All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/7] DAX fsync/msync support
@ 2015-12-23 19:39 ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

Changes since v5 [1]:

1) Merged with Dan's changes to fs/dax.c that were staged in -mm and -next.

2) Store sectors in the address_space radix tree for DAX entries instead of
addresses.  This allows us to get the addresses from the block driver
via dax_map_atomic() during fsync/msync so that we can protect against
races with block device removal. (Dan)

3) Reordered things a bit in dax_writeback_one() so we clear the
PAGECACHE_TAG_TOWRITE tag even if the radix tree entry is corrupt.  This
prevents us from getting into an infinite loop where we don't proceed far
enough in dax_writeback_one() to clear that flag, but
dax_writeback_mapping_range() will keep finding that entry via
find_get_entries_tag().

4) Changed the ordering of the radix tree insertion so that it happens
before the page insertion into the page tables.  This ensures that we don't
end up in a case where the page table insertion succeeds and the radix tree
insertion fails which could give us a writeable PTE that has no
corresponding radix tree entry.

5) Got rid of the 'nrdax' variable in struct address_space and renamed
'nrshadows' to 'nrexceptional' so that it can be used for both DAX and
shadow exceptional entries.  We explicitly prevent shadow entries from
being added to radix trees for DAX mappings, so the single counter can
safely be reused for both purposes. (Jan)

6) Updated all my WARN_ON() calls so I use the return value to know whether
I've hit an erorr. (Andrew)

This series applies cleanly and was tested against next-20151223.

A working tree can be found here:

https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_v6

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-December/003588.html

Ross Zwisler (7):
  pmem: add wb_cache_pmem() to the PMEM API
  dax: support dirty DAX entries in radix tree
  mm: add find_get_entries_tag()
  dax: add support for fsync/msync
  ext2: call dax_pfn_mkwrite() for DAX fsync/msync
  ext4: call dax_pfn_mkwrite() for DAX fsync/msync
  xfs: call dax_pfn_mkwrite() for DAX fsync/msync

 arch/x86/include/asm/pmem.h |  11 +--
 fs/block_dev.c              |   2 +-
 fs/dax.c                    | 196 ++++++++++++++++++++++++++++++++++++++++++--
 fs/ext2/file.c              |   4 +-
 fs/ext4/file.c              |   4 +-
 fs/inode.c                  |   2 +-
 fs/xfs/xfs_file.c           |   7 +-
 include/linux/dax.h         |   7 ++
 include/linux/fs.h          |   3 +-
 include/linux/pagemap.h     |   3 +
 include/linux/pmem.h        |  22 ++++-
 include/linux/radix-tree.h  |   9 ++
 mm/filemap.c                |  91 ++++++++++++++++++--
 mm/truncate.c               |  69 +++++++++-------
 mm/vmscan.c                 |   9 +-
 mm/workingset.c             |   4 +-
 16 files changed, 384 insertions(+), 59 deletions(-)

-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v6 0/7] DAX fsync/msync support
@ 2015-12-23 19:39 ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

Changes since v5 [1]:

1) Merged with Dan's changes to fs/dax.c that were staged in -mm and -next.

2) Store sectors in the address_space radix tree for DAX entries instead of
addresses.  This allows us to get the addresses from the block driver
via dax_map_atomic() during fsync/msync so that we can protect against
races with block device removal. (Dan)

3) Reordered things a bit in dax_writeback_one() so we clear the
PAGECACHE_TAG_TOWRITE tag even if the radix tree entry is corrupt.  This
prevents us from getting into an infinite loop where we don't proceed far
enough in dax_writeback_one() to clear that flag, but
dax_writeback_mapping_range() will keep finding that entry via
find_get_entries_tag().

4) Changed the ordering of the radix tree insertion so that it happens
before the page insertion into the page tables.  This ensures that we don't
end up in a case where the page table insertion succeeds and the radix tree
insertion fails which could give us a writeable PTE that has no
corresponding radix tree entry.

5) Got rid of the 'nrdax' variable in struct address_space and renamed
'nrshadows' to 'nrexceptional' so that it can be used for both DAX and
shadow exceptional entries.  We explicitly prevent shadow entries from
being added to radix trees for DAX mappings, so the single counter can
safely be reused for both purposes. (Jan)

6) Updated all my WARN_ON() calls so I use the return value to know whether
I've hit an erorr. (Andrew)

This series applies cleanly and was tested against next-20151223.

A working tree can be found here:

https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_v6

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-December/003588.html

Ross Zwisler (7):
  pmem: add wb_cache_pmem() to the PMEM API
  dax: support dirty DAX entries in radix tree
  mm: add find_get_entries_tag()
  dax: add support for fsync/msync
  ext2: call dax_pfn_mkwrite() for DAX fsync/msync
  ext4: call dax_pfn_mkwrite() for DAX fsync/msync
  xfs: call dax_pfn_mkwrite() for DAX fsync/msync

 arch/x86/include/asm/pmem.h |  11 +--
 fs/block_dev.c              |   2 +-
 fs/dax.c                    | 196 ++++++++++++++++++++++++++++++++++++++++++--
 fs/ext2/file.c              |   4 +-
 fs/ext4/file.c              |   4 +-
 fs/inode.c                  |   2 +-
 fs/xfs/xfs_file.c           |   7 +-
 include/linux/dax.h         |   7 ++
 include/linux/fs.h          |   3 +-
 include/linux/pagemap.h     |   3 +
 include/linux/pmem.h        |  22 ++++-
 include/linux/radix-tree.h  |   9 ++
 mm/filemap.c                |  91 ++++++++++++++++++--
 mm/truncate.c               |  69 +++++++++-------
 mm/vmscan.c                 |   9 +-
 mm/workingset.c             |   4 +-
 16 files changed, 384 insertions(+), 59 deletions(-)

-- 
2.6.3


^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v6 0/7] DAX fsync/msync support
@ 2015-12-23 19:39 ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

Changes since v5 [1]:

1) Merged with Dan's changes to fs/dax.c that were staged in -mm and -next.

2) Store sectors in the address_space radix tree for DAX entries instead of
addresses.  This allows us to get the addresses from the block driver
via dax_map_atomic() during fsync/msync so that we can protect against
races with block device removal. (Dan)

3) Reordered things a bit in dax_writeback_one() so we clear the
PAGECACHE_TAG_TOWRITE tag even if the radix tree entry is corrupt.  This
prevents us from getting into an infinite loop where we don't proceed far
enough in dax_writeback_one() to clear that flag, but
dax_writeback_mapping_range() will keep finding that entry via
find_get_entries_tag().

4) Changed the ordering of the radix tree insertion so that it happens
before the page insertion into the page tables.  This ensures that we don't
end up in a case where the page table insertion succeeds and the radix tree
insertion fails which could give us a writeable PTE that has no
corresponding radix tree entry.

5) Got rid of the 'nrdax' variable in struct address_space and renamed
'nrshadows' to 'nrexceptional' so that it can be used for both DAX and
shadow exceptional entries.  We explicitly prevent shadow entries from
being added to radix trees for DAX mappings, so the single counter can
safely be reused for both purposes. (Jan)

6) Updated all my WARN_ON() calls so I use the return value to know whether
I've hit an erorr. (Andrew)

This series applies cleanly and was tested against next-20151223.

A working tree can be found here:

https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_v6

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-December/003588.html

Ross Zwisler (7):
  pmem: add wb_cache_pmem() to the PMEM API
  dax: support dirty DAX entries in radix tree
  mm: add find_get_entries_tag()
  dax: add support for fsync/msync
  ext2: call dax_pfn_mkwrite() for DAX fsync/msync
  ext4: call dax_pfn_mkwrite() for DAX fsync/msync
  xfs: call dax_pfn_mkwrite() for DAX fsync/msync

 arch/x86/include/asm/pmem.h |  11 +--
 fs/block_dev.c              |   2 +-
 fs/dax.c                    | 196 ++++++++++++++++++++++++++++++++++++++++++--
 fs/ext2/file.c              |   4 +-
 fs/ext4/file.c              |   4 +-
 fs/inode.c                  |   2 +-
 fs/xfs/xfs_file.c           |   7 +-
 include/linux/dax.h         |   7 ++
 include/linux/fs.h          |   3 +-
 include/linux/pagemap.h     |   3 +
 include/linux/pmem.h        |  22 ++++-
 include/linux/radix-tree.h  |   9 ++
 mm/filemap.c                |  91 ++++++++++++++++++--
 mm/truncate.c               |  69 +++++++++-------
 mm/vmscan.c                 |   9 +-
 mm/workingset.c             |   4 +-
 16 files changed, 384 insertions(+), 59 deletions(-)

-- 
2.6.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v6 1/7] pmem: add wb_cache_pmem() to the PMEM API
  2015-12-23 19:39 ` Ross Zwisler
  (?)
@ 2015-12-23 19:39   ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

One thing worth noting is that we really do want this to be part of the
PMEM API as opposed to a stand-alone function like clflush_cache_range()
because of ordering restrictions.  By having wb_cache_pmem() as part of the
PMEM API we can leave it unordered, call it multiple times to write back
large amounts of memory, and then order the multiple calls with a single
wmb_pmem().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pmem.h | 11 ++++++-----
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 1544fab..c57fd1e 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -133,7 +134,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	void *vaddr = (void __force *)addr;
 
 	memset(vaddr, 0, size);
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index acfea8c..7c3d11a 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -178,4 +184,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 1/7] pmem: add wb_cache_pmem() to the PMEM API
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

One thing worth noting is that we really do want this to be part of the
PMEM API as opposed to a stand-alone function like clflush_cache_range()
because of ordering restrictions.  By having wb_cache_pmem() as part of the
PMEM API we can leave it unordered, call it multiple times to write back
large amounts of memory, and then order the multiple calls with a single
wmb_pmem().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pmem.h | 11 ++++++-----
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 1544fab..c57fd1e 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -133,7 +134,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	void *vaddr = (void __force *)addr;
 
 	memset(vaddr, 0, size);
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index acfea8c..7c3d11a 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -178,4 +184,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 1/7] pmem: add wb_cache_pmem() to the PMEM API
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

One thing worth noting is that we really do want this to be part of the
PMEM API as opposed to a stand-alone function like clflush_cache_range()
because of ordering restrictions.  By having wb_cache_pmem() as part of the
PMEM API we can leave it unordered, call it multiple times to write back
large amounts of memory, and then order the multiple calls with a single
wmb_pmem().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pmem.h | 11 ++++++-----
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 1544fab..c57fd1e 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -133,7 +134,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	void *vaddr = (void __force *)addr;
 
 	memset(vaddr, 0, size);
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index acfea8c..7c3d11a 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -178,4 +184,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.6.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
  2015-12-23 19:39 ` Ross Zwisler
  (?)
@ 2015-12-23 19:39   ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.
These exceptional entries will also contain the writeback sectors for the
PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  We rely on
the fact that only one type of exceptional entry can be found in a given
radix tree based on its usage.  This happens for free with DAX vs shmem but
we explicitly prevent shadow entries from being added to radix trees for
DAX mappings.

The only shadow entries that would be generated for DAX radix trees would
be to track zero page mappings that were created for holes.  These pages
would receive minimal benefit from having shadow entries, and the choice
to have only one type of exceptional entry in a given radix tree makes the
logic simpler both in clear_exceptional_entry() and in the rest of DAX.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/block_dev.c             |  2 +-
 fs/inode.c                 |  2 +-
 include/linux/dax.h        |  5 ++++
 include/linux/fs.h         |  3 +-
 include/linux/radix-tree.h |  9 ++++++
 mm/filemap.c               | 17 ++++++++----
 mm/truncate.c              | 69 ++++++++++++++++++++++++++--------------------
 mm/vmscan.c                |  9 +++++-
 mm/workingset.c            |  4 +--
 9 files changed, 78 insertions(+), 42 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 8c1f467..29b9c9b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -75,7 +75,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index 4c8f719..6e3e5d0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -495,7 +495,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
-	BUG_ON(inode->i_data.nrshadows);
+	BUG_ON(inode->i_data.nrexceptional);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+	return mapping->host && IS_DAX(mapping->host);
+}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 12ba937..905565f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -432,7 +432,8 @@ struct address_space {
 	struct rw_semaphore	i_mmap_rwsem;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
-	unsigned long		nrshadows;	/* number of shadow entries */
+	/* number of shadow or DAX exceptional entries */
+	unsigned long		nrexceptional;
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 33170db..ba8e0fc 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,15 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 847ee43..7b8be78 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@
  */
 #include <linux/export.h>
 #include <linux/compiler.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/uaccess.h>
 #include <linux/capability.h>
@@ -123,9 +124,9 @@ static void page_cache_tree_delete(struct address_space *mapping,
 	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
 
 	if (shadow) {
-		mapping->nrshadows++;
+		mapping->nrexceptional++;
 		/*
-		 * Make sure the nrshadows update is committed before
+		 * Make sure the nrexceptional update is committed before
 		 * the nrpages update so that final truncate racing
 		 * with reclaim does not see both counters 0 at the
 		 * same time and miss a shadow entry.
@@ -579,9 +580,13 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
+
+		if (WARN_ON(dax_mapping(mapping)))
+			return -EINVAL;
+
 		if (shadowp)
 			*shadowp = p;
-		mapping->nrshadows--;
+		mapping->nrexceptional--;
 		if (node)
 			workingset_node_shadows_dec(node);
 	}
@@ -1245,9 +1250,9 @@ repeat:
 			if (radix_tree_deref_retry(page))
 				goto restart;
 			/*
-			 * A shadow entry of a recently evicted page,
-			 * or a swap entry from shmem/tmpfs.  Return
-			 * it without attempting to raise page count.
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
 			 */
 			goto export;
 		}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..e3ee0e2 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@
 
 #include <linux/kernel.h>
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
@@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
 		return;
 
 	spin_lock_irq(&mapping->tree_lock);
-	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
-	 */
-	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-	radix_tree_replace_slot(slot, NULL);
-	mapping->nrshadows--;
-	if (!node)
-		goto unlock;
-	workingset_node_shadows_dec(node);
-	/*
-	 * Don't track node without shadow entries.
-	 *
-	 * Avoid acquiring the list_lru lock if already untracked.
-	 * The list_empty() test is safe as node->private_list is
-	 * protected by mapping->tree_lock.
-	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
-		list_lru_del(&workingset_shadow_nodes, &node->private_list);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+
+	if (dax_mapping(mapping)) {
+		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
+			mapping->nrexceptional--;
+	} else {
+		/*
+		 * Regular page slots are stabilized by the page lock even
+		 * without the tree itself locked.  These unlocked entries
+		 * need verification under the tree lock.
+		 */
+		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+					&slot))
+			goto unlock;
+		if (*slot != entry)
+			goto unlock;
+		radix_tree_replace_slot(slot, NULL);
+		mapping->nrexceptional--;
+		if (!node)
+			goto unlock;
+		workingset_node_shadows_dec(node);
+		/*
+		 * Don't track node without shadow entries.
+		 *
+		 * Avoid acquiring the list_lru lock if already untracked.
+		 * The list_empty() test is safe as node->private_list is
+		 * protected by mapping->tree_lock.
+		 */
+		if (!workingset_node_shadows(node) &&
+		    !list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes,
+					&node->private_list);
+		__radix_tree_delete_node(&mapping->page_tree, node);
+	}
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
@@ -228,7 +237,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	/* Offsets within partial pages */
@@ -402,7 +411,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
  */
 void truncate_inode_pages_final(struct address_space *mapping)
 {
-	unsigned long nrshadows;
+	unsigned long nrexceptional;
 	unsigned long nrpages;
 
 	/*
@@ -416,14 +425,14 @@ void truncate_inode_pages_final(struct address_space *mapping)
 
 	/*
 	 * When reclaim installs eviction entries, it increases
-	 * nrshadows first, then decreases nrpages.  Make sure we see
+	 * nrexceptional first, then decreases nrpages.  Make sure we see
 	 * this in the right order or we might miss an entry.
 	 */
 	nrpages = mapping->nrpages;
 	smp_rmb();
-	nrshadows = mapping->nrshadows;
+	nrexceptional = mapping->nrexceptional;
 
-	if (nrpages || nrshadows) {
+	if (nrpages || nrexceptional) {
 		/*
 		 * As truncation uses a lockless tree lookup, cycle
 		 * the tree lock to make sure any ongoing tree
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44ec50f..30e0cd7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -46,6 +46,7 @@
 #include <linux/oom.h>
 #include <linux/prefetch.h>
 #include <linux/printk.h>
+#include <linux/dax.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		 * inode reclaim needs to empty out the radix tree or
 		 * the nodes are lost.  Don't plant shadows behind its
 		 * back.
+		 *
+		 * We also don't store shadows for DAX mappings because the
+		 * only page cache pages found in these are zero pages
+		 * covering holes, and because we don't want to mix DAX
+		 * exceptional entries and shadow exceptional entries in the
+		 * same page_tree.
 		 */
 		if (reclaimed && page_is_file_cache(page) &&
-		    !mapping_exiting(mapping))
+		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow, memcg);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
diff --git a/mm/workingset.c b/mm/workingset.c
index aa01713..61ead9e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -351,8 +351,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 			node->slots[i] = NULL;
 			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
 			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-			BUG_ON(!mapping->nrshadows);
-			mapping->nrshadows--;
+			BUG_ON(!mapping->nrexceptional);
+			mapping->nrexceptional--;
 		}
 	}
 	BUG_ON(node->count);
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.
These exceptional entries will also contain the writeback sectors for the
PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  We rely on
the fact that only one type of exceptional entry can be found in a given
radix tree based on its usage.  This happens for free with DAX vs shmem but
we explicitly prevent shadow entries from being added to radix trees for
DAX mappings.

The only shadow entries that would be generated for DAX radix trees would
be to track zero page mappings that were created for holes.  These pages
would receive minimal benefit from having shadow entries, and the choice
to have only one type of exceptional entry in a given radix tree makes the
logic simpler both in clear_exceptional_entry() and in the rest of DAX.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/block_dev.c             |  2 +-
 fs/inode.c                 |  2 +-
 include/linux/dax.h        |  5 ++++
 include/linux/fs.h         |  3 +-
 include/linux/radix-tree.h |  9 ++++++
 mm/filemap.c               | 17 ++++++++----
 mm/truncate.c              | 69 ++++++++++++++++++++++++++--------------------
 mm/vmscan.c                |  9 +++++-
 mm/workingset.c            |  4 +--
 9 files changed, 78 insertions(+), 42 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 8c1f467..29b9c9b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -75,7 +75,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index 4c8f719..6e3e5d0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -495,7 +495,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
-	BUG_ON(inode->i_data.nrshadows);
+	BUG_ON(inode->i_data.nrexceptional);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+	return mapping->host && IS_DAX(mapping->host);
+}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 12ba937..905565f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -432,7 +432,8 @@ struct address_space {
 	struct rw_semaphore	i_mmap_rwsem;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
-	unsigned long		nrshadows;	/* number of shadow entries */
+	/* number of shadow or DAX exceptional entries */
+	unsigned long		nrexceptional;
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 33170db..ba8e0fc 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,15 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 847ee43..7b8be78 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@
  */
 #include <linux/export.h>
 #include <linux/compiler.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/uaccess.h>
 #include <linux/capability.h>
@@ -123,9 +124,9 @@ static void page_cache_tree_delete(struct address_space *mapping,
 	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
 
 	if (shadow) {
-		mapping->nrshadows++;
+		mapping->nrexceptional++;
 		/*
-		 * Make sure the nrshadows update is committed before
+		 * Make sure the nrexceptional update is committed before
 		 * the nrpages update so that final truncate racing
 		 * with reclaim does not see both counters 0 at the
 		 * same time and miss a shadow entry.
@@ -579,9 +580,13 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
+
+		if (WARN_ON(dax_mapping(mapping)))
+			return -EINVAL;
+
 		if (shadowp)
 			*shadowp = p;
-		mapping->nrshadows--;
+		mapping->nrexceptional--;
 		if (node)
 			workingset_node_shadows_dec(node);
 	}
@@ -1245,9 +1250,9 @@ repeat:
 			if (radix_tree_deref_retry(page))
 				goto restart;
 			/*
-			 * A shadow entry of a recently evicted page,
-			 * or a swap entry from shmem/tmpfs.  Return
-			 * it without attempting to raise page count.
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
 			 */
 			goto export;
 		}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..e3ee0e2 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@
 
 #include <linux/kernel.h>
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
@@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
 		return;
 
 	spin_lock_irq(&mapping->tree_lock);
-	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
-	 */
-	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-	radix_tree_replace_slot(slot, NULL);
-	mapping->nrshadows--;
-	if (!node)
-		goto unlock;
-	workingset_node_shadows_dec(node);
-	/*
-	 * Don't track node without shadow entries.
-	 *
-	 * Avoid acquiring the list_lru lock if already untracked.
-	 * The list_empty() test is safe as node->private_list is
-	 * protected by mapping->tree_lock.
-	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
-		list_lru_del(&workingset_shadow_nodes, &node->private_list);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+
+	if (dax_mapping(mapping)) {
+		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
+			mapping->nrexceptional--;
+	} else {
+		/*
+		 * Regular page slots are stabilized by the page lock even
+		 * without the tree itself locked.  These unlocked entries
+		 * need verification under the tree lock.
+		 */
+		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+					&slot))
+			goto unlock;
+		if (*slot != entry)
+			goto unlock;
+		radix_tree_replace_slot(slot, NULL);
+		mapping->nrexceptional--;
+		if (!node)
+			goto unlock;
+		workingset_node_shadows_dec(node);
+		/*
+		 * Don't track node without shadow entries.
+		 *
+		 * Avoid acquiring the list_lru lock if already untracked.
+		 * The list_empty() test is safe as node->private_list is
+		 * protected by mapping->tree_lock.
+		 */
+		if (!workingset_node_shadows(node) &&
+		    !list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes,
+					&node->private_list);
+		__radix_tree_delete_node(&mapping->page_tree, node);
+	}
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
@@ -228,7 +237,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	/* Offsets within partial pages */
@@ -402,7 +411,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
  */
 void truncate_inode_pages_final(struct address_space *mapping)
 {
-	unsigned long nrshadows;
+	unsigned long nrexceptional;
 	unsigned long nrpages;
 
 	/*
@@ -416,14 +425,14 @@ void truncate_inode_pages_final(struct address_space *mapping)
 
 	/*
 	 * When reclaim installs eviction entries, it increases
-	 * nrshadows first, then decreases nrpages.  Make sure we see
+	 * nrexceptional first, then decreases nrpages.  Make sure we see
 	 * this in the right order or we might miss an entry.
 	 */
 	nrpages = mapping->nrpages;
 	smp_rmb();
-	nrshadows = mapping->nrshadows;
+	nrexceptional = mapping->nrexceptional;
 
-	if (nrpages || nrshadows) {
+	if (nrpages || nrexceptional) {
 		/*
 		 * As truncation uses a lockless tree lookup, cycle
 		 * the tree lock to make sure any ongoing tree
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44ec50f..30e0cd7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -46,6 +46,7 @@
 #include <linux/oom.h>
 #include <linux/prefetch.h>
 #include <linux/printk.h>
+#include <linux/dax.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		 * inode reclaim needs to empty out the radix tree or
 		 * the nodes are lost.  Don't plant shadows behind its
 		 * back.
+		 *
+		 * We also don't store shadows for DAX mappings because the
+		 * only page cache pages found in these are zero pages
+		 * covering holes, and because we don't want to mix DAX
+		 * exceptional entries and shadow exceptional entries in the
+		 * same page_tree.
 		 */
 		if (reclaimed && page_is_file_cache(page) &&
-		    !mapping_exiting(mapping))
+		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow, memcg);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
diff --git a/mm/workingset.c b/mm/workingset.c
index aa01713..61ead9e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -351,8 +351,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 			node->slots[i] = NULL;
 			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
 			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-			BUG_ON(!mapping->nrshadows);
-			mapping->nrshadows--;
+			BUG_ON(!mapping->nrexceptional);
+			mapping->nrexceptional--;
 		}
 	}
 	BUG_ON(node->count);
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.
These exceptional entries will also contain the writeback sectors for the
PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  We rely on
the fact that only one type of exceptional entry can be found in a given
radix tree based on its usage.  This happens for free with DAX vs shmem but
we explicitly prevent shadow entries from being added to radix trees for
DAX mappings.

The only shadow entries that would be generated for DAX radix trees would
be to track zero page mappings that were created for holes.  These pages
would receive minimal benefit from having shadow entries, and the choice
to have only one type of exceptional entry in a given radix tree makes the
logic simpler both in clear_exceptional_entry() and in the rest of DAX.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/block_dev.c             |  2 +-
 fs/inode.c                 |  2 +-
 include/linux/dax.h        |  5 ++++
 include/linux/fs.h         |  3 +-
 include/linux/radix-tree.h |  9 ++++++
 mm/filemap.c               | 17 ++++++++----
 mm/truncate.c              | 69 ++++++++++++++++++++++++++--------------------
 mm/vmscan.c                |  9 +++++-
 mm/workingset.c            |  4 +--
 9 files changed, 78 insertions(+), 42 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 8c1f467..29b9c9b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -75,7 +75,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index 4c8f719..6e3e5d0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -495,7 +495,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
-	BUG_ON(inode->i_data.nrshadows);
+	BUG_ON(inode->i_data.nrexceptional);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+	return mapping->host && IS_DAX(mapping->host);
+}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 12ba937..905565f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -432,7 +432,8 @@ struct address_space {
 	struct rw_semaphore	i_mmap_rwsem;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
-	unsigned long		nrshadows;	/* number of shadow entries */
+	/* number of shadow or DAX exceptional entries */
+	unsigned long		nrexceptional;
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 33170db..ba8e0fc 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,15 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 847ee43..7b8be78 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@
  */
 #include <linux/export.h>
 #include <linux/compiler.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/uaccess.h>
 #include <linux/capability.h>
@@ -123,9 +124,9 @@ static void page_cache_tree_delete(struct address_space *mapping,
 	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
 
 	if (shadow) {
-		mapping->nrshadows++;
+		mapping->nrexceptional++;
 		/*
-		 * Make sure the nrshadows update is committed before
+		 * Make sure the nrexceptional update is committed before
 		 * the nrpages update so that final truncate racing
 		 * with reclaim does not see both counters 0 at the
 		 * same time and miss a shadow entry.
@@ -579,9 +580,13 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
+
+		if (WARN_ON(dax_mapping(mapping)))
+			return -EINVAL;
+
 		if (shadowp)
 			*shadowp = p;
-		mapping->nrshadows--;
+		mapping->nrexceptional--;
 		if (node)
 			workingset_node_shadows_dec(node);
 	}
@@ -1245,9 +1250,9 @@ repeat:
 			if (radix_tree_deref_retry(page))
 				goto restart;
 			/*
-			 * A shadow entry of a recently evicted page,
-			 * or a swap entry from shmem/tmpfs.  Return
-			 * it without attempting to raise page count.
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
 			 */
 			goto export;
 		}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..e3ee0e2 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@
 
 #include <linux/kernel.h>
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
@@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
 		return;
 
 	spin_lock_irq(&mapping->tree_lock);
-	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
-	 */
-	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-	radix_tree_replace_slot(slot, NULL);
-	mapping->nrshadows--;
-	if (!node)
-		goto unlock;
-	workingset_node_shadows_dec(node);
-	/*
-	 * Don't track node without shadow entries.
-	 *
-	 * Avoid acquiring the list_lru lock if already untracked.
-	 * The list_empty() test is safe as node->private_list is
-	 * protected by mapping->tree_lock.
-	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
-		list_lru_del(&workingset_shadow_nodes, &node->private_list);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+
+	if (dax_mapping(mapping)) {
+		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
+			mapping->nrexceptional--;
+	} else {
+		/*
+		 * Regular page slots are stabilized by the page lock even
+		 * without the tree itself locked.  These unlocked entries
+		 * need verification under the tree lock.
+		 */
+		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+					&slot))
+			goto unlock;
+		if (*slot != entry)
+			goto unlock;
+		radix_tree_replace_slot(slot, NULL);
+		mapping->nrexceptional--;
+		if (!node)
+			goto unlock;
+		workingset_node_shadows_dec(node);
+		/*
+		 * Don't track node without shadow entries.
+		 *
+		 * Avoid acquiring the list_lru lock if already untracked.
+		 * The list_empty() test is safe as node->private_list is
+		 * protected by mapping->tree_lock.
+		 */
+		if (!workingset_node_shadows(node) &&
+		    !list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes,
+					&node->private_list);
+		__radix_tree_delete_node(&mapping->page_tree, node);
+	}
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
@@ -228,7 +237,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	/* Offsets within partial pages */
@@ -402,7 +411,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
  */
 void truncate_inode_pages_final(struct address_space *mapping)
 {
-	unsigned long nrshadows;
+	unsigned long nrexceptional;
 	unsigned long nrpages;
 
 	/*
@@ -416,14 +425,14 @@ void truncate_inode_pages_final(struct address_space *mapping)
 
 	/*
 	 * When reclaim installs eviction entries, it increases
-	 * nrshadows first, then decreases nrpages.  Make sure we see
+	 * nrexceptional first, then decreases nrpages.  Make sure we see
 	 * this in the right order or we might miss an entry.
 	 */
 	nrpages = mapping->nrpages;
 	smp_rmb();
-	nrshadows = mapping->nrshadows;
+	nrexceptional = mapping->nrexceptional;
 
-	if (nrpages || nrshadows) {
+	if (nrpages || nrexceptional) {
 		/*
 		 * As truncation uses a lockless tree lookup, cycle
 		 * the tree lock to make sure any ongoing tree
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44ec50f..30e0cd7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -46,6 +46,7 @@
 #include <linux/oom.h>
 #include <linux/prefetch.h>
 #include <linux/printk.h>
+#include <linux/dax.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		 * inode reclaim needs to empty out the radix tree or
 		 * the nodes are lost.  Don't plant shadows behind its
 		 * back.
+		 *
+		 * We also don't store shadows for DAX mappings because the
+		 * only page cache pages found in these are zero pages
+		 * covering holes, and because we don't want to mix DAX
+		 * exceptional entries and shadow exceptional entries in the
+		 * same page_tree.
 		 */
 		if (reclaimed && page_is_file_cache(page) &&
-		    !mapping_exiting(mapping))
+		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow, memcg);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
diff --git a/mm/workingset.c b/mm/workingset.c
index aa01713..61ead9e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -351,8 +351,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 			node->slots[i] = NULL;
 			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
 			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-			BUG_ON(!mapping->nrshadows);
-			mapping->nrshadows--;
+			BUG_ON(!mapping->nrexceptional);
+			mapping->nrexceptional--;
 		}
 	}
 	BUG_ON(node->count);
-- 
2.6.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 3/7] mm: add find_get_entries_tag()
  2015-12-23 19:39 ` Ross Zwisler
  (?)
@ 2015-12-23 19:39   ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            | 68 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4d08b6c..92395a0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b8be78..1e215fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1499,6 +1499,74 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping:	the address_space to search
+ * @start:	the starting page cache index
+ * @tag:	the tag index
+ * @nr_entries:	the maximum number of entries
+ * @entries:	where the resulting entries are placed
+ * @indices:	the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_tagged(slot, &mapping->page_tree,
+				   &iter, start, tag) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				goto restart;
+			}
+
+			/*
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		entries[ret] = page;
+		if (++ret == nr_entries)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 3/7] mm: add find_get_entries_tag()
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            | 68 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4d08b6c..92395a0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b8be78..1e215fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1499,6 +1499,74 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping:	the address_space to search
+ * @start:	the starting page cache index
+ * @tag:	the tag index
+ * @nr_entries:	the maximum number of entries
+ * @entries:	where the resulting entries are placed
+ * @indices:	the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_tagged(slot, &mapping->page_tree,
+				   &iter, start, tag) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				goto restart;
+			}
+
+			/*
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		entries[ret] = page;
+		if (++ret == nr_entries)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 3/7] mm: add find_get_entries_tag()
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            | 68 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4d08b6c..92395a0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b8be78..1e215fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1499,6 +1499,74 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping:	the address_space to search
+ * @start:	the starting page cache index
+ * @tag:	the tag index
+ * @nr_entries:	the maximum number of entries
+ * @entries:	where the resulting entries are placed
+ * @indices:	the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_tagged(slot, &mapping->page_tree,
+				   &iter, start, tag) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				goto restart;
+			}
+
+			/*
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		entries[ret] = page;
+		if (++ret == nr_entries)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.6.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 4/7] dax: add support for fsync/msync
  2015-12-23 19:39 ` Ross Zwisler
  (?)
@ 2015-12-23 19:39   ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 196 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dax.h |   2 +
 mm/filemap.c        |   6 ++
 3 files changed, 198 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 82d0bff..050610d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,6 +24,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagevec.h>
 #include <linux/pmem.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
@@ -323,6 +324,176 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
+#define NO_SECTOR -1
+
+static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
+		sector_t sector, bool pmd_entry, bool dirty)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int error = 0;
+	void *entry;
+
+	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = radix_tree_lookup(page_tree, index);
+
+	if (entry) {
+		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+			goto dirty;
+		radix_tree_delete(&mapping->page_tree, index);
+		mapping->nrexceptional--;
+	}
+
+	if (sector == NO_SECTOR) {
+		/*
+		 * This can happen during correct operation if our pfn_mkwrite
+		 * fault raced against a hole punch operation.  If this
+		 * happens the pte that was hole punched will have been
+		 * unmapped and the radix tree entry will have been removed by
+		 * the time we are called, but the call will still happen.  We
+		 * will return all the way up to wp_pfn_shared(), where the
+		 * pte_same() check will fail, eventually causing page fault
+		 * to be retried by the CPU.
+		 */
+		goto unlock;
+	}
+
+	error = radix_tree_insert(page_tree, index,
+			RADIX_DAX_ENTRY(sector, pmd_entry));
+	if (error)
+		goto unlock;
+
+	mapping->nrexceptional++;
+ dirty:
+	if (dirty)
+		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return error;
+}
+
+static int dax_writeback_one(struct block_device *bdev,
+		struct address_space *mapping, pgoff_t index, void *entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int type = RADIX_DAX_TYPE(entry);
+	struct radix_tree_node *node;
+	struct blk_dax_ctl dax;
+	void **slot;
+	int ret = 0;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+
+	/* another fsync thread may have already written back this entry */
+	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto unlock;
+
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+
+	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	dax.sector = RADIX_DAX_SECTOR(entry);
+	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/*
+	 * We cannot hold tree_lock while calling dax_map_atomic() because it
+	 * eventually calls cond_resched().
+	 */
+	ret = dax_map_atomic(bdev, &dax);
+	if (ret < 0)
+		return ret;
+
+	if (WARN_ON_ONCE(ret < dax.size)) {
+		ret = -EIO;
+		dax_unmap_atomic(bdev, &dax);
+		return ret;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * We need to revalidate our radix entry while holding tree_lock
+	 * before we do the writeback.
+	 */
+	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+		goto unmap;
+	if (*slot != entry)
+		goto unmap;
+
+	wb_cache_pmem(dax.addr, dax.size);
+ unmap:
+	dax_unmap_atomic(bdev, &dax);
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+/*
+ * Flush the mapping to the persistent domain within the byte range of [start,
+ * end]. This is required by data integrity operations to ensure file data is
+ * on persistent storage prior to completion of the operation.
+ */
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end)
+{
+	struct inode *inode = mapping->host;
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	pgoff_t indices[PAGEVEC_SIZE];
+	pgoff_t start_page, end_page;
+	struct pagevec pvec;
+	void *entry;
+	int i, ret = 0;
+
+	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
+		return -EIO;
+
+	rcu_read_lock();
+	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
+	rcu_read_unlock();
+
+	/* see if the start of our range is covered by a PMD entry */
+	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		start &= PMD_MASK;
+
+	start_page = start >> PAGE_CACHE_SHIFT;
+	end_page = end >> PAGE_CACHE_SHIFT;
+
+	tag_pages_for_writeback(mapping, start_page, end_page);
+
+	pagevec_init(&pvec, 0);
+	while (1) {
+		pvec.nr = find_get_entries_tag(mapping, start_page,
+				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		for (i = 0; i < pvec.nr; i++) {
+			ret = dax_writeback_one(bdev, mapping, indices[i],
+					pvec.pages[i]);
+			if (ret < 0)
+				return ret;
+		}
+	}
+	wmb_pmem();
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -362,6 +533,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
+	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
+			vmf->flags & FAULT_FLAG_WRITE);
+	if (error)
+		goto out;
+
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
@@ -486,6 +662,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		delete_from_page_cache(page);
 		unlock_page(page);
 		page_cache_release(page);
+		page = NULL;
 	}
 
 	/*
@@ -579,7 +756,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	struct block_device *bdev = NULL;
 	pgoff_t size, pgoff;
 	sector_t block;
-	int result = 0;
+	int error, result = 0;
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -721,6 +898,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
+		if (write) {
+			error = dax_radix_entry(mapping, pgoff, dax.sector,
+					true, true);
+			if (error) {
+				dax_pmd_dbg(bdev, address,
+						"PMD radix insertion failed");
+				goto fallback;
+			}
+		}
+
 		dev_dbg(part_to_dev(bdev->bd_part),
 				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
 				__func__, current->comm, address,
@@ -779,15 +966,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
- *
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct file *file = vma->vm_file;
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	sb_end_pagefault(sb);
+	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..8204c3d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end);
 #endif
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e215fc..2e7c8d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -482,6 +482,12 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 {
 	int err = 0;
 
+	if (dax_mapping(mapping) && mapping->nrexceptional) {
+		err = dax_writeback_mapping_range(mapping, lstart, lend);
+		if (err)
+			return err;
+	}
+
 	if (mapping->nrpages) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 4/7] dax: add support for fsync/msync
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 196 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dax.h |   2 +
 mm/filemap.c        |   6 ++
 3 files changed, 198 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 82d0bff..050610d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,6 +24,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagevec.h>
 #include <linux/pmem.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
@@ -323,6 +324,176 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
+#define NO_SECTOR -1
+
+static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
+		sector_t sector, bool pmd_entry, bool dirty)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int error = 0;
+	void *entry;
+
+	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = radix_tree_lookup(page_tree, index);
+
+	if (entry) {
+		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+			goto dirty;
+		radix_tree_delete(&mapping->page_tree, index);
+		mapping->nrexceptional--;
+	}
+
+	if (sector == NO_SECTOR) {
+		/*
+		 * This can happen during correct operation if our pfn_mkwrite
+		 * fault raced against a hole punch operation.  If this
+		 * happens the pte that was hole punched will have been
+		 * unmapped and the radix tree entry will have been removed by
+		 * the time we are called, but the call will still happen.  We
+		 * will return all the way up to wp_pfn_shared(), where the
+		 * pte_same() check will fail, eventually causing page fault
+		 * to be retried by the CPU.
+		 */
+		goto unlock;
+	}
+
+	error = radix_tree_insert(page_tree, index,
+			RADIX_DAX_ENTRY(sector, pmd_entry));
+	if (error)
+		goto unlock;
+
+	mapping->nrexceptional++;
+ dirty:
+	if (dirty)
+		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return error;
+}
+
+static int dax_writeback_one(struct block_device *bdev,
+		struct address_space *mapping, pgoff_t index, void *entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int type = RADIX_DAX_TYPE(entry);
+	struct radix_tree_node *node;
+	struct blk_dax_ctl dax;
+	void **slot;
+	int ret = 0;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+
+	/* another fsync thread may have already written back this entry */
+	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto unlock;
+
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+
+	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	dax.sector = RADIX_DAX_SECTOR(entry);
+	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/*
+	 * We cannot hold tree_lock while calling dax_map_atomic() because it
+	 * eventually calls cond_resched().
+	 */
+	ret = dax_map_atomic(bdev, &dax);
+	if (ret < 0)
+		return ret;
+
+	if (WARN_ON_ONCE(ret < dax.size)) {
+		ret = -EIO;
+		dax_unmap_atomic(bdev, &dax);
+		return ret;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * We need to revalidate our radix entry while holding tree_lock
+	 * before we do the writeback.
+	 */
+	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+		goto unmap;
+	if (*slot != entry)
+		goto unmap;
+
+	wb_cache_pmem(dax.addr, dax.size);
+ unmap:
+	dax_unmap_atomic(bdev, &dax);
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+/*
+ * Flush the mapping to the persistent domain within the byte range of [start,
+ * end]. This is required by data integrity operations to ensure file data is
+ * on persistent storage prior to completion of the operation.
+ */
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end)
+{
+	struct inode *inode = mapping->host;
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	pgoff_t indices[PAGEVEC_SIZE];
+	pgoff_t start_page, end_page;
+	struct pagevec pvec;
+	void *entry;
+	int i, ret = 0;
+
+	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
+		return -EIO;
+
+	rcu_read_lock();
+	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
+	rcu_read_unlock();
+
+	/* see if the start of our range is covered by a PMD entry */
+	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		start &= PMD_MASK;
+
+	start_page = start >> PAGE_CACHE_SHIFT;
+	end_page = end >> PAGE_CACHE_SHIFT;
+
+	tag_pages_for_writeback(mapping, start_page, end_page);
+
+	pagevec_init(&pvec, 0);
+	while (1) {
+		pvec.nr = find_get_entries_tag(mapping, start_page,
+				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		for (i = 0; i < pvec.nr; i++) {
+			ret = dax_writeback_one(bdev, mapping, indices[i],
+					pvec.pages[i]);
+			if (ret < 0)
+				return ret;
+		}
+	}
+	wmb_pmem();
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -362,6 +533,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
+	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
+			vmf->flags & FAULT_FLAG_WRITE);
+	if (error)
+		goto out;
+
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
@@ -486,6 +662,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		delete_from_page_cache(page);
 		unlock_page(page);
 		page_cache_release(page);
+		page = NULL;
 	}
 
 	/*
@@ -579,7 +756,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	struct block_device *bdev = NULL;
 	pgoff_t size, pgoff;
 	sector_t block;
-	int result = 0;
+	int error, result = 0;
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -721,6 +898,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
+		if (write) {
+			error = dax_radix_entry(mapping, pgoff, dax.sector,
+					true, true);
+			if (error) {
+				dax_pmd_dbg(bdev, address,
+						"PMD radix insertion failed");
+				goto fallback;
+			}
+		}
+
 		dev_dbg(part_to_dev(bdev->bd_part),
 				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
 				__func__, current->comm, address,
@@ -779,15 +966,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
- *
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct file *file = vma->vm_file;
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	sb_end_pagefault(sb);
+	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..8204c3d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end);
 #endif
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e215fc..2e7c8d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -482,6 +482,12 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 {
 	int err = 0;
 
+	if (dax_mapping(mapping) && mapping->nrexceptional) {
+		err = dax_writeback_mapping_range(mapping, lstart, lend);
+		if (err)
+			return err;
+	}
+
 	if (mapping->nrpages) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 4/7] dax: add support for fsync/msync
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 196 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dax.h |   2 +
 mm/filemap.c        |   6 ++
 3 files changed, 198 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 82d0bff..050610d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,6 +24,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagevec.h>
 #include <linux/pmem.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
@@ -323,6 +324,176 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
+#define NO_SECTOR -1
+
+static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
+		sector_t sector, bool pmd_entry, bool dirty)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int error = 0;
+	void *entry;
+
+	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = radix_tree_lookup(page_tree, index);
+
+	if (entry) {
+		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+			goto dirty;
+		radix_tree_delete(&mapping->page_tree, index);
+		mapping->nrexceptional--;
+	}
+
+	if (sector == NO_SECTOR) {
+		/*
+		 * This can happen during correct operation if our pfn_mkwrite
+		 * fault raced against a hole punch operation.  If this
+		 * happens the pte that was hole punched will have been
+		 * unmapped and the radix tree entry will have been removed by
+		 * the time we are called, but the call will still happen.  We
+		 * will return all the way up to wp_pfn_shared(), where the
+		 * pte_same() check will fail, eventually causing page fault
+		 * to be retried by the CPU.
+		 */
+		goto unlock;
+	}
+
+	error = radix_tree_insert(page_tree, index,
+			RADIX_DAX_ENTRY(sector, pmd_entry));
+	if (error)
+		goto unlock;
+
+	mapping->nrexceptional++;
+ dirty:
+	if (dirty)
+		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return error;
+}
+
+static int dax_writeback_one(struct block_device *bdev,
+		struct address_space *mapping, pgoff_t index, void *entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int type = RADIX_DAX_TYPE(entry);
+	struct radix_tree_node *node;
+	struct blk_dax_ctl dax;
+	void **slot;
+	int ret = 0;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+
+	/* another fsync thread may have already written back this entry */
+	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto unlock;
+
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+
+	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	dax.sector = RADIX_DAX_SECTOR(entry);
+	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/*
+	 * We cannot hold tree_lock while calling dax_map_atomic() because it
+	 * eventually calls cond_resched().
+	 */
+	ret = dax_map_atomic(bdev, &dax);
+	if (ret < 0)
+		return ret;
+
+	if (WARN_ON_ONCE(ret < dax.size)) {
+		ret = -EIO;
+		dax_unmap_atomic(bdev, &dax);
+		return ret;
+	}
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * We need to revalidate our radix entry while holding tree_lock
+	 * before we do the writeback.
+	 */
+	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+		goto unmap;
+	if (*slot != entry)
+		goto unmap;
+
+	wb_cache_pmem(dax.addr, dax.size);
+ unmap:
+	dax_unmap_atomic(bdev, &dax);
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+/*
+ * Flush the mapping to the persistent domain within the byte range of [start,
+ * end]. This is required by data integrity operations to ensure file data is
+ * on persistent storage prior to completion of the operation.
+ */
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end)
+{
+	struct inode *inode = mapping->host;
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	pgoff_t indices[PAGEVEC_SIZE];
+	pgoff_t start_page, end_page;
+	struct pagevec pvec;
+	void *entry;
+	int i, ret = 0;
+
+	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
+		return -EIO;
+
+	rcu_read_lock();
+	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
+	rcu_read_unlock();
+
+	/* see if the start of our range is covered by a PMD entry */
+	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		start &= PMD_MASK;
+
+	start_page = start >> PAGE_CACHE_SHIFT;
+	end_page = end >> PAGE_CACHE_SHIFT;
+
+	tag_pages_for_writeback(mapping, start_page, end_page);
+
+	pagevec_init(&pvec, 0);
+	while (1) {
+		pvec.nr = find_get_entries_tag(mapping, start_page,
+				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		for (i = 0; i < pvec.nr; i++) {
+			ret = dax_writeback_one(bdev, mapping, indices[i],
+					pvec.pages[i]);
+			if (ret < 0)
+				return ret;
+		}
+	}
+	wmb_pmem();
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -362,6 +533,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
+	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
+			vmf->flags & FAULT_FLAG_WRITE);
+	if (error)
+		goto out;
+
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
@@ -486,6 +662,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		delete_from_page_cache(page);
 		unlock_page(page);
 		page_cache_release(page);
+		page = NULL;
 	}
 
 	/*
@@ -579,7 +756,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	struct block_device *bdev = NULL;
 	pgoff_t size, pgoff;
 	sector_t block;
-	int result = 0;
+	int error, result = 0;
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -721,6 +898,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
+		if (write) {
+			error = dax_radix_entry(mapping, pgoff, dax.sector,
+					true, true);
+			if (error) {
+				dax_pmd_dbg(bdev, address,
+						"PMD radix insertion failed");
+				goto fallback;
+			}
+		}
+
 		dev_dbg(part_to_dev(bdev->bd_part),
 				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
 				__func__, current->comm, address,
@@ -779,15 +966,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
- *
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct file *file = vma->vm_file;
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	sb_end_pagefault(sb);
+	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..8204c3d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end);
 #endif
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e215fc..2e7c8d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -482,6 +482,12 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 {
 	int err = 0;
 
+	if (dax_mapping(mapping) && mapping->nrexceptional) {
+		err = dax_writeback_mapping_range(mapping, lstart, lend);
+		if (err)
+			return err;
+	}
+
 	if (mapping->nrpages) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
-- 
2.6.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync
  2015-12-23 19:39 ` Ross Zwisler
  (?)
@ 2015-12-23 19:39   ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 11a42c5..2c88d68 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
@@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 
 	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 11a42c5..2c88d68 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
@@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 
 	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 11a42c5..2c88d68 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
@@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 
 	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
-- 
2.6.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 6/7] ext4: call dax_pfn_mkwrite() for DAX fsync/msync
  2015-12-23 19:39 ` Ross Zwisler
  (?)
@ 2015-12-23 19:39   ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 60683ab..fa899c9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
@@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(sb);
 
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 6/7] ext4: call dax_pfn_mkwrite() for DAX fsync/msync
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 60683ab..fa899c9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
@@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(sb);
 
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 6/7] ext4: call dax_pfn_mkwrite() for DAX fsync/msync
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 60683ab..fa899c9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
@@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(sb);
 
-- 
2.6.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 7/7] xfs: call dax_pfn_mkwrite() for DAX fsync/msync
  2015-12-23 19:39 ` Ross Zwisler
  (?)
@ 2015-12-23 19:39   ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f5392ab..40ffbb1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1603,9 +1603,8 @@ xfs_filemap_pmd_fault(
 /*
  * pfn_mkwrite was originally inteneded to ensure we capture time stamp
  * updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
- * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
- * barrier in place.
+ * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
+ * to ensure we serialise the fault barrier in place.
  */
 static int
 xfs_filemap_pfn_mkwrite(
@@ -1628,6 +1627,8 @@ xfs_filemap_pfn_mkwrite(
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else if (IS_DAX(inode))
+		ret = dax_pfn_mkwrite(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.6.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 7/7] xfs: call dax_pfn_mkwrite() for DAX fsync/msync
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f5392ab..40ffbb1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1603,9 +1603,8 @@ xfs_filemap_pmd_fault(
 /*
  * pfn_mkwrite was originally inteneded to ensure we capture time stamp
  * updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
- * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
- * barrier in place.
+ * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
+ * to ensure we serialise the fault barrier in place.
  */
 static int
 xfs_filemap_pfn_mkwrite(
@@ -1628,6 +1627,8 @@ xfs_filemap_pfn_mkwrite(
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else if (IS_DAX(inode))
+		ret = dax_pfn_mkwrite(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.6.3


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v6 7/7] xfs: call dax_pfn_mkwrite() for DAX fsync/msync
@ 2015-12-23 19:39   ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2015-12-23 19:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f5392ab..40ffbb1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1603,9 +1603,8 @@ xfs_filemap_pmd_fault(
 /*
  * pfn_mkwrite was originally inteneded to ensure we capture time stamp
  * updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
- * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
- * barrier in place.
+ * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
+ * to ensure we serialise the fault barrier in place.
  */
 static int
 xfs_filemap_pfn_mkwrite(
@@ -1628,6 +1627,8 @@ xfs_filemap_pfn_mkwrite(
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else if (IS_DAX(inode))
+		ret = dax_pfn_mkwrite(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.6.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* RE: [PATCH v6 3/7] mm: add find_get_entries_tag()
  2015-12-23 19:39   ` Ross Zwisler
  (?)
@ 2015-12-24  0:28     ` Elliott, Robert (Persistent Memory)
  -1 siblings, 0 replies; 75+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2015-12-24  0:28 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel
  Cc: Dave Hansen, Dave Chinner, J. Bruce Fields, linux-mm,
	Andreas Dilger, H. Peter Anvin, Jeff Layton, linux-nvdimm, x86,
	Ingo Molnar, linux-ext4, xfs, Alexander Viro, Thomas Gleixner,
	Theodore Ts'o, Jan Kara, linux-fsdevel, Andrew Morton,
	Matthew Wilcox

> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> Ross Zwisler
> Sent: Wednesday, December 23, 2015 1:39 PM
> Subject: [PATCH v6 3/7] mm: add find_get_entries_tag()
> 
...
> diff --git a/mm/filemap.c b/mm/filemap.c
...
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> +			int tag, unsigned int nr_entries,
> +			struct page **entries, pgoff_t *indices)
> +{
> +	void **slot;
> +	unsigned int ret = 0;
...
> +	radix_tree_for_each_tagged(slot, &mapping->page_tree,
> +				   &iter, start, tag) {
...
> +		indices[ret] = iter.index;
> +		entries[ret] = page;
> +		if (++ret == nr_entries)
> +			break;
> +	}

Using >= would provide more safety from buffer overflow
problems in case ret ever jumped ahead by more than one.
---
Robert Elliott, HPE Persistent Memory

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* RE: [PATCH v6 3/7] mm: add find_get_entries_tag()
@ 2015-12-24  0:28     ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 75+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2015-12-24  0:28 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel
  Cc: Dave Hansen, Dave Chinner, J. Bruce Fields, linux-mm,
	Andreas Dilger, H. Peter Anvin, Jeff Layton,
	linux-nvdimm@lists.01.org, x86, Ingo Molnar, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> Ross Zwisler
> Sent: Wednesday, December 23, 2015 1:39 PM
> Subject: [PATCH v6 3/7] mm: add find_get_entries_tag()
> 
...
> diff --git a/mm/filemap.c b/mm/filemap.c
...
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> +			int tag, unsigned int nr_entries,
> +			struct page **entries, pgoff_t *indices)
> +{
> +	void **slot;
> +	unsigned int ret = 0;
...
> +	radix_tree_for_each_tagged(slot, &mapping->page_tree,
> +				   &iter, start, tag) {
...
> +		indices[ret] = iter.index;
> +		entries[ret] = page;
> +		if (++ret == nr_entries)
> +			break;
> +	}

Using >= would provide more safety from buffer overflow
problems in case ret ever jumped ahead by more than one.
---
Robert Elliott, HPE Persistent Memory


^ permalink raw reply	[flat|nested] 75+ messages in thread

* RE: [PATCH v6 3/7] mm: add find_get_entries_tag()
@ 2015-12-24  0:28     ` Elliott, Robert (Persistent Memory)
  0 siblings, 0 replies; 75+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2015-12-24  0:28 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel
  Cc: x86, Theodore Ts'o, Andrew Morton, linux-nvdimm, Dave Hansen,
	xfs, J. Bruce Fields, linux-mm, Ingo Molnar, Andreas Dilger,
	Alexander Viro, H. Peter Anvin, linux-fsdevel, Jeff Layton,
	linux-ext4, Thomas Gleixner, Jan Kara, Matthew Wilcox

> -----Original Message-----
> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of
> Ross Zwisler
> Sent: Wednesday, December 23, 2015 1:39 PM
> Subject: [PATCH v6 3/7] mm: add find_get_entries_tag()
> 
...
> diff --git a/mm/filemap.c b/mm/filemap.c
...
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> +			int tag, unsigned int nr_entries,
> +			struct page **entries, pgoff_t *indices)
> +{
> +	void **slot;
> +	unsigned int ret = 0;
...
> +	radix_tree_for_each_tagged(slot, &mapping->page_tree,
> +				   &iter, start, tag) {
...
> +		indices[ret] = iter.index;
> +		entries[ret] = page;
> +		if (++ret == nr_entries)
> +			break;
> +	}

Using >= would provide more safety from buffer overflow
problems in case ret ever jumped ahead by more than one.
---
Robert Elliott, HPE Persistent Memory

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
  2015-12-23 19:39   ` Ross Zwisler
  (?)
@ 2015-12-30  8:02     ` Bob Liu
  -1 siblings, 0 replies; 75+ messages in thread
From: Bob Liu @ 2015-12-30  8:02 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

Hi Ross,

On 12/24/2015 03:39 AM, Ross Zwisler wrote:
> Add support for tracking dirty DAX entries in the struct address_space
> radix tree.  This tree is already used for dirty page writeback, and it
> already supports the use of exceptional (non struct page*) entries.
> 
> In order to properly track dirty DAX pages we will insert new exceptional
> entries into the radix tree that represent dirty DAX PTE or PMD pages.

I may get it wrong, but there is "struct page" for persistent memory after
"[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
So why not just add "struct page" to radix tree directly just like normal page cache?

Then we don't need to deal with any exceptional entries and special writeback.

Thanks,
Bob

> These exceptional entries will also contain the writeback sectors for the
> PTE or PMD faults that we can use at fsync/msync time.
> 
> There are currently two types of exceptional entries (shmem and shadow)
> that can be placed into the radix tree, and this adds a third.  We rely on
> the fact that only one type of exceptional entry can be found in a given
> radix tree based on its usage.  This happens for free with DAX vs shmem but
> we explicitly prevent shadow entries from being added to radix trees for
> DAX mappings.
> 
> The only shadow entries that would be generated for DAX radix trees would
> be to track zero page mappings that were created for holes.  These pages
> would receive minimal benefit from having shadow entries, and the choice
> to have only one type of exceptional entry in a given radix tree makes the
> logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/block_dev.c             |  2 +-
>  fs/inode.c                 |  2 +-
>  include/linux/dax.h        |  5 ++++
>  include/linux/fs.h         |  3 +-
>  include/linux/radix-tree.h |  9 ++++++
>  mm/filemap.c               | 17 ++++++++----
>  mm/truncate.c              | 69 ++++++++++++++++++++++++++--------------------
>  mm/vmscan.c                |  9 +++++-
>  mm/workingset.c            |  4 +--
>  9 files changed, 78 insertions(+), 42 deletions(-)

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-30  8:02     ` Bob Liu
  0 siblings, 0 replies; 75+ messages in thread
From: Bob Liu @ 2015-12-30  8:02 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

Hi Ross,

On 12/24/2015 03:39 AM, Ross Zwisler wrote:
> Add support for tracking dirty DAX entries in the struct address_space
> radix tree.  This tree is already used for dirty page writeback, and it
> already supports the use of exceptional (non struct page*) entries.
> 
> In order to properly track dirty DAX pages we will insert new exceptional
> entries into the radix tree that represent dirty DAX PTE or PMD pages.

I may get it wrong, but there is "struct page" for persistent memory after
"[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
So why not just add "struct page" to radix tree directly just like normal page cache?

Then we don't need to deal with any exceptional entries and special writeback.

Thanks,
Bob

> These exceptional entries will also contain the writeback sectors for the
> PTE or PMD faults that we can use at fsync/msync time.
> 
> There are currently two types of exceptional entries (shmem and shadow)
> that can be placed into the radix tree, and this adds a third.  We rely on
> the fact that only one type of exceptional entry can be found in a given
> radix tree based on its usage.  This happens for free with DAX vs shmem but
> we explicitly prevent shadow entries from being added to radix trees for
> DAX mappings.
> 
> The only shadow entries that would be generated for DAX radix trees would
> be to track zero page mappings that were created for holes.  These pages
> would receive minimal benefit from having shadow entries, and the choice
> to have only one type of exceptional entry in a given radix tree makes the
> logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/block_dev.c             |  2 +-
>  fs/inode.c                 |  2 +-
>  include/linux/dax.h        |  5 ++++
>  include/linux/fs.h         |  3 +-
>  include/linux/radix-tree.h |  9 ++++++
>  mm/filemap.c               | 17 ++++++++----
>  mm/truncate.c              | 69 ++++++++++++++++++++++++++--------------------
>  mm/vmscan.c                |  9 +++++-
>  mm/workingset.c            |  4 +--
>  9 files changed, 78 insertions(+), 42 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-30  8:02     ` Bob Liu
  0 siblings, 0 replies; 75+ messages in thread
From: Bob Liu @ 2015-12-30  8:02 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-nvdimm, Dave Hansen, J. Bruce Fields, linux-mm,
	Andreas Dilger, H. Peter Anvin, Jeff Layton, Dan Williams, x86,
	Ingo Molnar, Matthew Wilcox, linux-ext4, xfs, Alexander Viro,
	Thomas Gleixner, Theodore Ts'o, linux-kernel, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

Hi Ross,

On 12/24/2015 03:39 AM, Ross Zwisler wrote:
> Add support for tracking dirty DAX entries in the struct address_space
> radix tree.  This tree is already used for dirty page writeback, and it
> already supports the use of exceptional (non struct page*) entries.
> 
> In order to properly track dirty DAX pages we will insert new exceptional
> entries into the radix tree that represent dirty DAX PTE or PMD pages.

I may get it wrong, but there is "struct page" for persistent memory after
"[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
So why not just add "struct page" to radix tree directly just like normal page cache?

Then we don't need to deal with any exceptional entries and special writeback.

Thanks,
Bob

> These exceptional entries will also contain the writeback sectors for the
> PTE or PMD faults that we can use at fsync/msync time.
> 
> There are currently two types of exceptional entries (shmem and shadow)
> that can be placed into the radix tree, and this adds a third.  We rely on
> the fact that only one type of exceptional entry can be found in a given
> radix tree based on its usage.  This happens for free with DAX vs shmem but
> we explicitly prevent shadow entries from being added to radix trees for
> DAX mappings.
> 
> The only shadow entries that would be generated for DAX radix trees would
> be to track zero page mappings that were created for holes.  These pages
> would receive minimal benefit from having shadow entries, and the choice
> to have only one type of exceptional entry in a given radix tree makes the
> logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/block_dev.c             |  2 +-
>  fs/inode.c                 |  2 +-
>  include/linux/dax.h        |  5 ++++
>  include/linux/fs.h         |  3 +-
>  include/linux/radix-tree.h |  9 ++++++
>  mm/filemap.c               | 17 ++++++++----
>  mm/truncate.c              | 69 ++++++++++++++++++++++++++--------------------
>  mm/vmscan.c                |  9 +++++-
>  mm/workingset.c            |  4 +--
>  9 files changed, 78 insertions(+), 42 deletions(-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
  2015-12-30  8:02     ` Bob Liu
  (?)
@ 2015-12-30 20:39       ` Dan Williams
  -1 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2015-12-30 20:39 UTC (permalink / raw)
  To: Bob Liu
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Wed, Dec 30, 2015 at 12:02 AM, Bob Liu <bob.liu@oracle.com> wrote:
> Hi Ross,
>
> On 12/24/2015 03:39 AM, Ross Zwisler wrote:
>> Add support for tracking dirty DAX entries in the struct address_space
>> radix tree.  This tree is already used for dirty page writeback, and it
>> already supports the use of exceptional (non struct page*) entries.
>>
>> In order to properly track dirty DAX pages we will insert new exceptional
>> entries into the radix tree that represent dirty DAX PTE or PMD pages.
>
> I may get it wrong, but there is "struct page" for persistent memory after
> "[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
> So why not just add "struct page" to radix tree directly just like normal page cache?
>
> Then we don't need to deal with any exceptional entries and special writeback.

That "struct page" is optional and fsync/msync needs to operate in its absence.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-30 20:39       ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2015-12-30 20:39 UTC (permalink / raw)
  To: Bob Liu
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Wed, Dec 30, 2015 at 12:02 AM, Bob Liu <bob.liu@oracle.com> wrote:
> Hi Ross,
>
> On 12/24/2015 03:39 AM, Ross Zwisler wrote:
>> Add support for tracking dirty DAX entries in the struct address_space
>> radix tree.  This tree is already used for dirty page writeback, and it
>> already supports the use of exceptional (non struct page*) entries.
>>
>> In order to properly track dirty DAX pages we will insert new exceptional
>> entries into the radix tree that represent dirty DAX PTE or PMD pages.
>
> I may get it wrong, but there is "struct page" for persistent memory after
> "[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
> So why not just add "struct page" to radix tree directly just like normal page cache?
>
> Then we don't need to deal with any exceptional entries and special writeback.

That "struct page" is optional and fsync/msync needs to operate in its absence.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-30 20:39       ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2015-12-30 20:39 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-nvdimm, Dave Hansen, J. Bruce Fields, Linux MM,
	Andreas Dilger, H. Peter Anvin, Jeff Layton, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Wed, Dec 30, 2015 at 12:02 AM, Bob Liu <bob.liu@oracle.com> wrote:
> Hi Ross,
>
> On 12/24/2015 03:39 AM, Ross Zwisler wrote:
>> Add support for tracking dirty DAX entries in the struct address_space
>> radix tree.  This tree is already used for dirty page writeback, and it
>> already supports the use of exceptional (non struct page*) entries.
>>
>> In order to properly track dirty DAX pages we will insert new exceptional
>> entries into the radix tree that represent dirty DAX PTE or PMD pages.
>
> I may get it wrong, but there is "struct page" for persistent memory after
> "[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
> So why not just add "struct page" to radix tree directly just like normal page cache?
>
> Then we don't need to deal with any exceptional entries and special writeback.

That "struct page" is optional and fsync/msync needs to operate in its absence.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
  2015-12-30 20:39       ` Dan Williams
  (?)
  (?)
@ 2015-12-31  3:28         ` Bob Liu
  -1 siblings, 0 replies; 75+ messages in thread
From: Bob Liu @ 2015-12-31  3:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen


On 12/31/2015 04:39 AM, Dan Williams wrote:
> On Wed, Dec 30, 2015 at 12:02 AM, Bob Liu <bob.liu@oracle.com> wrote:
>> Hi Ross,
>>
>> On 12/24/2015 03:39 AM, Ross Zwisler wrote:
>>> Add support for tracking dirty DAX entries in the struct address_space
>>> radix tree.  This tree is already used for dirty page writeback, and it
>>> already supports the use of exceptional (non struct page*) entries.
>>>
>>> In order to properly track dirty DAX pages we will insert new exceptional
>>> entries into the radix tree that represent dirty DAX PTE or PMD pages.
>>
>> I may get it wrong, but there is "struct page" for persistent memory after
>> "[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
>> So why not just add "struct page" to radix tree directly just like normal page cache?
>>
>> Then we don't need to deal with any exceptional entries and special writeback.
> 
> That "struct page" is optional and fsync/msync needs to operate in its absence.
> 

Any special reason or scenario that "struct page" should not be enabled?
I didn't see any disadvantages if always enable "struct page" by force when using DAX model for pmem.
The benefits would be things can be more simple and less potential bugs because of smaller patches.

Happy New Year!
Bob

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-31  3:28         ` Bob Liu
  0 siblings, 0 replies; 75+ messages in thread
From: Bob Liu @ 2015-12-31  3:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen


On 12/31/2015 04:39 AM, Dan Williams wrote:
> On Wed, Dec 30, 2015 at 12:02 AM, Bob Liu <bob.liu@oracle.com> wrote:
>> Hi Ross,
>>
>> On 12/24/2015 03:39 AM, Ross Zwisler wrote:
>>> Add support for tracking dirty DAX entries in the struct address_space
>>> radix tree.  This tree is already used for dirty page writeback, and it
>>> already supports the use of exceptional (non struct page*) entries.
>>>
>>> In order to properly track dirty DAX pages we will insert new exceptional
>>> entries into the radix tree that represent dirty DAX PTE or PMD pages.
>>
>> I may get it wrong, but there is "struct page" for persistent memory after
>> "[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
>> So why not just add "struct page" to radix tree directly just like normal page cache?
>>
>> Then we don't need to deal with any exceptional entries and special writeback.
> 
> That "struct page" is optional and fsync/msync needs to operate in its absence.
> 

Any special reason or scenario that "struct page" should not be enabled?
I didn't see any disadvantages if always enable "struct page" by force when using DAX model for pmem.
The benefits would be things can be more simple and less potential bugs because of smaller patches.

Happy New Year!
Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-31  3:28         ` Bob Liu
  0 siblings, 0 replies; 75+ messages in thread
From: Bob Liu @ 2015-12-31  3:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton


On 12/31/2015 04:39 AM, Dan Williams wrote:
> On Wed, Dec 30, 2015 at 12:02 AM, Bob Liu <bob.liu@oracle.com> wrote:
>> Hi Ross,
>>
>> On 12/24/2015 03:39 AM, Ross Zwisler wrote:
>>> Add support for tracking dirty DAX entries in the struct address_space
>>> radix tree.  This tree is already used for dirty page writeback, and it
>>> already supports the use of exceptional (non struct page*) entries.
>>>
>>> In order to properly track dirty DAX pages we will insert new exceptional
>>> entries into the radix tree that represent dirty DAX PTE or PMD pages.
>>
>> I may get it wrong, but there is "struct page" for persistent memory after
>> "[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
>> So why not just add "struct page" to radix tree directly just like normal page cache?
>>
>> Then we don't need to deal with any exceptional entries and special writeback.
> 
> That "struct page" is optional and fsync/msync needs to operate in its absence.
> 

Any special reason or scenario that "struct page" should not be enabled?
I didn't see any disadvantages if always enable "struct page" by force when using DAX model for pmem.
The benefits would be things can be more simple and less potential bugs because of smaller patches.

Happy New Year!
Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-31  3:28         ` Bob Liu
  0 siblings, 0 replies; 75+ messages in thread
From: Bob Liu @ 2015-12-31  3:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-nvdimm, Dave Hansen, J. Bruce Fields, Linux MM,
	Andreas Dilger, H. Peter Anvin, Jeff Layton, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox


On 12/31/2015 04:39 AM, Dan Williams wrote:
> On Wed, Dec 30, 2015 at 12:02 AM, Bob Liu <bob.liu@oracle.com> wrote:
>> Hi Ross,
>>
>> On 12/24/2015 03:39 AM, Ross Zwisler wrote:
>>> Add support for tracking dirty DAX entries in the struct address_space
>>> radix tree.  This tree is already used for dirty page writeback, and it
>>> already supports the use of exceptional (non struct page*) entries.
>>>
>>> In order to properly track dirty DAX pages we will insert new exceptional
>>> entries into the radix tree that represent dirty DAX PTE or PMD pages.
>>
>> I may get it wrong, but there is "struct page" for persistent memory after
>> "[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
>> So why not just add "struct page" to radix tree directly just like normal page cache?
>>
>> Then we don't need to deal with any exceptional entries and special writeback.
> 
> That "struct page" is optional and fsync/msync needs to operate in its absence.
> 

Any special reason or scenario that "struct page" should not be enabled?
I didn't see any disadvantages if always enable "struct page" by force when using DAX model for pmem.
The benefits would be things can be more simple and less potential bugs because of smaller patches.

Happy New Year!
Bob

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
  2015-12-31  3:28         ` Bob Liu
  (?)
@ 2015-12-31 22:08           ` Dan Williams
  -1 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2015-12-31 22:08 UTC (permalink / raw)
  To: Bob Liu
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Wed, Dec 30, 2015 at 7:28 PM, Bob Liu <bob.liu@oracle.com> wrote:
>
> On 12/31/2015 04:39 AM, Dan Williams wrote:
>> On Wed, Dec 30, 2015 at 12:02 AM, Bob Liu <bob.liu@oracle.com> wrote:
>>> Hi Ross,
>>>
>>> On 12/24/2015 03:39 AM, Ross Zwisler wrote:
>>>> Add support for tracking dirty DAX entries in the struct address_space
>>>> radix tree.  This tree is already used for dirty page writeback, and it
>>>> already supports the use of exceptional (non struct page*) entries.
>>>>
>>>> In order to properly track dirty DAX pages we will insert new exceptional
>>>> entries into the radix tree that represent dirty DAX PTE or PMD pages.
>>>
>>> I may get it wrong, but there is "struct page" for persistent memory after
>>> "[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
>>> So why not just add "struct page" to radix tree directly just like normal page cache?
>>>
>>> Then we don't need to deal with any exceptional entries and special writeback.
>>
>> That "struct page" is optional and fsync/msync needs to operate in its absence.
>>
>
> Any special reason or scenario that "struct page" should not be enabled?
> I didn't see any disadvantages if always enable "struct page" by force when using DAX model for pmem.
> The benefits would be things can be more simple and less potential bugs because of smaller patches.
>

We can't enable struct page coverage by default.

The persistent memory capacity may be too large to allocate the memmap
array from DRAM.  Allocating it from pmem reduces the size of the
device and we can't have a block device change sizes just by booting a
different kernel (any kernel less than 4.5).  So, enabling struct page
must be an explicit action.

> Happy New Year!

Happy New Year!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-31 22:08           ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2015-12-31 22:08 UTC (permalink / raw)
  To: Bob Liu
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Wed, Dec 30, 2015 at 7:28 PM, Bob Liu <bob.liu@oracle.com> wrote:
>
> On 12/31/2015 04:39 AM, Dan Williams wrote:
>> On Wed, Dec 30, 2015 at 12:02 AM, Bob Liu <bob.liu@oracle.com> wrote:
>>> Hi Ross,
>>>
>>> On 12/24/2015 03:39 AM, Ross Zwisler wrote:
>>>> Add support for tracking dirty DAX entries in the struct address_space
>>>> radix tree.  This tree is already used for dirty page writeback, and it
>>>> already supports the use of exceptional (non struct page*) entries.
>>>>
>>>> In order to properly track dirty DAX pages we will insert new exceptional
>>>> entries into the radix tree that represent dirty DAX PTE or PMD pages.
>>>
>>> I may get it wrong, but there is "struct page" for persistent memory after
>>> "[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
>>> So why not just add "struct page" to radix tree directly just like normal page cache?
>>>
>>> Then we don't need to deal with any exceptional entries and special writeback.
>>
>> That "struct page" is optional and fsync/msync needs to operate in its absence.
>>
>
> Any special reason or scenario that "struct page" should not be enabled?
> I didn't see any disadvantages if always enable "struct page" by force when using DAX model for pmem.
> The benefits would be things can be more simple and less potential bugs because of smaller patches.
>

We can't enable struct page coverage by default.

The persistent memory capacity may be too large to allocate the memmap
array from DRAM.  Allocating it from pmem reduces the size of the
device and we can't have a block device change sizes just by booting a
different kernel (any kernel less than 4.5).  So, enabling struct page
must be an explicit action.

> Happy New Year!

Happy New Year!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2015-12-31 22:08           ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2015-12-31 22:08 UTC (permalink / raw)
  To: Bob Liu
  Cc: linux-nvdimm, Dave Hansen, J. Bruce Fields, Linux MM,
	Andreas Dilger, H. Peter Anvin, Jeff Layton, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Wed, Dec 30, 2015 at 7:28 PM, Bob Liu <bob.liu@oracle.com> wrote:
>
> On 12/31/2015 04:39 AM, Dan Williams wrote:
>> On Wed, Dec 30, 2015 at 12:02 AM, Bob Liu <bob.liu@oracle.com> wrote:
>>> Hi Ross,
>>>
>>> On 12/24/2015 03:39 AM, Ross Zwisler wrote:
>>>> Add support for tracking dirty DAX entries in the struct address_space
>>>> radix tree.  This tree is already used for dirty page writeback, and it
>>>> already supports the use of exceptional (non struct page*) entries.
>>>>
>>>> In order to properly track dirty DAX pages we will insert new exceptional
>>>> entries into the radix tree that represent dirty DAX PTE or PMD pages.
>>>
>>> I may get it wrong, but there is "struct page" for persistent memory after
>>> "[PATCH v4 00/18]get_user_pages() for dax pte and pmd mappings".
>>> So why not just add "struct page" to radix tree directly just like normal page cache?
>>>
>>> Then we don't need to deal with any exceptional entries and special writeback.
>>
>> That "struct page" is optional and fsync/msync needs to operate in its absence.
>>
>
> Any special reason or scenario that "struct page" should not be enabled?
> I didn't see any disadvantages if always enable "struct page" by force when using DAX model for pmem.
> The benefits would be things can be more simple and less potential bugs because of smaller patches.
>

We can't enable struct page coverage by default.

The persistent memory capacity may be too large to allocate the memmap
array from DRAM.  Allocating it from pmem reduces the size of the
device and we can't have a block device change sizes just by booting a
different kernel (any kernel less than 4.5).  So, enabling struct page
must be an explicit action.

> Happy New Year!

Happy New Year!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
  2015-12-23 19:39   ` Ross Zwisler
  (?)
@ 2016-01-03 18:13     ` Dan Williams
  -1 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-03 18:13 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm, X86 ML,
	XFS Developers, Andrew Morton, Matthew Wilcox, Dave Hansen

On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
>
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

I'm hitting the following report with the ndctl dax test [1] on
next-20151231.  I bisected it to
 commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
closer look tomorrow, but in case someone can beat me to it, here's
the back-trace:

------------[ cut here ]------------
kernel BUG at fs/inode.c:497!
[..]
CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
RSP: 0018:ffff880307797d50  EFLAGS: 00010002
RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
Stack:
ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
Call Trace:
[<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
[<ffffffff81281038>] evict+0xb8/0x180
[<ffffffff8128113b>] dispose_list+0x3b/0x50
[<ffffffff81282014>] evict_inodes+0x144/0x170
[<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
[<ffffffff81264837>] kill_block_super+0x27/0x70
[<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
[<ffffffff81264e9c>] deactivate_super+0x5c/0x60
[<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
[<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
[<ffffffff810c4f26>] task_work_run+0x76/0x90
[<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
[<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
74 1a a8 40 75 18 48 c7 8
3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
0b 0f 1f 44 00 00 0f 1f
44 00 00 55
RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
RSP <ffff880307797d50>
---[ end trace 3b1d8898a94a4fc1 ]---

[1]: git://git@github.com:pmem/ndctl.git pending
make TESTS="test/dax.sh" check

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-03 18:13     ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-03 18:13 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org,
	X86 ML, XFS Developers, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
>
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

I'm hitting the following report with the ndctl dax test [1] on
next-20151231.  I bisected it to
 commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
closer look tomorrow, but in case someone can beat me to it, here's
the back-trace:

------------[ cut here ]------------
kernel BUG at fs/inode.c:497!
[..]
CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
RSP: 0018:ffff880307797d50  EFLAGS: 00010002
RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
Stack:
ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
Call Trace:
[<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
[<ffffffff81281038>] evict+0xb8/0x180
[<ffffffff8128113b>] dispose_list+0x3b/0x50
[<ffffffff81282014>] evict_inodes+0x144/0x170
[<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
[<ffffffff81264837>] kill_block_super+0x27/0x70
[<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
[<ffffffff81264e9c>] deactivate_super+0x5c/0x60
[<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
[<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
[<ffffffff810c4f26>] task_work_run+0x76/0x90
[<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
[<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
74 1a a8 40 75 18 48 c7 8
3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
0b 0f 1f 44 00 00 0f 1f
44 00 00 55
RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
RSP <ffff880307797d50>
---[ end trace 3b1d8898a94a4fc1 ]---

[1]: git://git@github.com:pmem/ndctl.git pending
make TESTS="test/dax.sh" check

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-03 18:13     ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-03 18:13 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: X86 ML, Theodore Ts'o, Andrew Morton, linux-nvdimm, Jan Kara,
	linux-kernel, Dave Hansen, XFS Developers, J. Bruce Fields,
	Linux MM, Ingo Molnar, Andreas Dilger, Alexander Viro,
	H. Peter Anvin, linux-fsdevel, Matthew Wilcox, Jeff Layton,
	linux-ext4, Thomas Gleixner, Matthew Wilcox

On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
>
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

I'm hitting the following report with the ndctl dax test [1] on
next-20151231.  I bisected it to
 commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
closer look tomorrow, but in case someone can beat me to it, here's
the back-trace:

------------[ cut here ]------------
kernel BUG at fs/inode.c:497!
[..]
CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
RSP: 0018:ffff880307797d50  EFLAGS: 00010002
RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
Stack:
ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
Call Trace:
[<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
[<ffffffff81281038>] evict+0xb8/0x180
[<ffffffff8128113b>] dispose_list+0x3b/0x50
[<ffffffff81282014>] evict_inodes+0x144/0x170
[<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
[<ffffffff81264837>] kill_block_super+0x27/0x70
[<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
[<ffffffff81264e9c>] deactivate_super+0x5c/0x60
[<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
[<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
[<ffffffff810c4f26>] task_work_run+0x76/0x90
[<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
[<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
74 1a a8 40 75 18 48 c7 8
3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
0b 0f 1f 44 00 00 0f 1f
44 00 00 55
RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
RSP <ffff880307797d50>
---[ end trace 3b1d8898a94a4fc1 ]---

[1]: git://git@github.com:pmem/ndctl.git pending
make TESTS="test/dax.sh" check

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
  2015-12-23 19:39   ` Ross Zwisler
  (?)
@ 2016-01-05  9:41     ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-01-05  9:41 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

On Wed 23-12-15 12:39:15, Ross Zwisler wrote:
> Add support for tracking dirty DAX entries in the struct address_space
> radix tree.  This tree is already used for dirty page writeback, and it
> already supports the use of exceptional (non struct page*) entries.
> 
> In order to properly track dirty DAX pages we will insert new exceptional
> entries into the radix tree that represent dirty DAX PTE or PMD pages.
> These exceptional entries will also contain the writeback sectors for the
> PTE or PMD faults that we can use at fsync/msync time.
> 
> There are currently two types of exceptional entries (shmem and shadow)
> that can be placed into the radix tree, and this adds a third.  We rely on
> the fact that only one type of exceptional entry can be found in a given
> radix tree based on its usage.  This happens for free with DAX vs shmem but
> we explicitly prevent shadow entries from being added to radix trees for
> DAX mappings.
> 
> The only shadow entries that would be generated for DAX radix trees would
> be to track zero page mappings that were created for holes.  These pages
> would receive minimal benefit from having shadow entries, and the choice
> to have only one type of exceptional entry in a given radix tree makes the
> logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2016-01-05  9:41     ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-01-05  9:41 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

On Wed 23-12-15 12:39:15, Ross Zwisler wrote:
> Add support for tracking dirty DAX entries in the struct address_space
> radix tree.  This tree is already used for dirty page writeback, and it
> already supports the use of exceptional (non struct page*) entries.
> 
> In order to properly track dirty DAX pages we will insert new exceptional
> entries into the radix tree that represent dirty DAX PTE or PMD pages.
> These exceptional entries will also contain the writeback sectors for the
> PTE or PMD faults that we can use at fsync/msync time.
> 
> There are currently two types of exceptional entries (shmem and shadow)
> that can be placed into the radix tree, and this adds a third.  We rely on
> the fact that only one type of exceptional entry can be found in a given
> radix tree based on its usage.  This happens for free with DAX vs shmem but
> we explicitly prevent shadow entries from being added to radix trees for
> DAX mappings.
> 
> The only shadow entries that would be generated for DAX radix trees would
> be to track zero page mappings that were created for holes.  These pages
> would receive minimal benefit from having shadow entries, and the choice
> to have only one type of exceptional entry in a given radix tree makes the
> logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 2/7] dax: support dirty DAX entries in radix tree
@ 2016-01-05  9:41     ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-01-05  9:41 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, linux-ext4, xfs, Alexander Viro,
	Thomas Gleixner, Theodore Ts'o, linux-kernel, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

On Wed 23-12-15 12:39:15, Ross Zwisler wrote:
> Add support for tracking dirty DAX entries in the struct address_space
> radix tree.  This tree is already used for dirty page writeback, and it
> already supports the use of exceptional (non struct page*) entries.
> 
> In order to properly track dirty DAX pages we will insert new exceptional
> entries into the radix tree that represent dirty DAX PTE or PMD pages.
> These exceptional entries will also contain the writeback sectors for the
> PTE or PMD faults that we can use at fsync/msync time.
> 
> There are currently two types of exceptional entries (shmem and shadow)
> that can be placed into the radix tree, and this adds a third.  We rely on
> the fact that only one type of exceptional entry can be found in a given
> radix tree based on its usage.  This happens for free with DAX vs shmem but
> we explicitly prevent shadow entries from being added to radix trees for
> DAX mappings.
> 
> The only shadow entries that would be generated for DAX radix trees would
> be to track zero page mappings that were created for holes.  These pages
> would receive minimal benefit from having shadow entries, and the choice
> to have only one type of exceptional entry in a given radix tree makes the
> logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
  2016-01-03 18:13     ` Dan Williams
  (?)
@ 2016-01-05 11:13       ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-01-05 11:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Sun 03-01-16 10:13:06, Dan Williams wrote:
> On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I'm hitting the following report with the ndctl dax test [1] on
> next-20151231.  I bisected it to
>  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> closer look tomorrow, but in case someone can beat me to it, here's
> the back-trace:
> 
> ------------[ cut here ]------------
> kernel BUG at fs/inode.c:497!

I suppose this is the check that mapping->nr_exceptional is zero, isn't it?
Hum, I don't see how that could happen given we call
truncate_inode_pages_final() just before the clear_inode() call which
removes all the exceptional entries from the radix tree.  And there's not
much room for a race during umount... Does the radix tree really contain
any entry or is it an accounting bug?

								Honza

> [..]
> CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
> RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP: 0018:ffff880307797d50  EFLAGS: 00010002
> RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
> RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
> RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
> R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
> R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
> FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
> Stack:
> ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
> ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
> ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
> Call Trace:
> [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
> [<ffffffff81281038>] evict+0xb8/0x180
> [<ffffffff8128113b>] dispose_list+0x3b/0x50
> [<ffffffff81282014>] evict_inodes+0x144/0x170
> [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
> [<ffffffff81264837>] kill_block_super+0x27/0x70
> [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
> [<ffffffff81264e9c>] deactivate_super+0x5c/0x60
> [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
> [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
> [<ffffffff810c4f26>] task_work_run+0x76/0x90
> [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
> [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
> Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
> 74 1a a8 40 75 18 48 c7 8
> 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
> 0b 0f 1f 44 00 00 0f 1f
> 44 00 00 55
> RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP <ffff880307797d50>
> ---[ end trace 3b1d8898a94a4fc1 ]---
> 
> [1]: git://git@github.com:pmem/ndctl.git pending
> make TESTS="test/dax.sh" check
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 11:13       ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-01-05 11:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Sun 03-01-16 10:13:06, Dan Williams wrote:
> On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I'm hitting the following report with the ndctl dax test [1] on
> next-20151231.  I bisected it to
>  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> closer look tomorrow, but in case someone can beat me to it, here's
> the back-trace:
> 
> ------------[ cut here ]------------
> kernel BUG at fs/inode.c:497!

I suppose this is the check that mapping->nr_exceptional is zero, isn't it?
Hum, I don't see how that could happen given we call
truncate_inode_pages_final() just before the clear_inode() call which
removes all the exceptional entries from the radix tree.  And there's not
much room for a race during umount... Does the radix tree really contain
any entry or is it an accounting bug?

								Honza

> [..]
> CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
> RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP: 0018:ffff880307797d50  EFLAGS: 00010002
> RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
> RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
> RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
> R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
> R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
> FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
> Stack:
> ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
> ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
> ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
> Call Trace:
> [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
> [<ffffffff81281038>] evict+0xb8/0x180
> [<ffffffff8128113b>] dispose_list+0x3b/0x50
> [<ffffffff81282014>] evict_inodes+0x144/0x170
> [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
> [<ffffffff81264837>] kill_block_super+0x27/0x70
> [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
> [<ffffffff81264e9c>] deactivate_super+0x5c/0x60
> [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
> [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
> [<ffffffff810c4f26>] task_work_run+0x76/0x90
> [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
> [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
> Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
> 74 1a a8 40 75 18 48 c7 8
> 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
> 0b 0f 1f 44 00 00 0f 1f
> 44 00 00 55
> RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP <ffff880307797d50>
> ---[ end trace 3b1d8898a94a4fc1 ]---
> 
> [1]: git://git@github.com:pmem/ndctl.git pending
> make TESTS="test/dax.sh" check
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 11:13       ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-01-05 11:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Sun 03-01-16 10:13:06, Dan Williams wrote:
> On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I'm hitting the following report with the ndctl dax test [1] on
> next-20151231.  I bisected it to
>  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> closer look tomorrow, but in case someone can beat me to it, here's
> the back-trace:
> 
> ------------[ cut here ]------------
> kernel BUG at fs/inode.c:497!

I suppose this is the check that mapping->nr_exceptional is zero, isn't it?
Hum, I don't see how that could happen given we call
truncate_inode_pages_final() just before the clear_inode() call which
removes all the exceptional entries from the radix tree.  And there's not
much room for a race during umount... Does the radix tree really contain
any entry or is it an accounting bug?

								Honza

> [..]
> CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
> RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP: 0018:ffff880307797d50  EFLAGS: 00010002
> RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
> RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
> RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
> R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
> R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
> FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
> Stack:
> ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
> ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
> ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
> Call Trace:
> [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
> [<ffffffff81281038>] evict+0xb8/0x180
> [<ffffffff8128113b>] dispose_list+0x3b/0x50
> [<ffffffff81282014>] evict_inodes+0x144/0x170
> [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
> [<ffffffff81264837>] kill_block_super+0x27/0x70
> [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
> [<ffffffff81264e9c>] deactivate_super+0x5c/0x60
> [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
> [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
> [<ffffffff810c4f26>] task_work_run+0x76/0x90
> [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
> [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
> Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
> 74 1a a8 40 75 18 48 c7 8
> 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
> 0b 0f 1f 44 00 00 0f 1f
> 44 00 00 55
> RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP <ffff880307797d50>
> ---[ end trace 3b1d8898a94a4fc1 ]---
> 
> [1]: git://git@github.com:pmem/ndctl.git pending
> make TESTS="test/dax.sh" check
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
  2015-12-23 19:39   ` Ross Zwisler
  (?)
@ 2016-01-05 11:13     ` Jan Kara
  -1 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-01-05 11:13 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
> 
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
...
> +static int dax_writeback_one(struct block_device *bdev,
> +		struct address_space *mapping, pgoff_t index, void *entry)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type = RADIX_DAX_TYPE(entry);
> +	struct radix_tree_node *node;
> +	struct blk_dax_ctl dax;
> +	void **slot;
> +	int ret = 0;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * Regular page slots are stabilized by the page lock even
> +	 * without the tree itself locked.  These unlocked entries
> +	 * need verification under the tree lock.
> +	 */
> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unlock;
> +	if (*slot != entry)
> +		goto unlock;
> +
> +	/* another fsync thread may have already written back this entry */
> +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto unlock;
> +
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +
> +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	dax.sector = RADIX_DAX_SECTOR(entry);
> +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	/*
> +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> +	 * eventually calls cond_resched().
> +	 */
> +	ret = dax_map_atomic(bdev, &dax);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (WARN_ON_ONCE(ret < dax.size)) {
> +		ret = -EIO;
> +		dax_unmap_atomic(bdev, &dax);
> +		return ret;
> +	}
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * We need to revalidate our radix entry while holding tree_lock
> +	 * before we do the writeback.
> +	 */

Do we really need to revalidate here? dax_map_atomic() makes sure the addr
& size is still part of the device. I guess you are concerned that due to
truncate or similar operation those sectors needn't belong to the same file
anymore but we don't really care about flushing sectors for someone else,
do we?

Otherwise the patch looks good to me.

> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unmap;
> +	if (*slot != entry)
> +		goto unmap;
> +
> +	wb_cache_pmem(dax.addr, dax.size);
> + unmap:
> +	dax_unmap_atomic(bdev, &dax);
> + unlock:
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 11:13     ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-01-05 11:13 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
> 
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
...
> +static int dax_writeback_one(struct block_device *bdev,
> +		struct address_space *mapping, pgoff_t index, void *entry)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type = RADIX_DAX_TYPE(entry);
> +	struct radix_tree_node *node;
> +	struct blk_dax_ctl dax;
> +	void **slot;
> +	int ret = 0;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * Regular page slots are stabilized by the page lock even
> +	 * without the tree itself locked.  These unlocked entries
> +	 * need verification under the tree lock.
> +	 */
> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unlock;
> +	if (*slot != entry)
> +		goto unlock;
> +
> +	/* another fsync thread may have already written back this entry */
> +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto unlock;
> +
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +
> +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	dax.sector = RADIX_DAX_SECTOR(entry);
> +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	/*
> +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> +	 * eventually calls cond_resched().
> +	 */
> +	ret = dax_map_atomic(bdev, &dax);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (WARN_ON_ONCE(ret < dax.size)) {
> +		ret = -EIO;
> +		dax_unmap_atomic(bdev, &dax);
> +		return ret;
> +	}
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * We need to revalidate our radix entry while holding tree_lock
> +	 * before we do the writeback.
> +	 */

Do we really need to revalidate here? dax_map_atomic() makes sure the addr
& size is still part of the device. I guess you are concerned that due to
truncate or similar operation those sectors needn't belong to the same file
anymore but we don't really care about flushing sectors for someone else,
do we?

Otherwise the patch looks good to me.

> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unmap;
> +	if (*slot != entry)
> +		goto unmap;
> +
> +	wb_cache_pmem(dax.addr, dax.size);
> + unmap:
> +	dax_unmap_atomic(bdev, &dax);
> + unlock:
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 11:13     ` Jan Kara
  0 siblings, 0 replies; 75+ messages in thread
From: Jan Kara @ 2016-01-05 11:13 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, linux-ext4, xfs, Alexander Viro,
	Thomas Gleixner, Theodore Ts'o, linux-kernel, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
> 
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
...
> +static int dax_writeback_one(struct block_device *bdev,
> +		struct address_space *mapping, pgoff_t index, void *entry)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type = RADIX_DAX_TYPE(entry);
> +	struct radix_tree_node *node;
> +	struct blk_dax_ctl dax;
> +	void **slot;
> +	int ret = 0;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * Regular page slots are stabilized by the page lock even
> +	 * without the tree itself locked.  These unlocked entries
> +	 * need verification under the tree lock.
> +	 */
> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unlock;
> +	if (*slot != entry)
> +		goto unlock;
> +
> +	/* another fsync thread may have already written back this entry */
> +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto unlock;
> +
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +
> +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	dax.sector = RADIX_DAX_SECTOR(entry);
> +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	/*
> +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> +	 * eventually calls cond_resched().
> +	 */
> +	ret = dax_map_atomic(bdev, &dax);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (WARN_ON_ONCE(ret < dax.size)) {
> +		ret = -EIO;
> +		dax_unmap_atomic(bdev, &dax);
> +		return ret;
> +	}
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * We need to revalidate our radix entry while holding tree_lock
> +	 * before we do the writeback.
> +	 */

Do we really need to revalidate here? dax_map_atomic() makes sure the addr
& size is still part of the device. I guess you are concerned that due to
truncate or similar operation those sectors needn't belong to the same file
anymore but we don't really care about flushing sectors for someone else,
do we?

Otherwise the patch looks good to me.

> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unmap;
> +	if (*slot != entry)
> +		goto unmap;
> +
> +	wb_cache_pmem(dax.addr, dax.size);
> + unmap:
> +	dax_unmap_atomic(bdev, &dax);
> + unlock:
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
  2016-01-05 11:13       ` Jan Kara
  (?)
  (?)
@ 2016-01-05 15:50         ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 15:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Tue, Jan 05, 2016 at 12:13:46PM +0100, Jan Kara wrote:
> On Sun 03-01-16 10:13:06, Dan Williams wrote:
> > On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > > pages so it is able to flush them durably to media on demand.
> > >
> > > The tracking of dirty pages is done via the radix tree in struct
> > > address_space.  This radix tree is already used by the page writeback
> > > infrastructure for tracking dirty pages associated with an open file, and
> > > it already has support for exceptional (non struct page*) entries.  We
> > > build upon these features to add exceptional entries to the radix tree for
> > > DAX dirty PMD or PTE pages at fault time.
> > >
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > I'm hitting the following report with the ndctl dax test [1] on
> > next-20151231.  I bisected it to
> >  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> > closer look tomorrow, but in case someone can beat me to it, here's
> > the back-trace:
> > 
> > ------------[ cut here ]------------
> > kernel BUG at fs/inode.c:497!
> 
> I suppose this is the check that mapping->nr_exceptional is zero, isn't it?
> Hum, I don't see how that could happen given we call
> truncate_inode_pages_final() just before the clear_inode() call which
> removes all the exceptional entries from the radix tree.  And there's not
> much room for a race during umount... Does the radix tree really contain
> any entry or is it an accounting bug?
> 
> 								Honza

I think this is a bug with the existing way that we handle PMD faults.  The
issue is that the PMD path doesn't properly remove radix tree entries for zero
pages covering holes.  The PMD path calls unmap_mapping_range() to unmap the
range out of the struct address_space, but it is missing a call to
truncate_inode_pages_range() or similar to clear out those entries in the
radix tree.  Up until now we didn't notice, we just had an orphaned entry in
the radix tree, but with my code we then find the page entry in the radix
tree when handling a PMD fault, we remove it and add in a PMD entry.  This
causes us to be off on both our mapping->nrpages and mapping->nrexceptional
counts.	

In the PTE path we properly remove the pages from the radix tree when
upgrading from a hole to a real DAX entry via the delete_from_page_cache()
call, which eventually calls page_cache_tree_delete().

I'm working on a fix now (and making sure all the above is correct).

- Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 15:50         ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 15:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Tue, Jan 05, 2016 at 12:13:46PM +0100, Jan Kara wrote:
> On Sun 03-01-16 10:13:06, Dan Williams wrote:
> > On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > > pages so it is able to flush them durably to media on demand.
> > >
> > > The tracking of dirty pages is done via the radix tree in struct
> > > address_space.  This radix tree is already used by the page writeback
> > > infrastructure for tracking dirty pages associated with an open file, and
> > > it already has support for exceptional (non struct page*) entries.  We
> > > build upon these features to add exceptional entries to the radix tree for
> > > DAX dirty PMD or PTE pages at fault time.
> > >
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > I'm hitting the following report with the ndctl dax test [1] on
> > next-20151231.  I bisected it to
> >  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> > closer look tomorrow, but in case someone can beat me to it, here's
> > the back-trace:
> > 
> > ------------[ cut here ]------------
> > kernel BUG at fs/inode.c:497!
> 
> I suppose this is the check that mapping->nr_exceptional is zero, isn't it?
> Hum, I don't see how that could happen given we call
> truncate_inode_pages_final() just before the clear_inode() call which
> removes all the exceptional entries from the radix tree.  And there's not
> much room for a race during umount... Does the radix tree really contain
> any entry or is it an accounting bug?
> 
> 								Honza

I think this is a bug with the existing way that we handle PMD faults.  The
issue is that the PMD path doesn't properly remove radix tree entries for zero
pages covering holes.  The PMD path calls unmap_mapping_range() to unmap the
range out of the struct address_space, but it is missing a call to
truncate_inode_pages_range() or similar to clear out those entries in the
radix tree.  Up until now we didn't notice, we just had an orphaned entry in
the radix tree, but with my code we then find the page entry in the radix
tree when handling a PMD fault, we remove it and add in a PMD entry.  This
causes us to be off on both our mapping->nrpages and mapping->nrexceptional
counts.	

In the PTE path we properly remove the pages from the radix tree when
upgrading from a hole to a real DAX entry via the delete_from_page_cache()
call, which eventually calls page_cache_tree_delete().

I'm working on a fix now (and making sure all the above is correct).

- Ross

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 15:50         ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 15:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, D

On Tue, Jan 05, 2016 at 12:13:46PM +0100, Jan Kara wrote:
> On Sun 03-01-16 10:13:06, Dan Williams wrote:
> > On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > > pages so it is able to flush them durably to media on demand.
> > >
> > > The tracking of dirty pages is done via the radix tree in struct
> > > address_space.  This radix tree is already used by the page writeback
> > > infrastructure for tracking dirty pages associated with an open file, and
> > > it already has support for exceptional (non struct page*) entries.  We
> > > build upon these features to add exceptional entries to the radix tree for
> > > DAX dirty PMD or PTE pages at fault time.
> > >
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > I'm hitting the following report with the ndctl dax test [1] on
> > next-20151231.  I bisected it to
> >  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> > closer look tomorrow, but in case someone can beat me to it, here's
> > the back-trace:
> > 
> > ------------[ cut here ]------------
> > kernel BUG at fs/inode.c:497!
> 
> I suppose this is the check that mapping->nr_exceptional is zero, isn't it?
> Hum, I don't see how that could happen given we call
> truncate_inode_pages_final() just before the clear_inode() call which
> removes all the exceptional entries from the radix tree.  And there's not
> much room for a race during umount... Does the radix tree really contain
> any entry or is it an accounting bug?
> 
> 								Honza

I think this is a bug with the existing way that we handle PMD faults.  The
issue is that the PMD path doesn't properly remove radix tree entries for zero
pages covering holes.  The PMD path calls unmap_mapping_range() to unmap the
range out of the struct address_space, but it is missing a call to
truncate_inode_pages_range() or similar to clear out those entries in the
radix tree.  Up until now we didn't notice, we just had an orphaned entry in
the radix tree, but with my code we then find the page entry in the radix
tree when handling a PMD fault, we remove it and add in a PMD entry.  This
causes us to be off on both our mapping->nrpages and mapping->nrexceptional
counts.	

In the PTE path we properly remove the pages from the radix tree when
upgrading from a hole to a real DAX entry via the delete_from_page_cache()
call, which eventually calls page_cache_tree_delete().

I'm working on a fix now (and making sure all the above is correct).

- Ross

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 15:50         ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 15:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, X86 ML,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4,
	XFS Developers, Alexander Viro, Thomas Gleixner,
	Theodore Ts'o, linux-kernel, Jan Kara, linux-fsdevel,
	Andrew Morton, Matthew Wilcox

On Tue, Jan 05, 2016 at 12:13:46PM +0100, Jan Kara wrote:
> On Sun 03-01-16 10:13:06, Dan Williams wrote:
> > On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > > pages so it is able to flush them durably to media on demand.
> > >
> > > The tracking of dirty pages is done via the radix tree in struct
> > > address_space.  This radix tree is already used by the page writeback
> > > infrastructure for tracking dirty pages associated with an open file, and
> > > it already has support for exceptional (non struct page*) entries.  We
> > > build upon these features to add exceptional entries to the radix tree for
> > > DAX dirty PMD or PTE pages at fault time.
> > >
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > I'm hitting the following report with the ndctl dax test [1] on
> > next-20151231.  I bisected it to
> >  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> > closer look tomorrow, but in case someone can beat me to it, here's
> > the back-trace:
> > 
> > ------------[ cut here ]------------
> > kernel BUG at fs/inode.c:497!
> 
> I suppose this is the check that mapping->nr_exceptional is zero, isn't it?
> Hum, I don't see how that could happen given we call
> truncate_inode_pages_final() just before the clear_inode() call which
> removes all the exceptional entries from the radix tree.  And there's not
> much room for a race during umount... Does the radix tree really contain
> any entry or is it an accounting bug?
> 
> 								Honza

I think this is a bug with the existing way that we handle PMD faults.  The
issue is that the PMD path doesn't properly remove radix tree entries for zero
pages covering holes.  The PMD path calls unmap_mapping_range() to unmap the
range out of the struct address_space, but it is missing a call to
truncate_inode_pages_range() or similar to clear out those entries in the
radix tree.  Up until now we didn't notice, we just had an orphaned entry in
the radix tree, but with my code we then find the page entry in the radix
tree when handling a PMD fault, we remove it and add in a PMD entry.  This
causes us to be off on both our mapping->nrpages and mapping->nrexceptional
counts.	

In the PTE path we properly remove the pages from the radix tree when
upgrading from a hole to a real DAX entry via the delete_from_page_cache()
call, which eventually calls page_cache_tree_delete().

I'm working on a fix now (and making sure all the above is correct).

- Ross

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
  2016-01-05 11:13     ` Jan Kara
  (?)
  (?)
@ 2016-01-05 17:12       ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 17:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Dan Williams,
	Matthew Wilcox, Dave Hansen

On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ...
> > +static int dax_writeback_one(struct block_device *bdev,
> > +		struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	struct radix_tree_node *node;
> > +	struct blk_dax_ctl dax;
> > +	void **slot;
> > +	int ret = 0;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * Regular page slots are stabilized by the page lock even
> > +	 * without the tree itself locked.  These unlocked entries
> > +	 * need verification under the tree lock.
> > +	 */
> > +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > +		goto unlock;
> > +	if (*slot != entry)
> > +		goto unlock;
> > +
> > +	/* another fsync thread may have already written back this entry */
> > +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto unlock;
> > +
> > +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > +		ret = -EIO;
> > +		goto unlock;
> > +	}
> > +
> > +	dax.sector = RADIX_DAX_SECTOR(entry);
> > +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +
> > +	/*
> > +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> > +	 * eventually calls cond_resched().
> > +	 */
> > +	ret = dax_map_atomic(bdev, &dax);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	if (WARN_ON_ONCE(ret < dax.size)) {
> > +		ret = -EIO;
> > +		dax_unmap_atomic(bdev, &dax);
> > +		return ret;
> > +	}
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * We need to revalidate our radix entry while holding tree_lock
> > +	 * before we do the writeback.
> > +	 */
> 
> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
> & size is still part of the device. I guess you are concerned that due to
> truncate or similar operation those sectors needn't belong to the same file
> anymore but we don't really care about flushing sectors for someone else,
> do we?
> 
> Otherwise the patch looks good to me.

Yep, the concern is that we could have somehow raced against a truncate
operation while we weren't holding the tree_lock, and that now the address we
are about to flush belongs to another file or is unallocated by the
filesystem.

I agree that this should be non-destructive - if you think the additional
check and locking isn't worth the overhead, I'm happy to take it out.  I don't
have a strong opinion either way.

Thanks for the review!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 17:12       ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 17:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Dan Williams,
	Matthew Wilcox, Dave Hansen

On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ...
> > +static int dax_writeback_one(struct block_device *bdev,
> > +		struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	struct radix_tree_node *node;
> > +	struct blk_dax_ctl dax;
> > +	void **slot;
> > +	int ret = 0;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * Regular page slots are stabilized by the page lock even
> > +	 * without the tree itself locked.  These unlocked entries
> > +	 * need verification under the tree lock.
> > +	 */
> > +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > +		goto unlock;
> > +	if (*slot != entry)
> > +		goto unlock;
> > +
> > +	/* another fsync thread may have already written back this entry */
> > +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto unlock;
> > +
> > +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > +		ret = -EIO;
> > +		goto unlock;
> > +	}
> > +
> > +	dax.sector = RADIX_DAX_SECTOR(entry);
> > +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +
> > +	/*
> > +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> > +	 * eventually calls cond_resched().
> > +	 */
> > +	ret = dax_map_atomic(bdev, &dax);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	if (WARN_ON_ONCE(ret < dax.size)) {
> > +		ret = -EIO;
> > +		dax_unmap_atomic(bdev, &dax);
> > +		return ret;
> > +	}
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * We need to revalidate our radix entry while holding tree_lock
> > +	 * before we do the writeback.
> > +	 */
> 
> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
> & size is still part of the device. I guess you are concerned that due to
> truncate or similar operation those sectors needn't belong to the same file
> anymore but we don't really care about flushing sectors for someone else,
> do we?
> 
> Otherwise the patch looks good to me.

Yep, the concern is that we could have somehow raced against a truncate
operation while we weren't holding the tree_lock, and that now the address we
are about to flush belongs to another file or is unallocated by the
filesystem.

I agree that this should be non-destructive - if you think the additional
check and locking isn't worth the overhead, I'm happy to take it out.  I don't
have a strong opinion either way.

Thanks for the review!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 17:12       ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 17:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Dan Williams,
	Matthew Wilcox, Dave Hansen

On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ...
> > +static int dax_writeback_one(struct block_device *bdev,
> > +		struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	struct radix_tree_node *node;
> > +	struct blk_dax_ctl dax;
> > +	void **slot;
> > +	int ret = 0;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * Regular page slots are stabilized by the page lock even
> > +	 * without the tree itself locked.  These unlocked entries
> > +	 * need verification under the tree lock.
> > +	 */
> > +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > +		goto unlock;
> > +	if (*slot != entry)
> > +		goto unlock;
> > +
> > +	/* another fsync thread may have already written back this entry */
> > +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto unlock;
> > +
> > +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > +		ret = -EIO;
> > +		goto unlock;
> > +	}
> > +
> > +	dax.sector = RADIX_DAX_SECTOR(entry);
> > +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +
> > +	/*
> > +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> > +	 * eventually calls cond_resched().
> > +	 */
> > +	ret = dax_map_atomic(bdev, &dax);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	if (WARN_ON_ONCE(ret < dax.size)) {
> > +		ret = -EIO;
> > +		dax_unmap_atomic(bdev, &dax);
> > +		return ret;
> > +	}
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * We need to revalidate our radix entry while holding tree_lock
> > +	 * before we do the writeback.
> > +	 */
> 
> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
> & size is still part of the device. I guess you are concerned that due to
> truncate or similar operation those sectors needn't belong to the same file
> anymore but we don't really care about flushing sectors for someone else,
> do we?
> 
> Otherwise the patch looks good to me.

Yep, the concern is that we could have somehow raced against a truncate
operation while we weren't holding the tree_lock, and that now the address we
are about to flush belongs to another file or is unallocated by the
filesystem.

I agree that this should be non-destructive - if you think the additional
check and locking isn't worth the overhead, I'm happy to take it out.  I don't
have a strong opinion either way.

Thanks for the review!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 17:12       ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 17:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ...
> > +static int dax_writeback_one(struct block_device *bdev,
> > +		struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	struct radix_tree_node *node;
> > +	struct blk_dax_ctl dax;
> > +	void **slot;
> > +	int ret = 0;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * Regular page slots are stabilized by the page lock even
> > +	 * without the tree itself locked.  These unlocked entries
> > +	 * need verification under the tree lock.
> > +	 */
> > +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > +		goto unlock;
> > +	if (*slot != entry)
> > +		goto unlock;
> > +
> > +	/* another fsync thread may have already written back this entry */
> > +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto unlock;
> > +
> > +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > +		ret = -EIO;
> > +		goto unlock;
> > +	}
> > +
> > +	dax.sector = RADIX_DAX_SECTOR(entry);
> > +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +
> > +	/*
> > +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> > +	 * eventually calls cond_resched().
> > +	 */
> > +	ret = dax_map_atomic(bdev, &dax);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	if (WARN_ON_ONCE(ret < dax.size)) {
> > +		ret = -EIO;
> > +		dax_unmap_atomic(bdev, &dax);
> > +		return ret;
> > +	}
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * We need to revalidate our radix entry while holding tree_lock
> > +	 * before we do the writeback.
> > +	 */
> 
> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
> & size is still part of the device. I guess you are concerned that due to
> truncate or similar operation those sectors needn't belong to the same file
> anymore but we don't really care about flushing sectors for someone else,
> do we?
> 
> Otherwise the patch looks good to me.

Yep, the concern is that we could have somehow raced against a truncate
operation while we weren't holding the tree_lock, and that now the address we
are about to flush belongs to another file or is unallocated by the
filesystem.

I agree that this should be non-destructive - if you think the additional
check and locking isn't worth the overhead, I'm happy to take it out.  I don't
have a strong opinion either way.

Thanks for the review!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
  2016-01-05 17:12       ` Ross Zwisler
  (?)
  (?)
@ 2016-01-05 17:20         ` Dan Williams
  -1 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-05 17:20 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Dan Williams, Matthew Wilcox, Dave Hansen

On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
>> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
>> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
>> > pages so it is able to flush them durably to media on demand.
>> >
>> > The tracking of dirty pages is done via the radix tree in struct
>> > address_space.  This radix tree is already used by the page writeback
>> > infrastructure for tracking dirty pages associated with an open file, and
>> > it already has support for exceptional (non struct page*) entries.  We
>> > build upon these features to add exceptional entries to the radix tree for
>> > DAX dirty PMD or PTE pages at fault time.
>> >
>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> ...
>> > +static int dax_writeback_one(struct block_device *bdev,
>> > +           struct address_space *mapping, pgoff_t index, void *entry)
>> > +{
>> > +   struct radix_tree_root *page_tree = &mapping->page_tree;
>> > +   int type = RADIX_DAX_TYPE(entry);
>> > +   struct radix_tree_node *node;
>> > +   struct blk_dax_ctl dax;
>> > +   void **slot;
>> > +   int ret = 0;
>> > +
>> > +   spin_lock_irq(&mapping->tree_lock);
>> > +   /*
>> > +    * Regular page slots are stabilized by the page lock even
>> > +    * without the tree itself locked.  These unlocked entries
>> > +    * need verification under the tree lock.
>> > +    */
>> > +   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
>> > +           goto unlock;
>> > +   if (*slot != entry)
>> > +           goto unlock;
>> > +
>> > +   /* another fsync thread may have already written back this entry */
>> > +   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
>> > +           goto unlock;
>> > +
>> > +   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
>> > +
>> > +   if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
>> > +           ret = -EIO;
>> > +           goto unlock;
>> > +   }
>> > +
>> > +   dax.sector = RADIX_DAX_SECTOR(entry);
>> > +   dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
>> > +   spin_unlock_irq(&mapping->tree_lock);
>> > +
>> > +   /*
>> > +    * We cannot hold tree_lock while calling dax_map_atomic() because it
>> > +    * eventually calls cond_resched().
>> > +    */
>> > +   ret = dax_map_atomic(bdev, &dax);
>> > +   if (ret < 0)
>> > +           return ret;
>> > +
>> > +   if (WARN_ON_ONCE(ret < dax.size)) {
>> > +           ret = -EIO;
>> > +           dax_unmap_atomic(bdev, &dax);
>> > +           return ret;
>> > +   }
>> > +
>> > +   spin_lock_irq(&mapping->tree_lock);
>> > +   /*
>> > +    * We need to revalidate our radix entry while holding tree_lock
>> > +    * before we do the writeback.
>> > +    */
>>
>> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
>> & size is still part of the device. I guess you are concerned that due to
>> truncate or similar operation those sectors needn't belong to the same file
>> anymore but we don't really care about flushing sectors for someone else,
>> do we?
>>
>> Otherwise the patch looks good to me.
>
> Yep, the concern is that we could have somehow raced against a truncate
> operation while we weren't holding the tree_lock, and that now the address we
> are about to flush belongs to another file or is unallocated by the
> filesystem.
>
> I agree that this should be non-destructive - if you think the additional
> check and locking isn't worth the overhead, I'm happy to take it out.  I don't
> have a strong opinion either way.
>

My concern is whether flushing potentially invalid virtual addresses
is problematic on some architectures.  Maybe it's just FUD, but it's
less work in my opinion to just revalidate the address versus auditing
each arch for this concern.

At a minimum we can change the comment to not say "We need to" and
instead say "TODO: are all archs ok with flushing potentially invalid
addresses?"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 17:20         ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-05 17:20 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Dan Williams, Matthew Wilcox, Dave Hansen

On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
>> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
>> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
>> > pages so it is able to flush them durably to media on demand.
>> >
>> > The tracking of dirty pages is done via the radix tree in struct
>> > address_space.  This radix tree is already used by the page writeback
>> > infrastructure for tracking dirty pages associated with an open file, and
>> > it already has support for exceptional (non struct page*) entries.  We
>> > build upon these features to add exceptional entries to the radix tree for
>> > DAX dirty PMD or PTE pages at fault time.
>> >
>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> ...
>> > +static int dax_writeback_one(struct block_device *bdev,
>> > +           struct address_space *mapping, pgoff_t index, void *entry)
>> > +{
>> > +   struct radix_tree_root *page_tree = &mapping->page_tree;
>> > +   int type = RADIX_DAX_TYPE(entry);
>> > +   struct radix_tree_node *node;
>> > +   struct blk_dax_ctl dax;
>> > +   void **slot;
>> > +   int ret = 0;
>> > +
>> > +   spin_lock_irq(&mapping->tree_lock);
>> > +   /*
>> > +    * Regular page slots are stabilized by the page lock even
>> > +    * without the tree itself locked.  These unlocked entries
>> > +    * need verification under the tree lock.
>> > +    */
>> > +   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
>> > +           goto unlock;
>> > +   if (*slot != entry)
>> > +           goto unlock;
>> > +
>> > +   /* another fsync thread may have already written back this entry */
>> > +   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
>> > +           goto unlock;
>> > +
>> > +   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
>> > +
>> > +   if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
>> > +           ret = -EIO;
>> > +           goto unlock;
>> > +   }
>> > +
>> > +   dax.sector = RADIX_DAX_SECTOR(entry);
>> > +   dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
>> > +   spin_unlock_irq(&mapping->tree_lock);
>> > +
>> > +   /*
>> > +    * We cannot hold tree_lock while calling dax_map_atomic() because it
>> > +    * eventually calls cond_resched().
>> > +    */
>> > +   ret = dax_map_atomic(bdev, &dax);
>> > +   if (ret < 0)
>> > +           return ret;
>> > +
>> > +   if (WARN_ON_ONCE(ret < dax.size)) {
>> > +           ret = -EIO;
>> > +           dax_unmap_atomic(bdev, &dax);
>> > +           return ret;
>> > +   }
>> > +
>> > +   spin_lock_irq(&mapping->tree_lock);
>> > +   /*
>> > +    * We need to revalidate our radix entry while holding tree_lock
>> > +    * before we do the writeback.
>> > +    */
>>
>> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
>> & size is still part of the device. I guess you are concerned that due to
>> truncate or similar operation those sectors needn't belong to the same file
>> anymore but we don't really care about flushing sectors for someone else,
>> do we?
>>
>> Otherwise the patch looks good to me.
>
> Yep, the concern is that we could have somehow raced against a truncate
> operation while we weren't holding the tree_lock, and that now the address we
> are about to flush belongs to another file or is unallocated by the
> filesystem.
>
> I agree that this should be non-destructive - if you think the additional
> check and locking isn't worth the overhead, I'm happy to take it out.  I don't
> have a strong opinion either way.
>

My concern is whether flushing potentially invalid virtual addresses
is problematic on some architectures.  Maybe it's just FUD, but it's
less work in my opinion to just revalidate the address versus auditing
each arch for this concern.

At a minimum we can change the comment to not say "We need to" and
instead say "TODO: are all archs ok with flushing potentially invalid
addresses?"

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 17:20         ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-05 17:20 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Dan Williams, Matthew Wilcox

On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
>> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
>> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
>> > pages so it is able to flush them durably to media on demand.
>> >
>> > The tracking of dirty pages is done via the radix tree in struct
>> > address_space.  This radix tree is already used by the page writeback
>> > infrastructure for tracking dirty pages associated with an open file, and
>> > it already has support for exceptional (non struct page*) entries.  We
>> > build upon these features to add exceptional entries to the radix tree for
>> > DAX dirty PMD or PTE pages at fault time.
>> >
>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> ...
>> > +static int dax_writeback_one(struct block_device *bdev,
>> > +           struct address_space *mapping, pgoff_t index, void *entry)
>> > +{
>> > +   struct radix_tree_root *page_tree = &mapping->page_tree;
>> > +   int type = RADIX_DAX_TYPE(entry);
>> > +   struct radix_tree_node *node;
>> > +   struct blk_dax_ctl dax;
>> > +   void **slot;
>> > +   int ret = 0;
>> > +
>> > +   spin_lock_irq(&mapping->tree_lock);
>> > +   /*
>> > +    * Regular page slots are stabilized by the page lock even
>> > +    * without the tree itself locked.  These unlocked entries
>> > +    * need verification under the tree lock.
>> > +    */
>> > +   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
>> > +           goto unlock;
>> > +   if (*slot != entry)
>> > +           goto unlock;
>> > +
>> > +   /* another fsync thread may have already written back this entry */
>> > +   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
>> > +           goto unlock;
>> > +
>> > +   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
>> > +
>> > +   if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
>> > +           ret = -EIO;
>> > +           goto unlock;
>> > +   }
>> > +
>> > +   dax.sector = RADIX_DAX_SECTOR(entry);
>> > +   dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
>> > +   spin_unlock_irq(&mapping->tree_lock);
>> > +
>> > +   /*
>> > +    * We cannot hold tree_lock while calling dax_map_atomic() because it
>> > +    * eventually calls cond_resched().
>> > +    */
>> > +   ret = dax_map_atomic(bdev, &dax);
>> > +   if (ret < 0)
>> > +           return ret;
>> > +
>> > +   if (WARN_ON_ONCE(ret < dax.size)) {
>> > +           ret = -EIO;
>> > +           dax_unmap_atomic(bdev, &dax);
>> > +           return ret;
>> > +   }
>> > +
>> > +   spin_lock_irq(&mapping->tree_lock);
>> > +   /*
>> > +    * We need to revalidate our radix entry while holding tree_lock
>> > +    * before we do the writeback.
>> > +    */
>>
>> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
>> & size is still part of the device. I guess you are concerned that due to
>> truncate or similar operation those sectors needn't belong to the same file
>> anymore but we don't really care about flushing sectors for someone else,
>> do we?
>>
>> Otherwise the patch looks good to me.
>
> Yep, the concern is that we could have somehow raced against a truncate
> operation while we weren't holding the tree_lock, and that now the address we
> are about to flush belongs to another file or is unallocated by the
> filesystem.
>
> I agree that this should be non-destructive - if you think the additional
> check and locking isn't worth the overhead, I'm happy to take it out.  I don't
> have a strong opinion either way.
>

My concern is whether flushing potentially invalid virtual addresses
is problematic on some architectures.  Maybe it's just FUD, but it's
less work in my opinion to just revalidate the address versus auditing
each arch for this concern.

At a minimum we can change the comment to not say "We need to" and
instead say "TODO: are all archs ok with flushing potentially invalid
addresses?"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 17:20         ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-05 17:20 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Dan Williams, Matthew Wilcox, Dave Hansen

On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
>> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
>> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
>> > pages so it is able to flush them durably to media on demand.
>> >
>> > The tracking of dirty pages is done via the radix tree in struct
>> > address_space.  This radix tree is already used by the page writeback
>> > infrastructure for tracking dirty pages associated with an open file, and
>> > it already has support for exceptional (non struct page*) entries.  We
>> > build upon these features to add exceptional entries to the radix tree for
>> > DAX dirty PMD or PTE pages at fault time.
>> >
>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> ...
>> > +static int dax_writeback_one(struct block_device *bdev,
>> > +           struct address_space *mapping, pgoff_t index, void *entry)
>> > +{
>> > +   struct radix_tree_root *page_tree = &mapping->page_tree;
>> > +   int type = RADIX_DAX_TYPE(entry);
>> > +   struct radix_tree_node *node;
>> > +   struct blk_dax_ctl dax;
>> > +   void **slot;
>> > +   int ret = 0;
>> > +
>> > +   spin_lock_irq(&mapping->tree_lock);
>> > +   /*
>> > +    * Regular page slots are stabilized by the page lock even
>> > +    * without the tree itself locked.  These unlocked entries
>> > +    * need verification under the tree lock.
>> > +    */
>> > +   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
>> > +           goto unlock;
>> > +   if (*slot != entry)
>> > +           goto unlock;
>> > +
>> > +   /* another fsync thread may have already written back this entry */
>> > +   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
>> > +           goto unlock;
>> > +
>> > +   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
>> > +
>> > +   if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
>> > +           ret = -EIO;
>> > +           goto unlock;
>> > +   }
>> > +
>> > +   dax.sector = RADIX_DAX_SECTOR(entry);
>> > +   dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
>> > +   spin_unlock_irq(&mapping->tree_lock);
>> > +
>> > +   /*
>> > +    * We cannot hold tree_lock while calling dax_map_atomic() because it
>> > +    * eventually calls cond_resched().
>> > +    */
>> > +   ret = dax_map_atomic(bdev, &dax);
>> > +   if (ret < 0)
>> > +           return ret;
>> > +
>> > +   if (WARN_ON_ONCE(ret < dax.size)) {
>> > +           ret = -EIO;
>> > +           dax_unmap_atomic(bdev, &dax);
>> > +           return ret;
>> > +   }
>> > +
>> > +   spin_lock_irq(&mapping->tree_lock);
>> > +   /*
>> > +    * We need to revalidate our radix entry while holding tree_lock
>> > +    * before we do the writeback.
>> > +    */
>>
>> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
>> & size is still part of the device. I guess you are concerned that due to
>> truncate or similar operation those sectors needn't belong to the same file
>> anymore but we don't really care about flushing sectors for someone else,
>> do we?
>>
>> Otherwise the patch looks good to me.
>
> Yep, the concern is that we could have somehow raced against a truncate
> operation while we weren't holding the tree_lock, and that now the address we
> are about to flush belongs to another file or is unallocated by the
> filesystem.
>
> I agree that this should be non-destructive - if you think the additional
> check and locking isn't worth the overhead, I'm happy to take it out.  I don't
> have a strong opinion either way.
>

My concern is whether flushing potentially invalid virtual addresses
is problematic on some architectures.  Maybe it's just FUD, but it's
less work in my opinion to just revalidate the address versus auditing
each arch for this concern.

At a minimum we can change the comment to not say "We need to" and
instead say "TODO: are all archs ok with flushing potentially invalid
addresses?"

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
  2016-01-05 17:20         ` Dan Williams
  (?)
@ 2016-01-05 18:14           ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 18:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Jan Kara, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote:
> On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
> >> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> >> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> >> > pages so it is able to flush them durably to media on demand.
> >> >
> >> > The tracking of dirty pages is done via the radix tree in struct
> >> > address_space.  This radix tree is already used by the page writeback
> >> > infrastructure for tracking dirty pages associated with an open file, and
> >> > it already has support for exceptional (non struct page*) entries.  We
> >> > build upon these features to add exceptional entries to the radix tree for
> >> > DAX dirty PMD or PTE pages at fault time.
> >> >
> >> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> ...
> >> > +static int dax_writeback_one(struct block_device *bdev,
> >> > +           struct address_space *mapping, pgoff_t index, void *entry)
> >> > +{
> >> > +   struct radix_tree_root *page_tree = &mapping->page_tree;
> >> > +   int type = RADIX_DAX_TYPE(entry);
> >> > +   struct radix_tree_node *node;
> >> > +   struct blk_dax_ctl dax;
> >> > +   void **slot;
> >> > +   int ret = 0;
> >> > +
> >> > +   spin_lock_irq(&mapping->tree_lock);
> >> > +   /*
> >> > +    * Regular page slots are stabilized by the page lock even
> >> > +    * without the tree itself locked.  These unlocked entries
> >> > +    * need verification under the tree lock.
> >> > +    */
> >> > +   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> >> > +           goto unlock;
> >> > +   if (*slot != entry)
> >> > +           goto unlock;
> >> > +
> >> > +   /* another fsync thread may have already written back this entry */
> >> > +   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> >> > +           goto unlock;
> >> > +
> >> > +   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> >> > +
> >> > +   if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> >> > +           ret = -EIO;
> >> > +           goto unlock;
> >> > +   }
> >> > +
> >> > +   dax.sector = RADIX_DAX_SECTOR(entry);
> >> > +   dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> >> > +   spin_unlock_irq(&mapping->tree_lock);
> >> > +
> >> > +   /*
> >> > +    * We cannot hold tree_lock while calling dax_map_atomic() because it
> >> > +    * eventually calls cond_resched().
> >> > +    */
> >> > +   ret = dax_map_atomic(bdev, &dax);
> >> > +   if (ret < 0)
> >> > +           return ret;
> >> > +
> >> > +   if (WARN_ON_ONCE(ret < dax.size)) {
> >> > +           ret = -EIO;
> >> > +           dax_unmap_atomic(bdev, &dax);
> >> > +           return ret;
> >> > +   }
> >> > +
> >> > +   spin_lock_irq(&mapping->tree_lock);
> >> > +   /*
> >> > +    * We need to revalidate our radix entry while holding tree_lock
> >> > +    * before we do the writeback.
> >> > +    */
> >>
> >> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
> >> & size is still part of the device. I guess you are concerned that due to
> >> truncate or similar operation those sectors needn't belong to the same file
> >> anymore but we don't really care about flushing sectors for someone else,
> >> do we?
> >>
> >> Otherwise the patch looks good to me.
> >
> > Yep, the concern is that we could have somehow raced against a truncate
> > operation while we weren't holding the tree_lock, and that now the address we
> > are about to flush belongs to another file or is unallocated by the
> > filesystem.
> >
> > I agree that this should be non-destructive - if you think the additional
> > check and locking isn't worth the overhead, I'm happy to take it out.  I don't
> > have a strong opinion either way.
> >
> 
> My concern is whether flushing potentially invalid virtual addresses
> is problematic on some architectures.  Maybe it's just FUD, but it's
> less work in my opinion to just revalidate the address versus auditing
> each arch for this concern.

I don't think that the addresses have the potential of being invalid from the
driver's point of view - we are still holding a reference on the block queue
via dax_map_atomic(), so we should be protected against races vs block device
removal.  I think the only question is whether it is okay to flush an address
that we know to be valid from the block device's point of view, but which the
filesystem may have truncated from being allocated to our inode.

Does that all make sense?

> At a minimum we can change the comment to not say "We need to" and
> instead say "TODO: are all archs ok with flushing potentially invalid
> addresses?"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 18:14           ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 18:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Jan Kara, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote:
> On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
> >> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> >> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> >> > pages so it is able to flush them durably to media on demand.
> >> >
> >> > The tracking of dirty pages is done via the radix tree in struct
> >> > address_space.  This radix tree is already used by the page writeback
> >> > infrastructure for tracking dirty pages associated with an open file, and
> >> > it already has support for exceptional (non struct page*) entries.  We
> >> > build upon these features to add exceptional entries to the radix tree for
> >> > DAX dirty PMD or PTE pages at fault time.
> >> >
> >> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> ...
> >> > +static int dax_writeback_one(struct block_device *bdev,
> >> > +           struct address_space *mapping, pgoff_t index, void *entry)
> >> > +{
> >> > +   struct radix_tree_root *page_tree = &mapping->page_tree;
> >> > +   int type = RADIX_DAX_TYPE(entry);
> >> > +   struct radix_tree_node *node;
> >> > +   struct blk_dax_ctl dax;
> >> > +   void **slot;
> >> > +   int ret = 0;
> >> > +
> >> > +   spin_lock_irq(&mapping->tree_lock);
> >> > +   /*
> >> > +    * Regular page slots are stabilized by the page lock even
> >> > +    * without the tree itself locked.  These unlocked entries
> >> > +    * need verification under the tree lock.
> >> > +    */
> >> > +   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> >> > +           goto unlock;
> >> > +   if (*slot != entry)
> >> > +           goto unlock;
> >> > +
> >> > +   /* another fsync thread may have already written back this entry */
> >> > +   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> >> > +           goto unlock;
> >> > +
> >> > +   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> >> > +
> >> > +   if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> >> > +           ret = -EIO;
> >> > +           goto unlock;
> >> > +   }
> >> > +
> >> > +   dax.sector = RADIX_DAX_SECTOR(entry);
> >> > +   dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> >> > +   spin_unlock_irq(&mapping->tree_lock);
> >> > +
> >> > +   /*
> >> > +    * We cannot hold tree_lock while calling dax_map_atomic() because it
> >> > +    * eventually calls cond_resched().
> >> > +    */
> >> > +   ret = dax_map_atomic(bdev, &dax);
> >> > +   if (ret < 0)
> >> > +           return ret;
> >> > +
> >> > +   if (WARN_ON_ONCE(ret < dax.size)) {
> >> > +           ret = -EIO;
> >> > +           dax_unmap_atomic(bdev, &dax);
> >> > +           return ret;
> >> > +   }
> >> > +
> >> > +   spin_lock_irq(&mapping->tree_lock);
> >> > +   /*
> >> > +    * We need to revalidate our radix entry while holding tree_lock
> >> > +    * before we do the writeback.
> >> > +    */
> >>
> >> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
> >> & size is still part of the device. I guess you are concerned that due to
> >> truncate or similar operation those sectors needn't belong to the same file
> >> anymore but we don't really care about flushing sectors for someone else,
> >> do we?
> >>
> >> Otherwise the patch looks good to me.
> >
> > Yep, the concern is that we could have somehow raced against a truncate
> > operation while we weren't holding the tree_lock, and that now the address we
> > are about to flush belongs to another file or is unallocated by the
> > filesystem.
> >
> > I agree that this should be non-destructive - if you think the additional
> > check and locking isn't worth the overhead, I'm happy to take it out.  I don't
> > have a strong opinion either way.
> >
> 
> My concern is whether flushing potentially invalid virtual addresses
> is problematic on some architectures.  Maybe it's just FUD, but it's
> less work in my opinion to just revalidate the address versus auditing
> each arch for this concern.

I don't think that the addresses have the potential of being invalid from the
driver's point of view - we are still holding a reference on the block queue
via dax_map_atomic(), so we should be protected against races vs block device
removal.  I think the only question is whether it is okay to flush an address
that we know to be valid from the block device's point of view, but which the
filesystem may have truncated from being allocated to our inode.

Does that all make sense?

> At a minimum we can change the comment to not say "We need to" and
> instead say "TODO: are all archs ok with flushing potentially invalid
> addresses?"

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 18:14           ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-05 18:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Hansen, J. Bruce Fields, Linux MM, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote:
> On Tue, Jan 5, 2016 at 9:12 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Jan 05, 2016 at 12:13:58PM +0100, Jan Kara wrote:
> >> On Wed 23-12-15 12:39:17, Ross Zwisler wrote:
> >> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> >> > pages so it is able to flush them durably to media on demand.
> >> >
> >> > The tracking of dirty pages is done via the radix tree in struct
> >> > address_space.  This radix tree is already used by the page writeback
> >> > infrastructure for tracking dirty pages associated with an open file, and
> >> > it already has support for exceptional (non struct page*) entries.  We
> >> > build upon these features to add exceptional entries to the radix tree for
> >> > DAX dirty PMD or PTE pages at fault time.
> >> >
> >> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> ...
> >> > +static int dax_writeback_one(struct block_device *bdev,
> >> > +           struct address_space *mapping, pgoff_t index, void *entry)
> >> > +{
> >> > +   struct radix_tree_root *page_tree = &mapping->page_tree;
> >> > +   int type = RADIX_DAX_TYPE(entry);
> >> > +   struct radix_tree_node *node;
> >> > +   struct blk_dax_ctl dax;
> >> > +   void **slot;
> >> > +   int ret = 0;
> >> > +
> >> > +   spin_lock_irq(&mapping->tree_lock);
> >> > +   /*
> >> > +    * Regular page slots are stabilized by the page lock even
> >> > +    * without the tree itself locked.  These unlocked entries
> >> > +    * need verification under the tree lock.
> >> > +    */
> >> > +   if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> >> > +           goto unlock;
> >> > +   if (*slot != entry)
> >> > +           goto unlock;
> >> > +
> >> > +   /* another fsync thread may have already written back this entry */
> >> > +   if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> >> > +           goto unlock;
> >> > +
> >> > +   radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> >> > +
> >> > +   if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> >> > +           ret = -EIO;
> >> > +           goto unlock;
> >> > +   }
> >> > +
> >> > +   dax.sector = RADIX_DAX_SECTOR(entry);
> >> > +   dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> >> > +   spin_unlock_irq(&mapping->tree_lock);
> >> > +
> >> > +   /*
> >> > +    * We cannot hold tree_lock while calling dax_map_atomic() because it
> >> > +    * eventually calls cond_resched().
> >> > +    */
> >> > +   ret = dax_map_atomic(bdev, &dax);
> >> > +   if (ret < 0)
> >> > +           return ret;
> >> > +
> >> > +   if (WARN_ON_ONCE(ret < dax.size)) {
> >> > +           ret = -EIO;
> >> > +           dax_unmap_atomic(bdev, &dax);
> >> > +           return ret;
> >> > +   }
> >> > +
> >> > +   spin_lock_irq(&mapping->tree_lock);
> >> > +   /*
> >> > +    * We need to revalidate our radix entry while holding tree_lock
> >> > +    * before we do the writeback.
> >> > +    */
> >>
> >> Do we really need to revalidate here? dax_map_atomic() makes sure the addr
> >> & size is still part of the device. I guess you are concerned that due to
> >> truncate or similar operation those sectors needn't belong to the same file
> >> anymore but we don't really care about flushing sectors for someone else,
> >> do we?
> >>
> >> Otherwise the patch looks good to me.
> >
> > Yep, the concern is that we could have somehow raced against a truncate
> > operation while we weren't holding the tree_lock, and that now the address we
> > are about to flush belongs to another file or is unallocated by the
> > filesystem.
> >
> > I agree that this should be non-destructive - if you think the additional
> > check and locking isn't worth the overhead, I'm happy to take it out.  I don't
> > have a strong opinion either way.
> >
> 
> My concern is whether flushing potentially invalid virtual addresses
> is problematic on some architectures.  Maybe it's just FUD, but it's
> less work in my opinion to just revalidate the address versus auditing
> each arch for this concern.

I don't think that the addresses have the potential of being invalid from the
driver's point of view - we are still holding a reference on the block queue
via dax_map_atomic(), so we should be protected against races vs block device
removal.  I think the only question is whether it is okay to flush an address
that we know to be valid from the block device's point of view, but which the
filesystem may have truncated from being allocated to our inode.

Does that all make sense?

> At a minimum we can change the comment to not say "We need to" and
> instead say "TODO: are all archs ok with flushing potentially invalid
> addresses?"

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
  2016-01-05 18:14           ` Ross Zwisler
  (?)
  (?)
@ 2016-01-05 18:22             ` Dan Williams
  -1 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-05 18:22 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Jan Kara, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm, X86 ML,
	XFS Developers, Andrew Morton, Matthew Wilcox, Dave Hansen

On Tue, Jan 5, 2016 at 10:14 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote:
[..]
>> My concern is whether flushing potentially invalid virtual addresses
>> is problematic on some architectures.  Maybe it's just FUD, but it's
>> less work in my opinion to just revalidate the address versus auditing
>> each arch for this concern.
>
> I don't think that the addresses have the potential of being invalid from the
> driver's point of view - we are still holding a reference on the block queue
> via dax_map_atomic(), so we should be protected against races vs block device
> removal.  I think the only question is whether it is okay to flush an address
> that we know to be valid from the block device's point of view, but which the
> filesystem may have truncated from being allocated to our inode.
>
> Does that all make sense?

Yes, I was confusing which revalidation we were talking about.  As
long as the dax_map_atomic() is there I don't think we need any
further revalidation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 18:22             ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-05 18:22 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Jan Kara, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org,
	X86 ML, XFS Developers, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Jan 5, 2016 at 10:14 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote:
[..]
>> My concern is whether flushing potentially invalid virtual addresses
>> is problematic on some architectures.  Maybe it's just FUD, but it's
>> less work in my opinion to just revalidate the address versus auditing
>> each arch for this concern.
>
> I don't think that the addresses have the potential of being invalid from the
> driver's point of view - we are still holding a reference on the block queue
> via dax_map_atomic(), so we should be protected against races vs block device
> removal.  I think the only question is whether it is okay to flush an address
> that we know to be valid from the block device's point of view, but which the
> filesystem may have truncated from being allocated to our inode.
>
> Does that all make sense?

Yes, I was confusing which revalidation we were talking about.  As
long as the dax_map_atomic() is there I don't think we need any
further revalidation.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 18:22             ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-05 18:22 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Jan Kara, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm, X86 ML,
	XFS Developers, Andrew Morton, Matthew Wilcox

On Tue, Jan 5, 2016 at 10:14 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote:
[..]
>> My concern is whether flushing potentially invalid virtual addresses
>> is problematic on some architectures.  Maybe it's just FUD, but it's
>> less work in my opinion to just revalidate the address versus auditing
>> each arch for this concern.
>
> I don't think that the addresses have the potential of being invalid from the
> driver's point of view - we are still holding a reference on the block queue
> via dax_map_atomic(), so we should be protected against races vs block device
> removal.  I think the only question is whether it is okay to flush an address
> that we know to be valid from the block device's point of view, but which the
> filesystem may have truncated from being allocated to our inode.
>
> Does that all make sense?

Yes, I was confusing which revalidation we were talking about.  As
long as the dax_map_atomic() is there I don't think we need any
further revalidation.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-05 18:22             ` Dan Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Dan Williams @ 2016-01-05 18:22 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Jan Kara, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm, X86 ML,
	XFS Developers, Andrew Morton, Matthew Wilcox, Dave Hansen

On Tue, Jan 5, 2016 at 10:14 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Jan 05, 2016 at 09:20:47AM -0800, Dan Williams wrote:
[..]
>> My concern is whether flushing potentially invalid virtual addresses
>> is problematic on some architectures.  Maybe it's just FUD, but it's
>> less work in my opinion to just revalidate the address versus auditing
>> each arch for this concern.
>
> I don't think that the addresses have the potential of being invalid from the
> driver's point of view - we are still holding a reference on the block queue
> via dax_map_atomic(), so we should be protected against races vs block device
> removal.  I think the only question is whether it is okay to flush an address
> that we know to be valid from the block device's point of view, but which the
> filesystem may have truncated from being allocated to our inode.
>
> Does that all make sense?

Yes, I was confusing which revalidation we were talking about.  As
long as the dax_map_atomic() is there I don't think we need any
further revalidation.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
  2016-01-03 18:13     ` Dan Williams
  (?)
  (?)
@ 2016-01-06 18:10       ` Ross Zwisler
  -1 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Sun, Jan 03, 2016 at 10:13:06AM -0800, Dan Williams wrote:
> On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I'm hitting the following report with the ndctl dax test [1] on
> next-20151231.  I bisected it to
>  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> closer look tomorrow, but in case someone can beat me to it, here's
> the back-trace:
> 
> ------------[ cut here ]------------
> kernel BUG at fs/inode.c:497!
> [..]
> CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
> RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP: 0018:ffff880307797d50  EFLAGS: 00010002
> RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
> RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
> RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
> R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
> R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
> FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
> Stack:
> ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
> ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
> ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
> Call Trace:
> [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
> [<ffffffff81281038>] evict+0xb8/0x180
> [<ffffffff8128113b>] dispose_list+0x3b/0x50
> [<ffffffff81282014>] evict_inodes+0x144/0x170
> [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
> [<ffffffff81264837>] kill_block_super+0x27/0x70
> [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
> [<ffffffff81264e9c>] deactivate_super+0x5c/0x60
> [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
> [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
> [<ffffffff810c4f26>] task_work_run+0x76/0x90
> [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
> [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
> Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
> 74 1a a8 40 75 18 48 c7 8
> 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
> 0b 0f 1f 44 00 00 0f 1f
> 44 00 00 55
> RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP <ffff880307797d50>
> ---[ end trace 3b1d8898a94a4fc1 ]---
> 
> [1]: git://git@github.com:pmem/ndctl.git pending
> make TESTS="test/dax.sh" check

This issue is fixed with patch 2 of v7:

https://lists.01.org/pipermail/linux-nvdimm/2016-January/003888.html

Thanks for the report!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-06 18:10       ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Sun, Jan 03, 2016 at 10:13:06AM -0800, Dan Williams wrote:
> On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I'm hitting the following report with the ndctl dax test [1] on
> next-20151231.  I bisected it to
>  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> closer look tomorrow, but in case someone can beat me to it, here's
> the back-trace:
> 
> ------------[ cut here ]------------
> kernel BUG at fs/inode.c:497!
> [..]
> CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
> RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP: 0018:ffff880307797d50  EFLAGS: 00010002
> RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
> RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
> RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
> R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
> R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
> FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
> Stack:
> ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
> ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
> ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
> Call Trace:
> [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
> [<ffffffff81281038>] evict+0xb8/0x180
> [<ffffffff8128113b>] dispose_list+0x3b/0x50
> [<ffffffff81282014>] evict_inodes+0x144/0x170
> [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
> [<ffffffff81264837>] kill_block_super+0x27/0x70
> [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
> [<ffffffff81264e9c>] deactivate_super+0x5c/0x60
> [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
> [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
> [<ffffffff810c4f26>] task_work_run+0x76/0x90
> [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
> [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
> Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
> 74 1a a8 40 75 18 48 c7 8
> 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
> 0b 0f 1f 44 00 00 0f 1f
> 44 00 00 55
> RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP <ffff880307797d50>
> ---[ end trace 3b1d8898a94a4fc1 ]---
> 
> [1]: git://git@github.com:pmem/ndctl.git pending
> make TESTS="test/dax.sh" check

This issue is fixed with patch 2 of v7:

https://lists.01.org/pipermail/linux-nvdimm/2016-January/003888.html

Thanks for the report!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-06 18:10       ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Sun, Jan 03, 2016 at 10:13:06AM -0800, Dan Williams wrote:
> On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I'm hitting the following report with the ndctl dax test [1] on
> next-20151231.  I bisected it to
>  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> closer look tomorrow, but in case someone can beat me to it, here's
> the back-trace:
> 
> ------------[ cut here ]------------
> kernel BUG at fs/inode.c:497!
> [..]
> CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
> RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP: 0018:ffff880307797d50  EFLAGS: 00010002
> RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
> RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
> RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
> R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
> R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
> FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
> Stack:
> ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
> ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
> ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
> Call Trace:
> [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
> [<ffffffff81281038>] evict+0xb8/0x180
> [<ffffffff8128113b>] dispose_list+0x3b/0x50
> [<ffffffff81282014>] evict_inodes+0x144/0x170
> [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
> [<ffffffff81264837>] kill_block_super+0x27/0x70
> [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
> [<ffffffff81264e9c>] deactivate_super+0x5c/0x60
> [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
> [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
> [<ffffffff810c4f26>] task_work_run+0x76/0x90
> [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
> [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
> Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
> 74 1a a8 40 75 18 48 c7 8
> 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
> 0b 0f 1f 44 00 00 0f 1f
> 44 00 00 55
> RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP <ffff880307797d50>
> ---[ end trace 3b1d8898a94a4fc1 ]---
> 
> [1]: git://git@github.com:pmem/ndctl.git pending
> make TESTS="test/dax.sh" check

This issue is fixed with patch 2 of v7:

https://lists.01.org/pipermail/linux-nvdimm/2016-January/003888.html

Thanks for the report!

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v6 4/7] dax: add support for fsync/msync
@ 2016-01-06 18:10       ` Ross Zwisler
  0 siblings, 0 replies; 75+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Sun, Jan 03, 2016 at 10:13:06AM -0800, Dan Williams wrote:
> On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> >
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I'm hitting the following report with the ndctl dax test [1] on
> next-20151231.  I bisected it to
>  commit 3cb108f941de "dax-add-support-for-fsync-sync-v6".  I'll take a
> closer look tomorrow, but in case someone can beat me to it, here's
> the back-trace:
> 
> ------------[ cut here ]------------
> kernel BUG at fs/inode.c:497!
> [..]
> CPU: 1 PID: 3001 Comm: umount Tainted: G           O    4.4.0-rc7+ #2412
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> task: ffff8800da2a5a00 ti: ffff880307794000 task.ti: ffff880307794000
> RIP: 0010:[<ffffffff81280171>]  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP: 0018:ffff880307797d50  EFLAGS: 00010002
> RAX: ffff8800da2a5a00 RBX: ffff8800ca2e7328 RCX: ffff8800da2a5a28
> RDX: 0000000000000001 RSI: 0000000000000005 RDI: ffff8800ca2e7530
> RBP: ffff880307797d60 R08: ffffffff82900ae0 R09: 0000000000000000
> R10: ffff8800ca2e7548 R11: 0000000000000000 R12: ffff8800ca2e7530
> R13: ffff8800ca2e7328 R14: ffff8800da2e88d0 R15: ffff8800da2e88d0
> FS:  00007f2b22f4a880(0000) GS:ffff88031fc40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00005648abd933e8 CR3: 000000007f3fc000 CR4: 00000000000006e0
> Stack:
> ffff8800ca2e7328 ffff8800ca2e7000 ffff880307797d88 ffffffffa01c18af
> ffff8800ca2e7328 ffff8800ca2e74d0 ffffffffa01ec740 ffff880307797db0
> ffffffff81281038 ffff8800ca2e74c0 ffff880307797e00 ffff8800ca2e7328
> Call Trace:
> [<ffffffffa01c18af>] xfs_fs_evict_inode+0x5f/0x110 [xfs]
> [<ffffffff81281038>] evict+0xb8/0x180
> [<ffffffff8128113b>] dispose_list+0x3b/0x50
> [<ffffffff81282014>] evict_inodes+0x144/0x170
> [<ffffffff8126447f>] generic_shutdown_super+0x3f/0xf0
> [<ffffffff81264837>] kill_block_super+0x27/0x70
> [<ffffffff81264a53>] deactivate_locked_super+0x43/0x70
> [<ffffffff81264e9c>] deactivate_super+0x5c/0x60
> [<ffffffff81285aff>] cleanup_mnt+0x3f/0x90
> [<ffffffff81285b92>] __cleanup_mnt+0x12/0x20
> [<ffffffff810c4f26>] task_work_run+0x76/0x90
> [<ffffffff81003e3a>] syscall_return_slowpath+0x20a/0x280
> [<ffffffff8192671a>] int_ret_from_sys_call+0x25/0x9f
> Code: 48 8d 93 30 03 00 00 48 39 c2 75 23 48 8b 83 d0 00 00 00 a8 20
> 74 1a a8 40 75 18 48 c7 8
> 3 d0 00 00 00 60 00 00 00 5b 41 5c 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f
> 0b 0f 1f 44 00 00 0f 1f
> 44 00 00 55
> RIP  [<ffffffff81280171>] clear_inode+0x71/0x80
> RSP <ffff880307797d50>
> ---[ end trace 3b1d8898a94a4fc1 ]---
> 
> [1]: git://git@github.com:pmem/ndctl.git pending
> make TESTS="test/dax.sh" check

This issue is fixed with patch 2 of v7:

https://lists.01.org/pipermail/linux-nvdimm/2016-January/003888.html

Thanks for the report!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2016-01-06 18:10 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-23 19:39 [PATCH v6 0/7] DAX fsync/msync support Ross Zwisler
2015-12-23 19:39 ` Ross Zwisler
2015-12-23 19:39 ` Ross Zwisler
2015-12-23 19:39 ` [PATCH v6 1/7] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-23 19:39 ` [PATCH v6 2/7] dax: support dirty DAX entries in radix tree Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-30  8:02   ` Bob Liu
2015-12-30  8:02     ` Bob Liu
2015-12-30  8:02     ` Bob Liu
2015-12-30 20:39     ` Dan Williams
2015-12-30 20:39       ` Dan Williams
2015-12-30 20:39       ` Dan Williams
2015-12-31  3:28       ` Bob Liu
2015-12-31  3:28         ` Bob Liu
2015-12-31  3:28         ` Bob Liu
2015-12-31  3:28         ` Bob Liu
2015-12-31 22:08         ` Dan Williams
2015-12-31 22:08           ` Dan Williams
2015-12-31 22:08           ` Dan Williams
2016-01-05  9:41   ` Jan Kara
2016-01-05  9:41     ` Jan Kara
2016-01-05  9:41     ` Jan Kara
2015-12-23 19:39 ` [PATCH v6 3/7] mm: add find_get_entries_tag() Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-24  0:28   ` Elliott, Robert (Persistent Memory)
2015-12-24  0:28     ` Elliott, Robert (Persistent Memory)
2015-12-24  0:28     ` Elliott, Robert (Persistent Memory)
2015-12-23 19:39 ` [PATCH v6 4/7] dax: add support for fsync/msync Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2016-01-03 18:13   ` Dan Williams
2016-01-03 18:13     ` Dan Williams
2016-01-03 18:13     ` Dan Williams
2016-01-05 11:13     ` Jan Kara
2016-01-05 11:13       ` Jan Kara
2016-01-05 11:13       ` Jan Kara
2016-01-05 15:50       ` Ross Zwisler
2016-01-05 15:50         ` Ross Zwisler
2016-01-05 15:50         ` Ross Zwisler
2016-01-05 15:50         ` Ross Zwisler
2016-01-06 18:10     ` Ross Zwisler
2016-01-06 18:10       ` Ross Zwisler
2016-01-06 18:10       ` Ross Zwisler
2016-01-06 18:10       ` Ross Zwisler
2016-01-05 11:13   ` Jan Kara
2016-01-05 11:13     ` Jan Kara
2016-01-05 11:13     ` Jan Kara
2016-01-05 17:12     ` Ross Zwisler
2016-01-05 17:12       ` Ross Zwisler
2016-01-05 17:12       ` Ross Zwisler
2016-01-05 17:12       ` Ross Zwisler
2016-01-05 17:20       ` Dan Williams
2016-01-05 17:20         ` Dan Williams
2016-01-05 17:20         ` Dan Williams
2016-01-05 17:20         ` Dan Williams
2016-01-05 18:14         ` Ross Zwisler
2016-01-05 18:14           ` Ross Zwisler
2016-01-05 18:14           ` Ross Zwisler
2016-01-05 18:22           ` Dan Williams
2016-01-05 18:22             ` Dan Williams
2016-01-05 18:22             ` Dan Williams
2016-01-05 18:22             ` Dan Williams
2015-12-23 19:39 ` [PATCH v6 5/7] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-23 19:39 ` [PATCH v6 6/7] ext4: " Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-23 19:39 ` [PATCH v6 7/7] xfs: " Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler
2015-12-23 19:39   ` Ross Zwisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.