All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3 v1] dax: Clear dirty bits after flushing caches
@ 2016-06-21 15:45 ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-21 15:45 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-fsdevel, linux-mm, Jan Kara

Hello,

currently we never clear dirty bits in the radix tree of a DAX inode. Thus
fsync(2) or even periodical writeback flush all the dirty pfns again and
again. This patches implement clearing of the dirty tag in the radix tree
so that we issue flush only when needed.

The difficulty with clearing the dirty tag is that we have to protect against
a concurrent page fault setting the dirty tag and writing new data into the
page. So we need a lock serializing page fault and clearing of the dirty tag
and write-protecting PTEs (so that we get another pagefault when pfn is written
to again and we have to set the dirty tag again).

The effect of the patch set is easily visible:

Writing 1 GB of data via mmap, then fsync twice.

Before this patch set both fsyncs take ~205 ms on my test machine, after the
patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
clearing dirty bits etc. is very noticeable), the second fsync takes below
1 us.

As a bonus, these patches make filesystem freezing for DAX filesystems
reliable because mappings are now properly writeprotected while freezing the
fs.

Patches have passed xfstests for both xfs and ext4.

So far the patches don't work with PMD pages - that's next on my todo list.

								Honza
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 0/3 v1] dax: Clear dirty bits after flushing caches
@ 2016-06-21 15:45 ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-21 15:45 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-mm, linux-fsdevel, Dan Williams, Ross Zwisler, Jan Kara

Hello,

currently we never clear dirty bits in the radix tree of a DAX inode. Thus
fsync(2) or even periodical writeback flush all the dirty pfns again and
again. This patches implement clearing of the dirty tag in the radix tree
so that we issue flush only when needed.

The difficulty with clearing the dirty tag is that we have to protect against
a concurrent page fault setting the dirty tag and writing new data into the
page. So we need a lock serializing page fault and clearing of the dirty tag
and write-protecting PTEs (so that we get another pagefault when pfn is written
to again and we have to set the dirty tag again).

The effect of the patch set is easily visible:

Writing 1 GB of data via mmap, then fsync twice.

Before this patch set both fsyncs take ~205 ms on my test machine, after the
patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
clearing dirty bits etc. is very noticeable), the second fsync takes below
1 us.

As a bonus, these patches make filesystem freezing for DAX filesystems
reliable because mappings are now properly writeprotected while freezing the
fs.

Patches have passed xfstests for both xfs and ext4.

So far the patches don't work with PMD pages - that's next on my todo list.

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/3] dax: Make cache flushing protected by entry lock
  2016-06-21 15:45 ` Jan Kara
@ 2016-06-21 15:45   ` Jan Kara
  -1 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-21 15:45 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-fsdevel, linux-mm, Jan Kara

Currently, flushing of caches for DAX mappings was ignoring entry lock.
So far this was ok (modulo a bug that a difference in entry lock could
cause cache flushing to be mistakenly skipped) but in the following
patches we will write-protect PTEs on cache flushing and clear dirty
tags. For that we will need more exclusion. So do cache flushing under
an entry lock. This allows us to remove one lock-unlock pair of
mapping->tree_lock as a bonus.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 62 +++++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 23 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 761495bf5eb9..5209f8cd0bee 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -669,35 +669,54 @@ static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	int type = RADIX_DAX_TYPE(entry);
-	struct radix_tree_node *node;
+	int type;
 	struct blk_dax_ctl dax;
-	void **slot;
 	int ret = 0;
+	void *entry2, **slot;
 
-	spin_lock_irq(&mapping->tree_lock);
 	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
+	 * A page got tagged dirty in DAX mapping? Something is seriously
+	 * wrong.
 	 */
-	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-
-	/* another fsync thread may have already written back this entry */
-	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-		goto unlock;
+	if (WARN_ON(!radix_tree_exceptional_entry(entry)))
+		return -EIO;
 
+	spin_lock_irq(&mapping->tree_lock);
+	entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
+	/* Entry got punched out / reallocated? */
+	if (!entry2 || !radix_tree_exceptional_entry(entry2))
+		goto put_unlock;
+	/*
+	 * Entry got reallocated elsewhere? No need to writeback. We have to
+	 * compare sectors as we must not bail out due to difference in lockbit
+	 * or entry type.
+	 */
+	if (RADIX_DAX_SECTOR(entry2) != RADIX_DAX_SECTOR(entry))
+		goto put_unlock;
+	type = RADIX_DAX_TYPE(entry2);
 	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
 		ret = -EIO;
-		goto unlock;
+		goto put_unlock;
 	}
+	entry = entry2;
+
+	/* Another fsync thread may have already written back this entry */
+	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto put_unlock;
+	/* Lock the entry to serialize with page faults */
+	entry = lock_slot(mapping, slot);
+	/*
+	 * We can clear the tag now but we have to be careful so that concurrent
+	 * dax_writeback_one() calls for the same index cannot finish before we
+	 * actually flush the caches. This is achieved as the calls will look
+	 * at the entry only under tree_lock and once they do that they will
+	 * see the entry locked and wait for it to unlock.
+	 */
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+	spin_unlock_irq(&mapping->tree_lock);
 
 	dax.sector = RADIX_DAX_SECTOR(entry);
 	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
-	spin_unlock_irq(&mapping->tree_lock);
 
 	/*
 	 * We cannot hold tree_lock while calling dax_map_atomic() because it
@@ -713,15 +732,12 @@ static int dax_writeback_one(struct block_device *bdev,
 	}
 
 	wb_cache_pmem(dax.addr, dax.size);
-
-	spin_lock_irq(&mapping->tree_lock);
-	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
-	spin_unlock_irq(&mapping->tree_lock);
- unmap:
+unmap:
 	dax_unmap_atomic(bdev, &dax);
 	return ret;
 
- unlock:
+put_unlock:
+	put_unlocked_mapping_entry(mapping, index, entry2);
 	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 1/3] dax: Make cache flushing protected by entry lock
@ 2016-06-21 15:45   ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-21 15:45 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-mm, linux-fsdevel, Dan Williams, Ross Zwisler, Jan Kara

Currently, flushing of caches for DAX mappings was ignoring entry lock.
So far this was ok (modulo a bug that a difference in entry lock could
cause cache flushing to be mistakenly skipped) but in the following
patches we will write-protect PTEs on cache flushing and clear dirty
tags. For that we will need more exclusion. So do cache flushing under
an entry lock. This allows us to remove one lock-unlock pair of
mapping->tree_lock as a bonus.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 62 +++++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 23 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 761495bf5eb9..5209f8cd0bee 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -669,35 +669,54 @@ static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
 	struct radix_tree_root *page_tree = &mapping->page_tree;
-	int type = RADIX_DAX_TYPE(entry);
-	struct radix_tree_node *node;
+	int type;
 	struct blk_dax_ctl dax;
-	void **slot;
 	int ret = 0;
+	void *entry2, **slot;
 
-	spin_lock_irq(&mapping->tree_lock);
 	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
+	 * A page got tagged dirty in DAX mapping? Something is seriously
+	 * wrong.
 	 */
-	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-
-	/* another fsync thread may have already written back this entry */
-	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-		goto unlock;
+	if (WARN_ON(!radix_tree_exceptional_entry(entry)))
+		return -EIO;
 
+	spin_lock_irq(&mapping->tree_lock);
+	entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
+	/* Entry got punched out / reallocated? */
+	if (!entry2 || !radix_tree_exceptional_entry(entry2))
+		goto put_unlock;
+	/*
+	 * Entry got reallocated elsewhere? No need to writeback. We have to
+	 * compare sectors as we must not bail out due to difference in lockbit
+	 * or entry type.
+	 */
+	if (RADIX_DAX_SECTOR(entry2) != RADIX_DAX_SECTOR(entry))
+		goto put_unlock;
+	type = RADIX_DAX_TYPE(entry2);
 	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
 		ret = -EIO;
-		goto unlock;
+		goto put_unlock;
 	}
+	entry = entry2;
+
+	/* Another fsync thread may have already written back this entry */
+	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto put_unlock;
+	/* Lock the entry to serialize with page faults */
+	entry = lock_slot(mapping, slot);
+	/*
+	 * We can clear the tag now but we have to be careful so that concurrent
+	 * dax_writeback_one() calls for the same index cannot finish before we
+	 * actually flush the caches. This is achieved as the calls will look
+	 * at the entry only under tree_lock and once they do that they will
+	 * see the entry locked and wait for it to unlock.
+	 */
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+	spin_unlock_irq(&mapping->tree_lock);
 
 	dax.sector = RADIX_DAX_SECTOR(entry);
 	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
-	spin_unlock_irq(&mapping->tree_lock);
 
 	/*
 	 * We cannot hold tree_lock while calling dax_map_atomic() because it
@@ -713,15 +732,12 @@ static int dax_writeback_one(struct block_device *bdev,
 	}
 
 	wb_cache_pmem(dax.addr, dax.size);
-
-	spin_lock_irq(&mapping->tree_lock);
-	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
-	spin_unlock_irq(&mapping->tree_lock);
- unmap:
+unmap:
 	dax_unmap_atomic(bdev, &dax);
 	return ret;
 
- unlock:
+put_unlock:
+	put_unlocked_mapping_entry(mapping, index, entry2);
 	spin_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/3] mm: Export follow_pte()
  2016-06-21 15:45 ` Jan Kara
@ 2016-06-21 15:45   ` Jan Kara
  -1 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-21 15:45 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-fsdevel, linux-mm, Jan Kara

DAX will need to implement its own version of check_page_address(). To
avoid duplicating page table walking code, export follow_pte() which
does what we need.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/mm.h | 2 ++
 mm/memory.c        | 5 +++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5df5feb49575..989f5d949db3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1193,6 +1193,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
+int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
+	       spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 15322b73636b..f6175d63c2e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3647,8 +3647,8 @@ out:
 	return -EINVAL;
 }
 
-static inline int follow_pte(struct mm_struct *mm, unsigned long address,
-			     pte_t **ptepp, spinlock_t **ptlp)
+int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
+	       spinlock_t **ptlp)
 {
 	int res;
 
@@ -3657,6 +3657,7 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 			   !(res = __follow_pte(mm, address, ptepp, ptlp)));
 	return res;
 }
+EXPORT_SYMBOL(follow_pte);
 
 /**
  * follow_pfn - look up PFN at a user virtual address
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/3] mm: Export follow_pte()
@ 2016-06-21 15:45   ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-21 15:45 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-mm, linux-fsdevel, Dan Williams, Ross Zwisler, Jan Kara

DAX will need to implement its own version of check_page_address(). To
avoid duplicating page table walking code, export follow_pte() which
does what we need.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/mm.h | 2 ++
 mm/memory.c        | 5 +++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5df5feb49575..989f5d949db3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1193,6 +1193,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
+int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
+	       spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 15322b73636b..f6175d63c2e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3647,8 +3647,8 @@ out:
 	return -EINVAL;
 }
 
-static inline int follow_pte(struct mm_struct *mm, unsigned long address,
-			     pte_t **ptepp, spinlock_t **ptlp)
+int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
+	       spinlock_t **ptlp)
 {
 	int res;
 
@@ -3657,6 +3657,7 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 			   !(res = __follow_pte(mm, address, ptepp, ptlp)));
 	return res;
 }
+EXPORT_SYMBOL(follow_pte);
 
 /**
  * follow_pfn - look up PFN at a user virtual address
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/3] dax: Clear dirty entry tags on cache flush
  2016-06-21 15:45 ` Jan Kara
@ 2016-06-21 15:45   ` Jan Kara
  -1 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-21 15:45 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-fsdevel, linux-mm, Jan Kara

Currently we never clear dirty tags in DAX mappings and thus address
ranges to flush accumulate. Now that we have locking of radix tree
entries, we have all the locking necessary to reliably clear the radix
tree dirty tag when flushing caches for corresponding address range.
Similarly to page_mkclean() we also have to write-protect pages to get a
page fault when the page is next written to so that we can mark the
entry dirty again.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 68 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5209f8cd0bee..c0c4eecb5f73 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,6 +31,7 @@
 #include <linux/vmstat.h>
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
+#include <linux/mmu_notifier.h>
 
 /*
  * We use lowest available bit in exceptional entry for locking, other two
@@ -665,6 +666,59 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	return new_entry;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+	unsigned long address;
+
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	return address;
+}
+
+/* Walk all mappings of a given index of a file and writeprotect them */
+static void dax_mapping_entry_mkclean(struct address_space *mapping,
+				      pgoff_t index, unsigned long pfn)
+{
+	struct vm_area_struct *vma;
+	pte_t *ptep;
+	pte_t pte;
+	spinlock_t *ptl;
+	bool changed;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
+		unsigned long address;
+
+		cond_resched();
+
+		if (!(vma->vm_flags & VM_SHARED))
+			continue;
+
+		address = pgoff_address(index, vma);
+		changed = false;
+		if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
+			continue;
+		if (pfn != pte_pfn(*ptep))
+			goto unlock;
+		if (!pte_dirty(*ptep) && !pte_write(*ptep))
+			goto unlock;
+
+		flush_cache_page(vma, address, pfn);
+		pte = ptep_clear_flush(vma, address, ptep);
+		pte = pte_wrprotect(pte);
+		pte = pte_mkclean(pte);
+		set_pte_at(vma->vm_mm, address, ptep, pte);
+		changed = true;
+unlock:
+		pte_unmap_unlock(pte, ptl);
+
+		if (changed)
+			mmu_notifier_invalidate_page(vma->vm_mm, address);
+	}
+	i_mmap_unlock_read(mapping);
+}
+
 static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
@@ -723,17 +777,30 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * eventually calls cond_resched().
 	 */
 	ret = dax_map_atomic(bdev, &dax);
-	if (ret < 0)
+	if (ret < 0) {
+		put_locked_mapping_entry(mapping, index, entry);
 		return ret;
+	}
 
 	if (WARN_ON_ONCE(ret < dax.size)) {
 		ret = -EIO;
 		goto unmap;
 	}
 
+	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
 	wb_cache_pmem(dax.addr, dax.size);
+	/*
+	 * After we have flushed the cache, we can clear the dirty tag. There
+	 * cannot be new dirty data in the pfn after the flush has completed as
+	 * the pfn mappings are writeprotected and fault waits for mapping
+	 * entry lock.
+	 */
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
+	spin_unlock_irq(&mapping->tree_lock);
 unmap:
 	dax_unmap_atomic(bdev, &dax);
+	put_locked_mapping_entry(mapping, index, entry);
 	return ret;
 
 put_unlock:
-- 
2.6.6

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/3] dax: Clear dirty entry tags on cache flush
@ 2016-06-21 15:45   ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-21 15:45 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-mm, linux-fsdevel, Dan Williams, Ross Zwisler, Jan Kara

Currently we never clear dirty tags in DAX mappings and thus address
ranges to flush accumulate. Now that we have locking of radix tree
entries, we have all the locking necessary to reliably clear the radix
tree dirty tag when flushing caches for corresponding address range.
Similarly to page_mkclean() we also have to write-protect pages to get a
page fault when the page is next written to so that we can mark the
entry dirty again.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 68 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5209f8cd0bee..c0c4eecb5f73 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,6 +31,7 @@
 #include <linux/vmstat.h>
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
+#include <linux/mmu_notifier.h>
 
 /*
  * We use lowest available bit in exceptional entry for locking, other two
@@ -665,6 +666,59 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	return new_entry;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+	unsigned long address;
+
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	return address;
+}
+
+/* Walk all mappings of a given index of a file and writeprotect them */
+static void dax_mapping_entry_mkclean(struct address_space *mapping,
+				      pgoff_t index, unsigned long pfn)
+{
+	struct vm_area_struct *vma;
+	pte_t *ptep;
+	pte_t pte;
+	spinlock_t *ptl;
+	bool changed;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
+		unsigned long address;
+
+		cond_resched();
+
+		if (!(vma->vm_flags & VM_SHARED))
+			continue;
+
+		address = pgoff_address(index, vma);
+		changed = false;
+		if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
+			continue;
+		if (pfn != pte_pfn(*ptep))
+			goto unlock;
+		if (!pte_dirty(*ptep) && !pte_write(*ptep))
+			goto unlock;
+
+		flush_cache_page(vma, address, pfn);
+		pte = ptep_clear_flush(vma, address, ptep);
+		pte = pte_wrprotect(pte);
+		pte = pte_mkclean(pte);
+		set_pte_at(vma->vm_mm, address, ptep, pte);
+		changed = true;
+unlock:
+		pte_unmap_unlock(pte, ptl);
+
+		if (changed)
+			mmu_notifier_invalidate_page(vma->vm_mm, address);
+	}
+	i_mmap_unlock_read(mapping);
+}
+
 static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
@@ -723,17 +777,30 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * eventually calls cond_resched().
 	 */
 	ret = dax_map_atomic(bdev, &dax);
-	if (ret < 0)
+	if (ret < 0) {
+		put_locked_mapping_entry(mapping, index, entry);
 		return ret;
+	}
 
 	if (WARN_ON_ONCE(ret < dax.size)) {
 		ret = -EIO;
 		goto unmap;
 	}
 
+	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
 	wb_cache_pmem(dax.addr, dax.size);
+	/*
+	 * After we have flushed the cache, we can clear the dirty tag. There
+	 * cannot be new dirty data in the pfn after the flush has completed as
+	 * the pfn mappings are writeprotected and fault waits for mapping
+	 * entry lock.
+	 */
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
+	spin_unlock_irq(&mapping->tree_lock);
 unmap:
 	dax_unmap_atomic(bdev, &dax);
+	put_locked_mapping_entry(mapping, index, entry);
 	return ret;
 
 put_unlock:
-- 
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
  2016-06-21 15:45   ` Jan Kara
  (?)
@ 2016-06-21 17:31     ` kbuild test robot
  -1 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2016-06-21 17:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, linux-mm, kbuild-all, linux-fsdevel

Hi,

[auto build test ERROR on v4.7-rc4]
[also build test ERROR on next-20160621]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/dax-Clear-dirty-bits-after-flushing-caches/20160621-234931
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/pagemap.h:10:0,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   fs/dax.c: In function 'dax_mapping_entry_mkclean':
>> arch/x86/include/asm/pgtable_32.h:52:38: error: incompatible type for argument 1 of '__kunmap_atomic'
    #define pte_unmap(pte) kunmap_atomic((pte))
                                         ^
   include/linux/highmem.h:127:18: note: in definition of macro 'kunmap_atomic'
     __kunmap_atomic(addr);                                  \
                     ^~~~
>> include/linux/mm.h:1681:2: note: in expansion of macro 'pte_unmap'
     pte_unmap(pte);     \
     ^~~~~~~~~
>> fs/dax.c:714:3: note: in expansion of macro 'pte_unmap_unlock'
      pte_unmap_unlock(pte, ptl);
      ^~~~~~~~~~~~~~~~
   In file included from include/linux/highmem.h:34:0,
                    from include/linux/pagemap.h:10,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   arch/x86/include/asm/highmem.h:68:6: note: expected 'void *' but argument is of type 'pte_t {aka union <anonymous>}'
    void __kunmap_atomic(void *kvaddr);
         ^~~~~~~~~~~~~~~

vim +/pte_unmap_unlock +714 fs/dax.c

   698			address = pgoff_address(index, vma);
   699			changed = false;
   700			if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
   701				continue;
   702			if (pfn != pte_pfn(*ptep))
   703				goto unlock;
   704			if (!pte_dirty(*ptep) && !pte_write(*ptep))
   705				goto unlock;
   706	
   707			flush_cache_page(vma, address, pfn);
   708			pte = ptep_clear_flush(vma, address, ptep);
   709			pte = pte_wrprotect(pte);
   710			pte = pte_mkclean(pte);
   711			set_pte_at(vma->vm_mm, address, ptep, pte);
   712			changed = true;
   713	unlock:
 > 714			pte_unmap_unlock(pte, ptl);
   715	
   716			if (changed)
   717				mmu_notifier_invalidate_page(vma->vm_mm, address);
   718		}
   719		i_mmap_unlock_read(mapping);
   720	}
   721	
   722	static int dax_writeback_one(struct block_device *bdev,

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
@ 2016-06-21 17:31     ` kbuild test robot
  0 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2016-06-21 17:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: kbuild-all, linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams,
	Ross Zwisler, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 2725 bytes --]

Hi,

[auto build test ERROR on v4.7-rc4]
[also build test ERROR on next-20160621]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/dax-Clear-dirty-bits-after-flushing-caches/20160621-234931
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/pagemap.h:10:0,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   fs/dax.c: In function 'dax_mapping_entry_mkclean':
>> arch/x86/include/asm/pgtable_32.h:52:38: error: incompatible type for argument 1 of '__kunmap_atomic'
    #define pte_unmap(pte) kunmap_atomic((pte))
                                         ^
   include/linux/highmem.h:127:18: note: in definition of macro 'kunmap_atomic'
     __kunmap_atomic(addr);                                  \
                     ^~~~
>> include/linux/mm.h:1681:2: note: in expansion of macro 'pte_unmap'
     pte_unmap(pte);     \
     ^~~~~~~~~
>> fs/dax.c:714:3: note: in expansion of macro 'pte_unmap_unlock'
      pte_unmap_unlock(pte, ptl);
      ^~~~~~~~~~~~~~~~
   In file included from include/linux/highmem.h:34:0,
                    from include/linux/pagemap.h:10,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   arch/x86/include/asm/highmem.h:68:6: note: expected 'void *' but argument is of type 'pte_t {aka union <anonymous>}'
    void __kunmap_atomic(void *kvaddr);
         ^~~~~~~~~~~~~~~

vim +/pte_unmap_unlock +714 fs/dax.c

   698			address = pgoff_address(index, vma);
   699			changed = false;
   700			if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
   701				continue;
   702			if (pfn != pte_pfn(*ptep))
   703				goto unlock;
   704			if (!pte_dirty(*ptep) && !pte_write(*ptep))
   705				goto unlock;
   706	
   707			flush_cache_page(vma, address, pfn);
   708			pte = ptep_clear_flush(vma, address, ptep);
   709			pte = pte_wrprotect(pte);
   710			pte = pte_mkclean(pte);
   711			set_pte_at(vma->vm_mm, address, ptep, pte);
   712			changed = true;
   713	unlock:
 > 714			pte_unmap_unlock(pte, ptl);
   715	
   716			if (changed)
   717				mmu_notifier_invalidate_page(vma->vm_mm, address);
   718		}
   719		i_mmap_unlock_read(mapping);
   720	}
   721	
   722	static int dax_writeback_one(struct block_device *bdev,

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 55046 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
@ 2016-06-21 17:31     ` kbuild test robot
  0 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2016-06-21 17:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: kbuild-all, linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams,
	Ross Zwisler

[-- Attachment #1: Type: text/plain, Size: 2725 bytes --]

Hi,

[auto build test ERROR on v4.7-rc4]
[also build test ERROR on next-20160621]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/dax-Clear-dirty-bits-after-flushing-caches/20160621-234931
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   In file included from include/linux/pagemap.h:10:0,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   fs/dax.c: In function 'dax_mapping_entry_mkclean':
>> arch/x86/include/asm/pgtable_32.h:52:38: error: incompatible type for argument 1 of '__kunmap_atomic'
    #define pte_unmap(pte) kunmap_atomic((pte))
                                         ^
   include/linux/highmem.h:127:18: note: in definition of macro 'kunmap_atomic'
     __kunmap_atomic(addr);                                  \
                     ^~~~
>> include/linux/mm.h:1681:2: note: in expansion of macro 'pte_unmap'
     pte_unmap(pte);     \
     ^~~~~~~~~
>> fs/dax.c:714:3: note: in expansion of macro 'pte_unmap_unlock'
      pte_unmap_unlock(pte, ptl);
      ^~~~~~~~~~~~~~~~
   In file included from include/linux/highmem.h:34:0,
                    from include/linux/pagemap.h:10,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   arch/x86/include/asm/highmem.h:68:6: note: expected 'void *' but argument is of type 'pte_t {aka union <anonymous>}'
    void __kunmap_atomic(void *kvaddr);
         ^~~~~~~~~~~~~~~

vim +/pte_unmap_unlock +714 fs/dax.c

   698			address = pgoff_address(index, vma);
   699			changed = false;
   700			if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
   701				continue;
   702			if (pfn != pte_pfn(*ptep))
   703				goto unlock;
   704			if (!pte_dirty(*ptep) && !pte_write(*ptep))
   705				goto unlock;
   706	
   707			flush_cache_page(vma, address, pfn);
   708			pte = ptep_clear_flush(vma, address, ptep);
   709			pte = pte_wrprotect(pte);
   710			pte = pte_mkclean(pte);
   711			set_pte_at(vma->vm_mm, address, ptep, pte);
   712			changed = true;
   713	unlock:
 > 714			pte_unmap_unlock(pte, ptl);
   715	
   716			if (changed)
   717				mmu_notifier_invalidate_page(vma->vm_mm, address);
   718		}
   719		i_mmap_unlock_read(mapping);
   720	}
   721	
   722	static int dax_writeback_one(struct block_device *bdev,

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 55046 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
  2016-06-21 15:45   ` Jan Kara
  (?)
@ 2016-06-21 20:59     ` kbuild test robot
  -1 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2016-06-21 20:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-nvdimm, linux-mm, kbuild-all, linux-fsdevel

Hi,

[auto build test ERROR on v4.7-rc4]
[also build test ERROR on next-20160621]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/dax-Clear-dirty-bits-after-flushing-caches/20160621-234931
config: i386-allyesconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/dax.c: In function 'dax_mapping_entry_mkclean':
>> fs/dax.c:714:343: error: incompatible type for argument 1 of '__kunmap_atomic'
      pte_unmap_unlock(pte, ptl);
                                                                                                                                                                                                                                                                                                                                                          ^
   In file included from include/linux/highmem.h:34:0,
                    from include/linux/pagemap.h:10,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   arch/x86/include/asm/highmem.h:68:6: note: expected 'void *' but argument is of type 'pte_t {aka union <anonymous>}'
    void __kunmap_atomic(void *kvaddr);
         ^~~~~~~~~~~~~~~

vim +/__kunmap_atomic +714 fs/dax.c

   708			pte = ptep_clear_flush(vma, address, ptep);
   709			pte = pte_wrprotect(pte);
   710			pte = pte_mkclean(pte);
   711			set_pte_at(vma->vm_mm, address, ptep, pte);
   712			changed = true;
   713	unlock:
 > 714			pte_unmap_unlock(pte, ptl);
   715	
   716			if (changed)
   717				mmu_notifier_invalidate_page(vma->vm_mm, address);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
@ 2016-06-21 20:59     ` kbuild test robot
  0 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2016-06-21 20:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: kbuild-all, linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams,
	Ross Zwisler, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]

Hi,

[auto build test ERROR on v4.7-rc4]
[also build test ERROR on next-20160621]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/dax-Clear-dirty-bits-after-flushing-caches/20160621-234931
config: i386-allyesconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/dax.c: In function 'dax_mapping_entry_mkclean':
>> fs/dax.c:714:343: error: incompatible type for argument 1 of '__kunmap_atomic'
      pte_unmap_unlock(pte, ptl);
                                                                                                                                                                                                                                                                                                                                                          ^
   In file included from include/linux/highmem.h:34:0,
                    from include/linux/pagemap.h:10,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   arch/x86/include/asm/highmem.h:68:6: note: expected 'void *' but argument is of type 'pte_t {aka union <anonymous>}'
    void __kunmap_atomic(void *kvaddr);
         ^~~~~~~~~~~~~~~

vim +/__kunmap_atomic +714 fs/dax.c

   708			pte = ptep_clear_flush(vma, address, ptep);
   709			pte = pte_wrprotect(pte);
   710			pte = pte_mkclean(pte);
   711			set_pte_at(vma->vm_mm, address, ptep, pte);
   712			changed = true;
   713	unlock:
 > 714			pte_unmap_unlock(pte, ptl);
   715	
   716			if (changed)
   717				mmu_notifier_invalidate_page(vma->vm_mm, address);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 54399 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
@ 2016-06-21 20:59     ` kbuild test robot
  0 siblings, 0 replies; 31+ messages in thread
From: kbuild test robot @ 2016-06-21 20:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: kbuild-all, linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams,
	Ross Zwisler

[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]

Hi,

[auto build test ERROR on v4.7-rc4]
[also build test ERROR on next-20160621]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/dax-Clear-dirty-bits-after-flushing-caches/20160621-234931
config: i386-allyesconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/dax.c: In function 'dax_mapping_entry_mkclean':
>> fs/dax.c:714:343: error: incompatible type for argument 1 of '__kunmap_atomic'
      pte_unmap_unlock(pte, ptl);
                                                                                                                                                                                                                                                                                                                                                          ^
   In file included from include/linux/highmem.h:34:0,
                    from include/linux/pagemap.h:10,
                    from include/linux/blkdev.h:14,
                    from fs/dax.c:18:
   arch/x86/include/asm/highmem.h:68:6: note: expected 'void *' but argument is of type 'pte_t {aka union <anonymous>}'
    void __kunmap_atomic(void *kvaddr);
         ^~~~~~~~~~~~~~~

vim +/__kunmap_atomic +714 fs/dax.c

   708			pte = ptep_clear_flush(vma, address, ptep);
   709			pte = pte_wrprotect(pte);
   710			pte = pte_mkclean(pte);
   711			set_pte_at(vma->vm_mm, address, ptep, pte);
   712			changed = true;
   713	unlock:
 > 714			pte_unmap_unlock(pte, ptl);
   715	
   716			if (changed)
   717				mmu_notifier_invalidate_page(vma->vm_mm, address);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 54399 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
  2016-06-21 15:45   ` Jan Kara
  (?)
@ 2016-06-23 10:47     ` Jan Kara
  -1 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-23 10:47 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: linux-fsdevel, linux-mm, Jan Kara

Hi,

the previous version had a bug which manifested itself on i586. Attached is
a new version for the patch if someone is interested.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
@ 2016-06-23 10:47     ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-23 10:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-mm, linux-fsdevel, Dan Williams, Ross Zwisler, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 193 bytes --]

Hi,

the previous version had a bug which manifested itself on i586. Attached is
a new version for the patch if someone is interested.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0003-dax-Clear-dirty-entry-tags-on-cache-flush.patch --]
[-- Type: text/x-patch, Size: 3694 bytes --]

>From a55cfe2dc362826f5e81beb6c7607f0daa5d317d Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Fri, 17 Jun 2016 17:14:48 +0200
Subject: [PATCH 3/3] dax: Clear dirty entry tags on cache flush

Currently we never clear dirty tags in DAX mappings and thus address
ranges to flush accumulate. Now that we have locking of radix tree
entries, we have all the locking necessary to reliably clear the radix
tree dirty tag when flushing caches for corresponding address range.
Similarly to page_mkclean() we also have to write-protect pages to get a
page fault when the page is next written to so that we can mark the
entry dirty again.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 68 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5209f8cd0bee..4b31658660dc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -31,6 +31,7 @@
 #include <linux/vmstat.h>
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
+#include <linux/mmu_notifier.h>
 
 /*
  * We use lowest available bit in exceptional entry for locking, other two
@@ -665,6 +666,59 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
 	return new_entry;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+	unsigned long address;
+
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	return address;
+}
+
+/* Walk all mappings of a given index of a file and writeprotect them */
+static void dax_mapping_entry_mkclean(struct address_space *mapping,
+				      pgoff_t index, unsigned long pfn)
+{
+	struct vm_area_struct *vma;
+	pte_t *ptep;
+	pte_t pte;
+	spinlock_t *ptl;
+	bool changed;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
+		unsigned long address;
+
+		cond_resched();
+
+		if (!(vma->vm_flags & VM_SHARED))
+			continue;
+
+		address = pgoff_address(index, vma);
+		changed = false;
+		if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
+			continue;
+		if (pfn != pte_pfn(*ptep))
+			goto unlock;
+		if (!pte_dirty(*ptep) && !pte_write(*ptep))
+			goto unlock;
+
+		flush_cache_page(vma, address, pfn);
+		pte = ptep_clear_flush(vma, address, ptep);
+		pte = pte_wrprotect(pte);
+		pte = pte_mkclean(pte);
+		set_pte_at(vma->vm_mm, address, ptep, pte);
+		changed = true;
+unlock:
+		pte_unmap_unlock(ptep, ptl);
+
+		if (changed)
+			mmu_notifier_invalidate_page(vma->vm_mm, address);
+	}
+	i_mmap_unlock_read(mapping);
+}
+
 static int dax_writeback_one(struct block_device *bdev,
 		struct address_space *mapping, pgoff_t index, void *entry)
 {
@@ -723,17 +777,30 @@ static int dax_writeback_one(struct block_device *bdev,
 	 * eventually calls cond_resched().
 	 */
 	ret = dax_map_atomic(bdev, &dax);
-	if (ret < 0)
+	if (ret < 0) {
+		put_locked_mapping_entry(mapping, index, entry);
 		return ret;
+	}
 
 	if (WARN_ON_ONCE(ret < dax.size)) {
 		ret = -EIO;
 		goto unmap;
 	}
 
+	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
 	wb_cache_pmem(dax.addr, dax.size);
+	/*
+	 * After we have flushed the cache, we can clear the dirty tag. There
+	 * cannot be new dirty data in the pfn after the flush has completed as
+	 * the pfn mappings are writeprotected and fault waits for mapping
+	 * entry lock.
+	 */
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
+	spin_unlock_irq(&mapping->tree_lock);
 unmap:
 	dax_unmap_atomic(bdev, &dax);
+	put_locked_mapping_entry(mapping, index, entry);
 	return ret;
 
 put_unlock:
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
@ 2016-06-23 10:47     ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-23 10:47 UTC (permalink / raw)
  To: linux-nvdimm
  Cc: linux-mm, linux-fsdevel, Dan Williams, Ross Zwisler, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 193 bytes --]

Hi,

the previous version had a bug which manifested itself on i586. Attached is
a new version for the patch if someone is interested.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0003-dax-Clear-dirty-entry-tags-on-cache-flush.patch --]
[-- Type: text/x-patch, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] dax: Make cache flushing protected by entry lock
  2016-06-21 15:45   ` Jan Kara
@ 2016-06-24 21:44     ` Ross Zwisler
  -1 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2016-06-24 21:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, linux-nvdimm

On Tue, Jun 21, 2016 at 05:45:13PM +0200, Jan Kara wrote:
> Currently, flushing of caches for DAX mappings was ignoring entry lock.
> So far this was ok (modulo a bug that a difference in entry lock could
> cause cache flushing to be mistakenly skipped) but in the following
> patches we will write-protect PTEs on cache flushing and clear dirty
> tags. For that we will need more exclusion. So do cache flushing under
> an entry lock. This allows us to remove one lock-unlock pair of
> mapping->tree_lock as a bonus.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 62 +++++++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 761495bf5eb9..5209f8cd0bee 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -669,35 +669,54 @@ static int dax_writeback_one(struct block_device *bdev,
>  		struct address_space *mapping, pgoff_t index, void *entry)
>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
> -	int type = RADIX_DAX_TYPE(entry);
> -	struct radix_tree_node *node;
> +	int type;
>  	struct blk_dax_ctl dax;
> -	void **slot;
>  	int ret = 0;
> +	void *entry2, **slot;

Nit: Let's retain the "reverse X-mas tree" ordering of our variable
definitions.

> -	spin_lock_irq(&mapping->tree_lock);
>  	/*
> -	 * Regular page slots are stabilized by the page lock even
> -	 * without the tree itself locked.  These unlocked entries
> -	 * need verification under the tree lock.
> +	 * A page got tagged dirty in DAX mapping? Something is seriously
> +	 * wrong.
>  	 */
> -	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> -		goto unlock;
> -	if (*slot != entry)
> -		goto unlock;
> -
> -	/* another fsync thread may have already written back this entry */
> -	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> -		goto unlock;
> +	if (WARN_ON(!radix_tree_exceptional_entry(entry)))
> +		return -EIO;
>  
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
> +	/* Entry got punched out / reallocated? */
> +	if (!entry2 || !radix_tree_exceptional_entry(entry2))
> +		goto put_unlock;
> +	/*
> +	 * Entry got reallocated elsewhere? No need to writeback. We have to
> +	 * compare sectors as we must not bail out due to difference in lockbit
> +	 * or entry type.
> +	 */
> +	if (RADIX_DAX_SECTOR(entry2) != RADIX_DAX_SECTOR(entry))
> +		goto put_unlock;
> +	type = RADIX_DAX_TYPE(entry2);
>  	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
>  		ret = -EIO;
> -		goto unlock;
> +		goto put_unlock;
>  	}
> +	entry = entry2;

I don't think you need to set 'entry' here - you reset it in 4 lines via
lock_slot(), and don't use it in between.

> +
> +	/* Another fsync thread may have already written back this entry */
> +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto put_unlock;
> +	/* Lock the entry to serialize with page faults */
> +	entry = lock_slot(mapping, slot);

As of this patch nobody unlocks the slot.  :)  A quick test of "write, fsync,
fsync" confirms that it deadlocks.

You introduce the proper calls to unlock the slot via
put_locked_mapping_entry() in patch 3/3 - those probably need to be in this
patch instead.

> +	/*
> +	 * We can clear the tag now but we have to be careful so that concurrent
> +	 * dax_writeback_one() calls for the same index cannot finish before we
> +	 * actually flush the caches. This is achieved as the calls will look
> +	 * at the entry only under tree_lock and once they do that they will
> +	 * see the entry locked and wait for it to unlock.
> +	 */
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +	spin_unlock_irq(&mapping->tree_lock);
>  
>  	dax.sector = RADIX_DAX_SECTOR(entry);
>  	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> -	spin_unlock_irq(&mapping->tree_lock);
>  
>  	/*
>  	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> @@ -713,15 +732,12 @@ static int dax_writeback_one(struct block_device *bdev,
>  	}
>  
>  	wb_cache_pmem(dax.addr, dax.size);
> -
> -	spin_lock_irq(&mapping->tree_lock);
> -	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> -	spin_unlock_irq(&mapping->tree_lock);
> - unmap:
> +unmap:
>  	dax_unmap_atomic(bdev, &dax);
>  	return ret;
>  
> - unlock:
> +put_unlock:
> +	put_unlocked_mapping_entry(mapping, index, entry2);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	return ret;
>  }
> -- 
> 2.6.6
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] dax: Make cache flushing protected by entry lock
@ 2016-06-24 21:44     ` Ross Zwisler
  0 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2016-06-24 21:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams, Ross Zwisler

On Tue, Jun 21, 2016 at 05:45:13PM +0200, Jan Kara wrote:
> Currently, flushing of caches for DAX mappings was ignoring entry lock.
> So far this was ok (modulo a bug that a difference in entry lock could
> cause cache flushing to be mistakenly skipped) but in the following
> patches we will write-protect PTEs on cache flushing and clear dirty
> tags. For that we will need more exclusion. So do cache flushing under
> an entry lock. This allows us to remove one lock-unlock pair of
> mapping->tree_lock as a bonus.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 62 +++++++++++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 761495bf5eb9..5209f8cd0bee 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -669,35 +669,54 @@ static int dax_writeback_one(struct block_device *bdev,
>  		struct address_space *mapping, pgoff_t index, void *entry)
>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
> -	int type = RADIX_DAX_TYPE(entry);
> -	struct radix_tree_node *node;
> +	int type;
>  	struct blk_dax_ctl dax;
> -	void **slot;
>  	int ret = 0;
> +	void *entry2, **slot;

Nit: Let's retain the "reverse X-mas tree" ordering of our variable
definitions.

> -	spin_lock_irq(&mapping->tree_lock);
>  	/*
> -	 * Regular page slots are stabilized by the page lock even
> -	 * without the tree itself locked.  These unlocked entries
> -	 * need verification under the tree lock.
> +	 * A page got tagged dirty in DAX mapping? Something is seriously
> +	 * wrong.
>  	 */
> -	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> -		goto unlock;
> -	if (*slot != entry)
> -		goto unlock;
> -
> -	/* another fsync thread may have already written back this entry */
> -	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> -		goto unlock;
> +	if (WARN_ON(!radix_tree_exceptional_entry(entry)))
> +		return -EIO;
>  
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
> +	/* Entry got punched out / reallocated? */
> +	if (!entry2 || !radix_tree_exceptional_entry(entry2))
> +		goto put_unlock;
> +	/*
> +	 * Entry got reallocated elsewhere? No need to writeback. We have to
> +	 * compare sectors as we must not bail out due to difference in lockbit
> +	 * or entry type.
> +	 */
> +	if (RADIX_DAX_SECTOR(entry2) != RADIX_DAX_SECTOR(entry))
> +		goto put_unlock;
> +	type = RADIX_DAX_TYPE(entry2);
>  	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
>  		ret = -EIO;
> -		goto unlock;
> +		goto put_unlock;
>  	}
> +	entry = entry2;

I don't think you need to set 'entry' here - you reset it in 4 lines via
lock_slot(), and don't use it in between.

> +
> +	/* Another fsync thread may have already written back this entry */
> +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto put_unlock;
> +	/* Lock the entry to serialize with page faults */
> +	entry = lock_slot(mapping, slot);

As of this patch nobody unlocks the slot.  :)  A quick test of "write, fsync,
fsync" confirms that it deadlocks.

You introduce the proper calls to unlock the slot via
put_locked_mapping_entry() in patch 3/3 - those probably need to be in this
patch instead.

> +	/*
> +	 * We can clear the tag now but we have to be careful so that concurrent
> +	 * dax_writeback_one() calls for the same index cannot finish before we
> +	 * actually flush the caches. This is achieved as the calls will look
> +	 * at the entry only under tree_lock and once they do that they will
> +	 * see the entry locked and wait for it to unlock.
> +	 */
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +	spin_unlock_irq(&mapping->tree_lock);
>  
>  	dax.sector = RADIX_DAX_SECTOR(entry);
>  	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> -	spin_unlock_irq(&mapping->tree_lock);
>  
>  	/*
>  	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> @@ -713,15 +732,12 @@ static int dax_writeback_one(struct block_device *bdev,
>  	}
>  
>  	wb_cache_pmem(dax.addr, dax.size);
> -
> -	spin_lock_irq(&mapping->tree_lock);
> -	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> -	spin_unlock_irq(&mapping->tree_lock);
> - unmap:
> +unmap:
>  	dax_unmap_atomic(bdev, &dax);
>  	return ret;
>  
> - unlock:
> +put_unlock:
> +	put_unlocked_mapping_entry(mapping, index, entry2);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	return ret;
>  }
> -- 
> 2.6.6
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] mm: Export follow_pte()
  2016-06-21 15:45   ` Jan Kara
@ 2016-06-24 21:55     ` Ross Zwisler
  -1 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2016-06-24 21:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, linux-nvdimm

On Tue, Jun 21, 2016 at 05:45:14PM +0200, Jan Kara wrote:
> DAX will need to implement its own version of check_page_address(). To
						page_check_address()

> avoid duplicating page table walking code, export follow_pte() which
> does what we need.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/mm.h | 2 ++
>  mm/memory.c        | 5 +++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5df5feb49575..989f5d949db3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1193,6 +1193,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
>  			struct vm_area_struct *vma);
>  void unmap_mapping_range(struct address_space *mapping,
>  		loff_t const holebegin, loff_t const holelen, int even_cows);
> +int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
> +	       spinlock_t **ptlp);
>  int follow_pfn(struct vm_area_struct *vma, unsigned long address,
>  	unsigned long *pfn);
>  int follow_phys(struct vm_area_struct *vma, unsigned long address,
> diff --git a/mm/memory.c b/mm/memory.c
> index 15322b73636b..f6175d63c2e9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3647,8 +3647,8 @@ out:
>  	return -EINVAL;
>  }
>  
> -static inline int follow_pte(struct mm_struct *mm, unsigned long address,
> -			     pte_t **ptepp, spinlock_t **ptlp)
> +int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
> +	       spinlock_t **ptlp)
>  {
>  	int res;
>  
> @@ -3657,6 +3657,7 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
>  			   !(res = __follow_pte(mm, address, ptepp, ptlp)));
>  	return res;
>  }
> +EXPORT_SYMBOL(follow_pte);
>  
>  /**
>   * follow_pfn - look up PFN at a user virtual address
> -- 
> 2.6.6
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] mm: Export follow_pte()
@ 2016-06-24 21:55     ` Ross Zwisler
  0 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2016-06-24 21:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams, Ross Zwisler

On Tue, Jun 21, 2016 at 05:45:14PM +0200, Jan Kara wrote:
> DAX will need to implement its own version of check_page_address(). To
						page_check_address()

> avoid duplicating page table walking code, export follow_pte() which
> does what we need.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/mm.h | 2 ++
>  mm/memory.c        | 5 +++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5df5feb49575..989f5d949db3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1193,6 +1193,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
>  			struct vm_area_struct *vma);
>  void unmap_mapping_range(struct address_space *mapping,
>  		loff_t const holebegin, loff_t const holelen, int even_cows);
> +int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
> +	       spinlock_t **ptlp);
>  int follow_pfn(struct vm_area_struct *vma, unsigned long address,
>  	unsigned long *pfn);
>  int follow_phys(struct vm_area_struct *vma, unsigned long address,
> diff --git a/mm/memory.c b/mm/memory.c
> index 15322b73636b..f6175d63c2e9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3647,8 +3647,8 @@ out:
>  	return -EINVAL;
>  }
>  
> -static inline int follow_pte(struct mm_struct *mm, unsigned long address,
> -			     pte_t **ptepp, spinlock_t **ptlp)
> +int follow_pte(struct mm_struct *mm, unsigned long address, pte_t **ptepp,
> +	       spinlock_t **ptlp)
>  {
>  	int res;
>  
> @@ -3657,6 +3657,7 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
>  			   !(res = __follow_pte(mm, address, ptepp, ptlp)));
>  	return res;
>  }
> +EXPORT_SYMBOL(follow_pte);
>  
>  /**
>   * follow_pfn - look up PFN at a user virtual address
> -- 
> 2.6.6
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
  2016-06-21 15:45   ` Jan Kara
@ 2016-06-28 21:38     ` Ross Zwisler
  -1 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2016-06-28 21:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, linux-nvdimm

On Tue, Jun 21, 2016 at 05:45:15PM +0200, Jan Kara wrote:
> Currently we never clear dirty tags in DAX mappings and thus address
> ranges to flush accumulate. Now that we have locking of radix tree
> entries, we have all the locking necessary to reliably clear the radix
> tree dirty tag when flushing caches for corresponding address range.
> Similarly to page_mkclean() we also have to write-protect pages to get a
> page fault when the page is next written to so that we can mark the
> entry dirty again.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I think we still have a race where we can end up with a writeable PTE but a
clean DAX radix tree entry.  Here is the flow:

Thread 1					Thread 2
======== 					========
wp_pfn_shared()
  dax_pfn_mkwrite()
    get_unlocked_mapping_entry()
    radix_tree_tag_set(DIRTY)
    put_unlocked_mapping_entry()
						dax_writeback_one()
						  lock_slot()
						  radix_tree_tag_clear(TOWRITE)
						  dax_mapping_entry_mkclean()
						  wb_cache_pmem()
						  radix_tree_tag_clear(DIRTY)
						  put_locked_mapping_entry()
  wp_page_reuse()
    maybe_mkwrite()

When this ends the radix tree entry will have been written back and the
TOWRITE and DIRTY radix tree tags will have been cleared, but the PTE will be
writeable due to the last maybe_mkwrite() call.  This will result in no new
dax_pfn_mkwrite() calls happening to re-dirty the radix tree entry.

Essentially the problem is that we don't hold any locks through all the work
done by wp_pfn_shared() so that we can do maybe_mkwrite() safely.

Perhaps we can lock the radix tree slot in dax_pfn_mkwrite(), then have
another callback into DAX after the wp_page_reuse() => maybe_mkwrite() to
unlock the slot?  This would guarantee that they happen together from DAX's
point of view.

Thread 1's flow would then be:

Thread 1
========
wp_pfn_shared()
  dax_pfn_mkwrite()
    lock_slot()
    radix_tree_tag_set(DIRTY)
  wp_page_reuse()
    maybe_mkwrite()
    new_dax_call_to_unlock_slot()
      put_unlocked_mapping_entry()

> ---
>  fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 68 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5209f8cd0bee..c0c4eecb5f73 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -31,6 +31,7 @@
>  #include <linux/vmstat.h>
>  #include <linux/pfn_t.h>
>  #include <linux/sizes.h>
> +#include <linux/mmu_notifier.h>
>  
>  /*
>   * We use lowest available bit in exceptional entry for locking, other two
> @@ -665,6 +666,59 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  	return new_entry;
>  }
>  
> +static inline unsigned long
> +pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
> +{
> +	unsigned long address;
> +
> +	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +	return address;
> +}
> +
> +/* Walk all mappings of a given index of a file and writeprotect them */
> +static void dax_mapping_entry_mkclean(struct address_space *mapping,
> +				      pgoff_t index, unsigned long pfn)
> +{
> +	struct vm_area_struct *vma;
> +	pte_t *ptep;
> +	pte_t pte;
> +	spinlock_t *ptl;
> +	bool changed;
> +
> +	i_mmap_lock_read(mapping);
> +	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
> +		unsigned long address;
> +
> +		cond_resched();

Do we really need to cond_resched() between every PTE clean?  Maybe it would
be better to cond_resched() after each call to dax_writeback_one() in
dax_writeback_mapping_range() or something so we can be sure we've done a good
amount of work between each?

> +
> +		if (!(vma->vm_flags & VM_SHARED))
> +			continue;
> +
> +		address = pgoff_address(index, vma);
> +		changed = false;
> +		if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
> +			continue;
> +		if (pfn != pte_pfn(*ptep))
> +			goto unlock;
> +		if (!pte_dirty(*ptep) && !pte_write(*ptep))
> +			goto unlock;
> +
> +		flush_cache_page(vma, address, pfn);
> +		pte = ptep_clear_flush(vma, address, ptep);
> +		pte = pte_wrprotect(pte);
> +		pte = pte_mkclean(pte);
> +		set_pte_at(vma->vm_mm, address, ptep, pte);
> +		changed = true;
> +unlock:
> +		pte_unmap_unlock(pte, ptl);
> +
> +		if (changed)
> +			mmu_notifier_invalidate_page(vma->vm_mm, address);
> +	}
> +	i_mmap_unlock_read(mapping);
> +}
> +
>  static int dax_writeback_one(struct block_device *bdev,
>  		struct address_space *mapping, pgoff_t index, void *entry)
>  {
> @@ -723,17 +777,30 @@ static int dax_writeback_one(struct block_device *bdev,
>  	 * eventually calls cond_resched().
>  	 */
>  	ret = dax_map_atomic(bdev, &dax);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		put_locked_mapping_entry(mapping, index, entry);
>  		return ret;
> +	}
>  
>  	if (WARN_ON_ONCE(ret < dax.size)) {
>  		ret = -EIO;
>  		goto unmap;
>  	}
>  
> +	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
>  	wb_cache_pmem(dax.addr, dax.size);
> +	/*
> +	 * After we have flushed the cache, we can clear the dirty tag. There
> +	 * cannot be new dirty data in the pfn after the flush has completed as
> +	 * the pfn mappings are writeprotected and fault waits for mapping
> +	 * entry lock.
> +	 */
> +	spin_lock_irq(&mapping->tree_lock);
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
> +	spin_unlock_irq(&mapping->tree_lock);
>  unmap:
>  	dax_unmap_atomic(bdev, &dax);
> +	put_locked_mapping_entry(mapping, index, entry);
>  	return ret;
>  
>  put_unlock:
> -- 
> 2.6.6
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
@ 2016-06-28 21:38     ` Ross Zwisler
  0 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2016-06-28 21:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams, Ross Zwisler

On Tue, Jun 21, 2016 at 05:45:15PM +0200, Jan Kara wrote:
> Currently we never clear dirty tags in DAX mappings and thus address
> ranges to flush accumulate. Now that we have locking of radix tree
> entries, we have all the locking necessary to reliably clear the radix
> tree dirty tag when flushing caches for corresponding address range.
> Similarly to page_mkclean() we also have to write-protect pages to get a
> page fault when the page is next written to so that we can mark the
> entry dirty again.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I think we still have a race where we can end up with a writeable PTE but a
clean DAX radix tree entry.  Here is the flow:

Thread 1					Thread 2
======== 					========
wp_pfn_shared()
  dax_pfn_mkwrite()
    get_unlocked_mapping_entry()
    radix_tree_tag_set(DIRTY)
    put_unlocked_mapping_entry()
						dax_writeback_one()
						  lock_slot()
						  radix_tree_tag_clear(TOWRITE)
						  dax_mapping_entry_mkclean()
						  wb_cache_pmem()
						  radix_tree_tag_clear(DIRTY)
						  put_locked_mapping_entry()
  wp_page_reuse()
    maybe_mkwrite()

When this ends the radix tree entry will have been written back and the
TOWRITE and DIRTY radix tree tags will have been cleared, but the PTE will be
writeable due to the last maybe_mkwrite() call.  This will result in no new
dax_pfn_mkwrite() calls happening to re-dirty the radix tree entry.

Essentially the problem is that we don't hold any locks through all the work
done by wp_pfn_shared() so that we can do maybe_mkwrite() safely.

Perhaps we can lock the radix tree slot in dax_pfn_mkwrite(), then have
another callback into DAX after the wp_page_reuse() => maybe_mkwrite() to
unlock the slot?  This would guarantee that they happen together from DAX's
point of view.

Thread 1's flow would then be:

Thread 1
========
wp_pfn_shared()
  dax_pfn_mkwrite()
    lock_slot()
    radix_tree_tag_set(DIRTY)
  wp_page_reuse()
    maybe_mkwrite()
    new_dax_call_to_unlock_slot()
      put_unlocked_mapping_entry()

> ---
>  fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 68 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5209f8cd0bee..c0c4eecb5f73 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -31,6 +31,7 @@
>  #include <linux/vmstat.h>
>  #include <linux/pfn_t.h>
>  #include <linux/sizes.h>
> +#include <linux/mmu_notifier.h>
>  
>  /*
>   * We use lowest available bit in exceptional entry for locking, other two
> @@ -665,6 +666,59 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
>  	return new_entry;
>  }
>  
> +static inline unsigned long
> +pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
> +{
> +	unsigned long address;
> +
> +	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +	return address;
> +}
> +
> +/* Walk all mappings of a given index of a file and writeprotect them */
> +static void dax_mapping_entry_mkclean(struct address_space *mapping,
> +				      pgoff_t index, unsigned long pfn)
> +{
> +	struct vm_area_struct *vma;
> +	pte_t *ptep;
> +	pte_t pte;
> +	spinlock_t *ptl;
> +	bool changed;
> +
> +	i_mmap_lock_read(mapping);
> +	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
> +		unsigned long address;
> +
> +		cond_resched();

Do we really need to cond_resched() between every PTE clean?  Maybe it would
be better to cond_resched() after each call to dax_writeback_one() in
dax_writeback_mapping_range() or something so we can be sure we've done a good
amount of work between each?

> +
> +		if (!(vma->vm_flags & VM_SHARED))
> +			continue;
> +
> +		address = pgoff_address(index, vma);
> +		changed = false;
> +		if (follow_pte(vma->vm_mm, address, &ptep, &ptl))
> +			continue;
> +		if (pfn != pte_pfn(*ptep))
> +			goto unlock;
> +		if (!pte_dirty(*ptep) && !pte_write(*ptep))
> +			goto unlock;
> +
> +		flush_cache_page(vma, address, pfn);
> +		pte = ptep_clear_flush(vma, address, ptep);
> +		pte = pte_wrprotect(pte);
> +		pte = pte_mkclean(pte);
> +		set_pte_at(vma->vm_mm, address, ptep, pte);
> +		changed = true;
> +unlock:
> +		pte_unmap_unlock(pte, ptl);
> +
> +		if (changed)
> +			mmu_notifier_invalidate_page(vma->vm_mm, address);
> +	}
> +	i_mmap_unlock_read(mapping);
> +}
> +
>  static int dax_writeback_one(struct block_device *bdev,
>  		struct address_space *mapping, pgoff_t index, void *entry)
>  {
> @@ -723,17 +777,30 @@ static int dax_writeback_one(struct block_device *bdev,
>  	 * eventually calls cond_resched().
>  	 */
>  	ret = dax_map_atomic(bdev, &dax);
> -	if (ret < 0)
> +	if (ret < 0) {
> +		put_locked_mapping_entry(mapping, index, entry);
>  		return ret;
> +	}
>  
>  	if (WARN_ON_ONCE(ret < dax.size)) {
>  		ret = -EIO;
>  		goto unmap;
>  	}
>  
> +	dax_mapping_entry_mkclean(mapping, index, pfn_t_to_pfn(dax.pfn));
>  	wb_cache_pmem(dax.addr, dax.size);
> +	/*
> +	 * After we have flushed the cache, we can clear the dirty tag. There
> +	 * cannot be new dirty data in the pfn after the flush has completed as
> +	 * the pfn mappings are writeprotected and fault waits for mapping
> +	 * entry lock.
> +	 */
> +	spin_lock_irq(&mapping->tree_lock);
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
> +	spin_unlock_irq(&mapping->tree_lock);
>  unmap:
>  	dax_unmap_atomic(bdev, &dax);
> +	put_locked_mapping_entry(mapping, index, entry);
>  	return ret;
>  
>  put_unlock:
> -- 
> 2.6.6
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3 v1] dax: Clear dirty bits after flushing caches
  2016-06-21 15:45 ` Jan Kara
@ 2016-06-28 21:41   ` Ross Zwisler
  -1 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2016-06-28 21:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, linux-nvdimm

On Tue, Jun 21, 2016 at 05:45:12PM +0200, Jan Kara wrote:
> Hello,
> 
> currently we never clear dirty bits in the radix tree of a DAX inode. Thus
> fsync(2) or even periodical writeback flush all the dirty pfns again and
> again. This patches implement clearing of the dirty tag in the radix tree
> so that we issue flush only when needed.
> 
> The difficulty with clearing the dirty tag is that we have to protect against
> a concurrent page fault setting the dirty tag and writing new data into the
> page. So we need a lock serializing page fault and clearing of the dirty tag
> and write-protecting PTEs (so that we get another pagefault when pfn is written
> to again and we have to set the dirty tag again).
> 
> The effect of the patch set is easily visible:
> 
> Writing 1 GB of data via mmap, then fsync twice.
> 
> Before this patch set both fsyncs take ~205 ms on my test machine, after the
> patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
> clearing dirty bits etc. is very noticeable), the second fsync takes below
> 1 us.
> 
> As a bonus, these patches make filesystem freezing for DAX filesystems
> reliable because mappings are now properly writeprotected while freezing the
> fs.
> 
> Patches have passed xfstests for both xfs and ext4.
> 
> So far the patches don't work with PMD pages - that's next on my todo list.

Regarding the PMD work, I had a go at this a while ago.  You may (or may not)
find these patches useful:

mm: add follow_pte_pmd()
	https://patchwork.kernel.org/patch/7616241/
mm: add pmd_mkclean()
	https://patchwork.kernel.org/patch/7616261/
mm: add pgoff_mkclean()
	https://patchwork.kernel.org/patch/7616221/
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/3 v1] dax: Clear dirty bits after flushing caches
@ 2016-06-28 21:41   ` Ross Zwisler
  0 siblings, 0 replies; 31+ messages in thread
From: Ross Zwisler @ 2016-06-28 21:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams, Ross Zwisler

On Tue, Jun 21, 2016 at 05:45:12PM +0200, Jan Kara wrote:
> Hello,
> 
> currently we never clear dirty bits in the radix tree of a DAX inode. Thus
> fsync(2) or even periodical writeback flush all the dirty pfns again and
> again. This patches implement clearing of the dirty tag in the radix tree
> so that we issue flush only when needed.
> 
> The difficulty with clearing the dirty tag is that we have to protect against
> a concurrent page fault setting the dirty tag and writing new data into the
> page. So we need a lock serializing page fault and clearing of the dirty tag
> and write-protecting PTEs (so that we get another pagefault when pfn is written
> to again and we have to set the dirty tag again).
> 
> The effect of the patch set is easily visible:
> 
> Writing 1 GB of data via mmap, then fsync twice.
> 
> Before this patch set both fsyncs take ~205 ms on my test machine, after the
> patch set the first fsync takes ~283 ms (the additional cost of walking PTEs,
> clearing dirty bits etc. is very noticeable), the second fsync takes below
> 1 us.
> 
> As a bonus, these patches make filesystem freezing for DAX filesystems
> reliable because mappings are now properly writeprotected while freezing the
> fs.
> 
> Patches have passed xfstests for both xfs and ext4.
> 
> So far the patches don't work with PMD pages - that's next on my todo list.

Regarding the PMD work, I had a go at this a while ago.  You may (or may not)
find these patches useful:

mm: add follow_pte_pmd()
	https://patchwork.kernel.org/patch/7616241/
mm: add pmd_mkclean()
	https://patchwork.kernel.org/patch/7616261/
mm: add pgoff_mkclean()
	https://patchwork.kernel.org/patch/7616221/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] dax: Make cache flushing protected by entry lock
  2016-06-24 21:44     ` Ross Zwisler
@ 2016-06-29 20:28       ` Jan Kara
  -1 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-29 20:28 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: linux-fsdevel, linux-mm, Jan Kara, linux-nvdimm

Hi Ross,

all your comments look fine and I've incorporated them. Thanks for review!

								Honza

On Fri 24-06-16 15:44:45, Ross Zwisler wrote:
> On Tue, Jun 21, 2016 at 05:45:13PM +0200, Jan Kara wrote:
> > Currently, flushing of caches for DAX mappings was ignoring entry lock.
> > So far this was ok (modulo a bug that a difference in entry lock could
> > cause cache flushing to be mistakenly skipped) but in the following
> > patches we will write-protect PTEs on cache flushing and clear dirty
> > tags. For that we will need more exclusion. So do cache flushing under
> > an entry lock. This allows us to remove one lock-unlock pair of
> > mapping->tree_lock as a bonus.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c | 62 +++++++++++++++++++++++++++++++++++++++-----------------------
> >  1 file changed, 39 insertions(+), 23 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 761495bf5eb9..5209f8cd0bee 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -669,35 +669,54 @@ static int dax_writeback_one(struct block_device *bdev,
> >  		struct address_space *mapping, pgoff_t index, void *entry)
> >  {
> >  	struct radix_tree_root *page_tree = &mapping->page_tree;
> > -	int type = RADIX_DAX_TYPE(entry);
> > -	struct radix_tree_node *node;
> > +	int type;
> >  	struct blk_dax_ctl dax;
> > -	void **slot;
> >  	int ret = 0;
> > +	void *entry2, **slot;
> 
> Nit: Let's retain the "reverse X-mas tree" ordering of our variable
> definitions.
> 
> > -	spin_lock_irq(&mapping->tree_lock);
> >  	/*
> > -	 * Regular page slots are stabilized by the page lock even
> > -	 * without the tree itself locked.  These unlocked entries
> > -	 * need verification under the tree lock.
> > +	 * A page got tagged dirty in DAX mapping? Something is seriously
> > +	 * wrong.
> >  	 */
> > -	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > -		goto unlock;
> > -	if (*slot != entry)
> > -		goto unlock;
> > -
> > -	/* another fsync thread may have already written back this entry */
> > -	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > -		goto unlock;
> > +	if (WARN_ON(!radix_tree_exceptional_entry(entry)))
> > +		return -EIO;
> >  
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
> > +	/* Entry got punched out / reallocated? */
> > +	if (!entry2 || !radix_tree_exceptional_entry(entry2))
> > +		goto put_unlock;
> > +	/*
> > +	 * Entry got reallocated elsewhere? No need to writeback. We have to
> > +	 * compare sectors as we must not bail out due to difference in lockbit
> > +	 * or entry type.
> > +	 */
> > +	if (RADIX_DAX_SECTOR(entry2) != RADIX_DAX_SECTOR(entry))
> > +		goto put_unlock;
> > +	type = RADIX_DAX_TYPE(entry2);
> >  	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> >  		ret = -EIO;
> > -		goto unlock;
> > +		goto put_unlock;
> >  	}
> > +	entry = entry2;
> 
> I don't think you need to set 'entry' here - you reset it in 4 lines via
> lock_slot(), and don't use it in between.
> 
> > +
> > +	/* Another fsync thread may have already written back this entry */
> > +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto put_unlock;
> > +	/* Lock the entry to serialize with page faults */
> > +	entry = lock_slot(mapping, slot);
> 
> As of this patch nobody unlocks the slot.  :)  A quick test of "write, fsync,
> fsync" confirms that it deadlocks.
> 
> You introduce the proper calls to unlock the slot via
> put_locked_mapping_entry() in patch 3/3 - those probably need to be in this
> patch instead.
> 
> > +	/*
> > +	 * We can clear the tag now but we have to be careful so that concurrent
> > +	 * dax_writeback_one() calls for the same index cannot finish before we
> > +	 * actually flush the caches. This is achieved as the calls will look
> > +	 * at the entry only under tree_lock and once they do that they will
> > +	 * see the entry locked and wait for it to unlock.
> > +	 */
> > +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> >  
> >  	dax.sector = RADIX_DAX_SECTOR(entry);
> >  	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > -	spin_unlock_irq(&mapping->tree_lock);
> >  
> >  	/*
> >  	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> > @@ -713,15 +732,12 @@ static int dax_writeback_one(struct block_device *bdev,
> >  	}
> >  
> >  	wb_cache_pmem(dax.addr, dax.size);
> > -
> > -	spin_lock_irq(&mapping->tree_lock);
> > -	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > -	spin_unlock_irq(&mapping->tree_lock);
> > - unmap:
> > +unmap:
> >  	dax_unmap_atomic(bdev, &dax);
> >  	return ret;
> >  
> > - unlock:
> > +put_unlock:
> > +	put_unlocked_mapping_entry(mapping, index, entry2);
> >  	spin_unlock_irq(&mapping->tree_lock);
> >  	return ret;
> >  }
> > -- 
> > 2.6.6
> > 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/3] dax: Make cache flushing protected by entry lock
@ 2016-06-29 20:28       ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-29 20:28 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams

Hi Ross,

all your comments look fine and I've incorporated them. Thanks for review!

								Honza

On Fri 24-06-16 15:44:45, Ross Zwisler wrote:
> On Tue, Jun 21, 2016 at 05:45:13PM +0200, Jan Kara wrote:
> > Currently, flushing of caches for DAX mappings was ignoring entry lock.
> > So far this was ok (modulo a bug that a difference in entry lock could
> > cause cache flushing to be mistakenly skipped) but in the following
> > patches we will write-protect PTEs on cache flushing and clear dirty
> > tags. For that we will need more exclusion. So do cache flushing under
> > an entry lock. This allows us to remove one lock-unlock pair of
> > mapping->tree_lock as a bonus.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c | 62 +++++++++++++++++++++++++++++++++++++++-----------------------
> >  1 file changed, 39 insertions(+), 23 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 761495bf5eb9..5209f8cd0bee 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -669,35 +669,54 @@ static int dax_writeback_one(struct block_device *bdev,
> >  		struct address_space *mapping, pgoff_t index, void *entry)
> >  {
> >  	struct radix_tree_root *page_tree = &mapping->page_tree;
> > -	int type = RADIX_DAX_TYPE(entry);
> > -	struct radix_tree_node *node;
> > +	int type;
> >  	struct blk_dax_ctl dax;
> > -	void **slot;
> >  	int ret = 0;
> > +	void *entry2, **slot;
> 
> Nit: Let's retain the "reverse X-mas tree" ordering of our variable
> definitions.
> 
> > -	spin_lock_irq(&mapping->tree_lock);
> >  	/*
> > -	 * Regular page slots are stabilized by the page lock even
> > -	 * without the tree itself locked.  These unlocked entries
> > -	 * need verification under the tree lock.
> > +	 * A page got tagged dirty in DAX mapping? Something is seriously
> > +	 * wrong.
> >  	 */
> > -	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > -		goto unlock;
> > -	if (*slot != entry)
> > -		goto unlock;
> > -
> > -	/* another fsync thread may have already written back this entry */
> > -	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > -		goto unlock;
> > +	if (WARN_ON(!radix_tree_exceptional_entry(entry)))
> > +		return -EIO;
> >  
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
> > +	/* Entry got punched out / reallocated? */
> > +	if (!entry2 || !radix_tree_exceptional_entry(entry2))
> > +		goto put_unlock;
> > +	/*
> > +	 * Entry got reallocated elsewhere? No need to writeback. We have to
> > +	 * compare sectors as we must not bail out due to difference in lockbit
> > +	 * or entry type.
> > +	 */
> > +	if (RADIX_DAX_SECTOR(entry2) != RADIX_DAX_SECTOR(entry))
> > +		goto put_unlock;
> > +	type = RADIX_DAX_TYPE(entry2);
> >  	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> >  		ret = -EIO;
> > -		goto unlock;
> > +		goto put_unlock;
> >  	}
> > +	entry = entry2;
> 
> I don't think you need to set 'entry' here - you reset it in 4 lines via
> lock_slot(), and don't use it in between.
> 
> > +
> > +	/* Another fsync thread may have already written back this entry */
> > +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto put_unlock;
> > +	/* Lock the entry to serialize with page faults */
> > +	entry = lock_slot(mapping, slot);
> 
> As of this patch nobody unlocks the slot.  :)  A quick test of "write, fsync,
> fsync" confirms that it deadlocks.
> 
> You introduce the proper calls to unlock the slot via
> put_locked_mapping_entry() in patch 3/3 - those probably need to be in this
> patch instead.
> 
> > +	/*
> > +	 * We can clear the tag now but we have to be careful so that concurrent
> > +	 * dax_writeback_one() calls for the same index cannot finish before we
> > +	 * actually flush the caches. This is achieved as the calls will look
> > +	 * at the entry only under tree_lock and once they do that they will
> > +	 * see the entry locked and wait for it to unlock.
> > +	 */
> > +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> >  
> >  	dax.sector = RADIX_DAX_SECTOR(entry);
> >  	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > -	spin_unlock_irq(&mapping->tree_lock);
> >  
> >  	/*
> >  	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> > @@ -713,15 +732,12 @@ static int dax_writeback_one(struct block_device *bdev,
> >  	}
> >  
> >  	wb_cache_pmem(dax.addr, dax.size);
> > -
> > -	spin_lock_irq(&mapping->tree_lock);
> > -	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > -	spin_unlock_irq(&mapping->tree_lock);
> > - unmap:
> > +unmap:
> >  	dax_unmap_atomic(bdev, &dax);
> >  	return ret;
> >  
> > - unlock:
> > +put_unlock:
> > +	put_unlocked_mapping_entry(mapping, index, entry2);
> >  	spin_unlock_irq(&mapping->tree_lock);
> >  	return ret;
> >  }
> > -- 
> > 2.6.6
> > 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] mm: Export follow_pte()
  2016-06-24 21:55     ` Ross Zwisler
@ 2016-06-29 20:29       ` Jan Kara
  -1 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-29 20:29 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: linux-fsdevel, linux-mm, Jan Kara, linux-nvdimm

On Fri 24-06-16 15:55:12, Ross Zwisler wrote:
> On Tue, Jun 21, 2016 at 05:45:14PM +0200, Jan Kara wrote:
> > DAX will need to implement its own version of check_page_address(). To
> 						page_check_address()

Thanks. Fixed.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/3] mm: Export follow_pte()
@ 2016-06-29 20:29       ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-29 20:29 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams

On Fri 24-06-16 15:55:12, Ross Zwisler wrote:
> On Tue, Jun 21, 2016 at 05:45:14PM +0200, Jan Kara wrote:
> > DAX will need to implement its own version of check_page_address(). To
> 						page_check_address()

Thanks. Fixed.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
  2016-06-28 21:38     ` Ross Zwisler
@ 2016-06-29 20:47       ` Jan Kara
  -1 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-29 20:47 UTC (permalink / raw)
  To: Ross Zwisler; +Cc: linux-fsdevel, linux-mm, Jan Kara, linux-nvdimm

On Tue 28-06-16 15:38:57, Ross Zwisler wrote:
> On Tue, Jun 21, 2016 at 05:45:15PM +0200, Jan Kara wrote:
> > Currently we never clear dirty tags in DAX mappings and thus address
> > ranges to flush accumulate. Now that we have locking of radix tree
> > entries, we have all the locking necessary to reliably clear the radix
> > tree dirty tag when flushing caches for corresponding address range.
> > Similarly to page_mkclean() we also have to write-protect pages to get a
> > page fault when the page is next written to so that we can mark the
> > entry dirty again.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> I think we still have a race where we can end up with a writeable PTE but a
> clean DAX radix tree entry.  Here is the flow:
> 
> Thread 1					Thread 2
> ======== 					========
> wp_pfn_shared()
>   dax_pfn_mkwrite()
>     get_unlocked_mapping_entry()
>     radix_tree_tag_set(DIRTY)
>     put_unlocked_mapping_entry()
> 						dax_writeback_one()
> 						  lock_slot()
> 						  radix_tree_tag_clear(TOWRITE)
> 						  dax_mapping_entry_mkclean()
> 						  wb_cache_pmem()
> 						  radix_tree_tag_clear(DIRTY)
> 						  put_locked_mapping_entry()
>   wp_page_reuse()
>     maybe_mkwrite()
> 
> When this ends the radix tree entry will have been written back and the
> TOWRITE and DIRTY radix tree tags will have been cleared, but the PTE will be
> writeable due to the last maybe_mkwrite() call.  This will result in no new
> dax_pfn_mkwrite() calls happening to re-dirty the radix tree entry.

Good point, thanks for spotting this! There are several ways to fix this -
one is the callback you suggest below but I don't like that much. Another
is returning VM_FAULT_DAX_LOCKED as we do for cow faults handle that
properly in page_mkwrite() paths. Finally, we could just handle PTE changes
within the pfn_mkwrite() handler like we already do for the ->fault path
and thus keep the locking private to DAX code. I'm not yet decided what's
best. I'll think about it.

> Essentially the problem is that we don't hold any locks through all the work
> done by wp_pfn_shared() so that we can do maybe_mkwrite() safely.
> 
> Perhaps we can lock the radix tree slot in dax_pfn_mkwrite(), then have
> another callback into DAX after the wp_page_reuse() => maybe_mkwrite() to
> unlock the slot?  This would guarantee that they happen together from DAX's
> point of view.
> 
> Thread 1's flow would then be:
> 
> Thread 1
> ========
> wp_pfn_shared()
>   dax_pfn_mkwrite()
>     lock_slot()
>     radix_tree_tag_set(DIRTY)
>   wp_page_reuse()
>     maybe_mkwrite()
>     new_dax_call_to_unlock_slot()
>       put_unlocked_mapping_entry()
> 
> > ---
> >  fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 68 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 5209f8cd0bee..c0c4eecb5f73 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -31,6 +31,7 @@
> >  #include <linux/vmstat.h>
> >  #include <linux/pfn_t.h>
> >  #include <linux/sizes.h>
> > +#include <linux/mmu_notifier.h>
> >  
> >  /*
> >   * We use lowest available bit in exceptional entry for locking, other two
> > @@ -665,6 +666,59 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  	return new_entry;
> >  }
> >  
> > +static inline unsigned long
> > +pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
> > +{
> > +	unsigned long address;
> > +
> > +	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> > +	return address;
> > +}
> > +
> > +/* Walk all mappings of a given index of a file and writeprotect them */
> > +static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > +				      pgoff_t index, unsigned long pfn)
> > +{
> > +	struct vm_area_struct *vma;
> > +	pte_t *ptep;
> > +	pte_t pte;
> > +	spinlock_t *ptl;
> > +	bool changed;
> > +
> > +	i_mmap_lock_read(mapping);
> > +	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
> > +		unsigned long address;
> > +
> > +		cond_resched();
> 
> Do we really need to cond_resched() between every PTE clean?  Maybe it would
> be better to cond_resched() after each call to dax_writeback_one() in
> dax_writeback_mapping_range() or something so we can be sure we've done a good
> amount of work between each?

This is copied over from rmap code and I'd prefer to keep the code similar.
Note that cond_resched() does not mean we are going to reschedule. It is
pretty cheap call and we will be rescheduled only if we have used our CPU
slice...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/3] dax: Clear dirty entry tags on cache flush
@ 2016-06-29 20:47       ` Jan Kara
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Kara @ 2016-06-29 20:47 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-nvdimm, linux-mm, linux-fsdevel, Dan Williams

On Tue 28-06-16 15:38:57, Ross Zwisler wrote:
> On Tue, Jun 21, 2016 at 05:45:15PM +0200, Jan Kara wrote:
> > Currently we never clear dirty tags in DAX mappings and thus address
> > ranges to flush accumulate. Now that we have locking of radix tree
> > entries, we have all the locking necessary to reliably clear the radix
> > tree dirty tag when flushing caches for corresponding address range.
> > Similarly to page_mkclean() we also have to write-protect pages to get a
> > page fault when the page is next written to so that we can mark the
> > entry dirty again.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> I think we still have a race where we can end up with a writeable PTE but a
> clean DAX radix tree entry.  Here is the flow:
> 
> Thread 1					Thread 2
> ======== 					========
> wp_pfn_shared()
>   dax_pfn_mkwrite()
>     get_unlocked_mapping_entry()
>     radix_tree_tag_set(DIRTY)
>     put_unlocked_mapping_entry()
> 						dax_writeback_one()
> 						  lock_slot()
> 						  radix_tree_tag_clear(TOWRITE)
> 						  dax_mapping_entry_mkclean()
> 						  wb_cache_pmem()
> 						  radix_tree_tag_clear(DIRTY)
> 						  put_locked_mapping_entry()
>   wp_page_reuse()
>     maybe_mkwrite()
> 
> When this ends the radix tree entry will have been written back and the
> TOWRITE and DIRTY radix tree tags will have been cleared, but the PTE will be
> writeable due to the last maybe_mkwrite() call.  This will result in no new
> dax_pfn_mkwrite() calls happening to re-dirty the radix tree entry.

Good point, thanks for spotting this! There are several ways to fix this -
one is the callback you suggest below but I don't like that much. Another
is returning VM_FAULT_DAX_LOCKED as we do for cow faults handle that
properly in page_mkwrite() paths. Finally, we could just handle PTE changes
within the pfn_mkwrite() handler like we already do for the ->fault path
and thus keep the locking private to DAX code. I'm not yet decided what's
best. I'll think about it.

> Essentially the problem is that we don't hold any locks through all the work
> done by wp_pfn_shared() so that we can do maybe_mkwrite() safely.
> 
> Perhaps we can lock the radix tree slot in dax_pfn_mkwrite(), then have
> another callback into DAX after the wp_page_reuse() => maybe_mkwrite() to
> unlock the slot?  This would guarantee that they happen together from DAX's
> point of view.
> 
> Thread 1's flow would then be:
> 
> Thread 1
> ========
> wp_pfn_shared()
>   dax_pfn_mkwrite()
>     lock_slot()
>     radix_tree_tag_set(DIRTY)
>   wp_page_reuse()
>     maybe_mkwrite()
>     new_dax_call_to_unlock_slot()
>       put_unlocked_mapping_entry()
> 
> > ---
> >  fs/dax.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 68 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 5209f8cd0bee..c0c4eecb5f73 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -31,6 +31,7 @@
> >  #include <linux/vmstat.h>
> >  #include <linux/pfn_t.h>
> >  #include <linux/sizes.h>
> > +#include <linux/mmu_notifier.h>
> >  
> >  /*
> >   * We use lowest available bit in exceptional entry for locking, other two
> > @@ -665,6 +666,59 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> >  	return new_entry;
> >  }
> >  
> > +static inline unsigned long
> > +pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
> > +{
> > +	unsigned long address;
> > +
> > +	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > +	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> > +	return address;
> > +}
> > +
> > +/* Walk all mappings of a given index of a file and writeprotect them */
> > +static void dax_mapping_entry_mkclean(struct address_space *mapping,
> > +				      pgoff_t index, unsigned long pfn)
> > +{
> > +	struct vm_area_struct *vma;
> > +	pte_t *ptep;
> > +	pte_t pte;
> > +	spinlock_t *ptl;
> > +	bool changed;
> > +
> > +	i_mmap_lock_read(mapping);
> > +	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
> > +		unsigned long address;
> > +
> > +		cond_resched();
> 
> Do we really need to cond_resched() between every PTE clean?  Maybe it would
> be better to cond_resched() after each call to dax_writeback_one() in
> dax_writeback_mapping_range() or something so we can be sure we've done a good
> amount of work between each?

This is copied over from rmap code and I'd prefer to keep the code similar.
Note that cond_resched() does not mean we are going to reschedule. It is
pretty cheap call and we will be rescheduled only if we have used our CPU
slice...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-06-29 20:48 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-21 15:45 [PATCH 0/3 v1] dax: Clear dirty bits after flushing caches Jan Kara
2016-06-21 15:45 ` Jan Kara
2016-06-21 15:45 ` [PATCH 1/3] dax: Make cache flushing protected by entry lock Jan Kara
2016-06-21 15:45   ` Jan Kara
2016-06-24 21:44   ` Ross Zwisler
2016-06-24 21:44     ` Ross Zwisler
2016-06-29 20:28     ` Jan Kara
2016-06-29 20:28       ` Jan Kara
2016-06-21 15:45 ` [PATCH 2/3] mm: Export follow_pte() Jan Kara
2016-06-21 15:45   ` Jan Kara
2016-06-24 21:55   ` Ross Zwisler
2016-06-24 21:55     ` Ross Zwisler
2016-06-29 20:29     ` Jan Kara
2016-06-29 20:29       ` Jan Kara
2016-06-21 15:45 ` [PATCH 3/3] dax: Clear dirty entry tags on cache flush Jan Kara
2016-06-21 15:45   ` Jan Kara
2016-06-21 17:31   ` kbuild test robot
2016-06-21 17:31     ` kbuild test robot
2016-06-21 17:31     ` kbuild test robot
2016-06-21 20:59   ` kbuild test robot
2016-06-21 20:59     ` kbuild test robot
2016-06-21 20:59     ` kbuild test robot
2016-06-23 10:47   ` Jan Kara
2016-06-23 10:47     ` Jan Kara
2016-06-23 10:47     ` Jan Kara
2016-06-28 21:38   ` Ross Zwisler
2016-06-28 21:38     ` Ross Zwisler
2016-06-29 20:47     ` Jan Kara
2016-06-29 20:47       ` Jan Kara
2016-06-28 21:41 ` [PATCH 0/3 v1] dax: Clear dirty bits after flushing caches Ross Zwisler
2016-06-28 21:41   ` Ross Zwisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.