linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 1/2] dax: prevent invalidation of mapped DAX entries
@ 2017-05-04 19:59 Ross Zwisler
  2017-05-04 19:59 ` [PATCH v2 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
  2017-05-05  7:29 ` [PATCH v2 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
  0 siblings, 2 replies; 4+ messages in thread
From: Ross Zwisler @ 2017-05-04 19:59 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Alexey Kuznetsov, Andrey Ryabinin,
	Anna Schumaker, Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jan Kara, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

invalidate_mapping_pages()
  invalidate_exceptional_entry()
    dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages() there
is an additional criteria which is that the page must not be mapped.  This
is noted in the comments above invalidate_mapping_pages() and is checked in
invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry. This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to its
comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

We could potentially do an rmap walk to see if each of the entries actually
has any active mappings before we remove it, but this might end up being
very expensive and doesn't currently look to be worth it.

So, just remove dax_invalidate_mapping_entry() and leave the DAX entries in
the radix tree.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Reported-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>    [4.10+]
---
 fs/dax.c            | 29 -----------------------------
 include/linux/dax.h |  1 -
 mm/truncate.c       |  9 +++------
 3 files changed, 3 insertions(+), 36 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 85abd74..166504c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -507,35 +507,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 }
 
 /*
- * Invalidate exceptional DAX entry if easily possible. This handles DAX
- * entries for invalidate_inode_pages() so we evict the entry only if we can
- * do so without blocking.
- */
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
-{
-	int ret = 0;
-	void *entry, **slot;
-	struct radix_tree_root *page_tree = &mapping->page_tree;
-
-	spin_lock_irq(&mapping->tree_lock);
-	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
-	if (!entry || !radix_tree_exceptional_entry(entry) ||
-	    slot_locked(mapping, slot))
-		goto out;
-	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
-	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-		goto out;
-	radix_tree_delete(page_tree, index);
-	mapping->nrexceptional--;
-	ret = 1;
-out:
-	spin_unlock_irq(&mapping->tree_lock);
-	if (ret)
-		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
-	return ret;
-}
-
-/*
  * Invalidate exceptional DAX entry if it is clean.
  */
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d8a3dc0..f8e1833 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,7 +41,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 		    const struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
diff --git a/mm/truncate.c b/mm/truncate.c
index 6263aff..c537184 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping,
 
 /*
  * Invalidate exceptional entry if easily possible. This handles exceptional
- * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
- * clean entries.
+ * entries for invalidate_inode_pages().
  */
 static int invalidate_exceptional_entry(struct address_space *mapping,
 					pgoff_t index, void *entry)
 {
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
+	/* Handled by shmem itself, or for DAX we do nothing. */
+	if (shmem_mapping(mapping) || dax_mapping(mapping))
 		return 1;
-	if (dax_mapping(mapping))
-		return dax_invalidate_mapping_entry(mapping, index);
 	clear_shadow_entry(mapping, index, entry);
 	return 1;
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v2 2/2] dax: fix data corruption due to stale mmap reads
  2017-05-04 19:59 [PATCH v2 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
@ 2017-05-04 19:59 ` Ross Zwisler
  2017-05-05  7:29 ` [PATCH v2 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
  1 sibling, 0 replies; 4+ messages in thread
From: Ross Zwisler @ 2017-05-04 19:59 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Alexey Kuznetsov, Andrey Ryabinin,
	Anna Schumaker, Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jan Kara, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

Users of DAX can suffer data corruption from stale mmap reads via the
following sequence:

- open an mmap over a 2MiB hole

- read from a 2MiB hole, faulting in a 2MiB zero page

- write to the hole with write(3p).  The write succeeds but we incorrectly
  leave the 2MiB zero page mapping intact.

- via the mmap, read the data that was just written.  Since the zero page
  mapping is still intact we read back zeroes instead of the new data.

We fix this by unconditionally calling invalidate_inode_pages2_range() in
dax_iomap_actor() for new block allocations, and by enhancing
invalidate_inode_pages2_range() so that it properly unmaps the DAX entries
being removed from the radix tree.

This is based on an initial patch from Jan Kara.

I've written an fstest that triggers this error:
http://www.spinics.net/lists/linux-mm/msg126276.html

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Reported-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>    [4.10+]
---

Changes since v1:
 - Instead of unmapping each DAX entry individually in
   __dax_invalidate_mapping_entry(), instead unmap the whole range at once
   inside of invalidate_inode_pages2_range().  Each unmap requires an rmap
   walk so this should be less expensive, plus now we don't have to drop
   and re-acquire the mapping->tree_lock for each entry. (Jan)

These patches apply cleanly to v4.11 and have passed an xfstest run.
They also apply to v4.10.13 with a little help from git am's 3-way merger.

---
 fs/dax.c      |  8 ++++----
 mm/truncate.c | 10 ++++++++++
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 166504c..1f2c880 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -999,11 +999,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		return -EIO;
 
 	/*
-	 * Write can allocate block for an area which has a hole page mapped
-	 * into page tables. We have to tear down these mappings so that data
-	 * written by write(2) is visible in mmap.
+	 * Write can allocate block for an area which has a hole page or zero
+	 * PMD entry in the radix tree.  We have to tear down these mappings so
+	 * that data written by write(2) is visible in mmap.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+	if (iomap->flags & IOMAP_F_NEW) {
 		invalidate_inode_pages2_range(inode->i_mapping,
 					      pos >> PAGE_SHIFT,
 					      (end - 1) >> PAGE_SHIFT);
diff --git a/mm/truncate.c b/mm/truncate.c
index c537184..ad40316 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -683,6 +683,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		cond_resched();
 		index++;
 	}
+
+	/*
+	 * Ensure that any DAX exceptional entries that have been invalidated
+	 * are also unmapped.
+	 */
+	if (dax_mapping(mapping)) {
+		unmap_mapping_range(mapping, (loff_t)start << PAGE_SHIFT,
+				(loff_t)(1 + end - start) << PAGE_SHIFT, 0);
+	}
+
 	cleancache_invalidate_inode(mapping);
 	return ret;
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/2] dax: prevent invalidation of mapped DAX entries
  2017-05-04 19:59 [PATCH v2 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
  2017-05-04 19:59 ` [PATCH v2 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
@ 2017-05-05  7:29 ` Jan Kara
  2017-05-08 17:08   ` Ross Zwisler
  1 sibling, 1 reply; 4+ messages in thread
From: Jan Kara @ 2017-05-05  7:29 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, Alexander Viro, Alexey Kuznetsov,
	Andrey Ryabinin, Anna Schumaker, Christoph Hellwig, Dan Williams,
	Darrick J. Wong, Eric Van Hensbergen, Jan Kara, Jens Axboe,
	Johannes Weiner, Konrad Rzeszutek Wilk, Latchesar Ionkov,
	linux-cifs, linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm,
	Matthew Wilcox, Ron Minnich, samba-technical, Steve French,
	Trond Myklebust, v9fs-developer

On Thu 04-05-17 13:59:09, Ross Zwisler wrote:
> dax_invalidate_mapping_entry() currently removes DAX exceptional entries
> only if they are clean and unlocked.  This is done via:
> 
> invalidate_mapping_pages()
>   invalidate_exceptional_entry()
>     dax_invalidate_mapping_entry()
> 
> However, for page cache pages removed in invalidate_mapping_pages() there
> is an additional criteria which is that the page must not be mapped.  This
> is noted in the comments above invalidate_mapping_pages() and is checked in
> invalidate_inode_page().
> 
> For DAX entries this means that we can can end up in a situation where a
> DAX exceptional entry, either a huge zero page or a regular DAX entry,
> could end up mapped but without an associated radix tree entry. This is
> inconsistent with the rest of the DAX code and with what happens in the
> page cache case.
> 
> We aren't able to unmap the DAX exceptional entry because according to its
> comments invalidate_mapping_pages() isn't allowed to block, and
> unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
> 
> We could potentially do an rmap walk to see if each of the entries actually
> has any active mappings before we remove it, but this might end up being
> very expensive and doesn't currently look to be worth it.
> 
> So, just remove dax_invalidate_mapping_entry() and leave the DAX entries in
> the radix tree.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>    [4.10+]

Ah, I've just sent out a series which contains these two patches and
another two patches which change the entry locking to fix the last spotted
race...  So either just take my last two patches on top of these two or
take my series as a whole.

								Honza

> ---
>  fs/dax.c            | 29 -----------------------------
>  include/linux/dax.h |  1 -
>  mm/truncate.c       |  9 +++------
>  3 files changed, 3 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 85abd74..166504c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -507,35 +507,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
>  }
>  
>  /*
> - * Invalidate exceptional DAX entry if easily possible. This handles DAX
> - * entries for invalidate_inode_pages() so we evict the entry only if we can
> - * do so without blocking.
> - */
> -int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
> -{
> -	int ret = 0;
> -	void *entry, **slot;
> -	struct radix_tree_root *page_tree = &mapping->page_tree;
> -
> -	spin_lock_irq(&mapping->tree_lock);
> -	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
> -	if (!entry || !radix_tree_exceptional_entry(entry) ||
> -	    slot_locked(mapping, slot))
> -		goto out;
> -	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> -	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> -		goto out;
> -	radix_tree_delete(page_tree, index);
> -	mapping->nrexceptional--;
> -	ret = 1;
> -out:
> -	spin_unlock_irq(&mapping->tree_lock);
> -	if (ret)
> -		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
> -	return ret;
> -}
> -
> -/*
>   * Invalidate exceptional DAX entry if it is clean.
>   */
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index d8a3dc0..f8e1833 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -41,7 +41,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>  		    const struct iomap_ops *ops);
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> -int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
>  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 6263aff..c537184 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping,
>  
>  /*
>   * Invalidate exceptional entry if easily possible. This handles exceptional
> - * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
> - * clean entries.
> + * entries for invalidate_inode_pages().
>   */
>  static int invalidate_exceptional_entry(struct address_space *mapping,
>  					pgoff_t index, void *entry)
>  {
> -	/* Handled by shmem itself */
> -	if (shmem_mapping(mapping))
> +	/* Handled by shmem itself, or for DAX we do nothing. */
> +	if (shmem_mapping(mapping) || dax_mapping(mapping))
>  		return 1;
> -	if (dax_mapping(mapping))
> -		return dax_invalidate_mapping_entry(mapping, index);
>  	clear_shadow_entry(mapping, index, entry);
>  	return 1;
>  }
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2 1/2] dax: prevent invalidation of mapped DAX entries
  2017-05-05  7:29 ` [PATCH v2 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
@ 2017-05-08 17:08   ` Ross Zwisler
  0 siblings, 0 replies; 4+ messages in thread
From: Ross Zwisler @ 2017-05-08 17:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Alexander Viro,
	Alexey Kuznetsov, Andrey Ryabinin, Anna Schumaker,
	Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

On Fri, May 05, 2017 at 09:29:12AM +0200, Jan Kara wrote:
> On Thu 04-05-17 13:59:09, Ross Zwisler wrote:
> > dax_invalidate_mapping_entry() currently removes DAX exceptional entries
> > only if they are clean and unlocked.  This is done via:
> > 
> > invalidate_mapping_pages()
> >   invalidate_exceptional_entry()
> >     dax_invalidate_mapping_entry()
> > 
> > However, for page cache pages removed in invalidate_mapping_pages() there
> > is an additional criteria which is that the page must not be mapped.  This
> > is noted in the comments above invalidate_mapping_pages() and is checked in
> > invalidate_inode_page().
> > 
> > For DAX entries this means that we can can end up in a situation where a
> > DAX exceptional entry, either a huge zero page or a regular DAX entry,
> > could end up mapped but without an associated radix tree entry. This is
> > inconsistent with the rest of the DAX code and with what happens in the
> > page cache case.
> > 
> > We aren't able to unmap the DAX exceptional entry because according to its
> > comments invalidate_mapping_pages() isn't allowed to block, and
> > unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
> > 
> > We could potentially do an rmap walk to see if each of the entries actually
> > has any active mappings before we remove it, but this might end up being
> > very expensive and doesn't currently look to be worth it.
> > 
> > So, just remove dax_invalidate_mapping_entry() and leave the DAX entries in
> > the radix tree.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> > Reported-by: Jan Kara <jack@suse.cz>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > Cc: <stable@vger.kernel.org>    [4.10+]
> 
> Ah, I've just sent out a series which contains these two patches and
> another two patches which change the entry locking to fix the last spotted
> race...  So either just take my last two patches on top of these two or
> take my series as a whole.

Sounds good. You added a better comment in invalidate_inode_pages2_range(), so
let's just use your version of this series.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-05-08 17:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-04 19:59 [PATCH v2 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
2017-05-04 19:59 ` [PATCH v2 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
2017-05-05  7:29 ` [PATCH v2 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
2017-05-08 17:08   ` Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).