linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Alexey Kuznetsov <kuznet@virtuozzo.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	Christoph Hellwig <hch@lst.de>,
	Dan Williams <dan.j.williams@intel.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Eric Van Hensbergen <ericvh@gmail.com>, Jan Kara <jack@suse.cz>,
	Jens Axboe <axboe@kernel.dk>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Latchesar Ionkov <lucho@ionkov.net>,
	linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	linux-nvdimm@ml01.01.org, Matthew Wilcox <mawilcox@microsoft.com>,
	Ron Minnich <rminnich@sandia.gov>,
	samba-technical@lists.samba.org, Steve French <sfrench@samba.org>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	v9fs-developer@lists.sourceforge.net
Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
Date: Tue, 25 Apr 2017 13:10:43 +0200	[thread overview]
Message-ID: <20170425111043.GH2793@quack2.suse.cz> (raw)
In-Reply-To: <20170421034437.4359-2-ross.zwisler@linux.intel.com>

On Thu 20-04-17 21:44:37, Ross Zwisler wrote:
> Users of DAX can suffer data corruption from stale mmap reads via the
> following sequence:
> 
> - open an mmap over a 2MiB hole
> 
> - read from a 2MiB hole, faulting in a 2MiB zero page
> 
> - write to the hole with write(3p).  The write succeeds but we incorrectly
>   leave the 2MiB zero page mapping intact.
> 
> - via the mmap, read the data that was just written.  Since the zero page
>   mapping is still intact we read back zeroes instead of the new data.
> 
> We fix this by unconditionally calling invalidate_inode_pages2_range() in
> dax_iomap_actor() for new block allocations, and by enhancing
> __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
> being removed from the radix tree.
> 
> This is based on an initial patch from Jan Kara.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>    [4.10+]
> ---
>  fs/dax.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 166504c..3f445d5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
>  					  pgoff_t index, bool trunc)
>  {
>  	int ret = 0;
> -	void *entry;
> +	void *entry, **slot;
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	if (!trunc &&
>  	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
>  	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
>  		goto out;
> +
> +	/*
> +	 * Make sure 'entry' remains valid while we drop mapping->tree_lock to
> +	 * do the unmap_mapping_range() call.
> +	 */
> +	entry = lock_slot(mapping, slot);

This also stops page faults from mapping the entry again. Maybe worth
mentioning here as well.

> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
> +			(loff_t)PAGE_SIZE << dax_radix_order(entry), 0);

Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping
large ranges - each unmap means an rmap walk... Since this is a data
corruption class of bug, let's fix it this way for now but I think we'll
need to improve this later.

E.g. what if we called unmap_mapping_range() for the whole invalidated
range after removing the radix tree entries?

Hum, but now thinking more about it I have hard time figuring out why write
vs fault cannot actually still race:

CPU1 - write(2)				CPU2 - read fault

					dax_iomap_pte_fault()
					  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
					  grab_mapping_entry()
					  - we add zero page in the radix
					    tree & map it to page tables

Similarly read vs write fault may end up racing in a wrong way and try to
replace already existing exceptional entry with a hole page?

								Honza
> +
> +	spin_lock_irq(&mapping->tree_lock);
>  	radix_tree_delete(page_tree, index);
>  	mapping->nrexceptional--;
>  	ret = 1;
>  out:
> -	put_unlocked_mapping_entry(mapping, index, entry);
>  	spin_unlock_irq(&mapping->tree_lock);
> +	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
>  	return ret;
>  }
>  /*
> @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		return -EIO;
>  
>  	/*
> -	 * Write can allocate block for an area which has a hole page mapped
> -	 * into page tables. We have to tear down these mappings so that data
> -	 * written by write(2) is visible in mmap.
> +	 * Write can allocate block for an area which has a hole page or zero
> +	 * PMD entry in the radix tree.  We have to tear down these mappings so
> +	 * that data written by write(2) is visible in mmap.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +	if (iomap->flags & IOMAP_F_NEW) {
>  		invalidate_inode_pages2_range(inode->i_mapping,
>  					      pos >> PAGE_SHIFT,
>  					      (end - 1) >> PAGE_SHIFT);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2017-04-25 11:10 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-14 14:07 [PATCH 0/4] Properly invalidate data in the cleancache Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
2017-04-18 19:38   ` Ross Zwisler
2017-04-19 15:11     ` Andrey Ryabinin
2017-04-19 19:28       ` Ross Zwisler
2017-04-20 14:35         ` Jan Kara
2017-04-20 14:44           ` Jan Kara
2017-04-20 19:14             ` Ross Zwisler
2017-04-21  3:44               ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
2017-04-21  3:44                 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
2017-04-25 11:10                   ` Jan Kara [this message]
2017-04-25 22:59                     ` Ross Zwisler
2017-04-26  8:52                       ` Jan Kara
2017-04-26 22:52                         ` Ross Zwisler
2017-04-27  7:26                           ` Jan Kara
2017-05-01 22:38                             ` Ross Zwisler
2017-05-04  9:12                               ` Jan Kara
2017-05-01 22:59                             ` Dan Williams
2017-04-25 10:10                 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
2017-05-01 16:54                   ` Ross Zwisler
2017-04-18 22:46   ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrew Morton
2017-04-19 15:15     ` Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
2017-04-18 18:51   ` Nikolay Borisov
2017-04-19 13:22     ` Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
2017-04-18 15:24 ` [PATCH 0/4] Properly invalidate data in the cleancache Konrad Rzeszutek Wilk
2017-04-24 16:41 ` [PATCH v2 " Andrey Ryabinin
2017-04-24 16:41   ` [PATCH v2 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
2017-04-25  8:25     ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
2017-04-25  8:34     ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
2017-04-25  8:37     ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
2017-04-25  8:41     ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170425111043.GH2793@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=anna.schumaker@netapp.com \
    --cc=aryabinin@virtuozzo.com \
    --cc=axboe@kernel.dk \
    --cc=dan.j.williams@intel.com \
    --cc=darrick.wong@oracle.com \
    --cc=ericvh@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=konrad.wilk@oracle.com \
    --cc=kuznet@virtuozzo.com \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-nvdimm@ml01.01.org \
    --cc=lucho@ionkov.net \
    --cc=mawilcox@microsoft.com \
    --cc=rminnich@sandia.gov \
    --cc=ross.zwisler@linux.intel.com \
    --cc=samba-technical@lists.samba.org \
    --cc=sfrench@samba.org \
    --cc=trond.myklebust@primarydata.com \
    --cc=v9fs-developer@lists.sourceforge.net \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).