All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>, Jan Kara <jack@suse.cz>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	linux-mm@kvack.org, Christoph Hellwig <hch@lst.de>,
	linux-cifs@vger.kernel.org,
	Matthew Wilcox <mawilcox@microsoft.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Eric Van Hensbergen <ericvh@gmail.com>,
	linux-nvdimm@lists.01.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	v9fs-developer@lists.sourceforge.net,
	Jens Axboe <axboe@kernel.dk>,
	linux-nfs@vger.kernel.org,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	samba-technical@lists.samba.org, linux-kernel@vger.kernel.org,
	Steve French <sfrench@samba.org>,
	Alexey Kuznetsov <kuznet@virtuozzo.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-fsdevel@vger.kernel.org, Ron Minnich <rminnich@sandia.gov>,
	Andrew Morton <akpm@linux-foundation.org>,
	Anna Schumaker <anna.schumaker@netapp.com>
Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
Date: Tue, 25 Apr 2017 13:10:43 +0200	[thread overview]
Message-ID: <20170425111043.GH2793@quack2.suse.cz> (raw)
In-Reply-To: <20170421034437.4359-2-ross.zwisler@linux.intel.com>

On Thu 20-04-17 21:44:37, Ross Zwisler wrote:
> Users of DAX can suffer data corruption from stale mmap reads via the
> following sequence:
> 
> - open an mmap over a 2MiB hole
> 
> - read from a 2MiB hole, faulting in a 2MiB zero page
> 
> - write to the hole with write(3p).  The write succeeds but we incorrectly
>   leave the 2MiB zero page mapping intact.
> 
> - via the mmap, read the data that was just written.  Since the zero page
>   mapping is still intact we read back zeroes instead of the new data.
> 
> We fix this by unconditionally calling invalidate_inode_pages2_range() in
> dax_iomap_actor() for new block allocations, and by enhancing
> __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
> being removed from the radix tree.
> 
> This is based on an initial patch from Jan Kara.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>    [4.10+]
> ---
>  fs/dax.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 166504c..3f445d5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
>  					  pgoff_t index, bool trunc)
>  {
>  	int ret = 0;
> -	void *entry;
> +	void *entry, **slot;
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	if (!trunc &&
>  	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
>  	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
>  		goto out;
> +
> +	/*
> +	 * Make sure 'entry' remains valid while we drop mapping->tree_lock to
> +	 * do the unmap_mapping_range() call.
> +	 */
> +	entry = lock_slot(mapping, slot);

This also stops page faults from mapping the entry again. Maybe worth
mentioning here as well.

> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
> +			(loff_t)PAGE_SIZE << dax_radix_order(entry), 0);

Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping
large ranges - each unmap means an rmap walk... Since this is a data
corruption class of bug, let's fix it this way for now but I think we'll
need to improve this later.

E.g. what if we called unmap_mapping_range() for the whole invalidated
range after removing the radix tree entries?

Hum, but now thinking more about it I have hard time figuring out why write
vs fault cannot actually still race:

CPU1 - write(2)				CPU2 - read fault

					dax_iomap_pte_fault()
					  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
					  grab_mapping_entry()
					  - we add zero page in the radix
					    tree & map it to page tables

Similarly read vs write fault may end up racing in a wrong way and try to
replace already existing exceptional entry with a hole page?

								Honza
> +
> +	spin_lock_irq(&mapping->tree_lock);
>  	radix_tree_delete(page_tree, index);
>  	mapping->nrexceptional--;
>  	ret = 1;
>  out:
> -	put_unlocked_mapping_entry(mapping, index, entry);
>  	spin_unlock_irq(&mapping->tree_lock);
> +	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
>  	return ret;
>  }
>  /*
> @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		return -EIO;
>  
>  	/*
> -	 * Write can allocate block for an area which has a hole page mapped
> -	 * into page tables. We have to tear down these mappings so that data
> -	 * written by write(2) is visible in mmap.
> +	 * Write can allocate block for an area which has a hole page or zero
> +	 * PMD entry in the radix tree.  We have to tear down these mappings so
> +	 * that data written by write(2) is visible in mmap.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +	if (iomap->flags & IOMAP_F_NEW) {
>  		invalidate_inode_pages2_range(inode->i_mapping,
>  					      pos >> PAGE_SHIFT,
>  					      (end - 1) >> PAGE_SHIFT);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
To: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Alexander Viro
	<viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>,
	Alexey Kuznetsov <kuznet-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>,
	Andrey Ryabinin
	<aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>,
	Anna Schumaker
	<anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>,
	Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>,
	Dan Williams
	<dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
	"Darrick J. Wong"
	<darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	Eric Van Hensbergen
	<ericvh-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
	Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Konrad Rzeszutek Wilk
	<konrad.wilk-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	Latchesar Ionkov <lucho-OnYtXJJ0/fesTnJN9+BGXg@public.gmane.org>,
	linux-cifs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Matthew Wilcox <mawilcox-0li6OtcxBFHby3iVrkZq2A@public.gmane.org>,
	Ron Min
Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
Date: Tue, 25 Apr 2017 13:10:43 +0200	[thread overview]
Message-ID: <20170425111043.GH2793@quack2.suse.cz> (raw)
In-Reply-To: <20170421034437.4359-2-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>

On Thu 20-04-17 21:44:37, Ross Zwisler wrote:
> Users of DAX can suffer data corruption from stale mmap reads via the
> following sequence:
> 
> - open an mmap over a 2MiB hole
> 
> - read from a 2MiB hole, faulting in a 2MiB zero page
> 
> - write to the hole with write(3p).  The write succeeds but we incorrectly
>   leave the 2MiB zero page mapping intact.
> 
> - via the mmap, read the data that was just written.  Since the zero page
>   mapping is still intact we read back zeroes instead of the new data.
> 
> We fix this by unconditionally calling invalidate_inode_pages2_range() in
> dax_iomap_actor() for new block allocations, and by enhancing
> __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
> being removed from the radix tree.
> 
> This is based on an initial patch from Jan Kara.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>    [4.10+]
> ---
>  fs/dax.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 166504c..3f445d5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
>  					  pgoff_t index, bool trunc)
>  {
>  	int ret = 0;
> -	void *entry;
> +	void *entry, **slot;
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	if (!trunc &&
>  	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
>  	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
>  		goto out;
> +
> +	/*
> +	 * Make sure 'entry' remains valid while we drop mapping->tree_lock to
> +	 * do the unmap_mapping_range() call.
> +	 */
> +	entry = lock_slot(mapping, slot);

This also stops page faults from mapping the entry again. Maybe worth
mentioning here as well.

> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
> +			(loff_t)PAGE_SIZE << dax_radix_order(entry), 0);

Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping
large ranges - each unmap means an rmap walk... Since this is a data
corruption class of bug, let's fix it this way for now but I think we'll
need to improve this later.

E.g. what if we called unmap_mapping_range() for the whole invalidated
range after removing the radix tree entries?

Hum, but now thinking more about it I have hard time figuring out why write
vs fault cannot actually still race:

CPU1 - write(2)				CPU2 - read fault

					dax_iomap_pte_fault()
					  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
					  grab_mapping_entry()
					  - we add zero page in the radix
					    tree & map it to page tables

Similarly read vs write fault may end up racing in a wrong way and try to
replace already existing exceptional entry with a hole page?

								Honza
> +
> +	spin_lock_irq(&mapping->tree_lock);
>  	radix_tree_delete(page_tree, index);
>  	mapping->nrexceptional--;
>  	ret = 1;
>  out:
> -	put_unlocked_mapping_entry(mapping, index, entry);
>  	spin_unlock_irq(&mapping->tree_lock);
> +	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
>  	return ret;
>  }
>  /*
> @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		return -EIO;
>  
>  	/*
> -	 * Write can allocate block for an area which has a hole page mapped
> -	 * into page tables. We have to tear down these mappings so that data
> -	 * written by write(2) is visible in mmap.
> +	 * Write can allocate block for an area which has a hole page or zero
> +	 * PMD entry in the radix tree.  We have to tear down these mappings so
> +	 * that data written by write(2) is visible in mmap.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +	if (iomap->flags & IOMAP_F_NEW) {
>  		invalidate_inode_pages2_range(inode->i_mapping,
>  					      pos >> PAGE_SHIFT,
>  					      (end - 1) >> PAGE_SHIFT);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Alexey Kuznetsov <kuznet@virtuozzo.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	Christoph Hellwig <hch@lst.de>,
	Dan Williams <dan.j.williams@intel.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Eric Van Hensbergen <ericvh@gmail.com>, Jan Kara <jack@suse.cz>,
	Jens Axboe <axboe@kernel.dk>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Latchesar Ionkov <lucho@ionkov.net>,
	linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	linux-nvdimm@ml01.01.org, Matthew Wilcox <mawilcox@microsoft.com>,
	Ron Minnich <rminnich@sandia.gov>,
	samba-technical@lists.samba.org, Steve French <sfrench@samba.org>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	v9fs-developer@lists.sourceforge.net
Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
Date: Tue, 25 Apr 2017 13:10:43 +0200	[thread overview]
Message-ID: <20170425111043.GH2793@quack2.suse.cz> (raw)
In-Reply-To: <20170421034437.4359-2-ross.zwisler@linux.intel.com>

On Thu 20-04-17 21:44:37, Ross Zwisler wrote:
> Users of DAX can suffer data corruption from stale mmap reads via the
> following sequence:
> 
> - open an mmap over a 2MiB hole
> 
> - read from a 2MiB hole, faulting in a 2MiB zero page
> 
> - write to the hole with write(3p).  The write succeeds but we incorrectly
>   leave the 2MiB zero page mapping intact.
> 
> - via the mmap, read the data that was just written.  Since the zero page
>   mapping is still intact we read back zeroes instead of the new data.
> 
> We fix this by unconditionally calling invalidate_inode_pages2_range() in
> dax_iomap_actor() for new block allocations, and by enhancing
> __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
> being removed from the radix tree.
> 
> This is based on an initial patch from Jan Kara.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>    [4.10+]
> ---
>  fs/dax.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 166504c..3f445d5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
>  					  pgoff_t index, bool trunc)
>  {
>  	int ret = 0;
> -	void *entry;
> +	void *entry, **slot;
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	if (!trunc &&
>  	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
>  	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
>  		goto out;
> +
> +	/*
> +	 * Make sure 'entry' remains valid while we drop mapping->tree_lock to
> +	 * do the unmap_mapping_range() call.
> +	 */
> +	entry = lock_slot(mapping, slot);

This also stops page faults from mapping the entry again. Maybe worth
mentioning here as well.

> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
> +			(loff_t)PAGE_SIZE << dax_radix_order(entry), 0);

Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping
large ranges - each unmap means an rmap walk... Since this is a data
corruption class of bug, let's fix it this way for now but I think we'll
need to improve this later.

E.g. what if we called unmap_mapping_range() for the whole invalidated
range after removing the radix tree entries?

Hum, but now thinking more about it I have hard time figuring out why write
vs fault cannot actually still race:

CPU1 - write(2)				CPU2 - read fault

					dax_iomap_pte_fault()
					  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
					  grab_mapping_entry()
					  - we add zero page in the radix
					    tree & map it to page tables

Similarly read vs write fault may end up racing in a wrong way and try to
replace already existing exceptional entry with a hole page?

								Honza
> +
> +	spin_lock_irq(&mapping->tree_lock);
>  	radix_tree_delete(page_tree, index);
>  	mapping->nrexceptional--;
>  	ret = 1;
>  out:
> -	put_unlocked_mapping_entry(mapping, index, entry);
>  	spin_unlock_irq(&mapping->tree_lock);
> +	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
>  	return ret;
>  }
>  /*
> @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		return -EIO;
>  
>  	/*
> -	 * Write can allocate block for an area which has a hole page mapped
> -	 * into page tables. We have to tear down these mappings so that data
> -	 * written by write(2) is visible in mmap.
> +	 * Write can allocate block for an area which has a hole page or zero
> +	 * PMD entry in the radix tree.  We have to tear down these mappings so
> +	 * that data written by write(2) is visible in mmap.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +	if (iomap->flags & IOMAP_F_NEW) {
>  		invalidate_inode_pages2_range(inode->i_mapping,
>  					      pos >> PAGE_SHIFT,
>  					      (end - 1) >> PAGE_SHIFT);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Alexey Kuznetsov <kuznet@virtuozzo.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	Christoph Hellwig <hch@lst.de>,
	Dan Williams <dan.j.williams@intel.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Eric Van Hensbergen <ericvh@gmail.com>, Jan Kara <jack@suse.cz>,
	Jens Axboe <axboe@kernel.dk>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Latchesar Ionkov <lucho@ionkov.net>,
	linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	linux-nvdimm@lists.01.org,
	Matthew Wilcox <mawilcox@microsoft.com>,
	Ron Minnich <rminnich@sandia.gov>,
	samba-technical@lists.samba.org, Steve French <sfrench@samba.org>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	v9fs-developer@lists.sourceforge.net
Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
Date: Tue, 25 Apr 2017 13:10:43 +0200	[thread overview]
Message-ID: <20170425111043.GH2793@quack2.suse.cz> (raw)
In-Reply-To: <20170421034437.4359-2-ross.zwisler@linux.intel.com>

On Thu 20-04-17 21:44:37, Ross Zwisler wrote:
> Users of DAX can suffer data corruption from stale mmap reads via the
> following sequence:
> 
> - open an mmap over a 2MiB hole
> 
> - read from a 2MiB hole, faulting in a 2MiB zero page
> 
> - write to the hole with write(3p).  The write succeeds but we incorrectly
>   leave the 2MiB zero page mapping intact.
> 
> - via the mmap, read the data that was just written.  Since the zero page
>   mapping is still intact we read back zeroes instead of the new data.
> 
> We fix this by unconditionally calling invalidate_inode_pages2_range() in
> dax_iomap_actor() for new block allocations, and by enhancing
> __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
> being removed from the radix tree.
> 
> This is based on an initial patch from Jan Kara.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>    [4.10+]
> ---
>  fs/dax.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 166504c..3f445d5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
>  					  pgoff_t index, bool trunc)
>  {
>  	int ret = 0;
> -	void *entry;
> +	void *entry, **slot;
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	if (!trunc &&
>  	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
>  	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
>  		goto out;
> +
> +	/*
> +	 * Make sure 'entry' remains valid while we drop mapping->tree_lock to
> +	 * do the unmap_mapping_range() call.
> +	 */
> +	entry = lock_slot(mapping, slot);

This also stops page faults from mapping the entry again. Maybe worth
mentioning here as well.

> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
> +			(loff_t)PAGE_SIZE << dax_radix_order(entry), 0);

Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping
large ranges - each unmap means an rmap walk... Since this is a data
corruption class of bug, let's fix it this way for now but I think we'll
need to improve this later.

E.g. what if we called unmap_mapping_range() for the whole invalidated
range after removing the radix tree entries?

Hum, but now thinking more about it I have hard time figuring out why write
vs fault cannot actually still race:

CPU1 - write(2)				CPU2 - read fault

					dax_iomap_pte_fault()
					  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
					  grab_mapping_entry()
					  - we add zero page in the radix
					    tree & map it to page tables

Similarly read vs write fault may end up racing in a wrong way and try to
replace already existing exceptional entry with a hole page?

								Honza
> +
> +	spin_lock_irq(&mapping->tree_lock);
>  	radix_tree_delete(page_tree, index);
>  	mapping->nrexceptional--;
>  	ret = 1;
>  out:
> -	put_unlocked_mapping_entry(mapping, index, entry);
>  	spin_unlock_irq(&mapping->tree_lock);
> +	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
>  	return ret;
>  }
>  /*
> @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		return -EIO;
>  
>  	/*
> -	 * Write can allocate block for an area which has a hole page mapped
> -	 * into page tables. We have to tear down these mappings so that data
> -	 * written by write(2) is visible in mmap.
> +	 * Write can allocate block for an area which has a hole page or zero
> +	 * PMD entry in the radix tree.  We have to tear down these mappings so
> +	 * that data written by write(2) is visible in mmap.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +	if (iomap->flags & IOMAP_F_NEW) {
>  		invalidate_inode_pages2_range(inode->i_mapping,
>  					      pos >> PAGE_SHIFT,
>  					      (end - 1) >> PAGE_SHIFT);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Alexey Kuznetsov <kuznet@virtuozzo.com>,
	Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	Christoph Hellwig <hch@lst.de>,
	Dan Williams <dan.j.williams@intel.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Eric Van Hensbergen <ericvh@gmail.com>, Jan Kara <jack@suse.cz>,
	Jens Axboe <axboe@kernel.dk>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Latchesar Ionkov <lucho@ionkov.net>,
	linux-cifs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-nfs@vger.kernel.org,
	linux-nvdimm@lists.01.org,
	Matthew Wilcox <mawilcox@microsoft.com>,
	Ron Minnich <rminnich@sandia.gov>,
	samba-technical@lists.samba.org, Steve French <sfrench@samba.org>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	v9fs-developer@lists.sourceforge.net
Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
Date: Tue, 25 Apr 2017 13:10:43 +0200	[thread overview]
Message-ID: <20170425111043.GH2793@quack2.suse.cz> (raw)
In-Reply-To: <20170421034437.4359-2-ross.zwisler@linux.intel.com>

On Thu 20-04-17 21:44:37, Ross Zwisler wrote:
> Users of DAX can suffer data corruption from stale mmap reads via the
> following sequence:
> 
> - open an mmap over a 2MiB hole
> 
> - read from a 2MiB hole, faulting in a 2MiB zero page
> 
> - write to the hole with write(3p).  The write succeeds but we incorrectly
>   leave the 2MiB zero page mapping intact.
> 
> - via the mmap, read the data that was just written.  Since the zero page
>   mapping is still intact we read back zeroes instead of the new data.
> 
> We fix this by unconditionally calling invalidate_inode_pages2_range() in
> dax_iomap_actor() for new block allocations, and by enhancing
> __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
> being removed from the radix tree.
> 
> This is based on an initial patch from Jan Kara.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>    [4.10+]
> ---
>  fs/dax.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 166504c..3f445d5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
>  					  pgoff_t index, bool trunc)
>  {
>  	int ret = 0;
> -	void *entry;
> +	void *entry, **slot;
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	if (!trunc &&
>  	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
>  	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
>  		goto out;
> +
> +	/*
> +	 * Make sure 'entry' remains valid while we drop mapping->tree_lock to
> +	 * do the unmap_mapping_range() call.
> +	 */
> +	entry = lock_slot(mapping, slot);

This also stops page faults from mapping the entry again. Maybe worth
mentioning here as well.

> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
> +			(loff_t)PAGE_SIZE << dax_radix_order(entry), 0);

Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping
large ranges - each unmap means an rmap walk... Since this is a data
corruption class of bug, let's fix it this way for now but I think we'll
need to improve this later.

E.g. what if we called unmap_mapping_range() for the whole invalidated
range after removing the radix tree entries?

Hum, but now thinking more about it I have hard time figuring out why write
vs fault cannot actually still race:

CPU1 - write(2)				CPU2 - read fault

					dax_iomap_pte_fault()
					  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
					  grab_mapping_entry()
					  - we add zero page in the radix
					    tree & map it to page tables

Similarly read vs write fault may end up racing in a wrong way and try to
replace already existing exceptional entry with a hole page?

								Honza
> +
> +	spin_lock_irq(&mapping->tree_lock);
>  	radix_tree_delete(page_tree, index);
>  	mapping->nrexceptional--;
>  	ret = 1;
>  out:
> -	put_unlocked_mapping_entry(mapping, index, entry);
>  	spin_unlock_irq(&mapping->tree_lock);
> +	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
>  	return ret;
>  }
>  /*
> @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		return -EIO;
>  
>  	/*
> -	 * Write can allocate block for an area which has a hole page mapped
> -	 * into page tables. We have to tear down these mappings so that data
> -	 * written by write(2) is visible in mmap.
> +	 * Write can allocate block for an area which has a hole page or zero
> +	 * PMD entry in the radix tree.  We have to tear down these mappings so
> +	 * that data written by write(2) is visible in mmap.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +	if (iomap->flags & IOMAP_F_NEW) {
>  		invalidate_inode_pages2_range(inode->i_mapping,
>  					      pos >> PAGE_SHIFT,
>  					      (end - 1) >> PAGE_SHIFT);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2017-04-25 11:10 UTC|newest]

Thread overview: 144+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-14 14:07 [PATCH 0/4] Properly invalidate data in the cleancache Andrey Ryabinin
2017-04-14 14:07 ` Andrey Ryabinin
2017-04-14 14:07 ` Andrey Ryabinin
2017-04-14 14:07 ` Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
2017-04-14 14:07   ` Andrey Ryabinin
2017-04-14 14:07   ` Andrey Ryabinin
2017-04-18 19:38   ` Ross Zwisler
2017-04-18 19:38     ` Ross Zwisler
2017-04-18 19:38     ` Ross Zwisler
2017-04-19 15:11     ` Andrey Ryabinin
2017-04-19 15:11       ` Andrey Ryabinin
2017-04-19 15:11       ` Andrey Ryabinin
2017-04-19 19:28       ` Ross Zwisler
2017-04-19 19:28         ` Ross Zwisler
     [not found]         ` <20170419192836.GA6364-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2017-04-20 14:35           ` Jan Kara
2017-04-20 14:35             ` Jan Kara
2017-04-20 14:35             ` Jan Kara
     [not found]             ` <20170420143510.GF22135-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2017-04-20 14:44               ` Jan Kara
2017-04-20 14:44                 ` Jan Kara
2017-04-20 14:44                 ` Jan Kara
2017-04-20 19:14                 ` Ross Zwisler
2017-04-20 19:14                   ` Ross Zwisler
2017-04-21  3:44                   ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
2017-04-21  3:44                     ` Ross Zwisler
2017-04-21  3:44                     ` Ross Zwisler
2017-04-21  3:44                     ` Ross Zwisler
2017-04-21  3:44                     ` Ross Zwisler
2017-04-21  3:44                     ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
2017-04-21  3:44                       ` Ross Zwisler
2017-04-21  3:44                       ` Ross Zwisler
2017-04-21  3:44                       ` Ross Zwisler
2017-04-21  3:44                       ` Ross Zwisler
2017-04-25 11:10                       ` Jan Kara [this message]
2017-04-25 11:10                         ` Jan Kara
2017-04-25 11:10                         ` Jan Kara
2017-04-25 11:10                         ` Jan Kara
2017-04-25 11:10                         ` Jan Kara
2017-04-25 22:59                         ` Ross Zwisler
2017-04-25 22:59                           ` Ross Zwisler
2017-04-25 22:59                           ` Ross Zwisler
2017-04-25 22:59                           ` Ross Zwisler
2017-04-25 22:59                           ` Ross Zwisler
2017-04-26  8:52                           ` Jan Kara
2017-04-26  8:52                             ` Jan Kara
2017-04-26  8:52                             ` Jan Kara
2017-04-26  8:52                             ` Jan Kara
2017-04-26  8:52                             ` Jan Kara
2017-04-26 22:52                             ` Ross Zwisler
2017-04-26 22:52                               ` Ross Zwisler
2017-04-26 22:52                               ` Ross Zwisler
2017-04-26 22:52                               ` Ross Zwisler
2017-04-26 22:52                               ` Ross Zwisler
2017-04-27  7:26                               ` Jan Kara
2017-04-27  7:26                                 ` Jan Kara
2017-04-27  7:26                                 ` Jan Kara
2017-04-27  7:26                                 ` Jan Kara
2017-04-27  7:26                                 ` Jan Kara
2017-05-01 22:38                                 ` Ross Zwisler
2017-05-01 22:38                                   ` Ross Zwisler
2017-05-01 22:38                                   ` Ross Zwisler
2017-05-01 22:38                                   ` Ross Zwisler
2017-05-01 22:38                                   ` Ross Zwisler
2017-05-04  9:12                                   ` Jan Kara
2017-05-04  9:12                                     ` Jan Kara
2017-05-04  9:12                                     ` Jan Kara
2017-05-04  9:12                                     ` Jan Kara
2017-05-01 22:59                                 ` Dan Williams
2017-05-01 22:59                                   ` Dan Williams
2017-05-01 22:59                                   ` Dan Williams
2017-05-01 22:59                                   ` Dan Williams
2017-05-01 22:59                                   ` Dan Williams
2017-04-24 17:49                     ` [PATCH 1/2] xfs: fix incorrect argument count check Ross Zwisler
2017-04-24 17:49                       ` Ross Zwisler
2017-04-24 17:49                       ` Ross Zwisler
2017-04-24 17:49                       ` [PATCH 2/2] dax: add regression test for stale mmap reads Ross Zwisler
2017-04-24 17:49                         ` Ross Zwisler
2017-04-24 17:49                         ` Ross Zwisler
2017-04-25 11:27                         ` Eryu Guan
2017-04-25 11:27                           ` Eryu Guan
2017-04-25 11:27                           ` Eryu Guan
2017-04-25 20:39                           ` Ross Zwisler
2017-04-25 20:39                             ` Ross Zwisler
2017-04-25 20:39                             ` Ross Zwisler
2017-04-26  3:42                             ` Eryu Guan
2017-04-26  3:42                               ` Eryu Guan
2017-04-26  3:42                               ` Eryu Guan
2017-04-25 10:10                     ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
2017-04-25 10:10                       ` Jan Kara
2017-04-25 10:10                       ` Jan Kara
2017-04-25 10:10                       ` Jan Kara
2017-04-25 10:10                       ` Jan Kara
2017-05-01 16:54                       ` Ross Zwisler
2017-05-01 16:54                         ` Ross Zwisler
2017-05-01 16:54                         ` Ross Zwisler
2017-05-01 16:54                         ` Ross Zwisler
2017-05-01 16:54                         ` Ross Zwisler
     [not found]   ` <20170414140753.16108-2-aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2017-04-18 22:46     ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrew Morton
2017-04-18 22:46       ` Andrew Morton
2017-04-18 22:46       ` Andrew Morton
2017-04-18 22:46       ` Andrew Morton
2017-04-19 15:15       ` Andrey Ryabinin
2017-04-19 15:15         ` Andrey Ryabinin
2017-04-19 15:15         ` Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
2017-04-14 14:07   ` Andrey Ryabinin
2017-04-14 14:07   ` Andrey Ryabinin
2017-04-18 18:51   ` Nikolay Borisov
2017-04-18 18:51     ` Nikolay Borisov
2017-04-19 13:22     ` Andrey Ryabinin
2017-04-19 13:22       ` Andrey Ryabinin
2017-04-19 13:22       ` Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
2017-04-14 14:07   ` Andrey Ryabinin
2017-04-14 14:07   ` Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
2017-04-14 14:07   ` Andrey Ryabinin
2017-04-14 14:07   ` Andrey Ryabinin
     [not found] ` <20170414140753.16108-1-aryabinin-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
2017-04-18 15:24   ` [PATCH 0/4] Properly invalidate data in the cleancache Konrad Rzeszutek Wilk
2017-04-18 15:24     ` Konrad Rzeszutek Wilk
2017-04-18 15:24     ` Konrad Rzeszutek Wilk
2017-04-24 16:41 ` [PATCH v2 " Andrey Ryabinin
2017-04-24 16:41   ` Andrey Ryabinin
2017-04-24 16:41   ` Andrey Ryabinin
2017-04-24 16:41   ` [PATCH v2 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
2017-04-24 16:41     ` Andrey Ryabinin
2017-04-24 16:41     ` Andrey Ryabinin
2017-04-25  8:25     ` Jan Kara
2017-04-25  8:25       ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
2017-04-24 16:41     ` Andrey Ryabinin
2017-04-24 16:41     ` Andrey Ryabinin
2017-04-25  8:34     ` Jan Kara
2017-04-25  8:34       ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
2017-04-24 16:41     ` Andrey Ryabinin
2017-04-24 16:41     ` Andrey Ryabinin
2017-04-25  8:37     ` Jan Kara
2017-04-25  8:37       ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
2017-04-24 16:41     ` Andrey Ryabinin
2017-04-24 16:41     ` Andrey Ryabinin
2017-04-25  8:41     ` Jan Kara
2017-04-25  8:41       ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170425111043.GH2793@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=anna.schumaker@netapp.com \
    --cc=aryabinin@virtuozzo.com \
    --cc=axboe@kernel.dk \
    --cc=darrick.wong@oracle.com \
    --cc=ericvh@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=kuznet@virtuozzo.com \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=lucho@ionkov.net \
    --cc=mawilcox@microsoft.com \
    --cc=rminnich@sandia.gov \
    --cc=ross.zwisler@linux.intel.com \
    --cc=samba-technical@lists.samba.org \
    --cc=sfrench@samba.org \
    --cc=trond.myklebust@primarydata.com \
    --cc=v9fs-developer@lists.sourceforge.net \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.