linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Properly invalidate data in the cleancache.
@ 2017-04-14 14:07 Andrey Ryabinin
  2017-04-14 14:07 ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
                   ` (5 more replies)
  0 siblings, 6 replies; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-14 14:07 UTC (permalink / raw)
  To: Alexander Viro, linux-fsdevel
  Cc: Andrey Ryabinin, Konrad Rzeszutek Wilk, Eric Van Hensbergen,
	Ron Minnich, Latchesar Ionkov, Steve French, Matthew Wilcox,
	Ross Zwisler, Trond Myklebust, Anna Schumaker, Andrew Morton,
	Jan Kara, Jens Axboe, Johannes Weiner, Alexey Kuznetsov,
	Christoph Hellwig, v9fs-developer, linux-kernel, linux-cifs,
	samba-technical, linux-nfs, linux-mm

We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache.
The reason for this is that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.

Another odd thing is that we check only for ->nrpages and don't check for ->nrexceptional,
but invalidate_inode_pages2[_range] also invalidates exceptional entries as well.
So we invalidate exceptional entries only if ->nrpages != 0? This doesn't feel right.

 - Patch 1 fixes direct IO writes by removing ->nrpages check.
 - Patch 2 fixes similar case in invalidate_bdev(). 
     Note: I only fixed conditional cleancache_invalidate_inode() here.
       Do we also need to add ->nrexceptional check in into invalidate_bdev()?
     
 - Patches 3-4: some optimizations.

Andrey Ryabinin (4):
  fs: fix data invalidation in the cleancache during direct IO
  fs/block_dev: always invalidate cleancache in invalidate_bdev()
  mm/truncate: bail out early from invalidate_inode_pages2_range() if
    mapping is empty
  mm/truncate: avoid pointless cleancache_invalidate_inode() calls.

 fs/9p/vfs_file.c |  2 +-
 fs/block_dev.c   | 11 +++++------
 fs/cifs/inode.c  |  2 +-
 fs/dax.c         |  2 +-
 fs/iomap.c       | 16 +++++++---------
 fs/nfs/direct.c  |  6 ++----
 fs/nfs/inode.c   |  8 +++++---
 mm/filemap.c     | 26 +++++++++++---------------
 mm/truncate.c    | 13 +++++++++----
 9 files changed, 42 insertions(+), 44 deletions(-)

-- 
2.10.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-14 14:07 [PATCH 0/4] Properly invalidate data in the cleancache Andrey Ryabinin
@ 2017-04-14 14:07 ` Andrey Ryabinin
  2017-04-18 19:38   ` Ross Zwisler
  2017-04-18 22:46   ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrew Morton
  2017-04-14 14:07 ` [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-14 14:07 UTC (permalink / raw)
  To: Alexander Viro, linux-fsdevel
  Cc: Andrey Ryabinin, Konrad Rzeszutek Wilk, Eric Van Hensbergen,
	Ron Minnich, Latchesar Ionkov, Steve French, Matthew Wilcox,
	Ross Zwisler, Trond Myklebust, Anna Schumaker, Andrew Morton,
	Jan Kara, Jens Axboe, Johannes Weiner, Alexey Kuznetsov,
	Christoph Hellwig, v9fs-developer, linux-kernel, linux-cifs,
	samba-technical, linux-nfs, linux-mm

Some direct write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. If page cache is empty,
buffered read following after direct IO write would get stale data from
the cleancache.

Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.

Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
state.

Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 fs/9p/vfs_file.c |  2 +-
 fs/cifs/inode.c  |  2 +-
 fs/dax.c         |  2 +-
 fs/iomap.c       | 16 +++++++---------
 fs/nfs/direct.c  |  6 ++----
 fs/nfs/inode.c   |  8 +++++---
 mm/filemap.c     | 26 +++++++++++---------------
 7 files changed, 28 insertions(+), 34 deletions(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 3de3b4a8..786d0de 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -423,7 +423,7 @@ v9fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		unsigned long pg_start, pg_end;
 		pg_start = origin >> PAGE_SHIFT;
 		pg_end = (origin + retval - 1) >> PAGE_SHIFT;
-		if (inode->i_mapping && inode->i_mapping->nrpages)
+		if (inode->i_mapping)
 			invalidate_inode_pages2_range(inode->i_mapping,
 						      pg_start, pg_end);
 		iocb->ki_pos += retval;
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index c3b2fa0..6539fa3 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1857,7 +1857,7 @@ cifs_invalidate_mapping(struct inode *inode)
 {
 	int rc = 0;
 
-	if (inode->i_mapping && inode->i_mapping->nrpages != 0) {
+	if (inode->i_mapping) {
 		rc = invalidate_inode_pages2(inode->i_mapping);
 		if (rc)
 			cifs_dbg(VFS, "%s: could not invalidate inode %p\n",
diff --git a/fs/dax.c b/fs/dax.c
index 2e382fe..1e8cca0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 	 * into page tables. We have to tear down these mappings so that data
 	 * written by write(2) is visible in mmap.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+	if ((iomap->flags & IOMAP_F_NEW)) {
 		invalidate_inode_pages2_range(inode->i_mapping,
 					      pos >> PAGE_SHIFT,
 					      (end - 1) >> PAGE_SHIFT);
diff --git a/fs/iomap.c b/fs/iomap.c
index 0b457ff..7e1f947 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -880,16 +880,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		flags |= IOMAP_WRITE;
 	}
 
-	if (mapping->nrpages) {
-		ret = filemap_write_and_wait_range(mapping, start, end);
-		if (ret)
-			goto out_free_dio;
+	ret = filemap_write_and_wait_range(mapping, start, end);
+	if (ret)
+		goto out_free_dio;
 
-		ret = invalidate_inode_pages2_range(mapping,
+	ret = invalidate_inode_pages2_range(mapping,
 				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
-		WARN_ON_ONCE(ret);
-		ret = 0;
-	}
+	WARN_ON_ONCE(ret);
+	ret = 0;
 
 	inode_dio_begin(inode);
 
@@ -944,7 +942,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
 	 * this invalidation fails, tough, the write still worked...
 	 */
-	if (iov_iter_rw(iter) == WRITE && mapping->nrpages) {
+	if (iov_iter_rw(iter) == WRITE) {
 		int err = invalidate_inode_pages2_range(mapping,
 				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
 		WARN_ON_ONCE(err);
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index aab32fc..183ab4d 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -1024,10 +1024,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
 
 	result = nfs_direct_write_schedule_iovec(dreq, iter, pos);
 
-	if (mapping->nrpages) {
-		invalidate_inode_pages2_range(mapping,
-					      pos >> PAGE_SHIFT, end);
-	}
+	invalidate_inode_pages2_range(mapping,
+				pos >> PAGE_SHIFT, end);
 
 	nfs_end_io_direct(inode);
 
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index f489a5a..b727ec8 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1118,10 +1118,12 @@ static int nfs_invalidate_mapping(struct inode *inode, struct address_space *map
 			if (ret < 0)
 				return ret;
 		}
-		ret = invalidate_inode_pages2(mapping);
-		if (ret < 0)
-			return ret;
 	}
+
+	ret = invalidate_inode_pages2(mapping);
+	if (ret < 0)
+		return ret;
+
 	if (S_ISDIR(inode->i_mode)) {
 		spin_lock(&inode->i_lock);
 		memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
diff --git a/mm/filemap.c b/mm/filemap.c
index e9e5f7b..d233d59 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2721,18 +2721,16 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	 * about to write.  We do this *before* the write so that we can return
 	 * without clobbering -EIOCBQUEUED from ->direct_IO().
 	 */
-	if (mapping->nrpages) {
-		written = invalidate_inode_pages2_range(mapping,
+	written = invalidate_inode_pages2_range(mapping,
 					pos >> PAGE_SHIFT, end);
-		/*
-		 * If a page can not be invalidated, return 0 to fall back
-		 * to buffered write.
-		 */
-		if (written) {
-			if (written == -EBUSY)
-				return 0;
-			goto out;
-		}
+	/*
+	 * If a page can not be invalidated, return 0 to fall back
+	 * to buffered write.
+	 */
+	if (written) {
+		if (written == -EBUSY)
+			return 0;
+		goto out;
 	}
 
 	data = *from;
@@ -2746,10 +2744,8 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	 * so we don't support it 100%.  If this invalidation
 	 * fails, tough, the write still worked...
 	 */
-	if (mapping->nrpages) {
-		invalidate_inode_pages2_range(mapping,
-					      pos >> PAGE_SHIFT, end);
-	}
+	invalidate_inode_pages2_range(mapping,
+				pos >> PAGE_SHIFT, end);
 
 	if (written > 0) {
 		pos += written;
-- 
2.10.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev()
  2017-04-14 14:07 [PATCH 0/4] Properly invalidate data in the cleancache Andrey Ryabinin
  2017-04-14 14:07 ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
@ 2017-04-14 14:07 ` Andrey Ryabinin
  2017-04-18 18:51   ` Nikolay Borisov
  2017-04-14 14:07 ` [PATCH 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-14 14:07 UTC (permalink / raw)
  To: Alexander Viro, linux-fsdevel
  Cc: Andrey Ryabinin, Konrad Rzeszutek Wilk, Eric Van Hensbergen,
	Ron Minnich, Latchesar Ionkov, Steve French, Matthew Wilcox,
	Ross Zwisler, Trond Myklebust, Anna Schumaker, Andrew Morton,
	Jan Kara, Jens Axboe, Johannes Weiner, Alexey Kuznetsov,
	Christoph Hellwig, v9fs-developer, linux-kernel, linux-cifs,
	samba-technical, linux-nfs, linux-mm

invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
which doen't make any sense.
Make invalidate_bdev() always invalidate cleancache data.

Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 fs/block_dev.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e405d8e..7af4787 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -103,12 +103,11 @@ void invalidate_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0)
-		return;
-
-	invalidate_bh_lrus();
-	lru_add_drain_all();	/* make sure all lru add caches are flushed */
-	invalidate_mapping_pages(mapping, 0, -1);
+	if (mapping->nrpages) {
+		invalidate_bh_lrus();
+		lru_add_drain_all();	/* make sure all lru add caches are flushed */
+		invalidate_mapping_pages(mapping, 0, -1);
+	}
 	/* 99% of the time, we don't need to flush the cleancache on the bdev.
 	 * But, for the strange corners, lets be cautious
 	 */
-- 
2.10.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
  2017-04-14 14:07 [PATCH 0/4] Properly invalidate data in the cleancache Andrey Ryabinin
  2017-04-14 14:07 ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
  2017-04-14 14:07 ` [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
@ 2017-04-14 14:07 ` Andrey Ryabinin
  2017-04-14 14:07 ` [PATCH 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-14 14:07 UTC (permalink / raw)
  To: Alexander Viro, linux-fsdevel
  Cc: Andrey Ryabinin, Konrad Rzeszutek Wilk, Eric Van Hensbergen,
	Ron Minnich, Latchesar Ionkov, Steve French, Matthew Wilcox,
	Ross Zwisler, Trond Myklebust, Anna Schumaker, Andrew Morton,
	Jan Kara, Jens Axboe, Johannes Weiner, Alexey Kuznetsov,
	Christoph Hellwig, v9fs-developer, linux-kernel, linux-cifs,
	samba-technical, linux-nfs, linux-mm

If mapping is empty (both ->nrpages and ->nrexceptional is zero) we can avoid
pointless lookups in empty radix tree and bail out immediately after cleancache
invalidation.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 mm/truncate.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index 6263aff..8f12b0e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -624,6 +624,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int did_range_unmap = 0;
 
 	cleancache_invalidate_inode(mapping);
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+		return 0;
+
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
-- 
2.10.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls.
  2017-04-14 14:07 [PATCH 0/4] Properly invalidate data in the cleancache Andrey Ryabinin
                   ` (2 preceding siblings ...)
  2017-04-14 14:07 ` [PATCH 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
@ 2017-04-14 14:07 ` Andrey Ryabinin
  2017-04-18 15:24 ` [PATCH 0/4] Properly invalidate data in the cleancache Konrad Rzeszutek Wilk
  2017-04-24 16:41 ` [PATCH v2 " Andrey Ryabinin
  5 siblings, 0 replies; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-14 14:07 UTC (permalink / raw)
  To: Alexander Viro, linux-fsdevel
  Cc: Andrey Ryabinin, Konrad Rzeszutek Wilk, Eric Van Hensbergen,
	Ron Minnich, Latchesar Ionkov, Steve French, Matthew Wilcox,
	Ross Zwisler, Trond Myklebust, Anna Schumaker, Andrew Morton,
	Jan Kara, Jens Axboe, Johannes Weiner, Alexey Kuznetsov,
	Christoph Hellwig, v9fs-developer, linux-kernel, linux-cifs,
	samba-technical, linux-nfs, linux-mm

cleancache_invalidate_inode() called truncate_inode_pages_range()
and invalidate_inode_pages2_range() twice - on entry and on exit.
It's stupid and waste of time. It's enough to call it once at
exit.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 mm/truncate.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 8f12b0e..83a059e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -266,9 +266,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	pgoff_t		index;
 	int		i;
 
-	cleancache_invalidate_inode(mapping);
 	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
-		return;
+		goto out;
 
 	/* Offsets within partial pages */
 	partial_start = lstart & (PAGE_SIZE - 1);
@@ -363,7 +362,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	 * will be released, just zeroed, so we can bail out now.
 	 */
 	if (start >= end)
-		return;
+		goto out;
 
 	index = start;
 	for ( ; ; ) {
@@ -410,6 +409,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		pagevec_release(&pvec);
 		index++;
 	}
+
+out:
 	cleancache_invalidate_inode(mapping);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -623,9 +624,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int ret2 = 0;
 	int did_range_unmap = 0;
 
-	cleancache_invalidate_inode(mapping);
 	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
-		return 0;
+		goto out;
 
 	pagevec_init(&pvec, 0);
 	index = start;
@@ -689,6 +689,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		cond_resched();
 		index++;
 	}
+
+out:
 	cleancache_invalidate_inode(mapping);
 	return ret;
 }
-- 
2.10.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/4] Properly invalidate data in the cleancache.
  2017-04-14 14:07 [PATCH 0/4] Properly invalidate data in the cleancache Andrey Ryabinin
                   ` (3 preceding siblings ...)
  2017-04-14 14:07 ` [PATCH 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
@ 2017-04-18 15:24 ` Konrad Rzeszutek Wilk
  2017-04-24 16:41 ` [PATCH v2 " Andrey Ryabinin
  5 siblings, 0 replies; 37+ messages in thread
From: Konrad Rzeszutek Wilk @ 2017-04-18 15:24 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Alexander Viro, linux-fsdevel, Eric Van Hensbergen, Ron Minnich,
	Latchesar Ionkov, Steve French, Matthew Wilcox, Ross Zwisler,
	Trond Myklebust, Anna Schumaker, Andrew Morton, Jan Kara,
	Jens Axboe, Johannes Weiner, Alexey Kuznetsov, Christoph Hellwig,
	v9fs-developer, linux-kernel, linux-cifs, samba-technical,
	linux-nfs, linux-mm

On Fri, Apr 14, 2017 at 05:07:49PM +0300, Andrey Ryabinin wrote:
> We've noticed that after direct IO write, buffered read sometimes gets
> stale data which is coming from the cleancache.

That is not good.
> The reason for this is that some direct write hooks call call invalidate_inode_pages2[_range]()
> conditionally iff mapping->nrpages is not zero, so we may not invalidate
> data in the cleancache.
> 
> Another odd thing is that we check only for ->nrpages and don't check for ->nrexceptional,

Yikes.
> but invalidate_inode_pages2[_range] also invalidates exceptional entries as well.
> So we invalidate exceptional entries only if ->nrpages != 0? This doesn't feel right.
> 
>  - Patch 1 fixes direct IO writes by removing ->nrpages check.
>  - Patch 2 fixes similar case in invalidate_bdev(). 
>      Note: I only fixed conditional cleancache_invalidate_inode() here.
>        Do we also need to add ->nrexceptional check in into invalidate_bdev()?
>      
>  - Patches 3-4: some optimizations.

Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thanks!
> 
> Andrey Ryabinin (4):
>   fs: fix data invalidation in the cleancache during direct IO
>   fs/block_dev: always invalidate cleancache in invalidate_bdev()
>   mm/truncate: bail out early from invalidate_inode_pages2_range() if
>     mapping is empty
>   mm/truncate: avoid pointless cleancache_invalidate_inode() calls.
> 
>  fs/9p/vfs_file.c |  2 +-
>  fs/block_dev.c   | 11 +++++------
>  fs/cifs/inode.c  |  2 +-
>  fs/dax.c         |  2 +-
>  fs/iomap.c       | 16 +++++++---------
>  fs/nfs/direct.c  |  6 ++----
>  fs/nfs/inode.c   |  8 +++++---
>  mm/filemap.c     | 26 +++++++++++---------------
>  mm/truncate.c    | 13 +++++++++----
>  9 files changed, 42 insertions(+), 44 deletions(-)
> 
> -- 
> 2.10.2
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev()
  2017-04-14 14:07 ` [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
@ 2017-04-18 18:51   ` Nikolay Borisov
  2017-04-19 13:22     ` Andrey Ryabinin
  0 siblings, 1 reply; 37+ messages in thread
From: Nikolay Borisov @ 2017-04-18 18:51 UTC (permalink / raw)
  To: Andrey Ryabinin, Alexander Viro, linux-fsdevel
  Cc: Konrad Rzeszutek Wilk, Eric Van Hensbergen, Ron Minnich,
	Latchesar Ionkov, Steve French, Matthew Wilcox, Ross Zwisler,
	Trond Myklebust, Anna Schumaker, Andrew Morton, Jan Kara,
	Jens Axboe, Johannes Weiner, Alexey Kuznetsov, Christoph Hellwig,
	v9fs-developer, linux-kernel, linux-cifs, samba-technical,
	linux-nfs, linux-mm



On 14.04.2017 17:07, Andrey Ryabinin wrote:
> invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
> which doen't make any sense.
> Make invalidate_bdev() always invalidate cleancache data.
> 
> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> ---
>  fs/block_dev.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index e405d8e..7af4787 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -103,12 +103,11 @@ void invalidate_bdev(struct block_device *bdev)
>  {
>  	struct address_space *mapping = bdev->bd_inode->i_mapping;
>  
> -	if (mapping->nrpages == 0)
> -		return;
> -
> -	invalidate_bh_lrus();
> -	lru_add_drain_all();	/* make sure all lru add caches are flushed */
> -	invalidate_mapping_pages(mapping, 0, -1);
> +	if (mapping->nrpages) {
> +		invalidate_bh_lrus();
> +		lru_add_drain_all();	/* make sure all lru add caches are flushed */
> +		invalidate_mapping_pages(mapping, 0, -1);
> +	}

How is this different than the current code? You will only invalidate
the mapping iff ->nrpages > 0 ( I assume it can't go down below 0) ?
Perhaps just remove the if altogether?

>  	/* 99% of the time, we don't need to flush the cleancache on the bdev.
>  	 * But, for the strange corners, lets be cautious
>  	 */
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-14 14:07 ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
@ 2017-04-18 19:38   ` Ross Zwisler
  2017-04-19 15:11     ` Andrey Ryabinin
  2017-04-18 22:46   ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: Ross Zwisler @ 2017-04-18 19:38 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Alexander Viro, linux-fsdevel, Konrad Rzeszutek Wilk,
	Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov, Steve French,
	Matthew Wilcox, Ross Zwisler, Trond Myklebust, Anna Schumaker,
	Andrew Morton, Jan Kara, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, v9fs-developer,
	linux-kernel, linux-cifs, samba-technical, linux-nfs, linux-mm

On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote:
> Some direct write fs hooks call invalidate_inode_pages2[_range]()
> conditionally iff mapping->nrpages is not zero. If page cache is empty,
> buffered read following after direct IO write would get stale data from
> the cleancache.
> 
> Also it doesn't feel right to check only for ->nrpages because
> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> 
> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> state.
> 
> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> ---
<>
> diff --git a/fs/dax.c b/fs/dax.c
> index 2e382fe..1e8cca0 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  	 * into page tables. We have to tear down these mappings so that data
>  	 * written by write(2) is visible in mmap.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +	if ((iomap->flags & IOMAP_F_NEW)) {
>  		invalidate_inode_pages2_range(inode->i_mapping,
>  					      pos >> PAGE_SHIFT,
>  					      (end - 1) >> PAGE_SHIFT);

tl;dr: I think the old code is correct, and that you don't need this change.

This should be harmless, but could slow us down a little if we keep
calling invalidate_inode_pages2_range() without really needing to.  Really for
DAX I think we need to call invalidate_inode_page2_range() only if we have
zero pages mapped over the place where we are doing I/O, which is why we check
nrpages.

Is DAX even allowed to be used at the same time as cleancache?  From a brief
look at Documentation/vm/cleancache.txt, it seems like these two features are
incompatible.  With DAX we already are avoiding the page cache completely.

Anyway, I don't see how this change in DAX can save us from a data corruption
(which is what you're seeing, right?), and I think it could slow us down, so
I'd prefer to leave things as they are.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-14 14:07 ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
  2017-04-18 19:38   ` Ross Zwisler
@ 2017-04-18 22:46   ` Andrew Morton
  2017-04-19 15:15     ` Andrey Ryabinin
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2017-04-18 22:46 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Alexander Viro, linux-fsdevel, Konrad Rzeszutek Wilk,
	Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov, Steve French,
	Matthew Wilcox, Ross Zwisler, Trond Myklebust, Anna Schumaker,
	Jan Kara, Jens Axboe, Johannes Weiner, Alexey Kuznetsov,
	Christoph Hellwig, v9fs-developer, linux-kernel, linux-cifs,
	samba-technical, linux-nfs, linux-mm

On Fri, 14 Apr 2017 17:07:50 +0300 Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:

> Some direct write fs hooks call invalidate_inode_pages2[_range]()
> conditionally iff mapping->nrpages is not zero. If page cache is empty,
> buffered read following after direct IO write would get stale data from
> the cleancache.
> 
> Also it doesn't feel right to check only for ->nrpages because
> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> 
> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> state.

I'm not understanding this.  I can buy the argument about
nrexceptional, but why does cleancache require the
invalidate_inode_pages2_range) call even when ->nrpages is zero?

I *assume* it's because invalidate_inode_pages2_range() calls
cleancache_invalidate_inode(), yes?  If so, can we please add this to
the changelog?  If not then please explain further.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev()
  2017-04-18 18:51   ` Nikolay Borisov
@ 2017-04-19 13:22     ` Andrey Ryabinin
  0 siblings, 0 replies; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-19 13:22 UTC (permalink / raw)
  To: Nikolay Borisov, Alexander Viro, linux-fsdevel
  Cc: Konrad Rzeszutek Wilk, Eric Van Hensbergen, Ron Minnich,
	Latchesar Ionkov, Steve French, Matthew Wilcox, Ross Zwisler,
	Trond Myklebust, Anna Schumaker, Andrew Morton, Jan Kara,
	Jens Axboe, Johannes Weiner, Alexey Kuznetsov, Christoph Hellwig,
	v9fs-developer, linux-kernel, linux-cifs, samba-technical,
	linux-nfs, linux-mm

On 04/18/2017 09:51 PM, Nikolay Borisov wrote:
> 
> 
> On 14.04.2017 17:07, Andrey Ryabinin wrote:
>> invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
>> which doen't make any sense.
>> Make invalidate_bdev() always invalidate cleancache data.
>>
>> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
>> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
>> ---
>>  fs/block_dev.c | 11 +++++------
>>  1 file changed, 5 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index e405d8e..7af4787 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -103,12 +103,11 @@ void invalidate_bdev(struct block_device *bdev)
>>  {
>>  	struct address_space *mapping = bdev->bd_inode->i_mapping;
>>  
>> -	if (mapping->nrpages == 0)
>> -		return;
>> -
>> -	invalidate_bh_lrus();
>> -	lru_add_drain_all();	/* make sure all lru add caches are flushed */
>> -	invalidate_mapping_pages(mapping, 0, -1);
>> +	if (mapping->nrpages) {
>> +		invalidate_bh_lrus();
>> +		lru_add_drain_all();	/* make sure all lru add caches are flushed */
>> +		invalidate_mapping_pages(mapping, 0, -1);
>> +	}
> 
> How is this different than the current code? You will only invalidate
> the mapping iff ->nrpages > 0 ( I assume it can't go down below 0) ?

The difference is that invalidate_bdev() now always calls cleancache_invalidate_inode()
(you won't see it in this diff, it's placed after this if(mapping->nrpages){} block,)

> Perhaps just remove the if altogether?
> 

Given that invalidate_mapping_pages() invalidates exceptional entries as well, it certainly doesn't look
right that we look only at mapping->nrpages and completely ignore ->nrexceptional.
So maybe removing if() would be a right thing to do. But I think that should be a separate patch as it would
fix a another bug probably introduced by commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache")

My intention here was to fix only cleancache case.


>>  	/* 99% of the time, we don't need to flush the cleancache on the bdev.
>>  	 * But, for the strange corners, lets be cautious
>>  	 */
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-18 19:38   ` Ross Zwisler
@ 2017-04-19 15:11     ` Andrey Ryabinin
  2017-04-19 19:28       ` Ross Zwisler
  0 siblings, 1 reply; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-19 15:11 UTC (permalink / raw)
  To: Ross Zwisler, Alexander Viro, linux-fsdevel,
	Konrad Rzeszutek Wilk, Eric Van Hensbergen, Ron Minnich,
	Latchesar Ionkov, Steve French, Matthew Wilcox, Trond Myklebust,
	Anna Schumaker, Andrew Morton, Jan Kara, Jens Axboe,
	Johannes Weiner, Alexey Kuznetsov, Christoph Hellwig,
	v9fs-developer, linux-kernel, linux-cifs, samba-technical,
	linux-nfs, linux-mm

On 04/18/2017 10:38 PM, Ross Zwisler wrote:
> On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote:
>> Some direct write fs hooks call invalidate_inode_pages2[_range]()
>> conditionally iff mapping->nrpages is not zero. If page cache is empty,
>> buffered read following after direct IO write would get stale data from
>> the cleancache.
>>
>> Also it doesn't feel right to check only for ->nrpages because
>> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
>>
>> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
>> state.
>>
>> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
>> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
>> ---
> <>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 2e382fe..1e8cca0 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>>  	 * into page tables. We have to tear down these mappings so that data
>>  	 * written by write(2) is visible in mmap.
>>  	 */
>> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
>> +	if ((iomap->flags & IOMAP_F_NEW)) {
>>  		invalidate_inode_pages2_range(inode->i_mapping,
>>  					      pos >> PAGE_SHIFT,
>>  					      (end - 1) >> PAGE_SHIFT);
> 
> tl;dr: I think the old code is correct, and that you don't need this change.
> 
> This should be harmless, but could slow us down a little if we keep
> calling invalidate_inode_pages2_range() without really needing to.  Really for
> DAX I think we need to call invalidate_inode_page2_range() only if we have
> zero pages mapped over the place where we are doing I/O, which is why we check
> nrpages.
> 

Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also
invalidates exceptional radix tree entries. Is that correct that we invalidate
exceptional entries only if ->nrpages > 0 and skip invalidation otherwise?


> Is DAX even allowed to be used at the same time as cleancache?  From a brief
> look at Documentation/vm/cleancache.txt, it seems like these two features are
> incompatible.  With DAX we already are avoiding the page cache completely.

tl;dr: I think you're right.

cleancache may store any PageUptodate && PageMappedToDisk page evicted from page cache (see __delete_from_page_cache)
DAX deletes hole page via __delete_from_page_cache(), but I don't see we mark such page as Uptodate or MappedToDisk
so it will never go into the cleancache.

Latter cleancache_get_page() (e.g. it's called from mpage_readpages() which is called from blkdev_read_pages())
I assume that DAX doesn't use a_ops->readpages() method so cleancache_get_page() is never called from DAX.


> Anyway, I don't see how this change in DAX can save us from a data corruption
> (which is what you're seeing, right?), and I think it could slow us down, so
> I'd prefer to leave things as they are.
> 

I'll remove this hunk from v2.

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-18 22:46   ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrew Morton
@ 2017-04-19 15:15     ` Andrey Ryabinin
  0 siblings, 0 replies; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-19 15:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, linux-fsdevel, Konrad Rzeszutek Wilk,
	Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov, Steve French,
	Matthew Wilcox, Ross Zwisler, Trond Myklebust, Anna Schumaker,
	Jan Kara, Jens Axboe, Johannes Weiner, Alexey Kuznetsov,
	Christoph Hellwig, v9fs-developer, linux-kernel, linux-cifs,
	linux-nfs, linux-mm



On 04/19/2017 01:46 AM, Andrew Morton wrote:
> On Fri, 14 Apr 2017 17:07:50 +0300 Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
> 
>> Some direct write fs hooks call invalidate_inode_pages2[_range]()
>> conditionally iff mapping->nrpages is not zero. If page cache is empty,
>> buffered read following after direct IO write would get stale data from
>> the cleancache.
>>
>> Also it doesn't feel right to check only for ->nrpages because
>> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
>>
>> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
>> state.
> 
> I'm not understanding this.  I can buy the argument about
> nrexceptional, but why does cleancache require the
> invalidate_inode_pages2_range) call even when ->nrpages is zero?
> 
> I *assume* it's because invalidate_inode_pages2_range() calls
> cleancache_invalidate_inode(), yes?  If so, can we please add this to
> the changelog?  If not then please explain further.
> 

Yes, your assumption is correct. I'll fix the changelog.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-19 15:11     ` Andrey Ryabinin
@ 2017-04-19 19:28       ` Ross Zwisler
  2017-04-20 14:35         ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Ross Zwisler @ 2017-04-19 19:28 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Ross Zwisler, Alexander Viro, linux-fsdevel,
	Konrad Rzeszutek Wilk, Eric Van Hensbergen, Ron Minnich,
	Latchesar Ionkov, Steve French, Matthew Wilcox, Trond Myklebust,
	Anna Schumaker, Andrew Morton, Jan Kara, Jens Axboe,
	Johannes Weiner, Alexey Kuznetsov, Christoph Hellwig,
	v9fs-developer, linux-kernel, linux-cifs, samba-technical,
	linux-nfs, linux-mm

On Wed, Apr 19, 2017 at 06:11:31PM +0300, Andrey Ryabinin wrote:
> On 04/18/2017 10:38 PM, Ross Zwisler wrote:
> > On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote:
> >> Some direct write fs hooks call invalidate_inode_pages2[_range]()
> >> conditionally iff mapping->nrpages is not zero. If page cache is empty,
> >> buffered read following after direct IO write would get stale data from
> >> the cleancache.
> >>
> >> Also it doesn't feel right to check only for ->nrpages because
> >> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> >>
> >> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> >> state.
> >>
> >> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> >> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> >> ---
> > <>
> >> diff --git a/fs/dax.c b/fs/dax.c
> >> index 2e382fe..1e8cca0 100644
> >> --- a/fs/dax.c
> >> +++ b/fs/dax.c
> >> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >>  	 * into page tables. We have to tear down these mappings so that data
> >>  	 * written by write(2) is visible in mmap.
> >>  	 */
> >> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> >> +	if ((iomap->flags & IOMAP_F_NEW)) {
> >>  		invalidate_inode_pages2_range(inode->i_mapping,
> >>  					      pos >> PAGE_SHIFT,
> >>  					      (end - 1) >> PAGE_SHIFT);
> > 
> > tl;dr: I think the old code is correct, and that you don't need this change.
> > 
> > This should be harmless, but could slow us down a little if we keep
> > calling invalidate_inode_pages2_range() without really needing to.  Really for
> > DAX I think we need to call invalidate_inode_page2_range() only if we have
> > zero pages mapped over the place where we are doing I/O, which is why we check
> > nrpages.
> > 
> 
> Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also
> invalidates exceptional radix tree entries. Is that correct that we invalidate
> exceptional entries only if ->nrpages > 0 and skip invalidation otherwise?

For DAX we only invalidate clean DAX exceptional entries so that we can keep
dirty entries around for writeback, but yes you're correct that we only do the
invalidation if nrpages > 0.  And yes, it does seem a bit weird. :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-19 19:28       ` Ross Zwisler
@ 2017-04-20 14:35         ` Jan Kara
  2017-04-20 14:44           ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2017-04-20 14:35 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrey Ryabinin, Alexander Viro, linux-fsdevel,
	Konrad Rzeszutek Wilk, Eric Van Hensbergen, Ron Minnich,
	Latchesar Ionkov, Steve French, Matthew Wilcox, Trond Myklebust,
	Anna Schumaker, Andrew Morton, Jan Kara, Jens Axboe,
	Johannes Weiner, Alexey Kuznetsov, Christoph Hellwig,
	v9fs-developer, linux-kernel, linux-cifs, samba-technical,
	linux-nfs, linux-mm

On Wed 19-04-17 13:28:36, Ross Zwisler wrote:
> On Wed, Apr 19, 2017 at 06:11:31PM +0300, Andrey Ryabinin wrote:
> > On 04/18/2017 10:38 PM, Ross Zwisler wrote:
> > > On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote:
> > >> Some direct write fs hooks call invalidate_inode_pages2[_range]()
> > >> conditionally iff mapping->nrpages is not zero. If page cache is empty,
> > >> buffered read following after direct IO write would get stale data from
> > >> the cleancache.
> > >>
> > >> Also it doesn't feel right to check only for ->nrpages because
> > >> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> > >>
> > >> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> > >> state.
> > >>
> > >> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> > >> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> > >> ---
> > > <>
> > >> diff --git a/fs/dax.c b/fs/dax.c
> > >> index 2e382fe..1e8cca0 100644
> > >> --- a/fs/dax.c
> > >> +++ b/fs/dax.c
> > >> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > >>  	 * into page tables. We have to tear down these mappings so that data
> > >>  	 * written by write(2) is visible in mmap.
> > >>  	 */
> > >> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> > >> +	if ((iomap->flags & IOMAP_F_NEW)) {
> > >>  		invalidate_inode_pages2_range(inode->i_mapping,
> > >>  					      pos >> PAGE_SHIFT,
> > >>  					      (end - 1) >> PAGE_SHIFT);
> > > 
> > > tl;dr: I think the old code is correct, and that you don't need this change.
> > > 
> > > This should be harmless, but could slow us down a little if we keep
> > > calling invalidate_inode_pages2_range() without really needing to.  Really for
> > > DAX I think we need to call invalidate_inode_page2_range() only if we have
> > > zero pages mapped over the place where we are doing I/O, which is why we check
> > > nrpages.
> > > 
> > 
> > Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also
> > invalidates exceptional radix tree entries. Is that correct that we invalidate
> > exceptional entries only if ->nrpages > 0 and skip invalidation otherwise?
> 
> For DAX we only invalidate clean DAX exceptional entries so that we can keep
> dirty entries around for writeback, but yes you're correct that we only do the
> invalidation if nrpages > 0.  And yes, it does seem a bit weird. :)

Actually in this place the nrpages check is deliberate since there should
only be hole pages or nothing in the invalidated range - see the comment
before the if. But thinking more about it this assumption actually is not
right in presence of zero PMD entries in the radix tree. So this change
actually also fixes a possible bug for DAX but we should do it as a
separate patch with a proper changelog.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-20 14:35         ` Jan Kara
@ 2017-04-20 14:44           ` Jan Kara
  2017-04-20 19:14             ` Ross Zwisler
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2017-04-20 14:44 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrey Ryabinin, Alexander Viro, linux-fsdevel,
	Konrad Rzeszutek Wilk, Eric Van Hensbergen, Ron Minnich,
	Latchesar Ionkov, Steve French, Matthew Wilcox, Trond Myklebust,
	Anna Schumaker, Andrew Morton, Jan Kara, Jens Axboe,
	Johannes Weiner, Alexey Kuznetsov, Christoph Hellwig,
	v9fs-developer, linux-kernel, linux-cifs, samba-technical,
	linux-nfs, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3127 bytes --]

On Thu 20-04-17 16:35:10, Jan Kara wrote:
> On Wed 19-04-17 13:28:36, Ross Zwisler wrote:
> > On Wed, Apr 19, 2017 at 06:11:31PM +0300, Andrey Ryabinin wrote:
> > > On 04/18/2017 10:38 PM, Ross Zwisler wrote:
> > > > On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote:
> > > >> Some direct write fs hooks call invalidate_inode_pages2[_range]()
> > > >> conditionally iff mapping->nrpages is not zero. If page cache is empty,
> > > >> buffered read following after direct IO write would get stale data from
> > > >> the cleancache.
> > > >>
> > > >> Also it doesn't feel right to check only for ->nrpages because
> > > >> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> > > >>
> > > >> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> > > >> state.
> > > >>
> > > >> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> > > >> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> > > >> ---
> > > > <>
> > > >> diff --git a/fs/dax.c b/fs/dax.c
> > > >> index 2e382fe..1e8cca0 100644
> > > >> --- a/fs/dax.c
> > > >> +++ b/fs/dax.c
> > > >> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > > >>  	 * into page tables. We have to tear down these mappings so that data
> > > >>  	 * written by write(2) is visible in mmap.
> > > >>  	 */
> > > >> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> > > >> +	if ((iomap->flags & IOMAP_F_NEW)) {
> > > >>  		invalidate_inode_pages2_range(inode->i_mapping,
> > > >>  					      pos >> PAGE_SHIFT,
> > > >>  					      (end - 1) >> PAGE_SHIFT);
> > > > 
> > > > tl;dr: I think the old code is correct, and that you don't need this change.
> > > > 
> > > > This should be harmless, but could slow us down a little if we keep
> > > > calling invalidate_inode_pages2_range() without really needing to.  Really for
> > > > DAX I think we need to call invalidate_inode_page2_range() only if we have
> > > > zero pages mapped over the place where we are doing I/O, which is why we check
> > > > nrpages.
> > > > 
> > > 
> > > Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also
> > > invalidates exceptional radix tree entries. Is that correct that we invalidate
> > > exceptional entries only if ->nrpages > 0 and skip invalidation otherwise?
> > 
> > For DAX we only invalidate clean DAX exceptional entries so that we can keep
> > dirty entries around for writeback, but yes you're correct that we only do the
> > invalidation if nrpages > 0.  And yes, it does seem a bit weird. :)
> 
> Actually in this place the nrpages check is deliberate since there should
> only be hole pages or nothing in the invalidated range - see the comment
> before the if. But thinking more about it this assumption actually is not
> right in presence of zero PMD entries in the radix tree. So this change
> actually also fixes a possible bug for DAX but we should do it as a
> separate patch with a proper changelog.

Something like the attached patch. Ross?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-dax-Fix-inconsistency-between-mmap-and-write-2.patch --]
[-- Type: text/x-patch, Size: 1580 bytes --]

>From da79b4b72a6fe5fcf1a554ca1ce77cb462e8a306 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 20 Apr 2017 16:38:20 +0200
Subject: [PATCH] dax: Fix inconsistency between mmap and write(2)

When a process has a PMD-sized hole mapped via mmap and later allocates
part of the file underlying this area using write(2), memory mappings
need not be appropriately invalidated if the file has no hole pages
allocated and thus view via mmap will not show the data written by
write(2). Fix the problem by always invalidating memory mappings
covering part of the file for which blocks got allocated.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 85abd741253d..da7bc44e5725 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1028,11 +1028,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		return -EIO;
 
 	/*
-	 * Write can allocate block for an area which has a hole page mapped
-	 * into page tables. We have to tear down these mappings so that data
-	 * written by write(2) is visible in mmap.
+	 * Write can allocate block for an area which has a hole page or zero
+	 * PMD entry in the radix tree.  We have to tear down these mappings so
+	 * that data written by write(2) is visible in mmap.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+	if (iomap->flags & IOMAP_F_NEW) {
 		invalidate_inode_pages2_range(inode->i_mapping,
 					      pos >> PAGE_SHIFT,
 					      (end - 1) >> PAGE_SHIFT);
-- 
2.12.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-20 14:44           ` Jan Kara
@ 2017-04-20 19:14             ` Ross Zwisler
  2017-04-21  3:44               ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
  0 siblings, 1 reply; 37+ messages in thread
From: Ross Zwisler @ 2017-04-20 19:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrey Ryabinin, Alexander Viro, linux-fsdevel,
	Konrad Rzeszutek Wilk, Eric Van Hensbergen, Ron Minnich,
	Latchesar Ionkov, Steve French, Matthew Wilcox, Trond Myklebust,
	Anna Schumaker, Andrew Morton, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, v9fs-developer,
	linux-kernel, linux-cifs, samba-technical, linux-nfs, linux-mm

On Thu, Apr 20, 2017 at 04:44:31PM +0200, Jan Kara wrote:
> On Thu 20-04-17 16:35:10, Jan Kara wrote:
> > On Wed 19-04-17 13:28:36, Ross Zwisler wrote:
> > > On Wed, Apr 19, 2017 at 06:11:31PM +0300, Andrey Ryabinin wrote:
> > > > On 04/18/2017 10:38 PM, Ross Zwisler wrote:
> > > > > On Fri, Apr 14, 2017 at 05:07:50PM +0300, Andrey Ryabinin wrote:
> > > > >> Some direct write fs hooks call invalidate_inode_pages2[_range]()
> > > > >> conditionally iff mapping->nrpages is not zero. If page cache is empty,
> > > > >> buffered read following after direct IO write would get stale data from
> > > > >> the cleancache.
> > > > >>
> > > > >> Also it doesn't feel right to check only for ->nrpages because
> > > > >> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> > > > >>
> > > > >> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> > > > >> state.
> > > > >>
> > > > >> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> > > > >> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> > > > >> ---
> > > > > <>
> > > > >> diff --git a/fs/dax.c b/fs/dax.c
> > > > >> index 2e382fe..1e8cca0 100644
> > > > >> --- a/fs/dax.c
> > > > >> +++ b/fs/dax.c
> > > > >> @@ -1047,7 +1047,7 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > > > >>  	 * into page tables. We have to tear down these mappings so that data
> > > > >>  	 * written by write(2) is visible in mmap.
> > > > >>  	 */
> > > > >> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> > > > >> +	if ((iomap->flags & IOMAP_F_NEW)) {
> > > > >>  		invalidate_inode_pages2_range(inode->i_mapping,
> > > > >>  					      pos >> PAGE_SHIFT,
> > > > >>  					      (end - 1) >> PAGE_SHIFT);
> > > > > 
> > > > > tl;dr: I think the old code is correct, and that you don't need this change.
> > > > > 
> > > > > This should be harmless, but could slow us down a little if we keep
> > > > > calling invalidate_inode_pages2_range() without really needing to.  Really for
> > > > > DAX I think we need to call invalidate_inode_page2_range() only if we have
> > > > > zero pages mapped over the place where we are doing I/O, which is why we check
> > > > > nrpages.
> > > > > 
> > > > 
> > > > Check for ->nrpages only looks strange, because invalidate_inode_pages2_range() also
> > > > invalidates exceptional radix tree entries. Is that correct that we invalidate
> > > > exceptional entries only if ->nrpages > 0 and skip invalidation otherwise?
> > > 
> > > For DAX we only invalidate clean DAX exceptional entries so that we can keep
> > > dirty entries around for writeback, but yes you're correct that we only do the
> > > invalidation if nrpages > 0.  And yes, it does seem a bit weird. :)
> > 
> > Actually in this place the nrpages check is deliberate since there should
> > only be hole pages or nothing in the invalidated range - see the comment
> > before the if. But thinking more about it this assumption actually is not
> > right in presence of zero PMD entries in the radix tree. So this change
> > actually also fixes a possible bug for DAX but we should do it as a
> > separate patch with a proper changelog.
> 
> Something like the attached patch. Ross?

Yep, great catch, this is a real issue.  The attached patch isn't sufficient,
though, because invalidate_inode_pages2_range() for DAX exceptional entries
only wipes out the radix tree entry, and doesn't call unmap_mapping_range() as
it does in the case of real pages.

I'm working on a fix and an associated xfstest test.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/2] dax: prevent invalidation of mapped DAX entries
  2017-04-20 19:14             ` Ross Zwisler
@ 2017-04-21  3:44               ` Ross Zwisler
  2017-04-21  3:44                 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
  2017-04-25 10:10                 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
  0 siblings, 2 replies; 37+ messages in thread
From: Ross Zwisler @ 2017-04-21  3:44 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Alexey Kuznetsov, Andrey Ryabinin,
	Anna Schumaker, Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jan Kara, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

dax_invalidate_mapping_entry() currently removes DAX exceptional entries
only if they are clean and unlocked.  This is done via:

invalidate_mapping_pages()
  invalidate_exceptional_entry()
    dax_invalidate_mapping_entry()

However, for page cache pages removed in invalidate_mapping_pages() there
is an additional criteria which is that the page must not be mapped.  This
is noted in the comments above invalidate_mapping_pages() and is checked in
invalidate_inode_page().

For DAX entries this means that we can can end up in a situation where a
DAX exceptional entry, either a huge zero page or a regular DAX entry,
could end up mapped but without an associated radix tree entry. This is
inconsistent with the rest of the DAX code and with what happens in the
page cache case.

We aren't able to unmap the DAX exceptional entry because according to its
comments invalidate_mapping_pages() isn't allowed to block, and
unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

Since we essentially never have unmapped DAX entries to evict from the
radix tree, just remove dax_invalidate_mapping_entry().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Reported-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>    [4.10+]
---

This series applies cleanly to the current v4.11-rc7 based linux/master,
and has passed an xfstests run with DAX on ext4 and XFS.

These patches also apply to v4.10.9 with a little work from the 3-way
merge feature.

 fs/dax.c            | 29 -----------------------------
 include/linux/dax.h |  1 -
 mm/truncate.c       |  9 +++------
 3 files changed, 3 insertions(+), 36 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 85abd74..166504c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -507,35 +507,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
 }
 
 /*
- * Invalidate exceptional DAX entry if easily possible. This handles DAX
- * entries for invalidate_inode_pages() so we evict the entry only if we can
- * do so without blocking.
- */
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
-{
-	int ret = 0;
-	void *entry, **slot;
-	struct radix_tree_root *page_tree = &mapping->page_tree;
-
-	spin_lock_irq(&mapping->tree_lock);
-	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
-	if (!entry || !radix_tree_exceptional_entry(entry) ||
-	    slot_locked(mapping, slot))
-		goto out;
-	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
-	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
-		goto out;
-	radix_tree_delete(page_tree, index);
-	mapping->nrexceptional--;
-	ret = 1;
-out:
-	spin_unlock_irq(&mapping->tree_lock);
-	if (ret)
-		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
-	return ret;
-}
-
-/*
  * Invalidate exceptional DAX entry if it is clean.
  */
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d8a3dc0..f8e1833 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,7 +41,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 		    const struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
 void dax_wake_mapping_entry_waiter(struct address_space *mapping,
diff --git a/mm/truncate.c b/mm/truncate.c
index 6263aff..c537184 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping,
 
 /*
  * Invalidate exceptional entry if easily possible. This handles exceptional
- * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
- * clean entries.
+ * entries for invalidate_inode_pages().
  */
 static int invalidate_exceptional_entry(struct address_space *mapping,
 					pgoff_t index, void *entry)
 {
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
+	/* Handled by shmem itself, or for DAX we do nothing. */
+	if (shmem_mapping(mapping) || dax_mapping(mapping))
 		return 1;
-	if (dax_mapping(mapping))
-		return dax_invalidate_mapping_entry(mapping, index);
 	clear_shadow_entry(mapping, index, entry);
 	return 1;
 }
-- 
2.9.3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 2/2] dax: fix data corruption due to stale mmap reads
  2017-04-21  3:44               ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
@ 2017-04-21  3:44                 ` Ross Zwisler
  2017-04-25 11:10                   ` Jan Kara
  2017-04-25 10:10                 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
  1 sibling, 1 reply; 37+ messages in thread
From: Ross Zwisler @ 2017-04-21  3:44 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Ross Zwisler, Alexander Viro, Alexey Kuznetsov, Andrey Ryabinin,
	Anna Schumaker, Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jan Kara, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

Users of DAX can suffer data corruption from stale mmap reads via the
following sequence:

- open an mmap over a 2MiB hole

- read from a 2MiB hole, faulting in a 2MiB zero page

- write to the hole with write(3p).  The write succeeds but we incorrectly
  leave the 2MiB zero page mapping intact.

- via the mmap, read the data that was just written.  Since the zero page
  mapping is still intact we read back zeroes instead of the new data.

We fix this by unconditionally calling invalidate_inode_pages2_range() in
dax_iomap_actor() for new block allocations, and by enhancing
__dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
being removed from the radix tree.

This is based on an initial patch from Jan Kara.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
Reported-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>    [4.10+]
---
 fs/dax.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 166504c..3f445d5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
 {
 	int ret = 0;
-	void *entry;
+	void *entry, **slot;
 	struct radix_tree_root *page_tree = &mapping->page_tree;
 
 	spin_lock_irq(&mapping->tree_lock);
-	entry = get_unlocked_mapping_entry(mapping, index, NULL);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
 	if (!entry || !radix_tree_exceptional_entry(entry))
 		goto out;
 	if (!trunc &&
 	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
 		goto out;
+
+	/*
+	 * Make sure 'entry' remains valid while we drop mapping->tree_lock to
+	 * do the unmap_mapping_range() call.
+	 */
+	entry = lock_slot(mapping, slot);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
+			(loff_t)PAGE_SIZE << dax_radix_order(entry), 0);
+
+	spin_lock_irq(&mapping->tree_lock);
 	radix_tree_delete(page_tree, index);
 	mapping->nrexceptional--;
 	ret = 1;
 out:
-	put_unlocked_mapping_entry(mapping, index, entry);
 	spin_unlock_irq(&mapping->tree_lock);
+	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
 	return ret;
 }
 /*
@@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		return -EIO;
 
 	/*
-	 * Write can allocate block for an area which has a hole page mapped
-	 * into page tables. We have to tear down these mappings so that data
-	 * written by write(2) is visible in mmap.
+	 * Write can allocate block for an area which has a hole page or zero
+	 * PMD entry in the radix tree.  We have to tear down these mappings so
+	 * that data written by write(2) is visible in mmap.
 	 */
-	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+	if (iomap->flags & IOMAP_F_NEW) {
 		invalidate_inode_pages2_range(inode->i_mapping,
 					      pos >> PAGE_SHIFT,
 					      (end - 1) >> PAGE_SHIFT);
-- 
2.9.3

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 0/4] Properly invalidate data in the cleancache.
  2017-04-14 14:07 [PATCH 0/4] Properly invalidate data in the cleancache Andrey Ryabinin
                   ` (4 preceding siblings ...)
  2017-04-18 15:24 ` [PATCH 0/4] Properly invalidate data in the cleancache Konrad Rzeszutek Wilk
@ 2017-04-24 16:41 ` Andrey Ryabinin
  2017-04-24 16:41   ` [PATCH v2 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
                     ` (3 more replies)
  5 siblings, 4 replies; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-24 16:41 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andrey Ryabinin, Konrad Rzeszutek Wilk, Ross Zwisler,
	Andrew Morton, Jan Kara, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, Nikolay Borisov,
	linux-kernel, linux-fsdevel, linux-mm

Changes since v1:
 - Exclude DAX/nfs/cifs/9p hunks from the first patch. All these fs'es
     doesn't call cleancache_get_page() (nor directly, nor via mpage_readpage[s]()),
     so they are not affected by this bug.
 - Updated changelog.
     

We've noticed that after direct IO write, buffered read sometimes gets
stale data which is coming from the cleancache.
The reason for this is that some direct write hooks call call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero, so we may not invalidate
data in the cleancache.

Another odd thing is that we check only for ->nrpages and don't check for ->nrexceptional,
but invalidate_inode_pages2[_range] also invalidates exceptional entries as well.
So we invalidate exceptional entries only if ->nrpages != 0? This doesn't feel right.

 - Patch 1 fixes direct IO writes by removing ->nrpages check.
 - Patch 2 fixes similar case in invalidate_bdev(). 
     Note: I only fixed conditional cleancache_invalidate_inode() here.
       Do we also need to add ->nrexceptional check in into invalidate_bdev()?
     
 - Patches 3-4: some optimizations.

Andrey Ryabinin (4):
  fs: fix data invalidation in the cleancache during direct IO
  fs/block_dev: always invalidate cleancache in invalidate_bdev()
  mm/truncate: bail out early from invalidate_inode_pages2_range() if
    mapping is empty
  mm/truncate: avoid pointless cleancache_invalidate_inode() calls.

 fs/block_dev.c | 11 +++++------
 fs/iomap.c     | 18 ++++++++----------
 mm/filemap.c   | 26 +++++++++++---------------
 mm/truncate.c  | 13 +++++++++----
 4 files changed, 33 insertions(+), 35 deletions(-)

-- 
2.10.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-24 16:41 ` [PATCH v2 " Andrey Ryabinin
@ 2017-04-24 16:41   ` Andrey Ryabinin
  2017-04-25  8:25     ` Jan Kara
  2017-04-24 16:41   ` [PATCH v2 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-24 16:41 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andrey Ryabinin, stable, Konrad Rzeszutek Wilk, Ross Zwisler,
	Andrew Morton, Jan Kara, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, Nikolay Borisov,
	linux-kernel, linux-fsdevel, linux-mm

Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
conditionally iff mapping->nrpages is not zero. This can't be right,
because invalidate_inode_pages2[_ragne]() also invalidate data in
the cleancache via cleancache_invalidate_inode() call.
So if page cache is empty but there is some data in the cleancache,
buffered read after direct IO write would get stale data from
the cleancache.

Also it doesn't feel right to check only for ->nrpages because
invalidate_inode_pages2[_range] invalidates exceptional entries as well.

Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
state.

Note: nfs,cifs,9p doesn't need similar fix because the never call
cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so they
are not affected by this bug.

Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>
---
 fs/iomap.c   | 18 ++++++++----------
 mm/filemap.c | 26 +++++++++++---------------
 2 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index cdeed39..f6a6013 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -881,16 +881,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		flags |= IOMAP_WRITE;
 	}
 
-	if (mapping->nrpages) {
-		ret = filemap_write_and_wait_range(mapping, start, end);
-		if (ret)
-			goto out_free_dio;
+	ret = filemap_write_and_wait_range(mapping, start, end);
+	if (ret)
+		goto out_free_dio;
 
-		ret = invalidate_inode_pages2_range(mapping,
-				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
-		WARN_ON_ONCE(ret);
-		ret = 0;
-	}
+	ret = invalidate_inode_pages2_range(mapping,
+			start >> PAGE_SHIFT, end >> PAGE_SHIFT);
+	WARN_ON_ONCE(ret);
+	ret = 0;
 
 	inode_dio_begin(inode);
 
@@ -945,7 +943,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
 	 * this invalidation fails, tough, the write still worked...
 	 */
-	if (iov_iter_rw(iter) == WRITE && mapping->nrpages) {
+	if (iov_iter_rw(iter) == WRITE) {
 		int err = invalidate_inode_pages2_range(mapping,
 				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
 		WARN_ON_ONCE(err);
diff --git a/mm/filemap.c b/mm/filemap.c
index 9eab40e..b7b973b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2720,18 +2720,16 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	 * about to write.  We do this *before* the write so that we can return
 	 * without clobbering -EIOCBQUEUED from ->direct_IO().
 	 */
-	if (mapping->nrpages) {
-		written = invalidate_inode_pages2_range(mapping,
+	written = invalidate_inode_pages2_range(mapping,
 					pos >> PAGE_SHIFT, end);
-		/*
-		 * If a page can not be invalidated, return 0 to fall back
-		 * to buffered write.
-		 */
-		if (written) {
-			if (written == -EBUSY)
-				return 0;
-			goto out;
-		}
+	/*
+	 * If a page can not be invalidated, return 0 to fall back
+	 * to buffered write.
+	 */
+	if (written) {
+		if (written == -EBUSY)
+			return 0;
+		goto out;
 	}
 
 	written = mapping->a_ops->direct_IO(iocb, from);
@@ -2744,10 +2742,8 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	 * so we don't support it 100%.  If this invalidation
 	 * fails, tough, the write still worked...
 	 */
-	if (mapping->nrpages) {
-		invalidate_inode_pages2_range(mapping,
-					      pos >> PAGE_SHIFT, end);
-	}
+	invalidate_inode_pages2_range(mapping,
+				pos >> PAGE_SHIFT, end);
 
 	if (written > 0) {
 		pos += written;
-- 
2.10.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev()
  2017-04-24 16:41 ` [PATCH v2 " Andrey Ryabinin
  2017-04-24 16:41   ` [PATCH v2 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
@ 2017-04-24 16:41   ` Andrey Ryabinin
  2017-04-25  8:34     ` Jan Kara
  2017-04-24 16:41   ` [PATCH v2 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
  2017-04-24 16:41   ` [PATCH v2 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
  3 siblings, 1 reply; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-24 16:41 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andrey Ryabinin, stable, Konrad Rzeszutek Wilk, Ross Zwisler,
	Andrew Morton, Jan Kara, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, Nikolay Borisov,
	linux-kernel, linux-fsdevel, linux-mm

invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
which doen't make any sense.
Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
regardless of mapping->nrpages value.

Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>
---
 fs/block_dev.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 065d7c5..f625dce 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -104,12 +104,11 @@ void invalidate_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0)
-		return;
-
-	invalidate_bh_lrus();
-	lru_add_drain_all();	/* make sure all lru add caches are flushed */
-	invalidate_mapping_pages(mapping, 0, -1);
+	if (mapping->nrpages) {
+		invalidate_bh_lrus();
+		lru_add_drain_all();	/* make sure all lru add caches are flushed */
+		invalidate_mapping_pages(mapping, 0, -1);
+	}
 	/* 99% of the time, we don't need to flush the cleancache on the bdev.
 	 * But, for the strange corners, lets be cautious
 	 */
-- 
2.10.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
  2017-04-24 16:41 ` [PATCH v2 " Andrey Ryabinin
  2017-04-24 16:41   ` [PATCH v2 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
  2017-04-24 16:41   ` [PATCH v2 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
@ 2017-04-24 16:41   ` Andrey Ryabinin
  2017-04-25  8:37     ` Jan Kara
  2017-04-24 16:41   ` [PATCH v2 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
  3 siblings, 1 reply; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-24 16:41 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andrey Ryabinin, Konrad Rzeszutek Wilk, Ross Zwisler,
	Andrew Morton, Jan Kara, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, Nikolay Borisov,
	linux-kernel, linux-fsdevel, linux-mm

If mapping is empty (both ->nrpages and ->nrexceptional is zero) we can
avoid pointless lookups in empty radix tree and bail out immediately after
cleancache invalidation.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 mm/truncate.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index 6263aff..8f12b0e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -624,6 +624,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int did_range_unmap = 0;
 
 	cleancache_invalidate_inode(mapping);
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
+		return 0;
+
 	pagevec_init(&pvec, 0);
 	index = start;
 	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
-- 
2.10.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls.
  2017-04-24 16:41 ` [PATCH v2 " Andrey Ryabinin
                     ` (2 preceding siblings ...)
  2017-04-24 16:41   ` [PATCH v2 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
@ 2017-04-24 16:41   ` Andrey Ryabinin
  2017-04-25  8:41     ` Jan Kara
  3 siblings, 1 reply; 37+ messages in thread
From: Andrey Ryabinin @ 2017-04-24 16:41 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Andrey Ryabinin, Konrad Rzeszutek Wilk, Ross Zwisler,
	Andrew Morton, Jan Kara, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, Nikolay Borisov,
	linux-kernel, linux-fsdevel, linux-mm

cleancache_invalidate_inode() called truncate_inode_pages_range()
and invalidate_inode_pages2_range() twice - on entry and on exit.
It's stupid and waste of time. It's enough to call it once at
exit.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 mm/truncate.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 8f12b0e..83a059e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -266,9 +266,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	pgoff_t		index;
 	int		i;
 
-	cleancache_invalidate_inode(mapping);
 	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
-		return;
+		goto out;
 
 	/* Offsets within partial pages */
 	partial_start = lstart & (PAGE_SIZE - 1);
@@ -363,7 +362,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	 * will be released, just zeroed, so we can bail out now.
 	 */
 	if (start >= end)
-		return;
+		goto out;
 
 	index = start;
 	for ( ; ; ) {
@@ -410,6 +409,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		pagevec_release(&pvec);
 		index++;
 	}
+
+out:
 	cleancache_invalidate_inode(mapping);
 }
 EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -623,9 +624,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 	int ret2 = 0;
 	int did_range_unmap = 0;
 
-	cleancache_invalidate_inode(mapping);
 	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
-		return 0;
+		goto out;
 
 	pagevec_init(&pvec, 0);
 	index = start;
@@ -689,6 +689,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 		cond_resched();
 		index++;
 	}
+
+out:
 	cleancache_invalidate_inode(mapping);
 	return ret;
 }
-- 
2.10.2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/4] fs: fix data invalidation in the cleancache during direct IO
  2017-04-24 16:41   ` [PATCH v2 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
@ 2017-04-25  8:25     ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2017-04-25  8:25 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Alexander Viro, stable, Konrad Rzeszutek Wilk, Ross Zwisler,
	Andrew Morton, Jan Kara, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, Nikolay Borisov,
	linux-kernel, linux-fsdevel, linux-mm

On Mon 24-04-17 19:41:32, Andrey Ryabinin wrote:
> Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
> conditionally iff mapping->nrpages is not zero. This can't be right,
> because invalidate_inode_pages2[_ragne]() also invalidate data in
> the cleancache via cleancache_invalidate_inode() call.
> So if page cache is empty but there is some data in the cleancache,
> buffered read after direct IO write would get stale data from
> the cleancache.
> 
> Also it doesn't feel right to check only for ->nrpages because
> invalidate_inode_pages2[_range] invalidates exceptional entries as well.
> 
> Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages
> state.
> 
> Note: nfs,cifs,9p doesn't need similar fix because the never call
> cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so they
> are not affected by this bug.
> 
> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: <stable@vger.kernel.org>

OK, looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza


> ---
>  fs/iomap.c   | 18 ++++++++----------
>  mm/filemap.c | 26 +++++++++++---------------
>  2 files changed, 19 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index cdeed39..f6a6013 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -881,16 +881,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		flags |= IOMAP_WRITE;
>  	}
>  
> -	if (mapping->nrpages) {
> -		ret = filemap_write_and_wait_range(mapping, start, end);
> -		if (ret)
> -			goto out_free_dio;
> +	ret = filemap_write_and_wait_range(mapping, start, end);
> +	if (ret)
> +		goto out_free_dio;
>  
> -		ret = invalidate_inode_pages2_range(mapping,
> -				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -		WARN_ON_ONCE(ret);
> -		ret = 0;
> -	}
> +	ret = invalidate_inode_pages2_range(mapping,
> +			start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> +	WARN_ON_ONCE(ret);
> +	ret = 0;
>  
>  	inode_dio_begin(inode);
>  
> @@ -945,7 +943,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
>  	 * this invalidation fails, tough, the write still worked...
>  	 */
> -	if (iov_iter_rw(iter) == WRITE && mapping->nrpages) {
> +	if (iov_iter_rw(iter) == WRITE) {
>  		int err = invalidate_inode_pages2_range(mapping,
>  				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
>  		WARN_ON_ONCE(err);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9eab40e..b7b973b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2720,18 +2720,16 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	 * about to write.  We do this *before* the write so that we can return
>  	 * without clobbering -EIOCBQUEUED from ->direct_IO().
>  	 */
> -	if (mapping->nrpages) {
> -		written = invalidate_inode_pages2_range(mapping,
> +	written = invalidate_inode_pages2_range(mapping,
>  					pos >> PAGE_SHIFT, end);
> -		/*
> -		 * If a page can not be invalidated, return 0 to fall back
> -		 * to buffered write.
> -		 */
> -		if (written) {
> -			if (written == -EBUSY)
> -				return 0;
> -			goto out;
> -		}
> +	/*
> +	 * If a page can not be invalidated, return 0 to fall back
> +	 * to buffered write.
> +	 */
> +	if (written) {
> +		if (written == -EBUSY)
> +			return 0;
> +		goto out;
>  	}
>  
>  	written = mapping->a_ops->direct_IO(iocb, from);
> @@ -2744,10 +2742,8 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	 * so we don't support it 100%.  If this invalidation
>  	 * fails, tough, the write still worked...
>  	 */
> -	if (mapping->nrpages) {
> -		invalidate_inode_pages2_range(mapping,
> -					      pos >> PAGE_SHIFT, end);
> -	}
> +	invalidate_inode_pages2_range(mapping,
> +				pos >> PAGE_SHIFT, end);
>  
>  	if (written > 0) {
>  		pos += written;
> -- 
> 2.10.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev()
  2017-04-24 16:41   ` [PATCH v2 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
@ 2017-04-25  8:34     ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2017-04-25  8:34 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Alexander Viro, stable, Konrad Rzeszutek Wilk, Ross Zwisler,
	Andrew Morton, Jan Kara, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, Nikolay Borisov,
	linux-kernel, linux-fsdevel, linux-mm

On Mon 24-04-17 19:41:33, Andrey Ryabinin wrote:
> invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
> which doen't make any sense.
> Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
> regardless of mapping->nrpages value.
> 
> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: <stable@vger.kernel.org>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/block_dev.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 065d7c5..f625dce 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -104,12 +104,11 @@ void invalidate_bdev(struct block_device *bdev)
>  {
>  	struct address_space *mapping = bdev->bd_inode->i_mapping;
>  
> -	if (mapping->nrpages == 0)
> -		return;
> -
> -	invalidate_bh_lrus();
> -	lru_add_drain_all();	/* make sure all lru add caches are flushed */
> -	invalidate_mapping_pages(mapping, 0, -1);
> +	if (mapping->nrpages) {
> +		invalidate_bh_lrus();
> +		lru_add_drain_all();	/* make sure all lru add caches are flushed */
> +		invalidate_mapping_pages(mapping, 0, -1);
> +	}
>  	/* 99% of the time, we don't need to flush the cleancache on the bdev.
>  	 * But, for the strange corners, lets be cautious
>  	 */
> -- 
> 2.10.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
  2017-04-24 16:41   ` [PATCH v2 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
@ 2017-04-25  8:37     ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2017-04-25  8:37 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Alexander Viro, Konrad Rzeszutek Wilk, Ross Zwisler,
	Andrew Morton, Jan Kara, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, Nikolay Borisov,
	linux-kernel, linux-fsdevel, linux-mm

On Mon 24-04-17 19:41:34, Andrey Ryabinin wrote:
> If mapping is empty (both ->nrpages and ->nrexceptional is zero) we can
> avoid pointless lookups in empty radix tree and bail out immediately after
> cleancache invalidation.
> 
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/truncate.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 6263aff..8f12b0e 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -624,6 +624,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>  	int did_range_unmap = 0;
>  
>  	cleancache_invalidate_inode(mapping);
> +	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> +		return 0;
> +
>  	pagevec_init(&pvec, 0);
>  	index = start;
>  	while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
> -- 
> 2.10.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls.
  2017-04-24 16:41   ` [PATCH v2 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
@ 2017-04-25  8:41     ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2017-04-25  8:41 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Alexander Viro, Konrad Rzeszutek Wilk, Ross Zwisler,
	Andrew Morton, Jan Kara, Jens Axboe, Johannes Weiner,
	Alexey Kuznetsov, Christoph Hellwig, Nikolay Borisov,
	linux-kernel, linux-fsdevel, linux-mm

On Mon 24-04-17 19:41:35, Andrey Ryabinin wrote:
> cleancache_invalidate_inode() called truncate_inode_pages_range()
> and invalidate_inode_pages2_range() twice - on entry and on exit.
> It's stupid and waste of time. It's enough to call it once at
> exit.
> 
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Looks sensible to me but I don't really know cleancache :). Anyway feel
free to add:

Acked-by: Jan Kara <jack@suse.cz>
	
								Honza

> ---
>  mm/truncate.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 8f12b0e..83a059e 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -266,9 +266,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	pgoff_t		index;
>  	int		i;
>  
> -	cleancache_invalidate_inode(mapping);
>  	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> -		return;
> +		goto out;
>  
>  	/* Offsets within partial pages */
>  	partial_start = lstart & (PAGE_SIZE - 1);
> @@ -363,7 +362,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  	 * will be released, just zeroed, so we can bail out now.
>  	 */
>  	if (start >= end)
> -		return;
> +		goto out;
>  
>  	index = start;
>  	for ( ; ; ) {
> @@ -410,6 +409,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
>  		pagevec_release(&pvec);
>  		index++;
>  	}
> +
> +out:
>  	cleancache_invalidate_inode(mapping);
>  }
>  EXPORT_SYMBOL(truncate_inode_pages_range);
> @@ -623,9 +624,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>  	int ret2 = 0;
>  	int did_range_unmap = 0;
>  
> -	cleancache_invalidate_inode(mapping);
>  	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
> -		return 0;
> +		goto out;
>  
>  	pagevec_init(&pvec, 0);
>  	index = start;
> @@ -689,6 +689,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>  		cond_resched();
>  		index++;
>  	}
> +
> +out:
>  	cleancache_invalidate_inode(mapping);
>  	return ret;
>  }
> -- 
> 2.10.2
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] dax: prevent invalidation of mapped DAX entries
  2017-04-21  3:44               ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
  2017-04-21  3:44                 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
@ 2017-04-25 10:10                 ` Jan Kara
  2017-05-01 16:54                   ` Ross Zwisler
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Kara @ 2017-04-25 10:10 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, Alexander Viro, Alexey Kuznetsov,
	Andrey Ryabinin, Anna Schumaker, Christoph Hellwig, Dan Williams,
	Darrick J. Wong, Eric Van Hensbergen, Jan Kara, Jens Axboe,
	Johannes Weiner, Konrad Rzeszutek Wilk, Latchesar Ionkov,
	linux-cifs, linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm,
	Matthew Wilcox, Ron Minnich, samba-technical, Steve French,
	Trond Myklebust, v9fs-developer

On Thu 20-04-17 21:44:36, Ross Zwisler wrote:
> dax_invalidate_mapping_entry() currently removes DAX exceptional entries
> only if they are clean and unlocked.  This is done via:
> 
> invalidate_mapping_pages()
>   invalidate_exceptional_entry()
>     dax_invalidate_mapping_entry()
> 
> However, for page cache pages removed in invalidate_mapping_pages() there
> is an additional criteria which is that the page must not be mapped.  This
> is noted in the comments above invalidate_mapping_pages() and is checked in
> invalidate_inode_page().
> 
> For DAX entries this means that we can can end up in a situation where a
> DAX exceptional entry, either a huge zero page or a regular DAX entry,
> could end up mapped but without an associated radix tree entry. This is
> inconsistent with the rest of the DAX code and with what happens in the
> page cache case.
> 
> We aren't able to unmap the DAX exceptional entry because according to its
> comments invalidate_mapping_pages() isn't allowed to block, and
> unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
> 
> Since we essentially never have unmapped DAX entries to evict from the
> radix tree, just remove dax_invalidate_mapping_entry().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>    [4.10+]

Just as a side note - we wouldn't really have to unmap the mapping range
covered by the DAX exceptional entry. It would be enough to find out
whether such range is mapped and bail out in that case. But that would
still be pretty expensive for DAX - we'd have to do rmap walk similar as in
dax_mapping_entry_mkclean() and IMHO it is not worth it. So I agree with
what you did. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> 
> This series applies cleanly to the current v4.11-rc7 based linux/master,
> and has passed an xfstests run with DAX on ext4 and XFS.
> 
> These patches also apply to v4.10.9 with a little work from the 3-way
> merge feature.
> 
>  fs/dax.c            | 29 -----------------------------
>  include/linux/dax.h |  1 -
>  mm/truncate.c       |  9 +++------
>  3 files changed, 3 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 85abd74..166504c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -507,35 +507,6 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
>  }
>  
>  /*
> - * Invalidate exceptional DAX entry if easily possible. This handles DAX
> - * entries for invalidate_inode_pages() so we evict the entry only if we can
> - * do so without blocking.
> - */
> -int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
> -{
> -	int ret = 0;
> -	void *entry, **slot;
> -	struct radix_tree_root *page_tree = &mapping->page_tree;
> -
> -	spin_lock_irq(&mapping->tree_lock);
> -	entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
> -	if (!entry || !radix_tree_exceptional_entry(entry) ||
> -	    slot_locked(mapping, slot))
> -		goto out;
> -	if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> -	    radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> -		goto out;
> -	radix_tree_delete(page_tree, index);
> -	mapping->nrexceptional--;
> -	ret = 1;
> -out:
> -	spin_unlock_irq(&mapping->tree_lock);
> -	if (ret)
> -		dax_wake_mapping_entry_waiter(mapping, index, entry, true);
> -	return ret;
> -}
> -
> -/*
>   * Invalidate exceptional DAX entry if it is clean.
>   */
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index d8a3dc0..f8e1833 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -41,7 +41,6 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>  int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>  		    const struct iomap_ops *ops);
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> -int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
>  void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 6263aff..c537184 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -67,17 +67,14 @@ static void truncate_exceptional_entry(struct address_space *mapping,
>  
>  /*
>   * Invalidate exceptional entry if easily possible. This handles exceptional
> - * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
> - * clean entries.
> + * entries for invalidate_inode_pages().
>   */
>  static int invalidate_exceptional_entry(struct address_space *mapping,
>  					pgoff_t index, void *entry)
>  {
> -	/* Handled by shmem itself */
> -	if (shmem_mapping(mapping))
> +	/* Handled by shmem itself, or for DAX we do nothing. */
> +	if (shmem_mapping(mapping) || dax_mapping(mapping))
>  		return 1;
> -	if (dax_mapping(mapping))
> -		return dax_invalidate_mapping_entry(mapping, index);
>  	clear_shadow_entry(mapping, index, entry);
>  	return 1;
>  }
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
  2017-04-21  3:44                 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
@ 2017-04-25 11:10                   ` Jan Kara
  2017-04-25 22:59                     ` Ross Zwisler
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2017-04-25 11:10 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, Alexander Viro, Alexey Kuznetsov,
	Andrey Ryabinin, Anna Schumaker, Christoph Hellwig, Dan Williams,
	Darrick J. Wong, Eric Van Hensbergen, Jan Kara, Jens Axboe,
	Johannes Weiner, Konrad Rzeszutek Wilk, Latchesar Ionkov,
	linux-cifs, linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm,
	Matthew Wilcox, Ron Minnich, samba-technical, Steve French,
	Trond Myklebust, v9fs-developer

On Thu 20-04-17 21:44:37, Ross Zwisler wrote:
> Users of DAX can suffer data corruption from stale mmap reads via the
> following sequence:
> 
> - open an mmap over a 2MiB hole
> 
> - read from a 2MiB hole, faulting in a 2MiB zero page
> 
> - write to the hole with write(3p).  The write succeeds but we incorrectly
>   leave the 2MiB zero page mapping intact.
> 
> - via the mmap, read the data that was just written.  Since the zero page
>   mapping is still intact we read back zeroes instead of the new data.
> 
> We fix this by unconditionally calling invalidate_inode_pages2_range() in
> dax_iomap_actor() for new block allocations, and by enhancing
> __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
> being removed from the radix tree.
> 
> This is based on an initial patch from Jan Kara.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>    [4.10+]
> ---
>  fs/dax.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 166504c..3f445d5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
>  					  pgoff_t index, bool trunc)
>  {
>  	int ret = 0;
> -	void *entry;
> +	void *entry, **slot;
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
>  	if (!entry || !radix_tree_exceptional_entry(entry))
>  		goto out;
>  	if (!trunc &&
>  	    (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
>  	     radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
>  		goto out;
> +
> +	/*
> +	 * Make sure 'entry' remains valid while we drop mapping->tree_lock to
> +	 * do the unmap_mapping_range() call.
> +	 */
> +	entry = lock_slot(mapping, slot);

This also stops page faults from mapping the entry again. Maybe worth
mentioning here as well.

> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
> +			(loff_t)PAGE_SIZE << dax_radix_order(entry), 0);

Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping
large ranges - each unmap means an rmap walk... Since this is a data
corruption class of bug, let's fix it this way for now but I think we'll
need to improve this later.

E.g. what if we called unmap_mapping_range() for the whole invalidated
range after removing the radix tree entries?

Hum, but now thinking more about it I have hard time figuring out why write
vs fault cannot actually still race:

CPU1 - write(2)				CPU2 - read fault

					dax_iomap_pte_fault()
					  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
					  grab_mapping_entry()
					  - we add zero page in the radix
					    tree & map it to page tables

Similarly read vs write fault may end up racing in a wrong way and try to
replace already existing exceptional entry with a hole page?

								Honza
> +
> +	spin_lock_irq(&mapping->tree_lock);
>  	radix_tree_delete(page_tree, index);
>  	mapping->nrexceptional--;
>  	ret = 1;
>  out:
> -	put_unlocked_mapping_entry(mapping, index, entry);
>  	spin_unlock_irq(&mapping->tree_lock);
> +	dax_wake_mapping_entry_waiter(mapping, index, entry, true);
>  	return ret;
>  }
>  /*
> @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		return -EIO;
>  
>  	/*
> -	 * Write can allocate block for an area which has a hole page mapped
> -	 * into page tables. We have to tear down these mappings so that data
> -	 * written by write(2) is visible in mmap.
> +	 * Write can allocate block for an area which has a hole page or zero
> +	 * PMD entry in the radix tree.  We have to tear down these mappings so
> +	 * that data written by write(2) is visible in mmap.
>  	 */
> -	if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +	if (iomap->flags & IOMAP_F_NEW) {
>  		invalidate_inode_pages2_range(inode->i_mapping,
>  					      pos >> PAGE_SHIFT,
>  					      (end - 1) >> PAGE_SHIFT);
> -- 
> 2.9.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
  2017-04-25 11:10                   ` Jan Kara
@ 2017-04-25 22:59                     ` Ross Zwisler
  2017-04-26  8:52                       ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Ross Zwisler @ 2017-04-25 22:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Alexander Viro,
	Alexey Kuznetsov, Andrey Ryabinin, Anna Schumaker,
	Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
<>
> Hum, but now thinking more about it I have hard time figuring out why write
> vs fault cannot actually still race:
> 
> CPU1 - write(2)				CPU2 - read fault
> 
> 					dax_iomap_pte_fault()
> 					  ->iomap_begin() - sees hole
> dax_iomap_rw()
>   iomap_apply()
>     ->iomap_begin - allocates blocks
>     dax_iomap_actor()
>       invalidate_inode_pages2_range()
>         - there's nothing to invalidate
> 					  grab_mapping_entry()
> 					  - we add zero page in the radix
> 					    tree & map it to page tables
> 
> Similarly read vs write fault may end up racing in a wrong way and try to
> replace already existing exceptional entry with a hole page?

Yep, this race seems real to me, too.  This seems very much like the issues
that exist when a thread is doing direct I/O.  One thread is doing I/O to an
intermediate buffer (page cache for direct I/O case, zero page for us), and
the other is going around it directly to media, and they can get out of sync.

IIRC the direct I/O code looked something like:

1/ invalidate existing mappings
2/ do direct I/O to media
3/ invalidate mappings again, just in case.  Should be cheap if there weren't
   any conflicting faults.  This makes sure any new allocations we made are
   faulted in.

I guess one option would be to replicate that logic in the DAX I/O path, or we
could try and enhance our locking so page faults can't race with I/O since
both can allocate blocks.

I'm not sure, but will think on it.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
  2017-04-25 22:59                     ` Ross Zwisler
@ 2017-04-26  8:52                       ` Jan Kara
  2017-04-26 22:52                         ` Ross Zwisler
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Kara @ 2017-04-26  8:52 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Andrew Morton, linux-kernel, Alexander Viro,
	Alexey Kuznetsov, Andrey Ryabinin, Anna Schumaker,
	Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

On Tue 25-04-17 16:59:36, Ross Zwisler wrote:
> On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
> <>
> > Hum, but now thinking more about it I have hard time figuring out why write
> > vs fault cannot actually still race:
> > 
> > CPU1 - write(2)				CPU2 - read fault
> > 
> > 					dax_iomap_pte_fault()
> > 					  ->iomap_begin() - sees hole
> > dax_iomap_rw()
> >   iomap_apply()
> >     ->iomap_begin - allocates blocks
> >     dax_iomap_actor()
> >       invalidate_inode_pages2_range()
> >         - there's nothing to invalidate
> > 					  grab_mapping_entry()
> > 					  - we add zero page in the radix
> > 					    tree & map it to page tables
> > 
> > Similarly read vs write fault may end up racing in a wrong way and try to
> > replace already existing exceptional entry with a hole page?
> 
> Yep, this race seems real to me, too.  This seems very much like the issues
> that exist when a thread is doing direct I/O.  One thread is doing I/O to an
> intermediate buffer (page cache for direct I/O case, zero page for us), and
> the other is going around it directly to media, and they can get out of sync.
> 
> IIRC the direct I/O code looked something like:
> 
> 1/ invalidate existing mappings
> 2/ do direct I/O to media
> 3/ invalidate mappings again, just in case.  Should be cheap if there weren't
>    any conflicting faults.  This makes sure any new allocations we made are
>    faulted in.

Yeah, the problem is people generally expect weird behavior when they mix
direct and buffered IO (or let alone mmap) however everyone expects
standard read(2) and write(2) to be completely coherent with mmap(2).

> I guess one option would be to replicate that logic in the DAX I/O path, or we
> could try and enhance our locking so page faults can't race with I/O since
> both can allocate blocks.

In the abstract way, the problem is that we have radix tree (and page
tables) cache block mapping information and the operation: "read block
mapping information, store it in the radix tree" is not serialized in any
way against other block allocations so the information we store can be out
of date by the time we store it.

One way to solve this would be to move ->iomap_begin call in the fault
paths under entry lock although that would mean I have to redo how ext4
handles DAX faults because with current code it would create lock inversion
wrt transaction start.

Another solution would be to grab i_mmap_sem for write when doing write
fault of a page and similarly have it grabbed for writing when doing
write(2). This would scale rather poorly but if we later replaced it with a
range lock (Davidlohr has already posted a nice implementation of it) it
won't be as bad. But I guess option 1) is better...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
  2017-04-26  8:52                       ` Jan Kara
@ 2017-04-26 22:52                         ` Ross Zwisler
  2017-04-27  7:26                           ` Jan Kara
  0 siblings, 1 reply; 37+ messages in thread
From: Ross Zwisler @ 2017-04-26 22:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Alexander Viro,
	Alexey Kuznetsov, Andrey Ryabinin, Anna Schumaker,
	Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

On Wed, Apr 26, 2017 at 10:52:35AM +0200, Jan Kara wrote:
> On Tue 25-04-17 16:59:36, Ross Zwisler wrote:
> > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
> > <>
> > > Hum, but now thinking more about it I have hard time figuring out why write
> > > vs fault cannot actually still race:
> > > 
> > > CPU1 - write(2)				CPU2 - read fault
> > > 
> > > 					dax_iomap_pte_fault()
> > > 					  ->iomap_begin() - sees hole
> > > dax_iomap_rw()
> > >   iomap_apply()
> > >     ->iomap_begin - allocates blocks
> > >     dax_iomap_actor()
> > >       invalidate_inode_pages2_range()
> > >         - there's nothing to invalidate
> > > 					  grab_mapping_entry()
> > > 					  - we add zero page in the radix
> > > 					    tree & map it to page tables
> > > 
> > > Similarly read vs write fault may end up racing in a wrong way and try to
> > > replace already existing exceptional entry with a hole page?
> > 
> > Yep, this race seems real to me, too.  This seems very much like the issues
> > that exist when a thread is doing direct I/O.  One thread is doing I/O to an
> > intermediate buffer (page cache for direct I/O case, zero page for us), and
> > the other is going around it directly to media, and they can get out of sync.
> > 
> > IIRC the direct I/O code looked something like:
> > 
> > 1/ invalidate existing mappings
> > 2/ do direct I/O to media
> > 3/ invalidate mappings again, just in case.  Should be cheap if there weren't
> >    any conflicting faults.  This makes sure any new allocations we made are
> >    faulted in.
> 
> Yeah, the problem is people generally expect weird behavior when they mix
> direct and buffered IO (or let alone mmap) however everyone expects
> standard read(2) and write(2) to be completely coherent with mmap(2).

Yep, fair enough.

> > I guess one option would be to replicate that logic in the DAX I/O path, or we
> > could try and enhance our locking so page faults can't race with I/O since
> > both can allocate blocks.
> 
> In the abstract way, the problem is that we have radix tree (and page
> tables) cache block mapping information and the operation: "read block
> mapping information, store it in the radix tree" is not serialized in any
> way against other block allocations so the information we store can be out
> of date by the time we store it.
> 
> One way to solve this would be to move ->iomap_begin call in the fault
> paths under entry lock although that would mean I have to redo how ext4
> handles DAX faults because with current code it would create lock inversion
> wrt transaction start.

I don't think this alone is enough to save us.  The I/O path doesn't currently
take any DAX radix tree entry locks, so our race would just become:

CPU1 - write(2)				CPU2 - read fault

					dax_iomap_pte_fault()
					  grab_mapping_entry() // newly moved
					  ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
					  - we add zero page in the radix
					    tree & map it to page tables

In their current form I don't think we want to take DAX radix tree entry locks
in the I/O path because that would effectively serialize I/O over a given
radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range
would be serialized.

> Another solution would be to grab i_mmap_sem for write when doing write
> fault of a page and similarly have it grabbed for writing when doing
> write(2). This would scale rather poorly but if we later replaced it with a
> range lock (Davidlohr has already posted a nice implementation of it) it
> won't be as bad. But I guess option 1) is better...

The best idea I had for handling this sounds similar, which would be to
convert the radix tree locks to essentially be reader/writer locks.  I/O and
faults that don't modify the block mapping could just take read-level locks,
and could all run concurrently.  I/O or faults that modify a block mapping
would take a write lock, and serialize with other writers and readers.

You could know if you needed a write lock without asking the filesystem - if
you're a write and the radix tree entry is empty or is for a zero page, you
grab the write lock.

This dovetails nicely with the idea of having the radix tree act as a cache
for block mappings.  You take the appropriate lock on the radix tree entry,
and it has the block mapping info for your I/O or fault so you don't have to
call into the FS.  I/O would also participate so we would keep info about
block mappings that we gather from I/O to help shortcut our page faults.

How does this sound vs the range lock idea?  How hard do you think it would be
to convert our current wait queue system to reader/writer style locking?

Also, how do you think we should deal with the current PMD corruption?  Should
we go with the current fix (I can augment the comments as you suggested), and
then handle optimizations to that approach and the solution to this larger
race as a follow-on?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
  2017-04-26 22:52                         ` Ross Zwisler
@ 2017-04-27  7:26                           ` Jan Kara
  2017-05-01 22:38                             ` Ross Zwisler
  2017-05-01 22:59                             ` Dan Williams
  0 siblings, 2 replies; 37+ messages in thread
From: Jan Kara @ 2017-04-27  7:26 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Andrew Morton, linux-kernel, Alexander Viro,
	Alexey Kuznetsov, Andrey Ryabinin, Anna Schumaker,
	Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

On Wed 26-04-17 16:52:36, Ross Zwisler wrote:
> On Wed, Apr 26, 2017 at 10:52:35AM +0200, Jan Kara wrote:
> > On Tue 25-04-17 16:59:36, Ross Zwisler wrote:
> > > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
> > > <>
> > > > Hum, but now thinking more about it I have hard time figuring out why write
> > > > vs fault cannot actually still race:
> > > > 
> > > > CPU1 - write(2)				CPU2 - read fault
> > > > 
> > > > 					dax_iomap_pte_fault()
> > > > 					  ->iomap_begin() - sees hole
> > > > dax_iomap_rw()
> > > >   iomap_apply()
> > > >     ->iomap_begin - allocates blocks
> > > >     dax_iomap_actor()
> > > >       invalidate_inode_pages2_range()
> > > >         - there's nothing to invalidate
> > > > 					  grab_mapping_entry()
> > > > 					  - we add zero page in the radix
> > > > 					    tree & map it to page tables
> > > > 
> > > > Similarly read vs write fault may end up racing in a wrong way and try to
> > > > replace already existing exceptional entry with a hole page?
> > > 
> > > Yep, this race seems real to me, too.  This seems very much like the issues
> > > that exist when a thread is doing direct I/O.  One thread is doing I/O to an
> > > intermediate buffer (page cache for direct I/O case, zero page for us), and
> > > the other is going around it directly to media, and they can get out of sync.
> > > 
> > > IIRC the direct I/O code looked something like:
> > > 
> > > 1/ invalidate existing mappings
> > > 2/ do direct I/O to media
> > > 3/ invalidate mappings again, just in case.  Should be cheap if there weren't
> > >    any conflicting faults.  This makes sure any new allocations we made are
> > >    faulted in.
> > 
> > Yeah, the problem is people generally expect weird behavior when they mix
> > direct and buffered IO (or let alone mmap) however everyone expects
> > standard read(2) and write(2) to be completely coherent with mmap(2).
> 
> Yep, fair enough.
> 
> > > I guess one option would be to replicate that logic in the DAX I/O path, or we
> > > could try and enhance our locking so page faults can't race with I/O since
> > > both can allocate blocks.
> > 
> > In the abstract way, the problem is that we have radix tree (and page
> > tables) cache block mapping information and the operation: "read block
> > mapping information, store it in the radix tree" is not serialized in any
> > way against other block allocations so the information we store can be out
> > of date by the time we store it.
> > 
> > One way to solve this would be to move ->iomap_begin call in the fault
> > paths under entry lock although that would mean I have to redo how ext4
> > handles DAX faults because with current code it would create lock inversion
> > wrt transaction start.
> 
> I don't think this alone is enough to save us.  The I/O path doesn't currently
> take any DAX radix tree entry locks, so our race would just become:
> 
> CPU1 - write(2)				CPU2 - read fault
> 
> 					dax_iomap_pte_fault()
> 					  grab_mapping_entry() // newly moved
> 					  ->iomap_begin() - sees hole
> dax_iomap_rw()
>   iomap_apply()
>     ->iomap_begin - allocates blocks
>     dax_iomap_actor()
>       invalidate_inode_pages2_range()
>         - there's nothing to invalidate
> 					  - we add zero page in the radix
> 					    tree & map it to page tables
> 
> In their current form I don't think we want to take DAX radix tree entry locks
> in the I/O path because that would effectively serialize I/O over a given
> radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range
> would be serialized.

Note that invalidate_inode_pages2_range() will see the entry created by
grab_mapping_entry() on CPU2 and block waiting for its lock and this is
exactly what stops the race. The invalidate_inode_pages2_range()
effectively makes sure there isn't any page fault in progress for given
range...

Also note that writes to a file are serialized by i_rwsem anyway (and at
least serialization of writes to the overlapping range is required by POSIX)
so this doesn't add any more serialization than we already have.

> > Another solution would be to grab i_mmap_sem for write when doing write
> > fault of a page and similarly have it grabbed for writing when doing
> > write(2). This would scale rather poorly but if we later replaced it with a
> > range lock (Davidlohr has already posted a nice implementation of it) it
> > won't be as bad. But I guess option 1) is better...
> 
> The best idea I had for handling this sounds similar, which would be to
> convert the radix tree locks to essentially be reader/writer locks.  I/O and
> faults that don't modify the block mapping could just take read-level locks,
> and could all run concurrently.  I/O or faults that modify a block mapping
> would take a write lock, and serialize with other writers and readers.

Well, this would be difficult to implement inside the radix tree (not
enough bits in the entry) so you'd have to go for some external locking
primitive anyway. And if you do that, read-write range lock Davidlohr has
implemented is what you describe - well we could also have a radix tree
with rwsems but I suspect the overhead of maintaining that would be too
large. It would require larger rewrite than reusing entry locks as I
suggest above though and it isn't an obvious performance win for realistic
workloads either so I'd like to see some performance numbers before going
that way. It likely improves a situation where processes race to fault the
same page for which we already know the block mapping but I'm not sure if
that translates to any measurable performance wins for workloads on DAX
filesystem.

> You could know if you needed a write lock without asking the filesystem - if
> you're a write and the radix tree entry is empty or is for a zero page, you
> grab the write lock.
> 
> This dovetails nicely with the idea of having the radix tree act as a cache
> for block mappings.  You take the appropriate lock on the radix tree entry,
> and it has the block mapping info for your I/O or fault so you don't have to
> call into the FS.  I/O would also participate so we would keep info about
> block mappings that we gather from I/O to help shortcut our page faults.
> 
> How does this sound vs the range lock idea?  How hard do you think it would be
> to convert our current wait queue system to reader/writer style locking?
> 
> Also, how do you think we should deal with the current PMD corruption?  Should
> we go with the current fix (I can augment the comments as you suggested), and
> then handle optimizations to that approach and the solution to this larger
> race as a follow-on?

So for now I'm still more inclined to just stay with the radix tree lock as
is and just fix up the locking as I suggest and go for larger rewrite only
if we can demonstrate further performance wins.

WRT your second patch, if we go with the locking as I suggest, it is enough
to unmap the whole range after invalidate_inode_pages2() has cleared radix
tree entries (*) which will be much cheaper (for large writes) than doing
unmapping entry by entry. So I'd go for that. I'll prepare a patch for the
locking change - it will require changes to ext4 transaction handling so it
won't be completely trivial.

(*) The flow of information is: filesystem block mapping info -> radix tree
-> page tables so if 'filesystem block mapping info' changes, we should go
invalidate corresponding radix tree entries (new entries will already have
uptodate info) and then invalidate corresponding page tables (again once
radix tree has no stale entries, we are sure new page table entries will be
uptodate).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] dax: prevent invalidation of mapped DAX entries
  2017-04-25 10:10                 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
@ 2017-05-01 16:54                   ` Ross Zwisler
  0 siblings, 0 replies; 37+ messages in thread
From: Ross Zwisler @ 2017-05-01 16:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Alexander Viro,
	Alexey Kuznetsov, Andrey Ryabinin, Anna Schumaker,
	Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

On Tue, Apr 25, 2017 at 12:10:41PM +0200, Jan Kara wrote:
> On Thu 20-04-17 21:44:36, Ross Zwisler wrote:
> > dax_invalidate_mapping_entry() currently removes DAX exceptional entries
> > only if they are clean and unlocked.  This is done via:
> > 
> > invalidate_mapping_pages()
> >   invalidate_exceptional_entry()
> >     dax_invalidate_mapping_entry()
> > 
> > However, for page cache pages removed in invalidate_mapping_pages() there
> > is an additional criteria which is that the page must not be mapped.  This
> > is noted in the comments above invalidate_mapping_pages() and is checked in
> > invalidate_inode_page().
> > 
> > For DAX entries this means that we can can end up in a situation where a
> > DAX exceptional entry, either a huge zero page or a regular DAX entry,
> > could end up mapped but without an associated radix tree entry. This is
> > inconsistent with the rest of the DAX code and with what happens in the
> > page cache case.
> > 
> > We aren't able to unmap the DAX exceptional entry because according to its
> > comments invalidate_mapping_pages() isn't allowed to block, and
> > unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
> > 
> > Since we essentially never have unmapped DAX entries to evict from the
> > radix tree, just remove dax_invalidate_mapping_entry().
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> > Reported-by: Jan Kara <jack@suse.cz>
> > Cc: <stable@vger.kernel.org>    [4.10+]
> 
> Just as a side note - we wouldn't really have to unmap the mapping range
> covered by the DAX exceptional entry. It would be enough to find out
> whether such range is mapped and bail out in that case. But that would
> still be pretty expensive for DAX - we'd have to do rmap walk similar as in
> dax_mapping_entry_mkclean() and IMHO it is not worth it. So I agree with
> what you did. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Yep, that makes sense.  Thanks for the review.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
  2017-04-27  7:26                           ` Jan Kara
@ 2017-05-01 22:38                             ` Ross Zwisler
  2017-05-04  9:12                               ` Jan Kara
  2017-05-01 22:59                             ` Dan Williams
  1 sibling, 1 reply; 37+ messages in thread
From: Ross Zwisler @ 2017-05-01 22:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Alexander Viro,
	Alexey Kuznetsov, Andrey Ryabinin, Anna Schumaker,
	Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

On Thu, Apr 27, 2017 at 09:26:59AM +0200, Jan Kara wrote:
> On Wed 26-04-17 16:52:36, Ross Zwisler wrote:
<>
> > I don't think this alone is enough to save us.  The I/O path doesn't currently
> > take any DAX radix tree entry locks, so our race would just become:
> > 
> > CPU1 - write(2)				CPU2 - read fault
> > 
> > 					dax_iomap_pte_fault()
> > 					  grab_mapping_entry() // newly moved
> > 					  ->iomap_begin() - sees hole
> > dax_iomap_rw()
> >   iomap_apply()
> >     ->iomap_begin - allocates blocks
> >     dax_iomap_actor()
> >       invalidate_inode_pages2_range()
> >         - there's nothing to invalidate
> > 					  - we add zero page in the radix
> > 					    tree & map it to page tables
> > 
> > In their current form I don't think we want to take DAX radix tree entry locks
> > in the I/O path because that would effectively serialize I/O over a given
> > radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range
> > would be serialized.
> 
> Note that invalidate_inode_pages2_range() will see the entry created by
> grab_mapping_entry() on CPU2 and block waiting for its lock and this is
> exactly what stops the race. The invalidate_inode_pages2_range()
> effectively makes sure there isn't any page fault in progress for given
> range...

Yep, this is the bit that I was missing.  Thanks.

> Also note that writes to a file are serialized by i_rwsem anyway (and at
> least serialization of writes to the overlapping range is required by POSIX)
> so this doesn't add any more serialization than we already have.
> 
> > > Another solution would be to grab i_mmap_sem for write when doing write
> > > fault of a page and similarly have it grabbed for writing when doing
> > > write(2). This would scale rather poorly but if we later replaced it with a
> > > range lock (Davidlohr has already posted a nice implementation of it) it
> > > won't be as bad. But I guess option 1) is better...
> > 
> > The best idea I had for handling this sounds similar, which would be to
> > convert the radix tree locks to essentially be reader/writer locks.  I/O and
> > faults that don't modify the block mapping could just take read-level locks,
> > and could all run concurrently.  I/O or faults that modify a block mapping
> > would take a write lock, and serialize with other writers and readers.
> 
> Well, this would be difficult to implement inside the radix tree (not
> enough bits in the entry) so you'd have to go for some external locking
> primitive anyway. And if you do that, read-write range lock Davidlohr has
> implemented is what you describe - well we could also have a radix tree
> with rwsems but I suspect the overhead of maintaining that would be too
> large. It would require larger rewrite than reusing entry locks as I
> suggest above though and it isn't an obvious performance win for realistic
> workloads either so I'd like to see some performance numbers before going
> that way. It likely improves a situation where processes race to fault the
> same page for which we already know the block mapping but I'm not sure if
> that translates to any measurable performance wins for workloads on DAX
> filesystem.
> 
> > You could know if you needed a write lock without asking the filesystem - if
> > you're a write and the radix tree entry is empty or is for a zero page, you
> > grab the write lock.
> > 
> > This dovetails nicely with the idea of having the radix tree act as a cache
> > for block mappings.  You take the appropriate lock on the radix tree entry,
> > and it has the block mapping info for your I/O or fault so you don't have to
> > call into the FS.  I/O would also participate so we would keep info about
> > block mappings that we gather from I/O to help shortcut our page faults.
> > 
> > How does this sound vs the range lock idea?  How hard do you think it would be
> > to convert our current wait queue system to reader/writer style locking?
> > 
> > Also, how do you think we should deal with the current PMD corruption?  Should
> > we go with the current fix (I can augment the comments as you suggested), and
> > then handle optimizations to that approach and the solution to this larger
> > race as a follow-on?
> 
> So for now I'm still more inclined to just stay with the radix tree lock as
> is and just fix up the locking as I suggest and go for larger rewrite only
> if we can demonstrate further performance wins.

Sounds good.

> WRT your second patch, if we go with the locking as I suggest, it is enough
> to unmap the whole range after invalidate_inode_pages2() has cleared radix
> tree entries (*) which will be much cheaper (for large writes) than doing
> unmapping entry by entry.

I'm still not convinced that it is safe to do the unmap in a separate step.  I
see your point about it being expensive to do a rmap walk to unmap each entry
in __dax_invalidate_mapping_entry(), but I think we might need to because the
unmap is part of the contract imposed by invalidate_inode_pages2_range() and
invalidate_inode_pages2().  This exists in the header comment above each:

 * Any pages which are found to be mapped into pagetables are unmapped prior
 * to invalidation.

If you look at the usage of invalidate_inode_pages2_range() in
generic_file_direct_write() for example (which I realize we won't call for a
DAX inode, but still), I think that it really does rely on the fact that
invalidated pages are unmapped, right?  If it didn't, and hole pages were
mapped, the hole pages could remain mapped while a direct I/O write allocated
blocks and then wrote real data.

If we really want to unmap the entire range at once, maybe it would have to be
done in invalidate_inode_pages2_range(), after the loop?  My hesitation about
this is that we'd be leaking yet more DAX special casing up into the
mm/truncate.c code.

Or am I missing something?

> So I'd go for that. I'll prepare a patch for the
> locking change - it will require changes to ext4 transaction handling so it
> won't be completely trivial.
> 
> (*) The flow of information is: filesystem block mapping info -> radix tree
> -> page tables so if 'filesystem block mapping info' changes, we should go
> invalidate corresponding radix tree entries (new entries will already have
> uptodate info) and then invalidate corresponding page tables (again once
> radix tree has no stale entries, we are sure new page table entries will be
> uptodate).
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
  2017-04-27  7:26                           ` Jan Kara
  2017-05-01 22:38                             ` Ross Zwisler
@ 2017-05-01 22:59                             ` Dan Williams
  1 sibling, 0 replies; 37+ messages in thread
From: Dan Williams @ 2017-05-01 22:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Alexander Viro,
	Alexey Kuznetsov, Andrey Ryabinin, Anna Schumaker,
	Christoph Hellwig, Darrick J. Wong, Eric Van Hensbergen,
	Jens Axboe, Johannes Weiner, Konrad Rzeszutek Wilk,
	Latchesar Ionkov, linux-cifs, linux-fsdevel, Linux MM, linux-nfs,
	linux-nvdimm@lists.01.org, Matthew Wilcox, Ron Minnich,
	samba-technical, Steve French, Trond Myklebust, v9fs-developer

On Thu, Apr 27, 2017 at 12:26 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 26-04-17 16:52:36, Ross Zwisler wrote:
>> On Wed, Apr 26, 2017 at 10:52:35AM +0200, Jan Kara wrote:
>> > On Tue 25-04-17 16:59:36, Ross Zwisler wrote:
>> > > On Tue, Apr 25, 2017 at 01:10:43PM +0200, Jan Kara wrote:
>> > > <>
>> > > > Hum, but now thinking more about it I have hard time figuring out why write
>> > > > vs fault cannot actually still race:
>> > > >
>> > > > CPU1 - write(2)                         CPU2 - read fault
>> > > >
>> > > >                                         dax_iomap_pte_fault()
>> > > >                                           ->iomap_begin() - sees hole
>> > > > dax_iomap_rw()
>> > > >   iomap_apply()
>> > > >     ->iomap_begin - allocates blocks
>> > > >     dax_iomap_actor()
>> > > >       invalidate_inode_pages2_range()
>> > > >         - there's nothing to invalidate
>> > > >                                           grab_mapping_entry()
>> > > >                                           - we add zero page in the radix
>> > > >                                             tree & map it to page tables
>> > > >
>> > > > Similarly read vs write fault may end up racing in a wrong way and try to
>> > > > replace already existing exceptional entry with a hole page?
>> > >
>> > > Yep, this race seems real to me, too.  This seems very much like the issues
>> > > that exist when a thread is doing direct I/O.  One thread is doing I/O to an
>> > > intermediate buffer (page cache for direct I/O case, zero page for us), and
>> > > the other is going around it directly to media, and they can get out of sync.
>> > >
>> > > IIRC the direct I/O code looked something like:
>> > >
>> > > 1/ invalidate existing mappings
>> > > 2/ do direct I/O to media
>> > > 3/ invalidate mappings again, just in case.  Should be cheap if there weren't
>> > >    any conflicting faults.  This makes sure any new allocations we made are
>> > >    faulted in.
>> >
>> > Yeah, the problem is people generally expect weird behavior when they mix
>> > direct and buffered IO (or let alone mmap) however everyone expects
>> > standard read(2) and write(2) to be completely coherent with mmap(2).
>>
>> Yep, fair enough.
>>
>> > > I guess one option would be to replicate that logic in the DAX I/O path, or we
>> > > could try and enhance our locking so page faults can't race with I/O since
>> > > both can allocate blocks.
>> >
>> > In the abstract way, the problem is that we have radix tree (and page
>> > tables) cache block mapping information and the operation: "read block
>> > mapping information, store it in the radix tree" is not serialized in any
>> > way against other block allocations so the information we store can be out
>> > of date by the time we store it.
>> >
>> > One way to solve this would be to move ->iomap_begin call in the fault
>> > paths under entry lock although that would mean I have to redo how ext4
>> > handles DAX faults because with current code it would create lock inversion
>> > wrt transaction start.
>>
>> I don't think this alone is enough to save us.  The I/O path doesn't currently
>> take any DAX radix tree entry locks, so our race would just become:
>>
>> CPU1 - write(2)                               CPU2 - read fault
>>
>>                                       dax_iomap_pte_fault()
>>                                         grab_mapping_entry() // newly moved
>>                                         ->iomap_begin() - sees hole
>> dax_iomap_rw()
>>   iomap_apply()
>>     ->iomap_begin - allocates blocks
>>     dax_iomap_actor()
>>       invalidate_inode_pages2_range()
>>         - there's nothing to invalidate
>>                                         - we add zero page in the radix
>>                                           tree & map it to page tables
>>
>> In their current form I don't think we want to take DAX radix tree entry locks
>> in the I/O path because that would effectively serialize I/O over a given
>> radix tree entry. For a 2MiB entry, for example, all I/O to that 2MiB range
>> would be serialized.
>
> Note that invalidate_inode_pages2_range() will see the entry created by
> grab_mapping_entry() on CPU2 and block waiting for its lock and this is
> exactly what stops the race. The invalidate_inode_pages2_range()
> effectively makes sure there isn't any page fault in progress for given
> range...
>
> Also note that writes to a file are serialized by i_rwsem anyway (and at
> least serialization of writes to the overlapping range is required by POSIX)
> so this doesn't add any more serialization than we already have.
>
>> > Another solution would be to grab i_mmap_sem for write when doing write
>> > fault of a page and similarly have it grabbed for writing when doing
>> > write(2). This would scale rather poorly but if we later replaced it with a
>> > range lock (Davidlohr has already posted a nice implementation of it) it
>> > won't be as bad. But I guess option 1) is better...
>>
>> The best idea I had for handling this sounds similar, which would be to
>> convert the radix tree locks to essentially be reader/writer locks.  I/O and
>> faults that don't modify the block mapping could just take read-level locks,
>> and could all run concurrently.  I/O or faults that modify a block mapping
>> would take a write lock, and serialize with other writers and readers.
>
> Well, this would be difficult to implement inside the radix tree (not
> enough bits in the entry) so you'd have to go for some external locking
> primitive anyway. And if you do that, read-write range lock Davidlohr has
> implemented is what you describe - well we could also have a radix tree
> with rwsems but I suspect the overhead of maintaining that would be too
> large. It would require larger rewrite than reusing entry locks as I
> suggest above though and it isn't an obvious performance win for realistic
> workloads either so I'd like to see some performance numbers before going
> that way. It likely improves a situation where processes race to fault the
> same page for which we already know the block mapping but I'm not sure if
> that translates to any measurable performance wins for workloads on DAX
> filesystem.

I'm also concerned about inventing new / fancy radix infrastructure
when we're already in the space of needing struct page for any
non-trivial usage of dax. As Kirill's transparent-huge-page page cache
implementation matures I'd be interested in looking at a transition
path away from radix locking towards something that it shared with the
common case page cache locking.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
  2017-05-01 22:38                             ` Ross Zwisler
@ 2017-05-04  9:12                               ` Jan Kara
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Kara @ 2017-05-04  9:12 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Andrew Morton, linux-kernel, Alexander Viro,
	Alexey Kuznetsov, Andrey Ryabinin, Anna Schumaker,
	Christoph Hellwig, Dan Williams, Darrick J. Wong,
	Eric Van Hensbergen, Jens Axboe, Johannes Weiner,
	Konrad Rzeszutek Wilk, Latchesar Ionkov, linux-cifs,
	linux-fsdevel, linux-mm, linux-nfs, linux-nvdimm, Matthew Wilcox,
	Ron Minnich, samba-technical, Steve French, Trond Myklebust,
	v9fs-developer

On Mon 01-05-17 16:38:55, Ross Zwisler wrote:
> > So for now I'm still more inclined to just stay with the radix tree lock as
> > is and just fix up the locking as I suggest and go for larger rewrite only
> > if we can demonstrate further performance wins.
> 
> Sounds good.
> 
> > WRT your second patch, if we go with the locking as I suggest, it is enough
> > to unmap the whole range after invalidate_inode_pages2() has cleared radix
> > tree entries (*) which will be much cheaper (for large writes) than doing
> > unmapping entry by entry.
> 
> I'm still not convinced that it is safe to do the unmap in a separate step.  I
> see your point about it being expensive to do a rmap walk to unmap each entry
> in __dax_invalidate_mapping_entry(), but I think we might need to because the
> unmap is part of the contract imposed by invalidate_inode_pages2_range() and
> invalidate_inode_pages2().  This exists in the header comment above each:
> 
>  * Any pages which are found to be mapped into pagetables are unmapped prior
>  * to invalidation.
> 
> If you look at the usage of invalidate_inode_pages2_range() in
> generic_file_direct_write() for example (which I realize we won't call for a
> DAX inode, but still), I think that it really does rely on the fact that
> invalidated pages are unmapped, right?  If it didn't, and hole pages were
> mapped, the hole pages could remain mapped while a direct I/O write allocated
> blocks and then wrote real data.
> 
> If we really want to unmap the entire range at once, maybe it would have to be
> done in invalidate_inode_pages2_range(), after the loop?  My hesitation about
> this is that we'd be leaking yet more DAX special casing up into the
> mm/truncate.c code.
> 
> Or am I missing something?

No, my thinking was to put the invalidation at the end of
invalidate_inode_pages2_range(). I agree it means more special-casing for
DAX in mm/truncate.c.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-05-04 14:44 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-14 14:07 [PATCH 0/4] Properly invalidate data in the cleancache Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
2017-04-18 19:38   ` Ross Zwisler
2017-04-19 15:11     ` Andrey Ryabinin
2017-04-19 19:28       ` Ross Zwisler
2017-04-20 14:35         ` Jan Kara
2017-04-20 14:44           ` Jan Kara
2017-04-20 19:14             ` Ross Zwisler
2017-04-21  3:44               ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Ross Zwisler
2017-04-21  3:44                 ` [PATCH 2/2] dax: fix data corruption due to stale mmap reads Ross Zwisler
2017-04-25 11:10                   ` Jan Kara
2017-04-25 22:59                     ` Ross Zwisler
2017-04-26  8:52                       ` Jan Kara
2017-04-26 22:52                         ` Ross Zwisler
2017-04-27  7:26                           ` Jan Kara
2017-05-01 22:38                             ` Ross Zwisler
2017-05-04  9:12                               ` Jan Kara
2017-05-01 22:59                             ` Dan Williams
2017-04-25 10:10                 ` [PATCH 1/2] dax: prevent invalidation of mapped DAX entries Jan Kara
2017-05-01 16:54                   ` Ross Zwisler
2017-04-18 22:46   ` [PATCH 1/4] fs: fix data invalidation in the cleancache during direct IO Andrew Morton
2017-04-19 15:15     ` Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
2017-04-18 18:51   ` Nikolay Borisov
2017-04-19 13:22     ` Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
2017-04-14 14:07 ` [PATCH 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
2017-04-18 15:24 ` [PATCH 0/4] Properly invalidate data in the cleancache Konrad Rzeszutek Wilk
2017-04-24 16:41 ` [PATCH v2 " Andrey Ryabinin
2017-04-24 16:41   ` [PATCH v2 1/4] fs: fix data invalidation in the cleancache during direct IO Andrey Ryabinin
2017-04-25  8:25     ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev() Andrey Ryabinin
2017-04-25  8:34     ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 3/4] mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty Andrey Ryabinin
2017-04-25  8:37     ` Jan Kara
2017-04-24 16:41   ` [PATCH v2 4/4] mm/truncate: avoid pointless cleancache_invalidate_inode() calls Andrey Ryabinin
2017-04-25  8:41     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).