All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
@ 2013-12-07 10:55 Zheng Liu
  2013-12-16  9:37 ` Steven Whitehouse
  0 siblings, 1 reply; 21+ messages in thread
From: Zheng Liu @ 2013-12-07 10:55 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Christoph Hellwig, Dmitry Monakhov, Dave Chinner,
	Alexander Viro, Zheng Liu

From: Zheng Liu <wenqing.lz@taobao.com>

Currently we check i_size in buffered read path after we know the page
is update.  But it could return some zero-filled pages to the userspace
when we do some append dio writes.  We could use the following code
snippet to reproduce this problem in a ext2/3/4 filesystem.

code snippet:
  #define _GNU_SOURCE

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <memory.h>

  #include <unistd.h>
  #include <fcntl.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <errno.h>

  #include <pthread.h>

  #define BUF_ALIGN	1024

  struct writer_data {
  	int fd;
  	size_t blksize;
  	char *buf;
  };

  static void *writer(void *arg)
  {
  	struct writer_data *data = (struct writer_data *)arg;
  	int ret;

  	ret = write(data->fd, data->buf, data->blksize);
  	if (ret < 0)
  		fprintf(stderr, "write file failed: %s\n", strerror(errno));

  	return NULL;
  }

  int main(int argc, char *argv[])
  {
  	pthread_t tid;
  	struct writer_data wdata;
  	size_t max_blocks = 10 * 1024;
  	size_t blksize = 1 * 1024 * 1024;
  	char *rbuf, *wbuf;
  	int readfd, writefd;
  	int i, j;

  	if (argc < 2) {
  		fprintf(stderr, "usage: %s [filename]\n", argv[0]);
  		exit(1);
  	}

  	writefd = open(argv[1], O_CREAT|O_DIRECT|O_WRONLY|O_APPEND|O_TRUNC, S_IRWXU);
  	if (writefd < 0) {
  		fprintf(stderr, "failed to open wfile: %s\n", strerror(errno));
  		exit(1);
  	}
  	readfd = open(argv[1], O_DIRECT|O_RDONLY, S_IRWXU);
  	if (readfd < 0) {
  		fprintf(stderr, "failed to open rfile: %s\n", strerror(errno));
  		exit(1);
  	}

  	if (posix_memalign((void **)&wbuf, BUF_ALIGN, blksize)) {
  		fprintf(stderr, "failed to alloc memory: %s\n", strerror(errno));
  		exit(1);
  	}

  	if (posix_memalign((void **)&rbuf, 4096, blksize)) {
  		fprintf(stderr, "failed to alloc memory: %s\n", strerror(errno));
  		exit(1);
  	}

  	memset(wbuf, 'a', blksize);

  	wdata.fd = writefd;
  	wdata.blksize = blksize;
  	wdata.buf = wbuf;

  	for (i = 0; i < max_blocks; i++) {
  		void *retval;
  		int ret;

  		ret = pthread_create(&tid, NULL, writer, &wdata);
  		if (ret) {
  			fprintf(stderr, "create thread failed: %s\n", strerror(errno));
  			exit(1);
  		}

  		memset(rbuf, 'b', blksize);
  		do {
  			ret = pread(readfd, rbuf, blksize, i * blksize);
  		} while (ret <= 0);

  		if (ret < 0) {
  			fprintf(stderr, "read file failed: %s\n", strerror(errno));
  			exit(1);
  		}

  		if (pthread_join(tid, &retval)) {
  			fprintf(stderr, "pthread join failed: %s\n", strerror(errno));
  			exit(1);
  		}

  		if (ret >= 0) {
  			for (j = 0; j < ret; j++) {
  				if (rbuf[j] != 'a') {
  					fprintf(stderr, "encounter an error: offset %ld\n",
  						i);
  					goto err;
  				}
  			}
  		}
  	}

  err:
  	free(wbuf);
  	free(rbuf);

  	return 0;
  }

build & run:
  $ gcc code.c -o code -lpthread # build
  $ ./code ${filename} # run

As we expected, we should read nothing or data with 'a'.  But now we
read data with '0'.  I take a closer look at the code and it seems that
there is a bug in vfs.  Let me describe my found here.

  reader					writer
                                                generic_file_aio_write()
                                                ->__generic_file_aio_write()
                                                  ->generic_file_direct_write()
  generic_file_aio_read()
  ->do_generic_file_read()
    [fallback to buffered read]

    ->find_get_page()
    ->page_cache_sync_readahead()
    ->find_get_page()
    [in find_page label, we couldn't find a
     page before and after calling
     page_cache_sync_readahead().  So go to
     no_cached_page label]

    ->page_cache_alloc_cold()
    ->add_to_page_cache_lru()
    [in no_cached_page label, we alloc a page
     and goto readpage label.]

    ->aops->readpage()
    [in readpage label, readpage() callback
     is called and mpage_readpage() return a
     zero-filled page (e.g. ext3/4), and go
     to page_ok label]

                                                  ->a_ops->direct_IO()
                                                  ->i_size_write()
                                                  [we enlarge the i_size]

    Here we check i_size
    [in page_ok label, we check i_size but
     it has been enlarged.  Thus, we pass
     the check and return a zero-filled page]

This commit first trims desc->count to avoid to read too much data if
(*ppos + desc->count) >= i_size.  Then we will return directly if
desc->count == 0.  After doing that, we could fix the above problem.
Meanwhile this also can fix the problem that the reader does some
buffered reads with append dio write.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
changelog:
v3: 
 * Rebase against 3.13-rc3
 * Add 'Reviewed-by' tag because Jan has reviewed this patch

						- Zheng
 mm/filemap.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a92021c..622d49ac2a24 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1108,18 +1108,25 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 	pgoff_t prev_index;
 	unsigned long offset;      /* offset into pagecache page */
 	unsigned int prev_offset;
+	loff_t isize;
 	int error;
 
+	/* we need to trim desc->count to avoid expose stale data to user */
+	isize = i_size_read(inode);
+	if (*ppos + desc->count >= isize)
+		desc->count = isize - *ppos;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT;
 	prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1);
 	last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
 	offset = *ppos & ~PAGE_CACHE_MASK;
 
+	if (desc->count == 0)
+		goto out;
+
 	for (;;) {
 		struct page *page;
 		pgoff_t end_index;
-		loff_t isize;
 		unsigned long nr, ret;
 
 		cond_resched();
-- 
1.7.9.7


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-07 10:55 [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Zheng Liu
@ 2013-12-16  9:37 ` Steven Whitehouse
  2013-12-16 10:01   ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Whitehouse @ 2013-12-16  9:37 UTC (permalink / raw)
  To: Zheng Liu
  Cc: linux-fsdevel, Jan Kara, Christoph Hellwig, Dmitry Monakhov,
	Dave Chinner, Alexander Viro, Zheng Liu

Hi,

On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Currently we check i_size in buffered read path after we know the page
> is update.  But it could return some zero-filled pages to the userspace
> when we do some append dio writes.  We could use the following code
> snippet to reproduce this problem in a ext2/3/4 filesystem.
> 
If the page is not uptodate, then neither (potentially) is i_size, since
the underlying fs has not had a chance to get its own locks and update
the inode size.

I suspect that the correct fix would be to implement ->launder_page to
avoid the race that you've identified here, if I've understood it
correctly,

Steve.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-16  9:37 ` Steven Whitehouse
@ 2013-12-16 10:01   ` Jan Kara
  2013-12-17  9:43     ` Steven Whitehouse
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2013-12-16 10:01 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Zheng Liu, linux-fsdevel, Jan Kara, Christoph Hellwig,
	Dmitry Monakhov, Dave Chinner, Alexander Viro, Zheng Liu

On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> Hi,
> 
> On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > Currently we check i_size in buffered read path after we know the page
> > is update.  But it could return some zero-filled pages to the userspace
> > when we do some append dio writes.  We could use the following code
> > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > 
> If the page is not uptodate, then neither (potentially) is i_size, since
> the underlying fs has not had a chance to get its own locks and update
> the inode size.
  Hum, this isn't really about page being uptodate or not, is it? It is
more about the fact that gfs2 updates i_size only from gfs2_readpage()
function when obtaining inode lock. Or do you know about other filesystem
having the same problem as gfs2? E.g. ocfs2 updates i_size from
ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
do something similar?

> I suspect that the correct fix would be to implement ->launder_page to
> avoid the race that you've identified here, if I've understood it
> correctly,
  I don't understand how that would work. Can you elaborate a bit? Here the
problem is i_size gets extended by DIO write during buffered read (which is
a fallback from DIO read) so we return zero-filled page instead of either
data filled page or short read. I don't see where launder_page() would come
into the picture...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-16 10:01   ` Jan Kara
@ 2013-12-17  9:43     ` Steven Whitehouse
  2013-12-17 11:16       ` Zheng Liu
  2013-12-17 14:01       ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Jan Kara
  0 siblings, 2 replies; 21+ messages in thread
From: Steven Whitehouse @ 2013-12-17  9:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zheng Liu, linux-fsdevel, Christoph Hellwig, Dmitry Monakhov,
	Dave Chinner, Alexander Viro, Zheng Liu

Hi,

On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > 
> > > Currently we check i_size in buffered read path after we know the page
> > > is update.  But it could return some zero-filled pages to the userspace
> > > when we do some append dio writes.  We could use the following code
> > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > 
> > If the page is not uptodate, then neither (potentially) is i_size, since
> > the underlying fs has not had a chance to get its own locks and update
> > the inode size.
>   Hum, this isn't really about page being uptodate or not, is it? It is
> more about the fact that gfs2 updates i_size only from gfs2_readpage()
> function when obtaining inode lock. Or do you know about other filesystem
> having the same problem as gfs2? E.g. ocfs2 updates i_size from
> ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> do something similar?
> 
Well we could do that, but I don't think it makes sense really. The
point of having the page cache is to put it "in front" of the expensive
operations where possible (such as getting cluster locks). I don't think
it makes sense to grab and drop a cluster lock on every call to
generic_file_aio_read() when, for cached pages, we would not even need
to call ->readpage.

The problem appears to be that somehow we are getting a page returned
when it should not be. So I suspect that the correct place to catch that
is in ->readpage(s)

> > I suspect that the correct fix would be to implement ->launder_page to
> > avoid the race that you've identified here, if I've understood it
> > correctly,
>   I don't understand how that would work. Can you elaborate a bit? Here the
> problem is i_size gets extended by DIO write during buffered read (which is
> a fallback from DIO read) so we return zero-filled page instead of either
> data filled page or short read. I don't see where launder_page() would come
> into the picture...
> 
> 								Honza

Ah, sorry. I was thinking the other way around wrt to
read/write :( However, I still don't think that generic_file_aio_read()
is the right place to fix this. I note that XFS seems to pass the test
and it also uses mpage_readpage and mpage_readpages as well as
generic_file_aio_read() so maybe that is a clue as to where the answer
lies. GFS2 seems to fail the test in the same way as ext3 at the moment.

The other question that we've not really answered is why it is useful to
mix buffered and direct i/o to the same file at the same time anyway.
While I agree that it ought to work, I'm not convinced that its
enormously useful, which is maybe why it has not been spotted before,

Steve.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-17  9:43     ` Steven Whitehouse
@ 2013-12-17 11:16       ` Zheng Liu
  2013-12-17 12:17         ` Steven Whitehouse
  2013-12-17 14:01       ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Jan Kara
  1 sibling, 1 reply; 21+ messages in thread
From: Zheng Liu @ 2013-12-17 11:16 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Jan Kara, linux-fsdevel, Christoph Hellwig, Dmitry Monakhov,
	Dave Chinner, Alexander Viro, Zheng Liu

Hi Steven,

On Tue, Dec 17, 2013 at 09:43:42AM +0000, Steven Whitehouse wrote:
> Hi,
> 
> On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > Hi,
> > > 
> > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > 
> > > > Currently we check i_size in buffered read path after we know the page
> > > > is update.  But it could return some zero-filled pages to the userspace
> > > > when we do some append dio writes.  We could use the following code
> > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > 
> > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > the underlying fs has not had a chance to get its own locks and update
> > > the inode size.
> >   Hum, this isn't really about page being uptodate or not, is it? It is
> > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > function when obtaining inode lock. Or do you know about other filesystem
> > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > do something similar?
> > 
> Well we could do that, but I don't think it makes sense really. The
> point of having the page cache is to put it "in front" of the expensive
> operations where possible (such as getting cluster locks). I don't think
> it makes sense to grab and drop a cluster lock on every call to
> generic_file_aio_read() when, for cached pages, we would not even need
> to call ->readpage.
> 
> The problem appears to be that somehow we are getting a page returned
> when it should not be. So I suspect that the correct place to catch that
> is in ->readpage(s)
> 
> > > I suspect that the correct fix would be to implement ->launder_page to
> > > avoid the race that you've identified here, if I've understood it
> > > correctly,
> >   I don't understand how that would work. Can you elaborate a bit? Here the
> > problem is i_size gets extended by DIO write during buffered read (which is
> > a fallback from DIO read) so we return zero-filled page instead of either
> > data filled page or short read. I don't see where launder_page() would come
> > into the picture...
> > 
> > 								Honza
> 
> Ah, sorry. I was thinking the other way around wrt to
> read/write :( However, I still don't think that generic_file_aio_read()
> is the right place to fix this. I note that XFS seems to pass the test
> and it also uses mpage_readpage and mpage_readpages as well as
> generic_file_aio_read() so maybe that is a clue as to where the answer
> lies. GFS2 seems to fail the test in the same way as ext3 at the moment.

Yes, xfs can pass the test under direct IO.  I suspect that the reason
is that xfs uses ilock/iolock.  But it doesn't means that all file
systems need to fix this issue by themselves because the root cause is
in vfs layer.

> 
> The other question that we've not really answered is why it is useful to
> mix buffered and direct i/o to the same file at the same time anyway.
> While I agree that it ought to work, I'm not convinced that its
> enormously useful, which is maybe why it has not been spotted before,

Yes, mixed buffered read and direct write seems to be useless.  But in
our product system, the issue will appear under direct read/write.  The
application has a reader and a writer.  The reader should read nothing
or the content that has been written by writer.  But now the reader gets
the zero-filled page.  Hence the application needs to prevent this issue
using a mutex or other flag.  It doesn't make sense to the userspace
programmer.

TBH, we have found this issue two year ago but we don't take care of it.
So currently the application just solves it as I said above.  But it will
issue a flush calling fsync(2) to flush the dirty page in order to make
sure the data has been flushed into the disk.  Obviously, the performance
of this application is impacted by this issue.  Now we have known that we
don't need to flush the dirty page but I think we'd better fix it.

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-17 11:16       ` Zheng Liu
@ 2013-12-17 12:17         ` Steven Whitehouse
  2013-12-17 16:41           ` Zheng Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Whitehouse @ 2013-12-17 12:17 UTC (permalink / raw)
  To: Zheng Liu
  Cc: Jan Kara, linux-fsdevel, Christoph Hellwig, Dmitry Monakhov,
	Dave Chinner, Alexander Viro, Zheng Liu

Hi,

On Tue, 2013-12-17 at 19:16 +0800, Zheng Liu wrote:
> Hi Steven,
> 
> On Tue, Dec 17, 2013 at 09:43:42AM +0000, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > > Hi,
> > > > 
> > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > > 
> > > > > Currently we check i_size in buffered read path after we know the page
> > > > > is update.  But it could return some zero-filled pages to the userspace
> > > > > when we do some append dio writes.  We could use the following code
> > > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > > 
> > > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > > the underlying fs has not had a chance to get its own locks and update
> > > > the inode size.
> > >   Hum, this isn't really about page being uptodate or not, is it? It is
> > > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > > function when obtaining inode lock. Or do you know about other filesystem
> > > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > > do something similar?
> > > 
> > Well we could do that, but I don't think it makes sense really. The
> > point of having the page cache is to put it "in front" of the expensive
> > operations where possible (such as getting cluster locks). I don't think
> > it makes sense to grab and drop a cluster lock on every call to
> > generic_file_aio_read() when, for cached pages, we would not even need
> > to call ->readpage.
> > 
> > The problem appears to be that somehow we are getting a page returned
> > when it should not be. So I suspect that the correct place to catch that
> > is in ->readpage(s)
> > 
> > > > I suspect that the correct fix would be to implement ->launder_page to
> > > > avoid the race that you've identified here, if I've understood it
> > > > correctly,
> > >   I don't understand how that would work. Can you elaborate a bit? Here the
> > > problem is i_size gets extended by DIO write during buffered read (which is
> > > a fallback from DIO read) so we return zero-filled page instead of either
> > > data filled page or short read. I don't see where launder_page() would come
> > > into the picture...
> > > 
> > > 								Honza
> > 
> > Ah, sorry. I was thinking the other way around wrt to
> > read/write :( However, I still don't think that generic_file_aio_read()
> > is the right place to fix this. I note that XFS seems to pass the test
> > and it also uses mpage_readpage and mpage_readpages as well as
> > generic_file_aio_read() so maybe that is a clue as to where the answer
> > lies. GFS2 seems to fail the test in the same way as ext3 at the moment.
> 
> Yes, xfs can pass the test under direct IO.  I suspect that the reason
> is that xfs uses ilock/iolock.  But it doesn't means that all file
> systems need to fix this issue by themselves because the root cause is
> in vfs layer.
>
I'm not convinced that this is the case though. The VFS is asking the fs
to provide a page via ->readpage(s) and therefore the fs should not be
providing a page (presumably marked uptodate in this case) where one
does not already exist, and if it does exist, then the content of the
page should be correct.

I don't see how this test can be done at a point before the fs has a
chance to get its own locks, and thus ensuring that the inode size is
actually correct.

However, I'd like to understand what is going on at ->readpage level,
and thus why we get this page generated incorrectly, and what actual i/o
gets generated (or not) in the problematic case.

> > 
> > The other question that we've not really answered is why it is useful to
> > mix buffered and direct i/o to the same file at the same time anyway.
> > While I agree that it ought to work, I'm not convinced that its
> > enormously useful, which is maybe why it has not been spotted before,
> 
> Yes, mixed buffered read and direct write seems to be useless.  But in
> our product system, the issue will appear under direct read/write.  The
> application has a reader and a writer.  The reader should read nothing
> or the content that has been written by writer.  But now the reader gets
> the zero-filled page.  Hence the application needs to prevent this issue
> using a mutex or other flag.  It doesn't make sense to the userspace
> programmer.
> 
I'm not sure I fully understand what you are saying here... if you see
the same issue using only O_DIRECT reads and writes, how are you getting
that?

> TBH, we have found this issue two year ago but we don't take care of it.
> So currently the application just solves it as I said above.  But it will
> issue a flush calling fsync(2) to flush the dirty page in order to make
> sure the data has been flushed into the disk.  Obviously, the performance
> of this application is impacted by this issue.  Now we have known that we
> don't need to flush the dirty page but I think we'd better fix it.
> 
> Regards,
>                                                 - Zheng

Calling fsync is a separate issue though, since that is required anyway
if you want to be sure that the data has actually reached the disk,

Steve.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-17  9:43     ` Steven Whitehouse
  2013-12-17 11:16       ` Zheng Liu
@ 2013-12-17 14:01       ` Jan Kara
  2013-12-17 15:32         ` Steven Whitehouse
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Kara @ 2013-12-17 14:01 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Jan Kara, Zheng Liu, linux-fsdevel, Christoph Hellwig,
	Dmitry Monakhov, Dave Chinner, Alexander Viro, Zheng Liu

  Hi,

On Tue 17-12-13 09:43:42, Steven Whitehouse wrote:
> On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > Hi,
> > > 
> > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > 
> > > > Currently we check i_size in buffered read path after we know the page
> > > > is update.  But it could return some zero-filled pages to the userspace
> > > > when we do some append dio writes.  We could use the following code
> > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > 
> > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > the underlying fs has not had a chance to get its own locks and update
> > > the inode size.
> >   Hum, this isn't really about page being uptodate or not, is it? It is
> > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > function when obtaining inode lock. Or do you know about other filesystem
> > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > do something similar?
> > 
> Well we could do that, but I don't think it makes sense really. The
> point of having the page cache is to put it "in front" of the expensive
> operations where possible (such as getting cluster locks). I don't think
> it makes sense to grab and drop a cluster lock on every call to
> generic_file_aio_read() when, for cached pages, we would not even need
> to call ->readpage.
  Well, but in a cluster filesystem I fail to see how do you want to
synchronize read from pagecache on one node with truncate / write on another
node. Either the modification has to broadcast to all nodes having the file
cached or read has to check whether the data cached in pagecache is still
valid. How is gfs2 solving this?

> The problem appears to be that somehow we are getting a page returned
> when it should not be. So I suspect that the correct place to catch that
> is in ->readpage(s)
  There are two things which aren't ideal:
1) DIO writes and buffered reads can race resulting in reads returning
zero-filled pages. Fixing this corner case would slow down the common case
so we decided to put this race in the cathegory of "don't do it when it
hurts".
2) DIO read falls back to buffered read on failure - this is there to
simplify life of filesystems which don't support DIO at all or do not
handle some special cases like DIO from holes, not block-aligned DIO etc.

Combination of these two leads to the problem that DIO read can return
zero-filled page when racing with DIO write. With enough luck this can
even lead to exposure of uninitialized data:
  DIO WRITE           BUFFERED READ
- allocates blocks
                      - sees blocks mapped so it reads data
- submits data write
- waits for IO completion
- extends i_size
                      - sees large i_size so hands over data to userspace

  Now without proper serialization of reads & writes this is hard to avoid
(we plan range locks for that but that's still blocked by some mm changes).
But checking i_size before reading data at least makes the race *much*
less serious (you have to have two threads modifying i_size in
parallel to read running and since these threads are synchronized on
i_mutex, some races become impossible, the ones returning 0s are still
possible but harder to trigger).

Now we could check i_size in ->readpage / ->readpages but conceptually that
seems as a wrong thing to do since readpage only cares about buffers being
mapped. OTOH generic_file_aio_read() et al already handles i_size checks.
And GFS2 seems to be the only fs that has problems with stale i_size (OCFS2
and NFS both revalidate it in their \.aio_read callbacks). So I'd be
happier if we could handle this inside GFS2...

> > > I suspect that the correct fix would be to implement ->launder_page to
> > > avoid the race that you've identified here, if I've understood it
> > > correctly,
> >   I don't understand how that would work. Can you elaborate a bit? Here the
> > problem is i_size gets extended by DIO write during buffered read (which is
> > a fallback from DIO read) so we return zero-filled page instead of either
> > data filled page or short read. I don't see where launder_page() would come
> > into the picture...
> > 
> > 								Honza
> 
> Ah, sorry. I was thinking the other way around wrt to
> read/write :( However, I still don't think that generic_file_aio_read()
> is the right place to fix this. I note that XFS seems to pass the test
> and it also uses mpage_readpage and mpage_readpages as well as
> generic_file_aio_read() so maybe that is a clue as to where the answer
> lies. GFS2 seems to fail the test in the same way as ext3 at the moment.
  It is a good question why XFS doesn't have the problem. Looking into
the code, XFS doesn't use i_mutex but rather it's private rw-sem for
serialization (XFS_IOLOCK). Even reads hold the semaphore in shared mode
and as the test program is written, DIO reads will hold it in exclusive
mode for a while as well. A guess that provides enough serialization that
the test program doesn't hit the race (although the race still seems
possible).

> The other question that we've not really answered is why it is useful to
> mix buffered and direct i/o to the same file at the same time anyway.
> While I agree that it ought to work, I'm not convinced that its
> enormously useful, which is maybe why it has not been spotted before,
  As I wrote above, mixing buffered and direct IO isn't really supported
(although we should fix the data exposure race). But DIO read vs DIO write
sounds sensible enough to support. And when DIO read falls back to buffered
read, we have problems...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-17 14:01       ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Jan Kara
@ 2013-12-17 15:32         ` Steven Whitehouse
  2013-12-17 16:39           ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Whitehouse @ 2013-12-17 15:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Zheng Liu, linux-fsdevel, Christoph Hellwig, Dmitry Monakhov,
	Dave Chinner, Alexander Viro, Zheng Liu

Hi,

On Tue, 2013-12-17 at 15:01 +0100, Jan Kara wrote:
> Hi,
> 
> On Tue 17-12-13 09:43:42, Steven Whitehouse wrote:
> > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > > Hi,
> > > > 
> > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > > 
> > > > > Currently we check i_size in buffered read path after we know the page
> > > > > is update.  But it could return some zero-filled pages to the userspace
> > > > > when we do some append dio writes.  We could use the following code
> > > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > > 
> > > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > > the underlying fs has not had a chance to get its own locks and update
> > > > the inode size.
> > >   Hum, this isn't really about page being uptodate or not, is it? It is
> > > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > > function when obtaining inode lock. Or do you know about other filesystem
> > > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > > do something similar?
> > > 
> > Well we could do that, but I don't think it makes sense really. The
> > point of having the page cache is to put it "in front" of the expensive
> > operations where possible (such as getting cluster locks). I don't think
> > it makes sense to grab and drop a cluster lock on every call to
> > generic_file_aio_read() when, for cached pages, we would not even need
> > to call ->readpage.
>   Well, but in a cluster filesystem I fail to see how do you want to
> synchronize read from pagecache on one node with truncate / write on another
> node. Either the modification has to broadcast to all nodes having the file
> cached or read has to check whether the data cached in pagecache is still
> valid. How is gfs2 solving this?
> 

The same way that every other page cache based filesystem does it. If
there is a cached page that is uptodate, then it can be copied to the
user, and the page lock is used to ensure that the state doesn't change
under us. If there is no cached page, then the vfs has to ask the
filesystem to provide one, and then the cluster locks (glocks) are
taken. Those locks are then cached until another node, or memory
pressure results in them being dropped.

GFS2's internal locking via the glock layer will invalidate all cached
pages before it drops the glock to a mode other than shared or
exclusive. It will also write back all dirty data when dropping an
exclusive lock. For dio we use the "deferred" mode which is basically a
shared mode that is incompatible with the normal shared mode - it allows
shared caching of metadata, but no caching of data. GFS2 reverts to
buffered I/O in order to deal with cases where we need to allocate
blocks, including file extension.

> > The problem appears to be that somehow we are getting a page returned
> > when it should not be. So I suspect that the correct place to catch that
> > is in ->readpage(s)
>   There are two things which aren't ideal:
> 1) DIO writes and buffered reads can race resulting in reads returning
> zero-filled pages. Fixing this corner case would slow down the common case
> so we decided to put this race in the cathegory of "don't do it when it
> hurts".

Well if I alter the test program to use buffered reads, then it takes
much longer to hit the race. With the original O_DIRECT reads, then it
is not a great surprise that not all the write i/o has actually hit disk
at the time of the read, since as you say there is no synchronization
between the reads and writes.

So that makes sense to me.

> 2) DIO read falls back to buffered read on failure - this is there to
> simplify life of filesystems which don't support DIO at all or do not
> handle some special cases like DIO from holes, not block-aligned DIO etc.
> 
> Combination of these two leads to the problem that DIO read can return
> zero-filled page when racing with DIO write. With enough luck this can
> even lead to exposure of uninitialized data:
>   DIO WRITE           BUFFERED READ
> - allocates blocks
>                       - sees blocks mapped so it reads data
> - submits data write
> - waits for IO completion
> - extends i_size
>                       - sees large i_size so hands over data to userspace
> 
>   Now without proper serialization of reads & writes this is hard to avoid
> (we plan range locks for that but that's still blocked by some mm changes).
> But checking i_size before reading data at least makes the race *much*
> less serious (you have to have two threads modifying i_size in
> parallel to read running and since these threads are synchronized on
> i_mutex, some races become impossible, the ones returning 0s are still
> possible but harder to trigger).
> 
Well I agree with pretty much all of that except that part about adding
in the extra i_size test. It may make the race harder to hit, but it
doesn't eliminate it. Instead in means that we'll have to add a bunch of
extra code into the fast path just to ensure that it is uptodate.

The reason that the i_size check that is already in
do_generic_file_read() is ok is that there is a locked page at the time
of the i_size_read call, as per the comment:

                /*
                 * i_size must be checked after we know the page is Uptodate.
                 *
                 * Checking i_size after the check allows us to calculate
                 * the correct value for "nr", which means the zero-filled
                 * part of the page is not copied back to userspace (unless
                 * another truncate extends the file - this is desired though).
                 */


> Now we could check i_size in ->readpage / ->readpages but conceptually that
> seems as a wrong thing to do since readpage only cares about buffers being
> mapped. OTOH generic_file_aio_read() et al already handles i_size checks.
> And GFS2 seems to be the only fs that has problems with stale i_size (OCFS2
> and NFS both revalidate it in their \.aio_read callbacks). So I'd be
> happier if we could handle this inside GFS2...
> 
I would argue that GFS2 is doing this the right way in this case though
- it makes no sense to put the i_size check into the fast path, when it
is only required in the case that there is no cached page (i.e the slow
path). In addition, putting it in the slow path (where the fs can take
its own locks) also fixes the race, rather than just making it more
difficult to hit.

So far as I can tell the VFS code was originally intended to work in
exactly this way, however it is not that obvious that this is the case,
maybe better documentation is required in this case.

If we did add in some GFS2 code to update i_size ahead of
generic_file_aio_read() then we can only do that by grabbing and
dropping the glock. That is a relatively expensive operation, and still
doesn't fix the race as we must drop the glock before calling
generic_file_aio_read() since we must not hold glocks over the copy to
user stage.

Incidentally, for write calls, we do exactly that to update the i_size,
but only in the case that O_APPEND has been used, since we don't need to
do it until we get to write_begin/end otherwise.

I'm going to do a bit more investigation though, as I've not yet managed
to convince myself of exactly where the problem lies.

> > > > I suspect that the correct fix would be to implement ->launder_page to
> > > > avoid the race that you've identified here, if I've understood it
> > > > correctly,
> > >   I don't understand how that would work. Can you elaborate a bit? Here the
> > > problem is i_size gets extended by DIO write during buffered read (which is
> > > a fallback from DIO read) so we return zero-filled page instead of either
> > > data filled page or short read. I don't see where launder_page() would come
> > > into the picture...
> > > 
> > > 								Honza
> > 
> > Ah, sorry. I was thinking the other way around wrt to
> > read/write :( However, I still don't think that generic_file_aio_read()
> > is the right place to fix this. I note that XFS seems to pass the test
> > and it also uses mpage_readpage and mpage_readpages as well as
> > generic_file_aio_read() so maybe that is a clue as to where the answer
> > lies. GFS2 seems to fail the test in the same way as ext3 at the moment.
>   It is a good question why XFS doesn't have the problem. Looking into
> the code, XFS doesn't use i_mutex but rather it's private rw-sem for
> serialization (XFS_IOLOCK). Even reads hold the semaphore in shared mode
> and as the test program is written, DIO reads will hold it in exclusive
> mode for a while as well. A guess that provides enough serialization that
> the test program doesn't hit the race (although the race still seems
> possible).
> 
Yes, that sounds quite likely.

> > The other question that we've not really answered is why it is useful to
> > mix buffered and direct i/o to the same file at the same time anyway.
> > While I agree that it ought to work, I'm not convinced that its
> > enormously useful, which is maybe why it has not been spotted before,
>   As I wrote above, mixing buffered and direct IO isn't really supported
> (although we should fix the data exposure race). But DIO read vs DIO write
> sounds sensible enough to support. And when DIO read falls back to buffered
> read, we have problems...
> 
> 								Honza

Yes, agreed. I was really asking just to see if there was a use case
where this mixed use does make sense, as I couldn't immediately see why
someone would want to do that in the first place,

Steve.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-17 15:32         ` Steven Whitehouse
@ 2013-12-17 16:39           ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2013-12-17 16:39 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Jan Kara, Zheng Liu, linux-fsdevel, Christoph Hellwig,
	Dmitry Monakhov, Dave Chinner, Alexander Viro, Zheng Liu

  Hi,

On Tue 17-12-13 15:32:38, Steven Whitehouse wrote:
> On Tue, 2013-12-17 at 15:01 +0100, Jan Kara wrote:
> > On Tue 17-12-13 09:43:42, Steven Whitehouse wrote:
> > > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > > > Hi,
> > > > > 
> > > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > > > 
> > > > > > Currently we check i_size in buffered read path after we know the page
> > > > > > is update.  But it could return some zero-filled pages to the userspace
> > > > > > when we do some append dio writes.  We could use the following code
> > > > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > > > 
> > > > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > > > the underlying fs has not had a chance to get its own locks and update
> > > > > the inode size.
> > > >   Hum, this isn't really about page being uptodate or not, is it? It is
> > > > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > > > function when obtaining inode lock. Or do you know about other filesystem
> > > > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > > > do something similar?
> > > > 
> > > Well we could do that, but I don't think it makes sense really. The
> > > point of having the page cache is to put it "in front" of the expensive
> > > operations where possible (such as getting cluster locks). I don't think
> > > it makes sense to grab and drop a cluster lock on every call to
> > > generic_file_aio_read() when, for cached pages, we would not even need
> > > to call ->readpage.
> >   Well, but in a cluster filesystem I fail to see how do you want to
> > synchronize read from pagecache on one node with truncate / write on another
> > node. Either the modification has to broadcast to all nodes having the file
> > cached or read has to check whether the data cached in pagecache is still
> > valid. How is gfs2 solving this?
> > 
> 
> The same way that every other page cache based filesystem does it. If
> there is a cached page that is uptodate, then it can be copied to the
> user, and the page lock is used to ensure that the state doesn't change
> under us.
  If a page is uptodate, we don't ever take the page lock (see
do_generic_file_read())...

> If there is no cached page, then the vfs has to ask the
> filesystem to provide one, and then the cluster locks (glocks) are
> taken. Those locks are then cached until another node, or memory
> pressure results in them being dropped.
> 
> GFS2's internal locking via the glock layer will invalidate all cached
> pages before it drops the glock to a mode other than shared or
> exclusive. It will also write back all dirty data when dropping an
> exclusive lock. For dio we use the "deferred" mode which is basically a
> shared mode that is incompatible with the normal shared mode - it allows
> shared caching of metadata, but no caching of data. GFS2 reverts to
> buffered I/O in order to deal with cases where we need to allocate
> blocks, including file extension.
  OK, thanks for explanation. Now I see how you avoid the races I was
thinking about.

> > > The problem appears to be that somehow we are getting a page returned
> > > when it should not be. So I suspect that the correct place to catch that
> > > is in ->readpage(s)
> >   There are two things which aren't ideal:
> > 1) DIO writes and buffered reads can race resulting in reads returning
> > zero-filled pages. Fixing this corner case would slow down the common case
> > so we decided to put this race in the cathegory of "don't do it when it
> > hurts".
> 
> Well if I alter the test program to use buffered reads, then it takes
> much longer to hit the race. With the original O_DIRECT reads, then it
> is not a great surprise that not all the write i/o has actually hit disk
> at the time of the read, since as you say there is no synchronization
> between the reads and writes.
> 
> So that makes sense to me.
> 
> > 2) DIO read falls back to buffered read on failure - this is there to
> > simplify life of filesystems which don't support DIO at all or do not
> > handle some special cases like DIO from holes, not block-aligned DIO etc.
> > 
> > Combination of these two leads to the problem that DIO read can return
> > zero-filled page when racing with DIO write. With enough luck this can
> > even lead to exposure of uninitialized data:
> >   DIO WRITE           BUFFERED READ
> > - allocates blocks
> >                       - sees blocks mapped so it reads data
> > - submits data write
> > - waits for IO completion
> > - extends i_size
> >                       - sees large i_size so hands over data to userspace
> > 
> >   Now without proper serialization of reads & writes this is hard to avoid
> > (we plan range locks for that but that's still blocked by some mm changes).
> > But checking i_size before reading data at least makes the race *much*
> > less serious (you have to have two threads modifying i_size in
> > parallel to read running and since these threads are synchronized on
> > i_mutex, some races become impossible, the ones returning 0s are still
> > possible but harder to trigger).
> > 
> Well I agree with pretty much all of that except that part about adding
> in the extra i_size test. It may make the race harder to hit, but it
> doesn't eliminate it. Instead in means that we'll have to add a bunch of
> extra code into the fast path just to ensure that it is uptodate.
> 
> The reason that the i_size check that is already in
> do_generic_file_read() is ok is that there is a locked page at the time
> of the i_size_read call, as per the comment:
> 
>                 /*
>                  * i_size must be checked after we know the page is Uptodate.
>                  *
>                  * Checking i_size after the check allows us to calculate
>                  * the correct value for "nr", which means the zero-filled
>                  * part of the page is not copied back to userspace (unless
>                  * another truncate extends the file - this is desired though).
>                  */
  Note that the page is *not* locked at the time of the i_size check. And
I'm aware of the comment - however that seems to talk about a case where
truncate() makes the file smaller. If we checked i_size only before having
page uptodate, we could see i_size value before truncate and thus return to
userspace data that is already zeroed-out by block_truncate_page() or its
equivalent. Actually, the race still seems to be there as we could read
i_size before truncate begins and if we take long enough nap just after
that file_read_actor() could copy data already zeroed out by
block_truncate_page(). That just doesn't seem likely to happen in practice.

So you are right that we might be better off to hold page lock during
i_size check and copy to userspace. However that isn't really possible as
that could deadlock - consider an evil program that does

buf = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
read(fd, buf, 4096);

In that case file_read_actor() will hit a page fault when copying data to
userspace and to satisfy the page fault we need the very same page lock
that we would hold over the i_size check. This is the reason why write path
goes through the hoops of ->write_begin, ->write_end, and prefaulting
pages.

> > Now we could check i_size in ->readpage / ->readpages but conceptually that
> > seems as a wrong thing to do since readpage only cares about buffers being
> > mapped. OTOH generic_file_aio_read() et al already handles i_size checks.
> > And GFS2 seems to be the only fs that has problems with stale i_size (OCFS2
> > and NFS both revalidate it in their \.aio_read callbacks). So I'd be
> > happier if we could handle this inside GFS2...
> > 
> I would argue that GFS2 is doing this the right way in this case though
> - it makes no sense to put the i_size check into the fast path, when it
> is only required in the case that there is no cached page (i.e the slow
> path). In addition, putting it in the slow path (where the fs can take
> its own locks) also fixes the race, rather than just making it more
> difficult to hit.
  I disagree here: i_size check is necessary regardless the page being
uptodate or not. For local filesystems truncate can happen completely
independently of the read...

> So far as I can tell the VFS code was originally intended to work in
> exactly this way, however it is not that obvious that this is the case,
> maybe better documentation is required in this case.
  Originally, VFS never though about clustered filesystems so i_size was
always uptodate. We only had to take care about truncate changing i_size
while read was running. And to mitigate problems with that we needed to
check after the page is uptodate.

> If we did add in some GFS2 code to update i_size ahead of
> generic_file_aio_read() then we can only do that by grabbing and
> dropping the glock. That is a relatively expensive operation, and still
> doesn't fix the race as we must drop the glock before calling
> generic_file_aio_read() since we must not hold glocks over the copy to
> user stage.
  Is it that expensive even if you have the glock already cached? Because
if you have cached page, you have to have cached glock as well if I
understand GFS2 design right.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-17 12:17         ` Steven Whitehouse
@ 2013-12-17 16:41           ` Zheng Liu
  2013-12-19 12:27             ` Steven Whitehouse
  0 siblings, 1 reply; 21+ messages in thread
From: Zheng Liu @ 2013-12-17 16:41 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Jan Kara, linux-fsdevel, Christoph Hellwig, Dmitry Monakhov,
	Dave Chinner, Alexander Viro, Zheng Liu

On Tue, Dec 17, 2013 at 12:17:44PM +0000, Steven Whitehouse wrote:
> Hi,
> 
> On Tue, 2013-12-17 at 19:16 +0800, Zheng Liu wrote:
> > Hi Steven,
> > 
> > On Tue, Dec 17, 2013 at 09:43:42AM +0000, Steven Whitehouse wrote:
> > > Hi,
> > > 
> > > On Mon, 2013-12-16 at 11:01 +0100, Jan Kara wrote:
> > > > On Mon 16-12-13 09:37:34, Steven Whitehouse wrote:
> > > > > Hi,
> > > > > 
> > > > > On Sat, 2013-12-07 at 18:55 +0800, Zheng Liu wrote:
> > > > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > > > 
> > > > > > Currently we check i_size in buffered read path after we know the page
> > > > > > is update.  But it could return some zero-filled pages to the userspace
> > > > > > when we do some append dio writes.  We could use the following code
> > > > > > snippet to reproduce this problem in a ext2/3/4 filesystem.
> > > > > > 
> > > > > If the page is not uptodate, then neither (potentially) is i_size, since
> > > > > the underlying fs has not had a chance to get its own locks and update
> > > > > the inode size.
> > > >   Hum, this isn't really about page being uptodate or not, is it? It is
> > > > more about the fact that gfs2 updates i_size only from gfs2_readpage()
> > > > function when obtaining inode lock. Or do you know about other filesystem
> > > > having the same problem as gfs2? E.g. ocfs2 updates i_size from
> > > > ocfs2_file_aio_read() before calling generic_file_aio_read(), cannot gfs2
> > > > do something similar?
> > > > 
> > > Well we could do that, but I don't think it makes sense really. The
> > > point of having the page cache is to put it "in front" of the expensive
> > > operations where possible (such as getting cluster locks). I don't think
> > > it makes sense to grab and drop a cluster lock on every call to
> > > generic_file_aio_read() when, for cached pages, we would not even need
> > > to call ->readpage.
> > > 
> > > The problem appears to be that somehow we are getting a page returned
> > > when it should not be. So I suspect that the correct place to catch that
> > > is in ->readpage(s)
> > > 
> > > > > I suspect that the correct fix would be to implement ->launder_page to
> > > > > avoid the race that you've identified here, if I've understood it
> > > > > correctly,
> > > >   I don't understand how that would work. Can you elaborate a bit? Here the
> > > > problem is i_size gets extended by DIO write during buffered read (which is
> > > > a fallback from DIO read) so we return zero-filled page instead of either
> > > > data filled page or short read. I don't see where launder_page() would come
> > > > into the picture...
> > > > 
> > > > 								Honza
> > > 
> > > Ah, sorry. I was thinking the other way around wrt to
> > > read/write :( However, I still don't think that generic_file_aio_read()
> > > is the right place to fix this. I note that XFS seems to pass the test
> > > and it also uses mpage_readpage and mpage_readpages as well as
> > > generic_file_aio_read() so maybe that is a clue as to where the answer
> > > lies. GFS2 seems to fail the test in the same way as ext3 at the moment.
> > 
> > Yes, xfs can pass the test under direct IO.  I suspect that the reason
> > is that xfs uses ilock/iolock.  But it doesn't means that all file
> > systems need to fix this issue by themselves because the root cause is
> > in vfs layer.
> >
> I'm not convinced that this is the case though. The VFS is asking the fs
> to provide a page via ->readpage(s) and therefore the fs should not be
> providing a page (presumably marked uptodate in this case) where one
> does not already exist, and if it does exist, then the content of the
> page should be correct.

As far as I understand, ->readpage(s) should return zero-filled pages if
this page doesn't have a block mapping because the content should be '0'.

> 
> I don't see how this test can be done at a point before the fs has a
> chance to get its own locks, and thus ensuring that the inode size is
> actually correct.
> 
> However, I'd like to understand what is going on at ->readpage level,
> and thus why we get this page generated incorrectly, and what actual i/o
> gets generated (or not) in the problematic case.
> 
> > > 
> > > The other question that we've not really answered is why it is useful to
> > > mix buffered and direct i/o to the same file at the same time anyway.
> > > While I agree that it ought to work, I'm not convinced that its
> > > enormously useful, which is maybe why it has not been spotted before,
> > 
> > Yes, mixed buffered read and direct write seems to be useless.  But in
> > our product system, the issue will appear under direct read/write.  The
> > application has a reader and a writer.  The reader should read nothing
> > or the content that has been written by writer.  But now the reader gets
> > the zero-filled page.  Hence the application needs to prevent this issue
> > using a mutex or other flag.  It doesn't make sense to the userspace
> > programmer.
> > 
> I'm not sure I fully understand what you are saying here... if you see
> the same issue using only O_DIRECT reads and writes, how are you getting
> that?

Sorry, maybe I don't clarify the test program.  In commit log the test
program does direct read/write.  I found this problem because the app
developer asks me why his application doesn't work well undert direct
read/write.

> 
> > TBH, we have found this issue two year ago but we don't take care of it.
> > So currently the application just solves it as I said above.  But it will
> > issue a flush calling fsync(2) to flush the dirty page in order to make
> > sure the data has been flushed into the disk.  Obviously, the performance
> > of this application is impacted by this issue.  Now we have known that we
> > don't need to flush the dirty page but I think we'd better fix it.
> > 
> > Regards,
> >                                                 - Zheng
> 
> Calling fsync is a separate issue though, since that is required anyway
> if you want to be sure that the data has actually reached the disk,

Yes, calling fsync is another issue.  I just want to say that this issue
has impacted the application.  At least in our product system.

Thanks,
                                                - Zheng

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-17 16:41           ` Zheng Liu
@ 2013-12-19 12:27             ` Steven Whitehouse
  2013-12-19 22:44               ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Whitehouse @ 2013-12-19 12:27 UTC (permalink / raw)
  To: Zheng Liu
  Cc: Jan Kara, linux-fsdevel, Christoph Hellwig, Dmitry Monakhov,
	Dave Chinner, Alexander Viro, Zheng Liu

[-- Attachment #1: Type: text/plain, Size: 2177 bytes --]

Hi,

Sorry for the delay... this has turned out to be a somewhat more
complicated investigation than I'd first expected. In fact there are
still a few things I don't yet understand, however I thought I'd let you
know how I'm getting on in the mean time.

So I started looking at GFS2, since thats what I'm most familiar with.
I've found a couple of bugs which I'm about to post patches for,
although even with those patches GFS2 doesn't pass the test all the
time, although it does get through the test some of the time, and it
does last longer than ext4.

Since I wondered whether I was just lucky running the test against XFS,
I've run it again several times, and I still have not seen a single
failure on XFS.

In order to gain a bit more information about the problem, I added a few
more things to the printf, and in particular I note that under GFS2 I
see ret (the amount of data read) at various different sizes. On ext4,
ret is always the full 1M buffer size. So I assume that is the
difference which your patch was intended to cure.

However, I also printed out j, the offset where the first error occurs,
and in both the ext4 and gfs2 cases, that offset is 0, and after exactly
4096 bytes, there is an 'a'. I'd have expected to see a number of pages
of 'a' followed by zero pages, but instead, I'm seeing a single zero
page followed by at least one 'a'. I've not extended my instrumentation
to print out a list of which pages are zero and which 'a', but that is
an unexpected result.

Some tracing shows that with the additional GFS2 patches, the data does
get written to disk correctly, ahead of the read which is issued. Also
since the test does allocating writes, GFS2 will fall back to buffered
I/O for that, and only the read is direct I/O, so since we see similar
results in the GFS2 and ext4 cases, this missing first page which is
common to both looks like it might be related to the read side of
things.

I'm attaching my updated version of your test program, which I'd like to
add to our test suite in due course, if you have no objections.

So far I'm not sure what causes the missing first page issue though...
I'll keep looking in the mean time,

Steve.


[-- Attachment #2: dio-append.c --]
[-- Type: text/x-csrc, Size: 2558 bytes --]

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <memory.h>

#include <unistd.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>

#include <pthread.h>

#define BUF_ALIGN	1024

struct writer_data {
	int fd;
	size_t blksize;
	char *buf;
};

static void *writer(void *arg)
{
	struct writer_data *data = (struct writer_data *)arg;
	int ret;

	ret = write(data->fd, data->buf, data->blksize);
	if (ret < 0)
		fprintf(stderr, "write file failed: %s\n", strerror(errno));

	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t tid;
	struct writer_data wdata;
	size_t max_blocks = 10 * 1024;
	size_t blksize = 1 * 1024 * 1024;
	char *rbuf, *wbuf;
	int readfd, writefd;
	int i, j;
	int k;

	if (argc < 2) {
		fprintf(stderr, "usage: %s [filename]\n", argv[0]);
		exit(1);
	}

	writefd = open(argv[1], O_CREAT|O_DIRECT|O_WRONLY|O_APPEND|O_TRUNC, S_IRWXU);
	if (writefd < 0) {
		fprintf(stderr, "failed to open wfile: %s\n", strerror(errno));
		exit(1);
	}
	readfd = open(argv[1], O_DIRECT|O_RDONLY, S_IRWXU);
	if (readfd < 0) {
		fprintf(stderr, "failed to open rfile: %s\n", strerror(errno));
		exit(1);
	}

	if (posix_memalign((void **)&wbuf, BUF_ALIGN, blksize)) {
		fprintf(stderr, "failed to alloc memory: %s\n", strerror(errno));
		exit(1);
	}

	if (posix_memalign((void **)&rbuf, 4096, blksize)) {
		fprintf(stderr, "failed to alloc memory: %s\n", strerror(errno));
		exit(1);
	}

	memset(wbuf, 'a', blksize);

	wdata.fd = writefd;
	wdata.blksize = blksize;
	wdata.buf = wbuf;

	for (i = 0; i < max_blocks; i++) {
		void *retval;
		int ret;

		ret = pthread_create(&tid, NULL, writer, &wdata);
		if (ret) {
			fprintf(stderr, "create thread failed: %s\n", strerror(errno));
			exit(1);
		}

		memset(rbuf, 'b', blksize);
		do {
			ret = pread(readfd, rbuf, blksize, i * blksize);
		} while (ret <= 0);

		if (ret < 0) {
			fprintf(stderr, "read file failed: %s\n", strerror(errno));
			exit(1);
		}

		if (pthread_join(tid, &retval)) {
			fprintf(stderr, "pthread join failed: %s\n", strerror(errno));
			exit(1);
		}

		if (ret >= 0) {
			for (j = 0; j < ret; j++) {
				if (rbuf[j] != 'a') {
					char c = rbuf[j];
					fprintf(stderr, "encounter an error: offset %ld, j=%ld, ret=%ld rbuf[j]=0x%02x\n",
						i, j, ret, rbuf[j]);
					for(k = j; k < ret; k++) {
						if (rbuf[k] != c)
							break;
					}
					fprintf(stderr, "0x%02x continues for %d bytes, next=0x%02x\n", c, k - j, rbuf[k]);
					goto err;
				}
			}
		}
	}

err:
	free(wbuf);
	free(rbuf);

	return 0;
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-19 12:27             ` Steven Whitehouse
@ 2013-12-19 22:44               ` Dave Chinner
  2013-12-20  9:28                 ` Steven Whitehouse
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2013-12-19 22:44 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Zheng Liu, Jan Kara, linux-fsdevel, Christoph Hellwig,
	Dmitry Monakhov, Alexander Viro, Zheng Liu

On Thu, Dec 19, 2013 at 12:27:53PM +0000, Steven Whitehouse wrote:
> Hi,
> 
> Sorry for the delay... this has turned out to be a somewhat more
> complicated investigation than I'd first expected. In fact there are
> still a few things I don't yet understand, however I thought I'd let you
> know how I'm getting on in the mean time.
> 
> So I started looking at GFS2, since thats what I'm most familiar with.
> I've found a couple of bugs which I'm about to post patches for,
> although even with those patches GFS2 doesn't pass the test all the
> time, although it does get through the test some of the time, and it
> does last longer than ext4.
> 
> Since I wondered whether I was just lucky running the test against XFS,
> I've run it again several times, and I still have not seen a single
> failure on XFS.

IF this is a failure due to a buffered IO fallback from the direct
IO path, then XFS will never fail because it doesn't ever fall back
to buffered IO. i.e. XFS *always* does direct IO.

Looking at the test code, the appending direct IO write is
effectively a single 1MB IO (i.e. one atomic i_size update after it
completes). Hence if the filesystem doesn't fall back to buffered IO
for appending writes, then the direct IO reads should never read
data between the old EOF and the new EOF until after the new EOF is
reflected in i_size.

> In order to gain a bit more information about the problem, I added a few
> more things to the printf, and in particular I note that under GFS2 I
> see ret (the amount of data read) at various different sizes.

That implies multiple i_size updates during the write, which implies
buffered IO for the writes.

> On ext4,
> ret is always the full 1M buffer size. So I assume that is the
> difference which your patch was intended to cure.
>
> However, I also printed out j, the offset where the first error occurs,
> and in both the ext4 and gfs2 cases, that offset is 0, and after exactly
> 4096 bytes, there is an 'a'. I'd have expected to see a number of pages
> of 'a' followed by zero pages, but instead, I'm seeing a single zero
> page followed by at least one 'a'. I've not extended my instrumentation
> to print out a list of which pages are zero and which 'a', but that is
> an unexpected result.

i.e. the write is being done page by page rather than in chunks
limited by the size of a bio. Again, that implies that buffered
writes, not direct IO writes.

> Some tracing shows that with the additional GFS2 patches, the data does
> get written to disk correctly, ahead of the read which is issued. Also
> since the test does allocating writes, GFS2 will fall back to buffered
> I/O for that, and only the read is direct I/O, so since we see similar
> results in the GFS2 and ext4 cases, this missing first page which is
> common to both looks like it might be related to the read side of
> things.

Ok, buffered writes explain all those symptoms, and that means what
you are seeing is a buffered write/direct IO read race condition.
If the first page is missing data from the direct IO read, that
implies that the inode size has been updated by the buffered write
path, but a direct Io read of zeroes means the data in the page
cache has not been flushed to disk. Given that the direct IO read
path does a filemap_write_and_wait_range() call before issuing the
direct IO, that implies the first page was not flushed to disk by
the cache flush.

I'm not sure why that may be the case, but that's where I'd
start looking.....

> I'm attaching my updated version of your test program, which I'd like to
> add to our test suite in due course, if you have no objections.

IIRC, there's a patch to add it to xfstests.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-19 22:44               ` Dave Chinner
@ 2013-12-20  9:28                 ` Steven Whitehouse
  2013-12-23  3:00                   ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Whitehouse @ 2013-12-20  9:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Zheng Liu, Jan Kara, linux-fsdevel, Christoph Hellwig,
	Dmitry Monakhov, Alexander Viro, Zheng Liu, Lukas Czerner

Hi,

On Fri, 2013-12-20 at 09:44 +1100, Dave Chinner wrote:
> On Thu, Dec 19, 2013 at 12:27:53PM +0000, Steven Whitehouse wrote:
> > Hi,
> > 
> > Sorry for the delay... this has turned out to be a somewhat more
> > complicated investigation than I'd first expected. In fact there are
> > still a few things I don't yet understand, however I thought I'd let you
> > know how I'm getting on in the mean time.
> > 
> > So I started looking at GFS2, since thats what I'm most familiar with.
> > I've found a couple of bugs which I'm about to post patches for,
> > although even with those patches GFS2 doesn't pass the test all the
> > time, although it does get through the test some of the time, and it
> > does last longer than ext4.
> > 
> > Since I wondered whether I was just lucky running the test against XFS,
> > I've run it again several times, and I still have not seen a single
> > failure on XFS.
> 
> IF this is a failure due to a buffered IO fallback from the direct
> IO path, then XFS will never fail because it doesn't ever fall back
> to buffered IO. i.e. XFS *always* does direct IO.
> 
> Looking at the test code, the appending direct IO write is
> effectively a single 1MB IO (i.e. one atomic i_size update after it
> completes). Hence if the filesystem doesn't fall back to buffered IO
> for appending writes, then the direct IO reads should never read
> data between the old EOF and the new EOF until after the new EOF is
> reflected in i_size.
> 
Well GFS2 does fall back, not so sure about ext4.

> > In order to gain a bit more information about the problem, I added a few
> > more things to the printf, and in particular I note that under GFS2 I
> > see ret (the amount of data read) at various different sizes.
> 
> That implies multiple i_size updates during the write, which implies
> buffered IO for the writes.
> 
Yes, thats exactly what I'd expect to see. I've been tracing whats going
on using a mixture of trace_printk and existing trace points.

> > On ext4,
> > ret is always the full 1M buffer size. So I assume that is the
> > difference which your patch was intended to cure.
> >
> > However, I also printed out j, the offset where the first error occurs,
> > and in both the ext4 and gfs2 cases, that offset is 0, and after exactly
> > 4096 bytes, there is an 'a'. I'd have expected to see a number of pages
> > of 'a' followed by zero pages, but instead, I'm seeing a single zero
> > page followed by at least one 'a'. I've not extended my instrumentation
> > to print out a list of which pages are zero and which 'a', but that is
> > an unexpected result.
> 
> i.e. the write is being done page by page rather than in chunks
> limited by the size of a bio. Again, that implies that buffered
> writes, not direct IO writes.
> 
> > Some tracing shows that with the additional GFS2 patches, the data does
> > get written to disk correctly, ahead of the read which is issued. Also
> > since the test does allocating writes, GFS2 will fall back to buffered
> > I/O for that, and only the read is direct I/O, so since we see similar
> > results in the GFS2 and ext4 cases, this missing first page which is
> > common to both looks like it might be related to the read side of
> > things.
> 
> Ok, buffered writes explain all those symptoms, and that means what
> you are seeing is a buffered write/direct IO read race condition.
> If the first page is missing data from the direct IO read, that
> implies that the inode size has been updated by the buffered write
> path, but a direct Io read of zeroes means the data in the page
> cache has not been flushed to disk. Given that the direct IO read
> path does a filemap_write_and_wait_range() call before issuing the
> direct IO, that implies the first page was not flushed to disk by
> the cache flush.
> 
> I'm not sure why that may be the case, but that's where I'd
> start looking.....
> 
Indeed - I've just spent the last couple of days looking at exactly
this :-) I'm shortly going to lose access to the machine I've been doing
tests on until the New Year, so progress may slow down for a little
while on my side.

However there is nothing obvious in the trace. It looks like the write
I/O gets pushed out in two chunks, the earlier one containing the "first
page" missing block along with other blocks, and the second one is
pushed out by the filemap_write_and_wait_range added in the patch I
posted yesterday. Both have completed before we send out the read
(O_DIRECT) which results in a single I/O to disk - again exactly what
I'd expect to see. There are two calls to bmap for the O_DIRECT read,
the first one returns short (correctly since we hit EOF) and the second
one returns unmapped since it is a request to map the remainder of the
file, which is beyond EOF. That all looks correct to me, and I can't see
a difference between the working and broken cases.

I did also try the obvious experiment of changing the
filemap_write_and_wait_range into a plain filemap_write_and_wait and
that made no difference.

I also checked that after the error occurs, that the file does land up
correctly written, with all bytes set to 'a'. So definitely a race
somewhere, rather than a permanent condition.

I think I may also have interested Lucas Czerner in taking a look from
the ext4 perspective.

> > I'm attaching my updated version of your test program, which I'd like to
> > add to our test suite in due course, if you have no objections.
> 
> IIRC, there's a patch to add it to xfstests.
> 
> Cheers,
> 
> Dave.

Ok, in which case we can easily use it in xfstests. Thanks for the heads
up,

Steve.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-20  9:28                 ` Steven Whitehouse
@ 2013-12-23  3:00                   ` Dave Chinner
  2014-01-14 15:22                     ` Steven Whitehouse
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2013-12-23  3:00 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Zheng Liu, Jan Kara, linux-fsdevel, Christoph Hellwig,
	Dmitry Monakhov, Alexander Viro, Zheng Liu, Lukas Czerner

On Fri, Dec 20, 2013 at 09:28:44AM +0000, Steven Whitehouse wrote:
> On Fri, 2013-12-20 at 09:44 +1100, Dave Chinner wrote:

[snip]

> > Ok, buffered writes explain all those symptoms, and that means what
> > you are seeing is a buffered write/direct IO read race condition.
> > If the first page is missing data from the direct IO read, that
> > implies that the inode size has been updated by the buffered write
> > path, but a direct Io read of zeroes means the data in the page
> > cache has not been flushed to disk. Given that the direct IO read
> > path does a filemap_write_and_wait_range() call before issuing the
> > direct IO, that implies the first page was not flushed to disk by
> > the cache flush.
> > 
> > I'm not sure why that may be the case, but that's where I'd
> > start looking.....
> > 
> Indeed - I've just spent the last couple of days looking at exactly
> this :-) I'm shortly going to lose access to the machine I've been doing
> tests on until the New Year, so progress may slow down for a little
> while on my side.
> 
> However there is nothing obvious in the trace. It looks like the write
> I/O gets pushed out in two chunks, the earlier one containing the "first
> page" missing block along with other blocks, and the second one is
> pushed out by the filemap_write_and_wait_range added in the patch I
> posted yesterday. Both have completed before we send out the read
> (O_DIRECT) which results in a single I/O to disk - again exactly what
> I'd expect to see. There are two calls to bmap for the O_DIRECT read,
> the first one returns short (correctly since we hit EOF) and the second
> one returns unmapped since it is a request to map the remainder of the
> file, which is beyond EOF.

Oh, interesting that there are two bmap calls for the read. Have a
look at do_direct_IO(), specifically the case where we hit the
buffer_unmapped/do_holes code:

do_holes:
			/* Handle holes */
			if (!buffer_mapped(map_bh)) {
				loff_t i_size_aligned;

				/* AKPM: eargh, -ENOTBLK is a hack */
				if (dio->rw & WRITE) {
					page_cache_release(page);
					return -ENOTBLK;
				}

				/*
				 * Be sure to account for a partial block as the
				 * last block in the file
				 */
				i_size_aligned = ALIGN(i_size_read(dio->inode),
							1 << blkbits);
				if (sdio->block_in_file >=
						i_size_aligned >> blkbits) {
					/* We hit eof */
					page_cache_release(page);
					goto out;
				}
				zero_user(page, block_in_page << blkbits,
						1 << blkbits);
				sdio->block_in_file++;
				block_in_page++;
				goto next_block;
			}

So, if we hit a hole, we read the inode size to determine if we are
beyond EOF. If we are beyond EOF, we're done. If not, the block gets
zeroed, and we move on to the nect block.

Now, GFS2 does not set DIO_LOCKING, so the direct IO code is not
holding the i_mutex across the read, which means it is probably
racing with the buffered write (I'm not sure what locking gfs2 does
here). Hence if the buffered write completes a page and updates the
inode size between the point in time that the direct IO read mapped
a hole and then checked the EOF, it will zero the page rather than
returning a short read, and then move on to reading the next
block. If the next block is still in the hole but is beyond EOF it
will then abort, returning a short read of a zeroed page....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2013-12-23  3:00                   ` Dave Chinner
@ 2014-01-14 15:22                     ` Steven Whitehouse
  2014-01-14 19:19                       ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Whitehouse @ 2014-01-14 15:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Zheng Liu, Jan Kara, linux-fsdevel, Christoph Hellwig,
	Dmitry Monakhov, Alexander Viro, Zheng Liu, Lukas Czerner

Hi,

On Mon, 2013-12-23 at 14:00 +1100, Dave Chinner wrote:
> On Fri, Dec 20, 2013 at 09:28:44AM +0000, Steven Whitehouse wrote:
> > On Fri, 2013-12-20 at 09:44 +1100, Dave Chinner wrote:
> 
> [snip]
> 
> > > Ok, buffered writes explain all those symptoms, and that means what
> > > you are seeing is a buffered write/direct IO read race condition.
> > > If the first page is missing data from the direct IO read, that
> > > implies that the inode size has been updated by the buffered write
> > > path, but a direct Io read of zeroes means the data in the page
> > > cache has not been flushed to disk. Given that the direct IO read
> > > path does a filemap_write_and_wait_range() call before issuing the
> > > direct IO, that implies the first page was not flushed to disk by
> > > the cache flush.
> > > 
> > > I'm not sure why that may be the case, but that's where I'd
> > > start looking.....
> > > 
> > Indeed - I've just spent the last couple of days looking at exactly
> > this :-) I'm shortly going to lose access to the machine I've been doing
> > tests on until the New Year, so progress may slow down for a little
> > while on my side.
> > 
> > However there is nothing obvious in the trace. It looks like the write
> > I/O gets pushed out in two chunks, the earlier one containing the "first
> > page" missing block along with other blocks, and the second one is
> > pushed out by the filemap_write_and_wait_range added in the patch I
> > posted yesterday. Both have completed before we send out the read
> > (O_DIRECT) which results in a single I/O to disk - again exactly what
> > I'd expect to see. There are two calls to bmap for the O_DIRECT read,
> > the first one returns short (correctly since we hit EOF) and the second
> > one returns unmapped since it is a request to map the remainder of the
> > file, which is beyond EOF.
> 
> Oh, interesting that there are two bmap calls for the read. Have a
> look at do_direct_IO(), specifically the case where we hit the
> buffer_unmapped/do_holes code:
> 
[snip]
> So, if we hit a hole, we read the inode size to determine if we are
> beyond EOF. If we are beyond EOF, we're done. If not, the block gets
> zeroed, and we move on to the nect block.
> 
> Now, GFS2 does not set DIO_LOCKING, so the direct IO code is not
> holding the i_mutex across the read, which means it is probably
> racing with the buffered write (I'm not sure what locking gfs2 does
> here). Hence if the buffered write completes a page and updates the
> inode size between the point in time that the direct IO read mapped
> a hole and then checked the EOF, it will zero the page rather than
> returning a short read, and then move on to reading the next
> block. If the next block is still in the hole but is beyond EOF it
> will then abort, returning a short read of a zeroed page....
> 
> Cheers,
> 
> Dave.

Sorry for the delay - I have been working on this and I've now figured
out whats really going on here. I'll describe the GFS2 case first, but
it seems that it is something that will affect other fs too, including
local fs most likely.

So we have one thread doing writes to a file, 1M at a time, using
O_DIRECT. Now with GFS2 that results in more or less a fallback to a
buffered write - there is one important difference however, and that is
that the direct i/o page invalidations still occur, just as if it were a
direct i/o write.

Thread two is doing direct i/o reads. When this results in calling
->direct_IO, all is well. GFS2 does all the correct locking to ensure
that this is coherent with the buffered writes. There is a problem
though - in generic_file_aio_read() we have:

       /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
        if (filp->f_flags & O_DIRECT) {
                loff_t size;
                struct address_space *mapping;
                struct inode *inode;

                mapping = filp->f_mapping;
                inode = mapping->host;
                if (!count)
                        goto out; /* skip atime */
                size = i_size_read(inode);
                if (pos < size) {
                        retval = filemap_write_and_wait_range(mapping, pos,
                                        pos + iov_length(iov, nr_segs) - 1);
                        if (!retval) {
                                retval = mapping->a_ops->direct_IO(READ, iocb,
                                                        iov, pos, nr_segs);
                        }

The important thing here is that it is checking the i_size with no
locking! This means that it can check i_size, find its the same as the
request read position (if the read gets in first before the next block
is written to the file) and then fall back to buffered read.

That buffered read then gets a page which is invalidated by the page
invalidation after the (DIO, but fallen back to buffered) write has
occurred. I added some trace_printk() calls to the read path and it is
fairly clear that is the path thats being taken when the failure occurs.
Also, it explains why, despite the disk having been used and reused many
times, the page that is returned from the failure is always 0, so its
not a page thats been read from disk. It also explains why I've only
ever seen this for the first page of the read, and the rest of the read
is ok.

Changing the test "if (pos < size) {" to "if (1) {" results in zero
failures from the test program, since it takes the ->direct_IO path each
time which then checks the i_size under the glock, in a race free
manner.

So far as I can tell, this can be just as much a problem for local
filesystems, since they may update i_size at any time too. I note that
XFS appears to wrap this bit of code under its iolock, so that probably
explains why XFS doesn't suffer from this issue.

So what is the correct fix... we have several options I think:

 a) Ignore it on the basis that its not important for normal workloads
 b) Simply remove the "if (pos < size) {" test and leave the fs to get
on with it (I know that leaves the btrfs test which also uses i_size
later on, but that doesn't appear to be an issue since its after we've
called ->direct_IO). Is it an issue that this may call an extra
filemap_write_and_wait_range() in some cases where it didn't before? I
suspect not since the extra calls would be for cases where the pages
were beyond the end of file anyway (according to i_size), so a no-op in
most cases.
 c) Try to prevent race between page invalidation in dio write path and
buffered read somehow (how?)
 d) any other solutions?

I'm leaning towards (b) being the best option at the moment, since it
seems simple to do and looks to have no obvious unwanted side effects,

Steve.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2014-01-14 15:22                     ` Steven Whitehouse
@ 2014-01-14 19:19                       ` Jan Kara
  2014-01-15  7:19                         ` Zheng Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2014-01-14 19:19 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Dave Chinner, Zheng Liu, Jan Kara, linux-fsdevel,
	Christoph Hellwig, Dmitry Monakhov, Alexander Viro, Zheng Liu,
	Lukas Czerner

  Hello,

On Tue 14-01-14 15:22:13, Steven Whitehouse wrote:
> On Mon, 2013-12-23 at 14:00 +1100, Dave Chinner wrote:
> > On Fri, Dec 20, 2013 at 09:28:44AM +0000, Steven Whitehouse wrote:
> > > On Fri, 2013-12-20 at 09:44 +1100, Dave Chinner wrote:
> > 
> > [snip]
> > 
> > > > Ok, buffered writes explain all those symptoms, and that means what
> > > > you are seeing is a buffered write/direct IO read race condition.
> > > > If the first page is missing data from the direct IO read, that
> > > > implies that the inode size has been updated by the buffered write
> > > > path, but a direct Io read of zeroes means the data in the page
> > > > cache has not been flushed to disk. Given that the direct IO read
> > > > path does a filemap_write_and_wait_range() call before issuing the
> > > > direct IO, that implies the first page was not flushed to disk by
> > > > the cache flush.
> > > > 
> > > > I'm not sure why that may be the case, but that's where I'd
> > > > start looking.....
> > > > 
> > > Indeed - I've just spent the last couple of days looking at exactly
> > > this :-) I'm shortly going to lose access to the machine I've been doing
> > > tests on until the New Year, so progress may slow down for a little
> > > while on my side.
> > > 
> > > However there is nothing obvious in the trace. It looks like the write
> > > I/O gets pushed out in two chunks, the earlier one containing the "first
> > > page" missing block along with other blocks, and the second one is
> > > pushed out by the filemap_write_and_wait_range added in the patch I
> > > posted yesterday. Both have completed before we send out the read
> > > (O_DIRECT) which results in a single I/O to disk - again exactly what
> > > I'd expect to see. There are two calls to bmap for the O_DIRECT read,
> > > the first one returns short (correctly since we hit EOF) and the second
> > > one returns unmapped since it is a request to map the remainder of the
> > > file, which is beyond EOF.
> > 
> > Oh, interesting that there are two bmap calls for the read. Have a
> > look at do_direct_IO(), specifically the case where we hit the
> > buffer_unmapped/do_holes code:
> > 
> [snip]
> > So, if we hit a hole, we read the inode size to determine if we are
> > beyond EOF. If we are beyond EOF, we're done. If not, the block gets
> > zeroed, and we move on to the nect block.
> > 
> > Now, GFS2 does not set DIO_LOCKING, so the direct IO code is not
> > holding the i_mutex across the read, which means it is probably
> > racing with the buffered write (I'm not sure what locking gfs2 does
> > here). Hence if the buffered write completes a page and updates the
> > inode size between the point in time that the direct IO read mapped
> > a hole and then checked the EOF, it will zero the page rather than
> > returning a short read, and then move on to reading the next
> > block. If the next block is still in the hole but is beyond EOF it
> > will then abort, returning a short read of a zeroed page....
> > 
> > Cheers,
> > 
> > Dave.
> 
> Sorry for the delay - I have been working on this and I've now figured
> out whats really going on here. I'll describe the GFS2 case first, but
> it seems that it is something that will affect other fs too, including
> local fs most likely.
> 
> So we have one thread doing writes to a file, 1M at a time, using
> O_DIRECT. Now with GFS2 that results in more or less a fallback to a
> buffered write - there is one important difference however, and that is
> that the direct i/o page invalidations still occur, just as if it were a
> direct i/o write.
> 
> Thread two is doing direct i/o reads. When this results in calling
> ->direct_IO, all is well. GFS2 does all the correct locking to ensure
> that this is coherent with the buffered writes. There is a problem
> though - in generic_file_aio_read() we have:
> 
>        /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
>         if (filp->f_flags & O_DIRECT) {
>                 loff_t size;
>                 struct address_space *mapping;
>                 struct inode *inode;
> 
>                 mapping = filp->f_mapping;
>                 inode = mapping->host;
>                 if (!count)
>                         goto out; /* skip atime */
>                 size = i_size_read(inode);
>                 if (pos < size) {
>                         retval = filemap_write_and_wait_range(mapping, pos,
>                                         pos + iov_length(iov, nr_segs) - 1);
>                         if (!retval) {
>                                 retval = mapping->a_ops->direct_IO(READ, iocb,
>                                                         iov, pos, nr_segs);
>                         }
> 
> The important thing here is that it is checking the i_size with no
> locking! This means that it can check i_size, find its the same as the
> request read position (if the read gets in first before the next block
> is written to the file) and then fall back to buffered read.
> 
> That buffered read then gets a page which is invalidated by the page
> invalidation after the (DIO, but fallen back to buffered) write has
> occurred. I added some trace_printk() calls to the read path and it is
> fairly clear that is the path thats being taken when the failure occurs.
> Also, it explains why, despite the disk having been used and reused many
> times, the page that is returned from the failure is always 0, so its
> not a page thats been read from disk. It also explains why I've only
> ever seen this for the first page of the read, and the rest of the read
> is ok.
  Yes, race like this is something Zheng and I have been talking about from
the beginning. Thanks for tracking down the details.

> Changing the test "if (pos < size) {" to "if (1) {" results in zero
> failures from the test program, since it takes the ->direct_IO path each
> time which then checks the i_size under the glock, in a race free
> manner.
> 
> So far as I can tell, this can be just as much a problem for local
> filesystems, since they may update i_size at any time too. I note that
> XFS appears to wrap this bit of code under its iolock, so that probably
> explains why XFS doesn't suffer from this issue.
> 
> So what is the correct fix... we have several options I think:
> 
>  a) Ignore it on the basis that its not important for normal workloads
  Zheng has a real world usecase which hits this race and it does look
rather normal. So don't like a) very much.

>  b) Simply remove the "if (pos < size) {" test and leave the fs to get
> on with it (I know that leaves the btrfs test which also uses i_size
> later on, but that doesn't appear to be an issue since its after we've
> called ->direct_IO). Is it an issue that this may call an extra
> filemap_write_and_wait_range() in some cases where it didn't before? I
> suspect not since the extra calls would be for cases where the pages
> were beyond the end of file anyway (according to i_size), so a no-op in
> most cases.
>  c) Try to prevent race between page invalidation in dio write path and
> buffered read somehow (how?)
>  d) any other solutions?
  Well, there's Zheng's patch to make buffered write exit if pos >= i_size.
That fixes the race for anything but GFS2 as well. But b) will fix it as
well and if it works for GFS2 then I'm fine with it. Actually, it's not
that much different from Zheng's patch after all. We won't reach the
buffered read path as all because of
                        if (retval < 0 || !count || *ppos >= size) {
                                file_accessed(filp);
                                goto out;
                        }
check. Hum, but for that check to reliably prevent reading of zero-filled
page by buffered read we need to sample 'size' before calling ->direct_IO
(as we currently do) which will still cause GFS2 the same trouble Zheng's
patch did, won't it?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes
  2014-01-14 19:19                       ` Jan Kara
@ 2014-01-15  7:19                         ` Zheng Liu
  2014-01-16 15:35                           ` [RFC] [PATCH] Fix race when checking i_size on direct i/o read Steven Whitehouse
  0 siblings, 1 reply; 21+ messages in thread
From: Zheng Liu @ 2014-01-15  7:19 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: Steven Whitehouse, linux-fsdevel, Christoph Hellwig,
	Dmitry Monakhov, Alexander Viro, Zheng Liu, Lukas Czerner

Hi all,

Sorry for the delay because of holiday and other issues.

On Tue, Jan 14, 2014 at 08:19:01PM +0100, Jan Kara wrote:
[...]
> > Sorry for the delay - I have been working on this and I've now figured
> > out whats really going on here. I'll describe the GFS2 case first, but
> > it seems that it is something that will affect other fs too, including
> > local fs most likely.
> > 
> > So we have one thread doing writes to a file, 1M at a time, using
> > O_DIRECT. Now with GFS2 that results in more or less a fallback to a
> > buffered write - there is one important difference however, and that is
> > that the direct i/o page invalidations still occur, just as if it were a
> > direct i/o write.
> > 
> > Thread two is doing direct i/o reads. When this results in calling
> > ->direct_IO, all is well. GFS2 does all the correct locking to ensure
> > that this is coherent with the buffered writes. There is a problem
> > though - in generic_file_aio_read() we have:
> > 
> >        /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
> >         if (filp->f_flags & O_DIRECT) {
> >                 loff_t size;
> >                 struct address_space *mapping;
> >                 struct inode *inode;
> > 
> >                 mapping = filp->f_mapping;
> >                 inode = mapping->host;
> >                 if (!count)
> >                         goto out; /* skip atime */
> >                 size = i_size_read(inode);
> >                 if (pos < size) {
> >                         retval = filemap_write_and_wait_range(mapping, pos,
> >                                         pos + iov_length(iov, nr_segs) - 1);
> >                         if (!retval) {
> >                                 retval = mapping->a_ops->direct_IO(READ, iocb,
> >                                                         iov, pos, nr_segs);
> >                         }
> > 
> > The important thing here is that it is checking the i_size with no
> > locking! This means that it can check i_size, find its the same as the
> > request read position (if the read gets in first before the next block
> > is written to the file) and then fall back to buffered read.
> > 
> > That buffered read then gets a page which is invalidated by the page
> > invalidation after the (DIO, but fallen back to buffered) write has
> > occurred. I added some trace_printk() calls to the read path and it is
> > fairly clear that is the path thats being taken when the failure occurs.
> > Also, it explains why, despite the disk having been used and reused many
> > times, the page that is returned from the failure is always 0, so its
> > not a page thats been read from disk. It also explains why I've only
> > ever seen this for the first page of the read, and the rest of the read
> > is ok.
>   Yes, race like this is something Zheng and I have been talking about from
> the beginning. Thanks for tracking down the details.
> 
> > Changing the test "if (pos < size) {" to "if (1) {" results in zero
> > failures from the test program, since it takes the ->direct_IO path each
> > time which then checks the i_size under the glock, in a race free
> > manner.
> > 
> > So far as I can tell, this can be just as much a problem for local
> > filesystems, since they may update i_size at any time too. I note that
> > XFS appears to wrap this bit of code under its iolock, so that probably
> > explains why XFS doesn't suffer from this issue.
> > 
> > So what is the correct fix... we have several options I think:
> > 
> >  a) Ignore it on the basis that its not important for normal workloads
>   Zheng has a real world usecase which hits this race and it does look
> rather normal. So don't like a) very much.

Yes, as far as I know, at least there are two applications that has met
this problem in our product system at Taobao.  So that is why I dig into
this bug and hope it can be fixed.

> 
> >  b) Simply remove the "if (pos < size) {" test and leave the fs to get
> > on with it (I know that leaves the btrfs test which also uses i_size
> > later on, but that doesn't appear to be an issue since its after we've
> > called ->direct_IO). Is it an issue that this may call an extra
> > filemap_write_and_wait_range() in some cases where it didn't before? I
> > suspect not since the extra calls would be for cases where the pages
> > were beyond the end of file anyway (according to i_size), so a no-op in
> > most cases.

I try this approach and I can confirm that it is ok for ext4.

Dave, I have send out a patch for xfstests to add a new testcase
according to this bug but I don't get any response [1].  Could you
please review it?  Thanks.

1. http://comments.gmane.org/gmane.comp.file-systems.xfs.general/58428

Regards,
                                                - Zheng

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC] [PATCH] Fix race when checking i_size on direct i/o read
  2014-01-15  7:19                         ` Zheng Liu
@ 2014-01-16 15:35                           ` Steven Whitehouse
  2014-01-17 10:20                             ` Miklos Szeredi
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Whitehouse @ 2014-01-16 15:35 UTC (permalink / raw)
  To: Zheng Liu
  Cc: Jan Kara, Dave Chinner, linux-fsdevel, Christoph Hellwig,
	Dmitry Monakhov, Alexander Viro, Zheng Liu, Lukas Czerner,
	Miklos Szeredi, linux-ext4, Chris Mason, Josef Bacik


Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.

Removing the i_size test from the direct i/o read path at VFS level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:

 a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue

 b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side. Someone who knows fuse
better maybe able to confirm whether this is actually required or not.

 c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.

The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.

There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.

So please have a look at this and let me know what you think. I guess
that when time comes to submit it, it should probably be via the vfs
tree.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7e70506..89fdfd1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2710,6 +2710,9 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	inode = file->f_mapping->host;
 	i_size = i_size_read(inode);
 
+	if ((rw == READ) && (offset > i_size))
+		return 0;
+
 	/* optimization for short read */
 	if (async_dio && rw != WRITE && offset + count > i_size) {
 		if (offset >= i_size)
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..0184286 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1428,30 +1428,28 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
-		if (pos < size) {
-			retval = filemap_write_and_wait_range(mapping, pos,
+		retval = filemap_write_and_wait_range(mapping, pos,
 					pos + iov_length(iov, nr_segs) - 1);
-			if (!retval) {
-				retval = mapping->a_ops->direct_IO(READ, iocb,
-							iov, pos, nr_segs);
-			}
-			if (retval > 0) {
-				*ppos = pos + retval;
-				count -= retval;
-			}
+		if (!retval) {
+			retval = mapping->a_ops->direct_IO(READ, iocb,
+							   iov, pos, nr_segs);
+		}
+		if (retval > 0) {
+			*ppos = pos + retval;
+			count -= retval;
+		}
 
-			/*
-			 * Btrfs can have a short DIO read if we encounter
-			 * compressed extents, so if there was an error, or if
-			 * we've already read everything we wanted to, or if
-			 * there was a short read because we hit EOF, go ahead
-			 * and return.  Otherwise fallthrough to buffered io for
-			 * the rest of the read.
-			 */
-			if (retval < 0 || !count || *ppos >= size) {
-				file_accessed(filp);
-				goto out;
-			}
+		/*
+		 * Btrfs can have a short DIO read if we encounter
+		 * compressed extents, so if there was an error, or if
+		 * we've already read everything we wanted to, or if
+		 * there was a short read because we hit EOF, go ahead
+		 * and return.  Otherwise fallthrough to buffered io for
+		 * the rest of the read.
+		 */
+		if (retval < 0 || !count || *ppos >= size) {
+			file_accessed(filp);
+			goto out;
 		}
 	}
 



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC] [PATCH] Fix race when checking i_size on direct i/o read
  2014-01-16 15:35                           ` [RFC] [PATCH] Fix race when checking i_size on direct i/o read Steven Whitehouse
@ 2014-01-17 10:20                             ` Miklos Szeredi
  2014-01-24 14:42                               ` Steven Whitehouse
  0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2014-01-17 10:20 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Zheng Liu, Jan Kara, Dave Chinner, Linux-Fsdevel,
	Christoph Hellwig, Dmitry Monakhov, Alexander Viro, Zheng Liu,
	Lukas Czerner, linux-ext4, Chris Mason, Josef Bacik

On Thu, Jan 16, 2014 at 4:35 PM, Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
> reads with append dio writes" thread on linux-fsdevel, this patch is my
> current version of the fix proposed as option (b) in that thread.
>
> Removing the i_size test from the direct i/o read path at VFS level
> means that filesystems now have to deal with requests which are beyond
> i_size themselves. These I've divided into three sets:
>
>  a) Those with "no op" ->direct_IO (9p, cifs, ceph)
> These are obviously not going to be an issue
>
>  b) Those with "home brew" ->direct_IO (nfs, fuse)
> I've been told that NFS should not have any problem with the larger
> i_size, however I've added an extra test to FUSE to duplicate the
> original behaviour just to be on the safe side. Someone who knows fuse
> better maybe able to confirm whether this is actually required or not.
>
>  c) Those using __blockdev_direct_IO()
> These call through to ->get_block() which should deal with the EOF
> condition correctly. I've verified that with GFS2 and I believe that
> Zheng has verified it for ext4. I've also run the test on XFS and it
> passes both before and after this change.
>
> The part of the patch in filemap.c looks a lot larger than it really is
> - there are only two lines of real change. The rest is just indentation
> of the contained code.
>
> There remains a test of i_size though, which was added for btrfs. It
> doesn't cause the other filesystems a problem as the test is performed
> after ->direct_IO has been called. It is possible that there is a race
> that does matter to btrfs, however this patch doesn't change that, so
> its still an overall improvement.
>
> So please have a look at this and let me know what you think. I guess
> that when time comes to submit it, it should probably be via the vfs
> tree.
>
> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
> Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Chris Mason <clm@fb.com>
> Cc: Josef Bacik <jbacik@fb.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 7e70506..89fdfd1 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -2710,6 +2710,9 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
>         inode = file->f_mapping->host;
>         i_size = i_size_read(inode);
>
> +       if ((rw == READ) && (offset > i_size))
> +               return 0;
> +

Hmm, OK.   It's not strictly needed, but a valid optimization.  So ACK.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] Fix race when checking i_size on direct i/o read
  2014-01-17 10:20                             ` Miklos Szeredi
@ 2014-01-24 14:42                               ` Steven Whitehouse
  2014-01-27 10:13                                 ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Steven Whitehouse @ 2014-01-24 14:42 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Zheng Liu, Jan Kara, Dave Chinner, Linux-Fsdevel, Miklos Szeredi,
	Christoph Hellwig, Dmitry Monakhov, Alexander Viro, Zheng Liu,
	Lukas Czerner, linux-ext4, Chris Mason, Josef Bacik,
	linux-kernel, Andrew Morton


So far I've had one ACK for this, and no other comments. So I think it
is probably time to send this via some suitable tree. I'm guessing that
the vfs tree would be the most appropriate route, but not sure that
there is one at the moment (don't see anything recent at kernel.org)
so in that case I think -mm is the "back up plan". Al, please let me
know if you will take this?

Steve.

---------------------

Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
reads with append dio writes" thread on linux-fsdevel, this patch is my
current version of the fix proposed as option (b) in that thread.

Removing the i_size test from the direct i/o read path at vfs level
means that filesystems now have to deal with requests which are beyond
i_size themselves. These I've divided into three sets:

 a) Those with "no op" ->direct_IO (9p, cifs, ceph)
These are obviously not going to be an issue

 b) Those with "home brew" ->direct_IO (nfs, fuse)
I've been told that NFS should not have any problem with the larger
i_size, however I've added an extra test to FUSE to duplicate the
original behaviour just to be on the safe side.

 c) Those using __blockdev_direct_IO()
These call through to ->get_block() which should deal with the EOF
condition correctly. I've verified that with GFS2 and I believe that
Zheng has verified it for ext4. I've also run the test on XFS and it
passes both before and after this change.

The part of the patch in filemap.c looks a lot larger than it really is
- there are only two lines of real change. The rest is just indentation
of the contained code.

There remains a test of i_size though, which was added for btrfs. It
doesn't cause the other filesystems a problem as the test is performed
after ->direct_IO has been called. It is possible that there is a race
that does matter to btrfs, however this patch doesn't change that, so
its still an overall improvement.


Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 74f6ca5..77bcc30 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2727,6 +2727,9 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	inode = file->f_mapping->host;
 	i_size = i_size_read(inode);
 
+	if ((rw == READ) && (offset > i_size))
+		return 0;
+
 	/* optimization for short read */
 	if (async_dio && rw != WRITE && offset + count > i_size) {
 		if (offset >= i_size)
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a7f3e0..d56d3c1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1428,30 +1428,28 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
-		if (pos < size) {
-			retval = filemap_write_and_wait_range(mapping, pos,
+		retval = filemap_write_and_wait_range(mapping, pos,
 					pos + iov_length(iov, nr_segs) - 1);
-			if (!retval) {
-				retval = mapping->a_ops->direct_IO(READ, iocb,
-							iov, pos, nr_segs);
-			}
-			if (retval > 0) {
-				*ppos = pos + retval;
-				count -= retval;
-			}
+		if (!retval) {
+			retval = mapping->a_ops->direct_IO(READ, iocb,
+							   iov, pos, nr_segs);
+		}
+		if (retval > 0) {
+			*ppos = pos + retval;
+			count -= retval;
+		}
 
-			/*
-			 * Btrfs can have a short DIO read if we encounter
-			 * compressed extents, so if there was an error, or if
-			 * we've already read everything we wanted to, or if
-			 * there was a short read because we hit EOF, go ahead
-			 * and return.  Otherwise fallthrough to buffered io for
-			 * the rest of the read.
-			 */
-			if (retval < 0 || !count || *ppos >= size) {
-				file_accessed(filp);
-				goto out;
-			}
+		/*
+		 * Btrfs can have a short DIO read if we encounter
+		 * compressed extents, so if there was an error, or if
+		 * we've already read everything we wanted to, or if
+		 * there was a short read because we hit EOF, go ahead
+		 * and return.  Otherwise fallthrough to buffered io for
+		 * the rest of the read.
+		 */
+		if (retval < 0 || !count || *ppos >= size) {
+			file_accessed(filp);
+			goto out;
 		}
 	}
 



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] Fix race when checking i_size on direct i/o read
  2014-01-24 14:42                               ` Steven Whitehouse
@ 2014-01-27 10:13                                 ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2014-01-27 10:13 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Alexander Viro, Zheng Liu, Jan Kara, Dave Chinner, Linux-Fsdevel,
	Miklos Szeredi, Christoph Hellwig, Dmitry Monakhov, Zheng Liu,
	Lukas Czerner, linux-ext4, Chris Mason, Josef Bacik,
	linux-kernel, Andrew Morton

On Fri 24-01-14 14:42:22, Steven Whitehouse wrote:
> 
> So far I've had one ACK for this, and no other comments. So I think it
> is probably time to send this via some suitable tree. I'm guessing that
> the vfs tree would be the most appropriate route, but not sure that
> there is one at the moment (don't see anything recent at kernel.org)
> so in that case I think -mm is the "back up plan". Al, please let me
> know if you will take this?
> 
> Steve.
> 
> ---------------------
> 
> Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio
> reads with append dio writes" thread on linux-fsdevel, this patch is my
> current version of the fix proposed as option (b) in that thread.
  I think it would be good to have a full information about the race we are
trying to solve and why your patch fixes it in the changelog. We can still
defer to list archives for the discussion why you've decided to fix it the
way you've fixed it.

Other than that I agree with the change so feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> Removing the i_size test from the direct i/o read path at vfs level
> means that filesystems now have to deal with requests which are beyond
> i_size themselves. These I've divided into three sets:
> 
>  a) Those with "no op" ->direct_IO (9p, cifs, ceph)
> These are obviously not going to be an issue
> 
>  b) Those with "home brew" ->direct_IO (nfs, fuse)
> I've been told that NFS should not have any problem with the larger
> i_size, however I've added an extra test to FUSE to duplicate the
> original behaviour just to be on the safe side.
> 
>  c) Those using __blockdev_direct_IO()
> These call through to ->get_block() which should deal with the EOF
> condition correctly. I've verified that with GFS2 and I believe that
> Zheng has verified it for ext4. I've also run the test on XFS and it
> passes both before and after this change.
> 
> The part of the patch in filemap.c looks a lot larger than it really is
> - there are only two lines of real change. The rest is just indentation
> of the contained code.
> 
> There remains a test of i_size though, which was added for btrfs. It
> doesn't cause the other filesystems a problem as the test is performed
> after ->direct_IO has been called. It is possible that there is a race
> that does matter to btrfs, however this patch doesn't change that, so
> its still an overall improvement.
> 
> 
> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
> Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Acked-by: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Chris Mason <clm@fb.com>
> Cc: Josef Bacik <jbacik@fb.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 74f6ca5..77bcc30 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -2727,6 +2727,9 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
>  	inode = file->f_mapping->host;
>  	i_size = i_size_read(inode);
>  
> +	if ((rw == READ) && (offset > i_size))
> +		return 0;
> +
>  	/* optimization for short read */
>  	if (async_dio && rw != WRITE && offset + count > i_size) {
>  		if (offset >= i_size)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7a7f3e0..d56d3c1 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1428,30 +1428,28 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
>  		if (!count)
>  			goto out; /* skip atime */
>  		size = i_size_read(inode);
> -		if (pos < size) {
> -			retval = filemap_write_and_wait_range(mapping, pos,
> +		retval = filemap_write_and_wait_range(mapping, pos,
>  					pos + iov_length(iov, nr_segs) - 1);
> -			if (!retval) {
> -				retval = mapping->a_ops->direct_IO(READ, iocb,
> -							iov, pos, nr_segs);
> -			}
> -			if (retval > 0) {
> -				*ppos = pos + retval;
> -				count -= retval;
> -			}
> +		if (!retval) {
> +			retval = mapping->a_ops->direct_IO(READ, iocb,
> +							   iov, pos, nr_segs);
> +		}
> +		if (retval > 0) {
> +			*ppos = pos + retval;
> +			count -= retval;
> +		}
>  
> -			/*
> -			 * Btrfs can have a short DIO read if we encounter
> -			 * compressed extents, so if there was an error, or if
> -			 * we've already read everything we wanted to, or if
> -			 * there was a short read because we hit EOF, go ahead
> -			 * and return.  Otherwise fallthrough to buffered io for
> -			 * the rest of the read.
> -			 */
> -			if (retval < 0 || !count || *ppos >= size) {
> -				file_accessed(filp);
> -				goto out;
> -			}
> +		/*
> +		 * Btrfs can have a short DIO read if we encounter
> +		 * compressed extents, so if there was an error, or if
> +		 * we've already read everything we wanted to, or if
> +		 * there was a short read because we hit EOF, go ahead
> +		 * and return.  Otherwise fallthrough to buffered io for
> +		 * the rest of the read.
> +		 */
> +		if (retval < 0 || !count || *ppos >= size) {
> +			file_accessed(filp);
> +			goto out;
>  		}
>  	}
>  
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-01-27 10:13 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-07 10:55 [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Zheng Liu
2013-12-16  9:37 ` Steven Whitehouse
2013-12-16 10:01   ` Jan Kara
2013-12-17  9:43     ` Steven Whitehouse
2013-12-17 11:16       ` Zheng Liu
2013-12-17 12:17         ` Steven Whitehouse
2013-12-17 16:41           ` Zheng Liu
2013-12-19 12:27             ` Steven Whitehouse
2013-12-19 22:44               ` Dave Chinner
2013-12-20  9:28                 ` Steven Whitehouse
2013-12-23  3:00                   ` Dave Chinner
2014-01-14 15:22                     ` Steven Whitehouse
2014-01-14 19:19                       ` Jan Kara
2014-01-15  7:19                         ` Zheng Liu
2014-01-16 15:35                           ` [RFC] [PATCH] Fix race when checking i_size on direct i/o read Steven Whitehouse
2014-01-17 10:20                             ` Miklos Szeredi
2014-01-24 14:42                               ` Steven Whitehouse
2014-01-27 10:13                                 ` Jan Kara
2013-12-17 14:01       ` [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes Jan Kara
2013-12-17 15:32         ` Steven Whitehouse
2013-12-17 16:39           ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.