All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zheng Liu <gnehzuil.liu@gmail.com>
To: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: [BUG] ext2/3/4: dio reads stale data when we do some append dio writes
Date: Tue, 19 Nov 2013 17:53:03 +0800	[thread overview]
Message-ID: <20131119095302.GA4534@gmail.com> (raw)

Hi all,

Currently we check i_size in buffered read path after we know the page
is updated.  But it could return some zero-filled pages to the userspace
when we do some append dio writes meanwhile we do some dio reads.  We
could use the following code snippet to reproduce this problem in a
ext2/3/4 filesystem.  In the mean time, I run this code snippet against
xfs and btrfs, and they can survive.  Furthermore, if we do some buffered
read, xfs also has the same problem.  So I guess that this might be a
generic problem.

code snippet:
  #define _GNU_SOURCE

  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <memory.h>

  #include <unistd.h>
  #include <fcntl.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <errno.h>

  #include <pthread.h>

  #define BUF_ALIGN	1024

  struct writer_data {
  	int fd;
  	size_t blksize;
  	char *buf;
  };

  static void *writer(void *arg)
  {
  	struct writer_data *data = (struct writer_data *)arg;
  	int ret;

  	ret = write(data->fd, data->buf, data->blksize);
  	if (ret < 0)
  		fprintf(stderr, "write file failed: %s\n", strerror(errno));

  	return NULL;
  }

  int main(int argc, char *argv[])
  {
  	pthread_t tid;
  	struct writer_data wdata;
  	size_t max_blocks = 10 * 1024;
  	size_t blksize = 1 * 1024 * 1024;
  	char *rbuf, *wbuf;
  	int readfd, writefd;
  	int i, j;

  	if (argc < 2) {
  		fprintf(stderr, "usage: %s [filename]\n", argv[0]);
  		exit(1);
  	}

  	writefd = open(argv[1], O_CREAT|O_DIRECT|O_WRONLY|O_APPEND|O_TRUNC, S_IRWXU);
  	if (writefd < 0) {
  		fprintf(stderr, "failed to open wfile: %s\n", strerror(errno));
  		exit(1);
  	}
  	readfd = open(argv[1], O_DIRECT|O_RDONLY, S_IRWXU);
  	if (readfd < 0) {
  		fprintf(stderr, "failed to open rfile: %s\n", strerror(errno));
  		exit(1);
  	}

  	if (posix_memalign((void **)&wbuf, BUF_ALIGN, blksize)) {
  		fprintf(stderr, "failed to alloc memory: %s\n", strerror(errno));
  		exit(1);
  	}

  	if (posix_memalign((void **)&rbuf, 4096, blksize)) {
  		fprintf(stderr, "failed to alloc memory: %s\n", strerror(errno));
  		exit(1);
  	}

  	memset(wbuf, 'a', blksize);

  	wdata.fd = writefd;
  	wdata.blksize = blksize;
  	wdata.buf = wbuf;

  	for (i = 0; i < max_blocks; i++) {
  		void *retval;
  		int ret;

  		ret = pthread_create(&tid, NULL, writer, &wdata);
  		if (ret) {
  			fprintf(stderr, "create thread failed: %s\n", strerror(errno));
  			exit(1);
  		}

  		memset(rbuf, 'b', blksize);
  		do {
  			ret = pread(readfd, rbuf, blksize, i * blksize);
  		} while (ret <= 0);

  		if (ret < 0) {
  			fprintf(stderr, "read file failed: %s\n", strerror(errno));
  			exit(1);
  		}

  		if (pthread_join(tid, &retval)) {
  			fprintf(stderr, "pthread join failed: %s\n", strerror(errno));
  			exit(1);
  		}

  		if (ret >= 0) {
  			for (j = 0; j < ret; j++) {
  				if (rbuf[i] != 'a') {
  					fprintf(stderr, "encounter an error: offset %ld\n",
  						i);
  					goto err;
  				}
  			}
  		}
  	}

  err:
  	free(wbuf);
  	free(rbuf);

  	return 0;
  }

build & run:
  $ gcc code.c -o code -lpthread # build
  $ ./code ${filename} # run

As we expected, we should read nothing or data with 'a'.  But now we
read data with '0'.  I take a closer look at the code and it seems that
there is a bug in vfs.  Let me describe my found here.

  reader					writer
                                                generic_file_aio_write()
                                                ->__generic_file_aio_write()
                                                  ->generic_file_direct_write()
  generic_file_aio_read()
  ->do_generic_file_read()
    [fallback to buffered read]

    ->find_get_page()
    ->page_cache_sync_readahead()
    ->find_get_page()
    [in find_page label, we couldn't find a
     page before and after calling
     page_cache_sync_readahead().  So go to
     no_cached_page label]

    ->page_cache_alloc_cold()
    ->add_to_page_cache_lru()
    [in no_cached_page label, we alloc a page
     and goto readpage label.]

    ->aops->readpage()
    [in readpage label, readpage() callback
     is called and mpage_readpage() return a
     zero-filled page (e.g. ext3/4), and go
     to page_ok label]

                                                  ->a_ops->direct_IO()
                                                  ->i_size_write()
                                                  [we enlarge the i_size]

    Here we check i_size
    [in page_ok label, we check i_size but
     it has been enlarged.  Thus, we pass
     the check and return a zero-filled page]

I attach a patch below to fix this problem in vfs.  However, to be honest, the
fix is very dirty.  But frankly I haven't had a good solution until now.  So I
send this mail to talk about this problem.  Please let me know if I miss
something.

Any comment and idea are always welcome.

Thanks,
						- Zheng

From: Zheng Liu <wenqing.lz@taobao.com>

Subject: [PATCH] vfs: check i_size at the beginning of do_generic_file_read()

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 mm/filemap.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1e6aec4..9de2ad8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1107,6 +1107,8 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 	pgoff_t index;
 	pgoff_t last_index;
 	pgoff_t prev_index;
+	pgoff_t end_index;
+	loff_t isize;
 	unsigned long offset;      /* offset into pagecache page */
 	unsigned int prev_offset;
 	int error;
@@ -1117,10 +1119,17 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
 	last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
 	offset = *ppos & ~PAGE_CACHE_MASK;
 
+	/*
+	 * We must check i_size at the beginning of this function to avoid to return
+	 * zero-filled page to userspace when the application does append dio writes.
+	 */
+	isize = i_size_read(inode);
+	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+	if (unlikely(!isize || index > end_index))
+		goto out;
+
 	for (;;) {
 		struct page *page;
-		pgoff_t end_index;
-		loff_t isize;
 		unsigned long nr, ret;
 
 		cond_resched();
-- 
1.7.9.7


             reply	other threads:[~2013-11-19  9:53 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-19  9:53 Zheng Liu [this message]
2013-11-19 10:22 ` [BUG] ext2/3/4: dio reads stale data when we do some append dio writes Christoph Hellwig
2013-11-19 10:45   ` Zheng Liu
2013-11-19 11:01     ` Christoph Hellwig
2013-11-19 11:19       ` Zheng Liu
2013-11-19 11:18         ` Christoph Hellwig
2013-11-19 11:51           ` Zheng Liu
2013-11-19 11:51             ` Zheng Liu
2013-11-19 12:09             ` Dave Chinner
2013-11-19 12:09               ` Dave Chinner
2013-11-19 12:18               ` Zheng Liu
2013-11-19 12:01           ` Dave Chinner
2013-11-19 12:20             ` Zheng Liu
2013-11-19 12:20               ` Zheng Liu
2013-11-19 10:54 ` Dmitry Monakhov
2013-11-19 11:45   ` Zheng Liu
2013-11-27 23:01 ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131119095302.GA4534@gmail.com \
    --to=gnehzuil.liu@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.