linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] dio: falling through to buffered I/O when invalidation of a page fails
@ 2007-12-10  7:52 Hisashi Hifumi
  2007-12-12  1:00 ` Zach Brown
  0 siblings, 1 reply; 6+ messages in thread
From: Hisashi Hifumi @ 2007-12-10  7:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-fsdevel

Hi.

Current dio has some problems:
1, In ext3 ordered, dio write can return with EIO because of the race 
between invalidation of
a page and jbd. jbd pins the bhs while committing journal so 
try_to_release_page fails when jbd
is committing the transaction.

Past discussion about this issue is as follows.
http://marc.info/?t=119343431200004&r=1&w=2
http://marc.info/?t=112656762800002&r=1&w=2

2, invalidate_inode_pages2_range() sets ret=-EIO when invalidate_complete_page2()
fails, but this ret is cleared if do_launder_page() succeed on a page of 
next index.
In this case, dio is carried out even if invalidate_complete_page2() fails 
on some pages.
This can cause inconsistency between memory and blocks on HDD because the page
cache still exists.

I solved problems above by introducing invalidate_inode_pages3_range() and 
falling
through to buffered I/O when invalidation of a page failed.
We can distinguish between failure of page invalidation and other errors
with the return value of invalidate_inode_pages3_range(). When invalidation 
of a page
fails, this function exits immediately so that inconsistency between memory and
HDD blocks can be avoided and the number of the cached page that should be
acquired again is smaller.

Thanks.

Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>

diff -Nrup linux-2.6.24-rc4.org/include/linux/fs.h 
linux-2.6.24-rc4/include/linux/fs.h
--- linux-2.6.24-rc4.org/include/linux/fs.h	2007-12-06 12:05:30.000000000 +0900
+++ linux-2.6.24-rc4/include/linux/fs.h	2007-12-06 12:06:38.000000000 +0900
@@ -1670,6 +1670,8 @@ static inline void invalidate_remote_ino
  extern int invalidate_inode_pages2(struct address_space *mapping);
  extern int invalidate_inode_pages2_range(struct address_space *mapping,
  					 pgoff_t start, pgoff_t end);
+extern int invalidate_inode_pages3_range(struct address_space *mapping,
+					 pgoff_t start, pgoff_t end);
  extern int write_inode_now(struct inode *, int);
  extern int filemap_fdatawrite(struct address_space *);
  extern int filemap_flush(struct address_space *);
diff -Nrup linux-2.6.24-rc4.org/mm/filemap.c linux-2.6.24-rc4/mm/filemap.c
--- linux-2.6.24-rc4.org/mm/filemap.c	2007-12-06 12:05:31.000000000 +0900
+++ linux-2.6.24-rc4/mm/filemap.c	2007-12-06 12:06:38.000000000 +0900
@@ -2491,14 +2491,18 @@ generic_file_direct_IO(int rw, struct ki
  	/*
  	 * After a write we want buffered reads to be sure to go to disk to get
  	 * the new data.  We invalidate clean cached page from the region we're
-	 * about to write.  We do this *before* the write so that we can return
-	 * -EIO without clobbering -EIOCBQUEUED from ->direct_IO().
+	 * about to write.
+	 * If invalidation of any pages fails, we return 0 so that we can fall
+	 * through to buffered I/O.
  	 */
  	if (rw == WRITE && mapping->nrpages) {
-		retval = invalidate_inode_pages2_range(mapping,
+		retval = invalidate_inode_pages3_range(mapping,
  					offset >> PAGE_CACHE_SHIFT, end);
-		if (retval)
+		if (retval) {
+			if (retval > 0)
+				retval = 0;
  			goto out;
+		}
  	}

  	retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs);
diff -Nrup linux-2.6.24-rc4.org/mm/truncate.c linux-2.6.24-rc4/mm/truncate.c
--- linux-2.6.24-rc4.org/mm/truncate.c	2007-12-06 12:05:31.000000000 +0900
+++ linux-2.6.24-rc4/mm/truncate.c	2007-12-07 12:14:13.000000000 +0900
@@ -452,6 +452,86 @@ int invalidate_inode_pages2_range(struct
  EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);

  /**
+ * invalidate_inode_pages3_range - remove range of pages from an address_space
+ * @mapping: the address_space
+ * @start: the page offset 'from' which to invalidate
+ * @end: the page offset 'to' which to invalidate (inclusive)
+ *
+ * Any pages which are found to be mapped into pagetables are unmapped prior to
+ * invalidation.
+ *
+ * When a page can not be invalidated, this function exits immediately.
+ *
+ * Returns 1 if any pages could not be invalidated.
+ */
+int invalidate_inode_pages3_range(struct address_space *mapping,
+				  pgoff_t start, pgoff_t end)
+{
+	struct pagevec pvec;
+	pgoff_t next;
+	int i;
+	int ret = 0;
+	int did_range_unmap = 0;
+	int wrapped = 0;
+
+	pagevec_init(&pvec, 0);
+	next = start;
+	while (next <= end && !ret && !wrapped &&
+		pagevec_lookup(&pvec, mapping, next,
+			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+			pgoff_t page_index;
+
+			lock_page(page);
+			if (page->mapping != mapping) {
+				unlock_page(page);
+				continue;
+			}
+			page_index = page->index;
+			next = page_index + 1;
+			if (next == 0)
+				wrapped = 1;
+			if (page_index > end) {
+				unlock_page(page);
+				break;
+			}
+			wait_on_page_writeback(page);
+			if (page_mapped(page)) {
+				if (!did_range_unmap) {
+					/*
+					 * Zap the rest of the file in one hit.
+					 */
+					unmap_mapping_range(mapping,
+					   (loff_t)page_index<<PAGE_CACHE_SHIFT,
+					   (loff_t)(end - page_index + 1)
+							<< PAGE_CACHE_SHIFT,
+					    0);
+					did_range_unmap = 1;
+				} else {
+					/*
+					 * Just zap this page
+					 */
+					unmap_mapping_range(mapping,
+					  (loff_t)page_index<<PAGE_CACHE_SHIFT,
+					  PAGE_CACHE_SIZE, 0);
+				}
+			}
+			BUG_ON(page_mapped(page));
+			ret = do_launder_page(mapping, page);
+			if (ret == 0 && !invalidate_complete_page2(mapping, page))
+				ret = 1;
+			unlock_page(page);
+		}
+		if (!ret)
+			pagevec_release(&pvec);
+		cond_resched();
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(invalidate_inode_pages3_range);
+
+/**
   * invalidate_inode_pages2 - remove all pages from an address_space
   * @mapping: the address_space
   * 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails
  2007-12-10  7:52 [PATCH] dio: falling through to buffered I/O when invalidation of a page fails Hisashi Hifumi
@ 2007-12-12  1:00 ` Zach Brown
  2007-12-12  7:51   ` [PATCH] dio: falling through to buffered I/O when invalidationof " Hisashi Hifumi
  2007-12-14 18:59   ` [PATCH] dio: falling through to buffered I/O when invalidation of " Badari Pulavarty
  0 siblings, 2 replies; 6+ messages in thread
From: Zach Brown @ 2007-12-12  1:00 UTC (permalink / raw)
  To: Hisashi Hifumi
  Cc: akpm, linux-kernel, linux-fsdevel, Chris Mason, Badari Pulavarty

Hisashi Hifumi wrote:
> Hi.
> 
> Current dio has some problems:
> 1, In ext3 ordered, dio write can return with EIO because of the race
> between invalidation of
> a page and jbd. jbd pins the bhs while committing journal so
> try_to_release_page fails when jbd
> is committing the transaction.

Yeah.  It sure would be fantastic if some ext3 expert could stop this
from happening somehow.  But that hasn't happened in.. uh.. Badari, for
how many years has this been on the radar? :)

> 
> Past discussion about this issue is as follows.
> http://marc.info/?t=119343431200004&r=1&w=2
> http://marc.info/?t=112656762800002&r=1&w=2
> 
> 2, invalidate_inode_pages2_range() sets ret=-EIO when
> invalidate_complete_page2()
> fails, but this ret is cleared if do_launder_page() succeed on a page of
> next index.

Oops.  That's too bad.  So maybe we should fix it by not stomping on
that return code?

	ret2 = do_launder()
	if (ret2 == 0)
		ret2 = invalidate()
	if (ret == 0)
		ret = ret2

I'd be surprised if we ever wanted to mask an -EIO when later pages
laundered successfully.

> In this case, dio is carried out even if invalidate_complete_page2()
> fails on some pages.
> This can cause inconsistency between memory and blocks on HDD because
> the page
> cache still exists.

Yeah.

> I solved problems above by introducing invalidate_inode_pages3_range()
> and falling
> through to buffered I/O when invalidation of a page failed.

Well, I like the idea of more intelligently dealing with the known
problem between dio and ext3.  I'm not sure that falling back to
buffered is right.

If dio can tell that it only failed because jbd held the buffer, should
we have waited for the transaction to complete before trying to
invalidate again?

If we could do that, we'd avoid performing the IO twice.

> We can distinguish between failure of page invalidation and other errors
> with the return value of invalidate_inode_pages3_range().

I'm not sure duplicating the invalidation loop into a new function is
the right thing.  Maybe we'd just tweak inode_pages2 to indicate to the
caller the specific failing circumstances somehow.  Maybe.

- z

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] dio: falling through to buffered I/O when invalidationof a page fails
  2007-12-12  1:00 ` Zach Brown
@ 2007-12-12  7:51   ` Hisashi Hifumi
  2007-12-14 18:59   ` [PATCH] dio: falling through to buffered I/O when invalidation of " Badari Pulavarty
  1 sibling, 0 replies; 6+ messages in thread
From: Hisashi Hifumi @ 2007-12-12  7:51 UTC (permalink / raw)
  To: Zach Brown
  Cc: akpm, linux-kernel, linux-fsdevel, Chris Mason, Badari Pulavarty


 >
 >>
 >> Past discussion about this issue is as follows.
 >> http://marc.info/?t=119343431200004&r=1&w=2
 >> http://marc.info/?t=112656762800002&r=1&w=2
 >>
 >> 2, invalidate_inode_pages2_range() sets ret=-EIO when
 >> invalidate_complete_page2()
 >> fails, but this ret is cleared if do_launder_page() succeed on a page of
 >> next index.
 >
 >Oops.  That's too bad.  So maybe we should fix it by not stomping on
 >that return code?
 >
 >	ret2 = do_launder()
 >	if (ret2 == 0)
 >		ret2 = invalidate()
 >	if (ret == 0)
 >		ret = ret2
 >
 >I'd be surprised if we ever wanted to mask an -EIO when later pages
 >laundered successfully.

This can preserve ret of -EIO. But it cannot be distinguished between
the error of invalidation and other error only by this fix.


 >> I solved problems above by introducing invalidate_inode_pages3_range()
 >> and falling
 >> through to buffered I/O when invalidation of a page failed.
 >
 >Well, I like the idea of more intelligently dealing with the known
 >problem between dio and ext3.  I'm not sure that falling back to
 >buffered is right.

In the past discussion above, it was said that dio was a sort of best
effort function. So dio falls back to buffered IO when writing to hall, I think.
My idea is that this notion applies to this issue.

 >
 >> We can distinguish between failure of page invalidation and other errors
 >> with the return value of invalidate_inode_pages3_range().
 >
 >I'm not sure duplicating the invalidation loop into a new function is
 >the right thing.  Maybe we'd just tweak inode_pages2 to indicate to the
 >caller the specific failing circumstances somehow.  Maybe.

It will be better that adding one parameter to invalidate_inode_pages2_range so
that caller can deal with a number of cases.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails
  2007-12-12  1:00 ` Zach Brown
  2007-12-12  7:51   ` [PATCH] dio: falling through to buffered I/O when invalidationof " Hisashi Hifumi
@ 2007-12-14 18:59   ` Badari Pulavarty
  2007-12-14 19:15     ` Zach Brown
  1 sibling, 1 reply; 6+ messages in thread
From: Badari Pulavarty @ 2007-12-14 18:59 UTC (permalink / raw)
  To: Zach Brown
  Cc: Hisashi Hifumi, Andrew Morton, lkml, linux-fsdevel, Chris Mason

On Tue, 2007-12-11 at 17:00 -0800, Zach Brown wrote:
> Hisashi Hifumi wrote:
> > Hi.
> > 
> > Current dio has some problems:
> > 1, In ext3 ordered, dio write can return with EIO because of the race
> > between invalidation of
> > a page and jbd. jbd pins the bhs while committing journal so
> > try_to_release_page fails when jbd
> > is committing the transaction.
> 
> Yeah.  It sure would be fantastic if some ext3 expert could stop this
> from happening somehow.  But that hasn't happened in.. uh.. Badari, for
> how many years has this been on the radar? :)

I used to have a test case that would reproduce the problem some what
consistently. But with invalidate_range() introduction and Jan Kara's
re-write of journal commit handling, I can't reproduce the problem
anymore. So I gave up fixing the problem which I can't reproduce.

If anyone has a testcase - I can take a look at the problem again.

Thanks,
Badari




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] dio: falling through to buffered I/O when invalidation of a page fails
  2007-12-14 18:59   ` [PATCH] dio: falling through to buffered I/O when invalidation of " Badari Pulavarty
@ 2007-12-14 19:15     ` Zach Brown
  2007-12-17  2:38       ` [PATCH] dio: falling through to buffered I/O when invalidationof " Hisashi Hifumi
  0 siblings, 1 reply; 6+ messages in thread
From: Zach Brown @ 2007-12-14 19:15 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Hisashi Hifumi, Andrew Morton, lkml, linux-fsdevel, Chris Mason


> If anyone has a testcase - I can take a look at the problem again.

I can try and throw something together..

- z

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] dio: falling through to buffered I/O when invalidationof a page fails
  2007-12-14 19:15     ` Zach Brown
@ 2007-12-17  2:38       ` Hisashi Hifumi
  0 siblings, 0 replies; 6+ messages in thread
From: Hisashi Hifumi @ 2007-12-17  2:38 UTC (permalink / raw)
  To: Zach Brown, Badari Pulavarty
  Cc: Andrew Morton, lkml, linux-fsdevel, Chris Mason


At 04:15 07/12/15, Zach Brown wrote:
 >
 >> If anyone has a testcase - I can take a look at the problem again.
 >
 >I can try and throw something together..
 >
 >- z

I did a test by using fsstress.

I modified the dio write() of fsstress to check return value, and input
following command;

# fsstress -d /root/testdir/ -l 0 -n 1000 -p 1000 -z -f creat=1 -f write=1 
-f dwrite=1

I got EIO from dio write.


I encountered some PostgreSQL mailing list archive that mentioned about 
this issue.
http://groups.google.com/group/pgsql.hackers/browse_thread/thread/c7515ccd5ac1269c/dc2ddeac3fe0d8c1?lnk=raot&fwc=1

dio can be used in PostgreSQL to write transaction log(WAL). It is said 
that when dio is used,
EIO is returned under some workload.
  


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-12-17  2:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-10  7:52 [PATCH] dio: falling through to buffered I/O when invalidation of a page fails Hisashi Hifumi
2007-12-12  1:00 ` Zach Brown
2007-12-12  7:51   ` [PATCH] dio: falling through to buffered I/O when invalidationof " Hisashi Hifumi
2007-12-14 18:59   ` [PATCH] dio: falling through to buffered I/O when invalidation of " Badari Pulavarty
2007-12-14 19:15     ` Zach Brown
2007-12-17  2:38       ` [PATCH] dio: falling through to buffered I/O when invalidationof " Hisashi Hifumi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).