linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch] invalidate range of pages after direct IO write
@ 2005-01-29  2:16 Zach Brown
  2005-02-04  0:19 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Zach Brown @ 2005-01-29  2:16 UTC (permalink / raw)
  To: linux-kernel; +Cc: Zach Brown


After a direct IO write only invalidate the pages that the write intersected.
invalidate_inode_pages2_range(mapping, pgoff start, pgoff end) is added and
called from generic_file_direct_IO().  This doesn't break some subtle agreement
with some other part of the code, does it?

While we're in there, invalidate_inode_pages2() was calling
unmap_mapping_range() with the wrong convention in the single page case.  It
was providing the byte offset of the final page rather than the length of the
hole being unmapped.  This is also fixed.

This was lightly tested with a 10k op fsx run with O_DIRECT on a 16MB file in
ext3 on a junky old IDE drive.  Totaling vmstat columns of blocks read and
written during the runs shows that read traffic drops significantly.  The run
time seems to have gone down a little.

Two runs before the patch gave the following user/real/sys times and total
blocks in and out:

0m28.029s 0m20.093s 0m3.166s 16673 125107 
0m27.949s 0m20.068s 0m3.227s 18426 126094

and after the patch:

0m26.775s 0m19.996s 0m3.060s 3505 124982
0m26.856s 0m19.935s 0m3.052s 3505 125279

Signed-off-by: Zach Brown <zach.brown@oracle.com>
---

 include/linux/fs.h |    2 ++
 mm/filemap.c       |    5 ++++-
 mm/truncate.c      |   52 ++++++++++++++++++++++++++++++++++++++--------------
 3 files changed, 44 insertions(+), 15 deletions(-)

Index: 2.6-mm-odirinv/include/linux/fs.h
===================================================================
--- 2.6-mm-odirinv.orig/include/linux/fs.h	2005-01-28 14:14:19.000000000 -0800
+++ 2.6-mm-odirinv/include/linux/fs.h	2005-01-28 14:14:35.000000000 -0800
@@ -1369,6 +1369,8 @@
 		invalidate_inode_pages(inode->i_mapping);
 }
 extern int invalidate_inode_pages2(struct address_space *mapping);
+extern int invalidate_inode_pages2_range(struct address_space *mapping,
+					 pgoff_t start, pgoff_t end);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
Index: 2.6-mm-odirinv/mm/filemap.c
===================================================================
--- 2.6-mm-odirinv.orig/mm/filemap.c	2005-01-28 13:32:06.000000000 -0800
+++ 2.6-mm-odirinv/mm/filemap.c	2005-01-28 14:21:04.000000000 -0800
@@ -2325,7 +2325,10 @@
 		retval = mapping->a_ops->direct_IO(rw, iocb, iov,
 						offset, nr_segs);
 		if (rw == WRITE && mapping->nrpages) {
-			int err = invalidate_inode_pages2(mapping);
+			pgoff_t end = (offset + iov_length(iov, nr_segs) - 1)
+				      >> PAGE_CACHE_SHIFT;
+			int err = invalidate_inode_pages2_range(mapping,
+					offset >> PAGE_CACHE_SHIFT, end);
 			if (err)
 				retval = err;
 		}
Index: 2.6-mm-odirinv/mm/truncate.c
===================================================================
--- 2.6-mm-odirinv.orig/mm/truncate.c	2005-01-28 13:32:06.000000000 -0800
+++ 2.6-mm-odirinv/mm/truncate.c	2005-01-28 17:03:09.783939857 -0800
@@ -99,7 +99,7 @@
 }
 
 /**
- * truncate_inode_pages - truncate range of pages specified by start and
+ * truncate_inode_pages_range - truncate range of pages specified by start and
  * end byte offsets
  * @mapping: mapping to truncate
  * @lstart: offset from which to truncate
@@ -279,28 +279,38 @@
 EXPORT_SYMBOL(invalidate_inode_pages);
 
 /**
- * invalidate_inode_pages2 - remove all pages from an address_space
+ * invalidate_inode_pages2_range - remove range of pages from an address_space
  * @mapping - the address_space
+ * @start: the page offset 'from' which to invalidate
+ * @end: the page offset 'to' which to invalidate (inclusive)
  *
  * Any pages which are found to be mapped into pagetables are unmapped prior to
  * invalidation.
  *
  * Returns -EIO if any pages could not be invalidated.
  */
-int invalidate_inode_pages2(struct address_space *mapping)
+int invalidate_inode_pages2_range(struct address_space *mapping,
+				  pgoff_t start, pgoff_t end)
 {
 	struct pagevec pvec;
-	pgoff_t next = 0;
+	pgoff_t next;
 	int i;
 	int ret = 0;
-	int did_full_unmap = 0;
+	int did_range_unmap = 0;
 
 	pagevec_init(&pvec, 0);
-	while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	next = start;
+	while (next <= end &&
+	       !ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			int was_dirty;
 
+			if (page->index > end) {
+				next = page->index;
+				break;
+			}
+
 			lock_page(page);
 			if (page->mapping != mapping) {	/* truncate race? */
 				unlock_page(page);
@@ -309,24 +319,23 @@
 			wait_on_page_writeback(page);
 			next = page->index + 1;
 			while (page_mapped(page)) {
-				if (!did_full_unmap) {
+				if (!did_range_unmap) {
 					/*
 					 * Zap the rest of the file in one hit.
-					 * FIXME: invalidate_inode_pages2()
-					 * should take start/end offsets.
 					 */
 					unmap_mapping_range(mapping,
-						page->index << PAGE_CACHE_SHIFT,
-					  	-1, 0);
-					did_full_unmap = 1;
+					    page->index << PAGE_CACHE_SHIFT,
+					    (end - page->index + 1)
+							<< PAGE_CACHE_SHIFT,
+					    0);
+					did_range_unmap = 1;
 				} else {
 					/*
 					 * Just zap this page
 					 */
 					unmap_mapping_range(mapping,
 					  page->index << PAGE_CACHE_SHIFT,
-					  (page->index << PAGE_CACHE_SHIFT)+1,
-					  0);
+					  PAGE_CACHE_SIZE, 0);
 				}
 			}
 			was_dirty = test_clear_page_dirty(page);
@@ -342,4 +351,19 @@
 	}
 	return ret;
 }
+EXPORT_SYMBOL_GPL(invalidate_inode_pages2_range);
+
+/**
+ * invalidate_inode_pages2 - remove all pages from an address_space
+ * @mapping - the address_space
+ *
+ * Any pages which are found to be mapped into pagetables are unmapped prior to
+ * invalidation.
+ *
+ * Returns -EIO if any pages could not be invalidated.
+ */
+int invalidate_inode_pages2(struct address_space *mapping)
+{
+	return invalidate_inode_pages2_range(mapping, 0, ~0UL);
+}
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] invalidate range of pages after direct IO write
  2005-01-29  2:16 [Patch] invalidate range of pages after direct IO write Zach Brown
@ 2005-02-04  0:19 ` Andrew Morton
  2005-02-04  1:52   ` Zach Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2005-02-04  0:19 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-kernel, zach.brown

Zach Brown <zach.brown@oracle.com> wrote:
>
> After a direct IO write only invalidate the pages that the write intersected.
> invalidate_inode_pages2_range(mapping, pgoff start, pgoff end) is added and
> called from generic_file_direct_IO().  This doesn't break some subtle agreement
> with some other part of the code, does it?

It should be OK.

Note that the same optimisation should be made in the call to
unmap_mapping_range() in generic_file_direct_IO().  Currently we try and
unmap the whole file, even if we're only writing a single byte.  Given that
you're now calculating iov_length() in there we might as well use that
number a few lines further up in that function.

Note that invalidate_inode_pages[_range2] also does the unmapping thing -
in the direct-io case we don't expect that path th ever trigger: the only
way we'll find mapped pages here is if someone raced and faulted a page
back in.

Reading the code, I'm unable to convince myself that it won't go into an
infinite loop if it finds a page at page->index = -1.  But I didn't try
very hard ;)

Minor note on this:

	return invalidate_inode_pages2_range(mapping, 0, ~0UL);

I just use `-1' there.  We don't _know_ that pgoff_t is an unsigned long. 
Some smarty may come along one day and make it unsigned long long, in which
case the code will break.  Using -1 here just works everywhere.

I'll make that change and plop the patch into -mm, but we need to think
about the infinite-loop problem..

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] invalidate range of pages after direct IO write
  2005-02-04  0:19 ` Andrew Morton
@ 2005-02-04  1:52   ` Zach Brown
  2005-02-04  2:28     ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Zach Brown @ 2005-02-04  1:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton wrote:

> Note that the same optimisation should be made in the call to
> unmap_mapping_range() in generic_file_direct_IO().  Currently we try and
> unmap the whole file, even if we're only writing a single byte.  Given that
> you're now calculating iov_length() in there we might as well use that
> number a few lines further up in that function.

Indeed.  I can throw that together.  I also have a patch that introduces
 filemap_write_and_wait_range() and calls it from the read bit of
__blockdev_direct_IO().  It didn't change the cheesy fsx load I was
using and then I got distracted.  I can try harder.

> Reading the code, I'm unable to convince myself that it won't go into an
> infinite loop if it finds a page at page->index = -1.  But I didn't try
> very hard ;)

I'm unconvinced as well. I got the pattern from
truncate_inode_pages_range() and it seems to have a related problem when
end is something that page_index can never be greater than.  It just
starts over.

I wonder if we should add some mapping_for_each_range() (less ridiculous
names welcome :)) macro that handles this crap for the caller who just
works with page pointers.  We could introduce some iterator struct that
the caller would put on the stack and pass in to hold state, something
like the 'n' in list_for_each_safe().  It looks like a few
pagevec_lookup() callers could use this.

> Minor note on this:
> 
> 	return invalidate_inode_pages2_range(mapping, 0, ~0UL);
> 
> I just use `-1' there.

Roger.  I was just mimicking invalidate_inode_pages().

> I'll make that change and plop the patch into -mm, but we need to think
> about the infinite-loop problem..

I can try hacking together that macro and auditing pagevec_lookup()
callers..

- z

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] invalidate range of pages after direct IO write
  2005-02-04  1:52   ` Zach Brown
@ 2005-02-04  2:28     ` Andrew Morton
  2005-02-04 23:12       ` Zach Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2005-02-04  2:28 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-kernel

Zach Brown <zach.brown@oracle.com> wrote:
>
> > I'll make that change and plop the patch into -mm, but we need to think
>  > about the infinite-loop problem..
> 
>  I can try hacking together that macro and auditing pagevec_lookup()
>  callers..

I'd be inclined to hold off on the macro until we actually get the
open-coded stuff right..  Sometimes the page lookup loops take an end+1
argument and sometimes they take an inclusive `end'.  I think.  That might
make it a bit messy.

How's this look?



- Don't look up more pages than we're going to use

- Don't test page->index until we've locked the page

- Check for the cursor wrapping at the end of the mapping.

--- 25/mm/truncate.c~invalidate-range-of-pages-after-direct-io-write-fix	2005-02-03 18:20:22.000000000 -0800
+++ 25-akpm/mm/truncate.c	2005-02-03 18:28:24.627796400 -0800
@@ -264,18 +264,14 @@ int invalidate_inode_pages2_range(struct
 	pagevec_init(&pvec, 0);
 	next = start;
 	while (next <= end &&
-	       !ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+	       !ret && pagevec_lookup(&pvec, mapping, next,
+			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
 		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
 			int was_dirty;
 
-			if (page->index > end) {
-				next = page->index;
-				break;
-			}
-
 			lock_page(page);
-			if (page->mapping != mapping) {	/* truncate race? */
+			if (page->mapping != mapping || page->index > end) {
 				unlock_page(page);
 				continue;
 			}
@@ -311,6 +307,8 @@ int invalidate_inode_pages2_range(struct
 		}
 		pagevec_release(&pvec);
 		cond_resched();
+		if (next == 0)
+			break;		/* The pgoff_t wrapped */
 	}
 	return ret;
 }
_


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] invalidate range of pages after direct IO write
  2005-02-04  2:28     ` Andrew Morton
@ 2005-02-04 23:12       ` Zach Brown
  2005-02-04 23:35         ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Zach Brown @ 2005-02-04 23:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton wrote:

> I'd be inclined to hold off on the macro until we actually get the
> open-coded stuff right..  Sometimes the page lookup loops take an end+1
> argument and sometimes they take an inclusive `end'.  I think.  That might
> make it a bit messy.
> 
> How's this look?

Looks good.  It certainly stops the worst behaviour we were worried
about.  I wonder about 2 things:

1) Now that we're calculating a specifc length for pagevec_lookup(), is
testing that page->index > end still needed for pages that get remapped
somewhere weird before we locked?  If it is, I imagine we should test
for < start as well.

2) If we get a pagevec full of pages that fail the != mapping test we
will probably just try again, not having changed next.  This is fine, we
won't find them in our mapping again.  But this won't happen if next
started as 0 and we didn't update it.  I don't know if retrying is the
intended behaviour or if we care that the start == 0 case doesn't do it.

I'm being less excited by the iterating macro the more I think about it.
   Tearing down the pagevec in some complicated for(;;) doesn't sound
reliable and I fear that some function that takes a per-page callback
function pointer from the caller will turn people's stomachs.

- z

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] invalidate range of pages after direct IO write
  2005-02-04 23:12       ` Zach Brown
@ 2005-02-04 23:35         ` Andrew Morton
  2005-02-07 21:19           ` Zach Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2005-02-04 23:35 UTC (permalink / raw)
  To: Zach Brown; +Cc: linux-kernel

Zach Brown <zach.brown@oracle.com> wrote:
>
> Andrew Morton wrote:
> 
> > I'd be inclined to hold off on the macro until we actually get the
> > open-coded stuff right..  Sometimes the page lookup loops take an end+1
> > argument and sometimes they take an inclusive `end'.  I think.  That might
> > make it a bit messy.
> > 
> > How's this look?
> 
> Looks good.  It certainly stops the worst behaviour we were worried
> about.  I wonder about 2 things:
> 
> 1) Now that we're calculating a specifc length for pagevec_lookup(), is
> testing that page->index > end still needed for pages that get remapped
> somewhere weird before we locked?  If it is, I imagine we should test
> for < start as well.

Nope.  We're guaranteed that the pages which pagevec_lookup() returned are

a) at or beyond `start' and

b) in ascending pgoff_t order, although not necessarily contiguous. 
   There will be gaps for not-present pages.  It's just an in-order gather.

So we need the `page->index > end' test to cope with the situation where a
request for three pages returned the three pages at indices 10, 11 and
3,000,000.

> 2) If we get a pagevec full of pages that fail the != mapping test we
> will probably just try again, not having changed next.  This is fine, we
> won't find them in our mapping again.

Yes, good point.  That page should go away once we drop our ref, and
it's not in the radix tree any more.

>  But this won't happen if next
> started as 0 and we didn't update it.  I don't know if retrying is the
> intended behaviour or if we care that the start == 0 case doesn't do it.

Good point.  Let's make it explicit?

--- 25/mm/truncate.c~invalidate-range-of-pages-after-direct-io-write-fix-fix	Fri Feb  4 15:33:52 2005
+++ 25-akpm/mm/truncate.c	Fri Feb  4 15:34:47 2005
@@ -260,11 +260,12 @@ int invalidate_inode_pages2_range(struct
 	int i;
 	int ret = 0;
 	int did_range_unmap = 0;
+	int wrapped = 0;
 
 	pagevec_init(&pvec, 0);
 	next = start;
-	while (next <= end &&
-	       !ret && pagevec_lookup(&pvec, mapping, next,
+	while (next <= end && !ret && !wrapped &&
+		pagevec_lookup(&pvec, mapping, next,
 			min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
 		for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -277,6 +278,8 @@ int invalidate_inode_pages2_range(struct
 			}
 			wait_on_page_writeback(page);
 			next = page->index + 1;
+			if (next == 0)
+				wrapped = 1;
 			while (page_mapped(page)) {
 				if (!did_range_unmap) {
 					/*
@@ -307,8 +310,6 @@ int invalidate_inode_pages2_range(struct
 		}
 		pagevec_release(&pvec);
 		cond_resched();
-		if (next == 0)
-			break;		/* The pgoff_t wrapped */
 	}
 	return ret;
 }
_

> I'm being less excited by the iterating macro the more I think about it.
>    Tearing down the pagevec in some complicated for(;;) doesn't sound
> reliable and I fear that some function that takes a per-page callback
> function pointer from the caller will turn people's stomachs.

heh, OK.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Patch] invalidate range of pages after direct IO write
  2005-02-04 23:35         ` Andrew Morton
@ 2005-02-07 21:19           ` Zach Brown
  0 siblings, 0 replies; 7+ messages in thread
From: Zach Brown @ 2005-02-07 21:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

>> But this won't happen if next
>>started as 0 and we didn't update it.  I don't know if retrying is the
>>intended behaviour or if we care that the start == 0 case doesn't do it.
> 
> 
> Good point.  Let's make it explicit?

Looks great.  I briefly had visions of some bitfield to pack the three
boolean ints we have and then quickly came to my senses. :)

I threw together those other two patches that work with ranges around
direct IO.  (unmaping before r/w and writing and waiting before reads).
 rc3-mm1 is angry with my test machine so they're actually against
current -bk with this first invalidation patch applied.  I hope that
doesn't make life harder than it needs to be.  I'll send them under
seperate cover.

- z

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-02-07 21:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-29  2:16 [Patch] invalidate range of pages after direct IO write Zach Brown
2005-02-04  0:19 ` Andrew Morton
2005-02-04  1:52   ` Zach Brown
2005-02-04  2:28     ` Andrew Morton
2005-02-04 23:12       ` Zach Brown
2005-02-04 23:35         ` Andrew Morton
2005-02-07 21:19           ` Zach Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).