linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] filemap and readahead fixes for linux-next
@ 2009-04-10  6:09 Wu Fengguang
  2009-04-10  6:09 ` [PATCH 1/9] readahead: move max_sane_readahead() calls into force_page_cache_readahead() Wu Fengguang
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10  6:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Ying Han

Hi Andrew,

This is the original patchset minus the fault retry ones,
rebased onto next-20090409.

Thanks,
Fengguang
-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/9] readahead: move max_sane_readahead() calls into force_page_cache_readahead()
  2009-04-10  6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
@ 2009-04-10  6:09 ` Wu Fengguang
  2009-04-10  6:09 ` [PATCH 2/9] readahead: apply max_sane_readahead() limit in ondemand_readahead() Wu Fengguang
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10  6:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Nick Piggin, Linus Torvalds, Wu Fengguang, Ying Han

[-- Attachment #1: readahead-move-max_sane_readahead.patch --]
[-- Type: text/plain, Size: 1618 bytes --]

Impact: code simplification.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/fadvise.c   |    2 +-
 mm/filemap.c   |    3 +--
 mm/madvise.c   |    3 +--
 mm/readahead.c |    1 +
 4 files changed, 4 insertions(+), 5 deletions(-)

--- mm.orig/mm/fadvise.c
+++ mm/mm/fadvise.c
@@ -101,7 +101,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
 		
 		ret = force_page_cache_readahead(mapping, file,
 				start_index,
-				max_sane_readahead(nrpages));
+				nrpages);
 		if (ret > 0)
 			ret = 0;
 		break;
--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1389,8 +1389,7 @@ do_readahead(struct address_space *mappi
 	if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
 		return -EINVAL;
 
-	force_page_cache_readahead(mapping, filp, index,
-					max_sane_readahead(nr));
+	force_page_cache_readahead(mapping, filp, index, nr);
 	return 0;
 }
 
--- mm.orig/mm/madvise.c
+++ mm/mm/madvise.c
@@ -123,8 +123,7 @@ static long madvise_willneed(struct vm_a
 		end = vma->vm_end;
 	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-	force_page_cache_readahead(file->f_mapping,
-			file, start, max_sane_readahead(end - start));
+	force_page_cache_readahead(file->f_mapping, file, start, end - start);
 	return 0;
 }
 
--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -210,6 +210,7 @@ int force_page_cache_readahead(struct ad
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
 		return -EINVAL;
 
+	nr_to_read = max_sane_readahead(nr_to_read);
 	while (nr_to_read) {
 		int err;
 

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/9] readahead: apply max_sane_readahead() limit in ondemand_readahead()
  2009-04-10  6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
  2009-04-10  6:09 ` [PATCH 1/9] readahead: move max_sane_readahead() calls into force_page_cache_readahead() Wu Fengguang
@ 2009-04-10  6:09 ` Wu Fengguang
  2009-04-10  6:10 ` [PATCH 3/9] readahead: remove one unnecessary radix tree lookup Wu Fengguang
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10  6:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Nick Piggin, Linus Torvalds, Wu Fengguang, Ying Han

[-- Attachment #1: readahead-sane-max.patch --]
[-- Type: text/plain, Size: 627 bytes --]

Just in case someone aggressively set a huge readahead size.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -357,7 +357,7 @@ ondemand_readahead(struct address_space 
 		   bool hit_readahead_marker, pgoff_t offset,
 		   unsigned long req_size)
 {
-	int	max = ra->ra_pages;	/* max readahead pages */
+	unsigned long max = max_sane_readahead(ra->ra_pages);
 	pgoff_t prev_offset;
 	int	sequential;
 

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/9] readahead: remove one unnecessary radix tree lookup
  2009-04-10  6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
  2009-04-10  6:09 ` [PATCH 1/9] readahead: move max_sane_readahead() calls into force_page_cache_readahead() Wu Fengguang
  2009-04-10  6:09 ` [PATCH 2/9] readahead: apply max_sane_readahead() limit in ondemand_readahead() Wu Fengguang
@ 2009-04-10  6:10 ` Wu Fengguang
  2009-04-10  6:10 ` [PATCH 4/9] readahead: increase interleaved readahead size Wu Fengguang
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10  6:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Wu Fengguang, Ying Han

[-- Attachment #1: readahead-interleaved-offset.patch --]
[-- Type: text/plain, Size: 659 bytes --]

(hit_readahead_marker != 0) means the page at @offset is present,
so we can search for non-present page starting from @offset+1.

Reported-by: Xu Chenfeng <xcf@ustc.edu.cn>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -395,7 +395,7 @@ ondemand_readahead(struct address_space 
 		pgoff_t start;
 
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset,max+1);
+		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
 		rcu_read_unlock();
 
 		if (!start || start - offset > max)

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/9] readahead: increase interleaved readahead size
  2009-04-10  6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
                   ` (2 preceding siblings ...)
  2009-04-10  6:10 ` [PATCH 3/9] readahead: remove one unnecessary radix tree lookup Wu Fengguang
@ 2009-04-10  6:10 ` Wu Fengguang
  2009-04-10  6:10 ` [PATCH 5/9] readahead: remove sync/async readahead call dependency Wu Fengguang
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10  6:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Wu Fengguang, Ying Han

[-- Attachment #1: readahead-interleaved-size.patch --]
[-- Type: text/plain, Size: 583 bytes --]

Make sure interleaved readahead size is larger than request size.
This also makes readahead window grow up more quickly.

Reported-by: Xu Chenfeng <xcf@ustc.edu.cn>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |    1 +
 1 file changed, 1 insertion(+)

--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -403,6 +403,7 @@ ondemand_readahead(struct address_space 
 
 		ra->start = start;
 		ra->size = start - offset;	/* old async_size */
+		ra->size += req_size;
 		ra->size = get_next_ra_size(ra, max);
 		ra->async_size = ra->size;
 		goto readit;

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/9] readahead: remove sync/async readahead call dependency
  2009-04-10  6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
                   ` (3 preceding siblings ...)
  2009-04-10  6:10 ` [PATCH 4/9] readahead: increase interleaved readahead size Wu Fengguang
@ 2009-04-10  6:10 ` Wu Fengguang
  2009-04-10  6:10 ` [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead Wu Fengguang
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10  6:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Nick Piggin, Linus Torvalds, Wu Fengguang, Ying Han

[-- Attachment #1: readahead-remove-call-dependancy.patch --]
[-- Type: text/plain, Size: 1788 bytes --]

The readahead call scheme is error-prone in that it expects on the call sites
to check for async readahead after doing any sync one. I.e. 

			if (!page)
				page_cache_sync_readahead();
			page = find_get_page();
			if (page && PageReadahead(page))
				page_cache_async_readahead();

This is because PG_readahead could be set by a sync readahead for the _current_
newly faulted in page, and the readahead code simply expects one more callback
on the same page to start the async readahead. If the caller fails to do so, it
will miss the PG_readahead bits and never able to start an async readahead.

Eliminate this insane constraint by piggy-backing the async part into the
current readahead window.

Now if an async readahead should be started immediately after a sync one,
the readahead logic itself will do it. So the following code becomes valid:
(the 'else' in particular)

			if (!page)
				page_cache_sync_readahead();
			else if (PageReadahead(page))
				page_cache_async_readahead();

Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -421,6 +421,16 @@ ondemand_readahead(struct address_space 
 	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
 
 readit:
+	/*
+	 * Will this read hit the readahead marker made by itself?
+	 * If so, trigger the readahead marker hit now, and merge
+	 * the resulted next readahead window into the current one.
+	 */
+	if (offset == ra->start && ra->size == ra->async_size) {
+		ra->async_size = get_next_ra_size(ra, max);
+		ra->size += ra->async_size;
+	}
+
 	return ra_submit(ra, mapping, filp);
 }
 

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead
  2009-04-10  6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
                   ` (4 preceding siblings ...)
  2009-04-10  6:10 ` [PATCH 5/9] readahead: remove sync/async readahead call dependency Wu Fengguang
@ 2009-04-10  6:10 ` Wu Fengguang
  2009-04-10 23:48   ` Andrew Morton
  2009-04-10  6:10 ` [PATCH 7/9] readahead: sequential mmap readahead Wu Fengguang
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10  6:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Pavel Levshin, wli, Nick Piggin, Wu Fengguang,
	Linus Torvalds, Ying Han

[-- Attachment #1: readahead-mmap-split-code-and-cleanup.patch --]
[-- Type: text/plain, Size: 8906 bytes --]

From: Linus Torvalds <torvalds@linux-foundation.org>

This shouldn't really change behavior all that much, but the single
rather complex function with read-ahead inside a loop etc is broken up
into more manageable pieces.

The behaviour is also less subtle, with the read-ahead being done up-front 
rather than inside some subtle loop and thus avoiding the now unnecessary 
extra state variables (ie "did_readaround" is gone).

Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
which case the original code will directly jump to no_cached_page reading.

Cc: Pavel Levshin <lpk@581.spb.su>
Cc: wli@movementarian.org
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Ok, so this is something I did in Mexico when I wasn't scuba-diving, and 
was "watching" the kids at the pool. It was brought on by looking at git 
mmap file behaviour under cold-cache behaviour: git does ok, but my laptop 
disk is really slow, and I tried to verify that the kernel did a 
reasonable job of read-ahead when taking page faults.

I think it did, but quite frankly, the filemap_fault() code was totally 
unreadable. So this separates out the read-ahead cases, and adds more 
comments, and also changes it so that we do asynchronous read-ahead 
*before* we actually wait for the page we are waiting for to become 
unlocked.

Not that it seems to make any real difference on my laptop, but I really 
hated how it was doing a

	page = get_lock_page(..)

and then doing read-ahead after that: which just guarantees that we have 
to wait for any out-standing IO on "page" to complete before we can even 
submit any new read-ahead! That just seems totally broken!

So it replaces the "get_lock_page()" at the top with a broken-out page 
cache lookup, which allows us to look at the page state flags and make 
appropriate decisions on what we should do without waiting for the locked 
bit to clear.

It does add many more lines than it removes:

	 mm/filemap.c |  192 +++++++++++++++++++++++++++++++++++++++-------------------
	 1 files changed, 130 insertions(+), 62 deletions(-)

but that's largely due to (a) the new function headers etc due to the 
split-up and (b) new or extended comments especially about the helper 
functions. The code, in many ways, is actually simpler, apart from the 
fairly trivial expansion of the equivalent of "get_lock_page()" into the 
function.

Comments? I tried to avoid changing the read-ahead logic itself, although 
the old code did some strange things like doing *both* async readahead and 
then looking up the page and doing sync readahead (which I think was just 
due to the code being so damn messily organized, not on purpose).

			Linus

---
 mm/filemap.c |  156 +++++++++++++++++++++++++++----------------------
 1 file changed, 89 insertions(+), 67 deletions(-)

--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1455,6 +1455,68 @@ static int page_cache_read(struct file *
 
 #define MMAP_LOTSAMISS  (100)
 
+/*
+ * Synchronous readahead happens when we don't even find
+ * a page in the page cache at all.
+ */
+static void do_sync_mmap_readahead(struct vm_area_struct *vma,
+				   struct file_ra_state *ra,
+				   struct file *file,
+				   pgoff_t offset)
+{
+	unsigned long ra_pages;
+	struct address_space *mapping = file->f_mapping;
+
+	/* If we don't want any read-ahead, don't bother */
+	if (VM_RandomReadHint(vma))
+		return;
+
+	if (VM_SequentialReadHint(vma)) {
+		page_cache_sync_readahead(mapping, ra, file, offset, 1);
+		return;
+	}
+
+	if (ra->mmap_miss < INT_MAX)
+		ra->mmap_miss++;
+
+	/*
+	 * Do we miss much more than hit in this file? If so,
+	 * stop bothering with read-ahead. It will only hurt.
+	 */
+	if (ra->mmap_miss > MMAP_LOTSAMISS)
+		return;
+
+	ra_pages = max_sane_readahead(ra->ra_pages);
+	if (ra_pages) {
+		pgoff_t start = 0;
+
+		if (offset > ra_pages / 2)
+			start = offset - ra_pages / 2;
+		do_page_cache_readahead(mapping, file, start, ra_pages);
+	}
+}
+
+/*
+ * Asynchronous readahead happens when we find the page and PG_readahead,
+ * so we want to possibly extend the readahead further..
+ */
+static void do_async_mmap_readahead(struct vm_area_struct *vma,
+				    struct file_ra_state *ra,
+				    struct file *file,
+				    struct page *page,
+				    pgoff_t offset)
+{
+	struct address_space *mapping = file->f_mapping;
+
+	/* If we don't want any read-ahead, don't bother */
+	if (VM_RandomReadHint(vma))
+		return;
+	if (ra->mmap_miss > 0)
+		ra->mmap_miss--;
+	if (PageReadahead(page))
+		page_cache_async_readahead(mapping, ra, file, page, offset, 1);
+}
+
 /**
  * filemap_fault - read in file data for page fault handling
  * @vma:	vma in which the fault was taken
@@ -1474,78 +1536,44 @@ int filemap_fault(struct vm_area_struct 
 	struct address_space *mapping = file->f_mapping;
 	struct file_ra_state *ra = &file->f_ra;
 	struct inode *inode = mapping->host;
+	pgoff_t offset = vmf->pgoff;
 	struct page *page;
 	pgoff_t size;
-	int did_readaround = 0;
 	int ret = 0;
 
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (vmf->pgoff >= size)
+	if (offset >= size)
 		return VM_FAULT_SIGBUS;
 
-	/* If we don't want any read-ahead, don't bother */
-	if (VM_RandomReadHint(vma))
-		goto no_cached_page;
-
 	/*
 	 * Do we have something in the page cache already?
 	 */
-retry_find:
-	page = find_lock_page(mapping, vmf->pgoff);
-	/*
-	 * For sequential accesses, we use the generic readahead logic.
-	 */
-	if (VM_SequentialReadHint(vma)) {
-		if (!page) {
-			page_cache_sync_readahead(mapping, ra, file,
-							   vmf->pgoff, 1);
-			page = find_lock_page(mapping, vmf->pgoff);
-			if (!page)
-				goto no_cached_page;
-		}
-		if (PageReadahead(page)) {
-			page_cache_async_readahead(mapping, ra, file, page,
-							   vmf->pgoff, 1);
-		}
-	}
-
-	if (!page) {
-		unsigned long ra_pages;
-
-		ra->mmap_miss++;
-
+	page = find_get_page(mapping, offset);
+	if (likely(page)) {
 		/*
-		 * Do we miss much more than hit in this file? If so,
-		 * stop bothering with read-ahead. It will only hurt.
+		 * We found the page, so try async readahead before
+		 * waiting for the lock.
 		 */
-		if (ra->mmap_miss > MMAP_LOTSAMISS)
-			goto no_cached_page;
+		do_async_mmap_readahead(vma, ra, file, page, offset);
+		lock_page(page);
 
-		/*
-		 * To keep the pgmajfault counter straight, we need to
-		 * check did_readaround, as this is an inner loop.
-		 */
-		if (!did_readaround) {
-			ret = VM_FAULT_MAJOR;
-			count_vm_event(PGMAJFAULT);
-		}
-		did_readaround = 1;
-		ra_pages = max_sane_readahead(file->f_ra.ra_pages);
-		if (ra_pages) {
-			pgoff_t start = 0;
-
-			if (vmf->pgoff > ra_pages / 2)
-				start = vmf->pgoff - ra_pages / 2;
-			do_page_cache_readahead(mapping, file, start, ra_pages);
+		/* Did it get truncated? */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			put_page(page);
+			goto no_cached_page;
 		}
-		page = find_lock_page(mapping, vmf->pgoff);
+	} else {
+		/* No page in the page cache at all */
+		do_sync_mmap_readahead(vma, ra, file, offset);
+		count_vm_event(PGMAJFAULT);
+		ret = VM_FAULT_MAJOR;
+retry_find:
+		page = find_lock_page(mapping, offset);
 		if (!page)
 			goto no_cached_page;
 	}
 
-	if (!did_readaround)
-		ra->mmap_miss--;
-
 	/*
 	 * We have a locked page in the page cache, now we need to check
 	 * that it's up-to-date. If not, it is going to be due to an error.
@@ -1553,18 +1581,18 @@ retry_find:
 	if (unlikely(!PageUptodate(page)))
 		goto page_not_uptodate;
 
-	/* Must recheck i_size under page lock */
+	/*
+	 * Found the page and have a reference on it.
+	 * We must recheck i_size under page lock.
+	 */
 	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-	if (unlikely(vmf->pgoff >= size)) {
+	if (unlikely(offset >= size)) {
 		unlock_page(page);
 		page_cache_release(page);
 		return VM_FAULT_SIGBUS;
 	}
 
-	/*
-	 * Found the page and have a reference on it.
-	 */
-	ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT;
+	ra->prev_pos = (loff_t)offset << PAGE_CACHE_SHIFT;
 	vmf->page = page;
 	return ret | VM_FAULT_LOCKED;
 
@@ -1573,7 +1601,7 @@ no_cached_page:
 	 * We're only likely to ever get here if MADV_RANDOM is in
 	 * effect.
 	 */
-	error = page_cache_read(file, vmf->pgoff);
+	error = page_cache_read(file, offset);
 
 	/*
 	 * The page we want has now been added to the page cache.
@@ -1593,12 +1621,6 @@ no_cached_page:
 	return VM_FAULT_SIGBUS;
 
 page_not_uptodate:
-	/* IO error path */
-	if (!did_readaround) {
-		ret = VM_FAULT_MAJOR;
-		count_vm_event(PGMAJFAULT);
-	}
-
 	/*
 	 * Umm, take care of errors if the page isn't up-to-date.
 	 * Try to re-read it _once_. We do this synchronously,

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 7/9] readahead: sequential mmap readahead
  2009-04-10  6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
                   ` (5 preceding siblings ...)
  2009-04-10  6:10 ` [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead Wu Fengguang
@ 2009-04-10  6:10 ` Wu Fengguang
  2009-04-10 23:34   ` Andrew Morton
  2009-04-10  6:10 ` [PATCH 8/9] readahead: enforce full readahead size on async mmap readahead Wu Fengguang
  2009-04-10  6:10 ` [PATCH 9/9] readahead: record mmap read-around states in file_ra_state Wu Fengguang
  8 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10  6:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Nick Piggin, Linus Torvalds, Wu Fengguang, Ying Han

[-- Attachment #1: readahead-mmap-sequential-readahead.patch --]
[-- Type: text/plain, Size: 1796 bytes --]

Auto-detect sequential mmap reads and do readahead for them.

The sequential mmap readahead will be triggered when
- sync readahead: it's a major fault and (prev_offset == offset-1);
- async readahead: minor fault on PG_readahead page with valid readahead state.

The benefits of doing readahead instead of read-around:
- less I/O wait thanks to async readahead
- double real I/O size and no more cache hits

The single stream case is improved a little.
For 100,000 sequential mmap reads:

                                    user       system    cpu        total
(1-1)  plain -mm, 128KB readaround: 3.224      2.554     48.40%     11.838
(1-2)  plain -mm, 256KB readaround: 3.170      2.392     46.20%     11.976
(2)  patched -mm, 128KB readahead:  3.117      2.448     47.33%     11.607

The patched (2) has smallest total time, since it has no cache hit overheads
and less I/O block time(thanks to async readahead). Here the I/O size
makes no much difference, since there's only one single stream.

Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
since the half of the read-around pages will be readahead cache hits.

This is going to make _real_ differences for _concurrent_ IO streams.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1471,7 +1471,8 @@ static void do_sync_mmap_readahead(struc
 	if (VM_RandomReadHint(vma))
 		return;
 
-	if (VM_SequentialReadHint(vma)) {
+	if (VM_SequentialReadHint(vma) ||
+			offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
 		page_cache_sync_readahead(mapping, ra, file, offset, 1);
 		return;
 	}

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 8/9] readahead: enforce full readahead size on async mmap readahead
  2009-04-10  6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
                   ` (6 preceding siblings ...)
  2009-04-10  6:10 ` [PATCH 7/9] readahead: sequential mmap readahead Wu Fengguang
@ 2009-04-10  6:10 ` Wu Fengguang
  2009-04-10  6:10 ` [PATCH 9/9] readahead: record mmap read-around states in file_ra_state Wu Fengguang
  8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10  6:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Linus Torvalds, Nick Piggin, Wu Fengguang, Ying Han

[-- Attachment #1: readahead-mmap-full-async-readahead-size.patch --]
[-- Type: text/plain, Size: 2486 bytes --]

We need this in one perticular case and two more general ones.

Now we do async readahead for sequential mmap reads, and do it with the help of
PG_readahead. For normal reads, PG_readahead is the sufficient condition to do
a sequential readahead. But unfortunately, for mmap reads, there is a tiny nuisance:

[11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
[11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
[11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
[11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                 ~~~~~~~~~~~~~
An unfavorably small readahead. The original dumb read-around size could be more efficient.

That happened because ld-linux.so does a read(832) in L1 before mmap(),
which triggers a 4-page readahead, with the second page tagged PG_readahead.

L0: open("/lib/libc.so.6", O_RDONLY)        = 3
L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
L7: close(3)                                = 0

In general, the PG_readahead flag will also be hit in cases
- sequential reads
- clustered random reads
A full readahead size is desirable in both cases.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1515,7 +1515,8 @@ static void do_async_mmap_readahead(stru
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
 	if (PageReadahead(page))
-		page_cache_async_readahead(mapping, ra, file, page, offset, 1);
+		page_cache_async_readahead(mapping, ra, file,
+					   page, offset, ra->ra_pages);
 }
 
 /**

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 9/9] readahead: record mmap read-around states in file_ra_state
  2009-04-10  6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
                   ` (7 preceding siblings ...)
  2009-04-10  6:10 ` [PATCH 8/9] readahead: enforce full readahead size on async mmap readahead Wu Fengguang
@ 2009-04-10  6:10 ` Wu Fengguang
  2009-04-10 23:38   ` Andrew Morton
  8 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10  6:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Nick Piggin, Linus Torvalds, Ying Han

[-- Attachment #1: readahead-mmap-readaround-use-ra_submit.patch --]
[-- Type: text/plain, Size: 3906 bytes --]

Mmap read-around now shares the same code style and data structure
with readahead code.

This also removes do_page_cache_readahead().
Its last user, mmap read-around, has been changed to call ra_submit().

The no-readahead-if-congested logic is dumped by the way.
Users will be pretty sensitive about the slow loading of executables.
So it's unfavorable to disabled mmap read-around on a congested queue.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 include/linux/mm.h |    5 +++--
 mm/filemap.c       |   12 +++++++-----
 mm/readahead.c     |   23 ++---------------------
 3 files changed, 12 insertions(+), 28 deletions(-)

--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1487,13 +1487,15 @@ static void do_sync_mmap_readahead(struc
 	if (ra->mmap_miss > MMAP_LOTSAMISS)
 		return;
 
+	/*
+	 * mmap read-around
+	 */
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	if (ra_pages) {
-		pgoff_t start = 0;
-
-		if (offset > ra_pages / 2)
-			start = offset - ra_pages / 2;
-		do_page_cache_readahead(mapping, file, start, ra_pages);
+		ra->start = max_t(long, 0, offset - ra_pages/2);
+		ra->size = ra_pages;
+		ra->async_size = 0;
+		ra_submit(ra, mapping, file);
 	}
 }
 
--- mm.orig/include/linux/mm.h
+++ mm/include/linux/mm.h
@@ -1181,8 +1181,6 @@ void task_dirty_inc(struct task_struct *
 #define VM_MAX_READAHEAD	128	/* kbytes */
 #define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
 
-int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			pgoff_t offset, unsigned long nr_to_read);
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read);
 
@@ -1200,6 +1198,9 @@ void page_cache_async_readahead(struct a
 				unsigned long size);
 
 unsigned long max_sane_readahead(unsigned long nr);
+unsigned long ra_submit(struct file_ra_state *ra,
+		        struct address_space *mapping,
+			struct file *filp);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -133,15 +133,12 @@ out:
 }
 
 /*
- * do_page_cache_readahead actually reads a chunk of disk.  It allocates all
+ * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates all
  * the pages first, then submits them all for I/O. This avoids the very bad
  * behaviour which would occur if page allocations are causing VM writeback.
  * We really don't want to intermingle reads and writes like that.
  *
  * Returns the number of pages requested, or the maximum amount of I/O allowed.
- *
- * do_page_cache_readahead() returns -1 if it encountered request queue
- * congestion.
  */
 static int
 __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
@@ -232,22 +229,6 @@ int force_page_cache_readahead(struct ad
 }
 
 /*
- * This version skips the IO if the queue is read-congested, and will tell the
- * block layer to abandon the readahead if request allocation would block.
- *
- * force_page_cache_readahead() will ignore queue congestion and will block on
- * request queues.
- */
-int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			pgoff_t offset, unsigned long nr_to_read)
-{
-	if (bdi_read_congested(mapping->backing_dev_info))
-		return -1;
-
-	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
-}
-
-/*
  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
  * sensible upper limit.
  */
@@ -260,7 +241,7 @@ unsigned long max_sane_readahead(unsigne
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */
-static unsigned long ra_submit(struct file_ra_state *ra,
+unsigned long ra_submit(struct file_ra_state *ra,
 		       struct address_space *mapping, struct file *filp)
 {
 	int actual;

-- 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/9] readahead: sequential mmap readahead
  2009-04-10  6:10 ` [PATCH 7/9] readahead: sequential mmap readahead Wu Fengguang
@ 2009-04-10 23:34   ` Andrew Morton
  2009-04-12  6:50     ` Wu Fengguang
  2009-04-12  7:09     ` [PATCH] readahead: enforce full sync mmap readahead size Wu Fengguang
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2009-04-10 23:34 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, npiggin, torvalds, fengguang.wu, yinghan

On Fri, 10 Apr 2009 14:10:04 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Auto-detect sequential mmap reads and do readahead for them.
> 
> The sequential mmap readahead will be triggered when
> - sync readahead: it's a major fault and (prev_offset == offset-1);
> - async readahead: minor fault on PG_readahead page with valid readahead state.
> 
> The benefits of doing readahead instead of read-around:
> - less I/O wait thanks to async readahead
> - double real I/O size and no more cache hits
> 
> The single stream case is improved a little.
> For 100,000 sequential mmap reads:
> 
>                                     user       system    cpu        total
> (1-1)  plain -mm, 128KB readaround: 3.224      2.554     48.40%     11.838
> (1-2)  plain -mm, 256KB readaround: 3.170      2.392     46.20%     11.976
> (2)  patched -mm, 128KB readahead:  3.117      2.448     47.33%     11.607
> 
> The patched (2) has smallest total time, since it has no cache hit overheads
> and less I/O block time(thanks to async readahead). Here the I/O size
> makes no much difference, since there's only one single stream.
> 
> Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
> since the half of the read-around pages will be readahead cache hits.
> 
> This is going to make _real_ differences for _concurrent_ IO streams.
> 
> Cc: Nick Piggin <npiggin@suse.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/filemap.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- mm.orig/mm/filemap.c
> +++ mm/mm/filemap.c
> @@ -1471,7 +1471,8 @@ static void do_sync_mmap_readahead(struc
>  	if (VM_RandomReadHint(vma))
>  		return;
>  
> -	if (VM_SequentialReadHint(vma)) {
> +	if (VM_SequentialReadHint(vma) ||
> +			offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
>  		page_cache_sync_readahead(mapping, ra, file, offset, 1);
>  		return;
>  	}
> 

We've always believed that readaround was beneficial for more random
access patterns - classically faulting in an executable.  Although I
don't recall that this belief was very well substantiated.

(The best results I ever got was by doing readaround and setting the
size to a few MB, so we slurp the entire executable into memory in one
hit.  lol.)

So my question is: what is the probability that this change will
inadvertently cause a randomish-access workload to fall into readahead
(rather than readaround) mode, and what is the impact when this
happens?



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] readahead: record mmap read-around states in file_ra_state
  2009-04-10  6:10 ` [PATCH 9/9] readahead: record mmap read-around states in file_ra_state Wu Fengguang
@ 2009-04-10 23:38   ` Andrew Morton
  2009-04-11  4:24     ` Wu Fengguang
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-04-10 23:38 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, npiggin, torvalds, yinghan

On Fri, 10 Apr 2009 14:10:06 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Mmap read-around now shares the same code style and data structure
> with readahead code.
> 
> This also removes do_page_cache_readahead().
> Its last user, mmap read-around, has been changed to call ra_submit().
> 
> The no-readahead-if-congested logic is dumped by the way.
> Users will be pretty sensitive about the slow loading of executables.
> So it's unfavorable to disabled mmap read-around on a congested queue.

Did you verify that the read-congested code ever triggers?

It used to be (and probably still is) the case that
bdi_read_congested() is very very rare, because the read queue is long
and the kernel rarely puts many read requests into it.  You can of
course create this condition with a fake workload with may
threads/processes, but it _is_ fake.

Some real-world workloads (databases?) will of course trigger
bdi_read_congested().  But they're usually doing fixed-sized reads, and
if we're doing _any_ readahead/readaround in that case, readahead is
busted.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead
  2009-04-10  6:10 ` [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead Wu Fengguang
@ 2009-04-10 23:48   ` Andrew Morton
  2009-04-11 13:58     ` KOSAKI Motohiro
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-04-10 23:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel, lpk, wli, npiggin, fengguang.wu, torvalds, yinghan,
	KOSAKI Motohiro, Nick Piggin, Hugh Dickins, Rik van Riel

On Fri, 10 Apr 2009 14:10:03 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> @@ -1553,18 +1581,18 @@ retry_find:
>  	if (unlikely(!PageUptodate(page)))
>  		goto page_not_uptodate;
>  
> -	/* Must recheck i_size under page lock */
> +	/*
> +	 * Found the page and have a reference on it.
> +	 * We must recheck i_size under page lock.
> +	 */
>  	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> -	if (unlikely(vmf->pgoff >= size)) {
> +	if (unlikely(offset >= size)) {
>  		unlock_page(page);
>  		page_cache_release(page);
>  		return VM_FAULT_SIGBUS;
>  	}

This hunk broke
mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch.


I fixed it thusly:

	/*
	 * Found the page and have a reference on it.
	 * We must recheck i_size under page lock.
	 */
	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
	if (unlikely(offset >= size)) {
		unlock_page(page);
		page_cache_release(page);
		return VM_FAULT_SIGBUS;
	}
+	update_page_reclaim_stat(page);
	ra->prev_pos = (loff_t)offset << PAGE_CACHE_SHIFT;
	vmf->page = page;
	return ret | VM_FAULT_LOCKED;

which seems logical to me.

Although now I look at it, it seems that
mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch should
go into 2.6.30?

Ah.  But I have a note here that I didn't like it, because it adds lots
of new spinlocking to fastpaths.  So I'll leave things as they stand
until we have had a little talk about that.



From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Unfortunately, the following two patches had conflicting concepts.

  1. commit 9ff473b9a72942c5ac0ad35607cae28d8d59ed7a
     (vmscan: evict streaming IO first)

  2. commit bf3f3bc5e734706730c12a323f9b2068052aa1f0
     (mm: don't mark_page_accessed in fault path)

(1) requires that a page fault update reclaim stat via
    mark_page_accessed(), but

(2) removed mark_page_accessed().

However, (1) actually only needed to update reclaim stat, but not activate
the page.  Then, the fault-path can call update_page_reclaim_stat() to
solve this conflict.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/swap.h |    1 +
 mm/filemap.c         |    2 +-
 mm/memory.c          |    2 ++
 mm/swap.c            |   24 +++++++++++++++++++-----
 4 files changed, 23 insertions(+), 6 deletions(-)

diff -puN include/linux/swap.h~mm-update_page_reclaim_stat-is-called-from-page-fault-path include/linux/swap.h
--- a/include/linux/swap.h~mm-update_page_reclaim_stat-is-called-from-page-fault-path
+++ a/include/linux/swap.h
@@ -179,6 +179,7 @@ extern void __lru_cache_add(struct page 
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
+extern void update_page_reclaim_stat(struct page *page);
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
diff -puN mm/filemap.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path mm/filemap.c
--- a/mm/filemap.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path
+++ a/mm/filemap.c
@@ -1595,7 +1595,7 @@ retry_find:
 		page_cache_release(page);
 		return VM_FAULT_SIGBUS;
 	}
-
+	update_page_reclaim_stat(page);
 	ra->prev_pos = (loff_t)offset << PAGE_CACHE_SHIFT;
 	vmf->page = page;
 	return ret | VM_FAULT_LOCKED;
diff -puN mm/memory.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path mm/memory.c
--- a/mm/memory.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path
+++ a/mm/memory.c
@@ -2507,6 +2507,8 @@ static int do_swap_page(struct mm_struct
 		try_to_free_swap(page);
 	unlock_page(page);
 
+	update_page_reclaim_stat(page);
+
 	if (write_access) {
 		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
 		if (ret & VM_FAULT_ERROR)
diff -puN mm/swap.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path mm/swap.c
--- a/mm/swap.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path
+++ a/mm/swap.c
@@ -151,8 +151,9 @@ void  rotate_reclaimable_page(struct pag
 	}
 }
 
-static void update_page_reclaim_stat(struct zone *zone, struct page *page,
-				     int file, int rotated)
+static void update_page_reclaim_stat_locked(struct zone *zone,
+					    struct page *page,
+					    int file, int rotated)
 {
 	struct zone_reclaim_stat *reclaim_stat = &zone->reclaim_stat;
 	struct zone_reclaim_stat *memcg_reclaim_stat;
@@ -171,6 +172,19 @@ static void update_page_reclaim_stat(str
 		memcg_reclaim_stat->recent_rotated[file]++;
 }
 
+void update_page_reclaim_stat(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+
+	spin_lock_irq(&zone->lru_lock);
+	/* if the page isn't reclaimable, it doesn't update reclaim stat */
+	if (PageLRU(page) && !PageUnevictable(page)) {
+		update_page_reclaim_stat_locked(zone, page,
+					 !!page_is_file_cache(page), 1);
+	}
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 /*
  * FIXME: speed this up?
  */
@@ -182,14 +196,14 @@ void activate_page(struct page *page)
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		int file = page_is_file_cache(page);
 		int lru = LRU_BASE + file;
-		del_page_from_lru_list(zone, page, lru);
 
+		del_page_from_lru_list(zone, page, lru);
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(zone, page, lru);
 		__count_vm_event(PGACTIVATE);
 
-		update_page_reclaim_stat(zone, page, !!file, 1);
+		update_page_reclaim_stat_locked(zone, page, !!file, 1);
 	}
 	spin_unlock_irq(&zone->lru_lock);
 }
@@ -427,7 +441,7 @@ void ____pagevec_lru_add(struct pagevec 
 		file = is_file_lru(lru);
 		if (active)
 			SetPageActive(page);
-		update_page_reclaim_stat(zone, page, file, active);
+		update_page_reclaim_stat_locked(zone, page, file, active);
 		add_page_to_lru_list(zone, page, lru);
 	}
 	if (zone)
_


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 9/9] readahead: record mmap read-around states in file_ra_state
  2009-04-10 23:38   ` Andrew Morton
@ 2009-04-11  4:24     ` Wu Fengguang
  0 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-11  4:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, torvalds, yinghan

On Sat, Apr 11, 2009 at 07:38:53AM +0800, Andrew Morton wrote:
> On Fri, 10 Apr 2009 14:10:06 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Mmap read-around now shares the same code style and data structure
> > with readahead code.
> > 
> > This also removes do_page_cache_readahead().
> > Its last user, mmap read-around, has been changed to call ra_submit().
> > 
> > The no-readahead-if-congested logic is dumped by the way.
> > Users will be pretty sensitive about the slow loading of executables.
> > So it's unfavorable to disabled mmap read-around on a congested queue.
> 
> Did you verify that the read-congested code ever triggers?

No.

Sorry the described is an imagined case. There could be other
counter-cases, however..

> It used to be (and probably still is) the case that
> bdi_read_congested() is very very rare, because the read queue is long
> and the kernel rarely puts many read requests into it.  You can of
> course create this condition with a fake workload with may
> threads/processes, but it _is_ fake.

The major workloads that could trigger read congestions:

1) file servers running highly concurrent IO streams
2) concurrent sys_readahead()s on a desktop, for fast booting
3) mysql is also able to issue concurrent sys_readahead()s
4) more workloads out of my imaginary

For 1) the change is favorable or irrelevant.

For 2) and 3) the change is irrelevant. The user space readahead
process must make sure the degree of parallelism is kept in control,
so that read congestion _never_ happen. Or normal reads will be blocked.

> Some real-world workloads (databases?) will of course trigger
> bdi_read_congested().  But they're usually doing fixed-sized reads, and
> if we're doing _any_ readahead/readaround in that case, readahead is
> busted.

Hmm I didn't have this possibility in mind: the read congestion is
exactly caused by a pool of concurrent mmap readers _themself_.

Let's assume they are doing N-page sized _sparse_ random reads.
(otherwise this patch will be favorable)

The current mmap_miss accounting works so that if ever N >= 2, the
readaround will be enabled for ever; if N == 1, the readaround will
be disabled quickly. The designer of this logic must be a master!

Why? If an application is to do 2-page random reads, the best option
will be the read() syscall. Because the size info will be _lost_ when
doing mmap reads. If ever the application author cares about
performance (i.e. big DBMS), he will find out that truth either by
theorizing or through experiments.

IMHO the kernel mmap readaround algorithm shall only be optimized for
- enabled: sequential reads
- enabled: large random reads
- enabled: clustered random reads
- disabled: 1-page random reads
and to perform bad and therefore discourage
- enabled and discouraged: small(2-page) and sparse random reads

It will do undesirable readaround for small sparse random mmap readers,
and I do think that we want this bad behavior to push such workloads
into using the more optimal read() syscall.

Therefore the change introduced by this patch is in line with the
above principles: either the readaround is favorable and should be
insisted even in read congestions, or readaround is already
unfavorable and let's keep it. If there are user complaints, good. We
_helped_ they discover performance bugs in their application and can
point them to the optimal solution. If it's not viable in short term,
there are workarounds like reducing the readahead size.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/9] readahead: clean up and simplify the code for filemap  page fault readahead
  2009-04-10 23:48   ` Andrew Morton
@ 2009-04-11 13:58     ` KOSAKI Motohiro
  2009-04-11 18:49       ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-04-11 13:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, linux-kernel, lpk, wli, torvalds, yinghan,
	Nick Piggin, Hugh Dickins, Rik van Riel

2009/4/11 Andrew Morton <akpm@linux-foundation.org>:
> On Fri, 10 Apr 2009 14:10:03 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
>> @@ -1553,18 +1581,18 @@ retry_find:
>>       if (unlikely(!PageUptodate(page)))
>>               goto page_not_uptodate;
>>
>> -     /* Must recheck i_size under page lock */
>> +     /*
>> +      * Found the page and have a reference on it.
>> +      * We must recheck i_size under page lock.
>> +      */
>>       size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>> -     if (unlikely(vmf->pgoff >= size)) {
>> +     if (unlikely(offset >= size)) {
>>               unlock_page(page);
>>               page_cache_release(page);
>>               return VM_FAULT_SIGBUS;
>>       }
>
> This hunk broke
> mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch.
>
>
> I fixed it thusly:
>
>        /*
>         * Found the page and have a reference on it.
>         * We must recheck i_size under page lock.
>         */
>        size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>        if (unlikely(offset >= size)) {
>                unlock_page(page);
>                page_cache_release(page);
>                return VM_FAULT_SIGBUS;
>        }
> +       update_page_reclaim_stat(page);
>        ra->prev_pos = (loff_t)offset << PAGE_CACHE_SHIFT;
>        vmf->page = page;
>        return ret | VM_FAULT_LOCKED;
>
> which seems logical to me.
>
> Although now I look at it, it seems that
> mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch should
> go into 2.6.30?

I haven't see this patch detail and I don't think
mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch
should go into 2.6.30, but I have one question.


> Ah.  But I have a note here that I didn't like it, because it adds lots
> of new spinlocking to fastpaths.  So I'll leave things as they stand
> until we have had a little talk about that.

add?

old code: grab zone->lru_lock via mark_page_accessed()
new code: grab zone->lru_lock via update_reclaim_stat()

one remove, one add.

But I agree its lock can be removed maybe..
I'm sorry my late work.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/9] readahead: clean up and simplify the code for filemap  page fault readahead
  2009-04-11 13:58     ` KOSAKI Motohiro
@ 2009-04-11 18:49       ` Andrew Morton
  2009-04-12 23:16         ` KOSAKI Motohiro
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-04-11 18:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Wu Fengguang, linux-kernel, lpk, wli, torvalds, yinghan,
	Nick Piggin, Hugh Dickins, Rik van Riel

On Sat, 11 Apr 2009 22:58:31 +0900 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > Ah. __But I have a note here that I didn't like it, because it adds lots
> > of new spinlocking to fastpaths. __So I'll leave things as they stand
> > until we have had a little talk about that.
> 
> add?
> 
> old code: grab zone->lru_lock via mark_page_accessed()
> new code: grab zone->lru_lock via update_reclaim_stat()
> 
> one remove, one add.
> 

mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch adds
new calls to update_page_reclaim_stat() into do_swap_page() and
filemap_fault().  update_page_reclaim_stat() does spin_lock_irq() and
spin_unlock_irq().  It looks like a net slowdown to me.

> But I agree its lock can be removed maybe..

It would be nice to try to do something about it - every little bit
counts.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/9] readahead: sequential mmap readahead
  2009-04-10 23:34   ` Andrew Morton
@ 2009-04-12  6:50     ` Wu Fengguang
  2009-04-12  7:09     ` [PATCH] readahead: enforce full sync mmap readahead size Wu Fengguang
  1 sibling, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-12  6:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, torvalds, yinghan

On Sat, Apr 11, 2009 at 07:34:13AM +0800, Andrew Morton wrote:
> On Fri, 10 Apr 2009 14:10:04 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Auto-detect sequential mmap reads and do readahead for them.
> > 
> > The sequential mmap readahead will be triggered when
> > - sync readahead: it's a major fault and (prev_offset == offset-1);
> > - async readahead: minor fault on PG_readahead page with valid readahead state.
> > 
> > The benefits of doing readahead instead of read-around:
> > - less I/O wait thanks to async readahead
> > - double real I/O size and no more cache hits
> > 
> > The single stream case is improved a little.
> > For 100,000 sequential mmap reads:
> > 
> >                                     user       system    cpu        total
> > (1-1)  plain -mm, 128KB readaround: 3.224      2.554     48.40%     11.838
> > (1-2)  plain -mm, 256KB readaround: 3.170      2.392     46.20%     11.976
> > (2)  patched -mm, 128KB readahead:  3.117      2.448     47.33%     11.607
> > 
> > The patched (2) has smallest total time, since it has no cache hit overheads
> > and less I/O block time(thanks to async readahead). Here the I/O size
> > makes no much difference, since there's only one single stream.
> > 
> > Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
> > since the half of the read-around pages will be readahead cache hits.
> > 
> > This is going to make _real_ differences for _concurrent_ IO streams.
> > 
> > Cc: Nick Piggin <npiggin@suse.de>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/filemap.c |    3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > --- mm.orig/mm/filemap.c
> > +++ mm/mm/filemap.c
> > @@ -1471,7 +1471,8 @@ static void do_sync_mmap_readahead(struc
> >  	if (VM_RandomReadHint(vma))
> >  		return;
> >  
> > -	if (VM_SequentialReadHint(vma)) {
> > +	if (VM_SequentialReadHint(vma) ||
> > +			offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
> >  		page_cache_sync_readahead(mapping, ra, file, offset, 1);
> >  		return;
> >  	}
> > 
> 
> We've always believed that readaround was beneficial for more random
> access patterns - classically faulting in an executable.  Although I
> don't recall that this belief was very well substantiated.
> 
> (The best results I ever got was by doing readaround and setting the
> size to a few MB, so we slurp the entire executable into memory in one
> hit.  lol.)
> 
> So my question is: what is the probability that this change will
> inadvertently cause a randomish-access workload to fall into readahead
> (rather than readaround) mode, and what is the impact when this
> happens?

Good question!

I did some measuring in order to answer this question.

It's an NFS-root debian desktop system, readahead size = 60 pages.
The numbers are grabbed after a fresh boot into console.

approach        pgmajfault      RA miss ratio   mmap IO count   avg IO size(pages)
   A            383             31.6%           383             11
   B            225             32.4%           390             11
   C            224             32.6%           307             13

case A: mmap sync/async readahead disabled
case B: mmap sync/async readahead enabled, with enforced full async readahead size
case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
or:
A = vanilla 2.6.30-rc1
B = A plus this patchset
C = B plus the following change

@@ static void do_sync_mmap_readahead(struc
 	if (VM_SequentialReadHint(vma) ||
 			offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
-		page_cache_sync_readahead(mapping, ra, file, offset, 1);
+		page_cache_sync_readahead(mapping, ra, file, offset, ra->ra_pages);
 
The theory is
- readahead is _good_ for clustered random reads, and can perform
  _better_ than readaround because they could be _async_.
For this patchset:
- sync readahead size could be smaller than readaround size, hence may
  make things worse by produce more smaller IOs
- async readahead size is guaranteed to be larger than readaround
  size, and they are _async_, hence will mostly behave better

The summaries on the numbers are
- there are good possibilities for random mmap reads to trigger readahead
- 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
- case C can further reduce IO count by 1/4
- readahead miss ratios are not quite affected

Final conclusion:
- this patchset reduced major faults by 1/3 and no other overheads;
- mmap io can be further reduced by 1/4 with the following patch.

Raw data follows.

Thanks,
Fengguang
---

Note:
- The duplicate cats are run in different fresh boots, which shows
  that data variances are <1%.
- The readahead miss ratio is approximated by
  (unreferenced pages reported by page-types) / (LRU file pages reported by meminfo)

A: disable sync/async mmap readahead(only readaround)
-----------------------------------------------------

pgmajfault 383
readahead miss ratio ~= 3576 : (36988+8244)/4 = 31.6%
wfg@hp ~% cat /debug/readahead/stats
pattern         count sync_count mmap_count  eof_count       size async_size     actual
initial0          515        515          0        325          4          3          2
subsequent         44          1          0         29         17         17          7
marker             18          0          0         12         11         11          6
around            383        383        383        185         60          0         25
random             43         43          0          4          1          0          1
all              1003        942        383        555         26          2         11
wfg@hp ~% cat /debug/readahead/stats
pattern         count sync_count mmap_count  eof_count       size async_size     actual
initial0          510        510          0        320          4          3          2
subsequent         44          1          0         29         17         17          7
marker             18          0          0         12         11         11          6
around            383        383        383        185         60          0         25
random             43         43          0          4          1          0          1
all               998        937        383        550         26          2         11
wfg@hp ~% cat /debug/readahead/stats
pattern      ra_count   io_count sync_count mmap_count  eof_count    ra_size async_size    io_size
initial0          514        499        499          0        324          4          3          2
subsequent         44         21          1          0          6         17         17         15
marker             18          7          0          0          1         11         11         17
around            383        383        383        383        185         60          0         25
random             43         43         43          0          4          1          0          1
all              1002        953        926        383        520         26          2         11
wfg@hp ~% sudo ./page-types
  flags page-count       MB    symbolic-flags    long-symbolic-flags
0x00000     496335     1938  __________________
0x00004          1        0  __R_______________  referenced
0x00008          8        0  ___U______________  uptodate
0x00014          5        0  __R_D_____________  referenced,dirty
0x00020          1        0  _____l____________  lru
0x00028       3576       13  ___U_l____________  uptodate,lru
0x0002c       5539       21  __RU_l____________  referenced,uptodate,lru
0x00068       3752       14  ___U_lA___________  uptodate,lru,active
0x0006c       1467        5  __RU_lA___________  referenced,uptodate,lru,active
0x00078          3        0  ___UDlA___________  uptodate,dirty,lru,active
0x0007c         17        0  __RUDlA___________  referenced,uptodate,dirty,lru,active
0x00080       2390        9  _______S__________  slab
0x000c0        108        0  ______AS__________  active,slab
0x00228         89        0  ___U_l___x________  uptodate,lru,reclaim
0x0022c         43        0  __RU_l___x________  referenced,uptodate,lru,reclaim
0x00268         21        0  ___U_lA__x________  uptodate,lru,active,reclaim
0x0026c         73        0  __RU_lA__x________  referenced,uptodate,lru,active,reclaim
0x00400        540        2  __________B_______  buddy
  total     513968     2007
wfg@hp ~% cat /proc/meminfo
MemTotal:        1978892 kB
MemFree:         1878628 kB
Buffers:               0 kB
Cached:            45312 kB
SwapCached:            0 kB
Active:            17608 kB
Inactive:          36988 kB
Active(anon):       9364 kB
Inactive(anon):        0 kB
Active(file):       8244 kB
Inactive(file):    36988 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:          9160 kB
Mapped:             9588 kB
Slab:              26700 kB
SReclaimable:      14064 kB
SUnreclaim:        12636 kB
PageTables:         1648 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      989444 kB
Committed_AS:      34640 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       83404 kB
VmallocChunk:   34359654179 kB
DirectMap4k:        5824 kB
DirectMap2M:     2050048 kB


B: mmap sync/async readahead enabled, with enforced full async readahead size
-----------------------------------------------------------------------------

pgmajfault 225
readahead miss ratio ~= 3666 : (37080+8192)/4 = 32.4%
wfg@hp ~% cat /debug/readahead/stats # zero lines stripped
pattern         count sync_count mmap_count  eof_count       size async_size     actual
initial0          548        548         40        336          4          3          2
initial             5          5          4          1          4          3          3
subsequent        276          2        231        144         31         31         19
marker            151          0        133        131         54         54          6
around            180        180        180        142         60          0         19
random             43         43          0          4          1          0          1
all              1203        778        588        758         25         15          9
wfg@hp ~% cat /debug/readahead/stats            
pattern         count sync_count mmap_count  eof_count       size async_size     actual
initial0          560        560         40        347          4          3          2
initial             6          6          4          2          4          3          3
subsequent        275          2        232        143         32         31         19
marker            152          0        134        132         54         54          6
around            181        181        181        143         60          0         19
random             43         43          0          4          1          0          1
all              1217        792        591        771         24         15          9
wfg@hp ~% cat /debug/readahead/stats # an extended and more accurate version
pattern      ra_count   io_count sync_count mmap_count  eof_count    ra_size async_size    io_size
initial0          547        532        532         40        335          4          3          2
initial             4          4          4          4          0          4          3          4
subsequent        275        165          2        144         33         32         31         33
marker            151         29          0         22          9         54         54         33
around            180        180        180        180        142         60          0         19
random             43         43         43          0          4          1          0          1
all              1200        953        761        390        523         25         15         11
wfg@hp ~% sudo ./page-types
   flags        page-count       MB  symbolic-flags           long-symbolic-flags
0x000004                 1        0  __R____________________  referenced
0x000020                 1        0  _____l_________________  lru
0x000028              3666       14  ___U_l_________________  uptodate,lru
0x00002c              5587       21  __RU_l_________________  referenced,uptodate,lru
0x000068               549        2  ___U_lA________________  uptodate,lru,active
0x00006c              1506        5  __RU_lA________________  referenced,uptodate,lru,active
0x000080              1469        5  _______S_______________  slab
0x0000c0                49        0  ______AS_______________  active,slab
0x000228                49        0  ___U_l___x_____________  uptodate,lru,reclaim
0x000400               533        2  __________B____________  buddy
0x000800             19245       75  ___________r___________  reserved
0x002008                11        0  ___U_________b_________  uptodate,swapbacked
0x002068              3231       12  ___U_lA______b_________  uptodate,lru,active,swapbacked
0x00206c                25        0  __RU_lA______b_________  referenced,uptodate,lru,active,swapbacked
0x002078                 3        0  ___UDlA______b_________  uptodate,dirty,lru,active,swapbacked
0x00207c                17        0  __RUDlA______b_________  referenced,uptodate,dirty,lru,active,swapbacked
0x010000                15        0  ________________H______  head
0x010014                 1        0  __R_D___________H______  referenced,dirty,head
0x010080               909        3  _______S________H______  slab,head
0x0100c0                59        0  ______AS________H______  active,slab,head
0x020000              4266       16  _________________T_____  tail
0x020014                 4        0  __R_D____________T_____  referenced,dirty,tail
0x400000            472772     1846  ______________________n  noflags
   total            513968     2007
wfg@hp ~% cat /proc/meminfo
MemTotal:        1978892 kB
MemFree:         1878776 kB
Buffers:               0 kB
Cached:            45352 kB
SwapCached:            0 kB
Active:            17456 kB
Inactive:          37080 kB
Active(anon):       9264 kB
Inactive(anon):        0 kB
Active(file):       8192 kB
Inactive(file):    37080 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:          9184 kB
Mapped:             9588 kB
Slab:              26592 kB
SReclaimable:      14016 kB
SUnreclaim:        12576 kB
PageTables:         1624 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      989444 kB
Committed_AS:      34640 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       83404 kB
VmallocChunk:   34359654179 kB
DirectMap4k:        5824 kB
DirectMap2M:     2050048 kB


B: mmap sync/async readahead enabled, with enforced full sync/async readahead size
----------------------------------------------------------------------------------

pgmajfault 224
readahead miss ratio ~= 3760 : (37596+8484)/4 = 32.6%

wfg@hp ~% cat /debug/readahead/stats
pattern         count sync_count mmap_count  eof_count       size async_size     actual
initial0          554        554         40        354         12          7          5
initial             4          4          4          3        120         60         63
subsequent        185          1        142        119         33         33         19
marker            147          0        129        130         54         54          5
around            179        179        179        142         60          0         19
random             43         43          0          4          1          0          1
all              1112        781        494        752         29         16         10
wfg@hp ~% cat /debug/readahead/stats
pattern         count sync_count mmap_count  eof_count       size async_size     actual
initial0          550        550         40        350         12          7          5
initial             4          4          4          3        120         60         63
subsequent        186          1        142        120         33         33         19
marker            147          0        129        130         54         54          5
around            179        179        179        142         60          0         19
random             43         43          0          4          1          0          1
all              1109        777        494        749         29         16         10
wfg@hp ~% cat /debug/readahead/stats
pattern      ra_count   io_count sync_count mmap_count  eof_count    ra_size async_size    io_size
initial0          551        536        536         40        351         12          7          5
initial             4          4          4          4          3        120         60         63
subsequent        186         87          1         66         21         33         33         41
marker            147         25          0         18          8         54         54         31
around            179        179        179        179        142         60          0         19
random             43         43         43          0          4          1          0          1
all              1110        874        763        307        529         29         16         13
wfg@hp ~% sudo ./page-types
  flags page-count       MB    symbolic-flags    long-symbolic-flags
0x00000     496178     1938  __________________
0x00004          1        0  __R_______________  referenced
0x00008         12        0  ___U______________  uptodate
0x00014          5        0  __R_D_____________  referenced,dirty
0x00020          1        0  _____l____________  lru
0x00028       3760       14  ___U_l____________  uptodate,lru
0x0002c       5566       21  __RU_l____________  referenced,uptodate,lru
0x00068       3806       14  ___U_lA___________  uptodate,lru,active
0x0006c       1546        6  __RU_lA___________  referenced,uptodate,lru,active
0x00078          3        0  ___UDlA___________  uptodate,dirty,lru,active
0x0007c         17        0  __RUDlA___________  referenced,uptodate,dirty,lru,active
0x00080       2393        9  _______S__________  slab
0x000c0        109        0  ______AS__________  active,slab
0x00228         48        0  ___U_l___x________  uptodate,lru,reclaim
0x00400        523        2  __________B_______  buddy
  total     513968     2007
wfg@hp ~% cat /proc/meminfo 
MemTotal:        1978892 kB
MemFree:         1877516 kB
Buffers:               0 kB
Cached:            46160 kB
SwapCached:            0 kB
Active:            18012 kB
Inactive:          37596 kB
Active(anon):       9528 kB
Inactive(anon):        0 kB
Active(file):       8484 kB
Inactive(file):    37596 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:          9328 kB
Mapped:             9748 kB
Slab:              26920 kB
SReclaimable:      14324 kB
SUnreclaim:        12596 kB
PageTables:         1632 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      989444 kB
Committed_AS:      34800 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       83404 kB
VmallocChunk:   34359654179 kB
DirectMap4k:        5824 kB
DirectMap2M:     2050048 kB


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] readahead: enforce full sync mmap readahead size
  2009-04-10 23:34   ` Andrew Morton
  2009-04-12  6:50     ` Wu Fengguang
@ 2009-04-12  7:09     ` Wu Fengguang
  2009-04-12 15:15       ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-04-12  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, npiggin, torvalds, yinghan

Now that we do readahead for sequential mmap reads, here is
a simple evaluation of the impacts, and one further optimization.

It's an NFS-root debian desktop system, readahead size = 60 pages.
The numbers are grabbed after a fresh boot into console.

approach        pgmajfault      RA miss ratio   mmap IO count   avg IO size(pages)
   A            383             31.6%           383             11
   B            225             32.4%           390             11
   C            224             32.6%           307             13

case A: mmap sync/async readahead disabled
case B: mmap sync/async readahead enabled, with enforced full async readahead size
case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
or:
A = vanilla 2.6.30-rc1
B = A plus mmap readahead
C = B plus this patch

The numbers show that
- there are good possibilities for random mmap reads to trigger readahead
- 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
- case C can further reduce IO count by 1/4
- readahead miss ratios are not quite affected

The theory is
- readahead is _good_ for clustered random reads, and can perform
  _better_ than readaround because they could be _async_.
- async readahead size is guaranteed to be larger than readaround
  size, and they are _async_, hence will mostly behave better
However for B
- sync readahead size could be smaller than readaround size, hence may
  make things worse by produce more smaller IOs
which will be fixed by this patch.

Final conclusion:
- mmap readahead reduced major faults by 1/3 and no obvious overheads;
- mmap io can be further reduced by 1/4 with this patch.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1471,7 +1471,8 @@ static void do_sync_mmap_readahead(struc
 	if (VM_SequentialReadHint(vma) ||
 			offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
 		ra->flags |= RA_FLAG_MMAP;
-		page_cache_sync_readahead(mapping, ra, file, offset, 1);
+		page_cache_sync_readahead(mapping, ra, file, offset,
+					  ra->ra_pages);
 		return;
 	}
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] readahead: enforce full sync mmap readahead size
  2009-04-12  7:09     ` [PATCH] readahead: enforce full sync mmap readahead size Wu Fengguang
@ 2009-04-12 15:15       ` Linus Torvalds
  2009-04-13 13:53         ` Wu Fengguang
  2009-04-14  7:01         ` Nick Piggin
  0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2009-04-12 15:15 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel, npiggin, yinghan



On Sun, 12 Apr 2009, Wu Fengguang wrote:
>
> Now that we do readahead for sequential mmap reads, here is
> a simple evaluation of the impacts, and one further optimization.

Hmm.

Wu, I just went through your latest (?) series of 1-9 and they all looked 
(a) quite small and (b) all of them looked like good cleanups. 

And not only do they look good, you seem to have numbers to back it all up 
too.

In other words, I'd really prefer to merge this sooner rather than later. 
There just doesn't seem to be any reason _not_ to. Is there any reason to 
not just take this? I realize that it's past -rc1, but this is way smaller 
and saner-looking than the average patch that makes it in past -rc1.

Besides, it was originally posted before -rc1, and the last series didn't 
have the much more intrusive page-fault-retry patches. I'd leave those for 
the next merge window, but the read-ahead series (1-9 plus this final 
one-liner) seem to be pure improvement - both in code readability _and_ in 
numbers - with no real contentious issues.

No?

			Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/9] readahead: clean up and simplify the code for filemap  page fault readahead
  2009-04-11 18:49       ` Andrew Morton
@ 2009-04-12 23:16         ` KOSAKI Motohiro
  0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-04-12 23:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Wu Fengguang, linux-kernel, lpk, wli, torvalds,
	yinghan, Nick Piggin, Hugh Dickins, Rik van Riel

> On Sat, 11 Apr 2009 22:58:31 +0900 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > Ah. __But I have a note here that I didn't like it, because it adds lots
> > > of new spinlocking to fastpaths. __So I'll leave things as they stand
> > > until we have had a little talk about that.
> > 
> > add?
> > 
> > old code: grab zone->lru_lock via mark_page_accessed()
> > new code: grab zone->lru_lock via update_reclaim_stat()
> > 
> > one remove, one add.
> > 
> 
> mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch adds
> new calls to update_page_reclaim_stat() into do_swap_page() and
> filemap_fault().  update_page_reclaim_stat() does spin_lock_irq() and
> spin_unlock_irq().  It looks like a net slowdown to me.

Ah, I compared with the code before Nick's mark_page_accessed() removing.
but you don't.

but I agree this patch reduce the worth of nick's work.
I have to improve more.

> > But I agree its lock can be removed maybe..
> 
> It would be nice to try to do something about it - every little bit
> counts.

Yes. I can't opoose it.





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] readahead: enforce full sync mmap readahead size
  2009-04-12 15:15       ` Linus Torvalds
@ 2009-04-13 13:53         ` Wu Fengguang
  2009-04-14  7:01         ` Nick Piggin
  1 sibling, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-13 13:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel, npiggin, yinghan

On Sun, Apr 12, 2009 at 08:15:12AM -0700, Linus Torvalds wrote:
> 
> 
> On Sun, 12 Apr 2009, Wu Fengguang wrote:
> >
> > Now that we do readahead for sequential mmap reads, here is
> > a simple evaluation of the impacts, and one further optimization.
> 
> Hmm.
> 
> Wu, I just went through your latest (?) series of 1-9 and they all looked 
> (a) quite small and (b) all of them looked like good cleanups. 
> 
> And not only do they look good, you seem to have numbers to back it all up 
> too.
> 
> In other words, I'd really prefer to merge this sooner rather than later. 
> There just doesn't seem to be any reason _not_ to. Is there any reason to 
> not just take this? I realize that it's past -rc1, but this is way smaller 
> and saner-looking than the average patch that makes it in past -rc1.
> 
> Besides, it was originally posted before -rc1, and the last series didn't 
> have the much more intrusive page-fault-retry patches. I'd leave those for 
> the next merge window, but the read-ahead series (1-9 plus this final 
> one-liner) seem to be pure improvement - both in code readability _and_ in 
> numbers - with no real contentious issues.
> 
> No?

They shall be fine for 2.6.30. For some reasons I'm submitting them a
bit late, but they are in fact mostly old patches that have received
good pondering and tests in my cookroom :-)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] readahead: enforce full sync mmap readahead size
  2009-04-12 15:15       ` Linus Torvalds
  2009-04-13 13:53         ` Wu Fengguang
@ 2009-04-14  7:01         ` Nick Piggin
  1 sibling, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2009-04-14  7:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Wu Fengguang, Andrew Morton, linux-kernel, yinghan

On Sun, Apr 12, 2009 at 08:15:12AM -0700, Linus Torvalds wrote:
> 
> 
> On Sun, 12 Apr 2009, Wu Fengguang wrote:
> >
> > Now that we do readahead for sequential mmap reads, here is
> > a simple evaluation of the impacts, and one further optimization.
> 
> Hmm.
> 
> Wu, I just went through your latest (?) series of 1-9 and they all looked 
> (a) quite small and (b) all of them looked like good cleanups. 
> 
> And not only do they look good, you seem to have numbers to back it all up 
> too.

They look pretty good to me too.

 
> In other words, I'd really prefer to merge this sooner rather than later. 
> There just doesn't seem to be any reason _not_ to. Is there any reason to 
> not just take this? I realize that it's past -rc1, but this is way smaller 
> and saner-looking than the average patch that makes it in past -rc1.
> 
> Besides, it was originally posted before -rc1, and the last series didn't 
> have the much more intrusive page-fault-retry patches. I'd leave those for 
> the next merge window, but the read-ahead series (1-9 plus this final 
> one-liner) seem to be pure improvement - both in code readability _and_ in 
> numbers - with no real contentious issues.
> 
> No?

I guess untested code, especially with heuristics, can always cause non
intuitive problems for some people. So the other side of the argument
is what's the harm in putting them in -mm until next merge window?

That said, I like these kinds of things, so I don't object :)


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-04-14  7:01 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-10  6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
2009-04-10  6:09 ` [PATCH 1/9] readahead: move max_sane_readahead() calls into force_page_cache_readahead() Wu Fengguang
2009-04-10  6:09 ` [PATCH 2/9] readahead: apply max_sane_readahead() limit in ondemand_readahead() Wu Fengguang
2009-04-10  6:10 ` [PATCH 3/9] readahead: remove one unnecessary radix tree lookup Wu Fengguang
2009-04-10  6:10 ` [PATCH 4/9] readahead: increase interleaved readahead size Wu Fengguang
2009-04-10  6:10 ` [PATCH 5/9] readahead: remove sync/async readahead call dependency Wu Fengguang
2009-04-10  6:10 ` [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead Wu Fengguang
2009-04-10 23:48   ` Andrew Morton
2009-04-11 13:58     ` KOSAKI Motohiro
2009-04-11 18:49       ` Andrew Morton
2009-04-12 23:16         ` KOSAKI Motohiro
2009-04-10  6:10 ` [PATCH 7/9] readahead: sequential mmap readahead Wu Fengguang
2009-04-10 23:34   ` Andrew Morton
2009-04-12  6:50     ` Wu Fengguang
2009-04-12  7:09     ` [PATCH] readahead: enforce full sync mmap readahead size Wu Fengguang
2009-04-12 15:15       ` Linus Torvalds
2009-04-13 13:53         ` Wu Fengguang
2009-04-14  7:01         ` Nick Piggin
2009-04-10  6:10 ` [PATCH 8/9] readahead: enforce full readahead size on async mmap readahead Wu Fengguang
2009-04-10  6:10 ` [PATCH 9/9] readahead: record mmap read-around states in file_ra_state Wu Fengguang
2009-04-10 23:38   ` Andrew Morton
2009-04-11  4:24     ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).