* [PATCH 0/9] filemap and readahead fixes for linux-next
@ 2009-04-10 6:09 Wu Fengguang
2009-04-10 6:09 ` [PATCH 1/9] readahead: move max_sane_readahead() calls into force_page_cache_readahead() Wu Fengguang
` (8 more replies)
0 siblings, 9 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10 6:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Ying Han
Hi Andrew,
This is the original patchset minus the fault retry ones,
rebased onto next-20090409.
Thanks,
Fengguang
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/9] readahead: move max_sane_readahead() calls into force_page_cache_readahead()
2009-04-10 6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
@ 2009-04-10 6:09 ` Wu Fengguang
2009-04-10 6:09 ` [PATCH 2/9] readahead: apply max_sane_readahead() limit in ondemand_readahead() Wu Fengguang
` (7 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10 6:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Nick Piggin, Linus Torvalds, Wu Fengguang, Ying Han
[-- Attachment #1: readahead-move-max_sane_readahead.patch --]
[-- Type: text/plain, Size: 1618 bytes --]
Impact: code simplification.
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/fadvise.c | 2 +-
mm/filemap.c | 3 +--
mm/madvise.c | 3 +--
mm/readahead.c | 1 +
4 files changed, 4 insertions(+), 5 deletions(-)
--- mm.orig/mm/fadvise.c
+++ mm/mm/fadvise.c
@@ -101,7 +101,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
ret = force_page_cache_readahead(mapping, file,
start_index,
- max_sane_readahead(nrpages));
+ nrpages);
if (ret > 0)
ret = 0;
break;
--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1389,8 +1389,7 @@ do_readahead(struct address_space *mappi
if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
return -EINVAL;
- force_page_cache_readahead(mapping, filp, index,
- max_sane_readahead(nr));
+ force_page_cache_readahead(mapping, filp, index, nr);
return 0;
}
--- mm.orig/mm/madvise.c
+++ mm/mm/madvise.c
@@ -123,8 +123,7 @@ static long madvise_willneed(struct vm_a
end = vma->vm_end;
end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
- force_page_cache_readahead(file->f_mapping,
- file, start, max_sane_readahead(end - start));
+ force_page_cache_readahead(file->f_mapping, file, start, end - start);
return 0;
}
--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -210,6 +210,7 @@ int force_page_cache_readahead(struct ad
if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
return -EINVAL;
+ nr_to_read = max_sane_readahead(nr_to_read);
while (nr_to_read) {
int err;
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/9] readahead: apply max_sane_readahead() limit in ondemand_readahead()
2009-04-10 6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
2009-04-10 6:09 ` [PATCH 1/9] readahead: move max_sane_readahead() calls into force_page_cache_readahead() Wu Fengguang
@ 2009-04-10 6:09 ` Wu Fengguang
2009-04-10 6:10 ` [PATCH 3/9] readahead: remove one unnecessary radix tree lookup Wu Fengguang
` (6 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10 6:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Nick Piggin, Linus Torvalds, Wu Fengguang, Ying Han
[-- Attachment #1: readahead-sane-max.patch --]
[-- Type: text/plain, Size: 627 bytes --]
Just in case someone aggressively set a huge readahead size.
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/readahead.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -357,7 +357,7 @@ ondemand_readahead(struct address_space
bool hit_readahead_marker, pgoff_t offset,
unsigned long req_size)
{
- int max = ra->ra_pages; /* max readahead pages */
+ unsigned long max = max_sane_readahead(ra->ra_pages);
pgoff_t prev_offset;
int sequential;
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 3/9] readahead: remove one unnecessary radix tree lookup
2009-04-10 6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
2009-04-10 6:09 ` [PATCH 1/9] readahead: move max_sane_readahead() calls into force_page_cache_readahead() Wu Fengguang
2009-04-10 6:09 ` [PATCH 2/9] readahead: apply max_sane_readahead() limit in ondemand_readahead() Wu Fengguang
@ 2009-04-10 6:10 ` Wu Fengguang
2009-04-10 6:10 ` [PATCH 4/9] readahead: increase interleaved readahead size Wu Fengguang
` (5 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10 6:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Wu Fengguang, Ying Han
[-- Attachment #1: readahead-interleaved-offset.patch --]
[-- Type: text/plain, Size: 659 bytes --]
(hit_readahead_marker != 0) means the page at @offset is present,
so we can search for non-present page starting from @offset+1.
Reported-by: Xu Chenfeng <xcf@ustc.edu.cn>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/readahead.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -395,7 +395,7 @@ ondemand_readahead(struct address_space
pgoff_t start;
rcu_read_lock();
- start = radix_tree_next_hole(&mapping->page_tree, offset,max+1);
+ start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
rcu_read_unlock();
if (!start || start - offset > max)
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 4/9] readahead: increase interleaved readahead size
2009-04-10 6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
` (2 preceding siblings ...)
2009-04-10 6:10 ` [PATCH 3/9] readahead: remove one unnecessary radix tree lookup Wu Fengguang
@ 2009-04-10 6:10 ` Wu Fengguang
2009-04-10 6:10 ` [PATCH 5/9] readahead: remove sync/async readahead call dependency Wu Fengguang
` (4 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10 6:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Wu Fengguang, Ying Han
[-- Attachment #1: readahead-interleaved-size.patch --]
[-- Type: text/plain, Size: 583 bytes --]
Make sure interleaved readahead size is larger than request size.
This also makes readahead window grow up more quickly.
Reported-by: Xu Chenfeng <xcf@ustc.edu.cn>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/readahead.c | 1 +
1 file changed, 1 insertion(+)
--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -403,6 +403,7 @@ ondemand_readahead(struct address_space
ra->start = start;
ra->size = start - offset; /* old async_size */
+ ra->size += req_size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
goto readit;
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 5/9] readahead: remove sync/async readahead call dependency
2009-04-10 6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
` (3 preceding siblings ...)
2009-04-10 6:10 ` [PATCH 4/9] readahead: increase interleaved readahead size Wu Fengguang
@ 2009-04-10 6:10 ` Wu Fengguang
2009-04-10 6:10 ` [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead Wu Fengguang
` (3 subsequent siblings)
8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10 6:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Nick Piggin, Linus Torvalds, Wu Fengguang, Ying Han
[-- Attachment #1: readahead-remove-call-dependancy.patch --]
[-- Type: text/plain, Size: 1788 bytes --]
The readahead call scheme is error-prone in that it expects on the call sites
to check for async readahead after doing any sync one. I.e.
if (!page)
page_cache_sync_readahead();
page = find_get_page();
if (page && PageReadahead(page))
page_cache_async_readahead();
This is because PG_readahead could be set by a sync readahead for the _current_
newly faulted in page, and the readahead code simply expects one more callback
on the same page to start the async readahead. If the caller fails to do so, it
will miss the PG_readahead bits and never able to start an async readahead.
Eliminate this insane constraint by piggy-backing the async part into the
current readahead window.
Now if an async readahead should be started immediately after a sync one,
the readahead logic itself will do it. So the following code becomes valid:
(the 'else' in particular)
if (!page)
page_cache_sync_readahead();
else if (PageReadahead(page))
page_cache_async_readahead();
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/readahead.c | 10 ++++++++++
1 file changed, 10 insertions(+)
--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -421,6 +421,16 @@ ondemand_readahead(struct address_space
ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
readit:
+ /*
+ * Will this read hit the readahead marker made by itself?
+ * If so, trigger the readahead marker hit now, and merge
+ * the resulted next readahead window into the current one.
+ */
+ if (offset == ra->start && ra->size == ra->async_size) {
+ ra->async_size = get_next_ra_size(ra, max);
+ ra->size += ra->async_size;
+ }
+
return ra_submit(ra, mapping, filp);
}
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead
2009-04-10 6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
` (4 preceding siblings ...)
2009-04-10 6:10 ` [PATCH 5/9] readahead: remove sync/async readahead call dependency Wu Fengguang
@ 2009-04-10 6:10 ` Wu Fengguang
2009-04-10 23:48 ` Andrew Morton
2009-04-10 6:10 ` [PATCH 7/9] readahead: sequential mmap readahead Wu Fengguang
` (2 subsequent siblings)
8 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10 6:10 UTC (permalink / raw)
To: Andrew Morton
Cc: LKML, Pavel Levshin, wli, Nick Piggin, Wu Fengguang,
Linus Torvalds, Ying Han
[-- Attachment #1: readahead-mmap-split-code-and-cleanup.patch --]
[-- Type: text/plain, Size: 8906 bytes --]
From: Linus Torvalds <torvalds@linux-foundation.org>
This shouldn't really change behavior all that much, but the single
rather complex function with read-ahead inside a loop etc is broken up
into more manageable pieces.
The behaviour is also less subtle, with the read-ahead being done up-front
rather than inside some subtle loop and thus avoiding the now unnecessary
extra state variables (ie "did_readaround" is gone).
Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
which case the original code will directly jump to no_cached_page reading.
Cc: Pavel Levshin <lpk@581.spb.su>
Cc: wli@movementarian.org
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
Ok, so this is something I did in Mexico when I wasn't scuba-diving, and
was "watching" the kids at the pool. It was brought on by looking at git
mmap file behaviour under cold-cache behaviour: git does ok, but my laptop
disk is really slow, and I tried to verify that the kernel did a
reasonable job of read-ahead when taking page faults.
I think it did, but quite frankly, the filemap_fault() code was totally
unreadable. So this separates out the read-ahead cases, and adds more
comments, and also changes it so that we do asynchronous read-ahead
*before* we actually wait for the page we are waiting for to become
unlocked.
Not that it seems to make any real difference on my laptop, but I really
hated how it was doing a
page = get_lock_page(..)
and then doing read-ahead after that: which just guarantees that we have
to wait for any out-standing IO on "page" to complete before we can even
submit any new read-ahead! That just seems totally broken!
So it replaces the "get_lock_page()" at the top with a broken-out page
cache lookup, which allows us to look at the page state flags and make
appropriate decisions on what we should do without waiting for the locked
bit to clear.
It does add many more lines than it removes:
mm/filemap.c | 192 +++++++++++++++++++++++++++++++++++++++-------------------
1 files changed, 130 insertions(+), 62 deletions(-)
but that's largely due to (a) the new function headers etc due to the
split-up and (b) new or extended comments especially about the helper
functions. The code, in many ways, is actually simpler, apart from the
fairly trivial expansion of the equivalent of "get_lock_page()" into the
function.
Comments? I tried to avoid changing the read-ahead logic itself, although
the old code did some strange things like doing *both* async readahead and
then looking up the page and doing sync readahead (which I think was just
due to the code being so damn messily organized, not on purpose).
Linus
---
mm/filemap.c | 156 +++++++++++++++++++++++++++----------------------
1 file changed, 89 insertions(+), 67 deletions(-)
--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1455,6 +1455,68 @@ static int page_cache_read(struct file *
#define MMAP_LOTSAMISS (100)
+/*
+ * Synchronous readahead happens when we don't even find
+ * a page in the page cache at all.
+ */
+static void do_sync_mmap_readahead(struct vm_area_struct *vma,
+ struct file_ra_state *ra,
+ struct file *file,
+ pgoff_t offset)
+{
+ unsigned long ra_pages;
+ struct address_space *mapping = file->f_mapping;
+
+ /* If we don't want any read-ahead, don't bother */
+ if (VM_RandomReadHint(vma))
+ return;
+
+ if (VM_SequentialReadHint(vma)) {
+ page_cache_sync_readahead(mapping, ra, file, offset, 1);
+ return;
+ }
+
+ if (ra->mmap_miss < INT_MAX)
+ ra->mmap_miss++;
+
+ /*
+ * Do we miss much more than hit in this file? If so,
+ * stop bothering with read-ahead. It will only hurt.
+ */
+ if (ra->mmap_miss > MMAP_LOTSAMISS)
+ return;
+
+ ra_pages = max_sane_readahead(ra->ra_pages);
+ if (ra_pages) {
+ pgoff_t start = 0;
+
+ if (offset > ra_pages / 2)
+ start = offset - ra_pages / 2;
+ do_page_cache_readahead(mapping, file, start, ra_pages);
+ }
+}
+
+/*
+ * Asynchronous readahead happens when we find the page and PG_readahead,
+ * so we want to possibly extend the readahead further..
+ */
+static void do_async_mmap_readahead(struct vm_area_struct *vma,
+ struct file_ra_state *ra,
+ struct file *file,
+ struct page *page,
+ pgoff_t offset)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ /* If we don't want any read-ahead, don't bother */
+ if (VM_RandomReadHint(vma))
+ return;
+ if (ra->mmap_miss > 0)
+ ra->mmap_miss--;
+ if (PageReadahead(page))
+ page_cache_async_readahead(mapping, ra, file, page, offset, 1);
+}
+
/**
* filemap_fault - read in file data for page fault handling
* @vma: vma in which the fault was taken
@@ -1474,78 +1536,44 @@ int filemap_fault(struct vm_area_struct
struct address_space *mapping = file->f_mapping;
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
+ pgoff_t offset = vmf->pgoff;
struct page *page;
pgoff_t size;
- int did_readaround = 0;
int ret = 0;
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (vmf->pgoff >= size)
+ if (offset >= size)
return VM_FAULT_SIGBUS;
- /* If we don't want any read-ahead, don't bother */
- if (VM_RandomReadHint(vma))
- goto no_cached_page;
-
/*
* Do we have something in the page cache already?
*/
-retry_find:
- page = find_lock_page(mapping, vmf->pgoff);
- /*
- * For sequential accesses, we use the generic readahead logic.
- */
- if (VM_SequentialReadHint(vma)) {
- if (!page) {
- page_cache_sync_readahead(mapping, ra, file,
- vmf->pgoff, 1);
- page = find_lock_page(mapping, vmf->pgoff);
- if (!page)
- goto no_cached_page;
- }
- if (PageReadahead(page)) {
- page_cache_async_readahead(mapping, ra, file, page,
- vmf->pgoff, 1);
- }
- }
-
- if (!page) {
- unsigned long ra_pages;
-
- ra->mmap_miss++;
-
+ page = find_get_page(mapping, offset);
+ if (likely(page)) {
/*
- * Do we miss much more than hit in this file? If so,
- * stop bothering with read-ahead. It will only hurt.
+ * We found the page, so try async readahead before
+ * waiting for the lock.
*/
- if (ra->mmap_miss > MMAP_LOTSAMISS)
- goto no_cached_page;
+ do_async_mmap_readahead(vma, ra, file, page, offset);
+ lock_page(page);
- /*
- * To keep the pgmajfault counter straight, we need to
- * check did_readaround, as this is an inner loop.
- */
- if (!did_readaround) {
- ret = VM_FAULT_MAJOR;
- count_vm_event(PGMAJFAULT);
- }
- did_readaround = 1;
- ra_pages = max_sane_readahead(file->f_ra.ra_pages);
- if (ra_pages) {
- pgoff_t start = 0;
-
- if (vmf->pgoff > ra_pages / 2)
- start = vmf->pgoff - ra_pages / 2;
- do_page_cache_readahead(mapping, file, start, ra_pages);
+ /* Did it get truncated? */
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ put_page(page);
+ goto no_cached_page;
}
- page = find_lock_page(mapping, vmf->pgoff);
+ } else {
+ /* No page in the page cache at all */
+ do_sync_mmap_readahead(vma, ra, file, offset);
+ count_vm_event(PGMAJFAULT);
+ ret = VM_FAULT_MAJOR;
+retry_find:
+ page = find_lock_page(mapping, offset);
if (!page)
goto no_cached_page;
}
- if (!did_readaround)
- ra->mmap_miss--;
-
/*
* We have a locked page in the page cache, now we need to check
* that it's up-to-date. If not, it is going to be due to an error.
@@ -1553,18 +1581,18 @@ retry_find:
if (unlikely(!PageUptodate(page)))
goto page_not_uptodate;
- /* Must recheck i_size under page lock */
+ /*
+ * Found the page and have a reference on it.
+ * We must recheck i_size under page lock.
+ */
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
+ if (unlikely(offset >= size)) {
unlock_page(page);
page_cache_release(page);
return VM_FAULT_SIGBUS;
}
- /*
- * Found the page and have a reference on it.
- */
- ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT;
+ ra->prev_pos = (loff_t)offset << PAGE_CACHE_SHIFT;
vmf->page = page;
return ret | VM_FAULT_LOCKED;
@@ -1573,7 +1601,7 @@ no_cached_page:
* We're only likely to ever get here if MADV_RANDOM is in
* effect.
*/
- error = page_cache_read(file, vmf->pgoff);
+ error = page_cache_read(file, offset);
/*
* The page we want has now been added to the page cache.
@@ -1593,12 +1621,6 @@ no_cached_page:
return VM_FAULT_SIGBUS;
page_not_uptodate:
- /* IO error path */
- if (!did_readaround) {
- ret = VM_FAULT_MAJOR;
- count_vm_event(PGMAJFAULT);
- }
-
/*
* Umm, take care of errors if the page isn't up-to-date.
* Try to re-read it _once_. We do this synchronously,
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 7/9] readahead: sequential mmap readahead
2009-04-10 6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
` (5 preceding siblings ...)
2009-04-10 6:10 ` [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead Wu Fengguang
@ 2009-04-10 6:10 ` Wu Fengguang
2009-04-10 23:34 ` Andrew Morton
2009-04-10 6:10 ` [PATCH 8/9] readahead: enforce full readahead size on async mmap readahead Wu Fengguang
2009-04-10 6:10 ` [PATCH 9/9] readahead: record mmap read-around states in file_ra_state Wu Fengguang
8 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10 6:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Nick Piggin, Linus Torvalds, Wu Fengguang, Ying Han
[-- Attachment #1: readahead-mmap-sequential-readahead.patch --]
[-- Type: text/plain, Size: 1796 bytes --]
Auto-detect sequential mmap reads and do readahead for them.
The sequential mmap readahead will be triggered when
- sync readahead: it's a major fault and (prev_offset == offset-1);
- async readahead: minor fault on PG_readahead page with valid readahead state.
The benefits of doing readahead instead of read-around:
- less I/O wait thanks to async readahead
- double real I/O size and no more cache hits
The single stream case is improved a little.
For 100,000 sequential mmap reads:
user system cpu total
(1-1) plain -mm, 128KB readaround: 3.224 2.554 48.40% 11.838
(1-2) plain -mm, 256KB readaround: 3.170 2.392 46.20% 11.976
(2) patched -mm, 128KB readahead: 3.117 2.448 47.33% 11.607
The patched (2) has smallest total time, since it has no cache hit overheads
and less I/O block time(thanks to async readahead). Here the I/O size
makes no much difference, since there's only one single stream.
Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
since the half of the read-around pages will be readahead cache hits.
This is going to make _real_ differences for _concurrent_ IO streams.
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/filemap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1471,7 +1471,8 @@ static void do_sync_mmap_readahead(struc
if (VM_RandomReadHint(vma))
return;
- if (VM_SequentialReadHint(vma)) {
+ if (VM_SequentialReadHint(vma) ||
+ offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
page_cache_sync_readahead(mapping, ra, file, offset, 1);
return;
}
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 8/9] readahead: enforce full readahead size on async mmap readahead
2009-04-10 6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
` (6 preceding siblings ...)
2009-04-10 6:10 ` [PATCH 7/9] readahead: sequential mmap readahead Wu Fengguang
@ 2009-04-10 6:10 ` Wu Fengguang
2009-04-10 6:10 ` [PATCH 9/9] readahead: record mmap read-around states in file_ra_state Wu Fengguang
8 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10 6:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Linus Torvalds, Nick Piggin, Wu Fengguang, Ying Han
[-- Attachment #1: readahead-mmap-full-async-readahead-size.patch --]
[-- Type: text/plain, Size: 2486 bytes --]
We need this in one perticular case and two more general ones.
Now we do async readahead for sequential mmap reads, and do it with the help of
PG_readahead. For normal reads, PG_readahead is the sufficient condition to do
a sequential readahead. But unfortunately, for mmap reads, there is a tiny nuisance:
[11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
[11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
[11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
[11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~
An unfavorably small readahead. The original dumb read-around size could be more efficient.
That happened because ld-linux.so does a read(832) in L1 before mmap(),
which triggers a 4-page readahead, with the second page tagged PG_readahead.
L0: open("/lib/libc.so.6", O_RDONLY) = 3
L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
L7: close(3) = 0
In general, the PG_readahead flag will also be hit in cases
- sequential reads
- clustered random reads
A full readahead size is desirable in both cases.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/filemap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1515,7 +1515,8 @@ static void do_async_mmap_readahead(stru
if (ra->mmap_miss > 0)
ra->mmap_miss--;
if (PageReadahead(page))
- page_cache_async_readahead(mapping, ra, file, page, offset, 1);
+ page_cache_async_readahead(mapping, ra, file,
+ page, offset, ra->ra_pages);
}
/**
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 9/9] readahead: record mmap read-around states in file_ra_state
2009-04-10 6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
` (7 preceding siblings ...)
2009-04-10 6:10 ` [PATCH 8/9] readahead: enforce full readahead size on async mmap readahead Wu Fengguang
@ 2009-04-10 6:10 ` Wu Fengguang
2009-04-10 23:38 ` Andrew Morton
8 siblings, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-04-10 6:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: LKML, Nick Piggin, Linus Torvalds, Ying Han
[-- Attachment #1: readahead-mmap-readaround-use-ra_submit.patch --]
[-- Type: text/plain, Size: 3906 bytes --]
Mmap read-around now shares the same code style and data structure
with readahead code.
This also removes do_page_cache_readahead().
Its last user, mmap read-around, has been changed to call ra_submit().
The no-readahead-if-congested logic is dumped by the way.
Users will be pretty sensitive about the slow loading of executables.
So it's unfavorable to disabled mmap read-around on a congested queue.
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 5 +++--
mm/filemap.c | 12 +++++++-----
mm/readahead.c | 23 ++---------------------
3 files changed, 12 insertions(+), 28 deletions(-)
--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1487,13 +1487,15 @@ static void do_sync_mmap_readahead(struc
if (ra->mmap_miss > MMAP_LOTSAMISS)
return;
+ /*
+ * mmap read-around
+ */
ra_pages = max_sane_readahead(ra->ra_pages);
if (ra_pages) {
- pgoff_t start = 0;
-
- if (offset > ra_pages / 2)
- start = offset - ra_pages / 2;
- do_page_cache_readahead(mapping, file, start, ra_pages);
+ ra->start = max_t(long, 0, offset - ra_pages/2);
+ ra->size = ra_pages;
+ ra->async_size = 0;
+ ra_submit(ra, mapping, file);
}
}
--- mm.orig/include/linux/mm.h
+++ mm/include/linux/mm.h
@@ -1181,8 +1181,6 @@ void task_dirty_inc(struct task_struct *
#define VM_MAX_READAHEAD 128 /* kbytes */
#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
-int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
- pgoff_t offset, unsigned long nr_to_read);
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
@@ -1200,6 +1198,9 @@ void page_cache_async_readahead(struct a
unsigned long size);
unsigned long max_sane_readahead(unsigned long nr);
+unsigned long ra_submit(struct file_ra_state *ra,
+ struct address_space *mapping,
+ struct file *filp);
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- mm.orig/mm/readahead.c
+++ mm/mm/readahead.c
@@ -133,15 +133,12 @@ out:
}
/*
- * do_page_cache_readahead actually reads a chunk of disk. It allocates all
+ * __do_page_cache_readahead() actually reads a chunk of disk. It allocates all
* the pages first, then submits them all for I/O. This avoids the very bad
* behaviour which would occur if page allocations are causing VM writeback.
* We really don't want to intermingle reads and writes like that.
*
* Returns the number of pages requested, or the maximum amount of I/O allowed.
- *
- * do_page_cache_readahead() returns -1 if it encountered request queue
- * congestion.
*/
static int
__do_page_cache_readahead(struct address_space *mapping, struct file *filp,
@@ -232,22 +229,6 @@ int force_page_cache_readahead(struct ad
}
/*
- * This version skips the IO if the queue is read-congested, and will tell the
- * block layer to abandon the readahead if request allocation would block.
- *
- * force_page_cache_readahead() will ignore queue congestion and will block on
- * request queues.
- */
-int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
- pgoff_t offset, unsigned long nr_to_read)
-{
- if (bdi_read_congested(mapping->backing_dev_info))
- return -1;
-
- return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
-}
-
-/*
* Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
* sensible upper limit.
*/
@@ -260,7 +241,7 @@ unsigned long max_sane_readahead(unsigne
/*
* Submit IO for the read-ahead request in file_ra_state.
*/
-static unsigned long ra_submit(struct file_ra_state *ra,
+unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping, struct file *filp)
{
int actual;
--
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 7/9] readahead: sequential mmap readahead
2009-04-10 6:10 ` [PATCH 7/9] readahead: sequential mmap readahead Wu Fengguang
@ 2009-04-10 23:34 ` Andrew Morton
2009-04-12 6:50 ` Wu Fengguang
2009-04-12 7:09 ` [PATCH] readahead: enforce full sync mmap readahead size Wu Fengguang
0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2009-04-10 23:34 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, npiggin, torvalds, fengguang.wu, yinghan
On Fri, 10 Apr 2009 14:10:04 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> Auto-detect sequential mmap reads and do readahead for them.
>
> The sequential mmap readahead will be triggered when
> - sync readahead: it's a major fault and (prev_offset == offset-1);
> - async readahead: minor fault on PG_readahead page with valid readahead state.
>
> The benefits of doing readahead instead of read-around:
> - less I/O wait thanks to async readahead
> - double real I/O size and no more cache hits
>
> The single stream case is improved a little.
> For 100,000 sequential mmap reads:
>
> user system cpu total
> (1-1) plain -mm, 128KB readaround: 3.224 2.554 48.40% 11.838
> (1-2) plain -mm, 256KB readaround: 3.170 2.392 46.20% 11.976
> (2) patched -mm, 128KB readahead: 3.117 2.448 47.33% 11.607
>
> The patched (2) has smallest total time, since it has no cache hit overheads
> and less I/O block time(thanks to async readahead). Here the I/O size
> makes no much difference, since there's only one single stream.
>
> Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
> since the half of the read-around pages will be readahead cache hits.
>
> This is going to make _real_ differences for _concurrent_ IO streams.
>
> Cc: Nick Piggin <npiggin@suse.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> mm/filemap.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- mm.orig/mm/filemap.c
> +++ mm/mm/filemap.c
> @@ -1471,7 +1471,8 @@ static void do_sync_mmap_readahead(struc
> if (VM_RandomReadHint(vma))
> return;
>
> - if (VM_SequentialReadHint(vma)) {
> + if (VM_SequentialReadHint(vma) ||
> + offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
> page_cache_sync_readahead(mapping, ra, file, offset, 1);
> return;
> }
>
We've always believed that readaround was beneficial for more random
access patterns - classically faulting in an executable. Although I
don't recall that this belief was very well substantiated.
(The best results I ever got was by doing readaround and setting the
size to a few MB, so we slurp the entire executable into memory in one
hit. lol.)
So my question is: what is the probability that this change will
inadvertently cause a randomish-access workload to fall into readahead
(rather than readaround) mode, and what is the impact when this
happens?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 9/9] readahead: record mmap read-around states in file_ra_state
2009-04-10 6:10 ` [PATCH 9/9] readahead: record mmap read-around states in file_ra_state Wu Fengguang
@ 2009-04-10 23:38 ` Andrew Morton
2009-04-11 4:24 ` Wu Fengguang
0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-04-10 23:38 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, npiggin, torvalds, yinghan
On Fri, 10 Apr 2009 14:10:06 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> Mmap read-around now shares the same code style and data structure
> with readahead code.
>
> This also removes do_page_cache_readahead().
> Its last user, mmap read-around, has been changed to call ra_submit().
>
> The no-readahead-if-congested logic is dumped by the way.
> Users will be pretty sensitive about the slow loading of executables.
> So it's unfavorable to disabled mmap read-around on a congested queue.
Did you verify that the read-congested code ever triggers?
It used to be (and probably still is) the case that
bdi_read_congested() is very very rare, because the read queue is long
and the kernel rarely puts many read requests into it. You can of
course create this condition with a fake workload with may
threads/processes, but it _is_ fake.
Some real-world workloads (databases?) will of course trigger
bdi_read_congested(). But they're usually doing fixed-sized reads, and
if we're doing _any_ readahead/readaround in that case, readahead is
busted.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead
2009-04-10 6:10 ` [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead Wu Fengguang
@ 2009-04-10 23:48 ` Andrew Morton
2009-04-11 13:58 ` KOSAKI Motohiro
0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-04-10 23:48 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel, lpk, wli, npiggin, fengguang.wu, torvalds, yinghan,
KOSAKI Motohiro, Nick Piggin, Hugh Dickins, Rik van Riel
On Fri, 10 Apr 2009 14:10:03 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> @@ -1553,18 +1581,18 @@ retry_find:
> if (unlikely(!PageUptodate(page)))
> goto page_not_uptodate;
>
> - /* Must recheck i_size under page lock */
> + /*
> + * Found the page and have a reference on it.
> + * We must recheck i_size under page lock.
> + */
> size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> - if (unlikely(vmf->pgoff >= size)) {
> + if (unlikely(offset >= size)) {
> unlock_page(page);
> page_cache_release(page);
> return VM_FAULT_SIGBUS;
> }
This hunk broke
mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch.
I fixed it thusly:
/*
* Found the page and have a reference on it.
* We must recheck i_size under page lock.
*/
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (unlikely(offset >= size)) {
unlock_page(page);
page_cache_release(page);
return VM_FAULT_SIGBUS;
}
+ update_page_reclaim_stat(page);
ra->prev_pos = (loff_t)offset << PAGE_CACHE_SHIFT;
vmf->page = page;
return ret | VM_FAULT_LOCKED;
which seems logical to me.
Although now I look at it, it seems that
mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch should
go into 2.6.30?
Ah. But I have a note here that I didn't like it, because it adds lots
of new spinlocking to fastpaths. So I'll leave things as they stand
until we have had a little talk about that.
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Unfortunately, the following two patches had conflicting concepts.
1. commit 9ff473b9a72942c5ac0ad35607cae28d8d59ed7a
(vmscan: evict streaming IO first)
2. commit bf3f3bc5e734706730c12a323f9b2068052aa1f0
(mm: don't mark_page_accessed in fault path)
(1) requires that a page fault update reclaim stat via
mark_page_accessed(), but
(2) removed mark_page_accessed().
However, (1) actually only needed to update reclaim stat, but not activate
the page. Then, the fault-path can call update_page_reclaim_stat() to
solve this conflict.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/swap.h | 1 +
mm/filemap.c | 2 +-
mm/memory.c | 2 ++
mm/swap.c | 24 +++++++++++++++++++-----
4 files changed, 23 insertions(+), 6 deletions(-)
diff -puN include/linux/swap.h~mm-update_page_reclaim_stat-is-called-from-page-fault-path include/linux/swap.h
--- a/include/linux/swap.h~mm-update_page_reclaim_stat-is-called-from-page-fault-path
+++ a/include/linux/swap.h
@@ -179,6 +179,7 @@ extern void __lru_cache_add(struct page
extern void lru_cache_add_lru(struct page *, enum lru_list lru);
extern void activate_page(struct page *);
extern void mark_page_accessed(struct page *);
+extern void update_page_reclaim_stat(struct page *page);
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
diff -puN mm/filemap.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path mm/filemap.c
--- a/mm/filemap.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path
+++ a/mm/filemap.c
@@ -1595,7 +1595,7 @@ retry_find:
page_cache_release(page);
return VM_FAULT_SIGBUS;
}
-
+ update_page_reclaim_stat(page);
ra->prev_pos = (loff_t)offset << PAGE_CACHE_SHIFT;
vmf->page = page;
return ret | VM_FAULT_LOCKED;
diff -puN mm/memory.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path mm/memory.c
--- a/mm/memory.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path
+++ a/mm/memory.c
@@ -2507,6 +2507,8 @@ static int do_swap_page(struct mm_struct
try_to_free_swap(page);
unlock_page(page);
+ update_page_reclaim_stat(page);
+
if (write_access) {
ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
if (ret & VM_FAULT_ERROR)
diff -puN mm/swap.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path mm/swap.c
--- a/mm/swap.c~mm-update_page_reclaim_stat-is-called-from-page-fault-path
+++ a/mm/swap.c
@@ -151,8 +151,9 @@ void rotate_reclaimable_page(struct pag
}
}
-static void update_page_reclaim_stat(struct zone *zone, struct page *page,
- int file, int rotated)
+static void update_page_reclaim_stat_locked(struct zone *zone,
+ struct page *page,
+ int file, int rotated)
{
struct zone_reclaim_stat *reclaim_stat = &zone->reclaim_stat;
struct zone_reclaim_stat *memcg_reclaim_stat;
@@ -171,6 +172,19 @@ static void update_page_reclaim_stat(str
memcg_reclaim_stat->recent_rotated[file]++;
}
+void update_page_reclaim_stat(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+
+ spin_lock_irq(&zone->lru_lock);
+ /* if the page isn't reclaimable, it doesn't update reclaim stat */
+ if (PageLRU(page) && !PageUnevictable(page)) {
+ update_page_reclaim_stat_locked(zone, page,
+ !!page_is_file_cache(page), 1);
+ }
+ spin_unlock_irq(&zone->lru_lock);
+}
+
/*
* FIXME: speed this up?
*/
@@ -182,14 +196,14 @@ void activate_page(struct page *page)
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
int file = page_is_file_cache(page);
int lru = LRU_BASE + file;
- del_page_from_lru_list(zone, page, lru);
+ del_page_from_lru_list(zone, page, lru);
SetPageActive(page);
lru += LRU_ACTIVE;
add_page_to_lru_list(zone, page, lru);
__count_vm_event(PGACTIVATE);
- update_page_reclaim_stat(zone, page, !!file, 1);
+ update_page_reclaim_stat_locked(zone, page, !!file, 1);
}
spin_unlock_irq(&zone->lru_lock);
}
@@ -427,7 +441,7 @@ void ____pagevec_lru_add(struct pagevec
file = is_file_lru(lru);
if (active)
SetPageActive(page);
- update_page_reclaim_stat(zone, page, file, active);
+ update_page_reclaim_stat_locked(zone, page, file, active);
add_page_to_lru_list(zone, page, lru);
}
if (zone)
_
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 9/9] readahead: record mmap read-around states in file_ra_state
2009-04-10 23:38 ` Andrew Morton
@ 2009-04-11 4:24 ` Wu Fengguang
0 siblings, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-11 4:24 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, npiggin, torvalds, yinghan
On Sat, Apr 11, 2009 at 07:38:53AM +0800, Andrew Morton wrote:
> On Fri, 10 Apr 2009 14:10:06 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Mmap read-around now shares the same code style and data structure
> > with readahead code.
> >
> > This also removes do_page_cache_readahead().
> > Its last user, mmap read-around, has been changed to call ra_submit().
> >
> > The no-readahead-if-congested logic is dumped by the way.
> > Users will be pretty sensitive about the slow loading of executables.
> > So it's unfavorable to disabled mmap read-around on a congested queue.
>
> Did you verify that the read-congested code ever triggers?
No.
Sorry the described is an imagined case. There could be other
counter-cases, however..
> It used to be (and probably still is) the case that
> bdi_read_congested() is very very rare, because the read queue is long
> and the kernel rarely puts many read requests into it. You can of
> course create this condition with a fake workload with may
> threads/processes, but it _is_ fake.
The major workloads that could trigger read congestions:
1) file servers running highly concurrent IO streams
2) concurrent sys_readahead()s on a desktop, for fast booting
3) mysql is also able to issue concurrent sys_readahead()s
4) more workloads out of my imaginary
For 1) the change is favorable or irrelevant.
For 2) and 3) the change is irrelevant. The user space readahead
process must make sure the degree of parallelism is kept in control,
so that read congestion _never_ happen. Or normal reads will be blocked.
> Some real-world workloads (databases?) will of course trigger
> bdi_read_congested(). But they're usually doing fixed-sized reads, and
> if we're doing _any_ readahead/readaround in that case, readahead is
> busted.
Hmm I didn't have this possibility in mind: the read congestion is
exactly caused by a pool of concurrent mmap readers _themself_.
Let's assume they are doing N-page sized _sparse_ random reads.
(otherwise this patch will be favorable)
The current mmap_miss accounting works so that if ever N >= 2, the
readaround will be enabled for ever; if N == 1, the readaround will
be disabled quickly. The designer of this logic must be a master!
Why? If an application is to do 2-page random reads, the best option
will be the read() syscall. Because the size info will be _lost_ when
doing mmap reads. If ever the application author cares about
performance (i.e. big DBMS), he will find out that truth either by
theorizing or through experiments.
IMHO the kernel mmap readaround algorithm shall only be optimized for
- enabled: sequential reads
- enabled: large random reads
- enabled: clustered random reads
- disabled: 1-page random reads
and to perform bad and therefore discourage
- enabled and discouraged: small(2-page) and sparse random reads
It will do undesirable readaround for small sparse random mmap readers,
and I do think that we want this bad behavior to push such workloads
into using the more optimal read() syscall.
Therefore the change introduced by this patch is in line with the
above principles: either the readaround is favorable and should be
insisted even in read congestions, or readaround is already
unfavorable and let's keep it. If there are user complaints, good. We
_helped_ they discover performance bugs in their application and can
point them to the optimal solution. If it's not viable in short term,
there are workarounds like reducing the readahead size.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead
2009-04-10 23:48 ` Andrew Morton
@ 2009-04-11 13:58 ` KOSAKI Motohiro
2009-04-11 18:49 ` Andrew Morton
0 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-04-11 13:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Wu Fengguang, linux-kernel, lpk, wli, torvalds, yinghan,
Nick Piggin, Hugh Dickins, Rik van Riel
2009/4/11 Andrew Morton <akpm@linux-foundation.org>:
> On Fri, 10 Apr 2009 14:10:03 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
>> @@ -1553,18 +1581,18 @@ retry_find:
>> if (unlikely(!PageUptodate(page)))
>> goto page_not_uptodate;
>>
>> - /* Must recheck i_size under page lock */
>> + /*
>> + * Found the page and have a reference on it.
>> + * We must recheck i_size under page lock.
>> + */
>> size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>> - if (unlikely(vmf->pgoff >= size)) {
>> + if (unlikely(offset >= size)) {
>> unlock_page(page);
>> page_cache_release(page);
>> return VM_FAULT_SIGBUS;
>> }
>
> This hunk broke
> mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch.
>
>
> I fixed it thusly:
>
> /*
> * Found the page and have a reference on it.
> * We must recheck i_size under page lock.
> */
> size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> if (unlikely(offset >= size)) {
> unlock_page(page);
> page_cache_release(page);
> return VM_FAULT_SIGBUS;
> }
> + update_page_reclaim_stat(page);
> ra->prev_pos = (loff_t)offset << PAGE_CACHE_SHIFT;
> vmf->page = page;
> return ret | VM_FAULT_LOCKED;
>
> which seems logical to me.
>
> Although now I look at it, it seems that
> mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch should
> go into 2.6.30?
I haven't see this patch detail and I don't think
mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch
should go into 2.6.30, but I have one question.
> Ah. But I have a note here that I didn't like it, because it adds lots
> of new spinlocking to fastpaths. So I'll leave things as they stand
> until we have had a little talk about that.
add?
old code: grab zone->lru_lock via mark_page_accessed()
new code: grab zone->lru_lock via update_reclaim_stat()
one remove, one add.
But I agree its lock can be removed maybe..
I'm sorry my late work.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead
2009-04-11 13:58 ` KOSAKI Motohiro
@ 2009-04-11 18:49 ` Andrew Morton
2009-04-12 23:16 ` KOSAKI Motohiro
0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2009-04-11 18:49 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Wu Fengguang, linux-kernel, lpk, wli, torvalds, yinghan,
Nick Piggin, Hugh Dickins, Rik van Riel
On Sat, 11 Apr 2009 22:58:31 +0900 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > Ah. __But I have a note here that I didn't like it, because it adds lots
> > of new spinlocking to fastpaths. __So I'll leave things as they stand
> > until we have had a little talk about that.
>
> add?
>
> old code: grab zone->lru_lock via mark_page_accessed()
> new code: grab zone->lru_lock via update_reclaim_stat()
>
> one remove, one add.
>
mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch adds
new calls to update_page_reclaim_stat() into do_swap_page() and
filemap_fault(). update_page_reclaim_stat() does spin_lock_irq() and
spin_unlock_irq(). It looks like a net slowdown to me.
> But I agree its lock can be removed maybe..
It would be nice to try to do something about it - every little bit
counts.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 7/9] readahead: sequential mmap readahead
2009-04-10 23:34 ` Andrew Morton
@ 2009-04-12 6:50 ` Wu Fengguang
2009-04-12 7:09 ` [PATCH] readahead: enforce full sync mmap readahead size Wu Fengguang
1 sibling, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-12 6:50 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, npiggin, torvalds, yinghan
On Sat, Apr 11, 2009 at 07:34:13AM +0800, Andrew Morton wrote:
> On Fri, 10 Apr 2009 14:10:04 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Auto-detect sequential mmap reads and do readahead for them.
> >
> > The sequential mmap readahead will be triggered when
> > - sync readahead: it's a major fault and (prev_offset == offset-1);
> > - async readahead: minor fault on PG_readahead page with valid readahead state.
> >
> > The benefits of doing readahead instead of read-around:
> > - less I/O wait thanks to async readahead
> > - double real I/O size and no more cache hits
> >
> > The single stream case is improved a little.
> > For 100,000 sequential mmap reads:
> >
> > user system cpu total
> > (1-1) plain -mm, 128KB readaround: 3.224 2.554 48.40% 11.838
> > (1-2) plain -mm, 256KB readaround: 3.170 2.392 46.20% 11.976
> > (2) patched -mm, 128KB readahead: 3.117 2.448 47.33% 11.607
> >
> > The patched (2) has smallest total time, since it has no cache hit overheads
> > and less I/O block time(thanks to async readahead). Here the I/O size
> > makes no much difference, since there's only one single stream.
> >
> > Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
> > since the half of the read-around pages will be readahead cache hits.
> >
> > This is going to make _real_ differences for _concurrent_ IO streams.
> >
> > Cc: Nick Piggin <npiggin@suse.de>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > mm/filemap.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > --- mm.orig/mm/filemap.c
> > +++ mm/mm/filemap.c
> > @@ -1471,7 +1471,8 @@ static void do_sync_mmap_readahead(struc
> > if (VM_RandomReadHint(vma))
> > return;
> >
> > - if (VM_SequentialReadHint(vma)) {
> > + if (VM_SequentialReadHint(vma) ||
> > + offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
> > page_cache_sync_readahead(mapping, ra, file, offset, 1);
> > return;
> > }
> >
>
> We've always believed that readaround was beneficial for more random
> access patterns - classically faulting in an executable. Although I
> don't recall that this belief was very well substantiated.
>
> (The best results I ever got was by doing readaround and setting the
> size to a few MB, so we slurp the entire executable into memory in one
> hit. lol.)
>
> So my question is: what is the probability that this change will
> inadvertently cause a randomish-access workload to fall into readahead
> (rather than readaround) mode, and what is the impact when this
> happens?
Good question!
I did some measuring in order to answer this question.
It's an NFS-root debian desktop system, readahead size = 60 pages.
The numbers are grabbed after a fresh boot into console.
approach pgmajfault RA miss ratio mmap IO count avg IO size(pages)
A 383 31.6% 383 11
B 225 32.4% 390 11
C 224 32.6% 307 13
case A: mmap sync/async readahead disabled
case B: mmap sync/async readahead enabled, with enforced full async readahead size
case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
or:
A = vanilla 2.6.30-rc1
B = A plus this patchset
C = B plus the following change
@@ static void do_sync_mmap_readahead(struc
if (VM_SequentialReadHint(vma) ||
offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
- page_cache_sync_readahead(mapping, ra, file, offset, 1);
+ page_cache_sync_readahead(mapping, ra, file, offset, ra->ra_pages);
The theory is
- readahead is _good_ for clustered random reads, and can perform
_better_ than readaround because they could be _async_.
For this patchset:
- sync readahead size could be smaller than readaround size, hence may
make things worse by produce more smaller IOs
- async readahead size is guaranteed to be larger than readaround
size, and they are _async_, hence will mostly behave better
The summaries on the numbers are
- there are good possibilities for random mmap reads to trigger readahead
- 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
- case C can further reduce IO count by 1/4
- readahead miss ratios are not quite affected
Final conclusion:
- this patchset reduced major faults by 1/3 and no other overheads;
- mmap io can be further reduced by 1/4 with the following patch.
Raw data follows.
Thanks,
Fengguang
---
Note:
- The duplicate cats are run in different fresh boots, which shows
that data variances are <1%.
- The readahead miss ratio is approximated by
(unreferenced pages reported by page-types) / (LRU file pages reported by meminfo)
A: disable sync/async mmap readahead(only readaround)
-----------------------------------------------------
pgmajfault 383
readahead miss ratio ~= 3576 : (36988+8244)/4 = 31.6%
wfg@hp ~% cat /debug/readahead/stats
pattern count sync_count mmap_count eof_count size async_size actual
initial0 515 515 0 325 4 3 2
subsequent 44 1 0 29 17 17 7
marker 18 0 0 12 11 11 6
around 383 383 383 185 60 0 25
random 43 43 0 4 1 0 1
all 1003 942 383 555 26 2 11
wfg@hp ~% cat /debug/readahead/stats
pattern count sync_count mmap_count eof_count size async_size actual
initial0 510 510 0 320 4 3 2
subsequent 44 1 0 29 17 17 7
marker 18 0 0 12 11 11 6
around 383 383 383 185 60 0 25
random 43 43 0 4 1 0 1
all 998 937 383 550 26 2 11
wfg@hp ~% cat /debug/readahead/stats
pattern ra_count io_count sync_count mmap_count eof_count ra_size async_size io_size
initial0 514 499 499 0 324 4 3 2
subsequent 44 21 1 0 6 17 17 15
marker 18 7 0 0 1 11 11 17
around 383 383 383 383 185 60 0 25
random 43 43 43 0 4 1 0 1
all 1002 953 926 383 520 26 2 11
wfg@hp ~% sudo ./page-types
flags page-count MB symbolic-flags long-symbolic-flags
0x00000 496335 1938 __________________
0x00004 1 0 __R_______________ referenced
0x00008 8 0 ___U______________ uptodate
0x00014 5 0 __R_D_____________ referenced,dirty
0x00020 1 0 _____l____________ lru
0x00028 3576 13 ___U_l____________ uptodate,lru
0x0002c 5539 21 __RU_l____________ referenced,uptodate,lru
0x00068 3752 14 ___U_lA___________ uptodate,lru,active
0x0006c 1467 5 __RU_lA___________ referenced,uptodate,lru,active
0x00078 3 0 ___UDlA___________ uptodate,dirty,lru,active
0x0007c 17 0 __RUDlA___________ referenced,uptodate,dirty,lru,active
0x00080 2390 9 _______S__________ slab
0x000c0 108 0 ______AS__________ active,slab
0x00228 89 0 ___U_l___x________ uptodate,lru,reclaim
0x0022c 43 0 __RU_l___x________ referenced,uptodate,lru,reclaim
0x00268 21 0 ___U_lA__x________ uptodate,lru,active,reclaim
0x0026c 73 0 __RU_lA__x________ referenced,uptodate,lru,active,reclaim
0x00400 540 2 __________B_______ buddy
total 513968 2007
wfg@hp ~% cat /proc/meminfo
MemTotal: 1978892 kB
MemFree: 1878628 kB
Buffers: 0 kB
Cached: 45312 kB
SwapCached: 0 kB
Active: 17608 kB
Inactive: 36988 kB
Active(anon): 9364 kB
Inactive(anon): 0 kB
Active(file): 8244 kB
Inactive(file): 36988 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 9160 kB
Mapped: 9588 kB
Slab: 26700 kB
SReclaimable: 14064 kB
SUnreclaim: 12636 kB
PageTables: 1648 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 989444 kB
Committed_AS: 34640 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 83404 kB
VmallocChunk: 34359654179 kB
DirectMap4k: 5824 kB
DirectMap2M: 2050048 kB
B: mmap sync/async readahead enabled, with enforced full async readahead size
-----------------------------------------------------------------------------
pgmajfault 225
readahead miss ratio ~= 3666 : (37080+8192)/4 = 32.4%
wfg@hp ~% cat /debug/readahead/stats # zero lines stripped
pattern count sync_count mmap_count eof_count size async_size actual
initial0 548 548 40 336 4 3 2
initial 5 5 4 1 4 3 3
subsequent 276 2 231 144 31 31 19
marker 151 0 133 131 54 54 6
around 180 180 180 142 60 0 19
random 43 43 0 4 1 0 1
all 1203 778 588 758 25 15 9
wfg@hp ~% cat /debug/readahead/stats
pattern count sync_count mmap_count eof_count size async_size actual
initial0 560 560 40 347 4 3 2
initial 6 6 4 2 4 3 3
subsequent 275 2 232 143 32 31 19
marker 152 0 134 132 54 54 6
around 181 181 181 143 60 0 19
random 43 43 0 4 1 0 1
all 1217 792 591 771 24 15 9
wfg@hp ~% cat /debug/readahead/stats # an extended and more accurate version
pattern ra_count io_count sync_count mmap_count eof_count ra_size async_size io_size
initial0 547 532 532 40 335 4 3 2
initial 4 4 4 4 0 4 3 4
subsequent 275 165 2 144 33 32 31 33
marker 151 29 0 22 9 54 54 33
around 180 180 180 180 142 60 0 19
random 43 43 43 0 4 1 0 1
all 1200 953 761 390 523 25 15 11
wfg@hp ~% sudo ./page-types
flags page-count MB symbolic-flags long-symbolic-flags
0x000004 1 0 __R____________________ referenced
0x000020 1 0 _____l_________________ lru
0x000028 3666 14 ___U_l_________________ uptodate,lru
0x00002c 5587 21 __RU_l_________________ referenced,uptodate,lru
0x000068 549 2 ___U_lA________________ uptodate,lru,active
0x00006c 1506 5 __RU_lA________________ referenced,uptodate,lru,active
0x000080 1469 5 _______S_______________ slab
0x0000c0 49 0 ______AS_______________ active,slab
0x000228 49 0 ___U_l___x_____________ uptodate,lru,reclaim
0x000400 533 2 __________B____________ buddy
0x000800 19245 75 ___________r___________ reserved
0x002008 11 0 ___U_________b_________ uptodate,swapbacked
0x002068 3231 12 ___U_lA______b_________ uptodate,lru,active,swapbacked
0x00206c 25 0 __RU_lA______b_________ referenced,uptodate,lru,active,swapbacked
0x002078 3 0 ___UDlA______b_________ uptodate,dirty,lru,active,swapbacked
0x00207c 17 0 __RUDlA______b_________ referenced,uptodate,dirty,lru,active,swapbacked
0x010000 15 0 ________________H______ head
0x010014 1 0 __R_D___________H______ referenced,dirty,head
0x010080 909 3 _______S________H______ slab,head
0x0100c0 59 0 ______AS________H______ active,slab,head
0x020000 4266 16 _________________T_____ tail
0x020014 4 0 __R_D____________T_____ referenced,dirty,tail
0x400000 472772 1846 ______________________n noflags
total 513968 2007
wfg@hp ~% cat /proc/meminfo
MemTotal: 1978892 kB
MemFree: 1878776 kB
Buffers: 0 kB
Cached: 45352 kB
SwapCached: 0 kB
Active: 17456 kB
Inactive: 37080 kB
Active(anon): 9264 kB
Inactive(anon): 0 kB
Active(file): 8192 kB
Inactive(file): 37080 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 9184 kB
Mapped: 9588 kB
Slab: 26592 kB
SReclaimable: 14016 kB
SUnreclaim: 12576 kB
PageTables: 1624 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 989444 kB
Committed_AS: 34640 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 83404 kB
VmallocChunk: 34359654179 kB
DirectMap4k: 5824 kB
DirectMap2M: 2050048 kB
B: mmap sync/async readahead enabled, with enforced full sync/async readahead size
----------------------------------------------------------------------------------
pgmajfault 224
readahead miss ratio ~= 3760 : (37596+8484)/4 = 32.6%
wfg@hp ~% cat /debug/readahead/stats
pattern count sync_count mmap_count eof_count size async_size actual
initial0 554 554 40 354 12 7 5
initial 4 4 4 3 120 60 63
subsequent 185 1 142 119 33 33 19
marker 147 0 129 130 54 54 5
around 179 179 179 142 60 0 19
random 43 43 0 4 1 0 1
all 1112 781 494 752 29 16 10
wfg@hp ~% cat /debug/readahead/stats
pattern count sync_count mmap_count eof_count size async_size actual
initial0 550 550 40 350 12 7 5
initial 4 4 4 3 120 60 63
subsequent 186 1 142 120 33 33 19
marker 147 0 129 130 54 54 5
around 179 179 179 142 60 0 19
random 43 43 0 4 1 0 1
all 1109 777 494 749 29 16 10
wfg@hp ~% cat /debug/readahead/stats
pattern ra_count io_count sync_count mmap_count eof_count ra_size async_size io_size
initial0 551 536 536 40 351 12 7 5
initial 4 4 4 4 3 120 60 63
subsequent 186 87 1 66 21 33 33 41
marker 147 25 0 18 8 54 54 31
around 179 179 179 179 142 60 0 19
random 43 43 43 0 4 1 0 1
all 1110 874 763 307 529 29 16 13
wfg@hp ~% sudo ./page-types
flags page-count MB symbolic-flags long-symbolic-flags
0x00000 496178 1938 __________________
0x00004 1 0 __R_______________ referenced
0x00008 12 0 ___U______________ uptodate
0x00014 5 0 __R_D_____________ referenced,dirty
0x00020 1 0 _____l____________ lru
0x00028 3760 14 ___U_l____________ uptodate,lru
0x0002c 5566 21 __RU_l____________ referenced,uptodate,lru
0x00068 3806 14 ___U_lA___________ uptodate,lru,active
0x0006c 1546 6 __RU_lA___________ referenced,uptodate,lru,active
0x00078 3 0 ___UDlA___________ uptodate,dirty,lru,active
0x0007c 17 0 __RUDlA___________ referenced,uptodate,dirty,lru,active
0x00080 2393 9 _______S__________ slab
0x000c0 109 0 ______AS__________ active,slab
0x00228 48 0 ___U_l___x________ uptodate,lru,reclaim
0x00400 523 2 __________B_______ buddy
total 513968 2007
wfg@hp ~% cat /proc/meminfo
MemTotal: 1978892 kB
MemFree: 1877516 kB
Buffers: 0 kB
Cached: 46160 kB
SwapCached: 0 kB
Active: 18012 kB
Inactive: 37596 kB
Active(anon): 9528 kB
Inactive(anon): 0 kB
Active(file): 8484 kB
Inactive(file): 37596 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 9328 kB
Mapped: 9748 kB
Slab: 26920 kB
SReclaimable: 14324 kB
SUnreclaim: 12596 kB
PageTables: 1632 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 989444 kB
Committed_AS: 34800 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 83404 kB
VmallocChunk: 34359654179 kB
DirectMap4k: 5824 kB
DirectMap2M: 2050048 kB
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH] readahead: enforce full sync mmap readahead size
2009-04-10 23:34 ` Andrew Morton
2009-04-12 6:50 ` Wu Fengguang
@ 2009-04-12 7:09 ` Wu Fengguang
2009-04-12 15:15 ` Linus Torvalds
1 sibling, 1 reply; 22+ messages in thread
From: Wu Fengguang @ 2009-04-12 7:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, npiggin, torvalds, yinghan
Now that we do readahead for sequential mmap reads, here is
a simple evaluation of the impacts, and one further optimization.
It's an NFS-root debian desktop system, readahead size = 60 pages.
The numbers are grabbed after a fresh boot into console.
approach pgmajfault RA miss ratio mmap IO count avg IO size(pages)
A 383 31.6% 383 11
B 225 32.4% 390 11
C 224 32.6% 307 13
case A: mmap sync/async readahead disabled
case B: mmap sync/async readahead enabled, with enforced full async readahead size
case C: mmap sync/async readahead enabled, with enforced full sync/async readahead size
or:
A = vanilla 2.6.30-rc1
B = A plus mmap readahead
C = B plus this patch
The numbers show that
- there are good possibilities for random mmap reads to trigger readahead
- 'pgmajfault' is reduced by 1/3, due to the _async_ nature of readahead
- case C can further reduce IO count by 1/4
- readahead miss ratios are not quite affected
The theory is
- readahead is _good_ for clustered random reads, and can perform
_better_ than readaround because they could be _async_.
- async readahead size is guaranteed to be larger than readaround
size, and they are _async_, hence will mostly behave better
However for B
- sync readahead size could be smaller than readaround size, hence may
make things worse by produce more smaller IOs
which will be fixed by this patch.
Final conclusion:
- mmap readahead reduced major faults by 1/3 and no obvious overheads;
- mmap io can be further reduced by 1/4 with this patch.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/filemap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- mm.orig/mm/filemap.c
+++ mm/mm/filemap.c
@@ -1471,7 +1471,8 @@ static void do_sync_mmap_readahead(struc
if (VM_SequentialReadHint(vma) ||
offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
ra->flags |= RA_FLAG_MMAP;
- page_cache_sync_readahead(mapping, ra, file, offset, 1);
+ page_cache_sync_readahead(mapping, ra, file, offset,
+ ra->ra_pages);
return;
}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] readahead: enforce full sync mmap readahead size
2009-04-12 7:09 ` [PATCH] readahead: enforce full sync mmap readahead size Wu Fengguang
@ 2009-04-12 15:15 ` Linus Torvalds
2009-04-13 13:53 ` Wu Fengguang
2009-04-14 7:01 ` Nick Piggin
0 siblings, 2 replies; 22+ messages in thread
From: Linus Torvalds @ 2009-04-12 15:15 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel, npiggin, yinghan
On Sun, 12 Apr 2009, Wu Fengguang wrote:
>
> Now that we do readahead for sequential mmap reads, here is
> a simple evaluation of the impacts, and one further optimization.
Hmm.
Wu, I just went through your latest (?) series of 1-9 and they all looked
(a) quite small and (b) all of them looked like good cleanups.
And not only do they look good, you seem to have numbers to back it all up
too.
In other words, I'd really prefer to merge this sooner rather than later.
There just doesn't seem to be any reason _not_ to. Is there any reason to
not just take this? I realize that it's past -rc1, but this is way smaller
and saner-looking than the average patch that makes it in past -rc1.
Besides, it was originally posted before -rc1, and the last series didn't
have the much more intrusive page-fault-retry patches. I'd leave those for
the next merge window, but the read-ahead series (1-9 plus this final
one-liner) seem to be pure improvement - both in code readability _and_ in
numbers - with no real contentious issues.
No?
Linus
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead
2009-04-11 18:49 ` Andrew Morton
@ 2009-04-12 23:16 ` KOSAKI Motohiro
0 siblings, 0 replies; 22+ messages in thread
From: KOSAKI Motohiro @ 2009-04-12 23:16 UTC (permalink / raw)
To: Andrew Morton
Cc: kosaki.motohiro, Wu Fengguang, linux-kernel, lpk, wli, torvalds,
yinghan, Nick Piggin, Hugh Dickins, Rik van Riel
> On Sat, 11 Apr 2009 22:58:31 +0900 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
>
> > > Ah. __But I have a note here that I didn't like it, because it adds lots
> > > of new spinlocking to fastpaths. __So I'll leave things as they stand
> > > until we have had a little talk about that.
> >
> > add?
> >
> > old code: grab zone->lru_lock via mark_page_accessed()
> > new code: grab zone->lru_lock via update_reclaim_stat()
> >
> > one remove, one add.
> >
>
> mm-update_page_reclaim_stat-is-called-from-page-fault-path.patch adds
> new calls to update_page_reclaim_stat() into do_swap_page() and
> filemap_fault(). update_page_reclaim_stat() does spin_lock_irq() and
> spin_unlock_irq(). It looks like a net slowdown to me.
Ah, I compared with the code before Nick's mark_page_accessed() removing.
but you don't.
but I agree this patch reduce the worth of nick's work.
I have to improve more.
> > But I agree its lock can be removed maybe..
>
> It would be nice to try to do something about it - every little bit
> counts.
Yes. I can't opoose it.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] readahead: enforce full sync mmap readahead size
2009-04-12 15:15 ` Linus Torvalds
@ 2009-04-13 13:53 ` Wu Fengguang
2009-04-14 7:01 ` Nick Piggin
1 sibling, 0 replies; 22+ messages in thread
From: Wu Fengguang @ 2009-04-13 13:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel, npiggin, yinghan
On Sun, Apr 12, 2009 at 08:15:12AM -0700, Linus Torvalds wrote:
>
>
> On Sun, 12 Apr 2009, Wu Fengguang wrote:
> >
> > Now that we do readahead for sequential mmap reads, here is
> > a simple evaluation of the impacts, and one further optimization.
>
> Hmm.
>
> Wu, I just went through your latest (?) series of 1-9 and they all looked
> (a) quite small and (b) all of them looked like good cleanups.
>
> And not only do they look good, you seem to have numbers to back it all up
> too.
>
> In other words, I'd really prefer to merge this sooner rather than later.
> There just doesn't seem to be any reason _not_ to. Is there any reason to
> not just take this? I realize that it's past -rc1, but this is way smaller
> and saner-looking than the average patch that makes it in past -rc1.
>
> Besides, it was originally posted before -rc1, and the last series didn't
> have the much more intrusive page-fault-retry patches. I'd leave those for
> the next merge window, but the read-ahead series (1-9 plus this final
> one-liner) seem to be pure improvement - both in code readability _and_ in
> numbers - with no real contentious issues.
>
> No?
They shall be fine for 2.6.30. For some reasons I'm submitting them a
bit late, but they are in fact mostly old patches that have received
good pondering and tests in my cookroom :-)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH] readahead: enforce full sync mmap readahead size
2009-04-12 15:15 ` Linus Torvalds
2009-04-13 13:53 ` Wu Fengguang
@ 2009-04-14 7:01 ` Nick Piggin
1 sibling, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2009-04-14 7:01 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Wu Fengguang, Andrew Morton, linux-kernel, yinghan
On Sun, Apr 12, 2009 at 08:15:12AM -0700, Linus Torvalds wrote:
>
>
> On Sun, 12 Apr 2009, Wu Fengguang wrote:
> >
> > Now that we do readahead for sequential mmap reads, here is
> > a simple evaluation of the impacts, and one further optimization.
>
> Hmm.
>
> Wu, I just went through your latest (?) series of 1-9 and they all looked
> (a) quite small and (b) all of them looked like good cleanups.
>
> And not only do they look good, you seem to have numbers to back it all up
> too.
They look pretty good to me too.
> In other words, I'd really prefer to merge this sooner rather than later.
> There just doesn't seem to be any reason _not_ to. Is there any reason to
> not just take this? I realize that it's past -rc1, but this is way smaller
> and saner-looking than the average patch that makes it in past -rc1.
>
> Besides, it was originally posted before -rc1, and the last series didn't
> have the much more intrusive page-fault-retry patches. I'd leave those for
> the next merge window, but the read-ahead series (1-9 plus this final
> one-liner) seem to be pure improvement - both in code readability _and_ in
> numbers - with no real contentious issues.
>
> No?
I guess untested code, especially with heuristics, can always cause non
intuitive problems for some people. So the other side of the argument
is what's the harm in putting them in -mm until next merge window?
That said, I like these kinds of things, so I don't object :)
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2009-04-14 7:01 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-10 6:09 [PATCH 0/9] filemap and readahead fixes for linux-next Wu Fengguang
2009-04-10 6:09 ` [PATCH 1/9] readahead: move max_sane_readahead() calls into force_page_cache_readahead() Wu Fengguang
2009-04-10 6:09 ` [PATCH 2/9] readahead: apply max_sane_readahead() limit in ondemand_readahead() Wu Fengguang
2009-04-10 6:10 ` [PATCH 3/9] readahead: remove one unnecessary radix tree lookup Wu Fengguang
2009-04-10 6:10 ` [PATCH 4/9] readahead: increase interleaved readahead size Wu Fengguang
2009-04-10 6:10 ` [PATCH 5/9] readahead: remove sync/async readahead call dependency Wu Fengguang
2009-04-10 6:10 ` [PATCH 6/9] readahead: clean up and simplify the code for filemap page fault readahead Wu Fengguang
2009-04-10 23:48 ` Andrew Morton
2009-04-11 13:58 ` KOSAKI Motohiro
2009-04-11 18:49 ` Andrew Morton
2009-04-12 23:16 ` KOSAKI Motohiro
2009-04-10 6:10 ` [PATCH 7/9] readahead: sequential mmap readahead Wu Fengguang
2009-04-10 23:34 ` Andrew Morton
2009-04-12 6:50 ` Wu Fengguang
2009-04-12 7:09 ` [PATCH] readahead: enforce full sync mmap readahead size Wu Fengguang
2009-04-12 15:15 ` Linus Torvalds
2009-04-13 13:53 ` Wu Fengguang
2009-04-14 7:01 ` Nick Piggin
2009-04-10 6:10 ` [PATCH 8/9] readahead: enforce full readahead size on async mmap readahead Wu Fengguang
2009-04-10 6:10 ` [PATCH 9/9] readahead: record mmap read-around states in file_ra_state Wu Fengguang
2009-04-10 23:38 ` Andrew Morton
2009-04-11 4:24 ` Wu Fengguang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).