* [PATCH 00/33] Adaptive read-ahead V12
[not found] <20060524111246.420010595@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
2006-05-25 15:44 ` Andrew Morton
[not found] ` <20060524111857.983845462@localhost.localdomain>
` (27 subsequent siblings)
28 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
Andrew,
This is the 12th release of the adaptive readahead patchset.
It has received tests in a wide range of applications in the past
six months, and polished up considerably.
Please consider it for inclusion in -mm tree.
Performance benefits
====================
Besides file servers and desktops, it is recently found to benefit
postgresql databases a lot.
I explained to pgsql users how the patch may help their db performance:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
[QUOTE]
HOW IT WORKS
In adaptive readahead, the context based method may be of particular
interest to postgresql users. It works by peeking into the file cache
and check if there are any history pages present or accessed. In this
way it can detect almost all forms of sequential / semi-sequential read
patterns, e.g.
- parallel / interleaved sequential scans on one file
- sequential reads across file open/close
- mixed sequential / random accesses
- sparse / skimming sequential read
It also have methods to detect some less common cases:
- reading backward
- seeking all over reading N pages
WAYS TO BENEFIT FROM IT
As we know, postgresql relies on the kernel to do proper readahead.
The adaptive readahead might help performance in the following cases:
- concurrent sequential scans
- sequential scan on a fragmented table
(some DBs suffer from this problem, not sure for pgsql)
- index scan with clustered matches
- index scan on majority rows (in case the planner goes wrong)
And received positive responses:
[QUOTE from Michael Stone]
I've got one DB where the VACUUM ANALYZE generally takes 11M-12M ms;
with the patch the job took 1.7M ms. Another VACUUM that normally takes
between 300k-500k ms took 150k. Definately a promising addition.
[QUOTE from Michael Stone]
>I'm thinking about it, we're already using a fixed read-ahead of 16MB
>using blockdev on the stock Redhat 2.6.9 kernel, it would be nice to
>not have to set this so we may try it.
FWIW, I never saw much performance difference from doing that. Wu's
patch, OTOH, gave a big boost.
[QUOTE: odbc-bench with Postgresql 7.4.11 on dual Opteron]
Base kernel:
Transactions per second: 92.384758
Transactions per second: 99.800896
After read-ahvm.readahead_ratio = 100:
Transactions per second: 105.461952
Transactions per second: 105.458664
vm.readahead_ratio = 100 ; vm.readahead_hit_rate = 1:
Transactions per second: 113.055367
Transactions per second: 124.815910
Patches
=======
All 33 patches are bisect friendly:
special cares have been taken to make them compile cleanly on each step.
The following 29 patches are only logically seperated -
one should not remove one of them and expect others to compile cleanly:
[patch 01/33] readahead: kconfig options
[patch 02/33] radixtree: look-aside cache
[patch 03/33] radixtree: hole scanning functions
[patch 04/33] readahead: page flag PG_readahead
[patch 05/33] readahead: refactor do_generic_mapping_read()
[patch 06/33] readahead: refactor __do_page_cache_readahead()
[patch 07/33] readahead: insert cond_resched() calls
[patch 08/33] readahead: common macros
[patch 09/33] readahead: events accounting
[patch 10/33] readahead: support functions
[patch 11/33] readahead: sysctl parameters
[patch 12/33] readahead: min/max sizes
[patch 13/33] readahead: state based method - aging accounting
[patch 14/33] readahead: state based method - data structure
[patch 15/33] readahead: state based method - routines
[patch 16/33] readahead: state based method
[patch 17/33] readahead: context based method
[patch 18/33] readahead: initial method - guiding sizes
[patch 19/33] readahead: initial method - thrashing guard size
[patch 20/33] readahead: initial method - expected read size
[patch 21/33] readahead: initial method - user recommended size
[patch 22/33] readahead: initial method
[patch 23/33] readahead: backward prefetching method
[patch 24/33] readahead: seeking reads method
[patch 25/33] readahead: thrashing recovery method
[patch 26/33] readahead: call scheme
[patch 27/33] readahead: laptop mode
[patch 28/33] readahead: loop case
[patch 29/33] readahead: nfsd case
The following 4 patches are for debugging purpose, and for -mm only:
[patch 30/33] readahead: turn on by default
[patch 31/33] readahead: debug radix tree new functions
[patch 32/33] readahead: debug traces showing accessed file names
[patch 33/33] readahead: debug traces showing read patterns
Diffstat
========
Documentation/sysctl/vm.txt | 37
block/ll_rw_blk.c | 34
drivers/block/loop.c | 6
fs/file_table.c | 7
fs/mpage.c | 4
fs/nfsd/vfs.c | 5
include/linux/backing-dev.h | 3
include/linux/fs.h | 57 +
include/linux/mm.h | 31
include/linux/mmzone.h | 5
include/linux/page-flags.h | 5
include/linux/radix-tree.h | 87 ++
include/linux/sysctl.h | 2
include/linux/writeback.h | 6
kernel/sysctl.c | 28
lib/radix-tree.c | 202 ++++-
mm/Kconfig | 62 +
mm/filemap.c | 90 ++
mm/page-writeback.c | 2
mm/page_alloc.c | 2
mm/readahead.c | 1641 +++++++++++++++++++++++++++++++++++++++++++-
mm/swap.c | 2
mm/vmscan.c | 4
23 files changed, 2262 insertions(+), 60 deletions(-)
Changelog
=========
V12 2006-05-24
- improve small files case
- allow pausing of events accounting
- disable sparse read-ahead by default
- a bug fix in radix_tree_cache_lookup_parent()
- more cleanups
V11 2006-03-19
- patchset rework
- add kconfig option to make the feature compile-time selectable
- improve radix tree scan functions
- fix bug of using smp_processor_id() in preemptible code
- avoid overflow in compute_thrashing_threshold()
- disable sparse read prefetching if (readahead_hit_rate == 1)
- make thrashing recovery a standalone function
- random cleanups
V10 2005-12-16
- remove delayed page activation
- remove live page protection
- revert mmap readaround to old behavior
- default to original readahead logic
- default to original readahead size
- merge comment fixes from Andreas Mohr
- merge radixtree cleanups from Christoph Lameter
- reduce sizeof(struct file_ra_state) by unnamed union
- stateful method cleanups
- account other read-ahead paths
V9 2005-12-3
- standalone mmap read-around code, a little more smart and tunable
- make stateful method sensible of request size
- decouple readahead_ratio from live pages protection
- let readahead_ratio contribute to ra_size grow speed in stateful method
- account variance of ra_size
V8 2005-11-25
- balance zone aging only in page relaim paths and do it right
- do the aging of slabs in the same way as zones
- add debug code to dump the detailed page reclaim steps
- undo exposing of struct radix_tree_node and uninline related functions
- work better with nfsd
- generalize accelerated context based read-ahead
- account smooth read-ahead aging based on page referenced/activate bits
- avoid divide error in compute_thrashing_threshold()
- more low lantency efforts
- update some comments
- rebase debug actions on debugfs entries instead of magic readahead_ratio values
V7 2005-11-09
- new tunable parameters: readahead_hit_rate/readahead_live_chunk
- support sparse sequential accesses
- delay look-ahead if drive is spinned down in laptop mode
- disable look-ahead for loopback file
- make mandatory thrashing protection more simple and robust
- attempt to improve responsiveness on large read-ahead size
V6 2005-11-01
- cancel look-ahead in laptop mode
- increase read-ahead limit to 0xFFFF pages
V5 2005-10-28
- rewrite context based method to make it clean and robust
- improved accuracy of stateful thrashing threshold estimation
- make page aging equal to the number of code pages scanned
- sort out the thrashing protection logic
- enhanced debug/accounting facilities
V4 2005-10-15
- detect and save live chunks on page reclaim
- support database workload
- support reading backward
- radix tree lookup look-aside cache
V3 2005-10-06
- major code reorganization and documention
- stateful estimation of thrashing-threshold
- context method with accelerated grow up phase
- adaptive look-ahead
- early detection and rescue of pages in danger
- statitics data collection
- synchronized page aging between zones
V2 2005-09-15
- delayed page activation
- look-ahead: towards pipelined read-ahead
V1 2005-09-13
Initial release which features:
o stateless (for now)
o adapts to available memory / read speed
o free of thrashing (in theory)
And handles:
o large number of slow streams (FTP server)
o open/read/close access patterns (NFS server)
o multiple interleaved, sequential streams in one file
(multithread / multimedia / database)
Cheers,
Wu Fengguang
--
Dept. Automation University of Science and Technology of China
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 02/33] radixtree: look-aside cache
[not found] ` <20060524111857.983845462@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Nick Piggin, Christoph Lameter
[-- Attachment #1: radixtree-lookaside-cache.patch --]
[-- Type: text/plain, Size: 9264 bytes --]
Introduce a set of lookup functions to radix tree for the read-ahead logic.
Other access patterns with high locality may also benefit from them.
- radix_tree_lookup_parent(root, index, level)
Perform partial lookup, return the @level'th parent of the slot at
@index.
- radix_tree_cache_xxx()
Init/Query the cache.
- radix_tree_cache_lookup(root, cache, index)
Perform lookup with the aid of a look-aside cache.
For sequential scans, it has a time complexity of 64*O(1) + 1*O(logN).
Typical usage:
void func() {
+ struct radix_tree_cache cache;
+
+ radix_tree_cache_init(&cache);
read_lock_irq(&mapping->tree_lock);
for(;;) {
- page = radix_tree_lookup(&mapping->page_tree, index);
+ page = radix_tree_cache_lookup(&mapping->page_tree, &cache, index);
}
read_unlock_irq(&mapping->tree_lock);
}
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/radix-tree.h | 83 +++++++++++++++++++++++++++++++++++
lib/radix-tree.c | 104 ++++++++++++++++++++++++++++++++++-----------
2 files changed, 161 insertions(+), 26 deletions(-)
--- linux-2.6.17-rc4-mm3.orig/include/linux/radix-tree.h
+++ linux-2.6.17-rc4-mm3/include/linux/radix-tree.h
@@ -26,12 +26,29 @@
#define RADIX_TREE_MAX_TAGS 2
/* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
+#ifdef __KERNEL__
+#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
+#else
+#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
+#endif
+
+#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
+#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)
+
struct radix_tree_root {
unsigned int height;
gfp_t gfp_mask;
struct radix_tree_node *rnode;
};
+/*
+ * Lookaside cache to support access patterns with strong locality.
+ */
+struct radix_tree_cache {
+ unsigned long first_index;
+ struct radix_tree_node *tree_node;
+};
+
#define RADIX_TREE_INIT(mask) { \
.height = 0, \
.gfp_mask = (mask), \
@@ -49,9 +66,14 @@ do { \
} while (0)
int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
-void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
-void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_lookup_parent(struct radix_tree_root *, unsigned long,
+ unsigned int);
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+unsigned int radix_tree_cache_count(struct radix_tree_cache *cache);
+void *radix_tree_cache_lookup_parent(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index, unsigned int level);
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
@@ -74,4 +96,61 @@ static inline void radix_tree_preload_en
preempt_enable();
}
+/**
+ * radix_tree_lookup - perform lookup operation on a radix tree
+ * @root: radix tree root
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root.
+ */
+static inline void *radix_tree_lookup(struct radix_tree_root *root,
+ unsigned long index)
+{
+ return radix_tree_lookup_parent(root, index, 0);
+}
+
+/**
+ * radix_tree_cache_init - init a look-aside cache
+ * @cache: look-aside cache
+ *
+ * Init the radix tree look-aside cache @cache.
+ */
+static inline void radix_tree_cache_init(struct radix_tree_cache *cache)
+{
+ cache->first_index = RADIX_TREE_MAP_MASK;
+ cache->tree_node = NULL;
+}
+
+/**
+ * radix_tree_cache_lookup - cached lookup on a radix tree
+ * @root: radix tree root
+ * @cache: look-aside cache
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root,
+ * and make use of @cache to speedup the lookup process.
+ */
+static inline void *radix_tree_cache_lookup(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index)
+{
+ return radix_tree_cache_lookup_parent(root, cache, index, 0);
+}
+
+static inline unsigned int radix_tree_cache_size(struct radix_tree_cache *cache)
+{
+ return RADIX_TREE_MAP_SIZE;
+}
+
+static inline int radix_tree_cache_full(struct radix_tree_cache *cache)
+{
+ return radix_tree_cache_count(cache) == radix_tree_cache_size(cache);
+}
+
+static inline unsigned long
+radix_tree_cache_first_index(struct radix_tree_cache *cache)
+{
+ return cache->first_index;
+}
+
#endif /* _LINUX_RADIX_TREE_H */
--- linux-2.6.17-rc4-mm3.orig/lib/radix-tree.c
+++ linux-2.6.17-rc4-mm3/lib/radix-tree.c
@@ -309,36 +309,90 @@ int radix_tree_insert(struct radix_tree_
}
EXPORT_SYMBOL(radix_tree_insert);
-static inline void **__lookup_slot(struct radix_tree_root *root,
- unsigned long index)
+/**
+ * radix_tree_lookup_parent - low level lookup routine
+ * @root: radix tree root
+ * @index: index key
+ * @level: stop at that many levels from the tree leaf
+ *
+ * Lookup the @level'th parent of the slot at @index in radix tree @root.
+ * The return value is:
+ * @level == 0: page at @index;
+ * @level == 1: the corresponding bottom level tree node;
+ * @level < height: (@level-1)th parent node of the bottom node
+ * that contains @index;
+ * @level >= height: the root node.
+ */
+void *radix_tree_lookup_parent(struct radix_tree_root *root,
+ unsigned long index, unsigned int level)
{
unsigned int height, shift;
- struct radix_tree_node **slot;
+ struct radix_tree_node *slot;
height = root->height;
if (index > radix_tree_maxindex(height))
return NULL;
- if (height == 0 && root->rnode)
- return (void **)&root->rnode;
-
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- slot = &root->rnode;
+ slot = root->rnode;
- while (height > 0) {
- if (*slot == NULL)
+ while (height > level) {
+ if (slot == NULL)
return NULL;
- slot = (struct radix_tree_node **)
- ((*slot)->slots +
- ((index >> shift) & RADIX_TREE_MAP_MASK));
+ slot = slot->slots[(index >> shift) & RADIX_TREE_MAP_MASK];
shift -= RADIX_TREE_MAP_SHIFT;
height--;
}
- return (void **)slot;
+ return slot;
+}
+EXPORT_SYMBOL(radix_tree_lookup_parent);
+
+/**
+ * radix_tree_cache_lookup_parent - cached lookup node
+ * @root: radix tree root
+ * @cache: look-aside cache
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root,
+ * and return the node @level levels from the bottom in the search path.
+ *
+ * @cache stores the last accessed upper level tree node by this
+ * function, and is always checked first before searching in the tree.
+ * It can improve speed for access patterns with strong locality.
+ *
+ * NOTE:
+ * - The cache becomes invalid on leaving the lock;
+ * - Do not intermix calls with different @level.
+ */
+void *radix_tree_cache_lookup_parent(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index, unsigned int level)
+{
+ struct radix_tree_node *node;
+ unsigned long i;
+ unsigned long mask;
+
+ if (level >= root->height)
+ return radix_tree_lookup_parent(root, index, level);
+
+ i = (index >> (level * RADIX_TREE_MAP_SHIFT)) & RADIX_TREE_MAP_MASK;
+ mask = (~0UL) << ((level + 1) * RADIX_TREE_MAP_SHIFT);
+
+ if ((index & mask) == cache->first_index)
+ return cache->tree_node->slots[i];
+
+ node = radix_tree_lookup_parent(root, index, level + 1);
+ if (!node)
+ return 0;
+
+ cache->tree_node = node;
+ cache->first_index = (index & mask);
+ return node->slots[i];
}
+EXPORT_SYMBOL(radix_tree_cache_lookup_parent);
/**
* radix_tree_lookup_slot - lookup a slot in a radix tree
@@ -350,25 +404,27 @@ static inline void **__lookup_slot(struc
*/
void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
{
- return __lookup_slot(root, index);
+ struct radix_tree_node *node;
+
+ node = radix_tree_lookup_parent(root, index, 1);
+ return node->slots + (index & RADIX_TREE_MAP_MASK);
}
EXPORT_SYMBOL(radix_tree_lookup_slot);
/**
- * radix_tree_lookup - perform lookup operation on a radix tree
- * @root: radix tree root
- * @index: index key
+ * radix_tree_cache_count - items in the cached node
+ * @cache: radix tree look-aside cache
*
- * Lookup the item at the position @index in the radix tree @root.
+ * Query the number of items contained in the cached node.
*/
-void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+unsigned int radix_tree_cache_count(struct radix_tree_cache *cache)
{
- void **slot;
-
- slot = __lookup_slot(root, index);
- return slot != NULL ? *slot : NULL;
+ if (!(cache->first_index & RADIX_TREE_MAP_MASK))
+ return cache->tree_node->count;
+ else
+ return 0;
}
-EXPORT_SYMBOL(radix_tree_lookup);
+EXPORT_SYMBOL(radix_tree_cache_count);
/**
* radix_tree_tag_set - set a tag on a radix tree node
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 03/33] radixtree: hole scanning functions
[not found] ` <20060524111858.357709745@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
2006-05-25 16:19 ` Andrew Morton
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: radixtree-scan-hole.patch --]
[-- Type: text/plain, Size: 3784 bytes --]
Introduce a pair of functions to scan radix tree for hole/empty item.
include/linux/radix-tree.h | 4 +
lib/radix-tree.c | 104 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 108 insertions(+)
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
--- linux-2.6.17-rc4-mm3.orig/include/linux/radix-tree.h
+++ linux-2.6.17-rc4-mm3/include/linux/radix-tree.h
@@ -74,6 +74,10 @@ unsigned int radix_tree_cache_count(stru
void *radix_tree_cache_lookup_parent(struct radix_tree_root *root,
struct radix_tree_cache *cache,
unsigned long index, unsigned int level);
+unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan);
+unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan);
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
--- linux-2.6.17-rc4-mm3.orig/lib/radix-tree.c
+++ linux-2.6.17-rc4-mm3/lib/radix-tree.c
@@ -427,6 +427,110 @@ unsigned int radix_tree_cache_count(stru
EXPORT_SYMBOL(radix_tree_cache_count);
/**
+ * radix_tree_scan_hole_backward - scan backward for hole
+ * @root: radix tree root
+ * @index: index key
+ * @max_scan: advice on max items to scan (it may scan a little more)
+ *
+ * Scan backward from @index for a hole/empty item, stop when
+ * - hit hole
+ * - @max_scan or more items scanned
+ * - hit index 0
+ *
+ * Return the correponding index.
+ */
+unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan)
+{
+ struct radix_tree_cache cache;
+ struct radix_tree_node *node;
+ unsigned long origin;
+ int i;
+
+ origin = index;
+ radix_tree_cache_init(&cache);
+
+ while (origin - index < max_scan) {
+ node = radix_tree_cache_lookup_parent(root, &cache, index, 1);
+ if (!node)
+ break;
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ index = (index - RADIX_TREE_MAP_SIZE) |
+ RADIX_TREE_MAP_MASK;
+ goto check_underflow;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--) {
+ if (!node->slots[i])
+ goto out;
+ }
+
+check_underflow:
+ if (unlikely(index == ULONG_MAX)) {
+ index = 0;
+ break;
+ }
+ }
+
+out:
+ return index;
+}
+EXPORT_SYMBOL(radix_tree_scan_hole_backward);
+
+/**
+ * radix_tree_scan_hole - scan for hole
+ * @root: radix tree root
+ * @index: index key
+ * @max_scan: advice on max items to scan (it may scan a little more)
+ *
+ * Scan forward from @index for a hole/empty item, stop when
+ * - hit hole
+ * - hit EOF
+ * - hit index ULONG_MAX
+ * - @max_scan or more items scanned
+ *
+ * Return the correponding index.
+ */
+unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
+ unsigned long index, unsigned long max_scan)
+{
+ struct radix_tree_cache cache;
+ struct radix_tree_node *node;
+ unsigned long origin;
+ int i;
+
+ origin = index;
+ radix_tree_cache_init(&cache);
+
+ while (index - origin < max_scan) {
+ node = radix_tree_cache_lookup_parent(root, &cache, index, 1);
+ if (!node)
+ break;
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ index = (index | RADIX_TREE_MAP_MASK) + 1;
+ goto check_overflow;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE;
+ i++, index++) {
+ if (!node->slots[i])
+ goto out;
+ }
+
+check_overflow:
+ if (unlikely(!index)) {
+ index = ULONG_MAX;
+ break;
+ }
+ }
+out:
+ return index;
+}
+EXPORT_SYMBOL(radix_tree_scan_hole);
+
+/**
* radix_tree_tag_set - set a tag on a radix tree node
* @root: radix tree root
* @index: index key
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 04/33] readahead: page flag PG_readahead
[not found] ` <20060524111858.869793445@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
2006-05-25 16:23 ` Andrew Morton
2006-05-24 12:27 ` Peter Zijlstra
1 sibling, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-page-flag-PG_readahead.patch --]
[-- Type: text/plain, Size: 1792 bytes --]
An new page flag PG_readahead is introduced as a look-ahead mark, which
reminds the caller to give the adaptive read-ahead logic a chance to do
read-ahead ahead of time for I/O pipelining.
It roughly corresponds to `ahead_start' of the stock read-ahead logic.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/page-flags.h | 5 +++++
mm/page_alloc.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)
--- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
+++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
@@ -89,6 +89,7 @@
#define PG_reclaim 17 /* To be reclaimed asap */
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_buddy 19 /* Page is free, on buddy lists */
+#define PG_readahead 20 /* Reminder to do readahead */
#if (BITS_PER_LONG > 32)
@@ -372,6 +373,10 @@ extern void __mod_page_state_offset(unsi
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
+#define PageReadahead(page) test_bit(PG_readahead, &(page)->flags)
+#define __SetPageReadahead(page) __set_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
--- linux-2.6.17-rc4-mm3.orig/mm/page_alloc.c
+++ linux-2.6.17-rc4-mm3/mm/page_alloc.c
@@ -564,7 +564,7 @@ static int prep_new_page(struct page *pa
if (PageReserved(page))
return 1;
- page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+ page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
1 << PG_referenced | 1 << PG_arch_1 |
1 << PG_checked | 1 << PG_mappedtodisk);
set_page_private(page, 0);
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 05/33] readahead: refactor do_generic_mapping_read()
[not found] ` <20060524111859.540640819@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-refactor-do_generic_mapping_read.patch --]
[-- Type: text/plain, Size: 2538 bytes --]
In do_generic_mapping_read(), release accessed pages some time later,
so that it can be passed to and used by the adaptive read-ahead code.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/filemap.c | 18 ++++++++++++------
1 files changed, 12 insertions(+), 6 deletions(-)
--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -813,10 +813,12 @@ void do_generic_mapping_read(struct addr
unsigned long prev_index;
loff_t isize;
struct page *cached_page;
+ struct page *prev_page;
int error;
struct file_ra_state ra = *_ra;
cached_page = NULL;
+ prev_page = NULL;
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_page;
@@ -855,6 +857,11 @@ find_page:
handle_ra_miss(mapping, &ra, index);
goto no_cached_page;
}
+
+ if (prev_page)
+ page_cache_release(prev_page);
+ prev_page = page;
+
if (!PageUptodate(page))
goto page_not_up_to_date;
page_ok:
@@ -889,7 +896,6 @@ page_ok:
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
- page_cache_release(page);
if (ret == nr && desc->count)
continue;
goto out;
@@ -901,7 +907,6 @@ page_not_up_to_date:
/* Did it get unhashed before we got the lock? */
if (!page->mapping) {
unlock_page(page);
- page_cache_release(page);
continue;
}
@@ -931,7 +936,6 @@ readpage:
* invalidate_inode_pages got it
*/
unlock_page(page);
- page_cache_release(page);
goto find_page;
}
unlock_page(page);
@@ -952,7 +956,6 @@ readpage:
isize = i_size_read(inode);
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
if (unlikely(!isize || index > end_index)) {
- page_cache_release(page);
goto out;
}
@@ -961,7 +964,6 @@ readpage:
if (index == end_index) {
nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
if (nr <= offset) {
- page_cache_release(page);
goto out;
}
}
@@ -971,7 +973,6 @@ readpage:
readpage_error:
/* UHHUH! A synchronous read error occurred. Report it */
desc->error = error;
- page_cache_release(page);
goto out;
no_cached_page:
@@ -996,6 +997,9 @@ no_cached_page:
}
page = cached_page;
cached_page = NULL;
+ if (prev_page)
+ page_cache_release(prev_page);
+ prev_page = page;
goto readpage;
}
@@ -1005,6 +1009,8 @@ out:
*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
if (cached_page)
page_cache_release(cached_page);
+ if (prev_page)
+ page_cache_release(prev_page);
if (filp)
file_accessed(filp);
}
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 06/33] readahead: refactor __do_page_cache_readahead()
[not found] ` <20060524111859.909928820@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
2006-05-25 16:30 ` Andrew Morton
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-refactor-__do_page_cache_readahead.patch --]
[-- Type: text/plain, Size: 2546 bytes --]
Add look-ahead support to __do_page_cache_readahead(),
which is needed by the adaptive read-ahead logic.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 15 +++++++++------
1 files changed, 9 insertions(+), 6 deletions(-)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -266,7 +266,8 @@ out:
*/
static int
__do_page_cache_readahead(struct address_space *mapping, struct file *filp,
- pgoff_t offset, unsigned long nr_to_read)
+ pgoff_t offset, unsigned long nr_to_read,
+ unsigned long lookahead_size)
{
struct inode *inode = mapping->host;
struct page *page;
@@ -279,7 +280,7 @@ __do_page_cache_readahead(struct address
if (isize == 0)
goto out;
- end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
/*
* Preallocate as many pages as we will need.
@@ -302,6 +303,8 @@ __do_page_cache_readahead(struct address
break;
page->index = page_offset;
list_add(&page->lru, &page_pool);
+ if (page_idx == nr_to_read - lookahead_size)
+ __SetPageReadahead(page);
ret++;
}
read_unlock_irq(&mapping->tree_lock);
@@ -338,7 +341,7 @@ int force_page_cache_readahead(struct ad
if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
err = __do_page_cache_readahead(mapping, filp,
- offset, this_chunk);
+ offset, this_chunk, 0);
if (err < 0) {
ret = err;
break;
@@ -385,7 +388,7 @@ int do_page_cache_readahead(struct addre
if (bdi_read_congested(mapping->backing_dev_info))
return -1;
- return __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+ return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
}
/*
@@ -405,7 +408,7 @@ blockable_page_cache_readahead(struct ad
if (!block && bdi_read_congested(mapping->backing_dev_info))
return 0;
- actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+ actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
return check_ra_success(ra, nr_to_read, actual);
}
@@ -450,7 +453,7 @@ static int make_ahead_window(struct addr
* @req_size: hint: total size of the read which the caller is performing in
* PAGE_CACHE_SIZE units
*
- * page_cache_readahead() is the main function. If performs the adaptive
+ * page_cache_readahead() is the main function. It performs the adaptive
* readahead window size management and submits the readahead I/O.
*
* Note that @filp is purely used for passing on to the ->readpage[s]()
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 07/33] readahead: insert cond_resched() calls
[not found] ` <20060524111900.419314658@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Con Kolivas
[-- Attachment #1: readahead-insert-cond_resched-calls.patch --]
[-- Type: text/plain, Size: 2140 bytes --]
Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm more
complex, it becomes necessary to insert some cond_resched() calls in
the read-ahead path.
If desktop users still feel audio jitters with the new read-ahead code,
please try one of the following ways to get rid of it:
1) compile kernel with CONFIG_PREEMPT_VOLUNTARY/CONFIG_PREEMPT
2) reduce the read-ahead request size by running
blockdev --setra 256 /dev/hda # or whatever device you are using
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
This patch is recommended by Con Kolivas to improve respond time for desktop.
Thanks!
fs/mpage.c | 4 +++-
mm/readahead.c | 9 +++++++--
2 files changed, 10 insertions(+), 3 deletions(-)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -146,8 +146,10 @@ int read_cache_pages(struct address_spac
continue;
}
ret = filler(data, page);
- if (!pagevec_add(&lru_pvec, page))
+ if (!pagevec_add(&lru_pvec, page)) {
+ cond_resched();
__pagevec_lru_add(&lru_pvec);
+ }
if (ret) {
while (!list_empty(pages)) {
struct page *victim;
@@ -184,8 +186,10 @@ static int read_pages(struct address_spa
if (!add_to_page_cache(page, mapping,
page->index, GFP_KERNEL)) {
mapping->a_ops->readpage(filp, page);
- if (!pagevec_add(&lru_pvec, page))
+ if (!pagevec_add(&lru_pvec, page)) {
+ cond_resched();
__pagevec_lru_add(&lru_pvec);
+ }
} else
page_cache_release(page);
}
@@ -297,6 +301,7 @@ __do_page_cache_readahead(struct address
continue;
read_unlock_irq(&mapping->tree_lock);
+ cond_resched();
page = page_cache_alloc_cold(mapping);
read_lock_irq(&mapping->tree_lock);
if (!page)
--- linux-2.6.17-rc4-mm3.orig/fs/mpage.c
+++ linux-2.6.17-rc4-mm3/fs/mpage.c
@@ -407,8 +407,10 @@ mpage_readpages(struct address_space *ma
&last_block_in_bio, &map_bh,
&first_logical_block,
get_block);
- if (!pagevec_add(&lru_pvec, page))
+ if (!pagevec_add(&lru_pvec, page)) {
+ cond_resched();
__pagevec_lru_add(&lru_pvec);
+ }
} else {
page_cache_release(page);
}
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 08/33] readahead: common macros
[not found] ` <20060524111900.970898174@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
2006-05-25 5:56 ` Nick Piggin
2006-05-25 16:33 ` Andrew Morton
0 siblings, 2 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-common-macros.patch --]
[-- Type: text/plain, Size: 1649 bytes --]
Define some common used macros for the read-ahead logics.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 14 ++++++++++++--
1 files changed, 12 insertions(+), 2 deletions(-)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -5,6 +5,8 @@
*
* 09Apr2002 akpm@zip.com.au
* Initial version.
+ * 21May2006 Wu Fengguang <wfg@mail.ustc.edu.cn>
+ * Adaptive read-ahead framework.
*/
#include <linux/kernel.h>
@@ -14,6 +16,14 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+#include <linux/writeback.h>
+#include <linux/nfsd/const.h>
+
+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT)
+#define PAGES_KB(size) PAGES_BYTE((size)*1024)
+
+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
@@ -21,7 +31,7 @@ void default_unplug_io_fn(struct backing
EXPORT_SYMBOL(default_unplug_io_fn);
struct backing_dev_info default_backing_dev_info = {
- .ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+ .ra_pages = PAGES_KB(VM_MAX_READAHEAD),
.state = 0,
.capabilities = BDI_CAP_MAP_COPY,
.unplug_io_fn = default_unplug_io_fn,
@@ -50,7 +60,7 @@ static inline unsigned long get_max_read
static inline unsigned long get_min_readahead(struct file_ra_state *ra)
{
- return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ return PAGES_KB(VM_MIN_READAHEAD);
}
static inline void reset_ahead_window(struct file_ra_state *ra)
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 09/33] readahead: events accounting
[not found] ` <20060524111901.581603095@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
2006-05-25 16:36 ` Andrew Morton
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, J?rn Engel, Ingo Oeser
[-- Attachment #1: readahead-events-accounting.patch --]
[-- Type: text/plain, Size: 10611 bytes --]
A debugfs file named `readahead/events' is created according to advises from
J?rn Engel, Andrew Morton and Ingo Oeser.
It reveals various read-ahead activities/events, and is vital to the testing.
---------------------------
If you are experiencing performance problems, or want to help improve the
read-ahead logic, please send me the debug data. Thanks.
- Preparations
## First compile kernel with CONFIG_DEBUG_READAHEAD
mkdir /debug
mount -t debug none /debug
- For each session with distinct access pattern
echo > /debug/readahead/events # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
# echo 3 > /debug/readahead/debug_level # show verbose printk traces
## do one benchmark/task
# echo 1 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
# bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2
The commented out commands can uncover more detailed file accesses,
which are useful sometimes. Note that the log file can grow huge!
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 293 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 292 insertions(+), 1 deletion(-)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -19,12 +19,76 @@
#include <linux/writeback.h>
#include <linux/nfsd/const.h>
+/*
+ * Detailed classification of read-ahead behaviors.
+ */
+#define RA_CLASS_SHIFT 4
+#define RA_CLASS_MASK ((1 << RA_CLASS_SHIFT) - 1)
+enum ra_class {
+ RA_CLASS_ALL,
+ RA_CLASS_INITIAL,
+ RA_CLASS_STATE,
+ RA_CLASS_CONTEXT,
+ RA_CLASS_CONTEXT_AGGRESSIVE,
+ RA_CLASS_BACKWARD,
+ RA_CLASS_THRASHING,
+ RA_CLASS_SEEK,
+ RA_CLASS_NONE,
+ RA_CLASS_COUNT
+};
+
+/* Read-ahead events to be accounted. */
+enum ra_event {
+ RA_EVENT_CACHE_MISS, /* read cache misses */
+ RA_EVENT_RANDOM_READ, /* random reads */
+ RA_EVENT_IO_CONGESTION, /* i/o congestion */
+ RA_EVENT_IO_CACHE_HIT, /* canceled i/o due to cache hit */
+ RA_EVENT_IO_BLOCK, /* wait for i/o completion */
+
+ RA_EVENT_READAHEAD, /* read-ahead issued */
+ RA_EVENT_READAHEAD_HIT, /* read-ahead page hit */
+ RA_EVENT_LOOKAHEAD, /* look-ahead issued */
+ RA_EVENT_LOOKAHEAD_HIT, /* look-ahead mark hit */
+ RA_EVENT_LOOKAHEAD_NOACTION, /* look-ahead mark ignored */
+ RA_EVENT_READAHEAD_MMAP, /* read-ahead for mmap access */
+ RA_EVENT_READAHEAD_EOF, /* read-ahead reaches EOF */
+ RA_EVENT_READAHEAD_SHRINK, /* ra_size falls under previous la_size */
+ RA_EVENT_READAHEAD_THRASHING, /* read-ahead thrashing happened */
+ RA_EVENT_READAHEAD_MUTILATE, /* read-ahead mutilated by imbalanced aging */
+ RA_EVENT_READAHEAD_RESCUE, /* read-ahead rescued */
+
+ RA_EVENT_READAHEAD_CUBE,
+ RA_EVENT_COUNT
+};
+
+#ifdef CONFIG_DEBUG_READAHEAD
+u32 initial_ra_hit;
+u32 initial_ra_miss;
+u32 debug_level = 1;
+u32 disable_stateful_method = 0;
+static const char * const ra_class_name[];
+static void ra_account(struct file_ra_state *ra, enum ra_event e, int pages);
+# define debug_inc(var) do { var++; } while (0)
+# define debug_option(o) (o)
+#else
+# define ra_account(ra, e, pages) do { } while (0)
+# define debug_inc(var) do { } while (0)
+# define debug_option(o) (0)
+# define debug_level (0)
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+#define dprintk(args...) \
+ do { if (debug_level >= 2) printk(KERN_DEBUG args); } while(0)
+#define ddprintk(args...) \
+ do { if (debug_level >= 3) printk(KERN_DEBUG args); } while(0)
+
#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT)
#define PAGES_KB(size) PAGES_BYTE((size)*1024)
#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
+
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
}
@@ -365,6 +429,9 @@ int force_page_cache_readahead(struct ad
offset += this_chunk;
nr_to_read -= this_chunk;
}
+
+ ra_account(NULL, RA_EVENT_READAHEAD, ret);
+
return ret;
}
@@ -400,10 +467,16 @@ static inline int check_ra_success(struc
int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read)
{
+ unsigned long ret;
+
if (bdi_read_congested(mapping->backing_dev_info))
return -1;
- return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+ ret = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+
+ ra_account(NULL, RA_EVENT_READAHEAD, ret);
+
+ return ret;
}
/*
@@ -425,6 +498,10 @@ blockable_page_cache_readahead(struct ad
actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+ ra_account(NULL, RA_EVENT_READAHEAD, actual);
+ dprintk("blockable-readahead(ino=%lu, ra=%lu+%lu) = %d\n",
+ mapping->host->i_ino, offset, nr_to_read, actual);
+
return check_ra_success(ra, nr_to_read, actual);
}
@@ -604,3 +681,217 @@ unsigned long max_sane_readahead(unsigne
__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
return min(nr, (inactive + free) / 2);
}
+
+/*
+ * Read-ahead events accounting.
+ */
+#ifdef CONFIG_DEBUG_READAHEAD
+
+#include <linux/init.h>
+#include <linux/jiffies.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+static const char * const ra_class_name[] = {
+ "total",
+ "initial",
+ "state",
+ "context",
+ "contexta",
+ "backward",
+ "onthrash",
+ "onseek",
+ "none"
+};
+
+static const char * const ra_event_name[] = {
+ "cache_miss",
+ "random_read",
+ "io_congestion",
+ "io_cache_hit",
+ "io_block",
+ "readahead",
+ "readahead_hit",
+ "lookahead",
+ "lookahead_hit",
+ "lookahead_ignore",
+ "readahead_mmap",
+ "readahead_eof",
+ "readahead_shrink",
+ "readahead_thrash",
+ "readahead_mutilt",
+ "readahead_rescue"
+};
+
+static unsigned long ra_events[RA_CLASS_COUNT][RA_EVENT_COUNT][2];
+
+static void ra_account(struct file_ra_state *ra, enum ra_event e, int pages)
+{
+ enum ra_class c;
+
+ if (!debug_level)
+ return;
+
+ if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
+ c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+ pages = -pages;
+ } else if (ra)
+ c = ra->flags & RA_CLASS_MASK;
+ else
+ c = RA_CLASS_NONE;
+
+ if (!c)
+ c = RA_CLASS_NONE;
+
+ ra_events[c][e][0] += 1;
+ ra_events[c][e][1] += pages;
+
+ if (e == RA_EVENT_READAHEAD)
+ ra_events[c][RA_EVENT_READAHEAD_CUBE][1] += pages * pages;
+}
+
+static int ra_events_show(struct seq_file *s, void *_)
+{
+ int i;
+ int c;
+ int e;
+ static const char event_fmt[] = "%-16s";
+ static const char class_fmt[] = "%10s";
+ static const char item_fmt[] = "%10lu";
+ static const char percent_format[] = "%9lu%%";
+ static const char * const table_name[] = {
+ "[table requests]",
+ "[table pages]",
+ "[table summary]"};
+
+ for (i = 0; i <= 1; i++) {
+ for (e = 0; e < RA_EVENT_COUNT; e++) {
+ ra_events[RA_CLASS_ALL][e][i] = 0;
+ for (c = RA_CLASS_INITIAL; c < RA_CLASS_NONE; c++)
+ ra_events[RA_CLASS_ALL][e][i] += ra_events[c][e][i];
+ }
+
+ seq_printf(s, event_fmt, table_name[i]);
+ for (c = 0; c < RA_CLASS_COUNT; c++)
+ seq_printf(s, class_fmt, ra_class_name[c]);
+ seq_puts(s, "\n");
+
+ for (e = 0; e < RA_EVENT_COUNT; e++) {
+ if (e == RA_EVENT_READAHEAD_CUBE)
+ continue;
+ if (e == RA_EVENT_READAHEAD_HIT && i == 0)
+ continue;
+ if (e == RA_EVENT_IO_BLOCK && i == 1)
+ continue;
+
+ seq_printf(s, event_fmt, ra_event_name[e]);
+ for (c = 0; c < RA_CLASS_COUNT; c++)
+ seq_printf(s, item_fmt, ra_events[c][e][i]);
+ seq_puts(s, "\n");
+ }
+ seq_puts(s, "\n");
+ }
+
+ seq_printf(s, event_fmt, table_name[2]);
+ for (c = 0; c < RA_CLASS_COUNT; c++)
+ seq_printf(s, class_fmt, ra_class_name[c]);
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "random_rate");
+ for (c = 0; c < RA_CLASS_COUNT; c++)
+ seq_printf(s, percent_format,
+ (ra_events[c][RA_EVENT_RANDOM_READ][0] * 100) /
+ ((ra_events[c][RA_EVENT_RANDOM_READ][0] +
+ ra_events[c][RA_EVENT_READAHEAD][0]) | 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "ra_hit_rate");
+ for (c = 0; c < RA_CLASS_COUNT; c++)
+ seq_printf(s, percent_format,
+ (ra_events[c][RA_EVENT_READAHEAD_HIT][1] * 100) /
+ (ra_events[c][RA_EVENT_READAHEAD][1] | 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "la_hit_rate");
+ for (c = 0; c < RA_CLASS_COUNT; c++)
+ seq_printf(s, percent_format,
+ (ra_events[c][RA_EVENT_LOOKAHEAD_HIT][0] * 100) /
+ (ra_events[c][RA_EVENT_LOOKAHEAD][0] | 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "var_ra_size");
+ for (c = 0; c < RA_CLASS_COUNT; c++)
+ seq_printf(s, item_fmt,
+ (ra_events[c][RA_EVENT_READAHEAD_CUBE][1] -
+ ra_events[c][RA_EVENT_READAHEAD][1] *
+ (ra_events[c][RA_EVENT_READAHEAD][1] /
+ (ra_events[c][RA_EVENT_READAHEAD][0] | 1))) /
+ (ra_events[c][RA_EVENT_READAHEAD][0] | 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "avg_ra_size");
+ for (c = 0; c < RA_CLASS_COUNT; c++)
+ seq_printf(s, item_fmt,
+ (ra_events[c][RA_EVENT_READAHEAD][1] +
+ ra_events[c][RA_EVENT_READAHEAD][0] / 2) /
+ (ra_events[c][RA_EVENT_READAHEAD][0] | 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "avg_la_size");
+ for (c = 0; c < RA_CLASS_COUNT; c++)
+ seq_printf(s, item_fmt,
+ (ra_events[c][RA_EVENT_LOOKAHEAD][1] +
+ ra_events[c][RA_EVENT_LOOKAHEAD][0] / 2) /
+ (ra_events[c][RA_EVENT_LOOKAHEAD][0] | 1));
+ seq_puts(s, "\n");
+
+ return 0;
+}
+
+static int ra_events_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, ra_events_show, NULL);
+}
+
+static ssize_t ra_events_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ memset(ra_events, 0, sizeof(ra_events));
+ return 1;
+}
+
+struct file_operations ra_events_fops = {
+ .owner = THIS_MODULE,
+ .open = ra_events_open,
+ .write = ra_events_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+#define READAHEAD_DEBUGFS_ENTRY_U32(var) \
+ debugfs_create_u32(__stringify(var), 0644, root, &var)
+
+#define READAHEAD_DEBUGFS_ENTRY_BOOL(var) \
+ debugfs_create_bool(__stringify(var), 0644, root, &var)
+
+static int __init readahead_init(void)
+{
+ struct dentry *root;
+
+ root = debugfs_create_dir("readahead", NULL);
+
+ debugfs_create_file("events", 0644, root, NULL, &ra_events_fops);
+
+ READAHEAD_DEBUGFS_ENTRY_U32(initial_ra_hit);
+ READAHEAD_DEBUGFS_ENTRY_U32(initial_ra_miss);
+
+ READAHEAD_DEBUGFS_ENTRY_U32(debug_level);
+ READAHEAD_DEBUGFS_ENTRY_BOOL(disable_stateful_method);
+
+ return 0;
+}
+
+module_init(readahead_init)
+
+#endif /* CONFIG_DEBUG_READAHEAD */
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 10/33] readahead: support functions
[not found] ` <20060524111901.976888971@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
2006-05-25 5:13 ` Nick Piggin
2006-05-25 16:48 ` Andrew Morton
0 siblings, 2 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-support-functions.patch --]
[-- Type: text/plain, Size: 4222 bytes --]
Several support functions of adaptive read-ahead.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 11 +++++
mm/readahead.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 118 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1029,6 +1029,17 @@ void handle_ra_miss(struct address_space
struct file_ra_state *ra, pgoff_t offset);
unsigned long max_sane_readahead(unsigned long nr);
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+extern int readahead_ratio;
+#else
+#define readahead_ratio 1
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+static inline int prefer_adaptive_readahead(void)
+{
+ return readahead_ratio >= 10;
+}
+
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
#ifdef CONFIG_IA64
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -683,6 +683,113 @@ unsigned long max_sane_readahead(unsigne
}
/*
+ * Adaptive read-ahead.
+ *
+ * Good read patterns are compact both in space and time. The read-ahead logic
+ * tries to grant larger read-ahead size to better readers under the constraint
+ * of system memory and load pressure.
+ *
+ * It employs two methods to estimate the max thrashing safe read-ahead size:
+ * 1. state based - the default one
+ * 2. context based - the failsafe one
+ * The integration of the dual methods has the merit of being agile and robust.
+ * It makes the overall design clean: special cases are handled in general by
+ * the stateless method, leaving the stateful one simple and fast.
+ *
+ * To improve throughput and decrease read delay, the logic 'looks ahead'.
+ * In most read-ahead chunks, one page will be selected and tagged with
+ * PG_readahead. Later when the page with PG_readahead is read, the logic
+ * will be notified to submit the next read-ahead chunk in advance.
+ *
+ * a read-ahead chunk
+ * +-----------------------------------------+
+ * | # PG_readahead |
+ * +-----------------------------------------+
+ * ^ When this page is read, notify me for the next read-ahead.
+ *
+ */
+
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+
+/*
+ * The nature of read-ahead allows false tests to occur occasionally.
+ * Here we just do not bother to call get_page(), it's meaningless anyway.
+ */
+static inline struct page *__find_page(struct address_space *mapping,
+ pgoff_t offset)
+{
+ return radix_tree_lookup(&mapping->page_tree, offset);
+}
+
+static inline struct page *find_page(struct address_space *mapping,
+ pgoff_t offset)
+{
+ struct page *page;
+
+ read_lock_irq(&mapping->tree_lock);
+ page = __find_page(mapping, offset);
+ read_unlock_irq(&mapping->tree_lock);
+ return page;
+}
+
+/*
+ * Move pages in danger (of thrashing) to the head of inactive_list.
+ * Not expected to happen frequently.
+ */
+static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
+{
+ int pgrescue;
+ pgoff_t index;
+ struct zone *zone;
+ struct address_space *mapping;
+
+ BUG_ON(!nr_pages || !page);
+ pgrescue = 0;
+ index = page_index(page);
+ mapping = page_mapping(page);
+
+ dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
+ mapping->host->i_ino, index, nr_pages);
+
+ for(;;) {
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+
+ if (!PageLRU(page))
+ goto out_unlock;
+
+ while (page_mapping(page) == mapping &&
+ page_index(page) == index) {
+ struct page *the_page = page;
+ page = next_page(page);
+ if (!PageActive(the_page) &&
+ !PageLocked(the_page) &&
+ page_count(the_page) == 1) {
+ list_move(&the_page->lru, &zone->inactive_list);
+ pgrescue++;
+ }
+ index++;
+ if (!--nr_pages)
+ goto out_unlock;
+ }
+
+ spin_unlock_irq(&zone->lru_lock);
+
+ cond_resched();
+ page = find_page(mapping, index);
+ if (!page)
+ goto out;
+ }
+out_unlock:
+ spin_unlock_irq(&zone->lru_lock);
+out:
+ ra_account(NULL, RA_EVENT_READAHEAD_RESCUE, pgrescue);
+ return nr_pages;
+}
+
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+/*
* Read-ahead events accounting.
*/
#ifdef CONFIG_DEBUG_READAHEAD
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 11/33] readahead: sysctl parameters
[not found] ` <20060524111902.491708692@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
2006-05-25 4:50 ` [PATCH 12/33] readahead: min/max sizes Nick Piggin
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-parameter-sysctl-variables.patch --]
[-- Type: text/plain, Size: 5039 bytes --]
Add new sysctl entries in /proc/sys/vm:
- readahead_ratio = 50
i.e. set read-ahead size to <=(readahead_ratio%) thrashing threshold
- readahead_hit_rate = 1
i.e. read-ahead hit ratio >=(1/readahead_hit_rate) is deemed ok
readahead_ratio also provides a way to select read-ahead logic at runtime:
condition action
==========================================================================
readahead_ratio == 0 disable read-ahead
readahead_ratio <= 9 select the (old) stock read-ahead logic
readahead_ratio >= 10 select the (new) adaptive read-ahead logic
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
Documentation/sysctl/vm.txt | 37 +++++++++++++++++++++++++++++++++++++
include/linux/sysctl.h | 2 ++
kernel/sysctl.c | 28 ++++++++++++++++++++++++++++
mm/readahead.c | 17 +++++++++++++++++
4 files changed, 84 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -20,6 +20,23 @@
#include <linux/nfsd/const.h>
/*
+ * Adaptive read-ahead parameters.
+ */
+
+/* In laptop mode, poll delayed look-ahead on every ## pages read. */
+#define LAPTOP_POLL_INTERVAL 16
+
+/* Set look-ahead size to 1/# of the thrashing-threshold. */
+#define LOOKAHEAD_RATIO 8
+
+/* Set read-ahead size to ##% of the thrashing-threshold. */
+int readahead_ratio = 50;
+EXPORT_SYMBOL_GPL(readahead_ratio);
+
+/* Readahead as long as cache hit ratio keeps above 1/##. */
+int readahead_hit_rate = 1;
+
+/*
* Detailed classification of read-ahead behaviors.
*/
#define RA_CLASS_SHIFT 4
--- linux-2.6.17-rc4-mm3.orig/include/linux/sysctl.h
+++ linux-2.6.17-rc4-mm3/include/linux/sysctl.h
@@ -194,6 +194,8 @@ enum
VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
VM_SWAP_PREFETCH=34, /* swap prefetch */
+ VM_READAHEAD_RATIO=35, /* percent of read-ahead size to thrashing-threshold */
+ VM_READAHEAD_HIT_RATE=36, /* one accessed page legitimizes so many read-ahead pages */
};
/* CTL_NET names: */
--- linux-2.6.17-rc4-mm3.orig/kernel/sysctl.c
+++ linux-2.6.17-rc4-mm3/kernel/sysctl.c
@@ -77,6 +77,12 @@ extern int percpu_pagelist_fraction;
extern int compat_log;
extern int print_fatal_signals;
+#if defined(CONFIG_ADAPTIVE_READAHEAD)
+extern int readahead_ratio;
+extern int readahead_hit_rate;
+static int one = 1;
+#endif
+
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
int nmi_watchdog_enabled;
@@ -987,6 +993,28 @@ static ctl_table vm_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+ {
+ .ctl_name = VM_READAHEAD_RATIO,
+ .procname = "readahead_ratio",
+ .data = &readahead_ratio,
+ .maxlen = sizeof(readahead_ratio),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
+ .ctl_name = VM_READAHEAD_HIT_RATE,
+ .procname = "readahead_hit_rate",
+ .data = &readahead_hit_rate,
+ .maxlen = sizeof(readahead_hit_rate),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &one,
+ },
+#endif
{ .ctl_name = 0 }
};
--- linux-2.6.17-rc4-mm3.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.17-rc4-mm3/Documentation/sysctl/vm.txt
@@ -31,6 +31,8 @@ Currently, these files are in /proc/sys/
- zone_reclaim_interval
- panic_on_oom
- swap_prefetch
+- readahead_ratio
+- readahead_hit_rate
==============================================================
@@ -202,3 +204,38 @@ copying back pages from swap into the sw
practice it can take many minutes before the vm is idle enough.
The default value is 1.
+
+==============================================================
+
+readahead_ratio
+
+This limits readahead size to percent of the thrashing threshold.
+The thrashing threshold is dynamicly estimated from the _history_ read
+speed and system load, to deduce the _future_ readahead request size.
+
+Set it to a smaller value if you have not enough memory for all the
+concurrent readers, or the I/O loads fluctuate a lot. But if there's
+plenty of memory(>2MB per reader), a bigger value may help performance.
+
+readahead_ratio also selects the readahead logic:
+ VALUE CODE PATH
+ -------------------------------------------
+ 0 disable readahead totally
+ 1-9 select the stock readahead logic
+ 10-inf select the adaptive readahead logic
+
+The default value is 50. Reasonable values would be [50, 100].
+
+==============================================================
+
+readahead_hit_rate
+
+This is the max allowed value of (readahead-pages : accessed-pages).
+Useful only when (readahead_ratio >= 10). If the previous readahead
+request has bad hit rate, the kernel will be reluctant to do the next
+readahead.
+
+Larger values help catch more sparse access patterns. Be aware that
+readahead of the sparse patterns sacrifices memory for speed.
+
+The default value is 1. It is recommended to keep the value below 16.
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 13/33] readahead: state based method - aging accounting
[not found] ` <20060524111903.510268987@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
2006-05-26 17:04 ` Andrew Morton
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-stateful-aging.patch --]
[-- Type: text/plain, Size: 5221 bytes --]
Collect info about the global available memory and its consumption speed.
The data are used by the stateful method to estimate the thrashing threshold.
They are the decisive factor of the correctness/accuracy of the resulting
read-ahead size.
- On NUMA systems, the accountings are done on a per-node basis. It works for
the two common real-world schemes:
- the reader process allocates caches in a node affined manner;
- the reader process allocates caches _balancely_ from a set of nodes.
- On non-NUMA systems, the readahead_aging is mainly increased on first
access of the read-ahead pages, in order to make it go up constantly and
smoothly. It helps improve the accuracy on small/fast read-aheads, with
the cost of a little more overhead.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 9 +++++++++
include/linux/mmzone.h | 5 +++++
mm/Kconfig | 5 +++++
mm/readahead.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
mm/swap.c | 2 ++
mm/vmscan.c | 4 ++++
6 files changed, 74 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/mm/Kconfig
+++ linux-2.6.17-rc4-mm3/mm/Kconfig
@@ -203,3 +203,8 @@ config DEBUG_READAHEAD
echo 1 > /debug/readahead/debug_level # stop filling my kern.log
Say N for production servers.
+
+config READAHEAD_SMOOTH_AGING
+ def_bool n if NUMA
+ default y if !NUMA
+ depends on ADAPTIVE_READAHEAD
--- linux-2.6.17-rc4-mm3.orig/include/linux/mmzone.h
+++ linux-2.6.17-rc4-mm3/include/linux/mmzone.h
@@ -161,6 +161,11 @@ struct zone {
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */
+ /* The accumulated number of activities that may cause page aging,
+ * that is, make some pages closer to the tail of inactive_list.
+ */
+ unsigned long aging_total;
+
/* A count of how many reclaimers are scanning this zone */
atomic_t reclaim_in_progress;
--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1044,6 +1044,15 @@ static inline int prefer_adaptive_readah
return readahead_ratio >= 10;
}
+DECLARE_PER_CPU(unsigned long, readahead_aging);
+static inline void inc_readahead_aging(void)
+{
+#ifdef CONFIG_READAHEAD_SMOOTH_AGING
+ if (prefer_adaptive_readahead())
+ per_cpu(readahead_aging, raw_smp_processor_id())++;
+#endif
+}
+
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
#ifdef CONFIG_IA64
--- linux-2.6.17-rc4-mm3.orig/mm/vmscan.c
+++ linux-2.6.17-rc4-mm3/mm/vmscan.c
@@ -457,6 +457,9 @@ static unsigned long shrink_page_list(st
if (PageWriteback(page))
goto keep_locked;
+ if (!PageReferenced(page))
+ inc_readahead_aging();
+
referenced = page_referenced(page, 1);
/* In active use or really unfreeable? Activate it. */
if (referenced && page_mapping_inuse(page))
@@ -655,6 +658,7 @@ static unsigned long shrink_inactive_lis
&page_list, &nr_scan);
zone->nr_inactive -= nr_taken;
zone->pages_scanned += nr_scan;
+ zone->aging_total += nr_scan;
spin_unlock_irq(&zone->lru_lock);
nr_scanned += nr_scan;
--- linux-2.6.17-rc4-mm3.orig/mm/swap.c
+++ linux-2.6.17-rc4-mm3/mm/swap.c
@@ -127,6 +127,8 @@ void fastcall mark_page_accessed(struct
ClearPageReferenced(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
+ if (PageLRU(page))
+ inc_readahead_aging();
}
}
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -18,6 +18,7 @@
#include <linux/pagevec.h>
#include <linux/writeback.h>
#include <linux/nfsd/const.h>
+#include <asm/div64.h>
/*
* Adaptive read-ahead parameters.
@@ -37,6 +38,14 @@ EXPORT_SYMBOL_GPL(readahead_ratio);
int readahead_hit_rate = 1;
/*
+ * Measures the aging process of cold pages.
+ * Mainly increased on fresh page references to make it smooth.
+ */
+#ifdef CONFIG_READAHEAD_SMOOTH_AGING
+DEFINE_PER_CPU(unsigned long, readahead_aging);
+#endif
+
+/*
* Detailed classification of read-ahead behaviors.
*/
#define RA_CLASS_SHIFT 4
@@ -805,6 +814,46 @@ out:
}
/*
+ * The node's effective length of inactive_list(s).
+ */
+static unsigned long node_free_and_cold_pages(void)
+{
+ unsigned int i;
+ unsigned long sum = 0;
+ struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ sum += zones[i].nr_inactive +
+ zones[i].free_pages - zones[i].pages_low;
+
+ return sum;
+}
+
+/*
+ * The node's accumulated aging activities.
+ */
+static unsigned long node_readahead_aging(void)
+{
+ unsigned long sum = 0;
+
+#ifdef CONFIG_READAHEAD_SMOOTH_AGING
+ unsigned long cpu;
+ cpumask_t mask = node_to_cpumask(numa_node_id());
+
+ for_each_cpu_mask(cpu, mask)
+ sum += per_cpu(readahead_aging, cpu);
+#else
+ unsigned int i;
+ struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ sum += zones[i].aging_total;
+#endif
+
+ return sum;
+}
+
+/*
* ra_min is mainly determined by the size of cache memory. Reasonable?
*
* Table of concrete numbers for 4KB page size:
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 14/33] readahead: state based method - data structure
[not found] ` <20060524111904.019763011@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
2006-05-25 6:03 ` Nick Piggin
2006-05-26 17:05 ` Andrew Morton
0 siblings, 2 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-stateful-data.patch --]
[-- Type: text/plain, Size: 3134 bytes --]
Extend struct file_ra_state to support the adaptive read-ahead logic.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/fs.h | 57 +++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 47 insertions(+), 10 deletions(-)
--- linux-2.6.17-rc4-mm3.orig/include/linux/fs.h
+++ linux-2.6.17-rc4-mm3/include/linux/fs.h
@@ -613,21 +613,58 @@ struct fown_struct {
/*
* Track a single file's readahead state
+ *
+ * Diagram for the adaptive readahead logic:
+ *
+ * |--------- old chunk ------->|-------------- new chunk -------------->|
+ * +----------------------------+----------------------------------------+
+ * | # | # |
+ * +----------------------------+----------------------------------------+
+ * ^ ^ ^ ^
+ * file_ra_state.la_index .ra_index .lookahead_index .readahead_index
+ *
+ * Deduced sizes:
+ * |----------- readahead size ------------>|
+ * +----------------------------+----------------------------------------+
+ * | # | # |
+ * +----------------------------+----------------------------------------+
+ * |------- invoke interval ------>|-- lookahead size -->|
*/
struct file_ra_state {
- unsigned long start; /* Current window */
- unsigned long size;
- unsigned long flags; /* ra flags RA_FLAG_xxx*/
- unsigned long cache_hit; /* cache hit count*/
- unsigned long prev_page; /* Cache last read() position */
- unsigned long ahead_start; /* Ahead window */
- unsigned long ahead_size;
- unsigned long ra_pages; /* Maximum readahead window */
- unsigned long mmap_hit; /* Cache hit stat for mmap accesses */
- unsigned long mmap_miss; /* Cache miss stat for mmap accesses */
+ union {
+ struct { /* conventional read-ahead */
+ unsigned long start; /* Current window */
+ unsigned long size;
+ unsigned long ahead_start; /* Ahead window */
+ unsigned long ahead_size;
+ unsigned long cache_hit; /* cache hit count */
+ };
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+ struct { /* adaptive read-ahead */
+ pgoff_t la_index;
+ pgoff_t ra_index;
+ pgoff_t lookahead_index;
+ pgoff_t readahead_index;
+ unsigned long age;
+ uint64_t cache_hits;
+ };
+#endif
+ };
+
+ /* mmap read-around */
+ unsigned long mmap_hit; /* Cache hit stat for mmap accesses */
+ unsigned long mmap_miss; /* Cache miss stat for mmap accesses */
+
+ /* common ones */
+ unsigned long flags; /* ra flags RA_FLAG_xxx*/
+ unsigned long prev_page; /* Cache last read() position */
+ unsigned long ra_pages; /* Maximum readahead window */
};
#define RA_FLAG_MISS 0x01 /* a cache miss occured against this file */
#define RA_FLAG_INCACHE 0x02 /* file is already in cache */
+#define RA_FLAG_MMAP (1UL<<31) /* mmaped page access */
+#define RA_FLAG_NO_LOOKAHEAD (1UL<<30) /* disable look-ahead */
+#define RA_FLAG_EOF (1UL<<29) /* readahead hits EOF */
struct file {
/*
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 15/33] readahead: state based method - routines
[not found] ` <20060524111904.683513683@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
2006-05-26 17:15 ` Andrew Morton
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-stateful-routines.patch --]
[-- Type: text/plain, Size: 5765 bytes --]
Define some helpers on struct file_ra_state.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 186 insertions(+), 2 deletions(-)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -854,6 +854,190 @@ static unsigned long node_readahead_agin
}
/*
+ * Some helpers for querying/building a read-ahead request.
+ *
+ * Diagram for some variable names used frequently:
+ *
+ * |<------- la_size ------>|
+ * +-----------------------------------------+
+ * | # |
+ * +-----------------------------------------+
+ * ra_index -->|<---------------- ra_size -------------->|
+ *
+ */
+
+static enum ra_class ra_class_new(struct file_ra_state *ra)
+{
+ return ra->flags & RA_CLASS_MASK;
+}
+
+static inline enum ra_class ra_class_old(struct file_ra_state *ra)
+{
+ return (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+}
+
+static unsigned long ra_readahead_size(struct file_ra_state *ra)
+{
+ return ra->readahead_index - ra->ra_index;
+}
+
+static unsigned long ra_lookahead_size(struct file_ra_state *ra)
+{
+ return ra->readahead_index - ra->lookahead_index;
+}
+
+static unsigned long ra_invoke_interval(struct file_ra_state *ra)
+{
+ return ra->lookahead_index - ra->la_index;
+}
+
+/*
+ * The 64bit cache_hits stores three accumulated values and a counter value.
+ * MSB LSB
+ * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
+ */
+static int ra_cache_hit(struct file_ra_state *ra, int nr)
+{
+ return (ra->cache_hits >> (nr * 16)) & 0xFFFF;
+}
+
+/*
+ * Conceptual code:
+ * ra_cache_hit(ra, 1) += ra_cache_hit(ra, 0);
+ * ra_cache_hit(ra, 0) = 0;
+ */
+static void ra_addup_cache_hit(struct file_ra_state *ra)
+{
+ int n;
+
+ n = ra_cache_hit(ra, 0);
+ ra->cache_hits -= n;
+ n <<= 16;
+ ra->cache_hits += n;
+}
+
+/*
+ * The read-ahead is deemed success if cache-hit-rate >= 1/readahead_hit_rate.
+ */
+static int ra_cache_hit_ok(struct file_ra_state *ra)
+{
+ return ra_cache_hit(ra, 0) * readahead_hit_rate >=
+ (ra->lookahead_index - ra->la_index);
+}
+
+/*
+ * Check if @index falls in the @ra request.
+ */
+static int ra_has_index(struct file_ra_state *ra, pgoff_t index)
+{
+ if (index < ra->la_index || index >= ra->readahead_index)
+ return 0;
+
+ if (index >= ra->ra_index)
+ return 1;
+ else
+ return -1;
+}
+
+/*
+ * Which method is issuing this read-ahead?
+ */
+static void ra_set_class(struct file_ra_state *ra,
+ enum ra_class ra_class)
+{
+ unsigned long flags_mask;
+ unsigned long flags;
+ unsigned long old_ra_class;
+
+ flags_mask = ~(RA_CLASS_MASK | (RA_CLASS_MASK << RA_CLASS_SHIFT));
+ flags = ra->flags & flags_mask;
+
+ old_ra_class = ra_class_new(ra) << RA_CLASS_SHIFT;
+
+ ra->flags = flags | old_ra_class | ra_class;
+
+ ra_addup_cache_hit(ra);
+ if (ra_class != RA_CLASS_STATE)
+ ra->cache_hits <<= 16;
+
+ ra->age = node_readahead_aging();
+}
+
+/*
+ * Where is the old read-ahead and look-ahead?
+ */
+static void ra_set_index(struct file_ra_state *ra,
+ pgoff_t la_index, pgoff_t ra_index)
+{
+ ra->la_index = la_index;
+ ra->ra_index = ra_index;
+}
+
+/*
+ * Where is the new read-ahead and look-ahead?
+ */
+static void ra_set_size(struct file_ra_state *ra,
+ unsigned long ra_size, unsigned long la_size)
+{
+ /* Disable look-ahead for loopback file. */
+ if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+ la_size = 0;
+
+ ra->readahead_index = ra->ra_index + ra_size;
+ ra->lookahead_index = ra->readahead_index - la_size;
+}
+
+/*
+ * Submit IO for the read-ahead request in file_ra_state.
+ */
+static int ra_dispatch(struct file_ra_state *ra,
+ struct address_space *mapping, struct file *filp)
+{
+ enum ra_class ra_class = ra_class_new(ra);
+ unsigned long ra_size = ra_readahead_size(ra);
+ unsigned long la_size = ra_lookahead_size(ra);
+ pgoff_t eof_index = PAGES_BYTE(i_size_read(mapping->host)) + 1;
+ int actual;
+
+ if (unlikely(ra->ra_index >= eof_index))
+ return 0;
+
+ /* Snap to EOF. */
+ if (ra->readahead_index + ra_size / 2 > eof_index) {
+ if (ra_class == RA_CLASS_CONTEXT_AGGRESSIVE &&
+ eof_index > ra->lookahead_index + 1)
+ la_size = eof_index - ra->lookahead_index;
+ else
+ la_size = 0;
+ ra_size = eof_index - ra->ra_index;
+ ra_set_size(ra, ra_size, la_size);
+ ra->flags |= RA_FLAG_EOF;
+ }
+
+ actual = __do_page_cache_readahead(mapping, filp,
+ ra->ra_index, ra_size, la_size);
+
+#ifdef CONFIG_DEBUG_READAHEAD
+ if (ra->flags & RA_FLAG_MMAP)
+ ra_account(ra, RA_EVENT_READAHEAD_MMAP, actual);
+ if (ra->readahead_index == eof_index)
+ ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
+ if (la_size)
+ ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+ if (ra_size > actual)
+ ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
+ ra_account(ra, RA_EVENT_READAHEAD, actual);
+
+ dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
+ ra_class_name[ra_class],
+ mapping->host->i_ino, ra->la_index,
+ ra->ra_index, ra_size, la_size, actual);
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+ return actual;
+}
+
+/*
* ra_min is mainly determined by the size of cache memory. Reasonable?
*
* Table of concrete numbers for 4KB page size:
@@ -925,10 +1109,10 @@ static void ra_account(struct file_ra_st
return;
if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
- c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+ c = ra_class_old(ra);
pages = -pages;
} else if (ra)
- c = ra->flags & RA_CLASS_MASK;
+ c = ra_class_new(ra);
else
c = RA_CLASS_NONE;
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 17/33] readahead: context based method
[not found] ` <20060524111905.586110688@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
2006-05-25 5:26 ` Nick Piggin
` (2 more replies)
2006-05-24 12:37 ` Peter Zijlstra
1 sibling, 3 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-context.patch --]
[-- Type: text/plain, Size: 17152 bytes --]
This is the slow code path of adaptive read-ahead.
No valid state info is available, so the page cache is queried to obtain
the required position/timing info. This kind of estimation is more conservative
than the stateful method, and also fluctuates more on load variance.
HOW IT WORKS
============
It works by peeking into the file cache and check if there are any history
pages present or accessed. In this way it can detect almost all forms of
sequential / semi-sequential read patterns, e.g.
- parallel / interleaved sequential scans on one file
- sequential reads across file open/close
- mixed sequential / random accesses
- sparse / skimming sequential read
HOW DATABASES CAN BENEFIT FROM IT
=================================
The adaptive readahead might help db performance in the following cases:
- concurrent sequential scans
- sequential scan on a fragmented table
- index scan with clustered matches
- index scan on majority rows (in case the planner goes wrong)
ALGORITHM STEPS
===============
- look back/forward to find the ra_index;
- look back to estimate a thrashing safe ra_size;
- assemble the next read-ahead request in file_ra_state;
- submit it.
ALGORITHM DYNAMICS
==================
* startup
When a sequential read is detected, chunk size is set to readahead-min
and grows up with each readahead. The grow speed is controlled by
readahead-ratio. When readahead-ratio == 100, the new logic grows chunk
sizes exponentially -- like the current logic, but lags behind it at
early steps.
* stabilize
When chunk size reaches readahead-max, or comes close to
(readahead-ratio * thrashing-threshold)
it stops growing and stay there.
The main difference with the stock readahead logic occurs at and after
the time chunk size stops growing:
- The current logic grows chunk size exponentially in normal and
decreases it by 2 each time thrashing is seen. That can lead to
thrashing with almost every readahead for very slow streams.
- The new logic can stop at a size below the thrashing-threshold,
and stay there stable.
* on stream speed up or system load fall
thrashing-threshold follows up and chunk size is likely to be enlarged.
* on stream slow down or system load rocket up
thrashing-threshold falls down.
If thrashing happened, the next read would be treated as a random read,
and with another read the chunk-size-growing-phase is restarted.
For a slow stream that has (thrashing-threshold < readahead-max):
- When readahead-ratio = 100, there is only one chunk in cache at
most time;
- When readahead-ratio = 50, there are two chunks in cache at most
time.
- Lowing readahead-ratio helps gracefully cut down the chunk size
without thrashing.
OVERHEADS
=========
The context based method has some overheads over the stateful method, due
to more lockings and memory scans.
Running oprofile on the following command shows the following differences:
# diff sparse sparse1
total oprofile samples run1 run2
stateful method 560482 558696
stateless method 564463 559413
So the average overhead is about 0.4%.
Detailed diffprofile data:
# diffprofile oprofile.50.stateful oprofile.50.stateless
2998 41.1% isolate_lru_pages
2669 26.4% shrink_zone
1822 14.7% system_call
1419 27.6% radix_tree_delete
1376 14.8% _raw_write_lock
1279 27.4% free_pages_bulk
1111 12.0% _raw_write_unlock
1035 43.3% free_hot_cold_page
849 15.3% unlock_page
786 29.6% page_referenced
710 4.6% kmap_atomic
651 26.4% __pagevec_release_nonlru
586 16.1% __rmqueue
578 11.3% find_get_page
481 15.5% page_waitqueue
440 6.6% add_to_page_cache
420 33.7% fget_light
260 4.3% get_page_from_freelist
223 13.7% find_busiest_group
221 35.1% mutex_debug_check_no_locks_freed
211 0.0% radix_tree_scan_hole
198 35.5% delay_tsc
195 14.8% ext3_get_branch
182 12.6% profile_tick
173 0.0% radix_tree_cache_lookup_node
164 22.9% find_next_bit
162 50.3% page_cache_readahead_adaptive
...
106 0.0% radix_tree_scan_hole_backward
...
-51 -7.6% radix_tree_preload
...
-68 -2.1% radix_tree_insert
...
-87 -2.0% mark_page_accessed
-88 -2.0% __pagevec_lru_add
-103 -7.7% softlockup_tick
-107 -71.8% free_block
-122 -77.7% do_IRQ
-132 -82.0% do_timer
-140 -47.1% ack_edge_ioapic_vector
-168 -81.2% handle_IRQ_event
-192 -35.2% irq_entries_start
-204 -14.8% rw_verify_area
-214 -13.2% account_system_time
-233 -9.5% radix_tree_lookup_node
-234 -16.6% scheduler_tick
-259 -58.7% __do_IRQ
-266 -6.8% put_page
-318 -29.3% rcu_pending
-333 -3.0% do_generic_mapping_read
-337 -28.3% hrtimer_run_queues
-493 -27.0% __rcu_pending
-1038 -9.4% default_idle
-3323 -3.5% __copy_to_user_ll
-10331 -5.9% do_mpage_readpage
# diffprofile oprofile.50.stateful2 oprofile.50.stateless2
1739 1.1% do_mpage_readpage
833 0.9% __copy_to_user_ll
340 21.3% find_busiest_group
288 9.5% free_hot_cold_page
261 4.6% _raw_read_unlock
239 3.9% get_page_from_freelist
201 0.0% radix_tree_scan_hole
163 14.3% raise_softirq
160 0.0% radix_tree_cache_lookup_node
160 11.8% update_process_times
136 9.3% fget_light
121 35.1% page_cache_readahead_adaptive
117 36.0% restore_all
117 2.8% mark_page_accessed
109 6.4% rebalance_tick
107 9.4% sys_read
102 0.0% radix_tree_scan_hole_backward
...
63 4.0% readahead_cache_hit
...
-10 -15.9% radix_tree_node_alloc
...
-39 -1.7% radix_tree_lookup_node
-39 -10.3% irq_entries_start
-43 -1.3% radix_tree_insert
...
-47 -4.6% __do_page_cache_readahead
-64 -9.3% radix_tree_preload
-65 -5.4% rw_verify_area
-65 -2.2% vfs_read
-70 -4.7% timer_interrupt
-71 -1.0% __wake_up_bit
-73 -1.1% radix_tree_delete
-79 -12.6% __mod_page_state_offset
-94 -1.8% __find_get_block
-94 -2.2% __pagevec_lru_add
-102 -1.7% free_pages_bulk
-116 -1.3% _raw_read_lock
-123 -7.4% do_sync_read
-130 -8.4% ext3_get_blocks_handle
-142 -3.8% put_page
-146 -7.9% mpage_readpages
-147 -5.6% apic_timer_interrupt
-168 -1.6% _raw_write_unlock
-172 -5.0% page_referenced
-206 -3.2% unlock_page
-212 -15.0% restore_nocheck
-213 -2.1% default_idle
-245 -5.0% __rmqueue
-278 -4.3% find_get_page
-282 -2.1% system_call
-287 -11.8% run_timer_softirq
-300 -2.7% _raw_write_lock
-420 -3.2% shrink_zone
-661 -5.7% isolate_lru_pages
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 329 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 329 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1185,6 +1185,335 @@ state_based_readahead(struct address_spa
}
/*
+ * Page cache context based estimation of read-ahead/look-ahead size/index.
+ *
+ * The logic first looks around to find the start point of next read-ahead,
+ * and then, if necessary, looks backward in the inactive_list to get an
+ * estimation of the thrashing-threshold.
+ *
+ * The estimation theory can be illustrated with figure:
+ *
+ * chunk A chunk B chunk C head
+ *
+ * l01 l11 l12 l21 l22
+ *| |-->|-->| |------>|-->| |------>|
+ *| +-------+ +-----------+ +-------------+ |
+ *| | # | | # | | # | |
+ *| +-------+ +-----------+ +-------------+ |
+ *| |<==============|<===========================|<============================|
+ * L0 L1 L2
+ *
+ * Let f(l) = L be a map from
+ * l: the number of pages read by the stream
+ * to
+ * L: the number of pages pushed into inactive_list in the mean time
+ * then
+ * f(l01) <= L0
+ * f(l11 + l12) = L1
+ * f(l21 + l22) = L2
+ * ...
+ * f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
+ * <= Length(inactive_list) = f(thrashing-threshold)
+ *
+ * So the count of countinuous history pages left in the inactive_list is always
+ * a lower estimation of the true thrashing-threshold.
+ */
+
+#define PAGE_REFCNT_0 0
+#define PAGE_REFCNT_1 (1 << PG_referenced)
+#define PAGE_REFCNT_2 (1 << PG_active)
+#define PAGE_REFCNT_3 ((1 << PG_active) | (1 << PG_referenced))
+#define PAGE_REFCNT_MASK PAGE_REFCNT_3
+
+/*
+ * STATUS REFERENCE COUNT
+ * __ 0
+ * _R PAGE_REFCNT_1
+ * A_ PAGE_REFCNT_2
+ * AR PAGE_REFCNT_3
+ *
+ * A/R: Active / Referenced
+ */
+static inline unsigned long page_refcnt(struct page *page)
+{
+ return page->flags & PAGE_REFCNT_MASK;
+}
+
+/*
+ * STATUS REFERENCE COUNT TYPE
+ * __ 0 fresh
+ * _R PAGE_REFCNT_1 stale
+ * A_ PAGE_REFCNT_2 disturbed once
+ * AR PAGE_REFCNT_3 disturbed twice
+ *
+ * A/R: Active / Referenced
+ */
+static inline unsigned long cold_page_refcnt(struct page *page)
+{
+ if (!page || PageActive(page))
+ return 0;
+
+ return page_refcnt(page);
+}
+
+/*
+ * Find past-the-end index of the segment at @index.
+ */
+static pgoff_t find_segtail(struct address_space *mapping,
+ pgoff_t index, unsigned long max_scan)
+{
+ pgoff_t ra_index;
+
+ cond_resched();
+ read_lock_irq(&mapping->tree_lock);
+ ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan);
+ read_unlock_irq(&mapping->tree_lock);
+
+ if (ra_index <= index + max_scan)
+ return ra_index;
+ else
+ return 0;
+}
+
+/*
+ * Find past-the-end index of the segment before @index.
+ */
+static pgoff_t find_segtail_backward(struct address_space *mapping,
+ pgoff_t index, unsigned long max_scan)
+{
+ struct radix_tree_cache cache;
+ struct page *page;
+ pgoff_t origin;
+
+ origin = index;
+ if (max_scan > index)
+ max_scan = index;
+
+ cond_resched();
+ radix_tree_cache_init(&cache);
+ read_lock_irq(&mapping->tree_lock);
+ for (; origin - index < max_scan;) {
+ page = radix_tree_cache_lookup(&mapping->page_tree,
+ &cache, --index);
+ if (page) {
+ read_unlock_irq(&mapping->tree_lock);
+ return index + 1;
+ }
+ }
+ read_unlock_irq(&mapping->tree_lock);
+
+ return 0;
+}
+
+/*
+ * Count/estimate cache hits in range [first_index, last_index].
+ * The estimation is simple and optimistic.
+ */
+static int count_cache_hit(struct address_space *mapping,
+ pgoff_t first_index, pgoff_t last_index)
+{
+ struct page *page;
+ int size = last_index - first_index + 1;
+ int count = 0;
+ int i;
+
+ cond_resched();
+ read_lock_irq(&mapping->tree_lock);
+
+ /*
+ * The first page may well is chunk head and has been accessed,
+ * so it is index 0 that makes the estimation optimistic. This
+ * behavior guarantees a readahead when (size < ra_max) and
+ * (readahead_hit_rate >= 16).
+ */
+ for (i = 0; i < 16;) {
+ page = __find_page(mapping, first_index +
+ size * ((i++ * 29) & 15) / 16);
+ if (cold_page_refcnt(page) >= PAGE_REFCNT_1 && ++count >= 2)
+ break;
+ }
+
+ read_unlock_irq(&mapping->tree_lock);
+
+ return size * count / i;
+}
+
+/*
+ * Look back and check history pages to estimate thrashing-threshold.
+ */
+static unsigned long query_page_cache_segment(struct address_space *mapping,
+ struct file_ra_state *ra,
+ unsigned long *remain, pgoff_t offset,
+ unsigned long ra_min, unsigned long ra_max)
+{
+ pgoff_t index;
+ unsigned long count;
+ unsigned long nr_lookback;
+ struct radix_tree_cache cache;
+
+ /*
+ * Scan backward and check the near @ra_max pages.
+ * The count here determines ra_size.
+ */
+ cond_resched();
+ read_lock_irq(&mapping->tree_lock);
+ index = radix_tree_scan_hole_backward(&mapping->page_tree,
+ offset, ra_max);
+ read_unlock_irq(&mapping->tree_lock);
+
+ *remain = offset - index;
+
+ if (offset == ra->readahead_index && ra_cache_hit_ok(ra))
+ count = *remain;
+ else if (count_cache_hit(mapping, index + 1, offset) *
+ readahead_hit_rate >= *remain)
+ count = *remain;
+ else
+ count = ra_min;
+
+ /*
+ * Unnecessary to count more?
+ */
+ if (count < ra_max)
+ goto out;
+
+ if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+ goto out;
+
+ /*
+ * Check the far pages coarsely.
+ * The enlarged count here helps increase la_size.
+ */
+ nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
+ 100 / (readahead_ratio | 1);
+
+ cond_resched();
+ radix_tree_cache_init(&cache);
+ read_lock_irq(&mapping->tree_lock);
+ for (count += ra_max; count < nr_lookback; count += ra_max) {
+ struct radix_tree_node *node;
+ node = radix_tree_cache_lookup_parent(&mapping->page_tree,
+ &cache, offset - count, 1);
+ if (!node)
+ break;
+ }
+ read_unlock_irq(&mapping->tree_lock);
+
+out:
+ /*
+ * For sequential read that extends from index 0, the counted value
+ * may well be far under the true threshold, so return it unmodified
+ * for further processing in adjust_rala_aggressive().
+ */
+ if (count >= offset)
+ count = offset;
+ else
+ count = max(ra_min, count * readahead_ratio / 100);
+
+ ddprintk("query_page_cache_segment: "
+ "ino=%lu, idx=%lu, count=%lu, remain=%lu\n",
+ mapping->host->i_ino, offset, count, *remain);
+
+ return count;
+}
+
+/*
+ * Determine the request parameters for context based read-ahead that extends
+ * from start of file.
+ *
+ * The major weakness of stateless method is perhaps the slow grow up speed of
+ * ra_size. The logic tries to make up for this in the important case of
+ * sequential reads that extend from start of file. In this case, the ra_size
+ * is not chosen to make the whole next chunk safe (as in normal ones). Only
+ * half of which is safe. The added 'unsafe' half is the look-ahead part. It
+ * is expected to be safeguarded by rescue_pages() when the previous chunks are
+ * lost.
+ */
+static int adjust_rala_aggressive(unsigned long ra_max,
+ unsigned long *ra_size, unsigned long *la_size)
+{
+ pgoff_t index = *ra_size;
+
+ *ra_size -= min(*ra_size, *la_size);
+ *ra_size = *ra_size * readahead_ratio / 100;
+ *la_size = index * readahead_ratio / 100;
+ *ra_size += *la_size;
+
+ if (*ra_size > ra_max)
+ *ra_size = ra_max;
+ if (*la_size > *ra_size)
+ *la_size = *ra_size;
+
+ return 1;
+}
+
+/*
+ * Main function for page context based read-ahead.
+ *
+ * RETURN VALUE HINT
+ * 1 @ra contains a valid ra-request, please submit it
+ * 0 no seq-pattern discovered, please try the next method
+ * -1 please don't do _any_ readahead
+ */
+static int
+try_context_based_readahead(struct address_space *mapping,
+ struct file_ra_state *ra, struct page *prev_page,
+ struct page *page, pgoff_t index,
+ unsigned long ra_min, unsigned long ra_max)
+{
+ pgoff_t ra_index;
+ unsigned long ra_size;
+ unsigned long la_size;
+ unsigned long remain_pages;
+
+ /* Where to start read-ahead?
+ * NFSv3 daemons may process adjacent requests in parallel,
+ * leading to many locally disordered, globally sequential reads.
+ * So do not require nearby history pages to be present or accessed.
+ */
+ if (page) {
+ ra_index = find_segtail(mapping, index, ra_max * 5 / 4);
+ if (!ra_index)
+ return -1;
+ } else if (prev_page || find_page(mapping, index - 1)) {
+ ra_index = index;
+ } else if (readahead_hit_rate > 1) {
+ ra_index = find_segtail_backward(mapping, index,
+ readahead_hit_rate + ra_min);
+ if (!ra_index)
+ return 0;
+ ra_min += 2 * (index - ra_index);
+ index = ra_index; /* pretend the request starts here */
+ } else
+ return 0;
+
+ ra_size = query_page_cache_segment(mapping, ra, &remain_pages,
+ index, ra_min, ra_max);
+
+ la_size = ra_index - index;
+ if (page && remain_pages <= la_size &&
+ remain_pages < index && la_size > 1) {
+ rescue_pages(page, la_size);
+ return -1;
+ }
+
+ if (ra_size == index) {
+ if (!adjust_rala_aggressive(ra_max, &ra_size, &la_size))
+ return -1;
+ ra_set_class(ra, RA_CLASS_CONTEXT_AGGRESSIVE);
+ } else {
+ if (!adjust_rala(ra_max, &ra_size, &la_size))
+ return -1;
+ ra_set_class(ra, RA_CLASS_CONTEXT);
+ }
+
+ ra_set_index(ra, index, ra_index);
+ ra_set_size(ra, ra_size, la_size);
+
+ return 1;
+}
+
+/*
* ra_min is mainly determined by the size of cache memory. Reasonable?
*
* Table of concrete numbers for 4KB page size:
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 18/33] readahead: initial method - guiding sizes
[not found] ` <20060524111906.245276338@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-initial-sizes.patch --]
[-- Type: text/plain, Size: 2420 bytes --]
Introduce three guiding sizes for the initial readahead method.
- ra_pages0: recommended readahead on start-of-file
- ra_expect_bytes: expected read size on start-of-file
- ra_thrash_bytes: estimated thrashing threshold
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
block/ll_rw_blk.c | 4 +---
include/linux/backing-dev.h | 3 +++
mm/readahead.c | 3 +++
3 files changed, 7 insertions(+), 3 deletions(-)
--- linux-2.6.17-rc4-mm3.orig/include/linux/backing-dev.h
+++ linux-2.6.17-rc4-mm3/include/linux/backing-dev.h
@@ -24,6 +24,9 @@ typedef int (congested_fn)(void *, int);
struct backing_dev_info {
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
+ unsigned long ra_pages0; /* recommended readahead on start of file */
+ unsigned long ra_expect_bytes; /* expected read size on start of file */
+ unsigned long ra_thrash_bytes; /* thrashing threshold */
unsigned long state; /* Always use atomic bitops on this */
unsigned int capabilities; /* Device capabilities */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -122,6 +122,9 @@ EXPORT_SYMBOL(default_unplug_io_fn);
struct backing_dev_info default_backing_dev_info = {
.ra_pages = PAGES_KB(VM_MAX_READAHEAD),
+ .ra_pages0 = PAGES_KB(128),
+ .ra_expect_bytes = 1024 * VM_MIN_READAHEAD,
+ .ra_thrash_bytes = 1024 * VM_MIN_READAHEAD,
.state = 0,
.capabilities = BDI_CAP_MAP_COPY,
.unplug_io_fn = default_unplug_io_fn,
--- linux-2.6.17-rc4-mm3.orig/block/ll_rw_blk.c
+++ linux-2.6.17-rc4-mm3/block/ll_rw_blk.c
@@ -249,9 +249,6 @@ void blk_queue_make_request(request_queu
blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
q->make_request_fn = mfn;
- q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
- q->backing_dev_info.state = 0;
- q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
blk_queue_hardsect_size(q, 512);
blk_queue_dma_alignment(q, 511);
@@ -1850,6 +1847,7 @@ request_queue_t *blk_alloc_queue_node(gf
q->kobj.ktype = &queue_ktype;
kobject_init(&q->kobj);
+ q->backing_dev_info = default_backing_dev_info;
q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
q->backing_dev_info.unplug_io_data = q;
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 19/33] readahead: initial method - thrashing guard size
[not found] ` <20060524111906.588647885@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-initial-size-thrash.patch --]
[-- Type: text/plain, Size: 1619 bytes --]
backing_dev_info.ra_thrash_bytes is dynamicly updated to be a little above
the thrashing safe read-ahead size. It is used in the initial method where
the thrashing threshold for the particular reader is still unknown.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -817,6 +817,22 @@ out:
}
/*
+ * Update `backing_dev_info.ra_thrash_bytes' to be a _biased_ average of
+ * read-ahead sizes. Which makes it an a-bit-risky(*) estimation of the
+ * _minimal_ read-ahead thrashing threshold on the device.
+ *
+ * (*) Note that being a bit risky can _help_ overall performance.
+ */
+static inline void update_ra_thrash_bytes(struct backing_dev_info *bdi,
+ unsigned long ra_size)
+{
+ ra_size <<= PAGE_CACHE_SHIFT;
+ bdi->ra_thrash_bytes = (bdi->ra_thrash_bytes < ra_size) ?
+ (ra_size + bdi->ra_thrash_bytes * 1023) / 1024:
+ (ra_size + bdi->ra_thrash_bytes * 7) / 8;
+}
+
+/*
* The node's effective length of inactive_list(s).
*/
static unsigned long node_free_and_cold_pages(void)
@@ -1180,6 +1196,10 @@ state_based_readahead(struct address_spa
if (!adjust_rala(growth_limit, &ra_size, &la_size))
return 0;
+ /* ra_size in its _steady_ state reflects thrashing threshold */
+ if (page && ra_old + ra_old / 8 >= ra_size)
+ update_ra_thrash_bytes(mapping->backing_dev_info, ra_size);
+
ra_set_class(ra, RA_CLASS_STATE);
ra_set_index(ra, index, ra->readahead_index);
ra_set_size(ra, ra_size, la_size);
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 20/33] readahead: initial method - expected read size
[not found] ` <20060524111907.134685550@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
2006-05-25 5:34 ` [PATCH 22/33] readahead: initial method Nick Piggin
2006-05-26 17:29 ` [PATCH 20/33] readahead: initial method - expected read size Andrew Morton
0 siblings, 2 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-initial-size-expect.patch --]
[-- Type: text/plain, Size: 3385 bytes --]
backing_dev_info.ra_expect_bytes is dynamicly updated to be the expected
read pages on start-of-file. It allows the initial readahead to be more
aggressive and hence efficient.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
fs/file_table.c | 7 ++++++
include/linux/mm.h | 1
mm/readahead.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 63 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1032,6 +1032,7 @@ unsigned long page_cache_readahead(struc
void handle_ra_miss(struct address_space *mapping,
struct file_ra_state *ra, pgoff_t offset);
unsigned long max_sane_readahead(unsigned long nr);
+void fastcall readahead_close(struct file *file);
#ifdef CONFIG_ADAPTIVE_READAHEAD
extern int readahead_ratio;
--- linux-2.6.17-rc4-mm3.orig/fs/file_table.c
+++ linux-2.6.17-rc4-mm3/fs/file_table.c
@@ -12,6 +12,7 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/smp_lock.h>
+#include <linux/mm.h>
#include <linux/fs.h>
#include <linux/security.h>
#include <linux/eventpoll.h>
@@ -160,6 +161,12 @@ void fastcall __fput(struct file *file)
might_sleep();
fsnotify_close(file);
+
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+ if (file->f_ra.flags & RA_FLAG_EOF)
+ readahead_close(file);
+#endif
+
/*
* The function eventpoll_release() should be the first called
* in the file cleanup chain.
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1555,6 +1555,61 @@ static inline void get_readahead_bounds(
PAGES_KB(128)), *ra_max / 2);
}
+/*
+ * When closing a normal readonly file,
+ * - on cache hit: increase `backing_dev_info.ra_expect_bytes' slowly;
+ * - on cache miss: decrease it rapidly.
+ *
+ * The resulted `ra_expect_bytes' answers the question of:
+ * How many pages are expected to be read on start-of-file?
+ */
+void fastcall readahead_close(struct file *file)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct address_space *mapping = inode->i_mapping;
+ struct backing_dev_info *bdi = mapping->backing_dev_info;
+ unsigned long pos = file->f_pos;
+ unsigned long pgrahit = file->f_ra.cache_hits;
+ unsigned long pgaccess = 1 + pos / PAGE_CACHE_SIZE;
+ unsigned long pgcached = mapping->nrpages;
+
+ if (!pos) /* pread */
+ return;
+
+ if (pgcached > bdi->ra_pages0) /* excessive reads */
+ return;
+
+ if (pgaccess >= pgcached) {
+ if (bdi->ra_expect_bytes < bdi->ra_pages0 * PAGE_CACHE_SIZE)
+ bdi->ra_expect_bytes += pgcached * PAGE_CACHE_SIZE / 8;
+
+ debug_inc(initial_ra_hit);
+ dprintk("initial_ra_hit on file %s size %lluK "
+ "pos %lu by %s(%d)\n",
+ file->f_dentry->d_name.name,
+ i_size_read(inode) / 1024,
+ pos,
+ current->comm, current->pid);
+ } else {
+ unsigned long missed;
+
+ missed = (pgcached - pgaccess) * PAGE_CACHE_SIZE;
+ if (bdi->ra_expect_bytes >= missed / 2)
+ bdi->ra_expect_bytes -= missed / 2;
+
+ debug_inc(initial_ra_miss);
+ dprintk("initial_ra_miss on file %s "
+ "size %lluK cached %luK hit %luK "
+ "pos %lu by %s(%d)\n",
+ file->f_dentry->d_name.name,
+ i_size_read(inode) / 1024,
+ pgcached << (PAGE_CACHE_SHIFT - 10),
+ pgrahit << (PAGE_CACHE_SHIFT - 10),
+ pos,
+ current->comm, current->pid);
+ }
+}
+
#endif /* CONFIG_ADAPTIVE_READAHEAD */
/*
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 23/33] readahead: backward prefetching method
[not found] ` <20060524111908.569533741@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
2006-05-26 17:37 ` Nate Diller
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-backward.patch --]
[-- Type: text/plain, Size: 1450 bytes --]
Readahead policy for reading backward.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 40 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1574,6 +1574,46 @@ initial_readahead(struct address_space *
}
/*
+ * Backward prefetching.
+ *
+ * No look-ahead and thrashing safety guard: should be unnecessary.
+ */
+static int
+try_read_backward(struct file_ra_state *ra, pgoff_t begin_index,
+ unsigned long ra_size, unsigned long ra_max)
+{
+ pgoff_t end_index;
+
+ /* Are we reading backward? */
+ if (begin_index > ra->prev_page)
+ return 0;
+
+ if ((ra->flags & RA_CLASS_MASK) == RA_CLASS_BACKWARD &&
+ ra_has_index(ra, ra->prev_page)) {
+ ra_size += 2 * ra_cache_hit(ra, 0);
+ end_index = ra->la_index;
+ } else {
+ ra_size += ra_size + ra_size * (readahead_hit_rate - 1) / 2;
+ end_index = ra->prev_page;
+ }
+
+ if (ra_size > ra_max)
+ ra_size = ra_max;
+
+ /* Read traces close enough to be covered by the prefetching? */
+ if (end_index > begin_index + ra_size)
+ return 0;
+
+ begin_index = end_index - ra_size;
+
+ ra_set_class(ra, RA_CLASS_BACKWARD);
+ ra_set_index(ra, begin_index, begin_index);
+ ra_set_size(ra, ra_size, 0);
+
+ return 1;
+}
+
+/*
* ra_min is mainly determined by the size of cache memory. Reasonable?
*
* Table of concrete numbers for 4KB page size:
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 24/33] readahead: seeking reads method
[not found] ` <20060524111909.147416866@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-onseek.patch --]
[-- Type: text/plain, Size: 1822 bytes --]
Readahead policy on read after seeking.
It tries to detect sequences like:
seek(), 5*read(); seek(), 6*read(); seek(), 4*read(); ...
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 43 +++++++++++++++++++++++++++++++++++++++++++
1 files changed, 43 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1614,6 +1614,49 @@ try_read_backward(struct file_ra_state *
}
/*
+ * If there is a previous sequential read, it is likely to be another
+ * sequential read at the new position.
+ *
+ * i.e. detect the following sequences:
+ * seek(), 5*read(); seek(), 6*read(); seek(), 4*read(); ...
+ *
+ * Databases are known to have this seek-and-read-N-pages pattern.
+ */
+static int
+try_readahead_on_seek(struct file_ra_state *ra, pgoff_t index,
+ unsigned long ra_size, unsigned long ra_max)
+{
+ unsigned long hit0 = ra_cache_hit(ra, 0);
+ unsigned long hit1 = ra_cache_hit(ra, 1) + hit0;
+ unsigned long hit2 = ra_cache_hit(ra, 2);
+ unsigned long hit3 = ra_cache_hit(ra, 3);
+
+ /* There's a previous read-ahead request? */
+ if (!ra_has_index(ra, ra->prev_page))
+ return 0;
+
+ /* The previous read-ahead sequences have similiar sizes? */
+ if (!(ra_size < hit1 && hit1 > hit2 / 2 &&
+ hit2 > hit3 / 2 &&
+ hit3 > hit1 / 2))
+ return 0;
+
+ hit1 = max(hit1, hit2);
+
+ /* Follow the same prefetching direction. */
+ if ((ra->flags & RA_CLASS_MASK) == RA_CLASS_BACKWARD)
+ index = ((index > hit1 - ra_size) ? index - hit1 + ra_size : 0);
+
+ ra_size = min(hit1, ra_max);
+
+ ra_set_class(ra, RA_CLASS_SEEK);
+ ra_set_index(ra, index, index);
+ ra_set_size(ra, ra_size, 0);
+
+ return 1;
+}
+
+/*
* ra_min is mainly determined by the size of cache memory. Reasonable?
*
* Table of concrete numbers for 4KB page size:
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 25/33] readahead: thrashing recovery method
[not found] ` <20060524111909.635589701@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-method-onthrash.patch --]
[-- Type: text/plain, Size: 1746 bytes --]
Readahead policy after thrashing.
It tries to recover gracefully from the thrashing.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 42 ++++++++++++++++++++++++++++++++++++++++++
1 files changed, 42 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1657,6 +1657,48 @@ try_readahead_on_seek(struct file_ra_sta
}
/*
+ * Readahead thrashing recovery.
+ */
+static unsigned long
+thrashing_recovery_readahead(struct address_space *mapping,
+ struct file *filp, struct file_ra_state *ra,
+ pgoff_t index, unsigned long ra_max)
+{
+ unsigned long ra_size;
+
+ if (find_page(mapping, index - 1))
+ ra_account(ra, RA_EVENT_READAHEAD_MUTILATE,
+ ra->readahead_index - index);
+ ra_account(ra, RA_EVENT_READAHEAD_THRASHING,
+ ra->readahead_index - index);
+
+ /*
+ * Some thrashing occur in (ra_index, la_index], in which case the
+ * old read-ahead chunk is lost soon after the new one is allocated.
+ * Ensure that we recover all needed pages in the old chunk.
+ */
+ if (index < ra->ra_index)
+ ra_size = ra->ra_index - index;
+ else {
+ /* After thrashing, we know the exact thrashing-threshold. */
+ ra_size = ra_cache_hit(ra, 0);
+ update_ra_thrash_bytes(mapping->backing_dev_info, ra_size);
+
+ /* And we'd better be a bit conservative. */
+ ra_size = ra_size * 3 / 4;
+ }
+
+ if (ra_size > ra_max)
+ ra_size = ra_max;
+
+ ra_set_class(ra, RA_CLASS_THRASHING);
+ ra_set_index(ra, index, index);
+ ra_set_size(ra, ra_size, ra_size / LOOKAHEAD_RATIO);
+
+ return ra_dispatch(ra, mapping, filp);
+}
+
+/*
* ra_min is mainly determined by the size of cache memory. Reasonable?
*
* Table of concrete numbers for 4KB page size:
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 26/33] readahead: call scheme
[not found] ` <20060524111910.207894375@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-call-scheme.patch --]
[-- Type: text/plain, Size: 9456 bytes --]
The read-ahead logic is called when the reading hits
- a look-ahead mark;
- a non-present page.
ra.prev_page should be properly setup on entrance, and readahead_cache_hit()
should be called on every page reference to maintain the cache_hits counter.
This call scheme achieves the following goals:
- makes all stateful/stateless methods happy;
- eliminates the cache hit problem naturally;
- lives in harmony with application managed read-aheads via
fadvise/madvise.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 6 ++
mm/filemap.c | 51 ++++++++++++++++-
mm/readahead.c | 152 +++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 205 insertions(+), 4 deletions(-)
--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1033,6 +1033,12 @@ void handle_ra_miss(struct address_space
struct file_ra_state *ra, pgoff_t offset);
unsigned long max_sane_readahead(unsigned long nr);
void fastcall readahead_close(struct file *file);
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *prev_page, struct page *page,
+ pgoff_t first_index, pgoff_t index, pgoff_t last_index);
+void fastcall readahead_cache_hit(struct file_ra_state *ra, struct page *page);
#ifdef CONFIG_ADAPTIVE_READAHEAD
extern int readahead_ratio;
--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -847,14 +847,32 @@ void do_generic_mapping_read(struct addr
nr = nr - offset;
cond_resched();
- if (index == next_index)
+
+ if (!prefer_adaptive_readahead() && index == next_index)
next_index = page_cache_readahead(mapping, &ra, filp,
index, last_index - index);
find_page:
page = find_get_page(mapping, index);
+ if (prefer_adaptive_readahead()) {
+ if (unlikely(page == NULL)) {
+ ra.prev_page = prev_index;
+ page_cache_readahead_adaptive(mapping, &ra,
+ filp, prev_page, NULL,
+ *ppos >> PAGE_CACHE_SHIFT,
+ index, last_index);
+ page = find_get_page(mapping, index);
+ } else if (PageReadahead(page)) {
+ ra.prev_page = prev_index;
+ page_cache_readahead_adaptive(mapping, &ra,
+ filp, prev_page, page,
+ *ppos >> PAGE_CACHE_SHIFT,
+ index, last_index);
+ }
+ }
if (unlikely(page == NULL)) {
- handle_ra_miss(mapping, &ra, index);
+ if (!prefer_adaptive_readahead())
+ handle_ra_miss(mapping, &ra, index);
goto no_cached_page;
}
@@ -862,6 +880,9 @@ find_page:
page_cache_release(prev_page);
prev_page = page;
+ if (prefer_adaptive_readahead())
+ readahead_cache_hit(&ra, page);
+
if (!PageUptodate(page))
goto page_not_up_to_date;
page_ok:
@@ -1005,6 +1026,8 @@ no_cached_page:
out:
*_ra = ra;
+ if (prefer_adaptive_readahead())
+ _ra->prev_page = prev_index;
*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
if (cached_page)
@@ -1290,6 +1313,7 @@ struct page *filemap_nopage(struct vm_ar
unsigned long size, pgoff;
int did_readaround = 0, majmin = VM_FAULT_MINOR;
+ ra->flags |= RA_FLAG_MMAP;
pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
retry_all:
@@ -1307,19 +1331,33 @@ retry_all:
*
* For sequential accesses, we use the generic readahead logic.
*/
- if (VM_SequentialReadHint(area))
+ if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
page_cache_readahead(mapping, ra, file, pgoff, 1);
+
/*
* Do we have something in the page cache already?
*/
retry_find:
page = find_get_page(mapping, pgoff);
+ if (prefer_adaptive_readahead() && VM_SequentialReadHint(area)) {
+ if (!page) {
+ page_cache_readahead_adaptive(mapping, ra,
+ file, NULL, NULL,
+ pgoff, pgoff, pgoff + 1);
+ page = find_get_page(mapping, pgoff);
+ } else if (PageReadahead(page)) {
+ page_cache_readahead_adaptive(mapping, ra,
+ file, NULL, page,
+ pgoff, pgoff, pgoff + 1);
+ }
+ }
if (!page) {
unsigned long ra_pages;
if (VM_SequentialReadHint(area)) {
- handle_ra_miss(mapping, ra, pgoff);
+ if (!prefer_adaptive_readahead())
+ handle_ra_miss(mapping, ra, pgoff);
goto no_cached_page;
}
ra->mmap_miss++;
@@ -1356,6 +1394,9 @@ retry_find:
if (!did_readaround)
ra->mmap_hit++;
+ if (prefer_adaptive_readahead())
+ readahead_cache_hit(ra, page);
+
/*
* Ok, found a page in the page cache, now we need to check
* that it's up-to-date.
@@ -1370,6 +1411,8 @@ success:
mark_page_accessed(page);
if (type)
*type = majmin;
+ if (prefer_adaptive_readahead())
+ ra->prev_page = page->index;
return page;
outside_data_content:
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1717,6 +1717,158 @@ static inline void get_readahead_bounds(
PAGES_KB(128)), *ra_max / 2);
}
+/**
+ * page_cache_readahead_adaptive - thrashing safe adaptive read-ahead
+ * @mapping, @ra, @filp: the same as page_cache_readahead()
+ * @prev_page: the page at @index-1, may be NULL to let the function find it
+ * @page: the page at @index, or NULL if non-present
+ * @begin_index, @index, @end_index: offsets into @mapping
+ * [@begin_index, @end_index) is the read the caller is performing
+ * @index indicates the page to be read now
+ *
+ * page_cache_readahead_adaptive() is the entry point of the adaptive
+ * read-ahead logic. It tries a set of methods in turn to determine the
+ * appropriate readahead action and submits the readahead I/O.
+ *
+ * The caller is expected to point ra->prev_page to the previously accessed
+ * page, and to call it on two conditions:
+ * 1. @page == NULL
+ * A cache miss happened, some pages have to be read in
+ * 2. @page != NULL && PageReadahead(@page)
+ * A look-ahead mark encountered, this is set by a previous read-ahead
+ * invocation to instruct the caller to give the function a chance to
+ * check up and do next read-ahead in advance.
+ */
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *prev_page, struct page *page,
+ pgoff_t begin_index, pgoff_t index, pgoff_t end_index)
+{
+ unsigned long size;
+ unsigned long ra_min;
+ unsigned long ra_max;
+ int ret;
+
+ might_sleep();
+
+ if (page) {
+ if(!TestClearPageReadahead(page))
+ return 0;
+ if (bdi_read_congested(mapping->backing_dev_info)) {
+ ra_account(ra, RA_EVENT_IO_CONGESTION,
+ end_index - index);
+ return 0;
+ }
+ }
+
+ if (page)
+ ra_account(ra, RA_EVENT_LOOKAHEAD_HIT,
+ ra->readahead_index - ra->lookahead_index);
+ else if (index)
+ ra_account(ra, RA_EVENT_CACHE_MISS, end_index - begin_index);
+
+ size = end_index - index;
+ get_readahead_bounds(ra, &ra_min, &ra_max);
+
+ /* readahead disabled? */
+ if (unlikely(!ra_max || !readahead_ratio)) {
+ size = max_sane_readahead(size);
+ goto readit;
+ }
+
+ /*
+ * Start of file.
+ */
+ if (index == 0)
+ return initial_readahead(mapping, filp, ra, size);
+
+ /*
+ * State based sequential read-ahead.
+ */
+ if (!debug_option(disable_stateful_method) &&
+ index == ra->lookahead_index && ra_cache_hit_ok(ra))
+ return state_based_readahead(mapping, filp, ra, page,
+ index, size, ra_max);
+
+ /*
+ * Recover from possible thrashing.
+ */
+ if (!page && index == ra->prev_page + 1 && ra_has_index(ra, index))
+ return thrashing_recovery_readahead(mapping, filp, ra,
+ index, ra_max);
+
+ /*
+ * Backward read-ahead.
+ */
+ if (!page && begin_index == index &&
+ try_read_backward(ra, index, size, ra_max))
+ return ra_dispatch(ra, mapping, filp);
+
+ /*
+ * Context based sequential read-ahead.
+ */
+ ret = try_context_based_readahead(mapping, ra, prev_page, page,
+ index, ra_min, ra_max);
+ if (ret > 0)
+ return ra_dispatch(ra, mapping, filp);
+ if (ret < 0)
+ return 0;
+
+ /* No action on look ahead time? */
+ if (page) {
+ ra_account(ra, RA_EVENT_LOOKAHEAD_NOACTION,
+ ra->readahead_index - index);
+ return 0;
+ }
+
+ /*
+ * Random read that follows a sequential one.
+ */
+ if (try_readahead_on_seek(ra, index, size, ra_max))
+ return ra_dispatch(ra, mapping, filp);
+
+ /*
+ * Random read.
+ */
+ if (size > ra_max)
+ size = ra_max;
+
+readit:
+ size = __do_page_cache_readahead(mapping, filp, index, size, 0);
+
+ ra_account(ra, RA_EVENT_RANDOM_READ, size);
+ dprintk("random_read(ino=%lu, pages=%lu, index=%lu-%lu-%lu) = %lu\n",
+ mapping->host->i_ino, mapping->nrpages,
+ begin_index, index, end_index, size);
+
+ return size;
+}
+
+/**
+ * readahead_cache_hit - adaptive read-ahead feedback function
+ * @ra: file_ra_state which holds the readahead state
+ * @page: the page just accessed
+ *
+ * readahead_cache_hit() is the feedback route of the adaptive read-ahead
+ * logic. It must be called on every access on the read-ahead pages.
+ */
+void fastcall readahead_cache_hit(struct file_ra_state *ra, struct page *page)
+{
+ if (!PageUptodate(page))
+ ra_account(ra, RA_EVENT_IO_BLOCK, 1);
+
+ if (!ra_has_index(ra, page->index))
+ return;
+
+ ra->cache_hits++;
+
+ if (page->index >= ra->ra_index)
+ ra_account(ra, RA_EVENT_READAHEAD_HIT, 1);
+ else
+ ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
+}
+
/*
* When closing a normal readonly file,
* - on cache hit: increase `backing_dev_info.ra_expect_bytes' slowly;
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 27/33] readahead: laptop mode
[not found] ` <20060524111910.544274094@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
2006-05-26 17:38 ` Andrew Morton
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Bart Samwel
[-- Attachment #1: readahead-laptop-mode.patch --]
[-- Type: text/plain, Size: 3153 bytes --]
When the laptop drive is spinned down, defer look-ahead to spin up time.
The implementation employs a poll based method, for performance is not a
concern in this code path. The poll interval is 64KB, which should be small
enough for movies/musics. The user space application is responsible for
proper caching to hide the spin-up-and-read delay.
------------------------------------------------------------------------
For crazy laptop users who prefer aggressive read-ahead, here is the way:
# echo 1000 > /proc/sys/vm/readahead_ratio
# blockdev --setra 524280 /dev/hda # this is the max possible value
Notes:
- It is still an untested feature.
- It is safer to use blockdev+fadvise to increase ra-max for a single file,
which needs patching your movie player.
- Be sure to restore them to sane values in normal operations!
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/writeback.h | 6 ++++++
mm/page-writeback.c | 2 +-
mm/readahead.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 37 insertions(+), 1 deletion(-)
--- linux-2.6.17-rc4-mm3.orig/include/linux/writeback.h
+++ linux-2.6.17-rc4-mm3/include/linux/writeback.h
@@ -86,6 +86,12 @@ void laptop_io_completion(void);
void laptop_sync_completion(void);
void throttle_vm_writeout(void);
+extern struct timer_list laptop_mode_wb_timer;
+static inline int laptop_spinned_down(void)
+{
+ return !timer_pending(&laptop_mode_wb_timer);
+}
+
/* These are exported to sysctl. */
extern int dirty_background_ratio;
extern int vm_dirty_ratio;
--- linux-2.6.17-rc4-mm3.orig/mm/page-writeback.c
+++ linux-2.6.17-rc4-mm3/mm/page-writeback.c
@@ -389,7 +389,7 @@ static void wb_timer_fn(unsigned long un
static void laptop_timer_fn(unsigned long unused);
static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
-static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
+DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
/*
* Periodic writeback of "old" data.
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -817,6 +817,31 @@ out:
}
/*
+ * Set a new look-ahead mark at @new_index.
+ * Return 0 if the new mark is successfully set.
+ */
+static inline int renew_lookahead(struct address_space *mapping,
+ struct file_ra_state *ra,
+ pgoff_t index, pgoff_t new_index)
+{
+ struct page *page;
+
+ if (index == ra->lookahead_index &&
+ new_index >= ra->readahead_index)
+ return 1;
+
+ page = find_page(mapping, new_index);
+ if (!page)
+ return 1;
+
+ __SetPageReadahead(page);
+ if (ra->lookahead_index == index)
+ ra->lookahead_index = new_index;
+
+ return 0;
+}
+
+/*
* Update `backing_dev_info.ra_thrash_bytes' to be a _biased_ average of
* read-ahead sizes. Which makes it an a-bit-risky(*) estimation of the
* _minimal_ read-ahead thrashing threshold on the device.
@@ -1760,6 +1785,11 @@ page_cache_readahead_adaptive(struct add
end_index - index);
return 0;
}
+ if (laptop_mode && laptop_spinned_down()) {
+ if (!renew_lookahead(mapping, ra, index,
+ index + LAPTOP_POLL_INTERVAL))
+ return 0;
+ }
}
if (page)
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 28/33] readahead: loop case
[not found] ` <20060524111911.032100160@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
2006-05-24 14:01 ` Limin Wang
1 sibling, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-loop-case.patch --]
[-- Type: text/plain, Size: 894 bytes --]
Disable look-ahead for loop file.
Loopback files normally contain filesystems, in which case there are already
proper look-aheads in the upper layer, more look-aheads on the loopback file
only ruins the read-ahead hit rate.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
I'd like to thank Tero Grundstr?m for uncovering the loopback problem.
drivers/block/loop.c | 6 ++++++
1 files changed, 6 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/drivers/block/loop.c
+++ linux-2.6.17-rc4-mm3/drivers/block/loop.c
@@ -779,6 +779,12 @@ static int loop_set_fd(struct loop_devic
mapping = file->f_mapping;
inode = mapping->host;
+ /*
+ * The upper layer should already do proper look-ahead,
+ * one more look-ahead here only ruins the cache hit rate.
+ */
+ file->f_ra.flags |= RA_FLAG_NO_LOOKAHEAD;
+
if (!(file->f_mode & FMODE_WRITE))
lo_flags |= LO_FLAGS_READ_ONLY;
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 29/33] readahead: nfsd case
[not found] ` <20060524111911.607080495@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Neil Brown
[-- Attachment #1: readahead-nfsd-case.patch --]
[-- Type: text/plain, Size: 4646 bytes --]
Bypass nfsd raparms cache -- the new logic do not rely on it.
--------------------------------
For the case of NFS service, the new read-ahead logic
+ can handle disordered nfsd requests
+ can handle concurrent sequential requests on large files
with the help of look-ahead
- will have much ado about the concurrent ones on small files
--------------------------------
Notes about the concurrent nfsd requests issue:
nfsd read requests can be out of order, concurrent and with no ra-state info.
They are handled by the context based read-ahead method, which does the job
in the following steps:
1. scan in page cache
2. make read-ahead decisions
3. alloc new pages
4. insert new pages to page cache
A single read-ahead chunk in the client side will be dissembled and serviced
by many concurrent nfsd in the server side. It is highly possible for two or
more of these parallel nfsd instances to be in step 1/2/3 at the same time.
Without knowing others working on the same file region, they will issue
overlapped read-ahead requests, which lead to many conflicts at step 4.
There's no much luck to eliminate the concurrent problem in general and
efficient ways. But experiments show that mount with tcp,rsize=32768 can
cut down the overhead a lot.
--------------------------------
Here are the benchmark outputs. The test cases cover
- small/big files
- small/big rsize mount option
- serialized/parallel nfsd processing
`serialized' means running the following command to enforce serialized
nfsd requests processing:
# for pid in `pidof nfsd`; do taskset -p 1 $pid; done
8 nfsd; local mount with tcp,rsize=8192
=======================================
SERIALIZED, SMALL FILES
readahead_ratio = 0, ra_max = 128kb (old logic, the ra_max is not quite relevant)
96.51s real 11.32s system 3.27s user 160334+2829 cs diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb (new read-ahead logic)
94.88s real 11.53s system 3.20s user 152415+3777 cs diff -r $NFSDIR $NFSDIR2
SERIALIZED, BIG FILES
readahead_ratio = 0, ra_max = 128kb
56.52s real 3.38s system 1.23s user 47930+5256 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
32.54s real 5.71s system 1.38s user 23851+17007 cs diff $NFSFILE $NFSFILE2
PARALLEL, SMALL FILES
readahead_ratio = 0, ra_max = 128kb
99.87s real 11.41s system 3.15s user 173945+9163 cs diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb
100.14s real 12.06s system 3.16s user 170865+13406 cs diff -r $NFSDIR $NFSDIR2
PARALLEL, BIG FILES
readahead_ratio = 0, ra_max = 128kb
63.35s real 5.68s system 1.57s user 82594+48747 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
33.87s real 10.17s system 1.55s user 72291+100079 cs diff $NFSFILE $NFSFILE2
8 nfsd; local mount with tcp,rsize=32768
========================================
Note that the normal data are now much better, and come close to that of the
serialized ones.
PARALLEL/NORMAL
readahead_ratio = 8, ra_max = 1024kb (old logic)
48.36s real 2.22s system 1.51s user 7209+4110 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb (new logic)
30.04s real 2.46s system 1.33s user 5420+2492 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 8, ra_max = 1024kb
92.99s real 10.32s system 3.23s user 145004+1826 cs diff -r $NFSDIR $NFSDIR2 > /dev/null
readahead_ratio = 70, ra_max = 1024kb
90.96s real 10.68s system 3.22s user 144414+2520 cs diff -r $NFSDIR $NFSDIR2 > /dev/null
SERIALIZED
readahead_ratio = 8, ra_max = 1024kb
47.58s real 2.10s system 1.27s user 7933+1357 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
29.46s real 2.41s system 1.38s user 5590+2613 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 8, ra_max = 1024kb
93.02s real 10.67s system 3.25s user 144850+2286 cs diff -r $NFSDIR $NFSDIR2 > /dev/null
readahead_ratio = 70, ra_max = 1024kb
91.15s real 11.04s system 3.31s user 144432+2814 cs diff -r $NFSDIR $NFSDIR2 > /dev/null
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
Greg Banks gives a valuable recommend on the test cases, which helps me to
get the more complete picture. Thanks!
fs/nfsd/vfs.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletion(-)
--- linux-2.6.17-rc4-mm3.orig/fs/nfsd/vfs.c
+++ linux-2.6.17-rc4-mm3/fs/nfsd/vfs.c
@@ -829,7 +829,10 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st
#endif
/* Get readahead parameters */
- ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
+ if (prefer_adaptive_readahead())
+ ra = NULL;
+ else
+ ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
if (ra && ra->p_set)
file->f_ra = ra->p_ra;
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 30/33] readahead: turn on by default
[not found] ` <20060524111912.156646847@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-kconfig-option-default-on.patch --]
[-- Type: text/plain, Size: 570 bytes --]
Enable the adaptive readahead logic by default.
It helps collect more early testers, and is meant to be a -mm only patch.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/Kconfig | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)
--- linux-2.6.17-rc4-mm3.orig/mm/Kconfig
+++ linux-2.6.17-rc4-mm3/mm/Kconfig
@@ -152,7 +152,7 @@ config MIGRATION
#
config ADAPTIVE_READAHEAD
bool "Adaptive file readahead (EXPERIMENTAL)"
- default n
+ default y
depends on EXPERIMENTAL
help
Readahead is a technique employed by the kernel in an attempt
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 31/33] readahead: debug radix tree new functions
[not found] ` <20060524111912.485160282@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-debug-radix-tree.patch --]
[-- Type: text/plain, Size: 2040 bytes --]
Do some sanity checkings on the newly added radix tree code.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 24 ++++++++++++++++++++++++
1 files changed, 24 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -63,6 +63,8 @@ enum ra_class {
RA_CLASS_COUNT
};
+#define DEBUG_READAHEAD_RADIXTREE
+
/* Read-ahead events to be accounted. */
enum ra_event {
RA_EVENT_CACHE_MISS, /* read cache misses */
@@ -1315,6 +1317,16 @@ static pgoff_t find_segtail(struct addre
cond_resched();
read_lock_irq(&mapping->tree_lock);
ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ BUG_ON(!__find_page(mapping, index));
+ WARN_ON(ra_index < index);
+ if (ra_index != index && !__find_page(mapping, ra_index - 1))
+ printk(KERN_ERR "radix_tree_scan_hole(index=%lu ra_index=%lu "
+ "max_scan=%lu nrpages=%lu) fooled!\n",
+ index, ra_index, max_scan, mapping->nrpages);
+ if (ra_index != ~0UL && ra_index - index < max_scan)
+ WARN_ON(__find_page(mapping, ra_index));
+#endif
read_unlock_irq(&mapping->tree_lock);
if (ra_index <= index + max_scan)
@@ -1407,6 +1419,13 @@ static unsigned long query_page_cache_se
read_lock_irq(&mapping->tree_lock);
index = radix_tree_scan_hole_backward(&mapping->page_tree,
offset, ra_max);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ WARN_ON(index > offset);
+ if (index != offset)
+ WARN_ON(!__find_page(mapping, index + 1));
+ if (index && offset - index < ra_max)
+ WARN_ON(__find_page(mapping, index));
+#endif
read_unlock_irq(&mapping->tree_lock);
*remain = offset - index;
@@ -1442,6 +1461,11 @@ static unsigned long query_page_cache_se
struct radix_tree_node *node;
node = radix_tree_cache_lookup_parent(&mapping->page_tree,
&cache, offset - count, 1);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ if (node != radix_tree_lookup_parent(&mapping->page_tree,
+ offset - count, 1))
+ BUG();
+#endif
if (!node)
break;
}
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 32/33] readahead: debug traces showing accessed file names
[not found] ` <20060524111912.967392912@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-debug-traces-file-list.patch --]
[-- Type: text/plain, Size: 1011 bytes --]
Print file names on their first read-ahead, for tracing file access patterns.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 14 ++++++++++++++
1 files changed, 14 insertions(+)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1074,6 +1074,20 @@ static int ra_dispatch(struct file_ra_st
ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
ra_account(ra, RA_EVENT_READAHEAD, actual);
+ if (!ra->ra_index && filp->f_dentry->d_inode) {
+ char *fn;
+ static char path[1024];
+ unsigned long size;
+
+ size = (i_size_read(filp->f_dentry->d_inode)+1023)/1024;
+ fn = d_path(filp->f_dentry, filp->f_vfsmnt, path, 1000);
+ if (!IS_ERR(fn))
+ ddprintk("ino %lu is %s size %luK by %s(%d)\n",
+ filp->f_dentry->d_inode->i_ino,
+ fn, size,
+ current->comm, current->pid);
+ }
+
dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
ra_class_name[ra_class],
mapping->host->i_ino, ra->la_index,
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* [PATCH 33/33] readahead: debug traces showing read patterns
[not found] ` <20060524111913.603476893@localhost.localdomain>
@ 2006-05-24 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang
[-- Attachment #1: readahead-debug-traces-access-pattern.patch --]
[-- Type: text/plain, Size: 2664 bytes --]
Print all relavant read requests to help discover the access pattern.
If you are experiencing performance problems, or want to help improve
the read-ahead logic, please send me the trace data. Thanks.
- Preparations
# Compile kernel with option CONFIG_DEBUG_READAHEAD
mkdir /debug
mount -t debug none /debug
- For each session with distinct access pattern
echo > /debug/readahead # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
echo 8 > /debug/readahead/debug_level # show verbose printk traces
# do one benchmark/task
echo 0 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/filemap.c | 23 ++++++++++++++++++++++-
1 files changed, 22 insertions(+), 1 deletion(-)
--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -45,6 +45,12 @@ static ssize_t
generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
+#ifdef CONFIG_DEBUG_READAHEAD
+extern u32 debug_level;
+#else
+#define debug_level 0
+#endif /* CONFIG_DEBUG_READAHEAD */
+
/*
* Shared mappings implemented 30.11.1994. It's not fully working yet,
* though.
@@ -829,6 +835,10 @@ void do_generic_mapping_read(struct addr
if (!isize)
goto out;
+ if (debug_level >= 5)
+ printk(KERN_DEBUG "read-file(ino=%lu, req=%lu+%lu)\n",
+ inode->i_ino, index, last_index - index);
+
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
for (;;) {
struct page *page;
@@ -883,6 +893,11 @@ find_page:
if (prefer_adaptive_readahead())
readahead_cache_hit(&ra, page);
+ if (debug_level >= 7)
+ printk(KERN_DEBUG "read-page(ino=%lu, idx=%lu, io=%s)\n",
+ inode->i_ino, index,
+ PageUptodate(page) ? "hit" : "miss");
+
if (!PageUptodate(page))
goto page_not_up_to_date;
page_ok:
@@ -1334,7 +1349,6 @@ retry_all:
if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
page_cache_readahead(mapping, ra, file, pgoff, 1);
-
/*
* Do we have something in the page cache already?
*/
@@ -1397,6 +1411,13 @@ retry_find:
if (prefer_adaptive_readahead())
readahead_cache_hit(ra, page);
+ if (debug_level >= 6)
+ printk(KERN_DEBUG "read-mmap(ino=%lu, idx=%lu, hint=%s, io=%s)\n",
+ inode->i_ino, pgoff,
+ VM_RandomReadHint(area) ? "random" :
+ (VM_SequentialReadHint(area) ? "sequential" : "none"),
+ PageUptodate(page) ? "hit" : "miss");
+
/*
* Ok, found a page in the page cache, now we need to check
* that it's up-to-date.
--
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 04/33] readahead: page flag PG_readahead
[not found] ` <20060524111858.869793445@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 04/33] readahead: page flag PG_readahead Wu Fengguang
@ 2006-05-24 12:27 ` Peter Zijlstra
[not found] ` <20060524123740.GA16304@mail.ustc.edu.cn>
1 sibling, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2006-05-24 12:27 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
On Wed, 2006-05-24 at 19:12 +0800, Wu Fengguang wrote:
> plain text document attachment
> (readahead-page-flag-PG_readahead.patch)
> An new page flag PG_readahead is introduced as a look-ahead mark, which
> reminds the caller to give the adaptive read-ahead logic a chance to do
> read-ahead ahead of time for I/O pipelining.
>
> It roughly corresponds to `ahead_start' of the stock read-ahead logic.
>
> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> ---
>
> include/linux/page-flags.h | 5 +++++
> mm/page_alloc.c | 2 +-
> 2 files changed, 6 insertions(+), 1 deletion(-)
>
> --- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
> +++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
> @@ -89,6 +89,7 @@
> #define PG_reclaim 17 /* To be reclaimed asap */
> #define PG_nosave_free 18 /* Free, should not be written */
> #define PG_buddy 19 /* Page is free, on buddy lists */
> +#define PG_readahead 20 /* Reminder to do readahead */
>
Page flags are gouped by four, 20 would start a new set.
Also in my tree (git from a few days ago), 20 is taken by PG_unchached.
What code is this patch-set against?
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 04/33] readahead: page flag PG_readahead
[not found] ` <20060524123740.GA16304@mail.ustc.edu.cn>
@ 2006-05-24 12:37 ` Wu Fengguang
2006-05-24 12:48 ` Peter Zijlstra
1 sibling, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 12:37 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Andrew Morton, linux-kernel
On Wed, May 24, 2006 at 02:27:36PM +0200, Peter Zijlstra wrote:
> On Wed, 2006-05-24 at 19:12 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (readahead-page-flag-PG_readahead.patch)
> > An new page flag PG_readahead is introduced as a look-ahead mark, which
> > reminds the caller to give the adaptive read-ahead logic a chance to do
> > read-ahead ahead of time for I/O pipelining.
> >
> > It roughly corresponds to `ahead_start' of the stock read-ahead logic.
> >
> > Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> > ---
> >
> > include/linux/page-flags.h | 5 +++++
> > mm/page_alloc.c | 2 +-
> > 2 files changed, 6 insertions(+), 1 deletion(-)
> >
> > --- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
> > +++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
> > @@ -89,6 +89,7 @@
> > #define PG_reclaim 17 /* To be reclaimed asap */
> > #define PG_nosave_free 18 /* Free, should not be written */
> > #define PG_buddy 19 /* Page is free, on buddy lists */
> > +#define PG_readahead 20 /* Reminder to do readahead */
> >
>
> Page flags are gouped by four, 20 would start a new set.
> Also in my tree (git from a few days ago), 20 is taken by PG_unchached.
Thanks, grouped and renumbered it as 21.
> What code is this patch-set against?
It's against the latest -mm tree: linux-2.6.17-rc4-mm3.
Wu
---
Subject: readahead: page flag PG_readahead
An new page flag PG_readahead is introduced as a look-ahead mark, which
reminds the caller to give the adaptive read-ahead logic a chance to do
read-ahead ahead of time for I/O pipelining.
It roughly corresponds to `ahead_start' of the stock read-ahead logic.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/page-flags.h | 5 +++++
mm/page_alloc.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)
--- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
+++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
@@ -90,6 +90,8 @@
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_buddy 19 /* Page is free, on buddy lists */
+#define PG_readahead 21 /* Reminder to do readahead */
+
#if (BITS_PER_LONG > 32)
/*
@@ -372,6 +374,10 @@ extern void __mod_page_state_offset(unsi
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
+#define PageReadahead(page) test_bit(PG_readahead, &(page)->flags)
+#define __SetPageReadahead(page) __set_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
--- linux-2.6.17-rc4-mm3.orig/mm/page_alloc.c
+++ linux-2.6.17-rc4-mm3/mm/page_alloc.c
@@ -564,7 +564,7 @@ static int prep_new_page(struct page *pa
if (PageReserved(page))
return 1;
- page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+ page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
1 << PG_referenced | 1 << PG_arch_1 |
1 << PG_checked | 1 << PG_mappedtodisk);
set_page_private(page, 0);
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 17/33] readahead: context based method
[not found] ` <20060524111905.586110688@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 17/33] readahead: context based method Wu Fengguang
@ 2006-05-24 12:37 ` Peter Zijlstra
[not found] ` <20060524133353.GA16508@mail.ustc.edu.cn>
1 sibling, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2006-05-24 12:37 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
On Wed, 2006-05-24 at 19:13 +0800, Wu Fengguang wrote:
> +#define PAGE_REFCNT_0 0
> +#define PAGE_REFCNT_1 (1 << PG_referenced)
> +#define PAGE_REFCNT_2 (1 << PG_active)
> +#define PAGE_REFCNT_3 ((1 << PG_active) | (1 << PG_referenced))
> +#define PAGE_REFCNT_MASK PAGE_REFCNT_3
> +
> +/*
> + * STATUS REFERENCE COUNT
> + * __ 0
> + * _R PAGE_REFCNT_1
> + * A_ PAGE_REFCNT_2
> + * AR PAGE_REFCNT_3
> + *
> + * A/R: Active / Referenced
> + */
> +static inline unsigned long page_refcnt(struct page *page)
> +{
> + return page->flags & PAGE_REFCNT_MASK;
> +}
> +
> +/*
> + * STATUS REFERENCE COUNT TYPE
> + * __ 0 fresh
> + * _R PAGE_REFCNT_1 stale
> + * A_ PAGE_REFCNT_2 disturbed once
> + * AR PAGE_REFCNT_3 disturbed twice
> + *
> + * A/R: Active / Referenced
> + */
> +static inline unsigned long cold_page_refcnt(struct page *page)
> +{
> + if (!page || PageActive(page))
> + return 0;
> +
> + return page_refcnt(page);
> +}
> +
Why all of this if all you're ever going to use is cold_page_refcnt.
What about something like this:
static inline int cold_page_referenced(struct page *page)
{
if (!page || PageActive(page))
return 0;
return !!PageReferenced(page);
}
> +
> +/*
> + * Count/estimate cache hits in range [first_index, last_index].
> + * The estimation is simple and optimistic.
> + */
> +static int count_cache_hit(struct address_space *mapping,
> + pgoff_t first_index, pgoff_t last_index)
> +{
> + struct page *page;
> + int size = last_index - first_index + 1;
> + int count = 0;
> + int i;
> +
> + cond_resched();
> + read_lock_irq(&mapping->tree_lock);
> +
> + /*
> + * The first page may well is chunk head and has been accessed,
> + * so it is index 0 that makes the estimation optimistic. This
> + * behavior guarantees a readahead when (size < ra_max) and
> + * (readahead_hit_rate >= 16).
> + */
> + for (i = 0; i < 16;) {
> + page = __find_page(mapping, first_index +
> + size * ((i++ * 29) & 15) / 16);
> + if (cold_page_refcnt(page) >= PAGE_REFCNT_1 && ++count >= 2)
cold_page_referenced(page) && ++count >= 2
> + break;
> + }
> +
> + read_unlock_irq(&mapping->tree_lock);
> +
> + return size * count / i;
> +}
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 04/33] readahead: page flag PG_readahead
[not found] ` <20060524123740.GA16304@mail.ustc.edu.cn>
2006-05-24 12:37 ` Wu Fengguang
@ 2006-05-24 12:48 ` Peter Zijlstra
1 sibling, 0 replies; 107+ messages in thread
From: Peter Zijlstra @ 2006-05-24 12:48 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
On Wed, 2006-05-24 at 20:37 +0800, Wu Fengguang wrote:
> On Wed, May 24, 2006 at 02:27:36PM +0200, Peter Zijlstra wrote:
> > On Wed, 2006-05-24 at 19:12 +0800, Wu Fengguang wrote:
> > > --- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
> > > +++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
> > > @@ -89,6 +89,7 @@
> > > #define PG_reclaim 17 /* To be reclaimed asap */
> > > #define PG_nosave_free 18 /* Free, should not be written */
> > > #define PG_buddy 19 /* Page is free, on buddy lists */
> > > +#define PG_readahead 20 /* Reminder to do readahead */
> > >
> >
> > Page flags are gouped by four, 20 would start a new set.
> > Also in my tree (git from a few days ago), 20 is taken by PG_unchached.
>
> Thanks, grouped and renumbered it as 21.
>
> > What code is this patch-set against?
>
> It's against the latest -mm tree: linux-2.6.17-rc4-mm3.
Ah, now I see, -mm has got a trick up its sleeve for PG_uncached.
20 would indeed be the correct number for -mm. Then my sole comment
would be the grouping, which is a stylish nit really.
Sorry for the confusion.
Peter
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 17/33] readahead: context based method
[not found] ` <20060524133353.GA16508@mail.ustc.edu.cn>
@ 2006-05-24 13:33 ` Wu Fengguang
2006-05-24 15:53 ` Peter Zijlstra
1 sibling, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 13:33 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Andrew Morton, linux-kernel
On Wed, May 24, 2006 at 02:37:48PM +0200, Peter Zijlstra wrote:
> On Wed, 2006-05-24 at 19:13 +0800, Wu Fengguang wrote:
>
> > +#define PAGE_REFCNT_0 0
> > +#define PAGE_REFCNT_1 (1 << PG_referenced)
> > +#define PAGE_REFCNT_2 (1 << PG_active)
> > +#define PAGE_REFCNT_3 ((1 << PG_active) | (1 << PG_referenced))
> > +#define PAGE_REFCNT_MASK PAGE_REFCNT_3
> > +
> > +/*
> > + * STATUS REFERENCE COUNT
> > + * __ 0
> > + * _R PAGE_REFCNT_1
> > + * A_ PAGE_REFCNT_2
> > + * AR PAGE_REFCNT_3
> > + *
> > + * A/R: Active / Referenced
> > + */
> > +static inline unsigned long page_refcnt(struct page *page)
> > +{
> > + return page->flags & PAGE_REFCNT_MASK;
> > +}
> > +
> > +/*
> > + * STATUS REFERENCE COUNT TYPE
> > + * __ 0 fresh
> > + * _R PAGE_REFCNT_1 stale
> > + * A_ PAGE_REFCNT_2 disturbed once
> > + * AR PAGE_REFCNT_3 disturbed twice
> > + *
> > + * A/R: Active / Referenced
> > + */
> > +static inline unsigned long cold_page_refcnt(struct page *page)
> > +{
> > + if (!page || PageActive(page))
> > + return 0;
> > +
> > + return page_refcnt(page);
> > +}
> > +
>
> Why all of this if all you're ever going to use is cold_page_refcnt.
Well, the two functions have a long history...
There has been a PG_activate which makes the two functions quite
different. It was later removed for fear of the behavior changes it
introduced. However, there's still possibility that someone
reintroduce similar flags in the future :)
> What about something like this:
>
> static inline int cold_page_referenced(struct page *page)
> {
> if (!page || PageActive(page))
> return 0;
> return !!PageReferenced(page);
> }
Ah, here's another theory: the algorithm uses reference count
conceptually, so it may be better to retain the current form.
Thanks,
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 28/33] readahead: loop case
[not found] ` <20060524111911.032100160@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 28/33] readahead: loop case Wu Fengguang
@ 2006-05-24 14:01 ` Limin Wang
[not found] ` <20060525154846.GA6907@mail.ustc.edu.cn>
1 sibling, 1 reply; 107+ messages in thread
From: Limin Wang @ 2006-05-24 14:01 UTC (permalink / raw)
To: linux-kernel
If the loopback files is bigger than the memory size, it may cause miss again and
may better to turn on the read ahead?
Regards,
Limin
* Wu Fengguang <wfg@mail.ustc.edu.cn> [2006-05-24 19:13:14 +0800]:
> Disable look-ahead for loop file.
>
> Loopback files normally contain filesystems, in which case there are already
> proper look-aheads in the upper layer, more look-aheads on the loopback file
> only ruins the read-ahead hit rate.
>
> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> ---
>
> I'd like to thank Tero Grundstr?m for uncovering the loopback problem.
>
> drivers/block/loop.c | 6 ++++++
> 1 files changed, 6 insertions(+)
>
> --- linux-2.6.17-rc4-mm3.orig/drivers/block/loop.c
> +++ linux-2.6.17-rc4-mm3/drivers/block/loop.c
> @@ -779,6 +779,12 @@ static int loop_set_fd(struct loop_devic
> mapping = file->f_mapping;
> inode = mapping->host;
>
> + /*
> + * The upper layer should already do proper look-ahead,
> + * one more look-ahead here only ruins the cache hit rate.
> + */
> + file->f_ra.flags |= RA_FLAG_NO_LOOKAHEAD;
> +
> if (!(file->f_mode & FMODE_WRITE))
> lo_flags |= LO_FLAGS_READ_ONLY;
>
>
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 17/33] readahead: context based method
[not found] ` <20060524133353.GA16508@mail.ustc.edu.cn>
2006-05-24 13:33 ` Wu Fengguang
@ 2006-05-24 15:53 ` Peter Zijlstra
[not found] ` <20060525012556.GA6111@mail.ustc.edu.cn>
1 sibling, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2006-05-24 15:53 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
On Wed, 2006-05-24 at 21:33 +0800, Wu Fengguang wrote:
> On Wed, May 24, 2006 at 02:37:48PM +0200, Peter Zijlstra wrote:
> > On Wed, 2006-05-24 at 19:13 +0800, Wu Fengguang wrote:
> >
> > > +#define PAGE_REFCNT_0 0
> > > +#define PAGE_REFCNT_1 (1 << PG_referenced)
> > > +#define PAGE_REFCNT_2 (1 << PG_active)
> > > +#define PAGE_REFCNT_3 ((1 << PG_active) | (1 << PG_referenced))
> > > +#define PAGE_REFCNT_MASK PAGE_REFCNT_3
> > > +
> > > +/*
> > > + * STATUS REFERENCE COUNT
> > > + * __ 0
> > > + * _R PAGE_REFCNT_1
> > > + * A_ PAGE_REFCNT_2
> > > + * AR PAGE_REFCNT_3
> > > + *
> > > + * A/R: Active / Referenced
> > > + */
> > > +static inline unsigned long page_refcnt(struct page *page)
> > > +{
> > > + return page->flags & PAGE_REFCNT_MASK;
> > > +}
> > > +
> > > +/*
> > > + * STATUS REFERENCE COUNT TYPE
> > > + * __ 0 fresh
> > > + * _R PAGE_REFCNT_1 stale
> > > + * A_ PAGE_REFCNT_2 disturbed once
> > > + * AR PAGE_REFCNT_3 disturbed twice
> > > + *
> > > + * A/R: Active / Referenced
> > > + */
> > > +static inline unsigned long cold_page_refcnt(struct page *page)
> > > +{
> > > + if (!page || PageActive(page))
> > > + return 0;
> > > +
> > > + return page_refcnt(page);
> > > +}
> > > +
> >
> > Why all of this if all you're ever going to use is cold_page_refcnt.
>
> Well, the two functions have a long history...
>
> There has been a PG_activate which makes the two functions quite
> different. It was later removed for fear of the behavior changes it
> introduced. However, there's still possibility that someone
> reintroduce similar flags in the future :)
>
> > What about something like this:
> >
> > static inline int cold_page_referenced(struct page *page)
> > {
> > if (!page || PageActive(page))
> > return 0;
> > return !!PageReferenced(page);
> > }
>
> Ah, here's another theory: the algorithm uses reference count
> conceptually, so it may be better to retain the current form.
Reference count of what exactly, if you were to say of the page, I'd
have expected only the first function, page_refcnt().
What I don't exactly understand is why you specialise to the inactive
list. Why do you need that?
The reason I'm asking is that when I merge this with my page replacement
work, I need to find a generalised concept. cold_page_refcnt() would
become to mean something like: number of references for those pages that
are direct reclaim candidates. And honestly, that doesn't make a lot of
sense.
If you could explain the concept behind this, I'd be grateful.
Peter
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 17/33] readahead: context based method
[not found] ` <20060525012556.GA6111@mail.ustc.edu.cn>
@ 2006-05-25 1:25 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 1:25 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Andrew Morton, linux-kernel
On Wed, May 24, 2006 at 05:53:36PM +0200, Peter Zijlstra wrote:
> On Wed, 2006-05-24 at 21:33 +0800, Wu Fengguang wrote:
> > On Wed, May 24, 2006 at 02:37:48PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2006-05-24 at 19:13 +0800, Wu Fengguang wrote:
> > >
> > > > +#define PAGE_REFCNT_0 0
> > > > +#define PAGE_REFCNT_1 (1 << PG_referenced)
> > > > +#define PAGE_REFCNT_2 (1 << PG_active)
> > > > +#define PAGE_REFCNT_3 ((1 << PG_active) | (1 << PG_referenced))
> > > > +#define PAGE_REFCNT_MASK PAGE_REFCNT_3
> > > > +
> > > > +/*
> > > > + * STATUS REFERENCE COUNT
> > > > + * __ 0
> > > > + * _R PAGE_REFCNT_1
> > > > + * A_ PAGE_REFCNT_2
> > > > + * AR PAGE_REFCNT_3
> > > > + *
> > > > + * A/R: Active / Referenced
> > > > + */
> > > > +static inline unsigned long page_refcnt(struct page *page)
> > > > +{
> > > > + return page->flags & PAGE_REFCNT_MASK;
> > > > +}
> > > > +
> > > > +/*
> > > > + * STATUS REFERENCE COUNT TYPE
> > > > + * __ 0 fresh
> > > > + * _R PAGE_REFCNT_1 stale
> > > > + * A_ PAGE_REFCNT_2 disturbed once
> > > > + * AR PAGE_REFCNT_3 disturbed twice
> > > > + *
> > > > + * A/R: Active / Referenced
> > > > + */
> > > > +static inline unsigned long cold_page_refcnt(struct page *page)
> > > > +{
> > > > + if (!page || PageActive(page))
> > > > + return 0;
> > > > +
> > > > + return page_refcnt(page);
> > > > +}
> > > > +
> > >
> > > Why all of this if all you're ever going to use is cold_page_refcnt.
> >
> > Well, the two functions have a long history...
> >
> > There has been a PG_activate which makes the two functions quite
> > different. It was later removed for fear of the behavior changes it
> > introduced. However, there's still possibility that someone
> > reintroduce similar flags in the future :)
> >
> > > What about something like this:
> > >
> > > static inline int cold_page_referenced(struct page *page)
> > > {
> > > if (!page || PageActive(page))
> > > return 0;
> > > return !!PageReferenced(page);
> > > }
> >
> > Ah, here's another theory: the algorithm uses reference count
> > conceptually, so it may be better to retain the current form.
>
> Reference count of what exactly, if you were to say of the page, I'd
> have expected only the first function, page_refcnt().
>
> What I don't exactly understand is why you specialise to the inactive
> list. Why do you need that?
>
> The reason I'm asking is that when I merge this with my page replacement
> work, I need to find a generalised concept. cold_page_refcnt() would
> become to mean something like: number of references for those pages that
> are direct reclaim candidates. And honestly, that doesn't make a lot of
> sense.
>
> If you could explain the concept behind this, I'd be grateful.
Good question, and sorry for mentioning this...
There are some background info here:
[DISTURBS] section of
http://marc.theaimsgroup.com/?l=linux-kernel&m=112678976802381&w=2
[DELAYED ACTIVATION] section of
http://marc.theaimsgroup.com/?l=linux-kernel&m=112679176611006&w=2
It involves a tricky situation where there are two sequential readers
that come close enough, so that the follower retouched the pages
visited by the leader:
chunk 1 chunk 2 chunk 3
========== =============------- --------------------
follower ^ leader ^
It is all ok if the revisited pages still stay in the inactive list,
these pages will act as measurement of len(inactive list)/speed(leader).
But if the revisited pages(marked by '=') are sent to active list
immediately, the measurement will no longer be as accurate. The trace
is 'disturbed'. In this case, using page_refcnt() can be aggressive
and unsafe from thrashing, while cold_page_refcnt() can be conservative.
So either one of page_refcnt()/cold_page_refcnt() should be ok, as
long as we know the consequence of this situation. After all, it is
really uncommon to see much invocation of the context based method,
and even rare for this kind of situation to happen.
Regards,
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 12/33] readahead: min/max sizes
2006-05-24 11:12 ` [PATCH 11/33] readahead: sysctl parameters Wu Fengguang
@ 2006-05-25 4:50 ` Nick Piggin
[not found] ` <20060525121206.GI4996@mail.ustc.edu.cn>
0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-25 4:50 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
Wu Fengguang wrote:
>- Enlarge VM_MAX_READAHEAD to 1024 if new read-ahead code is compiled in.
> This value is no longer tightly coupled with the thrashing problem,
> therefore constrained by it. The adaptive read-ahead logic merely takes
> it as an upper bound, and will not stick to it under memory pressure.
>
I guess this size enlargement is one of the main reasons your
patchset improves performance in some cases.
There is currently some sort of thrashing protection in there.
Obviously you've found it to be unable to cope with some situations
and introduced a lot of really fancy stuff to fix it. Are these just
academic access patterns, or do you have real test cases that
demonstrate this failure (ie. can we try to incrementally improve
the current logic as well as work towards merging your readahead
rewrite?)
--
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 10/33] readahead: support functions
2006-05-24 11:12 ` [PATCH 10/33] readahead: support functions Wu Fengguang
@ 2006-05-25 5:13 ` Nick Piggin
[not found] ` <20060525111318.GH4996@mail.ustc.edu.cn>
2006-05-25 16:48 ` Andrew Morton
1 sibling, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-25 5:13 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
Wu Fengguang wrote:
>+#ifdef CONFIG_ADAPTIVE_READAHEAD
>+
>+/*
>+ * The nature of read-ahead allows false tests to occur occasionally.
>+ * Here we just do not bother to call get_page(), it's meaningless anyway.
>+ */
>+static inline struct page *__find_page(struct address_space *mapping,
>+ pgoff_t offset)
>+{
>+ return radix_tree_lookup(&mapping->page_tree, offset);
>+}
>+
>+static inline struct page *find_page(struct address_space *mapping,
>+ pgoff_t offset)
>+{
>+ struct page *page;
>+
>+ read_lock_irq(&mapping->tree_lock);
>+ page = __find_page(mapping, offset);
>+ read_unlock_irq(&mapping->tree_lock);
>+ return page;
>+}
>
>
Meh, this is just open-coded elsewhere in readahead.c; I'd either
open code it, or do a new patch to replace the existing callers.
find_page should be in mm/filemap.c, btw (or include/linux/pagemap.h).
>+
>+/*
>+ * Move pages in danger (of thrashing) to the head of inactive_list.
>+ * Not expected to happen frequently.
>+ */
>+static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
>
>
Should probably be in mm/vmscan.c
--
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 17/33] readahead: context based method
2006-05-24 11:13 ` [PATCH 17/33] readahead: context based method Wu Fengguang
@ 2006-05-25 5:26 ` Nick Piggin
[not found] ` <20060525080308.GB4996@mail.ustc.edu.cn>
2006-05-26 17:23 ` Andrew Morton
2006-05-26 17:27 ` Andrew Morton
2 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-25 5:26 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
Wu Fengguang wrote:
>
>+/*
>+ * Look back and check history pages to estimate thrashing-threshold.
>+ */
>+static unsigned long query_page_cache_segment(struct address_space *mapping,
>+ struct file_ra_state *ra,
>+ unsigned long *remain, pgoff_t offset,
>+ unsigned long ra_min, unsigned long ra_max)
>+{
>+ pgoff_t index;
>+ unsigned long count;
>+ unsigned long nr_lookback;
>+ struct radix_tree_cache cache;
>+
>+ /*
>+ * Scan backward and check the near @ra_max pages.
>+ * The count here determines ra_size.
>+ */
>+ cond_resched();
>+ read_lock_irq(&mapping->tree_lock);
>+ index = radix_tree_scan_hole_backward(&mapping->page_tree,
>+ offset, ra_max);
>+ read_unlock_irq(&mapping->tree_lock);
>
Why do you drop this lock just to pick it up again a few instructions
down the line? (is ra_cache_hit_ok or cound_cache_hit very big or
unable to be called without the lock?)
>+
>+ *remain = offset - index;
>+
>+ if (offset == ra->readahead_index && ra_cache_hit_ok(ra))
>+ count = *remain;
>+ else if (count_cache_hit(mapping, index + 1, offset) *
>+ readahead_hit_rate >= *remain)
>+ count = *remain;
>+ else
>+ count = ra_min;
>+
>+ /*
>+ * Unnecessary to count more?
>+ */
>+ if (count < ra_max)
>+ goto out;
>+
>+ if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
>+ goto out;
>+
>+ /*
>+ * Check the far pages coarsely.
>+ * The enlarged count here helps increase la_size.
>+ */
>+ nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
>+ 100 / (readahead_ratio | 1);
>+
>+ cond_resched();
>+ radix_tree_cache_init(&cache);
>+ read_lock_irq(&mapping->tree_lock);
>+ for (count += ra_max; count < nr_lookback; count += ra_max) {
>+ struct radix_tree_node *node;
>+ node = radix_tree_cache_lookup_parent(&mapping->page_tree,
>+ &cache, offset - count, 1);
>+ if (!node)
>+ break;
>+ }
>+ read_unlock_irq(&mapping->tree_lock);
>
Yuck. Apart from not being commented, this depends on internal
implementation of radix-tree. This should just be packaged up in some
radix-tree function to do exactly what you want (eg. is there a hole of
N contiguous pages).
And then again you can be rid of the radix-tree cache.
Yes, it increasingly appears that you're using the cache because you're
using the wrong abstractions. Eg. this is basically half implementing
some data-structure internal detail.
--
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 22/33] readahead: initial method
2006-05-24 11:13 ` [PATCH 20/33] readahead: initial method - expected read size Wu Fengguang
@ 2006-05-25 5:34 ` Nick Piggin
[not found] ` <20060525085957.GC4996@mail.ustc.edu.cn>
2006-05-26 17:29 ` [PATCH 20/33] readahead: initial method - expected read size Andrew Morton
1 sibling, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-25 5:34 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
BTW. while your patchset might be nicely broken down, I think your
naming and descriptions are letting it down a little bit.
Wu Fengguang wrote:
>Aggressive readahead policy for read on start-of-file.
>
>Instead of selecting a conservative readahead size,
>it tries to do large readahead in the first place.
>
>However we have to watch on two cases:
> - do not ruin the hit rate for file-head-checkers
> - do not lead to thrashing for memory tight systems
>
>
How does it handle
- don't needlessly readahead too much if the file is in cache
Would the current readahead mechanism benefit from more aggressive
start-of-file
readahead?
--
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 08/33] readahead: common macros
2006-05-24 11:12 ` [PATCH 08/33] readahead: common macros Wu Fengguang
@ 2006-05-25 5:56 ` Nick Piggin
[not found] ` <20060525104117.GE4996@mail.ustc.edu.cn>
[not found] ` <20060525134224.GJ4996@mail.ustc.edu.cn>
2006-05-25 16:33 ` Andrew Morton
1 sibling, 2 replies; 107+ messages in thread
From: Nick Piggin @ 2006-05-25 5:56 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
Wu Fengguang wrote:
>Define some common used macros for the read-ahead logics.
>
>Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
>---
>
> mm/readahead.c | 14 ++++++++++++--
> 1 files changed, 12 insertions(+), 2 deletions(-)
>
>--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
>+++ linux-2.6.17-rc4-mm3/mm/readahead.c
>@@ -5,6 +5,8 @@
> *
> * 09Apr2002 akpm@zip.com.au
> * Initial version.
>+ * 21May2006 Wu Fengguang <wfg@mail.ustc.edu.cn>
>+ * Adaptive read-ahead framework.
> */
>
> #include <linux/kernel.h>
>@@ -14,6 +16,14 @@
> #include <linux/blkdev.h>
> #include <linux/backing-dev.h>
> #include <linux/pagevec.h>
>+#include <linux/writeback.h>
>+#include <linux/nfsd/const.h>
>
How come you're adding these includes?
>+
>+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT)
>+#define PAGES_KB(size) PAGES_BYTE((size)*1024)
>
Don't really like the names. Don't think they do anything for clarity, but
if you can come up with something better for PAGES_BYTE I might change my
mind ;) (just forget about PAGES_KB - people know what *1024 means)
Also: the replacements are wrong: if you've defined VM_MAX_READAHEAD to be
4095 bytes, you don't want the _actual_ readahead to be 4096 bytes, do you?
It is saying nothing about minimum, so presumably 0 is the correct choice.
>+
>+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
>+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
>
Again, it is probably easier just to use the expanded version. Then the
reader can immediately say: ah, the next page on the LRU list (rather
than, maybe, the next page in the pagecache).
>
> void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
> {
>@@ -21,7 +31,7 @@ void default_unplug_io_fn(struct backing
> EXPORT_SYMBOL(default_unplug_io_fn);
>
> struct backing_dev_info default_backing_dev_info = {
>- .ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
>+ .ra_pages = PAGES_KB(VM_MAX_READAHEAD),
> .state = 0,
> .capabilities = BDI_CAP_MAP_COPY,
> .unplug_io_fn = default_unplug_io_fn,
>@@ -50,7 +60,7 @@ static inline unsigned long get_max_read
>
> static inline unsigned long get_min_readahead(struct file_ra_state *ra)
> {
>- return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
>+ return PAGES_KB(VM_MIN_READAHEAD);
> }
>
> static inline void reset_ahead_window(struct file_ra_state *ra)
>
>--
>
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 14/33] readahead: state based method - data structure
2006-05-24 11:13 ` [PATCH 14/33] readahead: state based method - data structure Wu Fengguang
@ 2006-05-25 6:03 ` Nick Piggin
[not found] ` <20060525104353.GF4996@mail.ustc.edu.cn>
2006-05-26 17:05 ` Andrew Morton
1 sibling, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-25 6:03 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
Wu Fengguang wrote:
>Extend struct file_ra_state to support the adaptive read-ahead logic.
>
Another nitpick: It is usually OK to do these things in the same patch
that actually uses the new data (or functions -- eg. patch 15).
If the addition is complex or in a completely different subsystem
(eg. your rescue_pages function), _that_ can justify it being split
into its own patch. Then you might also prepend the subject with mm:
and cc linux-mm to get better reviews.
--
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 17/33] readahead: context based method
[not found] ` <20060525080308.GB4996@mail.ustc.edu.cn>
@ 2006-05-25 8:03 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 8:03 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, linux-kernel
On Thu, May 25, 2006 at 03:26:00PM +1000, Nick Piggin wrote:
> Wu Fengguang wrote:
> >+ cond_resched();
> >+ read_lock_irq(&mapping->tree_lock);
> >+ index = radix_tree_scan_hole_backward(&mapping->page_tree,
> >+ offset, ra_max);
> >+ read_unlock_irq(&mapping->tree_lock);
> >
>
> Why do you drop this lock just to pick it up again a few instructions
> down the line? (is ra_cache_hit_ok or cound_cache_hit very big or
> unable to be called without the lock?)
Nice catch, will fix it.
> >+
> >+ *remain = offset - index;
> >+
> >+ if (offset == ra->readahead_index && ra_cache_hit_ok(ra))
> >+ count = *remain;
> >+ else if (count_cache_hit(mapping, index + 1, offset) *
> >+ readahead_hit_rate >=
> >*remain)
> >+ count = *remain;
> >+ else
> >+ count = ra_min;
> >+
> >+ /*
> >+ * Unnecessary to count more?
> >+ */
> >+ if (count < ra_max)
> >+ goto out;
> >+
> >+ if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
> >+ goto out;
> >+
> >+ /*
> >+ * Check the far pages coarsely.
> >+ * The enlarged count here helps increase la_size.
> >+ */
> >+ nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
> >+ 100 / (readahead_ratio | 1);
> >+
> >+ cond_resched();
> >+ radix_tree_cache_init(&cache);
> >+ read_lock_irq(&mapping->tree_lock);
> >+ for (count += ra_max; count < nr_lookback; count += ra_max) {
> >+ struct radix_tree_node *node;
> >+ node = radix_tree_cache_lookup_parent(&mapping->page_tree,
> >+ &cache, offset - count, 1);
> >+ if (!node)
> >+ break;
> >+ }
> >+ read_unlock_irq(&mapping->tree_lock);
> >
>
> Yuck. Apart from not being commented, this depends on internal
> implementation of radix-tree. This should just be packaged up in some
> radix-tree function to do exactly what you want (eg. is there a hole of
> N contiguous pages).
Yes, it is ugly.
Maybe we can make it a function named radix_tree_scan_hole_coarse().
> And then again you can be rid of the radix-tree cache.
>
> Yes, it increasingly appears that you're using the cache because you're
> using the wrong abstractions. Eg. this is basically half implementing
> some data-structure internal detail.
Sorry for not being aware of this problem :)
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 22/33] readahead: initial method
[not found] ` <20060525085957.GC4996@mail.ustc.edu.cn>
@ 2006-05-25 8:59 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 8:59 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, linux-kernel
On Thu, May 25, 2006 at 03:34:30PM +1000, Nick Piggin wrote:
> BTW. while your patchset might be nicely broken down, I think your
> naming and descriptions are letting it down a little bit.
:) Maybe more practices will help.
> Wu Fengguang wrote:
>
> >Aggressive readahead policy for read on start-of-file.
> >
> >Instead of selecting a conservative readahead size,
> >it tries to do large readahead in the first place.
> >
> >However we have to watch on two cases:
> > - do not ruin the hit rate for file-head-checkers
> > - do not lead to thrashing for memory tight systems
> >
> >
>
> How does it handle
> - don't needlessly readahead too much if the file is in cache
It is prevented by the calling scheme.
The adaptive readahead logic will only be called on
- read a non-cached page
So readahead will be started/stopped on demand.
- read a PG_readahead marked page
Since the PG_readahead mark will only be set on fresh
new pages in __do_page_cache_readahead(), readahead
will automatically cease on cache hit.
>
> Would the current readahead mechanism benefit from more aggressive
> start-of-file
> readahead?
It will have the same benefits(and drawbacks).
[QUOTE FROM ANOTHER MAIL]
> can we try to incrementally improve the current logic as well as work
> towards merging your readahead rewrite?
The current readahead is left untouched on purpose.
If I understand it right, its simplicity is a great virtue. And it is
hard to improve it without loosing this virtue, or avoid disturbing
old users.
Then the new framework provides a ideal testbed for fancy new things.
We can do experimental things without calling for complaints(before it
is stabilized after one year). And then we might port some proved
features to the current logic.
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 08/33] readahead: common macros
[not found] ` <20060525104117.GE4996@mail.ustc.edu.cn>
@ 2006-05-25 10:41 ` Wu Fengguang
2006-05-26 3:33 ` Nick Piggin
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 10:41 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, linux-kernel
On Thu, May 25, 2006 at 03:56:24PM +1000, Nick Piggin wrote:
> >+#include <linux/writeback.h>
> >+#include <linux/nfsd/const.h>
> >
>
> How come you're adding these includes?
For something added in the past and removed later...
> >+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >>
> >PAGE_CACHE_SHIFT)
> >+#define PAGES_KB(size) PAGES_BYTE((size)*1024)
> >
> Don't really like the names. Don't think they do anything for clarity, but
> if you can come up with something better for PAGES_BYTE I might change my
> mind ;) (just forget about PAGES_KB - people know what *1024 means)
No, they are mainly for concision. Don't you think it's cleaner to write
PAGES_KB(VM_MAX_READAHEAD)
than
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE
Admittedly the names are somewhat awkward though :)
> Also: the replacements are wrong: if you've defined VM_MAX_READAHEAD to be
> 4095 bytes, you don't want the _actual_ readahead to be 4096 bytes, do you?
> It is saying nothing about minimum, so presumably 0 is the correct choice.
The macros were first introduced exact for this reason ;)
It is rumored that there will be 64K page support, and this macro
helps round up the 16K sized VM_MIN_READAHEAD. The eof_index also
needs rounding up.
> >+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
> >+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
> >
>
> Again, it is probably easier just to use the expanded version. Then the
> reader can immediately say: ah, the next page on the LRU list (rather
> than, maybe, the next page in the pagecache).
Ok, will expand it.
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 14/33] readahead: state based method - data structure
[not found] ` <20060525104353.GF4996@mail.ustc.edu.cn>
@ 2006-05-25 10:43 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 10:43 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, linux-kernel
On Thu, May 25, 2006 at 04:03:31PM +1000, Nick Piggin wrote:
> Wu Fengguang wrote:
>
> >Extend struct file_ra_state to support the adaptive read-ahead logic.
> >
>
> Another nitpick: It is usually OK to do these things in the same patch
> that actually uses the new data (or functions -- eg. patch 15).
>
> If the addition is complex or in a completely different subsystem
> (eg. your rescue_pages function), _that_ can justify it being split
> into its own patch. Then you might also prepend the subject with mm:
> and cc linux-mm to get better reviews.
Ok, thanks for the advice.
Regards,
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 10/33] readahead: support functions
[not found] ` <20060525111318.GH4996@mail.ustc.edu.cn>
@ 2006-05-25 11:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 11:13 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, linux-kernel
On Thu, May 25, 2006 at 03:13:16PM +1000, Nick Piggin wrote:
> Wu Fengguang wrote:
>
> >+#ifdef CONFIG_ADAPTIVE_READAHEAD
> >+
> >+/*
> >+ * The nature of read-ahead allows false tests to occur occasionally.
> >+ * Here we just do not bother to call get_page(), it's meaningless anyway.
> >+ */
> >+static inline struct page *__find_page(struct address_space *mapping,
> >+ pgoff_t offset)
> >+{
> >+ return radix_tree_lookup(&mapping->page_tree, offset);
> >+}
> >+
> >+static inline struct page *find_page(struct address_space *mapping,
> >+ pgoff_t offset)
> >+{
> >+ struct page *page;
> >+
> >+ read_lock_irq(&mapping->tree_lock);
> >+ page = __find_page(mapping, offset);
> >+ read_unlock_irq(&mapping->tree_lock);
> >+ return page;
> >+}
> >
> >
>
> Meh, this is just open-coded elsewhere in readahead.c; I'd either
> open code it, or do a new patch to replace the existing callers.
> find_page should be in mm/filemap.c, btw (or include/linux/pagemap.h).
Maybe it should stay in readahead.c.
I got this early warning from Andrew:
find_page() is not meant to be a general API, for it can
easily be abused.
> >+
> >+/*
> >+ * Move pages in danger (of thrashing) to the head of inactive_list.
> >+ * Not expected to happen frequently.
> >+ */
> >+static unsigned long rescue_pages(struct page *page, unsigned long
> >nr_pages)
> >
> >
>
> Should probably be in mm/vmscan.c
Maybe. It's a highly specialized function. It protects a continuous
range of sequential readahead pages in a file. Do you mean to move it
for the zone->lru_lock protected statements?
Regards,
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 12/33] readahead: min/max sizes
[not found] ` <20060525121206.GI4996@mail.ustc.edu.cn>
@ 2006-05-25 12:12 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 12:12 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, linux-kernel
On Thu, May 25, 2006 at 02:50:59PM +1000, Nick Piggin wrote:
> Wu Fengguang wrote:
>
> >- Enlarge VM_MAX_READAHEAD to 1024 if new read-ahead code is compiled in.
> > This value is no longer tightly coupled with the thrashing problem,
> > therefore constrained by it. The adaptive read-ahead logic merely takes
> > it as an upper bound, and will not stick to it under memory pressure.
> >
>
> I guess this size enlargement is one of the main reasons your
> patchset improves performance in some cases.
Sure, I started the patch to fulfill the 1M _default_ size dream ;-)
The majority users will never enjoy the performance improvement if
ever we stick to 128k default size. And it won't be possible for the
current readahead logic, since it lacks basic thrashing protection
mechanism.
> There is currently some sort of thrashing protection in there.
> Obviously you've found it to be unable to cope with some situations
> and introduced a lot of really fancy stuff to fix it. Are these just
> academic access patterns, or do you have real test cases that
> demonstrate this failure (ie. can we try to incrementally improve
> the current logic as well as work towards merging your readahead
> rewrite?)
But to be serious, in the progress I realized that it's much more
beyond the max readahead size. The fancy features are more coming out
of _real_ needs than to fulfill academic goals. I've seen real world
improvements from desktop/file server/backup server/database users
for most of the implemented features.
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 08/33] readahead: common macros
[not found] ` <20060525134224.GJ4996@mail.ustc.edu.cn>
@ 2006-05-25 13:42 ` Wu Fengguang
2006-05-25 14:38 ` Andrew Morton
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 13:42 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, linux-kernel
On Thu, May 25, 2006 at 03:56:24PM +1000, Nick Piggin wrote:
> >+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >>
> >PAGE_CACHE_SHIFT)
> >+#define PAGES_KB(size) PAGES_BYTE((size)*1024)
> >
> Don't really like the names. Don't think they do anything for clarity, but
> if you can come up with something better for PAGES_BYTE I might change my
> mind ;) (just forget about PAGES_KB - people know what *1024 means)
>
> Also: the replacements are wrong: if you've defined VM_MAX_READAHEAD to be
> 4095 bytes, you don't want the _actual_ readahead to be 4096 bytes, do you?
> It is saying nothing about minimum, so presumably 0 is the correct choice.
Got an idea, how about these ones:
#define FULL_PAGES(bytes) ((bytes) >> PAGE_CACHE_SHIFT)
#define PARTIAL_PAGES(bytes) (((bytes)+PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT)
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 08/33] readahead: common macros
2006-05-25 13:42 ` Wu Fengguang
@ 2006-05-25 14:38 ` Andrew Morton
0 siblings, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 14:38 UTC (permalink / raw)
To: Wu Fengguang; +Cc: nickpiggin, linux-kernel
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> On Thu, May 25, 2006 at 03:56:24PM +1000, Nick Piggin wrote:
> > >+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >>
> > >PAGE_CACHE_SHIFT)
> > >+#define PAGES_KB(size) PAGES_BYTE((size)*1024)
> > >
> > Don't really like the names. Don't think they do anything for clarity, but
> > if you can come up with something better for PAGES_BYTE I might change my
> > mind ;) (just forget about PAGES_KB - people know what *1024 means)
> >
> > Also: the replacements are wrong: if you've defined VM_MAX_READAHEAD to be
> > 4095 bytes, you don't want the _actual_ readahead to be 4096 bytes, do you?
> > It is saying nothing about minimum, so presumably 0 is the correct choice.
>
> Got an idea, how about these ones:
>
> #define FULL_PAGES(bytes) ((bytes) >> PAGE_CACHE_SHIFT)
I dunno. We've traditionally open-coded things like this.
> #define PARTIAL_PAGES(bytes) (((bytes)+PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT)
That's identical to include/linux/kernel.h:DIV_ROUND_UP(), from the gfs2 tree.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-24 11:12 ` [PATCH 00/33] Adaptive read-ahead V12 Wu Fengguang
@ 2006-05-25 15:44 ` Andrew Morton
2006-05-25 19:26 ` Michael Stone
` (4 more replies)
0 siblings, 5 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 15:44 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg, mstone
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> Andrew,
>
> This is the 12th release of the adaptive readahead patchset.
>
> It has received tests in a wide range of applications in the past
> six months, and polished up considerably.
>
> Please consider it for inclusion in -mm tree.
>
>
> Performance benefits
> ====================
>
> Besides file servers and desktops, it is recently found to benefit
> postgresql databases a lot.
>
> I explained to pgsql users how the patch may help their db performance:
> http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
> [QUOTE]
> HOW IT WORKS
>
> In adaptive readahead, the context based method may be of particular
> interest to postgresql users. It works by peeking into the file cache
> and check if there are any history pages present or accessed. In this
> way it can detect almost all forms of sequential / semi-sequential read
> patterns, e.g.
> - parallel / interleaved sequential scans on one file
> - sequential reads across file open/close
> - mixed sequential / random accesses
> - sparse / skimming sequential read
>
> It also have methods to detect some less common cases:
> - reading backward
> - seeking all over reading N pages
>
> WAYS TO BENEFIT FROM IT
>
> As we know, postgresql relies on the kernel to do proper readahead.
> The adaptive readahead might help performance in the following cases:
> - concurrent sequential scans
> - sequential scan on a fragmented table
> (some DBs suffer from this problem, not sure for pgsql)
> - index scan with clustered matches
> - index scan on majority rows (in case the planner goes wrong)
>
> And received positive responses:
> [QUOTE from Michael Stone]
> I've got one DB where the VACUUM ANALYZE generally takes 11M-12M ms;
> with the patch the job took 1.7M ms. Another VACUUM that normally takes
> between 300k-500k ms took 150k. Definately a promising addition.
>
> [QUOTE from Michael Stone]
> >I'm thinking about it, we're already using a fixed read-ahead of 16MB
> >using blockdev on the stock Redhat 2.6.9 kernel, it would be nice to
> >not have to set this so we may try it.
>
> FWIW, I never saw much performance difference from doing that. Wu's
> patch, OTOH, gave a big boost.
>
> [QUOTE: odbc-bench with Postgresql 7.4.11 on dual Opteron]
> Base kernel:
> Transactions per second: 92.384758
> Transactions per second: 99.800896
>
> After read-ahvm.readahead_ratio = 100:
> Transactions per second: 105.461952
> Transactions per second: 105.458664
>
> vm.readahead_ratio = 100 ; vm.readahead_hit_rate = 1:
> Transactions per second: 113.055367
> Transactions per second: 124.815910
These are nice-looking numbers, but one wonders. If optimising readahead
makes this much difference to postgresql performance then postgresql should
be doing the readahead itself, rather than relying upon the kernel's
ability to guess what the application will be doing in the future. Because
surely the database can do a better job of that than the kernel.
That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel
readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
application-level readahead.
Has this been considered or attempted?
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 28/33] readahead: loop case
[not found] ` <20060525154846.GA6907@mail.ustc.edu.cn>
@ 2006-05-25 15:48 ` wfg
0 siblings, 0 replies; 107+ messages in thread
From: wfg @ 2006-05-25 15:48 UTC (permalink / raw)
To: linux-kernel; +Cc: Limin Wang
On Wed, May 24, 2006 at 10:01:35PM +0800, Limin Wang wrote:
>
> If the loopback files is bigger than the memory size, it may cause miss again and
> may better to turn on the read ahead?
>
The readahead is always on, it's only disabling lookahead :-)
> > Disable look-ahead for loop file.
> >
> > Loopback files normally contain filesystems, in which case there are already
> > proper look-aheads in the upper layer, more look-aheads on the loopback file
> > only ruins the read-ahead hit rate.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 03/33] radixtree: hole scanning functions
2006-05-24 11:12 ` [PATCH 03/33] radixtree: hole scanning functions Wu Fengguang
@ 2006-05-25 16:19 ` Andrew Morton
[not found] ` <20060526070416.GB5135@mail.ustc.edu.cn>
[not found] ` <20060526110559.GA14398@mail.ustc.edu.cn>
0 siblings, 2 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:19 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> Introduce a pair of functions to scan radix tree for hole/empty item.
>
There's a userspace radix-tree test harness at
http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz.
If/when these new features are merged up, it would be good to have new
testcases added to that suite, please.
In the meanwhile you may care to develop those tests anwyway, see if you
can trip up the new features.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 04/33] readahead: page flag PG_readahead
2006-05-24 11:12 ` [PATCH 04/33] readahead: page flag PG_readahead Wu Fengguang
@ 2006-05-25 16:23 ` Andrew Morton
[not found] ` <20060526070646.GC5135@mail.ustc.edu.cn>
0 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:23 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> An new page flag PG_readahead is introduced as a look-ahead mark, which
> reminds the caller to give the adaptive read-ahead logic a chance to do
> read-ahead ahead of time for I/O pipelining.
>
> It roughly corresponds to `ahead_start' of the stock read-ahead logic.
>
This isn't a very revealing description of what this flag does.
> +#define __SetPageReadahead(page) __set_bit(PG_readahead, &(page)->flags)
uh-oh. This is extremly risky. Needs extensive justification, please.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 06/33] readahead: refactor __do_page_cache_readahead()
2006-05-24 11:12 ` [PATCH 06/33] readahead: refactor __do_page_cache_readahead() Wu Fengguang
@ 2006-05-25 16:30 ` Andrew Morton
2006-05-25 22:33 ` Paul Mackerras
[not found] ` <20060526071339.GE5135@mail.ustc.edu.cn>
0 siblings, 2 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:30 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> Add look-ahead support to __do_page_cache_readahead(),
> which is needed by the adaptive read-ahead logic.
You'd need to define "look-ahead support" before telling us you've added it ;)
> @@ -302,6 +303,8 @@ __do_page_cache_readahead(struct address
> break;
> page->index = page_offset;
> list_add(&page->lru, &page_pool);
> + if (page_idx == nr_to_read - lookahead_size)
> + __SetPageReadahead(page);
> ret++;
> }
OK. But the __SetPageFoo() things still give me the creeps.
OT: look:
read_unlock_irq(&mapping->tree_lock);
page = page_cache_alloc_cold(mapping);
read_lock_irq(&mapping->tree_lock);
we should have a page allocation function which just allocates a page from
this CPU's per-cpu-pages magazine, and fails if the magazine is empty:
page = alloc_pages_local(mapping_gfp_mask(x)|__GFP_COLD);
if (!page) {
read_unlock_irq(&mapping->tree_lock);
/*
* This will refill the per-cpu-pages magazine
*/
page = page_cache_alloc_cold(mapping);
read_lock_irq(&mapping->tree_lock);
}
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 08/33] readahead: common macros
2006-05-24 11:12 ` [PATCH 08/33] readahead: common macros Wu Fengguang
2006-05-25 5:56 ` Nick Piggin
@ 2006-05-25 16:33 ` Andrew Morton
1 sibling, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:33 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> Define some common used macros for the read-ahead logics.
>
> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> ---
>
> mm/readahead.c | 14 ++++++++++++--
> 1 files changed, 12 insertions(+), 2 deletions(-)
>
> --- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
> +++ linux-2.6.17-rc4-mm3/mm/readahead.c
> @@ -5,6 +5,8 @@
> *
> * 09Apr2002 akpm@zip.com.au
> * Initial version.
> + * 21May2006 Wu Fengguang <wfg@mail.ustc.edu.cn>
> + * Adaptive read-ahead framework.
> */
>
> #include <linux/kernel.h>
> @@ -14,6 +16,14 @@
> #include <linux/blkdev.h>
> #include <linux/backing-dev.h>
> #include <linux/pagevec.h>
> +#include <linux/writeback.h>
> +#include <linux/nfsd/const.h>
Why on earth are we including that file?
Whatever goodies it contains should be moved into fs.h or mm.h or something.
> +
> +#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT)
> +#define PAGES_KB(size) PAGES_BYTE((size)*1024)
These aren't proving popular.
> +#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
> +#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
hm. Makes sense I guess, but normally we'll be iterating across lists with
the list_for_each*() helpers, so I'm a little surprised that the above are
needed.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 09/33] readahead: events accounting
2006-05-24 11:12 ` [PATCH 09/33] readahead: events accounting Wu Fengguang
@ 2006-05-25 16:36 ` Andrew Morton
[not found] ` <20060526070943.GD5135@mail.ustc.edu.cn>
[not found] ` <20060527132002.GA4814@mail.ustc.edu.cn>
0 siblings, 2 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:36 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg, joern, ioe-lkml
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> A debugfs file named `readahead/events' is created according to advises from
> J?rn Engel, Andrew Morton and Ingo Oeser.
If everyone's patches all get merged up we'd expect that this facility be
migrated over to use Martin Peschke's statistics infrastructure.
That's not a thing you should do now, but it would be a useful test of
Martin's work if you could find time to look at it and let us know whether
the infrastructure which he has provided would suit this application,
thanks.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 10/33] readahead: support functions
2006-05-24 11:12 ` [PATCH 10/33] readahead: support functions Wu Fengguang
2006-05-25 5:13 ` Nick Piggin
@ 2006-05-25 16:48 ` Andrew Morton
[not found] ` <20060526073114.GH5135@mail.ustc.edu.cn>
1 sibling, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:48 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> +/*
> + * The nature of read-ahead allows false tests to occur occasionally.
> + * Here we just do not bother to call get_page(), it's meaningless anyway.
> + */
> +static inline struct page *__find_page(struct address_space *mapping,
> + pgoff_t offset)
> +{
> + return radix_tree_lookup(&mapping->page_tree, offset);
> +}
> +
> +static inline struct page *find_page(struct address_space *mapping,
> + pgoff_t offset)
> +{
> + struct page *page;
> +
> + read_lock_irq(&mapping->tree_lock);
> + page = __find_page(mapping, offset);
> + read_unlock_irq(&mapping->tree_lock);
> + return page;
> +}
Would much prefer that this be called probe_page() and that it return 0 or
1, so nobody is tempted to dereference `page'.
> +/*
> + * Move pages in danger (of thrashing) to the head of inactive_list.
> + * Not expected to happen frequently.
> + */
> +static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
> +{
> + int pgrescue;
> + pgoff_t index;
> + struct zone *zone;
> + struct address_space *mapping;
> +
> + BUG_ON(!nr_pages || !page);
> + pgrescue = 0;
> + index = page_index(page);
> + mapping = page_mapping(page);
> +
> + dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
> + mapping->host->i_ino, index, nr_pages);
> +
> + for(;;) {
> + zone = page_zone(page);
> + spin_lock_irq(&zone->lru_lock);
> +
> + if (!PageLRU(page))
> + goto out_unlock;
> +
> + while (page_mapping(page) == mapping &&
> + page_index(page) == index) {
> + struct page *the_page = page;
> + page = next_page(page);
> + if (!PageActive(the_page) &&
> + !PageLocked(the_page) &&
> + page_count(the_page) == 1) {
> + list_move(&the_page->lru, &zone->inactive_list);
> + pgrescue++;
> + }
> + index++;
> + if (!--nr_pages)
> + goto out_unlock;
> + }
> +
> + spin_unlock_irq(&zone->lru_lock);
> +
> + cond_resched();
> + page = find_page(mapping, index);
> + if (!page)
> + goto out;
Yikes! We do not have a reference on this page. Now, it happens that
page_zone() on a random freed page will work OK. At present. I think.
Depends on things like memory hot-remove, balloon drivers and heaven knows
what.
But it's not at all clear that the combination
spin_lock_irq(&zone->lru_lock);
if (!PageLRU(page))
goto out_unlock;
is is a safe thing to do against a freed page, or against a freed and
reused-for-we-dont-know-what page. It probably _is_ safe, as we're
probably setting and clearing PG_lru inside lru_lock in other places. But
it's not obvious that these things will be true for all time and Nick keeps
on trying to diddle with that stuff. There's quite a bit of subtle
dependency being introduced here.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-25 15:44 ` Andrew Morton
@ 2006-05-25 19:26 ` Michael Stone
2006-05-25 19:40 ` David Lang
` (3 subsequent siblings)
4 siblings, 0 replies; 107+ messages in thread
From: Michael Stone @ 2006-05-25 19:26 UTC (permalink / raw)
To: Andrew Morton; +Cc: Wu Fengguang, linux-kernel
On Thu, May 25, 2006 at 08:44:15AM -0700, Andrew Morton wrote:
>These are nice-looking numbers, but one wonders. If optimising readahead
>makes this much difference to postgresql performance then postgresql should
>be doing the readahead itself, rather than relying upon the kernel's
>ability to guess what the application will be doing in the future. Because
>surely the database can do a better job of that than the kernel.
In this particular case Wu had asked about postgres numbers, so I
reported some postgres numbers. You could probably get similar speedups
out of postgres by implementing readahead in postgres. OTOH, the kernel
patch also gives substantial speedups to thing like cp; the question
comes down to whether it's better for every application to implement
readahead or for the kernel to do it. (There are, of course, other
concerns like maintainability or whether performance degrades in other
cases, but I didn't test that. :)
Mike Stone
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-25 15:44 ` Andrew Morton
2006-05-25 19:26 ` Michael Stone
@ 2006-05-25 19:40 ` David Lang
2006-05-25 22:01 ` Andrew Morton
[not found] ` <20060526011939.GA6220@mail.ustc.edu.cn>
` (2 subsequent siblings)
4 siblings, 1 reply; 107+ messages in thread
From: David Lang @ 2006-05-25 19:40 UTC (permalink / raw)
To: Andrew Morton; +Cc: Wu Fengguang, linux-kernel, mstone
On Thu, 25 May 2006, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>>
>
> These are nice-looking numbers, but one wonders. If optimising readahead
> makes this much difference to postgresql performance then postgresql should
> be doing the readahead itself, rather than relying upon the kernel's
> ability to guess what the application will be doing in the future. Because
> surely the database can do a better job of that than the kernel.
>
> That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel
> readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
> application-level readahead.
>
> Has this been considered or attempted?
Postgres chooses not to try and duplicate OS functionality in it's I/O
routines.
it doesn't try to determine where on disk the data is (other then
splitting the data into multiple files and possibly spreading things
between directories)
it doesn't try to do it's own readahead.
it _does_ maintain it's own journal, but depends on the OS to do the right
thing when a fsync is issued on the files.
yes it could be re-written to do all this itself, but the project has
decided not to try and figure out the best options for all the different
filesystems and OS's that it runs on and instead trust the OS developers
to do reasonable things instead.
besides, do you really want to have every program doing it's own
readahead?
David Lang
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-25 22:01 ` Andrew Morton
@ 2006-05-25 20:28 ` David Lang
2006-05-26 0:48 ` Michael Stone
1 sibling, 0 replies; 107+ messages in thread
From: David Lang @ 2006-05-25 20:28 UTC (permalink / raw)
To: Andrew Morton; +Cc: wfg, linux-kernel, mstone
> If the developers of that program want to squeeze the last 5% out of it
> then sure, I'd expect them to use such OS-provided I/O scheduling
> facilities. Database developers do that sort of thing all the time.
>
> We have an application which knows what it's doing sending IO requests to
> the kernel which must then try to reverse engineer what the application is
> doing via this rather inappropriate communication channel.
>
> Is that dumb, or what?
>
> Given that the application already knows what it's doing, it's in a much
> better position to issue the anticipatory IO requests than is the kernel.
if a program is trying to squeeze every last bit of performance out of a
system then you are right, it should run on the bare hardware. however
in reality many people are willing to sacrafice a little performance for
maintainability, and portability.
if Adaptive read-ahead was only useful for Postgres (and had a negative
effect on everything else, even if it's just the added complication in the
kernel) then I would agree that it should be in Postgres, not in the
kernel. but I don't believe that this is the case, this patch series helps
in a large number of workloads (including 'cp' according to some other
posters), postgres was just used as the example in this subthread.
gnome startup has some serious read-ahead issues from what I've heard,
should it include an I/O scheduler as well (after all it knows what it's
going to be doing, why should the kernel have to reverse-enginer it)
David Lang
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-25 19:40 ` David Lang
@ 2006-05-25 22:01 ` Andrew Morton
2006-05-25 20:28 ` David Lang
2006-05-26 0:48 ` Michael Stone
0 siblings, 2 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 22:01 UTC (permalink / raw)
To: David Lang; +Cc: wfg, linux-kernel, mstone
David Lang <dlang@digitalinsight.com> wrote:
>
> On Thu, 25 May 2006, Andrew Morton wrote:
>
> > Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >>
> >
> > These are nice-looking numbers, but one wonders. If optimising readahead
> > makes this much difference to postgresql performance then postgresql should
> > be doing the readahead itself, rather than relying upon the kernel's
> > ability to guess what the application will be doing in the future. Because
> > surely the database can do a better job of that than the kernel.
> >
> > That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel
> > readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
> > application-level readahead.
> >
> > Has this been considered or attempted?
>
> Postgres chooses not to try and duplicate OS functionality in it's I/O
> routines.
>
> it doesn't try to determine where on disk the data is (other then
> splitting the data into multiple files and possibly spreading things
> between directories)
>
> it doesn't try to do it's own readahead.
>
> it _does_ maintain it's own journal, but depends on the OS to do the right
> thing when a fsync is issued on the files.
>
> yes it could be re-written to do all this itself, but the project has
> decided not to try and figure out the best options for all the different
> filesystems and OS's that it runs on and instead trust the OS developers
> to do reasonable things instead.
>
> besides, do you really want to have every program doing it's own
> readahead?
>
If the developers of that program want to squeeze the last 5% out of it
then sure, I'd expect them to use such OS-provided I/O scheduling
facilities. Database developers do that sort of thing all the time.
We have an application which knows what it's doing sending IO requests to
the kernel which must then try to reverse engineer what the application is
doing via this rather inappropriate communication channel.
Is that dumb, or what?
Given that the application already knows what it's doing, it's in a much
better position to issue the anticipatory IO requests than is the kernel.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 06/33] readahead: refactor __do_page_cache_readahead()
2006-05-25 16:30 ` Andrew Morton
@ 2006-05-25 22:33 ` Paul Mackerras
2006-05-25 22:40 ` Andrew Morton
[not found] ` <20060526071339.GE5135@mail.ustc.edu.cn>
1 sibling, 1 reply; 107+ messages in thread
From: Paul Mackerras @ 2006-05-25 22:33 UTC (permalink / raw)
To: Andrew Morton; +Cc: Wu Fengguang, linux-kernel
Andrew Morton writes:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> > @@ -302,6 +303,8 @@ __do_page_cache_readahead(struct address
> > break;
> > page->index = page_offset;
> > list_add(&page->lru, &page_pool);
> > + if (page_idx == nr_to_read - lookahead_size)
> > + __SetPageReadahead(page);
> > ret++;
> > }
>
> OK. But the __SetPageFoo() things still give me the creeps.
I just hope that Wu Fengguang, or whoever is making these patches,
realizes that on some architectures, doing __set_bit on one CPU
concurrently with another CPU doing set_bit on a different bit in the
same word can result in the second CPU's update getting lost...
Paul.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 06/33] readahead: refactor __do_page_cache_readahead()
2006-05-25 22:33 ` Paul Mackerras
@ 2006-05-25 22:40 ` Andrew Morton
0 siblings, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 22:40 UTC (permalink / raw)
To: Paul Mackerras; +Cc: wfg, linux-kernel
Paul Mackerras <paulus@samba.org> wrote:
>
> Andrew Morton writes:
>
> > Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> > > @@ -302,6 +303,8 @@ __do_page_cache_readahead(struct address
> > > break;
> > > page->index = page_offset;
> > > list_add(&page->lru, &page_pool);
> > > + if (page_idx == nr_to_read - lookahead_size)
> > > + __SetPageReadahead(page);
> > > ret++;
> > > }
> >
> > OK. But the __SetPageFoo() things still give me the creeps.
>
> I just hope that Wu Fengguang, or whoever is making these patches,
> realizes that on some architectures, doing __set_bit on one CPU
> concurrently with another CPU doing set_bit on a different bit in the
> same word can result in the second CPU's update getting lost...
>
That's true even on x86.
Yes, this is understood - in this case he's following Nick's dubious lead
in leveraging our knowledge that no other code path will be attempting to
modify this page's flags at this time. It's just been taken off the
freelist, it's not yet on the LRU and we own the only ref to it.
The only hole I was able to shoot in this is swsusp, which walks mem_map[]
fiddling with page flags. But when it does this, only one CPU is running.
But I'm itching for an excuse to extirpate it all ;)
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-25 22:01 ` Andrew Morton
2006-05-25 20:28 ` David Lang
@ 2006-05-26 0:48 ` Michael Stone
1 sibling, 0 replies; 107+ messages in thread
From: Michael Stone @ 2006-05-26 0:48 UTC (permalink / raw)
To: Andrew Morton; +Cc: David Lang, wfg, linux-kernel
On Thu, May 25, 2006 at 03:01:49PM -0700, Andrew Morton wrote:
>If the developers of that program want to squeeze the last 5% out of it
>then sure, I'd expect them to use such OS-provided I/O scheduling
>facilities.
Maybe, if we were talking about squeezing the last 5%. But all
applications should be required to greatly complicate their IO routines
for the last 30%? To reimplement something the kernel already does (at
least to some degree), as opposed to making the kernel implementation
better? "Is that dumb, or what?" :-)
>Database developers do that sort of thing all the time.
Even the oracle people seem to have figured out they were doing too much
that's properly the responsibility of the OS and creating a maintenance
and portability nightmare.
Mike Stone
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
[not found] ` <20060526011939.GA6220@mail.ustc.edu.cn>
@ 2006-05-26 1:19 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26 1:19 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, mstone
On Thu, May 25, 2006 at 08:44:15AM -0700, Andrew Morton wrote:
> These are nice-looking numbers, but one wonders. If optimising readahead
> makes this much difference to postgresql performance then postgresql should
> be doing the readahead itself, rather than relying upon the kernel's
> ability to guess what the application will be doing in the future. Because
> surely the database can do a better job of that than the kernel.
>
> That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel
> readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
> application-level readahead.
>
> Has this been considered or attempted?
There has been many lengthy debates in the postgresql mailing list,
and it seems that there has been _strong_ resistance to it.
IMHO, a best scheme would be
- leave _obvious_ patterns to the kernel
i.e. all kinds of (semi-)sequential reads
- do fadvise() for _non-obvious_ patterns on _critical_ points
i.e. the index scans
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-25 15:44 ` Andrew Morton
` (2 preceding siblings ...)
[not found] ` <20060526011939.GA6220@mail.ustc.edu.cn>
@ 2006-05-26 2:10 ` Jon Smirl
2006-05-26 3:14 ` Nick Piggin
2006-05-26 14:00 ` Andi Kleen
4 siblings, 1 reply; 107+ messages in thread
From: Jon Smirl @ 2006-05-26 2:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Wu Fengguang, linux-kernel, mstone
On 5/25/06, Andrew Morton <akpm@osdl.org> wrote:
> These are nice-looking numbers, but one wonders. If optimising readahead
> makes this much difference to postgresql performance then postgresql should
> be doing the readahead itself, rather than relying upon the kernel's
> ability to guess what the application will be doing in the future. Because
> surely the database can do a better job of that than the kernel.
>
> That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel
> readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
> application-level readahead.
Users have also reported that this patch fixes performance problems
from web servers using sendfile(). In the case of lighttpd they
actually stopped using sendfile() for large transfers and wrote a user
space replacement where they could control readahead manually. With
this patch in place sendfile() went back to being faster than the user
space implementation.
--
Jon Smirl
jonsmirl@gmail.com
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-26 2:10 ` Jon Smirl
@ 2006-05-26 3:14 ` Nick Piggin
0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2006-05-26 3:14 UTC (permalink / raw)
To: Jon Smirl; +Cc: Andrew Morton, Wu Fengguang, linux-kernel, mstone
Jon Smirl wrote:
> On 5/25/06, Andrew Morton <akpm@osdl.org> wrote:
>
>> These are nice-looking numbers, but one wonders. If optimising
>> readahead
>> makes this much difference to postgresql performance then postgresql
>> should
>> be doing the readahead itself, rather than relying upon the kernel's
>> ability to guess what the application will be doing in the future.
>> Because
>> surely the database can do a better job of that than the kernel.
>>
>> That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable
>> kernel
>> readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
>> application-level readahead.
>
>
> Users have also reported that this patch fixes performance problems
> from web servers using sendfile(). In the case of lighttpd they
> actually stopped using sendfile() for large transfers and wrote a user
> space replacement where they could control readahead manually. With
> this patch in place sendfile() went back to being faster than the user
> space implementation.
Of course, that is something one would expect should be made to work
properly
with the current readahead implementation.
I don't see Wu's patches getting in for a little while yet.
Reproducable test cases (preferably without a whole lot of network clients)
should get this proble fixed.
--
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 08/33] readahead: common macros
2006-05-25 10:41 ` Wu Fengguang
@ 2006-05-26 3:33 ` Nick Piggin
[not found] ` <20060526065906.GA5135@mail.ustc.edu.cn>
0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-26 3:33 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
Wu Fengguang wrote:
>On Thu, May 25, 2006 at 03:56:24PM +1000, Nick Piggin wrote:
>
>>>+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >>
>>>PAGE_CACHE_SHIFT)
>>>+#define PAGES_KB(size) PAGES_BYTE((size)*1024)
>>>
>>>
>>Don't really like the names. Don't think they do anything for clarity, but
>>if you can come up with something better for PAGES_BYTE I might change my
>>mind ;) (just forget about PAGES_KB - people know what *1024 means)
>>
>
>No, they are mainly for concision. Don't you think it's cleaner to write
> PAGES_KB(VM_MAX_READAHEAD)
>than
> (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE
>
>
No. Apart from semantics being different (which I'll address below), anybody
with any business looking at this code will immediately know and understand
what the latter line means. Not so for the former.
>Admittedly the names are somewhat awkward though :)
>
>
>>Also: the replacements are wrong: if you've defined VM_MAX_READAHEAD to be
>>4095 bytes, you don't want the _actual_ readahead to be 4096 bytes, do you?
>>It is saying nothing about minimum, so presumably 0 is the correct choice.
>>
>
>The macros were first introduced exact for this reason ;)
>
>It is rumored that there will be 64K page support, and this macro
>helps round up the 16K sized VM_MIN_READAHEAD. The eof_index also
>needs rounding up.
>
But VM_MIN_READAHEAD of course should be rounded up, for the same
reasons I said VM_MAX_READAHEAD should be rounded down.
So OK as a bug fix, but it needs to be in its own patch, not in a "common
macros" one, and sufficiently commented (and preferably outside your core
adaptive readahead code so it can be quickly merged up)
--
Send instant messages to your online friends http://au.messenger.yahoo.com
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 08/33] readahead: common macros
[not found] ` <20060526065906.GA5135@mail.ustc.edu.cn>
@ 2006-05-26 6:59 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26 6:59 UTC (permalink / raw)
To: Nick Piggin; +Cc: Andrew Morton, linux-kernel
Hello Nick and Andrew,
Updated the patch as recommended.
Thanks,
Wu
---
readahead-macros-min-max-rapages.patch
---
Subject: readahead: introduce {MIN,MAX}_RA_PAGES
Define two convenient macros for read-ahead:
- MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
- MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD
Note that the rounded _up_ MIN_RA_PAGES will work flawlessly with large
page sizes like 64k.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 14 ++++++++++++--
1 files changed, 12 insertions(+), 2 deletions(-)
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -17,13 +17,21 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+/*
+ * Convienent macros for min/max read-ahead pages.
+ * Note that MAX_RA_PAGES is rounded down, while MIN_RA_PAGES is rounded up.
+ * The latter is necessary for systems with large page size(i.e. 64k).
+ */
+#define MAX_RA_PAGES (VM_MAX_READAHEAD*1024 / PAGE_CACHE_SIZE)
+#define MIN_RA_PAGES DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
+
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
}
EXPORT_SYMBOL(default_unplug_io_fn);
struct backing_dev_info default_backing_dev_info = {
- .ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+ .ra_pages = MAX_RA_PAGES,
.state = 0,
.capabilities = BDI_CAP_MAP_COPY,
.unplug_io_fn = default_unplug_io_fn,
@@ -52,7 +60,7 @@ static inline unsigned long get_max_read
static inline unsigned long get_min_readahead(struct file_ra_state *ra)
{
- return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ return MIN_RA_PAGES;
}
static inline void reset_ahead_window(struct file_ra_state *ra)
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 03/33] radixtree: hole scanning functions
[not found] ` <20060526070416.GB5135@mail.ustc.edu.cn>
@ 2006-05-26 7:04 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26 7:04 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Thu, May 25, 2006 at 09:19:46AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > Introduce a pair of functions to scan radix tree for hole/empty item.
> >
>
> There's a userspace radix-tree test harness at
> http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz.
>
> If/when these new features are merged up, it would be good to have new
> testcases added to that suite, please.
>
> In the meanwhile you may care to develop those tests anwyway, see if you
> can trip up the new features.
Handy tool.
I'll update it with the newly introduced functions, and write
corresponding test cases.
Thanks,
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 04/33] readahead: page flag PG_readahead
[not found] ` <20060526070646.GC5135@mail.ustc.edu.cn>
@ 2006-05-26 7:06 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26 7:06 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Thu, May 25, 2006 at 09:23:11AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > An new page flag PG_readahead is introduced as a look-ahead mark, which
> > reminds the caller to give the adaptive read-ahead logic a chance to do
> > read-ahead ahead of time for I/O pipelining.
> >
> > It roughly corresponds to `ahead_start' of the stock read-ahead logic.
> >
>
> This isn't a very revealing description of what this flag does.
Updated to:
An new page flag PG_readahead is introduced.
It acts as a look-ahead mark, which tells the page reader:
Hey, it's time to invoke the adaptive read-ahead logic!
For the sake of I/O pipelining, don't wait until it runs out of
cached pages. ;-)
> > +#define __SetPageReadahead(page) __set_bit(PG_readahead, &(page)->flags)
>
> uh-oh. This is extremly risky. Needs extensive justification, please.
Ok, removed the ugly __ :-)
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 09/33] readahead: events accounting
[not found] ` <20060526070943.GD5135@mail.ustc.edu.cn>
@ 2006-05-26 7:09 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26 7:09 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, joern, ioe-lkml
On Thu, May 25, 2006 at 09:36:27AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > A debugfs file named `readahead/events' is created according to advises from
> > J?rn Engel, Andrew Morton and Ingo Oeser.
>
> If everyone's patches all get merged up we'd expect that this facility be
> migrated over to use Martin Peschke's statistics infrastructure.
>
> That's not a thing you should do now, but it would be a useful test of
> Martin's work if you could find time to look at it and let us know whether
> the infrastructure which he has provided would suit this application,
> thanks.
Sure, I'll look into it when I am able to settle down.
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 06/33] readahead: refactor __do_page_cache_readahead()
[not found] ` <20060526071339.GE5135@mail.ustc.edu.cn>
@ 2006-05-26 7:13 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26 7:13 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Thu, May 25, 2006 at 09:30:39AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > Add look-ahead support to __do_page_cache_readahead(),
> > which is needed by the adaptive read-ahead logic.
>
> You'd need to define "look-ahead support" before telling us you've added it ;)
>
> > @@ -302,6 +303,8 @@ __do_page_cache_readahead(struct address
> > break;
> > page->index = page_offset;
> > list_add(&page->lru, &page_pool);
> > + if (page_idx == nr_to_read - lookahead_size)
> > + __SetPageReadahead(page);
> > ret++;
> > }
>
> OK. But the __SetPageFoo() things still give me the creeps.
Hehe, updated to SetPageReadahead().
> OT: look:
>
> read_unlock_irq(&mapping->tree_lock);
> page = page_cache_alloc_cold(mapping);
> read_lock_irq(&mapping->tree_lock);
>
> we should have a page allocation function which just allocates a page from
> this CPU's per-cpu-pages magazine, and fails if the magazine is empty:
>
> page = alloc_pages_local(mapping_gfp_mask(x)|__GFP_COLD);
> if (!page) {
> read_unlock_irq(&mapping->tree_lock);
> /*
> * This will refill the per-cpu-pages magazine
> */
> page = page_cache_alloc_cold(mapping);
> read_lock_irq(&mapping->tree_lock);
> }
Seems good, except for the alloc_pages_local() not being able to
spread memory among nodes as page_cache_alloc_cold() do.
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 10/33] readahead: support functions
[not found] ` <20060526073114.GH5135@mail.ustc.edu.cn>
@ 2006-05-26 7:31 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26 7:31 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Thu, May 25, 2006 at 09:48:29AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > +/*
> > + * The nature of read-ahead allows false tests to occur occasionally.
> > + * Here we just do not bother to call get_page(), it's meaningless anyway.
> > + */
> > +static inline struct page *__find_page(struct address_space *mapping,
> > + pgoff_t offset)
> > +{
> > + return radix_tree_lookup(&mapping->page_tree, offset);
> > +}
> > +
> > +static inline struct page *find_page(struct address_space *mapping,
> > + pgoff_t offset)
> > +{
> > + struct page *page;
> > +
> > + read_lock_irq(&mapping->tree_lock);
> > + page = __find_page(mapping, offset);
> > + read_unlock_irq(&mapping->tree_lock);
> > + return page;
> > +}
>
> Would much prefer that this be called probe_page() and that it return 0 or
> 1, so nobody is tempted to dereference `page'.
Good idea. I'd add them to filemap.c.
> > +/*
> > + * Move pages in danger (of thrashing) to the head of inactive_list.
> > + * Not expected to happen frequently.
> > + */
> > +static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
> > +{
> > + int pgrescue;
> > + pgoff_t index;
> > + struct zone *zone;
> > + struct address_space *mapping;
> > +
> > + BUG_ON(!nr_pages || !page);
> > + pgrescue = 0;
> > + index = page_index(page);
> > + mapping = page_mapping(page);
> > +
> > + dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
> > + mapping->host->i_ino, index, nr_pages);
> > +
> > + for(;;) {
> > + zone = page_zone(page);
> > + spin_lock_irq(&zone->lru_lock);
> > +
> > + if (!PageLRU(page))
> > + goto out_unlock;
> > +
> > + while (page_mapping(page) == mapping &&
> > + page_index(page) == index) {
> > + struct page *the_page = page;
> > + page = next_page(page);
> > + if (!PageActive(the_page) &&
> > + !PageLocked(the_page) &&
> > + page_count(the_page) == 1) {
> > + list_move(&the_page->lru, &zone->inactive_list);
> > + pgrescue++;
> > + }
> > + index++;
> > + if (!--nr_pages)
> > + goto out_unlock;
> > + }
> > +
> > + spin_unlock_irq(&zone->lru_lock);
> > +
> > + cond_resched();
> > + page = find_page(mapping, index);
> > + if (!page)
> > + goto out;
>
> Yikes! We do not have a reference on this page. Now, it happens that
> page_zone() on a random freed page will work OK. At present. I think.
> Depends on things like memory hot-remove, balloon drivers and heaven knows
> what.
>
> But it's not at all clear that the combination
>
> spin_lock_irq(&zone->lru_lock);
>
> if (!PageLRU(page))
> goto out_unlock;
>
> is is a safe thing to do against a freed page, or against a freed and
> reused-for-we-dont-know-what page. It probably _is_ safe, as we're
> probably setting and clearing PG_lru inside lru_lock in other places. But
> it's not obvious that these things will be true for all time and Nick keeps
> on trying to diddle with that stuff. There's quite a bit of subtle
> dependency being introduced here.
I saw some code pieces like
spin_lock_irqsave(&zone->lru_lock, flags);
VM_BUG_ON(!PageLRU(page));
__ClearPageLRU(page);
del_page_from_lru(zone, page);
spin_unlock_irqrestore(&zone->lru_lock, flags);
They give me an allusion that PG_lru and page->lru are always changed together,
under the protection of zone->lru_lock...
I bet correctness is top priority, so I'll stop playing fire with it.
Thanks,
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 03/33] radixtree: hole scanning functions
[not found] ` <20060526110559.GA14398@mail.ustc.edu.cn>
@ 2006-05-26 11:05 ` Wu Fengguang
2006-05-26 16:19 ` Andrew Morton
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26 11:05 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Nick Piggin
On Thu, May 25, 2006 at 09:19:46AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > Introduce a pair of functions to scan radix tree for hole/empty item.
> >
>
> There's a userspace radix-tree test harness at
> http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz.
>
> If/when these new features are merged up, it would be good to have new
> testcases added to that suite, please.
>
> In the meanwhile you may care to develop those tests anwyway, see if you
> can trip up the new features.
The new radix-tree.c/.h breaks compiling terribly.
Are there any particular reason not to implement it as a module?
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-25 15:44 ` Andrew Morton
` (3 preceding siblings ...)
2006-05-26 2:10 ` Jon Smirl
@ 2006-05-26 14:00 ` Andi Kleen
2006-05-26 16:25 ` Andrew Morton
2006-05-26 23:54 ` Folkert van Heusden
4 siblings, 2 replies; 107+ messages in thread
From: Andi Kleen @ 2006-05-26 14:00 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, wfg, mstone
Andrew Morton <akpm@osdl.org> writes:
>
> These are nice-looking numbers, but one wonders. If optimising readahead
> makes this much difference to postgresql performance then postgresql should
> be doing the readahead itself, rather than relying upon the kernel's
> ability to guess what the application will be doing in the future. Because
> surely the database can do a better job of that than the kernel.
With that argument we should remove all readahead from the kernel?
Because it's already trying to guess what the application will do.
I suspect it's better to have good readahead code in the kernel
than in a zillion application.
-Andi
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 03/33] radixtree: hole scanning functions
2006-05-26 11:05 ` Wu Fengguang
@ 2006-05-26 16:19 ` Andrew Morton
0 siblings, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 16:19 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, nickpiggin
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> On Thu, May 25, 2006 at 09:19:46AM -0700, Andrew Morton wrote:
> > Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> > >
> > > Introduce a pair of functions to scan radix tree for hole/empty item.
> > >
> >
> > There's a userspace radix-tree test harness at
> > http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz.
> >
> > If/when these new features are merged up, it would be good to have new
> > testcases added to that suite, please.
> >
> > In the meanwhile you may care to develop those tests anwyway, see if you
> > can trip up the new features.
>
> The new radix-tree.c/.h breaks compiling terribly.
Sprinkling more stub header files in there usually fixes that.
> Are there any particular reason not to implement it as a module?
Well. It's a heck of a lot more convenient to throw testcases at a
userspace app, and to debug it and to performance analyse it.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-26 14:00 ` Andi Kleen
@ 2006-05-26 16:25 ` Andrew Morton
2006-05-26 23:54 ` Folkert van Heusden
1 sibling, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 16:25 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux-kernel, wfg, mstone
Andi Kleen <ak@suse.de> wrote:
>
> Andrew Morton <akpm@osdl.org> writes:
> >
> > These are nice-looking numbers, but one wonders. If optimising readahead
> > makes this much difference to postgresql performance then postgresql should
> > be doing the readahead itself, rather than relying upon the kernel's
> > ability to guess what the application will be doing in the future. Because
> > surely the database can do a better job of that than the kernel.
>
> With that argument we should remove all readahead from the kernel?
> Because it's already trying to guess what the application will do.
>
> I suspect it's better to have good readahead code in the kernel
> than in a zillion application.
>
Wu: "this readahead patch speeds up postgres"
Me: "but postgres could be sped up even more via X"
everyone: "ah, you're saying that's a reason for not altering readahead!".
Would everyone *please* stop being so completely and utterly thick?
Thank you.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 13/33] readahead: state based method - aging accounting
2006-05-24 11:12 ` [PATCH 13/33] readahead: state based method - aging accounting Wu Fengguang
@ 2006-05-26 17:04 ` Andrew Morton
[not found] ` <20060527062234.GB4991@mail.ustc.edu.cn>
0 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:04 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
(hey, I haven't finished reading the last batch yet)
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> /*
> + * The node's effective length of inactive_list(s).
> + */
> +static unsigned long node_free_and_cold_pages(void)
> +{
> + unsigned int i;
> + unsigned long sum = 0;
> + struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
> +
> + for (i = 0; i < MAX_NR_ZONES; i++)
> + sum += zones[i].nr_inactive +
> + zones[i].free_pages - zones[i].pages_low;
> +
> + return sum;
> +}
I guess this should go into page_alloc.c along with all the similar functions.
Is this function well-named? Why does it have "cold" in the name?
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 14/33] readahead: state based method - data structure
2006-05-24 11:13 ` [PATCH 14/33] readahead: state based method - data structure Wu Fengguang
2006-05-25 6:03 ` Nick Piggin
@ 2006-05-26 17:05 ` Andrew Morton
[not found] ` <20060527070248.GD4991@mail.ustc.edu.cn>
[not found] ` <20060527082758.GF4991@mail.ustc.edu.cn>
1 sibling, 2 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:05 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> #define RA_FLAG_MISS 0x01 /* a cache miss occured against this file */
> #define RA_FLAG_INCACHE 0x02 /* file is already in cache */
> +#define RA_FLAG_MMAP (1UL<<31) /* mmaped page access */
> +#define RA_FLAG_NO_LOOKAHEAD (1UL<<30) /* disable look-ahead */
> +#define RA_FLAG_EOF (1UL<<29) /* readahead hits EOF */
Odd. Why not use 4, 8, 16?
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 15/33] readahead: state based method - routines
2006-05-24 11:13 ` [PATCH 15/33] readahead: state based method - routines Wu Fengguang
@ 2006-05-26 17:15 ` Andrew Morton
[not found] ` <20060527020616.GA7418@mail.ustc.edu.cn>
0 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:15 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> Define some helpers on struct file_ra_state.
>
> +/*
> + * The 64bit cache_hits stores three accumulated values and a counter value.
> + * MSB LSB
> + * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
> + */
> +static int ra_cache_hit(struct file_ra_state *ra, int nr)
> +{
> + return (ra->cache_hits >> (nr * 16)) & 0xFFFF;
> +}
So... why not use four u16s?
> +/*
> + * Submit IO for the read-ahead request in file_ra_state.
> + */
> +static int ra_dispatch(struct file_ra_state *ra,
> + struct address_space *mapping, struct file *filp)
> +{
> + enum ra_class ra_class = ra_class_new(ra);
> + unsigned long ra_size = ra_readahead_size(ra);
> + unsigned long la_size = ra_lookahead_size(ra);
> + pgoff_t eof_index = PAGES_BYTE(i_size_read(mapping->host)) + 1;
Sigh. I guess one gets used to that PAGES_BYTE thing after a while. If
you're not familiar with it, it obfuscates things.
<hunts around for its definition>
So in fact it's converting a loff_t to a pgoff_t. Why not call it
convert_loff_t_to_pgoff_t()? ;)
Something better, anyway. Something lower-case and an inline-not-a-macro, too.
> + int actual;
> +
> + if (unlikely(ra->ra_index >= eof_index))
> + return 0;
> +
> + /* Snap to EOF. */
> + if (ra->readahead_index + ra_size / 2 > eof_index) {
You've had a bit of a think and you've arrived at a design decision
surrounding the arithmetic in here. It's very very hard to look at this line
of code and to work out why you decided to implement it in this fashion.
The only way to make such code comprehensible (and hence maintainable) is
to fully comment such things.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 17/33] readahead: context based method
2006-05-24 11:13 ` [PATCH 17/33] readahead: context based method Wu Fengguang
2006-05-25 5:26 ` Nick Piggin
@ 2006-05-26 17:23 ` Andrew Morton
[not found] ` <20060527021252.GB7418@mail.ustc.edu.cn>
2006-05-26 17:27 ` Andrew Morton
2 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:23 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> +#define PAGE_REFCNT_0 0
> +#define PAGE_REFCNT_1 (1 << PG_referenced)
> +#define PAGE_REFCNT_2 (1 << PG_active)
> +#define PAGE_REFCNT_3 ((1 << PG_active) | (1 << PG_referenced))
> +#define PAGE_REFCNT_MASK PAGE_REFCNT_3
> +
> +/*
> + * STATUS REFERENCE COUNT
> + * __ 0
> + * _R PAGE_REFCNT_1
> + * A_ PAGE_REFCNT_2
> + * AR PAGE_REFCNT_3
> + *
> + * A/R: Active / Referenced
> + */
> +static inline unsigned long page_refcnt(struct page *page)
> +{
> + return page->flags & PAGE_REFCNT_MASK;
> +}
> +
This assumes that PG_referenced < PG_active. Nobody knows that this
assumption was made and someone might go and reorder the page flags and
subtly break readahead.
We need to either not do it this way, or put a big comment in page-flags.h,
or even redefine PG_active to be PG_referenced+1.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 17/33] readahead: context based method
2006-05-24 11:13 ` [PATCH 17/33] readahead: context based method Wu Fengguang
2006-05-25 5:26 ` Nick Piggin
2006-05-26 17:23 ` Andrew Morton
@ 2006-05-26 17:27 ` Andrew Morton
[not found] ` <20060527080443.GE4991@mail.ustc.edu.cn>
2 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:27 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> This is the slow code path of adaptive read-ahead.
>
> ...
>
> +
> +/*
> + * Count/estimate cache hits in range [first_index, last_index].
> + * The estimation is simple and optimistic.
> + */
> +static int count_cache_hit(struct address_space *mapping,
> + pgoff_t first_index, pgoff_t last_index)
> +{
> + struct page *page;
> + int size = last_index - first_index + 1;
`size' might overflow.
> + int count = 0;
> + int i;
> +
> + cond_resched();
> + read_lock_irq(&mapping->tree_lock);
> +
> + /*
> + * The first page may well is chunk head and has been accessed,
> + * so it is index 0 that makes the estimation optimistic. This
> + * behavior guarantees a readahead when (size < ra_max) and
> + * (readahead_hit_rate >= 16).
> + */
> + for (i = 0; i < 16;) {
> + page = __find_page(mapping, first_index +
> + size * ((i++ * 29) & 15) / 16);
29?
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 20/33] readahead: initial method - expected read size
2006-05-24 11:13 ` [PATCH 20/33] readahead: initial method - expected read size Wu Fengguang
2006-05-25 5:34 ` [PATCH 22/33] readahead: initial method Nick Piggin
@ 2006-05-26 17:29 ` Andrew Morton
[not found] ` <20060527063826.GC4991@mail.ustc.edu.cn>
1 sibling, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:29 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> backing_dev_info.ra_expect_bytes is dynamicly updated to be the expected
> read pages on start-of-file. It allows the initial readahead to be more
> aggressive and hence efficient.
>
>
> +void fastcall readahead_close(struct file *file)
eww, fastcall.
> +{
> + struct inode *inode = file->f_dentry->d_inode;
> + struct address_space *mapping = inode->i_mapping;
> + struct backing_dev_info *bdi = mapping->backing_dev_info;
> + unsigned long pos = file->f_pos;
f_pos is loff_t.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 23/33] readahead: backward prefetching method
2006-05-24 11:13 ` [PATCH 23/33] readahead: backward prefetching method Wu Fengguang
@ 2006-05-26 17:37 ` Nate Diller
2006-05-26 19:22 ` Nathan Scott
0 siblings, 1 reply; 107+ messages in thread
From: Nate Diller @ 2006-05-26 17:37 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel
On 5/24/06, Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> Readahead policy for reading backward.
Just curious, who actually does this? I noticed you submitted patches
to do profiling of actual read loads, so this must be based on data
you've seen. Could you include a comment in the actual code relating
to the loads that it affects?
thanks
NATE
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 27/33] readahead: laptop mode
2006-05-24 11:13 ` [PATCH 27/33] readahead: laptop mode Wu Fengguang
@ 2006-05-26 17:38 ` Andrew Morton
0 siblings, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:38 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, wfg, bart
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> /*
> + * Set a new look-ahead mark at @new_index.
> + * Return 0 if the new mark is successfully set.
> + */
> +static inline int renew_lookahead(struct address_space *mapping,
> + struct file_ra_state *ra,
> + pgoff_t index, pgoff_t new_index)
> +{
> + struct page *page;
> +
> + if (index == ra->lookahead_index &&
> + new_index >= ra->readahead_index)
> + return 1;
> +
> + page = find_page(mapping, new_index);
> + if (!page)
> + return 1;
> +
> + __SetPageReadahead(page);
> + if (ra->lookahead_index == index)
> + ra->lookahead_index = new_index;
> +
> + return 0;
> +}
> +
This is a pagecache page and other CPUs can look it up and play with it.
The __SetPageReadahead() is quite wrong here.
And we don't have a reference on this page, so this code appears to be racy.
You could fix that by taking and dropping a ref on the page, but it'd be
quicker to take tree_lock and do the SetPageReadahead() while holding it.
This function is too large to inline.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 23/33] readahead: backward prefetching method
2006-05-26 17:37 ` Nate Diller
@ 2006-05-26 19:22 ` Nathan Scott
[not found] ` <20060528123006.GC6478@mail.ustc.edu.cn>
0 siblings, 1 reply; 107+ messages in thread
From: Nathan Scott @ 2006-05-26 19:22 UTC (permalink / raw)
To: Nate Diller; +Cc: Wu Fengguang, Andrew Morton, linux-kernel
On Fri, May 26, 2006 at 10:37:56AM -0700, Nate Diller wrote:
> On 5/24/06, Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> > Readahead policy for reading backward.
>
> Just curious, who actually does this? I noticed you submitted patches
Nastran does this, and probably other FEA codes. IIRC, iozone
will measure this too - it is very important to some people in
certain scientific arenas.
cheers.
--
Nathan
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-26 14:00 ` Andi Kleen
2006-05-26 16:25 ` Andrew Morton
@ 2006-05-26 23:54 ` Folkert van Heusden
2006-05-27 0:00 ` Con Kolivas
1 sibling, 1 reply; 107+ messages in thread
From: Folkert van Heusden @ 2006-05-26 23:54 UTC (permalink / raw)
To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, wfg, mstone
> > These are nice-looking numbers, but one wonders. If optimising readahead
> > makes this much difference to postgresql performance then postgresql should
> > be doing the readahead itself, rather than relying upon the kernel's
> > ability to guess what the application will be doing in the future. Because
> > surely the database can do a better job of that than the kernel.
> With that argument we should remove all readahead from the kernel?
> Because it's already trying to guess what the application will do.
> I suspect it's better to have good readahead code in the kernel
> than in a zillion application.
Maybe a pluggable read-ahead system could be implemented.
Folkert van Heusden
--
Ever wonder what is out there? Any alien races? Then please support
the seti@home project: setiathome.ssl.berkeley.edu
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-26 23:54 ` Folkert van Heusden
@ 2006-05-27 0:00 ` Con Kolivas
2006-05-27 0:08 ` Con Kolivas
0 siblings, 1 reply; 107+ messages in thread
From: Con Kolivas @ 2006-05-27 0:00 UTC (permalink / raw)
To: linux-kernel; +Cc: Folkert van Heusden, Andi Kleen, Andrew Morton, wfg, mstone
On Saturday 27 May 2006 09:54, Folkert van Heusden wrote:
> > > These are nice-looking numbers, but one wonders. If optimising
> > > readahead makes this much difference to postgresql performance then
> > > postgresql should be doing the readahead itself, rather than relying
> > > upon the kernel's ability to guess what the application will be doing
> > > in the future. Because surely the database can do a better job of that
> > > than the kernel.
> >
> > With that argument we should remove all readahead from the kernel?
> > Because it's already trying to guess what the application will do.
> > I suspect it's better to have good readahead code in the kernel
> > than in a zillion application.
>
> Maybe a pluggable read-ahead system could be implemented.
Pluggable anything is unpopular with Linus and other maintainers. See
pluggable cpu scheduler and pluggable page replacement policy (vm) patchsets.
--
-ck
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-27 0:00 ` Con Kolivas
@ 2006-05-27 0:08 ` Con Kolivas
2006-05-28 22:20 ` Diego Calleja
0 siblings, 1 reply; 107+ messages in thread
From: Con Kolivas @ 2006-05-27 0:08 UTC (permalink / raw)
To: linux-kernel; +Cc: Folkert van Heusden, Andi Kleen, Andrew Morton, wfg, mstone
On Saturday 27 May 2006 10:00, Con Kolivas wrote:
> On Saturday 27 May 2006 09:54, Folkert van Heusden wrote:
> > > > These are nice-looking numbers, but one wonders. If optimising
> > > > readahead makes this much difference to postgresql performance then
> > > > postgresql should be doing the readahead itself, rather than relying
> > > > upon the kernel's ability to guess what the application will be doing
> > > > in the future. Because surely the database can do a better job of
> > > > that than the kernel.
> > >
> > > With that argument we should remove all readahead from the kernel?
> > > Because it's already trying to guess what the application will do.
> > > I suspect it's better to have good readahead code in the kernel
> > > than in a zillion application.
> >
> > Maybe a pluggable read-ahead system could be implemented.
>
> Pluggable anything is unpopular with Linus and other maintainers. See
> pluggable cpu scheduler and pluggable page replacement policy (vm)
> patchsets.
Sorry I should have been clearer. The belief is that certain infrastructure
components do not benefit from a pluggable framework, and readeahead probably
comes under that description. It's not like Linus was implying we should only
have one filesystem for example, since filesystems are afterall pluggable
features.
--
-ck
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 15/33] readahead: state based method - routines
[not found] ` <20060527020616.GA7418@mail.ustc.edu.cn>
@ 2006-05-27 2:06 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27 2:06 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Fri, May 26, 2006 at 10:15:36AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > Define some helpers on struct file_ra_state.
> >
> > +/*
> > + * The 64bit cache_hits stores three accumulated values and a counter value.
> > + * MSB LSB
> > + * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
> > + */
> > +static int ra_cache_hit(struct file_ra_state *ra, int nr)
> > +{
> > + return (ra->cache_hits >> (nr * 16)) & 0xFFFF;
> > +}
>
> So... why not use four u16s?
Sure, me too, have been thinking about it ;-)
> > +/*
> > + * Submit IO for the read-ahead request in file_ra_state.
> > + */
> > +static int ra_dispatch(struct file_ra_state *ra,
> > + struct address_space *mapping, struct file *filp)
> > +{
> > + enum ra_class ra_class = ra_class_new(ra);
> > + unsigned long ra_size = ra_readahead_size(ra);
> > + unsigned long la_size = ra_lookahead_size(ra);
> > + pgoff_t eof_index = PAGES_BYTE(i_size_read(mapping->host)) + 1;
>
> Sigh. I guess one gets used to that PAGES_BYTE thing after a while. If
> you're not familiar with it, it obfuscates things.
>
> <hunts around for its definition>
>
> So in fact it's converting a loff_t to a pgoff_t. Why not call it
> convert_loff_t_to_pgoff_t()? ;)
>
> Something better, anyway. Something lower-case and an inline-not-a-macro, too.
I'm now using DIV_ROUND_UP(), maybe we can settle with that.
> > + int actual;
> > +
> > + if (unlikely(ra->ra_index >= eof_index))
> > + return 0;
> > +
> > + /* Snap to EOF. */
> > + if (ra->readahead_index + ra_size / 2 > eof_index) {
>
> You've had a bit of a think and you've arrived at a design decision
> surrounding the arithmetic in here. It's very very hard to look at this line
> of code and to work out why you decided to implement it in this fashion.
> The only way to make such code comprehensible (and hence maintainable) is
> to fully comment such things.
Sorry for being a bit lazy.
It is true that some situations are rather tricky, and some
if()/numbers are carefully chosen. I'll continue expanding/detailing
the documentation with future releases. Or would you prefer to add
them as small and distinct patches?
Comments for this one(also rationalized code):
/*
* Snap to EOF, if the request
* - crossed the EOF boundary;
* - is close to EOF(explained below).
*
* Imagine a file sized 18 pages, and we dicided to read-ahead the
* first 16 pages. It is highly possible that in the near future we
* will have to do another read-ahead for the remaining 2 pages,
* which is an unfavorable small I/O.
*
* So we prefer to take a bit risk to enlarge the current read-ahead,
* to eliminate possible future small I/O.
*/
if (ra->readahead_index + ra_readahead_size(ra)/4 > eof_index) {
ra->readahead_index = eof_index;
if (ra->lookahead_index > eof_index)
ra->lookahead_index = eof_index;
ra->flags |= RA_FLAG_EOF;
}
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 17/33] readahead: context based method
[not found] ` <20060527021252.GB7418@mail.ustc.edu.cn>
@ 2006-05-27 2:12 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27 2:12 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Fri, May 26, 2006 at 10:23:43AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > +#define PAGE_REFCNT_0 0
> > +#define PAGE_REFCNT_1 (1 << PG_referenced)
> > +#define PAGE_REFCNT_2 (1 << PG_active)
> > +#define PAGE_REFCNT_3 ((1 << PG_active) | (1 << PG_referenced))
> > +#define PAGE_REFCNT_MASK PAGE_REFCNT_3
> > +
> > +/*
> > + * STATUS REFERENCE COUNT
> > + * __ 0
> > + * _R PAGE_REFCNT_1
> > + * A_ PAGE_REFCNT_2
> > + * AR PAGE_REFCNT_3
> > + *
> > + * A/R: Active / Referenced
> > + */
> > +static inline unsigned long page_refcnt(struct page *page)
> > +{
> > + return page->flags & PAGE_REFCNT_MASK;
> > +}
> > +
>
> This assumes that PG_referenced < PG_active. Nobody knows that this
> assumption was made and someone might go and reorder the page flags and
> subtly break readahead.
>
> We need to either not do it this way, or put a big comment in page-flags.h,
> or even redefine PG_active to be PG_referenced+1.
I have had a code segment like:
#if PG_active < PG_referenced
# error unexpected page flags order
#endif
I'd add it back.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 13/33] readahead: state based method - aging accounting
[not found] ` <20060527062234.GB4991@mail.ustc.edu.cn>
@ 2006-05-27 6:22 ` Wu Fengguang
2006-05-27 7:00 ` Andrew Morton
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27 6:22 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Fri, May 26, 2006 at 10:04:26AM -0700, Andrew Morton wrote:
>
> (hey, I haven't finished reading the last batch yet)
>
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > /*
> > + * The node's effective length of inactive_list(s).
> > + */
> > +static unsigned long node_free_and_cold_pages(void)
> > +{
> > + unsigned int i;
> > + unsigned long sum = 0;
> > + struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
> > +
> > + for (i = 0; i < MAX_NR_ZONES; i++)
> > + sum += zones[i].nr_inactive +
> > + zones[i].free_pages - zones[i].pages_low;
> > +
> > + return sum;
> > +}
>
> I guess this should go into page_alloc.c along with all the similar functions.
Moved as adviced.
> Is this function well-named? Why does it have "cold" in the name?
Because it only sums `nr_inactive', leaving out `nr_active'.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 20/33] readahead: initial method - expected read size
[not found] ` <20060527063826.GC4991@mail.ustc.edu.cn>
@ 2006-05-27 6:38 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27 6:38 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Fri, May 26, 2006 at 10:29:34AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > backing_dev_info.ra_expect_bytes is dynamicly updated to be the expected
> > read pages on start-of-file. It allows the initial readahead to be more
> > aggressive and hence efficient.
> >
> >
> > +void fastcall readahead_close(struct file *file)
>
> eww, fastcall.
Hehe, it's a tiny function, and calls no further sub-routines
except debugging ones. Still not necessary?
> > +{
> > + struct inode *inode = file->f_dentry->d_inode;
> > + struct address_space *mapping = inode->i_mapping;
> > + struct backing_dev_info *bdi = mapping->backing_dev_info;
> > + unsigned long pos = file->f_pos;
>
> f_pos is loff_t.
Just meant to be a little more compact ;)
+ unsigned long pos = file->f_pos;
+ unsigned long pgrahit = file->f_ra.cache_hits;
+ unsigned long pgaccess = 1 + pos / PAGE_CACHE_SIZE;
+ unsigned long pgcached = mapping->nrpages;
+
+ if (!pos) /* pread */
+ return;
+
+ if (pgcached > bdi->ra_pages0) /* excessive reads */
+ return;
Here the f_pos will almost definitely has small values.
+
+ if (pgaccess >= pgcached) {
Fixed by adding a comment to clarify it:
+ unsigned long pos = file->f_pos; /* supposed to be small */
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 13/33] readahead: state based method - aging accounting
2006-05-27 6:22 ` Wu Fengguang
@ 2006-05-27 7:00 ` Andrew Morton
[not found] ` <20060527072201.GA5284@mail.ustc.edu.cn>
0 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-27 7:00 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel
Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> > Is this function well-named? Why does it have "cold" in the name?
>
> Because it only sums `nr_inactive', leaving out `nr_active'.
We use the term "cold" to refer to probably-cache-cold pages in the page
allocator. How about you use "inactive"?
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 14/33] readahead: state based method - data structure
[not found] ` <20060527070248.GD4991@mail.ustc.edu.cn>
@ 2006-05-27 7:02 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27 7:02 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Fri, May 26, 2006 at 10:05:52AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > #define RA_FLAG_MISS 0x01 /* a cache miss occured against this file */
> > #define RA_FLAG_INCACHE 0x02 /* file is already in cache */
> > +#define RA_FLAG_MMAP (1UL<<31) /* mmaped page access */
> > +#define RA_FLAG_NO_LOOKAHEAD (1UL<<30) /* disable look-ahead */
> > +#define RA_FLAG_EOF (1UL<<29) /* readahead hits EOF */
>
> Odd. Why not use 4, 8, 16?
Sorry, the lower 8 bits are for ra_class values in the new code. It can
cause data corruption when dynamic switching between the two logics :(
I'd like to change the flags member to explicit ones like
struct {
unsigned miss :1;
unsigned incache :1;
unsigned mmap :1;
unsigned no_lookahead :1;
unsigned eof :1;
} flags;
unsigned class_new :4;
unsigned class_old :4;
Reasonable?
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 13/33] readahead: state based method - aging accounting
[not found] ` <20060527072201.GA5284@mail.ustc.edu.cn>
@ 2006-05-27 7:22 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27 7:22 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Sat, May 27, 2006 at 12:00:58AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > > Is this function well-named? Why does it have "cold" in the name?
> >
> > Because it only sums `nr_inactive', leaving out `nr_active'.
>
> We use the term "cold" to refer to probably-cache-cold pages in the page
> allocator. How about you use "inactive"?
Got it, thanks.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 17/33] readahead: context based method
[not found] ` <20060527080443.GE4991@mail.ustc.edu.cn>
@ 2006-05-27 8:04 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27 8:04 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Fri, May 26, 2006 at 10:27:16AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > This is the slow code path of adaptive read-ahead.
> >
> > ...
> >
> > +
> > +/*
> > + * Count/estimate cache hits in range [first_index, last_index].
> > + * The estimation is simple and optimistic.
> > + */
> > +static int count_cache_hit(struct address_space *mapping,
> > + pgoff_t first_index, pgoff_t last_index)
> > +{
> > + struct page *page;
> > + int size = last_index - first_index + 1;
>
> `size' might overflow.
It does. Fixed the caller:
@@query_page_cache_segment()
index = radix_tree_scan_hole_backward(&mapping->page_tree,
- offset, ra_max);
+ offset - 1, ra_max);
Here (offset >= 1) always holds.
> > + int count = 0;
> > + int i;
> > +
> > + cond_resched();
> > + read_lock_irq(&mapping->tree_lock);
> > +
> > + /*
> > + * The first page may well is chunk head and has been accessed,
> > + * so it is index 0 that makes the estimation optimistic. This
> > + * behavior guarantees a readahead when (size < ra_max) and
> > + * (readahead_hit_rate >= 16).
> > + */
> > + for (i = 0; i < 16;) {
> > + page = __find_page(mapping, first_index +
> > + size * ((i++ * 29) & 15) / 16);
>
> 29?
It's a prime number. Should be made obvious by the following macro:
#define CACHE_HIT_HASH_KEY 29 /* some prime number */
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 14/33] readahead: state based method - data structure
[not found] ` <20060527082758.GF4991@mail.ustc.edu.cn>
@ 2006-05-27 8:27 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27 8:27 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Fri, May 26, 2006 at 10:05:52AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > #define RA_FLAG_MISS 0x01 /* a cache miss occured against this file */
> > #define RA_FLAG_INCACHE 0x02 /* file is already in cache */
> > +#define RA_FLAG_MMAP (1UL<<31) /* mmaped page access */
> > +#define RA_FLAG_NO_LOOKAHEAD (1UL<<30) /* disable look-ahead */
> > +#define RA_FLAG_EOF (1UL<<29) /* readahead hits EOF */
>
> Odd. Why not use 4, 8, 16?
I'm now settled with:
-#define RA_FLAG_MISS 0x01 /* a cache miss occured against this file */
-#define RA_FLAG_INCACHE 0x02 /* file is already in cache */
+#define RA_FLAG_MISS (1UL<<31) /* a cache miss occured against this file */
+#define RA_FLAG_INCACHE (1UL<<30) /* file is already in cache */
+#define RA_FLAG_MMAP (1UL<<29) /* mmaped page access */
+#define RA_FLAG_NO_LOOKAHEAD (1UL<<28) /* disable look-ahead */
+#define RA_FLAG_EOF (1UL<<27) /* readahead hits EOF */
And still let the low bits hold ra_class values.
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 09/33] readahead: events accounting
[not found] ` <20060527132002.GA4814@mail.ustc.edu.cn>
@ 2006-05-27 13:20 ` Wu Fengguang
2006-05-29 8:19 ` Martin Peschke
0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27 13:20 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, joern, ioe-lkml, Martin Peschke
On Thu, May 25, 2006 at 09:36:27AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > A debugfs file named `readahead/events' is created according to advises from
> > J?rn Engel, Andrew Morton and Ingo Oeser.
>
> If everyone's patches all get merged up we'd expect that this facility be
> migrated over to use Martin Peschke's statistics infrastructure.
>
> That's not a thing you should do now, but it would be a useful test of
> Martin's work if you could find time to look at it and let us know whether
> the infrastructure which he has provided would suit this application,
> thanks.
Hi, Martin is doing a great job, thanks.
I have read about its doc. It should be suitable for various
readahead numbers. And it seems a trivial work to port to it :)
However it might also make sense to keep the current _table_ interface.
It shows us the whole picture at a glance:
% cat /debug/readahead/events
[table requests] total newfile state context contexta [...]
cache_miss 136302 538 3860 11317 490
read_random 62176 160 424 1633 60
io_congestion 0 0 0 0 0
io_cache_hit 34521 663 10071 15611 1423
io_block 204302 42174 10408 68277 2226
readahead 251478 70746 96846 73636 2561
lookahead 136315 14805 86267 32738 2505
lookahead_hit 103384 8038 74605 9097 598
lookahead_ignore 0 0 0 0 0
readahead_mmap 6911 0 0 0 0
readahead_eof 70793 55935 8500 648 581
readahead_shrink 473 0 473 0 0
readahead_thrash 0 0 0 0 0
readahead_mutilt 2526 24 1079 1403 20
readahead_rescue 1209 0 0 0 0
[table pages] total newfile state context contexta
cache_miss 1292350444 282817 35557285 86087568 5592690
read_random 10299237 177 426 1903 63
io_congestion 0 0 0 0 0
io_cache_hit 2194663 9289 1507054 414311 184715
io_block 204302 42174 10408 68277 2226
readahead 26122947 770681 21815335 3097682 259587
readahead_hit 23101714 588811 19906233 2209547 191269
lookahead 21397630 173502 19872014 936474 415640
lookahead_hit 18663196 98004 17879848 596562 88782
lookahead_ignore 0 0 0 0 0
readahead_mmap 170509 0 0 0 0
readahead_eof 1950484 432763 1342148 47368 34742
readahead_shrink 19900 0 19900 0 0
readahead_thrash 0 0 0 0 0
readahead_mutilt 220331 485 186922 29900 3024
readahead_rescue 119592 0 0 0 0
[table summary] total newfile state context contexta
random_rate 19% 0% 0% 2% 2%
ra_hit_rate 88% 76% 91% 71% 73%
la_hit_rate 75% 54% 86% 27% 23%
var_ra_size 13850 130 5802 6709 10563
avg_ra_size 104 11 225 42 101
avg_la_size 157 12 230 29 166
When Martin's work is included into -mm, I would like to reduce
several col/rows from the table to Martin's infrastructure, and
perhaps add some more items. One obvious candidate collection is the
ra_account(NULL, ...) calls, which do not quite fit the table
interface and deserves individual files.
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 23/33] readahead: backward prefetching method
[not found] ` <20060528123006.GC6478@mail.ustc.edu.cn>
@ 2006-05-28 12:30 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-28 12:30 UTC (permalink / raw)
To: Nathan Scott; +Cc: Nate Diller, Andrew Morton, linux-kernel
On Sat, May 27, 2006 at 05:22:43AM +1000, Nathan Scott wrote:
> On Fri, May 26, 2006 at 10:37:56AM -0700, Nate Diller wrote:
> > On 5/24/06, Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> > > Readahead policy for reading backward.
> >
> > Just curious, who actually does this? I noticed you submitted patches
>
> Nastran does this, and probably other FEA codes. IIRC, iozone
> will measure this too - it is very important to some people in
> certain scientific arenas.
Thanks.
It makes sense to have a list of use cases for the
less-common-but-still-important access patterns.
Cheers,
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-27 0:08 ` Con Kolivas
@ 2006-05-28 22:20 ` Diego Calleja
2006-05-28 22:31 ` kernel
0 siblings, 1 reply; 107+ messages in thread
From: Diego Calleja @ 2006-05-28 22:20 UTC (permalink / raw)
To: Con Kolivas; +Cc: linux-kernel, folkert, ak, akpm, wfg, mstone
El Sat, 27 May 2006 10:08:41 +1000,
Con Kolivas <kernel@kolivas.org> escribió:
> On Saturday 27 May 2006 10:00, Con Kolivas wrote:
> Sorry I should have been clearer. The belief is that certain infrastructure
> components do not benefit from a pluggable framework, and readeahead probably
> comes under that description. It's not like Linus was implying we should only
> have one filesystem for example, since filesystems are afterall pluggable
> features.
That leaves another question that I (a poor user) may have missed: Why is
adaptive read-ahead compile-time configurable instead of completely replacing
the old system?
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
2006-05-28 22:20 ` Diego Calleja
@ 2006-05-28 22:31 ` kernel
[not found] ` <20060529030445.GB5994@mail.ustc.edu.cn>
0 siblings, 1 reply; 107+ messages in thread
From: kernel @ 2006-05-28 22:31 UTC (permalink / raw)
To: Diego Calleja; +Cc: linux-kernel, folkert, ak, akpm, wfg, mstone
Quoting Diego Calleja <diegocg@gmail.com>:
> That leaves another question that I (a poor user) may have missed: Why is
> adaptive read-ahead compile-time configurable instead of completely
> replacing
> the old system?
That was done to appease the users out there that had worse performance with it.
In the early stages of development of this code it was rather detrimental on an
ordinary desktop. Fortunately that seems to have gotten a lot better. I don't
think the final version should be a compile time option. It's either "adaptive"
and better everywhere or it's not.
--
-ck
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 00/33] Adaptive read-ahead V12
[not found] ` <20060529030445.GB5994@mail.ustc.edu.cn>
@ 2006-05-29 3:04 ` Wu Fengguang
0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-29 3:04 UTC (permalink / raw)
To: kernel; +Cc: Diego Calleja, linux-kernel, folkert, ak, akpm, mstone
On Mon, May 29, 2006 at 08:31:43AM +1000, kernel@kolivas.org wrote:
> Quoting Diego Calleja <diegocg@gmail.com>:
>
> > That leaves another question that I (a poor user) may have missed: Why is
> > adaptive read-ahead compile-time configurable instead of completely
> > replacing
> > the old system?
>
> That was done to appease the users out there that had worse performance with it.
> In the early stages of development of this code it was rather detrimental on an
> ordinary desktop. Fortunately that seems to have gotten a lot better. I don't
> think the final version should be a compile time option. It's either "adaptive"
> and better everywhere or it's not.
Hehe, I have a dream - that it helps *everywhere* ;-)
Wu
^ permalink raw reply [flat|nested] 107+ messages in thread
* Re: [PATCH 09/33] readahead: events accounting
2006-05-27 13:20 ` Wu Fengguang
@ 2006-05-29 8:19 ` Martin Peschke
0 siblings, 0 replies; 107+ messages in thread
From: Martin Peschke @ 2006-05-29 8:19 UTC (permalink / raw)
To: Wu Fengguang, Andrew Morton, linux-kernel, joern, ioe-lkml,
Martin Peschke
Wu Fengguang wrote:
> On Thu, May 25, 2006 at 09:36:27AM -0700, Andrew Morton wrote:
>> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>>> A debugfs file named `readahead/events' is created according to advises from
>>> J?rn Engel, Andrew Morton and Ingo Oeser.
>> If everyone's patches all get merged up we'd expect that this facility be
>> migrated over to use Martin Peschke's statistics infrastructure.
>>
>> That's not a thing you should do now, but it would be a useful test of
>> Martin's work if you could find time to look at it and let us know whether
>> the infrastructure which he has provided would suit this application,
>> thanks.
>
> Hi, Martin is doing a great job, thanks.
>
> I have read about its doc. It should be suitable for various
> readahead numbers. And it seems a trivial work to port to it :)
Wu, great :) If you got questions (e.g. on how to setup your statistics so
that the output looks quite compact) or more requirements (like an enhancement
of the code that accumulates numbers) feel free to get back to me.
Thanks, Martin
^ permalink raw reply [flat|nested] 107+ messages in thread
end of thread, other threads:[~2006-05-29 8:19 UTC | newest]
Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20060524111246.420010595@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 00/33] Adaptive read-ahead V12 Wu Fengguang
2006-05-25 15:44 ` Andrew Morton
2006-05-25 19:26 ` Michael Stone
2006-05-25 19:40 ` David Lang
2006-05-25 22:01 ` Andrew Morton
2006-05-25 20:28 ` David Lang
2006-05-26 0:48 ` Michael Stone
[not found] ` <20060526011939.GA6220@mail.ustc.edu.cn>
2006-05-26 1:19 ` Wu Fengguang
2006-05-26 2:10 ` Jon Smirl
2006-05-26 3:14 ` Nick Piggin
2006-05-26 14:00 ` Andi Kleen
2006-05-26 16:25 ` Andrew Morton
2006-05-26 23:54 ` Folkert van Heusden
2006-05-27 0:00 ` Con Kolivas
2006-05-27 0:08 ` Con Kolivas
2006-05-28 22:20 ` Diego Calleja
2006-05-28 22:31 ` kernel
[not found] ` <20060529030445.GB5994@mail.ustc.edu.cn>
2006-05-29 3:04 ` Wu Fengguang
[not found] ` <20060524111857.983845462@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 02/33] radixtree: look-aside cache Wu Fengguang
[not found] ` <20060524111858.357709745@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 03/33] radixtree: hole scanning functions Wu Fengguang
2006-05-25 16:19 ` Andrew Morton
[not found] ` <20060526070416.GB5135@mail.ustc.edu.cn>
2006-05-26 7:04 ` Wu Fengguang
[not found] ` <20060526110559.GA14398@mail.ustc.edu.cn>
2006-05-26 11:05 ` Wu Fengguang
2006-05-26 16:19 ` Andrew Morton
[not found] ` <20060524111858.869793445@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 04/33] readahead: page flag PG_readahead Wu Fengguang
2006-05-25 16:23 ` Andrew Morton
[not found] ` <20060526070646.GC5135@mail.ustc.edu.cn>
2006-05-26 7:06 ` Wu Fengguang
2006-05-24 12:27 ` Peter Zijlstra
[not found] ` <20060524123740.GA16304@mail.ustc.edu.cn>
2006-05-24 12:37 ` Wu Fengguang
2006-05-24 12:48 ` Peter Zijlstra
[not found] ` <20060524111859.540640819@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 05/33] readahead: refactor do_generic_mapping_read() Wu Fengguang
[not found] ` <20060524111859.909928820@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 06/33] readahead: refactor __do_page_cache_readahead() Wu Fengguang
2006-05-25 16:30 ` Andrew Morton
2006-05-25 22:33 ` Paul Mackerras
2006-05-25 22:40 ` Andrew Morton
[not found] ` <20060526071339.GE5135@mail.ustc.edu.cn>
2006-05-26 7:13 ` Wu Fengguang
[not found] ` <20060524111900.419314658@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 07/33] readahead: insert cond_resched() calls Wu Fengguang
[not found] ` <20060524111900.970898174@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 08/33] readahead: common macros Wu Fengguang
2006-05-25 5:56 ` Nick Piggin
[not found] ` <20060525104117.GE4996@mail.ustc.edu.cn>
2006-05-25 10:41 ` Wu Fengguang
2006-05-26 3:33 ` Nick Piggin
[not found] ` <20060526065906.GA5135@mail.ustc.edu.cn>
2006-05-26 6:59 ` Wu Fengguang
[not found] ` <20060525134224.GJ4996@mail.ustc.edu.cn>
2006-05-25 13:42 ` Wu Fengguang
2006-05-25 14:38 ` Andrew Morton
2006-05-25 16:33 ` Andrew Morton
[not found] ` <20060524111901.581603095@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 09/33] readahead: events accounting Wu Fengguang
2006-05-25 16:36 ` Andrew Morton
[not found] ` <20060526070943.GD5135@mail.ustc.edu.cn>
2006-05-26 7:09 ` Wu Fengguang
[not found] ` <20060527132002.GA4814@mail.ustc.edu.cn>
2006-05-27 13:20 ` Wu Fengguang
2006-05-29 8:19 ` Martin Peschke
[not found] ` <20060524111901.976888971@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 10/33] readahead: support functions Wu Fengguang
2006-05-25 5:13 ` Nick Piggin
[not found] ` <20060525111318.GH4996@mail.ustc.edu.cn>
2006-05-25 11:13 ` Wu Fengguang
2006-05-25 16:48 ` Andrew Morton
[not found] ` <20060526073114.GH5135@mail.ustc.edu.cn>
2006-05-26 7:31 ` Wu Fengguang
[not found] ` <20060524111902.491708692@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 11/33] readahead: sysctl parameters Wu Fengguang
2006-05-25 4:50 ` [PATCH 12/33] readahead: min/max sizes Nick Piggin
[not found] ` <20060525121206.GI4996@mail.ustc.edu.cn>
2006-05-25 12:12 ` Wu Fengguang
[not found] ` <20060524111903.510268987@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 13/33] readahead: state based method - aging accounting Wu Fengguang
2006-05-26 17:04 ` Andrew Morton
[not found] ` <20060527062234.GB4991@mail.ustc.edu.cn>
2006-05-27 6:22 ` Wu Fengguang
2006-05-27 7:00 ` Andrew Morton
[not found] ` <20060527072201.GA5284@mail.ustc.edu.cn>
2006-05-27 7:22 ` Wu Fengguang
[not found] ` <20060524111904.019763011@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 14/33] readahead: state based method - data structure Wu Fengguang
2006-05-25 6:03 ` Nick Piggin
[not found] ` <20060525104353.GF4996@mail.ustc.edu.cn>
2006-05-25 10:43 ` Wu Fengguang
2006-05-26 17:05 ` Andrew Morton
[not found] ` <20060527070248.GD4991@mail.ustc.edu.cn>
2006-05-27 7:02 ` Wu Fengguang
[not found] ` <20060527082758.GF4991@mail.ustc.edu.cn>
2006-05-27 8:27 ` Wu Fengguang
[not found] ` <20060524111904.683513683@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 15/33] readahead: state based method - routines Wu Fengguang
2006-05-26 17:15 ` Andrew Morton
[not found] ` <20060527020616.GA7418@mail.ustc.edu.cn>
2006-05-27 2:06 ` Wu Fengguang
[not found] ` <20060524111906.245276338@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 18/33] readahead: initial method - guiding sizes Wu Fengguang
[not found] ` <20060524111906.588647885@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 19/33] readahead: initial method - thrashing guard size Wu Fengguang
[not found] ` <20060524111907.134685550@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 20/33] readahead: initial method - expected read size Wu Fengguang
2006-05-25 5:34 ` [PATCH 22/33] readahead: initial method Nick Piggin
[not found] ` <20060525085957.GC4996@mail.ustc.edu.cn>
2006-05-25 8:59 ` Wu Fengguang
2006-05-26 17:29 ` [PATCH 20/33] readahead: initial method - expected read size Andrew Morton
[not found] ` <20060527063826.GC4991@mail.ustc.edu.cn>
2006-05-27 6:38 ` Wu Fengguang
[not found] ` <20060524111908.569533741@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 23/33] readahead: backward prefetching method Wu Fengguang
2006-05-26 17:37 ` Nate Diller
2006-05-26 19:22 ` Nathan Scott
[not found] ` <20060528123006.GC6478@mail.ustc.edu.cn>
2006-05-28 12:30 ` Wu Fengguang
[not found] ` <20060524111909.147416866@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 24/33] readahead: seeking reads method Wu Fengguang
[not found] ` <20060524111909.635589701@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 25/33] readahead: thrashing recovery method Wu Fengguang
[not found] ` <20060524111910.207894375@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 26/33] readahead: call scheme Wu Fengguang
[not found] ` <20060524111910.544274094@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 27/33] readahead: laptop mode Wu Fengguang
2006-05-26 17:38 ` Andrew Morton
[not found] ` <20060524111911.607080495@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 29/33] readahead: nfsd case Wu Fengguang
[not found] ` <20060524111912.156646847@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 30/33] readahead: turn on by default Wu Fengguang
[not found] ` <20060524111912.485160282@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 31/33] readahead: debug radix tree new functions Wu Fengguang
[not found] ` <20060524111912.967392912@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 32/33] readahead: debug traces showing accessed file names Wu Fengguang
[not found] ` <20060524111913.603476893@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 33/33] readahead: debug traces showing read patterns Wu Fengguang
[not found] ` <20060524111905.586110688@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 17/33] readahead: context based method Wu Fengguang
2006-05-25 5:26 ` Nick Piggin
[not found] ` <20060525080308.GB4996@mail.ustc.edu.cn>
2006-05-25 8:03 ` Wu Fengguang
2006-05-26 17:23 ` Andrew Morton
[not found] ` <20060527021252.GB7418@mail.ustc.edu.cn>
2006-05-27 2:12 ` Wu Fengguang
2006-05-26 17:27 ` Andrew Morton
[not found] ` <20060527080443.GE4991@mail.ustc.edu.cn>
2006-05-27 8:04 ` Wu Fengguang
2006-05-24 12:37 ` Peter Zijlstra
[not found] ` <20060524133353.GA16508@mail.ustc.edu.cn>
2006-05-24 13:33 ` Wu Fengguang
2006-05-24 15:53 ` Peter Zijlstra
[not found] ` <20060525012556.GA6111@mail.ustc.edu.cn>
2006-05-25 1:25 ` Wu Fengguang
[not found] ` <20060524111911.032100160@localhost.localdomain>
2006-05-24 11:13 ` [PATCH 28/33] readahead: loop case Wu Fengguang
2006-05-24 14:01 ` Limin Wang
[not found] ` <20060525154846.GA6907@mail.ustc.edu.cn>
2006-05-25 15:48 ` wfg
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).