linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/33] Adaptive read-ahead V12
       [not found] <20060524111246.420010595@localhost.localdomain>
@ 2006-05-24 11:12 ` Wu Fengguang
  2006-05-25 15:44   ` Andrew Morton
       [not found] ` <20060524111857.983845462@localhost.localdomain>
                   ` (27 subsequent siblings)
  28 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

Andrew,

This is the 12th release of the adaptive readahead patchset.

It has received tests in a wide range of applications in the past
six months, and polished up considerably.

Please consider it for inclusion in -mm tree.


Performance benefits
====================

Besides file servers and desktops, it is recently found to benefit
postgresql databases a lot.

I explained to pgsql users how the patch may help their db performance:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
[QUOTE]
	HOW IT WORKS

	In adaptive readahead, the context based method may be of particular
	interest to postgresql users. It works by peeking into the file cache
	and check if there are any history pages present or accessed. In this
	way it can detect almost all forms of sequential / semi-sequential read
	patterns, e.g.
		- parallel / interleaved sequential scans on one file
		- sequential reads across file open/close
		- mixed sequential / random accesses
		- sparse / skimming sequential read

	It also have methods to detect some less common cases:
		- reading backward
		- seeking all over reading N pages

	WAYS TO BENEFIT FROM IT

	As we know, postgresql relies on the kernel to do proper readahead.
	The adaptive readahead might help performance in the following cases:
		- concurrent sequential scans
		- sequential scan on a fragmented table
		  (some DBs suffer from this problem, not sure for pgsql)
		- index scan with clustered matches
		- index scan on majority rows (in case the planner goes wrong)

And received positive responses:
[QUOTE from Michael Stone]
	I've got one DB where the VACUUM ANALYZE generally takes 11M-12M ms;
	with the patch the job took 1.7M ms. Another VACUUM that normally takes
	between 300k-500k ms took 150k. Definately a promising addition.

[QUOTE from Michael Stone]
	>I'm thinking about it, we're already using a fixed read-ahead of 16MB
	>using blockdev on the stock Redhat 2.6.9 kernel, it would be nice to
	>not have to set this so we may try it.

	FWIW, I never saw much performance difference from doing that. Wu's
	patch, OTOH, gave a big boost.

[QUOTE: odbc-bench with Postgresql 7.4.11 on dual Opteron]
	Base kernel:
	 Transactions per second:                92.384758
	 Transactions per second:                99.800896

	After read-ahvm.readahead_ratio = 100:
	 Transactions per second:                105.461952
	 Transactions per second:                105.458664

	vm.readahead_ratio = 100 ; vm.readahead_hit_rate = 1:
	 Transactions per second:                113.055367
	 Transactions per second:                124.815910


Patches
=======
All 33 patches are bisect friendly:
special cares have been taken to make them compile cleanly on each step.

The following 29 patches are only logically seperated -
one should not remove one of them and expect others to compile cleanly:

[patch 01/33] readahead: kconfig options
[patch 02/33] radixtree: look-aside cache
[patch 03/33] radixtree: hole scanning functions
[patch 04/33] readahead: page flag PG_readahead
[patch 05/33] readahead: refactor do_generic_mapping_read()
[patch 06/33] readahead: refactor __do_page_cache_readahead()
[patch 07/33] readahead: insert cond_resched() calls
[patch 08/33] readahead: common macros
[patch 09/33] readahead: events accounting
[patch 10/33] readahead: support functions
[patch 11/33] readahead: sysctl parameters
[patch 12/33] readahead: min/max sizes
[patch 13/33] readahead: state based method - aging accounting
[patch 14/33] readahead: state based method - data structure
[patch 15/33] readahead: state based method - routines
[patch 16/33] readahead: state based method
[patch 17/33] readahead: context based method
[patch 18/33] readahead: initial method - guiding sizes
[patch 19/33] readahead: initial method - thrashing guard size
[patch 20/33] readahead: initial method - expected read size
[patch 21/33] readahead: initial method - user recommended size
[patch 22/33] readahead: initial method
[patch 23/33] readahead: backward prefetching method
[patch 24/33] readahead: seeking reads method
[patch 25/33] readahead: thrashing recovery method
[patch 26/33] readahead: call scheme
[patch 27/33] readahead: laptop mode
[patch 28/33] readahead: loop case
[patch 29/33] readahead: nfsd case

The following 4 patches are for debugging purpose, and for -mm only:

[patch 30/33] readahead: turn on by default
[patch 31/33] readahead: debug radix tree new functions
[patch 32/33] readahead: debug traces showing accessed file names
[patch 33/33] readahead: debug traces showing read patterns


Diffstat
========
 Documentation/sysctl/vm.txt |   37 
 block/ll_rw_blk.c           |   34 
 drivers/block/loop.c        |    6 
 fs/file_table.c             |    7 
 fs/mpage.c                  |    4 
 fs/nfsd/vfs.c               |    5 
 include/linux/backing-dev.h |    3 
 include/linux/fs.h          |   57 +
 include/linux/mm.h          |   31 
 include/linux/mmzone.h      |    5 
 include/linux/page-flags.h  |    5 
 include/linux/radix-tree.h  |   87 ++
 include/linux/sysctl.h      |    2 
 include/linux/writeback.h   |    6 
 kernel/sysctl.c             |   28 
 lib/radix-tree.c            |  202 ++++-
 mm/Kconfig                  |   62 +
 mm/filemap.c                |   90 ++
 mm/page-writeback.c         |    2 
 mm/page_alloc.c             |    2 
 mm/readahead.c              | 1641 +++++++++++++++++++++++++++++++++++++++++++-
 mm/swap.c                   |    2 
 mm/vmscan.c                 |    4 
 23 files changed, 2262 insertions(+), 60 deletions(-)


Changelog
=========

V12  2006-05-24
- improve small files case
- allow pausing of events accounting
- disable sparse read-ahead by default
- a bug fix in radix_tree_cache_lookup_parent()
- more cleanups

V11  2006-03-19
- patchset rework
- add kconfig option to make the feature compile-time selectable
- improve radix tree scan functions
- fix bug of using smp_processor_id() in preemptible code
- avoid overflow in compute_thrashing_threshold()
- disable sparse read prefetching if (readahead_hit_rate == 1)
- make thrashing recovery a standalone function
- random cleanups

V10  2005-12-16
- remove delayed page activation
- remove live page protection
- revert mmap readaround to old behavior
- default to original readahead logic
- default to original readahead size
- merge comment fixes from Andreas Mohr
- merge radixtree cleanups from Christoph Lameter
- reduce sizeof(struct file_ra_state) by unnamed union
- stateful method cleanups
- account other read-ahead paths

V9  2005-12-3
- standalone mmap read-around code, a little more smart and tunable
- make stateful method sensible of request size
- decouple readahead_ratio from live pages protection
- let readahead_ratio contribute to ra_size grow speed in stateful method
- account variance of ra_size

V8  2005-11-25

- balance zone aging only in page relaim paths and do it right
- do the aging of slabs in the same way as zones
- add debug code to dump the detailed page reclaim steps
- undo exposing of struct radix_tree_node and uninline related functions
- work better with nfsd
- generalize accelerated context based read-ahead
- account smooth read-ahead aging based on page referenced/activate bits
- avoid divide error in compute_thrashing_threshold()
- more low lantency efforts
- update some comments
- rebase debug actions on debugfs entries instead of magic readahead_ratio values

V7  2005-11-09

- new tunable parameters: readahead_hit_rate/readahead_live_chunk
- support sparse sequential accesses
- delay look-ahead if drive is spinned down in laptop mode
- disable look-ahead for loopback file
- make mandatory thrashing protection more simple and robust
- attempt to improve responsiveness on large read-ahead size

V6  2005-11-01

- cancel look-ahead in laptop mode
- increase read-ahead limit to 0xFFFF pages

V5  2005-10-28

- rewrite context based method to make it clean and robust
- improved accuracy of stateful thrashing threshold estimation
- make page aging equal to the number of code pages scanned
- sort out the thrashing protection logic
- enhanced debug/accounting facilities

V4  2005-10-15

- detect and save live chunks on page reclaim
- support database workload
- support reading backward
- radix tree lookup look-aside cache

V3  2005-10-06

- major code reorganization and documention
- stateful estimation of thrashing-threshold
- context method with accelerated grow up phase
- adaptive look-ahead
- early detection and rescue of pages in danger
- statitics data collection
- synchronized page aging between zones

V2  2005-09-15

- delayed page activation
- look-ahead: towards pipelined read-ahead

V1  2005-09-13

Initial release which features:
        o stateless (for now)
        o adapts to available memory / read speed
        o free of thrashing (in theory)

And handles:
        o large number of slow streams (FTP server)
	o open/read/close access patterns (NFS server)
        o multiple interleaved, sequential streams in one file
	  (multithread / multimedia / database)

Cheers,
Wu Fengguang
--
Dept. Automation                University of Science and Technology of China

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 02/33] radixtree: look-aside cache
       [not found] ` <20060524111857.983845462@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Nick Piggin, Christoph Lameter

[-- Attachment #1: radixtree-lookaside-cache.patch --]
[-- Type: text/plain, Size: 9264 bytes --]

Introduce a set of lookup functions to radix tree for the read-ahead logic.
Other access patterns with high locality may also benefit from them.

- radix_tree_lookup_parent(root, index, level)
	Perform partial lookup, return the @level'th parent of the slot at
	@index.

- radix_tree_cache_xxx()
	Init/Query the cache.
- radix_tree_cache_lookup(root, cache, index)
	Perform lookup with the aid of a look-aside cache.
	For sequential scans, it has a time complexity of 64*O(1) + 1*O(logN).

	Typical usage:

   void func() {
  +       struct radix_tree_cache cache;
  +
  +       radix_tree_cache_init(&cache);
          read_lock_irq(&mapping->tree_lock);
          for(;;) {
  -               page = radix_tree_lookup(&mapping->page_tree, index);
  +               page = radix_tree_cache_lookup(&mapping->page_tree, &cache, index);
          }
          read_unlock_irq(&mapping->tree_lock);
   }                                                                                                                       	

Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/radix-tree.h |   83 +++++++++++++++++++++++++++++++++++
 lib/radix-tree.c           |  104 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 161 insertions(+), 26 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/radix-tree.h
+++ linux-2.6.17-rc4-mm3/include/linux/radix-tree.h
@@ -26,12 +26,29 @@
 #define RADIX_TREE_MAX_TAGS 2
 
 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
+#ifdef __KERNEL__
+#define RADIX_TREE_MAP_SHIFT	(CONFIG_BASE_SMALL ? 4 : 6)
+#else
+#define RADIX_TREE_MAP_SHIFT	3	/* For more stressful testing */
+#endif
+
+#define RADIX_TREE_MAP_SIZE	(1UL << RADIX_TREE_MAP_SHIFT)
+#define RADIX_TREE_MAP_MASK	(RADIX_TREE_MAP_SIZE-1)
+
 struct radix_tree_root {
 	unsigned int		height;
 	gfp_t			gfp_mask;
 	struct radix_tree_node	*rnode;
 };
 
+/*
+ * Lookaside cache to support access patterns with strong locality.
+ */
+struct radix_tree_cache {
+	unsigned long first_index;
+	struct radix_tree_node *tree_node;
+};
+
 #define RADIX_TREE_INIT(mask)	{					\
 	.height = 0,							\
 	.gfp_mask = (mask),						\
@@ -49,9 +66,14 @@ do {									\
 } while (0)
 
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
-void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
-void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_lookup_parent(struct radix_tree_root *, unsigned long,
+							unsigned int);
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+unsigned int radix_tree_cache_count(struct radix_tree_cache *cache);
+void *radix_tree_cache_lookup_parent(struct radix_tree_root *root,
+				struct radix_tree_cache *cache,
+				unsigned long index, unsigned int level);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
@@ -74,4 +96,61 @@ static inline void radix_tree_preload_en
 	preempt_enable();
 }
 
+/**
+ *	radix_tree_lookup    -    perform lookup operation on a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root.
+ */
+static inline void *radix_tree_lookup(struct radix_tree_root *root,
+							unsigned long index)
+{
+	return radix_tree_lookup_parent(root, index, 0);
+}
+
+/**
+ *	radix_tree_cache_init    -    init a look-aside cache
+ *	@cache:		look-aside cache
+ *
+ *	Init the radix tree look-aside cache @cache.
+ */
+static inline void radix_tree_cache_init(struct radix_tree_cache *cache)
+{
+	cache->first_index = RADIX_TREE_MAP_MASK;
+	cache->tree_node = NULL;
+}
+
+/**
+ *	radix_tree_cache_lookup    -    cached lookup on a radix tree
+ *	@root:		radix tree root
+ *	@cache:		look-aside cache
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root,
+ *	and make use of @cache to speedup the lookup process.
+ */
+static inline void *radix_tree_cache_lookup(struct radix_tree_root *root,
+						struct radix_tree_cache *cache,
+						unsigned long index)
+{
+	return radix_tree_cache_lookup_parent(root, cache, index, 0);
+}
+
+static inline unsigned int radix_tree_cache_size(struct radix_tree_cache *cache)
+{
+	return RADIX_TREE_MAP_SIZE;
+}
+
+static inline int radix_tree_cache_full(struct radix_tree_cache *cache)
+{
+	return radix_tree_cache_count(cache) == radix_tree_cache_size(cache);
+}
+
+static inline unsigned long
+radix_tree_cache_first_index(struct radix_tree_cache *cache)
+{
+	return cache->first_index;
+}
+
 #endif /* _LINUX_RADIX_TREE_H */
--- linux-2.6.17-rc4-mm3.orig/lib/radix-tree.c
+++ linux-2.6.17-rc4-mm3/lib/radix-tree.c
@@ -309,36 +309,90 @@ int radix_tree_insert(struct radix_tree_
 }
 EXPORT_SYMBOL(radix_tree_insert);
 
-static inline void **__lookup_slot(struct radix_tree_root *root,
-				   unsigned long index)
+/**
+ *	radix_tree_lookup_parent    -    low level lookup routine
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@level:		stop at that many levels from the tree leaf
+ *
+ *	Lookup the @level'th parent of the slot at @index in radix tree @root.
+ *	The return value is:
+ *	@level == 0:      page at @index;
+ *	@level == 1:      the corresponding bottom level tree node;
+ *	@level < height:  (@level-1)th parent node of the bottom node
+ *			  that contains @index;
+ *	@level >= height: the root node.
+ */
+void *radix_tree_lookup_parent(struct radix_tree_root *root,
+				unsigned long index, unsigned int level)
 {
 	unsigned int height, shift;
-	struct radix_tree_node **slot;
+	struct radix_tree_node *slot;
 
 	height = root->height;
 
 	if (index > radix_tree_maxindex(height))
 		return NULL;
 
-	if (height == 0 && root->rnode)
-		return (void **)&root->rnode;
-
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
-	slot = &root->rnode;
+	slot = root->rnode;
 
-	while (height > 0) {
-		if (*slot == NULL)
+	while (height > level) {
+		if (slot == NULL)
 			return NULL;
 
-		slot = (struct radix_tree_node **)
-			((*slot)->slots +
-				((index >> shift) & RADIX_TREE_MAP_MASK));
+		slot = slot->slots[(index >> shift) & RADIX_TREE_MAP_MASK];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
 	}
 
-	return (void **)slot;
+	return slot;
+}
+EXPORT_SYMBOL(radix_tree_lookup_parent);
+
+/**
+ *	radix_tree_cache_lookup_parent    -    cached lookup node
+ *	@root:		radix tree root
+ *	@cache:		look-aside cache
+ *	@index:		index key
+ *
+ *	Lookup the item at the position @index in the radix tree @root,
+ *	and return the node @level levels from the bottom in the search path.
+ *
+ *	@cache stores the last accessed upper level tree node by this
+ *	function, and is always checked first before searching in the tree.
+ *	It can improve speed for access patterns with strong locality.
+ *
+ *	NOTE:
+ *	- The cache becomes invalid on leaving the lock;
+ *	- Do not intermix calls with different @level.
+ */
+void *radix_tree_cache_lookup_parent(struct radix_tree_root *root,
+				struct radix_tree_cache *cache,
+				unsigned long index, unsigned int level)
+{
+	struct radix_tree_node *node;
+        unsigned long i;
+        unsigned long mask;
+
+        if (level >= root->height)
+                return radix_tree_lookup_parent(root, index, level);
+
+        i = (index >> (level * RADIX_TREE_MAP_SHIFT)) & RADIX_TREE_MAP_MASK;
+        mask = (~0UL) << ((level + 1) * RADIX_TREE_MAP_SHIFT);
+
+	if ((index & mask) == cache->first_index)
+                return cache->tree_node->slots[i];
+
+	node = radix_tree_lookup_parent(root, index, level + 1);
+	if (!node)
+		return 0;
+
+	cache->tree_node = node;
+	cache->first_index = (index & mask);
+        return node->slots[i];
 }
+EXPORT_SYMBOL(radix_tree_cache_lookup_parent);
 
 /**
  *	radix_tree_lookup_slot    -    lookup a slot in a radix tree
@@ -350,25 +404,27 @@ static inline void **__lookup_slot(struc
  */
 void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
-	return __lookup_slot(root, index);
+	struct radix_tree_node *node;
+
+	node = radix_tree_lookup_parent(root, index, 1);
+	return node->slots + (index & RADIX_TREE_MAP_MASK);
 }
 EXPORT_SYMBOL(radix_tree_lookup_slot);
 
 /**
- *	radix_tree_lookup    -    perform lookup operation on a radix tree
- *	@root:		radix tree root
- *	@index:		index key
+ *	radix_tree_cache_count    -    items in the cached node
+ *	@cache:      radix tree look-aside cache
  *
- *	Lookup the item at the position @index in the radix tree @root.
+ *      Query the number of items contained in the cached node.
  */
-void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+unsigned int radix_tree_cache_count(struct radix_tree_cache *cache)
 {
-	void **slot;
-
-	slot = __lookup_slot(root, index);
-	return slot != NULL ? *slot : NULL;
+	if (!(cache->first_index & RADIX_TREE_MAP_MASK))
+		return cache->tree_node->count;
+	else
+		return 0;
 }
-EXPORT_SYMBOL(radix_tree_lookup);
+EXPORT_SYMBOL(radix_tree_cache_count);
 
 /**
  *	radix_tree_tag_set - set a tag on a radix tree node

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 03/33] radixtree: hole scanning functions
       [not found] ` <20060524111858.357709745@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  2006-05-25 16:19     ` Andrew Morton
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: radixtree-scan-hole.patch --]
[-- Type: text/plain, Size: 3784 bytes --]

Introduce a pair of functions to scan radix tree for hole/empty item.

 include/linux/radix-tree.h |    4 +
 lib/radix-tree.c           |  104 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

--- linux-2.6.17-rc4-mm3.orig/include/linux/radix-tree.h
+++ linux-2.6.17-rc4-mm3/include/linux/radix-tree.h
@@ -74,6 +74,10 @@ unsigned int radix_tree_cache_count(stru
 void *radix_tree_cache_lookup_parent(struct radix_tree_root *root,
 				struct radix_tree_cache *cache,
 				unsigned long index, unsigned int level);
+unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan);
+unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
--- linux-2.6.17-rc4-mm3.orig/lib/radix-tree.c
+++ linux-2.6.17-rc4-mm3/lib/radix-tree.c
@@ -427,6 +427,110 @@ unsigned int radix_tree_cache_count(stru
 EXPORT_SYMBOL(radix_tree_cache_count);
 
 /**
+ *	radix_tree_scan_hole_backward    -    scan backward for hole
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@max_scan:      advice on max items to scan (it may scan a little more)
+ *
+ *      Scan backward from @index for a hole/empty item, stop when
+ *      - hit hole
+ *      - @max_scan or more items scanned
+ *      - hit index 0
+ *
+ *      Return the correponding index.
+ */
+unsigned long radix_tree_scan_hole_backward(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan)
+{
+	struct radix_tree_cache cache;
+	struct radix_tree_node *node;
+	unsigned long origin;
+	int i;
+
+	origin = index;
+        radix_tree_cache_init(&cache);
+
+	while (origin - index < max_scan) {
+		node = radix_tree_cache_lookup_parent(root, &cache, index, 1);
+		if (!node)
+			break;
+
+		if (node->count == RADIX_TREE_MAP_SIZE) {
+			index = (index - RADIX_TREE_MAP_SIZE) |
+					RADIX_TREE_MAP_MASK;
+			goto check_underflow;
+		}
+
+		for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--) {
+			if (!node->slots[i])
+				goto out;
+		}
+
+check_underflow:
+		if (unlikely(index == ULONG_MAX)) {
+			index = 0;
+			break;
+		}
+	}
+
+out:
+	return index;
+}
+EXPORT_SYMBOL(radix_tree_scan_hole_backward);
+
+/**
+ *	radix_tree_scan_hole    -    scan for hole
+ *	@root:		radix tree root
+ *	@index:		index key
+ *	@max_scan:      advice on max items to scan (it may scan a little more)
+ *
+ *      Scan forward from @index for a hole/empty item, stop when
+ *      - hit hole
+ *      - hit EOF
+ *      - hit index ULONG_MAX
+ *      - @max_scan or more items scanned
+ *
+ *      Return the correponding index.
+ */
+unsigned long radix_tree_scan_hole(struct radix_tree_root *root,
+				unsigned long index, unsigned long max_scan)
+{
+	struct radix_tree_cache cache;
+	struct radix_tree_node *node;
+	unsigned long origin;
+	int i;
+
+	origin = index;
+        radix_tree_cache_init(&cache);
+
+	while (index - origin < max_scan) {
+		node = radix_tree_cache_lookup_parent(root, &cache, index, 1);
+		if (!node)
+			break;
+
+		if (node->count == RADIX_TREE_MAP_SIZE) {
+			index = (index | RADIX_TREE_MAP_MASK) + 1;
+			goto check_overflow;
+		}
+
+		for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE;
+								i++, index++) {
+			if (!node->slots[i])
+				goto out;
+		}
+
+check_overflow:
+		if (unlikely(!index)) {
+			index = ULONG_MAX;
+			break;
+		}
+	}
+out:
+	return index;
+}
+EXPORT_SYMBOL(radix_tree_scan_hole);
+
+/**
  *	radix_tree_tag_set - set a tag on a radix tree node
  *	@root:		radix tree root
  *	@index:		index key

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 04/33] readahead: page flag PG_readahead
       [not found] ` <20060524111858.869793445@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  2006-05-25 16:23     ` Andrew Morton
  2006-05-24 12:27   ` Peter Zijlstra
  1 sibling, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-page-flag-PG_readahead.patch --]
[-- Type: text/plain, Size: 1792 bytes --]

An new page flag PG_readahead is introduced as a look-ahead mark, which
reminds the caller to give the adaptive read-ahead logic a chance to do
read-ahead ahead of time for I/O pipelining.

It roughly corresponds to `ahead_start' of the stock read-ahead logic.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/page-flags.h |    5 +++++
 mm/page_alloc.c            |    2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
+++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
@@ -89,6 +89,7 @@
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_buddy		19	/* Page is free, on buddy lists */
+#define PG_readahead		20	/* Reminder to do readahead */
 
 
 #if (BITS_PER_LONG > 32)
@@ -372,6 +373,10 @@ extern void __mod_page_state_offset(unsi
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageReadahead(page)	test_bit(PG_readahead, &(page)->flags)
+#define __SetPageReadahead(page) __set_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
--- linux-2.6.17-rc4-mm3.orig/mm/page_alloc.c
+++ linux-2.6.17-rc4-mm3/mm/page_alloc.c
@@ -564,7 +564,7 @@ static int prep_new_page(struct page *pa
 	if (PageReserved(page))
 		return 1;
 
-	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
 			1 << PG_referenced | 1 << PG_arch_1 |
 			1 << PG_checked | 1 << PG_mappedtodisk);
 	set_page_private(page, 0);

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 05/33] readahead: refactor do_generic_mapping_read()
       [not found] ` <20060524111859.540640819@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-refactor-do_generic_mapping_read.patch --]
[-- Type: text/plain, Size: 2538 bytes --]

In do_generic_mapping_read(), release accessed pages some time later,
so that it can be passed to and used by the adaptive read-ahead code.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/filemap.c |   18 ++++++++++++------
 1 files changed, 12 insertions(+), 6 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -813,10 +813,12 @@ void do_generic_mapping_read(struct addr
 	unsigned long prev_index;
 	loff_t isize;
 	struct page *cached_page;
+	struct page *prev_page;
 	int error;
 	struct file_ra_state ra = *_ra;
 
 	cached_page = NULL;
+	prev_page = NULL;
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	next_index = index;
 	prev_index = ra.prev_page;
@@ -855,6 +857,11 @@ find_page:
 			handle_ra_miss(mapping, &ra, index);
 			goto no_cached_page;
 		}
+
+		if (prev_page)
+			page_cache_release(prev_page);
+		prev_page = page;
+
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -889,7 +896,6 @@ page_ok:
 		index += offset >> PAGE_CACHE_SHIFT;
 		offset &= ~PAGE_CACHE_MASK;
 
-		page_cache_release(page);
 		if (ret == nr && desc->count)
 			continue;
 		goto out;
@@ -901,7 +907,6 @@ page_not_up_to_date:
 		/* Did it get unhashed before we got the lock? */
 		if (!page->mapping) {
 			unlock_page(page);
-			page_cache_release(page);
 			continue;
 		}
 
@@ -931,7 +936,6 @@ readpage:
 					 * invalidate_inode_pages got it
 					 */
 					unlock_page(page);
-					page_cache_release(page);
 					goto find_page;
 				}
 				unlock_page(page);
@@ -952,7 +956,6 @@ readpage:
 		isize = i_size_read(inode);
 		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 		if (unlikely(!isize || index > end_index)) {
-			page_cache_release(page);
 			goto out;
 		}
 
@@ -961,7 +964,6 @@ readpage:
 		if (index == end_index) {
 			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
 			if (nr <= offset) {
-				page_cache_release(page);
 				goto out;
 			}
 		}
@@ -971,7 +973,6 @@ readpage:
 readpage_error:
 		/* UHHUH! A synchronous read error occurred. Report it */
 		desc->error = error;
-		page_cache_release(page);
 		goto out;
 
 no_cached_page:
@@ -996,6 +997,9 @@ no_cached_page:
 		}
 		page = cached_page;
 		cached_page = NULL;
+		if (prev_page)
+			page_cache_release(prev_page);
+		prev_page = page;
 		goto readpage;
 	}
 
@@ -1005,6 +1009,8 @@ out:
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
 		page_cache_release(cached_page);
+	if (prev_page)
+		page_cache_release(prev_page);
 	if (filp)
 		file_accessed(filp);
 }

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 06/33] readahead: refactor __do_page_cache_readahead()
       [not found] ` <20060524111859.909928820@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  2006-05-25 16:30     ` Andrew Morton
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-refactor-__do_page_cache_readahead.patch --]
[-- Type: text/plain, Size: 2546 bytes --]

Add look-ahead support to __do_page_cache_readahead(),
which is needed by the adaptive read-ahead logic.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   15 +++++++++------
 1 files changed, 9 insertions(+), 6 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -266,7 +266,8 @@ out:
  */
 static int
 __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
-			pgoff_t offset, unsigned long nr_to_read)
+			pgoff_t offset, unsigned long nr_to_read,
+			unsigned long lookahead_size)
 {
 	struct inode *inode = mapping->host;
 	struct page *page;
@@ -279,7 +280,7 @@ __do_page_cache_readahead(struct address
 	if (isize == 0)
 		goto out;
 
- 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -302,6 +303,8 @@ __do_page_cache_readahead(struct address
 			break;
 		page->index = page_offset;
 		list_add(&page->lru, &page_pool);
+		if (page_idx == nr_to_read - lookahead_size)
+			__SetPageReadahead(page);
 		ret++;
 	}
 	read_unlock_irq(&mapping->tree_lock);
@@ -338,7 +341,7 @@ int force_page_cache_readahead(struct ad
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
 		err = __do_page_cache_readahead(mapping, filp,
-						offset, this_chunk);
+						offset, this_chunk, 0);
 		if (err < 0) {
 			ret = err;
 			break;
@@ -385,7 +388,7 @@ int do_page_cache_readahead(struct addre
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
-	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 }
 
 /*
@@ -405,7 +408,7 @@ blockable_page_cache_readahead(struct ad
 	if (!block && bdi_read_congested(mapping->backing_dev_info))
 		return 0;
 
-	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 
 	return check_ra_success(ra, nr_to_read, actual);
 }
@@ -450,7 +453,7 @@ static int make_ahead_window(struct addr
  * @req_size: hint: total size of the read which the caller is performing in
  *            PAGE_CACHE_SIZE units
  *
- * page_cache_readahead() is the main function.  If performs the adaptive
+ * page_cache_readahead() is the main function.  It performs the adaptive
  * readahead window size management and submits the readahead I/O.
  *
  * Note that @filp is purely used for passing on to the ->readpage[s]()

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 07/33] readahead: insert cond_resched() calls
       [not found] ` <20060524111900.419314658@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Con Kolivas

[-- Attachment #1: readahead-insert-cond_resched-calls.patch --]
[-- Type: text/plain, Size: 2140 bytes --]

Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm more
complex, it becomes necessary to insert some cond_resched() calls in
the read-ahead path.

If desktop users still feel audio jitters with the new read-ahead code,
please try one of the following ways to get rid of it:

1) compile kernel with CONFIG_PREEMPT_VOLUNTARY/CONFIG_PREEMPT
2) reduce the read-ahead request size by running
	blockdev --setra 256 /dev/hda # or whatever device you are using

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


This patch is recommended by Con Kolivas to improve respond time for desktop.
Thanks!

 fs/mpage.c     |    4 +++-
 mm/readahead.c |    9 +++++++--
 2 files changed, 10 insertions(+), 3 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -146,8 +146,10 @@ int read_cache_pages(struct address_spac
 			continue;
 		}
 		ret = filler(data, page);
-		if (!pagevec_add(&lru_pvec, page))
+		if (!pagevec_add(&lru_pvec, page)) {
+			cond_resched();
 			__pagevec_lru_add(&lru_pvec);
+		}
 		if (ret) {
 			while (!list_empty(pages)) {
 				struct page *victim;
@@ -184,8 +186,10 @@ static int read_pages(struct address_spa
 		if (!add_to_page_cache(page, mapping,
 					page->index, GFP_KERNEL)) {
 			mapping->a_ops->readpage(filp, page);
-			if (!pagevec_add(&lru_pvec, page))
+			if (!pagevec_add(&lru_pvec, page)) {
+				cond_resched();
 				__pagevec_lru_add(&lru_pvec);
+			}
 		} else
 			page_cache_release(page);
 	}
@@ -297,6 +301,7 @@ __do_page_cache_readahead(struct address
 			continue;
 
 		read_unlock_irq(&mapping->tree_lock);
+		cond_resched();
 		page = page_cache_alloc_cold(mapping);
 		read_lock_irq(&mapping->tree_lock);
 		if (!page)
--- linux-2.6.17-rc4-mm3.orig/fs/mpage.c
+++ linux-2.6.17-rc4-mm3/fs/mpage.c
@@ -407,8 +407,10 @@ mpage_readpages(struct address_space *ma
 					&last_block_in_bio, &map_bh,
 					&first_logical_block,
 					get_block);
-			if (!pagevec_add(&lru_pvec, page))
+			if (!pagevec_add(&lru_pvec, page)) {
+				cond_resched();
 				__pagevec_lru_add(&lru_pvec);
+			}
 		} else {
 			page_cache_release(page);
 		}

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 08/33] readahead: common macros
       [not found] ` <20060524111900.970898174@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  2006-05-25  5:56     ` Nick Piggin
  2006-05-25 16:33     ` Andrew Morton
  0 siblings, 2 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-common-macros.patch --]
[-- Type: text/plain, Size: 1649 bytes --]

Define some common used macros for the read-ahead logics.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   14 ++++++++++++--
 1 files changed, 12 insertions(+), 2 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -5,6 +5,8 @@
  *
  * 09Apr2002	akpm@zip.com.au
  *		Initial version.
+ * 21May2006	Wu Fengguang <wfg@mail.ustc.edu.cn>
+ *		Adaptive read-ahead framework.
  */
 
 #include <linux/kernel.h>
@@ -14,6 +16,14 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
+#include <linux/writeback.h>
+#include <linux/nfsd/const.h>
+
+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT)
+#define PAGES_KB(size)	 PAGES_BYTE((size)*1024)
+
+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -21,7 +31,7 @@ void default_unplug_io_fn(struct backing
 EXPORT_SYMBOL(default_unplug_io_fn);
 
 struct backing_dev_info default_backing_dev_info = {
-	.ra_pages	= (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+	.ra_pages	= PAGES_KB(VM_MAX_READAHEAD),
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY,
 	.unplug_io_fn	= default_unplug_io_fn,
@@ -50,7 +60,7 @@ static inline unsigned long get_max_read
 
 static inline unsigned long get_min_readahead(struct file_ra_state *ra)
 {
-	return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	return PAGES_KB(VM_MIN_READAHEAD);
 }
 
 static inline void reset_ahead_window(struct file_ra_state *ra)

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 09/33] readahead: events accounting
       [not found] ` <20060524111901.581603095@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  2006-05-25 16:36     ` Andrew Morton
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, J?rn Engel, Ingo Oeser

[-- Attachment #1: readahead-events-accounting.patch --]
[-- Type: text/plain, Size: 10611 bytes --]

A debugfs file named `readahead/events' is created according to advises from
J?rn Engel, Andrew Morton and Ingo Oeser.

It reveals various read-ahead activities/events, and is vital to the testing.

---------------------------
If you are experiencing performance problems, or want to help improve the
read-ahead logic, please send me the debug data. Thanks.

- Preparations

## First compile kernel with CONFIG_DEBUG_READAHEAD
mkdir /debug
mount -t debug none /debug

- For each session with distinct access pattern

echo > /debug/readahead/events # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
# echo 3 > /debug/readahead/debug_level # show verbose printk traces
## do one benchmark/task
# echo 1 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
# bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2

The commented out commands can uncover more detailed file accesses,
which are useful sometimes. Note that the log file can grow huge!

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  293 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 292 insertions(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -19,12 +19,76 @@
 #include <linux/writeback.h>
 #include <linux/nfsd/const.h>
 
+/*
+ * Detailed classification of read-ahead behaviors.
+ */
+#define RA_CLASS_SHIFT 4
+#define RA_CLASS_MASK  ((1 << RA_CLASS_SHIFT) - 1)
+enum ra_class {
+	RA_CLASS_ALL,
+	RA_CLASS_INITIAL,
+	RA_CLASS_STATE,
+	RA_CLASS_CONTEXT,
+	RA_CLASS_CONTEXT_AGGRESSIVE,
+	RA_CLASS_BACKWARD,
+	RA_CLASS_THRASHING,
+	RA_CLASS_SEEK,
+	RA_CLASS_NONE,
+	RA_CLASS_COUNT
+};
+
+/* Read-ahead events to be accounted. */
+enum ra_event {
+	RA_EVENT_CACHE_MISS,		/* read cache misses */
+	RA_EVENT_RANDOM_READ,		/* random reads */
+	RA_EVENT_IO_CONGESTION,		/* i/o congestion */
+	RA_EVENT_IO_CACHE_HIT,		/* canceled i/o due to cache hit */
+	RA_EVENT_IO_BLOCK,		/* wait for i/o completion */
+
+	RA_EVENT_READAHEAD,		/* read-ahead issued */
+	RA_EVENT_READAHEAD_HIT,		/* read-ahead page hit */
+	RA_EVENT_LOOKAHEAD,		/* look-ahead issued */
+	RA_EVENT_LOOKAHEAD_HIT,		/* look-ahead mark hit */
+	RA_EVENT_LOOKAHEAD_NOACTION,	/* look-ahead mark ignored */
+	RA_EVENT_READAHEAD_MMAP,	/* read-ahead for mmap access */
+	RA_EVENT_READAHEAD_EOF,		/* read-ahead reaches EOF */
+	RA_EVENT_READAHEAD_SHRINK,	/* ra_size falls under previous la_size */
+	RA_EVENT_READAHEAD_THRASHING,	/* read-ahead thrashing happened */
+	RA_EVENT_READAHEAD_MUTILATE,	/* read-ahead mutilated by imbalanced aging */
+	RA_EVENT_READAHEAD_RESCUE,	/* read-ahead rescued */
+
+	RA_EVENT_READAHEAD_CUBE,
+	RA_EVENT_COUNT
+};
+
+#ifdef CONFIG_DEBUG_READAHEAD
+u32 initial_ra_hit;
+u32 initial_ra_miss;
+u32 debug_level = 1;
+u32 disable_stateful_method = 0;
+static const char * const ra_class_name[];
+static void ra_account(struct file_ra_state *ra, enum ra_event e, int pages);
+#  define debug_inc(var)		do { var++; } while (0)
+#  define debug_option(o)		(o)
+#else
+#  define ra_account(ra, e, pages)	do { } while (0)
+#  define debug_inc(var)		do { } while (0)
+#  define debug_option(o)		(0)
+#  define debug_level 			(0)
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+#define dprintk(args...) \
+	do { if (debug_level >= 2) printk(KERN_DEBUG args); } while(0)
+#define ddprintk(args...) \
+	do { if (debug_level >= 3) printk(KERN_DEBUG args); } while(0)
+
 #define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT)
 #define PAGES_KB(size)	 PAGES_BYTE((size)*1024)
 
 #define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
 #define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
 
+
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
 }
@@ -365,6 +429,9 @@ int force_page_cache_readahead(struct ad
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
 	}
+
+	ra_account(NULL, RA_EVENT_READAHEAD, ret);
+
 	return ret;
 }
 
@@ -400,10 +467,16 @@ static inline int check_ra_success(struc
 int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read)
 {
+	unsigned long ret;
+
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
-	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+	ret = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+
+	ra_account(NULL, RA_EVENT_READAHEAD, ret);
+
+	return ret;
 }
 
 /*
@@ -425,6 +498,10 @@ blockable_page_cache_readahead(struct ad
 
 	actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 
+	ra_account(NULL, RA_EVENT_READAHEAD, actual);
+	dprintk("blockable-readahead(ino=%lu, ra=%lu+%lu) = %d\n",
+			mapping->host->i_ino, offset, nr_to_read, actual);
+
 	return check_ra_success(ra, nr_to_read, actual);
 }
 
@@ -604,3 +681,217 @@ unsigned long max_sane_readahead(unsigne
 	__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
 	return min(nr, (inactive + free) / 2);
 }
+
+/*
+ * Read-ahead events accounting.
+ */
+#ifdef CONFIG_DEBUG_READAHEAD
+
+#include <linux/init.h>
+#include <linux/jiffies.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+static const char * const ra_class_name[] = {
+	"total",
+	"initial",
+	"state",
+	"context",
+	"contexta",
+	"backward",
+	"onthrash",
+	"onseek",
+	"none"
+};
+
+static const char * const ra_event_name[] = {
+	"cache_miss",
+	"random_read",
+	"io_congestion",
+	"io_cache_hit",
+	"io_block",
+	"readahead",
+	"readahead_hit",
+	"lookahead",
+	"lookahead_hit",
+	"lookahead_ignore",
+	"readahead_mmap",
+	"readahead_eof",
+	"readahead_shrink",
+	"readahead_thrash",
+	"readahead_mutilt",
+	"readahead_rescue"
+};
+
+static unsigned long ra_events[RA_CLASS_COUNT][RA_EVENT_COUNT][2];
+
+static void ra_account(struct file_ra_state *ra, enum ra_event e, int pages)
+{
+	enum ra_class c;
+
+	if (!debug_level)
+		return;
+
+	if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
+		c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+		pages = -pages;
+	} else if (ra)
+		c = ra->flags & RA_CLASS_MASK;
+	else
+		c = RA_CLASS_NONE;
+
+	if (!c)
+		c = RA_CLASS_NONE;
+
+	ra_events[c][e][0] += 1;
+	ra_events[c][e][1] += pages;
+
+	if (e == RA_EVENT_READAHEAD)
+		ra_events[c][RA_EVENT_READAHEAD_CUBE][1] += pages * pages;
+}
+
+static int ra_events_show(struct seq_file *s, void *_)
+{
+	int i;
+	int c;
+	int e;
+	static const char event_fmt[] = "%-16s";
+	static const char class_fmt[] = "%10s";
+	static const char item_fmt[] = "%10lu";
+	static const char percent_format[] = "%9lu%%";
+	static const char * const table_name[] = {
+		"[table requests]",
+		"[table pages]",
+		"[table summary]"};
+
+	for (i = 0; i <= 1; i++) {
+		for (e = 0; e < RA_EVENT_COUNT; e++) {
+			ra_events[RA_CLASS_ALL][e][i] = 0;
+			for (c = RA_CLASS_INITIAL; c < RA_CLASS_NONE; c++)
+				ra_events[RA_CLASS_ALL][e][i] += ra_events[c][e][i];
+		}
+
+		seq_printf(s, event_fmt, table_name[i]);
+		for (c = 0; c < RA_CLASS_COUNT; c++)
+			seq_printf(s, class_fmt, ra_class_name[c]);
+		seq_puts(s, "\n");
+
+		for (e = 0; e < RA_EVENT_COUNT; e++) {
+			if (e == RA_EVENT_READAHEAD_CUBE)
+				continue;
+			if (e == RA_EVENT_READAHEAD_HIT && i == 0)
+				continue;
+			if (e == RA_EVENT_IO_BLOCK && i == 1)
+				continue;
+
+			seq_printf(s, event_fmt, ra_event_name[e]);
+			for (c = 0; c < RA_CLASS_COUNT; c++)
+				seq_printf(s, item_fmt, ra_events[c][e][i]);
+			seq_puts(s, "\n");
+		}
+		seq_puts(s, "\n");
+	}
+
+	seq_printf(s, event_fmt, table_name[2]);
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, class_fmt, ra_class_name[c]);
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "random_rate");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, percent_format,
+			(ra_events[c][RA_EVENT_RANDOM_READ][0] * 100) /
+			((ra_events[c][RA_EVENT_RANDOM_READ][0] +
+			  ra_events[c][RA_EVENT_READAHEAD][0]) | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "ra_hit_rate");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, percent_format,
+			(ra_events[c][RA_EVENT_READAHEAD_HIT][1] * 100) /
+			(ra_events[c][RA_EVENT_READAHEAD][1] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "la_hit_rate");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, percent_format,
+			(ra_events[c][RA_EVENT_LOOKAHEAD_HIT][0] * 100) /
+			(ra_events[c][RA_EVENT_LOOKAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "var_ra_size");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, item_fmt,
+			(ra_events[c][RA_EVENT_READAHEAD_CUBE][1] -
+			 ra_events[c][RA_EVENT_READAHEAD][1] *
+			(ra_events[c][RA_EVENT_READAHEAD][1] /
+			(ra_events[c][RA_EVENT_READAHEAD][0] | 1))) /
+			(ra_events[c][RA_EVENT_READAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "avg_ra_size");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, item_fmt,
+			(ra_events[c][RA_EVENT_READAHEAD][1] +
+			 ra_events[c][RA_EVENT_READAHEAD][0] / 2) /
+			(ra_events[c][RA_EVENT_READAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	seq_printf(s, event_fmt, "avg_la_size");
+	for (c = 0; c < RA_CLASS_COUNT; c++)
+		seq_printf(s, item_fmt,
+			(ra_events[c][RA_EVENT_LOOKAHEAD][1] +
+			 ra_events[c][RA_EVENT_LOOKAHEAD][0] / 2) /
+			(ra_events[c][RA_EVENT_LOOKAHEAD][0] | 1));
+	seq_puts(s, "\n");
+
+	return 0;
+}
+
+static int ra_events_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, ra_events_show, NULL);
+}
+
+static ssize_t ra_events_write(struct file *file, const char __user *buf,
+						size_t size, loff_t *offset)
+{
+	memset(ra_events, 0, sizeof(ra_events));
+	return 1;
+}
+
+struct file_operations ra_events_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ra_events_open,
+	.write		= ra_events_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+#define READAHEAD_DEBUGFS_ENTRY_U32(var) \
+	debugfs_create_u32(__stringify(var), 0644, root, &var)
+
+#define READAHEAD_DEBUGFS_ENTRY_BOOL(var) \
+	debugfs_create_bool(__stringify(var), 0644, root, &var)
+
+static int __init readahead_init(void)
+{
+	struct dentry *root;
+
+	root = debugfs_create_dir("readahead", NULL);
+
+	debugfs_create_file("events", 0644, root, NULL, &ra_events_fops);
+
+	READAHEAD_DEBUGFS_ENTRY_U32(initial_ra_hit);
+	READAHEAD_DEBUGFS_ENTRY_U32(initial_ra_miss);
+
+	READAHEAD_DEBUGFS_ENTRY_U32(debug_level);
+	READAHEAD_DEBUGFS_ENTRY_BOOL(disable_stateful_method);
+
+	return 0;
+}
+
+module_init(readahead_init)
+
+#endif /* CONFIG_DEBUG_READAHEAD */

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 10/33] readahead: support functions
       [not found] ` <20060524111901.976888971@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  2006-05-25  5:13     ` Nick Piggin
  2006-05-25 16:48     ` Andrew Morton
  0 siblings, 2 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-support-functions.patch --]
[-- Type: text/plain, Size: 4222 bytes --]

Several support functions of adaptive read-ahead.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h |   11 +++++
 mm/readahead.c     |  107 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1029,6 +1029,17 @@ void handle_ra_miss(struct address_space
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
 
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+extern int readahead_ratio;
+#else
+#define readahead_ratio 1
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+static inline int prefer_adaptive_readahead(void)
+{
+	return readahead_ratio >= 10;
+}
+
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 #ifdef CONFIG_IA64
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -683,6 +683,113 @@ unsigned long max_sane_readahead(unsigne
 }
 
 /*
+ * Adaptive read-ahead.
+ *
+ * Good read patterns are compact both in space and time. The read-ahead logic
+ * tries to grant larger read-ahead size to better readers under the constraint
+ * of system memory and load pressure.
+ *
+ * It employs two methods to estimate the max thrashing safe read-ahead size:
+ *   1. state based   - the default one
+ *   2. context based - the failsafe one
+ * The integration of the dual methods has the merit of being agile and robust.
+ * It makes the overall design clean: special cases are handled in general by
+ * the stateless method, leaving the stateful one simple and fast.
+ *
+ * To improve throughput and decrease read delay, the logic 'looks ahead'.
+ * In most read-ahead chunks, one page will be selected and tagged with
+ * PG_readahead. Later when the page with PG_readahead is read, the logic
+ * will be notified to submit the next read-ahead chunk in advance.
+ *
+ *                 a read-ahead chunk
+ *    +-----------------------------------------+
+ *    |       # PG_readahead                    |
+ *    +-----------------------------------------+
+ *            ^ When this page is read, notify me for the next read-ahead.
+ *
+ */
+
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+
+/*
+ * The nature of read-ahead allows false tests to occur occasionally.
+ * Here we just do not bother to call get_page(), it's meaningless anyway.
+ */
+static inline struct page *__find_page(struct address_space *mapping,
+							pgoff_t offset)
+{
+	return radix_tree_lookup(&mapping->page_tree, offset);
+}
+
+static inline struct page *find_page(struct address_space *mapping,
+							pgoff_t offset)
+{
+	struct page *page;
+
+	read_lock_irq(&mapping->tree_lock);
+	page = __find_page(mapping, offset);
+	read_unlock_irq(&mapping->tree_lock);
+	return page;
+}
+
+/*
+ * Move pages in danger (of thrashing) to the head of inactive_list.
+ * Not expected to happen frequently.
+ */
+static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
+{
+	int pgrescue;
+	pgoff_t index;
+	struct zone *zone;
+	struct address_space *mapping;
+
+	BUG_ON(!nr_pages || !page);
+	pgrescue = 0;
+	index = page_index(page);
+	mapping = page_mapping(page);
+
+	dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
+			mapping->host->i_ino, index, nr_pages);
+
+	for(;;) {
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		if (!PageLRU(page))
+			goto out_unlock;
+
+		while (page_mapping(page) == mapping &&
+				page_index(page) == index) {
+			struct page *the_page = page;
+			page = next_page(page);
+			if (!PageActive(the_page) &&
+					!PageLocked(the_page) &&
+					page_count(the_page) == 1) {
+				list_move(&the_page->lru, &zone->inactive_list);
+				pgrescue++;
+			}
+			index++;
+			if (!--nr_pages)
+				goto out_unlock;
+		}
+
+		spin_unlock_irq(&zone->lru_lock);
+
+		cond_resched();
+		page = find_page(mapping, index);
+		if (!page)
+			goto out;
+	}
+out_unlock:
+	spin_unlock_irq(&zone->lru_lock);
+out:
+	ra_account(NULL, RA_EVENT_READAHEAD_RESCUE, pgrescue);
+	return nr_pages;
+}
+
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+/*
  * Read-ahead events accounting.
  */
 #ifdef CONFIG_DEBUG_READAHEAD

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 11/33] readahead: sysctl parameters
       [not found] ` <20060524111902.491708692@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  2006-05-25  4:50     ` [PATCH 12/33] readahead: min/max sizes Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-parameter-sysctl-variables.patch --]
[-- Type: text/plain, Size: 5039 bytes --]

Add new sysctl entries in /proc/sys/vm:

- readahead_ratio = 50
	i.e. set read-ahead size to <=(readahead_ratio%) thrashing threshold
- readahead_hit_rate = 1
	i.e. read-ahead hit ratio >=(1/readahead_hit_rate) is deemed ok

readahead_ratio also provides a way to select read-ahead logic at runtime:

	condition			    action
==========================================================================
readahead_ratio == 0		disable read-ahead
readahead_ratio <= 9		select the (old) stock read-ahead logic
readahead_ratio >= 10		select the (new) adaptive read-ahead logic

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 Documentation/sysctl/vm.txt |   37 +++++++++++++++++++++++++++++++++++++
 include/linux/sysctl.h      |    2 ++
 kernel/sysctl.c             |   28 ++++++++++++++++++++++++++++
 mm/readahead.c              |   17 +++++++++++++++++
 4 files changed, 84 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -20,6 +20,23 @@
 #include <linux/nfsd/const.h>
 
 /*
+ * Adaptive read-ahead parameters.
+ */
+
+/* In laptop mode, poll delayed look-ahead on every ## pages read. */
+#define LAPTOP_POLL_INTERVAL 16
+
+/* Set look-ahead size to 1/# of the thrashing-threshold. */
+#define LOOKAHEAD_RATIO 8
+
+/* Set read-ahead size to ##% of the thrashing-threshold. */
+int readahead_ratio = 50;
+EXPORT_SYMBOL_GPL(readahead_ratio);
+
+/* Readahead as long as cache hit ratio keeps above 1/##. */
+int readahead_hit_rate = 1;
+
+/*
  * Detailed classification of read-ahead behaviors.
  */
 #define RA_CLASS_SHIFT 4
--- linux-2.6.17-rc4-mm3.orig/include/linux/sysctl.h
+++ linux-2.6.17-rc4-mm3/include/linux/sysctl.h
@@ -194,6 +194,8 @@ enum
 	VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
 	VM_PANIC_ON_OOM=33,	/* panic at out-of-memory */
 	VM_SWAP_PREFETCH=34,	/* swap prefetch */
+	VM_READAHEAD_RATIO=35,	/* percent of read-ahead size to thrashing-threshold */
+	VM_READAHEAD_HIT_RATE=36, /* one accessed page legitimizes so many read-ahead pages */
 };
 
 /* CTL_NET names: */
--- linux-2.6.17-rc4-mm3.orig/kernel/sysctl.c
+++ linux-2.6.17-rc4-mm3/kernel/sysctl.c
@@ -77,6 +77,12 @@ extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int print_fatal_signals;
 
+#if defined(CONFIG_ADAPTIVE_READAHEAD)
+extern int readahead_ratio;
+extern int readahead_hit_rate;
+static int one = 1;
+#endif
+
 #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
 int unknown_nmi_panic;
 int nmi_watchdog_enabled;
@@ -987,6 +993,28 @@ static ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+	{
+		.ctl_name	= VM_READAHEAD_RATIO,
+		.procname	= "readahead_ratio",
+		.data		= &readahead_ratio,
+		.maxlen		= sizeof(readahead_ratio),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+	},
+	{
+		.ctl_name	= VM_READAHEAD_HIT_RATE,
+		.procname	= "readahead_hit_rate",
+		.data		= &readahead_hit_rate,
+		.maxlen		= sizeof(readahead_hit_rate),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &one,
+	},
+#endif
 	{ .ctl_name = 0 }
 };
 
--- linux-2.6.17-rc4-mm3.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.17-rc4-mm3/Documentation/sysctl/vm.txt
@@ -31,6 +31,8 @@ Currently, these files are in /proc/sys/
 - zone_reclaim_interval
 - panic_on_oom
 - swap_prefetch
+- readahead_ratio
+- readahead_hit_rate
 
 ==============================================================
 
@@ -202,3 +204,38 @@ copying back pages from swap into the sw
 practice it can take many minutes before the vm is idle enough.
 
 The default value is 1.
+
+==============================================================
+
+readahead_ratio
+
+This limits readahead size to percent of the thrashing threshold.
+The thrashing threshold is dynamicly estimated from the _history_ read
+speed and system load, to deduce the _future_ readahead request size.
+
+Set it to a smaller value if you have not enough memory for all the
+concurrent readers, or the I/O loads fluctuate a lot. But if there's
+plenty of memory(>2MB per reader), a bigger value may help performance.
+
+readahead_ratio also selects the readahead logic:
+	VALUE	CODE PATH
+	-------------------------------------------
+	    0	disable readahead totally
+	  1-9	select the stock readahead logic
+	10-inf	select the adaptive readahead logic
+
+The default value is 50.  Reasonable values would be [50, 100].
+
+==============================================================
+
+readahead_hit_rate
+
+This is the max allowed value of (readahead-pages : accessed-pages).
+Useful only when (readahead_ratio >= 10). If the previous readahead
+request has bad hit rate, the kernel will be reluctant to do the next
+readahead.
+
+Larger values help catch more sparse access patterns. Be aware that
+readahead of the sparse patterns sacrifices memory for speed.
+
+The default value is 1.  It is recommended to keep the value below 16.

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 13/33] readahead: state based method - aging accounting
       [not found] ` <20060524111903.510268987@localhost.localdomain>
@ 2006-05-24 11:12   ` Wu Fengguang
  2006-05-26 17:04     ` Andrew Morton
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-stateful-aging.patch --]
[-- Type: text/plain, Size: 5221 bytes --]

Collect info about the global available memory and its consumption speed.
The data are used by the stateful method to estimate the thrashing threshold.

They are the decisive factor of the correctness/accuracy of the resulting
read-ahead size.

- On NUMA systems, the accountings are done on a per-node basis. It works for
  the two common real-world schemes:
	  - the reader process allocates caches in a node affined manner;
	  - the reader process allocates caches _balancely_ from a set of nodes.

- On non-NUMA systems, the readahead_aging is mainly increased on first
  access of the read-ahead pages, in order to make it go up constantly and
  smoothly. It helps improve the accuracy on small/fast read-aheads, with
  the cost of a little more overhead.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h     |    9 +++++++++
 include/linux/mmzone.h |    5 +++++
 mm/Kconfig             |    5 +++++
 mm/readahead.c         |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap.c              |    2 ++
 mm/vmscan.c            |    4 ++++
 6 files changed, 74 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/Kconfig
+++ linux-2.6.17-rc4-mm3/mm/Kconfig
@@ -203,3 +203,8 @@ config DEBUG_READAHEAD
 	  echo 1 > /debug/readahead/debug_level # stop filling my kern.log
 
 	  Say N for production servers.
+
+config READAHEAD_SMOOTH_AGING
+	def_bool n if NUMA
+	default y if !NUMA
+	depends on ADAPTIVE_READAHEAD
--- linux-2.6.17-rc4-mm3.orig/include/linux/mmzone.h
+++ linux-2.6.17-rc4-mm3/include/linux/mmzone.h
@@ -161,6 +161,11 @@ struct zone {
 	unsigned long		pages_scanned;	   /* since last reclaim */
 	int			all_unreclaimable; /* All pages pinned */
 
+	/* The accumulated number of activities that may cause page aging,
+	 * that is, make some pages closer to the tail of inactive_list.
+	 */
+	unsigned long 		aging_total;
+
 	/* A count of how many reclaimers are scanning this zone */
 	atomic_t		reclaim_in_progress;
 
--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1044,6 +1044,15 @@ static inline int prefer_adaptive_readah
 	return readahead_ratio >= 10;
 }
 
+DECLARE_PER_CPU(unsigned long, readahead_aging);
+static inline void inc_readahead_aging(void)
+{
+#ifdef CONFIG_READAHEAD_SMOOTH_AGING
+	if (prefer_adaptive_readahead())
+		per_cpu(readahead_aging, raw_smp_processor_id())++;
+#endif
+}
+
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 #ifdef CONFIG_IA64
--- linux-2.6.17-rc4-mm3.orig/mm/vmscan.c
+++ linux-2.6.17-rc4-mm3/mm/vmscan.c
@@ -457,6 +457,9 @@ static unsigned long shrink_page_list(st
 		if (PageWriteback(page))
 			goto keep_locked;
 
+		if (!PageReferenced(page))
+			inc_readahead_aging();
+
 		referenced = page_referenced(page, 1);
 		/* In active use or really unfreeable?  Activate it. */
 		if (referenced && page_mapping_inuse(page))
@@ -655,6 +658,7 @@ static unsigned long shrink_inactive_lis
 					     &page_list, &nr_scan);
 		zone->nr_inactive -= nr_taken;
 		zone->pages_scanned += nr_scan;
+		zone->aging_total += nr_scan;
 		spin_unlock_irq(&zone->lru_lock);
 
 		nr_scanned += nr_scan;
--- linux-2.6.17-rc4-mm3.orig/mm/swap.c
+++ linux-2.6.17-rc4-mm3/mm/swap.c
@@ -127,6 +127,8 @@ void fastcall mark_page_accessed(struct 
 		ClearPageReferenced(page);
 	} else if (!PageReferenced(page)) {
 		SetPageReferenced(page);
+		if (PageLRU(page))
+			inc_readahead_aging();
 	}
 }
 
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -18,6 +18,7 @@
 #include <linux/pagevec.h>
 #include <linux/writeback.h>
 #include <linux/nfsd/const.h>
+#include <asm/div64.h>
 
 /*
  * Adaptive read-ahead parameters.
@@ -37,6 +38,14 @@ EXPORT_SYMBOL_GPL(readahead_ratio);
 int readahead_hit_rate = 1;
 
 /*
+ * Measures the aging process of cold pages.
+ * Mainly increased on fresh page references to make it smooth.
+ */
+#ifdef CONFIG_READAHEAD_SMOOTH_AGING
+DEFINE_PER_CPU(unsigned long, readahead_aging);
+#endif
+
+/*
  * Detailed classification of read-ahead behaviors.
  */
 #define RA_CLASS_SHIFT 4
@@ -805,6 +814,46 @@ out:
 }
 
 /*
+ * The node's effective length of inactive_list(s).
+ */
+static unsigned long node_free_and_cold_pages(void)
+{
+	unsigned int i;
+	unsigned long sum = 0;
+	struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		sum += zones[i].nr_inactive +
+			zones[i].free_pages - zones[i].pages_low;
+
+	return sum;
+}
+
+/*
+ * The node's accumulated aging activities.
+ */
+static unsigned long node_readahead_aging(void)
+{
+       unsigned long sum = 0;
+
+#ifdef CONFIG_READAHEAD_SMOOTH_AGING
+       unsigned long cpu;
+       cpumask_t mask = node_to_cpumask(numa_node_id());
+
+       for_each_cpu_mask(cpu, mask)
+	       sum += per_cpu(readahead_aging, cpu);
+#else
+       unsigned int i;
+       struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+       for (i = 0; i < MAX_NR_ZONES; i++)
+	       sum += zones[i].aging_total;
+#endif
+
+       return sum;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 14/33] readahead: state based method - data structure
       [not found] ` <20060524111904.019763011@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  2006-05-25  6:03     ` Nick Piggin
  2006-05-26 17:05     ` Andrew Morton
  0 siblings, 2 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-stateful-data.patch --]
[-- Type: text/plain, Size: 3134 bytes --]

Extend struct file_ra_state to support the adaptive read-ahead logic.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/fs.h |   57 +++++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 47 insertions(+), 10 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/fs.h
+++ linux-2.6.17-rc4-mm3/include/linux/fs.h
@@ -613,21 +613,58 @@ struct fown_struct {
 
 /*
  * Track a single file's readahead state
+ *
+ * Diagram for the adaptive readahead logic:
+ *
+ *  |--------- old chunk ------->|-------------- new chunk -------------->|
+ *  +----------------------------+----------------------------------------+
+ *  |               #            |                  #                     |
+ *  +----------------------------+----------------------------------------+
+ *                  ^            ^                  ^                     ^
+ *  file_ra_state.la_index    .ra_index   .lookahead_index      .readahead_index
+ *
+ * Deduced sizes:
+ *                               |----------- readahead size ------------>|
+ *  +----------------------------+----------------------------------------+
+ *  |               #            |                  #                     |
+ *  +----------------------------+----------------------------------------+
+ *                  |------- invoke interval ------>|-- lookahead size -->|
  */
 struct file_ra_state {
-	unsigned long start;		/* Current window */
-	unsigned long size;
-	unsigned long flags;		/* ra flags RA_FLAG_xxx*/
-	unsigned long cache_hit;	/* cache hit count*/
-	unsigned long prev_page;	/* Cache last read() position */
-	unsigned long ahead_start;	/* Ahead window */
-	unsigned long ahead_size;
-	unsigned long ra_pages;		/* Maximum readahead window */
-	unsigned long mmap_hit;		/* Cache hit stat for mmap accesses */
-	unsigned long mmap_miss;	/* Cache miss stat for mmap accesses */
+	union {
+		struct { /* conventional read-ahead */
+			unsigned long start;		/* Current window */
+			unsigned long size;
+			unsigned long ahead_start;	/* Ahead window */
+			unsigned long ahead_size;
+			unsigned long cache_hit;        /* cache hit count */
+		};
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+		struct { /* adaptive read-ahead */
+			pgoff_t la_index;
+			pgoff_t ra_index;
+			pgoff_t lookahead_index;
+			pgoff_t readahead_index;
+			unsigned long age;
+			uint64_t cache_hits;
+		};
+#endif
+	};
+
+	/* mmap read-around */
+	unsigned long mmap_hit;         /* Cache hit stat for mmap accesses */
+	unsigned long mmap_miss;        /* Cache miss stat for mmap accesses */
+
+	/* common ones */
+	unsigned long flags;            /* ra flags RA_FLAG_xxx*/
+	unsigned long prev_page;        /* Cache last read() position */
+	unsigned long ra_pages;         /* Maximum readahead window */
 };
 #define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
 #define RA_FLAG_INCACHE 0x02	/* file is already in cache */
+#define RA_FLAG_MMAP		(1UL<<31)	/* mmaped page access */
+#define RA_FLAG_NO_LOOKAHEAD	(1UL<<30)	/* disable look-ahead */
+#define RA_FLAG_EOF		(1UL<<29)	/* readahead hits EOF */
 
 struct file {
 	/*

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 15/33] readahead: state based method - routines
       [not found] ` <20060524111904.683513683@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  2006-05-26 17:15     ` Andrew Morton
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-stateful-routines.patch --]
[-- Type: text/plain, Size: 5765 bytes --]

Define some helpers on struct file_ra_state.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  188 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 186 insertions(+), 2 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -854,6 +854,190 @@ static unsigned long node_readahead_agin
 }
 
 /*
+ * Some helpers for querying/building a read-ahead request.
+ *
+ * Diagram for some variable names used frequently:
+ *
+ *                                   |<------- la_size ------>|
+ *                  +-----------------------------------------+
+ *                  |                #                        |
+ *                  +-----------------------------------------+
+ *      ra_index -->|<---------------- ra_size -------------->|
+ *
+ */
+
+static enum ra_class ra_class_new(struct file_ra_state *ra)
+{
+	return ra->flags & RA_CLASS_MASK;
+}
+
+static inline enum ra_class ra_class_old(struct file_ra_state *ra)
+{
+	return (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+}
+
+static unsigned long ra_readahead_size(struct file_ra_state *ra)
+{
+	return ra->readahead_index - ra->ra_index;
+}
+
+static unsigned long ra_lookahead_size(struct file_ra_state *ra)
+{
+	return ra->readahead_index - ra->lookahead_index;
+}
+
+static unsigned long ra_invoke_interval(struct file_ra_state *ra)
+{
+	return ra->lookahead_index - ra->la_index;
+}
+
+/*
+ * The 64bit cache_hits stores three accumulated values and a counter value.
+ * MSB                                                                   LSB
+ * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
+ */
+static int ra_cache_hit(struct file_ra_state *ra, int nr)
+{
+	return (ra->cache_hits >> (nr * 16)) & 0xFFFF;
+}
+
+/*
+ * Conceptual code:
+ * ra_cache_hit(ra, 1) += ra_cache_hit(ra, 0);
+ * ra_cache_hit(ra, 0) = 0;
+ */
+static void ra_addup_cache_hit(struct file_ra_state *ra)
+{
+	int n;
+
+	n = ra_cache_hit(ra, 0);
+	ra->cache_hits -= n;
+	n <<= 16;
+	ra->cache_hits += n;
+}
+
+/*
+ * The read-ahead is deemed success if cache-hit-rate >= 1/readahead_hit_rate.
+ */
+static int ra_cache_hit_ok(struct file_ra_state *ra)
+{
+	return ra_cache_hit(ra, 0) * readahead_hit_rate >=
+					(ra->lookahead_index - ra->la_index);
+}
+
+/*
+ * Check if @index falls in the @ra request.
+ */
+static int ra_has_index(struct file_ra_state *ra, pgoff_t index)
+{
+	if (index < ra->la_index || index >= ra->readahead_index)
+		return 0;
+
+	if (index >= ra->ra_index)
+		return 1;
+	else
+		return -1;
+}
+
+/*
+ * Which method is issuing this read-ahead?
+ */
+static void ra_set_class(struct file_ra_state *ra,
+				enum ra_class ra_class)
+{
+	unsigned long flags_mask;
+	unsigned long flags;
+	unsigned long old_ra_class;
+
+	flags_mask = ~(RA_CLASS_MASK | (RA_CLASS_MASK << RA_CLASS_SHIFT));
+	flags = ra->flags & flags_mask;
+
+	old_ra_class = ra_class_new(ra) << RA_CLASS_SHIFT;
+
+	ra->flags = flags | old_ra_class | ra_class;
+
+	ra_addup_cache_hit(ra);
+	if (ra_class != RA_CLASS_STATE)
+		ra->cache_hits <<= 16;
+
+	ra->age = node_readahead_aging();
+}
+
+/*
+ * Where is the old read-ahead and look-ahead?
+ */
+static void ra_set_index(struct file_ra_state *ra,
+				pgoff_t la_index, pgoff_t ra_index)
+{
+	ra->la_index = la_index;
+	ra->ra_index = ra_index;
+}
+
+/*
+ * Where is the new read-ahead and look-ahead?
+ */
+static void ra_set_size(struct file_ra_state *ra,
+				unsigned long ra_size, unsigned long la_size)
+{
+	/* Disable look-ahead for loopback file. */
+	if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+		la_size = 0;
+
+	ra->readahead_index = ra->ra_index + ra_size;
+	ra->lookahead_index = ra->readahead_index - la_size;
+}
+
+/*
+ * Submit IO for the read-ahead request in file_ra_state.
+ */
+static int ra_dispatch(struct file_ra_state *ra,
+			struct address_space *mapping, struct file *filp)
+{
+	enum ra_class ra_class = ra_class_new(ra);
+	unsigned long ra_size = ra_readahead_size(ra);
+	unsigned long la_size = ra_lookahead_size(ra);
+	pgoff_t eof_index = PAGES_BYTE(i_size_read(mapping->host)) + 1;
+	int actual;
+
+	if (unlikely(ra->ra_index >= eof_index))
+		return 0;
+
+	/* Snap to EOF. */
+	if (ra->readahead_index + ra_size / 2 > eof_index) {
+		if (ra_class == RA_CLASS_CONTEXT_AGGRESSIVE &&
+				eof_index > ra->lookahead_index + 1)
+			la_size = eof_index - ra->lookahead_index;
+		else
+			la_size = 0;
+		ra_size = eof_index - ra->ra_index;
+		ra_set_size(ra, ra_size, la_size);
+		ra->flags |= RA_FLAG_EOF;
+	}
+
+	actual = __do_page_cache_readahead(mapping, filp,
+					ra->ra_index, ra_size, la_size);
+
+#ifdef CONFIG_DEBUG_READAHEAD
+	if (ra->flags & RA_FLAG_MMAP)
+		ra_account(ra, RA_EVENT_READAHEAD_MMAP, actual);
+	if (ra->readahead_index == eof_index)
+		ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
+	if (la_size)
+		ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+	if (ra_size > actual)
+		ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
+	ra_account(ra, RA_EVENT_READAHEAD, actual);
+
+	dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
+			ra_class_name[ra_class],
+			mapping->host->i_ino, ra->la_index,
+			ra->ra_index, ra_size, la_size, actual);
+#endif /* CONFIG_DEBUG_READAHEAD */
+
+	return actual;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:
@@ -925,10 +1109,10 @@ static void ra_account(struct file_ra_st
 		return;
 
 	if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
-		c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+		c = ra_class_old(ra);
 		pages = -pages;
 	} else if (ra)
-		c = ra->flags & RA_CLASS_MASK;
+		c = ra_class_new(ra);
 	else
 		c = RA_CLASS_NONE;
 

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 17/33] readahead: context based method
       [not found] ` <20060524111905.586110688@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  2006-05-25  5:26     ` Nick Piggin
                       ` (2 more replies)
  2006-05-24 12:37   ` Peter Zijlstra
  1 sibling, 3 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-context.patch --]
[-- Type: text/plain, Size: 17152 bytes --]

This is the slow code path of adaptive read-ahead.

No valid state info is available, so the page cache is queried to obtain
the required position/timing info. This kind of estimation is more conservative
than the stateful method, and also fluctuates more on load variance.


HOW IT WORKS
============

It works by peeking into the file cache and check if there are any history
pages present or accessed. In this way it can detect almost all forms of
sequential / semi-sequential read patterns, e.g.
        - parallel / interleaved sequential scans on one file
        - sequential reads across file open/close
        - mixed sequential / random accesses
        - sparse / skimming sequential read


HOW DATABASES CAN BENEFIT FROM IT
=================================

The adaptive readahead might help db performance in the following cases:
        - concurrent sequential scans
        - sequential scan on a fragmented table
        - index scan with clustered matches
        - index scan on majority rows (in case the planner goes wrong)


ALGORITHM STEPS
===============

        - look back/forward to find the ra_index;
        - look back to estimate a thrashing safe ra_size;
        - assemble the next read-ahead request in file_ra_state;
        - submit it.


ALGORITHM DYNAMICS
==================

* startup
When a sequential read is detected, chunk size is set to readahead-min
and grows up with each readahead.  The grow speed is controlled by
readahead-ratio.  When readahead-ratio == 100, the new logic grows chunk
sizes exponentially -- like the current logic, but lags behind it at
early steps.

* stabilize
When chunk size reaches readahead-max, or comes close to
        (readahead-ratio * thrashing-threshold)
it stops growing and stay there.

The main difference with the stock readahead logic occurs at and after
the time chunk size stops growing:
     -  The current logic grows chunk size exponentially in normal and
        decreases it by 2 each time thrashing is seen. That can lead to
        thrashing with almost every readahead for very slow streams.
     -  The new logic can stop at a size below the thrashing-threshold,
        and stay there stable.

* on stream speed up or system load fall
thrashing-threshold follows up and chunk size is likely to be enlarged.

* on stream slow down or system load rocket up
thrashing-threshold falls down.
If thrashing happened, the next read would be treated as a random read,
and with another read the chunk-size-growing-phase is restarted.

For a slow stream that has (thrashing-threshold < readahead-max):
      - When readahead-ratio = 100, there is only one chunk in cache at
        most time;
      - When readahead-ratio = 50, there are two chunks in cache at most
        time.
      - Lowing readahead-ratio helps gracefully cut down the chunk size
        without thrashing.


OVERHEADS
=========

The context based method has some overheads over the stateful method, due
to more lockings and memory scans.

Running oprofile on the following command shows the following differences:

	# diff sparse sparse1

	total oprofile samples	run1	run2
	stateful method		560482	558696
	stateless method	564463	559413

So the average overhead is about 0.4%.

Detailed diffprofile data:

# diffprofile oprofile.50.stateful oprofile.50.stateless
      2998    41.1% isolate_lru_pages
      2669    26.4% shrink_zone
      1822    14.7% system_call
      1419    27.6% radix_tree_delete
      1376    14.8% _raw_write_lock
      1279    27.4% free_pages_bulk
      1111    12.0% _raw_write_unlock
      1035    43.3% free_hot_cold_page
       849    15.3% unlock_page
       786    29.6% page_referenced
       710     4.6% kmap_atomic
       651    26.4% __pagevec_release_nonlru
       586    16.1% __rmqueue
       578    11.3% find_get_page
       481    15.5% page_waitqueue
       440     6.6% add_to_page_cache
       420    33.7% fget_light
       260     4.3% get_page_from_freelist
       223    13.7% find_busiest_group
       221    35.1% mutex_debug_check_no_locks_freed
       211     0.0% radix_tree_scan_hole
       198    35.5% delay_tsc
       195    14.8% ext3_get_branch
       182    12.6% profile_tick
       173     0.0% radix_tree_cache_lookup_node
       164    22.9% find_next_bit
       162    50.3% page_cache_readahead_adaptive
...
       106     0.0% radix_tree_scan_hole_backward
...
       -51    -7.6% radix_tree_preload
...
       -68    -2.1% radix_tree_insert
...
       -87    -2.0% mark_page_accessed
       -88    -2.0% __pagevec_lru_add
      -103    -7.7% softlockup_tick
      -107   -71.8% free_block
      -122   -77.7% do_IRQ
      -132   -82.0% do_timer
      -140   -47.1% ack_edge_ioapic_vector
      -168   -81.2% handle_IRQ_event
      -192   -35.2% irq_entries_start
      -204   -14.8% rw_verify_area
      -214   -13.2% account_system_time
      -233    -9.5% radix_tree_lookup_node
      -234   -16.6% scheduler_tick
      -259   -58.7% __do_IRQ
      -266    -6.8% put_page
      -318   -29.3% rcu_pending
      -333    -3.0% do_generic_mapping_read
      -337   -28.3% hrtimer_run_queues
      -493   -27.0% __rcu_pending
     -1038    -9.4% default_idle
     -3323    -3.5% __copy_to_user_ll
    -10331    -5.9% do_mpage_readpage

# diffprofile oprofile.50.stateful2 oprofile.50.stateless2
      1739     1.1% do_mpage_readpage
       833     0.9% __copy_to_user_ll
       340    21.3% find_busiest_group
       288     9.5% free_hot_cold_page
       261     4.6% _raw_read_unlock
       239     3.9% get_page_from_freelist
       201     0.0% radix_tree_scan_hole
       163    14.3% raise_softirq
       160     0.0% radix_tree_cache_lookup_node
       160    11.8% update_process_times
       136     9.3% fget_light
       121    35.1% page_cache_readahead_adaptive
       117    36.0% restore_all
       117     2.8% mark_page_accessed
       109     6.4% rebalance_tick
       107     9.4% sys_read
       102     0.0% radix_tree_scan_hole_backward
...
        63     4.0% readahead_cache_hit
...
       -10   -15.9% radix_tree_node_alloc
...
       -39    -1.7% radix_tree_lookup_node
       -39   -10.3% irq_entries_start
       -43    -1.3% radix_tree_insert
...
       -47    -4.6% __do_page_cache_readahead
       -64    -9.3% radix_tree_preload
       -65    -5.4% rw_verify_area
       -65    -2.2% vfs_read
       -70    -4.7% timer_interrupt
       -71    -1.0% __wake_up_bit
       -73    -1.1% radix_tree_delete
       -79   -12.6% __mod_page_state_offset
       -94    -1.8% __find_get_block
       -94    -2.2% __pagevec_lru_add
      -102    -1.7% free_pages_bulk
      -116    -1.3% _raw_read_lock
      -123    -7.4% do_sync_read
      -130    -8.4% ext3_get_blocks_handle
      -142    -3.8% put_page
      -146    -7.9% mpage_readpages
      -147    -5.6% apic_timer_interrupt
      -168    -1.6% _raw_write_unlock
      -172    -5.0% page_referenced
      -206    -3.2% unlock_page
      -212   -15.0% restore_nocheck
      -213    -2.1% default_idle
      -245    -5.0% __rmqueue
      -278    -4.3% find_get_page
      -282    -2.1% system_call
      -287   -11.8% run_timer_softirq
      -300    -2.7% _raw_write_lock
      -420    -3.2% shrink_zone
      -661    -5.7% isolate_lru_pages

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |  329 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 329 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1185,6 +1185,335 @@ state_based_readahead(struct address_spa
 }
 
 /*
+ * Page cache context based estimation of read-ahead/look-ahead size/index.
+ *
+ * The logic first looks around to find the start point of next read-ahead,
+ * and then, if necessary, looks backward in the inactive_list to get an
+ * estimation of the thrashing-threshold.
+ *
+ * The estimation theory can be illustrated with figure:
+ *
+ *   chunk A           chunk B                      chunk C                 head
+ *
+ *   l01 l11           l12   l21                    l22
+ *| |-->|-->|       |------>|-->|                |------>|
+ *| +-------+       +-----------+                +-------------+               |
+ *| |   #   |       |       #   |                |       #     |               |
+ *| +-------+       +-----------+                +-------------+               |
+ *| |<==============|<===========================|<============================|
+ *        L0                     L1                            L2
+ *
+ * Let f(l) = L be a map from
+ * 	l: the number of pages read by the stream
+ * to
+ * 	L: the number of pages pushed into inactive_list in the mean time
+ * then
+ * 	f(l01) <= L0
+ * 	f(l11 + l12) = L1
+ * 	f(l21 + l22) = L2
+ * 	...
+ * 	f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
+ *			   <= Length(inactive_list) = f(thrashing-threshold)
+ *
+ * So the count of countinuous history pages left in the inactive_list is always
+ * a lower estimation of the true thrashing-threshold.
+ */
+
+#define PAGE_REFCNT_0           0
+#define PAGE_REFCNT_1           (1 << PG_referenced)
+#define PAGE_REFCNT_2           (1 << PG_active)
+#define PAGE_REFCNT_3           ((1 << PG_active) | (1 << PG_referenced))
+#define PAGE_REFCNT_MASK        PAGE_REFCNT_3
+
+/*
+ * STATUS   REFERENCE COUNT
+ *  __                   0
+ *  _R       PAGE_REFCNT_1
+ *  A_       PAGE_REFCNT_2
+ *  AR       PAGE_REFCNT_3
+ *
+ *  A/R: Active / Referenced
+ */
+static inline unsigned long page_refcnt(struct page *page)
+{
+        return page->flags & PAGE_REFCNT_MASK;
+}
+
+/*
+ * STATUS   REFERENCE COUNT      TYPE
+ *  __                   0      fresh
+ *  _R       PAGE_REFCNT_1      stale
+ *  A_       PAGE_REFCNT_2      disturbed once
+ *  AR       PAGE_REFCNT_3      disturbed twice
+ *
+ *  A/R: Active / Referenced
+ */
+static inline unsigned long cold_page_refcnt(struct page *page)
+{
+	if (!page || PageActive(page))
+		return 0;
+
+	return page_refcnt(page);
+}
+
+/*
+ * Find past-the-end index of the segment at @index.
+ */
+static pgoff_t find_segtail(struct address_space *mapping,
+					pgoff_t index, unsigned long max_scan)
+{
+	pgoff_t ra_index;
+
+	cond_resched();
+	read_lock_irq(&mapping->tree_lock);
+	ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan);
+	read_unlock_irq(&mapping->tree_lock);
+
+	if (ra_index <= index + max_scan)
+		return ra_index;
+	else
+		return 0;
+}
+
+/*
+ * Find past-the-end index of the segment before @index.
+ */
+static pgoff_t find_segtail_backward(struct address_space *mapping,
+					pgoff_t index, unsigned long max_scan)
+{
+	struct radix_tree_cache cache;
+	struct page *page;
+	pgoff_t origin;
+
+	origin = index;
+	if (max_scan > index)
+		max_scan = index;
+
+	cond_resched();
+	radix_tree_cache_init(&cache);
+	read_lock_irq(&mapping->tree_lock);
+	for (; origin - index < max_scan;) {
+		page = radix_tree_cache_lookup(&mapping->page_tree,
+							&cache, --index);
+		if (page) {
+			read_unlock_irq(&mapping->tree_lock);
+			return index + 1;
+		}
+	}
+	read_unlock_irq(&mapping->tree_lock);
+
+	return 0;
+}
+
+/*
+ * Count/estimate cache hits in range [first_index, last_index].
+ * The estimation is simple and optimistic.
+ */
+static int count_cache_hit(struct address_space *mapping,
+				pgoff_t first_index, pgoff_t last_index)
+{
+	struct page *page;
+	int size = last_index - first_index + 1;
+	int count = 0;
+	int i;
+
+	cond_resched();
+	read_lock_irq(&mapping->tree_lock);
+
+	/*
+	 * The first page may well is chunk head and has been accessed,
+	 * so it is index 0 that makes the estimation optimistic. This
+	 * behavior guarantees a readahead when (size < ra_max) and
+	 * (readahead_hit_rate >= 16).
+	 */
+	for (i = 0; i < 16;) {
+		page = __find_page(mapping, first_index +
+						size * ((i++ * 29) & 15) / 16);
+		if (cold_page_refcnt(page) >= PAGE_REFCNT_1 && ++count >= 2)
+			break;
+	}
+
+	read_unlock_irq(&mapping->tree_lock);
+
+	return size * count / i;
+}
+
+/*
+ * Look back and check history pages to estimate thrashing-threshold.
+ */
+static unsigned long query_page_cache_segment(struct address_space *mapping,
+				struct file_ra_state *ra,
+				unsigned long *remain, pgoff_t offset,
+				unsigned long ra_min, unsigned long ra_max)
+{
+	pgoff_t index;
+	unsigned long count;
+	unsigned long nr_lookback;
+	struct radix_tree_cache cache;
+
+	/*
+	 * Scan backward and check the near @ra_max pages.
+	 * The count here determines ra_size.
+	 */
+	cond_resched();
+	read_lock_irq(&mapping->tree_lock);
+	index = radix_tree_scan_hole_backward(&mapping->page_tree,
+							offset, ra_max);
+	read_unlock_irq(&mapping->tree_lock);
+
+	*remain = offset - index;
+
+	if (offset == ra->readahead_index && ra_cache_hit_ok(ra))
+		count = *remain;
+	else if (count_cache_hit(mapping, index + 1, offset) *
+						readahead_hit_rate >= *remain)
+		count = *remain;
+	else
+		count = ra_min;
+
+	/*
+	 * Unnecessary to count more?
+	 */
+	if (count < ra_max)
+		goto out;
+
+	if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+		goto out;
+
+	/*
+	 * Check the far pages coarsely.
+	 * The enlarged count here helps increase la_size.
+	 */
+	nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
+						100 / (readahead_ratio | 1);
+
+	cond_resched();
+	radix_tree_cache_init(&cache);
+	read_lock_irq(&mapping->tree_lock);
+	for (count += ra_max; count < nr_lookback; count += ra_max) {
+		struct radix_tree_node *node;
+		node = radix_tree_cache_lookup_parent(&mapping->page_tree,
+						&cache, offset - count, 1);
+		if (!node)
+			break;
+	}
+	read_unlock_irq(&mapping->tree_lock);
+
+out:
+	/*
+	 *  For sequential read that extends from index 0, the counted value
+	 *  may well be far under the true threshold, so return it unmodified
+	 *  for further processing in adjust_rala_aggressive().
+	 */
+	if (count >= offset)
+		count = offset;
+	else
+		count = max(ra_min, count * readahead_ratio / 100);
+
+	ddprintk("query_page_cache_segment: "
+			"ino=%lu, idx=%lu, count=%lu, remain=%lu\n",
+			mapping->host->i_ino, offset, count, *remain);
+
+	return count;
+}
+
+/*
+ * Determine the request parameters for context based read-ahead that extends
+ * from start of file.
+ *
+ * The major weakness of stateless method is perhaps the slow grow up speed of
+ * ra_size. The logic tries to make up for this in the important case of
+ * sequential reads that extend from start of file. In this case, the ra_size
+ * is not chosen to make the whole next chunk safe (as in normal ones). Only
+ * half of which is safe. The added 'unsafe' half is the look-ahead part. It
+ * is expected to be safeguarded by rescue_pages() when the previous chunks are
+ * lost.
+ */
+static int adjust_rala_aggressive(unsigned long ra_max,
+				unsigned long *ra_size, unsigned long *la_size)
+{
+	pgoff_t index = *ra_size;
+
+	*ra_size -= min(*ra_size, *la_size);
+	*ra_size = *ra_size * readahead_ratio / 100;
+	*la_size = index * readahead_ratio / 100;
+	*ra_size += *la_size;
+
+	if (*ra_size > ra_max)
+		*ra_size = ra_max;
+	if (*la_size > *ra_size)
+		*la_size = *ra_size;
+
+	return 1;
+}
+
+/*
+ * Main function for page context based read-ahead.
+ *
+ * RETURN VALUE		HINT
+ *      1		@ra contains a valid ra-request, please submit it
+ *      0		no seq-pattern discovered, please try the next method
+ *     -1		please don't do _any_ readahead
+ */
+static int
+try_context_based_readahead(struct address_space *mapping,
+			struct file_ra_state *ra, struct page *prev_page,
+			struct page *page, pgoff_t index,
+			unsigned long ra_min, unsigned long ra_max)
+{
+	pgoff_t ra_index;
+	unsigned long ra_size;
+	unsigned long la_size;
+	unsigned long remain_pages;
+
+	/* Where to start read-ahead?
+	 * NFSv3 daemons may process adjacent requests in parallel,
+	 * leading to many locally disordered, globally sequential reads.
+	 * So do not require nearby history pages to be present or accessed.
+	 */
+	if (page) {
+		ra_index = find_segtail(mapping, index, ra_max * 5 / 4);
+		if (!ra_index)
+			return -1;
+	} else if (prev_page || find_page(mapping, index - 1)) {
+		ra_index = index;
+	} else if (readahead_hit_rate > 1) {
+		ra_index = find_segtail_backward(mapping, index,
+						readahead_hit_rate + ra_min);
+		if (!ra_index)
+			return 0;
+		ra_min += 2 * (index - ra_index);
+		index = ra_index;	/* pretend the request starts here */
+	} else
+		return 0;
+
+	ra_size = query_page_cache_segment(mapping, ra, &remain_pages,
+							index, ra_min, ra_max);
+
+	la_size = ra_index - index;
+	if (page && remain_pages <= la_size &&
+			remain_pages < index && la_size > 1) {
+		rescue_pages(page, la_size);
+		return -1;
+	}
+
+	if (ra_size == index) {
+		if (!adjust_rala_aggressive(ra_max, &ra_size, &la_size))
+			return -1;
+		ra_set_class(ra, RA_CLASS_CONTEXT_AGGRESSIVE);
+	} else {
+		if (!adjust_rala(ra_max, &ra_size, &la_size))
+			return -1;
+		ra_set_class(ra, RA_CLASS_CONTEXT);
+	}
+
+	ra_set_index(ra, index, ra_index);
+	ra_set_size(ra, ra_size, la_size);
+
+	return 1;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 18/33] readahead: initial method - guiding sizes
       [not found] ` <20060524111906.245276338@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-initial-sizes.patch --]
[-- Type: text/plain, Size: 2420 bytes --]

Introduce three guiding sizes for the initial readahead method.
	- ra_pages0:	   recommended readahead on start-of-file
	- ra_expect_bytes: expected read size on start-of-file
	- ra_thrash_bytes: estimated thrashing threshold

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


 block/ll_rw_blk.c           |    4 +---
 include/linux/backing-dev.h |    3 +++
 mm/readahead.c              |    3 +++
 3 files changed, 7 insertions(+), 3 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/backing-dev.h
+++ linux-2.6.17-rc4-mm3/include/linux/backing-dev.h
@@ -24,6 +24,9 @@ typedef int (congested_fn)(void *, int);
 
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
+	unsigned long ra_pages0; /* recommended readahead on start of file */
+	unsigned long ra_expect_bytes;	/* expected read size on start of file */
+	unsigned long ra_thrash_bytes;	/* thrashing threshold */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
 	congested_fn *congested_fn; /* Function pointer if device is md/dm */
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -122,6 +122,9 @@ EXPORT_SYMBOL(default_unplug_io_fn);
 
 struct backing_dev_info default_backing_dev_info = {
 	.ra_pages	= PAGES_KB(VM_MAX_READAHEAD),
+	.ra_pages0	= PAGES_KB(128),
+	.ra_expect_bytes = 1024 * VM_MIN_READAHEAD,
+	.ra_thrash_bytes = 1024 * VM_MIN_READAHEAD,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY,
 	.unplug_io_fn	= default_unplug_io_fn,
--- linux-2.6.17-rc4-mm3.orig/block/ll_rw_blk.c
+++ linux-2.6.17-rc4-mm3/block/ll_rw_blk.c
@@ -249,9 +249,6 @@ void blk_queue_make_request(request_queu
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	q->make_request_fn = mfn;
-	q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
-	q->backing_dev_info.state = 0;
-	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
 	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
 	blk_queue_hardsect_size(q, 512);
 	blk_queue_dma_alignment(q, 511);
@@ -1850,6 +1847,7 @@ request_queue_t *blk_alloc_queue_node(gf
 	q->kobj.ktype = &queue_ktype;
 	kobject_init(&q->kobj);
 
+	q->backing_dev_info = default_backing_dev_info;
 	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
 	q->backing_dev_info.unplug_io_data = q;
 

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 19/33] readahead: initial method - thrashing guard size
       [not found] ` <20060524111906.588647885@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-initial-size-thrash.patch --]
[-- Type: text/plain, Size: 1619 bytes --]

backing_dev_info.ra_thrash_bytes is dynamicly updated to be a little above
the thrashing safe read-ahead size. It is used in the initial method where
the thrashing threshold for the particular reader is still unknown.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -817,6 +817,22 @@ out:
 }
 
 /*
+ * Update `backing_dev_info.ra_thrash_bytes' to be a _biased_ average of
+ * read-ahead sizes. Which makes it an a-bit-risky(*) estimation of the
+ * _minimal_ read-ahead thrashing threshold on the device.
+ *
+ * (*) Note that being a bit risky can _help_ overall performance.
+ */
+static inline void update_ra_thrash_bytes(struct backing_dev_info *bdi,
+						unsigned long ra_size)
+{
+	ra_size <<= PAGE_CACHE_SHIFT;
+	bdi->ra_thrash_bytes = (bdi->ra_thrash_bytes < ra_size) ?
+				(ra_size + bdi->ra_thrash_bytes * 1023) / 1024:
+				(ra_size + bdi->ra_thrash_bytes *    7) /    8;
+}
+
+/*
  * The node's effective length of inactive_list(s).
  */
 static unsigned long node_free_and_cold_pages(void)
@@ -1180,6 +1196,10 @@ state_based_readahead(struct address_spa
 	if (!adjust_rala(growth_limit, &ra_size, &la_size))
 		return 0;
 
+	/* ra_size in its _steady_ state reflects thrashing threshold */
+	if (page && ra_old + ra_old / 8 >= ra_size)
+		update_ra_thrash_bytes(mapping->backing_dev_info, ra_size);
+
 	ra_set_class(ra, RA_CLASS_STATE);
 	ra_set_index(ra, index, ra->readahead_index);
 	ra_set_size(ra, ra_size, la_size);

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 20/33] readahead: initial method - expected read size
       [not found] ` <20060524111907.134685550@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  2006-05-25  5:34     ` [PATCH 22/33] readahead: initial method Nick Piggin
  2006-05-26 17:29     ` [PATCH 20/33] readahead: initial method - expected read size Andrew Morton
  0 siblings, 2 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-initial-size-expect.patch --]
[-- Type: text/plain, Size: 3385 bytes --]

backing_dev_info.ra_expect_bytes is dynamicly updated to be the expected
read pages on start-of-file. It allows the initial readahead to be more
aggressive and hence efficient.


Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 fs/file_table.c    |    7 ++++++
 include/linux/mm.h |    1 
 mm/readahead.c     |   55 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 63 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1032,6 +1032,7 @@ unsigned long page_cache_readahead(struc
 void handle_ra_miss(struct address_space *mapping, 
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
+void fastcall readahead_close(struct file *file);
 
 #ifdef CONFIG_ADAPTIVE_READAHEAD
 extern int readahead_ratio;
--- linux-2.6.17-rc4-mm3.orig/fs/file_table.c
+++ linux-2.6.17-rc4-mm3/fs/file_table.c
@@ -12,6 +12,7 @@
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/smp_lock.h>
+#include <linux/mm.h>
 #include <linux/fs.h>
 #include <linux/security.h>
 #include <linux/eventpoll.h>
@@ -160,6 +161,12 @@ void fastcall __fput(struct file *file)
 	might_sleep();
 
 	fsnotify_close(file);
+
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+	if (file->f_ra.flags & RA_FLAG_EOF)
+		readahead_close(file);
+#endif
+
 	/*
 	 * The function eventpoll_release() should be the first called
 	 * in the file cleanup chain.
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1555,6 +1555,61 @@ static inline void get_readahead_bounds(
 					PAGES_KB(128)), *ra_max / 2);
 }
 
+/*
+ * When closing a normal readonly file,
+ * 	- on cache hit:  increase `backing_dev_info.ra_expect_bytes' slowly;
+ * 	- on cache miss: decrease it rapidly.
+ *
+ * The resulted `ra_expect_bytes' answers the question of:
+ * 	How many pages are expected to be read on start-of-file?
+ */
+void fastcall readahead_close(struct file *file)
+{
+	struct inode *inode = file->f_dentry->d_inode;
+	struct address_space *mapping = inode->i_mapping;
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long pos = file->f_pos;
+	unsigned long pgrahit = file->f_ra.cache_hits;
+	unsigned long pgaccess = 1 + pos / PAGE_CACHE_SIZE;
+	unsigned long pgcached = mapping->nrpages;
+
+	if (!pos)				/* pread */
+		return;
+
+	if (pgcached > bdi->ra_pages0)		/* excessive reads */
+		return;
+
+	if (pgaccess >= pgcached) {
+		if (bdi->ra_expect_bytes < bdi->ra_pages0 * PAGE_CACHE_SIZE)
+			bdi->ra_expect_bytes += pgcached * PAGE_CACHE_SIZE / 8;
+
+		debug_inc(initial_ra_hit);
+		dprintk("initial_ra_hit on file %s size %lluK "
+				"pos %lu by %s(%d)\n",
+				file->f_dentry->d_name.name,
+				i_size_read(inode) / 1024,
+				pos,
+				current->comm, current->pid);
+	} else {
+		unsigned long missed;
+
+		missed = (pgcached - pgaccess) * PAGE_CACHE_SIZE;
+		if (bdi->ra_expect_bytes >= missed / 2)
+			bdi->ra_expect_bytes -= missed / 2;
+
+		debug_inc(initial_ra_miss);
+		dprintk("initial_ra_miss on file %s "
+				"size %lluK cached %luK hit %luK "
+				"pos %lu by %s(%d)\n",
+				file->f_dentry->d_name.name,
+				i_size_read(inode) / 1024,
+				pgcached << (PAGE_CACHE_SHIFT - 10),
+				pgrahit << (PAGE_CACHE_SHIFT - 10),
+				pos,
+				current->comm, current->pid);
+	}
+}
+
 #endif /* CONFIG_ADAPTIVE_READAHEAD */
 
 /*

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 23/33] readahead: backward prefetching method
       [not found] ` <20060524111908.569533741@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  2006-05-26 17:37     ` Nate Diller
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-backward.patch --]
[-- Type: text/plain, Size: 1450 bytes --]

Readahead policy for reading backward.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1574,6 +1574,46 @@ initial_readahead(struct address_space *
 }
 
 /*
+ * Backward prefetching.
+ *
+ * No look-ahead and thrashing safety guard: should be unnecessary.
+ */
+static int
+try_read_backward(struct file_ra_state *ra, pgoff_t begin_index,
+			unsigned long ra_size, unsigned long ra_max)
+{
+	pgoff_t end_index;
+
+	/* Are we reading backward? */
+	if (begin_index > ra->prev_page)
+		return 0;
+
+	if ((ra->flags & RA_CLASS_MASK) == RA_CLASS_BACKWARD &&
+					ra_has_index(ra, ra->prev_page)) {
+		ra_size += 2 * ra_cache_hit(ra, 0);
+		end_index = ra->la_index;
+	} else {
+		ra_size += ra_size + ra_size * (readahead_hit_rate - 1) / 2;
+		end_index = ra->prev_page;
+	}
+
+	if (ra_size > ra_max)
+		ra_size = ra_max;
+
+	/* Read traces close enough to be covered by the prefetching? */
+	if (end_index > begin_index + ra_size)
+		return 0;
+
+	begin_index = end_index - ra_size;
+
+	ra_set_class(ra, RA_CLASS_BACKWARD);
+	ra_set_index(ra, begin_index, begin_index);
+	ra_set_size(ra, ra_size, 0);
+
+	return 1;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 24/33] readahead: seeking reads method
       [not found] ` <20060524111909.147416866@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-onseek.patch --]
[-- Type: text/plain, Size: 1822 bytes --]

Readahead policy on read after seeking.

It tries to detect sequences like:
	seek(), 5*read(); seek(), 6*read(); seek(), 4*read(); ...

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   43 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 43 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1614,6 +1614,49 @@ try_read_backward(struct file_ra_state *
 }
 
 /*
+ * If there is a previous sequential read, it is likely to be another
+ * sequential read at the new position.
+ *
+ * i.e. detect the following sequences:
+ * 	seek(), 5*read(); seek(), 6*read(); seek(), 4*read(); ...
+ *
+ * Databases are known to have this seek-and-read-N-pages pattern.
+ */
+static int
+try_readahead_on_seek(struct file_ra_state *ra, pgoff_t index,
+			unsigned long ra_size, unsigned long ra_max)
+{
+	unsigned long hit0 = ra_cache_hit(ra, 0);
+	unsigned long hit1 = ra_cache_hit(ra, 1) + hit0;
+	unsigned long hit2 = ra_cache_hit(ra, 2);
+	unsigned long hit3 = ra_cache_hit(ra, 3);
+
+	/* There's a previous read-ahead request? */
+	if (!ra_has_index(ra, ra->prev_page))
+		return 0;
+
+	/* The previous read-ahead sequences have similiar sizes? */
+	if (!(ra_size < hit1 && hit1 > hit2 / 2 &&
+				hit2 > hit3 / 2 &&
+				hit3 > hit1 / 2))
+		return 0;
+
+	hit1 = max(hit1, hit2);
+
+	/* Follow the same prefetching direction. */
+	if ((ra->flags & RA_CLASS_MASK) == RA_CLASS_BACKWARD)
+		index = ((index > hit1 - ra_size) ? index - hit1 + ra_size : 0);
+
+	ra_size = min(hit1, ra_max);
+
+	ra_set_class(ra, RA_CLASS_SEEK);
+	ra_set_index(ra, index, index);
+	ra_set_size(ra, ra_size, 0);
+
+	return 1;
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 25/33] readahead: thrashing recovery method
       [not found] ` <20060524111909.635589701@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-method-onthrash.patch --]
[-- Type: text/plain, Size: 1746 bytes --]

Readahead policy after thrashing.

It tries to recover gracefully from the thrashing.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 42 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1657,6 +1657,48 @@ try_readahead_on_seek(struct file_ra_sta
 }
 
 /*
+ * Readahead thrashing recovery.
+ */
+static unsigned long
+thrashing_recovery_readahead(struct address_space *mapping,
+				struct file *filp, struct file_ra_state *ra,
+				pgoff_t index, unsigned long ra_max)
+{
+	unsigned long ra_size;
+
+	if (find_page(mapping, index - 1))
+		ra_account(ra, RA_EVENT_READAHEAD_MUTILATE,
+						ra->readahead_index - index);
+	ra_account(ra, RA_EVENT_READAHEAD_THRASHING,
+						ra->readahead_index - index);
+
+	/*
+	 * Some thrashing occur in (ra_index, la_index], in which case the
+	 * old read-ahead chunk is lost soon after the new one is allocated.
+	 * Ensure that we recover all needed pages in the old chunk.
+	 */
+	if (index < ra->ra_index)
+		ra_size = ra->ra_index - index;
+	else {
+		/* After thrashing, we know the exact thrashing-threshold. */
+		ra_size = ra_cache_hit(ra, 0);
+		update_ra_thrash_bytes(mapping->backing_dev_info, ra_size);
+
+		/* And we'd better be a bit conservative. */
+		ra_size = ra_size * 3 / 4;
+	}
+
+	if (ra_size > ra_max)
+		ra_size = ra_max;
+
+	ra_set_class(ra, RA_CLASS_THRASHING);
+	ra_set_index(ra, index, index);
+	ra_set_size(ra, ra_size, ra_size / LOOKAHEAD_RATIO);
+
+	return ra_dispatch(ra, mapping, filp);
+}
+
+/*
  * ra_min is mainly determined by the size of cache memory. Reasonable?
  *
  * Table of concrete numbers for 4KB page size:

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 26/33] readahead: call scheme
       [not found] ` <20060524111910.207894375@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-call-scheme.patch --]
[-- Type: text/plain, Size: 9456 bytes --]

The read-ahead logic is called when the reading hits
        - a look-ahead mark;
        - a non-present page.

ra.prev_page should be properly setup on entrance, and readahead_cache_hit()
should be called on every page reference to maintain the cache_hits counter.

This call scheme achieves the following goals:
        - makes all stateful/stateless methods happy;
        - eliminates the cache hit problem naturally;
        - lives in harmony with application managed read-aheads via
          fadvise/madvise.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/mm.h |    6 ++
 mm/filemap.c       |   51 ++++++++++++++++-
 mm/readahead.c     |  152 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 205 insertions(+), 4 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1033,6 +1033,12 @@ void handle_ra_miss(struct address_space
 		    struct file_ra_state *ra, pgoff_t offset);
 unsigned long max_sane_readahead(unsigned long nr);
 void fastcall readahead_close(struct file *file);
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+			struct file_ra_state *ra, struct file *filp,
+			struct page *prev_page, struct page *page,
+			pgoff_t first_index, pgoff_t index, pgoff_t last_index);
+void fastcall readahead_cache_hit(struct file_ra_state *ra, struct page *page);
 
 #ifdef CONFIG_ADAPTIVE_READAHEAD
 extern int readahead_ratio;
--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -847,14 +847,32 @@ void do_generic_mapping_read(struct addr
 		nr = nr - offset;
 
 		cond_resched();
-		if (index == next_index)
+
+		if (!prefer_adaptive_readahead() && index == next_index)
 			next_index = page_cache_readahead(mapping, &ra, filp,
 					index, last_index - index);
 
 find_page:
 		page = find_get_page(mapping, index);
+		if (prefer_adaptive_readahead()) {
+			if (unlikely(page == NULL)) {
+				ra.prev_page = prev_index;
+				page_cache_readahead_adaptive(mapping, &ra,
+						filp, prev_page, NULL,
+						*ppos >> PAGE_CACHE_SHIFT,
+						index, last_index);
+				page = find_get_page(mapping, index);
+			} else if (PageReadahead(page)) {
+				ra.prev_page = prev_index;
+				page_cache_readahead_adaptive(mapping, &ra,
+						filp, prev_page, page,
+						*ppos >> PAGE_CACHE_SHIFT,
+						index, last_index);
+			}
+		}
 		if (unlikely(page == NULL)) {
-			handle_ra_miss(mapping, &ra, index);
+			if (!prefer_adaptive_readahead())
+				handle_ra_miss(mapping, &ra, index);
 			goto no_cached_page;
 		}
 
@@ -862,6 +880,9 @@ find_page:
 			page_cache_release(prev_page);
 		prev_page = page;
 
+		if (prefer_adaptive_readahead())
+			readahead_cache_hit(&ra, page);
+
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -1005,6 +1026,8 @@ no_cached_page:
 
 out:
 	*_ra = ra;
+	if (prefer_adaptive_readahead())
+		_ra->prev_page = prev_index;
 
 	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
 	if (cached_page)
@@ -1290,6 +1313,7 @@ struct page *filemap_nopage(struct vm_ar
 	unsigned long size, pgoff;
 	int did_readaround = 0, majmin = VM_FAULT_MINOR;
 
+	ra->flags |= RA_FLAG_MMAP;
 	pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
 
 retry_all:
@@ -1307,19 +1331,33 @@ retry_all:
 	 *
 	 * For sequential accesses, we use the generic readahead logic.
 	 */
-	if (VM_SequentialReadHint(area))
+	if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
 		page_cache_readahead(mapping, ra, file, pgoff, 1);
 
+
 	/*
 	 * Do we have something in the page cache already?
 	 */
 retry_find:
 	page = find_get_page(mapping, pgoff);
+	if (prefer_adaptive_readahead() && VM_SequentialReadHint(area)) {
+		if (!page) {
+			page_cache_readahead_adaptive(mapping, ra,
+						file, NULL, NULL,
+						pgoff, pgoff, pgoff + 1);
+			page = find_get_page(mapping, pgoff);
+		} else if (PageReadahead(page)) {
+			page_cache_readahead_adaptive(mapping, ra,
+						file, NULL, page,
+						pgoff, pgoff, pgoff + 1);
+		}
+	}
 	if (!page) {
 		unsigned long ra_pages;
 
 		if (VM_SequentialReadHint(area)) {
-			handle_ra_miss(mapping, ra, pgoff);
+			if (!prefer_adaptive_readahead())
+				handle_ra_miss(mapping, ra, pgoff);
 			goto no_cached_page;
 		}
 		ra->mmap_miss++;
@@ -1356,6 +1394,9 @@ retry_find:
 	if (!did_readaround)
 		ra->mmap_hit++;
 
+	if (prefer_adaptive_readahead())
+		readahead_cache_hit(ra, page);
+
 	/*
 	 * Ok, found a page in the page cache, now we need to check
 	 * that it's up-to-date.
@@ -1370,6 +1411,8 @@ success:
 	mark_page_accessed(page);
 	if (type)
 		*type = majmin;
+	if (prefer_adaptive_readahead())
+		ra->prev_page = page->index;
 	return page;
 
 outside_data_content:
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1717,6 +1717,158 @@ static inline void get_readahead_bounds(
 					PAGES_KB(128)), *ra_max / 2);
 }
 
+/**
+ * page_cache_readahead_adaptive - thrashing safe adaptive read-ahead
+ * @mapping, @ra, @filp: the same as page_cache_readahead()
+ * @prev_page: the page at @index-1, may be NULL to let the function find it
+ * @page: the page at @index, or NULL if non-present
+ * @begin_index, @index, @end_index: offsets into @mapping
+ * 		[@begin_index, @end_index) is the read the caller is performing
+ *	 	@index indicates the page to be read now
+ *
+ * page_cache_readahead_adaptive() is the entry point of the adaptive
+ * read-ahead logic. It tries a set of methods in turn to determine the
+ * appropriate readahead action and submits the readahead I/O.
+ *
+ * The caller is expected to point ra->prev_page to the previously accessed
+ * page, and to call it on two conditions:
+ * 1. @page == NULL
+ *    A cache miss happened, some pages have to be read in
+ * 2. @page != NULL && PageReadahead(@page)
+ *    A look-ahead mark encountered, this is set by a previous read-ahead
+ *    invocation to instruct the caller to give the function a chance to
+ *    check up and do next read-ahead in advance.
+ */
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+			struct file_ra_state *ra, struct file *filp,
+			struct page *prev_page, struct page *page,
+			pgoff_t begin_index, pgoff_t index, pgoff_t end_index)
+{
+	unsigned long size;
+	unsigned long ra_min;
+	unsigned long ra_max;
+	int ret;
+
+	might_sleep();
+
+	if (page) {
+		if(!TestClearPageReadahead(page))
+			return 0;
+		if (bdi_read_congested(mapping->backing_dev_info)) {
+			ra_account(ra, RA_EVENT_IO_CONGESTION,
+							end_index - index);
+			return 0;
+		}
+	}
+
+	if (page)
+		ra_account(ra, RA_EVENT_LOOKAHEAD_HIT,
+				ra->readahead_index - ra->lookahead_index);
+	else if (index)
+		ra_account(ra, RA_EVENT_CACHE_MISS, end_index - begin_index);
+
+	size = end_index - index;
+	get_readahead_bounds(ra, &ra_min, &ra_max);
+
+	/* readahead disabled? */
+	if (unlikely(!ra_max || !readahead_ratio)) {
+		size = max_sane_readahead(size);
+		goto readit;
+	}
+
+	/*
+	 * Start of file.
+	 */
+	if (index == 0)
+		return initial_readahead(mapping, filp, ra, size);
+
+	/*
+	 * State based sequential read-ahead.
+	 */
+	if (!debug_option(disable_stateful_method) &&
+			index == ra->lookahead_index && ra_cache_hit_ok(ra))
+		return state_based_readahead(mapping, filp, ra, page,
+							index, size, ra_max);
+
+	/*
+	 * Recover from possible thrashing.
+	 */
+	if (!page && index == ra->prev_page + 1 && ra_has_index(ra, index))
+		return thrashing_recovery_readahead(mapping, filp, ra,
+								index, ra_max);
+
+	/*
+	 * Backward read-ahead.
+	 */
+	if (!page && begin_index == index &&
+				try_read_backward(ra, index, size, ra_max))
+		return ra_dispatch(ra, mapping, filp);
+
+	/*
+	 * Context based sequential read-ahead.
+	 */
+	ret = try_context_based_readahead(mapping, ra, prev_page, page,
+							index, ra_min, ra_max);
+	if (ret > 0)
+		return ra_dispatch(ra, mapping, filp);
+	if (ret < 0)
+		return 0;
+
+	/* No action on look ahead time? */
+	if (page) {
+		ra_account(ra, RA_EVENT_LOOKAHEAD_NOACTION,
+						ra->readahead_index - index);
+		return 0;
+	}
+
+	/*
+	 * Random read that follows a sequential one.
+	 */
+	if (try_readahead_on_seek(ra, index, size, ra_max))
+		return ra_dispatch(ra, mapping, filp);
+
+	/*
+	 * Random read.
+	 */
+	if (size > ra_max)
+		size = ra_max;
+
+readit:
+	size = __do_page_cache_readahead(mapping, filp, index, size, 0);
+
+	ra_account(ra, RA_EVENT_RANDOM_READ, size);
+	dprintk("random_read(ino=%lu, pages=%lu, index=%lu-%lu-%lu) = %lu\n",
+			mapping->host->i_ino, mapping->nrpages,
+			begin_index, index, end_index, size);
+
+	return size;
+}
+
+/**
+ * readahead_cache_hit - adaptive read-ahead feedback function
+ * @ra: file_ra_state which holds the readahead state
+ * @page: the page just accessed
+ *
+ * readahead_cache_hit() is the feedback route of the adaptive read-ahead
+ * logic. It must be called on every access on the read-ahead pages.
+ */
+void fastcall readahead_cache_hit(struct file_ra_state *ra, struct page *page)
+{
+	if (!PageUptodate(page))
+		ra_account(ra, RA_EVENT_IO_BLOCK, 1);
+
+	if (!ra_has_index(ra, page->index))
+		return;
+
+	ra->cache_hits++;
+
+	if (page->index >= ra->ra_index)
+		ra_account(ra, RA_EVENT_READAHEAD_HIT, 1);
+	else
+		ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
+}
+
 /*
  * When closing a normal readonly file,
  * 	- on cache hit:  increase `backing_dev_info.ra_expect_bytes' slowly;

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 27/33] readahead: laptop mode
       [not found] ` <20060524111910.544274094@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  2006-05-26 17:38     ` Andrew Morton
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Bart Samwel

[-- Attachment #1: readahead-laptop-mode.patch --]
[-- Type: text/plain, Size: 3153 bytes --]

When the laptop drive is spinned down, defer look-ahead to spin up time.

The implementation employs a poll based method, for performance is not a
concern in this code path. The poll interval is 64KB, which should be small
enough for movies/musics. The user space application is responsible for
proper caching to hide the spin-up-and-read delay.

------------------------------------------------------------------------
For crazy laptop users who prefer aggressive read-ahead, here is the way:

# echo 1000 > /proc/sys/vm/readahead_ratio
# blockdev --setra 524280 /dev/hda      # this is the max possible value

Notes:
- It is still an untested feature.
- It is safer to use blockdev+fadvise to increase ra-max for a single file,
  which needs patching your movie player.
- Be sure to restore them to sane values in normal operations!

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 include/linux/writeback.h |    6 ++++++
 mm/page-writeback.c       |    2 +-
 mm/readahead.c            |   30 ++++++++++++++++++++++++++++++
 3 files changed, 37 insertions(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/writeback.h
+++ linux-2.6.17-rc4-mm3/include/linux/writeback.h
@@ -86,6 +86,12 @@ void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(void);
 
+extern struct timer_list laptop_mode_wb_timer;
+static inline int laptop_spinned_down(void)
+{
+	return !timer_pending(&laptop_mode_wb_timer);
+}
+
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern int vm_dirty_ratio;
--- linux-2.6.17-rc4-mm3.orig/mm/page-writeback.c
+++ linux-2.6.17-rc4-mm3/mm/page-writeback.c
@@ -389,7 +389,7 @@ static void wb_timer_fn(unsigned long un
 static void laptop_timer_fn(unsigned long unused);
 
 static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
-static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
+DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
 
 /*
  * Periodic writeback of "old" data.
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -817,6 +817,31 @@ out:
 }
 
 /*
+ * Set a new look-ahead mark at @new_index.
+ * Return 0 if the new mark is successfully set.
+ */
+static inline int renew_lookahead(struct address_space *mapping,
+				struct file_ra_state *ra,
+				pgoff_t index, pgoff_t new_index)
+{
+	struct page *page;
+
+	if (index == ra->lookahead_index &&
+			new_index >= ra->readahead_index)
+		return 1;
+
+	page = find_page(mapping, new_index);
+	if (!page)
+		return 1;
+
+	__SetPageReadahead(page);
+	if (ra->lookahead_index == index)
+		ra->lookahead_index = new_index;
+
+	return 0;
+}
+
+/*
  * Update `backing_dev_info.ra_thrash_bytes' to be a _biased_ average of
  * read-ahead sizes. Which makes it an a-bit-risky(*) estimation of the
  * _minimal_ read-ahead thrashing threshold on the device.
@@ -1760,6 +1785,11 @@ page_cache_readahead_adaptive(struct add
 							end_index - index);
 			return 0;
 		}
+		if (laptop_mode && laptop_spinned_down()) {
+			if (!renew_lookahead(mapping, ra, index,
+						index + LAPTOP_POLL_INTERVAL))
+				return 0;
+		}
 	}
 
 	if (page)

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 28/33] readahead: loop case
       [not found] ` <20060524111911.032100160@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  2006-05-24 14:01   ` Limin Wang
  1 sibling, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-loop-case.patch --]
[-- Type: text/plain, Size: 894 bytes --]

Disable look-ahead for loop file.

Loopback files normally contain filesystems, in which case there are already
proper look-aheads in the upper layer, more look-aheads on the loopback file
only ruins the read-ahead hit rate.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

I'd like to thank Tero Grundstr?m for uncovering the loopback problem.

 drivers/block/loop.c |    6 ++++++
 1 files changed, 6 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/drivers/block/loop.c
+++ linux-2.6.17-rc4-mm3/drivers/block/loop.c
@@ -779,6 +779,12 @@ static int loop_set_fd(struct loop_devic
 	mapping = file->f_mapping;
 	inode = mapping->host;
 
+	/*
+	 * The upper layer should already do proper look-ahead,
+	 * one more look-ahead here only ruins the cache hit rate.
+	 */
+	file->f_ra.flags |= RA_FLAG_NO_LOOKAHEAD;
+
 	if (!(file->f_mode & FMODE_WRITE))
 		lo_flags |= LO_FLAGS_READ_ONLY;
 

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 29/33] readahead: nfsd case
       [not found] ` <20060524111911.607080495@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang, Neil Brown

[-- Attachment #1: readahead-nfsd-case.patch --]
[-- Type: text/plain, Size: 4646 bytes --]

Bypass nfsd raparms cache -- the new logic do not rely on it.

--------------------------------
For the case of NFS service, the new read-ahead logic
+ can handle disordered nfsd requests
+ can handle concurrent sequential requests on large files
  with the help of look-ahead
- will have much ado about the concurrent ones on small files

--------------------------------
Notes about the concurrent nfsd requests issue:

nfsd read requests can be out of order, concurrent and with no ra-state info.
They are handled by the context based read-ahead method, which does the job
in the following steps:

1. scan in page cache
2. make read-ahead decisions
3. alloc new pages
4. insert new pages to page cache

A single read-ahead chunk in the client side will be dissembled and serviced
by many concurrent nfsd in the server side. It is highly possible for two or
more of these parallel nfsd instances to be in step 1/2/3 at the same time.
Without knowing others working on the same file region, they will issue
overlapped read-ahead requests, which lead to many conflicts at step 4.

There's no much luck to eliminate the concurrent problem in general and
efficient ways. But experiments show that mount with tcp,rsize=32768 can
cut down the overhead a lot.

--------------------------------
Here are the benchmark outputs. The test cases cover
- small/big files
- small/big rsize mount option
- serialized/parallel nfsd processing

`serialized' means running the following command to enforce serialized
nfsd requests processing:

# for pid in `pidof nfsd`; do taskset -p 1 $pid; done

8 nfsd; local mount with tcp,rsize=8192
=======================================

SERIALIZED, SMALL FILES
readahead_ratio = 0, ra_max = 128kb (old logic, the ra_max is not quite relevant)
96.51s real  11.32s system  3.27s user  160334+2829 cs  diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb (new read-ahead logic)
94.88s real  11.53s system  3.20s user  152415+3777 cs  diff -r $NFSDIR $NFSDIR2

SERIALIZED, BIG FILES
readahead_ratio = 0, ra_max = 128kb
56.52s real  3.38s system  1.23s user  47930+5256 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
32.54s real  5.71s system  1.38s user  23851+17007 cs  diff $NFSFILE $NFSFILE2

PARALLEL, SMALL FILES
readahead_ratio = 0, ra_max = 128kb
99.87s real  11.41s system  3.15s user  173945+9163 cs  diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb
100.14s real  12.06s system  3.16s user  170865+13406 cs  diff -r $NFSDIR $NFSDIR2

PARALLEL, BIG FILES
readahead_ratio = 0, ra_max = 128kb
63.35s real  5.68s system  1.57s user  82594+48747 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
33.87s real  10.17s system  1.55s user  72291+100079 cs  diff $NFSFILE $NFSFILE2

8 nfsd; local mount with tcp,rsize=32768
========================================
Note that the normal data are now much better, and come close to that of the
serialized ones.

PARALLEL/NORMAL

readahead_ratio = 8, ra_max = 1024kb (old logic)
48.36s real  2.22s system  1.51s user  7209+4110 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb (new logic)
30.04s real  2.46s system  1.33s user  5420+2492 cs  diff $NFSFILE $NFSFILE2

readahead_ratio = 8, ra_max = 1024kb
92.99s real  10.32s system  3.23s user  145004+1826 cs  diff -r $NFSDIR $NFSDIR2 > /dev/null
readahead_ratio = 70, ra_max = 1024kb
90.96s real  10.68s system  3.22s user  144414+2520 cs  diff -r $NFSDIR $NFSDIR2 > /dev/null

SERIALIZED

readahead_ratio = 8, ra_max = 1024kb
47.58s real  2.10s system  1.27s user  7933+1357 cs  diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
29.46s real  2.41s system  1.38s user  5590+2613 cs  diff $NFSFILE $NFSFILE2

readahead_ratio = 8, ra_max = 1024kb
93.02s real  10.67s system  3.25s user  144850+2286 cs  diff -r $NFSDIR $NFSDIR2 > /dev/null
readahead_ratio = 70, ra_max = 1024kb
91.15s real  11.04s system  3.31s user  144432+2814 cs  diff -r $NFSDIR $NFSDIR2 > /dev/null

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---


Greg Banks gives a valuable recommend on the test cases, which helps me to
get the more complete picture. Thanks!

 fs/nfsd/vfs.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/fs/nfsd/vfs.c
+++ linux-2.6.17-rc4-mm3/fs/nfsd/vfs.c
@@ -829,7 +829,10 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st
 #endif
 
 	/* Get readahead parameters */
-	ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
+	if (prefer_adaptive_readahead())
+		ra = NULL;
+	else
+		ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
 
 	if (ra && ra->p_set)
 		file->f_ra = ra->p_ra;

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 30/33] readahead: turn on by default
       [not found] ` <20060524111912.156646847@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-kconfig-option-default-on.patch --]
[-- Type: text/plain, Size: 570 bytes --]

Enable the adaptive readahead logic by default.

It helps collect more early testers, and is meant to be a -mm only patch.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/Kconfig |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/mm/Kconfig
+++ linux-2.6.17-rc4-mm3/mm/Kconfig
@@ -152,7 +152,7 @@ config MIGRATION
 #
 config ADAPTIVE_READAHEAD
 	bool "Adaptive file readahead (EXPERIMENTAL)"
-	default n
+	default y
 	depends on EXPERIMENTAL
 	help
 	  Readahead is a technique employed by the kernel in an attempt

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 31/33] readahead: debug radix tree new functions
       [not found] ` <20060524111912.485160282@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-debug-radix-tree.patch --]
[-- Type: text/plain, Size: 2040 bytes --]

Do some sanity checkings on the newly added radix tree code.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -63,6 +63,8 @@ enum ra_class {
 	RA_CLASS_COUNT
 };
 
+#define DEBUG_READAHEAD_RADIXTREE
+
 /* Read-ahead events to be accounted. */
 enum ra_event {
 	RA_EVENT_CACHE_MISS,		/* read cache misses */
@@ -1315,6 +1317,16 @@ static pgoff_t find_segtail(struct addre
 	cond_resched();
 	read_lock_irq(&mapping->tree_lock);
 	ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+	BUG_ON(!__find_page(mapping, index));
+	WARN_ON(ra_index < index);
+	if (ra_index != index && !__find_page(mapping, ra_index - 1))
+		printk(KERN_ERR "radix_tree_scan_hole(index=%lu ra_index=%lu "
+				"max_scan=%lu nrpages=%lu) fooled!\n",
+				index, ra_index, max_scan, mapping->nrpages);
+	if (ra_index != ~0UL && ra_index - index < max_scan)
+		WARN_ON(__find_page(mapping, ra_index));
+#endif
 	read_unlock_irq(&mapping->tree_lock);
 
 	if (ra_index <= index + max_scan)
@@ -1407,6 +1419,13 @@ static unsigned long query_page_cache_se
 	read_lock_irq(&mapping->tree_lock);
 	index = radix_tree_scan_hole_backward(&mapping->page_tree,
 							offset, ra_max);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+	WARN_ON(index > offset);
+	if (index != offset)
+		WARN_ON(!__find_page(mapping, index + 1));
+	if (index && offset - index < ra_max)
+		WARN_ON(__find_page(mapping, index));
+#endif
 	read_unlock_irq(&mapping->tree_lock);
 
 	*remain = offset - index;
@@ -1442,6 +1461,11 @@ static unsigned long query_page_cache_se
 		struct radix_tree_node *node;
 		node = radix_tree_cache_lookup_parent(&mapping->page_tree,
 						&cache, offset - count, 1);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+		if (node != radix_tree_lookup_parent(&mapping->page_tree,
+							offset - count, 1))
+			BUG();
+#endif
 		if (!node)
 			break;
 	}

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 32/33] readahead: debug traces showing accessed file names
       [not found] ` <20060524111912.967392912@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-debug-traces-file-list.patch --]
[-- Type: text/plain, Size: 1011 bytes --]

Print file names on their first read-ahead, for tracing file access patterns.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/readahead.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -1074,6 +1074,20 @@ static int ra_dispatch(struct file_ra_st
 		ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
 	ra_account(ra, RA_EVENT_READAHEAD, actual);
 
+	if (!ra->ra_index && filp->f_dentry->d_inode) {
+		char *fn;
+		static char path[1024];
+		unsigned long size;
+
+		size = (i_size_read(filp->f_dentry->d_inode)+1023)/1024;
+		fn = d_path(filp->f_dentry, filp->f_vfsmnt, path, 1000);
+		if (!IS_ERR(fn))
+			ddprintk("ino %lu is %s size %luK by %s(%d)\n",
+					filp->f_dentry->d_inode->i_ino,
+					fn, size,
+					current->comm, current->pid);
+	}
+
 	dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
 			ra_class_name[ra_class],
 			mapping->host->i_ino, ra->la_index,

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 33/33] readahead: debug traces showing read patterns
       [not found] ` <20060524111913.603476893@localhost.localdomain>
@ 2006-05-24 11:13   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 11:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Wu Fengguang

[-- Attachment #1: readahead-debug-traces-access-pattern.patch --]
[-- Type: text/plain, Size: 2664 bytes --]

Print all relavant read requests to help discover the access pattern.

If you are experiencing performance problems, or want to help improve
the read-ahead logic, please send me the trace data. Thanks.

- Preparations

# Compile kernel with option CONFIG_DEBUG_READAHEAD
mkdir /debug
mount -t debug none /debug

- For each session with distinct access pattern

echo > /debug/readahead # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
echo 8 > /debug/readahead/debug_level # show verbose printk traces
# do one benchmark/task
echo 0 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---

 mm/filemap.c |   23 ++++++++++++++++++++++-
 1 files changed, 22 insertions(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/mm/filemap.c
+++ linux-2.6.17-rc4-mm3/mm/filemap.c
@@ -45,6 +45,12 @@ static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 	loff_t offset, unsigned long nr_segs);
 
+#ifdef CONFIG_DEBUG_READAHEAD
+extern u32 debug_level;
+#else
+#define debug_level 0
+#endif /* CONFIG_DEBUG_READAHEAD */
+
 /*
  * Shared mappings implemented 30.11.1994. It's not fully working yet,
  * though.
@@ -829,6 +835,10 @@ void do_generic_mapping_read(struct addr
 	if (!isize)
 		goto out;
 
+	if (debug_level >= 5)
+		printk(KERN_DEBUG "read-file(ino=%lu, req=%lu+%lu)\n",
+			inode->i_ino, index, last_index - index);
+
 	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
 	for (;;) {
 		struct page *page;
@@ -883,6 +893,11 @@ find_page:
 		if (prefer_adaptive_readahead())
 			readahead_cache_hit(&ra, page);
 
+		if (debug_level >= 7)
+			printk(KERN_DEBUG "read-page(ino=%lu, idx=%lu, io=%s)\n",
+				inode->i_ino, index,
+				PageUptodate(page) ? "hit" : "miss");
+
 		if (!PageUptodate(page))
 			goto page_not_up_to_date;
 page_ok:
@@ -1334,7 +1349,6 @@ retry_all:
 	if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
 		page_cache_readahead(mapping, ra, file, pgoff, 1);
 
-
 	/*
 	 * Do we have something in the page cache already?
 	 */
@@ -1397,6 +1411,13 @@ retry_find:
 	if (prefer_adaptive_readahead())
 		readahead_cache_hit(ra, page);
 
+	if (debug_level >= 6)
+		printk(KERN_DEBUG "read-mmap(ino=%lu, idx=%lu, hint=%s, io=%s)\n",
+			inode->i_ino, pgoff,
+			VM_RandomReadHint(area) ? "random" :
+			(VM_SequentialReadHint(area) ? "sequential" : "none"),
+			PageUptodate(page) ? "hit" : "miss");
+
 	/*
 	 * Ok, found a page in the page cache, now we need to check
 	 * that it's up-to-date.

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 04/33] readahead: page flag PG_readahead
       [not found] ` <20060524111858.869793445@localhost.localdomain>
  2006-05-24 11:12   ` [PATCH 04/33] readahead: page flag PG_readahead Wu Fengguang
@ 2006-05-24 12:27   ` Peter Zijlstra
       [not found]     ` <20060524123740.GA16304@mail.ustc.edu.cn>
  1 sibling, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2006-05-24 12:27 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

On Wed, 2006-05-24 at 19:12 +0800, Wu Fengguang wrote:
> plain text document attachment
> (readahead-page-flag-PG_readahead.patch)
> An new page flag PG_readahead is introduced as a look-ahead mark, which
> reminds the caller to give the adaptive read-ahead logic a chance to do
> read-ahead ahead of time for I/O pipelining.
> 
> It roughly corresponds to `ahead_start' of the stock read-ahead logic.
> 
> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> ---
> 
>  include/linux/page-flags.h |    5 +++++
>  mm/page_alloc.c            |    2 +-
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> --- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
> +++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
> @@ -89,6 +89,7 @@
>  #define PG_reclaim		17	/* To be reclaimed asap */
>  #define PG_nosave_free		18	/* Free, should not be written */
>  #define PG_buddy		19	/* Page is free, on buddy lists */
> +#define PG_readahead		20	/* Reminder to do readahead */
>  

Page flags are gouped by four, 20 would start a new set.
Also in my tree (git from a few days ago), 20 is taken by PG_unchached.
What code is this patch-set against?




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 04/33] readahead: page flag PG_readahead
       [not found]     ` <20060524123740.GA16304@mail.ustc.edu.cn>
@ 2006-05-24 12:37       ` Wu Fengguang
  2006-05-24 12:48       ` Peter Zijlstra
  1 sibling, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 12:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, linux-kernel

On Wed, May 24, 2006 at 02:27:36PM +0200, Peter Zijlstra wrote:
> On Wed, 2006-05-24 at 19:12 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (readahead-page-flag-PG_readahead.patch)
> > An new page flag PG_readahead is introduced as a look-ahead mark, which
> > reminds the caller to give the adaptive read-ahead logic a chance to do
> > read-ahead ahead of time for I/O pipelining.
> > 
> > It roughly corresponds to `ahead_start' of the stock read-ahead logic.
> > 
> > Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> > ---
> > 
> >  include/linux/page-flags.h |    5 +++++
> >  mm/page_alloc.c            |    2 +-
> >  2 files changed, 6 insertions(+), 1 deletion(-)
> > 
> > --- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
> > +++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
> > @@ -89,6 +89,7 @@
> >  #define PG_reclaim		17	/* To be reclaimed asap */
> >  #define PG_nosave_free		18	/* Free, should not be written */
> >  #define PG_buddy		19	/* Page is free, on buddy lists */
> > +#define PG_readahead		20	/* Reminder to do readahead */
> >  
> 
> Page flags are gouped by four, 20 would start a new set.
> Also in my tree (git from a few days ago), 20 is taken by PG_unchached.

Thanks, grouped and renumbered it as 21.

> What code is this patch-set against?

It's against the latest -mm tree: linux-2.6.17-rc4-mm3.

Wu
---

Subject: readahead: page flag PG_readahead

An new page flag PG_readahead is introduced as a look-ahead mark, which
reminds the caller to give the adaptive read-ahead logic a chance to do
read-ahead ahead of time for I/O pipelining.

It roughly corresponds to `ahead_start' of the stock read-ahead logic.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
 include/linux/page-flags.h |    5 +++++
 mm/page_alloc.c            |    2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

--- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
+++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
@@ -90,6 +90,8 @@
 #define PG_nosave_free		18	/* Free, should not be written */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_readahead		21	/* Reminder to do readahead */
+
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -372,6 +374,10 @@ extern void __mod_page_state_offset(unsi
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageReadahead(page)	test_bit(PG_readahead, &(page)->flags)
+#define __SetPageReadahead(page) __set_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
--- linux-2.6.17-rc4-mm3.orig/mm/page_alloc.c
+++ linux-2.6.17-rc4-mm3/mm/page_alloc.c
@@ -564,7 +564,7 @@ static int prep_new_page(struct page *pa
 	if (PageReserved(page))
 		return 1;
 
-	page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+	page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
 			1 << PG_referenced | 1 << PG_arch_1 |
 			1 << PG_checked | 1 << PG_mappedtodisk);
 	set_page_private(page, 0);

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/33] readahead: context based method
       [not found] ` <20060524111905.586110688@localhost.localdomain>
  2006-05-24 11:13   ` [PATCH 17/33] readahead: context based method Wu Fengguang
@ 2006-05-24 12:37   ` Peter Zijlstra
       [not found]     ` <20060524133353.GA16508@mail.ustc.edu.cn>
  1 sibling, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2006-05-24 12:37 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

On Wed, 2006-05-24 at 19:13 +0800, Wu Fengguang wrote:

> +#define PAGE_REFCNT_0           0
> +#define PAGE_REFCNT_1           (1 << PG_referenced)
> +#define PAGE_REFCNT_2           (1 << PG_active)
> +#define PAGE_REFCNT_3           ((1 << PG_active) | (1 << PG_referenced))
> +#define PAGE_REFCNT_MASK        PAGE_REFCNT_3
> +
> +/*
> + * STATUS   REFERENCE COUNT
> + *  __                   0
> + *  _R       PAGE_REFCNT_1
> + *  A_       PAGE_REFCNT_2
> + *  AR       PAGE_REFCNT_3
> + *
> + *  A/R: Active / Referenced
> + */
> +static inline unsigned long page_refcnt(struct page *page)
> +{
> +        return page->flags & PAGE_REFCNT_MASK;
> +}
> +
> +/*
> + * STATUS   REFERENCE COUNT      TYPE
> + *  __                   0      fresh
> + *  _R       PAGE_REFCNT_1      stale
> + *  A_       PAGE_REFCNT_2      disturbed once
> + *  AR       PAGE_REFCNT_3      disturbed twice
> + *
> + *  A/R: Active / Referenced
> + */
> +static inline unsigned long cold_page_refcnt(struct page *page)
> +{
> +	if (!page || PageActive(page))
> +		return 0;
> +
> +	return page_refcnt(page);
> +}
> +

Why all of this if all you're ever going to use is cold_page_refcnt.
What about something like this:

static inline int cold_page_referenced(struct page *page)
{
	if (!page || PageActive(page))
		return 0;
	return !!PageReferenced(page);
}

> +
> +/*
> + * Count/estimate cache hits in range [first_index, last_index].
> + * The estimation is simple and optimistic.
> + */
> +static int count_cache_hit(struct address_space *mapping,
> +				pgoff_t first_index, pgoff_t last_index)
> +{
> +	struct page *page;
> +	int size = last_index - first_index + 1;
> +	int count = 0;
> +	int i;
> +
> +	cond_resched();
> +	read_lock_irq(&mapping->tree_lock);
> +
> +	/*
> +	 * The first page may well is chunk head and has been accessed,
> +	 * so it is index 0 that makes the estimation optimistic. This
> +	 * behavior guarantees a readahead when (size < ra_max) and
> +	 * (readahead_hit_rate >= 16).
> +	 */
> +	for (i = 0; i < 16;) {
> +		page = __find_page(mapping, first_index +
> +						size * ((i++ * 29) & 15) / 16);
> +		if (cold_page_refcnt(page) >= PAGE_REFCNT_1 && ++count >= 2)
                      cold_page_referenced(page) && ++count >= 2
> +			break;
> +	}
> +
> +	read_unlock_irq(&mapping->tree_lock);
> +
> +	return size * count / i;
> +}



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 04/33] readahead: page flag PG_readahead
       [not found]     ` <20060524123740.GA16304@mail.ustc.edu.cn>
  2006-05-24 12:37       ` Wu Fengguang
@ 2006-05-24 12:48       ` Peter Zijlstra
  1 sibling, 0 replies; 107+ messages in thread
From: Peter Zijlstra @ 2006-05-24 12:48 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

On Wed, 2006-05-24 at 20:37 +0800, Wu Fengguang wrote:
> On Wed, May 24, 2006 at 02:27:36PM +0200, Peter Zijlstra wrote:
> > On Wed, 2006-05-24 at 19:12 +0800, Wu Fengguang wrote:


> > > --- linux-2.6.17-rc4-mm3.orig/include/linux/page-flags.h
> > > +++ linux-2.6.17-rc4-mm3/include/linux/page-flags.h
> > > @@ -89,6 +89,7 @@
> > >  #define PG_reclaim		17	/* To be reclaimed asap */
> > >  #define PG_nosave_free		18	/* Free, should not be written */
> > >  #define PG_buddy		19	/* Page is free, on buddy lists */
> > > +#define PG_readahead		20	/* Reminder to do readahead */
> > >  
> > 
> > Page flags are gouped by four, 20 would start a new set.
> > Also in my tree (git from a few days ago), 20 is taken by PG_unchached.
> 
> Thanks, grouped and renumbered it as 21.
> 
> > What code is this patch-set against?
> 
> It's against the latest -mm tree: linux-2.6.17-rc4-mm3.

Ah, now I see, -mm has got a trick up its sleeve for PG_uncached.

20 would indeed be the correct number for -mm. Then my sole comment
would be the grouping, which is a stylish nit really.

Sorry for the confusion.

Peter


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/33] readahead: context based method
       [not found]     ` <20060524133353.GA16508@mail.ustc.edu.cn>
@ 2006-05-24 13:33       ` Wu Fengguang
  2006-05-24 15:53       ` Peter Zijlstra
  1 sibling, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-24 13:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, linux-kernel

On Wed, May 24, 2006 at 02:37:48PM +0200, Peter Zijlstra wrote:
> On Wed, 2006-05-24 at 19:13 +0800, Wu Fengguang wrote:
> 
> > +#define PAGE_REFCNT_0           0
> > +#define PAGE_REFCNT_1           (1 << PG_referenced)
> > +#define PAGE_REFCNT_2           (1 << PG_active)
> > +#define PAGE_REFCNT_3           ((1 << PG_active) | (1 << PG_referenced))
> > +#define PAGE_REFCNT_MASK        PAGE_REFCNT_3
> > +
> > +/*
> > + * STATUS   REFERENCE COUNT
> > + *  __                   0
> > + *  _R       PAGE_REFCNT_1
> > + *  A_       PAGE_REFCNT_2
> > + *  AR       PAGE_REFCNT_3
> > + *
> > + *  A/R: Active / Referenced
> > + */
> > +static inline unsigned long page_refcnt(struct page *page)
> > +{
> > +        return page->flags & PAGE_REFCNT_MASK;
> > +}
> > +
> > +/*
> > + * STATUS   REFERENCE COUNT      TYPE
> > + *  __                   0      fresh
> > + *  _R       PAGE_REFCNT_1      stale
> > + *  A_       PAGE_REFCNT_2      disturbed once
> > + *  AR       PAGE_REFCNT_3      disturbed twice
> > + *
> > + *  A/R: Active / Referenced
> > + */
> > +static inline unsigned long cold_page_refcnt(struct page *page)
> > +{
> > +	if (!page || PageActive(page))
> > +		return 0;
> > +
> > +	return page_refcnt(page);
> > +}
> > +
> 
> Why all of this if all you're ever going to use is cold_page_refcnt.

Well, the two functions have a long history...

There has been a PG_activate which makes the two functions quite
different. It was later removed for fear of the behavior changes it
introduced. However, there's still possibility that someone
reintroduce similar flags in the future :)

> What about something like this:
> 
> static inline int cold_page_referenced(struct page *page)
> {
> 	if (!page || PageActive(page))
> 		return 0;
> 	return !!PageReferenced(page);
> }

Ah, here's another theory: the algorithm uses reference count
conceptually, so it may be better to retain the current form.

Thanks,
Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 28/33] readahead: loop case
       [not found] ` <20060524111911.032100160@localhost.localdomain>
  2006-05-24 11:13   ` [PATCH 28/33] readahead: loop case Wu Fengguang
@ 2006-05-24 14:01   ` Limin Wang
       [not found]     ` <20060525154846.GA6907@mail.ustc.edu.cn>
  1 sibling, 1 reply; 107+ messages in thread
From: Limin Wang @ 2006-05-24 14:01 UTC (permalink / raw)
  To: linux-kernel


If the loopback files is bigger than the memory size, it may cause miss again and
may better to turn on the read ahead?


Regards,
Limin
* Wu Fengguang <wfg@mail.ustc.edu.cn> [2006-05-24 19:13:14 +0800]:

> Disable look-ahead for loop file.
> 
> Loopback files normally contain filesystems, in which case there are already
> proper look-aheads in the upper layer, more look-aheads on the loopback file
> only ruins the read-ahead hit rate.
> 
> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> ---
> 
> I'd like to thank Tero Grundstr?m for uncovering the loopback problem.
> 
>  drivers/block/loop.c |    6 ++++++
>  1 files changed, 6 insertions(+)
> 
> --- linux-2.6.17-rc4-mm3.orig/drivers/block/loop.c
> +++ linux-2.6.17-rc4-mm3/drivers/block/loop.c
> @@ -779,6 +779,12 @@ static int loop_set_fd(struct loop_devic
>  	mapping = file->f_mapping;
>  	inode = mapping->host;
>  
> +	/*
> +	 * The upper layer should already do proper look-ahead,
> +	 * one more look-ahead here only ruins the cache hit rate.
> +	 */
> +	file->f_ra.flags |= RA_FLAG_NO_LOOKAHEAD;
> +
>  	if (!(file->f_mode & FMODE_WRITE))
>  		lo_flags |= LO_FLAGS_READ_ONLY;
>  
> 
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/33] readahead: context based method
       [not found]     ` <20060524133353.GA16508@mail.ustc.edu.cn>
  2006-05-24 13:33       ` Wu Fengguang
@ 2006-05-24 15:53       ` Peter Zijlstra
       [not found]         ` <20060525012556.GA6111@mail.ustc.edu.cn>
  1 sibling, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2006-05-24 15:53 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

On Wed, 2006-05-24 at 21:33 +0800, Wu Fengguang wrote:
> On Wed, May 24, 2006 at 02:37:48PM +0200, Peter Zijlstra wrote:
> > On Wed, 2006-05-24 at 19:13 +0800, Wu Fengguang wrote:
> > 
> > > +#define PAGE_REFCNT_0           0
> > > +#define PAGE_REFCNT_1           (1 << PG_referenced)
> > > +#define PAGE_REFCNT_2           (1 << PG_active)
> > > +#define PAGE_REFCNT_3           ((1 << PG_active) | (1 << PG_referenced))
> > > +#define PAGE_REFCNT_MASK        PAGE_REFCNT_3
> > > +
> > > +/*
> > > + * STATUS   REFERENCE COUNT
> > > + *  __                   0
> > > + *  _R       PAGE_REFCNT_1
> > > + *  A_       PAGE_REFCNT_2
> > > + *  AR       PAGE_REFCNT_3
> > > + *
> > > + *  A/R: Active / Referenced
> > > + */
> > > +static inline unsigned long page_refcnt(struct page *page)
> > > +{
> > > +        return page->flags & PAGE_REFCNT_MASK;
> > > +}
> > > +
> > > +/*
> > > + * STATUS   REFERENCE COUNT      TYPE
> > > + *  __                   0      fresh
> > > + *  _R       PAGE_REFCNT_1      stale
> > > + *  A_       PAGE_REFCNT_2      disturbed once
> > > + *  AR       PAGE_REFCNT_3      disturbed twice
> > > + *
> > > + *  A/R: Active / Referenced
> > > + */
> > > +static inline unsigned long cold_page_refcnt(struct page *page)
> > > +{
> > > +	if (!page || PageActive(page))
> > > +		return 0;
> > > +
> > > +	return page_refcnt(page);
> > > +}
> > > +
> > 
> > Why all of this if all you're ever going to use is cold_page_refcnt.
> 
> Well, the two functions have a long history...
> 
> There has been a PG_activate which makes the two functions quite
> different. It was later removed for fear of the behavior changes it
> introduced. However, there's still possibility that someone
> reintroduce similar flags in the future :)
> 
> > What about something like this:
> > 
> > static inline int cold_page_referenced(struct page *page)
> > {
> > 	if (!page || PageActive(page))
> > 		return 0;
> > 	return !!PageReferenced(page);
> > }
> 
> Ah, here's another theory: the algorithm uses reference count
> conceptually, so it may be better to retain the current form.

Reference count of what exactly, if you were to say of the page, I'd
have expected only the first function, page_refcnt().

What I don't exactly understand is why you specialise to the inactive
list. Why do you need that?

The reason I'm asking is that when I merge this with my page replacement
work, I need to find a generalised concept. cold_page_refcnt() would
become to mean something like: number of references for those pages that
are direct reclaim candidates. And honestly, that doesn't make a lot of
sense.

If you could explain the concept behind this, I'd be grateful.

Peter




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/33] readahead: context based method
       [not found]         ` <20060525012556.GA6111@mail.ustc.edu.cn>
@ 2006-05-25  1:25           ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25  1:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, linux-kernel

On Wed, May 24, 2006 at 05:53:36PM +0200, Peter Zijlstra wrote:
> On Wed, 2006-05-24 at 21:33 +0800, Wu Fengguang wrote:
> > On Wed, May 24, 2006 at 02:37:48PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2006-05-24 at 19:13 +0800, Wu Fengguang wrote:
> > > 
> > > > +#define PAGE_REFCNT_0           0
> > > > +#define PAGE_REFCNT_1           (1 << PG_referenced)
> > > > +#define PAGE_REFCNT_2           (1 << PG_active)
> > > > +#define PAGE_REFCNT_3           ((1 << PG_active) | (1 << PG_referenced))
> > > > +#define PAGE_REFCNT_MASK        PAGE_REFCNT_3
> > > > +
> > > > +/*
> > > > + * STATUS   REFERENCE COUNT
> > > > + *  __                   0
> > > > + *  _R       PAGE_REFCNT_1
> > > > + *  A_       PAGE_REFCNT_2
> > > > + *  AR       PAGE_REFCNT_3
> > > > + *
> > > > + *  A/R: Active / Referenced
> > > > + */
> > > > +static inline unsigned long page_refcnt(struct page *page)
> > > > +{
> > > > +        return page->flags & PAGE_REFCNT_MASK;
> > > > +}
> > > > +
> > > > +/*
> > > > + * STATUS   REFERENCE COUNT      TYPE
> > > > + *  __                   0      fresh
> > > > + *  _R       PAGE_REFCNT_1      stale
> > > > + *  A_       PAGE_REFCNT_2      disturbed once
> > > > + *  AR       PAGE_REFCNT_3      disturbed twice
> > > > + *
> > > > + *  A/R: Active / Referenced
> > > > + */
> > > > +static inline unsigned long cold_page_refcnt(struct page *page)
> > > > +{
> > > > +	if (!page || PageActive(page))
> > > > +		return 0;
> > > > +
> > > > +	return page_refcnt(page);
> > > > +}
> > > > +
> > > 
> > > Why all of this if all you're ever going to use is cold_page_refcnt.
> > 
> > Well, the two functions have a long history...
> > 
> > There has been a PG_activate which makes the two functions quite
> > different. It was later removed for fear of the behavior changes it
> > introduced. However, there's still possibility that someone
> > reintroduce similar flags in the future :)
> > 
> > > What about something like this:
> > > 
> > > static inline int cold_page_referenced(struct page *page)
> > > {
> > > 	if (!page || PageActive(page))
> > > 		return 0;
> > > 	return !!PageReferenced(page);
> > > }
> > 
> > Ah, here's another theory: the algorithm uses reference count
> > conceptually, so it may be better to retain the current form.
> 
> Reference count of what exactly, if you were to say of the page, I'd
> have expected only the first function, page_refcnt().
> 
> What I don't exactly understand is why you specialise to the inactive
> list. Why do you need that?
> 
> The reason I'm asking is that when I merge this with my page replacement
> work, I need to find a generalised concept. cold_page_refcnt() would
> become to mean something like: number of references for those pages that
> are direct reclaim candidates. And honestly, that doesn't make a lot of
> sense.
> 
> If you could explain the concept behind this, I'd be grateful.

Good question, and sorry for mentioning this...

There are some background info here:

        [DISTURBS] section of
        http://marc.theaimsgroup.com/?l=linux-kernel&m=112678976802381&w=2

        [DELAYED ACTIVATION] section of
        http://marc.theaimsgroup.com/?l=linux-kernel&m=112679176611006&w=2

It involves a tricky situation where there are two sequential readers
that come close enough, so that the follower retouched the pages
visited by the leader:

          chunk 1         chunk 2               chunk 3
        ==========  =============-------  --------------------
                       follower ^                     leader ^

It is all ok if the revisited pages still stay in the inactive list,
these pages will act as measurement of len(inactive list)/speed(leader).
But if the revisited pages(marked by '=') are sent to active list
immediately, the measurement will no longer be as accurate. The trace
is 'disturbed'. In this case, using page_refcnt() can be aggressive
and unsafe from thrashing, while cold_page_refcnt() can be conservative.

So either one of page_refcnt()/cold_page_refcnt() should be ok, as
long as we know the consequence of this situation.  After all, it is
really uncommon to see much invocation of the context based method,
and even rare for this kind of situation to happen.

Regards,
Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 12/33] readahead: min/max sizes
  2006-05-24 11:12   ` [PATCH 11/33] readahead: sysctl parameters Wu Fengguang
@ 2006-05-25  4:50     ` Nick Piggin
       [not found]       ` <20060525121206.GI4996@mail.ustc.edu.cn>
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-25  4:50 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

Wu Fengguang wrote:

>- Enlarge VM_MAX_READAHEAD to 1024 if new read-ahead code is compiled in.
>  This value is no longer tightly coupled with the thrashing problem,
>  therefore constrained by it. The adaptive read-ahead logic merely takes
>  it as an upper bound, and will not stick to it under memory pressure.
>

I guess this size enlargement is one of the main reasons your
patchset improves performance in some cases.

There is currently some sort of thrashing protection in there.
Obviously you've found it to be unable to cope with some situations
and introduced a lot of really fancy stuff to fix it. Are these just
academic access patterns, or do you have real test cases that
demonstrate this failure (ie. can we try to incrementally improve
the current logic as well as work towards merging your readahead
rewrite?)

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/33] readahead: support functions
  2006-05-24 11:12   ` [PATCH 10/33] readahead: support functions Wu Fengguang
@ 2006-05-25  5:13     ` Nick Piggin
       [not found]       ` <20060525111318.GH4996@mail.ustc.edu.cn>
  2006-05-25 16:48     ` Andrew Morton
  1 sibling, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-25  5:13 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

Wu Fengguang wrote:

>+#ifdef CONFIG_ADAPTIVE_READAHEAD
>+
>+/*
>+ * The nature of read-ahead allows false tests to occur occasionally.
>+ * Here we just do not bother to call get_page(), it's meaningless anyway.
>+ */
>+static inline struct page *__find_page(struct address_space *mapping,
>+							pgoff_t offset)
>+{
>+	return radix_tree_lookup(&mapping->page_tree, offset);
>+}
>+
>+static inline struct page *find_page(struct address_space *mapping,
>+							pgoff_t offset)
>+{
>+	struct page *page;
>+
>+	read_lock_irq(&mapping->tree_lock);
>+	page = __find_page(mapping, offset);
>+	read_unlock_irq(&mapping->tree_lock);
>+	return page;
>+}
>  
>

Meh, this is just open-coded elsewhere in readahead.c; I'd either
open code it, or do a new patch to replace the existing callers.
find_page should be in mm/filemap.c, btw (or include/linux/pagemap.h).

>+
>+/*
>+ * Move pages in danger (of thrashing) to the head of inactive_list.
>+ * Not expected to happen frequently.
>+ */
>+static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
>  
>

Should probably be in mm/vmscan.c

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/33] readahead: context based method
  2006-05-24 11:13   ` [PATCH 17/33] readahead: context based method Wu Fengguang
@ 2006-05-25  5:26     ` Nick Piggin
       [not found]       ` <20060525080308.GB4996@mail.ustc.edu.cn>
  2006-05-26 17:23     ` Andrew Morton
  2006-05-26 17:27     ` Andrew Morton
  2 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-25  5:26 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

Wu Fengguang wrote:

>
>+/*
>+ * Look back and check history pages to estimate thrashing-threshold.
>+ */
>+static unsigned long query_page_cache_segment(struct address_space *mapping,
>+				struct file_ra_state *ra,
>+				unsigned long *remain, pgoff_t offset,
>+				unsigned long ra_min, unsigned long ra_max)
>+{
>+	pgoff_t index;
>+	unsigned long count;
>+	unsigned long nr_lookback;
>+	struct radix_tree_cache cache;
>+
>+	/*
>+	 * Scan backward and check the near @ra_max pages.
>+	 * The count here determines ra_size.
>+	 */
>+	cond_resched();
>+	read_lock_irq(&mapping->tree_lock);
>+	index = radix_tree_scan_hole_backward(&mapping->page_tree,
>+							offset, ra_max);
>+	read_unlock_irq(&mapping->tree_lock);
>

Why do you drop this lock just to pick it up again a few instructions
down the line? (is ra_cache_hit_ok or cound_cache_hit very big or
unable to be called without the lock?)

>+
>+	*remain = offset - index;
>+
>+	if (offset == ra->readahead_index && ra_cache_hit_ok(ra))
>+		count = *remain;
>+	else if (count_cache_hit(mapping, index + 1, offset) *
>+						readahead_hit_rate >= *remain)
>+		count = *remain;
>+	else
>+		count = ra_min;
>+
>+	/*
>+	 * Unnecessary to count more?
>+	 */
>+	if (count < ra_max)
>+		goto out;
>+
>+	if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
>+		goto out;
>+
>+	/*
>+	 * Check the far pages coarsely.
>+	 * The enlarged count here helps increase la_size.
>+	 */
>+	nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
>+						100 / (readahead_ratio | 1);
>+
>+	cond_resched();
>+	radix_tree_cache_init(&cache);
>+	read_lock_irq(&mapping->tree_lock);
>+	for (count += ra_max; count < nr_lookback; count += ra_max) {
>+		struct radix_tree_node *node;
>+		node = radix_tree_cache_lookup_parent(&mapping->page_tree,
>+						&cache, offset - count, 1);
>+		if (!node)
>+			break;
>+	}
>+	read_unlock_irq(&mapping->tree_lock);
>

Yuck. Apart from not being commented, this depends on internal
implementation of radix-tree. This should just be packaged up in some
radix-tree function to do exactly what you want (eg. is there a hole of
N contiguous pages).

And then again you can be rid of the radix-tree cache.

Yes, it increasingly appears that you're using the cache because you're
using the wrong abstractions. Eg. this is basically half implementing
some data-structure internal detail.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 22/33] readahead: initial method
  2006-05-24 11:13   ` [PATCH 20/33] readahead: initial method - expected read size Wu Fengguang
@ 2006-05-25  5:34     ` Nick Piggin
       [not found]       ` <20060525085957.GC4996@mail.ustc.edu.cn>
  2006-05-26 17:29     ` [PATCH 20/33] readahead: initial method - expected read size Andrew Morton
  1 sibling, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-25  5:34 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

BTW. while your patchset might be nicely broken down, I think your
naming and descriptions are letting it down a little bit.

Wu Fengguang wrote:

>Aggressive readahead policy for read on start-of-file.
>
>Instead of selecting a conservative readahead size,
>it tries to do large readahead in the first place.
>
>However we have to watch on two cases:
>	- do not ruin the hit rate for file-head-checkers
>	- do not lead to thrashing for memory tight systems
>
>

How does it handle
             -  don't needlessly readahead too much if the file is in cache


Would the current readahead mechanism benefit from more aggressive 
start-of-file
readahead?

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/33] readahead: common macros
  2006-05-24 11:12   ` [PATCH 08/33] readahead: common macros Wu Fengguang
@ 2006-05-25  5:56     ` Nick Piggin
       [not found]       ` <20060525104117.GE4996@mail.ustc.edu.cn>
       [not found]       ` <20060525134224.GJ4996@mail.ustc.edu.cn>
  2006-05-25 16:33     ` Andrew Morton
  1 sibling, 2 replies; 107+ messages in thread
From: Nick Piggin @ 2006-05-25  5:56 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

Wu Fengguang wrote:

>Define some common used macros for the read-ahead logics.
>
>Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
>---
>
> mm/readahead.c |   14 ++++++++++++--
> 1 files changed, 12 insertions(+), 2 deletions(-)
>
>--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
>+++ linux-2.6.17-rc4-mm3/mm/readahead.c
>@@ -5,6 +5,8 @@
>  *
>  * 09Apr2002	akpm@zip.com.au
>  *		Initial version.
>+ * 21May2006	Wu Fengguang <wfg@mail.ustc.edu.cn>
>+ *		Adaptive read-ahead framework.
>  */
> 
> #include <linux/kernel.h>
>@@ -14,6 +16,14 @@
> #include <linux/blkdev.h>
> #include <linux/backing-dev.h>
> #include <linux/pagevec.h>
>+#include <linux/writeback.h>
>+#include <linux/nfsd/const.h>
>

How come you're adding these includes?

>+
>+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT)
>+#define PAGES_KB(size)	 PAGES_BYTE((size)*1024)
>
Don't really like the names. Don't think they do anything for clarity, but
if you can come up with something better for PAGES_BYTE I might change my
mind ;) (just forget about PAGES_KB - people know what *1024 means)

Also: the replacements are wrong: if you've defined VM_MAX_READAHEAD to be
4095 bytes, you don't want the _actual_ readahead to be 4096 bytes, do you?
It is saying nothing about minimum, so presumably 0 is the correct choice.


>+
>+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
>+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
>

Again, it is probably easier just to use the expanded version. Then the
reader can immediately say: ah, the next page on the LRU list (rather
than, maybe, the next page in the pagecache).

> 
> void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
> {
>@@ -21,7 +31,7 @@ void default_unplug_io_fn(struct backing
> EXPORT_SYMBOL(default_unplug_io_fn);
> 
> struct backing_dev_info default_backing_dev_info = {
>-	.ra_pages	= (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
>+	.ra_pages	= PAGES_KB(VM_MAX_READAHEAD),
> 	.state		= 0,
> 	.capabilities	= BDI_CAP_MAP_COPY,
> 	.unplug_io_fn	= default_unplug_io_fn,
>@@ -50,7 +60,7 @@ static inline unsigned long get_max_read
> 
> static inline unsigned long get_min_readahead(struct file_ra_state *ra)
> {
>-	return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
>+	return PAGES_KB(VM_MIN_READAHEAD);
> }
> 
> static inline void reset_ahead_window(struct file_ra_state *ra)
>
>--
>

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 14/33] readahead: state based method - data structure
  2006-05-24 11:13   ` [PATCH 14/33] readahead: state based method - data structure Wu Fengguang
@ 2006-05-25  6:03     ` Nick Piggin
       [not found]       ` <20060525104353.GF4996@mail.ustc.edu.cn>
  2006-05-26 17:05     ` Andrew Morton
  1 sibling, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-25  6:03 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

Wu Fengguang wrote:

>Extend struct file_ra_state to support the adaptive read-ahead logic.
>

Another nitpick: It is usually OK to do these things in the same patch
that actually uses the new data (or functions -- eg. patch 15).

If the addition is complex or in a completely different subsystem
(eg. your rescue_pages function), _that_ can justify it being split
into its own patch. Then you might also prepend the subject with mm:
and cc linux-mm to get better reviews.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/33] readahead: context based method
       [not found]       ` <20060525080308.GB4996@mail.ustc.edu.cn>
@ 2006-05-25  8:03         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25  8:03 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

On Thu, May 25, 2006 at 03:26:00PM +1000, Nick Piggin wrote:
> Wu Fengguang wrote:
> >+	cond_resched();
> >+	read_lock_irq(&mapping->tree_lock);
> >+	index = radix_tree_scan_hole_backward(&mapping->page_tree,
> >+							offset, ra_max);
> >+	read_unlock_irq(&mapping->tree_lock);
> >
> 
> Why do you drop this lock just to pick it up again a few instructions
> down the line? (is ra_cache_hit_ok or cound_cache_hit very big or
> unable to be called without the lock?)

Nice catch, will fix it.

> >+
> >+	*remain = offset - index;
> >+
> >+	if (offset == ra->readahead_index && ra_cache_hit_ok(ra))
> >+		count = *remain;
> >+	else if (count_cache_hit(mapping, index + 1, offset) *
> >+						readahead_hit_rate >= 
> >*remain)
> >+		count = *remain;
> >+	else
> >+		count = ra_min;
> >+
> >+	/*
> >+	 * Unnecessary to count more?
> >+	 */
> >+	if (count < ra_max)
> >+		goto out;
> >+
> >+	if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
> >+		goto out;
> >+
> >+	/*
> >+	 * Check the far pages coarsely.
> >+	 * The enlarged count here helps increase la_size.
> >+	 */
> >+	nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
> >+						100 / (readahead_ratio | 1);
> >+
> >+	cond_resched();
> >+	radix_tree_cache_init(&cache);
> >+	read_lock_irq(&mapping->tree_lock);
> >+	for (count += ra_max; count < nr_lookback; count += ra_max) {
> >+		struct radix_tree_node *node;
> >+		node = radix_tree_cache_lookup_parent(&mapping->page_tree,
> >+						&cache, offset - count, 1);
> >+		if (!node)
> >+			break;
> >+	}
> >+	read_unlock_irq(&mapping->tree_lock);
> >
> 
> Yuck. Apart from not being commented, this depends on internal
> implementation of radix-tree. This should just be packaged up in some
> radix-tree function to do exactly what you want (eg. is there a hole of
> N contiguous pages).

Yes, it is ugly.
Maybe we can make it a function named radix_tree_scan_hole_coarse().

> And then again you can be rid of the radix-tree cache.
> 
> Yes, it increasingly appears that you're using the cache because you're
> using the wrong abstractions. Eg. this is basically half implementing
> some data-structure internal detail.

Sorry for not being aware of this problem :)

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 22/33] readahead: initial method
       [not found]       ` <20060525085957.GC4996@mail.ustc.edu.cn>
@ 2006-05-25  8:59         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25  8:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

On Thu, May 25, 2006 at 03:34:30PM +1000, Nick Piggin wrote:
> BTW. while your patchset might be nicely broken down, I think your
> naming and descriptions are letting it down a little bit.

:) Maybe more practices will help.

> Wu Fengguang wrote:
> 
> >Aggressive readahead policy for read on start-of-file.
> >
> >Instead of selecting a conservative readahead size,
> >it tries to do large readahead in the first place.
> >
> >However we have to watch on two cases:
> >	- do not ruin the hit rate for file-head-checkers
> >	- do not lead to thrashing for memory tight systems
> >
> >
> 
> How does it handle
>             -  don't needlessly readahead too much if the file is in cache

It is prevented by the calling scheme.

The adaptive readahead logic will only be called on
        - read a non-cached page
                So readahead will be started/stopped on demand.
        - read a PG_readahead marked page
                Since the PG_readahead mark will only be set on fresh
                new pages in __do_page_cache_readahead(), readahead
                will automatically cease on cache hit.

> 
> Would the current readahead mechanism benefit from more aggressive 
> start-of-file
> readahead?

It will have the same benefits(and drawbacks).

[QUOTE FROM ANOTHER MAIL]
> can we try to incrementally improve the current logic as well as work
> towards merging your readahead rewrite?

The current readahead is left untouched on purpose.

If I understand it right, its simplicity is a great virtue.  And it is
hard to improve it without loosing this virtue, or avoid disturbing
old users.

Then the new framework provides a ideal testbed for fancy new things.
We can do experimental things without calling for complaints(before it
is stabilized after one year). And then we might port some proved
features to the current logic.

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/33] readahead: common macros
       [not found]       ` <20060525104117.GE4996@mail.ustc.edu.cn>
@ 2006-05-25 10:41         ` Wu Fengguang
  2006-05-26  3:33           ` Nick Piggin
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 10:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

On Thu, May 25, 2006 at 03:56:24PM +1000, Nick Piggin wrote:
> >+#include <linux/writeback.h>
> >+#include <linux/nfsd/const.h>
> >
> 
> How come you're adding these includes?

For something added in the past and removed later...

> >+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> 
> >PAGE_CACHE_SHIFT)
> >+#define PAGES_KB(size)	 PAGES_BYTE((size)*1024)
> >
> Don't really like the names. Don't think they do anything for clarity, but
> if you can come up with something better for PAGES_BYTE I might change my
> mind ;) (just forget about PAGES_KB - people know what *1024 means)

No, they are mainly for concision. Don't you think it's cleaner to write
        PAGES_KB(VM_MAX_READAHEAD)
than
        (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE

Admittedly the names are somewhat awkward though :)

> Also: the replacements are wrong: if you've defined VM_MAX_READAHEAD to be
> 4095 bytes, you don't want the _actual_ readahead to be 4096 bytes, do you?
> It is saying nothing about minimum, so presumably 0 is the correct choice.

The macros were first introduced exact for this reason ;)

It is rumored that there will be 64K page support, and this macro
helps round up the 16K sized VM_MIN_READAHEAD. The eof_index also
needs rounding up.

> >+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
> >+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
> >
> 
> Again, it is probably easier just to use the expanded version. Then the
> reader can immediately say: ah, the next page on the LRU list (rather
> than, maybe, the next page in the pagecache).

Ok, will expand it.

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 14/33] readahead: state based method - data structure
       [not found]       ` <20060525104353.GF4996@mail.ustc.edu.cn>
@ 2006-05-25 10:43         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 10:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

On Thu, May 25, 2006 at 04:03:31PM +1000, Nick Piggin wrote:
> Wu Fengguang wrote:
> 
> >Extend struct file_ra_state to support the adaptive read-ahead logic.
> >
> 
> Another nitpick: It is usually OK to do these things in the same patch
> that actually uses the new data (or functions -- eg. patch 15).
> 
> If the addition is complex or in a completely different subsystem
> (eg. your rescue_pages function), _that_ can justify it being split
> into its own patch. Then you might also prepend the subject with mm:
> and cc linux-mm to get better reviews.

Ok, thanks for the advice.

Regards,
Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/33] readahead: support functions
       [not found]       ` <20060525111318.GH4996@mail.ustc.edu.cn>
@ 2006-05-25 11:13         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 11:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

On Thu, May 25, 2006 at 03:13:16PM +1000, Nick Piggin wrote:
> Wu Fengguang wrote:
> 
> >+#ifdef CONFIG_ADAPTIVE_READAHEAD
> >+
> >+/*
> >+ * The nature of read-ahead allows false tests to occur occasionally.
> >+ * Here we just do not bother to call get_page(), it's meaningless anyway.
> >+ */
> >+static inline struct page *__find_page(struct address_space *mapping,
> >+							pgoff_t offset)
> >+{
> >+	return radix_tree_lookup(&mapping->page_tree, offset);
> >+}
> >+
> >+static inline struct page *find_page(struct address_space *mapping,
> >+							pgoff_t offset)
> >+{
> >+	struct page *page;
> >+
> >+	read_lock_irq(&mapping->tree_lock);
> >+	page = __find_page(mapping, offset);
> >+	read_unlock_irq(&mapping->tree_lock);
> >+	return page;
> >+}
> > 
> >
> 
> Meh, this is just open-coded elsewhere in readahead.c; I'd either
> open code it, or do a new patch to replace the existing callers.
> find_page should be in mm/filemap.c, btw (or include/linux/pagemap.h).

Maybe it should stay in readahead.c.

I got this early warning from Andrew:
        find_page() is not meant to be a general API, for it can
        easily be abused.

> >+
> >+/*
> >+ * Move pages in danger (of thrashing) to the head of inactive_list.
> >+ * Not expected to happen frequently.
> >+ */
> >+static unsigned long rescue_pages(struct page *page, unsigned long 
> >nr_pages)
> > 
> >
> 
> Should probably be in mm/vmscan.c

Maybe. It's a highly specialized function. It protects a continuous
range of sequential readahead pages in a file. Do you mean to move it
for the zone->lru_lock protected statements?

Regards,
Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 12/33] readahead: min/max sizes
       [not found]       ` <20060525121206.GI4996@mail.ustc.edu.cn>
@ 2006-05-25 12:12         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 12:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

On Thu, May 25, 2006 at 02:50:59PM +1000, Nick Piggin wrote:
> Wu Fengguang wrote:
> 
> >- Enlarge VM_MAX_READAHEAD to 1024 if new read-ahead code is compiled in.
> > This value is no longer tightly coupled with the thrashing problem,
> > therefore constrained by it. The adaptive read-ahead logic merely takes
> > it as an upper bound, and will not stick to it under memory pressure.
> >
> 
> I guess this size enlargement is one of the main reasons your
> patchset improves performance in some cases.

Sure, I started the patch to fulfill the 1M _default_ size dream ;-)
The majority users will never enjoy the performance improvement if
ever we stick to 128k default size.  And it won't be possible for the
current readahead logic, since it lacks basic thrashing protection
mechanism.

> There is currently some sort of thrashing protection in there.
> Obviously you've found it to be unable to cope with some situations
> and introduced a lot of really fancy stuff to fix it. Are these just
> academic access patterns, or do you have real test cases that
> demonstrate this failure (ie. can we try to incrementally improve
> the current logic as well as work towards merging your readahead
> rewrite?)

But to be serious, in the progress I realized that it's much more
beyond the max readahead size. The fancy features are more coming out
of _real_ needs than to fulfill academic goals. I've seen real world
improvements from desktop/file server/backup server/database users
for most of the implemented features.

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/33] readahead: common macros
       [not found]       ` <20060525134224.GJ4996@mail.ustc.edu.cn>
@ 2006-05-25 13:42         ` Wu Fengguang
  2006-05-25 14:38           ` Andrew Morton
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-25 13:42 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

On Thu, May 25, 2006 at 03:56:24PM +1000, Nick Piggin wrote:
> >+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> 
> >PAGE_CACHE_SHIFT)
> >+#define PAGES_KB(size)	 PAGES_BYTE((size)*1024)
> >
> Don't really like the names. Don't think they do anything for clarity, but
> if you can come up with something better for PAGES_BYTE I might change my
> mind ;) (just forget about PAGES_KB - people know what *1024 means)
> 
> Also: the replacements are wrong: if you've defined VM_MAX_READAHEAD to be
> 4095 bytes, you don't want the _actual_ readahead to be 4096 bytes, do you?
> It is saying nothing about minimum, so presumably 0 is the correct choice.

Got an idea, how about these ones:

#define FULL_PAGES(bytes)    ((bytes) >> PAGE_CACHE_SHIFT)
#define PARTIAL_PAGES(bytes) (((bytes)+PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT)

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/33] readahead: common macros
  2006-05-25 13:42         ` Wu Fengguang
@ 2006-05-25 14:38           ` Andrew Morton
  0 siblings, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 14:38 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: nickpiggin, linux-kernel

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> On Thu, May 25, 2006 at 03:56:24PM +1000, Nick Piggin wrote:
> > >+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> 
> > >PAGE_CACHE_SHIFT)
> > >+#define PAGES_KB(size)	 PAGES_BYTE((size)*1024)
> > >
> > Don't really like the names. Don't think they do anything for clarity, but
> > if you can come up with something better for PAGES_BYTE I might change my
> > mind ;) (just forget about PAGES_KB - people know what *1024 means)
> > 
> > Also: the replacements are wrong: if you've defined VM_MAX_READAHEAD to be
> > 4095 bytes, you don't want the _actual_ readahead to be 4096 bytes, do you?
> > It is saying nothing about minimum, so presumably 0 is the correct choice.
> 
> Got an idea, how about these ones:
> 
> #define FULL_PAGES(bytes)    ((bytes) >> PAGE_CACHE_SHIFT)

I dunno.  We've traditionally open-coded things like this.

> #define PARTIAL_PAGES(bytes) (((bytes)+PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT)

That's identical to include/linux/kernel.h:DIV_ROUND_UP(), from the gfs2 tree.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-24 11:12 ` [PATCH 00/33] Adaptive read-ahead V12 Wu Fengguang
@ 2006-05-25 15:44   ` Andrew Morton
  2006-05-25 19:26     ` Michael Stone
                       ` (4 more replies)
  0 siblings, 5 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 15:44 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg, mstone

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> Andrew,
> 
> This is the 12th release of the adaptive readahead patchset.
> 
> It has received tests in a wide range of applications in the past
> six months, and polished up considerably.
> 
> Please consider it for inclusion in -mm tree.
> 
> 
> Performance benefits
> ====================
> 
> Besides file servers and desktops, it is recently found to benefit
> postgresql databases a lot.
> 
> I explained to pgsql users how the patch may help their db performance:
> http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
> [QUOTE]
> 	HOW IT WORKS
> 
> 	In adaptive readahead, the context based method may be of particular
> 	interest to postgresql users. It works by peeking into the file cache
> 	and check if there are any history pages present or accessed. In this
> 	way it can detect almost all forms of sequential / semi-sequential read
> 	patterns, e.g.
> 		- parallel / interleaved sequential scans on one file
> 		- sequential reads across file open/close
> 		- mixed sequential / random accesses
> 		- sparse / skimming sequential read
> 
> 	It also have methods to detect some less common cases:
> 		- reading backward
> 		- seeking all over reading N pages
> 
> 	WAYS TO BENEFIT FROM IT
> 
> 	As we know, postgresql relies on the kernel to do proper readahead.
> 	The adaptive readahead might help performance in the following cases:
> 		- concurrent sequential scans
> 		- sequential scan on a fragmented table
> 		  (some DBs suffer from this problem, not sure for pgsql)
> 		- index scan with clustered matches
> 		- index scan on majority rows (in case the planner goes wrong)
> 
> And received positive responses:
> [QUOTE from Michael Stone]
> 	I've got one DB where the VACUUM ANALYZE generally takes 11M-12M ms;
> 	with the patch the job took 1.7M ms. Another VACUUM that normally takes
> 	between 300k-500k ms took 150k. Definately a promising addition.
> 
> [QUOTE from Michael Stone]
> 	>I'm thinking about it, we're already using a fixed read-ahead of 16MB
> 	>using blockdev on the stock Redhat 2.6.9 kernel, it would be nice to
> 	>not have to set this so we may try it.
> 
> 	FWIW, I never saw much performance difference from doing that. Wu's
> 	patch, OTOH, gave a big boost.
> 
> [QUOTE: odbc-bench with Postgresql 7.4.11 on dual Opteron]
> 	Base kernel:
> 	 Transactions per second:                92.384758
> 	 Transactions per second:                99.800896
> 
> 	After read-ahvm.readahead_ratio = 100:
> 	 Transactions per second:                105.461952
> 	 Transactions per second:                105.458664
> 
> 	vm.readahead_ratio = 100 ; vm.readahead_hit_rate = 1:
> 	 Transactions per second:                113.055367
> 	 Transactions per second:                124.815910

These are nice-looking numbers, but one wonders.  If optimising readahead
makes this much difference to postgresql performance then postgresql should
be doing the readahead itself, rather than relying upon the kernel's
ability to guess what the application will be doing in the future.  Because
surely the database can do a better job of that than the kernel.

That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel
readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
application-level readahead.

Has this been considered or attempted?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 28/33] readahead: loop case
       [not found]     ` <20060525154846.GA6907@mail.ustc.edu.cn>
@ 2006-05-25 15:48       ` wfg
  0 siblings, 0 replies; 107+ messages in thread
From: wfg @ 2006-05-25 15:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Limin Wang

On Wed, May 24, 2006 at 10:01:35PM +0800, Limin Wang wrote:
> 
> If the loopback files is bigger than the memory size, it may cause miss again and
> may better to turn on the read ahead?
> 

The readahead is always on, it's only disabling lookahead :-)

> > Disable look-ahead for loop file.
> > 
> > Loopback files normally contain filesystems, in which case there are already
> > proper look-aheads in the upper layer, more look-aheads on the loopback file
> > only ruins the read-ahead hit rate.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 03/33] radixtree: hole scanning functions
  2006-05-24 11:12   ` [PATCH 03/33] radixtree: hole scanning functions Wu Fengguang
@ 2006-05-25 16:19     ` Andrew Morton
       [not found]       ` <20060526070416.GB5135@mail.ustc.edu.cn>
       [not found]       ` <20060526110559.GA14398@mail.ustc.edu.cn>
  0 siblings, 2 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:19 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> Introduce a pair of functions to scan radix tree for hole/empty item.
>

There's a userspace radix-tree test harness at
http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz.

If/when these new features are merged up, it would be good to have new
testcases added to that suite, please.

In the meanwhile you may care to develop those tests anwyway, see if you
can trip up the new features.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 04/33] readahead: page flag PG_readahead
  2006-05-24 11:12   ` [PATCH 04/33] readahead: page flag PG_readahead Wu Fengguang
@ 2006-05-25 16:23     ` Andrew Morton
       [not found]       ` <20060526070646.GC5135@mail.ustc.edu.cn>
  0 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:23 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> An new page flag PG_readahead is introduced as a look-ahead mark, which
> reminds the caller to give the adaptive read-ahead logic a chance to do
> read-ahead ahead of time for I/O pipelining.
> 
> It roughly corresponds to `ahead_start' of the stock read-ahead logic.
> 

This isn't a very revealing description of what this flag does.

> +#define __SetPageReadahead(page) __set_bit(PG_readahead, &(page)->flags)

uh-oh.  This is extremly risky.  Needs extensive justification, please.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/33] readahead: refactor __do_page_cache_readahead()
  2006-05-24 11:12   ` [PATCH 06/33] readahead: refactor __do_page_cache_readahead() Wu Fengguang
@ 2006-05-25 16:30     ` Andrew Morton
  2006-05-25 22:33       ` Paul Mackerras
       [not found]       ` <20060526071339.GE5135@mail.ustc.edu.cn>
  0 siblings, 2 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:30 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> Add look-ahead support to __do_page_cache_readahead(),
> which is needed by the adaptive read-ahead logic.

You'd need to define "look-ahead support" before telling us you've added it ;)

> @@ -302,6 +303,8 @@ __do_page_cache_readahead(struct address
>  			break;
>  		page->index = page_offset;
>  		list_add(&page->lru, &page_pool);
> +		if (page_idx == nr_to_read - lookahead_size)
> +			__SetPageReadahead(page);
>  		ret++;
>  	}

OK.  But the __SetPageFoo() things still give me the creeps.


OT: look:

		read_unlock_irq(&mapping->tree_lock);
		page = page_cache_alloc_cold(mapping);
		read_lock_irq(&mapping->tree_lock);

we should have a page allocation function which just allocates a page from
this CPU's per-cpu-pages magazine, and fails if the magazine is empty:

		page = 	alloc_pages_local(mapping_gfp_mask(x)|__GFP_COLD);
		if (!page) {
			read_unlock_irq(&mapping->tree_lock);
			/*
			 * This will refill the per-cpu-pages magazine
			 */
			page = page_cache_alloc_cold(mapping);
			read_lock_irq(&mapping->tree_lock);
		}


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/33] readahead: common macros
  2006-05-24 11:12   ` [PATCH 08/33] readahead: common macros Wu Fengguang
  2006-05-25  5:56     ` Nick Piggin
@ 2006-05-25 16:33     ` Andrew Morton
  1 sibling, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:33 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> Define some common used macros for the read-ahead logics.
> 
> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
> ---
> 
>  mm/readahead.c |   14 ++++++++++++--
>  1 files changed, 12 insertions(+), 2 deletions(-)
> 
> --- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
> +++ linux-2.6.17-rc4-mm3/mm/readahead.c
> @@ -5,6 +5,8 @@
>   *
>   * 09Apr2002	akpm@zip.com.au
>   *		Initial version.
> + * 21May2006	Wu Fengguang <wfg@mail.ustc.edu.cn>
> + *		Adaptive read-ahead framework.
>   */
>  
>  #include <linux/kernel.h>
> @@ -14,6 +16,14 @@
>  #include <linux/blkdev.h>
>  #include <linux/backing-dev.h>
>  #include <linux/pagevec.h>
> +#include <linux/writeback.h>
> +#include <linux/nfsd/const.h>

Why on earth are we including that file?

Whatever goodies it contains should be moved into fs.h or mm.h or something.

> +
> +#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT)
> +#define PAGES_KB(size)	 PAGES_BYTE((size)*1024)

These aren't proving popular.

> +#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
> +#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))

hm.  Makes sense I guess, but normally we'll be iterating across lists with
the list_for_each*() helpers, so I'm a little surprised that the above are
needed.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/33] readahead: events accounting
  2006-05-24 11:12   ` [PATCH 09/33] readahead: events accounting Wu Fengguang
@ 2006-05-25 16:36     ` Andrew Morton
       [not found]       ` <20060526070943.GD5135@mail.ustc.edu.cn>
       [not found]       ` <20060527132002.GA4814@mail.ustc.edu.cn>
  0 siblings, 2 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:36 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg, joern, ioe-lkml

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> A debugfs file named `readahead/events' is created according to advises from
>  J?rn Engel, Andrew Morton and Ingo Oeser.

If everyone's patches all get merged up we'd expect that this facility be
migrated over to use Martin Peschke's statistics infrastructure.

That's not a thing you should do now, but it would be a useful test of
Martin's work if you could find time to look at it and let us know whether
the infrastructure which he has provided would suit this application,
thanks.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/33] readahead: support functions
  2006-05-24 11:12   ` [PATCH 10/33] readahead: support functions Wu Fengguang
  2006-05-25  5:13     ` Nick Piggin
@ 2006-05-25 16:48     ` Andrew Morton
       [not found]       ` <20060526073114.GH5135@mail.ustc.edu.cn>
  1 sibling, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 16:48 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> +/*
> + * The nature of read-ahead allows false tests to occur occasionally.
> + * Here we just do not bother to call get_page(), it's meaningless anyway.
> + */
> +static inline struct page *__find_page(struct address_space *mapping,
> +							pgoff_t offset)
> +{
> +	return radix_tree_lookup(&mapping->page_tree, offset);
> +}
> +
> +static inline struct page *find_page(struct address_space *mapping,
> +							pgoff_t offset)
> +{
> +	struct page *page;
> +
> +	read_lock_irq(&mapping->tree_lock);
> +	page = __find_page(mapping, offset);
> +	read_unlock_irq(&mapping->tree_lock);
> +	return page;
> +}

Would much prefer that this be called probe_page() and that it return 0 or
1, so nobody is tempted to dereference `page'.

> +/*
> + * Move pages in danger (of thrashing) to the head of inactive_list.
> + * Not expected to happen frequently.
> + */
> +static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
> +{
> +	int pgrescue;
> +	pgoff_t index;
> +	struct zone *zone;
> +	struct address_space *mapping;
> +
> +	BUG_ON(!nr_pages || !page);
> +	pgrescue = 0;
> +	index = page_index(page);
> +	mapping = page_mapping(page);
> +
> +	dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
> +			mapping->host->i_ino, index, nr_pages);
> +
> +	for(;;) {
> +		zone = page_zone(page);
> +		spin_lock_irq(&zone->lru_lock);
> +
> +		if (!PageLRU(page))
> +			goto out_unlock;
> +
> +		while (page_mapping(page) == mapping &&
> +				page_index(page) == index) {
> +			struct page *the_page = page;
> +			page = next_page(page);
> +			if (!PageActive(the_page) &&
> +					!PageLocked(the_page) &&
> +					page_count(the_page) == 1) {
> +				list_move(&the_page->lru, &zone->inactive_list);
> +				pgrescue++;
> +			}
> +			index++;
> +			if (!--nr_pages)
> +				goto out_unlock;
> +		}
> +
> +		spin_unlock_irq(&zone->lru_lock);
> +
> +		cond_resched();
> +		page = find_page(mapping, index);
> +		if (!page)
> +			goto out;

Yikes!  We do not have a reference on this page.  Now, it happens that
page_zone() on a random freed page will work OK.  At present.  I think. 
Depends on things like memory hot-remove, balloon drivers and heaven knows
what.

But it's not at all clear that the combination

		spin_lock_irq(&zone->lru_lock);

		if (!PageLRU(page))
			goto out_unlock;

is is a safe thing to do against a freed page, or against a freed and
reused-for-we-dont-know-what page.  It probably _is_ safe, as we're
probably setting and clearing PG_lru inside lru_lock in other places.  But
it's not obvious that these things will be true for all time and Nick keeps
on trying to diddle with that stuff.  There's quite a bit of subtle
dependency being introduced here.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-25 15:44   ` Andrew Morton
@ 2006-05-25 19:26     ` Michael Stone
  2006-05-25 19:40     ` David Lang
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 107+ messages in thread
From: Michael Stone @ 2006-05-25 19:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wu Fengguang, linux-kernel

On Thu, May 25, 2006 at 08:44:15AM -0700, Andrew Morton wrote:
>These are nice-looking numbers, but one wonders.  If optimising readahead
>makes this much difference to postgresql performance then postgresql should
>be doing the readahead itself, rather than relying upon the kernel's
>ability to guess what the application will be doing in the future.  Because
>surely the database can do a better job of that than the kernel.

In this particular case Wu had asked about postgres numbers, so I 
reported some postgres numbers. You could probably get similar speedups 
out of postgres by implementing readahead in postgres. OTOH, the kernel 
patch also gives substantial speedups to thing like cp; the question 
comes down to whether it's better for every application to implement 
readahead or for the kernel to do it. (There are, of course, other 
concerns like maintainability or whether performance degrades in other 
cases, but I didn't test that. :)

Mike Stone

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-25 15:44   ` Andrew Morton
  2006-05-25 19:26     ` Michael Stone
@ 2006-05-25 19:40     ` David Lang
  2006-05-25 22:01       ` Andrew Morton
       [not found]     ` <20060526011939.GA6220@mail.ustc.edu.cn>
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 107+ messages in thread
From: David Lang @ 2006-05-25 19:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wu Fengguang, linux-kernel, mstone

On Thu, 25 May 2006, Andrew Morton wrote:

> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>>
>
> These are nice-looking numbers, but one wonders.  If optimising readahead
> makes this much difference to postgresql performance then postgresql should
> be doing the readahead itself, rather than relying upon the kernel's
> ability to guess what the application will be doing in the future.  Because
> surely the database can do a better job of that than the kernel.
>
> That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel
> readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
> application-level readahead.
>
> Has this been considered or attempted?

Postgres chooses not to try and duplicate OS functionality in it's I/O 
routines.

it doesn't try to determine where on disk the data is (other then 
splitting the data into multiple files and possibly spreading things 
between directories)

it doesn't try to do it's own readahead.

it _does_ maintain it's own journal, but depends on the OS to do the right 
thing when a fsync is issued on the files.

yes it could be re-written to do all this itself, but the project has 
decided not to try and figure out the best options for all the different 
filesystems and OS's that it runs on and instead trust the OS developers 
to do reasonable things instead.

besides, do you really want to have every program doing it's own 
readahead?

David Lang

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-25 22:01       ` Andrew Morton
@ 2006-05-25 20:28         ` David Lang
  2006-05-26  0:48         ` Michael Stone
  1 sibling, 0 replies; 107+ messages in thread
From: David Lang @ 2006-05-25 20:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: wfg, linux-kernel, mstone

> If the developers of that program want to squeeze the last 5% out of it
> then sure, I'd expect them to use such OS-provided I/O scheduling
> facilities.  Database developers do that sort of thing all the time.
>
> We have an application which knows what it's doing sending IO requests to
> the kernel which must then try to reverse engineer what the application is
> doing via this rather inappropriate communication channel.
>
> Is that dumb, or what?
>
> Given that the application already knows what it's doing, it's in a much
> better position to issue the anticipatory IO requests than is the kernel.

if a program is trying to squeeze every last bit of performance out of a 
system then you are right, it should run on the bare hardware. however 
in reality many people are willing to sacrafice a little performance for 
maintainability, and portability.

if Adaptive read-ahead was only useful for Postgres (and had a negative 
effect on everything else, even if it's just the added complication in the 
kernel) then I would agree that it should be in Postgres, not in the 
kernel. but I don't believe that this is the case, this patch series helps 
in a large number of workloads (including 'cp' according to some other 
posters), postgres was just used as the example in this subthread.

gnome startup has some serious read-ahead issues from what I've heard, 
should it include an I/O scheduler as well (after all it knows what it's 
going to be doing, why should the kernel have to reverse-enginer it)

David Lang


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-25 19:40     ` David Lang
@ 2006-05-25 22:01       ` Andrew Morton
  2006-05-25 20:28         ` David Lang
  2006-05-26  0:48         ` Michael Stone
  0 siblings, 2 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 22:01 UTC (permalink / raw)
  To: David Lang; +Cc: wfg, linux-kernel, mstone

David Lang <dlang@digitalinsight.com> wrote:
>
> On Thu, 25 May 2006, Andrew Morton wrote:
> 
> > Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >>
> >
> > These are nice-looking numbers, but one wonders.  If optimising readahead
> > makes this much difference to postgresql performance then postgresql should
> > be doing the readahead itself, rather than relying upon the kernel's
> > ability to guess what the application will be doing in the future.  Because
> > surely the database can do a better job of that than the kernel.
> >
> > That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel
> > readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
> > application-level readahead.
> >
> > Has this been considered or attempted?
> 
> Postgres chooses not to try and duplicate OS functionality in it's I/O 
> routines.
> 
> it doesn't try to determine where on disk the data is (other then 
> splitting the data into multiple files and possibly spreading things 
> between directories)
> 
> it doesn't try to do it's own readahead.
> 
> it _does_ maintain it's own journal, but depends on the OS to do the right 
> thing when a fsync is issued on the files.
> 
> yes it could be re-written to do all this itself, but the project has 
> decided not to try and figure out the best options for all the different 
> filesystems and OS's that it runs on and instead trust the OS developers 
> to do reasonable things instead.
> 
> besides, do you really want to have every program doing it's own 
> readahead?
> 

If the developers of that program want to squeeze the last 5% out of it
then sure, I'd expect them to use such OS-provided I/O scheduling
facilities.  Database developers do that sort of thing all the time.

We have an application which knows what it's doing sending IO requests to
the kernel which must then try to reverse engineer what the application is
doing via this rather inappropriate communication channel.

Is that dumb, or what?

Given that the application already knows what it's doing, it's in a much
better position to issue the anticipatory IO requests than is the kernel.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/33] readahead: refactor __do_page_cache_readahead()
  2006-05-25 16:30     ` Andrew Morton
@ 2006-05-25 22:33       ` Paul Mackerras
  2006-05-25 22:40         ` Andrew Morton
       [not found]       ` <20060526071339.GE5135@mail.ustc.edu.cn>
  1 sibling, 1 reply; 107+ messages in thread
From: Paul Mackerras @ 2006-05-25 22:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wu Fengguang, linux-kernel

Andrew Morton writes:

> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> > @@ -302,6 +303,8 @@ __do_page_cache_readahead(struct address
> >  			break;
> >  		page->index = page_offset;
> >  		list_add(&page->lru, &page_pool);
> > +		if (page_idx == nr_to_read - lookahead_size)
> > +			__SetPageReadahead(page);
> >  		ret++;
> >  	}
> 
> OK.  But the __SetPageFoo() things still give me the creeps.

I just hope that Wu Fengguang, or whoever is making these patches,
realizes that on some architectures, doing __set_bit on one CPU
concurrently with another CPU doing set_bit on a different bit in the
same word can result in the second CPU's update getting lost...

Paul.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/33] readahead: refactor __do_page_cache_readahead()
  2006-05-25 22:33       ` Paul Mackerras
@ 2006-05-25 22:40         ` Andrew Morton
  0 siblings, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-25 22:40 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: wfg, linux-kernel

Paul Mackerras <paulus@samba.org> wrote:
>
> Andrew Morton writes:
> 
> > Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> > > @@ -302,6 +303,8 @@ __do_page_cache_readahead(struct address
> > >  			break;
> > >  		page->index = page_offset;
> > >  		list_add(&page->lru, &page_pool);
> > > +		if (page_idx == nr_to_read - lookahead_size)
> > > +			__SetPageReadahead(page);
> > >  		ret++;
> > >  	}
> > 
> > OK.  But the __SetPageFoo() things still give me the creeps.
> 
> I just hope that Wu Fengguang, or whoever is making these patches,
> realizes that on some architectures, doing __set_bit on one CPU
> concurrently with another CPU doing set_bit on a different bit in the
> same word can result in the second CPU's update getting lost...
> 

That's true even on x86.

Yes, this is understood - in this case he's following Nick's dubious lead
in leveraging our knowledge that no other code path will be attempting to
modify this page's flags at this time.  It's just been taken off the
freelist, it's not yet on the LRU and we own the only ref to it.

The only hole I was able to shoot in this is swsusp, which walks mem_map[]
fiddling with page flags.  But when it does this, only one CPU is running.

But I'm itching for an excuse to extirpate it all ;)

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-25 22:01       ` Andrew Morton
  2006-05-25 20:28         ` David Lang
@ 2006-05-26  0:48         ` Michael Stone
  1 sibling, 0 replies; 107+ messages in thread
From: Michael Stone @ 2006-05-26  0:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Lang, wfg, linux-kernel

On Thu, May 25, 2006 at 03:01:49PM -0700, Andrew Morton wrote:
>If the developers of that program want to squeeze the last 5% out of it
>then sure, I'd expect them to use such OS-provided I/O scheduling
>facilities.  

Maybe, if we were talking about squeezing the last 5%. But all 
applications should be required to greatly complicate their IO routines 
for the last 30%? To reimplement something the kernel already does (at 
least to some degree), as opposed to making the kernel implementation 
better? "Is that dumb, or what?" :-)

>Database developers do that sort of thing all the time.

Even the oracle people seem to have figured out they were doing too much 
that's properly the responsibility of the OS and creating a maintenance 
and portability nightmare. 

Mike Stone

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
       [not found]     ` <20060526011939.GA6220@mail.ustc.edu.cn>
@ 2006-05-26  1:19       ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26  1:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, mstone

On Thu, May 25, 2006 at 08:44:15AM -0700, Andrew Morton wrote:
> These are nice-looking numbers, but one wonders.  If optimising readahead
> makes this much difference to postgresql performance then postgresql should
> be doing the readahead itself, rather than relying upon the kernel's
> ability to guess what the application will be doing in the future.  Because
> surely the database can do a better job of that than the kernel.
> 
> That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel
> readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
> application-level readahead.
> 
> Has this been considered or attempted?

There has been many lengthy debates in the postgresql mailing list,
and it seems that there has been _strong_ resistance to it.

IMHO, a best scheme would be
        - leave _obvious_ patterns to the kernel
                i.e. all kinds of (semi-)sequential reads
        - do fadvise() for _non-obvious_ patterns on _critical_ points
                i.e. the index scans

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-25 15:44   ` Andrew Morton
                       ` (2 preceding siblings ...)
       [not found]     ` <20060526011939.GA6220@mail.ustc.edu.cn>
@ 2006-05-26  2:10     ` Jon Smirl
  2006-05-26  3:14       ` Nick Piggin
  2006-05-26 14:00     ` Andi Kleen
  4 siblings, 1 reply; 107+ messages in thread
From: Jon Smirl @ 2006-05-26  2:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Wu Fengguang, linux-kernel, mstone

On 5/25/06, Andrew Morton <akpm@osdl.org> wrote:
> These are nice-looking numbers, but one wonders.  If optimising readahead
> makes this much difference to postgresql performance then postgresql should
> be doing the readahead itself, rather than relying upon the kernel's
> ability to guess what the application will be doing in the future.  Because
> surely the database can do a better job of that than the kernel.
>
> That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel
> readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
> application-level readahead.

Users have also reported that this patch fixes performance problems
from web servers using sendfile(). In the case of lighttpd they
actually stopped using sendfile() for large transfers and wrote a user
space replacement where they could control readahead manually. With
this patch in place sendfile() went back to being faster than the user
space implementation.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-26  2:10     ` Jon Smirl
@ 2006-05-26  3:14       ` Nick Piggin
  0 siblings, 0 replies; 107+ messages in thread
From: Nick Piggin @ 2006-05-26  3:14 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Andrew Morton, Wu Fengguang, linux-kernel, mstone

Jon Smirl wrote:

> On 5/25/06, Andrew Morton <akpm@osdl.org> wrote:
>
>> These are nice-looking numbers, but one wonders.  If optimising 
>> readahead
>> makes this much difference to postgresql performance then postgresql 
>> should
>> be doing the readahead itself, rather than relying upon the kernel's
>> ability to guess what the application will be doing in the future.  
>> Because
>> surely the database can do a better job of that than the kernel.
>>
>> That would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable 
>> kernel
>> readahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch
>> application-level readahead.
>
>
> Users have also reported that this patch fixes performance problems
> from web servers using sendfile(). In the case of lighttpd they
> actually stopped using sendfile() for large transfers and wrote a user
> space replacement where they could control readahead manually. With
> this patch in place sendfile() went back to being faster than the user
> space implementation.


Of course, that is something one would expect should be made to work 
properly
with the current readahead implementation.

I don't see Wu's patches getting in for a little while yet.

Reproducable test cases (preferably without a whole lot of network clients)
should get this proble fixed.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/33] readahead: common macros
  2006-05-25 10:41         ` Wu Fengguang
@ 2006-05-26  3:33           ` Nick Piggin
       [not found]             ` <20060526065906.GA5135@mail.ustc.edu.cn>
  0 siblings, 1 reply; 107+ messages in thread
From: Nick Piggin @ 2006-05-26  3:33 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

Wu Fengguang wrote:

>On Thu, May 25, 2006 at 03:56:24PM +1000, Nick Piggin wrote:
>
>>>+#define PAGES_BYTE(size) (((size) + PAGE_CACHE_SIZE - 1) >> 
>>>PAGE_CACHE_SHIFT)
>>>+#define PAGES_KB(size)	 PAGES_BYTE((size)*1024)
>>>
>>>
>>Don't really like the names. Don't think they do anything for clarity, but
>>if you can come up with something better for PAGES_BYTE I might change my
>>mind ;) (just forget about PAGES_KB - people know what *1024 means)
>>
>
>No, they are mainly for concision. Don't you think it's cleaner to write
>        PAGES_KB(VM_MAX_READAHEAD)
>than
>        (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE
>
>

No. Apart from semantics being different (which I'll address below), anybody
with any business looking at this code will immediately know and understand
what the latter line means. Not so for the former.

>Admittedly the names are somewhat awkward though :)
>
>
>>Also: the replacements are wrong: if you've defined VM_MAX_READAHEAD to be
>>4095 bytes, you don't want the _actual_ readahead to be 4096 bytes, do you?
>>It is saying nothing about minimum, so presumably 0 is the correct choice.
>>
>
>The macros were first introduced exact for this reason ;)
>
>It is rumored that there will be 64K page support, and this macro
>helps round up the 16K sized VM_MIN_READAHEAD. The eof_index also
>needs rounding up.
>

But VM_MIN_READAHEAD of course should be rounded up, for the same
reasons I said VM_MAX_READAHEAD should be rounded down.

So OK as a bug fix, but it needs to be in its own patch, not in a "common
macros" one, and sufficiently commented (and preferably outside your core
adaptive readahead code so it can be quickly merged up)

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 08/33] readahead: common macros
       [not found]             ` <20060526065906.GA5135@mail.ustc.edu.cn>
@ 2006-05-26  6:59               ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26  6:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

Hello Nick and Andrew,

Updated the patch as recommended.

Thanks,
Wu
---

readahead-macros-min-max-rapages.patch
---

Subject: readahead: introduce {MIN,MAX}_RA_PAGES

Define two convenient macros for read-ahead:
	- MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
	- MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD

Note that the rounded _up_ MIN_RA_PAGES will work flawlessly with large
page sizes like 64k.

Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
 mm/readahead.c |   14 ++++++++++++--
 1 files changed, 12 insertions(+), 2 deletions(-)

--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -17,13 +17,21 @@
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
 
+/*
+ * Convienent macros for min/max read-ahead pages.
+ * Note that MAX_RA_PAGES is rounded down, while MIN_RA_PAGES is rounded up.
+ * The latter is necessary for systems with large page size(i.e. 64k).
+ */
+#define MAX_RA_PAGES	(VM_MAX_READAHEAD*1024 / PAGE_CACHE_SIZE)
+#define MIN_RA_PAGES	DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
+
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
 }
 EXPORT_SYMBOL(default_unplug_io_fn);
 
 struct backing_dev_info default_backing_dev_info = {
-	.ra_pages	= (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+	.ra_pages	= MAX_RA_PAGES,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY,
 	.unplug_io_fn	= default_unplug_io_fn,
@@ -52,7 +60,7 @@ static inline unsigned long get_max_read
 
 static inline unsigned long get_min_readahead(struct file_ra_state *ra)
 {
-	return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	return MIN_RA_PAGES;
 }
 
 static inline void reset_ahead_window(struct file_ra_state *ra)

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 03/33] radixtree: hole scanning functions
       [not found]       ` <20060526070416.GB5135@mail.ustc.edu.cn>
@ 2006-05-26  7:04         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26  7:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thu, May 25, 2006 at 09:19:46AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > Introduce a pair of functions to scan radix tree for hole/empty item.
> >
> 
> There's a userspace radix-tree test harness at
> http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz.
> 
> If/when these new features are merged up, it would be good to have new
> testcases added to that suite, please.
> 
> In the meanwhile you may care to develop those tests anwyway, see if you
> can trip up the new features.

Handy tool.

I'll update it with the newly introduced functions, and write
corresponding test cases.

Thanks,
Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 04/33] readahead: page flag PG_readahead
       [not found]       ` <20060526070646.GC5135@mail.ustc.edu.cn>
@ 2006-05-26  7:06         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26  7:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thu, May 25, 2006 at 09:23:11AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > An new page flag PG_readahead is introduced as a look-ahead mark, which
> > reminds the caller to give the adaptive read-ahead logic a chance to do
> > read-ahead ahead of time for I/O pipelining.
> > 
> > It roughly corresponds to `ahead_start' of the stock read-ahead logic.
> > 
> 
> This isn't a very revealing description of what this flag does.

Updated to:

An new page flag PG_readahead is introduced.

It acts as a look-ahead mark, which tells the page reader:
        Hey, it's time to invoke the adaptive read-ahead logic!
        For the sake of I/O pipelining, don't wait until it runs out of
        cached pages.  ;-)

> > +#define __SetPageReadahead(page) __set_bit(PG_readahead, &(page)->flags)
> 
> uh-oh.  This is extremly risky.  Needs extensive justification, please.

Ok, removed the ugly __ :-)

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/33] readahead: events accounting
       [not found]       ` <20060526070943.GD5135@mail.ustc.edu.cn>
@ 2006-05-26  7:09         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26  7:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, joern, ioe-lkml

On Thu, May 25, 2006 at 09:36:27AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > A debugfs file named `readahead/events' is created according to advises from
> >  J?rn Engel, Andrew Morton and Ingo Oeser.
> 
> If everyone's patches all get merged up we'd expect that this facility be
> migrated over to use Martin Peschke's statistics infrastructure.
> 
> That's not a thing you should do now, but it would be a useful test of
> Martin's work if you could find time to look at it and let us know whether
> the infrastructure which he has provided would suit this application,
> thanks.

Sure, I'll look into it when I am able to settle down.

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 06/33] readahead: refactor __do_page_cache_readahead()
       [not found]       ` <20060526071339.GE5135@mail.ustc.edu.cn>
@ 2006-05-26  7:13         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26  7:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thu, May 25, 2006 at 09:30:39AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > Add look-ahead support to __do_page_cache_readahead(),
> > which is needed by the adaptive read-ahead logic.
> 
> You'd need to define "look-ahead support" before telling us you've added it ;)
> 
> > @@ -302,6 +303,8 @@ __do_page_cache_readahead(struct address
> >  			break;
> >  		page->index = page_offset;
> >  		list_add(&page->lru, &page_pool);
> > +		if (page_idx == nr_to_read - lookahead_size)
> > +			__SetPageReadahead(page);
> >  		ret++;
> >  	}
> 
> OK.  But the __SetPageFoo() things still give me the creeps.

Hehe, updated to SetPageReadahead().

> OT: look:
> 
> 		read_unlock_irq(&mapping->tree_lock);
> 		page = page_cache_alloc_cold(mapping);
> 		read_lock_irq(&mapping->tree_lock);
> 
> we should have a page allocation function which just allocates a page from
> this CPU's per-cpu-pages magazine, and fails if the magazine is empty:
> 
> 		page = 	alloc_pages_local(mapping_gfp_mask(x)|__GFP_COLD);
> 		if (!page) {
> 			read_unlock_irq(&mapping->tree_lock);
> 			/*
> 			 * This will refill the per-cpu-pages magazine
> 			 */
> 			page = page_cache_alloc_cold(mapping);
> 			read_lock_irq(&mapping->tree_lock);
> 		}

Seems good, except for the alloc_pages_local() not being able to
spread memory among nodes as page_cache_alloc_cold() do.

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 10/33] readahead: support functions
       [not found]       ` <20060526073114.GH5135@mail.ustc.edu.cn>
@ 2006-05-26  7:31         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26  7:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Thu, May 25, 2006 at 09:48:29AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > +/*
> > + * The nature of read-ahead allows false tests to occur occasionally.
> > + * Here we just do not bother to call get_page(), it's meaningless anyway.
> > + */
> > +static inline struct page *__find_page(struct address_space *mapping,
> > +							pgoff_t offset)
> > +{
> > +	return radix_tree_lookup(&mapping->page_tree, offset);
> > +}
> > +
> > +static inline struct page *find_page(struct address_space *mapping,
> > +							pgoff_t offset)
> > +{
> > +	struct page *page;
> > +
> > +	read_lock_irq(&mapping->tree_lock);
> > +	page = __find_page(mapping, offset);
> > +	read_unlock_irq(&mapping->tree_lock);
> > +	return page;
> > +}
> 
> Would much prefer that this be called probe_page() and that it return 0 or
> 1, so nobody is tempted to dereference `page'.

Good idea. I'd add them to filemap.c.

> > +/*
> > + * Move pages in danger (of thrashing) to the head of inactive_list.
> > + * Not expected to happen frequently.
> > + */
> > +static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
> > +{
> > +	int pgrescue;
> > +	pgoff_t index;
> > +	struct zone *zone;
> > +	struct address_space *mapping;
> > +
> > +	BUG_ON(!nr_pages || !page);
> > +	pgrescue = 0;
> > +	index = page_index(page);
> > +	mapping = page_mapping(page);
> > +
> > +	dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
> > +			mapping->host->i_ino, index, nr_pages);
> > +
> > +	for(;;) {
> > +		zone = page_zone(page);
> > +		spin_lock_irq(&zone->lru_lock);
> > +
> > +		if (!PageLRU(page))
> > +			goto out_unlock;
> > +
> > +		while (page_mapping(page) == mapping &&
> > +				page_index(page) == index) {
> > +			struct page *the_page = page;
> > +			page = next_page(page);
> > +			if (!PageActive(the_page) &&
> > +					!PageLocked(the_page) &&
> > +					page_count(the_page) == 1) {
> > +				list_move(&the_page->lru, &zone->inactive_list);
> > +				pgrescue++;
> > +			}
> > +			index++;
> > +			if (!--nr_pages)
> > +				goto out_unlock;
> > +		}
> > +
> > +		spin_unlock_irq(&zone->lru_lock);
> > +
> > +		cond_resched();
> > +		page = find_page(mapping, index);
> > +		if (!page)
> > +			goto out;
> 
> Yikes!  We do not have a reference on this page.  Now, it happens that
> page_zone() on a random freed page will work OK.  At present.  I think. 
> Depends on things like memory hot-remove, balloon drivers and heaven knows
> what.
> 
> But it's not at all clear that the combination
> 
> 		spin_lock_irq(&zone->lru_lock);
> 
> 		if (!PageLRU(page))
> 			goto out_unlock;
> 
> is is a safe thing to do against a freed page, or against a freed and
> reused-for-we-dont-know-what page.  It probably _is_ safe, as we're
> probably setting and clearing PG_lru inside lru_lock in other places.  But
> it's not obvious that these things will be true for all time and Nick keeps
> on trying to diddle with that stuff.  There's quite a bit of subtle
> dependency being introduced here.

I saw some code pieces like
                spin_lock_irqsave(&zone->lru_lock, flags);
                VM_BUG_ON(!PageLRU(page));
                __ClearPageLRU(page);
                del_page_from_lru(zone, page);
                spin_unlock_irqrestore(&zone->lru_lock, flags);

They give me an allusion that PG_lru and page->lru are always changed together,
under the protection of zone->lru_lock...

I bet correctness is top priority, so I'll stop playing fire with it.

Thanks,
Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 03/33] radixtree: hole scanning functions
       [not found]       ` <20060526110559.GA14398@mail.ustc.edu.cn>
@ 2006-05-26 11:05         ` Wu Fengguang
  2006-05-26 16:19           ` Andrew Morton
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-26 11:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Nick Piggin

On Thu, May 25, 2006 at 09:19:46AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > Introduce a pair of functions to scan radix tree for hole/empty item.
> >
> 
> There's a userspace radix-tree test harness at
> http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz.
> 
> If/when these new features are merged up, it would be good to have new
> testcases added to that suite, please.
> 
> In the meanwhile you may care to develop those tests anwyway, see if you
> can trip up the new features.

The new radix-tree.c/.h breaks compiling terribly.

Are there any particular reason not to implement it as a module?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-25 15:44   ` Andrew Morton
                       ` (3 preceding siblings ...)
  2006-05-26  2:10     ` Jon Smirl
@ 2006-05-26 14:00     ` Andi Kleen
  2006-05-26 16:25       ` Andrew Morton
  2006-05-26 23:54       ` Folkert van Heusden
  4 siblings, 2 replies; 107+ messages in thread
From: Andi Kleen @ 2006-05-26 14:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, wfg, mstone

Andrew Morton <akpm@osdl.org> writes:
> 
> These are nice-looking numbers, but one wonders.  If optimising readahead
> makes this much difference to postgresql performance then postgresql should
> be doing the readahead itself, rather than relying upon the kernel's
> ability to guess what the application will be doing in the future.  Because
> surely the database can do a better job of that than the kernel.

With that argument we should remove all readahead from the kernel? 
Because it's already trying to guess what the application will do. 

I suspect it's better to have good readahead code in the kernel
than in a zillion application.

-Andi

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 03/33] radixtree: hole scanning functions
  2006-05-26 11:05         ` Wu Fengguang
@ 2006-05-26 16:19           ` Andrew Morton
  0 siblings, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 16:19 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, nickpiggin

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> On Thu, May 25, 2006 at 09:19:46AM -0700, Andrew Morton wrote:
> > Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> > >
> > > Introduce a pair of functions to scan radix tree for hole/empty item.
> > >
> > 
> > There's a userspace radix-tree test harness at
> > http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz.
> > 
> > If/when these new features are merged up, it would be good to have new
> > testcases added to that suite, please.
> > 
> > In the meanwhile you may care to develop those tests anwyway, see if you
> > can trip up the new features.
> 
> The new radix-tree.c/.h breaks compiling terribly.

Sprinkling more stub header files in there usually fixes that.

> Are there any particular reason not to implement it as a module?

Well.  It's a heck of a lot more convenient to throw testcases at a
userspace app, and to debug it and to performance analyse it.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-26 14:00     ` Andi Kleen
@ 2006-05-26 16:25       ` Andrew Morton
  2006-05-26 23:54       ` Folkert van Heusden
  1 sibling, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 16:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, wfg, mstone

Andi Kleen <ak@suse.de> wrote:
>
> Andrew Morton <akpm@osdl.org> writes:
> > 
> > These are nice-looking numbers, but one wonders.  If optimising readahead
> > makes this much difference to postgresql performance then postgresql should
> > be doing the readahead itself, rather than relying upon the kernel's
> > ability to guess what the application will be doing in the future.  Because
> > surely the database can do a better job of that than the kernel.
> 
> With that argument we should remove all readahead from the kernel? 
> Because it's already trying to guess what the application will do. 
> 
> I suspect it's better to have good readahead code in the kernel
> than in a zillion application.
> 

Wu: "this readahead patch speeds up postgres"

Me: "but postgres could be sped up even more via X"

everyone: "ah, you're saying that's a reason for not altering readahead!".


Would everyone *please* stop being so completely and utterly thick?

Thank you.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 13/33] readahead: state based method - aging accounting
  2006-05-24 11:12   ` [PATCH 13/33] readahead: state based method - aging accounting Wu Fengguang
@ 2006-05-26 17:04     ` Andrew Morton
       [not found]       ` <20060527062234.GB4991@mail.ustc.edu.cn>
  0 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:04 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg


(hey, I haven't finished reading the last batch yet)

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
>  /*
>  + * The node's effective length of inactive_list(s).
>  + */
>  +static unsigned long node_free_and_cold_pages(void)
>  +{
>  +	unsigned int i;
>  +	unsigned long sum = 0;
>  +	struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
>  +
>  +	for (i = 0; i < MAX_NR_ZONES; i++)
>  +		sum += zones[i].nr_inactive +
>  +			zones[i].free_pages - zones[i].pages_low;
>  +
>  +	return sum;
>  +}

I guess this should go into page_alloc.c along with all the similar functions.

Is this function well-named?  Why does it have "cold" in the name?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 14/33] readahead: state based method - data structure
  2006-05-24 11:13   ` [PATCH 14/33] readahead: state based method - data structure Wu Fengguang
  2006-05-25  6:03     ` Nick Piggin
@ 2006-05-26 17:05     ` Andrew Morton
       [not found]       ` <20060527070248.GD4991@mail.ustc.edu.cn>
       [not found]       ` <20060527082758.GF4991@mail.ustc.edu.cn>
  1 sibling, 2 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:05 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
>  #define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
>   #define RA_FLAG_INCACHE 0x02	/* file is already in cache */
>  +#define RA_FLAG_MMAP		(1UL<<31)	/* mmaped page access */
>  +#define RA_FLAG_NO_LOOKAHEAD	(1UL<<30)	/* disable look-ahead */
>  +#define RA_FLAG_EOF		(1UL<<29)	/* readahead hits EOF */

Odd.  Why not use 4, 8, 16?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 15/33] readahead: state based method - routines
  2006-05-24 11:13   ` [PATCH 15/33] readahead: state based method - routines Wu Fengguang
@ 2006-05-26 17:15     ` Andrew Morton
       [not found]       ` <20060527020616.GA7418@mail.ustc.edu.cn>
  0 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:15 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> Define some helpers on struct file_ra_state.
> 
> +/*
> + * The 64bit cache_hits stores three accumulated values and a counter value.
> + * MSB                                                                   LSB
> + * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
> + */
> +static int ra_cache_hit(struct file_ra_state *ra, int nr)
> +{
> +	return (ra->cache_hits >> (nr * 16)) & 0xFFFF;
> +}

So...   why not use four u16s?

> +/*
> + * Submit IO for the read-ahead request in file_ra_state.
> + */
> +static int ra_dispatch(struct file_ra_state *ra,
> +			struct address_space *mapping, struct file *filp)
> +{
> +	enum ra_class ra_class = ra_class_new(ra);
> +	unsigned long ra_size = ra_readahead_size(ra);
> +	unsigned long la_size = ra_lookahead_size(ra);
> +	pgoff_t eof_index = PAGES_BYTE(i_size_read(mapping->host)) + 1;

Sigh.  I guess one gets used to that PAGES_BYTE thing after a while.  If
you're not familiar with it, it obfuscates things.

<hunts around for its definition>

So in fact it's converting a loff_t to a pgoff_t.  Why not call it
convert_loff_t_to_pgoff_t()?  ;)

Something better, anyway.  Something lower-case and an inline-not-a-macro, too.

> +	int actual;
> +
> +	if (unlikely(ra->ra_index >= eof_index))
> +		return 0;
> +
> +	/* Snap to EOF. */
> +	if (ra->readahead_index + ra_size / 2 > eof_index) {

You've had a bit of a think and you've arrived at a design decision
surrounding the arithmetic in here.  It's very very hard to look at this line
of code and to work out why you decided to implement it in this fashion. 
The only way to make such code comprehensible (and hence maintainable) is
to fully comment such things.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/33] readahead: context based method
  2006-05-24 11:13   ` [PATCH 17/33] readahead: context based method Wu Fengguang
  2006-05-25  5:26     ` Nick Piggin
@ 2006-05-26 17:23     ` Andrew Morton
       [not found]       ` <20060527021252.GB7418@mail.ustc.edu.cn>
  2006-05-26 17:27     ` Andrew Morton
  2 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:23 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> +#define PAGE_REFCNT_0           0
>  +#define PAGE_REFCNT_1           (1 << PG_referenced)
>  +#define PAGE_REFCNT_2           (1 << PG_active)
>  +#define PAGE_REFCNT_3           ((1 << PG_active) | (1 << PG_referenced))
>  +#define PAGE_REFCNT_MASK        PAGE_REFCNT_3
>  +
>  +/*
>  + * STATUS   REFERENCE COUNT
>  + *  __                   0
>  + *  _R       PAGE_REFCNT_1
>  + *  A_       PAGE_REFCNT_2
>  + *  AR       PAGE_REFCNT_3
>  + *
>  + *  A/R: Active / Referenced
>  + */
>  +static inline unsigned long page_refcnt(struct page *page)
>  +{
>  +        return page->flags & PAGE_REFCNT_MASK;
>  +}
>  +

This assumes that PG_referenced < PG_active.  Nobody knows that this
assumption was made and someone might go and reorder the page flags and
subtly break readahead.

We need to either not do it this way, or put a big comment in page-flags.h,
or even redefine PG_active to be PG_referenced+1.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/33] readahead: context based method
  2006-05-24 11:13   ` [PATCH 17/33] readahead: context based method Wu Fengguang
  2006-05-25  5:26     ` Nick Piggin
  2006-05-26 17:23     ` Andrew Morton
@ 2006-05-26 17:27     ` Andrew Morton
       [not found]       ` <20060527080443.GE4991@mail.ustc.edu.cn>
  2 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:27 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> This is the slow code path of adaptive read-ahead.
> 
> ...
>
> +
> +/*
> + * Count/estimate cache hits in range [first_index, last_index].
> + * The estimation is simple and optimistic.
> + */
> +static int count_cache_hit(struct address_space *mapping,
> +				pgoff_t first_index, pgoff_t last_index)
> +{
> +	struct page *page;
> +	int size = last_index - first_index + 1;

`size' might overflow.

> +	int count = 0;
> +	int i;
> +
> +	cond_resched();
> +	read_lock_irq(&mapping->tree_lock);
> +
> +	/*
> +	 * The first page may well is chunk head and has been accessed,
> +	 * so it is index 0 that makes the estimation optimistic. This
> +	 * behavior guarantees a readahead when (size < ra_max) and
> +	 * (readahead_hit_rate >= 16).
> +	 */
> +	for (i = 0; i < 16;) {
> +		page = __find_page(mapping, first_index +
> +						size * ((i++ * 29) & 15) / 16);

29?



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 20/33] readahead: initial method - expected read size
  2006-05-24 11:13   ` [PATCH 20/33] readahead: initial method - expected read size Wu Fengguang
  2006-05-25  5:34     ` [PATCH 22/33] readahead: initial method Nick Piggin
@ 2006-05-26 17:29     ` Andrew Morton
       [not found]       ` <20060527063826.GC4991@mail.ustc.edu.cn>
  1 sibling, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:29 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> backing_dev_info.ra_expect_bytes is dynamicly updated to be the expected
> read pages on start-of-file. It allows the initial readahead to be more
> aggressive and hence efficient.
> 
> 
> +void fastcall readahead_close(struct file *file)

eww, fastcall.

> +{
> +	struct inode *inode = file->f_dentry->d_inode;
> +	struct address_space *mapping = inode->i_mapping;
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	unsigned long pos = file->f_pos;

f_pos is loff_t.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 23/33] readahead: backward prefetching method
  2006-05-24 11:13   ` [PATCH 23/33] readahead: backward prefetching method Wu Fengguang
@ 2006-05-26 17:37     ` Nate Diller
  2006-05-26 19:22       ` Nathan Scott
  0 siblings, 1 reply; 107+ messages in thread
From: Nate Diller @ 2006-05-26 17:37 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, linux-kernel

On 5/24/06, Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> Readahead policy for reading backward.

Just curious, who actually does this?  I noticed you submitted patches
to do profiling of actual read loads, so this must be based on data
you've seen.  Could you include a comment in the actual code relating
to the loads that it affects?

thanks

NATE

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 27/33] readahead: laptop mode
  2006-05-24 11:13   ` [PATCH 27/33] readahead: laptop mode Wu Fengguang
@ 2006-05-26 17:38     ` Andrew Morton
  0 siblings, 0 replies; 107+ messages in thread
From: Andrew Morton @ 2006-05-26 17:38 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel, wfg, bart

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
>   /*
>  + * Set a new look-ahead mark at @new_index.
>  + * Return 0 if the new mark is successfully set.
>  + */
>  +static inline int renew_lookahead(struct address_space *mapping,
>  +				struct file_ra_state *ra,
>  +				pgoff_t index, pgoff_t new_index)
>  +{
>  +	struct page *page;
>  +
>  +	if (index == ra->lookahead_index &&
>  +			new_index >= ra->readahead_index)
>  +		return 1;
>  +
>  +	page = find_page(mapping, new_index);
>  +	if (!page)
>  +		return 1;
>  +
>  +	__SetPageReadahead(page);
>  +	if (ra->lookahead_index == index)
>  +		ra->lookahead_index = new_index;
>  +
>  +	return 0;
>  +}
>  +

This is a pagecache page and other CPUs can look it up and play with it. 
The __SetPageReadahead() is quite wrong here.

And we don't have a reference on this page, so this code appears to be racy.

You could fix that by taking and dropping a ref on the page, but it'd be
quicker to take tree_lock and do the SetPageReadahead() while holding it.

This function is too large to inline.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 23/33] readahead: backward prefetching method
  2006-05-26 17:37     ` Nate Diller
@ 2006-05-26 19:22       ` Nathan Scott
       [not found]         ` <20060528123006.GC6478@mail.ustc.edu.cn>
  0 siblings, 1 reply; 107+ messages in thread
From: Nathan Scott @ 2006-05-26 19:22 UTC (permalink / raw)
  To: Nate Diller; +Cc: Wu Fengguang, Andrew Morton, linux-kernel

On Fri, May 26, 2006 at 10:37:56AM -0700, Nate Diller wrote:
> On 5/24/06, Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> > Readahead policy for reading backward.
> 
> Just curious, who actually does this?  I noticed you submitted patches

Nastran does this, and probably other FEA codes.  IIRC, iozone
will measure this too - it is very important to some people in
certain scientific arenas.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-26 14:00     ` Andi Kleen
  2006-05-26 16:25       ` Andrew Morton
@ 2006-05-26 23:54       ` Folkert van Heusden
  2006-05-27  0:00         ` Con Kolivas
  1 sibling, 1 reply; 107+ messages in thread
From: Folkert van Heusden @ 2006-05-26 23:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, wfg, mstone

> > These are nice-looking numbers, but one wonders.  If optimising readahead
> > makes this much difference to postgresql performance then postgresql should
> > be doing the readahead itself, rather than relying upon the kernel's
> > ability to guess what the application will be doing in the future.  Because
> > surely the database can do a better job of that than the kernel.
> With that argument we should remove all readahead from the kernel? 
> Because it's already trying to guess what the application will do. 
> I suspect it's better to have good readahead code in the kernel
> than in a zillion application.

Maybe a pluggable read-ahead system could be implemented.


Folkert van Heusden

-- 
Ever wonder what is out there? Any alien races? Then please support
the seti@home project: setiathome.ssl.berkeley.edu
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-26 23:54       ` Folkert van Heusden
@ 2006-05-27  0:00         ` Con Kolivas
  2006-05-27  0:08           ` Con Kolivas
  0 siblings, 1 reply; 107+ messages in thread
From: Con Kolivas @ 2006-05-27  0:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: Folkert van Heusden, Andi Kleen, Andrew Morton, wfg, mstone

On Saturday 27 May 2006 09:54, Folkert van Heusden wrote:
> > > These are nice-looking numbers, but one wonders.  If optimising
> > > readahead makes this much difference to postgresql performance then
> > > postgresql should be doing the readahead itself, rather than relying
> > > upon the kernel's ability to guess what the application will be doing
> > > in the future.  Because surely the database can do a better job of that
> > > than the kernel.
> >
> > With that argument we should remove all readahead from the kernel?
> > Because it's already trying to guess what the application will do.
> > I suspect it's better to have good readahead code in the kernel
> > than in a zillion application.
>
> Maybe a pluggable read-ahead system could be implemented.

Pluggable anything is unpopular with Linus and other maintainers. See 
pluggable cpu scheduler and pluggable page replacement policy (vm) patchsets.

-- 
-ck

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-27  0:00         ` Con Kolivas
@ 2006-05-27  0:08           ` Con Kolivas
  2006-05-28 22:20             ` Diego Calleja
  0 siblings, 1 reply; 107+ messages in thread
From: Con Kolivas @ 2006-05-27  0:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Folkert van Heusden, Andi Kleen, Andrew Morton, wfg, mstone

On Saturday 27 May 2006 10:00, Con Kolivas wrote:
> On Saturday 27 May 2006 09:54, Folkert van Heusden wrote:
> > > > These are nice-looking numbers, but one wonders.  If optimising
> > > > readahead makes this much difference to postgresql performance then
> > > > postgresql should be doing the readahead itself, rather than relying
> > > > upon the kernel's ability to guess what the application will be doing
> > > > in the future.  Because surely the database can do a better job of
> > > > that than the kernel.
> > >
> > > With that argument we should remove all readahead from the kernel?
> > > Because it's already trying to guess what the application will do.
> > > I suspect it's better to have good readahead code in the kernel
> > > than in a zillion application.
> >
> > Maybe a pluggable read-ahead system could be implemented.
>
> Pluggable anything is unpopular with Linus and other maintainers. See
> pluggable cpu scheduler and pluggable page replacement policy (vm)
> patchsets.

Sorry I should have been clearer. The belief is that certain infrastructure 
components do not benefit from a pluggable framework, and readeahead probably 
comes under that description. It's not like Linus was implying we should only 
have one filesystem for example, since filesystems are afterall pluggable 
features.

-- 
-ck

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 15/33] readahead: state based method - routines
       [not found]       ` <20060527020616.GA7418@mail.ustc.edu.cn>
@ 2006-05-27  2:06         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27  2:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, May 26, 2006 at 10:15:36AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > Define some helpers on struct file_ra_state.
> > 
> > +/*
> > + * The 64bit cache_hits stores three accumulated values and a counter value.
> > + * MSB                                                                   LSB
> > + * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
> > + */
> > +static int ra_cache_hit(struct file_ra_state *ra, int nr)
> > +{
> > +	return (ra->cache_hits >> (nr * 16)) & 0xFFFF;
> > +}
> 
> So...   why not use four u16s?

Sure, me too, have been thinking about it ;-)

> > +/*
> > + * Submit IO for the read-ahead request in file_ra_state.
> > + */
> > +static int ra_dispatch(struct file_ra_state *ra,
> > +			struct address_space *mapping, struct file *filp)
> > +{
> > +	enum ra_class ra_class = ra_class_new(ra);
> > +	unsigned long ra_size = ra_readahead_size(ra);
> > +	unsigned long la_size = ra_lookahead_size(ra);
> > +	pgoff_t eof_index = PAGES_BYTE(i_size_read(mapping->host)) + 1;
> 
> Sigh.  I guess one gets used to that PAGES_BYTE thing after a while.  If
> you're not familiar with it, it obfuscates things.
> 
> <hunts around for its definition>
> 
> So in fact it's converting a loff_t to a pgoff_t.  Why not call it
> convert_loff_t_to_pgoff_t()?  ;)
> 
> Something better, anyway.  Something lower-case and an inline-not-a-macro, too.

I'm now using DIV_ROUND_UP(), maybe we can settle with that.

> > +	int actual;
> > +
> > +	if (unlikely(ra->ra_index >= eof_index))
> > +		return 0;
> > +
> > +	/* Snap to EOF. */
> > +	if (ra->readahead_index + ra_size / 2 > eof_index) {
> 
> You've had a bit of a think and you've arrived at a design decision
> surrounding the arithmetic in here.  It's very very hard to look at this line
> of code and to work out why you decided to implement it in this fashion. 
> The only way to make such code comprehensible (and hence maintainable) is
> to fully comment such things.

Sorry for being a bit lazy.

It is true that some situations are rather tricky, and some
if()/numbers are carefully chosen. I'll continue expanding/detailing
the documentation with future releases. Or would you prefer to add
them as small and distinct patches?

Comments for this one(also rationalized code):

        /* 
         * Snap to EOF, if the request
         *      - crossed the EOF boundary;
         *      - is close to EOF(explained below).
         * 
         * Imagine a file sized 18 pages, and we dicided to read-ahead the
         * first 16 pages. It is highly possible that in the near future we
         * will have to do another read-ahead for the remaining 2 pages,
         * which is an unfavorable small I/O.
         * 
         * So we prefer to take a bit risk to enlarge the current read-ahead,
         * to eliminate possible future small I/O.
         */
        if (ra->readahead_index + ra_readahead_size(ra)/4 > eof_index) {
                ra->readahead_index = eof_index;
                if (ra->lookahead_index > eof_index)
                        ra->lookahead_index = eof_index;
                ra->flags |= RA_FLAG_EOF;
        }

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/33] readahead: context based method
       [not found]       ` <20060527021252.GB7418@mail.ustc.edu.cn>
@ 2006-05-27  2:12         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27  2:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, May 26, 2006 at 10:23:43AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > +#define PAGE_REFCNT_0           0
> >  +#define PAGE_REFCNT_1           (1 << PG_referenced)
> >  +#define PAGE_REFCNT_2           (1 << PG_active)
> >  +#define PAGE_REFCNT_3           ((1 << PG_active) | (1 << PG_referenced))
> >  +#define PAGE_REFCNT_MASK        PAGE_REFCNT_3
> >  +
> >  +/*
> >  + * STATUS   REFERENCE COUNT
> >  + *  __                   0
> >  + *  _R       PAGE_REFCNT_1
> >  + *  A_       PAGE_REFCNT_2
> >  + *  AR       PAGE_REFCNT_3
> >  + *
> >  + *  A/R: Active / Referenced
> >  + */
> >  +static inline unsigned long page_refcnt(struct page *page)
> >  +{
> >  +        return page->flags & PAGE_REFCNT_MASK;
> >  +}
> >  +
> 
> This assumes that PG_referenced < PG_active.  Nobody knows that this
> assumption was made and someone might go and reorder the page flags and
> subtly break readahead.
> 
> We need to either not do it this way, or put a big comment in page-flags.h,
> or even redefine PG_active to be PG_referenced+1.

I have had a code segment like:

#if PG_active < PG_referenced
#  error unexpected page flags order
#endif

I'd add it back.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 13/33] readahead: state based method - aging accounting
       [not found]       ` <20060527062234.GB4991@mail.ustc.edu.cn>
@ 2006-05-27  6:22         ` Wu Fengguang
  2006-05-27  7:00           ` Andrew Morton
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27  6:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, May 26, 2006 at 10:04:26AM -0700, Andrew Morton wrote:
> 
> (hey, I haven't finished reading the last batch yet)
> 
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> >  /*
> >  + * The node's effective length of inactive_list(s).
> >  + */
> >  +static unsigned long node_free_and_cold_pages(void)
> >  +{
> >  +	unsigned int i;
> >  +	unsigned long sum = 0;
> >  +	struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
> >  +
> >  +	for (i = 0; i < MAX_NR_ZONES; i++)
> >  +		sum += zones[i].nr_inactive +
> >  +			zones[i].free_pages - zones[i].pages_low;
> >  +
> >  +	return sum;
> >  +}
> 
> I guess this should go into page_alloc.c along with all the similar functions.

Moved as adviced.

> Is this function well-named?  Why does it have "cold" in the name?

Because it only sums `nr_inactive', leaving out `nr_active'.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 20/33] readahead: initial method - expected read size
       [not found]       ` <20060527063826.GC4991@mail.ustc.edu.cn>
@ 2006-05-27  6:38         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27  6:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, May 26, 2006 at 10:29:34AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > backing_dev_info.ra_expect_bytes is dynamicly updated to be the expected
> > read pages on start-of-file. It allows the initial readahead to be more
> > aggressive and hence efficient.
> > 
> > 
> > +void fastcall readahead_close(struct file *file)
> 
> eww, fastcall.

Hehe, it's a tiny function, and calls no further sub-routines
except debugging ones.   Still not necessary?

> > +{
> > +	struct inode *inode = file->f_dentry->d_inode;
> > +	struct address_space *mapping = inode->i_mapping;
> > +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > +	unsigned long pos = file->f_pos;
> 
> f_pos is loff_t.

Just meant to be a little more compact ;)

+       unsigned long pos = file->f_pos;
+       unsigned long pgrahit = file->f_ra.cache_hits;
+       unsigned long pgaccess = 1 + pos / PAGE_CACHE_SIZE;
+       unsigned long pgcached = mapping->nrpages;
+
+       if (!pos)                               /* pread */
+               return;
+
+       if (pgcached > bdi->ra_pages0)          /* excessive reads */
+               return;

Here the f_pos will almost definitely has small values.

+
+       if (pgaccess >= pgcached) {


Fixed by adding a comment to clarify it:

+       unsigned long pos = file->f_pos;  /* supposed to be small */

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 13/33] readahead: state based method - aging accounting
  2006-05-27  6:22         ` Wu Fengguang
@ 2006-05-27  7:00           ` Andrew Morton
       [not found]             ` <20060527072201.GA5284@mail.ustc.edu.cn>
  0 siblings, 1 reply; 107+ messages in thread
From: Andrew Morton @ 2006-05-27  7:00 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-kernel

Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>
> > Is this function well-named?  Why does it have "cold" in the name?
> 
>  Because it only sums `nr_inactive', leaving out `nr_active'.

We use the term "cold" to refer to probably-cache-cold pages in the page
allocator.  How about you use "inactive"?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 14/33] readahead: state based method - data structure
       [not found]       ` <20060527070248.GD4991@mail.ustc.edu.cn>
@ 2006-05-27  7:02         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27  7:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, May 26, 2006 at 10:05:52AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> >  #define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
> >   #define RA_FLAG_INCACHE 0x02	/* file is already in cache */
> >  +#define RA_FLAG_MMAP		(1UL<<31)	/* mmaped page access */
> >  +#define RA_FLAG_NO_LOOKAHEAD	(1UL<<30)	/* disable look-ahead */
> >  +#define RA_FLAG_EOF		(1UL<<29)	/* readahead hits EOF */
> 
> Odd.  Why not use 4, 8, 16?

Sorry, the lower 8 bits are for ra_class values in the new code. It can
cause data corruption when dynamic switching between the two logics :(

I'd like to change the flags member to explicit ones like

        struct {
                unsigned miss           :1;
                unsigned incache        :1;
                unsigned mmap           :1;
                unsigned no_lookahead   :1;
                unsigned eof            :1;
        } flags;

        unsigned class_new              :4;
        unsigned class_old              :4;

Reasonable?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 13/33] readahead: state based method - aging accounting
       [not found]             ` <20060527072201.GA5284@mail.ustc.edu.cn>
@ 2006-05-27  7:22               ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27  7:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Sat, May 27, 2006 at 12:00:58AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > > Is this function well-named?  Why does it have "cold" in the name?
> > 
> >  Because it only sums `nr_inactive', leaving out `nr_active'.
> 
> We use the term "cold" to refer to probably-cache-cold pages in the page
> allocator.  How about you use "inactive"?

Got it, thanks.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 17/33] readahead: context based method
       [not found]       ` <20060527080443.GE4991@mail.ustc.edu.cn>
@ 2006-05-27  8:04         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27  8:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, May 26, 2006 at 10:27:16AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > This is the slow code path of adaptive read-ahead.
> > 
> > ...
> >
> > +
> > +/*
> > + * Count/estimate cache hits in range [first_index, last_index].
> > + * The estimation is simple and optimistic.
> > + */
> > +static int count_cache_hit(struct address_space *mapping,
> > +				pgoff_t first_index, pgoff_t last_index)
> > +{
> > +	struct page *page;
> > +	int size = last_index - first_index + 1;
> 
> `size' might overflow.

It does. Fixed the caller:
        @@query_page_cache_segment()
        index = radix_tree_scan_hole_backward(&mapping->page_tree,
-                                                       offset, ra_max);
+                                                       offset - 1, ra_max);
Here (offset >= 1) always holds.

> > +	int count = 0;
> > +	int i;
> > +
> > +	cond_resched();
> > +	read_lock_irq(&mapping->tree_lock);
> > +
> > +	/*
> > +	 * The first page may well is chunk head and has been accessed,
> > +	 * so it is index 0 that makes the estimation optimistic. This
> > +	 * behavior guarantees a readahead when (size < ra_max) and
> > +	 * (readahead_hit_rate >= 16).
> > +	 */
> > +	for (i = 0; i < 16;) {
> > +		page = __find_page(mapping, first_index +
> > +						size * ((i++ * 29) & 15) / 16);
> 
> 29?

It's a prime number. Should be made obvious by the following macro:

#define CACHE_HIT_HASH_KEY      29      /* some prime number */

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 14/33] readahead: state based method - data structure
       [not found]       ` <20060527082758.GF4991@mail.ustc.edu.cn>
@ 2006-05-27  8:27         ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27  8:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Fri, May 26, 2006 at 10:05:52AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> >  #define RA_FLAG_MISS 0x01	/* a cache miss occured against this file */
> >   #define RA_FLAG_INCACHE 0x02	/* file is already in cache */
> >  +#define RA_FLAG_MMAP		(1UL<<31)	/* mmaped page access */
> >  +#define RA_FLAG_NO_LOOKAHEAD	(1UL<<30)	/* disable look-ahead */
> >  +#define RA_FLAG_EOF		(1UL<<29)	/* readahead hits EOF */
> 
> Odd.  Why not use 4, 8, 16?

I'm now settled with:

-#define RA_FLAG_MISS 0x01      /* a cache miss occured against this file */
-#define RA_FLAG_INCACHE 0x02   /* file is already in cache */
+#define RA_FLAG_MISS   (1UL<<31) /* a cache miss occured against this file */
+#define RA_FLAG_INCACHE        (1UL<<30) /* file is already in cache */
+#define RA_FLAG_MMAP           (1UL<<29) /* mmaped page access */
+#define RA_FLAG_NO_LOOKAHEAD   (1UL<<28) /* disable look-ahead */
+#define RA_FLAG_EOF            (1UL<<27) /* readahead hits EOF */

And still let the low bits hold ra_class values.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/33] readahead: events accounting
       [not found]       ` <20060527132002.GA4814@mail.ustc.edu.cn>
@ 2006-05-27 13:20         ` Wu Fengguang
  2006-05-29  8:19           ` Martin Peschke
  0 siblings, 1 reply; 107+ messages in thread
From: Wu Fengguang @ 2006-05-27 13:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, joern, ioe-lkml, Martin Peschke

On Thu, May 25, 2006 at 09:36:27AM -0700, Andrew Morton wrote:
> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> >
> > A debugfs file named `readahead/events' is created according to advises from
> >  J?rn Engel, Andrew Morton and Ingo Oeser.
> 
> If everyone's patches all get merged up we'd expect that this facility be
> migrated over to use Martin Peschke's statistics infrastructure.
> 
> That's not a thing you should do now, but it would be a useful test of
> Martin's work if you could find time to look at it and let us know whether
> the infrastructure which he has provided would suit this application,
> thanks.

Hi, Martin is doing a great job, thanks.

I have read about its doc.  It should be suitable for various
readahead numbers. And it seems a trivial work to port to it :)

However it might also make sense to keep the current _table_ interface.
It shows us the whole picture at a glance:

% cat /debug/readahead/events
[table requests]     total   newfile     state   context  contexta [...]
cache_miss          136302       538      3860     11317       490
read_random          62176       160       424      1633        60
io_congestion            0         0         0         0         0
io_cache_hit         34521       663     10071     15611      1423
io_block            204302     42174     10408     68277      2226
readahead           251478     70746     96846     73636      2561
lookahead           136315     14805     86267     32738      2505
lookahead_hit       103384      8038     74605      9097       598
lookahead_ignore         0         0         0         0         0
readahead_mmap        6911         0         0         0         0
readahead_eof        70793     55935      8500       648       581
readahead_shrink       473         0       473         0         0
readahead_thrash         0         0         0         0         0
readahead_mutilt      2526        24      1079      1403        20
readahead_rescue      1209         0         0         0         0

[table pages]        total   newfile     state   context  contexta
cache_miss      1292350444    282817  35557285  86087568   5592690
read_random       10299237       177       426      1903        63
io_congestion            0         0         0         0         0
io_cache_hit       2194663      9289   1507054    414311    184715
io_block            204302     42174     10408     68277      2226
readahead         26122947    770681  21815335   3097682    259587
readahead_hit     23101714    588811  19906233   2209547    191269
lookahead         21397630    173502  19872014    936474    415640
lookahead_hit     18663196     98004  17879848    596562     88782
lookahead_ignore         0         0         0         0         0
readahead_mmap      170509         0         0         0         0
readahead_eof      1950484    432763   1342148     47368     34742
readahead_shrink     19900         0     19900         0         0
readahead_thrash         0         0         0         0         0
readahead_mutilt    220331       485    186922     29900      3024
readahead_rescue    119592         0         0         0         0

[table summary]      total   newfile     state   context  contexta
random_rate            19%        0%        0%        2%        2%
ra_hit_rate            88%       76%       91%       71%       73%
la_hit_rate            75%       54%       86%       27%       23%
var_ra_size          13850       130      5802      6709     10563
avg_ra_size            104        11       225        42       101
avg_la_size            157        12       230        29       166


When Martin's work is included into -mm, I would like to reduce
several col/rows from the table to Martin's infrastructure, and
perhaps add some more items. One obvious candidate collection is the
ra_account(NULL, ...) calls, which do not quite fit the table
interface and deserves individual files.

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 23/33] readahead: backward prefetching method
       [not found]         ` <20060528123006.GC6478@mail.ustc.edu.cn>
@ 2006-05-28 12:30           ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-28 12:30 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Nate Diller, Andrew Morton, linux-kernel

On Sat, May 27, 2006 at 05:22:43AM +1000, Nathan Scott wrote:
> On Fri, May 26, 2006 at 10:37:56AM -0700, Nate Diller wrote:
> > On 5/24/06, Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
> > > Readahead policy for reading backward.
> > 
> > Just curious, who actually does this?  I noticed you submitted patches
> 
> Nastran does this, and probably other FEA codes.  IIRC, iozone
> will measure this too - it is very important to some people in
> certain scientific arenas.

Thanks.

It makes sense to have a list of use cases for the
less-common-but-still-important access patterns.

Cheers,
Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-27  0:08           ` Con Kolivas
@ 2006-05-28 22:20             ` Diego Calleja
  2006-05-28 22:31               ` kernel
  0 siblings, 1 reply; 107+ messages in thread
From: Diego Calleja @ 2006-05-28 22:20 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, folkert, ak, akpm, wfg, mstone

El Sat, 27 May 2006 10:08:41 +1000,
Con Kolivas <kernel@kolivas.org> escribió:
> On Saturday 27 May 2006 10:00, Con Kolivas wrote:

> Sorry I should have been clearer. The belief is that certain infrastructure 
> components do not benefit from a pluggable framework, and readeahead probably 
> comes under that description. It's not like Linus was implying we should only 
> have one filesystem for example, since filesystems are afterall pluggable 
> features.

That leaves another question that I (a poor user) may have missed: Why is
adaptive read-ahead compile-time configurable instead of completely replacing
the old system?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
  2006-05-28 22:20             ` Diego Calleja
@ 2006-05-28 22:31               ` kernel
       [not found]                 ` <20060529030445.GB5994@mail.ustc.edu.cn>
  0 siblings, 1 reply; 107+ messages in thread
From: kernel @ 2006-05-28 22:31 UTC (permalink / raw)
  To: Diego Calleja; +Cc: linux-kernel, folkert, ak, akpm, wfg, mstone

Quoting Diego Calleja <diegocg@gmail.com>:

> That leaves another question that I (a poor user) may have missed: Why is
> adaptive read-ahead compile-time configurable instead of completely
> replacing
> the old system?

That was done to appease the users out there that had worse performance with it.
In the early stages of development of this code it was rather detrimental on an
ordinary desktop. Fortunately that seems to have gotten a lot better. I don't
think the final version should be a compile time option. It's either "adaptive"
and better everywhere or it's not.

--
-ck


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 00/33] Adaptive read-ahead V12
       [not found]                 ` <20060529030445.GB5994@mail.ustc.edu.cn>
@ 2006-05-29  3:04                   ` Wu Fengguang
  0 siblings, 0 replies; 107+ messages in thread
From: Wu Fengguang @ 2006-05-29  3:04 UTC (permalink / raw)
  To: kernel; +Cc: Diego Calleja, linux-kernel, folkert, ak, akpm, mstone

On Mon, May 29, 2006 at 08:31:43AM +1000, kernel@kolivas.org wrote:
> Quoting Diego Calleja <diegocg@gmail.com>:
> 
> > That leaves another question that I (a poor user) may have missed: Why is
> > adaptive read-ahead compile-time configurable instead of completely
> > replacing
> > the old system?
> 
> That was done to appease the users out there that had worse performance with it.
> In the early stages of development of this code it was rather detrimental on an
> ordinary desktop. Fortunately that seems to have gotten a lot better. I don't
> think the final version should be a compile time option. It's either "adaptive"
> and better everywhere or it's not.
 
Hehe, I have a dream - that it helps *everywhere* ;-)

Wu

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 09/33] readahead: events accounting
  2006-05-27 13:20         ` Wu Fengguang
@ 2006-05-29  8:19           ` Martin Peschke
  0 siblings, 0 replies; 107+ messages in thread
From: Martin Peschke @ 2006-05-29  8:19 UTC (permalink / raw)
  To: Wu Fengguang, Andrew Morton, linux-kernel, joern, ioe-lkml,
	Martin Peschke

Wu Fengguang wrote:
> On Thu, May 25, 2006 at 09:36:27AM -0700, Andrew Morton wrote:
>> Wu Fengguang <wfg@mail.ustc.edu.cn> wrote:
>>> A debugfs file named `readahead/events' is created according to advises from
>>>  J?rn Engel, Andrew Morton and Ingo Oeser.
>> If everyone's patches all get merged up we'd expect that this facility be
>> migrated over to use Martin Peschke's statistics infrastructure.
>>
>> That's not a thing you should do now, but it would be a useful test of
>> Martin's work if you could find time to look at it and let us know whether
>> the infrastructure which he has provided would suit this application,
>> thanks.
> 
> Hi, Martin is doing a great job, thanks.
> 
> I have read about its doc.  It should be suitable for various
> readahead numbers. And it seems a trivial work to port to it :)

Wu, great :) If you got questions (e.g. on how to setup your statistics so
that the output looks quite compact) or more requirements (like an enhancement
of the code that accumulates numbers) feel free to get back to me.
Thanks, Martin


^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2006-05-29  8:19 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20060524111246.420010595@localhost.localdomain>
2006-05-24 11:12 ` [PATCH 00/33] Adaptive read-ahead V12 Wu Fengguang
2006-05-25 15:44   ` Andrew Morton
2006-05-25 19:26     ` Michael Stone
2006-05-25 19:40     ` David Lang
2006-05-25 22:01       ` Andrew Morton
2006-05-25 20:28         ` David Lang
2006-05-26  0:48         ` Michael Stone
     [not found]     ` <20060526011939.GA6220@mail.ustc.edu.cn>
2006-05-26  1:19       ` Wu Fengguang
2006-05-26  2:10     ` Jon Smirl
2006-05-26  3:14       ` Nick Piggin
2006-05-26 14:00     ` Andi Kleen
2006-05-26 16:25       ` Andrew Morton
2006-05-26 23:54       ` Folkert van Heusden
2006-05-27  0:00         ` Con Kolivas
2006-05-27  0:08           ` Con Kolivas
2006-05-28 22:20             ` Diego Calleja
2006-05-28 22:31               ` kernel
     [not found]                 ` <20060529030445.GB5994@mail.ustc.edu.cn>
2006-05-29  3:04                   ` Wu Fengguang
     [not found] ` <20060524111857.983845462@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 02/33] radixtree: look-aside cache Wu Fengguang
     [not found] ` <20060524111858.357709745@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 03/33] radixtree: hole scanning functions Wu Fengguang
2006-05-25 16:19     ` Andrew Morton
     [not found]       ` <20060526070416.GB5135@mail.ustc.edu.cn>
2006-05-26  7:04         ` Wu Fengguang
     [not found]       ` <20060526110559.GA14398@mail.ustc.edu.cn>
2006-05-26 11:05         ` Wu Fengguang
2006-05-26 16:19           ` Andrew Morton
     [not found] ` <20060524111858.869793445@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 04/33] readahead: page flag PG_readahead Wu Fengguang
2006-05-25 16:23     ` Andrew Morton
     [not found]       ` <20060526070646.GC5135@mail.ustc.edu.cn>
2006-05-26  7:06         ` Wu Fengguang
2006-05-24 12:27   ` Peter Zijlstra
     [not found]     ` <20060524123740.GA16304@mail.ustc.edu.cn>
2006-05-24 12:37       ` Wu Fengguang
2006-05-24 12:48       ` Peter Zijlstra
     [not found] ` <20060524111859.540640819@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 05/33] readahead: refactor do_generic_mapping_read() Wu Fengguang
     [not found] ` <20060524111859.909928820@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 06/33] readahead: refactor __do_page_cache_readahead() Wu Fengguang
2006-05-25 16:30     ` Andrew Morton
2006-05-25 22:33       ` Paul Mackerras
2006-05-25 22:40         ` Andrew Morton
     [not found]       ` <20060526071339.GE5135@mail.ustc.edu.cn>
2006-05-26  7:13         ` Wu Fengguang
     [not found] ` <20060524111900.419314658@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 07/33] readahead: insert cond_resched() calls Wu Fengguang
     [not found] ` <20060524111900.970898174@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 08/33] readahead: common macros Wu Fengguang
2006-05-25  5:56     ` Nick Piggin
     [not found]       ` <20060525104117.GE4996@mail.ustc.edu.cn>
2006-05-25 10:41         ` Wu Fengguang
2006-05-26  3:33           ` Nick Piggin
     [not found]             ` <20060526065906.GA5135@mail.ustc.edu.cn>
2006-05-26  6:59               ` Wu Fengguang
     [not found]       ` <20060525134224.GJ4996@mail.ustc.edu.cn>
2006-05-25 13:42         ` Wu Fengguang
2006-05-25 14:38           ` Andrew Morton
2006-05-25 16:33     ` Andrew Morton
     [not found] ` <20060524111901.581603095@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 09/33] readahead: events accounting Wu Fengguang
2006-05-25 16:36     ` Andrew Morton
     [not found]       ` <20060526070943.GD5135@mail.ustc.edu.cn>
2006-05-26  7:09         ` Wu Fengguang
     [not found]       ` <20060527132002.GA4814@mail.ustc.edu.cn>
2006-05-27 13:20         ` Wu Fengguang
2006-05-29  8:19           ` Martin Peschke
     [not found] ` <20060524111901.976888971@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 10/33] readahead: support functions Wu Fengguang
2006-05-25  5:13     ` Nick Piggin
     [not found]       ` <20060525111318.GH4996@mail.ustc.edu.cn>
2006-05-25 11:13         ` Wu Fengguang
2006-05-25 16:48     ` Andrew Morton
     [not found]       ` <20060526073114.GH5135@mail.ustc.edu.cn>
2006-05-26  7:31         ` Wu Fengguang
     [not found] ` <20060524111902.491708692@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 11/33] readahead: sysctl parameters Wu Fengguang
2006-05-25  4:50     ` [PATCH 12/33] readahead: min/max sizes Nick Piggin
     [not found]       ` <20060525121206.GI4996@mail.ustc.edu.cn>
2006-05-25 12:12         ` Wu Fengguang
     [not found] ` <20060524111903.510268987@localhost.localdomain>
2006-05-24 11:12   ` [PATCH 13/33] readahead: state based method - aging accounting Wu Fengguang
2006-05-26 17:04     ` Andrew Morton
     [not found]       ` <20060527062234.GB4991@mail.ustc.edu.cn>
2006-05-27  6:22         ` Wu Fengguang
2006-05-27  7:00           ` Andrew Morton
     [not found]             ` <20060527072201.GA5284@mail.ustc.edu.cn>
2006-05-27  7:22               ` Wu Fengguang
     [not found] ` <20060524111904.019763011@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 14/33] readahead: state based method - data structure Wu Fengguang
2006-05-25  6:03     ` Nick Piggin
     [not found]       ` <20060525104353.GF4996@mail.ustc.edu.cn>
2006-05-25 10:43         ` Wu Fengguang
2006-05-26 17:05     ` Andrew Morton
     [not found]       ` <20060527070248.GD4991@mail.ustc.edu.cn>
2006-05-27  7:02         ` Wu Fengguang
     [not found]       ` <20060527082758.GF4991@mail.ustc.edu.cn>
2006-05-27  8:27         ` Wu Fengguang
     [not found] ` <20060524111904.683513683@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 15/33] readahead: state based method - routines Wu Fengguang
2006-05-26 17:15     ` Andrew Morton
     [not found]       ` <20060527020616.GA7418@mail.ustc.edu.cn>
2006-05-27  2:06         ` Wu Fengguang
     [not found] ` <20060524111906.245276338@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 18/33] readahead: initial method - guiding sizes Wu Fengguang
     [not found] ` <20060524111906.588647885@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 19/33] readahead: initial method - thrashing guard size Wu Fengguang
     [not found] ` <20060524111907.134685550@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 20/33] readahead: initial method - expected read size Wu Fengguang
2006-05-25  5:34     ` [PATCH 22/33] readahead: initial method Nick Piggin
     [not found]       ` <20060525085957.GC4996@mail.ustc.edu.cn>
2006-05-25  8:59         ` Wu Fengguang
2006-05-26 17:29     ` [PATCH 20/33] readahead: initial method - expected read size Andrew Morton
     [not found]       ` <20060527063826.GC4991@mail.ustc.edu.cn>
2006-05-27  6:38         ` Wu Fengguang
     [not found] ` <20060524111908.569533741@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 23/33] readahead: backward prefetching method Wu Fengguang
2006-05-26 17:37     ` Nate Diller
2006-05-26 19:22       ` Nathan Scott
     [not found]         ` <20060528123006.GC6478@mail.ustc.edu.cn>
2006-05-28 12:30           ` Wu Fengguang
     [not found] ` <20060524111909.147416866@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 24/33] readahead: seeking reads method Wu Fengguang
     [not found] ` <20060524111909.635589701@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 25/33] readahead: thrashing recovery method Wu Fengguang
     [not found] ` <20060524111910.207894375@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 26/33] readahead: call scheme Wu Fengguang
     [not found] ` <20060524111910.544274094@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 27/33] readahead: laptop mode Wu Fengguang
2006-05-26 17:38     ` Andrew Morton
     [not found] ` <20060524111911.607080495@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 29/33] readahead: nfsd case Wu Fengguang
     [not found] ` <20060524111912.156646847@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 30/33] readahead: turn on by default Wu Fengguang
     [not found] ` <20060524111912.485160282@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 31/33] readahead: debug radix tree new functions Wu Fengguang
     [not found] ` <20060524111912.967392912@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 32/33] readahead: debug traces showing accessed file names Wu Fengguang
     [not found] ` <20060524111913.603476893@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 33/33] readahead: debug traces showing read patterns Wu Fengguang
     [not found] ` <20060524111905.586110688@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 17/33] readahead: context based method Wu Fengguang
2006-05-25  5:26     ` Nick Piggin
     [not found]       ` <20060525080308.GB4996@mail.ustc.edu.cn>
2006-05-25  8:03         ` Wu Fengguang
2006-05-26 17:23     ` Andrew Morton
     [not found]       ` <20060527021252.GB7418@mail.ustc.edu.cn>
2006-05-27  2:12         ` Wu Fengguang
2006-05-26 17:27     ` Andrew Morton
     [not found]       ` <20060527080443.GE4991@mail.ustc.edu.cn>
2006-05-27  8:04         ` Wu Fengguang
2006-05-24 12:37   ` Peter Zijlstra
     [not found]     ` <20060524133353.GA16508@mail.ustc.edu.cn>
2006-05-24 13:33       ` Wu Fengguang
2006-05-24 15:53       ` Peter Zijlstra
     [not found]         ` <20060525012556.GA6111@mail.ustc.edu.cn>
2006-05-25  1:25           ` Wu Fengguang
     [not found] ` <20060524111911.032100160@localhost.localdomain>
2006-05-24 11:13   ` [PATCH 28/33] readahead: loop case Wu Fengguang
2006-05-24 14:01   ` Limin Wang
     [not found]     ` <20060525154846.GA6907@mail.ustc.edu.cn>
2006-05-25 15:48       ` wfg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).