[PATCH 00/15] 512K readahead size with thrashing safe readahead v2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/15] 512K readahead size with thrashing safe readahead v2
@ 2010-02-24  3:10 ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel,
	Wu Fengguang, LKML

Andrew,

This enlarges the default readahead size from 128K to 512K.
To avoid possible regressions, also do
- scale down readahead size on small device and small memory
- thrashing safe context readahead
- add readahead tracing/stats support to help expose possible problems

Besides, the patchset also includes several algorithm updates:
- no start-of-file readahead after lseek
- faster radix_tree_next_hole()/radix_tree_prev_hole()
- pagecache context based mmap read-around


Changes since v1:
- update mmap read-around heuristics (Thanks to Nick Piggin)
- radix_tree_lookup_leaf_node() for the pagecache based mmap read-around
- use __print_symbolic() to show readahead pattern names
  (Thanks to Steven Rostedt)
- scale down readahead size proportional to system memory
  (Thanks to Matt Mackall)
- add readahead size kernel parameter (by Nikanth Karthikesan)
- add comments from Christian Ehrhardt

Changes since RFC:
- move the lenthy intro text to individual patch changelogs
- treat get_capacity()==0 as uninitilized value (Thanks to Vivek Goyal)
- increase readahead size limit for small devices (Thanks to Jens Axboe)
- add fio test results by Vivek Goyal


[PATCH 01/15] readahead: limit readahead size for small devices
[PATCH 02/15] readahead: retain inactive lru pages to be accessed soon
[PATCH 03/15] readahead: bump up the default readahead size
[PATCH 04/15] readahead: make default readahead size a kernel parameter
[PATCH 05/15] readahead: limit readahead size for small memory systems
[PATCH 06/15] readahead: replace ra->mmap_miss with ra->ra_flags
[PATCH 07/15] readahead: thrashing safe context readahead
[PATCH 08/15] readahead: record readahead patterns
[PATCH 09/15] readahead: add tracing event
[PATCH 10/15] readahead: add /debug/readahead/stats
[PATCH 11/15] readahead: dont do start-of-file readahead after lseek()
[PATCH 12/15] radixtree: introduce radix_tree_lookup_leaf_node()
[PATCH 13/15] radixtree: speed up the search for hole
[PATCH 14/15] readahead: reduce MMAP_LOTSAMISS for mmap read-around
[PATCH 15/15] readahead: pagecache context based mmap read-around

 Documentation/kernel-parameters.txt |    4 
 block/blk-core.c                    |    3 
 block/genhd.c                       |   24 +
 fs/fuse/inode.c                     |    2 
 fs/read_write.c                     |    3 
 include/linux/fs.h                  |   64 +++
 include/linux/mm.h                  |    8 
 include/linux/radix-tree.h          |    2 
 include/trace/events/readahead.h    |   78 ++++
 lib/radix-tree.c                    |   94 ++++-
 mm/Kconfig                          |   13 
 mm/filemap.c                        |   30 +
 mm/readahead.c                      |  459 ++++++++++++++++++++++----
 13 files changed, 680 insertions(+), 104 deletions(-)


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 00/15] 512K readahead size with thrashing safe readahead v2
@ 2010-02-24  3:10 ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel,
	Wu Fengguang, LKML

Andrew,

This enlarges the default readahead size from 128K to 512K.
To avoid possible regressions, also do
- scale down readahead size on small device and small memory
- thrashing safe context readahead
- add readahead tracing/stats support to help expose possible problems

Besides, the patchset also includes several algorithm updates:
- no start-of-file readahead after lseek
- faster radix_tree_next_hole()/radix_tree_prev_hole()
- pagecache context based mmap read-around


Changes since v1:
- update mmap read-around heuristics (Thanks to Nick Piggin)
- radix_tree_lookup_leaf_node() for the pagecache based mmap read-around
- use __print_symbolic() to show readahead pattern names
  (Thanks to Steven Rostedt)
- scale down readahead size proportional to system memory
  (Thanks to Matt Mackall)
- add readahead size kernel parameter (by Nikanth Karthikesan)
- add comments from Christian Ehrhardt

Changes since RFC:
- move the lenthy intro text to individual patch changelogs
- treat get_capacity()==0 as uninitilized value (Thanks to Vivek Goyal)
- increase readahead size limit for small devices (Thanks to Jens Axboe)
- add fio test results by Vivek Goyal


[PATCH 01/15] readahead: limit readahead size for small devices
[PATCH 02/15] readahead: retain inactive lru pages to be accessed soon
[PATCH 03/15] readahead: bump up the default readahead size
[PATCH 04/15] readahead: make default readahead size a kernel parameter
[PATCH 05/15] readahead: limit readahead size for small memory systems
[PATCH 06/15] readahead: replace ra->mmap_miss with ra->ra_flags
[PATCH 07/15] readahead: thrashing safe context readahead
[PATCH 08/15] readahead: record readahead patterns
[PATCH 09/15] readahead: add tracing event
[PATCH 10/15] readahead: add /debug/readahead/stats
[PATCH 11/15] readahead: dont do start-of-file readahead after lseek()
[PATCH 12/15] radixtree: introduce radix_tree_lookup_leaf_node()
[PATCH 13/15] radixtree: speed up the search for hole
[PATCH 14/15] readahead: reduce MMAP_LOTSAMISS for mmap read-around
[PATCH 15/15] readahead: pagecache context based mmap read-around

 Documentation/kernel-parameters.txt |    4 
 block/blk-core.c                    |    3 
 block/genhd.c                       |   24 +
 fs/fuse/inode.c                     |    2 
 fs/read_write.c                     |    3 
 include/linux/fs.h                  |   64 +++
 include/linux/mm.h                  |    8 
 include/linux/radix-tree.h          |    2 
 include/trace/events/readahead.h    |   78 ++++
 lib/radix-tree.c                    |   94 ++++-
 mm/Kconfig                          |   13 
 mm/filemap.c                        |   30 +
 mm/readahead.c                      |  459 ++++++++++++++++++++++----
 13 files changed, 680 insertions(+), 104 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 00/15] 512K readahead size with thrashing safe readahead v2
@ 2010-02-24  3:10 ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel,
	Wu Fengguang, LKML

Andrew,

This enlarges the default readahead size from 128K to 512K.
To avoid possible regressions, also do
- scale down readahead size on small device and small memory
- thrashing safe context readahead
- add readahead tracing/stats support to help expose possible problems

Besides, the patchset also includes several algorithm updates:
- no start-of-file readahead after lseek
- faster radix_tree_next_hole()/radix_tree_prev_hole()
- pagecache context based mmap read-around


Changes since v1:
- update mmap read-around heuristics (Thanks to Nick Piggin)
- radix_tree_lookup_leaf_node() for the pagecache based mmap read-around
- use __print_symbolic() to show readahead pattern names
  (Thanks to Steven Rostedt)
- scale down readahead size proportional to system memory
  (Thanks to Matt Mackall)
- add readahead size kernel parameter (by Nikanth Karthikesan)
- add comments from Christian Ehrhardt

Changes since RFC:
- move the lenthy intro text to individual patch changelogs
- treat get_capacity()==0 as uninitilized value (Thanks to Vivek Goyal)
- increase readahead size limit for small devices (Thanks to Jens Axboe)
- add fio test results by Vivek Goyal


[PATCH 01/15] readahead: limit readahead size for small devices
[PATCH 02/15] readahead: retain inactive lru pages to be accessed soon
[PATCH 03/15] readahead: bump up the default readahead size
[PATCH 04/15] readahead: make default readahead size a kernel parameter
[PATCH 05/15] readahead: limit readahead size for small memory systems
[PATCH 06/15] readahead: replace ra->mmap_miss with ra->ra_flags
[PATCH 07/15] readahead: thrashing safe context readahead
[PATCH 08/15] readahead: record readahead patterns
[PATCH 09/15] readahead: add tracing event
[PATCH 10/15] readahead: add /debug/readahead/stats
[PATCH 11/15] readahead: dont do start-of-file readahead after lseek()
[PATCH 12/15] radixtree: introduce radix_tree_lookup_leaf_node()
[PATCH 13/15] radixtree: speed up the search for hole
[PATCH 14/15] readahead: reduce MMAP_LOTSAMISS for mmap read-around
[PATCH 15/15] readahead: pagecache context based mmap read-around

 Documentation/kernel-parameters.txt |    4 
 block/blk-core.c                    |    3 
 block/genhd.c                       |   24 +
 fs/fuse/inode.c                     |    2 
 fs/read_write.c                     |    3 
 include/linux/fs.h                  |   64 +++
 include/linux/mm.h                  |    8 
 include/linux/radix-tree.h          |    2 
 include/trace/events/readahead.h    |   78 ++++
 lib/radix-tree.c                    |   94 ++++-
 mm/Kconfig                          |   13 
 mm/filemap.c                        |   30 +
 mm/readahead.c                      |  459 ++++++++++++++++++++++----
 13 files changed, 680 insertions(+), 104 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 01/15] readahead: limit readahead size for small devices
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Li Shaohua, Clemens Ladisch, Wu Fengguang,
	Chris Mason, Peter Zijlstra, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 6956 bytes --]

Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.

     lseek 0,    read 1024   => readahead 4 pages (start of file)
     lseek 1536, read 16384  => readahead 8 pages (page contiguous)

The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.

For the kernel part, Linus suggests:
  So maybe we could be less aggressive about read-ahead when the size of
  the device is small? Turning a 16kB read into a 64kB one is a big deal,
  when it's about 15% of the whole device!

This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).

Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:

        disk size    readahead size
     (scale by 4)      (scale by 2)
               1M                8k
               4M               16k
              16M               32k
              64M               64k
             256M              128k
               1G              256k
        --------------------------- (*)
               4G              512k
              16G             1024k
              64G             2048k
             256G             4096k

(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.

The formula is determined on the following data, collected by script:

	#!/bin/sh

	# please make sure BDEV is not mounted or opened by others
	BDEV=sdb

	for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
	do
		echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb 
		time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
	done

The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.

The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.

SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)

	rasize	1st run		2nd run
	----------------------------------
	  4k	123 MB/s	122 MB/s
	 16k  	153 MB/s	153 MB/s
	 32k	161 MB/s	162 MB/s
	 64k	167 MB/s	168 MB/s
	128k	197 MB/s	197 MB/s
	256k	217 MB/s	217 MB/s
	512k	238 MB/s	234 MB/s
	  1M	251 MB/s	248 MB/s
	  2M	259 MB/s	257 MB/s
==>	  4M	269 MB/s	264 MB/s
	  8M	266 MB/s	266 MB/s

Note that ==> points to the readahead size that yields plateau throughput.

SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)

	rasize  1st             2nd
	--------------------------------
	  4k     41 MB/s         41 MB/s
	 16k     85 MB/s         81 MB/s
	 32k    102 MB/s        109 MB/s
	 64k    125 MB/s        144 MB/s
	128k    183 MB/s        185 MB/s
	256k    216 MB/s        216 MB/s
	512k    216 MB/s        236 MB/s
	1024k   251 MB/s        252 MB/s
	  2M    258 MB/s        258 MB/s
==>       4M    266 MB/s        266 MB/s
	  8M    266 MB/s        266 MB/s

SSD 30G SanDisk SATA 5000

	  4k	29.6 MB/s	29.6 MB/s	29.6 MB/s
	 16k	52.1 MB/s	52.1 MB/s	52.1 MB/s
	 32k	61.5 MB/s	61.5 MB/s	61.5 MB/s
	 64k	67.2 MB/s	67.2 MB/s	67.1 MB/s
	128k	71.4 MB/s	71.3 MB/s	71.4 MB/s
	256k	73.4 MB/s	73.4 MB/s	73.3 MB/s
==>	512k	74.6 MB/s	74.6 MB/s	74.6 MB/s
	  1M	74.7 MB/s	74.6 MB/s	74.7 MB/s
	  2M	76.1 MB/s	74.6 MB/s	74.6 MB/s

USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165

	  4k	7.9 MB/s 	7.9 MB/s 	7.9 MB/s
	 16k	17.9 MB/s	17.9 MB/s	17.9 MB/s
	 32k	24.5 MB/s	24.5 MB/s	24.5 MB/s
	 64k	28.7 MB/s	28.7 MB/s	28.7 MB/s
	128k	28.8 MB/s	28.9 MB/s	28.9 MB/s
==>	256k	30.5 MB/s	30.5 MB/s	30.5 MB/s
	512k	30.9 MB/s	31.0 MB/s	30.9 MB/s
	  1M	31.0 MB/s	30.9 MB/s	30.9 MB/s
	  2M	30.9 MB/s	30.9 MB/s	30.9 MB/s

USB stick 4G SanDisk  Cruzer idVendor=0781, idProduct=5151

	  4k	6.4 MB/s 	6.4 MB/s 	6.4 MB/s
	 16k	13.4 MB/s	13.4 MB/s	13.2 MB/s
	 32k	17.8 MB/s	17.9 MB/s	17.8 MB/s
	 64k	21.3 MB/s	21.3 MB/s	21.2 MB/s
	128k	21.4 MB/s	21.4 MB/s	21.4 MB/s
==>	256k	23.3 MB/s	23.2 MB/s	23.2 MB/s
	512k	23.3 MB/s	23.8 MB/s	23.4 MB/s
	  1M	23.8 MB/s	23.4 MB/s	23.3 MB/s
	  2M	23.4 MB/s	23.2 MB/s	23.4 MB/s

USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113

	  4k	6.7 MB/s 	6.9 MB/s 	6.7 MB/s
	 16k	11.7 MB/s	11.7 MB/s	11.7 MB/s
	 32k	12.4 MB/s	12.4 MB/s	12.4 MB/s
   	 64k	13.4 MB/s	13.4 MB/s	13.4 MB/s
	128k	13.4 MB/s	13.4 MB/s	13.4 MB/s
==>	256k	13.6 MB/s	13.6 MB/s	13.6 MB/s
	512k	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  1M	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  2M	13.7 MB/s	13.7 MB/s	13.7 MB/s

64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey

	4KB:    139.339 s, 376 kB/s
	16KB:   81.0427 s, 647 kB/s
	32KB:   71.8513 s, 730 kB/s
==>	64KB:   67.3872 s, 778 kB/s
	128KB:  67.5434 s, 776 kB/s
	256KB:  65.9019 s, 796 kB/s
	512KB:  66.2282 s, 792 kB/s
	1024KB: 67.4632 s, 777 kB/s
	2048KB: 69.9759 s, 749 kB/s

CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/genhd.c |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

--- linux.orig/block/genhd.c	2010-02-03 20:40:37.000000000 +0800
+++ linux/block/genhd.c	2010-02-04 21:19:07.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
 	struct backing_dev_info *bdi;
 	dev_t devt;
 	int retval;
+	unsigned long size;
 
 	/* minors == 0 indicates to use ext devt from part0 and should
 	 * be accompanied with EXT_DEVT flag.  Make sure all
@@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
 	retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
 				   "bdi");
 	WARN_ON(retval);
+
+	/*
+	 * Limit default readahead size for small devices.
+	 *        disk size    readahead size
+	 *               1M                8k
+	 *               4M               16k
+	 *              16M               32k
+	 *              64M               64k
+	 *             256M              128k
+	 *               1G              256k
+	 *        ---------------------------
+	 *               4G              512k
+	 *              16G             1024k
+	 *              64G             2048k
+	 *             256G             4096k
+	 * Since the default readahead size is 512k, this limit
+	 * only takes effect for devices whose size is less than 4G.
+	 */
+	if (get_capacity(disk)) {
+		size = get_capacity(disk) >> 9;
+		size = 1UL << (ilog2(size) / 2);
+		bdi->ra_pages = min(bdi->ra_pages, size);
+	}
 }
 
 EXPORT_SYMBOL(add_disk);



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 01/15] readahead: limit readahead size for small devices
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Li Shaohua, Clemens Ladisch, Wu Fengguang,
	Chris Mason, Peter Zijlstra, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 7181 bytes --]

Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.

     lseek 0,    read 1024   => readahead 4 pages (start of file)
     lseek 1536, read 16384  => readahead 8 pages (page contiguous)

The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.

For the kernel part, Linus suggests:
  So maybe we could be less aggressive about read-ahead when the size of
  the device is small? Turning a 16kB read into a 64kB one is a big deal,
  when it's about 15% of the whole device!

This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).

Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:

        disk size    readahead size
     (scale by 4)      (scale by 2)
               1M                8k
               4M               16k
              16M               32k
              64M               64k
             256M              128k
               1G              256k
        --------------------------- (*)
               4G              512k
              16G             1024k
              64G             2048k
             256G             4096k

(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.

The formula is determined on the following data, collected by script:

	#!/bin/sh

	# please make sure BDEV is not mounted or opened by others
	BDEV=sdb

	for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
	do
		echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb 
		time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
	done

The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.

The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.

SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)

	rasize	1st run		2nd run
	----------------------------------
	  4k	123 MB/s	122 MB/s
	 16k  	153 MB/s	153 MB/s
	 32k	161 MB/s	162 MB/s
	 64k	167 MB/s	168 MB/s
	128k	197 MB/s	197 MB/s
	256k	217 MB/s	217 MB/s
	512k	238 MB/s	234 MB/s
	  1M	251 MB/s	248 MB/s
	  2M	259 MB/s	257 MB/s
==>	  4M	269 MB/s	264 MB/s
	  8M	266 MB/s	266 MB/s

Note that ==> points to the readahead size that yields plateau throughput.

SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)

	rasize  1st             2nd
	--------------------------------
	  4k     41 MB/s         41 MB/s
	 16k     85 MB/s         81 MB/s
	 32k    102 MB/s        109 MB/s
	 64k    125 MB/s        144 MB/s
	128k    183 MB/s        185 MB/s
	256k    216 MB/s        216 MB/s
	512k    216 MB/s        236 MB/s
	1024k   251 MB/s        252 MB/s
	  2M    258 MB/s        258 MB/s
==>       4M    266 MB/s        266 MB/s
	  8M    266 MB/s        266 MB/s

SSD 30G SanDisk SATA 5000

	  4k	29.6 MB/s	29.6 MB/s	29.6 MB/s
	 16k	52.1 MB/s	52.1 MB/s	52.1 MB/s
	 32k	61.5 MB/s	61.5 MB/s	61.5 MB/s
	 64k	67.2 MB/s	67.2 MB/s	67.1 MB/s
	128k	71.4 MB/s	71.3 MB/s	71.4 MB/s
	256k	73.4 MB/s	73.4 MB/s	73.3 MB/s
==>	512k	74.6 MB/s	74.6 MB/s	74.6 MB/s
	  1M	74.7 MB/s	74.6 MB/s	74.7 MB/s
	  2M	76.1 MB/s	74.6 MB/s	74.6 MB/s

USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165

	  4k	7.9 MB/s 	7.9 MB/s 	7.9 MB/s
	 16k	17.9 MB/s	17.9 MB/s	17.9 MB/s
	 32k	24.5 MB/s	24.5 MB/s	24.5 MB/s
	 64k	28.7 MB/s	28.7 MB/s	28.7 MB/s
	128k	28.8 MB/s	28.9 MB/s	28.9 MB/s
==>	256k	30.5 MB/s	30.5 MB/s	30.5 MB/s
	512k	30.9 MB/s	31.0 MB/s	30.9 MB/s
	  1M	31.0 MB/s	30.9 MB/s	30.9 MB/s
	  2M	30.9 MB/s	30.9 MB/s	30.9 MB/s

USB stick 4G SanDisk  Cruzer idVendor=0781, idProduct=5151

	  4k	6.4 MB/s 	6.4 MB/s 	6.4 MB/s
	 16k	13.4 MB/s	13.4 MB/s	13.2 MB/s
	 32k	17.8 MB/s	17.9 MB/s	17.8 MB/s
	 64k	21.3 MB/s	21.3 MB/s	21.2 MB/s
	128k	21.4 MB/s	21.4 MB/s	21.4 MB/s
==>	256k	23.3 MB/s	23.2 MB/s	23.2 MB/s
	512k	23.3 MB/s	23.8 MB/s	23.4 MB/s
	  1M	23.8 MB/s	23.4 MB/s	23.3 MB/s
	  2M	23.4 MB/s	23.2 MB/s	23.4 MB/s

USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113

	  4k	6.7 MB/s 	6.9 MB/s 	6.7 MB/s
	 16k	11.7 MB/s	11.7 MB/s	11.7 MB/s
	 32k	12.4 MB/s	12.4 MB/s	12.4 MB/s
   	 64k	13.4 MB/s	13.4 MB/s	13.4 MB/s
	128k	13.4 MB/s	13.4 MB/s	13.4 MB/s
==>	256k	13.6 MB/s	13.6 MB/s	13.6 MB/s
	512k	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  1M	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  2M	13.7 MB/s	13.7 MB/s	13.7 MB/s

64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey

	4KB:    139.339 s, 376 kB/s
	16KB:   81.0427 s, 647 kB/s
	32KB:   71.8513 s, 730 kB/s
==>	64KB:   67.3872 s, 778 kB/s
	128KB:  67.5434 s, 776 kB/s
	256KB:  65.9019 s, 796 kB/s
	512KB:  66.2282 s, 792 kB/s
	1024KB: 67.4632 s, 777 kB/s
	2048KB: 69.9759 s, 749 kB/s

CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/genhd.c |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

--- linux.orig/block/genhd.c	2010-02-03 20:40:37.000000000 +0800
+++ linux/block/genhd.c	2010-02-04 21:19:07.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
 	struct backing_dev_info *bdi;
 	dev_t devt;
 	int retval;
+	unsigned long size;
 
 	/* minors == 0 indicates to use ext devt from part0 and should
 	 * be accompanied with EXT_DEVT flag.  Make sure all
@@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
 	retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
 				   "bdi");
 	WARN_ON(retval);
+
+	/*
+	 * Limit default readahead size for small devices.
+	 *        disk size    readahead size
+	 *               1M                8k
+	 *               4M               16k
+	 *              16M               32k
+	 *              64M               64k
+	 *             256M              128k
+	 *               1G              256k
+	 *        ---------------------------
+	 *               4G              512k
+	 *              16G             1024k
+	 *              64G             2048k
+	 *             256G             4096k
+	 * Since the default readahead size is 512k, this limit
+	 * only takes effect for devices whose size is less than 4G.
+	 */
+	if (get_capacity(disk)) {
+		size = get_capacity(disk) >> 9;
+		size = 1UL << (ilog2(size) / 2);
+		bdi->ra_pages = min(bdi->ra_pages, size);
+	}
 }
 
 EXPORT_SYMBOL(add_disk);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 01/15] readahead: limit readahead size for small devices
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Li Shaohua, Clemens Ladisch, Wu Fengguang,
	Chris Mason, Peter Zijlstra, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 7181 bytes --]

Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.

     lseek 0,    read 1024   => readahead 4 pages (start of file)
     lseek 1536, read 16384  => readahead 8 pages (page contiguous)

The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.

For the kernel part, Linus suggests:
  So maybe we could be less aggressive about read-ahead when the size of
  the device is small? Turning a 16kB read into a 64kB one is a big deal,
  when it's about 15% of the whole device!

This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).

Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:

        disk size    readahead size
     (scale by 4)      (scale by 2)
               1M                8k
               4M               16k
              16M               32k
              64M               64k
             256M              128k
               1G              256k
        --------------------------- (*)
               4G              512k
              16G             1024k
              64G             2048k
             256G             4096k

(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.

The formula is determined on the following data, collected by script:

	#!/bin/sh

	# please make sure BDEV is not mounted or opened by others
	BDEV=sdb

	for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
	do
		echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb 
		time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
	done

The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.

The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.

SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)

	rasize	1st run		2nd run
	----------------------------------
	  4k	123 MB/s	122 MB/s
	 16k  	153 MB/s	153 MB/s
	 32k	161 MB/s	162 MB/s
	 64k	167 MB/s	168 MB/s
	128k	197 MB/s	197 MB/s
	256k	217 MB/s	217 MB/s
	512k	238 MB/s	234 MB/s
	  1M	251 MB/s	248 MB/s
	  2M	259 MB/s	257 MB/s
==>	  4M	269 MB/s	264 MB/s
	  8M	266 MB/s	266 MB/s

Note that ==> points to the readahead size that yields plateau throughput.

SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)

	rasize  1st             2nd
	--------------------------------
	  4k     41 MB/s         41 MB/s
	 16k     85 MB/s         81 MB/s
	 32k    102 MB/s        109 MB/s
	 64k    125 MB/s        144 MB/s
	128k    183 MB/s        185 MB/s
	256k    216 MB/s        216 MB/s
	512k    216 MB/s        236 MB/s
	1024k   251 MB/s        252 MB/s
	  2M    258 MB/s        258 MB/s
==>       4M    266 MB/s        266 MB/s
	  8M    266 MB/s        266 MB/s

SSD 30G SanDisk SATA 5000

	  4k	29.6 MB/s	29.6 MB/s	29.6 MB/s
	 16k	52.1 MB/s	52.1 MB/s	52.1 MB/s
	 32k	61.5 MB/s	61.5 MB/s	61.5 MB/s
	 64k	67.2 MB/s	67.2 MB/s	67.1 MB/s
	128k	71.4 MB/s	71.3 MB/s	71.4 MB/s
	256k	73.4 MB/s	73.4 MB/s	73.3 MB/s
==>	512k	74.6 MB/s	74.6 MB/s	74.6 MB/s
	  1M	74.7 MB/s	74.6 MB/s	74.7 MB/s
	  2M	76.1 MB/s	74.6 MB/s	74.6 MB/s

USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165

	  4k	7.9 MB/s 	7.9 MB/s 	7.9 MB/s
	 16k	17.9 MB/s	17.9 MB/s	17.9 MB/s
	 32k	24.5 MB/s	24.5 MB/s	24.5 MB/s
	 64k	28.7 MB/s	28.7 MB/s	28.7 MB/s
	128k	28.8 MB/s	28.9 MB/s	28.9 MB/s
==>	256k	30.5 MB/s	30.5 MB/s	30.5 MB/s
	512k	30.9 MB/s	31.0 MB/s	30.9 MB/s
	  1M	31.0 MB/s	30.9 MB/s	30.9 MB/s
	  2M	30.9 MB/s	30.9 MB/s	30.9 MB/s

USB stick 4G SanDisk  Cruzer idVendor=0781, idProduct=5151

	  4k	6.4 MB/s 	6.4 MB/s 	6.4 MB/s
	 16k	13.4 MB/s	13.4 MB/s	13.2 MB/s
	 32k	17.8 MB/s	17.9 MB/s	17.8 MB/s
	 64k	21.3 MB/s	21.3 MB/s	21.2 MB/s
	128k	21.4 MB/s	21.4 MB/s	21.4 MB/s
==>	256k	23.3 MB/s	23.2 MB/s	23.2 MB/s
	512k	23.3 MB/s	23.8 MB/s	23.4 MB/s
	  1M	23.8 MB/s	23.4 MB/s	23.3 MB/s
	  2M	23.4 MB/s	23.2 MB/s	23.4 MB/s

USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113

	  4k	6.7 MB/s 	6.9 MB/s 	6.7 MB/s
	 16k	11.7 MB/s	11.7 MB/s	11.7 MB/s
	 32k	12.4 MB/s	12.4 MB/s	12.4 MB/s
   	 64k	13.4 MB/s	13.4 MB/s	13.4 MB/s
	128k	13.4 MB/s	13.4 MB/s	13.4 MB/s
==>	256k	13.6 MB/s	13.6 MB/s	13.6 MB/s
	512k	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  1M	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  2M	13.7 MB/s	13.7 MB/s	13.7 MB/s

64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey

	4KB:    139.339 s, 376 kB/s
	16KB:   81.0427 s, 647 kB/s
	32KB:   71.8513 s, 730 kB/s
==>	64KB:   67.3872 s, 778 kB/s
	128KB:  67.5434 s, 776 kB/s
	256KB:  65.9019 s, 796 kB/s
	512KB:  66.2282 s, 792 kB/s
	1024KB: 67.4632 s, 777 kB/s
	2048KB: 69.9759 s, 749 kB/s

CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/genhd.c |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

--- linux.orig/block/genhd.c	2010-02-03 20:40:37.000000000 +0800
+++ linux/block/genhd.c	2010-02-04 21:19:07.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
 	struct backing_dev_info *bdi;
 	dev_t devt;
 	int retval;
+	unsigned long size;
 
 	/* minors == 0 indicates to use ext devt from part0 and should
 	 * be accompanied with EXT_DEVT flag.  Make sure all
@@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
 	retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
 				   "bdi");
 	WARN_ON(retval);
+
+	/*
+	 * Limit default readahead size for small devices.
+	 *        disk size    readahead size
+	 *               1M                8k
+	 *               4M               16k
+	 *              16M               32k
+	 *              64M               64k
+	 *             256M              128k
+	 *               1G              256k
+	 *        ---------------------------
+	 *               4G              512k
+	 *              16G             1024k
+	 *              64G             2048k
+	 *             256G             4096k
+	 * Since the default readahead size is 512k, this limit
+	 * only takes effect for devices whose size is less than 4G.
+	 */
+	if (get_capacity(disk)) {
+		size = get_capacity(disk) >> 9;
+		size = 1UL << (ilog2(size) / 2);
+		bdi->ra_pages = min(bdi->ra_pages, size);
+	}
 }
 
 EXPORT_SYMBOL(add_disk);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 02/15] readahead: retain inactive lru pages to be accessed soon
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Chris Frost, Steve VanDeBogart, KAMEZAWA Hiroyuki,
	Wu Fengguang, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-retain-pages-find_get_page.patch --]
[-- Type: text/plain, Size: 3346 bytes --]

From: Chris Frost <frost@cs.ucla.edu>

Ensure that cached pages in the inactive list are not prematurely evicted;
move such pages to lru head when they are covered by
- in-kernel heuristic readahead
- an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application

Before this patch, pages already in core may be evicted before the
pages covered by the same prefetch scan but that were not yet in core.
Many small read requests may be forced on the disk because of this
behavior.

In particular, posix_fadvise(... POSIX_FADV_WILLNEED) on an in-core page
has no effect on the page's location in the LRU list, even if it is the
next victim on the inactive list.

This change helps address the performance problems we encountered
while modifying SQLite and the GIMP to use large file prefetching.
Overall these prefetching techniques improved the runtime of large
benchmarks by 10-17x for these applications. More in the publication
_Reducing Seek Overhead with Application-Directed Prefetching_ in
USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.

Signed-off-by: Chris Frost <frost@cs.ucla.edu>
Signed-off-by: Steve VanDeBogart <vandebo@cs.ucla.edu>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:26.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:40.000000000 +0800
@@ -9,7 +9,9 @@
 
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/memcontrol.h>
 #include <linux/mm.h>
+#include <linux/mm_inline.h>
 #include <linux/module.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -133,6 +135,40 @@ out:
 }
 
 /*
+ * The file range is expected to be accessed in near future.  Move pages
+ * (possibly in inactive lru tail) to lru head, so that they are retained
+ * in memory for some reasonable time.
+ */
+static void retain_inactive_pages(struct address_space *mapping,
+				  pgoff_t index, int len)
+{
+	int i;
+	struct page *page;
+	struct zone *zone;
+
+	for (i = 0; i < len; i++) {
+		page = find_get_page(mapping, index + i);
+		if (!page)
+			continue;
+
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		if (PageLRU(page) &&
+		    !PageActive(page) &&
+		    !PageUnevictable(page)) {
+			int lru = page_lru_base_type(page);
+
+			del_page_from_lru_list(zone, page, lru);
+			add_page_to_lru_list(zone, page, lru);
+		}
+
+		spin_unlock_irq(&zone->lru_lock);
+		put_page(page);
+	}
+}
+
+/*
  * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates all
  * the pages first, then submits them all for I/O. This avoids the very bad
  * behaviour which would occur if page allocations are causing VM writeback.
@@ -184,6 +220,14 @@ __do_page_cache_readahead(struct address
 	}
 
 	/*
+	 * Normally readahead will auto stop on cached segments, so we won't
+	 * hit many cached pages. If it does happen, bring the inactive pages
+	 * adjecent to the newly prefetched ones(if any).
+	 */
+	if (ret < nr_to_read)
+		retain_inactive_pages(mapping, offset, page_idx);
+
+	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not
 	 * uptodate then the caller will launch readpage again, and
 	 * will then handle the error.



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 02/15] readahead: retain inactive lru pages to be accessed soon
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Chris Frost, Steve VanDeBogart, KAMEZAWA Hiroyuki,
	Wu Fengguang, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-retain-pages-find_get_page.patch --]
[-- Type: text/plain, Size: 3571 bytes --]

From: Chris Frost <frost@cs.ucla.edu>

Ensure that cached pages in the inactive list are not prematurely evicted;
move such pages to lru head when they are covered by
- in-kernel heuristic readahead
- an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application

Before this patch, pages already in core may be evicted before the
pages covered by the same prefetch scan but that were not yet in core.
Many small read requests may be forced on the disk because of this
behavior.

In particular, posix_fadvise(... POSIX_FADV_WILLNEED) on an in-core page
has no effect on the page's location in the LRU list, even if it is the
next victim on the inactive list.

This change helps address the performance problems we encountered
while modifying SQLite and the GIMP to use large file prefetching.
Overall these prefetching techniques improved the runtime of large
benchmarks by 10-17x for these applications. More in the publication
_Reducing Seek Overhead with Application-Directed Prefetching_ in
USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.

Signed-off-by: Chris Frost <frost@cs.ucla.edu>
Signed-off-by: Steve VanDeBogart <vandebo@cs.ucla.edu>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:26.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:40.000000000 +0800
@@ -9,7 +9,9 @@
 
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/memcontrol.h>
 #include <linux/mm.h>
+#include <linux/mm_inline.h>
 #include <linux/module.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -133,6 +135,40 @@ out:
 }
 
 /*
+ * The file range is expected to be accessed in near future.  Move pages
+ * (possibly in inactive lru tail) to lru head, so that they are retained
+ * in memory for some reasonable time.
+ */
+static void retain_inactive_pages(struct address_space *mapping,
+				  pgoff_t index, int len)
+{
+	int i;
+	struct page *page;
+	struct zone *zone;
+
+	for (i = 0; i < len; i++) {
+		page = find_get_page(mapping, index + i);
+		if (!page)
+			continue;
+
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		if (PageLRU(page) &&
+		    !PageActive(page) &&
+		    !PageUnevictable(page)) {
+			int lru = page_lru_base_type(page);
+
+			del_page_from_lru_list(zone, page, lru);
+			add_page_to_lru_list(zone, page, lru);
+		}
+
+		spin_unlock_irq(&zone->lru_lock);
+		put_page(page);
+	}
+}
+
+/*
  * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates all
  * the pages first, then submits them all for I/O. This avoids the very bad
  * behaviour which would occur if page allocations are causing VM writeback.
@@ -184,6 +220,14 @@ __do_page_cache_readahead(struct address
 	}
 
 	/*
+	 * Normally readahead will auto stop on cached segments, so we won't
+	 * hit many cached pages. If it does happen, bring the inactive pages
+	 * adjecent to the newly prefetched ones(if any).
+	 */
+	if (ret < nr_to_read)
+		retain_inactive_pages(mapping, offset, page_idx);
+
+	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not
 	 * uptodate then the caller will launch readpage again, and
 	 * will then handle the error.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 02/15] readahead: retain inactive lru pages to be accessed soon
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Chris Frost, Steve VanDeBogart, KAMEZAWA Hiroyuki,
	Wu Fengguang, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-retain-pages-find_get_page.patch --]
[-- Type: text/plain, Size: 3571 bytes --]

From: Chris Frost <frost@cs.ucla.edu>

Ensure that cached pages in the inactive list are not prematurely evicted;
move such pages to lru head when they are covered by
- in-kernel heuristic readahead
- an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application

Before this patch, pages already in core may be evicted before the
pages covered by the same prefetch scan but that were not yet in core.
Many small read requests may be forced on the disk because of this
behavior.

In particular, posix_fadvise(... POSIX_FADV_WILLNEED) on an in-core page
has no effect on the page's location in the LRU list, even if it is the
next victim on the inactive list.

This change helps address the performance problems we encountered
while modifying SQLite and the GIMP to use large file prefetching.
Overall these prefetching techniques improved the runtime of large
benchmarks by 10-17x for these applications. More in the publication
_Reducing Seek Overhead with Application-Directed Prefetching_ in
USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.

Signed-off-by: Chris Frost <frost@cs.ucla.edu>
Signed-off-by: Steve VanDeBogart <vandebo@cs.ucla.edu>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:26.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:40.000000000 +0800
@@ -9,7 +9,9 @@
 
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/memcontrol.h>
 #include <linux/mm.h>
+#include <linux/mm_inline.h>
 #include <linux/module.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -133,6 +135,40 @@ out:
 }
 
 /*
+ * The file range is expected to be accessed in near future.  Move pages
+ * (possibly in inactive lru tail) to lru head, so that they are retained
+ * in memory for some reasonable time.
+ */
+static void retain_inactive_pages(struct address_space *mapping,
+				  pgoff_t index, int len)
+{
+	int i;
+	struct page *page;
+	struct zone *zone;
+
+	for (i = 0; i < len; i++) {
+		page = find_get_page(mapping, index + i);
+		if (!page)
+			continue;
+
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+
+		if (PageLRU(page) &&
+		    !PageActive(page) &&
+		    !PageUnevictable(page)) {
+			int lru = page_lru_base_type(page);
+
+			del_page_from_lru_list(zone, page, lru);
+			add_page_to_lru_list(zone, page, lru);
+		}
+
+		spin_unlock_irq(&zone->lru_lock);
+		put_page(page);
+	}
+}
+
+/*
  * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates all
  * the pages first, then submits them all for I/O. This avoids the very bad
  * behaviour which would occur if page allocations are causing VM writeback.
@@ -184,6 +220,14 @@ __do_page_cache_readahead(struct address
 	}
 
 	/*
+	 * Normally readahead will auto stop on cached segments, so we won't
+	 * hit many cached pages. If it does happen, bring the inactive pages
+	 * adjecent to the newly prefetched ones(if any).
+	 */
+	if (ret < nr_to_read)
+		retain_inactive_pages(mapping, offset, page_idx);
+
+	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not
 	 * uptodate then the caller will launch readpage again, and
 	 * will then handle the error.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 03/15] readahead: bump up the default readahead size
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Chris Mason, Peter Zijlstra, Martin Schwidefsky,
	Paul Gortmaker, Matt Mackall, David Woodhouse,
	Christian Ehrhardt, Wu Fengguang, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-enlarge-default-size.patch --]
[-- Type: text/plain, Size: 6813 bytes --]

Use 512kb max readahead size, and 32kb min readahead size.

The former helps io performance for common workloads.
The latter will be used in the thrashing safe context readahead.


====== Rationals on the 512kb size ======

I believe it yields more I/O throughput without noticeably increasing
I/O latency for today's HDD.

For example, for a 100MB/s and 8ms access time HDD, its random IO or
highly concurrent sequential IO would in theory be:

io_size KB  access_time  transfer_time  io_latency   util%   throughput KB/s
4           8             0.04           8.04        0.49%    497.57  
8           8             0.08           8.08        0.97%    990.33  
16          8             0.16           8.16        1.92%   1961.69 
32          8             0.31           8.31        3.76%   3849.62 
64          8             0.62           8.62        7.25%   7420.29 
128         8             1.25           9.25       13.51%  13837.84
256         8             2.50          10.50       23.81%  24380.95
512         8             5.00          13.00       38.46%  39384.62
1024        8            10.00          18.00       55.56%  56888.89
2048        8            20.00          28.00       71.43%  73142.86
4096        8            40.00          48.00       83.33%  85333.33

The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to
~39MB/s, while merely increases (minimal) IO latency from 9.25ms to 13ms.

As for SSD, I find that Intel X25-M SSD desires large readahead size
even for sequential reads:

	rasize	1st run		2nd run
	----------------------------------
	  4k	123 MB/s	122 MB/s
	 16k  	153 MB/s	153 MB/s
	 32k	161 MB/s	162 MB/s
	 64k	167 MB/s	168 MB/s
	128k	197 MB/s	197 MB/s
	256k	217 MB/s	217 MB/s
	512k	238 MB/s	234 MB/s
	  1M	251 MB/s	248 MB/s
	  2M	259 MB/s	257 MB/s
   	  4M	269 MB/s	264 MB/s
	  8M	266 MB/s	266 MB/s

The two other impacts of an enlarged readahead size are

- memory footprint (caused by readahead miss)
	Sequential readahead hit ratio is pretty high regardless of max
	readahead size; the extra memory footprint is mainly caused by
	enlarged mmap read-around.
	I measured my desktop:
	- under Xwindow:
		128KB readahead hit ratio = 143MB/230MB = 62%
		512KB readahead hit ratio = 138MB/248MB = 55%
		  1MB readahead hit ratio = 130MB/253MB = 51%
	- under console: (seems more stable than the Xwindow data)
		128KB readahead hit ratio = 30MB/56MB   = 53%
		  1MB readahead hit ratio = 30MB/59MB   = 51%
	So the impact to memory footprint looks acceptable.

- readahead thrashing
	It will now cost 1MB readahead buffer per stream.  Memory tight
	systems typically do not run multiple streams; but if they do
	so, it should help I/O performance as long as we can avoid
	thrashing, which can be achieved with the following patches.

I also boot the system into console with different readahead size,
and find that both the io_count and readahead_hit_ratio reduced by
~10% when increasing readahead_size from 128k to 512k. I guess typical
desktop users would prefer the reduced IO numbers (for fastboot) at
the cost of a dozen MB memory.

readahead_size	io_count   avg_io_pages   total_readahead_pages	 readahead_hit_ratio
            4k      6765              1    6765			 -
          128k      1077              8    8616			 78.5%
          512k       897             11    9867			 68.6%
         1024k       867             12   10404			 65.0%
total_readahead_pages = io_count * avg_io_size


====== Remarks by Christian Ehrhardt ======

- 512 is by far superior to 128 for sequential reads
- improvements with iozone sequential read scaling from 1 to 64 parallel
  processes up to +35%
- readahead sizes larger than 512 reevealed to not be "more useful" but
  increasing the chance of trashing in low mem systems


====== Benchmarks by Vivek Goyal ======

I have got two paths to the HP EVA and got multipath device setup(dm-3).
I run increasing number of sequential readers. File system is ext3 and
filesize is 1G.
I have run the tests 3 times (3sets) and taken the average of it.

Workload=bsr      iosched=cfq     Filesz=1G   bs=32K
======================================================================
                    2.6.33-rc5                2.6.33-rc5-readahead
job   Set NR  ReadBW(KB/s)   MaxClat(us)    ReadBW(KB/s)   MaxClat(us)
---   --- --  ------------   -----------    ------------   -----------
bsr   3   1   141768         130965         190302         97937.3    
bsr   3   2   131979         135402         185636         223286     
bsr   3   4   132351         420733         185986         363658     
bsr   3   8   133152         455434         184352         428478     
bsr   3   16  130316         674499         185646         594311     

I ran same test on a different piece of hardware. There are few SATA disks
(5-6) in striped configuration behind a hardware RAID controller.

Workload=bsr      iosched=cfq     Filesz=1G   bs=32K
======================================================================
                    2.6.33-rc5                2.6.33-rc5-readahead
job   Set NR  ReadBW(KB/s)   MaxClat(us)    ReadBW(KB/s)   MaxClat(us)    
---   --- --  ------------   -----------    ------------   -----------    
bsr   3   1   147569         14369.7        160191         22752          
bsr   3   2   124716         243932         149343         184698         
bsr   3   4   123451         327665         147183         430875         
bsr   3   8   122486         455102         144568         484045         
bsr   3   16  117645         1.03957e+06    137485         1.06257e+06    


CC: Jens Axboe <jens.axboe@oracle.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Paul Gortmaker <paul.gortmaker@windriver.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Woodhouse <dwmw2@infradead.org>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by:  Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mm.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux.orig/include/linux/mm.h	2010-02-24 10:44:26.000000000 +0800
+++ linux/include/linux/mm.h	2010-02-24 10:44:41.000000000 +0800
@@ -1186,8 +1186,8 @@ int write_one_page(struct page *page, in
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
-#define VM_MAX_READAHEAD	128	/* kbytes */
-#define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
+#define VM_MAX_READAHEAD	512	/* kbytes */
+#define VM_MIN_READAHEAD	32	/* kbytes (includes current page) */
 
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read);



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 03/15] readahead: bump up the default readahead size
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Chris Mason, Peter Zijlstra, Martin Schwidefsky,
	Paul Gortmaker, Matt Mackall, David Woodhouse,
	Christian Ehrhardt, Wu Fengguang, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-enlarge-default-size.patch --]
[-- Type: text/plain, Size: 7038 bytes --]

Use 512kb max readahead size, and 32kb min readahead size.

The former helps io performance for common workloads.
The latter will be used in the thrashing safe context readahead.


====== Rationals on the 512kb size ======

I believe it yields more I/O throughput without noticeably increasing
I/O latency for today's HDD.

For example, for a 100MB/s and 8ms access time HDD, its random IO or
highly concurrent sequential IO would in theory be:

io_size KB  access_time  transfer_time  io_latency   util%   throughput KB/s
4           8             0.04           8.04        0.49%    497.57  
8           8             0.08           8.08        0.97%    990.33  
16          8             0.16           8.16        1.92%   1961.69 
32          8             0.31           8.31        3.76%   3849.62 
64          8             0.62           8.62        7.25%   7420.29 
128         8             1.25           9.25       13.51%  13837.84
256         8             2.50          10.50       23.81%  24380.95
512         8             5.00          13.00       38.46%  39384.62
1024        8            10.00          18.00       55.56%  56888.89
2048        8            20.00          28.00       71.43%  73142.86
4096        8            40.00          48.00       83.33%  85333.33

The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to
~39MB/s, while merely increases (minimal) IO latency from 9.25ms to 13ms.

As for SSD, I find that Intel X25-M SSD desires large readahead size
even for sequential reads:

	rasize	1st run		2nd run
	----------------------------------
	  4k	123 MB/s	122 MB/s
	 16k  	153 MB/s	153 MB/s
	 32k	161 MB/s	162 MB/s
	 64k	167 MB/s	168 MB/s
	128k	197 MB/s	197 MB/s
	256k	217 MB/s	217 MB/s
	512k	238 MB/s	234 MB/s
	  1M	251 MB/s	248 MB/s
	  2M	259 MB/s	257 MB/s
   	  4M	269 MB/s	264 MB/s
	  8M	266 MB/s	266 MB/s

The two other impacts of an enlarged readahead size are

- memory footprint (caused by readahead miss)
	Sequential readahead hit ratio is pretty high regardless of max
	readahead size; the extra memory footprint is mainly caused by
	enlarged mmap read-around.
	I measured my desktop:
	- under Xwindow:
		128KB readahead hit ratio = 143MB/230MB = 62%
		512KB readahead hit ratio = 138MB/248MB = 55%
		  1MB readahead hit ratio = 130MB/253MB = 51%
	- under console: (seems more stable than the Xwindow data)
		128KB readahead hit ratio = 30MB/56MB   = 53%
		  1MB readahead hit ratio = 30MB/59MB   = 51%
	So the impact to memory footprint looks acceptable.

- readahead thrashing
	It will now cost 1MB readahead buffer per stream.  Memory tight
	systems typically do not run multiple streams; but if they do
	so, it should help I/O performance as long as we can avoid
	thrashing, which can be achieved with the following patches.

I also boot the system into console with different readahead size,
and find that both the io_count and readahead_hit_ratio reduced by
~10% when increasing readahead_size from 128k to 512k. I guess typical
desktop users would prefer the reduced IO numbers (for fastboot) at
the cost of a dozen MB memory.

readahead_size	io_count   avg_io_pages   total_readahead_pages	 readahead_hit_ratio
            4k      6765              1    6765			 -
          128k      1077              8    8616			 78.5%
          512k       897             11    9867			 68.6%
         1024k       867             12   10404			 65.0%
total_readahead_pages = io_count * avg_io_size


====== Remarks by Christian Ehrhardt ======

- 512 is by far superior to 128 for sequential reads
- improvements with iozone sequential read scaling from 1 to 64 parallel
  processes up to +35%
- readahead sizes larger than 512 reevealed to not be "more useful" but
  increasing the chance of trashing in low mem systems


====== Benchmarks by Vivek Goyal ======

I have got two paths to the HP EVA and got multipath device setup(dm-3).
I run increasing number of sequential readers. File system is ext3 and
filesize is 1G.
I have run the tests 3 times (3sets) and taken the average of it.

Workload=bsr      iosched=cfq     Filesz=1G   bs=32K
======================================================================
                    2.6.33-rc5                2.6.33-rc5-readahead
job   Set NR  ReadBW(KB/s)   MaxClat(us)    ReadBW(KB/s)   MaxClat(us)
---   --- --  ------------   -----------    ------------   -----------
bsr   3   1   141768         130965         190302         97937.3    
bsr   3   2   131979         135402         185636         223286     
bsr   3   4   132351         420733         185986         363658     
bsr   3   8   133152         455434         184352         428478     
bsr   3   16  130316         674499         185646         594311     

I ran same test on a different piece of hardware. There are few SATA disks
(5-6) in striped configuration behind a hardware RAID controller.

Workload=bsr      iosched=cfq     Filesz=1G   bs=32K
======================================================================
                    2.6.33-rc5                2.6.33-rc5-readahead
job   Set NR  ReadBW(KB/s)   MaxClat(us)    ReadBW(KB/s)   MaxClat(us)    
---   --- --  ------------   -----------    ------------   -----------    
bsr   3   1   147569         14369.7        160191         22752          
bsr   3   2   124716         243932         149343         184698         
bsr   3   4   123451         327665         147183         430875         
bsr   3   8   122486         455102         144568         484045         
bsr   3   16  117645         1.03957e+06    137485         1.06257e+06    


CC: Jens Axboe <jens.axboe@oracle.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Paul Gortmaker <paul.gortmaker@windriver.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Woodhouse <dwmw2@infradead.org>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by:  Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mm.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux.orig/include/linux/mm.h	2010-02-24 10:44:26.000000000 +0800
+++ linux/include/linux/mm.h	2010-02-24 10:44:41.000000000 +0800
@@ -1186,8 +1186,8 @@ int write_one_page(struct page *page, in
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
-#define VM_MAX_READAHEAD	128	/* kbytes */
-#define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
+#define VM_MAX_READAHEAD	512	/* kbytes */
+#define VM_MIN_READAHEAD	32	/* kbytes (includes current page) */
 
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 03/15] readahead: bump up the default readahead size
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Chris Mason, Peter Zijlstra, Martin Schwidefsky,
	Paul Gortmaker, Matt Mackall, David Woodhouse,
	Christian Ehrhardt, Wu Fengguang, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-enlarge-default-size.patch --]
[-- Type: text/plain, Size: 7038 bytes --]

Use 512kb max readahead size, and 32kb min readahead size.

The former helps io performance for common workloads.
The latter will be used in the thrashing safe context readahead.


====== Rationals on the 512kb size ======

I believe it yields more I/O throughput without noticeably increasing
I/O latency for today's HDD.

For example, for a 100MB/s and 8ms access time HDD, its random IO or
highly concurrent sequential IO would in theory be:

io_size KB  access_time  transfer_time  io_latency   util%   throughput KB/s
4           8             0.04           8.04        0.49%    497.57  
8           8             0.08           8.08        0.97%    990.33  
16          8             0.16           8.16        1.92%   1961.69 
32          8             0.31           8.31        3.76%   3849.62 
64          8             0.62           8.62        7.25%   7420.29 
128         8             1.25           9.25       13.51%  13837.84
256         8             2.50          10.50       23.81%  24380.95
512         8             5.00          13.00       38.46%  39384.62
1024        8            10.00          18.00       55.56%  56888.89
2048        8            20.00          28.00       71.43%  73142.86
4096        8            40.00          48.00       83.33%  85333.33

The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to
~39MB/s, while merely increases (minimal) IO latency from 9.25ms to 13ms.

As for SSD, I find that Intel X25-M SSD desires large readahead size
even for sequential reads:

	rasize	1st run		2nd run
	----------------------------------
	  4k	123 MB/s	122 MB/s
	 16k  	153 MB/s	153 MB/s
	 32k	161 MB/s	162 MB/s
	 64k	167 MB/s	168 MB/s
	128k	197 MB/s	197 MB/s
	256k	217 MB/s	217 MB/s
	512k	238 MB/s	234 MB/s
	  1M	251 MB/s	248 MB/s
	  2M	259 MB/s	257 MB/s
   	  4M	269 MB/s	264 MB/s
	  8M	266 MB/s	266 MB/s

The two other impacts of an enlarged readahead size are

- memory footprint (caused by readahead miss)
	Sequential readahead hit ratio is pretty high regardless of max
	readahead size; the extra memory footprint is mainly caused by
	enlarged mmap read-around.
	I measured my desktop:
	- under Xwindow:
		128KB readahead hit ratio = 143MB/230MB = 62%
		512KB readahead hit ratio = 138MB/248MB = 55%
		  1MB readahead hit ratio = 130MB/253MB = 51%
	- under console: (seems more stable than the Xwindow data)
		128KB readahead hit ratio = 30MB/56MB   = 53%
		  1MB readahead hit ratio = 30MB/59MB   = 51%
	So the impact to memory footprint looks acceptable.

- readahead thrashing
	It will now cost 1MB readahead buffer per stream.  Memory tight
	systems typically do not run multiple streams; but if they do
	so, it should help I/O performance as long as we can avoid
	thrashing, which can be achieved with the following patches.

I also boot the system into console with different readahead size,
and find that both the io_count and readahead_hit_ratio reduced by
~10% when increasing readahead_size from 128k to 512k. I guess typical
desktop users would prefer the reduced IO numbers (for fastboot) at
the cost of a dozen MB memory.

readahead_size	io_count   avg_io_pages   total_readahead_pages	 readahead_hit_ratio
            4k      6765              1    6765			 -
          128k      1077              8    8616			 78.5%
          512k       897             11    9867			 68.6%
         1024k       867             12   10404			 65.0%
total_readahead_pages = io_count * avg_io_size


====== Remarks by Christian Ehrhardt ======

- 512 is by far superior to 128 for sequential reads
- improvements with iozone sequential read scaling from 1 to 64 parallel
  processes up to +35%
- readahead sizes larger than 512 reevealed to not be "more useful" but
  increasing the chance of trashing in low mem systems


====== Benchmarks by Vivek Goyal ======

I have got two paths to the HP EVA and got multipath device setup(dm-3).
I run increasing number of sequential readers. File system is ext3 and
filesize is 1G.
I have run the tests 3 times (3sets) and taken the average of it.

Workload=bsr      iosched=cfq     Filesz=1G   bs=32K
======================================================================
                    2.6.33-rc5                2.6.33-rc5-readahead
job   Set NR  ReadBW(KB/s)   MaxClat(us)    ReadBW(KB/s)   MaxClat(us)
---   --- --  ------------   -----------    ------------   -----------
bsr   3   1   141768         130965         190302         97937.3    
bsr   3   2   131979         135402         185636         223286     
bsr   3   4   132351         420733         185986         363658     
bsr   3   8   133152         455434         184352         428478     
bsr   3   16  130316         674499         185646         594311     

I ran same test on a different piece of hardware. There are few SATA disks
(5-6) in striped configuration behind a hardware RAID controller.

Workload=bsr      iosched=cfq     Filesz=1G   bs=32K
======================================================================
                    2.6.33-rc5                2.6.33-rc5-readahead
job   Set NR  ReadBW(KB/s)   MaxClat(us)    ReadBW(KB/s)   MaxClat(us)    
---   --- --  ------------   -----------    ------------   -----------    
bsr   3   1   147569         14369.7        160191         22752          
bsr   3   2   124716         243932         149343         184698         
bsr   3   4   123451         327665         147183         430875         
bsr   3   8   122486         455102         144568         484045         
bsr   3   16  117645         1.03957e+06    137485         1.06257e+06    


CC: Jens Axboe <jens.axboe@oracle.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Paul Gortmaker <paul.gortmaker@windriver.com>
CC: Matt Mackall <mpm@selenic.com>
CC: David Woodhouse <dwmw2@infradead.org>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by:  Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/mm.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux.orig/include/linux/mm.h	2010-02-24 10:44:26.000000000 +0800
+++ linux/include/linux/mm.h	2010-02-24 10:44:41.000000000 +0800
@@ -1186,8 +1186,8 @@ int write_one_page(struct page *page, in
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
-#define VM_MAX_READAHEAD	128	/* kbytes */
-#define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
+#define VM_MAX_READAHEAD	512	/* kbytes */
+#define VM_MIN_READAHEAD	32	/* kbytes (includes current page) */
 
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 04/15] readahead: make default readahead size a kernel parameter
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ankit Jain, Dave Chinner, Christian Ehrhardt,
	Nikanth Karthikesan, Wu Fengguang, Chris Mason, Peter Zijlstra,
	Clemens Ladisch, Olivier Galibert, Vivek Goyal, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-kernel-parameter.patch --]
[-- Type: text/plain, Size: 3120 bytes --]

From: Nikanth Karthikesan <knikanth@suse.de>

Add new kernel parameter "readahead", which allows user to override
the static VM_MAX_READAHEAD=512kb.

CC: Ankit Jain <radical@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/kernel-parameters.txt |    4 ++++
 block/blk-core.c                    |    3 +--
 fs/fuse/inode.c                     |    2 +-
 mm/readahead.c                      |   22 ++++++++++++++++++++++
 4 files changed, 28 insertions(+), 3 deletions(-)

--- linux.orig/Documentation/kernel-parameters.txt	2010-02-24 10:44:26.000000000 +0800
+++ linux/Documentation/kernel-parameters.txt	2010-02-24 10:44:42.000000000 +0800
@@ -2200,6 +2200,10 @@ and is between 256 and 4096 characters. 
 			Run specified binary instead of /init from the ramdisk,
 			used for early userspace startup. See initrd.
 
+	readahead=nn[KM]
+			Default max readahead size for block devices.
+			Range: 0; 4k - 128m
+
 	reboot=		[BUGS=X86-32,BUGS=ARM,BUGS=IA-64] Rebooting mode
 			Format: <reboot_mode>[,<reboot_mode2>[,...]]
 			See arch/*/kernel/reboot.c or arch/*/kernel/process.c
--- linux.orig/block/blk-core.c	2010-02-24 10:44:26.000000000 +0800
+++ linux/block/blk-core.c	2010-02-24 10:44:42.000000000 +0800
@@ -498,8 +498,7 @@ struct request_queue *blk_alloc_queue_no
 
 	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
 	q->backing_dev_info.unplug_io_data = q;
-	q->backing_dev_info.ra_pages =
-			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	q->backing_dev_info.ra_pages = default_backing_dev_info.ra_pages;
 	q->backing_dev_info.state = 0;
 	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
 	q->backing_dev_info.name = "block";
--- linux.orig/fs/fuse/inode.c	2010-02-24 10:44:26.000000000 +0800
+++ linux/fs/fuse/inode.c	2010-02-24 10:44:42.000000000 +0800
@@ -870,7 +870,7 @@ static int fuse_bdi_init(struct fuse_con
 	int err;
 
 	fc->bdi.name = "fuse";
-	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	fc->bdi.ra_pages = default_backing_dev_info.ra_pages;
 	fc->bdi.unplug_io_fn = default_unplug_io_fn;
 	/* fuse does it's own writeback accounting */
 	fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
--- linux.orig/mm/readahead.c	2010-02-24 10:44:40.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
@@ -19,6 +19,28 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+static int __init config_readahead_size(char *str)
+{
+	unsigned long bytes;
+
+	if (!str)
+		return -EINVAL;
+	bytes = memparse(str, &str);
+	if (*str != '\0')
+		return -EINVAL;
+
+	if (bytes) {
+		if (bytes < PAGE_CACHE_SIZE)	/* missed 'k'/'m' suffixes? */
+			return -EINVAL;
+		if (bytes > 128 << 20)		/* limit to 128MB */
+			bytes = 128 << 20;
+	}
+
+	default_backing_dev_info.ra_pages = bytes / PAGE_CACHE_SIZE;
+	return 0;
+}
+early_param("readahead", config_readahead_size);
+
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 04/15] readahead: make default readahead size a kernel parameter
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ankit Jain, Dave Chinner, Christian Ehrhardt,
	Nikanth Karthikesan, Wu Fengguang, Chris Mason, Peter Zijlstra,
	Clemens Ladisch, Olivier Galibert, Vivek Goyal, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-kernel-parameter.patch --]
[-- Type: text/plain, Size: 3345 bytes --]

From: Nikanth Karthikesan <knikanth@suse.de>

Add new kernel parameter "readahead", which allows user to override
the static VM_MAX_READAHEAD=512kb.

CC: Ankit Jain <radical@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/kernel-parameters.txt |    4 ++++
 block/blk-core.c                    |    3 +--
 fs/fuse/inode.c                     |    2 +-
 mm/readahead.c                      |   22 ++++++++++++++++++++++
 4 files changed, 28 insertions(+), 3 deletions(-)

--- linux.orig/Documentation/kernel-parameters.txt	2010-02-24 10:44:26.000000000 +0800
+++ linux/Documentation/kernel-parameters.txt	2010-02-24 10:44:42.000000000 +0800
@@ -2200,6 +2200,10 @@ and is between 256 and 4096 characters. 
 			Run specified binary instead of /init from the ramdisk,
 			used for early userspace startup. See initrd.
 
+	readahead=nn[KM]
+			Default max readahead size for block devices.
+			Range: 0; 4k - 128m
+
 	reboot=		[BUGS=X86-32,BUGS=ARM,BUGS=IA-64] Rebooting mode
 			Format: <reboot_mode>[,<reboot_mode2>[,...]]
 			See arch/*/kernel/reboot.c or arch/*/kernel/process.c
--- linux.orig/block/blk-core.c	2010-02-24 10:44:26.000000000 +0800
+++ linux/block/blk-core.c	2010-02-24 10:44:42.000000000 +0800
@@ -498,8 +498,7 @@ struct request_queue *blk_alloc_queue_no
 
 	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
 	q->backing_dev_info.unplug_io_data = q;
-	q->backing_dev_info.ra_pages =
-			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	q->backing_dev_info.ra_pages = default_backing_dev_info.ra_pages;
 	q->backing_dev_info.state = 0;
 	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
 	q->backing_dev_info.name = "block";
--- linux.orig/fs/fuse/inode.c	2010-02-24 10:44:26.000000000 +0800
+++ linux/fs/fuse/inode.c	2010-02-24 10:44:42.000000000 +0800
@@ -870,7 +870,7 @@ static int fuse_bdi_init(struct fuse_con
 	int err;
 
 	fc->bdi.name = "fuse";
-	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	fc->bdi.ra_pages = default_backing_dev_info.ra_pages;
 	fc->bdi.unplug_io_fn = default_unplug_io_fn;
 	/* fuse does it's own writeback accounting */
 	fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
--- linux.orig/mm/readahead.c	2010-02-24 10:44:40.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
@@ -19,6 +19,28 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+static int __init config_readahead_size(char *str)
+{
+	unsigned long bytes;
+
+	if (!str)
+		return -EINVAL;
+	bytes = memparse(str, &str);
+	if (*str != '\0')
+		return -EINVAL;
+
+	if (bytes) {
+		if (bytes < PAGE_CACHE_SIZE)	/* missed 'k'/'m' suffixes? */
+			return -EINVAL;
+		if (bytes > 128 << 20)		/* limit to 128MB */
+			bytes = 128 << 20;
+	}
+
+	default_backing_dev_info.ra_pages = bytes / PAGE_CACHE_SIZE;
+	return 0;
+}
+early_param("readahead", config_readahead_size);
+
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 04/15] readahead: make default readahead size a kernel parameter
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ankit Jain, Dave Chinner, Christian Ehrhardt,
	Nikanth Karthikesan, Wu Fengguang, Chris Mason, Peter Zijlstra,
	Clemens Ladisch, Olivier Galibert, Vivek Goyal, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-kernel-parameter.patch --]
[-- Type: text/plain, Size: 3345 bytes --]

From: Nikanth Karthikesan <knikanth@suse.de>

Add new kernel parameter "readahead", which allows user to override
the static VM_MAX_READAHEAD=512kb.

CC: Ankit Jain <radical@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/kernel-parameters.txt |    4 ++++
 block/blk-core.c                    |    3 +--
 fs/fuse/inode.c                     |    2 +-
 mm/readahead.c                      |   22 ++++++++++++++++++++++
 4 files changed, 28 insertions(+), 3 deletions(-)

--- linux.orig/Documentation/kernel-parameters.txt	2010-02-24 10:44:26.000000000 +0800
+++ linux/Documentation/kernel-parameters.txt	2010-02-24 10:44:42.000000000 +0800
@@ -2200,6 +2200,10 @@ and is between 256 and 4096 characters. 
 			Run specified binary instead of /init from the ramdisk,
 			used for early userspace startup. See initrd.
 
+	readahead=nn[KM]
+			Default max readahead size for block devices.
+			Range: 0; 4k - 128m
+
 	reboot=		[BUGS=X86-32,BUGS=ARM,BUGS=IA-64] Rebooting mode
 			Format: <reboot_mode>[,<reboot_mode2>[,...]]
 			See arch/*/kernel/reboot.c or arch/*/kernel/process.c
--- linux.orig/block/blk-core.c	2010-02-24 10:44:26.000000000 +0800
+++ linux/block/blk-core.c	2010-02-24 10:44:42.000000000 +0800
@@ -498,8 +498,7 @@ struct request_queue *blk_alloc_queue_no
 
 	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
 	q->backing_dev_info.unplug_io_data = q;
-	q->backing_dev_info.ra_pages =
-			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	q->backing_dev_info.ra_pages = default_backing_dev_info.ra_pages;
 	q->backing_dev_info.state = 0;
 	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
 	q->backing_dev_info.name = "block";
--- linux.orig/fs/fuse/inode.c	2010-02-24 10:44:26.000000000 +0800
+++ linux/fs/fuse/inode.c	2010-02-24 10:44:42.000000000 +0800
@@ -870,7 +870,7 @@ static int fuse_bdi_init(struct fuse_con
 	int err;
 
 	fc->bdi.name = "fuse";
-	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	fc->bdi.ra_pages = default_backing_dev_info.ra_pages;
 	fc->bdi.unplug_io_fn = default_unplug_io_fn;
 	/* fuse does it's own writeback accounting */
 	fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
--- linux.orig/mm/readahead.c	2010-02-24 10:44:40.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
@@ -19,6 +19,28 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+static int __init config_readahead_size(char *str)
+{
+	unsigned long bytes;
+
+	if (!str)
+		return -EINVAL;
+	bytes = memparse(str, &str);
+	if (*str != '\0')
+		return -EINVAL;
+
+	if (bytes) {
+		if (bytes < PAGE_CACHE_SIZE)	/* missed 'k'/'m' suffixes? */
+			return -EINVAL;
+		if (bytes > 128 << 20)		/* limit to 128MB */
+			bytes = 128 << 20;
+	}
+
+	default_backing_dev_info.ra_pages = bytes / PAGE_CACHE_SIZE;
+	return 0;
+}
+early_param("readahead", config_readahead_size);
+
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 05/15] readahead: limit readahead size for small memory systems
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Matt Mackall, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Nick Piggin, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-small-memory-limit.patch --]
[-- Type: text/plain, Size: 2687 bytes --]

When lifting the default readahead size from 128KB to 512KB,
make sure it won't add memory pressure to small memory systems.

For read-ahead, the memory pressure is mainly readahead buffers consumed
by too many concurrent streams. The context readahead can adapt
readahead size to thrashing threshold well.  So in principle we don't
need to adapt the default _max_ read-ahead size to memory pressure.

For read-around, the memory pressure is mainly read-around misses on
executables/libraries. Which could be reduced by scaling down
read-around size on fast "reclaim passes".

This patch presents a straightforward solution: to limit default
readahead size proportional to available system memory, ie.
                512MB mem => 512KB readahead size
                128MB mem => 128KB readahead size
                 32MB mem =>  32KB readahead size (minimal)

Strictly speaking, only read-around size has to be limited.  However we
don't bother to seperate read-around size from read-ahead size for now.

CC: Matt Mackall <mpm@selenic.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
@@ -19,6 +19,10 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+#define MIN_READAHEAD_PAGES DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
+
+static int __initdata user_defined_readahead_size;
+
 static int __init config_readahead_size(char *str)
 {
 	unsigned long bytes;
@@ -36,11 +40,33 @@ static int __init config_readahead_size(
 			bytes = 128 << 20;
 	}
 
+	user_defined_readahead_size = 1;
 	default_backing_dev_info.ra_pages = bytes / PAGE_CACHE_SIZE;
 	return 0;
 }
 early_param("readahead", config_readahead_size);
 
+static int __init check_readahead_size(void)
+{
+	/*
+	 * Scale down default readahead size for small memory systems.
+	 * For example, a 64MB box will do 64KB read-ahead/read-around
+	 * instead of the default 512KB.
+	 *
+	 * Note that the default readahead size will also be scaled down
+	 * for small devices in add_disk().
+	 */
+	if (!user_defined_readahead_size) {
+		unsigned long max = roundup_pow_of_two(totalram_pages / 1024);
+		if (default_backing_dev_info.ra_pages > max)
+		    default_backing_dev_info.ra_pages = max;
+		if (default_backing_dev_info.ra_pages < MIN_READAHEAD_PAGES)
+		    default_backing_dev_info.ra_pages = MIN_READAHEAD_PAGES;
+	}
+	return 0;
+}
+fs_initcall(check_readahead_size);
+
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 05/15] readahead: limit readahead size for small memory systems
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Matt Mackall, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Nick Piggin, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-small-memory-limit.patch --]
[-- Type: text/plain, Size: 2912 bytes --]

When lifting the default readahead size from 128KB to 512KB,
make sure it won't add memory pressure to small memory systems.

For read-ahead, the memory pressure is mainly readahead buffers consumed
by too many concurrent streams. The context readahead can adapt
readahead size to thrashing threshold well.  So in principle we don't
need to adapt the default _max_ read-ahead size to memory pressure.

For read-around, the memory pressure is mainly read-around misses on
executables/libraries. Which could be reduced by scaling down
read-around size on fast "reclaim passes".

This patch presents a straightforward solution: to limit default
readahead size proportional to available system memory, ie.
                512MB mem => 512KB readahead size
                128MB mem => 128KB readahead size
                 32MB mem =>  32KB readahead size (minimal)

Strictly speaking, only read-around size has to be limited.  However we
don't bother to seperate read-around size from read-ahead size for now.

CC: Matt Mackall <mpm@selenic.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
@@ -19,6 +19,10 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+#define MIN_READAHEAD_PAGES DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
+
+static int __initdata user_defined_readahead_size;
+
 static int __init config_readahead_size(char *str)
 {
 	unsigned long bytes;
@@ -36,11 +40,33 @@ static int __init config_readahead_size(
 			bytes = 128 << 20;
 	}
 
+	user_defined_readahead_size = 1;
 	default_backing_dev_info.ra_pages = bytes / PAGE_CACHE_SIZE;
 	return 0;
 }
 early_param("readahead", config_readahead_size);
 
+static int __init check_readahead_size(void)
+{
+	/*
+	 * Scale down default readahead size for small memory systems.
+	 * For example, a 64MB box will do 64KB read-ahead/read-around
+	 * instead of the default 512KB.
+	 *
+	 * Note that the default readahead size will also be scaled down
+	 * for small devices in add_disk().
+	 */
+	if (!user_defined_readahead_size) {
+		unsigned long max = roundup_pow_of_two(totalram_pages / 1024);
+		if (default_backing_dev_info.ra_pages > max)
+		    default_backing_dev_info.ra_pages = max;
+		if (default_backing_dev_info.ra_pages < MIN_READAHEAD_PAGES)
+		    default_backing_dev_info.ra_pages = MIN_READAHEAD_PAGES;
+	}
+	return 0;
+}
+fs_initcall(check_readahead_size);
+
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 05/15] readahead: limit readahead size for small memory systems
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Matt Mackall, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Nick Piggin, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-small-memory-limit.patch --]
[-- Type: text/plain, Size: 2912 bytes --]

When lifting the default readahead size from 128KB to 512KB,
make sure it won't add memory pressure to small memory systems.

For read-ahead, the memory pressure is mainly readahead buffers consumed
by too many concurrent streams. The context readahead can adapt
readahead size to thrashing threshold well.  So in principle we don't
need to adapt the default _max_ read-ahead size to memory pressure.

For read-around, the memory pressure is mainly read-around misses on
executables/libraries. Which could be reduced by scaling down
read-around size on fast "reclaim passes".

This patch presents a straightforward solution: to limit default
readahead size proportional to available system memory, ie.
                512MB mem => 512KB readahead size
                128MB mem => 128KB readahead size
                 32MB mem =>  32KB readahead size (minimal)

Strictly speaking, only read-around size has to be limited.  However we
don't bother to seperate read-around size from read-ahead size for now.

CC: Matt Mackall <mpm@selenic.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
@@ -19,6 +19,10 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+#define MIN_READAHEAD_PAGES DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
+
+static int __initdata user_defined_readahead_size;
+
 static int __init config_readahead_size(char *str)
 {
 	unsigned long bytes;
@@ -36,11 +40,33 @@ static int __init config_readahead_size(
 			bytes = 128 << 20;
 	}
 
+	user_defined_readahead_size = 1;
 	default_backing_dev_info.ra_pages = bytes / PAGE_CACHE_SIZE;
 	return 0;
 }
 early_param("readahead", config_readahead_size);
 
+static int __init check_readahead_size(void)
+{
+	/*
+	 * Scale down default readahead size for small memory systems.
+	 * For example, a 64MB box will do 64KB read-ahead/read-around
+	 * instead of the default 512KB.
+	 *
+	 * Note that the default readahead size will also be scaled down
+	 * for small devices in add_disk().
+	 */
+	if (!user_defined_readahead_size) {
+		unsigned long max = roundup_pow_of_two(totalram_pages / 1024);
+		if (default_backing_dev_info.ra_pages > max)
+		    default_backing_dev_info.ra_pages = max;
+		if (default_backing_dev_info.ra_pages < MIN_READAHEAD_PAGES)
+		    default_backing_dev_info.ra_pages = MIN_READAHEAD_PAGES;
+	}
+	return 0;
+}
+fs_initcall(check_readahead_size);
+
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 06/15] readahead: replace ra->mmap_miss with ra->ra_flags
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Andi Kleen, Steven Whitehouse,
	Wu Fengguang, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-flags.patch --]
[-- Type: text/plain, Size: 3066 bytes --]

Introduce a readahead flags field and embed the existing mmap_miss in it
(mainly to save space).

It also changes the mmap_miss upper bound from LONG_MAX to 4096.
This is to help adapt properly for changing mmap access patterns.

It will be possible to lose the flags in race conditions, however the
impact should be limited.  For the race to happen, there must be two
threads sharing the same file descriptor to be in page fault or
readahead at the same time.

Note that it has always been racy for "page faults" at the same time.

And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
Which may change some concrete readahead behavior, but won't really
impact overall I/O performance.

CC: Nick Piggin <npiggin@suse.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   30 +++++++++++++++++++++++++++++-
 mm/filemap.c       |    7 ++-----
 2 files changed, 31 insertions(+), 6 deletions(-)

--- linux.orig/include/linux/fs.h	2010-02-24 10:44:30.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:43.000000000 +0800
@@ -889,10 +889,38 @@ struct file_ra_state {
 					   there are only # of pages ahead */
 
 	unsigned int ra_pages;		/* Maximum readahead window */
-	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
+	unsigned int ra_flags;
 	loff_t prev_pos;		/* Cache last read() position */
 };
 
+/* ra_flags bits */
+#define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
+
+/*
+ * Don't do ra_flags++ directly to avoid possible overflow:
+ * the ra fields can be accessed concurrently in a racy way.
+ */
+static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
+{
+	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+	if (miss < READAHEAD_MMAP_MISS) {
+		miss++;
+		ra->ra_flags = miss | (ra->ra_flags &~ READAHEAD_MMAP_MISS);
+	}
+	return miss;
+}
+
+static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
+{
+	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+	if (miss) {
+		miss--;
+		ra->ra_flags = miss | (ra->ra_flags &~ READAHEAD_MMAP_MISS);
+	}
+}
+
 /*
  * Check if @index falls in the readahead windows.
  */
--- linux.orig/mm/filemap.c	2010-02-24 10:44:25.000000000 +0800
+++ linux/mm/filemap.c	2010-02-24 10:44:43.000000000 +0800
@@ -1418,14 +1418,12 @@ static void do_sync_mmap_readahead(struc
 		return;
 	}
 
-	if (ra->mmap_miss < INT_MAX)
-		ra->mmap_miss++;
 
 	/*
 	 * Do we miss much more than hit in this file? If so,
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
-	if (ra->mmap_miss > MMAP_LOTSAMISS)
+	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
 		return;
 
 	/*
@@ -1455,8 +1453,7 @@ static void do_async_mmap_readahead(stru
 	/* If we don't want any read-ahead, don't bother */
 	if (VM_RandomReadHint(vma))
 		return;
-	if (ra->mmap_miss > 0)
-		ra->mmap_miss--;
+	ra_mmap_miss_dec(ra);
 	if (PageReadahead(page))
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 06/15] readahead: replace ra->mmap_miss with ra->ra_flags
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Andi Kleen, Steven Whitehouse,
	Wu Fengguang, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-flags.patch --]
[-- Type: text/plain, Size: 3291 bytes --]

Introduce a readahead flags field and embed the existing mmap_miss in it
(mainly to save space).

It also changes the mmap_miss upper bound from LONG_MAX to 4096.
This is to help adapt properly for changing mmap access patterns.

It will be possible to lose the flags in race conditions, however the
impact should be limited.  For the race to happen, there must be two
threads sharing the same file descriptor to be in page fault or
readahead at the same time.

Note that it has always been racy for "page faults" at the same time.

And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
Which may change some concrete readahead behavior, but won't really
impact overall I/O performance.

CC: Nick Piggin <npiggin@suse.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   30 +++++++++++++++++++++++++++++-
 mm/filemap.c       |    7 ++-----
 2 files changed, 31 insertions(+), 6 deletions(-)

--- linux.orig/include/linux/fs.h	2010-02-24 10:44:30.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:43.000000000 +0800
@@ -889,10 +889,38 @@ struct file_ra_state {
 					   there are only # of pages ahead */
 
 	unsigned int ra_pages;		/* Maximum readahead window */
-	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
+	unsigned int ra_flags;
 	loff_t prev_pos;		/* Cache last read() position */
 };
 
+/* ra_flags bits */
+#define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
+
+/*
+ * Don't do ra_flags++ directly to avoid possible overflow:
+ * the ra fields can be accessed concurrently in a racy way.
+ */
+static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
+{
+	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+	if (miss < READAHEAD_MMAP_MISS) {
+		miss++;
+		ra->ra_flags = miss | (ra->ra_flags &~ READAHEAD_MMAP_MISS);
+	}
+	return miss;
+}
+
+static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
+{
+	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+	if (miss) {
+		miss--;
+		ra->ra_flags = miss | (ra->ra_flags &~ READAHEAD_MMAP_MISS);
+	}
+}
+
 /*
  * Check if @index falls in the readahead windows.
  */
--- linux.orig/mm/filemap.c	2010-02-24 10:44:25.000000000 +0800
+++ linux/mm/filemap.c	2010-02-24 10:44:43.000000000 +0800
@@ -1418,14 +1418,12 @@ static void do_sync_mmap_readahead(struc
 		return;
 	}
 
-	if (ra->mmap_miss < INT_MAX)
-		ra->mmap_miss++;
 
 	/*
 	 * Do we miss much more than hit in this file? If so,
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
-	if (ra->mmap_miss > MMAP_LOTSAMISS)
+	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
 		return;
 
 	/*
@@ -1455,8 +1453,7 @@ static void do_async_mmap_readahead(stru
 	/* If we don't want any read-ahead, don't bother */
 	if (VM_RandomReadHint(vma))
 		return;
-	if (ra->mmap_miss > 0)
-		ra->mmap_miss--;
+	ra_mmap_miss_dec(ra);
 	if (PageReadahead(page))
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 06/15] readahead: replace ra->mmap_miss with ra->ra_flags
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Andi Kleen, Steven Whitehouse,
	Wu Fengguang, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-flags.patch --]
[-- Type: text/plain, Size: 3291 bytes --]

Introduce a readahead flags field and embed the existing mmap_miss in it
(mainly to save space).

It also changes the mmap_miss upper bound from LONG_MAX to 4096.
This is to help adapt properly for changing mmap access patterns.

It will be possible to lose the flags in race conditions, however the
impact should be limited.  For the race to happen, there must be two
threads sharing the same file descriptor to be in page fault or
readahead at the same time.

Note that it has always been racy for "page faults" at the same time.

And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
Which may change some concrete readahead behavior, but won't really
impact overall I/O performance.

CC: Nick Piggin <npiggin@suse.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   30 +++++++++++++++++++++++++++++-
 mm/filemap.c       |    7 ++-----
 2 files changed, 31 insertions(+), 6 deletions(-)

--- linux.orig/include/linux/fs.h	2010-02-24 10:44:30.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:43.000000000 +0800
@@ -889,10 +889,38 @@ struct file_ra_state {
 					   there are only # of pages ahead */
 
 	unsigned int ra_pages;		/* Maximum readahead window */
-	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
+	unsigned int ra_flags;
 	loff_t prev_pos;		/* Cache last read() position */
 };
 
+/* ra_flags bits */
+#define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
+
+/*
+ * Don't do ra_flags++ directly to avoid possible overflow:
+ * the ra fields can be accessed concurrently in a racy way.
+ */
+static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
+{
+	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+	if (miss < READAHEAD_MMAP_MISS) {
+		miss++;
+		ra->ra_flags = miss | (ra->ra_flags &~ READAHEAD_MMAP_MISS);
+	}
+	return miss;
+}
+
+static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
+{
+	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+	if (miss) {
+		miss--;
+		ra->ra_flags = miss | (ra->ra_flags &~ READAHEAD_MMAP_MISS);
+	}
+}
+
 /*
  * Check if @index falls in the readahead windows.
  */
--- linux.orig/mm/filemap.c	2010-02-24 10:44:25.000000000 +0800
+++ linux/mm/filemap.c	2010-02-24 10:44:43.000000000 +0800
@@ -1418,14 +1418,12 @@ static void do_sync_mmap_readahead(struc
 		return;
 	}
 
-	if (ra->mmap_miss < INT_MAX)
-		ra->mmap_miss++;
 
 	/*
 	 * Do we miss much more than hit in this file? If so,
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
-	if (ra->mmap_miss > MMAP_LOTSAMISS)
+	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
 		return;
 
 	/*
@@ -1455,8 +1453,7 @@ static void do_async_mmap_readahead(stru
 	/* If we don't want any read-ahead, don't bother */
 	if (VM_RandomReadHint(vma))
 		return;
-	if (ra->mmap_miss > 0)
-		ra->mmap_miss--;
+	ra_mmap_miss_dec(ra);
 	if (PageReadahead(page))
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 07/15] readahead: thrashing safe context readahead
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Wu Fengguang, Chris Mason, Peter Zijlstra,
	Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-thrashing-safe-mode.patch --]
[-- Type: text/plain, Size: 13775 bytes --]

Introduce a more complete version of context readahead, which is a
full-fledged readahead algorithm by itself. It replaces some of the
existing cases.

- oversize read
  no behavior change; except in thrashed mode, async_size will be 0
- random read
  no behavior change; implies some different internal handling
  The random read will now be recorded in file_ra_state, which means in
  an intermixed sequential+random pattern, the sequential part's state
  will be flushed by random ones, and hence will be serviced by the
  context readahead instead of the stateful one. Also means that the
  first readahead for a sequential read in the middle of file will be
  started by the stateful one, instead of the sequential cache miss.
- sequential cache miss
  better
  When walking out of a cached page segment, the readahead size will
  be fully restored immediately instead of ramping up from initial size.
- hit readahead marker without valid state
  better in rare cases; costs more radix tree lookups, but won't be a
  problem with optimized radix_tree_prev_hole().  The added radix tree
  scan for history pages is to calculate the thrashing safe readahead
  size and adaptive async size.

The algorithm first looks ahead to find the start point of next
read-ahead, then looks backward in the page cache to get an estimation
of the thrashing-threshold.

It is able to automatically adapt to the thrashing threshold in a smooth
workload.  The estimation theory can be illustrated with figure:

   chunk A           chunk B                      chunk C                 head

   l01 l11           l12   l21                    l22
| |-->|-->|       |------>|-->|                |------>|
| +-------+       +-----------+                +-------------+               |
| |   #   |       |       #   |                |       #     |               |
| +-------+       +-----------+                +-------------+               |
| |<==============|<===========================|<============================|
        L0                     L1                            L2

 Let f(l) = L be a map from
     l: the number of pages read by the stream
 to
     L: the number of pages pushed into inactive_list in the mean time
 then
     f(l01) <= L0
     f(l11 + l12) = L1
     f(l21 + l22) = L2
     ...
     f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
                        <= Length(inactive_list) = f(thrashing-threshold)

So the count of continuous history pages left in inactive_list is always a
lower estimation of the true thrashing-threshold. Given a stable workload,
the readahead size will keep ramping up and then stabilize in range

	(thrashing_threshold/2, thrashing_threshold)

This is good because, it's in fact bad to always reach thrashing_threshold.
That would not only be more susceptible to fluctuations, but also impose
eviction pressure to the cached pages.

To demo the thrashing safety, I run 300 200KB/s streams with mem=128M.

Only 2031/61325=3.3% readahead windows are thrashed (due to workload
fluctuation):

# cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io       size async_size    io_size
initial            20          9          4         20         20         12         73         37         35
subsequent          3          3          0          1          0          1          8          8          1
context         61325          1       5479      61325       6788          5         14          2         13
thrash           2031          0       1222       2031       2031          0          9          0          6
around            235         90        142        235        235        235         60          0         19
fadvise             0          0          0          0          0          0          0          0          0
random            223        133          0         91         91          1          1          0          1
all             63837        236       6847      63703       9165          0         14          2         13

And the readahead inside a single stream is working as expected:

# grep streams-3162 /debug/tracing/trace
         streams-3162  [000]  8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
         streams-3162  [000]  8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
         streams-3162  [000]  8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
         streams-3162  [000]  8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
         streams-3162  [000]  8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
         streams-3162  [000]  8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
         streams-3162  [000]  8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
         streams-3162  [000]  8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
         streams-3162  [000]  8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
         streams-3162  [000]  8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
         streams-3162  [000]  8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
         streams-3162  [000]  8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
         streams-3162  [000]  8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
         streams-3162  [000]  8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
         streams-3162  [000]  8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
         streams-3162  [000]  8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
         streams-3162  [000]  8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
         streams-3162  [000]  8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
         streams-3162  [000]  8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
         streams-3162  [000]  8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
         streams-3162  [000]  8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
         streams-3162  [000]  8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
         streams-3162  [000]  8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
         streams-3162  [000]  8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
         streams-3162  [000]  8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
	 [...]

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |    1 
 mm/readahead.c     |  155 ++++++++++++++++++++++++-------------------
 2 files changed, 88 insertions(+), 68 deletions(-)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:44.000000000 +0800
@@ -68,6 +68,11 @@ static int __init check_readahead_size(v
 fs_initcall(check_readahead_size);
 
 /*
+ * Set async size to 1/# of the thrashing threshold.
+ */
+#define READAHEAD_ASYNC_RATIO	8
+
+/*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.
  */
@@ -441,39 +446,16 @@ static pgoff_t count_history_pages(struc
 }
 
 /*
- * page cache context based read-ahead
+ * Is @index recently readahead but not yet read by application?
+ * The low boundary is permissively estimated.
  */
-static int try_context_readahead(struct address_space *mapping,
-				 struct file_ra_state *ra,
-				 pgoff_t offset,
-				 unsigned long req_size,
-				 unsigned long max)
+static bool ra_thrashed(struct file_ra_state *ra, pgoff_t index)
 {
-	pgoff_t size;
-
-	size = count_history_pages(mapping, ra, offset, max);
-
-	/*
-	 * no history pages:
-	 * it could be a random read
-	 */
-	if (!size)
-		return 0;
-
-	/*
-	 * starts from beginning of file:
-	 * it is a strong indication of long-run stream (or whole-file-read)
-	 */
-	if (size >= offset)
-		size *= 2;
-
-	ra->start = offset;
-	ra->size = get_init_ra_size(size + req_size, max);
-	ra->async_size = ra->size;
-
-	return 1;
+	return (index >= ra->start - ra->size &&
+		index <  ra->start + ra->size);
 }
 
+
 /*
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
@@ -484,12 +466,26 @@ ondemand_readahead(struct address_space 
 		   unsigned long req_size)
 {
 	unsigned long max = max_sane_readahead(ra->ra_pages);
+	unsigned int size;
+	pgoff_t start;
 
 	/*
 	 * start of file
 	 */
-	if (!offset)
-		goto initial_readahead;
+	if (!offset) {
+		ra->start = offset;
+		ra->size = get_init_ra_size(req_size, max);
+		ra->async_size = ra->size > req_size ?
+				 ra->size - req_size : ra->size;
+		goto readit;
+	}
+
+	/*
+	 * Context readahead is thrashing safe, and can adapt to near the
+	 * thrashing threshold given a stable workload.
+	 */
+	if (ra->ra_flags & READAHEAD_THRASHED)
+		goto context_readahead;
 
 	/*
 	 * It's the expected callback offset, assume sequential access.
@@ -504,58 +500,81 @@ ondemand_readahead(struct address_space 
 	}
 
 	/*
-	 * Hit a marked page without valid readahead state.
-	 * E.g. interleaved reads.
-	 * Query the pagecache for async_size, which normally equals to
-	 * readahead size. Ramp it up and use it as the new readahead size.
+	 * oversize read, no need to query page cache
 	 */
-	if (hit_readahead_marker) {
-		pgoff_t start;
+	if (req_size > max && !hit_readahead_marker) {
+		ra->start = offset;
+		ra->size = max;
+		ra->async_size = max;
+		goto readit;
+	}
 
+	/*
+	 * page cache context based read-ahead
+	 *
+	 *     ==========================_____________..............
+	 *                          [ current window ]
+	 *                               ^offset
+	 * 1)                            |---- A ---->[start
+	 * 2) |<----------- H -----------|
+	 * 3)                            |----------- H ----------->]end
+	 *                                            [ new window ]
+	 *    [=] cached,visited [_] cached,to-be-visited [.] not cached
+	 *
+	 * 1) A = pages ahead = previous async_size
+	 * 2) H = history pages = thrashing safe size
+	 * 3) H - A = new readahead size
+	 */
+context_readahead:
+	if (hit_readahead_marker) {
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = radix_tree_next_hole(&mapping->page_tree,
+					     offset + 1, max);
 		rcu_read_unlock();
-
+		/*
+		 * there are enough pages ahead: no readahead
+		 */
 		if (!start || start - offset > max)
 			return 0;
+	} else
+		start = offset;
 
+	size = count_history_pages(mapping, ra, offset,
+				   READAHEAD_ASYNC_RATIO * max);
+	/*
+	 * no history pages cached, could be
+	 * 	- a random read
+	 * 	- a thrashed sequential read
+	 */
+	if (!size && !hit_readahead_marker) {
+		if (!ra_thrashed(ra, offset)) {
+			ra->size = min(req_size, max);
+		} else {
+			retain_inactive_pages(mapping, offset, min(2 * max,
+						ra->start + ra->size - offset));
+			ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
+			ra->ra_flags |= READAHEAD_THRASHED;
+		}
+		ra->async_size = 0;
 		ra->start = start;
-		ra->size = start - offset;	/* old async_size */
-		ra->size += req_size;
-		ra->size = get_next_ra_size(ra, max);
-		ra->async_size = ra->size;
 		goto readit;
 	}
-
 	/*
-	 * oversize read
+	 * history pages start from beginning of file:
+	 * it is a strong indication of long-run stream (or whole-file reads)
 	 */
-	if (req_size > max)
-		goto initial_readahead;
-
-	/*
-	 * sequential cache miss
-	 */
-	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
-		goto initial_readahead;
-
-	/*
-	 * Query the page cache and look for the traces(cached history pages)
-	 * that a sequential stream would leave behind.
-	 */
-	if (try_context_readahead(mapping, ra, offset, req_size, max))
-		goto readit;
-
+	if (size >= offset)
+		size *= 2;
 	/*
-	 * standalone, small random read
-	 * Read as is, and do not pollute the readahead state.
+	 * pages to readahead are already cached
 	 */
-	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	if (size <= start - offset)
+		return 0;
 
-initial_readahead:
-	ra->start = offset;
-	ra->size = get_init_ra_size(req_size, max);
-	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+	size -= start - offset;
+	ra->start = start;
+	ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
+	ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
 
 readit:
 	/*
--- linux.orig/include/linux/fs.h	2010-02-24 10:44:43.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:44.000000000 +0800
@@ -895,6 +895,7 @@ struct file_ra_state {
 
 /* ra_flags bits */
 #define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
+#define READAHEAD_THRASHED	0x10000000
 
 /*
  * Don't do ra_flags++ directly to avoid possible overflow:



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 07/15] readahead: thrashing safe context readahead
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Wu Fengguang, Chris Mason, Peter Zijlstra,
	Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-thrashing-safe-mode.patch --]
[-- Type: text/plain, Size: 14000 bytes --]

Introduce a more complete version of context readahead, which is a
full-fledged readahead algorithm by itself. It replaces some of the
existing cases.

- oversize read
  no behavior change; except in thrashed mode, async_size will be 0
- random read
  no behavior change; implies some different internal handling
  The random read will now be recorded in file_ra_state, which means in
  an intermixed sequential+random pattern, the sequential part's state
  will be flushed by random ones, and hence will be serviced by the
  context readahead instead of the stateful one. Also means that the
  first readahead for a sequential read in the middle of file will be
  started by the stateful one, instead of the sequential cache miss.
- sequential cache miss
  better
  When walking out of a cached page segment, the readahead size will
  be fully restored immediately instead of ramping up from initial size.
- hit readahead marker without valid state
  better in rare cases; costs more radix tree lookups, but won't be a
  problem with optimized radix_tree_prev_hole().  The added radix tree
  scan for history pages is to calculate the thrashing safe readahead
  size and adaptive async size.

The algorithm first looks ahead to find the start point of next
read-ahead, then looks backward in the page cache to get an estimation
of the thrashing-threshold.

It is able to automatically adapt to the thrashing threshold in a smooth
workload.  The estimation theory can be illustrated with figure:

   chunk A           chunk B                      chunk C                 head

   l01 l11           l12   l21                    l22
| |-->|-->|       |------>|-->|                |------>|
| +-------+       +-----------+                +-------------+               |
| |   #   |       |       #   |                |       #     |               |
| +-------+       +-----------+                +-------------+               |
| |<==============|<===========================|<============================|
        L0                     L1                            L2

 Let f(l) = L be a map from
     l: the number of pages read by the stream
 to
     L: the number of pages pushed into inactive_list in the mean time
 then
     f(l01) <= L0
     f(l11 + l12) = L1
     f(l21 + l22) = L2
     ...
     f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
                        <= Length(inactive_list) = f(thrashing-threshold)

So the count of continuous history pages left in inactive_list is always a
lower estimation of the true thrashing-threshold. Given a stable workload,
the readahead size will keep ramping up and then stabilize in range

	(thrashing_threshold/2, thrashing_threshold)

This is good because, it's in fact bad to always reach thrashing_threshold.
That would not only be more susceptible to fluctuations, but also impose
eviction pressure to the cached pages.

To demo the thrashing safety, I run 300 200KB/s streams with mem=128M.

Only 2031/61325=3.3% readahead windows are thrashed (due to workload
fluctuation):

# cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io       size async_size    io_size
initial            20          9          4         20         20         12         73         37         35
subsequent          3          3          0          1          0          1          8          8          1
context         61325          1       5479      61325       6788          5         14          2         13
thrash           2031          0       1222       2031       2031          0          9          0          6
around            235         90        142        235        235        235         60          0         19
fadvise             0          0          0          0          0          0          0          0          0
random            223        133          0         91         91          1          1          0          1
all             63837        236       6847      63703       9165          0         14          2         13

And the readahead inside a single stream is working as expected:

# grep streams-3162 /debug/tracing/trace
         streams-3162  [000]  8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
         streams-3162  [000]  8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
         streams-3162  [000]  8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
         streams-3162  [000]  8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
         streams-3162  [000]  8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
         streams-3162  [000]  8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
         streams-3162  [000]  8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
         streams-3162  [000]  8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
         streams-3162  [000]  8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
         streams-3162  [000]  8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
         streams-3162  [000]  8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
         streams-3162  [000]  8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
         streams-3162  [000]  8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
         streams-3162  [000]  8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
         streams-3162  [000]  8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
         streams-3162  [000]  8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
         streams-3162  [000]  8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
         streams-3162  [000]  8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
         streams-3162  [000]  8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
         streams-3162  [000]  8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
         streams-3162  [000]  8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
         streams-3162  [000]  8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
         streams-3162  [000]  8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
         streams-3162  [000]  8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
         streams-3162  [000]  8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
	 [...]

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |    1 
 mm/readahead.c     |  155 ++++++++++++++++++++++++-------------------
 2 files changed, 88 insertions(+), 68 deletions(-)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:44.000000000 +0800
@@ -68,6 +68,11 @@ static int __init check_readahead_size(v
 fs_initcall(check_readahead_size);
 
 /*
+ * Set async size to 1/# of the thrashing threshold.
+ */
+#define READAHEAD_ASYNC_RATIO	8
+
+/*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.
  */
@@ -441,39 +446,16 @@ static pgoff_t count_history_pages(struc
 }
 
 /*
- * page cache context based read-ahead
+ * Is @index recently readahead but not yet read by application?
+ * The low boundary is permissively estimated.
  */
-static int try_context_readahead(struct address_space *mapping,
-				 struct file_ra_state *ra,
-				 pgoff_t offset,
-				 unsigned long req_size,
-				 unsigned long max)
+static bool ra_thrashed(struct file_ra_state *ra, pgoff_t index)
 {
-	pgoff_t size;
-
-	size = count_history_pages(mapping, ra, offset, max);
-
-	/*
-	 * no history pages:
-	 * it could be a random read
-	 */
-	if (!size)
-		return 0;
-
-	/*
-	 * starts from beginning of file:
-	 * it is a strong indication of long-run stream (or whole-file-read)
-	 */
-	if (size >= offset)
-		size *= 2;
-
-	ra->start = offset;
-	ra->size = get_init_ra_size(size + req_size, max);
-	ra->async_size = ra->size;
-
-	return 1;
+	return (index >= ra->start - ra->size &&
+		index <  ra->start + ra->size);
 }
 
+
 /*
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
@@ -484,12 +466,26 @@ ondemand_readahead(struct address_space 
 		   unsigned long req_size)
 {
 	unsigned long max = max_sane_readahead(ra->ra_pages);
+	unsigned int size;
+	pgoff_t start;
 
 	/*
 	 * start of file
 	 */
-	if (!offset)
-		goto initial_readahead;
+	if (!offset) {
+		ra->start = offset;
+		ra->size = get_init_ra_size(req_size, max);
+		ra->async_size = ra->size > req_size ?
+				 ra->size - req_size : ra->size;
+		goto readit;
+	}
+
+	/*
+	 * Context readahead is thrashing safe, and can adapt to near the
+	 * thrashing threshold given a stable workload.
+	 */
+	if (ra->ra_flags & READAHEAD_THRASHED)
+		goto context_readahead;
 
 	/*
 	 * It's the expected callback offset, assume sequential access.
@@ -504,58 +500,81 @@ ondemand_readahead(struct address_space 
 	}
 
 	/*
-	 * Hit a marked page without valid readahead state.
-	 * E.g. interleaved reads.
-	 * Query the pagecache for async_size, which normally equals to
-	 * readahead size. Ramp it up and use it as the new readahead size.
+	 * oversize read, no need to query page cache
 	 */
-	if (hit_readahead_marker) {
-		pgoff_t start;
+	if (req_size > max && !hit_readahead_marker) {
+		ra->start = offset;
+		ra->size = max;
+		ra->async_size = max;
+		goto readit;
+	}
 
+	/*
+	 * page cache context based read-ahead
+	 *
+	 *     ==========================_____________..............
+	 *                          [ current window ]
+	 *                               ^offset
+	 * 1)                            |---- A ---->[start
+	 * 2) |<----------- H -----------|
+	 * 3)                            |----------- H ----------->]end
+	 *                                            [ new window ]
+	 *    [=] cached,visited [_] cached,to-be-visited [.] not cached
+	 *
+	 * 1) A = pages ahead = previous async_size
+	 * 2) H = history pages = thrashing safe size
+	 * 3) H - A = new readahead size
+	 */
+context_readahead:
+	if (hit_readahead_marker) {
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = radix_tree_next_hole(&mapping->page_tree,
+					     offset + 1, max);
 		rcu_read_unlock();
-
+		/*
+		 * there are enough pages ahead: no readahead
+		 */
 		if (!start || start - offset > max)
 			return 0;
+	} else
+		start = offset;
 
+	size = count_history_pages(mapping, ra, offset,
+				   READAHEAD_ASYNC_RATIO * max);
+	/*
+	 * no history pages cached, could be
+	 * 	- a random read
+	 * 	- a thrashed sequential read
+	 */
+	if (!size && !hit_readahead_marker) {
+		if (!ra_thrashed(ra, offset)) {
+			ra->size = min(req_size, max);
+		} else {
+			retain_inactive_pages(mapping, offset, min(2 * max,
+						ra->start + ra->size - offset));
+			ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
+			ra->ra_flags |= READAHEAD_THRASHED;
+		}
+		ra->async_size = 0;
 		ra->start = start;
-		ra->size = start - offset;	/* old async_size */
-		ra->size += req_size;
-		ra->size = get_next_ra_size(ra, max);
-		ra->async_size = ra->size;
 		goto readit;
 	}
-
 	/*
-	 * oversize read
+	 * history pages start from beginning of file:
+	 * it is a strong indication of long-run stream (or whole-file reads)
 	 */
-	if (req_size > max)
-		goto initial_readahead;
-
-	/*
-	 * sequential cache miss
-	 */
-	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
-		goto initial_readahead;
-
-	/*
-	 * Query the page cache and look for the traces(cached history pages)
-	 * that a sequential stream would leave behind.
-	 */
-	if (try_context_readahead(mapping, ra, offset, req_size, max))
-		goto readit;
-
+	if (size >= offset)
+		size *= 2;
 	/*
-	 * standalone, small random read
-	 * Read as is, and do not pollute the readahead state.
+	 * pages to readahead are already cached
 	 */
-	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	if (size <= start - offset)
+		return 0;
 
-initial_readahead:
-	ra->start = offset;
-	ra->size = get_init_ra_size(req_size, max);
-	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+	size -= start - offset;
+	ra->start = start;
+	ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
+	ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
 
 readit:
 	/*
--- linux.orig/include/linux/fs.h	2010-02-24 10:44:43.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:44.000000000 +0800
@@ -895,6 +895,7 @@ struct file_ra_state {
 
 /* ra_flags bits */
 #define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
+#define READAHEAD_THRASHED	0x10000000
 
 /*
  * Don't do ra_flags++ directly to avoid possible overflow:


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 07/15] readahead: thrashing safe context readahead
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Wu Fengguang, Chris Mason, Peter Zijlstra,
	Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-thrashing-safe-mode.patch --]
[-- Type: text/plain, Size: 14000 bytes --]

Introduce a more complete version of context readahead, which is a
full-fledged readahead algorithm by itself. It replaces some of the
existing cases.

- oversize read
  no behavior change; except in thrashed mode, async_size will be 0
- random read
  no behavior change; implies some different internal handling
  The random read will now be recorded in file_ra_state, which means in
  an intermixed sequential+random pattern, the sequential part's state
  will be flushed by random ones, and hence will be serviced by the
  context readahead instead of the stateful one. Also means that the
  first readahead for a sequential read in the middle of file will be
  started by the stateful one, instead of the sequential cache miss.
- sequential cache miss
  better
  When walking out of a cached page segment, the readahead size will
  be fully restored immediately instead of ramping up from initial size.
- hit readahead marker without valid state
  better in rare cases; costs more radix tree lookups, but won't be a
  problem with optimized radix_tree_prev_hole().  The added radix tree
  scan for history pages is to calculate the thrashing safe readahead
  size and adaptive async size.

The algorithm first looks ahead to find the start point of next
read-ahead, then looks backward in the page cache to get an estimation
of the thrashing-threshold.

It is able to automatically adapt to the thrashing threshold in a smooth
workload.  The estimation theory can be illustrated with figure:

   chunk A           chunk B                      chunk C                 head

   l01 l11           l12   l21                    l22
| |-->|-->|       |------>|-->|                |------>|
| +-------+       +-----------+                +-------------+               |
| |   #   |       |       #   |                |       #     |               |
| +-------+       +-----------+                +-------------+               |
| |<==============|<===========================|<============================|
        L0                     L1                            L2

 Let f(l) = L be a map from
     l: the number of pages read by the stream
 to
     L: the number of pages pushed into inactive_list in the mean time
 then
     f(l01) <= L0
     f(l11 + l12) = L1
     f(l21 + l22) = L2
     ...
     f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
                        <= Length(inactive_list) = f(thrashing-threshold)

So the count of continuous history pages left in inactive_list is always a
lower estimation of the true thrashing-threshold. Given a stable workload,
the readahead size will keep ramping up and then stabilize in range

	(thrashing_threshold/2, thrashing_threshold)

This is good because, it's in fact bad to always reach thrashing_threshold.
That would not only be more susceptible to fluctuations, but also impose
eviction pressure to the cached pages.

To demo the thrashing safety, I run 300 200KB/s streams with mem=128M.

Only 2031/61325=3.3% readahead windows are thrashed (due to workload
fluctuation):

# cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io       size async_size    io_size
initial            20          9          4         20         20         12         73         37         35
subsequent          3          3          0          1          0          1          8          8          1
context         61325          1       5479      61325       6788          5         14          2         13
thrash           2031          0       1222       2031       2031          0          9          0          6
around            235         90        142        235        235        235         60          0         19
fadvise             0          0          0          0          0          0          0          0          0
random            223        133          0         91         91          1          1          0          1
all             63837        236       6847      63703       9165          0         14          2         13

And the readahead inside a single stream is working as expected:

# grep streams-3162 /debug/tracing/trace
         streams-3162  [000]  8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
         streams-3162  [000]  8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
         streams-3162  [000]  8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
         streams-3162  [000]  8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
         streams-3162  [000]  8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
         streams-3162  [000]  8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
         streams-3162  [000]  8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
         streams-3162  [000]  8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
         streams-3162  [000]  8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
         streams-3162  [000]  8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
         streams-3162  [000]  8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
         streams-3162  [000]  8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
         streams-3162  [000]  8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
         streams-3162  [000]  8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
         streams-3162  [000]  8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
         streams-3162  [000]  8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
         streams-3162  [000]  8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
         streams-3162  [000]  8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
         streams-3162  [000]  8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
         streams-3162  [000]  8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
         streams-3162  [000]  8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
         streams-3162  [000]  8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
         streams-3162  [000]  8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
         streams-3162  [000]  8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
         streams-3162  [000]  8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
	 [...]

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |    1 
 mm/readahead.c     |  155 ++++++++++++++++++++++++-------------------
 2 files changed, 88 insertions(+), 68 deletions(-)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:42.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:44.000000000 +0800
@@ -68,6 +68,11 @@ static int __init check_readahead_size(v
 fs_initcall(check_readahead_size);
 
 /*
+ * Set async size to 1/# of the thrashing threshold.
+ */
+#define READAHEAD_ASYNC_RATIO	8
+
+/*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.
  */
@@ -441,39 +446,16 @@ static pgoff_t count_history_pages(struc
 }
 
 /*
- * page cache context based read-ahead
+ * Is @index recently readahead but not yet read by application?
+ * The low boundary is permissively estimated.
  */
-static int try_context_readahead(struct address_space *mapping,
-				 struct file_ra_state *ra,
-				 pgoff_t offset,
-				 unsigned long req_size,
-				 unsigned long max)
+static bool ra_thrashed(struct file_ra_state *ra, pgoff_t index)
 {
-	pgoff_t size;
-
-	size = count_history_pages(mapping, ra, offset, max);
-
-	/*
-	 * no history pages:
-	 * it could be a random read
-	 */
-	if (!size)
-		return 0;
-
-	/*
-	 * starts from beginning of file:
-	 * it is a strong indication of long-run stream (or whole-file-read)
-	 */
-	if (size >= offset)
-		size *= 2;
-
-	ra->start = offset;
-	ra->size = get_init_ra_size(size + req_size, max);
-	ra->async_size = ra->size;
-
-	return 1;
+	return (index >= ra->start - ra->size &&
+		index <  ra->start + ra->size);
 }
 
+
 /*
  * A minimal readahead algorithm for trivial sequential/random reads.
  */
@@ -484,12 +466,26 @@ ondemand_readahead(struct address_space 
 		   unsigned long req_size)
 {
 	unsigned long max = max_sane_readahead(ra->ra_pages);
+	unsigned int size;
+	pgoff_t start;
 
 	/*
 	 * start of file
 	 */
-	if (!offset)
-		goto initial_readahead;
+	if (!offset) {
+		ra->start = offset;
+		ra->size = get_init_ra_size(req_size, max);
+		ra->async_size = ra->size > req_size ?
+				 ra->size - req_size : ra->size;
+		goto readit;
+	}
+
+	/*
+	 * Context readahead is thrashing safe, and can adapt to near the
+	 * thrashing threshold given a stable workload.
+	 */
+	if (ra->ra_flags & READAHEAD_THRASHED)
+		goto context_readahead;
 
 	/*
 	 * It's the expected callback offset, assume sequential access.
@@ -504,58 +500,81 @@ ondemand_readahead(struct address_space 
 	}
 
 	/*
-	 * Hit a marked page without valid readahead state.
-	 * E.g. interleaved reads.
-	 * Query the pagecache for async_size, which normally equals to
-	 * readahead size. Ramp it up and use it as the new readahead size.
+	 * oversize read, no need to query page cache
 	 */
-	if (hit_readahead_marker) {
-		pgoff_t start;
+	if (req_size > max && !hit_readahead_marker) {
+		ra->start = offset;
+		ra->size = max;
+		ra->async_size = max;
+		goto readit;
+	}
 
+	/*
+	 * page cache context based read-ahead
+	 *
+	 *     ==========================_____________..............
+	 *                          [ current window ]
+	 *                               ^offset
+	 * 1)                            |---- A ---->[start
+	 * 2) |<----------- H -----------|
+	 * 3)                            |----------- H ----------->]end
+	 *                                            [ new window ]
+	 *    [=] cached,visited [_] cached,to-be-visited [.] not cached
+	 *
+	 * 1) A = pages ahead = previous async_size
+	 * 2) H = history pages = thrashing safe size
+	 * 3) H - A = new readahead size
+	 */
+context_readahead:
+	if (hit_readahead_marker) {
 		rcu_read_lock();
-		start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+		start = radix_tree_next_hole(&mapping->page_tree,
+					     offset + 1, max);
 		rcu_read_unlock();
-
+		/*
+		 * there are enough pages ahead: no readahead
+		 */
 		if (!start || start - offset > max)
 			return 0;
+	} else
+		start = offset;
 
+	size = count_history_pages(mapping, ra, offset,
+				   READAHEAD_ASYNC_RATIO * max);
+	/*
+	 * no history pages cached, could be
+	 * 	- a random read
+	 * 	- a thrashed sequential read
+	 */
+	if (!size && !hit_readahead_marker) {
+		if (!ra_thrashed(ra, offset)) {
+			ra->size = min(req_size, max);
+		} else {
+			retain_inactive_pages(mapping, offset, min(2 * max,
+						ra->start + ra->size - offset));
+			ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
+			ra->ra_flags |= READAHEAD_THRASHED;
+		}
+		ra->async_size = 0;
 		ra->start = start;
-		ra->size = start - offset;	/* old async_size */
-		ra->size += req_size;
-		ra->size = get_next_ra_size(ra, max);
-		ra->async_size = ra->size;
 		goto readit;
 	}
-
 	/*
-	 * oversize read
+	 * history pages start from beginning of file:
+	 * it is a strong indication of long-run stream (or whole-file reads)
 	 */
-	if (req_size > max)
-		goto initial_readahead;
-
-	/*
-	 * sequential cache miss
-	 */
-	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
-		goto initial_readahead;
-
-	/*
-	 * Query the page cache and look for the traces(cached history pages)
-	 * that a sequential stream would leave behind.
-	 */
-	if (try_context_readahead(mapping, ra, offset, req_size, max))
-		goto readit;
-
+	if (size >= offset)
+		size *= 2;
 	/*
-	 * standalone, small random read
-	 * Read as is, and do not pollute the readahead state.
+	 * pages to readahead are already cached
 	 */
-	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	if (size <= start - offset)
+		return 0;
 
-initial_readahead:
-	ra->start = offset;
-	ra->size = get_init_ra_size(req_size, max);
-	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+	size -= start - offset;
+	ra->start = start;
+	ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
+	ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
 
 readit:
 	/*
--- linux.orig/include/linux/fs.h	2010-02-24 10:44:43.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:44.000000000 +0800
@@ -895,6 +895,7 @@ struct file_ra_state {
 
 /* ra_flags bits */
 #define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
+#define READAHEAD_THRASHED	0x10000000
 
 /*
  * Don't do ra_flags++ directly to avoid possible overflow:


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 08/15] readahead: record readahead patterns
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
	Chris Mason, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 6162 bytes --]

Record the readahead pattern in ra_flags. This info can be examined by
users via the readahead tracing/stats interfaces.

Currently 7 patterns are defined:

      	pattern			readahead for
-----------------------------------------------------------
	RA_PATTERN_INITIAL	start-of-file/oversize read
	RA_PATTERN_SUBSEQUENT	trivial     sequential read
	RA_PATTERN_CONTEXT	interleaved sequential read
	RA_PATTERN_THRASH	thrashed    sequential read
	RA_PATTERN_MMAP_AROUND	mmap fault
	RA_PATTERN_FADVISE	posix_fadvise()
	RA_PATTERN_RANDOM	random read

CC: Ingo Molnar <mingo@elte.hu> 
CC: Jens Axboe <jens.axboe@oracle.com> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   32 ++++++++++++++++++++++++++++++++
 include/linux/mm.h |    4 +++-
 mm/filemap.c       |    9 +++++++--
 mm/readahead.c     |   17 +++++++++++++----
 4 files changed, 55 insertions(+), 7 deletions(-)

--- linux.orig/include/linux/fs.h	2010-02-24 10:44:44.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:45.000000000 +0800
@@ -894,8 +894,40 @@ struct file_ra_state {
 };
 
 /* ra_flags bits */
+#define READAHEAD_PATTERN_SHIFT	20
+#define READAHEAD_PATTERN	0x00f00000
 #define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
 #define READAHEAD_THRASHED	0x10000000
+#define	READAHEAD_MMAP		0x20000000
+
+/*
+ * Which policy makes decision to do the current read-ahead IO?
+ */
+enum readahead_pattern {
+	RA_PATTERN_INITIAL,
+	RA_PATTERN_SUBSEQUENT,
+	RA_PATTERN_CONTEXT,
+	RA_PATTERN_THRASH,
+	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_FADVISE,
+	RA_PATTERN_RANDOM,
+	RA_PATTERN_ALL,		/* for summary stats */
+	RA_PATTERN_MAX
+};
+
+static inline int ra_pattern(int ra_flags)
+{
+	int pattern = (ra_flags & READAHEAD_PATTERN)
+			       >> READAHEAD_PATTERN_SHIFT;
+
+	return min(pattern, RA_PATTERN_ALL);
+}
+
+static inline void ra_set_pattern(struct file_ra_state *ra, int pattern)
+{
+	ra->ra_flags = (ra->ra_flags & ~READAHEAD_PATTERN) |
+			    (pattern << READAHEAD_PATTERN_SHIFT);
+}
 
 /*
  * Don't do ra_flags++ directly to avoid possible overflow:
--- linux.orig/mm/readahead.c	2010-02-24 10:44:44.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:45.000000000 +0800
@@ -339,7 +339,10 @@ unsigned long max_sane_readahead(unsigne
  * Submit IO for the read-ahead request in file_ra_state.
  */
 unsigned long ra_submit(struct file_ra_state *ra,
-		       struct address_space *mapping, struct file *filp)
+			struct address_space *mapping,
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size)
 {
 	int actual;
 
@@ -473,6 +476,7 @@ ondemand_readahead(struct address_space 
 	 * start of file
 	 */
 	if (!offset) {
+		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		ra->start = offset;
 		ra->size = get_init_ra_size(req_size, max);
 		ra->async_size = ra->size > req_size ?
@@ -493,6 +497,7 @@ ondemand_readahead(struct address_space 
 	 */
 	if ((offset == (ra->start + ra->size - ra->async_size) ||
 	     offset == (ra->start + ra->size))) {
+		ra_set_pattern(ra, RA_PATTERN_SUBSEQUENT);
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max);
 		ra->async_size = ra->size;
@@ -503,6 +508,7 @@ ondemand_readahead(struct address_space 
 	 * oversize read, no need to query page cache
 	 */
 	if (req_size > max && !hit_readahead_marker) {
+		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		ra->start = offset;
 		ra->size = max;
 		ra->async_size = max;
@@ -548,8 +554,10 @@ context_readahead:
 	 */
 	if (!size && !hit_readahead_marker) {
 		if (!ra_thrashed(ra, offset)) {
+			ra_set_pattern(ra, RA_PATTERN_RANDOM);
 			ra->size = min(req_size, max);
 		} else {
+			ra_set_pattern(ra, RA_PATTERN_THRASH);
 			retain_inactive_pages(mapping, offset, min(2 * max,
 						ra->start + ra->size - offset));
 			ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
@@ -566,12 +574,13 @@ context_readahead:
 	if (size >= offset)
 		size *= 2;
 	/*
-	 * pages to readahead are already cached
+	 * Pages to readahead are already cached?
 	 */
 	if (size <= start - offset)
 		return 0;
-
 	size -= start - offset;
+
+	ra_set_pattern(ra, RA_PATTERN_CONTEXT);
 	ra->start = start;
 	ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
 	ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
@@ -587,7 +596,7 @@ readit:
 		ra->size += ra->async_size;
 	}
 
-	return ra_submit(ra, mapping, filp);
+	return ra_submit(ra, mapping, filp, offset, req_size);
 }
 
 /**
--- linux.orig/include/linux/mm.h	2010-02-24 10:44:41.000000000 +0800
+++ linux/include/linux/mm.h	2010-02-24 10:44:45.000000000 +0800
@@ -1208,7 +1208,9 @@ void page_cache_async_readahead(struct a
 unsigned long max_sane_readahead(unsigned long nr);
 unsigned long ra_submit(struct file_ra_state *ra,
 			struct address_space *mapping,
-			struct file *filp);
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux.orig/mm/filemap.c	2010-02-24 10:44:43.000000000 +0800
+++ linux/mm/filemap.c	2010-02-24 10:44:45.000000000 +0800
@@ -1413,6 +1413,7 @@ static void do_sync_mmap_readahead(struc
 
 	if (VM_SequentialReadHint(vma) ||
 			offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
+		ra->ra_flags |= READAHEAD_MMAP;
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
 		return;
@@ -1431,10 +1432,12 @@ static void do_sync_mmap_readahead(struc
 	 */
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	if (ra_pages) {
+		ra->ra_flags |= READAHEAD_MMAP;
+		ra_set_pattern(ra, RA_PATTERN_MMAP_AROUND);
 		ra->start = max_t(long, 0, offset - ra_pages/2);
 		ra->size = ra_pages;
 		ra->async_size = 0;
-		ra_submit(ra, mapping, file);
+		ra_submit(ra, mapping, file, offset, 1);
 	}
 }
 
@@ -1454,9 +1457,11 @@ static void do_async_mmap_readahead(stru
 	if (VM_RandomReadHint(vma))
 		return;
 	ra_mmap_miss_dec(ra);
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		ra->ra_flags |= READAHEAD_MMAP;
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
 }
 
 /**



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 08/15] readahead: record readahead patterns
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
	Chris Mason, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 6387 bytes --]

Record the readahead pattern in ra_flags. This info can be examined by
users via the readahead tracing/stats interfaces.

Currently 7 patterns are defined:

      	pattern			readahead for
-----------------------------------------------------------
	RA_PATTERN_INITIAL	start-of-file/oversize read
	RA_PATTERN_SUBSEQUENT	trivial     sequential read
	RA_PATTERN_CONTEXT	interleaved sequential read
	RA_PATTERN_THRASH	thrashed    sequential read
	RA_PATTERN_MMAP_AROUND	mmap fault
	RA_PATTERN_FADVISE	posix_fadvise()
	RA_PATTERN_RANDOM	random read

CC: Ingo Molnar <mingo@elte.hu> 
CC: Jens Axboe <jens.axboe@oracle.com> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   32 ++++++++++++++++++++++++++++++++
 include/linux/mm.h |    4 +++-
 mm/filemap.c       |    9 +++++++--
 mm/readahead.c     |   17 +++++++++++++----
 4 files changed, 55 insertions(+), 7 deletions(-)

--- linux.orig/include/linux/fs.h	2010-02-24 10:44:44.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:45.000000000 +0800
@@ -894,8 +894,40 @@ struct file_ra_state {
 };
 
 /* ra_flags bits */
+#define READAHEAD_PATTERN_SHIFT	20
+#define READAHEAD_PATTERN	0x00f00000
 #define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
 #define READAHEAD_THRASHED	0x10000000
+#define	READAHEAD_MMAP		0x20000000
+
+/*
+ * Which policy makes decision to do the current read-ahead IO?
+ */
+enum readahead_pattern {
+	RA_PATTERN_INITIAL,
+	RA_PATTERN_SUBSEQUENT,
+	RA_PATTERN_CONTEXT,
+	RA_PATTERN_THRASH,
+	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_FADVISE,
+	RA_PATTERN_RANDOM,
+	RA_PATTERN_ALL,		/* for summary stats */
+	RA_PATTERN_MAX
+};
+
+static inline int ra_pattern(int ra_flags)
+{
+	int pattern = (ra_flags & READAHEAD_PATTERN)
+			       >> READAHEAD_PATTERN_SHIFT;
+
+	return min(pattern, RA_PATTERN_ALL);
+}
+
+static inline void ra_set_pattern(struct file_ra_state *ra, int pattern)
+{
+	ra->ra_flags = (ra->ra_flags & ~READAHEAD_PATTERN) |
+			    (pattern << READAHEAD_PATTERN_SHIFT);
+}
 
 /*
  * Don't do ra_flags++ directly to avoid possible overflow:
--- linux.orig/mm/readahead.c	2010-02-24 10:44:44.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:45.000000000 +0800
@@ -339,7 +339,10 @@ unsigned long max_sane_readahead(unsigne
  * Submit IO for the read-ahead request in file_ra_state.
  */
 unsigned long ra_submit(struct file_ra_state *ra,
-		       struct address_space *mapping, struct file *filp)
+			struct address_space *mapping,
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size)
 {
 	int actual;
 
@@ -473,6 +476,7 @@ ondemand_readahead(struct address_space 
 	 * start of file
 	 */
 	if (!offset) {
+		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		ra->start = offset;
 		ra->size = get_init_ra_size(req_size, max);
 		ra->async_size = ra->size > req_size ?
@@ -493,6 +497,7 @@ ondemand_readahead(struct address_space 
 	 */
 	if ((offset == (ra->start + ra->size - ra->async_size) ||
 	     offset == (ra->start + ra->size))) {
+		ra_set_pattern(ra, RA_PATTERN_SUBSEQUENT);
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max);
 		ra->async_size = ra->size;
@@ -503,6 +508,7 @@ ondemand_readahead(struct address_space 
 	 * oversize read, no need to query page cache
 	 */
 	if (req_size > max && !hit_readahead_marker) {
+		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		ra->start = offset;
 		ra->size = max;
 		ra->async_size = max;
@@ -548,8 +554,10 @@ context_readahead:
 	 */
 	if (!size && !hit_readahead_marker) {
 		if (!ra_thrashed(ra, offset)) {
+			ra_set_pattern(ra, RA_PATTERN_RANDOM);
 			ra->size = min(req_size, max);
 		} else {
+			ra_set_pattern(ra, RA_PATTERN_THRASH);
 			retain_inactive_pages(mapping, offset, min(2 * max,
 						ra->start + ra->size - offset));
 			ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
@@ -566,12 +574,13 @@ context_readahead:
 	if (size >= offset)
 		size *= 2;
 	/*
-	 * pages to readahead are already cached
+	 * Pages to readahead are already cached?
 	 */
 	if (size <= start - offset)
 		return 0;
-
 	size -= start - offset;
+
+	ra_set_pattern(ra, RA_PATTERN_CONTEXT);
 	ra->start = start;
 	ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
 	ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
@@ -587,7 +596,7 @@ readit:
 		ra->size += ra->async_size;
 	}
 
-	return ra_submit(ra, mapping, filp);
+	return ra_submit(ra, mapping, filp, offset, req_size);
 }
 
 /**
--- linux.orig/include/linux/mm.h	2010-02-24 10:44:41.000000000 +0800
+++ linux/include/linux/mm.h	2010-02-24 10:44:45.000000000 +0800
@@ -1208,7 +1208,9 @@ void page_cache_async_readahead(struct a
 unsigned long max_sane_readahead(unsigned long nr);
 unsigned long ra_submit(struct file_ra_state *ra,
 			struct address_space *mapping,
-			struct file *filp);
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux.orig/mm/filemap.c	2010-02-24 10:44:43.000000000 +0800
+++ linux/mm/filemap.c	2010-02-24 10:44:45.000000000 +0800
@@ -1413,6 +1413,7 @@ static void do_sync_mmap_readahead(struc
 
 	if (VM_SequentialReadHint(vma) ||
 			offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
+		ra->ra_flags |= READAHEAD_MMAP;
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
 		return;
@@ -1431,10 +1432,12 @@ static void do_sync_mmap_readahead(struc
 	 */
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	if (ra_pages) {
+		ra->ra_flags |= READAHEAD_MMAP;
+		ra_set_pattern(ra, RA_PATTERN_MMAP_AROUND);
 		ra->start = max_t(long, 0, offset - ra_pages/2);
 		ra->size = ra_pages;
 		ra->async_size = 0;
-		ra_submit(ra, mapping, file);
+		ra_submit(ra, mapping, file, offset, 1);
 	}
 }
 
@@ -1454,9 +1457,11 @@ static void do_async_mmap_readahead(stru
 	if (VM_RandomReadHint(vma))
 		return;
 	ra_mmap_miss_dec(ra);
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		ra->ra_flags |= READAHEAD_MMAP;
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
 }
 
 /**


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 08/15] readahead: record readahead patterns
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
	Chris Mason, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 6387 bytes --]

Record the readahead pattern in ra_flags. This info can be examined by
users via the readahead tracing/stats interfaces.

Currently 7 patterns are defined:

      	pattern			readahead for
-----------------------------------------------------------
	RA_PATTERN_INITIAL	start-of-file/oversize read
	RA_PATTERN_SUBSEQUENT	trivial     sequential read
	RA_PATTERN_CONTEXT	interleaved sequential read
	RA_PATTERN_THRASH	thrashed    sequential read
	RA_PATTERN_MMAP_AROUND	mmap fault
	RA_PATTERN_FADVISE	posix_fadvise()
	RA_PATTERN_RANDOM	random read

CC: Ingo Molnar <mingo@elte.hu> 
CC: Jens Axboe <jens.axboe@oracle.com> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   32 ++++++++++++++++++++++++++++++++
 include/linux/mm.h |    4 +++-
 mm/filemap.c       |    9 +++++++--
 mm/readahead.c     |   17 +++++++++++++----
 4 files changed, 55 insertions(+), 7 deletions(-)

--- linux.orig/include/linux/fs.h	2010-02-24 10:44:44.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:45.000000000 +0800
@@ -894,8 +894,40 @@ struct file_ra_state {
 };
 
 /* ra_flags bits */
+#define READAHEAD_PATTERN_SHIFT	20
+#define READAHEAD_PATTERN	0x00f00000
 #define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
 #define READAHEAD_THRASHED	0x10000000
+#define	READAHEAD_MMAP		0x20000000
+
+/*
+ * Which policy makes decision to do the current read-ahead IO?
+ */
+enum readahead_pattern {
+	RA_PATTERN_INITIAL,
+	RA_PATTERN_SUBSEQUENT,
+	RA_PATTERN_CONTEXT,
+	RA_PATTERN_THRASH,
+	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_FADVISE,
+	RA_PATTERN_RANDOM,
+	RA_PATTERN_ALL,		/* for summary stats */
+	RA_PATTERN_MAX
+};
+
+static inline int ra_pattern(int ra_flags)
+{
+	int pattern = (ra_flags & READAHEAD_PATTERN)
+			       >> READAHEAD_PATTERN_SHIFT;
+
+	return min(pattern, RA_PATTERN_ALL);
+}
+
+static inline void ra_set_pattern(struct file_ra_state *ra, int pattern)
+{
+	ra->ra_flags = (ra->ra_flags & ~READAHEAD_PATTERN) |
+			    (pattern << READAHEAD_PATTERN_SHIFT);
+}
 
 /*
  * Don't do ra_flags++ directly to avoid possible overflow:
--- linux.orig/mm/readahead.c	2010-02-24 10:44:44.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:45.000000000 +0800
@@ -339,7 +339,10 @@ unsigned long max_sane_readahead(unsigne
  * Submit IO for the read-ahead request in file_ra_state.
  */
 unsigned long ra_submit(struct file_ra_state *ra,
-		       struct address_space *mapping, struct file *filp)
+			struct address_space *mapping,
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size)
 {
 	int actual;
 
@@ -473,6 +476,7 @@ ondemand_readahead(struct address_space 
 	 * start of file
 	 */
 	if (!offset) {
+		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		ra->start = offset;
 		ra->size = get_init_ra_size(req_size, max);
 		ra->async_size = ra->size > req_size ?
@@ -493,6 +497,7 @@ ondemand_readahead(struct address_space 
 	 */
 	if ((offset == (ra->start + ra->size - ra->async_size) ||
 	     offset == (ra->start + ra->size))) {
+		ra_set_pattern(ra, RA_PATTERN_SUBSEQUENT);
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max);
 		ra->async_size = ra->size;
@@ -503,6 +508,7 @@ ondemand_readahead(struct address_space 
 	 * oversize read, no need to query page cache
 	 */
 	if (req_size > max && !hit_readahead_marker) {
+		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		ra->start = offset;
 		ra->size = max;
 		ra->async_size = max;
@@ -548,8 +554,10 @@ context_readahead:
 	 */
 	if (!size && !hit_readahead_marker) {
 		if (!ra_thrashed(ra, offset)) {
+			ra_set_pattern(ra, RA_PATTERN_RANDOM);
 			ra->size = min(req_size, max);
 		} else {
+			ra_set_pattern(ra, RA_PATTERN_THRASH);
 			retain_inactive_pages(mapping, offset, min(2 * max,
 						ra->start + ra->size - offset));
 			ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
@@ -566,12 +574,13 @@ context_readahead:
 	if (size >= offset)
 		size *= 2;
 	/*
-	 * pages to readahead are already cached
+	 * Pages to readahead are already cached?
 	 */
 	if (size <= start - offset)
 		return 0;
-
 	size -= start - offset;
+
+	ra_set_pattern(ra, RA_PATTERN_CONTEXT);
 	ra->start = start;
 	ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
 	ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
@@ -587,7 +596,7 @@ readit:
 		ra->size += ra->async_size;
 	}
 
-	return ra_submit(ra, mapping, filp);
+	return ra_submit(ra, mapping, filp, offset, req_size);
 }
 
 /**
--- linux.orig/include/linux/mm.h	2010-02-24 10:44:41.000000000 +0800
+++ linux/include/linux/mm.h	2010-02-24 10:44:45.000000000 +0800
@@ -1208,7 +1208,9 @@ void page_cache_async_readahead(struct a
 unsigned long max_sane_readahead(unsigned long nr);
 unsigned long ra_submit(struct file_ra_state *ra,
 			struct address_space *mapping,
-			struct file *filp);
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux.orig/mm/filemap.c	2010-02-24 10:44:43.000000000 +0800
+++ linux/mm/filemap.c	2010-02-24 10:44:45.000000000 +0800
@@ -1413,6 +1413,7 @@ static void do_sync_mmap_readahead(struc
 
 	if (VM_SequentialReadHint(vma) ||
 			offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
+		ra->ra_flags |= READAHEAD_MMAP;
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
 		return;
@@ -1431,10 +1432,12 @@ static void do_sync_mmap_readahead(struc
 	 */
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	if (ra_pages) {
+		ra->ra_flags |= READAHEAD_MMAP;
+		ra_set_pattern(ra, RA_PATTERN_MMAP_AROUND);
 		ra->start = max_t(long, 0, offset - ra_pages/2);
 		ra->size = ra_pages;
 		ra->async_size = 0;
-		ra_submit(ra, mapping, file);
+		ra_submit(ra, mapping, file, offset, 1);
 	}
 }
 
@@ -1454,9 +1457,11 @@ static void do_async_mmap_readahead(stru
 	if (VM_RandomReadHint(vma))
 		return;
 	ra_mmap_miss_dec(ra);
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		ra->ra_flags |= READAHEAD_MMAP;
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
 }
 
 /**


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 09/15] readahead: add tracing event
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	Wu Fengguang, Chris Mason, Clemens Ladisch, Olivier Galibert,
	Vivek Goyal, Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 4221 bytes --]

Example output:

# echo 1 > /debug/tracing/events/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace  # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0

CC: Ingo Molnar <mingo@elte.hu> 
CC: Jens Axboe <jens.axboe@oracle.com> 
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/readahead.h |   78 +++++++++++++++++++++++++++++
 mm/readahead.c                   |   11 ++++
 2 files changed, 89 insertions(+)

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/include/trace/events/readahead.h	2010-02-24 10:44:46.000000000 +0800
@@ -0,0 +1,78 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM readahead
+
+#if !defined(_TRACE_READAHEAD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_READAHEAD_H
+
+#include <linux/tracepoint.h>
+
+#define show_pattern_name(val)						   \
+	__print_symbolic(val,						   \
+			{ RA_PATTERN_INITIAL,		"initial"	}, \
+			{ RA_PATTERN_SUBSEQUENT,	"subsequent"	}, \
+			{ RA_PATTERN_CONTEXT,		"context"	}, \
+			{ RA_PATTERN_THRASH,		"thrash"	}, \
+			{ RA_PATTERN_MMAP_AROUND,	"around"	}, \
+			{ RA_PATTERN_FADVISE,		"fadvise"	}, \
+			{ RA_PATTERN_RANDOM,		"random"	}, \
+			{ RA_PATTERN_ALL,		"all"		})
+
+/*
+ * Tracepoint for guest mode entry.
+ */
+TRACE_EVENT(readahead,
+	TP_PROTO(struct address_space *mapping,
+		 pgoff_t offset,
+		 unsigned long req_size,
+		 unsigned int ra_flags,
+		 pgoff_t start,
+		 unsigned int size,
+		 unsigned int async_size,
+		 unsigned int actual),
+
+	TP_ARGS(mapping, offset, req_size,
+		ra_flags, start, size, async_size, actual),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	pgoff_t,	offset		)
+		__field(	unsigned long,	req_size	)
+		__field(	unsigned int,	pattern		)
+		__field(	pgoff_t,	start		)
+		__field(	unsigned int,	size		)
+		__field(	unsigned int,	async_size	)
+		__field(	unsigned int,	actual		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= mapping->host->i_sb->s_dev;
+		__entry->ino		= mapping->host->i_ino;
+		__entry->pattern	= ra_pattern(ra_flags);
+		__entry->offset		= offset;
+		__entry->req_size	= req_size;
+		__entry->start		= start;
+		__entry->size		= size;
+		__entry->async_size	= async_size;
+		__entry->actual		= actual;
+	),
+
+	TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+		  "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+			show_pattern_name(__entry->pattern),
+			MAJOR(__entry->dev),
+			MINOR(__entry->dev),
+			__entry->ino,
+			__entry->offset,
+			__entry->req_size,
+			__entry->start,
+			__entry->size,
+			__entry->async_size,
+			__entry->start > __entry->offset,
+			__entry->actual)
+);
+
+#endif /* _TRACE_READAHEAD_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux.orig/mm/readahead.c	2010-02-24 10:44:45.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:46.000000000 +0800
@@ -19,6 +19,9 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/readahead.h>
+
 #define MIN_READAHEAD_PAGES DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
 
 static int __initdata user_defined_readahead_size;
@@ -322,6 +325,11 @@ int force_page_cache_readahead(struct ad
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
 	}
+
+	trace_readahead(mapping, offset, nr_to_read,
+			RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
+			offset, nr_to_read, 0, ret);
+
 	return ret;
 }
 
@@ -349,6 +357,9 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	trace_readahead(mapping, offset, req_size, ra->ra_flags,
+			ra->start, ra->size, ra->async_size, actual);
+
 	return actual;
 }
 



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 09/15] readahead: add tracing event
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	Wu Fengguang, Chris Mason, Clemens Ladisch, Olivier Galibert,
	Vivek Goyal, Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 4446 bytes --]

Example output:

# echo 1 > /debug/tracing/events/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace  # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0

CC: Ingo Molnar <mingo@elte.hu> 
CC: Jens Axboe <jens.axboe@oracle.com> 
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/readahead.h |   78 +++++++++++++++++++++++++++++
 mm/readahead.c                   |   11 ++++
 2 files changed, 89 insertions(+)

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/include/trace/events/readahead.h	2010-02-24 10:44:46.000000000 +0800
@@ -0,0 +1,78 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM readahead
+
+#if !defined(_TRACE_READAHEAD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_READAHEAD_H
+
+#include <linux/tracepoint.h>
+
+#define show_pattern_name(val)						   \
+	__print_symbolic(val,						   \
+			{ RA_PATTERN_INITIAL,		"initial"	}, \
+			{ RA_PATTERN_SUBSEQUENT,	"subsequent"	}, \
+			{ RA_PATTERN_CONTEXT,		"context"	}, \
+			{ RA_PATTERN_THRASH,		"thrash"	}, \
+			{ RA_PATTERN_MMAP_AROUND,	"around"	}, \
+			{ RA_PATTERN_FADVISE,		"fadvise"	}, \
+			{ RA_PATTERN_RANDOM,		"random"	}, \
+			{ RA_PATTERN_ALL,		"all"		})
+
+/*
+ * Tracepoint for guest mode entry.
+ */
+TRACE_EVENT(readahead,
+	TP_PROTO(struct address_space *mapping,
+		 pgoff_t offset,
+		 unsigned long req_size,
+		 unsigned int ra_flags,
+		 pgoff_t start,
+		 unsigned int size,
+		 unsigned int async_size,
+		 unsigned int actual),
+
+	TP_ARGS(mapping, offset, req_size,
+		ra_flags, start, size, async_size, actual),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	pgoff_t,	offset		)
+		__field(	unsigned long,	req_size	)
+		__field(	unsigned int,	pattern		)
+		__field(	pgoff_t,	start		)
+		__field(	unsigned int,	size		)
+		__field(	unsigned int,	async_size	)
+		__field(	unsigned int,	actual		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= mapping->host->i_sb->s_dev;
+		__entry->ino		= mapping->host->i_ino;
+		__entry->pattern	= ra_pattern(ra_flags);
+		__entry->offset		= offset;
+		__entry->req_size	= req_size;
+		__entry->start		= start;
+		__entry->size		= size;
+		__entry->async_size	= async_size;
+		__entry->actual		= actual;
+	),
+
+	TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+		  "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+			show_pattern_name(__entry->pattern),
+			MAJOR(__entry->dev),
+			MINOR(__entry->dev),
+			__entry->ino,
+			__entry->offset,
+			__entry->req_size,
+			__entry->start,
+			__entry->size,
+			__entry->async_size,
+			__entry->start > __entry->offset,
+			__entry->actual)
+);
+
+#endif /* _TRACE_READAHEAD_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux.orig/mm/readahead.c	2010-02-24 10:44:45.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:46.000000000 +0800
@@ -19,6 +19,9 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/readahead.h>
+
 #define MIN_READAHEAD_PAGES DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
 
 static int __initdata user_defined_readahead_size;
@@ -322,6 +325,11 @@ int force_page_cache_readahead(struct ad
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
 	}
+
+	trace_readahead(mapping, offset, nr_to_read,
+			RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
+			offset, nr_to_read, 0, ret);
+
 	return ret;
 }
 
@@ -349,6 +357,9 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	trace_readahead(mapping, offset, req_size, ra->ra_flags,
+			ra->start, ra->size, ra->async_size, actual);
+
 	return actual;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 09/15] readahead: add tracing event
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ingo Molnar, Steven Rostedt, Peter Zijlstra,
	Wu Fengguang, Chris Mason, Clemens Ladisch, Olivier Galibert,
	Vivek Goyal, Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 4446 bytes --]

Example output:

# echo 1 > /debug/tracing/events/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace  # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0

CC: Ingo Molnar <mingo@elte.hu> 
CC: Jens Axboe <jens.axboe@oracle.com> 
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/readahead.h |   78 +++++++++++++++++++++++++++++
 mm/readahead.c                   |   11 ++++
 2 files changed, 89 insertions(+)

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/include/trace/events/readahead.h	2010-02-24 10:44:46.000000000 +0800
@@ -0,0 +1,78 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM readahead
+
+#if !defined(_TRACE_READAHEAD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_READAHEAD_H
+
+#include <linux/tracepoint.h>
+
+#define show_pattern_name(val)						   \
+	__print_symbolic(val,						   \
+			{ RA_PATTERN_INITIAL,		"initial"	}, \
+			{ RA_PATTERN_SUBSEQUENT,	"subsequent"	}, \
+			{ RA_PATTERN_CONTEXT,		"context"	}, \
+			{ RA_PATTERN_THRASH,		"thrash"	}, \
+			{ RA_PATTERN_MMAP_AROUND,	"around"	}, \
+			{ RA_PATTERN_FADVISE,		"fadvise"	}, \
+			{ RA_PATTERN_RANDOM,		"random"	}, \
+			{ RA_PATTERN_ALL,		"all"		})
+
+/*
+ * Tracepoint for guest mode entry.
+ */
+TRACE_EVENT(readahead,
+	TP_PROTO(struct address_space *mapping,
+		 pgoff_t offset,
+		 unsigned long req_size,
+		 unsigned int ra_flags,
+		 pgoff_t start,
+		 unsigned int size,
+		 unsigned int async_size,
+		 unsigned int actual),
+
+	TP_ARGS(mapping, offset, req_size,
+		ra_flags, start, size, async_size, actual),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	pgoff_t,	offset		)
+		__field(	unsigned long,	req_size	)
+		__field(	unsigned int,	pattern		)
+		__field(	pgoff_t,	start		)
+		__field(	unsigned int,	size		)
+		__field(	unsigned int,	async_size	)
+		__field(	unsigned int,	actual		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= mapping->host->i_sb->s_dev;
+		__entry->ino		= mapping->host->i_ino;
+		__entry->pattern	= ra_pattern(ra_flags);
+		__entry->offset		= offset;
+		__entry->req_size	= req_size;
+		__entry->start		= start;
+		__entry->size		= size;
+		__entry->async_size	= async_size;
+		__entry->actual		= actual;
+	),
+
+	TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+		  "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+			show_pattern_name(__entry->pattern),
+			MAJOR(__entry->dev),
+			MINOR(__entry->dev),
+			__entry->ino,
+			__entry->offset,
+			__entry->req_size,
+			__entry->start,
+			__entry->size,
+			__entry->async_size,
+			__entry->start > __entry->offset,
+			__entry->actual)
+);
+
+#endif /* _TRACE_READAHEAD_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux.orig/mm/readahead.c	2010-02-24 10:44:45.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:46.000000000 +0800
@@ -19,6 +19,9 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/readahead.h>
+
 #define MIN_READAHEAD_PAGES DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
 
 static int __initdata user_defined_readahead_size;
@@ -322,6 +325,11 @@ int force_page_cache_readahead(struct ad
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
 	}
+
+	trace_readahead(mapping, offset, nr_to_read,
+			RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
+			offset, nr_to_read, 0, ret);
+
 	return ret;
 }
 
@@ -349,6 +357,9 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	trace_readahead(mapping, offset, req_size, ra->ra_flags,
+			ra->start, ra->size, ra->async_size, actual);
+
 	return actual;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 10/15] readahead: add /debug/readahead/stats
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
	Chris Mason, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 8329 bytes --]

Collect readahead stats when CONFIG_READAHEAD_STATS=y.

This is enabled by default because the added overheads are trivial:
two readahead_stats() calls per readahead.

Example output:
(taken from a fresh booted NFS-ROOT box with rsize=16k)

$ cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io       size async_size    io_size
initial           524        216         26        498        498         18          7          4          4
subsequent        181         80          1        130         13         60         25         25         24
context            94         28          3         85         64          8          7          2          5
thrash              0          0          0          0          0          0          0          0          0
around            162        121         33        162        162        162         60          0         21
fadvise             0          0          0          0          0          0          0          0          0
random            137          0          0        137        137          0          1          0          1
all              1098        445         63       1012        874          0         17          6          9

The two most important columns are
- io		number of readahead IO
- io_size	average readahead IO size

CC: Ingo Molnar <mingo@elte.hu> 
CC: Jens Axboe <jens.axboe@oracle.com> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/Kconfig     |   13 +++
 mm/readahead.c |  187 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 198 insertions(+), 2 deletions(-)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:46.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:47.000000000 +0800
@@ -89,6 +89,189 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,	/* readahead request */
+	RA_ACCOUNT_EOF,		/* readahead request contains/beyond EOF page */
+	RA_ACCOUNT_CHIT,	/* readahead request covers some cached pages */
+	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
+	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
+	RA_ACCOUNT_MMAP,	/* readahead IO by mmap accesses */
+	/* number of readahead pages */
+	RA_ACCOUNT_SIZE,	/* readahead size */
+	RA_ACCOUNT_ASIZE,	/* readahead async size */
+	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static void readahead_stats(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    unsigned int ra_flags,
+			    pgoff_t start,
+			    unsigned int size,
+			    unsigned int async_size,
+			    int actual)
+{
+	unsigned int pattern = ra_pattern(ra_flags);
+
+	ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+	ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
+	ra_stats[pattern][RA_ACCOUNT_ASIZE] += async_size;
+	ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+	if (actual < size) {
+		if (start + size >
+		    (i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT)
+			ra_stats[pattern][RA_ACCOUNT_EOF]++;
+		else
+			ra_stats[pattern][RA_ACCOUNT_CHIT]++;
+	}
+
+	if (!actual)
+		return;
+
+	ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+	if (start <= offset && start + size > offset)
+		ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+	if (ra_flags & READAHEAD_MMAP)
+		ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	static const char * const ra_pattern_names[] = {
+		[RA_PATTERN_INITIAL]		= "initial",
+		[RA_PATTERN_SUBSEQUENT]		= "subsequent",
+		[RA_PATTERN_CONTEXT]		= "context",
+		[RA_PATTERN_THRASH]		= "thrash",
+		[RA_PATTERN_MMAP_AROUND]	= "around",
+		[RA_PATTERN_FADVISE]		= "fadvise",
+		[RA_PATTERN_RANDOM]		= "random",
+		[RA_PATTERN_ALL]		= "all",
+	};
+	unsigned long count, iocount;
+	unsigned long i;
+
+	seq_printf(s, "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+			"pattern",
+			"readahead", "eof_hit", "cache_hit",
+			"io", "sync_io", "mmap_io",
+			"size", "async_size", "io_size");
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		count = ra_stats[i][RA_ACCOUNT_COUNT];
+		iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+		/*
+		 * avoid division-by-zero
+		 */
+		if (count == 0)
+			count = 1;
+		if (iocount == 0)
+			iocount = 1;
+
+		seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu %10lu "
+			   "%10lu %10lu %10lu\n",
+				ra_pattern_names[i],
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_CHIT],
+				ra_stats[i][RA_ACCOUNT_IOCOUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_MMAP],
+				ra_stats[i][RA_ACCOUNT_SIZE]   / count,
+				ra_stats[i][RA_ACCOUNT_ASIZE]  / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	memset(ra_stats, 0, sizeof(ra_stats));
+	return size;
+}
+
+static struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static struct dentry *ra_debug_root;
+
+static int debugfs_create_readahead(void)
+{
+	struct dentry *debugfs_stats;
+
+	ra_debug_root = debugfs_create_dir("readahead", NULL);
+	if (!ra_debug_root)
+		goto out;
+
+	debugfs_stats = debugfs_create_file("stats", 0644, ra_debug_root,
+					    NULL, &readahead_stats_fops);
+	if (!debugfs_stats)
+		goto out;
+
+	return 0;
+out:
+	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+	return -ENOMEM;
+}
+
+static int __init readahead_init(void)
+{
+	debugfs_create_readahead();
+	return 0;
+}
+
+static void __exit readahead_exit(void)
+{
+	debugfs_remove_recursive(ra_debug_root);
+}
+
+module_init(readahead_init);
+module_exit(readahead_exit);
+#endif
+
+static void readahead_event(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    unsigned int ra_flags,
+			    pgoff_t start,
+			    unsigned int size,
+			    unsigned int async_size,
+			    unsigned int actual)
+{
+#ifdef CONFIG_READAHEAD_STATS
+	readahead_stats(mapping, offset, req_size, ra_flags,
+			start, size, async_size, actual);
+	readahead_stats(mapping, offset, req_size,
+			RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
+			start, size, async_size, actual);
+#endif
+	trace_readahead(mapping, offset, req_size, ra_flags,
+			start, size, async_size, actual);
+}
+
 /*
  * see if a page needs releasing upon read_cache_pages() failure
  * - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -326,7 +509,7 @@ int force_page_cache_readahead(struct ad
 		nr_to_read -= this_chunk;
 	}
 
-	trace_readahead(mapping, offset, nr_to_read,
+	readahead_event(mapping, offset, nr_to_read,
 			RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
 			offset, nr_to_read, 0, ret);
 
@@ -357,7 +540,7 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
-	trace_readahead(mapping, offset, req_size, ra->ra_flags,
+	readahead_event(mapping, offset, req_size, ra->ra_flags,
 			ra->start, ra->size, ra->async_size, actual);
 
 	return actual;
--- linux.orig/mm/Kconfig	2010-02-24 10:44:23.000000000 +0800
+++ linux/mm/Kconfig	2010-02-24 10:44:47.000000000 +0800
@@ -283,3 +283,16 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config READAHEAD_STATS
+	bool "Collect page-cache readahead stats"
+	depends on DEBUG_FS
+	default y
+	help
+	  Enable readahead events accounting. Usage:
+
+	  # mount -t debugfs none /debug
+
+	  # echo > /debug/readahead/stats  # reset counters
+	  # do benchmarks
+	  # cat /debug/readahead/stats     # check counters



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 10/15] readahead: add /debug/readahead/stats
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
	Chris Mason, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 8554 bytes --]

Collect readahead stats when CONFIG_READAHEAD_STATS=y.

This is enabled by default because the added overheads are trivial:
two readahead_stats() calls per readahead.

Example output:
(taken from a fresh booted NFS-ROOT box with rsize=16k)

$ cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io       size async_size    io_size
initial           524        216         26        498        498         18          7          4          4
subsequent        181         80          1        130         13         60         25         25         24
context            94         28          3         85         64          8          7          2          5
thrash              0          0          0          0          0          0          0          0          0
around            162        121         33        162        162        162         60          0         21
fadvise             0          0          0          0          0          0          0          0          0
random            137          0          0        137        137          0          1          0          1
all              1098        445         63       1012        874          0         17          6          9

The two most important columns are
- io		number of readahead IO
- io_size	average readahead IO size

CC: Ingo Molnar <mingo@elte.hu> 
CC: Jens Axboe <jens.axboe@oracle.com> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/Kconfig     |   13 +++
 mm/readahead.c |  187 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 198 insertions(+), 2 deletions(-)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:46.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:47.000000000 +0800
@@ -89,6 +89,189 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,	/* readahead request */
+	RA_ACCOUNT_EOF,		/* readahead request contains/beyond EOF page */
+	RA_ACCOUNT_CHIT,	/* readahead request covers some cached pages */
+	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
+	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
+	RA_ACCOUNT_MMAP,	/* readahead IO by mmap accesses */
+	/* number of readahead pages */
+	RA_ACCOUNT_SIZE,	/* readahead size */
+	RA_ACCOUNT_ASIZE,	/* readahead async size */
+	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static void readahead_stats(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    unsigned int ra_flags,
+			    pgoff_t start,
+			    unsigned int size,
+			    unsigned int async_size,
+			    int actual)
+{
+	unsigned int pattern = ra_pattern(ra_flags);
+
+	ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+	ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
+	ra_stats[pattern][RA_ACCOUNT_ASIZE] += async_size;
+	ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+	if (actual < size) {
+		if (start + size >
+		    (i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT)
+			ra_stats[pattern][RA_ACCOUNT_EOF]++;
+		else
+			ra_stats[pattern][RA_ACCOUNT_CHIT]++;
+	}
+
+	if (!actual)
+		return;
+
+	ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+	if (start <= offset && start + size > offset)
+		ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+	if (ra_flags & READAHEAD_MMAP)
+		ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	static const char * const ra_pattern_names[] = {
+		[RA_PATTERN_INITIAL]		= "initial",
+		[RA_PATTERN_SUBSEQUENT]		= "subsequent",
+		[RA_PATTERN_CONTEXT]		= "context",
+		[RA_PATTERN_THRASH]		= "thrash",
+		[RA_PATTERN_MMAP_AROUND]	= "around",
+		[RA_PATTERN_FADVISE]		= "fadvise",
+		[RA_PATTERN_RANDOM]		= "random",
+		[RA_PATTERN_ALL]		= "all",
+	};
+	unsigned long count, iocount;
+	unsigned long i;
+
+	seq_printf(s, "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+			"pattern",
+			"readahead", "eof_hit", "cache_hit",
+			"io", "sync_io", "mmap_io",
+			"size", "async_size", "io_size");
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		count = ra_stats[i][RA_ACCOUNT_COUNT];
+		iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+		/*
+		 * avoid division-by-zero
+		 */
+		if (count == 0)
+			count = 1;
+		if (iocount == 0)
+			iocount = 1;
+
+		seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu %10lu "
+			   "%10lu %10lu %10lu\n",
+				ra_pattern_names[i],
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_CHIT],
+				ra_stats[i][RA_ACCOUNT_IOCOUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_MMAP],
+				ra_stats[i][RA_ACCOUNT_SIZE]   / count,
+				ra_stats[i][RA_ACCOUNT_ASIZE]  / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	memset(ra_stats, 0, sizeof(ra_stats));
+	return size;
+}
+
+static struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static struct dentry *ra_debug_root;
+
+static int debugfs_create_readahead(void)
+{
+	struct dentry *debugfs_stats;
+
+	ra_debug_root = debugfs_create_dir("readahead", NULL);
+	if (!ra_debug_root)
+		goto out;
+
+	debugfs_stats = debugfs_create_file("stats", 0644, ra_debug_root,
+					    NULL, &readahead_stats_fops);
+	if (!debugfs_stats)
+		goto out;
+
+	return 0;
+out:
+	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+	return -ENOMEM;
+}
+
+static int __init readahead_init(void)
+{
+	debugfs_create_readahead();
+	return 0;
+}
+
+static void __exit readahead_exit(void)
+{
+	debugfs_remove_recursive(ra_debug_root);
+}
+
+module_init(readahead_init);
+module_exit(readahead_exit);
+#endif
+
+static void readahead_event(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    unsigned int ra_flags,
+			    pgoff_t start,
+			    unsigned int size,
+			    unsigned int async_size,
+			    unsigned int actual)
+{
+#ifdef CONFIG_READAHEAD_STATS
+	readahead_stats(mapping, offset, req_size, ra_flags,
+			start, size, async_size, actual);
+	readahead_stats(mapping, offset, req_size,
+			RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
+			start, size, async_size, actual);
+#endif
+	trace_readahead(mapping, offset, req_size, ra_flags,
+			start, size, async_size, actual);
+}
+
 /*
  * see if a page needs releasing upon read_cache_pages() failure
  * - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -326,7 +509,7 @@ int force_page_cache_readahead(struct ad
 		nr_to_read -= this_chunk;
 	}
 
-	trace_readahead(mapping, offset, nr_to_read,
+	readahead_event(mapping, offset, nr_to_read,
 			RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
 			offset, nr_to_read, 0, ret);
 
@@ -357,7 +540,7 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
-	trace_readahead(mapping, offset, req_size, ra->ra_flags,
+	readahead_event(mapping, offset, req_size, ra->ra_flags,
 			ra->start, ra->size, ra->async_size, actual);
 
 	return actual;
--- linux.orig/mm/Kconfig	2010-02-24 10:44:23.000000000 +0800
+++ linux/mm/Kconfig	2010-02-24 10:44:47.000000000 +0800
@@ -283,3 +283,16 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config READAHEAD_STATS
+	bool "Collect page-cache readahead stats"
+	depends on DEBUG_FS
+	default y
+	help
+	  Enable readahead events accounting. Usage:
+
+	  # mount -t debugfs none /debug
+
+	  # echo > /debug/readahead/stats  # reset counters
+	  # do benchmarks
+	  # cat /debug/readahead/stats     # check counters


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 10/15] readahead: add /debug/readahead/stats
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
	Chris Mason, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 8554 bytes --]

Collect readahead stats when CONFIG_READAHEAD_STATS=y.

This is enabled by default because the added overheads are trivial:
two readahead_stats() calls per readahead.

Example output:
(taken from a fresh booted NFS-ROOT box with rsize=16k)

$ cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io       size async_size    io_size
initial           524        216         26        498        498         18          7          4          4
subsequent        181         80          1        130         13         60         25         25         24
context            94         28          3         85         64          8          7          2          5
thrash              0          0          0          0          0          0          0          0          0
around            162        121         33        162        162        162         60          0         21
fadvise             0          0          0          0          0          0          0          0          0
random            137          0          0        137        137          0          1          0          1
all              1098        445         63       1012        874          0         17          6          9

The two most important columns are
- io		number of readahead IO
- io_size	average readahead IO size

CC: Ingo Molnar <mingo@elte.hu> 
CC: Jens Axboe <jens.axboe@oracle.com> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/Kconfig     |   13 +++
 mm/readahead.c |  187 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 198 insertions(+), 2 deletions(-)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:46.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:47.000000000 +0800
@@ -89,6 +89,189 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,	/* readahead request */
+	RA_ACCOUNT_EOF,		/* readahead request contains/beyond EOF page */
+	RA_ACCOUNT_CHIT,	/* readahead request covers some cached pages */
+	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
+	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
+	RA_ACCOUNT_MMAP,	/* readahead IO by mmap accesses */
+	/* number of readahead pages */
+	RA_ACCOUNT_SIZE,	/* readahead size */
+	RA_ACCOUNT_ASIZE,	/* readahead async size */
+	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static void readahead_stats(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    unsigned int ra_flags,
+			    pgoff_t start,
+			    unsigned int size,
+			    unsigned int async_size,
+			    int actual)
+{
+	unsigned int pattern = ra_pattern(ra_flags);
+
+	ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+	ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
+	ra_stats[pattern][RA_ACCOUNT_ASIZE] += async_size;
+	ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+	if (actual < size) {
+		if (start + size >
+		    (i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT)
+			ra_stats[pattern][RA_ACCOUNT_EOF]++;
+		else
+			ra_stats[pattern][RA_ACCOUNT_CHIT]++;
+	}
+
+	if (!actual)
+		return;
+
+	ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+	if (start <= offset && start + size > offset)
+		ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+	if (ra_flags & READAHEAD_MMAP)
+		ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	static const char * const ra_pattern_names[] = {
+		[RA_PATTERN_INITIAL]		= "initial",
+		[RA_PATTERN_SUBSEQUENT]		= "subsequent",
+		[RA_PATTERN_CONTEXT]		= "context",
+		[RA_PATTERN_THRASH]		= "thrash",
+		[RA_PATTERN_MMAP_AROUND]	= "around",
+		[RA_PATTERN_FADVISE]		= "fadvise",
+		[RA_PATTERN_RANDOM]		= "random",
+		[RA_PATTERN_ALL]		= "all",
+	};
+	unsigned long count, iocount;
+	unsigned long i;
+
+	seq_printf(s, "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+			"pattern",
+			"readahead", "eof_hit", "cache_hit",
+			"io", "sync_io", "mmap_io",
+			"size", "async_size", "io_size");
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		count = ra_stats[i][RA_ACCOUNT_COUNT];
+		iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+		/*
+		 * avoid division-by-zero
+		 */
+		if (count == 0)
+			count = 1;
+		if (iocount == 0)
+			iocount = 1;
+
+		seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu %10lu "
+			   "%10lu %10lu %10lu\n",
+				ra_pattern_names[i],
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_CHIT],
+				ra_stats[i][RA_ACCOUNT_IOCOUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_MMAP],
+				ra_stats[i][RA_ACCOUNT_SIZE]   / count,
+				ra_stats[i][RA_ACCOUNT_ASIZE]  / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	memset(ra_stats, 0, sizeof(ra_stats));
+	return size;
+}
+
+static struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static struct dentry *ra_debug_root;
+
+static int debugfs_create_readahead(void)
+{
+	struct dentry *debugfs_stats;
+
+	ra_debug_root = debugfs_create_dir("readahead", NULL);
+	if (!ra_debug_root)
+		goto out;
+
+	debugfs_stats = debugfs_create_file("stats", 0644, ra_debug_root,
+					    NULL, &readahead_stats_fops);
+	if (!debugfs_stats)
+		goto out;
+
+	return 0;
+out:
+	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+	return -ENOMEM;
+}
+
+static int __init readahead_init(void)
+{
+	debugfs_create_readahead();
+	return 0;
+}
+
+static void __exit readahead_exit(void)
+{
+	debugfs_remove_recursive(ra_debug_root);
+}
+
+module_init(readahead_init);
+module_exit(readahead_exit);
+#endif
+
+static void readahead_event(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    unsigned int ra_flags,
+			    pgoff_t start,
+			    unsigned int size,
+			    unsigned int async_size,
+			    unsigned int actual)
+{
+#ifdef CONFIG_READAHEAD_STATS
+	readahead_stats(mapping, offset, req_size, ra_flags,
+			start, size, async_size, actual);
+	readahead_stats(mapping, offset, req_size,
+			RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
+			start, size, async_size, actual);
+#endif
+	trace_readahead(mapping, offset, req_size, ra_flags,
+			start, size, async_size, actual);
+}
+
 /*
  * see if a page needs releasing upon read_cache_pages() failure
  * - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -326,7 +509,7 @@ int force_page_cache_readahead(struct ad
 		nr_to_read -= this_chunk;
 	}
 
-	trace_readahead(mapping, offset, nr_to_read,
+	readahead_event(mapping, offset, nr_to_read,
 			RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
 			offset, nr_to_read, 0, ret);
 
@@ -357,7 +540,7 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
-	trace_readahead(mapping, offset, req_size, ra->ra_flags,
+	readahead_event(mapping, offset, req_size, ra->ra_flags,
 			ra->start, ra->size, ra->async_size, actual);
 
 	return actual;
--- linux.orig/mm/Kconfig	2010-02-24 10:44:23.000000000 +0800
+++ linux/mm/Kconfig	2010-02-24 10:44:47.000000000 +0800
@@ -283,3 +283,16 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  of 1 says that all excess pages should be trimmed.
 
 	  See Documentation/nommu-mmap.txt for more information.
+
+config READAHEAD_STATS
+	bool "Collect page-cache readahead stats"
+	depends on DEBUG_FS
+	default y
+	help
+	  Enable readahead events accounting. Usage:
+
+	  # mount -t debugfs none /debug
+
+	  # echo > /debug/readahead/stats  # reset counters
+	  # do benchmarks
+	  # cat /debug/readahead/stats     # check counters


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 11/15] readahead: dont do start-of-file readahead after lseek()
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Linus Torvalds, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2041 bytes --]

Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does
	     seek to	0
	     read	1024
	     seek to	1536
	     read	16384

The start-of-file readahead heuristic is wrong for them, whose 
access pattern can be identified by lseek() calls.

So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/read_write.c    |    3 +++
 include/linux/fs.h |    1 +
 mm/readahead.c     |    5 +++++
 3 files changed, 9 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:47.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:48.000000000 +0800
@@ -672,6 +672,11 @@ ondemand_readahead(struct address_space 
 	if (!offset) {
 		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		ra->start = offset;
+		if ((ra->ra_flags & READAHEAD_LSEEK) && req_size <= max) {
+			ra->size = req_size;
+			ra->async_size = 0;
+			goto readit;
+		}
 		ra->size = get_init_ra_size(req_size, max);
 		ra->async_size = ra->size > req_size ?
 				 ra->size - req_size : ra->size;
--- linux.orig/fs/read_write.c	2010-02-24 10:44:30.000000000 +0800
+++ linux/fs/read_write.c	2010-02-24 10:44:48.000000000 +0800
@@ -71,6 +71,9 @@ generic_file_llseek_unlocked(struct file
 		file->f_version = 0;
 	}
 
+	if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
+		file->f_ra.ra_flags |= READAHEAD_LSEEK;
+
 	return offset;
 }
 EXPORT_SYMBOL(generic_file_llseek_unlocked);
--- linux.orig/include/linux/fs.h	2010-02-24 10:44:45.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:48.000000000 +0800
@@ -899,6 +899,7 @@ struct file_ra_state {
 #define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
 #define READAHEAD_THRASHED	0x10000000
 #define	READAHEAD_MMAP		0x20000000
+#define	READAHEAD_LSEEK		0x40000000 /* be conservative after lseek() */
 
 /*
  * Which policy makes decision to do the current read-ahead IO?



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 11/15] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Linus Torvalds, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2266 bytes --]

Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does
	     seek to	0
	     read	1024
	     seek to	1536
	     read	16384

The start-of-file readahead heuristic is wrong for them, whose 
access pattern can be identified by lseek() calls.

So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/read_write.c    |    3 +++
 include/linux/fs.h |    1 +
 mm/readahead.c     |    5 +++++
 3 files changed, 9 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:47.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:48.000000000 +0800
@@ -672,6 +672,11 @@ ondemand_readahead(struct address_space 
 	if (!offset) {
 		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		ra->start = offset;
+		if ((ra->ra_flags & READAHEAD_LSEEK) && req_size <= max) {
+			ra->size = req_size;
+			ra->async_size = 0;
+			goto readit;
+		}
 		ra->size = get_init_ra_size(req_size, max);
 		ra->async_size = ra->size > req_size ?
 				 ra->size - req_size : ra->size;
--- linux.orig/fs/read_write.c	2010-02-24 10:44:30.000000000 +0800
+++ linux/fs/read_write.c	2010-02-24 10:44:48.000000000 +0800
@@ -71,6 +71,9 @@ generic_file_llseek_unlocked(struct file
 		file->f_version = 0;
 	}
 
+	if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
+		file->f_ra.ra_flags |= READAHEAD_LSEEK;
+
 	return offset;
 }
 EXPORT_SYMBOL(generic_file_llseek_unlocked);
--- linux.orig/include/linux/fs.h	2010-02-24 10:44:45.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:48.000000000 +0800
@@ -899,6 +899,7 @@ struct file_ra_state {
 #define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
 #define READAHEAD_THRASHED	0x10000000
 #define	READAHEAD_MMAP		0x20000000
+#define	READAHEAD_LSEEK		0x40000000 /* be conservative after lseek() */
 
 /*
  * Which policy makes decision to do the current read-ahead IO?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 11/15] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Linus Torvalds, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2266 bytes --]

Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does
	     seek to	0
	     read	1024
	     seek to	1536
	     read	16384

The start-of-file readahead heuristic is wrong for them, whose 
access pattern can be identified by lseek() calls.

So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/read_write.c    |    3 +++
 include/linux/fs.h |    1 +
 mm/readahead.c     |    5 +++++
 3 files changed, 9 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-24 10:44:47.000000000 +0800
+++ linux/mm/readahead.c	2010-02-24 10:44:48.000000000 +0800
@@ -672,6 +672,11 @@ ondemand_readahead(struct address_space 
 	if (!offset) {
 		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		ra->start = offset;
+		if ((ra->ra_flags & READAHEAD_LSEEK) && req_size <= max) {
+			ra->size = req_size;
+			ra->async_size = 0;
+			goto readit;
+		}
 		ra->size = get_init_ra_size(req_size, max);
 		ra->async_size = ra->size > req_size ?
 				 ra->size - req_size : ra->size;
--- linux.orig/fs/read_write.c	2010-02-24 10:44:30.000000000 +0800
+++ linux/fs/read_write.c	2010-02-24 10:44:48.000000000 +0800
@@ -71,6 +71,9 @@ generic_file_llseek_unlocked(struct file
 		file->f_version = 0;
 	}
 
+	if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
+		file->f_ra.ra_flags |= READAHEAD_LSEEK;
+
 	return offset;
 }
 EXPORT_SYMBOL(generic_file_llseek_unlocked);
--- linux.orig/include/linux/fs.h	2010-02-24 10:44:45.000000000 +0800
+++ linux/include/linux/fs.h	2010-02-24 10:44:48.000000000 +0800
@@ -899,6 +899,7 @@ struct file_ra_state {
 #define	READAHEAD_MMAP_MISS	0x00000fff /* cache misses for mmap access */
 #define READAHEAD_THRASHED	0x10000000
 #define	READAHEAD_MMAP		0x20000000
+#define	READAHEAD_LSEEK		0x40000000 /* be conservative after lseek() */
 
 /*
  * Which policy makes decision to do the current read-ahead IO?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 12/15] radixtree: introduce radix_tree_lookup_leaf_node()
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: radixtree-radix_tree_lookup_leaf_node.patch --]
[-- Type: text/plain, Size: 3339 bytes --]

This will be used by the pagecache context based read-ahead/read-around
heuristic to quickly check one pagecache range:
- if there is any hole
- if there is any pages

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/radix-tree.h |    2 ++
 lib/radix-tree.c           |   27 ++++++++++++++++++++++-----
 2 files changed, 24 insertions(+), 5 deletions(-)

--- linux.orig/lib/radix-tree.c	2010-02-24 10:44:23.000000000 +0800
+++ linux/lib/radix-tree.c	2010-02-24 10:44:49.000000000 +0800
@@ -359,7 +359,7 @@ EXPORT_SYMBOL(radix_tree_insert);
  * is_slot == 0 : search for the node.
  */
 static void *radix_tree_lookup_element(struct radix_tree_root *root,
-				unsigned long index, int is_slot)
+				unsigned long index, int is_slot, int level)
 {
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
@@ -369,7 +369,7 @@ static void *radix_tree_lookup_element(s
 		return NULL;
 
 	if (!radix_tree_is_indirect_ptr(node)) {
-		if (index > 0)
+		if (index > 0 || level > 0)
 			return NULL;
 		return is_slot ? (void *)&root->rnode : node;
 	}
@@ -390,7 +390,7 @@ static void *radix_tree_lookup_element(s
 
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
-	} while (height > 0);
+	} while (height > level);
 
 	return is_slot ? (void *)slot:node;
 }
@@ -410,7 +410,7 @@ static void *radix_tree_lookup_element(s
  */
 void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
-	return (void **)radix_tree_lookup_element(root, index, 1);
+	return (void **)radix_tree_lookup_element(root, index, 1, 0);
 }
 EXPORT_SYMBOL(radix_tree_lookup_slot);
 
@@ -428,11 +428,28 @@ EXPORT_SYMBOL(radix_tree_lookup_slot);
  */
 void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
 {
-	return radix_tree_lookup_element(root, index, 0);
+	return radix_tree_lookup_element(root, index, 0, 0);
 }
 EXPORT_SYMBOL(radix_tree_lookup);
 
 /**
+ *	radix_tree_lookup_leaf_node    -    lookup leaf node on a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the leaf node that covers @index in the radix tree @root.
+ *	Return NULL if the node does not exist, or is the special root node.
+ *
+ *	The typical usage is to check the value of node->count, which shall be
+ *	performed inside rcu_read_lock to prevent the node from being freed.
+ */
+struct radix_tree_node *
+radix_tree_lookup_leaf_node(struct radix_tree_root *root, unsigned long index)
+{
+	return radix_tree_lookup_element(root, index, 0, 1);
+}
+
+/**
  *	radix_tree_tag_set - set a tag on a radix tree node
  *	@root:		radix tree root
  *	@index:		index key
--- linux.orig/include/linux/radix-tree.h	2010-02-24 10:44:23.000000000 +0800
+++ linux/include/linux/radix-tree.h	2010-02-24 10:44:49.000000000 +0800
@@ -158,6 +158,8 @@ static inline void radix_tree_replace_sl
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+struct radix_tree_node *
+radix_tree_lookup_leaf_node(struct radix_tree_root *root, unsigned long index);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 12/15] radixtree: introduce radix_tree_lookup_leaf_node()
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: radixtree-radix_tree_lookup_leaf_node.patch --]
[-- Type: text/plain, Size: 3564 bytes --]

This will be used by the pagecache context based read-ahead/read-around
heuristic to quickly check one pagecache range:
- if there is any hole
- if there is any pages

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/radix-tree.h |    2 ++
 lib/radix-tree.c           |   27 ++++++++++++++++++++++-----
 2 files changed, 24 insertions(+), 5 deletions(-)

--- linux.orig/lib/radix-tree.c	2010-02-24 10:44:23.000000000 +0800
+++ linux/lib/radix-tree.c	2010-02-24 10:44:49.000000000 +0800
@@ -359,7 +359,7 @@ EXPORT_SYMBOL(radix_tree_insert);
  * is_slot == 0 : search for the node.
  */
 static void *radix_tree_lookup_element(struct radix_tree_root *root,
-				unsigned long index, int is_slot)
+				unsigned long index, int is_slot, int level)
 {
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
@@ -369,7 +369,7 @@ static void *radix_tree_lookup_element(s
 		return NULL;
 
 	if (!radix_tree_is_indirect_ptr(node)) {
-		if (index > 0)
+		if (index > 0 || level > 0)
 			return NULL;
 		return is_slot ? (void *)&root->rnode : node;
 	}
@@ -390,7 +390,7 @@ static void *radix_tree_lookup_element(s
 
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
-	} while (height > 0);
+	} while (height > level);
 
 	return is_slot ? (void *)slot:node;
 }
@@ -410,7 +410,7 @@ static void *radix_tree_lookup_element(s
  */
 void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
-	return (void **)radix_tree_lookup_element(root, index, 1);
+	return (void **)radix_tree_lookup_element(root, index, 1, 0);
 }
 EXPORT_SYMBOL(radix_tree_lookup_slot);
 
@@ -428,11 +428,28 @@ EXPORT_SYMBOL(radix_tree_lookup_slot);
  */
 void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
 {
-	return radix_tree_lookup_element(root, index, 0);
+	return radix_tree_lookup_element(root, index, 0, 0);
 }
 EXPORT_SYMBOL(radix_tree_lookup);
 
 /**
+ *	radix_tree_lookup_leaf_node    -    lookup leaf node on a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the leaf node that covers @index in the radix tree @root.
+ *	Return NULL if the node does not exist, or is the special root node.
+ *
+ *	The typical usage is to check the value of node->count, which shall be
+ *	performed inside rcu_read_lock to prevent the node from being freed.
+ */
+struct radix_tree_node *
+radix_tree_lookup_leaf_node(struct radix_tree_root *root, unsigned long index)
+{
+	return radix_tree_lookup_element(root, index, 0, 1);
+}
+
+/**
  *	radix_tree_tag_set - set a tag on a radix tree node
  *	@root:		radix tree root
  *	@index:		index key
--- linux.orig/include/linux/radix-tree.h	2010-02-24 10:44:23.000000000 +0800
+++ linux/include/linux/radix-tree.h	2010-02-24 10:44:49.000000000 +0800
@@ -158,6 +158,8 @@ static inline void radix_tree_replace_sl
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+struct radix_tree_node *
+radix_tree_lookup_leaf_node(struct radix_tree_root *root, unsigned long index);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 12/15] radixtree: introduce radix_tree_lookup_leaf_node()
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: radixtree-radix_tree_lookup_leaf_node.patch --]
[-- Type: text/plain, Size: 3564 bytes --]

This will be used by the pagecache context based read-ahead/read-around
heuristic to quickly check one pagecache range:
- if there is any hole
- if there is any pages

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/radix-tree.h |    2 ++
 lib/radix-tree.c           |   27 ++++++++++++++++++++++-----
 2 files changed, 24 insertions(+), 5 deletions(-)

--- linux.orig/lib/radix-tree.c	2010-02-24 10:44:23.000000000 +0800
+++ linux/lib/radix-tree.c	2010-02-24 10:44:49.000000000 +0800
@@ -359,7 +359,7 @@ EXPORT_SYMBOL(radix_tree_insert);
  * is_slot == 0 : search for the node.
  */
 static void *radix_tree_lookup_element(struct radix_tree_root *root,
-				unsigned long index, int is_slot)
+				unsigned long index, int is_slot, int level)
 {
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
@@ -369,7 +369,7 @@ static void *radix_tree_lookup_element(s
 		return NULL;
 
 	if (!radix_tree_is_indirect_ptr(node)) {
-		if (index > 0)
+		if (index > 0 || level > 0)
 			return NULL;
 		return is_slot ? (void *)&root->rnode : node;
 	}
@@ -390,7 +390,7 @@ static void *radix_tree_lookup_element(s
 
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
-	} while (height > 0);
+	} while (height > level);
 
 	return is_slot ? (void *)slot:node;
 }
@@ -410,7 +410,7 @@ static void *radix_tree_lookup_element(s
  */
 void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
-	return (void **)radix_tree_lookup_element(root, index, 1);
+	return (void **)radix_tree_lookup_element(root, index, 1, 0);
 }
 EXPORT_SYMBOL(radix_tree_lookup_slot);
 
@@ -428,11 +428,28 @@ EXPORT_SYMBOL(radix_tree_lookup_slot);
  */
 void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
 {
-	return radix_tree_lookup_element(root, index, 0);
+	return radix_tree_lookup_element(root, index, 0, 0);
 }
 EXPORT_SYMBOL(radix_tree_lookup);
 
 /**
+ *	radix_tree_lookup_leaf_node    -    lookup leaf node on a radix tree
+ *	@root:		radix tree root
+ *	@index:		index key
+ *
+ *	Lookup the leaf node that covers @index in the radix tree @root.
+ *	Return NULL if the node does not exist, or is the special root node.
+ *
+ *	The typical usage is to check the value of node->count, which shall be
+ *	performed inside rcu_read_lock to prevent the node from being freed.
+ */
+struct radix_tree_node *
+radix_tree_lookup_leaf_node(struct radix_tree_root *root, unsigned long index)
+{
+	return radix_tree_lookup_element(root, index, 0, 1);
+}
+
+/**
  *	radix_tree_tag_set - set a tag on a radix tree node
  *	@root:		radix tree root
  *	@index:		index key
--- linux.orig/include/linux/radix-tree.h	2010-02-24 10:44:23.000000000 +0800
+++ linux/include/linux/radix-tree.h	2010-02-24 10:44:49.000000000 +0800
@@ -158,6 +158,8 @@ static inline void radix_tree_replace_sl
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+struct radix_tree_node *
+radix_tree_lookup_leaf_node(struct radix_tree_root *root, unsigned long index);
 void *radix_tree_delete(struct radix_tree_root *, unsigned long);
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 13/15] radixtree: speed up the search for hole
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: radixtree-scan-hole-fast.patch --]
[-- Type: text/plain, Size: 2722 bytes --]

Replace the hole scan functions with more fast versions:
	- radix_tree_next_hole(root, index, max_scan)
	- radix_tree_prev_hole(root, index, max_scan)

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 lib/radix-tree.c |   67 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 56 insertions(+), 11 deletions(-)

--- linux.orig/lib/radix-tree.c	2010-02-24 10:44:49.000000000 +0800
+++ linux/lib/radix-tree.c	2010-02-24 10:44:50.000000000 +0800
@@ -647,18 +647,41 @@ EXPORT_SYMBOL(radix_tree_tag_get);
  *	under rcu_read_lock.
  */
 unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
+				   unsigned long index, unsigned long max_scan)
 {
-	unsigned long i;
+	struct radix_tree_node *node;
+	unsigned long origin = index;
+	int i;
+
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return index;
+
+	if (!radix_tree_is_indirect_ptr(node))
+		return index ? index : 1;
 
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
+	while (index - origin < max_scan) {
+		node = radix_tree_lookup_leaf_node(root, index);
+		if (!node)
 			break;
-		index++;
-		if (index == 0)
+
+		if (node->count == RADIX_TREE_MAP_SIZE) {
+			index = (index | RADIX_TREE_MAP_MASK) + 1;
+			goto check_overflow;
+		}
+
+		for (i = index & RADIX_TREE_MAP_MASK;
+		     i < RADIX_TREE_MAP_SIZE;
+		     i++, index++)
+			if (rcu_dereference(node->slots[i]) == NULL)
+				goto out;
+
+check_overflow:
+		if (unlikely(index == 0))
 			break;
 	}
 
+out:
 	return index;
 }
 EXPORT_SYMBOL(radix_tree_next_hole);
@@ -686,16 +709,38 @@ EXPORT_SYMBOL(radix_tree_next_hole);
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				   unsigned long index, unsigned long max_scan)
 {
-	unsigned long i;
+	struct radix_tree_node *node;
+	unsigned long origin = index;
+	int i;
+
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return index;
+
+	if (!radix_tree_is_indirect_ptr(node))
+		return index ? index : ULONG_MAX;
 
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
+	while (origin - index < max_scan) {
+		node = radix_tree_lookup_leaf_node(root, index);
+		if (!node)
 			break;
-		index--;
-		if (index == LONG_MAX)
+
+		if (node->count == RADIX_TREE_MAP_SIZE) {
+			index = (index - RADIX_TREE_MAP_SIZE) |
+					 RADIX_TREE_MAP_MASK;
+			goto check_underflow;
+		}
+
+		for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--)
+			if (rcu_dereference(node->slots[i]) == NULL)
+				goto out;
+
+check_underflow:
+		if (unlikely(index == ULONG_MAX))
 			break;
 	}
 
+out:
 	return index;
 }
 EXPORT_SYMBOL(radix_tree_prev_hole);



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 13/15] radixtree: speed up the search for hole
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: radixtree-scan-hole-fast.patch --]
[-- Type: text/plain, Size: 2947 bytes --]

Replace the hole scan functions with more fast versions:
	- radix_tree_next_hole(root, index, max_scan)
	- radix_tree_prev_hole(root, index, max_scan)

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 lib/radix-tree.c |   67 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 56 insertions(+), 11 deletions(-)

--- linux.orig/lib/radix-tree.c	2010-02-24 10:44:49.000000000 +0800
+++ linux/lib/radix-tree.c	2010-02-24 10:44:50.000000000 +0800
@@ -647,18 +647,41 @@ EXPORT_SYMBOL(radix_tree_tag_get);
  *	under rcu_read_lock.
  */
 unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
+				   unsigned long index, unsigned long max_scan)
 {
-	unsigned long i;
+	struct radix_tree_node *node;
+	unsigned long origin = index;
+	int i;
+
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return index;
+
+	if (!radix_tree_is_indirect_ptr(node))
+		return index ? index : 1;
 
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
+	while (index - origin < max_scan) {
+		node = radix_tree_lookup_leaf_node(root, index);
+		if (!node)
 			break;
-		index++;
-		if (index == 0)
+
+		if (node->count == RADIX_TREE_MAP_SIZE) {
+			index = (index | RADIX_TREE_MAP_MASK) + 1;
+			goto check_overflow;
+		}
+
+		for (i = index & RADIX_TREE_MAP_MASK;
+		     i < RADIX_TREE_MAP_SIZE;
+		     i++, index++)
+			if (rcu_dereference(node->slots[i]) == NULL)
+				goto out;
+
+check_overflow:
+		if (unlikely(index == 0))
 			break;
 	}
 
+out:
 	return index;
 }
 EXPORT_SYMBOL(radix_tree_next_hole);
@@ -686,16 +709,38 @@ EXPORT_SYMBOL(radix_tree_next_hole);
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				   unsigned long index, unsigned long max_scan)
 {
-	unsigned long i;
+	struct radix_tree_node *node;
+	unsigned long origin = index;
+	int i;
+
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return index;
+
+	if (!radix_tree_is_indirect_ptr(node))
+		return index ? index : ULONG_MAX;
 
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
+	while (origin - index < max_scan) {
+		node = radix_tree_lookup_leaf_node(root, index);
+		if (!node)
 			break;
-		index--;
-		if (index == LONG_MAX)
+
+		if (node->count == RADIX_TREE_MAP_SIZE) {
+			index = (index - RADIX_TREE_MAP_SIZE) |
+					 RADIX_TREE_MAP_MASK;
+			goto check_underflow;
+		}
+
+		for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--)
+			if (rcu_dereference(node->slots[i]) == NULL)
+				goto out;
+
+check_underflow:
+		if (unlikely(index == ULONG_MAX))
 			break;
 	}
 
+out:
 	return index;
 }
 EXPORT_SYMBOL(radix_tree_prev_hole);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 13/15] radixtree: speed up the search for hole
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: radixtree-scan-hole-fast.patch --]
[-- Type: text/plain, Size: 2947 bytes --]

Replace the hole scan functions with more fast versions:
	- radix_tree_next_hole(root, index, max_scan)
	- radix_tree_prev_hole(root, index, max_scan)

Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 lib/radix-tree.c |   67 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 56 insertions(+), 11 deletions(-)

--- linux.orig/lib/radix-tree.c	2010-02-24 10:44:49.000000000 +0800
+++ linux/lib/radix-tree.c	2010-02-24 10:44:50.000000000 +0800
@@ -647,18 +647,41 @@ EXPORT_SYMBOL(radix_tree_tag_get);
  *	under rcu_read_lock.
  */
 unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan)
+				   unsigned long index, unsigned long max_scan)
 {
-	unsigned long i;
+	struct radix_tree_node *node;
+	unsigned long origin = index;
+	int i;
+
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return index;
+
+	if (!radix_tree_is_indirect_ptr(node))
+		return index ? index : 1;
 
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
+	while (index - origin < max_scan) {
+		node = radix_tree_lookup_leaf_node(root, index);
+		if (!node)
 			break;
-		index++;
-		if (index == 0)
+
+		if (node->count == RADIX_TREE_MAP_SIZE) {
+			index = (index | RADIX_TREE_MAP_MASK) + 1;
+			goto check_overflow;
+		}
+
+		for (i = index & RADIX_TREE_MAP_MASK;
+		     i < RADIX_TREE_MAP_SIZE;
+		     i++, index++)
+			if (rcu_dereference(node->slots[i]) == NULL)
+				goto out;
+
+check_overflow:
+		if (unlikely(index == 0))
 			break;
 	}
 
+out:
 	return index;
 }
 EXPORT_SYMBOL(radix_tree_next_hole);
@@ -686,16 +709,38 @@ EXPORT_SYMBOL(radix_tree_next_hole);
 unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
 				   unsigned long index, unsigned long max_scan)
 {
-	unsigned long i;
+	struct radix_tree_node *node;
+	unsigned long origin = index;
+	int i;
+
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return index;
+
+	if (!radix_tree_is_indirect_ptr(node))
+		return index ? index : ULONG_MAX;
 
-	for (i = 0; i < max_scan; i++) {
-		if (!radix_tree_lookup(root, index))
+	while (origin - index < max_scan) {
+		node = radix_tree_lookup_leaf_node(root, index);
+		if (!node)
 			break;
-		index--;
-		if (index == LONG_MAX)
+
+		if (node->count == RADIX_TREE_MAP_SIZE) {
+			index = (index - RADIX_TREE_MAP_SIZE) |
+					 RADIX_TREE_MAP_MASK;
+			goto check_underflow;
+		}
+
+		for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--)
+			if (rcu_dereference(node->slots[i]) == NULL)
+				goto out;
+
+check_underflow:
+		if (unlikely(index == ULONG_MAX))
 			break;
 	}
 
+out:
 	return index;
 }
 EXPORT_SYMBOL(radix_tree_prev_hole);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 14/15] readahead: reduce MMAP_LOTSAMISS for mmap read-around
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-mmap-around.patch --]
[-- Type: text/plain, Size: 874 bytes --]

Now that we lifts readahead size from 128KB to 512KB,
the MMAP_LOTSAMISS shall be shrinked accordingly.

We shrink it a bit more, so that for sparse random access patterns,
only 10*512KB or ~5MB memory will be wasted, instead of the previous
100*128KB or ~12MB. The new threshold "10" is still big enough to avoid
turning off read-around for typical executable/lib page faults.

CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/mm/filemap.c	2010-02-21 23:56:22.000000000 +0800
+++ linux/mm/filemap.c	2010-02-21 23:56:26.000000000 +0800
@@ -1393,7 +1393,7 @@ static int page_cache_read(struct file *
 	return ret;
 }
 
-#define MMAP_LOTSAMISS  (100)
+#define MMAP_LOTSAMISS  (10)
 
 /*
  * Synchronous readahead happens when we don't even find



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 14/15] readahead: reduce MMAP_LOTSAMISS for mmap read-around
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-mmap-around.patch --]
[-- Type: text/plain, Size: 1099 bytes --]

Now that we lifts readahead size from 128KB to 512KB,
the MMAP_LOTSAMISS shall be shrinked accordingly.

We shrink it a bit more, so that for sparse random access patterns,
only 10*512KB or ~5MB memory will be wasted, instead of the previous
100*128KB or ~12MB. The new threshold "10" is still big enough to avoid
turning off read-around for typical executable/lib page faults.

CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/mm/filemap.c	2010-02-21 23:56:22.000000000 +0800
+++ linux/mm/filemap.c	2010-02-21 23:56:26.000000000 +0800
@@ -1393,7 +1393,7 @@ static int page_cache_read(struct file *
 	return ret;
 }
 
-#define MMAP_LOTSAMISS  (100)
+#define MMAP_LOTSAMISS  (10)
 
 /*
  * Synchronous readahead happens when we don't even find


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 14/15] readahead: reduce MMAP_LOTSAMISS for mmap read-around
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-mmap-around.patch --]
[-- Type: text/plain, Size: 1099 bytes --]

Now that we lifts readahead size from 128KB to 512KB,
the MMAP_LOTSAMISS shall be shrinked accordingly.

We shrink it a bit more, so that for sparse random access patterns,
only 10*512KB or ~5MB memory will be wasted, instead of the previous
100*128KB or ~12MB. The new threshold "10" is still big enough to avoid
turning off read-around for typical executable/lib page faults.

CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/mm/filemap.c	2010-02-21 23:56:22.000000000 +0800
+++ linux/mm/filemap.c	2010-02-21 23:56:26.000000000 +0800
@@ -1393,7 +1393,7 @@ static int page_cache_read(struct file *
 	return ret;
 }
 
-#define MMAP_LOTSAMISS  (100)
+#define MMAP_LOTSAMISS  (10)
 
 /*
  * Synchronous readahead happens when we don't even find


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 15/15] readahead: pagecache context based mmap read-around
  2010-02-24  3:10 ` Wu Fengguang
  (?)
@ 2010-02-24  3:10   ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-mmap-around-context.patch --]
[-- Type: text/plain, Size: 1315 bytes --]

Do mmap read-around when there are cached pages in the nearby 256KB
(covered by one radix tree node).

There is a failure case though: for a sequence of page faults at page
index 64*i+1, i=1,2,3,..., this heuristic will keep doing pointless
read-arounds.  Hopefully the pattern won't appear in real workloads.
Note that the readahead heuristic has similiar failure case.

CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- linux.orig/mm/filemap.c	2010-02-23 13:20:39.000000000 +0800
+++ linux/mm/filemap.c	2010-02-23 13:22:36.000000000 +0800
@@ -1421,11 +1421,17 @@ static void do_sync_mmap_readahead(struc
 
 
 	/*
-	 * Do we miss much more than hit in this file? If so,
-	 * stop bothering with read-ahead. It will only hurt.
+	 * Do we miss much more than hit in this file? If so, stop bothering
+	 * with read-around, unless some nearby pages were accessed recently.
 	 */
-	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
-		return;
+	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS) {
+		struct radix_tree_node *node;
+		rcu_read_lock();
+		node = radix_tree_lookup_leaf_node(&mapping->page_tree, offset);
+		rcu_read_unlock();
+		if (!node)
+			return;
+	}
 
 	/*
 	 * mmap read-around



^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 15/15] readahead: pagecache context based mmap read-around
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-mmap-around-context.patch --]
[-- Type: text/plain, Size: 1540 bytes --]

Do mmap read-around when there are cached pages in the nearby 256KB
(covered by one radix tree node).

There is a failure case though: for a sequence of page faults at page
index 64*i+1, i=1,2,3,..., this heuristic will keep doing pointless
read-arounds.  Hopefully the pattern won't appear in real workloads.
Note that the readahead heuristic has similiar failure case.

CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- linux.orig/mm/filemap.c	2010-02-23 13:20:39.000000000 +0800
+++ linux/mm/filemap.c	2010-02-23 13:22:36.000000000 +0800
@@ -1421,11 +1421,17 @@ static void do_sync_mmap_readahead(struc
 
 
 	/*
-	 * Do we miss much more than hit in this file? If so,
-	 * stop bothering with read-ahead. It will only hurt.
+	 * Do we miss much more than hit in this file? If so, stop bothering
+	 * with read-around, unless some nearby pages were accessed recently.
 	 */
-	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
-		return;
+	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS) {
+		struct radix_tree_node *node;
+		rcu_read_lock();
+		node = radix_tree_lookup_leaf_node(&mapping->page_tree, offset);
+		rcu_read_unlock();
+		if (!node)
+			return;
+	}
 
 	/*
 	 * mmap read-around


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH 15/15] readahead: pagecache context based mmap read-around
@ 2010-02-24  3:10   ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-24  3:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-mmap-around-context.patch --]
[-- Type: text/plain, Size: 1540 bytes --]

Do mmap read-around when there are cached pages in the nearby 256KB
(covered by one radix tree node).

There is a failure case though: for a sequence of page faults at page
index 64*i+1, i=1,2,3,..., this heuristic will keep doing pointless
read-arounds.  Hopefully the pattern won't appear in real workloads.
Note that the readahead heuristic has similiar failure case.

CC: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- linux.orig/mm/filemap.c	2010-02-23 13:20:39.000000000 +0800
+++ linux/mm/filemap.c	2010-02-23 13:22:36.000000000 +0800
@@ -1421,11 +1421,17 @@ static void do_sync_mmap_readahead(struc
 
 
 	/*
-	 * Do we miss much more than hit in this file? If so,
-	 * stop bothering with read-ahead. It will only hurt.
+	 * Do we miss much more than hit in this file? If so, stop bothering
+	 * with read-around, unless some nearby pages were accessed recently.
 	 */
-	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
-		return;
+	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS) {
+		struct radix_tree_node *node;
+		rcu_read_lock();
+		node = radix_tree_lookup_leaf_node(&mapping->page_tree, offset);
+		rcu_read_unlock();
+		if (!node)
+			return;
+	}
 
 	/*
 	 * mmap read-around


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 01/15] readahead: limit readahead size for small devices
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25  3:11     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25  3:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Li Shaohua, Clemens Ladisch,
	Chris Mason, Peter Zijlstra, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Linus reports a _really_ small&  slow (505kB, 15kB/s) USB device,
> on which blkid runs unpleasantly slow. He manages to optimize the blkid
> reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.

> CC: Li Shaohua<shaohua.li@intel.com>
> CC: Clemens Ladisch<clemens@ladisch.de>
> Acked-by: Jens Axboe<jens.axboe@oracle.com>
> Tested-by: Vivek Goyal<vgoyal@redhat.com>
> Tested-by: Linus Torvalds<torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 01/15] readahead: limit readahead size for small devices
@ 2010-02-25  3:11     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25  3:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Li Shaohua, Clemens Ladisch,
	Chris Mason, Peter Zijlstra, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Linus reports a _really_ small&  slow (505kB, 15kB/s) USB device,
> on which blkid runs unpleasantly slow. He manages to optimize the blkid
> reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.

> CC: Li Shaohua<shaohua.li@intel.com>
> CC: Clemens Ladisch<clemens@ladisch.de>
> Acked-by: Jens Axboe<jens.axboe@oracle.com>
> Tested-by: Vivek Goyal<vgoyal@redhat.com>
> Tested-by: Linus Torvalds<torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 02/15] readahead: retain inactive lru pages to be accessed soon
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25  3:17     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25  3:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Chris Frost, Steve VanDeBogart,
	KAMEZAWA Hiroyuki, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> From: Chris Frost<frost@cs.ucla.edu>
>
> Ensure that cached pages in the inactive list are not prematurely evicted;
> move such pages to lru head when they are covered by
> - in-kernel heuristic readahead
> - an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application

> Signed-off-by: Chris Frost<frost@cs.ucla.edu>
> Signed-off-by: Steve VanDeBogart<vandebo@cs.ucla.edu>
> Signed-off-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

When we get into the situation where readahead thrashing
would occur, we will end up evicting other stuff more
quickly from the inactive file list.  However, that will
be the case either with or without this code...

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 02/15] readahead: retain inactive lru pages to be accessed soon
@ 2010-02-25  3:17     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25  3:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Chris Frost, Steve VanDeBogart,
	KAMEZAWA Hiroyuki, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> From: Chris Frost<frost@cs.ucla.edu>
>
> Ensure that cached pages in the inactive list are not prematurely evicted;
> move such pages to lru head when they are covered by
> - in-kernel heuristic readahead
> - an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application

> Signed-off-by: Chris Frost<frost@cs.ucla.edu>
> Signed-off-by: Steve VanDeBogart<vandebo@cs.ucla.edu>
> Signed-off-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

When we get into the situation where readahead thrashing
would occur, we will end up evicting other stuff more
quickly from the inactive file list.  However, that will
be the case either with or without this code...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 03/15] readahead: bump up the default readahead size
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25  4:02     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25  4:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Chris Mason, Peter Zijlstra,
	Martin Schwidefsky, Paul Gortmaker, Matt Mackall,
	David Woodhouse, Christian Ehrhardt, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Use 512kb max readahead size, and 32kb min readahead size.
>
> The former helps io performance for common workloads.
> The latter will be used in the thrashing safe context readahead.

> CC: Jens Axboe<jens.axboe@oracle.com>
> CC: Chris Mason<chris.mason@oracle.com>
> CC: Peter Zijlstra<a.p.zijlstra@chello.nl>
> CC: Martin Schwidefsky<schwidefsky@de.ibm.com>
> CC: Paul Gortmaker<paul.gortmaker@windriver.com>
> CC: Matt Mackall<mpm@selenic.com>
> CC: David Woodhouse<dwmw2@infradead.org>
> Tested-by: Vivek Goyal<vgoyal@redhat.com>
> Tested-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
> Acked-by:  Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 03/15] readahead: bump up the default readahead size
@ 2010-02-25  4:02     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25  4:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Chris Mason, Peter Zijlstra,
	Martin Schwidefsky, Paul Gortmaker, Matt Mackall,
	David Woodhouse, Christian Ehrhardt, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Use 512kb max readahead size, and 32kb min readahead size.
>
> The former helps io performance for common workloads.
> The latter will be used in the thrashing safe context readahead.

> CC: Jens Axboe<jens.axboe@oracle.com>
> CC: Chris Mason<chris.mason@oracle.com>
> CC: Peter Zijlstra<a.p.zijlstra@chello.nl>
> CC: Martin Schwidefsky<schwidefsky@de.ibm.com>
> CC: Paul Gortmaker<paul.gortmaker@windriver.com>
> CC: Matt Mackall<mpm@selenic.com>
> CC: David Woodhouse<dwmw2@infradead.org>
> Tested-by: Vivek Goyal<vgoyal@redhat.com>
> Tested-by: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
> Acked-by:  Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 02/15] readahead: retain inactive lru pages to be accessed soon
  2010-02-25  3:17     ` Rik van Riel
@ 2010-02-25 12:27       ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-25 12:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Jens Axboe, Chris Frost, Steve VanDeBogart,
	KAMEZAWA Hiroyuki, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML

On Thu, Feb 25, 2010 at 11:17:41AM +0800, Rik van Riel wrote:
> On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> > From: Chris Frost<frost@cs.ucla.edu>
> >
> > Ensure that cached pages in the inactive list are not prematurely evicted;
> > move such pages to lru head when they are covered by
> > - in-kernel heuristic readahead
> > - an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application
> 
> > Signed-off-by: Chris Frost<frost@cs.ucla.edu>
> > Signed-off-by: Steve VanDeBogart<vandebo@cs.ucla.edu>
> > Signed-off-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> 
> When we get into the situation where readahead thrashing
> would occur, we will end up evicting other stuff more
> quickly from the inactive file list.  However, that will
> be the case either with or without this code...

Thanks. I'm actually not afraid of it adding memory pressure to the
readahead thrashing case.  The context readahead (patch 07) can
adaptively control the memory pressure with or without this patch.

It does add memory pressure to mmap read-around. A typical read-around
request would cover some cached pages (whether or not they are
memory-mapped), and all those pages would be moved to LRU head by
this patch.

This somehow implicitly adds LRU lifetime to executable/lib pages.

Hopefully this won't behave too bad. And will be limited by
smaller readahead size in small memory systems (patch 05).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 02/15] readahead: retain inactive lru pages to be accessed soon
@ 2010-02-25 12:27       ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-25 12:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Morton, Jens Axboe, Chris Frost, Steve VanDeBogart,
	KAMEZAWA Hiroyuki, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML

On Thu, Feb 25, 2010 at 11:17:41AM +0800, Rik van Riel wrote:
> On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> > From: Chris Frost<frost@cs.ucla.edu>
> >
> > Ensure that cached pages in the inactive list are not prematurely evicted;
> > move such pages to lru head when they are covered by
> > - in-kernel heuristic readahead
> > - an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application
> 
> > Signed-off-by: Chris Frost<frost@cs.ucla.edu>
> > Signed-off-by: Steve VanDeBogart<vandebo@cs.ucla.edu>
> > Signed-off-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>
> 
> Acked-by: Rik van Riel <riel@redhat.com>
> 
> When we get into the situation where readahead thrashing
> would occur, we will end up evicting other stuff more
> quickly from the inactive file list.  However, that will
> be the case either with or without this code...

Thanks. I'm actually not afraid of it adding memory pressure to the
readahead thrashing case.  The context readahead (patch 07) can
adaptively control the memory pressure with or without this patch.

It does add memory pressure to mmap read-around. A typical read-around
request would cover some cached pages (whether or not they are
memory-mapped), and all those pages would be moved to LRU head by
this patch.

This somehow implicitly adds LRU lifetime to executable/lib pages.

Hopefully this won't behave too bad. And will be limited by
smaller readahead size in small memory systems (patch 05).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 04/15] readahead: make default readahead size a kernel parameter
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 14:59     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 14:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Ankit Jain, Dave Chinner,
	Christian Ehrhardt, Nikanth Karthikesan, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Matt Mackall, Nick Piggin, Linux Memory Management List,
	linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> From: Nikanth Karthikesan<knikanth@suse.de>
>
> Add new kernel parameter "readahead", which allows user to override
> the static VM_MAX_READAHEAD=512kb.
>
> CC: Ankit Jain<radical@gmail.com>
> CC: Dave Chinner<david@fromorbit.com>
> CC: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
> Signed-off-by: Nikanth Karthikesan<knikanth@suse.de>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 04/15] readahead: make default readahead size a kernel parameter
@ 2010-02-25 14:59     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 14:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Ankit Jain, Dave Chinner,
	Christian Ehrhardt, Nikanth Karthikesan, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Matt Mackall, Nick Piggin, Linux Memory Management List,
	linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> From: Nikanth Karthikesan<knikanth@suse.de>
>
> Add new kernel parameter "readahead", which allows user to override
> the static VM_MAX_READAHEAD=512kb.
>
> CC: Ankit Jain<radical@gmail.com>
> CC: Dave Chinner<david@fromorbit.com>
> CC: Christian Ehrhardt<ehrhardt@linux.vnet.ibm.com>
> Signed-off-by: Nikanth Karthikesan<knikanth@suse.de>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 15:00     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 15:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Nick Piggin, Linux Memory Management List,
	linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> When lifting the default readahead size from 128KB to 512KB,
> make sure it won't add memory pressure to small memory systems.
>
> For read-ahead, the memory pressure is mainly readahead buffers consumed
> by too many concurrent streams. The context readahead can adapt
> readahead size to thrashing threshold well.  So in principle we don't
> need to adapt the default _max_ read-ahead size to memory pressure.
>
> For read-around, the memory pressure is mainly read-around misses on
> executables/libraries. Which could be reduced by scaling down
> read-around size on fast "reclaim passes".
>
> This patch presents a straightforward solution: to limit default
> readahead size proportional to available system memory, ie.
>                  512MB mem =>  512KB readahead size
>                  128MB mem =>  128KB readahead size
>                   32MB mem =>   32KB readahead size (minimal)
>
> Strictly speaking, only read-around size has to be limited.  However we
> don't bother to seperate read-around size from read-ahead size for now.
>
> CC: Matt Mackall<mpm@selenic.com>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
@ 2010-02-25 15:00     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 15:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Nick Piggin, Linux Memory Management List,
	linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> When lifting the default readahead size from 128KB to 512KB,
> make sure it won't add memory pressure to small memory systems.
>
> For read-ahead, the memory pressure is mainly readahead buffers consumed
> by too many concurrent streams. The context readahead can adapt
> readahead size to thrashing threshold well.  So in principle we don't
> need to adapt the default _max_ read-ahead size to memory pressure.
>
> For read-around, the memory pressure is mainly read-around misses on
> executables/libraries. Which could be reduced by scaling down
> read-around size on fast "reclaim passes".
>
> This patch presents a straightforward solution: to limit default
> readahead size proportional to available system memory, ie.
>                  512MB mem =>  512KB readahead size
>                  128MB mem =>  128KB readahead size
>                   32MB mem =>   32KB readahead size (minimal)
>
> Strictly speaking, only read-around size has to be limited.  However we
> don't bother to seperate read-around size from read-ahead size for now.
>
> CC: Matt Mackall<mpm@selenic.com>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
  2010-02-24  3:10   ` Wu Fengguang
  (?)
@ 2010-02-25 15:25     ` Christian Ehrhardt
  -1 siblings, 0 replies; 94+ messages in thread
From: Christian Ehrhardt @ 2010-02-25 15:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML



Wu Fengguang wrote:
 > When lifting the default readahead size from 128KB to 512KB,
 > make sure it won't add memory pressure to small memory systems.
 >
 > For read-ahead, the memory pressure is mainly readahead buffers consumed
 > by too many concurrent streams. The context readahead can adapt
 > readahead size to thrashing threshold well.  So in principle we don't
 > need to adapt the default _max_ read-ahead size to memory pressure.
 >
 > For read-around, the memory pressure is mainly read-around misses on
 > executables/libraries. Which could be reduced by scaling down
 > read-around size on fast "reclaim passes".
 >
 > This patch presents a straightforward solution: to limit default
 > readahead size proportional to available system memory, ie.
 >                 512MB mem => 512KB readahead size
 >                 128MB mem => 128KB readahead size
 >                  32MB mem =>  32KB readahead size (minimal)
 >
 > Strictly speaking, only read-around size has to be limited.  However we
 > don't bother to seperate read-around size from read-ahead size for now.
 >
 > CC: Matt Mackall <mpm@selenic.com>
 > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

What I state here is for read ahead in a "multi iozone sequential" 
setup, I can't speak for real "read around" workloads.
So probably your table is fine to cover read-around+read-ahead in one 
number.

I have tested 256MB mem systems with 512kb readahead quite a lot.
On those 512kb is still by far superior to smaller readaheads and I 
didn't see major trashing or memory pressure impact.

Therefore I would recommend a table like:
                >=256MB mem => 512KB readahead size
                  128MB mem => 128KB readahead size
                   32MB mem =>  32KB readahead size (minimal)

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
@ 2010-02-25 15:25     ` Christian Ehrhardt
  0 siblings, 0 replies; 94+ messages in thread
From: Christian Ehrhardt @ 2010-02-25 15:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML



Wu Fengguang wrote:
 > When lifting the default readahead size from 128KB to 512KB,
 > make sure it won't add memory pressure to small memory systems.
 >
 > For read-ahead, the memory pressure is mainly readahead buffers consumed
 > by too many concurrent streams. The context readahead can adapt
 > readahead size to thrashing threshold well.  So in principle we don't
 > need to adapt the default _max_ read-ahead size to memory pressure.
 >
 > For read-around, the memory pressure is mainly read-around misses on
 > executables/libraries. Which could be reduced by scaling down
 > read-around size on fast "reclaim passes".
 >
 > This patch presents a straightforward solution: to limit default
 > readahead size proportional to available system memory, ie.
 >                 512MB mem => 512KB readahead size
 >                 128MB mem => 128KB readahead size
 >                  32MB mem =>  32KB readahead size (minimal)
 >
 > Strictly speaking, only read-around size has to be limited.  However we
 > don't bother to seperate read-around size from read-ahead size for now.
 >
 > CC: Matt Mackall <mpm@selenic.com>
 > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

What I state here is for read ahead in a "multi iozone sequential" 
setup, I can't speak for real "read around" workloads.
So probably your table is fine to cover read-around+read-ahead in one 
number.

I have tested 256MB mem systems with 512kb readahead quite a lot.
On those 512kb is still by far superior to smaller readaheads and I 
didn't see major trashing or memory pressure impact.

Therefore I would recommend a table like:
                >=256MB mem => 512KB readahead size
                  128MB mem => 128KB readahead size
                   32MB mem =>  32KB readahead size (minimal)

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
@ 2010-02-25 15:25     ` Christian Ehrhardt
  0 siblings, 0 replies; 94+ messages in thread
From: Christian Ehrhardt @ 2010-02-25 15:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML



Wu Fengguang wrote:
 > When lifting the default readahead size from 128KB to 512KB,
 > make sure it won't add memory pressure to small memory systems.
 >
 > For read-ahead, the memory pressure is mainly readahead buffers consumed
 > by too many concurrent streams. The context readahead can adapt
 > readahead size to thrashing threshold well.  So in principle we don't
 > need to adapt the default _max_ read-ahead size to memory pressure.
 >
 > For read-around, the memory pressure is mainly read-around misses on
 > executables/libraries. Which could be reduced by scaling down
 > read-around size on fast "reclaim passes".
 >
 > This patch presents a straightforward solution: to limit default
 > readahead size proportional to available system memory, ie.
 >                 512MB mem => 512KB readahead size
 >                 128MB mem => 128KB readahead size
 >                  32MB mem =>  32KB readahead size (minimal)
 >
 > Strictly speaking, only read-around size has to be limited.  However we
 > don't bother to seperate read-around size from read-ahead size for now.
 >
 > CC: Matt Mackall <mpm@selenic.com>
 > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

What I state here is for read ahead in a "multi iozone sequential" 
setup, I can't speak for real "read around" workloads.
So probably your table is fine to cover read-around+read-ahead in one 
number.

I have tested 256MB mem systems with 512kb readahead quite a lot.
On those 512kb is still by far superior to smaller readaheads and I 
didn't see major trashing or memory pressure impact.

Therefore I would recommend a table like:
                >=256MB mem => 512KB readahead size
                  128MB mem => 128KB readahead size
                   32MB mem =>  32KB readahead size (minimal)

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 06/15] readahead: replace ra->mmap_miss with ra->ra_flags
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 15:52     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 15:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Nick Piggin, Andi Kleen,
	Steven Whitehouse, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Introduce a readahead flags field and embed the existing mmap_miss in it
> (mainly to save space).
>
> It also changes the mmap_miss upper bound from LONG_MAX to 4096.
> This is to help adapt properly for changing mmap access patterns.
>
> It will be possible to lose the flags in race conditions, however the
> impact should be limited.  For the race to happen, there must be two
> threads sharing the same file descriptor to be in page fault or
> readahead at the same time.
>
> Note that it has always been racy for "page faults" at the same time.
>
> And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
> Which may change some concrete readahead behavior, but won't really
> impact overall I/O performance.
>
> CC: Nick Piggin<npiggin@suse.de>
> CC: Andi Kleen<andi@firstfloor.org>
> CC: Steven Whitehouse<swhiteho@redhat.com>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 06/15] readahead: replace ra->mmap_miss with ra->ra_flags
@ 2010-02-25 15:52     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 15:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Nick Piggin, Andi Kleen,
	Steven Whitehouse, Chris Mason, Peter Zijlstra, Clemens Ladisch,
	Olivier Galibert, Vivek Goyal, Christian Ehrhardt, Matt Mackall,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Introduce a readahead flags field and embed the existing mmap_miss in it
> (mainly to save space).
>
> It also changes the mmap_miss upper bound from LONG_MAX to 4096.
> This is to help adapt properly for changing mmap access patterns.
>
> It will be possible to lose the flags in race conditions, however the
> impact should be limited.  For the race to happen, there must be two
> threads sharing the same file descriptor to be in page fault or
> readahead at the same time.
>
> Note that it has always been racy for "page faults" at the same time.
>
> And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
> Which may change some concrete readahead behavior, but won't really
> impact overall I/O performance.
>
> CC: Nick Piggin<npiggin@suse.de>
> CC: Andi Kleen<andi@firstfloor.org>
> CC: Steven Whitehouse<swhiteho@redhat.com>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 07/15] readahead: thrashing safe context readahead
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 16:24     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 16:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Chris Mason, Peter Zijlstra,
	Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Introduce a more complete version of context readahead, which is a
> full-fledged readahead algorithm by itself. It replaces some of the
> existing cases.

> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 07/15] readahead: thrashing safe context readahead
@ 2010-02-25 16:24     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 16:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Chris Mason, Peter Zijlstra,
	Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Introduce a more complete version of context readahead, which is a
> full-fledged readahead algorithm by itself. It replaces some of the
> existing cases.

> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 08/15] readahead: record readahead patterns
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 22:37     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 22:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Peter Zijlstra,
	Chris Mason, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Record the readahead pattern in ra_flags. This info can be examined by
> users via the readahead tracing/stats interfaces.
>
> Currently 7 patterns are defined:
>
>        	pattern			readahead for
> -----------------------------------------------------------
> 	RA_PATTERN_INITIAL	start-of-file/oversize read
> 	RA_PATTERN_SUBSEQUENT	trivial     sequential read
> 	RA_PATTERN_CONTEXT	interleaved sequential read
> 	RA_PATTERN_THRASH	thrashed    sequential read
> 	RA_PATTERN_MMAP_AROUND	mmap fault
> 	RA_PATTERN_FADVISE	posix_fadvise()
> 	RA_PATTERN_RANDOM	random read
>
> CC: Ingo Molnar<mingo@elte.hu>
> CC: Jens Axboe<jens.axboe@oracle.com>
> CC: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 08/15] readahead: record readahead patterns
@ 2010-02-25 22:37     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 22:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Peter Zijlstra,
	Chris Mason, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Record the readahead pattern in ra_flags. This info can be examined by
> users via the readahead tracing/stats interfaces.
>
> Currently 7 patterns are defined:
>
>        	pattern			readahead for
> -----------------------------------------------------------
> 	RA_PATTERN_INITIAL	start-of-file/oversize read
> 	RA_PATTERN_SUBSEQUENT	trivial     sequential read
> 	RA_PATTERN_CONTEXT	interleaved sequential read
> 	RA_PATTERN_THRASH	thrashed    sequential read
> 	RA_PATTERN_MMAP_AROUND	mmap fault
> 	RA_PATTERN_FADVISE	posix_fadvise()
> 	RA_PATTERN_RANDOM	random read
>
> CC: Ingo Molnar<mingo@elte.hu>
> CC: Jens Axboe<jens.axboe@oracle.com>
> CC: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 09/15] readahead: add tracing event
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 22:38     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 22:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, Chris Mason, Clemens Ladisch, Olivier Galibert,
	Vivek Goyal, Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Example output:
>
> # echo 1>  /debug/tracing/events/readahead/enable
> # cp test-file /dev/null
> # cat /debug/tracing/trace  # trimmed output
> readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
> readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
> readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
> readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
> readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
> readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
>
> CC: Ingo Molnar<mingo@elte.hu>
> CC: Jens Axboe<jens.axboe@oracle.com>
> CC: Steven Rostedt<rostedt@goodmis.org>
> CC: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 09/15] readahead: add tracing event
@ 2010-02-25 22:38     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 22:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Steven Rostedt,
	Peter Zijlstra, Chris Mason, Clemens Ladisch, Olivier Galibert,
	Vivek Goyal, Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Example output:
>
> # echo 1>  /debug/tracing/events/readahead/enable
> # cp test-file /dev/null
> # cat /debug/tracing/trace  # trimmed output
> readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
> readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
> readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
> readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
> readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
> readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
>
> CC: Ingo Molnar<mingo@elte.hu>
> CC: Jens Axboe<jens.axboe@oracle.com>
> CC: Steven Rostedt<rostedt@goodmis.org>
> CC: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 10/15] readahead: add /debug/readahead/stats
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 22:40     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 22:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Peter Zijlstra,
	Chris Mason, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Collect readahead stats when CONFIG_READAHEAD_STATS=y.
>
> This is enabled by default because the added overheads are trivial:
> two readahead_stats() calls per readahead.
>
> Example output:
> (taken from a fresh booted NFS-ROOT box with rsize=16k)
>
> $ cat /debug/readahead/stats
> pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io       size async_size    io_size
> initial           524        216         26        498        498         18          7          4          4
> subsequent        181         80          1        130         13         60         25         25         24
> context            94         28          3         85         64          8          7          2          5
> thrash              0          0          0          0          0          0          0          0          0
> around            162        121         33        162        162        162         60          0         21
> fadvise             0          0          0          0          0          0          0          0          0
> random            137          0          0        137        137          0          1          0          1
> all              1098        445         63       1012        874          0         17          6          9
>
> The two most important columns are
> - io		number of readahead IO
> - io_size	average readahead IO size
>
> CC: Ingo Molnar<mingo@elte.hu>
> CC: Jens Axboe<jens.axboe@oracle.com>
> CC: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 10/15] readahead: add /debug/readahead/stats
@ 2010-02-25 22:40     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 22:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Peter Zijlstra,
	Chris Mason, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Collect readahead stats when CONFIG_READAHEAD_STATS=y.
>
> This is enabled by default because the added overheads are trivial:
> two readahead_stats() calls per readahead.
>
> Example output:
> (taken from a fresh booted NFS-ROOT box with rsize=16k)
>
> $ cat /debug/readahead/stats
> pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io       size async_size    io_size
> initial           524        216         26        498        498         18          7          4          4
> subsequent        181         80          1        130         13         60         25         25         24
> context            94         28          3         85         64          8          7          2          5
> thrash              0          0          0          0          0          0          0          0          0
> around            162        121         33        162        162        162         60          0         21
> fadvise             0          0          0          0          0          0          0          0          0
> random            137          0          0        137        137          0          1          0          1
> all              1098        445         63       1012        874          0         17          6          9
>
> The two most important columns are
> - io		number of readahead IO
> - io_size	average readahead IO size
>
> CC: Ingo Molnar<mingo@elte.hu>
> CC: Jens Axboe<jens.axboe@oracle.com>
> CC: Peter Zijlstra<a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 11/15] readahead: dont do start-of-file readahead after lseek()
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 22:42     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 22:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Linus Torvalds, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Some applications (eg. blkid, id3tool etc.) seek around the file
> to get information. For example, blkid does
> 	     seek to	0
> 	     read	1024
> 	     seek to	1536
> 	     read	16384
>
> The start-of-file readahead heuristic is wrong for them, whose
> access pattern can be identified by lseek() calls.
>
> So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
> do start-of-file readahead on seeing it. Proposed by Linus.
>
> Acked-by: Linus Torvalds<torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 11/15] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-25 22:42     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 22:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Linus Torvalds, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Some applications (eg. blkid, id3tool etc.) seek around the file
> to get information. For example, blkid does
> 	     seek to	0
> 	     read	1024
> 	     seek to	1536
> 	     read	16384
>
> The start-of-file readahead heuristic is wrong for them, whose
> access pattern can be identified by lseek() calls.
>
> So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
> do start-of-file readahead on seeing it. Proposed by Linus.
>
> Acked-by: Linus Torvalds<torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 12/15] radixtree: introduce radix_tree_lookup_leaf_node()
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 23:13     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 23:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Nick Piggin, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> This will be used by the pagecache context based read-ahead/read-around
> heuristic to quickly check one pagecache range:
> - if there is any hole
> - if there is any pages
>
> Cc: Nick Piggin<nickpiggin@yahoo.com.au>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 12/15] radixtree: introduce radix_tree_lookup_leaf_node()
@ 2010-02-25 23:13     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 23:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Nick Piggin, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> This will be used by the pagecache context based read-ahead/read-around
> heuristic to quickly check one pagecache range:
> - if there is any hole
> - if there is any pages
>
> Cc: Nick Piggin<nickpiggin@yahoo.com.au>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 13/15] radixtree: speed up the search for hole
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 23:37     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 23:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Nick Piggin, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Replace the hole scan functions with more fast versions:
> 	- radix_tree_next_hole(root, index, max_scan)
> 	- radix_tree_prev_hole(root, index, max_scan)
>
> Cc: Nick Piggin<nickpiggin@yahoo.com.au>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 13/15] radixtree: speed up the search for hole
@ 2010-02-25 23:37     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 23:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Nick Piggin, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Nick Piggin,
	Linux Memory Management List, linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Replace the hole scan functions with more fast versions:
> 	- radix_tree_next_hole(root, index, max_scan)
> 	- radix_tree_prev_hole(root, index, max_scan)
>
> Cc: Nick Piggin<nickpiggin@yahoo.com.au>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 14/15] readahead: reduce MMAP_LOTSAMISS for mmap read-around
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-25 23:42     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 23:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Nick Piggin, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Linux Memory Management List,
	linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Now that we lifts readahead size from 128KB to 512KB,
> the MMAP_LOTSAMISS shall be shrinked accordingly.
>
> We shrink it a bit more, so that for sparse random access patterns,
> only 10*512KB or ~5MB memory will be wasted, instead of the previous
> 100*128KB or ~12MB. The new threshold "10" is still big enough to avoid
> turning off read-around for typical executable/lib page faults.
>
> CC: Nick Piggin<npiggin@suse.de>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 14/15] readahead: reduce MMAP_LOTSAMISS for mmap read-around
@ 2010-02-25 23:42     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-25 23:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Nick Piggin, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Linux Memory Management List,
	linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Now that we lifts readahead size from 128KB to 512KB,
> the MMAP_LOTSAMISS shall be shrinked accordingly.
>
> We shrink it a bit more, so that for sparse random access patterns,
> only 10*512KB or ~5MB memory will be wasted, instead of the previous
> 100*128KB or ~12MB. The new threshold "10" is still big enough to avoid
> turning off read-around for typical executable/lib page faults.
>
> CC: Nick Piggin<npiggin@suse.de>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 15/15] readahead: pagecache context based mmap read-around
  2010-02-24  3:10   ` Wu Fengguang
@ 2010-02-26  1:33     ` Rik van Riel
  -1 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-26  1:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Nick Piggin, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Linux Memory Management List,
	linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Do mmap read-around when there are cached pages in the nearby 256KB
> (covered by one radix tree node).
>
> There is a failure case though: for a sequence of page faults at page
> index 64*i+1, i=1,2,3,..., this heuristic will keep doing pointless
> read-arounds.  Hopefully the pattern won't appear in real workloads.
> Note that the readahead heuristic has similiar failure case.
>
> CC: Nick Piggin<npiggin@suse.de>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 15/15] readahead: pagecache context based mmap read-around
@ 2010-02-26  1:33     ` Rik van Riel
  0 siblings, 0 replies; 94+ messages in thread
From: Rik van Riel @ 2010-02-26  1:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Nick Piggin, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Christian Ehrhardt, Matt Mackall, Linux Memory Management List,
	linux-fsdevel, LKML

On 02/23/2010 10:10 PM, Wu Fengguang wrote:
> Do mmap read-around when there are cached pages in the nearby 256KB
> (covered by one radix tree node).
>
> There is a failure case though: for a sequence of page faults at page
> index 64*i+1, i=1,2,3,..., this heuristic will keep doing pointless
> read-arounds.  Hopefully the pattern won't appear in real workloads.
> Note that the readahead heuristic has similiar failure case.
>
> CC: Nick Piggin<npiggin@suse.de>
> Signed-off-by: Wu Fengguang<fengguang.wu@intel.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
  2010-02-25 15:25     ` Christian Ehrhardt
@ 2010-02-26  2:29       ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-26  2:29 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

On Thu, Feb 25, 2010 at 11:25:54PM +0800, Christian Ehrhardt wrote:
> 
> 
> Wu Fengguang wrote:
>  > When lifting the default readahead size from 128KB to 512KB,
>  > make sure it won't add memory pressure to small memory systems.
>  >
>  > For read-ahead, the memory pressure is mainly readahead buffers consumed
>  > by too many concurrent streams. The context readahead can adapt
>  > readahead size to thrashing threshold well.  So in principle we don't
>  > need to adapt the default _max_ read-ahead size to memory pressure.
>  >
>  > For read-around, the memory pressure is mainly read-around misses on
>  > executables/libraries. Which could be reduced by scaling down
>  > read-around size on fast "reclaim passes".
>  >
>  > This patch presents a straightforward solution: to limit default
>  > readahead size proportional to available system memory, ie.
>  >                 512MB mem => 512KB readahead size
>  >                 128MB mem => 128KB readahead size
>  >                  32MB mem =>  32KB readahead size (minimal)
>  >
>  > Strictly speaking, only read-around size has to be limited.  However we
>  > don't bother to seperate read-around size from read-ahead size for now.
>  >
>  > CC: Matt Mackall <mpm@selenic.com>
>  > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> What I state here is for read ahead in a "multi iozone sequential" 
> setup, I can't speak for real "read around" workloads.
> So probably your table is fine to cover read-around+read-ahead in one 
> number.

OK.

> I have tested 256MB mem systems with 512kb readahead quite a lot.
> On those 512kb is still by far superior to smaller readaheads and I 
> didn't see major trashing or memory pressure impact.

In fact I'd expect a 64MB box to also benefit from 512kb readahead :)

> Therefore I would recommend a table like:
>                 >=256MB mem => 512KB readahead size
>                   128MB mem => 128KB readahead size
>                    32MB mem =>  32KB readahead size (minimal)

So, I'm fed up with compromising the read-ahead size with read-around
size.

There is no good to introduce a read-around size to confuse the user
though.  Instead, I'll introduce a read-around size limit _on top of_
the readahead size. This will allow power users to adjust
read-ahead/read-around size at the same time, while saving the low end
from unnecessary memory pressure :) I made the assumption that low end
users have no need to request a large read-around size.

Thanks,
Fengguang
---
readahead: limit read-ahead size for small memory systems

When lifting the default readahead size from 128KB to 512KB,
make sure it won't add memory pressure to small memory systems.

For read-ahead, the memory pressure is mainly readahead buffers consumed
by too many concurrent streams. The context readahead can adapt
readahead size to thrashing threshold well.  So in principle we don't
need to adapt the default _max_ read-ahead size to memory pressure.

For read-around, the memory pressure is mainly read-around misses on
executables/libraries. Which could be reduced by scaling down
read-around size on fast "reclaim passes".

This patch presents a straightforward solution: to limit default
read-ahead size proportional to available system memory, ie.
                512MB mem => 512KB readahead size
                128MB mem => 128KB readahead size
                 32MB mem =>  32KB readahead size

CC: Matt Mackall <mpm@selenic.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c   |    2 +-
 mm/readahead.c |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

--- linux.orig/mm/filemap.c	2010-02-26 10:04:28.000000000 +0800
+++ linux/mm/filemap.c	2010-02-26 10:08:33.000000000 +0800
@@ -1431,7 +1431,7 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
-	ra_pages = max_sane_readahead(ra->ra_pages);
+	ra_pages = min(ra->ra_pages, roundup_pow_of_two(totalram_pages / 1024));
 	if (ra_pages) {
 		ra->start = max_t(long, 0, offset - ra_pages/2);
 		ra->size = ra_pages;

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
@ 2010-02-26  2:29       ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-26  2:29 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

On Thu, Feb 25, 2010 at 11:25:54PM +0800, Christian Ehrhardt wrote:
> 
> 
> Wu Fengguang wrote:
>  > When lifting the default readahead size from 128KB to 512KB,
>  > make sure it won't add memory pressure to small memory systems.
>  >
>  > For read-ahead, the memory pressure is mainly readahead buffers consumed
>  > by too many concurrent streams. The context readahead can adapt
>  > readahead size to thrashing threshold well.  So in principle we don't
>  > need to adapt the default _max_ read-ahead size to memory pressure.
>  >
>  > For read-around, the memory pressure is mainly read-around misses on
>  > executables/libraries. Which could be reduced by scaling down
>  > read-around size on fast "reclaim passes".
>  >
>  > This patch presents a straightforward solution: to limit default
>  > readahead size proportional to available system memory, ie.
>  >                 512MB mem => 512KB readahead size
>  >                 128MB mem => 128KB readahead size
>  >                  32MB mem =>  32KB readahead size (minimal)
>  >
>  > Strictly speaking, only read-around size has to be limited.  However we
>  > don't bother to seperate read-around size from read-ahead size for now.
>  >
>  > CC: Matt Mackall <mpm@selenic.com>
>  > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> What I state here is for read ahead in a "multi iozone sequential" 
> setup, I can't speak for real "read around" workloads.
> So probably your table is fine to cover read-around+read-ahead in one 
> number.

OK.

> I have tested 256MB mem systems with 512kb readahead quite a lot.
> On those 512kb is still by far superior to smaller readaheads and I 
> didn't see major trashing or memory pressure impact.

In fact I'd expect a 64MB box to also benefit from 512kb readahead :)

> Therefore I would recommend a table like:
>                 >=256MB mem => 512KB readahead size
>                   128MB mem => 128KB readahead size
>                    32MB mem =>  32KB readahead size (minimal)

So, I'm fed up with compromising the read-ahead size with read-around
size.

There is no good to introduce a read-around size to confuse the user
though.  Instead, I'll introduce a read-around size limit _on top of_
the readahead size. This will allow power users to adjust
read-ahead/read-around size at the same time, while saving the low end
from unnecessary memory pressure :) I made the assumption that low end
users have no need to request a large read-around size.

Thanks,
Fengguang
---
readahead: limit read-ahead size for small memory systems

When lifting the default readahead size from 128KB to 512KB,
make sure it won't add memory pressure to small memory systems.

For read-ahead, the memory pressure is mainly readahead buffers consumed
by too many concurrent streams. The context readahead can adapt
readahead size to thrashing threshold well.  So in principle we don't
need to adapt the default _max_ read-ahead size to memory pressure.

For read-around, the memory pressure is mainly read-around misses on
executables/libraries. Which could be reduced by scaling down
read-around size on fast "reclaim passes".

This patch presents a straightforward solution: to limit default
read-ahead size proportional to available system memory, ie.
                512MB mem => 512KB readahead size
                128MB mem => 128KB readahead size
                 32MB mem =>  32KB readahead size

CC: Matt Mackall <mpm@selenic.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c   |    2 +-
 mm/readahead.c |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

--- linux.orig/mm/filemap.c	2010-02-26 10:04:28.000000000 +0800
+++ linux/mm/filemap.c	2010-02-26 10:08:33.000000000 +0800
@@ -1431,7 +1431,7 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
-	ra_pages = max_sane_readahead(ra->ra_pages);
+	ra_pages = min(ra->ra_pages, roundup_pow_of_two(totalram_pages / 1024));
 	if (ra_pages) {
 		ra->start = max_t(long, 0, offset - ra_pages/2);
 		ra->size = ra_pages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH] readahead: add notes on readahead size
  2010-02-26  2:29       ` Wu Fengguang
@ 2010-02-26  2:48         ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-26  2:48 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

> readahead: limit read-ahead size for small memory systems
> 
> When lifting the default readahead size from 128KB to 512KB,
> make sure it won't add memory pressure to small memory systems.

btw, I wrote some comments to summarize the now complex readahead size
rules..

==
readahead: add notes on readahead size

Basically, currently the default max readahead size
- is 512k
- is boot time configurable with "readahead="
and is auto scaled down:
- for small devices
- for small memory systems (read-around size alone)

CC: Matt Mackall <mpm@selenic.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-26 10:11:41.000000000 +0800
+++ linux/mm/readahead.c	2010-02-26 10:11:55.000000000 +0800
@@ -7,6 +7,28 @@
  *		Initial version.
  */
 
+/*
+ * Notes on readahead size.
+ *
+ * The default max readahead size is VM_MAX_READAHEAD=512k,
+ * which can be changed by user with boot time parameter "readahead="
+ * or runtime interface "/sys/devices/virtual/bdi/default/read_ahead_kb".
+ * The latter normally only takes effect in future for hot added devices.
+ *
+ * The effective max readahead size for each block device can be accessed with
+ * 1) the `blockdev` command
+ * 2) /sys/block/sda/queue/read_ahead_kb
+ * 3) /sys/devices/virtual/bdi/$(env stat -c '%t:%T' /dev/sda)/read_ahead_kb
+ *
+ * They are typically initialized with the global default size, however may be
+ * auto scaled down for small devices in add_disk(). NFS, software RAID, btrfs
+ * etc. have special rules to setup their default readahead size.
+ *
+ * The mmap read-around size typically equals with readahead size, with an
+ * extra limit proportional to system memory size.  For example, a 64MB box
+ * will have a 64KB read-around size limit, 128MB mem => 128KB limit, etc.
+ */
+
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/memcontrol.h>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH] readahead: add notes on readahead size
@ 2010-02-26  2:48         ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-26  2:48 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

> readahead: limit read-ahead size for small memory systems
> 
> When lifting the default readahead size from 128KB to 512KB,
> make sure it won't add memory pressure to small memory systems.

btw, I wrote some comments to summarize the now complex readahead size
rules..

==
readahead: add notes on readahead size

Basically, currently the default max readahead size
- is 512k
- is boot time configurable with "readahead="
and is auto scaled down:
- for small devices
- for small memory systems (read-around size alone)

CC: Matt Mackall <mpm@selenic.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

--- linux.orig/mm/readahead.c	2010-02-26 10:11:41.000000000 +0800
+++ linux/mm/readahead.c	2010-02-26 10:11:55.000000000 +0800
@@ -7,6 +7,28 @@
  *		Initial version.
  */
 
+/*
+ * Notes on readahead size.
+ *
+ * The default max readahead size is VM_MAX_READAHEAD=512k,
+ * which can be changed by user with boot time parameter "readahead="
+ * or runtime interface "/sys/devices/virtual/bdi/default/read_ahead_kb".
+ * The latter normally only takes effect in future for hot added devices.
+ *
+ * The effective max readahead size for each block device can be accessed with
+ * 1) the `blockdev` command
+ * 2) /sys/block/sda/queue/read_ahead_kb
+ * 3) /sys/devices/virtual/bdi/$(env stat -c '%t:%T' /dev/sda)/read_ahead_kb
+ *
+ * They are typically initialized with the global default size, however may be
+ * auto scaled down for small devices in add_disk(). NFS, software RAID, btrfs
+ * etc. have special rules to setup their default readahead size.
+ *
+ * The mmap read-around size typically equals with readahead size, with an
+ * extra limit proportional to system memory size.  For example, a 64MB box
+ * will have a 64KB read-around size limit, 128MB mem => 128KB limit, etc.
+ */
+
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/memcontrol.h>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
  2010-02-26  2:29       ` Wu Fengguang
  (?)
@ 2010-02-26  7:23         ` Christian Ehrhardt
  -1 siblings, 0 replies; 94+ messages in thread
From: Christian Ehrhardt @ 2010-02-26  7:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

Unfortunately without a chance to measure this atm, this patch now looks 
really good to me.
Thanks for adapting it to a read-ahead only per mem limit.
Acked-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>


Wu Fengguang wrote:
> On Thu, Feb 25, 2010 at 11:25:54PM +0800, Christian Ehrhardt wrote:
>>
>> Wu Fengguang wrote:
>>  > When lifting the default readahead size from 128KB to 512KB,
>>  > make sure it won't add memory pressure to small memory systems.
>>  >
>>  > For read-ahead, the memory pressure is mainly readahead buffers consumed
>>  > by too many concurrent streams. The context readahead can adapt
>>  > readahead size to thrashing threshold well.  So in principle we don't
>>  > need to adapt the default _max_ read-ahead size to memory pressure.
>>  >
>>  > For read-around, the memory pressure is mainly read-around misses on
>>  > executables/libraries. Which could be reduced by scaling down
>>  > read-around size on fast "reclaim passes".
>>  >
>>  > This patch presents a straightforward solution: to limit default
>>  > readahead size proportional to available system memory, ie.
>>  >                 512MB mem => 512KB readahead size
>>  >                 128MB mem => 128KB readahead size
>>  >                  32MB mem =>  32KB readahead size (minimal)
>>  >
>>  > Strictly speaking, only read-around size has to be limited.  However we
>>  > don't bother to seperate read-around size from read-ahead size for now.
>>  >
>>  > CC: Matt Mackall <mpm@selenic.com>
>>  > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>>
>> What I state here is for read ahead in a "multi iozone sequential" 
>> setup, I can't speak for real "read around" workloads.
>> So probably your table is fine to cover read-around+read-ahead in one 
>> number.
> 
> OK.
> 
>> I have tested 256MB mem systems with 512kb readahead quite a lot.
>> On those 512kb is still by far superior to smaller readaheads and I 
>> didn't see major trashing or memory pressure impact.
> 
> In fact I'd expect a 64MB box to also benefit from 512kb readahead :)
> 
>> Therefore I would recommend a table like:
>>                 >=256MB mem => 512KB readahead size
>>                   128MB mem => 128KB readahead size
>>                    32MB mem =>  32KB readahead size (minimal)
> 
> So, I'm fed up with compromising the read-ahead size with read-around
> size.
> 
> There is no good to introduce a read-around size to confuse the user
> though.  Instead, I'll introduce a read-around size limit _on top of_
> the readahead size. This will allow power users to adjust
> read-ahead/read-around size at the same time, while saving the low end
> from unnecessary memory pressure :) I made the assumption that low end
> users have no need to request a large read-around size.
> 
> Thanks,
> Fengguang
> ---
> readahead: limit read-ahead size for small memory systems
> 
> When lifting the default readahead size from 128KB to 512KB,
> make sure it won't add memory pressure to small memory systems.
> 
> For read-ahead, the memory pressure is mainly readahead buffers consumed
> by too many concurrent streams. The context readahead can adapt
> readahead size to thrashing threshold well.  So in principle we don't
> need to adapt the default _max_ read-ahead size to memory pressure.
> 
> For read-around, the memory pressure is mainly read-around misses on
> executables/libraries. Which could be reduced by scaling down
> read-around size on fast "reclaim passes".
> 
> This patch presents a straightforward solution: to limit default
> read-ahead size proportional to available system memory, ie.
>                 512MB mem => 512KB readahead size
>                 128MB mem => 128KB readahead size
>                  32MB mem =>  32KB readahead size
> 
> CC: Matt Mackall <mpm@selenic.com>
> CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/filemap.c   |    2 +-
>  mm/readahead.c |   22 ++++++++++++++++++++++
>  2 files changed, 23 insertions(+), 1 deletion(-)
> 
> --- linux.orig/mm/filemap.c	2010-02-26 10:04:28.000000000 +0800
> +++ linux/mm/filemap.c	2010-02-26 10:08:33.000000000 +0800
> @@ -1431,7 +1431,7 @@ static void do_sync_mmap_readahead(struc
>  	/*
>  	 * mmap read-around
>  	 */
> -	ra_pages = max_sane_readahead(ra->ra_pages);
> +	ra_pages = min(ra->ra_pages, roundup_pow_of_two(totalram_pages / 1024));
>  	if (ra_pages) {
>  		ra->start = max_t(long, 0, offset - ra_pages/2);
>  		ra->size = ra_pages;

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
@ 2010-02-26  7:23         ` Christian Ehrhardt
  0 siblings, 0 replies; 94+ messages in thread
From: Christian Ehrhardt @ 2010-02-26  7:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

Unfortunately without a chance to measure this atm, this patch now looks 
really good to me.
Thanks for adapting it to a read-ahead only per mem limit.
Acked-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>


Wu Fengguang wrote:
> On Thu, Feb 25, 2010 at 11:25:54PM +0800, Christian Ehrhardt wrote:
>>
>> Wu Fengguang wrote:
>>  > When lifting the default readahead size from 128KB to 512KB,
>>  > make sure it won't add memory pressure to small memory systems.
>>  >
>>  > For read-ahead, the memory pressure is mainly readahead buffers consumed
>>  > by too many concurrent streams. The context readahead can adapt
>>  > readahead size to thrashing threshold well.  So in principle we don't
>>  > need to adapt the default _max_ read-ahead size to memory pressure.
>>  >
>>  > For read-around, the memory pressure is mainly read-around misses on
>>  > executables/libraries. Which could be reduced by scaling down
>>  > read-around size on fast "reclaim passes".
>>  >
>>  > This patch presents a straightforward solution: to limit default
>>  > readahead size proportional to available system memory, ie.
>>  >                 512MB mem => 512KB readahead size
>>  >                 128MB mem => 128KB readahead size
>>  >                  32MB mem =>  32KB readahead size (minimal)
>>  >
>>  > Strictly speaking, only read-around size has to be limited.  However we
>>  > don't bother to seperate read-around size from read-ahead size for now.
>>  >
>>  > CC: Matt Mackall <mpm@selenic.com>
>>  > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>>
>> What I state here is for read ahead in a "multi iozone sequential" 
>> setup, I can't speak for real "read around" workloads.
>> So probably your table is fine to cover read-around+read-ahead in one 
>> number.
> 
> OK.
> 
>> I have tested 256MB mem systems with 512kb readahead quite a lot.
>> On those 512kb is still by far superior to smaller readaheads and I 
>> didn't see major trashing or memory pressure impact.
> 
> In fact I'd expect a 64MB box to also benefit from 512kb readahead :)
> 
>> Therefore I would recommend a table like:
>>                 >=256MB mem => 512KB readahead size
>>                   128MB mem => 128KB readahead size
>>                    32MB mem =>  32KB readahead size (minimal)
> 
> So, I'm fed up with compromising the read-ahead size with read-around
> size.
> 
> There is no good to introduce a read-around size to confuse the user
> though.  Instead, I'll introduce a read-around size limit _on top of_
> the readahead size. This will allow power users to adjust
> read-ahead/read-around size at the same time, while saving the low end
> from unnecessary memory pressure :) I made the assumption that low end
> users have no need to request a large read-around size.
> 
> Thanks,
> Fengguang
> ---
> readahead: limit read-ahead size for small memory systems
> 
> When lifting the default readahead size from 128KB to 512KB,
> make sure it won't add memory pressure to small memory systems.
> 
> For read-ahead, the memory pressure is mainly readahead buffers consumed
> by too many concurrent streams. The context readahead can adapt
> readahead size to thrashing threshold well.  So in principle we don't
> need to adapt the default _max_ read-ahead size to memory pressure.
> 
> For read-around, the memory pressure is mainly read-around misses on
> executables/libraries. Which could be reduced by scaling down
> read-around size on fast "reclaim passes".
> 
> This patch presents a straightforward solution: to limit default
> read-ahead size proportional to available system memory, ie.
>                 512MB mem => 512KB readahead size
>                 128MB mem => 128KB readahead size
>                  32MB mem =>  32KB readahead size
> 
> CC: Matt Mackall <mpm@selenic.com>
> CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/filemap.c   |    2 +-
>  mm/readahead.c |   22 ++++++++++++++++++++++
>  2 files changed, 23 insertions(+), 1 deletion(-)
> 
> --- linux.orig/mm/filemap.c	2010-02-26 10:04:28.000000000 +0800
> +++ linux/mm/filemap.c	2010-02-26 10:08:33.000000000 +0800
> @@ -1431,7 +1431,7 @@ static void do_sync_mmap_readahead(struc
>  	/*
>  	 * mmap read-around
>  	 */
> -	ra_pages = max_sane_readahead(ra->ra_pages);
> +	ra_pages = min(ra->ra_pages, roundup_pow_of_two(totalram_pages / 1024));
>  	if (ra_pages) {
>  		ra->start = max_t(long, 0, offset - ra_pages/2);
>  		ra->size = ra_pages;

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
@ 2010-02-26  7:23         ` Christian Ehrhardt
  0 siblings, 0 replies; 94+ messages in thread
From: Christian Ehrhardt @ 2010-02-26  7:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

Unfortunately without a chance to measure this atm, this patch now looks 
really good to me.
Thanks for adapting it to a read-ahead only per mem limit.
Acked-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>


Wu Fengguang wrote:
> On Thu, Feb 25, 2010 at 11:25:54PM +0800, Christian Ehrhardt wrote:
>>
>> Wu Fengguang wrote:
>>  > When lifting the default readahead size from 128KB to 512KB,
>>  > make sure it won't add memory pressure to small memory systems.
>>  >
>>  > For read-ahead, the memory pressure is mainly readahead buffers consumed
>>  > by too many concurrent streams. The context readahead can adapt
>>  > readahead size to thrashing threshold well.  So in principle we don't
>>  > need to adapt the default _max_ read-ahead size to memory pressure.
>>  >
>>  > For read-around, the memory pressure is mainly read-around misses on
>>  > executables/libraries. Which could be reduced by scaling down
>>  > read-around size on fast "reclaim passes".
>>  >
>>  > This patch presents a straightforward solution: to limit default
>>  > readahead size proportional to available system memory, ie.
>>  >                 512MB mem => 512KB readahead size
>>  >                 128MB mem => 128KB readahead size
>>  >                  32MB mem =>  32KB readahead size (minimal)
>>  >
>>  > Strictly speaking, only read-around size has to be limited.  However we
>>  > don't bother to seperate read-around size from read-ahead size for now.
>>  >
>>  > CC: Matt Mackall <mpm@selenic.com>
>>  > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>>
>> What I state here is for read ahead in a "multi iozone sequential" 
>> setup, I can't speak for real "read around" workloads.
>> So probably your table is fine to cover read-around+read-ahead in one 
>> number.
> 
> OK.
> 
>> I have tested 256MB mem systems with 512kb readahead quite a lot.
>> On those 512kb is still by far superior to smaller readaheads and I 
>> didn't see major trashing or memory pressure impact.
> 
> In fact I'd expect a 64MB box to also benefit from 512kb readahead :)
> 
>> Therefore I would recommend a table like:
>>                 >=256MB mem => 512KB readahead size
>>                   128MB mem => 128KB readahead size
>>                    32MB mem =>  32KB readahead size (minimal)
> 
> So, I'm fed up with compromising the read-ahead size with read-around
> size.
> 
> There is no good to introduce a read-around size to confuse the user
> though.  Instead, I'll introduce a read-around size limit _on top of_
> the readahead size. This will allow power users to adjust
> read-ahead/read-around size at the same time, while saving the low end
> from unnecessary memory pressure :) I made the assumption that low end
> users have no need to request a large read-around size.
> 
> Thanks,
> Fengguang
> ---
> readahead: limit read-ahead size for small memory systems
> 
> When lifting the default readahead size from 128KB to 512KB,
> make sure it won't add memory pressure to small memory systems.
> 
> For read-ahead, the memory pressure is mainly readahead buffers consumed
> by too many concurrent streams. The context readahead can adapt
> readahead size to thrashing threshold well.  So in principle we don't
> need to adapt the default _max_ read-ahead size to memory pressure.
> 
> For read-around, the memory pressure is mainly read-around misses on
> executables/libraries. Which could be reduced by scaling down
> read-around size on fast "reclaim passes".
> 
> This patch presents a straightforward solution: to limit default
> read-ahead size proportional to available system memory, ie.
>                 512MB mem => 512KB readahead size
>                 128MB mem => 128KB readahead size
>                  32MB mem =>  32KB readahead size
> 
> CC: Matt Mackall <mpm@selenic.com>
> CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/filemap.c   |    2 +-
>  mm/readahead.c |   22 ++++++++++++++++++++++
>  2 files changed, 23 insertions(+), 1 deletion(-)
> 
> --- linux.orig/mm/filemap.c	2010-02-26 10:04:28.000000000 +0800
> +++ linux/mm/filemap.c	2010-02-26 10:08:33.000000000 +0800
> @@ -1431,7 +1431,7 @@ static void do_sync_mmap_readahead(struc
>  	/*
>  	 * mmap read-around
>  	 */
> -	ra_pages = max_sane_readahead(ra->ra_pages);
> +	ra_pages = min(ra->ra_pages, roundup_pow_of_two(totalram_pages / 1024));
>  	if (ra_pages) {
>  		ra->start = max_t(long, 0, offset - ra_pages/2);
>  		ra->size = ra_pages;

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
  2010-02-26  7:23         ` Christian Ehrhardt
@ 2010-02-26  7:38           ` Wu Fengguang
  -1 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-26  7:38 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

Christian,

On Fri, Feb 26, 2010 at 03:23:40PM +0800, Christian Ehrhardt wrote:
> Unfortunately without a chance to measure this atm, this patch now looks 
> really good to me.
> Thanks for adapting it to a read-ahead only per mem limit.
> Acked-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Thank you. Effective measurement is hard because it really depends on
how the user want to stress use his small memory system ;) So I think
a simple to understand and yet reasonable limit scheme would be OK.

Thanks,
Fengguang
---
readahead: limit read-ahead size for small memory systems

When lifting the default readahead size from 128KB to 512KB,
make sure it won't add memory pressure to small memory systems.

For read-ahead, the memory pressure is mainly readahead buffers consumed
by too many concurrent streams. The context readahead can adapt
readahead size to thrashing threshold well.  So in principle we don't
need to adapt the default _max_ read-ahead size to memory pressure.

For read-around, the memory pressure is mainly read-around misses on
executables/libraries. Which could be reduced by scaling down
read-around size on fast "reclaim passes".

This patch presents a straightforward solution: to limit default
read-ahead size proportional to available system memory, ie.

                512MB mem => 512KB read-around size
                128MB mem => 128KB read-around size
                 32MB mem =>  32KB read-around size

This will allow power users to adjust read-ahead/read-around size at
once, while saving the low end from unnecessary memory pressure, under
the assumption that low end users have no need to request a large
read-around size.

CC: Matt Mackall <mpm@selenic.com>
Acked-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c   |    2 +-
 mm/readahead.c |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

--- linux.orig/mm/filemap.c	2010-02-26 10:04:28.000000000 +0800
+++ linux/mm/filemap.c	2010-02-26 10:08:33.000000000 +0800
@@ -1431,7 +1431,7 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
-	ra_pages = max_sane_readahead(ra->ra_pages);
+	ra_pages = min(ra->ra_pages, roundup_pow_of_two(totalram_pages / 1024));
 	if (ra_pages) {
 		ra->start = max_t(long, 0, offset - ra_pages/2);
 		ra->size = ra_pages;

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH 05/15] readahead: limit readahead size for small memory systems
@ 2010-02-26  7:38           ` Wu Fengguang
  0 siblings, 0 replies; 94+ messages in thread
From: Wu Fengguang @ 2010-02-26  7:38 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, Jens Axboe, Matt Mackall, Chris Mason,
	Peter Zijlstra, Clemens Ladisch, Olivier Galibert, Vivek Goyal,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

Christian,

On Fri, Feb 26, 2010 at 03:23:40PM +0800, Christian Ehrhardt wrote:
> Unfortunately without a chance to measure this atm, this patch now looks 
> really good to me.
> Thanks for adapting it to a read-ahead only per mem limit.
> Acked-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

Thank you. Effective measurement is hard because it really depends on
how the user want to stress use his small memory system ;) So I think
a simple to understand and yet reasonable limit scheme would be OK.

Thanks,
Fengguang
---
readahead: limit read-ahead size for small memory systems

When lifting the default readahead size from 128KB to 512KB,
make sure it won't add memory pressure to small memory systems.

For read-ahead, the memory pressure is mainly readahead buffers consumed
by too many concurrent streams. The context readahead can adapt
readahead size to thrashing threshold well.  So in principle we don't
need to adapt the default _max_ read-ahead size to memory pressure.

For read-around, the memory pressure is mainly read-around misses on
executables/libraries. Which could be reduced by scaling down
read-around size on fast "reclaim passes".

This patch presents a straightforward solution: to limit default
read-ahead size proportional to available system memory, ie.

                512MB mem => 512KB read-around size
                128MB mem => 128KB read-around size
                 32MB mem =>  32KB read-around size

This will allow power users to adjust read-ahead/read-around size at
once, while saving the low end from unnecessary memory pressure, under
the assumption that low end users have no need to request a large
read-around size.

CC: Matt Mackall <mpm@selenic.com>
Acked-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c   |    2 +-
 mm/readahead.c |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

--- linux.orig/mm/filemap.c	2010-02-26 10:04:28.000000000 +0800
+++ linux/mm/filemap.c	2010-02-26 10:08:33.000000000 +0800
@@ -1431,7 +1431,7 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
-	ra_pages = max_sane_readahead(ra->ra_pages);
+	ra_pages = min(ra->ra_pages, roundup_pow_of_two(totalram_pages / 1024));
 	if (ra_pages) {
 		ra->start = max_t(long, 0, offset - ra_pages/2);
 		ra->size = ra_pages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] readahead: add notes on readahead size
  2010-02-26  2:48         ` Wu Fengguang
@ 2010-02-26 14:17           ` Vivek Goyal
  -1 siblings, 0 replies; 94+ messages in thread
From: Vivek Goyal @ 2010-02-26 14:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christian Ehrhardt, Andrew Morton, Jens Axboe, Matt Mackall,
	Chris Mason, Peter Zijlstra, Clemens Ladisch, Olivier Galibert,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

On Fri, Feb 26, 2010 at 10:48:37AM +0800, Wu Fengguang wrote:
> > readahead: limit read-ahead size for small memory systems
> > 
> > When lifting the default readahead size from 128KB to 512KB,
> > make sure it won't add memory pressure to small memory systems.
> 
> btw, I wrote some comments to summarize the now complex readahead size
> rules..
> 
> ==
> readahead: add notes on readahead size
> 
> Basically, currently the default max readahead size
> - is 512k
> - is boot time configurable with "readahead="
> and is auto scaled down:
> - for small devices
> - for small memory systems (read-around size alone)
> 
> CC: Matt Mackall <mpm@selenic.com>
> CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/readahead.c |   22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> --- linux.orig/mm/readahead.c	2010-02-26 10:11:41.000000000 +0800
> +++ linux/mm/readahead.c	2010-02-26 10:11:55.000000000 +0800
> @@ -7,6 +7,28 @@
>   *		Initial version.
>   */
>  
> +/*
> + * Notes on readahead size.
> + *
> + * The default max readahead size is VM_MAX_READAHEAD=512k,
> + * which can be changed by user with boot time parameter "readahead="
> + * or runtime interface "/sys/devices/virtual/bdi/default/read_ahead_kb".
> + * The latter normally only takes effect in future for hot added devices.
> + *
> + * The effective max readahead size for each block device can be accessed with
> + * 1) the `blockdev` command
> + * 2) /sys/block/sda/queue/read_ahead_kb
> + * 3) /sys/devices/virtual/bdi/$(env stat -c '%t:%T' /dev/sda)/read_ahead_kb
> + *
> + * They are typically initialized with the global default size, however may be
> + * auto scaled down for small devices in add_disk(). NFS, software RAID, btrfs
> + * etc. have special rules to setup their default readahead size.
> + *
> + * The mmap read-around size typically equals with readahead size, with an
> + * extra limit proportional to system memory size.  For example, a 64MB box
> + * will have a 64KB read-around size limit, 128MB mem => 128KB limit, etc.
> + */
> +

Great. I was confused among so many ways to control read ahead size. This
documentation helps a lot.

Vivek

>  #include <linux/kernel.h>
>  #include <linux/fs.h>
>  #include <linux/memcontrol.h>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] readahead: add notes on readahead size
@ 2010-02-26 14:17           ` Vivek Goyal
  0 siblings, 0 replies; 94+ messages in thread
From: Vivek Goyal @ 2010-02-26 14:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christian Ehrhardt, Andrew Morton, Jens Axboe, Matt Mackall,
	Chris Mason, Peter Zijlstra, Clemens Ladisch, Olivier Galibert,
	Nick Piggin, Linux Memory Management List, linux-fsdevel, LKML,
	Rik van Riel

On Fri, Feb 26, 2010 at 10:48:37AM +0800, Wu Fengguang wrote:
> > readahead: limit read-ahead size for small memory systems
> > 
> > When lifting the default readahead size from 128KB to 512KB,
> > make sure it won't add memory pressure to small memory systems.
> 
> btw, I wrote some comments to summarize the now complex readahead size
> rules..
> 
> ==
> readahead: add notes on readahead size
> 
> Basically, currently the default max readahead size
> - is 512k
> - is boot time configurable with "readahead="
> and is auto scaled down:
> - for small devices
> - for small memory systems (read-around size alone)
> 
> CC: Matt Mackall <mpm@selenic.com>
> CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/readahead.c |   22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> --- linux.orig/mm/readahead.c	2010-02-26 10:11:41.000000000 +0800
> +++ linux/mm/readahead.c	2010-02-26 10:11:55.000000000 +0800
> @@ -7,6 +7,28 @@
>   *		Initial version.
>   */
>  
> +/*
> + * Notes on readahead size.
> + *
> + * The default max readahead size is VM_MAX_READAHEAD=512k,
> + * which can be changed by user with boot time parameter "readahead="
> + * or runtime interface "/sys/devices/virtual/bdi/default/read_ahead_kb".
> + * The latter normally only takes effect in future for hot added devices.
> + *
> + * The effective max readahead size for each block device can be accessed with
> + * 1) the `blockdev` command
> + * 2) /sys/block/sda/queue/read_ahead_kb
> + * 3) /sys/devices/virtual/bdi/$(env stat -c '%t:%T' /dev/sda)/read_ahead_kb
> + *
> + * They are typically initialized with the global default size, however may be
> + * auto scaled down for small devices in add_disk(). NFS, software RAID, btrfs
> + * etc. have special rules to setup their default readahead size.
> + *
> + * The mmap read-around size typically equals with readahead size, with an
> + * extra limit proportional to system memory size.  For example, a 64MB box
> + * will have a 64KB read-around size limit, 128MB mem => 128KB limit, etc.
> + */
> +

Great. I was confused among so many ways to control read ahead size. This
documentation helps a lot.

Vivek

>  #include <linux/kernel.h>
>  #include <linux/fs.h>
>  #include <linux/memcontrol.h>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2010-02-26 14:18 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-24  3:10 [PATCH 00/15] 512K readahead size with thrashing safe readahead v2 Wu Fengguang
2010-02-24  3:10 ` Wu Fengguang
2010-02-24  3:10 ` Wu Fengguang
2010-02-24  3:10 ` [PATCH 01/15] readahead: limit readahead size for small devices Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25  3:11   ` Rik van Riel
2010-02-25  3:11     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 02/15] readahead: retain inactive lru pages to be accessed soon Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25  3:17   ` Rik van Riel
2010-02-25  3:17     ` Rik van Riel
2010-02-25 12:27     ` Wu Fengguang
2010-02-25 12:27       ` Wu Fengguang
2010-02-24  3:10 ` [PATCH 03/15] readahead: bump up the default readahead size Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25  4:02   ` Rik van Riel
2010-02-25  4:02     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 04/15] readahead: make default readahead size a kernel parameter Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 14:59   ` Rik van Riel
2010-02-25 14:59     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 05/15] readahead: limit readahead size for small memory systems Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 15:00   ` Rik van Riel
2010-02-25 15:00     ` Rik van Riel
2010-02-25 15:25   ` Christian Ehrhardt
2010-02-25 15:25     ` Christian Ehrhardt
2010-02-25 15:25     ` Christian Ehrhardt
2010-02-26  2:29     ` Wu Fengguang
2010-02-26  2:29       ` Wu Fengguang
2010-02-26  2:48       ` [PATCH] readahead: add notes on readahead size Wu Fengguang
2010-02-26  2:48         ` Wu Fengguang
2010-02-26 14:17         ` Vivek Goyal
2010-02-26 14:17           ` Vivek Goyal
2010-02-26  7:23       ` [PATCH 05/15] readahead: limit readahead size for small memory systems Christian Ehrhardt
2010-02-26  7:23         ` Christian Ehrhardt
2010-02-26  7:23         ` Christian Ehrhardt
2010-02-26  7:38         ` Wu Fengguang
2010-02-26  7:38           ` Wu Fengguang
2010-02-24  3:10 ` [PATCH 06/15] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 15:52   ` Rik van Riel
2010-02-25 15:52     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 07/15] readahead: thrashing safe context readahead Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 16:24   ` Rik van Riel
2010-02-25 16:24     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 08/15] readahead: record readahead patterns Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 22:37   ` Rik van Riel
2010-02-25 22:37     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 09/15] readahead: add tracing event Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 22:38   ` Rik van Riel
2010-02-25 22:38     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 10/15] readahead: add /debug/readahead/stats Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 22:40   ` Rik van Riel
2010-02-25 22:40     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 11/15] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 22:42   ` Rik van Riel
2010-02-25 22:42     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 12/15] radixtree: introduce radix_tree_lookup_leaf_node() Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 23:13   ` Rik van Riel
2010-02-25 23:13     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 13/15] radixtree: speed up the search for hole Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 23:37   ` Rik van Riel
2010-02-25 23:37     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 14/15] readahead: reduce MMAP_LOTSAMISS for mmap read-around Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-25 23:42   ` Rik van Riel
2010-02-25 23:42     ` Rik van Riel
2010-02-24  3:10 ` [PATCH 15/15] readahead: pagecache context based " Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-24  3:10   ` Wu Fengguang
2010-02-26  1:33   ` Rik van Riel
2010-02-26  1:33     ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.