All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] readahead stats/tracing, backwards prefetching and more (v3)
@ 2011-12-19 10:23 ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Linux Memory Management List, linux-fsdevel,
	Wu Fengguang, LKML

Andrew,

This introduces the per-cpu readahead stats, tracing, backwards prefetching,
fixes context readahead for SSD random reads and does some other minor changes.

Changes since v2:
- use per-cpu counters for readahead stats
- make context readahead more conservative
- simplify readahead tracing format and use __print_symbolic()
- backwards prefetching and snap to EOF fixes and cleanups

Changes since v1:
- use bit fields: pattern, for_mmap, for_metadata, lseek
- comment the various readahead patterns
- drop boot options "readahead=" and "readahead_stats="
- add for_metadata
- add snapping to EOF

 [PATCH 01/10] block: limit default readahead size for small devices
 [PATCH 02/10] readahead: make context readahead more conservative
 [PATCH 03/10] readahead: record readahead patterns
 [PATCH 04/10] readahead: tag mmap page fault call sites
 [PATCH 05/10] readahead: tag metadata call sites
 [PATCH 06/10] readahead: add vfs/readahead tracing event
 [PATCH 07/10] readahead: add /debug/readahead/stats
 [PATCH 08/10] readahead: basic support for backwards prefetching
 [PATCH 09/10] readahead: dont do start-of-file readahead after lseek()
 [PATCH 10/10] readahead: snap readahead request to EOF

 block/genhd.c              |   20 ++
 fs/Makefile                |    1 
 fs/ext3/dir.c              |    1 
 fs/ext4/dir.c              |    1 
 fs/read_write.c            |    3 
 fs/trace.c                 |    2 
 include/linux/fs.h         |   41 ++++
 include/linux/mm.h         |    4 
 include/trace/events/vfs.h |   78 +++++++++
 mm/Kconfig                 |   15 +
 mm/filemap.c               |    9 -
 mm/readahead.c             |  301 +++++++++++++++++++++++++++++++++--
 12 files changed, 461 insertions(+), 15 deletions(-)

Thanks,
Fengguang




^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 00/10] readahead stats/tracing, backwards prefetching and more (v3)
@ 2011-12-19 10:23 ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Linux Memory Management List, linux-fsdevel,
	Wu Fengguang, LKML

Andrew,

This introduces the per-cpu readahead stats, tracing, backwards prefetching,
fixes context readahead for SSD random reads and does some other minor changes.

Changes since v2:
- use per-cpu counters for readahead stats
- make context readahead more conservative
- simplify readahead tracing format and use __print_symbolic()
- backwards prefetching and snap to EOF fixes and cleanups

Changes since v1:
- use bit fields: pattern, for_mmap, for_metadata, lseek
- comment the various readahead patterns
- drop boot options "readahead=" and "readahead_stats="
- add for_metadata
- add snapping to EOF

 [PATCH 01/10] block: limit default readahead size for small devices
 [PATCH 02/10] readahead: make context readahead more conservative
 [PATCH 03/10] readahead: record readahead patterns
 [PATCH 04/10] readahead: tag mmap page fault call sites
 [PATCH 05/10] readahead: tag metadata call sites
 [PATCH 06/10] readahead: add vfs/readahead tracing event
 [PATCH 07/10] readahead: add /debug/readahead/stats
 [PATCH 08/10] readahead: basic support for backwards prefetching
 [PATCH 09/10] readahead: dont do start-of-file readahead after lseek()
 [PATCH 10/10] readahead: snap readahead request to EOF

 block/genhd.c              |   20 ++
 fs/Makefile                |    1 
 fs/ext3/dir.c              |    1 
 fs/ext4/dir.c              |    1 
 fs/read_write.c            |    3 
 fs/trace.c                 |    2 
 include/linux/fs.h         |   41 ++++
 include/linux/mm.h         |    4 
 include/trace/events/vfs.h |   78 +++++++++
 mm/Kconfig                 |   15 +
 mm/filemap.c               |    9 -
 mm/readahead.c             |  301 +++++++++++++++++++++++++++++++++--
 12 files changed, 461 insertions(+), 15 deletions(-)

Thanks,
Fengguang



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 00/10] readahead stats/tracing, backwards prefetching and more (v3)
@ 2011-12-19 10:23 ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Linux Memory Management List, linux-fsdevel,
	Wu Fengguang, LKML

Andrew,

This introduces the per-cpu readahead stats, tracing, backwards prefetching,
fixes context readahead for SSD random reads and does some other minor changes.

Changes since v2:
- use per-cpu counters for readahead stats
- make context readahead more conservative
- simplify readahead tracing format and use __print_symbolic()
- backwards prefetching and snap to EOF fixes and cleanups

Changes since v1:
- use bit fields: pattern, for_mmap, for_metadata, lseek
- comment the various readahead patterns
- drop boot options "readahead=" and "readahead_stats="
- add for_metadata
- add snapping to EOF

 [PATCH 01/10] block: limit default readahead size for small devices
 [PATCH 02/10] readahead: make context readahead more conservative
 [PATCH 03/10] readahead: record readahead patterns
 [PATCH 04/10] readahead: tag mmap page fault call sites
 [PATCH 05/10] readahead: tag metadata call sites
 [PATCH 06/10] readahead: add vfs/readahead tracing event
 [PATCH 07/10] readahead: add /debug/readahead/stats
 [PATCH 08/10] readahead: basic support for backwards prefetching
 [PATCH 09/10] readahead: dont do start-of-file readahead after lseek()
 [PATCH 10/10] readahead: snap readahead request to EOF

 block/genhd.c              |   20 ++
 fs/Makefile                |    1 
 fs/ext3/dir.c              |    1 
 fs/ext4/dir.c              |    1 
 fs/read_write.c            |    3 
 fs/trace.c                 |    2 
 include/linux/fs.h         |   41 ++++
 include/linux/mm.h         |    4 
 include/trace/events/vfs.h |   78 +++++++++
 mm/Kconfig                 |   15 +
 mm/filemap.c               |    9 -
 mm/readahead.c             |  301 +++++++++++++++++++++++++++++++++--
 12 files changed, 461 insertions(+), 15 deletions(-)

Thanks,
Fengguang



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/10] block: limit default readahead size for small devices
  2011-12-19 10:23 ` Wu Fengguang
@ 2011-12-19 10:23   ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Li Shaohua, Clemens Ladisch, Jens Axboe,
	Rik van Riel, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 6992 bytes --]

Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.

     lseek 0,    read 1024   => readahead 4 pages (start of file)
     lseek 1536, read 16384  => readahead 8 pages (page contiguous)

The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.

For the kernel part, Linus suggests:
  So maybe we could be less aggressive about read-ahead when the size of
  the device is small? Turning a 16kB read into a 64kB one is a big deal,
  when it's about 15% of the whole device!

This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).

Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:

        disk size    readahead size
     (scale by 4)      (scale by 2)
               1M                8k
               4M               16k
              16M               32k
              64M               64k
             256M              128k
        --------------------------- (*)
               1G              256k
               4G              512k
              16G             1024k
              64G             2048k
             256G             4096k

(*) Since the default readahead size is 128k, this limit only takes
effect for devices whose size is less than 256M.

The formula is determined on the following data, collected by script:

	#!/bin/sh

	# please make sure BDEV is not mounted or opened by others
	BDEV=sdb

	for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
	do
		echo 3 > /proc/sys/vm/drop_caches
		echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
		time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
	done

The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.

The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.

SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)

	rasize	1st run		2nd run
	----------------------------------
	  4k	123 MB/s	122 MB/s
	 16k  	153 MB/s	153 MB/s
	 32k	161 MB/s	162 MB/s
	 64k	167 MB/s	168 MB/s
	128k	197 MB/s	197 MB/s
	256k	217 MB/s	217 MB/s
	512k	238 MB/s	234 MB/s
	  1M	251 MB/s	248 MB/s
	  2M	259 MB/s	257 MB/s
==>	  4M	269 MB/s	264 MB/s
	  8M	266 MB/s	266 MB/s

Note that ==> points to the readahead size that yields plateau throughput.

SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)

	rasize  1st             2nd
	--------------------------------
	  4k     41 MB/s         41 MB/s
	 16k     85 MB/s         81 MB/s
	 32k    102 MB/s        109 MB/s
	 64k    125 MB/s        144 MB/s
	128k    183 MB/s        185 MB/s
	256k    216 MB/s        216 MB/s
	512k    216 MB/s        236 MB/s
	1024k   251 MB/s        252 MB/s
	  2M    258 MB/s        258 MB/s
==>       4M    266 MB/s        266 MB/s
	  8M    266 MB/s        266 MB/s

SSD 30G SanDisk SATA 5000

	  4k	29.6 MB/s	29.6 MB/s	29.6 MB/s
	 16k	52.1 MB/s	52.1 MB/s	52.1 MB/s
	 32k	61.5 MB/s	61.5 MB/s	61.5 MB/s
	 64k	67.2 MB/s	67.2 MB/s	67.1 MB/s
	128k	71.4 MB/s	71.3 MB/s	71.4 MB/s
	256k	73.4 MB/s	73.4 MB/s	73.3 MB/s
==>	512k	74.6 MB/s	74.6 MB/s	74.6 MB/s
	  1M	74.7 MB/s	74.6 MB/s	74.7 MB/s
	  2M	76.1 MB/s	74.6 MB/s	74.6 MB/s

USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165

	  4k	7.9 MB/s 	7.9 MB/s 	7.9 MB/s
	 16k	17.9 MB/s	17.9 MB/s	17.9 MB/s
	 32k	24.5 MB/s	24.5 MB/s	24.5 MB/s
	 64k	28.7 MB/s	28.7 MB/s	28.7 MB/s
	128k	28.8 MB/s	28.9 MB/s	28.9 MB/s
==>	256k	30.5 MB/s	30.5 MB/s	30.5 MB/s
	512k	30.9 MB/s	31.0 MB/s	30.9 MB/s
	  1M	31.0 MB/s	30.9 MB/s	30.9 MB/s
	  2M	30.9 MB/s	30.9 MB/s	30.9 MB/s

USB stick 4G SanDisk  Cruzer idVendor=0781, idProduct=5151

	  4k	6.4 MB/s 	6.4 MB/s 	6.4 MB/s
	 16k	13.4 MB/s	13.4 MB/s	13.2 MB/s
	 32k	17.8 MB/s	17.9 MB/s	17.8 MB/s
	 64k	21.3 MB/s	21.3 MB/s	21.2 MB/s
	128k	21.4 MB/s	21.4 MB/s	21.4 MB/s
==>	256k	23.3 MB/s	23.2 MB/s	23.2 MB/s
	512k	23.3 MB/s	23.8 MB/s	23.4 MB/s
	  1M	23.8 MB/s	23.4 MB/s	23.3 MB/s
	  2M	23.4 MB/s	23.2 MB/s	23.4 MB/s

USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113

	  4k	6.7 MB/s 	6.9 MB/s 	6.7 MB/s
	 16k	11.7 MB/s	11.7 MB/s	11.7 MB/s
	 32k	12.4 MB/s	12.4 MB/s	12.4 MB/s
   	 64k	13.4 MB/s	13.4 MB/s	13.4 MB/s
	128k	13.4 MB/s	13.4 MB/s	13.4 MB/s
==>	256k	13.6 MB/s	13.6 MB/s	13.6 MB/s
	512k	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  1M	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  2M	13.7 MB/s	13.7 MB/s	13.7 MB/s

64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey

	4KB:    139.339 s, 376 kB/s
	16KB:   81.0427 s, 647 kB/s
	32KB:   71.8513 s, 730 kB/s
==>	64KB:   67.3872 s, 778 kB/s
	128KB:  67.5434 s, 776 kB/s
	256KB:  65.9019 s, 796 kB/s
	512KB:  66.2282 s, 792 kB/s
	1024KB: 67.4632 s, 777 kB/s
	2048KB: 69.9759 s, 749 kB/s

An unnamed SD card (Yakui):

         4k     195.873 s,  5.5 MB/s
         8k     123.425 s,  8.7 MB/s
         16k    86.6425 s, 12.4 MB/s
         32k    66.7519 s, 16.1 MB/s
==>      64k    58.5262 s, 18.3 MB/s
         128k   59.3847 s, 18.1 MB/s
         256k   59.3188 s, 18.1 MB/s
         512k   59.0218 s, 18.2 MB/s

CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/genhd.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

--- linux-next.orig/block/genhd.c	2011-12-16 19:36:06.000000000 +0800
+++ linux-next/block/genhd.c	2011-12-16 19:36:10.000000000 +0800
@@ -577,6 +577,7 @@ exit:
 void add_disk(struct gendisk *disk)
 {
 	struct backing_dev_info *bdi;
+	size_t size;
 	dev_t devt;
 	int retval;
 
@@ -622,6 +623,25 @@ void add_disk(struct gendisk *disk)
 	WARN_ON(retval);
 
 	disk_add_events(disk);
+
+	/*
+	 * Scale down default readahead size for small devices.
+	 *        disk size    readahead size
+	 *               1M                8k
+	 *               4M               16k
+	 *              16M               32k
+	 *              64M               64k
+	 *             255M               64k (the round down effect)
+	 *             256M              128k
+	 *               1G              256k
+	 *               4G              512k
+	 *              16G             1024k
+	 */
+	size = get_capacity(disk);
+	if (size) {
+		size = 1 << (ilog2(size >> 9) / 2);
+		bdi->ra_pages = min(bdi->ra_pages, size);
+	}
 }
 EXPORT_SYMBOL(add_disk);
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/10] block: limit default readahead size for small devices
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Li Shaohua, Clemens Ladisch, Jens Axboe,
	Rik van Riel, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 7295 bytes --]

Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.

     lseek 0,    read 1024   => readahead 4 pages (start of file)
     lseek 1536, read 16384  => readahead 8 pages (page contiguous)

The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.

For the kernel part, Linus suggests:
  So maybe we could be less aggressive about read-ahead when the size of
  the device is small? Turning a 16kB read into a 64kB one is a big deal,
  when it's about 15% of the whole device!

This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).

Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:

        disk size    readahead size
     (scale by 4)      (scale by 2)
               1M                8k
               4M               16k
              16M               32k
              64M               64k
             256M              128k
        --------------------------- (*)
               1G              256k
               4G              512k
              16G             1024k
              64G             2048k
             256G             4096k

(*) Since the default readahead size is 128k, this limit only takes
effect for devices whose size is less than 256M.

The formula is determined on the following data, collected by script:

	#!/bin/sh

	# please make sure BDEV is not mounted or opened by others
	BDEV=sdb

	for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
	do
		echo 3 > /proc/sys/vm/drop_caches
		echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
		time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
	done

The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.

The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.

SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)

	rasize	1st run		2nd run
	----------------------------------
	  4k	123 MB/s	122 MB/s
	 16k  	153 MB/s	153 MB/s
	 32k	161 MB/s	162 MB/s
	 64k	167 MB/s	168 MB/s
	128k	197 MB/s	197 MB/s
	256k	217 MB/s	217 MB/s
	512k	238 MB/s	234 MB/s
	  1M	251 MB/s	248 MB/s
	  2M	259 MB/s	257 MB/s
==>	  4M	269 MB/s	264 MB/s
	  8M	266 MB/s	266 MB/s

Note that ==> points to the readahead size that yields plateau throughput.

SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)

	rasize  1st             2nd
	--------------------------------
	  4k     41 MB/s         41 MB/s
	 16k     85 MB/s         81 MB/s
	 32k    102 MB/s        109 MB/s
	 64k    125 MB/s        144 MB/s
	128k    183 MB/s        185 MB/s
	256k    216 MB/s        216 MB/s
	512k    216 MB/s        236 MB/s
	1024k   251 MB/s        252 MB/s
	  2M    258 MB/s        258 MB/s
==>       4M    266 MB/s        266 MB/s
	  8M    266 MB/s        266 MB/s

SSD 30G SanDisk SATA 5000

	  4k	29.6 MB/s	29.6 MB/s	29.6 MB/s
	 16k	52.1 MB/s	52.1 MB/s	52.1 MB/s
	 32k	61.5 MB/s	61.5 MB/s	61.5 MB/s
	 64k	67.2 MB/s	67.2 MB/s	67.1 MB/s
	128k	71.4 MB/s	71.3 MB/s	71.4 MB/s
	256k	73.4 MB/s	73.4 MB/s	73.3 MB/s
==>	512k	74.6 MB/s	74.6 MB/s	74.6 MB/s
	  1M	74.7 MB/s	74.6 MB/s	74.7 MB/s
	  2M	76.1 MB/s	74.6 MB/s	74.6 MB/s

USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165

	  4k	7.9 MB/s 	7.9 MB/s 	7.9 MB/s
	 16k	17.9 MB/s	17.9 MB/s	17.9 MB/s
	 32k	24.5 MB/s	24.5 MB/s	24.5 MB/s
	 64k	28.7 MB/s	28.7 MB/s	28.7 MB/s
	128k	28.8 MB/s	28.9 MB/s	28.9 MB/s
==>	256k	30.5 MB/s	30.5 MB/s	30.5 MB/s
	512k	30.9 MB/s	31.0 MB/s	30.9 MB/s
	  1M	31.0 MB/s	30.9 MB/s	30.9 MB/s
	  2M	30.9 MB/s	30.9 MB/s	30.9 MB/s

USB stick 4G SanDisk  Cruzer idVendor=0781, idProduct=5151

	  4k	6.4 MB/s 	6.4 MB/s 	6.4 MB/s
	 16k	13.4 MB/s	13.4 MB/s	13.2 MB/s
	 32k	17.8 MB/s	17.9 MB/s	17.8 MB/s
	 64k	21.3 MB/s	21.3 MB/s	21.2 MB/s
	128k	21.4 MB/s	21.4 MB/s	21.4 MB/s
==>	256k	23.3 MB/s	23.2 MB/s	23.2 MB/s
	512k	23.3 MB/s	23.8 MB/s	23.4 MB/s
	  1M	23.8 MB/s	23.4 MB/s	23.3 MB/s
	  2M	23.4 MB/s	23.2 MB/s	23.4 MB/s

USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113

	  4k	6.7 MB/s 	6.9 MB/s 	6.7 MB/s
	 16k	11.7 MB/s	11.7 MB/s	11.7 MB/s
	 32k	12.4 MB/s	12.4 MB/s	12.4 MB/s
   	 64k	13.4 MB/s	13.4 MB/s	13.4 MB/s
	128k	13.4 MB/s	13.4 MB/s	13.4 MB/s
==>	256k	13.6 MB/s	13.6 MB/s	13.6 MB/s
	512k	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  1M	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  2M	13.7 MB/s	13.7 MB/s	13.7 MB/s

64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey

	4KB:    139.339 s, 376 kB/s
	16KB:   81.0427 s, 647 kB/s
	32KB:   71.8513 s, 730 kB/s
==>	64KB:   67.3872 s, 778 kB/s
	128KB:  67.5434 s, 776 kB/s
	256KB:  65.9019 s, 796 kB/s
	512KB:  66.2282 s, 792 kB/s
	1024KB: 67.4632 s, 777 kB/s
	2048KB: 69.9759 s, 749 kB/s

An unnamed SD card (Yakui):

         4k     195.873 s,  5.5 MB/s
         8k     123.425 s,  8.7 MB/s
         16k    86.6425 s, 12.4 MB/s
         32k    66.7519 s, 16.1 MB/s
==>      64k    58.5262 s, 18.3 MB/s
         128k   59.3847 s, 18.1 MB/s
         256k   59.3188 s, 18.1 MB/s
         512k   59.0218 s, 18.2 MB/s

CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/genhd.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

--- linux-next.orig/block/genhd.c	2011-12-16 19:36:06.000000000 +0800
+++ linux-next/block/genhd.c	2011-12-16 19:36:10.000000000 +0800
@@ -577,6 +577,7 @@ exit:
 void add_disk(struct gendisk *disk)
 {
 	struct backing_dev_info *bdi;
+	size_t size;
 	dev_t devt;
 	int retval;
 
@@ -622,6 +623,25 @@ void add_disk(struct gendisk *disk)
 	WARN_ON(retval);
 
 	disk_add_events(disk);
+
+	/*
+	 * Scale down default readahead size for small devices.
+	 *        disk size    readahead size
+	 *               1M                8k
+	 *               4M               16k
+	 *              16M               32k
+	 *              64M               64k
+	 *             255M               64k (the round down effect)
+	 *             256M              128k
+	 *               1G              256k
+	 *               4G              512k
+	 *              16G             1024k
+	 */
+	size = get_capacity(disk);
+	if (size) {
+		size = 1 << (ilog2(size >> 9) / 2);
+		bdi->ra_pages = min(bdi->ra_pages, size);
+	}
 }
 EXPORT_SYMBOL(add_disk);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 02/10] readahead: make context readahead more conservative
  2011-12-19 10:23 ` Wu Fengguang
  (?)
@ 2011-12-19 10:23   ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-context-tt --]
[-- Type: text/plain, Size: 1998 bytes --]

Try to prevent negatively impact moderately dense random reads on SSD.

Transaction-Per-Second numbers provided by Taobao:

		QPS	case
		-------------------------------------------------------
		7536	disable context readahead totally
w/ patch:	7129	slower size rampup and start RA on the 3rd read
		6717	slower size rampup
w/o patch:	5581	unmodified context readahead

Before, readahead will be started whenever reading page N+1 when it
happen to read N recently. After patch, we'll only start readahead
when *three* random reads happen to access pages N, N+1, N+2. The
probability of this happening is extremely low for pure random reads,
unless they are very dense, which actually deserves some readahead.

Also start with a smaller readahead window. The impact to interleaved
sequential reads should be small, because for a long run stream, the
the small readahead window rampup phase is negletable.

The context readahead actually benefits clustered random reads on HDD
whose seek cost is pretty high.  However as SSD is increasingly used for
random read workloads it's better for the context readahead to
concentrate on interleaved sequential reads.

Tested-by: Tao Ma <tm@tao.ma>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- linux-next.orig/mm/readahead.c	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 18:12:51.000000000 +0800
@@ -369,10 +369,10 @@ static int try_context_readahead(struct 
 	size = count_history_pages(mapping, ra, offset, max);
 
 	/*
-	 * no history pages:
+	 * not enough history pages:
 	 * it could be a random read
 	 */
-	if (!size)
+	if (size <= req_size)
 		return 0;
 
 	/*
@@ -383,8 +383,8 @@ static int try_context_readahead(struct 
 		size *= 2;
 
 	ra->start = offset;
-	ra->size = get_init_ra_size(size + req_size, max);
-	ra->async_size = ra->size;
+	ra->size = min(size + req_size, max);
+	ra->async_size = 1;
 
 	return 1;
 }



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 02/10] readahead: make context readahead more conservative
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-context-tt --]
[-- Type: text/plain, Size: 2301 bytes --]

Try to prevent negatively impact moderately dense random reads on SSD.

Transaction-Per-Second numbers provided by Taobao:

		QPS	case
		-------------------------------------------------------
		7536	disable context readahead totally
w/ patch:	7129	slower size rampup and start RA on the 3rd read
		6717	slower size rampup
w/o patch:	5581	unmodified context readahead

Before, readahead will be started whenever reading page N+1 when it
happen to read N recently. After patch, we'll only start readahead
when *three* random reads happen to access pages N, N+1, N+2. The
probability of this happening is extremely low for pure random reads,
unless they are very dense, which actually deserves some readahead.

Also start with a smaller readahead window. The impact to interleaved
sequential reads should be small, because for a long run stream, the
the small readahead window rampup phase is negletable.

The context readahead actually benefits clustered random reads on HDD
whose seek cost is pretty high.  However as SSD is increasingly used for
random read workloads it's better for the context readahead to
concentrate on interleaved sequential reads.

Tested-by: Tao Ma <tm@tao.ma>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- linux-next.orig/mm/readahead.c	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 18:12:51.000000000 +0800
@@ -369,10 +369,10 @@ static int try_context_readahead(struct 
 	size = count_history_pages(mapping, ra, offset, max);
 
 	/*
-	 * no history pages:
+	 * not enough history pages:
 	 * it could be a random read
 	 */
-	if (!size)
+	if (size <= req_size)
 		return 0;
 
 	/*
@@ -383,8 +383,8 @@ static int try_context_readahead(struct 
 		size *= 2;
 
 	ra->start = offset;
-	ra->size = get_init_ra_size(size + req_size, max);
-	ra->async_size = ra->size;
+	ra->size = min(size + req_size, max);
+	ra->async_size = 1;
 
 	return 1;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 02/10] readahead: make context readahead more conservative
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-context-tt --]
[-- Type: text/plain, Size: 2301 bytes --]

Try to prevent negatively impact moderately dense random reads on SSD.

Transaction-Per-Second numbers provided by Taobao:

		QPS	case
		-------------------------------------------------------
		7536	disable context readahead totally
w/ patch:	7129	slower size rampup and start RA on the 3rd read
		6717	slower size rampup
w/o patch:	5581	unmodified context readahead

Before, readahead will be started whenever reading page N+1 when it
happen to read N recently. After patch, we'll only start readahead
when *three* random reads happen to access pages N, N+1, N+2. The
probability of this happening is extremely low for pure random reads,
unless they are very dense, which actually deserves some readahead.

Also start with a smaller readahead window. The impact to interleaved
sequential reads should be small, because for a long run stream, the
the small readahead window rampup phase is negletable.

The context readahead actually benefits clustered random reads on HDD
whose seek cost is pretty high.  However as SSD is increasingly used for
random read workloads it's better for the context readahead to
concentrate on interleaved sequential reads.

Tested-by: Tao Ma <tm@tao.ma>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- linux-next.orig/mm/readahead.c	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 18:12:51.000000000 +0800
@@ -369,10 +369,10 @@ static int try_context_readahead(struct 
 	size = count_history_pages(mapping, ra, offset, max);
 
 	/*
-	 * no history pages:
+	 * not enough history pages:
 	 * it could be a random read
 	 */
-	if (!size)
+	if (size <= req_size)
 		return 0;
 
 	/*
@@ -383,8 +383,8 @@ static int try_context_readahead(struct 
 		size *= 2;
 
 	ra->start = offset;
-	ra->size = get_init_ra_size(size + req_size, max);
-	ra->async_size = ra->size;
+	ra->size = min(size + req_size, max);
+	ra->async_size = 1;
 
 	return 1;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 03/10] readahead: record readahead patterns
  2011-12-19 10:23 ` Wu Fengguang
  (?)
@ 2011-12-19 10:23   ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra, Jan Kara,
	Rik van Riel, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 6845 bytes --]

Record the readahead pattern in ra->pattern and extend ra_submit()
parameters, to be used by the next readahead tracing/stats patches.

7 patterns are defined:

      	pattern			readahead for
-----------------------------------------------------------
	RA_PATTERN_INITIAL	start-of-file read
	RA_PATTERN_SUBSEQUENT	trivial sequential read
	RA_PATTERN_CONTEXT	interleaved sequential read
	RA_PATTERN_OVERSIZE	oversize read
	RA_PATTERN_MMAP_AROUND	mmap fault
	RA_PATTERN_FADVISE	posix_fadvise()
	RA_PATTERN_RANDOM	random read

Note that random reads will be recorded in file_ra_state now.
This won't deteriorate cache bouncing because the ra->prev_pos update
in do_generic_file_read() already pollutes the data cache, and
filemap_fault() will stop calling into us after MMAP_LOTSAMISS.

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   36 +++++++++++++++++++++++++++++++++++-
 include/linux/mm.h |    4 +++-
 mm/filemap.c       |    3 ++-
 mm/readahead.c     |   29 ++++++++++++++++++++++-------
 4 files changed, 62 insertions(+), 10 deletions(-)

--- linux-next.orig/include/linux/fs.h	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-19 18:13:07.000000000 +0800
@@ -947,11 +947,45 @@ struct file_ra_state {
 					   there are only # of pages ahead */
 
 	unsigned int ra_pages;		/* Maximum readahead window */
-	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
+	u16 mmap_miss;			/* Cache miss stat for mmap accesses */
+	u8 pattern;			/* one of RA_PATTERN_* */
+
 	loff_t prev_pos;		/* Cache last read() position */
 };
 
 /*
+ * Which policy makes decision to do the current read-ahead IO?
+ *
+ * RA_PATTERN_INITIAL		readahead window is initially opened,
+ *				normally when reading from start of file
+ * RA_PATTERN_SUBSEQUENT	readahead window is pushed forward
+ * RA_PATTERN_CONTEXT		no readahead window available, querying the
+ *				page cache to decide readahead start/size.
+ *				This typically happens on interleaved reads (eg.
+ *				reading pages 0, 1000, 1, 1001, 2, 1002, ...)
+ *				where one file_ra_state struct is not enough
+ *				for recording 2+ interleaved sequential read
+ *				streams.
+ * RA_PATTERN_MMAP_AROUND	read-around on mmap page faults
+ *				(w/o any sequential/random hints)
+ * RA_PATTERN_FADVISE		triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
+ * RA_PATTERN_OVERSIZE		a random read larger than max readahead size,
+ *				do max readahead to break down the read size
+ * RA_PATTERN_RANDOM		a small random read
+ */
+enum readahead_pattern {
+	RA_PATTERN_INITIAL,
+	RA_PATTERN_SUBSEQUENT,
+	RA_PATTERN_CONTEXT,
+	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_FADVISE,
+	RA_PATTERN_OVERSIZE,
+	RA_PATTERN_RANDOM,
+	RA_PATTERN_ALL,		/* for summary stats */
+	RA_PATTERN_MAX
+};
+
+/*
  * Check if @index falls in the readahead windows.
  */
 static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
--- linux-next.orig/mm/readahead.c	2011-12-19 18:12:51.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 18:13:07.000000000 +0800
@@ -249,7 +249,10 @@ unsigned long max_sane_readahead(unsigne
  * Submit IO for the read-ahead request in file_ra_state.
  */
 unsigned long ra_submit(struct file_ra_state *ra,
-		       struct address_space *mapping, struct file *filp)
+			struct address_space *mapping,
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size)
 {
 	int actual;
 
@@ -382,6 +385,7 @@ static int try_context_readahead(struct 
 	if (size >= offset)
 		size *= 2;
 
+	ra->pattern = RA_PATTERN_CONTEXT;
 	ra->start = offset;
 	ra->size = min(size + req_size, max);
 	ra->async_size = 1;
@@ -403,8 +407,10 @@ ondemand_readahead(struct address_space 
 	/*
 	 * start of file
 	 */
-	if (!offset)
+	if (!offset) {
+		ra->pattern = RA_PATTERN_INITIAL;
 		goto initial_readahead;
+	}
 
 	/*
 	 * It's the expected callback offset, assume sequential access.
@@ -412,6 +418,7 @@ ondemand_readahead(struct address_space 
 	 */
 	if ((offset == (ra->start + ra->size - ra->async_size) ||
 	     offset == (ra->start + ra->size))) {
+		ra->pattern = RA_PATTERN_SUBSEQUENT;
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max);
 		ra->async_size = ra->size;
@@ -434,6 +441,7 @@ ondemand_readahead(struct address_space 
 		if (!start || start - offset > max)
 			return 0;
 
+		ra->pattern = RA_PATTERN_CONTEXT;
 		ra->start = start;
 		ra->size = start - offset;	/* old async_size */
 		ra->size += req_size;
@@ -445,14 +453,18 @@ ondemand_readahead(struct address_space 
 	/*
 	 * oversize read
 	 */
-	if (req_size > max)
+	if (req_size > max) {
+		ra->pattern = RA_PATTERN_OVERSIZE;
 		goto initial_readahead;
+	}
 
 	/*
 	 * sequential cache miss
 	 */
-	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
+	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) {
+		ra->pattern = RA_PATTERN_INITIAL;
 		goto initial_readahead;
+	}
 
 	/*
 	 * Query the page cache and look for the traces(cached history pages)
@@ -463,9 +475,12 @@ ondemand_readahead(struct address_space 
 
 	/*
 	 * standalone, small random read
-	 * Read as is, and do not pollute the readahead state.
 	 */
-	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	ra->pattern = RA_PATTERN_RANDOM;
+	ra->start = offset;
+	ra->size = req_size;
+	ra->async_size = 0;
+	goto readit;
 
 initial_readahead:
 	ra->start = offset;
@@ -483,7 +498,7 @@ readit:
 		ra->size += ra->async_size;
 	}
 
-	return ra_submit(ra, mapping, filp);
+	return ra_submit(ra, mapping, filp, offset, req_size);
 }
 
 /**
--- linux-next.orig/include/linux/mm.h	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/include/linux/mm.h	2011-12-19 18:13:07.000000000 +0800
@@ -1447,7 +1447,9 @@ void page_cache_async_readahead(struct a
 unsigned long max_sane_readahead(unsigned long nr);
 unsigned long ra_submit(struct file_ra_state *ra,
 			struct address_space *mapping,
-			struct file *filp);
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size);
 
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux-next.orig/mm/filemap.c	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/mm/filemap.c	2011-12-19 18:13:07.000000000 +0800
@@ -1597,11 +1597,12 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
+	ra->pattern = RA_PATTERN_MMAP_AROUND;
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	ra->start = max_t(long, 0, offset - ra_pages / 2);
 	ra->size = ra_pages;
 	ra->async_size = ra_pages / 4;
-	ra_submit(ra, mapping, file);
+	ra_submit(ra, mapping, file, offset, 1);
 }
 
 /*



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 03/10] readahead: record readahead patterns
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra, Jan Kara,
	Rik van Riel, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 7148 bytes --]

Record the readahead pattern in ra->pattern and extend ra_submit()
parameters, to be used by the next readahead tracing/stats patches.

7 patterns are defined:

      	pattern			readahead for
-----------------------------------------------------------
	RA_PATTERN_INITIAL	start-of-file read
	RA_PATTERN_SUBSEQUENT	trivial sequential read
	RA_PATTERN_CONTEXT	interleaved sequential read
	RA_PATTERN_OVERSIZE	oversize read
	RA_PATTERN_MMAP_AROUND	mmap fault
	RA_PATTERN_FADVISE	posix_fadvise()
	RA_PATTERN_RANDOM	random read

Note that random reads will be recorded in file_ra_state now.
This won't deteriorate cache bouncing because the ra->prev_pos update
in do_generic_file_read() already pollutes the data cache, and
filemap_fault() will stop calling into us after MMAP_LOTSAMISS.

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   36 +++++++++++++++++++++++++++++++++++-
 include/linux/mm.h |    4 +++-
 mm/filemap.c       |    3 ++-
 mm/readahead.c     |   29 ++++++++++++++++++++++-------
 4 files changed, 62 insertions(+), 10 deletions(-)

--- linux-next.orig/include/linux/fs.h	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-19 18:13:07.000000000 +0800
@@ -947,11 +947,45 @@ struct file_ra_state {
 					   there are only # of pages ahead */
 
 	unsigned int ra_pages;		/* Maximum readahead window */
-	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
+	u16 mmap_miss;			/* Cache miss stat for mmap accesses */
+	u8 pattern;			/* one of RA_PATTERN_* */
+
 	loff_t prev_pos;		/* Cache last read() position */
 };
 
 /*
+ * Which policy makes decision to do the current read-ahead IO?
+ *
+ * RA_PATTERN_INITIAL		readahead window is initially opened,
+ *				normally when reading from start of file
+ * RA_PATTERN_SUBSEQUENT	readahead window is pushed forward
+ * RA_PATTERN_CONTEXT		no readahead window available, querying the
+ *				page cache to decide readahead start/size.
+ *				This typically happens on interleaved reads (eg.
+ *				reading pages 0, 1000, 1, 1001, 2, 1002, ...)
+ *				where one file_ra_state struct is not enough
+ *				for recording 2+ interleaved sequential read
+ *				streams.
+ * RA_PATTERN_MMAP_AROUND	read-around on mmap page faults
+ *				(w/o any sequential/random hints)
+ * RA_PATTERN_FADVISE		triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
+ * RA_PATTERN_OVERSIZE		a random read larger than max readahead size,
+ *				do max readahead to break down the read size
+ * RA_PATTERN_RANDOM		a small random read
+ */
+enum readahead_pattern {
+	RA_PATTERN_INITIAL,
+	RA_PATTERN_SUBSEQUENT,
+	RA_PATTERN_CONTEXT,
+	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_FADVISE,
+	RA_PATTERN_OVERSIZE,
+	RA_PATTERN_RANDOM,
+	RA_PATTERN_ALL,		/* for summary stats */
+	RA_PATTERN_MAX
+};
+
+/*
  * Check if @index falls in the readahead windows.
  */
 static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
--- linux-next.orig/mm/readahead.c	2011-12-19 18:12:51.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 18:13:07.000000000 +0800
@@ -249,7 +249,10 @@ unsigned long max_sane_readahead(unsigne
  * Submit IO for the read-ahead request in file_ra_state.
  */
 unsigned long ra_submit(struct file_ra_state *ra,
-		       struct address_space *mapping, struct file *filp)
+			struct address_space *mapping,
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size)
 {
 	int actual;
 
@@ -382,6 +385,7 @@ static int try_context_readahead(struct 
 	if (size >= offset)
 		size *= 2;
 
+	ra->pattern = RA_PATTERN_CONTEXT;
 	ra->start = offset;
 	ra->size = min(size + req_size, max);
 	ra->async_size = 1;
@@ -403,8 +407,10 @@ ondemand_readahead(struct address_space 
 	/*
 	 * start of file
 	 */
-	if (!offset)
+	if (!offset) {
+		ra->pattern = RA_PATTERN_INITIAL;
 		goto initial_readahead;
+	}
 
 	/*
 	 * It's the expected callback offset, assume sequential access.
@@ -412,6 +418,7 @@ ondemand_readahead(struct address_space 
 	 */
 	if ((offset == (ra->start + ra->size - ra->async_size) ||
 	     offset == (ra->start + ra->size))) {
+		ra->pattern = RA_PATTERN_SUBSEQUENT;
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max);
 		ra->async_size = ra->size;
@@ -434,6 +441,7 @@ ondemand_readahead(struct address_space 
 		if (!start || start - offset > max)
 			return 0;
 
+		ra->pattern = RA_PATTERN_CONTEXT;
 		ra->start = start;
 		ra->size = start - offset;	/* old async_size */
 		ra->size += req_size;
@@ -445,14 +453,18 @@ ondemand_readahead(struct address_space 
 	/*
 	 * oversize read
 	 */
-	if (req_size > max)
+	if (req_size > max) {
+		ra->pattern = RA_PATTERN_OVERSIZE;
 		goto initial_readahead;
+	}
 
 	/*
 	 * sequential cache miss
 	 */
-	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
+	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) {
+		ra->pattern = RA_PATTERN_INITIAL;
 		goto initial_readahead;
+	}
 
 	/*
 	 * Query the page cache and look for the traces(cached history pages)
@@ -463,9 +475,12 @@ ondemand_readahead(struct address_space 
 
 	/*
 	 * standalone, small random read
-	 * Read as is, and do not pollute the readahead state.
 	 */
-	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	ra->pattern = RA_PATTERN_RANDOM;
+	ra->start = offset;
+	ra->size = req_size;
+	ra->async_size = 0;
+	goto readit;
 
 initial_readahead:
 	ra->start = offset;
@@ -483,7 +498,7 @@ readit:
 		ra->size += ra->async_size;
 	}
 
-	return ra_submit(ra, mapping, filp);
+	return ra_submit(ra, mapping, filp, offset, req_size);
 }
 
 /**
--- linux-next.orig/include/linux/mm.h	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/include/linux/mm.h	2011-12-19 18:13:07.000000000 +0800
@@ -1447,7 +1447,9 @@ void page_cache_async_readahead(struct a
 unsigned long max_sane_readahead(unsigned long nr);
 unsigned long ra_submit(struct file_ra_state *ra,
 			struct address_space *mapping,
-			struct file *filp);
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size);
 
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux-next.orig/mm/filemap.c	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/mm/filemap.c	2011-12-19 18:13:07.000000000 +0800
@@ -1597,11 +1597,12 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
+	ra->pattern = RA_PATTERN_MMAP_AROUND;
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	ra->start = max_t(long, 0, offset - ra_pages / 2);
 	ra->size = ra_pages;
 	ra->async_size = ra_pages / 4;
-	ra_submit(ra, mapping, file);
+	ra_submit(ra, mapping, file, offset, 1);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 03/10] readahead: record readahead patterns
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra, Jan Kara,
	Rik van Riel, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 7148 bytes --]

Record the readahead pattern in ra->pattern and extend ra_submit()
parameters, to be used by the next readahead tracing/stats patches.

7 patterns are defined:

      	pattern			readahead for
-----------------------------------------------------------
	RA_PATTERN_INITIAL	start-of-file read
	RA_PATTERN_SUBSEQUENT	trivial sequential read
	RA_PATTERN_CONTEXT	interleaved sequential read
	RA_PATTERN_OVERSIZE	oversize read
	RA_PATTERN_MMAP_AROUND	mmap fault
	RA_PATTERN_FADVISE	posix_fadvise()
	RA_PATTERN_RANDOM	random read

Note that random reads will be recorded in file_ra_state now.
This won't deteriorate cache bouncing because the ra->prev_pos update
in do_generic_file_read() already pollutes the data cache, and
filemap_fault() will stop calling into us after MMAP_LOTSAMISS.

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   36 +++++++++++++++++++++++++++++++++++-
 include/linux/mm.h |    4 +++-
 mm/filemap.c       |    3 ++-
 mm/readahead.c     |   29 ++++++++++++++++++++++-------
 4 files changed, 62 insertions(+), 10 deletions(-)

--- linux-next.orig/include/linux/fs.h	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-19 18:13:07.000000000 +0800
@@ -947,11 +947,45 @@ struct file_ra_state {
 					   there are only # of pages ahead */
 
 	unsigned int ra_pages;		/* Maximum readahead window */
-	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
+	u16 mmap_miss;			/* Cache miss stat for mmap accesses */
+	u8 pattern;			/* one of RA_PATTERN_* */
+
 	loff_t prev_pos;		/* Cache last read() position */
 };
 
 /*
+ * Which policy makes decision to do the current read-ahead IO?
+ *
+ * RA_PATTERN_INITIAL		readahead window is initially opened,
+ *				normally when reading from start of file
+ * RA_PATTERN_SUBSEQUENT	readahead window is pushed forward
+ * RA_PATTERN_CONTEXT		no readahead window available, querying the
+ *				page cache to decide readahead start/size.
+ *				This typically happens on interleaved reads (eg.
+ *				reading pages 0, 1000, 1, 1001, 2, 1002, ...)
+ *				where one file_ra_state struct is not enough
+ *				for recording 2+ interleaved sequential read
+ *				streams.
+ * RA_PATTERN_MMAP_AROUND	read-around on mmap page faults
+ *				(w/o any sequential/random hints)
+ * RA_PATTERN_FADVISE		triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
+ * RA_PATTERN_OVERSIZE		a random read larger than max readahead size,
+ *				do max readahead to break down the read size
+ * RA_PATTERN_RANDOM		a small random read
+ */
+enum readahead_pattern {
+	RA_PATTERN_INITIAL,
+	RA_PATTERN_SUBSEQUENT,
+	RA_PATTERN_CONTEXT,
+	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_FADVISE,
+	RA_PATTERN_OVERSIZE,
+	RA_PATTERN_RANDOM,
+	RA_PATTERN_ALL,		/* for summary stats */
+	RA_PATTERN_MAX
+};
+
+/*
  * Check if @index falls in the readahead windows.
  */
 static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
--- linux-next.orig/mm/readahead.c	2011-12-19 18:12:51.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 18:13:07.000000000 +0800
@@ -249,7 +249,10 @@ unsigned long max_sane_readahead(unsigne
  * Submit IO for the read-ahead request in file_ra_state.
  */
 unsigned long ra_submit(struct file_ra_state *ra,
-		       struct address_space *mapping, struct file *filp)
+			struct address_space *mapping,
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size)
 {
 	int actual;
 
@@ -382,6 +385,7 @@ static int try_context_readahead(struct 
 	if (size >= offset)
 		size *= 2;
 
+	ra->pattern = RA_PATTERN_CONTEXT;
 	ra->start = offset;
 	ra->size = min(size + req_size, max);
 	ra->async_size = 1;
@@ -403,8 +407,10 @@ ondemand_readahead(struct address_space 
 	/*
 	 * start of file
 	 */
-	if (!offset)
+	if (!offset) {
+		ra->pattern = RA_PATTERN_INITIAL;
 		goto initial_readahead;
+	}
 
 	/*
 	 * It's the expected callback offset, assume sequential access.
@@ -412,6 +418,7 @@ ondemand_readahead(struct address_space 
 	 */
 	if ((offset == (ra->start + ra->size - ra->async_size) ||
 	     offset == (ra->start + ra->size))) {
+		ra->pattern = RA_PATTERN_SUBSEQUENT;
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max);
 		ra->async_size = ra->size;
@@ -434,6 +441,7 @@ ondemand_readahead(struct address_space 
 		if (!start || start - offset > max)
 			return 0;
 
+		ra->pattern = RA_PATTERN_CONTEXT;
 		ra->start = start;
 		ra->size = start - offset;	/* old async_size */
 		ra->size += req_size;
@@ -445,14 +453,18 @@ ondemand_readahead(struct address_space 
 	/*
 	 * oversize read
 	 */
-	if (req_size > max)
+	if (req_size > max) {
+		ra->pattern = RA_PATTERN_OVERSIZE;
 		goto initial_readahead;
+	}
 
 	/*
 	 * sequential cache miss
 	 */
-	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
+	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) {
+		ra->pattern = RA_PATTERN_INITIAL;
 		goto initial_readahead;
+	}
 
 	/*
 	 * Query the page cache and look for the traces(cached history pages)
@@ -463,9 +475,12 @@ ondemand_readahead(struct address_space 
 
 	/*
 	 * standalone, small random read
-	 * Read as is, and do not pollute the readahead state.
 	 */
-	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	ra->pattern = RA_PATTERN_RANDOM;
+	ra->start = offset;
+	ra->size = req_size;
+	ra->async_size = 0;
+	goto readit;
 
 initial_readahead:
 	ra->start = offset;
@@ -483,7 +498,7 @@ readit:
 		ra->size += ra->async_size;
 	}
 
-	return ra_submit(ra, mapping, filp);
+	return ra_submit(ra, mapping, filp, offset, req_size);
 }
 
 /**
--- linux-next.orig/include/linux/mm.h	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/include/linux/mm.h	2011-12-19 18:13:07.000000000 +0800
@@ -1447,7 +1447,9 @@ void page_cache_async_readahead(struct a
 unsigned long max_sane_readahead(unsigned long nr);
 unsigned long ra_submit(struct file_ra_state *ra,
 			struct address_space *mapping,
-			struct file *filp);
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size);
 
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux-next.orig/mm/filemap.c	2011-12-19 18:12:35.000000000 +0800
+++ linux-next/mm/filemap.c	2011-12-19 18:13:07.000000000 +0800
@@ -1597,11 +1597,12 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
+	ra->pattern = RA_PATTERN_MMAP_AROUND;
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	ra->start = max_t(long, 0, offset - ra_pages / 2);
 	ra->size = ra_pages;
 	ra->async_size = ra_pages / 4;
-	ra_submit(ra, mapping, file);
+	ra_submit(ra, mapping, file, offset, 1);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 04/10] readahead: tag mmap page fault call sites
  2011-12-19 10:23 ` Wu Fengguang
  (?)
@ 2011-12-19 10:23   ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Jan Kara, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-for-mmap --]
[-- Type: text/plain, Size: 2068 bytes --]

Introduce a bit field ra->for_mmap for tagging mmap reads.
The tag will be cleared immediate after submitting the IO.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |    1 +
 mm/filemap.c       |    6 +++++-
 mm/readahead.c     |    1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/fs.h	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-16 19:36:12.000000000 +0800
@@ -949,6 +949,7 @@ struct file_ra_state {
 	unsigned int ra_pages;		/* Maximum readahead window */
 	u16 mmap_miss;			/* Cache miss stat for mmap accesses */
 	u8 pattern;			/* one of RA_PATTERN_* */
+	unsigned int for_mmap:1;	/* readahead for mmap accesses */
 
 	loff_t prev_pos;		/* Cache last read() position */
 };
--- linux-next.orig/mm/filemap.c	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/mm/filemap.c	2011-12-16 19:36:12.000000000 +0800
@@ -1578,6 +1578,7 @@ static void do_sync_mmap_readahead(struc
 		return;
 
 	if (VM_SequentialReadHint(vma)) {
+		ra->for_mmap = 1;
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
 		return;
@@ -1597,6 +1598,7 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
+	ra->for_mmap = 1;
 	ra->pattern = RA_PATTERN_MMAP_AROUND;
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	ra->start = max_t(long, 0, offset - ra_pages / 2);
@@ -1622,9 +1624,11 @@ static void do_async_mmap_readahead(stru
 		return;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		ra->for_mmap = 1;
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
 }
 
 /**
--- linux-next.orig/mm/readahead.c	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 19:36:12.000000000 +0800
@@ -259,6 +259,7 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	ra->for_mmap = 0;
 	return actual;
 }
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 04/10] readahead: tag mmap page fault call sites
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Jan Kara, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-for-mmap --]
[-- Type: text/plain, Size: 2371 bytes --]

Introduce a bit field ra->for_mmap for tagging mmap reads.
The tag will be cleared immediate after submitting the IO.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |    1 +
 mm/filemap.c       |    6 +++++-
 mm/readahead.c     |    1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/fs.h	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-16 19:36:12.000000000 +0800
@@ -949,6 +949,7 @@ struct file_ra_state {
 	unsigned int ra_pages;		/* Maximum readahead window */
 	u16 mmap_miss;			/* Cache miss stat for mmap accesses */
 	u8 pattern;			/* one of RA_PATTERN_* */
+	unsigned int for_mmap:1;	/* readahead for mmap accesses */
 
 	loff_t prev_pos;		/* Cache last read() position */
 };
--- linux-next.orig/mm/filemap.c	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/mm/filemap.c	2011-12-16 19:36:12.000000000 +0800
@@ -1578,6 +1578,7 @@ static void do_sync_mmap_readahead(struc
 		return;
 
 	if (VM_SequentialReadHint(vma)) {
+		ra->for_mmap = 1;
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
 		return;
@@ -1597,6 +1598,7 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
+	ra->for_mmap = 1;
 	ra->pattern = RA_PATTERN_MMAP_AROUND;
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	ra->start = max_t(long, 0, offset - ra_pages / 2);
@@ -1622,9 +1624,11 @@ static void do_async_mmap_readahead(stru
 		return;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		ra->for_mmap = 1;
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
 }
 
 /**
--- linux-next.orig/mm/readahead.c	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 19:36:12.000000000 +0800
@@ -259,6 +259,7 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	ra->for_mmap = 0;
 	return actual;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 04/10] readahead: tag mmap page fault call sites
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Jan Kara, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-for-mmap --]
[-- Type: text/plain, Size: 2371 bytes --]

Introduce a bit field ra->for_mmap for tagging mmap reads.
The tag will be cleared immediate after submitting the IO.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |    1 +
 mm/filemap.c       |    6 +++++-
 mm/readahead.c     |    1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/fs.h	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-16 19:36:12.000000000 +0800
@@ -949,6 +949,7 @@ struct file_ra_state {
 	unsigned int ra_pages;		/* Maximum readahead window */
 	u16 mmap_miss;			/* Cache miss stat for mmap accesses */
 	u8 pattern;			/* one of RA_PATTERN_* */
+	unsigned int for_mmap:1;	/* readahead for mmap accesses */
 
 	loff_t prev_pos;		/* Cache last read() position */
 };
--- linux-next.orig/mm/filemap.c	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/mm/filemap.c	2011-12-16 19:36:12.000000000 +0800
@@ -1578,6 +1578,7 @@ static void do_sync_mmap_readahead(struc
 		return;
 
 	if (VM_SequentialReadHint(vma)) {
+		ra->for_mmap = 1;
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
 		return;
@@ -1597,6 +1598,7 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
+	ra->for_mmap = 1;
 	ra->pattern = RA_PATTERN_MMAP_AROUND;
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	ra->start = max_t(long, 0, offset - ra_pages / 2);
@@ -1622,9 +1624,11 @@ static void do_async_mmap_readahead(stru
 		return;
 	if (ra->mmap_miss > 0)
 		ra->mmap_miss--;
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		ra->for_mmap = 1;
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
 }
 
 /**
--- linux-next.orig/mm/readahead.c	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 19:36:12.000000000 +0800
@@ -259,6 +259,7 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	ra->for_mmap = 0;
 	return actual;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 05/10] readahead: tag metadata call sites
  2011-12-19 10:23 ` Wu Fengguang
  (?)
@ 2011-12-19 10:23   ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Jan Kara, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-for-metadata --]
[-- Type: text/plain, Size: 1978 bytes --]

We may be doing more metadata readahead in future.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/ext3/dir.c      |    1 +
 fs/ext4/dir.c      |    1 +
 include/linux/fs.h |    1 +
 mm/readahead.c     |    1 +
 4 files changed, 4 insertions(+)

--- linux-next.orig/fs/ext3/dir.c	2011-12-16 19:36:05.000000000 +0800
+++ linux-next/fs/ext3/dir.c	2011-12-16 19:36:13.000000000 +0800
@@ -136,6 +136,7 @@ static int ext3_readdir(struct file * fi
 			pgoff_t index = map_bh.b_blocknr >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
+				filp->f_ra.for_metadata = 1;
 				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
 					&filp->f_ra, filp,
--- linux-next.orig/fs/ext4/dir.c	2011-12-16 19:36:05.000000000 +0800
+++ linux-next/fs/ext4/dir.c	2011-12-16 19:36:13.000000000 +0800
@@ -153,6 +153,7 @@ static int ext4_readdir(struct file *fil
 			pgoff_t index = map.m_pblk >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
+				filp->f_ra.for_metadata = 1;
 				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
 					&filp->f_ra, filp,
--- linux-next.orig/include/linux/fs.h	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-16 19:36:13.000000000 +0800
@@ -950,6 +950,7 @@ struct file_ra_state {
 	u16 mmap_miss;			/* Cache miss stat for mmap accesses */
 	u8 pattern;			/* one of RA_PATTERN_* */
 	unsigned int for_mmap:1;	/* readahead for mmap accesses */
+	unsigned int for_metadata:1;	/* readahead for meta data */
 
 	loff_t prev_pos;		/* Cache last read() position */
 };
--- linux-next.orig/mm/readahead.c	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 19:36:13.000000000 +0800
@@ -260,6 +260,7 @@ unsigned long ra_submit(struct file_ra_s
 					ra->start, ra->size, ra->async_size);
 
 	ra->for_mmap = 0;
+	ra->for_metadata = 0;
 	return actual;
 }
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 05/10] readahead: tag metadata call sites
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Jan Kara, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-for-metadata --]
[-- Type: text/plain, Size: 2281 bytes --]

We may be doing more metadata readahead in future.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/ext3/dir.c      |    1 +
 fs/ext4/dir.c      |    1 +
 include/linux/fs.h |    1 +
 mm/readahead.c     |    1 +
 4 files changed, 4 insertions(+)

--- linux-next.orig/fs/ext3/dir.c	2011-12-16 19:36:05.000000000 +0800
+++ linux-next/fs/ext3/dir.c	2011-12-16 19:36:13.000000000 +0800
@@ -136,6 +136,7 @@ static int ext3_readdir(struct file * fi
 			pgoff_t index = map_bh.b_blocknr >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
+				filp->f_ra.for_metadata = 1;
 				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
 					&filp->f_ra, filp,
--- linux-next.orig/fs/ext4/dir.c	2011-12-16 19:36:05.000000000 +0800
+++ linux-next/fs/ext4/dir.c	2011-12-16 19:36:13.000000000 +0800
@@ -153,6 +153,7 @@ static int ext4_readdir(struct file *fil
 			pgoff_t index = map.m_pblk >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
+				filp->f_ra.for_metadata = 1;
 				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
 					&filp->f_ra, filp,
--- linux-next.orig/include/linux/fs.h	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-16 19:36:13.000000000 +0800
@@ -950,6 +950,7 @@ struct file_ra_state {
 	u16 mmap_miss;			/* Cache miss stat for mmap accesses */
 	u8 pattern;			/* one of RA_PATTERN_* */
 	unsigned int for_mmap:1;	/* readahead for mmap accesses */
+	unsigned int for_metadata:1;	/* readahead for meta data */
 
 	loff_t prev_pos;		/* Cache last read() position */
 };
--- linux-next.orig/mm/readahead.c	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 19:36:13.000000000 +0800
@@ -260,6 +260,7 @@ unsigned long ra_submit(struct file_ra_s
 					ra->start, ra->size, ra->async_size);
 
 	ra->for_mmap = 0;
+	ra->for_metadata = 0;
 	return actual;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 05/10] readahead: tag metadata call sites
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Jan Kara, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-for-metadata --]
[-- Type: text/plain, Size: 2281 bytes --]

We may be doing more metadata readahead in future.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/ext3/dir.c      |    1 +
 fs/ext4/dir.c      |    1 +
 include/linux/fs.h |    1 +
 mm/readahead.c     |    1 +
 4 files changed, 4 insertions(+)

--- linux-next.orig/fs/ext3/dir.c	2011-12-16 19:36:05.000000000 +0800
+++ linux-next/fs/ext3/dir.c	2011-12-16 19:36:13.000000000 +0800
@@ -136,6 +136,7 @@ static int ext3_readdir(struct file * fi
 			pgoff_t index = map_bh.b_blocknr >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
+				filp->f_ra.for_metadata = 1;
 				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
 					&filp->f_ra, filp,
--- linux-next.orig/fs/ext4/dir.c	2011-12-16 19:36:05.000000000 +0800
+++ linux-next/fs/ext4/dir.c	2011-12-16 19:36:13.000000000 +0800
@@ -153,6 +153,7 @@ static int ext4_readdir(struct file *fil
 			pgoff_t index = map.m_pblk >>
 					(PAGE_CACHE_SHIFT - inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
+				filp->f_ra.for_metadata = 1;
 				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
 					&filp->f_ra, filp,
--- linux-next.orig/include/linux/fs.h	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-16 19:36:13.000000000 +0800
@@ -950,6 +950,7 @@ struct file_ra_state {
 	u16 mmap_miss;			/* Cache miss stat for mmap accesses */
 	u8 pattern;			/* one of RA_PATTERN_* */
 	unsigned int for_mmap:1;	/* readahead for mmap accesses */
+	unsigned int for_metadata:1;	/* readahead for meta data */
 
 	loff_t prev_pos;		/* Cache last read() position */
 };
--- linux-next.orig/mm/readahead.c	2011-12-16 19:36:12.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 19:36:13.000000000 +0800
@@ -260,6 +260,7 @@ unsigned long ra_submit(struct file_ra_s
 					ra->start, ra->size, ra->async_size);
 
 	ra->for_mmap = 0;
+	ra->for_metadata = 0;
 	return actual;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 06/10] readahead: add vfs/readahead tracing event
  2011-12-19 10:23 ` Wu Fengguang
  (?)
@ 2011-12-19 10:23   ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra, Jan Kara,
	Rik van Riel, Steven Rostedt, Wu Fengguang,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 5682 bytes --]

This is very useful for verifying whether the readahead algorithms are
working to the expectation.

Example output:

# echo 1 > /debug/tracing/events/vfs/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace  # trimmed output
pattern=initial bdi=0:16 ino=100177 req=0+2 ra=0+4-2 async=0 actual=4
pattern=subsequent bdi=0:16 ino=100177 req=2+2 ra=4+8-8 async=1 actual=8
pattern=subsequent bdi=0:16 ino=100177 req=4+2 ra=12+16-16 async=1 actual=16
pattern=subsequent bdi=0:16 ino=100177 req=12+2 ra=28+32-32 async=1 actual=32
pattern=subsequent bdi=0:16 ino=100177 req=28+2 ra=60+60-60 async=1 actual=24
pattern=subsequent bdi=0:16 ino=100177 req=60+2 ra=120+60-60 async=1 actual=0

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/Makefile                |    1 
 fs/trace.c                 |    2 
 include/trace/events/vfs.h |   77 +++++++++++++++++++++++++++++++++++
 mm/readahead.c             |   24 ++++++++++
 4 files changed, 104 insertions(+)

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/include/trace/events/vfs.h	2011-12-16 19:36:14.000000000 +0800
@@ -0,0 +1,77 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfs
+
+#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFS_H
+
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
+#include <linux/tracepoint.h>
+
+#define READAHEAD_PATTERNS						   \
+			{ RA_PATTERN_INITIAL,		"initial"	}, \
+			{ RA_PATTERN_SUBSEQUENT,	"subsequent"	}, \
+			{ RA_PATTERN_CONTEXT,		"context"	}, \
+			{ RA_PATTERN_MMAP_AROUND,	"around"	}, \
+			{ RA_PATTERN_FADVISE,		"fadvise"	}, \
+			{ RA_PATTERN_OVERSIZE,		"oversize"	}, \
+			{ RA_PATTERN_RANDOM,		"random"	}, \
+			{ RA_PATTERN_ALL,		"all"		}
+
+TRACE_EVENT(readahead,
+	TP_PROTO(struct address_space *mapping,
+		 pgoff_t offset,
+		 unsigned long req_size,
+		 enum readahead_pattern pattern,
+		 pgoff_t start,
+		 unsigned long size,
+		 unsigned long async_size,
+		 unsigned int actual),
+
+	TP_ARGS(mapping, offset, req_size, pattern, start, size, async_size,
+		actual),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(ino_t,		ino)
+		__field(pgoff_t,	offset)
+		__field(unsigned long,	req_size)
+		__field(unsigned int,	pattern)
+		__field(pgoff_t,	start)
+		__field(unsigned int,	size)
+		__field(unsigned int,	async_size)
+		__field(unsigned int,	actual)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->bdi,
+			dev_name(mapping->backing_dev_info->dev), 32);
+		__entry->ino		= mapping->host->i_ino;
+		__entry->offset		= offset;
+		__entry->req_size	= req_size;
+		__entry->pattern	= pattern;
+		__entry->start		= start;
+		__entry->size		= size;
+		__entry->async_size	= async_size;
+		__entry->actual		= actual;
+	),
+
+	TP_printk("pattern=%s bdi=%s ino=%lu "
+		  "req=%lu+%lu ra=%lu+%d-%d async=%d actual=%d",
+			__print_symbolic(__entry->pattern, READAHEAD_PATTERNS),
+			__entry->bdi,
+			__entry->ino,
+			__entry->offset,
+			__entry->req_size,
+			__entry->start,
+			__entry->size,
+			__entry->async_size,
+			__entry->start > __entry->offset,
+			__entry->actual)
+);
+
+#endif /* _TRACE_VFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux-next.orig/mm/readahead.c	2011-12-16 19:36:13.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 19:36:14.000000000 +0800
@@ -17,6 +17,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
+#include <trace/events/vfs.h>
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -32,6 +33,21 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+static inline void readahead_event(struct address_space *mapping,
+				   pgoff_t offset,
+				   unsigned long req_size,
+				   bool for_mmap,
+				   bool for_metadata,
+				   enum readahead_pattern pattern,
+				   pgoff_t start,
+				   unsigned long size,
+				   unsigned long async_size,
+				   int actual)
+{
+	trace_readahead(mapping, offset, req_size,
+			pattern, start, size, async_size, actual);
+}
+
 /*
  * see if a page needs releasing upon read_cache_pages() failure
  * - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -228,6 +244,9 @@ int force_page_cache_readahead(struct ad
 			ret = err;
 			break;
 		}
+		readahead_event(mapping, offset, nr_to_read, 0, 0,
+				RA_PATTERN_FADVISE, offset, this_chunk, 0,
+				err);
 		ret += err;
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
@@ -259,6 +278,11 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	readahead_event(mapping, offset, req_size,
+			ra->for_mmap, ra->for_metadata,
+			ra->pattern, ra->start, ra->size, ra->async_size,
+			actual);
+
 	ra->for_mmap = 0;
 	ra->for_metadata = 0;
 	return actual;
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/fs/trace.c	2011-12-16 19:36:14.000000000 +0800
@@ -0,0 +1,2 @@
+#define CREATE_TRACE_POINTS
+#include <trace/events/vfs.h>
--- linux-next.orig/fs/Makefile	2011-12-16 19:36:05.000000000 +0800
+++ linux-next/fs/Makefile	2011-12-16 19:36:14.000000000 +0800
@@ -48,6 +48,7 @@ obj-$(CONFIG_NFS_COMMON)	+= nfs_common/
 obj-$(CONFIG_GENERIC_ACL)	+= generic_acl.o
 
 obj-$(CONFIG_FHANDLE)		+= fhandle.o
+obj-$(CONFIG_TRACEPOINTS)	+= trace.o
 
 obj-y				+= quota/
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 06/10] readahead: add vfs/readahead tracing event
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra, Jan Kara,
	Rik van Riel, Steven Rostedt, Wu Fengguang,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 5985 bytes --]

This is very useful for verifying whether the readahead algorithms are
working to the expectation.

Example output:

# echo 1 > /debug/tracing/events/vfs/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace  # trimmed output
pattern=initial bdi=0:16 ino=100177 req=0+2 ra=0+4-2 async=0 actual=4
pattern=subsequent bdi=0:16 ino=100177 req=2+2 ra=4+8-8 async=1 actual=8
pattern=subsequent bdi=0:16 ino=100177 req=4+2 ra=12+16-16 async=1 actual=16
pattern=subsequent bdi=0:16 ino=100177 req=12+2 ra=28+32-32 async=1 actual=32
pattern=subsequent bdi=0:16 ino=100177 req=28+2 ra=60+60-60 async=1 actual=24
pattern=subsequent bdi=0:16 ino=100177 req=60+2 ra=120+60-60 async=1 actual=0

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/Makefile                |    1 
 fs/trace.c                 |    2 
 include/trace/events/vfs.h |   77 +++++++++++++++++++++++++++++++++++
 mm/readahead.c             |   24 ++++++++++
 4 files changed, 104 insertions(+)

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/include/trace/events/vfs.h	2011-12-16 19:36:14.000000000 +0800
@@ -0,0 +1,77 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfs
+
+#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFS_H
+
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
+#include <linux/tracepoint.h>
+
+#define READAHEAD_PATTERNS						   \
+			{ RA_PATTERN_INITIAL,		"initial"	}, \
+			{ RA_PATTERN_SUBSEQUENT,	"subsequent"	}, \
+			{ RA_PATTERN_CONTEXT,		"context"	}, \
+			{ RA_PATTERN_MMAP_AROUND,	"around"	}, \
+			{ RA_PATTERN_FADVISE,		"fadvise"	}, \
+			{ RA_PATTERN_OVERSIZE,		"oversize"	}, \
+			{ RA_PATTERN_RANDOM,		"random"	}, \
+			{ RA_PATTERN_ALL,		"all"		}
+
+TRACE_EVENT(readahead,
+	TP_PROTO(struct address_space *mapping,
+		 pgoff_t offset,
+		 unsigned long req_size,
+		 enum readahead_pattern pattern,
+		 pgoff_t start,
+		 unsigned long size,
+		 unsigned long async_size,
+		 unsigned int actual),
+
+	TP_ARGS(mapping, offset, req_size, pattern, start, size, async_size,
+		actual),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(ino_t,		ino)
+		__field(pgoff_t,	offset)
+		__field(unsigned long,	req_size)
+		__field(unsigned int,	pattern)
+		__field(pgoff_t,	start)
+		__field(unsigned int,	size)
+		__field(unsigned int,	async_size)
+		__field(unsigned int,	actual)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->bdi,
+			dev_name(mapping->backing_dev_info->dev), 32);
+		__entry->ino		= mapping->host->i_ino;
+		__entry->offset		= offset;
+		__entry->req_size	= req_size;
+		__entry->pattern	= pattern;
+		__entry->start		= start;
+		__entry->size		= size;
+		__entry->async_size	= async_size;
+		__entry->actual		= actual;
+	),
+
+	TP_printk("pattern=%s bdi=%s ino=%lu "
+		  "req=%lu+%lu ra=%lu+%d-%d async=%d actual=%d",
+			__print_symbolic(__entry->pattern, READAHEAD_PATTERNS),
+			__entry->bdi,
+			__entry->ino,
+			__entry->offset,
+			__entry->req_size,
+			__entry->start,
+			__entry->size,
+			__entry->async_size,
+			__entry->start > __entry->offset,
+			__entry->actual)
+);
+
+#endif /* _TRACE_VFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux-next.orig/mm/readahead.c	2011-12-16 19:36:13.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 19:36:14.000000000 +0800
@@ -17,6 +17,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
+#include <trace/events/vfs.h>
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -32,6 +33,21 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+static inline void readahead_event(struct address_space *mapping,
+				   pgoff_t offset,
+				   unsigned long req_size,
+				   bool for_mmap,
+				   bool for_metadata,
+				   enum readahead_pattern pattern,
+				   pgoff_t start,
+				   unsigned long size,
+				   unsigned long async_size,
+				   int actual)
+{
+	trace_readahead(mapping, offset, req_size,
+			pattern, start, size, async_size, actual);
+}
+
 /*
  * see if a page needs releasing upon read_cache_pages() failure
  * - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -228,6 +244,9 @@ int force_page_cache_readahead(struct ad
 			ret = err;
 			break;
 		}
+		readahead_event(mapping, offset, nr_to_read, 0, 0,
+				RA_PATTERN_FADVISE, offset, this_chunk, 0,
+				err);
 		ret += err;
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
@@ -259,6 +278,11 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	readahead_event(mapping, offset, req_size,
+			ra->for_mmap, ra->for_metadata,
+			ra->pattern, ra->start, ra->size, ra->async_size,
+			actual);
+
 	ra->for_mmap = 0;
 	ra->for_metadata = 0;
 	return actual;
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/fs/trace.c	2011-12-16 19:36:14.000000000 +0800
@@ -0,0 +1,2 @@
+#define CREATE_TRACE_POINTS
+#include <trace/events/vfs.h>
--- linux-next.orig/fs/Makefile	2011-12-16 19:36:05.000000000 +0800
+++ linux-next/fs/Makefile	2011-12-16 19:36:14.000000000 +0800
@@ -48,6 +48,7 @@ obj-$(CONFIG_NFS_COMMON)	+= nfs_common/
 obj-$(CONFIG_GENERIC_ACL)	+= generic_acl.o
 
 obj-$(CONFIG_FHANDLE)		+= fhandle.o
+obj-$(CONFIG_TRACEPOINTS)	+= trace.o
 
 obj-y				+= quota/
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 06/10] readahead: add vfs/readahead tracing event
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra, Jan Kara,
	Rik van Riel, Steven Rostedt, Wu Fengguang,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 5985 bytes --]

This is very useful for verifying whether the readahead algorithms are
working to the expectation.

Example output:

# echo 1 > /debug/tracing/events/vfs/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace  # trimmed output
pattern=initial bdi=0:16 ino=100177 req=0+2 ra=0+4-2 async=0 actual=4
pattern=subsequent bdi=0:16 ino=100177 req=2+2 ra=4+8-8 async=1 actual=8
pattern=subsequent bdi=0:16 ino=100177 req=4+2 ra=12+16-16 async=1 actual=16
pattern=subsequent bdi=0:16 ino=100177 req=12+2 ra=28+32-32 async=1 actual=32
pattern=subsequent bdi=0:16 ino=100177 req=28+2 ra=60+60-60 async=1 actual=24
pattern=subsequent bdi=0:16 ino=100177 req=60+2 ra=120+60-60 async=1 actual=0

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/Makefile                |    1 
 fs/trace.c                 |    2 
 include/trace/events/vfs.h |   77 +++++++++++++++++++++++++++++++++++
 mm/readahead.c             |   24 ++++++++++
 4 files changed, 104 insertions(+)

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/include/trace/events/vfs.h	2011-12-16 19:36:14.000000000 +0800
@@ -0,0 +1,77 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfs
+
+#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFS_H
+
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
+#include <linux/tracepoint.h>
+
+#define READAHEAD_PATTERNS						   \
+			{ RA_PATTERN_INITIAL,		"initial"	}, \
+			{ RA_PATTERN_SUBSEQUENT,	"subsequent"	}, \
+			{ RA_PATTERN_CONTEXT,		"context"	}, \
+			{ RA_PATTERN_MMAP_AROUND,	"around"	}, \
+			{ RA_PATTERN_FADVISE,		"fadvise"	}, \
+			{ RA_PATTERN_OVERSIZE,		"oversize"	}, \
+			{ RA_PATTERN_RANDOM,		"random"	}, \
+			{ RA_PATTERN_ALL,		"all"		}
+
+TRACE_EVENT(readahead,
+	TP_PROTO(struct address_space *mapping,
+		 pgoff_t offset,
+		 unsigned long req_size,
+		 enum readahead_pattern pattern,
+		 pgoff_t start,
+		 unsigned long size,
+		 unsigned long async_size,
+		 unsigned int actual),
+
+	TP_ARGS(mapping, offset, req_size, pattern, start, size, async_size,
+		actual),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(ino_t,		ino)
+		__field(pgoff_t,	offset)
+		__field(unsigned long,	req_size)
+		__field(unsigned int,	pattern)
+		__field(pgoff_t,	start)
+		__field(unsigned int,	size)
+		__field(unsigned int,	async_size)
+		__field(unsigned int,	actual)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->bdi,
+			dev_name(mapping->backing_dev_info->dev), 32);
+		__entry->ino		= mapping->host->i_ino;
+		__entry->offset		= offset;
+		__entry->req_size	= req_size;
+		__entry->pattern	= pattern;
+		__entry->start		= start;
+		__entry->size		= size;
+		__entry->async_size	= async_size;
+		__entry->actual		= actual;
+	),
+
+	TP_printk("pattern=%s bdi=%s ino=%lu "
+		  "req=%lu+%lu ra=%lu+%d-%d async=%d actual=%d",
+			__print_symbolic(__entry->pattern, READAHEAD_PATTERNS),
+			__entry->bdi,
+			__entry->ino,
+			__entry->offset,
+			__entry->req_size,
+			__entry->start,
+			__entry->size,
+			__entry->async_size,
+			__entry->start > __entry->offset,
+			__entry->actual)
+);
+
+#endif /* _TRACE_VFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux-next.orig/mm/readahead.c	2011-12-16 19:36:13.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 19:36:14.000000000 +0800
@@ -17,6 +17,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
+#include <trace/events/vfs.h>
 
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
@@ -32,6 +33,21 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+static inline void readahead_event(struct address_space *mapping,
+				   pgoff_t offset,
+				   unsigned long req_size,
+				   bool for_mmap,
+				   bool for_metadata,
+				   enum readahead_pattern pattern,
+				   pgoff_t start,
+				   unsigned long size,
+				   unsigned long async_size,
+				   int actual)
+{
+	trace_readahead(mapping, offset, req_size,
+			pattern, start, size, async_size, actual);
+}
+
 /*
  * see if a page needs releasing upon read_cache_pages() failure
  * - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -228,6 +244,9 @@ int force_page_cache_readahead(struct ad
 			ret = err;
 			break;
 		}
+		readahead_event(mapping, offset, nr_to_read, 0, 0,
+				RA_PATTERN_FADVISE, offset, this_chunk, 0,
+				err);
 		ret += err;
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
@@ -259,6 +278,11 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	readahead_event(mapping, offset, req_size,
+			ra->for_mmap, ra->for_metadata,
+			ra->pattern, ra->start, ra->size, ra->async_size,
+			actual);
+
 	ra->for_mmap = 0;
 	ra->for_metadata = 0;
 	return actual;
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/fs/trace.c	2011-12-16 19:36:14.000000000 +0800
@@ -0,0 +1,2 @@
+#define CREATE_TRACE_POINTS
+#include <trace/events/vfs.h>
--- linux-next.orig/fs/Makefile	2011-12-16 19:36:05.000000000 +0800
+++ linux-next/fs/Makefile	2011-12-16 19:36:14.000000000 +0800
@@ -48,6 +48,7 @@ obj-$(CONFIG_NFS_COMMON)	+= nfs_common/
 obj-$(CONFIG_GENERIC_ACL)	+= generic_acl.o
 
 obj-$(CONFIG_FHANDLE)		+= fhandle.o
+obj-$(CONFIG_TRACEPOINTS)	+= trace.o
 
 obj-y				+= quota/
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 07/10] readahead: add /debug/readahead/stats
  2011-12-19 10:23 ` Wu Fengguang
  (?)
@ 2011-12-19 10:23   ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra,
	Rik van Riel, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 9156 bytes --]

The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive by default.

It can be runtime enabled/disabled through the debugfs interface

	echo 1 > /debug/readahead/stats_enable
	echo 0 > /debug/readahead/stats_enable

The added overheads are two readahead_stats() calls per readahead.
Which is trivial costs unless there are concurrent random reads on
super fast SSDs, which may lead to cache bouncing when updating the
global ra_stats[][]. Considering that normal users won't need this
except when debugging performance problems, it's disabled by default.
So it looks reasonable to keep this debug code simple rather than trying
to improve its scalability.

Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)

$ cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io    meta_io       size async_size    io_size
initial           702        511          0        692        692          0          0          2          0          2
subsequent          7          0          1          7          1          1          0         23         22         23
context           160        161          0          2          0          1          0          0          0         16
around            184        184        177        184        184        184          0         58          0         53
backwards           2          0          2          2          2          0          0          4          0          3
fadvise          2593         47          8       2588       2588          0          0          1          0          1
oversize            0          0          0          0          0          0          0          0          0          0
random             45         20          0         44         44          0          0          1          0          1
all              3697        923        188       3519       3511        186          0          4          0          4

The two most important columns are
- io		number of readahead IO
- io_size	average readahead IO size

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/Kconfig     |   15 +++
 mm/readahead.c |  193 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 208 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-12-16 19:59:36.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 20:00:37.000000000 +0800
@@ -33,6 +33,193 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/ftrace_event.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+static const struct trace_print_flags ra_pattern_names[] = {
+	READAHEAD_PATTERNS
+};
+
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,	/* readahead request */
+	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
+	RA_ACCOUNT_CACHE_HIT,	/* readahead request covers some cached pages */
+	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
+	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
+	RA_ACCOUNT_MMAP,	/* readahead IO by mmap page faults */
+	RA_ACCOUNT_METADATA,	/* readahead IO on metadata */
+	/* number of readahead pages */
+	RA_ACCOUNT_SIZE,	/* readahead size */
+	RA_ACCOUNT_ASYNC_SIZE,	/* readahead async size */
+	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+static DEFINE_PER_CPU(unsigned long[RA_PATTERN_ALL][RA_ACCOUNT_MAX], ra_stat);
+
+static void readahead_stats(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    bool for_mmap,
+			    bool for_metadata,
+			    enum readahead_pattern pattern,
+			    pgoff_t start,
+			    unsigned long size,
+			    unsigned long async_size,
+			    int actual)
+{
+	pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+
+	preempt_disable();
+
+	__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_COUNT]);
+	__this_cpu_add(ra_stat[pattern][RA_ACCOUNT_SIZE], size);
+	__this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ASYNC_SIZE], async_size);
+	__this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ACTUAL], actual);
+
+	if (start + size >= eof)
+		__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_EOF]);
+	if (actual < size)
+		__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_CACHE_HIT]);
+
+	if (actual) {
+		__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_IOCOUNT]);
+
+		if (start <= offset && offset < start + size)
+			__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_SYNC]);
+
+		if (for_mmap)
+			__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_MMAP]);
+		if (for_metadata)
+			__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_METADATA]);
+	}
+
+	preempt_enable();
+}
+
+static void ra_stats_clear(void)
+{
+	int cpu;
+	int i, j;
+
+	for_each_online_cpu(cpu)
+		for (i = 0; i < RA_PATTERN_ALL; i++)
+			for (j = 0; j < RA_ACCOUNT_MAX; j++)
+				per_cpu(ra_stat[i][j], cpu) = 0;
+}
+
+static void ra_stats_sum(unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
+{
+	int cpu;
+	int i, j;
+
+	for_each_online_cpu(cpu)
+		for (i = 0; i < RA_PATTERN_ALL; i++)
+			for (j = 0; j < RA_ACCOUNT_MAX; j++) {
+				unsigned long n = per_cpu(ra_stat[i][j], cpu);
+				ra_stats[i][j] += n;
+				ra_stats[RA_PATTERN_ALL][j] += n;
+			}
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	unsigned long i;
+	unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+	seq_printf(s,
+		   "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+		   "pattern", "readahead", "eof_hit", "cache_hit",
+		   "io", "sync_io", "mmap_io", "meta_io",
+		   "size", "async_size", "io_size");
+
+	memset(ra_stats, 0, sizeof(ra_stats));
+	ra_stats_sum(ra_stats);
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+		unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+		/*
+		 * avoid division-by-zero
+		 */
+		if (count == 0)
+			count = 1;
+		if (iocount == 0)
+			iocount = 1;
+
+		seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu "
+			   "%10lu %10lu %10lu %10lu %10lu\n",
+				ra_pattern_names[i].name,
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_CACHE_HIT],
+				ra_stats[i][RA_ACCOUNT_IOCOUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_MMAP],
+				ra_stats[i][RA_ACCOUNT_METADATA],
+				ra_stats[i][RA_ACCOUNT_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	ra_stats_clear();
+	return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+	struct dentry *root;
+	struct dentry *entry;
+
+	root = debugfs_create_dir("readahead", NULL);
+	if (!root)
+		goto out;
+
+	entry = debugfs_create_file("stats", 0644, root,
+				    NULL, &readahead_stats_fops);
+	if (!entry)
+		goto out;
+
+	entry = debugfs_create_bool("stats_enable", 0644, root,
+				    &readahead_stats_enable);
+	if (!entry)
+		goto out;
+
+	return 0;
+out:
+	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+	return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#endif
+
 static inline void readahead_event(struct address_space *mapping,
 				   pgoff_t offset,
 				   unsigned long req_size,
@@ -44,6 +231,12 @@ static inline void readahead_event(struc
 				   unsigned long async_size,
 				   int actual)
 {
+#ifdef CONFIG_READAHEAD_STATS
+	if (readahead_stats_enable)
+		readahead_stats(mapping, offset, req_size,
+				for_mmap, for_metadata,
+				pattern, start, size, async_size, actual);
+#endif
 	trace_readahead(mapping, offset, req_size,
 			pattern, start, size, async_size, actual);
 }
--- linux-next.orig/mm/Kconfig	2011-12-16 19:59:36.000000000 +0800
+++ linux-next/mm/Kconfig	2011-12-16 19:59:44.000000000 +0800
@@ -396,3 +396,18 @@ config FRONTSWAP
 	  and swap data is stored as normal on the matching swap device.
 
 	  If unsure, say Y to enable frontswap.
+
+config READAHEAD_STATS
+	bool "Collect page cache readahead stats"
+	depends on DEBUG_FS
+	default y
+	help
+	  This provides the readahead events accounting facilities.
+
+	  To do readahead accounting for a workload:
+
+	  echo 1 > /sys/kernel/debug/readahead/stats_enable
+	  echo 0 > /sys/kernel/debug/readahead/stats  # reset counters
+	  # run the workload
+	  cat /sys/kernel/debug/readahead/stats       # check counters
+	  echo 0 > /sys/kernel/debug/readahead/stats_enable



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 07/10] readahead: add /debug/readahead/stats
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra,
	Rik van Riel, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 9459 bytes --]

The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive by default.

It can be runtime enabled/disabled through the debugfs interface

	echo 1 > /debug/readahead/stats_enable
	echo 0 > /debug/readahead/stats_enable

The added overheads are two readahead_stats() calls per readahead.
Which is trivial costs unless there are concurrent random reads on
super fast SSDs, which may lead to cache bouncing when updating the
global ra_stats[][]. Considering that normal users won't need this
except when debugging performance problems, it's disabled by default.
So it looks reasonable to keep this debug code simple rather than trying
to improve its scalability.

Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)

$ cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io    meta_io       size async_size    io_size
initial           702        511          0        692        692          0          0          2          0          2
subsequent          7          0          1          7          1          1          0         23         22         23
context           160        161          0          2          0          1          0          0          0         16
around            184        184        177        184        184        184          0         58          0         53
backwards           2          0          2          2          2          0          0          4          0          3
fadvise          2593         47          8       2588       2588          0          0          1          0          1
oversize            0          0          0          0          0          0          0          0          0          0
random             45         20          0         44         44          0          0          1          0          1
all              3697        923        188       3519       3511        186          0          4          0          4

The two most important columns are
- io		number of readahead IO
- io_size	average readahead IO size

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/Kconfig     |   15 +++
 mm/readahead.c |  193 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 208 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-12-16 19:59:36.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 20:00:37.000000000 +0800
@@ -33,6 +33,193 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/ftrace_event.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+static const struct trace_print_flags ra_pattern_names[] = {
+	READAHEAD_PATTERNS
+};
+
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,	/* readahead request */
+	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
+	RA_ACCOUNT_CACHE_HIT,	/* readahead request covers some cached pages */
+	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
+	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
+	RA_ACCOUNT_MMAP,	/* readahead IO by mmap page faults */
+	RA_ACCOUNT_METADATA,	/* readahead IO on metadata */
+	/* number of readahead pages */
+	RA_ACCOUNT_SIZE,	/* readahead size */
+	RA_ACCOUNT_ASYNC_SIZE,	/* readahead async size */
+	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+static DEFINE_PER_CPU(unsigned long[RA_PATTERN_ALL][RA_ACCOUNT_MAX], ra_stat);
+
+static void readahead_stats(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    bool for_mmap,
+			    bool for_metadata,
+			    enum readahead_pattern pattern,
+			    pgoff_t start,
+			    unsigned long size,
+			    unsigned long async_size,
+			    int actual)
+{
+	pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+
+	preempt_disable();
+
+	__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_COUNT]);
+	__this_cpu_add(ra_stat[pattern][RA_ACCOUNT_SIZE], size);
+	__this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ASYNC_SIZE], async_size);
+	__this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ACTUAL], actual);
+
+	if (start + size >= eof)
+		__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_EOF]);
+	if (actual < size)
+		__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_CACHE_HIT]);
+
+	if (actual) {
+		__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_IOCOUNT]);
+
+		if (start <= offset && offset < start + size)
+			__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_SYNC]);
+
+		if (for_mmap)
+			__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_MMAP]);
+		if (for_metadata)
+			__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_METADATA]);
+	}
+
+	preempt_enable();
+}
+
+static void ra_stats_clear(void)
+{
+	int cpu;
+	int i, j;
+
+	for_each_online_cpu(cpu)
+		for (i = 0; i < RA_PATTERN_ALL; i++)
+			for (j = 0; j < RA_ACCOUNT_MAX; j++)
+				per_cpu(ra_stat[i][j], cpu) = 0;
+}
+
+static void ra_stats_sum(unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
+{
+	int cpu;
+	int i, j;
+
+	for_each_online_cpu(cpu)
+		for (i = 0; i < RA_PATTERN_ALL; i++)
+			for (j = 0; j < RA_ACCOUNT_MAX; j++) {
+				unsigned long n = per_cpu(ra_stat[i][j], cpu);
+				ra_stats[i][j] += n;
+				ra_stats[RA_PATTERN_ALL][j] += n;
+			}
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	unsigned long i;
+	unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+	seq_printf(s,
+		   "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+		   "pattern", "readahead", "eof_hit", "cache_hit",
+		   "io", "sync_io", "mmap_io", "meta_io",
+		   "size", "async_size", "io_size");
+
+	memset(ra_stats, 0, sizeof(ra_stats));
+	ra_stats_sum(ra_stats);
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+		unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+		/*
+		 * avoid division-by-zero
+		 */
+		if (count == 0)
+			count = 1;
+		if (iocount == 0)
+			iocount = 1;
+
+		seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu "
+			   "%10lu %10lu %10lu %10lu %10lu\n",
+				ra_pattern_names[i].name,
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_CACHE_HIT],
+				ra_stats[i][RA_ACCOUNT_IOCOUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_MMAP],
+				ra_stats[i][RA_ACCOUNT_METADATA],
+				ra_stats[i][RA_ACCOUNT_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	ra_stats_clear();
+	return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+	struct dentry *root;
+	struct dentry *entry;
+
+	root = debugfs_create_dir("readahead", NULL);
+	if (!root)
+		goto out;
+
+	entry = debugfs_create_file("stats", 0644, root,
+				    NULL, &readahead_stats_fops);
+	if (!entry)
+		goto out;
+
+	entry = debugfs_create_bool("stats_enable", 0644, root,
+				    &readahead_stats_enable);
+	if (!entry)
+		goto out;
+
+	return 0;
+out:
+	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+	return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#endif
+
 static inline void readahead_event(struct address_space *mapping,
 				   pgoff_t offset,
 				   unsigned long req_size,
@@ -44,6 +231,12 @@ static inline void readahead_event(struc
 				   unsigned long async_size,
 				   int actual)
 {
+#ifdef CONFIG_READAHEAD_STATS
+	if (readahead_stats_enable)
+		readahead_stats(mapping, offset, req_size,
+				for_mmap, for_metadata,
+				pattern, start, size, async_size, actual);
+#endif
 	trace_readahead(mapping, offset, req_size,
 			pattern, start, size, async_size, actual);
 }
--- linux-next.orig/mm/Kconfig	2011-12-16 19:59:36.000000000 +0800
+++ linux-next/mm/Kconfig	2011-12-16 19:59:44.000000000 +0800
@@ -396,3 +396,18 @@ config FRONTSWAP
 	  and swap data is stored as normal on the matching swap device.
 
 	  If unsure, say Y to enable frontswap.
+
+config READAHEAD_STATS
+	bool "Collect page cache readahead stats"
+	depends on DEBUG_FS
+	default y
+	help
+	  This provides the readahead events accounting facilities.
+
+	  To do readahead accounting for a workload:
+
+	  echo 1 > /sys/kernel/debug/readahead/stats_enable
+	  echo 0 > /sys/kernel/debug/readahead/stats  # reset counters
+	  # run the workload
+	  cat /sys/kernel/debug/readahead/stats       # check counters
+	  echo 0 > /sys/kernel/debug/readahead/stats_enable


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 07/10] readahead: add /debug/readahead/stats
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra,
	Rik van Riel, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 9459 bytes --]

The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive by default.

It can be runtime enabled/disabled through the debugfs interface

	echo 1 > /debug/readahead/stats_enable
	echo 0 > /debug/readahead/stats_enable

The added overheads are two readahead_stats() calls per readahead.
Which is trivial costs unless there are concurrent random reads on
super fast SSDs, which may lead to cache bouncing when updating the
global ra_stats[][]. Considering that normal users won't need this
except when debugging performance problems, it's disabled by default.
So it looks reasonable to keep this debug code simple rather than trying
to improve its scalability.

Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)

$ cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io    meta_io       size async_size    io_size
initial           702        511          0        692        692          0          0          2          0          2
subsequent          7          0          1          7          1          1          0         23         22         23
context           160        161          0          2          0          1          0          0          0         16
around            184        184        177        184        184        184          0         58          0         53
backwards           2          0          2          2          2          0          0          4          0          3
fadvise          2593         47          8       2588       2588          0          0          1          0          1
oversize            0          0          0          0          0          0          0          0          0          0
random             45         20          0         44         44          0          0          1          0          1
all              3697        923        188       3519       3511        186          0          4          0          4

The two most important columns are
- io		number of readahead IO
- io_size	average readahead IO size

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/Kconfig     |   15 +++
 mm/readahead.c |  193 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 208 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-12-16 19:59:36.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-16 20:00:37.000000000 +0800
@@ -33,6 +33,193 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/ftrace_event.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+static const struct trace_print_flags ra_pattern_names[] = {
+	READAHEAD_PATTERNS
+};
+
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,	/* readahead request */
+	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
+	RA_ACCOUNT_CACHE_HIT,	/* readahead request covers some cached pages */
+	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
+	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
+	RA_ACCOUNT_MMAP,	/* readahead IO by mmap page faults */
+	RA_ACCOUNT_METADATA,	/* readahead IO on metadata */
+	/* number of readahead pages */
+	RA_ACCOUNT_SIZE,	/* readahead size */
+	RA_ACCOUNT_ASYNC_SIZE,	/* readahead async size */
+	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+static DEFINE_PER_CPU(unsigned long[RA_PATTERN_ALL][RA_ACCOUNT_MAX], ra_stat);
+
+static void readahead_stats(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    bool for_mmap,
+			    bool for_metadata,
+			    enum readahead_pattern pattern,
+			    pgoff_t start,
+			    unsigned long size,
+			    unsigned long async_size,
+			    int actual)
+{
+	pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+
+	preempt_disable();
+
+	__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_COUNT]);
+	__this_cpu_add(ra_stat[pattern][RA_ACCOUNT_SIZE], size);
+	__this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ASYNC_SIZE], async_size);
+	__this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ACTUAL], actual);
+
+	if (start + size >= eof)
+		__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_EOF]);
+	if (actual < size)
+		__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_CACHE_HIT]);
+
+	if (actual) {
+		__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_IOCOUNT]);
+
+		if (start <= offset && offset < start + size)
+			__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_SYNC]);
+
+		if (for_mmap)
+			__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_MMAP]);
+		if (for_metadata)
+			__this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_METADATA]);
+	}
+
+	preempt_enable();
+}
+
+static void ra_stats_clear(void)
+{
+	int cpu;
+	int i, j;
+
+	for_each_online_cpu(cpu)
+		for (i = 0; i < RA_PATTERN_ALL; i++)
+			for (j = 0; j < RA_ACCOUNT_MAX; j++)
+				per_cpu(ra_stat[i][j], cpu) = 0;
+}
+
+static void ra_stats_sum(unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
+{
+	int cpu;
+	int i, j;
+
+	for_each_online_cpu(cpu)
+		for (i = 0; i < RA_PATTERN_ALL; i++)
+			for (j = 0; j < RA_ACCOUNT_MAX; j++) {
+				unsigned long n = per_cpu(ra_stat[i][j], cpu);
+				ra_stats[i][j] += n;
+				ra_stats[RA_PATTERN_ALL][j] += n;
+			}
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	unsigned long i;
+	unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+	seq_printf(s,
+		   "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+		   "pattern", "readahead", "eof_hit", "cache_hit",
+		   "io", "sync_io", "mmap_io", "meta_io",
+		   "size", "async_size", "io_size");
+
+	memset(ra_stats, 0, sizeof(ra_stats));
+	ra_stats_sum(ra_stats);
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+		unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+		/*
+		 * avoid division-by-zero
+		 */
+		if (count == 0)
+			count = 1;
+		if (iocount == 0)
+			iocount = 1;
+
+		seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu "
+			   "%10lu %10lu %10lu %10lu %10lu\n",
+				ra_pattern_names[i].name,
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_CACHE_HIT],
+				ra_stats[i][RA_ACCOUNT_IOCOUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_MMAP],
+				ra_stats[i][RA_ACCOUNT_METADATA],
+				ra_stats[i][RA_ACCOUNT_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	ra_stats_clear();
+	return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+	struct dentry *root;
+	struct dentry *entry;
+
+	root = debugfs_create_dir("readahead", NULL);
+	if (!root)
+		goto out;
+
+	entry = debugfs_create_file("stats", 0644, root,
+				    NULL, &readahead_stats_fops);
+	if (!entry)
+		goto out;
+
+	entry = debugfs_create_bool("stats_enable", 0644, root,
+				    &readahead_stats_enable);
+	if (!entry)
+		goto out;
+
+	return 0;
+out:
+	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+	return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#endif
+
 static inline void readahead_event(struct address_space *mapping,
 				   pgoff_t offset,
 				   unsigned long req_size,
@@ -44,6 +231,12 @@ static inline void readahead_event(struc
 				   unsigned long async_size,
 				   int actual)
 {
+#ifdef CONFIG_READAHEAD_STATS
+	if (readahead_stats_enable)
+		readahead_stats(mapping, offset, req_size,
+				for_mmap, for_metadata,
+				pattern, start, size, async_size, actual);
+#endif
 	trace_readahead(mapping, offset, req_size,
 			pattern, start, size, async_size, actual);
 }
--- linux-next.orig/mm/Kconfig	2011-12-16 19:59:36.000000000 +0800
+++ linux-next/mm/Kconfig	2011-12-16 19:59:44.000000000 +0800
@@ -396,3 +396,18 @@ config FRONTSWAP
 	  and swap data is stored as normal on the matching swap device.
 
 	  If unsure, say Y to enable frontswap.
+
+config READAHEAD_STATS
+	bool "Collect page cache readahead stats"
+	depends on DEBUG_FS
+	default y
+	help
+	  This provides the readahead events accounting facilities.
+
+	  To do readahead accounting for a workload:
+
+	  echo 1 > /sys/kernel/debug/readahead/stats_enable
+	  echo 0 > /sys/kernel/debug/readahead/stats  # reset counters
+	  # run the workload
+	  cat /sys/kernel/debug/readahead/stats       # check counters
+	  echo 0 > /sys/kernel/debug/readahead/stats_enable


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 08/10] readahead: basic support for backwards prefetching
  2011-12-19 10:23 ` Wu Fengguang
  (?)
@ 2011-12-19 10:23   ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Li Shaohua, Jan Kara, Wu Fengguang,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-backwards.patch --]
[-- Type: text/plain, Size: 5588 bytes --]

Add the backwards prefetching feature. It's pretty simple if we don't
support async prefetching and interleaved reads.

tail and tac are observed to have the reverse read pattern:

tail-3501  [006]   111.881191: readahead: readahead-random(bdi=0:16, ino=1548450, req=750+1, ra=750+1-0, async=0) = 1
tail-3501  [006]   111.881506: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=748+2, ra=746+5-0, async=0) = 4
tail-3501  [006]   111.882021: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=744+2, ra=726+25-0, async=0) = 20
tail-3501  [006]   111.883713: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=724+2, ra=626+125-0, async=0) = 100

 tac-3528  [001]   118.671924: readahead: readahead-random(bdi=0:16, ino=1548445, req=750+1, ra=750+1-0, async=0) = 1
 tac-3528  [001]   118.672371: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=748+2, ra=746+5-0, async=0) = 4
 tac-3528  [001]   118.673039: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=744+2, ra=726+25-0, async=0) = 20

Here is the behavior with an 8-page read sequence from 10000 down to 0.
(The readahead size is a bit large since it's an NFS mount.)

readahead-random(dev=0:16, ino=3948605, req=10000+8, ra=10000+8-0, async=0) = 8
readahead-backwards(dev=0:16, ino=3948605, req=9992+8, ra=9968+32-0, async=0) = 32
readahead-backwards(dev=0:16, ino=3948605, req=9960+8, ra=9840+128-0, async=0) = 128
readahead-backwards(dev=0:16, ino=3948605, req=9832+8, ra=9584+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9576+8, ra=9072+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9064+8, ra=8048+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8040+8, ra=6128+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6120+8, ra=4208+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4200+8, ra=2288+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2280+8, ra=368+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=360+8, ra=0+368-0, async=0) = 368

And a simple 1-page read sequence from 10000 down to 0.

readahead-random(dev=0:16, ino=3948605, req=10000+1, ra=10000+1-0, async=0) = 1
readahead-backwards(dev=0:16, ino=3948605, req=9999+1, ra=9996+4-0, async=0) = 4
readahead-backwards(dev=0:16, ino=3948605, req=9995+1, ra=9980+16-0, async=0) = 16
readahead-backwards(dev=0:16, ino=3948605, req=9979+1, ra=9916+64-0, async=0) = 64
readahead-backwards(dev=0:16, ino=3948605, req=9915+1, ra=9660+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9659+1, ra=9148+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9147+1, ra=8124+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8123+1, ra=6204+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6203+1, ra=4284+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4283+1, ra=2364+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2363+1, ra=444+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=443+1, ra=0+444-0, async=0) = 444

CC: Andi Kleen <andi@firstfloor.org>
CC: Li Shaohua <shaohua.li@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h         |    2 ++
 include/trace/events/vfs.h |    1 +
 mm/readahead.c             |   20 ++++++++++++++++++++
 3 files changed, 23 insertions(+)

--- linux-next.orig/include/linux/fs.h	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-19 16:10:08.000000000 +0800
@@ -970,6 +970,7 @@ struct file_ra_state {
  *				streams.
  * RA_PATTERN_MMAP_AROUND	read-around on mmap page faults
  *				(w/o any sequential/random hints)
+ * RA_PATTERN_BACKWARDS		reverse reading detected
  * RA_PATTERN_FADVISE		triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
  * RA_PATTERN_OVERSIZE		a random read larger than max readahead size,
  *				do max readahead to break down the read size
@@ -980,6 +981,7 @@ enum readahead_pattern {
 	RA_PATTERN_SUBSEQUENT,
 	RA_PATTERN_CONTEXT,
 	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_BACKWARDS,
 	RA_PATTERN_FADVISE,
 	RA_PATTERN_OVERSIZE,
 	RA_PATTERN_RANDOM,
--- linux-next.orig/mm/readahead.c	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 16:10:08.000000000 +0800
@@ -686,6 +686,26 @@ ondemand_readahead(struct address_space 
 	}
 
 	/*
+	 * backwards reading
+	 */
+	if (offset < ra->start && offset + req_size >= ra->start) {
+		ra->pattern = RA_PATTERN_BACKWARDS;
+		ra->size = get_next_ra_size(ra, max);
+		if (ra->size > ra->start) {
+			/*
+			 * ra->start may be concurrently set to some huge
+			 * value, the min() at least avoids submitting huge IO
+			 * in this race condition
+			 */
+			ra->size = min(ra->start, max);
+			ra->start = 0;
+		} else
+			ra->start -= ra->size;
+		ra->async_size = 0;
+		goto readit;
+	}
+
+	/*
 	 * Query the page cache and look for the traces(cached history pages)
 	 * that a sequential stream would leave behind.
 	 */
--- linux-next.orig/include/trace/events/vfs.h	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/include/trace/events/vfs.h	2011-12-19 16:09:45.000000000 +0800
@@ -14,6 +14,7 @@
 			{ RA_PATTERN_SUBSEQUENT,	"subsequent"	}, \
 			{ RA_PATTERN_CONTEXT,		"context"	}, \
 			{ RA_PATTERN_MMAP_AROUND,	"around"	}, \
+			{ RA_PATTERN_BACKWARDS,		"backwards"	}, \
 			{ RA_PATTERN_FADVISE,		"fadvise"	}, \
 			{ RA_PATTERN_OVERSIZE,		"oversize"	}, \
 			{ RA_PATTERN_RANDOM,		"random"	}, \



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 08/10] readahead: basic support for backwards prefetching
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Li Shaohua, Jan Kara, Wu Fengguang,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-backwards.patch --]
[-- Type: text/plain, Size: 5891 bytes --]

Add the backwards prefetching feature. It's pretty simple if we don't
support async prefetching and interleaved reads.

tail and tac are observed to have the reverse read pattern:

tail-3501  [006]   111.881191: readahead: readahead-random(bdi=0:16, ino=1548450, req=750+1, ra=750+1-0, async=0) = 1
tail-3501  [006]   111.881506: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=748+2, ra=746+5-0, async=0) = 4
tail-3501  [006]   111.882021: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=744+2, ra=726+25-0, async=0) = 20
tail-3501  [006]   111.883713: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=724+2, ra=626+125-0, async=0) = 100

 tac-3528  [001]   118.671924: readahead: readahead-random(bdi=0:16, ino=1548445, req=750+1, ra=750+1-0, async=0) = 1
 tac-3528  [001]   118.672371: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=748+2, ra=746+5-0, async=0) = 4
 tac-3528  [001]   118.673039: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=744+2, ra=726+25-0, async=0) = 20

Here is the behavior with an 8-page read sequence from 10000 down to 0.
(The readahead size is a bit large since it's an NFS mount.)

readahead-random(dev=0:16, ino=3948605, req=10000+8, ra=10000+8-0, async=0) = 8
readahead-backwards(dev=0:16, ino=3948605, req=9992+8, ra=9968+32-0, async=0) = 32
readahead-backwards(dev=0:16, ino=3948605, req=9960+8, ra=9840+128-0, async=0) = 128
readahead-backwards(dev=0:16, ino=3948605, req=9832+8, ra=9584+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9576+8, ra=9072+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9064+8, ra=8048+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8040+8, ra=6128+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6120+8, ra=4208+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4200+8, ra=2288+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2280+8, ra=368+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=360+8, ra=0+368-0, async=0) = 368

And a simple 1-page read sequence from 10000 down to 0.

readahead-random(dev=0:16, ino=3948605, req=10000+1, ra=10000+1-0, async=0) = 1
readahead-backwards(dev=0:16, ino=3948605, req=9999+1, ra=9996+4-0, async=0) = 4
readahead-backwards(dev=0:16, ino=3948605, req=9995+1, ra=9980+16-0, async=0) = 16
readahead-backwards(dev=0:16, ino=3948605, req=9979+1, ra=9916+64-0, async=0) = 64
readahead-backwards(dev=0:16, ino=3948605, req=9915+1, ra=9660+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9659+1, ra=9148+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9147+1, ra=8124+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8123+1, ra=6204+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6203+1, ra=4284+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4283+1, ra=2364+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2363+1, ra=444+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=443+1, ra=0+444-0, async=0) = 444

CC: Andi Kleen <andi@firstfloor.org>
CC: Li Shaohua <shaohua.li@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h         |    2 ++
 include/trace/events/vfs.h |    1 +
 mm/readahead.c             |   20 ++++++++++++++++++++
 3 files changed, 23 insertions(+)

--- linux-next.orig/include/linux/fs.h	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-19 16:10:08.000000000 +0800
@@ -970,6 +970,7 @@ struct file_ra_state {
  *				streams.
  * RA_PATTERN_MMAP_AROUND	read-around on mmap page faults
  *				(w/o any sequential/random hints)
+ * RA_PATTERN_BACKWARDS		reverse reading detected
  * RA_PATTERN_FADVISE		triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
  * RA_PATTERN_OVERSIZE		a random read larger than max readahead size,
  *				do max readahead to break down the read size
@@ -980,6 +981,7 @@ enum readahead_pattern {
 	RA_PATTERN_SUBSEQUENT,
 	RA_PATTERN_CONTEXT,
 	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_BACKWARDS,
 	RA_PATTERN_FADVISE,
 	RA_PATTERN_OVERSIZE,
 	RA_PATTERN_RANDOM,
--- linux-next.orig/mm/readahead.c	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 16:10:08.000000000 +0800
@@ -686,6 +686,26 @@ ondemand_readahead(struct address_space 
 	}
 
 	/*
+	 * backwards reading
+	 */
+	if (offset < ra->start && offset + req_size >= ra->start) {
+		ra->pattern = RA_PATTERN_BACKWARDS;
+		ra->size = get_next_ra_size(ra, max);
+		if (ra->size > ra->start) {
+			/*
+			 * ra->start may be concurrently set to some huge
+			 * value, the min() at least avoids submitting huge IO
+			 * in this race condition
+			 */
+			ra->size = min(ra->start, max);
+			ra->start = 0;
+		} else
+			ra->start -= ra->size;
+		ra->async_size = 0;
+		goto readit;
+	}
+
+	/*
 	 * Query the page cache and look for the traces(cached history pages)
 	 * that a sequential stream would leave behind.
 	 */
--- linux-next.orig/include/trace/events/vfs.h	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/include/trace/events/vfs.h	2011-12-19 16:09:45.000000000 +0800
@@ -14,6 +14,7 @@
 			{ RA_PATTERN_SUBSEQUENT,	"subsequent"	}, \
 			{ RA_PATTERN_CONTEXT,		"context"	}, \
 			{ RA_PATTERN_MMAP_AROUND,	"around"	}, \
+			{ RA_PATTERN_BACKWARDS,		"backwards"	}, \
 			{ RA_PATTERN_FADVISE,		"fadvise"	}, \
 			{ RA_PATTERN_OVERSIZE,		"oversize"	}, \
 			{ RA_PATTERN_RANDOM,		"random"	}, \


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 08/10] readahead: basic support for backwards prefetching
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Li Shaohua, Jan Kara, Wu Fengguang,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-backwards.patch --]
[-- Type: text/plain, Size: 5891 bytes --]

Add the backwards prefetching feature. It's pretty simple if we don't
support async prefetching and interleaved reads.

tail and tac are observed to have the reverse read pattern:

tail-3501  [006]   111.881191: readahead: readahead-random(bdi=0:16, ino=1548450, req=750+1, ra=750+1-0, async=0) = 1
tail-3501  [006]   111.881506: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=748+2, ra=746+5-0, async=0) = 4
tail-3501  [006]   111.882021: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=744+2, ra=726+25-0, async=0) = 20
tail-3501  [006]   111.883713: readahead: readahead-backwards(bdi=0:16, ino=1548450, req=724+2, ra=626+125-0, async=0) = 100

 tac-3528  [001]   118.671924: readahead: readahead-random(bdi=0:16, ino=1548445, req=750+1, ra=750+1-0, async=0) = 1
 tac-3528  [001]   118.672371: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=748+2, ra=746+5-0, async=0) = 4
 tac-3528  [001]   118.673039: readahead: readahead-backwards(bdi=0:16, ino=1548445, req=744+2, ra=726+25-0, async=0) = 20

Here is the behavior with an 8-page read sequence from 10000 down to 0.
(The readahead size is a bit large since it's an NFS mount.)

readahead-random(dev=0:16, ino=3948605, req=10000+8, ra=10000+8-0, async=0) = 8
readahead-backwards(dev=0:16, ino=3948605, req=9992+8, ra=9968+32-0, async=0) = 32
readahead-backwards(dev=0:16, ino=3948605, req=9960+8, ra=9840+128-0, async=0) = 128
readahead-backwards(dev=0:16, ino=3948605, req=9832+8, ra=9584+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9576+8, ra=9072+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9064+8, ra=8048+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8040+8, ra=6128+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6120+8, ra=4208+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4200+8, ra=2288+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2280+8, ra=368+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=360+8, ra=0+368-0, async=0) = 368

And a simple 1-page read sequence from 10000 down to 0.

readahead-random(dev=0:16, ino=3948605, req=10000+1, ra=10000+1-0, async=0) = 1
readahead-backwards(dev=0:16, ino=3948605, req=9999+1, ra=9996+4-0, async=0) = 4
readahead-backwards(dev=0:16, ino=3948605, req=9995+1, ra=9980+16-0, async=0) = 16
readahead-backwards(dev=0:16, ino=3948605, req=9979+1, ra=9916+64-0, async=0) = 64
readahead-backwards(dev=0:16, ino=3948605, req=9915+1, ra=9660+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9659+1, ra=9148+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9147+1, ra=8124+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8123+1, ra=6204+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6203+1, ra=4284+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4283+1, ra=2364+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2363+1, ra=444+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=443+1, ra=0+444-0, async=0) = 444

CC: Andi Kleen <andi@firstfloor.org>
CC: Li Shaohua <shaohua.li@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h         |    2 ++
 include/trace/events/vfs.h |    1 +
 mm/readahead.c             |   20 ++++++++++++++++++++
 3 files changed, 23 insertions(+)

--- linux-next.orig/include/linux/fs.h	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-19 16:10:08.000000000 +0800
@@ -970,6 +970,7 @@ struct file_ra_state {
  *				streams.
  * RA_PATTERN_MMAP_AROUND	read-around on mmap page faults
  *				(w/o any sequential/random hints)
+ * RA_PATTERN_BACKWARDS		reverse reading detected
  * RA_PATTERN_FADVISE		triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
  * RA_PATTERN_OVERSIZE		a random read larger than max readahead size,
  *				do max readahead to break down the read size
@@ -980,6 +981,7 @@ enum readahead_pattern {
 	RA_PATTERN_SUBSEQUENT,
 	RA_PATTERN_CONTEXT,
 	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_BACKWARDS,
 	RA_PATTERN_FADVISE,
 	RA_PATTERN_OVERSIZE,
 	RA_PATTERN_RANDOM,
--- linux-next.orig/mm/readahead.c	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 16:10:08.000000000 +0800
@@ -686,6 +686,26 @@ ondemand_readahead(struct address_space 
 	}
 
 	/*
+	 * backwards reading
+	 */
+	if (offset < ra->start && offset + req_size >= ra->start) {
+		ra->pattern = RA_PATTERN_BACKWARDS;
+		ra->size = get_next_ra_size(ra, max);
+		if (ra->size > ra->start) {
+			/*
+			 * ra->start may be concurrently set to some huge
+			 * value, the min() at least avoids submitting huge IO
+			 * in this race condition
+			 */
+			ra->size = min(ra->start, max);
+			ra->start = 0;
+		} else
+			ra->start -= ra->size;
+		ra->async_size = 0;
+		goto readit;
+	}
+
+	/*
 	 * Query the page cache and look for the traces(cached history pages)
 	 * that a sequential stream would leave behind.
 	 */
--- linux-next.orig/include/trace/events/vfs.h	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/include/trace/events/vfs.h	2011-12-19 16:09:45.000000000 +0800
@@ -14,6 +14,7 @@
 			{ RA_PATTERN_SUBSEQUENT,	"subsequent"	}, \
 			{ RA_PATTERN_CONTEXT,		"context"	}, \
 			{ RA_PATTERN_MMAP_AROUND,	"around"	}, \
+			{ RA_PATTERN_BACKWARDS,		"backwards"	}, \
 			{ RA_PATTERN_FADVISE,		"fadvise"	}, \
 			{ RA_PATTERN_OVERSIZE,		"oversize"	}, \
 			{ RA_PATTERN_RANDOM,		"random"	}, \


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 09/10] readahead: dont do start-of-file readahead after lseek()
  2011-12-19 10:23 ` Wu Fengguang
  (?)
@ 2011-12-19 10:23   ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Rik van Riel, Linus Torvalds, Wu Fengguang,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2250 bytes --]

Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does

	     seek to	0
	     read	1024
	     seek to	1536
	     read	16384

The start-of-file readahead heuristic is wrong for them, whose
access pattern can be identified by lseek() calls.

So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/read_write.c    |    3 +++
 include/linux/fs.h |    1 +
 mm/readahead.c     |    4 ++++
 3 files changed, 8 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 16:10:06.000000000 +0800
@@ -476,6 +476,7 @@ unsigned long ra_submit(struct file_ra_s
 			ra->pattern, ra->start, ra->size, ra->async_size,
 			actual);
 
+	ra->lseek = 0;
 	ra->for_mmap = 0;
 	ra->for_metadata = 0;
 	return actual;
@@ -627,6 +628,8 @@ ondemand_readahead(struct address_space 
 	 * start of file
 	 */
 	if (!offset) {
+		if (ra->lseek && req_size < max)
+			goto random_read;
 		ra->pattern = RA_PATTERN_INITIAL;
 		goto initial_readahead;
 	}
@@ -712,6 +715,7 @@ ondemand_readahead(struct address_space 
 	if (try_context_readahead(mapping, ra, offset, req_size, max))
 		goto readit;
 
+random_read:
 	/*
 	 * standalone, small random read
 	 */
--- linux-next.orig/fs/read_write.c	2011-12-18 14:06:28.000000000 +0800
+++ linux-next/fs/read_write.c	2011-12-19 16:09:45.000000000 +0800
@@ -47,6 +47,9 @@ static loff_t lseek_execute(struct file 
 		file->f_pos = offset;
 		file->f_version = 0;
 	}
+
+	file->f_ra.lseek = 1;
+
 	return offset;
 }
 
--- linux-next.orig/include/linux/fs.h	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-19 16:09:45.000000000 +0800
@@ -951,6 +951,7 @@ struct file_ra_state {
 	u8 pattern;			/* one of RA_PATTERN_* */
 	unsigned int for_mmap:1;	/* readahead for mmap accesses */
 	unsigned int for_metadata:1;	/* readahead for meta data */
+	unsigned int lseek:1;		/* this read has a leading lseek */
 
 	loff_t prev_pos;		/* Cache last read() position */
 };



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 09/10] readahead: dont do start-of-file readahead after lseek()
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Rik van Riel, Linus Torvalds, Wu Fengguang,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2553 bytes --]

Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does

	     seek to	0
	     read	1024
	     seek to	1536
	     read	16384

The start-of-file readahead heuristic is wrong for them, whose
access pattern can be identified by lseek() calls.

So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/read_write.c    |    3 +++
 include/linux/fs.h |    1 +
 mm/readahead.c     |    4 ++++
 3 files changed, 8 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 16:10:06.000000000 +0800
@@ -476,6 +476,7 @@ unsigned long ra_submit(struct file_ra_s
 			ra->pattern, ra->start, ra->size, ra->async_size,
 			actual);
 
+	ra->lseek = 0;
 	ra->for_mmap = 0;
 	ra->for_metadata = 0;
 	return actual;
@@ -627,6 +628,8 @@ ondemand_readahead(struct address_space 
 	 * start of file
 	 */
 	if (!offset) {
+		if (ra->lseek && req_size < max)
+			goto random_read;
 		ra->pattern = RA_PATTERN_INITIAL;
 		goto initial_readahead;
 	}
@@ -712,6 +715,7 @@ ondemand_readahead(struct address_space 
 	if (try_context_readahead(mapping, ra, offset, req_size, max))
 		goto readit;
 
+random_read:
 	/*
 	 * standalone, small random read
 	 */
--- linux-next.orig/fs/read_write.c	2011-12-18 14:06:28.000000000 +0800
+++ linux-next/fs/read_write.c	2011-12-19 16:09:45.000000000 +0800
@@ -47,6 +47,9 @@ static loff_t lseek_execute(struct file 
 		file->f_pos = offset;
 		file->f_version = 0;
 	}
+
+	file->f_ra.lseek = 1;
+
 	return offset;
 }
 
--- linux-next.orig/include/linux/fs.h	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-19 16:09:45.000000000 +0800
@@ -951,6 +951,7 @@ struct file_ra_state {
 	u8 pattern;			/* one of RA_PATTERN_* */
 	unsigned int for_mmap:1;	/* readahead for mmap accesses */
 	unsigned int for_metadata:1;	/* readahead for meta data */
+	unsigned int lseek:1;		/* this read has a leading lseek */
 
 	loff_t prev_pos;		/* Cache last read() position */
 };


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 09/10] readahead: dont do start-of-file readahead after lseek()
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Rik van Riel, Linus Torvalds, Wu Fengguang,
	Linux Memory Management List, linux-fsdevel, LKML

[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2553 bytes --]

Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does

	     seek to	0
	     read	1024
	     seek to	1536
	     read	16384

The start-of-file readahead heuristic is wrong for them, whose
access pattern can be identified by lseek() calls.

So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/read_write.c    |    3 +++
 include/linux/fs.h |    1 +
 mm/readahead.c     |    4 ++++
 3 files changed, 8 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 16:10:06.000000000 +0800
@@ -476,6 +476,7 @@ unsigned long ra_submit(struct file_ra_s
 			ra->pattern, ra->start, ra->size, ra->async_size,
 			actual);
 
+	ra->lseek = 0;
 	ra->for_mmap = 0;
 	ra->for_metadata = 0;
 	return actual;
@@ -627,6 +628,8 @@ ondemand_readahead(struct address_space 
 	 * start of file
 	 */
 	if (!offset) {
+		if (ra->lseek && req_size < max)
+			goto random_read;
 		ra->pattern = RA_PATTERN_INITIAL;
 		goto initial_readahead;
 	}
@@ -712,6 +715,7 @@ ondemand_readahead(struct address_space 
 	if (try_context_readahead(mapping, ra, offset, req_size, max))
 		goto readit;
 
+random_read:
 	/*
 	 * standalone, small random read
 	 */
--- linux-next.orig/fs/read_write.c	2011-12-18 14:06:28.000000000 +0800
+++ linux-next/fs/read_write.c	2011-12-19 16:09:45.000000000 +0800
@@ -47,6 +47,9 @@ static loff_t lseek_execute(struct file 
 		file->f_pos = offset;
 		file->f_version = 0;
 	}
+
+	file->f_ra.lseek = 1;
+
 	return offset;
 }
 
--- linux-next.orig/include/linux/fs.h	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-12-19 16:09:45.000000000 +0800
@@ -951,6 +951,7 @@ struct file_ra_state {
 	u8 pattern;			/* one of RA_PATTERN_* */
 	unsigned int for_mmap:1;	/* readahead for mmap accesses */
 	unsigned int for_metadata:1;	/* readahead for meta data */
+	unsigned int lseek:1;		/* this read has a leading lseek */
 
 	loff_t prev_pos;		/* Cache last read() position */
 };


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 10/10] readahead: snap readahead request to EOF
  2011-12-19 10:23 ` Wu Fengguang
  (?)
@ 2011-12-19 10:23   ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Jan Kara, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-eof --]
[-- Type: text/plain, Size: 1623 bytes --]

If the file size is 20kb and readahead request is [0, 16kb),
it's better to expand the readahead request to [0, 20kb), which will
likely save one followup I/O for the ending [16kb, 20kb).

If the readahead request already covers EOF, trimm it down to EOF.
Also don't set the PG_readahead mark to avoid an unnecessary future
invocation of the readahead code.

This special handling looks worthwhile because small to medium sized
files are pretty common.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 16:10:04.000000000 +0800
@@ -457,6 +457,25 @@ unsigned long max_sane_readahead(unsigne
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
+static void snap_to_eof(struct file_ra_state *ra, struct address_space *mapping)
+{
+	pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+	pgoff_t start = ra->start;
+	unsigned int size = ra->size;
+
+	/*
+	 * skip backwards and random reads
+	 */
+	if (ra->pattern > RA_PATTERN_MMAP_AROUND)
+		return;
+
+	size += min(size / 2, ra->ra_pages / 4);
+	if (start + size > eof) {
+		ra->size = eof - start;
+		ra->async_size = 0;
+	}
+}
+
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */
@@ -468,6 +487,8 @@ unsigned long ra_submit(struct file_ra_s
 {
 	int actual;
 
+	snap_to_eof(ra, mapping);
+
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 10/10] readahead: snap readahead request to EOF
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Jan Kara, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-eof --]
[-- Type: text/plain, Size: 1926 bytes --]

If the file size is 20kb and readahead request is [0, 16kb),
it's better to expand the readahead request to [0, 20kb), which will
likely save one followup I/O for the ending [16kb, 20kb).

If the readahead request already covers EOF, trimm it down to EOF.
Also don't set the PG_readahead mark to avoid an unnecessary future
invocation of the readahead code.

This special handling looks worthwhile because small to medium sized
files are pretty common.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 16:10:04.000000000 +0800
@@ -457,6 +457,25 @@ unsigned long max_sane_readahead(unsigne
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
+static void snap_to_eof(struct file_ra_state *ra, struct address_space *mapping)
+{
+	pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+	pgoff_t start = ra->start;
+	unsigned int size = ra->size;
+
+	/*
+	 * skip backwards and random reads
+	 */
+	if (ra->pattern > RA_PATTERN_MMAP_AROUND)
+		return;
+
+	size += min(size / 2, ra->ra_pages / 4);
+	if (start + size > eof) {
+		ra->size = eof - start;
+		ra->async_size = 0;
+	}
+}
+
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */
@@ -468,6 +487,8 @@ unsigned long ra_submit(struct file_ra_s
 {
 	int actual;
 
+	snap_to_eof(ra, mapping);
+
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 10/10] readahead: snap readahead request to EOF
@ 2011-12-19 10:23   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-19 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Jan Kara, Wu Fengguang, Linux Memory Management List,
	linux-fsdevel, LKML

[-- Attachment #1: readahead-eof --]
[-- Type: text/plain, Size: 1926 bytes --]

If the file size is 20kb and readahead request is [0, 16kb),
it's better to expand the readahead request to [0, 20kb), which will
likely save one followup I/O for the ending [16kb, 20kb).

If the readahead request already covers EOF, trimm it down to EOF.
Also don't set the PG_readahead mark to avoid an unnecessary future
invocation of the readahead code.

This special handling looks worthwhile because small to medium sized
files are pretty common.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/readahead.c |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-12-19 16:09:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-19 16:10:04.000000000 +0800
@@ -457,6 +457,25 @@ unsigned long max_sane_readahead(unsigne
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
+static void snap_to_eof(struct file_ra_state *ra, struct address_space *mapping)
+{
+	pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+	pgoff_t start = ra->start;
+	unsigned int size = ra->size;
+
+	/*
+	 * skip backwards and random reads
+	 */
+	if (ra->pattern > RA_PATTERN_MMAP_AROUND)
+		return;
+
+	size += min(size / 2, ra->ra_pages / 4);
+	if (start + size > eof) {
+		ra->size = eof - start;
+		ra->async_size = 0;
+	}
+}
+
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */
@@ -468,6 +487,8 @@ unsigned long ra_submit(struct file_ra_s
 {
 	int actual;
 
+	snap_to_eof(ra, mapping);
+
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 07/10 v2] readahead: add /debug/readahead/stats
  2011-12-19 10:23   ` Wu Fengguang
@ 2011-12-23 12:59     ` Wu Fengguang
  -1 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-23 12:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra,
	Rik van Riel, Linux Memory Management List, linux-fsdevel, LKML,
	Jan Kara, Dave Chinner

The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive by default.

It can be runtime enabled/disabled through the debugfs interface

	echo 1 > /debug/readahead/stats_enable
	echo 0 > /debug/readahead/stats_enable

Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)

$ cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io    meta_io       size async_size    io_size
initial           702        511          0        692        692          0          0          2          0          2
subsequent          7          0          1          7          1          1          0         23         22         23
context           160        161          0          2          0          1          0          0          0         16
around            184        184        177        184        184        184          0         58          0         53
backwards           2          0          2          2          2          0          0          4          0          3
fadvise          2593         47          8       2588       2588          0          0          1          0          1
oversize            0          0          0          0          0          0          0          0          0          0
random             45         20          0         44         44          0          0          1          0          1
all              3697        923        188       3519       3511        186          0          4          0          4

The two most important columns are
- io		number of readahead IO
- io_size	average readahead IO size

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/Kconfig     |   15 +++
 mm/readahead.c |  202 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 217 insertions(+)

This switches to the percpu_counter facilities.

--- linux-next.orig/mm/readahead.c	2011-12-23 20:29:14.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-23 20:50:04.000000000 +0800
@@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/ftrace_event.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+static const struct trace_print_flags ra_pattern_names[] = {
+	READAHEAD_PATTERNS
+};
+
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,	/* readahead request */
+	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
+	RA_ACCOUNT_CACHE_HIT,	/* readahead request covers some cached pages */
+	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
+	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
+	RA_ACCOUNT_MMAP,	/* readahead IO by mmap page faults */
+	RA_ACCOUNT_METADATA,	/* readahead IO on metadata */
+	/* number of readahead pages */
+	RA_ACCOUNT_SIZE,	/* readahead size */
+	RA_ACCOUNT_ASYNC_SIZE,	/* readahead async size */
+	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+#define RA_STAT_BATCH	(INT_MAX / 2)
+static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX];
+
+static inline void add_ra_stat(int i, int j, s64 amount)
+{
+	__percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH);
+}
+
+static inline void inc_ra_stat(int i, int j)
+{
+	add_ra_stat(i, j, 1);
+}
+
+static void readahead_stats(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    bool for_mmap,
+			    bool for_metadata,
+			    enum readahead_pattern pattern,
+			    pgoff_t start,
+			    unsigned long size,
+			    unsigned long async_size,
+			    int actual)
+{
+	pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+
+	inc_ra_stat(pattern, RA_ACCOUNT_COUNT);
+	add_ra_stat(pattern, RA_ACCOUNT_SIZE, size);
+	add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size);
+	add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual);
+
+	if (start + size >= eof)
+		inc_ra_stat(pattern, RA_ACCOUNT_EOF);
+	if (actual < size)
+		inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT);
+
+	if (actual) {
+		inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT);
+
+		if (start <= offset && offset < start + size)
+			inc_ra_stat(pattern, RA_ACCOUNT_SYNC);
+
+		if (for_mmap)
+			inc_ra_stat(pattern, RA_ACCOUNT_MMAP);
+		if (for_metadata)
+			inc_ra_stat(pattern, RA_ACCOUNT_METADATA);
+	}
+}
+
+static void readahead_stats_reset(void)
+{
+	int i, j;
+
+	for (i = 0; i < RA_PATTERN_ALL; i++)
+		for (j = 0; j < RA_ACCOUNT_MAX; j++)
+			percpu_counter_set(&ra_stat[i][j], 0);
+}
+
+static void
+readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
+{
+	int i, j;
+
+	for (i = 0; i < RA_PATTERN_ALL; i++)
+		for (j = 0; j < RA_ACCOUNT_MAX; j++) {
+			s64 n = percpu_counter_sum(&ra_stat[i][j]);
+			ra_stats[i][j] += n;
+			ra_stats[RA_PATTERN_ALL][j] += n;
+		}
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+	int i;
+
+	seq_printf(s,
+		   "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+		   "pattern", "readahead", "eof_hit", "cache_hit",
+		   "io", "sync_io", "mmap_io", "meta_io",
+		   "size", "async_size", "io_size");
+
+	memset(ra_stats, 0, sizeof(ra_stats));
+	readahead_stats_sum(ra_stats);
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+		unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+		/*
+		 * avoid division-by-zero
+		 */
+		if (count == 0)
+			count = 1;
+		if (iocount == 0)
+			iocount = 1;
+
+		seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld "
+			   "%10lld %10lld %10lld %10lld %10lld\n",
+				ra_pattern_names[i].name,
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_CACHE_HIT],
+				ra_stats[i][RA_ACCOUNT_IOCOUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_MMAP],
+				ra_stats[i][RA_ACCOUNT_METADATA],
+				ra_stats[i][RA_ACCOUNT_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	readahead_stats_reset();
+	return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+	struct dentry *root;
+	struct dentry *entry;
+	int i, j;
+
+	root = debugfs_create_dir("readahead", NULL);
+	if (!root)
+		goto out;
+
+	entry = debugfs_create_file("stats", 0644, root,
+				    NULL, &readahead_stats_fops);
+	if (!entry)
+		goto out;
+
+	entry = debugfs_create_bool("stats_enable", 0644, root,
+				    &readahead_stats_enable);
+	if (!entry)
+		goto out;
+
+	for (i = 0; i < RA_PATTERN_ALL; i++)
+		for (j = 0; j < RA_ACCOUNT_MAX; j++)
+			percpu_counter_init(&ra_stat[i][j], 0);
+
+	return 0;
+out:
+	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+	return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#endif
+
 static inline void readahead_event(struct address_space *mapping,
 				   pgoff_t offset,
 				   unsigned long req_size,
@@ -44,6 +240,12 @@ static inline void readahead_event(struc
 				   unsigned long async_size,
 				   int actual)
 {
+#ifdef CONFIG_READAHEAD_STATS
+	if (readahead_stats_enable)
+		readahead_stats(mapping, offset, req_size,
+				for_mmap, for_metadata,
+				pattern, start, size, async_size, actual);
+#endif
 	trace_readahead(mapping, offset, req_size,
 			pattern, start, size, async_size, actual);
 }
--- linux-next.orig/mm/Kconfig	2011-12-23 20:28:06.000000000 +0800
+++ linux-next/mm/Kconfig	2011-12-23 20:29:31.000000000 +0800
@@ -396,3 +396,18 @@ config FRONTSWAP
 	  and swap data is stored as normal on the matching swap device.
 
 	  If unsure, say Y to enable frontswap.
+
+config READAHEAD_STATS
+	bool "Collect page cache readahead stats"
+	depends on DEBUG_FS
+	default y
+	help
+	  This provides the readahead events accounting facilities.
+
+	  To do readahead accounting for a workload:
+
+	  echo 1 > /sys/kernel/debug/readahead/stats_enable
+	  echo 0 > /sys/kernel/debug/readahead/stats  # reset counters
+	  # run the workload
+	  cat /sys/kernel/debug/readahead/stats       # check counters
+	  echo 0 > /sys/kernel/debug/readahead/stats_enable

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 07/10 v2] readahead: add /debug/readahead/stats
@ 2011-12-23 12:59     ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2011-12-23 12:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra,
	Rik van Riel, Linux Memory Management List, linux-fsdevel, LKML,
	Jan Kara, Dave Chinner

The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive by default.

It can be runtime enabled/disabled through the debugfs interface

	echo 1 > /debug/readahead/stats_enable
	echo 0 > /debug/readahead/stats_enable

Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)

$ cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io    meta_io       size async_size    io_size
initial           702        511          0        692        692          0          0          2          0          2
subsequent          7          0          1          7          1          1          0         23         22         23
context           160        161          0          2          0          1          0          0          0         16
around            184        184        177        184        184        184          0         58          0         53
backwards           2          0          2          2          2          0          0          4          0          3
fadvise          2593         47          8       2588       2588          0          0          1          0          1
oversize            0          0          0          0          0          0          0          0          0          0
random             45         20          0         44         44          0          0          1          0          1
all              3697        923        188       3519       3511        186          0          4          0          4

The two most important columns are
- io		number of readahead IO
- io_size	average readahead IO size

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/Kconfig     |   15 +++
 mm/readahead.c |  202 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 217 insertions(+)

This switches to the percpu_counter facilities.

--- linux-next.orig/mm/readahead.c	2011-12-23 20:29:14.000000000 +0800
+++ linux-next/mm/readahead.c	2011-12-23 20:50:04.000000000 +0800
@@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/ftrace_event.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+static const struct trace_print_flags ra_pattern_names[] = {
+	READAHEAD_PATTERNS
+};
+
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,	/* readahead request */
+	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
+	RA_ACCOUNT_CACHE_HIT,	/* readahead request covers some cached pages */
+	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
+	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
+	RA_ACCOUNT_MMAP,	/* readahead IO by mmap page faults */
+	RA_ACCOUNT_METADATA,	/* readahead IO on metadata */
+	/* number of readahead pages */
+	RA_ACCOUNT_SIZE,	/* readahead size */
+	RA_ACCOUNT_ASYNC_SIZE,	/* readahead async size */
+	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+#define RA_STAT_BATCH	(INT_MAX / 2)
+static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX];
+
+static inline void add_ra_stat(int i, int j, s64 amount)
+{
+	__percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH);
+}
+
+static inline void inc_ra_stat(int i, int j)
+{
+	add_ra_stat(i, j, 1);
+}
+
+static void readahead_stats(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    bool for_mmap,
+			    bool for_metadata,
+			    enum readahead_pattern pattern,
+			    pgoff_t start,
+			    unsigned long size,
+			    unsigned long async_size,
+			    int actual)
+{
+	pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+
+	inc_ra_stat(pattern, RA_ACCOUNT_COUNT);
+	add_ra_stat(pattern, RA_ACCOUNT_SIZE, size);
+	add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size);
+	add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual);
+
+	if (start + size >= eof)
+		inc_ra_stat(pattern, RA_ACCOUNT_EOF);
+	if (actual < size)
+		inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT);
+
+	if (actual) {
+		inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT);
+
+		if (start <= offset && offset < start + size)
+			inc_ra_stat(pattern, RA_ACCOUNT_SYNC);
+
+		if (for_mmap)
+			inc_ra_stat(pattern, RA_ACCOUNT_MMAP);
+		if (for_metadata)
+			inc_ra_stat(pattern, RA_ACCOUNT_METADATA);
+	}
+}
+
+static void readahead_stats_reset(void)
+{
+	int i, j;
+
+	for (i = 0; i < RA_PATTERN_ALL; i++)
+		for (j = 0; j < RA_ACCOUNT_MAX; j++)
+			percpu_counter_set(&ra_stat[i][j], 0);
+}
+
+static void
+readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
+{
+	int i, j;
+
+	for (i = 0; i < RA_PATTERN_ALL; i++)
+		for (j = 0; j < RA_ACCOUNT_MAX; j++) {
+			s64 n = percpu_counter_sum(&ra_stat[i][j]);
+			ra_stats[i][j] += n;
+			ra_stats[RA_PATTERN_ALL][j] += n;
+		}
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+	int i;
+
+	seq_printf(s,
+		   "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+		   "pattern", "readahead", "eof_hit", "cache_hit",
+		   "io", "sync_io", "mmap_io", "meta_io",
+		   "size", "async_size", "io_size");
+
+	memset(ra_stats, 0, sizeof(ra_stats));
+	readahead_stats_sum(ra_stats);
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+		unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+		/*
+		 * avoid division-by-zero
+		 */
+		if (count == 0)
+			count = 1;
+		if (iocount == 0)
+			iocount = 1;
+
+		seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld "
+			   "%10lld %10lld %10lld %10lld %10lld\n",
+				ra_pattern_names[i].name,
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_CACHE_HIT],
+				ra_stats[i][RA_ACCOUNT_IOCOUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_MMAP],
+				ra_stats[i][RA_ACCOUNT_METADATA],
+				ra_stats[i][RA_ACCOUNT_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	readahead_stats_reset();
+	return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+	struct dentry *root;
+	struct dentry *entry;
+	int i, j;
+
+	root = debugfs_create_dir("readahead", NULL);
+	if (!root)
+		goto out;
+
+	entry = debugfs_create_file("stats", 0644, root,
+				    NULL, &readahead_stats_fops);
+	if (!entry)
+		goto out;
+
+	entry = debugfs_create_bool("stats_enable", 0644, root,
+				    &readahead_stats_enable);
+	if (!entry)
+		goto out;
+
+	for (i = 0; i < RA_PATTERN_ALL; i++)
+		for (j = 0; j < RA_ACCOUNT_MAX; j++)
+			percpu_counter_init(&ra_stat[i][j], 0);
+
+	return 0;
+out:
+	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+	return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#endif
+
 static inline void readahead_event(struct address_space *mapping,
 				   pgoff_t offset,
 				   unsigned long req_size,
@@ -44,6 +240,12 @@ static inline void readahead_event(struc
 				   unsigned long async_size,
 				   int actual)
 {
+#ifdef CONFIG_READAHEAD_STATS
+	if (readahead_stats_enable)
+		readahead_stats(mapping, offset, req_size,
+				for_mmap, for_metadata,
+				pattern, start, size, async_size, actual);
+#endif
 	trace_readahead(mapping, offset, req_size,
 			pattern, start, size, async_size, actual);
 }
--- linux-next.orig/mm/Kconfig	2011-12-23 20:28:06.000000000 +0800
+++ linux-next/mm/Kconfig	2011-12-23 20:29:31.000000000 +0800
@@ -396,3 +396,18 @@ config FRONTSWAP
 	  and swap data is stored as normal on the matching swap device.
 
 	  If unsure, say Y to enable frontswap.
+
+config READAHEAD_STATS
+	bool "Collect page cache readahead stats"
+	depends on DEBUG_FS
+	default y
+	help
+	  This provides the readahead events accounting facilities.
+
+	  To do readahead accounting for a workload:
+
+	  echo 1 > /sys/kernel/debug/readahead/stats_enable
+	  echo 0 > /sys/kernel/debug/readahead/stats  # reset counters
+	  # run the workload
+	  cat /sys/kernel/debug/readahead/stats       # check counters
+	  echo 0 > /sys/kernel/debug/readahead/stats_enable

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/10 v2] readahead: add /debug/readahead/stats
  2011-12-23 12:59     ` Wu Fengguang
@ 2011-12-23 13:48       ` Jan Kara
  -1 siblings, 0 replies; 36+ messages in thread
From: Jan Kara @ 2011-12-23 13:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
	Peter Zijlstra, Rik van Riel, Linux Memory Management List,
	linux-fsdevel, LKML, Jan Kara, Dave Chinner

On Fri 23-12-11 20:59:12, Wu Fengguang wrote:
> The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
> and will remain inactive by default.
> 
> It can be runtime enabled/disabled through the debugfs interface
> 
> 	echo 1 > /debug/readahead/stats_enable
> 	echo 0 > /debug/readahead/stats_enable
> 
> Example output:
> (taken from a fresh booted NFS-ROOT console box with rsize=524288)
> 
> $ cat /debug/readahead/stats
> pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io    meta_io       size async_size    io_size
> initial           702        511          0        692        692          0          0          2          0          2
> subsequent          7          0          1          7          1          1          0         23         22         23
> context           160        161          0          2          0          1          0          0          0         16
> around            184        184        177        184        184        184          0         58          0         53
> backwards           2          0          2          2          2          0          0          4          0          3
> fadvise          2593         47          8       2588       2588          0          0          1          0          1
> oversize            0          0          0          0          0          0          0          0          0          0
> random             45         20          0         44         44          0          0          1          0          1
> all              3697        923        188       3519       3511        186          0          4          0          4
> 
> The two most important columns are
> - io		number of readahead IO
> - io_size	average readahead IO size
> 
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Jens Axboe <axboe@kernel.dk>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  Looks good to me.

  Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/Kconfig     |   15 +++
>  mm/readahead.c |  202 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 217 insertions(+)
> 
> This switches to the percpu_counter facilities.
> 
> --- linux-next.orig/mm/readahead.c	2011-12-23 20:29:14.000000000 +0800
> +++ linux-next/mm/readahead.c	2011-12-23 20:50:04.000000000 +0800
> @@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
>  
>  #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
>  
> +#ifdef CONFIG_READAHEAD_STATS
> +#include <linux/ftrace_event.h>
> +#include <linux/seq_file.h>
> +#include <linux/debugfs.h>
> +
> +static u32 readahead_stats_enable __read_mostly;
> +
> +static const struct trace_print_flags ra_pattern_names[] = {
> +	READAHEAD_PATTERNS
> +};
> +
> +enum ra_account {
> +	/* number of readaheads */
> +	RA_ACCOUNT_COUNT,	/* readahead request */
> +	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
> +	RA_ACCOUNT_CACHE_HIT,	/* readahead request covers some cached pages */
> +	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
> +	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
> +	RA_ACCOUNT_MMAP,	/* readahead IO by mmap page faults */
> +	RA_ACCOUNT_METADATA,	/* readahead IO on metadata */
> +	/* number of readahead pages */
> +	RA_ACCOUNT_SIZE,	/* readahead size */
> +	RA_ACCOUNT_ASYNC_SIZE,	/* readahead async size */
> +	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
> +	/* end mark */
> +	RA_ACCOUNT_MAX,
> +};
> +
> +#define RA_STAT_BATCH	(INT_MAX / 2)
> +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX];
> +
> +static inline void add_ra_stat(int i, int j, s64 amount)
> +{
> +	__percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH);
> +}
> +
> +static inline void inc_ra_stat(int i, int j)
> +{
> +	add_ra_stat(i, j, 1);
> +}
> +
> +static void readahead_stats(struct address_space *mapping,
> +			    pgoff_t offset,
> +			    unsigned long req_size,
> +			    bool for_mmap,
> +			    bool for_metadata,
> +			    enum readahead_pattern pattern,
> +			    pgoff_t start,
> +			    unsigned long size,
> +			    unsigned long async_size,
> +			    int actual)
> +{
> +	pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
> +
> +	inc_ra_stat(pattern, RA_ACCOUNT_COUNT);
> +	add_ra_stat(pattern, RA_ACCOUNT_SIZE, size);
> +	add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size);
> +	add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual);
> +
> +	if (start + size >= eof)
> +		inc_ra_stat(pattern, RA_ACCOUNT_EOF);
> +	if (actual < size)
> +		inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT);
> +
> +	if (actual) {
> +		inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT);
> +
> +		if (start <= offset && offset < start + size)
> +			inc_ra_stat(pattern, RA_ACCOUNT_SYNC);
> +
> +		if (for_mmap)
> +			inc_ra_stat(pattern, RA_ACCOUNT_MMAP);
> +		if (for_metadata)
> +			inc_ra_stat(pattern, RA_ACCOUNT_METADATA);
> +	}
> +}
> +
> +static void readahead_stats_reset(void)
> +{
> +	int i, j;
> +
> +	for (i = 0; i < RA_PATTERN_ALL; i++)
> +		for (j = 0; j < RA_ACCOUNT_MAX; j++)
> +			percpu_counter_set(&ra_stat[i][j], 0);
> +}
> +
> +static void
> +readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
> +{
> +	int i, j;
> +
> +	for (i = 0; i < RA_PATTERN_ALL; i++)
> +		for (j = 0; j < RA_ACCOUNT_MAX; j++) {
> +			s64 n = percpu_counter_sum(&ra_stat[i][j]);
> +			ra_stats[i][j] += n;
> +			ra_stats[RA_PATTERN_ALL][j] += n;
> +		}
> +}
> +
> +static int readahead_stats_show(struct seq_file *s, void *_)
> +{
> +	long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
> +	int i;
> +
> +	seq_printf(s,
> +		   "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
> +		   "pattern", "readahead", "eof_hit", "cache_hit",
> +		   "io", "sync_io", "mmap_io", "meta_io",
> +		   "size", "async_size", "io_size");
> +
> +	memset(ra_stats, 0, sizeof(ra_stats));
> +	readahead_stats_sum(ra_stats);
> +
> +	for (i = 0; i < RA_PATTERN_MAX; i++) {
> +		unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
> +		unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
> +		/*
> +		 * avoid division-by-zero
> +		 */
> +		if (count == 0)
> +			count = 1;
> +		if (iocount == 0)
> +			iocount = 1;
> +
> +		seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld "
> +			   "%10lld %10lld %10lld %10lld %10lld\n",
> +				ra_pattern_names[i].name,
> +				ra_stats[i][RA_ACCOUNT_COUNT],
> +				ra_stats[i][RA_ACCOUNT_EOF],
> +				ra_stats[i][RA_ACCOUNT_CACHE_HIT],
> +				ra_stats[i][RA_ACCOUNT_IOCOUNT],
> +				ra_stats[i][RA_ACCOUNT_SYNC],
> +				ra_stats[i][RA_ACCOUNT_MMAP],
> +				ra_stats[i][RA_ACCOUNT_METADATA],
> +				ra_stats[i][RA_ACCOUNT_SIZE] / count,
> +				ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
> +				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
> +	}
> +
> +	return 0;
> +}
> +
> +static int readahead_stats_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, readahead_stats_show, NULL);
> +}
> +
> +static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
> +				     size_t size, loff_t *offset)
> +{
> +	readahead_stats_reset();
> +	return size;
> +}
> +
> +static const struct file_operations readahead_stats_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= readahead_stats_open,
> +	.write		= readahead_stats_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
> +static int __init readahead_create_debugfs(void)
> +{
> +	struct dentry *root;
> +	struct dentry *entry;
> +	int i, j;
> +
> +	root = debugfs_create_dir("readahead", NULL);
> +	if (!root)
> +		goto out;
> +
> +	entry = debugfs_create_file("stats", 0644, root,
> +				    NULL, &readahead_stats_fops);
> +	if (!entry)
> +		goto out;
> +
> +	entry = debugfs_create_bool("stats_enable", 0644, root,
> +				    &readahead_stats_enable);
> +	if (!entry)
> +		goto out;
> +
> +	for (i = 0; i < RA_PATTERN_ALL; i++)
> +		for (j = 0; j < RA_ACCOUNT_MAX; j++)
> +			percpu_counter_init(&ra_stat[i][j], 0);
> +
> +	return 0;
> +out:
> +	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
> +	return -ENOMEM;
> +}
> +
> +late_initcall(readahead_create_debugfs);
> +#endif
> +
>  static inline void readahead_event(struct address_space *mapping,
>  				   pgoff_t offset,
>  				   unsigned long req_size,
> @@ -44,6 +240,12 @@ static inline void readahead_event(struc
>  				   unsigned long async_size,
>  				   int actual)
>  {
> +#ifdef CONFIG_READAHEAD_STATS
> +	if (readahead_stats_enable)
> +		readahead_stats(mapping, offset, req_size,
> +				for_mmap, for_metadata,
> +				pattern, start, size, async_size, actual);
> +#endif
>  	trace_readahead(mapping, offset, req_size,
>  			pattern, start, size, async_size, actual);
>  }
> --- linux-next.orig/mm/Kconfig	2011-12-23 20:28:06.000000000 +0800
> +++ linux-next/mm/Kconfig	2011-12-23 20:29:31.000000000 +0800
> @@ -396,3 +396,18 @@ config FRONTSWAP
>  	  and swap data is stored as normal on the matching swap device.
>  
>  	  If unsure, say Y to enable frontswap.
> +
> +config READAHEAD_STATS
> +	bool "Collect page cache readahead stats"
> +	depends on DEBUG_FS
> +	default y
> +	help
> +	  This provides the readahead events accounting facilities.
> +
> +	  To do readahead accounting for a workload:
> +
> +	  echo 1 > /sys/kernel/debug/readahead/stats_enable
> +	  echo 0 > /sys/kernel/debug/readahead/stats  # reset counters
> +	  # run the workload
> +	  cat /sys/kernel/debug/readahead/stats       # check counters
> +	  echo 0 > /sys/kernel/debug/readahead/stats_enable

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/10 v2] readahead: add /debug/readahead/stats
@ 2011-12-23 13:48       ` Jan Kara
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Kara @ 2011-12-23 13:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
	Peter Zijlstra, Rik van Riel, Linux Memory Management List,
	linux-fsdevel, LKML, Jan Kara, Dave Chinner

On Fri 23-12-11 20:59:12, Wu Fengguang wrote:
> The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
> and will remain inactive by default.
> 
> It can be runtime enabled/disabled through the debugfs interface
> 
> 	echo 1 > /debug/readahead/stats_enable
> 	echo 0 > /debug/readahead/stats_enable
> 
> Example output:
> (taken from a fresh booted NFS-ROOT console box with rsize=524288)
> 
> $ cat /debug/readahead/stats
> pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io    meta_io       size async_size    io_size
> initial           702        511          0        692        692          0          0          2          0          2
> subsequent          7          0          1          7          1          1          0         23         22         23
> context           160        161          0          2          0          1          0          0          0         16
> around            184        184        177        184        184        184          0         58          0         53
> backwards           2          0          2          2          2          0          0          4          0          3
> fadvise          2593         47          8       2588       2588          0          0          1          0          1
> oversize            0          0          0          0          0          0          0          0          0          0
> random             45         20          0         44         44          0          0          1          0          1
> all              3697        923        188       3519       3511        186          0          4          0          4
> 
> The two most important columns are
> - io		number of readahead IO
> - io_size	average readahead IO size
> 
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Jens Axboe <axboe@kernel.dk>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  Looks good to me.

  Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/Kconfig     |   15 +++
>  mm/readahead.c |  202 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 217 insertions(+)
> 
> This switches to the percpu_counter facilities.
> 
> --- linux-next.orig/mm/readahead.c	2011-12-23 20:29:14.000000000 +0800
> +++ linux-next/mm/readahead.c	2011-12-23 20:50:04.000000000 +0800
> @@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
>  
>  #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
>  
> +#ifdef CONFIG_READAHEAD_STATS
> +#include <linux/ftrace_event.h>
> +#include <linux/seq_file.h>
> +#include <linux/debugfs.h>
> +
> +static u32 readahead_stats_enable __read_mostly;
> +
> +static const struct trace_print_flags ra_pattern_names[] = {
> +	READAHEAD_PATTERNS
> +};
> +
> +enum ra_account {
> +	/* number of readaheads */
> +	RA_ACCOUNT_COUNT,	/* readahead request */
> +	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
> +	RA_ACCOUNT_CACHE_HIT,	/* readahead request covers some cached pages */
> +	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
> +	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
> +	RA_ACCOUNT_MMAP,	/* readahead IO by mmap page faults */
> +	RA_ACCOUNT_METADATA,	/* readahead IO on metadata */
> +	/* number of readahead pages */
> +	RA_ACCOUNT_SIZE,	/* readahead size */
> +	RA_ACCOUNT_ASYNC_SIZE,	/* readahead async size */
> +	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
> +	/* end mark */
> +	RA_ACCOUNT_MAX,
> +};
> +
> +#define RA_STAT_BATCH	(INT_MAX / 2)
> +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX];
> +
> +static inline void add_ra_stat(int i, int j, s64 amount)
> +{
> +	__percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH);
> +}
> +
> +static inline void inc_ra_stat(int i, int j)
> +{
> +	add_ra_stat(i, j, 1);
> +}
> +
> +static void readahead_stats(struct address_space *mapping,
> +			    pgoff_t offset,
> +			    unsigned long req_size,
> +			    bool for_mmap,
> +			    bool for_metadata,
> +			    enum readahead_pattern pattern,
> +			    pgoff_t start,
> +			    unsigned long size,
> +			    unsigned long async_size,
> +			    int actual)
> +{
> +	pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
> +
> +	inc_ra_stat(pattern, RA_ACCOUNT_COUNT);
> +	add_ra_stat(pattern, RA_ACCOUNT_SIZE, size);
> +	add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size);
> +	add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual);
> +
> +	if (start + size >= eof)
> +		inc_ra_stat(pattern, RA_ACCOUNT_EOF);
> +	if (actual < size)
> +		inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT);
> +
> +	if (actual) {
> +		inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT);
> +
> +		if (start <= offset && offset < start + size)
> +			inc_ra_stat(pattern, RA_ACCOUNT_SYNC);
> +
> +		if (for_mmap)
> +			inc_ra_stat(pattern, RA_ACCOUNT_MMAP);
> +		if (for_metadata)
> +			inc_ra_stat(pattern, RA_ACCOUNT_METADATA);
> +	}
> +}
> +
> +static void readahead_stats_reset(void)
> +{
> +	int i, j;
> +
> +	for (i = 0; i < RA_PATTERN_ALL; i++)
> +		for (j = 0; j < RA_ACCOUNT_MAX; j++)
> +			percpu_counter_set(&ra_stat[i][j], 0);
> +}
> +
> +static void
> +readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
> +{
> +	int i, j;
> +
> +	for (i = 0; i < RA_PATTERN_ALL; i++)
> +		for (j = 0; j < RA_ACCOUNT_MAX; j++) {
> +			s64 n = percpu_counter_sum(&ra_stat[i][j]);
> +			ra_stats[i][j] += n;
> +			ra_stats[RA_PATTERN_ALL][j] += n;
> +		}
> +}
> +
> +static int readahead_stats_show(struct seq_file *s, void *_)
> +{
> +	long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
> +	int i;
> +
> +	seq_printf(s,
> +		   "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
> +		   "pattern", "readahead", "eof_hit", "cache_hit",
> +		   "io", "sync_io", "mmap_io", "meta_io",
> +		   "size", "async_size", "io_size");
> +
> +	memset(ra_stats, 0, sizeof(ra_stats));
> +	readahead_stats_sum(ra_stats);
> +
> +	for (i = 0; i < RA_PATTERN_MAX; i++) {
> +		unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
> +		unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
> +		/*
> +		 * avoid division-by-zero
> +		 */
> +		if (count == 0)
> +			count = 1;
> +		if (iocount == 0)
> +			iocount = 1;
> +
> +		seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld "
> +			   "%10lld %10lld %10lld %10lld %10lld\n",
> +				ra_pattern_names[i].name,
> +				ra_stats[i][RA_ACCOUNT_COUNT],
> +				ra_stats[i][RA_ACCOUNT_EOF],
> +				ra_stats[i][RA_ACCOUNT_CACHE_HIT],
> +				ra_stats[i][RA_ACCOUNT_IOCOUNT],
> +				ra_stats[i][RA_ACCOUNT_SYNC],
> +				ra_stats[i][RA_ACCOUNT_MMAP],
> +				ra_stats[i][RA_ACCOUNT_METADATA],
> +				ra_stats[i][RA_ACCOUNT_SIZE] / count,
> +				ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
> +				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
> +	}
> +
> +	return 0;
> +}
> +
> +static int readahead_stats_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, readahead_stats_show, NULL);
> +}
> +
> +static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
> +				     size_t size, loff_t *offset)
> +{
> +	readahead_stats_reset();
> +	return size;
> +}
> +
> +static const struct file_operations readahead_stats_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= readahead_stats_open,
> +	.write		= readahead_stats_write,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
> +static int __init readahead_create_debugfs(void)
> +{
> +	struct dentry *root;
> +	struct dentry *entry;
> +	int i, j;
> +
> +	root = debugfs_create_dir("readahead", NULL);
> +	if (!root)
> +		goto out;
> +
> +	entry = debugfs_create_file("stats", 0644, root,
> +				    NULL, &readahead_stats_fops);
> +	if (!entry)
> +		goto out;
> +
> +	entry = debugfs_create_bool("stats_enable", 0644, root,
> +				    &readahead_stats_enable);
> +	if (!entry)
> +		goto out;
> +
> +	for (i = 0; i < RA_PATTERN_ALL; i++)
> +		for (j = 0; j < RA_ACCOUNT_MAX; j++)
> +			percpu_counter_init(&ra_stat[i][j], 0);
> +
> +	return 0;
> +out:
> +	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
> +	return -ENOMEM;
> +}
> +
> +late_initcall(readahead_create_debugfs);
> +#endif
> +
>  static inline void readahead_event(struct address_space *mapping,
>  				   pgoff_t offset,
>  				   unsigned long req_size,
> @@ -44,6 +240,12 @@ static inline void readahead_event(struc
>  				   unsigned long async_size,
>  				   int actual)
>  {
> +#ifdef CONFIG_READAHEAD_STATS
> +	if (readahead_stats_enable)
> +		readahead_stats(mapping, offset, req_size,
> +				for_mmap, for_metadata,
> +				pattern, start, size, async_size, actual);
> +#endif
>  	trace_readahead(mapping, offset, req_size,
>  			pattern, start, size, async_size, actual);
>  }
> --- linux-next.orig/mm/Kconfig	2011-12-23 20:28:06.000000000 +0800
> +++ linux-next/mm/Kconfig	2011-12-23 20:29:31.000000000 +0800
> @@ -396,3 +396,18 @@ config FRONTSWAP
>  	  and swap data is stored as normal on the matching swap device.
>  
>  	  If unsure, say Y to enable frontswap.
> +
> +config READAHEAD_STATS
> +	bool "Collect page cache readahead stats"
> +	depends on DEBUG_FS
> +	default y
> +	help
> +	  This provides the readahead events accounting facilities.
> +
> +	  To do readahead accounting for a workload:
> +
> +	  echo 1 > /sys/kernel/debug/readahead/stats_enable
> +	  echo 0 > /sys/kernel/debug/readahead/stats  # reset counters
> +	  # run the workload
> +	  cat /sys/kernel/debug/readahead/stats       # check counters
> +	  echo 0 > /sys/kernel/debug/readahead/stats_enable

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2011-12-23 13:49 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-19 10:23 [PATCH 00/10] readahead stats/tracing, backwards prefetching and more (v3) Wu Fengguang
2011-12-19 10:23 ` Wu Fengguang
2011-12-19 10:23 ` Wu Fengguang
2011-12-19 10:23 ` [PATCH 01/10] block: limit default readahead size for small devices Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23 ` [PATCH 02/10] readahead: make context readahead more conservative Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23 ` [PATCH 03/10] readahead: record readahead patterns Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23 ` [PATCH 04/10] readahead: tag mmap page fault call sites Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23 ` [PATCH 05/10] readahead: tag metadata " Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23 ` [PATCH 06/10] readahead: add vfs/readahead tracing event Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23 ` [PATCH 07/10] readahead: add /debug/readahead/stats Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-23 12:59   ` [PATCH 07/10 v2] " Wu Fengguang
2011-12-23 12:59     ` Wu Fengguang
2011-12-23 13:48     ` Jan Kara
2011-12-23 13:48       ` Jan Kara
2011-12-19 10:23 ` [PATCH 08/10] readahead: basic support for backwards prefetching Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23 ` [PATCH 09/10] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23 ` [PATCH 10/10] readahead: snap readahead request to EOF Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang
2011-12-19 10:23   ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.