[PATCH 0/8] readahead stats/tracing, backwards prefetching and more

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] readahead stats/tracing, backwards prefetching and more
@ 2011-11-21  9:18 Wu Fengguang
  2011-11-21  9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
                   ` (8 more replies)
  0 siblings, 9 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Wu Fengguang, LKML,
	Andi Kleen

Andrew,

I'm getting around to pick up the readahead works again :-)

This first series is mainly to add some debug facilities, to support the long
missed backwards prefetching capability, and some old patches that somehow get
delayed (shame me).

The next step would be to better handle the readahead thrashing situations.
That would require rewriting part of the algorithms, this is why I'd like to
keep the backwards prefetching simple and stupid for now.

When (almost) free of readahead thrashing, we'll be in a good position to lift
the default readahead size. Which I suspect would be the single most efficient
way to improve performance for the large volumes of casually maintained Linux
file servers.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/8] block: limit default readahead size for small devices
  2011-11-21  9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
@ 2011-11-21  9:18 ` Wu Fengguang
  2011-11-21 10:00   ` Christoph Hellwig
                     ` (2 more replies)
  2011-11-21  9:18 ` [PATCH 2/8] readahead: make default readahead size a kernel parameter Wu Fengguang
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Li Shaohua,
	Clemens Ladisch, Jens Axboe, Rik van Riel, Wu Fengguang, LKML,
	Andi Kleen

[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 6861 bytes --]

Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.

     lseek 0,    read 1024   => readahead 4 pages (start of file)
     lseek 1536, read 16384  => readahead 8 pages (page contiguous)

The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.

For the kernel part, Linus suggests:
  So maybe we could be less aggressive about read-ahead when the size of
  the device is small? Turning a 16kB read into a 64kB one is a big deal,
  when it's about 15% of the whole device!

This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).

Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:

        disk size    readahead size
     (scale by 4)      (scale by 2)
               1M                8k
               4M               16k
              16M               32k
              64M               64k
             256M              128k
        --------------------------- (*)
               1G              256k
               4G              512k
              16G             1024k
              64G             2048k
             256G             4096k

(*) Since the default readahead size is 128k, this limit only takes
effect for devices whose size is less than 256M.

The formula is determined on the following data, collected by script:

	#!/bin/sh

	# please make sure BDEV is not mounted or opened by others
	BDEV=sdb

	for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
	do
		echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
		time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
	done

The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.

The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.

SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)

	rasize	1st run		2nd run
	----------------------------------
	  4k	123 MB/s	122 MB/s
	 16k  	153 MB/s	153 MB/s
	 32k	161 MB/s	162 MB/s
	 64k	167 MB/s	168 MB/s
	128k	197 MB/s	197 MB/s
	256k	217 MB/s	217 MB/s
	512k	238 MB/s	234 MB/s
	  1M	251 MB/s	248 MB/s
	  2M	259 MB/s	257 MB/s
==>	  4M	269 MB/s	264 MB/s
	  8M	266 MB/s	266 MB/s

Note that ==> points to the readahead size that yields plateau throughput.

SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)

	rasize  1st             2nd
	--------------------------------
	  4k     41 MB/s         41 MB/s
	 16k     85 MB/s         81 MB/s
	 32k    102 MB/s        109 MB/s
	 64k    125 MB/s        144 MB/s
	128k    183 MB/s        185 MB/s
	256k    216 MB/s        216 MB/s
	512k    216 MB/s        236 MB/s
	1024k   251 MB/s        252 MB/s
	  2M    258 MB/s        258 MB/s
==>       4M    266 MB/s        266 MB/s
	  8M    266 MB/s        266 MB/s

SSD 30G SanDisk SATA 5000

	  4k	29.6 MB/s	29.6 MB/s	29.6 MB/s
	 16k	52.1 MB/s	52.1 MB/s	52.1 MB/s
	 32k	61.5 MB/s	61.5 MB/s	61.5 MB/s
	 64k	67.2 MB/s	67.2 MB/s	67.1 MB/s
	128k	71.4 MB/s	71.3 MB/s	71.4 MB/s
	256k	73.4 MB/s	73.4 MB/s	73.3 MB/s
==>	512k	74.6 MB/s	74.6 MB/s	74.6 MB/s
	  1M	74.7 MB/s	74.6 MB/s	74.7 MB/s
	  2M	76.1 MB/s	74.6 MB/s	74.6 MB/s

USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165

	  4k	7.9 MB/s 	7.9 MB/s 	7.9 MB/s
	 16k	17.9 MB/s	17.9 MB/s	17.9 MB/s
	 32k	24.5 MB/s	24.5 MB/s	24.5 MB/s
	 64k	28.7 MB/s	28.7 MB/s	28.7 MB/s
	128k	28.8 MB/s	28.9 MB/s	28.9 MB/s
==>	256k	30.5 MB/s	30.5 MB/s	30.5 MB/s
	512k	30.9 MB/s	31.0 MB/s	30.9 MB/s
	  1M	31.0 MB/s	30.9 MB/s	30.9 MB/s
	  2M	30.9 MB/s	30.9 MB/s	30.9 MB/s

USB stick 4G SanDisk  Cruzer idVendor=0781, idProduct=5151

	  4k	6.4 MB/s 	6.4 MB/s 	6.4 MB/s
	 16k	13.4 MB/s	13.4 MB/s	13.2 MB/s
	 32k	17.8 MB/s	17.9 MB/s	17.8 MB/s
	 64k	21.3 MB/s	21.3 MB/s	21.2 MB/s
	128k	21.4 MB/s	21.4 MB/s	21.4 MB/s
==>	256k	23.3 MB/s	23.2 MB/s	23.2 MB/s
	512k	23.3 MB/s	23.8 MB/s	23.4 MB/s
	  1M	23.8 MB/s	23.4 MB/s	23.3 MB/s
	  2M	23.4 MB/s	23.2 MB/s	23.4 MB/s

USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113

	  4k	6.7 MB/s 	6.9 MB/s 	6.7 MB/s
	 16k	11.7 MB/s	11.7 MB/s	11.7 MB/s
	 32k	12.4 MB/s	12.4 MB/s	12.4 MB/s
   	 64k	13.4 MB/s	13.4 MB/s	13.4 MB/s
	128k	13.4 MB/s	13.4 MB/s	13.4 MB/s
==>	256k	13.6 MB/s	13.6 MB/s	13.6 MB/s
	512k	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  1M	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  2M	13.7 MB/s	13.7 MB/s	13.7 MB/s

64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey

	4KB:    139.339 s, 376 kB/s
	16KB:   81.0427 s, 647 kB/s
	32KB:   71.8513 s, 730 kB/s
==>	64KB:   67.3872 s, 778 kB/s
	128KB:  67.5434 s, 776 kB/s
	256KB:  65.9019 s, 796 kB/s
	512KB:  66.2282 s, 792 kB/s
	1024KB: 67.4632 s, 777 kB/s
	2048KB: 69.9759 s, 749 kB/s

An unnamed SD card (Yakui):

         4k     195.873 s,  5.5 MB/s
         8k     123.425 s,  8.7 MB/s
         16k    86.6425 s, 12.4 MB/s
         32k    66.7519 s, 16.1 MB/s
==>      64k    58.5262 s, 18.3 MB/s
         128k   59.3847 s, 18.1 MB/s
         256k   59.3188 s, 18.1 MB/s
         512k   59.0218 s, 18.2 MB/s

CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/genhd.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

--- linux-next.orig/block/genhd.c	2011-10-31 00:13:51.000000000 +0800
+++ linux-next/block/genhd.c	2011-11-18 11:27:08.000000000 +0800
@@ -623,6 +623,26 @@ void add_disk(struct gendisk *disk)
 	WARN_ON(retval);
 
 	disk_add_events(disk);
+
+	/*
+	 * Limit default readahead size for small devices.
+	 *        disk size    readahead size
+	 *               1M                8k
+	 *               4M               16k
+	 *              16M               32k
+	 *              64M               64k
+	 *             256M              128k
+	 *               1G              256k
+	 *               4G              512k
+	 *              16G             1024k
+	 *              64G             2048k
+	 *             256G             4096k
+	 */
+	if (get_capacity(disk)) {
+		unsigned long size = get_capacity(disk) >> 9;
+		size = 1UL << (ilog2(size) / 2);
+		bdi->ra_pages = min(bdi->ra_pages, size);
+	}
 }
 EXPORT_SYMBOL(add_disk);
 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 2/8] readahead: make default readahead size a kernel parameter
  2011-11-21  9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
  2011-11-21  9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
@ 2011-11-21  9:18 ` Wu Fengguang
  2011-11-21 10:01   ` Christoph Hellwig
  2011-11-21  9:18 ` [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Ankit Jain,
	Dave Chinner, Christian Ehrhardt, Rik van Riel,
	Nikanth Karthikesan, Wu Fengguang, LKML, Andi Kleen

[-- Attachment #1: readahead-kernel-parameter.patch --]
[-- Type: text/plain, Size: 3085 bytes --]

From: Nikanth Karthikesan <knikanth@suse.de>

Add new kernel parameter "readahead=", which allows user to override
the static VM_MAX_READAHEAD=128kb.

CC: Ankit Jain <radical@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/kernel-parameters.txt |    6 ++++++
 block/blk-core.c                    |    3 +--
 fs/fuse/inode.c                     |    2 +-
 mm/readahead.c                      |   19 +++++++++++++++++++
 4 files changed, 27 insertions(+), 3 deletions(-)

--- linux-next.orig/Documentation/kernel-parameters.txt	2011-10-19 11:11:14.000000000 +0800
+++ linux-next/Documentation/kernel-parameters.txt	2011-11-20 11:09:56.000000000 +0800
@@ -2245,6 +2245,12 @@ bytes respectively. Such letter suffixes
 			Run specified binary instead of /init from the ramdisk,
 			used for early userspace startup. See initrd.
 
+	readahead=nn[KM]
+			Default max readahead size for block devices.
+
+			This default max readahead size may be overrode
+			in some cases, notably NFS, btrfs and software RAID.
+
 	reboot=		[BUGS=X86-32,BUGS=ARM,BUGS=IA-64] Rebooting mode
 			Format: <reboot_mode>[,<reboot_mode2>[,...]]
 			See arch/*/kernel/reboot.c or arch/*/kernel/process.c
--- linux-next.orig/block/blk-core.c	2011-11-08 10:18:16.000000000 +0800
+++ linux-next/block/blk-core.c	2011-11-20 10:49:33.000000000 +0800
@@ -462,8 +462,7 @@ struct request_queue *blk_alloc_queue_no
 	if (!q)
 		return NULL;
 
-	q->backing_dev_info.ra_pages =
-			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	q->backing_dev_info.ra_pages = default_backing_dev_info.ra_pages;
 	q->backing_dev_info.state = 0;
 	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
 	q->backing_dev_info.name = "block";
--- linux-next.orig/fs/fuse/inode.c	2011-11-08 10:18:39.000000000 +0800
+++ linux-next/fs/fuse/inode.c	2011-11-20 10:50:12.000000000 +0800
@@ -878,7 +878,7 @@ static int fuse_bdi_init(struct fuse_con
 	int err;
 
 	fc->bdi.name = "fuse";
-	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	fc->bdi.ra_pages = default_backing_dev_info.ra_pages;
 	/* fuse does it's own writeback accounting */
 	fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
 
--- linux-next.orig/mm/readahead.c	2011-11-20 10:48:57.000000000 +0800
+++ linux-next/mm/readahead.c	2011-11-20 11:09:22.000000000 +0800
@@ -18,6 +18,25 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+static int __init config_readahead_size(char *str)
+{
+	unsigned long bytes;
+
+	if (!str)
+		return -EINVAL;
+	bytes = memparse(str, &str);
+	if (*str != '\0')
+		return -EINVAL;
+
+	/* missed 'k'/'m' suffixes? */
+	if (bytes && bytes < PAGE_CACHE_SIZE)
+		return -EINVAL;
+
+	default_backing_dev_info.ra_pages = bytes / PAGE_CACHE_SIZE;
+	return 0;
+}
+early_param("readahead", config_readahead_size);
+
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
  2011-11-21  9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
  2011-11-21  9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
  2011-11-21  9:18 ` [PATCH 2/8] readahead: make default readahead size a kernel parameter Wu Fengguang
@ 2011-11-21  9:18 ` Wu Fengguang
  2011-11-21 11:04   ` Steven Whitehouse
  2011-11-21 23:01   ` Andrew Morton
  2011-11-21  9:18 ` [PATCH 4/8] readahead: record readahead patterns Wu Fengguang
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
	Steven Whitehouse, Rik van Riel, Wu Fengguang, LKML

[-- Attachment #1: readahead-flags.patch --]
[-- Type: text/plain, Size: 3095 bytes --]

Introduce a readahead flags field and embed the existing mmap_miss in it
(mainly to save space).

It will be possible to lose the flags in race conditions, however the
impact should be limited.  For the race to happen, there must be two
threads sharing the same file descriptor to be in page fault or
readahead at the same time.

Note that it has always been racy for "page faults" at the same time.

And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
Which may change some concrete readahead behavior, but won't really
impact overall I/O performance.

CC: Andi Kleen <andi@firstfloor.org>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   31 ++++++++++++++++++++++++++++++-
 mm/filemap.c       |    9 ++-------
 2 files changed, 32 insertions(+), 8 deletions(-)

--- linux-next.orig/include/linux/fs.h	2011-11-20 11:30:55.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-11-20 11:48:53.000000000 +0800
@@ -945,10 +945,39 @@ struct file_ra_state {
 					   there are only # of pages ahead */
 
 	unsigned int ra_pages;		/* Maximum readahead window */
-	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
+	unsigned int ra_flags;
 	loff_t prev_pos;		/* Cache last read() position */
 };
 
+/* ra_flags bits */
+#define	READAHEAD_MMAP_MISS	0x000003ff /* cache misses for mmap access */
+
+/*
+ * Don't do ra_flags++ directly to avoid possible overflow:
+ * the ra fields can be accessed concurrently in a racy way.
+ */
+static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
+{
+	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+	/* the upper bound avoids banging the cache line unnecessarily */
+	if (miss < READAHEAD_MMAP_MISS) {
+		miss++;
+		ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
+	}
+	return miss;
+}
+
+static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
+{
+	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+	if (miss) {
+		miss--;
+		ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
+	}
+}
+
 /*
  * Check if @index falls in the readahead windows.
  */
--- linux-next.orig/mm/filemap.c	2011-11-20 11:30:55.000000000 +0800
+++ linux-next/mm/filemap.c	2011-11-20 11:48:29.000000000 +0800
@@ -1597,15 +1597,11 @@ static void do_sync_mmap_readahead(struc
 		return;
 	}
 
-	/* Avoid banging the cache line if not needed */
-	if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
-		ra->mmap_miss++;
-
 	/*
 	 * Do we miss much more than hit in this file? If so,
 	 * stop bothering with read-ahead. It will only hurt.
 	 */
-	if (ra->mmap_miss > MMAP_LOTSAMISS)
+	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
 		return;
 
 	/*
@@ -1633,8 +1629,7 @@ static void do_async_mmap_readahead(stru
 	/* If we don't want any read-ahead, don't bother */
 	if (VM_RandomReadHint(vma))
 		return;
-	if (ra->mmap_miss > 0)
-		ra->mmap_miss--;
+	ra_mmap_miss_dec(ra);
 	if (PageReadahead(page))
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 4/8] readahead: record readahead patterns
  2011-11-21  9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-11-21  9:18 ` [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
@ 2011-11-21  9:18 ` Wu Fengguang
  2011-11-21 23:19   ` Andrew Morton
  2011-11-21  9:18 ` [PATCH 5/8] readahead: add /debug/readahead/stats Wu Fengguang
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
	Jens Axboe, Peter Zijlstra, Rik van Riel, Wu Fengguang, LKML,
	Andi Kleen

[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 6975 bytes --]

Record the readahead pattern in ra_flags and extend the ra_submit()
parameters, to be used by the next readahead tracing/stats patches.

7 patterns are defined:

      	pattern			readahead for
-----------------------------------------------------------
	RA_PATTERN_INITIAL	start-of-file read
	RA_PATTERN_SUBSEQUENT	trivial sequential read
	RA_PATTERN_CONTEXT	interleaved sequential read
	RA_PATTERN_OVERSIZE	oversize read
	RA_PATTERN_MMAP_AROUND	mmap fault
	RA_PATTERN_FADVISE	posix_fadvise()
	RA_PATTERN_RANDOM	random read

Note that random reads will be recorded in file_ra_state now.
This won't deteriorate cache bouncing because the ra->prev_pos update
in do_generic_file_read() already pollutes the data cache, and
filemap_fault() will stop calling into us after MMAP_LOTSAMISS.

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |   33 +++++++++++++++++++++++++++++++++
 include/linux/mm.h |    4 +++-
 mm/filemap.c       |    9 +++++++--
 mm/readahead.c     |   30 +++++++++++++++++++++++-------
 4 files changed, 66 insertions(+), 10 deletions(-)

--- linux-next.orig/include/linux/fs.h	2011-11-20 20:10:48.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-11-20 20:18:29.000000000 +0800
@@ -951,6 +951,39 @@ struct file_ra_state {
 
 /* ra_flags bits */
 #define	READAHEAD_MMAP_MISS	0x000003ff /* cache misses for mmap access */
+#define	READAHEAD_MMAP		0x00010000
+
+#define READAHEAD_PATTERN_SHIFT	28
+#define READAHEAD_PATTERN	0xf0000000
+
+/*
+ * Which policy makes decision to do the current read-ahead IO?
+ */
+enum readahead_pattern {
+	RA_PATTERN_INITIAL,
+	RA_PATTERN_SUBSEQUENT,
+	RA_PATTERN_CONTEXT,
+	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_FADVISE,
+	RA_PATTERN_OVERSIZE,
+	RA_PATTERN_RANDOM,
+	RA_PATTERN_ALL,		/* for summary stats */
+	RA_PATTERN_MAX
+};
+
+static inline unsigned int ra_pattern(unsigned int ra_flags)
+{
+	unsigned int pattern = ra_flags >> READAHEAD_PATTERN_SHIFT;
+
+	return min_t(unsigned int, pattern, RA_PATTERN_ALL);
+}
+
+static inline void ra_set_pattern(struct file_ra_state *ra,
+				  unsigned int pattern)
+{
+	ra->ra_flags = (ra->ra_flags & ~READAHEAD_PATTERN) |
+			    (pattern << READAHEAD_PATTERN_SHIFT);
+}
 
 /*
  * Don't do ra_flags++ directly to avoid possible overflow:
--- linux-next.orig/mm/readahead.c	2011-11-20 20:10:48.000000000 +0800
+++ linux-next/mm/readahead.c	2011-11-20 20:18:14.000000000 +0800
@@ -268,13 +268,17 @@ unsigned long max_sane_readahead(unsigne
  * Submit IO for the read-ahead request in file_ra_state.
  */
 unsigned long ra_submit(struct file_ra_state *ra,
-		       struct address_space *mapping, struct file *filp)
+			struct address_space *mapping,
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size)
 {
 	int actual;
 
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	ra->ra_flags &= ~READAHEAD_MMAP;
 	return actual;
 }
 
@@ -401,6 +405,7 @@ static int try_context_readahead(struct 
 	if (size >= offset)
 		size *= 2;
 
+	ra_set_pattern(ra, RA_PATTERN_CONTEXT);
 	ra->start = offset;
 	ra->size = get_init_ra_size(size + req_size, max);
 	ra->async_size = ra->size;
@@ -422,8 +427,10 @@ ondemand_readahead(struct address_space 
 	/*
 	 * start of file
 	 */
-	if (!offset)
+	if (!offset) {
+		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		goto initial_readahead;
+	}
 
 	/*
 	 * It's the expected callback offset, assume sequential access.
@@ -431,6 +438,7 @@ ondemand_readahead(struct address_space 
 	 */
 	if ((offset == (ra->start + ra->size - ra->async_size) ||
 	     offset == (ra->start + ra->size))) {
+		ra_set_pattern(ra, RA_PATTERN_SUBSEQUENT);
 		ra->start += ra->size;
 		ra->size = get_next_ra_size(ra, max);
 		ra->async_size = ra->size;
@@ -453,6 +461,7 @@ ondemand_readahead(struct address_space 
 		if (!start || start - offset > max)
 			return 0;
 
+		ra_set_pattern(ra, RA_PATTERN_CONTEXT);
 		ra->start = start;
 		ra->size = start - offset;	/* old async_size */
 		ra->size += req_size;
@@ -464,14 +473,18 @@ ondemand_readahead(struct address_space 
 	/*
 	 * oversize read
 	 */
-	if (req_size > max)
+	if (req_size > max) {
+		ra_set_pattern(ra, RA_PATTERN_OVERSIZE);
 		goto initial_readahead;
+	}
 
 	/*
 	 * sequential cache miss
 	 */
-	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
+	if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) {
+		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		goto initial_readahead;
+	}
 
 	/*
 	 * Query the page cache and look for the traces(cached history pages)
@@ -482,9 +495,12 @@ ondemand_readahead(struct address_space 
 
 	/*
 	 * standalone, small random read
-	 * Read as is, and do not pollute the readahead state.
 	 */
-	return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+	ra_set_pattern(ra, RA_PATTERN_RANDOM);
+	ra->start = offset;
+	ra->size = req_size;
+	ra->async_size = 0;
+	goto readit;
 
 initial_readahead:
 	ra->start = offset;
@@ -502,7 +518,7 @@ readit:
 		ra->size += ra->async_size;
 	}
 
-	return ra_submit(ra, mapping, filp);
+	return ra_submit(ra, mapping, filp, offset, req_size);
 }
 
 /**
--- linux-next.orig/include/linux/mm.h	2011-11-20 20:10:48.000000000 +0800
+++ linux-next/include/linux/mm.h	2011-11-20 20:10:49.000000000 +0800
@@ -1456,7 +1456,9 @@ void page_cache_async_readahead(struct a
 unsigned long max_sane_readahead(unsigned long nr);
 unsigned long ra_submit(struct file_ra_state *ra,
 			struct address_space *mapping,
-			struct file *filp);
+			struct file *filp,
+			pgoff_t offset,
+			unsigned long req_size);
 
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux-next.orig/mm/filemap.c	2011-11-20 20:10:48.000000000 +0800
+++ linux-next/mm/filemap.c	2011-11-20 20:10:49.000000000 +0800
@@ -1592,6 +1592,7 @@ static void do_sync_mmap_readahead(struc
 		return;
 
 	if (VM_SequentialReadHint(vma)) {
+		ra->ra_flags |= READAHEAD_MMAP;
 		page_cache_sync_readahead(mapping, ra, file, offset,
 					  ra->ra_pages);
 		return;
@@ -1607,11 +1608,13 @@ static void do_sync_mmap_readahead(struc
 	/*
 	 * mmap read-around
 	 */
+	ra->ra_flags |= READAHEAD_MMAP;
+	ra_set_pattern(ra, RA_PATTERN_MMAP_AROUND);
 	ra_pages = max_sane_readahead(ra->ra_pages);
 	ra->start = max_t(long, 0, offset - ra_pages / 2);
 	ra->size = ra_pages;
 	ra->async_size = ra_pages / 4;
-	ra_submit(ra, mapping, file);
+	ra_submit(ra, mapping, file, offset, 1);
 }
 
 /*
@@ -1630,9 +1633,11 @@ static void do_async_mmap_readahead(stru
 	if (VM_RandomReadHint(vma))
 		return;
 	ra_mmap_miss_dec(ra);
-	if (PageReadahead(page))
+	if (PageReadahead(page)) {
+		ra->ra_flags |= READAHEAD_MMAP;
 		page_cache_async_readahead(mapping, ra, file,
 					   page, offset, ra->ra_pages);
+	}
 }
 
 /**



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 5/8] readahead: add /debug/readahead/stats
  2011-11-21  9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-11-21  9:18 ` [PATCH 4/8] readahead: record readahead patterns Wu Fengguang
@ 2011-11-21  9:18 ` Wu Fengguang
  2011-11-21 14:17   ` Andi Kleen
  2011-11-21 23:29   ` Andrew Morton
  2011-11-21  9:18 ` [PATCH 6/8] readahead: add debug tracing event Wu Fengguang
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
	Jens Axboe, Peter Zijlstra, Rik van Riel, Wu Fengguang, LKML,
	Andi Kleen

[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 10222 bytes --]

The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive unless enabled explicitly with either boot option

	readahead_stats=1

or through the debugfs interface

	echo 1 > /debug/readahead/stats_enable

The added overheads are two readahead_stats() calls per readahead.
Which is trivial costs unless there are concurrent random reads on
super fast SSDs, which may lead to cache bouncing when updating the
global ra_stats[][]. Considering that normal users won't need this
except when debugging performance problems, it's disabled by default.
So it looks reasonable to keep this debug code simple rather than trying
to improve its scalability.

Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)

$ cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io       size async_size    io_size
initial           545        347         10        535        535          0         74         38          3
subsequent         48         41          1         11          1          5         53         53         15
context           156        156          0          3          0          1       1690       1690         12
around            152        152          0        152        152        152       1920        480         45
backwards           2          0          2          2          2          0          4          0          3
fadvise          2566          0          0       2566          0          0          0          0          1
oversize            0          0          0          0          0          0          0          0          0
random             30          0          1         29         29          0          1          0          1
all              3499        696         14       3298        719          0        171        102          3

The two most important columns are
- io		number of readahead IO
- io_size	average readahead IO size

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/kernel-parameters.txt |    6 
 mm/Kconfig                          |   15 ++
 mm/readahead.c                      |  194 ++++++++++++++++++++++++++
 3 files changed, 215 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-11-21 17:08:43.000000000 +0800
+++ linux-next/mm/readahead.c	2011-11-21 17:13:28.000000000 +0800
@@ -18,6 +18,17 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+static const char * const ra_pattern_names[] = {
+	[RA_PATTERN_INITIAL]            = "initial",
+	[RA_PATTERN_SUBSEQUENT]         = "subsequent",
+	[RA_PATTERN_CONTEXT]            = "context",
+	[RA_PATTERN_MMAP_AROUND]        = "around",
+	[RA_PATTERN_FADVISE]            = "fadvise",
+	[RA_PATTERN_OVERSIZE]           = "oversize",
+	[RA_PATTERN_RANDOM]             = "random",
+	[RA_PATTERN_ALL]                = "all",
+};
+
 static int __init config_readahead_size(char *str)
 {
 	unsigned long bytes;
@@ -51,6 +62,182 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+static int __init config_readahead_stats(char *str)
+{
+	int enable = 1;
+	get_option(&str, &enable);
+	readahead_stats_enable = enable;
+	return 0;
+}
+early_param("readahead_stats", config_readahead_stats);
+
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,	/* readahead request */
+	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
+	RA_ACCOUNT_CHIT,	/* readahead request covers some cached pages */
+	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
+	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
+	RA_ACCOUNT_MMAP,	/* readahead IO by mmap page faults */
+	/* number of readahead pages */
+	RA_ACCOUNT_SIZE,	/* readahead size */
+	RA_ACCOUNT_ASIZE,	/* readahead async size */
+	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static void readahead_stats(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    unsigned int ra_flags,
+			    pgoff_t start,
+			    unsigned int size,
+			    unsigned int async_size,
+			    int actual)
+{
+	unsigned int pattern = ra_pattern(ra_flags);
+
+	ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+	ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
+	ra_stats[pattern][RA_ACCOUNT_ASIZE] += async_size;
+	ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+	if (actual < size) {
+		if (start + size >
+		    (i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT)
+			ra_stats[pattern][RA_ACCOUNT_EOF]++;
+		else
+			ra_stats[pattern][RA_ACCOUNT_CHIT]++;
+	}
+
+	if (!actual)
+		return;
+
+	ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+	if (start <= offset && offset < start + size)
+		ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+	if (ra_flags & READAHEAD_MMAP)
+		ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	unsigned long i;
+
+	seq_printf(s, "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+			"pattern",
+			"readahead", "eof_hit", "cache_hit",
+			"io", "sync_io", "mmap_io",
+			"size", "async_size", "io_size");
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+		unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+		/*
+		 * avoid division-by-zero
+		 */
+		if (count == 0)
+			count = 1;
+		if (iocount == 0)
+			iocount = 1;
+
+		seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu %10lu "
+			   "%10lu %10lu %10lu\n",
+				ra_pattern_names[i],
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_CHIT],
+				ra_stats[i][RA_ACCOUNT_IOCOUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_MMAP],
+				ra_stats[i][RA_ACCOUNT_SIZE]   / count,
+				ra_stats[i][RA_ACCOUNT_ASIZE]  / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	memset(ra_stats, 0, sizeof(ra_stats));
+	return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+	struct dentry *root;
+	struct dentry *entry;
+
+	root = debugfs_create_dir("readahead", NULL);
+	if (!root)
+		goto out;
+
+	entry = debugfs_create_file("stats", 0644, root,
+				    NULL, &readahead_stats_fops);
+	if (!entry)
+		goto out;
+
+	entry = debugfs_create_bool("stats_enable", 0644, root,
+				    &readahead_stats_enable);
+	if (!entry)
+		goto out;
+
+	return 0;
+out:
+	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+	return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#endif
+
+static void readahead_event(struct address_space *mapping,
+			    pgoff_t offset,
+			    unsigned long req_size,
+			    unsigned int ra_flags,
+			    pgoff_t start,
+			    unsigned int size,
+			    unsigned int async_size,
+			    unsigned int actual)
+{
+#ifdef CONFIG_READAHEAD_STATS
+	if (readahead_stats_enable) {
+		readahead_stats(mapping, offset, req_size, ra_flags,
+				start, size, async_size, actual);
+		readahead_stats(mapping, offset, req_size,
+				RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
+				start, size, async_size, actual);
+	}
+#endif
+}
+
 /*
  * see if a page needs releasing upon read_cache_pages() failure
  * - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -247,10 +434,14 @@ int force_page_cache_readahead(struct ad
 			ret = err;
 			break;
 		}
+		readahead_event(mapping, offset, nr_to_read,
+				RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
+				offset, this_chunk, 0, err);
 		ret += err;
 		offset += this_chunk;
 		nr_to_read -= this_chunk;
 	}
+
 	return ret;
 }
 
@@ -278,6 +469,9 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	readahead_event(mapping, offset, req_size, ra->ra_flags,
+			ra->start, ra->size, ra->async_size, actual);
+
 	ra->ra_flags &= ~READAHEAD_MMAP;
 	return actual;
 }
--- linux-next.orig/mm/Kconfig	2011-11-21 17:08:31.000000000 +0800
+++ linux-next/mm/Kconfig	2011-11-21 17:08:51.000000000 +0800
@@ -373,3 +373,18 @@ config CLEANCACHE
 	  in a negligible performance hit.
 
 	  If unsure, say Y to enable cleancache
+
+config READAHEAD_STATS
+	bool "Collect page cache readahead stats"
+	depends on DEBUG_FS
+	default y
+	help
+	  This provides the readahead events accounting facilities.
+
+	  To enable accounting early, boot kernel with "readahead_stats=1".
+	  Or run these commands after boot:
+
+	  echo 1 > /sys/kernel/debug/readahead/stats_enable
+	  echo 0 > /sys/kernel/debug/readahead/stats  # reset counters
+	  # run the workload
+	  cat /sys/kernel/debug/readahead/stats       # check counters
--- linux-next.orig/Documentation/kernel-parameters.txt	2011-11-21 17:08:38.000000000 +0800
+++ linux-next/Documentation/kernel-parameters.txt	2011-11-21 17:08:51.000000000 +0800
@@ -2251,6 +2251,12 @@ bytes respectively. Such letter suffixes
 			This default max readahead size may be overrode
 			in some cases, notably NFS, btrfs and software RAID.
 
+	readahead_stats[=0|1]
+			Enable/disable readahead stats accounting.
+
+			It's also possible to enable/disable it after boot:
+			echo 1 > /sys/kernel/debug/readahead/stats_enable
+
 	reboot=		[BUGS=X86-32,BUGS=ARM,BUGS=IA-64] Rebooting mode
 			Format: <reboot_mode>[,<reboot_mode2>[,...]]
 			See arch/*/kernel/reboot.c or arch/*/kernel/process.c



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 6/8] readahead: add debug tracing event
  2011-11-21  9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-11-21  9:18 ` [PATCH 5/8] readahead: add /debug/readahead/stats Wu Fengguang
@ 2011-11-21  9:18 ` Wu Fengguang
  2011-11-21 14:01   ` Steven Rostedt
  2011-11-21  9:18 ` [PATCH 7/8] readahead: basic support for backwards prefetching Wu Fengguang
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
	Jens Axboe, Steven Rostedt, Peter Zijlstra, Rik van Riel,
	Wu Fengguang, LKML, Andi Kleen

[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 3524 bytes --]

This is very useful for verifying whether the algorithms are working
to our expectaions.

Example output:

# echo 1 > /debug/tracing/events/vfs/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace  # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/vfs.h |   64 +++++++++++++++++++++++++++++++++++
 mm/readahead.c             |    5 ++
 2 files changed, 69 insertions(+)

--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/include/trace/events/vfs.h	2011-11-21 17:17:45.000000000 +0800
@@ -0,0 +1,64 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfs
+
+#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFS_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(readahead,
+	TP_PROTO(struct address_space *mapping,
+		 pgoff_t offset,
+		 unsigned long req_size,
+		 unsigned int ra_flags,
+		 pgoff_t start,
+		 unsigned int size,
+		 unsigned int async_size,
+		 unsigned int actual),
+
+	TP_ARGS(mapping, offset, req_size,
+		ra_flags, start, size, async_size, actual),
+
+	TP_STRUCT__entry(
+		__field(	dev_t,		dev		)
+		__field(	ino_t,		ino		)
+		__field(	pgoff_t,	offset		)
+		__field(	unsigned long,	req_size	)
+		__field(	unsigned int,	pattern		)
+		__field(	pgoff_t,	start		)
+		__field(	unsigned int,	size		)
+		__field(	unsigned int,	async_size	)
+		__field(	unsigned int,	actual		)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= mapping->host->i_sb->s_dev;
+		__entry->ino		= mapping->host->i_ino;
+		__entry->pattern	= ra_pattern(ra_flags);
+		__entry->offset		= offset;
+		__entry->req_size	= req_size;
+		__entry->start		= start;
+		__entry->size		= size;
+		__entry->async_size	= async_size;
+		__entry->actual		= actual;
+	),
+
+	TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+		  "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+			ra_pattern_names[__entry->pattern],
+			MAJOR(__entry->dev),
+			MINOR(__entry->dev),
+			__entry->ino,
+			__entry->offset,
+			__entry->req_size,
+			__entry->start,
+			__entry->size,
+			__entry->async_size,
+			__entry->start > __entry->offset,
+			__entry->actual)
+);
+
+#endif /* _TRACE_VFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux-next.orig/mm/readahead.c	2011-11-21 17:17:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-11-21 17:17:49.000000000 +0800
@@ -48,6 +48,9 @@ static int __init config_readahead_size(
 }
 early_param("readahead", config_readahead_size);
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vfs.h>
+
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.
@@ -236,6 +239,8 @@ static void readahead_event(struct addre
 				start, size, async_size, actual);
 	}
 #endif
+	trace_readahead(mapping, offset, req_size, ra_flags,
+			start, size, async_size, actual);
 }
 
 /*



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 7/8] readahead: basic support for backwards prefetching
  2011-11-21  9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
                   ` (5 preceding siblings ...)
  2011-11-21  9:18 ` [PATCH 6/8] readahead: add debug tracing event Wu Fengguang
@ 2011-11-21  9:18 ` Wu Fengguang
  2011-11-21 23:33   ` Andrew Morton
  2011-11-21  9:18 ` [PATCH 8/8] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
  2011-11-21  9:56 ` [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Christoph Hellwig
  8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
	Li Shaohua, Wu Fengguang, LKML

[-- Attachment #1: readahead-backwards.patch --]
[-- Type: text/plain, Size: 3890 bytes --]

Add the backwards prefetching feature. It's pretty simple if we don't
support async prefetching and interleaved reads.

Here is the behavior with an 8-page read sequence from 10000 down to 0.
(The readahead size is a bit large since it's an NFS mount.)

readahead-random(dev=0:16, ino=3948605, req=10000+8, ra=10000+8-0, async=0) = 8
readahead-backwards(dev=0:16, ino=3948605, req=9992+8, ra=9968+32-0, async=0) = 32
readahead-backwards(dev=0:16, ino=3948605, req=9960+8, ra=9840+128-0, async=0) = 128
readahead-backwards(dev=0:16, ino=3948605, req=9832+8, ra=9584+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9576+8, ra=9072+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9064+8, ra=8048+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8040+8, ra=6128+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6120+8, ra=4208+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4200+8, ra=2288+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2280+8, ra=368+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=360+8, ra=0+368-0, async=0) = 368

And a simple 1-page read sequence from 10000 down to 0.

readahead-random(dev=0:16, ino=3948605, req=10000+1, ra=10000+1-0, async=0) = 1
readahead-backwards(dev=0:16, ino=3948605, req=9999+1, ra=9996+4-0, async=0) = 4
readahead-backwards(dev=0:16, ino=3948605, req=9995+1, ra=9980+16-0, async=0) = 16
readahead-backwards(dev=0:16, ino=3948605, req=9979+1, ra=9916+64-0, async=0) = 64
readahead-backwards(dev=0:16, ino=3948605, req=9915+1, ra=9660+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9659+1, ra=9148+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9147+1, ra=8124+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8123+1, ra=6204+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6203+1, ra=4284+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4283+1, ra=2364+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2363+1, ra=444+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=443+1, ra=0+444-0, async=0) = 444

CC: Andi Kleen <andi@firstfloor.org>
CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/fs.h |    1 +
 mm/readahead.c     |   14 ++++++++++++++
 2 files changed, 15 insertions(+)

--- linux-next.orig/include/linux/fs.h	2011-11-21 17:17:44.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-11-21 17:17:47.000000000 +0800
@@ -964,6 +964,7 @@ enum readahead_pattern {
 	RA_PATTERN_SUBSEQUENT,
 	RA_PATTERN_CONTEXT,
 	RA_PATTERN_MMAP_AROUND,
+	RA_PATTERN_BACKWARDS,
 	RA_PATTERN_FADVISE,
 	RA_PATTERN_OVERSIZE,
 	RA_PATTERN_RANDOM,
--- linux-next.orig/mm/readahead.c	2011-11-21 17:17:45.000000000 +0800
+++ linux-next/mm/readahead.c	2011-11-21 17:17:47.000000000 +0800
@@ -23,6 +23,7 @@ static const char * const ra_pattern_nam
 	[RA_PATTERN_SUBSEQUENT]         = "subsequent",
 	[RA_PATTERN_CONTEXT]            = "context",
 	[RA_PATTERN_MMAP_AROUND]        = "around",
+	[RA_PATTERN_BACKWARDS]          = "backwards",
 	[RA_PATTERN_FADVISE]            = "fadvise",
 	[RA_PATTERN_OVERSIZE]           = "oversize",
 	[RA_PATTERN_RANDOM]             = "random",
@@ -686,6 +687,19 @@ ondemand_readahead(struct address_space 
 	}
 
 	/*
+	 * backwards reading
+	 */
+	if (offset < ra->start && offset + req_size >= ra->start) {
+		ra_set_pattern(ra, RA_PATTERN_BACKWARDS);
+		ra->size = get_next_ra_size(ra, max);
+		if (ra->size > ra->start)
+			ra->size = ra->start;
+		ra->async_size = 0;
+		ra->start -= ra->size;
+		goto readit;
+	}
+
+	/*
 	 * Query the page cache and look for the traces(cached history pages)
 	 * that a sequential stream would leave behind.
 	 */



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 8/8] readahead: dont do start-of-file readahead after lseek()
  2011-11-21  9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
                   ` (6 preceding siblings ...)
  2011-11-21  9:18 ` [PATCH 7/8] readahead: basic support for backwards prefetching Wu Fengguang
@ 2011-11-21  9:18 ` Wu Fengguang
  2011-11-21 23:36   ` Andrew Morton
  2011-11-21  9:56 ` [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Christoph Hellwig
  8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Rik van Riel,
	Linus Torvalds, Wu Fengguang, LKML, Andi Kleen

[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2136 bytes --]

Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does

	     seek to	0
	     read	1024
	     seek to	1536
	     read	16384

The start-of-file readahead heuristic is wrong for them, whose
access pattern can be identified by lseek() calls.

So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.

Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/read_write.c    |    4 ++++
 include/linux/fs.h |    1 +
 mm/readahead.c     |    3 +++
 3 files changed, 8 insertions(+)

--- linux-next.orig/mm/readahead.c	2011-11-20 22:02:01.000000000 +0800
+++ linux-next/mm/readahead.c	2011-11-20 22:02:03.000000000 +0800
@@ -629,6 +629,8 @@ ondemand_readahead(struct address_space 
 	 * start of file
 	 */
 	if (!offset) {
+		if ((ra->ra_flags & READAHEAD_LSEEK) && req_size < max)
+			goto random_read;
 		ra_set_pattern(ra, RA_PATTERN_INITIAL);
 		goto initial_readahead;
 	}
@@ -707,6 +709,7 @@ ondemand_readahead(struct address_space 
 	if (try_context_readahead(mapping, ra, offset, req_size, max))
 		goto readit;
 
+random_read:
 	/*
 	 * standalone, small random read
 	 */
--- linux-next.orig/fs/read_write.c	2011-11-20 22:02:01.000000000 +0800
+++ linux-next/fs/read_write.c	2011-11-20 22:02:03.000000000 +0800
@@ -47,6 +47,10 @@ static loff_t lseek_execute(struct file 
 		file->f_pos = offset;
 		file->f_version = 0;
 	}
+
+	if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
+		file->f_ra.ra_flags |= READAHEAD_LSEEK;
+
 	return offset;
 }
 
--- linux-next.orig/include/linux/fs.h	2011-11-20 22:02:01.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-11-20 22:02:03.000000000 +0800
@@ -952,6 +952,7 @@ struct file_ra_state {
 /* ra_flags bits */
 #define	READAHEAD_MMAP_MISS	0x000003ff /* cache misses for mmap access */
 #define	READAHEAD_MMAP		0x00010000
+#define	READAHEAD_LSEEK		0x00020000 /* be conservative after lseek() */
 
 #define READAHEAD_PATTERN_SHIFT	28
 #define READAHEAD_PATTERN	0xf0000000



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/8] readahead stats/tracing, backwards prefetching and more
  2011-11-21  9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
                   ` (7 preceding siblings ...)
  2011-11-21  9:18 ` [PATCH 8/8] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
@ 2011-11-21  9:56 ` Christoph Hellwig
  2011-11-21 12:00   ` Wu Fengguang
  8 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2011-11-21  9:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel, LKML,
	Andi Kleen

On Mon, Nov 21, 2011 at 05:18:19PM +0800, Wu Fengguang wrote:
> Andrew,
> 
> I'm getting around to pick up the readahead works again :-)
> 
> This first series is mainly to add some debug facilities, to support the long
> missed backwards prefetching capability, and some old patches that somehow get
> delayed (shame me).
> 
> The next step would be to better handle the readahead thrashing situations.
> That would require rewriting part of the algorithms, this is why I'd like to
> keep the backwards prefetching simple and stupid for now.
> 
> When (almost) free of readahead thrashing, we'll be in a good position to lift
> the default readahead size. Which I suspect would be the single most efficient
> way to improve performance for the large volumes of casually maintained Linux
> file servers.

Btw, if you work actively in that area I have a todo list item I was
planning to look into sooner or later:  instead of embedding the ra
state into the struct file allocate it dynamically.  That way files that
either don't use the pagecache, or aren't read from won't need have to
pay the price for increasing struct file size, and if we have to we
could enlarge it more easily.  Besides removing f_version in the common
struct file and also allocting f_owner separately that seem to be the
easiest ways to get struct file size down.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] block: limit default readahead size for small devices
  2011-11-21  9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
@ 2011-11-21 10:00   ` Christoph Hellwig
  2011-11-21 11:24     ` Wu Fengguang
  2011-11-21 12:47     ` Andi Kleen
  2011-11-21 14:46   ` Jeff Moyer
  2011-11-21 22:52   ` Andrew Morton
  2 siblings, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2011-11-21 10:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
	Li Shaohua, Clemens Ladisch, Jens Axboe, Rik van Riel, LKML,
	Andi Kleen

On Mon, Nov 21, 2011 at 05:18:20PM +0800, Wu Fengguang wrote:
> This looks reasonable: smaller device tend to be slower (USB sticks as
> well as micro/mobile/old hard disks).
> 
> Given that the non-rotational attribute is not always reported, we can
> take disk size as a max readahead size hint. This patch uses a formula
> that generates the following concrete limits:

Given that you mentioned the rotational flag and device size in this
mail, as well as benchmarking with an intel SSD  -  did you measure
how useful large read ahead sizes still are with highend Flash device
that have extremly high read IOP rates?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
  2011-11-21  9:18 ` [PATCH 2/8] readahead: make default readahead size a kernel parameter Wu Fengguang
@ 2011-11-21 10:01   ` Christoph Hellwig
  2011-11-21 11:35     ` Wu Fengguang
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2011-11-21 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
	Ankit Jain, Dave Chinner, Christian Ehrhardt, Rik van Riel,
	Nikanth Karthikesan, LKML, Andi Kleen

On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> From: Nikanth Karthikesan <knikanth@suse.de>
> 
> Add new kernel parameter "readahead=", which allows user to override
> the static VM_MAX_READAHEAD=128kb.

Is a boot-time paramter really such a good idea?  I would at least make
it a sysctl so that it's run-time controllable, including beeing able
to set it from initscripts.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
  2011-11-21  9:18 ` [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
@ 2011-11-21 11:04   ` Steven Whitehouse
  2011-11-21 11:42     ` Wu Fengguang
  2011-11-21 23:01   ` Andrew Morton
  1 sibling, 1 reply; 47+ messages in thread
From: Steven Whitehouse @ 2011-11-21 11:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
	Andi Kleen, Rik van Riel, LKML

Hi,

I'm not quite sure why you copied me in to this patch, but I've had a
look at it and it seems ok to me. Some of the other patches in this
series look as if they might be rather useful for the GFS2 dir readahead
code though, so I'll be keeping an eye on developments in this area,

Acked-by: Steven Whitehouse <swhiteho@redhat.com>

Steve.

On Mon, 2011-11-21 at 17:18 +0800, Wu Fengguang wrote:
> plain text document attachment (readahead-flags.patch)
> Introduce a readahead flags field and embed the existing mmap_miss in it
> (mainly to save space).
> 
> It will be possible to lose the flags in race conditions, however the
> impact should be limited.  For the race to happen, there must be two
> threads sharing the same file descriptor to be in page fault or
> readahead at the same time.
> 
> Note that it has always been racy for "page faults" at the same time.
> 
> And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
> Which may change some concrete readahead behavior, but won't really
> impact overall I/O performance.
> 
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Steven Whitehouse <swhiteho@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/fs.h |   31 ++++++++++++++++++++++++++++++-
>  mm/filemap.c       |    9 ++-------
>  2 files changed, 32 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/include/linux/fs.h	2011-11-20 11:30:55.000000000 +0800
> +++ linux-next/include/linux/fs.h	2011-11-20 11:48:53.000000000 +0800
> @@ -945,10 +945,39 @@ struct file_ra_state {
>  					   there are only # of pages ahead */
>  
>  	unsigned int ra_pages;		/* Maximum readahead window */
> -	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
> +	unsigned int ra_flags;
>  	loff_t prev_pos;		/* Cache last read() position */
>  };
>  
> +/* ra_flags bits */
> +#define	READAHEAD_MMAP_MISS	0x000003ff /* cache misses for mmap access */
> +
> +/*
> + * Don't do ra_flags++ directly to avoid possible overflow:
> + * the ra fields can be accessed concurrently in a racy way.
> + */
> +static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
> +{
> +	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> +
> +	/* the upper bound avoids banging the cache line unnecessarily */
> +	if (miss < READAHEAD_MMAP_MISS) {
> +		miss++;
> +		ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> +	}
> +	return miss;
> +}
> +
> +static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
> +{
> +	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> +
> +	if (miss) {
> +		miss--;
> +		ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> +	}
> +}
> +
>  /*
>   * Check if @index falls in the readahead windows.
>   */
> --- linux-next.orig/mm/filemap.c	2011-11-20 11:30:55.000000000 +0800
> +++ linux-next/mm/filemap.c	2011-11-20 11:48:29.000000000 +0800
> @@ -1597,15 +1597,11 @@ static void do_sync_mmap_readahead(struc
>  		return;
>  	}
>  
> -	/* Avoid banging the cache line if not needed */
> -	if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
> -		ra->mmap_miss++;
> -
>  	/*
>  	 * Do we miss much more than hit in this file? If so,
>  	 * stop bothering with read-ahead. It will only hurt.
>  	 */
> -	if (ra->mmap_miss > MMAP_LOTSAMISS)
> +	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
>  		return;
>  
>  	/*
> @@ -1633,8 +1629,7 @@ static void do_async_mmap_readahead(stru
>  	/* If we don't want any read-ahead, don't bother */
>  	if (VM_RandomReadHint(vma))
>  		return;
> -	if (ra->mmap_miss > 0)
> -		ra->mmap_miss--;
> +	ra_mmap_miss_dec(ra);
>  	if (PageReadahead(page))
>  		page_cache_async_readahead(mapping, ra, file,
>  					   page, offset, ra->ra_pages);
> 
> 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] block: limit default readahead size for small devices
  2011-11-21 10:00   ` Christoph Hellwig
@ 2011-11-21 11:24     ` Wu Fengguang
  2011-11-21 12:47     ` Andi Kleen
  1 sibling, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 11:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel, Li,
	Shaohua, Clemens Ladisch, Jens Axboe, Rik van Riel, LKML,
	Andi Kleen

On Mon, Nov 21, 2011 at 06:00:04PM +0800, Christoph Hellwig wrote:
> On Mon, Nov 21, 2011 at 05:18:20PM +0800, Wu Fengguang wrote:
> > This looks reasonable: smaller device tend to be slower (USB sticks as
> > well as micro/mobile/old hard disks).
> > 
> > Given that the non-rotational attribute is not always reported, we can
> > take disk size as a max readahead size hint. This patch uses a formula
> > that generates the following concrete limits:
> 
> Given that you mentioned the rotational flag and device size in this
> mail, as well as benchmarking with an intel SSD  -  did you measure
> how useful large read ahead sizes still are with highend Flash device
> that have extremly high read IOP rates?

I don't know -- I don't have access to such highend devices.

However the patch changelog has the simple test script. It would be
high appreciated if someone can help collect the data :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
  2011-11-21 10:01   ` Christoph Hellwig
@ 2011-11-21 11:35     ` Wu Fengguang
  2011-11-24 22:28       ` Jan Kara
  0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 11:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
	Ankit Jain, Dave Chinner, Christian Ehrhardt, Rik van Riel,
	Nikanth Karthikesan, LKML, Andi Kleen

On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
> On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> > From: Nikanth Karthikesan <knikanth@suse.de>
> > 
> > Add new kernel parameter "readahead=", which allows user to override
> > the static VM_MAX_READAHEAD=128kb.
> 
> Is a boot-time paramter really such a good idea?  I would at least

It's most convenient to set at boot time, because the default size
will be used to initialize all the block devices.

> make it a sysctl so that it's run-time controllable, including
> beeing able to set it from initscripts.

Once boot up, it's more natural to set the size one by one, for
example

        blockdev --setra 1024 /dev/sda2
or
        echo 512 > /sys/block/sda/queue/read_ahead_kb

And you still have the chance to modify the global default, but the
change will only be inherited by newly created devices thereafter:

        echo 512 > /sys/devices/virtual/bdi/default/read_ahead_kb

The above command is very suitable for use in initscripts.  However
there are no natural way to do sysctl as there is no such a global
value.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
  2011-11-21 11:04   ` Steven Whitehouse
@ 2011-11-21 11:42     ` Wu Fengguang
  0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 11:42 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
	Andi Kleen, Rik van Riel, LKML

Hi Steven,

On Mon, Nov 21, 2011 at 07:04:27PM +0800, Steven Whitehouse wrote:
> Hi,
> 
> I'm not quite sure why you copied me in to this patch, but I've had a

Yeah it's such an old patch that I've forgotten why I added CC to you ;)

> look at it and it seems ok to me. Some of the other patches in this
> series look as if they might be rather useful for the GFS2 dir readahead
> code though, so I'll be keeping an eye on developments in this area,
> 
> Acked-by: Steven Whitehouse <swhiteho@redhat.com>

Thanks! That reminds me of the "metadata readahead". It should be
possible to do more metadata readahead in future, hence we might add a
"meta_io" column in the readahead stats file :)

Thanks,
Fengguang

> On Mon, 2011-11-21 at 17:18 +0800, Wu Fengguang wrote:
> > plain text document attachment (readahead-flags.patch)
> > Introduce a readahead flags field and embed the existing mmap_miss in it
> > (mainly to save space).
> > 
> > It will be possible to lose the flags in race conditions, however the
> > impact should be limited.  For the race to happen, there must be two
> > threads sharing the same file descriptor to be in page fault or
> > readahead at the same time.
> > 
> > Note that it has always been racy for "page faults" at the same time.
> > 
> > And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
> > Which may change some concrete readahead behavior, but won't really
> > impact overall I/O performance.
> > 
> > CC: Andi Kleen <andi@firstfloor.org>
> > CC: Steven Whitehouse <swhiteho@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/fs.h |   31 ++++++++++++++++++++++++++++++-
> >  mm/filemap.c       |    9 ++-------
> >  2 files changed, 32 insertions(+), 8 deletions(-)
> > 
> > --- linux-next.orig/include/linux/fs.h	2011-11-20 11:30:55.000000000 +0800
> > +++ linux-next/include/linux/fs.h	2011-11-20 11:48:53.000000000 +0800
> > @@ -945,10 +945,39 @@ struct file_ra_state {
> >  					   there are only # of pages ahead */
> >  
> >  	unsigned int ra_pages;		/* Maximum readahead window */
> > -	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
> > +	unsigned int ra_flags;
> >  	loff_t prev_pos;		/* Cache last read() position */
> >  };
> >  
> > +/* ra_flags bits */
> > +#define	READAHEAD_MMAP_MISS	0x000003ff /* cache misses for mmap access */
> > +
> > +/*
> > + * Don't do ra_flags++ directly to avoid possible overflow:
> > + * the ra fields can be accessed concurrently in a racy way.
> > + */
> > +static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
> > +{
> > +	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> > +
> > +	/* the upper bound avoids banging the cache line unnecessarily */
> > +	if (miss < READAHEAD_MMAP_MISS) {
> > +		miss++;
> > +		ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> > +	}
> > +	return miss;
> > +}
> > +
> > +static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
> > +{
> > +	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> > +
> > +	if (miss) {
> > +		miss--;
> > +		ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> > +	}
> > +}
> > +
> >  /*
> >   * Check if @index falls in the readahead windows.
> >   */
> > --- linux-next.orig/mm/filemap.c	2011-11-20 11:30:55.000000000 +0800
> > +++ linux-next/mm/filemap.c	2011-11-20 11:48:29.000000000 +0800
> > @@ -1597,15 +1597,11 @@ static void do_sync_mmap_readahead(struc
> >  		return;
> >  	}
> >  
> > -	/* Avoid banging the cache line if not needed */
> > -	if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
> > -		ra->mmap_miss++;
> > -
> >  	/*
> >  	 * Do we miss much more than hit in this file? If so,
> >  	 * stop bothering with read-ahead. It will only hurt.
> >  	 */
> > -	if (ra->mmap_miss > MMAP_LOTSAMISS)
> > +	if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
> >  		return;
> >  
> >  	/*
> > @@ -1633,8 +1629,7 @@ static void do_async_mmap_readahead(stru
> >  	/* If we don't want any read-ahead, don't bother */
> >  	if (VM_RandomReadHint(vma))
> >  		return;
> > -	if (ra->mmap_miss > 0)
> > -		ra->mmap_miss--;
> > +	ra_mmap_miss_dec(ra);
> >  	if (PageReadahead(page))
> >  		page_cache_async_readahead(mapping, ra, file,
> >  					   page, offset, ra->ra_pages);
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 0/8] readahead stats/tracing, backwards prefetching and more
  2011-11-21  9:56 ` [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Christoph Hellwig
@ 2011-11-21 12:00   ` Wu Fengguang
  0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 12:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel, LKML,
	Andi Kleen

On Mon, Nov 21, 2011 at 05:56:38PM +0800, Christoph Hellwig wrote:
> On Mon, Nov 21, 2011 at 05:18:19PM +0800, Wu Fengguang wrote:
> > Andrew,
> > 
> > I'm getting around to pick up the readahead works again :-)
> > 
> > This first series is mainly to add some debug facilities, to support the long
> > missed backwards prefetching capability, and some old patches that somehow get
> > delayed (shame me).
> > 
> > The next step would be to better handle the readahead thrashing situations.
> > That would require rewriting part of the algorithms, this is why I'd like to
> > keep the backwards prefetching simple and stupid for now.
> > 
> > When (almost) free of readahead thrashing, we'll be in a good position to lift
> > the default readahead size. Which I suspect would be the single most efficient
> > way to improve performance for the large volumes of casually maintained Linux
> > file servers.
> 
> Btw, if you work actively in that area I have a todo list item I was
> planning to look into sooner or later:  instead of embedding the ra
> state into the struct file allocate it dynamically.  That way files that
> either don't use the pagecache, or aren't read from won't need have to
> pay the price for increasing struct file size, and if we have to we
> could enlarge it more easily.

Agreed. That's good to have, please allow me to move it into my todo list :)

> Besides removing f_version in the common
> struct file and also allocting f_owner separately that seem to be the
> easiest ways to get struct file size down.

Yeah, there seems no much code accessing fown_struct.
I may consider that when I'm at file_ra_state, but no promise ;)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] block: limit default readahead size for small devices
  2011-11-21 10:00   ` Christoph Hellwig
  2011-11-21 11:24     ` Wu Fengguang
@ 2011-11-21 12:47     ` Andi Kleen
  1 sibling, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2011-11-21 12:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, Andrew Morton, Linux Memory Management List,
	linux-fsdevel, Li Shaohua, Clemens Ladisch, Jens Axboe,
	Rik van Riel, LKML, Andi Kleen

On Mon, Nov 21, 2011 at 05:00:04AM -0500, Christoph Hellwig wrote:
> On Mon, Nov 21, 2011 at 05:18:20PM +0800, Wu Fengguang wrote:
> > Given that the non-rotational attribute is not always reported, we can
> > take disk size as a max readahead size hint. This patch uses a formula
> > that generates the following concrete limits:
> 
> Given that you mentioned the rotational flag and device size in this
> mail, as well as benchmarking with an intel SSD  -  did you measure
> how useful large read ahead sizes still are with highend Flash device
> that have extremly high read IOP rates?

The more the IOPs the larger the "window" you need to keep everything
going I suspect.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/8] readahead: add debug tracing event
  2011-11-21  9:18 ` [PATCH 6/8] readahead: add debug tracing event Wu Fengguang
@ 2011-11-21 14:01   ` Steven Rostedt
  0 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2011-11-21 14:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
	Ingo Molnar, Jens Axboe, Peter Zijlstra, Rik van Riel, LKML,
	Andi Kleen

On Mon, 2011-11-21 at 17:18 +0800, Wu Fengguang wrote:
> plain text document attachment (readahead-tracer.patch)
> This is very useful for verifying whether the algorithms are working
> to our expectaions.
> 
> Example output:
> 
> # echo 1 > /debug/tracing/events/vfs/readahead/enable
> # cp test-file /dev/null
> # cat /debug/tracing/trace  # trimmed output
> readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
> readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
> readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
> readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
> readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
> readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
> 
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Jens Axboe <jens.axboe@oracle.com>
> CC: Steven Rostedt <rostedt@goodmis.org>

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
  2011-11-21  9:18 ` [PATCH 5/8] readahead: add /debug/readahead/stats Wu Fengguang
@ 2011-11-21 14:17   ` Andi Kleen
  2011-11-22 14:14     ` Wu Fengguang
  2011-11-21 23:29   ` Andrew Morton
  1 sibling, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2011-11-21 14:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
	Ingo Molnar, Jens Axboe, Peter Zijlstra, Rik van Riel, LKML,
	Andi Kleen

> +static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];

Why not make it per cpu?  That should get the overhead down, probably
even enough that it can be enabled by default.

BTW I have an older framework to make it really easy to add per
cpu stats counters to debugfs. Will repost, that would simplify
it even more.

-Andi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] block: limit default readahead size for small devices
  2011-11-21  9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
  2011-11-21 10:00   ` Christoph Hellwig
@ 2011-11-21 14:46   ` Jeff Moyer
  2011-11-21 22:52   ` Andrew Morton
  2 siblings, 0 replies; 47+ messages in thread
From: Jeff Moyer @ 2011-11-21 14:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
	Li Shaohua, Clemens Ladisch, Jens Axboe, Rik van Riel, LKML,
	Andi Kleen

Wu Fengguang <fengguang.wu@intel.com> writes:

> Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> on which blkid runs unpleasantly slow. He manages to optimize the blkid
> reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
>
>      lseek 0,    read 1024   => readahead 4 pages (start of file)
>      lseek 1536, read 16384  => readahead 8 pages (page contiguous)
>
> The readahead heuristics involved here are reasonable ones in general.
> So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
>
> For the kernel part, Linus suggests:
>   So maybe we could be less aggressive about read-ahead when the size of
>   the device is small? Turning a 16kB read into a 64kB one is a big deal,
>   when it's about 15% of the whole device!
>
> This looks reasonable: smaller device tend to be slower (USB sticks as
> well as micro/mobile/old hard disks).
>
> Given that the non-rotational attribute is not always reported, we can
> take disk size as a max readahead size hint. This patch uses a formula
> that generates the following concrete limits:
>
>         disk size    readahead size
>      (scale by 4)      (scale by 2)
>                1M                8k
>                4M               16k
>               16M               32k
>               64M               64k
>              256M              128k
>         --------------------------- (*)
>                1G              256k
>                4G              512k
>               16G             1024k
>               64G             2048k
>              256G             4096k
>
> (*) Since the default readahead size is 128k, this limit only takes
> effect for devices whose size is less than 256M.

Scaling readahead by the size of the device may make sense up to a
point.  But really, that shouldn't be the only factor considered.  Just
because you have a big disk, it doesn't mean it's fast, and it sure
doesn't mean that you should waste memory with readahead data that may
not be used before it's evicted (whether due to readahead on data that
isn't needed or thrashing of the page cache).

So, I think reducing the readahead size for smaller devices is fine.
I'm not a big fan of going the other way.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] block: limit default readahead size for small devices
  2011-11-21  9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
  2011-11-21 10:00   ` Christoph Hellwig
  2011-11-21 14:46   ` Jeff Moyer
@ 2011-11-21 22:52   ` Andrew Morton
  2011-11-22 14:23     ` Jeff Moyer
  2011-11-23 12:18     ` Wu Fengguang
  2 siblings, 2 replies; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 22:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Linux Memory Management List, linux-fsdevel, Li Shaohua,
	Clemens Ladisch, Jens Axboe, Rik van Riel, LKML, Andi Kleen

On Mon, 21 Nov 2011 17:18:20 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> on which blkid runs unpleasantly slow. He manages to optimize the blkid
> reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
> 
>      lseek 0,    read 1024   => readahead 4 pages (start of file)

I'm disturbed that the code did a 4 page (16kbyte?) readahead after an
lseek.  Given the high probability that the next read will occur after
a second lseek, that's a mistake.

Was an lseek to offset 0 special-cased?

>      lseek 1536, read 16384  => readahead 8 pages (page contiguous)
> 
> The readahead heuristics involved here are reasonable ones in general.
> So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
> 
> For the kernel part, Linus suggests:
>   So maybe we could be less aggressive about read-ahead when the size of
>   the device is small? Turning a 16kB read into a 64kB one is a big deal,
>   when it's about 15% of the whole device!
> 
> This looks reasonable: smaller device tend to be slower (USB sticks as
> well as micro/mobile/old hard disks).

Spose so.  Obviously there are other characteristics which should be
considered when choosing a readaahead size, but one of them can be disk
size and that's what this change does.

In a better world, userspace would run a
work-out-what-readahead-size-to-use script each time a distro is
installed and when new storage devices are added/detected.  Userspace
would then remember that readahead size for subsequent bootups.

In the real world, we shovel guaranteed-to-be-wrong guesswork into the
kernel and everyone just uses the results.  Sigh.

> --- linux-next.orig/block/genhd.c	2011-10-31 00:13:51.000000000 +0800
> +++ linux-next/block/genhd.c	2011-11-18 11:27:08.000000000 +0800
> @@ -623,6 +623,26 @@ void add_disk(struct gendisk *disk)
>  	WARN_ON(retval);
>  
>  	disk_add_events(disk);
> +
> +	/*
> +	 * Limit default readahead size for small devices.
> +	 *        disk size    readahead size
> +	 *               1M                8k
> +	 *               4M               16k
> +	 *              16M               32k
> +	 *              64M               64k
> +	 *             256M              128k
> +	 *               1G              256k
> +	 *               4G              512k
> +	 *              16G             1024k
> +	 *              64G             2048k
> +	 *             256G             4096k
> +	 */
> +	if (get_capacity(disk)) {
> +		unsigned long size = get_capacity(disk) >> 9;

get_capacity() returns sector_t.  This expression will overflow with a
2T disk.  I'm not sure if we successfully support 2T disks on 32-bit
machines, but changes like this will guarantee that we don't :)

> +		size = 1UL << (ilog2(size) / 2);

I think there's a rounddown_pow_of_two() hiding in that expression?

> +		bdi->ra_pages = min(bdi->ra_pages, size);

I don't have a clue why that min() is in there.  It needs a comment,
please.

> +	}



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
  2011-11-21  9:18 ` [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
  2011-11-21 11:04   ` Steven Whitehouse
@ 2011-11-21 23:01   ` Andrew Morton
  2011-11-23 12:47     ` Wu Fengguang
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 23:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
	Steven Whitehouse, Rik van Riel, LKML

On Mon, 21 Nov 2011 17:18:22 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Introduce a readahead flags field and embed the existing mmap_miss in it
> (mainly to save space).

What an ugly patch.

> It will be possible to lose the flags in race conditions, however the
> impact should be limited.  For the race to happen, there must be two
> threads sharing the same file descriptor to be in page fault or
> readahead at the same time.
> 
> Note that it has always been racy for "page faults" at the same time.
> 
> And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
> Which may change some concrete readahead behavior, but won't really
> impact overall I/O performance.
> 
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Steven Whitehouse <swhiteho@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/fs.h |   31 ++++++++++++++++++++++++++++++-
>  mm/filemap.c       |    9 ++-------
>  2 files changed, 32 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/include/linux/fs.h	2011-11-20 11:30:55.000000000 +0800
> +++ linux-next/include/linux/fs.h	2011-11-20 11:48:53.000000000 +0800
> @@ -945,10 +945,39 @@ struct file_ra_state {
>  					   there are only # of pages ahead */
>  
>  	unsigned int ra_pages;		/* Maximum readahead window */
> -	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
> +	unsigned int ra_flags;

And it doesn't actually save any space, unless ra_flags gets used for
something else in a subsequent patch.  And if it does, perhaps ra_flags
should be ulong, which is compatible with the bitops.h code.

Or perhaps we should use a bitfield and let the compiler do the work.

>  	loff_t prev_pos;		/* Cache last read() position */
>  };
>  
> +/* ra_flags bits */
> +#define	READAHEAD_MMAP_MISS	0x000003ff /* cache misses for mmap access */
> +
> +/*
> + * Don't do ra_flags++ directly to avoid possible overflow:
> + * the ra fields can be accessed concurrently in a racy way.
> + */
> +static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
> +{
> +	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> +
> +	/* the upper bound avoids banging the cache line unnecessarily */
> +	if (miss < READAHEAD_MMAP_MISS) {
> +		miss++;
> +		ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> +	}
> +	return miss;
> +}
> +
> +static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
> +{
> +	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> +
> +	if (miss) {
> +		miss--;
> +		ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> +	}
> +}

It's strange that ra_mmap_miss_inc() returns the new value whereas
ra_mmap_miss_dec() returns void.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/8] readahead: record readahead patterns
  2011-11-21  9:18 ` [PATCH 4/8] readahead: record readahead patterns Wu Fengguang
@ 2011-11-21 23:19   ` Andrew Morton
  2011-11-29  2:40     ` Wu Fengguang
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 23:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
	Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen

On Mon, 21 Nov 2011 17:18:23 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Record the readahead pattern in ra_flags and extend the ra_submit()
> parameters, to be used by the next readahead tracing/stats patches.
> 
> 7 patterns are defined:
> 
>       	pattern			readahead for
> -----------------------------------------------------------
> 	RA_PATTERN_INITIAL	start-of-file read
> 	RA_PATTERN_SUBSEQUENT	trivial sequential read
> 	RA_PATTERN_CONTEXT	interleaved sequential read
> 	RA_PATTERN_OVERSIZE	oversize read
> 	RA_PATTERN_MMAP_AROUND	mmap fault
> 	RA_PATTERN_FADVISE	posix_fadvise()
> 	RA_PATTERN_RANDOM	random read

It would be useful to spell out in full detail what an "interleaved
sequential read" is, and why a read is considered "oversized", etc. 
The 'enum readahead_pattern' definition site would be a good place for
this.

> Note that random reads will be recorded in file_ra_state now.
> This won't deteriorate cache bouncing because the ra->prev_pos update
> in do_generic_file_read() already pollutes the data cache, and
> filemap_fault() will stop calling into us after MMAP_LOTSAMISS.
> 
> --- linux-next.orig/include/linux/fs.h	2011-11-20 20:10:48.000000000 +0800
> +++ linux-next/include/linux/fs.h	2011-11-20 20:18:29.000000000 +0800
> @@ -951,6 +951,39 @@ struct file_ra_state {
>  
>  /* ra_flags bits */
>  #define	READAHEAD_MMAP_MISS	0x000003ff /* cache misses for mmap access */
> +#define	READAHEAD_MMAP		0x00010000

Why leave a gap?

And what is READAHEAD_MMAP anyway?

> +#define READAHEAD_PATTERN_SHIFT	28

Why 28?

> +#define READAHEAD_PATTERN	0xf0000000
> +
> +/*
> + * Which policy makes decision to do the current read-ahead IO?
> + */
> +enum readahead_pattern {
> +	RA_PATTERN_INITIAL,
> +	RA_PATTERN_SUBSEQUENT,
> +	RA_PATTERN_CONTEXT,
> +	RA_PATTERN_MMAP_AROUND,
> +	RA_PATTERN_FADVISE,
> +	RA_PATTERN_OVERSIZE,
> +	RA_PATTERN_RANDOM,
> +	RA_PATTERN_ALL,		/* for summary stats */
> +	RA_PATTERN_MAX
> +};

Again, the behaviour is all undocumented.  I see from the code that
multiple flags can be set at the same time.  So afacit a file can be
marked RANDOM and SUBSEQUENT at the same time, which seems oxymoronic.

This reader wants to know what the implications of this are - how the
code chooses, prioritises and acts.  But this code doesn't tell me.

> +static inline unsigned int ra_pattern(unsigned int ra_flags)
> +{
> +	unsigned int pattern = ra_flags >> READAHEAD_PATTERN_SHIFT;

OK, no masking is needed because the code silently assumes that arg
`ra_flags' came out of an ra_state.ra_flags and it also silently
assumes that no higher bits are used in ra_state.ra_flags.

That's a bit of a handgrenade - if someone redoes the flags
enumeration, the code will explode.

> +	return min_t(unsigned int, pattern, RA_PATTERN_ALL);
> +}

<scratches head>

What the heck is that min_t() doing in there?

> +static inline void ra_set_pattern(struct file_ra_state *ra,
> +				  unsigned int pattern)
> +{
> +	ra->ra_flags = (ra->ra_flags & ~READAHEAD_PATTERN) |
> +			    (pattern << READAHEAD_PATTERN_SHIFT);
> +}
>  
>  /*
>   * Don't do ra_flags++ directly to avoid possible overflow:
>
> ...
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
  2011-11-21  9:18 ` [PATCH 5/8] readahead: add /debug/readahead/stats Wu Fengguang
  2011-11-21 14:17   ` Andi Kleen
@ 2011-11-21 23:29   ` Andrew Morton
  2011-11-21 23:32     ` Andi Kleen
  2011-11-29  3:23     ` Wu Fengguang
  1 sibling, 2 replies; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 23:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
	Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen

On Mon, 21 Nov 2011 17:18:24 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
> and will remain inactive unless enabled explicitly with either boot option
> 
> 	readahead_stats=1
> 
> or through the debugfs interface
> 
> 	echo 1 > /debug/readahead/stats_enable

It's unfortunate that these two things have different names.

I'd have thought that the debugfs knob was sufficient - no need for the
boot option.

> The added overheads are two readahead_stats() calls per readahead.
> Which is trivial costs unless there are concurrent random reads on
> super fast SSDs, which may lead to cache bouncing when updating the
> global ra_stats[][]. Considering that normal users won't need this
> except when debugging performance problems, it's disabled by default.
> So it looks reasonable to keep this debug code simple rather than trying
> to improve its scalability.

I may be wrong, but I don't think the CPU cost of this code matters a
lot.  People will rarely turn it on and disk IO is a lot slower than
CPU actions and it's waaaaaaay more important to get high-quality info
about readahead than it is to squeeze out a few CPU cycles.

>
> ...
>
> @@ -51,6 +62,182 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
>  
>  #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
>  
> +#ifdef CONFIG_READAHEAD_STATS
> +#include <linux/seq_file.h>
> +#include <linux/debugfs.h>
> +
> +static u32 readahead_stats_enable __read_mostly;
> +
> +static int __init config_readahead_stats(char *str)
> +{
> +	int enable = 1;
> +	get_option(&str, &enable);
> +	readahead_stats_enable = enable;
> +	return 0;
> +}
> +early_param("readahead_stats", config_readahead_stats);

Why use early_param() rather than plain old __setup()?

> +enum ra_account {
> +	/* number of readaheads */
> +	RA_ACCOUNT_COUNT,	/* readahead request */
> +	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
> +	RA_ACCOUNT_CHIT,	/* readahead request covers some cached pages */

I don't like chit :)  "cache_hit" would be better.  Or just "hit".

> +	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
> +	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
> +	RA_ACCOUNT_MMAP,	/* readahead IO by mmap page faults */
> +	/* number of readahead pages */
> +	RA_ACCOUNT_SIZE,	/* readahead size */
> +	RA_ACCOUNT_ASIZE,	/* readahead async size */
> +	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
> +	/* end mark */
> +	RA_ACCOUNT_MAX,
> +};
> +
>
> ...
>
> +static void readahead_event(struct address_space *mapping,
> +			    pgoff_t offset,
> +			    unsigned long req_size,
> +			    unsigned int ra_flags,
> +			    pgoff_t start,
> +			    unsigned int size,
> +			    unsigned int async_size,
> +			    unsigned int actual)
> +{
> +#ifdef CONFIG_READAHEAD_STATS
> +	if (readahead_stats_enable) {
> +		readahead_stats(mapping, offset, req_size, ra_flags,
> +				start, size, async_size, actual);
> +		readahead_stats(mapping, offset, req_size,
> +				RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
> +				start, size, async_size, actual);
> +	}
> +#endif
> +}

The stub should be inlined, methinks.  The overhead of evaluating and
preparing eight arguments is significant.  I don't think the compiler
is yet smart enough to save us.

>
> ...
>
> --- linux-next.orig/Documentation/kernel-parameters.txt	2011-11-21 17:08:38.000000000 +0800
> +++ linux-next/Documentation/kernel-parameters.txt	2011-11-21 17:08:51.000000000 +0800
> @@ -2251,6 +2251,12 @@ bytes respectively. Such letter suffixes
>  			This default max readahead size may be overrode
>  			in some cases, notably NFS, btrfs and software RAID.
>  
> +	readahead_stats[=0|1]
> +			Enable/disable readahead stats accounting.
> +
> +			It's also possible to enable/disable it after boot:
> +			echo 1 > /sys/kernel/debug/readahead/stats_enable

Can the current setting be read back?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
  2011-11-21 23:29   ` Andrew Morton
@ 2011-11-21 23:32     ` Andi Kleen
  2011-11-29  3:23     ` Wu Fengguang
  1 sibling, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2011-11-21 23:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Linux Memory Management List, linux-fsdevel,
	Ingo Molnar, Jens Axboe, Peter Zijlstra, Rik van Riel, LKML,
	Andi Kleen

> I may be wrong, but I don't think the CPU cost of this code matters a
> lot.  People will rarely turn it on and disk IO is a lot slower than
> CPU actions and it's waaaaaaay more important to get high-quality info
> about readahead than it is to squeeze out a few CPU cycles.

In its current form it would cache line bounce, which tends to be 
extremly slow. But the solution is probably to make it per CPU.

-Andi


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/8] readahead: basic support for backwards prefetching
  2011-11-21  9:18 ` [PATCH 7/8] readahead: basic support for backwards prefetching Wu Fengguang
@ 2011-11-21 23:33   ` Andrew Morton
  2011-11-29  3:08     ` Wu Fengguang
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 23:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
	Li Shaohua, LKML

On Mon, 21 Nov 2011 17:18:26 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Add the backwards prefetching feature. It's pretty simple if we don't
> support async prefetching and interleaved reads.

Well OK, but I wonder how many applications out there read files in
reverse order.  Is it common enough to bother special-casing in the
kernel like this?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] readahead: dont do start-of-file readahead after lseek()
  2011-11-21  9:18 ` [PATCH 8/8] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
@ 2011-11-21 23:36   ` Andrew Morton
  2011-11-22 14:18     ` Wu Fengguang
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 23:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Linux Memory Management List, linux-fsdevel, Rik van Riel,
	Linus Torvalds, LKML, Andi Kleen

On Mon, 21 Nov 2011 17:18:27 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Some applications (eg. blkid, id3tool etc.) seek around the file
> to get information. For example, blkid does
> 
> 	     seek to	0
> 	     read	1024
> 	     seek to	1536
> 	     read	16384
> 
> The start-of-file readahead heuristic is wrong for them, whose
> access pattern can be identified by lseek() calls.

ah, there we are.

> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/read_write.c    |    4 ++++
>  include/linux/fs.h |    1 +
>  mm/readahead.c     |    3 +++
>  3 files changed, 8 insertions(+)
> 
> --- linux-next.orig/mm/readahead.c	2011-11-20 22:02:01.000000000 +0800
> +++ linux-next/mm/readahead.c	2011-11-20 22:02:03.000000000 +0800
> @@ -629,6 +629,8 @@ ondemand_readahead(struct address_space 
>  	 * start of file
>  	 */
>  	if (!offset) {
> +		if ((ra->ra_flags & READAHEAD_LSEEK) && req_size < max)
> +			goto random_read;
>  		ra_set_pattern(ra, RA_PATTERN_INITIAL);
>  		goto initial_readahead;
>  	}
> @@ -707,6 +709,7 @@ ondemand_readahead(struct address_space 
>  	if (try_context_readahead(mapping, ra, offset, req_size, max))
>  		goto readit;
>  
> +random_read:
>  	/*
>  	 * standalone, small random read
>  	 */
> --- linux-next.orig/fs/read_write.c	2011-11-20 22:02:01.000000000 +0800
> +++ linux-next/fs/read_write.c	2011-11-20 22:02:03.000000000 +0800
> @@ -47,6 +47,10 @@ static loff_t lseek_execute(struct file 
>  		file->f_pos = offset;
>  		file->f_version = 0;
>  	}
> +
> +	if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
> +		file->f_ra.ra_flags |= READAHEAD_LSEEK;
> +
>  	return offset;
>  }

Confused.  How does READAHEAD_LSEEK get cleared again?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
  2011-11-21 14:17   ` Andi Kleen
@ 2011-11-22 14:14     ` Wu Fengguang
  0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-22 14:14 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
	Ingo Molnar, Jens Axboe, Peter Zijlstra, Rik van Riel, LKML

Andi,

On Mon, Nov 21, 2011 at 10:17:59PM +0800, Andi Kleen wrote:
> > +static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
> 
> Why not make it per cpu?  That should get the overhead down, probably
> even enough that it can be enabled by default.
> 
> BTW I have an older framework to make it really easy to add per
> cpu stats counters to debugfs. Will repost, that would simplify
> it even more.

That's definitely a good facility to have. I would be happy to become
its first user :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/8] readahead: dont do start-of-file readahead after lseek()
  2011-11-21 23:36   ` Andrew Morton
@ 2011-11-22 14:18     ` Wu Fengguang
  0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-22 14:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Rik van Riel,
	Linus Torvalds, LKML, Andi Kleen

> > --- linux-next.orig/fs/read_write.c	2011-11-20 22:02:01.000000000 +0800
> > +++ linux-next/fs/read_write.c	2011-11-20 22:02:03.000000000 +0800
> > @@ -47,6 +47,10 @@ static loff_t lseek_execute(struct file 
> >  		file->f_pos = offset;
> >  		file->f_version = 0;
> >  	}
> > +
> > +	if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
> > +		file->f_ra.ra_flags |= READAHEAD_LSEEK;
> > +
> >  	return offset;
> >  }
> 
> Confused.  How does READAHEAD_LSEEK get cleared again?

I thought it's not necessary (at least for this case). But yeah, it's
good to clear it to make it more reasonable and avoid unexpected things.

And it would be simple to do, in ra_submit():

-       ra->ra_flags &= ~READAHEAD_MMAP;
+       ra->ra_flags &= ~(READAHEAD_MMAP | READAHEAD_LSEEK);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] block: limit default readahead size for small devices
  2011-11-21 22:52   ` Andrew Morton
@ 2011-11-22 14:23     ` Jeff Moyer
  2011-11-23 12:18     ` Wu Fengguang
  1 sibling, 0 replies; 47+ messages in thread
From: Jeff Moyer @ 2011-11-22 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Linux Memory Management List, linux-fsdevel,
	Li Shaohua, Clemens Ladisch, Jens Axboe, Rik van Riel, LKML,
	Andi Kleen

Andrew Morton <akpm@linux-foundation.org> writes:

> In a better world, userspace would run a
> work-out-what-readahead-size-to-use script each time a distro is
> installed and when new storage devices are added/detected.  Userspace
> would then remember that readahead size for subsequent bootups.

I'd be interested to hear what factors you think should be taken into
account by such a script.  I agree that there are certain things, like
timing of reads of different sizes, or heuristics based on the size of
installed memory, which could contribute to the default readahead size.
However, other things, like memory pressure while running the desired
workload, can't really be measured by an installer or one-time script.

> In the real world, we shovel guaranteed-to-be-wrong guesswork into the
> kernel and everyone just uses the results.  Sigh.

I'm not sure a userspace tool is the panacea you paint.  However, if you
can provide some guidance on what you think could make things better,
I'm happy to give it a go.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/8] block: limit default readahead size for small devices
  2011-11-21 22:52   ` Andrew Morton
  2011-11-22 14:23     ` Jeff Moyer
@ 2011-11-23 12:18     ` Wu Fengguang
  1 sibling, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-23 12:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Li Shaohua,
	Clemens Ladisch, Jens Axboe, Rik van Riel, LKML, Andi Kleen

On Mon, Nov 21, 2011 at 02:52:47PM -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 17:18:20 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> > on which blkid runs unpleasantly slow. He manages to optimize the blkid
> > reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
> > 
> >      lseek 0,    read 1024   => readahead 4 pages (start of file)
> 
> I'm disturbed that the code did a 4 page (16kbyte?) readahead after an
> lseek.  Given the high probability that the next read will occur after
> a second lseek, that's a mistake.

Agreed.

> Was an lseek to offset 0 special-cased?

Yup, as you see in the last patch :)

> >      lseek 1536, read 16384  => readahead 8 pages (page contiguous)
> > 
> > The readahead heuristics involved here are reasonable ones in general.
> > So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
> > 
> > For the kernel part, Linus suggests:
> >   So maybe we could be less aggressive about read-ahead when the size of
> >   the device is small? Turning a 16kB read into a 64kB one is a big deal,
> >   when it's about 15% of the whole device!
> > 
> > This looks reasonable: smaller device tend to be slower (USB sticks as
> > well as micro/mobile/old hard disks).
> 
> Spose so.  Obviously there are other characteristics which should be
> considered when choosing a readaahead size, but one of them can be disk
> size and that's what this change does.

The other characteristics includes memory size, disk throughput,
latency and (unfortunately) above all the real workload and its
sensitivity to IO latencies.

The disk throughput and latency would be better than disk size,
however the former is not that easy to measure. Disk size is readily
available w/o any cost, hence this patch.

> In a better world, userspace would run a
> work-out-what-readahead-size-to-use script each time a distro is
> installed and when new storage devices are added/detected.  Userspace
> would then remember that readahead size for subsequent bootups.
> 
> In the real world, we shovel guaranteed-to-be-wrong guesswork into the
> kernel and everyone just uses the results.  Sigh.

Heh. Disk characters should be measurable with some efforts.
However I won't even try to guess the workload and user demands...

Practically, I would choose a meaningful default size (512KB or 1MB)
for the majority and reduce it on small memory/slow disk systems (this
patch does the easy work based on disk size).

> > --- linux-next.orig/block/genhd.c	2011-10-31 00:13:51.000000000 +0800
> > +++ linux-next/block/genhd.c	2011-11-18 11:27:08.000000000 +0800
> > @@ -623,6 +623,26 @@ void add_disk(struct gendisk *disk)
> >  	WARN_ON(retval);
> >  
> >  	disk_add_events(disk);
> > +
> > +	/*
> > +	 * Limit default readahead size for small devices.
> > +	 *        disk size    readahead size
> > +	 *               1M                8k
> > +	 *               4M               16k
> > +	 *              16M               32k
> > +	 *              64M               64k
> > +	 *             256M              128k
> > +	 *               1G              256k
> > +	 *               4G              512k
> > +	 *              16G             1024k
> > +	 *              64G             2048k
> > +	 *             256G             4096k
> > +	 */
> > +	if (get_capacity(disk)) {
> > +		unsigned long size = get_capacity(disk) >> 9;
> 
> get_capacity() returns sector_t.  This expression will overflow with a
> 2T disk.  I'm not sure if we successfully support 2T disks on 32-bit
> machines, but changes like this will guarantee that we don't :)

Good catch! I'll change the type to size_t.

> > +		size = 1UL << (ilog2(size) / 2);
> 
> I think there's a rounddown_pow_of_two() hiding in that expression?

Yeah, added a line of comment for that.

> > +		bdi->ra_pages = min(bdi->ra_pages, size);
> 
> I don't have a clue why that min() is in there.  It needs a comment,
> please.

It's actually explained in the comment by word "Limit", I'll change it
to "Scale down" to make it more obvious.

Thanks,
Fengguang
---
Subject: block: limit default readahead size for small devices
Date: Fri Nov 18 11:27:12 CST 2011

Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.

     lseek 0,    read 1024   => readahead 4 pages (start of file)
     lseek 1536, read 16384  => readahead 8 pages (page contiguous)

The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.

For the kernel part, Linus suggests:
  So maybe we could be less aggressive about read-ahead when the size of
  the device is small? Turning a 16kB read into a 64kB one is a big deal,
  when it's about 15% of the whole device!

This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).

Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:

        disk size    readahead size
     (scale by 4)      (scale by 2)
               1M                8k
               4M               16k
              16M               32k
              64M               64k
             256M              128k
        --------------------------- (*)
               1G              256k
               4G              512k
              16G             1024k
              64G             2048k
             256G             4096k

(*) Since the default readahead size is 128k, this limit only takes
effect for devices whose size is less than 256M.

The formula is determined on the following data, collected by script:

	#!/bin/sh

	# please make sure BDEV is not mounted or opened by others
	BDEV=sdb

	for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
	do
		echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
		time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
	done

The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.

The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.

SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)

	rasize	1st run		2nd run
	----------------------------------
	  4k	123 MB/s	122 MB/s
	 16k  	153 MB/s	153 MB/s
	 32k	161 MB/s	162 MB/s
	 64k	167 MB/s	168 MB/s
	128k	197 MB/s	197 MB/s
	256k	217 MB/s	217 MB/s
	512k	238 MB/s	234 MB/s
	  1M	251 MB/s	248 MB/s
	  2M	259 MB/s	257 MB/s
==>	  4M	269 MB/s	264 MB/s
	  8M	266 MB/s	266 MB/s

Note that ==> points to the readahead size that yields plateau throughput.

SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)

	rasize  1st             2nd
	--------------------------------
	  4k     41 MB/s         41 MB/s
	 16k     85 MB/s         81 MB/s
	 32k    102 MB/s        109 MB/s
	 64k    125 MB/s        144 MB/s
	128k    183 MB/s        185 MB/s
	256k    216 MB/s        216 MB/s
	512k    216 MB/s        236 MB/s
	1024k   251 MB/s        252 MB/s
	  2M    258 MB/s        258 MB/s
==>       4M    266 MB/s        266 MB/s
	  8M    266 MB/s        266 MB/s

SSD 30G SanDisk SATA 5000

	  4k	29.6 MB/s	29.6 MB/s	29.6 MB/s
	 16k	52.1 MB/s	52.1 MB/s	52.1 MB/s
	 32k	61.5 MB/s	61.5 MB/s	61.5 MB/s
	 64k	67.2 MB/s	67.2 MB/s	67.1 MB/s
	128k	71.4 MB/s	71.3 MB/s	71.4 MB/s
	256k	73.4 MB/s	73.4 MB/s	73.3 MB/s
==>	512k	74.6 MB/s	74.6 MB/s	74.6 MB/s
	  1M	74.7 MB/s	74.6 MB/s	74.7 MB/s
	  2M	76.1 MB/s	74.6 MB/s	74.6 MB/s

USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165

	  4k	7.9 MB/s 	7.9 MB/s 	7.9 MB/s
	 16k	17.9 MB/s	17.9 MB/s	17.9 MB/s
	 32k	24.5 MB/s	24.5 MB/s	24.5 MB/s
	 64k	28.7 MB/s	28.7 MB/s	28.7 MB/s
	128k	28.8 MB/s	28.9 MB/s	28.9 MB/s
==>	256k	30.5 MB/s	30.5 MB/s	30.5 MB/s
	512k	30.9 MB/s	31.0 MB/s	30.9 MB/s
	  1M	31.0 MB/s	30.9 MB/s	30.9 MB/s
	  2M	30.9 MB/s	30.9 MB/s	30.9 MB/s

USB stick 4G SanDisk  Cruzer idVendor=0781, idProduct=5151

	  4k	6.4 MB/s 	6.4 MB/s 	6.4 MB/s
	 16k	13.4 MB/s	13.4 MB/s	13.2 MB/s
	 32k	17.8 MB/s	17.9 MB/s	17.8 MB/s
	 64k	21.3 MB/s	21.3 MB/s	21.2 MB/s
	128k	21.4 MB/s	21.4 MB/s	21.4 MB/s
==>	256k	23.3 MB/s	23.2 MB/s	23.2 MB/s
	512k	23.3 MB/s	23.8 MB/s	23.4 MB/s
	  1M	23.8 MB/s	23.4 MB/s	23.3 MB/s
	  2M	23.4 MB/s	23.2 MB/s	23.4 MB/s

USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113

	  4k	6.7 MB/s 	6.9 MB/s 	6.7 MB/s
	 16k	11.7 MB/s	11.7 MB/s	11.7 MB/s
	 32k	12.4 MB/s	12.4 MB/s	12.4 MB/s
   	 64k	13.4 MB/s	13.4 MB/s	13.4 MB/s
	128k	13.4 MB/s	13.4 MB/s	13.4 MB/s
==>	256k	13.6 MB/s	13.6 MB/s	13.6 MB/s
	512k	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  1M	13.7 MB/s	13.7 MB/s	13.7 MB/s
	  2M	13.7 MB/s	13.7 MB/s	13.7 MB/s

64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey

	4KB:    139.339 s, 376 kB/s
	16KB:   81.0427 s, 647 kB/s
	32KB:   71.8513 s, 730 kB/s
==>	64KB:   67.3872 s, 778 kB/s
	128KB:  67.5434 s, 776 kB/s
	256KB:  65.9019 s, 796 kB/s
	512KB:  66.2282 s, 792 kB/s
	1024KB: 67.4632 s, 777 kB/s
	2048KB: 69.9759 s, 749 kB/s

An unnamed SD card (Yakui):

         4k     195.873 s,  5.5 MB/s
         8k     123.425 s,  8.7 MB/s
         16k    86.6425 s, 12.4 MB/s
         32k    66.7519 s, 16.1 MB/s
==>      64k    58.5262 s, 18.3 MB/s
         128k   59.3847 s, 18.1 MB/s
         256k   59.3188 s, 18.1 MB/s
         512k   59.0218 s, 18.2 MB/s

CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/genhd.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

--- linux-next.orig/block/genhd.c	2011-11-21 20:32:56.000000000 +0800
+++ linux-next/block/genhd.c	2011-11-23 20:11:23.000000000 +0800
@@ -578,6 +578,7 @@ exit:
 void add_disk(struct gendisk *disk)
 {
 	struct backing_dev_info *bdi;
+	size_t size;
 	dev_t devt;
 	int retval;
 
@@ -623,6 +624,25 @@ void add_disk(struct gendisk *disk)
 	WARN_ON(retval);
 
 	disk_add_events(disk);
+
+	/*
+	 * Scale down default readahead size for small devices.
+	 *        disk size    readahead size
+	 *               1M                8k
+	 *               4M               16k
+	 *              16M               32k
+	 *              64M               64k
+	 *             255M               64k (the round down effect)
+	 *             256M              128k
+	 *               1G              256k
+	 *               4G              512k
+	 *              16G             1024k
+	 */
+	size = get_capacity(disk);
+	if (size) {
+		size = 1 << (ilog2(size >> 9) / 2);
+		bdi->ra_pages = min(bdi->ra_pages, size);
+	}
 }
 EXPORT_SYMBOL(add_disk);
 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
  2011-11-21 23:01   ` Andrew Morton
@ 2011-11-23 12:47     ` Wu Fengguang
  2011-11-23 20:31       ` Andrew Morton
  0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-23 12:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
	Steven Whitehouse, Rik van Riel, LKML

On Mon, Nov 21, 2011 at 03:01:16PM -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 17:18:22 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Introduce a readahead flags field and embed the existing mmap_miss in it
> > (mainly to save space).
> 
> What an ugly patch.

Indeed..

> > It will be possible to lose the flags in race conditions, however the
> > impact should be limited.  For the race to happen, there must be two
> > threads sharing the same file descriptor to be in page fault or
> > readahead at the same time.
> > 
> > Note that it has always been racy for "page faults" at the same time.
> > 
> > And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
> > Which may change some concrete readahead behavior, but won't really
> > impact overall I/O performance.
> > 
> > CC: Andi Kleen <andi@firstfloor.org>
> > CC: Steven Whitehouse <swhiteho@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/fs.h |   31 ++++++++++++++++++++++++++++++-
> >  mm/filemap.c       |    9 ++-------
> >  2 files changed, 32 insertions(+), 8 deletions(-)
> > 
> > --- linux-next.orig/include/linux/fs.h	2011-11-20 11:30:55.000000000 +0800
> > +++ linux-next/include/linux/fs.h	2011-11-20 11:48:53.000000000 +0800
> > @@ -945,10 +945,39 @@ struct file_ra_state {
> >  					   there are only # of pages ahead */
> >  
> >  	unsigned int ra_pages;		/* Maximum readahead window */
> > -	unsigned int mmap_miss;		/* Cache miss stat for mmap accesses */
> > +	unsigned int ra_flags;
> 
> And it doesn't actually save any space, unless ra_flags gets used for
> something else in a subsequent patch.  And if it does, perhaps ra_flags

Because it's a preparation patch. There will be more fields defined later.

> should be ulong, which is compatible with the bitops.h code.
> Or perhaps we should use a bitfield and let the compiler do the work.

What if we do

        u16     mmap_miss;
        u16     ra_flags;

That would get rid of this patch. I'd still like to pack the various
flags as well as pattern into one single ra_flags, which makes it
convenient to pass things around (as one single parameter).

> >  	loff_t prev_pos;		/* Cache last read() position */
> >  };
> >  
> > +/* ra_flags bits */
> > +#define	READAHEAD_MMAP_MISS	0x000003ff /* cache misses for mmap access */
> > +
> > +/*
> > + * Don't do ra_flags++ directly to avoid possible overflow:
> > + * the ra fields can be accessed concurrently in a racy way.
> > + */
> > +static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
> > +{
> > +	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> > +
> > +	/* the upper bound avoids banging the cache line unnecessarily */
> > +	if (miss < READAHEAD_MMAP_MISS) {
> > +		miss++;
> > +		ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> > +	}
> > +	return miss;
> > +}
> > +
> > +static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
> > +{
> > +	unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> > +
> > +	if (miss) {
> > +		miss--;
> > +		ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> > +	}
> > +}
> 
> It's strange that ra_mmap_miss_inc() returns the new value whereas
> ra_mmap_miss_dec() returns void.

Simply because no one need to check the return value of ra_mmap_miss_dec()...
But yeah it's good to make them look symmetry.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
  2011-11-23 12:47     ` Wu Fengguang
@ 2011-11-23 20:31       ` Andrew Morton
  2011-11-29  3:42         ` Wu Fengguang
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-23 20:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
	Steven Whitehouse, Rik van Riel, LKML

On Wed, 23 Nov 2011 20:47:45 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> > should be ulong, which is compatible with the bitops.h code.
> > Or perhaps we should use a bitfield and let the compiler do the work.
> 
> What if we do
> 
>         u16     mmap_miss;
>         u16     ra_flags;
> 
> That would get rid of this patch. I'd still like to pack the various
> flags as well as pattern into one single ra_flags, which makes it
> convenient to pass things around (as one single parameter).

I'm not sure that this will improve things much...

Again, how does the code look if you use a bitfield and let the
compiler do the worK?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
  2011-11-21 11:35     ` Wu Fengguang
@ 2011-11-24 22:28       ` Jan Kara
  2011-11-25  0:36         ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Jan Kara @ 2011-11-24 22:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Andrew Morton, Linux Memory Management List,
	linux-fsdevel, Ankit Jain, Dave Chinner, Christian Ehrhardt,
	Rik van Riel, Nikanth Karthikesan, LKML, Andi Kleen

On Mon 21-11-11 19:35:40, Wu Fengguang wrote:
> On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
> > On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> > > From: Nikanth Karthikesan <knikanth@suse.de>
> > > 
> > > Add new kernel parameter "readahead=", which allows user to override
> > > the static VM_MAX_READAHEAD=128kb.
> > 
> > Is a boot-time paramter really such a good idea?  I would at least
> 
> It's most convenient to set at boot time, because the default size
> will be used to initialize all the block devices.
> 
> > make it a sysctl so that it's run-time controllable, including
> > beeing able to set it from initscripts.
> 
> Once boot up, it's more natural to set the size one by one, for
> example
> 
>         blockdev --setra 1024 /dev/sda2
> or
>         echo 512 > /sys/block/sda/queue/read_ahead_kb
> 
> And you still have the chance to modify the global default, but the
> change will only be inherited by newly created devices thereafter:
> 
>         echo 512 > /sys/devices/virtual/bdi/default/read_ahead_kb
> 
> The above command is very suitable for use in initscripts.  However
> there are no natural way to do sysctl as there is no such a global
> value.
  Well, you can always have an udev rule to set read_ahead_kb to whatever
you want. In some respect that looks like a nicer solution to me...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
  2011-11-24 22:28       ` Jan Kara
@ 2011-11-25  0:36         ` Dave Chinner
  2011-11-28  2:39           ` Wu Fengguang
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2011-11-25  0:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, Christoph Hellwig, Andrew Morton,
	Linux Memory Management List, linux-fsdevel, Ankit Jain,
	Christian Ehrhardt, Rik van Riel, Nikanth Karthikesan, LKML,
	Andi Kleen

On Thu, Nov 24, 2011 at 11:28:22PM +0100, Jan Kara wrote:
> On Mon 21-11-11 19:35:40, Wu Fengguang wrote:
> > On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
> > > On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> > > > From: Nikanth Karthikesan <knikanth@suse.de>
> > > > 
> > > > Add new kernel parameter "readahead=", which allows user to override
> > > > the static VM_MAX_READAHEAD=128kb.
> > > 
> > > Is a boot-time paramter really such a good idea?  I would at least
> > 
> > It's most convenient to set at boot time, because the default size
> > will be used to initialize all the block devices.
> > 
> > > make it a sysctl so that it's run-time controllable, including
> > > beeing able to set it from initscripts.
> > 
> > Once boot up, it's more natural to set the size one by one, for
> > example
> > 
> >         blockdev --setra 1024 /dev/sda2
> > or
> >         echo 512 > /sys/block/sda/queue/read_ahead_kb
> > 
> > And you still have the chance to modify the global default, but the
> > change will only be inherited by newly created devices thereafter:
> > 
> >         echo 512 > /sys/devices/virtual/bdi/default/read_ahead_kb
> > 
> > The above command is very suitable for use in initscripts.  However
> > there are no natural way to do sysctl as there is no such a global
> > value.
>   Well, you can always have an udev rule to set read_ahead_kb to whatever
> you want. In some respect that looks like a nicer solution to me...

And one that has already been in use for exactly this purpose for
years. Indeed, it's far more flexible because you can give different
types of devices different default readahead settings quite easily,
and it you can set different defaults for just about any tunable
parameter (e.g. readahead, ctq depth, max IO sizes, etc) in the same
way.

Hence I don't think we should treat default readahead any
differently from any other configurable storage parameter - we've
already got places to change the per-device defaults to something
sensible at boot/discovery time....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
  2011-11-25  0:36         ` Dave Chinner
@ 2011-11-28  2:39           ` Wu Fengguang
  2011-11-30 13:04             ` Christian Ehrhardt
  0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-28  2:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, Andrew Morton,
	Linux Memory Management List, linux-fsdevel, Ankit Jain,
	Christian Ehrhardt, Rik van Riel, Nikanth Karthikesan, LKML,
	Andi Kleen

On Fri, Nov 25, 2011 at 08:36:33AM +0800, Dave Chinner wrote:
> On Thu, Nov 24, 2011 at 11:28:22PM +0100, Jan Kara wrote:
> > On Mon 21-11-11 19:35:40, Wu Fengguang wrote:
> > > On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
> > > > On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> > > > > From: Nikanth Karthikesan <knikanth@suse.de>
> > > > > 
> > > > > Add new kernel parameter "readahead=", which allows user to override
> > > > > the static VM_MAX_READAHEAD=128kb.
> > > > 
> > > > Is a boot-time paramter really such a good idea?  I would at least
> > > 
> > > It's most convenient to set at boot time, because the default size
> > > will be used to initialize all the block devices.
> > > 
> > > > make it a sysctl so that it's run-time controllable, including
> > > > beeing able to set it from initscripts.
> > > 
> > > Once boot up, it's more natural to set the size one by one, for
> > > example
> > > 
> > >         blockdev --setra 1024 /dev/sda2
> > > or
> > >         echo 512 > /sys/block/sda/queue/read_ahead_kb
> > > 
> > > And you still have the chance to modify the global default, but the
> > > change will only be inherited by newly created devices thereafter:
> > > 
> > >         echo 512 > /sys/devices/virtual/bdi/default/read_ahead_kb
> > > 
> > > The above command is very suitable for use in initscripts.  However
> > > there are no natural way to do sysctl as there is no such a global
> > > value.
> >   Well, you can always have an udev rule to set read_ahead_kb to whatever
> > you want. In some respect that looks like a nicer solution to me...
> 
> And one that has already been in use for exactly this purpose for
> years. Indeed, it's far more flexible because you can give different
> types of devices different default readahead settings quite easily,
> and it you can set different defaults for just about any tunable
> parameter (e.g. readahead, ctq depth, max IO sizes, etc) in the same
> way.

I'm interested in this usage, too. Would you share some of your rules?

> Hence I don't think we should treat default readahead any
> differently from any other configurable storage parameter - we've
> already got places to change the per-device defaults to something
> sensible at boot/discovery time....

OK, I'll drop this patch.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/8] readahead: record readahead patterns
  2011-11-21 23:19   ` Andrew Morton
@ 2011-11-29  2:40     ` Wu Fengguang
  0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29  2:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
	Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen

On Mon, Nov 21, 2011 at 03:19:19PM -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 17:18:23 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Record the readahead pattern in ra_flags and extend the ra_submit()
> > parameters, to be used by the next readahead tracing/stats patches.
> > 
> > 7 patterns are defined:
> > 
> >       	pattern			readahead for
> > -----------------------------------------------------------
> > 	RA_PATTERN_INITIAL	start-of-file read
> > 	RA_PATTERN_SUBSEQUENT	trivial sequential read
> > 	RA_PATTERN_CONTEXT	interleaved sequential read
> > 	RA_PATTERN_OVERSIZE	oversize read
> > 	RA_PATTERN_MMAP_AROUND	mmap fault
> > 	RA_PATTERN_FADVISE	posix_fadvise()
> > 	RA_PATTERN_RANDOM	random read
> 
> It would be useful to spell out in full detail what an "interleaved
> sequential read" is, and why a read is considered "oversized", etc. 
> The 'enum readahead_pattern' definition site would be a good place for
> this.

Good point, here is the added comments:

/*
 * Which policy makes decision to do the current read-ahead IO?
 *
 * RA_PATTERN_INITIAL           readahead window is initially opened,
 *                              normally when reading from start of file
 * RA_PATTERN_SUBSEQUENT        readahead window is pushed forward
 * RA_PATTERN_CONTEXT           no readahead window available, querying the
 *                              page cache to decide readahead start/size.
 *                              This typically happens on interleaved reads (eg.
 *                              reading pages 0, 1000, 1, 1001, 2, 1002, ...)
 *                              where one file_ra_state struct is not enough
 *                              for recording 2+ interleaved sequential read
 *                              streams.
 * RA_PATTERN_MMAP_AROUND       read-around on mmap page faults
 *                              (w/o any sequential/random hints)
 * RA_PATTERN_BACKWARDS         reverse reading detected
 * RA_PATTERN_FADVISE           triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
 * RA_PATTERN_OVERSIZE          a random read larger than max readahead size,
 *                              do max readahead to break down the read size
 * RA_PATTERN_RANDOM            a small random read
 */

> > Note that random reads will be recorded in file_ra_state now.
> > This won't deteriorate cache bouncing because the ra->prev_pos update
> > in do_generic_file_read() already pollutes the data cache, and
> > filemap_fault() will stop calling into us after MMAP_LOTSAMISS.
> > 
> > --- linux-next.orig/include/linux/fs.h	2011-11-20 20:10:48.000000000 +0800
> > +++ linux-next/include/linux/fs.h	2011-11-20 20:18:29.000000000 +0800
> > @@ -951,6 +951,39 @@ struct file_ra_state {
> >  
> >  /* ra_flags bits */
> >  #define	READAHEAD_MMAP_MISS	0x000003ff /* cache misses for mmap access */
> > +#define	READAHEAD_MMAP		0x00010000
> 
> Why leave a gap?

Never mind, it's now converted to a bit field :)

> And what is READAHEAD_MMAP anyway?

READAHEAD_MMAP will be set for mmap page faults.

> > +#define READAHEAD_PATTERN_SHIFT	28
> 
> Why 28?

Bits 28-32 are for READAHEAD_PATTERN.

Anyway it will be gone when breaking down the ra_flags fields into
individual variables.

> > +#define READAHEAD_PATTERN	0xf0000000
> > +
> > +/*
> > + * Which policy makes decision to do the current read-ahead IO?
> > + */
> > +enum readahead_pattern {
> > +	RA_PATTERN_INITIAL,
> > +	RA_PATTERN_SUBSEQUENT,
> > +	RA_PATTERN_CONTEXT,
> > +	RA_PATTERN_MMAP_AROUND,
> > +	RA_PATTERN_FADVISE,
> > +	RA_PATTERN_OVERSIZE,
> > +	RA_PATTERN_RANDOM,
> > +	RA_PATTERN_ALL,		/* for summary stats */
> > +	RA_PATTERN_MAX
> > +};
> 
> Again, the behaviour is all undocumented.  I see from the code that
> multiple flags can be set at the same time.  So afacit a file can be
> marked RANDOM and SUBSEQUENT at the same time, which seems oxymoronic.

Nope, it will be classified into one "pattern" exclusively.

> This reader wants to know what the implications of this are - how the
> code chooses, prioritises and acts.  But this code doesn't tell me.

Hope the comment addresses this issue. The precise logic happens
mainly inside ondemand_readahead().

> > +static inline unsigned int ra_pattern(unsigned int ra_flags)
> > +{
> > +	unsigned int pattern = ra_flags >> READAHEAD_PATTERN_SHIFT;
> 
> OK, no masking is needed because the code silently assumes that arg
> `ra_flags' came out of an ra_state.ra_flags and it also silently
> assumes that no higher bits are used in ra_state.ra_flags.
> 
> That's a bit of a handgrenade - if someone redoes the flags
> enumeration, the code will explode.

Yeah sorry for playing with such tricks. Will get rid of this function
totally and use a plain assign to ra->pattern.

> > +	return min_t(unsigned int, pattern, RA_PATTERN_ALL);
> > +}
> 
> <scratches head>
> 
> What the heck is that min_t() doing in there?

Just for safety... not really necessary given correct code.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/8] readahead: basic support for backwards prefetching
  2011-11-21 23:33   ` Andrew Morton
@ 2011-11-29  3:08     ` Wu Fengguang
  0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29  3:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
	Li Shaohua, LKML

On Mon, Nov 21, 2011 at 03:33:09PM -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 17:18:26 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Add the backwards prefetching feature. It's pretty simple if we don't
> > support async prefetching and interleaved reads.
> 
> Well OK, but I wonder how many applications out there read files in
> reverse order.  Is it common enough to bother special-casing in the
> kernel like this?

Maybe not so many applications, but sure there are some real cases
somewhere. I remember an IBM paper (that's many years ago, so cannot
recall the exact title) on database shows a graph containing backwards
reading curves among the other ones.

Recently Shaohua even run into a performance regression caused by glibc
optimizing memcpy to access page in reverse order (15, 14, 13, ... 0).

Well this patch may not be the most pertinent fix to that particular
issue. But you see the opportunity such access patterns arise from
surprised areas. 

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
  2011-11-21 23:29   ` Andrew Morton
  2011-11-21 23:32     ` Andi Kleen
@ 2011-11-29  3:23     ` Wu Fengguang
  2011-11-29  4:49       ` Andrew Morton
  1 sibling, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29  3:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
	Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen

On Mon, Nov 21, 2011 at 03:29:58PM -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 17:18:24 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
> > and will remain inactive unless enabled explicitly with either boot option
> > 
> > 	readahead_stats=1
> > 
> > or through the debugfs interface
> > 
> > 	echo 1 > /debug/readahead/stats_enable
> 
> It's unfortunate that these two things have different names.

Yes unfortunately.

> I'd have thought that the debugfs knob was sufficient - no need for the
> boot option.

The boot option intents to catch the boot time readaheads.
However it's not that big deal, I'll drop the boot option.

> > The added overheads are two readahead_stats() calls per readahead.
> > Which is trivial costs unless there are concurrent random reads on
> > super fast SSDs, which may lead to cache bouncing when updating the
> > global ra_stats[][]. Considering that normal users won't need this
> > except when debugging performance problems, it's disabled by default.
> > So it looks reasonable to keep this debug code simple rather than trying
> > to improve its scalability.
> 
> I may be wrong, but I don't think the CPU cost of this code matters a
> lot.  People will rarely turn it on and disk IO is a lot slower than
> CPU actions and it's waaaaaaay more important to get high-quality info
> about readahead than it is to squeeze out a few CPU cycles.

Agreed in general.

> > @@ -51,6 +62,182 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
> >  
> >  #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
> >  
> > +#ifdef CONFIG_READAHEAD_STATS
> > +#include <linux/seq_file.h>
> > +#include <linux/debugfs.h>
> > +
> > +static u32 readahead_stats_enable __read_mostly;
> > +
> > +static int __init config_readahead_stats(char *str)
> > +{
> > +	int enable = 1;
> > +	get_option(&str, &enable);
> > +	readahead_stats_enable = enable;
> > +	return 0;
> > +}
> > +early_param("readahead_stats", config_readahead_stats);
> 
> Why use early_param() rather than plain old __setup()?

Heh it's a no-brain copy from other code ;)
Anyway, the readahead_stats boot parameter will be dropped.

> > +enum ra_account {
> > +	/* number of readaheads */
> > +	RA_ACCOUNT_COUNT,	/* readahead request */
> > +	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
> > +	RA_ACCOUNT_CHIT,	/* readahead request covers some cached pages */
> 
> I don't like chit :)  "cache_hit" would be better.  Or just "hit".

Yeah it's not good. I renamed it to RA_ACCOUNT_CACHE_HIT.

> > +	RA_ACCOUNT_ASIZE,	/* readahead async size */

Also renamed that to RA_ACCOUNT_ASYNC_SIZE.

> > +	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
> > +	/* end mark */
> > +	RA_ACCOUNT_MAX,
> > +};
> > +
> >
> > ...
> >
> > +static void readahead_event(struct address_space *mapping,
> > +			    pgoff_t offset,
> > +			    unsigned long req_size,
> > +			    unsigned int ra_flags,
> > +			    pgoff_t start,
> > +			    unsigned int size,
> > +			    unsigned int async_size,
> > +			    unsigned int actual)
> > +{
> > +#ifdef CONFIG_READAHEAD_STATS
> > +	if (readahead_stats_enable) {
> > +		readahead_stats(mapping, offset, req_size, ra_flags,
> > +				start, size, async_size, actual);
> > +		readahead_stats(mapping, offset, req_size,
> > +				RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
> > +				start, size, async_size, actual);
> > +	}
> > +#endif
> > +}
> 
> The stub should be inlined, methinks.  The overhead of evaluating and
> preparing eight arguments is significant.  I don't think the compiler
> is yet smart enough to save us.

The parameter list actually becomes even out of control when doing the
bit fields:

+       readahead_event(mapping, offset, req_size,
+                       ra->pattern, ra->for_mmap, ra->for_metadata,
+                       ra->start + ra->size >= eof,
+                       ra->start, ra->size, ra->async_size, actual);

So I end up passing file_ra_state around. The added cost is, I'll have
to dynamically create a file_ra_state for the fadvise case, which
should be acceptable since it's a cold path.

> >
> > ...
> >
> > --- linux-next.orig/Documentation/kernel-parameters.txt	2011-11-21 17:08:38.000000000 +0800
> > +++ linux-next/Documentation/kernel-parameters.txt	2011-11-21 17:08:51.000000000 +0800
> > @@ -2251,6 +2251,12 @@ bytes respectively. Such letter suffixes
> >  			This default max readahead size may be overrode
> >  			in some cases, notably NFS, btrfs and software RAID.
> >  
> > +	readahead_stats[=0|1]
> > +			Enable/disable readahead stats accounting.
> > +
> > +			It's also possible to enable/disable it after boot:
> > +			echo 1 > /sys/kernel/debug/readahead/stats_enable
> 
> Can the current setting be read back?

Yes. This is possible:

        echo 0 > /sys/kernel/debug/readahead/stats_enable

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
  2011-11-23 20:31       ` Andrew Morton
@ 2011-11-29  3:42         ` Wu Fengguang
  0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29  3:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
	Steven Whitehouse, Rik van Riel, LKML

On Wed, Nov 23, 2011 at 12:31:50PM -0800, Andrew Morton wrote:
> On Wed, 23 Nov 2011 20:47:45 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > should be ulong, which is compatible with the bitops.h code.
> > > Or perhaps we should use a bitfield and let the compiler do the work.
> > 
> > What if we do
> > 
> >         u16     mmap_miss;
> >         u16     ra_flags;
> > 
> > That would get rid of this patch. I'd still like to pack the various
> > flags as well as pattern into one single ra_flags, which makes it
> > convenient to pass things around (as one single parameter).
> 
> I'm not sure that this will improve things much...
> 
> Again, how does the code look if you use a bitfield and let the
> compiler do the worK?

It results in much clean code, as you may find in the V2 patches :-)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
  2011-11-29  3:23     ` Wu Fengguang
@ 2011-11-29  4:49       ` Andrew Morton
  2011-11-29  6:41         ` Wu Fengguang
  0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-29  4:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
	Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen

On Tue, 29 Nov 2011 11:23:23 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> > > +{
> > > +#ifdef CONFIG_READAHEAD_STATS
> > > +	if (readahead_stats_enable) {
> > > +		readahead_stats(mapping, offset, req_size, ra_flags,
> > > +				start, size, async_size, actual);
> > > +		readahead_stats(mapping, offset, req_size,
> > > +				RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
> > > +				start, size, async_size, actual);
> > > +	}
> > > +#endif
> > > +}
> > 
> > The stub should be inlined, methinks.  The overhead of evaluating and
> > preparing eight arguments is significant.  I don't think the compiler
> > is yet smart enough to save us.
> 
> The parameter list actually becomes even out of control when doing the
> bit fields:
> 
> +       readahead_event(mapping, offset, req_size,
> +                       ra->pattern, ra->for_mmap, ra->for_metadata,
> +                       ra->start + ra->size >= eof,
> +                       ra->start, ra->size, ra->async_size, actual);
> 
> So I end up passing file_ra_state around. The added cost is, I'll have
> to dynamically create a file_ra_state for the fadvise case, which
> should be acceptable since it's a cold path.

That will reduce the cost of something which would have zero cost by
making this function a static inline when CONFIG_READAHEAD_STATS=n.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
  2011-11-29  4:49       ` Andrew Morton
@ 2011-11-29  6:41         ` Wu Fengguang
  2011-11-29 12:29           ` Wu Fengguang
  0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29  6:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
	Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen

On Tue, Nov 29, 2011 at 12:49:50PM +0800, Andrew Morton wrote:
> On Tue, 29 Nov 2011 11:23:23 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > > +{
> > > > +#ifdef CONFIG_READAHEAD_STATS
> > > > +	if (readahead_stats_enable) {
> > > > +		readahead_stats(mapping, offset, req_size, ra_flags,
> > > > +				start, size, async_size, actual);
> > > > +		readahead_stats(mapping, offset, req_size,
> > > > +				RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
> > > > +				start, size, async_size, actual);
> > > > +	}
> > > > +#endif
> > > > +}
> > > 
> > > The stub should be inlined, methinks.  The overhead of evaluating and
> > > preparing eight arguments is significant.  I don't think the compiler
> > > is yet smart enough to save us.
> > 
> > The parameter list actually becomes even out of control when doing the
> > bit fields:
> > 
> > +       readahead_event(mapping, offset, req_size,
> > +                       ra->pattern, ra->for_mmap, ra->for_metadata,
> > +                       ra->start + ra->size >= eof,
> > +                       ra->start, ra->size, ra->async_size, actual);
> > 
> > So I end up passing file_ra_state around. The added cost is, I'll have
> > to dynamically create a file_ra_state for the fadvise case, which
> > should be acceptable since it's a cold path.
> 
> That will reduce the cost of something which would have zero cost by
> making this function a static inline when CONFIG_READAHEAD_STATS=n.

What I do now is to remove the readahead_event() function altogether,
as done by the below patch.

Do you suggest to remove fadvise_ra and still passing the many raw
values to readahead_stats()? (need to restore the inline function
readahead_event() because there will be two call sites)

Thanks,
Fengguang
---
Subject: readahead: add /debug/readahead/stats
Date: Sun Nov 20 11:25:50 CST 2011

The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive by default.

It can be runtime enabled/disabled through the debugfs interface

	echo 1 > /debug/readahead/stats_enable
	echo 0 > /debug/readahead/stats_enable

The added overheads are two readahead_stats() calls per readahead.
Which is trivial costs unless there are concurrent random reads on
super fast SSDs, which may lead to cache bouncing when updating the
global ra_stats[][]. Considering that normal users won't need this
except when debugging performance problems, it's disabled by default.
So it looks reasonable to keep this debug code simple rather than trying
to improve its scalability.

Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)

$ cat /debug/readahead/stats
pattern     readahead    eof_hit  cache_hit         io    sync_io    mmap_io    meta_io       size async_size    io_size
initial           702        511          0        692        692          0          0          2          0          2
subsequent          7          0          1          7          1          1          0         23         22         23
context           160        161          0          2          0          1          0          0          0         16
around            184        184        177        184        184        184          0         58          0         53
backwards           2          0          2          2          2          0          0          4          0          3
fadvise          2593         47          8       2588       2588          0          0          1          0          1
oversize            0          0          0          0          0          0          0          0          0          0
random             45         20          0         44         44          0          0          1          0          1
all              3697        923        188       3519       3511        186          0          4          0          4

The two most important columns are
- io		number of readahead IO
- io_size	average readahead IO size

CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/Kconfig     |   15 +++
 mm/readahead.c |  183 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 196 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/readahead.c	2011-11-29 14:14:36.000000000 +0800
+++ linux-next/mm/readahead.c	2011-11-29 14:24:25.000000000 +0800
@@ -18,6 +18,17 @@
 #include <linux/pagevec.h>
 #include <linux/pagemap.h>
 
+static const char * const ra_pattern_names[] = {
+	[RA_PATTERN_INITIAL]            = "initial",
+	[RA_PATTERN_SUBSEQUENT]         = "subsequent",
+	[RA_PATTERN_CONTEXT]            = "context",
+	[RA_PATTERN_MMAP_AROUND]        = "around",
+	[RA_PATTERN_FADVISE]            = "fadvise",
+	[RA_PATTERN_OVERSIZE]           = "oversize",
+	[RA_PATTERN_RANDOM]             = "random",
+	[RA_PATTERN_ALL]                = "all",
+};
+
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.
@@ -32,6 +43,167 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
 
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+enum ra_account {
+	/* number of readaheads */
+	RA_ACCOUNT_COUNT,	/* readahead request */
+	RA_ACCOUNT_EOF,		/* readahead request covers EOF */
+	RA_ACCOUNT_CACHE_HIT,	/* readahead request covers some cached pages */
+	RA_ACCOUNT_IOCOUNT,	/* readahead IO */
+	RA_ACCOUNT_SYNC,	/* readahead IO that is synchronous */
+	RA_ACCOUNT_MMAP,	/* readahead IO by mmap page faults */
+	RA_ACCOUNT_METADATA,	/* readahead IO on metadata */
+	/* number of readahead pages */
+	RA_ACCOUNT_SIZE,	/* readahead size */
+	RA_ACCOUNT_ASYNC_SIZE,	/* readahead async size */
+	RA_ACCOUNT_ACTUAL,	/* readahead actual IO size */
+	/* end mark */
+	RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static inline void readahead_stats(struct file_ra_state *ra,
+				   struct address_space *mapping,
+				   pgoff_t offset,
+				   unsigned long req_size,
+				   pgoff_t eof,
+				   int actual)
+{
+	enum readahead_pattern pattern = ra->pattern;
+
+recount:
+	ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+	ra_stats[pattern][RA_ACCOUNT_SIZE] += ra->size;
+	ra_stats[pattern][RA_ACCOUNT_ASYNC_SIZE] += ra->async_size;
+	ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+	if (ra->start + ra->size >= eof)
+		ra_stats[pattern][RA_ACCOUNT_EOF]++;
+	if (actual < ra->size)
+		ra_stats[pattern][RA_ACCOUNT_CACHE_HIT]++;
+
+	if (actual) {
+		ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+		if (ra->start <= offset && offset < ra->start + ra->size)
+			ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+		if (ra->for_mmap)
+			ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+		if (ra->for_metadata)
+			ra_stats[pattern][RA_ACCOUNT_METADATA]++;
+	}
+
+	if (pattern != RA_PATTERN_ALL) {
+		pattern = RA_PATTERN_ALL;
+		goto recount;
+	}
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+	unsigned long i;
+
+	seq_printf(s,
+		   "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+		   "pattern", "readahead", "eof_hit", "cache_hit",
+		   "io", "sync_io", "mmap_io", "meta_io",
+		   "size", "async_size", "io_size");
+
+	for (i = 0; i < RA_PATTERN_MAX; i++) {
+		unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+		unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+		/*
+		 * avoid division-by-zero
+		 */
+		if (count == 0)
+			count = 1;
+		if (iocount == 0)
+			iocount = 1;
+
+		seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu "
+			   "%10lu %10lu %10lu %10lu %10lu\n",
+				ra_pattern_names[i],
+				ra_stats[i][RA_ACCOUNT_COUNT],
+				ra_stats[i][RA_ACCOUNT_EOF],
+				ra_stats[i][RA_ACCOUNT_CACHE_HIT],
+				ra_stats[i][RA_ACCOUNT_IOCOUNT],
+				ra_stats[i][RA_ACCOUNT_SYNC],
+				ra_stats[i][RA_ACCOUNT_MMAP],
+				ra_stats[i][RA_ACCOUNT_METADATA],
+				ra_stats[i][RA_ACCOUNT_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
+				ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+	}
+
+	return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+				     size_t size, loff_t *offset)
+{
+	memset(ra_stats, 0, sizeof(ra_stats));
+	return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+	.owner		= THIS_MODULE,
+	.open		= readahead_stats_open,
+	.write		= readahead_stats_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+	struct dentry *root;
+	struct dentry *entry;
+
+	root = debugfs_create_dir("readahead", NULL);
+	if (!root)
+		goto out;
+
+	entry = debugfs_create_file("stats", 0644, root,
+				    NULL, &readahead_stats_fops);
+	if (!entry)
+		goto out;
+
+	entry = debugfs_create_bool("stats_enable", 0644, root,
+				    &readahead_stats_enable);
+	if (!entry)
+		goto out;
+
+	return 0;
+out:
+	printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+	return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#else
+#define readahead_stats_enable	0
+static inline void readahead_stats(struct file_ra_state *ra,
+				   struct address_space *mapping,
+				   pgoff_t offset,
+				   unsigned long req_size,
+				   pgoff_t eof,
+				   int actual)
+{
+}
+#endif
+
 /*
  * see if a page needs releasing upon read_cache_pages() failure
  * - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -209,6 +381,9 @@ out:
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		pgoff_t offset, unsigned long nr_to_read)
 {
+	struct file_ra_state fadvice_ra = {
+		.pattern	= RA_PATTERN_FADVISE,
+	};
 	int ret = 0;
 
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
@@ -222,8 +397,9 @@ int force_page_cache_readahead(struct ad
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
-		err = __do_page_cache_readahead(mapping, filp,
-						offset, this_chunk, 0);
+		fadvice_ra.start = offset;
+		fadvice_ra.size = this_chunk;
+		err = ra_submit(&fadvice_ra, mapping, filp, offset, nr_to_read);
 		if (err < 0) {
 			ret = err;
 			break;
@@ -267,6 +443,9 @@ unsigned long ra_submit(struct file_ra_s
 	actual = __do_page_cache_readahead(mapping, filp,
 					ra->start, ra->size, ra->async_size);
 
+	if (readahead_stats_enable)
+		readahead_stats(ra, mapping, offset, req_size, eof, actual);
+
 	ra->for_mmap = 0;
 	ra->for_metadata = 0;
 	return actual;
--- linux-next.orig/mm/Kconfig	2011-11-29 14:14:25.000000000 +0800
+++ linux-next/mm/Kconfig	2011-11-29 14:14:37.000000000 +0800
@@ -373,3 +373,18 @@ config CLEANCACHE
 	  in a negligible performance hit.
 
 	  If unsure, say Y to enable cleancache
+
+config READAHEAD_STATS
+	bool "Collect page cache readahead stats"
+	depends on DEBUG_FS
+	default y
+	help
+	  This provides the readahead events accounting facilities.
+
+	  To do readahead accounting for a workload:
+
+	  echo 1 > /sys/kernel/debug/readahead/stats_enable
+	  echo 0 > /sys/kernel/debug/readahead/stats  # reset counters
+	  # run the workload
+	  cat /sys/kernel/debug/readahead/stats       # check counters
+	  echo 0 > /sys/kernel/debug/readahead/stats_enable

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
  2011-11-29  6:41         ` Wu Fengguang
@ 2011-11-29 12:29           ` Wu Fengguang
  0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 12:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
	Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen

>  int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
>  		pgoff_t offset, unsigned long nr_to_read)
>  {
> +	struct file_ra_state fadvice_ra = {
> +		.pattern	= RA_PATTERN_FADVISE,
> +	};
>  	int ret = 0;
>  
>  	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
> @@ -222,8 +397,9 @@ int force_page_cache_readahead(struct ad
>  
>  		if (this_chunk > nr_to_read)
>  			this_chunk = nr_to_read;
> -		err = __do_page_cache_readahead(mapping, filp,
> -						offset, this_chunk, 0);
> +		fadvice_ra.start = offset;
> +		fadvice_ra.size = this_chunk;
> +		err = ra_submit(&fadvice_ra, mapping, filp, offset, nr_to_read);
>  		if (err < 0) {
>  			ret = err;
>  			break;

It looks that we can safely use filp->f_ra:

@@ -214,6 +386,7 @@ int force_page_cache_readahead(struct ad
        if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
                return -EINVAL;

+       filp->f_ra.pattern = RA_PATTERN_FADVISE;
        nr_to_read = max_sane_readahead(nr_to_read);
        while (nr_to_read) {
                int err;
@@ -222,8 +395,9 @@ int force_page_cache_readahead(struct ad
               
                if (this_chunk > nr_to_read)
                        this_chunk = nr_to_read;
-               err = __do_page_cache_readahead(mapping, filp,
-                                               offset, this_chunk, 0);
+               filp->f_ra.start = offset;
+               filp->f_ra.size = this_chunk;
+               err = ra_submit(&filp->f_ra, mapping, filp, offset, nr_to_read);
                if (err < 0) {
                        ret = err;
                        break;

But still, it adds one more function call to the fadvise path.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
  2011-11-28  2:39           ` Wu Fengguang
@ 2011-11-30 13:04             ` Christian Ehrhardt
  2011-11-30 13:29               ` Wu Fengguang
  0 siblings, 1 reply; 47+ messages in thread
From: Christian Ehrhardt @ 2011-11-30 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, Andrew Morton,
	Linux Memory Management List, linux-fsdevel, Ankit Jain,
	Rik van Riel, Nikanth Karthikesan, LKML, Andi Kleen



On 11/28/2011 03:39 AM, Wu Fengguang wrote:
> On Fri, Nov 25, 2011 at 08:36:33AM +0800, Dave Chinner wrote:
>> On Thu, Nov 24, 2011 at 11:28:22PM +0100, Jan Kara wrote:
>>> On Mon 21-11-11 19:35:40, Wu Fengguang wrote:
>>>> On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
>>>>> On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
>>>>>> From: Nikanth Karthikesan<knikanth@suse.de>
>>>>>>
[...]

>>
>> And one that has already been in use for exactly this purpose for
>> years. Indeed, it's far more flexible because you can give different
>> types of devices different default readahead settings quite easily,
>> and it you can set different defaults for just about any tunable
>> parameter (e.g. readahead, ctq depth, max IO sizes, etc) in the same
>> way.
>
> I'm interested in this usage, too. Would you share some of your rules?
>

FYI - This is an example of a rules Suse delivers in SLES @ s390 for a 
while now. With little modifications it could be used for all Dave 
mentioned above.

cat /etc/udev/rules.d/60-readahead.rules
# 
 
 

# Rules to set an increased default max readahead size for s390 disk 
devices 
 

# This file should be installed in /etc/udev/rules.d 
 
 

# 
 
 

 
 
 

SUBSYSTEM!="block", GOTO="ra_end" 
 
 

 
 
 

ACTION!="add", GOTO="ra_end" 
 
 

# on device add set initial readahead to 512 (instead of in kernel 128) 
 
 

KERNEL=="sd*[!0-9]", ATTR{queue/read_ahead_kb}="512" 
 
 

KERNEL=="dasd*[!0-9]", ATTR{queue/read_ahead_kb}="512" 
 
 


LABEL="ra_end"

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
  2011-11-30 13:04             ` Christian Ehrhardt
@ 2011-11-30 13:29               ` Wu Fengguang
  2011-11-30 16:09                 ` Jan Kara
  0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-30 13:29 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, Andrew Morton,
	Linux Memory Management List, linux-fsdevel, Ankit Jain,
	Rik van Riel, Nikanth Karthikesan, LKML, Andi Kleen

On Wed, Nov 30, 2011 at 09:04:11PM +0800, Christian Ehrhardt wrote:
> 
> 
> On 11/28/2011 03:39 AM, Wu Fengguang wrote:
> > On Fri, Nov 25, 2011 at 08:36:33AM +0800, Dave Chinner wrote:
> >> On Thu, Nov 24, 2011 at 11:28:22PM +0100, Jan Kara wrote:
> >>> On Mon 21-11-11 19:35:40, Wu Fengguang wrote:
> >>>> On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
> >>>>> On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> >>>>>> From: Nikanth Karthikesan<knikanth@suse.de>
> >>>>>>
> [...]
> 
> >>
> >> And one that has already been in use for exactly this purpose for
> >> years. Indeed, it's far more flexible because you can give different
> >> types of devices different default readahead settings quite easily,
> >> and it you can set different defaults for just about any tunable
> >> parameter (e.g. readahead, ctq depth, max IO sizes, etc) in the same
> >> way.
> >
> > I'm interested in this usage, too. Would you share some of your rules?
> >
> 
> FYI - This is an example of a rules Suse delivers in SLES @ s390 for a 
> while now. With little modifications it could be used for all Dave 
> mentioned above.

It's a really good example, thank you!

> cat /etc/udev/rules.d/60-readahead.rules
> # 
>  
>  
> 
> # Rules to set an increased default max readahead size for s390 disk 
> devices 
>  
> 
> # This file should be installed in /etc/udev/rules.d 
>  
>  
> 
> # 
>  
> SUBSYSTEM!="block", GOTO="ra_end" 
> 
> ACTION!="add", GOTO="ra_end" 
> 
> # on device add set initial readahead to 512 (instead of in kernel 128) 
> 
> KERNEL=="sd*[!0-9]", ATTR{queue/read_ahead_kb}="512" 
> 
> KERNEL=="dasd*[!0-9]", ATTR{queue/read_ahead_kb}="512" 

So SLES (@s390 and maybe more) is already shipping with 512kb
readahead size? Good to know this!

Thanks,
Fengguang

>  
> 
> 
> LABEL="ra_end"
> 
> -- 
> 
> Grüsse / regards, Christian Ehrhardt
> IBM Linux Technology Center, System z Linux Performance

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
  2011-11-30 13:29               ` Wu Fengguang
@ 2011-11-30 16:09                 ` Jan Kara
  0 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2011-11-30 16:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christian Ehrhardt, Dave Chinner, Jan Kara, Christoph Hellwig,
	Andrew Morton, Linux Memory Management List, linux-fsdevel,
	Ankit Jain, Rik van Riel, Nikanth Karthikesan, LKML, Andi Kleen

On Wed 30-11-11 21:29:28, Wu Fengguang wrote:
> > cat /etc/udev/rules.d/60-readahead.rules
> > # 
> >  
> >  
> > 
> > # Rules to set an increased default max readahead size for s390 disk 
> > devices 
> >  
> > 
> > # This file should be installed in /etc/udev/rules.d 
> >  
> >  
> > 
> > # 
> >  
> > SUBSYSTEM!="block", GOTO="ra_end" 
> > 
> > ACTION!="add", GOTO="ra_end" 
> > 
> > # on device add set initial readahead to 512 (instead of in kernel 128) 
> > 
> > KERNEL=="sd*[!0-9]", ATTR{queue/read_ahead_kb}="512" 
> > 
> > KERNEL=="dasd*[!0-9]", ATTR{queue/read_ahead_kb}="512" 
> 
> So SLES (@s390 and maybe more) is already shipping with 512kb
> readahead size? Good to know this!
  SLES (and openSUSE) since about 2.6.16 times is shipping with 512kb
readahead on everything... With some types of storage it makes a
significant difference.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2011-11-30 16:09 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-21  9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
2011-11-21  9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
2011-11-21 10:00   ` Christoph Hellwig
2011-11-21 11:24     ` Wu Fengguang
2011-11-21 12:47     ` Andi Kleen
2011-11-21 14:46   ` Jeff Moyer
2011-11-21 22:52   ` Andrew Morton
2011-11-22 14:23     ` Jeff Moyer
2011-11-23 12:18     ` Wu Fengguang
2011-11-21  9:18 ` [PATCH 2/8] readahead: make default readahead size a kernel parameter Wu Fengguang
2011-11-21 10:01   ` Christoph Hellwig
2011-11-21 11:35     ` Wu Fengguang
2011-11-24 22:28       ` Jan Kara
2011-11-25  0:36         ` Dave Chinner
2011-11-28  2:39           ` Wu Fengguang
2011-11-30 13:04             ` Christian Ehrhardt
2011-11-30 13:29               ` Wu Fengguang
2011-11-30 16:09                 ` Jan Kara
2011-11-21  9:18 ` [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
2011-11-21 11:04   ` Steven Whitehouse
2011-11-21 11:42     ` Wu Fengguang
2011-11-21 23:01   ` Andrew Morton
2011-11-23 12:47     ` Wu Fengguang
2011-11-23 20:31       ` Andrew Morton
2011-11-29  3:42         ` Wu Fengguang
2011-11-21  9:18 ` [PATCH 4/8] readahead: record readahead patterns Wu Fengguang
2011-11-21 23:19   ` Andrew Morton
2011-11-29  2:40     ` Wu Fengguang
2011-11-21  9:18 ` [PATCH 5/8] readahead: add /debug/readahead/stats Wu Fengguang
2011-11-21 14:17   ` Andi Kleen
2011-11-22 14:14     ` Wu Fengguang
2011-11-21 23:29   ` Andrew Morton
2011-11-21 23:32     ` Andi Kleen
2011-11-29  3:23     ` Wu Fengguang
2011-11-29  4:49       ` Andrew Morton
2011-11-29  6:41         ` Wu Fengguang
2011-11-29 12:29           ` Wu Fengguang
2011-11-21  9:18 ` [PATCH 6/8] readahead: add debug tracing event Wu Fengguang
2011-11-21 14:01   ` Steven Rostedt
2011-11-21  9:18 ` [PATCH 7/8] readahead: basic support for backwards prefetching Wu Fengguang
2011-11-21 23:33   ` Andrew Morton
2011-11-29  3:08     ` Wu Fengguang
2011-11-21  9:18 ` [PATCH 8/8] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
2011-11-21 23:36   ` Andrew Morton
2011-11-22 14:18     ` Wu Fengguang
2011-11-21  9:56 ` [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Christoph Hellwig
2011-11-21 12:00   ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).