* [PATCH 0/8] readahead stats/tracing, backwards prefetching and more
@ 2011-11-21 9:18 Wu Fengguang
2011-11-21 9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
` (8 more replies)
0 siblings, 9 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 9:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Wu Fengguang, LKML,
Andi Kleen
Andrew,
I'm getting around to pick up the readahead works again :-)
This first series is mainly to add some debug facilities, to support the long
missed backwards prefetching capability, and some old patches that somehow get
delayed (shame me).
The next step would be to better handle the readahead thrashing situations.
That would require rewriting part of the algorithms, this is why I'd like to
keep the backwards prefetching simple and stupid for now.
When (almost) free of readahead thrashing, we'll be in a good position to lift
the default readahead size. Which I suspect would be the single most efficient
way to improve performance for the large volumes of casually maintained Linux
file servers.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 1/8] block: limit default readahead size for small devices
2011-11-21 9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
@ 2011-11-21 9:18 ` Wu Fengguang
2011-11-21 10:00 ` Christoph Hellwig
` (2 more replies)
2011-11-21 9:18 ` [PATCH 2/8] readahead: make default readahead size a kernel parameter Wu Fengguang
` (7 subsequent siblings)
8 siblings, 3 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 9:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Li Shaohua,
Clemens Ladisch, Jens Axboe, Rik van Riel, Wu Fengguang, LKML,
Andi Kleen
[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 6861 bytes --]
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
--------------------------- (*)
1G 256k
4G 512k
16G 1024k
64G 2048k
256G 4096k
(*) Since the default readahead size is 128k, this limit only takes
effect for devices whose size is less than 256M.
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
SSD 30G SanDisk SATA 5000
4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
==> 64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s
An unnamed SD card (Yakui):
4k 195.873 s, 5.5 MB/s
8k 123.425 s, 8.7 MB/s
16k 86.6425 s, 12.4 MB/s
32k 66.7519 s, 16.1 MB/s
==> 64k 58.5262 s, 18.3 MB/s
128k 59.3847 s, 18.1 MB/s
256k 59.3188 s, 18.1 MB/s
512k 59.0218 s, 18.2 MB/s
CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
--- linux-next.orig/block/genhd.c 2011-10-31 00:13:51.000000000 +0800
+++ linux-next/block/genhd.c 2011-11-18 11:27:08.000000000 +0800
@@ -623,6 +623,26 @@ void add_disk(struct gendisk *disk)
WARN_ON(retval);
disk_add_events(disk);
+
+ /*
+ * Limit default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 256M 128k
+ * 1G 256k
+ * 4G 512k
+ * 16G 1024k
+ * 64G 2048k
+ * 256G 4096k
+ */
+ if (get_capacity(disk)) {
+ unsigned long size = get_capacity(disk) >> 9;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
+ }
}
EXPORT_SYMBOL(add_disk);
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 2/8] readahead: make default readahead size a kernel parameter
2011-11-21 9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
2011-11-21 9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
@ 2011-11-21 9:18 ` Wu Fengguang
2011-11-21 10:01 ` Christoph Hellwig
2011-11-21 9:18 ` [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
` (6 subsequent siblings)
8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 9:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Ankit Jain,
Dave Chinner, Christian Ehrhardt, Rik van Riel,
Nikanth Karthikesan, Wu Fengguang, LKML, Andi Kleen
[-- Attachment #1: readahead-kernel-parameter.patch --]
[-- Type: text/plain, Size: 3085 bytes --]
From: Nikanth Karthikesan <knikanth@suse.de>
Add new kernel parameter "readahead=", which allows user to override
the static VM_MAX_READAHEAD=128kb.
CC: Ankit Jain <radical@gmail.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
Documentation/kernel-parameters.txt | 6 ++++++
block/blk-core.c | 3 +--
fs/fuse/inode.c | 2 +-
mm/readahead.c | 19 +++++++++++++++++++
4 files changed, 27 insertions(+), 3 deletions(-)
--- linux-next.orig/Documentation/kernel-parameters.txt 2011-10-19 11:11:14.000000000 +0800
+++ linux-next/Documentation/kernel-parameters.txt 2011-11-20 11:09:56.000000000 +0800
@@ -2245,6 +2245,12 @@ bytes respectively. Such letter suffixes
Run specified binary instead of /init from the ramdisk,
used for early userspace startup. See initrd.
+ readahead=nn[KM]
+ Default max readahead size for block devices.
+
+ This default max readahead size may be overrode
+ in some cases, notably NFS, btrfs and software RAID.
+
reboot= [BUGS=X86-32,BUGS=ARM,BUGS=IA-64] Rebooting mode
Format: <reboot_mode>[,<reboot_mode2>[,...]]
See arch/*/kernel/reboot.c or arch/*/kernel/process.c
--- linux-next.orig/block/blk-core.c 2011-11-08 10:18:16.000000000 +0800
+++ linux-next/block/blk-core.c 2011-11-20 10:49:33.000000000 +0800
@@ -462,8 +462,7 @@ struct request_queue *blk_alloc_queue_no
if (!q)
return NULL;
- q->backing_dev_info.ra_pages =
- (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ q->backing_dev_info.ra_pages = default_backing_dev_info.ra_pages;
q->backing_dev_info.state = 0;
q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
q->backing_dev_info.name = "block";
--- linux-next.orig/fs/fuse/inode.c 2011-11-08 10:18:39.000000000 +0800
+++ linux-next/fs/fuse/inode.c 2011-11-20 10:50:12.000000000 +0800
@@ -878,7 +878,7 @@ static int fuse_bdi_init(struct fuse_con
int err;
fc->bdi.name = "fuse";
- fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ fc->bdi.ra_pages = default_backing_dev_info.ra_pages;
/* fuse does it's own writeback accounting */
fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
--- linux-next.orig/mm/readahead.c 2011-11-20 10:48:57.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-20 11:09:22.000000000 +0800
@@ -18,6 +18,25 @@
#include <linux/pagevec.h>
#include <linux/pagemap.h>
+static int __init config_readahead_size(char *str)
+{
+ unsigned long bytes;
+
+ if (!str)
+ return -EINVAL;
+ bytes = memparse(str, &str);
+ if (*str != '\0')
+ return -EINVAL;
+
+ /* missed 'k'/'m' suffixes? */
+ if (bytes && bytes < PAGE_CACHE_SIZE)
+ return -EINVAL;
+
+ default_backing_dev_info.ra_pages = bytes / PAGE_CACHE_SIZE;
+ return 0;
+}
+early_param("readahead", config_readahead_size);
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
2011-11-21 9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
2011-11-21 9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
2011-11-21 9:18 ` [PATCH 2/8] readahead: make default readahead size a kernel parameter Wu Fengguang
@ 2011-11-21 9:18 ` Wu Fengguang
2011-11-21 11:04 ` Steven Whitehouse
2011-11-21 23:01 ` Andrew Morton
2011-11-21 9:18 ` [PATCH 4/8] readahead: record readahead patterns Wu Fengguang
` (5 subsequent siblings)
8 siblings, 2 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 9:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
Steven Whitehouse, Rik van Riel, Wu Fengguang, LKML
[-- Attachment #1: readahead-flags.patch --]
[-- Type: text/plain, Size: 3095 bytes --]
Introduce a readahead flags field and embed the existing mmap_miss in it
(mainly to save space).
It will be possible to lose the flags in race conditions, however the
impact should be limited. For the race to happen, there must be two
threads sharing the same file descriptor to be in page fault or
readahead at the same time.
Note that it has always been racy for "page faults" at the same time.
And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
Which may change some concrete readahead behavior, but won't really
impact overall I/O performance.
CC: Andi Kleen <andi@firstfloor.org>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 31 ++++++++++++++++++++++++++++++-
mm/filemap.c | 9 ++-------
2 files changed, 32 insertions(+), 8 deletions(-)
--- linux-next.orig/include/linux/fs.h 2011-11-20 11:30:55.000000000 +0800
+++ linux-next/include/linux/fs.h 2011-11-20 11:48:53.000000000 +0800
@@ -945,10 +945,39 @@ struct file_ra_state {
there are only # of pages ahead */
unsigned int ra_pages; /* Maximum readahead window */
- unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
+ unsigned int ra_flags;
loff_t prev_pos; /* Cache last read() position */
};
+/* ra_flags bits */
+#define READAHEAD_MMAP_MISS 0x000003ff /* cache misses for mmap access */
+
+/*
+ * Don't do ra_flags++ directly to avoid possible overflow:
+ * the ra fields can be accessed concurrently in a racy way.
+ */
+static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
+{
+ unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+ /* the upper bound avoids banging the cache line unnecessarily */
+ if (miss < READAHEAD_MMAP_MISS) {
+ miss++;
+ ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
+ }
+ return miss;
+}
+
+static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
+{
+ unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+ if (miss) {
+ miss--;
+ ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
+ }
+}
+
/*
* Check if @index falls in the readahead windows.
*/
--- linux-next.orig/mm/filemap.c 2011-11-20 11:30:55.000000000 +0800
+++ linux-next/mm/filemap.c 2011-11-20 11:48:29.000000000 +0800
@@ -1597,15 +1597,11 @@ static void do_sync_mmap_readahead(struc
return;
}
- /* Avoid banging the cache line if not needed */
- if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
- ra->mmap_miss++;
-
/*
* Do we miss much more than hit in this file? If so,
* stop bothering with read-ahead. It will only hurt.
*/
- if (ra->mmap_miss > MMAP_LOTSAMISS)
+ if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
return;
/*
@@ -1633,8 +1629,7 @@ static void do_async_mmap_readahead(stru
/* If we don't want any read-ahead, don't bother */
if (VM_RandomReadHint(vma))
return;
- if (ra->mmap_miss > 0)
- ra->mmap_miss--;
+ ra_mmap_miss_dec(ra);
if (PageReadahead(page))
page_cache_async_readahead(mapping, ra, file,
page, offset, ra->ra_pages);
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 4/8] readahead: record readahead patterns
2011-11-21 9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
` (2 preceding siblings ...)
2011-11-21 9:18 ` [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
@ 2011-11-21 9:18 ` Wu Fengguang
2011-11-21 23:19 ` Andrew Morton
2011-11-21 9:18 ` [PATCH 5/8] readahead: add /debug/readahead/stats Wu Fengguang
` (4 subsequent siblings)
8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 9:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel, Wu Fengguang, LKML,
Andi Kleen
[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 6975 bytes --]
Record the readahead pattern in ra_flags and extend the ra_submit()
parameters, to be used by the next readahead tracing/stats patches.
7 patterns are defined:
pattern readahead for
-----------------------------------------------------------
RA_PATTERN_INITIAL start-of-file read
RA_PATTERN_SUBSEQUENT trivial sequential read
RA_PATTERN_CONTEXT interleaved sequential read
RA_PATTERN_OVERSIZE oversize read
RA_PATTERN_MMAP_AROUND mmap fault
RA_PATTERN_FADVISE posix_fadvise()
RA_PATTERN_RANDOM random read
Note that random reads will be recorded in file_ra_state now.
This won't deteriorate cache bouncing because the ra->prev_pos update
in do_generic_file_read() already pollutes the data cache, and
filemap_fault() will stop calling into us after MMAP_LOTSAMISS.
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 33 +++++++++++++++++++++++++++++++++
include/linux/mm.h | 4 +++-
mm/filemap.c | 9 +++++++--
mm/readahead.c | 30 +++++++++++++++++++++++-------
4 files changed, 66 insertions(+), 10 deletions(-)
--- linux-next.orig/include/linux/fs.h 2011-11-20 20:10:48.000000000 +0800
+++ linux-next/include/linux/fs.h 2011-11-20 20:18:29.000000000 +0800
@@ -951,6 +951,39 @@ struct file_ra_state {
/* ra_flags bits */
#define READAHEAD_MMAP_MISS 0x000003ff /* cache misses for mmap access */
+#define READAHEAD_MMAP 0x00010000
+
+#define READAHEAD_PATTERN_SHIFT 28
+#define READAHEAD_PATTERN 0xf0000000
+
+/*
+ * Which policy makes decision to do the current read-ahead IO?
+ */
+enum readahead_pattern {
+ RA_PATTERN_INITIAL,
+ RA_PATTERN_SUBSEQUENT,
+ RA_PATTERN_CONTEXT,
+ RA_PATTERN_MMAP_AROUND,
+ RA_PATTERN_FADVISE,
+ RA_PATTERN_OVERSIZE,
+ RA_PATTERN_RANDOM,
+ RA_PATTERN_ALL, /* for summary stats */
+ RA_PATTERN_MAX
+};
+
+static inline unsigned int ra_pattern(unsigned int ra_flags)
+{
+ unsigned int pattern = ra_flags >> READAHEAD_PATTERN_SHIFT;
+
+ return min_t(unsigned int, pattern, RA_PATTERN_ALL);
+}
+
+static inline void ra_set_pattern(struct file_ra_state *ra,
+ unsigned int pattern)
+{
+ ra->ra_flags = (ra->ra_flags & ~READAHEAD_PATTERN) |
+ (pattern << READAHEAD_PATTERN_SHIFT);
+}
/*
* Don't do ra_flags++ directly to avoid possible overflow:
--- linux-next.orig/mm/readahead.c 2011-11-20 20:10:48.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-20 20:18:14.000000000 +0800
@@ -268,13 +268,17 @@ unsigned long max_sane_readahead(unsigne
* Submit IO for the read-ahead request in file_ra_state.
*/
unsigned long ra_submit(struct file_ra_state *ra,
- struct address_space *mapping, struct file *filp)
+ struct address_space *mapping,
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long req_size)
{
int actual;
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
+ ra->ra_flags &= ~READAHEAD_MMAP;
return actual;
}
@@ -401,6 +405,7 @@ static int try_context_readahead(struct
if (size >= offset)
size *= 2;
+ ra_set_pattern(ra, RA_PATTERN_CONTEXT);
ra->start = offset;
ra->size = get_init_ra_size(size + req_size, max);
ra->async_size = ra->size;
@@ -422,8 +427,10 @@ ondemand_readahead(struct address_space
/*
* start of file
*/
- if (!offset)
+ if (!offset) {
+ ra_set_pattern(ra, RA_PATTERN_INITIAL);
goto initial_readahead;
+ }
/*
* It's the expected callback offset, assume sequential access.
@@ -431,6 +438,7 @@ ondemand_readahead(struct address_space
*/
if ((offset == (ra->start + ra->size - ra->async_size) ||
offset == (ra->start + ra->size))) {
+ ra_set_pattern(ra, RA_PATTERN_SUBSEQUENT);
ra->start += ra->size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
@@ -453,6 +461,7 @@ ondemand_readahead(struct address_space
if (!start || start - offset > max)
return 0;
+ ra_set_pattern(ra, RA_PATTERN_CONTEXT);
ra->start = start;
ra->size = start - offset; /* old async_size */
ra->size += req_size;
@@ -464,14 +473,18 @@ ondemand_readahead(struct address_space
/*
* oversize read
*/
- if (req_size > max)
+ if (req_size > max) {
+ ra_set_pattern(ra, RA_PATTERN_OVERSIZE);
goto initial_readahead;
+ }
/*
* sequential cache miss
*/
- if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
+ if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) {
+ ra_set_pattern(ra, RA_PATTERN_INITIAL);
goto initial_readahead;
+ }
/*
* Query the page cache and look for the traces(cached history pages)
@@ -482,9 +495,12 @@ ondemand_readahead(struct address_space
/*
* standalone, small random read
- * Read as is, and do not pollute the readahead state.
*/
- return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+ ra_set_pattern(ra, RA_PATTERN_RANDOM);
+ ra->start = offset;
+ ra->size = req_size;
+ ra->async_size = 0;
+ goto readit;
initial_readahead:
ra->start = offset;
@@ -502,7 +518,7 @@ readit:
ra->size += ra->async_size;
}
- return ra_submit(ra, mapping, filp);
+ return ra_submit(ra, mapping, filp, offset, req_size);
}
/**
--- linux-next.orig/include/linux/mm.h 2011-11-20 20:10:48.000000000 +0800
+++ linux-next/include/linux/mm.h 2011-11-20 20:10:49.000000000 +0800
@@ -1456,7 +1456,9 @@ void page_cache_async_readahead(struct a
unsigned long max_sane_readahead(unsigned long nr);
unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping,
- struct file *filp);
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long req_size);
/* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux-next.orig/mm/filemap.c 2011-11-20 20:10:48.000000000 +0800
+++ linux-next/mm/filemap.c 2011-11-20 20:10:49.000000000 +0800
@@ -1592,6 +1592,7 @@ static void do_sync_mmap_readahead(struc
return;
if (VM_SequentialReadHint(vma)) {
+ ra->ra_flags |= READAHEAD_MMAP;
page_cache_sync_readahead(mapping, ra, file, offset,
ra->ra_pages);
return;
@@ -1607,11 +1608,13 @@ static void do_sync_mmap_readahead(struc
/*
* mmap read-around
*/
+ ra->ra_flags |= READAHEAD_MMAP;
+ ra_set_pattern(ra, RA_PATTERN_MMAP_AROUND);
ra_pages = max_sane_readahead(ra->ra_pages);
ra->start = max_t(long, 0, offset - ra_pages / 2);
ra->size = ra_pages;
ra->async_size = ra_pages / 4;
- ra_submit(ra, mapping, file);
+ ra_submit(ra, mapping, file, offset, 1);
}
/*
@@ -1630,9 +1633,11 @@ static void do_async_mmap_readahead(stru
if (VM_RandomReadHint(vma))
return;
ra_mmap_miss_dec(ra);
- if (PageReadahead(page))
+ if (PageReadahead(page)) {
+ ra->ra_flags |= READAHEAD_MMAP;
page_cache_async_readahead(mapping, ra, file,
page, offset, ra->ra_pages);
+ }
}
/**
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 5/8] readahead: add /debug/readahead/stats
2011-11-21 9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
` (3 preceding siblings ...)
2011-11-21 9:18 ` [PATCH 4/8] readahead: record readahead patterns Wu Fengguang
@ 2011-11-21 9:18 ` Wu Fengguang
2011-11-21 14:17 ` Andi Kleen
2011-11-21 23:29 ` Andrew Morton
2011-11-21 9:18 ` [PATCH 6/8] readahead: add debug tracing event Wu Fengguang
` (3 subsequent siblings)
8 siblings, 2 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 9:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel, Wu Fengguang, LKML,
Andi Kleen
[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 10222 bytes --]
The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive unless enabled explicitly with either boot option
readahead_stats=1
or through the debugfs interface
echo 1 > /debug/readahead/stats_enable
The added overheads are two readahead_stats() calls per readahead.
Which is trivial costs unless there are concurrent random reads on
super fast SSDs, which may lead to cache bouncing when updating the
global ra_stats[][]. Considering that normal users won't need this
except when debugging performance problems, it's disabled by default.
So it looks reasonable to keep this debug code simple rather than trying
to improve its scalability.
Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)
$ cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
initial 545 347 10 535 535 0 74 38 3
subsequent 48 41 1 11 1 5 53 53 15
context 156 156 0 3 0 1 1690 1690 12
around 152 152 0 152 152 152 1920 480 45
backwards 2 0 2 2 2 0 4 0 3
fadvise 2566 0 0 2566 0 0 0 0 1
oversize 0 0 0 0 0 0 0 0 0
random 30 0 1 29 29 0 1 0 1
all 3499 696 14 3298 719 0 171 102 3
The two most important columns are
- io number of readahead IO
- io_size average readahead IO size
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
Documentation/kernel-parameters.txt | 6
mm/Kconfig | 15 ++
mm/readahead.c | 194 ++++++++++++++++++++++++++
3 files changed, 215 insertions(+)
--- linux-next.orig/mm/readahead.c 2011-11-21 17:08:43.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-21 17:13:28.000000000 +0800
@@ -18,6 +18,17 @@
#include <linux/pagevec.h>
#include <linux/pagemap.h>
+static const char * const ra_pattern_names[] = {
+ [RA_PATTERN_INITIAL] = "initial",
+ [RA_PATTERN_SUBSEQUENT] = "subsequent",
+ [RA_PATTERN_CONTEXT] = "context",
+ [RA_PATTERN_MMAP_AROUND] = "around",
+ [RA_PATTERN_FADVISE] = "fadvise",
+ [RA_PATTERN_OVERSIZE] = "oversize",
+ [RA_PATTERN_RANDOM] = "random",
+ [RA_PATTERN_ALL] = "all",
+};
+
static int __init config_readahead_size(char *str)
{
unsigned long bytes;
@@ -51,6 +62,182 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+static int __init config_readahead_stats(char *str)
+{
+ int enable = 1;
+ get_option(&str, &enable);
+ readahead_stats_enable = enable;
+ return 0;
+}
+early_param("readahead_stats", config_readahead_stats);
+
+enum ra_account {
+ /* number of readaheads */
+ RA_ACCOUNT_COUNT, /* readahead request */
+ RA_ACCOUNT_EOF, /* readahead request covers EOF */
+ RA_ACCOUNT_CHIT, /* readahead request covers some cached pages */
+ RA_ACCOUNT_IOCOUNT, /* readahead IO */
+ RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */
+ RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */
+ /* number of readahead pages */
+ RA_ACCOUNT_SIZE, /* readahead size */
+ RA_ACCOUNT_ASIZE, /* readahead async size */
+ RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
+ /* end mark */
+ RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static void readahead_stats(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ int actual)
+{
+ unsigned int pattern = ra_pattern(ra_flags);
+
+ ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+ ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
+ ra_stats[pattern][RA_ACCOUNT_ASIZE] += async_size;
+ ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+ if (actual < size) {
+ if (start + size >
+ (i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT)
+ ra_stats[pattern][RA_ACCOUNT_EOF]++;
+ else
+ ra_stats[pattern][RA_ACCOUNT_CHIT]++;
+ }
+
+ if (!actual)
+ return;
+
+ ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+ if (start <= offset && offset < start + size)
+ ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+ if (ra_flags & READAHEAD_MMAP)
+ ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+ unsigned long i;
+
+ seq_printf(s, "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+ "pattern",
+ "readahead", "eof_hit", "cache_hit",
+ "io", "sync_io", "mmap_io",
+ "size", "async_size", "io_size");
+
+ for (i = 0; i < RA_PATTERN_MAX; i++) {
+ unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+ unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+ /*
+ * avoid division-by-zero
+ */
+ if (count == 0)
+ count = 1;
+ if (iocount == 0)
+ iocount = 1;
+
+ seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu %10lu "
+ "%10lu %10lu %10lu\n",
+ ra_pattern_names[i],
+ ra_stats[i][RA_ACCOUNT_COUNT],
+ ra_stats[i][RA_ACCOUNT_EOF],
+ ra_stats[i][RA_ACCOUNT_CHIT],
+ ra_stats[i][RA_ACCOUNT_IOCOUNT],
+ ra_stats[i][RA_ACCOUNT_SYNC],
+ ra_stats[i][RA_ACCOUNT_MMAP],
+ ra_stats[i][RA_ACCOUNT_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ASIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+ }
+
+ return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ memset(ra_stats, 0, sizeof(ra_stats));
+ return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+ .owner = THIS_MODULE,
+ .open = readahead_stats_open,
+ .write = readahead_stats_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+ struct dentry *root;
+ struct dentry *entry;
+
+ root = debugfs_create_dir("readahead", NULL);
+ if (!root)
+ goto out;
+
+ entry = debugfs_create_file("stats", 0644, root,
+ NULL, &readahead_stats_fops);
+ if (!entry)
+ goto out;
+
+ entry = debugfs_create_bool("stats_enable", 0644, root,
+ &readahead_stats_enable);
+ if (!entry)
+ goto out;
+
+ return 0;
+out:
+ printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+ return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#endif
+
+static void readahead_event(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ unsigned int actual)
+{
+#ifdef CONFIG_READAHEAD_STATS
+ if (readahead_stats_enable) {
+ readahead_stats(mapping, offset, req_size, ra_flags,
+ start, size, async_size, actual);
+ readahead_stats(mapping, offset, req_size,
+ RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
+ start, size, async_size, actual);
+ }
+#endif
+}
+
/*
* see if a page needs releasing upon read_cache_pages() failure
* - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -247,10 +434,14 @@ int force_page_cache_readahead(struct ad
ret = err;
break;
}
+ readahead_event(mapping, offset, nr_to_read,
+ RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
+ offset, this_chunk, 0, err);
ret += err;
offset += this_chunk;
nr_to_read -= this_chunk;
}
+
return ret;
}
@@ -278,6 +469,9 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
+ readahead_event(mapping, offset, req_size, ra->ra_flags,
+ ra->start, ra->size, ra->async_size, actual);
+
ra->ra_flags &= ~READAHEAD_MMAP;
return actual;
}
--- linux-next.orig/mm/Kconfig 2011-11-21 17:08:31.000000000 +0800
+++ linux-next/mm/Kconfig 2011-11-21 17:08:51.000000000 +0800
@@ -373,3 +373,18 @@ config CLEANCACHE
in a negligible performance hit.
If unsure, say Y to enable cleancache
+
+config READAHEAD_STATS
+ bool "Collect page cache readahead stats"
+ depends on DEBUG_FS
+ default y
+ help
+ This provides the readahead events accounting facilities.
+
+ To enable accounting early, boot kernel with "readahead_stats=1".
+ Or run these commands after boot:
+
+ echo 1 > /sys/kernel/debug/readahead/stats_enable
+ echo 0 > /sys/kernel/debug/readahead/stats # reset counters
+ # run the workload
+ cat /sys/kernel/debug/readahead/stats # check counters
--- linux-next.orig/Documentation/kernel-parameters.txt 2011-11-21 17:08:38.000000000 +0800
+++ linux-next/Documentation/kernel-parameters.txt 2011-11-21 17:08:51.000000000 +0800
@@ -2251,6 +2251,12 @@ bytes respectively. Such letter suffixes
This default max readahead size may be overrode
in some cases, notably NFS, btrfs and software RAID.
+ readahead_stats[=0|1]
+ Enable/disable readahead stats accounting.
+
+ It's also possible to enable/disable it after boot:
+ echo 1 > /sys/kernel/debug/readahead/stats_enable
+
reboot= [BUGS=X86-32,BUGS=ARM,BUGS=IA-64] Rebooting mode
Format: <reboot_mode>[,<reboot_mode2>[,...]]
See arch/*/kernel/reboot.c or arch/*/kernel/process.c
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 6/8] readahead: add debug tracing event
2011-11-21 9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
` (4 preceding siblings ...)
2011-11-21 9:18 ` [PATCH 5/8] readahead: add /debug/readahead/stats Wu Fengguang
@ 2011-11-21 9:18 ` Wu Fengguang
2011-11-21 14:01 ` Steven Rostedt
2011-11-21 9:18 ` [PATCH 7/8] readahead: basic support for backwards prefetching Wu Fengguang
` (2 subsequent siblings)
8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 9:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
Jens Axboe, Steven Rostedt, Peter Zijlstra, Rik van Riel,
Wu Fengguang, LKML, Andi Kleen
[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 3524 bytes --]
This is very useful for verifying whether the algorithms are working
to our expectaions.
Example output:
# echo 1 > /debug/tracing/events/vfs/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/trace/events/vfs.h | 64 +++++++++++++++++++++++++++++++++++
mm/readahead.c | 5 ++
2 files changed, 69 insertions(+)
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-next/include/trace/events/vfs.h 2011-11-21 17:17:45.000000000 +0800
@@ -0,0 +1,64 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfs
+
+#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFS_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(readahead,
+ TP_PROTO(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ unsigned int actual),
+
+ TP_ARGS(mapping, offset, req_size,
+ ra_flags, start, size, async_size, actual),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( pgoff_t, offset )
+ __field( unsigned long, req_size )
+ __field( unsigned int, pattern )
+ __field( pgoff_t, start )
+ __field( unsigned int, size )
+ __field( unsigned int, async_size )
+ __field( unsigned int, actual )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = mapping->host->i_sb->s_dev;
+ __entry->ino = mapping->host->i_ino;
+ __entry->pattern = ra_pattern(ra_flags);
+ __entry->offset = offset;
+ __entry->req_size = req_size;
+ __entry->start = start;
+ __entry->size = size;
+ __entry->async_size = async_size;
+ __entry->actual = actual;
+ ),
+
+ TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+ "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+ ra_pattern_names[__entry->pattern],
+ MAJOR(__entry->dev),
+ MINOR(__entry->dev),
+ __entry->ino,
+ __entry->offset,
+ __entry->req_size,
+ __entry->start,
+ __entry->size,
+ __entry->async_size,
+ __entry->start > __entry->offset,
+ __entry->actual)
+);
+
+#endif /* _TRACE_VFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux-next.orig/mm/readahead.c 2011-11-21 17:17:45.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-21 17:17:49.000000000 +0800
@@ -48,6 +48,9 @@ static int __init config_readahead_size(
}
early_param("readahead", config_readahead_size);
+#define CREATE_TRACE_POINTS
+#include <trace/events/vfs.h>
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
@@ -236,6 +239,8 @@ static void readahead_event(struct addre
start, size, async_size, actual);
}
#endif
+ trace_readahead(mapping, offset, req_size, ra_flags,
+ start, size, async_size, actual);
}
/*
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 7/8] readahead: basic support for backwards prefetching
2011-11-21 9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
` (5 preceding siblings ...)
2011-11-21 9:18 ` [PATCH 6/8] readahead: add debug tracing event Wu Fengguang
@ 2011-11-21 9:18 ` Wu Fengguang
2011-11-21 23:33 ` Andrew Morton
2011-11-21 9:18 ` [PATCH 8/8] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
2011-11-21 9:56 ` [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Christoph Hellwig
8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 9:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
Li Shaohua, Wu Fengguang, LKML
[-- Attachment #1: readahead-backwards.patch --]
[-- Type: text/plain, Size: 3890 bytes --]
Add the backwards prefetching feature. It's pretty simple if we don't
support async prefetching and interleaved reads.
Here is the behavior with an 8-page read sequence from 10000 down to 0.
(The readahead size is a bit large since it's an NFS mount.)
readahead-random(dev=0:16, ino=3948605, req=10000+8, ra=10000+8-0, async=0) = 8
readahead-backwards(dev=0:16, ino=3948605, req=9992+8, ra=9968+32-0, async=0) = 32
readahead-backwards(dev=0:16, ino=3948605, req=9960+8, ra=9840+128-0, async=0) = 128
readahead-backwards(dev=0:16, ino=3948605, req=9832+8, ra=9584+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9576+8, ra=9072+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9064+8, ra=8048+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8040+8, ra=6128+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6120+8, ra=4208+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4200+8, ra=2288+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2280+8, ra=368+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=360+8, ra=0+368-0, async=0) = 368
And a simple 1-page read sequence from 10000 down to 0.
readahead-random(dev=0:16, ino=3948605, req=10000+1, ra=10000+1-0, async=0) = 1
readahead-backwards(dev=0:16, ino=3948605, req=9999+1, ra=9996+4-0, async=0) = 4
readahead-backwards(dev=0:16, ino=3948605, req=9995+1, ra=9980+16-0, async=0) = 16
readahead-backwards(dev=0:16, ino=3948605, req=9979+1, ra=9916+64-0, async=0) = 64
readahead-backwards(dev=0:16, ino=3948605, req=9915+1, ra=9660+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9659+1, ra=9148+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9147+1, ra=8124+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8123+1, ra=6204+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6203+1, ra=4284+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4283+1, ra=2364+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2363+1, ra=444+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=443+1, ra=0+444-0, async=0) = 444
CC: Andi Kleen <andi@firstfloor.org>
CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 1 +
mm/readahead.c | 14 ++++++++++++++
2 files changed, 15 insertions(+)
--- linux-next.orig/include/linux/fs.h 2011-11-21 17:17:44.000000000 +0800
+++ linux-next/include/linux/fs.h 2011-11-21 17:17:47.000000000 +0800
@@ -964,6 +964,7 @@ enum readahead_pattern {
RA_PATTERN_SUBSEQUENT,
RA_PATTERN_CONTEXT,
RA_PATTERN_MMAP_AROUND,
+ RA_PATTERN_BACKWARDS,
RA_PATTERN_FADVISE,
RA_PATTERN_OVERSIZE,
RA_PATTERN_RANDOM,
--- linux-next.orig/mm/readahead.c 2011-11-21 17:17:45.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-21 17:17:47.000000000 +0800
@@ -23,6 +23,7 @@ static const char * const ra_pattern_nam
[RA_PATTERN_SUBSEQUENT] = "subsequent",
[RA_PATTERN_CONTEXT] = "context",
[RA_PATTERN_MMAP_AROUND] = "around",
+ [RA_PATTERN_BACKWARDS] = "backwards",
[RA_PATTERN_FADVISE] = "fadvise",
[RA_PATTERN_OVERSIZE] = "oversize",
[RA_PATTERN_RANDOM] = "random",
@@ -686,6 +687,19 @@ ondemand_readahead(struct address_space
}
/*
+ * backwards reading
+ */
+ if (offset < ra->start && offset + req_size >= ra->start) {
+ ra_set_pattern(ra, RA_PATTERN_BACKWARDS);
+ ra->size = get_next_ra_size(ra, max);
+ if (ra->size > ra->start)
+ ra->size = ra->start;
+ ra->async_size = 0;
+ ra->start -= ra->size;
+ goto readit;
+ }
+
+ /*
* Query the page cache and look for the traces(cached history pages)
* that a sequential stream would leave behind.
*/
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 8/8] readahead: dont do start-of-file readahead after lseek()
2011-11-21 9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
` (6 preceding siblings ...)
2011-11-21 9:18 ` [PATCH 7/8] readahead: basic support for backwards prefetching Wu Fengguang
@ 2011-11-21 9:18 ` Wu Fengguang
2011-11-21 23:36 ` Andrew Morton
2011-11-21 9:56 ` [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Christoph Hellwig
8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 9:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Rik van Riel,
Linus Torvalds, Wu Fengguang, LKML, Andi Kleen
[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2136 bytes --]
Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does
seek to 0
read 1024
seek to 1536
read 16384
The start-of-file readahead heuristic is wrong for them, whose
access pattern can be identified by lseek() calls.
So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/read_write.c | 4 ++++
include/linux/fs.h | 1 +
mm/readahead.c | 3 +++
3 files changed, 8 insertions(+)
--- linux-next.orig/mm/readahead.c 2011-11-20 22:02:01.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-20 22:02:03.000000000 +0800
@@ -629,6 +629,8 @@ ondemand_readahead(struct address_space
* start of file
*/
if (!offset) {
+ if ((ra->ra_flags & READAHEAD_LSEEK) && req_size < max)
+ goto random_read;
ra_set_pattern(ra, RA_PATTERN_INITIAL);
goto initial_readahead;
}
@@ -707,6 +709,7 @@ ondemand_readahead(struct address_space
if (try_context_readahead(mapping, ra, offset, req_size, max))
goto readit;
+random_read:
/*
* standalone, small random read
*/
--- linux-next.orig/fs/read_write.c 2011-11-20 22:02:01.000000000 +0800
+++ linux-next/fs/read_write.c 2011-11-20 22:02:03.000000000 +0800
@@ -47,6 +47,10 @@ static loff_t lseek_execute(struct file
file->f_pos = offset;
file->f_version = 0;
}
+
+ if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
+ file->f_ra.ra_flags |= READAHEAD_LSEEK;
+
return offset;
}
--- linux-next.orig/include/linux/fs.h 2011-11-20 22:02:01.000000000 +0800
+++ linux-next/include/linux/fs.h 2011-11-20 22:02:03.000000000 +0800
@@ -952,6 +952,7 @@ struct file_ra_state {
/* ra_flags bits */
#define READAHEAD_MMAP_MISS 0x000003ff /* cache misses for mmap access */
#define READAHEAD_MMAP 0x00010000
+#define READAHEAD_LSEEK 0x00020000 /* be conservative after lseek() */
#define READAHEAD_PATTERN_SHIFT 28
#define READAHEAD_PATTERN 0xf0000000
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 0/8] readahead stats/tracing, backwards prefetching and more
2011-11-21 9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
` (7 preceding siblings ...)
2011-11-21 9:18 ` [PATCH 8/8] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
@ 2011-11-21 9:56 ` Christoph Hellwig
2011-11-21 12:00 ` Wu Fengguang
8 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2011-11-21 9:56 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel, LKML,
Andi Kleen
On Mon, Nov 21, 2011 at 05:18:19PM +0800, Wu Fengguang wrote:
> Andrew,
>
> I'm getting around to pick up the readahead works again :-)
>
> This first series is mainly to add some debug facilities, to support the long
> missed backwards prefetching capability, and some old patches that somehow get
> delayed (shame me).
>
> The next step would be to better handle the readahead thrashing situations.
> That would require rewriting part of the algorithms, this is why I'd like to
> keep the backwards prefetching simple and stupid for now.
>
> When (almost) free of readahead thrashing, we'll be in a good position to lift
> the default readahead size. Which I suspect would be the single most efficient
> way to improve performance for the large volumes of casually maintained Linux
> file servers.
Btw, if you work actively in that area I have a todo list item I was
planning to look into sooner or later: instead of embedding the ra
state into the struct file allocate it dynamically. That way files that
either don't use the pagecache, or aren't read from won't need have to
pay the price for increasing struct file size, and if we have to we
could enlarge it more easily. Besides removing f_version in the common
struct file and also allocting f_owner separately that seem to be the
easiest ways to get struct file size down.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/8] block: limit default readahead size for small devices
2011-11-21 9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
@ 2011-11-21 10:00 ` Christoph Hellwig
2011-11-21 11:24 ` Wu Fengguang
2011-11-21 12:47 ` Andi Kleen
2011-11-21 14:46 ` Jeff Moyer
2011-11-21 22:52 ` Andrew Morton
2 siblings, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2011-11-21 10:00 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
Li Shaohua, Clemens Ladisch, Jens Axboe, Rik van Riel, LKML,
Andi Kleen
On Mon, Nov 21, 2011 at 05:18:20PM +0800, Wu Fengguang wrote:
> This looks reasonable: smaller device tend to be slower (USB sticks as
> well as micro/mobile/old hard disks).
>
> Given that the non-rotational attribute is not always reported, we can
> take disk size as a max readahead size hint. This patch uses a formula
> that generates the following concrete limits:
Given that you mentioned the rotational flag and device size in this
mail, as well as benchmarking with an intel SSD - did you measure
how useful large read ahead sizes still are with highend Flash device
that have extremly high read IOP rates?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
2011-11-21 9:18 ` [PATCH 2/8] readahead: make default readahead size a kernel parameter Wu Fengguang
@ 2011-11-21 10:01 ` Christoph Hellwig
2011-11-21 11:35 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2011-11-21 10:01 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
Ankit Jain, Dave Chinner, Christian Ehrhardt, Rik van Riel,
Nikanth Karthikesan, LKML, Andi Kleen
On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> From: Nikanth Karthikesan <knikanth@suse.de>
>
> Add new kernel parameter "readahead=", which allows user to override
> the static VM_MAX_READAHEAD=128kb.
Is a boot-time paramter really such a good idea? I would at least make
it a sysctl so that it's run-time controllable, including beeing able
to set it from initscripts.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
2011-11-21 9:18 ` [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
@ 2011-11-21 11:04 ` Steven Whitehouse
2011-11-21 11:42 ` Wu Fengguang
2011-11-21 23:01 ` Andrew Morton
1 sibling, 1 reply; 47+ messages in thread
From: Steven Whitehouse @ 2011-11-21 11:04 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
Andi Kleen, Rik van Riel, LKML
Hi,
I'm not quite sure why you copied me in to this patch, but I've had a
look at it and it seems ok to me. Some of the other patches in this
series look as if they might be rather useful for the GFS2 dir readahead
code though, so I'll be keeping an eye on developments in this area,
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Steve.
On Mon, 2011-11-21 at 17:18 +0800, Wu Fengguang wrote:
> plain text document attachment (readahead-flags.patch)
> Introduce a readahead flags field and embed the existing mmap_miss in it
> (mainly to save space).
>
> It will be possible to lose the flags in race conditions, however the
> impact should be limited. For the race to happen, there must be two
> threads sharing the same file descriptor to be in page fault or
> readahead at the same time.
>
> Note that it has always been racy for "page faults" at the same time.
>
> And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
> Which may change some concrete readahead behavior, but won't really
> impact overall I/O performance.
>
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Steven Whitehouse <swhiteho@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/linux/fs.h | 31 ++++++++++++++++++++++++++++++-
> mm/filemap.c | 9 ++-------
> 2 files changed, 32 insertions(+), 8 deletions(-)
>
> --- linux-next.orig/include/linux/fs.h 2011-11-20 11:30:55.000000000 +0800
> +++ linux-next/include/linux/fs.h 2011-11-20 11:48:53.000000000 +0800
> @@ -945,10 +945,39 @@ struct file_ra_state {
> there are only # of pages ahead */
>
> unsigned int ra_pages; /* Maximum readahead window */
> - unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
> + unsigned int ra_flags;
> loff_t prev_pos; /* Cache last read() position */
> };
>
> +/* ra_flags bits */
> +#define READAHEAD_MMAP_MISS 0x000003ff /* cache misses for mmap access */
> +
> +/*
> + * Don't do ra_flags++ directly to avoid possible overflow:
> + * the ra fields can be accessed concurrently in a racy way.
> + */
> +static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
> +{
> + unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> +
> + /* the upper bound avoids banging the cache line unnecessarily */
> + if (miss < READAHEAD_MMAP_MISS) {
> + miss++;
> + ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> + }
> + return miss;
> +}
> +
> +static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
> +{
> + unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> +
> + if (miss) {
> + miss--;
> + ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> + }
> +}
> +
> /*
> * Check if @index falls in the readahead windows.
> */
> --- linux-next.orig/mm/filemap.c 2011-11-20 11:30:55.000000000 +0800
> +++ linux-next/mm/filemap.c 2011-11-20 11:48:29.000000000 +0800
> @@ -1597,15 +1597,11 @@ static void do_sync_mmap_readahead(struc
> return;
> }
>
> - /* Avoid banging the cache line if not needed */
> - if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
> - ra->mmap_miss++;
> -
> /*
> * Do we miss much more than hit in this file? If so,
> * stop bothering with read-ahead. It will only hurt.
> */
> - if (ra->mmap_miss > MMAP_LOTSAMISS)
> + if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
> return;
>
> /*
> @@ -1633,8 +1629,7 @@ static void do_async_mmap_readahead(stru
> /* If we don't want any read-ahead, don't bother */
> if (VM_RandomReadHint(vma))
> return;
> - if (ra->mmap_miss > 0)
> - ra->mmap_miss--;
> + ra_mmap_miss_dec(ra);
> if (PageReadahead(page))
> page_cache_async_readahead(mapping, ra, file,
> page, offset, ra->ra_pages);
>
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/8] block: limit default readahead size for small devices
2011-11-21 10:00 ` Christoph Hellwig
@ 2011-11-21 11:24 ` Wu Fengguang
2011-11-21 12:47 ` Andi Kleen
1 sibling, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 11:24 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel, Li,
Shaohua, Clemens Ladisch, Jens Axboe, Rik van Riel, LKML,
Andi Kleen
On Mon, Nov 21, 2011 at 06:00:04PM +0800, Christoph Hellwig wrote:
> On Mon, Nov 21, 2011 at 05:18:20PM +0800, Wu Fengguang wrote:
> > This looks reasonable: smaller device tend to be slower (USB sticks as
> > well as micro/mobile/old hard disks).
> >
> > Given that the non-rotational attribute is not always reported, we can
> > take disk size as a max readahead size hint. This patch uses a formula
> > that generates the following concrete limits:
>
> Given that you mentioned the rotational flag and device size in this
> mail, as well as benchmarking with an intel SSD - did you measure
> how useful large read ahead sizes still are with highend Flash device
> that have extremly high read IOP rates?
I don't know -- I don't have access to such highend devices.
However the patch changelog has the simple test script. It would be
high appreciated if someone can help collect the data :)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
2011-11-21 10:01 ` Christoph Hellwig
@ 2011-11-21 11:35 ` Wu Fengguang
2011-11-24 22:28 ` Jan Kara
0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 11:35 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
Ankit Jain, Dave Chinner, Christian Ehrhardt, Rik van Riel,
Nikanth Karthikesan, LKML, Andi Kleen
On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
> On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> > From: Nikanth Karthikesan <knikanth@suse.de>
> >
> > Add new kernel parameter "readahead=", which allows user to override
> > the static VM_MAX_READAHEAD=128kb.
>
> Is a boot-time paramter really such a good idea? I would at least
It's most convenient to set at boot time, because the default size
will be used to initialize all the block devices.
> make it a sysctl so that it's run-time controllable, including
> beeing able to set it from initscripts.
Once boot up, it's more natural to set the size one by one, for
example
blockdev --setra 1024 /dev/sda2
or
echo 512 > /sys/block/sda/queue/read_ahead_kb
And you still have the chance to modify the global default, but the
change will only be inherited by newly created devices thereafter:
echo 512 > /sys/devices/virtual/bdi/default/read_ahead_kb
The above command is very suitable for use in initscripts. However
there are no natural way to do sysctl as there is no such a global
value.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
2011-11-21 11:04 ` Steven Whitehouse
@ 2011-11-21 11:42 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 11:42 UTC (permalink / raw)
To: Steven Whitehouse
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
Andi Kleen, Rik van Riel, LKML
Hi Steven,
On Mon, Nov 21, 2011 at 07:04:27PM +0800, Steven Whitehouse wrote:
> Hi,
>
> I'm not quite sure why you copied me in to this patch, but I've had a
Yeah it's such an old patch that I've forgotten why I added CC to you ;)
> look at it and it seems ok to me. Some of the other patches in this
> series look as if they might be rather useful for the GFS2 dir readahead
> code though, so I'll be keeping an eye on developments in this area,
>
> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Thanks! That reminds me of the "metadata readahead". It should be
possible to do more metadata readahead in future, hence we might add a
"meta_io" column in the readahead stats file :)
Thanks,
Fengguang
> On Mon, 2011-11-21 at 17:18 +0800, Wu Fengguang wrote:
> > plain text document attachment (readahead-flags.patch)
> > Introduce a readahead flags field and embed the existing mmap_miss in it
> > (mainly to save space).
> >
> > It will be possible to lose the flags in race conditions, however the
> > impact should be limited. For the race to happen, there must be two
> > threads sharing the same file descriptor to be in page fault or
> > readahead at the same time.
> >
> > Note that it has always been racy for "page faults" at the same time.
> >
> > And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
> > Which may change some concrete readahead behavior, but won't really
> > impact overall I/O performance.
> >
> > CC: Andi Kleen <andi@firstfloor.org>
> > CC: Steven Whitehouse <swhiteho@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > include/linux/fs.h | 31 ++++++++++++++++++++++++++++++-
> > mm/filemap.c | 9 ++-------
> > 2 files changed, 32 insertions(+), 8 deletions(-)
> >
> > --- linux-next.orig/include/linux/fs.h 2011-11-20 11:30:55.000000000 +0800
> > +++ linux-next/include/linux/fs.h 2011-11-20 11:48:53.000000000 +0800
> > @@ -945,10 +945,39 @@ struct file_ra_state {
> > there are only # of pages ahead */
> >
> > unsigned int ra_pages; /* Maximum readahead window */
> > - unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
> > + unsigned int ra_flags;
> > loff_t prev_pos; /* Cache last read() position */
> > };
> >
> > +/* ra_flags bits */
> > +#define READAHEAD_MMAP_MISS 0x000003ff /* cache misses for mmap access */
> > +
> > +/*
> > + * Don't do ra_flags++ directly to avoid possible overflow:
> > + * the ra fields can be accessed concurrently in a racy way.
> > + */
> > +static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
> > +{
> > + unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> > +
> > + /* the upper bound avoids banging the cache line unnecessarily */
> > + if (miss < READAHEAD_MMAP_MISS) {
> > + miss++;
> > + ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> > + }
> > + return miss;
> > +}
> > +
> > +static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
> > +{
> > + unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> > +
> > + if (miss) {
> > + miss--;
> > + ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> > + }
> > +}
> > +
> > /*
> > * Check if @index falls in the readahead windows.
> > */
> > --- linux-next.orig/mm/filemap.c 2011-11-20 11:30:55.000000000 +0800
> > +++ linux-next/mm/filemap.c 2011-11-20 11:48:29.000000000 +0800
> > @@ -1597,15 +1597,11 @@ static void do_sync_mmap_readahead(struc
> > return;
> > }
> >
> > - /* Avoid banging the cache line if not needed */
> > - if (ra->mmap_miss < MMAP_LOTSAMISS * 10)
> > - ra->mmap_miss++;
> > -
> > /*
> > * Do we miss much more than hit in this file? If so,
> > * stop bothering with read-ahead. It will only hurt.
> > */
> > - if (ra->mmap_miss > MMAP_LOTSAMISS)
> > + if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
> > return;
> >
> > /*
> > @@ -1633,8 +1629,7 @@ static void do_async_mmap_readahead(stru
> > /* If we don't want any read-ahead, don't bother */
> > if (VM_RandomReadHint(vma))
> > return;
> > - if (ra->mmap_miss > 0)
> > - ra->mmap_miss--;
> > + ra_mmap_miss_dec(ra);
> > if (PageReadahead(page))
> > page_cache_async_readahead(mapping, ra, file,
> > page, offset, ra->ra_pages);
> >
> >
>
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 0/8] readahead stats/tracing, backwards prefetching and more
2011-11-21 9:56 ` [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Christoph Hellwig
@ 2011-11-21 12:00 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-21 12:00 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel, LKML,
Andi Kleen
On Mon, Nov 21, 2011 at 05:56:38PM +0800, Christoph Hellwig wrote:
> On Mon, Nov 21, 2011 at 05:18:19PM +0800, Wu Fengguang wrote:
> > Andrew,
> >
> > I'm getting around to pick up the readahead works again :-)
> >
> > This first series is mainly to add some debug facilities, to support the long
> > missed backwards prefetching capability, and some old patches that somehow get
> > delayed (shame me).
> >
> > The next step would be to better handle the readahead thrashing situations.
> > That would require rewriting part of the algorithms, this is why I'd like to
> > keep the backwards prefetching simple and stupid for now.
> >
> > When (almost) free of readahead thrashing, we'll be in a good position to lift
> > the default readahead size. Which I suspect would be the single most efficient
> > way to improve performance for the large volumes of casually maintained Linux
> > file servers.
>
> Btw, if you work actively in that area I have a todo list item I was
> planning to look into sooner or later: instead of embedding the ra
> state into the struct file allocate it dynamically. That way files that
> either don't use the pagecache, or aren't read from won't need have to
> pay the price for increasing struct file size, and if we have to we
> could enlarge it more easily.
Agreed. That's good to have, please allow me to move it into my todo list :)
> Besides removing f_version in the common
> struct file and also allocting f_owner separately that seem to be the
> easiest ways to get struct file size down.
Yeah, there seems no much code accessing fown_struct.
I may consider that when I'm at file_ra_state, but no promise ;)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/8] block: limit default readahead size for small devices
2011-11-21 10:00 ` Christoph Hellwig
2011-11-21 11:24 ` Wu Fengguang
@ 2011-11-21 12:47 ` Andi Kleen
1 sibling, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2011-11-21 12:47 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Wu Fengguang, Andrew Morton, Linux Memory Management List,
linux-fsdevel, Li Shaohua, Clemens Ladisch, Jens Axboe,
Rik van Riel, LKML, Andi Kleen
On Mon, Nov 21, 2011 at 05:00:04AM -0500, Christoph Hellwig wrote:
> On Mon, Nov 21, 2011 at 05:18:20PM +0800, Wu Fengguang wrote:
> > Given that the non-rotational attribute is not always reported, we can
> > take disk size as a max readahead size hint. This patch uses a formula
> > that generates the following concrete limits:
>
> Given that you mentioned the rotational flag and device size in this
> mail, as well as benchmarking with an intel SSD - did you measure
> how useful large read ahead sizes still are with highend Flash device
> that have extremly high read IOP rates?
The more the IOPs the larger the "window" you need to keep everything
going I suspect.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/8] readahead: add debug tracing event
2011-11-21 9:18 ` [PATCH 6/8] readahead: add debug tracing event Wu Fengguang
@ 2011-11-21 14:01 ` Steven Rostedt
0 siblings, 0 replies; 47+ messages in thread
From: Steven Rostedt @ 2011-11-21 14:01 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
Ingo Molnar, Jens Axboe, Peter Zijlstra, Rik van Riel, LKML,
Andi Kleen
On Mon, 2011-11-21 at 17:18 +0800, Wu Fengguang wrote:
> plain text document attachment (readahead-tracer.patch)
> This is very useful for verifying whether the algorithms are working
> to our expectaions.
>
> Example output:
>
> # echo 1 > /debug/tracing/events/vfs/readahead/enable
> # cp test-file /dev/null
> # cat /debug/tracing/trace # trimmed output
> readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
> readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
> readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
> readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
> readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
> readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Jens Axboe <jens.axboe@oracle.com>
> CC: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
-- Steve
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
2011-11-21 9:18 ` [PATCH 5/8] readahead: add /debug/readahead/stats Wu Fengguang
@ 2011-11-21 14:17 ` Andi Kleen
2011-11-22 14:14 ` Wu Fengguang
2011-11-21 23:29 ` Andrew Morton
1 sibling, 1 reply; 47+ messages in thread
From: Andi Kleen @ 2011-11-21 14:17 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
Ingo Molnar, Jens Axboe, Peter Zijlstra, Rik van Riel, LKML,
Andi Kleen
> +static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
Why not make it per cpu? That should get the overhead down, probably
even enough that it can be enabled by default.
BTW I have an older framework to make it really easy to add per
cpu stats counters to debugfs. Will repost, that would simplify
it even more.
-Andi
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/8] block: limit default readahead size for small devices
2011-11-21 9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
2011-11-21 10:00 ` Christoph Hellwig
@ 2011-11-21 14:46 ` Jeff Moyer
2011-11-21 22:52 ` Andrew Morton
2 siblings, 0 replies; 47+ messages in thread
From: Jeff Moyer @ 2011-11-21 14:46 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
Li Shaohua, Clemens Ladisch, Jens Axboe, Rik van Riel, LKML,
Andi Kleen
Wu Fengguang <fengguang.wu@intel.com> writes:
> Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> on which blkid runs unpleasantly slow. He manages to optimize the blkid
> reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
>
> lseek 0, read 1024 => readahead 4 pages (start of file)
> lseek 1536, read 16384 => readahead 8 pages (page contiguous)
>
> The readahead heuristics involved here are reasonable ones in general.
> So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
>
> For the kernel part, Linus suggests:
> So maybe we could be less aggressive about read-ahead when the size of
> the device is small? Turning a 16kB read into a 64kB one is a big deal,
> when it's about 15% of the whole device!
>
> This looks reasonable: smaller device tend to be slower (USB sticks as
> well as micro/mobile/old hard disks).
>
> Given that the non-rotational attribute is not always reported, we can
> take disk size as a max readahead size hint. This patch uses a formula
> that generates the following concrete limits:
>
> disk size readahead size
> (scale by 4) (scale by 2)
> 1M 8k
> 4M 16k
> 16M 32k
> 64M 64k
> 256M 128k
> --------------------------- (*)
> 1G 256k
> 4G 512k
> 16G 1024k
> 64G 2048k
> 256G 4096k
>
> (*) Since the default readahead size is 128k, this limit only takes
> effect for devices whose size is less than 256M.
Scaling readahead by the size of the device may make sense up to a
point. But really, that shouldn't be the only factor considered. Just
because you have a big disk, it doesn't mean it's fast, and it sure
doesn't mean that you should waste memory with readahead data that may
not be used before it's evicted (whether due to readahead on data that
isn't needed or thrashing of the page cache).
So, I think reducing the readahead size for smaller devices is fine.
I'm not a big fan of going the other way.
Cheers,
Jeff
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/8] block: limit default readahead size for small devices
2011-11-21 9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
2011-11-21 10:00 ` Christoph Hellwig
2011-11-21 14:46 ` Jeff Moyer
@ 2011-11-21 22:52 ` Andrew Morton
2011-11-22 14:23 ` Jeff Moyer
2011-11-23 12:18 ` Wu Fengguang
2 siblings, 2 replies; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 22:52 UTC (permalink / raw)
To: Wu Fengguang
Cc: Linux Memory Management List, linux-fsdevel, Li Shaohua,
Clemens Ladisch, Jens Axboe, Rik van Riel, LKML, Andi Kleen
On Mon, 21 Nov 2011 17:18:20 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> on which blkid runs unpleasantly slow. He manages to optimize the blkid
> reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
>
> lseek 0, read 1024 => readahead 4 pages (start of file)
I'm disturbed that the code did a 4 page (16kbyte?) readahead after an
lseek. Given the high probability that the next read will occur after
a second lseek, that's a mistake.
Was an lseek to offset 0 special-cased?
> lseek 1536, read 16384 => readahead 8 pages (page contiguous)
>
> The readahead heuristics involved here are reasonable ones in general.
> So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
>
> For the kernel part, Linus suggests:
> So maybe we could be less aggressive about read-ahead when the size of
> the device is small? Turning a 16kB read into a 64kB one is a big deal,
> when it's about 15% of the whole device!
>
> This looks reasonable: smaller device tend to be slower (USB sticks as
> well as micro/mobile/old hard disks).
Spose so. Obviously there are other characteristics which should be
considered when choosing a readaahead size, but one of them can be disk
size and that's what this change does.
In a better world, userspace would run a
work-out-what-readahead-size-to-use script each time a distro is
installed and when new storage devices are added/detected. Userspace
would then remember that readahead size for subsequent bootups.
In the real world, we shovel guaranteed-to-be-wrong guesswork into the
kernel and everyone just uses the results. Sigh.
> --- linux-next.orig/block/genhd.c 2011-10-31 00:13:51.000000000 +0800
> +++ linux-next/block/genhd.c 2011-11-18 11:27:08.000000000 +0800
> @@ -623,6 +623,26 @@ void add_disk(struct gendisk *disk)
> WARN_ON(retval);
>
> disk_add_events(disk);
> +
> + /*
> + * Limit default readahead size for small devices.
> + * disk size readahead size
> + * 1M 8k
> + * 4M 16k
> + * 16M 32k
> + * 64M 64k
> + * 256M 128k
> + * 1G 256k
> + * 4G 512k
> + * 16G 1024k
> + * 64G 2048k
> + * 256G 4096k
> + */
> + if (get_capacity(disk)) {
> + unsigned long size = get_capacity(disk) >> 9;
get_capacity() returns sector_t. This expression will overflow with a
2T disk. I'm not sure if we successfully support 2T disks on 32-bit
machines, but changes like this will guarantee that we don't :)
> + size = 1UL << (ilog2(size) / 2);
I think there's a rounddown_pow_of_two() hiding in that expression?
> + bdi->ra_pages = min(bdi->ra_pages, size);
I don't have a clue why that min() is in there. It needs a comment,
please.
> + }
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
2011-11-21 9:18 ` [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
2011-11-21 11:04 ` Steven Whitehouse
@ 2011-11-21 23:01 ` Andrew Morton
2011-11-23 12:47 ` Wu Fengguang
1 sibling, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 23:01 UTC (permalink / raw)
To: Wu Fengguang
Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
Steven Whitehouse, Rik van Riel, LKML
On Mon, 21 Nov 2011 17:18:22 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> Introduce a readahead flags field and embed the existing mmap_miss in it
> (mainly to save space).
What an ugly patch.
> It will be possible to lose the flags in race conditions, however the
> impact should be limited. For the race to happen, there must be two
> threads sharing the same file descriptor to be in page fault or
> readahead at the same time.
>
> Note that it has always been racy for "page faults" at the same time.
>
> And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
> Which may change some concrete readahead behavior, but won't really
> impact overall I/O performance.
>
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Steven Whitehouse <swhiteho@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> include/linux/fs.h | 31 ++++++++++++++++++++++++++++++-
> mm/filemap.c | 9 ++-------
> 2 files changed, 32 insertions(+), 8 deletions(-)
>
> --- linux-next.orig/include/linux/fs.h 2011-11-20 11:30:55.000000000 +0800
> +++ linux-next/include/linux/fs.h 2011-11-20 11:48:53.000000000 +0800
> @@ -945,10 +945,39 @@ struct file_ra_state {
> there are only # of pages ahead */
>
> unsigned int ra_pages; /* Maximum readahead window */
> - unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
> + unsigned int ra_flags;
And it doesn't actually save any space, unless ra_flags gets used for
something else in a subsequent patch. And if it does, perhaps ra_flags
should be ulong, which is compatible with the bitops.h code.
Or perhaps we should use a bitfield and let the compiler do the work.
> loff_t prev_pos; /* Cache last read() position */
> };
>
> +/* ra_flags bits */
> +#define READAHEAD_MMAP_MISS 0x000003ff /* cache misses for mmap access */
> +
> +/*
> + * Don't do ra_flags++ directly to avoid possible overflow:
> + * the ra fields can be accessed concurrently in a racy way.
> + */
> +static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
> +{
> + unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> +
> + /* the upper bound avoids banging the cache line unnecessarily */
> + if (miss < READAHEAD_MMAP_MISS) {
> + miss++;
> + ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> + }
> + return miss;
> +}
> +
> +static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
> +{
> + unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> +
> + if (miss) {
> + miss--;
> + ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> + }
> +}
It's strange that ra_mmap_miss_inc() returns the new value whereas
ra_mmap_miss_dec() returns void.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 4/8] readahead: record readahead patterns
2011-11-21 9:18 ` [PATCH 4/8] readahead: record readahead patterns Wu Fengguang
@ 2011-11-21 23:19 ` Andrew Morton
2011-11-29 2:40 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 23:19 UTC (permalink / raw)
To: Wu Fengguang
Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen
On Mon, 21 Nov 2011 17:18:23 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> Record the readahead pattern in ra_flags and extend the ra_submit()
> parameters, to be used by the next readahead tracing/stats patches.
>
> 7 patterns are defined:
>
> pattern readahead for
> -----------------------------------------------------------
> RA_PATTERN_INITIAL start-of-file read
> RA_PATTERN_SUBSEQUENT trivial sequential read
> RA_PATTERN_CONTEXT interleaved sequential read
> RA_PATTERN_OVERSIZE oversize read
> RA_PATTERN_MMAP_AROUND mmap fault
> RA_PATTERN_FADVISE posix_fadvise()
> RA_PATTERN_RANDOM random read
It would be useful to spell out in full detail what an "interleaved
sequential read" is, and why a read is considered "oversized", etc.
The 'enum readahead_pattern' definition site would be a good place for
this.
> Note that random reads will be recorded in file_ra_state now.
> This won't deteriorate cache bouncing because the ra->prev_pos update
> in do_generic_file_read() already pollutes the data cache, and
> filemap_fault() will stop calling into us after MMAP_LOTSAMISS.
>
> --- linux-next.orig/include/linux/fs.h 2011-11-20 20:10:48.000000000 +0800
> +++ linux-next/include/linux/fs.h 2011-11-20 20:18:29.000000000 +0800
> @@ -951,6 +951,39 @@ struct file_ra_state {
>
> /* ra_flags bits */
> #define READAHEAD_MMAP_MISS 0x000003ff /* cache misses for mmap access */
> +#define READAHEAD_MMAP 0x00010000
Why leave a gap?
And what is READAHEAD_MMAP anyway?
> +#define READAHEAD_PATTERN_SHIFT 28
Why 28?
> +#define READAHEAD_PATTERN 0xf0000000
> +
> +/*
> + * Which policy makes decision to do the current read-ahead IO?
> + */
> +enum readahead_pattern {
> + RA_PATTERN_INITIAL,
> + RA_PATTERN_SUBSEQUENT,
> + RA_PATTERN_CONTEXT,
> + RA_PATTERN_MMAP_AROUND,
> + RA_PATTERN_FADVISE,
> + RA_PATTERN_OVERSIZE,
> + RA_PATTERN_RANDOM,
> + RA_PATTERN_ALL, /* for summary stats */
> + RA_PATTERN_MAX
> +};
Again, the behaviour is all undocumented. I see from the code that
multiple flags can be set at the same time. So afacit a file can be
marked RANDOM and SUBSEQUENT at the same time, which seems oxymoronic.
This reader wants to know what the implications of this are - how the
code chooses, prioritises and acts. But this code doesn't tell me.
> +static inline unsigned int ra_pattern(unsigned int ra_flags)
> +{
> + unsigned int pattern = ra_flags >> READAHEAD_PATTERN_SHIFT;
OK, no masking is needed because the code silently assumes that arg
`ra_flags' came out of an ra_state.ra_flags and it also silently
assumes that no higher bits are used in ra_state.ra_flags.
That's a bit of a handgrenade - if someone redoes the flags
enumeration, the code will explode.
> + return min_t(unsigned int, pattern, RA_PATTERN_ALL);
> +}
<scratches head>
What the heck is that min_t() doing in there?
> +static inline void ra_set_pattern(struct file_ra_state *ra,
> + unsigned int pattern)
> +{
> + ra->ra_flags = (ra->ra_flags & ~READAHEAD_PATTERN) |
> + (pattern << READAHEAD_PATTERN_SHIFT);
> +}
>
> /*
> * Don't do ra_flags++ directly to avoid possible overflow:
>
> ...
>
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
2011-11-21 9:18 ` [PATCH 5/8] readahead: add /debug/readahead/stats Wu Fengguang
2011-11-21 14:17 ` Andi Kleen
@ 2011-11-21 23:29 ` Andrew Morton
2011-11-21 23:32 ` Andi Kleen
2011-11-29 3:23 ` Wu Fengguang
1 sibling, 2 replies; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 23:29 UTC (permalink / raw)
To: Wu Fengguang
Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen
On Mon, 21 Nov 2011 17:18:24 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
> and will remain inactive unless enabled explicitly with either boot option
>
> readahead_stats=1
>
> or through the debugfs interface
>
> echo 1 > /debug/readahead/stats_enable
It's unfortunate that these two things have different names.
I'd have thought that the debugfs knob was sufficient - no need for the
boot option.
> The added overheads are two readahead_stats() calls per readahead.
> Which is trivial costs unless there are concurrent random reads on
> super fast SSDs, which may lead to cache bouncing when updating the
> global ra_stats[][]. Considering that normal users won't need this
> except when debugging performance problems, it's disabled by default.
> So it looks reasonable to keep this debug code simple rather than trying
> to improve its scalability.
I may be wrong, but I don't think the CPU cost of this code matters a
lot. People will rarely turn it on and disk IO is a lot slower than
CPU actions and it's waaaaaaay more important to get high-quality info
about readahead than it is to squeeze out a few CPU cycles.
>
> ...
>
> @@ -51,6 +62,182 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
>
> #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
>
> +#ifdef CONFIG_READAHEAD_STATS
> +#include <linux/seq_file.h>
> +#include <linux/debugfs.h>
> +
> +static u32 readahead_stats_enable __read_mostly;
> +
> +static int __init config_readahead_stats(char *str)
> +{
> + int enable = 1;
> + get_option(&str, &enable);
> + readahead_stats_enable = enable;
> + return 0;
> +}
> +early_param("readahead_stats", config_readahead_stats);
Why use early_param() rather than plain old __setup()?
> +enum ra_account {
> + /* number of readaheads */
> + RA_ACCOUNT_COUNT, /* readahead request */
> + RA_ACCOUNT_EOF, /* readahead request covers EOF */
> + RA_ACCOUNT_CHIT, /* readahead request covers some cached pages */
I don't like chit :) "cache_hit" would be better. Or just "hit".
> + RA_ACCOUNT_IOCOUNT, /* readahead IO */
> + RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */
> + RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */
> + /* number of readahead pages */
> + RA_ACCOUNT_SIZE, /* readahead size */
> + RA_ACCOUNT_ASIZE, /* readahead async size */
> + RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
> + /* end mark */
> + RA_ACCOUNT_MAX,
> +};
> +
>
> ...
>
> +static void readahead_event(struct address_space *mapping,
> + pgoff_t offset,
> + unsigned long req_size,
> + unsigned int ra_flags,
> + pgoff_t start,
> + unsigned int size,
> + unsigned int async_size,
> + unsigned int actual)
> +{
> +#ifdef CONFIG_READAHEAD_STATS
> + if (readahead_stats_enable) {
> + readahead_stats(mapping, offset, req_size, ra_flags,
> + start, size, async_size, actual);
> + readahead_stats(mapping, offset, req_size,
> + RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
> + start, size, async_size, actual);
> + }
> +#endif
> +}
The stub should be inlined, methinks. The overhead of evaluating and
preparing eight arguments is significant. I don't think the compiler
is yet smart enough to save us.
>
> ...
>
> --- linux-next.orig/Documentation/kernel-parameters.txt 2011-11-21 17:08:38.000000000 +0800
> +++ linux-next/Documentation/kernel-parameters.txt 2011-11-21 17:08:51.000000000 +0800
> @@ -2251,6 +2251,12 @@ bytes respectively. Such letter suffixes
> This default max readahead size may be overrode
> in some cases, notably NFS, btrfs and software RAID.
>
> + readahead_stats[=0|1]
> + Enable/disable readahead stats accounting.
> +
> + It's also possible to enable/disable it after boot:
> + echo 1 > /sys/kernel/debug/readahead/stats_enable
Can the current setting be read back?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
2011-11-21 23:29 ` Andrew Morton
@ 2011-11-21 23:32 ` Andi Kleen
2011-11-29 3:23 ` Wu Fengguang
1 sibling, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2011-11-21 23:32 UTC (permalink / raw)
To: Andrew Morton
Cc: Wu Fengguang, Linux Memory Management List, linux-fsdevel,
Ingo Molnar, Jens Axboe, Peter Zijlstra, Rik van Riel, LKML,
Andi Kleen
> I may be wrong, but I don't think the CPU cost of this code matters a
> lot. People will rarely turn it on and disk IO is a lot slower than
> CPU actions and it's waaaaaaay more important to get high-quality info
> about readahead than it is to squeeze out a few CPU cycles.
In its current form it would cache line bounce, which tends to be
extremly slow. But the solution is probably to make it per CPU.
-Andi
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 7/8] readahead: basic support for backwards prefetching
2011-11-21 9:18 ` [PATCH 7/8] readahead: basic support for backwards prefetching Wu Fengguang
@ 2011-11-21 23:33 ` Andrew Morton
2011-11-29 3:08 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 23:33 UTC (permalink / raw)
To: Wu Fengguang
Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
Li Shaohua, LKML
On Mon, 21 Nov 2011 17:18:26 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> Add the backwards prefetching feature. It's pretty simple if we don't
> support async prefetching and interleaved reads.
Well OK, but I wonder how many applications out there read files in
reverse order. Is it common enough to bother special-casing in the
kernel like this?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 8/8] readahead: dont do start-of-file readahead after lseek()
2011-11-21 9:18 ` [PATCH 8/8] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
@ 2011-11-21 23:36 ` Andrew Morton
2011-11-22 14:18 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-21 23:36 UTC (permalink / raw)
To: Wu Fengguang
Cc: Linux Memory Management List, linux-fsdevel, Rik van Riel,
Linus Torvalds, LKML, Andi Kleen
On Mon, 21 Nov 2011 17:18:27 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> Some applications (eg. blkid, id3tool etc.) seek around the file
> to get information. For example, blkid does
>
> seek to 0
> read 1024
> seek to 1536
> read 16384
>
> The start-of-file readahead heuristic is wrong for them, whose
> access pattern can be identified by lseek() calls.
ah, there we are.
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> fs/read_write.c | 4 ++++
> include/linux/fs.h | 1 +
> mm/readahead.c | 3 +++
> 3 files changed, 8 insertions(+)
>
> --- linux-next.orig/mm/readahead.c 2011-11-20 22:02:01.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-11-20 22:02:03.000000000 +0800
> @@ -629,6 +629,8 @@ ondemand_readahead(struct address_space
> * start of file
> */
> if (!offset) {
> + if ((ra->ra_flags & READAHEAD_LSEEK) && req_size < max)
> + goto random_read;
> ra_set_pattern(ra, RA_PATTERN_INITIAL);
> goto initial_readahead;
> }
> @@ -707,6 +709,7 @@ ondemand_readahead(struct address_space
> if (try_context_readahead(mapping, ra, offset, req_size, max))
> goto readit;
>
> +random_read:
> /*
> * standalone, small random read
> */
> --- linux-next.orig/fs/read_write.c 2011-11-20 22:02:01.000000000 +0800
> +++ linux-next/fs/read_write.c 2011-11-20 22:02:03.000000000 +0800
> @@ -47,6 +47,10 @@ static loff_t lseek_execute(struct file
> file->f_pos = offset;
> file->f_version = 0;
> }
> +
> + if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
> + file->f_ra.ra_flags |= READAHEAD_LSEEK;
> +
> return offset;
> }
Confused. How does READAHEAD_LSEEK get cleared again?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
2011-11-21 14:17 ` Andi Kleen
@ 2011-11-22 14:14 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-22 14:14 UTC (permalink / raw)
To: Andi Kleen
Cc: Andrew Morton, Linux Memory Management List, linux-fsdevel,
Ingo Molnar, Jens Axboe, Peter Zijlstra, Rik van Riel, LKML
Andi,
On Mon, Nov 21, 2011 at 10:17:59PM +0800, Andi Kleen wrote:
> > +static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
>
> Why not make it per cpu? That should get the overhead down, probably
> even enough that it can be enabled by default.
>
> BTW I have an older framework to make it really easy to add per
> cpu stats counters to debugfs. Will repost, that would simplify
> it even more.
That's definitely a good facility to have. I would be happy to become
its first user :)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 8/8] readahead: dont do start-of-file readahead after lseek()
2011-11-21 23:36 ` Andrew Morton
@ 2011-11-22 14:18 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-22 14:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Rik van Riel,
Linus Torvalds, LKML, Andi Kleen
> > --- linux-next.orig/fs/read_write.c 2011-11-20 22:02:01.000000000 +0800
> > +++ linux-next/fs/read_write.c 2011-11-20 22:02:03.000000000 +0800
> > @@ -47,6 +47,10 @@ static loff_t lseek_execute(struct file
> > file->f_pos = offset;
> > file->f_version = 0;
> > }
> > +
> > + if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
> > + file->f_ra.ra_flags |= READAHEAD_LSEEK;
> > +
> > return offset;
> > }
>
> Confused. How does READAHEAD_LSEEK get cleared again?
I thought it's not necessary (at least for this case). But yeah, it's
good to clear it to make it more reasonable and avoid unexpected things.
And it would be simple to do, in ra_submit():
- ra->ra_flags &= ~READAHEAD_MMAP;
+ ra->ra_flags &= ~(READAHEAD_MMAP | READAHEAD_LSEEK);
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/8] block: limit default readahead size for small devices
2011-11-21 22:52 ` Andrew Morton
@ 2011-11-22 14:23 ` Jeff Moyer
2011-11-23 12:18 ` Wu Fengguang
1 sibling, 0 replies; 47+ messages in thread
From: Jeff Moyer @ 2011-11-22 14:23 UTC (permalink / raw)
To: Andrew Morton
Cc: Wu Fengguang, Linux Memory Management List, linux-fsdevel,
Li Shaohua, Clemens Ladisch, Jens Axboe, Rik van Riel, LKML,
Andi Kleen
Andrew Morton <akpm@linux-foundation.org> writes:
> In a better world, userspace would run a
> work-out-what-readahead-size-to-use script each time a distro is
> installed and when new storage devices are added/detected. Userspace
> would then remember that readahead size for subsequent bootups.
I'd be interested to hear what factors you think should be taken into
account by such a script. I agree that there are certain things, like
timing of reads of different sizes, or heuristics based on the size of
installed memory, which could contribute to the default readahead size.
However, other things, like memory pressure while running the desired
workload, can't really be measured by an installer or one-time script.
> In the real world, we shovel guaranteed-to-be-wrong guesswork into the
> kernel and everyone just uses the results. Sigh.
I'm not sure a userspace tool is the panacea you paint. However, if you
can provide some guidance on what you think could make things better,
I'm happy to give it a go.
Cheers,
Jeff
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 1/8] block: limit default readahead size for small devices
2011-11-21 22:52 ` Andrew Morton
2011-11-22 14:23 ` Jeff Moyer
@ 2011-11-23 12:18 ` Wu Fengguang
1 sibling, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-23 12:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Li Shaohua,
Clemens Ladisch, Jens Axboe, Rik van Riel, LKML, Andi Kleen
On Mon, Nov 21, 2011 at 02:52:47PM -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 17:18:20 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> > on which blkid runs unpleasantly slow. He manages to optimize the blkid
> > reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
> >
> > lseek 0, read 1024 => readahead 4 pages (start of file)
>
> I'm disturbed that the code did a 4 page (16kbyte?) readahead after an
> lseek. Given the high probability that the next read will occur after
> a second lseek, that's a mistake.
Agreed.
> Was an lseek to offset 0 special-cased?
Yup, as you see in the last patch :)
> > lseek 1536, read 16384 => readahead 8 pages (page contiguous)
> >
> > The readahead heuristics involved here are reasonable ones in general.
> > So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
> >
> > For the kernel part, Linus suggests:
> > So maybe we could be less aggressive about read-ahead when the size of
> > the device is small? Turning a 16kB read into a 64kB one is a big deal,
> > when it's about 15% of the whole device!
> >
> > This looks reasonable: smaller device tend to be slower (USB sticks as
> > well as micro/mobile/old hard disks).
>
> Spose so. Obviously there are other characteristics which should be
> considered when choosing a readaahead size, but one of them can be disk
> size and that's what this change does.
The other characteristics includes memory size, disk throughput,
latency and (unfortunately) above all the real workload and its
sensitivity to IO latencies.
The disk throughput and latency would be better than disk size,
however the former is not that easy to measure. Disk size is readily
available w/o any cost, hence this patch.
> In a better world, userspace would run a
> work-out-what-readahead-size-to-use script each time a distro is
> installed and when new storage devices are added/detected. Userspace
> would then remember that readahead size for subsequent bootups.
>
> In the real world, we shovel guaranteed-to-be-wrong guesswork into the
> kernel and everyone just uses the results. Sigh.
Heh. Disk characters should be measurable with some efforts.
However I won't even try to guess the workload and user demands...
Practically, I would choose a meaningful default size (512KB or 1MB)
for the majority and reduce it on small memory/slow disk systems (this
patch does the easy work based on disk size).
> > --- linux-next.orig/block/genhd.c 2011-10-31 00:13:51.000000000 +0800
> > +++ linux-next/block/genhd.c 2011-11-18 11:27:08.000000000 +0800
> > @@ -623,6 +623,26 @@ void add_disk(struct gendisk *disk)
> > WARN_ON(retval);
> >
> > disk_add_events(disk);
> > +
> > + /*
> > + * Limit default readahead size for small devices.
> > + * disk size readahead size
> > + * 1M 8k
> > + * 4M 16k
> > + * 16M 32k
> > + * 64M 64k
> > + * 256M 128k
> > + * 1G 256k
> > + * 4G 512k
> > + * 16G 1024k
> > + * 64G 2048k
> > + * 256G 4096k
> > + */
> > + if (get_capacity(disk)) {
> > + unsigned long size = get_capacity(disk) >> 9;
>
> get_capacity() returns sector_t. This expression will overflow with a
> 2T disk. I'm not sure if we successfully support 2T disks on 32-bit
> machines, but changes like this will guarantee that we don't :)
Good catch! I'll change the type to size_t.
> > + size = 1UL << (ilog2(size) / 2);
>
> I think there's a rounddown_pow_of_two() hiding in that expression?
Yeah, added a line of comment for that.
> > + bdi->ra_pages = min(bdi->ra_pages, size);
>
> I don't have a clue why that min() is in there. It needs a comment,
> please.
It's actually explained in the comment by word "Limit", I'll change it
to "Scale down" to make it more obvious.
Thanks,
Fengguang
---
Subject: block: limit default readahead size for small devices
Date: Fri Nov 18 11:27:12 CST 2011
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
--------------------------- (*)
1G 256k
4G 512k
16G 1024k
64G 2048k
256G 4096k
(*) Since the default readahead size is 128k, this limit only takes
effect for devices whose size is less than 256M.
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
SSD 30G SanDisk SATA 5000
4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
==> 64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s
An unnamed SD card (Yakui):
4k 195.873 s, 5.5 MB/s
8k 123.425 s, 8.7 MB/s
16k 86.6425 s, 12.4 MB/s
32k 66.7519 s, 16.1 MB/s
==> 64k 58.5262 s, 18.3 MB/s
128k 59.3847 s, 18.1 MB/s
256k 59.3188 s, 18.1 MB/s
512k 59.0218 s, 18.2 MB/s
CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
--- linux-next.orig/block/genhd.c 2011-11-21 20:32:56.000000000 +0800
+++ linux-next/block/genhd.c 2011-11-23 20:11:23.000000000 +0800
@@ -578,6 +578,7 @@ exit:
void add_disk(struct gendisk *disk)
{
struct backing_dev_info *bdi;
+ size_t size;
dev_t devt;
int retval;
@@ -623,6 +624,25 @@ void add_disk(struct gendisk *disk)
WARN_ON(retval);
disk_add_events(disk);
+
+ /*
+ * Scale down default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 255M 64k (the round down effect)
+ * 256M 128k
+ * 1G 256k
+ * 4G 512k
+ * 16G 1024k
+ */
+ size = get_capacity(disk);
+ if (size) {
+ size = 1 << (ilog2(size >> 9) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
+ }
}
EXPORT_SYMBOL(add_disk);
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
2011-11-21 23:01 ` Andrew Morton
@ 2011-11-23 12:47 ` Wu Fengguang
2011-11-23 20:31 ` Andrew Morton
0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-23 12:47 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
Steven Whitehouse, Rik van Riel, LKML
On Mon, Nov 21, 2011 at 03:01:16PM -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 17:18:22 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Introduce a readahead flags field and embed the existing mmap_miss in it
> > (mainly to save space).
>
> What an ugly patch.
Indeed..
> > It will be possible to lose the flags in race conditions, however the
> > impact should be limited. For the race to happen, there must be two
> > threads sharing the same file descriptor to be in page fault or
> > readahead at the same time.
> >
> > Note that it has always been racy for "page faults" at the same time.
> >
> > And if ever the race happen, we'll lose one mmap_miss++ or mmap_miss--.
> > Which may change some concrete readahead behavior, but won't really
> > impact overall I/O performance.
> >
> > CC: Andi Kleen <andi@firstfloor.org>
> > CC: Steven Whitehouse <swhiteho@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > include/linux/fs.h | 31 ++++++++++++++++++++++++++++++-
> > mm/filemap.c | 9 ++-------
> > 2 files changed, 32 insertions(+), 8 deletions(-)
> >
> > --- linux-next.orig/include/linux/fs.h 2011-11-20 11:30:55.000000000 +0800
> > +++ linux-next/include/linux/fs.h 2011-11-20 11:48:53.000000000 +0800
> > @@ -945,10 +945,39 @@ struct file_ra_state {
> > there are only # of pages ahead */
> >
> > unsigned int ra_pages; /* Maximum readahead window */
> > - unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
> > + unsigned int ra_flags;
>
> And it doesn't actually save any space, unless ra_flags gets used for
> something else in a subsequent patch. And if it does, perhaps ra_flags
Because it's a preparation patch. There will be more fields defined later.
> should be ulong, which is compatible with the bitops.h code.
> Or perhaps we should use a bitfield and let the compiler do the work.
What if we do
u16 mmap_miss;
u16 ra_flags;
That would get rid of this patch. I'd still like to pack the various
flags as well as pattern into one single ra_flags, which makes it
convenient to pass things around (as one single parameter).
> > loff_t prev_pos; /* Cache last read() position */
> > };
> >
> > +/* ra_flags bits */
> > +#define READAHEAD_MMAP_MISS 0x000003ff /* cache misses for mmap access */
> > +
> > +/*
> > + * Don't do ra_flags++ directly to avoid possible overflow:
> > + * the ra fields can be accessed concurrently in a racy way.
> > + */
> > +static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
> > +{
> > + unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> > +
> > + /* the upper bound avoids banging the cache line unnecessarily */
> > + if (miss < READAHEAD_MMAP_MISS) {
> > + miss++;
> > + ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> > + }
> > + return miss;
> > +}
> > +
> > +static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
> > +{
> > + unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
> > +
> > + if (miss) {
> > + miss--;
> > + ra->ra_flags = miss | (ra->ra_flags & ~READAHEAD_MMAP_MISS);
> > + }
> > +}
>
> It's strange that ra_mmap_miss_inc() returns the new value whereas
> ra_mmap_miss_dec() returns void.
Simply because no one need to check the return value of ra_mmap_miss_dec()...
But yeah it's good to make them look symmetry.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
2011-11-23 12:47 ` Wu Fengguang
@ 2011-11-23 20:31 ` Andrew Morton
2011-11-29 3:42 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-23 20:31 UTC (permalink / raw)
To: Wu Fengguang
Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
Steven Whitehouse, Rik van Riel, LKML
On Wed, 23 Nov 2011 20:47:45 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> > should be ulong, which is compatible with the bitops.h code.
> > Or perhaps we should use a bitfield and let the compiler do the work.
>
> What if we do
>
> u16 mmap_miss;
> u16 ra_flags;
>
> That would get rid of this patch. I'd still like to pack the various
> flags as well as pattern into one single ra_flags, which makes it
> convenient to pass things around (as one single parameter).
I'm not sure that this will improve things much...
Again, how does the code look if you use a bitfield and let the
compiler do the worK?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
2011-11-21 11:35 ` Wu Fengguang
@ 2011-11-24 22:28 ` Jan Kara
2011-11-25 0:36 ` Dave Chinner
0 siblings, 1 reply; 47+ messages in thread
From: Jan Kara @ 2011-11-24 22:28 UTC (permalink / raw)
To: Wu Fengguang
Cc: Christoph Hellwig, Andrew Morton, Linux Memory Management List,
linux-fsdevel, Ankit Jain, Dave Chinner, Christian Ehrhardt,
Rik van Riel, Nikanth Karthikesan, LKML, Andi Kleen
On Mon 21-11-11 19:35:40, Wu Fengguang wrote:
> On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
> > On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> > > From: Nikanth Karthikesan <knikanth@suse.de>
> > >
> > > Add new kernel parameter "readahead=", which allows user to override
> > > the static VM_MAX_READAHEAD=128kb.
> >
> > Is a boot-time paramter really such a good idea? I would at least
>
> It's most convenient to set at boot time, because the default size
> will be used to initialize all the block devices.
>
> > make it a sysctl so that it's run-time controllable, including
> > beeing able to set it from initscripts.
>
> Once boot up, it's more natural to set the size one by one, for
> example
>
> blockdev --setra 1024 /dev/sda2
> or
> echo 512 > /sys/block/sda/queue/read_ahead_kb
>
> And you still have the chance to modify the global default, but the
> change will only be inherited by newly created devices thereafter:
>
> echo 512 > /sys/devices/virtual/bdi/default/read_ahead_kb
>
> The above command is very suitable for use in initscripts. However
> there are no natural way to do sysctl as there is no such a global
> value.
Well, you can always have an udev rule to set read_ahead_kb to whatever
you want. In some respect that looks like a nicer solution to me...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
2011-11-24 22:28 ` Jan Kara
@ 2011-11-25 0:36 ` Dave Chinner
2011-11-28 2:39 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2011-11-25 0:36 UTC (permalink / raw)
To: Jan Kara
Cc: Wu Fengguang, Christoph Hellwig, Andrew Morton,
Linux Memory Management List, linux-fsdevel, Ankit Jain,
Christian Ehrhardt, Rik van Riel, Nikanth Karthikesan, LKML,
Andi Kleen
On Thu, Nov 24, 2011 at 11:28:22PM +0100, Jan Kara wrote:
> On Mon 21-11-11 19:35:40, Wu Fengguang wrote:
> > On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
> > > On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> > > > From: Nikanth Karthikesan <knikanth@suse.de>
> > > >
> > > > Add new kernel parameter "readahead=", which allows user to override
> > > > the static VM_MAX_READAHEAD=128kb.
> > >
> > > Is a boot-time paramter really such a good idea? I would at least
> >
> > It's most convenient to set at boot time, because the default size
> > will be used to initialize all the block devices.
> >
> > > make it a sysctl so that it's run-time controllable, including
> > > beeing able to set it from initscripts.
> >
> > Once boot up, it's more natural to set the size one by one, for
> > example
> >
> > blockdev --setra 1024 /dev/sda2
> > or
> > echo 512 > /sys/block/sda/queue/read_ahead_kb
> >
> > And you still have the chance to modify the global default, but the
> > change will only be inherited by newly created devices thereafter:
> >
> > echo 512 > /sys/devices/virtual/bdi/default/read_ahead_kb
> >
> > The above command is very suitable for use in initscripts. However
> > there are no natural way to do sysctl as there is no such a global
> > value.
> Well, you can always have an udev rule to set read_ahead_kb to whatever
> you want. In some respect that looks like a nicer solution to me...
And one that has already been in use for exactly this purpose for
years. Indeed, it's far more flexible because you can give different
types of devices different default readahead settings quite easily,
and it you can set different defaults for just about any tunable
parameter (e.g. readahead, ctq depth, max IO sizes, etc) in the same
way.
Hence I don't think we should treat default readahead any
differently from any other configurable storage parameter - we've
already got places to change the per-device defaults to something
sensible at boot/discovery time....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
2011-11-25 0:36 ` Dave Chinner
@ 2011-11-28 2:39 ` Wu Fengguang
2011-11-30 13:04 ` Christian Ehrhardt
0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-28 2:39 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, Christoph Hellwig, Andrew Morton,
Linux Memory Management List, linux-fsdevel, Ankit Jain,
Christian Ehrhardt, Rik van Riel, Nikanth Karthikesan, LKML,
Andi Kleen
On Fri, Nov 25, 2011 at 08:36:33AM +0800, Dave Chinner wrote:
> On Thu, Nov 24, 2011 at 11:28:22PM +0100, Jan Kara wrote:
> > On Mon 21-11-11 19:35:40, Wu Fengguang wrote:
> > > On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
> > > > On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> > > > > From: Nikanth Karthikesan <knikanth@suse.de>
> > > > >
> > > > > Add new kernel parameter "readahead=", which allows user to override
> > > > > the static VM_MAX_READAHEAD=128kb.
> > > >
> > > > Is a boot-time paramter really such a good idea? I would at least
> > >
> > > It's most convenient to set at boot time, because the default size
> > > will be used to initialize all the block devices.
> > >
> > > > make it a sysctl so that it's run-time controllable, including
> > > > beeing able to set it from initscripts.
> > >
> > > Once boot up, it's more natural to set the size one by one, for
> > > example
> > >
> > > blockdev --setra 1024 /dev/sda2
> > > or
> > > echo 512 > /sys/block/sda/queue/read_ahead_kb
> > >
> > > And you still have the chance to modify the global default, but the
> > > change will only be inherited by newly created devices thereafter:
> > >
> > > echo 512 > /sys/devices/virtual/bdi/default/read_ahead_kb
> > >
> > > The above command is very suitable for use in initscripts. However
> > > there are no natural way to do sysctl as there is no such a global
> > > value.
> > Well, you can always have an udev rule to set read_ahead_kb to whatever
> > you want. In some respect that looks like a nicer solution to me...
>
> And one that has already been in use for exactly this purpose for
> years. Indeed, it's far more flexible because you can give different
> types of devices different default readahead settings quite easily,
> and it you can set different defaults for just about any tunable
> parameter (e.g. readahead, ctq depth, max IO sizes, etc) in the same
> way.
I'm interested in this usage, too. Would you share some of your rules?
> Hence I don't think we should treat default readahead any
> differently from any other configurable storage parameter - we've
> already got places to change the per-device defaults to something
> sensible at boot/discovery time....
OK, I'll drop this patch.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 4/8] readahead: record readahead patterns
2011-11-21 23:19 ` Andrew Morton
@ 2011-11-29 2:40 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 2:40 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen
On Mon, Nov 21, 2011 at 03:19:19PM -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 17:18:23 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Record the readahead pattern in ra_flags and extend the ra_submit()
> > parameters, to be used by the next readahead tracing/stats patches.
> >
> > 7 patterns are defined:
> >
> > pattern readahead for
> > -----------------------------------------------------------
> > RA_PATTERN_INITIAL start-of-file read
> > RA_PATTERN_SUBSEQUENT trivial sequential read
> > RA_PATTERN_CONTEXT interleaved sequential read
> > RA_PATTERN_OVERSIZE oversize read
> > RA_PATTERN_MMAP_AROUND mmap fault
> > RA_PATTERN_FADVISE posix_fadvise()
> > RA_PATTERN_RANDOM random read
>
> It would be useful to spell out in full detail what an "interleaved
> sequential read" is, and why a read is considered "oversized", etc.
> The 'enum readahead_pattern' definition site would be a good place for
> this.
Good point, here is the added comments:
/*
* Which policy makes decision to do the current read-ahead IO?
*
* RA_PATTERN_INITIAL readahead window is initially opened,
* normally when reading from start of file
* RA_PATTERN_SUBSEQUENT readahead window is pushed forward
* RA_PATTERN_CONTEXT no readahead window available, querying the
* page cache to decide readahead start/size.
* This typically happens on interleaved reads (eg.
* reading pages 0, 1000, 1, 1001, 2, 1002, ...)
* where one file_ra_state struct is not enough
* for recording 2+ interleaved sequential read
* streams.
* RA_PATTERN_MMAP_AROUND read-around on mmap page faults
* (w/o any sequential/random hints)
* RA_PATTERN_BACKWARDS reverse reading detected
* RA_PATTERN_FADVISE triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
* RA_PATTERN_OVERSIZE a random read larger than max readahead size,
* do max readahead to break down the read size
* RA_PATTERN_RANDOM a small random read
*/
> > Note that random reads will be recorded in file_ra_state now.
> > This won't deteriorate cache bouncing because the ra->prev_pos update
> > in do_generic_file_read() already pollutes the data cache, and
> > filemap_fault() will stop calling into us after MMAP_LOTSAMISS.
> >
> > --- linux-next.orig/include/linux/fs.h 2011-11-20 20:10:48.000000000 +0800
> > +++ linux-next/include/linux/fs.h 2011-11-20 20:18:29.000000000 +0800
> > @@ -951,6 +951,39 @@ struct file_ra_state {
> >
> > /* ra_flags bits */
> > #define READAHEAD_MMAP_MISS 0x000003ff /* cache misses for mmap access */
> > +#define READAHEAD_MMAP 0x00010000
>
> Why leave a gap?
Never mind, it's now converted to a bit field :)
> And what is READAHEAD_MMAP anyway?
READAHEAD_MMAP will be set for mmap page faults.
> > +#define READAHEAD_PATTERN_SHIFT 28
>
> Why 28?
Bits 28-32 are for READAHEAD_PATTERN.
Anyway it will be gone when breaking down the ra_flags fields into
individual variables.
> > +#define READAHEAD_PATTERN 0xf0000000
> > +
> > +/*
> > + * Which policy makes decision to do the current read-ahead IO?
> > + */
> > +enum readahead_pattern {
> > + RA_PATTERN_INITIAL,
> > + RA_PATTERN_SUBSEQUENT,
> > + RA_PATTERN_CONTEXT,
> > + RA_PATTERN_MMAP_AROUND,
> > + RA_PATTERN_FADVISE,
> > + RA_PATTERN_OVERSIZE,
> > + RA_PATTERN_RANDOM,
> > + RA_PATTERN_ALL, /* for summary stats */
> > + RA_PATTERN_MAX
> > +};
>
> Again, the behaviour is all undocumented. I see from the code that
> multiple flags can be set at the same time. So afacit a file can be
> marked RANDOM and SUBSEQUENT at the same time, which seems oxymoronic.
Nope, it will be classified into one "pattern" exclusively.
> This reader wants to know what the implications of this are - how the
> code chooses, prioritises and acts. But this code doesn't tell me.
Hope the comment addresses this issue. The precise logic happens
mainly inside ondemand_readahead().
> > +static inline unsigned int ra_pattern(unsigned int ra_flags)
> > +{
> > + unsigned int pattern = ra_flags >> READAHEAD_PATTERN_SHIFT;
>
> OK, no masking is needed because the code silently assumes that arg
> `ra_flags' came out of an ra_state.ra_flags and it also silently
> assumes that no higher bits are used in ra_state.ra_flags.
>
> That's a bit of a handgrenade - if someone redoes the flags
> enumeration, the code will explode.
Yeah sorry for playing with such tricks. Will get rid of this function
totally and use a plain assign to ra->pattern.
> > + return min_t(unsigned int, pattern, RA_PATTERN_ALL);
> > +}
>
> <scratches head>
>
> What the heck is that min_t() doing in there?
Just for safety... not really necessary given correct code.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 7/8] readahead: basic support for backwards prefetching
2011-11-21 23:33 ` Andrew Morton
@ 2011-11-29 3:08 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 3:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
Li Shaohua, LKML
On Mon, Nov 21, 2011 at 03:33:09PM -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 17:18:26 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Add the backwards prefetching feature. It's pretty simple if we don't
> > support async prefetching and interleaved reads.
>
> Well OK, but I wonder how many applications out there read files in
> reverse order. Is it common enough to bother special-casing in the
> kernel like this?
Maybe not so many applications, but sure there are some real cases
somewhere. I remember an IBM paper (that's many years ago, so cannot
recall the exact title) on database shows a graph containing backwards
reading curves among the other ones.
Recently Shaohua even run into a performance regression caused by glibc
optimizing memcpy to access page in reverse order (15, 14, 13, ... 0).
Well this patch may not be the most pertinent fix to that particular
issue. But you see the opportunity such access patterns arise from
surprised areas.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
2011-11-21 23:29 ` Andrew Morton
2011-11-21 23:32 ` Andi Kleen
@ 2011-11-29 3:23 ` Wu Fengguang
2011-11-29 4:49 ` Andrew Morton
1 sibling, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 3:23 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen
On Mon, Nov 21, 2011 at 03:29:58PM -0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 17:18:24 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
> > and will remain inactive unless enabled explicitly with either boot option
> >
> > readahead_stats=1
> >
> > or through the debugfs interface
> >
> > echo 1 > /debug/readahead/stats_enable
>
> It's unfortunate that these two things have different names.
Yes unfortunately.
> I'd have thought that the debugfs knob was sufficient - no need for the
> boot option.
The boot option intents to catch the boot time readaheads.
However it's not that big deal, I'll drop the boot option.
> > The added overheads are two readahead_stats() calls per readahead.
> > Which is trivial costs unless there are concurrent random reads on
> > super fast SSDs, which may lead to cache bouncing when updating the
> > global ra_stats[][]. Considering that normal users won't need this
> > except when debugging performance problems, it's disabled by default.
> > So it looks reasonable to keep this debug code simple rather than trying
> > to improve its scalability.
>
> I may be wrong, but I don't think the CPU cost of this code matters a
> lot. People will rarely turn it on and disk IO is a lot slower than
> CPU actions and it's waaaaaaay more important to get high-quality info
> about readahead than it is to squeeze out a few CPU cycles.
Agreed in general.
> > @@ -51,6 +62,182 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
> >
> > #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
> >
> > +#ifdef CONFIG_READAHEAD_STATS
> > +#include <linux/seq_file.h>
> > +#include <linux/debugfs.h>
> > +
> > +static u32 readahead_stats_enable __read_mostly;
> > +
> > +static int __init config_readahead_stats(char *str)
> > +{
> > + int enable = 1;
> > + get_option(&str, &enable);
> > + readahead_stats_enable = enable;
> > + return 0;
> > +}
> > +early_param("readahead_stats", config_readahead_stats);
>
> Why use early_param() rather than plain old __setup()?
Heh it's a no-brain copy from other code ;)
Anyway, the readahead_stats boot parameter will be dropped.
> > +enum ra_account {
> > + /* number of readaheads */
> > + RA_ACCOUNT_COUNT, /* readahead request */
> > + RA_ACCOUNT_EOF, /* readahead request covers EOF */
> > + RA_ACCOUNT_CHIT, /* readahead request covers some cached pages */
>
> I don't like chit :) "cache_hit" would be better. Or just "hit".
Yeah it's not good. I renamed it to RA_ACCOUNT_CACHE_HIT.
> > + RA_ACCOUNT_ASIZE, /* readahead async size */
Also renamed that to RA_ACCOUNT_ASYNC_SIZE.
> > + RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
> > + /* end mark */
> > + RA_ACCOUNT_MAX,
> > +};
> > +
> >
> > ...
> >
> > +static void readahead_event(struct address_space *mapping,
> > + pgoff_t offset,
> > + unsigned long req_size,
> > + unsigned int ra_flags,
> > + pgoff_t start,
> > + unsigned int size,
> > + unsigned int async_size,
> > + unsigned int actual)
> > +{
> > +#ifdef CONFIG_READAHEAD_STATS
> > + if (readahead_stats_enable) {
> > + readahead_stats(mapping, offset, req_size, ra_flags,
> > + start, size, async_size, actual);
> > + readahead_stats(mapping, offset, req_size,
> > + RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
> > + start, size, async_size, actual);
> > + }
> > +#endif
> > +}
>
> The stub should be inlined, methinks. The overhead of evaluating and
> preparing eight arguments is significant. I don't think the compiler
> is yet smart enough to save us.
The parameter list actually becomes even out of control when doing the
bit fields:
+ readahead_event(mapping, offset, req_size,
+ ra->pattern, ra->for_mmap, ra->for_metadata,
+ ra->start + ra->size >= eof,
+ ra->start, ra->size, ra->async_size, actual);
So I end up passing file_ra_state around. The added cost is, I'll have
to dynamically create a file_ra_state for the fadvise case, which
should be acceptable since it's a cold path.
> >
> > ...
> >
> > --- linux-next.orig/Documentation/kernel-parameters.txt 2011-11-21 17:08:38.000000000 +0800
> > +++ linux-next/Documentation/kernel-parameters.txt 2011-11-21 17:08:51.000000000 +0800
> > @@ -2251,6 +2251,12 @@ bytes respectively. Such letter suffixes
> > This default max readahead size may be overrode
> > in some cases, notably NFS, btrfs and software RAID.
> >
> > + readahead_stats[=0|1]
> > + Enable/disable readahead stats accounting.
> > +
> > + It's also possible to enable/disable it after boot:
> > + echo 1 > /sys/kernel/debug/readahead/stats_enable
>
> Can the current setting be read back?
Yes. This is possible:
echo 0 > /sys/kernel/debug/readahead/stats_enable
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags
2011-11-23 20:31 ` Andrew Morton
@ 2011-11-29 3:42 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 3:42 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Andi Kleen,
Steven Whitehouse, Rik van Riel, LKML
On Wed, Nov 23, 2011 at 12:31:50PM -0800, Andrew Morton wrote:
> On Wed, 23 Nov 2011 20:47:45 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > > should be ulong, which is compatible with the bitops.h code.
> > > Or perhaps we should use a bitfield and let the compiler do the work.
> >
> > What if we do
> >
> > u16 mmap_miss;
> > u16 ra_flags;
> >
> > That would get rid of this patch. I'd still like to pack the various
> > flags as well as pattern into one single ra_flags, which makes it
> > convenient to pass things around (as one single parameter).
>
> I'm not sure that this will improve things much...
>
> Again, how does the code look if you use a bitfield and let the
> compiler do the worK?
It results in much clean code, as you may find in the V2 patches :-)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
2011-11-29 3:23 ` Wu Fengguang
@ 2011-11-29 4:49 ` Andrew Morton
2011-11-29 6:41 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Andrew Morton @ 2011-11-29 4:49 UTC (permalink / raw)
To: Wu Fengguang
Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen
On Tue, 29 Nov 2011 11:23:23 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > +{
> > > +#ifdef CONFIG_READAHEAD_STATS
> > > + if (readahead_stats_enable) {
> > > + readahead_stats(mapping, offset, req_size, ra_flags,
> > > + start, size, async_size, actual);
> > > + readahead_stats(mapping, offset, req_size,
> > > + RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
> > > + start, size, async_size, actual);
> > > + }
> > > +#endif
> > > +}
> >
> > The stub should be inlined, methinks. The overhead of evaluating and
> > preparing eight arguments is significant. I don't think the compiler
> > is yet smart enough to save us.
>
> The parameter list actually becomes even out of control when doing the
> bit fields:
>
> + readahead_event(mapping, offset, req_size,
> + ra->pattern, ra->for_mmap, ra->for_metadata,
> + ra->start + ra->size >= eof,
> + ra->start, ra->size, ra->async_size, actual);
>
> So I end up passing file_ra_state around. The added cost is, I'll have
> to dynamically create a file_ra_state for the fadvise case, which
> should be acceptable since it's a cold path.
That will reduce the cost of something which would have zero cost by
making this function a static inline when CONFIG_READAHEAD_STATS=n.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
2011-11-29 4:49 ` Andrew Morton
@ 2011-11-29 6:41 ` Wu Fengguang
2011-11-29 12:29 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 6:41 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen
On Tue, Nov 29, 2011 at 12:49:50PM +0800, Andrew Morton wrote:
> On Tue, 29 Nov 2011 11:23:23 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > > > +{
> > > > +#ifdef CONFIG_READAHEAD_STATS
> > > > + if (readahead_stats_enable) {
> > > > + readahead_stats(mapping, offset, req_size, ra_flags,
> > > > + start, size, async_size, actual);
> > > > + readahead_stats(mapping, offset, req_size,
> > > > + RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
> > > > + start, size, async_size, actual);
> > > > + }
> > > > +#endif
> > > > +}
> > >
> > > The stub should be inlined, methinks. The overhead of evaluating and
> > > preparing eight arguments is significant. I don't think the compiler
> > > is yet smart enough to save us.
> >
> > The parameter list actually becomes even out of control when doing the
> > bit fields:
> >
> > + readahead_event(mapping, offset, req_size,
> > + ra->pattern, ra->for_mmap, ra->for_metadata,
> > + ra->start + ra->size >= eof,
> > + ra->start, ra->size, ra->async_size, actual);
> >
> > So I end up passing file_ra_state around. The added cost is, I'll have
> > to dynamically create a file_ra_state for the fadvise case, which
> > should be acceptable since it's a cold path.
>
> That will reduce the cost of something which would have zero cost by
> making this function a static inline when CONFIG_READAHEAD_STATS=n.
What I do now is to remove the readahead_event() function altogether,
as done by the below patch.
Do you suggest to remove fadvise_ra and still passing the many raw
values to readahead_stats()? (need to restore the inline function
readahead_event() because there will be two call sites)
Thanks,
Fengguang
---
Subject: readahead: add /debug/readahead/stats
Date: Sun Nov 20 11:25:50 CST 2011
The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive by default.
It can be runtime enabled/disabled through the debugfs interface
echo 1 > /debug/readahead/stats_enable
echo 0 > /debug/readahead/stats_enable
The added overheads are two readahead_stats() calls per readahead.
Which is trivial costs unless there are concurrent random reads on
super fast SSDs, which may lead to cache bouncing when updating the
global ra_stats[][]. Considering that normal users won't need this
except when debugging performance problems, it's disabled by default.
So it looks reasonable to keep this debug code simple rather than trying
to improve its scalability.
Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)
$ cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io meta_io size async_size io_size
initial 702 511 0 692 692 0 0 2 0 2
subsequent 7 0 1 7 1 1 0 23 22 23
context 160 161 0 2 0 1 0 0 0 16
around 184 184 177 184 184 184 0 58 0 53
backwards 2 0 2 2 2 0 0 4 0 3
fadvise 2593 47 8 2588 2588 0 0 1 0 1
oversize 0 0 0 0 0 0 0 0 0 0
random 45 20 0 44 44 0 0 1 0 1
all 3697 923 188 3519 3511 186 0 4 0 4
The two most important columns are
- io number of readahead IO
- io_size average readahead IO size
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/Kconfig | 15 +++
mm/readahead.c | 183 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 196 insertions(+), 2 deletions(-)
--- linux-next.orig/mm/readahead.c 2011-11-29 14:14:36.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-29 14:24:25.000000000 +0800
@@ -18,6 +18,17 @@
#include <linux/pagevec.h>
#include <linux/pagemap.h>
+static const char * const ra_pattern_names[] = {
+ [RA_PATTERN_INITIAL] = "initial",
+ [RA_PATTERN_SUBSEQUENT] = "subsequent",
+ [RA_PATTERN_CONTEXT] = "context",
+ [RA_PATTERN_MMAP_AROUND] = "around",
+ [RA_PATTERN_FADVISE] = "fadvise",
+ [RA_PATTERN_OVERSIZE] = "oversize",
+ [RA_PATTERN_RANDOM] = "random",
+ [RA_PATTERN_ALL] = "all",
+};
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
@@ -32,6 +43,167 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+enum ra_account {
+ /* number of readaheads */
+ RA_ACCOUNT_COUNT, /* readahead request */
+ RA_ACCOUNT_EOF, /* readahead request covers EOF */
+ RA_ACCOUNT_CACHE_HIT, /* readahead request covers some cached pages */
+ RA_ACCOUNT_IOCOUNT, /* readahead IO */
+ RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */
+ RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */
+ RA_ACCOUNT_METADATA, /* readahead IO on metadata */
+ /* number of readahead pages */
+ RA_ACCOUNT_SIZE, /* readahead size */
+ RA_ACCOUNT_ASYNC_SIZE, /* readahead async size */
+ RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
+ /* end mark */
+ RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static inline void readahead_stats(struct file_ra_state *ra,
+ struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ pgoff_t eof,
+ int actual)
+{
+ enum readahead_pattern pattern = ra->pattern;
+
+recount:
+ ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+ ra_stats[pattern][RA_ACCOUNT_SIZE] += ra->size;
+ ra_stats[pattern][RA_ACCOUNT_ASYNC_SIZE] += ra->async_size;
+ ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+ if (ra->start + ra->size >= eof)
+ ra_stats[pattern][RA_ACCOUNT_EOF]++;
+ if (actual < ra->size)
+ ra_stats[pattern][RA_ACCOUNT_CACHE_HIT]++;
+
+ if (actual) {
+ ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+ if (ra->start <= offset && offset < ra->start + ra->size)
+ ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+ if (ra->for_mmap)
+ ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+ if (ra->for_metadata)
+ ra_stats[pattern][RA_ACCOUNT_METADATA]++;
+ }
+
+ if (pattern != RA_PATTERN_ALL) {
+ pattern = RA_PATTERN_ALL;
+ goto recount;
+ }
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+ unsigned long i;
+
+ seq_printf(s,
+ "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+ "pattern", "readahead", "eof_hit", "cache_hit",
+ "io", "sync_io", "mmap_io", "meta_io",
+ "size", "async_size", "io_size");
+
+ for (i = 0; i < RA_PATTERN_MAX; i++) {
+ unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+ unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+ /*
+ * avoid division-by-zero
+ */
+ if (count == 0)
+ count = 1;
+ if (iocount == 0)
+ iocount = 1;
+
+ seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu "
+ "%10lu %10lu %10lu %10lu %10lu\n",
+ ra_pattern_names[i],
+ ra_stats[i][RA_ACCOUNT_COUNT],
+ ra_stats[i][RA_ACCOUNT_EOF],
+ ra_stats[i][RA_ACCOUNT_CACHE_HIT],
+ ra_stats[i][RA_ACCOUNT_IOCOUNT],
+ ra_stats[i][RA_ACCOUNT_SYNC],
+ ra_stats[i][RA_ACCOUNT_MMAP],
+ ra_stats[i][RA_ACCOUNT_METADATA],
+ ra_stats[i][RA_ACCOUNT_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+ }
+
+ return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ memset(ra_stats, 0, sizeof(ra_stats));
+ return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+ .owner = THIS_MODULE,
+ .open = readahead_stats_open,
+ .write = readahead_stats_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+ struct dentry *root;
+ struct dentry *entry;
+
+ root = debugfs_create_dir("readahead", NULL);
+ if (!root)
+ goto out;
+
+ entry = debugfs_create_file("stats", 0644, root,
+ NULL, &readahead_stats_fops);
+ if (!entry)
+ goto out;
+
+ entry = debugfs_create_bool("stats_enable", 0644, root,
+ &readahead_stats_enable);
+ if (!entry)
+ goto out;
+
+ return 0;
+out:
+ printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+ return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#else
+#define readahead_stats_enable 0
+static inline void readahead_stats(struct file_ra_state *ra,
+ struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ pgoff_t eof,
+ int actual)
+{
+}
+#endif
+
/*
* see if a page needs releasing upon read_cache_pages() failure
* - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -209,6 +381,9 @@ out:
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read)
{
+ struct file_ra_state fadvice_ra = {
+ .pattern = RA_PATTERN_FADVISE,
+ };
int ret = 0;
if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
@@ -222,8 +397,9 @@ int force_page_cache_readahead(struct ad
if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
- err = __do_page_cache_readahead(mapping, filp,
- offset, this_chunk, 0);
+ fadvice_ra.start = offset;
+ fadvice_ra.size = this_chunk;
+ err = ra_submit(&fadvice_ra, mapping, filp, offset, nr_to_read);
if (err < 0) {
ret = err;
break;
@@ -267,6 +443,9 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
+ if (readahead_stats_enable)
+ readahead_stats(ra, mapping, offset, req_size, eof, actual);
+
ra->for_mmap = 0;
ra->for_metadata = 0;
return actual;
--- linux-next.orig/mm/Kconfig 2011-11-29 14:14:25.000000000 +0800
+++ linux-next/mm/Kconfig 2011-11-29 14:14:37.000000000 +0800
@@ -373,3 +373,18 @@ config CLEANCACHE
in a negligible performance hit.
If unsure, say Y to enable cleancache
+
+config READAHEAD_STATS
+ bool "Collect page cache readahead stats"
+ depends on DEBUG_FS
+ default y
+ help
+ This provides the readahead events accounting facilities.
+
+ To do readahead accounting for a workload:
+
+ echo 1 > /sys/kernel/debug/readahead/stats_enable
+ echo 0 > /sys/kernel/debug/readahead/stats # reset counters
+ # run the workload
+ cat /sys/kernel/debug/readahead/stats # check counters
+ echo 0 > /sys/kernel/debug/readahead/stats_enable
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 5/8] readahead: add /debug/readahead/stats
2011-11-29 6:41 ` Wu Fengguang
@ 2011-11-29 12:29 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 12:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Linux Memory Management List, linux-fsdevel, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel, LKML, Andi Kleen
> int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
> pgoff_t offset, unsigned long nr_to_read)
> {
> + struct file_ra_state fadvice_ra = {
> + .pattern = RA_PATTERN_FADVISE,
> + };
> int ret = 0;
>
> if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
> @@ -222,8 +397,9 @@ int force_page_cache_readahead(struct ad
>
> if (this_chunk > nr_to_read)
> this_chunk = nr_to_read;
> - err = __do_page_cache_readahead(mapping, filp,
> - offset, this_chunk, 0);
> + fadvice_ra.start = offset;
> + fadvice_ra.size = this_chunk;
> + err = ra_submit(&fadvice_ra, mapping, filp, offset, nr_to_read);
> if (err < 0) {
> ret = err;
> break;
It looks that we can safely use filp->f_ra:
@@ -214,6 +386,7 @@ int force_page_cache_readahead(struct ad
if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
return -EINVAL;
+ filp->f_ra.pattern = RA_PATTERN_FADVISE;
nr_to_read = max_sane_readahead(nr_to_read);
while (nr_to_read) {
int err;
@@ -222,8 +395,9 @@ int force_page_cache_readahead(struct ad
if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
- err = __do_page_cache_readahead(mapping, filp,
- offset, this_chunk, 0);
+ filp->f_ra.start = offset;
+ filp->f_ra.size = this_chunk;
+ err = ra_submit(&filp->f_ra, mapping, filp, offset, nr_to_read);
if (err < 0) {
ret = err;
break;
But still, it adds one more function call to the fadvise path.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
2011-11-28 2:39 ` Wu Fengguang
@ 2011-11-30 13:04 ` Christian Ehrhardt
2011-11-30 13:29 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Christian Ehrhardt @ 2011-11-30 13:04 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dave Chinner, Jan Kara, Christoph Hellwig, Andrew Morton,
Linux Memory Management List, linux-fsdevel, Ankit Jain,
Rik van Riel, Nikanth Karthikesan, LKML, Andi Kleen
On 11/28/2011 03:39 AM, Wu Fengguang wrote:
> On Fri, Nov 25, 2011 at 08:36:33AM +0800, Dave Chinner wrote:
>> On Thu, Nov 24, 2011 at 11:28:22PM +0100, Jan Kara wrote:
>>> On Mon 21-11-11 19:35:40, Wu Fengguang wrote:
>>>> On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
>>>>> On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
>>>>>> From: Nikanth Karthikesan<knikanth@suse.de>
>>>>>>
[...]
>>
>> And one that has already been in use for exactly this purpose for
>> years. Indeed, it's far more flexible because you can give different
>> types of devices different default readahead settings quite easily,
>> and it you can set different defaults for just about any tunable
>> parameter (e.g. readahead, ctq depth, max IO sizes, etc) in the same
>> way.
>
> I'm interested in this usage, too. Would you share some of your rules?
>
FYI - This is an example of a rules Suse delivers in SLES @ s390 for a
while now. With little modifications it could be used for all Dave
mentioned above.
cat /etc/udev/rules.d/60-readahead.rules
#
# Rules to set an increased default max readahead size for s390 disk
devices
# This file should be installed in /etc/udev/rules.d
#
SUBSYSTEM!="block", GOTO="ra_end"
ACTION!="add", GOTO="ra_end"
# on device add set initial readahead to 512 (instead of in kernel 128)
KERNEL=="sd*[!0-9]", ATTR{queue/read_ahead_kb}="512"
KERNEL=="dasd*[!0-9]", ATTR{queue/read_ahead_kb}="512"
LABEL="ra_end"
--
Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
2011-11-30 13:04 ` Christian Ehrhardt
@ 2011-11-30 13:29 ` Wu Fengguang
2011-11-30 16:09 ` Jan Kara
0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-30 13:29 UTC (permalink / raw)
To: Christian Ehrhardt
Cc: Dave Chinner, Jan Kara, Christoph Hellwig, Andrew Morton,
Linux Memory Management List, linux-fsdevel, Ankit Jain,
Rik van Riel, Nikanth Karthikesan, LKML, Andi Kleen
On Wed, Nov 30, 2011 at 09:04:11PM +0800, Christian Ehrhardt wrote:
>
>
> On 11/28/2011 03:39 AM, Wu Fengguang wrote:
> > On Fri, Nov 25, 2011 at 08:36:33AM +0800, Dave Chinner wrote:
> >> On Thu, Nov 24, 2011 at 11:28:22PM +0100, Jan Kara wrote:
> >>> On Mon 21-11-11 19:35:40, Wu Fengguang wrote:
> >>>> On Mon, Nov 21, 2011 at 06:01:37PM +0800, Christoph Hellwig wrote:
> >>>>> On Mon, Nov 21, 2011 at 05:18:21PM +0800, Wu Fengguang wrote:
> >>>>>> From: Nikanth Karthikesan<knikanth@suse.de>
> >>>>>>
> [...]
>
> >>
> >> And one that has already been in use for exactly this purpose for
> >> years. Indeed, it's far more flexible because you can give different
> >> types of devices different default readahead settings quite easily,
> >> and it you can set different defaults for just about any tunable
> >> parameter (e.g. readahead, ctq depth, max IO sizes, etc) in the same
> >> way.
> >
> > I'm interested in this usage, too. Would you share some of your rules?
> >
>
> FYI - This is an example of a rules Suse delivers in SLES @ s390 for a
> while now. With little modifications it could be used for all Dave
> mentioned above.
It's a really good example, thank you!
> cat /etc/udev/rules.d/60-readahead.rules
> #
>
>
>
> # Rules to set an increased default max readahead size for s390 disk
> devices
>
>
> # This file should be installed in /etc/udev/rules.d
>
>
>
> #
>
> SUBSYSTEM!="block", GOTO="ra_end"
>
> ACTION!="add", GOTO="ra_end"
>
> # on device add set initial readahead to 512 (instead of in kernel 128)
>
> KERNEL=="sd*[!0-9]", ATTR{queue/read_ahead_kb}="512"
>
> KERNEL=="dasd*[!0-9]", ATTR{queue/read_ahead_kb}="512"
So SLES (@s390 and maybe more) is already shipping with 512kb
readahead size? Good to know this!
Thanks,
Fengguang
>
>
>
> LABEL="ra_end"
>
> --
>
> Grüsse / regards, Christian Ehrhardt
> IBM Linux Technology Center, System z Linux Performance
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/8] readahead: make default readahead size a kernel parameter
2011-11-30 13:29 ` Wu Fengguang
@ 2011-11-30 16:09 ` Jan Kara
0 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2011-11-30 16:09 UTC (permalink / raw)
To: Wu Fengguang
Cc: Christian Ehrhardt, Dave Chinner, Jan Kara, Christoph Hellwig,
Andrew Morton, Linux Memory Management List, linux-fsdevel,
Ankit Jain, Rik van Riel, Nikanth Karthikesan, LKML, Andi Kleen
On Wed 30-11-11 21:29:28, Wu Fengguang wrote:
> > cat /etc/udev/rules.d/60-readahead.rules
> > #
> >
> >
> >
> > # Rules to set an increased default max readahead size for s390 disk
> > devices
> >
> >
> > # This file should be installed in /etc/udev/rules.d
> >
> >
> >
> > #
> >
> > SUBSYSTEM!="block", GOTO="ra_end"
> >
> > ACTION!="add", GOTO="ra_end"
> >
> > # on device add set initial readahead to 512 (instead of in kernel 128)
> >
> > KERNEL=="sd*[!0-9]", ATTR{queue/read_ahead_kb}="512"
> >
> > KERNEL=="dasd*[!0-9]", ATTR{queue/read_ahead_kb}="512"
>
> So SLES (@s390 and maybe more) is already shipping with 512kb
> readahead size? Good to know this!
SLES (and openSUSE) since about 2.6.16 times is shipping with 512kb
readahead on everything... With some types of storage it makes a
significant difference.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
end of thread, other threads:[~2011-11-30 16:09 UTC | newest]
Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-21 9:18 [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Wu Fengguang
2011-11-21 9:18 ` [PATCH 1/8] block: limit default readahead size for small devices Wu Fengguang
2011-11-21 10:00 ` Christoph Hellwig
2011-11-21 11:24 ` Wu Fengguang
2011-11-21 12:47 ` Andi Kleen
2011-11-21 14:46 ` Jeff Moyer
2011-11-21 22:52 ` Andrew Morton
2011-11-22 14:23 ` Jeff Moyer
2011-11-23 12:18 ` Wu Fengguang
2011-11-21 9:18 ` [PATCH 2/8] readahead: make default readahead size a kernel parameter Wu Fengguang
2011-11-21 10:01 ` Christoph Hellwig
2011-11-21 11:35 ` Wu Fengguang
2011-11-24 22:28 ` Jan Kara
2011-11-25 0:36 ` Dave Chinner
2011-11-28 2:39 ` Wu Fengguang
2011-11-30 13:04 ` Christian Ehrhardt
2011-11-30 13:29 ` Wu Fengguang
2011-11-30 16:09 ` Jan Kara
2011-11-21 9:18 ` [PATCH 3/8] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
2011-11-21 11:04 ` Steven Whitehouse
2011-11-21 11:42 ` Wu Fengguang
2011-11-21 23:01 ` Andrew Morton
2011-11-23 12:47 ` Wu Fengguang
2011-11-23 20:31 ` Andrew Morton
2011-11-29 3:42 ` Wu Fengguang
2011-11-21 9:18 ` [PATCH 4/8] readahead: record readahead patterns Wu Fengguang
2011-11-21 23:19 ` Andrew Morton
2011-11-29 2:40 ` Wu Fengguang
2011-11-21 9:18 ` [PATCH 5/8] readahead: add /debug/readahead/stats Wu Fengguang
2011-11-21 14:17 ` Andi Kleen
2011-11-22 14:14 ` Wu Fengguang
2011-11-21 23:29 ` Andrew Morton
2011-11-21 23:32 ` Andi Kleen
2011-11-29 3:23 ` Wu Fengguang
2011-11-29 4:49 ` Andrew Morton
2011-11-29 6:41 ` Wu Fengguang
2011-11-29 12:29 ` Wu Fengguang
2011-11-21 9:18 ` [PATCH 6/8] readahead: add debug tracing event Wu Fengguang
2011-11-21 14:01 ` Steven Rostedt
2011-11-21 9:18 ` [PATCH 7/8] readahead: basic support for backwards prefetching Wu Fengguang
2011-11-21 23:33 ` Andrew Morton
2011-11-29 3:08 ` Wu Fengguang
2011-11-21 9:18 ` [PATCH 8/8] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
2011-11-21 23:36 ` Andrew Morton
2011-11-22 14:18 ` Wu Fengguang
2011-11-21 9:56 ` [PATCH 0/8] readahead stats/tracing, backwards prefetching and more Christoph Hellwig
2011-11-21 12:00 ` Wu Fengguang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).