* [PATCH 1/9] block: limit default readahead size for small devices
2011-11-29 13:09 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v2) Wu Fengguang
@ 2011-11-29 13:09 ` Wu Fengguang
2011-11-29 13:09 ` [PATCH 2/9] readahead: snap readahead request to EOF Wu Fengguang
` (7 subsequent siblings)
8 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 13:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Li Shaohua, Clemens Ladisch, Jens Axboe,
Rik van Riel, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 6992 bytes --]
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
--------------------------- (*)
1G 256k
4G 512k
16G 1024k
64G 2048k
256G 4096k
(*) Since the default readahead size is 128k, this limit only takes
effect for devices whose size is less than 256M.
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo 3 > /proc/sys/vm/drop_caches
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
SSD 30G SanDisk SATA 5000
4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
==> 64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s
An unnamed SD card (Yakui):
4k 195.873 s, 5.5 MB/s
8k 123.425 s, 8.7 MB/s
16k 86.6425 s, 12.4 MB/s
32k 66.7519 s, 16.1 MB/s
==> 64k 58.5262 s, 18.3 MB/s
128k 59.3847 s, 18.1 MB/s
256k 59.3188 s, 18.1 MB/s
512k 59.0218 s, 18.2 MB/s
CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
--- linux-next.orig/block/genhd.c 2011-11-29 11:47:10.000000000 +0800
+++ linux-next/block/genhd.c 2011-11-29 11:47:30.000000000 +0800
@@ -577,6 +577,7 @@ exit:
void add_disk(struct gendisk *disk)
{
struct backing_dev_info *bdi;
+ size_t size;
dev_t devt;
int retval;
@@ -622,6 +623,25 @@ void add_disk(struct gendisk *disk)
WARN_ON(retval);
disk_add_events(disk);
+
+ /*
+ * Scale down default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 255M 64k (the round down effect)
+ * 256M 128k
+ * 1G 256k
+ * 4G 512k
+ * 16G 1024k
+ */
+ size = get_capacity(disk);
+ if (size) {
+ size = 1 << (ilog2(size >> 9) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
+ }
}
EXPORT_SYMBOL(add_disk);
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 2/9] readahead: snap readahead request to EOF
2011-11-29 13:09 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v2) Wu Fengguang
2011-11-29 13:09 ` [PATCH 1/9] block: limit default readahead size for small devices Wu Fengguang
@ 2011-11-29 13:09 ` Wu Fengguang
2011-11-29 14:29 ` Jan Kara
2011-11-29 13:09 ` [PATCH 3/9] readahead: record readahead patterns Wu Fengguang
` (6 subsequent siblings)
8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 13:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-eof --]
[-- Type: text/plain, Size: 1221 bytes --]
If the file size is 20kb and readahead request is [0, 16kb),
it's better to expand the readahead request to [0, 20kb), which will
likely save one followup I/O for [16kb, 20kb).
If the readahead request already covers EOF, trimm it down to EOF.
Also don't set the PG_readahead mark to avoid an unnecessary future
invocation of the readahead code.
This special handling looks worthwhile because small to medium sized
files are pretty common.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/readahead.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- linux-next.orig/mm/readahead.c 2011-11-29 11:28:56.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-29 11:29:05.000000000 +0800
@@ -251,8 +251,16 @@ unsigned long max_sane_readahead(unsigne
unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping, struct file *filp)
{
+ pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+ pgoff_t start = ra->start;
int actual;
+ /* snap to EOF */
+ if (start + ra->size + ra->size / 2 > eof) {
+ ra->size = eof - start;
+ ra->async_size = 0;
+ }
+
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/9] readahead: snap readahead request to EOF
2011-11-29 13:09 ` [PATCH 2/9] readahead: snap readahead request to EOF Wu Fengguang
@ 2011-11-29 14:29 ` Jan Kara
2011-11-30 1:06 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Jan Kara @ 2011-11-29 14:29 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Andi Kleen, Linux Memory Management List,
linux-fsdevel, LKML
On Tue 29-11-11 21:09:02, Wu Fengguang wrote:
> If the file size is 20kb and readahead request is [0, 16kb),
> it's better to expand the readahead request to [0, 20kb), which will
> likely save one followup I/O for [16kb, 20kb).
>
> If the readahead request already covers EOF, trimm it down to EOF.
> Also don't set the PG_readahead mark to avoid an unnecessary future
> invocation of the readahead code.
>
> This special handling looks worthwhile because small to medium sized
> files are pretty common.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> mm/readahead.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> --- linux-next.orig/mm/readahead.c 2011-11-29 11:28:56.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-11-29 11:29:05.000000000 +0800
> @@ -251,8 +251,16 @@ unsigned long max_sane_readahead(unsigne
> unsigned long ra_submit(struct file_ra_state *ra,
> struct address_space *mapping, struct file *filp)
> {
> + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
> + pgoff_t start = ra->start;
> int actual;
>
> + /* snap to EOF */
> + if (start + ra->size + ra->size / 2 > eof) {
> + ra->size = eof - start;
> + ra->async_size = 0;
> + }
> +
> actual = __do_page_cache_readahead(mapping, filp,
> ra->start, ra->size, ra->async_size);
Hmm, wouldn't it be cleaner to do this already in ondemand_readahead()?
All other updates of readahead window seem to be there. Also shouldn't we
take maximum readahead size into account? Reading 3/2 of max readahead
window seems like a relatively big deal for large files...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/9] readahead: snap readahead request to EOF
2011-11-29 14:29 ` Jan Kara
@ 2011-11-30 1:06 ` Wu Fengguang
2011-11-30 11:37 ` Jan Kara
0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-30 1:06 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Andi Kleen, Linux Memory Management List,
linux-fsdevel, LKML
> Hmm, wouldn't it be cleaner to do this already in ondemand_readahead()?
> All other updates of readahead window seem to be there.
Yeah it's not that clean, however the intention is to cover the other
call site -- mmap read-around, too.
> Also shouldn't we
> take maximum readahead size into account? Reading 3/2 of max readahead
> window seems like a relatively big deal for large files...
Good point, the max readahead size is actually a must, in order to
prevent it expanding the readahead size for ever in the backwards
reading case.
This limits the size expansion to 1/4 max readahead. That means, if
the next expected readahead size will be less than 1/4 max size, it
will be merged into the current readahead window to avoid one small IO.
The backwards reading is not special cased here because it's not
frequent anyway.
unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping, struct file *filp)
{
+ pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+ pgoff_t start = ra->start;
+ unsigned long size = ra->size;
int actual;
+ /* snap to EOF */
+ size += min(size, ra->ra_pages / 4);
+ if (start + size > eof) {
+ ra->size = eof - start;
+ ra->async_size = 0;
+ }
+
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/9] readahead: snap readahead request to EOF
2011-11-30 1:06 ` Wu Fengguang
@ 2011-11-30 11:37 ` Jan Kara
2011-11-30 12:06 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Jan Kara @ 2011-11-30 11:37 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Andi Kleen,
Linux Memory Management List, linux-fsdevel, LKML
On Wed 30-11-11 09:06:04, Wu Fengguang wrote:
> > Hmm, wouldn't it be cleaner to do this already in ondemand_readahead()?
> > All other updates of readahead window seem to be there.
>
> Yeah it's not that clean, however the intention is to cover the other
> call site -- mmap read-around, too.
Ah, OK.
> > Also shouldn't we
> > take maximum readahead size into account? Reading 3/2 of max readahead
> > window seems like a relatively big deal for large files...
>
> Good point, the max readahead size is actually a must, in order to
> prevent it expanding the readahead size for ever in the backwards
> reading case.
>
> This limits the size expansion to 1/4 max readahead. That means, if
> the next expected readahead size will be less than 1/4 max size, it
> will be merged into the current readahead window to avoid one small IO.
>
> The backwards reading is not special cased here because it's not
> frequent anyway.
>
> unsigned long ra_submit(struct file_ra_state *ra,
> struct address_space *mapping, struct file *filp)
> {
> + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
> + pgoff_t start = ra->start;
> + unsigned long size = ra->size;
> int actual;
>
> + /* snap to EOF */
> + size += min(size, ra->ra_pages / 4);
I'd probably choose:
size += min(size / 2, ra->ra_pages / 4);
to increase current window only to 3/2 and not twice but I don't have a
strong opinion. Otherwise I think the code is fine now so you can add:
Acked-by: Jan Kara <jack@suse.cz>
> + if (start + size > eof) {
> + ra->size = eof - start;
> + ra->async_size = 0;
> + }
> +
> actual = __do_page_cache_readahead(mapping, filp,
> ra->start, ra->size, ra->async_size);
Honza
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 2/9] readahead: snap readahead request to EOF
2011-11-30 11:37 ` Jan Kara
@ 2011-11-30 12:06 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-30 12:06 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Andi Kleen, Linux Memory Management List,
linux-fsdevel, LKML
> > + /* snap to EOF */
> > + size += min(size, ra->ra_pages / 4);
> I'd probably choose:
> size += min(size / 2, ra->ra_pages / 4);
> to increase current window only to 3/2 and not twice but I don't have a
OK it looks good on large ra_pages. I'll use this form.
> strong opinion. Otherwise I think the code is fine now so you can add:
> Acked-by: Jan Kara <jack@suse.cz>
Thanks!
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 3/9] readahead: record readahead patterns
2011-11-29 13:09 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v2) Wu Fengguang
2011-11-29 13:09 ` [PATCH 1/9] block: limit default readahead size for small devices Wu Fengguang
2011-11-29 13:09 ` [PATCH 2/9] readahead: snap readahead request to EOF Wu Fengguang
@ 2011-11-29 13:09 ` Wu Fengguang
2011-11-29 14:40 ` Jan Kara
2011-11-29 17:57 ` Andi Kleen
2011-11-29 13:09 ` [PATCH 4/9] readahead: tag mmap page fault call sites Wu Fengguang
` (5 subsequent siblings)
8 siblings, 2 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 13:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra,
Rik van Riel, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 6922 bytes --]
Record the readahead pattern in ra->pattern and extend the ra_submit()
parameters, to be used by the next readahead tracing/stats patches.
7 patterns are defined:
pattern readahead for
-----------------------------------------------------------
RA_PATTERN_INITIAL start-of-file read
RA_PATTERN_SUBSEQUENT trivial sequential read
RA_PATTERN_CONTEXT interleaved sequential read
RA_PATTERN_OVERSIZE oversize read
RA_PATTERN_MMAP_AROUND mmap fault
RA_PATTERN_FADVISE posix_fadvise()
RA_PATTERN_RANDOM random read
Note that random reads will be recorded in file_ra_state now.
This won't deteriorate cache bouncing because the ra->prev_pos update
in do_generic_file_read() already pollutes the data cache, and
filemap_fault() will stop calling into us after MMAP_LOTSAMISS.
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 36 +++++++++++++++++++++++++++++++++++-
include/linux/mm.h | 4 +++-
mm/filemap.c | 3 ++-
mm/readahead.c | 29 ++++++++++++++++++++++-------
4 files changed, 62 insertions(+), 10 deletions(-)
--- linux-next.orig/include/linux/fs.h 2011-11-28 21:21:05.000000000 +0800
+++ linux-next/include/linux/fs.h 2011-11-29 10:23:41.000000000 +0800
@@ -945,11 +945,45 @@ struct file_ra_state {
there are only # of pages ahead */
unsigned int ra_pages; /* Maximum readahead window */
- unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
+ u16 mmap_miss; /* Cache miss stat for mmap accesses */
+ u8 pattern; /* one of RA_PATTERN_* */
+
loff_t prev_pos; /* Cache last read() position */
};
/*
+ * Which policy makes decision to do the current read-ahead IO?
+ *
+ * RA_PATTERN_INITIAL readahead window is initially opened,
+ * normally when reading from start of file
+ * RA_PATTERN_SUBSEQUENT readahead window is pushed forward
+ * RA_PATTERN_CONTEXT no readahead window available, querying the
+ * page cache to decide readahead start/size.
+ * This typically happens on interleaved reads (eg.
+ * reading pages 0, 1000, 1, 1001, 2, 1002, ...)
+ * where one file_ra_state struct is not enough
+ * for recording 2+ interleaved sequential read
+ * streams.
+ * RA_PATTERN_MMAP_AROUND read-around on mmap page faults
+ * (w/o any sequential/random hints)
+ * RA_PATTERN_FADVISE triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
+ * RA_PATTERN_OVERSIZE a random read larger than max readahead size,
+ * do max readahead to break down the read size
+ * RA_PATTERN_RANDOM a small random read
+ */
+enum readahead_pattern {
+ RA_PATTERN_INITIAL,
+ RA_PATTERN_SUBSEQUENT,
+ RA_PATTERN_CONTEXT,
+ RA_PATTERN_MMAP_AROUND,
+ RA_PATTERN_FADVISE,
+ RA_PATTERN_OVERSIZE,
+ RA_PATTERN_RANDOM,
+ RA_PATTERN_ALL, /* for summary stats */
+ RA_PATTERN_MAX
+};
+
+/*
* Check if @index falls in the readahead windows.
*/
static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
--- linux-next.orig/mm/readahead.c 2011-11-28 22:24:16.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-29 10:17:14.000000000 +0800
@@ -249,7 +249,10 @@ unsigned long max_sane_readahead(unsigne
* Submit IO for the read-ahead request in file_ra_state.
*/
unsigned long ra_submit(struct file_ra_state *ra,
- struct address_space *mapping, struct file *filp)
+ struct address_space *mapping,
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long req_size)
{
pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
pgoff_t start = ra->start;
@@ -390,6 +393,7 @@ static int try_context_readahead(struct
if (size >= offset)
size *= 2;
+ ra->pattern = RA_PATTERN_CONTEXT;
ra->start = offset;
ra->size = get_init_ra_size(size + req_size, max);
ra->async_size = ra->size;
@@ -411,8 +415,10 @@ ondemand_readahead(struct address_space
/*
* start of file
*/
- if (!offset)
+ if (!offset) {
+ ra->pattern = RA_PATTERN_INITIAL;
goto initial_readahead;
+ }
/*
* It's the expected callback offset, assume sequential access.
@@ -420,6 +426,7 @@ ondemand_readahead(struct address_space
*/
if ((offset == (ra->start + ra->size - ra->async_size) ||
offset == (ra->start + ra->size))) {
+ ra->pattern = RA_PATTERN_SUBSEQUENT;
ra->start += ra->size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
@@ -442,6 +449,7 @@ ondemand_readahead(struct address_space
if (!start || start - offset > max)
return 0;
+ ra->pattern = RA_PATTERN_CONTEXT;
ra->start = start;
ra->size = start - offset; /* old async_size */
ra->size += req_size;
@@ -453,14 +461,18 @@ ondemand_readahead(struct address_space
/*
* oversize read
*/
- if (req_size > max)
+ if (req_size > max) {
+ ra->pattern = RA_PATTERN_OVERSIZE;
goto initial_readahead;
+ }
/*
* sequential cache miss
*/
- if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
+ if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) {
+ ra->pattern = RA_PATTERN_INITIAL;
goto initial_readahead;
+ }
/*
* Query the page cache and look for the traces(cached history pages)
@@ -471,9 +483,12 @@ ondemand_readahead(struct address_space
/*
* standalone, small random read
- * Read as is, and do not pollute the readahead state.
*/
- return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+ ra->pattern = RA_PATTERN_RANDOM;
+ ra->start = offset;
+ ra->size = req_size;
+ ra->async_size = 0;
+ goto readit;
initial_readahead:
ra->start = offset;
@@ -491,7 +506,7 @@ readit:
ra->size += ra->async_size;
}
- return ra_submit(ra, mapping, filp);
+ return ra_submit(ra, mapping, filp, offset, req_size);
}
/**
--- linux-next.orig/include/linux/mm.h 2011-11-28 21:21:05.000000000 +0800
+++ linux-next/include/linux/mm.h 2011-11-28 22:24:16.000000000 +0800
@@ -1456,7 +1456,9 @@ void page_cache_async_readahead(struct a
unsigned long max_sane_readahead(unsigned long nr);
unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping,
- struct file *filp);
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long req_size);
/* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux-next.orig/mm/filemap.c 2011-11-28 21:21:05.000000000 +0800
+++ linux-next/mm/filemap.c 2011-11-29 10:17:14.000000000 +0800
@@ -1611,11 +1611,12 @@ static void do_sync_mmap_readahead(struc
/*
* mmap read-around
*/
+ ra->pattern = RA_PATTERN_MMAP_AROUND;
ra_pages = max_sane_readahead(ra->ra_pages);
ra->start = max_t(long, 0, offset - ra_pages / 2);
ra->size = ra_pages;
ra->async_size = ra_pages / 4;
- ra_submit(ra, mapping, file);
+ ra_submit(ra, mapping, file, offset, 1);
}
/*
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 3/9] readahead: record readahead patterns
2011-11-29 13:09 ` [PATCH 3/9] readahead: record readahead patterns Wu Fengguang
@ 2011-11-29 14:40 ` Jan Kara
2011-11-29 17:57 ` Andi Kleen
1 sibling, 0 replies; 47+ messages in thread
From: Jan Kara @ 2011-11-29 14:40 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
On Tue 29-11-11 21:09:03, Wu Fengguang wrote:
> Record the readahead pattern in ra->pattern and extend the ra_submit()
> parameters, to be used by the next readahead tracing/stats patches.
>
> 7 patterns are defined:
>
> pattern readahead for
> -----------------------------------------------------------
> RA_PATTERN_INITIAL start-of-file read
> RA_PATTERN_SUBSEQUENT trivial sequential read
> RA_PATTERN_CONTEXT interleaved sequential read
> RA_PATTERN_OVERSIZE oversize read
> RA_PATTERN_MMAP_AROUND mmap fault
> RA_PATTERN_FADVISE posix_fadvise()
> RA_PATTERN_RANDOM random read
>
> Note that random reads will be recorded in file_ra_state now.
> This won't deteriorate cache bouncing because the ra->prev_pos update
> in do_generic_file_read() already pollutes the data cache, and
> filemap_fault() will stop calling into us after MMAP_LOTSAMISS.
>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Jens Axboe <axboe@kernel.dk>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
The patch looks OK. You can add:
Acked-by: Jan Kara <jack@suse.cz>
Honza
> ---
> include/linux/fs.h | 36 +++++++++++++++++++++++++++++++++++-
> include/linux/mm.h | 4 +++-
> mm/filemap.c | 3 ++-
> mm/readahead.c | 29 ++++++++++++++++++++++-------
> 4 files changed, 62 insertions(+), 10 deletions(-)
>
> --- linux-next.orig/include/linux/fs.h 2011-11-28 21:21:05.000000000 +0800
> +++ linux-next/include/linux/fs.h 2011-11-29 10:23:41.000000000 +0800
> @@ -945,11 +945,45 @@ struct file_ra_state {
> there are only # of pages ahead */
>
> unsigned int ra_pages; /* Maximum readahead window */
> - unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
> + u16 mmap_miss; /* Cache miss stat for mmap accesses */
> + u8 pattern; /* one of RA_PATTERN_* */
> +
> loff_t prev_pos; /* Cache last read() position */
> };
>
> /*
> + * Which policy makes decision to do the current read-ahead IO?
> + *
> + * RA_PATTERN_INITIAL readahead window is initially opened,
> + * normally when reading from start of file
> + * RA_PATTERN_SUBSEQUENT readahead window is pushed forward
> + * RA_PATTERN_CONTEXT no readahead window available, querying the
> + * page cache to decide readahead start/size.
> + * This typically happens on interleaved reads (eg.
> + * reading pages 0, 1000, 1, 1001, 2, 1002, ...)
> + * where one file_ra_state struct is not enough
> + * for recording 2+ interleaved sequential read
> + * streams.
> + * RA_PATTERN_MMAP_AROUND read-around on mmap page faults
> + * (w/o any sequential/random hints)
> + * RA_PATTERN_FADVISE triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
> + * RA_PATTERN_OVERSIZE a random read larger than max readahead size,
> + * do max readahead to break down the read size
> + * RA_PATTERN_RANDOM a small random read
> + */
> +enum readahead_pattern {
> + RA_PATTERN_INITIAL,
> + RA_PATTERN_SUBSEQUENT,
> + RA_PATTERN_CONTEXT,
> + RA_PATTERN_MMAP_AROUND,
> + RA_PATTERN_FADVISE,
> + RA_PATTERN_OVERSIZE,
> + RA_PATTERN_RANDOM,
> + RA_PATTERN_ALL, /* for summary stats */
> + RA_PATTERN_MAX
> +};
> +
> +/*
> * Check if @index falls in the readahead windows.
> */
> static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
> --- linux-next.orig/mm/readahead.c 2011-11-28 22:24:16.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-11-29 10:17:14.000000000 +0800
> @@ -249,7 +249,10 @@ unsigned long max_sane_readahead(unsigne
> * Submit IO for the read-ahead request in file_ra_state.
> */
> unsigned long ra_submit(struct file_ra_state *ra,
> - struct address_space *mapping, struct file *filp)
> + struct address_space *mapping,
> + struct file *filp,
> + pgoff_t offset,
> + unsigned long req_size)
> {
> pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
> pgoff_t start = ra->start;
> @@ -390,6 +393,7 @@ static int try_context_readahead(struct
> if (size >= offset)
> size *= 2;
>
> + ra->pattern = RA_PATTERN_CONTEXT;
> ra->start = offset;
> ra->size = get_init_ra_size(size + req_size, max);
> ra->async_size = ra->size;
> @@ -411,8 +415,10 @@ ondemand_readahead(struct address_space
> /*
> * start of file
> */
> - if (!offset)
> + if (!offset) {
> + ra->pattern = RA_PATTERN_INITIAL;
> goto initial_readahead;
> + }
>
> /*
> * It's the expected callback offset, assume sequential access.
> @@ -420,6 +426,7 @@ ondemand_readahead(struct address_space
> */
> if ((offset == (ra->start + ra->size - ra->async_size) ||
> offset == (ra->start + ra->size))) {
> + ra->pattern = RA_PATTERN_SUBSEQUENT;
> ra->start += ra->size;
> ra->size = get_next_ra_size(ra, max);
> ra->async_size = ra->size;
> @@ -442,6 +449,7 @@ ondemand_readahead(struct address_space
> if (!start || start - offset > max)
> return 0;
>
> + ra->pattern = RA_PATTERN_CONTEXT;
> ra->start = start;
> ra->size = start - offset; /* old async_size */
> ra->size += req_size;
> @@ -453,14 +461,18 @@ ondemand_readahead(struct address_space
> /*
> * oversize read
> */
> - if (req_size > max)
> + if (req_size > max) {
> + ra->pattern = RA_PATTERN_OVERSIZE;
> goto initial_readahead;
> + }
>
> /*
> * sequential cache miss
> */
> - if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
> + if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) {
> + ra->pattern = RA_PATTERN_INITIAL;
> goto initial_readahead;
> + }
>
> /*
> * Query the page cache and look for the traces(cached history pages)
> @@ -471,9 +483,12 @@ ondemand_readahead(struct address_space
>
> /*
> * standalone, small random read
> - * Read as is, and do not pollute the readahead state.
> */
> - return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
> + ra->pattern = RA_PATTERN_RANDOM;
> + ra->start = offset;
> + ra->size = req_size;
> + ra->async_size = 0;
> + goto readit;
>
> initial_readahead:
> ra->start = offset;
> @@ -491,7 +506,7 @@ readit:
> ra->size += ra->async_size;
> }
>
> - return ra_submit(ra, mapping, filp);
> + return ra_submit(ra, mapping, filp, offset, req_size);
> }
>
> /**
> --- linux-next.orig/include/linux/mm.h 2011-11-28 21:21:05.000000000 +0800
> +++ linux-next/include/linux/mm.h 2011-11-28 22:24:16.000000000 +0800
> @@ -1456,7 +1456,9 @@ void page_cache_async_readahead(struct a
> unsigned long max_sane_readahead(unsigned long nr);
> unsigned long ra_submit(struct file_ra_state *ra,
> struct address_space *mapping,
> - struct file *filp);
> + struct file *filp,
> + pgoff_t offset,
> + unsigned long req_size);
>
> /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
> extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
> --- linux-next.orig/mm/filemap.c 2011-11-28 21:21:05.000000000 +0800
> +++ linux-next/mm/filemap.c 2011-11-29 10:17:14.000000000 +0800
> @@ -1611,11 +1611,12 @@ static void do_sync_mmap_readahead(struc
> /*
> * mmap read-around
> */
> + ra->pattern = RA_PATTERN_MMAP_AROUND;
> ra_pages = max_sane_readahead(ra->ra_pages);
> ra->start = max_t(long, 0, offset - ra_pages / 2);
> ra->size = ra_pages;
> ra->async_size = ra_pages / 4;
> - ra_submit(ra, mapping, file);
> + ra_submit(ra, mapping, file, offset, 1);
> }
>
> /*
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 3/9] readahead: record readahead patterns
2011-11-29 13:09 ` [PATCH 3/9] readahead: record readahead patterns Wu Fengguang
2011-11-29 14:40 ` Jan Kara
@ 2011-11-29 17:57 ` Andi Kleen
2011-11-30 1:18 ` Wu Fengguang
2011-12-15 8:55 ` [PATCH] proc: show readahead state in fdinfo Wu Fengguang
1 sibling, 2 replies; 47+ messages in thread
From: Andi Kleen @ 2011-11-29 17:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
On Tue, Nov 29, 2011 at 09:09:03PM +0800, Wu Fengguang wrote:
> Record the readahead pattern in ra->pattern and extend the ra_submit()
> parameters, to be used by the next readahead tracing/stats patches.
I like this, could it be exported it a bit more formally in /proc for
each file descriptor?
I could imagine a monitoring tool that you run on a process that
tells you what pattern state the various file descriptors are in and how
large the window is. That would be similar to the tools for
monitoring network connections, which are extremly useful
in practice.
-Andi
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 3/9] readahead: record readahead patterns
2011-11-29 17:57 ` Andi Kleen
@ 2011-11-30 1:18 ` Wu Fengguang
2011-12-15 8:55 ` [PATCH] proc: show readahead state in fdinfo Wu Fengguang
1 sibling, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-30 1:18 UTC (permalink / raw)
To: Andi Kleen
Cc: Andrew Morton, Ingo Molnar, Jens Axboe, Peter Zijlstra,
Rik van Riel, Linux Memory Management List, linux-fsdevel, LKML
On Wed, Nov 30, 2011 at 01:57:43AM +0800, Andi Kleen wrote:
> On Tue, Nov 29, 2011 at 09:09:03PM +0800, Wu Fengguang wrote:
> > Record the readahead pattern in ra->pattern and extend the ra_submit()
> > parameters, to be used by the next readahead tracing/stats patches.
>
> I like this, could it be exported it a bit more formally in /proc for
> each file descriptor?
Something like this?
$ cat /proc/self/fdinfo/2
pos: 0
flags: 0100002
+ ra_pattern: initial
+ ra_size: 4
It may be some rapidly changing information, however in practical
should remain stable unless it's changing access pattern a lot.
> I could imagine a monitoring tool that you run on a process that
> tells you what pattern state the various file descriptors are in and how
> large the window is. That would be similar to the tools for
> monitoring network connections, which are extremly useful
> in practice.
Yeah, the simplest form may be
watch "head /proc/self/fdinfo/*"
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH] proc: show readahead state in fdinfo
2011-11-29 17:57 ` Andi Kleen
2011-11-30 1:18 ` Wu Fengguang
@ 2011-12-15 8:55 ` Wu Fengguang
2011-12-15 9:49 ` Ingo Molnar
1 sibling, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-12-15 8:55 UTC (permalink / raw)
To: Andi Kleen
Cc: Andrew Morton, Ingo Molnar, Jens Axboe, Peter Zijlstra,
Rik van Riel, Linux Memory Management List, linux-fsdevel, LKML
On Wed, Nov 30, 2011 at 01:57:43AM +0800, Andi Kleen wrote:
> On Tue, Nov 29, 2011 at 09:09:03PM +0800, Wu Fengguang wrote:
> > Record the readahead pattern in ra->pattern and extend the ra_submit()
> > parameters, to be used by the next readahead tracing/stats patches.
>
> I like this, could it be exported it a bit more formally in /proc for
> each file descriptor?
How about this?
---
Subject: proc: show readahead state in fdinfo
Date: Thu Dec 15 14:35:56 CST 2011
Append three readahead states to /proc/<PID>/fdinfo/<FD>:
# cat /proc/self/fdinfo/0
pos: 0
flags: 0100002
+ ra_pattern: initial
+ ra_start: 0 # pages
+ ra_size: 0 # pages
As proposed by Andi: I could imagine a monitoring tool that you run on a
process that tells you what pattern state the various file descriptors
are in and how large the window is. That would be similar to the tools
for monitoring network connections, which are extremely useful in practice.
CC: Andi Kleen <andi@firstfloor.org>
CC: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/proc/base.c | 14 ++++++++++----
include/linux/fs.h | 1 +
mm/readahead.c | 2 +-
3 files changed, 12 insertions(+), 5 deletions(-)
--- linux-next.orig/fs/proc/base.c 2011-12-15 14:36:04.000000000 +0800
+++ linux-next/fs/proc/base.c 2011-12-15 15:51:35.000000000 +0800
@@ -1885,7 +1885,7 @@ out:
return ~0U;
}
-#define PROC_FDINFO_MAX 64
+#define PROC_FDINFO_MAX 128
static int proc_fd_info(struct inode *inode, struct path *path, char *info)
{
@@ -1920,10 +1920,16 @@ static int proc_fd_info(struct inode *in
}
if (info)
snprintf(info, PROC_FDINFO_MAX,
- "pos:\t%lli\n"
- "flags:\t0%o\n",
+ "pos:\t\t%lli\n"
+ "flags:\t\t0%o\n"
+ "ra_pattern:\t%s\n"
+ "ra_start:\t%lu\n"
+ "ra_size:\t%u\n",
(long long) file->f_pos,
- f_flags);
+ f_flags,
+ ra_pattern_names[file->f_ra.pattern],
+ file->f_ra.start,
+ file->f_ra.size);
spin_unlock(&files->file_lock);
put_files_struct(files);
return 0;
--- linux-next.orig/include/linux/fs.h 2011-12-15 14:36:41.000000000 +0800
+++ linux-next/include/linux/fs.h 2011-12-15 14:36:57.000000000 +0800
@@ -953,6 +953,7 @@ struct file_ra_state {
loff_t prev_pos; /* Cache last read() position */
};
+extern const char * const ra_pattern_names[];
/*
* Which policy makes decision to do the current read-ahead IO?
--- linux-next.orig/mm/readahead.c 2011-12-15 14:36:28.000000000 +0800
+++ linux-next/mm/readahead.c 2011-12-15 14:36:33.000000000 +0800
@@ -19,7 +19,7 @@
#include <linux/pagemap.h>
#include <trace/events/vfs.h>
-static const char * const ra_pattern_names[] = {
+const char * const ra_pattern_names[] = {
[RA_PATTERN_INITIAL] = "initial",
[RA_PATTERN_SUBSEQUENT] = "subsequent",
[RA_PATTERN_CONTEXT] = "context",
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH] proc: show readahead state in fdinfo
2011-12-15 8:55 ` [PATCH] proc: show readahead state in fdinfo Wu Fengguang
@ 2011-12-15 9:49 ` Ingo Molnar
0 siblings, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2011-12-15 9:49 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andi Kleen, Andrew Morton, Jens Axboe, Peter Zijlstra,
Rik van Riel, Linux Memory Management List, linux-fsdevel, LKML,
Frédéric Weisbecker, Arnaldo Carvalho de Melo
* Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Wed, Nov 30, 2011 at 01:57:43AM +0800, Andi Kleen wrote:
> > On Tue, Nov 29, 2011 at 09:09:03PM +0800, Wu Fengguang wrote:
> > > Record the readahead pattern in ra->pattern and extend the ra_submit()
> > > parameters, to be used by the next readahead tracing/stats patches.
> >
> > I like this, could it be exported it a bit more formally in /proc for
> > each file descriptor?
>
> How about this?
> ---
> Subject: proc: show readahead state in fdinfo
> Date: Thu Dec 15 14:35:56 CST 2011
>
> Append three readahead states to /proc/<PID>/fdinfo/<FD>:
Not a very good idea - please keep debug info under /debug as
much as possible (as your original series did), instead of
creating an ad-hoc insta-ABI in /proc.
In the long run we'd really like to retrieve such kind of
information not even via ad-hoc exported info in /debug but via
the standard event facilities: the tracepoints, if they are
versatile enough, could be used to collect these stats and more.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 4/9] readahead: tag mmap page fault call sites
2011-11-29 13:09 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v2) Wu Fengguang
` (2 preceding siblings ...)
2011-11-29 13:09 ` [PATCH 3/9] readahead: record readahead patterns Wu Fengguang
@ 2011-11-29 13:09 ` Wu Fengguang
2011-11-29 14:41 ` Jan Kara
2011-11-29 13:09 ` [PATCH 5/9] readahead: tag metadata " Wu Fengguang
` (4 subsequent siblings)
8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 13:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-for-mmap --]
[-- Type: text/plain, Size: 2034 bytes --]
Introduce a bit field ra->for_mmap for tagging mmap reads.
The tag will be cleared immediate after submitting the IO.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 1 +
mm/filemap.c | 6 +++++-
mm/readahead.c | 1 +
3 files changed, 7 insertions(+), 1 deletion(-)
--- linux-next.orig/include/linux/fs.h 2011-11-29 10:12:19.000000000 +0800
+++ linux-next/include/linux/fs.h 2011-11-29 10:13:08.000000000 +0800
@@ -947,6 +947,7 @@ struct file_ra_state {
unsigned int ra_pages; /* Maximum readahead window */
u16 mmap_miss; /* Cache miss stat for mmap accesses */
u8 pattern; /* one of RA_PATTERN_* */
+ unsigned int for_mmap:1; /* readahead for mmap accesses */
loff_t prev_pos; /* Cache last read() position */
};
--- linux-next.orig/mm/filemap.c 2011-11-29 09:48:49.000000000 +0800
+++ linux-next/mm/filemap.c 2011-11-29 10:13:08.000000000 +0800
@@ -1592,6 +1592,7 @@ static void do_sync_mmap_readahead(struc
return;
if (VM_SequentialReadHint(vma)) {
+ ra->for_mmap = 1;
page_cache_sync_readahead(mapping, ra, file, offset,
ra->ra_pages);
return;
@@ -1611,6 +1612,7 @@ static void do_sync_mmap_readahead(struc
/*
* mmap read-around
*/
+ ra->for_mmap = 1;
ra->pattern = RA_PATTERN_MMAP_AROUND;
ra_pages = max_sane_readahead(ra->ra_pages);
ra->start = max_t(long, 0, offset - ra_pages / 2);
@@ -1636,9 +1638,11 @@ static void do_async_mmap_readahead(stru
return;
if (ra->mmap_miss > 0)
ra->mmap_miss--;
- if (PageReadahead(page))
+ if (PageReadahead(page)) {
+ ra->for_mmap = 1;
page_cache_async_readahead(mapping, ra, file,
page, offset, ra->ra_pages);
+ }
}
/**
--- linux-next.orig/mm/readahead.c 2011-11-29 09:48:49.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-29 10:13:08.000000000 +0800
@@ -267,6 +267,7 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
+ ra->for_mmap = 0;
return actual;
}
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 4/9] readahead: tag mmap page fault call sites
2011-11-29 13:09 ` [PATCH 4/9] readahead: tag mmap page fault call sites Wu Fengguang
@ 2011-11-29 14:41 ` Jan Kara
0 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2011-11-29 14:41 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Andi Kleen, Linux Memory Management List,
linux-fsdevel, LKML
On Tue 29-11-11 21:09:04, Wu Fengguang wrote:
> Introduce a bit field ra->for_mmap for tagging mmap reads.
> The tag will be cleared immediate after submitting the IO.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Looks OK.
Acked-by: Jan Kara <jack@suse.cz>
Honza
> ---
> include/linux/fs.h | 1 +
> mm/filemap.c | 6 +++++-
> mm/readahead.c | 1 +
> 3 files changed, 7 insertions(+), 1 deletion(-)
>
> --- linux-next.orig/include/linux/fs.h 2011-11-29 10:12:19.000000000 +0800
> +++ linux-next/include/linux/fs.h 2011-11-29 10:13:08.000000000 +0800
> @@ -947,6 +947,7 @@ struct file_ra_state {
> unsigned int ra_pages; /* Maximum readahead window */
> u16 mmap_miss; /* Cache miss stat for mmap accesses */
> u8 pattern; /* one of RA_PATTERN_* */
> + unsigned int for_mmap:1; /* readahead for mmap accesses */
>
> loff_t prev_pos; /* Cache last read() position */
> };
> --- linux-next.orig/mm/filemap.c 2011-11-29 09:48:49.000000000 +0800
> +++ linux-next/mm/filemap.c 2011-11-29 10:13:08.000000000 +0800
> @@ -1592,6 +1592,7 @@ static void do_sync_mmap_readahead(struc
> return;
>
> if (VM_SequentialReadHint(vma)) {
> + ra->for_mmap = 1;
> page_cache_sync_readahead(mapping, ra, file, offset,
> ra->ra_pages);
> return;
> @@ -1611,6 +1612,7 @@ static void do_sync_mmap_readahead(struc
> /*
> * mmap read-around
> */
> + ra->for_mmap = 1;
> ra->pattern = RA_PATTERN_MMAP_AROUND;
> ra_pages = max_sane_readahead(ra->ra_pages);
> ra->start = max_t(long, 0, offset - ra_pages / 2);
> @@ -1636,9 +1638,11 @@ static void do_async_mmap_readahead(stru
> return;
> if (ra->mmap_miss > 0)
> ra->mmap_miss--;
> - if (PageReadahead(page))
> + if (PageReadahead(page)) {
> + ra->for_mmap = 1;
> page_cache_async_readahead(mapping, ra, file,
> page, offset, ra->ra_pages);
> + }
> }
>
> /**
> --- linux-next.orig/mm/readahead.c 2011-11-29 09:48:49.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-11-29 10:13:08.000000000 +0800
> @@ -267,6 +267,7 @@ unsigned long ra_submit(struct file_ra_s
> actual = __do_page_cache_readahead(mapping, filp,
> ra->start, ra->size, ra->async_size);
>
> + ra->for_mmap = 0;
> return actual;
> }
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 5/9] readahead: tag metadata call sites
2011-11-29 13:09 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v2) Wu Fengguang
` (3 preceding siblings ...)
2011-11-29 13:09 ` [PATCH 4/9] readahead: tag mmap page fault call sites Wu Fengguang
@ 2011-11-29 13:09 ` Wu Fengguang
2011-11-29 14:45 ` Jan Kara
2011-11-29 13:09 ` [PATCH 6/9] readahead: add /debug/readahead/stats Wu Fengguang
` (3 subsequent siblings)
8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 13:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-for-metadata --]
[-- Type: text/plain, Size: 1944 bytes --]
We may be doing more metadata readahead in future.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/ext3/dir.c | 1 +
fs/ext4/dir.c | 1 +
include/linux/fs.h | 1 +
mm/readahead.c | 1 +
4 files changed, 4 insertions(+)
--- linux-next.orig/fs/ext3/dir.c 2011-11-29 09:48:49.000000000 +0800
+++ linux-next/fs/ext3/dir.c 2011-11-29 10:13:13.000000000 +0800
@@ -136,6 +136,7 @@ static int ext3_readdir(struct file * fi
pgoff_t index = map_bh.b_blocknr >>
(PAGE_CACHE_SHIFT - inode->i_blkbits);
if (!ra_has_index(&filp->f_ra, index))
+ filp->f_ra.for_metadata = 1;
page_cache_sync_readahead(
sb->s_bdev->bd_inode->i_mapping,
&filp->f_ra, filp,
--- linux-next.orig/fs/ext4/dir.c 2011-11-29 09:48:49.000000000 +0800
+++ linux-next/fs/ext4/dir.c 2011-11-29 10:13:13.000000000 +0800
@@ -153,6 +153,7 @@ static int ext4_readdir(struct file *fil
pgoff_t index = map.m_pblk >>
(PAGE_CACHE_SHIFT - inode->i_blkbits);
if (!ra_has_index(&filp->f_ra, index))
+ filp->f_ra.for_metadata = 1;
page_cache_sync_readahead(
sb->s_bdev->bd_inode->i_mapping,
&filp->f_ra, filp,
--- linux-next.orig/include/linux/fs.h 2011-11-29 10:13:08.000000000 +0800
+++ linux-next/include/linux/fs.h 2011-11-29 10:13:13.000000000 +0800
@@ -948,6 +948,7 @@ struct file_ra_state {
u16 mmap_miss; /* Cache miss stat for mmap accesses */
u8 pattern; /* one of RA_PATTERN_* */
unsigned int for_mmap:1; /* readahead for mmap accesses */
+ unsigned int for_metadata:1; /* readahead for meta data */
loff_t prev_pos; /* Cache last read() position */
};
--- linux-next.orig/mm/readahead.c 2011-11-29 10:13:08.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-29 10:13:13.000000000 +0800
@@ -268,6 +268,7 @@ unsigned long ra_submit(struct file_ra_s
ra->start, ra->size, ra->async_size);
ra->for_mmap = 0;
+ ra->for_metadata = 0;
return actual;
}
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 5/9] readahead: tag metadata call sites
2011-11-29 13:09 ` [PATCH 5/9] readahead: tag metadata " Wu Fengguang
@ 2011-11-29 14:45 ` Jan Kara
0 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2011-11-29 14:45 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Andi Kleen, Linux Memory Management List,
linux-fsdevel, LKML
On Tue 29-11-11 21:09:05, Wu Fengguang wrote:
> We may be doing more metadata readahead in future.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Looks OK.
Acked-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/ext3/dir.c | 1 +
> fs/ext4/dir.c | 1 +
> include/linux/fs.h | 1 +
> mm/readahead.c | 1 +
> 4 files changed, 4 insertions(+)
>
> --- linux-next.orig/fs/ext3/dir.c 2011-11-29 09:48:49.000000000 +0800
> +++ linux-next/fs/ext3/dir.c 2011-11-29 10:13:13.000000000 +0800
> @@ -136,6 +136,7 @@ static int ext3_readdir(struct file * fi
> pgoff_t index = map_bh.b_blocknr >>
> (PAGE_CACHE_SHIFT - inode->i_blkbits);
> if (!ra_has_index(&filp->f_ra, index))
> + filp->f_ra.for_metadata = 1;
> page_cache_sync_readahead(
> sb->s_bdev->bd_inode->i_mapping,
> &filp->f_ra, filp,
> --- linux-next.orig/fs/ext4/dir.c 2011-11-29 09:48:49.000000000 +0800
> +++ linux-next/fs/ext4/dir.c 2011-11-29 10:13:13.000000000 +0800
> @@ -153,6 +153,7 @@ static int ext4_readdir(struct file *fil
> pgoff_t index = map.m_pblk >>
> (PAGE_CACHE_SHIFT - inode->i_blkbits);
> if (!ra_has_index(&filp->f_ra, index))
> + filp->f_ra.for_metadata = 1;
> page_cache_sync_readahead(
> sb->s_bdev->bd_inode->i_mapping,
> &filp->f_ra, filp,
> --- linux-next.orig/include/linux/fs.h 2011-11-29 10:13:08.000000000 +0800
> +++ linux-next/include/linux/fs.h 2011-11-29 10:13:13.000000000 +0800
> @@ -948,6 +948,7 @@ struct file_ra_state {
> u16 mmap_miss; /* Cache miss stat for mmap accesses */
> u8 pattern; /* one of RA_PATTERN_* */
> unsigned int for_mmap:1; /* readahead for mmap accesses */
> + unsigned int for_metadata:1; /* readahead for meta data */
>
> loff_t prev_pos; /* Cache last read() position */
> };
> --- linux-next.orig/mm/readahead.c 2011-11-29 10:13:08.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-11-29 10:13:13.000000000 +0800
> @@ -268,6 +268,7 @@ unsigned long ra_submit(struct file_ra_s
> ra->start, ra->size, ra->async_size);
>
> ra->for_mmap = 0;
> + ra->for_metadata = 0;
> return actual;
> }
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 6/9] readahead: add /debug/readahead/stats
2011-11-29 13:09 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v2) Wu Fengguang
` (4 preceding siblings ...)
2011-11-29 13:09 ` [PATCH 5/9] readahead: tag metadata " Wu Fengguang
@ 2011-11-29 13:09 ` Wu Fengguang
2011-11-29 15:21 ` Jan Kara
2011-11-29 13:09 ` [PATCH 7/9] readahead: add vfs/readahead tracing event Wu Fengguang
` (2 subsequent siblings)
8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 13:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra,
Rik van Riel, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 9635 bytes --]
The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive by default.
It can be runtime enabled/disabled through the debugfs interface
echo 1 > /debug/readahead/stats_enable
echo 0 > /debug/readahead/stats_enable
The added overheads are two readahead_stats() calls per readahead.
Which is trivial costs unless there are concurrent random reads on
super fast SSDs, which may lead to cache bouncing when updating the
global ra_stats[][]. Considering that normal users won't need this
except when debugging performance problems, it's disabled by default.
So it looks reasonable to keep this debug code simple rather than trying
to improve its scalability.
Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)
$ cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io meta_io size async_size io_size
initial 702 511 0 692 692 0 0 2 0 2
subsequent 7 0 1 7 1 1 0 23 22 23
context 160 161 0 2 0 1 0 0 0 16
around 184 184 177 184 184 184 0 58 0 53
backwards 2 0 2 2 2 0 0 4 0 3
fadvise 2593 47 8 2588 2588 0 0 1 0 1
oversize 0 0 0 0 0 0 0 0 0 0
random 45 20 0 44 44 0 0 1 0 1
all 3697 923 188 3519 3511 186 0 4 0 4
The two most important columns are
- io number of readahead IO
- io_size average readahead IO size
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/Kconfig | 15 +++
mm/readahead.c | 194 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 209 insertions(+)
--- linux-next.orig/mm/readahead.c 2011-11-29 20:48:05.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-29 20:58:53.000000000 +0800
@@ -18,6 +18,17 @@
#include <linux/pagevec.h>
#include <linux/pagemap.h>
+static const char * const ra_pattern_names[] = {
+ [RA_PATTERN_INITIAL] = "initial",
+ [RA_PATTERN_SUBSEQUENT] = "subsequent",
+ [RA_PATTERN_CONTEXT] = "context",
+ [RA_PATTERN_MMAP_AROUND] = "around",
+ [RA_PATTERN_FADVISE] = "fadvise",
+ [RA_PATTERN_OVERSIZE] = "oversize",
+ [RA_PATTERN_RANDOM] = "random",
+ [RA_PATTERN_ALL] = "all",
+};
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
@@ -32,6 +43,181 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+enum ra_account {
+ /* number of readaheads */
+ RA_ACCOUNT_COUNT, /* readahead request */
+ RA_ACCOUNT_EOF, /* readahead request covers EOF */
+ RA_ACCOUNT_CACHE_HIT, /* readahead request covers some cached pages */
+ RA_ACCOUNT_IOCOUNT, /* readahead IO */
+ RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */
+ RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */
+ RA_ACCOUNT_METADATA, /* readahead IO on metadata */
+ /* number of readahead pages */
+ RA_ACCOUNT_SIZE, /* readahead size */
+ RA_ACCOUNT_ASYNC_SIZE, /* readahead async size */
+ RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
+ /* end mark */
+ RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static void readahead_stats(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ bool for_mmap,
+ bool for_metadata,
+ enum readahead_pattern pattern,
+ pgoff_t start,
+ unsigned long size,
+ unsigned long async_size,
+ int actual)
+{
+ pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+
+recount:
+ ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+ ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
+ ra_stats[pattern][RA_ACCOUNT_ASYNC_SIZE] += async_size;
+ ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+ if (start + size >= eof)
+ ra_stats[pattern][RA_ACCOUNT_EOF]++;
+ if (actual < size)
+ ra_stats[pattern][RA_ACCOUNT_CACHE_HIT]++;
+
+ if (actual) {
+ ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+ if (start <= offset && offset < start + size)
+ ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+ if (for_mmap)
+ ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+ if (for_metadata)
+ ra_stats[pattern][RA_ACCOUNT_METADATA]++;
+ }
+
+ if (pattern != RA_PATTERN_ALL) {
+ pattern = RA_PATTERN_ALL;
+ goto recount;
+ }
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+ unsigned long i;
+
+ seq_printf(s,
+ "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+ "pattern", "readahead", "eof_hit", "cache_hit",
+ "io", "sync_io", "mmap_io", "meta_io",
+ "size", "async_size", "io_size");
+
+ for (i = 0; i < RA_PATTERN_MAX; i++) {
+ unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+ unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+ /*
+ * avoid division-by-zero
+ */
+ if (count == 0)
+ count = 1;
+ if (iocount == 0)
+ iocount = 1;
+
+ seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu "
+ "%10lu %10lu %10lu %10lu %10lu\n",
+ ra_pattern_names[i],
+ ra_stats[i][RA_ACCOUNT_COUNT],
+ ra_stats[i][RA_ACCOUNT_EOF],
+ ra_stats[i][RA_ACCOUNT_CACHE_HIT],
+ ra_stats[i][RA_ACCOUNT_IOCOUNT],
+ ra_stats[i][RA_ACCOUNT_SYNC],
+ ra_stats[i][RA_ACCOUNT_MMAP],
+ ra_stats[i][RA_ACCOUNT_METADATA],
+ ra_stats[i][RA_ACCOUNT_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+ }
+
+ return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ memset(ra_stats, 0, sizeof(ra_stats));
+ return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+ .owner = THIS_MODULE,
+ .open = readahead_stats_open,
+ .write = readahead_stats_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+ struct dentry *root;
+ struct dentry *entry;
+
+ root = debugfs_create_dir("readahead", NULL);
+ if (!root)
+ goto out;
+
+ entry = debugfs_create_file("stats", 0644, root,
+ NULL, &readahead_stats_fops);
+ if (!entry)
+ goto out;
+
+ entry = debugfs_create_bool("stats_enable", 0644, root,
+ &readahead_stats_enable);
+ if (!entry)
+ goto out;
+
+ return 0;
+out:
+ printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+ return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#endif
+
+static inline void readahead_event(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ bool for_mmap,
+ bool for_metadata,
+ enum readahead_pattern pattern,
+ pgoff_t start,
+ unsigned long size,
+ unsigned long async_size,
+ int actual)
+{
+#ifdef CONFIG_READAHEAD_STATS
+ if (readahead_stats_enable)
+ readahead_stats(mapping, offset, req_size,
+ for_mmap, for_metadata,
+ pattern, start, size, async_size, actual);
+#endif
+}
+
+
/*
* see if a page needs releasing upon read_cache_pages() failure
* - the caller of read_cache_pages() may have set PG_private or PG_fscache
@@ -228,6 +414,9 @@ int force_page_cache_readahead(struct ad
ret = err;
break;
}
+ readahead_event(mapping, offset, nr_to_read, 0, 0,
+ RA_PATTERN_FADVISE, offset, this_chunk, 0,
+ err);
ret += err;
offset += this_chunk;
nr_to_read -= this_chunk;
@@ -267,6 +456,11 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
+ readahead_event(mapping, offset, req_size,
+ ra->for_mmap, ra->for_metadata,
+ ra->pattern, ra->start, ra->size, ra->async_size,
+ actual);
+
ra->for_mmap = 0;
ra->for_metadata = 0;
return actual;
--- linux-next.orig/mm/Kconfig 2011-11-29 20:48:05.000000000 +0800
+++ linux-next/mm/Kconfig 2011-11-29 20:48:05.000000000 +0800
@@ -373,3 +373,18 @@ config CLEANCACHE
in a negligible performance hit.
If unsure, say Y to enable cleancache
+
+config READAHEAD_STATS
+ bool "Collect page cache readahead stats"
+ depends on DEBUG_FS
+ default y
+ help
+ This provides the readahead events accounting facilities.
+
+ To do readahead accounting for a workload:
+
+ echo 1 > /sys/kernel/debug/readahead/stats_enable
+ echo 0 > /sys/kernel/debug/readahead/stats # reset counters
+ # run the workload
+ cat /sys/kernel/debug/readahead/stats # check counters
+ echo 0 > /sys/kernel/debug/readahead/stats_enable
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2011-11-29 13:09 ` [PATCH 6/9] readahead: add /debug/readahead/stats Wu Fengguang
@ 2011-11-29 15:21 ` Jan Kara
2011-11-30 0:44 ` Wu Fengguang
2011-12-14 6:36 ` Wu Fengguang
0 siblings, 2 replies; 47+ messages in thread
From: Jan Kara @ 2011-11-29 15:21 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
On Tue 29-11-11 21:09:06, Wu Fengguang wrote:
> The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
> and will remain inactive by default.
>
> It can be runtime enabled/disabled through the debugfs interface
>
> echo 1 > /debug/readahead/stats_enable
> echo 0 > /debug/readahead/stats_enable
>
> The added overheads are two readahead_stats() calls per readahead.
> Which is trivial costs unless there are concurrent random reads on
> super fast SSDs, which may lead to cache bouncing when updating the
> global ra_stats[][]. Considering that normal users won't need this
> except when debugging performance problems, it's disabled by default.
> So it looks reasonable to keep this debug code simple rather than trying
> to improve its scalability.
>
> Example output:
> (taken from a fresh booted NFS-ROOT console box with rsize=524288)
>
> $ cat /debug/readahead/stats
> pattern readahead eof_hit cache_hit io sync_io mmap_io meta_io size async_size io_size
> initial 702 511 0 692 692 0 0 2 0 2
> subsequent 7 0 1 7 1 1 0 23 22 23
> context 160 161 0 2 0 1 0 0 0 16
> around 184 184 177 184 184 184 0 58 0 53
> backwards 2 0 2 2 2 0 0 4 0 3
> fadvise 2593 47 8 2588 2588 0 0 1 0 1
> oversize 0 0 0 0 0 0 0 0 0 0
> random 45 20 0 44 44 0 0 1 0 1
> all 3697 923 188 3519 3511 186 0 4 0 4
>
> The two most important columns are
> - io number of readahead IO
> - io_size average readahead IO size
>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Jens Axboe <axboe@kernel.dk>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
This looks all inherently racy (which doesn't matter much as you suggest)
so I just wanted to suggest that if you used per-cpu counters you'd get
race-free and faster code at the cost of larger data structures and using
percpu_counter_add() instead of ++ (which doesn't seem like a big
complication to me).
Honza
> ---
> mm/Kconfig | 15 +++
> mm/readahead.c | 194 +++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 209 insertions(+)
>
> --- linux-next.orig/mm/readahead.c 2011-11-29 20:48:05.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-11-29 20:58:53.000000000 +0800
> @@ -18,6 +18,17 @@
> #include <linux/pagevec.h>
> #include <linux/pagemap.h>
>
> +static const char * const ra_pattern_names[] = {
> + [RA_PATTERN_INITIAL] = "initial",
> + [RA_PATTERN_SUBSEQUENT] = "subsequent",
> + [RA_PATTERN_CONTEXT] = "context",
> + [RA_PATTERN_MMAP_AROUND] = "around",
> + [RA_PATTERN_FADVISE] = "fadvise",
> + [RA_PATTERN_OVERSIZE] = "oversize",
> + [RA_PATTERN_RANDOM] = "random",
> + [RA_PATTERN_ALL] = "all",
> +};
> +
> /*
> * Initialise a struct file's readahead state. Assumes that the caller has
> * memset *ra to zero.
> @@ -32,6 +43,181 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
>
> #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
>
> +#ifdef CONFIG_READAHEAD_STATS
> +#include <linux/seq_file.h>
> +#include <linux/debugfs.h>
> +
> +static u32 readahead_stats_enable __read_mostly;
> +
> +enum ra_account {
> + /* number of readaheads */
> + RA_ACCOUNT_COUNT, /* readahead request */
> + RA_ACCOUNT_EOF, /* readahead request covers EOF */
> + RA_ACCOUNT_CACHE_HIT, /* readahead request covers some cached pages */
> + RA_ACCOUNT_IOCOUNT, /* readahead IO */
> + RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */
> + RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */
> + RA_ACCOUNT_METADATA, /* readahead IO on metadata */
> + /* number of readahead pages */
> + RA_ACCOUNT_SIZE, /* readahead size */
> + RA_ACCOUNT_ASYNC_SIZE, /* readahead async size */
> + RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
> + /* end mark */
> + RA_ACCOUNT_MAX,
> +};
> +
> +static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
> +
> +static void readahead_stats(struct address_space *mapping,
> + pgoff_t offset,
> + unsigned long req_size,
> + bool for_mmap,
> + bool for_metadata,
> + enum readahead_pattern pattern,
> + pgoff_t start,
> + unsigned long size,
> + unsigned long async_size,
> + int actual)
> +{
> + pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
> +
> +recount:
> + ra_stats[pattern][RA_ACCOUNT_COUNT]++;
> + ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
> + ra_stats[pattern][RA_ACCOUNT_ASYNC_SIZE] += async_size;
> + ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
> +
> + if (start + size >= eof)
> + ra_stats[pattern][RA_ACCOUNT_EOF]++;
> + if (actual < size)
> + ra_stats[pattern][RA_ACCOUNT_CACHE_HIT]++;
> +
> + if (actual) {
> + ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
> +
> + if (start <= offset && offset < start + size)
> + ra_stats[pattern][RA_ACCOUNT_SYNC]++;
> +
> + if (for_mmap)
> + ra_stats[pattern][RA_ACCOUNT_MMAP]++;
> + if (for_metadata)
> + ra_stats[pattern][RA_ACCOUNT_METADATA]++;
> + }
> +
> + if (pattern != RA_PATTERN_ALL) {
> + pattern = RA_PATTERN_ALL;
> + goto recount;
> + }
> +}
> +
> +static int readahead_stats_show(struct seq_file *s, void *_)
> +{
> + unsigned long i;
> +
> + seq_printf(s,
> + "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
> + "pattern", "readahead", "eof_hit", "cache_hit",
> + "io", "sync_io", "mmap_io", "meta_io",
> + "size", "async_size", "io_size");
> +
> + for (i = 0; i < RA_PATTERN_MAX; i++) {
> + unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
> + unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
> + /*
> + * avoid division-by-zero
> + */
> + if (count == 0)
> + count = 1;
> + if (iocount == 0)
> + iocount = 1;
> +
> + seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu "
> + "%10lu %10lu %10lu %10lu %10lu\n",
> + ra_pattern_names[i],
> + ra_stats[i][RA_ACCOUNT_COUNT],
> + ra_stats[i][RA_ACCOUNT_EOF],
> + ra_stats[i][RA_ACCOUNT_CACHE_HIT],
> + ra_stats[i][RA_ACCOUNT_IOCOUNT],
> + ra_stats[i][RA_ACCOUNT_SYNC],
> + ra_stats[i][RA_ACCOUNT_MMAP],
> + ra_stats[i][RA_ACCOUNT_METADATA],
> + ra_stats[i][RA_ACCOUNT_SIZE] / count,
> + ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
> + ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
> + }
> +
> + return 0;
> +}
> +
> +static int readahead_stats_open(struct inode *inode, struct file *file)
> +{
> + return single_open(file, readahead_stats_show, NULL);
> +}
> +
> +static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
> + size_t size, loff_t *offset)
> +{
> + memset(ra_stats, 0, sizeof(ra_stats));
> + return size;
> +}
> +
> +static const struct file_operations readahead_stats_fops = {
> + .owner = THIS_MODULE,
> + .open = readahead_stats_open,
> + .write = readahead_stats_write,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = single_release,
> +};
> +
> +static int __init readahead_create_debugfs(void)
> +{
> + struct dentry *root;
> + struct dentry *entry;
> +
> + root = debugfs_create_dir("readahead", NULL);
> + if (!root)
> + goto out;
> +
> + entry = debugfs_create_file("stats", 0644, root,
> + NULL, &readahead_stats_fops);
> + if (!entry)
> + goto out;
> +
> + entry = debugfs_create_bool("stats_enable", 0644, root,
> + &readahead_stats_enable);
> + if (!entry)
> + goto out;
> +
> + return 0;
> +out:
> + printk(KERN_ERR "readahead: failed to create debugfs entries\n");
> + return -ENOMEM;
> +}
> +
> +late_initcall(readahead_create_debugfs);
> +#endif
> +
> +static inline void readahead_event(struct address_space *mapping,
> + pgoff_t offset,
> + unsigned long req_size,
> + bool for_mmap,
> + bool for_metadata,
> + enum readahead_pattern pattern,
> + pgoff_t start,
> + unsigned long size,
> + unsigned long async_size,
> + int actual)
> +{
> +#ifdef CONFIG_READAHEAD_STATS
> + if (readahead_stats_enable)
> + readahead_stats(mapping, offset, req_size,
> + for_mmap, for_metadata,
> + pattern, start, size, async_size, actual);
> +#endif
> +}
> +
> +
> /*
> * see if a page needs releasing upon read_cache_pages() failure
> * - the caller of read_cache_pages() may have set PG_private or PG_fscache
> @@ -228,6 +414,9 @@ int force_page_cache_readahead(struct ad
> ret = err;
> break;
> }
> + readahead_event(mapping, offset, nr_to_read, 0, 0,
> + RA_PATTERN_FADVISE, offset, this_chunk, 0,
> + err);
> ret += err;
> offset += this_chunk;
> nr_to_read -= this_chunk;
> @@ -267,6 +456,11 @@ unsigned long ra_submit(struct file_ra_s
> actual = __do_page_cache_readahead(mapping, filp,
> ra->start, ra->size, ra->async_size);
>
> + readahead_event(mapping, offset, req_size,
> + ra->for_mmap, ra->for_metadata,
> + ra->pattern, ra->start, ra->size, ra->async_size,
> + actual);
> +
> ra->for_mmap = 0;
> ra->for_metadata = 0;
> return actual;
> --- linux-next.orig/mm/Kconfig 2011-11-29 20:48:05.000000000 +0800
> +++ linux-next/mm/Kconfig 2011-11-29 20:48:05.000000000 +0800
> @@ -373,3 +373,18 @@ config CLEANCACHE
> in a negligible performance hit.
>
> If unsure, say Y to enable cleancache
> +
> +config READAHEAD_STATS
> + bool "Collect page cache readahead stats"
> + depends on DEBUG_FS
> + default y
> + help
> + This provides the readahead events accounting facilities.
> +
> + To do readahead accounting for a workload:
> +
> + echo 1 > /sys/kernel/debug/readahead/stats_enable
> + echo 0 > /sys/kernel/debug/readahead/stats # reset counters
> + # run the workload
> + cat /sys/kernel/debug/readahead/stats # check counters
> + echo 0 > /sys/kernel/debug/readahead/stats_enable
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2011-11-29 15:21 ` Jan Kara
@ 2011-11-30 0:44 ` Wu Fengguang
2011-12-14 6:36 ` Wu Fengguang
1 sibling, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-30 0:44 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
> This looks all inherently racy (which doesn't matter much as you suggest)
> so I just wanted to suggest that if you used per-cpu counters you'd get
> race-free and faster code at the cost of larger data structures and using
> percpu_counter_add() instead of ++ (which doesn't seem like a big
> complication to me).
No problem. I'll switch to per-cpu counters in next post.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2011-11-29 15:21 ` Jan Kara
2011-11-30 0:44 ` Wu Fengguang
@ 2011-12-14 6:36 ` Wu Fengguang
2011-12-19 16:32 ` Jan Kara
1 sibling, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-12-14 6:36 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
> This looks all inherently racy (which doesn't matter much as you suggest)
> so I just wanted to suggest that if you used per-cpu counters you'd get
> race-free and faster code at the cost of larger data structures and using
> percpu_counter_add() instead of ++ (which doesn't seem like a big
> complication to me).
OK, here is the incremental patch to use per-cpu counters :)
---
mm/readahead.c | 61 +++++++++++++++++++++++++++++++++--------------
1 file changed, 44 insertions(+), 17 deletions(-)
--- linux-next.orig/mm/readahead.c 2011-12-14 09:50:37.000000000 +0800
+++ linux-next/mm/readahead.c 2011-12-14 14:16:15.000000000 +0800
@@ -68,7 +68,7 @@ enum ra_account {
RA_ACCOUNT_MAX,
};
-static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+static DEFINE_PER_CPU(unsigned long[RA_PATTERN_ALL][RA_ACCOUNT_MAX], ra_stat);
static void readahead_stats(struct address_space *mapping,
pgoff_t offset,
@@ -83,38 +83,62 @@ static void readahead_stats(struct addre
{
pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
-recount:
- ra_stats[pattern][RA_ACCOUNT_COUNT]++;
- ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
- ra_stats[pattern][RA_ACCOUNT_ASYNC_SIZE] += async_size;
- ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+ preempt_disable();
+
+ __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_COUNT]);
+ __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_SIZE], size);
+ __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ASYNC_SIZE], async_size);
+ __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ACTUAL], actual);
if (start + size >= eof)
- ra_stats[pattern][RA_ACCOUNT_EOF]++;
+ __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_EOF]);
if (actual < size)
- ra_stats[pattern][RA_ACCOUNT_CACHE_HIT]++;
+ __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_CACHE_HIT]);
if (actual) {
- ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+ __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_IOCOUNT]);
if (start <= offset && offset < start + size)
- ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+ __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_SYNC]);
if (for_mmap)
- ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+ __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_MMAP]);
if (for_metadata)
- ra_stats[pattern][RA_ACCOUNT_METADATA]++;
+ __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_METADATA]);
}
- if (pattern != RA_PATTERN_ALL) {
- pattern = RA_PATTERN_ALL;
- goto recount;
- }
+ preempt_enable();
+}
+
+static void ra_stats_clear(void)
+{
+ int cpu;
+ int i, j;
+
+ for_each_online_cpu(cpu)
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++)
+ per_cpu(ra_stat[i][j], cpu) = 0;
+}
+
+static void ra_stats_sum(unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
+{
+ int cpu;
+ int i, j;
+
+ for_each_online_cpu(cpu)
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++) {
+ unsigned long n = per_cpu(ra_stat[i][j], cpu);
+ ra_stats[i][j] += n;
+ ra_stats[RA_PATTERN_ALL][j] += n;
+ }
}
static int readahead_stats_show(struct seq_file *s, void *_)
{
unsigned long i;
+ unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
seq_printf(s,
"%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
@@ -122,6 +146,9 @@ static int readahead_stats_show(struct s
"io", "sync_io", "mmap_io", "meta_io",
"size", "async_size", "io_size");
+ memset(ra_stats, 0, sizeof(ra_stats));
+ ra_stats_sum(ra_stats);
+
for (i = 0; i < RA_PATTERN_MAX; i++) {
unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
@@ -159,7 +186,7 @@ static int readahead_stats_open(struct i
static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
size_t size, loff_t *offset)
{
- memset(ra_stats, 0, sizeof(ra_stats));
+ ra_stats_clear();
return size;
}
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2011-12-14 6:36 ` Wu Fengguang
@ 2011-12-19 16:32 ` Jan Kara
2011-12-21 1:29 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Jan Kara @ 2011-12-19 16:32 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
On Wed 14-12-11 14:36:25, Wu Fengguang wrote:
> > This looks all inherently racy (which doesn't matter much as you suggest)
> > so I just wanted to suggest that if you used per-cpu counters you'd get
> > race-free and faster code at the cost of larger data structures and using
> > percpu_counter_add() instead of ++ (which doesn't seem like a big
> > complication to me).
>
> OK, here is the incremental patch to use per-cpu counters :)
Thanks! This looks better. I just thought you would use per-cpu counters
as defined in include/linux/percpu_counter.h and are used e.g. by bdi
stats. This is more standard for statistics in the kernel than using
per-cpu variables directly.
Honza
> ---
> mm/readahead.c | 61 +++++++++++++++++++++++++++++++++--------------
> 1 file changed, 44 insertions(+), 17 deletions(-)
>
> --- linux-next.orig/mm/readahead.c 2011-12-14 09:50:37.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-12-14 14:16:15.000000000 +0800
> @@ -68,7 +68,7 @@ enum ra_account {
> RA_ACCOUNT_MAX,
> };
>
> -static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
> +static DEFINE_PER_CPU(unsigned long[RA_PATTERN_ALL][RA_ACCOUNT_MAX], ra_stat);
>
> static void readahead_stats(struct address_space *mapping,
> pgoff_t offset,
> @@ -83,38 +83,62 @@ static void readahead_stats(struct addre
> {
> pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
>
> -recount:
> - ra_stats[pattern][RA_ACCOUNT_COUNT]++;
> - ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
> - ra_stats[pattern][RA_ACCOUNT_ASYNC_SIZE] += async_size;
> - ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
> + preempt_disable();
> +
> + __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_COUNT]);
> + __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_SIZE], size);
> + __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ASYNC_SIZE], async_size);
> + __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ACTUAL], actual);
>
> if (start + size >= eof)
> - ra_stats[pattern][RA_ACCOUNT_EOF]++;
> + __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_EOF]);
> if (actual < size)
> - ra_stats[pattern][RA_ACCOUNT_CACHE_HIT]++;
> + __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_CACHE_HIT]);
>
> if (actual) {
> - ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
> + __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_IOCOUNT]);
>
> if (start <= offset && offset < start + size)
> - ra_stats[pattern][RA_ACCOUNT_SYNC]++;
> + __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_SYNC]);
>
> if (for_mmap)
> - ra_stats[pattern][RA_ACCOUNT_MMAP]++;
> + __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_MMAP]);
> if (for_metadata)
> - ra_stats[pattern][RA_ACCOUNT_METADATA]++;
> + __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_METADATA]);
> }
>
> - if (pattern != RA_PATTERN_ALL) {
> - pattern = RA_PATTERN_ALL;
> - goto recount;
> - }
> + preempt_enable();
> +}
> +
> +static void ra_stats_clear(void)
> +{
> + int cpu;
> + int i, j;
> +
> + for_each_online_cpu(cpu)
> + for (i = 0; i < RA_PATTERN_ALL; i++)
> + for (j = 0; j < RA_ACCOUNT_MAX; j++)
> + per_cpu(ra_stat[i][j], cpu) = 0;
> +}
> +
> +static void ra_stats_sum(unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
> +{
> + int cpu;
> + int i, j;
> +
> + for_each_online_cpu(cpu)
> + for (i = 0; i < RA_PATTERN_ALL; i++)
> + for (j = 0; j < RA_ACCOUNT_MAX; j++) {
> + unsigned long n = per_cpu(ra_stat[i][j], cpu);
> + ra_stats[i][j] += n;
> + ra_stats[RA_PATTERN_ALL][j] += n;
> + }
> }
>
> static int readahead_stats_show(struct seq_file *s, void *_)
> {
> unsigned long i;
> + unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
>
> seq_printf(s,
> "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
> @@ -122,6 +146,9 @@ static int readahead_stats_show(struct s
> "io", "sync_io", "mmap_io", "meta_io",
> "size", "async_size", "io_size");
>
> + memset(ra_stats, 0, sizeof(ra_stats));
> + ra_stats_sum(ra_stats);
> +
> for (i = 0; i < RA_PATTERN_MAX; i++) {
> unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
> unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
> @@ -159,7 +186,7 @@ static int readahead_stats_open(struct i
> static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
> size_t size, loff_t *offset)
> {
> - memset(ra_stats, 0, sizeof(ra_stats));
> + ra_stats_clear();
> return size;
> }
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2011-12-19 16:32 ` Jan Kara
@ 2011-12-21 1:29 ` Wu Fengguang
2011-12-21 4:06 ` Dave Chinner
0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-12-21 1:29 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
On Tue, Dec 20, 2011 at 12:32:41AM +0800, Jan Kara wrote:
> On Wed 14-12-11 14:36:25, Wu Fengguang wrote:
> > > This looks all inherently racy (which doesn't matter much as you suggest)
> > > so I just wanted to suggest that if you used per-cpu counters you'd get
> > > race-free and faster code at the cost of larger data structures and using
> > > percpu_counter_add() instead of ++ (which doesn't seem like a big
> > > complication to me).
> >
> > OK, here is the incremental patch to use per-cpu counters :)
> Thanks! This looks better. I just thought you would use per-cpu counters
> as defined in include/linux/percpu_counter.h and are used e.g. by bdi
> stats. This is more standard for statistics in the kernel than using
> per-cpu variables directly.
Ah yes, I overlooked that facility! However the percpu_counter's
ability to maintain and quickly retrieve the global value seems
unnecessary feature/overheads for readahead stats, because here we
only need to sum up the global value when the user requests it. If
switching to percpu_counter, I'm afraid every readahead(1MB) event
will lead to the update of percpu_counter global value (grabbing the
spinlock) due to 1MB > some small batch size. This actually performs
worse than the plain global array of values in the v1 patch.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2011-12-21 1:29 ` Wu Fengguang
@ 2011-12-21 4:06 ` Dave Chinner
2011-12-23 3:33 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2011-12-21 4:06 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
On Wed, Dec 21, 2011 at 09:29:36AM +0800, Wu Fengguang wrote:
> On Tue, Dec 20, 2011 at 12:32:41AM +0800, Jan Kara wrote:
> > On Wed 14-12-11 14:36:25, Wu Fengguang wrote:
> > > > This looks all inherently racy (which doesn't matter much as you suggest)
> > > > so I just wanted to suggest that if you used per-cpu counters you'd get
> > > > race-free and faster code at the cost of larger data structures and using
> > > > percpu_counter_add() instead of ++ (which doesn't seem like a big
> > > > complication to me).
> > >
> > > OK, here is the incremental patch to use per-cpu counters :)
> > Thanks! This looks better. I just thought you would use per-cpu counters
> > as defined in include/linux/percpu_counter.h and are used e.g. by bdi
> > stats. This is more standard for statistics in the kernel than using
> > per-cpu variables directly.
>
> Ah yes, I overlooked that facility! However the percpu_counter's
> ability to maintain and quickly retrieve the global value seems
> unnecessary feature/overheads for readahead stats, because here we
> only need to sum up the global value when the user requests it. If
> switching to percpu_counter, I'm afraid every readahead(1MB) event
> will lead to the update of percpu_counter global value (grabbing the
> spinlock) due to 1MB > some small batch size. This actually performs
> worse than the plain global array of values in the v1 patch.
So use a custom batch size so that typical increments don't require
locking for every add. The bdi stat counters are an example of this
sort of setup to reduce lock contention on typical IO workloads as
concurrency increases.
All these stats have is a requirement for a different batch size to
avoid frequent lock grabs. The stats don't have to update the global
counter very often (only to prvent overflow!) so you count get away
with a batch size in the order of 2^30 without any issues....
We have a general per-cpu counter infrastructure - we should be
using it and improving it and not reinventing it a different way
every time we need a per-cpu counter.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2011-12-21 4:06 ` Dave Chinner
@ 2011-12-23 3:33 ` Wu Fengguang
2011-12-23 11:16 ` Jan Kara
0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-12-23 3:33 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
On Wed, Dec 21, 2011 at 12:06:56PM +0800, Dave Chinner wrote:
> On Wed, Dec 21, 2011 at 09:29:36AM +0800, Wu Fengguang wrote:
> > On Tue, Dec 20, 2011 at 12:32:41AM +0800, Jan Kara wrote:
> > > On Wed 14-12-11 14:36:25, Wu Fengguang wrote:
> > > > > This looks all inherently racy (which doesn't matter much as you suggest)
> > > > > so I just wanted to suggest that if you used per-cpu counters you'd get
> > > > > race-free and faster code at the cost of larger data structures and using
> > > > > percpu_counter_add() instead of ++ (which doesn't seem like a big
> > > > > complication to me).
> > > >
> > > > OK, here is the incremental patch to use per-cpu counters :)
> > > Thanks! This looks better. I just thought you would use per-cpu counters
> > > as defined in include/linux/percpu_counter.h and are used e.g. by bdi
> > > stats. This is more standard for statistics in the kernel than using
> > > per-cpu variables directly.
> >
> > Ah yes, I overlooked that facility! However the percpu_counter's
> > ability to maintain and quickly retrieve the global value seems
> > unnecessary feature/overheads for readahead stats, because here we
> > only need to sum up the global value when the user requests it. If
> > switching to percpu_counter, I'm afraid every readahead(1MB) event
> > will lead to the update of percpu_counter global value (grabbing the
> > spinlock) due to 1MB > some small batch size. This actually performs
> > worse than the plain global array of values in the v1 patch.
>
> So use a custom batch size so that typical increments don't require
> locking for every add. The bdi stat counters are an example of this
> sort of setup to reduce lock contention on typical IO workloads as
> concurrency increases.
>
> All these stats have is a requirement for a different batch size to
> avoid frequent lock grabs. The stats don't have to update the global
> counter very often (only to prvent overflow!) so you count get away
> with a batch size in the order of 2^30 without any issues....
>
> We have a general per-cpu counter infrastructure - we should be
> using it and improving it and not reinventing it a different way
> every time we need a per-cpu counter.
OK, let's try using percpu_counter, with a huge batch size.
It actually adds both code size and runtime overheads slightly.
Are you sure you like this incremental patch?
Thanks,
Fengguang
---
mm/readahead.c | 74 ++++++++++++++++++++++++++---------------------
1 file changed, 41 insertions(+), 33 deletions(-)
--- linux-next.orig/mm/readahead.c 2011-12-23 10:04:32.000000000 +0800
+++ linux-next/mm/readahead.c 2011-12-23 11:18:35.000000000 +0800
@@ -61,7 +61,18 @@ enum ra_account {
RA_ACCOUNT_MAX,
};
-static DEFINE_PER_CPU(unsigned long[RA_PATTERN_ALL][RA_ACCOUNT_MAX], ra_stat);
+#define RA_STAT_BATCH (INT_MAX / 2)
+static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX];
+
+static inline void add_ra_stat(int i, int j, s64 amount)
+{
+ __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH);
+}
+
+static inline void inc_ra_stat(int i, int j)
+{
+ add_ra_stat(i, j, 1);
+}
static void readahead_stats(struct address_space *mapping,
pgoff_t offset,
@@ -76,62 +87,54 @@ static void readahead_stats(struct addre
{
pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
- preempt_disable();
-
- __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_COUNT]);
- __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_SIZE], size);
- __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ASYNC_SIZE], async_size);
- __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ACTUAL], actual);
+ inc_ra_stat(pattern, RA_ACCOUNT_COUNT);
+ add_ra_stat(pattern, RA_ACCOUNT_SIZE, size);
+ add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size);
+ add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual);
if (start + size >= eof)
- __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_EOF]);
+ inc_ra_stat(pattern, RA_ACCOUNT_EOF);
if (actual < size)
- __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_CACHE_HIT]);
+ inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT);
if (actual) {
- __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_IOCOUNT]);
+ inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT);
if (start <= offset && offset < start + size)
- __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_SYNC]);
+ inc_ra_stat(pattern, RA_ACCOUNT_SYNC);
if (for_mmap)
- __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_MMAP]);
+ inc_ra_stat(pattern, RA_ACCOUNT_MMAP);
if (for_metadata)
- __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_METADATA]);
+ inc_ra_stat(pattern, RA_ACCOUNT_METADATA);
}
-
- preempt_enable();
}
static void ra_stats_clear(void)
{
- int cpu;
int i, j;
- for_each_online_cpu(cpu)
- for (i = 0; i < RA_PATTERN_ALL; i++)
- for (j = 0; j < RA_ACCOUNT_MAX; j++)
- per_cpu(ra_stat[i][j], cpu) = 0;
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++)
+ percpu_counter_set(&ra_stat[i][j], 0);
}
-static void ra_stats_sum(unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
+static void ra_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
{
- int cpu;
int i, j;
- for_each_online_cpu(cpu)
- for (i = 0; i < RA_PATTERN_ALL; i++)
- for (j = 0; j < RA_ACCOUNT_MAX; j++) {
- unsigned long n = per_cpu(ra_stat[i][j], cpu);
- ra_stats[i][j] += n;
- ra_stats[RA_PATTERN_ALL][j] += n;
- }
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++) {
+ s64 n = percpu_counter_sum(&ra_stat[i][j]);
+ ra_stats[i][j] += n;
+ ra_stats[RA_PATTERN_ALL][j] += n;
+ }
}
static int readahead_stats_show(struct seq_file *s, void *_)
{
- unsigned long i;
- unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+ long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+ int i;
seq_printf(s,
"%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
@@ -153,8 +156,8 @@ static int readahead_stats_show(struct s
if (iocount == 0)
iocount = 1;
- seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu "
- "%10lu %10lu %10lu %10lu %10lu\n",
+ seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld "
+ "%10lld %10lld %10lld %10lld %10lld\n",
ra_pattern_names[i].name,
ra_stats[i][RA_ACCOUNT_COUNT],
ra_stats[i][RA_ACCOUNT_EOF],
@@ -196,6 +199,7 @@ static int __init readahead_create_debug
{
struct dentry *root;
struct dentry *entry;
+ int i, j;
root = debugfs_create_dir("readahead", NULL);
if (!root)
@@ -211,6 +215,10 @@ static int __init readahead_create_debug
if (!entry)
goto out;
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++)
+ percpu_counter_init(&ra_stat[i][j], 0);
+
return 0;
out:
printk(KERN_ERR "readahead: failed to create debugfs entries\n");
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2011-12-23 3:33 ` Wu Fengguang
@ 2011-12-23 11:16 ` Jan Kara
0 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2011-12-23 11:16 UTC (permalink / raw)
To: Wu Fengguang
Cc: Dave Chinner, Jan Kara, Andrew Morton, Andi Kleen, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel,
Linux Memory Management List, linux-fsdevel, LKML
On Fri 23-12-11 11:33:20, Wu Fengguang wrote:
> On Wed, Dec 21, 2011 at 12:06:56PM +0800, Dave Chinner wrote:
> > On Wed, Dec 21, 2011 at 09:29:36AM +0800, Wu Fengguang wrote:
> > > On Tue, Dec 20, 2011 at 12:32:41AM +0800, Jan Kara wrote:
> > > > On Wed 14-12-11 14:36:25, Wu Fengguang wrote:
> > > > > > This looks all inherently racy (which doesn't matter much as you suggest)
> > > > > > so I just wanted to suggest that if you used per-cpu counters you'd get
> > > > > > race-free and faster code at the cost of larger data structures and using
> > > > > > percpu_counter_add() instead of ++ (which doesn't seem like a big
> > > > > > complication to me).
> > > > >
> > > > > OK, here is the incremental patch to use per-cpu counters :)
> > > > Thanks! This looks better. I just thought you would use per-cpu counters
> > > > as defined in include/linux/percpu_counter.h and are used e.g. by bdi
> > > > stats. This is more standard for statistics in the kernel than using
> > > > per-cpu variables directly.
> > >
> > > Ah yes, I overlooked that facility! However the percpu_counter's
> > > ability to maintain and quickly retrieve the global value seems
> > > unnecessary feature/overheads for readahead stats, because here we
> > > only need to sum up the global value when the user requests it. If
> > > switching to percpu_counter, I'm afraid every readahead(1MB) event
> > > will lead to the update of percpu_counter global value (grabbing the
> > > spinlock) due to 1MB > some small batch size. This actually performs
> > > worse than the plain global array of values in the v1 patch.
> >
> > So use a custom batch size so that typical increments don't require
> > locking for every add. The bdi stat counters are an example of this
> > sort of setup to reduce lock contention on typical IO workloads as
> > concurrency increases.
> >
> > All these stats have is a requirement for a different batch size to
> > avoid frequent lock grabs. The stats don't have to update the global
> > counter very often (only to prvent overflow!) so you count get away
> > with a batch size in the order of 2^30 without any issues....
> >
> > We have a general per-cpu counter infrastructure - we should be
> > using it and improving it and not reinventing it a different way
> > every time we need a per-cpu counter.
>
> OK, let's try using percpu_counter, with a huge batch size.
>
> It actually adds both code size and runtime overheads slightly.
> Are you sure you like this incremental patch?
Well, I like it because it's easier to see the code is doing the right
thing when it's using standard kernel infrastructure...
Honza
> ---
> mm/readahead.c | 74 ++++++++++++++++++++++++++---------------------
> 1 file changed, 41 insertions(+), 33 deletions(-)
>
> --- linux-next.orig/mm/readahead.c 2011-12-23 10:04:32.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-12-23 11:18:35.000000000 +0800
> @@ -61,7 +61,18 @@ enum ra_account {
> RA_ACCOUNT_MAX,
> };
>
> -static DEFINE_PER_CPU(unsigned long[RA_PATTERN_ALL][RA_ACCOUNT_MAX], ra_stat);
> +#define RA_STAT_BATCH (INT_MAX / 2)
> +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX];
> +
> +static inline void add_ra_stat(int i, int j, s64 amount)
> +{
> + __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH);
> +}
> +
> +static inline void inc_ra_stat(int i, int j)
> +{
> + add_ra_stat(i, j, 1);
> +}
>
> static void readahead_stats(struct address_space *mapping,
> pgoff_t offset,
> @@ -76,62 +87,54 @@ static void readahead_stats(struct addre
> {
> pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
>
> - preempt_disable();
> -
> - __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_COUNT]);
> - __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_SIZE], size);
> - __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ASYNC_SIZE], async_size);
> - __this_cpu_add(ra_stat[pattern][RA_ACCOUNT_ACTUAL], actual);
> + inc_ra_stat(pattern, RA_ACCOUNT_COUNT);
> + add_ra_stat(pattern, RA_ACCOUNT_SIZE, size);
> + add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size);
> + add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual);
>
> if (start + size >= eof)
> - __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_EOF]);
> + inc_ra_stat(pattern, RA_ACCOUNT_EOF);
> if (actual < size)
> - __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_CACHE_HIT]);
> + inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT);
>
> if (actual) {
> - __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_IOCOUNT]);
> + inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT);
>
> if (start <= offset && offset < start + size)
> - __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_SYNC]);
> + inc_ra_stat(pattern, RA_ACCOUNT_SYNC);
>
> if (for_mmap)
> - __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_MMAP]);
> + inc_ra_stat(pattern, RA_ACCOUNT_MMAP);
> if (for_metadata)
> - __this_cpu_inc(ra_stat[pattern][RA_ACCOUNT_METADATA]);
> + inc_ra_stat(pattern, RA_ACCOUNT_METADATA);
> }
> -
> - preempt_enable();
> }
>
> static void ra_stats_clear(void)
> {
> - int cpu;
> int i, j;
>
> - for_each_online_cpu(cpu)
> - for (i = 0; i < RA_PATTERN_ALL; i++)
> - for (j = 0; j < RA_ACCOUNT_MAX; j++)
> - per_cpu(ra_stat[i][j], cpu) = 0;
> + for (i = 0; i < RA_PATTERN_ALL; i++)
> + for (j = 0; j < RA_ACCOUNT_MAX; j++)
> + percpu_counter_set(&ra_stat[i][j], 0);
> }
>
> -static void ra_stats_sum(unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
> +static void ra_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
> {
> - int cpu;
> int i, j;
>
> - for_each_online_cpu(cpu)
> - for (i = 0; i < RA_PATTERN_ALL; i++)
> - for (j = 0; j < RA_ACCOUNT_MAX; j++) {
> - unsigned long n = per_cpu(ra_stat[i][j], cpu);
> - ra_stats[i][j] += n;
> - ra_stats[RA_PATTERN_ALL][j] += n;
> - }
> + for (i = 0; i < RA_PATTERN_ALL; i++)
> + for (j = 0; j < RA_ACCOUNT_MAX; j++) {
> + s64 n = percpu_counter_sum(&ra_stat[i][j]);
> + ra_stats[i][j] += n;
> + ra_stats[RA_PATTERN_ALL][j] += n;
> + }
> }
>
> static int readahead_stats_show(struct seq_file *s, void *_)
> {
> - unsigned long i;
> - unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
> + long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
> + int i;
>
> seq_printf(s,
> "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
> @@ -153,8 +156,8 @@ static int readahead_stats_show(struct s
> if (iocount == 0)
> iocount = 1;
>
> - seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu "
> - "%10lu %10lu %10lu %10lu %10lu\n",
> + seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld "
> + "%10lld %10lld %10lld %10lld %10lld\n",
> ra_pattern_names[i].name,
> ra_stats[i][RA_ACCOUNT_COUNT],
> ra_stats[i][RA_ACCOUNT_EOF],
> @@ -196,6 +199,7 @@ static int __init readahead_create_debug
> {
> struct dentry *root;
> struct dentry *entry;
> + int i, j;
>
> root = debugfs_create_dir("readahead", NULL);
> if (!root)
> @@ -211,6 +215,10 @@ static int __init readahead_create_debug
> if (!entry)
> goto out;
>
> + for (i = 0; i < RA_PATTERN_ALL; i++)
> + for (j = 0; j < RA_ACCOUNT_MAX; j++)
> + percpu_counter_init(&ra_stat[i][j], 0);
> +
> return 0;
> out:
> printk(KERN_ERR "readahead: failed to create debugfs entries\n");
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 7/9] readahead: add vfs/readahead tracing event
2011-11-29 13:09 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v2) Wu Fengguang
` (5 preceding siblings ...)
2011-11-29 13:09 ` [PATCH 6/9] readahead: add /debug/readahead/stats Wu Fengguang
@ 2011-11-29 13:09 ` Wu Fengguang
2011-11-29 15:22 ` Jan Kara
2011-12-06 15:30 ` Christoph Hellwig
2011-11-29 13:09 ` [PATCH 8/9] readahead: basic support for backwards prefetching Wu Fengguang
2011-11-29 13:09 ` [PATCH 9/9] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
8 siblings, 2 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 13:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Steven Rostedt,
Peter Zijlstra, Rik van Riel, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 3550 bytes --]
This is very useful for verifying whether the readahead algorithms are
working to the expectation.
Example output:
# echo 1 > /debug/tracing/events/vfs/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/trace/events/vfs.h | 64 +++++++++++++++++++++++++++++++++++
mm/readahead.c | 5 ++
2 files changed, 69 insertions(+)
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-next/include/trace/events/vfs.h 2011-11-29 20:58:59.000000000 +0800
@@ -0,0 +1,64 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfs
+
+#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFS_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(readahead,
+ TP_PROTO(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ enum readahead_pattern pattern,
+ pgoff_t start,
+ unsigned long size,
+ unsigned long async_size,
+ unsigned int actual),
+
+ TP_ARGS(mapping, offset, req_size, pattern, start, size, async_size,
+ actual),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( pgoff_t, offset )
+ __field( unsigned long, req_size )
+ __field( unsigned int, pattern )
+ __field( pgoff_t, start )
+ __field( unsigned int, size )
+ __field( unsigned int, async_size )
+ __field( unsigned int, actual )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = mapping->host->i_sb->s_dev;
+ __entry->ino = mapping->host->i_ino;
+ __entry->offset = offset;
+ __entry->req_size = req_size;
+ __entry->pattern = pattern;
+ __entry->start = start;
+ __entry->size = size;
+ __entry->async_size = async_size;
+ __entry->actual = actual;
+ ),
+
+ TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+ "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+ ra_pattern_names[__entry->pattern],
+ MAJOR(__entry->dev),
+ MINOR(__entry->dev),
+ __entry->ino,
+ __entry->offset,
+ __entry->req_size,
+ __entry->start,
+ __entry->size,
+ __entry->async_size,
+ __entry->start > __entry->offset,
+ __entry->actual)
+);
+
+#endif /* _TRACE_VFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux-next.orig/mm/readahead.c 2011-11-29 20:58:53.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-29 20:59:20.000000000 +0800
@@ -29,6 +29,9 @@ static const char * const ra_pattern_nam
[RA_PATTERN_ALL] = "all",
};
+#define CREATE_TRACE_POINTS
+#include <trace/events/vfs.h>
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
@@ -215,6 +218,8 @@ static inline void readahead_event(struc
for_mmap, for_metadata,
pattern, start, size, async_size, actual);
#endif
+ trace_readahead(mapping, offset, req_size,
+ pattern, start, size, async_size, actual);
}
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 7/9] readahead: add vfs/readahead tracing event
2011-11-29 13:09 ` [PATCH 7/9] readahead: add vfs/readahead tracing event Wu Fengguang
@ 2011-11-29 15:22 ` Jan Kara
2011-11-30 0:42 ` Wu Fengguang
2011-12-06 15:30 ` Christoph Hellwig
1 sibling, 1 reply; 47+ messages in thread
From: Jan Kara @ 2011-11-29 15:22 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Steven Rostedt, Peter Zijlstra, Rik van Riel,
Linux Memory Management List, linux-fsdevel, LKML
On Tue 29-11-11 21:09:07, Wu Fengguang wrote:
> This is very useful for verifying whether the readahead algorithms are
> working to the expectation.
>
> Example output:
>
> # echo 1 > /debug/tracing/events/vfs/readahead/enable
> # cp test-file /dev/null
> # cat /debug/tracing/trace # trimmed output
> readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
> readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
> readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
> readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
> readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
> readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Jens Axboe <axboe@kernel.dk>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Looks OK.
Acked-by: Jan Kara <jack@suse.cz>
Honza
> ---
> include/trace/events/vfs.h | 64 +++++++++++++++++++++++++++++++++++
> mm/readahead.c | 5 ++
> 2 files changed, 69 insertions(+)
>
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-next/include/trace/events/vfs.h 2011-11-29 20:58:59.000000000 +0800
> @@ -0,0 +1,64 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM vfs
> +
> +#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_VFS_H
> +
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(readahead,
> + TP_PROTO(struct address_space *mapping,
> + pgoff_t offset,
> + unsigned long req_size,
> + enum readahead_pattern pattern,
> + pgoff_t start,
> + unsigned long size,
> + unsigned long async_size,
> + unsigned int actual),
> +
> + TP_ARGS(mapping, offset, req_size, pattern, start, size, async_size,
> + actual),
> +
> + TP_STRUCT__entry(
> + __field( dev_t, dev )
> + __field( ino_t, ino )
> + __field( pgoff_t, offset )
> + __field( unsigned long, req_size )
> + __field( unsigned int, pattern )
> + __field( pgoff_t, start )
> + __field( unsigned int, size )
> + __field( unsigned int, async_size )
> + __field( unsigned int, actual )
> + ),
> +
> + TP_fast_assign(
> + __entry->dev = mapping->host->i_sb->s_dev;
> + __entry->ino = mapping->host->i_ino;
> + __entry->offset = offset;
> + __entry->req_size = req_size;
> + __entry->pattern = pattern;
> + __entry->start = start;
> + __entry->size = size;
> + __entry->async_size = async_size;
> + __entry->actual = actual;
> + ),
> +
> + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
> + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> + ra_pattern_names[__entry->pattern],
> + MAJOR(__entry->dev),
> + MINOR(__entry->dev),
> + __entry->ino,
> + __entry->offset,
> + __entry->req_size,
> + __entry->start,
> + __entry->size,
> + __entry->async_size,
> + __entry->start > __entry->offset,
> + __entry->actual)
> +);
> +
> +#endif /* _TRACE_VFS_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> --- linux-next.orig/mm/readahead.c 2011-11-29 20:58:53.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-11-29 20:59:20.000000000 +0800
> @@ -29,6 +29,9 @@ static const char * const ra_pattern_nam
> [RA_PATTERN_ALL] = "all",
> };
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/vfs.h>
> +
> /*
> * Initialise a struct file's readahead state. Assumes that the caller has
> * memset *ra to zero.
> @@ -215,6 +218,8 @@ static inline void readahead_event(struc
> for_mmap, for_metadata,
> pattern, start, size, async_size, actual);
> #endif
> + trace_readahead(mapping, offset, req_size,
> + pattern, start, size, async_size, actual);
> }
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 7/9] readahead: add vfs/readahead tracing event
2011-11-29 15:22 ` Jan Kara
@ 2011-11-30 0:42 ` Wu Fengguang
2011-11-30 11:44 ` Jan Kara
0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-30 0:42 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Steven Rostedt, Peter Zijlstra, Rik van Riel,
Linux Memory Management List, linux-fsdevel, LKML,
Christoph Hellwig, Dave Chinner
On Tue, Nov 29, 2011 at 11:22:28PM +0800, Jan Kara wrote:
> On Tue 29-11-11 21:09:07, Wu Fengguang wrote:
> > This is very useful for verifying whether the readahead algorithms are
> > working to the expectation.
> >
> > Example output:
> >
> > # echo 1 > /debug/tracing/events/vfs/readahead/enable
> > # cp test-file /dev/null
> > # cat /debug/tracing/trace # trimmed output
> > readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
> > readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
> > readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
> > readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
> > readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
> > readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
> >
> > CC: Ingo Molnar <mingo@elte.hu>
> > CC: Jens Axboe <axboe@kernel.dk>
> > CC: Steven Rostedt <rostedt@goodmis.org>
> > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> Looks OK.
>
> Acked-by: Jan Kara <jack@suse.cz>
Thank you.
> > + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
> > + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> > + ra_pattern_names[__entry->pattern],
> > + MAJOR(__entry->dev),
> > + MINOR(__entry->dev),
One thing I'm not certain is the dev=MAJOR:MINOR. The other option
used in many trace events are bdi=BDI_NAME_OR_NUMBER. Will bdi be more
suitable here?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 7/9] readahead: add vfs/readahead tracing event
2011-11-30 0:42 ` Wu Fengguang
@ 2011-11-30 11:44 ` Jan Kara
2011-11-30 12:06 ` Wu Fengguang
0 siblings, 1 reply; 47+ messages in thread
From: Jan Kara @ 2011-11-30 11:44 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Steven Rostedt, Peter Zijlstra, Rik van Riel,
Linux Memory Management List, linux-fsdevel, LKML,
Christoph Hellwig, Dave Chinner
On Wed 30-11-11 08:42:35, Wu Fengguang wrote:
> On Tue, Nov 29, 2011 at 11:22:28PM +0800, Jan Kara wrote:
> > On Tue 29-11-11 21:09:07, Wu Fengguang wrote:
> > > This is very useful for verifying whether the readahead algorithms are
> > > working to the expectation.
> > >
> > > Example output:
> > >
> > > # echo 1 > /debug/tracing/events/vfs/readahead/enable
> > > # cp test-file /dev/null
> > > # cat /debug/tracing/trace # trimmed output
> > > readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
> > > readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
> > > readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
> > > readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
> > > readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
> > > readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
> > >
> > > CC: Ingo Molnar <mingo@elte.hu>
> > > CC: Jens Axboe <axboe@kernel.dk>
> > > CC: Steven Rostedt <rostedt@goodmis.org>
> > > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > Looks OK.
> >
> > Acked-by: Jan Kara <jack@suse.cz>
>
> Thank you.
>
> > > + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
> > > + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> > > + ra_pattern_names[__entry->pattern],
> > > + MAJOR(__entry->dev),
> > > + MINOR(__entry->dev),
>
> One thing I'm not certain is the dev=MAJOR:MINOR. The other option
> used in many trace events are bdi=BDI_NAME_OR_NUMBER. Will bdi be more
> suitable here?
Probably bdi name will be more consistent (e.g. with writeback) but I
don't think it makes a big difference in practice.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 7/9] readahead: add vfs/readahead tracing event
2011-11-30 11:44 ` Jan Kara
@ 2011-11-30 12:06 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-30 12:06 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Steven Rostedt, Peter Zijlstra, Rik van Riel,
Linux Memory Management List, linux-fsdevel, LKML,
Christoph Hellwig, Dave Chinner
On Wed, Nov 30, 2011 at 07:44:38PM +0800, Jan Kara wrote:
> On Wed 30-11-11 08:42:35, Wu Fengguang wrote:
> > On Tue, Nov 29, 2011 at 11:22:28PM +0800, Jan Kara wrote:
> > > On Tue 29-11-11 21:09:07, Wu Fengguang wrote:
> > > > This is very useful for verifying whether the readahead algorithms are
> > > > working to the expectation.
> > > >
> > > > Example output:
> > > >
> > > > # echo 1 > /debug/tracing/events/vfs/readahead/enable
> > > > # cp test-file /dev/null
> > > > # cat /debug/tracing/trace # trimmed output
> > > > readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
> > > > readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
> > > > readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
> > > > readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
> > > > readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
> > > > readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
> > > >
> > > > CC: Ingo Molnar <mingo@elte.hu>
> > > > CC: Jens Axboe <axboe@kernel.dk>
> > > > CC: Steven Rostedt <rostedt@goodmis.org>
> > > > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > > Acked-by: Rik van Riel <riel@redhat.com>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > Looks OK.
> > >
> > > Acked-by: Jan Kara <jack@suse.cz>
> >
> > Thank you.
> >
> > > > + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
> > > > + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> > > > + ra_pattern_names[__entry->pattern],
> > > > + MAJOR(__entry->dev),
> > > > + MINOR(__entry->dev),
> >
> > One thing I'm not certain is the dev=MAJOR:MINOR. The other option
> > used in many trace events are bdi=BDI_NAME_OR_NUMBER. Will bdi be more
> > suitable here?
> Probably bdi name will be more consistent (e.g. with writeback) but I
> don't think it makes a big difference in practice.
Yeah, so I'll change to bdi name.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 7/9] readahead: add vfs/readahead tracing event
2011-11-29 13:09 ` [PATCH 7/9] readahead: add vfs/readahead tracing event Wu Fengguang
2011-11-29 15:22 ` Jan Kara
@ 2011-12-06 15:30 ` Christoph Hellwig
2011-12-07 9:18 ` Wu Fengguang
2011-12-08 9:03 ` [PATCH] writeback: show writeback reason with __print_symbolic Wu Fengguang
1 sibling, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2011-12-06 15:30 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Steven Rostedt, Peter Zijlstra, Rik van Riel,
Linux Memory Management List, linux-fsdevel, LKML
> + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
please don't duplicate the tracepoint name in the output string.
Also don't use braces, as it jsut complicates parsing.
> + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> + ra_pattern_names[__entry->pattern],
Instead of doing a manual array lookup please use __print_symbolic so
that users of the binary interface (like trace-cmd) also get the
right output.
> --- linux-next.orig/mm/readahead.c 2011-11-29 20:58:53.000000000 +0800
> +++ linux-next/mm/readahead.c 2011-11-29 20:59:20.000000000 +0800
> @@ -29,6 +29,9 @@ static const char * const ra_pattern_nam
> [RA_PATTERN_ALL] = "all",
> };
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/vfs.h>
Maybe we should create a new fs/trace.c just for this instead of stickin
it into the first file that created a tracepoint in the "vfs" namespace.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 7/9] readahead: add vfs/readahead tracing event
2011-12-06 15:30 ` Christoph Hellwig
@ 2011-12-07 9:18 ` Wu Fengguang
2011-12-08 9:03 ` [PATCH] writeback: show writeback reason with __print_symbolic Wu Fengguang
1 sibling, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-12-07 9:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Steven Rostedt, Peter Zijlstra, Rik van Riel,
Linux Memory Management List, linux-fsdevel, LKML, Jan Kara
On Tue, Dec 06, 2011 at 11:30:25PM +0800, Christoph Hellwig wrote:
> > + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
>
> please don't duplicate the tracepoint name in the output string.
> Also don't use braces, as it jsut complicates parsing.
OK. Changed to this format:
TP_printk("pattern=%s bdi=%s ino=%lu "
"req=%lu+%lu ra=%lu+%d-%d async=%d actual=%d",
> > + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> > + ra_pattern_names[__entry->pattern],
>
> Instead of doing a manual array lookup please use __print_symbolic so
> that users of the binary interface (like trace-cmd) also get the
> right output.
The patch actually started with
+#define show_pattern_name(val) \
+ __print_symbolic(val, \
+ { RA_PATTERN_INITIAL, "initial" }, \
+ { RA_PATTERN_SUBSEQUENT, "subsequent" }, \
+ { RA_PATTERN_CONTEXT, "context" }, \
+ { RA_PATTERN_THRASH, "thrash" }, \
+ { RA_PATTERN_MMAP_AROUND, "around" }, \
+ { RA_PATTERN_FADVISE, "fadvise" }, \
+ { RA_PATTERN_RANDOM, "random" }, \
+ { RA_PATTERN_ALL, "all" })
It's then converted to the current form so as to avoid duplicating the
num<>string mapping in two places.
The recently added writeback reason shares the same problem:
TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
"kupdate=%d range_cyclic=%d background=%d reason=%s",
...
wb_reason_name[__entry->reason]
)
Fortunately that's newly introduced in 3.2-rc1, so it's still the good
time to fix the writeback traces.
However the problem is, are we going to keep adding duplicate mappings
like this in future?
> > --- linux-next.orig/mm/readahead.c 2011-11-29 20:58:53.000000000 +0800
> > +++ linux-next/mm/readahead.c 2011-11-29 20:59:20.000000000 +0800
> > @@ -29,6 +29,9 @@ static const char * const ra_pattern_nam
> > [RA_PATTERN_ALL] = "all",
> > };
> >
> > +#define CREATE_TRACE_POINTS
> > +#include <trace/events/vfs.h>
>
> Maybe we should create a new fs/trace.c just for this instead of stickin
> it into the first file that created a tracepoint in the "vfs" namespace.
Yeah, it looks better to move it to a more general place.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH] writeback: show writeback reason with __print_symbolic
2011-12-06 15:30 ` Christoph Hellwig
2011-12-07 9:18 ` Wu Fengguang
@ 2011-12-08 9:03 ` Wu Fengguang
1 sibling, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-12-08 9:03 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Curt Wohlgemuth, Andrew Morton, Andi Kleen, Ingo Molnar,
Jens Axboe, Steven Rostedt, Peter Zijlstra, Rik van Riel,
Linux Memory Management List, linux-fsdevel, LKML
> > + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> > + ra_pattern_names[__entry->pattern],
>
> Instead of doing a manual array lookup please use __print_symbolic so
> that users of the binary interface (like trace-cmd) also get the
> right output.
FYI, here is the related fix on writeback traces.
---
This makes the traces trace-cmd friendly, at the cost of a bit code duplication.
CC: Curt Wohlgemuth <curtw@google.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/trace/events/writeback.h | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
--- linux-next.orig/include/trace/events/writeback.h 2011-12-08 16:44:38.000000000 +0800
+++ linux-next/include/trace/events/writeback.h 2011-12-08 16:53:41.000000000 +0800
@@ -21,6 +21,18 @@
{I_REFERENCED, "I_REFERENCED"} \
)
+#define show_work_reason(reason) \
+ __print_symbolic(reason, \
+ {WB_REASON_BACKGROUND, "background"}, \
+ {WB_REASON_TRY_TO_FREE_PAGES, "try_to_free_pages"}, \
+ {WB_REASON_SYNC, "sync"}, \
+ {WB_REASON_PERIODIC, "periodic"}, \
+ {WB_REASON_LAPTOP_TIMER, "laptop_timer"}, \
+ {WB_REASON_FREE_MORE_MEM, "free_more_memory"}, \
+ {WB_REASON_FS_FREE_SPACE, "fs_free_space"}, \
+ {WB_REASON_FORKER_THREAD, "forker_thread"} \
+ )
+
struct wb_writeback_work;
DECLARE_EVENT_CLASS(writeback_work_class,
@@ -55,7 +67,7 @@ DECLARE_EVENT_CLASS(writeback_work_class
__entry->for_kupdate,
__entry->range_cyclic,
__entry->for_background,
- wb_reason_name[__entry->reason]
+ show_work_reason(__entry->reason)
)
);
#define DEFINE_WRITEBACK_WORK_EVENT(name) \
@@ -184,7 +196,8 @@ TRACE_EVENT(writeback_queue_io,
__entry->older, /* older_than_this in jiffies */
__entry->age, /* older_than_this in relative milliseconds */
__entry->moved,
- wb_reason_name[__entry->reason])
+ show_work_reason(__entry->reason)
+ )
);
TRACE_EVENT(global_dirty_state,
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 8/9] readahead: basic support for backwards prefetching
2011-11-29 13:09 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v2) Wu Fengguang
` (6 preceding siblings ...)
2011-11-29 13:09 ` [PATCH 7/9] readahead: add vfs/readahead tracing event Wu Fengguang
@ 2011-11-29 13:09 ` Wu Fengguang
2011-11-29 15:35 ` Jan Kara
2011-11-29 13:09 ` [PATCH 9/9] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
8 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 13:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Li Shaohua, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-backwards.patch --]
[-- Type: text/plain, Size: 4300 bytes --]
Add the backwards prefetching feature. It's pretty simple if we don't
support async prefetching and interleaved reads.
Here is the behavior with an 8-page read sequence from 10000 down to 0.
(The readahead size is a bit large since it's an NFS mount.)
readahead-random(dev=0:16, ino=3948605, req=10000+8, ra=10000+8-0, async=0) = 8
readahead-backwards(dev=0:16, ino=3948605, req=9992+8, ra=9968+32-0, async=0) = 32
readahead-backwards(dev=0:16, ino=3948605, req=9960+8, ra=9840+128-0, async=0) = 128
readahead-backwards(dev=0:16, ino=3948605, req=9832+8, ra=9584+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9576+8, ra=9072+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9064+8, ra=8048+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8040+8, ra=6128+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6120+8, ra=4208+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4200+8, ra=2288+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2280+8, ra=368+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=360+8, ra=0+368-0, async=0) = 368
And a simple 1-page read sequence from 10000 down to 0.
readahead-random(dev=0:16, ino=3948605, req=10000+1, ra=10000+1-0, async=0) = 1
readahead-backwards(dev=0:16, ino=3948605, req=9999+1, ra=9996+4-0, async=0) = 4
readahead-backwards(dev=0:16, ino=3948605, req=9995+1, ra=9980+16-0, async=0) = 16
readahead-backwards(dev=0:16, ino=3948605, req=9979+1, ra=9916+64-0, async=0) = 64
readahead-backwards(dev=0:16, ino=3948605, req=9915+1, ra=9660+256-0, async=0) = 256
readahead-backwards(dev=0:16, ino=3948605, req=9659+1, ra=9148+512-0, async=0) = 512
readahead-backwards(dev=0:16, ino=3948605, req=9147+1, ra=8124+1024-0, async=0) = 1024
readahead-backwards(dev=0:16, ino=3948605, req=8123+1, ra=6204+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=6203+1, ra=4284+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=4283+1, ra=2364+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=2363+1, ra=444+1920-0, async=0) = 1920
readahead-backwards(dev=0:16, ino=3948605, req=443+1, ra=0+444-0, async=0) = 444
CC: Andi Kleen <andi@firstfloor.org>
CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 2 ++
mm/readahead.c | 15 +++++++++++++++
2 files changed, 17 insertions(+)
--- linux-next.orig/include/linux/fs.h 2011-11-29 20:55:27.000000000 +0800
+++ linux-next/include/linux/fs.h 2011-11-29 20:57:07.000000000 +0800
@@ -968,6 +968,7 @@ struct file_ra_state {
* streams.
* RA_PATTERN_MMAP_AROUND read-around on mmap page faults
* (w/o any sequential/random hints)
+ * RA_PATTERN_BACKWARDS reverse reading detected
* RA_PATTERN_FADVISE triggered by POSIX_FADV_WILLNEED or FMODE_RANDOM
* RA_PATTERN_OVERSIZE a random read larger than max readahead size,
* do max readahead to break down the read size
@@ -978,6 +979,7 @@ enum readahead_pattern {
RA_PATTERN_SUBSEQUENT,
RA_PATTERN_CONTEXT,
RA_PATTERN_MMAP_AROUND,
+ RA_PATTERN_BACKWARDS,
RA_PATTERN_FADVISE,
RA_PATTERN_OVERSIZE,
RA_PATTERN_RANDOM,
--- linux-next.orig/mm/readahead.c 2011-11-29 20:57:03.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-29 20:57:07.000000000 +0800
@@ -23,6 +23,7 @@ static const char * const ra_pattern_nam
[RA_PATTERN_SUBSEQUENT] = "subsequent",
[RA_PATTERN_CONTEXT] = "context",
[RA_PATTERN_MMAP_AROUND] = "around",
+ [RA_PATTERN_BACKWARDS] = "backwards",
[RA_PATTERN_FADVISE] = "fadvise",
[RA_PATTERN_OVERSIZE] = "oversize",
[RA_PATTERN_RANDOM] = "random",
@@ -676,6 +677,20 @@ ondemand_readahead(struct address_space
}
/*
+ * backwards reading
+ */
+ if (offset < ra->start && offset + req_size >= ra->start) {
+ ra->pattern = RA_PATTERN_BACKWARDS;
+ ra->size = get_next_ra_size(ra, max);
+ max = ra->start;
+ if (ra->size > max)
+ ra->size = max;
+ ra->async_size = 0;
+ ra->start -= ra->size;
+ goto readit;
+ }
+
+ /*
* Query the page cache and look for the traces(cached history pages)
* that a sequential stream would leave behind.
*/
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 8/9] readahead: basic support for backwards prefetching
2011-11-29 13:09 ` [PATCH 8/9] readahead: basic support for backwards prefetching Wu Fengguang
@ 2011-11-29 15:35 ` Jan Kara
2011-11-29 16:37 ` Pádraig Brady
2011-11-30 0:37 ` Wu Fengguang
0 siblings, 2 replies; 47+ messages in thread
From: Jan Kara @ 2011-11-29 15:35 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Andi Kleen, Li Shaohua,
Linux Memory Management List, linux-fsdevel, LKML
On Tue 29-11-11 21:09:08, Wu Fengguang wrote:
> Add the backwards prefetching feature. It's pretty simple if we don't
> support async prefetching and interleaved reads.
>
> Here is the behavior with an 8-page read sequence from 10000 down to 0.
> (The readahead size is a bit large since it's an NFS mount.)
>
> readahead-random(dev=0:16, ino=3948605, req=10000+8, ra=10000+8-0, async=0) = 8
> readahead-backwards(dev=0:16, ino=3948605, req=9992+8, ra=9968+32-0, async=0) = 32
> readahead-backwards(dev=0:16, ino=3948605, req=9960+8, ra=9840+128-0, async=0) = 128
> readahead-backwards(dev=0:16, ino=3948605, req=9832+8, ra=9584+256-0, async=0) = 256
> readahead-backwards(dev=0:16, ino=3948605, req=9576+8, ra=9072+512-0, async=0) = 512
> readahead-backwards(dev=0:16, ino=3948605, req=9064+8, ra=8048+1024-0, async=0) = 1024
> readahead-backwards(dev=0:16, ino=3948605, req=8040+8, ra=6128+1920-0, async=0) = 1920
> readahead-backwards(dev=0:16, ino=3948605, req=6120+8, ra=4208+1920-0, async=0) = 1920
> readahead-backwards(dev=0:16, ino=3948605, req=4200+8, ra=2288+1920-0, async=0) = 1920
> readahead-backwards(dev=0:16, ino=3948605, req=2280+8, ra=368+1920-0, async=0) = 1920
> readahead-backwards(dev=0:16, ino=3948605, req=360+8, ra=0+368-0, async=0) = 368
>
> And a simple 1-page read sequence from 10000 down to 0.
>
> readahead-random(dev=0:16, ino=3948605, req=10000+1, ra=10000+1-0, async=0) = 1
> readahead-backwards(dev=0:16, ino=3948605, req=9999+1, ra=9996+4-0, async=0) = 4
> readahead-backwards(dev=0:16, ino=3948605, req=9995+1, ra=9980+16-0, async=0) = 16
> readahead-backwards(dev=0:16, ino=3948605, req=9979+1, ra=9916+64-0, async=0) = 64
> readahead-backwards(dev=0:16, ino=3948605, req=9915+1, ra=9660+256-0, async=0) = 256
> readahead-backwards(dev=0:16, ino=3948605, req=9659+1, ra=9148+512-0, async=0) = 512
> readahead-backwards(dev=0:16, ino=3948605, req=9147+1, ra=8124+1024-0, async=0) = 1024
> readahead-backwards(dev=0:16, ino=3948605, req=8123+1, ra=6204+1920-0, async=0) = 1920
> readahead-backwards(dev=0:16, ino=3948605, req=6203+1, ra=4284+1920-0, async=0) = 1920
> readahead-backwards(dev=0:16, ino=3948605, req=4283+1, ra=2364+1920-0, async=0) = 1920
> readahead-backwards(dev=0:16, ino=3948605, req=2363+1, ra=444+1920-0, async=0) = 1920
> readahead-backwards(dev=0:16, ino=3948605, req=443+1, ra=0+444-0, async=0) = 444
>
> CC: Andi Kleen <andi@firstfloor.org>
> CC: Li Shaohua <shaohua.li@intel.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Someone already mentioned this earlier and I don't think I've seen a
response: Do you have a realistic usecase for this? I don't think I've ever
seen an application reading file backwards...
> --- linux-next.orig/include/linux/fs.h 2011-11-29 20:55:27.000000000 +0800
> +++ linux-next/include/linux/fs.h 2011-11-29 20:57:07.000000000 +0800
...
> @@ -676,6 +677,20 @@ ondemand_readahead(struct address_space
> }
>
> /*
> + * backwards reading
> + */
> + if (offset < ra->start && offset + req_size >= ra->start) {
> + ra->pattern = RA_PATTERN_BACKWARDS;
> + ra->size = get_next_ra_size(ra, max);
> + max = ra->start;
> + if (ra->size > max)
> + ra->size = max;
> + ra->async_size = 0;
> + ra->start -= ra->size;
IMHO much more obvious way to write this is:
ra->size = get_next_ra_size(ra, max);
if (ra->size > ra->start) {
ra->size = ra->start;
ra->start = 0;
} else
ra->start -= ra->size;
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 8/9] readahead: basic support for backwards prefetching
2011-11-29 15:35 ` Jan Kara
@ 2011-11-29 16:37 ` Pádraig Brady
2011-11-30 0:24 ` Wu Fengguang
2011-11-30 0:37 ` Wu Fengguang
1 sibling, 1 reply; 47+ messages in thread
From: Pádraig Brady @ 2011-11-29 16:37 UTC (permalink / raw)
To: Jan Kara
Cc: Wu Fengguang, Andrew Morton, Andi Kleen, Li Shaohua,
Linux Memory Management List, linux-fsdevel, LKML
On 11/29/2011 03:35 PM, Jan Kara wrote:
> Someone already mentioned this earlier and I don't think I've seen a
> response: Do you have a realistic usecase for this? I don't think I've ever
> seen an application reading file backwards...
tac, tail -n$large, ...
cheers,
Pádraig.
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 8/9] readahead: basic support for backwards prefetching
2011-11-29 16:37 ` Pádraig Brady
@ 2011-11-30 0:24 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-30 0:24 UTC (permalink / raw)
To: Pádraig Brady
Cc: Jan Kara, Andrew Morton, Andi Kleen, Li, Shaohua,
Linux Memory Management List, linux-fsdevel, LKML
On Wed, Nov 30, 2011 at 12:37:55AM +0800, Pádraig Brady wrote:
> On 11/29/2011 03:35 PM, Jan Kara wrote:
> > Someone already mentioned this earlier and I don't think I've seen a
> > response: Do you have a realistic usecase for this? I don't think I've ever
> > seen an application reading file backwards...
>
> tac, tail -n$large, ...
Indeed!
tac-4425 [000] 73358.419777: readahead: readahead-random(dev=0:16, ino=1548445, req=750+1, ra=750+1-0, async=0) = 1
tac-4425 [004] 73358.442030: readahead: readahead-backwards(dev=0:16, ino=1548445, req=748+2, ra=746+5-0, async=0) = 4
tac-4425 [004] 73358.443312: readahead: readahead-backwards(dev=0:16, ino=1548445, req=744+2, ra=726+25-0, async=0) = 20
tail-4369 [000] 72633.696307: readahead: readahead-random(dev=0:16, ino=1548450, req=750+1, ra=750+1-0, async=0) = 1
tail-4369 [004] 72634.042106: readahead: readahead-backwards(dev=0:16, ino=1548450, req=748+2, ra=746+5-0, async=0) = 4
tail-4369 [004] 72634.043231: readahead: readahead-backwards(dev=0:16, ino=1548450, req=744+2, ra=726+25-0, async=0) = 20
tail-4369 [004] 72634.176216: readahead: readahead-backwards(dev=0:16, ino=1548450, req=724+2, ra=626+125-0, async=0) = 100
However I see the readahead requests always be snapped to EOF.
So it's obvious the "snap to EOF" logic need some limiting based on
max readahead size.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 8/9] readahead: basic support for backwards prefetching
2011-11-29 15:35 ` Jan Kara
2011-11-29 16:37 ` Pádraig Brady
@ 2011-11-30 0:37 ` Wu Fengguang
2011-11-30 11:21 ` Jan Kara
1 sibling, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2011-11-30 0:37 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Andi Kleen, Li, Shaohua,
Linux Memory Management List, linux-fsdevel, LKML
(snip)
> > @@ -676,6 +677,20 @@ ondemand_readahead(struct address_space
> > }
> >
> > /*
> > + * backwards reading
> > + */
> > + if (offset < ra->start && offset + req_size >= ra->start) {
> > + ra->pattern = RA_PATTERN_BACKWARDS;
> > + ra->size = get_next_ra_size(ra, max);
> > + max = ra->start;
> > + if (ra->size > max)
> > + ra->size = max;
> > + ra->async_size = 0;
> > + ra->start -= ra->size;
> IMHO much more obvious way to write this is:
> ra->size = get_next_ra_size(ra, max);
> if (ra->size > ra->start) {
> ra->size = ra->start;
> ra->start = 0;
> } else
> ra->start -= ra->size;
Good idea! Here is the updated code:
/*
* backwards reading
*/
if (offset < ra->start && offset + req_size >= ra->start) {
ra->pattern = RA_PATTERN_BACKWARDS;
ra->size = get_next_ra_size(ra, max);
if (ra->size > ra->start) {
/*
* ra->start may be concurrently set to some huge
* value, the min() at least avoids submitting huge IO
* in this race condition
*/
ra->size = min(ra->start, max);
ra->start = 0;
} else
ra->start -= ra->size;
ra->async_size = 0;
goto readit;
}
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 8/9] readahead: basic support for backwards prefetching
2011-11-30 0:37 ` Wu Fengguang
@ 2011-11-30 11:21 ` Jan Kara
0 siblings, 0 replies; 47+ messages in thread
From: Jan Kara @ 2011-11-30 11:21 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Andi Kleen, Li, Shaohua,
Linux Memory Management List, linux-fsdevel, LKML
On Wed 30-11-11 08:37:16, Wu Fengguang wrote:
> (snip)
> > > @@ -676,6 +677,20 @@ ondemand_readahead(struct address_space
> > > }
> > >
> > > /*
> > > + * backwards reading
> > > + */
> > > + if (offset < ra->start && offset + req_size >= ra->start) {
> > > + ra->pattern = RA_PATTERN_BACKWARDS;
> > > + ra->size = get_next_ra_size(ra, max);
> > > + max = ra->start;
> > > + if (ra->size > max)
> > > + ra->size = max;
> > > + ra->async_size = 0;
> > > + ra->start -= ra->size;
> > IMHO much more obvious way to write this is:
> > ra->size = get_next_ra_size(ra, max);
> > if (ra->size > ra->start) {
> > ra->size = ra->start;
> > ra->start = 0;
> > } else
> > ra->start -= ra->size;
>
> Good idea! Here is the updated code:
>
> /*
> * backwards reading
> */
> if (offset < ra->start && offset + req_size >= ra->start) {
> ra->pattern = RA_PATTERN_BACKWARDS;
> ra->size = get_next_ra_size(ra, max);
> if (ra->size > ra->start) {
> /*
> * ra->start may be concurrently set to some huge
> * value, the min() at least avoids submitting huge IO
> * in this race condition
> */
> ra->size = min(ra->start, max);
> ra->start = 0;
> } else
> ra->start -= ra->size;
> ra->async_size = 0;
> goto readit;
> }
Looks good. You can add:
Acked-by: Jan Kara <jack@suse.cz>
to the patch.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 9/9] readahead: dont do start-of-file readahead after lseek()
2011-11-29 13:09 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v2) Wu Fengguang
` (7 preceding siblings ...)
2011-11-29 13:09 ` [PATCH 8/9] readahead: basic support for backwards prefetching Wu Fengguang
@ 2011-11-29 13:09 ` Wu Fengguang
8 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2011-11-29 13:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Rik van Riel, Linus Torvalds, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2250 bytes --]
Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does
seek to 0
read 1024
seek to 1536
read 16384
The start-of-file readahead heuristic is wrong for them, whose
access pattern can be identified by lseek() calls.
So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/read_write.c | 3 +++
include/linux/fs.h | 1 +
mm/readahead.c | 4 ++++
3 files changed, 8 insertions(+)
--- linux-next.orig/mm/readahead.c 2011-11-29 20:57:07.000000000 +0800
+++ linux-next/mm/readahead.c 2011-11-29 20:57:09.000000000 +0800
@@ -467,6 +467,7 @@ unsigned long ra_submit(struct file_ra_s
ra->pattern, ra->start, ra->size, ra->async_size,
actual);
+ ra->lseek = 0;
ra->for_mmap = 0;
ra->for_metadata = 0;
return actual;
@@ -618,6 +619,8 @@ ondemand_readahead(struct address_space
* start of file
*/
if (!offset) {
+ if (ra->lseek && req_size < max)
+ goto random_read;
ra->pattern = RA_PATTERN_INITIAL;
goto initial_readahead;
}
@@ -697,6 +700,7 @@ ondemand_readahead(struct address_space
if (try_context_readahead(mapping, ra, offset, req_size, max))
goto readit;
+random_read:
/*
* standalone, small random read
*/
--- linux-next.orig/fs/read_write.c 2011-11-29 20:55:27.000000000 +0800
+++ linux-next/fs/read_write.c 2011-11-29 20:57:09.000000000 +0800
@@ -47,6 +47,9 @@ static loff_t lseek_execute(struct file
file->f_pos = offset;
file->f_version = 0;
}
+
+ file->f_ra.lseek = 1;
+
return offset;
}
--- linux-next.orig/include/linux/fs.h 2011-11-29 20:57:07.000000000 +0800
+++ linux-next/include/linux/fs.h 2011-11-29 20:57:09.000000000 +0800
@@ -949,6 +949,7 @@ struct file_ra_state {
u8 pattern; /* one of RA_PATTERN_* */
unsigned int for_mmap:1; /* readahead for mmap accesses */
unsigned int for_metadata:1; /* readahead for meta data */
+ unsigned int lseek:1; /* this read has a leading lseek */
loff_t prev_pos; /* Cache last read() position */
};
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 6/9] readahead: add /debug/readahead/stats
2012-01-27 3:05 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v4) Wu Fengguang
@ 2012-01-27 3:05 ` Wu Fengguang
2012-01-27 16:21 ` Christoph Lameter
0 siblings, 1 reply; 47+ messages in thread
From: Wu Fengguang @ 2012-01-27 3:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra,
Rik van Riel, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 8832 bytes --]
The accounting code will be compiled in by default (CONFIG_READAHEAD_STATS=y),
and will remain inactive by default.
It can be runtime enabled/disabled through the debugfs interface
echo 1 > /debug/readahead/stats_enable
echo 0 > /debug/readahead/stats_enable
Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)
$ cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io meta_io size async_size io_size
initial 702 511 0 692 692 0 0 2 0 2
subsequent 7 0 1 7 1 1 0 23 22 23
context 160 161 0 2 0 1 0 0 0 16
around 184 184 177 184 184 184 0 58 0 53
backwards 2 0 2 2 2 0 0 4 0 3
fadvise 2593 47 8 2588 2588 0 0 1 0 1
oversize 0 0 0 0 0 0 0 0 0 0
random 45 20 0 44 44 0 0 1 0 1
all 3697 923 188 3519 3511 186 0 4 0 4
The two most important columns are
- io number of readahead IO
- io_size average readahead IO size
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/Kconfig | 15 +++
mm/readahead.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 217 insertions(+)
--- linux-next.orig/mm/readahead.c 2012-01-25 15:57:52.000000000 +0800
+++ linux-next/mm/readahead.c 2012-01-25 15:57:53.000000000 +0800
@@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/ftrace_event.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+static const struct trace_print_flags ra_pattern_names[] = {
+ READAHEAD_PATTERNS
+};
+
+enum ra_account {
+ /* number of readaheads */
+ RA_ACCOUNT_COUNT, /* readahead request */
+ RA_ACCOUNT_EOF, /* readahead request covers EOF */
+ RA_ACCOUNT_CACHE_HIT, /* readahead request covers some cached pages */
+ RA_ACCOUNT_IOCOUNT, /* readahead IO */
+ RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */
+ RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */
+ RA_ACCOUNT_METADATA, /* readahead IO on metadata */
+ /* number of readahead pages */
+ RA_ACCOUNT_SIZE, /* readahead size */
+ RA_ACCOUNT_ASYNC_SIZE, /* readahead async size */
+ RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
+ /* end mark */
+ RA_ACCOUNT_MAX,
+};
+
+#define RA_STAT_BATCH (INT_MAX / 2)
+static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX];
+
+static inline void add_ra_stat(int i, int j, s64 amount)
+{
+ __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH);
+}
+
+static inline void inc_ra_stat(int i, int j)
+{
+ add_ra_stat(i, j, 1);
+}
+
+static void readahead_stats(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ bool for_mmap,
+ bool for_metadata,
+ enum readahead_pattern pattern,
+ pgoff_t start,
+ unsigned long size,
+ unsigned long async_size,
+ int actual)
+{
+ pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+
+ inc_ra_stat(pattern, RA_ACCOUNT_COUNT);
+ add_ra_stat(pattern, RA_ACCOUNT_SIZE, size);
+ add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size);
+ add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual);
+
+ if (start + size >= eof)
+ inc_ra_stat(pattern, RA_ACCOUNT_EOF);
+ if (actual < size)
+ inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT);
+
+ if (actual) {
+ inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT);
+
+ if (start <= offset && offset < start + size)
+ inc_ra_stat(pattern, RA_ACCOUNT_SYNC);
+
+ if (for_mmap)
+ inc_ra_stat(pattern, RA_ACCOUNT_MMAP);
+ if (for_metadata)
+ inc_ra_stat(pattern, RA_ACCOUNT_METADATA);
+ }
+}
+
+static void readahead_stats_reset(void)
+{
+ int i, j;
+
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++)
+ percpu_counter_set(&ra_stat[i][j], 0);
+}
+
+static void
+readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
+{
+ int i, j;
+
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++) {
+ s64 n = percpu_counter_sum(&ra_stat[i][j]);
+ ra_stats[i][j] += n;
+ ra_stats[RA_PATTERN_ALL][j] += n;
+ }
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+ long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+ int i;
+
+ seq_printf(s,
+ "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+ "pattern", "readahead", "eof_hit", "cache_hit",
+ "io", "sync_io", "mmap_io", "meta_io",
+ "size", "async_size", "io_size");
+
+ memset(ra_stats, 0, sizeof(ra_stats));
+ readahead_stats_sum(ra_stats);
+
+ for (i = 0; i < RA_PATTERN_MAX; i++) {
+ unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+ unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+ /*
+ * avoid division-by-zero
+ */
+ if (count == 0)
+ count = 1;
+ if (iocount == 0)
+ iocount = 1;
+
+ seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld "
+ "%10lld %10lld %10lld %10lld %10lld\n",
+ ra_pattern_names[i].name,
+ ra_stats[i][RA_ACCOUNT_COUNT],
+ ra_stats[i][RA_ACCOUNT_EOF],
+ ra_stats[i][RA_ACCOUNT_CACHE_HIT],
+ ra_stats[i][RA_ACCOUNT_IOCOUNT],
+ ra_stats[i][RA_ACCOUNT_SYNC],
+ ra_stats[i][RA_ACCOUNT_MMAP],
+ ra_stats[i][RA_ACCOUNT_METADATA],
+ ra_stats[i][RA_ACCOUNT_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+ }
+
+ return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ readahead_stats_reset();
+ return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+ .owner = THIS_MODULE,
+ .open = readahead_stats_open,
+ .write = readahead_stats_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+ struct dentry *root;
+ struct dentry *entry;
+ int i, j;
+
+ root = debugfs_create_dir("readahead", NULL);
+ if (!root)
+ goto out;
+
+ entry = debugfs_create_file("stats", 0644, root,
+ NULL, &readahead_stats_fops);
+ if (!entry)
+ goto out;
+
+ entry = debugfs_create_bool("stats_enable", 0644, root,
+ &readahead_stats_enable);
+ if (!entry)
+ goto out;
+
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++)
+ percpu_counter_init(&ra_stat[i][j], 0);
+
+ return 0;
+out:
+ printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+ return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#endif
+
static inline void readahead_event(struct address_space *mapping,
pgoff_t offset,
unsigned long req_size,
@@ -44,6 +240,12 @@ static inline void readahead_event(struc
unsigned long async_size,
int actual)
{
+#ifdef CONFIG_READAHEAD_STATS
+ if (readahead_stats_enable)
+ readahead_stats(mapping, offset, req_size,
+ for_mmap, for_metadata,
+ pattern, start, size, async_size, actual);
+#endif
trace_readahead(mapping, offset, req_size,
pattern, start, size, async_size, actual);
}
--- linux-next.orig/mm/Kconfig 2012-01-25 15:57:46.000000000 +0800
+++ linux-next/mm/Kconfig 2012-01-25 15:57:53.000000000 +0800
@@ -379,3 +379,18 @@ config CLEANCACHE
in a negligible performance hit.
If unsure, say Y to enable cleancache
+
+config READAHEAD_STATS
+ bool "Collect page cache readahead stats"
+ depends on DEBUG_FS
+ default n
+ help
+ This provides the readahead events accounting facilities.
+
+ To do readahead accounting for a workload:
+
+ echo 1 > /sys/kernel/debug/readahead/stats_enable
+ echo 0 > /sys/kernel/debug/readahead/stats # reset counters
+ # run the workload
+ cat /sys/kernel/debug/readahead/stats # check counters
+ echo 0 > /sys/kernel/debug/readahead/stats_enable
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2012-01-27 3:05 ` [PATCH 6/9] readahead: add /debug/readahead/stats Wu Fengguang
@ 2012-01-27 16:21 ` Christoph Lameter
2012-01-27 20:15 ` Andrew Morton
0 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2012-01-27 16:21 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
On Fri, 27 Jan 2012, Wu Fengguang wrote:
> +
> +#define RA_STAT_BATCH (INT_MAX / 2)
> +static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX];
Why use percpu counter here? The stats structures are not dynamically
allocated so you can just use a DECLARE_PER_CPU statement. That way you do
not have the overhead of percpu counter calls. Instead simple instructions
are generated to deal with the counter.
There are also no calls to any of the fast access functions for percpu
counter so percpu_counter has to always having to loop over all
counters anyways to get the results. The batching of the percpu_counters
is therefore not used.
Its simpler to just do a loop that sums over all counters when displaying
the results.
> +static inline void add_ra_stat(int i, int j, s64 amount)
> +{
> + __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH);
__this_cpu_add(ra_stat[i][j], amount);
> +}
> +
> +static void readahead_stats_reset(void)
> +{
> + int i, j;
> +
> + for (i = 0; i < RA_PATTERN_ALL; i++)
> + for (j = 0; j < RA_ACCOUNT_MAX; j++)
> + percpu_counter_set(&ra_stat[i][j], 0);
for_each_online(cpu)
memset(per_cpu_ptr(&ra_stat, cpu), 0, sizeof(ra_stat));
> +}
> +
> +static void
> +readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
> +{
> + int i, j;
> +
> + for (i = 0; i < RA_PATTERN_ALL; i++)
> + for (j = 0; j < RA_ACCOUNT_MAX; j++) {
> + s64 n = percpu_counter_sum(&ra_stat[i][j]);
> + ra_stats[i][j] += n;
> + ra_stats[RA_PATTERN_ALL][j] += n;
> + }
> +}
Define a function stats instead?
static long get_stat_sum(long __per_cpu *x)
{
int cpu;
long sum;
for_each_online(cpu)
sum += *per_cpu_ptr(x, cpu);
return sum;
}
> +
> +static int readahead_stats_show(struct seq_file *s, void *_)
> +{
> + readahead_stats_sum(ra_stats);
> +
> + for (i = 0; i < RA_PATTERN_MAX; i++) {
> + unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
= get_stats(&ra_stats[i][RA_ACCOUNT]);
...
?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2012-01-27 16:21 ` Christoph Lameter
@ 2012-01-27 20:15 ` Andrew Morton
2012-01-29 5:07 ` Wu Fengguang
2012-01-30 4:02 ` Dave Chinner
0 siblings, 2 replies; 47+ messages in thread
From: Andrew Morton @ 2012-01-27 20:15 UTC (permalink / raw)
To: Christoph Lameter
Cc: Wu Fengguang, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
On Fri, 27 Jan 2012 10:21:36 -0600 (CST)
Christoph Lameter <cl@linux.com> wrote:
> > +
> > +static void readahead_stats_reset(void)
> > +{
> > + int i, j;
> > +
> > + for (i = 0; i < RA_PATTERN_ALL; i++)
> > + for (j = 0; j < RA_ACCOUNT_MAX; j++)
> > + percpu_counter_set(&ra_stat[i][j], 0);
>
> for_each_online(cpu)
> memset(per_cpu_ptr(&ra_stat, cpu), 0, sizeof(ra_stat));
for_each_possible_cpu(). And that's one reason to not open-code the
operation. Another is so we don't have tiresome open-coded loops all
over the place.
But before doing either of those things we should choose boring old
atomic_inc(). Has it been shown that the cost of doing so is
unacceptable? Bearing this in mind:
> The accounting code will be compiled in by default
> (CONFIG_READAHEAD_STATS=y), and will remain inactive by default.
I agree with those choices. They effectively mean that the stats will
be a developer-only/debugger-only thing. So even if the atomic_inc()
costs are measurable during these develop/debug sessions, is anyone
likely to care?
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2012-01-27 20:15 ` Andrew Morton
@ 2012-01-29 5:07 ` Wu Fengguang
2012-01-30 4:02 ` Dave Chinner
1 sibling, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2012-01-29 5:07 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Andi Kleen, Ingo Molnar, Jens Axboe,
Peter Zijlstra, Rik van Riel, Linux Memory Management List,
linux-fsdevel, LKML
On Fri, Jan 27, 2012 at 12:15:51PM -0800, Andrew Morton wrote:
> > The accounting code will be compiled in by default
> > (CONFIG_READAHEAD_STATS=y), and will remain inactive by default.
>
> I agree with those choices. They effectively mean that the stats will
> be a developer-only/debugger-only thing. So even if the atomic_inc()
> costs are measurable during these develop/debug sessions, is anyone
> likely to care?
Sorry I have changed the default to CONFIG_READAHEAD_STATS=n to avoid
bloating the kernel (and forgot to edit the changelog accordingly).
I'm not sure how many people are going to check the readahead stats.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 47+ messages in thread
* Re: [PATCH 6/9] readahead: add /debug/readahead/stats
2012-01-27 20:15 ` Andrew Morton
2012-01-29 5:07 ` Wu Fengguang
@ 2012-01-30 4:02 ` Dave Chinner
1 sibling, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2012-01-30 4:02 UTC (permalink / raw)
To: Andrew Morton
Cc: Christoph Lameter, Wu Fengguang, Andi Kleen, Ingo Molnar,
Jens Axboe, Peter Zijlstra, Rik van Riel,
Linux Memory Management List, linux-fsdevel, LKML
On Fri, Jan 27, 2012 at 12:15:51PM -0800, Andrew Morton wrote:
> On Fri, 27 Jan 2012 10:21:36 -0600 (CST)
> Christoph Lameter <cl@linux.com> wrote:
>
> > > +
> > > +static void readahead_stats_reset(void)
> > > +{
> > > + int i, j;
> > > +
> > > + for (i = 0; i < RA_PATTERN_ALL; i++)
> > > + for (j = 0; j < RA_ACCOUNT_MAX; j++)
> > > + percpu_counter_set(&ra_stat[i][j], 0);
> >
> > for_each_online(cpu)
> > memset(per_cpu_ptr(&ra_stat, cpu), 0, sizeof(ra_stat));
>
> for_each_possible_cpu(). And that's one reason to not open-code the
> operation. Another is so we don't have tiresome open-coded loops all
> over the place.
Amen, brother!
> But before doing either of those things we should choose boring old
> atomic_inc(). Has it been shown that the cost of doing so is
> unacceptable? Bearing this in mind:
atomics for stats in the IO path have long been known not to scale
well enough - especially now we have PCIe SSDs that can do hundreds
of thousands of reads per second if you have enough CPU concurrency
to drive them that hard. Under that sort of workload, atomics won't
scale.
>
> > The accounting code will be compiled in by default
> > (CONFIG_READAHEAD_STATS=y), and will remain inactive by default.
>
> I agree with those choices. They effectively mean that the stats will
> be a developer-only/debugger-only thing. So even if the atomic_inc()
> costs are measurable during these develop/debug sessions, is anyone
> likely to care?
I do. If I need the debugging stats, the overhead must not perturb
the behaviour I'm trying to understand/debug for them to be
useful....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 47+ messages in thread
* [PATCH 6/9] readahead: add /debug/readahead/stats
2012-02-11 4:31 [PATCH 0/9] readahead stats/tracing, backwards prefetching and more (v5) Wu Fengguang
@ 2012-02-11 4:31 ` Wu Fengguang
0 siblings, 0 replies; 47+ messages in thread
From: Wu Fengguang @ 2012-02-11 4:31 UTC (permalink / raw)
To: Andrew Morton
Cc: Andi Kleen, Ingo Molnar, Jens Axboe, Peter Zijlstra,
Rik van Riel, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 8890 bytes --]
This accounting code won't be compiled by default (CONFIG_READAHEAD_STATS=n).
It's expected to be runtime reset and enabled before using:
echo 0 > /debug/readahead/stats # reset counters
echo 1 > /debug/readahead/stats_enable
# run test workload
echo 0 > /debug/readahead/stats_enable
Example output:
(taken from a fresh booted NFS-ROOT console box with rsize=524288)
$ cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io meta_io size async_size io_size
initial 702 511 0 692 692 0 0 2 0 2
subsequent 7 0 1 7 1 1 0 23 22 23
context 160 161 0 2 0 1 0 0 0 16
around 184 184 177 184 184 184 0 58 0 53
backwards 2 0 2 2 2 0 0 4 0 3
fadvise 2593 47 8 2588 2588 0 0 1 0 1
oversize 0 0 0 0 0 0 0 0 0 0
random 45 20 0 44 44 0 0 1 0 1
all 3697 923 188 3519 3511 186 0 4 0 4
The two most important columns are
- io number of readahead IO
- io_size average readahead IO size
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <axboe@kernel.dk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/Kconfig | 15 +++
mm/readahead.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 217 insertions(+)
--- linux-next.orig/mm/readahead.c 2012-02-11 12:02:02.000000000 +0800
+++ linux-next/mm/readahead.c 2012-02-11 12:02:08.000000000 +0800
@@ -33,6 +33,202 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/ftrace_event.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+
+static u32 readahead_stats_enable __read_mostly;
+
+static const struct trace_print_flags ra_pattern_names[] = {
+ READAHEAD_PATTERNS
+};
+
+enum ra_account {
+ /* number of readaheads */
+ RA_ACCOUNT_COUNT, /* readahead request */
+ RA_ACCOUNT_EOF, /* readahead request covers EOF */
+ RA_ACCOUNT_CACHE_HIT, /* readahead request covers some cached pages */
+ RA_ACCOUNT_IOCOUNT, /* readahead IO */
+ RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */
+ RA_ACCOUNT_MMAP, /* readahead IO by mmap page faults */
+ RA_ACCOUNT_METADATA, /* readahead IO on metadata */
+ /* number of readahead pages */
+ RA_ACCOUNT_SIZE, /* readahead size */
+ RA_ACCOUNT_ASYNC_SIZE, /* readahead async size */
+ RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
+ /* end mark */
+ RA_ACCOUNT_MAX,
+};
+
+#define RA_STAT_BATCH (INT_MAX / 2)
+static struct percpu_counter ra_stat[RA_PATTERN_ALL][RA_ACCOUNT_MAX];
+
+static inline void add_ra_stat(int i, int j, s64 amount)
+{
+ __percpu_counter_add(&ra_stat[i][j], amount, RA_STAT_BATCH);
+}
+
+static inline void inc_ra_stat(int i, int j)
+{
+ add_ra_stat(i, j, 1);
+}
+
+static void readahead_stats(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ bool for_mmap,
+ bool for_metadata,
+ enum readahead_pattern pattern,
+ pgoff_t start,
+ unsigned long size,
+ unsigned long async_size,
+ int actual)
+{
+ pgoff_t eof = ((i_size_read(mapping->host)-1) >> PAGE_CACHE_SHIFT) + 1;
+
+ inc_ra_stat(pattern, RA_ACCOUNT_COUNT);
+ add_ra_stat(pattern, RA_ACCOUNT_SIZE, size);
+ add_ra_stat(pattern, RA_ACCOUNT_ASYNC_SIZE, async_size);
+ add_ra_stat(pattern, RA_ACCOUNT_ACTUAL, actual);
+
+ if (start + size >= eof)
+ inc_ra_stat(pattern, RA_ACCOUNT_EOF);
+ if (actual < size)
+ inc_ra_stat(pattern, RA_ACCOUNT_CACHE_HIT);
+
+ if (actual) {
+ inc_ra_stat(pattern, RA_ACCOUNT_IOCOUNT);
+
+ if (start <= offset && offset < start + size)
+ inc_ra_stat(pattern, RA_ACCOUNT_SYNC);
+
+ if (for_mmap)
+ inc_ra_stat(pattern, RA_ACCOUNT_MMAP);
+ if (for_metadata)
+ inc_ra_stat(pattern, RA_ACCOUNT_METADATA);
+ }
+}
+
+static void readahead_stats_reset(void)
+{
+ int i, j;
+
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++)
+ percpu_counter_set(&ra_stat[i][j], 0);
+}
+
+static void
+readahead_stats_sum(long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX])
+{
+ int i, j;
+
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++) {
+ s64 n = percpu_counter_sum(&ra_stat[i][j]);
+ ra_stats[i][j] += n;
+ ra_stats[RA_PATTERN_ALL][j] += n;
+ }
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+ long long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+ int i;
+
+ seq_printf(s,
+ "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+ "pattern", "readahead", "eof_hit", "cache_hit",
+ "io", "sync_io", "mmap_io", "meta_io",
+ "size", "async_size", "io_size");
+
+ memset(ra_stats, 0, sizeof(ra_stats));
+ readahead_stats_sum(ra_stats);
+
+ for (i = 0; i < RA_PATTERN_MAX; i++) {
+ unsigned long count = ra_stats[i][RA_ACCOUNT_COUNT];
+ unsigned long iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+ /*
+ * avoid division-by-zero
+ */
+ if (count == 0)
+ count = 1;
+ if (iocount == 0)
+ iocount = 1;
+
+ seq_printf(s, "%-10s %10lld %10lld %10lld %10lld %10lld "
+ "%10lld %10lld %10lld %10lld %10lld\n",
+ ra_pattern_names[i].name,
+ ra_stats[i][RA_ACCOUNT_COUNT],
+ ra_stats[i][RA_ACCOUNT_EOF],
+ ra_stats[i][RA_ACCOUNT_CACHE_HIT],
+ ra_stats[i][RA_ACCOUNT_IOCOUNT],
+ ra_stats[i][RA_ACCOUNT_SYNC],
+ ra_stats[i][RA_ACCOUNT_MMAP],
+ ra_stats[i][RA_ACCOUNT_METADATA],
+ ra_stats[i][RA_ACCOUNT_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ASYNC_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+ }
+
+ return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ readahead_stats_reset();
+ return size;
+}
+
+static const struct file_operations readahead_stats_fops = {
+ .owner = THIS_MODULE,
+ .open = readahead_stats_open,
+ .write = readahead_stats_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init readahead_create_debugfs(void)
+{
+ struct dentry *root;
+ struct dentry *entry;
+ int i, j;
+
+ root = debugfs_create_dir("readahead", NULL);
+ if (!root)
+ goto out;
+
+ entry = debugfs_create_file("stats", 0644, root,
+ NULL, &readahead_stats_fops);
+ if (!entry)
+ goto out;
+
+ entry = debugfs_create_bool("stats_enable", 0644, root,
+ &readahead_stats_enable);
+ if (!entry)
+ goto out;
+
+ for (i = 0; i < RA_PATTERN_ALL; i++)
+ for (j = 0; j < RA_ACCOUNT_MAX; j++)
+ percpu_counter_init(&ra_stat[i][j], 0);
+
+ return 0;
+out:
+ printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+ return -ENOMEM;
+}
+
+late_initcall(readahead_create_debugfs);
+#endif
+
static inline void readahead_event(struct address_space *mapping,
pgoff_t offset,
unsigned long req_size,
@@ -44,6 +240,12 @@ static inline void readahead_event(struc
unsigned long async_size,
int actual)
{
+#ifdef CONFIG_READAHEAD_STATS
+ if (readahead_stats_enable)
+ readahead_stats(mapping, offset, req_size,
+ for_mmap, for_metadata,
+ pattern, start, size, async_size, actual);
+#endif
trace_readahead(mapping, offset, req_size,
pattern, start, size, async_size, actual);
}
--- linux-next.orig/mm/Kconfig 2012-02-08 18:46:04.000000000 +0800
+++ linux-next/mm/Kconfig 2012-02-11 12:03:14.000000000 +0800
@@ -396,3 +396,18 @@ config FRONTSWAP
and swap data is stored as normal on the matching swap device.
If unsure, say Y to enable frontswap.
+
+config READAHEAD_STATS
+ bool "Collect page cache readahead stats"
+ depends on DEBUG_FS
+ default n
+ help
+ This provides the readahead events accounting facilities.
+
+ To do readahead accounting for a workload:
+
+ echo 0 > /sys/kernel/debug/readahead/stats # reset counters
+ echo 1 > /sys/kernel/debug/readahead/stats_enable
+ # run the workload
+ echo 0 > /sys/kernel/debug/readahead/stats_enable
+ cat /sys/kernel/debug/readahead/stats # check counters
^ permalink raw reply [flat|nested] 47+ messages in thread