linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCHSET RESEND] ioblame: statistical IO analyzer
@ 2012-01-05 23:42 Tejun Heo
  2012-01-05 23:42 ` [PATCH 01/11] trace_event_filter: factorize filter creation Tejun Heo
                   ` (11 more replies)
  0 siblings, 12 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel

This is re-post.  The original posting was on Dec 15th but it was
missing cc to LKML.  I got some responses on that thread so didn't
suspect LKML was missing.  If you're getting it the second time.  My
apologies.

Stuff pointed out in the original thread are...

* Is the quick variant of backtrace gathering really necessary? -
  Still need to get performance numbers.

* TRACE_EVENT_CONDITION() can be used in some places. - Will be
  updated.

Original message follows.  Thanks.

Hello, guys.

Even with blktrace and tracepoints, getting insight into the IOs going
on a system is very challenging.  A lot of IO operations happen long
after the action which triggered the IO finished and the overall
asynchronous nature of IO operations make it difficult to trace back
the origin of a given IO.

ioblame is an attempt at providing better visibility into overall IO
behavior.  ioblame hooks into various tracepoints and tries to
determine who caused any given IO how and charges the IO accordingly.

On each IO completion, ioblame knows who to charge the IO (task), how
the IO got triggered (stack trace at the point of triggering, be it
page, inode dirtying or direct IO issue) and various information about
the IO itself (offset, size, how long it took and so on).  ioblame
collects this information into histograms which is configurable from
userland using debugfs interface.

For example, using ioblame, user can acquire information like "this
task triggered IO with this stack trace on this file with the
following offset distribution".

For more details, please read Documentation/trace/ioblame.txt, which
I'll append to this message too for discussion.

This patchset contains the following 11 patches.

  0001-trace_event_filter-factorize-filter-creation.patch
  0002-trace_event_filter-add-trace_event_filter_-interface.patch
  0003-block-block_bio_complete-tracepoint-was-missing.patch
  0004-block-add-req-to-bio_-front-back-_merge-tracepoints.patch
  0005-block-abstract-disk-iteration-into-disk_iter.patch
  0006-writeback-move-struct-wb_writeback_work-to-writeback.patch
  0007-writeback-add-more-tracepoints.patch
  0008-block-add-block_touch_buffer-tracepoint.patch
  0009-vfs-add-fcheck-tracepoint.patch
  0010-stacktrace-implement-save_stack_trace_quick.patch
  0011-block-trace-implement-ioblame-IO-statistical-analyze.patch

0001-0002 export trace_event_filter so that ioblame can use it too.

0003 adds back block_bio_complete TP invocation, which got lost
somehow.  This probably makes sense as fix patch for 3.2.

0004-0006 update block layer in preparation.  0005 probably makes
sense as a standalone patch too.

0007-0009 add more tracepoints along the IO stack.

0010 adds nimbler backtrace dump function as ioblame dumps stacktrace
extremely frequently.

0011 implements ioblame.

This is still in early stage and I haven't done much performance
analysis yet.  Tentative testing shows it adds ~20% CPU overhead when
used on memory backed loopback device.

The patches are on top of mainline (42ebfc61cf "Merge branch
'stable...git/konrad/xen'") and perf/core (74eec26fac "perf tools: Add
ability to synthesize event according to a sample").

It's also available in the following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-ioblame

diffstat follows.

 Documentation/trace/ioblame.txt    |  646 ++++++
 arch/x86/include/asm/stacktrace.h  |    2 
 arch/x86/kernel/stacktrace.c       |   40 
 block/blk-core.c                   |    5 
 block/genhd.c                      |   98 -
 fs/bio.c                           |    3 
 fs/fs-writeback.c                  |   34 
 fs/super.c                         |    2 
 include/linux/blk_types.h          |    7 
 include/linux/buffer_head.h        |    7 
 include/linux/fdtable.h            |    3 
 include/linux/fs.h                 |    3 
 include/linux/genhd.h              |   14 
 include/linux/ioblame.h            |   95 +
 include/linux/stacktrace.h         |    6 
 include/linux/writeback.h          |   18 
 include/trace/events/block.h       |   70 
 include/trace/events/vfs.h         |   40 
 include/trace/events/writeback.h   |  113 +
 kernel/stacktrace.c                |    6 
 kernel/trace/Kconfig               |   11 
 kernel/trace/Makefile              |    1 
 kernel/trace/blktrace.c            |    2 
 kernel/trace/ioblame.c             | 3479 +++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h               |    6 
 kernel/trace/trace_events_filter.c |  379 ++--
 mm/page-writeback.c                |    2 
 27 files changed, 4872 insertions(+), 220 deletions(-)

Thanks.

--
tejun


ioblame - statistical IO analyzer with origin tracking

December, 2011		Tejun Heo <tj@kernel.org>


CONTENTS

1. Introduction
2. Overall design
3. Debugfs interface
3-1. Configuration
3-2. Monitoring
3-3. Data acquisition
4. Notes
5. Overheads


1. Introduction

In many workloads, IO throughput and latency have large effect on
overall performance; however, due to the complexity and asynchronous
nature, it is very difficult to characterize what's going on.
blktrace and various tracepoints provide visibility into individual IO
operations but it is still extremely difficult to trace back to the
origin of those IO operations.

ioblame is statistical IO analyzer which can track and record origin
of IOs.  It keeps track of who dirtied pages and inodes, and, on an
actual IO, attributes it to the originator of the IO.

The design goals of ioblame are

* Minimally invasive - Analyzer shouldn't be invasive.  Except for
  adding some fields to mostly block layer data structures for
  tracking, ioblame gathers all information through well defined
  tracepoints and all tracking logic is contained in ioblame proper.

* Generic and detailed - There are many different IO paths and
  filesystems which also go through changes regularly.  Analyzer
  should be able to report detailed enough result covering most cases
  without requiring frequent adaptation.  ioblame uses stack trace at
  key points combined information from generic layers to categorize
  IOs.  This gives detailed enough information into varying IO paths
  without requiring specific adaptations.

* Low overhead - Overhead both in terms of memory and processor cycles
  should be low enough so that the analyzer can be used in IO-heavy
  production environments.  ioblame keeps hot data structures compact
  and mostly read-only and avoids synchronization on hot paths by
  using RCU and taking advantage of the fact that statistics doesn't
  have to be completely accurate.

* Dynamic and customizable - There are many different aspects of IOs
  which can be irrelevant or interesting depending on the situation.
  From analysis point of view, always recording all collected data
  would be ideal but is very wasteful in most situations.  ioblame
  lets users decide what information to gather so that they can
  acquire relevant information without wasting resources
  unnecessarily.  ioblame also allows dynamic configuration while the
  analyzer is online which enables dynamic drill down of IO behaviors.


2. Overall design

ioblame tracks the following three object types.

* Role: This tracks 'who' is taking an action.  Normally corresponds
  to a thread.  Can also be specified by userland (not implemented
  yet).

* Intent: Stack trace + modifier.  An intent groups actions of the
  same type.  As the name suggests, modifier modifies the intent and
  there can be multiple intents with the same stack trace but
  different modifiers.  Currently, only writeback modifiers are
  implemented which denote why the writeback action is occurring -
  ie. wb_reason.

* Act: This is combination of role, intent and the inode being
  operated on and the ultimate tracking unit ioblame uses.  IOs are
  charged to and statistics are gathered by acts.

ioblame uses the same indexing data structure for all three types of
objects.  Objects are never linked directly using pointers and every
access goes through the index.  This allows avoiding expensive strict
object lifetime management.  Objects are located either by its content
via hash table or id which contains generation number.

To attribute data writebacks to the originator, ioblame maintains a
table indexed by page frame number which keeps track of which act
dirtied which pages.  For each IO, the target pages are looked up in
the table and the dirtying act is charged for the IO.  Note that,
currently, each IO is charged as whole to a single act - e.g. all of
an IO for writeback encompassing multiple dirtiers will be charged to
the first found dirtying act.  This simplifies data collection and
reporting while not losing too much information - writebacks tend to
be naturally grouped and IOPS (IO operations per second) are often
more significant than length of each IO.

inode writeback tracking is more involved as different filesystems
handle metadata updates and writebacks differently.  ioblame uses
per-inode and buffer_head operation tracking to identify inode
writebacks to the originator.

After all the tracking, on each IO completion, ioblame knows the
offset and size of the IO, the act to be charged, how long it took in
the queue and device.  From the information, ioblame produces fields
which can be recorded.

All statistics are recorded in histograms, called counters, which have
eight slots.  Userland can specify the number of counters, IO
directions to consider, what each counter records, the boundary values
to decide histogram slots and optional filter for more complex
filtering conditions.

All interactions including configuration and data acquisition happen
via files under /sys/kernel/debug/ioblame/.


3. Debugfs interface

3-1. Configuration

* devs				- can be changed anytime

  Specifies the devices ioblame is enabled for.  ioblame will only
  track operations on devices which are explicitly enabled in this
  file.

  It accepts white space separated list of MAJ:MINs or block device
  names with optional preceding '!' for negation.  Opening with
  O_TRUNC clears all existing entries.  For example,

  $ echo sda sdb > devs		# disables all devices and then enable sd[ab]
  $ echo sdc >> devs		# sd[abc] enabled
  $ echo !8:0 >> devs		# sd[bc] enabled
  $ cat devs
  8:16 sdb
  8:32 sdc

* max_{role|intent|act}s	- can be changed while disabled

  Specifies the maximum number of each object type.  If the number of
  certain object type exceeds the limit, IOs will be attributed to
  special NOMEM object.

* ttl_secs			- can be changed anytime

  Specifies TTL of roles and acts.  Roles are reclaimed after at least
  TTL has passed after the matching thread has exited or execed and
  assumed another tid.  Acts are reclaimed after being unused for at
  least TTL.

  Note that reclaiming is tied to userland reading counters data.  If
  userland doesn't read counters, reclaiming won't happen.

* nr_counters			- can be changed while disabled

  Specifies the number of counters.  Each act will have the specified
  number of histograms associated with it.  Individual counters can be
  configured using files under the counters subdirectory.  Any write
  to this file clears all counter settings.

* counters/NR			- can be changed anytime

  Specifies each counter.  Its format is

    DIR FIELD B0 B1 B2 B3 B4 B5 B6 B7 B8

  DIR is any combination of letters 'R', 'A', and 'W', each
  representing reads (sans readaheads), readaheads and writes.

  FIELD is the field to record in histogram and one of the followings.

    offset	: IO offset scaled to 0-65535
    size	: IO size
    wait_time	: time spent queued in usecs
    io_time	: time spent on device in usecs
    seek_dist	: seek dist from IO completed right before, scaled to 0-65536

  B[0-8] are the boundaries for the histogram.  Histograms have eight
  slots.  If (FIELD < B[0] || (B[8] != 0 && FIELD >= B[8])), it's not
  recorded; otherwise, FIELD is counted in the slot with enclosing
  boundaries.  e.g. If FIELD is >= B2 and < B3, it's recorded in the
  second slot (slot 1).

  B8 can be zero indicating no upper limit but all other boundaries
  must be equal to or larger than the boundary before them.

  e.g. To record offsets of reads and read aheads in counter 0,

  $ echo RA offset 0 8192 16384 24576 32768 40960 49152 57344 0 > counters/0

  If higher resolution than 8 slots is necessary, two counters can be
  used.

  $ echo RA offset 0 4096 8192 12288 16384 20480 24576 28672 32768 > counters/0
  $ echo RA offset 32768 36864 40960 45056 49152 53248 57344 61440 0 \
								   > counters/1

  Writing empty string disables the counter.

  $ echo > 1
  $ cat 1
  --- disabled

* counters/NR_filter		- can be changed anytime

  Specifies trace event type filter for more complex conditions.  For
  example, it allows conditions like "if IO is in the latter half of
  the device, size is smaller than 128k and IO time is equal to or
  longer than 10ms".

  To record IO time in counter 0 with the above condition,

  $ echo 'offset >= 16384 && size < 131072 && io_time >= 10000' > 0_filter
  $ echo RAW io_time 10000 25000 50000 100000 500000 1000000 2500000 \
							5000000 0 > 0

  Any FIELD can be used in filter specification.  For more details
  about filter format, please read "Event filtering" section in
  Documentation/trace/events.txt.

  Writing '0' to filter file removes the filter.  Note that writing
  malformed filter disables the filter and reading it back afterwards
  returns error message explaining why parsing failed.


3-2. Monitoring (read only)

* nr_{roles|intents|acts}

  Returns the number of objects of the type.  The number of roles and
  acts can decrease after reclaiming but nr_intents only increases
  while ioblame is enabled.

* stats/idx_nomem

  How many times role, intent or act creation failed because memory
  allocation failed while extending index to accomodate new object.

* stats/idx_nospc

  How many times role, intent or act creation failed because limit
  specified by {role|intent|act}_max is reached.

* stats/node_nomem

  How many times role, intent or act creation failed to allocate.

* stats/pgtree_nomem

  How many times page tree, which maps page frame number to dirtying
  act, failed to expand due to memory allocation failure.

* stats/cnts_nomem

  How many times per-act counter allocation failed.

* stats/iolog_overflow

  How many iolog entries are lost due to overflow.


3-3. Data acquisition (read only)

* iolog

  iolog is primarily a debug feature and dumps IOs as they complete.

  $ cat iolog
  R 4096 @ 74208 pid-5806 (ls) dev=0x800010 ino=0x2 gen=0x0
    #39 modifier=0x0
    [ffffffff811a0696] submit_bh+0xe6/0x120
    [ffffffff811a1f56] ll_rw_block+0xa6/0xb0
    [ffffffff81239a43] ext4_bread+0x43/0x80
    [ffffffff8123f4e3] htree_dirblock_to_tree+0x33/0x190
    [ffffffff8123f79a] ext4_htree_fill_tree+0x15a/0x250
    [ffffffff8122e26e] ext4_readdir+0x10e/0x5d0
    [ffffffff811832d0] vfs_readdir+0xa0/0xc0
    [ffffffff81183450] sys_getdents+0x80/0xe0
    [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
  W 4096 @ 0 pid-20 (sync_supers) dev=0x800010 ino=0x0 gen=0x0
    #44 modifier=0x0
    [ffffffff811a0696] submit_bh+0xe6/0x120
    [ffffffff811a371d] __sync_dirty_buffer+0x4d/0xd0
    [ffffffff811a37ae] sync_dirty_buffer+0xe/0x10
    [ffffffff81250ee8] ext4_commit_super+0x188/0x230
    [ffffffff81250fae] ext4_write_super+0x1e/0x30
    [ffffffff811738fa] sync_supers+0xfa/0x100
    [ffffffff8113d3e1] bdi_sync_supers+0x41/0x60
    [ffffffff810ad4c6] kthread+0x96/0xa0
    [ffffffff81a3dcb4] kernel_thread_helper+0x4/0x10
  W 4096 @ 8512 pid-5813 dev=0x800010 ino=0x74 gen=0x4cc12c59
    #45 modifier=0x10000002
    [ffffffff812342cb] ext4_setattr+0x25b/0x4c0
    [ffffffff8118b9ba] notify_change+0x10a/0x2b0
    [ffffffff8119ef87] utimes_common+0xd7/0x180
    [ffffffff8119f0c9] do_utimes+0x99/0xf0
    [ffffffff8119f21d] sys_utimensat+0x2d/0x90
    [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
  ...

  The first entry is 4k read at sector 74208 (unscaled) on /dev/sdb
  issued by ls.  The second is sync_supers writing out dirty super
  block.  The third is inode writeback from "touch FILE; sync".  Note
  that the modifier is set (it's indicating WB_REASON_SYNC).

  Here is another example from "cp FILE FILE1" and then waiting.

  W 4096 @ 0 pid-20 (sync_supers) dev=0x800010 ino=0x0 gen=0x0
    #16 modifier=0x0
    [ffffffff8139cd94] submit_bio+0x74/0x100
    [ffffffff811cba3b] submit_bh+0xeb/0x130
    [ffffffff811cecd2] __sync_dirty_buffer+0x52/0xd0
    [ffffffff811ced63] sync_dirty_buffer+0x13/0x20
    [ffffffff81281fa8] ext4_commit_super+0x188/0x230
    [ffffffff81282073] ext4_write_super+0x23/0x40
    [ffffffff8119c8d2] sync_supers+0x102/0x110
    [ffffffff81162c99] bdi_sync_supers+0x49/0x60
    [ffffffff810bc216] kthread+0xb6/0xc0
    [ffffffff81ab13b4] kernel_thread_helper+0x4/0x10
  ...
  W 4096 @ 8512 pid-668 dev=0x800010 ino=0x73 gen=0x17b5119d
    #23 modifier=0x10000003
    [ffffffff811c55b0] __mark_inode_dirty+0x220/0x330
    [ffffffff811cccfb] generic_write_end+0x6b/0xa0
    [ffffffff81268b10] ext4_da_write_end+0x150/0x360
    [ffffffff811444bb] generic_file_buffered_write+0x18b/0x290
    [ffffffff81146938] __generic_file_aio_write+0x238/0x460
    [ffffffff81146bd8] generic_file_aio_write+0x78/0xf0
    [ffffffff8125ef9f] ext4_file_write+0x6f/0x2a0
    [ffffffff811997f2] do_sync_write+0xe2/0x120
    [ffffffff8119a428] vfs_write+0xc8/0x180
    [ffffffff8119a5e1] sys_write+0x51/0x90
    [ffffffff81aafe2b] system_call_fastpath+0x16/0x1b
  ...
  W 524288 @ 3276800 pid-668 dev=0x800010 ino=0x73 gen=0x17b5119d
    #25 modifier=0x10000003
    [ffffffff811cc86c] __set_page_dirty+0x4c/0xd0
    [ffffffff811cc956] mark_buffer_dirty+0x66/0xa0
    [ffffffff811cca39] __block_commit_write+0xa9/0xe0
    [ffffffff811ccc42] block_write_end+0x42/0x90
    [ffffffff811cccc3] generic_write_end+0x33/0xa0
    [ffffffff81268b10] ext4_da_write_end+0x150/0x360
    [ffffffff811444bb] generic_file_buffered_write+0x18b/0x290
    [ffffffff81146938] __generic_file_aio_write+0x238/0x460
    [ffffffff81146bd8] generic_file_aio_write+0x78/0xf0
    [ffffffff8125ef9f] ext4_file_write+0x6f/0x2a0
    [ffffffff811997f2] do_sync_write+0xe2/0x120
    [ffffffff8119a428] vfs_write+0xc8/0x180
    [ffffffff8119a5e1] sys_write+0x51/0x90
    [ffffffff81aafe2b] system_call_fastpath+0x16/0x1b
  ...

  The first entry is ext4 marking super block dirty.  After a while,
  periodic writeback kicks in (modifier 0x100000003) and the inode
  dirtied by cp is written back followed by dirty data pages.

  At this point, iolog is mostly a debug feature.  The output format
  is quite verbose and it isn't particularly performant.  If
  necessary, it can be extended to use trace ringbuffer and grow
  per-cpu binary interface.

* intents

  Dump of intents in Human readable form.

  $ cat intents
  #0 modifier=0x0
  #1 modifier=0x0
  #2 modifier=0x0
  [ffffffff81189a6a] file_update_time+0xca/0x150
  [ffffffff81122030] __generic_file_aio_write+0x200/0x460
  [ffffffff81122301] generic_file_aio_write+0x71/0xe0
  [ffffffff8122ea94] ext4_file_write+0x64/0x280
  [ffffffff811b5d24] aio_rw_vect_retry+0x74/0x1d0
  [ffffffff811b7401] aio_run_iocb+0x61/0x190
  [ffffffff811b81c8] do_io_submit+0x648/0xaf0
  [ffffffff811b867b] sys_io_submit+0xb/0x10
  [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
  #3 modifier=0x0
  [ffffffff811aaf2e] __blockdev_direct_IO+0x1f1e/0x37c0
  [ffffffff812353b2] ext4_direct_IO+0x1b2/0x3f0
  [ffffffff81121d6a] generic_file_direct_write+0xba/0x180
  [ffffffff8112210b] __generic_file_aio_write+0x2db/0x460
  [ffffffff81122301] generic_file_aio_write+0x71/0xe0
  [ffffffff8122ea94] ext4_file_write+0x64/0x280
  [ffffffff811b5d24] aio_rw_vect_retry+0x74/0x1d0
  [ffffffff811b7401] aio_run_iocb+0x61/0x190
  [ffffffff811b81c8] do_io_submit+0x648/0xaf0
  [ffffffff811b867b] sys_io_submit+0xb/0x10
  [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
  #4 modifier=0x0
  [ffffffff811aaf2e] __blockdev_direct_IO+0x1f1e/0x37c0
  [ffffffff8126da71] ext4_ind_direct_IO+0x121/0x460
  [ffffffff81235436] ext4_direct_IO+0x236/0x3f0
  [ffffffff81122db2] generic_file_aio_read+0x6b2/0x740
  ...

  The # prefixed number is the NR of the intent used to link intent
  from stastics.  Modifier and stack trace follow.  The first two
  entries are special - 0 is nomem intent and 1 is lost intent.  The
  former is used when an intent can't be created because allocation
  failed or intent_max is reached.  The latter is used when reclaiming
  resulted in loss of tracking info and the IO can't be reported
  exactly.

  This file can be seeked by intent NR.  ie. seeking to 3 and reading
  will return intent #3 and after.  Because intents are never
  destroyed while ioblame is enabled, this allows userland tool to
  discover new intents since last reading.  Seeking to the number of
  currently known intents and reading returns only the newly created
  intents.

* intents_bin

  Identical to intents but in compact binary format and likely to be
  much more performant.  Each entry in the file is in the following
  format as defined in include/linux/ioblame.h.

  #define IOB_INTENTS_BIN_VER	1

  /* intent modifer */
  #define IOB_MODIFIER_TYPE_SHIFT	28
  #define IOB_MODIFIER_TYPE_MASK	0xf0000000U
  #define IOB_MODIFIER_VAL_MASK		(~IOB_MODIFIER_TYPE_MASK)

  /* val contains wb_reason */
  #define IOB_MODIFIER_WB		(1 << IOB_MODIFIER_TYPE_SHIFT)

  /* record struct for /sys/kernel/debug/ioblame/intents_bin */
  struct iob_intent_bin_record {
	uint16_t	len;
	uint16_t	ver;
	uint32_t	nr;
	uint32_t	modifier;
	uint32_t	__pad;
	uint64_t	trace[];
  } __packed;

* counters_pipe

  Dumps counters and triggers reclaim.  Opening and reading this file
  returns counters of all acts which have been used since the last
  open.

  Because roles and acts shouldn't be reclaimed before the updated
  counters are reported, reclaiming is tied to counters_pipe access.
  Opening counters_pipe prepares for reclaiming and closing executes
  it.  Act reclaiming works at ttl_secs / 2 granularity.  ioblame
  tries to stay close to the lifetime timings requested by ttl_secs
  but note that reclaim happens only on counters_pipe open/close.

  There can only be one user of counters_pipe at any given moment;
  otherwise, file operations will fail and the output and reclaiming
  timings are undefined.

  All reported histogram counters are u32 and never reset.  It's the
  user's responsibility to calculate the delta if necessary.  Note
  that counters_pipe reports all used acts since last open and
  counters are not guaranteed to have been updated - ie. there can be
  spurious acts in the output.

  counters_pipe is seekable by act NR.

  In the following example, two counters are configured - the first
  one counts read offsets and the second write offsets.  A file is
  copied using dd w/ direct flags.

  $ cat counters_pipe
  pid-20 (sync_supers) int=66 dev=0x800010 ino=0x0 gen=0x0
	  0       0       0       0       0       0       0       0
	  2       0       0       0       0       0       0       0
  pid-1708 int=58 dev=0x800010 ino=0x71 gen=0x3e0d99f2
	 11       0       0       0       0       0       0       0
	  0       0       0       0       0       0       0       0
  pid-1708 int=59 dev=0x800010 ino=0x71 gen=0x3e0d99f2
	 11       0       0       0       0       0       0       0
	  0       0       0       0       0       0       0       0
  pid-1708 int=62 dev=0x800010 ino=0x2727 gen=0xf4739822
	  0       0       0       0       0       0       0       0
	 10       0       0       0       0       0       0       0
  pid-1708 int=63 dev=0x800010 ino=0x2727 gen=0xf4739822
	  0       0       0       0       0       0       0       0
	 10       0       0       0       0       0       0       0
  pid-1708 int=31 dev=0x800010 ino=0x2727 gen=0xf4739822
	  0       0       0       0       0       0       0       0
	  2       0       0       0       0       0       0       0
  pid-1708 int=65 dev=0x800010 ino=0x2727 gen=0xf4739822
	  0       0       0       0       0       0       0       0
	  1       0       0       0       0       0       0       0

  pid-1708 is the dd which copied the file.  The output is separated
  by pid-* lines and each section corresponds to single act, which has
  intent NR and file (dev:ino:gen) associated with it.  One 8-slot
  histogram is printed per line in ascending order.

  The filesystem is mostly empty and, from the output, both files seem
  to be located in the first 1/8th of the disk.

  All reads happened through intent 58 and 59.  From intents file,
  they are,

  #58 modifier=0x0
  [ffffffff8139d974] submit_bio+0x74/0x100
  [ffffffff811d5dba] __blockdev_direct_IO+0xc2a/0x3830
  [ffffffff8129fe51] ext4_ind_direct_IO+0x121/0x470
  [ffffffff8126678e] ext4_direct_IO+0x23e/0x400
  [ffffffff81147b05] generic_file_aio_read+0x6d5/0x760
  [ffffffff81199932] do_sync_read+0xe2/0x120
  [ffffffff8119a5c5] vfs_read+0xc5/0x180
  [ffffffff8119a781] sys_read+0x51/0x90
  [ffffffff81ab1fab] system_call_fastpath+0x16/0x1b
  #59 modifier=0x0
  [ffffffff8139d974] submit_bio+0x74/0x100
  [ffffffff811d7345] __blockdev_direct_IO+0x21b5/0x3830
  [ffffffff8129fe51] ext4_ind_direct_IO+0x121/0x470
  [ffffffff8126678e] ext4_direct_IO+0x23e/0x400
  [ffffffff81147b05] generic_file_aio_read+0x6d5/0x760
  [ffffffff81199932] do_sync_read+0xe2/0x120
  [ffffffff8119a5c5] vfs_read+0xc5/0x180
  [ffffffff8119a781] sys_read+0x51/0x90
  [ffffffff81ab1fab] system_call_fastpath+0x16/0x1b

  Except for hitting slightly different paths in __blockdev_direct_IO,
  they both are ext4 direct reads as expected.  Writes seem more
  diversified and upon examination, #62 and #63 are ext4 direct
  writes.  #31 and #65 are more interesting.

  #31 modifier=0x0
  [ffffffff811cd0cc] __set_page_dirty+0x4c/0xd0
  [ffffffff811cd1b6] mark_buffer_dirty+0x66/0xa0
  [ffffffff811cd299] __block_commit_write+0xa9/0xe0
  [ffffffff811cd4a2] block_write_end+0x42/0x90
  [ffffffff811cd523] generic_write_end+0x33/0xa0
  [ffffffff81269720] ext4_da_write_end+0x150/0x360
  [ffffffff81144878] generic_file_buffered_write+0x188/0x2b0
  [ffffffff81146d18] __generic_file_aio_write+0x238/0x460
  [ffffffff81146fb8] generic_file_aio_write+0x78/0xf0
  [ffffffff8125fbaf] ext4_file_write+0x6f/0x2a0
  [ffffffff81199812] do_sync_write+0xe2/0x120
  [ffffffff8119a308] vfs_write+0xc8/0x180
  [ffffffff8119a4c1] sys_write+0x51/0x90
  [ffffffff81ab1fab] system_call_fastpath+0x16/0x1b

  This is buffered write.  It turns out the file size didn't match the
  specified blocksize, so dd turns off O_DIRECT for the last IO and
  issue buffered write for the remainder.

  Note that the actual IO submission is not visible in the stack trace
  as the IOs are charged to the dirtying act.  Actual IOs are likely
  to be executed from fsync(2).

  #65 modifier=0x0
  [ffffffff811c5e10] __mark_inode_dirty+0x220/0x330
  [ffffffff81267edd] ext4_da_update_reserve_space+0x13d/0x230
  [ffffffff8129006d] ext4_ext_map_blocks+0x13dd/0x1dc0
  [ffffffff81268a31] ext4_map_blocks+0x1b1/0x260
  [ffffffff81269c52] mpage_da_map_and_submit+0xb2/0x480
  [ffffffff8126a84a] ext4_da_writepages+0x30a/0x6e0
  [ffffffff8114f584] do_writepages+0x24/0x40
  [ffffffff811468cb] __filemap_fdatawrite_range+0x5b/0x60
  [ffffffff8114692a] filemap_write_and_wait_range+0x5a/0x80
  [ffffffff8125ff64] ext4_sync_file+0x74/0x440
  [ffffffff811ca31b] vfs_fsync_range+0x2b/0x40
  [ffffffff811ca34c] vfs_fsync+0x1c/0x20
  [ffffffff811ca58a] do_fsync+0x3a/0x60
  [ffffffff811ca5e0] sys_fsync+0x10/0x20
  [ffffffff81ab1fab] system_call_fastpath+0x16/0x1b

  And this is dd fsync(2)ing and marking inode for writeback.  As with
  data writeback, IO submission is not visible in the stack trace.

* counters_pipe_bin

  Identical to counters_pipe but in compact binary format and likely
  to be much more performant.  Each entry in the file is in the
  following format as defined in include/linux/ioblame.h.

  #define IOBC_PIPE_BIN_VER	1

  /* record struct for /sys/kernel/debug/ioblame/counters_pipe_bin */
  struct iobc_pipe_bin_record {
	  uint16_t	len;
	  uint16_t	ver;
	  int32_t	id;		/* >0 pid or negated user id */
	  uint32_t	intent_nr;	/* associated intent */
	  uint32_t	dev;
	  uint64_t	ino;
	  uint32_t	gen;
	  uint32_t	__pad;
	  uint32_t	cnts[];		/* [type][slot] */
  } __packed;

  Note that counters_pipe and counters_pipe_bin can't be used
  parallelly.  Only one opener is allowed across the two files at any
  given moment.


4. Notes

* By the time ioblame reports IOs or counters, the task which gets
  charged might have already exited and this is why ioblame prints
  task command in some reports but not in others.  Userland tool is
  advised to use combination of live task listing and process
  accounting to match pid's to commands.

* dev:ino:gen can be mapped to filename without scanning the whole
  filesystem by constructing FS-specific filehandle, opening it with
  open_by_handle_at(2) and then readlink(2)ing /proc/self/FD.  This
  will return full path as long as the dentry is in cache, which is
  likely if data acquisition and mapping don't happen too long after
  IOs.

* Mechanism to specify userland role ID is not implemented yet.


5. Overheads

On x86_64, role is 104 bytes, intent 32 + 8 * stack_depth and act 72
bytes.  Intents are allocated using kzalloc() and there shouldn't be
too many of them.  Both roles and acts have their own kmem_cache and
can be monitored via /proc/slabinfo.

Each counter occupy 32 * nr_counters and is aligned to cacheline.
Counters are allocated only as necessary.  iob_counters kmem_cache is
dynamically created on enable.

The size of page frame number -> dirtier mapping table is proportional
to the amount of available physical memory.  If max_acts <= 65536,
2bytes are used per PAGE_SIZE.  With 4k page, at most ~0.049% can be
used.  If max_acts > 65536, 4bytes are used doubling the percentage to
~0.098%.  The table also grows dynamically.

There are also indexing data structures used - hash tables, id[ra]s
and a radix tree.  There are three hash tables, each sized according
to max_{roles|intents|acts}.  The maximum memory usage by hash tables
is sizeof(void *) * (max_roles + max_intents + max_acts).  Memory used
by other indexing structures should be negligible.

Preliminary tests w/ fio ssd-test on loopback device on tmpfs, which
is purely CPU cycle bound, shows ~20% throughput hit.

*** TODO: add performance testing results and explain involved CPU
    overheads.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 01/11] trace_event_filter: factorize filter creation
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-05 23:42 ` [PATCH 02/11] trace_event_filter: add trace_event_filter_*() interface Tejun Heo
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo

There are four places where new filter for a given filter string is
created, which involves several different steps.  This patch factors
those steps into create_[system_]filter() functions which in turn make
use of create_filter_{start|finish}() for common parts.

The only functional change is that if replace_filter_string() is
requested and fails, creation fails without any side effect instead of
being ignored.

Note that system filter is now installed after the processing is
complete which makes freeing before and then restoring filter string
on error unncessary.

-v2: Rebased to resolve conflict with 49aa29513e and updated both
     create_filter() functions to always set *filterp instead of
     requiring the caller to clear it to %NULL on entry.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/trace/trace_events_filter.c |  283 ++++++++++++++++++------------------
 1 files changed, 142 insertions(+), 141 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index f04cc31..24aee71 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1738,11 +1738,121 @@ static int replace_system_preds(struct event_subsystem *system,
 	return -ENOMEM;
 }
 
+static int create_filter_start(char *filter_str, bool set_str,
+			       struct filter_parse_state **psp,
+			       struct event_filter **filterp)
+{
+	struct event_filter *filter;
+	struct filter_parse_state *ps = NULL;
+	int err = 0;
+
+	WARN_ON_ONCE(*psp || *filterp);
+
+	/* allocate everything, and if any fails, free all and fail */
+	filter = __alloc_filter();
+	if (filter && set_str)
+		err = replace_filter_string(filter, filter_str);
+
+	ps = kzalloc(sizeof(*ps), GFP_KERNEL);
+
+	if (!filter || !ps || err) {
+		kfree(ps);
+		__free_filter(filter);
+		return -ENOMEM;
+	}
+
+	/* we're committed to creating a new filter */
+	*filterp = filter;
+	*psp = ps;
+
+	parse_init(ps, filter_ops, filter_str);
+	err = filter_parse(ps);
+	if (err && set_str)
+		append_filter_err(ps, filter);
+	return err;
+}
+
+static void create_filter_finish(struct filter_parse_state *ps)
+{
+	if (ps) {
+		filter_opstack_clear(ps);
+		postfix_clear(ps);
+		kfree(ps);
+	}
+}
+
+/**
+ * create_filter - create a filter for a ftrace_event_call
+ * @call: ftrace_event_call to create a filter for
+ * @filter_str: filter string
+ * @set_str: remember @filter_str and enable detailed error in filter
+ * @filterp: out param for created filter (always updated on return)
+ *
+ * Creates a filter for @call with @filter_str.  If @set_str is %true,
+ * @filter_str is copied and recorded in the new filter.
+ *
+ * On success, returns 0 and *@filterp points to the new filter.  On
+ * failure, returns -errno and *@filterp may point to %NULL or to a new
+ * filter.  In the latter case, the returned filter contains error
+ * information if @set_str is %true and the caller is responsible for
+ * freeing it.
+ */
+static int create_filter(struct ftrace_event_call *call,
+			 char *filter_str, bool set_str,
+			 struct event_filter **filterp)
+{
+	struct event_filter *filter = NULL;
+	struct filter_parse_state *ps = NULL;
+	int err;
+
+	err = create_filter_start(filter_str, set_str, &ps, &filter);
+	if (!err) {
+		err = replace_preds(call, filter, ps, filter_str, false);
+		if (err && set_str)
+			append_filter_err(ps, filter);
+	}
+	create_filter_finish(ps);
+
+	*filterp = filter;
+	return err;
+}
+
+/**
+ * create_system_filter - create a filter for an event_subsystem
+ * @system: event_subsystem to create a filter for
+ * @filter_str: filter string
+ * @filterp: out param for created filter (always updated on return)
+ *
+ * Identical to create_filter() except that it creates a subsystem filter
+ * and always remembers @filter_str.
+ */
+static int create_system_filter(struct event_subsystem *system,
+				char *filter_str, struct event_filter **filterp)
+{
+	struct event_filter *filter = NULL;
+	struct filter_parse_state *ps = NULL;
+	int err;
+
+	err = create_filter_start(filter_str, true, &ps, &filter);
+	if (!err) {
+		err = replace_system_preds(system, ps, filter_str);
+		if (!err) {
+			/* System filters just show a default message */
+			kfree(filter->filter_string);
+			filter->filter_string = NULL;
+		} else {
+			append_filter_err(ps, filter);
+		}
+	}
+	create_filter_finish(ps);
+
+	*filterp = filter;
+	return err;
+}
+
 int apply_event_filter(struct ftrace_event_call *call, char *filter_string)
 {
-	struct filter_parse_state *ps;
 	struct event_filter *filter;
-	struct event_filter *tmp;
 	int err = 0;
 
 	mutex_lock(&event_mutex);
@@ -1759,49 +1869,30 @@ int apply_event_filter(struct ftrace_event_call *call, char *filter_string)
 		goto out_unlock;
 	}
 
-	err = -ENOMEM;
-	ps = kzalloc(sizeof(*ps), GFP_KERNEL);
-	if (!ps)
-		goto out_unlock;
-
-	filter = __alloc_filter();
-	if (!filter) {
-		kfree(ps);
-		goto out_unlock;
-	}
-
-	replace_filter_string(filter, filter_string);
-
-	parse_init(ps, filter_ops, filter_string);
-	err = filter_parse(ps);
-	if (err) {
-		append_filter_err(ps, filter);
-		goto out;
-	}
+	err = create_filter(call, filter_string, true, &filter);
 
-	err = replace_preds(call, filter, ps, filter_string, false);
-	if (err) {
-		filter_disable(call);
-		append_filter_err(ps, filter);
-	} else
-		call->flags |= TRACE_EVENT_FL_FILTERED;
-out:
 	/*
 	 * Always swap the call filter with the new filter
 	 * even if there was an error. If there was an error
 	 * in the filter, we disable the filter and show the error
 	 * string
 	 */
-	tmp = call->filter;
-	rcu_assign_pointer(call->filter, filter);
-	if (tmp) {
-		/* Make sure the call is done with the filter */
-		synchronize_sched();
-		__free_filter(tmp);
+	if (filter) {
+		struct event_filter *tmp = call->filter;
+
+		if (!err)
+			call->flags |= TRACE_EVENT_FL_FILTERED;
+		else
+			filter_disable(call);
+
+		rcu_assign_pointer(call->filter, filter);
+
+		if (tmp) {
+			/* Make sure the call is done with the filter */
+			synchronize_sched();
+			__free_filter(tmp);
+		}
 	}
-	filter_opstack_clear(ps);
-	postfix_clear(ps);
-	kfree(ps);
 out_unlock:
 	mutex_unlock(&event_mutex);
 
@@ -1811,7 +1902,6 @@ out_unlock:
 int apply_subsystem_event_filter(struct event_subsystem *system,
 				 char *filter_string)
 {
-	struct filter_parse_state *ps;
 	struct event_filter *filter;
 	int err = 0;
 
@@ -1835,48 +1925,19 @@ int apply_subsystem_event_filter(struct event_subsystem *system,
 		goto out_unlock;
 	}
 
-	err = -ENOMEM;
-	ps = kzalloc(sizeof(*ps), GFP_KERNEL);
-	if (!ps)
-		goto out_unlock;
-
-	filter = __alloc_filter();
-	if (!filter)
-		goto out;
-
-	/* System filters just show a default message */
-	kfree(filter->filter_string);
-	filter->filter_string = NULL;
-
-	/*
-	 * No event actually uses the system filter
-	 * we can free it without synchronize_sched().
-	 */
-	__free_filter(system->filter);
-	system->filter = filter;
-
-	parse_init(ps, filter_ops, filter_string);
-	err = filter_parse(ps);
-	if (err)
-		goto err_filter;
-
-	err = replace_system_preds(system, ps, filter_string);
-	if (err)
-		goto err_filter;
-
-out:
-	filter_opstack_clear(ps);
-	postfix_clear(ps);
-	kfree(ps);
+	err = create_system_filter(system, filter_string, &filter);
+	if (filter) {
+		/*
+		 * No event actually uses the system filter
+		 * we can free it without synchronize_sched().
+		 */
+		__free_filter(system->filter);
+		system->filter = filter;
+	}
 out_unlock:
 	mutex_unlock(&event_mutex);
 
 	return err;
-
-err_filter:
-	replace_filter_string(filter, filter_string);
-	append_filter_err(ps, system->filter);
-	goto out;
 }
 
 #ifdef CONFIG_PERF_EVENTS
@@ -1894,7 +1955,6 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
 {
 	int err;
 	struct event_filter *filter;
-	struct filter_parse_state *ps;
 	struct ftrace_event_call *call;
 
 	mutex_lock(&event_mutex);
@@ -1909,33 +1969,10 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
 	if (event->filter)
 		goto out_unlock;
 
-	filter = __alloc_filter();
-	if (!filter) {
-		err = PTR_ERR(filter);
-		goto out_unlock;
-	}
-
-	err = -ENOMEM;
-	ps = kzalloc(sizeof(*ps), GFP_KERNEL);
-	if (!ps)
-		goto free_filter;
-
-	parse_init(ps, filter_ops, filter_str);
-	err = filter_parse(ps);
-	if (err)
-		goto free_ps;
-
-	err = replace_preds(call, filter, ps, filter_str, false);
+	err = create_filter(call, filter_str, false, &filter);
 	if (!err)
 		event->filter = filter;
-
-free_ps:
-	filter_opstack_clear(ps);
-	postfix_clear(ps);
-	kfree(ps);
-
-free_filter:
-	if (err)
+	else
 		__free_filter(filter);
 
 out_unlock:
@@ -1954,43 +1991,6 @@ out_unlock:
 #define CREATE_TRACE_POINTS
 #include "trace_events_filter_test.h"
 
-static int test_get_filter(char *filter_str, struct ftrace_event_call *call,
-			   struct event_filter **pfilter)
-{
-	struct event_filter *filter;
-	struct filter_parse_state *ps;
-	int err = -ENOMEM;
-
-	filter = __alloc_filter();
-	if (!filter)
-		goto out;
-
-	ps = kzalloc(sizeof(*ps), GFP_KERNEL);
-	if (!ps)
-		goto free_filter;
-
-	parse_init(ps, filter_ops, filter_str);
-	err = filter_parse(ps);
-	if (err)
-		goto free_ps;
-
-	err = replace_preds(call, filter, ps, filter_str, false);
-	if (!err)
-		*pfilter = filter;
-
- free_ps:
-	filter_opstack_clear(ps);
-	postfix_clear(ps);
-	kfree(ps);
-
- free_filter:
-	if (err)
-		__free_filter(filter);
-
- out:
-	return err;
-}
-
 #define DATA_REC(m, va, vb, vc, vd, ve, vf, vg, vh, nvisit) \
 { \
 	.filter = FILTER, \
@@ -2109,12 +2109,13 @@ static __init int ftrace_test_event_filter(void)
 		struct test_filter_data_t *d = &test_filter_data[i];
 		int err;
 
-		err = test_get_filter(d->filter, &event_ftrace_test_filter,
-				      &filter);
+		err = create_filter(&event_ftrace_test_filter, d->filter,
+				    false, &filter);
 		if (err) {
 			printk(KERN_INFO
 			       "Failed to get filter for '%s', err %d\n",
 			       d->filter, err);
+			__free_filter(filter);
 			break;
 		}
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 02/11] trace_event_filter: add trace_event_filter_*() interface
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
  2012-01-05 23:42 ` [PATCH 01/11] trace_event_filter: factorize filter creation Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-05 23:42 ` [PATCH 03/11] block: block_bio_complete tracepoint was missing Tejun Heo
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo

Trace event filter is generic enough to be useful in other use cases.
This patch makes update replace_preds such that it takes field lookup
function instead of struct ftrace_event_call and use it to implement
trace_event_filter_create() and friends.

The fields which can be used in filter are described by list of struct
ftrace_event_field's, trace_event_filter_string() can be used to
access the filter string and _destroy() frees the filter.

The currently planned external user is another tracer and this is good
enough but it might be a good idea to further generalize the interface
if it gets used more widely.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/trace/trace.h               |    6 ++
 kernel/trace/trace_events_filter.c |  100 +++++++++++++++++++++++++++++++----
 2 files changed, 94 insertions(+), 12 deletions(-)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2c26574..11c4754 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -811,6 +811,12 @@ filter_check_discard(struct ftrace_event_call *call, void *rec,
 
 extern void trace_event_enable_cmd_record(bool enable);
 
+extern int trace_event_filter_create(struct list_head *event_fields,
+				     char *filter_str,
+				     struct event_filter **filterp);
+extern const char *trace_event_filter_string(struct event_filter *filter);
+extern void trace_event_filter_destroy(struct event_filter *filter);
+
 extern struct mutex event_mutex;
 extern struct list_head ftrace_events;
 
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 24aee71..e944b9f 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -656,9 +656,9 @@ void print_subsystem_event_filter(struct event_subsystem *system,
 	mutex_unlock(&event_mutex);
 }
 
-static struct ftrace_event_field *
-__find_event_field(struct list_head *head, char *name)
+static struct ftrace_event_field *__find_event_field(void *data, char *name)
 {
+	struct list_head *head = data;
 	struct ftrace_event_field *field;
 
 	list_for_each_entry(field, head, link) {
@@ -669,9 +669,9 @@ __find_event_field(struct list_head *head, char *name)
 	return NULL;
 }
 
-static struct ftrace_event_field *
-find_event_field(struct ftrace_event_call *call, char *name)
+static struct ftrace_event_field *find_event_field(void *data, char *name)
 {
+	struct ftrace_event_call *call = data;
 	struct ftrace_event_field *field;
 	struct list_head *head;
 
@@ -1308,8 +1308,10 @@ parse_operand:
 	return 0;
 }
 
+typedef struct ftrace_event_field *find_field_fn_t(void *data, char *name);
+
 static struct filter_pred *create_pred(struct filter_parse_state *ps,
-				       struct ftrace_event_call *call,
+				       find_field_fn_t ff_fn, void *ff_data,
 				       int op, char *operand1, char *operand2)
 {
 	struct ftrace_event_field *field;
@@ -1326,7 +1328,7 @@ static struct filter_pred *create_pred(struct filter_parse_state *ps,
 		return NULL;
 	}
 
-	field = find_event_field(call, operand1);
+	field = ff_fn(ff_data, operand1);
 	if (!field) {
 		parse_error(ps, FILT_ERR_FIELD_NOT_FOUND, 0);
 		return NULL;
@@ -1525,11 +1527,11 @@ static int fold_pred_tree(struct event_filter *filter,
 			      filter->preds);
 }
 
-static int replace_preds(struct ftrace_event_call *call,
-			 struct event_filter *filter,
-			 struct filter_parse_state *ps,
-			 char *filter_string,
-			 bool dry_run)
+static int __replace_preds(find_field_fn_t ff_fn, void *ff_data,
+			   struct event_filter *filter,
+			   struct filter_parse_state *ps,
+			   char *filter_string,
+			   bool dry_run)
 {
 	char *operand1 = NULL, *operand2 = NULL;
 	struct filter_pred *pred;
@@ -1579,7 +1581,8 @@ static int replace_preds(struct ftrace_event_call *call,
 			goto fail;
 		}
 
-		pred = create_pred(ps, call, elt->op, operand1, operand2);
+		pred = create_pred(ps, ff_fn, ff_data,
+				   elt->op, operand1, operand2);
 		if (!pred) {
 			err = -EINVAL;
 			goto fail;
@@ -1628,6 +1631,16 @@ fail:
 	return err;
 }
 
+static int replace_preds(struct ftrace_event_call *call,
+			 struct event_filter *filter,
+			 struct filter_parse_state *ps,
+			 char *filter_string,
+			 bool dry_run)
+{
+	return __replace_preds(find_event_field, call, filter, ps,
+			       filter_string, dry_run);
+}
+
 struct filter_list {
 	struct list_head	list;
 	struct event_filter	*filter;
@@ -1940,6 +1953,69 @@ out_unlock:
 	return err;
 }
 
+/**
+ * trace_event_filter_create - create an event filter for external user
+ * @event_fields: linked list of struct ftrace_event_field's
+ * @filter_str: filter string
+ * @filterp: out param for created filter (always updated on return)
+ *
+ * Creates a filter for an external user.  @event_fields lists struct
+ * ftrace_event_field's for all possible fields @filter_str may use.
+ *
+ * On success, returns 0 and *@filterp points to the new filter.  On
+ * failure, returns -errno and *@filterp may point to %NULL or to a new
+ * filter.  In the latter case, the returned filter contains error
+ * information if @set_str is %true and the caller is responsible for
+ * freeing it by calling trace_event_filter_destroy().
+ */
+int trace_event_filter_create(struct list_head *event_fields,
+			      char *filter_str, struct event_filter **filterp)
+{
+	struct event_filter *filter = NULL;
+	struct filter_parse_state *ps = NULL;
+	int err;
+
+	err = create_filter_start(filter_str, true, &ps, &filter);
+	if (!err) {
+		err = __replace_preds(__find_event_field, event_fields,
+				      filter, ps, filter_str, false);
+		if (err)
+			append_filter_err(ps, filter);
+	}
+	create_filter_finish(ps);
+
+	*filterp = filter;
+	return err;
+}
+EXPORT_SYMBOL_GPL(trace_event_filter_create);
+
+/**
+ * trace_event_filter_string - access filter string of an event filter
+ * @filter: target filter to access filter string for
+ *
+ * Returns pointer to filter string of @filter.  Note that the caller is
+ * responsible for ensuring @filter is not destroyed while the returned
+ * filter string is used.
+ */
+const char *trace_event_filter_string(struct event_filter *filter)
+{
+	return filter->filter_string;
+}
+EXPORT_SYMBOL_GPL(trace_event_filter_string);
+
+/**
+ * trace_event_filter_destroy - destroy an event filter
+ * @filter: filter to destroy
+ *
+ * Destroy @filter.  All filters returned by trace_event_filter_create()
+ * should be freed using this function.
+ */
+void trace_event_filter_destroy(struct event_filter *filter)
+{
+	__free_filter(filter);
+}
+EXPORT_SYMBOL_GPL(trace_event_filter_destroy);
+
 #ifdef CONFIG_PERF_EVENTS
 
 void ftrace_profile_free_filter(struct perf_event *event)
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 03/11] block: block_bio_complete tracepoint was missing
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
  2012-01-05 23:42 ` [PATCH 01/11] trace_event_filter: factorize filter creation Tejun Heo
  2012-01-05 23:42 ` [PATCH 02/11] trace_event_filter: add trace_event_filter_*() interface Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-09  1:30   ` Namhyung Kim
  2012-01-05 23:42 ` [PATCH 04/11] block: add @req to bio_{front|back}_merge tracepoints Tejun Heo
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo

block_bio_complete tracepoint was defined but not invoked anywhere.
Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/bio.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index b1fe82c..96548da 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1447,6 +1447,9 @@ void bio_endio(struct bio *bio, int error)
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = -EIO;
 
+	if (bio->bi_bdev)
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+					 bio, error);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio, error);
 }
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 04/11] block: add @req to bio_{front|back}_merge tracepoints
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
                   ` (2 preceding siblings ...)
  2012-01-05 23:42 ` [PATCH 03/11] block: block_bio_complete tracepoint was missing Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-05 23:42 ` [PATCH 05/11] block: abstract disk iteration into disk_iter Tejun Heo
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo

bio_{front|back}_merge tracepoints report a bio merging into an
existing request but didn't specify which request the bio is being
merged into.  Add @req to it.  This makes it impossible to share the
event template with block_bio_queue - split it out.

@req isn't used or exported to userland at this point and there is no
userland visible behavior change.  Later changes will make use of the
extra parameter.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c             |    4 +-
 include/trace/events/block.h |   45 +++++++++++++++++++++++++++++++----------
 kernel/trace/blktrace.c      |    2 +
 3 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ea70e6c..b0b5411 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1180,7 +1180,7 @@ static bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
 	if (!ll_back_merge_fn(q, req, bio))
 		return false;
 
-	trace_block_bio_backmerge(q, bio);
+	trace_block_bio_backmerge(q, req, bio);
 
 	if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
 		blk_rq_set_mixed_merge(req);
@@ -1203,7 +1203,7 @@ static bool bio_attempt_front_merge(struct request_queue *q,
 	if (!ll_front_merge_fn(q, req, bio))
 		return false;
 
-	trace_block_bio_frontmerge(q, bio);
+	trace_block_bio_frontmerge(q, req, bio);
 
 	if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
 		blk_rq_set_mixed_merge(req);
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 05c5e61..983f8a8 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -241,11 +241,11 @@ TRACE_EVENT(block_bio_complete,
 		  __entry->nr_sector, __entry->error)
 );
 
-DECLARE_EVENT_CLASS(block_bio,
+DECLARE_EVENT_CLASS(block_bio_merge,
 
-	TP_PROTO(struct request_queue *q, struct bio *bio),
+	TP_PROTO(struct request_queue *q, struct request *rq, struct bio *bio),
 
-	TP_ARGS(q, bio),
+	TP_ARGS(q, rq, bio),
 
 	TP_STRUCT__entry(
 		__field( dev_t,		dev			)
@@ -272,31 +272,33 @@ DECLARE_EVENT_CLASS(block_bio,
 /**
  * block_bio_backmerge - merging block operation to the end of an existing operation
  * @q: queue holding operation
+ * @rq: request bio is being merged into
  * @bio: new block operation to merge
  *
  * Merging block request @bio to the end of an existing block request
  * in queue @q.
  */
-DEFINE_EVENT(block_bio, block_bio_backmerge,
+DEFINE_EVENT(block_bio_merge, block_bio_backmerge,
 
-	TP_PROTO(struct request_queue *q, struct bio *bio),
+	TP_PROTO(struct request_queue *q, struct request *rq, struct bio *bio),
 
-	TP_ARGS(q, bio)
+	TP_ARGS(q, rq, bio)
 );
 
 /**
  * block_bio_frontmerge - merging block operation to the beginning of an existing operation
  * @q: queue holding operation
+ * @rq: request bio is being merged into
  * @bio: new block operation to merge
  *
  * Merging block IO operation @bio to the beginning of an existing block
  * operation in queue @q.
  */
-DEFINE_EVENT(block_bio, block_bio_frontmerge,
+DEFINE_EVENT(block_bio_merge, block_bio_frontmerge,
 
-	TP_PROTO(struct request_queue *q, struct bio *bio),
+	TP_PROTO(struct request_queue *q, struct request *rq, struct bio *bio),
 
-	TP_ARGS(q, bio)
+	TP_ARGS(q, rq, bio)
 );
 
 /**
@@ -306,11 +308,32 @@ DEFINE_EVENT(block_bio, block_bio_frontmerge,
  *
  * About to place the block IO operation @bio into queue @q.
  */
-DEFINE_EVENT(block_bio, block_bio_queue,
+TRACE_EVENT(block_bio_queue,
 
 	TP_PROTO(struct request_queue *q, struct bio *bio),
 
-	TP_ARGS(q, bio)
+	TP_ARGS(q, bio),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__array( char,		rwbs,	RWBS_LEN	)
+		__array( char,		comm,	TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		blk_fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("%d,%d %s %llu + %u [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->comm)
 );
 
 DECLARE_EVENT_CLASS(block_get_rq,
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 16fc34a..7d00bcf 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -797,6 +797,7 @@ static void blk_add_trace_bio_complete(void *ignore,
 
 static void blk_add_trace_bio_backmerge(void *ignore,
 					struct request_queue *q,
+					struct request *rq,
 					struct bio *bio)
 {
 	blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0);
@@ -804,6 +805,7 @@ static void blk_add_trace_bio_backmerge(void *ignore,
 
 static void blk_add_trace_bio_frontmerge(void *ignore,
 					 struct request_queue *q,
+					 struct request *rq,
 					 struct bio *bio)
 {
 	blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 05/11] block: abstract disk iteration into disk_iter
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
                   ` (3 preceding siblings ...)
  2012-01-05 23:42 ` [PATCH 04/11] block: add @req to bio_{front|back}_merge tracepoints Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-05 23:42 ` [PATCH 06/11] writeback: move struct wb_writeback_work to writeback.h Tejun Heo
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo

Instead of using class_dev_iter directly, abstract disk iteration into
disk_iter and helpers which are exported.  This simplifies the callers
a bit and allows external users to iterate over disks.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/genhd.c         |   98 +++++++++++++++++++++++++++++++++----------------
 include/linux/genhd.h |   10 ++++-
 2 files changed, 75 insertions(+), 33 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 02e9fca..aa13775 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -41,6 +41,47 @@ static void disk_del_events(struct gendisk *disk);
 static void disk_release_events(struct gendisk *disk);
 
 /**
+ * disk_iter_init - initialize disk iterator
+ * @diter: iterator to initialize
+ *
+ * Initialize @diter so that it iterates over all disks
+ */
+void disk_iter_init(struct disk_iter *diter)
+{
+	class_dev_iter_init(&diter->cdev_iter, &block_class, NULL, &disk_type);
+}
+EXPORT_SYMBOL_GPL(disk_iter_init);
+
+/**
+ * disk_iter_next - proceed iterator to the next disk and return it
+ * @diter: iterator to proceed
+ *
+ * Proceed @diter to the next disk and return it.
+ */
+struct gendisk *disk_iter_next(struct disk_iter *diter)
+{
+	struct device *dev;
+
+	dev = class_dev_iter_next(&diter->cdev_iter);
+	if (dev)
+		return dev_to_disk(dev);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(disk_iter_next);
+
+/**
+ * disk_iter_exit - finish up disk iteration
+ * @diter: iter to exit
+ *
+ * Called when iteration is over.  Cleans up @diter.
+ */
+void disk_iter_exit(struct disk_iter *diter)
+{
+	class_dev_iter_exit(&diter->cdev_iter);
+}
+EXPORT_SYMBOL_GPL(disk_iter_exit);
+
+/**
  * disk_get_part - get partition
  * @disk: disk to look partition from
  * @partno: partition number
@@ -731,12 +772,11 @@ EXPORT_SYMBOL(bdget_disk);
  */
 void __init printk_all_partitions(void)
 {
-	struct class_dev_iter iter;
-	struct device *dev;
+	struct disk_iter diter;
+	struct gendisk *disk;
 
-	class_dev_iter_init(&iter, &block_class, NULL, &disk_type);
-	while ((dev = class_dev_iter_next(&iter))) {
-		struct gendisk *disk = dev_to_disk(dev);
+	disk_iter_init(&diter);
+	while ((disk = disk_iter_next(&diter))) {
 		struct disk_part_iter piter;
 		struct hd_struct *part;
 		char name_buf[BDEVNAME_SIZE];
@@ -780,7 +820,7 @@ void __init printk_all_partitions(void)
 		}
 		disk_part_iter_exit(&piter);
 	}
-	class_dev_iter_exit(&iter);
+	disk_iter_exit(&diter);
 }
 
 #ifdef CONFIG_PROC_FS
@@ -788,44 +828,38 @@ void __init printk_all_partitions(void)
 static void *disk_seqf_start(struct seq_file *seqf, loff_t *pos)
 {
 	loff_t skip = *pos;
-	struct class_dev_iter *iter;
-	struct device *dev;
+	struct disk_iter *diter;
+	struct gendisk *disk;
 
-	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
-	if (!iter)
+	diter = kmalloc(sizeof(*diter), GFP_KERNEL);
+	if (!diter)
 		return ERR_PTR(-ENOMEM);
 
-	seqf->private = iter;
-	class_dev_iter_init(iter, &block_class, NULL, &disk_type);
+	seqf->private = diter;
+	disk_iter_init(diter);
 	do {
-		dev = class_dev_iter_next(iter);
-		if (!dev)
+		disk = disk_iter_next(diter);
+		if (!disk)
 			return NULL;
 	} while (skip--);
 
-	return dev_to_disk(dev);
+	return disk;
 }
 
 static void *disk_seqf_next(struct seq_file *seqf, void *v, loff_t *pos)
 {
-	struct device *dev;
-
 	(*pos)++;
-	dev = class_dev_iter_next(seqf->private);
-	if (dev)
-		return dev_to_disk(dev);
-
-	return NULL;
+	return disk_iter_next(seqf->private);
 }
 
 static void disk_seqf_stop(struct seq_file *seqf, void *v)
 {
-	struct class_dev_iter *iter = seqf->private;
+	struct disk_iter *diter = seqf->private;
 
 	/* stop is called even after start failed :-( */
-	if (iter) {
-		class_dev_iter_exit(iter);
-		kfree(iter);
+	if (diter) {
+		disk_iter_exit(diter);
+		kfree(diter);
 	}
 }
 
@@ -1207,12 +1241,12 @@ module_init(proc_genhd_init);
 dev_t blk_lookup_devt(const char *name, int partno)
 {
 	dev_t devt = MKDEV(0, 0);
-	struct class_dev_iter iter;
-	struct device *dev;
+	struct disk_iter diter;
+	struct gendisk *disk;
 
-	class_dev_iter_init(&iter, &block_class, NULL, &disk_type);
-	while ((dev = class_dev_iter_next(&iter))) {
-		struct gendisk *disk = dev_to_disk(dev);
+	disk_iter_init(&diter);
+	while ((disk = disk_iter_next(&diter))) {
+		struct device *dev = disk_to_dev(disk);
 		struct hd_struct *part;
 
 		if (strcmp(dev_name(dev), name))
@@ -1234,7 +1268,7 @@ dev_t blk_lookup_devt(const char *name, int partno)
 		}
 		disk_put_part(part);
 	}
-	class_dev_iter_exit(&iter);
+	disk_iter_exit(&diter);
 	return devt;
 }
 EXPORT_SYMBOL(blk_lookup_devt);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 6d18f35..aefa6ba 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -260,8 +260,16 @@ static inline void disk_put_part(struct hd_struct *part)
 }
 
 /*
- * Smarter partition iterator without context limits.
+ * Smarter disk and partition iterators without context limits.
  */
+struct disk_iter {
+	struct class_dev_iter	cdev_iter;
+};
+
+extern void disk_iter_init(struct disk_iter *diter);
+extern struct gendisk *disk_iter_next(struct disk_iter *diter);
+extern void disk_iter_exit(struct disk_iter *diter);
+
 #define DISK_PITER_REVERSE	(1 << 0) /* iterate in the reverse direction */
 #define DISK_PITER_INCL_EMPTY	(1 << 1) /* include 0-sized parts */
 #define DISK_PITER_INCL_PART0	(1 << 2) /* include partition 0 */
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 06/11] writeback: move struct wb_writeback_work to writeback.h
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
                   ` (4 preceding siblings ...)
  2012-01-05 23:42 ` [PATCH 05/11] block: abstract disk iteration into disk_iter Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-05 23:42 ` [PATCH 07/11] writeback: add more tracepoints Tejun Heo
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo

Move definition of struct wb_writeback_work from fs/fs-writeback.c to
include/linux/writeback.h.  This is to allow accessing fields from
writeback tracepoint probes which live outside fs-writeback.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/fs-writeback.c         |   18 ------------------
 include/linux/writeback.h |   18 ++++++++++++++++++
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ac86f8b..68cba85 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -29,24 +29,6 @@
 #include <linux/tracepoint.h>
 #include "internal.h"
 
-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
-	long nr_pages;
-	struct super_block *sb;
-	unsigned long *older_than_this;
-	enum writeback_sync_modes sync_mode;
-	unsigned int tagged_writepages:1;
-	unsigned int for_kupdate:1;
-	unsigned int range_cyclic:1;
-	unsigned int for_background:1;
-	enum wb_reason reason;		/* why was writeback initiated? */
-
-	struct list_head list;		/* pending work list */
-	struct completion *done;	/* set if the caller waits */
-};
-
 const char *wb_reason_name[] = {
 	[WB_REASON_BACKGROUND]		= "background",
 	[WB_REASON_TRY_TO_FREE_PAGES]	= "try_to_free_pages",
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a378c29..10d22d1 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -82,6 +82,24 @@ struct writeback_control {
 };
 
 /*
+ * Passed into wb_writeback(), essentially a subset of writeback_control
+ */
+struct wb_writeback_work {
+	long nr_pages;
+	struct super_block *sb;
+	unsigned long *older_than_this;
+	enum writeback_sync_modes sync_mode;
+	unsigned int tagged_writepages:1;
+	unsigned int for_kupdate:1;
+	unsigned int range_cyclic:1;
+	unsigned int for_background:1;
+	enum wb_reason reason;		/* why was writeback initiated? */
+
+	struct list_head list;		/* pending work list */
+	struct completion *done;	/* set if the caller waits */
+};
+
+/*
  * fs/fs-writeback.c
  */	
 struct bdi_writeback;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 07/11] writeback: add more tracepoints
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
                   ` (5 preceding siblings ...)
  2012-01-05 23:42 ` [PATCH 06/11] writeback: move struct wb_writeback_work to writeback.h Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-05 23:42 ` [PATCH 08/11] block: add block_touch_buffer tracepoint Tejun Heo
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo

Add tracepoints for page dirtying, writeback_single_inode start, inode
dirtying and writeback.  For the latter two inode events, a pair of
events are defined to denote start and end of the operations (the
starting one has _start suffix and the one w/o suffix happens after
the operation is complete).  These inode ops are FS specific and can
be non-trivial and having enclosing tracepoints is useful for external
tracers.

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/fs-writeback.c                |   16 +++++-
 include/trace/events/writeback.h |  113 ++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c              |    2 +
 3 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 68cba85..7d29370 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -310,8 +310,14 @@ static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	if (inode->i_sb->s_op->write_inode && !is_bad_inode(inode))
-		return inode->i_sb->s_op->write_inode(inode, wbc);
+	int ret;
+
+	if (inode->i_sb->s_op->write_inode && !is_bad_inode(inode)) {
+		trace_writeback_write_inode_start(inode, wbc);
+		ret = inode->i_sb->s_op->write_inode(inode, wbc);
+		trace_writeback_write_inode(inode, wbc);
+		return ret;
+	}
 	return 0;
 }
 
@@ -392,6 +398,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&wb->list_lock);
 
+	trace_writeback_single_inode_start(inode, wbc, nr_to_write);
+
 	ret = do_writepages(mapping, wbc);
 
 	/*
@@ -1051,8 +1059,12 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	 * dirty the inode itself
 	 */
 	if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
+		trace_writeback_dirty_inode_start(inode, flags);
+
 		if (sb->s_op->dirty_inode)
 			sb->s_op->dirty_inode(inode, flags);
+
+		trace_writeback_dirty_inode(inode, flags);
 	}
 
 	/*
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index b99caa8..a35effa 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -23,6 +23,112 @@
 
 struct wb_writeback_work;
 
+TRACE_EVENT(writeback_dirty_page,
+
+	TP_PROTO(struct page *page, struct address_space *mapping),
+
+	TP_ARGS(page, mapping),
+
+	TP_STRUCT__entry (
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(pgoff_t, index)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			mapping ? dev_name(mapping->backing_dev_info->dev) : "(unknown)", 32);
+		__entry->ino = mapping ? mapping->host->i_ino : 0;
+		__entry->index = page->index;
+	),
+
+	TP_printk("bdi %s: ino=%lu index=%lu",
+		__entry->name,
+		__entry->ino,
+		__entry->index
+	)
+);
+
+DECLARE_EVENT_CLASS(writeback_dirty_inode_template,
+
+	TP_PROTO(struct inode *inode, int flags),
+
+	TP_ARGS(inode, flags),
+
+	TP_STRUCT__entry (
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, flags)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->flags		= flags;
+	),
+
+	TP_printk("bdi %s: ino=%lu flags=%s",
+		__entry->name,
+		__entry->ino,
+		show_inode_state(__entry->flags)
+	)
+);
+
+DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode_start,
+
+	TP_PROTO(struct inode *inode, int flags),
+
+	TP_ARGS(inode, flags)
+);
+
+DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode,
+
+	TP_PROTO(struct inode *inode, int flags),
+
+	TP_ARGS(inode, flags)
+);
+
+DECLARE_EVENT_CLASS(writeback_write_inode_template,
+
+	TP_PROTO(struct inode *inode, struct writeback_control *wbc),
+
+	TP_ARGS(inode, wbc),
+
+	TP_STRUCT__entry (
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(int, sync_mode)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->sync_mode	= wbc->sync_mode;
+	),
+
+	TP_printk("bdi %s: ino=%lu mode=%d",
+		__entry->name,
+		__entry->ino,
+		__entry->sync_mode
+	)
+);
+
+DEFINE_EVENT(writeback_write_inode_template, writeback_write_inode_start,
+
+	TP_PROTO(struct inode *inode, struct writeback_control *wbc),
+
+	TP_ARGS(inode, wbc)
+);
+
+DEFINE_EVENT(writeback_write_inode_template, writeback_write_inode,
+
+	TP_PROTO(struct inode *inode, struct writeback_control *wbc),
+
+	TP_ARGS(inode, wbc)
+);
+
 DECLARE_EVENT_CLASS(writeback_work_class,
 	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
 	TP_ARGS(bdi, work),
@@ -436,6 +542,13 @@ DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_requeue,
 	TP_ARGS(inode, wbc, nr_to_write)
 );
 
+DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_start,
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write),
+	TP_ARGS(inode, wbc, nr_to_write)
+);
+
 DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
 	TP_PROTO(struct inode *inode,
 		 struct writeback_control *wbc,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 50f0824..0bc5085 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1735,6 +1735,8 @@ int __set_page_dirty_no_writeback(struct page *page)
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
+	trace_writeback_dirty_page(page, mapping);
+
 	if (mapping_cap_account_dirty(mapping)) {
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 08/11] block: add block_touch_buffer tracepoint
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
                   ` (6 preceding siblings ...)
  2012-01-05 23:42 ` [PATCH 07/11] writeback: add more tracepoints Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-05 23:42 ` [PATCH 09/11] vfs: add fcheck tracepoint Tejun Heo
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo

Add block_touch_buffer tracepoint which gets triggered on
touch_buffer().

Because touch_buffer() is defined as macro in linux/buffer_head.h,
this creates circular dependency between linux/buffer_head.h and
events/block.h.  As event header needs buffer_head details only when
the tracepoints are actually created (CREATE_TRACE_POINTS is defined),
this can be easily solved by including buffer_head.h before setting
CREATE_TRACE_POINTS and including the event header to create
tracepoints.

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c             |    1 +
 include/linux/buffer_head.h  |    7 ++++++-
 include/trace/events/block.h |   25 +++++++++++++++++++++++++
 3 files changed, 32 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b0b5411..365ab8f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -29,6 +29,7 @@
 #include <linux/fault-inject.h>
 #include <linux/list_sort.h>
 #include <linux/delay.h>
+#include <linux/buffer_head.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..245caed 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -13,6 +13,7 @@
 #include <linux/pagemap.h>
 #include <linux/wait.h>
 #include <linux/atomic.h>
+#include <trace/events/block.h>
 
 #ifdef CONFIG_BLOCK
 
@@ -126,7 +127,11 @@ BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Unwritten, unwritten)
 
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
-#define touch_buffer(bh)	mark_page_accessed(bh->b_page)
+
+#define touch_buffer(bh)	do {				\
+		trace_block_touch_buffer(bh);			\
+		mark_page_accessed(bh->b_page);			\
+	} while (0)
 
 /* If we *know* page->private refers to buffer_heads */
 #define page_buffers(page)					\
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 983f8a8..4fcc09d 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -6,10 +6,35 @@
 
 #include <linux/blktrace_api.h>
 #include <linux/blkdev.h>
+#include <linux/buffer_head.h>
 #include <linux/tracepoint.h>
 
 #define RWBS_LEN	8
 
+TRACE_EVENT(block_touch_buffer,
+
+	TP_PROTO(struct buffer_head *bh),
+
+	TP_ARGS(bh),
+
+	TP_STRUCT__entry (
+		__field(  dev_t,	dev			)
+		__field(  sector_t,	sector			)
+		__field(  size_t,	size			)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bh->b_bdev->bd_dev;
+		__entry->sector		= bh->b_blocknr;
+		__entry->size		= bh->b_size;
+	),
+
+	TP_printk("%d,%d get_bh sector=%llu size=%zu",
+		MAJOR(__entry->dev), MINOR(__entry->dev),
+		(unsigned long long)__entry->sector, __entry->size
+	)
+);
+
 DECLARE_EVENT_CLASS(block_rq_with_error,
 
 	TP_PROTO(struct request_queue *q, struct request *rq),
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 09/11] vfs: add fcheck tracepoint
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
                   ` (7 preceding siblings ...)
  2012-01-05 23:42 ` [PATCH 08/11] block: add block_touch_buffer tracepoint Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-05 23:42 ` [PATCH 10/11] stacktrace: implement save_stack_trace_quick() Tejun Heo
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo

All file accesses from userland go through fcheck to map fd to struct
file, making it a very good location for peeking at what files
userland is accessing.  Add a tracepoint there.

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/super.c                 |    2 ++
 include/linux/fdtable.h    |    3 +++
 include/trace/events/vfs.h |   40 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 0 deletions(-)
 create mode 100644 include/trace/events/vfs.h

diff --git a/fs/super.c b/fs/super.c
index afd0f1a..7b70a2e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,8 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vfs.h>
 
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 82163c4..72df04b 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -12,6 +12,7 @@
 #include <linux/types.h>
 #include <linux/init.h>
 #include <linux/fs.h>
+#include <trace/events/vfs.h>
 
 #include <linux/atomic.h>
 
@@ -87,6 +88,8 @@ static inline struct file * fcheck_files(struct files_struct *files, unsigned in
 
 	if (fd < fdt->max_fds)
 		file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
+
+	trace_vfs_fcheck(files, fd, file);
 	return file;
 }
 
diff --git a/include/trace/events/vfs.h b/include/trace/events/vfs.h
new file mode 100644
index 0000000..9a9bae4
--- /dev/null
+++ b/include/trace/events/vfs.h
@@ -0,0 +1,40 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfs
+
+#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFS_H
+
+#include <linux/tracepoint.h>
+#include <linux/fs.h>
+
+TRACE_EVENT(vfs_fcheck,
+
+	TP_PROTO(struct files_struct *files, unsigned int fd,
+		 struct file *file),
+
+	TP_ARGS(files, fd, file),
+
+	TP_STRUCT__entry(
+		__field(unsigned int,	fd)
+		__field(umode_t,	mode)
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+	),
+
+	TP_fast_assign(
+		__entry->fd = fd;
+		__entry->mode = file ? file->f_path.dentry->d_inode->i_mode : 0;
+		__entry->dev = file ? file->f_path.dentry->d_inode->i_sb->s_dev : 0;
+		__entry->ino = file ? file->f_path.dentry->d_inode->i_ino : 0;
+	),
+
+	TP_printk("fd %u mode 0x%x dev %d,%d ino %lu",
+		  __entry->fd, __entry->mode,
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long)__entry->ino)
+);
+
+#endif /* _TRACE_VFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 10/11] stacktrace: implement save_stack_trace_quick()
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
                   ` (8 preceding siblings ...)
  2012-01-05 23:42 ` [PATCH 09/11] vfs: add fcheck tracepoint Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-05 23:42 ` [PATCH 11/11] block, trace: implement ioblame IO statistical analyzer Tejun Heo
  2012-01-06  9:00 ` [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Namhyung Kim
  11 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo, H. Peter Anvin

Implement save_stack_trace_quick() which only considers the usual
contexts (ie. thread and irq) and doesn't handle links between
different contexts - if %current is in irq context, only backtrace in
the irq stack is considered.

This is subset of dump_trace() done in much simpler way.  It's
intended to be used in hot paths where the overhead of dump_trace()
can be too heavy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/include/asm/stacktrace.h |    2 +
 arch/x86/kernel/stacktrace.c      |   40 +++++++++++++++++++++++++++++++++++++
 include/linux/stacktrace.h        |    6 +++++
 kernel/stacktrace.c               |    6 +++++
 4 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 70bbe39..06bbdfc 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -50,9 +50,11 @@ void dump_trace(struct task_struct *tsk, struct pt_regs *regs,
 #ifdef CONFIG_X86_32
 #define STACKSLOTS_PER_LINE 8
 #define get_bp(bp) asm("movl %%ebp, %0" : "=r" (bp) :)
+#define get_irq_stack_end()	0
 #else
 #define STACKSLOTS_PER_LINE 4
 #define get_bp(bp) asm("movq %%rbp, %0" : "=r" (bp) :)
+#define get_irq_stack_end()	(unsigned long)this_cpu_read(irq_stack_ptr)
 #endif
 
 #ifdef CONFIG_FRAME_POINTER
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index fdd0c64..f53ec547 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -81,6 +81,46 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
 
+#ifdef CONFIG_FRAME_POINTER
+void save_stack_trace_quick(struct stack_trace *trace)
+{
+	const unsigned long stk_sz = THREAD_SIZE - sizeof(struct stack_frame);
+	unsigned long tstk = (unsigned long)current_thread_info();
+	unsigned long istk = get_irq_stack_end();
+	unsigned long last_bp = 0;
+	unsigned long bp, stk;
+
+	get_bp(bp);
+
+	if (bp > tstk && bp <= tstk + stk_sz)
+		stk = tstk;
+	else if (istk && (bp > istk && bp <= stk_sz))
+		stk = istk;
+	else
+		goto out;
+
+	while (bp > last_bp && bp <= stk + stk_sz) {
+		struct stack_frame *frame = (struct stack_frame *)bp;
+		unsigned long ret_addr = frame->return_address;
+
+		if (!trace->skip) {
+			if (trace->nr_entries >= trace->max_entries)
+				return;
+			trace->entries[trace->nr_entries++] = ret_addr;
+		} else {
+			trace->skip--;
+		}
+
+		last_bp = bp;
+		bp = (unsigned long)frame->next_frame;
+	}
+out:
+	if (trace->nr_entries < trace->max_entries)
+		trace->entries[trace->nr_entries++] = ULONG_MAX;
+}
+EXPORT_SYMBOL_GPL(save_stack_trace_quick);
+#endif
+
 /* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
 
 struct stack_frame_user {
diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 115b570..d5b16c4 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -19,6 +19,12 @@ extern void save_stack_trace_regs(struct pt_regs *regs,
 extern void save_stack_trace_tsk(struct task_struct *tsk,
 				struct stack_trace *trace);
 
+/*
+ * Saves only trace from the current context.  Doesn't handle exception
+ * stacks or verify text address.
+ */
+extern void save_stack_trace_quick(struct stack_trace *trace);
+
 extern void print_stack_trace(struct stack_trace *trace, int spaces);
 
 #ifdef CONFIG_USER_STACKTRACE_SUPPORT
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 00fe55c..4760949 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -31,6 +31,12 @@ EXPORT_SYMBOL_GPL(print_stack_trace);
  * (whenever this facility is utilized - for example by procfs):
  */
 __weak void
+save_stack_trace_quick(struct stack_trace *trace)
+{
+	WARN_ONCE(1, KERN_INFO "save_stack_trace_quick() not implemented yet.\n");
+}
+
+__weak void
 save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
 {
 	WARN_ONCE(1, KERN_INFO "save_stack_trace_tsk() not implemented yet.\n");
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 11/11] block, trace: implement ioblame IO statistical analyzer
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
                   ` (9 preceding siblings ...)
  2012-01-05 23:42 ` [PATCH 10/11] stacktrace: implement save_stack_trace_quick() Tejun Heo
@ 2012-01-05 23:42 ` Tejun Heo
  2012-01-06  9:00 ` [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Namhyung Kim
  11 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-05 23:42 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott, dsharp
  Cc: linux-kernel, Tejun Heo

Implement ioblame, which can attribute each IO to its origin and
record user configurable histograms.

Operations which may eventually cause IOs and IO operations themselves
are identified and tracked primarily by their stack traces along with
the task and the target file (dev:ino:gen).  On each IO completion,
ioblame knows why that specific IO happened and record statistics in
user-configurable histograms.

ioblame aims to deliver insight into overall system IO behavior with
manageable overhead.  Also, to enable follow-the-breadcrumbs type
investigation, a lot of information gathering configurations can be
changed on the fly.

While ioblame adds fields to a few fs and block layer objects, all
logic is well insulated inside ioblame proper and all hooking goes
through well defined tracepoints and doesn't add any significant
maintenance overhead.

For details, please read Documentation/trace/ioblame.txt.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Justin TerAvest <teravest@google.com>
Cc: Slava Pestov <slavapestov@google.com>
---
 Documentation/trace/ioblame.txt |  646 ++++++++
 include/linux/blk_types.h       |    7 +-
 include/linux/fs.h              |    3 +
 include/linux/genhd.h           |    4 +
 include/linux/ioblame.h         |   95 ++
 kernel/trace/Kconfig            |   11 +
 kernel/trace/Makefile           |    1 +
 kernel/trace/ioblame.c          | 3479 +++++++++++++++++++++++++++++++++++++++
 8 files changed, 4244 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/trace/ioblame.txt
 create mode 100644 include/linux/ioblame.h
 create mode 100644 kernel/trace/ioblame.c

diff --git a/Documentation/trace/ioblame.txt b/Documentation/trace/ioblame.txt
new file mode 100644
index 0000000..4541184
--- /dev/null
+++ b/Documentation/trace/ioblame.txt
@@ -0,0 +1,646 @@
+
+ioblame - statistical IO analyzer with origin tracking
+
+December, 2011		Tejun Heo <tj@kernel.org>
+
+
+CONTENTS
+
+1. Introduction
+2. Overall design
+3. Debugfs interface
+3-1. Configuration
+3-2. Monitoring
+3-3. Data acquisition
+4. Notes
+5. Overheads
+
+
+1. Introduction
+
+In many workloads, IO throughput and latency have large effect on
+overall performance; however, due to the complexity and asynchronous
+nature, it is very difficult to characterize what's going on.
+blktrace and various tracepoints provide visibility into individual IO
+operations but it is still extremely difficult to trace back to the
+origin of those IO operations.
+
+ioblame is statistical IO analyzer which can track and record origin
+of IOs.  It keeps track of who dirtied pages and inodes, and, on an
+actual IO, attributes it to the originator of the IO.
+
+The design goals of ioblame are
+
+* Minimally invasive - Analyzer shouldn't be invasive.  Except for
+  adding some fields to mostly block layer data structures for
+  tracking, ioblame gathers all information through well defined
+  tracepoints and all tracking logic is contained in ioblame proper.
+
+* Generic and detailed - There are many different IO paths and
+  filesystems which also go through changes regularly.  Analyzer
+  should be able to report detailed enough result covering most cases
+  without requiring frequent adaptation.  ioblame uses stack trace at
+  key points combined information from generic layers to categorize
+  IOs.  This gives detailed enough information into varying IO paths
+  without requiring specific adaptations.
+
+* Low overhead - Overhead both in terms of memory and processor cycles
+  should be low enough so that the analyzer can be used in IO-heavy
+  production environments.  ioblame keeps hot data structures compact
+  and mostly read-only and avoids synchronization on hot paths by
+  using RCU and taking advantage of the fact that statistics doesn't
+  have to be completely accurate.
+
+* Dynamic and customizable - There are many different aspects of IOs
+  which can be irrelevant or interesting depending on the situation.
+  From analysis point of view, always recording all collected data
+  would be ideal but is very wasteful in most situations.  ioblame
+  lets users decide what information to gather so that they can
+  acquire relevant information without wasting resources
+  unnecessarily.  ioblame also allows dynamic configuration while the
+  analyzer is online which enables dynamic drill down of IO behaviors.
+
+
+2. Overall design
+
+ioblame tracks the following three object types.
+
+* Role: This tracks 'who' is taking an action.  Normally corresponds
+  to a thread.  Can also be specified by userland (not implemented
+  yet).
+
+* Intent: Stack trace + modifier.  An intent groups actions of the
+  same type.  As the name suggests, modifier modifies the intent and
+  there can be multiple intents with the same stack trace but
+  different modifiers.  Currently, only writeback modifiers are
+  implemented which denote why the writeback action is occurring -
+  ie. wb_reason.
+
+* Act: This is combination of role, intent and the inode being
+  operated on and the ultimate tracking unit ioblame uses.  IOs are
+  charged to and statistics are gathered by acts.
+
+ioblame uses the same indexing data structure for all three types of
+objects.  Objects are never linked directly using pointers and every
+access goes through the index.  This allows avoiding expensive strict
+object lifetime management.  Objects are located either by its content
+via hash table or id which contains generation number.
+
+To attribute data writebacks to the originator, ioblame maintains a
+table indexed by page frame number which keeps track of which act
+dirtied which pages.  For each IO, the target pages are looked up in
+the table and the dirtying act is charged for the IO.  Note that,
+currently, each IO is charged as whole to a single act - e.g. all of
+an IO for writeback encompassing multiple dirtiers will be charged to
+the first found dirtying act.  This simplifies data collection and
+reporting while not losing too much information - writebacks tend to
+be naturally grouped and IOPS (IO operations per second) are often
+more significant than length of each IO.
+
+inode writeback tracking is more involved as different filesystems
+handle metadata updates and writebacks differently.  ioblame uses
+per-inode and buffer_head operation tracking to identify inode
+writebacks to the originator.
+
+After all the tracking, on each IO completion, ioblame knows the
+offset and size of the IO, the act to be charged, how long it took in
+the queue and device.  From the information, ioblame produces fields
+which can be recorded.
+
+All statistics are recorded in histograms, called counters, which have
+eight slots.  Userland can specify the number of counters, IO
+directions to consider, what each counter records, the boundary values
+to decide histogram slots and optional filter for more complex
+filtering conditions.
+
+All interactions including configuration and data acquisition happen
+via files under /sys/kernel/debug/ioblame/.
+
+
+3. Debugfs interface
+
+3-1. Configuration
+
+* devs				- can be changed anytime
+
+  Specifies the devices ioblame is enabled for.  ioblame will only
+  track operations on devices which are explicitly enabled in this
+  file.
+
+  It accepts white space separated list of MAJ:MINs or block device
+  names with optional preceding '!' for negation.  Opening with
+  O_TRUNC clears all existing entries.  For example,
+
+  $ echo sda sdb > devs		# disables all devices and then enable sd[ab]
+  $ echo sdc >> devs		# sd[abc] enabled
+  $ echo !8:0 >> devs		# sd[bc] enabled
+  $ cat devs
+  8:16 sdb
+  8:32 sdc
+
+* max_{role|intent|act}s	- can be changed while disabled
+
+  Specifies the maximum number of each object type.  If the number of
+  certain object type exceeds the limit, IOs will be attributed to
+  special NOMEM object.
+
+* ttl_secs			- can be changed anytime
+
+  Specifies TTL of roles and acts.  Roles are reclaimed after at least
+  TTL has passed after the matching thread has exited or execed and
+  assumed another tid.  Acts are reclaimed after being unused for at
+  least TTL.
+
+  Note that reclaiming is tied to userland reading counters data.  If
+  userland doesn't read counters, reclaiming won't happen.
+
+* nr_counters			- can be changed while disabled
+
+  Specifies the number of counters.  Each act will have the specified
+  number of histograms associated with it.  Individual counters can be
+  configured using files under the counters subdirectory.  Any write
+  to this file clears all counter settings.
+
+* counters/NR			- can be changed anytime
+
+  Specifies each counter.  Its format is
+
+    DIR FIELD B0 B1 B2 B3 B4 B5 B6 B7 B8
+
+  DIR is any combination of letters 'R', 'A', and 'W', each
+  representing reads (sans readaheads), readaheads and writes.
+
+  FIELD is the field to record in histogram and one of the followings.
+
+    offset	: IO offset scaled to 0-65535
+    size	: IO size
+    wait_time	: time spent queued in usecs
+    io_time	: time spent on device in usecs
+    seek_dist	: seek dist from IO completed right before, scaled to 0-65536
+
+  B[0-8] are the boundaries for the histogram.  Histograms have eight
+  slots.  If (FIELD < B[0] || (B[8] != 0 && FIELD >= B[8])), it's not
+  recorded; otherwise, FIELD is counted in the slot with enclosing
+  boundaries.  e.g. If FIELD is >= B2 and < B3, it's recorded in the
+  second slot (slot 1).
+
+  B8 can be zero indicating no upper limit but all other boundaries
+  must be equal to or larger than the boundary before them.
+
+  e.g. To record offsets of reads and read aheads in counter 0,
+
+  $ echo RA offset 0 8192 16384 24576 32768 40960 49152 57344 0 > counters/0
+
+  If higher resolution than 8 slots is necessary, two counters can be
+  used.
+
+  $ echo RA offset 0 4096 8192 12288 16384 20480 24576 28672 32768 > counters/0
+  $ echo RA offset 32768 36864 40960 45056 49152 53248 57344 61440 0 \
+								   > counters/1
+
+  Writing empty string disables the counter.
+
+  $ echo > 1
+  $ cat 1
+  --- disabled
+
+* counters/NR_filter		- can be changed anytime
+
+  Specifies trace event type filter for more complex conditions.  For
+  example, it allows conditions like "if IO is in the latter half of
+  the device, size is smaller than 128k and IO time is equal to or
+  longer than 10ms".
+
+  To record IO time in counter 0 with the above condition,
+
+  $ echo 'offset >= 16384 && size < 131072 && io_time >= 10000' > 0_filter
+  $ echo RAW io_time 10000 25000 50000 100000 500000 1000000 2500000 \
+							5000000 0 > 0
+
+  Any FIELD can be used in filter specification.  For more details
+  about filter format, please read "Event filtering" section in
+  Documentation/trace/events.txt.
+
+  Writing '0' to filter file removes the filter.  Note that writing
+  malformed filter disables the filter and reading it back afterwards
+  returns error message explaining why parsing failed.
+
+
+3-2. Monitoring (read only)
+
+* nr_{roles|intents|acts}
+
+  Returns the number of objects of the type.  The number of roles and
+  acts can decrease after reclaiming but nr_intents only increases
+  while ioblame is enabled.
+
+* stats/idx_nomem
+
+  How many times role, intent or act creation failed because memory
+  allocation failed while extending index to accomodate new object.
+
+* stats/idx_nospc
+
+  How many times role, intent or act creation failed because limit
+  specified by {role|intent|act}_max is reached.
+
+* stats/node_nomem
+
+  How many times role, intent or act creation failed to allocate.
+
+* stats/pgtree_nomem
+
+  How many times page tree, which maps page frame number to dirtying
+  act, failed to expand due to memory allocation failure.
+
+* stats/cnts_nomem
+
+  How many times per-act counter allocation failed.
+
+* stats/iolog_overflow
+
+  How many iolog entries are lost due to overflow.
+
+
+3-3. Data acquisition (read only)
+
+* iolog
+
+  iolog is primarily a debug feature and dumps IOs as they complete.
+
+  $ cat iolog
+  R 4096 @ 74208 pid-5806 (ls) dev=0x800010 ino=0x2 gen=0x0
+    #39 modifier=0x0
+    [ffffffff811a0696] submit_bh+0xe6/0x120
+    [ffffffff811a1f56] ll_rw_block+0xa6/0xb0
+    [ffffffff81239a43] ext4_bread+0x43/0x80
+    [ffffffff8123f4e3] htree_dirblock_to_tree+0x33/0x190
+    [ffffffff8123f79a] ext4_htree_fill_tree+0x15a/0x250
+    [ffffffff8122e26e] ext4_readdir+0x10e/0x5d0
+    [ffffffff811832d0] vfs_readdir+0xa0/0xc0
+    [ffffffff81183450] sys_getdents+0x80/0xe0
+    [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
+  W 4096 @ 0 pid-20 (sync_supers) dev=0x800010 ino=0x0 gen=0x0
+    #44 modifier=0x0
+    [ffffffff811a0696] submit_bh+0xe6/0x120
+    [ffffffff811a371d] __sync_dirty_buffer+0x4d/0xd0
+    [ffffffff811a37ae] sync_dirty_buffer+0xe/0x10
+    [ffffffff81250ee8] ext4_commit_super+0x188/0x230
+    [ffffffff81250fae] ext4_write_super+0x1e/0x30
+    [ffffffff811738fa] sync_supers+0xfa/0x100
+    [ffffffff8113d3e1] bdi_sync_supers+0x41/0x60
+    [ffffffff810ad4c6] kthread+0x96/0xa0
+    [ffffffff81a3dcb4] kernel_thread_helper+0x4/0x10
+  W 4096 @ 8512 pid-5813 dev=0x800010 ino=0x74 gen=0x4cc12c59
+    #45 modifier=0x10000002
+    [ffffffff812342cb] ext4_setattr+0x25b/0x4c0
+    [ffffffff8118b9ba] notify_change+0x10a/0x2b0
+    [ffffffff8119ef87] utimes_common+0xd7/0x180
+    [ffffffff8119f0c9] do_utimes+0x99/0xf0
+    [ffffffff8119f21d] sys_utimensat+0x2d/0x90
+    [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
+  ...
+
+  The first entry is 4k read at sector 74208 (unscaled) on /dev/sdb
+  issued by ls.  The second is sync_supers writing out dirty super
+  block.  The third is inode writeback from "touch FILE; sync".  Note
+  that the modifier is set (it's indicating WB_REASON_SYNC).
+
+  Here is another example from "cp FILE FILE1" and then waiting.
+
+  W 4096 @ 0 pid-20 (sync_supers) dev=0x800010 ino=0x0 gen=0x0
+    #16 modifier=0x0
+    [ffffffff8139cd94] submit_bio+0x74/0x100
+    [ffffffff811cba3b] submit_bh+0xeb/0x130
+    [ffffffff811cecd2] __sync_dirty_buffer+0x52/0xd0
+    [ffffffff811ced63] sync_dirty_buffer+0x13/0x20
+    [ffffffff81281fa8] ext4_commit_super+0x188/0x230
+    [ffffffff81282073] ext4_write_super+0x23/0x40
+    [ffffffff8119c8d2] sync_supers+0x102/0x110
+    [ffffffff81162c99] bdi_sync_supers+0x49/0x60
+    [ffffffff810bc216] kthread+0xb6/0xc0
+    [ffffffff81ab13b4] kernel_thread_helper+0x4/0x10
+  ...
+  W 4096 @ 8512 pid-668 dev=0x800010 ino=0x73 gen=0x17b5119d
+    #23 modifier=0x10000003
+    [ffffffff811c55b0] __mark_inode_dirty+0x220/0x330
+    [ffffffff811cccfb] generic_write_end+0x6b/0xa0
+    [ffffffff81268b10] ext4_da_write_end+0x150/0x360
+    [ffffffff811444bb] generic_file_buffered_write+0x18b/0x290
+    [ffffffff81146938] __generic_file_aio_write+0x238/0x460
+    [ffffffff81146bd8] generic_file_aio_write+0x78/0xf0
+    [ffffffff8125ef9f] ext4_file_write+0x6f/0x2a0
+    [ffffffff811997f2] do_sync_write+0xe2/0x120
+    [ffffffff8119a428] vfs_write+0xc8/0x180
+    [ffffffff8119a5e1] sys_write+0x51/0x90
+    [ffffffff81aafe2b] system_call_fastpath+0x16/0x1b
+  ...
+  W 524288 @ 3276800 pid-668 dev=0x800010 ino=0x73 gen=0x17b5119d
+    #25 modifier=0x10000003
+    [ffffffff811cc86c] __set_page_dirty+0x4c/0xd0
+    [ffffffff811cc956] mark_buffer_dirty+0x66/0xa0
+    [ffffffff811cca39] __block_commit_write+0xa9/0xe0
+    [ffffffff811ccc42] block_write_end+0x42/0x90
+    [ffffffff811cccc3] generic_write_end+0x33/0xa0
+    [ffffffff81268b10] ext4_da_write_end+0x150/0x360
+    [ffffffff811444bb] generic_file_buffered_write+0x18b/0x290
+    [ffffffff81146938] __generic_file_aio_write+0x238/0x460
+    [ffffffff81146bd8] generic_file_aio_write+0x78/0xf0
+    [ffffffff8125ef9f] ext4_file_write+0x6f/0x2a0
+    [ffffffff811997f2] do_sync_write+0xe2/0x120
+    [ffffffff8119a428] vfs_write+0xc8/0x180
+    [ffffffff8119a5e1] sys_write+0x51/0x90
+    [ffffffff81aafe2b] system_call_fastpath+0x16/0x1b
+  ...
+
+  The first entry is ext4 marking super block dirty.  After a while,
+  periodic writeback kicks in (modifier 0x100000003) and the inode
+  dirtied by cp is written back followed by dirty data pages.
+
+  At this point, iolog is mostly a debug feature.  The output format
+  is quite verbose and it isn't particularly performant.  If
+  necessary, it can be extended to use trace ringbuffer and grow
+  per-cpu binary interface.
+
+* intents
+
+  Dump of intents in Human readable form.
+
+  $ cat intents
+  #0 modifier=0x0
+  #1 modifier=0x0
+  #2 modifier=0x0
+  [ffffffff81189a6a] file_update_time+0xca/0x150
+  [ffffffff81122030] __generic_file_aio_write+0x200/0x460
+  [ffffffff81122301] generic_file_aio_write+0x71/0xe0
+  [ffffffff8122ea94] ext4_file_write+0x64/0x280
+  [ffffffff811b5d24] aio_rw_vect_retry+0x74/0x1d0
+  [ffffffff811b7401] aio_run_iocb+0x61/0x190
+  [ffffffff811b81c8] do_io_submit+0x648/0xaf0
+  [ffffffff811b867b] sys_io_submit+0xb/0x10
+  [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
+  #3 modifier=0x0
+  [ffffffff811aaf2e] __blockdev_direct_IO+0x1f1e/0x37c0
+  [ffffffff812353b2] ext4_direct_IO+0x1b2/0x3f0
+  [ffffffff81121d6a] generic_file_direct_write+0xba/0x180
+  [ffffffff8112210b] __generic_file_aio_write+0x2db/0x460
+  [ffffffff81122301] generic_file_aio_write+0x71/0xe0
+  [ffffffff8122ea94] ext4_file_write+0x64/0x280
+  [ffffffff811b5d24] aio_rw_vect_retry+0x74/0x1d0
+  [ffffffff811b7401] aio_run_iocb+0x61/0x190
+  [ffffffff811b81c8] do_io_submit+0x648/0xaf0
+  [ffffffff811b867b] sys_io_submit+0xb/0x10
+  [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
+  #4 modifier=0x0
+  [ffffffff811aaf2e] __blockdev_direct_IO+0x1f1e/0x37c0
+  [ffffffff8126da71] ext4_ind_direct_IO+0x121/0x460
+  [ffffffff81235436] ext4_direct_IO+0x236/0x3f0
+  [ffffffff81122db2] generic_file_aio_read+0x6b2/0x740
+  ...
+
+  The # prefixed number is the NR of the intent used to link intent
+  from stastics.  Modifier and stack trace follow.  The first two
+  entries are special - 0 is nomem intent and 1 is lost intent.  The
+  former is used when an intent can't be created because allocation
+  failed or intent_max is reached.  The latter is used when reclaiming
+  resulted in loss of tracking info and the IO can't be reported
+  exactly.
+
+  This file can be seeked by intent NR.  ie. seeking to 3 and reading
+  will return intent #3 and after.  Because intents are never
+  destroyed while ioblame is enabled, this allows userland tool to
+  discover new intents since last reading.  Seeking to the number of
+  currently known intents and reading returns only the newly created
+  intents.
+
+* intents_bin
+
+  Identical to intents but in compact binary format and likely to be
+  much more performant.  Each entry in the file is in the following
+  format as defined in include/linux/ioblame.h.
+
+  #define IOB_INTENTS_BIN_VER	1
+
+  /* intent modifer */
+  #define IOB_MODIFIER_TYPE_SHIFT	28
+  #define IOB_MODIFIER_TYPE_MASK	0xf0000000U
+  #define IOB_MODIFIER_VAL_MASK		(~IOB_MODIFIER_TYPE_MASK)
+
+  /* val contains wb_reason */
+  #define IOB_MODIFIER_WB		(1 << IOB_MODIFIER_TYPE_SHIFT)
+
+  /* record struct for /sys/kernel/debug/ioblame/intents_bin */
+  struct iob_intent_bin_record {
+	uint16_t	len;
+	uint16_t	ver;
+	uint32_t	nr;
+	uint32_t	modifier;
+	uint32_t	__pad;
+	uint64_t	trace[];
+  } __packed;
+
+* counters_pipe
+
+  Dumps counters and triggers reclaim.  Opening and reading this file
+  returns counters of all acts which have been used since the last
+  open.
+
+  Because roles and acts shouldn't be reclaimed before the updated
+  counters are reported, reclaiming is tied to counters_pipe access.
+  Opening counters_pipe prepares for reclaiming and closing executes
+  it.  Act reclaiming works at ttl_secs / 2 granularity.  ioblame
+  tries to stay close to the lifetime timings requested by ttl_secs
+  but note that reclaim happens only on counters_pipe open/close.
+
+  There can only be one user of counters_pipe at any given moment;
+  otherwise, file operations will fail and the output and reclaiming
+  timings are undefined.
+
+  All reported histogram counters are u32 and never reset.  It's the
+  user's responsibility to calculate the delta if necessary.  Note
+  that counters_pipe reports all used acts since last open and
+  counters are not guaranteed to have been updated - ie. there can be
+  spurious acts in the output.
+
+  counters_pipe is seekable by act NR.
+
+  In the following example, two counters are configured - the first
+  one counts read offsets and the second write offsets.  A file is
+  copied using dd w/ direct flags.
+
+  $ cat counters_pipe
+  pid-20 (sync_supers) int=66 dev=0x800010 ino=0x0 gen=0x0
+	  0       0       0       0       0       0       0       0
+	  2       0       0       0       0       0       0       0
+  pid-1708 int=58 dev=0x800010 ino=0x71 gen=0x3e0d99f2
+	 11       0       0       0       0       0       0       0
+	  0       0       0       0       0       0       0       0
+  pid-1708 int=59 dev=0x800010 ino=0x71 gen=0x3e0d99f2
+	 11       0       0       0       0       0       0       0
+	  0       0       0       0       0       0       0       0
+  pid-1708 int=62 dev=0x800010 ino=0x2727 gen=0xf4739822
+	  0       0       0       0       0       0       0       0
+	 10       0       0       0       0       0       0       0
+  pid-1708 int=63 dev=0x800010 ino=0x2727 gen=0xf4739822
+	  0       0       0       0       0       0       0       0
+	 10       0       0       0       0       0       0       0
+  pid-1708 int=31 dev=0x800010 ino=0x2727 gen=0xf4739822
+	  0       0       0       0       0       0       0       0
+	  2       0       0       0       0       0       0       0
+  pid-1708 int=65 dev=0x800010 ino=0x2727 gen=0xf4739822
+	  0       0       0       0       0       0       0       0
+	  1       0       0       0       0       0       0       0
+
+  pid-1708 is the dd which copied the file.  The output is separated
+  by pid-* lines and each section corresponds to single act, which has
+  intent NR and file (dev:ino:gen) associated with it.  One 8-slot
+  histogram is printed per line in ascending order.
+
+  The filesystem is mostly empty and, from the output, both files seem
+  to be located in the first 1/8th of the disk.
+
+  All reads happened through intent 58 and 59.  From intents file,
+  they are,
+
+  #58 modifier=0x0
+  [ffffffff8139d974] submit_bio+0x74/0x100
+  [ffffffff811d5dba] __blockdev_direct_IO+0xc2a/0x3830
+  [ffffffff8129fe51] ext4_ind_direct_IO+0x121/0x470
+  [ffffffff8126678e] ext4_direct_IO+0x23e/0x400
+  [ffffffff81147b05] generic_file_aio_read+0x6d5/0x760
+  [ffffffff81199932] do_sync_read+0xe2/0x120
+  [ffffffff8119a5c5] vfs_read+0xc5/0x180
+  [ffffffff8119a781] sys_read+0x51/0x90
+  [ffffffff81ab1fab] system_call_fastpath+0x16/0x1b
+  #59 modifier=0x0
+  [ffffffff8139d974] submit_bio+0x74/0x100
+  [ffffffff811d7345] __blockdev_direct_IO+0x21b5/0x3830
+  [ffffffff8129fe51] ext4_ind_direct_IO+0x121/0x470
+  [ffffffff8126678e] ext4_direct_IO+0x23e/0x400
+  [ffffffff81147b05] generic_file_aio_read+0x6d5/0x760
+  [ffffffff81199932] do_sync_read+0xe2/0x120
+  [ffffffff8119a5c5] vfs_read+0xc5/0x180
+  [ffffffff8119a781] sys_read+0x51/0x90
+  [ffffffff81ab1fab] system_call_fastpath+0x16/0x1b
+
+  Except for hitting slightly different paths in __blockdev_direct_IO,
+  they both are ext4 direct reads as expected.  Writes seem more
+  diversified and upon examination, #62 and #63 are ext4 direct
+  writes.  #31 and #65 are more interesting.
+
+  #31 modifier=0x0
+  [ffffffff811cd0cc] __set_page_dirty+0x4c/0xd0
+  [ffffffff811cd1b6] mark_buffer_dirty+0x66/0xa0
+  [ffffffff811cd299] __block_commit_write+0xa9/0xe0
+  [ffffffff811cd4a2] block_write_end+0x42/0x90
+  [ffffffff811cd523] generic_write_end+0x33/0xa0
+  [ffffffff81269720] ext4_da_write_end+0x150/0x360
+  [ffffffff81144878] generic_file_buffered_write+0x188/0x2b0
+  [ffffffff81146d18] __generic_file_aio_write+0x238/0x460
+  [ffffffff81146fb8] generic_file_aio_write+0x78/0xf0
+  [ffffffff8125fbaf] ext4_file_write+0x6f/0x2a0
+  [ffffffff81199812] do_sync_write+0xe2/0x120
+  [ffffffff8119a308] vfs_write+0xc8/0x180
+  [ffffffff8119a4c1] sys_write+0x51/0x90
+  [ffffffff81ab1fab] system_call_fastpath+0x16/0x1b
+
+  This is buffered write.  It turns out the file size didn't match the
+  specified blocksize, so dd turns off O_DIRECT for the last IO and
+  issue buffered write for the remainder.
+
+  Note that the actual IO submission is not visible in the stack trace
+  as the IOs are charged to the dirtying act.  Actual IOs are likely
+  to be executed from fsync(2).
+
+  #65 modifier=0x0
+  [ffffffff811c5e10] __mark_inode_dirty+0x220/0x330
+  [ffffffff81267edd] ext4_da_update_reserve_space+0x13d/0x230
+  [ffffffff8129006d] ext4_ext_map_blocks+0x13dd/0x1dc0
+  [ffffffff81268a31] ext4_map_blocks+0x1b1/0x260
+  [ffffffff81269c52] mpage_da_map_and_submit+0xb2/0x480
+  [ffffffff8126a84a] ext4_da_writepages+0x30a/0x6e0
+  [ffffffff8114f584] do_writepages+0x24/0x40
+  [ffffffff811468cb] __filemap_fdatawrite_range+0x5b/0x60
+  [ffffffff8114692a] filemap_write_and_wait_range+0x5a/0x80
+  [ffffffff8125ff64] ext4_sync_file+0x74/0x440
+  [ffffffff811ca31b] vfs_fsync_range+0x2b/0x40
+  [ffffffff811ca34c] vfs_fsync+0x1c/0x20
+  [ffffffff811ca58a] do_fsync+0x3a/0x60
+  [ffffffff811ca5e0] sys_fsync+0x10/0x20
+  [ffffffff81ab1fab] system_call_fastpath+0x16/0x1b
+
+  And this is dd fsync(2)ing and marking inode for writeback.  As with
+  data writeback, IO submission is not visible in the stack trace.
+
+* counters_pipe_bin
+
+  Identical to counters_pipe but in compact binary format and likely
+  to be much more performant.  Each entry in the file is in the
+  following format as defined in include/linux/ioblame.h.
+
+  #define IOBC_PIPE_BIN_VER	1
+
+  /* record struct for /sys/kernel/debug/ioblame/counters_pipe_bin */
+  struct iobc_pipe_bin_record {
+	  uint16_t	len;
+	  uint16_t	ver;
+	  int32_t	id;		/* >0 pid or negated user id */
+	  uint32_t	intent_nr;	/* associated intent */
+	  uint32_t	dev;
+	  uint64_t	ino;
+	  uint32_t	gen;
+	  uint32_t	__pad;
+	  uint32_t	cnts[];		/* [type][slot] */
+  } __packed;
+
+  Note that counters_pipe and counters_pipe_bin can't be used
+  parallelly.  Only one opener is allowed across the two files at any
+  given moment.
+
+
+4. Notes
+
+* By the time ioblame reports IOs or counters, the task which gets
+  charged might have already exited and this is why ioblame prints
+  task command in some reports but not in others.  Userland tool is
+  advised to use combination of live task listing and process
+  accounting to match pid's to commands.
+
+* dev:ino:gen can be mapped to filename without scanning the whole
+  filesystem by constructing FS-specific filehandle, opening it with
+  open_by_handle_at(2) and then readlink(2)ing /proc/self/FD.  This
+  will return full path as long as the dentry is in cache, which is
+  likely if data acquisition and mapping don't happen too long after
+  IOs.
+
+* Mechanism to specify userland role ID is not implemented yet.
+
+
+5. Overheads
+
+On x86_64, role is 104 bytes, intent 32 + 8 * stack_depth and act 72
+bytes.  Intents are allocated using kzalloc() and there shouldn't be
+too many of them.  Both roles and acts have their own kmem_cache and
+can be monitored via /proc/slabinfo.
+
+Each counter occupy 32 * nr_counters and is aligned to cacheline.
+Counters are allocated only as necessary.  iob_counters kmem_cache is
+dynamically created on enable.
+
+The size of page frame number -> dirtier mapping table is proportional
+to the amount of available physical memory.  If max_acts <= 65536,
+2bytes are used per PAGE_SIZE.  With 4k page, at most ~0.049% can be
+used.  If max_acts > 65536, 4bytes are used doubling the percentage to
+~0.098%.  The table also grows dynamically.
+
+There are also indexing data structures used - hash tables, id[ra]s
+and a radix tree.  There are three hash tables, each sized according
+to max_{roles|intents|acts}.  The maximum memory usage by hash tables
+is sizeof(void *) * (max_roles + max_intents + max_acts).  Memory used
+by other indexing structures should be negligible.
+
+Preliminary tests w/ fio ssd-test on loopback device on tmpfs, which
+is purely CPU cycle bound, shows ~20% throughput hit.
+
+*** TODO: add performance testing results and explain involved CPU
+    overheads.
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4053cbd..4f42174 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -8,6 +8,7 @@
 #ifdef CONFIG_BLOCK
 
 #include <linux/types.h>
+#include <linux/ioblame.h>
 
 struct bio_set;
 struct bio;
@@ -66,10 +67,12 @@ struct bio {
 	bio_end_io_t		*bi_end_io;
 
 	void			*bi_private;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
+#ifdef CONFIG_BLK_DEV_INTEGRITY
 	struct bio_integrity_payload *bi_integrity;  /* data integrity */
 #endif
-
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+	struct iob_io_info	bi_iob_info;
+#endif
 	bio_destructor_t	*bi_destructor;	/* destructor */
 
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e0bc4ff..950b2b3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -835,6 +835,9 @@ struct inode {
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
 	void			*i_private; /* fs or device private pointer */
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+	union iob_id		i_iob_act;
+#endif
 };
 
 static inline int inode_unhashed(struct inode *inode)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index aefa6ba..7d02c88 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -190,6 +190,10 @@ struct gendisk {
 #ifdef  CONFIG_BLK_DEV_INTEGRITY
 	struct blk_integrity *integrity;
 #endif
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+	bool iob_enabled;
+	u16 iob_scaled_last_sect;
+#endif
 	int node_id;
 };
 
diff --git a/include/linux/ioblame.h b/include/linux/ioblame.h
new file mode 100644
index 0000000..689b722
--- /dev/null
+++ b/include/linux/ioblame.h
@@ -0,0 +1,95 @@
+/*
+ * include/linux/ioblame.h - statistical IO analyzer with origin tracking
+ *
+ * Copyright (C) 2011 Google, Inc.
+ * Copyright (C) 2011 Tejun Heo <tj@kernel.org>
+ */
+#ifndef _IOBLAME_H
+#define _IOBLAME_H
+
+#ifdef __KERNEL__
+
+#include <linux/rcupdate.h>
+
+struct page;
+struct inode;
+struct buffer_head;
+
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+
+/*
+ * Each iob_node is identified by 64bit id, which packs three fields in it
+ * - @type, @nr and @gen.  @nr is ida allocated index in @type.  It is
+ * always allocated from the lowest available slot, which allows efficient
+ * use of pgtree and idr; however, this means @nr is likely to be recycled.
+ * @gen is used to disambiguate recycled @nr's.
+ */
+#define IOB_NR_BITS			31
+#define IOB_GEN_BITS			31
+#define IOB_TYPE_BITS			2
+
+union iob_id {
+	u64				v;
+	struct {
+		u64			nr:IOB_NR_BITS;
+		u64			gen:IOB_GEN_BITS;
+		u64			type:IOB_TYPE_BITS;
+	} f;
+};
+
+struct iob_io_info {
+	unsigned long			rw;
+	sector_t			sector;
+
+	unsigned long			queued_at;
+	unsigned long			issued_at;
+
+	union iob_id			act;
+	unsigned int			size;
+};
+
+#endif	/* CONFIG_IO_BLAME[_MODULE] */
+#endif	/* __KERNEL__ */
+
+enum iob_special_nr {
+	IOB_NOMEM_NR,
+	IOB_LOST_NR,
+	IOB_BASE_NR,
+};
+
+#define IOB_INTENTS_BIN_VER	1
+
+/* intent modifer */
+#define IOB_MODIFIER_TYPE_SHIFT	28
+#define IOB_MODIFIER_TYPE_MASK	0xf0000000U
+#define IOB_MODIFIER_VAL_MASK	(~IOB_MODIFIER_TYPE_MASK)
+
+/* val contains wb_reason */
+#define IOB_MODIFIER_WB		(1 << IOB_MODIFIER_TYPE_SHIFT)
+
+/* record struct for /sys/kernel/debug/ioblame/intents_bin */
+struct iob_intent_bin_record {
+	uint16_t	len;
+	uint16_t	ver;
+	uint32_t	nr;
+	uint32_t	modifier;
+	uint32_t	__pad;
+	uint64_t	trace[];
+} __packed;
+
+#define IOBC_PIPE_BIN_VER	1
+
+/* record struct for /sys/kernel/debug/ioblame/counters_pipe_bin */
+struct iobc_pipe_bin_record {
+	uint16_t	len;
+	uint16_t	ver;
+	int32_t		id;		/* >0 pid or negated user id */
+	uint32_t	intent_nr;	/* associated intent */
+	uint32_t	dev;
+	uint64_t	ino;
+	uint32_t	gen;
+	uint32_t	__pad;
+	uint32_t	cnts[];		/* [type][slot] */
+} __packed;
+
+#endif	/* _IOBLAME_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index cd31345..551d8fb 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -368,6 +368,17 @@ config BLK_DEV_IO_TRACE
 
 	  If unsure, say N.
 
+config IO_BLAME
+	tristate "Enable io-blame tracer"
+	depends on SYSFS
+	depends on BLOCK
+	select TRACEPOINTS
+	select STACKTRACE
+	help
+	  Say Y here if you want to enable end-to-end IO tracer.
+
+	  If unsure, say N.
+
 config KPROBE_EVENT
 	depends on KPROBES
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 5f39a07..408cd1a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
 ifeq ($(CONFIG_BLOCK),y)
 obj-$(CONFIG_EVENT_TRACING) += blktrace.o
 endif
+obj-$(CONFIG_IO_BLAME) += ioblame.o
 obj-$(CONFIG_EVENT_TRACING) += trace_events.o
 obj-$(CONFIG_EVENT_TRACING) += trace_export.o
 obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
diff --git a/kernel/trace/ioblame.c b/kernel/trace/ioblame.c
new file mode 100644
index 0000000..9083675
--- /dev/null
+++ b/kernel/trace/ioblame.c
@@ -0,0 +1,3479 @@
+/*
+ * kernel/trace/ioblame.c - statistical IO analyzer with origin tracking
+ *
+ * Copyright (C) 2011 Google, Inc.
+ * Copyright (C) 2011 Tejun Heo <tj@kernel.org>
+ */
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/idr.h>
+#include <linux/bitmap.h>
+#include <linux/radix-tree.h>
+#include <linux/rculist.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/stacktrace.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/log2.h>
+#include <linux/jhash.h>
+#include <linux/genhd.h>
+#include <linux/string.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/mm_types.h>
+#include <linux/fs.h>
+#include <linux/buffer_head.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/log2.h>
+#include <asm/div64.h>
+
+#include <trace/events/sched.h>
+#include <trace/events/vfs.h>
+#include <trace/events/writeback.h>
+#include <trace/events/block.h>
+
+#include "trace.h"
+
+#include <linux/ioblame.h>
+
+#define IOB_ROLE_NAMELEN	32
+#define IOB_STACK_MAX_DEPTH	32
+
+#define IOB_DFL_MAX_ROLES	(1 << 16)
+#define IOB_DFL_MAX_INTENTS	(1 << 10)
+#define IOB_DFL_MAX_ACTS	(1 << 16)
+#define IOB_DFL_TTL_SECS	120
+#define IOB_IOLOG_CNT		512
+
+#define IOB_LAST_INO_DURATION	(5 * HZ)	/* last_ino is valid for 5s */
+
+/*
+ * Each type represents different type of entities tracked by ioblame and
+ * has its own iob_idx.
+ *
+ * role		: "who" - either a task or custom id from userland.
+ *
+ * intent	: The who's intention - backtrace + modifier.
+ *
+ * act		: Product of role, intent and the target inode.  "who"
+ *		  acts on a target inode with certain backtrace.
+ */
+enum iob_type {
+	IOB_INVALID,
+	IOB_ROLE,
+	IOB_INTENT,
+	IOB_ACT,
+
+	IOB_NR_TYPES,
+};
+
+#define IOB_PACK_ID(_type, _nr, _gen)	\
+	(union iob_id){ .f = { .type = (_type), .nr = (_nr), .gen = (_gen) }}
+
+/* stats */
+struct iob_stats {
+	u64 idx_nomem;
+	u64 idx_nospc;
+	u64 node_nomem;
+	u64 pgtree_nomem;
+	u64 cnts_nomem;
+	u64 iolog_overflow;
+};
+
+/* iob_node is what iob_idx indexes and embedded in every iob_type */
+struct iob_node {
+	struct hlist_node	hash_node;
+	union iob_id		id;
+};
+
+/* describes properties and operations of an iob_type for iob_idx */
+struct iob_idx_type {
+	enum iob_type		type;
+
+	/* calculate hash value from key */
+	unsigned long		(*hash)(void *key);
+	/* return %true if @node matches @key */
+	bool			(*match)(struct iob_node *node, void *key);
+	/* create a new node which matches @key w/ alloc mask @gfp_mask */
+	struct iob_node		*(*create)(void *key, gfp_t gfp_mask);
+	/* destroy @node */
+	void			(*destroy)(struct iob_node *node);
+
+	/* keys for fallback nodes */
+	void			*nomem_key;
+	void			*lost_key;
+};
+
+/*
+ * iob_idx indexes iob_nodes.  iob_nodes can either be found via hash table
+ * or by id.f.nr.  Hash calculation and matching are determined by
+ * iob_idx_type.  If a node is missing during hash lookup, new one is
+ * automatically created.
+ */
+struct iob_idx {
+	const struct iob_idx_type *type;
+
+	/* hash */
+	struct hlist_head	*hash;
+	unsigned int		hash_mask;
+
+	/* id index */
+	struct ida		ida;		/* used for allocation */
+	struct idr		idr;		/* record node or gen */
+
+	/* fallback nodes */
+	struct iob_node		*nomem_node;
+	struct iob_node		*lost_node;
+
+	/* stats */
+	unsigned int		nr_nodes;
+	unsigned int		max_nodes;
+};
+
+/*
+ * Functions to encode and decode pointer and generation for iob_idx->idr.
+ *
+ * id.f.gen is used to disambiguate recycled id.f.nr.  When there's no
+ * active node, iob_idx->idr slot carries the last generation number.
+ */
+static void *iob_idr_encode_node(struct iob_node *node)
+{
+	BUG_ON((unsigned long)node & 1);
+	return node;
+}
+
+static void *iob_idr_encode_gen(u32 gen)
+{
+	unsigned long v = (unsigned long)gen;
+	return (void *)((v << 1) | 1);
+}
+
+static struct iob_node *iob_idr_node(void *p)
+{
+	unsigned long v = (unsigned long)p;
+	return (v & 1) ? NULL : (void *)v;
+}
+
+static u32 iob_idr_gen(void *p)
+{
+	unsigned long v = (unsigned long)p;
+	return (v & 1) ? v >> 1 : 0;
+}
+
+/* IOB_ROLE */
+struct iob_role {
+	struct iob_node		node;
+
+	/*
+	 * For task roles, because a task can change its pid during exec
+	 * and we want exact match for removal on task exit, we use task
+	 * pointer as key and ->id contains pid.  For userland specified
+	 * roles, ->task is %NULL and ->id is negative and used as key.
+	 */
+	int			pid;	/* pid for troles, -id for uroles */
+	struct task_struct	*task;	/* %NULL for uroles */
+	union iob_id		user_role;
+
+	/* modifier currently in effect */
+	u32			modifier;
+
+	/* last file this trole has operated on */
+	struct {
+		dev_t			dev;
+		u32			gen;
+		ino_t			ino;
+	} last_ino;
+	unsigned long		last_ino_jiffies;
+
+	/* act for inode dirtying/writing in progress */
+	union iob_id		inode_act;
+
+	/* for reclaiming */
+	struct list_head	free_list;
+};
+
+/* IOB_INTENT - uses separate key struct to use struct stack_trace directly */
+struct iob_intent_key {
+	u32			modifier;
+	int			depth;
+	unsigned long		*trace;
+};
+
+struct iob_intent {
+	struct iob_node		node;
+
+	u32			modifier;
+	int			depth;
+	unsigned long		trace[];
+};
+
+/* IOB_ACT */
+struct iob_act {
+	struct iob_node		node;
+
+	u32			*cnts;	/* [slot][type] */
+	struct iob_act		*free_next;
+
+	/* key fields follow - paddings, if any, should be zero filled */
+	union iob_id		role;	/* must be the first field of keys */
+	union iob_id		intent;
+	dev_t			dev;
+	u32			gen;
+	ino_t			ino;
+};
+
+#define IOB_ACT_KEY_OFFSET	offsetof(struct iob_act, role)
+
+static DEFINE_MUTEX(iob_mutex);		/* enable/disable and userland access */
+static DEFINE_SPINLOCK(iob_lock);	/* write access to all int structures */
+
+static bool iob_enabled __read_mostly = false;
+
+/* temp buffer used for parsing/printing, user must be holding iob_mutex */
+static char __iob_page_buf[PAGE_SIZE];
+#define iob_page_buf	({ lockdep_assert_held(&iob_mutex); __iob_page_buf; })
+
+/* userland tunable knobs */
+static unsigned int iob_max_roles __read_mostly = IOB_DFL_MAX_ROLES;
+static unsigned int iob_max_intents __read_mostly = IOB_DFL_MAX_INTENTS;
+static unsigned int iob_max_acts __read_mostly = IOB_DFL_MAX_ACTS;
+static unsigned int iob_ttl_secs __read_mostly = IOB_DFL_TTL_SECS;
+static bool iob_ignore_ino __read_mostly;
+
+/* pgtree params, determined by iob_max_acts */
+static unsigned long iob_pgtree_shift __read_mostly;
+static unsigned long iob_pgtree_pfn_shift __read_mostly;
+static unsigned long iob_pgtree_pfn_mask __read_mostly;
+
+/* role and act caches, intent is variable size and allocated using kzalloc */
+static struct kmem_cache *iob_role_cache;
+static struct kmem_cache *iob_act_cache;
+
+/* iob_idx for each iob_type */
+static struct iob_idx *iob_role_idx __read_mostly;
+static struct iob_idx *iob_intent_idx __read_mostly;
+static struct iob_idx *iob_act_idx __read_mostly;
+
+/* for iob_role reclaiming */
+static unsigned long iob_role_reclaim_tstmp;
+
+static struct list_head iob_role_to_free_heads[2] = {
+	LIST_HEAD_INIT(iob_role_to_free_heads[0]),
+	LIST_HEAD_INIT(iob_role_to_free_heads[1]),
+};
+static struct list_head *iob_role_to_free_front = &iob_role_to_free_heads[0];
+static struct list_head *iob_role_to_free_back = &iob_role_to_free_heads[1];
+
+/* for iob_act reclaiming */
+static unsigned long iob_act_reclaim_tstmp;
+static unsigned long *iob_act_used_bitmaps[4];
+
+struct iob_act_used {
+	unsigned long	*cur;
+	unsigned long	*staging;
+	unsigned long	*front;
+	unsigned long	*back;
+} iob_act_used;
+
+/* pgtree - maps pfn to act id.f.nr */
+static RADIX_TREE(iob_pgtree, GFP_NOWAIT);
+
+/* stats and /sys/kernel/debug/ioblame */
+static struct iob_stats iob_stats;
+static struct dentry *iob_dir;
+
+/*
+ * Counters - counts io events in histograms.  What events are counted how
+ * is userland configurable.
+ */
+
+/* number of histogram slots, there are places which assume it to be 8 */
+#define IOBC_NR_SLOTS			8
+
+/* for data direction predicate */
+enum iobc_dir {
+	IOBC_READ			= 1 << 0, /* reads (sans read aheads) */
+	IOBC_RAHEAD			= 1 << 1, /* reads */
+	IOBC_WRITE			= 1 << 2, /* writes */
+};
+
+/* fields which can be counted, can also be used in filters */
+enum iobc_field {
+	IOBC_OFFSET,			/* request offset (scaled to 0-65535) */
+	IOBC_SIZE,			/* size of request */
+	IOBC_WAIT_TIME,			/* wait time in usecs */
+	IOBC_IO_TIME,			/* service time in usecs */
+	IOBC_SEEK_DIST,			/* scaled seek distance */
+
+	IOBC_NR_FIELDS,
+};
+
+/* and their userland visible names for counter and filter specification */
+static char *iobc_field_strs[] = {
+	[IOBC_OFFSET]			= "offset",
+	[IOBC_SIZE]			= "size",
+	[IOBC_WAIT_TIME]		= "wait_time",
+	[IOBC_IO_TIME]			= "io_time",
+	[IOBC_SEEK_DIST]		= "seek_dist",
+};
+
+/* max len of field name, 32 gotta be enough */
+#define IOBC_FIELD_MAX_LEN		32
+
+/* struct to RCU free event filter */
+struct iobc_filter_rcu {
+	struct event_filter		*filter;
+	struct rcu_head			rcu_head;
+};
+
+/*
+ * Describes a counter type.  There are @iobc_nr_types types and each
+ * iob_act has matching set of histograms.
+ */
+struct iobc_type {
+	u16				dir;	/* data direction predicate */
+	u16				field;	/* field to count */
+
+	/*
+	 * Histogram boundaries.  bounds[N] <= bounds[N+1] should hold
+	 * except for the last entry.  The first and last entries are
+	 * cutoff conditions and the last can be zero denoting no limit.
+	 * Internal entries are used to decide the histogram slot to use.
+	 */
+	u32				bounds[IOBC_NR_SLOTS + 1];
+
+	/* optional filter, all fields can be used in event filter format */
+	struct event_filter __rcu	*filter;
+	struct iobc_filter_rcu		*filter_rcu;
+};
+
+/* constructed during module init and fed to trace_event_filter_create() */
+static LIST_HEAD(iobc_event_field_list);
+static struct ftrace_event_field iobc_event_fields[IOBC_NR_FIELDS];
+
+/* configured counters types and kmem cache */
+static int iobc_nr_types;
+static struct iobc_type *iobc_types;
+static struct kmem_cache *iobc_cnts_cache;
+
+/* iobc_pipe in use? */
+static bool iobc_pipe_opened;
+
+/* ioblame/counters directory */
+static struct dentry *iobc_dir;
+
+static void iob_count(struct iob_io_info *io, struct gendisk *disk);
+
+/* ioblame/iolog for slow verbose per-io output */
+static DEFINE_SPINLOCK(iob_iolog_lock);
+static DECLARE_WAIT_QUEUE_HEAD(iob_iolog_wait);
+static struct iob_io_info *iob_iolog;
+static char *iob_iolog_buf;
+static int iob_iolog_head, iob_iolog_tail;
+
+static void iob_iolog_fill(struct iob_io_info *io);
+
+
+/*
+ * IOB_IDX
+ *
+ * This is the main indexing facility used to maintain and access all
+ * iob_type objects.  iob_idx operates on iob_node which each iob_type
+ * object embeds.
+ *
+ * Each iob_idx is associated with iob_idx_type on creation, which
+ * describes which type it is, methods used during hash lookup and two keys
+ * for fallback node creation.
+ *
+ * Objects can be accessed either by hash table or id.  Hash table lookup
+ * uses iob_idx_type->hash() and ->match() methods for lookup and
+ * ->create() and ->destroy() to create new object if missing and
+ * requested.  Note that the hash key is opaque to iob_idx.  Key handling
+ * is defined completely by iob_idx_type methods.
+ *
+ * When a new object is created, iob_idx automatically assigns an id, which
+ * is combination of type enum, object number (nr), and generation number.
+ * Object number is ida allocated and always packed towards 0.  Generation
+ * number starts at 1 and gets incremented each time the nr is recycled.
+ *
+ * Access by id is either by whole id or nr part of it.  Objects are not
+ * created through id lookups.
+ *
+ * Read accesses are protected by sched_rcu.  Using sched_rcu allows
+ * avoiding extra rcu locking operations in tracepoint probes.  Write
+ * accesses are expected to be infrequent and synchronized with single
+ * spinlock - iob_lock.
+ */
+
+static int iob_idx_install_node(struct iob_node *node, struct iob_idx *idx,
+				gfp_t gfp_mask)
+{
+	const struct iob_idx_type *type = idx->type;
+	int nr = -1, idr_nr = -1, ret;
+	void *p;
+
+	INIT_HLIST_NODE(&node->hash_node);
+
+	/* allocate nr and make sure it's under the limit */
+	do {
+		if (unlikely(!ida_pre_get(&idx->ida, gfp_mask)))
+			goto enomem;
+		ret = ida_get_new(&idx->ida, &nr);
+	} while (unlikely(ret == -EAGAIN));
+
+	if (unlikely(ret < 0 || nr >= idx->max_nodes))
+		goto enospc;
+
+	/* if @nr was used before, idr would have last_gen recorded, look up */
+	p = idr_find(&idx->idr, nr);
+	if (p) {
+		WARN_ON_ONCE(iob_idr_node(p));
+		/* set id with gen before replacing the idr entry */
+		node->id = IOB_PACK_ID(type->type, nr, iob_idr_gen(p) + 1);
+		idr_replace(&idx->idr, node, nr);
+		return 0;
+	}
+
+	/* create a new idr entry, it must match ida allocation */
+	node->id = IOB_PACK_ID(type->type, nr, 1);
+	do {
+		if (unlikely(!idr_pre_get(&idx->idr, gfp_mask)))
+			goto enomem;
+		ret = idr_get_new_above(&idx->idr, iob_idr_encode_node(node),
+					nr, &idr_nr);
+	} while (unlikely(ret == -EAGAIN));
+
+	if (unlikely(ret < 0) || WARN_ON_ONCE(idr_nr != nr))
+		goto enospc;
+
+	return 0;
+
+enomem:
+	iob_stats.idx_nomem++;
+	ret = -ENOMEM;
+	goto fail;
+enospc:
+	iob_stats.idx_nospc++;
+	ret = -ENOSPC;
+fail:
+	if (idr_nr >= 0)
+		idr_remove(&idx->idr, idr_nr);
+	if (nr >= 0)
+		ida_remove(&idx->ida, nr);
+	return ret;
+}
+
+/**
+ * iob_idx_destroy - destroy iob_idx
+ * @idx: iob_idx to destroy
+ *
+ * Free all nodes indexed by @idx and @idx itself.  The caller is
+ * responsible for ensuring nobody is accessing @idx.
+ */
+static void iob_idx_destroy(struct iob_idx *idx)
+{
+	const struct iob_idx_type *type = idx->type;
+	void *ptr;
+	int pos = 0;
+
+	while ((ptr = idr_get_next(&idx->idr, &pos))) {
+		struct iob_node *node = iob_idr_node(ptr);
+		if (node)
+			type->destroy(node);
+		pos++;
+	}
+
+	idr_remove_all(&idx->idr);
+	idr_destroy(&idx->idr);
+	ida_destroy(&idx->ida);
+
+	vfree(idx->hash);
+	kfree(idx);
+}
+
+/**
+ * iob_idx_create - create a new iob_idx
+ * @type: type of new iob_idx
+ * @max_nodes: maximum number of nodes allowed
+ *
+ * Create a new @type iob_idx.  Newly created iob_idx has two fallback
+ * nodes pre-allocated - one for nomem and the other for lost nodes, each
+ * occupying IOB_NOMEM_NR and IOB_LOST_NR slot respectively.
+ *
+ * Returns pointer to the new iob_idx on success, %NULL on failure.
+ */
+static struct iob_idx *iob_idx_create(const struct iob_idx_type *type,
+				      unsigned int max_nodes)
+{
+	unsigned int hash_sz = rounddown_pow_of_two(max_nodes);
+	struct iob_idx *idx;
+	struct iob_node *node;
+
+	if (max_nodes < 2)
+		return NULL;
+
+	/* alloc and init */
+	idx = kzalloc(sizeof(*idx), GFP_KERNEL);
+	if (!idx)
+		return NULL;
+
+	ida_init(&idx->ida);
+	idr_init(&idx->idr);
+	idx->type = type;
+	idx->max_nodes = max_nodes;
+	idx->hash_mask = hash_sz - 1;
+
+	idx->hash = vzalloc(hash_sz * sizeof(idx->hash[0]));
+	if (!idx->hash)
+		goto fail;
+
+	/* create and install nomem_node */
+	node = type->create(type->nomem_key, GFP_KERNEL);
+	if (!node)
+		goto fail;
+	if (iob_idx_install_node(node, idx, GFP_KERNEL) < 0) {
+		type->destroy(node);
+		goto fail;
+	}
+	idx->nomem_node = node;
+	idx->nr_nodes++;
+
+	/* create and install lost_node */
+	node = type->create(type->lost_key, GFP_KERNEL);
+	if (!node)
+		goto fail;
+	if (iob_idx_install_node(node, idx, GFP_KERNEL) < 0) {
+		type->destroy(node);
+		goto fail;
+	}
+	idx->lost_node = node;
+	idx->nr_nodes++;
+
+	/* verify both fallback nodes have the correct id.f.nr */
+	if (idx->nomem_node->id.f.nr != IOB_NOMEM_NR ||
+	    idx->lost_node->id.f.nr != IOB_LOST_NR)
+		goto fail;
+
+	return idx;
+fail:
+	iob_idx_destroy(idx);
+	return NULL;
+}
+
+/**
+ * iob_node_by_nr_raw - lookup node by nr
+ * @nr: nr to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Lookup node occupying slot @nr.  If such node doesn't exist, %NULL is
+ * returned.
+ */
+static struct iob_node *iob_node_by_nr_raw(int nr, struct iob_idx *idx)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+	return iob_idr_node(idr_find(&idx->idr, nr));
+}
+
+/**
+ * iob_node_by_id_raw - lookup node by id
+ * @id: id to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Lookup node with @id.  @id's type should match @idx's type and all three
+ * id fields should match for successful lookup - type, id and generation.
+ * Returns %NULL on failure.
+ */
+static struct iob_node *iob_node_by_id_raw(union iob_id id, struct iob_idx *idx)
+{
+	struct iob_node *node;
+
+	WARN_ON_ONCE(id.f.type != idx->type->type);
+
+	node = iob_node_by_nr_raw(id.f.nr, idx);
+	if (likely(node && node->id.v == id.v))
+		return node;
+	return NULL;
+}
+
+static struct iob_node *iob_hash_head_lookup(void *key,
+					     struct hlist_head *hash_head,
+					     const struct iob_idx_type *type)
+{
+	struct hlist_node *pos;
+	struct iob_node *node;
+
+	hlist_for_each_entry_rcu(node, pos, hash_head, hash_node)
+		if (type->match(node, key))
+			return node;
+	return NULL;
+}
+
+/**
+ * iob_get_node_raw - lookup node from hash table and create if missing
+ * @key: key to lookup hash table with
+ * @idx: iob_idx to lookup from
+ * @create: whether to create a new node if lookup fails
+ *
+ * Look up node which matches @key in @idx.  If no such node exists and
+ * @create is %true, create a new one.  A newly created node will have
+ * unique id assigned to it as long as generation number doesn't overflow.
+ *
+ * This function should be called under rcu sched read lock and returns
+ * %NULL on failure.
+ */
+static struct iob_node *iob_get_node_raw(void *key, struct iob_idx *idx,
+					 bool create)
+{
+	const struct iob_idx_type *type = idx->type;
+	struct iob_node *node, *new_node;
+	struct hlist_head *hash_head;
+	unsigned long hash, flags;
+
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	/* lookup hash */
+	hash = type->hash(key);
+	hash_head = &idx->hash[hash & idx->hash_mask];
+
+	node = iob_hash_head_lookup(key, hash_head, type);
+	if (node || !create)
+		return node;
+
+	/* non-existent && @create, create new one */
+	new_node = type->create(key, GFP_NOWAIT);
+	if (!new_node) {
+		iob_stats.node_nomem++;
+		return NULL;
+	}
+
+	spin_lock_irqsave(&iob_lock, flags);
+
+	/* someone might have inserted it inbetween, lookup again */
+	node = iob_hash_head_lookup(key, hash_head, type);
+	if (node)
+		goto out_unlock;
+
+	/* install the node and add to the hash table */
+	if (iob_idx_install_node(new_node, idx, GFP_NOWAIT))
+		goto out_unlock;
+
+	hlist_add_head_rcu(&new_node->hash_node, hash_head);
+	idx->nr_nodes++;
+
+	node = new_node;
+	new_node = NULL;
+out_unlock:
+	spin_unlock_irqrestore(&iob_lock, flags);
+
+	if (unlikely(new_node))
+		type->destroy(new_node);
+	return node;
+}
+
+/**
+ * iob_node_by_nr - lookup node by nr with fallback
+ * @nr: nr to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Same as iob_node_by_nr_raw() but returns @idx->lost_node instead of
+ * %NULL if lookup fails.  The lost_node is returned as nr/id lookup
+ * failure indicates the target node has already been reclaimed.
+ */
+static struct iob_node *iob_node_by_nr(int nr, struct iob_idx *idx)
+{
+	return iob_node_by_nr_raw(nr, idx) ?: idx->lost_node;
+}
+
+/**
+ * iob_node_by_nr - lookup node by id with fallback
+ * @id: id to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Same as iob_node_by_id_raw() but returns @idx->lost_node instead of
+ * %NULL if lookup fails.  The lost_node is returned as nr/id lookup
+ * failure indicates the target node has already been reclaimed.
+ */
+static struct iob_node *iob_node_by_id(union iob_id id, struct iob_idx *idx)
+{
+	return iob_node_by_id_raw(id, idx) ?: idx->lost_node;
+}
+
+/**
+ * iob_get_node - lookup node from hash table and create if missing w/ fallback
+ * @key: key to lookup hash table with
+ * @idx: iob_idx to lookup from
+ * @create: whether to create a new node if lookup fails
+ *
+ * Same as iob_get_node_raw(@key, @idx, %true) but returns @idx->nomem_node
+ * instead of %NULL on failure as the only reason is alloc failure.
+ */
+static struct iob_node *iob_get_node(void *key, struct iob_idx *idx)
+{
+	return iob_get_node_raw(key, idx, true) ?: idx->nomem_node;
+}
+
+/**
+ * iob_unhash_node - unhash an iob_node
+ * @node: node to unhash
+ * @idx: iob_idx @node is hashed on
+ *
+ * Make @node invisible from hash lookup.  It will still be visible from
+ * id/nr lookup.
+ *
+ * Must be called holding iob_lock and returns %true if unhashed
+ * successfully, %false if someone else already unhashed it.
+ */
+static bool iob_unhash_node(struct iob_node *node, struct iob_idx *idx)
+{
+	lockdep_assert_held(&iob_lock);
+
+	if (hlist_unhashed(&node->hash_node))
+		return false;
+	hlist_del_init_rcu(&node->hash_node);
+	return true;
+}
+
+/**
+ * iob_remove_node - remove an iob_node
+ * @node: node to remove
+ * @idx: iob_idx @node is on
+ *
+ * Remove @node from @idx.  The caller is responsible for calling
+ * iob_unhash_node() before.  Note that removed nodes should be freed only
+ * after RCU grace period has passed.
+ *
+ * Must be called holding iob_lock.
+ */
+static void iob_remove_node(struct iob_node *node, struct iob_idx *idx)
+{
+	lockdep_assert_held(&iob_lock);
+
+	/* don't remove idr slot, record current generation there */
+	idr_replace(&idx->idr, iob_idr_encode_gen(node->id.f.gen),
+		    node->id.f.nr);
+	ida_remove(&idx->ida, node->id.f.nr);
+	idx->nr_nodes--;
+}
+
+
+/*
+ * IOB_ROLE
+ *
+ * There are two types of roles - task and user specified.  task_role
+ * represents a task and keyed by its task pointer.  task_role is created
+ * when the matching task first enters iob tracking, unhashed on task exit
+ * and destroyed after reclaim period has passed.
+ *
+ * The reason task_roles are keyed by task pointer instead of pid is
+ * because pid can change across exec(2) and we need reliable match on task
+ * exit to avoid leaking task_roles.  A task_role is unhashed and scheduled
+ * for removal on task exit or if thie pid no longer matches after exec.
+ *
+ * These life-cycle rules guarantee that any task is given one id across
+ * its lifetime and avoids resource leaks.
+ *
+ * task_roles also carry context information for the task, e.g. whether the
+ * task is currently assuming a user specified role, the last file the task
+ * operated on, currently on-going inode operation and so on.
+ *
+ * User specified roles are identified by positive integers, which are
+ * stored negated in role->id, and managed by userland.  Userland can
+ * request the current task to assume a user specified role, in which case
+ * all actions taken by the task is attributed to the user specified role.
+ */
+
+static struct iob_role *iob_node_to_role(struct iob_node *node)
+{
+	return node ? container_of(node, struct iob_role, node) : NULL;
+}
+
+static unsigned long iob_role_hash(void *key)
+{
+	struct iob_role *rkey = key;
+
+	/* task_roles are keyed by task ptr, user roles by id */
+	if (rkey->pid >= 0)
+		return jhash(rkey->task, sizeof(rkey->task), JHASH_INITVAL);
+	else
+		return jhash_1word(rkey->pid, JHASH_INITVAL);
+}
+
+static bool iob_role_match(struct iob_node *node, void *key)
+{
+	struct iob_role *role = iob_node_to_role(node);
+	struct iob_role *rkey = key;
+
+	if (rkey->pid >= 0)
+		return rkey->task == role->task;
+	else
+		return rkey->pid == role->pid;
+}
+
+static struct iob_node *iob_role_create(void *key, gfp_t gfp_mask)
+{
+	struct iob_role *rkey = key;
+	struct iob_role *role;
+
+	role = kmem_cache_alloc(iob_role_cache, gfp_mask);
+	if (!role)
+		return NULL;
+	*role = *rkey;
+	INIT_LIST_HEAD(&role->free_list);
+	return &role->node;
+}
+
+static void iob_role_destroy(struct iob_node *node)
+{
+	kmem_cache_free(iob_role_cache, iob_node_to_role(node));
+}
+
+static struct iob_role iob_role_null_key = { };
+
+static const struct iob_idx_type iob_role_idx_type = {
+	.type		= IOB_ROLE,
+
+	.hash		= iob_role_hash,
+	.match		= iob_role_match,
+	.create		= iob_role_create,
+	.destroy	= iob_role_destroy,
+
+	.nomem_key	= &iob_role_null_key,
+	.lost_key	= &iob_role_null_key,
+};
+
+static struct iob_role *iob_role_by_id(union iob_id id)
+{
+	return iob_node_to_role(iob_node_by_id(id, iob_role_idx));
+}
+
+/**
+ * iob_reclaim_current_task_role - reclaim the current task_role
+ *
+ * Unhash task_role.  This function guarantees that the %current task_role
+ * won't be visible to hash table lookup by itself.
+ */
+static void iob_reclaim_current_task_role(void)
+{
+	struct iob_role rkey = { };
+	struct iob_role *trole;
+	unsigned long flags;
+
+	/*
+	 * A task_role is always created by %current and thus guaranteed to
+	 * be visible to %current.  Negative result from lockless lookup
+	 * can be trusted.
+	 */
+	rkey.task = current;
+	rkey.pid = task_pid_nr(current);
+	trole = iob_node_to_role(iob_get_node_raw(&rkey, iob_role_idx, false));
+	if (!trole)
+		return;
+
+	/* unhash and queue on reclaim list */
+	spin_lock_irqsave(&iob_lock, flags);
+	WARN_ON_ONCE(!iob_unhash_node(&trole->node, iob_role_idx));
+	WARN_ON_ONCE(!list_empty(&trole->free_list));
+	list_add_tail(&trole->free_list, iob_role_to_free_front);
+	spin_unlock_irqrestore(&iob_lock, flags);
+}
+
+/**
+ * iob_current_task_role - lookup task_role for %current
+ *
+ * Return task_role for %current.  May return nomem node under memory
+ * pressure.
+ */
+static struct iob_role *iob_current_task_role(void)
+{
+	struct iob_role rkey = { };
+	struct iob_role *trole;
+	bool retried = false;
+
+	rkey.task = current;
+	rkey.pid = task_pid_nr(current);
+retry:
+	trole = iob_node_to_role(iob_get_node(&rkey, iob_role_idx));
+
+	/*
+	 * If %current exec'd, its pid may have changed.  In such cases,
+	 * shoot down the current task_role and retry.
+	 */
+	if (trole->pid == rkey.pid || trole->node.id.f.nr < IOB_BASE_NR)
+		return trole;
+
+	iob_reclaim_current_task_role();
+
+	/* this shouldn't happen more than once */
+	WARN_ON_ONCE(retried);
+	retried = true;
+	goto retry;
+}
+
+/**
+ * iob_task_role_to_role - return the role to use for IO blaming
+ * @trole: task_role of interest
+ *
+ * If @trole has a user role, return it; otherwise, return @trole.
+ */
+static struct iob_role *iob_task_role_to_role(struct iob_role *trole)
+{
+	struct iob_role *urole;
+
+	if (!trole || !trole->user_role.v)
+		return trole;
+
+	/* look up user role */
+	urole = iob_role_by_id(trole->user_role);
+	if (urole) {
+		WARN_ON_ONCE(urole->pid >= 0);
+		return urole;
+	}
+
+	/* user_role is dangling, clear it */
+	trole->user_role.v = 0;
+	return trole;
+}
+
+/**
+ * iob_current_role - lookup role for %current
+ *
+ * Return role to use for IO blaming.
+ */
+static struct iob_role *iob_current_role(void)
+{
+	return iob_task_role_to_role(iob_current_task_role());
+}
+
+
+/*
+ * IOB_INTENT
+ *
+ * An intent represents a category of actions a task can take.  It
+ * currently consists of stack trace at the point of action and modifier.
+ * The number of unique backtraces is expected to be limited and no
+ * reclaiming is implemented.
+ */
+
+static struct iob_intent *iob_node_to_intent(struct iob_node *node)
+{
+	return node ? container_of(node, struct iob_intent, node) : NULL;
+}
+
+static unsigned long iob_intent_hash(void *key)
+{
+	struct iob_intent_key *ikey = key;
+
+	return jhash(ikey->trace, ikey->depth * sizeof(ikey->trace[0]),
+		     JHASH_INITVAL + ikey->modifier);
+}
+
+static bool iob_intent_match(struct iob_node *node, void *key)
+{
+	struct iob_intent *intent = iob_node_to_intent(node);
+	struct iob_intent_key *ikey = key;
+
+	if (intent->modifier == ikey->modifier &&
+	    intent->depth == ikey->depth)
+		return !memcmp(intent->trace, ikey->trace,
+			       intent->depth * sizeof(intent->trace[0]));
+	return false;
+}
+
+static struct iob_node *iob_intent_create(void *key, gfp_t gfp_mask)
+{
+	struct iob_intent_key *ikey = key;
+	struct iob_intent *intent;
+	size_t trace_sz = sizeof(intent->trace[0]) * ikey->depth;
+
+	intent = kzalloc(sizeof(*intent) + trace_sz, gfp_mask);
+	if (!intent)
+		return NULL;
+
+	intent->modifier = ikey->modifier;
+	intent->depth = ikey->depth;
+	memcpy(intent->trace, ikey->trace, trace_sz);
+	return &intent->node;
+}
+
+static void iob_intent_destroy(struct iob_node *node)
+{
+	kfree(iob_node_to_intent(node));
+}
+
+static struct iob_intent_key iob_intent_null_key = { };
+
+static const struct iob_idx_type iob_intent_idx_type = {
+	.type		= IOB_INTENT,
+
+	.hash		= iob_intent_hash,
+	.match		= iob_intent_match,
+	.create		= iob_intent_create,
+	.destroy	= iob_intent_destroy,
+
+	.nomem_key	= &iob_intent_null_key,
+	.lost_key	= &iob_intent_null_key,
+};
+
+static struct iob_intent *iob_intent_by_nr(int nr)
+{
+	return iob_node_to_intent(iob_node_by_nr(nr, iob_intent_idx));
+}
+
+static struct iob_intent *iob_intent_by_id(union iob_id id)
+{
+	return iob_node_to_intent(iob_node_by_id(id, iob_intent_idx));
+}
+
+static struct iob_intent *iob_get_intent(unsigned long *trace, int depth,
+					 u32 modifier)
+{
+	struct iob_intent_key ikey = { .modifier = modifier, .depth = depth,
+				       .trace = trace };
+
+	return iob_node_to_intent(iob_get_node(&ikey, iob_intent_idx));
+}
+
+static DEFINE_PER_CPU(unsigned long [IOB_STACK_MAX_DEPTH], iob_trace_buf_pcpu);
+
+/**
+ * iob_current_intent - return intent for %current
+ * @skip: number of stack frames to skip
+ *
+ * Acquire stack trace after skipping @skip frames and return matching
+ * iob_intent.  The stack trace never includes iob_current_intent() and
+ * @skip of 1 skips the caller not iob_current_intent().  May return nomem
+ * node under memory pressure.
+ */
+static noinline struct iob_intent *iob_current_intent(int skip)
+{
+	unsigned long *trace = *this_cpu_ptr(&iob_trace_buf_pcpu);
+	struct stack_trace st = { .max_entries = IOB_STACK_MAX_DEPTH,
+				  .entries = trace, .skip = skip + 1 };
+	struct iob_intent *intent;
+	unsigned long flags;
+
+	/* disable IRQ to make trace_pcpu array access exclusive */
+	local_irq_save(flags);
+
+	/* acquire stack trace, ignore -1LU end of stack marker */
+	save_stack_trace_quick(&st);
+	if (st.nr_entries && trace[st.nr_entries - 1] == ULONG_MAX)
+		st.nr_entries--;
+
+	/* get matching iob_intent */
+	intent = iob_get_intent(trace, st.nr_entries, 0);
+
+	local_irq_restore(flags);
+	return intent;
+}
+
+
+/*
+ * IOB_ACT
+ *
+ * Represents specific action an iob_role took.  Consists of a iob_role,
+ * iob_act, and the target inode.  iob_act is what ioblame tracks.  For
+ * each operation which needs to be blamed, iob_act is acquired and
+ * recorded (either by id or id.f.nr) and used for data gathering and
+ * reporting.
+ *
+ * Because this is product of three different entities, the number can grow
+ * quite large.  Each successful lookup sets updates used bitmap and
+ * iob_acts which haven't been used for iob_ttl_secs are reclaimed after
+ * data is collected by userland.
+ */
+
+static void iob_act_mark_used(struct iob_act *act)
+{
+	if (!test_bit(act->node.id.f.nr, iob_act_used.cur))
+		set_bit(act->node.id.f.nr, iob_act_used.cur);
+}
+
+static struct iob_act *iob_node_to_act(struct iob_node *node)
+{
+	return node ? container_of(node, struct iob_act, node) : NULL;
+}
+
+static unsigned long iob_act_hash(void *key)
+{
+	return jhash(key + IOB_ACT_KEY_OFFSET,
+		     sizeof(struct iob_act) - IOB_ACT_KEY_OFFSET,
+		     JHASH_INITVAL);
+}
+
+static bool iob_act_match(struct iob_node *node, void *key)
+{
+	return !memcmp((void *)node + IOB_ACT_KEY_OFFSET,
+		       key + IOB_ACT_KEY_OFFSET,
+		       sizeof(struct iob_act) - IOB_ACT_KEY_OFFSET);
+}
+
+static struct iob_node *iob_act_create(void *key, gfp_t gfp_mask)
+{
+	struct iob_act *akey = key;
+	struct iob_act *act;
+
+	act = kmem_cache_alloc(iob_act_cache, gfp_mask);
+	if (!act)
+		return NULL;
+	*act = *akey;
+	return &act->node;
+}
+
+static void iob_act_destroy(struct iob_node *node)
+{
+	struct iob_act *act = iob_node_to_act(node);
+
+	if (act->cnts)
+		kmem_cache_free(iobc_cnts_cache, act->cnts);
+	kmem_cache_free(iob_act_cache, act);
+}
+
+static struct iob_act iob_act_nomem_key = {
+	.role		= IOB_PACK_ID(IOB_ROLE, IOB_NOMEM_NR, 1),
+	.intent		= IOB_PACK_ID(IOB_INTENT, IOB_NOMEM_NR, 1),
+};
+
+static struct iob_act iob_act_lost_key = {
+	.role		= IOB_PACK_ID(IOB_ROLE, IOB_LOST_NR, 1),
+	.intent		= IOB_PACK_ID(IOB_INTENT, IOB_LOST_NR, 1),
+};
+
+static const struct iob_idx_type iob_act_idx_type = {
+	.type		= IOB_ACT,
+
+	.hash		= iob_act_hash,
+	.match		= iob_act_match,
+	.create		= iob_act_create,
+	.destroy	= iob_act_destroy,
+
+	.nomem_key	= &iob_act_nomem_key,
+	.lost_key	= &iob_act_lost_key,
+};
+
+static struct iob_act *iob_act_by_nr(int nr)
+{
+	return iob_node_to_act(iob_node_by_nr(nr, iob_act_idx));
+}
+
+static struct iob_act *iob_act_by_id(union iob_id id)
+{
+	return iob_node_to_act(iob_node_by_id(id, iob_act_idx));
+}
+
+/**
+ * iob_current_act - return the current iob_act
+ * @stack_skip: number of stack frames to skip when acquiring iob_intent
+ * @dev: dev_t of the inode being operated on
+ * @ino: ino of the inode being operated on
+ * @gen: generation of the inode being operated on
+ *
+ * Return iob_act for %current with the current backtrace.
+ * iob_current_act() is never included in the backtrace.  May return nomem
+ * node under memory pressure.
+ */
+static __always_inline struct iob_act *iob_current_act(int stack_skip,
+						dev_t dev, ino_t ino, u32 gen)
+{
+	struct iob_role *role = iob_current_role();
+	struct iob_intent *intent = iob_current_intent(stack_skip);
+	struct iob_act akey = { .role = role->node.id,
+				.intent = intent->node.id, .dev = dev };
+	struct iob_act *act;
+	int min_nr;
+
+	/* if either role or intent is special, return matching special role */
+	min_nr = min_t(int, role->node.id.f.nr, intent->node.id.f.nr);
+	if (unlikely(min_nr < IOB_BASE_NR)) {
+		if (min_nr == IOB_NOMEM_NR)
+			return iob_node_to_act(iob_act_idx->nomem_node);
+		else
+			return iob_node_to_act(iob_act_idx->lost_node);
+	}
+
+	/* if ignore_ino is set, use the same act for all files on the dev */
+	if (!iob_ignore_ino) {
+		akey.ino = ino;
+		akey.gen = gen;
+	}
+
+	act = iob_node_to_act(iob_get_node(&akey, iob_act_idx));
+	if (act)
+		iob_act_mark_used(act);
+	return act;
+}
+
+/**
+ * iob_modified_act - determined modified act
+ * @act: the base act
+ * @modifier: modifier to apply
+ *
+ * Return iob_act which is identical to @act except that its intent
+ * modifier is @modifier.  @act is allowed to have no or any modifier on
+ * entry.  May return nomem node under memory pressure.
+ */
+static struct iob_act *iob_modified_act(struct iob_act *act, u32 modifier)
+{
+	struct iob_intent *intent = iob_intent_by_id(act->intent);
+	struct iob_act akey = { .role = act->role, .dev = act->dev };
+
+	/* if ignore_ino is set, use the same act for all files on the dev */
+	if (!iob_ignore_ino) {
+		akey.ino = act->ino;
+		akey.gen = act->gen;
+	}
+
+	intent = iob_get_intent(intent->trace, intent->depth, modifier);
+	akey.intent = intent->node.id;
+
+	return iob_node_to_act(iob_get_node(&akey, iob_act_idx));
+}
+
+
+/*
+ * RECLAIM
+ *
+ * iob_act can only be reclaimed once data is collected by userland, so we
+ * run reclaimer together with data acquisition.
+ *
+ * iob_act uses bitmaps to collect and track used state history.  Used bits
+ * are tracked every half ttl period and iob_acts which haven't been used
+ * for two half ttl periods are reclaimed.  As reclaiming is regulated by
+ * data acquisition, the code doesn't have full control over reclaim
+ * timing.  It tries to stay close to the behavior specified by
+ * @iob_ttl_secs.
+ *
+ * iob_role goes through reclaiming mostly to delay freeing so that roles
+ * are still available when async IO events fire after the original tasks
+ * exit.  iob_role reclaiming is simpler and always happens after at least
+ * one ttl period has passed.
+ */
+
+/**
+ * iob_switch_act_used_cur - switch iob_act_used->cur and ->staging
+ *
+ * Switch the cur and staging bitmaps and wait for all current users to
+ * finish.  ->staging must be clear on entry.  On return, ->staging points
+ * to used bitmap collected since the previous switch and is guaranteed to
+ * be quiescent.
+ *
+ * Must be called under iob_mutex.
+ */
+static void iob_switch_act_used_cur(void)
+{
+	struct iob_act_used *u = &iob_act_used;
+
+	lockdep_assert_held(&iob_mutex);
+	swap(u->cur, u->staging);
+	synchronize_sched();
+}
+
+/**
+ * iob_reclaim - reclaim iob_roles and iob_acts
+ *
+ * This function looks at iob_act_used->staging, ->front, ->back, and
+ * iob_role_to_free_front and reclaims unused nodes.  On entry,
+ * iob_act_used->staging should contain used bitmap from the previous
+ * period - IOW, the caller should have called iob_switch_act_used_cur()
+ * before.
+ *
+ * Must be called under iob_mutex.
+ */
+static void iob_reclaim(void)
+{
+	LIST_HEAD(role_todo);
+	unsigned long ttl = iob_ttl_secs * HZ;
+	unsigned long role_delta = jiffies - iob_role_reclaim_tstmp;
+	unsigned long act_delta = jiffies - iob_act_reclaim_tstmp;
+	struct iob_role *role, *role_pos;
+	struct iob_act *free_head = NULL, *act;
+	struct iob_act_used *u = &iob_act_used;
+	unsigned long *reclaim;
+	unsigned long flags;
+	int i;
+
+	lockdep_assert_held(&iob_mutex);
+
+	/* collect staging into front and clear staging */
+	bitmap_or(u->front, u->front, u->staging, iob_max_acts);
+	bitmap_clear(u->staging, 0, iob_max_acts);
+
+	/* if less than ttl/2 has passed, collecting is enough */
+	if (act_delta < ttl / 2)
+		return;
+
+	/* >= ttl/2 has passed, let's see if we can kill anything */
+	spin_lock_irqsave(&iob_lock, flags);
+
+	/* determine which roles to reclaim */
+	if (role_delta >= ttl) {
+		/* roles in the other free_head are now older than ttl */
+		list_splice_init(iob_role_to_free_back, &role_todo);
+		swap(iob_role_to_free_front, iob_role_to_free_back);
+		iob_role_reclaim_tstmp = jiffies;
+
+		/*
+		 * All roles to be reclaimed should have been unhashed
+		 * already.  Removing is enough.
+		 */
+		list_for_each_entry(role, &role_todo, free_list) {
+			WARN_ON_ONCE(!hlist_unhashed(&role->node.hash_node));
+			iob_remove_node(&role->node, iob_role_idx);
+		}
+	}
+
+	/*
+	 * Determine the bitmap to use for act reclaim.  Ideally, we want
+	 * to be invoked every ttl/2 for reclaim granularity but don't have
+	 * control over that.  We handle [ttl/2,ttl) as ttl/2 - acts which
+	 * are marked unused in both front and back bitmaps are reclaimed.
+	 * If >=ttl, we ignore back bitmap and reclaim any which is marked
+	 * unused in the front bitmap.
+	 */
+	if (act_delta < ttl) {
+		bitmap_or(u->back, u->back, u->front, iob_max_acts);
+		reclaim = u->back;
+	} else {
+		reclaim = u->front;
+	}
+
+	/* unhash and remove all acts which don't have bit set in @reclaim */
+	for (i = find_next_zero_bit(reclaim, iob_max_acts, IOB_BASE_NR);
+	     i < iob_max_acts;
+	     i = find_next_zero_bit(reclaim, iob_max_acts, i + 1)) {
+		act = iob_node_to_act(iob_node_by_nr_raw(i, iob_act_idx));
+		if (act) {
+			WARN_ON_ONCE(!iob_unhash_node(&act->node, iob_act_idx));
+			iob_remove_node(&act->node, iob_act_idx);
+			act->free_next = free_head;
+			free_head = act;
+		}
+	}
+
+	spin_unlock_irqrestore(&iob_lock, flags);
+
+	/* reclaim complete, front<->back and clear front */
+	swap(u->front, u->back);
+	bitmap_clear(u->front, 0, iob_max_acts);
+
+	iob_act_reclaim_tstmp = jiffies;
+
+	/* before freeing reclaimed nodes, wait for in-flight users to finish */
+	synchronize_sched();
+
+	list_for_each_entry_safe(role, role_pos, &role_todo, free_list)
+		iob_role_destroy(&role->node);
+
+	while ((act = free_head)) {
+		free_head = act->free_next;
+		iob_act_destroy(&act->node);
+	}
+}
+
+/*
+ * PGTREE
+ *
+ * Radix tree to map pfn to iob_act.  This is used to track which iob_act
+ * dirtied the page.  When a bio is issued, each page in the iovec is
+ * consulted against pgtree to find out which act caused it.
+ *
+ * Because the size of pgtree is proportional to total available memory, it
+ * uses id.f.nr instead of full id and may occassionally give stale result.
+ * Also, it uses u16 array if ACT_MAX is <= USHRT_MAX; otherwise, u32.
+ */
+
+void *iob_pgtree_slot(unsigned long pfn)
+{
+	unsigned long idx = pfn >> iob_pgtree_pfn_shift;
+	unsigned long offset = pfn & iob_pgtree_pfn_mask;
+	void *p;
+
+	p = radix_tree_lookup(&iob_pgtree, idx);
+	if (p)
+		return p + (offset << iob_pgtree_shift);
+	return NULL;
+}
+
+/**
+ * iob_pgtree_set_nr - map pfn to nr
+ * @pfn: pfn to map
+ * @nr: id.f.nr to be mapped
+ *
+ * Map @pfn to @nr, which can later be retrieved using
+ * iob_pgtree_get_and_clear_nr().  This function is opportunistic - it may
+ * fail under memory pressure and clobber each other's mappings when
+ * multiple pgtree ops race.
+ */
+static int iob_pgtree_set_nr(unsigned long pfn, int nr)
+{
+	void *slot, *p;
+	unsigned long flags;
+	int ret;
+retry:
+	slot = iob_pgtree_slot(pfn);
+	if (likely(slot)) {
+		/*
+		 * We're playing with pointer casts and racy accesses.  Use
+		 * ACCESS_ONCE() to avoid compiler surprises.
+		 */
+		switch (iob_pgtree_shift) {
+		case 1:
+			ACCESS_ONCE(*(u16 *)slot) = nr;
+			break;
+		case 2:
+			ACCESS_ONCE(*(u32 *)slot) = nr;
+			break;
+		default:
+			BUG();
+		}
+		return 0;
+	}
+
+	/* slot missing, create and insert new page and retry */
+	p = (void *)get_zeroed_page(GFP_NOWAIT);
+	if (!p) {
+		iob_stats.pgtree_nomem++;
+		return -ENOMEM;
+	}
+
+	spin_lock_irqsave(&iob_lock, flags);
+	ret = radix_tree_insert(&iob_pgtree, pfn >> iob_pgtree_pfn_shift, p);
+	spin_unlock_irqrestore(&iob_lock, flags);
+
+	if (ret) {
+		free_page((unsigned long)p);
+		if (ret != -EEXIST) {
+			iob_stats.pgtree_nomem++;
+			return ret;
+		}
+	}
+	goto retry;
+}
+
+/**
+ * iob_pgtree_get_and_clear_nr - read back pfn to nr mapping and clear it
+ * @pfn: pfn to read mapping for
+ *
+ * Read back mapping set by iob_pgtree_set_nr().  This function is
+ * opportunistic and may clobber each other's mappings when multiple pgtree
+ * ops race.
+ */
+static int iob_pgtree_get_and_clear_nr(unsigned long pfn)
+{
+	void *slot;
+	int nr;
+
+	slot = iob_pgtree_slot(pfn);
+	if (unlikely(!slot))
+		return 0;
+
+	/*
+	 * We're playing with pointer casts and racy accesses.  Use
+	 * ACCESS_ONCE() to avoid compiler surprises.
+	 */
+	switch (iob_pgtree_shift) {
+	case 1:
+		nr = ACCESS_ONCE(*(u16 *)slot);
+		if (nr)
+			ACCESS_ONCE(*(u16 *)slot) = 0;
+		break;
+	case 2:
+		nr = ACCESS_ONCE(*(u32 *)slot);
+		if (nr)
+			ACCESS_ONCE(*(u32 *)slot) = 0;
+		break;
+	default:
+		BUG();
+	}
+	return nr;
+}
+
+
+/*
+ * PROBES
+ *
+ * Tracepoint probes.  This is how ioblame learns what's going on in the
+ * system.  TP probes are always called with preemtion disabled, so we
+ * don't need explicit rcu_read_lock_sched().
+ */
+
+static bool iob_enabled_inode(struct inode *inode)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	return iob_enabled && inode->i_sb->s_bdev &&
+		inode->i_sb->s_bdev->bd_disk->iob_enabled;
+}
+
+static bool iob_enabled_bh(struct buffer_head *bh)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	return iob_enabled && bh->b_bdev->bd_disk->iob_enabled;
+}
+
+static bool iob_enabled_bio(struct bio *bio)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	return iob_enabled && bio->bi_bdev &&
+		bio->bi_bdev->bd_disk->iob_enabled;
+}
+
+/* current timestamp in usecs, base is unknown and may jump backwards */
+static unsigned long iob_now_usecs(void)
+{
+	u64 now = local_clock();
+
+	/*
+	 * We don't worry about @now itself wrapping.  On 32bit, the
+	 * divided ulong result will wrap in orderly manner and
+	 * time_before/after() should work as expected.
+	 */
+	do_div(now, 1000);
+	return now;
+}
+
+static void iob_set_last_ino(struct inode *inode)
+{
+	struct iob_role *trole = iob_current_task_role();
+
+	trole->last_ino.dev = inode->i_sb->s_dev;
+	trole->last_ino.ino = inode->i_ino;
+	trole->last_ino.gen = inode->i_generation;
+	trole->last_ino_jiffies = jiffies;
+}
+
+/*
+ * Mark the last inode accessed by this task role.  This is used to
+ * attribute IOs to files.
+ */
+static void iob_probe_vfs_fcheck(void *data, struct files_struct *files,
+				 unsigned int fd, struct file *file)
+{
+	if (file) {
+		struct inode *inode = file->f_dentry->d_inode;
+
+		if (iob_enabled_inode(inode))
+			iob_set_last_ino(inode);
+	}
+}
+
+/* called after a page is dirtied - record the dirtying act in pgtree */
+static void iob_probe_wb_dirty_page(void *data, struct page *page,
+				    struct address_space *mapping)
+{
+	struct inode *inode = mapping->host;
+
+	if (iob_enabled_inode(inode)) {
+		struct iob_act *act = iob_current_act(2, inode->i_sb->s_dev,
+						      inode->i_ino,
+						      inode->i_generation);
+
+		iob_pgtree_set_nr(page_to_pfn(page), act->node.id.f.nr);
+	}
+}
+
+/*
+ * Writeback is starting, record wb_reason in trole->modifier.  This will
+ * be applied to any IOs issued from this task until writeback is finished.
+ */
+static void iob_probe_wb_start(void *data, struct backing_dev_info *bdi,
+			       struct wb_writeback_work *work)
+{
+	struct iob_role *trole = iob_current_task_role();
+
+	trole->modifier = work->reason | IOB_MODIFIER_WB;
+}
+
+/* writeback done, clear modifier */
+static void iob_probe_wb_written(void *data, struct backing_dev_info *bdi,
+				 struct wb_writeback_work *work)
+{
+	struct iob_role *trole = iob_current_task_role();
+
+	trole->modifier = 0;
+}
+
+/*
+ * An inode is about to be written back.  Will be followed by data and
+ * inode writeback.  In case dirtier data is not recorded in pgtree or
+ * inode, remember the inode in trole->last_ino.
+ */
+static void iob_probe_wb_single_inode_start(void *data, struct inode *inode,
+					    struct writeback_control *wbc,
+					    unsigned long nr_to_write)
+{
+	if (iob_enabled_inode(inode))
+		iob_set_last_ino(inode);
+}
+
+/*
+ * Called when an inode is about to be dirtied, right before fs
+ * dirty_inode() method.  Different filesystems implement inode dirtying
+ * and writeback differently.  Some may allocate bh on dirtying, some might
+ * do it during write_inode() and others might not use bh at all.
+ *
+ * To cover most cases, two tracking mechanisms are used - trole->inode_act
+ * and inode->i_iob_act.  The former marks the current task as performing
+ * inode dirtying act and any IOs issued or bhs touched are attributed to
+ * the act.  The latter records the dirtying act on the inode itself so
+ * that if the filesystem takes action for the inode from write_inode(),
+ * the acting task can take on the dirtying act.
+ */
+static void iob_probe_wb_dirty_inode_start(void *data, struct inode *inode,
+					   int flags)
+{
+	if (iob_enabled_inode(inode)) {
+		struct iob_role *trole = iob_current_task_role();
+		struct iob_act *act = iob_current_act(1, inode->i_sb->s_dev,
+						      inode->i_ino,
+						      inode->i_generation);
+		trole->inode_act = act->node.id;
+		inode->i_iob_act = act->node.id;
+	}
+}
+
+/* inode dirtying complete */
+static void iob_probe_wb_dirty_inode(void *data, struct inode *inode, int flags)
+{
+	if (iob_enabled_inode(inode))
+		iob_current_task_role()->inode_act.v = 0;
+}
+
+/*
+ * Called when an inode is being written back, right before fs
+ * write_inode() method.  Inode writeback is starting, take on the act
+ * which dirtied the inode.
+ */
+static void iob_probe_wb_write_inode_start(void *data, struct inode *inode,
+					   struct writeback_control *wbc)
+{
+	if (iob_enabled_inode(inode) && inode->i_iob_act.v) {
+		struct iob_role *trole = iob_current_task_role();
+
+		trole->inode_act = inode->i_iob_act;
+	}
+}
+
+/* inode writing complete */
+static void iob_probe_wb_write_inode(void *data, struct inode *inode,
+				     struct writeback_control *wbc)
+{
+	if (iob_enabled_inode(inode))
+		iob_current_task_role()->inode_act.v = 0;
+}
+
+/*
+ * Called on touch_buffer().  Transfer inode act to pgtree.  This catches
+ * most inode operations for filesystems which use bh for metadata.
+ */
+static void iob_probe_block_touch_buffer(void *data, struct buffer_head *bh)
+{
+	if (iob_enabled_bh(bh)) {
+		struct iob_role *trole = iob_current_task_role();
+
+		if (trole->inode_act.v)
+			iob_pgtree_set_nr(page_to_pfn(bh->b_page),
+					  trole->inode_act.f.nr);
+	}
+}
+
+/* bio is being queued, collect all info into bio->bi_iob_info */
+static void iob_probe_block_bio_queue(void *data, struct request_queue *q,
+				      struct bio *bio)
+{
+	struct iob_io_info *io = &bio->bi_iob_info;
+	struct iob_act *act = NULL;
+	struct iob_role *trole;
+	int i;
+
+	if (!iob_enabled_bio(bio))
+		return;
+
+	trole = iob_current_task_role();
+
+	io->rw = bio->bi_rw;
+	io->sector = bio->bi_sector;
+	io->size = bio->bi_size;
+	io->issued_at = io->queued_at = iob_now_usecs();
+
+	/* trole's inode_act has the highest priority */
+	if (trole->inode_act.v)
+		io->act = trole->inode_act;
+
+	/* always walk pgtree and clear matching pages */
+	for (i = 0; i < bio->bi_vcnt; i++) {
+		struct bio_vec *bv = &bio->bi_io_vec[i];
+		int nr;
+
+		if (!bv->bv_len)
+			continue;
+
+		nr = iob_pgtree_get_and_clear_nr(page_to_pfn(bv->bv_page));
+		if (!nr || io->act.v)
+			continue;
+
+		/* this is the first act, charge everything to it */
+		act = iob_act_by_nr(nr);
+		io->act = act->node.id;
+	}
+
+	/*
+	 * If act is still not set, charge it to the IO issuer.  When
+	 * acquiring stack trace, skip this function and
+	 * generic_make_request[_checks]()
+	 */
+	if (!io->act.v) {
+		unsigned long now = jiffies;
+		dev_t dev = bio->bi_bdev->bd_dev;
+		ino_t ino = 0;
+		u32 gen = 0;
+
+		/*
+		 * Charge IOs to the last file this task initiated RW or
+		 * writeback on, which is highly likely to be the file this
+		 * IO is for.  As a sanity check, trust last_ino only for
+		 * pre-defined duration.
+		 */
+		if (time_before_eq(trole->last_ino_jiffies, now) &&
+		    now - trole->last_ino_jiffies <= IOB_LAST_INO_DURATION) {
+			dev = trole->last_ino.dev;
+			ino = trole->last_ino.ino;
+			gen = trole->last_ino.gen;
+		}
+
+		act = iob_current_act(3, dev, ino, gen);
+		io->act = act->node.id;
+	}
+
+	/* if %current has modifier set, apply it */
+	if (trole->modifier) {
+		if (!act)
+			act = iob_act_by_id(io->act);
+		act = iob_modified_act(act, trole->modifier);
+		io->act = act->node.id;
+	}
+}
+
+/* when bios get merged, charge everything to the first bio */
+static void iob_probe_block_bio_backmerge(void *data, struct request_queue *q,
+					  struct request *rq, struct bio *bio)
+{
+	struct bio *mbio = rq->bio;
+	struct iob_io_info *mio = &mbio->bi_iob_info;
+	struct iob_io_info *sio = &bio->bi_iob_info;
+
+	mio->size += sio->size;
+	sio->size = 0;
+}
+
+/* when bios get merged, charge everything to the first bio */
+static void iob_probe_block_bio_frontmerge(void *data, struct request_queue *q,
+					   struct request *rq, struct bio *bio)
+{
+	struct bio *mbio = rq->bio;
+	struct iob_io_info *mio = &mbio->bi_iob_info;
+	struct iob_io_info *sio = &bio->bi_iob_info;
+
+	mio->sector = sio->sector;
+	mio->size += sio->size;
+	mio->act = sio->act;
+	sio->size = 0;
+}
+
+/* record issue timestamp, this may not happen for bio based drivers */
+static void iob_probe_block_rq_issue(void *data, struct request_queue *q,
+				     struct request *rq)
+{
+	if (rq->bio && rq->bio->bi_iob_info.size)
+		rq->bio->bi_iob_info.issued_at = iob_now_usecs();
+}
+
+/* bio is complete, report and accumulate statistics */
+static void iob_probe_block_bio_complete(void *data, struct request_queue *q,
+					 struct bio *bio, int error)
+{
+	struct iob_io_info *io = &bio->bi_iob_info;
+
+	if (!io->size)
+		return;
+
+	if (!iob_enabled_bio(bio))
+		return;
+
+	if (iobc_nr_types)
+		iob_count(io, bio->bi_bdev->bd_disk);
+
+	if (iob_iolog)
+		iob_iolog_fill(io);
+}
+
+/* %current is exiting, shoot down its task_role */
+static void iob_probe_block_sched_process_exit(void *data,
+					       struct task_struct *task)
+{
+	WARN_ON_ONCE(task != current);
+	iob_reclaim_current_task_role();
+}
+
+
+/*
+ * Counters.
+ *
+ * Collects io stats to be reported to userland.  Each act is associcated
+ * with a set of counters as determined by counter types.
+ *
+ * Each counter type consists of histogram of eight u32's, the field to
+ * record, boundary values used to determine the slot in the histogram and
+ * optional filter.  It describes whether the counter should be activated
+ * for a specific IO (filter), if so, which value to record (field) and to
+ * which slot (boundaries).
+ *
+ * Counter types are userland configurable via ioblame/nr_counters and
+ * ioblame/counters/NR[_filter].
+ */
+
+/*
+ * Helper to grab iob_lock and get iobc_type associated with
+ * ioblame/counters/NR[_filter] @file.
+ */
+static struct iobc_type *iobc_lock_and_get_type(struct file *file)
+	__acquires(&iob_mutex)
+{
+	int i;
+
+	mutex_lock(&iob_mutex);
+
+	i = (long)file->f_dentry->d_inode->i_private;
+
+	/* raced nr_counters reduction? */
+	if (i >= iobc_nr_types) {
+		mutex_unlock(&iob_mutex);
+		return ERR_PTR(-ENOENT);
+	}
+
+	return &iobc_types[i];
+}
+
+/*
+ * ioblame/counters/NR - read and set counter type.  Its format is
+ *
+ *   DIR FIELD_NAME B0 B1 B2 B3 B4 B5 B6 B7 B8
+ *
+ * DIR is any combination of letters 'r', 'a', and 'w', each representing
+ * reads, readaheads and writes.  FIELD_NAME is one of iobc_field_strs[]
+ * and B[0-8] are u32 values delimiting histogram slots - ie. a value >= B3
+ * and < B4 would be recorded in histogram slot 3.  Values < B0 or >=
+ * non-zero B8 are ignored.
+ *
+ * Note that counter type can be updated while iob is enabled.
+ */
+static ssize_t iobc_type_read(struct file *file, char __user *ubuf,
+			      size_t count, loff_t *ppos)
+{
+	struct iobc_type *type;
+	char *buf;
+	u32 *b;
+	ssize_t ret;
+
+	type = iobc_lock_and_get_type(file);
+	if (IS_ERR(type))
+		return PTR_ERR(type);
+
+	buf = iob_page_buf;
+	b = type->bounds;
+
+	if (type->dir) {
+		char dir[4] = "---";
+
+		if (type->dir & IOBC_READ)
+			dir[0] = 'R';
+		if (type->dir & IOBC_RAHEAD)
+			dir[1] = 'A';
+		if (type->dir & IOBC_WRITE)
+			dir[2] = 'W';
+
+		snprintf(buf, PAGE_SIZE, "%s %s %u %u %u %u %u %u %u %u %u\n",
+			 dir, iobc_field_strs[type->field],
+			 b[0], b[1], b[2], b[3], b[4], b[5], b[6], b[7], b[8]);
+	} else {
+		snprintf(buf, PAGE_SIZE, "--- disabled\n");
+	}
+
+	ret = simple_read_from_buffer(ubuf, count, ppos, buf, strlen(buf));
+
+	mutex_unlock(&iob_mutex);
+
+	return ret;
+}
+
+static ssize_t iobc_type_write(struct file *file, const char __user *ubuf,
+			       size_t cnt, loff_t *ppos)
+{
+	char field_str[IOBC_FIELD_MAX_LEN + 1];
+	char dir_buf[4];
+	char *buf, *p;
+	struct iobc_type *type;
+	unsigned dir, b[IOBC_NR_SLOTS + 1];
+	int i, field, ret;
+
+	if (cnt >= PAGE_SIZE)
+		return -EOVERFLOW;
+
+	type = iobc_lock_and_get_type(file);
+	if (IS_ERR(type))
+		return PTR_ERR(type);
+
+	buf = iob_page_buf;
+
+	ret = -EFAULT;
+	if (copy_from_user(buf, ubuf, cnt))
+		goto out;
+	buf[cnt] = '\0';
+
+	p = strim(buf);
+	if (!strlen(p)) {
+		type->dir = 0;
+		ret = 0;
+		goto out;
+	}
+
+	/* start parsing */
+	ret = -EINVAL;
+
+	if (sscanf(p, "%3s %"__stringify(IOBC_FIELD_MAX_LEN)"s %u %u %u %u %u %u %u %u %u",
+		   dir_buf, field_str, &b[0], &b[1], &b[2], &b[3],
+		   &b[4], &b[5], &b[6], &b[7], &b[8]) != 11)
+		goto out;
+
+	/* parse direction */
+	dir = 0;
+	if (strchr(dir_buf, 'r') || strchr(dir_buf, 'R'))
+		dir |= IOBC_READ;
+	if (strchr(dir_buf, 'a') || strchr(dir_buf, 'A'))
+		dir |= IOBC_RAHEAD;
+	if (strchr(dir_buf, 'w') || strchr(dir_buf, 'W'))
+		dir |= IOBC_WRITE;
+
+	/* match field */
+	field = IOBC_NR_FIELDS;
+	for (i = 0; i < ARRAY_SIZE(iobc_field_strs); i++)
+		if (!strcmp(field_str, iobc_field_strs[i]))
+			field = i;
+	if (field == IOBC_NR_FIELDS)
+		goto out;
+
+	/*
+	 * Make sure boundary values don't decrease, the last entry can be
+	 * zero meaning no limit.
+	 */
+	for (i = 0; i < IOBC_NR_SLOTS - 1; i++)
+		if (b[i] > b[i + 1])
+			goto out;
+
+	if (b[IOBC_NR_SLOTS] &&
+	    b[IOBC_NR_SLOTS - 1] > b[IOBC_NR_SLOTS])
+		goto out;
+
+	/* alright, commit - if iob is enabled, just let the users race */
+	type->dir = dir;
+	type->field = field;
+	for (i = 0; i < ARRAY_SIZE(type->bounds); i++)
+		type->bounds[i] = b[i];
+	ret = 0;
+out:
+	mutex_unlock(&iob_mutex);
+	return ret ?: cnt;
+}
+
+static const struct file_operations iobc_type_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nonseekable_open,
+	.read		= iobc_type_read,
+	.write		= iobc_type_write,
+};
+
+/*
+ * ioblame/counters/NR_filter - read and set counter filters.  Filters are
+ * the same as trace event filters and all counter fields can be used.  If
+ * no filter is set, the counter is always enabled.
+ *
+ * Note that counter filter can be updated while iob is enabled.
+ */
+static ssize_t iobc_filter_read(struct file *file, char __user *ubuf,
+				size_t count, loff_t *ppos)
+{
+	struct iobc_type *type;
+	char *buf;
+	ssize_t ret = 0;
+
+	type = iobc_lock_and_get_type(file);
+	if (IS_ERR(type))
+		return PTR_ERR(type);
+
+	buf = iob_page_buf;
+
+	if (type->filter) {
+		const char *s = trace_event_filter_string(type->filter);
+
+		if (s) {
+			snprintf(buf, PAGE_SIZE, "%s\n", s);
+			ret = simple_read_from_buffer(ubuf, count, ppos,
+						      buf, strlen(buf));
+		}
+	}
+
+	mutex_unlock(&iob_mutex);
+
+	return ret;
+}
+
+static void iobc_filter_free_rcu(struct rcu_head *head)
+{
+	struct iobc_filter_rcu *rcu = container_of(head, struct iobc_filter_rcu,
+						   rcu_head);
+	trace_event_filter_destroy(rcu->filter);
+	kfree(rcu);
+}
+
+static void iobc_free_filter(struct iobc_type *type)
+{
+	if (!type->filter)
+		return;
+
+	type->filter_rcu->filter = type->filter;
+	call_rcu_sched(&type->filter_rcu->rcu_head, iobc_filter_free_rcu);
+	rcu_assign_pointer(type->filter, NULL);
+	type->filter_rcu = NULL;
+}
+
+static ssize_t iobc_filter_write(struct file *file, const char __user *ubuf,
+				 size_t cnt, loff_t *ppos)
+{
+	struct iobc_filter_rcu *filter_rcu = NULL;
+	struct event_filter *filter = NULL;
+	struct iobc_type *type;
+	char *buf;
+	int ret;
+
+	if (cnt >= PAGE_SIZE)
+		return -EOVERFLOW;
+
+	type = iobc_lock_and_get_type(file);
+	if (IS_ERR(type))
+		return PTR_ERR(type);
+
+	buf = iob_page_buf;
+
+	ret = -EFAULT;
+	if (copy_from_user(buf, ubuf, cnt))
+		goto out;
+	buf[cnt] = '\0';
+
+	ret = 0;
+	buf = strim(buf);
+	if (buf[0] == '0' && buf[1] == '\0') {
+		iobc_free_filter(type);
+		goto out;
+	}
+
+	ret = -ENOMEM;
+	filter_rcu = kzalloc(sizeof(*filter_rcu), GFP_KERNEL);
+	if (!filter_rcu)
+		goto out;
+
+	ret = trace_event_filter_create(&iobc_event_field_list, buf, &filter);
+	/*
+	 * If we have a filter, whether error one or not, install it.
+	 * type->filter is RCU managed so that it can be modified while iob
+	 * is enabled.
+	 */
+	if (filter) {
+		iobc_free_filter(type);
+		type->filter_rcu = filter_rcu;
+		rcu_assign_pointer(type->filter, filter);
+	} else {
+		kfree(filter_rcu);
+		kfree(filter);
+	}
+out:
+	mutex_unlock(&iob_mutex);
+	return ret ?: cnt;
+}
+
+static const struct file_operations iobc_filter_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nonseekable_open,
+	.read		= iobc_filter_read,
+	.write		= iobc_filter_write,
+};
+
+/*
+ * ioblame/nr_counters - the number of counter types.  Can be set only
+ * while iob is disabled.  Write resets all counter types and filters.
+ */
+static int iobc_nr_types_get(void *data, u64 *val)
+{
+	*val = iobc_nr_types;
+	return 0;
+}
+
+static int iobc_nr_types_set(void *data, u64 val)
+{
+	struct iobc_type *tmp_types = NULL;
+	int i, ret;
+
+	if (val > INT_MAX)
+		return -EINVAL;
+
+	mutex_lock(&iob_mutex);
+
+	ret = -EBUSY;
+	if (iob_enabled)
+		goto out_unlock;
+
+	/* destroy old counters/ dir */
+	if (iobc_dir) {
+		debugfs_remove_recursive(iobc_dir);
+		iobc_dir = NULL;
+	}
+
+	if (!val)
+		goto done;
+
+	/* create new ones */
+	ret = -ENOMEM;
+	tmp_types = kzalloc(sizeof(tmp_types[0]) * val, GFP_KERNEL);
+	if (!tmp_types)
+		goto out_unlock;
+
+	iobc_dir = debugfs_create_dir("counters", iob_dir);
+	if (!iobc_dir)
+		goto out_unlock;
+
+	for (i = 0; i < val; i++) {
+		char cnt_name[16], filter_name[32];
+
+		snprintf(cnt_name, sizeof(cnt_name), "%d", i);
+		snprintf(filter_name, sizeof(filter_name), "%d_filter", i);
+
+		if (!debugfs_create_file(cnt_name, 0600, iobc_dir,
+					 (void *)(long)i, &iobc_type_fops) ||
+		    !debugfs_create_file(filter_name, 0600, iobc_dir,
+					 (void *)(long)i, &iobc_filter_fops)) {
+			debugfs_remove_recursive(iobc_dir);
+			iobc_dir = NULL;
+			goto out_unlock;
+		}
+	}
+done:
+	/* destroy old type and commit new one */
+	for (i = 0; i < iobc_nr_types; i++)
+		iobc_free_filter(&iobc_types[i]);
+	swap(iobc_types, tmp_types);
+	iobc_nr_types = val;
+	ret = 0;
+out_unlock:
+	mutex_unlock(&iob_mutex);
+	kfree(tmp_types);
+	return ret;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(iobc_nr_types_fops,
+			iobc_nr_types_get, iobc_nr_types_set, "%llu\n");
+
+/*
+ * Actual counting.  iob_count() is called on each io completion and
+ * responsible for updating the corresponding act->cnts[].
+ *
+ * Note that act->cnts is indexed by slot first and then type.  This is to
+ * increase chance of updates to different counters falling in the same
+ * cache line.  We update one out of eight histogram counters belonging to
+ * each type.  If the counters are organized by type and then slot, it'll
+ * always touch every cacheline the counters occupy.
+ */
+static int iobc_idx(int type, int slot)
+{
+	return slot * iobc_nr_types + type;
+}
+
+/*
+ * Scale @sect / @capa to [0,65536).  IOW, @sect * 65536 / @capa.  Shift
+ * bits around so that we don't lose precision unnecessarily while still
+ * doing single 64bit/32bit division.
+ */
+static u16 iobc_scale_sect(u64 sect, u64 capa)
+{
+	int shift = fls64(capa) - 32;
+
+	if (shift <= 0) {
+		sect <<= 16;
+	} else {
+		if (shift <= 16)
+			sect <<= 16 - shift;
+		else
+			sect >>= shift - 16;
+		capa >>= shift;
+	}
+
+	do_div(sect, capa);
+
+	return sect;
+}
+
+static void iob_count(struct iob_io_info *io, struct gendisk *disk)
+{
+	struct iob_act *act = iob_act_by_id(io->act);
+	unsigned long now = iob_now_usecs();
+	u16 sect = iobc_scale_sect(io->sector, get_capacity(disk));
+	u32 fields[IOBC_NR_FIELDS];
+	u32 *cnts;
+	int i, dir;
+
+	iob_act_mark_used(act);
+
+	/* timestamps may jump backwards, fix up */
+	if (time_before(now, io->issued_at))
+		io->issued_at = now;
+	if (time_before(io->issued_at, io->queued_at))
+		io->queued_at = io->issued_at;
+
+	if (!(io->rw & REQ_WRITE)) {
+		if (io->rw & REQ_RAHEAD)
+			dir = IOBC_RAHEAD;
+		else
+			dir = IOBC_READ;
+	} else
+		dir = IOBC_WRITE;
+
+	fields[IOBC_OFFSET] = sect;
+	fields[IOBC_SIZE] = io->size;
+	fields[IOBC_WAIT_TIME] = io->issued_at - io->queued_at;
+	fields[IOBC_IO_TIME] = now - io->issued_at;
+	fields[IOBC_SEEK_DIST] = abs(sect - disk->iob_scaled_last_sect);
+
+	disk->iob_scaled_last_sect = sect;
+
+	/* all fields ready, find or allocate cnts to update */
+	cnts = act->cnts;
+	if (!cnts) {
+		unsigned long flags;
+
+		cnts = kmem_cache_zalloc(iobc_cnts_cache, GFP_NOIO);
+		if (!cnts) {
+			iob_stats.cnts_nomem++;
+			return;
+		}
+
+		spin_lock_irqsave(&iob_lock, flags);
+		if (!act->cnts) {
+			act->cnts = cnts;
+		} else {
+			kmem_cache_free(iobc_cnts_cache, cnts);
+			cnts = act->cnts;
+		}
+		spin_unlock_irqrestore(&iob_lock, flags);
+	}
+
+	/* let's count */
+	for (i = 0; i < iobc_nr_types; i++) {
+		struct iobc_type *type = &iobc_types[i];
+		struct event_filter *filter = rcu_dereference(type->filter);
+		u32 *b = type->bounds;
+		u32 v = fields[type->field];
+		int slot = -1;
+
+		if (!(type->dir & dir))
+			continue;
+
+		/* if there's filter, run it */
+		if (filter && !filter_match_preds(filter, fields))
+			continue;
+
+		/* open coded binary search to determine histogram slot */
+		if (v < b[4]) {
+			if (v < b[2]) {
+				if (v < b[1]) {
+					if (v >= b[0])
+						slot = 0;
+				} else {
+					slot = 1;
+				}
+			} else {
+				if (v < b[3])
+					slot = 2;
+				else
+					slot = 3;
+			}
+		} else {
+			if (v < b[6]) {
+				if (v < b[5])
+					slot = 4;
+				else
+					slot = 5;
+			} else {
+				if (v < b[7]) {
+					slot = 6;
+				} else {
+					if (!b[8] || v < b[8])
+						slot = 7;
+				}
+			}
+		}
+
+		/*
+		 * Yeah, finally.  Histogram increment is opportunistic and
+		 * racing updates may clobber each other.  Given how act is
+		 * determined, this isn't too likely to happen.  Even when
+		 * it does, as only updates on histogram are increments,
+		 * the deviation should be small.
+		 */
+		if (slot >= 0)
+			cnts[iobc_idx(i, slot)]++;
+	}
+}
+
+/**
+ * iob_disable - disable ioblame
+ *
+ * Master disble.  Stop ioblame, unregister all hooks and free all
+ * resources.
+ */
+static void iob_disable(void)
+{
+	const int gang_nr = 16;
+	unsigned long indices[gang_nr];
+	void **slots[gang_nr];
+	unsigned long base_idx = 0;
+	int i, nr;
+
+	mutex_lock(&iob_mutex);
+
+	if (iob_enabled) {
+		/* if enabled, unregister all hooks */
+		iob_enabled = false;
+		iobc_pipe_opened = false;
+		unregister_trace_vfs_fcheck(iob_probe_vfs_fcheck, NULL);
+		unregister_trace_writeback_dirty_page(iob_probe_wb_dirty_page, NULL);
+		unregister_trace_writeback_start(iob_probe_wb_start, NULL);
+		unregister_trace_writeback_written(iob_probe_wb_written, NULL);
+		unregister_trace_writeback_single_inode_start(iob_probe_wb_single_inode_start, NULL);
+		unregister_trace_writeback_dirty_inode_start(iob_probe_wb_dirty_inode_start, NULL);
+		unregister_trace_writeback_dirty_inode(iob_probe_wb_dirty_inode, NULL);
+		unregister_trace_writeback_write_inode_start(iob_probe_wb_write_inode_start, NULL);
+		unregister_trace_writeback_write_inode(iob_probe_wb_write_inode, NULL);
+		unregister_trace_block_touch_buffer(iob_probe_block_touch_buffer, NULL);
+		unregister_trace_block_bio_queue(iob_probe_block_bio_queue, NULL);
+		unregister_trace_block_bio_backmerge(iob_probe_block_bio_backmerge, NULL);
+		unregister_trace_block_bio_frontmerge(iob_probe_block_bio_frontmerge, NULL);
+		unregister_trace_block_rq_issue(iob_probe_block_rq_issue, NULL);
+		unregister_trace_block_bio_complete(iob_probe_block_bio_complete, NULL);
+		unregister_trace_sched_process_exit(iob_probe_block_sched_process_exit, NULL);
+		/* and drain all in-flight users */
+		tracepoint_synchronize_unregister();
+	}
+
+	/*
+	 * At this point, we're sure that nobody is executing iob hooks.
+	 * Free all resources.
+	 */
+	for (i = 0; i < ARRAY_SIZE(iob_act_used_bitmaps); i++) {
+		vfree(iob_act_used_bitmaps[i]);
+		iob_act_used_bitmaps[i] = NULL;
+	}
+
+	if (iob_role_idx)
+		iob_idx_destroy(iob_role_idx);
+	if (iob_intent_idx)
+		iob_idx_destroy(iob_intent_idx);
+	if (iob_act_idx)
+		iob_idx_destroy(iob_act_idx);
+	iob_role_idx = iob_intent_idx = iob_act_idx = NULL;
+
+	while ((nr = radix_tree_gang_lookup_slot(&iob_pgtree, slots, indices,
+						 base_idx, gang_nr))) {
+		for (i = 0; i < nr; i++) {
+			free_page((unsigned long)*slots[i]);
+			radix_tree_delete(&iob_pgtree, indices[i]);
+		}
+		base_idx = indices[nr - 1] + 1;
+	}
+
+	if (iobc_cnts_cache) {
+		kmem_cache_destroy(iobc_cnts_cache);
+		iobc_cnts_cache = NULL;
+	}
+
+	mutex_unlock(&iob_mutex);
+}
+
+/**
+ * iob_enable - enable ioblame
+ *
+ * Master enable.  Set up all resources and enable ioblame.  Returns 0 on
+ * success, -errno on failure.
+ */
+static int iob_enable(void)
+{
+	int i, err;
+
+	mutex_lock(&iob_mutex);
+
+	if (iob_enabled)
+		goto out;
+
+	/* allocate iobc_cnts cache */
+	err = -ENOMEM;
+	iobc_cnts_cache = kmem_cache_create("iob_counters",
+				iobc_nr_types * IOBC_NR_SLOTS * sizeof(u32),
+				__alignof__(u32), SLAB_HWCACHE_ALIGN, NULL);
+	if (!iobc_cnts_cache)
+		goto out;
+
+	/* determine pgtree params from iob_max_acts */
+	iob_pgtree_shift = iob_max_acts <= USHRT_MAX ? 1 : 2;
+	iob_pgtree_pfn_shift = PAGE_SHIFT - iob_pgtree_shift;
+	iob_pgtree_pfn_mask = (1 << iob_pgtree_pfn_shift) - 1;
+
+	/* create iob_idx'es and allocate act used bitmaps */
+	err = -ENOMEM;
+	iob_role_idx = iob_idx_create(&iob_role_idx_type, iob_max_roles);
+	iob_intent_idx = iob_idx_create(&iob_intent_idx_type, iob_max_intents);
+	iob_act_idx = iob_idx_create(&iob_act_idx_type, iob_max_acts);
+
+	if (!iob_role_idx || !iob_intent_idx || !iob_act_idx)
+		goto out;
+
+	for (i = 0; i < ARRAY_SIZE(iob_act_used_bitmaps); i++) {
+		iob_act_used_bitmaps[i] = vzalloc(sizeof(unsigned long) *
+						  BITS_TO_LONGS(iob_max_acts));
+		if (!iob_act_used_bitmaps[i])
+			goto out;
+	}
+
+	iob_role_reclaim_tstmp = jiffies;
+	iob_act_reclaim_tstmp = jiffies;
+	iob_act_used.cur = iob_act_used_bitmaps[0];
+	iob_act_used.staging = iob_act_used_bitmaps[1];;
+	iob_act_used.front = iob_act_used_bitmaps[2];;
+	iob_act_used.back = iob_act_used_bitmaps[3];;
+
+	/* register hooks */
+	err = register_trace_vfs_fcheck(iob_probe_vfs_fcheck, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_dirty_page(iob_probe_wb_dirty_page, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_start(iob_probe_wb_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_written(iob_probe_wb_written, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_single_inode_start(iob_probe_wb_single_inode_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_dirty_inode_start(iob_probe_wb_dirty_inode_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_dirty_inode(iob_probe_wb_dirty_inode, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_write_inode_start(iob_probe_wb_write_inode_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_write_inode(iob_probe_wb_write_inode, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_touch_buffer(iob_probe_block_touch_buffer, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_queue(iob_probe_block_bio_queue, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_backmerge(iob_probe_block_bio_backmerge, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_frontmerge(iob_probe_block_bio_frontmerge, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_rq_issue(iob_probe_block_rq_issue, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_complete(iob_probe_block_bio_complete, NULL);
+	if (err)
+		goto out;
+	err = register_trace_sched_process_exit(iob_probe_block_sched_process_exit, NULL);
+	if (err)
+		goto out;
+
+	/* wait until everything becomes visible */
+	synchronize_sched();
+	/* and go... */
+	iob_enabled = true;
+out:
+	mutex_unlock(&iob_mutex);
+
+	if (iob_enabled)
+		return 0;
+	iob_disable();
+	return err;
+}
+
+/* ioblame/{*_max|ttl_secs} - uint tunables */
+static int iob_uint_get(void *data, u64 *val)
+{
+	*val = *(unsigned int *)data;
+	return 0;
+}
+
+static int __iob_uint_set(void *data, u64 val, bool must_be_disabled)
+{
+	if (val > INT_MAX)
+		return -EINVAL;
+
+	mutex_lock(&iob_mutex);
+	if (must_be_disabled && iob_enabled) {
+		mutex_unlock(&iob_mutex);
+		return -EBUSY;
+	}
+
+	*(unsigned int *)data = val;
+
+	mutex_unlock(&iob_mutex);
+
+	return 0;
+}
+
+/* max params must not be manipulated while enabled */
+static int iob_uint_set_disabled(void *data, u64 val)
+{
+	return __iob_uint_set(data, val, true);
+}
+
+/* ttl can be changed anytime */
+static int iob_uint_set(void *data, u64 val)
+{
+	return __iob_uint_set(data, val, false);
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(iob_uint_fops_disabled, iob_uint_get,
+			iob_uint_set_disabled, "%llu\n");
+DEFINE_SIMPLE_ATTRIBUTE(iob_uint_fops, iob_uint_get, iob_uint_set, "%llu\n");
+
+/* bool - ioblame/ignore_ino, also used for ioblame/enable */
+static ssize_t iob_bool_read(struct file *file, char __user *ubuf,
+			     size_t count, loff_t *ppos)
+{
+	bool *boolp = file->f_dentry->d_inode->i_private;
+	const char *str = *boolp ? "Y\n" : "N\n";
+
+	return simple_read_from_buffer(ubuf, count, ppos, str, strlen(str));
+}
+
+static ssize_t __iob_bool_write(struct file *file, const char __user *ubuf,
+				size_t count, loff_t *ppos, bool *boolp)
+{
+	char buf[32] = { };
+	int err;
+
+	if (copy_from_user(buf, ubuf, min(count, sizeof(buf) - 1)))
+		return -EFAULT;
+
+	err = strtobool(buf, boolp);
+	if (err)
+		return err;
+
+	return err ?: count;
+}
+
+static ssize_t iob_bool_write(struct file *file, const char __user *ubuf,
+			      size_t count, loff_t *ppos)
+{
+	return __iob_bool_write(file, ubuf, count, ppos,
+				file->f_dentry->d_inode->i_private);
+}
+
+static const struct file_operations iob_bool_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nonseekable_open,
+	.read		= iob_bool_read,
+	.write		= iob_bool_write,
+};
+
+/* u64 fops, used for stats */
+static int iob_u64_get(void *data, u64 *val)
+{
+	*val = *(u64 *)data;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(iob_stats_fops, iob_u64_get, NULL, "%llu\n");
+
+/* used to export nr_nodes of each iob_idx */
+static int iob_nr_nodes_get(void *data, u64 *val)
+{
+	struct iob_idx **idxp = data;
+
+	*val = 0;
+	mutex_lock(&iob_mutex);
+	if (*idxp)
+		*val = (*idxp)->nr_nodes;
+	mutex_unlock(&iob_mutex);
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(iob_nr_nodes_fops, iob_nr_nodes_get, NULL, "%llu\n");
+
+/*
+ * ioblame/devs - per device enable switch, accepts block device kernel
+ * name, "maj:min" or "*" for all devices.  Prefix '!' to disable.  Opening
+ * w/ O_TRUNC also disables ioblame for all devices.
+ */
+static void iob_enable_all_devs(bool enable)
+{
+	struct disk_iter diter;
+	struct gendisk *disk;
+
+	disk_iter_init(&diter);
+	while ((disk = disk_iter_next(&diter)))
+		disk->iob_enabled = enable;
+	disk_iter_exit(&diter);
+}
+
+static void *iob_devs_seq_start(struct seq_file *seqf, loff_t *pos)
+{
+	loff_t skip = *pos;
+	struct disk_iter *diter;
+	struct gendisk *disk;
+
+	diter = kmalloc(sizeof(*diter), GFP_KERNEL);
+	if (!diter)
+		return ERR_PTR(-ENOMEM);
+
+	seqf->private = diter;
+	disk_iter_init(diter);
+
+	/* skip to the current *pos */
+	do {
+		disk = disk_iter_next(diter);
+		if (!disk)
+			return NULL;
+	} while (skip--);
+
+	/* skip to the first iob_enabled disk */
+	while (disk && !disk->iob_enabled) {
+		(*pos)++;
+		disk = disk_iter_next(diter);
+	}
+
+	return disk;
+}
+
+static void *iob_devs_seq_next(struct seq_file *seqf, void *v, loff_t *pos)
+{
+	/* skip to the next iob_enabled disk */
+	while (true) {
+		struct gendisk *disk;
+
+		(*pos)++;
+		disk = disk_iter_next(seqf->private);
+		if (!disk)
+			return NULL;
+
+		if (disk->iob_enabled)
+			return disk;
+	}
+}
+
+static int iob_devs_seq_show(struct seq_file *seqf, void *v)
+{
+	struct gendisk *disk = v;
+	dev_t dev = disk_devt(disk);
+
+	seq_printf(seqf, "%u:%u %s\n", MAJOR(dev), MINOR(dev),
+		   disk->disk_name);
+	return 0;
+}
+
+static void iob_devs_seq_stop(struct seq_file *seqf, void *v)
+{
+	struct disk_iter *diter = seqf->private;
+
+	/* stop is called even after start failed :-( */
+	if (diter) {
+		disk_iter_exit(diter);
+		kfree(diter);
+	}
+}
+
+static ssize_t iob_devs_write(struct file *file, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
+{
+	char *buf = NULL, *p = NULL, *last_tok = NULL, *tok;
+	int err;
+
+	if (!cnt)
+		return 0;
+
+	err = -ENOMEM;
+	buf = vmalloc(cnt + 1);
+	if (!buf)
+		goto out;
+
+	err = -EFAULT;
+	if (copy_from_user(buf, ubuf, cnt))
+		goto out;
+	buf[cnt] = '\0';
+
+	err = 0;
+	p = buf;
+	while ((tok = strsep(&p, " \t\r\n"))) {
+		bool enable = true;
+		int partno = 0;
+		struct gendisk *disk;
+		unsigned maj, min;
+		dev_t devt;
+
+		tok = strim(tok);
+		if (!strlen(tok))
+			continue;
+
+		if (tok[0] == '!') {
+			enable = false;
+			tok++;
+		}
+
+		if (!strcmp(tok, "*")) {
+			iob_enable_all_devs(enable);
+			last_tok = tok;
+			continue;
+		}
+
+		if (sscanf(tok, "%u:%u", &maj, &min) == 2)
+			devt = MKDEV(maj, min);
+		else
+			devt = blk_lookup_devt(tok, 0);
+
+		disk = get_gendisk(devt, &partno);
+		if (!disk || partno) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		disk->iob_enabled = enable;
+		put_disk(disk);
+		last_tok = tok;
+	}
+out:
+	vfree(buf);
+	if (!err)
+		return cnt;
+	if (last_tok)
+		return last_tok + strlen(last_tok) - buf;
+	return err;
+}
+
+static const struct seq_operations iob_devs_sops = {
+	.start		= iob_devs_seq_start,
+	.next		= iob_devs_seq_next,
+	.show		= iob_devs_seq_show,
+	.stop		= iob_devs_seq_stop,
+};
+
+static int iob_devs_seq_open(struct inode *inode, struct file *file)
+{
+	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC))
+		iob_enable_all_devs(false);
+
+	return seq_open(file, &iob_devs_sops);
+}
+
+static const struct file_operations iob_devs_fops = {
+	.owner		= THIS_MODULE,
+	.open		= iob_devs_seq_open,
+	.read		= seq_read,
+	.write		= iob_devs_write,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+/*
+ * ioblame/enable - master enable switch
+ */
+static ssize_t iob_enable_write(struct file *file, const char __user *ubuf,
+				size_t count, loff_t *ppos)
+{
+	bool enable;
+	ssize_t ret;
+	int err = 0;
+
+	ret = __iob_bool_write(file, ubuf, count, ppos, &enable);
+	if (ret < 0)
+		return ret;
+
+	if (enable)
+		err = iob_enable();
+	else
+		iob_disable();
+
+	return err ?: ret;
+}
+
+static const struct file_operations iob_enable_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nonseekable_open,
+	.read		= iob_bool_read,
+	.write		= iob_enable_write,
+};
+
+/*
+ * Print helpers.
+ */
+#define iob_print(p, e, fmt, args...)	(p + scnprintf(p, e - p, fmt , ##args))
+
+static char *iob_print_role(char *p, char *e, union iob_id role_id)
+{
+	struct iob_role *role = iob_role_by_id(role_id);
+
+	if (role->pid < 0) {
+		p = iob_print(p, e, "user-%d", -role->pid);
+	} else {
+		struct task_struct *task;
+		int no = role->node.id.f.nr;
+
+		rcu_read_lock_sched();
+		task = pid_task(find_pid_ns(role->pid, &init_pid_ns),
+				PIDTYPE_PID);
+		if (task)
+			p = iob_print(p, e, "pid-%d (%s)",
+				      role->pid, task->comm);
+		else if (no >= 2)
+			p = iob_print(p, e, "pid-%d", role->pid);
+		else
+			p = iob_print(p, e, "%s", no ? "lost" : "nomem");
+		rcu_read_unlock_sched();
+	}
+
+	return p;
+}
+
+static char *iob_print_intent(char *p, char *e, struct iob_intent *intent,
+			      const char *header)
+{
+	int i;
+
+	p = iob_print(p, e, "%s#%d modifier=0x%x\n", header,
+		      intent->node.id.f.nr, intent->modifier);
+	for (i = 0; i < intent->depth; i++)
+		p = iob_print(p, e, "%s[%p] %pF\n", header,
+			      (void *)intent->trace[i],
+			      (void *)intent->trace[i]);
+	return p;
+}
+
+
+/*
+ * ioblame/intents[_bin] - export intents to userland.
+ *
+ * Userland can acquire intents by reading either ioblame/intents or
+ * intents_bin, where the former is human readable text and the latter in
+ * binary format.
+ *
+ * While iob is enabled, intents are never reclaimed, intent nr is
+ * guaranteed to be allocated consecutively in ascending order and both
+ * intents files are lseekable by intent nr, so userland tools which want
+ * to learn about new intents since last reading can simply seek to the
+ * number of currently known intents and start reading from there.
+ */
+static loff_t iob_intents_llseek(struct file *file, loff_t offset, int origin)
+{
+	loff_t ret = -EIO;
+
+	mutex_lock(&iob_mutex);
+
+	if (iob_enabled) {
+		/*
+		 * We seek by intent nr and don't care about i_size.
+		 * Temporarily set i_size to nr_nodes and hitch on generic
+		 * llseek.
+		 */
+		i_size_write(file->f_dentry->d_inode, iob_intent_idx->nr_nodes);
+		ret = generic_file_llseek(file, offset, origin);
+		i_size_write(file->f_dentry->d_inode, 0);
+	}
+
+	mutex_unlock(&iob_mutex);
+	return ret;
+}
+
+static ssize_t iob_intents_read(struct file *file, char __user *ubuf,
+				size_t count, loff_t *ppos)
+{
+	char *buf, *p, *e;
+	int err;
+
+	if (count < PAGE_SIZE)
+		return -EINVAL;
+
+	err = -EIO;
+	mutex_lock(&iob_mutex);
+	if (!iob_enabled)
+		goto out;
+
+	p = buf = iob_page_buf;
+	e = p + PAGE_SIZE;
+
+	err = 0;
+	if (*ppos >= iob_intent_idx->nr_nodes)
+		goto out;
+
+	/* print to buf */
+	rcu_read_lock_sched();
+	p = iob_print_intent(p, e, iob_intent_by_nr(*ppos), "");
+	rcu_read_unlock_sched();
+	WARN_ON_ONCE(p == e);
+
+	/* copy out */
+	err = -EFAULT;
+	if (copy_to_user(ubuf, buf, p - buf))
+		goto out;
+
+	(*ppos)++;
+	err = 0;
+out:
+	mutex_unlock(&iob_mutex);
+	return err ?: p - buf;
+}
+
+static ssize_t iob_intents_read_bin(struct file *file, char __user *ubuf,
+				    size_t count, loff_t *ppos)
+{
+	static struct {
+		struct iob_intent_bin_record r;
+		uint64_t s[IOB_STACK_MAX_DEPTH];
+	} rec_buf = { .r.ver = IOB_INTENTS_BIN_VER };
+	struct iob_intent_bin_record *rec = &rec_buf.r;
+	char __user *up = ubuf, __user *ue = ubuf + count;
+	int nr, err = 0;
+
+	mutex_lock(&iob_mutex);
+	if (!iob_enabled) {
+		err = -EIO;
+		goto out;
+	}
+
+	/* for each intent */
+	for (nr = *ppos; nr < iob_intent_idx->nr_nodes; nr++) {
+		struct iob_intent *intent;
+		size_t tlen;
+
+		/* print to buf */
+		rcu_read_lock_sched();
+
+		intent = iob_intent_by_nr(nr);
+		tlen = sizeof(rec->trace[0]) * intent->depth;
+
+		rec->len = offsetof(struct iob_intent_bin_record, trace) + tlen;
+		rec->nr = intent->node.id.f.nr;
+		rec->modifier = intent->modifier;
+		memcpy(rec->trace, intent->trace, tlen);
+
+		rcu_read_unlock_sched();
+
+		/* copy out */
+		if (ue - up < rec->len)
+			break;
+
+		if (copy_to_user(up, &rec, rec->len)) {
+			err = -EFAULT;
+			break;
+		}
+		up += rec->len;
+		*ppos = nr + 1;
+	}
+out:
+	mutex_unlock(&iob_mutex);
+
+	if (err && up == ubuf)
+		return err;
+	return up - ubuf;
+}
+
+static const struct file_operations iob_intents_fops = {
+	.owner		= THIS_MODULE,
+	.open		= generic_file_open,
+	.llseek		= iob_intents_llseek,
+	.read		= iob_intents_read,
+};
+
+static const struct file_operations iob_intents_bin_fops = {
+	.owner		= THIS_MODULE,
+	.open		= generic_file_open,
+	.llseek		= iob_intents_llseek,
+	.read		= iob_intents_read_bin,
+};
+
+/*
+ * ioblame/counters_pipe[_bin] - export counters to userland and reclaim acts.
+ *
+ * Userland can acquire dirty counters by reading either
+ * ioblame/counters_pipe or counters_pipe_bin, where the former is human
+ * readable text and the latter in binary format.
+ *
+ * As acts can't be reclaimed with dirty counters, accessing counters also
+ * triggers reclaim.  Opening any of the two couters_pipe files switches
+ * the current used bitmap with staging and closing folds staging into
+ * front bitmap and the rest of reclaim starts.
+ *
+ * Each open-(N*read)-close cycle clears dirtiness on all counters whether
+ * all the counters were read or not and concurrent accesses to
+ * counters_pipe files aren't allowed.
+ *
+ * Note that cnts of all acts which have been used are reported whether
+ * cnts themselves have been updated or not.  ie. Counters which haven't
+ * changed since last read might be reported again.
+ */
+
+static int iobc_pipe_open(struct inode *inode, struct file *filp)
+{
+	int ret = -EIO;
+
+	mutex_lock(&iob_mutex);
+
+	/* only one opener, opened is cleared on release or iob_disable() */
+	if (iob_enabled && !iobc_pipe_opened) {
+		/* switch used and staging */
+		iob_switch_act_used_cur();
+		iobc_pipe_opened = true;
+		ret = 0;
+	}
+
+	mutex_unlock(&iob_mutex);
+	return ret;
+}
+
+static loff_t iobc_pipe_llseek(struct file *file, loff_t offset, int origin)
+{
+	loff_t ret;
+
+	/*
+	 * We seek by act nr and don't care about i_size.  Temporarily set
+	 * i_size to iob_max_acts and hitch on generic llseek.
+	 */
+	i_size_write(file->f_dentry->d_inode, iob_max_acts);
+	ret = generic_file_llseek(file, offset, origin);
+	i_size_write(file->f_dentry->d_inode, 0);
+
+	return ret;
+}
+
+static ssize_t iobc_pipe_read(struct file *file, char __user *ubuf,
+			      size_t count, loff_t *ppos)
+{
+	unsigned long *bitmap = iob_act_used.staging;
+	unsigned long bit = *ppos;
+	struct iob_act *act;
+	struct iob_intent *intent;
+	char *buf, *p, *e;
+	int i, j, err;
+
+	if (count < PAGE_SIZE)
+		return -EINVAL;
+
+	err = -EIO;
+	mutex_lock(&iob_mutex);
+	if (!iobc_pipe_opened)
+		goto out;
+
+	p = buf = iob_page_buf;
+	e = p + PAGE_SIZE;
+
+	rcu_read_lock_sched();
+
+	/* find the next used act w/ cnts */
+	while (true) {
+		err = 0;
+		bit = find_next_bit(bitmap, iob_max_acts, bit);
+		if (bit >= iob_max_acts) {
+			rcu_read_unlock_sched();
+			goto out;
+		}
+		act = iob_act_by_nr(bit);
+		if (act->cnts)
+			break;
+		bit++;
+	}
+
+	/* print to buf */
+	intent = iob_intent_by_id(act->intent);
+
+	p = iob_print_role(p, e, act->role);
+	p = iob_print(p, e, " int=%u dev=0x%x ino=0x%lx gen=0x%x\n",
+		      intent->node.id.f.nr, act->dev, act->ino, act->gen);
+
+	for (i = 0; i < iobc_nr_types; i++) {
+		p = iob_print(p, e, " ");
+		for (j = 0; j < IOBC_NR_SLOTS; j++)
+			p = iob_print(p, e, " %7d", act->cnts[iobc_idx(i, j)]);
+		p = iob_print(p, e, "\n");
+	}
+
+	rcu_read_unlock_sched();
+
+	/* copy out */
+	err = -EFAULT;
+	if (copy_to_user(ubuf, buf, p - buf))
+		goto out;
+
+	*ppos = bit + 1;
+	err = 0;
+out:
+	mutex_unlock(&iob_mutex);
+	return err ?: p - buf;
+}
+
+static ssize_t iobc_pipe_read_bin(struct file *file, char __user *ubuf,
+				  size_t count, loff_t *ppos)
+{
+	unsigned long *bitmap = iob_act_used.staging;
+	char __user *up = ubuf, __user *ue = ubuf + count;
+	struct iobc_pipe_bin_record *rec;
+	size_t reclen;
+	int i, j, bit, err;
+
+	/* sanity checks */
+	reclen = sizeof(struct iobc_pipe_bin_record) +
+		iobc_nr_types * IOBC_NR_SLOTS * sizeof(u32);
+	if (reclen > PAGE_SIZE) {
+		pr_err_once("ioblame: doesn't support bin counter reads larger than PAGE_SIZE");
+		return -EINVAL;
+	}
+	if (reclen > count)
+		return -EOVERFLOW;
+
+	err = -EIO;
+	mutex_lock(&iob_mutex);
+	if (!iobc_pipe_opened)
+		goto out;
+
+	rec = (void *)iob_page_buf;
+	memset(rec, 0, sizeof(*rec));
+	rec->ver = IOBC_PIPE_BIN_VER;
+	rec->len = reclen;
+
+	bit = *ppos;
+	do {
+		struct iob_act *act;
+		struct iob_role *role;
+
+		/* for each used act w/ cnts */
+		bit = find_next_bit(bitmap, iob_max_acts, bit);
+		if (bit >= iob_max_acts)
+			break;
+
+		rcu_read_lock_sched();
+
+		act = iob_act_by_nr(bit);
+		if (!act->cnts) {
+			rcu_read_unlock_sched();
+			goto next;
+		}
+
+		role = iob_role_by_id(act->role);
+
+		/* fill in @rec */
+		rec->id = role->pid;
+		rec->intent_nr = act->intent.f.nr;
+		rec->dev = act->dev;
+		rec->ino = act->ino;
+		rec->gen = act->gen;
+
+		/* @act->cnts is transposed, transpose it back for userland */
+		for (i = 0; i < iobc_nr_types; i++)
+			for (j = 0; j < IOBC_NR_SLOTS; j++)
+				rec->cnts[i * IOBC_NR_SLOTS + j] =
+					act->cnts[iobc_idx(i, j)];
+		rcu_read_unlock_sched();
+
+		/* copy out */
+		err = -EFAULT;
+		if (copy_to_user(up, rec, rec->len))
+			goto out;
+		up += reclen;
+	next:
+		*ppos = ++bit;
+	} while (up + reclen <= ue);
+
+	err = 0;
+out:
+	mutex_unlock(&iob_mutex);
+
+	if (err && up == ubuf)
+		return err;
+	return up - ubuf;
+}
+
+static int iobc_pipe_release(struct inode *inode, struct file *file)
+{
+	mutex_lock(&iob_mutex);
+	if (iobc_pipe_opened) {
+		/* all used acts are reported, trigger reclaim */
+		iob_reclaim();
+		iobc_pipe_opened = false;
+	}
+	mutex_unlock(&iob_mutex);
+	return 0;
+}
+
+static const struct file_operations iobc_pipe_fops = {
+	.owner		= THIS_MODULE,
+	.open		= iobc_pipe_open,
+	.llseek		= iobc_pipe_llseek,
+	.read		= iobc_pipe_read,
+	.release	= iobc_pipe_release,
+};
+
+static const struct file_operations iobc_pipe_bin_fops = {
+	.owner		= THIS_MODULE,
+	.open		= iobc_pipe_open,
+	.llseek		= iobc_pipe_llseek,
+	.read		= iobc_pipe_read_bin,
+	.release	= iobc_pipe_release,
+};
+
+/*
+ * ioblame/iolog - debug pipe which dumps every iob_io_info on bio completion
+ */
+static void iob_iolog_fill(struct iob_io_info *io)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&iob_iolog_lock, flags);
+
+	iob_iolog[iob_iolog_head] = *io;
+
+	/* if was empty, wake up consumer */
+	if (iob_iolog_head == iob_iolog_tail)
+		wake_up(&iob_iolog_wait);
+
+	iob_iolog_head = (iob_iolog_head + 1) % IOB_IOLOG_CNT;
+
+	/* if full, forget the oldest entry */
+	if (iob_iolog_head == iob_iolog_tail) {
+		iob_iolog_tail = (iob_iolog_tail + 1) % IOB_IOLOG_CNT;
+		iob_stats.iolog_overflow++;
+	}
+
+	spin_unlock_irqrestore(&iob_iolog_lock, flags);
+}
+
+static int iob_iolog_consume(struct iob_io_info *io)
+{
+	unsigned long flags;
+	int ret;
+retry:
+	ret = wait_event_interruptible(iob_iolog_wait,
+				       iob_iolog_head != iob_iolog_tail);
+	if (ret)
+		return ret;
+
+	spin_lock_irqsave(&iob_iolog_lock, flags);
+
+	if (iob_iolog_head == iob_iolog_tail) {
+		spin_unlock_irqrestore(&iob_iolog_lock, flags);
+		goto retry;
+	}
+
+	*io = iob_iolog[iob_iolog_tail];
+	iob_iolog_tail = (iob_iolog_tail + 1) % IOB_IOLOG_CNT;
+
+	spin_unlock_irqrestore(&iob_iolog_lock, flags);
+
+	return 0;
+}
+
+static int iob_iolog_open(struct inode *inode, struct file *file)
+{
+	int ret;
+
+	mutex_lock(&iob_mutex);
+
+	ret = nonseekable_open(inode, file);
+	if (ret)
+		goto out_unlock;
+
+	ret = -EBUSY;
+	if (iob_iolog)
+		goto out_unlock;
+
+	ret = -ENOMEM;
+	iob_iolog = vzalloc(sizeof(iob_iolog[0]) * IOB_IOLOG_CNT);
+	if (!iob_iolog)
+		goto out_unlock;
+
+	iob_iolog_buf = (void *)__get_free_page(GFP_KERNEL);
+	if (!iob_iolog_buf) {
+		vfree(iob_iolog);
+		goto out_unlock;
+	}
+
+	ret = 0;
+out_unlock:
+	mutex_unlock(&iob_mutex);
+	return ret;
+}
+
+static ssize_t iob_iolog_read(struct file *file, char __user *ubuf,
+			      size_t len, loff_t *ppos)
+{
+	char *p = iob_iolog_buf;
+	char *e = p + PAGE_SIZE;
+	struct iob_io_info io;
+	struct iob_act *act;
+	int ret;
+
+	if (len < PAGE_SIZE)
+		return -EINVAL;
+
+	ret = iob_iolog_consume(&io);
+	if (ret)
+		return ret;
+
+	rcu_read_lock_sched();
+	if (!iob_enabled) {
+		rcu_read_unlock_sched();
+		return -EIO;
+	}
+
+	act = iob_act_by_id(io.act);
+
+	p = iob_print(p, e, "%c %u @ %llu ", io.rw & REQ_WRITE ? 'W' : 'R',
+		      io.size, (unsigned long long)io.sector);
+	p = iob_print_role(p, e, act->role);
+	p = iob_print(p, e, " dev=0x%x ino=0x%lx gen=0x%x\n",
+		      act->dev, act->ino, act->gen);
+	p = iob_print_intent(p, e, iob_intent_by_id(act->intent), "  ");
+
+	rcu_read_unlock_sched();
+
+	ret = p - iob_iolog_buf;
+	if (copy_to_user(ubuf, iob_iolog_buf, ret))
+		return -EFAULT;
+	return ret;
+}
+
+static int iob_iolog_release(struct inode *inode, struct file *file)
+{
+	struct iob_io_info *iolog = iob_iolog;
+
+	mutex_lock(&iob_mutex);
+
+	iob_iolog = NULL;
+	synchronize_sched();
+
+	vfree(iolog);
+	free_page((unsigned long)iob_iolog_buf);
+	iob_iolog_head = iob_iolog_tail = 0;
+	iob_iolog_buf = NULL;
+
+	mutex_unlock(&iob_mutex);
+	return 0;
+}
+
+static const struct file_operations iob_iolog_fops = {
+	.owner		= THIS_MODULE,
+	.open		= iob_iolog_open,
+	.read		= iob_iolog_read,
+	.release	= iob_iolog_release,
+};
+
+static int __init ioblame_init(void)
+{
+	struct dentry *stats_dir;
+	int i;
+
+	BUILD_BUG_ON((1 << IOB_TYPE_BITS) < IOB_NR_TYPES);
+	BUILD_BUG_ON(IOB_NR_BITS + IOB_GEN_BITS + IOB_TYPE_BITS != 64);
+
+	iob_role_cache = KMEM_CACHE(iob_role, 0);
+	iob_act_cache = KMEM_CACHE(iob_act, 0);
+	if (!iob_role_cache || !iob_act_cache)
+		goto fail;
+
+	/* build iobc_event_fields list, used to parse filters */
+	for (i = 0; i < IOBC_NR_FIELDS; i++) {
+		struct ftrace_event_field *f = &iobc_event_fields[i];
+
+		f->name = iobc_field_strs[i];
+		f->filter_type = FILTER_OTHER;
+		f->offset = i * sizeof(u32);
+		f->size = sizeof(u32);
+		f->is_signed = 0;
+
+		list_add_tail(&f->link, &iobc_event_field_list);
+	}
+
+	/* create ioblame/ dirs and files */
+	iob_dir = debugfs_create_dir("ioblame", NULL);
+	if (!iob_dir)
+		goto fail;
+
+	if (!debugfs_create_file("max_roles", 0600, iob_dir, &iob_max_roles, &iob_uint_fops_disabled) ||
+	    !debugfs_create_file("max_intents", 0600, iob_dir, &iob_max_intents, &iob_uint_fops_disabled) ||
+	    !debugfs_create_file("max_acts", 0600, iob_dir, &iob_max_acts, &iob_uint_fops_disabled) ||
+	    !debugfs_create_file("ttl_secs", 0600, iob_dir, &iob_ttl_secs, &iob_uint_fops) ||
+	    !debugfs_create_file("ignore_ino", 0600, iob_dir, &iob_ignore_ino, &iob_bool_fops) ||
+	    !debugfs_create_file("devs", 0600, iob_dir, NULL, &iob_devs_fops) ||
+	    !debugfs_create_file("intents", 0400, iob_dir, NULL, &iob_intents_fops) ||
+	    !debugfs_create_file("intents_bin", 0400, iob_dir, NULL, &iob_intents_bin_fops) ||
+	    !debugfs_create_file("nr_counters", 0600, iob_dir, NULL, &iobc_nr_types_fops) ||
+	    !debugfs_create_file("counters_pipe", 0200, iob_dir, NULL, &iobc_pipe_fops) ||
+	    !debugfs_create_file("counters_pipe_bin", 0200, iob_dir, NULL, &iobc_pipe_bin_fops) ||
+	    !debugfs_create_file("iolog", 0600, iob_dir, NULL, &iob_iolog_fops) ||
+	    !debugfs_create_file("enable", 0600, iob_dir, &iob_enabled, &iob_enable_fops) ||
+	    !debugfs_create_file("nr_roles", 0400, iob_dir, &iob_role_idx, &iob_nr_nodes_fops) ||
+	    !debugfs_create_file("nr_intents", 0400, iob_dir, &iob_intent_idx, &iob_nr_nodes_fops) ||
+	    !debugfs_create_file("nr_acts", 0400, iob_dir, &iob_act_idx, &iob_nr_nodes_fops))
+		goto fail;
+
+	stats_dir = debugfs_create_dir("stats", iob_dir);
+	if (!stats_dir)
+		goto fail;
+	if (!debugfs_create_file("idx_nomem", 0400, stats_dir, &iob_stats.idx_nomem, &iob_stats_fops) ||
+	    !debugfs_create_file("idx_nospc", 0400, stats_dir, &iob_stats.idx_nospc, &iob_stats_fops) ||
+	    !debugfs_create_file("node_nomem", 0400, stats_dir, &iob_stats.node_nomem, &iob_stats_fops) ||
+	    !debugfs_create_file("pgtree_nomem", 0400, stats_dir, &iob_stats.pgtree_nomem, &iob_stats_fops) ||
+	    !debugfs_create_file("cnts_nomem", 0400, stats_dir, &iob_stats.cnts_nomem, &iob_stats_fops) ||
+	    !debugfs_create_file("iolog_overflow", 0400, stats_dir, &iob_stats.iolog_overflow, &iob_stats_fops))
+		goto fail;
+
+	return 0;
+
+fail:
+	if (iob_role_cache)
+		kmem_cache_destroy(iob_role_cache);
+	if (iob_act_cache)
+		kmem_cache_destroy(iob_act_cache);
+	if (iob_dir)
+		debugfs_remove_recursive(iob_dir);
+	return -ENOMEM;
+}
+
+static void __exit ioblame_exit(void)
+{
+	iob_disable();
+	debugfs_remove_recursive(iob_dir);
+	kmem_cache_destroy(iob_role_cache);
+	kmem_cache_destroy(iob_act_cache);
+}
+
+module_init(ioblame_init);
+module_exit(ioblame_exit);
+
+MODULE_AUTHOR("Tejun Heo <tj@kernel.org>");
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("IO monitor with dirtier and issuer tracking");
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHSET RESEND] ioblame: statistical IO analyzer
  2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
                   ` (10 preceding siblings ...)
  2012-01-05 23:42 ` [PATCH 11/11] block, trace: implement ioblame IO statistical analyzer Tejun Heo
@ 2012-01-06  9:00 ` Namhyung Kim
  2012-01-06 16:02   ` Tejun Heo
  11 siblings, 1 reply; 18+ messages in thread
From: Namhyung Kim @ 2012-01-06  9:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dsharp, linux-kernel

Hi, Tejun

2012-01-06 AM 8:42, Tejun Heo Wrote:
> This is re-post.  The original posting was on Dec 15th but it was
> missing cc to LKML.  I got some responses on that thread so didn't
> suspect LKML was missing.  If you're getting it the second time.  My
> apologies.
>
> Stuff pointed out in the original thread are...
>
> * Is the quick variant of backtrace gathering really necessary? -
>    Still need to get performance numbers.
>
> * TRACE_EVENT_CONDITION() can be used in some places. - Will be
>    updated.
>
> Original message follows.  Thanks.
>
> Hello, guys.
>
> Even with blktrace and tracepoints, getting insight into the IOs going
> on a system is very challenging.  A lot of IO operations happen long
> after the action which triggered the IO finished and the overall
> asynchronous nature of IO operations make it difficult to trace back
> the origin of a given IO.
>
> ioblame is an attempt at providing better visibility into overall IO
> behavior.  ioblame hooks into various tracepoints and tries to
> determine who caused any given IO how and charges the IO accordingly.
>
> On each IO completion, ioblame knows who to charge the IO (task), how
> the IO got triggered (stack trace at the point of triggering, be it
> page, inode dirtying or direct IO issue) and various information about
> the IO itself (offset, size, how long it took and so on).  ioblame
> collects this information into histograms which is configurable from
> userland using debugfs interface.
>
> For example, using ioblame, user can acquire information like "this
> task triggered IO with this stack trace on this file with the
> following offset distribution".
>
> For more details, please read Documentation/trace/ioblame.txt, which
> I'll append to this message too for discussion.
>
> This patchset contains the following 11 patches.
>
>    0001-trace_event_filter-factorize-filter-creation.patch
>    0002-trace_event_filter-add-trace_event_filter_-interface.patch
>    0003-block-block_bio_complete-tracepoint-was-missing.patch
>    0004-block-add-req-to-bio_-front-back-_merge-tracepoints.patch
>    0005-block-abstract-disk-iteration-into-disk_iter.patch
>    0006-writeback-move-struct-wb_writeback_work-to-writeback.patch
>    0007-writeback-add-more-tracepoints.patch
>    0008-block-add-block_touch_buffer-tracepoint.patch
>    0009-vfs-add-fcheck-tracepoint.patch
>    0010-stacktrace-implement-save_stack_trace_quick.patch
>    0011-block-trace-implement-ioblame-IO-statistical-analyze.patch
>
> 0001-0002 export trace_event_filter so that ioblame can use it too.
>
> 0003 adds back block_bio_complete TP invocation, which got lost
> somehow.  This probably makes sense as fix patch for 3.2.
>
> 0004-0006 update block layer in preparation.  0005 probably makes
> sense as a standalone patch too.
>
> 0007-0009 add more tracepoints along the IO stack.
>
> 0010 adds nimbler backtrace dump function as ioblame dumps stacktrace
> extremely frequently.
>
> 0011 implements ioblame.
>
> This is still in early stage and I haven't done much performance
> analysis yet.  Tentative testing shows it adds ~20% CPU overhead when
> used on memory backed loopback device.
>
> The patches are on top of mainline (42ebfc61cf "Merge branch
> 'stable...git/konrad/xen'") and perf/core (74eec26fac "perf tools: Add
> ability to synthesize event according to a sample").
>
> It's also available in the following git branch.
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-ioblame
>

Very interesting. It should help analyzing and improving IO performance 
a lot.

BTW, it seems the ioblame based on event tracing feature, so couldn't it 
be implemented in userspace with the help of the tracepoints and 
additional information (e.g. intent, ...) you add? The perf can deal 
with them and extend post-processing capability easily, and also might 
reduce some kernel jobs, I guess.

Thanks
Namhyung Kim

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHSET RESEND] ioblame: statistical IO analyzer
  2012-01-06  9:00 ` [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Namhyung Kim
@ 2012-01-06 16:02   ` Tejun Heo
  2012-01-06 16:33     ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2012-01-06 16:02 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dsharp, linux-kernel

Hello, Namhyung.

On Fri, Jan 06, 2012 at 06:00:04PM +0900, Namhyung Kim wrote:
> BTW, it seems the ioblame based on event tracing feature, so
> couldn't it be implemented in userspace with the help of the
> tracepoints and additional information (e.g. intent, ...) you add?
> The perf can deal with them and extend post-processing capability
> easily, and also might reduce some kernel jobs, I guess.

Yeah, it uses tracepoints to gather information it needs, but
producing relevant information (like the intent id) requires
nontrivial state tracking.  The point where it would make sense to
push to userland is the iolog, where all the relevant information has
been gathered for each IO.  Currently, the export interface there is
pretty dumb and slow.

Hmmm... originally, I had variable length data structure there but now
it's fixed so exposing them using tracepoint shouldn't be too
difficult and could actually be better (previously it didn't really
fit TP and ringbuffer should be used directly).  Yeah, that's a
thought.  Generating a TP event per IO shouldn't be taxing and it
would give much better visibility to userland and we can drop the
whole statics configuration and stuff.  Enticing.  I'll think more
about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCHSET RESEND] ioblame: statistical IO analyzer
  2012-01-06 16:02   ` Tejun Heo
@ 2012-01-06 16:33     ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2012-01-06 16:33 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp, linux-kernel

Hello, again.

On Fri, Jan 06, 2012 at 08:02:35AM -0800, Tejun Heo wrote:
> Hmmm... originally, I had variable length data structure there but now
> it's fixed so exposing them using tracepoint shouldn't be too
> difficult and could actually be better (previously it didn't really
> fit TP and ringbuffer should be used directly).  Yeah, that's a
> thought.  Generating a TP event per IO shouldn't be taxing and it
> would give much better visibility to userland and we can drop the
> whole statics configuration and stuff.  Enticing.  I'll think more
> about it.

The more I think about it, the better it seems.  I ended up putting
the statistics gathering inside the kernel mostly because the
information I had at IO completion wasn't too fit to export to
userland verbatim, but while iterating the implementation those parts
got chopped (it was necessary for statistics too anyway) and now it
seems I can just replace the whole thing with a single tracepoint and
drop the whole configurable statistics stuff along with act and its
reclaiming, and silly _bin interface.  That's just sweeet.  Someohw I
completely missed the transition of the information available at IO
completion.

I might be missing something important but will give it a shot and, if
it goes as expected, post updated series soon.

Thanks a lot for bringing it up. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 03/11] block: block_bio_complete tracepoint was missing
  2012-01-05 23:42 ` [PATCH 03/11] block: block_bio_complete tracepoint was missing Tejun Heo
@ 2012-01-09  1:30   ` Namhyung Kim
  2012-01-09  1:49     ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Namhyung Kim @ 2012-01-09  1:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dsharp, linux-kernel

2012-01-06 8:42 AM, Tejun Heo wrote:
> block_bio_complete tracepoint was defined but not invoked anywhere.
> Fix it.
>
> Signed-off-by: Tejun Heo<tj@kernel.org>
> ---
>   fs/bio.c |    3 +++
>   1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/fs/bio.c b/fs/bio.c
> index b1fe82c..96548da 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -1447,6 +1447,9 @@ void bio_endio(struct bio *bio, int error)
>   	else if (!test_bit(BIO_UPTODATE,&bio->bi_flags))
>   		error = -EIO;
>
> +	if (bio->bi_bdev)
> +		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
> +					 bio, error);
>   	if (bio->bi_end_io)
>   		bio->bi_end_io(bio, error);
>   }

Hi,

Just adding the TP unconditionally will produce duplicated (in some 
sense) reports about the completion. For example, normal request based 
IO reports whole request completion prior to its bio's, and further, 
some of nested block IO handling routines - bounced bio and btrfs with 
compression, etc - call bio_endio() more than once. Also there are cases 
that bio fails before it's enqueued for some reason.

I have no idea about the ioblame can deal with all of such corner cases. 
However it might confuse blktrace somewhat, I guess.

I already posted similar patch a couple of weeks ago, but didn't receive 
a comment yet. [1] Please take a look this too :)

After a quick glance, the ioblame seems to carry all IO accounting info 
through the first bio in the request. If so, why don't you use the 
request structure for that?


Thanks,
Namhyung Kim


[1] https://lkml.org/lkml/2011/12/27/111

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 03/11] block: block_bio_complete tracepoint was missing
  2012-01-09  1:30   ` Namhyung Kim
@ 2012-01-09  1:49     ` Tejun Heo
  2012-01-09  2:33       ` Namhyung Kim
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2012-01-09  1:49 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dsharp, linux-kernel

Hello,

On Mon, Jan 09, 2012 at 10:30:06AM +0900, Namhyung Kim wrote:
> Just adding the TP unconditionally will produce duplicated (in some
> sense) reports about the completion. For example, normal request
> based IO reports whole request completion prior to its bio's, and
> further

Request and bio completions are separate events.  There's nothing
wrong with reporting them separately.  In fact, I think they should be
reported separately.

>, some of nested block IO handling routines - bounced bio and
> btrfs with compression, etc - call bio_endio() more than once. Also
> there are cases that bio fails before it's enqueued for some reason.

They are actually separate bio's being completed.  I don't think
trying to put extra semantics on TP itself is a good idea.  In
general, TP signals that such event happened with sufficient
information and it's the consumers' responsibility to make sense of
what's going on.  BIO_CLONED/BOUNCED are there.

> I have no idea about the ioblame can deal with all of such corner
> cases. However it might confuse blktrace somewhat, I guess.

Unless someone is doing memcpy() on bio's, ioblame should be okay.  It
only considers bio's which went through actual submission.

> I already posted similar patch a couple of weeks ago, but didn't
> receive a comment yet. [1] Please take a look this too :)

I'll reply there but don't think imposing such extra logic on TP is a
good idea.

> After a quick glance, the ioblame seems to carry all IO accounting
> info through the first bio in the request. If so, why don't you use
> the request structure for that?

Because there are bio based drivers which don't use requests at all.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 03/11] block: block_bio_complete tracepoint was missing
  2012-01-09  1:49     ` Tejun Heo
@ 2012-01-09  2:33       ` Namhyung Kim
  0 siblings, 0 replies; 18+ messages in thread
From: Namhyung Kim @ 2012-01-09  2:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Namhyung Kim, axboe, mingo, rostedt, fweisbec, teravest,
	slavapestov, ctalbott, linux-kernel

Hi,

2012-01-09 10:49 AM, Tejun Heo wrote:
> Hello,
>
> On Mon, Jan 09, 2012 at 10:30:06AM +0900, Namhyung Kim wrote:
>> Just adding the TP unconditionally will produce duplicated (in some
>> sense) reports about the completion. For example, normal request
>> based IO reports whole request completion prior to its bio's, and
>> further
>
> Request and bio completions are separate events.  There's nothing
> wrong with reporting them separately.  In fact, I think they should be
> reported separately.
>
>> , some of nested block IO handling routines - bounced bio and
>> btrfs with compression, etc - call bio_endio() more than once. Also
>> there are cases that bio fails before it's enqueued for some reason.
>
> They are actually separate bio's being completed.  I don't think
> trying to put extra semantics on TP itself is a good idea.  In
> general, TP signals that such event happened with sufficient
> information and it's the consumers' responsibility to make sense of
> what's going on.  BIO_CLONED/BOUNCED are there.

I see.


>> I have no idea about the ioblame can deal with all of such corner
>> cases. However it might confuse blktrace somewhat, I guess.
>
> Unless someone is doing memcpy() on bio's, ioblame should be okay.  It
> only considers bio's which went through actual submission.
>
>> I already posted similar patch a couple of weeks ago, but didn't
>> receive a comment yet. [1] Please take a look this too :)
>
> I'll reply there but don't think imposing such extra logic on TP is a
> good idea.

I'll reply on that thread too. :)


>> After a quick glance, the ioblame seems to carry all IO accounting
>> info through the first bio in the request. If so, why don't you use
>> the request structure for that?
>
> Because there are bio based drivers which don't use requests at all.

What I thought for such drivers was dynamic allocation in their 
->make_request_fn, but because we don't have a block_bio_issue TP, 
Nevermind. :)


Thanks,
Namhyung Kim

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-01-09  2:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-05 23:42 [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Tejun Heo
2012-01-05 23:42 ` [PATCH 01/11] trace_event_filter: factorize filter creation Tejun Heo
2012-01-05 23:42 ` [PATCH 02/11] trace_event_filter: add trace_event_filter_*() interface Tejun Heo
2012-01-05 23:42 ` [PATCH 03/11] block: block_bio_complete tracepoint was missing Tejun Heo
2012-01-09  1:30   ` Namhyung Kim
2012-01-09  1:49     ` Tejun Heo
2012-01-09  2:33       ` Namhyung Kim
2012-01-05 23:42 ` [PATCH 04/11] block: add @req to bio_{front|back}_merge tracepoints Tejun Heo
2012-01-05 23:42 ` [PATCH 05/11] block: abstract disk iteration into disk_iter Tejun Heo
2012-01-05 23:42 ` [PATCH 06/11] writeback: move struct wb_writeback_work to writeback.h Tejun Heo
2012-01-05 23:42 ` [PATCH 07/11] writeback: add more tracepoints Tejun Heo
2012-01-05 23:42 ` [PATCH 08/11] block: add block_touch_buffer tracepoint Tejun Heo
2012-01-05 23:42 ` [PATCH 09/11] vfs: add fcheck tracepoint Tejun Heo
2012-01-05 23:42 ` [PATCH 10/11] stacktrace: implement save_stack_trace_quick() Tejun Heo
2012-01-05 23:42 ` [PATCH 11/11] block, trace: implement ioblame IO statistical analyzer Tejun Heo
2012-01-06  9:00 ` [RFC PATCHSET RESEND] ioblame: statistical IO analyzer Namhyung Kim
2012-01-06 16:02   ` Tejun Heo
2012-01-06 16:33     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).