linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking
@ 2012-01-10 18:28 Tejun Heo
  2012-01-10 18:28 ` [PATCH 1/9] block: abstract disk iteration into disk_iter Tejun Heo
                   ` (9 more replies)
  0 siblings, 10 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-10 18:28 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung

Hello, guys.

Even with blktrace and tracepoints, getting insight into the IOs going
on a system is very challenging.  A lot of IO operations happen long
after the action which triggered the IO finished and the overall
asynchronous nature of IO operations make it difficult to trace back
the origin of a given IO.

ioblame is an attempt at providing better visibility into overall IO
behavior.  ioblame hooks into various tracepoints and tries to
determine who caused any given IO how and charges the IO accordingly.

On each IO completion, ioblame knows who to charge the IO (task), how
the IO got triggered (stack trace at the point of triggering, be it
page, inode dirtying or direct IO issue) and various information about
the IO itself (offset, size, how long it took and so on).  ioblame
exports this information via ioblame:ioblame_io tracepoint.

For more details, please read Documentation/trace/ioblame.txt.

Changes from the last take[L] are,

* Per Namhyung's suggestion, in-kernel statistics gathering stripped
  out.  All information is now exported through a tracepoint per each
  IO.  This makes a lot of stuff unnecessary and over 1500 lines of
  code have been removed.

* block_bio_complete tracepoint patch will result in duplicate
  BLK_TA_COMPLETE notifications.  Namhyung is working on proper
  solution.  For now, SOB is removed from the patch.

* Trace filter is no longer used and patches dropped from the series.

* Rebased on top of v3.2.

This patchset contains the following 9 patches.

  0001-block-abstract-disk-iteration-into-disk_iter.patch
  0002-block-block_bio_complete-tracepoint-was-missing.patch
  0003-block-add-req-to-bio_-front-back-_merge-tracepoints.patch
  0004-writeback-move-struct-wb_writeback_work-to-writeback.patch
  0005-writeback-add-more-tracepoints.patch
  0006-block-add-block_touch_buffer-tracepoint.patch
  0007-vfs-add-fcheck-tracepoint.patch
  0008-stacktrace-implement-save_stack_trace_quick.patch
  0009-block-trace-implement-ioblame-IO-tracer-with-origin-.patch

0001-0004 update block layer in preparation.

0005-0007 add more tracepoints along the IO stack.

0008 adds nimbler backtrace dump function as ioblame dumps stacktrace
extremely frequently.

0009 implements ioblame.

This is still in early stage and I haven't done much performance
analysis yet.  Tentative testing shows it adds ~20% CPU overhead when
used on memory backed loopback device.

The patches are on top of v3.2 and available in the following git
branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-ioblame

diffstat follows.

 Documentation/trace/ioblame.txt   |  476 +++++++
 arch/x86/include/asm/stacktrace.h |    2 
 arch/x86/kernel/stacktrace.c      |   40 
 block/blk-core.c                  |    5 
 block/genhd.c                     |   98 +
 fs/bio.c                          |    3 
 fs/fs-writeback.c                 |   34 
 fs/super.c                        |    2 
 include/linux/blk_types.h         |    4 
 include/linux/buffer_head.h       |    7 
 include/linux/fdtable.h           |    3 
 include/linux/fs.h                |    3 
 include/linux/genhd.h             |   13 
 include/linux/ioblame.h           |   72 +
 include/linux/stacktrace.h        |    6 
 include/linux/writeback.h         |   18 
 include/trace/events/block.h      |   70 -
 include/trace/events/vfs.h        |   40 
 include/trace/events/writeback.h  |  113 +
 kernel/stacktrace.c               |    6 
 kernel/trace/Kconfig              |   12 
 kernel/trace/Makefile             |    1 
 kernel/trace/blktrace.c           |    2 
 kernel/trace/ioblame.c            | 2279 ++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c               |    2 
 25 files changed, 3244 insertions(+), 67 deletions(-)

Thanks.

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/1235937

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/9] block: abstract disk iteration into disk_iter
  2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
@ 2012-01-10 18:28 ` Tejun Heo
  2012-01-10 18:28 ` [PATCH 2/9] block: block_bio_complete tracepoint was missing Tejun Heo
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-10 18:28 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Tejun Heo

Instead of using class_dev_iter directly, abstract disk iteration into
disk_iter and helpers which are exported.  This simplifies the callers
a bit and allows external users to iterate over disks.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/genhd.c         |   98 +++++++++++++++++++++++++++++++++----------------
 include/linux/genhd.h |   10 ++++-
 2 files changed, 75 insertions(+), 33 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 83e7c04..7c811ff 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -40,6 +40,47 @@ static void disk_del_events(struct gendisk *disk);
 static void disk_release_events(struct gendisk *disk);
 
 /**
+ * disk_iter_init - initialize disk iterator
+ * @diter: iterator to initialize
+ *
+ * Initialize @diter so that it iterates over all disks
+ */
+void disk_iter_init(struct disk_iter *diter)
+{
+	class_dev_iter_init(&diter->cdev_iter, &block_class, NULL, &disk_type);
+}
+EXPORT_SYMBOL_GPL(disk_iter_init);
+
+/**
+ * disk_iter_next - proceed iterator to the next disk and return it
+ * @diter: iterator to proceed
+ *
+ * Proceed @diter to the next disk and return it.
+ */
+struct gendisk *disk_iter_next(struct disk_iter *diter)
+{
+	struct device *dev;
+
+	dev = class_dev_iter_next(&diter->cdev_iter);
+	if (dev)
+		return dev_to_disk(dev);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(disk_iter_next);
+
+/**
+ * disk_iter_exit - finish up disk iteration
+ * @diter: iter to exit
+ *
+ * Called when iteration is over.  Cleans up @diter.
+ */
+void disk_iter_exit(struct disk_iter *diter)
+{
+	class_dev_iter_exit(&diter->cdev_iter);
+}
+EXPORT_SYMBOL_GPL(disk_iter_exit);
+
+/**
  * disk_get_part - get partition
  * @disk: disk to look partition from
  * @partno: partition number
@@ -730,12 +771,11 @@ EXPORT_SYMBOL(bdget_disk);
  */
 void __init printk_all_partitions(void)
 {
-	struct class_dev_iter iter;
-	struct device *dev;
+	struct disk_iter diter;
+	struct gendisk *disk;
 
-	class_dev_iter_init(&iter, &block_class, NULL, &disk_type);
-	while ((dev = class_dev_iter_next(&iter))) {
-		struct gendisk *disk = dev_to_disk(dev);
+	disk_iter_init(&diter);
+	while ((disk = disk_iter_next(&diter))) {
 		struct disk_part_iter piter;
 		struct hd_struct *part;
 		char name_buf[BDEVNAME_SIZE];
@@ -779,7 +819,7 @@ void __init printk_all_partitions(void)
 		}
 		disk_part_iter_exit(&piter);
 	}
-	class_dev_iter_exit(&iter);
+	disk_iter_exit(&diter);
 }
 
 #ifdef CONFIG_PROC_FS
@@ -787,44 +827,38 @@ void __init printk_all_partitions(void)
 static void *disk_seqf_start(struct seq_file *seqf, loff_t *pos)
 {
 	loff_t skip = *pos;
-	struct class_dev_iter *iter;
-	struct device *dev;
+	struct disk_iter *diter;
+	struct gendisk *disk;
 
-	iter = kmalloc(sizeof(*iter), GFP_KERNEL);
-	if (!iter)
+	diter = kmalloc(sizeof(*diter), GFP_KERNEL);
+	if (!diter)
 		return ERR_PTR(-ENOMEM);
 
-	seqf->private = iter;
-	class_dev_iter_init(iter, &block_class, NULL, &disk_type);
+	seqf->private = diter;
+	disk_iter_init(diter);
 	do {
-		dev = class_dev_iter_next(iter);
-		if (!dev)
+		disk = disk_iter_next(diter);
+		if (!disk)
 			return NULL;
 	} while (skip--);
 
-	return dev_to_disk(dev);
+	return disk;
 }
 
 static void *disk_seqf_next(struct seq_file *seqf, void *v, loff_t *pos)
 {
-	struct device *dev;
-
 	(*pos)++;
-	dev = class_dev_iter_next(seqf->private);
-	if (dev)
-		return dev_to_disk(dev);
-
-	return NULL;
+	return disk_iter_next(seqf->private);
 }
 
 static void disk_seqf_stop(struct seq_file *seqf, void *v)
 {
-	struct class_dev_iter *iter = seqf->private;
+	struct disk_iter *diter = seqf->private;
 
 	/* stop is called even after start failed :-( */
-	if (iter) {
-		class_dev_iter_exit(iter);
-		kfree(iter);
+	if (diter) {
+		disk_iter_exit(diter);
+		kfree(diter);
 	}
 }
 
@@ -1206,12 +1240,12 @@ module_init(proc_genhd_init);
 dev_t blk_lookup_devt(const char *name, int partno)
 {
 	dev_t devt = MKDEV(0, 0);
-	struct class_dev_iter iter;
-	struct device *dev;
+	struct disk_iter diter;
+	struct gendisk *disk;
 
-	class_dev_iter_init(&iter, &block_class, NULL, &disk_type);
-	while ((dev = class_dev_iter_next(&iter))) {
-		struct gendisk *disk = dev_to_disk(dev);
+	disk_iter_init(&diter);
+	while ((disk = disk_iter_next(&diter))) {
+		struct device *dev = disk_to_dev(disk);
 		struct hd_struct *part;
 
 		if (strcmp(dev_name(dev), name))
@@ -1233,7 +1267,7 @@ dev_t blk_lookup_devt(const char *name, int partno)
 		}
 		disk_put_part(part);
 	}
-	class_dev_iter_exit(&iter);
+	disk_iter_exit(&diter);
 	return devt;
 }
 EXPORT_SYMBOL(blk_lookup_devt);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index fe23ee7..9d0e0b5 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -260,8 +260,16 @@ static inline void disk_put_part(struct hd_struct *part)
 }
 
 /*
- * Smarter partition iterator without context limits.
+ * Smarter disk and partition iterators without context limits.
  */
+struct disk_iter {
+	struct class_dev_iter	cdev_iter;
+};
+
+extern void disk_iter_init(struct disk_iter *diter);
+extern struct gendisk *disk_iter_next(struct disk_iter *diter);
+extern void disk_iter_exit(struct disk_iter *diter);
+
 #define DISK_PITER_REVERSE	(1 << 0) /* iterate in the reverse direction */
 #define DISK_PITER_INCL_EMPTY	(1 << 1) /* include 0-sized parts */
 #define DISK_PITER_INCL_PART0	(1 << 2) /* include partition 0 */
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/9] block: block_bio_complete tracepoint was missing
  2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
  2012-01-10 18:28 ` [PATCH 1/9] block: abstract disk iteration into disk_iter Tejun Heo
@ 2012-01-10 18:28 ` Tejun Heo
  2012-01-11 17:25   ` Steven Rostedt
  2012-01-10 18:28 ` [PATCH 3/9] block: add @req to bio_{front|back}_merge tracepoints Tejun Heo
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-01-10 18:28 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Tejun Heo

block_bio_complete tracepoint was defined but not invoked anywhere.
Fix it.

-tj: This will generate duplicate BLK_TA_COMPLETEs.  Namhyung is
     working on proper solution.

DO_NOT_APPLY
Cc: Namhyung Kim <namhyung@gmail.com>
---
 fs/bio.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index b1fe82c..96548da 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1447,6 +1447,9 @@ void bio_endio(struct bio *bio, int error)
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
 		error = -EIO;
 
+	if (bio->bi_bdev)
+		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
+					 bio, error);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio, error);
 }
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/9] block: add @req to bio_{front|back}_merge tracepoints
  2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
  2012-01-10 18:28 ` [PATCH 1/9] block: abstract disk iteration into disk_iter Tejun Heo
  2012-01-10 18:28 ` [PATCH 2/9] block: block_bio_complete tracepoint was missing Tejun Heo
@ 2012-01-10 18:28 ` Tejun Heo
  2012-01-10 18:28 ` [PATCH 4/9] writeback: move struct wb_writeback_work to writeback.h Tejun Heo
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-10 18:28 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Tejun Heo

bio_{front|back}_merge tracepoints report a bio merging into an
existing request but didn't specify which request the bio is being
merged into.  Add @req to it.  This makes it impossible to share the
event template with block_bio_queue - split it out.

@req isn't used or exported to userland at this point and there is no
userland visible behavior change.  Later changes will make use of the
extra parameter.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c             |    4 +-
 include/trace/events/block.h |   45 +++++++++++++++++++++++++++++++----------
 kernel/trace/blktrace.c      |    2 +
 3 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 15de223..dd45d6e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1179,7 +1179,7 @@ static bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
 	if (!ll_back_merge_fn(q, req, bio))
 		return false;
 
-	trace_block_bio_backmerge(q, bio);
+	trace_block_bio_backmerge(q, req, bio);
 
 	if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
 		blk_rq_set_mixed_merge(req);
@@ -1202,7 +1202,7 @@ static bool bio_attempt_front_merge(struct request_queue *q,
 	if (!ll_front_merge_fn(q, req, bio))
 		return false;
 
-	trace_block_bio_frontmerge(q, bio);
+	trace_block_bio_frontmerge(q, req, bio);
 
 	if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
 		blk_rq_set_mixed_merge(req);
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 05c5e61..983f8a8 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -241,11 +241,11 @@ TRACE_EVENT(block_bio_complete,
 		  __entry->nr_sector, __entry->error)
 );
 
-DECLARE_EVENT_CLASS(block_bio,
+DECLARE_EVENT_CLASS(block_bio_merge,
 
-	TP_PROTO(struct request_queue *q, struct bio *bio),
+	TP_PROTO(struct request_queue *q, struct request *rq, struct bio *bio),
 
-	TP_ARGS(q, bio),
+	TP_ARGS(q, rq, bio),
 
 	TP_STRUCT__entry(
 		__field( dev_t,		dev			)
@@ -272,31 +272,33 @@ DECLARE_EVENT_CLASS(block_bio,
 /**
  * block_bio_backmerge - merging block operation to the end of an existing operation
  * @q: queue holding operation
+ * @rq: request bio is being merged into
  * @bio: new block operation to merge
  *
  * Merging block request @bio to the end of an existing block request
  * in queue @q.
  */
-DEFINE_EVENT(block_bio, block_bio_backmerge,
+DEFINE_EVENT(block_bio_merge, block_bio_backmerge,
 
-	TP_PROTO(struct request_queue *q, struct bio *bio),
+	TP_PROTO(struct request_queue *q, struct request *rq, struct bio *bio),
 
-	TP_ARGS(q, bio)
+	TP_ARGS(q, rq, bio)
 );
 
 /**
  * block_bio_frontmerge - merging block operation to the beginning of an existing operation
  * @q: queue holding operation
+ * @rq: request bio is being merged into
  * @bio: new block operation to merge
  *
  * Merging block IO operation @bio to the beginning of an existing block
  * operation in queue @q.
  */
-DEFINE_EVENT(block_bio, block_bio_frontmerge,
+DEFINE_EVENT(block_bio_merge, block_bio_frontmerge,
 
-	TP_PROTO(struct request_queue *q, struct bio *bio),
+	TP_PROTO(struct request_queue *q, struct request *rq, struct bio *bio),
 
-	TP_ARGS(q, bio)
+	TP_ARGS(q, rq, bio)
 );
 
 /**
@@ -306,11 +308,32 @@ DEFINE_EVENT(block_bio, block_bio_frontmerge,
  *
  * About to place the block IO operation @bio into queue @q.
  */
-DEFINE_EVENT(block_bio, block_bio_queue,
+TRACE_EVENT(block_bio_queue,
 
 	TP_PROTO(struct request_queue *q, struct bio *bio),
 
-	TP_ARGS(q, bio)
+	TP_ARGS(q, bio),
+
+	TP_STRUCT__entry(
+		__field( dev_t,		dev			)
+		__field( sector_t,	sector			)
+		__field( unsigned int,	nr_sector		)
+		__array( char,		rwbs,	RWBS_LEN	)
+		__array( char,		comm,	TASK_COMM_LEN	)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bio->bi_bdev->bd_dev;
+		__entry->sector		= bio->bi_sector;
+		__entry->nr_sector	= bio->bi_size >> 9;
+		blk_fill_rwbs(__entry->rwbs, bio->bi_rw, bio->bi_size);
+		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+	),
+
+	TP_printk("%d,%d %s %llu + %u [%s]",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->rwbs,
+		  (unsigned long long)__entry->sector,
+		  __entry->nr_sector, __entry->comm)
 );
 
 DECLARE_EVENT_CLASS(block_get_rq,
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index cdea7b5..c1c8c97 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -797,6 +797,7 @@ static void blk_add_trace_bio_complete(void *ignore,
 
 static void blk_add_trace_bio_backmerge(void *ignore,
 					struct request_queue *q,
+					struct request *rq,
 					struct bio *bio)
 {
 	blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0);
@@ -804,6 +805,7 @@ static void blk_add_trace_bio_backmerge(void *ignore,
 
 static void blk_add_trace_bio_frontmerge(void *ignore,
 					 struct request_queue *q,
+					 struct request *rq,
 					 struct bio *bio)
 {
 	blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/9] writeback: move struct wb_writeback_work to writeback.h
  2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
                   ` (2 preceding siblings ...)
  2012-01-10 18:28 ` [PATCH 3/9] block: add @req to bio_{front|back}_merge tracepoints Tejun Heo
@ 2012-01-10 18:28 ` Tejun Heo
  2012-01-10 18:28 ` [PATCH 5/9] writeback: add more tracepoints Tejun Heo
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-10 18:28 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Tejun Heo

Move definition of struct wb_writeback_work from fs/fs-writeback.c to
include/linux/writeback.h.  This is to allow accessing fields from
writeback tracepoint probes which live outside fs-writeback.c.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/fs-writeback.c         |   18 ------------------
 include/linux/writeback.h |   18 ++++++++++++++++++
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e295150..a97cb49 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -29,24 +29,6 @@
 #include "internal.h"
 
 /*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
-	long nr_pages;
-	struct super_block *sb;
-	unsigned long *older_than_this;
-	enum writeback_sync_modes sync_mode;
-	unsigned int tagged_writepages:1;
-	unsigned int for_kupdate:1;
-	unsigned int range_cyclic:1;
-	unsigned int for_background:1;
-	enum wb_reason reason;		/* why was writeback initiated? */
-
-	struct list_head list;		/* pending work list */
-	struct completion *done;	/* set if the caller waits */
-};
-
-/*
  * Include the creation of the trace points after defining the
  * wb_writeback_work structure so that the definition remains local to this
  * file.
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a378c29..10d22d1 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -82,6 +82,24 @@ struct writeback_control {
 };
 
 /*
+ * Passed into wb_writeback(), essentially a subset of writeback_control
+ */
+struct wb_writeback_work {
+	long nr_pages;
+	struct super_block *sb;
+	unsigned long *older_than_this;
+	enum writeback_sync_modes sync_mode;
+	unsigned int tagged_writepages:1;
+	unsigned int for_kupdate:1;
+	unsigned int range_cyclic:1;
+	unsigned int for_background:1;
+	enum wb_reason reason;		/* why was writeback initiated? */
+
+	struct list_head list;		/* pending work list */
+	struct completion *done;	/* set if the caller waits */
+};
+
+/*
  * fs/fs-writeback.c
  */	
 struct bdi_writeback;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/9] writeback: add more tracepoints
  2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
                   ` (3 preceding siblings ...)
  2012-01-10 18:28 ` [PATCH 4/9] writeback: move struct wb_writeback_work to writeback.h Tejun Heo
@ 2012-01-10 18:28 ` Tejun Heo
  2012-01-10 18:28 ` [PATCH 6/9] block: add block_touch_buffer tracepoint Tejun Heo
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-10 18:28 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Tejun Heo

Add tracepoints for page dirtying, writeback_single_inode start, inode
dirtying and writeback.  For the latter two inode events, a pair of
events are defined to denote start and end of the operations (the
starting one has _start suffix and the one w/o suffix happens after
the operation is complete).  These inode ops are FS specific and can
be non-trivial and having enclosing tracepoints is useful for external
tracers.

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/fs-writeback.c                |   16 +++++-
 include/trace/events/writeback.h |  113 ++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c              |    2 +
 3 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a97cb49..ace4a45 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -298,8 +298,14 @@ static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	if (inode->i_sb->s_op->write_inode && !is_bad_inode(inode))
-		return inode->i_sb->s_op->write_inode(inode, wbc);
+	int ret;
+
+	if (inode->i_sb->s_op->write_inode && !is_bad_inode(inode)) {
+		trace_writeback_write_inode_start(inode, wbc);
+		ret = inode->i_sb->s_op->write_inode(inode, wbc);
+		trace_writeback_write_inode(inode, wbc);
+		return ret;
+	}
 	return 0;
 }
 
@@ -380,6 +386,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&wb->list_lock);
 
+	trace_writeback_single_inode_start(inode, wbc, nr_to_write);
+
 	ret = do_writepages(mapping, wbc);
 
 	/*
@@ -1037,8 +1045,12 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	 * dirty the inode itself
 	 */
 	if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
+		trace_writeback_dirty_inode_start(inode, flags);
+
 		if (sb->s_op->dirty_inode)
 			sb->s_op->dirty_inode(inode, flags);
+
+		trace_writeback_dirty_inode(inode, flags);
 	}
 
 	/*
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 99d1d0d..c8fc9d9 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -33,6 +33,112 @@
 
 struct wb_writeback_work;
 
+TRACE_EVENT(writeback_dirty_page,
+
+	TP_PROTO(struct page *page, struct address_space *mapping),
+
+	TP_ARGS(page, mapping),
+
+	TP_STRUCT__entry (
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(pgoff_t, index)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			mapping ? dev_name(mapping->backing_dev_info->dev) : "(unknown)", 32);
+		__entry->ino = mapping ? mapping->host->i_ino : 0;
+		__entry->index = page->index;
+	),
+
+	TP_printk("bdi %s: ino=%lu index=%lu",
+		__entry->name,
+		__entry->ino,
+		__entry->index
+	)
+);
+
+DECLARE_EVENT_CLASS(writeback_dirty_inode_template,
+
+	TP_PROTO(struct inode *inode, int flags),
+
+	TP_ARGS(inode, flags),
+
+	TP_STRUCT__entry (
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, flags)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->flags		= flags;
+	),
+
+	TP_printk("bdi %s: ino=%lu flags=%s",
+		__entry->name,
+		__entry->ino,
+		show_inode_state(__entry->flags)
+	)
+);
+
+DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode_start,
+
+	TP_PROTO(struct inode *inode, int flags),
+
+	TP_ARGS(inode, flags)
+);
+
+DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode,
+
+	TP_PROTO(struct inode *inode, int flags),
+
+	TP_ARGS(inode, flags)
+);
+
+DECLARE_EVENT_CLASS(writeback_write_inode_template,
+
+	TP_PROTO(struct inode *inode, struct writeback_control *wbc),
+
+	TP_ARGS(inode, wbc),
+
+	TP_STRUCT__entry (
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(int, sync_mode)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->sync_mode	= wbc->sync_mode;
+	),
+
+	TP_printk("bdi %s: ino=%lu mode=%d",
+		__entry->name,
+		__entry->ino,
+		__entry->sync_mode
+	)
+);
+
+DEFINE_EVENT(writeback_write_inode_template, writeback_write_inode_start,
+
+	TP_PROTO(struct inode *inode, struct writeback_control *wbc),
+
+	TP_ARGS(inode, wbc)
+);
+
+DEFINE_EVENT(writeback_write_inode_template, writeback_write_inode,
+
+	TP_PROTO(struct inode *inode, struct writeback_control *wbc),
+
+	TP_ARGS(inode, wbc)
+);
+
 DECLARE_EVENT_CLASS(writeback_work_class,
 	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
 	TP_ARGS(bdi, work),
@@ -447,6 +553,13 @@ DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_requeue,
 	TP_ARGS(inode, wbc, nr_to_write)
 );
 
+DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_start,
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long nr_to_write),
+	TP_ARGS(inode, wbc, nr_to_write)
+);
+
 DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
 	TP_PROTO(struct inode *inode,
 		 struct writeback_control *wbc,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8616ef3..4cf8c6d9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1735,6 +1735,8 @@ int __set_page_dirty_no_writeback(struct page *page)
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
+	trace_writeback_dirty_page(page, mapping);
+
 	if (mapping_cap_account_dirty(mapping)) {
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/9] block: add block_touch_buffer tracepoint
  2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
                   ` (4 preceding siblings ...)
  2012-01-10 18:28 ` [PATCH 5/9] writeback: add more tracepoints Tejun Heo
@ 2012-01-10 18:28 ` Tejun Heo
  2012-01-11 17:42   ` Steven Rostedt
  2012-01-10 18:28 ` [PATCH 7/9] vfs: add fcheck tracepoint Tejun Heo
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-01-10 18:28 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Tejun Heo

Add block_touch_buffer tracepoint which gets triggered on
touch_buffer().

Because touch_buffer() is defined as macro in linux/buffer_head.h,
this creates circular dependency between linux/buffer_head.h and
events/block.h.  As event header needs buffer_head details only when
the tracepoints are actually created (CREATE_TRACE_POINTS is defined),
this can be easily solved by including buffer_head.h before setting
CREATE_TRACE_POINTS and including the event header to create
tracepoints.

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c             |    1 +
 include/linux/buffer_head.h  |    7 ++++++-
 include/trace/events/block.h |   25 +++++++++++++++++++++++++
 3 files changed, 32 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index dd45d6e..8f59db3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -29,6 +29,7 @@
 #include <linux/fault-inject.h>
 #include <linux/list_sort.h>
 #include <linux/delay.h>
+#include <linux/buffer_head.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..245caed 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -13,6 +13,7 @@
 #include <linux/pagemap.h>
 #include <linux/wait.h>
 #include <linux/atomic.h>
+#include <trace/events/block.h>
 
 #ifdef CONFIG_BLOCK
 
@@ -126,7 +127,11 @@ BUFFER_FNS(Write_EIO, write_io_error)
 BUFFER_FNS(Unwritten, unwritten)
 
 #define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
-#define touch_buffer(bh)	mark_page_accessed(bh->b_page)
+
+#define touch_buffer(bh)	do {				\
+		trace_block_touch_buffer(bh);			\
+		mark_page_accessed(bh->b_page);			\
+	} while (0)
 
 /* If we *know* page->private refers to buffer_heads */
 #define page_buffers(page)					\
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 983f8a8..4fcc09d 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -6,10 +6,35 @@
 
 #include <linux/blktrace_api.h>
 #include <linux/blkdev.h>
+#include <linux/buffer_head.h>
 #include <linux/tracepoint.h>
 
 #define RWBS_LEN	8
 
+TRACE_EVENT(block_touch_buffer,
+
+	TP_PROTO(struct buffer_head *bh),
+
+	TP_ARGS(bh),
+
+	TP_STRUCT__entry (
+		__field(  dev_t,	dev			)
+		__field(  sector_t,	sector			)
+		__field(  size_t,	size			)
+	),
+
+	TP_fast_assign(
+		__entry->dev		= bh->b_bdev->bd_dev;
+		__entry->sector		= bh->b_blocknr;
+		__entry->size		= bh->b_size;
+	),
+
+	TP_printk("%d,%d get_bh sector=%llu size=%zu",
+		MAJOR(__entry->dev), MINOR(__entry->dev),
+		(unsigned long long)__entry->sector, __entry->size
+	)
+);
+
 DECLARE_EVENT_CLASS(block_rq_with_error,
 
 	TP_PROTO(struct request_queue *q, struct request *rq),
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 7/9] vfs: add fcheck tracepoint
  2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
                   ` (5 preceding siblings ...)
  2012-01-10 18:28 ` [PATCH 6/9] block: add block_touch_buffer tracepoint Tejun Heo
@ 2012-01-10 18:28 ` Tejun Heo
  2012-01-10 18:28 ` [PATCH 8/9] stacktrace: implement save_stack_trace_quick() Tejun Heo
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-10 18:28 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Tejun Heo

All file accesses from userland go through fcheck to map fd to struct
file, making it a very good location for peeking at what files
userland is accessing.  Add a tracepoint there.

This is part of tracepoint additions to improve visiblity into
dirtying / writeback operations for io tracer and userland.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/super.c                 |    2 ++
 include/linux/fdtable.h    |    3 +++
 include/trace/events/vfs.h |   40 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 0 deletions(-)
 create mode 100644 include/trace/events/vfs.h

diff --git a/fs/super.c b/fs/super.c
index de41e1e..3055f32 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,8 @@
 #include <linux/cleancache.h>
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vfs.h>
 
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 82163c4..72df04b 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -12,6 +12,7 @@
 #include <linux/types.h>
 #include <linux/init.h>
 #include <linux/fs.h>
+#include <trace/events/vfs.h>
 
 #include <linux/atomic.h>
 
@@ -87,6 +88,8 @@ static inline struct file * fcheck_files(struct files_struct *files, unsigned in
 
 	if (fd < fdt->max_fds)
 		file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
+
+	trace_vfs_fcheck(files, fd, file);
 	return file;
 }
 
diff --git a/include/trace/events/vfs.h b/include/trace/events/vfs.h
new file mode 100644
index 0000000..9a9bae4
--- /dev/null
+++ b/include/trace/events/vfs.h
@@ -0,0 +1,40 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vfs
+
+#if !defined(_TRACE_VFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VFS_H
+
+#include <linux/tracepoint.h>
+#include <linux/fs.h>
+
+TRACE_EVENT(vfs_fcheck,
+
+	TP_PROTO(struct files_struct *files, unsigned int fd,
+		 struct file *file),
+
+	TP_ARGS(files, fd, file),
+
+	TP_STRUCT__entry(
+		__field(unsigned int,	fd)
+		__field(umode_t,	mode)
+		__field(dev_t,		dev)
+		__field(ino_t,		ino)
+	),
+
+	TP_fast_assign(
+		__entry->fd = fd;
+		__entry->mode = file ? file->f_path.dentry->d_inode->i_mode : 0;
+		__entry->dev = file ? file->f_path.dentry->d_inode->i_sb->s_dev : 0;
+		__entry->ino = file ? file->f_path.dentry->d_inode->i_ino : 0;
+	),
+
+	TP_printk("fd %u mode 0x%x dev %d,%d ino %lu",
+		  __entry->fd, __entry->mode,
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  (unsigned long)__entry->ino)
+);
+
+#endif /* _TRACE_VFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 8/9] stacktrace: implement save_stack_trace_quick()
  2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
                   ` (6 preceding siblings ...)
  2012-01-10 18:28 ` [PATCH 7/9] vfs: add fcheck tracepoint Tejun Heo
@ 2012-01-10 18:28 ` Tejun Heo
  2012-01-11 16:26   ` Frederic Weisbecker
  2012-01-10 18:28 ` [PATCH 9/9] block, trace: implement ioblame - IO tracer with origin tracking Tejun Heo
  2012-01-11 14:40 ` [RFC PATCHSET take#2] ioblame: " Frederic Weisbecker
  9 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-01-10 18:28 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Tejun Heo, H. Peter Anvin

Implement save_stack_trace_quick() which only considers the usual
contexts (ie. thread and irq) and doesn't handle links between
different contexts - if %current is in irq context, only backtrace in
the irq stack is considered.

This is subset of dump_trace() done in much simpler way.  It's
intended to be used in hot paths where the overhead of dump_trace()
can be too heavy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/include/asm/stacktrace.h |    2 +
 arch/x86/kernel/stacktrace.c      |   40 +++++++++++++++++++++++++++++++++++++
 include/linux/stacktrace.h        |    6 +++++
 kernel/stacktrace.c               |    6 +++++
 4 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 70bbe39..06bbdfc 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -50,9 +50,11 @@ void dump_trace(struct task_struct *tsk, struct pt_regs *regs,
 #ifdef CONFIG_X86_32
 #define STACKSLOTS_PER_LINE 8
 #define get_bp(bp) asm("movl %%ebp, %0" : "=r" (bp) :)
+#define get_irq_stack_end()	0
 #else
 #define STACKSLOTS_PER_LINE 4
 #define get_bp(bp) asm("movq %%rbp, %0" : "=r" (bp) :)
+#define get_irq_stack_end()	(unsigned long)this_cpu_read(irq_stack_ptr)
 #endif
 
 #ifdef CONFIG_FRAME_POINTER
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index fdd0c64..f53ec547 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -81,6 +81,46 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
 
+#ifdef CONFIG_FRAME_POINTER
+void save_stack_trace_quick(struct stack_trace *trace)
+{
+	const unsigned long stk_sz = THREAD_SIZE - sizeof(struct stack_frame);
+	unsigned long tstk = (unsigned long)current_thread_info();
+	unsigned long istk = get_irq_stack_end();
+	unsigned long last_bp = 0;
+	unsigned long bp, stk;
+
+	get_bp(bp);
+
+	if (bp > tstk && bp <= tstk + stk_sz)
+		stk = tstk;
+	else if (istk && (bp > istk && bp <= stk_sz))
+		stk = istk;
+	else
+		goto out;
+
+	while (bp > last_bp && bp <= stk + stk_sz) {
+		struct stack_frame *frame = (struct stack_frame *)bp;
+		unsigned long ret_addr = frame->return_address;
+
+		if (!trace->skip) {
+			if (trace->nr_entries >= trace->max_entries)
+				return;
+			trace->entries[trace->nr_entries++] = ret_addr;
+		} else {
+			trace->skip--;
+		}
+
+		last_bp = bp;
+		bp = (unsigned long)frame->next_frame;
+	}
+out:
+	if (trace->nr_entries < trace->max_entries)
+		trace->entries[trace->nr_entries++] = ULONG_MAX;
+}
+EXPORT_SYMBOL_GPL(save_stack_trace_quick);
+#endif
+
 /* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
 
 struct stack_frame_user {
diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 115b570..d5b16c4 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -19,6 +19,12 @@ extern void save_stack_trace_regs(struct pt_regs *regs,
 extern void save_stack_trace_tsk(struct task_struct *tsk,
 				struct stack_trace *trace);
 
+/*
+ * Saves only trace from the current context.  Doesn't handle exception
+ * stacks or verify text address.
+ */
+extern void save_stack_trace_quick(struct stack_trace *trace);
+
 extern void print_stack_trace(struct stack_trace *trace, int spaces);
 
 #ifdef CONFIG_USER_STACKTRACE_SUPPORT
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 00fe55c..4760949 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -31,6 +31,12 @@ EXPORT_SYMBOL_GPL(print_stack_trace);
  * (whenever this facility is utilized - for example by procfs):
  */
 __weak void
+save_stack_trace_quick(struct stack_trace *trace)
+{
+	WARN_ONCE(1, KERN_INFO "save_stack_trace_quick() not implemented yet.\n");
+}
+
+__weak void
 save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
 {
 	WARN_ONCE(1, KERN_INFO "save_stack_trace_tsk() not implemented yet.\n");
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
                   ` (7 preceding siblings ...)
  2012-01-10 18:28 ` [PATCH 8/9] stacktrace: implement save_stack_trace_quick() Tejun Heo
@ 2012-01-10 18:28 ` Tejun Heo
  2012-01-11  0:25   ` Chanho Park
  2012-01-11  1:32   ` [PATCH RESEND " Tejun Heo
  2012-01-11 14:40 ` [RFC PATCHSET take#2] ioblame: " Frederic Weisbecker
  9 siblings, 2 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-10 18:28 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Tejun Heo

Implement ioblame, which can attribute each IO to its origin and
export the information using a tracepoint.

Operations which may eventually cause IOs and IO operations themselves
are identified and tracked primarily by their stack traces along with
the task and the target file (dev:ino:gen).  On each IO completion,
ioblame knows why that specific IO happened and exports the
information via ioblame:ioblame_io tracepoint.

While ioblame adds fields to a few fs and block layer objects, all
logic is well insulated inside ioblame proper and all hooking goes
through well defined tracepoints and doesn't add any significant
maintenance overhead.

For details, please read Documentation/trace/ioblame.txt.

-v2: Namhyung pointed out that all the information available at IO
     completion can be exported via tracepoint and letting userland do
     whatever it wants to do with that would be better.  Stripped out
     in-kernel statistics gathering.

     Now that everything is exported through tracepoint, iolog and
     counters_pipe[_pipe] are unnecessary.  Removed.  intents_bin too
     is removed.

     As data collection no longer requires polling, ioblame/intents is
     updated to generate inotify IN_MODIFY event after a new intent is
     created.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Justin TerAvest <teravest@google.com>
Cc: Slava Pestov <slavapestov@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Jim Winget <winget@google.com>
---
 Documentation/trace/ioblame.txt |  476 ++++++++
 include/linux/blk_types.h       |    4 +
 include/linux/fs.h              |    3 +
 include/linux/genhd.h           |    3 +
 include/linux/ioblame.h         |   72 ++
 kernel/trace/Kconfig            |   12 +
 kernel/trace/Makefile           |    1 +
 kernel/trace/ioblame.c          | 2279 +++++++++++++++++++++++++++++++++++++++
 8 files changed, 2850 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/ioblame.txt
 create mode 100644 include/linux/ioblame.h
 create mode 100644 kernel/trace/ioblame.c

diff --git a/Documentation/trace/ioblame.txt b/Documentation/trace/ioblame.txt
new file mode 100644
index 0000000..cd72f29
--- /dev/null
+++ b/Documentation/trace/ioblame.txt
@@ -0,0 +1,476 @@
+
+ioblame - IO tracer with origin tracking
+
+December, 2011		Tejun Heo <tj@kernel.org>
+
+
+CONTENTS
+
+1. Introduction
+2. Overall design
+3. Debugfs interface
+3-1. Configuration
+3-2. Stats and intents
+4. Trace examples
+5. Notes
+6. Overheads
+
+
+1. Introduction
+
+In many workloads, IO throughput and latency have large effect on
+overall performance; however, due to the complexity and asynchronous
+nature, it is very difficult to characterize what's going on.
+blktrace and various tracepoints provide visibility into individual IO
+operations but it is still extremely difficult to trace back to the
+origin of those IO operations.
+
+ioblame is IO tracer which tracks origin of each IO.  It keeps track
+of who dirtied pages and inodes, and, on an actual IO, attributes it
+to the originator of the IO.  All the information ioblame collects is
+exported via ioblame:ioblame_io tracepoint on each IO completion.
+
+The design goals of ioblame are
+
+* Minimally invasive - Tracer shouldn't be invasive.  Except for
+  adding some fields to mostly block layer data structures for
+  tracking, ioblame gathers all information through well defined
+  tracepoints and all tracking logic is contained in ioblame proper.
+
+* Generic and detailed - There are many different IO paths and
+  filesystems which also go through changes regularly.  Tracer should
+  be able to report detailed enough result covering most cases without
+  requiring frequent adaptation.  ioblame uses stack trace at key
+  points combined information from generic layers to categorize IOs.
+  This gives detailed enough information into varying IO paths without
+  requiring specific adaptations.
+
+* Low overhead - Overhead both in terms of memory and processor cycles
+  should be low enough so that the analyzer can be used in IO-heavy
+  production environments.  ioblame keeps hot data structures compact
+  and mostly read-only and avoids synchronization on hot paths by
+  using RCU and taking advantage of the fact that statistics doesn't
+  have to be completely accurate.
+
+
+2. Overall design
+
+ioblame tracks the following three object types.
+
+* Role: This tracks 'who' is taking an action.  Corresponds to a
+  thread.
+
+* Intent: Stack trace + modifier.  An intent groups actions of the
+  same type.  As the name suggests, modifier modifies the intent and
+  there can be multiple intents with the same stack trace but
+  different modifiers.  Currently, only writeback modifiers are
+  implemented which denote why the writeback action is occurring -
+  ie. wb_reason.
+
+* Act: This is combination of role, intent and the inode being
+  operated.  This is not visible to userland and used internally to
+  track dirtier and its intent in compact form.
+
+ioblame uses the same indexing data structure for all three types of
+objects.  Objects are never linked directly using pointers and every
+access goes through the index.  This allows avoiding expensive strict
+object lifetime management.  Objects are located either by its content
+via hash table or id which contains generation number.
+
+To attribute data writebacks to the originator, ioblame maintains a
+table indexed by page frame number which keeps track of which act
+dirtied which pages.  For each IO, the target pages are looked up in
+the table and the dirtying act is charged for the IO.  Note that,
+currently, each IO is charged as whole to a single act - e.g. all of
+an IO for writeback encompassing multiple dirtiers will be charged to
+the first found dirtying act.  This simplifies data collection and
+reporting while not losing too much information - writebacks tend to
+be naturally grouped and IOPS (IO operations per second) are often
+more significant than length of each IO.
+
+inode writeback tracking is more involved as different filesystems
+handle metadata updates and writebacks differently.  ioblame uses
+per-inode and buffer_head operation tracking to identify inode
+writebacks to the originator.
+
+On each IO completion, ioblame knows the offset and size of the IO,
+who's responsible and its intent, how long it took in the queue and
+the target file.  This information is reported via ioblame:ioblame_io
+tracepoint.
+
+Except for the tracepoint, all interactions happen using files under
+/sys/kernel/debug/ioblame/.
+
+
+3. Debugfs interface
+
+3-1. Configuration
+
+* enable			- can be changed anytime
+
+  Master enable.  Write [Yy1] to enable, [Nn0] to disable.
+
+* devs				- can be changed anytime
+
+  Specifies the devices ioblame is enabled for.  ioblame will only
+  track operations on devices which are explicitly enabled in this
+  file.
+
+  It accepts white space separated list of MAJ:MINs or block device
+  names with optional preceding '!' for negation.  Opening with
+  O_TRUNC clears all existing entries.  For example,
+
+  $ echo sda sdb > devs		# disables all devices and then enable sd[ab]
+  $ echo sdc >> devs		# sd[abc] enabled
+  $ echo !8:0 >> devs		# sd[bc] enabled
+  $ cat devs
+  8:16 sdb
+  8:32 sdc
+
+* max_{role|intent|act}s	- can be changed while disabled
+
+  Specifies the maximum number of each object type.  If the number of
+  certain object type exceeds the limit, IOs will be attributed to
+  special NOMEM object.
+
+* ttl_secs			- can be changed anytime
+
+  Specifies TTL of roles and acts.  Roles are reclaimed after at least
+  TTL has passed after the matching thread has exited or execed and
+  assumed another tid.  Acts are reclaimed after being unused for at
+  least TTL.
+
+
+3-2. Stats and intents (read only)
+
+* nr_{roles|intents|acts}
+
+  Returns the number of objects of the type.  The number of roles and
+  acts can decrease after reclaiming but nr_intents only increases
+  while ioblame is enabled.
+
+* stats/idx_nomem
+
+  How many times role, intent or act creation failed because memory
+  allocation failed while extending index to accomodate new object.
+
+* stats/idx_nospc
+
+  How many times role, intent or act creation failed because limit
+  specified by {role|intent|act}_max is reached.
+
+* stats/node_nomem
+
+  How many times role, intent or act creation failed to allocate.
+
+* stats/pgtree_nomem
+
+  How many times page tree, which maps page frame number to dirtying
+  act, failed to expand due to memory allocation failure.
+
+* intents
+
+  Dump of intents.
+
+  $ cat intents
+  #0 modifier=0x0
+  #1 modifier=0x0
+  #2 modifier=0x0
+  [ffffffff81189a6a] file_update_time+0xca/0x150
+  [ffffffff81122030] __generic_file_aio_write+0x200/0x460
+  [ffffffff81122301] generic_file_aio_write+0x71/0xe0
+  [ffffffff8122ea94] ext4_file_write+0x64/0x280
+  [ffffffff811b5d24] aio_rw_vect_retry+0x74/0x1d0
+  [ffffffff811b7401] aio_run_iocb+0x61/0x190
+  [ffffffff811b81c8] do_io_submit+0x648/0xaf0
+  [ffffffff811b867b] sys_io_submit+0xb/0x10
+  [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
+  #3 modifier=0x0
+  [ffffffff811aaf2e] __blockdev_direct_IO+0x1f1e/0x37c0
+  [ffffffff812353b2] ext4_direct_IO+0x1b2/0x3f0
+  [ffffffff81121d6a] generic_file_direct_write+0xba/0x180
+  [ffffffff8112210b] __generic_file_aio_write+0x2db/0x460
+  [ffffffff81122301] generic_file_aio_write+0x71/0xe0
+  [ffffffff8122ea94] ext4_file_write+0x64/0x280
+  [ffffffff811b5d24] aio_rw_vect_retry+0x74/0x1d0
+  [ffffffff811b7401] aio_run_iocb+0x61/0x190
+  [ffffffff811b81c8] do_io_submit+0x648/0xaf0
+  [ffffffff811b867b] sys_io_submit+0xb/0x10
+  [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
+  #4 modifier=0x0
+  [ffffffff811aaf2e] __blockdev_direct_IO+0x1f1e/0x37c0
+  [ffffffff8126da71] ext4_ind_direct_IO+0x121/0x460
+  [ffffffff81235436] ext4_direct_IO+0x236/0x3f0
+  [ffffffff81122db2] generic_file_aio_read+0x6b2/0x740
+  ...
+
+  The # prefixed number is the NR of the intent used to link intent
+  from stastics.  Modifier and stack trace follow.  The first two
+  entries are special - 0 is nomem intent and 1 is lost intent.  The
+  former is used when an intent can't be created because allocation
+  failed or intent_max is reached.  The latter is used when reclaiming
+  resulted in loss of tracking info and the IO can't be reported
+  exactly.
+
+  This file can be seeked by intent NR.  ie. seeking to 3 and reading
+  will return intent #3 and after.  Because intents are never
+  destroyed while ioblame is enabled, this allows userland tool to
+  discover new intents since last reading.  Seeking to the number of
+  currently known intents and reading returns only the newly created
+  intents.
+
+  At least one inotify IN_MODIFY event is generated after a new intent
+  is created.
+
+
+4. Trace examples
+
+All information ioblame gathers is available through
+ioblame:ioblame_io tracing event.  The outputs in the following
+examples are reformatted and annoated.
+
+4-1. ls, touch and sync - on an ext4 FS w/o journal
+
+- sector=69896 size=4096 rw=META|PRIO wait_nsec=45244 io_nsec=11263878
+  pid=952 intent=8 dev=8:17 ino=2 gen=0
+
+  pid 952 (ls) issues 4k META|PRIO read on /dev/sdb1's root directory
+  with intent 8 to read directory entries.
+
+  #8 modifier=0x0
+  [ffffffff813981b8] generic_make_request+0x18/0x100
+  [ffffffff81398314] submit_bio+0x74/0x100
+  [ffffffff811c6b9b] submit_bh+0xeb/0x130
+  [ffffffff811c851e] ll_rw_block+0xae/0xb0
+  [ffffffff81265703] ext4_bread+0x43/0x80
+  [ffffffff8126b458] htree_dirblock_to_tree+0x38/0x190
+  [ffffffff8126b655] ext4_htree_fill_tree+0xa5/0x260
+  [ffffffff81259c76] ext4_readdir+0x116/0x5e0
+  [ffffffff811a7ec0] vfs_readdir+0xb0/0xd0
+  [ffffffff811a8049] sys_getdents+0x89/0xf0
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+- sector=4232 size=4096 rw= wait_nsec=69052 io_nsec=475710
+  pid=953 intent=14 dev=8:16 ino=0 gen=0
+
+  pid 953 (touch) issues 4k read with intent 14 during open(2).
+
+  #14 modifier=0x0
+  [ffffffff813981b8] generic_make_request+0x18/0x100
+  [ffffffff81398314] submit_bio+0x74/0x100
+  [ffffffff811c6b9b] submit_bh+0xeb/0x130
+  [ffffffff811c8425] bh_submit_read+0x35/0x80
+  [ffffffff8125b29b] ext4_read_inode_bitmap+0x18b/0x3f0
+  [ffffffff8125bf85] ext4_new_inode+0x355/0x10b0
+  [ffffffff81269a7a] ext4_create+0x9a/0x120
+  [ffffffff811a366c] vfs_create+0x8c/0xe0
+  [ffffffff811a4616] do_last+0x776/0x8e0
+  [ffffffff811a4858] path_openat+0xd8/0x410
+  [ffffffff811a4ca9] do_filp_open+0x49/0xa0
+  [ffffffff811926a7] do_sys_open+0x107/0x1e0
+  [ffffffff811927c0] sys_open+0x20/0x30
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+- sector=4360 size=4096 rw=WRITE wait_nsec=28998035 io_nsec=768370
+  pid=953 intent=11 dev=8:17 ino=14 gen=3151897938
+
+  touch dirtied inode 14 and the following sync forces writeback.
+  The IO is attributed to the dirtier.  Note the non-zero modifier is
+  indicating WB_REASON_SYNC.
+
+  #11 modifier=0x10000002
+  [ffffffff811c0710] __mark_inode_dirty+0x220/0x330
+  [ffffffff8125feeb] ext4_setattr+0x26b/0x4d0
+  [ffffffff811b0f2a] notify_change+0x10a/0x2b0
+  [ffffffff811c52de] utimes_common+0xde/0x190
+  [ffffffff811c5431] do_utimes+0xa1/0xf0
+  [ffffffff811c55a6] sys_utimensat+0x36/0xb0
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+
+4-2. copying a 1M file from another filesystem and waiting a bit
+
+- sector=2056 size=4096 rw=WRITE wait_nsec=151425 io_nsec=584466
+  pid=1004 intent=24 dev=8:16 ino=0 gen=0
+
+  flush-8:16 starting writeback w/ WB_REASON_BACKGROUND.  This
+  repeats a couple times.
+
+  #24 modifier=0x10000000
+  [ffffffff813981b8] generic_make_request+0x18/0x100
+  [ffffffff81398314] submit_bio+0x74/0x100
+  [ffffffff811c6b9b] submit_bh+0xeb/0x130
+  [ffffffff811ca410] __block_write_full_page+0x210/0x3b0
+  [ffffffff811ca6a0] block_write_full_page_endio+0xf0/0x140
+  [ffffffff811ca705] block_write_full_page+0x15/0x20
+  [ffffffff811ce438] blkdev_writepage+0x18/0x20
+  [ffffffff81148f1a] __writepage+0x1a/0x50
+  [ffffffff81149ae6] write_cache_pages+0x206/0x4f0
+  [ffffffff81149e24] generic_writepages+0x54/0x80
+  [ffffffff81149e74] do_writepages+0x24/0x40
+  [ffffffff811bf301] writeback_single_inode+0x1a1/0x600
+  [ffffffff811c01db] writeback_sb_inodes+0x1ab/0x280
+  [ffffffff811c0b8e] __writeback_inodes_wb+0x9e/0xd0
+  [ffffffff811c0ea3] wb_writeback+0x243/0x3a0
+  [ffffffff811c115a] wb_do_writeback+0x15a/0x2b0
+  [ffffffff811c138a] bdi_writeback_thread+0xda/0x330
+  [ffffffff810bc286] kthread+0xb6/0xc0
+  [ffffffff81aadff4] kernel_thread_helper+0x4/0x10
+
+- sector=4360 size=4096 rw=WRITE wait_nsec=781396 io_nsec=894147
+  pid=1017 intent=25 dev=8:17 ino=12 gen=3151897939
+
+  Writeback got to inode 12 which was created and written to by cp.
+  This is inode writeback.
+
+  #25 modifier=0x10000000
+  [ffffffff811c0710] __mark_inode_dirty+0x220/0x330
+  [ffffffff811c7e5b] generic_write_end+0x6b/0xa0
+  [ffffffff8126191a] ext4_da_write_end+0xfa/0x350
+  [ffffffff8113f168] generic_file_buffered_write+0x188/0x2b0
+  [ffffffff81141608] __generic_file_aio_write+0x238/0x460
+  [ffffffff811418a8] generic_file_aio_write+0x78/0xf0
+  [ffffffff8125a4cf] ext4_file_write+0x6f/0x2a0
+  [ffffffff81194112] do_sync_write+0xe2/0x120
+  [ffffffff81194c08] vfs_write+0xc8/0x180
+  [ffffffff81194dc1] sys_write+0x51/0x90
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+- sector=268288 size=524288 rw=WRITE wait_nsec=461543 io_nsec=3180190
+  pid=1017 intent=27 dev=8:17 ino=12 gen=3151897939
+
+  The first half of data.
+
+  #27 modifier=0x10000000
+  [ffffffff811c79cc] __set_page_dirty+0x4c/0xd0
+  [ffffffff811c7ab6] mark_buffer_dirty+0x66/0xa0
+  [ffffffff811c7b99] __block_commit_write+0xa9/0xe0
+  [ffffffff811c7da2] block_write_end+0x42/0x90
+  [ffffffff811c7e23] generic_write_end+0x33/0xa0
+  [ffffffff8126191a] ext4_da_write_end+0xfa/0x350
+  [ffffffff8113f168] generic_file_buffered_write+0x188/0x2b0
+  [ffffffff81141608] __generic_file_aio_write+0x238/0x460
+  [ffffffff811418a8] generic_file_aio_write+0x78/0xf0
+  [ffffffff8125a4cf] ext4_file_write+0x6f/0x2a0
+  [ffffffff81194112] do_sync_write+0xe2/0x120
+  [ffffffff81194c08] vfs_write+0xc8/0x180
+  [ffffffff81194dc1] sys_write+0x51/0x90
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+- sector=269312 size=524288 rw=WRITE wait_nsec=364198 io_nsec=5667553
+  pid=1017 intent=27 dev=8:17 ino=12 gen=3151897939
+
+  And the second half.
+
+
+4-3. dd if=/dev/zero of=testfile bs=128k count=4 oflag=direct
+
+- sector=266496 size=131072 rw=WRITE|SYNC wait_nsec=48180 io_nsec=1066758
+  pid=1042 intent=34 dev=8:17 ino=12 gen=3151897940
+
+  First chunk.
+
+  #34 modifier=0x0
+  [ffffffff813981b8] generic_make_request+0x18/0x100
+  [ffffffff81398314] submit_bio+0x74/0x100
+  [ffffffff811d1c45] __blockdev_direct_IO+0x21b5/0x3830
+  [ffffffff8129a7a1] ext4_ind_direct_IO+0x121/0x470
+  [ffffffff812612ee] ext4_direct_IO+0x23e/0x400
+  [ffffffff81141308] generic_file_direct_write+0xc8/0x190
+  [ffffffff811416ab] __generic_file_aio_write+0x2db/0x460
+  [ffffffff811418a8] generic_file_aio_write+0x78/0xf0
+  [ffffffff8125a4cf] ext4_file_write+0x6f/0x2a0
+  [ffffffff81194112] do_sync_write+0xe2/0x120
+  [ffffffff81194c08] vfs_write+0xc8/0x180
+  [ffffffff81194dc1] sys_write+0x51/0x90
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+- sector=266752 size=131072 rw=WRITE|SYNC wait_nsec=15155 io_nsec=1086987
+  pid=1042 intent=34 dev=8:17 ino=12 gen=3151897940
+
+  Second.
+
+- sector=267008 size=131072 rw=WRITE|SYNC wait_nsec=22694 io_nsec=1092836
+  pid=1042 intent=34 dev=8:17 ino=12 gen=3151897940
+
+  Third.
+
+- sector=267264 size=131072 rw=WRITE|SYNC wait_nsec=15852 io_nsec=1021868
+  pid=1042 intent=34 dev=8:17 ino=12 gen=3151897940
+
+  Fourth.
+
+...
+
+- sector=4360 size=4096 rw=WRITE wait_nsec=1378342 io_nsec=828771
+  pid=1042 intent=35 dev=8:17 ino=12 gen=3151897940
+
+  After a while, inode is written back with WB_REASON_PERIODIC.
+
+  #35 modifier=0x10000003
+  [ffffffff811c0710] __mark_inode_dirty+0x220/0x330
+  [ffffffff8129611a] ext4_mb_new_blocks+0xea/0x5a0
+  [ffffffff8128b22e] ext4_ext_map_blocks+0x1c0e/0x1d80
+  [ffffffff812638d1] ext4_map_blocks+0x1b1/0x260
+  [ffffffff81263a28] _ext4_get_block+0xa8/0x160
+  [ffffffff81263b46] ext4_get_block+0x16/0x20
+  [ffffffff811d0460] __blockdev_direct_IO+0x9d0/0x3830
+  [ffffffff8129a7a1] ext4_ind_direct_IO+0x121/0x470
+  [ffffffff812612ee] ext4_direct_IO+0x23e/0x400
+  [ffffffff81141308] generic_file_direct_write+0xc8/0x190
+  [ffffffff811416ab] __generic_file_aio_write+0x2db/0x460
+  [ffffffff811418a8] generic_file_aio_write+0x78/0xf0
+  [ffffffff8125a4cf] ext4_file_write+0x6f/0x2a0
+  [ffffffff81194112] do_sync_write+0xe2/0x120
+  [ffffffff81194c08] vfs_write+0xc8/0x180
+  [ffffffff81194dc1] sys_write+0x51/0x90
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+
+5. Notes
+
+* By the time ioblame reports IOs or counters, the task which gets
+  charged might have already exited and this is why ioblame prints
+  task command in some reports but not in others.  Userland tool is
+  advised to use combination of live task listing and process
+  accounting to match pid's to commands.
+
+* dev:ino:gen can be mapped to filename without scanning the whole
+  filesystem by constructing FS-specific filehandle, opening it with
+  open_by_handle_at(2) and then readlink(2)ing /proc/self/FD.  This
+  will return full path as long as the dentry is in cache, which is
+  likely if data acquisition and mapping don't happen too long after
+  IOs.
+
+* At this point, it's mostly tested with ext4 w/o journal.  Metadata
+  dirtier tracking w/ journal needs improvements.
+
+
+6. Overheads
+
+On x86_64, role is 104 bytes, intent 32 + 8 * stack_depth and act 72
+bytes.  Intents are allocated using kzalloc() and there shouldn't be
+too many of them.  Both roles and acts have their own kmem_cache and
+can be monitored via /proc/slabinfo.
+
+Each counter occupy 32 * nr_counters and is aligned to cacheline.
+Counters are allocated only as necessary.  iob_counters kmem_cache is
+dynamically created on enable.
+
+The size of page frame number -> dirtier mapping table is proportional
+to the amount of available physical memory.  If max_acts <= 65536,
+2bytes are used per PAGE_SIZE.  With 4k page, at most ~0.049% can be
+used.  If max_acts > 65536, 4bytes are used doubling the percentage to
+~0.098%.  The table also grows dynamically.
+
+There are also indexing data structures used - hash tables, id[ra]s
+and a radix tree.  There are three hash tables, each sized according
+to max_{roles|intents|acts}.  The maximum memory usage by hash tables
+is sizeof(void *) * (max_roles + max_intents + max_acts).  Memory used
+by other indexing structures should be negligible.
+
+Preliminary tests w/ fio ssd-test on loopback device on tmpfs, which
+is purely CPU cycle bound, shows ~20% throughput hit.
+
+*** TODO: add performance testing results and explain involved CPU
+    overheads.
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4053cbd..2ee4e3b 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -8,6 +8,7 @@
 #ifdef CONFIG_BLOCK
 
 #include <linux/types.h>
+#include <linux/ioblame.h>
 
 struct bio_set;
 struct bio;
@@ -69,6 +70,9 @@ struct bio {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
 	struct bio_integrity_payload *bi_integrity;  /* data integrity */
 #endif
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+	struct iob_io_info	bi_iob_info;
+#endif
 
 	bio_destructor_t	*bi_destructor;	/* destructor */
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7aacf31..7a43f9a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -835,6 +835,9 @@ struct inode {
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
 	void			*i_private; /* fs or device private pointer */
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+	union iob_id		i_iob_act;
+#endif
 };
 
 static inline int inode_unhashed(struct inode *inode)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 9d0e0b5..237db65 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -190,6 +190,9 @@ struct gendisk {
 #ifdef  CONFIG_BLK_DEV_INTEGRITY
 	struct blk_integrity *integrity;
 #endif
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+	bool iob_enabled;
+#endif
 	int node_id;
 };
 
diff --git a/include/linux/ioblame.h b/include/linux/ioblame.h
new file mode 100644
index 0000000..06c7f3a
--- /dev/null
+++ b/include/linux/ioblame.h
@@ -0,0 +1,72 @@
+/*
+ * include/linux/ioblame.h - statistical IO analyzer with origin tracking
+ *
+ * Copyright (C) 2011 Google, Inc.
+ * Copyright (C) 2011 Tejun Heo <tj@kernel.org>
+ */
+#ifndef _IOBLAME_H
+#define _IOBLAME_H
+
+#ifdef __KERNEL__
+
+#include <linux/rcupdate.h>
+
+struct page;
+struct inode;
+struct buffer_head;
+
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+
+/*
+ * Each iob_node is identified by 64bit id, which packs three fields in it
+ * - @type, @nr and @gen.  @nr is ida allocated index in @type.  It is
+ * always allocated from the lowest available slot, which allows efficient
+ * use of pgtree and idr; however, this means @nr is likely to be recycled.
+ * @gen is used to disambiguate recycled @nr's.
+ */
+#define IOB_NR_BITS			31
+#define IOB_GEN_BITS			31
+#define IOB_TYPE_BITS			2
+
+union iob_id {
+	u64				v;
+	struct {
+		u64			nr:IOB_NR_BITS;
+		u64			gen:IOB_GEN_BITS;
+		u64			type:IOB_TYPE_BITS;
+	} f;
+};
+
+struct iob_io_info {
+	sector_t			sector;
+	size_t				size;
+	unsigned long			rw;
+
+	u64				queued_at;
+	u64				issued_at;
+
+	pid_t				pid;
+	int				intent;
+	dev_t				dev;
+	u32				gen;
+	ino_t				ino;
+};
+
+#endif	/* CONFIG_IO_BLAME[_MODULE] */
+#endif	/* __KERNEL__ */
+
+enum iob_special_nr {
+	IOB_NOMEM_NR,
+	IOB_LOST_NR,
+	IOB_BASE_NR,
+};
+
+/* intent modifer */
+#define IOB_MODIFIER_TYPE_SHIFT	28
+#define IOB_MODIFIER_TYPE_MASK	0xf0000000U
+#define IOB_MODIFIER_VAL_MASK	(~IOB_MODIFIER_TYPE_MASK)
+
+/* val contains wb_reason */
+#define IOB_MODIFIER_WB		(1 << IOB_MODIFIER_TYPE_SHIFT)
+
+#endif	/* _IOBLAME_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index cd31345..ccc7c12 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -368,6 +368,18 @@ config BLK_DEV_IO_TRACE
 
 	  If unsure, say N.
 
+config IO_BLAME
+	tristate "Enable io-blame tracer"
+	depends on SYSFS
+	depends on BLOCK
+	select TRACEPOINTS
+	select STACKTRACE
+	help
+	  Say Y here if you want to enable IO tracer with dirtier
+	  tracking.  See Documentation/trace/ioblame.txt.
+
+	  If unsure, say N.
+
 config KPROBE_EVENT
 	depends on KPROBES
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 5f39a07..408cd1a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
 ifeq ($(CONFIG_BLOCK),y)
 obj-$(CONFIG_EVENT_TRACING) += blktrace.o
 endif
+obj-$(CONFIG_IO_BLAME) += ioblame.o
 obj-$(CONFIG_EVENT_TRACING) += trace_events.o
 obj-$(CONFIG_EVENT_TRACING) += trace_export.o
 obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
diff --git a/kernel/trace/ioblame.c b/kernel/trace/ioblame.c
new file mode 100644
index 0000000..ae46abe
--- /dev/null
+++ b/kernel/trace/ioblame.c
@@ -0,0 +1,2279 @@
+/*
+ * kernel/trace/ioblame.c - IO tracer with origin tracking
+ *
+ * Copyright (C) 2011 Google, Inc.
+ * Copyright (C) 2011 Tejun Heo <tj@kernel.org>
+ */
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/idr.h>
+#include <linux/bitmap.h>
+#include <linux/radix-tree.h>
+#include <linux/rculist.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/stacktrace.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/log2.h>
+#include <linux/jhash.h>
+#include <linux/genhd.h>
+#include <linux/string.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/mm_types.h>
+#include <linux/fs.h>
+#include <linux/buffer_head.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/log2.h>
+#include <asm/div64.h>
+
+#include <trace/events/sched.h>
+#include <trace/events/vfs.h>
+#include <trace/events/writeback.h>
+#include <trace/events/block.h>
+
+#include "trace.h"
+
+#include <linux/ioblame.h>
+
+#define IOB_ROLE_NAMELEN	32
+#define IOB_STACK_MAX_DEPTH	32
+
+#define IOB_DFL_MAX_ROLES	(1 << 16)
+#define IOB_DFL_MAX_INTENTS	(1 << 10)
+#define IOB_DFL_MAX_ACTS	(1 << 16)
+#define IOB_DFL_TTL_SECS	120
+
+#define IOB_LAST_INO_DURATION	(5 * HZ)	/* last_ino is valid for 5s */
+
+/*
+ * Each type represents different type of entities tracked by ioblame and
+ * has its own iob_idx.
+ *
+ * role		: "who" - either a task or custom id from userland.
+ *
+ * intent	: The who's intention - backtrace + modifier.
+ *
+ * act		: Product of role, intent and the target inode.  "who"
+ *		  acts on a target inode with certain backtrace.
+ */
+enum iob_type {
+	IOB_INVALID,
+	IOB_ROLE,
+	IOB_INTENT,
+	IOB_ACT,
+
+	IOB_NR_TYPES,
+};
+
+#define IOB_PACK_ID(_type, _nr, _gen)	\
+	(union iob_id){ .f = { .type = (_type), .nr = (_nr), .gen = (_gen) }}
+
+/* stats */
+struct iob_stats {
+	u64 idx_nomem;
+	u64 idx_nospc;
+	u64 node_nomem;
+	u64 pgtree_nomem;
+};
+
+/* iob_node is what iob_idx indexes and embedded in every iob_type */
+struct iob_node {
+	struct hlist_node	hash_node;
+	union iob_id		id;
+};
+
+/* describes properties and operations of an iob_type for iob_idx */
+struct iob_idx_type {
+	enum iob_type		type;
+
+	/* calculate hash value from key */
+	unsigned long		(*hash)(void *key);
+	/* return %true if @node matches @key */
+	bool			(*match)(struct iob_node *node, void *key);
+	/* create a new node which matches @key w/ alloc mask @gfp_mask */
+	struct iob_node		*(*create)(void *key, gfp_t gfp_mask);
+	/* destroy @node */
+	void			(*destroy)(struct iob_node *node);
+
+	/* keys for fallback nodes */
+	void			*nomem_key;
+	void			*lost_key;
+};
+
+/*
+ * iob_idx indexes iob_nodes.  iob_nodes can either be found via hash table
+ * or by id.f.nr.  Hash calculation and matching are determined by
+ * iob_idx_type.  If a node is missing during hash lookup, new one is
+ * automatically created.
+ */
+struct iob_idx {
+	const struct iob_idx_type *type;
+
+	/* hash */
+	struct hlist_head	*hash;
+	unsigned int		hash_mask;
+
+	/* id index */
+	struct ida		ida;		/* used for allocation */
+	struct idr		idr;		/* record node or gen */
+
+	/* fallback nodes */
+	struct iob_node		*nomem_node;
+	struct iob_node		*lost_node;
+
+	/* stats */
+	unsigned int		nr_nodes;
+	unsigned int		max_nodes;
+};
+
+/*
+ * Functions to encode and decode pointer and generation for iob_idx->idr.
+ *
+ * id.f.gen is used to disambiguate recycled id.f.nr.  When there's no
+ * active node, iob_idx->idr slot carries the last generation number.
+ */
+static void *iob_idr_encode_node(struct iob_node *node)
+{
+	BUG_ON((unsigned long)node & 1);
+	return node;
+}
+
+static void *iob_idr_encode_gen(u32 gen)
+{
+	unsigned long v = (unsigned long)gen;
+	return (void *)((v << 1) | 1);
+}
+
+static struct iob_node *iob_idr_node(void *p)
+{
+	unsigned long v = (unsigned long)p;
+	return (v & 1) ? NULL : (void *)v;
+}
+
+static u32 iob_idr_gen(void *p)
+{
+	unsigned long v = (unsigned long)p;
+	return (v & 1) ? v >> 1 : 0;
+}
+
+/* IOB_ROLE */
+struct iob_role {
+	struct iob_node		node;
+
+	/*
+	 * Because a task can change its pid during exec and we want exact
+	 * match for removal on task exit, we use task pointer as key.
+	 */
+	struct task_struct	*task;
+	int			pid;
+
+	/* modifier currently in effect */
+	u32			modifier;
+
+	/* last file this role has operated on */
+	struct {
+		dev_t			dev;
+		u32			gen;
+		ino_t			ino;
+	} last_ino;
+	unsigned long		last_ino_jiffies;
+
+	/* act for inode dirtying/writing in progress */
+	union iob_id		inode_act;
+
+	/* for reclaiming */
+	struct list_head	free_list;
+};
+
+/* IOB_INTENT - uses separate key struct to use struct stack_trace directly */
+struct iob_intent_key {
+	u32			modifier;
+	int			depth;
+	unsigned long		*trace;
+};
+
+struct iob_intent {
+	struct iob_node		node;
+
+	u32			modifier;
+	int			depth;
+	unsigned long		trace[];
+};
+
+/* IOB_ACT */
+struct iob_act {
+	struct iob_node		node;
+
+	struct iob_act		*free_next;
+
+	/* key fields follow - paddings, if any, should be zero filled */
+	union iob_id		role;	/* must be the first field of keys */
+	union iob_id		intent;
+	dev_t			dev;
+	u32			gen;
+	ino_t			ino;
+};
+
+#define IOB_ACT_KEY_OFFSET	offsetof(struct iob_act, role)
+
+static DEFINE_MUTEX(iob_mutex);		/* enable/disable and userland access */
+static DEFINE_SPINLOCK(iob_lock);	/* write access to all int structures */
+
+static bool iob_enabled __read_mostly = false;
+
+/* temp buffer used for parsing/printing, user must be holding iob_mutex */
+static char __iob_page_buf[PAGE_SIZE];
+#define iob_page_buf	({ lockdep_assert_held(&iob_mutex); __iob_page_buf; })
+
+/* userland tunable knobs */
+static unsigned int iob_max_roles __read_mostly = IOB_DFL_MAX_ROLES;
+static unsigned int iob_max_intents __read_mostly = IOB_DFL_MAX_INTENTS;
+static unsigned int iob_max_acts __read_mostly = IOB_DFL_MAX_ACTS;
+static unsigned int iob_ttl_secs __read_mostly = IOB_DFL_TTL_SECS;
+static bool iob_ignore_ino __read_mostly;
+
+/* pgtree params, determined by iob_max_acts */
+static unsigned long iob_pgtree_shift __read_mostly;
+static unsigned long iob_pgtree_pfn_shift __read_mostly;
+static unsigned long iob_pgtree_pfn_mask __read_mostly;
+
+/* role and act caches, intent is variable size and allocated using kzalloc */
+static struct kmem_cache *iob_role_cache;
+static struct kmem_cache *iob_act_cache;
+
+/* iob_idx for each iob_type */
+static struct iob_idx *iob_role_idx __read_mostly;
+static struct iob_idx *iob_intent_idx __read_mostly;
+static struct iob_idx *iob_act_idx __read_mostly;
+
+/* for reclaiming */
+static void iob_reclaim_workfn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(iob_reclaim_work, iob_reclaim_workfn);
+
+static unsigned int iob_role_reclaim_seq;
+
+static struct list_head iob_role_to_free_heads[2] = {
+	LIST_HEAD_INIT(iob_role_to_free_heads[0]),
+	LIST_HEAD_INIT(iob_role_to_free_heads[1]),
+};
+static struct list_head *iob_role_to_free_front = &iob_role_to_free_heads[0];
+static struct list_head *iob_role_to_free_back = &iob_role_to_free_heads[1];
+
+static unsigned long *iob_act_used_bitmaps[2];
+
+struct iob_act_used {
+	unsigned long	*front;
+	unsigned long	*back;
+} iob_act_used;
+
+/* pgtree - maps pfn to act nr */
+static RADIX_TREE(iob_pgtree, GFP_NOWAIT);
+
+/* stats and /sys/kernel/debug/ioblame */
+static struct iob_stats iob_stats;
+static struct dentry *iob_dir;
+static struct dentry *iob_intents_dentry;
+
+static void iob_intent_notify_workfn(struct work_struct *work);
+static DECLARE_WORK(iob_intent_notify_work, iob_intent_notify_workfn);
+
+static bool iob_enabled_inode(struct inode *inode)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	return iob_enabled && inode->i_sb->s_bdev &&
+		inode->i_sb->s_bdev->bd_disk->iob_enabled;
+}
+
+static bool iob_enabled_bh(struct buffer_head *bh)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	return iob_enabled && bh->b_bdev->bd_disk->iob_enabled;
+}
+
+static bool iob_enabled_bio(struct bio *bio)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	return iob_enabled && bio->bi_bdev &&
+		bio->bi_bdev->bd_disk->iob_enabled;
+}
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/ioblame.h>
+
+/*
+ * IOB_IDX
+ *
+ * This is the main indexing facility used to maintain and access all
+ * iob_type objects.  iob_idx operates on iob_node which each iob_type
+ * object embeds.
+ *
+ * Each iob_idx is associated with iob_idx_type on creation, which
+ * describes which type it is, methods used during hash lookup and two keys
+ * for fallback node creation.
+ *
+ * Objects can be accessed either by hash table or id.  Hash table lookup
+ * uses iob_idx_type->hash() and ->match() methods for lookup and
+ * ->create() and ->destroy() to create new object if missing and
+ * requested.  Note that the hash key is opaque to iob_idx.  Key handling
+ * is defined completely by iob_idx_type methods.
+ *
+ * When a new object is created, iob_idx automatically assigns an id, which
+ * is combination of type enum, object number (nr), and generation number.
+ * Object number is ida allocated and always packed towards 0.  Generation
+ * number starts at 1 and gets incremented each time the nr is recycled.
+ *
+ * Access by id is either by whole id or nr part of it.  Objects are not
+ * created through id lookups.
+ *
+ * Read accesses are protected by sched_rcu.  Using sched_rcu allows
+ * avoiding extra rcu locking operations in tracepoint probes.  Write
+ * accesses are expected to be infrequent and synchronized with single
+ * spinlock - iob_lock.
+ */
+
+static int iob_idx_install_node(struct iob_node *node, struct iob_idx *idx,
+				gfp_t gfp_mask)
+{
+	const struct iob_idx_type *type = idx->type;
+	int nr = -1, idr_nr = -1, ret;
+	void *p;
+
+	INIT_HLIST_NODE(&node->hash_node);
+
+	/* allocate nr and make sure it's under the limit */
+	do {
+		if (unlikely(!ida_pre_get(&idx->ida, gfp_mask)))
+			goto enomem;
+		ret = ida_get_new(&idx->ida, &nr);
+	} while (unlikely(ret == -EAGAIN));
+
+	if (unlikely(ret < 0 || nr >= idx->max_nodes))
+		goto enospc;
+
+	/* if @nr was used before, idr would have last_gen recorded, look up */
+	p = idr_find(&idx->idr, nr);
+	if (p) {
+		WARN_ON_ONCE(iob_idr_node(p));
+		/* set id with gen before replacing the idr entry */
+		node->id = IOB_PACK_ID(type->type, nr, iob_idr_gen(p) + 1);
+		idr_replace(&idx->idr, node, nr);
+		return 0;
+	}
+
+	/* create a new idr entry, it must match ida allocation */
+	node->id = IOB_PACK_ID(type->type, nr, 1);
+	do {
+		if (unlikely(!idr_pre_get(&idx->idr, gfp_mask)))
+			goto enomem;
+		ret = idr_get_new_above(&idx->idr, iob_idr_encode_node(node),
+					nr, &idr_nr);
+	} while (unlikely(ret == -EAGAIN));
+
+	if (unlikely(ret < 0) || WARN_ON_ONCE(idr_nr != nr))
+		goto enospc;
+
+	return 0;
+
+enomem:
+	iob_stats.idx_nomem++;
+	ret = -ENOMEM;
+	goto fail;
+enospc:
+	iob_stats.idx_nospc++;
+	ret = -ENOSPC;
+fail:
+	if (idr_nr >= 0)
+		idr_remove(&idx->idr, idr_nr);
+	if (nr >= 0)
+		ida_remove(&idx->ida, nr);
+	return ret;
+}
+
+/**
+ * iob_idx_destroy - destroy iob_idx
+ * @idx: iob_idx to destroy
+ *
+ * Free all nodes indexed by @idx and @idx itself.  The caller is
+ * responsible for ensuring nobody is accessing @idx.
+ */
+static void iob_idx_destroy(struct iob_idx *idx)
+{
+	const struct iob_idx_type *type = idx->type;
+	void *ptr;
+	int pos = 0;
+
+	while ((ptr = idr_get_next(&idx->idr, &pos))) {
+		struct iob_node *node = iob_idr_node(ptr);
+		if (node)
+			type->destroy(node);
+		pos++;
+	}
+
+	idr_remove_all(&idx->idr);
+	idr_destroy(&idx->idr);
+	ida_destroy(&idx->ida);
+
+	vfree(idx->hash);
+	kfree(idx);
+}
+
+/**
+ * iob_idx_create - create a new iob_idx
+ * @type: type of new iob_idx
+ * @max_nodes: maximum number of nodes allowed
+ *
+ * Create a new @type iob_idx.  Newly created iob_idx has two fallback
+ * nodes pre-allocated - one for nomem and the other for lost nodes, each
+ * occupying IOB_NOMEM_NR and IOB_LOST_NR slot respectively.
+ *
+ * Returns pointer to the new iob_idx on success, %NULL on failure.
+ */
+static struct iob_idx *iob_idx_create(const struct iob_idx_type *type,
+				      unsigned int max_nodes)
+{
+	unsigned int hash_sz = rounddown_pow_of_two(max_nodes);
+	struct iob_idx *idx;
+	struct iob_node *node;
+
+	if (max_nodes < 2)
+		return NULL;
+
+	/* alloc and init */
+	idx = kzalloc(sizeof(*idx), GFP_KERNEL);
+	if (!idx)
+		return NULL;
+
+	ida_init(&idx->ida);
+	idr_init(&idx->idr);
+	idx->type = type;
+	idx->max_nodes = max_nodes;
+	idx->hash_mask = hash_sz - 1;
+
+	idx->hash = vzalloc(hash_sz * sizeof(idx->hash[0]));
+	if (!idx->hash)
+		goto fail;
+
+	/* create and install nomem_node */
+	node = type->create(type->nomem_key, GFP_KERNEL);
+	if (!node)
+		goto fail;
+	if (iob_idx_install_node(node, idx, GFP_KERNEL) < 0) {
+		type->destroy(node);
+		goto fail;
+	}
+	idx->nomem_node = node;
+	idx->nr_nodes++;
+
+	/* create and install lost_node */
+	node = type->create(type->lost_key, GFP_KERNEL);
+	if (!node)
+		goto fail;
+	if (iob_idx_install_node(node, idx, GFP_KERNEL) < 0) {
+		type->destroy(node);
+		goto fail;
+	}
+	idx->lost_node = node;
+	idx->nr_nodes++;
+
+	/* verify both fallback nodes have the correct id.f.nr */
+	if (idx->nomem_node->id.f.nr != IOB_NOMEM_NR ||
+	    idx->lost_node->id.f.nr != IOB_LOST_NR)
+		goto fail;
+
+	return idx;
+fail:
+	iob_idx_destroy(idx);
+	return NULL;
+}
+
+/**
+ * iob_node_by_nr_raw - lookup node by nr
+ * @nr: nr to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Lookup node occupying slot @nr.  If such node doesn't exist, %NULL is
+ * returned.
+ */
+static struct iob_node *iob_node_by_nr_raw(int nr, struct iob_idx *idx)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+	return iob_idr_node(idr_find(&idx->idr, nr));
+}
+
+/**
+ * iob_node_by_id_raw - lookup node by id
+ * @id: id to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Lookup node with @id.  @id's type should match @idx's type and all three
+ * id fields should match for successful lookup - type, id and generation.
+ * Returns %NULL on failure.
+ */
+static struct iob_node *iob_node_by_id_raw(union iob_id id, struct iob_idx *idx)
+{
+	struct iob_node *node;
+
+	WARN_ON_ONCE(id.f.type != idx->type->type);
+
+	node = iob_node_by_nr_raw(id.f.nr, idx);
+	if (likely(node && node->id.v == id.v))
+		return node;
+	return NULL;
+}
+
+static struct iob_node *iob_hash_head_lookup(void *key,
+					     struct hlist_head *hash_head,
+					     const struct iob_idx_type *type)
+{
+	struct hlist_node *pos;
+	struct iob_node *node;
+
+	hlist_for_each_entry_rcu(node, pos, hash_head, hash_node)
+		if (type->match(node, key))
+			return node;
+	return NULL;
+}
+
+/**
+ * iob_get_node_raw - lookup node from hash table and create if missing
+ * @key: key to lookup hash table with
+ * @idx: iob_idx to lookup from
+ * @create: whether to create a new node if lookup fails
+ *
+ * Look up node which matches @key in @idx.  If no such node exists and
+ * @create is %true, create a new one.  A newly created node will have
+ * unique id assigned to it as long as generation number doesn't overflow.
+ *
+ * This function should be called under rcu sched read lock and returns
+ * %NULL on failure.
+ */
+static struct iob_node *iob_get_node_raw(void *key, struct iob_idx *idx,
+					 bool create)
+{
+	const struct iob_idx_type *type = idx->type;
+	struct iob_node *node, *new_node;
+	struct hlist_head *hash_head;
+	unsigned long hash, flags;
+
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	/* lookup hash */
+	hash = type->hash(key);
+	hash_head = &idx->hash[hash & idx->hash_mask];
+
+	node = iob_hash_head_lookup(key, hash_head, type);
+	if (node || !create)
+		return node;
+
+	/* non-existent && @create, create new one */
+	new_node = type->create(key, GFP_NOWAIT);
+	if (!new_node) {
+		iob_stats.node_nomem++;
+		return NULL;
+	}
+
+	spin_lock_irqsave(&iob_lock, flags);
+
+	/* someone might have inserted it inbetween, lookup again */
+	node = iob_hash_head_lookup(key, hash_head, type);
+	if (node)
+		goto out_unlock;
+
+	/* install the node and add to the hash table */
+	if (iob_idx_install_node(new_node, idx, GFP_NOWAIT))
+		goto out_unlock;
+
+	hlist_add_head_rcu(&new_node->hash_node, hash_head);
+	idx->nr_nodes++;
+
+	node = new_node;
+	new_node = NULL;
+out_unlock:
+	spin_unlock_irqrestore(&iob_lock, flags);
+
+	if (unlikely(new_node))
+		type->destroy(new_node);
+	return node;
+}
+
+/**
+ * iob_node_by_nr - lookup node by nr with fallback
+ * @nr: nr to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Same as iob_node_by_nr_raw() but returns @idx->lost_node instead of
+ * %NULL if lookup fails.  The lost_node is returned as nr/id lookup
+ * failure indicates the target node has already been reclaimed.
+ */
+static struct iob_node *iob_node_by_nr(int nr, struct iob_idx *idx)
+{
+	return iob_node_by_nr_raw(nr, idx) ?: idx->lost_node;
+}
+
+/**
+ * iob_node_by_nr - lookup node by id with fallback
+ * @id: id to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Same as iob_node_by_id_raw() but returns @idx->lost_node instead of
+ * %NULL if lookup fails.  The lost_node is returned as nr/id lookup
+ * failure indicates the target node has already been reclaimed.
+ */
+static struct iob_node *iob_node_by_id(union iob_id id, struct iob_idx *idx)
+{
+	return iob_node_by_id_raw(id, idx) ?: idx->lost_node;
+}
+
+/**
+ * iob_get_node - lookup node from hash table and create if missing w/ fallback
+ * @key: key to lookup hash table with
+ * @idx: iob_idx to lookup from
+ * @create: whether to create a new node if lookup fails
+ *
+ * Same as iob_get_node_raw(@key, @idx, %true) but returns @idx->nomem_node
+ * instead of %NULL on failure as the only reason is alloc failure.
+ */
+static struct iob_node *iob_get_node(void *key, struct iob_idx *idx)
+{
+	return iob_get_node_raw(key, idx, true) ?: idx->nomem_node;
+}
+
+/**
+ * iob_unhash_node - unhash an iob_node
+ * @node: node to unhash
+ * @idx: iob_idx @node is hashed on
+ *
+ * Make @node invisible from hash lookup.  It will still be visible from
+ * id/nr lookup.
+ *
+ * Must be called holding iob_lock and returns %true if unhashed
+ * successfully, %false if someone else already unhashed it.
+ */
+static bool iob_unhash_node(struct iob_node *node, struct iob_idx *idx)
+{
+	lockdep_assert_held(&iob_lock);
+
+	if (hlist_unhashed(&node->hash_node))
+		return false;
+	hlist_del_init_rcu(&node->hash_node);
+	return true;
+}
+
+/**
+ * iob_remove_node - remove an iob_node
+ * @node: node to remove
+ * @idx: iob_idx @node is on
+ *
+ * Remove @node from @idx.  The caller is responsible for calling
+ * iob_unhash_node() before.  Note that removed nodes should be freed only
+ * after RCU grace period has passed.
+ *
+ * Must be called holding iob_lock.
+ */
+static void iob_remove_node(struct iob_node *node, struct iob_idx *idx)
+{
+	lockdep_assert_held(&iob_lock);
+
+	/* don't remove idr slot, record current generation there */
+	idr_replace(&idx->idr, iob_idr_encode_gen(node->id.f.gen),
+		    node->id.f.nr);
+	ida_remove(&idx->ida, node->id.f.nr);
+	idx->nr_nodes--;
+}
+
+
+/*
+ * IOB_ROLE
+ *
+ * A role represents a task and is keyed by its task pointer.  It is
+ * created when the matching task first enters iob tracking, unhashed on
+ * task exit and destroyed after reclaim period has passed.
+ *
+ * The reason why task_roles are keyed by task pointer instead of pid is
+ * that pid can change across exec(2) and we need reliable match on task
+ * exit to avoid leaking task_roles.  A task_role is unhashed and scheduled
+ * for removal on task exit or if thie pid no longer matches after exec.
+ *
+ * These life-cycle rules guarantee that any task is given one id across
+ * its lifetime and avoid resource leaks.
+ *
+ * A role also carries context information for the task, e.g. the last file
+ * the task operated on, currently on-going inode operation and so on.
+ */
+
+static struct iob_role *iob_node_to_role(struct iob_node *node)
+{
+	return node ? container_of(node, struct iob_role, node) : NULL;
+}
+
+static unsigned long iob_role_hash(void *key)
+{
+	struct iob_role *rkey = key;
+
+	return jhash(rkey->task, sizeof(rkey->task), JHASH_INITVAL);
+}
+
+static bool iob_role_match(struct iob_node *node, void *key)
+{
+	struct iob_role *role = iob_node_to_role(node);
+	struct iob_role *rkey = key;
+
+	return rkey->task == role->task;
+}
+
+static struct iob_node *iob_role_create(void *key, gfp_t gfp_mask)
+{
+	struct iob_role *rkey = key;
+	struct iob_role *role;
+
+	role = kmem_cache_alloc(iob_role_cache, gfp_mask);
+	if (!role)
+		return NULL;
+	*role = *rkey;
+	INIT_LIST_HEAD(&role->free_list);
+	return &role->node;
+}
+
+static void iob_role_destroy(struct iob_node *node)
+{
+	kmem_cache_free(iob_role_cache, iob_node_to_role(node));
+}
+
+static struct iob_role iob_role_null_key = { };
+
+static const struct iob_idx_type iob_role_idx_type = {
+	.type		= IOB_ROLE,
+
+	.hash		= iob_role_hash,
+	.match		= iob_role_match,
+	.create		= iob_role_create,
+	.destroy	= iob_role_destroy,
+
+	.nomem_key	= &iob_role_null_key,
+	.lost_key	= &iob_role_null_key,
+};
+
+static struct iob_role *iob_role_by_id(union iob_id id)
+{
+	return iob_node_to_role(iob_node_by_id(id, iob_role_idx));
+}
+
+/**
+ * iob_reclaim_current_role - reclaim role for %current
+ *
+ * This function guarantees that the self role won't be visible to hash
+ * table lookup by %current itself.
+ */
+static void iob_reclaim_current_role(void)
+{
+	struct iob_role rkey = { };
+	struct iob_role *role;
+	unsigned long flags;
+
+	/*
+	 * A role is always created by %current and thus guaranteed to be
+	 * visible to %current.  Negative result from lockless lookup can
+	 * be trusted.
+	 */
+	rkey.task = current;
+	rkey.pid = task_pid_nr(current);
+	role = iob_node_to_role(iob_get_node_raw(&rkey, iob_role_idx, false));
+	if (!role)
+		return;
+
+	/* unhash and queue on reclaim list */
+	spin_lock_irqsave(&iob_lock, flags);
+	WARN_ON_ONCE(!iob_unhash_node(&role->node, iob_role_idx));
+	WARN_ON_ONCE(!list_empty(&role->free_list));
+	list_add_tail(&role->free_list, iob_role_to_free_front);
+	spin_unlock_irqrestore(&iob_lock, flags);
+}
+
+/**
+ * iob_current_role - lookup role for %current
+ *
+ * Return role for %current.  May return nomem node under memory pressure.
+ */
+static struct iob_role *iob_current_role(void)
+{
+	struct iob_role rkey = { };
+	struct iob_role *role;
+	bool retried = false;
+
+	rkey.task = current;
+	rkey.pid = task_pid_nr(current);
+retry:
+	role = iob_node_to_role(iob_get_node(&rkey, iob_role_idx));
+
+	/*
+	 * If %current exec'd, its pid may have changed.  In such cases,
+	 * shoot down the current role and retry.
+	 */
+	if (role->pid == rkey.pid || role->node.id.f.nr < IOB_BASE_NR)
+		return role;
+
+	iob_reclaim_current_role();
+
+	/* this shouldn't happen more than once */
+	WARN_ON_ONCE(retried);
+	retried = true;
+	goto retry;
+}
+
+
+/*
+ * IOB_INTENT
+ *
+ * An intent represents a category of actions a task can take.  It
+ * currently consists of the stack trace at the point of action and an
+ * optional modifier.  The number of unique backtraces is expected to be
+ * limited and no reclaiming is implemented.
+ */
+
+static struct iob_intent *iob_node_to_intent(struct iob_node *node)
+{
+	return node ? container_of(node, struct iob_intent, node) : NULL;
+}
+
+static unsigned long iob_intent_hash(void *key)
+{
+	struct iob_intent_key *ikey = key;
+
+	return jhash(ikey->trace, ikey->depth * sizeof(ikey->trace[0]),
+		     JHASH_INITVAL + ikey->modifier);
+}
+
+static bool iob_intent_match(struct iob_node *node, void *key)
+{
+	struct iob_intent *intent = iob_node_to_intent(node);
+	struct iob_intent_key *ikey = key;
+
+	if (intent->modifier == ikey->modifier &&
+	    intent->depth == ikey->depth)
+		return !memcmp(intent->trace, ikey->trace,
+			       intent->depth * sizeof(intent->trace[0]));
+	return false;
+}
+
+static struct iob_node *iob_intent_create(void *key, gfp_t gfp_mask)
+{
+	struct iob_intent_key *ikey = key;
+	struct iob_intent *intent;
+	size_t trace_sz = sizeof(intent->trace[0]) * ikey->depth;
+
+	intent = kzalloc(sizeof(*intent) + trace_sz, gfp_mask);
+	if (!intent)
+		return NULL;
+
+	intent->modifier = ikey->modifier;
+	intent->depth = ikey->depth;
+	memcpy(intent->trace, ikey->trace, trace_sz);
+
+	return &intent->node;
+}
+
+static void iob_intent_destroy(struct iob_node *node)
+{
+	kfree(iob_node_to_intent(node));
+}
+
+static struct iob_intent_key iob_intent_null_key = { };
+
+static const struct iob_idx_type iob_intent_idx_type = {
+	.type		= IOB_INTENT,
+
+	.hash		= iob_intent_hash,
+	.match		= iob_intent_match,
+	.create		= iob_intent_create,
+	.destroy	= iob_intent_destroy,
+
+	.nomem_key	= &iob_intent_null_key,
+	.lost_key	= &iob_intent_null_key,
+};
+
+static struct iob_intent *iob_intent_by_nr(int nr)
+{
+	return iob_node_to_intent(iob_node_by_nr(nr, iob_intent_idx));
+}
+
+static struct iob_intent *iob_intent_by_id(union iob_id id)
+{
+	return iob_node_to_intent(iob_node_by_id(id, iob_intent_idx));
+}
+
+static struct iob_intent *iob_get_intent(unsigned long *trace, int depth,
+					 u32 modifier)
+{
+	struct iob_intent_key ikey = { .modifier = modifier, .depth = depth,
+				       .trace = trace };
+	struct iob_intent *intent;
+	int nr_nodes;
+
+	nr_nodes = iob_intent_idx->nr_nodes;
+
+	intent = iob_node_to_intent(iob_get_node(&ikey, iob_intent_idx));
+
+	/*
+	 * If nr_nodes changed across get_node, we probably have created a
+	 * new entry.  Notify change on intent files.  This may be spurious
+	 * but won't miss an event, which is good enough.
+	 */
+	if (nr_nodes != iob_intent_idx->nr_nodes)
+		schedule_work(&iob_intent_notify_work);
+
+	return intent;
+}
+
+static DEFINE_PER_CPU(unsigned long [IOB_STACK_MAX_DEPTH], iob_trace_buf_pcpu);
+
+/**
+ * iob_current_intent - return intent for %current
+ * @skip: number of stack frames to skip
+ *
+ * Acquire stack trace after skipping @skip frames and return matching
+ * iob_intent.  The stack trace never includes iob_current_intent() and
+ * @skip of 1 skips the caller not iob_current_intent().  May return nomem
+ * node under memory pressure.
+ */
+static noinline struct iob_intent *iob_current_intent(int skip)
+{
+	unsigned long *trace = *this_cpu_ptr(&iob_trace_buf_pcpu);
+	struct stack_trace st = { .max_entries = IOB_STACK_MAX_DEPTH,
+				  .entries = trace, .skip = skip + 1 };
+	struct iob_intent *intent;
+	unsigned long flags;
+
+	/* disable IRQ to make trace_pcpu array access exclusive */
+	local_irq_save(flags);
+
+	/* acquire stack trace, ignore -1LU end of stack marker */
+	save_stack_trace_quick(&st);
+	if (st.nr_entries && trace[st.nr_entries - 1] == ULONG_MAX)
+		st.nr_entries--;
+
+	/* get matching iob_intent */
+	intent = iob_get_intent(trace, st.nr_entries, 0);
+
+	local_irq_restore(flags);
+	return intent;
+}
+
+/**
+ * iob_modified_intent - determine modified intent
+ * @intent: the base intent
+ * @modifier: modifier to apply
+ *
+ * Return iob_intent which is identical to @intent except that its modifier
+ * is @modifier.  @intent is allowed to have any modifier including zero on
+ * entry.  May return nomem node under memory pressure.
+ */
+static struct iob_intent *iob_modified_intent(struct iob_intent *intent,
+					      u32 modifier)
+{
+	if (intent->modifier == modifier ||
+	    unlikely(intent->node.id.f.nr < IOB_BASE_NR))
+		return intent;
+	return iob_get_intent(intent->trace, intent->depth, modifier);
+}
+
+
+/*
+ * IOB_ACT
+ *
+ * Represents specific action an iob_role took.  Consists of a iob_role,
+ * iob_act, and the target inode.  iob_act is used to track dirtiers.  For
+ * each dirtying operation, iob_act is acquired and recorded (either by id
+ * or id.f.nr) and used for reporting later.
+ *
+ * Because this is product of three different entities, the number can grow
+ * quite large.  Each successful lookup sets used bitmap and iob_acts which
+ * haven't been used for iob_ttl_secs are reclaimed.
+ */
+
+static void iob_act_mark_used(struct iob_act *act)
+{
+	if (!test_bit(act->node.id.f.nr, iob_act_used.front))
+		set_bit(act->node.id.f.nr, iob_act_used.front);
+}
+
+static struct iob_act *iob_node_to_act(struct iob_node *node)
+{
+	return node ? container_of(node, struct iob_act, node) : NULL;
+}
+
+static unsigned long iob_act_hash(void *key)
+{
+	return jhash(key + IOB_ACT_KEY_OFFSET,
+		     sizeof(struct iob_act) - IOB_ACT_KEY_OFFSET,
+		     JHASH_INITVAL);
+}
+
+static bool iob_act_match(struct iob_node *node, void *key)
+{
+	return !memcmp((void *)node + IOB_ACT_KEY_OFFSET,
+		       key + IOB_ACT_KEY_OFFSET,
+		       sizeof(struct iob_act) - IOB_ACT_KEY_OFFSET);
+}
+
+static struct iob_node *iob_act_create(void *key, gfp_t gfp_mask)
+{
+	struct iob_act *akey = key;
+	struct iob_act *act;
+
+	act = kmem_cache_alloc(iob_act_cache, gfp_mask);
+	if (!act)
+		return NULL;
+	*act = *akey;
+	return &act->node;
+}
+
+static void iob_act_destroy(struct iob_node *node)
+{
+	kmem_cache_free(iob_act_cache, iob_node_to_act(node));
+}
+
+static struct iob_act iob_act_nomem_key = {
+	.role		= IOB_PACK_ID(IOB_ROLE, IOB_NOMEM_NR, 1),
+	.intent		= IOB_PACK_ID(IOB_INTENT, IOB_NOMEM_NR, 1),
+};
+
+static struct iob_act iob_act_lost_key = {
+	.role		= IOB_PACK_ID(IOB_ROLE, IOB_LOST_NR, 1),
+	.intent		= IOB_PACK_ID(IOB_INTENT, IOB_LOST_NR, 1),
+};
+
+static const struct iob_idx_type iob_act_idx_type = {
+	.type		= IOB_ACT,
+
+	.hash		= iob_act_hash,
+	.match		= iob_act_match,
+	.create		= iob_act_create,
+	.destroy	= iob_act_destroy,
+
+	.nomem_key	= &iob_act_nomem_key,
+	.lost_key	= &iob_act_lost_key,
+};
+
+static struct iob_act *iob_act_by_nr(int nr)
+{
+	return iob_node_to_act(iob_node_by_nr(nr, iob_act_idx));
+}
+
+static struct iob_act *iob_act_by_id(union iob_id id)
+{
+	return iob_node_to_act(iob_node_by_id(id, iob_act_idx));
+}
+
+/**
+ * iob_current_act - return the current iob_act
+ * @stack_skip: number of stack frames to skip when acquiring iob_intent
+ * @dev: dev_t of the inode being operated on
+ * @ino: ino of the inode being operated on
+ * @gen: generation of the inode being operated on
+ *
+ * Return iob_act for %current with the current backtrace.
+ * iob_current_act() is never included in the backtrace.  May return nomem
+ * node under memory pressure.
+ */
+static __always_inline struct iob_act *iob_current_act(int stack_skip,
+						dev_t dev, ino_t ino, u32 gen)
+{
+	struct iob_role *role = iob_current_role();
+	struct iob_intent *intent = iob_current_intent(stack_skip);
+	struct iob_act akey = { .role = role->node.id,
+				.intent = intent->node.id, .dev = dev };
+	struct iob_act *act;
+	int min_nr;
+
+	/* if either role or intent is special, return matching special role */
+	min_nr = min_t(int, role->node.id.f.nr, intent->node.id.f.nr);
+	if (unlikely(min_nr < IOB_BASE_NR)) {
+		if (min_nr == IOB_NOMEM_NR)
+			return iob_node_to_act(iob_act_idx->nomem_node);
+		else
+			return iob_node_to_act(iob_act_idx->lost_node);
+	}
+
+	/* if ignore_ino is set, use the same act for all files on the dev */
+	if (!iob_ignore_ino) {
+		akey.ino = ino;
+		akey.gen = gen;
+	}
+
+	act = iob_node_to_act(iob_get_node(&akey, iob_act_idx));
+	if (act)
+		iob_act_mark_used(act);
+	return act;
+}
+
+
+/*
+ * RECLAIM
+ */
+
+/**
+ * iob_reclaim - reclaim iob_roles and iob_acts
+ *
+ * This function is called from workqueue every ttl/2 and looks at
+ * iob_act_used->front/back and iob_role_to_free_front/back to reclaim
+ * unused nodes.
+ *
+ * iob_act uses bitmaps to collect and track used history.  Used bits are
+ * examined every ttl/2 period and iob_acts which haven't been used for two
+ * half periods are reclaimed.
+ *
+ * iob_role goes through reclaiming mostly to delay freeing so that roles
+ * are still available when async IO events fire after the original tasks
+ * exit.  iob_role reclaiming is simpler and happens every ttl.
+ */
+static void iob_reclaim_workfn(struct work_struct *work)
+{
+	LIST_HEAD(role_todo);
+	struct iob_act_used *u = &iob_act_used;
+	struct iob_act *free_head = NULL;
+	struct iob_act *act;
+	struct iob_role *role, *role_pos;
+	unsigned long flags;
+	int i;
+
+	/*
+	 * We're gonna reclaim acts which don't have bit set in both front
+	 * and back used bitmaps - IOW, the ones which weren't used in the
+	 * last and this ttl/2 periods.
+	 */
+	bitmap_or(u->back, u->front, u->back, iob_max_acts);
+
+	spin_lock_irqsave(&iob_lock, flags);
+
+	/*
+	 * Determine which roles to reclaim.  This function is executed
+	 * every ttl/2 but we want ttl.  Skip every other time.
+	 */
+	if (!(++iob_role_reclaim_seq % 2)) {
+		/* roles in the other free_head are now older than ttl */
+		list_splice_init(iob_role_to_free_back, &role_todo);
+		swap(iob_role_to_free_front, iob_role_to_free_back);
+
+		/*
+		 * All roles to be reclaimed should have been unhashed
+		 * already.  Removing is enough.
+		 */
+		list_for_each_entry(role, &role_todo, free_list) {
+			WARN_ON_ONCE(!hlist_unhashed(&role->node.hash_node));
+			iob_remove_node(&role->node, iob_role_idx);
+		}
+	}
+
+	/* unhash and remove all acts which don't have bit set in @u->back */
+	for (i = find_next_zero_bit(u->back, iob_max_acts, IOB_BASE_NR);
+	     i < iob_max_acts;
+	     i = find_next_zero_bit(u->back, iob_max_acts, i + 1)) {
+		act = iob_node_to_act(iob_node_by_nr_raw(i, iob_act_idx));
+		if (act) {
+			WARN_ON_ONCE(!iob_unhash_node(&act->node, iob_act_idx));
+			iob_remove_node(&act->node, iob_act_idx);
+			act->free_next = free_head;
+			free_head = act;
+		}
+	}
+
+	spin_unlock_irqrestore(&iob_lock, flags);
+
+	/* reclaim complete, front<->back and clear front */
+	swap(u->front, u->back);
+	bitmap_clear(u->front, 0, iob_max_acts);
+
+	/* before freeing reclaimed nodes, wait for in-flight users to finish */
+	synchronize_sched();
+
+	list_for_each_entry_safe(role, role_pos, &role_todo, free_list)
+		iob_role_destroy(&role->node);
+
+	while ((act = free_head)) {
+		free_head = act->free_next;
+		iob_act_destroy(&act->node);
+	}
+
+	queue_delayed_work(system_nrt_wq, &iob_reclaim_work,
+			   iob_ttl_secs * HZ / 2);
+}
+
+
+/*
+ * PGTREE
+ *
+ * Radix tree to map pfn to iob_act.  This is used to track which iob_act
+ * dirtied the page.  When a bio is issued, each page in the iovec is
+ * consulted against pgtree to find out which act caused it.
+ *
+ * Because the size of pgtree is proportional to total available memory, it
+ * uses id.f.nr instead of full id and may occassionally give stale result.
+ * Also, it uses u16 array if ACT_MAX is <= USHRT_MAX; otherwise, u32.
+ */
+
+void *iob_pgtree_slot(unsigned long pfn)
+{
+	unsigned long idx = pfn >> iob_pgtree_pfn_shift;
+	unsigned long offset = pfn & iob_pgtree_pfn_mask;
+	void *p;
+
+	p = radix_tree_lookup(&iob_pgtree, idx);
+	if (p)
+		return p + (offset << iob_pgtree_shift);
+	return NULL;
+}
+
+/**
+ * iob_pgtree_set_nr - map pfn to nr
+ * @pfn: pfn to map
+ * @nr: id.f.nr to be mapped
+ *
+ * Map @pfn to @nr, which can later be retrieved using
+ * iob_pgtree_get_and_clear_nr().  This function is opportunistic - it may
+ * fail under memory pressure and clobber each other's mappings when
+ * multiple pgtree ops race.
+ */
+static int iob_pgtree_set_nr(unsigned long pfn, int nr)
+{
+	void *slot, *p;
+	unsigned long flags;
+	int ret;
+retry:
+	slot = iob_pgtree_slot(pfn);
+	if (likely(slot)) {
+		/*
+		 * We're playing with pointer casts and racy accesses.  Use
+		 * ACCESS_ONCE() to avoid compiler surprises.
+		 */
+		switch (iob_pgtree_shift) {
+		case 1:
+			ACCESS_ONCE(*(u16 *)slot) = nr;
+			break;
+		case 2:
+			ACCESS_ONCE(*(u32 *)slot) = nr;
+			break;
+		default:
+			BUG();
+		}
+		return 0;
+	}
+
+	/* slot missing, create and insert new page and retry */
+	p = (void *)get_zeroed_page(GFP_NOWAIT);
+	if (!p) {
+		iob_stats.pgtree_nomem++;
+		return -ENOMEM;
+	}
+
+	spin_lock_irqsave(&iob_lock, flags);
+	ret = radix_tree_insert(&iob_pgtree, pfn >> iob_pgtree_pfn_shift, p);
+	spin_unlock_irqrestore(&iob_lock, flags);
+
+	if (ret) {
+		free_page((unsigned long)p);
+		if (ret != -EEXIST) {
+			iob_stats.pgtree_nomem++;
+			return ret;
+		}
+	}
+	goto retry;
+}
+
+/**
+ * iob_pgtree_get_and_clear_nr - read back pfn to nr mapping and clear it
+ * @pfn: pfn to read mapping for
+ *
+ * Read back mapping set by iob_pgtree_set_nr().  This function is
+ * opportunistic and may clobber each other's mappings when multiple pgtree
+ * ops race.
+ */
+static int iob_pgtree_get_and_clear_nr(unsigned long pfn)
+{
+	void *slot;
+	int nr;
+
+	slot = iob_pgtree_slot(pfn);
+	if (unlikely(!slot))
+		return 0;
+
+	/*
+	 * We're playing with pointer casts and racy accesses.  Use
+	 * ACCESS_ONCE() to avoid compiler surprises.
+	 */
+	switch (iob_pgtree_shift) {
+	case 1:
+		nr = ACCESS_ONCE(*(u16 *)slot);
+		if (nr)
+			ACCESS_ONCE(*(u16 *)slot) = 0;
+		break;
+	case 2:
+		nr = ACCESS_ONCE(*(u32 *)slot);
+		if (nr)
+			ACCESS_ONCE(*(u32 *)slot) = 0;
+		break;
+	default:
+		BUG();
+	}
+	return nr;
+}
+
+
+/*
+ * PROBES
+ *
+ * Tracepoint probes.  This is how ioblame learns what's going on in the
+ * system.  TP probes are always called with preemtion disabled, so we
+ * don't need explicit rcu_read_lock_sched().
+ */
+
+static void iob_set_last_ino(struct inode *inode)
+{
+	struct iob_role *role = iob_current_role();
+
+	role->last_ino.dev = inode->i_sb->s_dev;
+	role->last_ino.ino = inode->i_ino;
+	role->last_ino.gen = inode->i_generation;
+	role->last_ino_jiffies = jiffies;
+}
+
+/*
+ * Mark the last inode accessed by this task role.  This is used to
+ * attribute IOs to files.
+ */
+static void iob_probe_vfs_fcheck(void *data, struct files_struct *files,
+				 unsigned int fd, struct file *file)
+{
+	if (file) {
+		struct inode *inode = file->f_dentry->d_inode;
+
+		if (iob_enabled_inode(inode))
+			iob_set_last_ino(inode);
+	}
+}
+
+/* called after a page is dirtied - record the dirtying act in pgtree */
+static void iob_probe_wb_dirty_page(void *data, struct page *page,
+				    struct address_space *mapping)
+{
+	struct inode *inode = mapping->host;
+
+	if (iob_enabled_inode(inode)) {
+		struct iob_act *act = iob_current_act(2, inode->i_sb->s_dev,
+						      inode->i_ino,
+						      inode->i_generation);
+
+		iob_pgtree_set_nr(page_to_pfn(page), act->node.id.f.nr);
+	}
+}
+
+/*
+ * Writeback is starting, record wb_reason in role->modifier.  This will
+ * be applied to any IOs issued from this task until writeback is finished.
+ */
+static void iob_probe_wb_start(void *data, struct backing_dev_info *bdi,
+			       struct wb_writeback_work *work)
+{
+	struct iob_role *role = iob_current_role();
+
+	role->modifier = work->reason | IOB_MODIFIER_WB;
+}
+
+/* writeback done, clear modifier */
+static void iob_probe_wb_written(void *data, struct backing_dev_info *bdi,
+				 struct wb_writeback_work *work)
+{
+	struct iob_role *role = iob_current_role();
+
+	role->modifier = 0;
+}
+
+/*
+ * An inode is about to be written back.  Will be followed by data and
+ * inode writeback.  In case dirtier data is not recorded in pgtree or
+ * inode, remember the inode in role->last_ino.
+ */
+static void iob_probe_wb_single_inode_start(void *data, struct inode *inode,
+					    struct writeback_control *wbc,
+					    unsigned long nr_to_write)
+{
+	if (iob_enabled_inode(inode))
+		iob_set_last_ino(inode);
+}
+
+/*
+ * Called when an inode is about to be dirtied, right before fs
+ * dirty_inode() method.  Different filesystems implement inode dirtying
+ * and writeback differently.  Some may allocate bh on dirtying, some might
+ * do it during write_inode() and others might not use bh at all.
+ *
+ * To cover most cases, two tracking mechanisms are used - role->inode_act
+ * and inode->i_iob_act.  The former marks the current task as performing
+ * inode dirtying act and any IOs issued or bhs touched are attributed to
+ * the act.  The latter records the dirtying act on the inode itself so
+ * that if the filesystem takes action for the inode from write_inode(),
+ * the acting task can take on the dirtying act.
+ */
+static void iob_probe_wb_dirty_inode_start(void *data, struct inode *inode,
+					   int flags)
+{
+	if (iob_enabled_inode(inode)) {
+		struct iob_role *role = iob_current_role();
+		struct iob_act *act = iob_current_act(1, inode->i_sb->s_dev,
+						      inode->i_ino,
+						      inode->i_generation);
+		role->inode_act = act->node.id;
+		inode->i_iob_act = act->node.id;
+	}
+}
+
+/* inode dirtying complete */
+static void iob_probe_wb_dirty_inode(void *data, struct inode *inode, int flags)
+{
+	if (iob_enabled_inode(inode))
+		iob_current_role()->inode_act.v = 0;
+}
+
+/*
+ * Called when an inode is being written back, right before fs
+ * write_inode() method.  Inode writeback is starting, take on the act
+ * which dirtied the inode.
+ */
+static void iob_probe_wb_write_inode_start(void *data, struct inode *inode,
+					   struct writeback_control *wbc)
+{
+	if (iob_enabled_inode(inode) && inode->i_iob_act.v) {
+		struct iob_role *role = iob_current_role();
+
+		role->inode_act = inode->i_iob_act;
+	}
+}
+
+/* inode writing complete */
+static void iob_probe_wb_write_inode(void *data, struct inode *inode,
+				     struct writeback_control *wbc)
+{
+	if (iob_enabled_inode(inode))
+		iob_current_role()->inode_act.v = 0;
+}
+
+/*
+ * Called on touch_buffer().  Transfer inode act to pgtree.  This catches
+ * most inode operations for filesystems which use bh for metadata.
+ */
+static void iob_probe_block_touch_buffer(void *data, struct buffer_head *bh)
+{
+	if (iob_enabled_bh(bh)) {
+		struct iob_role *role = iob_current_role();
+
+		if (role->inode_act.v)
+			iob_pgtree_set_nr(page_to_pfn(bh->b_page),
+					  role->inode_act.f.nr);
+	}
+}
+
+/* bio is being queued, collect all info into bio->bi_iob_info */
+static void iob_probe_block_bio_queue(void *data, struct request_queue *q,
+				      struct bio *bio)
+{
+	struct iob_io_info *io = &bio->bi_iob_info;
+	struct iob_act *act = NULL;
+	struct iob_role *role;
+	struct iob_intent *intent;
+	int i;
+
+	if (!iob_enabled_bio(bio))
+		return;
+
+	role = iob_current_role();
+
+	io->sector = bio->bi_sector;
+	io->size = bio->bi_size;
+	io->rw = bio->bi_rw;
+
+	/* usec duration will be calculated on completion */
+	io->queued_at = io->issued_at = local_clock();
+
+	/* role's inode_act has the highest priority */
+	if (role->inode_act.v)
+		act = iob_act_by_id(role->inode_act);
+
+	/* always walk pgtree and clear matching pages */
+	for (i = 0; i < bio->bi_vcnt; i++) {
+		struct bio_vec *bv = &bio->bi_io_vec[i];
+		int nr;
+
+		if (!bv->bv_len)
+			continue;
+
+		nr = iob_pgtree_get_and_clear_nr(page_to_pfn(bv->bv_page));
+		if (!nr || act)
+			continue;
+
+		/* this is the first act, charge everything to it */
+		act = iob_act_by_nr(nr);
+	}
+
+	if (act) {
+		/* charge it to async dirtier */
+		io->pid = iob_role_by_id(act->role)->pid;
+		io->dev = act->dev;
+		io->ino = act->ino;
+		io->gen = act->gen;
+
+		intent = iob_intent_by_id(act->intent);
+	} else {
+		/*
+		 * Charge it to the IO issuer and the last file this task
+		 * initiated RW or writeback on, which is highly likely to
+		 * be the file this IO is for.  As a sanity check, trust
+		 * last_ino only for pre-defined duration.
+		 *
+		 * When acquiring stack trace, skip this function and
+		 * generic_make_request[_checks]()
+		 */
+		unsigned long now = jiffies;
+
+		io->pid = role->pid;
+
+		if (!iob_ignore_ino &&
+		    time_before_eq(role->last_ino_jiffies, now) &&
+		    now - role->last_ino_jiffies <= IOB_LAST_INO_DURATION) {
+			io->dev = role->last_ino.dev;
+			io->ino = role->last_ino.ino;
+			io->gen = role->last_ino.gen;
+		} else {
+			io->dev = bio->bi_bdev->bd_dev;
+			io->ino = 0;
+			io->gen = 0;
+		}
+
+		intent = iob_current_intent(2);
+	}
+
+	/* apply intent modifier and store nr */
+	intent = iob_modified_intent(intent, role->modifier);
+	io->intent = intent->node.id.f.nr;
+}
+
+/* when bios get merged, charge everything to the first bio */
+static void iob_probe_block_bio_backmerge(void *data, struct request_queue *q,
+					  struct request *rq, struct bio *bio)
+{
+	struct bio *mbio = rq->bio;
+	struct iob_io_info *mio = &mbio->bi_iob_info;
+	struct iob_io_info *sio = &bio->bi_iob_info;
+
+	mio->size += sio->size;
+	sio->size = 0;
+}
+
+/* when bios get merged, charge everything to the first bio */
+static void iob_probe_block_bio_frontmerge(void *data, struct request_queue *q,
+					   struct request *rq, struct bio *bio)
+{
+	struct bio *mbio = rq->bio;
+	struct iob_io_info *mio = &mbio->bi_iob_info;
+	struct iob_io_info *sio = &bio->bi_iob_info;
+	size_t msize = mio->size;
+
+	*mio = *sio;
+	mio->size += msize;
+	sio->size = 0;
+}
+
+/* record issue timestamp, this may not happen for bio based drivers */
+static void iob_probe_block_rq_issue(void *data, struct request_queue *q,
+				     struct request *rq)
+{
+	if (rq->bio && rq->bio->bi_iob_info.size)
+		rq->bio->bi_iob_info.issued_at = local_clock();
+}
+
+/* bio is complete, report and accumulate statistics */
+static void iob_probe_block_bio_complete(void *data, struct request_queue *q,
+					 struct bio *bio, int error)
+{
+	/* kick the TP */
+	trace_ioblame_io(bio);
+}
+
+/* %current is exiting, shoot down its role */
+static void iob_probe_block_sched_process_exit(void *data,
+					       struct task_struct *task)
+{
+	WARN_ON_ONCE(task != current);
+	iob_reclaim_current_role();
+}
+
+
+/**
+ * iob_disable - disable ioblame
+ *
+ * Master disble.  Stop ioblame, unregister all hooks and free all
+ * resources.
+ */
+static void iob_disable(void)
+{
+	const int gang_nr = 16;
+	unsigned long indices[gang_nr];
+	void **slots[gang_nr];
+	unsigned long base_idx = 0;
+	int i, nr;
+
+	mutex_lock(&iob_mutex);
+
+	/* if enabled, disable reclaim and unregister all hooks */
+	if (iob_enabled) {
+		cancel_delayed_work_sync(&iob_reclaim_work);
+		cancel_work_sync(&iob_intent_notify_work);
+		iob_enabled = false;
+
+		unregister_trace_vfs_fcheck(iob_probe_vfs_fcheck, NULL);
+		unregister_trace_writeback_dirty_page(iob_probe_wb_dirty_page, NULL);
+		unregister_trace_writeback_start(iob_probe_wb_start, NULL);
+		unregister_trace_writeback_written(iob_probe_wb_written, NULL);
+		unregister_trace_writeback_single_inode_start(iob_probe_wb_single_inode_start, NULL);
+		unregister_trace_writeback_dirty_inode_start(iob_probe_wb_dirty_inode_start, NULL);
+		unregister_trace_writeback_dirty_inode(iob_probe_wb_dirty_inode, NULL);
+		unregister_trace_writeback_write_inode_start(iob_probe_wb_write_inode_start, NULL);
+		unregister_trace_writeback_write_inode(iob_probe_wb_write_inode, NULL);
+		unregister_trace_block_touch_buffer(iob_probe_block_touch_buffer, NULL);
+		unregister_trace_block_bio_queue(iob_probe_block_bio_queue, NULL);
+		unregister_trace_block_bio_backmerge(iob_probe_block_bio_backmerge, NULL);
+		unregister_trace_block_bio_frontmerge(iob_probe_block_bio_frontmerge, NULL);
+		unregister_trace_block_rq_issue(iob_probe_block_rq_issue, NULL);
+		unregister_trace_block_bio_complete(iob_probe_block_bio_complete, NULL);
+		unregister_trace_sched_process_exit(iob_probe_block_sched_process_exit, NULL);
+
+		/* and drain all in-flight users */
+		tracepoint_synchronize_unregister();
+	}
+
+	/*
+	 * At this point, we're sure that nobody is executing iob hooks.
+	 * Free all resources.
+	 */
+	for (i = 0; i < ARRAY_SIZE(iob_act_used_bitmaps); i++) {
+		vfree(iob_act_used_bitmaps[i]);
+		iob_act_used_bitmaps[i] = NULL;
+	}
+
+	if (iob_role_idx)
+		iob_idx_destroy(iob_role_idx);
+	if (iob_intent_idx)
+		iob_idx_destroy(iob_intent_idx);
+	if (iob_act_idx)
+		iob_idx_destroy(iob_act_idx);
+	iob_role_idx = iob_intent_idx = iob_act_idx = NULL;
+
+	while ((nr = radix_tree_gang_lookup_slot(&iob_pgtree, slots, indices,
+						 base_idx, gang_nr))) {
+		for (i = 0; i < nr; i++) {
+			free_page((unsigned long)*slots[i]);
+			radix_tree_delete(&iob_pgtree, indices[i]);
+		}
+		base_idx = indices[nr - 1] + 1;
+	}
+
+	mutex_unlock(&iob_mutex);
+}
+
+/**
+ * iob_enable - enable ioblame
+ *
+ * Master enable.  Set up all resources and enable ioblame.  Returns 0 on
+ * success, -errno on failure.
+ */
+static int iob_enable(void)
+{
+	int i, err;
+
+	mutex_lock(&iob_mutex);
+
+	if (iob_enabled)
+		goto out;
+
+	/* determine pgtree params from iob_max_acts */
+	iob_pgtree_shift = iob_max_acts <= USHRT_MAX ? 1 : 2;
+	iob_pgtree_pfn_shift = PAGE_SHIFT - iob_pgtree_shift;
+	iob_pgtree_pfn_mask = (1 << iob_pgtree_pfn_shift) - 1;
+
+	/* create iob_idx'es and allocate act used bitmaps */
+	err = -ENOMEM;
+	iob_role_idx = iob_idx_create(&iob_role_idx_type, iob_max_roles);
+	iob_intent_idx = iob_idx_create(&iob_intent_idx_type, iob_max_intents);
+	iob_act_idx = iob_idx_create(&iob_act_idx_type, iob_max_acts);
+
+	if (!iob_role_idx || !iob_intent_idx || !iob_act_idx)
+		goto out;
+
+	for (i = 0; i < ARRAY_SIZE(iob_act_used_bitmaps); i++) {
+		iob_act_used_bitmaps[i] = vzalloc(sizeof(unsigned long) *
+						  BITS_TO_LONGS(iob_max_acts));
+		if (!iob_act_used_bitmaps[i])
+			goto out;
+	}
+
+	iob_role_reclaim_seq = 0;
+	iob_act_used.front = iob_act_used_bitmaps[0];;
+	iob_act_used.back = iob_act_used_bitmaps[1];;
+
+	/* register hooks */
+	err = register_trace_vfs_fcheck(iob_probe_vfs_fcheck, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_dirty_page(iob_probe_wb_dirty_page, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_start(iob_probe_wb_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_written(iob_probe_wb_written, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_single_inode_start(iob_probe_wb_single_inode_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_dirty_inode_start(iob_probe_wb_dirty_inode_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_dirty_inode(iob_probe_wb_dirty_inode, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_write_inode_start(iob_probe_wb_write_inode_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_write_inode(iob_probe_wb_write_inode, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_touch_buffer(iob_probe_block_touch_buffer, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_queue(iob_probe_block_bio_queue, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_backmerge(iob_probe_block_bio_backmerge, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_frontmerge(iob_probe_block_bio_frontmerge, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_rq_issue(iob_probe_block_rq_issue, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_complete(iob_probe_block_bio_complete, NULL);
+	if (err)
+		goto out;
+	err = register_trace_sched_process_exit(iob_probe_block_sched_process_exit, NULL);
+	if (err)
+		goto out;
+
+	/* wait until everything becomes visible */
+	synchronize_sched();
+	/* and go... */
+	iob_enabled = true;
+	queue_delayed_work(system_nrt_wq, &iob_reclaim_work,
+			   iob_ttl_secs * HZ / 2);
+out:
+	mutex_unlock(&iob_mutex);
+
+	if (iob_enabled)
+		return 0;
+	iob_disable();
+	return err;
+}
+
+/* ioblame/{*_max|ttl_secs} - uint tunables */
+static int iob_uint_get(void *data, u64 *val)
+{
+	*val = *(unsigned int *)data;
+	return 0;
+}
+
+static int __iob_uint_set(void *data, u64 val, bool must_be_disabled)
+{
+	if (val > INT_MAX)
+		return -EINVAL;
+
+	mutex_lock(&iob_mutex);
+	if (must_be_disabled && iob_enabled) {
+		mutex_unlock(&iob_mutex);
+		return -EBUSY;
+	}
+
+	*(unsigned int *)data = val;
+
+	mutex_unlock(&iob_mutex);
+
+	return 0;
+}
+
+/* max params must not be manipulated while enabled */
+static int iob_uint_set_disabled(void *data, u64 val)
+{
+	return __iob_uint_set(data, val, true);
+}
+
+/* ttl can be changed anytime */
+static int iob_uint_set(void *data, u64 val)
+{
+	return __iob_uint_set(data, val, false);
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(iob_uint_fops_disabled, iob_uint_get,
+			iob_uint_set_disabled, "%llu\n");
+DEFINE_SIMPLE_ATTRIBUTE(iob_uint_fops, iob_uint_get, iob_uint_set, "%llu\n");
+
+/* bool - ioblame/ignore_ino, also used for ioblame/enable */
+static ssize_t iob_bool_read(struct file *file, char __user *ubuf,
+			     size_t count, loff_t *ppos)
+{
+	bool *boolp = file->f_dentry->d_inode->i_private;
+	const char *str = *boolp ? "Y\n" : "N\n";
+
+	return simple_read_from_buffer(ubuf, count, ppos, str, strlen(str));
+}
+
+static ssize_t __iob_bool_write(struct file *file, const char __user *ubuf,
+				size_t count, loff_t *ppos, bool *boolp)
+{
+	char buf[32] = { };
+	int err;
+
+	if (copy_from_user(buf, ubuf, min(count, sizeof(buf) - 1)))
+		return -EFAULT;
+
+	err = strtobool(buf, boolp);
+	if (err)
+		return err;
+
+	return err ?: count;
+}
+
+static ssize_t iob_bool_write(struct file *file, const char __user *ubuf,
+			      size_t count, loff_t *ppos)
+{
+	return __iob_bool_write(file, ubuf, count, ppos,
+				file->f_dentry->d_inode->i_private);
+}
+
+static const struct file_operations iob_bool_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nonseekable_open,
+	.read		= iob_bool_read,
+	.write		= iob_bool_write,
+};
+
+/* u64 fops, used for stats */
+static int iob_u64_get(void *data, u64 *val)
+{
+	*val = *(u64 *)data;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(iob_stats_fops, iob_u64_get, NULL, "%llu\n");
+
+/* used to export nr_nodes of each iob_idx */
+static int iob_nr_nodes_get(void *data, u64 *val)
+{
+	struct iob_idx **idxp = data;
+
+	*val = 0;
+	mutex_lock(&iob_mutex);
+	if (*idxp)
+		*val = (*idxp)->nr_nodes;
+	mutex_unlock(&iob_mutex);
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(iob_nr_nodes_fops, iob_nr_nodes_get, NULL, "%llu\n");
+
+/*
+ * ioblame/devs - per device enable switch, accepts block device kernel
+ * name, "maj:min" or "*" for all devices.  Prefix '!' to disable.  Opening
+ * w/ O_TRUNC also disables ioblame for all devices.
+ */
+static void iob_enable_all_devs(bool enable)
+{
+	struct disk_iter diter;
+	struct gendisk *disk;
+
+	disk_iter_init(&diter);
+	while ((disk = disk_iter_next(&diter)))
+		disk->iob_enabled = enable;
+	disk_iter_exit(&diter);
+}
+
+static void *iob_devs_seq_start(struct seq_file *seqf, loff_t *pos)
+{
+	loff_t skip = *pos;
+	struct disk_iter *diter;
+	struct gendisk *disk;
+
+	diter = kmalloc(sizeof(*diter), GFP_KERNEL);
+	if (!diter)
+		return ERR_PTR(-ENOMEM);
+
+	seqf->private = diter;
+	disk_iter_init(diter);
+
+	/* skip to the current *pos */
+	do {
+		disk = disk_iter_next(diter);
+		if (!disk)
+			return NULL;
+	} while (skip--);
+
+	/* skip to the first iob_enabled disk */
+	while (disk && !disk->iob_enabled) {
+		(*pos)++;
+		disk = disk_iter_next(diter);
+	}
+
+	return disk;
+}
+
+static void *iob_devs_seq_next(struct seq_file *seqf, void *v, loff_t *pos)
+{
+	/* skip to the next iob_enabled disk */
+	while (true) {
+		struct gendisk *disk;
+
+		(*pos)++;
+		disk = disk_iter_next(seqf->private);
+		if (!disk)
+			return NULL;
+
+		if (disk->iob_enabled)
+			return disk;
+	}
+}
+
+static int iob_devs_seq_show(struct seq_file *seqf, void *v)
+{
+	struct gendisk *disk = v;
+	dev_t dev = disk_devt(disk);
+
+	seq_printf(seqf, "%u:%u %s\n", MAJOR(dev), MINOR(dev),
+		   disk->disk_name);
+	return 0;
+}
+
+static void iob_devs_seq_stop(struct seq_file *seqf, void *v)
+{
+	struct disk_iter *diter = seqf->private;
+
+	/* stop is called even after start failed :-( */
+	if (diter) {
+		disk_iter_exit(diter);
+		kfree(diter);
+	}
+}
+
+static ssize_t iob_devs_write(struct file *file, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
+{
+	char *buf = NULL, *p = NULL, *last_tok = NULL, *tok;
+	int err;
+
+	if (!cnt)
+		return 0;
+
+	err = -ENOMEM;
+	buf = vmalloc(cnt + 1);
+	if (!buf)
+		goto out;
+
+	err = -EFAULT;
+	if (copy_from_user(buf, ubuf, cnt))
+		goto out;
+	buf[cnt] = '\0';
+
+	err = 0;
+	p = buf;
+	while ((tok = strsep(&p, " \t\r\n"))) {
+		bool enable = true;
+		int partno = 0;
+		struct gendisk *disk;
+		unsigned maj, min;
+		dev_t devt;
+
+		tok = strim(tok);
+		if (!strlen(tok))
+			continue;
+
+		if (tok[0] == '!') {
+			enable = false;
+			tok++;
+		}
+
+		if (!strcmp(tok, "*")) {
+			iob_enable_all_devs(enable);
+			last_tok = tok;
+			continue;
+		}
+
+		if (sscanf(tok, "%u:%u", &maj, &min) == 2)
+			devt = MKDEV(maj, min);
+		else
+			devt = blk_lookup_devt(tok, 0);
+
+		disk = get_gendisk(devt, &partno);
+		if (!disk || partno) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		disk->iob_enabled = enable;
+		put_disk(disk);
+		last_tok = tok;
+	}
+out:
+	vfree(buf);
+	if (!err)
+		return cnt;
+	if (last_tok)
+		return last_tok + strlen(last_tok) - buf;
+	return err;
+}
+
+static const struct seq_operations iob_devs_sops = {
+	.start		= iob_devs_seq_start,
+	.next		= iob_devs_seq_next,
+	.show		= iob_devs_seq_show,
+	.stop		= iob_devs_seq_stop,
+};
+
+static int iob_devs_seq_open(struct inode *inode, struct file *file)
+{
+	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC))
+		iob_enable_all_devs(false);
+
+	return seq_open(file, &iob_devs_sops);
+}
+
+static const struct file_operations iob_devs_fops = {
+	.owner		= THIS_MODULE,
+	.open		= iob_devs_seq_open,
+	.read		= seq_read,
+	.write		= iob_devs_write,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+/*
+ * ioblame/enable - master enable switch
+ */
+static ssize_t iob_enable_write(struct file *file, const char __user *ubuf,
+				size_t count, loff_t *ppos)
+{
+	bool enable;
+	ssize_t ret;
+	int err = 0;
+
+	ret = __iob_bool_write(file, ubuf, count, ppos, &enable);
+	if (ret < 0)
+		return ret;
+
+	if (enable)
+		err = iob_enable();
+	else
+		iob_disable();
+
+	return err ?: ret;
+}
+
+static const struct file_operations iob_enable_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nonseekable_open,
+	.read		= iob_bool_read,
+	.write		= iob_enable_write,
+};
+
+/*
+ * Print helpers.
+ */
+#define iob_print(p, e, fmt, args...)	(p + scnprintf(p, e - p, fmt , ##args))
+
+static char *iob_print_intent(char *p, char *e, struct iob_intent *intent,
+			      const char *header)
+{
+	int i;
+
+	p = iob_print(p, e, "%s#%d modifier=0x%x\n", header,
+		      intent->node.id.f.nr, intent->modifier);
+	for (i = 0; i < intent->depth; i++)
+		p = iob_print(p, e, "%s[%p] %pF\n", header,
+			      (void *)intent->trace[i],
+			      (void *)intent->trace[i]);
+	return p;
+}
+
+
+/*
+ * ioblame/intents - export intents to userland.
+ *
+ * Userland can acquire intents by reading ioblame/intents.
+ *
+ * While iob is enabled, intents are never reclaimed, intent nr is
+ * guaranteed to be allocated consecutively in ascending order and both
+ * intents files are lseekable by intent nr, so userland tools which want
+ * to learn about new intents since last reading can simply seek to the
+ * number of currently known intents and start reading from there.
+ *
+ * Both files generate at least one size changed notification after a new
+ * intent is created.
+ */
+static void iob_intent_notify_workfn(struct work_struct *work)
+{
+	struct iattr iattr = (struct iattr){ .ia_valid = ATTR_SIZE };
+
+	/*
+	 * Invoked after new intent is created, kick bogus size changed
+	 * notification.
+	 */
+	notify_change(iob_intents_dentry, &iattr);
+}
+
+static loff_t iob_intents_llseek(struct file *file, loff_t offset, int origin)
+{
+	loff_t ret = -EIO;
+
+	mutex_lock(&iob_mutex);
+
+	if (iob_enabled) {
+		/*
+		 * We seek by intent nr and don't care about i_size.
+		 * Temporarily set i_size to nr_nodes and hitch on generic
+		 * llseek.
+		 */
+		i_size_write(file->f_dentry->d_inode, iob_intent_idx->nr_nodes);
+		ret = generic_file_llseek(file, offset, origin);
+		i_size_write(file->f_dentry->d_inode, 0);
+	}
+
+	mutex_unlock(&iob_mutex);
+	return ret;
+}
+
+static ssize_t iob_intents_read(struct file *file, char __user *ubuf,
+				size_t count, loff_t *ppos)
+{
+	char *buf, *p, *e;
+	int err;
+
+	if (count < PAGE_SIZE)
+		return -EINVAL;
+
+	err = -EIO;
+	mutex_lock(&iob_mutex);
+	if (!iob_enabled)
+		goto out;
+
+	p = buf = iob_page_buf;
+	e = p + PAGE_SIZE;
+
+	err = 0;
+	if (*ppos >= iob_intent_idx->nr_nodes)
+		goto out;
+
+	/* print to buf */
+	rcu_read_lock_sched();
+	p = iob_print_intent(p, e, iob_intent_by_nr(*ppos), "");
+	rcu_read_unlock_sched();
+	WARN_ON_ONCE(p == e);
+
+	/* copy out */
+	err = -EFAULT;
+	if (copy_to_user(ubuf, buf, p - buf))
+		goto out;
+
+	(*ppos)++;
+	err = 0;
+out:
+	mutex_unlock(&iob_mutex);
+	return err ?: p - buf;
+}
+
+static const struct file_operations iob_intents_fops = {
+	.owner		= THIS_MODULE,
+	.open		= generic_file_open,
+	.llseek		= iob_intents_llseek,
+	.read		= iob_intents_read,
+};
+
+
+static int __init ioblame_init(void)
+{
+	struct dentry *stats_dir;
+
+	BUILD_BUG_ON((1 << IOB_TYPE_BITS) < IOB_NR_TYPES);
+	BUILD_BUG_ON(IOB_NR_BITS + IOB_GEN_BITS + IOB_TYPE_BITS != 64);
+
+	iob_role_cache = KMEM_CACHE(iob_role, 0);
+	iob_act_cache = KMEM_CACHE(iob_act, 0);
+	if (!iob_role_cache || !iob_act_cache)
+		goto fail;
+
+	/* create ioblame/ dirs and files */
+	iob_dir = debugfs_create_dir("ioblame", NULL);
+	if (!iob_dir)
+		goto fail;
+
+	if (!debugfs_create_file("max_roles", 0600, iob_dir, &iob_max_roles, &iob_uint_fops_disabled) ||
+	    !debugfs_create_file("max_intents", 0600, iob_dir, &iob_max_intents, &iob_uint_fops_disabled) ||
+	    !debugfs_create_file("max_acts", 0600, iob_dir, &iob_max_acts, &iob_uint_fops_disabled) ||
+	    !debugfs_create_file("ttl_secs", 0600, iob_dir, &iob_ttl_secs, &iob_uint_fops) ||
+	    !debugfs_create_file("ignore_ino", 0600, iob_dir, &iob_ignore_ino, &iob_bool_fops) ||
+	    !debugfs_create_file("devs", 0600, iob_dir, NULL, &iob_devs_fops) ||
+	    !debugfs_create_file("enable", 0600, iob_dir, &iob_enabled, &iob_enable_fops) ||
+	    !debugfs_create_file("nr_roles", 0400, iob_dir, &iob_role_idx, &iob_nr_nodes_fops) ||
+	    !debugfs_create_file("nr_intents", 0400, iob_dir, &iob_intent_idx, &iob_nr_nodes_fops) ||
+	    !debugfs_create_file("nr_acts", 0400, iob_dir, &iob_act_idx, &iob_nr_nodes_fops))
+		goto fail;
+
+	iob_intents_dentry = debugfs_create_file("intents", 0400, iob_dir, NULL, &iob_intents_fops);
+	if (!iob_intents_dentry)
+		goto fail;
+
+	stats_dir = debugfs_create_dir("stats", iob_dir);
+	if (!stats_dir)
+		goto fail;
+
+	if (!debugfs_create_file("idx_nomem", 0400, stats_dir, &iob_stats.idx_nomem, &iob_stats_fops) ||
+	    !debugfs_create_file("idx_nospc", 0400, stats_dir, &iob_stats.idx_nospc, &iob_stats_fops) ||
+	    !debugfs_create_file("node_nomem", 0400, stats_dir, &iob_stats.node_nomem, &iob_stats_fops) ||
+	    !debugfs_create_file("pgtree_nomem", 0400, stats_dir, &iob_stats.pgtree_nomem, &iob_stats_fops))
+		goto fail;
+
+	return 0;
+
+fail:
+	if (iob_role_cache)
+		kmem_cache_destroy(iob_role_cache);
+	if (iob_act_cache)
+		kmem_cache_destroy(iob_act_cache);
+	if (iob_dir)
+		debugfs_remove_recursive(iob_dir);
+	return -ENOMEM;
+}
+
+static void __exit ioblame_exit(void)
+{
+	iob_disable();
+	debugfs_remove_recursive(iob_dir);
+	kmem_cache_destroy(iob_role_cache);
+	kmem_cache_destroy(iob_act_cache);
+}
+
+module_init(ioblame_init);
+module_exit(ioblame_exit);
+
+MODULE_AUTHOR("Tejun Heo <tj@kernel.org>");
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("IO monitor with dirtier and issuer tracking");
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* RE: [PATCH 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-10 18:28 ` [PATCH 9/9] block, trace: implement ioblame - IO tracer with origin tracking Tejun Heo
@ 2012-01-11  0:25   ` Chanho Park
  2012-01-11  1:04     ` Tejun Heo
  2012-01-11  1:32   ` [PATCH RESEND " Tejun Heo
  1 sibling, 1 reply; 37+ messages in thread
From: Chanho Park @ 2012-01-11  0:25 UTC (permalink / raw)
  To: 'Tejun Heo',
	axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung

Hello,

> +#define CREATE_TRACE_POINTS
> +#include <trace/events/ioblame.h>

I can't find it from patchset. So when compiling, I've got below error.

CC      kernel/trace/power-traces.o
kernel/trace/ioblame.c:308: fatal error: trace/events/ioblame.h: No such
file or directory
compilation terminated.
make[2]: *** [kernel/trace/ioblame.o] Error 1
make[2]: *** Waiting for unfinished jobs....

Are you forgot to attach it or just only RFC?

Best regards,
Chanho Park


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-11  0:25   ` Chanho Park
@ 2012-01-11  1:04     ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-11  1:04 UTC (permalink / raw)
  To: Chanho Park
  Cc: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp, linux-kernel, winget, namhyung

Hello,

On Tue, Jan 10, 2012 at 4:25 PM, Chanho Park <chanho61.park@samsung.com> wrote:
>> +#define CREATE_TRACE_POINTS
>> +#include <trace/events/ioblame.h>
>
> I can't find it from patchset. So when compiling, I've got below error.
>
> CC      kernel/trace/power-traces.o
> kernel/trace/ioblame.c:308: fatal error: trace/events/ioblame.h: No such
> file or directory
> compilation terminated.
> make[2]: *** [kernel/trace/ioblame.o] Error 1
> make[2]: *** Waiting for unfinished jobs....
>
> Are you forgot to attach it or just only RFC?

Eh... crap. That's me forgetting to do quilt add. :( I'll resend shortly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-10 18:28 ` [PATCH 9/9] block, trace: implement ioblame - IO tracer with origin tracking Tejun Heo
  2012-01-11  0:25   ` Chanho Park
@ 2012-01-11  1:32   ` Tejun Heo
  2012-01-11  6:15     ` Namhyung Kim
  2012-01-11 18:08     ` Tejun Heo
  1 sibling, 2 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-11  1:32 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Chanho Park

Implement ioblame, which can attribute each IO to its origin and
export the information using a tracepoint.

Operations which may eventually cause IOs and IO operations themselves
are identified and tracked primarily by their stack traces along with
the task and the target file (dev:ino:gen).  On each IO completion,
ioblame knows why that specific IO happened and exports the
information via ioblame:ioblame_io tracepoint.

While ioblame adds fields to a few fs and block layer objects, all
logic is well insulated inside ioblame proper and all hooking goes
through well defined tracepoints and doesn't add any significant
maintenance overhead.

For details, please read Documentation/trace/ioblame.txt.

-v2: Namhyung pointed out that all the information available at IO
     completion can be exported via tracepoint and letting userland do
     whatever it wants to do with that would be better.  Stripped out
     in-kernel statistics gathering.

     Now that everything is exported through tracepoint, iolog and
     counters_pipe[_pipe] are unnecessary.  Removed.  intents_bin too
     is removed.

     As data collection no longer requires polling, ioblame/intents is
     updated to generate inotify IN_MODIFY event after a new intent is
     created.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Justin TerAvest <teravest@google.com>
Cc: Slava Pestov <slavapestov@google.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Jim Winget <winget@google.com>
---
include/trace/events/ioblame.h was missing from the patch.  git branch
also updated.

Thanks.

 Documentation/trace/ioblame.txt |  476 ++++++++
 include/linux/blk_types.h       |    4 +
 include/linux/fs.h              |    3 +
 include/linux/genhd.h           |    3 +
 include/linux/ioblame.h         |   72 ++
 include/trace/events/ioblame.h  |   94 ++
 kernel/trace/Kconfig            |   12 +
 kernel/trace/Makefile           |    1 +
 kernel/trace/ioblame.c          | 2279 +++++++++++++++++++++++++++++++++++++++
 9 files changed, 2944 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/ioblame.txt
 create mode 100644 include/linux/ioblame.h
 create mode 100644 include/trace/events/ioblame.h
 create mode 100644 kernel/trace/ioblame.c

diff --git a/Documentation/trace/ioblame.txt b/Documentation/trace/ioblame.txt
new file mode 100644
index 0000000..cd72f29
--- /dev/null
+++ b/Documentation/trace/ioblame.txt
@@ -0,0 +1,476 @@
+
+ioblame - IO tracer with origin tracking
+
+December, 2011		Tejun Heo <tj@kernel.org>
+
+
+CONTENTS
+
+1. Introduction
+2. Overall design
+3. Debugfs interface
+3-1. Configuration
+3-2. Stats and intents
+4. Trace examples
+5. Notes
+6. Overheads
+
+
+1. Introduction
+
+In many workloads, IO throughput and latency have large effect on
+overall performance; however, due to the complexity and asynchronous
+nature, it is very difficult to characterize what's going on.
+blktrace and various tracepoints provide visibility into individual IO
+operations but it is still extremely difficult to trace back to the
+origin of those IO operations.
+
+ioblame is IO tracer which tracks origin of each IO.  It keeps track
+of who dirtied pages and inodes, and, on an actual IO, attributes it
+to the originator of the IO.  All the information ioblame collects is
+exported via ioblame:ioblame_io tracepoint on each IO completion.
+
+The design goals of ioblame are
+
+* Minimally invasive - Tracer shouldn't be invasive.  Except for
+  adding some fields to mostly block layer data structures for
+  tracking, ioblame gathers all information through well defined
+  tracepoints and all tracking logic is contained in ioblame proper.
+
+* Generic and detailed - There are many different IO paths and
+  filesystems which also go through changes regularly.  Tracer should
+  be able to report detailed enough result covering most cases without
+  requiring frequent adaptation.  ioblame uses stack trace at key
+  points combined information from generic layers to categorize IOs.
+  This gives detailed enough information into varying IO paths without
+  requiring specific adaptations.
+
+* Low overhead - Overhead both in terms of memory and processor cycles
+  should be low enough so that the analyzer can be used in IO-heavy
+  production environments.  ioblame keeps hot data structures compact
+  and mostly read-only and avoids synchronization on hot paths by
+  using RCU and taking advantage of the fact that statistics doesn't
+  have to be completely accurate.
+
+
+2. Overall design
+
+ioblame tracks the following three object types.
+
+* Role: This tracks 'who' is taking an action.  Corresponds to a
+  thread.
+
+* Intent: Stack trace + modifier.  An intent groups actions of the
+  same type.  As the name suggests, modifier modifies the intent and
+  there can be multiple intents with the same stack trace but
+  different modifiers.  Currently, only writeback modifiers are
+  implemented which denote why the writeback action is occurring -
+  ie. wb_reason.
+
+* Act: This is combination of role, intent and the inode being
+  operated.  This is not visible to userland and used internally to
+  track dirtier and its intent in compact form.
+
+ioblame uses the same indexing data structure for all three types of
+objects.  Objects are never linked directly using pointers and every
+access goes through the index.  This allows avoiding expensive strict
+object lifetime management.  Objects are located either by its content
+via hash table or id which contains generation number.
+
+To attribute data writebacks to the originator, ioblame maintains a
+table indexed by page frame number which keeps track of which act
+dirtied which pages.  For each IO, the target pages are looked up in
+the table and the dirtying act is charged for the IO.  Note that,
+currently, each IO is charged as whole to a single act - e.g. all of
+an IO for writeback encompassing multiple dirtiers will be charged to
+the first found dirtying act.  This simplifies data collection and
+reporting while not losing too much information - writebacks tend to
+be naturally grouped and IOPS (IO operations per second) are often
+more significant than length of each IO.
+
+inode writeback tracking is more involved as different filesystems
+handle metadata updates and writebacks differently.  ioblame uses
+per-inode and buffer_head operation tracking to identify inode
+writebacks to the originator.
+
+On each IO completion, ioblame knows the offset and size of the IO,
+who's responsible and its intent, how long it took in the queue and
+the target file.  This information is reported via ioblame:ioblame_io
+tracepoint.
+
+Except for the tracepoint, all interactions happen using files under
+/sys/kernel/debug/ioblame/.
+
+
+3. Debugfs interface
+
+3-1. Configuration
+
+* enable			- can be changed anytime
+
+  Master enable.  Write [Yy1] to enable, [Nn0] to disable.
+
+* devs				- can be changed anytime
+
+  Specifies the devices ioblame is enabled for.  ioblame will only
+  track operations on devices which are explicitly enabled in this
+  file.
+
+  It accepts white space separated list of MAJ:MINs or block device
+  names with optional preceding '!' for negation.  Opening with
+  O_TRUNC clears all existing entries.  For example,
+
+  $ echo sda sdb > devs		# disables all devices and then enable sd[ab]
+  $ echo sdc >> devs		# sd[abc] enabled
+  $ echo !8:0 >> devs		# sd[bc] enabled
+  $ cat devs
+  8:16 sdb
+  8:32 sdc
+
+* max_{role|intent|act}s	- can be changed while disabled
+
+  Specifies the maximum number of each object type.  If the number of
+  certain object type exceeds the limit, IOs will be attributed to
+  special NOMEM object.
+
+* ttl_secs			- can be changed anytime
+
+  Specifies TTL of roles and acts.  Roles are reclaimed after at least
+  TTL has passed after the matching thread has exited or execed and
+  assumed another tid.  Acts are reclaimed after being unused for at
+  least TTL.
+
+
+3-2. Stats and intents (read only)
+
+* nr_{roles|intents|acts}
+
+  Returns the number of objects of the type.  The number of roles and
+  acts can decrease after reclaiming but nr_intents only increases
+  while ioblame is enabled.
+
+* stats/idx_nomem
+
+  How many times role, intent or act creation failed because memory
+  allocation failed while extending index to accomodate new object.
+
+* stats/idx_nospc
+
+  How many times role, intent or act creation failed because limit
+  specified by {role|intent|act}_max is reached.
+
+* stats/node_nomem
+
+  How many times role, intent or act creation failed to allocate.
+
+* stats/pgtree_nomem
+
+  How many times page tree, which maps page frame number to dirtying
+  act, failed to expand due to memory allocation failure.
+
+* intents
+
+  Dump of intents.
+
+  $ cat intents
+  #0 modifier=0x0
+  #1 modifier=0x0
+  #2 modifier=0x0
+  [ffffffff81189a6a] file_update_time+0xca/0x150
+  [ffffffff81122030] __generic_file_aio_write+0x200/0x460
+  [ffffffff81122301] generic_file_aio_write+0x71/0xe0
+  [ffffffff8122ea94] ext4_file_write+0x64/0x280
+  [ffffffff811b5d24] aio_rw_vect_retry+0x74/0x1d0
+  [ffffffff811b7401] aio_run_iocb+0x61/0x190
+  [ffffffff811b81c8] do_io_submit+0x648/0xaf0
+  [ffffffff811b867b] sys_io_submit+0xb/0x10
+  [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
+  #3 modifier=0x0
+  [ffffffff811aaf2e] __blockdev_direct_IO+0x1f1e/0x37c0
+  [ffffffff812353b2] ext4_direct_IO+0x1b2/0x3f0
+  [ffffffff81121d6a] generic_file_direct_write+0xba/0x180
+  [ffffffff8112210b] __generic_file_aio_write+0x2db/0x460
+  [ffffffff81122301] generic_file_aio_write+0x71/0xe0
+  [ffffffff8122ea94] ext4_file_write+0x64/0x280
+  [ffffffff811b5d24] aio_rw_vect_retry+0x74/0x1d0
+  [ffffffff811b7401] aio_run_iocb+0x61/0x190
+  [ffffffff811b81c8] do_io_submit+0x648/0xaf0
+  [ffffffff811b867b] sys_io_submit+0xb/0x10
+  [ffffffff81a3c8bb] system_call_fastpath+0x16/0x1b
+  #4 modifier=0x0
+  [ffffffff811aaf2e] __blockdev_direct_IO+0x1f1e/0x37c0
+  [ffffffff8126da71] ext4_ind_direct_IO+0x121/0x460
+  [ffffffff81235436] ext4_direct_IO+0x236/0x3f0
+  [ffffffff81122db2] generic_file_aio_read+0x6b2/0x740
+  ...
+
+  The # prefixed number is the NR of the intent used to link intent
+  from stastics.  Modifier and stack trace follow.  The first two
+  entries are special - 0 is nomem intent and 1 is lost intent.  The
+  former is used when an intent can't be created because allocation
+  failed or intent_max is reached.  The latter is used when reclaiming
+  resulted in loss of tracking info and the IO can't be reported
+  exactly.
+
+  This file can be seeked by intent NR.  ie. seeking to 3 and reading
+  will return intent #3 and after.  Because intents are never
+  destroyed while ioblame is enabled, this allows userland tool to
+  discover new intents since last reading.  Seeking to the number of
+  currently known intents and reading returns only the newly created
+  intents.
+
+  At least one inotify IN_MODIFY event is generated after a new intent
+  is created.
+
+
+4. Trace examples
+
+All information ioblame gathers is available through
+ioblame:ioblame_io tracing event.  The outputs in the following
+examples are reformatted and annoated.
+
+4-1. ls, touch and sync - on an ext4 FS w/o journal
+
+- sector=69896 size=4096 rw=META|PRIO wait_nsec=45244 io_nsec=11263878
+  pid=952 intent=8 dev=8:17 ino=2 gen=0
+
+  pid 952 (ls) issues 4k META|PRIO read on /dev/sdb1's root directory
+  with intent 8 to read directory entries.
+
+  #8 modifier=0x0
+  [ffffffff813981b8] generic_make_request+0x18/0x100
+  [ffffffff81398314] submit_bio+0x74/0x100
+  [ffffffff811c6b9b] submit_bh+0xeb/0x130
+  [ffffffff811c851e] ll_rw_block+0xae/0xb0
+  [ffffffff81265703] ext4_bread+0x43/0x80
+  [ffffffff8126b458] htree_dirblock_to_tree+0x38/0x190
+  [ffffffff8126b655] ext4_htree_fill_tree+0xa5/0x260
+  [ffffffff81259c76] ext4_readdir+0x116/0x5e0
+  [ffffffff811a7ec0] vfs_readdir+0xb0/0xd0
+  [ffffffff811a8049] sys_getdents+0x89/0xf0
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+- sector=4232 size=4096 rw= wait_nsec=69052 io_nsec=475710
+  pid=953 intent=14 dev=8:16 ino=0 gen=0
+
+  pid 953 (touch) issues 4k read with intent 14 during open(2).
+
+  #14 modifier=0x0
+  [ffffffff813981b8] generic_make_request+0x18/0x100
+  [ffffffff81398314] submit_bio+0x74/0x100
+  [ffffffff811c6b9b] submit_bh+0xeb/0x130
+  [ffffffff811c8425] bh_submit_read+0x35/0x80
+  [ffffffff8125b29b] ext4_read_inode_bitmap+0x18b/0x3f0
+  [ffffffff8125bf85] ext4_new_inode+0x355/0x10b0
+  [ffffffff81269a7a] ext4_create+0x9a/0x120
+  [ffffffff811a366c] vfs_create+0x8c/0xe0
+  [ffffffff811a4616] do_last+0x776/0x8e0
+  [ffffffff811a4858] path_openat+0xd8/0x410
+  [ffffffff811a4ca9] do_filp_open+0x49/0xa0
+  [ffffffff811926a7] do_sys_open+0x107/0x1e0
+  [ffffffff811927c0] sys_open+0x20/0x30
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+- sector=4360 size=4096 rw=WRITE wait_nsec=28998035 io_nsec=768370
+  pid=953 intent=11 dev=8:17 ino=14 gen=3151897938
+
+  touch dirtied inode 14 and the following sync forces writeback.
+  The IO is attributed to the dirtier.  Note the non-zero modifier is
+  indicating WB_REASON_SYNC.
+
+  #11 modifier=0x10000002
+  [ffffffff811c0710] __mark_inode_dirty+0x220/0x330
+  [ffffffff8125feeb] ext4_setattr+0x26b/0x4d0
+  [ffffffff811b0f2a] notify_change+0x10a/0x2b0
+  [ffffffff811c52de] utimes_common+0xde/0x190
+  [ffffffff811c5431] do_utimes+0xa1/0xf0
+  [ffffffff811c55a6] sys_utimensat+0x36/0xb0
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+
+4-2. copying a 1M file from another filesystem and waiting a bit
+
+- sector=2056 size=4096 rw=WRITE wait_nsec=151425 io_nsec=584466
+  pid=1004 intent=24 dev=8:16 ino=0 gen=0
+
+  flush-8:16 starting writeback w/ WB_REASON_BACKGROUND.  This
+  repeats a couple times.
+
+  #24 modifier=0x10000000
+  [ffffffff813981b8] generic_make_request+0x18/0x100
+  [ffffffff81398314] submit_bio+0x74/0x100
+  [ffffffff811c6b9b] submit_bh+0xeb/0x130
+  [ffffffff811ca410] __block_write_full_page+0x210/0x3b0
+  [ffffffff811ca6a0] block_write_full_page_endio+0xf0/0x140
+  [ffffffff811ca705] block_write_full_page+0x15/0x20
+  [ffffffff811ce438] blkdev_writepage+0x18/0x20
+  [ffffffff81148f1a] __writepage+0x1a/0x50
+  [ffffffff81149ae6] write_cache_pages+0x206/0x4f0
+  [ffffffff81149e24] generic_writepages+0x54/0x80
+  [ffffffff81149e74] do_writepages+0x24/0x40
+  [ffffffff811bf301] writeback_single_inode+0x1a1/0x600
+  [ffffffff811c01db] writeback_sb_inodes+0x1ab/0x280
+  [ffffffff811c0b8e] __writeback_inodes_wb+0x9e/0xd0
+  [ffffffff811c0ea3] wb_writeback+0x243/0x3a0
+  [ffffffff811c115a] wb_do_writeback+0x15a/0x2b0
+  [ffffffff811c138a] bdi_writeback_thread+0xda/0x330
+  [ffffffff810bc286] kthread+0xb6/0xc0
+  [ffffffff81aadff4] kernel_thread_helper+0x4/0x10
+
+- sector=4360 size=4096 rw=WRITE wait_nsec=781396 io_nsec=894147
+  pid=1017 intent=25 dev=8:17 ino=12 gen=3151897939
+
+  Writeback got to inode 12 which was created and written to by cp.
+  This is inode writeback.
+
+  #25 modifier=0x10000000
+  [ffffffff811c0710] __mark_inode_dirty+0x220/0x330
+  [ffffffff811c7e5b] generic_write_end+0x6b/0xa0
+  [ffffffff8126191a] ext4_da_write_end+0xfa/0x350
+  [ffffffff8113f168] generic_file_buffered_write+0x188/0x2b0
+  [ffffffff81141608] __generic_file_aio_write+0x238/0x460
+  [ffffffff811418a8] generic_file_aio_write+0x78/0xf0
+  [ffffffff8125a4cf] ext4_file_write+0x6f/0x2a0
+  [ffffffff81194112] do_sync_write+0xe2/0x120
+  [ffffffff81194c08] vfs_write+0xc8/0x180
+  [ffffffff81194dc1] sys_write+0x51/0x90
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+- sector=268288 size=524288 rw=WRITE wait_nsec=461543 io_nsec=3180190
+  pid=1017 intent=27 dev=8:17 ino=12 gen=3151897939
+
+  The first half of data.
+
+  #27 modifier=0x10000000
+  [ffffffff811c79cc] __set_page_dirty+0x4c/0xd0
+  [ffffffff811c7ab6] mark_buffer_dirty+0x66/0xa0
+  [ffffffff811c7b99] __block_commit_write+0xa9/0xe0
+  [ffffffff811c7da2] block_write_end+0x42/0x90
+  [ffffffff811c7e23] generic_write_end+0x33/0xa0
+  [ffffffff8126191a] ext4_da_write_end+0xfa/0x350
+  [ffffffff8113f168] generic_file_buffered_write+0x188/0x2b0
+  [ffffffff81141608] __generic_file_aio_write+0x238/0x460
+  [ffffffff811418a8] generic_file_aio_write+0x78/0xf0
+  [ffffffff8125a4cf] ext4_file_write+0x6f/0x2a0
+  [ffffffff81194112] do_sync_write+0xe2/0x120
+  [ffffffff81194c08] vfs_write+0xc8/0x180
+  [ffffffff81194dc1] sys_write+0x51/0x90
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+- sector=269312 size=524288 rw=WRITE wait_nsec=364198 io_nsec=5667553
+  pid=1017 intent=27 dev=8:17 ino=12 gen=3151897939
+
+  And the second half.
+
+
+4-3. dd if=/dev/zero of=testfile bs=128k count=4 oflag=direct
+
+- sector=266496 size=131072 rw=WRITE|SYNC wait_nsec=48180 io_nsec=1066758
+  pid=1042 intent=34 dev=8:17 ino=12 gen=3151897940
+
+  First chunk.
+
+  #34 modifier=0x0
+  [ffffffff813981b8] generic_make_request+0x18/0x100
+  [ffffffff81398314] submit_bio+0x74/0x100
+  [ffffffff811d1c45] __blockdev_direct_IO+0x21b5/0x3830
+  [ffffffff8129a7a1] ext4_ind_direct_IO+0x121/0x470
+  [ffffffff812612ee] ext4_direct_IO+0x23e/0x400
+  [ffffffff81141308] generic_file_direct_write+0xc8/0x190
+  [ffffffff811416ab] __generic_file_aio_write+0x2db/0x460
+  [ffffffff811418a8] generic_file_aio_write+0x78/0xf0
+  [ffffffff8125a4cf] ext4_file_write+0x6f/0x2a0
+  [ffffffff81194112] do_sync_write+0xe2/0x120
+  [ffffffff81194c08] vfs_write+0xc8/0x180
+  [ffffffff81194dc1] sys_write+0x51/0x90
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+- sector=266752 size=131072 rw=WRITE|SYNC wait_nsec=15155 io_nsec=1086987
+  pid=1042 intent=34 dev=8:17 ino=12 gen=3151897940
+
+  Second.
+
+- sector=267008 size=131072 rw=WRITE|SYNC wait_nsec=22694 io_nsec=1092836
+  pid=1042 intent=34 dev=8:17 ino=12 gen=3151897940
+
+  Third.
+
+- sector=267264 size=131072 rw=WRITE|SYNC wait_nsec=15852 io_nsec=1021868
+  pid=1042 intent=34 dev=8:17 ino=12 gen=3151897940
+
+  Fourth.
+
+...
+
+- sector=4360 size=4096 rw=WRITE wait_nsec=1378342 io_nsec=828771
+  pid=1042 intent=35 dev=8:17 ino=12 gen=3151897940
+
+  After a while, inode is written back with WB_REASON_PERIODIC.
+
+  #35 modifier=0x10000003
+  [ffffffff811c0710] __mark_inode_dirty+0x220/0x330
+  [ffffffff8129611a] ext4_mb_new_blocks+0xea/0x5a0
+  [ffffffff8128b22e] ext4_ext_map_blocks+0x1c0e/0x1d80
+  [ffffffff812638d1] ext4_map_blocks+0x1b1/0x260
+  [ffffffff81263a28] _ext4_get_block+0xa8/0x160
+  [ffffffff81263b46] ext4_get_block+0x16/0x20
+  [ffffffff811d0460] __blockdev_direct_IO+0x9d0/0x3830
+  [ffffffff8129a7a1] ext4_ind_direct_IO+0x121/0x470
+  [ffffffff812612ee] ext4_direct_IO+0x23e/0x400
+  [ffffffff81141308] generic_file_direct_write+0xc8/0x190
+  [ffffffff811416ab] __generic_file_aio_write+0x2db/0x460
+  [ffffffff811418a8] generic_file_aio_write+0x78/0xf0
+  [ffffffff8125a4cf] ext4_file_write+0x6f/0x2a0
+  [ffffffff81194112] do_sync_write+0xe2/0x120
+  [ffffffff81194c08] vfs_write+0xc8/0x180
+  [ffffffff81194dc1] sys_write+0x51/0x90
+  [ffffffff81aaca6b] system_call_fastpath+0x16/0x1b
+
+
+5. Notes
+
+* By the time ioblame reports IOs or counters, the task which gets
+  charged might have already exited and this is why ioblame prints
+  task command in some reports but not in others.  Userland tool is
+  advised to use combination of live task listing and process
+  accounting to match pid's to commands.
+
+* dev:ino:gen can be mapped to filename without scanning the whole
+  filesystem by constructing FS-specific filehandle, opening it with
+  open_by_handle_at(2) and then readlink(2)ing /proc/self/FD.  This
+  will return full path as long as the dentry is in cache, which is
+  likely if data acquisition and mapping don't happen too long after
+  IOs.
+
+* At this point, it's mostly tested with ext4 w/o journal.  Metadata
+  dirtier tracking w/ journal needs improvements.
+
+
+6. Overheads
+
+On x86_64, role is 104 bytes, intent 32 + 8 * stack_depth and act 72
+bytes.  Intents are allocated using kzalloc() and there shouldn't be
+too many of them.  Both roles and acts have their own kmem_cache and
+can be monitored via /proc/slabinfo.
+
+Each counter occupy 32 * nr_counters and is aligned to cacheline.
+Counters are allocated only as necessary.  iob_counters kmem_cache is
+dynamically created on enable.
+
+The size of page frame number -> dirtier mapping table is proportional
+to the amount of available physical memory.  If max_acts <= 65536,
+2bytes are used per PAGE_SIZE.  With 4k page, at most ~0.049% can be
+used.  If max_acts > 65536, 4bytes are used doubling the percentage to
+~0.098%.  The table also grows dynamically.
+
+There are also indexing data structures used - hash tables, id[ra]s
+and a radix tree.  There are three hash tables, each sized according
+to max_{roles|intents|acts}.  The maximum memory usage by hash tables
+is sizeof(void *) * (max_roles + max_intents + max_acts).  Memory used
+by other indexing structures should be negligible.
+
+Preliminary tests w/ fio ssd-test on loopback device on tmpfs, which
+is purely CPU cycle bound, shows ~20% throughput hit.
+
+*** TODO: add performance testing results and explain involved CPU
+    overheads.
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4053cbd..2ee4e3b 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -8,6 +8,7 @@
 #ifdef CONFIG_BLOCK
 
 #include <linux/types.h>
+#include <linux/ioblame.h>
 
 struct bio_set;
 struct bio;
@@ -69,6 +70,9 @@ struct bio {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
 	struct bio_integrity_payload *bi_integrity;  /* data integrity */
 #endif
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+	struct iob_io_info	bi_iob_info;
+#endif
 
 	bio_destructor_t	*bi_destructor;	/* destructor */
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7aacf31..7a43f9a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -835,6 +835,9 @@ struct inode {
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
 	void			*i_private; /* fs or device private pointer */
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+	union iob_id		i_iob_act;
+#endif
 };
 
 static inline int inode_unhashed(struct inode *inode)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 9d0e0b5..237db65 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -190,6 +190,9 @@ struct gendisk {
 #ifdef  CONFIG_BLK_DEV_INTEGRITY
 	struct blk_integrity *integrity;
 #endif
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+	bool iob_enabled;
+#endif
 	int node_id;
 };
 
diff --git a/include/linux/ioblame.h b/include/linux/ioblame.h
new file mode 100644
index 0000000..06c7f3a
--- /dev/null
+++ b/include/linux/ioblame.h
@@ -0,0 +1,72 @@
+/*
+ * include/linux/ioblame.h - statistical IO analyzer with origin tracking
+ *
+ * Copyright (C) 2011 Google, Inc.
+ * Copyright (C) 2011 Tejun Heo <tj@kernel.org>
+ */
+#ifndef _IOBLAME_H
+#define _IOBLAME_H
+
+#ifdef __KERNEL__
+
+#include <linux/rcupdate.h>
+
+struct page;
+struct inode;
+struct buffer_head;
+
+#if defined(CONFIG_IO_BLAME) || defined(CONFIG_IO_BLAME_MODULE)
+
+/*
+ * Each iob_node is identified by 64bit id, which packs three fields in it
+ * - @type, @nr and @gen.  @nr is ida allocated index in @type.  It is
+ * always allocated from the lowest available slot, which allows efficient
+ * use of pgtree and idr; however, this means @nr is likely to be recycled.
+ * @gen is used to disambiguate recycled @nr's.
+ */
+#define IOB_NR_BITS			31
+#define IOB_GEN_BITS			31
+#define IOB_TYPE_BITS			2
+
+union iob_id {
+	u64				v;
+	struct {
+		u64			nr:IOB_NR_BITS;
+		u64			gen:IOB_GEN_BITS;
+		u64			type:IOB_TYPE_BITS;
+	} f;
+};
+
+struct iob_io_info {
+	sector_t			sector;
+	size_t				size;
+	unsigned long			rw;
+
+	u64				queued_at;
+	u64				issued_at;
+
+	pid_t				pid;
+	int				intent;
+	dev_t				dev;
+	u32				gen;
+	ino_t				ino;
+};
+
+#endif	/* CONFIG_IO_BLAME[_MODULE] */
+#endif	/* __KERNEL__ */
+
+enum iob_special_nr {
+	IOB_NOMEM_NR,
+	IOB_LOST_NR,
+	IOB_BASE_NR,
+};
+
+/* intent modifer */
+#define IOB_MODIFIER_TYPE_SHIFT	28
+#define IOB_MODIFIER_TYPE_MASK	0xf0000000U
+#define IOB_MODIFIER_VAL_MASK	(~IOB_MODIFIER_TYPE_MASK)
+
+/* val contains wb_reason */
+#define IOB_MODIFIER_WB		(1 << IOB_MODIFIER_TYPE_SHIFT)
+
+#endif	/* _IOBLAME_H */
diff --git a/include/trace/events/ioblame.h b/include/trace/events/ioblame.h
new file mode 100644
index 0000000..2d17055
--- /dev/null
+++ b/include/trace/events/ioblame.h
@@ -0,0 +1,94 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM ioblame
+
+#if !defined(_TRACE_IOBLAME_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_IOBLAME_H
+
+#include <linux/tracepoint.h>
+#include <linux/ioblame.h>
+
+/*
+ * We care only about common and bio flags.  Request flags are much more
+ * volatile and don't have much meaning outside block core proper anyway.
+ */
+#define show_io_rw(rw)							\
+	__print_flags(rw, "|",						\
+		{ REQ_WRITE,			"WRITE" },		\
+		{ REQ_FAILFAST_DEV,		"FAILFAST_DEV" },	\
+		{ REQ_FAILFAST_TRANSPORT,	"FAILFAST_TRANSPORT" },	\
+		{ REQ_FAILFAST_DRIVER,		"FAILFAST_DRIVER" },	\
+									\
+		{ REQ_SYNC,			"SYNC" },		\
+		{ REQ_META,			"META" },		\
+		{ REQ_PRIO,			"PRIO" },		\
+		{ REQ_DISCARD,			"DISCARD" },		\
+		{ REQ_SECURE,			"SECURE" },		\
+									\
+		{ REQ_NOIDLE,			"NOIDLE" },		\
+		{ REQ_FUA,			"FUA" },		\
+		{ REQ_FLUSH,			"FLUSH" },		\
+									\
+		{ REQ_RAHEAD,			"RAHEAD" },		\
+		{ REQ_THROTTLED,		"THROTTLED" })
+
+DECLARE_EVENT_CLASS(ioblame_io_class,
+
+	TP_PROTO(struct bio *bio),
+
+	TP_ARGS(bio),
+
+	TP_STRUCT__entry(
+		__field( sector_t,		sector )
+		__field( size_t,		size )
+		__field( unsigned long,		rw )
+		__field( unsigned int,		wait_nsec )
+		__field( unsigned int,		io_nsec )
+		__field( pid_t,			pid )
+		__field( int,			intent )
+		__field( dev_t,			dev )
+		__field( u32,			gen )
+		__field( ino_t,			ino )
+	),
+
+	TP_fast_assign(
+		struct iob_io_info *io = &bio->bi_iob_info;
+		u64 now = local_clock();
+		u64 queued_at = io->queued_at;
+		u64 issued_at = io->issued_at;
+
+		__entry->sector			= io->sector;
+		__entry->size			= io->size;
+		__entry->rw			= io->rw;
+		__entry->pid			= io->pid;
+		__entry->intent			= io->intent;
+		__entry->dev			= io->dev;
+		__entry->gen			= io->gen;
+		__entry->ino			= io->ino;
+
+		if (time_before64(now, issued_at))
+			issued_at = now;
+		if (time_before64(issued_at, queued_at))
+			queued_at = issued_at;
+
+		__entry->wait_nsec = issued_at - queued_at;
+		__entry->io_nsec = now - issued_at;
+	),
+
+	TP_printk("io sector=%llu size=%zu rw=%s wait_nsec=%u io_nsec=%u pid=%d intent=%d dev=%d:%d ino=%llu gen=%u",
+		(unsigned long long)__entry->sector, __entry->size,
+		show_io_rw(__entry->rw), __entry->wait_nsec, __entry->io_nsec,
+		__entry->pid, __entry->intent,
+		MAJOR(__entry->dev), MINOR(__entry->dev),
+		(unsigned long long)__entry->ino, __entry->gen)
+);
+
+DEFINE_EVENT_CONDITION(ioblame_io_class, ioblame_io,
+	TP_PROTO(struct bio *bio),
+	TP_ARGS(bio),
+	TP_CONDITION(bio->bi_iob_info.size && iob_enabled_bio(bio))
+);
+
+#endif
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index cd31345..ccc7c12 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -368,6 +368,18 @@ config BLK_DEV_IO_TRACE
 
 	  If unsure, say N.
 
+config IO_BLAME
+	tristate "Enable io-blame tracer"
+	depends on SYSFS
+	depends on BLOCK
+	select TRACEPOINTS
+	select STACKTRACE
+	help
+	  Say Y here if you want to enable IO tracer with dirtier
+	  tracking.  See Documentation/trace/ioblame.txt.
+
+	  If unsure, say N.
+
 config KPROBE_EVENT
 	depends on KPROBES
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 5f39a07..408cd1a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
 ifeq ($(CONFIG_BLOCK),y)
 obj-$(CONFIG_EVENT_TRACING) += blktrace.o
 endif
+obj-$(CONFIG_IO_BLAME) += ioblame.o
 obj-$(CONFIG_EVENT_TRACING) += trace_events.o
 obj-$(CONFIG_EVENT_TRACING) += trace_export.o
 obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
diff --git a/kernel/trace/ioblame.c b/kernel/trace/ioblame.c
new file mode 100644
index 0000000..ae46abe
--- /dev/null
+++ b/kernel/trace/ioblame.c
@@ -0,0 +1,2279 @@
+/*
+ * kernel/trace/ioblame.c - IO tracer with origin tracking
+ *
+ * Copyright (C) 2011 Google, Inc.
+ * Copyright (C) 2011 Tejun Heo <tj@kernel.org>
+ */
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/idr.h>
+#include <linux/bitmap.h>
+#include <linux/radix-tree.h>
+#include <linux/rculist.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/stacktrace.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/log2.h>
+#include <linux/jhash.h>
+#include <linux/genhd.h>
+#include <linux/string.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/mm_types.h>
+#include <linux/fs.h>
+#include <linux/buffer_head.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/log2.h>
+#include <asm/div64.h>
+
+#include <trace/events/sched.h>
+#include <trace/events/vfs.h>
+#include <trace/events/writeback.h>
+#include <trace/events/block.h>
+
+#include "trace.h"
+
+#include <linux/ioblame.h>
+
+#define IOB_ROLE_NAMELEN	32
+#define IOB_STACK_MAX_DEPTH	32
+
+#define IOB_DFL_MAX_ROLES	(1 << 16)
+#define IOB_DFL_MAX_INTENTS	(1 << 10)
+#define IOB_DFL_MAX_ACTS	(1 << 16)
+#define IOB_DFL_TTL_SECS	120
+
+#define IOB_LAST_INO_DURATION	(5 * HZ)	/* last_ino is valid for 5s */
+
+/*
+ * Each type represents different type of entities tracked by ioblame and
+ * has its own iob_idx.
+ *
+ * role		: "who" - either a task or custom id from userland.
+ *
+ * intent	: The who's intention - backtrace + modifier.
+ *
+ * act		: Product of role, intent and the target inode.  "who"
+ *		  acts on a target inode with certain backtrace.
+ */
+enum iob_type {
+	IOB_INVALID,
+	IOB_ROLE,
+	IOB_INTENT,
+	IOB_ACT,
+
+	IOB_NR_TYPES,
+};
+
+#define IOB_PACK_ID(_type, _nr, _gen)	\
+	(union iob_id){ .f = { .type = (_type), .nr = (_nr), .gen = (_gen) }}
+
+/* stats */
+struct iob_stats {
+	u64 idx_nomem;
+	u64 idx_nospc;
+	u64 node_nomem;
+	u64 pgtree_nomem;
+};
+
+/* iob_node is what iob_idx indexes and embedded in every iob_type */
+struct iob_node {
+	struct hlist_node	hash_node;
+	union iob_id		id;
+};
+
+/* describes properties and operations of an iob_type for iob_idx */
+struct iob_idx_type {
+	enum iob_type		type;
+
+	/* calculate hash value from key */
+	unsigned long		(*hash)(void *key);
+	/* return %true if @node matches @key */
+	bool			(*match)(struct iob_node *node, void *key);
+	/* create a new node which matches @key w/ alloc mask @gfp_mask */
+	struct iob_node		*(*create)(void *key, gfp_t gfp_mask);
+	/* destroy @node */
+	void			(*destroy)(struct iob_node *node);
+
+	/* keys for fallback nodes */
+	void			*nomem_key;
+	void			*lost_key;
+};
+
+/*
+ * iob_idx indexes iob_nodes.  iob_nodes can either be found via hash table
+ * or by id.f.nr.  Hash calculation and matching are determined by
+ * iob_idx_type.  If a node is missing during hash lookup, new one is
+ * automatically created.
+ */
+struct iob_idx {
+	const struct iob_idx_type *type;
+
+	/* hash */
+	struct hlist_head	*hash;
+	unsigned int		hash_mask;
+
+	/* id index */
+	struct ida		ida;		/* used for allocation */
+	struct idr		idr;		/* record node or gen */
+
+	/* fallback nodes */
+	struct iob_node		*nomem_node;
+	struct iob_node		*lost_node;
+
+	/* stats */
+	unsigned int		nr_nodes;
+	unsigned int		max_nodes;
+};
+
+/*
+ * Functions to encode and decode pointer and generation for iob_idx->idr.
+ *
+ * id.f.gen is used to disambiguate recycled id.f.nr.  When there's no
+ * active node, iob_idx->idr slot carries the last generation number.
+ */
+static void *iob_idr_encode_node(struct iob_node *node)
+{
+	BUG_ON((unsigned long)node & 1);
+	return node;
+}
+
+static void *iob_idr_encode_gen(u32 gen)
+{
+	unsigned long v = (unsigned long)gen;
+	return (void *)((v << 1) | 1);
+}
+
+static struct iob_node *iob_idr_node(void *p)
+{
+	unsigned long v = (unsigned long)p;
+	return (v & 1) ? NULL : (void *)v;
+}
+
+static u32 iob_idr_gen(void *p)
+{
+	unsigned long v = (unsigned long)p;
+	return (v & 1) ? v >> 1 : 0;
+}
+
+/* IOB_ROLE */
+struct iob_role {
+	struct iob_node		node;
+
+	/*
+	 * Because a task can change its pid during exec and we want exact
+	 * match for removal on task exit, we use task pointer as key.
+	 */
+	struct task_struct	*task;
+	int			pid;
+
+	/* modifier currently in effect */
+	u32			modifier;
+
+	/* last file this role has operated on */
+	struct {
+		dev_t			dev;
+		u32			gen;
+		ino_t			ino;
+	} last_ino;
+	unsigned long		last_ino_jiffies;
+
+	/* act for inode dirtying/writing in progress */
+	union iob_id		inode_act;
+
+	/* for reclaiming */
+	struct list_head	free_list;
+};
+
+/* IOB_INTENT - uses separate key struct to use struct stack_trace directly */
+struct iob_intent_key {
+	u32			modifier;
+	int			depth;
+	unsigned long		*trace;
+};
+
+struct iob_intent {
+	struct iob_node		node;
+
+	u32			modifier;
+	int			depth;
+	unsigned long		trace[];
+};
+
+/* IOB_ACT */
+struct iob_act {
+	struct iob_node		node;
+
+	struct iob_act		*free_next;
+
+	/* key fields follow - paddings, if any, should be zero filled */
+	union iob_id		role;	/* must be the first field of keys */
+	union iob_id		intent;
+	dev_t			dev;
+	u32			gen;
+	ino_t			ino;
+};
+
+#define IOB_ACT_KEY_OFFSET	offsetof(struct iob_act, role)
+
+static DEFINE_MUTEX(iob_mutex);		/* enable/disable and userland access */
+static DEFINE_SPINLOCK(iob_lock);	/* write access to all int structures */
+
+static bool iob_enabled __read_mostly = false;
+
+/* temp buffer used for parsing/printing, user must be holding iob_mutex */
+static char __iob_page_buf[PAGE_SIZE];
+#define iob_page_buf	({ lockdep_assert_held(&iob_mutex); __iob_page_buf; })
+
+/* userland tunable knobs */
+static unsigned int iob_max_roles __read_mostly = IOB_DFL_MAX_ROLES;
+static unsigned int iob_max_intents __read_mostly = IOB_DFL_MAX_INTENTS;
+static unsigned int iob_max_acts __read_mostly = IOB_DFL_MAX_ACTS;
+static unsigned int iob_ttl_secs __read_mostly = IOB_DFL_TTL_SECS;
+static bool iob_ignore_ino __read_mostly;
+
+/* pgtree params, determined by iob_max_acts */
+static unsigned long iob_pgtree_shift __read_mostly;
+static unsigned long iob_pgtree_pfn_shift __read_mostly;
+static unsigned long iob_pgtree_pfn_mask __read_mostly;
+
+/* role and act caches, intent is variable size and allocated using kzalloc */
+static struct kmem_cache *iob_role_cache;
+static struct kmem_cache *iob_act_cache;
+
+/* iob_idx for each iob_type */
+static struct iob_idx *iob_role_idx __read_mostly;
+static struct iob_idx *iob_intent_idx __read_mostly;
+static struct iob_idx *iob_act_idx __read_mostly;
+
+/* for reclaiming */
+static void iob_reclaim_workfn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(iob_reclaim_work, iob_reclaim_workfn);
+
+static unsigned int iob_role_reclaim_seq;
+
+static struct list_head iob_role_to_free_heads[2] = {
+	LIST_HEAD_INIT(iob_role_to_free_heads[0]),
+	LIST_HEAD_INIT(iob_role_to_free_heads[1]),
+};
+static struct list_head *iob_role_to_free_front = &iob_role_to_free_heads[0];
+static struct list_head *iob_role_to_free_back = &iob_role_to_free_heads[1];
+
+static unsigned long *iob_act_used_bitmaps[2];
+
+struct iob_act_used {
+	unsigned long	*front;
+	unsigned long	*back;
+} iob_act_used;
+
+/* pgtree - maps pfn to act nr */
+static RADIX_TREE(iob_pgtree, GFP_NOWAIT);
+
+/* stats and /sys/kernel/debug/ioblame */
+static struct iob_stats iob_stats;
+static struct dentry *iob_dir;
+static struct dentry *iob_intents_dentry;
+
+static void iob_intent_notify_workfn(struct work_struct *work);
+static DECLARE_WORK(iob_intent_notify_work, iob_intent_notify_workfn);
+
+static bool iob_enabled_inode(struct inode *inode)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	return iob_enabled && inode->i_sb->s_bdev &&
+		inode->i_sb->s_bdev->bd_disk->iob_enabled;
+}
+
+static bool iob_enabled_bh(struct buffer_head *bh)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	return iob_enabled && bh->b_bdev->bd_disk->iob_enabled;
+}
+
+static bool iob_enabled_bio(struct bio *bio)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	return iob_enabled && bio->bi_bdev &&
+		bio->bi_bdev->bd_disk->iob_enabled;
+}
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/ioblame.h>
+
+/*
+ * IOB_IDX
+ *
+ * This is the main indexing facility used to maintain and access all
+ * iob_type objects.  iob_idx operates on iob_node which each iob_type
+ * object embeds.
+ *
+ * Each iob_idx is associated with iob_idx_type on creation, which
+ * describes which type it is, methods used during hash lookup and two keys
+ * for fallback node creation.
+ *
+ * Objects can be accessed either by hash table or id.  Hash table lookup
+ * uses iob_idx_type->hash() and ->match() methods for lookup and
+ * ->create() and ->destroy() to create new object if missing and
+ * requested.  Note that the hash key is opaque to iob_idx.  Key handling
+ * is defined completely by iob_idx_type methods.
+ *
+ * When a new object is created, iob_idx automatically assigns an id, which
+ * is combination of type enum, object number (nr), and generation number.
+ * Object number is ida allocated and always packed towards 0.  Generation
+ * number starts at 1 and gets incremented each time the nr is recycled.
+ *
+ * Access by id is either by whole id or nr part of it.  Objects are not
+ * created through id lookups.
+ *
+ * Read accesses are protected by sched_rcu.  Using sched_rcu allows
+ * avoiding extra rcu locking operations in tracepoint probes.  Write
+ * accesses are expected to be infrequent and synchronized with single
+ * spinlock - iob_lock.
+ */
+
+static int iob_idx_install_node(struct iob_node *node, struct iob_idx *idx,
+				gfp_t gfp_mask)
+{
+	const struct iob_idx_type *type = idx->type;
+	int nr = -1, idr_nr = -1, ret;
+	void *p;
+
+	INIT_HLIST_NODE(&node->hash_node);
+
+	/* allocate nr and make sure it's under the limit */
+	do {
+		if (unlikely(!ida_pre_get(&idx->ida, gfp_mask)))
+			goto enomem;
+		ret = ida_get_new(&idx->ida, &nr);
+	} while (unlikely(ret == -EAGAIN));
+
+	if (unlikely(ret < 0 || nr >= idx->max_nodes))
+		goto enospc;
+
+	/* if @nr was used before, idr would have last_gen recorded, look up */
+	p = idr_find(&idx->idr, nr);
+	if (p) {
+		WARN_ON_ONCE(iob_idr_node(p));
+		/* set id with gen before replacing the idr entry */
+		node->id = IOB_PACK_ID(type->type, nr, iob_idr_gen(p) + 1);
+		idr_replace(&idx->idr, node, nr);
+		return 0;
+	}
+
+	/* create a new idr entry, it must match ida allocation */
+	node->id = IOB_PACK_ID(type->type, nr, 1);
+	do {
+		if (unlikely(!idr_pre_get(&idx->idr, gfp_mask)))
+			goto enomem;
+		ret = idr_get_new_above(&idx->idr, iob_idr_encode_node(node),
+					nr, &idr_nr);
+	} while (unlikely(ret == -EAGAIN));
+
+	if (unlikely(ret < 0) || WARN_ON_ONCE(idr_nr != nr))
+		goto enospc;
+
+	return 0;
+
+enomem:
+	iob_stats.idx_nomem++;
+	ret = -ENOMEM;
+	goto fail;
+enospc:
+	iob_stats.idx_nospc++;
+	ret = -ENOSPC;
+fail:
+	if (idr_nr >= 0)
+		idr_remove(&idx->idr, idr_nr);
+	if (nr >= 0)
+		ida_remove(&idx->ida, nr);
+	return ret;
+}
+
+/**
+ * iob_idx_destroy - destroy iob_idx
+ * @idx: iob_idx to destroy
+ *
+ * Free all nodes indexed by @idx and @idx itself.  The caller is
+ * responsible for ensuring nobody is accessing @idx.
+ */
+static void iob_idx_destroy(struct iob_idx *idx)
+{
+	const struct iob_idx_type *type = idx->type;
+	void *ptr;
+	int pos = 0;
+
+	while ((ptr = idr_get_next(&idx->idr, &pos))) {
+		struct iob_node *node = iob_idr_node(ptr);
+		if (node)
+			type->destroy(node);
+		pos++;
+	}
+
+	idr_remove_all(&idx->idr);
+	idr_destroy(&idx->idr);
+	ida_destroy(&idx->ida);
+
+	vfree(idx->hash);
+	kfree(idx);
+}
+
+/**
+ * iob_idx_create - create a new iob_idx
+ * @type: type of new iob_idx
+ * @max_nodes: maximum number of nodes allowed
+ *
+ * Create a new @type iob_idx.  Newly created iob_idx has two fallback
+ * nodes pre-allocated - one for nomem and the other for lost nodes, each
+ * occupying IOB_NOMEM_NR and IOB_LOST_NR slot respectively.
+ *
+ * Returns pointer to the new iob_idx on success, %NULL on failure.
+ */
+static struct iob_idx *iob_idx_create(const struct iob_idx_type *type,
+				      unsigned int max_nodes)
+{
+	unsigned int hash_sz = rounddown_pow_of_two(max_nodes);
+	struct iob_idx *idx;
+	struct iob_node *node;
+
+	if (max_nodes < 2)
+		return NULL;
+
+	/* alloc and init */
+	idx = kzalloc(sizeof(*idx), GFP_KERNEL);
+	if (!idx)
+		return NULL;
+
+	ida_init(&idx->ida);
+	idr_init(&idx->idr);
+	idx->type = type;
+	idx->max_nodes = max_nodes;
+	idx->hash_mask = hash_sz - 1;
+
+	idx->hash = vzalloc(hash_sz * sizeof(idx->hash[0]));
+	if (!idx->hash)
+		goto fail;
+
+	/* create and install nomem_node */
+	node = type->create(type->nomem_key, GFP_KERNEL);
+	if (!node)
+		goto fail;
+	if (iob_idx_install_node(node, idx, GFP_KERNEL) < 0) {
+		type->destroy(node);
+		goto fail;
+	}
+	idx->nomem_node = node;
+	idx->nr_nodes++;
+
+	/* create and install lost_node */
+	node = type->create(type->lost_key, GFP_KERNEL);
+	if (!node)
+		goto fail;
+	if (iob_idx_install_node(node, idx, GFP_KERNEL) < 0) {
+		type->destroy(node);
+		goto fail;
+	}
+	idx->lost_node = node;
+	idx->nr_nodes++;
+
+	/* verify both fallback nodes have the correct id.f.nr */
+	if (idx->nomem_node->id.f.nr != IOB_NOMEM_NR ||
+	    idx->lost_node->id.f.nr != IOB_LOST_NR)
+		goto fail;
+
+	return idx;
+fail:
+	iob_idx_destroy(idx);
+	return NULL;
+}
+
+/**
+ * iob_node_by_nr_raw - lookup node by nr
+ * @nr: nr to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Lookup node occupying slot @nr.  If such node doesn't exist, %NULL is
+ * returned.
+ */
+static struct iob_node *iob_node_by_nr_raw(int nr, struct iob_idx *idx)
+{
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+	return iob_idr_node(idr_find(&idx->idr, nr));
+}
+
+/**
+ * iob_node_by_id_raw - lookup node by id
+ * @id: id to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Lookup node with @id.  @id's type should match @idx's type and all three
+ * id fields should match for successful lookup - type, id and generation.
+ * Returns %NULL on failure.
+ */
+static struct iob_node *iob_node_by_id_raw(union iob_id id, struct iob_idx *idx)
+{
+	struct iob_node *node;
+
+	WARN_ON_ONCE(id.f.type != idx->type->type);
+
+	node = iob_node_by_nr_raw(id.f.nr, idx);
+	if (likely(node && node->id.v == id.v))
+		return node;
+	return NULL;
+}
+
+static struct iob_node *iob_hash_head_lookup(void *key,
+					     struct hlist_head *hash_head,
+					     const struct iob_idx_type *type)
+{
+	struct hlist_node *pos;
+	struct iob_node *node;
+
+	hlist_for_each_entry_rcu(node, pos, hash_head, hash_node)
+		if (type->match(node, key))
+			return node;
+	return NULL;
+}
+
+/**
+ * iob_get_node_raw - lookup node from hash table and create if missing
+ * @key: key to lookup hash table with
+ * @idx: iob_idx to lookup from
+ * @create: whether to create a new node if lookup fails
+ *
+ * Look up node which matches @key in @idx.  If no such node exists and
+ * @create is %true, create a new one.  A newly created node will have
+ * unique id assigned to it as long as generation number doesn't overflow.
+ *
+ * This function should be called under rcu sched read lock and returns
+ * %NULL on failure.
+ */
+static struct iob_node *iob_get_node_raw(void *key, struct iob_idx *idx,
+					 bool create)
+{
+	const struct iob_idx_type *type = idx->type;
+	struct iob_node *node, *new_node;
+	struct hlist_head *hash_head;
+	unsigned long hash, flags;
+
+	WARN_ON_ONCE(!rcu_read_lock_sched_held());
+
+	/* lookup hash */
+	hash = type->hash(key);
+	hash_head = &idx->hash[hash & idx->hash_mask];
+
+	node = iob_hash_head_lookup(key, hash_head, type);
+	if (node || !create)
+		return node;
+
+	/* non-existent && @create, create new one */
+	new_node = type->create(key, GFP_NOWAIT);
+	if (!new_node) {
+		iob_stats.node_nomem++;
+		return NULL;
+	}
+
+	spin_lock_irqsave(&iob_lock, flags);
+
+	/* someone might have inserted it inbetween, lookup again */
+	node = iob_hash_head_lookup(key, hash_head, type);
+	if (node)
+		goto out_unlock;
+
+	/* install the node and add to the hash table */
+	if (iob_idx_install_node(new_node, idx, GFP_NOWAIT))
+		goto out_unlock;
+
+	hlist_add_head_rcu(&new_node->hash_node, hash_head);
+	idx->nr_nodes++;
+
+	node = new_node;
+	new_node = NULL;
+out_unlock:
+	spin_unlock_irqrestore(&iob_lock, flags);
+
+	if (unlikely(new_node))
+		type->destroy(new_node);
+	return node;
+}
+
+/**
+ * iob_node_by_nr - lookup node by nr with fallback
+ * @nr: nr to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Same as iob_node_by_nr_raw() but returns @idx->lost_node instead of
+ * %NULL if lookup fails.  The lost_node is returned as nr/id lookup
+ * failure indicates the target node has already been reclaimed.
+ */
+static struct iob_node *iob_node_by_nr(int nr, struct iob_idx *idx)
+{
+	return iob_node_by_nr_raw(nr, idx) ?: idx->lost_node;
+}
+
+/**
+ * iob_node_by_nr - lookup node by id with fallback
+ * @id: id to lookup
+ * @idx: iob_idx to lookup from
+ *
+ * Same as iob_node_by_id_raw() but returns @idx->lost_node instead of
+ * %NULL if lookup fails.  The lost_node is returned as nr/id lookup
+ * failure indicates the target node has already been reclaimed.
+ */
+static struct iob_node *iob_node_by_id(union iob_id id, struct iob_idx *idx)
+{
+	return iob_node_by_id_raw(id, idx) ?: idx->lost_node;
+}
+
+/**
+ * iob_get_node - lookup node from hash table and create if missing w/ fallback
+ * @key: key to lookup hash table with
+ * @idx: iob_idx to lookup from
+ * @create: whether to create a new node if lookup fails
+ *
+ * Same as iob_get_node_raw(@key, @idx, %true) but returns @idx->nomem_node
+ * instead of %NULL on failure as the only reason is alloc failure.
+ */
+static struct iob_node *iob_get_node(void *key, struct iob_idx *idx)
+{
+	return iob_get_node_raw(key, idx, true) ?: idx->nomem_node;
+}
+
+/**
+ * iob_unhash_node - unhash an iob_node
+ * @node: node to unhash
+ * @idx: iob_idx @node is hashed on
+ *
+ * Make @node invisible from hash lookup.  It will still be visible from
+ * id/nr lookup.
+ *
+ * Must be called holding iob_lock and returns %true if unhashed
+ * successfully, %false if someone else already unhashed it.
+ */
+static bool iob_unhash_node(struct iob_node *node, struct iob_idx *idx)
+{
+	lockdep_assert_held(&iob_lock);
+
+	if (hlist_unhashed(&node->hash_node))
+		return false;
+	hlist_del_init_rcu(&node->hash_node);
+	return true;
+}
+
+/**
+ * iob_remove_node - remove an iob_node
+ * @node: node to remove
+ * @idx: iob_idx @node is on
+ *
+ * Remove @node from @idx.  The caller is responsible for calling
+ * iob_unhash_node() before.  Note that removed nodes should be freed only
+ * after RCU grace period has passed.
+ *
+ * Must be called holding iob_lock.
+ */
+static void iob_remove_node(struct iob_node *node, struct iob_idx *idx)
+{
+	lockdep_assert_held(&iob_lock);
+
+	/* don't remove idr slot, record current generation there */
+	idr_replace(&idx->idr, iob_idr_encode_gen(node->id.f.gen),
+		    node->id.f.nr);
+	ida_remove(&idx->ida, node->id.f.nr);
+	idx->nr_nodes--;
+}
+
+
+/*
+ * IOB_ROLE
+ *
+ * A role represents a task and is keyed by its task pointer.  It is
+ * created when the matching task first enters iob tracking, unhashed on
+ * task exit and destroyed after reclaim period has passed.
+ *
+ * The reason why task_roles are keyed by task pointer instead of pid is
+ * that pid can change across exec(2) and we need reliable match on task
+ * exit to avoid leaking task_roles.  A task_role is unhashed and scheduled
+ * for removal on task exit or if thie pid no longer matches after exec.
+ *
+ * These life-cycle rules guarantee that any task is given one id across
+ * its lifetime and avoid resource leaks.
+ *
+ * A role also carries context information for the task, e.g. the last file
+ * the task operated on, currently on-going inode operation and so on.
+ */
+
+static struct iob_role *iob_node_to_role(struct iob_node *node)
+{
+	return node ? container_of(node, struct iob_role, node) : NULL;
+}
+
+static unsigned long iob_role_hash(void *key)
+{
+	struct iob_role *rkey = key;
+
+	return jhash(rkey->task, sizeof(rkey->task), JHASH_INITVAL);
+}
+
+static bool iob_role_match(struct iob_node *node, void *key)
+{
+	struct iob_role *role = iob_node_to_role(node);
+	struct iob_role *rkey = key;
+
+	return rkey->task == role->task;
+}
+
+static struct iob_node *iob_role_create(void *key, gfp_t gfp_mask)
+{
+	struct iob_role *rkey = key;
+	struct iob_role *role;
+
+	role = kmem_cache_alloc(iob_role_cache, gfp_mask);
+	if (!role)
+		return NULL;
+	*role = *rkey;
+	INIT_LIST_HEAD(&role->free_list);
+	return &role->node;
+}
+
+static void iob_role_destroy(struct iob_node *node)
+{
+	kmem_cache_free(iob_role_cache, iob_node_to_role(node));
+}
+
+static struct iob_role iob_role_null_key = { };
+
+static const struct iob_idx_type iob_role_idx_type = {
+	.type		= IOB_ROLE,
+
+	.hash		= iob_role_hash,
+	.match		= iob_role_match,
+	.create		= iob_role_create,
+	.destroy	= iob_role_destroy,
+
+	.nomem_key	= &iob_role_null_key,
+	.lost_key	= &iob_role_null_key,
+};
+
+static struct iob_role *iob_role_by_id(union iob_id id)
+{
+	return iob_node_to_role(iob_node_by_id(id, iob_role_idx));
+}
+
+/**
+ * iob_reclaim_current_role - reclaim role for %current
+ *
+ * This function guarantees that the self role won't be visible to hash
+ * table lookup by %current itself.
+ */
+static void iob_reclaim_current_role(void)
+{
+	struct iob_role rkey = { };
+	struct iob_role *role;
+	unsigned long flags;
+
+	/*
+	 * A role is always created by %current and thus guaranteed to be
+	 * visible to %current.  Negative result from lockless lookup can
+	 * be trusted.
+	 */
+	rkey.task = current;
+	rkey.pid = task_pid_nr(current);
+	role = iob_node_to_role(iob_get_node_raw(&rkey, iob_role_idx, false));
+	if (!role)
+		return;
+
+	/* unhash and queue on reclaim list */
+	spin_lock_irqsave(&iob_lock, flags);
+	WARN_ON_ONCE(!iob_unhash_node(&role->node, iob_role_idx));
+	WARN_ON_ONCE(!list_empty(&role->free_list));
+	list_add_tail(&role->free_list, iob_role_to_free_front);
+	spin_unlock_irqrestore(&iob_lock, flags);
+}
+
+/**
+ * iob_current_role - lookup role for %current
+ *
+ * Return role for %current.  May return nomem node under memory pressure.
+ */
+static struct iob_role *iob_current_role(void)
+{
+	struct iob_role rkey = { };
+	struct iob_role *role;
+	bool retried = false;
+
+	rkey.task = current;
+	rkey.pid = task_pid_nr(current);
+retry:
+	role = iob_node_to_role(iob_get_node(&rkey, iob_role_idx));
+
+	/*
+	 * If %current exec'd, its pid may have changed.  In such cases,
+	 * shoot down the current role and retry.
+	 */
+	if (role->pid == rkey.pid || role->node.id.f.nr < IOB_BASE_NR)
+		return role;
+
+	iob_reclaim_current_role();
+
+	/* this shouldn't happen more than once */
+	WARN_ON_ONCE(retried);
+	retried = true;
+	goto retry;
+}
+
+
+/*
+ * IOB_INTENT
+ *
+ * An intent represents a category of actions a task can take.  It
+ * currently consists of the stack trace at the point of action and an
+ * optional modifier.  The number of unique backtraces is expected to be
+ * limited and no reclaiming is implemented.
+ */
+
+static struct iob_intent *iob_node_to_intent(struct iob_node *node)
+{
+	return node ? container_of(node, struct iob_intent, node) : NULL;
+}
+
+static unsigned long iob_intent_hash(void *key)
+{
+	struct iob_intent_key *ikey = key;
+
+	return jhash(ikey->trace, ikey->depth * sizeof(ikey->trace[0]),
+		     JHASH_INITVAL + ikey->modifier);
+}
+
+static bool iob_intent_match(struct iob_node *node, void *key)
+{
+	struct iob_intent *intent = iob_node_to_intent(node);
+	struct iob_intent_key *ikey = key;
+
+	if (intent->modifier == ikey->modifier &&
+	    intent->depth == ikey->depth)
+		return !memcmp(intent->trace, ikey->trace,
+			       intent->depth * sizeof(intent->trace[0]));
+	return false;
+}
+
+static struct iob_node *iob_intent_create(void *key, gfp_t gfp_mask)
+{
+	struct iob_intent_key *ikey = key;
+	struct iob_intent *intent;
+	size_t trace_sz = sizeof(intent->trace[0]) * ikey->depth;
+
+	intent = kzalloc(sizeof(*intent) + trace_sz, gfp_mask);
+	if (!intent)
+		return NULL;
+
+	intent->modifier = ikey->modifier;
+	intent->depth = ikey->depth;
+	memcpy(intent->trace, ikey->trace, trace_sz);
+
+	return &intent->node;
+}
+
+static void iob_intent_destroy(struct iob_node *node)
+{
+	kfree(iob_node_to_intent(node));
+}
+
+static struct iob_intent_key iob_intent_null_key = { };
+
+static const struct iob_idx_type iob_intent_idx_type = {
+	.type		= IOB_INTENT,
+
+	.hash		= iob_intent_hash,
+	.match		= iob_intent_match,
+	.create		= iob_intent_create,
+	.destroy	= iob_intent_destroy,
+
+	.nomem_key	= &iob_intent_null_key,
+	.lost_key	= &iob_intent_null_key,
+};
+
+static struct iob_intent *iob_intent_by_nr(int nr)
+{
+	return iob_node_to_intent(iob_node_by_nr(nr, iob_intent_idx));
+}
+
+static struct iob_intent *iob_intent_by_id(union iob_id id)
+{
+	return iob_node_to_intent(iob_node_by_id(id, iob_intent_idx));
+}
+
+static struct iob_intent *iob_get_intent(unsigned long *trace, int depth,
+					 u32 modifier)
+{
+	struct iob_intent_key ikey = { .modifier = modifier, .depth = depth,
+				       .trace = trace };
+	struct iob_intent *intent;
+	int nr_nodes;
+
+	nr_nodes = iob_intent_idx->nr_nodes;
+
+	intent = iob_node_to_intent(iob_get_node(&ikey, iob_intent_idx));
+
+	/*
+	 * If nr_nodes changed across get_node, we probably have created a
+	 * new entry.  Notify change on intent files.  This may be spurious
+	 * but won't miss an event, which is good enough.
+	 */
+	if (nr_nodes != iob_intent_idx->nr_nodes)
+		schedule_work(&iob_intent_notify_work);
+
+	return intent;
+}
+
+static DEFINE_PER_CPU(unsigned long [IOB_STACK_MAX_DEPTH], iob_trace_buf_pcpu);
+
+/**
+ * iob_current_intent - return intent for %current
+ * @skip: number of stack frames to skip
+ *
+ * Acquire stack trace after skipping @skip frames and return matching
+ * iob_intent.  The stack trace never includes iob_current_intent() and
+ * @skip of 1 skips the caller not iob_current_intent().  May return nomem
+ * node under memory pressure.
+ */
+static noinline struct iob_intent *iob_current_intent(int skip)
+{
+	unsigned long *trace = *this_cpu_ptr(&iob_trace_buf_pcpu);
+	struct stack_trace st = { .max_entries = IOB_STACK_MAX_DEPTH,
+				  .entries = trace, .skip = skip + 1 };
+	struct iob_intent *intent;
+	unsigned long flags;
+
+	/* disable IRQ to make trace_pcpu array access exclusive */
+	local_irq_save(flags);
+
+	/* acquire stack trace, ignore -1LU end of stack marker */
+	save_stack_trace_quick(&st);
+	if (st.nr_entries && trace[st.nr_entries - 1] == ULONG_MAX)
+		st.nr_entries--;
+
+	/* get matching iob_intent */
+	intent = iob_get_intent(trace, st.nr_entries, 0);
+
+	local_irq_restore(flags);
+	return intent;
+}
+
+/**
+ * iob_modified_intent - determine modified intent
+ * @intent: the base intent
+ * @modifier: modifier to apply
+ *
+ * Return iob_intent which is identical to @intent except that its modifier
+ * is @modifier.  @intent is allowed to have any modifier including zero on
+ * entry.  May return nomem node under memory pressure.
+ */
+static struct iob_intent *iob_modified_intent(struct iob_intent *intent,
+					      u32 modifier)
+{
+	if (intent->modifier == modifier ||
+	    unlikely(intent->node.id.f.nr < IOB_BASE_NR))
+		return intent;
+	return iob_get_intent(intent->trace, intent->depth, modifier);
+}
+
+
+/*
+ * IOB_ACT
+ *
+ * Represents specific action an iob_role took.  Consists of a iob_role,
+ * iob_act, and the target inode.  iob_act is used to track dirtiers.  For
+ * each dirtying operation, iob_act is acquired and recorded (either by id
+ * or id.f.nr) and used for reporting later.
+ *
+ * Because this is product of three different entities, the number can grow
+ * quite large.  Each successful lookup sets used bitmap and iob_acts which
+ * haven't been used for iob_ttl_secs are reclaimed.
+ */
+
+static void iob_act_mark_used(struct iob_act *act)
+{
+	if (!test_bit(act->node.id.f.nr, iob_act_used.front))
+		set_bit(act->node.id.f.nr, iob_act_used.front);
+}
+
+static struct iob_act *iob_node_to_act(struct iob_node *node)
+{
+	return node ? container_of(node, struct iob_act, node) : NULL;
+}
+
+static unsigned long iob_act_hash(void *key)
+{
+	return jhash(key + IOB_ACT_KEY_OFFSET,
+		     sizeof(struct iob_act) - IOB_ACT_KEY_OFFSET,
+		     JHASH_INITVAL);
+}
+
+static bool iob_act_match(struct iob_node *node, void *key)
+{
+	return !memcmp((void *)node + IOB_ACT_KEY_OFFSET,
+		       key + IOB_ACT_KEY_OFFSET,
+		       sizeof(struct iob_act) - IOB_ACT_KEY_OFFSET);
+}
+
+static struct iob_node *iob_act_create(void *key, gfp_t gfp_mask)
+{
+	struct iob_act *akey = key;
+	struct iob_act *act;
+
+	act = kmem_cache_alloc(iob_act_cache, gfp_mask);
+	if (!act)
+		return NULL;
+	*act = *akey;
+	return &act->node;
+}
+
+static void iob_act_destroy(struct iob_node *node)
+{
+	kmem_cache_free(iob_act_cache, iob_node_to_act(node));
+}
+
+static struct iob_act iob_act_nomem_key = {
+	.role		= IOB_PACK_ID(IOB_ROLE, IOB_NOMEM_NR, 1),
+	.intent		= IOB_PACK_ID(IOB_INTENT, IOB_NOMEM_NR, 1),
+};
+
+static struct iob_act iob_act_lost_key = {
+	.role		= IOB_PACK_ID(IOB_ROLE, IOB_LOST_NR, 1),
+	.intent		= IOB_PACK_ID(IOB_INTENT, IOB_LOST_NR, 1),
+};
+
+static const struct iob_idx_type iob_act_idx_type = {
+	.type		= IOB_ACT,
+
+	.hash		= iob_act_hash,
+	.match		= iob_act_match,
+	.create		= iob_act_create,
+	.destroy	= iob_act_destroy,
+
+	.nomem_key	= &iob_act_nomem_key,
+	.lost_key	= &iob_act_lost_key,
+};
+
+static struct iob_act *iob_act_by_nr(int nr)
+{
+	return iob_node_to_act(iob_node_by_nr(nr, iob_act_idx));
+}
+
+static struct iob_act *iob_act_by_id(union iob_id id)
+{
+	return iob_node_to_act(iob_node_by_id(id, iob_act_idx));
+}
+
+/**
+ * iob_current_act - return the current iob_act
+ * @stack_skip: number of stack frames to skip when acquiring iob_intent
+ * @dev: dev_t of the inode being operated on
+ * @ino: ino of the inode being operated on
+ * @gen: generation of the inode being operated on
+ *
+ * Return iob_act for %current with the current backtrace.
+ * iob_current_act() is never included in the backtrace.  May return nomem
+ * node under memory pressure.
+ */
+static __always_inline struct iob_act *iob_current_act(int stack_skip,
+						dev_t dev, ino_t ino, u32 gen)
+{
+	struct iob_role *role = iob_current_role();
+	struct iob_intent *intent = iob_current_intent(stack_skip);
+	struct iob_act akey = { .role = role->node.id,
+				.intent = intent->node.id, .dev = dev };
+	struct iob_act *act;
+	int min_nr;
+
+	/* if either role or intent is special, return matching special role */
+	min_nr = min_t(int, role->node.id.f.nr, intent->node.id.f.nr);
+	if (unlikely(min_nr < IOB_BASE_NR)) {
+		if (min_nr == IOB_NOMEM_NR)
+			return iob_node_to_act(iob_act_idx->nomem_node);
+		else
+			return iob_node_to_act(iob_act_idx->lost_node);
+	}
+
+	/* if ignore_ino is set, use the same act for all files on the dev */
+	if (!iob_ignore_ino) {
+		akey.ino = ino;
+		akey.gen = gen;
+	}
+
+	act = iob_node_to_act(iob_get_node(&akey, iob_act_idx));
+	if (act)
+		iob_act_mark_used(act);
+	return act;
+}
+
+
+/*
+ * RECLAIM
+ */
+
+/**
+ * iob_reclaim - reclaim iob_roles and iob_acts
+ *
+ * This function is called from workqueue every ttl/2 and looks at
+ * iob_act_used->front/back and iob_role_to_free_front/back to reclaim
+ * unused nodes.
+ *
+ * iob_act uses bitmaps to collect and track used history.  Used bits are
+ * examined every ttl/2 period and iob_acts which haven't been used for two
+ * half periods are reclaimed.
+ *
+ * iob_role goes through reclaiming mostly to delay freeing so that roles
+ * are still available when async IO events fire after the original tasks
+ * exit.  iob_role reclaiming is simpler and happens every ttl.
+ */
+static void iob_reclaim_workfn(struct work_struct *work)
+{
+	LIST_HEAD(role_todo);
+	struct iob_act_used *u = &iob_act_used;
+	struct iob_act *free_head = NULL;
+	struct iob_act *act;
+	struct iob_role *role, *role_pos;
+	unsigned long flags;
+	int i;
+
+	/*
+	 * We're gonna reclaim acts which don't have bit set in both front
+	 * and back used bitmaps - IOW, the ones which weren't used in the
+	 * last and this ttl/2 periods.
+	 */
+	bitmap_or(u->back, u->front, u->back, iob_max_acts);
+
+	spin_lock_irqsave(&iob_lock, flags);
+
+	/*
+	 * Determine which roles to reclaim.  This function is executed
+	 * every ttl/2 but we want ttl.  Skip every other time.
+	 */
+	if (!(++iob_role_reclaim_seq % 2)) {
+		/* roles in the other free_head are now older than ttl */
+		list_splice_init(iob_role_to_free_back, &role_todo);
+		swap(iob_role_to_free_front, iob_role_to_free_back);
+
+		/*
+		 * All roles to be reclaimed should have been unhashed
+		 * already.  Removing is enough.
+		 */
+		list_for_each_entry(role, &role_todo, free_list) {
+			WARN_ON_ONCE(!hlist_unhashed(&role->node.hash_node));
+			iob_remove_node(&role->node, iob_role_idx);
+		}
+	}
+
+	/* unhash and remove all acts which don't have bit set in @u->back */
+	for (i = find_next_zero_bit(u->back, iob_max_acts, IOB_BASE_NR);
+	     i < iob_max_acts;
+	     i = find_next_zero_bit(u->back, iob_max_acts, i + 1)) {
+		act = iob_node_to_act(iob_node_by_nr_raw(i, iob_act_idx));
+		if (act) {
+			WARN_ON_ONCE(!iob_unhash_node(&act->node, iob_act_idx));
+			iob_remove_node(&act->node, iob_act_idx);
+			act->free_next = free_head;
+			free_head = act;
+		}
+	}
+
+	spin_unlock_irqrestore(&iob_lock, flags);
+
+	/* reclaim complete, front<->back and clear front */
+	swap(u->front, u->back);
+	bitmap_clear(u->front, 0, iob_max_acts);
+
+	/* before freeing reclaimed nodes, wait for in-flight users to finish */
+	synchronize_sched();
+
+	list_for_each_entry_safe(role, role_pos, &role_todo, free_list)
+		iob_role_destroy(&role->node);
+
+	while ((act = free_head)) {
+		free_head = act->free_next;
+		iob_act_destroy(&act->node);
+	}
+
+	queue_delayed_work(system_nrt_wq, &iob_reclaim_work,
+			   iob_ttl_secs * HZ / 2);
+}
+
+
+/*
+ * PGTREE
+ *
+ * Radix tree to map pfn to iob_act.  This is used to track which iob_act
+ * dirtied the page.  When a bio is issued, each page in the iovec is
+ * consulted against pgtree to find out which act caused it.
+ *
+ * Because the size of pgtree is proportional to total available memory, it
+ * uses id.f.nr instead of full id and may occassionally give stale result.
+ * Also, it uses u16 array if ACT_MAX is <= USHRT_MAX; otherwise, u32.
+ */
+
+void *iob_pgtree_slot(unsigned long pfn)
+{
+	unsigned long idx = pfn >> iob_pgtree_pfn_shift;
+	unsigned long offset = pfn & iob_pgtree_pfn_mask;
+	void *p;
+
+	p = radix_tree_lookup(&iob_pgtree, idx);
+	if (p)
+		return p + (offset << iob_pgtree_shift);
+	return NULL;
+}
+
+/**
+ * iob_pgtree_set_nr - map pfn to nr
+ * @pfn: pfn to map
+ * @nr: id.f.nr to be mapped
+ *
+ * Map @pfn to @nr, which can later be retrieved using
+ * iob_pgtree_get_and_clear_nr().  This function is opportunistic - it may
+ * fail under memory pressure and clobber each other's mappings when
+ * multiple pgtree ops race.
+ */
+static int iob_pgtree_set_nr(unsigned long pfn, int nr)
+{
+	void *slot, *p;
+	unsigned long flags;
+	int ret;
+retry:
+	slot = iob_pgtree_slot(pfn);
+	if (likely(slot)) {
+		/*
+		 * We're playing with pointer casts and racy accesses.  Use
+		 * ACCESS_ONCE() to avoid compiler surprises.
+		 */
+		switch (iob_pgtree_shift) {
+		case 1:
+			ACCESS_ONCE(*(u16 *)slot) = nr;
+			break;
+		case 2:
+			ACCESS_ONCE(*(u32 *)slot) = nr;
+			break;
+		default:
+			BUG();
+		}
+		return 0;
+	}
+
+	/* slot missing, create and insert new page and retry */
+	p = (void *)get_zeroed_page(GFP_NOWAIT);
+	if (!p) {
+		iob_stats.pgtree_nomem++;
+		return -ENOMEM;
+	}
+
+	spin_lock_irqsave(&iob_lock, flags);
+	ret = radix_tree_insert(&iob_pgtree, pfn >> iob_pgtree_pfn_shift, p);
+	spin_unlock_irqrestore(&iob_lock, flags);
+
+	if (ret) {
+		free_page((unsigned long)p);
+		if (ret != -EEXIST) {
+			iob_stats.pgtree_nomem++;
+			return ret;
+		}
+	}
+	goto retry;
+}
+
+/**
+ * iob_pgtree_get_and_clear_nr - read back pfn to nr mapping and clear it
+ * @pfn: pfn to read mapping for
+ *
+ * Read back mapping set by iob_pgtree_set_nr().  This function is
+ * opportunistic and may clobber each other's mappings when multiple pgtree
+ * ops race.
+ */
+static int iob_pgtree_get_and_clear_nr(unsigned long pfn)
+{
+	void *slot;
+	int nr;
+
+	slot = iob_pgtree_slot(pfn);
+	if (unlikely(!slot))
+		return 0;
+
+	/*
+	 * We're playing with pointer casts and racy accesses.  Use
+	 * ACCESS_ONCE() to avoid compiler surprises.
+	 */
+	switch (iob_pgtree_shift) {
+	case 1:
+		nr = ACCESS_ONCE(*(u16 *)slot);
+		if (nr)
+			ACCESS_ONCE(*(u16 *)slot) = 0;
+		break;
+	case 2:
+		nr = ACCESS_ONCE(*(u32 *)slot);
+		if (nr)
+			ACCESS_ONCE(*(u32 *)slot) = 0;
+		break;
+	default:
+		BUG();
+	}
+	return nr;
+}
+
+
+/*
+ * PROBES
+ *
+ * Tracepoint probes.  This is how ioblame learns what's going on in the
+ * system.  TP probes are always called with preemtion disabled, so we
+ * don't need explicit rcu_read_lock_sched().
+ */
+
+static void iob_set_last_ino(struct inode *inode)
+{
+	struct iob_role *role = iob_current_role();
+
+	role->last_ino.dev = inode->i_sb->s_dev;
+	role->last_ino.ino = inode->i_ino;
+	role->last_ino.gen = inode->i_generation;
+	role->last_ino_jiffies = jiffies;
+}
+
+/*
+ * Mark the last inode accessed by this task role.  This is used to
+ * attribute IOs to files.
+ */
+static void iob_probe_vfs_fcheck(void *data, struct files_struct *files,
+				 unsigned int fd, struct file *file)
+{
+	if (file) {
+		struct inode *inode = file->f_dentry->d_inode;
+
+		if (iob_enabled_inode(inode))
+			iob_set_last_ino(inode);
+	}
+}
+
+/* called after a page is dirtied - record the dirtying act in pgtree */
+static void iob_probe_wb_dirty_page(void *data, struct page *page,
+				    struct address_space *mapping)
+{
+	struct inode *inode = mapping->host;
+
+	if (iob_enabled_inode(inode)) {
+		struct iob_act *act = iob_current_act(2, inode->i_sb->s_dev,
+						      inode->i_ino,
+						      inode->i_generation);
+
+		iob_pgtree_set_nr(page_to_pfn(page), act->node.id.f.nr);
+	}
+}
+
+/*
+ * Writeback is starting, record wb_reason in role->modifier.  This will
+ * be applied to any IOs issued from this task until writeback is finished.
+ */
+static void iob_probe_wb_start(void *data, struct backing_dev_info *bdi,
+			       struct wb_writeback_work *work)
+{
+	struct iob_role *role = iob_current_role();
+
+	role->modifier = work->reason | IOB_MODIFIER_WB;
+}
+
+/* writeback done, clear modifier */
+static void iob_probe_wb_written(void *data, struct backing_dev_info *bdi,
+				 struct wb_writeback_work *work)
+{
+	struct iob_role *role = iob_current_role();
+
+	role->modifier = 0;
+}
+
+/*
+ * An inode is about to be written back.  Will be followed by data and
+ * inode writeback.  In case dirtier data is not recorded in pgtree or
+ * inode, remember the inode in role->last_ino.
+ */
+static void iob_probe_wb_single_inode_start(void *data, struct inode *inode,
+					    struct writeback_control *wbc,
+					    unsigned long nr_to_write)
+{
+	if (iob_enabled_inode(inode))
+		iob_set_last_ino(inode);
+}
+
+/*
+ * Called when an inode is about to be dirtied, right before fs
+ * dirty_inode() method.  Different filesystems implement inode dirtying
+ * and writeback differently.  Some may allocate bh on dirtying, some might
+ * do it during write_inode() and others might not use bh at all.
+ *
+ * To cover most cases, two tracking mechanisms are used - role->inode_act
+ * and inode->i_iob_act.  The former marks the current task as performing
+ * inode dirtying act and any IOs issued or bhs touched are attributed to
+ * the act.  The latter records the dirtying act on the inode itself so
+ * that if the filesystem takes action for the inode from write_inode(),
+ * the acting task can take on the dirtying act.
+ */
+static void iob_probe_wb_dirty_inode_start(void *data, struct inode *inode,
+					   int flags)
+{
+	if (iob_enabled_inode(inode)) {
+		struct iob_role *role = iob_current_role();
+		struct iob_act *act = iob_current_act(1, inode->i_sb->s_dev,
+						      inode->i_ino,
+						      inode->i_generation);
+		role->inode_act = act->node.id;
+		inode->i_iob_act = act->node.id;
+	}
+}
+
+/* inode dirtying complete */
+static void iob_probe_wb_dirty_inode(void *data, struct inode *inode, int flags)
+{
+	if (iob_enabled_inode(inode))
+		iob_current_role()->inode_act.v = 0;
+}
+
+/*
+ * Called when an inode is being written back, right before fs
+ * write_inode() method.  Inode writeback is starting, take on the act
+ * which dirtied the inode.
+ */
+static void iob_probe_wb_write_inode_start(void *data, struct inode *inode,
+					   struct writeback_control *wbc)
+{
+	if (iob_enabled_inode(inode) && inode->i_iob_act.v) {
+		struct iob_role *role = iob_current_role();
+
+		role->inode_act = inode->i_iob_act;
+	}
+}
+
+/* inode writing complete */
+static void iob_probe_wb_write_inode(void *data, struct inode *inode,
+				     struct writeback_control *wbc)
+{
+	if (iob_enabled_inode(inode))
+		iob_current_role()->inode_act.v = 0;
+}
+
+/*
+ * Called on touch_buffer().  Transfer inode act to pgtree.  This catches
+ * most inode operations for filesystems which use bh for metadata.
+ */
+static void iob_probe_block_touch_buffer(void *data, struct buffer_head *bh)
+{
+	if (iob_enabled_bh(bh)) {
+		struct iob_role *role = iob_current_role();
+
+		if (role->inode_act.v)
+			iob_pgtree_set_nr(page_to_pfn(bh->b_page),
+					  role->inode_act.f.nr);
+	}
+}
+
+/* bio is being queued, collect all info into bio->bi_iob_info */
+static void iob_probe_block_bio_queue(void *data, struct request_queue *q,
+				      struct bio *bio)
+{
+	struct iob_io_info *io = &bio->bi_iob_info;
+	struct iob_act *act = NULL;
+	struct iob_role *role;
+	struct iob_intent *intent;
+	int i;
+
+	if (!iob_enabled_bio(bio))
+		return;
+
+	role = iob_current_role();
+
+	io->sector = bio->bi_sector;
+	io->size = bio->bi_size;
+	io->rw = bio->bi_rw;
+
+	/* usec duration will be calculated on completion */
+	io->queued_at = io->issued_at = local_clock();
+
+	/* role's inode_act has the highest priority */
+	if (role->inode_act.v)
+		act = iob_act_by_id(role->inode_act);
+
+	/* always walk pgtree and clear matching pages */
+	for (i = 0; i < bio->bi_vcnt; i++) {
+		struct bio_vec *bv = &bio->bi_io_vec[i];
+		int nr;
+
+		if (!bv->bv_len)
+			continue;
+
+		nr = iob_pgtree_get_and_clear_nr(page_to_pfn(bv->bv_page));
+		if (!nr || act)
+			continue;
+
+		/* this is the first act, charge everything to it */
+		act = iob_act_by_nr(nr);
+	}
+
+	if (act) {
+		/* charge it to async dirtier */
+		io->pid = iob_role_by_id(act->role)->pid;
+		io->dev = act->dev;
+		io->ino = act->ino;
+		io->gen = act->gen;
+
+		intent = iob_intent_by_id(act->intent);
+	} else {
+		/*
+		 * Charge it to the IO issuer and the last file this task
+		 * initiated RW or writeback on, which is highly likely to
+		 * be the file this IO is for.  As a sanity check, trust
+		 * last_ino only for pre-defined duration.
+		 *
+		 * When acquiring stack trace, skip this function and
+		 * generic_make_request[_checks]()
+		 */
+		unsigned long now = jiffies;
+
+		io->pid = role->pid;
+
+		if (!iob_ignore_ino &&
+		    time_before_eq(role->last_ino_jiffies, now) &&
+		    now - role->last_ino_jiffies <= IOB_LAST_INO_DURATION) {
+			io->dev = role->last_ino.dev;
+			io->ino = role->last_ino.ino;
+			io->gen = role->last_ino.gen;
+		} else {
+			io->dev = bio->bi_bdev->bd_dev;
+			io->ino = 0;
+			io->gen = 0;
+		}
+
+		intent = iob_current_intent(2);
+	}
+
+	/* apply intent modifier and store nr */
+	intent = iob_modified_intent(intent, role->modifier);
+	io->intent = intent->node.id.f.nr;
+}
+
+/* when bios get merged, charge everything to the first bio */
+static void iob_probe_block_bio_backmerge(void *data, struct request_queue *q,
+					  struct request *rq, struct bio *bio)
+{
+	struct bio *mbio = rq->bio;
+	struct iob_io_info *mio = &mbio->bi_iob_info;
+	struct iob_io_info *sio = &bio->bi_iob_info;
+
+	mio->size += sio->size;
+	sio->size = 0;
+}
+
+/* when bios get merged, charge everything to the first bio */
+static void iob_probe_block_bio_frontmerge(void *data, struct request_queue *q,
+					   struct request *rq, struct bio *bio)
+{
+	struct bio *mbio = rq->bio;
+	struct iob_io_info *mio = &mbio->bi_iob_info;
+	struct iob_io_info *sio = &bio->bi_iob_info;
+	size_t msize = mio->size;
+
+	*mio = *sio;
+	mio->size += msize;
+	sio->size = 0;
+}
+
+/* record issue timestamp, this may not happen for bio based drivers */
+static void iob_probe_block_rq_issue(void *data, struct request_queue *q,
+				     struct request *rq)
+{
+	if (rq->bio && rq->bio->bi_iob_info.size)
+		rq->bio->bi_iob_info.issued_at = local_clock();
+}
+
+/* bio is complete, report and accumulate statistics */
+static void iob_probe_block_bio_complete(void *data, struct request_queue *q,
+					 struct bio *bio, int error)
+{
+	/* kick the TP */
+	trace_ioblame_io(bio);
+}
+
+/* %current is exiting, shoot down its role */
+static void iob_probe_block_sched_process_exit(void *data,
+					       struct task_struct *task)
+{
+	WARN_ON_ONCE(task != current);
+	iob_reclaim_current_role();
+}
+
+
+/**
+ * iob_disable - disable ioblame
+ *
+ * Master disble.  Stop ioblame, unregister all hooks and free all
+ * resources.
+ */
+static void iob_disable(void)
+{
+	const int gang_nr = 16;
+	unsigned long indices[gang_nr];
+	void **slots[gang_nr];
+	unsigned long base_idx = 0;
+	int i, nr;
+
+	mutex_lock(&iob_mutex);
+
+	/* if enabled, disable reclaim and unregister all hooks */
+	if (iob_enabled) {
+		cancel_delayed_work_sync(&iob_reclaim_work);
+		cancel_work_sync(&iob_intent_notify_work);
+		iob_enabled = false;
+
+		unregister_trace_vfs_fcheck(iob_probe_vfs_fcheck, NULL);
+		unregister_trace_writeback_dirty_page(iob_probe_wb_dirty_page, NULL);
+		unregister_trace_writeback_start(iob_probe_wb_start, NULL);
+		unregister_trace_writeback_written(iob_probe_wb_written, NULL);
+		unregister_trace_writeback_single_inode_start(iob_probe_wb_single_inode_start, NULL);
+		unregister_trace_writeback_dirty_inode_start(iob_probe_wb_dirty_inode_start, NULL);
+		unregister_trace_writeback_dirty_inode(iob_probe_wb_dirty_inode, NULL);
+		unregister_trace_writeback_write_inode_start(iob_probe_wb_write_inode_start, NULL);
+		unregister_trace_writeback_write_inode(iob_probe_wb_write_inode, NULL);
+		unregister_trace_block_touch_buffer(iob_probe_block_touch_buffer, NULL);
+		unregister_trace_block_bio_queue(iob_probe_block_bio_queue, NULL);
+		unregister_trace_block_bio_backmerge(iob_probe_block_bio_backmerge, NULL);
+		unregister_trace_block_bio_frontmerge(iob_probe_block_bio_frontmerge, NULL);
+		unregister_trace_block_rq_issue(iob_probe_block_rq_issue, NULL);
+		unregister_trace_block_bio_complete(iob_probe_block_bio_complete, NULL);
+		unregister_trace_sched_process_exit(iob_probe_block_sched_process_exit, NULL);
+
+		/* and drain all in-flight users */
+		tracepoint_synchronize_unregister();
+	}
+
+	/*
+	 * At this point, we're sure that nobody is executing iob hooks.
+	 * Free all resources.
+	 */
+	for (i = 0; i < ARRAY_SIZE(iob_act_used_bitmaps); i++) {
+		vfree(iob_act_used_bitmaps[i]);
+		iob_act_used_bitmaps[i] = NULL;
+	}
+
+	if (iob_role_idx)
+		iob_idx_destroy(iob_role_idx);
+	if (iob_intent_idx)
+		iob_idx_destroy(iob_intent_idx);
+	if (iob_act_idx)
+		iob_idx_destroy(iob_act_idx);
+	iob_role_idx = iob_intent_idx = iob_act_idx = NULL;
+
+	while ((nr = radix_tree_gang_lookup_slot(&iob_pgtree, slots, indices,
+						 base_idx, gang_nr))) {
+		for (i = 0; i < nr; i++) {
+			free_page((unsigned long)*slots[i]);
+			radix_tree_delete(&iob_pgtree, indices[i]);
+		}
+		base_idx = indices[nr - 1] + 1;
+	}
+
+	mutex_unlock(&iob_mutex);
+}
+
+/**
+ * iob_enable - enable ioblame
+ *
+ * Master enable.  Set up all resources and enable ioblame.  Returns 0 on
+ * success, -errno on failure.
+ */
+static int iob_enable(void)
+{
+	int i, err;
+
+	mutex_lock(&iob_mutex);
+
+	if (iob_enabled)
+		goto out;
+
+	/* determine pgtree params from iob_max_acts */
+	iob_pgtree_shift = iob_max_acts <= USHRT_MAX ? 1 : 2;
+	iob_pgtree_pfn_shift = PAGE_SHIFT - iob_pgtree_shift;
+	iob_pgtree_pfn_mask = (1 << iob_pgtree_pfn_shift) - 1;
+
+	/* create iob_idx'es and allocate act used bitmaps */
+	err = -ENOMEM;
+	iob_role_idx = iob_idx_create(&iob_role_idx_type, iob_max_roles);
+	iob_intent_idx = iob_idx_create(&iob_intent_idx_type, iob_max_intents);
+	iob_act_idx = iob_idx_create(&iob_act_idx_type, iob_max_acts);
+
+	if (!iob_role_idx || !iob_intent_idx || !iob_act_idx)
+		goto out;
+
+	for (i = 0; i < ARRAY_SIZE(iob_act_used_bitmaps); i++) {
+		iob_act_used_bitmaps[i] = vzalloc(sizeof(unsigned long) *
+						  BITS_TO_LONGS(iob_max_acts));
+		if (!iob_act_used_bitmaps[i])
+			goto out;
+	}
+
+	iob_role_reclaim_seq = 0;
+	iob_act_used.front = iob_act_used_bitmaps[0];;
+	iob_act_used.back = iob_act_used_bitmaps[1];;
+
+	/* register hooks */
+	err = register_trace_vfs_fcheck(iob_probe_vfs_fcheck, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_dirty_page(iob_probe_wb_dirty_page, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_start(iob_probe_wb_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_written(iob_probe_wb_written, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_single_inode_start(iob_probe_wb_single_inode_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_dirty_inode_start(iob_probe_wb_dirty_inode_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_dirty_inode(iob_probe_wb_dirty_inode, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_write_inode_start(iob_probe_wb_write_inode_start, NULL);
+	if (err)
+		goto out;
+	err = register_trace_writeback_write_inode(iob_probe_wb_write_inode, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_touch_buffer(iob_probe_block_touch_buffer, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_queue(iob_probe_block_bio_queue, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_backmerge(iob_probe_block_bio_backmerge, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_frontmerge(iob_probe_block_bio_frontmerge, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_rq_issue(iob_probe_block_rq_issue, NULL);
+	if (err)
+		goto out;
+	err = register_trace_block_bio_complete(iob_probe_block_bio_complete, NULL);
+	if (err)
+		goto out;
+	err = register_trace_sched_process_exit(iob_probe_block_sched_process_exit, NULL);
+	if (err)
+		goto out;
+
+	/* wait until everything becomes visible */
+	synchronize_sched();
+	/* and go... */
+	iob_enabled = true;
+	queue_delayed_work(system_nrt_wq, &iob_reclaim_work,
+			   iob_ttl_secs * HZ / 2);
+out:
+	mutex_unlock(&iob_mutex);
+
+	if (iob_enabled)
+		return 0;
+	iob_disable();
+	return err;
+}
+
+/* ioblame/{*_max|ttl_secs} - uint tunables */
+static int iob_uint_get(void *data, u64 *val)
+{
+	*val = *(unsigned int *)data;
+	return 0;
+}
+
+static int __iob_uint_set(void *data, u64 val, bool must_be_disabled)
+{
+	if (val > INT_MAX)
+		return -EINVAL;
+
+	mutex_lock(&iob_mutex);
+	if (must_be_disabled && iob_enabled) {
+		mutex_unlock(&iob_mutex);
+		return -EBUSY;
+	}
+
+	*(unsigned int *)data = val;
+
+	mutex_unlock(&iob_mutex);
+
+	return 0;
+}
+
+/* max params must not be manipulated while enabled */
+static int iob_uint_set_disabled(void *data, u64 val)
+{
+	return __iob_uint_set(data, val, true);
+}
+
+/* ttl can be changed anytime */
+static int iob_uint_set(void *data, u64 val)
+{
+	return __iob_uint_set(data, val, false);
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(iob_uint_fops_disabled, iob_uint_get,
+			iob_uint_set_disabled, "%llu\n");
+DEFINE_SIMPLE_ATTRIBUTE(iob_uint_fops, iob_uint_get, iob_uint_set, "%llu\n");
+
+/* bool - ioblame/ignore_ino, also used for ioblame/enable */
+static ssize_t iob_bool_read(struct file *file, char __user *ubuf,
+			     size_t count, loff_t *ppos)
+{
+	bool *boolp = file->f_dentry->d_inode->i_private;
+	const char *str = *boolp ? "Y\n" : "N\n";
+
+	return simple_read_from_buffer(ubuf, count, ppos, str, strlen(str));
+}
+
+static ssize_t __iob_bool_write(struct file *file, const char __user *ubuf,
+				size_t count, loff_t *ppos, bool *boolp)
+{
+	char buf[32] = { };
+	int err;
+
+	if (copy_from_user(buf, ubuf, min(count, sizeof(buf) - 1)))
+		return -EFAULT;
+
+	err = strtobool(buf, boolp);
+	if (err)
+		return err;
+
+	return err ?: count;
+}
+
+static ssize_t iob_bool_write(struct file *file, const char __user *ubuf,
+			      size_t count, loff_t *ppos)
+{
+	return __iob_bool_write(file, ubuf, count, ppos,
+				file->f_dentry->d_inode->i_private);
+}
+
+static const struct file_operations iob_bool_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nonseekable_open,
+	.read		= iob_bool_read,
+	.write		= iob_bool_write,
+};
+
+/* u64 fops, used for stats */
+static int iob_u64_get(void *data, u64 *val)
+{
+	*val = *(u64 *)data;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(iob_stats_fops, iob_u64_get, NULL, "%llu\n");
+
+/* used to export nr_nodes of each iob_idx */
+static int iob_nr_nodes_get(void *data, u64 *val)
+{
+	struct iob_idx **idxp = data;
+
+	*val = 0;
+	mutex_lock(&iob_mutex);
+	if (*idxp)
+		*val = (*idxp)->nr_nodes;
+	mutex_unlock(&iob_mutex);
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(iob_nr_nodes_fops, iob_nr_nodes_get, NULL, "%llu\n");
+
+/*
+ * ioblame/devs - per device enable switch, accepts block device kernel
+ * name, "maj:min" or "*" for all devices.  Prefix '!' to disable.  Opening
+ * w/ O_TRUNC also disables ioblame for all devices.
+ */
+static void iob_enable_all_devs(bool enable)
+{
+	struct disk_iter diter;
+	struct gendisk *disk;
+
+	disk_iter_init(&diter);
+	while ((disk = disk_iter_next(&diter)))
+		disk->iob_enabled = enable;
+	disk_iter_exit(&diter);
+}
+
+static void *iob_devs_seq_start(struct seq_file *seqf, loff_t *pos)
+{
+	loff_t skip = *pos;
+	struct disk_iter *diter;
+	struct gendisk *disk;
+
+	diter = kmalloc(sizeof(*diter), GFP_KERNEL);
+	if (!diter)
+		return ERR_PTR(-ENOMEM);
+
+	seqf->private = diter;
+	disk_iter_init(diter);
+
+	/* skip to the current *pos */
+	do {
+		disk = disk_iter_next(diter);
+		if (!disk)
+			return NULL;
+	} while (skip--);
+
+	/* skip to the first iob_enabled disk */
+	while (disk && !disk->iob_enabled) {
+		(*pos)++;
+		disk = disk_iter_next(diter);
+	}
+
+	return disk;
+}
+
+static void *iob_devs_seq_next(struct seq_file *seqf, void *v, loff_t *pos)
+{
+	/* skip to the next iob_enabled disk */
+	while (true) {
+		struct gendisk *disk;
+
+		(*pos)++;
+		disk = disk_iter_next(seqf->private);
+		if (!disk)
+			return NULL;
+
+		if (disk->iob_enabled)
+			return disk;
+	}
+}
+
+static int iob_devs_seq_show(struct seq_file *seqf, void *v)
+{
+	struct gendisk *disk = v;
+	dev_t dev = disk_devt(disk);
+
+	seq_printf(seqf, "%u:%u %s\n", MAJOR(dev), MINOR(dev),
+		   disk->disk_name);
+	return 0;
+}
+
+static void iob_devs_seq_stop(struct seq_file *seqf, void *v)
+{
+	struct disk_iter *diter = seqf->private;
+
+	/* stop is called even after start failed :-( */
+	if (diter) {
+		disk_iter_exit(diter);
+		kfree(diter);
+	}
+}
+
+static ssize_t iob_devs_write(struct file *file, const char __user *ubuf,
+			      size_t cnt, loff_t *ppos)
+{
+	char *buf = NULL, *p = NULL, *last_tok = NULL, *tok;
+	int err;
+
+	if (!cnt)
+		return 0;
+
+	err = -ENOMEM;
+	buf = vmalloc(cnt + 1);
+	if (!buf)
+		goto out;
+
+	err = -EFAULT;
+	if (copy_from_user(buf, ubuf, cnt))
+		goto out;
+	buf[cnt] = '\0';
+
+	err = 0;
+	p = buf;
+	while ((tok = strsep(&p, " \t\r\n"))) {
+		bool enable = true;
+		int partno = 0;
+		struct gendisk *disk;
+		unsigned maj, min;
+		dev_t devt;
+
+		tok = strim(tok);
+		if (!strlen(tok))
+			continue;
+
+		if (tok[0] == '!') {
+			enable = false;
+			tok++;
+		}
+
+		if (!strcmp(tok, "*")) {
+			iob_enable_all_devs(enable);
+			last_tok = tok;
+			continue;
+		}
+
+		if (sscanf(tok, "%u:%u", &maj, &min) == 2)
+			devt = MKDEV(maj, min);
+		else
+			devt = blk_lookup_devt(tok, 0);
+
+		disk = get_gendisk(devt, &partno);
+		if (!disk || partno) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		disk->iob_enabled = enable;
+		put_disk(disk);
+		last_tok = tok;
+	}
+out:
+	vfree(buf);
+	if (!err)
+		return cnt;
+	if (last_tok)
+		return last_tok + strlen(last_tok) - buf;
+	return err;
+}
+
+static const struct seq_operations iob_devs_sops = {
+	.start		= iob_devs_seq_start,
+	.next		= iob_devs_seq_next,
+	.show		= iob_devs_seq_show,
+	.stop		= iob_devs_seq_stop,
+};
+
+static int iob_devs_seq_open(struct inode *inode, struct file *file)
+{
+	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC))
+		iob_enable_all_devs(false);
+
+	return seq_open(file, &iob_devs_sops);
+}
+
+static const struct file_operations iob_devs_fops = {
+	.owner		= THIS_MODULE,
+	.open		= iob_devs_seq_open,
+	.read		= seq_read,
+	.write		= iob_devs_write,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+/*
+ * ioblame/enable - master enable switch
+ */
+static ssize_t iob_enable_write(struct file *file, const char __user *ubuf,
+				size_t count, loff_t *ppos)
+{
+	bool enable;
+	ssize_t ret;
+	int err = 0;
+
+	ret = __iob_bool_write(file, ubuf, count, ppos, &enable);
+	if (ret < 0)
+		return ret;
+
+	if (enable)
+		err = iob_enable();
+	else
+		iob_disable();
+
+	return err ?: ret;
+}
+
+static const struct file_operations iob_enable_fops = {
+	.owner		= THIS_MODULE,
+	.open		= nonseekable_open,
+	.read		= iob_bool_read,
+	.write		= iob_enable_write,
+};
+
+/*
+ * Print helpers.
+ */
+#define iob_print(p, e, fmt, args...)	(p + scnprintf(p, e - p, fmt , ##args))
+
+static char *iob_print_intent(char *p, char *e, struct iob_intent *intent,
+			      const char *header)
+{
+	int i;
+
+	p = iob_print(p, e, "%s#%d modifier=0x%x\n", header,
+		      intent->node.id.f.nr, intent->modifier);
+	for (i = 0; i < intent->depth; i++)
+		p = iob_print(p, e, "%s[%p] %pF\n", header,
+			      (void *)intent->trace[i],
+			      (void *)intent->trace[i]);
+	return p;
+}
+
+
+/*
+ * ioblame/intents - export intents to userland.
+ *
+ * Userland can acquire intents by reading ioblame/intents.
+ *
+ * While iob is enabled, intents are never reclaimed, intent nr is
+ * guaranteed to be allocated consecutively in ascending order and both
+ * intents files are lseekable by intent nr, so userland tools which want
+ * to learn about new intents since last reading can simply seek to the
+ * number of currently known intents and start reading from there.
+ *
+ * Both files generate at least one size changed notification after a new
+ * intent is created.
+ */
+static void iob_intent_notify_workfn(struct work_struct *work)
+{
+	struct iattr iattr = (struct iattr){ .ia_valid = ATTR_SIZE };
+
+	/*
+	 * Invoked after new intent is created, kick bogus size changed
+	 * notification.
+	 */
+	notify_change(iob_intents_dentry, &iattr);
+}
+
+static loff_t iob_intents_llseek(struct file *file, loff_t offset, int origin)
+{
+	loff_t ret = -EIO;
+
+	mutex_lock(&iob_mutex);
+
+	if (iob_enabled) {
+		/*
+		 * We seek by intent nr and don't care about i_size.
+		 * Temporarily set i_size to nr_nodes and hitch on generic
+		 * llseek.
+		 */
+		i_size_write(file->f_dentry->d_inode, iob_intent_idx->nr_nodes);
+		ret = generic_file_llseek(file, offset, origin);
+		i_size_write(file->f_dentry->d_inode, 0);
+	}
+
+	mutex_unlock(&iob_mutex);
+	return ret;
+}
+
+static ssize_t iob_intents_read(struct file *file, char __user *ubuf,
+				size_t count, loff_t *ppos)
+{
+	char *buf, *p, *e;
+	int err;
+
+	if (count < PAGE_SIZE)
+		return -EINVAL;
+
+	err = -EIO;
+	mutex_lock(&iob_mutex);
+	if (!iob_enabled)
+		goto out;
+
+	p = buf = iob_page_buf;
+	e = p + PAGE_SIZE;
+
+	err = 0;
+	if (*ppos >= iob_intent_idx->nr_nodes)
+		goto out;
+
+	/* print to buf */
+	rcu_read_lock_sched();
+	p = iob_print_intent(p, e, iob_intent_by_nr(*ppos), "");
+	rcu_read_unlock_sched();
+	WARN_ON_ONCE(p == e);
+
+	/* copy out */
+	err = -EFAULT;
+	if (copy_to_user(ubuf, buf, p - buf))
+		goto out;
+
+	(*ppos)++;
+	err = 0;
+out:
+	mutex_unlock(&iob_mutex);
+	return err ?: p - buf;
+}
+
+static const struct file_operations iob_intents_fops = {
+	.owner		= THIS_MODULE,
+	.open		= generic_file_open,
+	.llseek		= iob_intents_llseek,
+	.read		= iob_intents_read,
+};
+
+
+static int __init ioblame_init(void)
+{
+	struct dentry *stats_dir;
+
+	BUILD_BUG_ON((1 << IOB_TYPE_BITS) < IOB_NR_TYPES);
+	BUILD_BUG_ON(IOB_NR_BITS + IOB_GEN_BITS + IOB_TYPE_BITS != 64);
+
+	iob_role_cache = KMEM_CACHE(iob_role, 0);
+	iob_act_cache = KMEM_CACHE(iob_act, 0);
+	if (!iob_role_cache || !iob_act_cache)
+		goto fail;
+
+	/* create ioblame/ dirs and files */
+	iob_dir = debugfs_create_dir("ioblame", NULL);
+	if (!iob_dir)
+		goto fail;
+
+	if (!debugfs_create_file("max_roles", 0600, iob_dir, &iob_max_roles, &iob_uint_fops_disabled) ||
+	    !debugfs_create_file("max_intents", 0600, iob_dir, &iob_max_intents, &iob_uint_fops_disabled) ||
+	    !debugfs_create_file("max_acts", 0600, iob_dir, &iob_max_acts, &iob_uint_fops_disabled) ||
+	    !debugfs_create_file("ttl_secs", 0600, iob_dir, &iob_ttl_secs, &iob_uint_fops) ||
+	    !debugfs_create_file("ignore_ino", 0600, iob_dir, &iob_ignore_ino, &iob_bool_fops) ||
+	    !debugfs_create_file("devs", 0600, iob_dir, NULL, &iob_devs_fops) ||
+	    !debugfs_create_file("enable", 0600, iob_dir, &iob_enabled, &iob_enable_fops) ||
+	    !debugfs_create_file("nr_roles", 0400, iob_dir, &iob_role_idx, &iob_nr_nodes_fops) ||
+	    !debugfs_create_file("nr_intents", 0400, iob_dir, &iob_intent_idx, &iob_nr_nodes_fops) ||
+	    !debugfs_create_file("nr_acts", 0400, iob_dir, &iob_act_idx, &iob_nr_nodes_fops))
+		goto fail;
+
+	iob_intents_dentry = debugfs_create_file("intents", 0400, iob_dir, NULL, &iob_intents_fops);
+	if (!iob_intents_dentry)
+		goto fail;
+
+	stats_dir = debugfs_create_dir("stats", iob_dir);
+	if (!stats_dir)
+		goto fail;
+
+	if (!debugfs_create_file("idx_nomem", 0400, stats_dir, &iob_stats.idx_nomem, &iob_stats_fops) ||
+	    !debugfs_create_file("idx_nospc", 0400, stats_dir, &iob_stats.idx_nospc, &iob_stats_fops) ||
+	    !debugfs_create_file("node_nomem", 0400, stats_dir, &iob_stats.node_nomem, &iob_stats_fops) ||
+	    !debugfs_create_file("pgtree_nomem", 0400, stats_dir, &iob_stats.pgtree_nomem, &iob_stats_fops))
+		goto fail;
+
+	return 0;
+
+fail:
+	if (iob_role_cache)
+		kmem_cache_destroy(iob_role_cache);
+	if (iob_act_cache)
+		kmem_cache_destroy(iob_act_cache);
+	if (iob_dir)
+		debugfs_remove_recursive(iob_dir);
+	return -ENOMEM;
+}
+
+static void __exit ioblame_exit(void)
+{
+	iob_disable();
+	debugfs_remove_recursive(iob_dir);
+	kmem_cache_destroy(iob_role_cache);
+	kmem_cache_destroy(iob_act_cache);
+}
+
+module_init(ioblame_init);
+module_exit(ioblame_exit);
+
+MODULE_AUTHOR("Tejun Heo <tj@kernel.org>");
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("IO monitor with dirtier and issuer tracking");
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-11  1:32   ` [PATCH RESEND " Tejun Heo
@ 2012-01-11  6:15     ` Namhyung Kim
  2012-01-11 17:06       ` Tejun Heo
  2012-01-11 18:08     ` Tejun Heo
  1 sibling, 1 reply; 37+ messages in thread
From: Namhyung Kim @ 2012-01-11  6:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp, linux-kernel, winget, Chanho Park, namhyung

2012-01-11 10:32 AM, Tejun Heo wrote:
> Implement ioblame, which can attribute each IO to its origin and
> export the information using a tracepoint.
>
> Operations which may eventually cause IOs and IO operations themselves
> are identified and tracked primarily by their stack traces along with
> the task and the target file (dev:ino:gen).  On each IO completion,
> ioblame knows why that specific IO happened and exports the
> information via ioblame:ioblame_io tracepoint.
>
> While ioblame adds fields to a few fs and block layer objects, all
> logic is well insulated inside ioblame proper and all hooking goes
> through well defined tracepoints and doesn't add any significant
> maintenance overhead.
>
> For details, please read Documentation/trace/ioblame.txt.
>
> -v2: Namhyung pointed out that all the information available at IO
>       completion can be exported via tracepoint and letting userland do
>       whatever it wants to do with that would be better.  Stripped out
>       in-kernel statistics gathering.
>
>       Now that everything is exported through tracepoint, iolog and
>       counters_pipe[_pipe] are unnecessary.  Removed.  intents_bin too
>       is removed.
>
>       As data collection no longer requires polling, ioblame/intents is
>       updated to generate inotify IN_MODIFY event after a new intent is
>       created.
>

Hi Tejun,

How about adding another tracepoint for intent creation to provide raw 
data as well, somewhere in iob_get_intent() or iob_intent_create() 
maybe? It can be useful to get those data for further processing IMHO.

Thanks,
Namhyung Kim

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking
  2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
                   ` (8 preceding siblings ...)
  2012-01-10 18:28 ` [PATCH 9/9] block, trace: implement ioblame - IO tracer with origin tracking Tejun Heo
@ 2012-01-11 14:40 ` Frederic Weisbecker
  2012-01-11 17:02   ` Tejun Heo
  9 siblings, 1 reply; 37+ messages in thread
From: Frederic Weisbecker @ 2012-01-11 14:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, mingo, rostedt, teravest, slavapestov, ctalbott, dhsharp,
	linux-kernel, winget, namhyung

Hi Tejun,

On Tue, Jan 10, 2012 at 10:28:17AM -0800, Tejun Heo wrote:
> Hello, guys.
> 
> Even with blktrace and tracepoints, getting insight into the IOs going
> on a system is very challenging.  A lot of IO operations happen long
> after the action which triggered the IO finished and the overall
> asynchronous nature of IO operations make it difficult to trace back
> the origin of a given IO.
> 
> ioblame is an attempt at providing better visibility into overall IO
> behavior.  ioblame hooks into various tracepoints and tries to
> determine who caused any given IO how and charges the IO accordingly.
> 
> On each IO completion, ioblame knows who to charge the IO (task), how
> the IO got triggered (stack trace at the point of triggering, be it
> page, inode dirtying or direct IO issue) and various information about
> the IO itself (offset, size, how long it took and so on).  ioblame
> exports this information via ioblame:ioblame_io tracepoint.
> 
> For more details, please read Documentation/trace/ioblame.txt.
> 
> Changes from the last take[L] are,
> 
> * Per Namhyung's suggestion, in-kernel statistics gathering stripped
>   out.  All information is now exported through a tracepoint per each
>   IO.  This makes a lot of stuff unnecessary and over 1500 lines of
>   code have been removed.
> 
> * block_bio_complete tracepoint patch will result in duplicate
>   BLK_TA_COMPLETE notifications.  Namhyung is working on proper
>   solution.  For now, SOB is removed from the patch.
> 
> * Trace filter is no longer used and patches dropped from the series.
> 
> * Rebased on top of v3.2.
> 
> This patchset contains the following 9 patches.
> 
>   0001-block-abstract-disk-iteration-into-disk_iter.patch
>   0002-block-block_bio_complete-tracepoint-was-missing.patch
>   0003-block-add-req-to-bio_-front-back-_merge-tracepoints.patch
>   0004-writeback-move-struct-wb_writeback_work-to-writeback.patch
>   0005-writeback-add-more-tracepoints.patch
>   0006-block-add-block_touch_buffer-tracepoint.patch
>   0007-vfs-add-fcheck-tracepoint.patch
>   0008-stacktrace-implement-save_stack_trace_quick.patch
>   0009-block-trace-implement-ioblame-IO-tracer-with-origin-.patch
> 
> 0001-0004 update block layer in preparation.
> 
> 0005-0007 add more tracepoints along the IO stack.
> 
> 0008 adds nimbler backtrace dump function as ioblame dumps stacktrace
> extremely frequently.
> 
> 0009 implements ioblame.
> 
> This is still in early stage and I haven't done much performance
> analysis yet.  Tentative testing shows it adds ~20% CPU overhead when
> used on memory backed loopback device.
> 
> The patches are on top of v3.2 and available in the following git
> branch.
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-ioblame
> 
> diffstat follows.
> 
>  Documentation/trace/ioblame.txt   |  476 +++++++
>  arch/x86/include/asm/stacktrace.h |    2 
>  arch/x86/kernel/stacktrace.c      |   40 
>  block/blk-core.c                  |    5 
>  block/genhd.c                     |   98 +
>  fs/bio.c                          |    3 
>  fs/fs-writeback.c                 |   34 
>  fs/super.c                        |    2 
>  include/linux/blk_types.h         |    4 
>  include/linux/buffer_head.h       |    7 
>  include/linux/fdtable.h           |    3 
>  include/linux/fs.h                |    3 
>  include/linux/genhd.h             |   13 
>  include/linux/ioblame.h           |   72 +
>  include/linux/stacktrace.h        |    6 
>  include/linux/writeback.h         |   18 
>  include/trace/events/block.h      |   70 -
>  include/trace/events/vfs.h        |   40 
>  include/trace/events/writeback.h  |  113 +
>  kernel/stacktrace.c               |    6 
>  kernel/trace/Kconfig              |   12 
>  kernel/trace/Makefile             |    1 
>  kernel/trace/blktrace.c           |    2 
>  kernel/trace/ioblame.c            | 2279 ++++++++++++++++++++++++++++++++++++++

I think this has been asked before. So sorry for asking twice.

But I'm wondering why the post processing is made from the kernel. Do you think
it would be possible to pull that out in userspace. We have some nice scripting
framework for post processing of trace events in perf tools for example.

If it's not possible please tell us why. We really would like to avoid adding such
a big piece of code in the tracing subsystem if possible.

Thanks.

>  mm/page-writeback.c               |    2 
>  25 files changed, 3244 insertions(+), 67 deletions(-)
> 
> Thanks.
> 
> --
> tejun
> 
> [L] http://thread.gmane.org/gmane.linux.kernel/1235937

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 8/9] stacktrace: implement save_stack_trace_quick()
  2012-01-10 18:28 ` [PATCH 8/9] stacktrace: implement save_stack_trace_quick() Tejun Heo
@ 2012-01-11 16:26   ` Frederic Weisbecker
  2012-01-11 16:38     ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Frederic Weisbecker @ 2012-01-11 16:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, mingo, rostedt, teravest, slavapestov, ctalbott, dhsharp,
	linux-kernel, winget, namhyung, H. Peter Anvin

On Tue, Jan 10, 2012 at 10:28:25AM -0800, Tejun Heo wrote:
> Implement save_stack_trace_quick() which only considers the usual
> contexts (ie. thread and irq) and doesn't handle links between
> different contexts - if %current is in irq context, only backtrace in
> the irq stack is considered.

The thing I don't like is the duplication that involves not only on
stack unwinding but also on the safety checks.

What about making struct stacktrace_ops::stack() return a value
that either stops or continue the trace? In your case EOE/EOI would
be the triggering condition.

Filtering stack contexts might in fact be a desirable generic feature
overall.

At least in perf we could be interested in filtering kernel/user contexts.
And in your case in stopping after the first context. I also don't know if
we will be interested in filtering irq/exception/process stacks in the future
but I prefer to ensure we have a flexible enough interface to allow that.

So it may be a good idea to reuse the exisiting code for your needs like
a stack() return value as above. And if the post processing will be done
from userspace (which I really hope) then extend the ftrace/perf interface
to allow your quick filtering, something that can be later extended to
allow more finegrained stacktrace filtering.

> This is subset of dump_trace() done in much simpler way.  It's
> intended to be used in hot paths where the overhead of dump_trace()
> can be too heavy.

Is it? Have you found a measurable impact (outside the fact you record only
one stack).

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 8/9] stacktrace: implement save_stack_trace_quick()
  2012-01-11 16:26   ` Frederic Weisbecker
@ 2012-01-11 16:38     ` Tejun Heo
  2012-01-11 17:37       ` Tejun Heo
  2012-01-17  2:22       ` Frederic Weisbecker
  0 siblings, 2 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-11 16:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: axboe, mingo, rostedt, teravest, slavapestov, ctalbott, dhsharp,
	linux-kernel, winget, namhyung, H. Peter Anvin

Hello, Frederic.

On Wed, Jan 11, 2012 at 05:26:44PM +0100, Frederic Weisbecker wrote:
> On Tue, Jan 10, 2012 at 10:28:25AM -0800, Tejun Heo wrote:
> > Implement save_stack_trace_quick() which only considers the usual
> > contexts (ie. thread and irq) and doesn't handle links between
> > different contexts - if %current is in irq context, only backtrace in
> > the irq stack is considered.
> 
> The thing I don't like is the duplication that involves not only on
> stack unwinding but also on the safety checks.

I'm not entirely convinced that this is necessary or we can just add
more features to the existing backtrace facility (and maybe make that
more efficient) and be done with it.

> > This is subset of dump_trace() done in much simpler way.  It's
> > intended to be used in hot paths where the overhead of dump_trace()
> > can be too heavy.
> 
> Is it? Have you found a measurable impact (outside the fact you record only
> one stack).

As I wrote in the head message, I haven't done comparative test yet
but in the preliminary tests the CPU overhead against memory backed
device is quite visible (roughly ~20%), so I expect it to matter.
Note that testing against memory backed device is actually relevant,
on faster SSDs, CPU is already a bottleneck.

It would be best if we can extend the existing one to cover all the
cases with acceptable overhead.  I needed to write this minimal
version anyway for comparison so it's posted together but no matter
how it turns out switching them isn't difficult.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking
  2012-01-11 14:40 ` [RFC PATCHSET take#2] ioblame: " Frederic Weisbecker
@ 2012-01-11 17:02   ` Tejun Heo
  2012-01-11 22:45     ` David Sharp
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-01-11 17:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: axboe, mingo, rostedt, teravest, slavapestov, ctalbott, dhsharp,
	linux-kernel, winget, namhyung

Hello, Frederic.

On Wed, Jan 11, 2012 at 03:40:14PM +0100, Frederic Weisbecker wrote:
> I think this has been asked before. So sorry for asking twice.

I thought Namhyung was primarily asking about stat gathering which is
chopped now.

> But I'm wondering why the post processing is made from the kernel. Do you think
> it would be possible to pull that out in userspace. We have some nice scripting
> framework for post processing of trace events in perf tools for example.
>
> If it's not possible please tell us why. We really would like to avoid adding such
> a big piece of code in the tracing subsystem if possible.

I suppose you're talking about the state tracking by post-processing,
right?

* ioblame tracks stack trace for each dirtying operation.  If we don't
  want further state tracking in kernel, we would have to exort the
  whole stack trace on each dirtying operation which can be high
  frequency.  Also, is there an efficient way to export variable
  length data via TPs?  If so, it can be somewhat better but still not
  very good.
  
* Even if we track dirtying state in userland, when an io is issued,
  it needs to be mapped back to the dirtying actions.  If the dirtier
  state is in userland, we have to export all physaddrs of pages in
  the IO so that userland can match them up and clear dirtied states.
  Again, the same problem.

* As implemented, most of state tracking should be fairly stable and
  shouldn't require much modification as code base evolves but it's
  still trying to extract pretty high level semantics from disjoint
  events across multiple layers.  It's reasonable to expect future
  changes would require updates to how those semantics are
  established.  Exporting higher level semantics, we don't get tied to
  keeping the relevant raw tracepoints and, more importantly, their
  exact interactions stable.

* It isn't trivial but still pretty straight-forward.  Most of what it
  does is abbreviating strack trace to an identifier (which BTW could
  be useful for other tracing purposes and may be worthwhile to
  generalize) and tracking page and inode dirtiers using those
  identifiers.  It stays mostly out of the way and doesn't noticeably
  harm maintainability.  It fits the role of in-kernel tracers -
  building information from domain knowledge and states and exporting
  to userland in sensible form.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-11  6:15     ` Namhyung Kim
@ 2012-01-11 17:06       ` Tejun Heo
  2012-01-12  1:05         ` Namhyung Kim
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-01-11 17:06 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp, linux-kernel, winget, Chanho Park, namhyung

Hello,

On Wed, Jan 11, 2012 at 03:15:54PM +0900, Namhyung Kim wrote:
> How about adding another tracepoint for intent creation to provide
> raw data as well, somewhere in iob_get_intent() or
> iob_intent_create() maybe? It can be useful to get those data for
> further processing IMHO.

While I don't particularly object to that, information and
notification (via inotify) for that is already available via
ioblame/intents file which we need regardless of the new tracepoint,
so it's kinda redundant, isn't it?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/9] block: block_bio_complete tracepoint was missing
  2012-01-10 18:28 ` [PATCH 2/9] block: block_bio_complete tracepoint was missing Tejun Heo
@ 2012-01-11 17:25   ` Steven Rostedt
  2012-01-11 17:30     ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Steven Rostedt @ 2012-01-11 17:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, mingo, fweisbec, teravest, slavapestov, ctalbott, dhsharp,
	linux-kernel, winget, namhyung

On Tue, 2012-01-10 at 10:28 -0800, Tejun Heo wrote:
> block_bio_complete tracepoint was defined but not invoked anywhere.
> Fix it.
> 
> -tj: This will generate duplicate BLK_TA_COMPLETEs.  Namhyung is
>      working on proper solution.
> 
> DO_NOT_APPLY
> Cc: Namhyung Kim <namhyung@gmail.com>
> ---
>  fs/bio.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/bio.c b/fs/bio.c
> index b1fe82c..96548da 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -1447,6 +1447,9 @@ void bio_endio(struct bio *bio, int error)
>  	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
>  		error = -EIO;
>  
> +	if (bio->bi_bdev)
> +		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
> +					 bio, error);

I thought I commented before about using TRACE_EVENT_CONDITIONAL() here.
To remove that open coded branch.

-- Steve

>  	if (bio->bi_end_io)
>  		bio->bi_end_io(bio, error);
>  }



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/9] block: block_bio_complete tracepoint was missing
  2012-01-11 17:25   ` Steven Rostedt
@ 2012-01-11 17:30     ` Tejun Heo
  2012-01-12  0:24       ` Namhyung Kim
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-01-11 17:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: axboe, mingo, fweisbec, teravest, slavapestov, ctalbott, dhsharp,
	linux-kernel, winget, namhyung

Hello,

On Wed, Jan 11, 2012 at 12:25:32PM -0500, Steven Rostedt wrote:
> On Tue, 2012-01-10 at 10:28 -0800, Tejun Heo wrote:
> > block_bio_complete tracepoint was defined but not invoked anywhere.
> > Fix it.
> > 
> > -tj: This will generate duplicate BLK_TA_COMPLETEs.  Namhyung is
> >      working on proper solution.
> > 
> > DO_NOT_APPLY
> > Cc: Namhyung Kim <namhyung@gmail.com>
> > ---
> >  fs/bio.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/bio.c b/fs/bio.c
> > index b1fe82c..96548da 100644
> > --- a/fs/bio.c
> > +++ b/fs/bio.c
> > @@ -1447,6 +1447,9 @@ void bio_endio(struct bio *bio, int error)
> >  	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
> >  		error = -EIO;
> >  
> > +	if (bio->bi_bdev)
> > +		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
> > +					 bio, error);
> 
> I thought I commented before about using TRACE_EVENT_CONDITIONAL() here.
> To remove that open coded branch.

Yeah but this particular patch is dead now so it's a bit pointless.
ioblame:ioblame_io uses it FWIW.  Namhyung, can you please consider
using TRACE_EVENT_CONDITION() for you patches if applicable?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 8/9] stacktrace: implement save_stack_trace_quick()
  2012-01-11 16:38     ` Tejun Heo
@ 2012-01-11 17:37       ` Tejun Heo
  2012-01-17  2:22       ` Frederic Weisbecker
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-11 17:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: axboe, mingo, rostedt, teravest, slavapestov, ctalbott, dhsharp,
	linux-kernel, winget, namhyung, H. Peter Anvin

Hello,

On Wed, Jan 11, 2012 at 08:38:26AM -0800, Tejun Heo wrote:
> It would be best if we can extend the existing one to cover all the
> cases with acceptable overhead.  I needed to write this minimal
> version anyway for comparison so it's posted together but no matter
> how it turns out switching them isn't difficult.

To add a bit, it would probably be better to restructure backtrace
code such that the common code provides discrete steps for backtracing
(maybe somewhat like iterators) and interface functions assembling
what they need.  Such structure would be much more malleable for
various purposes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/9] block: add block_touch_buffer tracepoint
  2012-01-10 18:28 ` [PATCH 6/9] block: add block_touch_buffer tracepoint Tejun Heo
@ 2012-01-11 17:42   ` Steven Rostedt
  2012-01-11 17:58     ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Steven Rostedt @ 2012-01-11 17:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, mingo, fweisbec, teravest, slavapestov, ctalbott, dhsharp,
	linux-kernel, winget, namhyung

On Tue, 2012-01-10 at 10:28 -0800, Tejun Heo wrote:
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/block.h>
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 458f497..245caed 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -13,6 +13,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/wait.h>
>  #include <linux/atomic.h>
> +#include <trace/events/block.h>

trace/events/x.h in a header can cause issue if this header is ever
included in another trace/events/x.h header. I think I'll need to write
a patch that allows this, but it wont be pretty.


-- Steve



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/9] block: add block_touch_buffer tracepoint
  2012-01-11 17:42   ` Steven Rostedt
@ 2012-01-11 17:58     ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-11 17:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: axboe, mingo, fweisbec, teravest, slavapestov, ctalbott, dhsharp,
	linux-kernel, winget, namhyung

On Wed, Jan 11, 2012 at 12:42:01PM -0500, Steven Rostedt wrote:
> On Tue, 2012-01-10 at 10:28 -0800, Tejun Heo wrote:
> >  
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/block.h>
> > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > index 458f497..245caed 100644
> > --- a/include/linux/buffer_head.h
> > +++ b/include/linux/buffer_head.h
> > @@ -13,6 +13,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/wait.h>
> >  #include <linux/atomic.h>
> > +#include <trace/events/block.h>
> 
> trace/events/x.h in a header can cause issue if this header is ever
> included in another trace/events/x.h header. I think I'll need to write
> a patch that allows this, but it wont be pretty.

Hmmm... I see.  I'll make touch_buffer() a function.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-11  1:32   ` [PATCH RESEND " Tejun Heo
  2012-01-11  6:15     ` Namhyung Kim
@ 2012-01-11 18:08     ` Tejun Heo
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-11 18:08 UTC (permalink / raw)
  To: axboe, mingo, rostedt, fweisbec, teravest, slavapestov, ctalbott,
	dhsharp
  Cc: linux-kernel, winget, namhyung, Chanho Park

On Tue, Jan 10, 2012 at 05:32:12PM -0800, Tejun Heo wrote:
> Implement ioblame, which can attribute each IO to its origin and
> export the information using a tracepoint.
> 
> Operations which may eventually cause IOs and IO operations themselves
> are identified and tracked primarily by their stack traces along with
> the task and the target file (dev:ino:gen).  On each IO completion,
> ioblame knows why that specific IO happened and exports the
> information via ioblame:ioblame_io tracepoint.
> 
> While ioblame adds fields to a few fs and block layer objects, all
> logic is well insulated inside ioblame proper and all hooking goes
> through well defined tracepoints and doesn't add any significant
> maintenance overhead.
> 
> For details, please read Documentation/trace/ioblame.txt.
> 
> -v2: Namhyung pointed out that all the information available at IO
>      completion can be exported via tracepoint and letting userland do
>      whatever it wants to do with that would be better.  Stripped out
>      in-kernel statistics gathering.
> 
>      Now that everything is exported through tracepoint, iolog and
>      counters_pipe[_pipe] are unnecessary.  Removed.  intents_bin too
>      is removed.
> 
>      As data collection no longer requires polling, ioblame/intents is
>      updated to generate inotify IN_MODIFY event after a new intent is
>      created.

One planned update is exporting issuer and dirtier separately.
Currently dirtier, if exists, simply overrides issuer as it wasn't
useful for in-kernel statistics anyway.  With that gone, I think it
makes much more sense to expose both of them.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking
  2012-01-11 17:02   ` Tejun Heo
@ 2012-01-11 22:45     ` David Sharp
  0 siblings, 0 replies; 37+ messages in thread
From: David Sharp @ 2012-01-11 22:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, axboe, mingo, rostedt, teravest,
	slavapestov, ctalbott, linux-kernel, winget, namhyung

On Wed, Jan 11, 2012 at 9:02 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Frederic.
>
> On Wed, Jan 11, 2012 at 03:40:14PM +0100, Frederic Weisbecker wrote:
>> I think this has been asked before. So sorry for asking twice.
>
> I thought Namhyung was primarily asking about stat gathering which is
> chopped now.
>
>> But I'm wondering why the post processing is made from the kernel. Do you think
>> it would be possible to pull that out in userspace. We have some nice scripting
>> framework for post processing of trace events in perf tools for example.
>>
>> If it's not possible please tell us why. We really would like to avoid adding such
>> a big piece of code in the tracing subsystem if possible.
>
> I suppose you're talking about the state tracking by post-processing,
> right?
>
> * ioblame tracks stack trace for each dirtying operation.  If we don't
>  want further state tracking in kernel, we would have to exort the
>  whole stack trace on each dirtying operation which can be high
>  frequency.  Also, is there an efficient way to export variable
>  length data via TPs?  If so, it can be somewhat better but still not
>  very good.

See __dynamic_array. It imposes a 4-byte overhead to store the offset
and length of data within the trace event.

That said, I'm always very wary of adding large amounts of data to
tracepoints, especially if they are high frequency, as that just leads
to faster ring buffer exhaustion.

>
> * Even if we track dirtying state in userland, when an io is issued,
>  it needs to be mapped back to the dirtying actions.  If the dirtier
>  state is in userland, we have to export all physaddrs of pages in
>  the IO so that userland can match them up and clear dirtied states.
>  Again, the same problem.
>
> * As implemented, most of state tracking should be fairly stable and
>  shouldn't require much modification as code base evolves but it's
>  still trying to extract pretty high level semantics from disjoint
>  events across multiple layers.  It's reasonable to expect future
>  changes would require updates to how those semantics are
>  established.  Exporting higher level semantics, we don't get tied to
>  keeping the relevant raw tracepoints and, more importantly, their
>  exact interactions stable.
>
> * It isn't trivial but still pretty straight-forward.  Most of what it
>  does is abbreviating strack trace to an identifier (which BTW could
>  be useful for other tracing purposes and may be worthwhile to
>  generalize) and tracking page and inode dirtiers using those
>  identifiers.  It stays mostly out of the way and doesn't noticeably
>  harm maintainability.  It fits the role of in-kernel tracers -
>  building information from domain knowledge and states and exporting
>  to userland in sensible form.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/9] block: block_bio_complete tracepoint was missing
  2012-01-11 17:30     ` Tejun Heo
@ 2012-01-12  0:24       ` Namhyung Kim
  0 siblings, 0 replies; 37+ messages in thread
From: Namhyung Kim @ 2012-01-12  0:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Steven Rostedt, axboe, mingo, fweisbec, teravest, slavapestov,
	ctalbott, dhsharp, linux-kernel, winget, namhyung

2012-01-12 2:30 AM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jan 11, 2012 at 12:25:32PM -0500, Steven Rostedt wrote:
>> On Tue, 2012-01-10 at 10:28 -0800, Tejun Heo wrote:
>>> block_bio_complete tracepoint was defined but not invoked anywhere.
>>> Fix it.
>>>
>>> -tj: This will generate duplicate BLK_TA_COMPLETEs.  Namhyung is
>>>       working on proper solution.
>>>
>>> DO_NOT_APPLY
>>> Cc: Namhyung Kim<namhyung@gmail.com>
>>> ---
>>>   fs/bio.c |    3 +++
>>>   1 files changed, 3 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/fs/bio.c b/fs/bio.c
>>> index b1fe82c..96548da 100644
>>> --- a/fs/bio.c
>>> +++ b/fs/bio.c
>>> @@ -1447,6 +1447,9 @@ void bio_endio(struct bio *bio, int error)
>>>   	else if (!test_bit(BIO_UPTODATE,&bio->bi_flags))
>>>   		error = -EIO;
>>>
>>> +	if (bio->bi_bdev)
>>> +		trace_block_bio_complete(bdev_get_queue(bio->bi_bdev),
>>> +					 bio, error);
>>
>> I thought I commented before about using TRACE_EVENT_CONDITIONAL() here.
>> To remove that open coded branch.
>
> Yeah but this particular patch is dead now so it's a bit pointless.
> ioblame:ioblame_io uses it FWIW.  Namhyung, can you please consider
> using TRACE_EVENT_CONDITION() for you patches if applicable?
>
> Thanks.
>

Sure, I'll apply it.

Thanks,
Namhyung Kim

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-11 17:06       ` Tejun Heo
@ 2012-01-12  1:05         ` Namhyung Kim
  2012-01-12  1:14           ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Namhyung Kim @ 2012-01-12  1:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Namhyung Kim, axboe, mingo, rostedt, fweisbec, teravest,
	slavapestov, ctalbott, dhsharp, linux-kernel, winget,
	Chanho Park

Hi,

2012-01-12 2:06 AM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jan 11, 2012 at 03:15:54PM +0900, Namhyung Kim wrote:
>> How about adding another tracepoint for intent creation to provide
>> raw data as well, somewhere in iob_get_intent() or
>> iob_intent_create() maybe? It can be useful to get those data for
>> further processing IMHO.
>
> While I don't particularly object to that, information and
> notification (via inotify) for that is already available via
> ioblame/intents file which we need regardless of the new tracepoint,
> so it's kinda redundant, isn't it?
>
> Thanks.
>

Yes. But that's a text-based so it might fit better to simple use cases. 
If we need further post processing based on intents, it could be better 
off having binary interface IMHO. And since we already use tracepoints 
anyway, wouldn't it be good to avoid adding another layer of interface 
or complexity?

Thanks,
Namhyung Kim

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-12  1:05         ` Namhyung Kim
@ 2012-01-12  1:14           ` Tejun Heo
  2012-01-12  1:35             ` Namhyung Kim
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-01-12  1:14 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Namhyung Kim, axboe, mingo, rostedt, fweisbec, teravest,
	slavapestov, ctalbott, dhsharp, linux-kernel, winget,
	Chanho Park

Hello,

On Wed, Jan 11, 2012 at 5:05 PM, Namhyung Kim <namhyung.kim@lge.com> wrote:
> Yes. But that's a text-based so it might fit better to simple use cases. If
> we need further post processing based on intents, it could be better off
> having binary interface IMHO. And since we already use tracepoints anyway,
> wouldn't it be good to avoid adding another layer of interface or
> complexity?

The thing is that all entries are needed for any post processing, not
only the new ones. To use TP, either there needs to be special
"trigger the TP for all existing entries" switch somewhere or
ioblame/intents file needs to be read for existing entries. Even then,
TPs aren't guaranteed to be reliable. There's no way to detect
overflow and re-emit the event. It just isn't the right interface. The
previous version had intents_bin file in binary format but given that
there aren't too many of intents, binary interface didn't seem
necessary and ripped it out. Adding it back isn't difficult at all but
I'm not sure that's a good idea. It's not like parsing the intents
file is difficult.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-12  1:14           ` Tejun Heo
@ 2012-01-12  1:35             ` Namhyung Kim
  2012-01-12  1:37               ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Namhyung Kim @ 2012-01-12  1:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Namhyung Kim, axboe, mingo, rostedt, fweisbec, teravest,
	slavapestov, ctalbott, dhsharp, linux-kernel, winget,
	Chanho Park

Hello,

2012-01-12 10:14 AM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jan 11, 2012 at 5:05 PM, Namhyung Kim<namhyung.kim@lge.com>  wrote:
>> Yes. But that's a text-based so it might fit better to simple use cases. If
>> we need further post processing based on intents, it could be better off
>> having binary interface IMHO. And since we already use tracepoints anyway,
>> wouldn't it be good to avoid adding another layer of interface or
>> complexity?
>
> The thing is that all entries are needed for any post processing, not
> only the new ones. To use TP, either there needs to be special
> "trigger the TP for all existing entries" switch somewhere or
> ioblame/intents file needs to be read for existing entries. Even then,
> TPs aren't guaranteed to be reliable. There's no way to detect
> overflow and re-emit the event. It just isn't the right interface. The
> previous version had intents_bin file in binary format but given that
> there aren't too many of intents, binary interface didn't seem
> necessary and ripped it out. Adding it back isn't difficult at all but
> I'm not sure that's a good idea. It's not like parsing the intents
> file is difficult.
>
> Thanks.
>

Why do we need to trigger the TP for existing ones as we keep each entry 
at its creation? Maybe I'm missing something?

Thanks,
Namhyung Kim

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-12  1:35             ` Namhyung Kim
@ 2012-01-12  1:37               ` Tejun Heo
  2012-01-12  1:40                 ` Namhyung Kim
  2012-01-12  1:41                 ` Namhyung Kim
  0 siblings, 2 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-12  1:37 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Namhyung Kim, axboe, mingo, rostedt, fweisbec, teravest,
	slavapestov, ctalbott, dhsharp, linux-kernel, winget,
	Chanho Park

Hello,

On Wed, Jan 11, 2012 at 5:35 PM, Namhyung Kim <namhyung.kim@lge.com> wrote:
> Why do we need to trigger the TP for existing ones as we keep each entry at
> its creation? Maybe I'm missing something?

If a userland program starts to watch the TP after some intents are
created, how is it gonna find out the existing ones?

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-12  1:37               ` Tejun Heo
@ 2012-01-12  1:40                 ` Namhyung Kim
  2012-01-12  1:41                 ` Namhyung Kim
  1 sibling, 0 replies; 37+ messages in thread
From: Namhyung Kim @ 2012-01-12  1:40 UTC (permalink / raw)
  To: linux-kernel
  Cc: Namhyung Kim, axboe, mingo, rostedt, fweisbec, teravest,
	slavapestov, ctalbott, dhsharp, linux-kernel, winget,
	Chanho Park

2012-01-12 10:37 AM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jan 11, 2012 at 5:35 PM, Namhyung Kim<namhyung.kim@lge.com>  wrote:
>> Why do we need to trigger the TP for existing ones as we keep each entry at
>> its creation? Maybe I'm missing something?
>
> If a userland program starts to watch the TP after some intents are
> created, how is it gonna find out the existing ones?
>

Hi,

OK, but that can be controlled from userspace, I guess? Do we have to 
care about that?

Thanks,
Namhyung Kim


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-12  1:37               ` Tejun Heo
  2012-01-12  1:40                 ` Namhyung Kim
@ 2012-01-12  1:41                 ` Namhyung Kim
  2012-01-12  1:44                   ` Tejun Heo
  1 sibling, 1 reply; 37+ messages in thread
From: Namhyung Kim @ 2012-01-12  1:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Namhyung Kim, axboe, mingo, rostedt, fweisbec, teravest,
	slavapestov, ctalbott, dhsharp, linux-kernel, winget,
	Chanho Park

2012-01-12 10:37 AM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jan 11, 2012 at 5:35 PM, Namhyung Kim<namhyung.kim@lge.com>  wrote:
>> Why do we need to trigger the TP for existing ones as we keep each entry at
>> its creation? Maybe I'm missing something?
>
> If a userland program starts to watch the TP after some intents are
> created, how is it gonna find out the existing ones?
>

Hi,

OK, but that can be controlled from userspace, I guess? Do we have to 
care about that?

Thanks,
Namhyung Kim

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-12  1:41                 ` Namhyung Kim
@ 2012-01-12  1:44                   ` Tejun Heo
  2012-01-12  2:19                     ` Namhyung Kim
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2012-01-12  1:44 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Namhyung Kim, axboe, mingo, rostedt, fweisbec, teravest,
	slavapestov, ctalbott, dhsharp, linux-kernel, winget,
	Chanho Park

On Wed, Jan 11, 2012 at 5:41 PM, Namhyung Kim <namhyung.kim@lge.com> wrote:
> OK, but that can be controlled from userspace, I guess? Do we have to care
> about that?

Otherwise, it's gonna be a pretty silly interface. If you want to
watch the TP, you have to restart the whole thing which also implies
intent number would change. And there still is the problem of losing
new intent events due to overflow. It just isn't the right interface
for the task.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-12  1:44                   ` Tejun Heo
@ 2012-01-12  2:19                     ` Namhyung Kim
  2012-01-12  2:24                       ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Namhyung Kim @ 2012-01-12  2:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Namhyung Kim, axboe, mingo, rostedt, fweisbec, teravest,
	slavapestov, ctalbott, dhsharp, linux-kernel, winget,
	Chanho Park

2012-01-12 10:44 AM, Tejun Heo wrote:
> On Wed, Jan 11, 2012 at 5:41 PM, Namhyung Kim<namhyung.kim@lge.com>  wrote:
>> OK, but that can be controlled from userspace, I guess? Do we have to care
>> about that?
>
> Otherwise, it's gonna be a pretty silly interface. If you want to
> watch the TP, you have to restart the whole thing which also implies
> intent number would change. And there still is the problem of losing
> new intent events due to overflow. It just isn't the right interface
> for the task.
>

I understood it's unreliable. BTW I think that sane userland tool must 
start both of ioblame_io and ioblame_create_intent, say, TPs at once - 
prior to triggering ioblame/enable, obviously. The ioblame/intents can 
be used as a backup just in case.

Thanks,
Namhyung Kim

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RESEND 9/9] block, trace: implement ioblame - IO tracer with origin tracking
  2012-01-12  2:19                     ` Namhyung Kim
@ 2012-01-12  2:24                       ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2012-01-12  2:24 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Namhyung Kim, axboe, mingo, rostedt, fweisbec, teravest,
	slavapestov, ctalbott, dhsharp, linux-kernel, winget,
	Chanho Park

Hello,

On Wed, Jan 11, 2012 at 6:19 PM, Namhyung Kim <namhyung.kim@lge.com> wrote:
> I understood it's unreliable. BTW I think that sane userland tool must start
> both of ioblame_io and ioblame_create_intent, say, TPs at once - prior to
> triggering ioblame/enable, obviously. The ioblame/intents can be used as a
> backup just in case.

So, regardless of create_intent TP, the tool has to have code to read
ioblame/intents, right? I just don't see what the benefit of having an
extra interface which can't even be used by itself would be.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 8/9] stacktrace: implement save_stack_trace_quick()
  2012-01-11 16:38     ` Tejun Heo
  2012-01-11 17:37       ` Tejun Heo
@ 2012-01-17  2:22       ` Frederic Weisbecker
  1 sibling, 0 replies; 37+ messages in thread
From: Frederic Weisbecker @ 2012-01-17  2:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, mingo, rostedt, teravest, slavapestov, ctalbott, dhsharp,
	linux-kernel, winget, namhyung, H. Peter Anvin

On Wed, Jan 11, 2012 at 08:38:26AM -0800, Tejun Heo wrote:
> Hello, Frederic.
> 
> On Wed, Jan 11, 2012 at 05:26:44PM +0100, Frederic Weisbecker wrote:
> > On Tue, Jan 10, 2012 at 10:28:25AM -0800, Tejun Heo wrote:
> > > Implement save_stack_trace_quick() which only considers the usual
> > > contexts (ie. thread and irq) and doesn't handle links between
> > > different contexts - if %current is in irq context, only backtrace in
> > > the irq stack is considered.
> > 
> > The thing I don't like is the duplication that involves not only on
> > stack unwinding but also on the safety checks.
> 
> I'm not entirely convinced that this is necessary or we can just add
> more features to the existing backtrace facility (and maybe make that
> more efficient) and be done with it.

Yeah probably we can do that.

> 
> > > This is subset of dump_trace() done in much simpler way.  It's
> > > intended to be used in hot paths where the overhead of dump_trace()
> > > can be too heavy.
> > 
> > Is it? Have you found a measurable impact (outside the fact you record only
> > one stack).
> 
> As I wrote in the head message, I haven't done comparative test yet
> but in the preliminary tests the CPU overhead against memory backed
> device is quite visible (roughly ~20%), so I expect it to matter.
> Note that testing against memory backed device is actually relevant,
> on faster SSDs, CPU is already a bottleneck.
> 
> It would be best if we can extend the existing one to cover all the
> cases with acceptable overhead.  I needed to write this minimal
> version anyway for comparison so it's posted together but no matter
> how it turns out switching them isn't difficult.

Right. So there are two major differences that may affect performances
between save_stack_trace() and save_stack_trace_quick():

- save_stack_trace() does a full walk through the stack, but it rejects
unreliable entries. So to begin with, it should use print_context_stack_bp()
that does a frame pointer walk only (in CONFIG_FRAME_POINTER case).

- It links between stacks. Doing the ->stack() that returns a value should
help in this regard.

- And dump_stack() does various more checks, perhaps we can simplify
it a bit.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-01-17  2:22 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-10 18:28 [RFC PATCHSET take#2] ioblame: IO tracer with origin tracking Tejun Heo
2012-01-10 18:28 ` [PATCH 1/9] block: abstract disk iteration into disk_iter Tejun Heo
2012-01-10 18:28 ` [PATCH 2/9] block: block_bio_complete tracepoint was missing Tejun Heo
2012-01-11 17:25   ` Steven Rostedt
2012-01-11 17:30     ` Tejun Heo
2012-01-12  0:24       ` Namhyung Kim
2012-01-10 18:28 ` [PATCH 3/9] block: add @req to bio_{front|back}_merge tracepoints Tejun Heo
2012-01-10 18:28 ` [PATCH 4/9] writeback: move struct wb_writeback_work to writeback.h Tejun Heo
2012-01-10 18:28 ` [PATCH 5/9] writeback: add more tracepoints Tejun Heo
2012-01-10 18:28 ` [PATCH 6/9] block: add block_touch_buffer tracepoint Tejun Heo
2012-01-11 17:42   ` Steven Rostedt
2012-01-11 17:58     ` Tejun Heo
2012-01-10 18:28 ` [PATCH 7/9] vfs: add fcheck tracepoint Tejun Heo
2012-01-10 18:28 ` [PATCH 8/9] stacktrace: implement save_stack_trace_quick() Tejun Heo
2012-01-11 16:26   ` Frederic Weisbecker
2012-01-11 16:38     ` Tejun Heo
2012-01-11 17:37       ` Tejun Heo
2012-01-17  2:22       ` Frederic Weisbecker
2012-01-10 18:28 ` [PATCH 9/9] block, trace: implement ioblame - IO tracer with origin tracking Tejun Heo
2012-01-11  0:25   ` Chanho Park
2012-01-11  1:04     ` Tejun Heo
2012-01-11  1:32   ` [PATCH RESEND " Tejun Heo
2012-01-11  6:15     ` Namhyung Kim
2012-01-11 17:06       ` Tejun Heo
2012-01-12  1:05         ` Namhyung Kim
2012-01-12  1:14           ` Tejun Heo
2012-01-12  1:35             ` Namhyung Kim
2012-01-12  1:37               ` Tejun Heo
2012-01-12  1:40                 ` Namhyung Kim
2012-01-12  1:41                 ` Namhyung Kim
2012-01-12  1:44                   ` Tejun Heo
2012-01-12  2:19                     ` Namhyung Kim
2012-01-12  2:24                       ` Tejun Heo
2012-01-11 18:08     ` Tejun Heo
2012-01-11 14:40 ` [RFC PATCHSET take#2] ioblame: " Frederic Weisbecker
2012-01-11 17:02   ` Tejun Heo
2012-01-11 22:45     ` David Sharp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).