All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] blkio: More IO controller stats - always-on and debug-only
@ 2010-04-09  4:13 Divyesh Shah
  2010-04-09  4:14 ` [PATCH 1/3] blkio: Add io_merged stat Divyesh Shah
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Divyesh Shah @ 2010-04-09  4:13 UTC (permalink / raw)
  To: jens.axboe, vgoyal; +Cc: linux-kernel, nauman, ctalbott

The following patchset implements additional per-cgroup IO controller stats.
Some of these should be enabled for debugging only.

These stats have helped us debug issues with earlier IO controller
versions and should be useful now as well.
We've been using these stats for monitoring and debugging problems after the
fact as these stats can be collected and stored for later use.

The stats added in this patchset are:
1) io_merged - This includes both the number of bios merged into requests
   belonging to this cgroup as well as the number of requests merged together.
   In the past, we've observed different merging behavior across upstream
   kernels, some by design some actual bugs. This stat helps a lot in debugging
   such problems when applications report decreased throughput with a new kernel
   version.
2) io_queued
3) avg_queue_size
   These 2 stats are useful for getting a feel for the IO pattern of the cgroup,
   i.e., how filled up its queues are at a given instant and over the existence
   of the cgroup. This ability is useful when debugging problems in the wild as
   it helps understand the application's IO pattern w/o having to read through
   the userspace code (coz its tedious or just not available) or w/o the ability
   to run blktrace (since you may not have root access and/or not want to
   disturb performance).
4) group_wait_time - This is the amount of time the cgroup had to wait since it
   became busy (i.e., went from 0 to 1 request queued) to get a timeslice for
   one of its queues. This is different from the io_wait_time which is the
   cumulative total of the amount of time spent by each IO in that cgroup
   waiting in the scheduler queue. This stat is a great way to find out any jobs
   in the fleet that are being starved or waiting for longer than what is
   expected (due to an IO controller bug or any other issue).
5) empty_time - This is the amount of time a cgroup spends w/o any pending
   requests. This stat is useful when a job does not seem to be able to use its
   assigned disk share by helping check if that is happening due to an IO
   controller bug or because the job is not submitting enough IOs.
6) idle_time - This is the amount of time spent by the IO scheduler idling
   for a given cgroup in anticipation of a better request than the exising ones
   from other queues/cgroups.

The last 4 stats are enabled only when CONFIG_DEBUG_CFQ_IOSCHED=y which in turn
enables CONFIG_DEBUG_BLK_CGROUP.
---

Divyesh Shah (3):
      Add more debug-only per-cgroup stats for IO contoller
      Add io_queued and avg_queue_size stats for IO controller.
      Add io_merged stat. This includes both the number of bios merged into requests


 Documentation/cgroups/blkio-controller.txt |   45 +++++
 block/blk-cgroup.c                         |  264 ++++++++++++++++++++++++++++
 block/blk-cgroup.h                         |   80 ++++++++
 block/blk-core.c                           |    2 
 block/cfq-iosched.c                        |   72 ++++++--
 block/elevator.c                           |    9 +
 include/linux/elevator.h                   |    6 +
 7 files changed, 456 insertions(+), 22 deletions(-)

-- 
Divyesh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] blkio: Add io_merged stat
  2010-04-09  4:13 [PATCH 0/3] blkio: More IO controller stats - always-on and debug-only Divyesh Shah
@ 2010-04-09  4:14 ` Divyesh Shah
  2010-04-09  4:15 ` [PATCH 2/3] blkio: Add io_queued and avg_queue_size stats Divyesh Shah
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Divyesh Shah @ 2010-04-09  4:14 UTC (permalink / raw)
  To: jens.axboe, vgoyal; +Cc: linux-kernel, nauman, ctalbott

This includes both the number of bios merged into requests belonging to this
cgroup as well as the number of requests merged together.
In the past, we've observed different merging behavior across upstream kernels,
some by design some actual bugs. This stat helps a lot in debugging such
problems when applications report decreased throughput with a new kernel
version.

This needed adding an extra elevator function to capture bios being merged as I
did not want to pollute elevator code with blkiocg knowledge and hence needed
the accounting invocation to come from CFQ.

Signed-off-by: Divyesh Shah<dpshah@google.com>
---

 Documentation/cgroups/blkio-controller.txt |    5 +++++
 block/blk-cgroup.c                         |   17 +++++++++++++++++
 block/blk-cgroup.h                         |    8 +++++++-
 block/blk-core.c                           |    2 ++
 block/cfq-iosched.c                        |   11 +++++++++++
 block/elevator.c                           |    9 +++++++++
 include/linux/elevator.h                   |    6 ++++++
 7 files changed, 57 insertions(+), 1 deletions(-)

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index ed04fe9..810e301 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -134,6 +134,11 @@ Details of cgroup files
 	  minor number of the device, third field specifies the operation type
 	  and the fourth field specifies the io_wait_time in ns.
 
+- blkio.io_merged
+	- Total number of bios/requests merged into requests belonging to this
+	  cgroup. This is further divided by the type of operation - read or
+	  write, sync or async.
+
 - blkio.dequeue
 	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
 	  gives the statistics about how many a times a group was dequeued
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 6797df5..d23b538 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -127,6 +127,18 @@ void blkiocg_update_completion_stats(struct blkio_group *blkg,
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_completion_stats);
 
+void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
+					bool sync)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&blkg->stats_lock, flags);
+	blkio_add_stat(blkg->stats.stat_arr[BLKIO_STAT_MERGED], 1, direction,
+			sync);
+	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
+
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 			struct blkio_group *blkg, void *key, dev_t dev)
 {
@@ -363,6 +375,7 @@ SHOW_FUNCTION_PER_GROUP(io_service_bytes, BLKIO_STAT_SERVICE_BYTES, 1);
 SHOW_FUNCTION_PER_GROUP(io_serviced, BLKIO_STAT_SERVICED, 1);
 SHOW_FUNCTION_PER_GROUP(io_service_time, BLKIO_STAT_SERVICE_TIME, 1);
 SHOW_FUNCTION_PER_GROUP(io_wait_time, BLKIO_STAT_WAIT_TIME, 1);
+SHOW_FUNCTION_PER_GROUP(io_merged, BLKIO_STAT_MERGED, 1);
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 SHOW_FUNCTION_PER_GROUP(dequeue, BLKIO_STAT_DEQUEUE, 0);
 #endif
@@ -408,6 +421,10 @@ struct cftype blkio_files[] = {
 		.read_map = blkiocg_io_wait_time_read,
 	},
 	{
+		.name = "io_merged",
+		.read_map = blkiocg_io_merged_read,
+	},
+	{
 		.name = "reset_stats",
 		.write_u64 = blkiocg_reset_stats,
 	},
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index b22e553..470a29d 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -34,6 +34,8 @@ enum stat_type {
 	BLKIO_STAT_SERVICED,
 	/* Total time spent waiting in scheduler queue in ns */
 	BLKIO_STAT_WAIT_TIME,
+	/* Number of IOs merged */
+	BLKIO_STAT_MERGED,
 	/* All the single valued stats go below this */
 	BLKIO_STAT_TIME,
 	BLKIO_STAT_SECTORS,
@@ -61,7 +63,7 @@ struct blkio_group_stats {
 	/* total disk time and nr sectors dispatched by this group */
 	uint64_t time;
 	uint64_t sectors;
-	uint64_t stat_arr[BLKIO_STAT_WAIT_TIME + 1][BLKIO_STAT_TOTAL];
+	uint64_t stat_arr[BLKIO_STAT_MERGED + 1][BLKIO_STAT_TOTAL];
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	/* How many times this group has been removed from service tree */
 	unsigned long dequeue;
@@ -148,6 +150,8 @@ void blkiocg_update_dispatch_stats(struct blkio_group *blkg, uint64_t bytes,
 						bool direction, bool sync);
 void blkiocg_update_completion_stats(struct blkio_group *blkg,
 	uint64_t start_time, uint64_t io_start_time, bool direction, bool sync);
+void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
+					bool sync);
 #else
 struct cgroup;
 static inline struct blkio_cgroup *
@@ -169,5 +173,7 @@ static inline void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
 static inline void blkiocg_update_completion_stats(struct blkio_group *blkg,
 		uint64_t start_time, uint64_t io_start_time, bool direction,
 		bool sync) {}
+static inline void blkiocg_update_io_merged_stats(struct blkio_group *blkg,
+						bool direction, bool sync) {}
 #endif
 #endif /* _BLK_CGROUP_H */
diff --git a/block/blk-core.c b/block/blk-core.c
index 4b1b29e..e9a5ae2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1202,6 +1202,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 		if (!blk_rq_cpu_valid(req))
 			req->cpu = bio->bi_comp_cpu;
 		drive_stat_acct(req, 0);
+		elv_bio_merged(q, req, bio);
 		if (!attempt_back_merge(q, req))
 			elv_merged_request(q, req, el_ret);
 		goto out;
@@ -1235,6 +1236,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 		if (!blk_rq_cpu_valid(req))
 			req->cpu = bio->bi_comp_cpu;
 		drive_stat_acct(req, 0);
+		elv_bio_merged(q, req, bio);
 		if (!attempt_front_merge(q, req))
 			elv_merged_request(q, req, el_ret);
 		goto out;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5617ae0..4eb1906 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1467,6 +1467,14 @@ static void cfq_merged_request(struct request_queue *q, struct request *req,
 	}
 }
 
+static void cfq_bio_merged(struct request_queue *q, struct request *req,
+				struct bio *bio)
+{
+	struct cfq_queue *cfqq = RQ_CFQQ(req);
+	blkiocg_update_io_merged_stats(&cfqq->cfqg->blkg, bio_data_dir(bio),
+					cfq_bio_sync(bio));
+}
+
 static void
 cfq_merged_requests(struct request_queue *q, struct request *rq,
 		    struct request *next)
@@ -1484,6 +1492,8 @@ cfq_merged_requests(struct request_queue *q, struct request *rq,
 	if (cfqq->next_rq == next)
 		cfqq->next_rq = rq;
 	cfq_remove_request(next);
+	blkiocg_update_io_merged_stats(&cfqq->cfqg->blkg, rq_data_dir(next),
+					rq_is_sync(next));
 }
 
 static int cfq_allow_merge(struct request_queue *q, struct request *rq,
@@ -3861,6 +3871,7 @@ static struct elevator_type iosched_cfq = {
 		.elevator_merged_fn =		cfq_merged_request,
 		.elevator_merge_req_fn =	cfq_merged_requests,
 		.elevator_allow_merge_fn =	cfq_allow_merge,
+		.elevator_bio_merged_fn =	cfq_bio_merged,
 		.elevator_dispatch_fn =		cfq_dispatch_requests,
 		.elevator_add_req_fn =		cfq_insert_request,
 		.elevator_activate_req_fn =	cfq_activate_request,
diff --git a/block/elevator.c b/block/elevator.c
index 76e3702..5e73459 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -539,6 +539,15 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
 	q->last_merge = rq;
 }
 
+void elv_bio_merged(struct request_queue *q, struct request *rq,
+			struct bio *bio)
+{
+	struct elevator_queue *e = q->elevator;
+
+	if (e->ops->elevator_bio_merged_fn)
+		e->ops->elevator_bio_merged_fn(q, rq, bio);
+}
+
 void elv_requeue_request(struct request_queue *q, struct request *rq)
 {
 	/*
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 1cb3372..2c958f4 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -14,6 +14,9 @@ typedef void (elevator_merged_fn) (struct request_queue *, struct request *, int
 
 typedef int (elevator_allow_merge_fn) (struct request_queue *, struct request *, struct bio *);
 
+typedef void (elevator_bio_merged_fn) (struct request_queue *,
+						struct request *, struct bio *);
+
 typedef int (elevator_dispatch_fn) (struct request_queue *, int);
 
 typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
@@ -36,6 +39,7 @@ struct elevator_ops
 	elevator_merged_fn *elevator_merged_fn;
 	elevator_merge_req_fn *elevator_merge_req_fn;
 	elevator_allow_merge_fn *elevator_allow_merge_fn;
+	elevator_bio_merged_fn *elevator_bio_merged_fn;
 
 	elevator_dispatch_fn *elevator_dispatch_fn;
 	elevator_add_req_fn *elevator_add_req_fn;
@@ -103,6 +107,8 @@ extern int elv_merge(struct request_queue *, struct request **, struct bio *);
 extern void elv_merge_requests(struct request_queue *, struct request *,
 			       struct request *);
 extern void elv_merged_request(struct request_queue *, struct request *, int);
+extern void elv_bio_merged(struct request_queue *q, struct request *,
+				struct bio *);
 extern void elv_requeue_request(struct request_queue *, struct request *);
 extern int elv_queue_empty(struct request_queue *);
 extern struct request *elv_former_request(struct request_queue *, struct request *);


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] blkio: Add io_queued and avg_queue_size stats
  2010-04-09  4:13 [PATCH 0/3] blkio: More IO controller stats - always-on and debug-only Divyesh Shah
  2010-04-09  4:14 ` [PATCH 1/3] blkio: Add io_merged stat Divyesh Shah
@ 2010-04-09  4:15 ` Divyesh Shah
  2010-04-09 15:32   ` Vivek Goyal
  2010-04-09  4:15 ` [PATCH 3/3] blkio: Add more debug-only per-cgroup stats Divyesh Shah
  2010-04-09  6:36 ` [PATCH 0/3] blkio: More IO controller stats - always-on and debug-only Jens Axboe
  3 siblings, 1 reply; 12+ messages in thread
From: Divyesh Shah @ 2010-04-09  4:15 UTC (permalink / raw)
  To: jens.axboe, vgoyal; +Cc: linux-kernel, nauman, ctalbott

These stats are useful for getting a feel for the queue depth of the cgroup,
i.e., how filled up its queues are at a given instant and over the existence of
the cgroup. This ability is useful when debugging problems in the wild as it
helps understand the application's IO pattern w/o having to read through the
userspace code (coz its tedious or just not available) or w/o the ability
to run blktrace (since you may not have root access and/or not want to disturb
performance).

Signed-off-by: Divyesh Shah<dpshah@google.com>
---

 Documentation/cgroups/blkio-controller.txt |   11 +++
 block/blk-cgroup.c                         |   98 +++++++++++++++++++++++++++-
 block/blk-cgroup.h                         |   20 +++++-
 block/cfq-iosched.c                        |   11 +++
 4 files changed, 134 insertions(+), 6 deletions(-)

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index 810e301..6e52e7c 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -139,6 +139,17 @@ Details of cgroup files
 	  cgroup. This is further divided by the type of operation - read or
 	  write, sync or async.
 
+- blkio.io_queued
+	- Total number of requests queued up at any given instant for this
+	  cgroup. This is further divided by the type of operation - read or
+	  write, sync or async.
+
+- blkio.avg_queue_size
+	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
+	  The average queue size for this cgroup over the entire time of this
+	  cgroup's existence. Queue size samples are taken each time one of the
+	  queues of this cgroup gets a timeslice.
+
 - blkio.dequeue
 	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
 	  gives the statistics about how many a times a group was dequeued
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d23b538..1e0c497 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -81,6 +81,71 @@ static void blkio_add_stat(uint64_t *stat, uint64_t add, bool direction,
 		stat[BLKIO_STAT_ASYNC] += add;
 }
 
+/*
+ * Decrements the appropriate stat variable if non-zero depending on the
+ * request type. Panics on value being zero.
+ * This should be called with the blkg->stats_lock held.
+ */
+static void blkio_check_and_dec_stat(uint64_t *stat, bool direction, bool sync)
+{
+	if (direction) {
+		BUG_ON(stat[BLKIO_STAT_WRITE] == 0);
+		stat[BLKIO_STAT_WRITE]--;
+	} else {
+		BUG_ON(stat[BLKIO_STAT_READ] == 0);
+		stat[BLKIO_STAT_READ]--;
+	}
+	if (sync) {
+		BUG_ON(stat[BLKIO_STAT_SYNC] == 0);
+		stat[BLKIO_STAT_SYNC]--;
+	} else {
+		BUG_ON(stat[BLKIO_STAT_ASYNC] == 0);
+		stat[BLKIO_STAT_ASYNC]--;
+	}
+}
+
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
+{
+	unsigned long flags;
+	struct blkio_group_stats *stats;
+
+	spin_lock_irqsave(&blkg->stats_lock, flags);
+	stats = &blkg->stats;
+	stats->avg_queue_size_sum +=
+			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] +
+			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE];
+	stats->avg_queue_size_samples++;
+	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blkiocg_update_set_active_queue_stats);
+#endif
+
+void blkiocg_update_request_add_stats(struct blkio_group *blkg,
+			struct blkio_group *curr_blkg, bool direction,
+			bool sync)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&blkg->stats_lock, flags);
+	blkio_add_stat(blkg->stats.stat_arr[BLKIO_STAT_QUEUED], 1, direction,
+			sync);
+	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blkiocg_update_request_add_stats);
+
+void blkiocg_update_request_remove_stats(struct blkio_group *blkg,
+						bool direction, bool sync)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&blkg->stats_lock, flags);
+	blkio_check_and_dec_stat(blkg->stats.stat_arr[BLKIO_STAT_QUEUED],
+					direction, sync);
+	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blkiocg_update_request_remove_stats);
+
 void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
 {
 	unsigned long flags;
@@ -253,14 +318,18 @@ blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
 	struct blkio_cgroup *blkcg;
 	struct blkio_group *blkg;
 	struct hlist_node *n;
-	struct blkio_group_stats *stats;
+	uint64_t queued[BLKIO_STAT_TOTAL];
+	int i;
 
 	blkcg = cgroup_to_blkio_cgroup(cgroup);
 	spin_lock_irq(&blkcg->lock);
 	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
 		spin_lock(&blkg->stats_lock);
-		stats = &blkg->stats;
-		memset(stats, 0, sizeof(struct blkio_group_stats));
+		for (i = 0; i < BLKIO_STAT_TOTAL; i++)
+			queued[i] = blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i];
+		memset(&blkg->stats, 0, sizeof(struct blkio_group_stats));
+		for (i = 0; i < BLKIO_STAT_TOTAL; i++)
+			blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i] = queued[i];
 		spin_unlock(&blkg->stats_lock);
 	}
 	spin_unlock_irq(&blkcg->lock);
@@ -323,6 +392,15 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
 					blkg->stats.sectors, cb, dev);
 #ifdef CONFIG_DEBUG_BLK_CGROUP
+	if (type == BLKIO_STAT_AVG_QUEUE_SIZE) {
+		uint64_t sum = blkg->stats.avg_queue_size_sum;
+		uint64_t samples = blkg->stats.avg_queue_size_samples;
+		if (samples)
+			do_div(sum, samples);
+		else
+			sum = 0;
+		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, sum, cb, dev);
+	}
 	if (type == BLKIO_STAT_DEQUEUE)
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
 					blkg->stats.dequeue, cb, dev);
@@ -376,8 +454,10 @@ SHOW_FUNCTION_PER_GROUP(io_serviced, BLKIO_STAT_SERVICED, 1);
 SHOW_FUNCTION_PER_GROUP(io_service_time, BLKIO_STAT_SERVICE_TIME, 1);
 SHOW_FUNCTION_PER_GROUP(io_wait_time, BLKIO_STAT_WAIT_TIME, 1);
 SHOW_FUNCTION_PER_GROUP(io_merged, BLKIO_STAT_MERGED, 1);
+SHOW_FUNCTION_PER_GROUP(io_queued, BLKIO_STAT_QUEUED, 1);
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 SHOW_FUNCTION_PER_GROUP(dequeue, BLKIO_STAT_DEQUEUE, 0);
+SHOW_FUNCTION_PER_GROUP(avg_queue_size, BLKIO_STAT_AVG_QUEUE_SIZE, 0);
 #endif
 #undef SHOW_FUNCTION_PER_GROUP
 
@@ -425,14 +505,22 @@ struct cftype blkio_files[] = {
 		.read_map = blkiocg_io_merged_read,
 	},
 	{
+		.name = "io_queued",
+		.read_map = blkiocg_io_queued_read,
+	},
+	{
 		.name = "reset_stats",
 		.write_u64 = blkiocg_reset_stats,
 	},
 #ifdef CONFIG_DEBUG_BLK_CGROUP
-       {
+	{
+		.name = "avg_queue_size",
+		.read_map = blkiocg_avg_queue_size_read,
+	},
+	{
 		.name = "dequeue",
 		.read_map = blkiocg_dequeue_read,
-       },
+	},
 #endif
 };
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 470a29d..bea7f3b 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -36,10 +36,13 @@ enum stat_type {
 	BLKIO_STAT_WAIT_TIME,
 	/* Number of IOs merged */
 	BLKIO_STAT_MERGED,
+	/* Number of IOs queued up */
+	BLKIO_STAT_QUEUED,
 	/* All the single valued stats go below this */
 	BLKIO_STAT_TIME,
 	BLKIO_STAT_SECTORS,
 #ifdef CONFIG_DEBUG_BLK_CGROUP
+	BLKIO_STAT_AVG_QUEUE_SIZE,
 	BLKIO_STAT_DEQUEUE
 #endif
 };
@@ -63,8 +66,12 @@ struct blkio_group_stats {
 	/* total disk time and nr sectors dispatched by this group */
 	uint64_t time;
 	uint64_t sectors;
-	uint64_t stat_arr[BLKIO_STAT_MERGED + 1][BLKIO_STAT_TOTAL];
+	uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
 #ifdef CONFIG_DEBUG_BLK_CGROUP
+	/* Sum of number of IOs queued across all samples */
+	uint64_t avg_queue_size_sum;
+	/* Count of samples taken for average */
+	uint64_t avg_queue_size_samples;
 	/* How many times this group has been removed from service tree */
 	unsigned long dequeue;
 #endif
@@ -127,10 +134,13 @@ static inline char *blkg_path(struct blkio_group *blkg)
 {
 	return blkg->path;
 }
+void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg);
 void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
 				unsigned long dequeue);
 #else
 static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
+static inline void blkiocg_update_set_active_queue_stats(
+						struct blkio_group *blkg) {}
 static inline void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
 						unsigned long dequeue) {}
 #endif
@@ -152,6 +162,10 @@ void blkiocg_update_completion_stats(struct blkio_group *blkg,
 	uint64_t start_time, uint64_t io_start_time, bool direction, bool sync);
 void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
 					bool sync);
+void blkiocg_update_request_add_stats(struct blkio_group *blkg,
+		struct blkio_group *curr_blkg, bool direction, bool sync);
+void blkiocg_update_request_remove_stats(struct blkio_group *blkg,
+					bool direction, bool sync);
 #else
 struct cgroup;
 static inline struct blkio_cgroup *
@@ -175,5 +189,9 @@ static inline void blkiocg_update_completion_stats(struct blkio_group *blkg,
 		bool sync) {}
 static inline void blkiocg_update_io_merged_stats(struct blkio_group *blkg,
 						bool direction, bool sync) {}
+static inline void blkiocg_update_request_add_stats(struct blkio_group *blkg,
+		struct blkio_group *curr_blkg, bool direction, bool sync) {}
+static inline void blkiocg_update_request_remove_stats(struct blkio_group *blkg,
+						bool direction, bool sync) {}
 #endif
 #endif /* _BLK_CGROUP_H */
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4eb1906..8e0b86a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1380,7 +1380,12 @@ static void cfq_reposition_rq_rb(struct cfq_queue *cfqq, struct request *rq)
 {
 	elv_rb_del(&cfqq->sort_list, rq);
 	cfqq->queued[rq_is_sync(rq)]--;
+	blkiocg_update_request_remove_stats(&cfqq->cfqg->blkg, rq_data_dir(rq),
+						rq_is_sync(rq));
 	cfq_add_rq_rb(rq);
+	blkiocg_update_request_add_stats(
+			&cfqq->cfqg->blkg, &cfqq->cfqd->serving_group->blkg,
+			rq_data_dir(rq), rq_is_sync(rq));
 }
 
 static struct request *
@@ -1436,6 +1441,8 @@ static void cfq_remove_request(struct request *rq)
 	cfq_del_rq_rb(rq);
 
 	cfqq->cfqd->rq_queued--;
+	blkiocg_update_request_remove_stats(&cfqq->cfqg->blkg, rq_data_dir(rq),
+						rq_is_sync(rq));
 	if (rq_is_meta(rq)) {
 		WARN_ON(!cfqq->meta_pending);
 		cfqq->meta_pending--;
@@ -1527,6 +1534,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 	if (cfqq) {
 		cfq_log_cfqq(cfqd, cfqq, "set_active wl_prio:%d wl_type:%d",
 				cfqd->serving_prio, cfqd->serving_type);
+		blkiocg_update_set_active_queue_stats(&cfqq->cfqg->blkg);
 		cfqq->slice_start = 0;
 		cfqq->dispatch_start = jiffies;
 		cfqq->allocated_slice = 0;
@@ -3213,6 +3221,9 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
 	list_add_tail(&rq->queuelist, &cfqq->fifo);
 	cfq_add_rq_rb(rq);
 
+	blkiocg_update_request_add_stats(&cfqq->cfqg->blkg,
+			&cfqd->serving_group->blkg, rq_data_dir(rq),
+			rq_is_sync(rq));
 	cfq_rq_enqueued(cfqd, cfqq, rq);
 }
 


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] blkio: Add more debug-only per-cgroup stats
  2010-04-09  4:13 [PATCH 0/3] blkio: More IO controller stats - always-on and debug-only Divyesh Shah
  2010-04-09  4:14 ` [PATCH 1/3] blkio: Add io_merged stat Divyesh Shah
  2010-04-09  4:15 ` [PATCH 2/3] blkio: Add io_queued and avg_queue_size stats Divyesh Shah
@ 2010-04-09  4:15 ` Divyesh Shah
  2010-04-09 19:21   ` Vivek Goyal
  2010-04-09  6:36 ` [PATCH 0/3] blkio: More IO controller stats - always-on and debug-only Jens Axboe
  3 siblings, 1 reply; 12+ messages in thread
From: Divyesh Shah @ 2010-04-09  4:15 UTC (permalink / raw)
  To: jens.axboe, vgoyal; +Cc: linux-kernel, nauman, ctalbott

1) group_wait_time - This is the amount of time the cgroup had to wait to get a
  timeslice for one of its queues from when it became busy, i.e., went from 0
  to 1 request queued. This is different from the io_wait_time which is the
  cumulative total of the amount of time spent by each IO in that cgroup waiting
  in the scheduler queue. This stat is a great way to find out any jobs in the
  fleet that are being starved or waiting for longer than what is expected (due
  to an IO controller bug or any other issue).
2) empty_time - This is the amount of time a cgroup spends w/o any pending
   requests. This stat is useful when a job does not seem to be able to use its
   assigned disk share by helping check if that is happening due to an IO
   controller bug or because the job is not submitting enough IOs.
3) idle_time - This is the amount of time spent by the IO scheduler idling
   for a given cgroup in anticipation of a better request than the exising ones
   from other queues/cgroups.

All these stats are recorded using start and stop events. When reading these
stats, we do not add the delta between the current time and the last start time
if we're between the start and stop events. We avoid doing this to make sure
that these numbers are always monotonically increasing when read. Since we're
using sched_clock() which may use the tsc as its source, it may induce some
inconsistency (due to tsc resync across cpus) if we included the current delta.

Signed-off-by: Divyesh Shah<dpshah@google.com>
---

 Documentation/cgroups/blkio-controller.txt |   29 +++++
 block/blk-cgroup.c                         |  159 +++++++++++++++++++++++++++-
 block/blk-cgroup.h                         |   54 ++++++++++
 block/cfq-iosched.c                        |   50 ++++++---
 4 files changed, 271 insertions(+), 21 deletions(-)

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index 6e52e7c..db054ea 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -150,6 +150,35 @@ Details of cgroup files
 	  cgroup's existence. Queue size samples are taken each time one of the
 	  queues of this cgroup gets a timeslice.
 
+- blkio.group_wait_time
+	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
+	  This is the amount of time the cgroup had to wait since it became busy
+	  (i.e., went from 0 to 1 request queued) to get a timeslice for one of
+	  its queues. This is different from the io_wait_time which is the
+	  cumulative total of the amount of time spent by each IO in that cgroup
+	  waiting in the scheduler queue. This is in nanoseconds. If this is
+	  read when the cgroup is in a waiting (for timeslice) state, the stat
+	  will only report the group_wait_time accumulated till the last time it
+	  got a timeslice and will not include the current delta.
+
+- blkio.empty_time
+	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
+	  This is the amount of time a cgroup spends without any pending
+	  requests when not being served, i.e., it does not include any time
+	  spent idling for one of the queues of the cgroup. This is in
+	  nanoseconds. If this is read when the cgroup is in an empty state,
+	  the stat will only report the empty_time accumulated till the last
+	  time it had a pending request and will not include the current delta.
+
+- blkio.idle_time
+	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
+	  This is the amount of time spent by the IO scheduler idling for a
+	  given cgroup in anticipation of a better request than the exising ones
+	  from other queues/cgroups. This is in nanoseconds. If this is read
+	  when the cgroup is in an idling state, the stat will only report the
+	  idle_time accumulated till the last idle period and will not include
+	  the current delta.
+
 - blkio.dequeue
 	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
 	  gives the statistics about how many a times a group was dequeued
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 1e0c497..1ecff7a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -105,6 +105,76 @@ static void blkio_check_and_dec_stat(uint64_t *stat, bool direction, bool sync)
 }
 
 #ifdef CONFIG_DEBUG_BLK_CGROUP
+/* This should be called with the blkg->stats_lock held. */
+static void blkio_set_start_group_wait_time(struct blkio_group *blkg,
+						struct blkio_group *curr_blkg)
+{
+	if (blkio_blkg_waiting(&blkg->stats))
+		return;
+	if (blkg == curr_blkg)
+		return;
+	blkg->stats.start_group_wait_time = sched_clock();
+	blkio_mark_blkg_waiting(&blkg->stats);
+}
+
+/* This should be called with the blkg->stats_lock held. */
+static void blkio_update_group_wait_time(struct blkio_group_stats *stats)
+{
+	unsigned long long now;
+
+	if (!blkio_blkg_waiting(stats))
+		return;
+
+	now = sched_clock();
+	if (time_after64(now, stats->start_group_wait_time))
+		stats->group_wait_time += now - stats->start_group_wait_time;
+	blkio_clear_blkg_waiting(stats);
+}
+
+/* This should be called with the blkg->stats_lock held. */
+static void blkio_end_empty_time(struct blkio_group_stats *stats)
+{
+	unsigned long long now;
+
+	if (!blkio_blkg_empty(stats))
+		return;
+
+	now = sched_clock();
+	if (time_after64(now, stats->start_empty_time))
+		stats->empty_time += now - stats->start_empty_time;
+	blkio_clear_blkg_empty(stats);
+}
+
+void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&blkg->stats_lock, flags);
+	BUG_ON(blkio_blkg_idling(&blkg->stats));
+	blkg->stats.start_idle_time = sched_clock();
+	blkio_mark_blkg_idling(&blkg->stats);
+	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blkiocg_update_set_idle_time_stats);
+
+void blkiocg_update_idle_time_stats(struct blkio_group *blkg)
+{
+	unsigned long flags;
+	unsigned long long now;
+	struct blkio_group_stats *stats;
+
+	spin_lock_irqsave(&blkg->stats_lock, flags);
+	stats = &blkg->stats;
+	if (blkio_blkg_idling(stats)) {
+		now = sched_clock();
+		if (time_after64(now, stats->start_idle_time))
+			stats->idle_time += now - stats->start_idle_time;
+		blkio_clear_blkg_idling(stats);
+	}
+	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blkiocg_update_idle_time_stats);
+
 void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
 {
 	unsigned long flags;
@@ -116,9 +186,14 @@ void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
 			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] +
 			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE];
 	stats->avg_queue_size_samples++;
+	blkio_update_group_wait_time(stats);
 	spin_unlock_irqrestore(&blkg->stats_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_set_active_queue_stats);
+#else
+static inline void blkio_set_start_group_wait_time(struct blkio_group *blkg,
+					struct blkio_group *curr_blkg) {}
+static inline void blkio_end_empty_time(struct blkio_group_stats *stats) {}
 #endif
 
 void blkiocg_update_request_add_stats(struct blkio_group *blkg,
@@ -130,6 +205,8 @@ void blkiocg_update_request_add_stats(struct blkio_group *blkg,
 	spin_lock_irqsave(&blkg->stats_lock, flags);
 	blkio_add_stat(blkg->stats.stat_arr[BLKIO_STAT_QUEUED], 1, direction,
 			sync);
+	blkio_end_empty_time(&blkg->stats);
+	blkio_set_start_group_wait_time(blkg, curr_blkg);
 	spin_unlock_irqrestore(&blkg->stats_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_request_add_stats);
@@ -156,6 +233,33 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
 
+void blkiocg_set_start_empty_time(struct blkio_group *blkg, bool ignore)
+{
+	unsigned long flags;
+	struct blkio_group_stats *stats;
+
+	spin_lock_irqsave(&blkg->stats_lock, flags);
+	stats = &blkg->stats;
+
+	if (stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] ||
+			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE]) {
+		spin_unlock_irqrestore(&blkg->stats_lock, flags);
+		return;
+	}
+
+	/*
+	 * If ignore is set, we do not panic on the empty flag being set
+	 * already. This is to avoid cases where there are superfluous timeslice
+	 * complete events (for eg., forced_dispatch in CFQ) when no IOs are
+	 * served which could result in triggering the empty check incorrectly.
+	 */
+	BUG_ON(!ignore && blkio_blkg_empty(stats));
+	stats->start_empty_time = sched_clock();
+	blkio_mark_blkg_empty(stats);
+	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blkiocg_set_start_empty_time);
+
 void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
 				uint64_t bytes, bool direction, bool sync)
 {
@@ -317,19 +421,44 @@ blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
 {
 	struct blkio_cgroup *blkcg;
 	struct blkio_group *blkg;
+	struct blkio_group_stats *stats;
 	struct hlist_node *n;
 	uint64_t queued[BLKIO_STAT_TOTAL];
 	int i;
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+	bool idling, waiting, empty;
+	unsigned long long now = sched_clock();
+#endif
 
 	blkcg = cgroup_to_blkio_cgroup(cgroup);
 	spin_lock_irq(&blkcg->lock);
 	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
 		spin_lock(&blkg->stats_lock);
+		stats = &blkg->stats;
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+		idling = blkio_blkg_idling(stats);
+		waiting = blkio_blkg_waiting(stats);
+		empty = blkio_blkg_empty(stats);
+#endif
 		for (i = 0; i < BLKIO_STAT_TOTAL; i++)
-			queued[i] = blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i];
-		memset(&blkg->stats, 0, sizeof(struct blkio_group_stats));
+			queued[i] = stats->stat_arr[BLKIO_STAT_QUEUED][i];
+		memset(stats, 0, sizeof(struct blkio_group_stats));
 		for (i = 0; i < BLKIO_STAT_TOTAL; i++)
-			blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i] = queued[i];
+			stats->stat_arr[BLKIO_STAT_QUEUED][i] = queued[i];
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+		if (idling) {
+			blkio_mark_blkg_idling(stats);
+			stats->start_idle_time = now;
+		}
+		if (waiting) {
+			blkio_mark_blkg_waiting(stats);
+			stats->start_group_wait_time = now;
+		}
+		if (empty) {
+			blkio_mark_blkg_empty(stats);
+			stats->start_empty_time = now;
+		}
+#endif
 		spin_unlock(&blkg->stats_lock);
 	}
 	spin_unlock_irq(&blkcg->lock);
@@ -401,6 +530,15 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
 			sum = 0;
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, sum, cb, dev);
 	}
+	if (type == BLKIO_STAT_GROUP_WAIT_TIME)
+		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
+					blkg->stats.group_wait_time, cb, dev);
+	if (type == BLKIO_STAT_IDLE_TIME)
+		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
+					blkg->stats.idle_time, cb, dev);
+	if (type == BLKIO_STAT_EMPTY_TIME)
+		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
+					blkg->stats.empty_time, cb, dev);
 	if (type == BLKIO_STAT_DEQUEUE)
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
 					blkg->stats.dequeue, cb, dev);
@@ -458,6 +596,9 @@ SHOW_FUNCTION_PER_GROUP(io_queued, BLKIO_STAT_QUEUED, 1);
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 SHOW_FUNCTION_PER_GROUP(dequeue, BLKIO_STAT_DEQUEUE, 0);
 SHOW_FUNCTION_PER_GROUP(avg_queue_size, BLKIO_STAT_AVG_QUEUE_SIZE, 0);
+SHOW_FUNCTION_PER_GROUP(group_wait_time, BLKIO_STAT_GROUP_WAIT_TIME, 0);
+SHOW_FUNCTION_PER_GROUP(idle_time, BLKIO_STAT_IDLE_TIME, 0);
+SHOW_FUNCTION_PER_GROUP(empty_time, BLKIO_STAT_EMPTY_TIME, 0);
 #endif
 #undef SHOW_FUNCTION_PER_GROUP
 
@@ -518,6 +659,18 @@ struct cftype blkio_files[] = {
 		.read_map = blkiocg_avg_queue_size_read,
 	},
 	{
+		.name = "group_wait_time",
+		.read_map = blkiocg_group_wait_time_read,
+	},
+	{
+		.name = "idle_time",
+		.read_map = blkiocg_idle_time_read,
+	},
+	{
+		.name = "empty_time",
+		.read_map = blkiocg_empty_time_read,
+	},
+	{
 		.name = "dequeue",
 		.read_map = blkiocg_dequeue_read,
 	},
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index bea7f3b..bfce085 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -43,6 +43,9 @@ enum stat_type {
 	BLKIO_STAT_SECTORS,
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	BLKIO_STAT_AVG_QUEUE_SIZE,
+	BLKIO_STAT_IDLE_TIME,
+	BLKIO_STAT_EMPTY_TIME,
+	BLKIO_STAT_GROUP_WAIT_TIME,
 	BLKIO_STAT_DEQUEUE
 #endif
 };
@@ -55,6 +58,13 @@ enum stat_sub_type {
 	BLKIO_STAT_TOTAL
 };
 
+/* blkg state flags */
+enum blkg_state_flags {
+	BLKG_waiting = 0,
+	BLKG_idling,
+	BLKG_empty,
+};
+
 struct blkio_cgroup {
 	struct cgroup_subsys_state css;
 	unsigned int weight;
@@ -74,6 +84,21 @@ struct blkio_group_stats {
 	uint64_t avg_queue_size_samples;
 	/* How many times this group has been removed from service tree */
 	unsigned long dequeue;
+
+	/* Total time spent waiting for it to be assigned a timeslice. */
+	uint64_t group_wait_time;
+	uint64_t start_group_wait_time;
+
+	/* Time spent idling for this blkio_group */
+	uint64_t idle_time;
+	uint64_t start_idle_time;
+	/*
+	 * Total time when we have requests queued and do not contain the
+	 * current active queue.
+	 */
+	uint64_t empty_time;
+	uint64_t start_empty_time;
+	uint16_t flags;
 #endif
 };
 
@@ -137,12 +162,41 @@ static inline char *blkg_path(struct blkio_group *blkg)
 void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg);
 void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
 				unsigned long dequeue);
+void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg);
+void blkiocg_update_idle_time_stats(struct blkio_group *blkg);
+void blkiocg_set_start_empty_time(struct blkio_group *blkg, bool ignore);
+
+#define BLKG_FLAG_FNS(name)						\
+static inline void blkio_mark_blkg_##name(				\
+		struct blkio_group_stats *stats)			\
+{									\
+	stats->flags |= (1 << BLKG_##name);				\
+}									\
+static inline void blkio_clear_blkg_##name(				\
+		struct blkio_group_stats *stats)			\
+{									\
+	stats->flags &= ~(1 << BLKG_##name);				\
+}									\
+static inline int blkio_blkg_##name(struct blkio_group_stats *stats)	\
+{									\
+	return (stats->flags & (1 << BLKG_##name)) != 0;		\
+}									\
+
+BLKG_FLAG_FNS(waiting)
+BLKG_FLAG_FNS(idling)
+BLKG_FLAG_FNS(empty)
+#undef BLKG_FLAG_FNS
 #else
 static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
 static inline void blkiocg_update_set_active_queue_stats(
 						struct blkio_group *blkg) {}
 static inline void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
 						unsigned long dequeue) {}
+static inline void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg)
+{}
+static inline void blkiocg_update_idle_time_stats(struct blkio_group *blkg) {}
+static inline void blkiocg_set_start_empty_time(struct blkio_group *blkg,
+						bool ignore) {}
 #endif
 
 #if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8e0b86a..b6e095c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -886,7 +886,7 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
 }
 
 static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
-				struct cfq_queue *cfqq)
+				struct cfq_queue *cfqq, bool forced)
 {
 	struct cfq_rb_root *st = &cfqd->grp_service_tree;
 	unsigned int used_sl, charge_sl;
@@ -916,6 +916,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
 					st->min_vdisktime);
 	blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
+	blkiocg_set_start_empty_time(&cfqg->blkg, forced);
 }
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
@@ -1528,6 +1529,12 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
 	return cfqq == RQ_CFQQ(rq);
 }
 
+static inline void cfq_del_timer(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+	del_timer(&cfqd->idle_slice_timer);
+	blkiocg_update_idle_time_stats(&cfqq->cfqg->blkg);
+}
+
 static void __cfq_set_active_queue(struct cfq_data *cfqd,
 				   struct cfq_queue *cfqq)
 {
@@ -1547,7 +1554,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 		cfq_clear_cfqq_fifo_expire(cfqq);
 		cfq_mark_cfqq_slice_new(cfqq);
 
-		del_timer(&cfqd->idle_slice_timer);
+		cfq_del_timer(cfqd, cfqq);
 	}
 
 	cfqd->active_queue = cfqq;
@@ -1558,12 +1565,12 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
  */
 static void
 __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-		    bool timed_out)
+		    bool timed_out, bool forced)
 {
 	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
 
 	if (cfq_cfqq_wait_request(cfqq))
-		del_timer(&cfqd->idle_slice_timer);
+		cfq_del_timer(cfqd, cfqq);
 
 	cfq_clear_cfqq_wait_request(cfqq);
 	cfq_clear_cfqq_wait_busy(cfqq);
@@ -1585,7 +1592,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
 	}
 
-	cfq_group_served(cfqd, cfqq->cfqg, cfqq);
+	cfq_group_served(cfqd, cfqq->cfqg, cfqq, forced);
 
 	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
 		cfq_del_cfqq_rr(cfqd, cfqq);
@@ -1604,12 +1611,13 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	}
 }
 
-static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
+static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out,
+					bool forced)
 {
 	struct cfq_queue *cfqq = cfqd->active_queue;
 
 	if (cfqq)
-		__cfq_slice_expired(cfqd, cfqq, timed_out);
+		__cfq_slice_expired(cfqd, cfqq, timed_out, forced);
 }
 
 /*
@@ -1865,6 +1873,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	sl = cfqd->cfq_slice_idle;
 
 	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
+	blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
 }
 
@@ -2176,7 +2185,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 	}
 
 expire:
-	cfq_slice_expired(cfqd, 0);
+	cfq_slice_expired(cfqd, 0, false);
 new_queue:
 	/*
 	 * Current queue expired. Check if we have to switch to a new
@@ -2202,7 +2211,7 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
 	BUG_ON(!list_empty(&cfqq->fifo));
 
 	/* By default cfqq is not expired if it is empty. Do it explicitly */
-	__cfq_slice_expired(cfqq->cfqd, cfqq, 0);
+	__cfq_slice_expired(cfqq->cfqd, cfqq, 0, true);
 	return dispatched;
 }
 
@@ -2218,7 +2227,7 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 	while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	cfq_slice_expired(cfqd, 0);
+	cfq_slice_expired(cfqd, 0, true);
 	BUG_ON(cfqd->busy_queues);
 
 	cfq_log(cfqd, "forced_dispatch=%d", dispatched);
@@ -2382,10 +2391,15 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
 	    cfq_class_idle(cfqq))) {
 		cfqq->slice_end = jiffies + 1;
-		cfq_slice_expired(cfqd, 0);
+		cfq_slice_expired(cfqd, 0, false);
 	}
 
 	cfq_log_cfqq(cfqd, cfqq, "dispatched a request");
+	/*
+	 * This is needed since we don't exactly match the mod_timer() and
+	 * del_timer() calls in CFQ.
+	 */
+	blkiocg_update_idle_time_stats(&cfqq->cfqg->blkg);
 	return 1;
 }
 
@@ -2413,7 +2427,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
 	orig_cfqg = cfqq->orig_cfqg;
 
 	if (unlikely(cfqd->active_queue == cfqq)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
+		__cfq_slice_expired(cfqd, cfqq, 0, false);
 		cfq_schedule_dispatch(cfqd);
 	}
 
@@ -2514,7 +2528,7 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	struct cfq_queue *__cfqq, *next;
 
 	if (unlikely(cfqq == cfqd->active_queue)) {
-		__cfq_slice_expired(cfqd, cfqq, 0);
+		__cfq_slice_expired(cfqd, cfqq, 0, false);
 		cfq_schedule_dispatch(cfqd);
 	}
 
@@ -3143,7 +3157,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	cfq_log_cfqq(cfqd, cfqq, "preempt");
-	cfq_slice_expired(cfqd, 1);
+	cfq_slice_expired(cfqd, 1, false);
 
 	/*
 	 * Put the new queue at the front of the of the current list,
@@ -3191,7 +3205,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		if (cfq_cfqq_wait_request(cfqq)) {
 			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
 			    cfqd->busy_queues > 1) {
-				del_timer(&cfqd->idle_slice_timer);
+				cfq_del_timer(cfqd, cfqq);
 				cfq_clear_cfqq_wait_request(cfqq);
 				__blk_run_queue(cfqd->queue);
 			} else
@@ -3352,7 +3366,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 		 * - when there is a close cooperator
 		 */
 		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
-			cfq_slice_expired(cfqd, 1);
+			cfq_slice_expired(cfqd, 1, false);
 		else if (sync && cfqq_empty &&
 			 !cfq_close_cooperator(cfqd, cfqq)) {
 			cfqd->noidle_tree_requires_idle |= !rq_noidle(rq);
@@ -3612,7 +3626,7 @@ static void cfq_idle_slice_timer(unsigned long data)
 		cfq_clear_cfqq_deep(cfqq);
 	}
 expire:
-	cfq_slice_expired(cfqd, timed_out);
+	cfq_slice_expired(cfqd, timed_out, false);
 out_kick:
 	cfq_schedule_dispatch(cfqd);
 out_cont:
@@ -3655,7 +3669,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	spin_lock_irq(q->queue_lock);
 
 	if (cfqd->active_queue)
-		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
+		__cfq_slice_expired(cfqd, cfqd->active_queue, 0, false);
 
 	while (!list_empty(&cfqd->cic_list)) {
 		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/3] blkio: More IO controller stats - always-on and debug-only
  2010-04-09  4:13 [PATCH 0/3] blkio: More IO controller stats - always-on and debug-only Divyesh Shah
                   ` (2 preceding siblings ...)
  2010-04-09  4:15 ` [PATCH 3/3] blkio: Add more debug-only per-cgroup stats Divyesh Shah
@ 2010-04-09  6:36 ` Jens Axboe
  3 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2010-04-09  6:36 UTC (permalink / raw)
  To: Divyesh Shah; +Cc: vgoyal, linux-kernel, nauman, ctalbott

On Thu, Apr 08 2010, Divyesh Shah wrote:
> The following patchset implements additional per-cgroup IO controller stats.
> Some of these should be enabled for debugging only.
> 
> These stats have helped us debug issues with earlier IO controller
> versions and should be useful now as well.
> We've been using these stats for monitoring and debugging problems after the
> fact as these stats can be collected and stored for later use.

Applied, thanks Divyesh.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] blkio: Add io_queued and avg_queue_size stats
  2010-04-09  4:15 ` [PATCH 2/3] blkio: Add io_queued and avg_queue_size stats Divyesh Shah
@ 2010-04-09 15:32   ` Vivek Goyal
  2010-04-09 20:44     ` Divyesh Shah
  2010-04-10  2:34     ` Divyesh Shah
  0 siblings, 2 replies; 12+ messages in thread
From: Vivek Goyal @ 2010-04-09 15:32 UTC (permalink / raw)
  To: Divyesh Shah; +Cc: jens.axboe, linux-kernel, nauman, ctalbott

On Thu, Apr 08, 2010 at 09:15:10PM -0700, Divyesh Shah wrote:
> These stats are useful for getting a feel for the queue depth of the cgroup,
> i.e., how filled up its queues are at a given instant and over the existence of
> the cgroup. This ability is useful when debugging problems in the wild as it
> helps understand the application's IO pattern w/o having to read through the
> userspace code (coz its tedious or just not available) or w/o the ability
> to run blktrace (since you may not have root access and/or not want to disturb
> performance).
> 
> Signed-off-by: Divyesh Shah<dpshah@google.com>
> ---
> 
>  Documentation/cgroups/blkio-controller.txt |   11 +++
>  block/blk-cgroup.c                         |   98 +++++++++++++++++++++++++++-
>  block/blk-cgroup.h                         |   20 +++++-
>  block/cfq-iosched.c                        |   11 +++
>  4 files changed, 134 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> index 810e301..6e52e7c 100644
> --- a/Documentation/cgroups/blkio-controller.txt
> +++ b/Documentation/cgroups/blkio-controller.txt
> @@ -139,6 +139,17 @@ Details of cgroup files
>  	  cgroup. This is further divided by the type of operation - read or
>  	  write, sync or async.
>  
> +- blkio.io_queued
> +	- Total number of requests queued up at any given instant for this
> +	  cgroup. This is further divided by the type of operation - read or
> +	  write, sync or async.
> +
> +- blkio.avg_queue_size
> +	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
> +	  The average queue size for this cgroup over the entire time of this
> +	  cgroup's existence. Queue size samples are taken each time one of the
> +	  queues of this cgroup gets a timeslice.
> +
>  - blkio.dequeue
>  	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
>  	  gives the statistics about how many a times a group was dequeued
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d23b538..1e0c497 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -81,6 +81,71 @@ static void blkio_add_stat(uint64_t *stat, uint64_t add, bool direction,
>  		stat[BLKIO_STAT_ASYNC] += add;
>  }
>  
> +/*
> + * Decrements the appropriate stat variable if non-zero depending on the
> + * request type. Panics on value being zero.
> + * This should be called with the blkg->stats_lock held.
> + */
> +static void blkio_check_and_dec_stat(uint64_t *stat, bool direction, bool sync)
> +{
> +	if (direction) {
> +		BUG_ON(stat[BLKIO_STAT_WRITE] == 0);
> +		stat[BLKIO_STAT_WRITE]--;
> +	} else {
> +		BUG_ON(stat[BLKIO_STAT_READ] == 0);
> +		stat[BLKIO_STAT_READ]--;
> +	}
> +	if (sync) {
> +		BUG_ON(stat[BLKIO_STAT_SYNC] == 0);
> +		stat[BLKIO_STAT_SYNC]--;
> +	} else {
> +		BUG_ON(stat[BLKIO_STAT_ASYNC] == 0);
> +		stat[BLKIO_STAT_ASYNC]--;
> +	}
> +}
> +
> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> +void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
> +{

How about blkiocg_update_avg_queue_size_stats()? A different io policy can
choose to take this sample at higher rate and not necessarily when
queue is set active.

Can we group this function near blkiocg_update_blkio_group_dequeue_stats()
so that we introduce on less #ifdef CONFIG_DEBUG_BLK_CGROUP.

> +	unsigned long flags;
> +	struct blkio_group_stats *stats;
> +
> +	spin_lock_irqsave(&blkg->stats_lock, flags);
> +	stats = &blkg->stats;
> +	stats->avg_queue_size_sum +=
> +			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] +
> +			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE];
> +	stats->avg_queue_size_samples++;

What happens when over a period of time "avg_queue_size_sum" or "avg_queue_size_samples" overflow? All the stats will look weird.

> +	spin_unlock_irqrestore(&blkg->stats_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(blkiocg_update_set_active_queue_stats);
> +#endif
> +
> +void blkiocg_update_request_add_stats(struct blkio_group *blkg,
> +			struct blkio_group *curr_blkg, bool direction,
> +			bool sync)
> +{

curr_blkg is redundant?

How about just use "io" keyword instead of "request" as you have been
doing all along. blkiocg_update_io_add_stats().

> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&blkg->stats_lock, flags);
> +	blkio_add_stat(blkg->stats.stat_arr[BLKIO_STAT_QUEUED], 1, direction,
> +			sync);
> +	spin_unlock_irqrestore(&blkg->stats_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(blkiocg_update_request_add_stats);
> +
> +void blkiocg_update_request_remove_stats(struct blkio_group *blkg,
> +						bool direction, bool sync)
> +{

blkiocg_update_io_remove_stats().

> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&blkg->stats_lock, flags);
> +	blkio_check_and_dec_stat(blkg->stats.stat_arr[BLKIO_STAT_QUEUED],
> +					direction, sync);
> +	spin_unlock_irqrestore(&blkg->stats_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(blkiocg_update_request_remove_stats);
> +
>  void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
>  {
>  	unsigned long flags;
> @@ -253,14 +318,18 @@ blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
>  	struct blkio_cgroup *blkcg;
>  	struct blkio_group *blkg;
>  	struct hlist_node *n;
> -	struct blkio_group_stats *stats;
> +	uint64_t queued[BLKIO_STAT_TOTAL];
> +	int i;
>  
>  	blkcg = cgroup_to_blkio_cgroup(cgroup);
>  	spin_lock_irq(&blkcg->lock);
>  	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
>  		spin_lock(&blkg->stats_lock);
> -		stats = &blkg->stats;
> -		memset(stats, 0, sizeof(struct blkio_group_stats));
> +		for (i = 0; i < BLKIO_STAT_TOTAL; i++)
> +			queued[i] = blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i];
> +		memset(&blkg->stats, 0, sizeof(struct blkio_group_stats));
> +		for (i = 0; i < BLKIO_STAT_TOTAL; i++)
> +			blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i] = queued[i];

During reset_stats, why are you not resetting the stats for average queue
depth? This reset_stat is mean to reset all the debug stats we have
collected so far for the cgroup?

Thanks
Vivek

>  		spin_unlock(&blkg->stats_lock);
>  	}
>  	spin_unlock_irq(&blkcg->lock);
> @@ -323,6 +392,15 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
>  		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
>  					blkg->stats.sectors, cb, dev);
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
> +	if (type == BLKIO_STAT_AVG_QUEUE_SIZE) {
> +		uint64_t sum = blkg->stats.avg_queue_size_sum;
> +		uint64_t samples = blkg->stats.avg_queue_size_samples;
> +		if (samples)
> +			do_div(sum, samples);
> +		else
> +			sum = 0;
> +		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, sum, cb, dev);
> +	}
>  	if (type == BLKIO_STAT_DEQUEUE)
>  		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
>  					blkg->stats.dequeue, cb, dev);
> @@ -376,8 +454,10 @@ SHOW_FUNCTION_PER_GROUP(io_serviced, BLKIO_STAT_SERVICED, 1);
>  SHOW_FUNCTION_PER_GROUP(io_service_time, BLKIO_STAT_SERVICE_TIME, 1);
>  SHOW_FUNCTION_PER_GROUP(io_wait_time, BLKIO_STAT_WAIT_TIME, 1);
>  SHOW_FUNCTION_PER_GROUP(io_merged, BLKIO_STAT_MERGED, 1);
> +SHOW_FUNCTION_PER_GROUP(io_queued, BLKIO_STAT_QUEUED, 1);
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  SHOW_FUNCTION_PER_GROUP(dequeue, BLKIO_STAT_DEQUEUE, 0);
> +SHOW_FUNCTION_PER_GROUP(avg_queue_size, BLKIO_STAT_AVG_QUEUE_SIZE, 0);
>  #endif
>  #undef SHOW_FUNCTION_PER_GROUP
>  
> @@ -425,14 +505,22 @@ struct cftype blkio_files[] = {
>  		.read_map = blkiocg_io_merged_read,
>  	},
>  	{
> +		.name = "io_queued",
> +		.read_map = blkiocg_io_queued_read,
> +	},
> +	{
>  		.name = "reset_stats",
>  		.write_u64 = blkiocg_reset_stats,
>  	},
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
> -       {
> +	{
> +		.name = "avg_queue_size",
> +		.read_map = blkiocg_avg_queue_size_read,
> +	},
> +	{
>  		.name = "dequeue",
>  		.read_map = blkiocg_dequeue_read,
> -       },
> +	},
>  #endif
>  };
>  
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 470a29d..bea7f3b 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -36,10 +36,13 @@ enum stat_type {
>  	BLKIO_STAT_WAIT_TIME,
>  	/* Number of IOs merged */
>  	BLKIO_STAT_MERGED,
> +	/* Number of IOs queued up */
> +	BLKIO_STAT_QUEUED,
>  	/* All the single valued stats go below this */
>  	BLKIO_STAT_TIME,
>  	BLKIO_STAT_SECTORS,
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
> +	BLKIO_STAT_AVG_QUEUE_SIZE,
>  	BLKIO_STAT_DEQUEUE
>  #endif
>  };
> @@ -63,8 +66,12 @@ struct blkio_group_stats {
>  	/* total disk time and nr sectors dispatched by this group */
>  	uint64_t time;
>  	uint64_t sectors;
> -	uint64_t stat_arr[BLKIO_STAT_MERGED + 1][BLKIO_STAT_TOTAL];
> +	uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
> +	/* Sum of number of IOs queued across all samples */
> +	uint64_t avg_queue_size_sum;
> +	/* Count of samples taken for average */
> +	uint64_t avg_queue_size_samples;
>  	/* How many times this group has been removed from service tree */
>  	unsigned long dequeue;
>  #endif
> @@ -127,10 +134,13 @@ static inline char *blkg_path(struct blkio_group *blkg)
>  {
>  	return blkg->path;
>  }
> +void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg);
>  void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>  				unsigned long dequeue);
>  #else
>  static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
> +static inline void blkiocg_update_set_active_queue_stats(
> +						struct blkio_group *blkg) {}
>  static inline void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>  						unsigned long dequeue) {}
>  #endif
> @@ -152,6 +162,10 @@ void blkiocg_update_completion_stats(struct blkio_group *blkg,
>  	uint64_t start_time, uint64_t io_start_time, bool direction, bool sync);
>  void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
>  					bool sync);
> +void blkiocg_update_request_add_stats(struct blkio_group *blkg,
> +		struct blkio_group *curr_blkg, bool direction, bool sync);
> +void blkiocg_update_request_remove_stats(struct blkio_group *blkg,
> +					bool direction, bool sync);
>  #else
>  struct cgroup;
>  static inline struct blkio_cgroup *
> @@ -175,5 +189,9 @@ static inline void blkiocg_update_completion_stats(struct blkio_group *blkg,
>  		bool sync) {}
>  static inline void blkiocg_update_io_merged_stats(struct blkio_group *blkg,
>  						bool direction, bool sync) {}
> +static inline void blkiocg_update_request_add_stats(struct blkio_group *blkg,
> +		struct blkio_group *curr_blkg, bool direction, bool sync) {}
> +static inline void blkiocg_update_request_remove_stats(struct blkio_group *blkg,
> +						bool direction, bool sync) {}
>  #endif
>  #endif /* _BLK_CGROUP_H */
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 4eb1906..8e0b86a 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1380,7 +1380,12 @@ static void cfq_reposition_rq_rb(struct cfq_queue *cfqq, struct request *rq)
>  {
>  	elv_rb_del(&cfqq->sort_list, rq);
>  	cfqq->queued[rq_is_sync(rq)]--;
> +	blkiocg_update_request_remove_stats(&cfqq->cfqg->blkg, rq_data_dir(rq),
> +						rq_is_sync(rq));
>  	cfq_add_rq_rb(rq);
> +	blkiocg_update_request_add_stats(
> +			&cfqq->cfqg->blkg, &cfqq->cfqd->serving_group->blkg,
> +			rq_data_dir(rq), rq_is_sync(rq));
>  }
>  
>  static struct request *
> @@ -1436,6 +1441,8 @@ static void cfq_remove_request(struct request *rq)
>  	cfq_del_rq_rb(rq);
>  
>  	cfqq->cfqd->rq_queued--;
> +	blkiocg_update_request_remove_stats(&cfqq->cfqg->blkg, rq_data_dir(rq),
> +						rq_is_sync(rq));
>  	if (rq_is_meta(rq)) {
>  		WARN_ON(!cfqq->meta_pending);
>  		cfqq->meta_pending--;
> @@ -1527,6 +1534,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
>  	if (cfqq) {
>  		cfq_log_cfqq(cfqd, cfqq, "set_active wl_prio:%d wl_type:%d",
>  				cfqd->serving_prio, cfqd->serving_type);
> +		blkiocg_update_set_active_queue_stats(&cfqq->cfqg->blkg);
>  		cfqq->slice_start = 0;
>  		cfqq->dispatch_start = jiffies;
>  		cfqq->allocated_slice = 0;
> @@ -3213,6 +3221,9 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
>  	list_add_tail(&rq->queuelist, &cfqq->fifo);
>  	cfq_add_rq_rb(rq);
>  
> +	blkiocg_update_request_add_stats(&cfqq->cfqg->blkg,
> +			&cfqd->serving_group->blkg, rq_data_dir(rq),
> +			rq_is_sync(rq));
>  	cfq_rq_enqueued(cfqd, cfqq, rq);
>  }
>  

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] blkio: Add more debug-only per-cgroup stats
  2010-04-09  4:15 ` [PATCH 3/3] blkio: Add more debug-only per-cgroup stats Divyesh Shah
@ 2010-04-09 19:21   ` Vivek Goyal
  2010-04-10  0:09     ` Divyesh Shah
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2010-04-09 19:21 UTC (permalink / raw)
  To: Divyesh Shah; +Cc: jens.axboe, linux-kernel, nauman, ctalbott

On Thu, Apr 08, 2010 at 09:15:35PM -0700, Divyesh Shah wrote:
> 1) group_wait_time - This is the amount of time the cgroup had to wait to get a
>   timeslice for one of its queues from when it became busy, i.e., went from 0
>   to 1 request queued. This is different from the io_wait_time which is the
>   cumulative total of the amount of time spent by each IO in that cgroup waiting
>   in the scheduler queue. This stat is a great way to find out any jobs in the
>   fleet that are being starved or waiting for longer than what is expected (due
>   to an IO controller bug or any other issue).
> 2) empty_time - This is the amount of time a cgroup spends w/o any pending
>    requests. This stat is useful when a job does not seem to be able to use its
>    assigned disk share by helping check if that is happening due to an IO
>    controller bug or because the job is not submitting enough IOs.

Divyesh,

This empty_time looks like a redundant stats. One can just look at
group_wait_time. If group_wait_time is  not increasing, then application
in cgroup is not doing any IO. If it is increasing then one can look
at so many other stas like blkio.time, blkio.sectors etc to find out
if cgroup is making any progress or not.

So IMHO, we really don't need anything like empty_time.
  

> 3) idle_time - This is the amount of time spent by the IO scheduler idling
>    for a given cgroup in anticipation of a better request than the exising ones
>    from other queues/cgroups.
> 
> All these stats are recorded using start and stop events. When reading these
> stats, we do not add the delta between the current time and the last start time
> if we're between the start and stop events. We avoid doing this to make sure
> that these numbers are always monotonically increasing when read. Since we're
> using sched_clock() which may use the tsc as its source, it may induce some
> inconsistency (due to tsc resync across cpus) if we included the current delta.

So what is that inconsistency? Time goes backwards? If yes, can't we just
check that stop time is start time and only then any time sample is
valid otherwise we ignore it?

Why same inconsistency in time (due to tsc resync) can't be introduced between start and stop event window? 

In case of group_wait_time, there does not seem to be any stop event. It
is just start and update event?

> 
> Signed-off-by: Divyesh Shah<dpshah@google.com>
> ---
> 
>  Documentation/cgroups/blkio-controller.txt |   29 +++++
>  block/blk-cgroup.c                         |  159 +++++++++++++++++++++++++++-
>  block/blk-cgroup.h                         |   54 ++++++++++
>  block/cfq-iosched.c                        |   50 ++++++---
>  4 files changed, 271 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> index 6e52e7c..db054ea 100644
> --- a/Documentation/cgroups/blkio-controller.txt
> +++ b/Documentation/cgroups/blkio-controller.txt
> @@ -150,6 +150,35 @@ Details of cgroup files
>  	  cgroup's existence. Queue size samples are taken each time one of the
>  	  queues of this cgroup gets a timeslice.
>  
> +- blkio.group_wait_time
> +	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
> +	  This is the amount of time the cgroup had to wait since it became busy
> +	  (i.e., went from 0 to 1 request queued) to get a timeslice for one of
> +	  its queues. This is different from the io_wait_time which is the
> +	  cumulative total of the amount of time spent by each IO in that cgroup
> +	  waiting in the scheduler queue. This is in nanoseconds. If this is
> +	  read when the cgroup is in a waiting (for timeslice) state, the stat
> +	  will only report the group_wait_time accumulated till the last time it
> +	  got a timeslice and will not include the current delta.
> +
> +- blkio.empty_time
> +	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
> +	  This is the amount of time a cgroup spends without any pending
> +	  requests when not being served, i.e., it does not include any time
> +	  spent idling for one of the queues of the cgroup. This is in
> +	  nanoseconds. If this is read when the cgroup is in an empty state,
> +	  the stat will only report the empty_time accumulated till the last
> +	  time it had a pending request and will not include the current delta.
> +
> +- blkio.idle_time
> +	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
> +	  This is the amount of time spent by the IO scheduler idling for a
> +	  given cgroup in anticipation of a better request than the exising ones
> +	  from other queues/cgroups. This is in nanoseconds. If this is read
> +	  when the cgroup is in an idling state, the stat will only report the
> +	  idle_time accumulated till the last idle period and will not include
> +	  the current delta.
> +
>  - blkio.dequeue
>  	- Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
>  	  gives the statistics about how many a times a group was dequeued
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 1e0c497..1ecff7a 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -105,6 +105,76 @@ static void blkio_check_and_dec_stat(uint64_t *stat, bool direction, bool sync)
>  }
>  
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
> +/* This should be called with the blkg->stats_lock held. */
> +static void blkio_set_start_group_wait_time(struct blkio_group *blkg,
> +						struct blkio_group *curr_blkg)
> +{
> +	if (blkio_blkg_waiting(&blkg->stats))
> +		return;
> +	if (blkg == curr_blkg)
> +		return;
> +	blkg->stats.start_group_wait_time = sched_clock();
> +	blkio_mark_blkg_waiting(&blkg->stats);

I am hoping that you have a good reason to maintain all these stats in ns
and not in us.

My rough calculation says that it will take 584 years to overflow this
:-). So probably ns is not bad.  

2^64/10^9 = 18446744073 sec
18446744073/60 = 307445734 mins
307445734/60 = 5124095 hrs
5124095/24= 213503 days
213503/365= 584 years 

Hopefully this calculation is right.

> +}
> +
> +/* This should be called with the blkg->stats_lock held. */
> +static void blkio_update_group_wait_time(struct blkio_group_stats *stats)
> +{
> +	unsigned long long now;
> +
> +	if (!blkio_blkg_waiting(stats))
> +		return;
> +
> +	now = sched_clock();
> +	if (time_after64(now, stats->start_group_wait_time))
> +		stats->group_wait_time += now - stats->start_group_wait_time;
> +	blkio_clear_blkg_waiting(stats);
> +}
> +
> +/* This should be called with the blkg->stats_lock held. */
> +static void blkio_end_empty_time(struct blkio_group_stats *stats)
> +{
> +	unsigned long long now;
> +
> +	if (!blkio_blkg_empty(stats))
> +		return;
> +
> +	now = sched_clock();
> +	if (time_after64(now, stats->start_empty_time))
> +		stats->empty_time += now - stats->start_empty_time;
> +	blkio_clear_blkg_empty(stats);
> +}
> +
> +void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&blkg->stats_lock, flags);
> +	BUG_ON(blkio_blkg_idling(&blkg->stats));
> +	blkg->stats.start_idle_time = sched_clock();
> +	blkio_mark_blkg_idling(&blkg->stats);
> +	spin_unlock_irqrestore(&blkg->stats_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(blkiocg_update_set_idle_time_stats);
> +
> +void blkiocg_update_idle_time_stats(struct blkio_group *blkg)
> +{
> +	unsigned long flags;
> +	unsigned long long now;
> +	struct blkio_group_stats *stats;
> +
> +	spin_lock_irqsave(&blkg->stats_lock, flags);
> +	stats = &blkg->stats;
> +	if (blkio_blkg_idling(stats)) {
> +		now = sched_clock();
> +		if (time_after64(now, stats->start_idle_time))
> +			stats->idle_time += now - stats->start_idle_time;
> +		blkio_clear_blkg_idling(stats);
> +	}
> +	spin_unlock_irqrestore(&blkg->stats_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(blkiocg_update_idle_time_stats);
> +
>  void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
>  {
>  	unsigned long flags;
> @@ -116,9 +186,14 @@ void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
>  			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] +
>  			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE];
>  	stats->avg_queue_size_samples++;
> +	blkio_update_group_wait_time(stats);

Will cfq_choose_cfqg() be a better place to call this function. Why should
we call it when an active queue is set. Instead when a group is chosen
to dispatch request from, we can update the wait time.

That way the number of times this function is called will reduce as we
can dispatch multiple cfqqs from single group.

>  	spin_unlock_irqrestore(&blkg->stats_lock, flags);
>  }
>  EXPORT_SYMBOL_GPL(blkiocg_update_set_active_queue_stats);
> +#else
> +static inline void blkio_set_start_group_wait_time(struct blkio_group *blkg,
> +					struct blkio_group *curr_blkg) {}
> +static inline void blkio_end_empty_time(struct blkio_group_stats *stats) {}
>  #endif
>  
>  void blkiocg_update_request_add_stats(struct blkio_group *blkg,
> @@ -130,6 +205,8 @@ void blkiocg_update_request_add_stats(struct blkio_group *blkg,
>  	spin_lock_irqsave(&blkg->stats_lock, flags);
>  	blkio_add_stat(blkg->stats.stat_arr[BLKIO_STAT_QUEUED], 1, direction,
>  			sync);
> +	blkio_end_empty_time(&blkg->stats);
> +	blkio_set_start_group_wait_time(blkg, curr_blkg);
>  	spin_unlock_irqrestore(&blkg->stats_lock, flags);
>  }
>  EXPORT_SYMBOL_GPL(blkiocg_update_request_add_stats);
> @@ -156,6 +233,33 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
>  }
>  EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
>  
> +void blkiocg_set_start_empty_time(struct blkio_group *blkg, bool ignore)
> +{
> +	unsigned long flags;
> +	struct blkio_group_stats *stats;
> +
> +	spin_lock_irqsave(&blkg->stats_lock, flags);
> +	stats = &blkg->stats;
> +
> +	if (stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] ||
> +			stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE]) {
> +		spin_unlock_irqrestore(&blkg->stats_lock, flags);
> +		return;
> +	}
> +
> +	/*
> +	 * If ignore is set, we do not panic on the empty flag being set
> +	 * already. This is to avoid cases where there are superfluous timeslice
> +	 * complete events (for eg., forced_dispatch in CFQ) when no IOs are
> +	 * served which could result in triggering the empty check incorrectly.
> +	 */
> +	BUG_ON(!ignore && blkio_blkg_empty(stats));
> +	stats->start_empty_time = sched_clock();
> +	blkio_mark_blkg_empty(stats);
> +	spin_unlock_irqrestore(&blkg->stats_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(blkiocg_set_start_empty_time);
> +
>  void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
>  				uint64_t bytes, bool direction, bool sync)
>  {
> @@ -317,19 +421,44 @@ blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
>  {
>  	struct blkio_cgroup *blkcg;
>  	struct blkio_group *blkg;
> +	struct blkio_group_stats *stats;
>  	struct hlist_node *n;
>  	uint64_t queued[BLKIO_STAT_TOTAL];
>  	int i;
> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> +	bool idling, waiting, empty;
> +	unsigned long long now = sched_clock();
> +#endif
>  
>  	blkcg = cgroup_to_blkio_cgroup(cgroup);
>  	spin_lock_irq(&blkcg->lock);
>  	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
>  		spin_lock(&blkg->stats_lock);
> +		stats = &blkg->stats;
> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> +		idling = blkio_blkg_idling(stats);
> +		waiting = blkio_blkg_waiting(stats);
> +		empty = blkio_blkg_empty(stats);
> +#endif
>  		for (i = 0; i < BLKIO_STAT_TOTAL; i++)
> -			queued[i] = blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i];
> -		memset(&blkg->stats, 0, sizeof(struct blkio_group_stats));
> +			queued[i] = stats->stat_arr[BLKIO_STAT_QUEUED][i];
> +		memset(stats, 0, sizeof(struct blkio_group_stats));
>  		for (i = 0; i < BLKIO_STAT_TOTAL; i++)
> -			blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i] = queued[i];
> +			stats->stat_arr[BLKIO_STAT_QUEUED][i] = queued[i];
> +#ifdef CONFIG_DEBUG_BLK_CGROUP
> +		if (idling) {
> +			blkio_mark_blkg_idling(stats);
> +			stats->start_idle_time = now;
> +		}
> +		if (waiting) {
> +			blkio_mark_blkg_waiting(stats);
> +			stats->start_group_wait_time = now;
> +		}
> +		if (empty) {
> +			blkio_mark_blkg_empty(stats);
> +			stats->start_empty_time = now;
> +		}
> +#endif
>  		spin_unlock(&blkg->stats_lock);
>  	}
>  	spin_unlock_irq(&blkcg->lock);
> @@ -401,6 +530,15 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
>  			sum = 0;
>  		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, sum, cb, dev);
>  	}
> +	if (type == BLKIO_STAT_GROUP_WAIT_TIME)
> +		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
> +					blkg->stats.group_wait_time, cb, dev);
> +	if (type == BLKIO_STAT_IDLE_TIME)
> +		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
> +					blkg->stats.idle_time, cb, dev);
> +	if (type == BLKIO_STAT_EMPTY_TIME)
> +		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
> +					blkg->stats.empty_time, cb, dev);
>  	if (type == BLKIO_STAT_DEQUEUE)
>  		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
>  					blkg->stats.dequeue, cb, dev);
> @@ -458,6 +596,9 @@ SHOW_FUNCTION_PER_GROUP(io_queued, BLKIO_STAT_QUEUED, 1);
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  SHOW_FUNCTION_PER_GROUP(dequeue, BLKIO_STAT_DEQUEUE, 0);
>  SHOW_FUNCTION_PER_GROUP(avg_queue_size, BLKIO_STAT_AVG_QUEUE_SIZE, 0);
> +SHOW_FUNCTION_PER_GROUP(group_wait_time, BLKIO_STAT_GROUP_WAIT_TIME, 0);
> +SHOW_FUNCTION_PER_GROUP(idle_time, BLKIO_STAT_IDLE_TIME, 0);
> +SHOW_FUNCTION_PER_GROUP(empty_time, BLKIO_STAT_EMPTY_TIME, 0);
>  #endif
>  #undef SHOW_FUNCTION_PER_GROUP
>  
> @@ -518,6 +659,18 @@ struct cftype blkio_files[] = {
>  		.read_map = blkiocg_avg_queue_size_read,
>  	},
>  	{
> +		.name = "group_wait_time",
> +		.read_map = blkiocg_group_wait_time_read,
> +	},
> +	{
> +		.name = "idle_time",
> +		.read_map = blkiocg_idle_time_read,
> +	},
> +	{
> +		.name = "empty_time",
> +		.read_map = blkiocg_empty_time_read,
> +	},
> +	{
>  		.name = "dequeue",
>  		.read_map = blkiocg_dequeue_read,
>  	},
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index bea7f3b..bfce085 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -43,6 +43,9 @@ enum stat_type {
>  	BLKIO_STAT_SECTORS,
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  	BLKIO_STAT_AVG_QUEUE_SIZE,
> +	BLKIO_STAT_IDLE_TIME,
> +	BLKIO_STAT_EMPTY_TIME,
> +	BLKIO_STAT_GROUP_WAIT_TIME,
>  	BLKIO_STAT_DEQUEUE
>  #endif
>  };
> @@ -55,6 +58,13 @@ enum stat_sub_type {
>  	BLKIO_STAT_TOTAL
>  };
>  
> +/* blkg state flags */
> +enum blkg_state_flags {
> +	BLKG_waiting = 0,
> +	BLKG_idling,
> +	BLKG_empty,
> +};
> +
>  struct blkio_cgroup {
>  	struct cgroup_subsys_state css;
>  	unsigned int weight;
> @@ -74,6 +84,21 @@ struct blkio_group_stats {
>  	uint64_t avg_queue_size_samples;
>  	/* How many times this group has been removed from service tree */
>  	unsigned long dequeue;
> +
> +	/* Total time spent waiting for it to be assigned a timeslice. */
> +	uint64_t group_wait_time;
> +	uint64_t start_group_wait_time;
> +
> +	/* Time spent idling for this blkio_group */
> +	uint64_t idle_time;
> +	uint64_t start_idle_time;
> +	/*
> +	 * Total time when we have requests queued and do not contain the
> +	 * current active queue.
> +	 */
> +	uint64_t empty_time;
> +	uint64_t start_empty_time;
> +	uint16_t flags;

Does this flags has to be inside blkio_group_stats? May be a better
place is inside blkio_group as it represents the blkio_group status.
That's a differnet thing that as of today all three flags are helping out
only for stats purposes but in future we could add more flags?

>  #endif
>  };
>  
> @@ -137,12 +162,41 @@ static inline char *blkg_path(struct blkio_group *blkg)
>  void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg);
>  void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>  				unsigned long dequeue);
> +void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg);
> +void blkiocg_update_idle_time_stats(struct blkio_group *blkg);
> +void blkiocg_set_start_empty_time(struct blkio_group *blkg, bool ignore);
> +
> +#define BLKG_FLAG_FNS(name)						\
> +static inline void blkio_mark_blkg_##name(				\
> +		struct blkio_group_stats *stats)			\
> +{									\
> +	stats->flags |= (1 << BLKG_##name);				\
> +}									\
> +static inline void blkio_clear_blkg_##name(				\
> +		struct blkio_group_stats *stats)			\
> +{									\
> +	stats->flags &= ~(1 << BLKG_##name);				\
> +}									\
> +static inline int blkio_blkg_##name(struct blkio_group_stats *stats)	\
> +{									\
> +	return (stats->flags & (1 << BLKG_##name)) != 0;		\
> +}									\
> +
> +BLKG_FLAG_FNS(waiting)
> +BLKG_FLAG_FNS(idling)
> +BLKG_FLAG_FNS(empty)
> +#undef BLKG_FLAG_FNS
>  #else
>  static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
>  static inline void blkiocg_update_set_active_queue_stats(
>  						struct blkio_group *blkg) {}
>  static inline void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>  						unsigned long dequeue) {}
> +static inline void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg)
> +{}
> +static inline void blkiocg_update_idle_time_stats(struct blkio_group *blkg) {}
> +static inline void blkiocg_set_start_empty_time(struct blkio_group *blkg,
> +						bool ignore) {}
>  #endif
>  
>  #if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 8e0b86a..b6e095c 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -886,7 +886,7 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
>  }
>  
>  static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
> -				struct cfq_queue *cfqq)
> +				struct cfq_queue *cfqq, bool forced)
>  {
>  	struct cfq_rb_root *st = &cfqd->grp_service_tree;
>  	unsigned int used_sl, charge_sl;
> @@ -916,6 +916,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>  	cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
>  					st->min_vdisktime);
>  	blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
> +	blkiocg_set_start_empty_time(&cfqg->blkg, forced);

Why this needs to be called from CFQ? Can't we just internally call this
from blkiocg_update_request_remove_stats() internally. So when we remove
a request, we update the stats in blkio. That time blkio can checek if
group became empty and start the time?

>  }
>  
>  #ifdef CONFIG_CFQ_GROUP_IOSCHED
> @@ -1528,6 +1529,12 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
>  	return cfqq == RQ_CFQQ(rq);
>  }
>  
> +static inline void cfq_del_timer(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> +{
> +	del_timer(&cfqd->idle_slice_timer);
> +	blkiocg_update_idle_time_stats(&cfqq->cfqg->blkg);
> +}
> +
>  static void __cfq_set_active_queue(struct cfq_data *cfqd,
>  				   struct cfq_queue *cfqq)
>  {
> @@ -1547,7 +1554,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
>  		cfq_clear_cfqq_fifo_expire(cfqq);
>  		cfq_mark_cfqq_slice_new(cfqq);
>  
> -		del_timer(&cfqd->idle_slice_timer);
> +		cfq_del_timer(cfqd, cfqq);
>  	}
>  
>  	cfqd->active_queue = cfqq;
> @@ -1558,12 +1565,12 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
>   */
>  static void
>  __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> -		    bool timed_out)
> +		    bool timed_out, bool forced)
>  {
>  	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
>  
>  	if (cfq_cfqq_wait_request(cfqq))
> -		del_timer(&cfqd->idle_slice_timer);
> +		cfq_del_timer(cfqd, cfqq);
>  
>  	cfq_clear_cfqq_wait_request(cfqq);
>  	cfq_clear_cfqq_wait_busy(cfqq);
> @@ -1585,7 +1592,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  		cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
>  	}
>  
> -	cfq_group_served(cfqd, cfqq->cfqg, cfqq);
> +	cfq_group_served(cfqd, cfqq->cfqg, cfqq, forced);
>  
>  	if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
>  		cfq_del_cfqq_rr(cfqd, cfqq);
> @@ -1604,12 +1611,13 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  	}
>  }
>  
> -static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
> +static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out,
> +					bool forced)
>  {
>  	struct cfq_queue *cfqq = cfqd->active_queue;
>  
>  	if (cfqq)
> -		__cfq_slice_expired(cfqd, cfqq, timed_out);
> +		__cfq_slice_expired(cfqd, cfqq, timed_out, forced);
>  }
>  
>  /*
> @@ -1865,6 +1873,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>  	sl = cfqd->cfq_slice_idle;
>  
>  	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
> +	blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
>  	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
>  }
>  
> @@ -2176,7 +2185,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>  	}
>  
>  expire:
> -	cfq_slice_expired(cfqd, 0);
> +	cfq_slice_expired(cfqd, 0, false);
>  new_queue:
>  	/*
>  	 * Current queue expired. Check if we have to switch to a new
> @@ -2202,7 +2211,7 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
>  	BUG_ON(!list_empty(&cfqq->fifo));
>  
>  	/* By default cfqq is not expired if it is empty. Do it explicitly */
> -	__cfq_slice_expired(cfqq->cfqd, cfqq, 0);
> +	__cfq_slice_expired(cfqq->cfqd, cfqq, 0, true);
>  	return dispatched;
>  }
>  
> @@ -2218,7 +2227,7 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
>  	while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL)
>  		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
>  
> -	cfq_slice_expired(cfqd, 0);
> +	cfq_slice_expired(cfqd, 0, true);
>  	BUG_ON(cfqd->busy_queues);
>  
>  	cfq_log(cfqd, "forced_dispatch=%d", dispatched);
> @@ -2382,10 +2391,15 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
>  	    cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
>  	    cfq_class_idle(cfqq))) {
>  		cfqq->slice_end = jiffies + 1;
> -		cfq_slice_expired(cfqd, 0);
> +		cfq_slice_expired(cfqd, 0, false);
>  	}
>  
>  	cfq_log_cfqq(cfqd, cfqq, "dispatched a request");
> +	/*
> +	 * This is needed since we don't exactly match the mod_timer() and
> +	 * del_timer() calls in CFQ.
> +	 */
> +	blkiocg_update_idle_time_stats(&cfqq->cfqg->blkg);

Shouldn't we try to call cfq_del_timer() whereever appropriate, instead of
directy trying to update idle_time_stats?

>  	return 1;
>  }
>  
> @@ -2413,7 +2427,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
>  	orig_cfqg = cfqq->orig_cfqg;
>  
>  	if (unlikely(cfqd->active_queue == cfqq)) {
> -		__cfq_slice_expired(cfqd, cfqq, 0);
> +		__cfq_slice_expired(cfqd, cfqq, 0, false);
>  		cfq_schedule_dispatch(cfqd);
>  	}
>  
> @@ -2514,7 +2528,7 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  	struct cfq_queue *__cfqq, *next;
>  
>  	if (unlikely(cfqq == cfqd->active_queue)) {
> -		__cfq_slice_expired(cfqd, cfqq, 0);
> +		__cfq_slice_expired(cfqd, cfqq, 0, false);
>  		cfq_schedule_dispatch(cfqd);
>  	}
>  
> @@ -3143,7 +3157,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>  static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  {
>  	cfq_log_cfqq(cfqd, cfqq, "preempt");
> -	cfq_slice_expired(cfqd, 1);
> +	cfq_slice_expired(cfqd, 1, false);
>  
>  	/*
>  	 * Put the new queue at the front of the of the current list,
> @@ -3191,7 +3205,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  		if (cfq_cfqq_wait_request(cfqq)) {
>  			if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
>  			    cfqd->busy_queues > 1) {
> -				del_timer(&cfqd->idle_slice_timer);
> +				cfq_del_timer(cfqd, cfqq);
>  				cfq_clear_cfqq_wait_request(cfqq);
>  				__blk_run_queue(cfqd->queue);
>  			} else
> @@ -3352,7 +3366,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
>  		 * - when there is a close cooperator
>  		 */
>  		if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
> -			cfq_slice_expired(cfqd, 1);
> +			cfq_slice_expired(cfqd, 1, false);
>  		else if (sync && cfqq_empty &&
>  			 !cfq_close_cooperator(cfqd, cfqq)) {
>  			cfqd->noidle_tree_requires_idle |= !rq_noidle(rq);
> @@ -3612,7 +3626,7 @@ static void cfq_idle_slice_timer(unsigned long data)
>  		cfq_clear_cfqq_deep(cfqq);
>  	}
>  expire:
> -	cfq_slice_expired(cfqd, timed_out);
> +	cfq_slice_expired(cfqd, timed_out, false);
>  out_kick:
>  	cfq_schedule_dispatch(cfqd);
>  out_cont:
> @@ -3655,7 +3669,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
>  	spin_lock_irq(q->queue_lock);
>  
>  	if (cfqd->active_queue)
> -		__cfq_slice_expired(cfqd, cfqd->active_queue, 0);
> +		__cfq_slice_expired(cfqd, cfqd->active_queue, 0, false);
>  
>  	while (!list_empty(&cfqd->cic_list)) {
>  		struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] blkio: Add io_queued and avg_queue_size stats
  2010-04-09 15:32   ` Vivek Goyal
@ 2010-04-09 20:44     ` Divyesh Shah
  2010-04-10  2:34     ` Divyesh Shah
  1 sibling, 0 replies; 12+ messages in thread
From: Divyesh Shah @ 2010-04-09 20:44 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: jens.axboe, linux-kernel, nauman, ctalbott

On Fri, Apr 9, 2010 at 8:32 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Thu, Apr 08, 2010 at 09:15:10PM -0700, Divyesh Shah wrote:
>> These stats are useful for getting a feel for the queue depth of the cgroup,
>> i.e., how filled up its queues are at a given instant and over the existence of
>> the cgroup. This ability is useful when debugging problems in the wild as it
>> helps understand the application's IO pattern w/o having to read through the
>> userspace code (coz its tedious or just not available) or w/o the ability
>> to run blktrace (since you may not have root access and/or not want to disturb
>> performance).
>>
>> Signed-off-by: Divyesh Shah<dpshah@google.com>
>> ---
>>
>>  Documentation/cgroups/blkio-controller.txt |   11 +++
>>  block/blk-cgroup.c                         |   98 +++++++++++++++++++++++++++-
>>  block/blk-cgroup.h                         |   20 +++++-
>>  block/cfq-iosched.c                        |   11 +++
>>  4 files changed, 134 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
>> index 810e301..6e52e7c 100644
>> --- a/Documentation/cgroups/blkio-controller.txt
>> +++ b/Documentation/cgroups/blkio-controller.txt
>> @@ -139,6 +139,17 @@ Details of cgroup files
>>         cgroup. This is further divided by the type of operation - read or
>>         write, sync or async.
>>
>> +- blkio.io_queued
>> +     - Total number of requests queued up at any given instant for this
>> +       cgroup. This is further divided by the type of operation - read or
>> +       write, sync or async.
>> +
>> +- blkio.avg_queue_size
>> +     - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
>> +       The average queue size for this cgroup over the entire time of this
>> +       cgroup's existence. Queue size samples are taken each time one of the
>> +       queues of this cgroup gets a timeslice.
>> +
>>  - blkio.dequeue
>>       - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
>>         gives the statistics about how many a times a group was dequeued
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index d23b538..1e0c497 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -81,6 +81,71 @@ static void blkio_add_stat(uint64_t *stat, uint64_t add, bool direction,
>>               stat[BLKIO_STAT_ASYNC] += add;
>>  }
>>
>> +/*
>> + * Decrements the appropriate stat variable if non-zero depending on the
>> + * request type. Panics on value being zero.
>> + * This should be called with the blkg->stats_lock held.
>> + */
>> +static void blkio_check_and_dec_stat(uint64_t *stat, bool direction, bool sync)
>> +{
>> +     if (direction) {
>> +             BUG_ON(stat[BLKIO_STAT_WRITE] == 0);
>> +             stat[BLKIO_STAT_WRITE]--;
>> +     } else {
>> +             BUG_ON(stat[BLKIO_STAT_READ] == 0);
>> +             stat[BLKIO_STAT_READ]--;
>> +     }
>> +     if (sync) {
>> +             BUG_ON(stat[BLKIO_STAT_SYNC] == 0);
>> +             stat[BLKIO_STAT_SYNC]--;
>> +     } else {
>> +             BUG_ON(stat[BLKIO_STAT_ASYNC] == 0);
>> +             stat[BLKIO_STAT_ASYNC]--;
>> +     }
>> +}
>> +
>> +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> +void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
>> +{
>
> How about blkiocg_update_avg_queue_size_stats()? A different io policy can
> choose to take this sample at higher rate and not necessarily when
> queue is set active.
Done. I'll send a patch fixing all review comments from the different patches.

>
> Can we group this function near blkiocg_update_blkio_group_dequeue_stats()
> so that we introduce on less #ifdef CONFIG_DEBUG_BLK_CGROUP.

I pulled up blkiocg_update_blkio_group_dequeue_stats() to where this
function is.

>
>> +     unsigned long flags;
>> +     struct blkio_group_stats *stats;
>> +
>> +     spin_lock_irqsave(&blkg->stats_lock, flags);
>> +     stats = &blkg->stats;
>> +     stats->avg_queue_size_sum +=
>> +                     stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] +
>> +                     stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE];
>> +     stats->avg_queue_size_samples++;
>
> What happens when over a period of time "avg_queue_size_sum" or "avg_queue_size_samples" overflow? All the stats will look weird.

Using your calculations from the other thread, since this is at a much
coarser granularity than ns, I'm not sure we should be worrying about
overflow here.

>
>> +     spin_unlock_irqrestore(&blkg->stats_lock, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(blkiocg_update_set_active_queue_stats);
>> +#endif
>> +
>> +void blkiocg_update_request_add_stats(struct blkio_group *blkg,
>> +                     struct blkio_group *curr_blkg, bool direction,
>> +                     bool sync)
>> +{
>
> curr_blkg is redundant?
>
> How about just use "io" keyword instead of "request" as you have been
> doing all along. blkiocg_update_io_add_stats().
Done

>
>> +     unsigned long flags;
>> +
>> +     spin_lock_irqsave(&blkg->stats_lock, flags);
>> +     blkio_add_stat(blkg->stats.stat_arr[BLKIO_STAT_QUEUED], 1, direction,
>> +                     sync);
>> +     spin_unlock_irqrestore(&blkg->stats_lock, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(blkiocg_update_request_add_stats);
>> +
>> +void blkiocg_update_request_remove_stats(struct blkio_group *blkg,
>> +                                             bool direction, bool sync)
>> +{
>
> blkiocg_update_io_remove_stats().
Done

>
>> +     unsigned long flags;
>> +
>> +     spin_lock_irqsave(&blkg->stats_lock, flags);
>> +     blkio_check_and_dec_stat(blkg->stats.stat_arr[BLKIO_STAT_QUEUED],
>> +                                     direction, sync);
>> +     spin_unlock_irqrestore(&blkg->stats_lock, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(blkiocg_update_request_remove_stats);
>> +
>>  void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
>>  {
>>       unsigned long flags;
>> @@ -253,14 +318,18 @@ blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
>>       struct blkio_cgroup *blkcg;
>>       struct blkio_group *blkg;
>>       struct hlist_node *n;
>> -     struct blkio_group_stats *stats;
>> +     uint64_t queued[BLKIO_STAT_TOTAL];
>> +     int i;
>>
>>       blkcg = cgroup_to_blkio_cgroup(cgroup);
>>       spin_lock_irq(&blkcg->lock);
>>       hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
>>               spin_lock(&blkg->stats_lock);
>> -             stats = &blkg->stats;
>> -             memset(stats, 0, sizeof(struct blkio_group_stats));
>> +             for (i = 0; i < BLKIO_STAT_TOTAL; i++)
>> +                     queued[i] = blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i];
>> +             memset(&blkg->stats, 0, sizeof(struct blkio_group_stats));
>> +             for (i = 0; i < BLKIO_STAT_TOTAL; i++)
>> +                     blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i] = queued[i];
>
> During reset_stats, why are you not resetting the stats for average queue
> depth? This reset_stat is mean to reset all the debug stats we have
> collected so far for the cgroup?

The avg_queue_size stats (sum and samples) are being reset - see the
memset. The current count of how many IOs are queued should not be
reset since that will mess up the accounting and hit the BUG_ONs when
decrementing counts when the requests are dispatched.

>
> Thanks
> Vivek
>
>>               spin_unlock(&blkg->stats_lock);
>>       }
>>       spin_unlock_irq(&blkcg->lock);
>> @@ -323,6 +392,15 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
>>               return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
>>                                       blkg->stats.sectors, cb, dev);
>>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>> +     if (type == BLKIO_STAT_AVG_QUEUE_SIZE) {
>> +             uint64_t sum = blkg->stats.avg_queue_size_sum;
>> +             uint64_t samples = blkg->stats.avg_queue_size_samples;
>> +             if (samples)
>> +                     do_div(sum, samples);
>> +             else
>> +                     sum = 0;
>> +             return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, sum, cb, dev);
>> +     }
>>       if (type == BLKIO_STAT_DEQUEUE)
>>               return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
>>                                       blkg->stats.dequeue, cb, dev);
>> @@ -376,8 +454,10 @@ SHOW_FUNCTION_PER_GROUP(io_serviced, BLKIO_STAT_SERVICED, 1);
>>  SHOW_FUNCTION_PER_GROUP(io_service_time, BLKIO_STAT_SERVICE_TIME, 1);
>>  SHOW_FUNCTION_PER_GROUP(io_wait_time, BLKIO_STAT_WAIT_TIME, 1);
>>  SHOW_FUNCTION_PER_GROUP(io_merged, BLKIO_STAT_MERGED, 1);
>> +SHOW_FUNCTION_PER_GROUP(io_queued, BLKIO_STAT_QUEUED, 1);
>>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>>  SHOW_FUNCTION_PER_GROUP(dequeue, BLKIO_STAT_DEQUEUE, 0);
>> +SHOW_FUNCTION_PER_GROUP(avg_queue_size, BLKIO_STAT_AVG_QUEUE_SIZE, 0);
>>  #endif
>>  #undef SHOW_FUNCTION_PER_GROUP
>>
>> @@ -425,14 +505,22 @@ struct cftype blkio_files[] = {
>>               .read_map = blkiocg_io_merged_read,
>>       },
>>       {
>> +             .name = "io_queued",
>> +             .read_map = blkiocg_io_queued_read,
>> +     },
>> +     {
>>               .name = "reset_stats",
>>               .write_u64 = blkiocg_reset_stats,
>>       },
>>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>> -       {
>> +     {
>> +             .name = "avg_queue_size",
>> +             .read_map = blkiocg_avg_queue_size_read,
>> +     },
>> +     {
>>               .name = "dequeue",
>>               .read_map = blkiocg_dequeue_read,
>> -       },
>> +     },
>>  #endif
>>  };
>>
>> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
>> index 470a29d..bea7f3b 100644
>> --- a/block/blk-cgroup.h
>> +++ b/block/blk-cgroup.h
>> @@ -36,10 +36,13 @@ enum stat_type {
>>       BLKIO_STAT_WAIT_TIME,
>>       /* Number of IOs merged */
>>       BLKIO_STAT_MERGED,
>> +     /* Number of IOs queued up */
>> +     BLKIO_STAT_QUEUED,
>>       /* All the single valued stats go below this */
>>       BLKIO_STAT_TIME,
>>       BLKIO_STAT_SECTORS,
>>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>> +     BLKIO_STAT_AVG_QUEUE_SIZE,
>>       BLKIO_STAT_DEQUEUE
>>  #endif
>>  };
>> @@ -63,8 +66,12 @@ struct blkio_group_stats {
>>       /* total disk time and nr sectors dispatched by this group */
>>       uint64_t time;
>>       uint64_t sectors;
>> -     uint64_t stat_arr[BLKIO_STAT_MERGED + 1][BLKIO_STAT_TOTAL];
>> +     uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
>>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>> +     /* Sum of number of IOs queued across all samples */
>> +     uint64_t avg_queue_size_sum;
>> +     /* Count of samples taken for average */
>> +     uint64_t avg_queue_size_samples;
>>       /* How many times this group has been removed from service tree */
>>       unsigned long dequeue;
>>  #endif
>> @@ -127,10 +134,13 @@ static inline char *blkg_path(struct blkio_group *blkg)
>>  {
>>       return blkg->path;
>>  }
>> +void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg);
>>  void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>>                               unsigned long dequeue);
>>  #else
>>  static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
>> +static inline void blkiocg_update_set_active_queue_stats(
>> +                                             struct blkio_group *blkg) {}
>>  static inline void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>>                                               unsigned long dequeue) {}
>>  #endif
>> @@ -152,6 +162,10 @@ void blkiocg_update_completion_stats(struct blkio_group *blkg,
>>       uint64_t start_time, uint64_t io_start_time, bool direction, bool sync);
>>  void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
>>                                       bool sync);
>> +void blkiocg_update_request_add_stats(struct blkio_group *blkg,
>> +             struct blkio_group *curr_blkg, bool direction, bool sync);
>> +void blkiocg_update_request_remove_stats(struct blkio_group *blkg,
>> +                                     bool direction, bool sync);
>>  #else
>>  struct cgroup;
>>  static inline struct blkio_cgroup *
>> @@ -175,5 +189,9 @@ static inline void blkiocg_update_completion_stats(struct blkio_group *blkg,
>>               bool sync) {}
>>  static inline void blkiocg_update_io_merged_stats(struct blkio_group *blkg,
>>                                               bool direction, bool sync) {}
>> +static inline void blkiocg_update_request_add_stats(struct blkio_group *blkg,
>> +             struct blkio_group *curr_blkg, bool direction, bool sync) {}
>> +static inline void blkiocg_update_request_remove_stats(struct blkio_group *blkg,
>> +                                             bool direction, bool sync) {}
>>  #endif
>>  #endif /* _BLK_CGROUP_H */
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 4eb1906..8e0b86a 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -1380,7 +1380,12 @@ static void cfq_reposition_rq_rb(struct cfq_queue *cfqq, struct request *rq)
>>  {
>>       elv_rb_del(&cfqq->sort_list, rq);
>>       cfqq->queued[rq_is_sync(rq)]--;
>> +     blkiocg_update_request_remove_stats(&cfqq->cfqg->blkg, rq_data_dir(rq),
>> +                                             rq_is_sync(rq));
>>       cfq_add_rq_rb(rq);
>> +     blkiocg_update_request_add_stats(
>> +                     &cfqq->cfqg->blkg, &cfqq->cfqd->serving_group->blkg,
>> +                     rq_data_dir(rq), rq_is_sync(rq));
>>  }
>>
>>  static struct request *
>> @@ -1436,6 +1441,8 @@ static void cfq_remove_request(struct request *rq)
>>       cfq_del_rq_rb(rq);
>>
>>       cfqq->cfqd->rq_queued--;
>> +     blkiocg_update_request_remove_stats(&cfqq->cfqg->blkg, rq_data_dir(rq),
>> +                                             rq_is_sync(rq));
>>       if (rq_is_meta(rq)) {
>>               WARN_ON(!cfqq->meta_pending);
>>               cfqq->meta_pending--;
>> @@ -1527,6 +1534,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
>>       if (cfqq) {
>>               cfq_log_cfqq(cfqd, cfqq, "set_active wl_prio:%d wl_type:%d",
>>                               cfqd->serving_prio, cfqd->serving_type);
>> +             blkiocg_update_set_active_queue_stats(&cfqq->cfqg->blkg);
>>               cfqq->slice_start = 0;
>>               cfqq->dispatch_start = jiffies;
>>               cfqq->allocated_slice = 0;
>> @@ -3213,6 +3221,9 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
>>       list_add_tail(&rq->queuelist, &cfqq->fifo);
>>       cfq_add_rq_rb(rq);
>>
>> +     blkiocg_update_request_add_stats(&cfqq->cfqg->blkg,
>> +                     &cfqd->serving_group->blkg, rq_data_dir(rq),
>> +                     rq_is_sync(rq));
>>       cfq_rq_enqueued(cfqd, cfqq, rq);
>>  }
>>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] blkio: Add more debug-only per-cgroup stats
  2010-04-09 19:21   ` Vivek Goyal
@ 2010-04-10  0:09     ` Divyesh Shah
  2010-04-12 14:04       ` Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Divyesh Shah @ 2010-04-10  0:09 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: jens.axboe, linux-kernel, nauman, ctalbott

On Fri, Apr 9, 2010 at 12:21 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Thu, Apr 08, 2010 at 09:15:35PM -0700, Divyesh Shah wrote:
>> 1) group_wait_time - This is the amount of time the cgroup had to wait to get a
>>   timeslice for one of its queues from when it became busy, i.e., went from 0
>>   to 1 request queued. This is different from the io_wait_time which is the
>>   cumulative total of the amount of time spent by each IO in that cgroup waiting
>>   in the scheduler queue. This stat is a great way to find out any jobs in the
>>   fleet that are being starved or waiting for longer than what is expected (due
>>   to an IO controller bug or any other issue).
>> 2) empty_time - This is the amount of time a cgroup spends w/o any pending
>>    requests. This stat is useful when a job does not seem to be able to use its
>>    assigned disk share by helping check if that is happening due to an IO
>>    controller bug or because the job is not submitting enough IOs.
>
> Divyesh,
>
> This empty_time looks like a redundant stats. One can just look at
> group_wait_time. If group_wait_time is  not increasing, then application
> in cgroup is not doing any IO. If it is increasing then one can look
> at so many other stas like blkio.time, blkio.sectors etc to find out
> if cgroup is making any progress or not.

Vivek,
        group_wait_time could not be increasing even when the cgroup
is dominating the disk by sending tons of IO all the time and you
can't conclude that the cgroup is not doing any IO. Even when it is
increasing, any of the other stats don't give what we are looking for
with this stat.. which is how busy is the application really able to
keep its IO queues over a given period of time. I do realize that
these 3 stats might not be of interest to all which is why I enable
them only in debug mode. I do think this stat adds value and is not
completely redundant.

>
> So IMHO, we really don't need anything like empty_time.
>
>
>> 3) idle_time - This is the amount of time spent by the IO scheduler idling
>>    for a given cgroup in anticipation of a better request than the exising ones
>>    from other queues/cgroups.
>>
>> All these stats are recorded using start and stop events. When reading these
>> stats, we do not add the delta between the current time and the last start time
>> if we're between the start and stop events. We avoid doing this to make sure
>> that these numbers are always monotonically increasing when read. Since we're
>> using sched_clock() which may use the tsc as its source, it may induce some
>> inconsistency (due to tsc resync across cpus) if we included the current delta.
>
> So what is that inconsistency? Time goes backwards? If yes, can't we just
> check that stop time is start time and only then any time sample is
> valid otherwise we ignore it?

Looks like I did a bad job in explaining the problem. We have
start-stop events and we do a good job at discarding any intervals
where time went backwards and hence stop_time is earlier than
start_time.
Say we are in the middle of the start and stop events and userspace
reads this value. We compute the delta between the start event and now
(and ignore if we see now before start event) and add that value to
the accumulated time and export to userspace. If the next read happens
right away and in between these 2 reads time goes backward (due to tsc
sync or any other issue) we will get a current delta which maybe
smaller than before and hence the stat that we export to userspace may
actually decrease in value. We want to avoid that and always export
monotonically increasing values to userspace (except when we
rollover... which takes a really long time).
If this does not clarify it, I can try again with an example too.

> Why same inconsistency in time (due to tsc resync) can't be introduced between start and stop event window?

The above explanation should have hopefully explained this too.

>
> In case of group_wait_time, there does not seem to be any stop event. It
> is just start and update event?

the stop event is when the the first queue from this group gets a
timeslice after the group became busy. We clear the blkg_waiting flag
on this event.

>
>>
>> Signed-off-by: Divyesh Shah<dpshah@google.com>
>> ---
>>
>>  Documentation/cgroups/blkio-controller.txt |   29 +++++
>>  block/blk-cgroup.c                         |  159 +++++++++++++++++++++++++++-
>>  block/blk-cgroup.h                         |   54 ++++++++++
>>  block/cfq-iosched.c                        |   50 ++++++---
>>  4 files changed, 271 insertions(+), 21 deletions(-)
>>
>> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
>> index 6e52e7c..db054ea 100644
>> --- a/Documentation/cgroups/blkio-controller.txt
>> +++ b/Documentation/cgroups/blkio-controller.txt
>> @@ -150,6 +150,35 @@ Details of cgroup files
>>         cgroup's existence. Queue size samples are taken each time one of the
>>         queues of this cgroup gets a timeslice.
>>
>> +- blkio.group_wait_time
>> +     - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
>> +       This is the amount of time the cgroup had to wait since it became busy
>> +       (i.e., went from 0 to 1 request queued) to get a timeslice for one of
>> +       its queues. This is different from the io_wait_time which is the
>> +       cumulative total of the amount of time spent by each IO in that cgroup
>> +       waiting in the scheduler queue. This is in nanoseconds. If this is
>> +       read when the cgroup is in a waiting (for timeslice) state, the stat
>> +       will only report the group_wait_time accumulated till the last time it
>> +       got a timeslice and will not include the current delta.
>> +
>> +- blkio.empty_time
>> +     - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
>> +       This is the amount of time a cgroup spends without any pending
>> +       requests when not being served, i.e., it does not include any time
>> +       spent idling for one of the queues of the cgroup. This is in
>> +       nanoseconds. If this is read when the cgroup is in an empty state,
>> +       the stat will only report the empty_time accumulated till the last
>> +       time it had a pending request and will not include the current delta.
>> +
>> +- blkio.idle_time
>> +     - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y.
>> +       This is the amount of time spent by the IO scheduler idling for a
>> +       given cgroup in anticipation of a better request than the exising ones
>> +       from other queues/cgroups. This is in nanoseconds. If this is read
>> +       when the cgroup is in an idling state, the stat will only report the
>> +       idle_time accumulated till the last idle period and will not include
>> +       the current delta.
>> +
>>  - blkio.dequeue
>>       - Debugging aid only enabled if CONFIG_DEBUG_CFQ_IOSCHED=y. This
>>         gives the statistics about how many a times a group was dequeued
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index 1e0c497..1ecff7a 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -105,6 +105,76 @@ static void blkio_check_and_dec_stat(uint64_t *stat, bool direction, bool sync)
>>  }
>>
>>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>> +/* This should be called with the blkg->stats_lock held. */
>> +static void blkio_set_start_group_wait_time(struct blkio_group *blkg,
>> +                                             struct blkio_group *curr_blkg)
>> +{
>> +     if (blkio_blkg_waiting(&blkg->stats))
>> +             return;
>> +     if (blkg == curr_blkg)
>> +             return;
>> +     blkg->stats.start_group_wait_time = sched_clock();
>> +     blkio_mark_blkg_waiting(&blkg->stats);
>
> I am hoping that you have a good reason to maintain all these stats in ns
> and not in us.

Is trying to be consistent (w/ the exception of blkio.time) good enough? ;)
It would suck to have all stats export time in different units.

>
> My rough calculation says that it will take 584 years to overflow this
> :-). So probably ns is not bad.

Yep.. ns is not that bad

>
> 2^64/10^9 = 18446744073 sec
> 18446744073/60 = 307445734 mins
> 307445734/60 = 5124095 hrs
> 5124095/24= 213503 days
> 213503/365= 584 years
>
> Hopefully this calculation is right.
>
>> +}
>> +
>> +/* This should be called with the blkg->stats_lock held. */
>> +static void blkio_update_group_wait_time(struct blkio_group_stats *stats)
>> +{
>> +     unsigned long long now;
>> +
>> +     if (!blkio_blkg_waiting(stats))
>> +             return;
>> +
>> +     now = sched_clock();
>> +     if (time_after64(now, stats->start_group_wait_time))
>> +             stats->group_wait_time += now - stats->start_group_wait_time;
>> +     blkio_clear_blkg_waiting(stats);
>> +}
>> +
>> +/* This should be called with the blkg->stats_lock held. */
>> +static void blkio_end_empty_time(struct blkio_group_stats *stats)
>> +{
>> +     unsigned long long now;
>> +
>> +     if (!blkio_blkg_empty(stats))
>> +             return;
>> +
>> +     now = sched_clock();
>> +     if (time_after64(now, stats->start_empty_time))
>> +             stats->empty_time += now - stats->start_empty_time;
>> +     blkio_clear_blkg_empty(stats);
>> +}
>> +
>> +void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg)
>> +{
>> +     unsigned long flags;
>> +
>> +     spin_lock_irqsave(&blkg->stats_lock, flags);
>> +     BUG_ON(blkio_blkg_idling(&blkg->stats));
>> +     blkg->stats.start_idle_time = sched_clock();
>> +     blkio_mark_blkg_idling(&blkg->stats);
>> +     spin_unlock_irqrestore(&blkg->stats_lock, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(blkiocg_update_set_idle_time_stats);
>> +
>> +void blkiocg_update_idle_time_stats(struct blkio_group *blkg)
>> +{
>> +     unsigned long flags;
>> +     unsigned long long now;
>> +     struct blkio_group_stats *stats;
>> +
>> +     spin_lock_irqsave(&blkg->stats_lock, flags);
>> +     stats = &blkg->stats;
>> +     if (blkio_blkg_idling(stats)) {
>> +             now = sched_clock();
>> +             if (time_after64(now, stats->start_idle_time))
>> +                     stats->idle_time += now - stats->start_idle_time;
>> +             blkio_clear_blkg_idling(stats);
>> +     }
>> +     spin_unlock_irqrestore(&blkg->stats_lock, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(blkiocg_update_idle_time_stats);
>> +
>>  void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
>>  {
>>       unsigned long flags;
>> @@ -116,9 +186,14 @@ void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
>>                       stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] +
>>                       stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE];
>>       stats->avg_queue_size_samples++;
>> +     blkio_update_group_wait_time(stats);
>
> Will cfq_choose_cfqg() be a better place to call this function. Why should
> we call it when an active queue is set. Instead when a group is chosen
> to dispatch request from, we can update the wait time.

Does that cover pre-emption cases too?

> That way the number of times this function is called will reduce as we
> can dispatch multiple cfqqs from single group.
>
>>       spin_unlock_irqrestore(&blkg->stats_lock, flags);
>>  }
>>  EXPORT_SYMBOL_GPL(blkiocg_update_set_active_queue_stats);
>> +#else
>> +static inline void blkio_set_start_group_wait_time(struct blkio_group *blkg,
>> +                                     struct blkio_group *curr_blkg) {}
>> +static inline void blkio_end_empty_time(struct blkio_group_stats *stats) {}
>>  #endif
>>
>>  void blkiocg_update_request_add_stats(struct blkio_group *blkg,
>> @@ -130,6 +205,8 @@ void blkiocg_update_request_add_stats(struct blkio_group *blkg,
>>       spin_lock_irqsave(&blkg->stats_lock, flags);
>>       blkio_add_stat(blkg->stats.stat_arr[BLKIO_STAT_QUEUED], 1, direction,
>>                       sync);
>> +     blkio_end_empty_time(&blkg->stats);
>> +     blkio_set_start_group_wait_time(blkg, curr_blkg);
>>       spin_unlock_irqrestore(&blkg->stats_lock, flags);
>>  }
>>  EXPORT_SYMBOL_GPL(blkiocg_update_request_add_stats);
>> @@ -156,6 +233,33 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
>>  }
>>  EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
>>
>> +void blkiocg_set_start_empty_time(struct blkio_group *blkg, bool ignore)
>> +{
>> +     unsigned long flags;
>> +     struct blkio_group_stats *stats;
>> +
>> +     spin_lock_irqsave(&blkg->stats_lock, flags);
>> +     stats = &blkg->stats;
>> +
>> +     if (stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] ||
>> +                     stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE]) {
>> +             spin_unlock_irqrestore(&blkg->stats_lock, flags);
>> +             return;
>> +     }
>> +
>> +     /*
>> +      * If ignore is set, we do not panic on the empty flag being set
>> +      * already. This is to avoid cases where there are superfluous timeslice
>> +      * complete events (for eg., forced_dispatch in CFQ) when no IOs are
>> +      * served which could result in triggering the empty check incorrectly.
>> +      */
>> +     BUG_ON(!ignore && blkio_blkg_empty(stats));
>> +     stats->start_empty_time = sched_clock();
>> +     blkio_mark_blkg_empty(stats);
>> +     spin_unlock_irqrestore(&blkg->stats_lock, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(blkiocg_set_start_empty_time);
>> +
>>  void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
>>                               uint64_t bytes, bool direction, bool sync)
>>  {
>> @@ -317,19 +421,44 @@ blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
>>  {
>>       struct blkio_cgroup *blkcg;
>>       struct blkio_group *blkg;
>> +     struct blkio_group_stats *stats;
>>       struct hlist_node *n;
>>       uint64_t queued[BLKIO_STAT_TOTAL];
>>       int i;
>> +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> +     bool idling, waiting, empty;
>> +     unsigned long long now = sched_clock();
>> +#endif
>>
>>       blkcg = cgroup_to_blkio_cgroup(cgroup);
>>       spin_lock_irq(&blkcg->lock);
>>       hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
>>               spin_lock(&blkg->stats_lock);
>> +             stats = &blkg->stats;
>> +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> +             idling = blkio_blkg_idling(stats);
>> +             waiting = blkio_blkg_waiting(stats);
>> +             empty = blkio_blkg_empty(stats);
>> +#endif
>>               for (i = 0; i < BLKIO_STAT_TOTAL; i++)
>> -                     queued[i] = blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i];
>> -             memset(&blkg->stats, 0, sizeof(struct blkio_group_stats));
>> +                     queued[i] = stats->stat_arr[BLKIO_STAT_QUEUED][i];
>> +             memset(stats, 0, sizeof(struct blkio_group_stats));
>>               for (i = 0; i < BLKIO_STAT_TOTAL; i++)
>> -                     blkg->stats.stat_arr[BLKIO_STAT_QUEUED][i] = queued[i];
>> +                     stats->stat_arr[BLKIO_STAT_QUEUED][i] = queued[i];
>> +#ifdef CONFIG_DEBUG_BLK_CGROUP
>> +             if (idling) {
>> +                     blkio_mark_blkg_idling(stats);
>> +                     stats->start_idle_time = now;
>> +             }
>> +             if (waiting) {
>> +                     blkio_mark_blkg_waiting(stats);
>> +                     stats->start_group_wait_time = now;
>> +             }
>> +             if (empty) {
>> +                     blkio_mark_blkg_empty(stats);
>> +                     stats->start_empty_time = now;
>> +             }
>> +#endif
>>               spin_unlock(&blkg->stats_lock);
>>       }
>>       spin_unlock_irq(&blkcg->lock);
>> @@ -401,6 +530,15 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
>>                       sum = 0;
>>               return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, sum, cb, dev);
>>       }
>> +     if (type == BLKIO_STAT_GROUP_WAIT_TIME)
>> +             return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
>> +                                     blkg->stats.group_wait_time, cb, dev);
>> +     if (type == BLKIO_STAT_IDLE_TIME)
>> +             return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
>> +                                     blkg->stats.idle_time, cb, dev);
>> +     if (type == BLKIO_STAT_EMPTY_TIME)
>> +             return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
>> +                                     blkg->stats.empty_time, cb, dev);
>>       if (type == BLKIO_STAT_DEQUEUE)
>>               return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
>>                                       blkg->stats.dequeue, cb, dev);
>> @@ -458,6 +596,9 @@ SHOW_FUNCTION_PER_GROUP(io_queued, BLKIO_STAT_QUEUED, 1);
>>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>>  SHOW_FUNCTION_PER_GROUP(dequeue, BLKIO_STAT_DEQUEUE, 0);
>>  SHOW_FUNCTION_PER_GROUP(avg_queue_size, BLKIO_STAT_AVG_QUEUE_SIZE, 0);
>> +SHOW_FUNCTION_PER_GROUP(group_wait_time, BLKIO_STAT_GROUP_WAIT_TIME, 0);
>> +SHOW_FUNCTION_PER_GROUP(idle_time, BLKIO_STAT_IDLE_TIME, 0);
>> +SHOW_FUNCTION_PER_GROUP(empty_time, BLKIO_STAT_EMPTY_TIME, 0);
>>  #endif
>>  #undef SHOW_FUNCTION_PER_GROUP
>>
>> @@ -518,6 +659,18 @@ struct cftype blkio_files[] = {
>>               .read_map = blkiocg_avg_queue_size_read,
>>       },
>>       {
>> +             .name = "group_wait_time",
>> +             .read_map = blkiocg_group_wait_time_read,
>> +     },
>> +     {
>> +             .name = "idle_time",
>> +             .read_map = blkiocg_idle_time_read,
>> +     },
>> +     {
>> +             .name = "empty_time",
>> +             .read_map = blkiocg_empty_time_read,
>> +     },
>> +     {
>>               .name = "dequeue",
>>               .read_map = blkiocg_dequeue_read,
>>       },
>> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
>> index bea7f3b..bfce085 100644
>> --- a/block/blk-cgroup.h
>> +++ b/block/blk-cgroup.h
>> @@ -43,6 +43,9 @@ enum stat_type {
>>       BLKIO_STAT_SECTORS,
>>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>>       BLKIO_STAT_AVG_QUEUE_SIZE,
>> +     BLKIO_STAT_IDLE_TIME,
>> +     BLKIO_STAT_EMPTY_TIME,
>> +     BLKIO_STAT_GROUP_WAIT_TIME,
>>       BLKIO_STAT_DEQUEUE
>>  #endif
>>  };
>> @@ -55,6 +58,13 @@ enum stat_sub_type {
>>       BLKIO_STAT_TOTAL
>>  };
>>
>> +/* blkg state flags */
>> +enum blkg_state_flags {
>> +     BLKG_waiting = 0,
>> +     BLKG_idling,
>> +     BLKG_empty,
>> +};
>> +
>>  struct blkio_cgroup {
>>       struct cgroup_subsys_state css;
>>       unsigned int weight;
>> @@ -74,6 +84,21 @@ struct blkio_group_stats {
>>       uint64_t avg_queue_size_samples;
>>       /* How many times this group has been removed from service tree */
>>       unsigned long dequeue;
>> +
>> +     /* Total time spent waiting for it to be assigned a timeslice. */
>> +     uint64_t group_wait_time;
>> +     uint64_t start_group_wait_time;
>> +
>> +     /* Time spent idling for this blkio_group */
>> +     uint64_t idle_time;
>> +     uint64_t start_idle_time;
>> +     /*
>> +      * Total time when we have requests queued and do not contain the
>> +      * current active queue.
>> +      */
>> +     uint64_t empty_time;
>> +     uint64_t start_empty_time;
>> +     uint16_t flags;
>
> Does this flags has to be inside blkio_group_stats? May be a better
> place is inside blkio_group as it represents the blkio_group status.
> That's a differnet thing that as of today all three flags are helping out
> only for stats purposes but in future we could add more flags?

I agree with you in principle. Would it make sense to move this to
blkg when we have a use-case for it beyond stats?

>
>>  #endif
>>  };
>>
>> @@ -137,12 +162,41 @@ static inline char *blkg_path(struct blkio_group *blkg)
>>  void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg);
>>  void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>>                               unsigned long dequeue);
>> +void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg);
>> +void blkiocg_update_idle_time_stats(struct blkio_group *blkg);
>> +void blkiocg_set_start_empty_time(struct blkio_group *blkg, bool ignore);
>> +
>> +#define BLKG_FLAG_FNS(name)                                          \
>> +static inline void blkio_mark_blkg_##name(                           \
>> +             struct blkio_group_stats *stats)                        \
>> +{                                                                    \
>> +     stats->flags |= (1 << BLKG_##name);                             \
>> +}                                                                    \
>> +static inline void blkio_clear_blkg_##name(                          \
>> +             struct blkio_group_stats *stats)                        \
>> +{                                                                    \
>> +     stats->flags &= ~(1 << BLKG_##name);                            \
>> +}                                                                    \
>> +static inline int blkio_blkg_##name(struct blkio_group_stats *stats) \
>> +{                                                                    \
>> +     return (stats->flags & (1 << BLKG_##name)) != 0;                \
>> +}                                                                    \
>> +
>> +BLKG_FLAG_FNS(waiting)
>> +BLKG_FLAG_FNS(idling)
>> +BLKG_FLAG_FNS(empty)
>> +#undef BLKG_FLAG_FNS
>>  #else
>>  static inline char *blkg_path(struct blkio_group *blkg) { return NULL; }
>>  static inline void blkiocg_update_set_active_queue_stats(
>>                                               struct blkio_group *blkg) {}
>>  static inline void blkiocg_update_dequeue_stats(struct blkio_group *blkg,
>>                                               unsigned long dequeue) {}
>> +static inline void blkiocg_update_set_idle_time_stats(struct blkio_group *blkg)
>> +{}
>> +static inline void blkiocg_update_idle_time_stats(struct blkio_group *blkg) {}
>> +static inline void blkiocg_set_start_empty_time(struct blkio_group *blkg,
>> +                                             bool ignore) {}
>>  #endif
>>
>>  #if defined(CONFIG_BLK_CGROUP) || defined(CONFIG_BLK_CGROUP_MODULE)
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 8e0b86a..b6e095c 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -886,7 +886,7 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
>>  }
>>
>>  static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>> -                             struct cfq_queue *cfqq)
>> +                             struct cfq_queue *cfqq, bool forced)
>>  {
>>       struct cfq_rb_root *st = &cfqd->grp_service_tree;
>>       unsigned int used_sl, charge_sl;
>> @@ -916,6 +916,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>       cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
>>                                       st->min_vdisktime);
>>       blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
>> +     blkiocg_set_start_empty_time(&cfqg->blkg, forced);
>
> Why this needs to be called from CFQ? Can't we just internally call this
> from blkiocg_update_request_remove_stats() internally. So when we remove
> a request, we update the stats in blkio. That time blkio can checek if
> group became empty and start the time?

>From Documentation/cgroups/blkio-controller.txt:
This is the amount of time a cgroup spends without any pending
requests when not being served, i.e., it does not include any time
spent idling for one of the queues of the cgroup.

Calling this from remove_stats will add the idle time in here too.

>
>>  }
>>
>>  #ifdef CONFIG_CFQ_GROUP_IOSCHED
>> @@ -1528,6 +1529,12 @@ static int cfq_allow_merge(struct request_queue *q, struct request *rq,
>>       return cfqq == RQ_CFQQ(rq);
>>  }
>>
>> +static inline void cfq_del_timer(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> +{
>> +     del_timer(&cfqd->idle_slice_timer);
>> +     blkiocg_update_idle_time_stats(&cfqq->cfqg->blkg);
>> +}
>> +
>>  static void __cfq_set_active_queue(struct cfq_data *cfqd,
>>                                  struct cfq_queue *cfqq)
>>  {
>> @@ -1547,7 +1554,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
>>               cfq_clear_cfqq_fifo_expire(cfqq);
>>               cfq_mark_cfqq_slice_new(cfqq);
>>
>> -             del_timer(&cfqd->idle_slice_timer);
>> +             cfq_del_timer(cfqd, cfqq);
>>       }
>>
>>       cfqd->active_queue = cfqq;
>> @@ -1558,12 +1565,12 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
>>   */
>>  static void
>>  __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> -                 bool timed_out)
>> +                 bool timed_out, bool forced)
>>  {
>>       cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
>>
>>       if (cfq_cfqq_wait_request(cfqq))
>> -             del_timer(&cfqd->idle_slice_timer);
>> +             cfq_del_timer(cfqd, cfqq);
>>
>>       cfq_clear_cfqq_wait_request(cfqq);
>>       cfq_clear_cfqq_wait_busy(cfqq);
>> @@ -1585,7 +1592,7 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>               cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid);
>>       }
>>
>> -     cfq_group_served(cfqd, cfqq->cfqg, cfqq);
>> +     cfq_group_served(cfqd, cfqq->cfqg, cfqq, forced);
>>
>>       if (cfq_cfqq_on_rr(cfqq) && RB_EMPTY_ROOT(&cfqq->sort_list))
>>               cfq_del_cfqq_rr(cfqd, cfqq);
>> @@ -1604,12 +1611,13 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>       }
>>  }
>>
>> -static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
>> +static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out,
>> +                                     bool forced)
>>  {
>>       struct cfq_queue *cfqq = cfqd->active_queue;
>>
>>       if (cfqq)
>> -             __cfq_slice_expired(cfqd, cfqq, timed_out);
>> +             __cfq_slice_expired(cfqd, cfqq, timed_out, forced);
>>  }
>>
>>  /*
>> @@ -1865,6 +1873,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>>       sl = cfqd->cfq_slice_idle;
>>
>>       mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
>> +     blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
>>       cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
>>  }
>>
>> @@ -2176,7 +2185,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>>       }
>>
>>  expire:
>> -     cfq_slice_expired(cfqd, 0);
>> +     cfq_slice_expired(cfqd, 0, false);
>>  new_queue:
>>       /*
>>        * Current queue expired. Check if we have to switch to a new
>> @@ -2202,7 +2211,7 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
>>       BUG_ON(!list_empty(&cfqq->fifo));
>>
>>       /* By default cfqq is not expired if it is empty. Do it explicitly */
>> -     __cfq_slice_expired(cfqq->cfqd, cfqq, 0);
>> +     __cfq_slice_expired(cfqq->cfqd, cfqq, 0, true);
>>       return dispatched;
>>  }
>>
>> @@ -2218,7 +2227,7 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
>>       while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL)
>>               dispatched += __cfq_forced_dispatch_cfqq(cfqq);
>>
>> -     cfq_slice_expired(cfqd, 0);
>> +     cfq_slice_expired(cfqd, 0, true);
>>       BUG_ON(cfqd->busy_queues);
>>
>>       cfq_log(cfqd, "forced_dispatch=%d", dispatched);
>> @@ -2382,10 +2391,15 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
>>           cfqq->slice_dispatch >= cfq_prio_to_maxrq(cfqd, cfqq)) ||
>>           cfq_class_idle(cfqq))) {
>>               cfqq->slice_end = jiffies + 1;
>> -             cfq_slice_expired(cfqd, 0);
>> +             cfq_slice_expired(cfqd, 0, false);
>>       }
>>
>>       cfq_log_cfqq(cfqd, cfqq, "dispatched a request");
>> +     /*
>> +      * This is needed since we don't exactly match the mod_timer() and
>> +      * del_timer() calls in CFQ.
>> +      */
>> +     blkiocg_update_idle_time_stats(&cfqq->cfqg->blkg);
>
> Shouldn't we try to call cfq_del_timer() whereever appropriate, instead of
> directy trying to update idle_time_stats?

Done.

>
>>       return 1;
>>  }
>>
>> @@ -2413,7 +2427,7 @@ static void cfq_put_queue(struct cfq_queue *cfqq)
>>       orig_cfqg = cfqq->orig_cfqg;
>>
>>       if (unlikely(cfqd->active_queue == cfqq)) {
>> -             __cfq_slice_expired(cfqd, cfqq, 0);
>> +             __cfq_slice_expired(cfqd, cfqq, 0, false);
>>               cfq_schedule_dispatch(cfqd);
>>       }
>>
>> @@ -2514,7 +2528,7 @@ static void cfq_exit_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>>       struct cfq_queue *__cfqq, *next;
>>
>>       if (unlikely(cfqq == cfqd->active_queue)) {
>> -             __cfq_slice_expired(cfqd, cfqq, 0);
>> +             __cfq_slice_expired(cfqd, cfqq, 0, false);
>>               cfq_schedule_dispatch(cfqd);
>>       }
>>
>> @@ -3143,7 +3157,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>>  static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>>  {
>>       cfq_log_cfqq(cfqd, cfqq, "preempt");
>> -     cfq_slice_expired(cfqd, 1);
>> +     cfq_slice_expired(cfqd, 1, false);
>>
>>       /*
>>        * Put the new queue at the front of the of the current list,
>> @@ -3191,7 +3205,7 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>               if (cfq_cfqq_wait_request(cfqq)) {
>>                       if (blk_rq_bytes(rq) > PAGE_CACHE_SIZE ||
>>                           cfqd->busy_queues > 1) {
>> -                             del_timer(&cfqd->idle_slice_timer);
>> +                             cfq_del_timer(cfqd, cfqq);
>>                               cfq_clear_cfqq_wait_request(cfqq);
>>                               __blk_run_queue(cfqd->queue);
>>                       } else
>> @@ -3352,7 +3366,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
>>                * - when there is a close cooperator
>>                */
>>               if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
>> -                     cfq_slice_expired(cfqd, 1);
>> +                     cfq_slice_expired(cfqd, 1, false);
>>               else if (sync && cfqq_empty &&
>>                        !cfq_close_cooperator(cfqd, cfqq)) {
>>                       cfqd->noidle_tree_requires_idle |= !rq_noidle(rq);
>> @@ -3612,7 +3626,7 @@ static void cfq_idle_slice_timer(unsigned long data)
>>               cfq_clear_cfqq_deep(cfqq);
>>       }
>>  expire:
>> -     cfq_slice_expired(cfqd, timed_out);
>> +     cfq_slice_expired(cfqd, timed_out, false);
>>  out_kick:
>>       cfq_schedule_dispatch(cfqd);
>>  out_cont:
>> @@ -3655,7 +3669,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
>>       spin_lock_irq(q->queue_lock);
>>
>>       if (cfqd->active_queue)
>> -             __cfq_slice_expired(cfqd, cfqd->active_queue, 0);
>> +             __cfq_slice_expired(cfqd, cfqd->active_queue, 0, false);
>>
>>       while (!list_empty(&cfqd->cic_list)) {
>>               struct cfq_io_context *cic = list_entry(cfqd->cic_list.next,
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] blkio: Add io_queued and avg_queue_size stats
  2010-04-09 15:32   ` Vivek Goyal
  2010-04-09 20:44     ` Divyesh Shah
@ 2010-04-10  2:34     ` Divyesh Shah
  1 sibling, 0 replies; 12+ messages in thread
From: Divyesh Shah @ 2010-04-10  2:34 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: jens.axboe, linux-kernel, nauman, ctalbott

On Fri, Apr 9, 2010 at 8:32 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> +void blkiocg_update_request_add_stats(struct blkio_group *blkg,
>> +                     struct blkio_group *curr_blkg, bool direction,
>> +                     bool sync)
>> +{
>
> curr_blkg is redundant?

Missed this comment in my earlier response. No, curr_blkg is not
redundant since it is used for determining whether we should set
group_wait_time or not. This argument should really have been added in
the 3rd patch instead of this one. Unless you really want me to, I'd
just let it be.

-Divyesh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] blkio: Add more debug-only per-cgroup stats
  2010-04-10  0:09     ` Divyesh Shah
@ 2010-04-12 14:04       ` Vivek Goyal
  2010-04-12 18:37         ` Divyesh Shah
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2010-04-12 14:04 UTC (permalink / raw)
  To: Divyesh Shah; +Cc: jens.axboe, linux-kernel, nauman, ctalbott

On Fri, Apr 09, 2010 at 05:09:07PM -0700, Divyesh Shah wrote:
> On Fri, Apr 9, 2010 at 12:21 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Thu, Apr 08, 2010 at 09:15:35PM -0700, Divyesh Shah wrote:
> >> 1) group_wait_time - This is the amount of time the cgroup had to wait to get a
> >>   timeslice for one of its queues from when it became busy, i.e., went from 0
> >>   to 1 request queued. This is different from the io_wait_time which is the
> >>   cumulative total of the amount of time spent by each IO in that cgroup waiting
> >>   in the scheduler queue. This stat is a great way to find out any jobs in the
> >>   fleet that are being starved or waiting for longer than what is expected (due
> >>   to an IO controller bug or any other issue).
> >> 2) empty_time - This is the amount of time a cgroup spends w/o any pending
> >>    requests. This stat is useful when a job does not seem to be able to use its
> >>    assigned disk share by helping check if that is happening due to an IO
> >>    controller bug or because the job is not submitting enough IOs.
> >
> > Divyesh,
> >
> > This empty_time looks like a redundant stats. One can just look at
> > group_wait_time. If group_wait_time is  not increasing, then application
> > in cgroup is not doing any IO. If it is increasing then one can look
> > at so many other stas like blkio.time, blkio.sectors etc to find out
> > if cgroup is making any progress or not.
> 
> Vivek,
>         group_wait_time could not be increasing even when the cgroup
> is dominating the disk by sending tons of IO all the time and you
> can't conclude that the cgroup is not doing any IO.

If group is dominating the disk, then blkio.time and blkio.sectors and
other stats are increasing continuously and we know group is being served
which in turn implies that application is sending tons of IO.

> Even when it is
> increasing, any of the other stats don't give what we are looking for
> with this stat.. which is how busy is the application really able to
> keep its IO queues over a given period of time.

will blkio.avg_queue_size not tell you how an application is keeping
a group busy over a period of time?

When group_wait time is increasing (that means applicatio is doing IO but
group has not yet been scheduled in), then will blkio.io_queued will
give will give you a good idea how busy a group is?

If due to IO controller bug, group is not being scheduled, then
avg_queue_size will not have enough samples, but blkio.io_queued will
still be increasing or will be a big number and will tell you that
group is busy but is not making any progress.

But I do realize that rest of the stats don't exactly tell how long 
group was empty. So lets keep it as you seem to be finding it useful. Also
avg_queue_size doe not account for time you are not doing any IO.

[..]
> >>  void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
> >>  {
> >>       unsigned long flags;
> >> @@ -116,9 +186,14 @@ void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
> >>                       stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] +
> >>                       stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE];
> >>       stats->avg_queue_size_samples++;
> >> +     blkio_update_group_wait_time(stats);
> >
> > Will cfq_choose_cfqg() be a better place to call this function. Why should
> > we call it when an active queue is set. Instead when a group is chosen
> > to dispatch request from, we can update the wait time.
> 
> Does that cover pre-emption cases too?
> 

That's a good point. In preemption path we don't use cfq_choose_cfqg(). So
it is fine.

[..]
> >>  struct blkio_cgroup {
> >>       struct cgroup_subsys_state css;
> >>       unsigned int weight;
> >> @@ -74,6 +84,21 @@ struct blkio_group_stats {
> >>       uint64_t avg_queue_size_samples;
> >>       /* How many times this group has been removed from service tree */
> >>       unsigned long dequeue;
> >> +
> >> +     /* Total time spent waiting for it to be assigned a timeslice. */
> >> +     uint64_t group_wait_time;
> >> +     uint64_t start_group_wait_time;
> >> +
> >> +     /* Time spent idling for this blkio_group */
> >> +     uint64_t idle_time;
> >> +     uint64_t start_idle_time;
> >> +     /*
> >> +      * Total time when we have requests queued and do not contain the
> >> +      * current active queue.
> >> +      */
> >> +     uint64_t empty_time;
> >> +     uint64_t start_empty_time;
> >> +     uint16_t flags;
> >
> > Does this flags has to be inside blkio_group_stats? May be a better
> > place is inside blkio_group as it represents the blkio_group status.
> > That's a differnet thing that as of today all three flags are helping out
> > only for stats purposes but in future we could add more flags?
> 
> I agree with you in principle. Would it make sense to move this to
> blkg when we have a use-case for it beyond stats?
> 

Ok, that's fine with me.

[..]
> >>  static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
> >> -                             struct cfq_queue *cfqq)
> >> +                             struct cfq_queue *cfqq, bool forced)
> >>  {
> >>       struct cfq_rb_root *st = &cfqd->grp_service_tree;
> >>       unsigned int used_sl, charge_sl;
> >> @@ -916,6 +916,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
> >>       cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
> >>                                       st->min_vdisktime);
> >>       blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
> >> +     blkiocg_set_start_empty_time(&cfqg->blkg, forced);
> >
> > Why this needs to be called from CFQ? Can't we just internally call this
> > from blkiocg_update_request_remove_stats() internally. So when we remove
> > a request, we update the stats in blkio. That time blkio can checek if
> > group became empty and start the time?
> 
> >From Documentation/cgroups/blkio-controller.txt:
> This is the amount of time a cgroup spends without any pending
> requests when not being served, i.e., it does not include any time
> spent idling for one of the queues of the cgroup.
> 
> Calling this from remove_stats will add the idle time in here too.

Ok, so to get a real sense of how long a group was empty, I need to add up
empty_time and idle_time.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] blkio: Add more debug-only per-cgroup stats
  2010-04-12 14:04       ` Vivek Goyal
@ 2010-04-12 18:37         ` Divyesh Shah
  0 siblings, 0 replies; 12+ messages in thread
From: Divyesh Shah @ 2010-04-12 18:37 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: jens.axboe, linux-kernel, nauman, ctalbott

On Mon, Apr 12, 2010 at 7:04 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Apr 09, 2010 at 05:09:07PM -0700, Divyesh Shah wrote:
>> On Fri, Apr 9, 2010 at 12:21 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Thu, Apr 08, 2010 at 09:15:35PM -0700, Divyesh Shah wrote:
>> >> 1) group_wait_time - This is the amount of time the cgroup had to wait to get a
>> >> á timeslice for one of its queues from when it became busy, i.e., went from 0
>> >> á to 1 request queued. This is different from the io_wait_time which is the
>> >> á cumulative total of the amount of time spent by each IO in that cgroup waiting
>> >> á in the scheduler queue. This stat is a great way to find out any jobs in the
>> >> á fleet that are being starved or waiting for longer than what is expected (due
>> >> á to an IO controller bug or any other issue).
>> >> 2) empty_time - This is the amount of time a cgroup spends w/o any pending
>> >> á árequests. This stat is useful when a job does not seem to be able to use its
>> >> á áassigned disk share by helping check if that is happening due to an IO
>> >> á ácontroller bug or because the job is not submitting enough IOs.
>> >
>> > Divyesh,
>> >
>> > This empty_time looks like a redundant stats. One can just look at
>> > group_wait_time. If group_wait_time is ánot increasing, then application
>> > in cgroup is not doing any IO. If it is increasing then one can look
>> > at so many other stas like blkio.time, blkio.sectors etc to find out
>> > if cgroup is making any progress or not.
>>
>> Vivek,
>>         group_wait_time could not be increasing even when the cgroup
>> is dominating the disk by sending tons of IO all the time and you
>> can't conclude that the cgroup is not doing any IO.
>
> If group is dominating the disk, then blkio.time and blkio.sectors and
> other stats are increasing continuously and we know group is being served
> which in turn implies that application is sending tons of IO.
>
>> Even when it is
>> increasing, any of the other stats don't give what we are looking for
>> with this stat.. which is how busy is the application really able to
>> keep its IO queues over a given period of time.
>
> will blkio.avg_queue_size not tell you how an application is keeping
> a group busy over a period of time?

Its an indicator of how deep is the application able to drive its IO
queues but samples are taken only on timeslice start so it doesn't
cover the empty case.

> When group_wait time is increasing (that means applicatio is doing IO but
> group has not yet been scheduled in), then will blkio.io_queued will
> give will give you a good idea how busy a group is?
It will tell you how deep the IO queue is.

>
> If due to IO controller bug, group is not being scheduled, then
> avg_queue_size will not have enough samples, but blkio.io_queued will
> still be increasing or will be a big number and will tell you that
> group is busy but is not making any progress.
>
> But I do realize that rest of the stats don't exactly tell how long
> group was empty. So lets keep it as you seem to be finding it useful. Also
> avg_queue_size doe not account for time you are not doing any IO.

Thanks! Yes you're right. I wanted to avoid any periodic timer based
accounting for all cgroups and only do event based accounting.

>
> [..]
>> >> ávoid blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
>> >> á{
>> >> á á á unsigned long flags;
>> >> @@ -116,9 +186,14 @@ void blkiocg_update_set_active_queue_stats(struct blkio_group *blkg)
>> >> á á á á á á á á á á á stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_READ] +
>> >> á á á á á á á á á á á stats->stat_arr[BLKIO_STAT_QUEUED][BLKIO_STAT_WRITE];
>> >> á á á stats->avg_queue_size_samples++;
>> >> + á á blkio_update_group_wait_time(stats);
>> >
>> > Will cfq_choose_cfqg() be a better place to call this function. Why should
>> > we call it when an active queue is set. Instead when a group is chosen
>> > to dispatch request from, we can update the wait time.
>>
>> Does that cover pre-emption cases too?
>>
>
> That's a good point. In preemption path we don't use cfq_choose_cfqg(). So
> it is fine.
>
> [..]
>> >> ástruct blkio_cgroup {
>> >> á á á struct cgroup_subsys_state css;
>> >> á á á unsigned int weight;
>> >> @@ -74,6 +84,21 @@ struct blkio_group_stats {
>> >> á á á uint64_t avg_queue_size_samples;
>> >> á á á /* How many times this group has been removed from service tree */
>> >> á á á unsigned long dequeue;
>> >> +
>> >> + á á /* Total time spent waiting for it to be assigned a timeslice. */
>> >> + á á uint64_t group_wait_time;
>> >> + á á uint64_t start_group_wait_time;
>> >> +
>> >> + á á /* Time spent idling for this blkio_group */
>> >> + á á uint64_t idle_time;
>> >> + á á uint64_t start_idle_time;
>> >> + á á /*
>> >> + á á á* Total time when we have requests queued and do not contain the
>> >> + á á á* current active queue.
>> >> + á á á*/
>> >> + á á uint64_t empty_time;
>> >> + á á uint64_t start_empty_time;
>> >> + á á uint16_t flags;
>> >
>> > Does this flags has to be inside blkio_group_stats? May be a better
>> > place is inside blkio_group as it represents the blkio_group status.
>> > That's a differnet thing that as of today all three flags are helping out
>> > only for stats purposes but in future we could add more flags?
>>
>> I agree with you in principle. Would it make sense to move this to
>> blkg when we have a use-case for it beyond stats?
>>
>
> Ok, that's fine with me.
>
> [..]
>> >> ástatic void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>> >> - á á á á á á á á á á á á á á struct cfq_queue *cfqq)
>> >> + á á á á á á á á á á á á á á struct cfq_queue *cfqq, bool forced)
>> >> á{
>> >> á á á struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> >> á á á unsigned int used_sl, charge_sl;
>> >> @@ -916,6 +916,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>> >> á á á cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
>> >> á á á á á á á á á á á á á á á á á á á st->min_vdisktime);
>> >> á á á blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
>> >> + á á blkiocg_set_start_empty_time(&cfqg->blkg, forced);
>> >
>> > Why this needs to be called from CFQ? Can't we just internally call this
>> > from blkiocg_update_request_remove_stats() internally. So when we remove
>> > a request, we update the stats in blkio. That time blkio can checek if
>> > group became empty and start the time?
>>
>> >From Documentation/cgroups/blkio-controller.txt:
>> This is the amount of time a cgroup spends without any pending
>> requests when not being served, i.e., it does not include any time
>> spent idling for one of the queues of the cgroup.
>>
>> Calling this from remove_stats will add the idle time in here too.
>
> Ok, so to get a real sense of how long a group was empty, I need to add up
> empty_time and idle_time.

Yes that is correct.

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-04-12 18:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-09  4:13 [PATCH 0/3] blkio: More IO controller stats - always-on and debug-only Divyesh Shah
2010-04-09  4:14 ` [PATCH 1/3] blkio: Add io_merged stat Divyesh Shah
2010-04-09  4:15 ` [PATCH 2/3] blkio: Add io_queued and avg_queue_size stats Divyesh Shah
2010-04-09 15:32   ` Vivek Goyal
2010-04-09 20:44     ` Divyesh Shah
2010-04-10  2:34     ` Divyesh Shah
2010-04-09  4:15 ` [PATCH 3/3] blkio: Add more debug-only per-cgroup stats Divyesh Shah
2010-04-09 19:21   ` Vivek Goyal
2010-04-10  0:09     ` Divyesh Shah
2010-04-12 14:04       ` Vivek Goyal
2010-04-12 18:37         ` Divyesh Shah
2010-04-09  6:36 ` [PATCH 0/3] blkio: More IO controller stats - always-on and debug-only Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.