linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] blkcg: add blk-iotrack
@ 2020-03-21  1:20 Weiping Zhang
  2020-03-21  1:21 ` [RFC 1/3] update the real issue size when bio_split Weiping Zhang
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Weiping Zhang @ 2020-03-21  1:20 UTC (permalink / raw)
  To: axboe, tj; +Cc: linux-block, cgroups

Hi all,

This patchset try to add a monitor-only module blk-iotrack for block
cgroup.

It contains kernel space blk-iotrack and user space tools iotrack, and
you can also write your own tool to do more data analysis.

blk-iotrack was designed to track various io statistic of block cgroup,
it is based on rq_qos framework. It only tracks io and does not do any
throttlling.

Compare to blk-iolatency, it provides 8 configurable latency buckets,
/sys/fs/cgroup/io.iotrack.lat_thresh, blk-iotrack will account the
number of IOs whose latency less than corresponding threshold. In this
way we can get the cgroup's latency distribution. The default latency
bucket is 50us, 100us, 200us, 400us, 1ms, 2ms, 4ms, 8ms.

Compare to io.stat.{rbytes,wbytes,rios,wios,dbytes,dios}, it account
IOs when IO completed, instead of submited. If IO was throttled by
io scheduler or other throttle policy, then there is a gap, these
IOs have not been completed yet.

The previous patch has record the timestamp for each bio, when it
was issued to the disk driver. Then we can get the disk latency in
rq_qos_done_bio, this is also be called D2C time. In rq_qos_done_bio,
blk-iotrack also record total latency(now - bio_issue_time), actually
it can be treated as the Q2C time. In this way, we can get the percentile
%d2c=D2C/Q2C for each cgroup. It's very useful to detect the main latency
is from disk or software e.g. io scheduler or other block cgroup throttle
policy.

The user space tool, which called iotrack, used to collect these basic
io statistics and then generate more valuable metrics at cgroup level.
From iotrack, you can get a cgroup's percentile for io, bytes,
total_time and disk_time of the whole disk. It can easily to evaluate
the real weight of the weight based policy(bfq, blk-iocost).
There are lots of metrics for read and write generate by iotrack,
for more details, please visit: https://github.com/dublio/iotrack.

Test result for two fio with randread 4K,
test1 cgroup bfq weight = 800
test2 cgroup bfq weight = 100

Device      io/s   MB/s    %io    %MB    %tm   %dtm  %d2c %hit0 %hit1 %hit2 %hit3 %hit4 %hit5  %hit6  %hit7 cgroup
nvme1n1 44588.00 174.17 100.00 100.00 100.00 100.00 38.46  0.25 45.27 95.90 98.33 99.47 99.85  99.92  99.95 /
nvme1n1 30206.00 117.99  67.74  67.74  29.44  67.29 87.90  0.35 47.82 99.22 99.98 99.99 99.99 100.00 100.00 /test1
nvme1n1 14370.00  56.13  32.23  32.23  70.55  32.69 17.82  0.03 39.89 88.92 94.88 98.37 99.53  99.77  99.85 /test2

* The root block cgroup "/" shows the io statistics for whole ssd disk.

* test1 use disk's %67 iops and bps.

* %dtm stands for the on disk time, test1 cgroup get 67% of whole disk,
	that means test1 gets more disk time than test2.

* For test's %d2c, there is only 17% latency cost at hardware disk,
	that means the main latency cames from software, it was
	throttled by softwre.


The patch1 and patch2 are preapre patch.
The last patch implement blk-iotrack.

Weiping Zhang (3):
  update the real issue size when bio_split
  bio: track timestamp of submitting bio the disk driver
  blkcg: add blk-iotrack

 block/Kconfig              |   6 +
 block/Makefile             |   1 +
 block/bio.c                |  13 ++
 block/blk-cgroup.c         |   4 +
 block/blk-iotrack.c        | 436 +++++++++++++++++++++++++++++++++++++
 block/blk-mq.c             |   3 +
 block/blk-rq-qos.h         |   3 +
 block/blk.h                |   7 +
 include/linux/blk-cgroup.h |   6 +
 include/linux/blk_types.h  |  38 ++++
 10 files changed, 517 insertions(+)
 create mode 100644 block/blk-iotrack.c

-- 
2.18.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC 1/3] update the real issue size when bio_split
  2020-03-21  1:20 [RFC 0/3] blkcg: add blk-iotrack Weiping Zhang
@ 2020-03-21  1:21 ` Weiping Zhang
  2020-03-21  1:21 ` [RFC 2/3] bio: track timestamp of submitting bio the disk driver Weiping Zhang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Weiping Zhang @ 2020-03-21  1:21 UTC (permalink / raw)
  To: axboe, tj; +Cc: linux-block, cgroups

The split.bi_iter.bi_size was copied from @bio,
bi_issue was initialized in this flow:
bio_clone_fast->__bio_clone_fast->blkcg_bio_issue_init

So the split->bi_issue has a wrong size, so update the size
at here.

Change-Id: I1f9c8c973ac1d41f4aea17a9a766b4c4d532f642
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
---
 block/bio.c               | 13 +++++++++++++
 include/linux/blk_types.h |  9 +++++++++
 2 files changed, 22 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 0985f3422556..8654c4d692e5 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1911,6 +1911,19 @@ struct bio *bio_split(struct bio *bio, int sectors,
 
 	split->bi_iter.bi_size = sectors << 9;
 
+	/*
+	 * reinit bio->bi_issue, the split.bi_iter.bi_size was copied
+	 * from @bio, bi_issue was initialized in this flow:
+	 * bio_clone_fast->__bio_clone_fast->blkcg_bio_issue_init
+	 *
+	 * So the split->bi_issue has a wrong size, so update the size
+	 * at here.
+	 *
+	 * Actually, we can just use blkcg_bio_issue_init, there is just
+	 * a bit difference for the issue_time.
+	 */
+	bio_issue_update_size(&split->bi_issue, bio_sectors(split));
+
 	if (bio_integrity(split))
 		bio_integrity_trim(split);
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..56e41ef3e827 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -128,6 +128,15 @@ static inline sector_t bio_issue_size(struct bio_issue *issue)
 	return ((issue->value & BIO_ISSUE_SIZE_MASK) >> BIO_ISSUE_SIZE_SHIFT);
 }
 
+static inline void bio_issue_update_size(struct bio_issue *issue, sector_t size)
+{
+	size &= (1ULL << BIO_ISSUE_SIZE_BITS) - 1;
+	/* set all _issue_size bits to 1 */
+	issue->value |= (u64)BIO_ISSUE_SIZE_MASK;
+	/* set new size */
+	issue->value &= ((u64)size << BIO_ISSUE_SIZE_SHIFT);
+}
+
 static inline void bio_issue_init(struct bio_issue *issue,
 				       sector_t size)
 {
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 2/3] bio: track timestamp of submitting bio the disk driver
  2020-03-21  1:20 [RFC 0/3] blkcg: add blk-iotrack Weiping Zhang
  2020-03-21  1:21 ` [RFC 1/3] update the real issue size when bio_split Weiping Zhang
@ 2020-03-21  1:21 ` Weiping Zhang
  2020-03-21  1:21 ` [RFC 3/3] blkcg: add blk-iotrack Weiping Zhang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Weiping Zhang @ 2020-03-21  1:21 UTC (permalink / raw)
  To: axboe, tj; +Cc: linux-block, cgroups

Change-Id: Ibb9caf20616f83e111113ab5c824c05930c0e523
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
---
 block/blk-mq.c             |  3 +++
 include/linux/blk-cgroup.h |  6 ++++++
 include/linux/blk_types.h  | 29 +++++++++++++++++++++++++++++
 3 files changed, 38 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5b2e6550e0b6..53db008ac8d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -652,6 +652,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
 void blk_mq_start_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	struct bio *bio;
 
 	trace_block_rq_issue(q, rq);
 
@@ -660,6 +661,8 @@ void blk_mq_start_request(struct request *rq)
 		rq->stats_sectors = blk_rq_sectors(rq);
 		rq->rq_flags |= RQF_STATS;
 		rq_qos_issue(q, rq);
+		__rq_for_each_bio(bio, rq)
+			blkcg_bio_start_init(bio);
 	}
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index e4a6949fd171..9720f04a9523 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -579,6 +579,11 @@ static inline void blkcg_bio_issue_init(struct bio *bio)
 	bio_issue_init(&bio->bi_issue, bio_sectors(bio));
 }
 
+static inline void blkcg_bio_start_init(struct bio *bio)
+{
+	bio_start_init(&bio->bi_start);
+}
+
 static inline bool blkcg_bio_issue_check(struct request_queue *q,
 					 struct bio *bio)
 {
@@ -738,6 +743,7 @@ static inline void blkg_get(struct blkcg_gq *blkg) { }
 static inline void blkg_put(struct blkcg_gq *blkg) { }
 
 static inline bool blkcg_punt_bio_submit(struct bio *bio) { return false; }
++static inline void blkcg_bio_start_init(struct bio *bio) { }
 static inline void blkcg_bio_issue_init(struct bio *bio) { }
 static inline bool blkcg_bio_issue_check(struct request_queue *q,
 					 struct bio *bio) { return true; }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 56e41ef3e827..30dc3a73235f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -109,10 +109,38 @@ static inline bool blk_path_error(blk_status_t error)
 /* Reserved bit for blk-throtl */
 #define BIO_ISSUE_THROTL_SKIP_LATENCY (1ULL << 63)
 
+/* submit bio to block layer */
 struct bio_issue {
 	u64 value;
 };
 
+/*
+ * submit bio to the disk driver layer
+ *
+ * 63:51	reserved
+ * 50:0		bits: start time of bio
+ *
+ * same bitmask as bi_issue
+ */
+struct bio_start {
+	u64 value;
+};
+
+static inline u64 __bio_start_time(u64 time)
+{
+	return time & BIO_ISSUE_TIME_MASK;
+}
+
+static inline u64 bio_start_time(struct bio_start *start)
+{
+	return __bio_start_time(start->value);
+}
+
+static inline void bio_start_init(struct bio_start *start)
+{
+	start->value = ktime_get_ns() & BIO_ISSUE_TIME_MASK;
+}
+
 static inline u64 __bio_issue_time(u64 time)
 {
 	return time & BIO_ISSUE_TIME_MASK;
@@ -178,6 +206,7 @@ struct bio {
 	 */
 	struct blkcg_gq		*bi_blkg;
 	struct bio_issue	bi_issue;
+	struct bio_start	bi_start;
 #ifdef CONFIG_BLK_CGROUP_IOCOST
 	u64			bi_iocost_cost;
 #endif
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC 3/3] blkcg: add blk-iotrack
  2020-03-21  1:20 [RFC 0/3] blkcg: add blk-iotrack Weiping Zhang
  2020-03-21  1:21 ` [RFC 1/3] update the real issue size when bio_split Weiping Zhang
  2020-03-21  1:21 ` [RFC 2/3] bio: track timestamp of submitting bio the disk driver Weiping Zhang
@ 2020-03-21  1:21 ` Weiping Zhang
  2020-03-24 18:27 ` [RFC 0/3] " Tejun Heo
  2020-03-27  6:27 ` [RFC PATCH v2 " Weiping Zhang
  4 siblings, 0 replies; 13+ messages in thread
From: Weiping Zhang @ 2020-03-21  1:21 UTC (permalink / raw)
  To: axboe, tj; +Cc: linux-block, cgroups

blk-iotrack was designed to track various io statistic of block cgroup,
it is based on rq_qos framework. It only tracks io and does not do any
throttlling.

Compare to blk-iolatency, it provides 8 configurable latency buckets,
/sys/fs/cgroup/io.iotrack.lat_thresh, blk-iotrack will account the
number of IOs whose latency less than corresponding threshold. In this
way we can get the cgroup's latency distribution. The default latency
bucket is 50us, 100us, 200us, 400us, 1ms, 2ms, 4ms, 8ms.

Compare to io.stat.{rbytes,wbytes,rios,wios,dbytes,dios}, it account
IOs when IO completed, instead of submited. If IO was throttled by
io scheduler or other throttle policy, then there is a gap, these
IOs have not been completed yet.

The previous patch has record the timestamp for each bio, when it
was issued to the disk driver. Then we can get the disk latency in
rq_qos_done_bio, this is also be called D2C time. In rq_qos_done_bio,
blk-iotrack also record total latency(now - bio_issue_time), actually
it can be treated as the Q2C time. In this way, we can get the percentile
%d2c=D2C/Q2C for each cgroup. It's very useful for detect the main latency is
from disk or software e.g. io scheduler or other block cgroup throttle
policy.

The user space tool, wihch called iotrack, used to collect these basic
io statistics and then generate more valuable metrics at cgroup level.
From iotrack, you can get a cgroup's percentile for io, bytes,
total_time and disk_time of the whole disk. It can easily to evaluate
the real weight of the weight based policy(bfq, blk-iocost).
For more details, please visit: https://github.com/dublio/iotrack.

Change-Id: I17b12b309709eb3eca3b3ff75a1f636981c70ce5
Signed-off-by: Weiping Zhang <zhangweiping@didiglobal.com>
---
 block/Kconfig       |   6 +
 block/Makefile      |   1 +
 block/blk-cgroup.c  |   4 +
 block/blk-iotrack.c | 436 ++++++++++++++++++++++++++++++++++++++++++++
 block/blk-rq-qos.h  |   3 +
 block/blk.h         |   7 +
 6 files changed, 457 insertions(+)
 create mode 100644 block/blk-iotrack.c

diff --git a/block/Kconfig b/block/Kconfig
index 3bc76bb113a0..d3073e4b048f 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -132,6 +132,12 @@ config BLK_WBT
 	dynamically on an algorithm loosely based on CoDel, factoring in
 	the realtime performance of the disk.
 
+config BLK_CGROUP_IOTRACK
+       bool "Enable support for track io latency bucket"
+       depends on BLK_CGROUP=y
+       ---help---
+       count the io latency in several bucket.
+
 config BLK_CGROUP_IOLATENCY
 	bool "Enable support for latency based cgroup IO protection"
 	depends on BLK_CGROUP=y
diff --git a/block/Makefile b/block/Makefile
index 1a43750f4b01..b0fc4d6f3cda 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
 obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
 obj-$(CONFIG_BLK_CGROUP_RWSTAT)	+= blk-cgroup-rwstat.o
 obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
+obj-$(CONFIG_BLK_CGROUP_IOTRACK)	+= blk-iotrack.o
 obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a229b94d5390..85825f663a53 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1045,6 +1045,10 @@ int blkcg_init_queue(struct request_queue *q)
 	if (ret)
 		goto err_destroy_all;
 
+	ret = blk_iotrack_init(q);
+	if (ret)
+		goto err_destroy_all;
+
 	ret = blk_throtl_init(q);
 	if (ret)
 		goto err_destroy_all;
diff --git a/block/blk-iotrack.c b/block/blk-iotrack.c
new file mode 100644
index 000000000000..e6b783fa668d
--- /dev/null
+++ b/block/blk-iotrack.c
@@ -0,0 +1,436 @@
+#include <linux/kernel.h>
+#include <linux/blk_types.h>
+#include <linux/backing-dev.h>
+#include <linux/module.h>
+#include <linux/timer.h>
+#include <linux/memcontrol.h>
+#include <linux/sched/loadavg.h>
+#include <linux/sched/signal.h>
+#include <trace/events/block.h>
+#include <linux/blk-mq.h>
+#include "blk-rq-qos.h"
+#include "blk-stat.h"
+
+
+static struct blkcg_policy blkcg_policy_iotrack;
+
+struct blk_iotrack {
+	struct rq_qos rqos;
+};
+
+
+#define LAT_BUCKET_NR 8
+/* default latency bucket(ns) */
+uint64_t def_latb_thresh[LAT_BUCKET_NR] = {
+	50000,		/* 50 us */
+	100000,		/* 100 us */
+	200000,		/* 200 us */
+	400000,		/* 400 us */
+	1000000,	/* 1 ms */
+	2000000,	/* 2 ms */
+	4000000,	/* 4 ms */
+	8000000,	/* 8 ms */
+};
+
+enum {
+	IOT_READ,
+	IOT_WRITE,
+	IOT_OTHER,
+	IOT_NR,
+};
+
+struct iotrack_stat {
+	struct blk_rq_stat  rqs;
+	uint64_t ios[IOT_NR];
+	uint64_t sts[IOT_NR];
+	uint64_t tms[IOT_NR];
+	uint64_t dtms[IOT_NR];
+	uint64_t hit[IOT_NR][LAT_BUCKET_NR];
+};
+
+struct iotrack_grp {
+	struct blkg_policy_data pd;
+	struct iotrack_stat __percpu *stat_pcpu;
+	uint64_t thresh_ns[LAT_BUCKET_NR];
+	struct iotrack_stat stat;
+};
+
+static inline struct blk_iotrack *BLKIOTIME(struct rq_qos *rqos)
+{
+	return container_of(rqos, struct blk_iotrack, rqos);
+}
+
+static inline struct iotrack_grp *pd_to_iot(struct blkg_policy_data *pd)
+{
+	return pd ? container_of(pd, struct iotrack_grp, pd) : NULL;
+}
+
+static inline struct iotrack_grp *blkg_to_iot(struct blkcg_gq *blkg)
+{
+	return pd_to_iot(blkg_to_pd(blkg, &blkcg_policy_iotrack));
+}
+
+static inline struct blkcg_gq *iot_to_blkg(struct iotrack_grp *iot)
+{
+	return pd_to_blkg(&iot->pd);
+}
+
+static struct blkg_policy_data *iotrack_pd_alloc(gfp_t gfp,
+			struct request_queue *q, struct blkcg *blkcg)
+{
+	struct iotrack_grp *iot;
+
+	iot = kzalloc_node(sizeof(*iot), gfp, q->node);
+	if (!iot)
+		return NULL;
+
+	iot->stat_pcpu = __alloc_percpu_gfp(sizeof(struct iotrack_stat),
+				__alignof__(struct iotrack_stat), gfp);
+	if (!iot->stat_pcpu) {
+		kfree(iot);
+		return NULL;
+	}
+
+	return &iot->pd;
+}
+
+static void iotrack_pd_init(struct blkg_policy_data *pd)
+{
+	struct iotrack_grp *iot = pd_to_iot(pd);
+	int i, j, cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct iotrack_stat *stat;
+		stat = per_cpu_ptr(iot->stat_pcpu, cpu);
+		blk_rq_stat_init(&stat->rqs);
+		for (i = 0; i < IOT_NR; i++) {
+			stat->ios[i] = stat->sts[i] = 0;
+			stat->tms[i] = stat->dtms[i] = 0;
+			for (j = 0; j < LAT_BUCKET_NR; j++)
+				stat->hit[i][j] = 0;
+		}
+	}
+
+	blk_rq_stat_init(&iot->stat.rqs);
+	for (i = 0; i < IOT_NR; i++) {
+		iot->stat.ios[i] = iot->stat.sts[i] = 0;
+		iot->stat.tms[i] = iot->stat.dtms[i] = 0;
+		for (j = 0; j < LAT_BUCKET_NR; j++)
+			iot->stat.hit[i][j] = 0;
+	}
+
+	for (i = 0; i < LAT_BUCKET_NR; i++)
+		iot->thresh_ns[i] = def_latb_thresh[i];
+}
+
+static void iotrack_pd_offline(struct blkg_policy_data *pd)
+{
+}
+
+static void iotrack_pd_free(struct blkg_policy_data *pd)
+{
+	struct iotrack_grp *iot = pd_to_iot(pd);
+
+	free_percpu(iot->stat_pcpu);
+	kfree(iot);
+}
+
+static u64 iotrack_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd,
+			       int off)
+{
+	struct iotrack_grp *iot = pd_to_iot(pd);
+	struct iotrack_stat *stat = &iot->stat;
+	struct blk_rq_stat *rqs = &stat->rqs;
+	const char *dname = blkg_dev_name(pd->blkg);
+	int cpu, i, j;
+
+	if (!dname)
+		return 0;
+
+	/* collect per cpu data */
+	preempt_disable();
+	for_each_online_cpu(cpu) {
+		struct iotrack_stat* s;
+		s = per_cpu_ptr(iot->stat_pcpu, cpu);
+		blk_rq_stat_sum(rqs, &s->rqs);
+		blk_rq_stat_init(&s->rqs);
+		for (i = 0; i < IOT_NR; i++) {
+			stat->ios[i] += s->ios[i];
+			stat->sts[i] += s->sts[i];
+			stat->tms[i] += s->tms[i];
+			stat->dtms[i] += s->dtms[i];
+			s->ios[i] = 0;
+			s->sts[i] = 0;
+			s->tms[i] = 0;
+			s->dtms[i] = 0;
+			for (j = 0; j < LAT_BUCKET_NR; j++) {
+				stat->hit[i][j] += s->hit[i][j];
+				s->hit[i][j] = 0;
+			}
+		}
+	}
+	preempt_enable();
+
+	seq_printf(sf, "%s mean: %llu min: %llu max: %llu sum: %llu "
+			"rios: %llu wios: %llu oios:%llu "
+			"rsts: %llu wsts: %llu osts: %llu "
+			"rtms: %llu wtms: %llu otms: %llu "
+			"rdtms: %llu wdtms: %llu odtms: %llu",
+		dname, rqs->mean, rqs->min, rqs->max, rqs->batch,
+		stat->ios[IOT_READ], stat->ios[IOT_WRITE], stat->ios[IOT_OTHER],
+		stat->sts[IOT_READ], stat->sts[IOT_WRITE], stat->sts[IOT_OTHER],
+		stat->tms[IOT_READ], stat->tms[IOT_WRITE], stat->tms[IOT_OTHER],
+		stat->dtms[IOT_READ], stat->dtms[IOT_WRITE], stat->dtms[IOT_OTHER]);
+
+	/* read hit */
+	seq_printf(sf, " rhit:");
+	for (i = 0; i < LAT_BUCKET_NR; i++)
+		seq_printf(sf, " %llu",  stat->hit[IOT_READ][i]);
+
+	/* write hit */
+	seq_printf(sf, " whit:");
+	for (i = 0; i < LAT_BUCKET_NR; i++)
+		seq_printf(sf, " %llu",  stat->hit[IOT_WRITE][i]);
+
+	/* other hit */
+	seq_printf(sf, " ohit:");
+	for (i = 0; i < LAT_BUCKET_NR; i++)
+		seq_printf(sf, " %llu",  stat->hit[IOT_OTHER][i]);
+
+	seq_printf(sf, "\n");
+
+	return 0;
+}
+
+static int iotrack_print_stat(struct seq_file *sf, void *v)
+{
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), iotrack_prfill_stat,
+			  &blkcg_policy_iotrack, seq_cft(sf)->private, false);
+	return 0;
+}
+
+static u64 iotrack_prfill_lat_thresh(struct seq_file *sf,
+			struct blkg_policy_data *pd, int off)
+{
+	struct iotrack_grp *iot = pd_to_iot(pd);
+	const char *dname = blkg_dev_name(pd->blkg);
+	int i;
+
+	if (!dname)
+		return 0;
+
+	seq_printf(sf, "%s", dname);
+	for (i = 0; i < LAT_BUCKET_NR; i++)
+		seq_printf(sf, " %llu",  iot->thresh_ns[i]);
+
+	seq_printf(sf, "\n");
+
+	return 0;
+}
+
+static int iotrack_print_lat_thresh(struct seq_file *sf, void *v)
+{
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+		iotrack_prfill_lat_thresh, &blkcg_policy_iotrack,
+		seq_cft(sf)->private, false);
+	return 0;
+}
+
+static ssize_t iotrack_set_lat_thresh(struct kernfs_open_file *of, char *buf,
+			     size_t nbytes, loff_t off)
+{
+	struct blkcg *blkcg = css_to_blkcg(of_css(of));
+	struct blkg_conf_ctx ctx;
+	struct iotrack_grp *iot;
+	uint64_t tmp[LAT_BUCKET_NR];
+	int i, ret;
+	char *p;
+
+	ret = blkg_conf_prep(blkcg, &blkcg_policy_iotrack, buf, &ctx);
+	if (ret)
+		return ret;
+
+	iot = blkg_to_iot(ctx.blkg);
+	p = ctx.body;
+
+	ret = -EINVAL;
+	if (LAT_BUCKET_NR != sscanf(p, "%llu %llu %llu %llu %llu %llu %llu %llu",
+			&tmp[0], &tmp[1], &tmp[2], &tmp[3],
+			&tmp[4], &tmp[5], &tmp[6], &tmp[7]))
+		goto out;
+
+	/* make sure threshold in order */
+	for (i = 0; i < LAT_BUCKET_NR - 1; i++) {
+		if (tmp[i] >= tmp[i + 1])
+			goto out;
+	}
+
+	/* update threshold for each bucket */
+	for (i = 0; i < LAT_BUCKET_NR; i++)
+		iot->thresh_ns[i] = tmp[i];
+
+	ret = 0;
+out:
+	blkg_conf_finish(&ctx);
+	return ret ?: nbytes;
+}
+
+static struct cftype iotrack_files[] = {
+	{
+		.name = "iotrack.stat",
+		.seq_show = iotrack_print_stat,
+	},
+	{
+		.name = "iotrack.lat_thresh",
+		.seq_show = iotrack_print_lat_thresh,
+		.write = iotrack_set_lat_thresh,
+	},
+	{}
+};
+
+static struct cftype iotrack_def_files[] = {
+	{
+		.name = "iotrack.stat",
+		.seq_show = iotrack_print_stat,
+	},
+	{
+		.name = "iotrack.lat_thresh",
+		.seq_show = iotrack_print_lat_thresh,
+		.write = iotrack_set_lat_thresh,
+	},
+	{}
+};
+
+static struct blkcg_policy blkcg_policy_iotrack = {
+	.dfl_cftypes	= iotrack_def_files,
+	.legacy_cftypes = iotrack_files,
+	.pd_alloc_fn	= iotrack_pd_alloc,
+	.pd_init_fn	= iotrack_pd_init,
+	.pd_offline_fn	= iotrack_pd_offline,
+	.pd_free_fn	= iotrack_pd_free,
+};
+
+static void iotrack_account_bio(struct iotrack_grp *iot, struct bio *bio,
+		u64 now)
+{
+	u64 delta, start = bio_issue_time(&bio->bi_issue);
+	u64 delta_disk, start_disk = bio_start_time(&bio->bi_start);
+	struct iotrack_stat *stat;
+	int i, t;
+
+	now = __bio_issue_time(now);
+
+	if (now <= start)
+		return;
+
+	switch (bio_op(bio)) {
+	case REQ_OP_READ:
+		t = IOT_READ;
+		break;
+	case REQ_OP_WRITE:
+		t = IOT_WRITE;
+		break;
+	default:
+		t = IOT_OTHER;
+		break;
+	}
+
+	delta = now - start;
+	stat = get_cpu_ptr(iot->stat_pcpu);
+	blk_rq_stat_add(&stat->rqs, delta);
+	stat->ios[t]++;
+	stat->sts[t] += (bio_issue_size(&bio->bi_issue));
+	stat->tms[t] += delta;
+	if (start_disk && (start_disk > start) && (now > start_disk))
+		delta_disk = now - start_disk;
+	else
+		delta_disk = 0;
+	stat->dtms[t] += delta_disk;
+	for (i = 0; i < LAT_BUCKET_NR; i++) {
+		if (delta < iot->thresh_ns[i])
+			stat->hit[t][i]++;
+	}
+	put_cpu_ptr(stat);
+}
+
+static void blkcg_iotrack_done_bio(struct rq_qos *rqos, struct bio *bio)
+{
+	struct blkcg_gq *blkg;
+	struct iotrack_grp *iot;
+	u64 now = ktime_to_ns(ktime_get());
+
+	 blkg = bio->bi_blkg;
+	if (!blkg)
+		return;
+
+	iot = blkg_to_iot(bio->bi_blkg);
+	if (!iot)
+		return;
+
+	/* account io statistics */
+	while (blkg) {
+		iot = blkg_to_iot(blkg);
+		if (!iot) {
+			blkg = blkg->parent;
+			continue;
+		}
+
+		iotrack_account_bio(iot, bio, now);
+		blkg = blkg->parent;
+	}
+}
+
+static void blkcg_iotrack_exit(struct rq_qos *rqos)
+{
+	struct blk_iotrack *blkiotrack = BLKIOTIME(rqos);
+
+	blkcg_deactivate_policy(rqos->q, &blkcg_policy_iotrack);
+	kfree(blkiotrack);
+}
+
+static struct rq_qos_ops blkcg_iotrack_ops = {
+	.done_bio = blkcg_iotrack_done_bio,
+	.exit = blkcg_iotrack_exit,
+};
+
+int blk_iotrack_init(struct request_queue *q)
+{
+	struct blk_iotrack *blkiotrack;
+	struct rq_qos *rqos;
+	int ret;
+
+	blkiotrack = kzalloc(sizeof(*blkiotrack), GFP_KERNEL);
+	if (!blkiotrack)
+		return -ENOMEM;
+
+	rqos = &blkiotrack->rqos;
+	rqos->id = RQ_QOS_IOTRACK;
+	rqos->ops = &blkcg_iotrack_ops;
+	rqos->q = q;
+
+	rq_qos_add(q, rqos);
+
+	ret = blkcg_activate_policy(q, &blkcg_policy_iotrack);
+	if (ret) {
+		rq_qos_del(q, rqos);
+		kfree(blkiotrack);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int __init iotrack_init(void)
+{
+	return blkcg_policy_register(&blkcg_policy_iotrack);
+}
+
+static void __exit iotrack_exit(void)
+{
+	return blkcg_policy_unregister(&blkcg_policy_iotrack);
+}
+
+module_init(iotrack_init);
+module_exit(iotrack_exit);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("weiping zhang <zhangweiping@didichuxing.com>");
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 2bc43e94f4c4..3066d3afe77d 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -16,6 +16,7 @@ enum rq_qos_id {
 	RQ_QOS_WBT,
 	RQ_QOS_LATENCY,
 	RQ_QOS_COST,
+	RQ_QOS_IOTRACK,
 };
 
 struct rq_wait {
@@ -87,6 +88,8 @@ static inline const char *rq_qos_id_to_name(enum rq_qos_id id)
 		return "latency";
 	case RQ_QOS_COST:
 		return "cost";
+	case RQ_QOS_IOTRACK:
+		return "iotrack";
 	}
 	return "unknown";
 }
diff --git a/block/blk.h b/block/blk.h
index 670337b7cfa0..3520b5ab7971 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -346,6 +346,13 @@ extern int blk_iolatency_init(struct request_queue *q);
 static inline int blk_iolatency_init(struct request_queue *q) { return 0; }
 #endif
 
+#ifdef CONFIG_BLK_CGROUP_IOTRACK
+extern int blk_iotrack_init(struct request_queue *q);
+#else
+static inline int blk_iotrack_init(struct request_queue *q) { return 0; }
+#endif
+
+
 struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
 
 #ifdef CONFIG_BLK_DEV_ZONED
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC 0/3] blkcg: add blk-iotrack
  2020-03-21  1:20 [RFC 0/3] blkcg: add blk-iotrack Weiping Zhang
                   ` (2 preceding siblings ...)
  2020-03-21  1:21 ` [RFC 3/3] blkcg: add blk-iotrack Weiping Zhang
@ 2020-03-24 18:27 ` Tejun Heo
  2020-03-25 12:49   ` Weiping Zhang
  2020-03-27  6:27 ` [RFC PATCH v2 " Weiping Zhang
  4 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2020-03-24 18:27 UTC (permalink / raw)
  To: axboe, linux-block, cgroups

On Sat, Mar 21, 2020 at 09:20:36AM +0800, Weiping Zhang wrote:
> The user space tool, which called iotrack, used to collect these basic
> io statistics and then generate more valuable metrics at cgroup level.
> From iotrack, you can get a cgroup's percentile for io, bytes,
> total_time and disk_time of the whole disk. It can easily to evaluate
> the real weight of the weight based policy(bfq, blk-iocost).
> There are lots of metrics for read and write generate by iotrack,
> for more details, please visit: https://github.com/dublio/iotrack.
> 
> Test result for two fio with randread 4K,
> test1 cgroup bfq weight = 800
> test2 cgroup bfq weight = 100
> 
> Device      io/s   MB/s    %io    %MB    %tm   %dtm  %d2c %hit0 %hit1 %hit2 %hit3 %hit4 %hit5  %hit6  %hit7 cgroup
> nvme1n1 44588.00 174.17 100.00 100.00 100.00 100.00 38.46  0.25 45.27 95.90 98.33 99.47 99.85  99.92  99.95 /
> nvme1n1 30206.00 117.99  67.74  67.74  29.44  67.29 87.90  0.35 47.82 99.22 99.98 99.99 99.99 100.00 100.00 /test1
> nvme1n1 14370.00  56.13  32.23  32.23  70.55  32.69 17.82  0.03 39.89 88.92 94.88 98.37 99.53  99.77  99.85 /test2

Maybe this'd be better done with bpf?

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/3] blkcg: add blk-iotrack
  2020-03-24 18:27 ` [RFC 0/3] " Tejun Heo
@ 2020-03-25 12:49   ` Weiping Zhang
  2020-03-25 14:12     ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Weiping Zhang @ 2020-03-25 12:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-block, cgroups

Tejun Heo <tj@kernel.org> 于2020年3月25日周三 上午2:27写道:
>
> On Sat, Mar 21, 2020 at 09:20:36AM +0800, Weiping Zhang wrote:
> > The user space tool, which called iotrack, used to collect these basic
> > io statistics and then generate more valuable metrics at cgroup level.
> > From iotrack, you can get a cgroup's percentile for io, bytes,
> > total_time and disk_time of the whole disk. It can easily to evaluate
> > the real weight of the weight based policy(bfq, blk-iocost).
> > There are lots of metrics for read and write generate by iotrack,
> > for more details, please visit: https://github.com/dublio/iotrack.
> >
> > Test result for two fio with randread 4K,
> > test1 cgroup bfq weight = 800
> > test2 cgroup bfq weight = 100
> >
> > Device      io/s   MB/s    %io    %MB    %tm   %dtm  %d2c %hit0 %hit1 %hit2 %hit3 %hit4 %hit5  %hit6  %hit7 cgroup
> > nvme1n1 44588.00 174.17 100.00 100.00 100.00 100.00 38.46  0.25 45.27 95.90 98.33 99.47 99.85  99.92  99.95 /
> > nvme1n1 30206.00 117.99  67.74  67.74  29.44  67.29 87.90  0.35 47.82 99.22 99.98 99.99 99.99 100.00 100.00 /test1
> > nvme1n1 14370.00  56.13  32.23  32.23  70.55  32.69 17.82  0.03 39.89 88.92 94.88 98.37 99.53  99.77  99.85 /test2
>
> Maybe this'd be better done with bpf?
>
Hi Tejun,

How about support both iotrack and bpf?
If it's ok I want to add bpf support in another patchset, I saw that
iocost_monitor.py was base on drgn, maybe write a new script
"biotrack" base on drgn.

For this patchset, iotrack can work well, I'm using it to monitor
block cgroup for
selecting a proper io isolation policy.

Thanks a ton
Weiping
> --
> tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/3] blkcg: add blk-iotrack
  2020-03-25 12:49   ` Weiping Zhang
@ 2020-03-25 14:12     ` Tejun Heo
  2020-03-25 16:45       ` Weiping Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2020-03-25 14:12 UTC (permalink / raw)
  To: Weiping Zhang; +Cc: Jens Axboe, linux-block, cgroups

On Wed, Mar 25, 2020 at 08:49:24PM +0800, Weiping Zhang wrote:
> For this patchset, iotrack can work well, I'm using it to monitor
> block cgroup for
> selecting a proper io isolation policy.

Yeah, I get that but monitoring needs tend to diverge quite a bit
depending on the use cases making detailed monitoring often need fair
bit of flexibility, so I'm a bit skeptical about adding a fixed
controller for that. I think a better approach may be implementing
features which can make dynamic monitoring whether that's through bpf,
drgn or plain tracepoints easier and more efficient.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/3] blkcg: add blk-iotrack
  2020-03-25 14:12     ` Tejun Heo
@ 2020-03-25 16:45       ` Weiping Zhang
  2020-03-26 15:08         ` Weiping Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Weiping Zhang @ 2020-03-25 16:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-block, cgroups

Tejun Heo <tj@kernel.org> 于2020年3月25日周三 下午10:12写道:
>
> On Wed, Mar 25, 2020 at 08:49:24PM +0800, Weiping Zhang wrote:
> > For this patchset, iotrack can work well, I'm using it to monitor
> > block cgroup for
> > selecting a proper io isolation policy.
>
> Yeah, I get that but monitoring needs tend to diverge quite a bit
> depending on the use cases making detailed monitoring often need fair
> bit of flexibility, so I'm a bit skeptical about adding a fixed
> controller for that. I think a better approach may be implementing
> features which can make dynamic monitoring whether that's through bpf,
> drgn or plain tracepoints easier and more efficient.
>
I agree with you, there are lots of io metrics needed in the real
production system.
The more flexible way is export all bio structure members of bio’s whole life to
the userspace without re-compiling kernel, like what bpf did.

Now the main block cgroup isolation policy:
 blk-iocost and bfq are weght based, blk-iolatency is latency based.
The blk-iotrack can track the real percentage for IOs,kB,on disk time(d2c),
and total time, it’s a good indicator to the real weight. For blk-iolatency, the
blk-iotrack has 8 lantency thresholds to show latency distribution, so if we
change these thresholds around to blk-iolateny.target.latency, we can tune
the target latency to a more proper value.

blk-iotrack extends the basic io.stat. It just export the important
basic io statistics
for cgroup,like what /prof/diskstats for block device. And it’s easy
programming,
iotrack just working like iostat, but focus on cgroup.

blk-iotrack is friendly with these block cgroup isolation policies, a
indicator for
cgroup weight and lantency.

Thanks

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/3] blkcg: add blk-iotrack
  2020-03-25 16:45       ` Weiping Zhang
@ 2020-03-26 15:08         ` Weiping Zhang
  2020-03-26 16:14           ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Weiping Zhang @ 2020-03-26 15:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-block, cgroups

Weiping Zhang <zwp10758@gmail.com> 于2020年3月26日周四 上午12:45写道:
>
> Tejun Heo <tj@kernel.org> 于2020年3月25日周三 下午10:12写道:
> >
> > On Wed, Mar 25, 2020 at 08:49:24PM +0800, Weiping Zhang wrote:
> > > For this patchset, iotrack can work well, I'm using it to monitor
> > > block cgroup for
> > > selecting a proper io isolation policy.
> >
> > Yeah, I get that but monitoring needs tend to diverge quite a bit
> > depending on the use cases making detailed monitoring often need fair
> > bit of flexibility, so I'm a bit skeptical about adding a fixed
> > controller for that. I think a better approach may be implementing
> > features which can make dynamic monitoring whether that's through bpf,
> > drgn or plain tracepoints easier and more efficient.
> >
> I agree with you, there are lots of io metrics needed in the real
> production system.
> The more flexible way is export all bio structure members of bio’s whole life to
> the userspace without re-compiling kernel, like what bpf did.
>
> Now the main block cgroup isolation policy:
>  blk-iocost and bfq are weght based, blk-iolatency is latency based.
> The blk-iotrack can track the real percentage for IOs,kB,on disk time(d2c),
> and total time, it’s a good indicator to the real weight. For blk-iolatency, the
> blk-iotrack has 8 lantency thresholds to show latency distribution, so if we
> change these thresholds around to blk-iolateny.target.latency, we can tune
> the target latency to a more proper value.
>
> blk-iotrack extends the basic io.stat. It just export the important
> basic io statistics
> for cgroup,like what /prof/diskstats for block device. And it’s easy
> programming,
> iotrack just working like iostat, but focus on cgroup.
>
> blk-iotrack is friendly with these block cgroup isolation policies, a
> indicator for
> cgroup weight and lantency.
>

Hi Tejun,

I do a testing at cgroup v2 and monitoring them by iotrack,
then I compare the fio's output and iotrack's, they can match well.

cgroup weight test:
/sys/fs/cgroup/test1
/sys/fs/cgroup/test2
test1.weight : test2.weight = 8 : 1

Now I just run randread-4K fio test by these three policy:
iocost, bfq, and nvme-wrr
For blk-iocost I run "iocost_coef_gen.py" and set result to the "io.cost.model".
259:0 ctrl=user model=linear rbps=3286476297 rseqiops=547837
rrandiops=793881 wbps=2001272356 wseqiops=482243 wrandiops=483037

But iocost_test1 cannot get 8/(8+1) iops,  and the total disk iops
is 737559 < 79388, even I change rrandiops=637000.

test case     bw         iops       rd_avg_lat  rd_p99_lat
==========================================================
iocost_test1  1550478    387619     659.76      1662.00
iocost_test2  1399761    349940     730.83      1712.00
wrr_test1     2618185    654546     390.59      1187.00
wrr_test2     362613     90653      2822.62     4358.00
bfq_test1     714127     178531     1432.43     489.00
bfq_test2     178821     44705      5721.76     552.00


The detail test report can be found at:
https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test


> Thanks

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/3] blkcg: add blk-iotrack
  2020-03-26 15:08         ` Weiping Zhang
@ 2020-03-26 16:14           ` Tejun Heo
  2020-03-26 16:27             ` Weiping Zhang
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2020-03-26 16:14 UTC (permalink / raw)
  To: Weiping Zhang; +Cc: Jens Axboe, linux-block, cgroups

On Thu, Mar 26, 2020 at 11:08:45PM +0800, Weiping Zhang wrote:
> But iocost_test1 cannot get 8/(8+1) iops,  and the total disk iops
> is 737559 < 79388, even I change rrandiops=637000.

iocost needs QoS targets set especially for deep queue devcies. W/o QoS targets,
it only throttles when QD is saturated, which might not happen at all depending
on fio job params.

Can you try with sth like the following in io.cost.qos?

  259:0 enable=1 ctrl=user rpct=95.00 rlat=5000 wpct=50.00 wlat=10000

In case you see significant bw loss, step up the r/wlat params.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/3] blkcg: add blk-iotrack
  2020-03-26 16:14           ` Tejun Heo
@ 2020-03-26 16:27             ` Weiping Zhang
  2020-03-31 14:19               ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Weiping Zhang @ 2020-03-26 16:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-block, cgroups

Tejun Heo <tj@kernel.org> 于2020年3月27日周五 上午12:15写道:
>
> On Thu, Mar 26, 2020 at 11:08:45PM +0800, Weiping Zhang wrote:
> > But iocost_test1 cannot get 8/(8+1) iops,  and the total disk iops
> > is 737559 < 79388, even I change rrandiops=637000.
>
> iocost needs QoS targets set especially for deep queue devcies. W/o QoS targets,
> it only throttles when QD is saturated, which might not happen at all depending
> on fio job params.
>
> Can you try with sth like the following in io.cost.qos?
>
>   259:0 enable=1 ctrl=user rpct=95.00 rlat=5000 wpct=50.00 wlat=10000
>
> In case you see significant bw loss, step up the r/wlat params.
>
OK, I'll try it.

I really appreciate that if you help review blk-iotrack.c, or just
drop io.iotrakc.stat
and append these  statistics to the io.stat? I think these metrics is usefull,
and it just extend io.stat output.

Thanks a ton

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v2 0/3] blkcg: add blk-iotrack
  2020-03-21  1:20 [RFC 0/3] blkcg: add blk-iotrack Weiping Zhang
                   ` (3 preceding siblings ...)
  2020-03-24 18:27 ` [RFC 0/3] " Tejun Heo
@ 2020-03-27  6:27 ` Weiping Zhang
  4 siblings, 0 replies; 13+ messages in thread
From: Weiping Zhang @ 2020-03-27  6:27 UTC (permalink / raw)
  To: axboe, tj; +Cc: linux-block, cgroups

Hi all,

This patchset try to add a monitor-only module blk-iotrack for block
cgroup.

It contains kernel space blk-iotrack and user space tools iotrack, and
you can also write your own tool to do more data analysis.

blk-iotrack was designed to track various io statistic of block cgroup,
it is based on rq_qos framework. It only tracks io and does not do any
throttlling.

Compare to blk-iolatency, it provides 8 configurable latency buckets,
/sys/fs/cgroup/io.iotrack.lat_thresh, blk-iotrack will account the
number of IOs whose latency less than corresponding threshold. In this
way we can get the cgroup's latency distribution. The default latency
bucket is 50us, 100us, 200us, 400us, 1ms, 2ms, 4ms, 8ms.

Compare to io.stat.{rbytes,wbytes,rios,wios,dbytes,dios}, it account
IOs when IO completed, instead of submited. If IO was throttled by
io scheduler or other throttle policy, then there is a gap, these
IOs have not been completed yet.

The previous patch has record the timestamp for each bio, when it
was issued to the disk driver. Then we can get the disk latency in
rq_qos_done_bio, this is also be called D2C time. In rq_qos_done_bio,
blk-iotrack also record total latency(now - bio_issue_time), actually
it can be treated as the Q2C time. In this way, we can get the percentile
%d2c=D2C/Q2C for each cgroup. It's very useful to detect the main latency
is from disk or software e.g. io scheduler or other block cgroup throttle
policy.

The user space tool, which called iotrack, used to collect these basic
io statistics and then generate more valuable metrics at cgroup level.
From iotrack, you can get a cgroup's percentage for io(%io), bytes(%byte),
total_time and disk_time of the whole disk. It can easily to evaluate
the real weight of the weight based policy(bfq, blk-iocost).
Except the basic io/s, MB/s, iotrack also show:
%io
There are lots of metrics for read and write generate by iotrack,
for more details, please visit: https://github.com/dublio/iotrack.

Test result for two fio with randread 4K,
test1 cgroup bfq weight = 800
test2 cgroup bfq weight = 100
numjobs=8, iodepth=32

Device   rrqm/s wrqm/s r/s    w/s    rMB/s  wMB/s  avgrqkb  avgqu-sz await    r_await  w_await  svctm    %util    conc
nvme1n1  0      0      217341 0      848.98 0.00   4.00     475.03   2.28     2.28     0.00     0.00     100.20   474.08

Device   io/s   MB/s   %io    %byte  %tm    %dtm   %d2c ad2c aq2c   %hit0 %hit1 %hit2 %hit3 %hit4 %hit5 %hit6 %hit7 cgroup
nvme1n1  217345 849.00 100.00 100.00 100.00 100.00 4.09 0.09 2.28   23.97 62.43 89.88 98.44 99.88 99.88 99.88 99.88 /
nvme1n1  193183 754.62 88.88  88.88  45.91  84.54  7.52 0.09 1.18   26.85 64.87 90.71 98.40 99.88 99.88 99.88 99.88 /test1
nvme1n1  24235  94.67  11.15  11.15  54.09  15.48  1.17 0.13 11.06  0.98  43.00 83.31 98.77 99.87 99.87 99.87 99.87 /test2

* The root block cgroup "/" shows the io statistics for whole ssd disk.

* test1 use disk's 88% iops and bps.

* %dtm stands for the on disk time, test1 cgroup get 85% of whole disk,
	that means test1 gets more disk time than test2.

* For test's %d2c, there is only 1.17% latency cost at hardware disk,
	that means the main latency cames from software, it was
	throttled by softwre.

* aq2c: average q2c, test2's aq2c(11ms) > test1's aq2c(1ms).

* For latency distribution, hit1(<=100us), 64% io of test1 <=100us, and
        43% io of test2 <=100us, test1's latency is better than test2.

For more detail test report, please visit:
https://github.com/dublio/iotrack/wiki/cgroup-io-weight-test

The patch1 and patch2 are preapre patch.
The last patch implement blk-iotrack.

Changes since v1:
* fix bio issue_size when split bio, v1 patch will clear issue_time.

Weiping Zhang (3):
  update the real issue size when bio_split
  bio: track timestamp of submitting bio the disk driver
  blkcg: add blk-iotrack

 block/Kconfig              |   6 +
 block/Makefile             |   1 +
 block/bio.c                |  13 ++
 block/blk-cgroup.c         |   4 +
 block/blk-iotrack.c        | 436 +++++++++++++++++++++++++++++++++++++
 block/blk-mq.c             |   3 +
 block/blk-rq-qos.h         |   3 +
 block/blk.h                |   7 +
 include/linux/blk-cgroup.h |   6 +
 include/linux/blk_types.h  |  38 ++++
 10 files changed, 517 insertions(+)
 create mode 100644 block/blk-iotrack.c

-- 
2.18.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC 0/3] blkcg: add blk-iotrack
  2020-03-26 16:27             ` Weiping Zhang
@ 2020-03-31 14:19               ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2020-03-31 14:19 UTC (permalink / raw)
  To: Weiping Zhang; +Cc: Jens Axboe, linux-block, cgroups

Hello, Weiping.

On Fri, Mar 27, 2020 at 12:27:11AM +0800, Weiping Zhang wrote:
> I really appreciate that if you help review blk-iotrack.c, or just
> drop io.iotrakc.stat
> and append these  statistics to the io.stat? I think these metrics is usefull,

So, the problem is that you can get the same exact and easily more information
using bpf. There definitely are benefits to baking in some stastics in terms of
overhead and accessbility but I'm not sure what's being proposed is generic
and/or flexible enough to bake into the interface at this point.

Something which can be immediately useful would be cgroup-aware bpf progs which
expose these statistics. Can you please take a look at the followings?

  https://github.com/iovisor/bcc/blob/master/tools/biolatency.py
  https://github.com/iovisor/bcc/blob/master/tools/biolatency_example.txt

They aren't cgroup aware but can be made so and can provide a lot more detailed
statistics than something we can hardcode into the kernel.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-03-31 14:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-21  1:20 [RFC 0/3] blkcg: add blk-iotrack Weiping Zhang
2020-03-21  1:21 ` [RFC 1/3] update the real issue size when bio_split Weiping Zhang
2020-03-21  1:21 ` [RFC 2/3] bio: track timestamp of submitting bio the disk driver Weiping Zhang
2020-03-21  1:21 ` [RFC 3/3] blkcg: add blk-iotrack Weiping Zhang
2020-03-24 18:27 ` [RFC 0/3] " Tejun Heo
2020-03-25 12:49   ` Weiping Zhang
2020-03-25 14:12     ` Tejun Heo
2020-03-25 16:45       ` Weiping Zhang
2020-03-26 15:08         ` Weiping Zhang
2020-03-26 16:14           ` Tejun Heo
2020-03-26 16:27             ` Weiping Zhang
2020-03-31 14:19               ` Tejun Heo
2020-03-27  6:27 ` [RFC PATCH v2 " Weiping Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).