linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET] block: IO polling improvements
@ 2016-11-01 21:05 Jens Axboe
  2016-11-01 21:05 ` [PATCH 1/4] block: add scalable completion tracking of requests Jens Axboe
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-01 21:05 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-block; +Cc: hch

This builds on top of Christophs simplified bdev O_DIRECT code,
posted earlier today [1].

This patchset adds support for a hybrid polling mode, where a poll
cycle can be split into an upfront sleep, then a busy poll. On the
devices where we care about IO polling, generally we have fairly
deterministic completion latencies. So this patchset pulls in
the stats code from the buffered writeback code, and uses that to
more intelligently poll a device.

Let's assume a device completes IO in 8 usecs. When we poll now,
we busy loop for 8 usec. It's a lot more efficient to sleep for
a bit, then poll the last few usecs instead, if we know the
rough completion time of the IO.

This adds a sysfs file, /sys/block/<dev>/queue/io_poll_delay, which
makes the polling behave as follows:

-1	Never enter hybrid sleep, always poll
 0	Use half of the completion mean for this request type for the
	sleep delay
>0	Use this specific value as the sleep delay

We default to -1, which is the old behavior. Initial testing with '0'
has been positive, retaining 93% of the performance at roughly 50%
CPU instead of the 100% with classic polling.

You can also find this code in my for-4.10/dio branch.

[1] https://marc.info/?l=linux-kernel&m=147793678521108&w=2

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-01 21:05 [PATCHSET] block: IO polling improvements Jens Axboe
@ 2016-11-01 21:05 ` Jens Axboe
  2016-11-01 22:25   ` Johannes Thumshirn
                     ` (3 more replies)
  2016-11-01 21:05 ` [PATCH 2/4] block: move poll code to blk-mq Jens Axboe
                   ` (3 subsequent siblings)
  4 siblings, 4 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-01 21:05 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-block; +Cc: hch, Jens Axboe

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Makefile            |   2 +-
 block/blk-core.c          |   4 +
 block/blk-mq-sysfs.c      |  47 ++++++++++
 block/blk-mq.c            |  14 +++
 block/blk-mq.h            |   3 +
 block/blk-stat.c          | 226 ++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-stat.h          |  37 ++++++++
 block/blk-sysfs.c         |  26 ++++++
 include/linux/blk_types.h |  16 ++++
 include/linux/blkdev.h    |   4 +
 10 files changed, 378 insertions(+), 1 deletion(-)
 create mode 100644 block/blk-stat.c
 create mode 100644 block/blk-stat.h

diff --git a/block/Makefile b/block/Makefile
index 934dac73fb37..2528c596f7ec 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			blk-lib.o blk-mq.o blk-mq-tag.o \
+			blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
 			blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 0bfaa54d3e9f..ca77c725b4e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
 {
 	blk_dequeue_request(req);
 
+	blk_stat_set_issue_time(&req->issue_stat);
+
 	/*
 	 * We are now handing the request to the hardware, initialize
 	 * resid_len to full count and add the timeout handler.
@@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
 
 	trace_block_rq_complete(req->q, req, nr_bytes);
 
+	blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
+
 	if (!req->bio)
 		return false;
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 01fb455d3377..633c79a538ea 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	return ret;
 }
 
+static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
+{
+	struct blk_mq_ctx *ctx;
+	unsigned int i;
+
+	hctx_for_each_ctx(hctx, ctx, i) {
+		blk_stat_init(&ctx->stat[0]);
+		blk_stat_init(&ctx->stat[1]);
+	}
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
+					  const char *page, size_t count)
+{
+	blk_mq_stat_clear(hctx);
+	return count;
+}
+
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
+{
+	return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+			pre, (long long) stat->nr_samples,
+			(long long) stat->mean, (long long) stat->min,
+			(long long) stat->max);
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char *page)
+{
+	struct blk_rq_stat stat[2];
+	ssize_t ret;
+
+	blk_stat_init(&stat[0]);
+	blk_stat_init(&stat[1]);
+
+	blk_hctx_stat_get(hctx, stat);
+
+	ret = print_stat(page, &stat[0], "read :");
+	ret += print_stat(page + ret, &stat[1], "write:");
+	return ret;
+}
+
 static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
 	.attr = {.name = "dispatched", .mode = S_IRUGO },
 	.show = blk_mq_sysfs_dispatched_show,
@@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
 	.show = blk_mq_hw_sysfs_poll_show,
 	.store = blk_mq_hw_sysfs_poll_store,
 };
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
+	.attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
+	.show = blk_mq_hw_sysfs_stat_show,
+	.store = blk_mq_hw_sysfs_stat_store,
+};
 
 static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_queued.attr,
@@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_cpus.attr,
 	&blk_mq_hw_sysfs_active.attr,
 	&blk_mq_hw_sysfs_poll.attr,
+	&blk_mq_hw_sysfs_stat.attr,
 	NULL,
 };
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2da1a0ee3318..4555a76d22a7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -30,6 +30,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-stat.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -376,10 +377,19 @@ static void blk_mq_ipi_complete_request(struct request *rq)
 	put_cpu();
 }
 
+static void blk_mq_stat_add(struct request *rq)
+{
+	struct blk_rq_stat *stat = &rq->mq_ctx->stat[rq_data_dir(rq)];
+
+	blk_stat_add(stat, rq);
+}
+
 static void __blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
+	blk_mq_stat_add(rq);
+
 	if (!q->softirq_done_fn)
 		blk_mq_end_request(rq, rq->errors);
 	else
@@ -423,6 +433,8 @@ void blk_mq_start_request(struct request *rq)
 	if (unlikely(blk_bidi_rq(rq)))
 		rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
 
+	blk_stat_set_issue_time(&rq->issue_stat);
+
 	blk_add_timer(rq);
 
 	/*
@@ -1708,6 +1720,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		spin_lock_init(&__ctx->lock);
 		INIT_LIST_HEAD(&__ctx->rq_list);
 		__ctx->queue = q;
+		blk_stat_init(&__ctx->stat[0]);
+		blk_stat_init(&__ctx->stat[1]);
 
 		/* If the cpu isn't online, the cpu is mapped to first hctx */
 		if (!cpu_online(i))
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e5d25249028c..8cf16cb69f64 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -1,6 +1,8 @@
 #ifndef INT_BLK_MQ_H
 #define INT_BLK_MQ_H
 
+#include "blk-stat.h"
+
 struct blk_mq_tag_set;
 
 struct blk_mq_ctx {
@@ -18,6 +20,7 @@ struct blk_mq_ctx {
 
 	/* incremented at completion time */
 	unsigned long		____cacheline_aligned_in_smp rq_completed[2];
+	struct blk_rq_stat	stat[2];
 
 	struct request_queue	*queue;
 	struct kobject		kobj;
diff --git a/block/blk-stat.c b/block/blk-stat.c
new file mode 100644
index 000000000000..642afdc6d0f8
--- /dev/null
+++ b/block/blk-stat.c
@@ -0,0 +1,226 @@
+/*
+ * Block stat tracking code
+ *
+ * Copyright (C) 2016 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/blk-mq.h>
+
+#include "blk-stat.h"
+#include "blk-mq.h"
+
+static void blk_stat_flush_batch(struct blk_rq_stat *stat)
+{
+	if (!stat->nr_batch)
+		return;
+	if (!stat->nr_samples)
+		stat->mean = div64_s64(stat->batch, stat->nr_batch);
+	else {
+		stat->mean = div64_s64((stat->mean * stat->nr_samples) +
+					stat->batch,
+					stat->nr_samples + stat->nr_batch);
+	}
+
+	stat->nr_samples += stat->nr_batch;
+	stat->nr_batch = stat->batch = 0;
+}
+
+void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
+{
+	if (!src->nr_samples)
+		return;
+
+	blk_stat_flush_batch(src);
+
+	dst->min = min(dst->min, src->min);
+	dst->max = max(dst->max, src->max);
+
+	if (!dst->nr_samples)
+		dst->mean = src->mean;
+	else {
+		dst->mean = div64_s64((src->mean * src->nr_samples) +
+					(dst->mean * dst->nr_samples),
+					dst->nr_samples + src->nr_samples);
+	}
+	dst->nr_samples += src->nr_samples;
+}
+
+static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
+{
+	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_ctx *ctx;
+	uint64_t latest = 0;
+	int i, j, nr;
+
+	blk_stat_init(&dst[0]);
+	blk_stat_init(&dst[1]);
+
+	nr = 0;
+	do {
+		uint64_t newest = 0;
+
+		queue_for_each_hw_ctx(q, hctx, i) {
+			hctx_for_each_ctx(hctx, ctx, j) {
+				if (!ctx->stat[0].nr_samples &&
+				    !ctx->stat[1].nr_samples)
+					continue;
+				if (ctx->stat[0].time > newest)
+					newest = ctx->stat[0].time;
+				if (ctx->stat[1].time > newest)
+					newest = ctx->stat[1].time;
+			}
+		}
+
+		/*
+		 * No samples
+		 */
+		if (!newest)
+			break;
+
+		if (newest > latest)
+			latest = newest;
+
+		queue_for_each_hw_ctx(q, hctx, i) {
+			hctx_for_each_ctx(hctx, ctx, j) {
+				if (ctx->stat[0].time == newest) {
+					blk_stat_sum(&dst[0], &ctx->stat[0]);
+					nr++;
+				}
+				if (ctx->stat[1].time == newest) {
+					blk_stat_sum(&dst[1], &ctx->stat[1]);
+					nr++;
+				}
+			}
+		}
+		/*
+		 * If we race on finding an entry, just loop back again.
+		 * Should be very rare.
+		 */
+	} while (!nr);
+
+	dst[0].time = dst[1].time = latest;
+}
+
+void blk_queue_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
+{
+	if (q->mq_ops)
+		blk_mq_stat_get(q, dst);
+	else {
+		memcpy(&dst[0], &q->rq_stats[0], sizeof(struct blk_rq_stat));
+		memcpy(&dst[1], &q->rq_stats[1], sizeof(struct blk_rq_stat));
+	}
+}
+
+void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst)
+{
+	struct blk_mq_ctx *ctx;
+	unsigned int i, nr;
+
+	nr = 0;
+	do {
+		uint64_t newest = 0;
+
+		hctx_for_each_ctx(hctx, ctx, i) {
+			if (!ctx->stat[0].nr_samples &&
+			    !ctx->stat[1].nr_samples)
+				continue;
+
+			if (ctx->stat[0].time > newest)
+				newest = ctx->stat[0].time;
+			if (ctx->stat[1].time > newest)
+				newest = ctx->stat[1].time;
+		}
+
+		if (!newest)
+			break;
+
+		hctx_for_each_ctx(hctx, ctx, i) {
+			if (ctx->stat[0].time == newest) {
+				blk_stat_sum(&dst[0], &ctx->stat[0]);
+				nr++;
+			}
+			if (ctx->stat[1].time == newest) {
+				blk_stat_sum(&dst[1], &ctx->stat[1]);
+				nr++;
+			}
+		}
+		/*
+		 * If we race on finding an entry, just loop back again.
+		 * Should be very rare, as the window is only updated
+		 * occasionally
+		 */
+	} while (!nr);
+}
+
+static void __blk_stat_init(struct blk_rq_stat *stat, s64 time_now)
+{
+	stat->min = -1ULL;
+	stat->max = stat->nr_samples = stat->mean = 0;
+	stat->batch = stat->nr_batch = 0;
+	stat->time = time_now & BLK_STAT_NSEC_MASK;
+}
+
+void blk_stat_init(struct blk_rq_stat *stat)
+{
+	__blk_stat_init(stat, ktime_to_ns(ktime_get()));
+}
+
+static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now)
+{
+	return (now & BLK_STAT_NSEC_MASK) == (stat->time & BLK_STAT_NSEC_MASK);
+}
+
+bool blk_stat_is_current(struct blk_rq_stat *stat)
+{
+	return __blk_stat_is_current(stat, ktime_to_ns(ktime_get()));
+}
+
+void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
+{
+	s64 now, value;
+
+	now = __blk_stat_time(ktime_to_ns(ktime_get()));
+	if (now < blk_stat_time(&rq->issue_stat))
+		return;
+
+	if (!__blk_stat_is_current(stat, now))
+		__blk_stat_init(stat, now);
+
+	value = now - blk_stat_time(&rq->issue_stat);
+	if (value > stat->max)
+		stat->max = value;
+	if (value < stat->min)
+		stat->min = value;
+
+	if (stat->batch + value < stat->batch ||
+	    stat->nr_batch + 1 == BLK_RQ_STAT_BATCH)
+		blk_stat_flush_batch(stat);
+
+	stat->batch += value;
+	stat->nr_batch++;
+}
+
+void blk_stat_clear(struct request_queue *q)
+{
+	if (q->mq_ops) {
+		struct blk_mq_hw_ctx *hctx;
+		struct blk_mq_ctx *ctx;
+		int i, j;
+
+		queue_for_each_hw_ctx(q, hctx, i) {
+			hctx_for_each_ctx(hctx, ctx, j) {
+				blk_stat_init(&ctx->stat[0]);
+				blk_stat_init(&ctx->stat[1]);
+			}
+		}
+	} else {
+		blk_stat_init(&q->rq_stats[0]);
+		blk_stat_init(&q->rq_stats[1]);
+	}
+}
+
+void blk_stat_set_issue_time(struct blk_issue_stat *stat)
+{
+	stat->time = (stat->time & BLK_STAT_MASK) |
+			(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK);
+}
diff --git a/block/blk-stat.h b/block/blk-stat.h
new file mode 100644
index 000000000000..26b1f45dff26
--- /dev/null
+++ b/block/blk-stat.h
@@ -0,0 +1,37 @@
+#ifndef BLK_STAT_H
+#define BLK_STAT_H
+
+/*
+ * ~0.13s window as a power-of-2 (2^27 nsecs)
+ */
+#define BLK_STAT_NSEC		134217728ULL
+#define BLK_STAT_NSEC_MASK	~(BLK_STAT_NSEC - 1)
+
+/*
+ * Upper 3 bits can be used elsewhere
+ */
+#define BLK_STAT_RES_BITS	3
+#define BLK_STAT_SHIFT		(64 - BLK_STAT_RES_BITS)
+#define BLK_STAT_TIME_MASK	((1ULL << BLK_STAT_SHIFT) - 1)
+#define BLK_STAT_MASK		~BLK_STAT_TIME_MASK
+
+void blk_stat_add(struct blk_rq_stat *, struct request *);
+void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_rq_stat *);
+void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
+void blk_stat_clear(struct request_queue *q);
+void blk_stat_init(struct blk_rq_stat *);
+void blk_stat_sum(struct blk_rq_stat *, struct blk_rq_stat *);
+bool blk_stat_is_current(struct blk_rq_stat *);
+void blk_stat_set_issue_time(struct blk_issue_stat *);
+
+static inline u64 __blk_stat_time(u64 time)
+{
+	return time & BLK_STAT_TIME_MASK;
+}
+
+static inline u64 blk_stat_time(struct blk_issue_stat *stat)
+{
+	return __blk_stat_time(stat->time);
+}
+
+#endif
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 488c2e28feb8..5bb4648f434a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -401,6 +401,26 @@ static ssize_t queue_dax_show(struct request_queue *q, char *page)
 	return queue_var_show(blk_queue_dax(q), page);
 }
 
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
+{
+	return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+			pre, (long long) stat->nr_samples,
+			(long long) stat->mean, (long long) stat->min,
+			(long long) stat->max);
+}
+
+static ssize_t queue_stats_show(struct request_queue *q, char *page)
+{
+	struct blk_rq_stat stat[2];
+	ssize_t ret;
+
+	blk_queue_stat_get(q, stat);
+
+	ret = print_stat(page, &stat[0], "read :");
+	ret += print_stat(page + ret, &stat[1], "write:");
+	return ret;
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_requests_show,
@@ -553,6 +573,11 @@ static struct queue_sysfs_entry queue_dax_entry = {
 	.show = queue_dax_show,
 };
 
+static struct queue_sysfs_entry queue_stats_entry = {
+	.attr = {.name = "stats", .mode = S_IRUGO },
+	.show = queue_stats_show,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -582,6 +607,7 @@ static struct attribute *default_attrs[] = {
 	&queue_poll_entry.attr,
 	&queue_wc_entry.attr,
 	&queue_dax_entry.attr,
+	&queue_stats_entry.attr,
 	NULL,
 };
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bb921028e7c5..a59a214c39ae 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -248,4 +248,20 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
 	return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
 }
 
+struct blk_issue_stat {
+	u64 time;
+};
+
+#define BLK_RQ_STAT_BATCH	64
+
+struct blk_rq_stat {
+	s64 mean;
+	u64 min;
+	u64 max;
+	s32 nr_samples;
+	s32 nr_batch;
+	u64 batch;
+	s64 time;
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8396da2bb698..dcd8d6e8801f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -197,6 +197,7 @@ struct request {
 	struct gendisk *rq_disk;
 	struct hd_struct *part;
 	unsigned long start_time;
+	struct blk_issue_stat issue_stat;
 #ifdef CONFIG_BLK_CGROUP
 	struct request_list *rl;		/* rl this rq is alloced from */
 	unsigned long long start_time_ns;
@@ -490,6 +491,9 @@ struct request_queue {
 
 	unsigned int		nr_sorted;
 	unsigned int		in_flight[2];
+
+	struct blk_rq_stat	rq_stats[2];
+
 	/*
 	 * Number of active block driver functions for which blk_drain_queue()
 	 * must wait. Must be incremented around functions that unlock the
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] block: move poll code to blk-mq
  2016-11-01 21:05 [PATCHSET] block: IO polling improvements Jens Axboe
  2016-11-01 21:05 ` [PATCH 1/4] block: add scalable completion tracking of requests Jens Axboe
@ 2016-11-01 21:05 ` Jens Axboe
  2016-11-02 14:54   ` Christoph Hellwig
  2016-11-01 21:05 ` [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT Jens Axboe
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2016-11-01 21:05 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-block; +Cc: hch, Jens Axboe

The poll code is blk-mq specific, let's move it to blk-mq.c. This
is a prep patch for improving the polling code.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-core.c | 36 +++++-------------------------------
 block/blk-mq.c   | 33 +++++++++++++++++++++++++++++++++
 block/blk-mq.h   |  2 ++
 3 files changed, 40 insertions(+), 31 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ca77c725b4e5..7728562d77d9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3293,47 +3293,21 @@ EXPORT_SYMBOL(blk_finish_plug);
 
 bool blk_poll(struct request_queue *q, blk_qc_t cookie)
 {
-	struct blk_plug *plug;
-	long state;
-	unsigned int queue_num;
 	struct blk_mq_hw_ctx *hctx;
+	struct blk_plug *plug;
+	struct request *rq;
 
 	if (!q->mq_ops || !q->mq_ops->poll || !blk_qc_t_valid(cookie) ||
 	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		return false;
 
-	queue_num = blk_qc_t_to_queue_num(cookie);
-	hctx = q->queue_hw_ctx[queue_num];
-	hctx->poll_considered++;
-
 	plug = current->plug;
 	if (plug)
 		blk_flush_plug_list(plug, false);
 
-	state = current->state;
-	while (!need_resched()) {
-		int ret;
-
-		hctx->poll_invoked++;
-
-		ret = q->mq_ops->poll(hctx, blk_qc_t_to_tag(cookie));
-		if (ret > 0) {
-			hctx->poll_success++;
-			set_current_state(TASK_RUNNING);
-			return true;
-		}
-
-		if (signal_pending_state(state, current))
-			set_current_state(TASK_RUNNING);
-
-		if (current->state == TASK_RUNNING)
-			return true;
-		if (ret < 0)
-			break;
-		cpu_relax();
-	}
-
-	return false;
+	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+	rq = blk_mq_tag_to_rq(hctx->tags, blk_qc_t_to_tag(cookie));
+	return blk_mq_poll(hctx, rq);
 }
 EXPORT_SYMBOL_GPL(blk_poll);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4555a76d22a7..4ef35588c299 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2352,6 +2352,39 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
+bool blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+	struct request_queue *q = hctx->queue;
+	long state;
+
+	hctx->poll_considered++;
+
+	state = current->state;
+	while (!need_resched()) {
+		int ret;
+
+		hctx->poll_invoked++;
+
+		ret = q->mq_ops->poll(hctx, rq->tag);
+		if (ret > 0) {
+			hctx->poll_success++;
+			set_current_state(TASK_RUNNING);
+			return true;
+		}
+
+		if (signal_pending_state(state, current))
+			set_current_state(TASK_RUNNING);
+
+		if (current->state == TASK_RUNNING)
+			return true;
+		if (ret < 0)
+			break;
+		cpu_relax();
+	}
+
+	return false;
+}
+
 void blk_mq_disable_hotplug(void)
 {
 	mutex_lock(&all_q_mutex);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 8cf16cb69f64..79ea86e0ed49 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -61,6 +61,8 @@ extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
 
 void blk_mq_release(struct request_queue *q);
 
+extern bool blk_mq_poll(struct blk_mq_hw_ctx *, struct request *);
+
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
 					   unsigned int cpu)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT
  2016-11-01 21:05 [PATCHSET] block: IO polling improvements Jens Axboe
  2016-11-01 21:05 ` [PATCH 1/4] block: add scalable completion tracking of requests Jens Axboe
  2016-11-01 21:05 ` [PATCH 2/4] block: move poll code to blk-mq Jens Axboe
@ 2016-11-01 21:05 ` Jens Axboe
  2016-11-02 14:54   ` Christoph Hellwig
                     ` (2 more replies)
  2016-11-01 21:05 ` [PATCH 4/4] blk-mq: make the polling code adaptive Jens Axboe
  2016-11-02 14:51 ` [PATCHSET] block: IO polling improvements Christoph Hellwig
  4 siblings, 3 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-01 21:05 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-block; +Cc: hch, Jens Axboe

This patch enables a hybrid polling mode. Instead of polling after IO
submission, we can induce an artificial delay, and then poll after that.
For example, if the IO is presumed to complete in 8 usecs from now, we
can sleep for 4 usecs, wake up, and then do our polling. This still puts
a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
after the IO has completed, it'll happen before. With this hybrid
scheme, we can achieve big latency reductions while still using the same
(or less) amount of CPU.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-mq.c         | 38 ++++++++++++++++++++++++++++++++++++++
 block/blk-sysfs.c      | 29 +++++++++++++++++++++++++++++
 block/blk.h            |  1 +
 include/linux/blkdev.h |  1 +
 4 files changed, 69 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4ef35588c299..caa55bec9411 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -302,6 +302,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
 	rq->rq_flags = 0;
 
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+	clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
 	blk_mq_put_tag(hctx, ctx, tag);
 	blk_queue_exit(q);
 }
@@ -2352,11 +2353,48 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
+static void blk_mq_poll_hybrid_sleep(struct request_queue *q,
+				     struct request *rq)
+{
+	struct hrtimer_sleeper hs;
+	ktime_t kt;
+
+	if (!q->poll_nsec || test_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags))
+		return;
+
+	set_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
+
+	/*
+	 * This will be replaced with the stats tracking code, using
+	 * 'avg_completion_time / 2' as the pre-sleep target.
+	 */
+	kt = ktime_set(0, q->poll_nsec);
+
+	hrtimer_init_on_stack(&hs.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_set_expires(&hs.timer, kt);
+
+	hrtimer_init_sleeper(&hs, current);
+	do {
+		if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
+			break;
+		set_current_state(TASK_INTERRUPTIBLE);
+		hrtimer_start_expires(&hs.timer, HRTIMER_MODE_REL);
+		if (hs.task)
+			io_schedule();
+		hrtimer_cancel(&hs.timer);
+	} while (hs.task && !signal_pending(current));
+
+	__set_current_state(TASK_RUNNING);
+	destroy_hrtimer_on_stack(&hs.timer);
+}
+
 bool blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq)
 {
 	struct request_queue *q = hctx->queue;
 	long state;
 
+	blk_mq_poll_hybrid_sleep(q, rq);
+
 	hctx->poll_considered++;
 
 	state = current->state;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5bb4648f434a..467b81c6713c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -336,6 +336,28 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
 	return ret;
 }
 
+static ssize_t queue_poll_delay_show(struct request_queue *q, char *page)
+{
+	return queue_var_show(q->poll_nsec / 1000, page);
+}
+
+static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page,
+				size_t count)
+{
+	unsigned long poll_usec;
+	ssize_t ret;
+
+	if (!q->mq_ops || !q->mq_ops->poll)
+		return -EINVAL;
+
+	ret = queue_var_store(&poll_usec, page, count);
+	if (ret < 0)
+		return ret;
+
+	q->poll_nsec = poll_usec * 1000;
+	return ret;
+}
+
 static ssize_t queue_poll_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(test_bit(QUEUE_FLAG_POLL, &q->queue_flags), page);
@@ -562,6 +584,12 @@ static struct queue_sysfs_entry queue_poll_entry = {
 	.store = queue_poll_store,
 };
 
+static struct queue_sysfs_entry queue_poll_delay_entry = {
+	.attr = {.name = "io_poll_delay", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_poll_delay_show,
+	.store = queue_poll_delay_store,
+};
+
 static struct queue_sysfs_entry queue_wc_entry = {
 	.attr = {.name = "write_cache", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_wc_show,
@@ -608,6 +636,7 @@ static struct attribute *default_attrs[] = {
 	&queue_wc_entry.attr,
 	&queue_dax_entry.attr,
 	&queue_stats_entry.attr,
+	&queue_poll_delay_entry.attr,
 	NULL,
 };
 
diff --git a/block/blk.h b/block/blk.h
index aa132dea598c..041185e5f129 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -111,6 +111,7 @@ void blk_account_io_done(struct request *req);
 enum rq_atomic_flags {
 	REQ_ATOM_COMPLETE = 0,
 	REQ_ATOM_STARTED,
+	REQ_ATOM_POLL_SLEPT,
 };
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index dcd8d6e8801f..6acd220dc3f3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -502,6 +502,7 @@ struct request_queue {
 	unsigned int		request_fn_active;
 
 	unsigned int		rq_timeout;
+	unsigned int		poll_nsec;
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] blk-mq: make the polling code adaptive
  2016-11-01 21:05 [PATCHSET] block: IO polling improvements Jens Axboe
                   ` (2 preceding siblings ...)
  2016-11-01 21:05 ` [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT Jens Axboe
@ 2016-11-01 21:05 ` Jens Axboe
  2016-11-02 14:51 ` [PATCHSET] block: IO polling improvements Christoph Hellwig
  4 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-01 21:05 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-block; +Cc: hch, Jens Axboe

The previous commit introduced the hybrid sleep/poll mode. Take
that one step further, and use the completion latencies to
automatically sleep for half the mean completion time. This is
a good approximation.

This changes the 'io_poll_delay' sysfs file a bit to expose the
various options. Depending on the value, the polling code will
behave differently:

-1	Never enter hybrid sleep mode
 0	Use half of the completion mean for the sleep delay
>0	Use this specific value as the sleep delay

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-mq.c         | 50 +++++++++++++++++++++++++++++++++++++++++++++++---
 block/blk-sysfs.c      | 28 ++++++++++++++++++++--------
 include/linux/blkdev.h |  2 +-
 3 files changed, 68 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index caa55bec9411..2af75b087ebd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2353,13 +2353,57 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
+static unsigned long blk_mq_poll_nsecs(struct blk_mq_hw_ctx *hctx,
+				       struct request *rq)
+{
+	struct blk_rq_stat stat[2];
+	unsigned long ret = 0;
+
+	/*
+	 * We don't have to do this once per IO, should optimize this
+	 * to just use the current window of stats until it changes
+	 */
+	memset(&stat, 0, sizeof(stat));
+	blk_hctx_stat_get(hctx, stat);
+
+	/*
+	 * As an optimistic guess, use half of the mean service time
+	 * for this type of request
+	 */
+	if (req_op(rq) == REQ_OP_READ && stat[0].nr_samples)
+		ret = (stat[0].mean + 1) / 2;
+	else if (req_op(rq) == REQ_OP_WRITE && stat[1].nr_samples)
+		ret = (stat[1].mean + 1) / 2;
+
+	return ret;
+}
+
 static void blk_mq_poll_hybrid_sleep(struct request_queue *q,
+				     struct blk_mq_hw_ctx *hctx,
 				     struct request *rq)
 {
 	struct hrtimer_sleeper hs;
+	unsigned int nsecs;
 	ktime_t kt;
 
-	if (!q->poll_nsec || test_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags))
+	if (test_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags))
+		return;
+
+	/*
+	 * poll_nsec can be:
+	 *
+	 * -1:	don't ever hybrid sleep
+	 *  0:	use half of prev avg
+	 * >0:	use this specific value
+	 */
+	if (q->poll_nsec == -1)
+		return;
+	else if (q->poll_nsec > 0)
+		nsecs = q->poll_nsec;
+	else
+		nsecs = blk_mq_poll_nsecs(hctx, rq);
+
+	if (!nsecs)
 		return;
 
 	set_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
@@ -2368,7 +2412,7 @@ static void blk_mq_poll_hybrid_sleep(struct request_queue *q,
 	 * This will be replaced with the stats tracking code, using
 	 * 'avg_completion_time / 2' as the pre-sleep target.
 	 */
-	kt = ktime_set(0, q->poll_nsec);
+	kt = ktime_set(0, nsecs);
 
 	hrtimer_init_on_stack(&hs.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	hrtimer_set_expires(&hs.timer, kt);
@@ -2393,7 +2437,7 @@ bool blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq)
 	struct request_queue *q = hctx->queue;
 	long state;
 
-	blk_mq_poll_hybrid_sleep(q, rq);
+	blk_mq_poll_hybrid_sleep(q, hctx, rq);
 
 	hctx->poll_considered++;
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 467b81c6713c..c668af57197b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -338,24 +338,36 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
 
 static ssize_t queue_poll_delay_show(struct request_queue *q, char *page)
 {
-	return queue_var_show(q->poll_nsec / 1000, page);
+	int val;
+
+	if (q->poll_nsec == -1)
+		val = -1;
+	else
+		val = q->poll_nsec / 1000;
+
+	return sprintf(page, "%d\n", val);
 }
 
 static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page,
 				size_t count)
 {
-	unsigned long poll_usec;
-	ssize_t ret;
+	int err, val;
 
 	if (!q->mq_ops || !q->mq_ops->poll)
 		return -EINVAL;
 
-	ret = queue_var_store(&poll_usec, page, count);
-	if (ret < 0)
-		return ret;
+	err = kstrtoint(page, 10, &val);
+	if (err < 0)
+		return err;
 
-	q->poll_nsec = poll_usec * 1000;
-	return ret;
+	printk(KERN_ERR "val=%d\n", val);
+
+	if (val == -1)
+		q->poll_nsec = -1;
+	else
+		q->poll_nsec = val * 1000;
+
+	return count;
 }
 
 static ssize_t queue_poll_show(struct request_queue *q, char *page)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6acd220dc3f3..857f866d2751 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -502,7 +502,7 @@ struct request_queue {
 	unsigned int		request_fn_active;
 
 	unsigned int		rq_timeout;
-	unsigned int		poll_nsec;
+	int			poll_nsec;
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-01 21:05 ` [PATCH 1/4] block: add scalable completion tracking of requests Jens Axboe
@ 2016-11-01 22:25   ` Johannes Thumshirn
  2016-11-02  5:37     ` Jens Axboe
  2016-11-02 14:52   ` Christoph Hellwig
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Johannes Thumshirn @ 2016-11-01 22:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-block, hch

On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
> For legacy block, we simply track them in the request queue. For
> blk-mq, we track them on a per-sw queue basis, which we can then
> sum up through the hardware queues and finally to a per device
> state.
> 
> The stats are tracked in, roughly, 0.1s interval windows.
> 
> Add sysfs files to display the stats.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---

[...]

>  
>  	/* incremented at completion time */
>  	unsigned long		____cacheline_aligned_in_smp rq_completed[2];
> +	struct blk_rq_stat	stat[2];

Can you add an enum or define for the directions? Just 0 and 1 aren't very
intuitive.

	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-01 22:25   ` Johannes Thumshirn
@ 2016-11-02  5:37     ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-02  5:37 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-kernel, linux-block, hch

On Tue, Nov 01 2016, Johannes Thumshirn wrote:
> On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
> > For legacy block, we simply track them in the request queue. For
> > blk-mq, we track them on a per-sw queue basis, which we can then
> > sum up through the hardware queues and finally to a per device
> > state.
> > 
> > The stats are tracked in, roughly, 0.1s interval windows.
> > 
> > Add sysfs files to display the stats.
> > 
> > Signed-off-by: Jens Axboe <axboe@fb.com>
> > ---
> 
> [...]
> 
> >  
> >  	/* incremented at completion time */
> >  	unsigned long		____cacheline_aligned_in_smp rq_completed[2];
> > +	struct blk_rq_stat	stat[2];
> 
> Can you add an enum or define for the directions? Just 0 and 1 aren't very
> intuitive.

Good point, I added that and updated both the stats patch and the
subsequent blk-mq poll code.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET] block: IO polling improvements
  2016-11-01 21:05 [PATCHSET] block: IO polling improvements Jens Axboe
                   ` (3 preceding siblings ...)
  2016-11-01 21:05 ` [PATCH 4/4] blk-mq: make the polling code adaptive Jens Axboe
@ 2016-11-02 14:51 ` Christoph Hellwig
  2016-11-02 14:54   ` Jens Axboe
  4 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2016-11-02 14:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-block, Dave Chinner

On Tue, Nov 01, 2016 at 03:05:21PM -0600, Jens Axboe wrote:
> This builds on top of Christophs simplified bdev O_DIRECT code,
> posted earlier today [1].

Btw, can you create a stable branch just for the
"block: add bio_iov_iter_get_pages()" so that we can use it as
a baseline for the iomap dio work?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-01 21:05 ` [PATCH 1/4] block: add scalable completion tracking of requests Jens Axboe
  2016-11-01 22:25   ` Johannes Thumshirn
@ 2016-11-02 14:52   ` Christoph Hellwig
  2016-11-02 14:55     ` Jens Axboe
  2016-11-03 11:17   ` Ming Lei
  2016-11-03 14:10   ` Bart Van Assche
  3 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2016-11-02 14:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-block, hch

On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
> For legacy block, we simply track them in the request queue. For
> blk-mq, we track them on a per-sw queue basis, which we can then
> sum up through the hardware queues and finally to a per device
> state.

what is the use case for the legacy request tracking?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] block: move poll code to blk-mq
  2016-11-01 21:05 ` [PATCH 2/4] block: move poll code to blk-mq Jens Axboe
@ 2016-11-02 14:54   ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2016-11-02 14:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-block

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCHSET] block: IO polling improvements
  2016-11-02 14:51 ` [PATCHSET] block: IO polling improvements Christoph Hellwig
@ 2016-11-02 14:54   ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-02 14:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-kernel, linux-block, Dave Chinner

On 11/02/2016 08:51 AM, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 03:05:21PM -0600, Jens Axboe wrote:
>> This builds on top of Christophs simplified bdev O_DIRECT code,
>> posted earlier today [1].
>
> Btw, can you create a stable branch just for the
> "block: add bio_iov_iter_get_pages()" so that we can use it as
> a baseline for the iomap dio work?

Sure

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT
  2016-11-01 21:05 ` [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT Jens Axboe
@ 2016-11-02 14:54   ` Christoph Hellwig
  2016-11-03 12:27   ` Ming Lei
  2016-11-03 14:01   ` Bart Van Assche
  2 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2016-11-02 14:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-block, hch

On Tue, Nov 01, 2016 at 03:05:24PM -0600, Jens Axboe wrote:
> This patch enables a hybrid polling mode. Instead of polling after IO
> submission, we can induce an artificial delay, and then poll after that.
> For example, if the IO is presumed to complete in 8 usecs from now, we
> can sleep for 4 usecs, wake up, and then do our polling. This still puts
> a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
> after the IO has completed, it'll happen before. With this hybrid
> scheme, we can achieve big latency reductions while still using the same
> (or less) amount of CPU.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>

This looks very nice.  I'll need to run some tests before giving the
formal ACK, though.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-02 14:52   ` Christoph Hellwig
@ 2016-11-02 14:55     ` Jens Axboe
  2016-11-02 14:59       ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2016-11-02 14:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-kernel, linux-block

On 11/02/2016 08:52 AM, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
>> For legacy block, we simply track them in the request queue. For
>> blk-mq, we track them on a per-sw queue basis, which we can then
>> sum up through the hardware queues and finally to a per device
>> state.
>
> what is the use case for the legacy request tracking?

Buffered writeback code uses the same base. Additionally, it could
replace some user space tracking for latency outliers that people are
currently running.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-02 14:55     ` Jens Axboe
@ 2016-11-02 14:59       ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2016-11-02 14:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, axboe, linux-kernel, linux-block

On Wed, Nov 02, 2016 at 08:55:16AM -0600, Jens Axboe wrote:
> On 11/02/2016 08:52 AM, Christoph Hellwig wrote:
>> On Tue, Nov 01, 2016 at 03:05:22PM -0600, Jens Axboe wrote:
>>> For legacy block, we simply track them in the request queue. For
>>> blk-mq, we track them on a per-sw queue basis, which we can then
>>> sum up through the hardware queues and finally to a per device
>>> state.
>>
>> what is the use case for the legacy request tracking?
>
> Buffered writeback code uses the same base. Additionally, it could
> replace some user space tracking for latency outliers that people are
> currently running.

Ok.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-01 21:05 ` [PATCH 1/4] block: add scalable completion tracking of requests Jens Axboe
  2016-11-01 22:25   ` Johannes Thumshirn
  2016-11-02 14:52   ` Christoph Hellwig
@ 2016-11-03 11:17   ` Ming Lei
  2016-11-03 13:38     ` Jens Axboe
  2016-11-03 14:10   ` Bart Van Assche
  3 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2016-11-03 11:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block, Christoph Hellwig

On Wed, Nov 2, 2016 at 5:05 AM, Jens Axboe <axboe@fb.com> wrote:
> For legacy block, we simply track them in the request queue. For
> blk-mq, we track them on a per-sw queue basis, which we can then
> sum up through the hardware queues and finally to a per device
> state.
>
> The stats are tracked in, roughly, 0.1s interval windows.
>
> Add sysfs files to display the stats.
>
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
>  block/Makefile            |   2 +-
>  block/blk-core.c          |   4 +
>  block/blk-mq-sysfs.c      |  47 ++++++++++
>  block/blk-mq.c            |  14 +++
>  block/blk-mq.h            |   3 +
>  block/blk-stat.c          | 226 ++++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-stat.h          |  37 ++++++++
>  block/blk-sysfs.c         |  26 ++++++
>  include/linux/blk_types.h |  16 ++++
>  include/linux/blkdev.h    |   4 +
>  10 files changed, 378 insertions(+), 1 deletion(-)
>  create mode 100644 block/blk-stat.c
>  create mode 100644 block/blk-stat.h
>
> diff --git a/block/Makefile b/block/Makefile
> index 934dac73fb37..2528c596f7ec 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -5,7 +5,7 @@
>  obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
>                         blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
>                         blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
> -                       blk-lib.o blk-mq.o blk-mq-tag.o \
> +                       blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
>                         blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
>                         genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
>                         badblocks.o partitions/
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0bfaa54d3e9f..ca77c725b4e5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
>  {
>         blk_dequeue_request(req);
>
> +       blk_stat_set_issue_time(&req->issue_stat);
> +
>         /*
>          * We are now handing the request to the hardware, initialize
>          * resid_len to full count and add the timeout handler.
> @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
>
>         trace_block_rq_complete(req->q, req, nr_bytes);
>
> +       blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);

blk_update_request() is often called lockless, so it isn't good to
do it here.

> +
>         if (!req->bio)
>                 return false;
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 01fb455d3377..633c79a538ea 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
>         return ret;
>  }
>
> +static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
> +{
> +       struct blk_mq_ctx *ctx;
> +       unsigned int i;
> +
> +       hctx_for_each_ctx(hctx, ctx, i) {
> +               blk_stat_init(&ctx->stat[0]);
> +               blk_stat_init(&ctx->stat[1]);
> +       }
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
> +                                         const char *page, size_t count)
> +{
> +       blk_mq_stat_clear(hctx);
> +       return count;
> +}
> +
> +static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
> +{
> +       return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
> +                       pre, (long long) stat->nr_samples,
> +                       (long long) stat->mean, (long long) stat->min,
> +                       (long long) stat->max);
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char *page)
> +{
> +       struct blk_rq_stat stat[2];
> +       ssize_t ret;
> +
> +       blk_stat_init(&stat[0]);
> +       blk_stat_init(&stat[1]);
> +
> +       blk_hctx_stat_get(hctx, stat);
> +
> +       ret = print_stat(page, &stat[0], "read :");
> +       ret += print_stat(page + ret, &stat[1], "write:");
> +       return ret;
> +}
> +
>  static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
>         .attr = {.name = "dispatched", .mode = S_IRUGO },
>         .show = blk_mq_sysfs_dispatched_show,
> @@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
>         .show = blk_mq_hw_sysfs_poll_show,
>         .store = blk_mq_hw_sysfs_poll_store,
>  };
> +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
> +       .attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
> +       .show = blk_mq_hw_sysfs_stat_show,
> +       .store = blk_mq_hw_sysfs_stat_store,
> +};
>
>  static struct attribute *default_hw_ctx_attrs[] = {
>         &blk_mq_hw_sysfs_queued.attr,
> @@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
>         &blk_mq_hw_sysfs_cpus.attr,
>         &blk_mq_hw_sysfs_active.attr,
>         &blk_mq_hw_sysfs_poll.attr,
> +       &blk_mq_hw_sysfs_stat.attr,
>         NULL,
>  };
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2da1a0ee3318..4555a76d22a7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -30,6 +30,7 @@
>  #include "blk.h"
>  #include "blk-mq.h"
>  #include "blk-mq-tag.h"
> +#include "blk-stat.h"
>
>  static DEFINE_MUTEX(all_q_mutex);
>  static LIST_HEAD(all_q_list);
> @@ -376,10 +377,19 @@ static void blk_mq_ipi_complete_request(struct request *rq)
>         put_cpu();
>  }
>
> +static void blk_mq_stat_add(struct request *rq)
> +{
> +       struct blk_rq_stat *stat = &rq->mq_ctx->stat[rq_data_dir(rq)];
> +
> +       blk_stat_add(stat, rq);
> +}
> +
>  static void __blk_mq_complete_request(struct request *rq)
>  {
>         struct request_queue *q = rq->q;
>
> +       blk_mq_stat_add(rq);

It is still possible for rqs belonging to same sw queue to complete on different
CPUs, so same issue with legacy queue.

> +
>         if (!q->softirq_done_fn)
>                 blk_mq_end_request(rq, rq->errors);
>         else
> @@ -423,6 +433,8 @@ void blk_mq_start_request(struct request *rq)
>         if (unlikely(blk_bidi_rq(rq)))
>                 rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
>
> +       blk_stat_set_issue_time(&rq->issue_stat);
> +
>         blk_add_timer(rq);
>
>         /*
> @@ -1708,6 +1720,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
>                 spin_lock_init(&__ctx->lock);
>                 INIT_LIST_HEAD(&__ctx->rq_list);
>                 __ctx->queue = q;
> +               blk_stat_init(&__ctx->stat[0]);
> +               blk_stat_init(&__ctx->stat[1]);
>
>                 /* If the cpu isn't online, the cpu is mapped to first hctx */
>                 if (!cpu_online(i))
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index e5d25249028c..8cf16cb69f64 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -1,6 +1,8 @@
>  #ifndef INT_BLK_MQ_H
>  #define INT_BLK_MQ_H
>
> +#include "blk-stat.h"
> +
>  struct blk_mq_tag_set;
>
>  struct blk_mq_ctx {
> @@ -18,6 +20,7 @@ struct blk_mq_ctx {
>
>         /* incremented at completion time */
>         unsigned long           ____cacheline_aligned_in_smp rq_completed[2];
> +       struct blk_rq_stat      stat[2];
>
>         struct request_queue    *queue;
>         struct kobject          kobj;
> diff --git a/block/blk-stat.c b/block/blk-stat.c
> new file mode 100644
> index 000000000000..642afdc6d0f8
> --- /dev/null
> +++ b/block/blk-stat.c
> @@ -0,0 +1,226 @@
> +/*
> + * Block stat tracking code
> + *
> + * Copyright (C) 2016 Jens Axboe
> + */
> +#include <linux/kernel.h>
> +#include <linux/blk-mq.h>
> +
> +#include "blk-stat.h"
> +#include "blk-mq.h"
> +
> +static void blk_stat_flush_batch(struct blk_rq_stat *stat)
> +{
> +       if (!stat->nr_batch)
> +               return;
> +       if (!stat->nr_samples)
> +               stat->mean = div64_s64(stat->batch, stat->nr_batch);
> +       else {
> +               stat->mean = div64_s64((stat->mean * stat->nr_samples) +
> +                                       stat->batch,
> +                                       stat->nr_samples + stat->nr_batch);
> +       }
> +
> +       stat->nr_samples += stat->nr_batch;
> +       stat->nr_batch = stat->batch = 0;
> +}
> +
> +void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
> +{
> +       if (!src->nr_samples)
> +               return;
> +
> +       blk_stat_flush_batch(src);
> +
> +       dst->min = min(dst->min, src->min);
> +       dst->max = max(dst->max, src->max);
> +
> +       if (!dst->nr_samples)
> +               dst->mean = src->mean;
> +       else {
> +               dst->mean = div64_s64((src->mean * src->nr_samples) +
> +                                       (dst->mean * dst->nr_samples),
> +                                       dst->nr_samples + src->nr_samples);
> +       }
> +       dst->nr_samples += src->nr_samples;
> +}
> +
> +static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
> +{
> +       struct blk_mq_hw_ctx *hctx;
> +       struct blk_mq_ctx *ctx;
> +       uint64_t latest = 0;
> +       int i, j, nr;
> +
> +       blk_stat_init(&dst[0]);
> +       blk_stat_init(&dst[1]);
> +
> +       nr = 0;
> +       do {
> +               uint64_t newest = 0;
> +
> +               queue_for_each_hw_ctx(q, hctx, i) {
> +                       hctx_for_each_ctx(hctx, ctx, j) {
> +                               if (!ctx->stat[0].nr_samples &&
> +                                   !ctx->stat[1].nr_samples)
> +                                       continue;
> +                               if (ctx->stat[0].time > newest)
> +                                       newest = ctx->stat[0].time;
> +                               if (ctx->stat[1].time > newest)
> +                                       newest = ctx->stat[1].time;
> +                       }
> +               }
> +
> +               /*
> +                * No samples
> +                */
> +               if (!newest)
> +                       break;
> +
> +               if (newest > latest)
> +                       latest = newest;
> +
> +               queue_for_each_hw_ctx(q, hctx, i) {
> +                       hctx_for_each_ctx(hctx, ctx, j) {
> +                               if (ctx->stat[0].time == newest) {
> +                                       blk_stat_sum(&dst[0], &ctx->stat[0]);
> +                                       nr++;
> +                               }
> +                               if (ctx->stat[1].time == newest) {
> +                                       blk_stat_sum(&dst[1], &ctx->stat[1]);
> +                                       nr++;
> +                               }
> +                       }
> +               }
> +               /*
> +                * If we race on finding an entry, just loop back again.
> +                * Should be very rare.
> +                */
> +       } while (!nr);
> +
> +       dst[0].time = dst[1].time = latest;
> +}
> +
> +void blk_queue_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
> +{
> +       if (q->mq_ops)
> +               blk_mq_stat_get(q, dst);
> +       else {
> +               memcpy(&dst[0], &q->rq_stats[0], sizeof(struct blk_rq_stat));
> +               memcpy(&dst[1], &q->rq_stats[1], sizeof(struct blk_rq_stat));
> +       }
> +}
> +
> +void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst)
> +{
> +       struct blk_mq_ctx *ctx;
> +       unsigned int i, nr;
> +
> +       nr = 0;
> +       do {
> +               uint64_t newest = 0;
> +
> +               hctx_for_each_ctx(hctx, ctx, i) {
> +                       if (!ctx->stat[0].nr_samples &&
> +                           !ctx->stat[1].nr_samples)
> +                               continue;
> +
> +                       if (ctx->stat[0].time > newest)
> +                               newest = ctx->stat[0].time;
> +                       if (ctx->stat[1].time > newest)
> +                               newest = ctx->stat[1].time;
> +               }
> +
> +               if (!newest)
> +                       break;
> +
> +               hctx_for_each_ctx(hctx, ctx, i) {
> +                       if (ctx->stat[0].time == newest) {
> +                               blk_stat_sum(&dst[0], &ctx->stat[0]);
> +                               nr++;
> +                       }
> +                       if (ctx->stat[1].time == newest) {
> +                               blk_stat_sum(&dst[1], &ctx->stat[1]);
> +                               nr++;
> +                       }
> +               }
> +               /*
> +                * If we race on finding an entry, just loop back again.
> +                * Should be very rare, as the window is only updated
> +                * occasionally
> +                */
> +       } while (!nr);
> +}
> +
> +static void __blk_stat_init(struct blk_rq_stat *stat, s64 time_now)
> +{
> +       stat->min = -1ULL;
> +       stat->max = stat->nr_samples = stat->mean = 0;
> +       stat->batch = stat->nr_batch = 0;
> +       stat->time = time_now & BLK_STAT_NSEC_MASK;
> +}
> +
> +void blk_stat_init(struct blk_rq_stat *stat)
> +{
> +       __blk_stat_init(stat, ktime_to_ns(ktime_get()));
> +}
> +
> +static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now)
> +{
> +       return (now & BLK_STAT_NSEC_MASK) == (stat->time & BLK_STAT_NSEC_MASK);
> +}
> +
> +bool blk_stat_is_current(struct blk_rq_stat *stat)
> +{
> +       return __blk_stat_is_current(stat, ktime_to_ns(ktime_get()));
> +}
> +
> +void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
> +{
> +       s64 now, value;
> +
> +       now = __blk_stat_time(ktime_to_ns(ktime_get()));
> +       if (now < blk_stat_time(&rq->issue_stat))
> +               return;
> +
> +       if (!__blk_stat_is_current(stat, now))
> +               __blk_stat_init(stat, now);
> +
> +       value = now - blk_stat_time(&rq->issue_stat);
> +       if (value > stat->max)
> +               stat->max = value;
> +       if (value < stat->min)
> +               stat->min = value;
> +
> +       if (stat->batch + value < stat->batch ||
> +           stat->nr_batch + 1 == BLK_RQ_STAT_BATCH)
> +               blk_stat_flush_batch(stat);
> +
> +       stat->batch += value;
> +       stat->nr_batch++;
> +}
> +
> +void blk_stat_clear(struct request_queue *q)
> +{
> +       if (q->mq_ops) {
> +               struct blk_mq_hw_ctx *hctx;
> +               struct blk_mq_ctx *ctx;
> +               int i, j;
> +
> +               queue_for_each_hw_ctx(q, hctx, i) {
> +                       hctx_for_each_ctx(hctx, ctx, j) {
> +                               blk_stat_init(&ctx->stat[0]);
> +                               blk_stat_init(&ctx->stat[1]);
> +                       }
> +               }
> +       } else {
> +               blk_stat_init(&q->rq_stats[0]);
> +               blk_stat_init(&q->rq_stats[1]);
> +       }
> +}
> +
> +void blk_stat_set_issue_time(struct blk_issue_stat *stat)
> +{
> +       stat->time = (stat->time & BLK_STAT_MASK) |
> +                       (ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK);
> +}
> diff --git a/block/blk-stat.h b/block/blk-stat.h
> new file mode 100644
> index 000000000000..26b1f45dff26
> --- /dev/null
> +++ b/block/blk-stat.h
> @@ -0,0 +1,37 @@
> +#ifndef BLK_STAT_H
> +#define BLK_STAT_H
> +
> +/*
> + * ~0.13s window as a power-of-2 (2^27 nsecs)
> + */
> +#define BLK_STAT_NSEC          134217728ULL
> +#define BLK_STAT_NSEC_MASK     ~(BLK_STAT_NSEC - 1)
> +
> +/*
> + * Upper 3 bits can be used elsewhere
> + */
> +#define BLK_STAT_RES_BITS      3
> +#define BLK_STAT_SHIFT         (64 - BLK_STAT_RES_BITS)
> +#define BLK_STAT_TIME_MASK     ((1ULL << BLK_STAT_SHIFT) - 1)
> +#define BLK_STAT_MASK          ~BLK_STAT_TIME_MASK
> +
> +void blk_stat_add(struct blk_rq_stat *, struct request *);
> +void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_rq_stat *);
> +void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
> +void blk_stat_clear(struct request_queue *q);
> +void blk_stat_init(struct blk_rq_stat *);
> +void blk_stat_sum(struct blk_rq_stat *, struct blk_rq_stat *);
> +bool blk_stat_is_current(struct blk_rq_stat *);
> +void blk_stat_set_issue_time(struct blk_issue_stat *);
> +
> +static inline u64 __blk_stat_time(u64 time)
> +{
> +       return time & BLK_STAT_TIME_MASK;
> +}
> +
> +static inline u64 blk_stat_time(struct blk_issue_stat *stat)
> +{
> +       return __blk_stat_time(stat->time);
> +}
> +
> +#endif
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 488c2e28feb8..5bb4648f434a 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -401,6 +401,26 @@ static ssize_t queue_dax_show(struct request_queue *q, char *page)
>         return queue_var_show(blk_queue_dax(q), page);
>  }
>
> +static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
> +{
> +       return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
> +                       pre, (long long) stat->nr_samples,
> +                       (long long) stat->mean, (long long) stat->min,
> +                       (long long) stat->max);
> +}
> +
> +static ssize_t queue_stats_show(struct request_queue *q, char *page)
> +{
> +       struct blk_rq_stat stat[2];
> +       ssize_t ret;
> +
> +       blk_queue_stat_get(q, stat);
> +
> +       ret = print_stat(page, &stat[0], "read :");
> +       ret += print_stat(page + ret, &stat[1], "write:");
> +       return ret;
> +}
> +
>  static struct queue_sysfs_entry queue_requests_entry = {
>         .attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
>         .show = queue_requests_show,
> @@ -553,6 +573,11 @@ static struct queue_sysfs_entry queue_dax_entry = {
>         .show = queue_dax_show,
>  };
>
> +static struct queue_sysfs_entry queue_stats_entry = {
> +       .attr = {.name = "stats", .mode = S_IRUGO },
> +       .show = queue_stats_show,
> +};
> +
>  static struct attribute *default_attrs[] = {
>         &queue_requests_entry.attr,
>         &queue_ra_entry.attr,
> @@ -582,6 +607,7 @@ static struct attribute *default_attrs[] = {
>         &queue_poll_entry.attr,
>         &queue_wc_entry.attr,
>         &queue_dax_entry.attr,
> +       &queue_stats_entry.attr,
>         NULL,
>  };
>
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index bb921028e7c5..a59a214c39ae 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -248,4 +248,20 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
>         return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
>  }
>
> +struct blk_issue_stat {
> +       u64 time;
> +};
> +
> +#define BLK_RQ_STAT_BATCH      64
> +
> +struct blk_rq_stat {
> +       s64 mean;
> +       u64 min;
> +       u64 max;
> +       s32 nr_samples;
> +       s32 nr_batch;
> +       u64 batch;
> +       s64 time;
> +};
> +
>  #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8396da2bb698..dcd8d6e8801f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -197,6 +197,7 @@ struct request {
>         struct gendisk *rq_disk;
>         struct hd_struct *part;
>         unsigned long start_time;
> +       struct blk_issue_stat issue_stat;
>  #ifdef CONFIG_BLK_CGROUP
>         struct request_list *rl;                /* rl this rq is alloced from */
>         unsigned long long start_time_ns;
> @@ -490,6 +491,9 @@ struct request_queue {
>
>         unsigned int            nr_sorted;
>         unsigned int            in_flight[2];
> +
> +       struct blk_rq_stat      rq_stats[2];
> +
>         /*
>          * Number of active block driver functions for which blk_drain_queue()
>          * must wait. Must be incremented around functions that unlock the
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT
  2016-11-01 21:05 ` [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT Jens Axboe
  2016-11-02 14:54   ` Christoph Hellwig
@ 2016-11-03 12:27   ` Ming Lei
  2016-11-03 13:41     ` Jens Axboe
  2016-11-03 14:01   ` Bart Van Assche
  2 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2016-11-03 12:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block, Christoph Hellwig

On Wed, Nov 2, 2016 at 5:05 AM, Jens Axboe <axboe@fb.com> wrote:
> This patch enables a hybrid polling mode. Instead of polling after IO
> submission, we can induce an artificial delay, and then poll after that.
> For example, if the IO is presumed to complete in 8 usecs from now, we
> can sleep for 4 usecs, wake up, and then do our polling. This still puts

I guess in reality it isn't easy to figure a perfect poll time:

- for one driver, different CPU and different drive/disk may cause different
completion time

- for requests with different size, the completion time can be different too

Is there one way to figure out the poll time automatically?

> a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
> after the IO has completed, it'll happen before. With this hybrid
> scheme, we can achieve big latency reductions while still using the same
> (or less) amount of CPU.
>
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
>  block/blk-mq.c         | 38 ++++++++++++++++++++++++++++++++++++++
>  block/blk-sysfs.c      | 29 +++++++++++++++++++++++++++++
>  block/blk.h            |  1 +
>  include/linux/blkdev.h |  1 +
>  4 files changed, 69 insertions(+)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4ef35588c299..caa55bec9411 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -302,6 +302,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
>         rq->rq_flags = 0;
>
>         clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> +       clear_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
>         blk_mq_put_tag(hctx, ctx, tag);
>         blk_queue_exit(q);
>  }
> @@ -2352,11 +2353,48 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
>
> +static void blk_mq_poll_hybrid_sleep(struct request_queue *q,
> +                                    struct request *rq)
> +{
> +       struct hrtimer_sleeper hs;
> +       ktime_t kt;
> +
> +       if (!q->poll_nsec || test_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags))
> +               return;
> +
> +       set_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
> +
> +       /*
> +        * This will be replaced with the stats tracking code, using
> +        * 'avg_completion_time / 2' as the pre-sleep target.
> +        */
> +       kt = ktime_set(0, q->poll_nsec);
> +
> +       hrtimer_init_on_stack(&hs.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +       hrtimer_set_expires(&hs.timer, kt);
> +
> +       hrtimer_init_sleeper(&hs, current);
> +       do {
> +               if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
> +                       break;
> +               set_current_state(TASK_INTERRUPTIBLE);
> +               hrtimer_start_expires(&hs.timer, HRTIMER_MODE_REL);
> +               if (hs.task)
> +                       io_schedule();
> +               hrtimer_cancel(&hs.timer);
> +       } while (hs.task && !signal_pending(current));
> +
> +       __set_current_state(TASK_RUNNING);
> +       destroy_hrtimer_on_stack(&hs.timer);
> +}
> +
>  bool blk_mq_poll(struct blk_mq_hw_ctx *hctx, struct request *rq)
>  {
>         struct request_queue *q = hctx->queue;
>         long state;
>
> +       blk_mq_poll_hybrid_sleep(q, rq);
> +
>         hctx->poll_considered++;
>
>         state = current->state;
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 5bb4648f434a..467b81c6713c 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -336,6 +336,28 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
>         return ret;
>  }
>
> +static ssize_t queue_poll_delay_show(struct request_queue *q, char *page)
> +{
> +       return queue_var_show(q->poll_nsec / 1000, page);
> +}
> +
> +static ssize_t queue_poll_delay_store(struct request_queue *q, const char *page,
> +                               size_t count)
> +{
> +       unsigned long poll_usec;
> +       ssize_t ret;
> +
> +       if (!q->mq_ops || !q->mq_ops->poll)
> +               return -EINVAL;
> +
> +       ret = queue_var_store(&poll_usec, page, count);
> +       if (ret < 0)
> +               return ret;
> +
> +       q->poll_nsec = poll_usec * 1000;
> +       return ret;
> +}
> +
>  static ssize_t queue_poll_show(struct request_queue *q, char *page)
>  {
>         return queue_var_show(test_bit(QUEUE_FLAG_POLL, &q->queue_flags), page);
> @@ -562,6 +584,12 @@ static struct queue_sysfs_entry queue_poll_entry = {
>         .store = queue_poll_store,
>  };
>
> +static struct queue_sysfs_entry queue_poll_delay_entry = {
> +       .attr = {.name = "io_poll_delay", .mode = S_IRUGO | S_IWUSR },
> +       .show = queue_poll_delay_show,
> +       .store = queue_poll_delay_store,
> +};
> +
>  static struct queue_sysfs_entry queue_wc_entry = {
>         .attr = {.name = "write_cache", .mode = S_IRUGO | S_IWUSR },
>         .show = queue_wc_show,
> @@ -608,6 +636,7 @@ static struct attribute *default_attrs[] = {
>         &queue_wc_entry.attr,
>         &queue_dax_entry.attr,
>         &queue_stats_entry.attr,
> +       &queue_poll_delay_entry.attr,
>         NULL,
>  };
>
> diff --git a/block/blk.h b/block/blk.h
> index aa132dea598c..041185e5f129 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -111,6 +111,7 @@ void blk_account_io_done(struct request *req);
>  enum rq_atomic_flags {
>         REQ_ATOM_COMPLETE = 0,
>         REQ_ATOM_STARTED,
> +       REQ_ATOM_POLL_SLEPT,
>  };
>
>  /*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index dcd8d6e8801f..6acd220dc3f3 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -502,6 +502,7 @@ struct request_queue {
>         unsigned int            request_fn_active;
>
>         unsigned int            rq_timeout;
> +       unsigned int            poll_nsec;
>         struct timer_list       timeout;
>         struct work_struct      timeout_work;
>         struct list_head        timeout_list;
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-03 11:17   ` Ming Lei
@ 2016-11-03 13:38     ` Jens Axboe
  2016-11-03 14:57       ` Ming Lei
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2016-11-03 13:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block, Christoph Hellwig

On 11/03/2016 05:17 AM, Ming Lei wrote:
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 0bfaa54d3e9f..ca77c725b4e5 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
>>  {
>>         blk_dequeue_request(req);
>>
>> +       blk_stat_set_issue_time(&req->issue_stat);
>> +
>>         /*
>>          * We are now handing the request to the hardware, initialize
>>          * resid_len to full count and add the timeout handler.
>> @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
>>
>>         trace_block_rq_complete(req->q, req, nr_bytes);
>>
>> +       blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
>
> blk_update_request() is often called lockless, so it isn't good to
> do it here.

It's not really a concern, not for the legacy path here nor the mq one
where it is per sw context. The collisions are rare enough that it'll
skew the latencies a bit for that short window, but then go away again.
I'd much rather take that, than adding locking for this part.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT
  2016-11-03 12:27   ` Ming Lei
@ 2016-11-03 13:41     ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-03 13:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block, Christoph Hellwig

On 11/03/2016 06:27 AM, Ming Lei wrote:
> On Wed, Nov 2, 2016 at 5:05 AM, Jens Axboe <axboe@fb.com> wrote:
>> This patch enables a hybrid polling mode. Instead of polling after IO
>> submission, we can induce an artificial delay, and then poll after that.
>> For example, if the IO is presumed to complete in 8 usecs from now, we
>> can sleep for 4 usecs, wake up, and then do our polling. This still puts
>
> I guess in reality it isn't easy to figure a perfect poll time:
>
> - for one driver, different CPU and different drive/disk may cause different
> completion time
>
> - for requests with different size, the completion time can be different too
>
> Is there one way to figure out the poll time automatically?

Yes, it's not easy to figure out the perfect time, the point is to try
and make a guess that's a bit better than the current "let's poll the
whole time". I suspect that for most real world cases, you are going to
be polling for smallish IO of roughly the same size. Hence the stats
should be useful.

But we could extend the tracking a bit and make it smarter.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT
  2016-11-01 21:05 ` [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT Jens Axboe
  2016-11-02 14:54   ` Christoph Hellwig
  2016-11-03 12:27   ` Ming Lei
@ 2016-11-03 14:01   ` Bart Van Assche
  2016-11-03 14:15     ` Jens Axboe
  2 siblings, 1 reply; 27+ messages in thread
From: Bart Van Assche @ 2016-11-03 14:01 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-kernel, linux-block; +Cc: hch

On 11/01/2016 03:05 PM, Jens Axboe wrote:
> +static void blk_mq_poll_hybrid_sleep(struct request_queue *q,
> +				     struct request *rq)
> +{
> +	struct hrtimer_sleeper hs;
> +	ktime_t kt;
> +
> +	if (!q->poll_nsec || test_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags))
> +		return;
> +
> +	set_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
> +
> +	/*
> +	 * This will be replaced with the stats tracking code, using
> +	 * 'avg_completion_time / 2' as the pre-sleep target.
> +	 */
> +	kt = ktime_set(0, q->poll_nsec);
> +
> +	hrtimer_init_on_stack(&hs.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	hrtimer_set_expires(&hs.timer, kt);
> +
> +	hrtimer_init_sleeper(&hs, current);
> +	do {
> +		if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
> +			break;
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		hrtimer_start_expires(&hs.timer, HRTIMER_MODE_REL);
> +		if (hs.task)
> +			io_schedule();
> +		hrtimer_cancel(&hs.timer);
> +	} while (hs.task && !signal_pending(current));
> +
> +	__set_current_state(TASK_RUNNING);
> +	destroy_hrtimer_on_stack(&hs.timer);
> +}

Hello Jens,

Will avg_completion_time/2 be a good choice for a polling interval if an 
application submits requests of varying sizes, e.g. 4 KB and 64 KB?

Can avg_completion_time be smaller than the context switch time? If so, 
do you think any other thread will be able to do any useful work after 
the timer has been started and before the timer has expired?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-01 21:05 ` [PATCH 1/4] block: add scalable completion tracking of requests Jens Axboe
                     ` (2 preceding siblings ...)
  2016-11-03 11:17   ` Ming Lei
@ 2016-11-03 14:10   ` Bart Van Assche
  2016-11-03 14:18     ` Jens Axboe
  3 siblings, 1 reply; 27+ messages in thread
From: Bart Van Assche @ 2016-11-03 14:10 UTC (permalink / raw)
  To: Jens Axboe, axboe, linux-kernel, linux-block; +Cc: hch

On 11/01/2016 03:05 PM, Jens Axboe wrote:
> +void blk_stat_init(struct blk_rq_stat *stat)
> +{
> +	__blk_stat_init(stat, ktime_to_ns(ktime_get()));
> +}
> +
> +static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now)
> +{
> +	return (now & BLK_STAT_NSEC_MASK) == (stat->time & BLK_STAT_NSEC_MASK);
> +}
> +
> +bool blk_stat_is_current(struct blk_rq_stat *stat)
> +{
> +	return __blk_stat_is_current(stat, ktime_to_ns(ktime_get()));
> +}

Hello Jens,

What is the performance impact of these patches? My experience is that 
introducing ktime_get() in the I/O path of high-performance I/O devices 
measurably slows down I/O. On https://lkml.org/lkml/2016/4/21/107 I read 
that a single ktime_get() call takes about 100 ns.

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT
  2016-11-03 14:01   ` Bart Van Assche
@ 2016-11-03 14:15     ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-03 14:15 UTC (permalink / raw)
  To: Bart Van Assche, axboe, linux-kernel, linux-block; +Cc: hch

On 11/03/2016 08:01 AM, Bart Van Assche wrote:
> On 11/01/2016 03:05 PM, Jens Axboe wrote:
>> +static void blk_mq_poll_hybrid_sleep(struct request_queue *q,
>> +                     struct request *rq)
>> +{
>> +    struct hrtimer_sleeper hs;
>> +    ktime_t kt;
>> +
>> +    if (!q->poll_nsec || test_bit(REQ_ATOM_POLL_SLEPT,
>> &rq->atomic_flags))
>> +        return;
>> +
>> +    set_bit(REQ_ATOM_POLL_SLEPT, &rq->atomic_flags);
>> +
>> +    /*
>> +     * This will be replaced with the stats tracking code, using
>> +     * 'avg_completion_time / 2' as the pre-sleep target.
>> +     */
>> +    kt = ktime_set(0, q->poll_nsec);
>> +
>> +    hrtimer_init_on_stack(&hs.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +    hrtimer_set_expires(&hs.timer, kt);
>> +
>> +    hrtimer_init_sleeper(&hs, current);
>> +    do {
>> +        if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
>> +            break;
>> +        set_current_state(TASK_INTERRUPTIBLE);
>> +        hrtimer_start_expires(&hs.timer, HRTIMER_MODE_REL);
>> +        if (hs.task)
>> +            io_schedule();
>> +        hrtimer_cancel(&hs.timer);
>> +    } while (hs.task && !signal_pending(current));
>> +
>> +    __set_current_state(TASK_RUNNING);
>> +    destroy_hrtimer_on_stack(&hs.timer);
>> +}
>
> Hello Jens,
>
> Will avg_completion_time/2 be a good choice for a polling interval if an
> application submits requests of varying sizes, e.g. 4 KB and 64 KB?

Not necessarily. This is a first implementation to demonstrate what is
possible, we can definitely make it more clever. As mentioned in the
previous email, I believe most cases will be small(ish) IO of roughly
the same size, and it'll work fine for that.

One possible improvement could be to factor in the minimum times as
well. Since we're assuming some level of predictability here,
incorporating the 'min' time could help too. This is one of the devices
I used for testing:

# cat stats
read : samples=15199, mean=5185, min=5068, max=25616
write: samples=0, mean=0, min=-1, max=0

and when it looks like that, then using mean/2 works well. If I do a
512/64k 50/50 split, it looks like this:

cat stats
read : samples=4896, mean=20996, min=5250, max=49721
write: samples=0, mean=0, min=-1, max=0

Using the classic polling with this workload, I get 39900 IOPS at 100%
CPU load. With the hybrid poll enabled, I get 37700 IOPS at 68% CPU. For
the hybrid polling, with the default settings, we end up being a bit
slower than pure poll (not unexpected), but at a higher level of
efficiency. Even for this 512b/64k split case, that's still the case.

For comparison, without polling, we run at 25800 IOPS (at 39% CPU).

> Can avg_completion_time be smaller than the context switch time? If so,
> do you think any other thread will be able to do any useful work after
> the timer has been started and before the timer has expired?

There are definitely cases where we want to just keep busy polling. I
didn't add any minimum yet, but yes, you can imagine cases where we can
blow our budget by doing the sleep up front. We just want to do the busy
poll for that case.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-03 14:10   ` Bart Van Assche
@ 2016-11-03 14:18     ` Jens Axboe
  0 siblings, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-03 14:18 UTC (permalink / raw)
  To: Bart Van Assche, axboe, linux-kernel, linux-block; +Cc: hch

On 11/03/2016 08:10 AM, Bart Van Assche wrote:
> On 11/01/2016 03:05 PM, Jens Axboe wrote:
>> +void blk_stat_init(struct blk_rq_stat *stat)
>> +{
>> +    __blk_stat_init(stat, ktime_to_ns(ktime_get()));
>> +}
>> +
>> +static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now)
>> +{
>> +    return (now & BLK_STAT_NSEC_MASK) == (stat->time &
>> BLK_STAT_NSEC_MASK);
>> +}
>> +
>> +bool blk_stat_is_current(struct blk_rq_stat *stat)
>> +{
>> +    return __blk_stat_is_current(stat, ktime_to_ns(ktime_get()));
>> +}
>
> Hello Jens,
>
> What is the performance impact of these patches? My experience is that
> introducing ktime_get() in the I/O path of high-performance I/O devices
> measurably slows down I/O. On https://lkml.org/lkml/2016/4/21/107 I read
> that a single ktime_get() call takes about 100 ns.

Hmm, on the testing I did, it didn't seem to have any noticeable
slowdown. If we do see a slowdown, we can look into enabling it only
when we need it.

Outside of the polling, my buffered writeback throttling patches also
use this stat tracking. For that patchset, it's easy enough to enable it
if we have wbt enabled. For polling, it's a bit more difficult. One easy
way would be to have a queue flag for it, and the first poll would
enable it unless it has been explicitly turned off.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-03 13:38     ` Jens Axboe
@ 2016-11-03 14:57       ` Ming Lei
  2016-11-03 16:55         ` Jens Axboe
  0 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2016-11-03 14:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block, Christoph Hellwig

On Thu, Nov 3, 2016 at 9:38 PM, Jens Axboe <axboe@fb.com> wrote:
> On 11/03/2016 05:17 AM, Ming Lei wrote:
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 0bfaa54d3e9f..ca77c725b4e5 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
>>>  {
>>>         blk_dequeue_request(req);
>>>
>>> +       blk_stat_set_issue_time(&req->issue_stat);
>>> +
>>>         /*
>>>          * We are now handing the request to the hardware, initialize
>>>          * resid_len to full count and add the timeout handler.
>>> @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int
>>> error, unsigned int nr_bytes)
>>>
>>>         trace_block_rq_complete(req->q, req, nr_bytes);
>>>
>>> +       blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
>>
>>
>> blk_update_request() is often called lockless, so it isn't good to
>> do it here.
>
>
> It's not really a concern, not for the legacy path here nor the mq one
> where it is per sw context. The collisions are rare enough that it'll

How do you get the conclusion that the collisions are rare enough
when the counting becomes completely lockless?

Even though it is true, the statistics still may become a mess with rare
collisons.

> skew the latencies a bit for that short window, but then go away again.
> I'd much rather take that, than adding locking for this part.

For legacy case, blk_stat_add() can be moved into blk_finish_request()
for avoiding the collision.


-- 
Ming Lei

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-03 14:57       ` Ming Lei
@ 2016-11-03 16:55         ` Jens Axboe
  2016-11-04 23:13           ` Ming Lei
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2016-11-03 16:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block, Christoph Hellwig

On 11/03/2016 08:57 AM, Ming Lei wrote:
> On Thu, Nov 3, 2016 at 9:38 PM, Jens Axboe <axboe@fb.com> wrote:
>> On 11/03/2016 05:17 AM, Ming Lei wrote:
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 0bfaa54d3e9f..ca77c725b4e5 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
>>>>  {
>>>>         blk_dequeue_request(req);
>>>>
>>>> +       blk_stat_set_issue_time(&req->issue_stat);
>>>> +
>>>>         /*
>>>>          * We are now handing the request to the hardware, initialize
>>>>          * resid_len to full count and add the timeout handler.
>>>> @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int
>>>> error, unsigned int nr_bytes)
>>>>
>>>>         trace_block_rq_complete(req->q, req, nr_bytes);
>>>>
>>>> +       blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
>>>
>>>
>>> blk_update_request() is often called lockless, so it isn't good to
>>> do it here.
>>
>>
>> It's not really a concern, not for the legacy path here nor the mq one
>> where it is per sw context. The collisions are rare enough that it'll
>
> How do you get the conclusion that the collisions are rare enough
> when the counting becomes completely lockless?

Of all the races I can spot, it basically boils down to accounting one
IO to little or too many.

> Even though it is true, the statistics still may become a mess with rare
> collisons.

How so? Not saying we could not improve it, but we're trading off
precision for scalability. My claim is that the existing code is good
enough. I've run a TON of testing on it, since I've used it for multiple
projects, and it's been solid.

>> skew the latencies a bit for that short window, but then go away again.
>> I'd much rather take that, than adding locking for this part.
>
> For legacy case, blk_stat_add() can be moved into blk_finish_request()
> for avoiding the collision.

Yes, that might be a good idea, since it doesn't cost us anything. For
the mq case, I'm hard pressed to think of areas where we could complete
IO in parallel on the same software queue. You'll never have a software
queue mapped to multiple hardware queues. So we should essentially be
serialized.

In short, I don't see any problems with this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-03 16:55         ` Jens Axboe
@ 2016-11-04 23:13           ` Ming Lei
  2016-11-05 20:49             ` Jens Axboe
  2016-11-05 20:59             ` Jens Axboe
  0 siblings, 2 replies; 27+ messages in thread
From: Ming Lei @ 2016-11-04 23:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block, Christoph Hellwig

On Fri, Nov 4, 2016 at 12:55 AM, Jens Axboe <axboe@fb.com> wrote:
> On 11/03/2016 08:57 AM, Ming Lei wrote:
>>
>> On Thu, Nov 3, 2016 at 9:38 PM, Jens Axboe <axboe@fb.com> wrote:
>>>
>>> On 11/03/2016 05:17 AM, Ming Lei wrote:
>>>>>
>>>>>
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index 0bfaa54d3e9f..ca77c725b4e5 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
>>>>>  {
>>>>>         blk_dequeue_request(req);
>>>>>
>>>>> +       blk_stat_set_issue_time(&req->issue_stat);
>>>>> +
>>>>>         /*
>>>>>          * We are now handing the request to the hardware, initialize
>>>>>          * resid_len to full count and add the timeout handler.
>>>>> @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int
>>>>> error, unsigned int nr_bytes)
>>>>>
>>>>>         trace_block_rq_complete(req->q, req, nr_bytes);
>>>>>
>>>>> +       blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
>>>>
>>>>
>>>>
>>>> blk_update_request() is often called lockless, so it isn't good to
>>>> do it here.
>>>
>>>
>>>
>>> It's not really a concern, not for the legacy path here nor the mq one
>>> where it is per sw context. The collisions are rare enough that it'll
>>
>>
>> How do you get the conclusion that the collisions are rare enough
>> when the counting becomes completely lockless?
>
>
> Of all the races I can spot, it basically boils down to accounting one
> IO to little or too many.
>
>> Even though it is true, the statistics still may become a mess with rare
>> collisons.
>
>
> How so? Not saying we could not improve it, but we're trading off
> precision for scalability. My claim is that the existing code is good
> enough. I've run a TON of testing on it, since I've used it for multiple
> projects, and it's been solid.

+static void blk_stat_flush_batch(struct blk_rq_stat *stat)
+{
+       if (!stat->nr_batch)
+               return;
+       if (!stat->nr_samples)
+               stat->mean = div64_s64(stat->batch, stat->nr_batch);

For example, two reqs(A & B) are completed at the same time, and
A is on CPU0, and B is on CPU1.

If the two last writting in the function is reordered observed from CPU1,
for B, CPU1 runs the above branch when it just sees stat->batch is set as zero,
but nr_samples isn't updated yet, then div_zero is triggered.


+       else {
+               stat->mean = div64_s64((stat->mean * stat->nr_samples) +
+                                       stat->batch,
+                                       stat->nr_samples + stat->nr_batch);
+       }

BTW, the above 'if else' can be removed, and 'stat->mean' can be computed
in the 2nd way.

+
+       stat->nr_samples += stat->nr_batch;

one addition of stat->nr_batch can be overrideed, and it may not
be accounted into stat->mean at all.

+       stat->nr_batch = stat->batch = 0;
+}

>
>>> skew the latencies a bit for that short window, but then go away again.
>>> I'd much rather take that, than adding locking for this part.
>>
>>
>> For legacy case, blk_stat_add() can be moved into blk_finish_request()
>> for avoiding the collision.
>
>
> Yes, that might be a good idea, since it doesn't cost us anything. For
> the mq case, I'm hard pressed to think of areas where we could complete
> IO in parallel on the same software queue. You'll never have a software
> queue mapped to multiple hardware queues. So we should essentially be
> serialized.

For blk-mq, blk_mq_stat_add() is called in __blk_mq_complete_request()
which is often run from interrupt handler, and the CPU serving the interrupt
can be different with the submitting CPU for rq->mq_ctx. And there can be
several CPUs handling the interrupts originating from same sw queue.

BTW, I don't object to this patch actually, but maybe we should add
comment about this kind of race. Cause in the future someone might find
the statistics becomes not accurate, and they may understand or
try to improve.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-04 23:13           ` Ming Lei
@ 2016-11-05 20:49             ` Jens Axboe
  2016-11-05 20:59             ` Jens Axboe
  1 sibling, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-05 20:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block, Christoph Hellwig

On 11/04/2016 05:13 PM, Ming Lei wrote:
>>> Even though it is true, the statistics still may become a mess with rare
>>> collisons.
>>
>>
>> How so? Not saying we could not improve it, but we're trading off
>> precision for scalability. My claim is that the existing code is good
>> enough. I've run a TON of testing on it, since I've used it for multiple
>> projects, and it's been solid.
>
> +static void blk_stat_flush_batch(struct blk_rq_stat *stat)
> +{
> +       if (!stat->nr_batch)
> +               return;
> +       if (!stat->nr_samples)
> +               stat->mean = div64_s64(stat->batch, stat->nr_batch);
>
> For example, two reqs(A & B) are completed at the same time, and A is
> on CPU0, and B is on CPU1.
>
> If the two last writting in the function is reordered observed from
> CPU1, for B, CPU1 runs the above branch when it just sees stat->batch
> is set as zero, but nr_samples isn't updated yet, then div_zero is
> triggered.

We should probably just have the nr_batch be a READ_ONCE(). I'm fine
with the stats being a bit off in the rare case of a collision, but we
can't have a divide-by-zero, obviously.

>
> +       else {
> +               stat->mean = div64_s64((stat->mean * stat->nr_samples) +
> +                                       stat->batch,
> +                                       stat->nr_samples + stat->nr_batch);
> +       }
>
> BTW, the above 'if else' can be removed, and 'stat->mean' can be computed
> in the 2nd way.

True, they could be collapsed.

>> Yes, that might be a good idea, since it doesn't cost us anything. For
>> the mq case, I'm hard pressed to think of areas where we could complete
>> IO in parallel on the same software queue. You'll never have a software
>> queue mapped to multiple hardware queues. So we should essentially be
>> serialized.
>
> For blk-mq, blk_mq_stat_add() is called in __blk_mq_complete_request()
> which is often run from interrupt handler, and the CPU serving the interrupt
> can be different with the submitting CPU for rq->mq_ctx. And there can be
> several CPUs handling the interrupts originating from same sw queue.
>
> BTW, I don't object to this patch actually, but maybe we should add
> comment about this kind of race. Cause in the future someone might find
> the statistics becomes not accurate, and they may understand or
> try to improve.

I'm fine with documenting it. For the two use cases I have, I'm fine
with it not being 100% stable. For by far the majority of the windows,
it'll be just fine. I'll fix the divide-by-zero, though.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] block: add scalable completion tracking of requests
  2016-11-04 23:13           ` Ming Lei
  2016-11-05 20:49             ` Jens Axboe
@ 2016-11-05 20:59             ` Jens Axboe
  1 sibling, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2016-11-05 20:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block, Christoph Hellwig

On 11/04/2016 05:13 PM, Ming Lei wrote:
>> Yes, that might be a good idea, since it doesn't cost us anything.
>> For the mq case, I'm hard pressed to think of areas where we could
>> complete IO in parallel on the same software queue. You'll never have
>> a software queue mapped to multiple hardware queues. So we should
>> essentially be serialized.
>
> For blk-mq, blk_mq_stat_add() is called in __blk_mq_complete_request()
> which is often run from interrupt handler, and the CPU serving the
> interrupt can be different with the submitting CPU for rq->mq_ctx. And
> there can be several CPUs handling the interrupts originating from
> same sw queue.

BTW, one small improvement might be to call blk_mq_stat_add() on the
curent ctx, in case it's different than rq->mq_ctx. That can happen if
we have multiple CPUs per hardware queue. In reality, even for that
case, a real race is rare. You'd have to rebalance interrupt masks
basically, at least on x86 where multiple CPUs in the IRQ affinity mask
still always trigger on the first one.

So while we could just grab the current ctx instead, I don't think it's
going to make a difference in practice.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-11-05 20:59 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-01 21:05 [PATCHSET] block: IO polling improvements Jens Axboe
2016-11-01 21:05 ` [PATCH 1/4] block: add scalable completion tracking of requests Jens Axboe
2016-11-01 22:25   ` Johannes Thumshirn
2016-11-02  5:37     ` Jens Axboe
2016-11-02 14:52   ` Christoph Hellwig
2016-11-02 14:55     ` Jens Axboe
2016-11-02 14:59       ` Christoph Hellwig
2016-11-03 11:17   ` Ming Lei
2016-11-03 13:38     ` Jens Axboe
2016-11-03 14:57       ` Ming Lei
2016-11-03 16:55         ` Jens Axboe
2016-11-04 23:13           ` Ming Lei
2016-11-05 20:49             ` Jens Axboe
2016-11-05 20:59             ` Jens Axboe
2016-11-03 14:10   ` Bart Van Assche
2016-11-03 14:18     ` Jens Axboe
2016-11-01 21:05 ` [PATCH 2/4] block: move poll code to blk-mq Jens Axboe
2016-11-02 14:54   ` Christoph Hellwig
2016-11-01 21:05 ` [PATCH 3/4] blk-mq: implement hybrid poll mode for sync O_DIRECT Jens Axboe
2016-11-02 14:54   ` Christoph Hellwig
2016-11-03 12:27   ` Ming Lei
2016-11-03 13:41     ` Jens Axboe
2016-11-03 14:01   ` Bart Van Assche
2016-11-03 14:15     ` Jens Axboe
2016-11-01 21:05 ` [PATCH 4/4] blk-mq: make the polling code adaptive Jens Axboe
2016-11-02 14:51 ` [PATCHSET] block: IO polling improvements Christoph Hellwig
2016-11-02 14:54   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).