linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET] Throttled buffered writeback
@ 2016-11-01 21:08 Jens Axboe
  2016-11-01 21:08 ` [PATCH 1/8] block: add WRITE_BACKGROUND Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Jens Axboe @ 2016-11-01 21:08 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: jack, hch

I have addressed the (small) review comments from Christoph, and
rebased it on top of for-4.10/block, since that now has the
flag unification and fs side cleanups as well. This impacted the
prep patches, and the wbt code.

I'd really like to get this merged for 4.10. It's block specific
at this point, and defaults to just being enabled for blk-mq
managed devices.

Let me know if there are any objections.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/8] block: add WRITE_BACKGROUND
  2016-11-01 21:08 [PATCHSET] Throttled buffered writeback Jens Axboe
@ 2016-11-01 21:08 ` Jens Axboe
  2016-11-02 14:55   ` Christoph Hellwig
  2016-11-05 22:27   ` Jan Kara
  2016-11-01 21:08 ` [PATCH 2/8] writeback: add wbc_to_write_flags() Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2016-11-01 21:08 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: jack, hch, Jens Axboe

This adds a new request flag, REQ_BACKGROUND, that callers can use to
tell the block layer that this is background (non-urgent) IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/blk_types.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index bb921028e7c5..562ac46cb790 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -177,6 +177,7 @@ enum req_flag_bits {
 	__REQ_FUA,		/* forced unit access */
 	__REQ_PREFLUSH,		/* request for cache flush */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
+	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -192,6 +193,7 @@ enum req_flag_bits {
 #define REQ_FUA			(1ULL << __REQ_FUA)
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
+#define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/8] writeback: add wbc_to_write_flags()
  2016-11-01 21:08 [PATCHSET] Throttled buffered writeback Jens Axboe
  2016-11-01 21:08 ` [PATCH 1/8] block: add WRITE_BACKGROUND Jens Axboe
@ 2016-11-01 21:08 ` Jens Axboe
  2016-11-02 14:56   ` Christoph Hellwig
  2016-11-01 21:08 ` [PATCH 3/8] writeback: mark background writeback as such Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2016-11-01 21:08 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: jack, hch, Jens Axboe

Add wbc_to_write_flags(), which returns the write modifier flags to use,
based on a struct writeback_control. No functional changes in this
patch, but it prepares us for factoring other wbc fields for write type.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c               | 2 +-
 fs/f2fs/data.c            | 2 +-
 fs/f2fs/node.c            | 2 +-
 fs/gfs2/meta_io.c         | 3 +--
 fs/mpage.c                | 2 +-
 fs/xfs/xfs_aops.c         | 8 ++------
 include/linux/writeback.h | 9 +++++++++
 mm/page_io.c              | 5 +----
 8 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index bc7c2bb30a9b..af5776da814a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1697,7 +1697,7 @@ int __block_write_full_page(struct inode *inode, struct page *page,
 	struct buffer_head *bh, *head;
 	unsigned int blocksize, bbits;
 	int nr_underway = 0;
-	int write_flags = (wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0);
+	int write_flags = wbc_to_write_flags(wbc);
 
 	head = create_page_buffers(page, inode,
 					(1 << BH_Dirty)|(1 << BH_Uptodate));
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b80bf10603d7..9e5561fa4cb6 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1249,7 +1249,7 @@ static int f2fs_write_data_page(struct page *page,
 		.sbi = sbi,
 		.type = DATA,
 		.op = REQ_OP_WRITE,
-		.op_flags = (wbc->sync_mode == WB_SYNC_ALL) ? REQ_SYNC : 0,
+		.op_flags = wbc_to_write_flags(wbc),
 		.page = page,
 		.encrypted_page = NULL,
 	};
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 932f3f8bb57b..d1e29deb4598 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1570,7 +1570,7 @@ static int f2fs_write_node_page(struct page *page,
 		.sbi = sbi,
 		.type = NODE,
 		.op = REQ_OP_WRITE,
-		.op_flags = (wbc->sync_mode == WB_SYNC_ALL) ? REQ_SYNC : 0,
+		.op_flags = wbc_to_write_flags(wbc),
 		.page = page,
 		.encrypted_page = NULL,
 	};
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index e562b1191c9c..49db8ef13fdf 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -37,8 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb
 {
 	struct buffer_head *bh, *head;
 	int nr_underway = 0;
-	int write_flags = REQ_META | REQ_PRIO |
-		(wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0);
+	int write_flags = REQ_META | REQ_PRIO | wbc_to_write_flags(wbc);
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!page_has_buffers(page));
diff --git a/fs/mpage.c b/fs/mpage.c
index f35e2819d0c6..98fc11aa7e0b 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -489,7 +489,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 	struct buffer_head map_bh;
 	loff_t i_size = i_size_read(inode);
 	int ret = 0;
-	int op_flags = (wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0);
+	int op_flags = wbc_to_write_flags(wbc);
 
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 594e02c485b2..6be5204a06d3 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -495,9 +495,7 @@ xfs_submit_ioend(
 
 	ioend->io_bio->bi_private = ioend;
 	ioend->io_bio->bi_end_io = xfs_end_bio;
-	ioend->io_bio->bi_opf = REQ_OP_WRITE;
-	if (wbc->sync_mode == WB_SYNC_ALL)
-		ioend->io_bio->bi_opf |= REQ_SYNC;
+	ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
 
 	/*
 	 * If we are failing the IO now, just mark the ioend with an
@@ -569,9 +567,7 @@ xfs_chain_bio(
 
 	bio_chain(ioend->io_bio, new);
 	bio_get(ioend->io_bio);		/* for xfs_destroy_ioend */
-	ioend->io_bio->bi_opf = REQ_OP_WRITE;
-	if (wbc->sync_mode == WB_SYNC_ALL)
-		ioend->io_bio->bi_opf |= REQ_SYNC;
+	ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
 	submit_bio(ioend->io_bio);
 	ioend->io_bio = new;
 }
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index e4c38703bf4e..50c96ee8108f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -9,6 +9,7 @@
 #include <linux/fs.h>
 #include <linux/flex_proportions.h>
 #include <linux/backing-dev-defs.h>
+#include <linux/blk_types.h>
 
 struct bio;
 
@@ -102,6 +103,14 @@ struct writeback_control {
 #endif
 };
 
+static inline int wbc_to_write_flags(struct writeback_control *wbc)
+{
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		return REQ_SYNC;
+
+	return 0;
+}
+
 /*
  * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
  * and are measured against each other in.  There always is one global
diff --git a/mm/page_io.c b/mm/page_io.c
index a2651f58c86a..23f6d0d3470f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -320,10 +320,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 		ret = -ENOMEM;
 		goto out;
 	}
-	if (wbc->sync_mode == WB_SYNC_ALL)
-		bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC);
-	else
-		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+	bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
 	count_vm_event(PSWPOUT);
 	set_page_writeback(page);
 	unlock_page(page);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/8] writeback: mark background writeback as such
  2016-11-01 21:08 [PATCHSET] Throttled buffered writeback Jens Axboe
  2016-11-01 21:08 ` [PATCH 1/8] block: add WRITE_BACKGROUND Jens Axboe
  2016-11-01 21:08 ` [PATCH 2/8] writeback: add wbc_to_write_flags() Jens Axboe
@ 2016-11-01 21:08 ` Jens Axboe
  2016-11-02 14:56   ` Christoph Hellwig
  2016-11-05 22:26   ` Jan Kara
  2016-11-01 21:08 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2016-11-01 21:08 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: jack, hch, Jens Axboe

If we're doing background type writes, then use the appropriate
background write flags for that.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/writeback.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 50c96ee8108f..c78f9f0920b5 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -107,6 +107,8 @@ static inline int wbc_to_write_flags(struct writeback_control *wbc)
 {
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		return REQ_SYNC;
+	else if (wbc->for_kupdate || wbc->for_background)
+		return REQ_BACKGROUND;
 
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()
  2016-11-01 21:08 [PATCHSET] Throttled buffered writeback Jens Axboe
                   ` (2 preceding siblings ...)
  2016-11-01 21:08 ` [PATCH 3/8] writeback: mark background writeback as such Jens Axboe
@ 2016-11-01 21:08 ` Jens Axboe
  2016-11-02 14:57   ` Christoph Hellwig
  2016-11-08 13:02   ` Jan Kara
  2016-11-01 21:08 ` [PATCH 5/8] block: add code to track actual device queue depth Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2016-11-01 21:08 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: jack, hch, Jens Axboe

Note in the bdi_writeback structure whenever a task ends up sleeping
waiting for progress. We can use that information in the lower layers
to increase the priority of writes.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/backing-dev-defs.h | 2 ++
 mm/backing-dev.c                 | 1 +
 mm/page-writeback.c              | 1 +
 3 files changed, 4 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index c357f27d5483..dc5f76d7f648 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
 	struct list_head work_list;
 	struct delayed_work dwork;	/* work item used for writeback */
 
+	unsigned long dirty_sleep;	/* last wait */
+
 	struct list_head bdi_node;	/* anchored at bdi->wb_list */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8fde443f36d7..3bfed5ab2475 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -310,6 +310,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	spin_lock_init(&wb->work_lock);
 	INIT_LIST_HEAD(&wb->work_list);
 	INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
+	wb->dirty_sleep = jiffies;
 
 	wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
 	if (!wb->congested)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 439cc63ad903..52e2f8e3b472 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1778,6 +1778,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 					  pause,
 					  start_time);
 		__set_current_state(TASK_KILLABLE);
+		wb->dirty_sleep = now;
 		io_schedule_timeout(pause);
 
 		current->dirty_paused_when = now + pause;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/8] block: add code to track actual device queue depth
  2016-11-01 21:08 [PATCHSET] Throttled buffered writeback Jens Axboe
                   ` (3 preceding siblings ...)
  2016-11-01 21:08 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
@ 2016-11-01 21:08 ` Jens Axboe
  2016-11-02 14:59   ` Christoph Hellwig
  2016-11-05 22:37   ` Jan Kara
  2016-11-01 21:08 ` [PATCH 6/8] block: add scalable completion tracking of requests Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2016-11-01 21:08 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: jack, hch, Jens Axboe

For blk-mq, ->nr_requests does track queue depth, at least at init
time. But for the older queue paths, it's simply a soft setting.
On top of that, it's generally larger than the hardware setting
on purpose, to allow backup of requests for merging.

Fill a hole in struct request with a 'queue_depth' member, that
drivers can call to more closely inform the block layer of the
real queue depth.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-settings.c   | 12 ++++++++++++
 drivers/scsi/scsi.c    |  3 +++
 include/linux/blkdev.h | 11 +++++++++++
 3 files changed, 26 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 55369a65dea2..9cf053759363 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -837,6 +837,18 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable)
 EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
 
 /**
+ * blk_set_queue_depth - tell the block layer about the device queue depth
+ * @q:		the request queue for the device
+ * @depth:		queue depth
+ *
+ */
+void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
+{
+	q->queue_depth = depth;
+}
+EXPORT_SYMBOL(blk_set_queue_depth);
+
+/**
  * blk_queue_write_cache - configure queue's write cache
  * @q:		the request queue for the device
  * @wc:		write back cache on or off
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 1deb6adc411f..75455d4dab68 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int depth)
 		wmb();
 	}
 
+	if (sdev->request_queue)
+		blk_set_queue_depth(sdev->request_queue, depth);
+
 	return sdev->queue_depth;
 }
 EXPORT_SYMBOL(scsi_change_queue_depth);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8396da2bb698..0c677fb35ce4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -405,6 +405,8 @@ struct request_queue {
 	struct blk_mq_ctx __percpu	*queue_ctx;
 	unsigned int		nr_queues;
 
+	unsigned int		queue_depth;
+
 	/* hw dispatch queues */
 	struct blk_mq_hw_ctx	**queue_hw_ctx;
 	unsigned int		nr_hw_queues;
@@ -777,6 +779,14 @@ static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
 	return false;
 }
 
+static inline unsigned int blk_queue_depth(struct request_queue *q)
+{
+	if (q->queue_depth)
+		return q->queue_depth;
+
+	return q->nr_requests;
+}
+
 /*
  * q->prep_rq_fn return values
  */
@@ -1093,6 +1103,7 @@ extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
 extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
 extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
 extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
+extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
 extern void blk_set_default_limits(struct queue_limits *lim);
 extern void blk_set_stacking_limits(struct queue_limits *lim);
 extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/8] block: add scalable completion tracking of requests
  2016-11-01 21:08 [PATCHSET] Throttled buffered writeback Jens Axboe
                   ` (4 preceding siblings ...)
  2016-11-01 21:08 ` [PATCH 5/8] block: add code to track actual device queue depth Jens Axboe
@ 2016-11-01 21:08 ` Jens Axboe
  2016-11-08 13:30   ` Jan Kara
  2016-11-01 21:08 ` [PATCH 7/8] blk-wbt: add general throttling mechanism Jens Axboe
  2016-11-01 21:08 ` [PATCH 8/8] block: hook up writeback throttling Jens Axboe
  7 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2016-11-01 21:08 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: jack, hch, Jens Axboe

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Makefile            |   2 +-
 block/blk-core.c          |   4 +
 block/blk-mq-sysfs.c      |  47 ++++++++++
 block/blk-mq.c            |  14 +++
 block/blk-mq.h            |   3 +
 block/blk-stat.c          | 226 ++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-stat.h          |  37 ++++++++
 block/blk-sysfs.c         |  26 ++++++
 include/linux/blk_types.h |  16 ++++
 include/linux/blkdev.h    |   4 +
 10 files changed, 378 insertions(+), 1 deletion(-)
 create mode 100644 block/blk-stat.c
 create mode 100644 block/blk-stat.h

diff --git a/block/Makefile b/block/Makefile
index 934dac73fb37..2528c596f7ec 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			blk-lib.o blk-mq.o blk-mq-tag.o \
+			blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
 			blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 0bfaa54d3e9f..ca77c725b4e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
 {
 	blk_dequeue_request(req);
 
+	blk_stat_set_issue_time(&req->issue_stat);
+
 	/*
 	 * We are now handing the request to the hardware, initialize
 	 * resid_len to full count and add the timeout handler.
@@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
 
 	trace_block_rq_complete(req->q, req, nr_bytes);
 
+	blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
+
 	if (!req->bio)
 		return false;
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 01fb455d3377..633c79a538ea 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	return ret;
 }
 
+static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
+{
+	struct blk_mq_ctx *ctx;
+	unsigned int i;
+
+	hctx_for_each_ctx(hctx, ctx, i) {
+		blk_stat_init(&ctx->stat[0]);
+		blk_stat_init(&ctx->stat[1]);
+	}
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
+					  const char *page, size_t count)
+{
+	blk_mq_stat_clear(hctx);
+	return count;
+}
+
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
+{
+	return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+			pre, (long long) stat->nr_samples,
+			(long long) stat->mean, (long long) stat->min,
+			(long long) stat->max);
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char *page)
+{
+	struct blk_rq_stat stat[2];
+	ssize_t ret;
+
+	blk_stat_init(&stat[0]);
+	blk_stat_init(&stat[1]);
+
+	blk_hctx_stat_get(hctx, stat);
+
+	ret = print_stat(page, &stat[0], "read :");
+	ret += print_stat(page + ret, &stat[1], "write:");
+	return ret;
+}
+
 static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
 	.attr = {.name = "dispatched", .mode = S_IRUGO },
 	.show = blk_mq_sysfs_dispatched_show,
@@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
 	.show = blk_mq_hw_sysfs_poll_show,
 	.store = blk_mq_hw_sysfs_poll_store,
 };
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
+	.attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
+	.show = blk_mq_hw_sysfs_stat_show,
+	.store = blk_mq_hw_sysfs_stat_store,
+};
 
 static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_queued.attr,
@@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_cpus.attr,
 	&blk_mq_hw_sysfs_active.attr,
 	&blk_mq_hw_sysfs_poll.attr,
+	&blk_mq_hw_sysfs_stat.attr,
 	NULL,
 };
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2da1a0ee3318..4555a76d22a7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -30,6 +30,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-stat.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -376,10 +377,19 @@ static void blk_mq_ipi_complete_request(struct request *rq)
 	put_cpu();
 }
 
+static void blk_mq_stat_add(struct request *rq)
+{
+	struct blk_rq_stat *stat = &rq->mq_ctx->stat[rq_data_dir(rq)];
+
+	blk_stat_add(stat, rq);
+}
+
 static void __blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
+	blk_mq_stat_add(rq);
+
 	if (!q->softirq_done_fn)
 		blk_mq_end_request(rq, rq->errors);
 	else
@@ -423,6 +433,8 @@ void blk_mq_start_request(struct request *rq)
 	if (unlikely(blk_bidi_rq(rq)))
 		rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
 
+	blk_stat_set_issue_time(&rq->issue_stat);
+
 	blk_add_timer(rq);
 
 	/*
@@ -1708,6 +1720,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		spin_lock_init(&__ctx->lock);
 		INIT_LIST_HEAD(&__ctx->rq_list);
 		__ctx->queue = q;
+		blk_stat_init(&__ctx->stat[0]);
+		blk_stat_init(&__ctx->stat[1]);
 
 		/* If the cpu isn't online, the cpu is mapped to first hctx */
 		if (!cpu_online(i))
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e5d25249028c..8cf16cb69f64 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -1,6 +1,8 @@
 #ifndef INT_BLK_MQ_H
 #define INT_BLK_MQ_H
 
+#include "blk-stat.h"
+
 struct blk_mq_tag_set;
 
 struct blk_mq_ctx {
@@ -18,6 +20,7 @@ struct blk_mq_ctx {
 
 	/* incremented at completion time */
 	unsigned long		____cacheline_aligned_in_smp rq_completed[2];
+	struct blk_rq_stat	stat[2];
 
 	struct request_queue	*queue;
 	struct kobject		kobj;
diff --git a/block/blk-stat.c b/block/blk-stat.c
new file mode 100644
index 000000000000..642afdc6d0f8
--- /dev/null
+++ b/block/blk-stat.c
@@ -0,0 +1,226 @@
+/*
+ * Block stat tracking code
+ *
+ * Copyright (C) 2016 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/blk-mq.h>
+
+#include "blk-stat.h"
+#include "blk-mq.h"
+
+static void blk_stat_flush_batch(struct blk_rq_stat *stat)
+{
+	if (!stat->nr_batch)
+		return;
+	if (!stat->nr_samples)
+		stat->mean = div64_s64(stat->batch, stat->nr_batch);
+	else {
+		stat->mean = div64_s64((stat->mean * stat->nr_samples) +
+					stat->batch,
+					stat->nr_samples + stat->nr_batch);
+	}
+
+	stat->nr_samples += stat->nr_batch;
+	stat->nr_batch = stat->batch = 0;
+}
+
+void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
+{
+	if (!src->nr_samples)
+		return;
+
+	blk_stat_flush_batch(src);
+
+	dst->min = min(dst->min, src->min);
+	dst->max = max(dst->max, src->max);
+
+	if (!dst->nr_samples)
+		dst->mean = src->mean;
+	else {
+		dst->mean = div64_s64((src->mean * src->nr_samples) +
+					(dst->mean * dst->nr_samples),
+					dst->nr_samples + src->nr_samples);
+	}
+	dst->nr_samples += src->nr_samples;
+}
+
+static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
+{
+	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_ctx *ctx;
+	uint64_t latest = 0;
+	int i, j, nr;
+
+	blk_stat_init(&dst[0]);
+	blk_stat_init(&dst[1]);
+
+	nr = 0;
+	do {
+		uint64_t newest = 0;
+
+		queue_for_each_hw_ctx(q, hctx, i) {
+			hctx_for_each_ctx(hctx, ctx, j) {
+				if (!ctx->stat[0].nr_samples &&
+				    !ctx->stat[1].nr_samples)
+					continue;
+				if (ctx->stat[0].time > newest)
+					newest = ctx->stat[0].time;
+				if (ctx->stat[1].time > newest)
+					newest = ctx->stat[1].time;
+			}
+		}
+
+		/*
+		 * No samples
+		 */
+		if (!newest)
+			break;
+
+		if (newest > latest)
+			latest = newest;
+
+		queue_for_each_hw_ctx(q, hctx, i) {
+			hctx_for_each_ctx(hctx, ctx, j) {
+				if (ctx->stat[0].time == newest) {
+					blk_stat_sum(&dst[0], &ctx->stat[0]);
+					nr++;
+				}
+				if (ctx->stat[1].time == newest) {
+					blk_stat_sum(&dst[1], &ctx->stat[1]);
+					nr++;
+				}
+			}
+		}
+		/*
+		 * If we race on finding an entry, just loop back again.
+		 * Should be very rare.
+		 */
+	} while (!nr);
+
+	dst[0].time = dst[1].time = latest;
+}
+
+void blk_queue_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
+{
+	if (q->mq_ops)
+		blk_mq_stat_get(q, dst);
+	else {
+		memcpy(&dst[0], &q->rq_stats[0], sizeof(struct blk_rq_stat));
+		memcpy(&dst[1], &q->rq_stats[1], sizeof(struct blk_rq_stat));
+	}
+}
+
+void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst)
+{
+	struct blk_mq_ctx *ctx;
+	unsigned int i, nr;
+
+	nr = 0;
+	do {
+		uint64_t newest = 0;
+
+		hctx_for_each_ctx(hctx, ctx, i) {
+			if (!ctx->stat[0].nr_samples &&
+			    !ctx->stat[1].nr_samples)
+				continue;
+
+			if (ctx->stat[0].time > newest)
+				newest = ctx->stat[0].time;
+			if (ctx->stat[1].time > newest)
+				newest = ctx->stat[1].time;
+		}
+
+		if (!newest)
+			break;
+
+		hctx_for_each_ctx(hctx, ctx, i) {
+			if (ctx->stat[0].time == newest) {
+				blk_stat_sum(&dst[0], &ctx->stat[0]);
+				nr++;
+			}
+			if (ctx->stat[1].time == newest) {
+				blk_stat_sum(&dst[1], &ctx->stat[1]);
+				nr++;
+			}
+		}
+		/*
+		 * If we race on finding an entry, just loop back again.
+		 * Should be very rare, as the window is only updated
+		 * occasionally
+		 */
+	} while (!nr);
+}
+
+static void __blk_stat_init(struct blk_rq_stat *stat, s64 time_now)
+{
+	stat->min = -1ULL;
+	stat->max = stat->nr_samples = stat->mean = 0;
+	stat->batch = stat->nr_batch = 0;
+	stat->time = time_now & BLK_STAT_NSEC_MASK;
+}
+
+void blk_stat_init(struct blk_rq_stat *stat)
+{
+	__blk_stat_init(stat, ktime_to_ns(ktime_get()));
+}
+
+static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now)
+{
+	return (now & BLK_STAT_NSEC_MASK) == (stat->time & BLK_STAT_NSEC_MASK);
+}
+
+bool blk_stat_is_current(struct blk_rq_stat *stat)
+{
+	return __blk_stat_is_current(stat, ktime_to_ns(ktime_get()));
+}
+
+void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
+{
+	s64 now, value;
+
+	now = __blk_stat_time(ktime_to_ns(ktime_get()));
+	if (now < blk_stat_time(&rq->issue_stat))
+		return;
+
+	if (!__blk_stat_is_current(stat, now))
+		__blk_stat_init(stat, now);
+
+	value = now - blk_stat_time(&rq->issue_stat);
+	if (value > stat->max)
+		stat->max = value;
+	if (value < stat->min)
+		stat->min = value;
+
+	if (stat->batch + value < stat->batch ||
+	    stat->nr_batch + 1 == BLK_RQ_STAT_BATCH)
+		blk_stat_flush_batch(stat);
+
+	stat->batch += value;
+	stat->nr_batch++;
+}
+
+void blk_stat_clear(struct request_queue *q)
+{
+	if (q->mq_ops) {
+		struct blk_mq_hw_ctx *hctx;
+		struct blk_mq_ctx *ctx;
+		int i, j;
+
+		queue_for_each_hw_ctx(q, hctx, i) {
+			hctx_for_each_ctx(hctx, ctx, j) {
+				blk_stat_init(&ctx->stat[0]);
+				blk_stat_init(&ctx->stat[1]);
+			}
+		}
+	} else {
+		blk_stat_init(&q->rq_stats[0]);
+		blk_stat_init(&q->rq_stats[1]);
+	}
+}
+
+void blk_stat_set_issue_time(struct blk_issue_stat *stat)
+{
+	stat->time = (stat->time & BLK_STAT_MASK) |
+			(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK);
+}
diff --git a/block/blk-stat.h b/block/blk-stat.h
new file mode 100644
index 000000000000..26b1f45dff26
--- /dev/null
+++ b/block/blk-stat.h
@@ -0,0 +1,37 @@
+#ifndef BLK_STAT_H
+#define BLK_STAT_H
+
+/*
+ * ~0.13s window as a power-of-2 (2^27 nsecs)
+ */
+#define BLK_STAT_NSEC		134217728ULL
+#define BLK_STAT_NSEC_MASK	~(BLK_STAT_NSEC - 1)
+
+/*
+ * Upper 3 bits can be used elsewhere
+ */
+#define BLK_STAT_RES_BITS	3
+#define BLK_STAT_SHIFT		(64 - BLK_STAT_RES_BITS)
+#define BLK_STAT_TIME_MASK	((1ULL << BLK_STAT_SHIFT) - 1)
+#define BLK_STAT_MASK		~BLK_STAT_TIME_MASK
+
+void blk_stat_add(struct blk_rq_stat *, struct request *);
+void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_rq_stat *);
+void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
+void blk_stat_clear(struct request_queue *q);
+void blk_stat_init(struct blk_rq_stat *);
+void blk_stat_sum(struct blk_rq_stat *, struct blk_rq_stat *);
+bool blk_stat_is_current(struct blk_rq_stat *);
+void blk_stat_set_issue_time(struct blk_issue_stat *);
+
+static inline u64 __blk_stat_time(u64 time)
+{
+	return time & BLK_STAT_TIME_MASK;
+}
+
+static inline u64 blk_stat_time(struct blk_issue_stat *stat)
+{
+	return __blk_stat_time(stat->time);
+}
+
+#endif
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 488c2e28feb8..5bb4648f434a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -401,6 +401,26 @@ static ssize_t queue_dax_show(struct request_queue *q, char *page)
 	return queue_var_show(blk_queue_dax(q), page);
 }
 
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
+{
+	return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+			pre, (long long) stat->nr_samples,
+			(long long) stat->mean, (long long) stat->min,
+			(long long) stat->max);
+}
+
+static ssize_t queue_stats_show(struct request_queue *q, char *page)
+{
+	struct blk_rq_stat stat[2];
+	ssize_t ret;
+
+	blk_queue_stat_get(q, stat);
+
+	ret = print_stat(page, &stat[0], "read :");
+	ret += print_stat(page + ret, &stat[1], "write:");
+	return ret;
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_requests_show,
@@ -553,6 +573,11 @@ static struct queue_sysfs_entry queue_dax_entry = {
 	.show = queue_dax_show,
 };
 
+static struct queue_sysfs_entry queue_stats_entry = {
+	.attr = {.name = "stats", .mode = S_IRUGO },
+	.show = queue_stats_show,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -582,6 +607,7 @@ static struct attribute *default_attrs[] = {
 	&queue_poll_entry.attr,
 	&queue_wc_entry.attr,
 	&queue_dax_entry.attr,
+	&queue_stats_entry.attr,
 	NULL,
 };
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 562ac46cb790..4d0044d09984 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -250,4 +250,20 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
 	return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
 }
 
+struct blk_issue_stat {
+	u64 time;
+};
+
+#define BLK_RQ_STAT_BATCH	64
+
+struct blk_rq_stat {
+	s64 mean;
+	u64 min;
+	u64 max;
+	s32 nr_samples;
+	s32 nr_batch;
+	u64 batch;
+	s64 time;
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 0c677fb35ce4..6bd5eb56894e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -197,6 +197,7 @@ struct request {
 	struct gendisk *rq_disk;
 	struct hd_struct *part;
 	unsigned long start_time;
+	struct blk_issue_stat issue_stat;
 #ifdef CONFIG_BLK_CGROUP
 	struct request_list *rl;		/* rl this rq is alloced from */
 	unsigned long long start_time_ns;
@@ -492,6 +493,9 @@ struct request_queue {
 
 	unsigned int		nr_sorted;
 	unsigned int		in_flight[2];
+
+	struct blk_rq_stat	rq_stats[2];
+
 	/*
 	 * Number of active block driver functions for which blk_drain_queue()
 	 * must wait. Must be incremented around functions that unlock the
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 7/8] blk-wbt: add general throttling mechanism
  2016-11-01 21:08 [PATCHSET] Throttled buffered writeback Jens Axboe
                   ` (5 preceding siblings ...)
  2016-11-01 21:08 ` [PATCH 6/8] block: add scalable completion tracking of requests Jens Axboe
@ 2016-11-01 21:08 ` Jens Axboe
  2016-11-08 13:39   ` Jan Kara
  2016-11-01 21:08 ` [PATCH 8/8] block: hook up writeback throttling Jens Axboe
  7 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2016-11-01 21:08 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: jack, hch, Jens Axboe

We can hook this up to the block layer, to help throttle buffered
writes.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
               wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Makefile             |   1 +
 block/blk-wbt.c            | 704 +++++++++++++++++++++++++++++++++++++++++++++
 block/blk-wbt.h            | 166 +++++++++++
 include/trace/events/wbt.h | 153 ++++++++++
 4 files changed, 1024 insertions(+)
 create mode 100644 block/blk-wbt.c
 create mode 100644 block/blk-wbt.h
 create mode 100644 include/trace/events/wbt.h

diff --git a/block/Makefile b/block/Makefile
index 2528c596f7ec..a827f988c4e6 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -24,3 +24,4 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
 obj-$(CONFIG_BLK_MQ_PCI)	+= blk-mq-pci.o
 obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
+obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
diff --git a/block/blk-wbt.c b/block/blk-wbt.c
new file mode 100644
index 000000000000..1b1d67aae1d3
--- /dev/null
+++ b/block/blk-wbt.c
@@ -0,0 +1,704 @@
+/*
+ * buffered writeback throttling. loosely based on CoDel. We can't drop
+ * packets for IO scheduling, so the logic is something like this:
+ *
+ * - Monitor latencies in a defined window of time.
+ * - If the minimum latency in the above window exceeds some target, increment
+ *   scaling step and scale down queue depth by a factor of 2x. The monitoring
+ *   window is then shrunk to 100 / sqrt(scaling step + 1).
+ * - For any window where we don't have solid data on what the latencies
+ *   look like, retain status quo.
+ * - If latencies look good, decrement scaling step.
+ * - If we're only doing writes, allow the scaling step to go negative. This
+ *   will temporarily boost write performance, snapping back to a stable
+ *   scaling step of 0 if reads show up or the heavy writers finish. Unlike
+ *   positive scaling steps where we shrink the monitoring window, a negative
+ *   scaling step retains the default step==0 window size.
+ *
+ * Copyright (C) 2016 Jens Axboe
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/blk_types.h>
+#include <linux/slab.h>
+#include <linux/backing-dev.h>
+#include <linux/swap.h>
+
+#include "blk-wbt.h"
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/wbt.h>
+
+enum {
+	/*
+	 * Default setting, we'll scale up (to 75% of QD max) or down (min 1)
+	 * from here depending on device stats
+	 */
+	RWB_DEF_DEPTH	= 16,
+
+	/*
+	 * 100msec window
+	 */
+	RWB_WINDOW_NSEC		= 100 * 1000 * 1000ULL,
+
+	/*
+	 * Disregard stats, if we don't meet this minimum
+	 */
+	RWB_MIN_WRITE_SAMPLES	= 3,
+
+	/*
+	 * If we have this number of consecutive windows with not enough
+	 * information to scale up or down, scale up.
+	 */
+	RWB_UNKNOWN_BUMP	= 5,
+};
+
+static inline bool rwb_enabled(struct rq_wb *rwb)
+{
+	return rwb && rwb->wb_normal != 0;
+}
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+	int cur = atomic_read(v);
+
+	for (;;) {
+		int old;
+
+		if (cur >= below)
+			return false;
+		old = atomic_cmpxchg(v, cur, cur + 1);
+		if (old == cur)
+			break;
+		cur = old;
+	}
+
+	return true;
+}
+
+static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
+{
+	if (rwb_enabled(rwb)) {
+		const unsigned long cur = jiffies;
+
+		if (cur != *var)
+			*var = cur;
+	}
+}
+
+/*
+ * If a task was rate throttled in balance_dirty_pages() within the last
+ * second or so, use that to indicate a higher cleaning rate.
+ */
+static bool wb_recent_wait(struct rq_wb *rwb)
+{
+	struct bdi_writeback *wb = &rwb->bdi->wb;
+
+	return time_before(jiffies, wb->dirty_sleep + HZ);
+}
+
+static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_kswapd)
+{
+	return &rwb->rq_wait[is_kswapd];
+}
+
+static void rwb_wake_all(struct rq_wb *rwb)
+{
+	int i;
+
+	for (i = 0; i < WBT_NUM_RWQ; i++) {
+		struct rq_wait *rqw = &rwb->rq_wait[i];
+
+		if (waitqueue_active(&rqw->wait))
+			wake_up_all(&rqw->wait);
+	}
+}
+
+void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
+{
+	struct rq_wait *rqw;
+	int inflight, limit;
+
+	if (!(wb_acct & WBT_TRACKED))
+		return;
+
+	rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
+	inflight = atomic_dec_return(&rqw->inflight);
+
+	/*
+	 * wbt got disabled with IO in flight. Wake up any potential
+	 * waiters, we don't have to do more than that.
+	 */
+	if (unlikely(!rwb_enabled(rwb))) {
+		rwb_wake_all(rwb);
+		return;
+	}
+
+	/*
+	 * If the device does write back caching, drop further down
+	 * before we wake people up.
+	 */
+	if (rwb->wc && !wb_recent_wait(rwb))
+		limit = 0;
+	else
+		limit = rwb->wb_normal;
+
+	/*
+	 * Don't wake anyone up if we are above the normal limit.
+	 */
+	if (inflight && inflight >= limit)
+		return;
+
+	if (waitqueue_active(&rqw->wait)) {
+		int diff = limit - inflight;
+
+		if (!inflight || diff >= rwb->wb_background / 2)
+			wake_up(&rqw->wait);
+	}
+}
+
+/*
+ * Called on completion of a request. Note that it's also called when
+ * a request is merged, when the request gets freed.
+ */
+void wbt_done(struct rq_wb *rwb, struct blk_issue_stat *stat)
+{
+	if (!rwb)
+		return;
+
+	if (!wbt_is_tracked(stat)) {
+		if (rwb->sync_cookie == stat) {
+			rwb->sync_issue = 0;
+			rwb->sync_cookie = NULL;
+		}
+
+		if (wbt_is_read(stat))
+			wb_timestamp(rwb, &rwb->last_comp);
+		wbt_clear_state(stat);
+	} else {
+		WARN_ON_ONCE(stat == rwb->sync_cookie);
+		__wbt_done(rwb, wbt_stat_to_mask(stat));
+		wbt_clear_state(stat);
+	}
+}
+
+/*
+ * Return true, if we can't increase the depth further by scaling
+ */
+static bool calc_wb_limits(struct rq_wb *rwb)
+{
+	unsigned int depth;
+	bool ret = false;
+
+	if (!rwb->min_lat_nsec) {
+		rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
+		return false;
+	}
+
+	/*
+	 * For QD=1 devices, this is a special case. It's important for those
+	 * to have one request ready when one completes, so force a depth of
+	 * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
+	 * since the device can't have more than that in flight. If we're
+	 * scaling down, then keep a setting of 1/1/1.
+	 */
+	if (rwb->queue_depth == 1) {
+		if (rwb->scale_step > 0)
+			rwb->wb_max = rwb->wb_normal = 1;
+		else {
+			rwb->wb_max = rwb->wb_normal = 2;
+			ret = true;
+		}
+		rwb->wb_background = 1;
+	} else {
+		/*
+		 * scale_step == 0 is our default state. If we have suffered
+		 * latency spikes, step will be > 0, and we shrink the
+		 * allowed write depths. If step is < 0, we're only doing
+		 * writes, and we allow a temporarily higher depth to
+		 * increase performance.
+		 */
+		depth = min_t(unsigned int, RWB_DEF_DEPTH, rwb->queue_depth);
+		if (rwb->scale_step > 0)
+			depth = 1 + ((depth - 1) >> min(31, rwb->scale_step));
+		else if (rwb->scale_step < 0) {
+			unsigned int maxd = 3 * rwb->queue_depth / 4;
+
+			depth = 1 + ((depth - 1) << -rwb->scale_step);
+			if (depth > maxd) {
+				depth = maxd;
+				ret = true;
+			}
+		}
+
+		/*
+		 * Set our max/normal/bg queue depths based on how far
+		 * we have scaled down (->scale_step).
+		 */
+		rwb->wb_max = depth;
+		rwb->wb_normal = (rwb->wb_max + 1) / 2;
+		rwb->wb_background = (rwb->wb_max + 3) / 4;
+	}
+
+	return ret;
+}
+
+static bool inline stat_sample_valid(struct blk_rq_stat *stat)
+{
+	/*
+	 * We need at least one read sample, and a minimum of
+	 * RWB_MIN_WRITE_SAMPLES. We require some write samples to know
+	 * that it's writes impacting us, and not just some sole read on
+	 * a device that is in a lower power state.
+	 */
+	return stat[0].nr_samples >= 1 &&
+		stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES;
+}
+
+static u64 rwb_sync_issue_lat(struct rq_wb *rwb)
+{
+	u64 now, issue = ACCESS_ONCE(rwb->sync_issue);
+
+	if (!issue || !rwb->sync_cookie)
+		return 0;
+
+	now = ktime_to_ns(ktime_get());
+	return now - issue;
+}
+
+enum {
+	LAT_OK = 1,
+	LAT_UNKNOWN,
+	LAT_UNKNOWN_WRITES,
+	LAT_EXCEEDED,
+};
+
+static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
+{
+	u64 thislat;
+
+	/*
+	 * If our stored sync issue exceeds the window size, or it
+	 * exceeds our min target AND we haven't logged any entries,
+	 * flag the latency as exceeded. wbt works off completion latencies,
+	 * but for a flooded device, a single sync IO can take a long time
+	 * to complete after being issued. If this time exceeds our
+	 * monitoring window AND we didn't see any other completions in that
+	 * window, then count that sync IO as a violation of the latency.
+	 */
+	thislat = rwb_sync_issue_lat(rwb);
+	if (thislat > rwb->cur_win_nsec ||
+	    (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
+		trace_wbt_lat(rwb->bdi, thislat);
+		return LAT_EXCEEDED;
+	}
+
+	/*
+	 * No read/write mix, if stat isn't valid
+	 */
+	if (!stat_sample_valid(stat)) {
+		/*
+		 * If we had writes in this stat window and the window is
+		 * current, we're only doing writes. If a task recently
+		 * waited or still has writes in flights, consider us doing
+		 * just writes as well.
+		 */
+		if ((stat[1].nr_samples && rwb->stat_ops->is_current(stat)) ||
+		    wb_recent_wait(rwb) || wbt_inflight(rwb))
+			return LAT_UNKNOWN_WRITES;
+		return LAT_UNKNOWN;
+	}
+
+	/*
+	 * If the 'min' latency exceeds our target, step down.
+	 */
+	if (stat[0].min > rwb->min_lat_nsec) {
+		trace_wbt_lat(rwb->bdi, stat[0].min);
+		trace_wbt_stat(rwb->bdi, stat);
+		return LAT_EXCEEDED;
+	}
+
+	if (rwb->scale_step)
+		trace_wbt_stat(rwb->bdi, stat);
+
+	return LAT_OK;
+}
+
+static int latency_exceeded(struct rq_wb *rwb)
+{
+	struct blk_rq_stat stat[2];
+
+	rwb->stat_ops->get(rwb->ops_data, stat);
+	return __latency_exceeded(rwb, stat);
+}
+
+static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
+{
+	trace_wbt_step(rwb->bdi, msg, rwb->scale_step, rwb->cur_win_nsec,
+			rwb->wb_background, rwb->wb_normal, rwb->wb_max);
+}
+
+static void scale_up(struct rq_wb *rwb)
+{
+	/*
+	 * Hit max in previous round, stop here
+	 */
+	if (rwb->scaled_max)
+		return;
+
+	rwb->scale_step--;
+	rwb->unknown_cnt = 0;
+	rwb->stat_ops->clear(rwb->ops_data);
+
+	rwb->scaled_max = calc_wb_limits(rwb);
+
+	rwb_wake_all(rwb);
+
+	rwb_trace_step(rwb, "step up");
+}
+
+/*
+ * Scale rwb down. If 'hard_throttle' is set, do it quicker, since we
+ * had a latency violation.
+ */
+static void scale_down(struct rq_wb *rwb, bool hard_throttle)
+{
+	/*
+	 * Stop scaling down when we've hit the limit. This also prevents
+	 * ->scale_step from going to crazy values, if the device can't
+	 * keep up.
+	 */
+	if (rwb->wb_max == 1)
+		return;
+
+	if (rwb->scale_step < 0 && hard_throttle)
+		rwb->scale_step = 0;
+	else
+		rwb->scale_step++;
+
+	rwb->scaled_max = false;
+	rwb->unknown_cnt = 0;
+	rwb->stat_ops->clear(rwb->ops_data);
+	calc_wb_limits(rwb);
+	rwb_trace_step(rwb, "step down");
+}
+
+static void rwb_arm_timer(struct rq_wb *rwb)
+{
+	unsigned long expires;
+
+	if (rwb->scale_step > 0) {
+		/*
+		 * We should speed this up, using some variant of a fast
+		 * integer inverse square root calculation. Since we only do
+		 * this for every window expiration, it's not a huge deal,
+		 * though.
+		 */
+		rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
+					int_sqrt((rwb->scale_step + 1) << 8));
+	} else {
+		/*
+		 * For step < 0, we don't want to increase/decrease the
+		 * window size.
+		 */
+		rwb->cur_win_nsec = rwb->win_nsec;
+	}
+
+	expires = jiffies + nsecs_to_jiffies(rwb->cur_win_nsec);
+	mod_timer(&rwb->window_timer, expires);
+}
+
+static void wb_timer_fn(unsigned long data)
+{
+	struct rq_wb *rwb = (struct rq_wb *) data;
+	unsigned int inflight = wbt_inflight(rwb);
+	int status;
+
+	status = latency_exceeded(rwb);
+
+	trace_wbt_timer(rwb->bdi, status, rwb->scale_step, inflight);
+
+	/*
+	 * If we exceeded the latency target, step down. If we did not,
+	 * step one level up. If we don't know enough to say either exceeded
+	 * or ok, then don't do anything.
+	 */
+	switch (status) {
+	case LAT_EXCEEDED:
+		scale_down(rwb, true);
+		break;
+	case LAT_OK:
+		scale_up(rwb);
+		break;
+	case LAT_UNKNOWN_WRITES:
+		scale_up(rwb);
+		break;
+	case LAT_UNKNOWN:
+		if (++rwb->unknown_cnt < RWB_UNKNOWN_BUMP)
+			break;
+		/*
+		 * We get here for two reasons:
+		 *
+		 * 1) We previously scaled reduced depth, and we currently
+		 *    don't have a valid read/write sample. For that case,
+		 *    slowly return to center state (step == 0).
+		 * 2) We started a the center step, but don't have a valid
+		 *    read/write sample, but we do have writes going on.
+		 *    Allow step to go negative, to increase write perf.
+		 */
+		if (rwb->scale_step > 0)
+			scale_up(rwb);
+		else if (rwb->scale_step < 0)
+			scale_down(rwb, false);
+		break;
+	default:
+		break;
+	}
+
+	/*
+	 * Re-arm timer, if we have IO in flight
+	 */
+	if (rwb->scale_step || inflight)
+		rwb_arm_timer(rwb);
+}
+
+void wbt_update_limits(struct rq_wb *rwb)
+{
+	rwb->scale_step = 0;
+	rwb->scaled_max = false;
+	calc_wb_limits(rwb);
+
+	rwb_wake_all(rwb);
+}
+
+static bool close_io(struct rq_wb *rwb)
+{
+	const unsigned long now = jiffies;
+
+	return time_before(now, rwb->last_issue + HZ / 10) ||
+		time_before(now, rwb->last_comp + HZ / 10);
+}
+
+#define REQ_HIPRIO	(REQ_SYNC | REQ_META | REQ_PRIO)
+
+static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
+{
+	unsigned int limit;
+
+	/*
+	 * At this point we know it's a buffered write. If this is
+	 * kswapd trying to free memory, or REQ_SYNC is set, set, then
+	 * it's WB_SYNC_ALL writeback, and we'll use the max limit for
+	 * that. If the write is marked as a background write, then use
+	 * the idle limit, or go to normal if we haven't had competing
+	 * IO for a bit.
+	 */
+	if ((rw & REQ_HIPRIO) || wb_recent_wait(rwb) || current_is_kswapd())
+		limit = rwb->wb_max;
+	else if ((rw & REQ_BACKGROUND) || close_io(rwb)) {
+		/*
+		 * If less than 100ms since we completed unrelated IO,
+		 * limit us to half the depth for background writeback.
+		 */
+		limit = rwb->wb_background;
+	} else
+		limit = rwb->wb_normal;
+
+	return limit;
+}
+
+static inline bool may_queue(struct rq_wb *rwb, struct rq_wait *rqw,
+			     unsigned long rw)
+{
+	/*
+	 * inc it here even if disabled, since we'll dec it at completion.
+	 * this only happens if the task was sleeping in __wbt_wait(),
+	 * and someone turned it off at the same time.
+	 */
+	if (!rwb_enabled(rwb)) {
+		atomic_inc(&rqw->inflight);
+		return true;
+	}
+
+	return atomic_inc_below(&rqw->inflight, get_limit(rwb, rw));
+}
+
+/*
+ * Block if we will exceed our limit, or if we are currently waiting for
+ * the timer to kick off queuing again.
+ */
+static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
+{
+	struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
+	DEFINE_WAIT(wait);
+
+	if (may_queue(rwb, rqw, rw))
+		return;
+
+	do {
+		prepare_to_wait_exclusive(&rqw->wait, &wait,
+						TASK_UNINTERRUPTIBLE);
+
+		if (may_queue(rwb, rqw, rw))
+			break;
+
+		if (lock)
+			spin_unlock_irq(lock);
+
+		io_schedule();
+
+		if (lock)
+			spin_lock_irq(lock);
+	} while (1);
+
+	finish_wait(&rqw->wait, &wait);
+}
+
+static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
+{
+	const int op = bio_op(bio);
+
+	/*
+	 * If not a WRITE (or a discard), do nothing
+	 */
+	if (!(op == REQ_OP_WRITE || op == REQ_OP_DISCARD))
+		return false;
+
+	/*
+	 * Don't throttle WRITE_ODIRECT
+	 */
+	if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE))
+		return false;
+
+	return true;
+}
+
+/*
+ * Returns true if the IO request should be accounted, false if not.
+ * May sleep, if we have exceeded the writeback limits. Caller can pass
+ * in an irq held spinlock, if it holds one when calling this function.
+ * If we do sleep, we'll release and re-grab it.
+ */
+unsigned int wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
+{
+	unsigned int ret = 0;
+
+	if (!rwb_enabled(rwb))
+		return 0;
+
+	if (bio_op(bio) == REQ_OP_READ)
+		ret = WBT_READ;
+
+	if (!wbt_should_throttle(rwb, bio)) {
+		if (ret & WBT_READ)
+			wb_timestamp(rwb, &rwb->last_issue);
+		return ret;
+	}
+
+	__wbt_wait(rwb, bio->bi_opf, lock);
+
+	if (!timer_pending(&rwb->window_timer))
+		rwb_arm_timer(rwb);
+
+	if (current_is_kswapd())
+		ret |= WBT_KSWAPD;
+
+	return ret | WBT_TRACKED;
+}
+
+void wbt_issue(struct rq_wb *rwb, struct blk_issue_stat *stat)
+{
+	if (!rwb_enabled(rwb))
+		return;
+
+	/*
+	 * Track sync issue, in case it takes a long time to complete. Allows
+	 * us to react quicker, if a sync IO takes a long time to complete.
+	 * Note that this is just a hint. 'stat' can go away when the
+	 * request completes, so it's important we never dereference it. We
+	 * only use the address to compare with, which is why we store the
+	 * sync_issue time locally.
+	 */
+	if (wbt_is_read(stat) && !rwb->sync_issue) {
+		rwb->sync_cookie = stat;
+		rwb->sync_issue = blk_stat_time(stat);
+	}
+}
+
+void wbt_requeue(struct rq_wb *rwb, struct blk_issue_stat *stat)
+{
+	if (!rwb_enabled(rwb))
+		return;
+	if (stat == rwb->sync_cookie) {
+		rwb->sync_issue = 0;
+		rwb->sync_cookie = NULL;
+	}
+}
+
+void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+{
+	if (rwb) {
+		rwb->queue_depth = depth;
+		wbt_update_limits(rwb);
+	}
+}
+
+void wbt_set_write_cache(struct rq_wb *rwb, bool write_cache_on)
+{
+	if (rwb)
+		rwb->wc = write_cache_on;
+}
+
+void wbt_disable(struct rq_wb *rwb)
+{
+	if (rwb) {
+		del_timer_sync(&rwb->window_timer);
+		rwb->win_nsec = rwb->min_lat_nsec = 0;
+		wbt_update_limits(rwb);
+	}
+}
+EXPORT_SYMBOL_GPL(wbt_disable);
+
+struct rq_wb *wbt_init(struct backing_dev_info *bdi, struct wb_stat_ops *ops,
+		       void *ops_data)
+{
+	struct rq_wb *rwb;
+	int i;
+
+	BUILD_BUG_ON(WBT_NR_BITS > BLK_STAT_RES_BITS);
+
+	if (!ops->get || !ops->is_current || !ops->clear)
+		return ERR_PTR(-EINVAL);
+
+	rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
+	if (!rwb)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < WBT_NUM_RWQ; i++) {
+		atomic_set(&rwb->rq_wait[i].inflight, 0);
+		init_waitqueue_head(&rwb->rq_wait[i].wait);
+	}
+
+	setup_timer(&rwb->window_timer, wb_timer_fn, (unsigned long) rwb);
+	rwb->wc = 1;
+	rwb->queue_depth = RWB_DEF_DEPTH;
+	rwb->last_comp = rwb->last_issue = jiffies;
+	rwb->bdi = bdi;
+	rwb->win_nsec = RWB_WINDOW_NSEC;
+	rwb->stat_ops = ops;
+	rwb->ops_data = ops_data;
+	wbt_update_limits(rwb);
+	return rwb;
+}
+
+void wbt_exit(struct rq_wb *rwb)
+{
+	if (rwb) {
+		del_timer_sync(&rwb->window_timer);
+		kfree(rwb);
+	}
+}
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
new file mode 100644
index 000000000000..784e392b20e1
--- /dev/null
+++ b/block/blk-wbt.h
@@ -0,0 +1,166 @@
+#ifndef WB_THROTTLE_H
+#define WB_THROTTLE_H
+
+#include <linux/kernel.h>
+#include <linux/atomic.h>
+#include <linux/wait.h>
+#include <linux/timer.h>
+#include <linux/ktime.h>
+
+#include "blk-stat.h"
+
+enum wbt_flags {
+	WBT_TRACKED		= 1,	/* write, tracked for throttling */
+	WBT_READ		= 2,	/* read */
+	WBT_KSWAPD		= 4,	/* write, from kswapd */
+
+	WBT_NR_BITS		= 3,	/* number of bits */
+};
+
+enum {
+	WBT_NUM_RWQ		= 2,
+};
+
+static inline void wbt_clear_state(struct blk_issue_stat *stat)
+{
+	stat->time &= BLK_STAT_TIME_MASK;
+}
+
+static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
+{
+	return (stat->time & BLK_STAT_MASK) >> BLK_STAT_SHIFT;
+}
+
+static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags wb_acct)
+{
+	stat->time |= ((u64) wb_acct) << BLK_STAT_SHIFT;
+}
+
+static inline bool wbt_is_tracked(struct blk_issue_stat *stat)
+{
+	return (stat->time >> BLK_STAT_SHIFT) & WBT_TRACKED;
+}
+
+static inline bool wbt_is_read(struct blk_issue_stat *stat)
+{
+	return (stat->time >> BLK_STAT_SHIFT) & WBT_READ;
+}
+
+struct wb_stat_ops {
+	void (*get)(void *, struct blk_rq_stat *);
+	bool (*is_current)(struct blk_rq_stat *);
+	void (*clear)(void *);
+};
+
+struct rq_wait {
+	wait_queue_head_t wait;
+	atomic_t inflight;
+};
+
+struct rq_wb {
+	/*
+	 * Settings that govern how we throttle
+	 */
+	unsigned int wb_background;		/* background writeback */
+	unsigned int wb_normal;			/* normal writeback */
+	unsigned int wb_max;			/* max throughput writeback */
+	int scale_step;
+	bool scaled_max;
+
+	/*
+	 * Number of consecutive periods where we don't have enough
+	 * information to make a firm scale up/down decision.
+	 */
+	unsigned int unknown_cnt;
+
+	u64 win_nsec;				/* default window size */
+	u64 cur_win_nsec;			/* current window size */
+
+	struct timer_list window_timer;
+
+	s64 sync_issue;
+	void *sync_cookie;
+
+	unsigned int wc;
+	unsigned int queue_depth;
+
+	unsigned long last_issue;		/* last non-throttled issue */
+	unsigned long last_comp;		/* last non-throttled comp */
+	unsigned long min_lat_nsec;
+	struct backing_dev_info *bdi;
+	struct rq_wait rq_wait[WBT_NUM_RWQ];
+
+	struct wb_stat_ops *stat_ops;
+	void *ops_data;
+};
+
+static inline unsigned int wbt_inflight(struct rq_wb *rwb)
+{
+	unsigned int i, ret = 0;
+
+	for (i = 0; i < WBT_NUM_RWQ; i++)
+		ret += atomic_read(&rwb->rq_wait[i].inflight);
+
+	return ret;
+}
+
+struct backing_dev_info;
+
+#ifdef CONFIG_BLK_WBT
+
+void __wbt_done(struct rq_wb *, enum wbt_flags);
+void wbt_done(struct rq_wb *, struct blk_issue_stat *);
+enum wbt_flags wbt_wait(struct rq_wb *, struct bio *, spinlock_t *);
+struct rq_wb *wbt_init(struct backing_dev_info *, struct wb_stat_ops *, void *);
+void wbt_exit(struct rq_wb *);
+void wbt_update_limits(struct rq_wb *);
+void wbt_requeue(struct rq_wb *, struct blk_issue_stat *);
+void wbt_issue(struct rq_wb *, struct blk_issue_stat *);
+void wbt_disable(struct rq_wb *);
+
+void wbt_set_queue_depth(struct rq_wb *, unsigned int);
+void wbt_set_write_cache(struct rq_wb *, bool);
+
+#else
+
+static inline void __wbt_done(struct rq_wb *rwb, enum wbt_flags flags)
+{
+}
+static inline void wbt_done(struct rq_wb *rwb, struct blk_issue_stat *stat)
+{
+}
+static inline enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio,
+				      spinlock_t *lock)
+{
+	return 0;
+}
+static inline struct rq_wb *wbt_init(struct backing_dev_info *bdi,
+				     struct wb_stat_ops *ops, void *ops_data)
+{
+	return ERR_PTR(-EINVAL);
+}
+static inline void wbt_exit(struct rq_wb *rbw)
+{
+}
+static inline void wbt_update_limits(struct rq_wb *rwb)
+{
+}
+static inline void wbt_requeue(struct rq_wb *rwb, struct blk_issue_stat *stat)
+{
+}
+static inline void wbt_issue(struct rq_wb *rwb, struct blk_issue_stat *stat)
+{
+}
+static inline void wbt_disable(struct rq_wb *rwb)
+{
+}
+static inline void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+{
+}
+static inline void wbt_set_write_cache(struct rq_wb *rwb, bool wc)
+{
+}
+
+#endif /* CONFIG_BLK_WBT */
+
+#endif
diff --git a/include/trace/events/wbt.h b/include/trace/events/wbt.h
new file mode 100644
index 000000000000..3c518e455680
--- /dev/null
+++ b/include/trace/events/wbt.h
@@ -0,0 +1,153 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM wbt
+
+#if !defined(_TRACE_WBT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_WBT_H
+
+#include <linux/tracepoint.h>
+#include "../../../block/blk-wbt.h"
+
+/**
+ * wbt_stat - trace stats for blk_wb
+ * @stat: array of read/write stats
+ */
+TRACE_EVENT(wbt_stat,
+
+	TP_PROTO(struct backing_dev_info *bdi, struct blk_rq_stat *stat),
+
+	TP_ARGS(bdi, stat),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(s64, rmean)
+		__field(u64, rmin)
+		__field(u64, rmax)
+		__field(s64, rnr_samples)
+		__field(s64, rtime)
+		__field(s64, wmean)
+		__field(u64, wmin)
+		__field(u64, wmax)
+		__field(s64, wnr_samples)
+		__field(s64, wtime)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->rmean		= stat[0].mean;
+		__entry->rmin		= stat[0].min;
+		__entry->rmax		= stat[0].max;
+		__entry->rnr_samples	= stat[0].nr_samples;
+		__entry->wmean		= stat[1].mean;
+		__entry->wmin		= stat[1].min;
+		__entry->wmax		= stat[1].max;
+		__entry->wnr_samples	= stat[1].nr_samples;
+	),
+
+	TP_printk("%s: rmean=%llu, rmin=%llu, rmax=%llu, rsamples=%llu, "
+		  "wmean=%llu, wmin=%llu, wmax=%llu, wsamples=%llu\n",
+		  __entry->name, __entry->rmean, __entry->rmin, __entry->rmax,
+		  __entry->rnr_samples, __entry->wmean, __entry->wmin,
+		  __entry->wmax, __entry->wnr_samples)
+);
+
+/**
+ * wbt_lat - trace latency event
+ * @lat: latency trigger
+ */
+TRACE_EVENT(wbt_lat,
+
+	TP_PROTO(struct backing_dev_info *bdi, unsigned long lat),
+
+	TP_ARGS(bdi, lat),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, lat)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->lat = div_u64(lat, 1000);
+	),
+
+	TP_printk("%s: latency %lluus\n", __entry->name,
+			(unsigned long long) __entry->lat)
+);
+
+/**
+ * wbt_step - trace wb event step
+ * @msg: context message
+ * @step: the current scale step count
+ * @window: the current monitoring window
+ * @bg: the current background queue limit
+ * @normal: the current normal writeback limit
+ * @max: the current max throughput writeback limit
+ */
+TRACE_EVENT(wbt_step,
+
+	TP_PROTO(struct backing_dev_info *bdi, const char *msg,
+		 int step, unsigned long window, unsigned int bg,
+		 unsigned int normal, unsigned int max),
+
+	TP_ARGS(bdi, msg, step, window, bg, normal, max),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(const char *, msg)
+		__field(int, step)
+		__field(unsigned long, window)
+		__field(unsigned int, bg)
+		__field(unsigned int, normal)
+		__field(unsigned int, max)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->msg	= msg;
+		__entry->step	= step;
+		__entry->window	= div_u64(window, 1000);
+		__entry->bg	= bg;
+		__entry->normal	= normal;
+		__entry->max	= max;
+	),
+
+	TP_printk("%s: %s: step=%d, window=%luus, background=%u, normal=%u, max=%u\n",
+		  __entry->name, __entry->msg, __entry->step, __entry->window,
+		  __entry->bg, __entry->normal, __entry->max)
+);
+
+/**
+ * wbt_timer - trace wb timer event
+ * @status: timer state status
+ * @step: the current scale step count
+ * @inflight: tracked writes inflight
+ */
+TRACE_EVENT(wbt_timer,
+
+	TP_PROTO(struct backing_dev_info *bdi, unsigned int status,
+		 int step, unsigned int inflight),
+
+	TP_ARGS(bdi, status, step, inflight),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned int, status)
+		__field(int, step)
+		__field(unsigned int, inflight)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->status		= status;
+		__entry->step		= step;
+		__entry->inflight	= inflight;
+	),
+
+	TP_printk("%s: status=%u, step=%d, inflight=%u\n", __entry->name,
+		  __entry->status, __entry->step, __entry->inflight)
+);
+
+#endif /* _TRACE_WBT_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 8/8] block: hook up writeback throttling
  2016-11-01 21:08 [PATCHSET] Throttled buffered writeback Jens Axboe
                   ` (6 preceding siblings ...)
  2016-11-01 21:08 ` [PATCH 7/8] blk-wbt: add general throttling mechanism Jens Axboe
@ 2016-11-01 21:08 ` Jens Axboe
  2016-11-08 13:42   ` Jan Kara
  7 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2016-11-01 21:08 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: jack, hch, Jens Axboe

Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers two sysfs entries. The first one, 'wb_window_usec',
defines the window of monitoring. The second one, 'wb_lat_usec',
sets the latency target for the window. It defaults to 2 msec for
non-rotational storage, and 75 msec for rotational storage. Setting
this value to '0' disables blk-wb. Generally, a user would not have
to touch these settings.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 Documentation/block/queue-sysfs.txt |  13 ++++
 block/Kconfig                       |  24 +++++++
 block/blk-core.c                    |  18 ++++-
 block/blk-mq.c                      |  27 +++++++-
 block/blk-settings.c                |   4 ++
 block/blk-sysfs.c                   | 134 ++++++++++++++++++++++++++++++++++++
 block/cfq-iosched.c                 |  14 ++++
 include/linux/blkdev.h              |   3 +
 8 files changed, 233 insertions(+), 4 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index 2a3904030dea..2847219ebd8c 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -169,5 +169,18 @@ This is the number of bytes the device can write in a single write-same
 command.  A value of '0' means write-same is not supported by this
 device.
 
+wb_lat_usec (RW)
+----------------
+If the device is registered for writeback throttling, then this file shows
+the target minimum read latency. If this latency is exceeded in a given
+window of time (see wb_window_usec), then the writeback throttling will start
+scaling back writes.
+
+wb_window_usec (RW)
+-------------------
+If the device is registered for writeback throttling, then this file shows
+the value of the monitoring window in which we'll look at the target
+latency. See wb_lat_usec.
+
 
 Jens Axboe <jens.axboe@oracle.com>, February 2009
diff --git a/block/Kconfig b/block/Kconfig
index 6b0ad08f0677..9f5d4dd7d751 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -120,6 +120,30 @@ config BLK_CMDLINE_PARSER
 
 	See Documentation/block/cmdline-partition.txt for more information.
 
+config BLK_WBT
+	bool "Enable support for block device writeback throttling"
+	default n
+	---help---
+	Enabling this option enables the block layer to throttle buffered
+	writeback from the VM, making it more smooth and having less
+	impact on foreground operations.
+
+config BLK_WBT_SQ
+	bool "Single queue writeback throttling"
+	default n
+	depends on BLK_WBT
+	---help---
+	Enable writeback throttling by default on legacy single queue devices
+
+config BLK_WBT_MQ
+	bool "Multiqueue writeback throttling"
+	default y
+	depends on BLK_WBT
+	---help---
+	Enable writeback throttling by default on multiqueue devices.
+	Multiqueue currently doesn't have support for IO scheduling,
+	enabling this option is recommended.
+
 menu "Partition Types"
 
 source "block/partitions/Kconfig"
diff --git a/block/blk-core.c b/block/blk-core.c
index ca77c725b4e5..c68e92acf21a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-wbt.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
@@ -882,6 +883,8 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 
 fail:
 	blk_free_flush_queue(q->fq);
+	wbt_exit(q->rq_wb);
+	q->rq_wb = NULL;
 	return NULL;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
@@ -1344,6 +1347,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
 	blk_delete_timer(rq);
 	blk_clear_rq_complete(rq);
 	trace_block_rq_requeue(q, rq);
+	wbt_requeue(q->rq_wb, &rq->issue_stat);
 
 	if (rq->rq_flags & RQF_QUEUED)
 		blk_queue_end_tag(q, rq);
@@ -1436,6 +1440,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	/* this is a bio leak */
 	WARN_ON(req->bio != NULL);
 
+	wbt_done(q->rq_wb, &req->issue_stat);
+
 	/*
 	 * Request may not have originated from ll_rw_blk. if not,
 	 * it didn't come out of our reserved rq pools
@@ -1663,6 +1669,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	int el_ret, where = ELEVATOR_INSERT_SORT;
 	struct request *req;
 	unsigned int request_count = 0;
+	unsigned int wb_acct;
 
 	/*
 	 * low level driver can indicate that it wants pages above a
@@ -1715,17 +1722,22 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	}
 
 get_rq:
+	wb_acct = wbt_wait(q->rq_wb, bio, q->queue_lock);
+
 	/*
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */
 	req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
 	if (IS_ERR(req)) {
+		__wbt_done(q->rq_wb, wb_acct);
 		bio->bi_error = PTR_ERR(req);
 		bio_endio(bio);
 		goto out_unlock;
 	}
 
+	wbt_track(&req->issue_stat, wb_acct);
+
 	/*
 	 * After dropping the lock and possibly sleeping here, our request
 	 * may now be mergeable after it had proven unmergeable (above).
@@ -2463,6 +2475,7 @@ void blk_start_request(struct request *req)
 	blk_dequeue_request(req);
 
 	blk_stat_set_issue_time(&req->issue_stat);
+	wbt_issue(req->q->rq_wb, &req->issue_stat);
 
 	/*
 	 * We are now handing the request to the hardware, initialize
@@ -2700,9 +2713,10 @@ void blk_finish_request(struct request *req, int error)
 
 	blk_account_io_done(req);
 
-	if (req->end_io)
+	if (req->end_io) {
+		wbt_done(req->q->rq_wb, &req->issue_stat);
 		req->end_io(req, error);
-	else {
+	} else {
 		if (blk_bidi_rq(req))
 			__blk_put_request(req->next_rq->q, req->next_rq);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4555a76d22a7..11e461c026dc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -31,6 +31,7 @@
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 #include "blk-stat.h"
+#include "blk-wbt.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -299,6 +300,8 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT)
 		atomic_dec(&hctx->nr_active);
+
+	wbt_done(q->rq_wb, &rq->issue_stat);
 	rq->rq_flags = 0;
 
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
@@ -327,6 +330,7 @@ inline void __blk_mq_end_request(struct request *rq, int error)
 	blk_account_io_done(rq);
 
 	if (rq->end_io) {
+		wbt_done(rq->q->rq_wb, &rq->issue_stat);
 		rq->end_io(rq, error);
 	} else {
 		if (unlikely(blk_bidi_rq(rq)))
@@ -434,6 +438,7 @@ void blk_mq_start_request(struct request *rq)
 		rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
 
 	blk_stat_set_issue_time(&rq->issue_stat);
+	wbt_issue(q->rq_wb, &rq->issue_stat);
 
 	blk_add_timer(rq);
 
@@ -470,6 +475,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	struct request_queue *q = rq->q;
 
 	trace_block_rq_requeue(q, rq);
+	wbt_requeue(q->rq_wb, &rq->issue_stat);
 
 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
 		if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -1270,6 +1276,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
 	blk_qc_t cookie;
+	unsigned int wb_acct;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1284,9 +1291,15 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return BLK_QC_T_NONE;
 
+	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
+
 	rq = blk_mq_map_request(q, bio, &data);
-	if (unlikely(!rq))
+	if (unlikely(!rq)) {
+		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
+	}
+
+	wbt_track(&rq->issue_stat, wb_acct);
 
 	cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
@@ -1363,6 +1376,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	struct blk_mq_alloc_data data;
 	struct request *rq;
 	blk_qc_t cookie;
+	unsigned int wb_acct;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1379,9 +1393,15 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	} else
 		request_count = blk_plug_queued_count(q);
 
+	wb_acct = wbt_wait(q->rq_wb, bio, NULL);
+
 	rq = blk_mq_map_request(q, bio, &data);
-	if (unlikely(!rq))
+	if (unlikely(!rq)) {
+		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
+	}
+
+	wbt_track(&rq->issue_stat, wb_acct);
 
 	cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
@@ -2052,6 +2072,9 @@ void blk_mq_free_queue(struct request_queue *q)
 	list_del_init(&q->all_q_node);
 	mutex_unlock(&all_q_mutex);
 
+	wbt_exit(q->rq_wb);
+	q->rq_wb = NULL;
+
 	blk_mq_del_queue_tag_set(q);
 
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 9cf053759363..c7ccabc0ec3e 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -13,6 +13,7 @@
 #include <linux/gfp.h>
 
 #include "blk.h"
+#include "blk-wbt.h"
 
 unsigned long blk_max_low_pfn;
 EXPORT_SYMBOL(blk_max_low_pfn);
@@ -845,6 +846,7 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
 void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
 	q->queue_depth = depth;
+	wbt_set_queue_depth(q->rq_wb, depth);
 }
 EXPORT_SYMBOL(blk_set_queue_depth);
 
@@ -868,6 +870,8 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
 	else
 		queue_flag_clear(QUEUE_FLAG_FUA, q);
 	spin_unlock_irq(q->queue_lock);
+
+	wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
 }
 EXPORT_SYMBOL_GPL(blk_queue_write_cache);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5bb4648f434a..70644bf87582 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -13,6 +13,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-wbt.h"
 
 struct queue_sysfs_entry {
 	struct attribute attr;
@@ -41,6 +42,19 @@ queue_var_store(unsigned long *var, const char *page, size_t count)
 	return count;
 }
 
+static ssize_t queue_var_store64(u64 *var, const char *page)
+{
+	int err;
+	u64 v;
+
+	err = kstrtou64(page, 10, &v);
+	if (err < 0)
+		return err;
+
+	*var = v;
+	return 0;
+}
+
 static ssize_t queue_requests_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(q->nr_requests, (page));
@@ -364,6 +378,58 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	return ret;
 }
 
+static ssize_t queue_wb_win_show(struct request_queue *q, char *page)
+{
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	return sprintf(page, "%llu\n", div_u64(q->rq_wb->win_nsec, 1000));
+}
+
+static ssize_t queue_wb_win_store(struct request_queue *q, const char *page,
+				  size_t count)
+{
+	ssize_t ret;
+	u64 val;
+
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	ret = queue_var_store64(&val, page);
+	if (ret < 0)
+		return ret;
+
+	q->rq_wb->win_nsec = val * 1000ULL;
+	wbt_update_limits(q->rq_wb);
+	return count;
+}
+
+static ssize_t queue_wb_lat_show(struct request_queue *q, char *page)
+{
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	return sprintf(page, "%llu\n", div_u64(q->rq_wb->min_lat_nsec, 1000));
+}
+
+static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
+				  size_t count)
+{
+	ssize_t ret;
+	u64 val;
+
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	ret = queue_var_store64(&val, page);
+	if (ret < 0)
+		return ret;
+
+	q->rq_wb->min_lat_nsec = val * 1000ULL;
+	wbt_update_limits(q->rq_wb);
+	return count;
+}
+
 static ssize_t queue_wc_show(struct request_queue *q, char *page)
 {
 	if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
@@ -578,6 +644,18 @@ static struct queue_sysfs_entry queue_stats_entry = {
 	.show = queue_stats_show,
 };
 
+static struct queue_sysfs_entry queue_wb_lat_entry = {
+	.attr = {.name = "wbt_lat_usec", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wb_lat_show,
+	.store = queue_wb_lat_store,
+};
+
+static struct queue_sysfs_entry queue_wb_win_entry = {
+	.attr = {.name = "wbt_window_usec", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wb_win_show,
+	.store = queue_wb_win_store,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -608,6 +686,8 @@ static struct attribute *default_attrs[] = {
 	&queue_wc_entry.attr,
 	&queue_dax_entry.attr,
 	&queue_stats_entry.attr,
+	&queue_wb_lat_entry.attr,
+	&queue_wb_win_entry.attr,
 	NULL,
 };
 
@@ -722,6 +802,58 @@ struct kobj_type blk_queue_ktype = {
 	.release	= blk_release_queue,
 };
 
+static void blk_wb_stat_get(void *data, struct blk_rq_stat *stat)
+{
+	blk_queue_stat_get(data, stat);
+}
+
+static void blk_wb_stat_clear(void *data)
+{
+	blk_stat_clear(data);
+}
+
+static bool blk_wb_stat_is_current(struct blk_rq_stat *stat)
+{
+	return blk_stat_is_current(stat);
+}
+
+static struct wb_stat_ops wb_stat_ops = {
+	.get		= blk_wb_stat_get,
+	.is_current	= blk_wb_stat_is_current,
+	.clear		= blk_wb_stat_clear,
+};
+
+static void blk_wb_init(struct request_queue *q)
+{
+	struct rq_wb *rwb;
+
+#ifndef CONFIG_BLK_WBT_MQ
+	if (q->mq_ops)
+		return;
+#endif
+#ifndef CONFIG_BLK_WBT_SQ
+	if (q->request_fn)
+		return;
+#endif
+
+	rwb = wbt_init(&q->backing_dev_info, &wb_stat_ops, q);
+
+	/*
+	 * If this fails, we don't get throttling
+	 */
+	if (IS_ERR_OR_NULL(rwb))
+		return;
+
+	if (blk_queue_nonrot(q))
+		rwb->min_lat_nsec = 2000000ULL;
+	else
+		rwb->min_lat_nsec = 75000000ULL;
+
+	wbt_set_queue_depth(rwb, blk_queue_depth(q));
+	wbt_set_write_cache(rwb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
+	q->rq_wb = rwb;
+}
+
 int blk_register_queue(struct gendisk *disk)
 {
 	int ret;
@@ -761,6 +893,8 @@ int blk_register_queue(struct gendisk *disk)
 	if (q->mq_ops)
 		blk_mq_register_dev(dev, q);
 
+	blk_wb_init(q);
+
 	if (!q->request_fn)
 		return 0;
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index dcbed8c9c82c..0ef5ef5b5ed2 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -16,6 +16,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/blk-cgroup.h>
 #include "blk.h"
+#include "blk-wbt.h"
 
 /*
  * tunables
@@ -3762,9 +3763,11 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
 	struct cfq_queue *cfqq;
 	uint64_t serial_nr;
+	bool nonroot_cg;
 
 	rcu_read_lock();
 	serial_nr = bio_blkcg(bio)->css.serial_nr;
+	nonroot_cg = bio_blkcg(bio) != &blkcg_root;
 	rcu_read_unlock();
 
 	/*
@@ -3775,6 +3778,17 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 		return;
 
 	/*
+	 * If we have a non-root cgroup, we can depend on that to
+	 * do proper throttling of writes. Turn off wbt for that
+	 * case.
+	 */
+	if (nonroot_cg) {
+		struct request_queue *q = cfqd->queue;
+
+		wbt_disable(q->rq_wb);
+	}
+
+	/*
 	 * Drop reference to queues.  New queues will be assigned in new
 	 * group upon arrival of fresh requests.
 	 */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6bd5eb56894e..294ee64e7f06 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -38,6 +38,7 @@ struct bsg_job;
 struct blkcg_gq;
 struct blk_flush_queue;
 struct pr_ops;
+struct rq_wb;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
@@ -381,6 +382,8 @@ struct request_queue {
 	int			nr_rqs[2];	/* # allocated [a]sync rqs */
 	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
+	struct rq_wb		*rq_wb;
+
 	/*
 	 * If blkcg is not used, @q->root_rl serves all requests.  If blkcg
 	 * is used, root blkg allocates from @q->root_rl and all other
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] block: add WRITE_BACKGROUND
  2016-11-01 21:08 ` [PATCH 1/8] block: add WRITE_BACKGROUND Jens Axboe
@ 2016-11-02 14:55   ` Christoph Hellwig
  2016-11-02 16:22     ` Jens Axboe
  2016-11-05 22:27   ` Jan Kara
  1 sibling, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2016-11-02 14:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue, Nov 01, 2016 at 03:08:44PM -0600, Jens Axboe wrote:
> This adds a new request flag, REQ_BACKGROUND, that callers can use to
> tell the block layer that this is background (non-urgent) IO.

The subject still says WRITE_BACKGROUND :)

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/8] writeback: add wbc_to_write_flags()
  2016-11-01 21:08 ` [PATCH 2/8] writeback: add wbc_to_write_flags() Jens Axboe
@ 2016-11-02 14:56   ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2016-11-02 14:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue, Nov 01, 2016 at 03:08:45PM -0600, Jens Axboe wrote:
> Add wbc_to_write_flags(), which returns the write modifier flags to use,
> based on a struct writeback_control. No functional changes in this
> patch, but it prepares us for factoring other wbc fields for write type.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/8] writeback: mark background writeback as such
  2016-11-01 21:08 ` [PATCH 3/8] writeback: mark background writeback as such Jens Axboe
@ 2016-11-02 14:56   ` Christoph Hellwig
  2016-11-05 22:26   ` Jan Kara
  1 sibling, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2016-11-02 14:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()
  2016-11-01 21:08 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
@ 2016-11-02 14:57   ` Christoph Hellwig
  2016-11-02 14:59     ` Jens Axboe
  2016-11-08 13:02   ` Jan Kara
  1 sibling, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2016-11-02 14:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue, Nov 01, 2016 at 03:08:47PM -0600, Jens Axboe wrote:
> Note in the bdi_writeback structure whenever a task ends up sleeping
> waiting for progress. We can use that information in the lower layers
> to increase the priority of writes.

Do we need to care about atomicy of multiple threads updating the value
?

Otherwise this looks fine.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] block: add code to track actual device queue depth
  2016-11-01 21:08 ` [PATCH 5/8] block: add code to track actual device queue depth Jens Axboe
@ 2016-11-02 14:59   ` Christoph Hellwig
  2016-11-02 15:02     ` Jens Axboe
  2016-11-05 22:37   ` Jan Kara
  1 sibling, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2016-11-02 14:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue, Nov 01, 2016 at 03:08:48PM -0600, Jens Axboe wrote:
> For blk-mq, ->nr_requests does track queue depth, at least at init
> time. But for the older queue paths, it's simply a soft setting.
> On top of that, it's generally larger than the hardware setting
> on purpose, to allow backup of requests for merging.
> 
> Fill a hole in struct request with a 'queue_depth' member, that

That would be struct request_queue..

>  /**
> + * blk_set_queue_depth - tell the block layer about the device queue depth
> + * @q:		the request queue for the device
> + * @depth:		queue depth
> + *
> + */
> +void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
> +{
> +	q->queue_depth = depth;
> +}
> +EXPORT_SYMBOL(blk_set_queue_depth);

Do we really need this wrapper?

> +++ b/drivers/scsi/scsi.c
> @@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int depth)
>  		wmb();
>  	}
>  
> +	if (sdev->request_queue)
> +		blk_set_queue_depth(sdev->request_queue, depth);
> +
>  	return sdev->queue_depth;

Can we kill the scsi_device queue_depth member and just always use
the request_queue one?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()
  2016-11-02 14:57   ` Christoph Hellwig
@ 2016-11-02 14:59     ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2016-11-02 14:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack

On 11/02/2016 08:57 AM, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 03:08:47PM -0600, Jens Axboe wrote:
>> Note in the bdi_writeback structure whenever a task ends up sleeping
>> waiting for progress. We can use that information in the lower layers
>> to increase the priority of writes.
>
> Do we need to care about atomicy of multiple threads updating the value
> ?
>
> Otherwise this looks fine.

Don't think we have to care about it too much, it's a soft hint similar
wb->dirty_exceeded.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] block: add code to track actual device queue depth
  2016-11-02 14:59   ` Christoph Hellwig
@ 2016-11-02 15:02     ` Jens Axboe
  2016-11-02 16:40       ` Johannes Thumshirn
  0 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2016-11-02 15:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack

On 11/02/2016 08:59 AM, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 03:08:48PM -0600, Jens Axboe wrote:
>> For blk-mq, ->nr_requests does track queue depth, at least at init
>> time. But for the older queue paths, it's simply a soft setting.
>> On top of that, it's generally larger than the hardware setting
>> on purpose, to allow backup of requests for merging.
>>
>> Fill a hole in struct request with a 'queue_depth' member, that
>
> That would be struct request_queue..

Good catch, will fix.

>>  /**
>> + * blk_set_queue_depth - tell the block layer about the device queue depth
>> + * @q:		the request queue for the device
>> + * @depth:		queue depth
>> + *
>> + */
>> +void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
>> +{
>> +	q->queue_depth = depth;
>> +}
>> +EXPORT_SYMBOL(blk_set_queue_depth);
>
> Do we really need this wrapper?

Not necessarily, just seems like a nicer API than manually setting the
field. Not a big deal to me, though.

>> +++ b/drivers/scsi/scsi.c
>> @@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int depth)
>>  		wmb();
>>  	}
>>
>> +	if (sdev->request_queue)
>> +		blk_set_queue_depth(sdev->request_queue, depth);
>> +
>>  	return sdev->queue_depth;
>
> Can we kill the scsi_device queue_depth member and just always use
> the request_queue one?

Yes. I'd prefer that we do that through the SCSI tree, once this is in
though.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] block: add WRITE_BACKGROUND
  2016-11-02 14:55   ` Christoph Hellwig
@ 2016-11-02 16:22     ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2016-11-02 16:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, linux-block, jack

On Wed, Nov 02 2016, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 03:08:44PM -0600, Jens Axboe wrote:
> > This adds a new request flag, REQ_BACKGROUND, that callers can use to
> > tell the block layer that this is background (non-urgent) IO.
> 
> The subject still says WRITE_BACKGROUND :)

Gah - will fix that up.

> Otherwise looks fine:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Added, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] block: add code to track actual device queue depth
  2016-11-02 15:02     ` Jens Axboe
@ 2016-11-02 16:40       ` Johannes Thumshirn
  0 siblings, 0 replies; 39+ messages in thread
From: Johannes Thumshirn @ 2016-11-02 16:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, axboe, linux-kernel, linux-fsdevel, linux-block, jack

On Wed, Nov 02, 2016 at 09:02:08AM -0600, Jens Axboe wrote:
> On 11/02/2016 08:59 AM, Christoph Hellwig wrote:
> > On Tue, Nov 01, 2016 at 03:08:48PM -0600, Jens Axboe wrote:
> > > For blk-mq, ->nr_requests does track queue depth, at least at init
> > > time. But for the older queue paths, it's simply a soft setting.
> > > On top of that, it's generally larger than the hardware setting
> > > on purpose, to allow backup of requests for merging.
> > > 
> > > Fill a hole in struct request with a 'queue_depth' member, that
> > 
> > That would be struct request_queue..
> 
> Good catch, will fix.
> 
> > >  /**
> > > + * blk_set_queue_depth - tell the block layer about the device queue depth
> > > + * @q:		the request queue for the device
> > > + * @depth:		queue depth
> > > + *
> > > + */
> > > +void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
> > > +{
> > > +	q->queue_depth = depth;
> > > +}
> > > +EXPORT_SYMBOL(blk_set_queue_depth);
> > 
> > Do we really need this wrapper?
> 
> Not necessarily, just seems like a nicer API than manually setting the
> field. Not a big deal to me, though.

A lot of block code uses this kind of setters so I _think_ it complies with
the overall style. But I have no strong opinion on this either...

Byte,
	Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 3/8] writeback: mark background writeback as such
  2016-11-01 21:08 ` [PATCH 3/8] writeback: mark background writeback as such Jens Axboe
  2016-11-02 14:56   ` Christoph Hellwig
@ 2016-11-05 22:26   ` Jan Kara
  1 sibling, 0 replies; 39+ messages in thread
From: Jan Kara @ 2016-11-05 22:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue 01-11-16 15:08:46, Jens Axboe wrote:
> If we're doing background type writes, then use the appropriate
> background write flags for that.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/writeback.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 50c96ee8108f..c78f9f0920b5 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -107,6 +107,8 @@ static inline int wbc_to_write_flags(struct writeback_control *wbc)
>  {
>  	if (wbc->sync_mode == WB_SYNC_ALL)
>  		return REQ_SYNC;
> +	else if (wbc->for_kupdate || wbc->for_background)
> +		return REQ_BACKGROUND;
>  
>  	return 0;
>  }
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/8] block: add WRITE_BACKGROUND
  2016-11-01 21:08 ` [PATCH 1/8] block: add WRITE_BACKGROUND Jens Axboe
  2016-11-02 14:55   ` Christoph Hellwig
@ 2016-11-05 22:27   ` Jan Kara
  1 sibling, 0 replies; 39+ messages in thread
From: Jan Kara @ 2016-11-05 22:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue 01-11-16 15:08:44, Jens Axboe wrote:
> This adds a new request flag, REQ_BACKGROUND, that callers can use to
> tell the block layer that this is background (non-urgent) IO.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/blk_types.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index bb921028e7c5..562ac46cb790 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -177,6 +177,7 @@ enum req_flag_bits {
>  	__REQ_FUA,		/* forced unit access */
>  	__REQ_PREFLUSH,		/* request for cache flush */
>  	__REQ_RAHEAD,		/* read ahead, can fail anytime */
> +	__REQ_BACKGROUND,	/* background IO */
>  	__REQ_NR_BITS,		/* stops here */
>  };
>  
> @@ -192,6 +193,7 @@ enum req_flag_bits {
>  #define REQ_FUA			(1ULL << __REQ_FUA)
>  #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
>  #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
> +#define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
>  
>  #define REQ_FAILFAST_MASK \
>  	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/8] block: add code to track actual device queue depth
  2016-11-01 21:08 ` [PATCH 5/8] block: add code to track actual device queue depth Jens Axboe
  2016-11-02 14:59   ` Christoph Hellwig
@ 2016-11-05 22:37   ` Jan Kara
  1 sibling, 0 replies; 39+ messages in thread
From: Jan Kara @ 2016-11-05 22:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue 01-11-16 15:08:48, Jens Axboe wrote:
> For blk-mq, ->nr_requests does track queue depth, at least at init
> time. But for the older queue paths, it's simply a soft setting.
> On top of that, it's generally larger than the hardware setting
> on purpose, to allow backup of requests for merging.
> 
> Fill a hole in struct request with a 'queue_depth' member, that
> drivers can call to more closely inform the block layer of the
> real queue depth.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  block/blk-settings.c   | 12 ++++++++++++
>  drivers/scsi/scsi.c    |  3 +++
>  include/linux/blkdev.h | 11 +++++++++++
>  3 files changed, 26 insertions(+)
> 
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 55369a65dea2..9cf053759363 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -837,6 +837,18 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable)
>  EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
>  
>  /**
> + * blk_set_queue_depth - tell the block layer about the device queue depth
> + * @q:		the request queue for the device
> + * @depth:		queue depth
> + *
> + */
> +void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
> +{
> +	q->queue_depth = depth;
> +}
> +EXPORT_SYMBOL(blk_set_queue_depth);
> +
> +/**
>   * blk_queue_write_cache - configure queue's write cache
>   * @q:		the request queue for the device
>   * @wc:		write back cache on or off
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index 1deb6adc411f..75455d4dab68 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int depth)
>  		wmb();
>  	}
>  
> +	if (sdev->request_queue)
> +		blk_set_queue_depth(sdev->request_queue, depth);
> +
>  	return sdev->queue_depth;
>  }
>  EXPORT_SYMBOL(scsi_change_queue_depth);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8396da2bb698..0c677fb35ce4 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -405,6 +405,8 @@ struct request_queue {
>  	struct blk_mq_ctx __percpu	*queue_ctx;
>  	unsigned int		nr_queues;
>  
> +	unsigned int		queue_depth;
> +
>  	/* hw dispatch queues */
>  	struct blk_mq_hw_ctx	**queue_hw_ctx;
>  	unsigned int		nr_hw_queues;
> @@ -777,6 +779,14 @@ static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
>  	return false;
>  }
>  
> +static inline unsigned int blk_queue_depth(struct request_queue *q)
> +{
> +	if (q->queue_depth)
> +		return q->queue_depth;
> +
> +	return q->nr_requests;
> +}
> +
>  /*
>   * q->prep_rq_fn return values
>   */
> @@ -1093,6 +1103,7 @@ extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
>  extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
>  extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
>  extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
> +extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
>  extern void blk_set_default_limits(struct queue_limits *lim);
>  extern void blk_set_stacking_limits(struct queue_limits *lim);
>  extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()
  2016-11-01 21:08 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
  2016-11-02 14:57   ` Christoph Hellwig
@ 2016-11-08 13:02   ` Jan Kara
  1 sibling, 0 replies; 39+ messages in thread
From: Jan Kara @ 2016-11-08 13:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue 01-11-16 15:08:47, Jens Axboe wrote:
> Note in the bdi_writeback structure whenever a task ends up sleeping
> waiting for progress. We can use that information in the lower layers
> to increase the priority of writes.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/backing-dev-defs.h | 2 ++
>  mm/backing-dev.c                 | 1 +
>  mm/page-writeback.c              | 1 +
>  3 files changed, 4 insertions(+)
> 
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index c357f27d5483..dc5f76d7f648 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -116,6 +116,8 @@ struct bdi_writeback {
>  	struct list_head work_list;
>  	struct delayed_work dwork;	/* work item used for writeback */
>  
> +	unsigned long dirty_sleep;	/* last wait */
> +
>  	struct list_head bdi_node;	/* anchored at bdi->wb_list */
>  
>  #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 8fde443f36d7..3bfed5ab2475 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -310,6 +310,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
>  	spin_lock_init(&wb->work_lock);
>  	INIT_LIST_HEAD(&wb->work_list);
>  	INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
> +	wb->dirty_sleep = jiffies;
>  
>  	wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
>  	if (!wb->congested)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 439cc63ad903..52e2f8e3b472 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1778,6 +1778,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>  					  pause,
>  					  start_time);
>  		__set_current_state(TASK_KILLABLE);
> +		wb->dirty_sleep = now;
>  		io_schedule_timeout(pause);
>  
>  		current->dirty_paused_when = now + pause;
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] block: add scalable completion tracking of requests
  2016-11-01 21:08 ` [PATCH 6/8] block: add scalable completion tracking of requests Jens Axboe
@ 2016-11-08 13:30   ` Jan Kara
  2016-11-08 15:25     ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2016-11-08 13:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue 01-11-16 15:08:49, Jens Axboe wrote:
> For legacy block, we simply track them in the request queue. For
> blk-mq, we track them on a per-sw queue basis, which we can then
> sum up through the hardware queues and finally to a per device
> state.
> 
> The stats are tracked in, roughly, 0.1s interval windows.
> 
> Add sysfs files to display the stats.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>

This patch looks mostly good to me but I have one concern: You track
statistics in a fixed 134ms window, stats get cleared at the beginning of
each window. Now this can interact with the writeback window and latency
settings which are dynamic and settable from userspace - so if the
writeback code observation window gets set larger than the stats window,
things become strange since you'll likely miss quite some observations
about read latencies. So I think you need to make sure stats window is
always larger than writeback window. Or actually, why do you have something
like stats window and don't leave clearing of statistics completely to the
writeback tracking code?

Also as a side note - nobody currently uses the mean value of the
statistics. It may be faster to track just sum and count so that mean can
be computed on request which will be presumably much more rare than current
situation where we recompute the mean on each batch update. Actually, that
way you could get rid of the batching as well I assume.

								Honza
> ---
>  block/Makefile            |   2 +-
>  block/blk-core.c          |   4 +
>  block/blk-mq-sysfs.c      |  47 ++++++++++
>  block/blk-mq.c            |  14 +++
>  block/blk-mq.h            |   3 +
>  block/blk-stat.c          | 226 ++++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-stat.h          |  37 ++++++++
>  block/blk-sysfs.c         |  26 ++++++
>  include/linux/blk_types.h |  16 ++++
>  include/linux/blkdev.h    |   4 +
>  10 files changed, 378 insertions(+), 1 deletion(-)
>  create mode 100644 block/blk-stat.c
>  create mode 100644 block/blk-stat.h
> 
> diff --git a/block/Makefile b/block/Makefile
> index 934dac73fb37..2528c596f7ec 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -5,7 +5,7 @@
>  obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
>  			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
>  			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
> -			blk-lib.o blk-mq.o blk-mq-tag.o \
> +			blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
>  			blk-mq-sysfs.o blk-mq-cpumap.o ioctl.o \
>  			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
>  			badblocks.o partitions/
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0bfaa54d3e9f..ca77c725b4e5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2462,6 +2462,8 @@ void blk_start_request(struct request *req)
>  {
>  	blk_dequeue_request(req);
>  
> +	blk_stat_set_issue_time(&req->issue_stat);
> +
>  	/*
>  	 * We are now handing the request to the hardware, initialize
>  	 * resid_len to full count and add the timeout handler.
> @@ -2529,6 +2531,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
>  
>  	trace_block_rq_complete(req->q, req, nr_bytes);
>  
> +	blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
> +
>  	if (!req->bio)
>  		return false;
>  
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 01fb455d3377..633c79a538ea 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -259,6 +259,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
>  	return ret;
>  }
>  
> +static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct blk_mq_ctx *ctx;
> +	unsigned int i;
> +
> +	hctx_for_each_ctx(hctx, ctx, i) {
> +		blk_stat_init(&ctx->stat[0]);
> +		blk_stat_init(&ctx->stat[1]);
> +	}
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
> +					  const char *page, size_t count)
> +{
> +	blk_mq_stat_clear(hctx);
> +	return count;
> +}
> +
> +static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
> +{
> +	return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
> +			pre, (long long) stat->nr_samples,
> +			(long long) stat->mean, (long long) stat->min,
> +			(long long) stat->max);
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char *page)
> +{
> +	struct blk_rq_stat stat[2];
> +	ssize_t ret;
> +
> +	blk_stat_init(&stat[0]);
> +	blk_stat_init(&stat[1]);
> +
> +	blk_hctx_stat_get(hctx, stat);
> +
> +	ret = print_stat(page, &stat[0], "read :");
> +	ret += print_stat(page + ret, &stat[1], "write:");
> +	return ret;
> +}
> +
>  static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
>  	.attr = {.name = "dispatched", .mode = S_IRUGO },
>  	.show = blk_mq_sysfs_dispatched_show,
> @@ -317,6 +358,11 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
>  	.show = blk_mq_hw_sysfs_poll_show,
>  	.store = blk_mq_hw_sysfs_poll_store,
>  };
> +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
> +	.attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
> +	.show = blk_mq_hw_sysfs_stat_show,
> +	.store = blk_mq_hw_sysfs_stat_store,
> +};
>  
>  static struct attribute *default_hw_ctx_attrs[] = {
>  	&blk_mq_hw_sysfs_queued.attr,
> @@ -327,6 +373,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
>  	&blk_mq_hw_sysfs_cpus.attr,
>  	&blk_mq_hw_sysfs_active.attr,
>  	&blk_mq_hw_sysfs_poll.attr,
> +	&blk_mq_hw_sysfs_stat.attr,
>  	NULL,
>  };
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2da1a0ee3318..4555a76d22a7 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -30,6 +30,7 @@
>  #include "blk.h"
>  #include "blk-mq.h"
>  #include "blk-mq-tag.h"
> +#include "blk-stat.h"
>  
>  static DEFINE_MUTEX(all_q_mutex);
>  static LIST_HEAD(all_q_list);
> @@ -376,10 +377,19 @@ static void blk_mq_ipi_complete_request(struct request *rq)
>  	put_cpu();
>  }
>  
> +static void blk_mq_stat_add(struct request *rq)
> +{
> +	struct blk_rq_stat *stat = &rq->mq_ctx->stat[rq_data_dir(rq)];
> +
> +	blk_stat_add(stat, rq);
> +}
> +
>  static void __blk_mq_complete_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
>  
> +	blk_mq_stat_add(rq);
> +
>  	if (!q->softirq_done_fn)
>  		blk_mq_end_request(rq, rq->errors);
>  	else
> @@ -423,6 +433,8 @@ void blk_mq_start_request(struct request *rq)
>  	if (unlikely(blk_bidi_rq(rq)))
>  		rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
>  
> +	blk_stat_set_issue_time(&rq->issue_stat);
> +
>  	blk_add_timer(rq);
>  
>  	/*
> @@ -1708,6 +1720,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
>  		spin_lock_init(&__ctx->lock);
>  		INIT_LIST_HEAD(&__ctx->rq_list);
>  		__ctx->queue = q;
> +		blk_stat_init(&__ctx->stat[0]);
> +		blk_stat_init(&__ctx->stat[1]);
>  
>  		/* If the cpu isn't online, the cpu is mapped to first hctx */
>  		if (!cpu_online(i))
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index e5d25249028c..8cf16cb69f64 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -1,6 +1,8 @@
>  #ifndef INT_BLK_MQ_H
>  #define INT_BLK_MQ_H
>  
> +#include "blk-stat.h"
> +
>  struct blk_mq_tag_set;
>  
>  struct blk_mq_ctx {
> @@ -18,6 +20,7 @@ struct blk_mq_ctx {
>  
>  	/* incremented at completion time */
>  	unsigned long		____cacheline_aligned_in_smp rq_completed[2];
> +	struct blk_rq_stat	stat[2];
>  
>  	struct request_queue	*queue;
>  	struct kobject		kobj;
> diff --git a/block/blk-stat.c b/block/blk-stat.c
> new file mode 100644
> index 000000000000..642afdc6d0f8
> --- /dev/null
> +++ b/block/blk-stat.c
> @@ -0,0 +1,226 @@
> +/*
> + * Block stat tracking code
> + *
> + * Copyright (C) 2016 Jens Axboe
> + */
> +#include <linux/kernel.h>
> +#include <linux/blk-mq.h>
> +
> +#include "blk-stat.h"
> +#include "blk-mq.h"
> +
> +static void blk_stat_flush_batch(struct blk_rq_stat *stat)
> +{
> +	if (!stat->nr_batch)
> +		return;
> +	if (!stat->nr_samples)
> +		stat->mean = div64_s64(stat->batch, stat->nr_batch);
> +	else {
> +		stat->mean = div64_s64((stat->mean * stat->nr_samples) +
> +					stat->batch,
> +					stat->nr_samples + stat->nr_batch);
> +	}
> +
> +	stat->nr_samples += stat->nr_batch;
> +	stat->nr_batch = stat->batch = 0;
> +}
> +
> +void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
> +{
> +	if (!src->nr_samples)
> +		return;
> +
> +	blk_stat_flush_batch(src);
> +
> +	dst->min = min(dst->min, src->min);
> +	dst->max = max(dst->max, src->max);
> +
> +	if (!dst->nr_samples)
> +		dst->mean = src->mean;
> +	else {
> +		dst->mean = div64_s64((src->mean * src->nr_samples) +
> +					(dst->mean * dst->nr_samples),
> +					dst->nr_samples + src->nr_samples);
> +	}
> +	dst->nr_samples += src->nr_samples;
> +}
> +
> +static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +	struct blk_mq_ctx *ctx;
> +	uint64_t latest = 0;
> +	int i, j, nr;
> +
> +	blk_stat_init(&dst[0]);
> +	blk_stat_init(&dst[1]);
> +
> +	nr = 0;
> +	do {
> +		uint64_t newest = 0;
> +
> +		queue_for_each_hw_ctx(q, hctx, i) {
> +			hctx_for_each_ctx(hctx, ctx, j) {
> +				if (!ctx->stat[0].nr_samples &&
> +				    !ctx->stat[1].nr_samples)
> +					continue;
> +				if (ctx->stat[0].time > newest)
> +					newest = ctx->stat[0].time;
> +				if (ctx->stat[1].time > newest)
> +					newest = ctx->stat[1].time;
> +			}
> +		}
> +
> +		/*
> +		 * No samples
> +		 */
> +		if (!newest)
> +			break;
> +
> +		if (newest > latest)
> +			latest = newest;
> +
> +		queue_for_each_hw_ctx(q, hctx, i) {
> +			hctx_for_each_ctx(hctx, ctx, j) {
> +				if (ctx->stat[0].time == newest) {
> +					blk_stat_sum(&dst[0], &ctx->stat[0]);
> +					nr++;
> +				}
> +				if (ctx->stat[1].time == newest) {
> +					blk_stat_sum(&dst[1], &ctx->stat[1]);
> +					nr++;
> +				}
> +			}
> +		}
> +		/*
> +		 * If we race on finding an entry, just loop back again.
> +		 * Should be very rare.
> +		 */
> +	} while (!nr);
> +
> +	dst[0].time = dst[1].time = latest;
> +}
> +
> +void blk_queue_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
> +{
> +	if (q->mq_ops)
> +		blk_mq_stat_get(q, dst);
> +	else {
> +		memcpy(&dst[0], &q->rq_stats[0], sizeof(struct blk_rq_stat));
> +		memcpy(&dst[1], &q->rq_stats[1], sizeof(struct blk_rq_stat));
> +	}
> +}
> +
> +void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst)
> +{
> +	struct blk_mq_ctx *ctx;
> +	unsigned int i, nr;
> +
> +	nr = 0;
> +	do {
> +		uint64_t newest = 0;
> +
> +		hctx_for_each_ctx(hctx, ctx, i) {
> +			if (!ctx->stat[0].nr_samples &&
> +			    !ctx->stat[1].nr_samples)
> +				continue;
> +
> +			if (ctx->stat[0].time > newest)
> +				newest = ctx->stat[0].time;
> +			if (ctx->stat[1].time > newest)
> +				newest = ctx->stat[1].time;
> +		}
> +
> +		if (!newest)
> +			break;
> +
> +		hctx_for_each_ctx(hctx, ctx, i) {
> +			if (ctx->stat[0].time == newest) {
> +				blk_stat_sum(&dst[0], &ctx->stat[0]);
> +				nr++;
> +			}
> +			if (ctx->stat[1].time == newest) {
> +				blk_stat_sum(&dst[1], &ctx->stat[1]);
> +				nr++;
> +			}
> +		}
> +		/*
> +		 * If we race on finding an entry, just loop back again.
> +		 * Should be very rare, as the window is only updated
> +		 * occasionally
> +		 */
> +	} while (!nr);
> +}
> +
> +static void __blk_stat_init(struct blk_rq_stat *stat, s64 time_now)
> +{
> +	stat->min = -1ULL;
> +	stat->max = stat->nr_samples = stat->mean = 0;
> +	stat->batch = stat->nr_batch = 0;
> +	stat->time = time_now & BLK_STAT_NSEC_MASK;
> +}
> +
> +void blk_stat_init(struct blk_rq_stat *stat)
> +{
> +	__blk_stat_init(stat, ktime_to_ns(ktime_get()));
> +}
> +
> +static bool __blk_stat_is_current(struct blk_rq_stat *stat, s64 now)
> +{
> +	return (now & BLK_STAT_NSEC_MASK) == (stat->time & BLK_STAT_NSEC_MASK);
> +}
> +
> +bool blk_stat_is_current(struct blk_rq_stat *stat)
> +{
> +	return __blk_stat_is_current(stat, ktime_to_ns(ktime_get()));
> +}
> +
> +void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
> +{
> +	s64 now, value;
> +
> +	now = __blk_stat_time(ktime_to_ns(ktime_get()));
> +	if (now < blk_stat_time(&rq->issue_stat))
> +		return;
> +
> +	if (!__blk_stat_is_current(stat, now))
> +		__blk_stat_init(stat, now);
> +
> +	value = now - blk_stat_time(&rq->issue_stat);
> +	if (value > stat->max)
> +		stat->max = value;
> +	if (value < stat->min)
> +		stat->min = value;
> +
> +	if (stat->batch + value < stat->batch ||
> +	    stat->nr_batch + 1 == BLK_RQ_STAT_BATCH)
> +		blk_stat_flush_batch(stat);
> +
> +	stat->batch += value;
> +	stat->nr_batch++;
> +}
> +
> +void blk_stat_clear(struct request_queue *q)
> +{
> +	if (q->mq_ops) {
> +		struct blk_mq_hw_ctx *hctx;
> +		struct blk_mq_ctx *ctx;
> +		int i, j;
> +
> +		queue_for_each_hw_ctx(q, hctx, i) {
> +			hctx_for_each_ctx(hctx, ctx, j) {
> +				blk_stat_init(&ctx->stat[0]);
> +				blk_stat_init(&ctx->stat[1]);
> +			}
> +		}
> +	} else {
> +		blk_stat_init(&q->rq_stats[0]);
> +		blk_stat_init(&q->rq_stats[1]);
> +	}
> +}
> +
> +void blk_stat_set_issue_time(struct blk_issue_stat *stat)
> +{
> +	stat->time = (stat->time & BLK_STAT_MASK) |
> +			(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK);
> +}
> diff --git a/block/blk-stat.h b/block/blk-stat.h
> new file mode 100644
> index 000000000000..26b1f45dff26
> --- /dev/null
> +++ b/block/blk-stat.h
> @@ -0,0 +1,37 @@
> +#ifndef BLK_STAT_H
> +#define BLK_STAT_H
> +
> +/*
> + * ~0.13s window as a power-of-2 (2^27 nsecs)
> + */
> +#define BLK_STAT_NSEC		134217728ULL
> +#define BLK_STAT_NSEC_MASK	~(BLK_STAT_NSEC - 1)
> +
> +/*
> + * Upper 3 bits can be used elsewhere
> + */
> +#define BLK_STAT_RES_BITS	3
> +#define BLK_STAT_SHIFT		(64 - BLK_STAT_RES_BITS)
> +#define BLK_STAT_TIME_MASK	((1ULL << BLK_STAT_SHIFT) - 1)
> +#define BLK_STAT_MASK		~BLK_STAT_TIME_MASK
> +
> +void blk_stat_add(struct blk_rq_stat *, struct request *);
> +void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_rq_stat *);
> +void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
> +void blk_stat_clear(struct request_queue *q);
> +void blk_stat_init(struct blk_rq_stat *);
> +void blk_stat_sum(struct blk_rq_stat *, struct blk_rq_stat *);
> +bool blk_stat_is_current(struct blk_rq_stat *);
> +void blk_stat_set_issue_time(struct blk_issue_stat *);
> +
> +static inline u64 __blk_stat_time(u64 time)
> +{
> +	return time & BLK_STAT_TIME_MASK;
> +}
> +
> +static inline u64 blk_stat_time(struct blk_issue_stat *stat)
> +{
> +	return __blk_stat_time(stat->time);
> +}
> +
> +#endif
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 488c2e28feb8..5bb4648f434a 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -401,6 +401,26 @@ static ssize_t queue_dax_show(struct request_queue *q, char *page)
>  	return queue_var_show(blk_queue_dax(q), page);
>  }
>  
> +static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
> +{
> +	return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
> +			pre, (long long) stat->nr_samples,
> +			(long long) stat->mean, (long long) stat->min,
> +			(long long) stat->max);
> +}
> +
> +static ssize_t queue_stats_show(struct request_queue *q, char *page)
> +{
> +	struct blk_rq_stat stat[2];
> +	ssize_t ret;
> +
> +	blk_queue_stat_get(q, stat);
> +
> +	ret = print_stat(page, &stat[0], "read :");
> +	ret += print_stat(page + ret, &stat[1], "write:");
> +	return ret;
> +}
> +
>  static struct queue_sysfs_entry queue_requests_entry = {
>  	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
>  	.show = queue_requests_show,
> @@ -553,6 +573,11 @@ static struct queue_sysfs_entry queue_dax_entry = {
>  	.show = queue_dax_show,
>  };
>  
> +static struct queue_sysfs_entry queue_stats_entry = {
> +	.attr = {.name = "stats", .mode = S_IRUGO },
> +	.show = queue_stats_show,
> +};
> +
>  static struct attribute *default_attrs[] = {
>  	&queue_requests_entry.attr,
>  	&queue_ra_entry.attr,
> @@ -582,6 +607,7 @@ static struct attribute *default_attrs[] = {
>  	&queue_poll_entry.attr,
>  	&queue_wc_entry.attr,
>  	&queue_dax_entry.attr,
> +	&queue_stats_entry.attr,
>  	NULL,
>  };
>  
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 562ac46cb790..4d0044d09984 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -250,4 +250,20 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
>  	return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
>  }
>  
> +struct blk_issue_stat {
> +	u64 time;
> +};
> +
> +#define BLK_RQ_STAT_BATCH	64
> +
> +struct blk_rq_stat {
> +	s64 mean;
> +	u64 min;
> +	u64 max;
> +	s32 nr_samples;
> +	s32 nr_batch;
> +	u64 batch;
> +	s64 time;
> +};
> +
>  #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 0c677fb35ce4..6bd5eb56894e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -197,6 +197,7 @@ struct request {
>  	struct gendisk *rq_disk;
>  	struct hd_struct *part;
>  	unsigned long start_time;
> +	struct blk_issue_stat issue_stat;
>  #ifdef CONFIG_BLK_CGROUP
>  	struct request_list *rl;		/* rl this rq is alloced from */
>  	unsigned long long start_time_ns;
> @@ -492,6 +493,9 @@ struct request_queue {
>  
>  	unsigned int		nr_sorted;
>  	unsigned int		in_flight[2];
> +
> +	struct blk_rq_stat	rq_stats[2];
> +
>  	/*
>  	 * Number of active block driver functions for which blk_drain_queue()
>  	 * must wait. Must be incremented around functions that unlock the
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] blk-wbt: add general throttling mechanism
  2016-11-01 21:08 ` [PATCH 7/8] blk-wbt: add general throttling mechanism Jens Axboe
@ 2016-11-08 13:39   ` Jan Kara
  2016-11-08 15:41     ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2016-11-08 13:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue 01-11-16 15:08:50, Jens Axboe wrote:
> We can hook this up to the block layer, to help throttle buffered
> writes.
> 
> wbt registers a few trace points that can be used to track what is
> happening in the system:
> 
> wbt_lat: 259:0: latency 2446318
> wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
>                wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32
> 
> This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> dumps the current read/write stats for that window, and wbt_step shows a
> step down event where we now scale back writes. Each trace includes the
> device, 259:0 in this case.

Just one serious question and one nit below:

> +void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
> +{
> +	struct rq_wait *rqw;
> +	int inflight, limit;
> +
> +	if (!(wb_acct & WBT_TRACKED))
> +		return;
> +
> +	rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
> +	inflight = atomic_dec_return(&rqw->inflight);
> +
> +	/*
> +	 * wbt got disabled with IO in flight. Wake up any potential
> +	 * waiters, we don't have to do more than that.
> +	 */
> +	if (unlikely(!rwb_enabled(rwb))) {
> +		rwb_wake_all(rwb);
> +		return;
> +	}
> +
> +	/*
> +	 * If the device does write back caching, drop further down
> +	 * before we wake people up.
> +	 */
> +	if (rwb->wc && !wb_recent_wait(rwb))
> +		limit = 0;
> +	else
> +		limit = rwb->wb_normal;

So for devices with write cache, you will completely drain the device
before waking anybody waiting to issue new requests. Isn't it too strict?
In particular may_queue() will allow new writers to issue new writes once
we drop below the limit so it can happen that some processes will be
effectively starved waiting in may_queue?

> +static void wb_timer_fn(unsigned long data)
> +{
> +	struct rq_wb *rwb = (struct rq_wb *) data;
> +	unsigned int inflight = wbt_inflight(rwb);
> +	int status;
> +
> +	status = latency_exceeded(rwb);
> +
> +	trace_wbt_timer(rwb->bdi, status, rwb->scale_step, inflight);
> +
> +	/*
> +	 * If we exceeded the latency target, step down. If we did not,
> +	 * step one level up. If we don't know enough to say either exceeded
> +	 * or ok, then don't do anything.
> +	 */
> +	switch (status) {
> +	case LAT_EXCEEDED:
> +		scale_down(rwb, true);
> +		break;
> +	case LAT_OK:
> +		scale_up(rwb);
> +		break;
> +	case LAT_UNKNOWN_WRITES:
> +		scale_up(rwb);
> +		break;
> +	case LAT_UNKNOWN:
> +		if (++rwb->unknown_cnt < RWB_UNKNOWN_BUMP)
> +			break;
> +		/*
> +		 * We get here for two reasons:
> +		 *
> +		 * 1) We previously scaled reduced depth, and we currently
> +		 *    don't have a valid read/write sample. For that case,
> +		 *    slowly return to center state (step == 0).
> +		 * 2) We started a the center step, but don't have a valid
> +		 *    read/write sample, but we do have writes going on.
> +		 *    Allow step to go negative, to increase write perf.
> +		 */

I think part 2) of the comment now belongs to LAT_UNKNOWN_WRITES label.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 8/8] block: hook up writeback throttling
  2016-11-01 21:08 ` [PATCH 8/8] block: hook up writeback throttling Jens Axboe
@ 2016-11-08 13:42   ` Jan Kara
  2016-11-08 15:16     ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2016-11-08 13:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, jack, hch

On Tue 01-11-16 15:08:51, Jens Axboe wrote:
> Enable throttling of buffered writeback to make it a lot
> more smooth, and has way less impact on other system activity.
> Background writeback should be, by definition, background
> activity. The fact that we flush huge bundles of it at the time
> means that it potentially has heavy impacts on foreground workloads,
> which isn't ideal. We can't easily limit the sizes of writes that
> we do, since that would impact file system layout in the presence
> of delayed allocation. So just throttle back buffered writeback,
> unless someone is waiting for it.
> 
> The algorithm for when to throttle takes its inspiration in the
> CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
> the minimum latencies of requests over a window of time. In that
> window of time, if the minimum latency of any request exceeds a
> given target, then a scale count is incremented and the queue depth
> is shrunk. The next monitoring window is shrunk accordingly. Unlike
> CoDel, if we hit a window that exhibits good behavior, then we
> simply increment the scale count and re-calculate the limits for that
> scale value. This prevents us from oscillating between a
> close-to-ideal value and max all the time, instead remaining in the
> windows where we get good behavior.
> 
> Unlike CoDel, blk-wb allows the scale count to to negative. This
> happens if we primarily have writes going on. Unlike positive
> scale counts, this doesn't change the size of the monitoring window.
> When the heavy writers finish, blk-bw quickly snaps back to it's
> stable state of a zero scale count.
> 
> The patch registers two sysfs entries. The first one, 'wb_window_usec',
> defines the window of monitoring. The second one, 'wb_lat_usec',
> sets the latency target for the window. It defaults to 2 msec for
> non-rotational storage, and 75 msec for rotational storage. Setting
> this value to '0' disables blk-wb. Generally, a user would not have
> to touch these settings.
> 
> We don't enable WBT on devices that are managed with CFQ, and have
> a non-root block cgroup attached. If we have a proportional share setup
> on this particular disk, then the wbt throttling will interfere with
> that. We don't have a strong need for wbt for that case, since we will
> rely on CFQ doing that for us.

Just one nit: Don't you miss wbt_exit() call for legacy block layer? I
don't see where that happens.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 8/8] block: hook up writeback throttling
  2016-11-08 13:42   ` Jan Kara
@ 2016-11-08 15:16     ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2016-11-08 15:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, hch

On 11/08/2016 06:42 AM, Jan Kara wrote:
> On Tue 01-11-16 15:08:51, Jens Axboe wrote:
>> Enable throttling of buffered writeback to make it a lot
>> more smooth, and has way less impact on other system activity.
>> Background writeback should be, by definition, background
>> activity. The fact that we flush huge bundles of it at the time
>> means that it potentially has heavy impacts on foreground workloads,
>> which isn't ideal. We can't easily limit the sizes of writes that
>> we do, since that would impact file system layout in the presence
>> of delayed allocation. So just throttle back buffered writeback,
>> unless someone is waiting for it.
>>
>> The algorithm for when to throttle takes its inspiration in the
>> CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
>> the minimum latencies of requests over a window of time. In that
>> window of time, if the minimum latency of any request exceeds a
>> given target, then a scale count is incremented and the queue depth
>> is shrunk. The next monitoring window is shrunk accordingly. Unlike
>> CoDel, if we hit a window that exhibits good behavior, then we
>> simply increment the scale count and re-calculate the limits for that
>> scale value. This prevents us from oscillating between a
>> close-to-ideal value and max all the time, instead remaining in the
>> windows where we get good behavior.
>>
>> Unlike CoDel, blk-wb allows the scale count to to negative. This
>> happens if we primarily have writes going on. Unlike positive
>> scale counts, this doesn't change the size of the monitoring window.
>> When the heavy writers finish, blk-bw quickly snaps back to it's
>> stable state of a zero scale count.
>>
>> The patch registers two sysfs entries. The first one, 'wb_window_usec',
>> defines the window of monitoring. The second one, 'wb_lat_usec',
>> sets the latency target for the window. It defaults to 2 msec for
>> non-rotational storage, and 75 msec for rotational storage. Setting
>> this value to '0' disables blk-wb. Generally, a user would not have
>> to touch these settings.
>>
>> We don't enable WBT on devices that are managed with CFQ, and have
>> a non-root block cgroup attached. If we have a proportional share setup
>> on this particular disk, then the wbt throttling will interfere with
>> that. We don't have a strong need for wbt for that case, since we will
>> rely on CFQ doing that for us.
>
> Just one nit: Don't you miss wbt_exit() call for legacy block layer? I
> don't see where that happens.

Huh yes, good point, that must have been lost along the way. I'll readd
it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] block: add scalable completion tracking of requests
  2016-11-08 13:30   ` Jan Kara
@ 2016-11-08 15:25     ` Jens Axboe
  2016-11-09  9:01       ` Jan Kara
  0 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2016-11-08 15:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, hch

On 11/08/2016 06:30 AM, Jan Kara wrote:
> On Tue 01-11-16 15:08:49, Jens Axboe wrote:
>> For legacy block, we simply track them in the request queue. For
>> blk-mq, we track them on a per-sw queue basis, which we can then
>> sum up through the hardware queues and finally to a per device
>> state.
>>
>> The stats are tracked in, roughly, 0.1s interval windows.
>>
>> Add sysfs files to display the stats.
>>
>> Signed-off-by: Jens Axboe <axboe@fb.com>
>
> This patch looks mostly good to me but I have one concern: You track
> statistics in a fixed 134ms window, stats get cleared at the beginning of
> each window. Now this can interact with the writeback window and latency
> settings which are dynamic and settable from userspace - so if the
> writeback code observation window gets set larger than the stats window,
> things become strange since you'll likely miss quite some observations
> about read latencies. So I think you need to make sure stats window is
> always larger than writeback window. Or actually, why do you have something
> like stats window and don't leave clearing of statistics completely to the
> writeback tracking code?

That's a good point, and there actually used to be a comment to that
effect in the code. I think the best solution here would be to make the
stats code mask available somewhere, and allow a consumer of the stats
to request a larger window.

Similarly, we could make the stat window be driven by the consumer, as
you suggest.

Currently there are two pending submissions that depend on the stats
code. One is this writeback series, and the other one is the hybrid
polling code. The latter does not really care about the window size as
such, since it has no monitoring window of its own, and it wants the
auto-clearing as well.

I don't mind working on additions for this, but I'd prefer if we could
layer them on top of the existing series instead of respinning it.
There's considerable test time on the existing patchset. Would that work
for you? Especially collapsing the stats and wbt windows would require
some re-architecting.

> Also as a side note - nobody currently uses the mean value of the
> statistics. It may be faster to track just sum and count so that mean can
> be computed on request which will be presumably much more rare than current
> situation where we recompute the mean on each batch update. Actually, that
> way you could get rid of the batching as well I assume.

That could be opt-in as well. The poll code uses it. And fwiw, it is
exposed through sysfs as well.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] blk-wbt: add general throttling mechanism
  2016-11-08 13:39   ` Jan Kara
@ 2016-11-08 15:41     ` Jens Axboe
  2016-11-09  8:40       ` Jan Kara
  0 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2016-11-08 15:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, hch

On Tue, Nov 08 2016, Jan Kara wrote:
> On Tue 01-11-16 15:08:50, Jens Axboe wrote:
> > We can hook this up to the block layer, to help throttle buffered
> > writes.
> > 
> > wbt registers a few trace points that can be used to track what is
> > happening in the system:
> > 
> > wbt_lat: 259:0: latency 2446318
> > wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
> >                wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> > wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32
> > 
> > This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> > dumps the current read/write stats for that window, and wbt_step shows a
> > step down event where we now scale back writes. Each trace includes the
> > device, 259:0 in this case.
> 
> Just one serious question and one nit below:
> 
> > +void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
> > +{
> > +	struct rq_wait *rqw;
> > +	int inflight, limit;
> > +
> > +	if (!(wb_acct & WBT_TRACKED))
> > +		return;
> > +
> > +	rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
> > +	inflight = atomic_dec_return(&rqw->inflight);
> > +
> > +	/*
> > +	 * wbt got disabled with IO in flight. Wake up any potential
> > +	 * waiters, we don't have to do more than that.
> > +	 */
> > +	if (unlikely(!rwb_enabled(rwb))) {
> > +		rwb_wake_all(rwb);
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * If the device does write back caching, drop further down
> > +	 * before we wake people up.
> > +	 */
> > +	if (rwb->wc && !wb_recent_wait(rwb))
> > +		limit = 0;
> > +	else
> > +		limit = rwb->wb_normal;
> 
> So for devices with write cache, you will completely drain the device
> before waking anybody waiting to issue new requests. Isn't it too strict?
> In particular may_queue() will allow new writers to issue new writes once
> we drop below the limit so it can happen that some processes will be
> effectively starved waiting in may_queue?

It is strict, and perhaps too strict. In testing, it's the only method
that's proven to keep the writeback caching devices in check. It will
round robin the writers, if we have more, which isn't necessarily a bad
thing. Each will get to do a burst of depth writes, then wait for a new
one.

> > +	case LAT_UNKNOWN:
> > +		if (++rwb->unknown_cnt < RWB_UNKNOWN_BUMP)
> > +			break;
> > +		/*
> > +		 * We get here for two reasons:
> > +		 *
> > +		 * 1) We previously scaled reduced depth, and we currently
> > +		 *    don't have a valid read/write sample. For that case,
> > +		 *    slowly return to center state (step == 0).
> > +		 * 2) We started a the center step, but don't have a valid
> > +		 *    read/write sample, but we do have writes going on.
> > +		 *    Allow step to go negative, to increase write perf.
> > +		 */
> 
> I think part 2) of the comment now belongs to LAT_UNKNOWN_WRITES label.

Indeed, that got moved around a bit, I'll fix that up.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] blk-wbt: add general throttling mechanism
  2016-11-08 15:41     ` Jens Axboe
@ 2016-11-09  8:40       ` Jan Kara
  2016-11-09 16:07         ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2016-11-09  8:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, hch

On Tue 08-11-16 08:41:09, Jens Axboe wrote:
> On Tue, Nov 08 2016, Jan Kara wrote:
> > On Tue 01-11-16 15:08:50, Jens Axboe wrote:
> > > We can hook this up to the block layer, to help throttle buffered
> > > writes.
> > > 
> > > wbt registers a few trace points that can be used to track what is
> > > happening in the system:
> > > 
> > > wbt_lat: 259:0: latency 2446318
> > > wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
> > >                wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> > > wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32
> > > 
> > > This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> > > dumps the current read/write stats for that window, and wbt_step shows a
> > > step down event where we now scale back writes. Each trace includes the
> > > device, 259:0 in this case.
> > 
> > Just one serious question and one nit below:
> > 
> > > +void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
> > > +{
> > > +	struct rq_wait *rqw;
> > > +	int inflight, limit;
> > > +
> > > +	if (!(wb_acct & WBT_TRACKED))
> > > +		return;
> > > +
> > > +	rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
> > > +	inflight = atomic_dec_return(&rqw->inflight);
> > > +
> > > +	/*
> > > +	 * wbt got disabled with IO in flight. Wake up any potential
> > > +	 * waiters, we don't have to do more than that.
> > > +	 */
> > > +	if (unlikely(!rwb_enabled(rwb))) {
> > > +		rwb_wake_all(rwb);
> > > +		return;
> > > +	}
> > > +
> > > +	/*
> > > +	 * If the device does write back caching, drop further down
> > > +	 * before we wake people up.
> > > +	 */
> > > +	if (rwb->wc && !wb_recent_wait(rwb))
> > > +		limit = 0;
> > > +	else
> > > +		limit = rwb->wb_normal;
> > 
> > So for devices with write cache, you will completely drain the device
> > before waking anybody waiting to issue new requests. Isn't it too strict?
> > In particular may_queue() will allow new writers to issue new writes once
> > we drop below the limit so it can happen that some processes will be
> > effectively starved waiting in may_queue?
> 
> It is strict, and perhaps too strict. In testing, it's the only method
> that's proven to keep the writeback caching devices in check. It will
> round robin the writers, if we have more, which isn't necessarily a bad
> thing. Each will get to do a burst of depth writes, then wait for a new
> one.

Well, I'm more concerned about a situation where one writer does a bursty
write and blocks sleeping in may_queue(). Another writer produces a steady
flow of write requests so that never causes the write queue to completely
drain but that writer also never blocks in may_queue() when it starts
queueing after write queue has somewhat drained because it never submits
many requests in parallel. In such case the first writer would get starved
AFAIU.

Also I'm not sure why such logic for devices with writeback cache is
needed. Sure the disk is fast to accept writes but if that causes long read
latencies, we should scale down the writeback limits so that we eventually
end up submitting only one write request anyway - effectively the same
thing as limit=0 - won't we?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] block: add scalable completion tracking of requests
  2016-11-08 15:25     ` Jens Axboe
@ 2016-11-09  9:01       ` Jan Kara
  2016-11-09 16:09         ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2016-11-09  9:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jan Kara, axboe, linux-kernel, linux-fsdevel, linux-block, hch

On Tue 08-11-16 08:25:52, Jens Axboe wrote:
> On 11/08/2016 06:30 AM, Jan Kara wrote:
> >On Tue 01-11-16 15:08:49, Jens Axboe wrote:
> >>For legacy block, we simply track them in the request queue. For
> >>blk-mq, we track them on a per-sw queue basis, which we can then
> >>sum up through the hardware queues and finally to a per device
> >>state.
> >>
> >>The stats are tracked in, roughly, 0.1s interval windows.
> >>
> >>Add sysfs files to display the stats.
> >>
> >>Signed-off-by: Jens Axboe <axboe@fb.com>
> >
> >This patch looks mostly good to me but I have one concern: You track
> >statistics in a fixed 134ms window, stats get cleared at the beginning of
> >each window. Now this can interact with the writeback window and latency
> >settings which are dynamic and settable from userspace - so if the
> >writeback code observation window gets set larger than the stats window,
> >things become strange since you'll likely miss quite some observations
> >about read latencies. So I think you need to make sure stats window is
> >always larger than writeback window. Or actually, why do you have something
> >like stats window and don't leave clearing of statistics completely to the
> >writeback tracking code?
> 
> That's a good point, and there actually used to be a comment to that
> effect in the code. I think the best solution here would be to make the
> stats code mask available somewhere, and allow a consumer of the stats
> to request a larger window.
> 
> Similarly, we could make the stat window be driven by the consumer, as
> you suggest.
> 
> Currently there are two pending submissions that depend on the stats
> code. One is this writeback series, and the other one is the hybrid
> polling code. The latter does not really care about the window size as
> such, since it has no monitoring window of its own, and it wants the
> auto-clearing as well.
> 
> I don't mind working on additions for this, but I'd prefer if we could
> layer them on top of the existing series instead of respinning it.
> There's considerable test time on the existing patchset. Would that work
> for you? Especially collapsing the stats and wbt windows would require
> some re-architecting.

OK, that works for me. Actually, when thinking about this, I have one more
suggestion: Do we really want to expose the wbt window as a sysfs tunable?
I guess it is good for initial experiments but longer term having the wbt
window length be a function of target read latency might be better.
Generally you want the window length to be considerably larger than the
target latency but OTOH not too large so that the algorithm can react
reasonably quickly so that suggests it could really be autotuned (and we
scale the window anyway to adapt it to current situation).

> >Also as a side note - nobody currently uses the mean value of the
> >statistics. It may be faster to track just sum and count so that mean can
> >be computed on request which will be presumably much more rare than current
> >situation where we recompute the mean on each batch update. Actually, that
> >way you could get rid of the batching as well I assume.
> 
> That could be opt-in as well. The poll code uses it. And fwiw, it is
> exposed through sysfs as well.

Yeah, my point was that just doing the division in response to sysfs read
or actual request to read the average is likely going to be less expensive
than having to do it on each batch completion (actually, you seem to have
that batching code only so that you don't have to do the division too
often). Whether my suggestion is right depends on how often polling code
actually needs to read the average...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] blk-wbt: add general throttling mechanism
  2016-11-09  8:40       ` Jan Kara
@ 2016-11-09 16:07         ` Jens Axboe
  2016-11-09 19:52           ` Jens Axboe
  2016-11-10  0:00           ` Dave Chinner
  0 siblings, 2 replies; 39+ messages in thread
From: Jens Axboe @ 2016-11-09 16:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, hch

On 11/09/2016 01:40 AM, Jan Kara wrote:
>>> So for devices with write cache, you will completely drain the device
>>> before waking anybody waiting to issue new requests. Isn't it too strict?
>>> In particular may_queue() will allow new writers to issue new writes once
>>> we drop below the limit so it can happen that some processes will be
>>> effectively starved waiting in may_queue?
>>
>> It is strict, and perhaps too strict. In testing, it's the only method
>> that's proven to keep the writeback caching devices in check. It will
>> round robin the writers, if we have more, which isn't necessarily a bad
>> thing. Each will get to do a burst of depth writes, then wait for a new
>> one.
>
> Well, I'm more concerned about a situation where one writer does a
> bursty write and blocks sleeping in may_queue(). Another writer
> produces a steady flow of write requests so that never causes the
> write queue to completely drain but that writer also never blocks in
> may_queue() when it starts queueing after write queue has somewhat
> drained because it never submits many requests in parallel. In such
> case the first writer would get starved AFAIU.

I see what you are saying. I can modify the logic to ensure that if we
do have a waiter, we queue up others behind it. That should get rid of
that concern.

> Also I'm not sure why such logic for devices with writeback cache is
> needed. Sure the disk is fast to accept writes but if that causes long
> read latencies, we should scale down the writeback limits so that we
> eventually end up submitting only one write request anyway -
> effectively the same thing as limit=0 - won't we?

Basically we want to avoid getting into that situation. The problem with
write caching is that it takes a while for you to notice that anything
is wrong, and when you do, you are way down in the hole. That causes the
first violations to be pretty bad.

I'm fine with playing with this logic and improving it, but I'd rather
wait for a 2nd series for that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] block: add scalable completion tracking of requests
  2016-11-09  9:01       ` Jan Kara
@ 2016-11-09 16:09         ` Jens Axboe
  2016-11-09 19:52           ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2016-11-09 16:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block, hch

On 11/09/2016 02:01 AM, Jan Kara wrote:
> On Tue 08-11-16 08:25:52, Jens Axboe wrote:
>> On 11/08/2016 06:30 AM, Jan Kara wrote:
>>> On Tue 01-11-16 15:08:49, Jens Axboe wrote:
>>>> For legacy block, we simply track them in the request queue. For
>>>> blk-mq, we track them on a per-sw queue basis, which we can then
>>>> sum up through the hardware queues and finally to a per device
>>>> state.
>>>>
>>>> The stats are tracked in, roughly, 0.1s interval windows.
>>>>
>>>> Add sysfs files to display the stats.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>>
>>> This patch looks mostly good to me but I have one concern: You track
>>> statistics in a fixed 134ms window, stats get cleared at the beginning of
>>> each window. Now this can interact with the writeback window and latency
>>> settings which are dynamic and settable from userspace - so if the
>>> writeback code observation window gets set larger than the stats window,
>>> things become strange since you'll likely miss quite some observations
>>> about read latencies. So I think you need to make sure stats window is
>>> always larger than writeback window. Or actually, why do you have something
>>> like stats window and don't leave clearing of statistics completely to the
>>> writeback tracking code?
>>
>> That's a good point, and there actually used to be a comment to that
>> effect in the code. I think the best solution here would be to make the
>> stats code mask available somewhere, and allow a consumer of the stats
>> to request a larger window.
>>
>> Similarly, we could make the stat window be driven by the consumer, as
>> you suggest.
>>
>> Currently there are two pending submissions that depend on the stats
>> code. One is this writeback series, and the other one is the hybrid
>> polling code. The latter does not really care about the window size as
>> such, since it has no monitoring window of its own, and it wants the
>> auto-clearing as well.
>>
>> I don't mind working on additions for this, but I'd prefer if we could
>> layer them on top of the existing series instead of respinning it.
>> There's considerable test time on the existing patchset. Would that work
>> for you? Especially collapsing the stats and wbt windows would require
>> some re-architecting.
>
> OK, that works for me. Actually, when thinking about this, I have one more
> suggestion: Do we really want to expose the wbt window as a sysfs tunable?
> I guess it is good for initial experiments but longer term having the wbt
> window length be a function of target read latency might be better.
> Generally you want the window length to be considerably larger than the
> target latency but OTOH not too large so that the algorithm can react
> reasonably quickly so that suggests it could really be autotuned (and we
> scale the window anyway to adapt it to current situation).

That's not a bad idea, I have thought about that as well before. We
don't need the window tunable, and you are right, it can be a function
of the desired latency.

I'll hardwire the 100msec latency window for now and get rid of the
exposed tunable. It's harder to remove sysfs files once they have made
it into the kernel...

>>> Also as a side note - nobody currently uses the mean value of the
>>> statistics. It may be faster to track just sum and count so that mean can
>>> be computed on request which will be presumably much more rare than current
>>> situation where we recompute the mean on each batch update. Actually, that
>>> way you could get rid of the batching as well I assume.
>>
>> That could be opt-in as well. The poll code uses it. And fwiw, it is
>> exposed through sysfs as well.
>
> Yeah, my point was that just doing the division in response to sysfs read
> or actual request to read the average is likely going to be less expensive
> than having to do it on each batch completion (actually, you seem to have
> that batching code only so that you don't have to do the division too
> often). Whether my suggestion is right depends on how often polling code
> actually needs to read the average...

The polling code currently does it for every IO... That is not ideal for
other purposes, I think I'm going to work on changing that to just keep
the previous window available, so we only need to read it when the stats
window changes.

With the batching, I don't see the division as a problem in micro
benchmarks. That's why I added the batching, because it did show up
before.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] block: add scalable completion tracking of requests
  2016-11-09 16:09         ` Jens Axboe
@ 2016-11-09 19:52           ` Jens Axboe
  2016-11-10 19:38             ` Jan Kara
  0 siblings, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2016-11-09 19:52 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, hch

On 11/09/2016 09:09 AM, Jens Axboe wrote:
> On 11/09/2016 02:01 AM, Jan Kara wrote:
>> On Tue 08-11-16 08:25:52, Jens Axboe wrote:
>>> On 11/08/2016 06:30 AM, Jan Kara wrote:
>>>> On Tue 01-11-16 15:08:49, Jens Axboe wrote:
>>>>> For legacy block, we simply track them in the request queue. For
>>>>> blk-mq, we track them on a per-sw queue basis, which we can then
>>>>> sum up through the hardware queues and finally to a per device
>>>>> state.
>>>>>
>>>>> The stats are tracked in, roughly, 0.1s interval windows.
>>>>>
>>>>> Add sysfs files to display the stats.
>>>>>
>>>>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>>>
>>>> This patch looks mostly good to me but I have one concern: You track
>>>> statistics in a fixed 134ms window, stats get cleared at the
>>>> beginning of
>>>> each window. Now this can interact with the writeback window and
>>>> latency
>>>> settings which are dynamic and settable from userspace - so if the
>>>> writeback code observation window gets set larger than the stats
>>>> window,
>>>> things become strange since you'll likely miss quite some observations
>>>> about read latencies. So I think you need to make sure stats window is
>>>> always larger than writeback window. Or actually, why do you have
>>>> something
>>>> like stats window and don't leave clearing of statistics completely
>>>> to the
>>>> writeback tracking code?
>>>
>>> That's a good point, and there actually used to be a comment to that
>>> effect in the code. I think the best solution here would be to make the
>>> stats code mask available somewhere, and allow a consumer of the stats
>>> to request a larger window.
>>>
>>> Similarly, we could make the stat window be driven by the consumer, as
>>> you suggest.
>>>
>>> Currently there are two pending submissions that depend on the stats
>>> code. One is this writeback series, and the other one is the hybrid
>>> polling code. The latter does not really care about the window size as
>>> such, since it has no monitoring window of its own, and it wants the
>>> auto-clearing as well.
>>>
>>> I don't mind working on additions for this, but I'd prefer if we could
>>> layer them on top of the existing series instead of respinning it.
>>> There's considerable test time on the existing patchset. Would that work
>>> for you? Especially collapsing the stats and wbt windows would require
>>> some re-architecting.
>>
>> OK, that works for me. Actually, when thinking about this, I have one
>> more
>> suggestion: Do we really want to expose the wbt window as a sysfs
>> tunable?
>> I guess it is good for initial experiments but longer term having the wbt
>> window length be a function of target read latency might be better.
>> Generally you want the window length to be considerably larger than the
>> target latency but OTOH not too large so that the algorithm can react
>> reasonably quickly so that suggests it could really be autotuned (and we
>> scale the window anyway to adapt it to current situation).
>
> That's not a bad idea, I have thought about that as well before. We
> don't need the window tunable, and you are right, it can be a function
> of the desired latency.
>
> I'll hardwire the 100msec latency window for now and get rid of the
> exposed tunable. It's harder to remove sysfs files once they have made
> it into the kernel...

Killed the sysfs variable, so for now it'll be a 100msec window by
default.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] blk-wbt: add general throttling mechanism
  2016-11-09 16:07         ` Jens Axboe
@ 2016-11-09 19:52           ` Jens Axboe
  2016-11-10 19:36             ` Jan Kara
  2016-11-10  0:00           ` Dave Chinner
  1 sibling, 1 reply; 39+ messages in thread
From: Jens Axboe @ 2016-11-09 19:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, hch

On 11/09/2016 09:07 AM, Jens Axboe wrote:
> On 11/09/2016 01:40 AM, Jan Kara wrote:
>>>> So for devices with write cache, you will completely drain the device
>>>> before waking anybody waiting to issue new requests. Isn't it too
>>>> strict?
>>>> In particular may_queue() will allow new writers to issue new writes
>>>> once
>>>> we drop below the limit so it can happen that some processes will be
>>>> effectively starved waiting in may_queue?
>>>
>>> It is strict, and perhaps too strict. In testing, it's the only method
>>> that's proven to keep the writeback caching devices in check. It will
>>> round robin the writers, if we have more, which isn't necessarily a bad
>>> thing. Each will get to do a burst of depth writes, then wait for a new
>>> one.
>>
>> Well, I'm more concerned about a situation where one writer does a
>> bursty write and blocks sleeping in may_queue(). Another writer
>> produces a steady flow of write requests so that never causes the
>> write queue to completely drain but that writer also never blocks in
>> may_queue() when it starts queueing after write queue has somewhat
>> drained because it never submits many requests in parallel. In such
>> case the first writer would get starved AFAIU.
>
> I see what you are saying. I can modify the logic to ensure that if we
> do have a waiter, we queue up others behind it. That should get rid of
> that concern.

I added that - if we currently have a waiter, we'll add ourselves to the
back of the waitqueue and wait.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] blk-wbt: add general throttling mechanism
  2016-11-09 16:07         ` Jens Axboe
  2016-11-09 19:52           ` Jens Axboe
@ 2016-11-10  0:00           ` Dave Chinner
  1 sibling, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2016-11-10  0:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, hch

On Wed, Nov 09, 2016 at 09:07:08AM -0700, Jens Axboe wrote:
> On 11/09/2016 01:40 AM, Jan Kara wrote:
> >Also I'm not sure why such logic for devices with writeback cache is
> >needed. Sure the disk is fast to accept writes but if that causes long
> >read latencies, we should scale down the writeback limits so that we
> >eventually end up submitting only one write request anyway -
> >effectively the same thing as limit=0 - won't we?
> 
> Basically we want to avoid getting into that situation. The problem with
> write caching is that it takes a while for you to notice that anything
> is wrong, and when you do, you are way down in the hole. That causes the
> first violations to be pretty bad.

Yeah, slow RAID devices with a large BBWC in front of them are
notorious for doing this. You won't notice the actual IO performance
until the write cache is filled (can be GB in size) and by then it's
way too late to fix up with OS level queuing...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 7/8] blk-wbt: add general throttling mechanism
  2016-11-09 19:52           ` Jens Axboe
@ 2016-11-10 19:36             ` Jan Kara
  0 siblings, 0 replies; 39+ messages in thread
From: Jan Kara @ 2016-11-10 19:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, hch

On Wed 09-11-16 12:52:59, Jens Axboe wrote:
> On 11/09/2016 09:07 AM, Jens Axboe wrote:
> >On 11/09/2016 01:40 AM, Jan Kara wrote:
> >>>>So for devices with write cache, you will completely drain the device
> >>>>before waking anybody waiting to issue new requests. Isn't it too
> >>>>strict?
> >>>>In particular may_queue() will allow new writers to issue new writes
> >>>>once
> >>>>we drop below the limit so it can happen that some processes will be
> >>>>effectively starved waiting in may_queue?
> >>>
> >>>It is strict, and perhaps too strict. In testing, it's the only method
> >>>that's proven to keep the writeback caching devices in check. It will
> >>>round robin the writers, if we have more, which isn't necessarily a bad
> >>>thing. Each will get to do a burst of depth writes, then wait for a new
> >>>one.
> >>
> >>Well, I'm more concerned about a situation where one writer does a
> >>bursty write and blocks sleeping in may_queue(). Another writer
> >>produces a steady flow of write requests so that never causes the
> >>write queue to completely drain but that writer also never blocks in
> >>may_queue() when it starts queueing after write queue has somewhat
> >>drained because it never submits many requests in parallel. In such
> >>case the first writer would get starved AFAIU.
> >
> >I see what you are saying. I can modify the logic to ensure that if we
> >do have a waiter, we queue up others behind it. That should get rid of
> >that concern.
> 
> I added that - if we currently have a waiter, we'll add ourselves to the
> back of the waitqueue and wait.

OK, sounds good to me. If the write queue draining will show to be an
issue, it will be at least clearly visible with this logic.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] block: add scalable completion tracking of requests
  2016-11-09 19:52           ` Jens Axboe
@ 2016-11-10 19:38             ` Jan Kara
  2016-11-12  5:19               ` Jens Axboe
  0 siblings, 1 reply; 39+ messages in thread
From: Jan Kara @ 2016-11-10 19:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jens Axboe, Jan Kara, linux-kernel, linux-fsdevel, linux-block, hch

On Wed 09-11-16 12:52:25, Jens Axboe wrote:
> On 11/09/2016 09:09 AM, Jens Axboe wrote:
> >On 11/09/2016 02:01 AM, Jan Kara wrote:
> >>On Tue 08-11-16 08:25:52, Jens Axboe wrote:
> >>>On 11/08/2016 06:30 AM, Jan Kara wrote:
> >>>>On Tue 01-11-16 15:08:49, Jens Axboe wrote:
> >>>>>For legacy block, we simply track them in the request queue. For
> >>>>>blk-mq, we track them on a per-sw queue basis, which we can then
> >>>>>sum up through the hardware queues and finally to a per device
> >>>>>state.
> >>>>>
> >>>>>The stats are tracked in, roughly, 0.1s interval windows.
> >>>>>
> >>>>>Add sysfs files to display the stats.
> >>>>>
> >>>>>Signed-off-by: Jens Axboe <axboe@fb.com>
> >>>>
> >>>>This patch looks mostly good to me but I have one concern: You track
> >>>>statistics in a fixed 134ms window, stats get cleared at the
> >>>>beginning of
> >>>>each window. Now this can interact with the writeback window and
> >>>>latency
> >>>>settings which are dynamic and settable from userspace - so if the
> >>>>writeback code observation window gets set larger than the stats
> >>>>window,
> >>>>things become strange since you'll likely miss quite some observations
> >>>>about read latencies. So I think you need to make sure stats window is
> >>>>always larger than writeback window. Or actually, why do you have
> >>>>something
> >>>>like stats window and don't leave clearing of statistics completely
> >>>>to the
> >>>>writeback tracking code?
> >>>
> >>>That's a good point, and there actually used to be a comment to that
> >>>effect in the code. I think the best solution here would be to make the
> >>>stats code mask available somewhere, and allow a consumer of the stats
> >>>to request a larger window.
> >>>
> >>>Similarly, we could make the stat window be driven by the consumer, as
> >>>you suggest.
> >>>
> >>>Currently there are two pending submissions that depend on the stats
> >>>code. One is this writeback series, and the other one is the hybrid
> >>>polling code. The latter does not really care about the window size as
> >>>such, since it has no monitoring window of its own, and it wants the
> >>>auto-clearing as well.
> >>>
> >>>I don't mind working on additions for this, but I'd prefer if we could
> >>>layer them on top of the existing series instead of respinning it.
> >>>There's considerable test time on the existing patchset. Would that work
> >>>for you? Especially collapsing the stats and wbt windows would require
> >>>some re-architecting.
> >>
> >>OK, that works for me. Actually, when thinking about this, I have one
> >>more
> >>suggestion: Do we really want to expose the wbt window as a sysfs
> >>tunable?
> >>I guess it is good for initial experiments but longer term having the wbt
> >>window length be a function of target read latency might be better.
> >>Generally you want the window length to be considerably larger than the
> >>target latency but OTOH not too large so that the algorithm can react
> >>reasonably quickly so that suggests it could really be autotuned (and we
> >>scale the window anyway to adapt it to current situation).
> >
> >That's not a bad idea, I have thought about that as well before. We
> >don't need the window tunable, and you are right, it can be a function
> >of the desired latency.
> >
> >I'll hardwire the 100msec latency window for now and get rid of the
> >exposed tunable. It's harder to remove sysfs files once they have made
> >it into the kernel...
> 
> Killed the sysfs variable, so for now it'll be a 100msec window by
> default.

OK, I guess good enough to get this merged.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/8] block: add scalable completion tracking of requests
  2016-11-10 19:38             ` Jan Kara
@ 2016-11-12  5:19               ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2016-11-12  5:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jens Axboe, linux-kernel, linux-fsdevel, linux-block, hch

On 11/10/2016 12:38 PM, Jan Kara wrote:
> On Wed 09-11-16 12:52:25, Jens Axboe wrote:
>> On 11/09/2016 09:09 AM, Jens Axboe wrote:
>>> On 11/09/2016 02:01 AM, Jan Kara wrote:
>>>> On Tue 08-11-16 08:25:52, Jens Axboe wrote:
>>>>> On 11/08/2016 06:30 AM, Jan Kara wrote:
>>>>>> On Tue 01-11-16 15:08:49, Jens Axboe wrote:
>>>>>>> For legacy block, we simply track them in the request queue. For
>>>>>>> blk-mq, we track them on a per-sw queue basis, which we can then
>>>>>>> sum up through the hardware queues and finally to a per device
>>>>>>> state.
>>>>>>>
>>>>>>> The stats are tracked in, roughly, 0.1s interval windows.
>>>>>>>
>>>>>>> Add sysfs files to display the stats.
>>>>>>>
>>>>>>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>>>>>
>>>>>> This patch looks mostly good to me but I have one concern: You track
>>>>>> statistics in a fixed 134ms window, stats get cleared at the
>>>>>> beginning of
>>>>>> each window. Now this can interact with the writeback window and
>>>>>> latency
>>>>>> settings which are dynamic and settable from userspace - so if the
>>>>>> writeback code observation window gets set larger than the stats
>>>>>> window,
>>>>>> things become strange since you'll likely miss quite some observations
>>>>>> about read latencies. So I think you need to make sure stats window is
>>>>>> always larger than writeback window. Or actually, why do you have
>>>>>> something
>>>>>> like stats window and don't leave clearing of statistics completely
>>>>>> to the
>>>>>> writeback tracking code?
>>>>>
>>>>> That's a good point, and there actually used to be a comment to that
>>>>> effect in the code. I think the best solution here would be to make the
>>>>> stats code mask available somewhere, and allow a consumer of the stats
>>>>> to request a larger window.
>>>>>
>>>>> Similarly, we could make the stat window be driven by the consumer, as
>>>>> you suggest.
>>>>>
>>>>> Currently there are two pending submissions that depend on the stats
>>>>> code. One is this writeback series, and the other one is the hybrid
>>>>> polling code. The latter does not really care about the window size as
>>>>> such, since it has no monitoring window of its own, and it wants the
>>>>> auto-clearing as well.
>>>>>
>>>>> I don't mind working on additions for this, but I'd prefer if we could
>>>>> layer them on top of the existing series instead of respinning it.
>>>>> There's considerable test time on the existing patchset. Would that work
>>>>> for you? Especially collapsing the stats and wbt windows would require
>>>>> some re-architecting.
>>>>
>>>> OK, that works for me. Actually, when thinking about this, I have one
>>>> more
>>>> suggestion: Do we really want to expose the wbt window as a sysfs
>>>> tunable?
>>>> I guess it is good for initial experiments but longer term having the wbt
>>>> window length be a function of target read latency might be better.
>>>> Generally you want the window length to be considerably larger than the
>>>> target latency but OTOH not too large so that the algorithm can react
>>>> reasonably quickly so that suggests it could really be autotuned (and we
>>>> scale the window anyway to adapt it to current situation).
>>>
>>> That's not a bad idea, I have thought about that as well before. We
>>> don't need the window tunable, and you are right, it can be a function
>>> of the desired latency.
>>>
>>> I'll hardwire the 100msec latency window for now and get rid of the
>>> exposed tunable. It's harder to remove sysfs files once they have made
>>> it into the kernel...
>>
>> Killed the sysfs variable, so for now it'll be a 100msec window by
>> default.
>
> OK, I guess good enough to get this merged.

Thanks Jan!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 8/8] block: hook up writeback throttling
  2016-10-26 20:52 [PATCHSET] block: buffered " Jens Axboe
@ 2016-10-26 20:52 ` Jens Axboe
  0 siblings, 0 replies; 39+ messages in thread
From: Jens Axboe @ 2016-10-26 20:52 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: jack, kernel, Jens Axboe

Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.

The patch registers two sysfs entries. The first one, 'wb_window_usec',
defines the window of monitoring. The second one, 'wb_lat_usec',
sets the latency target for the window. It defaults to 2 msec for
non-rotational storage, and 75 msec for rotational storage. Setting
this value to '0' disables blk-wb. Generally, a user would not have
to touch these settings.

We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 Documentation/block/queue-sysfs.txt |  13 ++++
 block/Kconfig                       |  24 +++++++
 block/blk-core.c                    |  18 ++++-
 block/blk-mq.c                      |  27 +++++++-
 block/blk-settings.c                |   4 ++
 block/blk-sysfs.c                   | 134 ++++++++++++++++++++++++++++++++++++
 block/cfq-iosched.c                 |  14 ++++
 include/linux/blkdev.h              |   3 +
 8 files changed, 233 insertions(+), 4 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index 2a3904030dea..2847219ebd8c 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -169,5 +169,18 @@ This is the number of bytes the device can write in a single write-same
 command.  A value of '0' means write-same is not supported by this
 device.
 
+wb_lat_usec (RW)
+----------------
+If the device is registered for writeback throttling, then this file shows
+the target minimum read latency. If this latency is exceeded in a given
+window of time (see wb_window_usec), then the writeback throttling will start
+scaling back writes.
+
+wb_window_usec (RW)
+-------------------
+If the device is registered for writeback throttling, then this file shows
+the value of the monitoring window in which we'll look at the target
+latency. See wb_lat_usec.
+
 
 Jens Axboe <jens.axboe@oracle.com>, February 2009
diff --git a/block/Kconfig b/block/Kconfig
index 1d4d624492fc..7c8523e97ede 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -112,6 +112,30 @@ config BLK_CMDLINE_PARSER
 
 	See Documentation/block/cmdline-partition.txt for more information.
 
+config BLK_WBT
+	bool "Enable support for block device writeback throttling"
+	default n
+	---help---
+	Enabling this option enables the block layer to throttle buffered
+	writeback from the VM, making it more smooth and having less
+	impact on foreground operations.
+
+config BLK_WBT_SQ
+	bool "Single queue writeback throttling"
+	default n
+	depends on BLK_WBT
+	---help---
+	Enable writeback throttling by default on legacy single queue devices
+
+config BLK_WBT_MQ
+	bool "Multiqueue writeback throttling"
+	default y
+	depends on BLK_WBT
+	---help---
+	Enable writeback throttling by default on multiqueue devices.
+	Multiqueue currently doesn't have support for IO scheduling,
+	enabling this option is recommended.
+
 menu "Partition Types"
 
 source "block/partitions/Kconfig"
diff --git a/block/blk-core.c b/block/blk-core.c
index a822383e858e..165511278fa9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-wbt.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
@@ -882,6 +883,8 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 
 fail:
 	blk_free_flush_queue(q->fq);
+	wbt_exit(q->rq_wb);
+	q->rq_wb = NULL;
 	return NULL;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
@@ -1346,6 +1349,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
 	blk_delete_timer(rq);
 	blk_clear_rq_complete(rq);
 	trace_block_rq_requeue(q, rq);
+	wbt_requeue(q->rq_wb, &rq->issue_stat);
 
 	if (rq->cmd_flags & REQ_QUEUED)
 		blk_queue_end_tag(q, rq);
@@ -1436,6 +1440,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	/* this is a bio leak */
 	WARN_ON(req->bio != NULL);
 
+	wbt_done(q->rq_wb, &req->issue_stat);
+
 	/*
 	 * Request may not have originated from ll_rw_blk. if not,
 	 * it didn't come out of our reserved rq pools
@@ -1667,6 +1673,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	int el_ret, rw_flags = 0, where = ELEVATOR_INSERT_SORT;
 	struct request *req;
 	unsigned int request_count = 0;
+	unsigned int wb_acct;
 
 	/*
 	 * low level driver can indicate that it wants pages above a
@@ -1719,6 +1726,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	}
 
 get_rq:
+	wb_acct = wbt_wait(q->rq_wb, bio->bi_opf, q->queue_lock);
+
 	/*
 	 * This sync check and mask will be re-done in init_request_from_bio(),
 	 * but we need to set it earlier to expose the sync flag to the
@@ -1738,11 +1747,14 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 */
 	req = get_request(q, bio_data_dir(bio), rw_flags, bio, GFP_NOIO);
 	if (IS_ERR(req)) {
+		__wbt_done(q->rq_wb, wb_acct);
 		bio->bi_error = PTR_ERR(req);
 		bio_endio(bio);
 		goto out_unlock;
 	}
 
+	wbt_track(&req->issue_stat, wb_acct);
+
 	/*
 	 * After dropping the lock and possibly sleeping here, our request
 	 * may now be mergeable after it had proven unmergeable (above).
@@ -2476,6 +2488,7 @@ void blk_start_request(struct request *req)
 	blk_dequeue_request(req);
 
 	blk_stat_set_issue_time(&req->issue_stat);
+	wbt_issue(req->q->rq_wb, &req->issue_stat);
 
 	/*
 	 * We are now handing the request to the hardware, initialize
@@ -2713,9 +2726,10 @@ void blk_finish_request(struct request *req, int error)
 
 	blk_account_io_done(req);
 
-	if (req->end_io)
+	if (req->end_io) {
+		wbt_done(req->q->rq_wb, &req->issue_stat);
 		req->end_io(req, error);
-	else {
+	} else {
 		if (blk_bidi_rq(req))
 			__blk_put_request(req->next_rq->q, req->next_rq);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index dc668353e228..2e97213bf27e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -31,6 +31,7 @@
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 #include "blk-stat.h"
+#include "blk-wbt.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -301,6 +302,8 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
 
 	if (rq->cmd_flags & REQ_MQ_INFLIGHT)
 		atomic_dec(&hctx->nr_active);
+
+	wbt_done(q->rq_wb, &rq->issue_stat);
 	rq->cmd_flags = 0;
 
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
@@ -329,6 +332,7 @@ inline void __blk_mq_end_request(struct request *rq, int error)
 	blk_account_io_done(rq);
 
 	if (rq->end_io) {
+		wbt_done(rq->q->rq_wb, &rq->issue_stat);
 		rq->end_io(rq, error);
 	} else {
 		if (unlikely(blk_bidi_rq(rq)))
@@ -436,6 +440,7 @@ void blk_mq_start_request(struct request *rq)
 		rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
 
 	blk_stat_set_issue_time(&rq->issue_stat);
+	wbt_issue(q->rq_wb, &rq->issue_stat);
 
 	blk_add_timer(rq);
 
@@ -472,6 +477,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	struct request_queue *q = rq->q;
 
 	trace_block_rq_requeue(q, rq);
+	wbt_requeue(q->rq_wb, &rq->issue_stat);
 
 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
 		if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -1285,6 +1291,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
 	blk_qc_t cookie;
+	unsigned int wb_acct;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1299,9 +1306,15 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return BLK_QC_T_NONE;
 
+	wb_acct = wbt_wait(q->rq_wb, bio->bi_opf, NULL);
+
 	rq = blk_mq_map_request(q, bio, &data);
-	if (unlikely(!rq))
+	if (unlikely(!rq)) {
+		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
+	}
+
+	wbt_track(&rq->issue_stat, wb_acct);
 
 	cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
@@ -1378,6 +1391,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	struct blk_map_ctx data;
 	struct request *rq;
 	blk_qc_t cookie;
+	unsigned int wb_acct;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1394,9 +1408,15 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	} else
 		request_count = blk_plug_queued_count(q);
 
+	wb_acct = wbt_wait(q->rq_wb, bio->bi_opf, NULL);
+
 	rq = blk_mq_map_request(q, bio, &data);
-	if (unlikely(!rq))
+	if (unlikely(!rq)) {
+		__wbt_done(q->rq_wb, wb_acct);
 		return BLK_QC_T_NONE;
+	}
+
+	wbt_track(&rq->issue_stat, wb_acct);
 
 	cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
@@ -2067,6 +2087,9 @@ void blk_mq_free_queue(struct request_queue *q)
 	list_del_init(&q->all_q_node);
 	mutex_unlock(&all_q_mutex);
 
+	wbt_exit(q->rq_wb);
+	q->rq_wb = NULL;
+
 	blk_mq_del_queue_tag_set(q);
 
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f7e122e717e8..b51ad190c989 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -13,6 +13,7 @@
 #include <linux/gfp.h>
 
 #include "blk.h"
+#include "blk-wbt.h"
 
 unsigned long blk_max_low_pfn;
 EXPORT_SYMBOL(blk_max_low_pfn);
@@ -840,6 +841,7 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
 void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
 	q->queue_depth = depth;
+	wbt_set_queue_depth(q->rq_wb, depth);
 }
 EXPORT_SYMBOL(blk_set_queue_depth);
 
@@ -863,6 +865,8 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
 	else
 		queue_flag_clear(QUEUE_FLAG_FUA, q);
 	spin_unlock_irq(q->queue_lock);
+
+	wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
 }
 EXPORT_SYMBOL_GPL(blk_queue_write_cache);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3228ae396e3e..57d08ae81ace 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -13,6 +13,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-wbt.h"
 
 struct queue_sysfs_entry {
 	struct attribute attr;
@@ -41,6 +42,19 @@ queue_var_store(unsigned long *var, const char *page, size_t count)
 	return count;
 }
 
+static ssize_t queue_var_store64(u64 *var, const char *page)
+{
+	int err;
+	u64 v;
+
+	err = kstrtou64(page, 10, &v);
+	if (err < 0)
+		return err;
+
+	*var = v;
+	return 0;
+}
+
 static ssize_t queue_requests_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(q->nr_requests, (page));
@@ -347,6 +361,58 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	return ret;
 }
 
+static ssize_t queue_wb_win_show(struct request_queue *q, char *page)
+{
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	return sprintf(page, "%llu\n", div_u64(q->rq_wb->win_nsec, 1000));
+}
+
+static ssize_t queue_wb_win_store(struct request_queue *q, const char *page,
+				  size_t count)
+{
+	ssize_t ret;
+	u64 val;
+
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	ret = queue_var_store64(&val, page);
+	if (ret < 0)
+		return ret;
+
+	q->rq_wb->win_nsec = val * 1000ULL;
+	wbt_update_limits(q->rq_wb);
+	return count;
+}
+
+static ssize_t queue_wb_lat_show(struct request_queue *q, char *page)
+{
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	return sprintf(page, "%llu\n", div_u64(q->rq_wb->min_lat_nsec, 1000));
+}
+
+static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
+				  size_t count)
+{
+	ssize_t ret;
+	u64 val;
+
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	ret = queue_var_store64(&val, page);
+	if (ret < 0)
+		return ret;
+
+	q->rq_wb->min_lat_nsec = val * 1000ULL;
+	wbt_update_limits(q->rq_wb);
+	return count;
+}
+
 static ssize_t queue_wc_show(struct request_queue *q, char *page)
 {
 	if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
@@ -551,6 +617,18 @@ static struct queue_sysfs_entry queue_stats_entry = {
 	.show = queue_stats_show,
 };
 
+static struct queue_sysfs_entry queue_wb_lat_entry = {
+	.attr = {.name = "wbt_lat_usec", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wb_lat_show,
+	.store = queue_wb_lat_store,
+};
+
+static struct queue_sysfs_entry queue_wb_win_entry = {
+	.attr = {.name = "wbt_window_usec", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wb_win_show,
+	.store = queue_wb_win_store,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -579,6 +657,8 @@ static struct attribute *default_attrs[] = {
 	&queue_wc_entry.attr,
 	&queue_dax_entry.attr,
 	&queue_stats_entry.attr,
+	&queue_wb_lat_entry.attr,
+	&queue_wb_win_entry.attr,
 	NULL,
 };
 
@@ -693,6 +773,58 @@ struct kobj_type blk_queue_ktype = {
 	.release	= blk_release_queue,
 };
 
+static void blk_wb_stat_get(void *data, struct blk_rq_stat *stat)
+{
+	blk_queue_stat_get(data, stat);
+}
+
+static void blk_wb_stat_clear(void *data)
+{
+	blk_stat_clear(data);
+}
+
+static bool blk_wb_stat_is_current(struct blk_rq_stat *stat)
+{
+	return blk_stat_is_current(stat);
+}
+
+static struct wb_stat_ops wb_stat_ops = {
+	.get		= blk_wb_stat_get,
+	.is_current	= blk_wb_stat_is_current,
+	.clear		= blk_wb_stat_clear,
+};
+
+static void blk_wb_init(struct request_queue *q)
+{
+	struct rq_wb *rwb;
+
+#ifndef CONFIG_BLK_WBT_MQ
+	if (q->mq_ops)
+		return;
+#endif
+#ifndef CONFIG_BLK_WBT_SQ
+	if (q->request_fn)
+		return;
+#endif
+
+	rwb = wbt_init(&q->backing_dev_info, &wb_stat_ops, q);
+
+	/*
+	 * If this fails, we don't get throttling
+	 */
+	if (IS_ERR(rwb))
+		return;
+
+	if (blk_queue_nonrot(q))
+		rwb->min_lat_nsec = 2000000ULL;
+	else
+		rwb->min_lat_nsec = 75000000ULL;
+
+	wbt_set_queue_depth(rwb, blk_queue_depth(q));
+	wbt_set_write_cache(rwb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
+	q->rq_wb = rwb;
+}
+
 int blk_register_queue(struct gendisk *disk)
 {
 	int ret;
@@ -732,6 +864,8 @@ int blk_register_queue(struct gendisk *disk)
 	if (q->mq_ops)
 		blk_mq_register_dev(dev, q);
 
+	blk_wb_init(q);
+
 	if (!q->request_fn)
 		return 0;
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5e24d880306c..b879a1a5f2d5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -16,6 +16,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/blk-cgroup.h>
 #include "blk.h"
+#include "blk-wbt.h"
 
 /*
  * tunables
@@ -3771,9 +3772,11 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
 	struct cfq_queue *cfqq;
 	uint64_t serial_nr;
+	bool nonroot_cg;
 
 	rcu_read_lock();
 	serial_nr = bio_blkcg(bio)->css.serial_nr;
+	nonroot_cg = bio_blkcg(bio) != &blkcg_root;
 	rcu_read_unlock();
 
 	/*
@@ -3784,6 +3787,17 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 		return;
 
 	/*
+	 * If we have a non-root cgroup, we can depend on that to
+	 * do proper throttling of writes. Turn off wbt for that
+	 * case.
+	 */
+	if (nonroot_cg) {
+		struct request_queue *q = cfqd->queue;
+
+		wbt_disable(q->rq_wb);
+	}
+
+	/*
 	 * Drop reference to queues.  New queues will be assigned in new
 	 * group upon arrival of fresh requests.
 	 */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8a0056b5772e..250df1ea853d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -37,6 +37,7 @@ struct bsg_job;
 struct blkcg_gq;
 struct blk_flush_queue;
 struct pr_ops;
+struct rq_wb;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
@@ -303,6 +304,8 @@ struct request_queue {
 	int			nr_rqs[2];	/* # allocated [a]sync rqs */
 	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
+	struct rq_wb		*rq_wb;
+
 	/*
 	 * If blkcg is not used, @q->root_rl serves all requests.  If blkcg
 	 * is used, root blkg allocates from @q->root_rl and all other
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-11-12  5:25 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-01 21:08 [PATCHSET] Throttled buffered writeback Jens Axboe
2016-11-01 21:08 ` [PATCH 1/8] block: add WRITE_BACKGROUND Jens Axboe
2016-11-02 14:55   ` Christoph Hellwig
2016-11-02 16:22     ` Jens Axboe
2016-11-05 22:27   ` Jan Kara
2016-11-01 21:08 ` [PATCH 2/8] writeback: add wbc_to_write_flags() Jens Axboe
2016-11-02 14:56   ` Christoph Hellwig
2016-11-01 21:08 ` [PATCH 3/8] writeback: mark background writeback as such Jens Axboe
2016-11-02 14:56   ` Christoph Hellwig
2016-11-05 22:26   ` Jan Kara
2016-11-01 21:08 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
2016-11-02 14:57   ` Christoph Hellwig
2016-11-02 14:59     ` Jens Axboe
2016-11-08 13:02   ` Jan Kara
2016-11-01 21:08 ` [PATCH 5/8] block: add code to track actual device queue depth Jens Axboe
2016-11-02 14:59   ` Christoph Hellwig
2016-11-02 15:02     ` Jens Axboe
2016-11-02 16:40       ` Johannes Thumshirn
2016-11-05 22:37   ` Jan Kara
2016-11-01 21:08 ` [PATCH 6/8] block: add scalable completion tracking of requests Jens Axboe
2016-11-08 13:30   ` Jan Kara
2016-11-08 15:25     ` Jens Axboe
2016-11-09  9:01       ` Jan Kara
2016-11-09 16:09         ` Jens Axboe
2016-11-09 19:52           ` Jens Axboe
2016-11-10 19:38             ` Jan Kara
2016-11-12  5:19               ` Jens Axboe
2016-11-01 21:08 ` [PATCH 7/8] blk-wbt: add general throttling mechanism Jens Axboe
2016-11-08 13:39   ` Jan Kara
2016-11-08 15:41     ` Jens Axboe
2016-11-09  8:40       ` Jan Kara
2016-11-09 16:07         ` Jens Axboe
2016-11-09 19:52           ` Jens Axboe
2016-11-10 19:36             ` Jan Kara
2016-11-10  0:00           ` Dave Chinner
2016-11-01 21:08 ` [PATCH 8/8] block: hook up writeback throttling Jens Axboe
2016-11-08 13:42   ` Jan Kara
2016-11-08 15:16     ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2016-10-26 20:52 [PATCHSET] block: buffered " Jens Axboe
2016-10-26 20:52 ` [PATCH 8/8] block: hook up " Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).