linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v5] Make background writeback great again for the first time
@ 2016-04-26 15:55 Jens Axboe
  2016-04-26 15:55 ` [PATCH 1/8] block: add WRITE_BG Jens Axboe
                   ` (8 more replies)
  0 siblings, 9 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block; +Cc: jack, dchinner, sedat.dilek

Hi,

Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.

I have posted plenty of results previously, I'll keep it shorter
this time. Here's a run on my laptop, using read-to-pipe-async for
reading a 5g file, and rewriting it. You can find this test program
in the fio git repo.

4.6-rc3:

$ t/read-to-pipe-async -f ~/5g > 5g-new

Latency percentiles (usec) (READERS)
	50.0000th: 2
	75.0000th: 3
	90.0000th: 5
	95.0000th: 7
	99.0000th: 43
	99.5000th: 77
	99.9000th: 9008
	99.9900th: 91008
	99.9990th: 286208
	99.9999th: 347648
	Over=1251, min=0, max=358081
Latency percentiles (usec) (WRITERS)
	50.0000th: 4
	75.0000th: 8
	90.0000th: 13
	95.0000th: 15
	99.0000th: 32
	99.5000th: 43
	99.9000th: 81
	99.9900th: 2372
	99.9990th: 104320
	99.9999th: 349696
	Over=63, min=1, max=358321
Read rate (KB/sec) : 91859
Write rate (KB/sec): 91859

4.6-rc3 + wb-buf-throttle

Latency percentiles (usec) (READERS)
	50.0000th: 2
	75.0000th: 3
	90.0000th: 5
	95.0000th: 8
	99.0000th: 48
	99.5000th: 79
	99.9000th: 5304
	99.9900th: 22496
	99.9990th: 29408
	99.9999th: 33728
	Over=860, min=0, max=37599
Latency percentiles (usec) (WRITERS)
	50.0000th: 4
	75.0000th: 9
	90.0000th: 14
	95.0000th: 16
	99.0000th: 34
	99.5000th: 45
	99.9000th: 87
	99.9900th: 1342
	99.9990th: 13648
	99.9999th: 21280
	Over=29, min=1, max=30457
Read rate (KB/sec) : 95832
Write rate (KB/sec): 95832

Better throughput and tighter latencies, for both reads and writes.
That's hard not to like.

The above was the why. The how is basically throttling background
writeback. We still want to issue big writes from the vm side of things,
so we get nice and big extents on the file system end. But we don't need
to flood the device with THOUSANDS of requests for background writeback.
For most devices, we don't need a whole lot to get decent throughput.

This adds some simple blk-wb code that keeps limits how much buffered
writeback we keep in flight on the device end. It's all about managing
the queues on the hardware side. The big change in this version is that
it should be pretty much auto-tuning - you no longer have to set a
given percentage of writeback bandwidth. I've implemented something
similar to CoDel to manage the writeback queue. See the last patch
for a full description, but the tldr is that we monitor min latencies
over a window of time, and scale up/down the queue based on that. This
needs a minimum of tunables, and it stays out of the way, if your device
is fast enough. There's a single tunable now, wb_last_usec, that simply
sets this latency target. Most people won't have to touch this, it'll
work pretty well just being in the ballpark.

I welcome testing. If you are sick of Linux bogging down when buffered
writes are happening, then this is for you, laptop or server. The
patchset is fully stable, I have not observed problems. It passes full
xfstest runs, and a variety of benchmarks as well. It works equally well
on blk-mq/scsi-mq, and "classic" setups.

You can also find this in a branch in the block git repo:

git://git.kernel.dk/linux-block.git wb-buf-throttle

Note that I rebase this branch when I collapse patches. The
wb-buf-throttle-v5 will remain the same as this version. I've folded
the device write cache changes into my 4.7 branches, so they are not
a part of this posting. Get the full wb-buf-throttle branch, or apply
the patches here on top of my for-next. A full patch against Linus'
current tree can also be downloaded here:

http://brick.kernel.dk/snaps/wb-buf-throttle-v5.patch

Changes since v4

- Add some documentation for the two queue sysfs files
- Kill off wb_stats sysfs file. Use the trace points to get this info
  now.
- Various work around making this block layer agnostic. The main code
  now resides in lib/wbt.c and can be plugged into NFS as well, for
  instance.
- Fix an issue with double completions on the block layer side.
- Fix an issue where a long sync issue was disregarded, if the stat
  sample weren't valid.
- Speed up the division in rwb_arm_timer().
- Add logic to scale back up for 'unknown' latency events.
- Don't track sync issue timestamp of wbt is disabled.
- Drop the dirty/writeback page inc/dec patch. We don't need it, and
  it was racy.
- Move block/blk-wb.c to lib/wbt.c

Changes since v3

- Re-do the mm/ writheback parts. Add REQ_BG for background writes,
  and don't overload the wbc 'reason' for writeback decisions.
- Add tracking for when apps are sleeping waiting for a page to complete.
- Change wbc_to_write() to wbc_to_write_cmd().
- Use atomic_t for the balance_dirty_pages() sleep count.
- Add a basic scalable block stats tracking framework.
- Rewrite blk-wb core as described above, to dynamically adapt. This is
  a big change, see the last patch for a full description of it.
- Add tracing to blk-wb, instead of using debug printk's.
- Rebased to 4.6-rc3 (ish)

Changes since v2

- Switch from wb_depth to wb_percent, as that's an easier tunable.
- Add the patch to track device depth on the block layer side.
- Cleanup the limiting code.
- Don't use a fixed limit in the wb wait, since it can change
  between wakeups.
- Minor tweaks, fixups, cleanups.

Changes since v1

- Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change
- wb_start_writeback() fills in background/reclaim/sync info in
  the writeback work, based on writeback reason.
- Use WRITE_SYNC for reclaim/sync IO
- Split balance_dirty_pages() sleep change into separate patch
- Drop get_request() u64 flag change, set the bit on the request
  directly after-the-fact.
- Fix wrong sysfs return value
- Various small cleanups


 Documentation/block/queue-sysfs.txt             |   22 +
 Documentation/block/writeback_cache_control.txt |    4 
 arch/um/drivers/ubd_kern.c                      |    2 
 block/Kconfig                                   |    1 
 block/Makefile                                  |    2 
 block/blk-core.c                                |   26 +
 block/blk-flush.c                               |   11 
 block/blk-mq-sysfs.c                            |   47 ++
 block/blk-mq.c                                  |   44 +-
 block/blk-mq.h                                  |    3 
 block/blk-settings.c                            |   59 +-
 block/blk-stat.c                                |  185 ++++++++
 block/blk-stat.h                                |   17 
 block/blk-sysfs.c                               |  184 ++++++++
 drivers/block/drbd/drbd_main.c                  |    2 
 drivers/block/loop.c                            |    2 
 drivers/block/mtip32xx/mtip32xx.c               |    6 
 drivers/block/nbd.c                             |    4 
 drivers/block/osdblk.c                          |    2 
 drivers/block/ps3disk.c                         |    2 
 drivers/block/skd_main.c                        |    2 
 drivers/block/virtio_blk.c                      |    6 
 drivers/block/xen-blkback/xenbus.c              |    2 
 drivers/block/xen-blkfront.c                    |    3 
 drivers/ide/ide-disk.c                          |    6 
 drivers/md/bcache/super.c                       |    2 
 drivers/md/dm-table.c                           |   20 
 drivers/md/md.c                                 |    2 
 drivers/md/raid5-cache.c                        |    3 
 drivers/mmc/card/block.c                        |    2 
 drivers/mtd/mtd_blkdevs.c                       |    2 
 drivers/nvme/host/core.c                        |    7 
 drivers/scsi/scsi.c                             |    3 
 drivers/scsi/sd.c                               |    8 
 drivers/target/target_core_iblock.c             |    6 
 fs/block_dev.c                                  |    2 
 fs/buffer.c                                     |    2 
 fs/f2fs/data.c                                  |    2 
 fs/f2fs/node.c                                  |    2 
 fs/gfs2/meta_io.c                               |    3 
 fs/mpage.c                                      |    9 
 fs/xfs/xfs_aops.c                               |    2 
 include/linux/backing-dev-defs.h                |    2 
 include/linux/blk_types.h                       |   12 
 include/linux/blkdev.h                          |   28 +
 include/linux/fs.h                              |    4 
 include/linux/wbt.h                             |   95 ++++
 include/linux/writeback.h                       |   10 
 include/trace/events/wbt.h                      |  122 +++++
 lib/Kconfig                                     |    3 
 lib/Makefile                                    |    1 
 lib/wbt.c                                       |  524 ++++++++++++++++++++++++
 mm/backing-dev.c                                |    1 
 mm/page-writeback.c                             |    2 
 54 files changed, 1429 insertions(+), 96 deletions(-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/8] block: add WRITE_BG
  2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
@ 2016-04-26 15:55 ` Jens Axboe
  2016-04-26 15:55 ` [PATCH 2/8] writeback: add wbc_to_write_cmd() Jens Axboe
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block
  Cc: jack, dchinner, sedat.dilek, Jens Axboe

This adds a new request flag, REQ_BG, that callers can use to tell
the block layer that this is background (non-urgent) IO.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/blk_types.h | 4 +++-
 include/linux/fs.h        | 4 ++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 86a38ea1823f..223012451c7a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -161,6 +161,7 @@ enum rq_flag_bits {
 	__REQ_INTEGRITY,	/* I/O includes block integrity payload */
 	__REQ_FUA,		/* forced unit access */
 	__REQ_FLUSH,		/* request for cache flush */
+	__REQ_BG,		/* background activity */
 
 	/* bio only flags */
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
@@ -208,7 +209,7 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_PRIO | \
 	 REQ_DISCARD | REQ_WRITE_SAME | REQ_NOIDLE | REQ_FLUSH | REQ_FUA | \
-	 REQ_SECURE | REQ_INTEGRITY)
+	 REQ_SECURE | REQ_INTEGRITY | REQ_BG)
 #define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define BIO_NO_ADVANCE_ITER_MASK	(REQ_DISCARD|REQ_WRITE_SAME)
@@ -235,6 +236,7 @@ enum rq_flag_bits {
 #define REQ_COPY_USER		(1ULL << __REQ_COPY_USER)
 #define REQ_FLUSH		(1ULL << __REQ_FLUSH)
 #define REQ_FLUSH_SEQ		(1ULL << __REQ_FLUSH_SEQ)
+#define REQ_BG			(1ULL << __REQ_BG)
 #define REQ_IO_STAT		(1ULL << __REQ_IO_STAT)
 #define REQ_MIXED_MERGE		(1ULL << __REQ_MIXED_MERGE)
 #define REQ_SECURE		(1ULL << __REQ_SECURE)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 70e61b58baaf..bb8f951cc619 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -192,6 +192,9 @@ typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
  * WRITE_FLUSH_FUA	Combination of WRITE_FLUSH and FUA. The IO is preceded
  *			by a cache flush and data is guaranteed to be on
  *			non-volatile media on completion.
+ * WRITE_BG		Background write. This is for background activity like
+ *			the periodic flush and background threshold writeback
+ *
  *
  */
 #define RW_MASK			REQ_WRITE
@@ -207,6 +210,7 @@ typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
 #define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FLUSH)
 #define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FUA)
 #define WRITE_FLUSH_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
+#define WRITE_BG		(WRITE | REQ_NOIDLE | REQ_BG)
 
 /*
  * Attribute flags.  These should be or-ed together to figure out what
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 2/8] writeback: add wbc_to_write_cmd()
  2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
  2016-04-26 15:55 ` [PATCH 1/8] block: add WRITE_BG Jens Axboe
@ 2016-04-26 15:55 ` Jens Axboe
  2016-04-26 15:55 ` [PATCH 3/8] writeback: use WRITE_BG for kupdate and background writeback Jens Axboe
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block
  Cc: jack, dchinner, sedat.dilek, Jens Axboe

Add wbc_to_write_cmd(), which returns the write type to use, based on a
struct writeback_control. No functional changes in this patch, but it
prepares us for factoring other wbc fields for write type.

Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c            | 2 +-
 fs/buffer.c               | 2 +-
 fs/f2fs/data.c            | 2 +-
 fs/f2fs/node.c            | 2 +-
 fs/gfs2/meta_io.c         | 3 +--
 fs/mpage.c                | 9 ++++-----
 fs/xfs/xfs_aops.c         | 2 +-
 include/linux/writeback.h | 8 ++++++++
 8 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 20a2c02b77c4..8662da6aa07c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -432,7 +432,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
 			struct page *page, struct writeback_control *wbc)
 {
 	int result;
-	int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
+	int rw = wbc_to_write_cmd(wbc);
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
 
 	if (!ops->rw_page || bdev_get_integrity(bdev))
diff --git a/fs/buffer.c b/fs/buffer.c
index af0d9a82a8ed..46763c58e786 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1697,7 +1697,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	struct buffer_head *bh, *head;
 	unsigned int blocksize, bbits;
 	int nr_underway = 0;
-	int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+	int write_op = wbc_to_write_cmd(wbc);
 
 	head = create_page_buffers(page, inode,
 					(1 << BH_Dirty)|(1 << BH_Uptodate));
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 5dafb9cef12e..e4e81ce663c5 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1153,7 +1153,7 @@ static int f2fs_write_data_page(struct page *page,
 	struct f2fs_io_info fio = {
 		.sbi = sbi,
 		.type = DATA,
-		.rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
+		.rw = wbc_to_write_cmd(wbc),
 		.page = page,
 		.encrypted_page = NULL,
 	};
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 1a33de9d84b1..3b377258dc09 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1397,7 +1397,7 @@ static int f2fs_write_node_page(struct page *page,
 	struct f2fs_io_info fio = {
 		.sbi = sbi,
 		.type = NODE,
-		.rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
+		.rw = wbc_to_write_cmd(wbc),
 		.page = page,
 		.encrypted_page = NULL,
 	};
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 0448524c11bc..3fdfa3848f18 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -37,8 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb
 {
 	struct buffer_head *bh, *head;
 	int nr_underway = 0;
-	int write_op = REQ_META | REQ_PRIO |
-		(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+	int write_op = REQ_META | REQ_PRIO | wbc_to_write_cmd(wbc);
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!page_has_buffers(page));
diff --git a/fs/mpage.c b/fs/mpage.c
index eedc644b78d7..bcbdb61b24f1 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -486,7 +486,6 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 	struct buffer_head map_bh;
 	loff_t i_size = i_size_read(inode);
 	int ret = 0;
-	int wr = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
 
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -595,7 +594,7 @@ page_is_mapped:
 	 * This page will go to BIO.  Do we need to send this BIO off first?
 	 */
 	if (bio && mpd->last_block_in_bio != blocks[0] - 1)
-		bio = mpage_bio_submit(wr, bio);
+		bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
 
 alloc_new:
 	if (bio == NULL) {
@@ -622,7 +621,7 @@ alloc_new:
 	wbc_account_io(wbc, page, PAGE_SIZE);
 	length = first_unmapped << blkbits;
 	if (bio_add_page(bio, page, length, 0) < length) {
-		bio = mpage_bio_submit(wr, bio);
+		bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
 		goto alloc_new;
 	}
 
@@ -632,7 +631,7 @@ alloc_new:
 	set_page_writeback(page);
 	unlock_page(page);
 	if (boundary || (first_unmapped != blocks_per_page)) {
-		bio = mpage_bio_submit(wr, bio);
+		bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
 		if (boundary_block) {
 			write_boundary_block(boundary_bdev,
 					boundary_block, 1 << blkbits);
@@ -644,7 +643,7 @@ alloc_new:
 
 confused:
 	if (bio)
-		bio = mpage_bio_submit(wr, bio);
+		bio = mpage_bio_submit(wbc_to_write_cmd(wbc), bio);
 
 	if (mpd->use_writepage) {
 		ret = mapping->a_ops->writepage(page, wbc);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e49b2406d15d..e6c721f4153b 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -393,7 +393,7 @@ xfs_submit_ioend_bio(
 	atomic_inc(&ioend->io_remaining);
 	bio->bi_private = ioend;
 	bio->bi_end_io = xfs_end_bio;
-	submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
+	submit_bio(wbc_to_write_cmd(wbc), bio);
 }
 
 STATIC struct bio *
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d0b5ca5d4e08..aa66fa05ff0d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -100,6 +100,14 @@ struct writeback_control {
 #endif
 };
 
+static inline int wbc_to_write_cmd(struct writeback_control *wbc)
+{
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		return WRITE_SYNC;
+
+	return WRITE;
+}
+
 /*
  * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
  * and are measured against each other in.  There always is one global
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 3/8] writeback: use WRITE_BG for kupdate and background writeback
  2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
  2016-04-26 15:55 ` [PATCH 1/8] block: add WRITE_BG Jens Axboe
  2016-04-26 15:55 ` [PATCH 2/8] writeback: add wbc_to_write_cmd() Jens Axboe
@ 2016-04-26 15:55 ` Jens Axboe
  2016-04-26 15:55 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block
  Cc: jack, dchinner, sedat.dilek, Jens Axboe

If we're doing background type writes, then use the appropriate
write command for that.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/writeback.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index aa66fa05ff0d..6e4a35acaa3e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -104,6 +104,8 @@ static inline int wbc_to_write_cmd(struct writeback_control *wbc)
 {
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		return WRITE_SYNC;
+	else if (wbc->for_kupdate || wbc->for_background)
+		return WRITE_BG;
 
 	return WRITE;
 }
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()
  2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
                   ` (2 preceding siblings ...)
  2016-04-26 15:55 ` [PATCH 3/8] writeback: use WRITE_BG for kupdate and background writeback Jens Axboe
@ 2016-04-26 15:55 ` Jens Axboe
  2016-04-26 15:55 ` [PATCH 5/8] block: add code to track actual device queue depth Jens Axboe
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block
  Cc: jack, dchinner, sedat.dilek, Jens Axboe

Note in the bdi_writeback structure if a task is currently being
limited in balance_dirty_pages(), waiting for writeback to
proceed.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/backing-dev-defs.h | 2 ++
 mm/backing-dev.c                 | 1 +
 mm/page-writeback.c              | 2 ++
 3 files changed, 5 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 3f103076d0bf..1212c374b928 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
 	struct list_head work_list;
 	struct delayed_work dwork;	/* work item used for writeback */
 
+	atomic_t dirty_sleeping;	/* waiting on dirty limit exceeded */
+
 	struct list_head bdi_node;	/* anchored at bdi->wb_list */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0c6317b7db38..41db7dff11d0 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -310,6 +310,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	spin_lock_init(&wb->work_lock);
 	INIT_LIST_HEAD(&wb->work_list);
 	INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
+	atomic_set(&wb->dirty_sleeping, 0);
 
 	wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp);
 	if (!wb->congested)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 999792d35ccc..028a3d4d7129 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1746,7 +1746,9 @@ pause:
 					  pause,
 					  start_time);
 		__set_current_state(TASK_KILLABLE);
+		atomic_inc(&wb->dirty_sleeping);
 		io_schedule_timeout(pause);
+		atomic_dec(&wb->dirty_sleeping);
 
 		current->dirty_paused_when = now + pause;
 		current->nr_dirtied = 0;
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 5/8] block: add code to track actual device queue depth
  2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
                   ` (3 preceding siblings ...)
  2016-04-26 15:55 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
@ 2016-04-26 15:55 ` Jens Axboe
  2016-04-26 15:55 ` [PATCH 6/8] block: add scalable completion tracking of requests Jens Axboe
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block
  Cc: jack, dchinner, sedat.dilek, Jens Axboe

For blk-mq, ->nr_requests does track queue depth, at least at init
time. But for the older queue paths, it's simply a soft setting.
On top of that, it's generally larger than the hardware setting
on purpose, to allow backup of requests for merging.

Fill a hole in struct request with a 'queue_depth' member, that
drivers can call to more closely inform the block layer of the
real queue depth.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-settings.c   | 12 ++++++++++++
 drivers/scsi/scsi.c    |  3 +++
 include/linux/blkdev.h | 11 +++++++++++
 3 files changed, 26 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index f679ae122843..f7e122e717e8 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -832,6 +832,18 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable)
 EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
 
 /**
+ * blk_set_queue_depth - tell the block layer about the device queue depth
+ * @q:		the request queue for the device
+ * @depth:		queue depth
+ *
+ */
+void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
+{
+	q->queue_depth = depth;
+}
+EXPORT_SYMBOL(blk_set_queue_depth);
+
+/**
  * blk_queue_write_cache - configure queue's write cache
  * @q:		the request queue for the device
  * @wc:		write back cache on or off
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 1deb6adc411f..75455d4dab68 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int depth)
 		wmb();
 	}
 
+	if (sdev->request_queue)
+		blk_set_queue_depth(sdev->request_queue, depth);
+
 	return sdev->queue_depth;
 }
 EXPORT_SYMBOL(scsi_change_queue_depth);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fc1894996b12..eee94bd6de52 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -315,6 +315,8 @@ struct request_queue {
 	struct blk_mq_ctx __percpu	*queue_ctx;
 	unsigned int		nr_queues;
 
+	unsigned int		queue_depth;
+
 	/* hw dispatch queues */
 	struct blk_mq_hw_ctx	**queue_hw_ctx;
 	unsigned int		nr_hw_queues;
@@ -681,6 +683,14 @@ static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
 	return false;
 }
 
+static inline unsigned int blk_queue_depth(struct request_queue *q)
+{
+	if (q->queue_depth)
+		return q->queue_depth;
+
+	return q->nr_requests;
+}
+
 /*
  * q->prep_rq_fn return values
  */
@@ -984,6 +994,7 @@ extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
 extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
 extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
 extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
+extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
 extern void blk_set_default_limits(struct queue_limits *lim);
 extern void blk_set_stacking_limits(struct queue_limits *lim);
 extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 6/8] block: add scalable completion tracking of requests
  2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
                   ` (4 preceding siblings ...)
  2016-04-26 15:55 ` [PATCH 5/8] block: add code to track actual device queue depth Jens Axboe
@ 2016-04-26 15:55 ` Jens Axboe
  2016-05-05  7:52   ` Ming Lei
  2016-04-26 15:55 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block
  Cc: jack, dchinner, sedat.dilek, Jens Axboe

For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.

The stats are tracked in, roughly, 0.1s interval windows.

Add sysfs files to display the stats.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Makefile            |   2 +-
 block/blk-core.c          |   4 +
 block/blk-mq-sysfs.c      |  47 ++++++++++++
 block/blk-mq.c            |  14 ++++
 block/blk-mq.h            |   3 +
 block/blk-stat.c          | 184 ++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-stat.h          |  17 +++++
 block/blk-sysfs.c         |  26 +++++++
 include/linux/blk_types.h |   8 ++
 include/linux/blkdev.h    |   4 +
 10 files changed, 308 insertions(+), 1 deletion(-)
 create mode 100644 block/blk-stat.c
 create mode 100644 block/blk-stat.h

diff --git a/block/Makefile b/block/Makefile
index 9eda2322b2d4..3446e0472df0 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			blk-lib.o blk-mq.o blk-mq-tag.o \
+			blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
 			blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 74c16fd8995d..40b57bf4852c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2514,6 +2514,8 @@ void blk_start_request(struct request *req)
 {
 	blk_dequeue_request(req);
 
+	req->issue_time = ktime_to_ns(ktime_get());
+
 	/*
 	 * We are now handing the request to the hardware, initialize
 	 * resid_len to full count and add the timeout handler.
@@ -2581,6 +2583,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
 
 	trace_block_rq_complete(req->q, req, nr_bytes);
 
+	blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);
+
 	if (!req->bio)
 		return false;
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4ea4dd8a1eed..2f68015f8616 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -247,6 +247,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	return ret;
 }
 
+static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
+{
+	struct blk_mq_ctx *ctx;
+	unsigned int i;
+
+	hctx_for_each_ctx(hctx, ctx, i) {
+		blk_stat_init(&ctx->stat[0]);
+		blk_stat_init(&ctx->stat[1]);
+	}
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
+					  const char *page, size_t count)
+{
+	blk_mq_stat_clear(hctx);
+	return count;
+}
+
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
+{
+	return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+			pre, (long long) stat->nr_samples,
+			(long long) stat->mean, (long long) stat->min,
+			(long long) stat->max);
+}
+
+static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char *page)
+{
+	struct blk_rq_stat stat[2];
+	ssize_t ret;
+
+	blk_stat_init(&stat[0]);
+	blk_stat_init(&stat[1]);
+
+	blk_hctx_stat_get(hctx, stat);
+
+	ret = print_stat(page, &stat[0], "read :");
+	ret += print_stat(page + ret, &stat[1], "write:");
+	return ret;
+}
+
 static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
 	.attr = {.name = "dispatched", .mode = S_IRUGO },
 	.show = blk_mq_sysfs_dispatched_show,
@@ -304,6 +345,11 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
 	.attr = {.name = "io_poll", .mode = S_IRUGO },
 	.show = blk_mq_hw_sysfs_poll_show,
 };
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
+	.attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
+	.show = blk_mq_hw_sysfs_stat_show,
+	.store = blk_mq_hw_sysfs_stat_store,
+};
 
 static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_queued.attr,
@@ -314,6 +360,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_cpus.attr,
 	&blk_mq_hw_sysfs_active.attr,
 	&blk_mq_hw_sysfs_poll.attr,
+	&blk_mq_hw_sysfs_stat.attr,
 	NULL,
 };
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1699baf39b78..71b4a13fbf94 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -29,6 +29,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-stat.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -356,10 +357,19 @@ static void blk_mq_ipi_complete_request(struct request *rq)
 	put_cpu();
 }
 
+static void blk_mq_stat_add(struct request *rq)
+{
+	struct blk_rq_stat *stat = &rq->mq_ctx->stat[rq_data_dir(rq)];
+
+	blk_stat_add(stat, rq);
+}
+
 static void __blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 
+	blk_mq_stat_add(rq);
+
 	if (!q->softirq_done_fn)
 		blk_mq_end_request(rq, rq->errors);
 	else
@@ -403,6 +413,8 @@ void blk_mq_start_request(struct request *rq)
 	if (unlikely(blk_bidi_rq(rq)))
 		rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
 
+	rq->issue_time = ktime_to_ns(ktime_get());
+
 	blk_add_timer(rq);
 
 	/*
@@ -1761,6 +1773,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		spin_lock_init(&__ctx->lock);
 		INIT_LIST_HEAD(&__ctx->rq_list);
 		__ctx->queue = q;
+		blk_stat_init(&__ctx->stat[0]);
+		blk_stat_init(&__ctx->stat[1]);
 
 		/* If the cpu isn't online, the cpu is mapped to first hctx */
 		if (!cpu_online(i))
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9087b11037b7..e107f700ff17 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -1,6 +1,8 @@
 #ifndef INT_BLK_MQ_H
 #define INT_BLK_MQ_H
 
+#include "blk-stat.h"
+
 struct blk_mq_tag_set;
 
 struct blk_mq_ctx {
@@ -20,6 +22,7 @@ struct blk_mq_ctx {
 
 	/* incremented at completion time */
 	unsigned long		____cacheline_aligned_in_smp rq_completed[2];
+	struct blk_rq_stat	stat[2];
 
 	struct request_queue	*queue;
 	struct kobject		kobj;
diff --git a/block/blk-stat.c b/block/blk-stat.c
new file mode 100644
index 000000000000..b38776a83173
--- /dev/null
+++ b/block/blk-stat.c
@@ -0,0 +1,184 @@
+/*
+ * Block stat tracking code
+ *
+ * Copyright (C) 2016 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/blk-mq.h>
+
+#include "blk-stat.h"
+#include "blk-mq.h"
+
+void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
+{
+	if (!src->nr_samples)
+		return;
+
+	dst->min = min(dst->min, src->min);
+	dst->max = max(dst->max, src->max);
+
+	if (!dst->nr_samples)
+		dst->mean = src->mean;
+	else {
+		dst->mean = div64_s64((src->mean * src->nr_samples) +
+					(dst->mean * dst->nr_samples),
+					dst->nr_samples + src->nr_samples);
+	}
+	dst->nr_samples += src->nr_samples;
+}
+
+static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
+{
+	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_ctx *ctx;
+	int i, j, nr;
+
+	blk_stat_init(&dst[0]);
+	blk_stat_init(&dst[1]);
+
+	nr = 0;
+	do {
+		uint64_t newest = 0;
+
+		queue_for_each_hw_ctx(q, hctx, i) {
+			hctx_for_each_ctx(hctx, ctx, j) {
+				if (!ctx->stat[0].nr_samples &&
+				    !ctx->stat[1].nr_samples)
+					continue;
+				if (ctx->stat[0].time > newest)
+					newest = ctx->stat[0].time;
+				if (ctx->stat[1].time > newest)
+					newest = ctx->stat[1].time;
+			}
+		}
+
+		/*
+		 * No samples
+		 */
+		if (!newest)
+			break;
+
+		queue_for_each_hw_ctx(q, hctx, i) {
+			hctx_for_each_ctx(hctx, ctx, j) {
+				if (ctx->stat[0].time == newest) {
+					blk_stat_sum(&dst[0], &ctx->stat[0]);
+					nr++;
+				}
+				if (ctx->stat[1].time == newest) {
+					blk_stat_sum(&dst[1], &ctx->stat[1]);
+					nr++;
+				}
+			}
+		}
+		/*
+		 * If we race on finding an entry, just loop back again.
+		 * Should be very rare.
+		 */
+	} while (!nr);
+}
+
+void blk_queue_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
+{
+	if (q->mq_ops)
+		blk_mq_stat_get(q, dst);
+	else {
+		memcpy(&dst[0], &q->rq_stats[0], sizeof(struct blk_rq_stat));
+		memcpy(&dst[1], &q->rq_stats[1], sizeof(struct blk_rq_stat));
+	}
+}
+
+void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst)
+{
+	struct blk_mq_ctx *ctx;
+	unsigned int i, nr;
+
+	nr = 0;
+	do {
+		uint64_t newest = 0;
+
+		hctx_for_each_ctx(hctx, ctx, i) {
+			if (!ctx->stat[0].nr_samples &&
+			    !ctx->stat[1].nr_samples)
+				continue;
+
+			if (ctx->stat[0].time > newest)
+				newest = ctx->stat[0].time;
+			if (ctx->stat[1].time > newest)
+				newest = ctx->stat[1].time;
+		}
+
+		if (!newest)
+			break;
+
+		hctx_for_each_ctx(hctx, ctx, i) {
+			if (ctx->stat[0].time == newest) {
+				blk_stat_sum(&dst[0], &ctx->stat[0]);
+				nr++;
+			}
+			if (ctx->stat[1].time == newest) {
+				blk_stat_sum(&dst[1], &ctx->stat[1]);
+				nr++;
+			}
+		}
+		/*
+		 * If we race on finding an entry, just loop back again.
+		 * Should be very rare, as the window is only updated
+		 * occasionally
+		 */
+	} while (!nr);
+}
+
+static void __blk_stat_init(struct blk_rq_stat *stat, s64 time_now)
+{
+	stat->min = -1ULL;
+	stat->max = stat->nr_samples = stat->mean = 0;
+	stat->time = time_now & BLK_STAT_MASK;
+}
+
+void blk_stat_init(struct blk_rq_stat *stat)
+{
+	__blk_stat_init(stat, ktime_to_ns(ktime_get()));
+}
+
+void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
+{
+	s64 delta, now, value;
+
+	now = ktime_to_ns(ktime_get());
+	if (now < rq->issue_time)
+		return;
+
+	if ((now & BLK_STAT_MASK) != (stat->time & BLK_STAT_MASK))
+		__blk_stat_init(stat, now);
+
+	value = now - rq->issue_time;
+	if (value > stat->max)
+		stat->max = value;
+	if (value < stat->min)
+		stat->min = value;
+
+	delta = value - stat->mean;
+	if (delta)
+		stat->mean += div64_s64(delta, stat->nr_samples + 1);
+
+	stat->nr_samples++;
+}
+
+void blk_stat_clear(struct request_queue *q)
+{
+	if (q->mq_ops) {
+		struct blk_mq_hw_ctx *hctx;
+		struct blk_mq_ctx *ctx;
+		int i, j;
+
+		queue_for_each_hw_ctx(q, hctx, i) {
+			hctx_for_each_ctx(hctx, ctx, j) {
+				blk_stat_init(&ctx->stat[0]);
+				blk_stat_init(&ctx->stat[1]);
+			}
+		}
+	} else {
+		blk_stat_init(&q->rq_stats[0]);
+		blk_stat_init(&q->rq_stats[1]);
+	}
+}
diff --git a/block/blk-stat.h b/block/blk-stat.h
new file mode 100644
index 000000000000..d77548dbf196
--- /dev/null
+++ b/block/blk-stat.h
@@ -0,0 +1,17 @@
+#ifndef BLK_STAT_H
+#define BLK_STAT_H
+
+/*
+ * ~0.13s window as a power-of-2 (2^27 nsecs)
+ */
+#define BLK_STAT_NSEC	134217728ULL
+#define BLK_STAT_MASK	~(BLK_STAT_NSEC - 1)
+
+void blk_stat_add(struct blk_rq_stat *, struct request *);
+void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_rq_stat *);
+void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
+void blk_stat_clear(struct request_queue *q);
+void blk_stat_init(struct blk_rq_stat *);
+void blk_stat_sum(struct blk_rq_stat *, struct blk_rq_stat *);
+
+#endif
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 99205965f559..6e516cc0d3d0 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -379,6 +379,26 @@ static ssize_t queue_wc_store(struct request_queue *q, const char *page,
 	return count;
 }
 
+static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
+{
+	return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
+			pre, (long long) stat->nr_samples,
+			(long long) stat->mean, (long long) stat->min,
+			(long long) stat->max);
+}
+
+static ssize_t queue_stats_show(struct request_queue *q, char *page)
+{
+	struct blk_rq_stat stat[2];
+	ssize_t ret;
+
+	blk_queue_stat_get(q, stat);
+
+	ret = print_stat(page, &stat[0], "read :");
+	ret += print_stat(page + ret, &stat[1], "write:");
+	return ret;
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_requests_show,
@@ -516,6 +536,11 @@ static struct queue_sysfs_entry queue_wc_entry = {
 	.store = queue_wc_store,
 };
 
+static struct queue_sysfs_entry queue_stats_entry = {
+	.attr = {.name = "stats", .mode = S_IRUGO },
+	.show = queue_stats_show,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -542,6 +567,7 @@ static struct attribute *default_attrs[] = {
 	&queue_random_entry.attr,
 	&queue_poll_entry.attr,
 	&queue_wc_entry.attr,
+	&queue_stats_entry.attr,
 	NULL,
 };
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 223012451c7a..2b4414fb4d8e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -268,4 +268,12 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
 	return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
 }
 
+struct blk_rq_stat {
+	s64 mean;
+	u64 min;
+	u64 max;
+	s64 nr_samples;
+	s64 time;
+};
+
 #endif /* __LINUX_BLK_TYPES_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index eee94bd6de52..87f6703ced71 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -153,6 +153,7 @@ struct request {
 	struct gendisk *rq_disk;
 	struct hd_struct *part;
 	unsigned long start_time;
+	s64 issue_time;
 #ifdef CONFIG_BLK_CGROUP
 	struct request_list *rl;		/* rl this rq is alloced from */
 	unsigned long long start_time_ns;
@@ -402,6 +403,9 @@ struct request_queue {
 
 	unsigned int		nr_sorted;
 	unsigned int		in_flight[2];
+
+	struct blk_rq_stat	rq_stats[2];
+
 	/*
 	 * Number of active block driver functions for which blk_drain_queue()
 	 * must wait. Must be incremented around functions that unlock the
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 7/8] wbt: add general throttling mechanism
  2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
                   ` (5 preceding siblings ...)
  2016-04-26 15:55 ` [PATCH 6/8] block: add scalable completion tracking of requests Jens Axboe
@ 2016-04-26 15:55 ` Jens Axboe
  2016-04-27 12:06   ` xiakaixu
  2016-04-28 11:05   ` Jan Kara
  2016-04-26 15:55 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe
  2016-04-27 18:01 ` [PATCHSET v5] Make background writeback great again for the first time Jan Kara
  8 siblings, 2 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block
  Cc: jack, dchinner, sedat.dilek, Jens Axboe

We can hook this up to the block layer, to help throttle buffered
writes. Or NFS can tap into it, to accomplish the same.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
               wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/wbt.h        |  95 ++++++++
 include/trace/events/wbt.h | 122 +++++++++++
 lib/Kconfig                |   3 +
 lib/Makefile               |   1 +
 lib/wbt.c                  | 524 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 745 insertions(+)
 create mode 100644 include/linux/wbt.h
 create mode 100644 include/trace/events/wbt.h
 create mode 100644 lib/wbt.c

diff --git a/include/linux/wbt.h b/include/linux/wbt.h
new file mode 100644
index 000000000000..c8a12795416b
--- /dev/null
+++ b/include/linux/wbt.h
@@ -0,0 +1,95 @@
+#ifndef WB_THROTTLE_H
+#define WB_THROTTLE_H
+
+#include <linux/atomic.h>
+#include <linux/wait.h>
+#include <linux/timer.h>
+#include <linux/ktime.h>
+
+#define ISSUE_STAT_MASK		(1ULL << 63)
+#define ISSUE_STAT_TIME_MASK	~ISSUE_STAT_MASK
+
+struct wb_issue_stat {
+	u64 time;
+};
+
+static inline void wbt_issue_stat_set_time(struct wb_issue_stat *stat)
+{
+	stat->time = (stat->time & ISSUE_STAT_MASK) |
+			(ktime_to_ns(ktime_get()) & ISSUE_STAT_TIME_MASK);
+}
+
+static inline u64 wbt_issue_stat_get_time(struct wb_issue_stat *stat)
+{
+	return stat->time & ISSUE_STAT_TIME_MASK;
+}
+
+static inline void wbt_mark_tracked(struct wb_issue_stat *stat)
+{
+	stat->time |= ISSUE_STAT_MASK;
+}
+
+static inline void wbt_clear_tracked(struct wb_issue_stat *stat)
+{
+	stat->time &= ~ISSUE_STAT_MASK;
+}
+
+static inline bool wbt_tracked(struct wb_issue_stat *stat)
+{
+	return (stat->time & ISSUE_STAT_MASK) != 0;
+}
+
+struct wb_stat_ops {
+	void (*get)(void *, struct blk_rq_stat *);
+	void (*clear)(void *);
+};
+
+struct rq_wb {
+	/*
+	 * Settings that govern how we throttle
+	 */
+	unsigned int wb_background;		/* background writeback */
+	unsigned int wb_normal;			/* normal writeback */
+	unsigned int wb_max;			/* max throughput writeback */
+	unsigned int scale_step;
+
+	u64 win_nsec;				/* default window size */
+	u64 cur_win_nsec;			/* current window size */
+
+	unsigned int unknown_cnt;
+
+	struct timer_list window_timer;
+
+	s64 sync_issue;
+	void *sync_cookie;
+
+	unsigned int wc;
+	unsigned int queue_depth;
+
+	unsigned long last_issue;		/* last non-throttled issue */
+	unsigned long last_comp;		/* last non-throttled comp */
+	unsigned long min_lat_nsec;
+	struct backing_dev_info *bdi;
+	struct request_queue *q;
+	wait_queue_head_t wait;
+	atomic_t inflight;
+
+	struct wb_stat_ops *stat_ops;
+	void *ops_data;
+};
+
+struct backing_dev_info;
+
+void __wbt_done(struct rq_wb *);
+void wbt_done(struct rq_wb *, struct wb_issue_stat *);
+bool wbt_wait(struct rq_wb *, unsigned int, spinlock_t *);
+struct rq_wb *wbt_init(struct backing_dev_info *, struct wb_stat_ops *, void *);
+void wbt_exit(struct rq_wb *);
+void wbt_update_limits(struct rq_wb *);
+void wbt_requeue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_issue(struct rq_wb *, struct wb_issue_stat *);
+
+void wbt_set_queue_depth(struct rq_wb *, unsigned int);
+void wbt_set_write_cache(struct rq_wb *, bool);
+
+#endif
diff --git a/include/trace/events/wbt.h b/include/trace/events/wbt.h
new file mode 100644
index 000000000000..a4b8b2e57bb1
--- /dev/null
+++ b/include/trace/events/wbt.h
@@ -0,0 +1,122 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM wbt
+
+#if !defined(_TRACE_WBT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_WBT_H
+
+#include <linux/tracepoint.h>
+#include <linux/wbt.h>
+
+/**
+ * wbt_stat - trace stats for blk_wb
+ * @stat: array of read/write stats
+ */
+TRACE_EVENT(wbt_stat,
+
+	TP_PROTO(struct backing_dev_info *bdi, struct blk_rq_stat *stat),
+
+	TP_ARGS(bdi, stat),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(s64, rmean)
+		__field(u64, rmin)
+		__field(u64, rmax)
+		__field(s64, rnr_samples)
+		__field(s64, rtime)
+		__field(s64, wmean)
+		__field(u64, wmin)
+		__field(u64, wmax)
+		__field(s64, wnr_samples)
+		__field(s64, wtime)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->rmean		= stat[0].mean;
+		__entry->rmin		= stat[0].min;
+		__entry->rmax		= stat[0].max;
+		__entry->rnr_samples	= stat[0].nr_samples;
+		__entry->wmean		= stat[1].mean;
+		__entry->wmin		= stat[1].min;
+		__entry->wmax		= stat[1].max;
+		__entry->wnr_samples	= stat[1].nr_samples;
+	),
+
+	TP_printk("%s: rmean=%llu, rmin=%llu, rmax=%llu, rsamples=%llu, "
+		  "wmean=%llu, wmin=%llu, wmax=%llu, wsamples=%llu\n",
+		  __entry->name, __entry->rmean, __entry->rmin, __entry->rmax,
+		  __entry->rnr_samples, __entry->wmean, __entry->wmin,
+		  __entry->wmax, __entry->wnr_samples)
+);
+
+/**
+ * wbt_lat - trace latency event
+ * @lat: latency trigger
+ */
+TRACE_EVENT(wbt_lat,
+
+	TP_PROTO(struct backing_dev_info *bdi, unsigned long lat),
+
+	TP_ARGS(bdi, lat),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, lat)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->lat = lat;
+	),
+
+	TP_printk("%s: latency %llu\n", __entry->name,
+			(unsigned long long) __entry->lat)
+);
+
+/**
+ * wbt_step - trace wb event step
+ * @msg: context message
+ * @step: the current scale step count
+ * @window: the current monitoring window
+ * @bg: the current background queue limit
+ * @normal: the current normal writeback limit
+ * @max: the current max throughput writeback limit
+ */
+TRACE_EVENT(wbt_step,
+
+	TP_PROTO(struct backing_dev_info *bdi, const char *msg,
+		 unsigned int step, unsigned long window, unsigned int bg,
+		 unsigned int normal, unsigned int max),
+
+	TP_ARGS(bdi, msg, step, window, bg, normal, max),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(const char *, msg)
+		__field(unsigned int, step)
+		__field(unsigned long, window)
+		__field(unsigned int, bg)
+		__field(unsigned int, normal)
+		__field(unsigned int, max)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->msg	= msg;
+		__entry->step	= step;
+		__entry->window	= window;
+		__entry->bg	= bg;
+		__entry->normal	= normal;
+		__entry->max	= max;
+	),
+
+	TP_printk("%s: %s: step=%u, window=%lu, background=%u, normal=%u, max=%u\n",
+		  __entry->name, __entry->msg, __entry->step, __entry->window,
+		  __entry->bg, __entry->normal, __entry->max)
+);
+
+#endif /* _TRACE_WBT_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/lib/Kconfig b/lib/Kconfig
index 3cca1222578e..01da47cb9766 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -540,4 +540,7 @@ config STACKDEPOT
 	bool
 	select STACKTRACE
 
+config WBT
+	bool
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 7bd6fd436c97..15366777a1d4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -180,6 +180,7 @@ obj-$(CONFIG_GENERIC_NET_UTILS) += net_utils.o
 obj-$(CONFIG_SG_SPLIT) += sg_split.o
 obj-$(CONFIG_STMP_DEVICE) += stmp_device.o
 obj-$(CONFIG_IRQ_POLL) += irq_poll.o
+obj-$(CONFIG_WBT) += wbt.o
 
 obj-$(CONFIG_STACKDEPOT) += stackdepot.o
 KASAN_SANITIZE_stackdepot.o := n
diff --git a/lib/wbt.c b/lib/wbt.c
new file mode 100644
index 000000000000..650da911f24f
--- /dev/null
+++ b/lib/wbt.c
@@ -0,0 +1,524 @@
+/*
+ * buffered writeback throttling. losely based on CoDel. We can't drop
+ * packets for IO scheduling, so the logic is something like this:
+ *
+ * - Monitor latencies in a defined window of time.
+ * - If the minimum latency in the above window exceeds some target, increment
+ *   scaling step and scale down queue depth by a factor of 2x. The monitoring
+ *   window is then shrunk to 100 / sqrt(scaling step + 1).
+ * - For any window where we don't have solid data on what the latencies
+ *   look like, retain status quo.
+ * - If latencies look good, decrement scaling step.
+ *
+ * Copyright (C) 2016 Jens Axboe
+ *
+ * Things that (may) need changing:
+ *
+ *	- Different scaling of background/normal/high priority writeback.
+ *	  We may have to violate guarantees for max.
+ *	- We can have mismatches between the stat window and our window.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/blk_types.h>
+#include <linux/slab.h>
+#include <linux/backing-dev.h>
+#include <linux/wbt.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/wbt.h>
+
+enum {
+	/*
+	 * Might need to be higher
+	 */
+	RWB_MAX_DEPTH	= 64,
+
+	/*
+	 * 100msec window
+	 */
+	RWB_WINDOW_NSEC		= 100 * 1000 * 1000ULL,
+
+	/*
+	 * Disregard stats, if we don't meet these minimums
+	 */
+	RWB_MIN_WRITE_SAMPLES	= 3,
+	RWB_MIN_READ_SAMPLES	= 1,
+
+	RWB_UNKNOWN_BUMP	= 5,
+};
+
+static inline bool rwb_enabled(struct rq_wb *rwb)
+{
+	return rwb && rwb->wb_normal != 0;
+}
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+	int cur = atomic_read(v);
+
+	for (;;) {
+		int old;
+
+		if (cur >= below)
+			return false;
+		old = atomic_cmpxchg(v, cur, cur + 1);
+		if (old == cur)
+			break;
+		cur = old;
+	}
+
+	return true;
+}
+
+static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
+{
+	if (rwb_enabled(rwb)) {
+		const unsigned long cur = jiffies;
+
+		if (cur != *var)
+			*var = cur;
+	}
+}
+
+void __wbt_done(struct rq_wb *rwb)
+{
+	int inflight, limit = rwb->wb_normal;
+
+	/*
+	 * If the device does write back caching, drop further down
+	 * before we wake people up.
+	 */
+	if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
+		limit = 0;
+	else
+		limit = rwb->wb_normal;
+
+	/*
+	 * Don't wake anyone up if we are above the normal limit. If
+	 * throttling got disabled (limit == 0) with waiters, ensure
+	 * that we wake them up.
+	 */
+	inflight = atomic_dec_return(&rwb->inflight);
+	if (limit && inflight >= limit) {
+		if (!rwb->wb_max)
+			wake_up_all(&rwb->wait);
+		return;
+	}
+
+	if (waitqueue_active(&rwb->wait)) {
+		int diff = limit - inflight;
+
+		if (!inflight || diff >= rwb->wb_background / 2)
+			wake_up_nr(&rwb->wait, 1);
+	}
+}
+
+/*
+ * Called on completion of a request. Note that it's also called when
+ * a request is merged, when the request gets freed.
+ */
+void wbt_done(struct rq_wb *rwb, struct wb_issue_stat *stat)
+{
+	if (!rwb)
+		return;
+
+	if (!wbt_tracked(stat)) {
+		if (rwb->sync_cookie == stat) {
+			rwb->sync_issue = 0;
+			rwb->sync_cookie = NULL;
+		}
+
+		wb_timestamp(rwb, &rwb->last_comp);
+	} else {
+		WARN_ON_ONCE(stat == rwb->sync_cookie);
+		__wbt_done(rwb);
+		wbt_clear_tracked(stat);
+	}
+}
+
+static void calc_wb_limits(struct rq_wb *rwb)
+{
+	unsigned int depth;
+
+	if (!rwb->min_lat_nsec) {
+		rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
+		return;
+	}
+
+	depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
+
+	/*
+	 * Reduce max depth by 50%, and re-calculate normal/bg based on that
+	 */
+	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
+	rwb->wb_normal = (rwb->wb_max + 1) / 2;
+	rwb->wb_background = (rwb->wb_max + 3) / 4;
+}
+
+static bool inline stat_sample_valid(struct blk_rq_stat *stat)
+{
+	/*
+	 * We need at least one read sample, and a minimum of
+	 * RWB_MIN_WRITE_SAMPLES. We require some write samples to know
+	 * that it's writes impacting us, and not just some sole read on
+	 * a device that is in a lower power state.
+	 */
+	return stat[0].nr_samples >= 1 &&
+		stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES;
+}
+
+static u64 rwb_sync_issue_lat(struct rq_wb *rwb)
+{
+	u64 now, issue = ACCESS_ONCE(rwb->sync_issue);
+
+	if (!issue || !rwb->sync_cookie)
+		return 0;
+
+	now = ktime_to_ns(ktime_get());
+	return now - issue;
+}
+
+enum {
+	LAT_OK,
+	LAT_UNKNOWN,
+	LAT_EXCEEDED,
+};
+
+static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
+{
+	u64 thislat;
+
+	/*
+	 * If our stored sync issue exceeds the window size, or it
+	 * exceeds our min target AND we haven't logged any entries,
+	 * flag the latency as exceeded.
+	 */
+	thislat = rwb_sync_issue_lat(rwb);
+	if (thislat > rwb->cur_win_nsec ||
+	    (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
+		trace_wbt_lat(rwb->bdi, thislat);
+		return LAT_EXCEEDED;
+	}
+
+	if (!stat_sample_valid(stat))
+		return LAT_UNKNOWN;
+
+	/*
+	 * If the 'min' latency exceeds our target, step down.
+	 */
+	if (stat[0].min > rwb->min_lat_nsec) {
+		trace_wbt_lat(rwb->bdi, stat[0].min);
+		trace_wbt_stat(rwb->bdi, stat);
+		return LAT_EXCEEDED;
+	}
+
+	if (rwb->scale_step)
+		trace_wbt_stat(rwb->bdi, stat);
+
+	return LAT_OK;
+}
+
+static int latency_exceeded(struct rq_wb *rwb)
+{
+	struct blk_rq_stat stat[2];
+
+	rwb->stat_ops->get(rwb->ops_data, stat);
+	return __latency_exceeded(rwb, stat);
+}
+
+static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
+{
+	trace_wbt_step(rwb->bdi, msg, rwb->scale_step, rwb->cur_win_nsec,
+			rwb->wb_background, rwb->wb_normal, rwb->wb_max);
+}
+
+static void scale_up(struct rq_wb *rwb)
+{
+	/*
+	 * If we're at 0, we can't go lower.
+	 */
+	if (!rwb->scale_step)
+		return;
+
+	rwb->scale_step--;
+	rwb->unknown_cnt = 0;
+	rwb->stat_ops->clear(rwb->ops_data);
+	calc_wb_limits(rwb);
+
+	if (waitqueue_active(&rwb->wait))
+		wake_up_all(&rwb->wait);
+
+	rwb_trace_step(rwb, "step up");
+}
+
+static void scale_down(struct rq_wb *rwb)
+{
+	/*
+	 * Stop scaling down when we've hit the limit. This also prevents
+	 * ->scale_step from going to crazy values, if the device can't
+	 * keep up.
+	 */
+	if (rwb->wb_max == 1)
+		return;
+
+	rwb->scale_step++;
+	rwb->unknown_cnt = 0;
+	rwb->stat_ops->clear(rwb->ops_data);
+	calc_wb_limits(rwb);
+	rwb_trace_step(rwb, "step down");
+}
+
+static void rwb_arm_timer(struct rq_wb *rwb)
+{
+	unsigned long expires;
+
+	/*
+	 * We should speed this up, using some variant of a fast integer
+	 * inverse square root calculation. Since we only do this for
+	 * every window expiration, it's not a huge deal, though.
+	 */
+	rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
+					int_sqrt((rwb->scale_step + 1) << 8));
+	expires = jiffies + nsecs_to_jiffies(rwb->cur_win_nsec);
+	mod_timer(&rwb->window_timer, expires);
+}
+
+static void wb_timer_fn(unsigned long data)
+{
+	struct rq_wb *rwb = (struct rq_wb *) data;
+	int status;
+
+	/*
+	 * If we exceeded the latency target, step down. If we did not,
+	 * step one level up. If we don't know enough to say either exceeded
+	 * or ok, then don't do anything.
+	 */
+	status = latency_exceeded(rwb);
+	switch (status) {
+	case LAT_EXCEEDED:
+		scale_down(rwb);
+		break;
+	case LAT_OK:
+		scale_up(rwb);
+		break;
+	case LAT_UNKNOWN:
+		/*
+		 * We had no read samples, start bumping up the write
+		 * depth slowly
+		 */
+		if (++rwb->unknown_cnt >= RWB_UNKNOWN_BUMP)
+			scale_up(rwb);
+		break;
+	default:
+		break;
+	}
+
+	/*
+	 * Re-arm timer, if we have IO in flight
+	 */
+	if (rwb->scale_step || atomic_read(&rwb->inflight))
+		rwb_arm_timer(rwb);
+}
+
+void wbt_update_limits(struct rq_wb *rwb)
+{
+	rwb->scale_step = 0;
+	calc_wb_limits(rwb);
+
+	if (waitqueue_active(&rwb->wait))
+		wake_up_all(&rwb->wait);
+}
+
+static bool close_io(struct rq_wb *rwb)
+{
+	const unsigned long now = jiffies;
+
+	return time_before(now, rwb->last_issue + HZ / 10) ||
+		time_before(now, rwb->last_comp + HZ / 10);
+}
+
+#define REQ_HIPRIO	(REQ_SYNC | REQ_META | REQ_PRIO)
+
+static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
+{
+	unsigned int limit;
+
+	/*
+	 * At this point we know it's a buffered write. If REQ_SYNC is
+	 * set, then it's WB_SYNC_ALL writeback, and we'll use the max
+	 * limit for that. If the write is marked as a background write,
+	 * then use the idle limit, or go to normal if we haven't had
+	 * competing IO for a bit.
+	 */
+	if ((rw & REQ_HIPRIO) || atomic_read(&rwb->bdi->wb.dirty_sleeping))
+		limit = rwb->wb_max;
+	else if ((rw & REQ_BG) || close_io(rwb)) {
+		/*
+		 * If less than 100ms since we completed unrelated IO,
+		 * limit us to half the depth for background writeback.
+		 */
+		limit = rwb->wb_background;
+	} else
+		limit = rwb->wb_normal;
+
+	return limit;
+}
+
+static inline bool may_queue(struct rq_wb *rwb, unsigned long rw)
+{
+	/*
+	 * inc it here even if disabled, since we'll dec it at completion.
+	 * this only happens if the task was sleeping in __wbt_wait(),
+	 * and someone turned it off at the same time.
+	 */
+	if (!rwb_enabled(rwb)) {
+		atomic_inc(&rwb->inflight);
+		return true;
+	}
+
+	return atomic_inc_below(&rwb->inflight, get_limit(rwb, rw));
+}
+
+/*
+ * Block if we will exceed our limit, or if we are currently waiting for
+ * the timer to kick off queuing again.
+ */
+static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
+{
+	DEFINE_WAIT(wait);
+
+	if (may_queue(rwb, rw))
+		return;
+
+	do {
+		prepare_to_wait_exclusive(&rwb->wait, &wait,
+						TASK_UNINTERRUPTIBLE);
+
+		if (may_queue(rwb, rw))
+			break;
+
+		if (lock)
+			spin_unlock_irq(lock);
+
+		io_schedule();
+
+		if (lock)
+			spin_lock_irq(lock);
+	} while (1);
+
+	finish_wait(&rwb->wait, &wait);
+}
+
+static inline bool wbt_should_throttle(struct rq_wb *rwb, unsigned int rw)
+{
+	/*
+	 * If not a WRITE (or a discard), do nothing
+	 */
+	if (!(rw & REQ_WRITE) || (rw & REQ_DISCARD))
+		return false;
+
+	/*
+	 * Don't throttle WRITE_ODIRECT
+	 */
+	if ((rw & (REQ_SYNC | REQ_NOIDLE)) == REQ_SYNC)
+		return false;
+
+	return true;
+}
+
+/*
+ * Returns true if the IO request should be accounted, false if not.
+ * May sleep, if we have exceeded the writeback limits. Caller can pass
+ * in an irq held spinlock, if it holds one when calling this function.
+ * If we do sleep, we'll release and re-grab it.
+ */
+bool wbt_wait(struct rq_wb *rwb, unsigned int rw, spinlock_t *lock)
+{
+	if (!rwb_enabled(rwb))
+		return false;
+
+	if (!wbt_should_throttle(rwb, rw)) {
+		wb_timestamp(rwb, &rwb->last_issue);
+		return false;
+	}
+
+	__wbt_wait(rwb, rw, lock);
+
+	if (!timer_pending(&rwb->window_timer))
+		rwb_arm_timer(rwb);
+
+	return true;
+}
+
+void wbt_issue(struct rq_wb *rwb, struct wb_issue_stat *stat)
+{
+	if (!rwb_enabled(rwb))
+		return;
+
+	wbt_issue_stat_set_time(stat);
+
+	if (!wbt_tracked(stat) && !rwb->sync_issue) {
+		rwb->sync_cookie = stat;
+		rwb->sync_issue = wbt_issue_stat_get_time(stat);
+	}
+}
+
+void wbt_requeue(struct rq_wb *rwb, struct wb_issue_stat *stat)
+{
+	if (!rwb_enabled(rwb))
+		return;
+	if (stat == rwb->sync_cookie) {
+		rwb->sync_issue = 0;
+		rwb->sync_cookie = NULL;
+	}
+}
+
+void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+{
+	if (rwb) {
+		rwb->queue_depth = depth;
+		wbt_update_limits(rwb);
+	}
+}
+
+void wbt_set_write_cache(struct rq_wb *rwb, bool write_cache_on)
+{
+	if (rwb)
+		rwb->wc = write_cache_on;
+}
+
+struct rq_wb *wbt_init(struct backing_dev_info *bdi, struct wb_stat_ops *ops,
+		       void *ops_data)
+{
+	struct rq_wb *rwb;
+
+	rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
+	if (!rwb)
+		return ERR_PTR(-ENOMEM);
+
+	atomic_set(&rwb->inflight, 0);
+	init_waitqueue_head(&rwb->wait);
+	setup_timer(&rwb->window_timer, wb_timer_fn, (unsigned long) rwb);
+	rwb->wc = 1;
+	rwb->queue_depth = RWB_MAX_DEPTH;
+	rwb->last_comp = rwb->last_issue = jiffies;
+	rwb->bdi = bdi;
+	rwb->win_nsec = RWB_WINDOW_NSEC;
+	rwb->stat_ops = ops,
+	rwb->ops_data = ops_data;
+	wbt_update_limits(rwb);
+	return rwb;
+}
+
+void wbt_exit(struct rq_wb *rwb)
+{
+	if (rwb) {
+		del_timer_sync(&rwb->window_timer);
+		kfree(rwb);
+	}
+}
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 8/8] writeback: throttle buffered writeback
  2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
                   ` (6 preceding siblings ...)
  2016-04-26 15:55 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe
@ 2016-04-26 15:55 ` Jens Axboe
  2016-04-27 18:01 ` [PATCHSET v5] Make background writeback great again for the first time Jan Kara
  8 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block
  Cc: jack, dchinner, sedat.dilek, Jens Axboe

Test patch that throttles buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.

The patch registers two sysfs entries. The first one, 'wb_window_usec',
defines the window of monitoring. The second one, 'wb_lat_usec',
sets the latency target for the window. It defaults to 2 msec for
non-rotational storage, and 75 msec for rotational storage. Setting
this value to '0' disables blk-wb. Generally, a user would not have
to touch these settings.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 Documentation/block/queue-sysfs.txt |  13 ++++
 block/Kconfig                       |   1 +
 block/blk-core.c                    |  21 ++++++-
 block/blk-mq.c                      |  32 +++++++++-
 block/blk-settings.c                |   3 +
 block/blk-stat.c                    |   5 +-
 block/blk-sysfs.c                   | 119 ++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h              |   6 +-
 8 files changed, 191 insertions(+), 9 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index dce25d848d92..9bc990abef4d 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -151,5 +151,18 @@ device state. This means that it might not be safe to toggle the
 setting from "write back" to "write through", since that will also
 eliminate cache flushes issued by the kernel.
 
+wb_lat_usec (RW)
+----------------
+If the device is registered for writeback throttling, then this file shows
+the target minimum read latency. If this latency is exceeded in a given
+window of time (see wb_window_usec), then the writeback throttling will start
+scaling back writes.
+
+wb_window_usec (RW)
+-------------------
+If the device is registered for writeback throttling, then this file shows
+the value of the monitoring window in which we'll look at the target
+latency. See wb_lat_usec.
+
 
 Jens Axboe <jens.axboe@oracle.com>, February 2009
diff --git a/block/Kconfig b/block/Kconfig
index 0363cd731320..d4c2ff4b9b2c 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -4,6 +4,7 @@
 menuconfig BLOCK
        bool "Enable the block layer" if EXPERT
        default y
+       select WBT
        help
 	 Provide block layer support for the kernel.
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 40b57bf4852c..c166d46a09d1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -33,6 +33,7 @@
 #include <linux/ratelimit.h>
 #include <linux/pm_runtime.h>
 #include <linux/blk-cgroup.h>
+#include <linux/wbt.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -880,6 +881,8 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 
 fail:
 	blk_free_flush_queue(q->fq);
+	wbt_exit(q->rq_wb);
+	q->rq_wb = NULL;
 	return NULL;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
@@ -1395,6 +1398,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
 	blk_delete_timer(rq);
 	blk_clear_rq_complete(rq);
 	trace_block_rq_requeue(q, rq);
+	wbt_requeue(q->rq_wb, &rq->wb_stat);
 
 	if (rq->cmd_flags & REQ_QUEUED)
 		blk_queue_end_tag(q, rq);
@@ -1485,6 +1489,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	/* this is a bio leak */
 	WARN_ON(req->bio != NULL);
 
+	wbt_done(q->rq_wb, &req->wb_stat);
+
 	/*
 	 * Request may not have originated from ll_rw_blk. if not,
 	 * it didn't come out of our reserved rq pools
@@ -1714,6 +1720,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
 	struct request *req;
 	unsigned int request_count = 0;
+	bool wb_acct;
 
 	/*
 	 * low level driver can indicate that it wants pages above a
@@ -1766,6 +1773,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	}
 
 get_rq:
+	wb_acct = wbt_wait(q->rq_wb, bio->bi_rw, q->queue_lock);
+
 	/*
 	 * This sync check and mask will be re-done in init_request_from_bio(),
 	 * but we need to set it earlier to expose the sync flag to the
@@ -1781,11 +1790,16 @@ get_rq:
 	 */
 	req = get_request(q, rw_flags, bio, GFP_NOIO);
 	if (IS_ERR(req)) {
+		if (wb_acct)
+			__wbt_done(q->rq_wb);
 		bio->bi_error = PTR_ERR(req);
 		bio_endio(bio);
 		goto out_unlock;
 	}
 
+	if (wb_acct)
+		wbt_mark_tracked(&req->wb_stat);
+
 	/*
 	 * After dropping the lock and possibly sleeping here, our request
 	 * may now be mergeable after it had proven unmergeable (above).
@@ -2514,7 +2528,7 @@ void blk_start_request(struct request *req)
 {
 	blk_dequeue_request(req);
 
-	req->issue_time = ktime_to_ns(ktime_get());
+	wbt_issue(req->q->rq_wb, &req->wb_stat);
 
 	/*
 	 * We are now handing the request to the hardware, initialize
@@ -2752,9 +2766,10 @@ void blk_finish_request(struct request *req, int error)
 
 	blk_account_io_done(req);
 
-	if (req->end_io)
+	if (req->end_io) {
+		wbt_done(req->q->rq_wb, &req->wb_stat);
 		req->end_io(req, error);
-	else {
+	} else {
 		if (blk_bidi_rq(req))
 			__blk_put_request(req->next_rq->q, req->next_rq);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 71b4a13fbf94..556229e4da92 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -22,6 +22,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
+#include <linux/wbt.h>
 
 #include <trace/events/block.h>
 
@@ -275,6 +276,8 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
 
 	if (rq->cmd_flags & REQ_MQ_INFLIGHT)
 		atomic_dec(&hctx->nr_active);
+
+	wbt_done(q->rq_wb, &rq->wb_stat);
 	rq->cmd_flags = 0;
 
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
@@ -307,6 +310,7 @@ inline void __blk_mq_end_request(struct request *rq, int error)
 	blk_account_io_done(rq);
 
 	if (rq->end_io) {
+		wbt_done(rq->q->rq_wb, &rq->wb_stat);
 		rq->end_io(rq, error);
 	} else {
 		if (unlikely(blk_bidi_rq(rq)))
@@ -413,7 +417,7 @@ void blk_mq_start_request(struct request *rq)
 	if (unlikely(blk_bidi_rq(rq)))
 		rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
 
-	rq->issue_time = ktime_to_ns(ktime_get());
+	wbt_issue(q->rq_wb, &rq->wb_stat);
 
 	blk_add_timer(rq);
 
@@ -450,6 +454,7 @@ static void __blk_mq_requeue_request(struct request *rq)
 	struct request_queue *q = rq->q;
 
 	trace_block_rq_requeue(q, rq);
+	wbt_requeue(q->rq_wb, &rq->wb_stat);
 
 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
 		if (q->dma_drain_size && blk_rq_bytes(rq))
@@ -1265,6 +1270,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
 	blk_qc_t cookie;
+	bool wb_acct;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1282,9 +1288,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	} else
 		request_count = blk_plug_queued_count(q);
 
+	wb_acct = wbt_wait(q->rq_wb, bio->bi_rw, NULL);
+
 	rq = blk_mq_map_request(q, bio, &data);
-	if (unlikely(!rq))
+	if (unlikely(!rq)) {
+		if (wb_acct)
+			__wbt_done(q->rq_wb);
 		return BLK_QC_T_NONE;
+	}
+
+	if (wb_acct)
+		wbt_mark_tracked(&rq->wb_stat);
 
 	cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
@@ -1361,6 +1375,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	struct blk_map_ctx data;
 	struct request *rq;
 	blk_qc_t cookie;
+	bool wb_acct;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1375,9 +1390,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	    blk_attempt_plug_merge(q, bio, &request_count, NULL))
 		return BLK_QC_T_NONE;
 
+	wb_acct = wbt_wait(q->rq_wb, bio->bi_rw, NULL);
+
 	rq = blk_mq_map_request(q, bio, &data);
-	if (unlikely(!rq))
+	if (unlikely(!rq)) {
+		if (wb_acct)
+			__wbt_done(q->rq_wb);
 		return BLK_QC_T_NONE;
+	}
+
+	if (wb_acct)
+		wbt_mark_tracked(&rq->wb_stat);
 
 	cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
@@ -2111,6 +2134,9 @@ void blk_mq_free_queue(struct request_queue *q)
 	list_del_init(&q->all_q_node);
 	mutex_unlock(&all_q_mutex);
 
+	wbt_exit(q->rq_wb);
+	q->rq_wb = NULL;
+
 	blk_mq_del_queue_tag_set(q);
 
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index f7e122e717e8..746dc9fee1ac 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -840,6 +840,7 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
 void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
 	q->queue_depth = depth;
+	wbt_set_queue_depth(q->rq_wb, depth);
 }
 EXPORT_SYMBOL(blk_set_queue_depth);
 
@@ -863,6 +864,8 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua)
 	else
 		queue_flag_clear(QUEUE_FLAG_FUA, q);
 	spin_unlock_irq(q->queue_lock);
+
+	wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
 }
 EXPORT_SYMBOL_GPL(blk_queue_write_cache);
 
diff --git a/block/blk-stat.c b/block/blk-stat.c
index b38776a83173..8e3974d87c1f 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -143,15 +143,16 @@ void blk_stat_init(struct blk_rq_stat *stat)
 void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
 {
 	s64 delta, now, value;
+	u64 rq_time = wbt_issue_stat_get_time(&rq->wb_stat);
 
 	now = ktime_to_ns(ktime_get());
-	if (now < rq->issue_time)
+	if (now < rq_time)
 		return;
 
 	if ((now & BLK_STAT_MASK) != (stat->time & BLK_STAT_MASK))
 		__blk_stat_init(stat, now);
 
-	value = now - rq->issue_time;
+	value = now - rq_time;
 	if (value > stat->max)
 		stat->max = value;
 	if (value < stat->min)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6e516cc0d3d0..df194bf93598 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -10,6 +10,7 @@
 #include <linux/blktrace_api.h>
 #include <linux/blk-mq.h>
 #include <linux/blk-cgroup.h>
+#include <linux/wbt.h>
 
 #include "blk.h"
 #include "blk-mq.h"
@@ -41,6 +42,19 @@ queue_var_store(unsigned long *var, const char *page, size_t count)
 	return count;
 }
 
+static ssize_t queue_var_store64(u64 *var, const char *page)
+{
+	int err;
+	u64 v;
+
+	err = kstrtou64(page, 10, &v);
+	if (err < 0)
+		return err;
+
+	*var = v;
+	return 0;
+}
+
 static ssize_t queue_requests_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(q->nr_requests, (page));
@@ -347,6 +361,58 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	return ret;
 }
 
+static ssize_t queue_wb_win_show(struct request_queue *q, char *page)
+{
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	return sprintf(page, "%llu\n", div_u64(q->rq_wb->win_nsec, 1000));
+}
+
+static ssize_t queue_wb_win_store(struct request_queue *q, const char *page,
+				  size_t count)
+{
+	ssize_t ret;
+	u64 val;
+
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	ret = queue_var_store64(&val, page);
+	if (ret < 0)
+		return ret;
+
+	q->rq_wb->win_nsec = val * 1000ULL;
+	wbt_update_limits(q->rq_wb);
+	return count;
+}
+
+static ssize_t queue_wb_lat_show(struct request_queue *q, char *page)
+{
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	return sprintf(page, "%llu\n", div_u64(q->rq_wb->min_lat_nsec, 1000));
+}
+
+static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page,
+				  size_t count)
+{
+	ssize_t ret;
+	u64 val;
+
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	ret = queue_var_store64(&val, page);
+	if (ret < 0)
+		return ret;
+
+	q->rq_wb->min_lat_nsec = val * 1000ULL;
+	wbt_update_limits(q->rq_wb);
+	return count;
+}
+
 static ssize_t queue_wc_show(struct request_queue *q, char *page)
 {
 	if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
@@ -541,6 +607,18 @@ static struct queue_sysfs_entry queue_stats_entry = {
 	.show = queue_stats_show,
 };
 
+static struct queue_sysfs_entry queue_wb_lat_entry = {
+	.attr = {.name = "wbt_lat_usec", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wb_lat_show,
+	.store = queue_wb_lat_store,
+};
+
+static struct queue_sysfs_entry queue_wb_win_entry = {
+	.attr = {.name = "wbt_window_usec", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wb_win_show,
+	.store = queue_wb_win_store,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -568,6 +646,8 @@ static struct attribute *default_attrs[] = {
 	&queue_poll_entry.attr,
 	&queue_wc_entry.attr,
 	&queue_stats_entry.attr,
+	&queue_wb_lat_entry.attr,
+	&queue_wb_win_entry.attr,
 	NULL,
 };
 
@@ -682,6 +762,43 @@ struct kobj_type blk_queue_ktype = {
 	.release	= blk_release_queue,
 };
 
+static void blk_wb_stat_get(void *data, struct blk_rq_stat *stat)
+{
+	blk_queue_stat_get(data, stat);
+}
+
+static void blk_wb_stat_clear(void *data)
+{
+	blk_stat_clear(data);
+}
+
+static struct wb_stat_ops wb_stat_ops = {
+	.get	= blk_wb_stat_get,
+	.clear	= blk_wb_stat_clear,
+};
+
+static void blk_wb_init(struct request_queue *q)
+{
+	struct rq_wb *rwb;
+
+	rwb = wbt_init(&q->backing_dev_info, &wb_stat_ops, q);
+
+	/*
+	 * If this fails, we don't get throttling
+	 */
+	if (IS_ERR(rwb))
+		return;
+
+	if (blk_queue_nonrot(q))
+		rwb->min_lat_nsec = 2000000ULL;
+	else
+		rwb->min_lat_nsec = 75000000ULL;
+
+	wbt_set_queue_depth(rwb, blk_queue_depth(q));
+	wbt_set_write_cache(rwb, test_bit(QUEUE_FLAG_WC, &q->queue_flags));
+	q->rq_wb = rwb;
+}
+
 int blk_register_queue(struct gendisk *disk)
 {
 	int ret;
@@ -721,6 +838,8 @@ int blk_register_queue(struct gendisk *disk)
 	if (q->mq_ops)
 		blk_mq_register_disk(disk);
 
+	blk_wb_init(q);
+
 	if (!q->request_fn)
 		return 0;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 87f6703ced71..a89f46c58d5f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -24,6 +24,7 @@
 #include <linux/rcupdate.h>
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
+#include <linux/wbt.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -37,6 +38,7 @@ struct bsg_job;
 struct blkcg_gq;
 struct blk_flush_queue;
 struct pr_ops;
+struct rq_wb;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
@@ -153,7 +155,7 @@ struct request {
 	struct gendisk *rq_disk;
 	struct hd_struct *part;
 	unsigned long start_time;
-	s64 issue_time;
+	struct wb_issue_stat wb_stat;
 #ifdef CONFIG_BLK_CGROUP
 	struct request_list *rl;		/* rl this rq is alloced from */
 	unsigned long long start_time_ns;
@@ -291,6 +293,8 @@ struct request_queue {
 	int			nr_rqs[2];	/* # allocated [a]sync rqs */
 	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
+	struct rq_wb		*rq_wb;
+
 	/*
 	 * If blkcg is not used, @q->root_rl serves all requests.  If blkcg
 	 * is used, root blkg allocates from @q->root_rl and all other
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-04-26 15:55 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe
@ 2016-04-27 12:06   ` xiakaixu
  2016-04-27 15:21     ` Jens Axboe
  2016-04-28 11:05   ` Jan Kara
  1 sibling, 1 reply; 45+ messages in thread
From: xiakaixu @ 2016-04-27 12:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, linux-block, jack, dchinner,
	sedat.dilek, miaoxie (A),
	Huxinwei, Bintian


> +	return rwb && rwb->wb_normal != 0;
> +}
> +
> +/*
> + * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
> + * false if 'v' + 1 would be bigger than 'below'.
> + */
> +static bool atomic_inc_below(atomic_t *v, int below)
> +{
> +	int cur = atomic_read(v);
> +
> +	for (;;) {
> +		int old;
> +
> +		if (cur >= below)
> +			return false;
> +		old = atomic_cmpxchg(v, cur, cur + 1);
> +		if (old == cur)
> +			break;
> +		cur = old;
> +	}
> +
> +	return true;
> +}
> +
> +static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
> +{
> +	if (rwb_enabled(rwb)) {
> +		const unsigned long cur = jiffies;
> +
> +		if (cur != *var)
> +			*var = cur;
> +	}
> +}
> +
> +void __wbt_done(struct rq_wb *rwb)
> +{
> +	int inflight, limit = rwb->wb_normal;
> +
> +	/*
> +	 * If the device does write back caching, drop further down
> +	 * before we wake people up.
> +	 */
> +	if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
> +		limit = 0;
> +	else
> +		limit = rwb->wb_normal;
> +
> +	/*
> +	 * Don't wake anyone up if we are above the normal limit. If
> +	 * throttling got disabled (limit == 0) with waiters, ensure
> +	 * that we wake them up.
> +	 */
> +	inflight = atomic_dec_return(&rwb->inflight);
> +	if (limit && inflight >= limit) {
> +		if (!rwb->wb_max)
> +			wake_up_all(&rwb->wait);
> +		return;
> +	}
> +
Hi Jens,

Just a little confused about this. The rwb->wb_max can't be 0 if the variable
'limit' does not equal to 0. So the if (!rwb->wb_max) branch maybe does not
make sense.


> +	if (waitqueue_active(&rwb->wait)) {
> +		int diff = limit - inflight;
> +
> +		if (!inflight || diff >= rwb->wb_background / 2)
> +			wake_up_nr(&rwb->wait, 1);
> +	}
> +}
> +
> +/*
> + * Called on completion of a request. Note that it's also called when
> + * a request is merged, when the request gets freed.
> + */
> +void wbt_done(struct rq_wb *rwb, struct wb_issue_stat *stat)
> +{
> +	if (!rwb)
> +		return;
> +
> +	if (!wbt_tracked(stat)) {
> +		if (rwb->sync_cookie == stat) {
> +			rwb->sync_issue = 0;
> +			rwb->sync_cookie = NULL;
> +		}
> +
> +		wb_timestamp(rwb, &rwb->last_comp);
> +	} else {
> +		WARN_ON_ONCE(stat == rwb->sync_cookie);
> +		__wbt_done(rwb);
> +		wbt_clear_tracked(stat);
> +	}
> +}
> +
> +static void calc_wb_limits(struct rq_wb *rwb)
> +{
> +	unsigned int depth;
> +
> +	if (!rwb->min_lat_nsec) {
> +		rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
> +		return;
> +	}
> +
> +	depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
> +
> +	/*
> +	 * Reduce max depth by 50%, and re-calculate normal/bg based on that
> +	 */
> +	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> +	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> +	rwb->wb_background = (rwb->wb_max + 3) / 4;
> +}
> +
> +static bool inline stat_sample_valid(struct blk_rq_stat *stat)
> +{
> +	/*
> +	 * We need at least one read sample, and a minimum of
> +	 * RWB_MIN_WRITE_SAMPLES. We require some write samples to know
> +	 * that it's writes impacting us, and not just some sole read on
> +	 * a device that is in a lower power state.
> +	 */
> +	return stat[0].nr_samples >= 1 &&
> +		stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES;
> +}
> +

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-04-27 12:06   ` xiakaixu
@ 2016-04-27 15:21     ` Jens Axboe
  2016-04-28  3:29       ` xiakaixu
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2016-04-27 15:21 UTC (permalink / raw)
  To: xiakaixu
  Cc: linux-kernel, linux-fsdevel, linux-block, jack, dchinner,
	sedat.dilek, miaoxie (A),
	Huxinwei, Bintian

[-- Attachment #1: Type: text/plain, Size: 1144 bytes --]

On 04/27/2016 06:06 AM, xiakaixu wrote:
>> +void __wbt_done(struct rq_wb *rwb)
>> +{
>> +	int inflight, limit = rwb->wb_normal;
>> +
>> +	/*
>> +	 * If the device does write back caching, drop further down
>> +	 * before we wake people up.
>> +	 */
>> +	if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
>> +		limit = 0;
>> +	else
>> +		limit = rwb->wb_normal;
>> +
>> +	/*
>> +	 * Don't wake anyone up if we are above the normal limit. If
>> +	 * throttling got disabled (limit == 0) with waiters, ensure
>> +	 * that we wake them up.
>> +	 */
>> +	inflight = atomic_dec_return(&rwb->inflight);
>> +	if (limit && inflight >= limit) {
>> +		if (!rwb->wb_max)
>> +			wake_up_all(&rwb->wait);
>> +		return;
>> +	}
>> +
> Hi Jens,
>
> Just a little confused about this. The rwb->wb_max can't be 0 if the variable
> 'limit' does not equal to 0. So the if (!rwb->wb_max) branch maybe does not
> make sense.

You are right, it doesn't make a lot of sense. I think it suffers from 
code shuffling. How about the attached? The important part is that we 
wake up waiters, if wbt got disabled while we had tracked IO in flight.

-- 
Jens Axboe


[-- Attachment #2: wbt-disable-wakeup.patch --]
[-- Type: text/x-patch, Size: 912 bytes --]

diff --git a/lib/wbt.c b/lib/wbt.c
index 650da911f24f..a6b80c135510 100644
--- a/lib/wbt.c
+++ b/lib/wbt.c
@@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
 	else
 		limit = rwb->wb_normal;
 
+	inflight = atomic_dec_return(&rwb->inflight);
+
 	/*
-	 * Don't wake anyone up if we are above the normal limit. If
-	 * throttling got disabled (limit == 0) with waiters, ensure
-	 * that we wake them up.
+	 * wbt got disabled with IO in flight. Wake up any potential
+	 * waiters, we don't have to do more than that.
 	 */
-	inflight = atomic_dec_return(&rwb->inflight);
-	if (limit && inflight >= limit) {
-		if (!rwb->wb_max)
-			wake_up_all(&rwb->wait);
+	if (!rwb_enabled(rwb)) {
+		wake_up_all(&rwb->wait);
 		return;
 	}
 
+	/*
+	 * Don't wake anyone up if we are above the normal limit.
+	 */
+	if (inflight >= limit)
+		return;
+
 	if (waitqueue_active(&rwb->wait)) {
 		int diff = limit - inflight;
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
                   ` (7 preceding siblings ...)
  2016-04-26 15:55 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe
@ 2016-04-27 18:01 ` Jan Kara
  2016-04-27 18:17   ` Jens Axboe
  8 siblings, 1 reply; 45+ messages in thread
From: Jan Kara @ 2016-04-27 18:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, linux-block, jack, dchinner, sedat.dilek

Hi,

On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> Since the dawn of time, our background buffered writeback has sucked.
> When we do background buffered writeback, it should have little impact
> on foreground activity. That's the definition of background activity...
> But for as long as I can remember, heavy buffered writers have not
> behaved like that. For instance, if I do something like this:
> 
> $ dd if=/dev/zero of=foo bs=1M count=10k
> 
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or, for server oriented
> workloads, where installation of a big RPM (or similar) adversely
> impacts database reads or sync writes. When that happens, I get people
> yelling at me.
> 
> I have posted plenty of results previously, I'll keep it shorter
> this time. Here's a run on my laptop, using read-to-pipe-async for
> reading a 5g file, and rewriting it. You can find this test program
> in the fio git repo.

I have tested your patchset on my test system. Generally I have observed
noticeable drop in average throughput for heavy background writes without
any other disk activity and also somewhat increased variance in the
runtimes. It is most visible on this simple testcases:

dd if=/dev/zero of=/mnt/file bs=1M count=10000

and

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync

The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
created before each dd run on a dedicated disk.

Without your patches I get pretty stable dd runtimes for both cases:

dd if=/dev/zero of=/mnt/file bs=1M count=10000
Runtimes: 87.9611 87.3279 87.2554

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
Runtimes: 93.3502 93.2086 93.541

With your patches the numbers look like:

dd if=/dev/zero of=/mnt/file bs=1M count=10000
Runtimes: 108.183, 97.184, 99.9587

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
Runtimes: 104.9, 102.775, 102.892

I have checked whether the variance is due to some interaction with CFQ
which is used for the disk. When I switched the disk to deadline, I still
get some variance although, the throughput is still ~10% lower:

dd if=/dev/zero of=/mnt/file bs=1M count=10000
Runtimes: 100.417 100.643 100.866

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
Runtimes: 104.208 106.341 105.483

The disk is rotational SATA drive with writeback cache, queue depth of the
disk reported in /sys/block/sdb/device/queue_depth is 1.

So I think we still need some tweaking on the low end of the storage
spectrum so that we don't lose 10% of throughput for simple cases like
this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-04-27 18:01 ` [PATCHSET v5] Make background writeback great again for the first time Jan Kara
@ 2016-04-27 18:17   ` Jens Axboe
  2016-04-27 20:37     ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2016-04-27 18:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On 04/27/2016 12:01 PM, Jan Kara wrote:
> Hi,
>
> On Tue 26-04-16 09:55:23, Jens Axboe wrote:
>> Since the dawn of time, our background buffered writeback has sucked.
>> When we do background buffered writeback, it should have little impact
>> on foreground activity. That's the definition of background activity...
>> But for as long as I can remember, heavy buffered writers have not
>> behaved like that. For instance, if I do something like this:
>>
>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>
>> on my laptop, and then try and start chrome, it basically won't start
>> before the buffered writeback is done. Or, for server oriented
>> workloads, where installation of a big RPM (or similar) adversely
>> impacts database reads or sync writes. When that happens, I get people
>> yelling at me.
>>
>> I have posted plenty of results previously, I'll keep it shorter
>> this time. Here's a run on my laptop, using read-to-pipe-async for
>> reading a 5g file, and rewriting it. You can find this test program
>> in the fio git repo.
>
> I have tested your patchset on my test system. Generally I have observed
> noticeable drop in average throughput for heavy background writes without
> any other disk activity and also somewhat increased variance in the
> runtimes. It is most visible on this simple testcases:
>
> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>
> and
>
> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>
> The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> created before each dd run on a dedicated disk.
>
> Without your patches I get pretty stable dd runtimes for both cases:
>
> dd if=/dev/zero of=/mnt/file bs=1M count=10000
> Runtimes: 87.9611 87.3279 87.2554
>
> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> Runtimes: 93.3502 93.2086 93.541
>
> With your patches the numbers look like:
>
> dd if=/dev/zero of=/mnt/file bs=1M count=10000
> Runtimes: 108.183, 97.184, 99.9587
>
> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> Runtimes: 104.9, 102.775, 102.892
>
> I have checked whether the variance is due to some interaction with CFQ
> which is used for the disk. When I switched the disk to deadline, I still
> get some variance although, the throughput is still ~10% lower:
>
> dd if=/dev/zero of=/mnt/file bs=1M count=10000
> Runtimes: 100.417 100.643 100.866
>
> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> Runtimes: 104.208 106.341 105.483
>
> The disk is rotational SATA drive with writeback cache, queue depth of the
> disk reported in /sys/block/sdb/device/queue_depth is 1.
>
> So I think we still need some tweaking on the low end of the storage
> spectrum so that we don't lose 10% of throughput for simple cases like
> this.

Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if you 
are seeing smaller requests, and that is why it both varies and you get 
lower throughput? I'll try and setup a test here similar to yours.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-04-27 18:17   ` Jens Axboe
@ 2016-04-27 20:37     ` Jens Axboe
  2016-04-27 20:59       ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2016-04-27 20:37 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On Wed, Apr 27 2016, Jens Axboe wrote:
> On 04/27/2016 12:01 PM, Jan Kara wrote:
> >Hi,
> >
> >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> >>Since the dawn of time, our background buffered writeback has sucked.
> >>When we do background buffered writeback, it should have little impact
> >>on foreground activity. That's the definition of background activity...
> >>But for as long as I can remember, heavy buffered writers have not
> >>behaved like that. For instance, if I do something like this:
> >>
> >>$ dd if=/dev/zero of=foo bs=1M count=10k
> >>
> >>on my laptop, and then try and start chrome, it basically won't start
> >>before the buffered writeback is done. Or, for server oriented
> >>workloads, where installation of a big RPM (or similar) adversely
> >>impacts database reads or sync writes. When that happens, I get people
> >>yelling at me.
> >>
> >>I have posted plenty of results previously, I'll keep it shorter
> >>this time. Here's a run on my laptop, using read-to-pipe-async for
> >>reading a 5g file, and rewriting it. You can find this test program
> >>in the fio git repo.
> >
> >I have tested your patchset on my test system. Generally I have observed
> >noticeable drop in average throughput for heavy background writes without
> >any other disk activity and also somewhat increased variance in the
> >runtimes. It is most visible on this simple testcases:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> >
> >and
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> >
> >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> >created before each dd run on a dedicated disk.
> >
> >Without your patches I get pretty stable dd runtimes for both cases:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> >Runtimes: 87.9611 87.3279 87.2554
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> >Runtimes: 93.3502 93.2086 93.541
> >
> >With your patches the numbers look like:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> >Runtimes: 108.183, 97.184, 99.9587
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> >Runtimes: 104.9, 102.775, 102.892
> >
> >I have checked whether the variance is due to some interaction with CFQ
> >which is used for the disk. When I switched the disk to deadline, I still
> >get some variance although, the throughput is still ~10% lower:
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> >Runtimes: 100.417 100.643 100.866
> >
> >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> >Runtimes: 104.208 106.341 105.483
> >
> >The disk is rotational SATA drive with writeback cache, queue depth of the
> >disk reported in /sys/block/sdb/device/queue_depth is 1.
> >
> >So I think we still need some tweaking on the low end of the storage
> >spectrum so that we don't lose 10% of throughput for simple cases like
> >this.
> 
> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> you are seeing smaller requests, and that is why it both varies and
> you get lower throughput? I'll try and setup a test here similar to
> yours.

Jan, care to try the below patch? I can't fully reproduce your issue on
a SCSI disk limited to QD=1, but I have a feeling this might help. It's
a bit of a hack, but the general idea is to allow one more request to
build up for QD=1 devices. That eliminates wait time between one request
finishing, and the next being submitted.


diff --git a/lib/wbt.c b/lib/wbt.c
index 650da911f24f..6b24c8525ace 100644
--- a/lib/wbt.c
+++ b/lib/wbt.c
@@ -93,23 +93,30 @@ void __wbt_done(struct rq_wb *rwb)
 	 * If the device does write back caching, drop further down
 	 * before we wake people up.
 	 */
-	if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
+	if (rwb->queue_depth == 1)
+		limit = 2;
+	else if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
 		limit = 0;
 	else
 		limit = rwb->wb_normal;
 
+	inflight = atomic_dec_return(&rwb->inflight);
+
 	/*
-	 * Don't wake anyone up if we are above the normal limit. If
-	 * throttling got disabled (limit == 0) with waiters, ensure
-	 * that we wake them up.
+	 * wbt got disabled with IO in flight. Wake up any potential
+	 * waiters, we don't have to do more than that.
 	 */
-	inflight = atomic_dec_return(&rwb->inflight);
-	if (limit && inflight >= limit) {
-		if (!rwb->wb_max)
-			wake_up_all(&rwb->wait);
+	if (!rwb_enabled(rwb)) {
+		wake_up_all(&rwb->wait);
 		return;
 	}
 
+	/*
+	 * Don't wake anyone up if we are above the normal limit.
+	 */
+	if (inflight >= limit)
+		return;
+
 	if (waitqueue_active(&rwb->wait)) {
 		int diff = limit - inflight;
 
@@ -366,6 +373,9 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
 	} else
 		limit = rwb->wb_normal;
 
+	if (rwb->queue_depth == 1)
+		limit = 2;
+
 	return limit;
 }
 

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-04-27 20:37     ` Jens Axboe
@ 2016-04-27 20:59       ` Jens Axboe
  2016-04-28  4:06         ` xiakaixu
  2016-04-28 11:54         ` Jan Kara
  0 siblings, 2 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-27 20:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On Wed, Apr 27 2016, Jens Axboe wrote:
> On Wed, Apr 27 2016, Jens Axboe wrote:
> > On 04/27/2016 12:01 PM, Jan Kara wrote:
> > >Hi,
> > >
> > >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> > >>Since the dawn of time, our background buffered writeback has sucked.
> > >>When we do background buffered writeback, it should have little impact
> > >>on foreground activity. That's the definition of background activity...
> > >>But for as long as I can remember, heavy buffered writers have not
> > >>behaved like that. For instance, if I do something like this:
> > >>
> > >>$ dd if=/dev/zero of=foo bs=1M count=10k
> > >>
> > >>on my laptop, and then try and start chrome, it basically won't start
> > >>before the buffered writeback is done. Or, for server oriented
> > >>workloads, where installation of a big RPM (or similar) adversely
> > >>impacts database reads or sync writes. When that happens, I get people
> > >>yelling at me.
> > >>
> > >>I have posted plenty of results previously, I'll keep it shorter
> > >>this time. Here's a run on my laptop, using read-to-pipe-async for
> > >>reading a 5g file, and rewriting it. You can find this test program
> > >>in the fio git repo.
> > >
> > >I have tested your patchset on my test system. Generally I have observed
> > >noticeable drop in average throughput for heavy background writes without
> > >any other disk activity and also somewhat increased variance in the
> > >runtimes. It is most visible on this simple testcases:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > >
> > >and
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > >
> > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> > >created before each dd run on a dedicated disk.
> > >
> > >Without your patches I get pretty stable dd runtimes for both cases:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > >Runtimes: 87.9611 87.3279 87.2554
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > >Runtimes: 93.3502 93.2086 93.541
> > >
> > >With your patches the numbers look like:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > >Runtimes: 108.183, 97.184, 99.9587
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > >Runtimes: 104.9, 102.775, 102.892
> > >
> > >I have checked whether the variance is due to some interaction with CFQ
> > >which is used for the disk. When I switched the disk to deadline, I still
> > >get some variance although, the throughput is still ~10% lower:
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > >Runtimes: 100.417 100.643 100.866
> > >
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > >Runtimes: 104.208 106.341 105.483
> > >
> > >The disk is rotational SATA drive with writeback cache, queue depth of the
> > >disk reported in /sys/block/sdb/device/queue_depth is 1.
> > >
> > >So I think we still need some tweaking on the low end of the storage
> > >spectrum so that we don't lose 10% of throughput for simple cases like
> > >this.
> > 
> > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> > you are seeing smaller requests, and that is why it both varies and
> > you get lower throughput? I'll try and setup a test here similar to
> > yours.
> 
> Jan, care to try the below patch? I can't fully reproduce your issue on
> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
> a bit of a hack, but the general idea is to allow one more request to
> build up for QD=1 devices. That eliminates wait time between one request
> finishing, and the next being submitted.

That accidentally added a potentially stall, this one is both cleaner
and should have that fixed.

diff --git a/lib/wbt.c b/lib/wbt.c
index 650da911f24f..322f5e04e994 100644
--- a/lib/wbt.c
+++ b/lib/wbt.c
@@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
 	else
 		limit = rwb->wb_normal;
 
+	inflight = atomic_dec_return(&rwb->inflight);
+
 	/*
-	 * Don't wake anyone up if we are above the normal limit. If
-	 * throttling got disabled (limit == 0) with waiters, ensure
-	 * that we wake them up.
+	 * wbt got disabled with IO in flight. Wake up any potential
+	 * waiters, we don't have to do more than that.
 	 */
-	inflight = atomic_dec_return(&rwb->inflight);
-	if (limit && inflight >= limit) {
-		if (!rwb->wb_max)
-			wake_up_all(&rwb->wait);
+	if (!rwb_enabled(rwb)) {
+		wake_up_all(&rwb->wait);
 		return;
 	}
 
+	/*
+	 * Don't wake anyone up if we are above the normal limit.
+	 */
+	if (inflight && inflight >= limit)
+		return;
+
 	if (waitqueue_active(&rwb->wait)) {
 		int diff = limit - inflight;
 
@@ -150,14 +155,26 @@ static void calc_wb_limits(struct rq_wb *rwb)
 		return;
 	}
 
-	depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
-
 	/*
-	 * Reduce max depth by 50%, and re-calculate normal/bg based on that
+	 * For QD=1 devices, this is a special case. It's important for those
+	 * to have one request ready when one completes, so force a depth of
+	 * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
+	 * since the device can't have more than that in flight.
 	 */
-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
-	rwb->wb_background = (rwb->wb_max + 3) / 4;
+	if (rwb->queue_depth == 1) {
+		rwb->wb_max = rwb->wb_normal = 2;
+		rwb->wb_background = 1;
+	} else {
+		depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
+
+		/*
+		 * Reduce max depth by 50%, and re-calculate normal/bg based on
+		 * that.
+		 */
+		rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
+		rwb->wb_normal = (rwb->wb_max + 1) / 2;
+		rwb->wb_background = (rwb->wb_max + 3) / 4;
+	}
 }
 
 static bool inline stat_sample_valid(struct blk_rq_stat *stat)

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-04-27 15:21     ` Jens Axboe
@ 2016-04-28  3:29       ` xiakaixu
  0 siblings, 0 replies; 45+ messages in thread
From: xiakaixu @ 2016-04-28  3:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, linux-block, jack, dchinner,
	sedat.dilek, miaoxie (A),
	Huxinwei, Bintian

于 2016/4/27 23:21, Jens Axboe 写道:
> On 04/27/2016 06:06 AM, xiakaixu wrote:
>>> +void __wbt_done(struct rq_wb *rwb)
>>> +{
>>> +    int inflight, limit = rwb->wb_normal;
>>> +
>>> +    /*
>>> +     * If the device does write back caching, drop further down
>>> +     * before we wake people up.
>>> +     */
>>> +    if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
>>> +        limit = 0;
>>> +    else
>>> +        limit = rwb->wb_normal;
>>> +
>>> +    /*
>>> +     * Don't wake anyone up if we are above the normal limit. If
>>> +     * throttling got disabled (limit == 0) with waiters, ensure
>>> +     * that we wake them up.
>>> +     */
>>> +    inflight = atomic_dec_return(&rwb->inflight);
>>> +    if (limit && inflight >= limit) {
>>> +        if (!rwb->wb_max)
>>> +            wake_up_all(&rwb->wait);
>>> +        return;
>>> +    }
>>> +
>> Hi Jens,
>>
>> Just a little confused about this. The rwb->wb_max can't be 0 if the variable
>> 'limit' does not equal to 0. So the if (!rwb->wb_max) branch maybe does not
>> make sense.
> 
> You are right, it doesn't make a lot of sense. I think it suffers from code shuffling. How about the attached? The important part is that we wake up waiters, if wbt got disabled while we had tracked IO in flight.
>
Hi Jens,

The modified patch in another mail looks better. Maybe there are still
some places coube be improved. You can find them in that mail.



-- 
Regards
Kaixu Xia

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-04-27 20:59       ` Jens Axboe
@ 2016-04-28  4:06         ` xiakaixu
  2016-04-28 18:36           ` Jens Axboe
  2016-04-28 11:54         ` Jan Kara
  1 sibling, 1 reply; 45+ messages in thread
From: xiakaixu @ 2016-04-28  4:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek, Huxinwei, miaoxie (A),
	Bintian, Xia Kaixu

于 2016/4/28 4:59, Jens Axboe 写道:
> On Wed, Apr 27 2016, Jens Axboe wrote:
>> On Wed, Apr 27 2016, Jens Axboe wrote:
>>> On 04/27/2016 12:01 PM, Jan Kara wrote:
>>>> Hi,
>>>>
>>>> On Tue 26-04-16 09:55:23, Jens Axboe wrote:
>>>>> Since the dawn of time, our background buffered writeback has sucked.
>>>>> When we do background buffered writeback, it should have little impact
>>>>> on foreground activity. That's the definition of background activity...
>>>>> But for as long as I can remember, heavy buffered writers have not
>>>>> behaved like that. For instance, if I do something like this:
>>>>>
>>>>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>>>>
>>>>> on my laptop, and then try and start chrome, it basically won't start
>>>>> before the buffered writeback is done. Or, for server oriented
>>>>> workloads, where installation of a big RPM (or similar) adversely
>>>>> impacts database reads or sync writes. When that happens, I get people
>>>>> yelling at me.
>>>>>
>>>>> I have posted plenty of results previously, I'll keep it shorter
>>>>> this time. Here's a run on my laptop, using read-to-pipe-async for
>>>>> reading a 5g file, and rewriting it. You can find this test program
>>>>> in the fio git repo.
>>>>
>>>> I have tested your patchset on my test system. Generally I have observed
>>>> noticeable drop in average throughput for heavy background writes without
>>>> any other disk activity and also somewhat increased variance in the
>>>> runtimes. It is most visible on this simple testcases:
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>>
>>>> and
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>>
>>>> The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
>>>> created before each dd run on a dedicated disk.
>>>>
>>>> Without your patches I get pretty stable dd runtimes for both cases:
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>> Runtimes: 87.9611 87.3279 87.2554
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>> Runtimes: 93.3502 93.2086 93.541
>>>>
>>>> With your patches the numbers look like:
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>> Runtimes: 108.183, 97.184, 99.9587
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>> Runtimes: 104.9, 102.775, 102.892
>>>>
>>>> I have checked whether the variance is due to some interaction with CFQ
>>>> which is used for the disk. When I switched the disk to deadline, I still
>>>> get some variance although, the throughput is still ~10% lower:
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>> Runtimes: 100.417 100.643 100.866
>>>>
>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>> Runtimes: 104.208 106.341 105.483
>>>>
>>>> The disk is rotational SATA drive with writeback cache, queue depth of the
>>>> disk reported in /sys/block/sdb/device/queue_depth is 1.
>>>>
>>>> So I think we still need some tweaking on the low end of the storage
>>>> spectrum so that we don't lose 10% of throughput for simple cases like
>>>> this.
>>>
>>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
>>> you are seeing smaller requests, and that is why it both varies and
>>> you get lower throughput? I'll try and setup a test here similar to
>>> yours.
>>
>> Jan, care to try the below patch? I can't fully reproduce your issue on
>> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
>> a bit of a hack, but the general idea is to allow one more request to
>> build up for QD=1 devices. That eliminates wait time between one request
>> finishing, and the next being submitted.
> 
> That accidentally added a potentially stall, this one is both cleaner
> and should have that fixed.
> 
> diff --git a/lib/wbt.c b/lib/wbt.c
> index 650da911f24f..322f5e04e994 100644
> --- a/lib/wbt.c
> +++ b/lib/wbt.c
> @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
>  	else
>  		limit = rwb->wb_normal;
Hi Jens,

This statement 'limit = rwb->wb_normal' is executed twice, maybe once is
enough. It is not a big deal anyway :)


Another question about this if branch:

   if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
	limit = 0;

I can't follow the logic of this if branch. why set limit equal to 0
when the device supports write back caches and there are tasks being
limited in balance_dirty_pages(). Could you pelase give more info
about this ?  Thanks!
>  
> +	inflight = atomic_dec_return(&rwb->inflight);
> +
>  	/*
> -	 * Don't wake anyone up if we are above the normal limit. If
> -	 * throttling got disabled (limit == 0) with waiters, ensure
> -	 * that we wake them up.
> +	 * wbt got disabled with IO in flight. Wake up any potential
> +	 * waiters, we don't have to do more than that.
>  	 */
> -	inflight = atomic_dec_return(&rwb->inflight);
> -	if (limit && inflight >= limit) {
> -		if (!rwb->wb_max)
> -			wake_up_all(&rwb->wait);
> +	if (!rwb_enabled(rwb)) {
> +		wake_up_all(&rwb->wait);
>  		return;
>  	}

Maybe it is better that executing this if branch earlier. So we can wake up
potential waiters in time when wbt got disabled.
>  
> +	/*
> +	 * Don't wake anyone up if we are above the normal limit.
> +	 */
> +	if (inflight && inflight >= limit)
> +		return;
> +
>  	if (waitqueue_active(&rwb->wait)) {
>  		int diff = limit - inflight;
>  
> @@ -150,14 +155,26 @@ static void calc_wb_limits(struct rq_wb *rwb)
>  		return;
>  	}
>  
> -	depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
> -
>  	/*
> -	 * Reduce max depth by 50%, and re-calculate normal/bg based on that
> +	 * For QD=1 devices, this is a special case. It's important for those
> +	 * to have one request ready when one completes, so force a depth of
> +	 * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
> +	 * since the device can't have more than that in flight.
>  	 */
> -	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> -	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> -	rwb->wb_background = (rwb->wb_max + 3) / 4;
> +	if (rwb->queue_depth == 1) {
> +		rwb->wb_max = rwb->wb_normal = 2;
> +		rwb->wb_background = 1;
> +	} else {
> +		depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
> +
> +		/*
> +		 * Reduce max depth by 50%, and re-calculate normal/bg based on
> +		 * that.
> +		 */
> +		rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> +		rwb->wb_normal = (rwb->wb_max + 1) / 2;
> +		rwb->wb_background = (rwb->wb_max + 3) / 4;
> +	}
>  }
>  
>  static bool inline stat_sample_valid(struct blk_rq_stat *stat)
> 


-- 
Regards
Kaixu Xia

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-04-26 15:55 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe
  2016-04-27 12:06   ` xiakaixu
@ 2016-04-28 11:05   ` Jan Kara
  2016-04-28 18:53     ` Jens Axboe
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kara @ 2016-04-28 11:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, linux-block, jack, dchinner, sedat.dilek

On Tue 26-04-16 09:55:30, Jens Axboe wrote:
> We can hook this up to the block layer, to help throttle buffered
> writes. Or NFS can tap into it, to accomplish the same.
> 
> wbt registers a few trace points that can be used to track what is
> happening in the system:
> 
> wbt_lat: 259:0: latency 2446318
> wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
>                wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32
> 
> This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> dumps the current read/write stats for that window, and wbt_step shows a
> step down event where we now scale back writes. Each trace includes the
> device, 259:0 in this case.

I have some comments below...

> +struct rq_wb {
> +	/*
> +	 * Settings that govern how we throttle
> +	 */
> +	unsigned int wb_background;		/* background writeback */
> +	unsigned int wb_normal;			/* normal writeback */
> +	unsigned int wb_max;			/* max throughput writeback */
> +	unsigned int scale_step;
> +
> +	u64 win_nsec;				/* default window size */
> +	u64 cur_win_nsec;			/* current window size */
> +
> +	unsigned int unknown_cnt;

It would be useful to have a comment here explaining that 'unknown_cnt' is
a number of consecutive periods in which we didn't have enough data to
decide about queue scaling (at least this is what I understood from the
code).

> +
> +	struct timer_list window_timer;
> +
> +	s64 sync_issue;
> +	void *sync_cookie;

So I'm somewhat wondering: What is protecting consistency of this
structure? The limits, scale_step, cur_win_nsec, unknown_cnt are updated only
from timer so those should be safe. However sync_issue & sync_cookie are
accessed from IO submission and completion path and there we need some
protection to keep those two in sync. It seems q->queue_lock should mostly
achieve those except for blk-mq submission path calling wbt_wait() which
doesn't hold queue_lock.

It seems you were aware of the possible races and the code handles them
mostly fine (although I wouldn't bet too much there is not some weird
corner case). However it would be good to comment on this somewhere and
explain what the rules for these two fields are.

> +
> +	unsigned int wc;
> +	unsigned int queue_depth;
> +
> +	unsigned long last_issue;		/* last non-throttled issue */
> +	unsigned long last_comp;		/* last non-throttled comp */
> +	unsigned long min_lat_nsec;
> +	struct backing_dev_info *bdi;
> +	struct request_queue *q;
> +	wait_queue_head_t wait;
> +	atomic_t inflight;
> +
> +	struct wb_stat_ops *stat_ops;
> +	void *ops_data;
> +};
...
> diff --git a/lib/wbt.c b/lib/wbt.c
> new file mode 100644
> index 000000000000..650da911f24f
> --- /dev/null
> +++ b/lib/wbt.c
> @@ -0,0 +1,524 @@
> +/*
> + * buffered writeback throttling. losely based on CoDel. We can't drop
> + * packets for IO scheduling, so the logic is something like this:
> + *
> + * - Monitor latencies in a defined window of time.
> + * - If the minimum latency in the above window exceeds some target, increment
> + *   scaling step and scale down queue depth by a factor of 2x. The monitoring
> + *   window is then shrunk to 100 / sqrt(scaling step + 1).
> + * - For any window where we don't have solid data on what the latencies
> + *   look like, retain status quo.
> + * - If latencies look good, decrement scaling step.

I'm wondering about two things:

1) There is a logic somewhat in this direction in blk_queue_start_tag().
   Probably it should be removed after your patches land?

2) As far as I can see in patch 8/8, you have plugged the throttling above
   the IO scheduler. When there are e.g. multiple cgroups with different IO
   limits operating, this throttling can lead to strange results (like a
   cgroup with low limit using up all available background "slots" and thus
   effectively stopping background writeback for other cgroups)? So won't
   it make more sense to plug this below the IO scheduler? Now I understand
   there may be other problems with this but I think we should put more
   though to that and provide some justification in changelogs.

> +static void calc_wb_limits(struct rq_wb *rwb)
> +{
> +	unsigned int depth;
> +
> +	if (!rwb->min_lat_nsec) {
> +		rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
> +		return;
> +	}
> +
> +	depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
> +
> +	/*
> +	 * Reduce max depth by 50%, and re-calculate normal/bg based on that
> +	 */

The comment looks a bit out of place here since we don't reduce max depth
here. We just use whatever is set in scale_step...

> +	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> +	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> +	rwb->wb_background = (rwb->wb_max + 3) / 4;
> +}
> +
> +static bool inline stat_sample_valid(struct blk_rq_stat *stat)
> +{
> +	/*
> +	 * We need at least one read sample, and a minimum of
> +	 * RWB_MIN_WRITE_SAMPLES. We require some write samples to know
> +	 * that it's writes impacting us, and not just some sole read on
> +	 * a device that is in a lower power state.
> +	 */
> +	return stat[0].nr_samples >= 1 &&
> +		stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES;
> +}
> +
> +static u64 rwb_sync_issue_lat(struct rq_wb *rwb)
> +{
> +	u64 now, issue = ACCESS_ONCE(rwb->sync_issue);
> +
> +	if (!issue || !rwb->sync_cookie)
> +		return 0;
> +
> +	now = ktime_to_ns(ktime_get());
> +	return now - issue;
> +}
> +
> +enum {
> +	LAT_OK,
> +	LAT_UNKNOWN,
> +	LAT_EXCEEDED,
> +};
> +
> +static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
> +{
> +	u64 thislat;
> +
> +	/*
> +	 * If our stored sync issue exceeds the window size, or it
> +	 * exceeds our min target AND we haven't logged any entries,
> +	 * flag the latency as exceeded.
> +	 */
> +	thislat = rwb_sync_issue_lat(rwb);
> +	if (thislat > rwb->cur_win_nsec ||
> +	    (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
> +		trace_wbt_lat(rwb->bdi, thislat);
> +		return LAT_EXCEEDED;
> +	}

So I'm trying to wrap my head around this. If I read the code right,
rwb_sync_issue_lat() this returns time that has passed since issuing sync
request that is still running. We basically randomly pick which sync
request we track as we always start tracking a sync request when some is
issued and we are not tracking any at that moment. This is to detect the
case when latency of sync IO is very large compared to measurement window
so we would not get enough samples to make it valid?

Probably the comment could explain more of "why we do this?" than pure
"what we do".

> +
> +	if (!stat_sample_valid(stat))
> +		return LAT_UNKNOWN;
> +
> +	/*
> +	 * If the 'min' latency exceeds our target, step down.
> +	 */
> +	if (stat[0].min > rwb->min_lat_nsec) {
> +		trace_wbt_lat(rwb->bdi, stat[0].min);
> +		trace_wbt_stat(rwb->bdi, stat);
> +		return LAT_EXCEEDED;
> +	}
> +
> +	if (rwb->scale_step)
> +		trace_wbt_stat(rwb->bdi, stat);
> +
> +	return LAT_OK;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-04-27 20:59       ` Jens Axboe
  2016-04-28  4:06         ` xiakaixu
@ 2016-04-28 11:54         ` Jan Kara
  2016-04-28 18:46           ` Jens Axboe
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kara @ 2016-04-28 11:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek

On Wed 27-04-16 14:59:15, Jens Axboe wrote:
> On Wed, Apr 27 2016, Jens Axboe wrote:
> > On Wed, Apr 27 2016, Jens Axboe wrote:
> > > On 04/27/2016 12:01 PM, Jan Kara wrote:
> > > >Hi,
> > > >
> > > >On Tue 26-04-16 09:55:23, Jens Axboe wrote:
> > > >>Since the dawn of time, our background buffered writeback has sucked.
> > > >>When we do background buffered writeback, it should have little impact
> > > >>on foreground activity. That's the definition of background activity...
> > > >>But for as long as I can remember, heavy buffered writers have not
> > > >>behaved like that. For instance, if I do something like this:
> > > >>
> > > >>$ dd if=/dev/zero of=foo bs=1M count=10k
> > > >>
> > > >>on my laptop, and then try and start chrome, it basically won't start
> > > >>before the buffered writeback is done. Or, for server oriented
> > > >>workloads, where installation of a big RPM (or similar) adversely
> > > >>impacts database reads or sync writes. When that happens, I get people
> > > >>yelling at me.
> > > >>
> > > >>I have posted plenty of results previously, I'll keep it shorter
> > > >>this time. Here's a run on my laptop, using read-to-pipe-async for
> > > >>reading a 5g file, and rewriting it. You can find this test program
> > > >>in the fio git repo.
> > > >
> > > >I have tested your patchset on my test system. Generally I have observed
> > > >noticeable drop in average throughput for heavy background writes without
> > > >any other disk activity and also somewhat increased variance in the
> > > >runtimes. It is most visible on this simple testcases:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > > >
> > > >and
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > >
> > > >The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
> > > >created before each dd run on a dedicated disk.
> > > >
> > > >Without your patches I get pretty stable dd runtimes for both cases:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > > >Runtimes: 87.9611 87.3279 87.2554
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > >Runtimes: 93.3502 93.2086 93.541
> > > >
> > > >With your patches the numbers look like:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > > >Runtimes: 108.183, 97.184, 99.9587
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > >Runtimes: 104.9, 102.775, 102.892
> > > >
> > > >I have checked whether the variance is due to some interaction with CFQ
> > > >which is used for the disk. When I switched the disk to deadline, I still
> > > >get some variance although, the throughput is still ~10% lower:
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000
> > > >Runtimes: 100.417 100.643 100.866
> > > >
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > >Runtimes: 104.208 106.341 105.483
> > > >
> > > >The disk is rotational SATA drive with writeback cache, queue depth of the
> > > >disk reported in /sys/block/sdb/device/queue_depth is 1.
> > > >
> > > >So I think we still need some tweaking on the low end of the storage
> > > >spectrum so that we don't lose 10% of throughput for simple cases like
> > > >this.
> > > 
> > > Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
> > > you are seeing smaller requests, and that is why it both varies and
> > > you get lower throughput? I'll try and setup a test here similar to
> > > yours.
> > 
> > Jan, care to try the below patch? I can't fully reproduce your issue on
> > a SCSI disk limited to QD=1, but I have a feeling this might help. It's
> > a bit of a hack, but the general idea is to allow one more request to
> > build up for QD=1 devices. That eliminates wait time between one request
> > finishing, and the next being submitted.
> 
> That accidentally added a potentially stall, this one is both cleaner
> and should have that fixed.
> 
..
> -	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> -	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> -	rwb->wb_background = (rwb->wb_max + 3) / 4;
> +	if (rwb->queue_depth == 1) {
> +		rwb->wb_max = rwb->wb_normal = 2;
> +		rwb->wb_background = 1;

This breaks the detection of too big scale_step in scale_up() where we key
of wb_max == 1 value. However even with that fixed no luck :(:

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
Runtime: 105.126 107.125 105.641

So about the same as before. I'll try to debug this later today...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-04-28  4:06         ` xiakaixu
@ 2016-04-28 18:36           ` Jens Axboe
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-28 18:36 UTC (permalink / raw)
  To: xiakaixu
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek, Huxinwei, miaoxie (A),
	Bintian

On 04/27/2016 10:06 PM, xiakaixu wrote:
>> diff --git a/lib/wbt.c b/lib/wbt.c
>> index 650da911f24f..322f5e04e994 100644
>> --- a/lib/wbt.c
>> +++ b/lib/wbt.c
>> @@ -98,18 +98,23 @@ void __wbt_done(struct rq_wb *rwb)
>>   	else
>>   		limit = rwb->wb_normal;
> Hi Jens,
>
> This statement 'limit = rwb->wb_normal' is executed twice, maybe once is
> enough. It is not a big deal anyway :)

I'll clean that up, thanks for noticing. No functional difference.

> Another question about this if branch:
>
>     if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
> 	limit = 0;
>
> I can't follow the logic of this if branch. why set limit equal to 0
> when the device supports write back caches and there are tasks being
> limited in balance_dirty_pages(). Could you pelase give more info
> about this ?  Thanks!

Sure. So for write back caching, we have to try a bit harder to ensure 
that the device doesn't build up long internal queues with a lot of 
dirty data in the cache. So for the case where we have write back 
caching AND we don't have anyone waiting for the IO, allow the queue 
depth to drain to zero before building it back up again.

Does that make sense?

>>
>> +	inflight = atomic_dec_return(&rwb->inflight);
>> +
>>   	/*
>> -	 * Don't wake anyone up if we are above the normal limit. If
>> -	 * throttling got disabled (limit == 0) with waiters, ensure
>> -	 * that we wake them up.
>> +	 * wbt got disabled with IO in flight. Wake up any potential
>> +	 * waiters, we don't have to do more than that.
>>   	 */
>> -	inflight = atomic_dec_return(&rwb->inflight);
>> -	if (limit && inflight >= limit) {
>> -		if (!rwb->wb_max)
>> -			wake_up_all(&rwb->wait);
>> +	if (!rwb_enabled(rwb)) {
>> +		wake_up_all(&rwb->wait);
>>   		return;
>>   	}
>
> Maybe it is better that executing this if branch earlier. So we can wake up
> potential waiters in time when wbt got disabled.

The !rwb_enabled() case will only happen if someone disabled wbt while 
we had tracked IO in flight. We have to it below the 
atomic_dec_return(), so we could reorder that to be at the front. 
Ideally we just want it out-of-line instead, as it's the unexpected 
slower path.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-04-28 11:54         ` Jan Kara
@ 2016-04-28 18:46           ` Jens Axboe
  2016-05-03 12:17             ` Jan Kara
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2016-04-28 18:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On 04/28/2016 05:54 AM, Jan Kara wrote:
> On Wed 27-04-16 14:59:15, Jens Axboe wrote:
>> On Wed, Apr 27 2016, Jens Axboe wrote:
>>> On Wed, Apr 27 2016, Jens Axboe wrote:
>>>> On 04/27/2016 12:01 PM, Jan Kara wrote:
>>>>> Hi,
>>>>>
>>>>> On Tue 26-04-16 09:55:23, Jens Axboe wrote:
>>>>>> Since the dawn of time, our background buffered writeback has sucked.
>>>>>> When we do background buffered writeback, it should have little impact
>>>>>> on foreground activity. That's the definition of background activity...
>>>>>> But for as long as I can remember, heavy buffered writers have not
>>>>>> behaved like that. For instance, if I do something like this:
>>>>>>
>>>>>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>>>>>
>>>>>> on my laptop, and then try and start chrome, it basically won't start
>>>>>> before the buffered writeback is done. Or, for server oriented
>>>>>> workloads, where installation of a big RPM (or similar) adversely
>>>>>> impacts database reads or sync writes. When that happens, I get people
>>>>>> yelling at me.
>>>>>>
>>>>>> I have posted plenty of results previously, I'll keep it shorter
>>>>>> this time. Here's a run on my laptop, using read-to-pipe-async for
>>>>>> reading a 5g file, and rewriting it. You can find this test program
>>>>>> in the fio git repo.
>>>>>
>>>>> I have tested your patchset on my test system. Generally I have observed
>>>>> noticeable drop in average throughput for heavy background writes without
>>>>> any other disk activity and also somewhat increased variance in the
>>>>> runtimes. It is most visible on this simple testcases:
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>>>
>>>>> and
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>>>
>>>>> The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
>>>>> created before each dd run on a dedicated disk.
>>>>>
>>>>> Without your patches I get pretty stable dd runtimes for both cases:
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>>> Runtimes: 87.9611 87.3279 87.2554
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>>> Runtimes: 93.3502 93.2086 93.541
>>>>>
>>>>> With your patches the numbers look like:
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>>> Runtimes: 108.183, 97.184, 99.9587
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>>> Runtimes: 104.9, 102.775, 102.892
>>>>>
>>>>> I have checked whether the variance is due to some interaction with CFQ
>>>>> which is used for the disk. When I switched the disk to deadline, I still
>>>>> get some variance although, the throughput is still ~10% lower:
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000
>>>>> Runtimes: 100.417 100.643 100.866
>>>>>
>>>>> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
>>>>> Runtimes: 104.208 106.341 105.483
>>>>>
>>>>> The disk is rotational SATA drive with writeback cache, queue depth of the
>>>>> disk reported in /sys/block/sdb/device/queue_depth is 1.
>>>>>
>>>>> So I think we still need some tweaking on the low end of the storage
>>>>> spectrum so that we don't lose 10% of throughput for simple cases like
>>>>> this.
>>>>
>>>> Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
>>>> you are seeing smaller requests, and that is why it both varies and
>>>> you get lower throughput? I'll try and setup a test here similar to
>>>> yours.
>>>
>>> Jan, care to try the below patch? I can't fully reproduce your issue on
>>> a SCSI disk limited to QD=1, but I have a feeling this might help. It's
>>> a bit of a hack, but the general idea is to allow one more request to
>>> build up for QD=1 devices. That eliminates wait time between one request
>>> finishing, and the next being submitted.
>>
>> That accidentally added a potentially stall, this one is both cleaner
>> and should have that fixed.
>>
> ..
>> -	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
>> -	rwb->wb_normal = (rwb->wb_max + 1) / 2;
>> -	rwb->wb_background = (rwb->wb_max + 3) / 4;
>> +	if (rwb->queue_depth == 1) {
>> +		rwb->wb_max = rwb->wb_normal = 2;
>> +		rwb->wb_background = 1;
>
> This breaks the detection of too big scale_step in scale_up() where we key
> of wb_max == 1 value. However even with that fixed no luck :(:

Yeah, I need to look at that. For QD=1, I think the only sensible values 
for max/normal/bg is 2/2/1 and 1/1/1 if we step down.

> dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> Runtime: 105.126 107.125 105.641
>
> So about the same as before. I'll try to debug this later today...

Thanks, I'm very interested in what you find!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-04-28 11:05   ` Jan Kara
@ 2016-04-28 18:53     ` Jens Axboe
  2016-04-28 19:03       ` Jens Axboe
  2016-05-03  9:34       ` Jan Kara
  0 siblings, 2 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-28 18:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On 04/28/2016 05:05 AM, Jan Kara wrote:
> I have some comments below...
>
>> +struct rq_wb {
>> +	/*
>> +	 * Settings that govern how we throttle
>> +	 */
>> +	unsigned int wb_background;		/* background writeback */
>> +	unsigned int wb_normal;			/* normal writeback */
>> +	unsigned int wb_max;			/* max throughput writeback */
>> +	unsigned int scale_step;
>> +
>> +	u64 win_nsec;				/* default window size */
>> +	u64 cur_win_nsec;			/* current window size */
>> +
>> +	unsigned int unknown_cnt;
>
> It would be useful to have a comment here explaining that 'unknown_cnt' is
> a number of consecutive periods in which we didn't have enough data to
> decide about queue scaling (at least this is what I understood from the
> code).

Agree, I'll add that comment.

>> +
>> +	struct timer_list window_timer;
>> +
>> +	s64 sync_issue;
>> +	void *sync_cookie;
>
> So I'm somewhat wondering: What is protecting consistency of this
> structure? The limits, scale_step, cur_win_nsec, unknown_cnt are updated only
> from timer so those should be safe. However sync_issue & sync_cookie are
> accessed from IO submission and completion path and there we need some
> protection to keep those two in sync. It seems q->queue_lock should mostly
> achieve those except for blk-mq submission path calling wbt_wait() which
> doesn't hold queue_lock.

Right, it's designed such that only the timer will be updating these 
values, and that part is serialized. For sync_issue and sync_cookie, the 
important part there is that we never dereference sync_cookie. That's 
why it's a void * now. So we just use it as a hint. And yes, if the IO 
happens to complete at just the time we are looking at it, we could get 
a false positive or false negative. That's going to be noise, and 
nothing we need to worry about. It's deliberate that I don't do any 
locking for that, the only reason we pass in the queue_lock is to be 
able to drop it for sleeping.

> It seems you were aware of the possible races and the code handles them
> mostly fine (although I wouldn't bet too much there is not some weird
> corner case). However it would be good to comment on this somewhere and
> explain what the rules for these two fields are.

Agree, it does warrant a good code comment. If we look at the edge 
cases, one would be:

We look at sync_issue and decide that we're now too late, at the same 
time as the sync_cookie gets cleared. For this case, we'll count it as 
an exceed and scale down. In reality we were late, so it doesn't matter. 
Even if it was the exact time, it's still prudent to scale down as we're 
going to miss soon.

A more worrying case would be two issues that happen at the same time, 
and only one gets set. Let's assume the one that doesn't get set is the 
one that ends up taking a long time to complete. We'll miss scaling down 
in this case, we'll only notice when it completes and shows up in the 
stats. Not idea, but it's still being handled in the fashion that was 
originally intended, at completion time.

>> diff --git a/lib/wbt.c b/lib/wbt.c
>> new file mode 100644
>> index 000000000000..650da911f24f
>> --- /dev/null
>> +++ b/lib/wbt.c
>> @@ -0,0 +1,524 @@
>> +/*
>> + * buffered writeback throttling. losely based on CoDel. We can't drop
>> + * packets for IO scheduling, so the logic is something like this:
>> + *
>> + * - Monitor latencies in a defined window of time.
>> + * - If the minimum latency in the above window exceeds some target, increment
>> + *   scaling step and scale down queue depth by a factor of 2x. The monitoring
>> + *   window is then shrunk to 100 / sqrt(scaling step + 1).
>> + * - For any window where we don't have solid data on what the latencies
>> + *   look like, retain status quo.
>> + * - If latencies look good, decrement scaling step.
>
> I'm wondering about two things:
>
> 1) There is a logic somewhat in this direction in blk_queue_start_tag().
>     Probably it should be removed after your patches land?

You're referring to the read/write separation in the legacy tagging? Yes 
agree, we can kill that once this goes in.

> 2) As far as I can see in patch 8/8, you have plugged the throttling above
>     the IO scheduler. When there are e.g. multiple cgroups with different IO
>     limits operating, this throttling can lead to strange results (like a
>     cgroup with low limit using up all available background "slots" and thus
>     effectively stopping background writeback for other cgroups)? So won't
>     it make more sense to plug this below the IO scheduler? Now I understand
>     there may be other problems with this but I think we should put more
>     though to that and provide some justification in changelogs.

One complexity is that we have to do this early for blk-mq, since once 
you get a request, you're already sitting on the hw tag. CoDel should 
actually work fine at each hop, so hopefully this will as well.

But yes, fairness is something that we have to pay attention to. Right 
now the wait queue has no priority associated with it, that should 
probably be improved to be able to wakeup in a more appropriate order.
Needs testing, but hopefully it works out since if you do run into 
starvation, then you'll go to the back of the queue for the next attempt.

>> +static void calc_wb_limits(struct rq_wb *rwb)
>> +{
>> +	unsigned int depth;
>> +
>> +	if (!rwb->min_lat_nsec) {
>> +		rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
>> +		return;
>> +	}
>> +
>> +	depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
>> +
>> +	/*
>> +	 * Reduce max depth by 50%, and re-calculate normal/bg based on that
>> +	 */
>
> The comment looks a bit out of place here since we don't reduce max depth
> here. We just use whatever is set in scale_step...

True, it does get called for both scaling up and down now. I'll update 
the comment.

>> +static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
>> +{
>> +	u64 thislat;
>> +
>> +	/*
>> +	 * If our stored sync issue exceeds the window size, or it
>> +	 * exceeds our min target AND we haven't logged any entries,
>> +	 * flag the latency as exceeded.
>> +	 */
>> +	thislat = rwb_sync_issue_lat(rwb);
>> +	if (thislat > rwb->cur_win_nsec ||
>> +	    (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
>> +		trace_wbt_lat(rwb->bdi, thislat);
>> +		return LAT_EXCEEDED;
>> +	}
>
> So I'm trying to wrap my head around this. If I read the code right,
> rwb_sync_issue_lat() this returns time that has passed since issuing sync
> request that is still running. We basically randomly pick which sync
> request we track as we always start tracking a sync request when some is
> issued and we are not tracking any at that moment. This is to detect the
> case when latency of sync IO is very large compared to measurement window
> so we would not get enough samples to make it valid?

Right, that's pretty close. Since wbt uses the completion latencies to 
make decisions, if an IO hasn't completed, we don't know about it. If 
the device is flooded with writes, and we then issue a read, maybe that 
read won't complete for multiple monitoring windows. During that time, 
we keep thinking everything is fine. But in reality, it's not completing 
because of the write load. So this logic attempts to track the single 
sync IO request case. If that exceeds a monitoring window of time and we 
saw no other sync IO in that window, then treat that case as if it had 
completed but exceeded the min latency. And then scale back.

We'll always treat a state sample with 1 read as valuable, but for this 
case, we don't have that sample until it completes.

Does that make more sense?

> Probably the comment could explain more of "why we do this?" than pure
> "what we do".

Agree, if you find it confusing, then it needs updating. I'll update the 
comment.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-04-28 18:53     ` Jens Axboe
@ 2016-04-28 19:03       ` Jens Axboe
  2016-05-03  9:34       ` Jan Kara
  1 sibling, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-04-28 19:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek,
	Kaixu Xia

On 04/28/2016 12:53 PM, Jens Axboe wrote:
>
>> Probably the comment could explain more of "why we do this?" than pure
>> "what we do".
>
> Agree, if you find it confusing, then it needs updating. I'll update the
> comment.

http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle

This should address your review comments, I believe.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-04-28 18:53     ` Jens Axboe
  2016-04-28 19:03       ` Jens Axboe
@ 2016-05-03  9:34       ` Jan Kara
  2016-05-03 14:23         ` Jens Axboe
  2016-05-03 15:40         ` Jan Kara
  1 sibling, 2 replies; 45+ messages in thread
From: Jan Kara @ 2016-05-03  9:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek

On Thu 28-04-16 12:53:50, Jens Axboe wrote:
> >2) As far as I can see in patch 8/8, you have plugged the throttling above
> >    the IO scheduler. When there are e.g. multiple cgroups with different IO
> >    limits operating, this throttling can lead to strange results (like a
> >    cgroup with low limit using up all available background "slots" and thus
> >    effectively stopping background writeback for other cgroups)? So won't
> >    it make more sense to plug this below the IO scheduler? Now I understand
> >    there may be other problems with this but I think we should put more
> >    though to that and provide some justification in changelogs.
> 
> One complexity is that we have to do this early for blk-mq, since once you
> get a request, you're already sitting on the hw tag. CoDel should actually
> work fine at each hop, so hopefully this will as well.

OK, I see. But then this suggests that any IO scheduling and / or
cgroup-related throttling should happen before we get a request for blk-mq
as well? And then we can still do writeback throttling below that layer?

> But yes, fairness is something that we have to pay attention to. Right now
> the wait queue has no priority associated with it, that should probably be
> improved to be able to wakeup in a more appropriate order.
> Needs testing, but hopefully it works out since if you do run into
> starvation, then you'll go to the back of the queue for the next attempt.

Yeah, once I'll hunt down that regression with old disk, I can have a look
into how writeback throttling plays together with blkio-controller.

> >>+static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
> >>+{
> >>+	u64 thislat;
> >>+
> >>+	/*
> >>+	 * If our stored sync issue exceeds the window size, or it
> >>+	 * exceeds our min target AND we haven't logged any entries,
> >>+	 * flag the latency as exceeded.
> >>+	 */
> >>+	thislat = rwb_sync_issue_lat(rwb);
> >>+	if (thislat > rwb->cur_win_nsec ||
> >>+	    (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
> >>+		trace_wbt_lat(rwb->bdi, thislat);
> >>+		return LAT_EXCEEDED;
> >>+	}
> >
> >So I'm trying to wrap my head around this. If I read the code right,
> >rwb_sync_issue_lat() this returns time that has passed since issuing sync
> >request that is still running. We basically randomly pick which sync
> >request we track as we always start tracking a sync request when some is
> >issued and we are not tracking any at that moment. This is to detect the
> >case when latency of sync IO is very large compared to measurement window
> >so we would not get enough samples to make it valid?
> 
> Right, that's pretty close. Since wbt uses the completion latencies to make
> decisions, if an IO hasn't completed, we don't know about it. If the device
> is flooded with writes, and we then issue a read, maybe that read won't
> complete for multiple monitoring windows. During that time, we keep thinking
> everything is fine. But in reality, it's not completing because of the write
> load. So this logic attempts to track the single sync IO request case. If
> that exceeds a monitoring window of time and we saw no other sync IO in that
> window, then treat that case as if it had completed but exceeded the min
> latency. And then scale back.
> 
> We'll always treat a state sample with 1 read as valuable, but for this
> case, we don't have that sample until it completes.
> 
> Does that make more sense?

OK, makes sense. Thanks for explanation.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-04-28 18:46           ` Jens Axboe
@ 2016-05-03 12:17             ` Jan Kara
  2016-05-03 12:40               ` Chris Mason
  2016-05-11 16:36               ` Jan Kara
  0 siblings, 2 replies; 45+ messages in thread
From: Jan Kara @ 2016-05-03 12:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek

On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> >>+	if (rwb->queue_depth == 1) {
> >>+		rwb->wb_max = rwb->wb_normal = 2;
> >>+		rwb->wb_background = 1;
> >
> >This breaks the detection of too big scale_step in scale_up() where we key
> >of wb_max == 1 value. However even with that fixed no luck :(:
> 
> Yeah, I need to look at that. For QD=1, I think the only sensible values for
> max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> 
> >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> >Runtime: 105.126 107.125 105.641
> >
> >So about the same as before. I'll try to debug this later today...
> 
> Thanks, I'm very interested in what you find!

OK, so the reason was relatively standard in the end. I was using ext3 (or
more exactly ext4 without delayed allocation) for the test. The throttling
of background writes gave more priority to writes from the journalling
thread which happen with WRITE_SYNC and thus are not throttled. Thus the
journalling thread ended up having to do more data writeback to be able to
commit a transaction (due to requirements of data=ordered mode) and it is
less efficient at that than the normal flusher thread.

So this is an example where throttling background writeback effectively
just pushes more work into another context which does it less efficiently
and indirectly makes everyone wait for it. ext3 has been always sensitive to
issues like this. ext4 is using delayed allocation and thus only data
writes into holes end up being part of a transaction -> simple dd test case
doesn't hit that path. And indeed when I repeat the same test with ext4,
the numbers with and without your patch are exactly the same.

The question remains how common a pattern where throttling of background
writeback delays also something else is. I'll schedule a couple of
benchmarks to measure impact of your patches for a wider range of workloads
(but sadly pretty limited set of hw). If ext3 is the only one seeing
issues, I would be willing to accept that ext3 takes the hit since it is
doing something rather stupid (but inherent in its journal design) and we
have a way to deal with this either by enabling delayed allocation or by
turning off the writeback throttling...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-05-03 12:17             ` Jan Kara
@ 2016-05-03 12:40               ` Chris Mason
  2016-05-03 13:06                 ` Jan Kara
  2016-05-11 16:36               ` Jan Kara
  1 sibling, 1 reply; 45+ messages in thread
From: Chris Mason @ 2016-05-03 12:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek

On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> > >>+	if (rwb->queue_depth == 1) {
> > >>+		rwb->wb_max = rwb->wb_normal = 2;
> > >>+		rwb->wb_background = 1;
> > >
> > >This breaks the detection of too big scale_step in scale_up() where we key
> > >of wb_max == 1 value. However even with that fixed no luck :(:
> > 
> > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > 
> > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > >Runtime: 105.126 107.125 105.641
> > >
> > >So about the same as before. I'll try to debug this later today...
> > 
> > Thanks, I'm very interested in what you find!
> 
> OK, so the reason was relatively standard in the end. I was using ext3 (or
> more exactly ext4 without delayed allocation) for the test. The throttling
> of background writes gave more priority to writes from the journalling
> thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> journalling thread ended up having to do more data writeback to be able to
> commit a transaction (due to requirements of data=ordered mode) and it is
> less efficient at that than the normal flusher thread.
> 
> So this is an example where throttling background writeback effectively
> just pushes more work into another context which does it less efficiently
> and indirectly makes everyone wait for it. ext3 has been always sensitive to
> issues like this. ext4 is using delayed allocation and thus only data
> writes into holes end up being part of a transaction -> simple dd test case
> doesn't hit that path. And indeed when I repeat the same test with ext4,
> the numbers with and without your patch are exactly the same.
> 
> The question remains how common a pattern where throttling of background
> writeback delays also something else is. I'll schedule a couple of
> benchmarks to measure impact of your patches for a wider range of workloads
> (but sadly pretty limited set of hw). If ext3 is the only one seeing
> issues, I would be willing to accept that ext3 takes the hit since it is
> doing something rather stupid (but inherent in its journal design) and we
> have a way to deal with this either by enabling delayed allocation or by
> turning off the writeback throttling...

At least in the case of io that we know is going to be data=ordered, we
can bump the prio of those pages?

-chris

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-05-03 12:40               ` Chris Mason
@ 2016-05-03 13:06                 ` Jan Kara
  2016-05-03 13:42                   ` Chris Mason
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kara @ 2016-05-03 13:06 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Jens Axboe, linux-kernel, linux-fsdevel, linux-block,
	dchinner, sedat.dilek

On Tue 03-05-16 08:40:11, Chris Mason wrote:
> On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > >>+	if (rwb->queue_depth == 1) {
> > > >>+		rwb->wb_max = rwb->wb_normal = 2;
> > > >>+		rwb->wb_background = 1;
> > > >
> > > >This breaks the detection of too big scale_step in scale_up() where we key
> > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > 
> > > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > 
> > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > >Runtime: 105.126 107.125 105.641
> > > >
> > > >So about the same as before. I'll try to debug this later today...
> > > 
> > > Thanks, I'm very interested in what you find!
> > 
> > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > more exactly ext4 without delayed allocation) for the test. The throttling
> > of background writes gave more priority to writes from the journalling
> > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > journalling thread ended up having to do more data writeback to be able to
> > commit a transaction (due to requirements of data=ordered mode) and it is
> > less efficient at that than the normal flusher thread.
> > 
> > So this is an example where throttling background writeback effectively
> > just pushes more work into another context which does it less efficiently
> > and indirectly makes everyone wait for it. ext3 has been always sensitive to
> > issues like this. ext4 is using delayed allocation and thus only data
> > writes into holes end up being part of a transaction -> simple dd test case
> > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > the numbers with and without your patch are exactly the same.
> > 
> > The question remains how common a pattern where throttling of background
> > writeback delays also something else is. I'll schedule a couple of
> > benchmarks to measure impact of your patches for a wider range of workloads
> > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > issues, I would be willing to accept that ext3 takes the hit since it is
> > doing something rather stupid (but inherent in its journal design) and we
> > have a way to deal with this either by enabling delayed allocation or by
> > turning off the writeback throttling...
> 
> At least in the case of io that we know is going to be data=ordered, we
> can bump the prio of those pages?

But how would flusher thread, which is submitting IO, know that? We would
have to somehow mark inodes that are part of the running transaction and
flusher thread could give more priority to such writeback - e.g. by using
WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
it could be doable.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-05-03 13:06                 ` Jan Kara
@ 2016-05-03 13:42                   ` Chris Mason
  2016-05-03 13:57                     ` Jan Kara
  0 siblings, 1 reply; 45+ messages in thread
From: Chris Mason @ 2016-05-03 13:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek

On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote:
> On Tue 03-05-16 08:40:11, Chris Mason wrote:
> > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > > >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > > >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > > >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > > >>+	if (rwb->queue_depth == 1) {
> > > > >>+		rwb->wb_max = rwb->wb_normal = 2;
> > > > >>+		rwb->wb_background = 1;
> > > > >
> > > > >This breaks the detection of too big scale_step in scale_up() where we key
> > > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > > 
> > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > > 
> > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > > >Runtime: 105.126 107.125 105.641
> > > > >
> > > > >So about the same as before. I'll try to debug this later today...
> > > > 
> > > > Thanks, I'm very interested in what you find!
> > > 
> > > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > > more exactly ext4 without delayed allocation) for the test. The throttling
> > > of background writes gave more priority to writes from the journalling
> > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > > journalling thread ended up having to do more data writeback to be able to
> > > commit a transaction (due to requirements of data=ordered mode) and it is
> > > less efficient at that than the normal flusher thread.
> > > 
> > > So this is an example where throttling background writeback effectively
> > > just pushes more work into another context which does it less efficiently
> > > and indirectly makes everyone wait for it. ext3 has been always sensitive to
> > > issues like this. ext4 is using delayed allocation and thus only data
> > > writes into holes end up being part of a transaction -> simple dd test case
> > > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > > the numbers with and without your patch are exactly the same.
> > > 
> > > The question remains how common a pattern where throttling of background
> > > writeback delays also something else is. I'll schedule a couple of
> > > benchmarks to measure impact of your patches for a wider range of workloads
> > > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > > issues, I would be willing to accept that ext3 takes the hit since it is
> > > doing something rather stupid (but inherent in its journal design) and we
> > > have a way to deal with this either by enabling delayed allocation or by
> > > turning off the writeback throttling...
> > 
> > At least in the case of io that we know is going to be data=ordered, we
> > can bump the prio of those pages?
> 
> But how would flusher thread, which is submitting IO, know that? We would
> have to somehow mark inodes that are part of the running transaction and
> flusher thread could give more priority to such writeback - e.g. by using
> WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
> it could be doable.

This would be specific to the data=ordered code in the FS.  If there's
some way to test for an inode or a page's status in the data=ordered
list, the FS writepages call could flag the IO as higher prio?

-chris

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-05-03 13:42                   ` Chris Mason
@ 2016-05-03 13:57                     ` Jan Kara
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kara @ 2016-05-03 13:57 UTC (permalink / raw)
  To: Chris Mason
  Cc: Jan Kara, Jens Axboe, linux-kernel, linux-fsdevel, linux-block,
	dchinner, sedat.dilek

On Tue 03-05-16 09:42:40, Chris Mason wrote:
> On Tue, May 03, 2016 at 03:06:09PM +0200, Jan Kara wrote:
> > On Tue 03-05-16 08:40:11, Chris Mason wrote:
> > > On Tue, May 03, 2016 at 02:17:19PM +0200, Jan Kara wrote:
> > > > On Thu 28-04-16 12:46:41, Jens Axboe wrote:
> > > > > >>-	rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
> > > > > >>-	rwb->wb_normal = (rwb->wb_max + 1) / 2;
> > > > > >>-	rwb->wb_background = (rwb->wb_max + 3) / 4;
> > > > > >>+	if (rwb->queue_depth == 1) {
> > > > > >>+		rwb->wb_max = rwb->wb_normal = 2;
> > > > > >>+		rwb->wb_background = 1;
> > > > > >
> > > > > >This breaks the detection of too big scale_step in scale_up() where we key
> > > > > >of wb_max == 1 value. However even with that fixed no luck :(:
> > > > > 
> > > > > Yeah, I need to look at that. For QD=1, I think the only sensible values for
> > > > > max/normal/bg is 2/2/1 and 1/1/1 if we step down.
> > > > > 
> > > > > >dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
> > > > > >Runtime: 105.126 107.125 105.641
> > > > > >
> > > > > >So about the same as before. I'll try to debug this later today...
> > > > > 
> > > > > Thanks, I'm very interested in what you find!
> > > > 
> > > > OK, so the reason was relatively standard in the end. I was using ext3 (or
> > > > more exactly ext4 without delayed allocation) for the test. The throttling
> > > > of background writes gave more priority to writes from the journalling
> > > > thread which happen with WRITE_SYNC and thus are not throttled. Thus the
> > > > journalling thread ended up having to do more data writeback to be able to
> > > > commit a transaction (due to requirements of data=ordered mode) and it is
> > > > less efficient at that than the normal flusher thread.
> > > > 
> > > > So this is an example where throttling background writeback effectively
> > > > just pushes more work into another context which does it less efficiently
> > > > and indirectly makes everyone wait for it. ext3 has been always sensitive to
> > > > issues like this. ext4 is using delayed allocation and thus only data
> > > > writes into holes end up being part of a transaction -> simple dd test case
> > > > doesn't hit that path. And indeed when I repeat the same test with ext4,
> > > > the numbers with and without your patch are exactly the same.
> > > > 
> > > > The question remains how common a pattern where throttling of background
> > > > writeback delays also something else is. I'll schedule a couple of
> > > > benchmarks to measure impact of your patches for a wider range of workloads
> > > > (but sadly pretty limited set of hw). If ext3 is the only one seeing
> > > > issues, I would be willing to accept that ext3 takes the hit since it is
> > > > doing something rather stupid (but inherent in its journal design) and we
> > > > have a way to deal with this either by enabling delayed allocation or by
> > > > turning off the writeback throttling...
> > > 
> > > At least in the case of io that we know is going to be data=ordered, we
> > > can bump the prio of those pages?
> > 
> > But how would flusher thread, which is submitting IO, know that? We would
> > have to somehow mark inodes that are part of the running transaction and
> > flusher thread could give more priority to such writeback - e.g. by using
> > WRITE_SYNC or at least plain writes. Hmm, if we use an inode flag for that,
> > it could be doable.
> 
> This would be specific to the data=ordered code in the FS.  If there's
> some way to test for an inode or a page's status in the data=ordered
> list, the FS writepages call could flag the IO as higher prio?

Oh, right, we could do that. I can experiment with that later.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-05-03  9:34       ` Jan Kara
@ 2016-05-03 14:23         ` Jens Axboe
  2016-05-03 15:22           ` Jan Kara
  2016-05-03 15:40         ` Jan Kara
  1 sibling, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2016-05-03 14:23 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On 05/03/2016 03:34 AM, Jan Kara wrote:
> On Thu 28-04-16 12:53:50, Jens Axboe wrote:
>>> 2) As far as I can see in patch 8/8, you have plugged the throttling above
>>>     the IO scheduler. When there are e.g. multiple cgroups with different IO
>>>     limits operating, this throttling can lead to strange results (like a
>>>     cgroup with low limit using up all available background "slots" and thus
>>>     effectively stopping background writeback for other cgroups)? So won't
>>>     it make more sense to plug this below the IO scheduler? Now I understand
>>>     there may be other problems with this but I think we should put more
>>>     though to that and provide some justification in changelogs.
>>
>> One complexity is that we have to do this early for blk-mq, since once you
>> get a request, you're already sitting on the hw tag. CoDel should actually
>> work fine at each hop, so hopefully this will as well.
>
> OK, I see. But then this suggests that any IO scheduling and / or
> cgroup-related throttling should happen before we get a request for blk-mq
> as well? And then we can still do writeback throttling below that layer?

Not necessarily. For IO scheduling, basically we care about two parts:

1) Are you allowed to allocate the resources to queue some IO
2) Are you allowed to dispatch

The latter part can still be handled independently, and the former as 
well of course, wbt just deals with throttling back #1 for buffered writes.

>> But yes, fairness is something that we have to pay attention to. Right now
>> the wait queue has no priority associated with it, that should probably be
>> improved to be able to wakeup in a more appropriate order.
>> Needs testing, but hopefully it works out since if you do run into
>> starvation, then you'll go to the back of the queue for the next attempt.
>
> Yeah, once I'll hunt down that regression with old disk, I can have a look
> into how writeback throttling plays together with blkio-controller.

Thanks!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-05-03 14:23         ` Jens Axboe
@ 2016-05-03 15:22           ` Jan Kara
  2016-05-03 15:32             ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kara @ 2016-05-03 15:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek

On Tue 03-05-16 08:23:27, Jens Axboe wrote:
> On 05/03/2016 03:34 AM, Jan Kara wrote:
> >On Thu 28-04-16 12:53:50, Jens Axboe wrote:
> >>>2) As far as I can see in patch 8/8, you have plugged the throttling above
> >>>    the IO scheduler. When there are e.g. multiple cgroups with different IO
> >>>    limits operating, this throttling can lead to strange results (like a
> >>>    cgroup with low limit using up all available background "slots" and thus
> >>>    effectively stopping background writeback for other cgroups)? So won't
> >>>    it make more sense to plug this below the IO scheduler? Now I understand
> >>>    there may be other problems with this but I think we should put more
> >>>    though to that and provide some justification in changelogs.
> >>
> >>One complexity is that we have to do this early for blk-mq, since once you
> >>get a request, you're already sitting on the hw tag. CoDel should actually
> >>work fine at each hop, so hopefully this will as well.
> >
> >OK, I see. But then this suggests that any IO scheduling and / or
> >cgroup-related throttling should happen before we get a request for blk-mq
> >as well? And then we can still do writeback throttling below that layer?
> 
> Not necessarily. For IO scheduling, basically we care about two parts:
> 
> 1) Are you allowed to allocate the resources to queue some IO
> 2) Are you allowed to dispatch

But then it seems suboptimal to waste a relatively scarce resource (which
HW tag is AFAIU) just because you happen to run from a cgroup that is
bandwidth limited and thus are not allowed to dispatch?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-05-03 15:22           ` Jan Kara
@ 2016-05-03 15:32             ` Jens Axboe
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-05-03 15:32 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On 05/03/2016 09:22 AM, Jan Kara wrote:
> On Tue 03-05-16 08:23:27, Jens Axboe wrote:
>> On 05/03/2016 03:34 AM, Jan Kara wrote:
>>> On Thu 28-04-16 12:53:50, Jens Axboe wrote:
>>>>> 2) As far as I can see in patch 8/8, you have plugged the throttling above
>>>>>     the IO scheduler. When there are e.g. multiple cgroups with different IO
>>>>>     limits operating, this throttling can lead to strange results (like a
>>>>>     cgroup with low limit using up all available background "slots" and thus
>>>>>     effectively stopping background writeback for other cgroups)? So won't
>>>>>     it make more sense to plug this below the IO scheduler? Now I understand
>>>>>     there may be other problems with this but I think we should put more
>>>>>     though to that and provide some justification in changelogs.
>>>>
>>>> One complexity is that we have to do this early for blk-mq, since once you
>>>> get a request, you're already sitting on the hw tag. CoDel should actually
>>>> work fine at each hop, so hopefully this will as well.
>>>
>>> OK, I see. But then this suggests that any IO scheduling and / or
>>> cgroup-related throttling should happen before we get a request for blk-mq
>>> as well? And then we can still do writeback throttling below that layer?
>>
>> Not necessarily. For IO scheduling, basically we care about two parts:
>>
>> 1) Are you allowed to allocate the resources to queue some IO
>> 2) Are you allowed to dispatch
>
> But then it seems suboptimal to waste a relatively scarce resource (which
> HW tag is AFAIU) just because you happen to run from a cgroup that is
> bandwidth limited and thus are not allowed to dispatch?

For some cases, you are absolutely right, and #1 is the main one. For 
your case of QD=1, that's obviously the case. For SATA, it's a bit more 
grey zone, and for others (nvme, scsi, etc), it's not really a scarce 
resource so #2 is the bigger part of it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-05-03  9:34       ` Jan Kara
  2016-05-03 14:23         ` Jens Axboe
@ 2016-05-03 15:40         ` Jan Kara
  2016-05-03 15:48           ` Jan Kara
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kara @ 2016-05-03 15:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek

On Tue 03-05-16 11:34:10, Jan Kara wrote:
> Yeah, once I'll hunt down that regression with old disk, I can have a look
> into how writeback throttling plays together with blkio-controller.

So I've tried the following script (note that you need cgroup v2 for
writeback IO to be throttled):

---
mkdir /sys/fs/cgroup/group1
echo 1000 >/sys/fs/cgroup/group1/io.weight
dd if=/dev/zero of=/mnt/file1 bs=1M count=10000&
DD1=$!
echo $DD1 >/sys/fs/cgroup/group1/cgroup.procs

mkdir /sys/fs/cgroup/group2
echo 100 >/sys/fs/cgroup/group2/io.weight
#echo "259:65536 wbps=5000000" >/sys/fs/cgroup/group2/io.max
echo "259:65536 wbps=max" >/sys/fs/cgroup/group2/io.max
dd if=/dev/zero of=/mnt/file2 bs=1M count=10000&
DD2=$!
echo $DD2 >/sys/fs/cgroup/group2/cgroup.procs

while true; do
        sleep 1
        kill -USR1 $DD1
        kill -USR1 $DD2
        echo  '======================================================='
done
---

and watched the progress of the dd processes in different cgroups. The 1/10
weight difference has no effect with your writeback patches - the situation
after one minute:

3120+1 records in
3120+1 records out
3272392704 bytes (3.3 GB) copied, 63.7119 s, 51.4 MB/s
3217+1 records in
3217+1 records out
3374010368 bytes (3.4 GB) copied, 63.5819 s, 53.1 MB/s

I should add that even without your patches the progress doesn't quite
correspond to the weight ratio:
...

but still there is noticeable difference to cgroups with different weights.

OTOH blk-throttle combines well with your patches: Limiting one cgroup to
5 M/s results in numbers like:

3883+2 records in
3883+2 records out
4072091648 bytes (4.1 GB) copied, 36.6713 s, 111 MB/s
413+0 records in
413+0 records out
433061888 bytes (433 MB) copied, 36.8939 s, 11.7 MB/s

which is fine and comparable with unpatched kernel. Higher throughput
number is because we do buffered writes and dd reports what it wrote into
page cache. And there is no wonder blk-throttle combines fine - it
throttles bios which happens before we reach writeback throttling
mechanism.

So I belive this demonstrates that your writeback throttling just doesn't
work well with selective scheduling policy that happens below it because it
can essentially lead to IO priority inversion issues...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-05-03 15:40         ` Jan Kara
@ 2016-05-03 15:48           ` Jan Kara
  2016-05-03 16:59             ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kara @ 2016-05-03 15:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek

On Tue 03-05-16 17:40:32, Jan Kara wrote:
> On Tue 03-05-16 11:34:10, Jan Kara wrote:
> > Yeah, once I'll hunt down that regression with old disk, I can have a look
> > into how writeback throttling plays together with blkio-controller.
> 
> So I've tried the following script (note that you need cgroup v2 for
> writeback IO to be throttled):
> 
> ---
> mkdir /sys/fs/cgroup/group1
> echo 1000 >/sys/fs/cgroup/group1/io.weight
> dd if=/dev/zero of=/mnt/file1 bs=1M count=10000&
> DD1=$!
> echo $DD1 >/sys/fs/cgroup/group1/cgroup.procs
> 
> mkdir /sys/fs/cgroup/group2
> echo 100 >/sys/fs/cgroup/group2/io.weight
> #echo "259:65536 wbps=5000000" >/sys/fs/cgroup/group2/io.max
> echo "259:65536 wbps=max" >/sys/fs/cgroup/group2/io.max
> dd if=/dev/zero of=/mnt/file2 bs=1M count=10000&
> DD2=$!
> echo $DD2 >/sys/fs/cgroup/group2/cgroup.procs
> 
> while true; do
>         sleep 1
>         kill -USR1 $DD1
>         kill -USR1 $DD2
>         echo  '======================================================='
> done
> ---
> 
> and watched the progress of the dd processes in different cgroups. The 1/10
> weight difference has no effect with your writeback patches - the situation
> after one minute:
> 
> 3120+1 records in
> 3120+1 records out
> 3272392704 bytes (3.3 GB) copied, 63.7119 s, 51.4 MB/s
> 3217+1 records in
> 3217+1 records out
> 3374010368 bytes (3.4 GB) copied, 63.5819 s, 53.1 MB/s
> 
> I should add that even without your patches the progress doesn't quite
> correspond to the weight ratio:

Forgot to fill in corresponding data for unpatched kernel here:

5962+2 records in
5962+2 records out
6252281856 bytes (6.3 GB) copied, 64.1719 s, 97.4 MB/s
1502+0 records in
1502+0 records out
1574961152 bytes (1.6 GB) copied, 64.207 s, 24.5 MB/s

> but still there is noticeable difference to cgroups with different weights.
> 
> OTOH blk-throttle combines well with your patches: Limiting one cgroup to
> 5 M/s results in numbers like:
> 
> 3883+2 records in
> 3883+2 records out
> 4072091648 bytes (4.1 GB) copied, 36.6713 s, 111 MB/s
> 413+0 records in
> 413+0 records out
> 433061888 bytes (433 MB) copied, 36.8939 s, 11.7 MB/s
> 
> which is fine and comparable with unpatched kernel. Higher throughput
> number is because we do buffered writes and dd reports what it wrote into
> page cache. And there is no wonder blk-throttle combines fine - it
> throttles bios which happens before we reach writeback throttling
> mechanism.
> 
> So I belive this demonstrates that your writeback throttling just doesn't
> work well with selective scheduling policy that happens below it because it
> can essentially lead to IO priority inversion issues...
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-05-03 15:48           ` Jan Kara
@ 2016-05-03 16:59             ` Jens Axboe
  2016-05-03 18:14               ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2016-05-03 16:59 UTC (permalink / raw)
  To: Jan Kara, Jens Axboe
  Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On 05/03/2016 09:48 AM, Jan Kara wrote:
> On Tue 03-05-16 17:40:32, Jan Kara wrote:
>> On Tue 03-05-16 11:34:10, Jan Kara wrote:
>>> Yeah, once I'll hunt down that regression with old disk, I can have a look
>>> into how writeback throttling plays together with blkio-controller.
>>
>> So I've tried the following script (note that you need cgroup v2 for
>> writeback IO to be throttled):
>>
>> ---
>> mkdir /sys/fs/cgroup/group1
>> echo 1000 >/sys/fs/cgroup/group1/io.weight
>> dd if=/dev/zero of=/mnt/file1 bs=1M count=10000&
>> DD1=$!
>> echo $DD1 >/sys/fs/cgroup/group1/cgroup.procs
>>
>> mkdir /sys/fs/cgroup/group2
>> echo 100 >/sys/fs/cgroup/group2/io.weight
>> #echo "259:65536 wbps=5000000" >/sys/fs/cgroup/group2/io.max
>> echo "259:65536 wbps=max" >/sys/fs/cgroup/group2/io.max
>> dd if=/dev/zero of=/mnt/file2 bs=1M count=10000&
>> DD2=$!
>> echo $DD2 >/sys/fs/cgroup/group2/cgroup.procs
>>
>> while true; do
>>          sleep 1
>>          kill -USR1 $DD1
>>          kill -USR1 $DD2
>>          echo  '======================================================='
>> done
>> ---
>>
>> and watched the progress of the dd processes in different cgroups. The 1/10
>> weight difference has no effect with your writeback patches - the situation
>> after one minute:
>>
>> 3120+1 records in
>> 3120+1 records out
>> 3272392704 bytes (3.3 GB) copied, 63.7119 s, 51.4 MB/s
>> 3217+1 records in
>> 3217+1 records out
>> 3374010368 bytes (3.4 GB) copied, 63.5819 s, 53.1 MB/s
>>
>> I should add that even without your patches the progress doesn't quite
>> correspond to the weight ratio:
>
> Forgot to fill in corresponding data for unpatched kernel here:
>
> 5962+2 records in
> 5962+2 records out
> 6252281856 bytes (6.3 GB) copied, 64.1719 s, 97.4 MB/s
> 1502+0 records in
> 1502+0 records out
> 1574961152 bytes (1.6 GB) copied, 64.207 s, 24.5 MB/s

Thanks for testing this, I'll see what we can do about that. It stands 
to reason that we'll throttle a heavier writer more, statistically. But 
I'm assuming this above test was run basically with just the writes 
going, so no real competition? And hence we end up throttling them 
equally much, destroying the weighting in the process. But for both 
cases, we basically don't pay any attention to cgroup weights.

>> but still there is noticeable difference to cgroups with different weights.
>>
>> OTOH blk-throttle combines well with your patches: Limiting one cgroup to
>> 5 M/s results in numbers like:
>>
>> 3883+2 records in
>> 3883+2 records out
>> 4072091648 bytes (4.1 GB) copied, 36.6713 s, 111 MB/s
>> 413+0 records in
>> 413+0 records out
>> 433061888 bytes (433 MB) copied, 36.8939 s, 11.7 MB/s
>>
>> which is fine and comparable with unpatched kernel. Higher throughput
>> number is because we do buffered writes and dd reports what it wrote into
>> page cache. And there is no wonder blk-throttle combines fine - it
>> throttles bios which happens before we reach writeback throttling
>> mechanism.

OK, that's good, at least that part works fine. And yes, the throttle 
path is hit before we end up in the make_request_fn, which is where wbt 
drops in.

>> So I belive this demonstrates that your writeback throttling just doesn't
>> work well with selective scheduling policy that happens below it because it
>> can essentially lead to IO priority inversion issues...

It this testing still done on the QD=1 ATA disk? Not too surprising that 
this falls apart, since we have very little room to maneuver. I wonder 
if a normal SATA with NCQ would behave better in this regard. I'll have 
to test a bit and think about how we can best handle this case.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-05-03 16:59             ` Jens Axboe
@ 2016-05-03 18:14               ` Jens Axboe
  2016-05-03 19:07                 ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2016-05-03 18:14 UTC (permalink / raw)
  To: Jan Kara, Jens Axboe
  Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On 05/03/2016 10:59 AM, Jens Axboe wrote:
> On 05/03/2016 09:48 AM, Jan Kara wrote:
>> On Tue 03-05-16 17:40:32, Jan Kara wrote:
>>> On Tue 03-05-16 11:34:10, Jan Kara wrote:
>>>> Yeah, once I'll hunt down that regression with old disk, I can have
>>>> a look
>>>> into how writeback throttling plays together with blkio-controller.
>>>
>>> So I've tried the following script (note that you need cgroup v2 for
>>> writeback IO to be throttled):
>>>
>>> ---
>>> mkdir /sys/fs/cgroup/group1
>>> echo 1000 >/sys/fs/cgroup/group1/io.weight
>>> dd if=/dev/zero of=/mnt/file1 bs=1M count=10000&
>>> DD1=$!
>>> echo $DD1 >/sys/fs/cgroup/group1/cgroup.procs
>>>
>>> mkdir /sys/fs/cgroup/group2
>>> echo 100 >/sys/fs/cgroup/group2/io.weight
>>> #echo "259:65536 wbps=5000000" >/sys/fs/cgroup/group2/io.max
>>> echo "259:65536 wbps=max" >/sys/fs/cgroup/group2/io.max
>>> dd if=/dev/zero of=/mnt/file2 bs=1M count=10000&
>>> DD2=$!
>>> echo $DD2 >/sys/fs/cgroup/group2/cgroup.procs
>>>
>>> while true; do
>>>          sleep 1
>>>          kill -USR1 $DD1
>>>          kill -USR1 $DD2
>>>          echo  '======================================================='
>>> done
>>> ---
>>>
>>> and watched the progress of the dd processes in different cgroups.
>>> The 1/10
>>> weight difference has no effect with your writeback patches - the
>>> situation
>>> after one minute:
>>>
>>> 3120+1 records in
>>> 3120+1 records out
>>> 3272392704 bytes (3.3 GB) copied, 63.7119 s, 51.4 MB/s
>>> 3217+1 records in
>>> 3217+1 records out
>>> 3374010368 bytes (3.4 GB) copied, 63.5819 s, 53.1 MB/s
>>>
>>> I should add that even without your patches the progress doesn't quite
>>> correspond to the weight ratio:
>>
>> Forgot to fill in corresponding data for unpatched kernel here:
>>
>> 5962+2 records in
>> 5962+2 records out
>> 6252281856 bytes (6.3 GB) copied, 64.1719 s, 97.4 MB/s
>> 1502+0 records in
>> 1502+0 records out
>> 1574961152 bytes (1.6 GB) copied, 64.207 s, 24.5 MB/s
>
> Thanks for testing this, I'll see what we can do about that. It stands
> to reason that we'll throttle a heavier writer more, statistically. But
> I'm assuming this above test was run basically with just the writes
> going, so no real competition? And hence we end up throttling them
> equally much, destroying the weighting in the process. But for both
> cases, we basically don't pay any attention to cgroup weights.
>
>>> but still there is noticeable difference to cgroups with different
>>> weights.
>>>
>>> OTOH blk-throttle combines well with your patches: Limiting one
>>> cgroup to
>>> 5 M/s results in numbers like:
>>>
>>> 3883+2 records in
>>> 3883+2 records out
>>> 4072091648 bytes (4.1 GB) copied, 36.6713 s, 111 MB/s
>>> 413+0 records in
>>> 413+0 records out
>>> 433061888 bytes (433 MB) copied, 36.8939 s, 11.7 MB/s
>>>
>>> which is fine and comparable with unpatched kernel. Higher throughput
>>> number is because we do buffered writes and dd reports what it wrote
>>> into
>>> page cache. And there is no wonder blk-throttle combines fine - it
>>> throttles bios which happens before we reach writeback throttling
>>> mechanism.
>
> OK, that's good, at least that part works fine. And yes, the throttle
> path is hit before we end up in the make_request_fn, which is where wbt
> drops in.
>
>>> So I belive this demonstrates that your writeback throttling just
>>> doesn't
>>> work well with selective scheduling policy that happens below it
>>> because it
>>> can essentially lead to IO priority inversion issues...
>
> It this testing still done on the QD=1 ATA disk? Not too surprising that
> this falls apart, since we have very little room to maneuver. I wonder
> if a normal SATA with NCQ would behave better in this regard. I'll have
> to test a bit and think about how we can best handle this case.

I think what we'll do for now is just disable wbt IFF we have a non-root 
cgroup attached to CFQ. Done here:

http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle&id=7315756efe76bbdf83076fc9dbc569bbb4da5d32

We don't have a strong need for wbt (supposedly) since CFQ should take 
care of most of it, if you have policies set for proportional sharing.

Longer term it's not a concern either, as we'll move away from that 
model anyway.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-05-03 18:14               ` Jens Axboe
@ 2016-05-03 19:07                 ` Jens Axboe
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-05-03 19:07 UTC (permalink / raw)
  To: Jan Kara, Jens Axboe
  Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On 05/03/2016 12:14 PM, Jens Axboe wrote:
> On 05/03/2016 10:59 AM, Jens Axboe wrote:
>> On 05/03/2016 09:48 AM, Jan Kara wrote:
>>> On Tue 03-05-16 17:40:32, Jan Kara wrote:
>>>> On Tue 03-05-16 11:34:10, Jan Kara wrote:
>>>>> Yeah, once I'll hunt down that regression with old disk, I can have
>>>>> a look
>>>>> into how writeback throttling plays together with blkio-controller.
>>>>
>>>> So I've tried the following script (note that you need cgroup v2 for
>>>> writeback IO to be throttled):
>>>>
>>>> ---
>>>> mkdir /sys/fs/cgroup/group1
>>>> echo 1000 >/sys/fs/cgroup/group1/io.weight
>>>> dd if=/dev/zero of=/mnt/file1 bs=1M count=10000&
>>>> DD1=$!
>>>> echo $DD1 >/sys/fs/cgroup/group1/cgroup.procs
>>>>
>>>> mkdir /sys/fs/cgroup/group2
>>>> echo 100 >/sys/fs/cgroup/group2/io.weight
>>>> #echo "259:65536 wbps=5000000" >/sys/fs/cgroup/group2/io.max
>>>> echo "259:65536 wbps=max" >/sys/fs/cgroup/group2/io.max
>>>> dd if=/dev/zero of=/mnt/file2 bs=1M count=10000&
>>>> DD2=$!
>>>> echo $DD2 >/sys/fs/cgroup/group2/cgroup.procs
>>>>
>>>> while true; do
>>>>          sleep 1
>>>>          kill -USR1 $DD1
>>>>          kill -USR1 $DD2
>>>>          echo
>>>> '======================================================='
>>>> done
>>>> ---
>>>>
>>>> and watched the progress of the dd processes in different cgroups.
>>>> The 1/10
>>>> weight difference has no effect with your writeback patches - the
>>>> situation
>>>> after one minute:
>>>>
>>>> 3120+1 records in
>>>> 3120+1 records out
>>>> 3272392704 bytes (3.3 GB) copied, 63.7119 s, 51.4 MB/s
>>>> 3217+1 records in
>>>> 3217+1 records out
>>>> 3374010368 bytes (3.4 GB) copied, 63.5819 s, 53.1 MB/s
>>>>
>>>> I should add that even without your patches the progress doesn't quite
>>>> correspond to the weight ratio:
>>>
>>> Forgot to fill in corresponding data for unpatched kernel here:
>>>
>>> 5962+2 records in
>>> 5962+2 records out
>>> 6252281856 bytes (6.3 GB) copied, 64.1719 s, 97.4 MB/s
>>> 1502+0 records in
>>> 1502+0 records out
>>> 1574961152 bytes (1.6 GB) copied, 64.207 s, 24.5 MB/s
>>
>> Thanks for testing this, I'll see what we can do about that. It stands
>> to reason that we'll throttle a heavier writer more, statistically. But
>> I'm assuming this above test was run basically with just the writes
>> going, so no real competition? And hence we end up throttling them
>> equally much, destroying the weighting in the process. But for both
>> cases, we basically don't pay any attention to cgroup weights.
>>
>>>> but still there is noticeable difference to cgroups with different
>>>> weights.
>>>>
>>>> OTOH blk-throttle combines well with your patches: Limiting one
>>>> cgroup to
>>>> 5 M/s results in numbers like:
>>>>
>>>> 3883+2 records in
>>>> 3883+2 records out
>>>> 4072091648 bytes (4.1 GB) copied, 36.6713 s, 111 MB/s
>>>> 413+0 records in
>>>> 413+0 records out
>>>> 433061888 bytes (433 MB) copied, 36.8939 s, 11.7 MB/s
>>>>
>>>> which is fine and comparable with unpatched kernel. Higher throughput
>>>> number is because we do buffered writes and dd reports what it wrote
>>>> into
>>>> page cache. And there is no wonder blk-throttle combines fine - it
>>>> throttles bios which happens before we reach writeback throttling
>>>> mechanism.
>>
>> OK, that's good, at least that part works fine. And yes, the throttle
>> path is hit before we end up in the make_request_fn, which is where wbt
>> drops in.
>>
>>>> So I belive this demonstrates that your writeback throttling just
>>>> doesn't
>>>> work well with selective scheduling policy that happens below it
>>>> because it
>>>> can essentially lead to IO priority inversion issues...
>>
>> It this testing still done on the QD=1 ATA disk? Not too surprising that
>> this falls apart, since we have very little room to maneuver. I wonder
>> if a normal SATA with NCQ would behave better in this regard. I'll have
>> to test a bit and think about how we can best handle this case.
>
> I think what we'll do for now is just disable wbt IFF we have a non-root
> cgroup attached to CFQ. Done here:
>
> http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle&id=7315756efe76bbdf83076fc9dbc569bbb4da5d32

That was a bit too untested.. This should be better, it taps into where 
cfq normally notices a difference in blkcg:

http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle&id=9b89e1bb666bd036a4cb1313479435087fb86ba0


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 6/8] block: add scalable completion tracking of requests
  2016-04-26 15:55 ` [PATCH 6/8] block: add scalable completion tracking of requests Jens Axboe
@ 2016-05-05  7:52   ` Ming Lei
  0 siblings, 0 replies; 45+ messages in thread
From: Ming Lei @ 2016-05-05  7:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linux Kernel Mailing List, Linux FS Devel, linux-block, Jan Kara,
	dchinner, sedat.dilek

On Tue, Apr 26, 2016 at 11:55 PM, Jens Axboe <axboe@fb.com> wrote:
> For legacy block, we simply track them in the request queue. For
> blk-mq, we track them on a per-sw queue basis, which we can then
> sum up through the hardware queues and finally to a per device
> state.
>
> The stats are tracked in, roughly, 0.1s interval windows.
>
> Add sysfs files to display the stats.
>
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
>  block/Makefile            |   2 +-
>  block/blk-core.c          |   4 +
>  block/blk-mq-sysfs.c      |  47 ++++++++++++
>  block/blk-mq.c            |  14 ++++
>  block/blk-mq.h            |   3 +
>  block/blk-stat.c          | 184 ++++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-stat.h          |  17 +++++
>  block/blk-sysfs.c         |  26 +++++++
>  include/linux/blk_types.h |   8 ++
>  include/linux/blkdev.h    |   4 +
>  10 files changed, 308 insertions(+), 1 deletion(-)
>  create mode 100644 block/blk-stat.c
>  create mode 100644 block/blk-stat.h
>
> diff --git a/block/Makefile b/block/Makefile
> index 9eda2322b2d4..3446e0472df0 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -5,7 +5,7 @@
>  obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
>                         blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
>                         blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
> -                       blk-lib.o blk-mq.o blk-mq-tag.o \
> +                       blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
>                         blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
>                         genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
>                         badblocks.o partitions/
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 74c16fd8995d..40b57bf4852c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2514,6 +2514,8 @@ void blk_start_request(struct request *req)
>  {
>         blk_dequeue_request(req);
>
> +       req->issue_time = ktime_to_ns(ktime_get());
> +
>         /*
>          * We are now handing the request to the hardware, initialize
>          * resid_len to full count and add the timeout handler.
> @@ -2581,6 +2583,8 @@ bool blk_update_request(struct request *req, int error, unsigned int nr_bytes)
>
>         trace_block_rq_complete(req->q, req, nr_bytes);
>
> +       blk_stat_add(&req->q->rq_stats[rq_data_dir(req)], req);

blk_update_request() is often run lockless, so it might be a problem
to add into queue's status here in case of non-blk-mq. Maybe it is
better to do it in blk_finish_request()?

For blk-mq, blk_stat_add() should be avoided here.

> +
>         if (!req->bio)
>                 return false;
>
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 4ea4dd8a1eed..2f68015f8616 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -247,6 +247,47 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
>         return ret;
>  }
>
> +static void blk_mq_stat_clear(struct blk_mq_hw_ctx *hctx)
> +{
> +       struct blk_mq_ctx *ctx;
> +       unsigned int i;
> +
> +       hctx_for_each_ctx(hctx, ctx, i) {
> +               blk_stat_init(&ctx->stat[0]);
> +               blk_stat_init(&ctx->stat[1]);
> +       }
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_stat_store(struct blk_mq_hw_ctx *hctx,
> +                                         const char *page, size_t count)
> +{
> +       blk_mq_stat_clear(hctx);
> +       return count;
> +}
> +
> +static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
> +{
> +       return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
> +                       pre, (long long) stat->nr_samples,
> +                       (long long) stat->mean, (long long) stat->min,
> +                       (long long) stat->max);
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_stat_show(struct blk_mq_hw_ctx *hctx, char *page)
> +{
> +       struct blk_rq_stat stat[2];
> +       ssize_t ret;
> +
> +       blk_stat_init(&stat[0]);
> +       blk_stat_init(&stat[1]);
> +
> +       blk_hctx_stat_get(hctx, stat);
> +
> +       ret = print_stat(page, &stat[0], "read :");
> +       ret += print_stat(page + ret, &stat[1], "write:");
> +       return ret;
> +}
> +
>  static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
>         .attr = {.name = "dispatched", .mode = S_IRUGO },
>         .show = blk_mq_sysfs_dispatched_show,
> @@ -304,6 +345,11 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
>         .attr = {.name = "io_poll", .mode = S_IRUGO },
>         .show = blk_mq_hw_sysfs_poll_show,
>  };
> +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_stat = {
> +       .attr = {.name = "stats", .mode = S_IRUGO | S_IWUSR },
> +       .show = blk_mq_hw_sysfs_stat_show,
> +       .store = blk_mq_hw_sysfs_stat_store,
> +};
>
>  static struct attribute *default_hw_ctx_attrs[] = {
>         &blk_mq_hw_sysfs_queued.attr,
> @@ -314,6 +360,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
>         &blk_mq_hw_sysfs_cpus.attr,
>         &blk_mq_hw_sysfs_active.attr,
>         &blk_mq_hw_sysfs_poll.attr,
> +       &blk_mq_hw_sysfs_stat.attr,
>         NULL,
>  };
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1699baf39b78..71b4a13fbf94 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -29,6 +29,7 @@
>  #include "blk.h"
>  #include "blk-mq.h"
>  #include "blk-mq-tag.h"
> +#include "blk-stat.h"
>
>  static DEFINE_MUTEX(all_q_mutex);
>  static LIST_HEAD(all_q_list);
> @@ -356,10 +357,19 @@ static void blk_mq_ipi_complete_request(struct request *rq)
>         put_cpu();
>  }
>
> +static void blk_mq_stat_add(struct request *rq)
> +{
> +       struct blk_rq_stat *stat = &rq->mq_ctx->stat[rq_data_dir(rq)];
> +
> +       blk_stat_add(stat, rq);
> +}
> +
>  static void __blk_mq_complete_request(struct request *rq)
>  {
>         struct request_queue *q = rq->q;
>
> +       blk_mq_stat_add(rq);
> +
>         if (!q->softirq_done_fn)
>                 blk_mq_end_request(rq, rq->errors);
>         else
> @@ -403,6 +413,8 @@ void blk_mq_start_request(struct request *rq)
>         if (unlikely(blk_bidi_rq(rq)))
>                 rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
>
> +       rq->issue_time = ktime_to_ns(ktime_get());
> +
>         blk_add_timer(rq);
>
>         /*
> @@ -1761,6 +1773,8 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
>                 spin_lock_init(&__ctx->lock);
>                 INIT_LIST_HEAD(&__ctx->rq_list);
>                 __ctx->queue = q;
> +               blk_stat_init(&__ctx->stat[0]);
> +               blk_stat_init(&__ctx->stat[1]);
>
>                 /* If the cpu isn't online, the cpu is mapped to first hctx */
>                 if (!cpu_online(i))
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 9087b11037b7..e107f700ff17 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -1,6 +1,8 @@
>  #ifndef INT_BLK_MQ_H
>  #define INT_BLK_MQ_H
>
> +#include "blk-stat.h"
> +
>  struct blk_mq_tag_set;
>
>  struct blk_mq_ctx {
> @@ -20,6 +22,7 @@ struct blk_mq_ctx {
>
>         /* incremented at completion time */
>         unsigned long           ____cacheline_aligned_in_smp rq_completed[2];
> +       struct blk_rq_stat      stat[2];
>
>         struct request_queue    *queue;
>         struct kobject          kobj;
> diff --git a/block/blk-stat.c b/block/blk-stat.c
> new file mode 100644
> index 000000000000..b38776a83173
> --- /dev/null
> +++ b/block/blk-stat.c
> @@ -0,0 +1,184 @@
> +/*
> + * Block stat tracking code
> + *
> + * Copyright (C) 2016 Jens Axboe
> + */
> +#include <linux/kernel.h>
> +#include <linux/blk-mq.h>
> +
> +#include "blk-stat.h"
> +#include "blk-mq.h"
> +
> +void blk_stat_sum(struct blk_rq_stat *dst, struct blk_rq_stat *src)
> +{
> +       if (!src->nr_samples)
> +               return;
> +
> +       dst->min = min(dst->min, src->min);
> +       dst->max = max(dst->max, src->max);
> +
> +       if (!dst->nr_samples)
> +               dst->mean = src->mean;
> +       else {
> +               dst->mean = div64_s64((src->mean * src->nr_samples) +
> +                                       (dst->mean * dst->nr_samples),
> +                                       dst->nr_samples + src->nr_samples);
> +       }
> +       dst->nr_samples += src->nr_samples;
> +}
> +
> +static void blk_mq_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
> +{
> +       struct blk_mq_hw_ctx *hctx;
> +       struct blk_mq_ctx *ctx;
> +       int i, j, nr;
> +
> +       blk_stat_init(&dst[0]);
> +       blk_stat_init(&dst[1]);
> +
> +       nr = 0;
> +       do {
> +               uint64_t newest = 0;
> +
> +               queue_for_each_hw_ctx(q, hctx, i) {
> +                       hctx_for_each_ctx(hctx, ctx, j) {
> +                               if (!ctx->stat[0].nr_samples &&
> +                                   !ctx->stat[1].nr_samples)
> +                                       continue;
> +                               if (ctx->stat[0].time > newest)
> +                                       newest = ctx->stat[0].time;
> +                               if (ctx->stat[1].time > newest)
> +                                       newest = ctx->stat[1].time;
> +                       }
> +               }
> +
> +               /*
> +                * No samples
> +                */
> +               if (!newest)
> +                       break;
> +
> +               queue_for_each_hw_ctx(q, hctx, i) {
> +                       hctx_for_each_ctx(hctx, ctx, j) {
> +                               if (ctx->stat[0].time == newest) {
> +                                       blk_stat_sum(&dst[0], &ctx->stat[0]);
> +                                       nr++;
> +                               }
> +                               if (ctx->stat[1].time == newest) {
> +                                       blk_stat_sum(&dst[1], &ctx->stat[1]);
> +                                       nr++;
> +                               }
> +                       }
> +               }
> +               /*
> +                * If we race on finding an entry, just loop back again.
> +                * Should be very rare.
> +                */
> +       } while (!nr);
> +}
> +
> +void blk_queue_stat_get(struct request_queue *q, struct blk_rq_stat *dst)
> +{
> +       if (q->mq_ops)
> +               blk_mq_stat_get(q, dst);
> +       else {
> +               memcpy(&dst[0], &q->rq_stats[0], sizeof(struct blk_rq_stat));
> +               memcpy(&dst[1], &q->rq_stats[1], sizeof(struct blk_rq_stat));
> +       }
> +}
> +
> +void blk_hctx_stat_get(struct blk_mq_hw_ctx *hctx, struct blk_rq_stat *dst)
> +{
> +       struct blk_mq_ctx *ctx;
> +       unsigned int i, nr;
> +
> +       nr = 0;
> +       do {
> +               uint64_t newest = 0;
> +
> +               hctx_for_each_ctx(hctx, ctx, i) {
> +                       if (!ctx->stat[0].nr_samples &&
> +                           !ctx->stat[1].nr_samples)
> +                               continue;
> +
> +                       if (ctx->stat[0].time > newest)
> +                               newest = ctx->stat[0].time;
> +                       if (ctx->stat[1].time > newest)
> +                               newest = ctx->stat[1].time;
> +               }
> +
> +               if (!newest)
> +                       break;
> +
> +               hctx_for_each_ctx(hctx, ctx, i) {
> +                       if (ctx->stat[0].time == newest) {
> +                               blk_stat_sum(&dst[0], &ctx->stat[0]);
> +                               nr++;
> +                       }
> +                       if (ctx->stat[1].time == newest) {
> +                               blk_stat_sum(&dst[1], &ctx->stat[1]);
> +                               nr++;
> +                       }
> +               }
> +               /*
> +                * If we race on finding an entry, just loop back again.
> +                * Should be very rare, as the window is only updated
> +                * occasionally
> +                */
> +       } while (!nr);
> +}
> +
> +static void __blk_stat_init(struct blk_rq_stat *stat, s64 time_now)
> +{
> +       stat->min = -1ULL;
> +       stat->max = stat->nr_samples = stat->mean = 0;
> +       stat->time = time_now & BLK_STAT_MASK;
> +}
> +
> +void blk_stat_init(struct blk_rq_stat *stat)
> +{
> +       __blk_stat_init(stat, ktime_to_ns(ktime_get()));
> +}
> +
> +void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
> +{
> +       s64 delta, now, value;
> +
> +       now = ktime_to_ns(ktime_get());
> +       if (now < rq->issue_time)
> +               return;
> +
> +       if ((now & BLK_STAT_MASK) != (stat->time & BLK_STAT_MASK))
> +               __blk_stat_init(stat, now);
> +
> +       value = now - rq->issue_time;
> +       if (value > stat->max)
> +               stat->max = value;
> +       if (value < stat->min)
> +               stat->min = value;
> +
> +       delta = value - stat->mean;
> +       if (delta)
> +               stat->mean += div64_s64(delta, stat->nr_samples + 1);
> +
> +       stat->nr_samples++;
> +}
> +
> +void blk_stat_clear(struct request_queue *q)
> +{
> +       if (q->mq_ops) {
> +               struct blk_mq_hw_ctx *hctx;
> +               struct blk_mq_ctx *ctx;
> +               int i, j;
> +
> +               queue_for_each_hw_ctx(q, hctx, i) {
> +                       hctx_for_each_ctx(hctx, ctx, j) {
> +                               blk_stat_init(&ctx->stat[0]);
> +                               blk_stat_init(&ctx->stat[1]);
> +                       }
> +               }
> +       } else {
> +               blk_stat_init(&q->rq_stats[0]);
> +               blk_stat_init(&q->rq_stats[1]);
> +       }
> +}
> diff --git a/block/blk-stat.h b/block/blk-stat.h
> new file mode 100644
> index 000000000000..d77548dbf196
> --- /dev/null
> +++ b/block/blk-stat.h
> @@ -0,0 +1,17 @@
> +#ifndef BLK_STAT_H
> +#define BLK_STAT_H
> +
> +/*
> + * ~0.13s window as a power-of-2 (2^27 nsecs)
> + */
> +#define BLK_STAT_NSEC  134217728ULL
> +#define BLK_STAT_MASK  ~(BLK_STAT_NSEC - 1)
> +
> +void blk_stat_add(struct blk_rq_stat *, struct request *);
> +void blk_hctx_stat_get(struct blk_mq_hw_ctx *, struct blk_rq_stat *);
> +void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
> +void blk_stat_clear(struct request_queue *q);
> +void blk_stat_init(struct blk_rq_stat *);
> +void blk_stat_sum(struct blk_rq_stat *, struct blk_rq_stat *);
> +
> +#endif
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 99205965f559..6e516cc0d3d0 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -379,6 +379,26 @@ static ssize_t queue_wc_store(struct request_queue *q, const char *page,
>         return count;
>  }
>
> +static ssize_t print_stat(char *page, struct blk_rq_stat *stat, const char *pre)
> +{
> +       return sprintf(page, "%s samples=%llu, mean=%lld, min=%lld, max=%lld\n",
> +                       pre, (long long) stat->nr_samples,
> +                       (long long) stat->mean, (long long) stat->min,
> +                       (long long) stat->max);
> +}
> +
> +static ssize_t queue_stats_show(struct request_queue *q, char *page)
> +{
> +       struct blk_rq_stat stat[2];
> +       ssize_t ret;
> +
> +       blk_queue_stat_get(q, stat);
> +
> +       ret = print_stat(page, &stat[0], "read :");
> +       ret += print_stat(page + ret, &stat[1], "write:");
> +       return ret;
> +}
> +
>  static struct queue_sysfs_entry queue_requests_entry = {
>         .attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
>         .show = queue_requests_show,
> @@ -516,6 +536,11 @@ static struct queue_sysfs_entry queue_wc_entry = {
>         .store = queue_wc_store,
>  };
>
> +static struct queue_sysfs_entry queue_stats_entry = {
> +       .attr = {.name = "stats", .mode = S_IRUGO },
> +       .show = queue_stats_show,
> +};
> +
>  static struct attribute *default_attrs[] = {
>         &queue_requests_entry.attr,
>         &queue_ra_entry.attr,
> @@ -542,6 +567,7 @@ static struct attribute *default_attrs[] = {
>         &queue_random_entry.attr,
>         &queue_poll_entry.attr,
>         &queue_wc_entry.attr,
> +       &queue_stats_entry.attr,
>         NULL,
>  };
>
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 223012451c7a..2b4414fb4d8e 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -268,4 +268,12 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
>         return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
>  }
>
> +struct blk_rq_stat {
> +       s64 mean;
> +       u64 min;
> +       u64 max;
> +       s64 nr_samples;
> +       s64 time;
> +};
> +
>  #endif /* __LINUX_BLK_TYPES_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index eee94bd6de52..87f6703ced71 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -153,6 +153,7 @@ struct request {
>         struct gendisk *rq_disk;
>         struct hd_struct *part;
>         unsigned long start_time;
> +       s64 issue_time;

io_start_time_ns may be reused for same purpose.

>  #ifdef CONFIG_BLK_CGROUP
>         struct request_list *rl;                /* rl this rq is alloced from */
>         unsigned long long start_time_ns;
> @@ -402,6 +403,9 @@ struct request_queue {
>
>         unsigned int            nr_sorted;
>         unsigned int            in_flight[2];
> +
> +       struct blk_rq_stat      rq_stats[2];
> +
>         /*
>          * Number of active block driver functions for which blk_drain_queue()
>          * must wait. Must be incremented around functions that unlock the
> --
> 2.8.0.rc4.6.g7e4ba36
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-05-03 12:17             ` Jan Kara
  2016-05-03 12:40               ` Chris Mason
@ 2016-05-11 16:36               ` Jan Kara
  2016-05-13 18:29                 ` Jens Axboe
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kara @ 2016-05-11 16:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek

On Tue 03-05-16 14:17:19, Jan Kara wrote:
> The question remains how common a pattern where throttling of background
> writeback delays also something else is. I'll schedule a couple of
> benchmarks to measure impact of your patches for a wider range of workloads
> (but sadly pretty limited set of hw). If ext3 is the only one seeing
> issues, I would be willing to accept that ext3 takes the hit since it is
> doing something rather stupid (but inherent in its journal design) and we
> have a way to deal with this either by enabling delayed allocation or by
> turning off the writeback throttling...

So I've run some benchmarks on a machine with 6 GB of RAM and SSD with
queue depth 32. The filesystem on the disk was XFS this time. I've found
couple of regressions. A clear one is with dbench (version 4). The average
throughput numbers look like:

			Baseline		WBT
Hmean    mb/sec-1         30.26 (  0.00%)       18.67 (-38.28%)
Hmean    mb/sec-2         40.71 (  0.00%)       31.25 (-23.23%)
Hmean    mb/sec-4         52.67 (  0.00%)       46.83 (-11.09%)
Hmean    mb/sec-8         69.51 (  0.00%)       64.35 ( -7.42%)
Hmean    mb/sec-16        91.07 (  0.00%)       86.46 ( -5.07%)
Hmean    mb/sec-32       115.10 (  0.00%)      110.29 ( -4.18%)
Hmean    mb/sec-64       145.14 (  0.00%)      134.97 ( -7.00%)
Hmean    mb/sec-512       93.99 (  0.00%)      133.85 ( 42.41%)

There were also some losses in a filebench webproxy workload (I can give
you exact details of the settings if you want to reproduce it).

Also, and this really puzzles me, I've seen higher read latencies in some
cases (I've verified they are not just noise by rerunning the test for
kernel with writeback throttling patches). For example with the following
fio job file:

[global]
direct=0
ioengine=sync
runtime=300
time_based
invalidate=1
blocksize=4096
size=10g        # Just random value, we are running time based workload
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=1g
fdatasync=256
readwrite=randwrite
numjobs=4

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

I get the following results:

Throughput			Baseline		WBT
Hmean    kb/sec-writer-write      591.60 (  0.00%)      507.00 (-14.30%)
Hmean    kb/sec-reader-read       211.81 (  0.00%)      137.53 (-35.07%)

So both read and write throughput have suffered. And latencies don't offset
for the loss either:

FIO read latency
Min         latency-read     1383.00 (  0.00%)     1519.00 ( -9.83%)
1st-qrtle   latency-read     3485.00 (  0.00%)     5235.00 (-50.22%)
2nd-qrtle   latency-read     4708.00 (  0.00%)    15028.00 (-219.20%)
3rd-qrtle   latency-read    10286.00 (  0.00%)    57622.00 (-460.20%)
Max-90%     latency-read   195834.00 (  0.00%)   167149.00 ( 14.65%)
Max-93%     latency-read   273145.00 (  0.00%)   200319.00 ( 26.66%)
Max-95%     latency-read   335434.00 (  0.00%)   220695.00 ( 34.21%)
Max-99%     latency-read   537017.00 (  0.00%)   347174.00 ( 35.35%)
Max         latency-read   991101.00 (  0.00%)   485835.00 ( 50.98%)
Mean        latency-read    51282.79 (  0.00%)    49953.95 (  2.59%)

So we have reduced the extra high read latencies which is nice but on
average there is no change.

And another fio jobfile which doesn't look great:

[global]
direct=0
ioengine=sync
runtime=300
blocksize=4096
invalidate=1
time_based
ramp_time=5     # Let the flusher thread start before taking measurements
log_avg_msec=10
group_reporting=1

[writer]
nrfiles=1
filesize=$((MEMTOTAL_BYTES*2))
readwrite=randwrite

[reader]
# Simulate random reading from different files, switching to different file
# after 16 ios. This somewhat simulates application startup.
new_group
filesize=100m
nrfiles=20
file_service_type=random:16
readwrite=randread

The throughput numbers look like:
Hmean    kb/sec-writer-write    24707.22 (  0.00%)    19912.23 (-19.41%)
Hmean    kb/sec-reader-read       886.65 (  0.00%)      905.71 (  2.15%)

So we've got significant hit in writes not really offset by a big increase
in reads. Read latency numbers look like (I show the WBT numbers for two runs
just so that one can see how variable the latency numbers are because I was
puzzled by very high max latency for WBT kernels - quartiles seem rather
stable higher percentiles and min/max are rather variable):

			   Baseline		WBT			WBT
Min         latency-read     1230.00 (  0.00%)     1560.00 (-26.83%)	1100.00 ( 10.57%)
1st-qrtle   latency-read     3357.00 (  0.00%)     3351.00 (  0.18%)	3351.00 (  0.18%)
2nd-qrtle   latency-read     4074.00 (  0.00%)     4056.00 (  0.44%)	4022.00 (  1.28%)
3rd-qrtle   latency-read     5198.00 (  0.00%)     5145.00 (  1.02%)	5095.00 (  1.98%)
Max-90%     latency-read     6594.00 (  0.00%)     6370.00 (  3.40%)	6130.00 (  7.04%)
Max-93%     latency-read    11251.00 (  0.00%)     9410.00 ( 16.36%)	6654.00 ( 40.86%)
Max-95%     latency-read    14769.00 (  0.00%)    13231.00 ( 10.41%)	10306.00 ( 30.22%)
Max-99%     latency-read    27826.00 (  0.00%)    28728.00 ( -3.24%)	25077.00 (  9.88%)
Max         latency-read    80202.00 (  0.00%)   186491.00 (-132.53%)	141346.00 (-76.24%)
Mean        latency-read     5356.12 (  0.00%)     5229.00 (  2.37%)	4927.23 (  8.01%)

I have run also other tests but they have mostly shown no significant
difference.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-05-11 16:36               ` Jan Kara
@ 2016-05-13 18:29                 ` Jens Axboe
  2016-05-16  7:47                   ` Jan Kara
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2016-05-13 18:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block, dchinner, sedat.dilek

On 05/11/2016 10:36 AM, Jan Kara wrote:
> On Tue 03-05-16 14:17:19, Jan Kara wrote:
>> The question remains how common a pattern where throttling of background
>> writeback delays also something else is. I'll schedule a couple of
>> benchmarks to measure impact of your patches for a wider range of workloads
>> (but sadly pretty limited set of hw). If ext3 is the only one seeing
>> issues, I would be willing to accept that ext3 takes the hit since it is
>> doing something rather stupid (but inherent in its journal design) and we
>> have a way to deal with this either by enabling delayed allocation or by
>> turning off the writeback throttling...
>
> So I've run some benchmarks on a machine with 6 GB of RAM and SSD with
> queue depth 32. The filesystem on the disk was XFS this time. I've found
> couple of regressions. A clear one is with dbench (version 4). The average
> throughput numbers look like:
>
> 			Baseline		WBT
> Hmean    mb/sec-1         30.26 (  0.00%)       18.67 (-38.28%)
> Hmean    mb/sec-2         40.71 (  0.00%)       31.25 (-23.23%)
> Hmean    mb/sec-4         52.67 (  0.00%)       46.83 (-11.09%)
> Hmean    mb/sec-8         69.51 (  0.00%)       64.35 ( -7.42%)
> Hmean    mb/sec-16        91.07 (  0.00%)       86.46 ( -5.07%)
> Hmean    mb/sec-32       115.10 (  0.00%)      110.29 ( -4.18%)
> Hmean    mb/sec-64       145.14 (  0.00%)      134.97 ( -7.00%)
> Hmean    mb/sec-512       93.99 (  0.00%)      133.85 ( 42.41%)
>
> There were also some losses in a filebench webproxy workload (I can give
> you exact details of the settings if you want to reproduce it).
>
> Also, and this really puzzles me, I've seen higher read latencies in some
> cases (I've verified they are not just noise by rerunning the test for
> kernel with writeback throttling patches). For example with the following
> fio job file:
>
> [global]
> direct=0
> ioengine=sync
> runtime=300
> time_based
> invalidate=1
> blocksize=4096
> size=10g        # Just random value, we are running time based workload
> log_avg_msec=10
> group_reporting=1
>
> [writer]
> nrfiles=1
> filesize=1g
> fdatasync=256
> readwrite=randwrite
> numjobs=4
>
> [reader]
> # Simulate random reading from different files, switching to different file
> # after 16 ios. This somewhat simulates application startup.
> new_group
> filesize=100m
> nrfiles=20
> file_service_type=random:16
> readwrite=randread
>
> I get the following results:
>
> Throughput			Baseline		WBT
> Hmean    kb/sec-writer-write      591.60 (  0.00%)      507.00 (-14.30%)
> Hmean    kb/sec-reader-read       211.81 (  0.00%)      137.53 (-35.07%)
>
> So both read and write throughput have suffered. And latencies don't offset
> for the loss either:
>
> FIO read latency
> Min         latency-read     1383.00 (  0.00%)     1519.00 ( -9.83%)
> 1st-qrtle   latency-read     3485.00 (  0.00%)     5235.00 (-50.22%)
> 2nd-qrtle   latency-read     4708.00 (  0.00%)    15028.00 (-219.20%)
> 3rd-qrtle   latency-read    10286.00 (  0.00%)    57622.00 (-460.20%)
> Max-90%     latency-read   195834.00 (  0.00%)   167149.00 ( 14.65%)
> Max-93%     latency-read   273145.00 (  0.00%)   200319.00 ( 26.66%)
> Max-95%     latency-read   335434.00 (  0.00%)   220695.00 ( 34.21%)
> Max-99%     latency-read   537017.00 (  0.00%)   347174.00 ( 35.35%)
> Max         latency-read   991101.00 (  0.00%)   485835.00 ( 50.98%)
> Mean        latency-read    51282.79 (  0.00%)    49953.95 (  2.59%)
>
> So we have reduced the extra high read latencies which is nice but on
> average there is no change.
>
> And another fio jobfile which doesn't look great:
>
> [global]
> direct=0
> ioengine=sync
> runtime=300
> blocksize=4096
> invalidate=1
> time_based
> ramp_time=5     # Let the flusher thread start before taking measurements
> log_avg_msec=10
> group_reporting=1
>
> [writer]
> nrfiles=1
> filesize=$((MEMTOTAL_BYTES*2))
> readwrite=randwrite
>
> [reader]
> # Simulate random reading from different files, switching to different file
> # after 16 ios. This somewhat simulates application startup.
> new_group
> filesize=100m
> nrfiles=20
> file_service_type=random:16
> readwrite=randread
>
> The throughput numbers look like:
> Hmean    kb/sec-writer-write    24707.22 (  0.00%)    19912.23 (-19.41%)
> Hmean    kb/sec-reader-read       886.65 (  0.00%)      905.71 (  2.15%)
>
> So we've got significant hit in writes not really offset by a big increase
> in reads. Read latency numbers look like (I show the WBT numbers for two runs
> just so that one can see how variable the latency numbers are because I was
> puzzled by very high max latency for WBT kernels - quartiles seem rather
> stable higher percentiles and min/max are rather variable):
>
> 			   Baseline		WBT			WBT
> Min         latency-read     1230.00 (  0.00%)     1560.00 (-26.83%)	1100.00 ( 10.57%)
> 1st-qrtle   latency-read     3357.00 (  0.00%)     3351.00 (  0.18%)	3351.00 (  0.18%)
> 2nd-qrtle   latency-read     4074.00 (  0.00%)     4056.00 (  0.44%)	4022.00 (  1.28%)
> 3rd-qrtle   latency-read     5198.00 (  0.00%)     5145.00 (  1.02%)	5095.00 (  1.98%)
> Max-90%     latency-read     6594.00 (  0.00%)     6370.00 (  3.40%)	6130.00 (  7.04%)
> Max-93%     latency-read    11251.00 (  0.00%)     9410.00 ( 16.36%)	6654.00 ( 40.86%)
> Max-95%     latency-read    14769.00 (  0.00%)    13231.00 ( 10.41%)	10306.00 ( 30.22%)
> Max-99%     latency-read    27826.00 (  0.00%)    28728.00 ( -3.24%)	25077.00 (  9.88%)
> Max         latency-read    80202.00 (  0.00%)   186491.00 (-132.53%)	141346.00 (-76.24%)
> Mean        latency-read     5356.12 (  0.00%)     5229.00 (  2.37%)	4927.23 (  8.01%)
>
> I have run also other tests but they have mostly shown no significant
> difference.

Thanks Jan, this is great and super useful! I'm revamping certain parts 
of it to deal with write back caching better, and I'll take a look at 
the regressions that you reported.

What kind of SSD is this? I'm assuming it's SATA (QD=32), and then it 
would probably be a safe assumption that it's flagging itself as having 
a volatile write back cache, would that be a correct assumption?

Are you using scsi-mq, or do you have an IO scheduler attached to it?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCHSET v5] Make background writeback great again for the first time
  2016-05-13 18:29                 ` Jens Axboe
@ 2016-05-16  7:47                   ` Jan Kara
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kara @ 2016-05-16  7:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, linux-block, dchinner,
	sedat.dilek

On Fri 13-05-16 12:29:10, Jens Axboe wrote:
> Thanks Jan, this is great and super useful! I'm revamping certain parts of
> it to deal with write back caching better, and I'll take a look at the
> regressions that you reported.
> 
> What kind of SSD is this? I'm assuming it's SATA (QD=32), and then it would
> probably be a safe assumption that it's flagging itself as having a volatile
> write back cache, would that be a correct assumption?

Yes, it is SATA with writeback cache.

> Are you using scsi-mq, or do you have an IO scheduler attached to it?

The disk was using IO scheduler, however at this point I'm not 100% sure
which scheduler (deadline or cfq) was the default one for the distro that
was installed. The machine is currently testing something else so I cannot
reinstall it and check. Maybe I can rerun some tests later in the week when
the machine gets freed with scsi-mq or deadline IO scheduler so that we
have 100% certain config.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 7/8] wbt: add general throttling mechanism
  2016-09-07 14:46 [PATCH 0/8] Throttled background buffered writeback v7 Jens Axboe
@ 2016-09-07 14:46 ` Jens Axboe
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-09-07 14:46 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

We can hook this up to the block layer, to help throttle buffered
writes. Or NFS can tap into it, to accomplish the same.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
               wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/wbt.h        | 120 ++++++++
 include/trace/events/wbt.h | 153 ++++++++++
 lib/Kconfig                |   3 +
 lib/Makefile               |   1 +
 lib/wbt.c                  | 679 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 956 insertions(+)
 create mode 100644 include/linux/wbt.h
 create mode 100644 include/trace/events/wbt.h
 create mode 100644 lib/wbt.c

diff --git a/include/linux/wbt.h b/include/linux/wbt.h
new file mode 100644
index 000000000000..5ffcd1409c2f
--- /dev/null
+++ b/include/linux/wbt.h
@@ -0,0 +1,120 @@
+#ifndef WB_THROTTLE_H
+#define WB_THROTTLE_H
+
+#include <linux/atomic.h>
+#include <linux/wait.h>
+#include <linux/timer.h>
+#include <linux/ktime.h>
+
+enum {
+	ISSUE_STAT_TRACKED	= 1ULL << 63,
+	ISSUE_STAT_READ		= 1ULL << 62,
+	ISSUE_STAT_MASK 	= ISSUE_STAT_TRACKED | ISSUE_STAT_READ,
+	ISSUE_STAT_TIME_MASK	= ~ISSUE_STAT_MASK,
+
+	WBT_TRACKED		= 1,
+	WBT_READ		= 2,
+};
+
+struct wb_issue_stat {
+	u64 time;
+};
+
+static inline void wbt_issue_stat_set_time(struct wb_issue_stat *stat)
+{
+	stat->time = (stat->time & ISSUE_STAT_MASK) |
+			(ktime_to_ns(ktime_get()) & ISSUE_STAT_TIME_MASK);
+}
+
+static inline u64 wbt_issue_stat_get_time(struct wb_issue_stat *stat)
+{
+	return stat->time & ISSUE_STAT_TIME_MASK;
+}
+
+static inline void wbt_mark_tracked(struct wb_issue_stat *stat)
+{
+	stat->time |= ISSUE_STAT_TRACKED;
+}
+
+static inline void wbt_clear_state(struct wb_issue_stat *stat)
+{
+	stat->time &= ~(ISSUE_STAT_TRACKED | ISSUE_STAT_READ);
+}
+
+static inline bool wbt_tracked(struct wb_issue_stat *stat)
+{
+	return (stat->time & ISSUE_STAT_TRACKED) != 0;
+}
+
+static inline void wbt_mark_read(struct wb_issue_stat *stat)
+{
+	stat->time |= ISSUE_STAT_READ;
+}
+
+static inline bool wbt_is_read(struct wb_issue_stat *stat)
+{
+	return (stat->time & ISSUE_STAT_READ) != 0;
+}
+
+struct wb_stat_ops {
+	void (*get)(void *, struct blk_rq_stat *);
+	bool (*is_current)(struct blk_rq_stat *);
+	void (*clear)(void *);
+};
+
+struct rq_wb {
+	/*
+	 * Settings that govern how we throttle
+	 */
+	unsigned int wb_background;		/* background writeback */
+	unsigned int wb_normal;			/* normal writeback */
+	unsigned int wb_max;			/* max throughput writeback */
+	int scale_step;
+	bool scaled_max;
+
+	u64 win_nsec;				/* default window size */
+	u64 cur_win_nsec;			/* current window size */
+
+	/*
+	 * Number of consecutive periods where we don't have enough
+	 * information to make a firm scale up/down decision.
+	 */
+	unsigned int unknown_cnt;
+
+	struct timer_list window_timer;
+
+	s64 sync_issue;
+	void *sync_cookie;
+
+	unsigned int wc;
+	unsigned int queue_depth;
+
+	unsigned long last_issue;		/* last non-throttled issue */
+	unsigned long last_comp;		/* last non-throttled comp */
+	unsigned long min_lat_nsec;
+	struct backing_dev_info *bdi;
+	struct request_queue *q;
+	wait_queue_head_t wait;
+	atomic_t inflight;
+
+	struct wb_stat_ops *stat_ops;
+	void *ops_data;
+};
+
+struct backing_dev_info;
+
+void __wbt_done(struct rq_wb *);
+void wbt_done(struct rq_wb *, struct wb_issue_stat *);
+unsigned int wbt_wait(struct rq_wb *, unsigned int, spinlock_t *);
+struct rq_wb *wbt_init(struct backing_dev_info *, struct wb_stat_ops *, void *);
+void wbt_exit(struct rq_wb *);
+void wbt_update_limits(struct rq_wb *);
+void wbt_requeue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_issue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_disable(struct rq_wb *);
+void wbt_track(struct wb_issue_stat *, unsigned int);
+
+void wbt_set_queue_depth(struct rq_wb *, unsigned int);
+void wbt_set_write_cache(struct rq_wb *, bool);
+
+#endif
diff --git a/include/trace/events/wbt.h b/include/trace/events/wbt.h
new file mode 100644
index 000000000000..926c7ee0ef4e
--- /dev/null
+++ b/include/trace/events/wbt.h
@@ -0,0 +1,153 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM wbt
+
+#if !defined(_TRACE_WBT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_WBT_H
+
+#include <linux/tracepoint.h>
+#include <linux/wbt.h>
+
+/**
+ * wbt_stat - trace stats for blk_wb
+ * @stat: array of read/write stats
+ */
+TRACE_EVENT(wbt_stat,
+
+	TP_PROTO(struct backing_dev_info *bdi, struct blk_rq_stat *stat),
+
+	TP_ARGS(bdi, stat),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(s64, rmean)
+		__field(u64, rmin)
+		__field(u64, rmax)
+		__field(s64, rnr_samples)
+		__field(s64, rtime)
+		__field(s64, wmean)
+		__field(u64, wmin)
+		__field(u64, wmax)
+		__field(s64, wnr_samples)
+		__field(s64, wtime)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->rmean		= stat[0].mean;
+		__entry->rmin		= stat[0].min;
+		__entry->rmax		= stat[0].max;
+		__entry->rnr_samples	= stat[0].nr_samples;
+		__entry->wmean		= stat[1].mean;
+		__entry->wmin		= stat[1].min;
+		__entry->wmax		= stat[1].max;
+		__entry->wnr_samples	= stat[1].nr_samples;
+	),
+
+	TP_printk("%s: rmean=%llu, rmin=%llu, rmax=%llu, rsamples=%llu, "
+		  "wmean=%llu, wmin=%llu, wmax=%llu, wsamples=%llu\n",
+		  __entry->name, __entry->rmean, __entry->rmin, __entry->rmax,
+		  __entry->rnr_samples, __entry->wmean, __entry->wmin,
+		  __entry->wmax, __entry->wnr_samples)
+);
+
+/**
+ * wbt_lat - trace latency event
+ * @lat: latency trigger
+ */
+TRACE_EVENT(wbt_lat,
+
+	TP_PROTO(struct backing_dev_info *bdi, unsigned long lat),
+
+	TP_ARGS(bdi, lat),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, lat)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->lat = div_u64(lat, 1000);
+	),
+
+	TP_printk("%s: latency %lluus\n", __entry->name,
+			(unsigned long long) __entry->lat)
+);
+
+/**
+ * wbt_step - trace wb event step
+ * @msg: context message
+ * @step: the current scale step count
+ * @window: the current monitoring window
+ * @bg: the current background queue limit
+ * @normal: the current normal writeback limit
+ * @max: the current max throughput writeback limit
+ */
+TRACE_EVENT(wbt_step,
+
+	TP_PROTO(struct backing_dev_info *bdi, const char *msg,
+		 int step, unsigned long window, unsigned int bg,
+		 unsigned int normal, unsigned int max),
+
+	TP_ARGS(bdi, msg, step, window, bg, normal, max),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(const char *, msg)
+		__field(int, step)
+		__field(unsigned long, window)
+		__field(unsigned int, bg)
+		__field(unsigned int, normal)
+		__field(unsigned int, max)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->msg	= msg;
+		__entry->step	= step;
+		__entry->window	= div_u64(window, 1000);
+		__entry->bg	= bg;
+		__entry->normal	= normal;
+		__entry->max	= max;
+	),
+
+	TP_printk("%s: %s: step=%d, window=%luus, background=%u, normal=%u, max=%u\n",
+		  __entry->name, __entry->msg, __entry->step, __entry->window,
+		  __entry->bg, __entry->normal, __entry->max)
+);
+
+/**
+ * wbt_timer - trace wb timer event
+ * @status: timer state status
+ * @step: the current scale step count
+ * @inflight: tracked writes inflight
+ */
+TRACE_EVENT(wbt_timer,
+
+	TP_PROTO(struct backing_dev_info *bdi, unsigned int status,
+		 int step, unsigned int inflight),
+
+	TP_ARGS(bdi, status, step, inflight),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned int, status)
+		__field(int, step)
+		__field(unsigned int, inflight)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->status		= status;
+		__entry->step		= step;
+		__entry->inflight	= inflight;
+	),
+
+	TP_printk("%s: status=%u, step=%d, inflight=%u\n", __entry->name,
+		  __entry->status, __entry->step, __entry->inflight)
+);
+
+#endif /* _TRACE_WBT_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/lib/Kconfig b/lib/Kconfig
index d79909dc01ec..c585e4c40143 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -550,4 +550,7 @@ config STACKDEPOT
 	bool
 	select STACKTRACE
 
+config WBT
+	bool
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index 5dc77a8ec297..23afd6329c33 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -177,6 +177,7 @@ obj-$(CONFIG_SG_SPLIT) += sg_split.o
 obj-$(CONFIG_SG_POOL) += sg_pool.o
 obj-$(CONFIG_STMP_DEVICE) += stmp_device.o
 obj-$(CONFIG_IRQ_POLL) += irq_poll.o
+obj-$(CONFIG_WBT) += wbt.o
 
 obj-$(CONFIG_STACKDEPOT) += stackdepot.o
 KASAN_SANITIZE_stackdepot.o := n
diff --git a/lib/wbt.c b/lib/wbt.c
new file mode 100644
index 000000000000..88b1f884f0f0
--- /dev/null
+++ b/lib/wbt.c
@@ -0,0 +1,679 @@
+/*
+ * buffered writeback throttling. losely based on CoDel. We can't drop
+ * packets for IO scheduling, so the logic is something like this:
+ *
+ * - Monitor latencies in a defined window of time.
+ * - If the minimum latency in the above window exceeds some target, increment
+ *   scaling step and scale down queue depth by a factor of 2x. The monitoring
+ *   window is then shrunk to 100 / sqrt(scaling step + 1).
+ * - For any window where we don't have solid data on what the latencies
+ *   look like, retain status quo.
+ * - If latencies look good, decrement scaling step.
+ * - If we're only doing writes, allow the scaling step to go negative. This
+ *   will temporarily boost write performance, snapping back to a stable
+ *   scaling step of 0 if reads show up or the heavy writers finish. Unlike
+ *   positive scaling steps where we shrink the monitoring window, a negative
+ *   scaling step retains the default step==0 window size.
+ *
+ * Copyright (C) 2016 Jens Axboe
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/blk_types.h>
+#include <linux/slab.h>
+#include <linux/backing-dev.h>
+#include <linux/wbt.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/wbt.h>
+
+enum {
+	/*
+	 * Default setting, we'll scale up (to 75% of QD max) or down (min 1)
+	 * from here depending on device stats
+	 */
+	RWB_DEF_DEPTH	= 16,
+
+	/*
+	 * 100msec window
+	 */
+	RWB_WINDOW_NSEC		= 100 * 1000 * 1000ULL,
+
+	/*
+	 * Disregard stats, if we don't meet this minimum
+	 */
+	RWB_MIN_WRITE_SAMPLES	= 3,
+
+	/*
+	 * If we have this number of consecutive windows with not enough
+	 * information to scale up or down, scale up.
+	 */
+	RWB_UNKNOWN_BUMP	= 5,
+};
+
+static inline bool rwb_enabled(struct rq_wb *rwb)
+{
+	return rwb && rwb->wb_normal != 0;
+}
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+	int cur = atomic_read(v);
+
+	for (;;) {
+		int old;
+
+		if (cur >= below)
+			return false;
+		old = atomic_cmpxchg(v, cur, cur + 1);
+		if (old == cur)
+			break;
+		cur = old;
+	}
+
+	return true;
+}
+
+static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
+{
+	if (rwb_enabled(rwb)) {
+		const unsigned long cur = jiffies;
+
+		if (cur != *var)
+			*var = cur;
+	}
+}
+
+/*
+ * If a task was rate throttled in balance_dirty_pages() within the last
+ * second or so, use that to indicate a higher cleaning rate.
+ */
+static bool wb_recent_wait(struct rq_wb *rwb)
+{
+	struct bdi_writeback *wb = &rwb->bdi->wb;
+
+	return time_before(jiffies, wb->dirty_sleep + HZ);
+}
+
+void __wbt_done(struct rq_wb *rwb)
+{
+	int inflight, limit;
+
+	inflight = atomic_dec_return(&rwb->inflight);
+
+	/*
+	 * wbt got disabled with IO in flight. Wake up any potential
+	 * waiters, we don't have to do more than that.
+	 */
+	if (unlikely(!rwb_enabled(rwb))) {
+		wake_up_all(&rwb->wait);
+		return;
+	}
+
+	/*
+	 * If the device does write back caching, drop further down
+	 * before we wake people up.
+	 */
+	if (rwb->wc && !wb_recent_wait(rwb))
+		limit = 0;
+	else
+		limit = rwb->wb_normal;
+
+	/*
+	 * Don't wake anyone up if we are above the normal limit.
+	 */
+	if (inflight && inflight >= limit)
+		return;
+
+	if (waitqueue_active(&rwb->wait)) {
+		int diff = limit - inflight;
+
+		if (!inflight || diff >= rwb->wb_background / 2)
+			wake_up(&rwb->wait);
+	}
+}
+
+/*
+ * Called on completion of a request. Note that it's also called when
+ * a request is merged, when the request gets freed.
+ */
+void wbt_done(struct rq_wb *rwb, struct wb_issue_stat *stat)
+{
+	if (!rwb)
+		return;
+
+	if (!wbt_tracked(stat)) {
+		if (rwb->sync_cookie == stat) {
+			rwb->sync_issue = 0;
+			rwb->sync_cookie = NULL;
+		}
+
+		if (wbt_is_read(stat))
+			wb_timestamp(rwb, &rwb->last_comp);
+		wbt_clear_state(stat);
+	} else {
+		WARN_ON_ONCE(stat == rwb->sync_cookie);
+		__wbt_done(rwb);
+		wbt_clear_state(stat);
+	}
+}
+
+/*
+ * Return true, if we can't increase the depth further by scaling
+ */
+static bool calc_wb_limits(struct rq_wb *rwb)
+{
+	unsigned int depth;
+	bool ret = false;
+
+	if (!rwb->min_lat_nsec) {
+		rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
+		return false;
+	}
+
+	/*
+	 * For QD=1 devices, this is a special case. It's important for those
+	 * to have one request ready when one completes, so force a depth of
+	 * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
+	 * since the device can't have more than that in flight. If we're
+	 * scaling down, then keep a setting of 1/1/1.
+	 */
+	if (rwb->queue_depth == 1) {
+		if (rwb->scale_step > 0)
+			rwb->wb_max = rwb->wb_normal = 1;
+		else {
+			rwb->wb_max = rwb->wb_normal = 2;
+			ret = true;
+		}
+		rwb->wb_background = 1;
+	} else {
+		/*
+		 * scale_step == 0 is our default state. If we have suffered
+		 * latency spikes, step will be > 0, and we shrink the
+		 * allowed write depths. If step is < 0, we're only doing
+		 * writes, and we allow a temporarily higher depth to
+		 * increase performance.
+		 */
+		depth = min_t(unsigned int, RWB_DEF_DEPTH, rwb->queue_depth);
+		if (rwb->scale_step > 0)
+			depth = 1 + ((depth - 1) >> min(31, rwb->scale_step));
+		else if (rwb->scale_step < 0) {
+			unsigned int maxd = 3 * rwb->queue_depth / 4;
+
+			depth = 1 + ((depth - 1) << -rwb->scale_step);
+			if (depth > maxd) {
+				depth = maxd;
+				ret = true;
+			}
+		}
+
+		/*
+		 * Set our max/normal/bg queue depths based on how far
+		 * we have scaled down (->scale_step).
+		 */
+		rwb->wb_max = depth;
+		rwb->wb_normal = (rwb->wb_max + 1) / 2;
+		rwb->wb_background = (rwb->wb_max + 3) / 4;
+	}
+
+	return ret;
+}
+
+static bool inline stat_sample_valid(struct blk_rq_stat *stat)
+{
+	/*
+	 * We need at least one read sample, and a minimum of
+	 * RWB_MIN_WRITE_SAMPLES. We require some write samples to know
+	 * that it's writes impacting us, and not just some sole read on
+	 * a device that is in a lower power state.
+	 */
+	return stat[0].nr_samples >= 1 &&
+		stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES;
+}
+
+static u64 rwb_sync_issue_lat(struct rq_wb *rwb)
+{
+	u64 now, issue = ACCESS_ONCE(rwb->sync_issue);
+
+	if (!issue || !rwb->sync_cookie)
+		return 0;
+
+	now = ktime_to_ns(ktime_get());
+	return now - issue;
+}
+
+enum {
+	LAT_OK = 1,
+	LAT_UNKNOWN,
+	LAT_UNKNOWN_WRITES,
+	LAT_EXCEEDED,
+};
+
+static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
+{
+	u64 thislat;
+
+	/*
+	 * If our stored sync issue exceeds the window size, or it
+	 * exceeds our min target AND we haven't logged any entries,
+	 * flag the latency as exceeded. wbt works off completion latencies,
+	 * but for a flooded device, a single sync IO can take a long time
+	 * to complete after being issued. If this time exceeds our
+	 * monitoring window AND we didn't see any other completions in that
+	 * window, then count that sync IO as a violation of the latency.
+	 */
+	thislat = rwb_sync_issue_lat(rwb);
+	if (thislat > rwb->cur_win_nsec ||
+	    (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
+		trace_wbt_lat(rwb->bdi, thislat);
+		return LAT_EXCEEDED;
+	}
+
+	/*
+	 * No read/write mix, if stat isn't valid
+	 */
+	if (!stat_sample_valid(stat)) {
+		/*
+		 * If we had writes in this stat window and the window is
+		 * current, we're only doing writes. If a task recently
+		 * waited or still has writes in flights, consider us doing
+		 * just writes as well.
+		 */
+		if ((stat[1].nr_samples && rwb->stat_ops->is_current(stat)) ||
+		    wb_recent_wait(rwb) || atomic_read(&rwb->inflight))
+			return LAT_UNKNOWN_WRITES;
+		return LAT_UNKNOWN;
+	}
+
+	/*
+	 * If the 'min' latency exceeds our target, step down.
+	 */
+	if (stat[0].min > rwb->min_lat_nsec) {
+		trace_wbt_lat(rwb->bdi, stat[0].min);
+		trace_wbt_stat(rwb->bdi, stat);
+		return LAT_EXCEEDED;
+	}
+
+	if (rwb->scale_step)
+		trace_wbt_stat(rwb->bdi, stat);
+
+	return LAT_OK;
+}
+
+static int latency_exceeded(struct rq_wb *rwb)
+{
+	struct blk_rq_stat stat[2];
+
+	rwb->stat_ops->get(rwb->ops_data, stat);
+	return __latency_exceeded(rwb, stat);
+}
+
+static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
+{
+	trace_wbt_step(rwb->bdi, msg, rwb->scale_step, rwb->cur_win_nsec,
+			rwb->wb_background, rwb->wb_normal, rwb->wb_max);
+}
+
+static void scale_up(struct rq_wb *rwb)
+{
+	/*
+	 * Hit max in previous round, stop here
+	 */
+	if (rwb->scaled_max)
+		return;
+
+	rwb->scale_step--;
+	rwb->unknown_cnt = 0;
+	rwb->stat_ops->clear(rwb->ops_data);
+
+	rwb->scaled_max = calc_wb_limits(rwb);
+
+	if (waitqueue_active(&rwb->wait))
+		wake_up_all(&rwb->wait);
+
+	rwb_trace_step(rwb, "step up");
+}
+
+/*
+ * Scale rwb down. If 'hard_throttle' is set, do it quicker, since we
+ * had a latency violation.
+ */
+static void scale_down(struct rq_wb *rwb, bool hard_throttle)
+{
+	/*
+	 * Stop scaling down when we've hit the limit. This also prevents
+	 * ->scale_step from going to crazy values, if the device can't
+	 * keep up.
+	 */
+	if (rwb->wb_max == 1)
+		return;
+
+	if (rwb->scale_step < 0 && hard_throttle)
+		rwb->scale_step = 0;
+	else
+		rwb->scale_step++;
+
+	rwb->scaled_max = false;
+	rwb->unknown_cnt = 0;
+	rwb->stat_ops->clear(rwb->ops_data);
+	calc_wb_limits(rwb);
+	rwb_trace_step(rwb, "step down");
+}
+
+static void rwb_arm_timer(struct rq_wb *rwb)
+{
+	unsigned long expires;
+
+	if (rwb->scale_step > 0) {
+		/*
+		 * We should speed this up, using some variant of a fast
+		 * integer inverse square root calculation. Since we only do
+		 * this for every window expiration, it's not a huge deal,
+		 * though.
+		 */
+		rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
+					int_sqrt((rwb->scale_step + 1) << 8));
+	} else {
+		/*
+		 * For step < 0, we don't want to increase/decrease the
+		 * window size.
+		 */
+		rwb->cur_win_nsec = rwb->win_nsec;
+	}
+
+	expires = jiffies + nsecs_to_jiffies(rwb->cur_win_nsec);
+	mod_timer(&rwb->window_timer, expires);
+}
+
+static void wb_timer_fn(unsigned long data)
+{
+	struct rq_wb *rwb = (struct rq_wb *) data;
+	int status, inflight;
+
+	inflight = atomic_read(&rwb->inflight);
+
+	status = latency_exceeded(rwb);
+
+	trace_wbt_timer(rwb->bdi, status, rwb->scale_step, inflight);
+
+	/*
+	 * If we exceeded the latency target, step down. If we did not,
+	 * step one level up. If we don't know enough to say either exceeded
+	 * or ok, then don't do anything.
+	 */
+	switch (status) {
+	case LAT_EXCEEDED:
+		scale_down(rwb, true);
+		break;
+	case LAT_OK:
+		scale_up(rwb);
+		break;
+	case LAT_UNKNOWN_WRITES:
+		scale_up(rwb);
+		break;
+	case LAT_UNKNOWN:
+		if (++rwb->unknown_cnt < RWB_UNKNOWN_BUMP)
+			break;
+		/*
+		 * We get here for two reasons:
+		 *
+		 * 1) We previously scaled reduced depth, and we currently
+		 *    don't have a valid read/write sample. For that case,
+		 *    slowly return to center state (step == 0).
+		 * 2) We started a the center step, but don't have a valid
+		 *    read/write sample, but we do have writes going on.
+		 *    Allow step to go negative, to increase write perf.
+		 */
+		if (rwb->scale_step > 0)
+			scale_up(rwb);
+		else if (rwb->scale_step < 0)
+			scale_down(rwb, false);
+		break;
+	default:
+		break;
+	}
+
+	/*
+	 * Re-arm timer, if we have IO in flight
+	 */
+	if (rwb->scale_step || inflight)
+		rwb_arm_timer(rwb);
+}
+
+void wbt_update_limits(struct rq_wb *rwb)
+{
+	rwb->scale_step = 0;
+	rwb->scaled_max = false;
+	calc_wb_limits(rwb);
+
+	if (waitqueue_active(&rwb->wait))
+		wake_up_all(&rwb->wait);
+}
+
+static bool close_io(struct rq_wb *rwb)
+{
+	const unsigned long now = jiffies;
+
+	return time_before(now, rwb->last_issue + HZ / 10) ||
+		time_before(now, rwb->last_comp + HZ / 10);
+}
+
+#define REQ_HIPRIO	(REQ_SYNC | REQ_META | REQ_PRIO)
+
+static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
+{
+	unsigned int limit;
+
+	/*
+	 * At this point we know it's a buffered write. If REQ_SYNC is
+	 * set, then it's WB_SYNC_ALL writeback, and we'll use the max
+	 * limit for that. If the write is marked as a background write,
+	 * then use the idle limit, or go to normal if we haven't had
+	 * competing IO for a bit.
+	 */
+	if ((rw & REQ_HIPRIO) || wb_recent_wait(rwb))
+		limit = rwb->wb_max;
+	else if ((rw & REQ_BG) || close_io(rwb)) {
+		/*
+		 * If less than 100ms since we completed unrelated IO,
+		 * limit us to half the depth for background writeback.
+		 */
+		limit = rwb->wb_background;
+	} else
+		limit = rwb->wb_normal;
+
+	return limit;
+}
+
+static inline bool may_queue(struct rq_wb *rwb, unsigned long rw)
+{
+	/*
+	 * inc it here even if disabled, since we'll dec it at completion.
+	 * this only happens if the task was sleeping in __wbt_wait(),
+	 * and someone turned it off at the same time.
+	 */
+	if (!rwb_enabled(rwb)) {
+		atomic_inc(&rwb->inflight);
+		return true;
+	}
+
+	return atomic_inc_below(&rwb->inflight, get_limit(rwb, rw));
+}
+
+/*
+ * Block if we will exceed our limit, or if we are currently waiting for
+ * the timer to kick off queuing again.
+ */
+static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
+{
+	DEFINE_WAIT(wait);
+
+	if (may_queue(rwb, rw))
+		return;
+
+	do {
+		prepare_to_wait_exclusive(&rwb->wait, &wait,
+						TASK_UNINTERRUPTIBLE);
+
+		if (may_queue(rwb, rw))
+			break;
+
+		if (lock)
+			spin_unlock_irq(lock);
+
+		io_schedule();
+
+		if (lock)
+			spin_lock_irq(lock);
+	} while (1);
+
+	finish_wait(&rwb->wait, &wait);
+}
+
+static inline bool wbt_should_throttle(struct rq_wb *rwb, unsigned int rw)
+{
+	const int op = rw >> BIO_OP_SHIFT;
+
+	/*
+	 * If not a WRITE (or a discard), do nothing
+	 */
+	if (!(op == REQ_OP_WRITE || op == REQ_OP_DISCARD))
+		return false;
+
+	/*
+	 * Don't throttle WRITE_ODIRECT
+	 */
+	if ((rw & (REQ_SYNC | REQ_NOIDLE)) == REQ_SYNC)
+		return false;
+
+	return true;
+}
+
+/*
+ * Returns true if the IO request should be accounted, false if not.
+ * May sleep, if we have exceeded the writeback limits. Caller can pass
+ * in an irq held spinlock, if it holds one when calling this function.
+ * If we do sleep, we'll release and re-grab it.
+ */
+unsigned int wbt_wait(struct rq_wb *rwb, unsigned int rw, spinlock_t *lock)
+{
+	unsigned int ret;
+
+	if (!rwb_enabled(rwb))
+		return 0;
+
+	if ((rw >> BIO_OP_SHIFT) == REQ_OP_READ)
+		ret = WBT_READ;
+
+	if (!wbt_should_throttle(rwb, rw)) {
+		if (ret & WBT_READ)
+			wb_timestamp(rwb, &rwb->last_issue);
+		return ret;
+	}
+
+	__wbt_wait(rwb, rw, lock);
+
+	if (!timer_pending(&rwb->window_timer))
+		rwb_arm_timer(rwb);
+
+	return ret | WBT_TRACKED;
+}
+
+void wbt_issue(struct rq_wb *rwb, struct wb_issue_stat *stat)
+{
+	if (!rwb_enabled(rwb))
+		return;
+
+	wbt_issue_stat_set_time(stat);
+
+	/*
+	 * Track sync issue, in case it takes a long time to complete. Allows
+	 * us to react quicker, if a sync IO takes a long time to complete.
+	 * Note that this is just a hint. 'stat' can go away when the
+	 * request completes, so it's important we never dereference it. We
+	 * only use the address to compare with, which is why we store the
+	 * sync_issue time locally.
+	 */
+	if (wbt_is_read(stat) && !rwb->sync_issue) {
+		rwb->sync_cookie = stat;
+		rwb->sync_issue = wbt_issue_stat_get_time(stat);
+	}
+}
+
+void wbt_track(struct wb_issue_stat *stat, unsigned int wb_acct)
+{
+	if (wb_acct & WBT_TRACKED)
+		wbt_mark_tracked(stat);
+	else if (wb_acct & WBT_READ)
+		wbt_mark_read(stat);
+}
+
+void wbt_requeue(struct rq_wb *rwb, struct wb_issue_stat *stat)
+{
+	if (!rwb_enabled(rwb))
+		return;
+	if (stat == rwb->sync_cookie) {
+		rwb->sync_issue = 0;
+		rwb->sync_cookie = NULL;
+	}
+}
+
+void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+{
+	if (rwb) {
+		rwb->queue_depth = depth;
+		wbt_update_limits(rwb);
+	}
+}
+
+void wbt_set_write_cache(struct rq_wb *rwb, bool write_cache_on)
+{
+	if (rwb)
+		rwb->wc = write_cache_on;
+}
+
+void wbt_disable(struct rq_wb *rwb)
+{
+	del_timer_sync(&rwb->window_timer);
+	rwb->win_nsec = rwb->min_lat_nsec = 0;
+	wbt_update_limits(rwb);
+}
+EXPORT_SYMBOL_GPL(wbt_disable);
+
+struct rq_wb *wbt_init(struct backing_dev_info *bdi, struct wb_stat_ops *ops,
+		       void *ops_data)
+{
+	struct rq_wb *rwb;
+
+	if (!ops->get || !ops->is_current || !ops->clear)
+		return ERR_PTR(-EINVAL);
+
+	rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
+	if (!rwb)
+		return ERR_PTR(-ENOMEM);
+
+	atomic_set(&rwb->inflight, 0);
+	init_waitqueue_head(&rwb->wait);
+	setup_timer(&rwb->window_timer, wb_timer_fn, (unsigned long) rwb);
+	rwb->wc = 1;
+	rwb->queue_depth = RWB_DEF_DEPTH;
+	rwb->last_comp = rwb->last_issue = jiffies;
+	rwb->bdi = bdi;
+	rwb->win_nsec = RWB_WINDOW_NSEC;
+	rwb->stat_ops = ops,
+	rwb->ops_data = ops_data;
+	wbt_update_limits(rwb);
+	return rwb;
+}
+
+void wbt_exit(struct rq_wb *rwb)
+{
+	if (rwb) {
+		del_timer_sync(&rwb->window_timer);
+		kfree(rwb);
+	}
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-09-01 18:05   ` Omar Sandoval
@ 2016-09-01 18:51     ` Jens Axboe
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2016-09-01 18:51 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block

On 09/01/2016 12:05 PM, Omar Sandoval wrote:
>> diff --git a/lib/Kconfig b/lib/Kconfig
>> index d79909dc01ec..5a65a1f91889 100644
>> --- a/lib/Kconfig
>> +++ b/lib/Kconfig
>> @@ -550,4 +550,8 @@ config STACKDEPOT
>>  	bool
>>  	select STACKTRACE
>>
>> +config WBT
>> +	bool
>> +	select SCALE_BITMAP
>
> Looks like this snuck in from your experiments to get this to work on
> top of scale_bitmap?

Oops yes, it is indeed. Killed, thanks.

>> +	if (waitqueue_active(&rwb->wait)) {
>> +		int diff = limit - inflight;
>> +
>> +		if (!inflight || diff >= rwb->wb_background / 2)
>> +			wake_up_nr(&rwb->wait, 1);
>
> wake_up(&rwb->wait)?

Yeah, that'd be cleaner. I think this is a leftover from when I 
experimented with batched wakeups, with nr != 1. I'll change it to just 
wake_up().

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 7/8] wbt: add general throttling mechanism
  2016-08-31 17:05 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe
@ 2016-09-01 18:05   ` Omar Sandoval
  2016-09-01 18:51     ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Omar Sandoval @ 2016-09-01 18:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: axboe, linux-kernel, linux-fsdevel, linux-block

On Wed, Aug 31, 2016 at 11:05:50AM -0600, Jens Axboe wrote:
> We can hook this up to the block layer, to help throttle buffered
> writes. Or NFS can tap into it, to accomplish the same.
> 
> wbt registers a few trace points that can be used to track what is
> happening in the system:
> 
> wbt_lat: 259:0: latency 2446318
> wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
>                wmean=518866, wmin=15522, wmax=5330353, wsamples=57
> wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32
> 
> This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
> dumps the current read/write stats for that window, and wbt_step shows a
> step down event where we now scale back writes. Each trace includes the
> device, 259:0 in this case.
> 
> Signed-off-by: Jens Axboe <axboe@fb.com>
> ---
>  include/linux/wbt.h        | 118 +++++++++
>  include/trace/events/wbt.h | 122 ++++++++++
>  lib/Kconfig                |   4 +
>  lib/Makefile               |   1 +
>  lib/wbt.c                  | 587 +++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 832 insertions(+)
>  create mode 100644 include/linux/wbt.h
>  create mode 100644 include/trace/events/wbt.h
>  create mode 100644 lib/wbt.c
> 

[snip]

> diff --git a/lib/Kconfig b/lib/Kconfig
> index d79909dc01ec..5a65a1f91889 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -550,4 +550,8 @@ config STACKDEPOT
>  	bool
>  	select STACKTRACE
>  
> +config WBT
> +	bool
> +	select SCALE_BITMAP

Looks like this snuck in from your experiments to get this to work on
top of scale_bitmap?

[snip]

> +void __wbt_done(struct rq_wb *rwb)
> +{
> +	int inflight, limit;
> +
> +	inflight = atomic_dec_return(&rwb->inflight);
> +
> +	/*
> +	 * wbt got disabled with IO in flight. Wake up any potential
> +	 * waiters, we don't have to do more than that.
> +	 */
> +	if (unlikely(!rwb_enabled(rwb))) {
> +		wake_up_all(&rwb->wait);
> +		return;
> +	}
> +
> +	/*
> +	 * If the device does write back caching, drop further down
> +	 * before we wake people up.
> +	 */
> +	if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
> +		limit = 0;
> +	else
> +		limit = rwb->wb_normal;
> +
> +	/*
> +	 * Don't wake anyone up if we are above the normal limit.
> +	 */
> +	if (inflight && inflight >= limit)
> +		return;
> +
> +	if (waitqueue_active(&rwb->wait)) {
> +		int diff = limit - inflight;
> +
> +		if (!inflight || diff >= rwb->wb_background / 2)
> +			wake_up_nr(&rwb->wait, 1);

wake_up(&rwb->wait)?

-- 
Omar

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 7/8] wbt: add general throttling mechanism
  2016-08-31 17:05 [PATCHSET v6] Throttled background buffered writeback Jens Axboe
@ 2016-08-31 17:05 ` Jens Axboe
  2016-09-01 18:05   ` Omar Sandoval
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2016-08-31 17:05 UTC (permalink / raw)
  To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

We can hook this up to the block layer, to help throttle buffered
writes. Or NFS can tap into it, to accomplish the same.

wbt registers a few trace points that can be used to track what is
happening in the system:

wbt_lat: 259:0: latency 2446318
wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
               wmean=518866, wmin=15522, wmax=5330353, wsamples=57
wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32

This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
dumps the current read/write stats for that window, and wbt_step shows a
step down event where we now scale back writes. Each trace includes the
device, 259:0 in this case.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/wbt.h        | 118 +++++++++
 include/trace/events/wbt.h | 122 ++++++++++
 lib/Kconfig                |   4 +
 lib/Makefile               |   1 +
 lib/wbt.c                  | 587 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 832 insertions(+)
 create mode 100644 include/linux/wbt.h
 create mode 100644 include/trace/events/wbt.h
 create mode 100644 lib/wbt.c

diff --git a/include/linux/wbt.h b/include/linux/wbt.h
new file mode 100644
index 000000000000..14473d550a18
--- /dev/null
+++ b/include/linux/wbt.h
@@ -0,0 +1,118 @@
+#ifndef WB_THROTTLE_H
+#define WB_THROTTLE_H
+
+#include <linux/atomic.h>
+#include <linux/wait.h>
+#include <linux/timer.h>
+#include <linux/ktime.h>
+
+enum {
+	ISSUE_STAT_TRACKED	= 1ULL << 63,
+	ISSUE_STAT_READ		= 1ULL << 62,
+	ISSUE_STAT_MASK 	= ISSUE_STAT_TRACKED | ISSUE_STAT_READ,
+	ISSUE_STAT_TIME_MASK	= ~ISSUE_STAT_MASK,
+
+	WBT_TRACKED		= 1,
+	WBT_READ		= 2,
+};
+
+struct wb_issue_stat {
+	u64 time;
+};
+
+static inline void wbt_issue_stat_set_time(struct wb_issue_stat *stat)
+{
+	stat->time = (stat->time & ISSUE_STAT_MASK) |
+			(ktime_to_ns(ktime_get()) & ISSUE_STAT_TIME_MASK);
+}
+
+static inline u64 wbt_issue_stat_get_time(struct wb_issue_stat *stat)
+{
+	return stat->time & ISSUE_STAT_TIME_MASK;
+}
+
+static inline void wbt_mark_tracked(struct wb_issue_stat *stat)
+{
+	stat->time |= ISSUE_STAT_TRACKED;
+}
+
+static inline void wbt_clear_state(struct wb_issue_stat *stat)
+{
+	stat->time &= ~(ISSUE_STAT_TRACKED | ISSUE_STAT_READ);
+}
+
+static inline bool wbt_tracked(struct wb_issue_stat *stat)
+{
+	return (stat->time & ISSUE_STAT_TRACKED) != 0;
+}
+
+static inline void wbt_mark_read(struct wb_issue_stat *stat)
+{
+	stat->time |= ISSUE_STAT_READ;
+}
+
+static inline bool wbt_is_read(struct wb_issue_stat *stat)
+{
+	return (stat->time & ISSUE_STAT_READ) != 0;
+}
+
+struct wb_stat_ops {
+	void (*get)(void *, struct blk_rq_stat *);
+	void (*clear)(void *);
+};
+
+struct rq_wb {
+	/*
+	 * Settings that govern how we throttle
+	 */
+	unsigned int wb_background;		/* background writeback */
+	unsigned int wb_normal;			/* normal writeback */
+	unsigned int wb_max;			/* max throughput writeback */
+	unsigned int scale_step;
+
+	u64 win_nsec;				/* default window size */
+	u64 cur_win_nsec;			/* current window size */
+
+	/*
+	 * Number of consecutive periods where we don't have enough
+	 * information to make a firm scale up/down decision.
+	 */
+	unsigned int unknown_cnt;
+
+	struct timer_list window_timer;
+
+	s64 sync_issue;
+	void *sync_cookie;
+
+	unsigned int wc;
+	unsigned int queue_depth;
+
+	unsigned long last_issue;		/* last non-throttled issue */
+	unsigned long last_comp;		/* last non-throttled comp */
+	unsigned long min_lat_nsec;
+	struct backing_dev_info *bdi;
+	struct request_queue *q;
+	wait_queue_head_t wait;
+	atomic_t inflight;
+
+	struct wb_stat_ops *stat_ops;
+	void *ops_data;
+};
+
+struct backing_dev_info;
+
+void __wbt_done(struct rq_wb *);
+void wbt_done(struct rq_wb *, struct wb_issue_stat *);
+unsigned int wbt_wait(struct rq_wb *, unsigned int, spinlock_t *);
+struct rq_wb *wbt_init(struct backing_dev_info *, struct wb_stat_ops *, void *);
+void wbt_exit(struct rq_wb *);
+void wbt_update_limits(struct rq_wb *);
+void wbt_requeue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_issue(struct rq_wb *, struct wb_issue_stat *);
+void wbt_disable(struct rq_wb *);
+void wbt_track(struct wb_issue_stat *, unsigned int);
+
+void wbt_set_queue_depth(struct rq_wb *, unsigned int);
+void wbt_set_write_cache(struct rq_wb *, bool);
+
+#endif
diff --git a/include/trace/events/wbt.h b/include/trace/events/wbt.h
new file mode 100644
index 000000000000..a4b8b2e57bb1
--- /dev/null
+++ b/include/trace/events/wbt.h
@@ -0,0 +1,122 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM wbt
+
+#if !defined(_TRACE_WBT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_WBT_H
+
+#include <linux/tracepoint.h>
+#include <linux/wbt.h>
+
+/**
+ * wbt_stat - trace stats for blk_wb
+ * @stat: array of read/write stats
+ */
+TRACE_EVENT(wbt_stat,
+
+	TP_PROTO(struct backing_dev_info *bdi, struct blk_rq_stat *stat),
+
+	TP_ARGS(bdi, stat),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(s64, rmean)
+		__field(u64, rmin)
+		__field(u64, rmax)
+		__field(s64, rnr_samples)
+		__field(s64, rtime)
+		__field(s64, wmean)
+		__field(u64, wmin)
+		__field(u64, wmax)
+		__field(s64, wnr_samples)
+		__field(s64, wtime)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->rmean		= stat[0].mean;
+		__entry->rmin		= stat[0].min;
+		__entry->rmax		= stat[0].max;
+		__entry->rnr_samples	= stat[0].nr_samples;
+		__entry->wmean		= stat[1].mean;
+		__entry->wmin		= stat[1].min;
+		__entry->wmax		= stat[1].max;
+		__entry->wnr_samples	= stat[1].nr_samples;
+	),
+
+	TP_printk("%s: rmean=%llu, rmin=%llu, rmax=%llu, rsamples=%llu, "
+		  "wmean=%llu, wmin=%llu, wmax=%llu, wsamples=%llu\n",
+		  __entry->name, __entry->rmean, __entry->rmin, __entry->rmax,
+		  __entry->rnr_samples, __entry->wmean, __entry->wmin,
+		  __entry->wmax, __entry->wnr_samples)
+);
+
+/**
+ * wbt_lat - trace latency event
+ * @lat: latency trigger
+ */
+TRACE_EVENT(wbt_lat,
+
+	TP_PROTO(struct backing_dev_info *bdi, unsigned long lat),
+
+	TP_ARGS(bdi, lat),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, lat)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->lat = lat;
+	),
+
+	TP_printk("%s: latency %llu\n", __entry->name,
+			(unsigned long long) __entry->lat)
+);
+
+/**
+ * wbt_step - trace wb event step
+ * @msg: context message
+ * @step: the current scale step count
+ * @window: the current monitoring window
+ * @bg: the current background queue limit
+ * @normal: the current normal writeback limit
+ * @max: the current max throughput writeback limit
+ */
+TRACE_EVENT(wbt_step,
+
+	TP_PROTO(struct backing_dev_info *bdi, const char *msg,
+		 unsigned int step, unsigned long window, unsigned int bg,
+		 unsigned int normal, unsigned int max),
+
+	TP_ARGS(bdi, msg, step, window, bg, normal, max),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(const char *, msg)
+		__field(unsigned int, step)
+		__field(unsigned long, window)
+		__field(unsigned int, bg)
+		__field(unsigned int, normal)
+		__field(unsigned int, max)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->msg	= msg;
+		__entry->step	= step;
+		__entry->window	= window;
+		__entry->bg	= bg;
+		__entry->normal	= normal;
+		__entry->max	= max;
+	),
+
+	TP_printk("%s: %s: step=%u, window=%lu, background=%u, normal=%u, max=%u\n",
+		  __entry->name, __entry->msg, __entry->step, __entry->window,
+		  __entry->bg, __entry->normal, __entry->max)
+);
+
+#endif /* _TRACE_WBT_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/lib/Kconfig b/lib/Kconfig
index d79909dc01ec..5a65a1f91889 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -550,4 +550,8 @@ config STACKDEPOT
 	bool
 	select STACKTRACE
 
+config WBT
+	bool
+	select SCALE_BITMAP
+
 endmenu
diff --git a/lib/Makefile b/lib/Makefile
index cfa68eb269e4..c42f0eccd700 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -178,6 +178,7 @@ obj-$(CONFIG_SG_SPLIT) += sg_split.o
 obj-$(CONFIG_SG_POOL) += sg_pool.o
 obj-$(CONFIG_STMP_DEVICE) += stmp_device.o
 obj-$(CONFIG_IRQ_POLL) += irq_poll.o
+obj-$(CONFIG_WBT) += wbt.o
 
 obj-$(CONFIG_STACKDEPOT) += stackdepot.o
 KASAN_SANITIZE_stackdepot.o := n
diff --git a/lib/wbt.c b/lib/wbt.c
new file mode 100644
index 000000000000..7da087700eb1
--- /dev/null
+++ b/lib/wbt.c
@@ -0,0 +1,587 @@
+/*
+ * buffered writeback throttling. losely based on CoDel. We can't drop
+ * packets for IO scheduling, so the logic is something like this:
+ *
+ * - Monitor latencies in a defined window of time.
+ * - If the minimum latency in the above window exceeds some target, increment
+ *   scaling step and scale down queue depth by a factor of 2x. The monitoring
+ *   window is then shrunk to 100 / sqrt(scaling step + 1).
+ * - For any window where we don't have solid data on what the latencies
+ *   look like, retain status quo.
+ * - If latencies look good, decrement scaling step.
+ *
+ * Copyright (C) 2016 Jens Axboe
+ *
+ * Things that (may) need changing:
+ *
+ *	- Different scaling of background/normal/high priority writeback.
+ *	  We may have to violate guarantees for max.
+ *	- We can have mismatches between the stat window and our window.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/blk_types.h>
+#include <linux/slab.h>
+#include <linux/backing-dev.h>
+#include <linux/wbt.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/wbt.h>
+
+enum {
+	/*
+	 * Might need to be higher
+	 */
+	RWB_MAX_DEPTH	= 64,
+
+	/*
+	 * 100msec window
+	 */
+	RWB_WINDOW_NSEC		= 100 * 1000 * 1000ULL,
+
+	/*
+	 * Disregard stats, if we don't meet these minimums
+	 */
+	RWB_MIN_WRITE_SAMPLES	= 3,
+	RWB_MIN_READ_SAMPLES	= 1,
+
+	/*
+	 * If we have this number of consecutive windows with not enough
+	 * information to scale up or down, scale up.
+	 */
+	RWB_UNKNOWN_BUMP	= 5,
+};
+
+static inline bool rwb_enabled(struct rq_wb *rwb)
+{
+	return rwb && rwb->wb_normal != 0;
+}
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+	int cur = atomic_read(v);
+
+	for (;;) {
+		int old;
+
+		if (cur >= below)
+			return false;
+		old = atomic_cmpxchg(v, cur, cur + 1);
+		if (old == cur)
+			break;
+		cur = old;
+	}
+
+	return true;
+}
+
+static void wb_timestamp(struct rq_wb *rwb, unsigned long *var)
+{
+	if (rwb_enabled(rwb)) {
+		const unsigned long cur = jiffies;
+
+		if (cur != *var)
+			*var = cur;
+	}
+}
+
+void __wbt_done(struct rq_wb *rwb)
+{
+	int inflight, limit;
+
+	inflight = atomic_dec_return(&rwb->inflight);
+
+	/*
+	 * wbt got disabled with IO in flight. Wake up any potential
+	 * waiters, we don't have to do more than that.
+	 */
+	if (unlikely(!rwb_enabled(rwb))) {
+		wake_up_all(&rwb->wait);
+		return;
+	}
+
+	/*
+	 * If the device does write back caching, drop further down
+	 * before we wake people up.
+	 */
+	if (rwb->wc && !atomic_read(&rwb->bdi->wb.dirty_sleeping))
+		limit = 0;
+	else
+		limit = rwb->wb_normal;
+
+	/*
+	 * Don't wake anyone up if we are above the normal limit.
+	 */
+	if (inflight && inflight >= limit)
+		return;
+
+	if (waitqueue_active(&rwb->wait)) {
+		int diff = limit - inflight;
+
+		if (!inflight || diff >= rwb->wb_background / 2)
+			wake_up_nr(&rwb->wait, 1);
+	}
+}
+
+/*
+ * Called on completion of a request. Note that it's also called when
+ * a request is merged, when the request gets freed.
+ */
+void wbt_done(struct rq_wb *rwb, struct wb_issue_stat *stat)
+{
+	if (!rwb)
+		return;
+
+	if (!wbt_tracked(stat)) {
+		if (rwb->sync_cookie == stat) {
+			rwb->sync_issue = 0;
+			rwb->sync_cookie = NULL;
+		}
+
+		if (wbt_is_read(stat))
+			wb_timestamp(rwb, &rwb->last_comp);
+		wbt_clear_state(stat);
+	} else {
+		WARN_ON_ONCE(stat == rwb->sync_cookie);
+		__wbt_done(rwb);
+		wbt_clear_state(stat);
+	}
+}
+
+static void calc_wb_limits(struct rq_wb *rwb)
+{
+	unsigned int depth;
+
+	if (!rwb->min_lat_nsec) {
+		rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
+		return;
+	}
+
+	/*
+	 * For QD=1 devices, this is a special case. It's important for those
+	 * to have one request ready when one completes, so force a depth of
+	 * 2 for those devices. On the backend, it'll be a depth of 1 anyway,
+	 * since the device can't have more than that in flight. If we're
+	 * scaling down, then keep a setting of 1/1/1.
+	 */
+	if (rwb->queue_depth == 1) {
+		if (rwb->scale_step)
+			rwb->wb_max = rwb->wb_normal = 1;
+		else
+			rwb->wb_max = rwb->wb_normal = 2;
+		rwb->wb_background = 1;
+	} else {
+		depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
+
+		/*
+		 * Set our max/normal/bg queue depths based on how far
+		 * we have scaled down (->scale_step).
+		 */
+		rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
+		rwb->wb_normal = (rwb->wb_max + 1) / 2;
+		rwb->wb_background = (rwb->wb_max + 3) / 4;
+	}
+}
+
+static bool inline stat_sample_valid(struct blk_rq_stat *stat)
+{
+	/*
+	 * We need at least one read sample, and a minimum of
+	 * RWB_MIN_WRITE_SAMPLES. We require some write samples to know
+	 * that it's writes impacting us, and not just some sole read on
+	 * a device that is in a lower power state.
+	 */
+	return stat[0].nr_samples >= 1 &&
+		stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES;
+}
+
+static u64 rwb_sync_issue_lat(struct rq_wb *rwb)
+{
+	u64 now, issue = ACCESS_ONCE(rwb->sync_issue);
+
+	if (!issue || !rwb->sync_cookie)
+		return 0;
+
+	now = ktime_to_ns(ktime_get());
+	return now - issue;
+}
+
+enum {
+	LAT_OK,
+	LAT_UNKNOWN,
+	LAT_EXCEEDED,
+};
+
+static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
+{
+	u64 thislat;
+
+	/*
+	 * If our stored sync issue exceeds the window size, or it
+	 * exceeds our min target AND we haven't logged any entries,
+	 * flag the latency as exceeded. wbt works off completion latencies,
+	 * but for a flooded device, a single sync IO can take a long time
+	 * to complete after being issued. If this time exceeds our
+	 * monitoring window AND we didn't see any other completions in that
+	 * window, then count that sync IO as a violation of the latency.
+	 */
+	thislat = rwb_sync_issue_lat(rwb);
+	if (thislat > rwb->cur_win_nsec ||
+	    (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
+		trace_wbt_lat(rwb->bdi, thislat);
+		return LAT_EXCEEDED;
+	}
+
+	if (!stat_sample_valid(stat))
+		return LAT_UNKNOWN;
+
+	/*
+	 * If the 'min' latency exceeds our target, step down.
+	 */
+	if (stat[0].min > rwb->min_lat_nsec) {
+		trace_wbt_lat(rwb->bdi, stat[0].min);
+		trace_wbt_stat(rwb->bdi, stat);
+		return LAT_EXCEEDED;
+	}
+
+	if (rwb->scale_step)
+		trace_wbt_stat(rwb->bdi, stat);
+
+	return LAT_OK;
+}
+
+static int latency_exceeded(struct rq_wb *rwb)
+{
+	struct blk_rq_stat stat[2];
+
+	rwb->stat_ops->get(rwb->ops_data, stat);
+	return __latency_exceeded(rwb, stat);
+}
+
+static void rwb_trace_step(struct rq_wb *rwb, const char *msg)
+{
+	trace_wbt_step(rwb->bdi, msg, rwb->scale_step, rwb->cur_win_nsec,
+			rwb->wb_background, rwb->wb_normal, rwb->wb_max);
+}
+
+static void scale_up(struct rq_wb *rwb)
+{
+	/*
+	 * If we're at 0, we can't go lower.
+	 */
+	if (!rwb->scale_step)
+		return;
+
+	rwb->scale_step--;
+	rwb->unknown_cnt = 0;
+	rwb->stat_ops->clear(rwb->ops_data);
+	calc_wb_limits(rwb);
+
+	if (waitqueue_active(&rwb->wait))
+		wake_up_all(&rwb->wait);
+
+	rwb_trace_step(rwb, "step up");
+}
+
+static void scale_down(struct rq_wb *rwb)
+{
+	/*
+	 * Stop scaling down when we've hit the limit. This also prevents
+	 * ->scale_step from going to crazy values, if the device can't
+	 * keep up.
+	 */
+	if (rwb->wb_max == 1)
+		return;
+
+	rwb->scale_step++;
+	rwb->unknown_cnt = 0;
+	rwb->stat_ops->clear(rwb->ops_data);
+	calc_wb_limits(rwb);
+	rwb_trace_step(rwb, "step down");
+}
+
+static void rwb_arm_timer(struct rq_wb *rwb)
+{
+	unsigned long expires;
+
+	/*
+	 * We should speed this up, using some variant of a fast integer
+	 * inverse square root calculation. Since we only do this for
+	 * every window expiration, it's not a huge deal, though.
+	 */
+	rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4,
+					int_sqrt((rwb->scale_step + 1) << 8));
+	expires = jiffies + nsecs_to_jiffies(rwb->cur_win_nsec);
+	mod_timer(&rwb->window_timer, expires);
+}
+
+static void wb_timer_fn(unsigned long data)
+{
+	struct rq_wb *rwb = (struct rq_wb *) data;
+	int status;
+
+	/*
+	 * If we exceeded the latency target, step down. If we did not,
+	 * step one level up. If we don't know enough to say either exceeded
+	 * or ok, then don't do anything.
+	 */
+	status = latency_exceeded(rwb);
+	switch (status) {
+	case LAT_EXCEEDED:
+		scale_down(rwb);
+		break;
+	case LAT_OK:
+		scale_up(rwb);
+		break;
+	case LAT_UNKNOWN:
+		/*
+		 * We had no read samples, start bumping up the write
+		 * depth slowly
+		 */
+		if (++rwb->unknown_cnt >= RWB_UNKNOWN_BUMP)
+			scale_up(rwb);
+		break;
+	default:
+		break;
+	}
+
+	/*
+	 * Re-arm timer, if we have IO in flight
+	 */
+	if (rwb->scale_step || atomic_read(&rwb->inflight))
+		rwb_arm_timer(rwb);
+}
+
+void wbt_update_limits(struct rq_wb *rwb)
+{
+	rwb->scale_step = 0;
+	calc_wb_limits(rwb);
+
+	if (waitqueue_active(&rwb->wait))
+		wake_up_all(&rwb->wait);
+}
+
+static bool close_io(struct rq_wb *rwb)
+{
+	const unsigned long now = jiffies;
+
+	return time_before(now, rwb->last_issue + HZ / 10) ||
+		time_before(now, rwb->last_comp + HZ / 10);
+}
+
+#define REQ_HIPRIO	(REQ_SYNC | REQ_META | REQ_PRIO)
+
+static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw)
+{
+	unsigned int limit;
+
+	/*
+	 * At this point we know it's a buffered write. If REQ_SYNC is
+	 * set, then it's WB_SYNC_ALL writeback, and we'll use the max
+	 * limit for that. If the write is marked as a background write,
+	 * then use the idle limit, or go to normal if we haven't had
+	 * competing IO for a bit.
+	 */
+	if ((rw & REQ_HIPRIO) || atomic_read(&rwb->bdi->wb.dirty_sleeping))
+		limit = rwb->wb_max;
+	else if ((rw & REQ_BG) || close_io(rwb)) {
+		/*
+		 * If less than 100ms since we completed unrelated IO,
+		 * limit us to half the depth for background writeback.
+		 */
+		limit = rwb->wb_background;
+	} else
+		limit = rwb->wb_normal;
+
+	return limit;
+}
+
+static inline bool may_queue(struct rq_wb *rwb, unsigned long rw)
+{
+	/*
+	 * inc it here even if disabled, since we'll dec it at completion.
+	 * this only happens if the task was sleeping in __wbt_wait(),
+	 * and someone turned it off at the same time.
+	 */
+	if (!rwb_enabled(rwb)) {
+		atomic_inc(&rwb->inflight);
+		return true;
+	}
+
+	return atomic_inc_below(&rwb->inflight, get_limit(rwb, rw));
+}
+
+/*
+ * Block if we will exceed our limit, or if we are currently waiting for
+ * the timer to kick off queuing again.
+ */
+static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
+{
+	DEFINE_WAIT(wait);
+
+	if (may_queue(rwb, rw))
+		return;
+
+	do {
+		prepare_to_wait_exclusive(&rwb->wait, &wait,
+						TASK_UNINTERRUPTIBLE);
+
+		if (may_queue(rwb, rw))
+			break;
+
+		if (lock)
+			spin_unlock_irq(lock);
+
+		io_schedule();
+
+		if (lock)
+			spin_lock_irq(lock);
+	} while (1);
+
+	finish_wait(&rwb->wait, &wait);
+}
+
+static inline bool wbt_should_throttle(struct rq_wb *rwb, unsigned int rw)
+{
+	const int op = rw >> BIO_OP_SHIFT;
+
+	/*
+	 * If not a WRITE (or a discard), do nothing
+	 */
+	if (!(op == REQ_OP_WRITE || op == REQ_OP_DISCARD))
+		return false;
+
+	/*
+	 * Don't throttle WRITE_ODIRECT
+	 */
+	if ((rw & (REQ_SYNC | REQ_NOIDLE)) == REQ_SYNC)
+		return false;
+
+	return true;
+}
+
+/*
+ * Returns true if the IO request should be accounted, false if not.
+ * May sleep, if we have exceeded the writeback limits. Caller can pass
+ * in an irq held spinlock, if it holds one when calling this function.
+ * If we do sleep, we'll release and re-grab it.
+ */
+unsigned int wbt_wait(struct rq_wb *rwb, unsigned int rw, spinlock_t *lock)
+{
+	unsigned int ret;
+
+	if (!rwb_enabled(rwb))
+		return 0;
+
+	if ((rw >> BIO_OP_SHIFT) == REQ_OP_READ)
+		ret = WBT_READ;
+
+	if (!wbt_should_throttle(rwb, rw)) {
+		if (ret & WBT_READ)
+			wb_timestamp(rwb, &rwb->last_issue);
+		return ret;
+	}
+
+	__wbt_wait(rwb, rw, lock);
+
+	if (!timer_pending(&rwb->window_timer))
+		rwb_arm_timer(rwb);
+
+	return ret | WBT_TRACKED;
+}
+
+void wbt_issue(struct rq_wb *rwb, struct wb_issue_stat *stat)
+{
+	if (!rwb_enabled(rwb))
+		return;
+
+	wbt_issue_stat_set_time(stat);
+
+	/*
+	 * Track sync issue, in case it takes a long time to complete. Allows
+	 * us to react quicker, if a sync IO takes a long time to complete.
+	 * Note that this is just a hint. 'stat' can go away when the
+	 * request completes, so it's important we never dereference it. We
+	 * only use the address to compare with, which is why we store the
+	 * sync_issue time locally.
+	 */
+	if (wbt_is_read(stat) && !rwb->sync_issue) {
+		rwb->sync_cookie = stat;
+		rwb->sync_issue = wbt_issue_stat_get_time(stat);
+	}
+}
+
+void wbt_track(struct wb_issue_stat *stat, unsigned int wb_acct)
+{
+	if (wb_acct & WBT_TRACKED)
+		wbt_mark_tracked(stat);
+	else if (wb_acct & WBT_READ)
+		wbt_mark_read(stat);
+}
+
+void wbt_requeue(struct rq_wb *rwb, struct wb_issue_stat *stat)
+{
+	if (!rwb_enabled(rwb))
+		return;
+	if (stat == rwb->sync_cookie) {
+		rwb->sync_issue = 0;
+		rwb->sync_cookie = NULL;
+	}
+}
+
+void wbt_set_queue_depth(struct rq_wb *rwb, unsigned int depth)
+{
+	if (rwb) {
+		rwb->queue_depth = depth;
+		wbt_update_limits(rwb);
+	}
+}
+
+void wbt_set_write_cache(struct rq_wb *rwb, bool write_cache_on)
+{
+	if (rwb)
+		rwb->wc = write_cache_on;
+}
+
+void wbt_disable(struct rq_wb *rwb)
+{
+	del_timer_sync(&rwb->window_timer);
+	rwb->win_nsec = rwb->min_lat_nsec = 0;
+	wbt_update_limits(rwb);
+}
+EXPORT_SYMBOL_GPL(wbt_disable);
+
+struct rq_wb *wbt_init(struct backing_dev_info *bdi, struct wb_stat_ops *ops,
+		       void *ops_data)
+{
+	struct rq_wb *rwb;
+
+	rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
+	if (!rwb)
+		return ERR_PTR(-ENOMEM);
+
+	atomic_set(&rwb->inflight, 0);
+	init_waitqueue_head(&rwb->wait);
+	setup_timer(&rwb->window_timer, wb_timer_fn, (unsigned long) rwb);
+	rwb->wc = 1;
+	rwb->queue_depth = RWB_MAX_DEPTH;
+	rwb->last_comp = rwb->last_issue = jiffies;
+	rwb->bdi = bdi;
+	rwb->win_nsec = RWB_WINDOW_NSEC;
+	rwb->stat_ops = ops,
+	rwb->ops_data = ops_data;
+	wbt_update_limits(rwb);
+	return rwb;
+}
+
+void wbt_exit(struct rq_wb *rwb)
+{
+	if (rwb) {
+		del_timer_sync(&rwb->window_timer);
+		kfree(rwb);
+	}
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2016-09-07 14:47 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe
2016-04-26 15:55 ` [PATCH 1/8] block: add WRITE_BG Jens Axboe
2016-04-26 15:55 ` [PATCH 2/8] writeback: add wbc_to_write_cmd() Jens Axboe
2016-04-26 15:55 ` [PATCH 3/8] writeback: use WRITE_BG for kupdate and background writeback Jens Axboe
2016-04-26 15:55 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
2016-04-26 15:55 ` [PATCH 5/8] block: add code to track actual device queue depth Jens Axboe
2016-04-26 15:55 ` [PATCH 6/8] block: add scalable completion tracking of requests Jens Axboe
2016-05-05  7:52   ` Ming Lei
2016-04-26 15:55 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe
2016-04-27 12:06   ` xiakaixu
2016-04-27 15:21     ` Jens Axboe
2016-04-28  3:29       ` xiakaixu
2016-04-28 11:05   ` Jan Kara
2016-04-28 18:53     ` Jens Axboe
2016-04-28 19:03       ` Jens Axboe
2016-05-03  9:34       ` Jan Kara
2016-05-03 14:23         ` Jens Axboe
2016-05-03 15:22           ` Jan Kara
2016-05-03 15:32             ` Jens Axboe
2016-05-03 15:40         ` Jan Kara
2016-05-03 15:48           ` Jan Kara
2016-05-03 16:59             ` Jens Axboe
2016-05-03 18:14               ` Jens Axboe
2016-05-03 19:07                 ` Jens Axboe
2016-04-26 15:55 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe
2016-04-27 18:01 ` [PATCHSET v5] Make background writeback great again for the first time Jan Kara
2016-04-27 18:17   ` Jens Axboe
2016-04-27 20:37     ` Jens Axboe
2016-04-27 20:59       ` Jens Axboe
2016-04-28  4:06         ` xiakaixu
2016-04-28 18:36           ` Jens Axboe
2016-04-28 11:54         ` Jan Kara
2016-04-28 18:46           ` Jens Axboe
2016-05-03 12:17             ` Jan Kara
2016-05-03 12:40               ` Chris Mason
2016-05-03 13:06                 ` Jan Kara
2016-05-03 13:42                   ` Chris Mason
2016-05-03 13:57                     ` Jan Kara
2016-05-11 16:36               ` Jan Kara
2016-05-13 18:29                 ` Jens Axboe
2016-05-16  7:47                   ` Jan Kara
2016-08-31 17:05 [PATCHSET v6] Throttled background buffered writeback Jens Axboe
2016-08-31 17:05 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe
2016-09-01 18:05   ` Omar Sandoval
2016-09-01 18:51     ` Jens Axboe
2016-09-07 14:46 [PATCH 0/8] Throttled background buffered writeback v7 Jens Axboe
2016-09-07 14:46 ` [PATCH 7/8] wbt: add general throttling mechanism Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).