All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v3][RFC] Make background writeback not suck
@ 2016-03-30 15:07 Jens Axboe
  2016-03-30 15:07 ` [PATCH 1/9] writeback: propagate the various reasons for writeback Jens Axboe
                   ` (10 more replies)
  0 siblings, 11 replies; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:07 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block

Hi,

This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts data base reads or sync writes. When that happens, I get people
yelling at me.

Last time I posted this, I used flash storage as the example. But
this works equally well on rotating storage. Let's run a test case
that writes a lot. This test writes 50 files, each 100M, on XFS on
a regular hard drive. While this happens, we attempt to read
another file with fio.

Writers:

$ time (./write-files ; sync)
real	1m6.304s
user	0m0.020s
sys	0m12.210s

Fio reader:

  read : io=35580KB, bw=550868B/s, iops=134, runt= 66139msec
    clat (usec): min=40, max=654204, avg=7432.37, stdev=43872.83
     lat (usec): min=40, max=654204, avg=7432.70, stdev=43872.83
    clat percentiles (usec):
     |  1.00th=[   41],  5.00th=[   41], 10.00th=[   41], 20.00th=[   42],
     | 30.00th=[   42], 40.00th=[   42], 50.00th=[   43], 60.00th=[   52],
     | 70.00th=[   59], 80.00th=[   65], 90.00th=[   87], 95.00th=[ 1192],
     | 99.00th=[254976], 99.50th=[358400], 99.90th=[444416], 99.95th=[468992],
     | 99.99th=[651264]


Let's run the same test, but with the patches applied, and wb_percent
set to 10%:

Writers:

$ time (./write-files ; sync)
real	1m29.384s
user	0m0.040s
sys	0m10.810s

Fio reader:

  read : io=1024.0MB, bw=18640KB/s, iops=4660, runt= 56254msec
    clat (usec): min=39, max=408400, avg=212.05, stdev=2982.44
     lat (usec): min=39, max=408400, avg=212.30, stdev=2982.44
    clat percentiles (usec):
     |  1.00th=[   40],  5.00th=[   41], 10.00th=[   41], 20.00th=[   41],
     | 30.00th=[   42], 40.00th=[   42], 50.00th=[   42], 60.00th=[   42],
     | 70.00th=[   43], 80.00th=[   45], 90.00th=[   56], 95.00th=[   60],
     | 99.00th=[  454], 99.50th=[ 8768], 99.90th=[36608], 99.95th=[43264],
     | 99.99th=[69120]


Much better, looking at the P99.x percentiles, and of course on
the bandwidth front as well. It's the difference between this:

---io---- -system-- ------cpu-----
 bi    bo   in   cs us sy id wa st
 20636 45056 5593 10833  0  0 94  6  0
 16416 46080 4484 8666  0  0 94  6  0
 16960 47104 5183 8936  0  0 94  6  0

and this

---io---- -system-- ------cpu-----
 bi    bo   in   cs us sy id wa st
   384 73728  571  558  0  0 95  5  0
   384 73728  548  545  0  0 95  5  0
   388 73728  575  763  0  0 96  4  0

in the vmstat output. It's not quite as bad as deeper queue depth
devices, where we have hugely bursty IO, but it's still very slow.

If we don't run the competing reader, the dirty data writeback proceeds
at normal rates:

# time (./write-files ; sync)
real	1m6.919s
user	0m0.010s
sys	0m10.900s


The above was run without scsi-mq, and with using the deadline scheduler,
results with CFQ are similary depressing for this test. So IO scheduling
is in place for this test, it's not pure blk-mq without scheduling.

The above was the why. The how is basically throttling background
writeback. We still want to issue big writes from the vm side of things,
so we get nice and big extents on the file system end. But we don't need
to flood the device with THOUSANDS of requests for background writeback.
For most devices, we don't need a whole lot to get decent throughput.

This adds some simple blk-wb code that keeps limits how much buffered
writeback we keep in flight on the device end. The default is pretty
low. If we end up switching to WB_SYNC_ALL, we up the limits. If the
dirtying task ends up being throttled in balance_dirty_pages(), we up
the limit. If we need to reclaim memory, we up the limit. The cases
that need to clean memory at or near device speeds, they get to do
that. We still don't need thousands of requests to accomplish that.
And for the cases where we don't need to be near device limits, we
can clean at a more reasonable pace. See the last patch in the series
for a more detailed description of the change, and the tunable.

I welcome testing. If you are sick of Linux bogging down when buffered
writes are happening, then this is for you, laptop or server. The
patchset is fully stable, I have not observed problems. It passes full
xfstest runs, and a variety of benchmarks as well. It works equally well
on blk-mq/scsi-mq, and "classic" setups.

You can also find this in a branch in the block git repo:

git://git.kernel.dk/linux-block.git wb-buf-throttle

Note that I rebase this branch when I collapse patches. Patches are
against current Linus' git, 4.6.0-rc1, I can make them available
against 4.5 as well, if there's any interest in that for test
purposes.

Changes since v2

- Switch from wb_depth to wb_percent, as that's an easier tunable.
- Add the patch to track device depth on the block layer side.
- Cleanup the limiting code.
- Don't use a fixed limit in the wb wait, since it can change
  between wakeups.
- Minor tweaks, fixups, cleanups.

Changes since v1

- Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change
- wb_start_writeback() fills in background/reclaim/sync info in
  the writeback work, based on writeback reason.
- Use WRITE_SYNC for reclaim/sync IO
- Split balance_dirty_pages() sleep change into separate patch
- Drop get_request() u64 flag change, set the bit on the request
  directly after-the-fact.
- Fix wrong sysfs return value
- Various small cleanups


 block/Makefile                   |    2 
 block/blk-core.c                 |   15 ++
 block/blk-mq.c                   |   31 ++++-
 block/blk-settings.c             |   20 +++
 block/blk-sysfs.c                |  128 ++++++++++++++++++++
 block/blk-wb.c                   |  238 +++++++++++++++++++++++++++++++++++++++
 block/blk-wb.h                   |   33 +++++
 drivers/nvme/host/core.c         |    1 
 drivers/scsi/scsi.c              |    3 
 drivers/scsi/sd.c                |    5 
 fs/block_dev.c                   |    2 
 fs/buffer.c                      |    2 
 fs/f2fs/data.c                   |    2 
 fs/f2fs/node.c                   |    2 
 fs/fs-writeback.c                |   13 ++
 fs/gfs2/meta_io.c                |    3 
 fs/mpage.c                       |    9 -
 fs/xfs/xfs_aops.c                |    2 
 include/linux/backing-dev-defs.h |    2 
 include/linux/blk_types.h        |    2 
 include/linux/blkdev.h           |   18 ++
 include/linux/writeback.h        |    8 +
 mm/page-writeback.c              |    2 
 23 files changed, 527 insertions(+), 16 deletions(-)


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/9] writeback: propagate the various reasons for writeback
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
@ 2016-03-30 15:07 ` Jens Axboe
  2016-03-30 15:07 ` [PATCH 2/9] writeback: add wbc_to_write() Jens Axboe
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:07 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

Avoid losing context by propagating the various reason why we
initiate writeback. If we are doing more important reclaim or
synchronous writeback, the lower levels should know about it.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/fs-writeback.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index fee81e8768c9..4300ee7b1139 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -52,6 +52,7 @@ struct wb_writeback_work {
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
 	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
+	unsigned int for_reclaim:1;	/* for mem reclaim */
 	unsigned int auto_free:1;	/* free on completion */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
@@ -944,6 +945,17 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 	work->reason	= reason;
 	work->auto_free	= 1;
 
+	switch (reason) {
+	case WB_REASON_TRY_TO_FREE_PAGES:
+	case WB_REASON_FREE_MORE_MEM:
+		work->for_reclaim = 1;
+	case WB_REASON_SYNC:
+		work->for_sync = 1;
+		break;
+	default:
+		break;
+	}
+
 	wb_queue_work(wb, work);
 }
 
@@ -1446,6 +1458,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
 		.for_sync		= work->for_sync,
+		.for_reclaim		= work->for_reclaim,
 		.range_cyclic		= work->range_cyclic,
 		.range_start		= 0,
 		.range_end		= LLONG_MAX,
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/9] writeback: add wbc_to_write()
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
  2016-03-30 15:07 ` [PATCH 1/9] writeback: propagate the various reasons for writeback Jens Axboe
@ 2016-03-30 15:07 ` Jens Axboe
  2016-03-30 15:07 ` [PATCH 3/9] writeback: use WRITE_SYNC for reclaim or sync writeback Jens Axboe
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:07 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

Add wbc_to_write(), which returns the write type to use, based on a
struct writeback_control. No functional changes in this patch, but it
prepares us for factoring other wbc fields for write type.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 fs/block_dev.c            | 2 +-
 fs/buffer.c               | 2 +-
 fs/f2fs/data.c            | 2 +-
 fs/f2fs/node.c            | 2 +-
 fs/gfs2/meta_io.c         | 3 +--
 fs/mpage.c                | 9 ++++-----
 fs/xfs/xfs_aops.c         | 2 +-
 include/linux/writeback.h | 8 ++++++++
 8 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3172c4e2f502..b11d4e08b9a7 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -432,7 +432,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
 			struct page *page, struct writeback_control *wbc)
 {
 	int result;
-	int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
+	int rw = wbc_to_write(wbc);
 	const struct block_device_operations *ops = bdev->bd_disk->fops;
 
 	if (!ops->rw_page || bdev_get_integrity(bdev))
diff --git a/fs/buffer.c b/fs/buffer.c
index 33be29675358..28273caaf2b1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1697,7 +1697,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	struct buffer_head *bh, *head;
 	unsigned int blocksize, bbits;
 	int nr_underway = 0;
-	int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+	int write_op = wbc_to_write(wbc);
 
 	head = create_page_buffers(page, inode,
 					(1 << BH_Dirty)|(1 << BH_Uptodate));
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index e5c762b37239..dca5d43c67a3 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1143,7 +1143,7 @@ static int f2fs_write_data_page(struct page *page,
 	struct f2fs_io_info fio = {
 		.sbi = sbi,
 		.type = DATA,
-		.rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
+		.rw = wbc_to_write(wbc),
 		.page = page,
 		.encrypted_page = NULL,
 	};
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 118321bd1a7f..db9201f45bf1 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1397,7 +1397,7 @@ static int f2fs_write_node_page(struct page *page,
 	struct f2fs_io_info fio = {
 		.sbi = sbi,
 		.type = NODE,
-		.rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
+		.rw = wbc_to_write(wbc),
 		.page = page,
 		.encrypted_page = NULL,
 	};
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index e137d96f1b17..ede87306caa5 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -37,8 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb
 {
 	struct buffer_head *bh, *head;
 	int nr_underway = 0;
-	int write_op = REQ_META | REQ_PRIO |
-		(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+	int write_op = REQ_META | REQ_PRIO | wbc_to_write(wbc);
 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!page_has_buffers(page));
diff --git a/fs/mpage.c b/fs/mpage.c
index 6bd9fd90964e..9986c752f7bb 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -486,7 +486,6 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 	struct buffer_head map_bh;
 	loff_t i_size = i_size_read(inode);
 	int ret = 0;
-	int wr = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC : WRITE);
 
 	if (page_has_buffers(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -595,7 +594,7 @@ page_is_mapped:
 	 * This page will go to BIO.  Do we need to send this BIO off first?
 	 */
 	if (bio && mpd->last_block_in_bio != blocks[0] - 1)
-		bio = mpage_bio_submit(wr, bio);
+		bio = mpage_bio_submit(wbc_to_write(wbc), bio);
 
 alloc_new:
 	if (bio == NULL) {
@@ -622,7 +621,7 @@ alloc_new:
 	wbc_account_io(wbc, page, PAGE_SIZE);
 	length = first_unmapped << blkbits;
 	if (bio_add_page(bio, page, length, 0) < length) {
-		bio = mpage_bio_submit(wr, bio);
+		bio = mpage_bio_submit(wbc_to_write(wbc), bio);
 		goto alloc_new;
 	}
 
@@ -632,7 +631,7 @@ alloc_new:
 	set_page_writeback(page);
 	unlock_page(page);
 	if (boundary || (first_unmapped != blocks_per_page)) {
-		bio = mpage_bio_submit(wr, bio);
+		bio = mpage_bio_submit(wbc_to_write(wbc), bio);
 		if (boundary_block) {
 			write_boundary_block(boundary_bdev,
 					boundary_block, 1 << blkbits);
@@ -644,7 +643,7 @@ alloc_new:
 
 confused:
 	if (bio)
-		bio = mpage_bio_submit(wr, bio);
+		bio = mpage_bio_submit(wbc_to_write(wbc), bio);
 
 	if (mpd->use_writepage) {
 		ret = mapping->a_ops->writepage(page, wbc);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d445a64b979e..239a612ea1d6 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -393,7 +393,7 @@ xfs_submit_ioend_bio(
 	atomic_inc(&ioend->io_remaining);
 	bio->bi_private = ioend;
 	bio->bi_end_io = xfs_end_bio;
-	submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
+	submit_bio(wbc_to_write(wbc), bio);
 }
 
 STATIC struct bio *
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d0b5ca5d4e08..719c255e105a 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -100,6 +100,14 @@ struct writeback_control {
 #endif
 };
 
+static inline int wbc_to_write(struct writeback_control *wbc)
+{
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		return WRITE_SYNC;
+
+	return WRITE;
+}
+
 /*
  * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
  * and are measured against each other in.  There always is one global
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/9] writeback: use WRITE_SYNC for reclaim or sync writeback
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
  2016-03-30 15:07 ` [PATCH 1/9] writeback: propagate the various reasons for writeback Jens Axboe
  2016-03-30 15:07 ` [PATCH 2/9] writeback: add wbc_to_write() Jens Axboe
@ 2016-03-30 15:07 ` Jens Axboe
  2016-03-30 15:07 ` [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:07 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

If we're doing reclaim or sync IO, use WRITE_SYNC to inform the lower
levels of the importance of this IO.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/writeback.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 719c255e105a..b2c75b8901da 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -102,7 +102,7 @@ struct writeback_control {
 
 static inline int wbc_to_write(struct writeback_control *wbc)
 {
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL || wbc->for_reclaim || wbc->for_sync)
 		return WRITE_SYNC;
 
 	return WRITE;
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages()
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
                   ` (2 preceding siblings ...)
  2016-03-30 15:07 ` [PATCH 3/9] writeback: use WRITE_SYNC for reclaim or sync writeback Jens Axboe
@ 2016-03-30 15:07 ` Jens Axboe
  2016-04-13 13:08   ` Jan Kara
  2016-03-30 15:07 ` [PATCH 5/9] block: add ability to flag write back caching on a device Jens Axboe
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:07 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

Note in the bdi_writeback structure if a task is currently being
limited in balance_dirty_pages(), waiting for writeback to
proceed.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 include/linux/backing-dev-defs.h | 2 ++
 mm/page-writeback.c              | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 1b4d69f68c33..f702309216b4 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
 	struct list_head work_list;
 	struct delayed_work dwork;	/* work item used for writeback */
 
+	int dirty_sleeping;		/* waiting on dirty limit exceeded */
+
 	struct list_head bdi_node;	/* anchored at bdi->wb_list */
 
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 11ff8f758631..15e696bc5d14 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1746,7 +1746,9 @@ pause:
 					  pause,
 					  start_time);
 		__set_current_state(TASK_KILLABLE);
+		wb->dirty_sleeping = 1;
 		io_schedule_timeout(pause);
+		wb->dirty_sleeping = 0;
 
 		current->dirty_paused_when = now + pause;
 		current->nr_dirtied = 0;
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/9] block: add ability to flag write back caching on a device
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
                   ` (3 preceding siblings ...)
  2016-03-30 15:07 ` [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
@ 2016-03-30 15:07 ` Jens Axboe
  2016-03-30 15:42   ` Christoph Hellwig
  2016-03-30 15:07 ` [PATCH 6/9] sd: inform block layer of write cache state Jens Axboe
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:07 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

Add an internal helper and flag for setting whether a queue has
write back caching, or write through (or none). Add a sysfs file
to show this as well, and make it changeable from user space.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-settings.c   | 11 +++++++++++
 block/blk-sysfs.c      | 39 +++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |  4 ++++
 3 files changed, 54 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index c7bb666aafd1..4dbd511a9889 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -846,6 +846,17 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable)
 }
 EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);
 
+void blk_queue_write_cache(struct request_queue *q, bool enabled)
+{
+	spin_lock_irq(q->queue_lock);
+	if (enabled)
+		queue_flag_set(QUEUE_FLAG_WC, q);
+	else
+		queue_flag_clear(QUEUE_FLAG_WC, q);
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL_GPL(blk_queue_write_cache);
+
 static int __init blk_settings_init(void)
 {
 	blk_max_low_pfn = max_low_pfn - 1;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index dd93763057ce..954e510452d7 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -347,6 +347,38 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	return ret;
 }
 
+static ssize_t queue_wc_show(struct request_queue *q, char *page)
+{
+	if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+		return sprintf(page, "write back\n");
+
+	return sprintf(page, "write through\n");
+}
+
+static ssize_t queue_wc_store(struct request_queue *q, const char *page,
+			      size_t count)
+{
+	int set = -1;
+
+	if (!strncmp(page, "write back", 10))
+		set = 1;
+	else if (!strncmp(page, "write through", 13) ||
+		 !strncmp(page, "none", 4))
+		set = 0;
+
+	if (set == -1)
+		return -EINVAL;
+
+	spin_lock_irq(q->queue_lock);
+	if (set)
+		queue_flag_set(QUEUE_FLAG_WC, q);
+	else
+		queue_flag_clear(QUEUE_FLAG_WC, q);
+	spin_unlock_irq(q->queue_lock);
+
+	return count;
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_requests_show,
@@ -478,6 +510,12 @@ static struct queue_sysfs_entry queue_poll_entry = {
 	.store = queue_poll_store,
 };
 
+static struct queue_sysfs_entry queue_wc_entry = {
+	.attr = {.name = "write_cache", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wc_show,
+	.store = queue_wc_store,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -503,6 +541,7 @@ static struct attribute *default_attrs[] = {
 	&queue_iostats_entry.attr,
 	&queue_random_entry.attr,
 	&queue_poll_entry.attr,
+	&queue_wc_entry.attr,
 	NULL,
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7e5d7e018bea..76e875159e52 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -491,15 +491,18 @@ struct request_queue {
 #define QUEUE_FLAG_INIT_DONE   20	/* queue is initialized */
 #define QUEUE_FLAG_NO_SG_MERGE 21	/* don't attempt to merge SG segments*/
 #define QUEUE_FLAG_POLL	       22	/* IO polling enabled if set */
+#define QUEUE_FLAG_WC	       23	/* Write back caching */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
+				 (1 << QUEUE_FLAG_WC)		|	\
 				 (1 << QUEUE_FLAG_ADD_RANDOM))
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
+				 (1 << QUEUE_FLAG_WC)		|	\
 				 (1 << QUEUE_FLAG_POLL))
 
 static inline void queue_lockdep_assert_held(struct request_queue *q)
@@ -1009,6 +1012,7 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
 extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
 extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
+extern void blk_queue_write_cache(struct request_queue *q, bool enabled);
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/9] sd: inform block layer of write cache state
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
                   ` (4 preceding siblings ...)
  2016-03-30 15:07 ` [PATCH 5/9] block: add ability to flag write back caching on a device Jens Axboe
@ 2016-03-30 15:07 ` Jens Axboe
  2016-03-30 15:07 ` [PATCH 7/9] NVMe: " Jens Axboe
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:07 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 drivers/scsi/sd.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 5a5457ac9cdb..049f424fb4ad 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -192,6 +192,7 @@ cache_type_store(struct device *dev, struct device_attribute *attr,
 		sdkp->WCE = wce;
 		sdkp->RCD = rcd;
 		sd_set_flush_flag(sdkp);
+		blk_queue_write_cache(sdp->request_queue, wce != 0);
 		return count;
 	}
 
@@ -2571,7 +2572,7 @@ sd_read_cache_type(struct scsi_disk *sdkp, unsigned char *buffer)
 				  sdkp->DPOFUA ? "supports DPO and FUA"
 				  : "doesn't support DPO or FUA");
 
-		return;
+		goto done;
 	}
 
 bad_sense:
@@ -2596,6 +2597,8 @@ defaults:
 	}
 	sdkp->RCD = 0;
 	sdkp->DPOFUA = 0;
+done:
+	blk_queue_write_cache(sdp->request_queue, sdkp->WCE != 0);
 }
 
 /*
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/9] NVMe: inform block layer of write cache state
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
                   ` (5 preceding siblings ...)
  2016-03-30 15:07 ` [PATCH 6/9] sd: inform block layer of write cache state Jens Axboe
@ 2016-03-30 15:07 ` Jens Axboe
  2016-03-30 15:07 ` [PATCH 8/9] block: add code to track actual device queue depth Jens Axboe
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:07 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

This isn't quite correct, since the VWC merely states if a potential
write back cache is volatile or not. But for the purpose of write
absortion, it's good enough.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 drivers/nvme/host/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 643f457131c2..05c8edfb7611 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -906,6 +906,7 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 	if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
 		blk_queue_flush(q, REQ_FLUSH | REQ_FUA);
 	blk_queue_virt_boundary(q, ctrl->page_size - 1);
+	blk_queue_write_cache(q, ctrl->vwc & NVME_CTRL_VWC_PRESENT);
 }
 
 /*
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 8/9] block: add code to track actual device queue depth
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
                   ` (6 preceding siblings ...)
  2016-03-30 15:07 ` [PATCH 7/9] NVMe: " Jens Axboe
@ 2016-03-30 15:07 ` Jens Axboe
  2016-03-30 15:07 ` [PATCH 9/9] writeback: throttle buffered writeback Jens Axboe
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:07 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

For blk-mq, ->nr_requests does track queue depth, at least at init
time. But for the older queue paths, it's simply a soft setting.
On top of that, it's generally larger than the hardware setting
on purpose, to allow backup of requests for merging.

Fill a hole in struct request with a 'queue_depth' member, that
drivers can call to more closely inform the block layer of the
real queue depth.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/blk-settings.c   |  6 ++++++
 drivers/scsi/scsi.c    |  3 +++
 include/linux/blkdev.h | 11 +++++++++++
 3 files changed, 20 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4dbd511a9889..06e01682f827 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -820,6 +820,12 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
 }
 EXPORT_SYMBOL(blk_queue_update_dma_alignment);
 
+void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
+{
+	q->queue_depth = depth;
+}
+EXPORT_SYMBOL(blk_set_queue_depth);
+
 /**
  * blk_queue_flush - configure queue's cache flush capability
  * @q:		the request queue for the device
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index b1bf42b93fcc..6503724865e7 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int depth)
 		wmb();
 	}
 
+	if (sdev->request_queue)
+		blk_set_queue_depth(sdev->request_queue, depth);
+
 	return sdev->queue_depth;
 }
 EXPORT_SYMBOL(scsi_change_queue_depth);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 76e875159e52..08b897b159d1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -315,6 +315,8 @@ struct request_queue {
 	struct blk_mq_ctx __percpu	*queue_ctx;
 	unsigned int		nr_queues;
 
+	unsigned int		queue_depth;
+
 	/* hw dispatch queues */
 	struct blk_mq_hw_ctx	**queue_hw_ctx;
 	unsigned int		nr_hw_queues;
@@ -683,6 +685,14 @@ static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
 	return false;
 }
 
+static inline unsigned int blk_queue_depth(struct request_queue *q)
+{
+	if (q->queue_depth)
+		return q->queue_depth;
+
+	return q->nr_requests;
+}
+
 /*
  * q->prep_rq_fn return values
  */
@@ -986,6 +996,7 @@ extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
 extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
 extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
 extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
+extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
 extern void blk_set_default_limits(struct queue_limits *lim);
 extern void blk_set_stacking_limits(struct queue_limits *lim);
 extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 9/9] writeback: throttle buffered writeback
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
                   ` (7 preceding siblings ...)
  2016-03-30 15:07 ` [PATCH 8/9] block: add code to track actual device queue depth Jens Axboe
@ 2016-03-30 15:07 ` Jens Axboe
  2016-03-31  8:24 ` [PATCHSET v3][RFC] Make background writeback not suck Dave Chinner
  2016-03-31 22:09 ` Holger Hoffstätte
  10 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:07 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe

Test patch that throttles buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

This is just a test patch, and as such, it registers a queue sysfs
entry to both monitor the current state:

$ cat /sys/block/nvme0n1/queue/wb_stats
idle=16, normal=32, max=64, inflight=0, wait=0, timer=0, bdp_wait=0

'idle' denotes how many requests we will allow inflight for idle
buffered writeback, 'normal' for higher priority writeback, and 'max'
for when it's urgent we clean pages. The values are calculated based
on the queue depth of the device, and the 'wb_percent' setting. If
'wb_percent' is set to zero, the functionality is turned off.

'inflight' shows how many requests are currently inflight for buffered
writeback, 'wait' shows if anyone is currently waiting for access,
'timer' shows if we have processes being deferred in write back cache
timeout, and bdp_wait shows if someone is currently throttled on this
device in balance_dirty_pages().

Finally, if the device has write back caching, 'wb_cache_delay' delays
by this amount of usecs when a write completes before allowing more.

It'd be nice to auto-tune 'wb_percent' based on device response. Flash
is less picky than rotating storage, but still needs throttling. For
flash storage, a wb_percent setting of 50% gives good read latencies
while still having good write bandwidth. For rotating storage, lower
settings (like 10-15%) are more reasonable.

Signed-off-by: Jens Axboe <axboe@fb.com>
---
 block/Makefile            |   2 +-
 block/blk-core.c          |  15 +++
 block/blk-mq.c            |  31 +++++-
 block/blk-settings.c      |   3 +
 block/blk-sysfs.c         |  89 +++++++++++++++++
 block/blk-wb.c            | 238 ++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-wb.h            |  33 +++++++
 include/linux/blk_types.h |   2 +
 include/linux/blkdev.h    |   3 +
 9 files changed, 413 insertions(+), 3 deletions(-)
 create mode 100644 block/blk-wb.c
 create mode 100644 block/blk-wb.h

diff --git a/block/Makefile b/block/Makefile
index 9eda2322b2d4..9df911a3b569 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
 obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
-			blk-lib.o blk-mq.o blk-mq-tag.o \
+			blk-lib.o blk-mq.o blk-mq-tag.o blk-wb.o \
 			blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 827f8badd143..85a92cd6047b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-wb.h"
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
@@ -863,6 +864,9 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 	 */
 	blk_queue_make_request(q, blk_queue_bio);
 
+	if (blk_wb_init(q))
+		goto fail;
+
 	q->sg_reserved_size = INT_MAX;
 
 	/* Protect q->elevator from elevator_change */
@@ -880,6 +884,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 
 fail:
 	blk_free_flush_queue(q->fq);
+	blk_wb_exit(q);
 	return NULL;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
@@ -1485,6 +1490,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	/* this is a bio leak */
 	WARN_ON(req->bio != NULL);
 
+	blk_wb_done(q->rq_wb, req);
+
 	/*
 	 * Request may not have originated from ll_rw_blk. if not,
 	 * it didn't come out of our reserved rq pools
@@ -1714,6 +1721,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
 	struct request *req;
 	unsigned int request_count = 0;
+	bool wb_acct;
 
 	/*
 	 * low level driver can indicate that it wants pages above a
@@ -1766,6 +1774,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	}
 
 get_rq:
+	wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
+
 	/*
 	 * This sync check and mask will be re-done in init_request_from_bio(),
 	 * but we need to set it earlier to expose the sync flag to the
@@ -1781,11 +1791,16 @@ get_rq:
 	 */
 	req = get_request(q, rw_flags, bio, GFP_NOIO);
 	if (IS_ERR(req)) {
+		if (wb_acct)
+			__blk_wb_done(q->rq_wb);
 		bio->bi_error = PTR_ERR(req);
 		bio_endio(bio);
 		goto out_unlock;
 	}
 
+	if (wb_acct)
+		req->cmd_flags |= REQ_BUF_INFLIGHT;
+
 	/*
 	 * After dropping the lock and possibly sleeping here, our request
 	 * may now be mergeable after it had proven unmergeable (above).
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1699baf39b78..437cdc9b429c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -29,6 +29,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
+#include "blk-wb.h"
 
 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
@@ -274,6 +275,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
 
 	if (rq->cmd_flags & REQ_MQ_INFLIGHT)
 		atomic_dec(&hctx->nr_active);
+
+	blk_wb_done(q->rq_wb, rq);
+
 	rq->cmd_flags = 0;
 
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
@@ -1253,6 +1257,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
 	blk_qc_t cookie;
+	bool wb_acct;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1270,9 +1275,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	} else
 		request_count = blk_plug_queued_count(q);
 
+	wb_acct = blk_wb_wait(q->rq_wb, bio, NULL);
+
 	rq = blk_mq_map_request(q, bio, &data);
-	if (unlikely(!rq))
+	if (unlikely(!rq)) {
+		if (wb_acct)
+			__blk_wb_done(q->rq_wb);
 		return BLK_QC_T_NONE;
+	}
+
+	if (wb_acct)
+		rq->cmd_flags |= REQ_BUF_INFLIGHT;
 
 	cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
@@ -1349,6 +1362,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	struct blk_map_ctx data;
 	struct request *rq;
 	blk_qc_t cookie;
+	bool wb_acct;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1363,9 +1377,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	    blk_attempt_plug_merge(q, bio, &request_count, NULL))
 		return BLK_QC_T_NONE;
 
+	wb_acct = blk_wb_wait(q->rq_wb, bio, NULL);
+
 	rq = blk_mq_map_request(q, bio, &data);
-	if (unlikely(!rq))
+	if (unlikely(!rq)) {
+		if (wb_acct)
+			__blk_wb_done(q->rq_wb);
 		return BLK_QC_T_NONE;
+	}
+
+	if (wb_acct)
+		rq->cmd_flags |= REQ_BUF_INFLIGHT;
 
 	cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);
 
@@ -2062,6 +2084,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	 */
 	q->nr_requests = set->queue_depth;
 
+	if (blk_wb_init(q))
+		goto err_hctxs;
+
 	if (set->ops->complete)
 		blk_queue_softirq_done(q, set->ops->complete);
 
@@ -2097,6 +2122,8 @@ void blk_mq_free_queue(struct request_queue *q)
 	list_del_init(&q->all_q_node);
 	mutex_unlock(&all_q_mutex);
 
+	blk_wb_exit(q);
+
 	blk_mq_del_queue_tag_set(q);
 
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 06e01682f827..bd713a8aa755 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -13,6 +13,7 @@
 #include <linux/gfp.h>
 
 #include "blk.h"
+#include "blk-wb.h"
 
 unsigned long blk_max_low_pfn;
 EXPORT_SYMBOL(blk_max_low_pfn);
@@ -823,6 +824,8 @@ EXPORT_SYMBOL(blk_queue_update_dma_alignment);
 void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
 {
 	q->queue_depth = depth;
+	if (q->rq_wb)
+		blk_wb_update_limits(q->rq_wb, depth);
 }
 EXPORT_SYMBOL(blk_set_queue_depth);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 954e510452d7..2afd5cb8f003 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -13,6 +13,7 @@
 
 #include "blk.h"
 #include "blk-mq.h"
+#include "blk-wb.h"
 
 struct queue_sysfs_entry {
 	struct attribute attr;
@@ -347,6 +348,76 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	return ret;
 }
 
+static ssize_t queue_wb_stats_show(struct request_queue *q, char *page)
+{
+	struct rq_wb *wb = q->rq_wb;
+
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	return sprintf(page, "idle=%d, normal=%d, max=%d, inflight=%d, wait=%d,"
+				" timer=%d, bdp_wait=%d\n", wb->wb_idle,
+					wb->wb_normal, wb->wb_max,
+					atomic_read(&wb->inflight),
+					waitqueue_active(&wb->wait),
+					timer_pending(&wb->timer),
+					*wb->bdp_wait);
+}
+
+static ssize_t queue_wb_perc_show(struct request_queue *q, char *page)
+{
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	return queue_var_show(q->rq_wb->perc, page);
+}
+
+static ssize_t queue_wb_perc_store(struct request_queue *q, const char *page,
+				   size_t count)
+{
+	unsigned long perc;
+	ssize_t ret;
+
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	ret = queue_var_store(&perc, page, count);
+	if (ret < 0)
+		return ret;
+	if (perc > 100)
+		return -EINVAL;
+
+	q->rq_wb->perc = perc;
+	blk_wb_update_limits(q->rq_wb, blk_queue_depth(q));
+	return ret;
+}
+
+static ssize_t queue_wb_cache_delay_show(struct request_queue *q, char *page)
+{
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	return queue_var_show(q->rq_wb->cache_delay_usecs, page);
+}
+
+static ssize_t queue_wb_cache_delay_store(struct request_queue *q,
+					  const char *page, size_t count)
+{
+	unsigned long var;
+	ssize_t ret;
+
+	if (!q->rq_wb)
+		return -EINVAL;
+
+	ret = queue_var_store(&var, page, count);
+	if (ret < 0)
+		return ret;
+
+	q->rq_wb->cache_delay_usecs = var;
+	q->rq_wb->cache_delay = usecs_to_jiffies(var);
+	return ret;
+}
+
 static ssize_t queue_wc_show(struct request_queue *q, char *page)
 {
 	if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
@@ -516,6 +587,21 @@ static struct queue_sysfs_entry queue_wc_entry = {
 	.store = queue_wc_store,
 };
 
+static struct queue_sysfs_entry queue_wb_stats_entry = {
+	.attr = {.name = "wb_stats", .mode = S_IRUGO },
+	.show = queue_wb_stats_show,
+};
+static struct queue_sysfs_entry queue_wb_cache_delay_entry = {
+	.attr = {.name = "wb_cache_usecs", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wb_cache_delay_show,
+	.store = queue_wb_cache_delay_store,
+};
+static struct queue_sysfs_entry queue_wb_perc_entry = {
+	.attr = {.name = "wb_percent", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_wb_perc_show,
+	.store = queue_wb_perc_store,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -542,6 +628,9 @@ static struct attribute *default_attrs[] = {
 	&queue_random_entry.attr,
 	&queue_poll_entry.attr,
 	&queue_wc_entry.attr,
+	&queue_wb_stats_entry.attr,
+	&queue_wb_cache_delay_entry.attr,
+	&queue_wb_perc_entry.attr,
 	NULL,
 };
 
diff --git a/block/blk-wb.c b/block/blk-wb.c
new file mode 100644
index 000000000000..d93dd1ccf16a
--- /dev/null
+++ b/block/blk-wb.c
@@ -0,0 +1,238 @@
+/*
+ * buffered writeback throttling
+ *
+ * Copyright (C) 2016 Jens Axboe
+ *
+ * Things that need changing:
+ *
+ *	- Auto-detection of optimal wb_percent setting. A lower setting
+ *	  is appropriate on rotating storage (wb_percent=15 gives good
+ *	  separation, while still getting full bandwidth with wb cache).
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+
+#include "blk.h"
+#include "blk-wb.h"
+
+static inline bool rwb_enabled(struct rq_wb *rwb)
+{
+	return rwb->wb_normal != 0;
+}
+
+void __blk_wb_done(struct rq_wb *rwb)
+{
+	int inflight, limit = rwb->wb_normal;
+
+	inflight = atomic_dec_return(&rwb->inflight);
+	if (inflight >= limit)
+		return;
+
+	/*
+	 * If the device does caching, we can still flood it with IO
+	 * even at a low depth. If caching is on, delay a bit before
+	 * submitting the next, if we're still purely background
+	 * activity.
+	 */
+	if (test_bit(QUEUE_FLAG_WC, &rwb->q->queue_flags) && !*rwb->bdp_wait &&
+	    time_before(jiffies, rwb->last_comp + rwb->cache_delay)) {
+		if (!timer_pending(&rwb->timer))
+			mod_timer(&rwb->timer, jiffies + rwb->cache_delay);
+		return;
+	}
+
+	if (waitqueue_active(&rwb->wait)) {
+		int diff = limit - inflight;
+
+		if (diff >= rwb->wb_idle / 2)
+			wake_up_nr(&rwb->wait, 1);
+	}
+}
+
+/*
+ * Called on completion of a request. Note that it's also called when
+ * a request is merged, when the request gets freed.
+ */
+void blk_wb_done(struct rq_wb *rwb, struct request *rq)
+{
+	if (!(rq->cmd_flags & REQ_BUF_INFLIGHT)) {
+		if (rwb_enabled(rwb)) {
+			const unsigned long cur = jiffies;
+
+			if (cur != rwb->last_comp)
+				rwb->last_comp = cur;
+		}
+	} else
+		__blk_wb_done(rwb);
+}
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+	int cur = atomic_read(v);
+
+	for (;;) {
+		int old;
+
+		if (cur >= below)
+			return false;
+		old = atomic_cmpxchg(v, cur, cur + 1);
+		if (old == cur)
+			break;
+		cur = old;
+	}
+
+	return true;
+}
+
+static inline unsigned int get_limit(struct rq_wb *rwb, unsigned int rw)
+{
+	unsigned int limit;
+
+	/*
+	 * At this point we know it's a buffered write. If REQ_SYNC is
+	 * set, then it's WB_SYNC_ALL writeback. Bump the limit 4x for
+	 * those, since someone is (or will be) waiting on that.
+	 */
+	if ((rw & REQ_SYNC) || *rwb->bdp_wait)
+		limit = rwb->wb_max;
+	else if (time_before(jiffies, rwb->last_comp + HZ / 10)) {
+		/*
+		 * If less than 100ms since we completed unrelated IO,
+		 * limit us to half the depth for background writeback.
+		 */
+		limit = rwb->wb_idle;
+	} else
+		limit = rwb->wb_normal;
+
+	return limit;
+}
+
+/*
+ * Block if we will exceed our limit, or if we are currently waiting for
+ * the timer to kick off queuing again.
+ */
+static void __blk_wb_wait(struct rq_wb *rwb, unsigned int rw, spinlock_t *lock)
+{
+	DEFINE_WAIT(wait);
+
+	if (!timer_pending(&rwb->timer) &&
+	    atomic_inc_below(&rwb->inflight, get_limit(rwb, rw)))
+		return;
+
+	do {
+		prepare_to_wait_exclusive(&rwb->wait, &wait,
+						TASK_UNINTERRUPTIBLE);
+
+		if (!timer_pending(&rwb->timer) &&
+		    atomic_inc_below(&rwb->inflight, get_limit(rwb, rw)))
+			break;
+
+		if (lock)
+			spin_unlock_irq(lock);
+
+		io_schedule();
+
+		if (lock)
+			spin_lock_irq(lock);
+	} while (1);
+
+	finish_wait(&rwb->wait, &wait);
+}
+
+/*
+ * Returns true if the IO request should be accounted, false if not.
+ * May sleep, if we have exceeded the writeback limits. Caller can pass
+ * in an irq held spinlock, if it holds one when calling this function.
+ * If we do sleep, we'll release and re-grab it.
+ */
+bool blk_wb_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
+{
+	/*
+	 * If disabled, or not a WRITE (or a discard), do nothing
+	 */
+	if (!rwb_enabled(rwb) || !(bio->bi_rw & REQ_WRITE) ||
+	    (bio->bi_rw & REQ_DISCARD))
+		return false;
+
+	/*
+	 * Don't throttle WRITE_ODIRECT
+	 */
+	if ((bio->bi_rw & (REQ_SYNC | REQ_NOIDLE)) == REQ_SYNC)
+		return false;
+
+	__blk_wb_wait(rwb, bio->bi_rw, lock);
+	return true;
+}
+
+static void calc_wb_limits(struct rq_wb *rwb, unsigned int depth,
+			   unsigned int perc)
+{
+	/*
+	 * We'll use depth==64 as a reasonable max limit that should be able
+	 * to achieve full device bandwidth anywhere.
+	 */
+	depth = min(64U, depth);
+
+	/*
+	 * Full perf writes are max 'perc' percentage of the depth
+	 */
+	rwb->wb_max = (perc * depth + 1) / 100;
+	if (!rwb->wb_max && perc)
+		rwb->wb_max = 1;
+	rwb->wb_normal = (rwb->wb_max + 1) / 2;
+	rwb->wb_idle = (rwb->wb_max + 3) / 4;
+}
+
+void blk_wb_update_limits(struct rq_wb *rwb, unsigned int depth)
+{
+	calc_wb_limits(rwb, depth, rwb->perc);
+	wake_up_all(&rwb->wait);
+}
+
+static void blk_wb_timer(unsigned long data)
+{
+	struct rq_wb *rwb = (struct rq_wb *) data;
+
+	if (waitqueue_active(&rwb->wait))
+		wake_up_nr(&rwb->wait, 1);
+}
+
+#define DEF_WB_PERC		50
+#define DEF_WB_CACHE_DELAY	10000
+
+int blk_wb_init(struct request_queue *q)
+{
+	struct rq_wb *rwb;
+
+	rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
+	if (!rwb)
+		return -ENOMEM;
+
+	atomic_set(&rwb->inflight, 0);
+	init_waitqueue_head(&rwb->wait);
+	rwb->last_comp = jiffies;
+	rwb->bdp_wait = &q->backing_dev_info.wb.dirty_sleeping;
+	setup_timer(&rwb->timer, blk_wb_timer, (unsigned long) rwb);
+	rwb->perc = DEF_WB_PERC;
+	rwb->cache_delay_usecs = DEF_WB_CACHE_DELAY;
+	rwb->cache_delay = usecs_to_jiffies(rwb->cache_delay);
+	rwb->q = q;
+	blk_wb_update_limits(rwb, blk_queue_depth(q));
+	q->rq_wb = rwb;
+	return 0;
+}
+
+void blk_wb_exit(struct request_queue *q)
+{
+	if (q->rq_wb)
+		del_timer_sync(&q->rq_wb->timer);
+
+	kfree(q->rq_wb);
+	q->rq_wb = NULL;
+}
diff --git a/block/blk-wb.h b/block/blk-wb.h
new file mode 100644
index 000000000000..201bc00ac7a7
--- /dev/null
+++ b/block/blk-wb.h
@@ -0,0 +1,33 @@
+#ifndef BLK_WB_H
+#define BLK_WB_H
+
+#include <linux/atomic.h>
+#include <linux/wait.h>
+
+struct rq_wb {
+	/*
+	 * Settings that govern how we throttle
+	 */
+	unsigned int perc;			/* INPUT */
+	unsigned int wb_idle;			/* idle writeback */
+	unsigned int wb_normal;			/* normal writeback */
+	unsigned int wb_max;			/* max throughput writeback */
+
+	unsigned int cache_delay;
+	unsigned int cache_delay_usecs;
+	unsigned long last_comp;
+	unsigned int *bdp_wait;
+	struct request_queue *q;
+	atomic_t inflight;
+	wait_queue_head_t wait;
+	struct timer_list timer;
+};
+
+void __blk_wb_done(struct rq_wb *);
+void blk_wb_done(struct rq_wb *, struct request *);
+bool blk_wb_wait(struct rq_wb *, struct bio *, spinlock_t *);
+int blk_wb_init(struct request_queue *);
+void blk_wb_exit(struct request_queue *);
+void blk_wb_update_limits(struct rq_wb *, unsigned int);
+
+#endif
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 86a38ea1823f..6f2a174b771c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -188,6 +188,7 @@ enum rq_flag_bits {
 	__REQ_PM,		/* runtime pm request */
 	__REQ_HASHED,		/* on IO scheduler merge hash */
 	__REQ_MQ_INFLIGHT,	/* track inflight for MQ */
+	__REQ_BUF_INFLIGHT,	/* track inflight for buffered */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -241,6 +242,7 @@ enum rq_flag_bits {
 #define REQ_PM			(1ULL << __REQ_PM)
 #define REQ_HASHED		(1ULL << __REQ_HASHED)
 #define REQ_MQ_INFLIGHT		(1ULL << __REQ_MQ_INFLIGHT)
+#define REQ_BUF_INFLIGHT	(1ULL << __REQ_BUF_INFLIGHT)
 
 typedef unsigned int blk_qc_t;
 #define BLK_QC_T_NONE	-1U
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 08b897b159d1..ee9b90ff4fde 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -37,6 +37,7 @@ struct bsg_job;
 struct blkcg_gq;
 struct blk_flush_queue;
 struct pr_ops;
+struct rq_wb;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
@@ -290,6 +291,8 @@ struct request_queue {
 	int			nr_rqs[2];	/* # allocated [a]sync rqs */
 	int			nr_rqs_elvpriv;	/* # allocated rqs w/ elvpriv */
 
+	struct rq_wb		*rq_wb;
+
 	/*
 	 * If blkcg is not used, @q->root_rl serves all requests.  If blkcg
 	 * is used, root blkg allocates from @q->root_rl and all other
-- 
2.8.0.rc4.6.g7e4ba36

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] block: add ability to flag write back caching on a device
  2016-03-30 15:07 ` [PATCH 5/9] block: add ability to flag write back caching on a device Jens Axboe
@ 2016-03-30 15:42   ` Christoph Hellwig
  2016-03-30 15:46     ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2016-03-30 15:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, linux-block

On Wed, Mar 30, 2016 at 09:07:53AM -0600, Jens Axboe wrote:
> Add an internal helper and flag for setting whether a queue has
> write back caching, or write through (or none). Add a sysfs file
> to show this as well, and make it changeable from user space.

As per previous discussion:  I don't really care about the way how we
tell the block layer we have a writeback cache, but a NAK to having
each driver call two interfaces to convey the same information.

Please just look at q->flush_flag & REQ_FLUSH for now, and then
improvem the interface if you don't like it (I don't particularly like
it, but not to the point that I'm motivated enough to fix it :))

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] block: add ability to flag write back caching on a device
  2016-03-30 15:42   ` Christoph Hellwig
@ 2016-03-30 15:46     ` Jens Axboe
  2016-03-30 16:23       ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 15:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, linux-block

On 03/30/2016 09:42 AM, Christoph Hellwig wrote:
> On Wed, Mar 30, 2016 at 09:07:53AM -0600, Jens Axboe wrote:
>> Add an internal helper and flag for setting whether a queue has
>> write back caching, or write through (or none). Add a sysfs file
>> to show this as well, and make it changeable from user space.
>
> As per previous discussion:  I don't really care about the way how we
> tell the block layer we have a writeback cache, but a NAK to having
> each driver call two interfaces to convey the same information.
>
> Please just look at q->flush_flag & REQ_FLUSH for now, and then
> improvem the interface if you don't like it (I don't particularly like
> it, but not to the point that I'm motivated enough to fix it :))

That's fine, I don't mind making that change, just didn't do it for this 
version. I prefer if we change the cache flagging to be the primary way 
to signal it, it's more intuitive than REQ_FLUSH.

It'll be in the next version.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] block: add ability to flag write back caching on a device
  2016-03-30 15:46     ` Jens Axboe
@ 2016-03-30 16:23       ` Jens Axboe
  2016-03-30 17:29         ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2016-03-30 16:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, linux-block

On 03/30/2016 09:46 AM, Jens Axboe wrote:
> On 03/30/2016 09:42 AM, Christoph Hellwig wrote:
>> On Wed, Mar 30, 2016 at 09:07:53AM -0600, Jens Axboe wrote:
>>> Add an internal helper and flag for setting whether a queue has
>>> write back caching, or write through (or none). Add a sysfs file
>>> to show this as well, and make it changeable from user space.
>>
>> As per previous discussion:  I don't really care about the way how we
>> tell the block layer we have a writeback cache, but a NAK to having
>> each driver call two interfaces to convey the same information.
>>
>> Please just look at q->flush_flag & REQ_FLUSH for now, and then
>> improvem the interface if you don't like it (I don't particularly like
>> it, but not to the point that I'm motivated enough to fix it :))
>
> That's fine, I don't mind making that change, just didn't do it for this
> version. I prefer if we change the cache flagging to be the primary way
> to signal it, it's more intuitive than REQ_FLUSH.
>
> It'll be in the next version.

Conversion series in here now:

http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/9] block: add ability to flag write back caching on a device
  2016-03-30 16:23       ` Jens Axboe
@ 2016-03-30 17:29         ` Christoph Hellwig
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2016-03-30 17:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-block

On Wed, Mar 30, 2016 at 10:23:50AM -0600, Jens Axboe wrote:
> Conversion series in here now:
> 
> http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle

The new API looks reasonable to me.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
                   ` (8 preceding siblings ...)
  2016-03-30 15:07 ` [PATCH 9/9] writeback: throttle buffered writeback Jens Axboe
@ 2016-03-31  8:24 ` Dave Chinner
  2016-03-31 14:29   ` Jens Axboe
  2016-03-31 22:09 ` Holger Hoffstätte
  10 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2016-03-31  8:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, linux-block

On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
> Hi,
> 
> This patchset isn't as much a final solution, as it's demonstration
> of what I believe is a huge issue. Since the dawn of time, our
> background buffered writeback has sucked. When we do background
> buffered writeback, it should have little impact on foreground
> activity. That's the definition of background activity... But for as
> long as I can remember, heavy buffered writers has not behaved like
> that. For instance, if I do something like this:
> 
> $ dd if=/dev/zero of=foo bs=1M count=10k
> 
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or, for server oriented
> workloads, where installation of a big RPM (or similar) adversely
> impacts data base reads or sync writes. When that happens, I get people
> yelling at me.
> 
> Last time I posted this, I used flash storage as the example. But
> this works equally well on rotating storage. Let's run a test case
> that writes a lot. This test writes 50 files, each 100M, on XFS on
> a regular hard drive. While this happens, we attempt to read
> another file with fio.
> 
> Writers:
> 
> $ time (./write-files ; sync)
> real	1m6.304s
> user	0m0.020s
> sys	0m12.210s

Great. So a basic IO tests looks good - let's through something more
complex at it. Say, a benchmark I've been using for years to stress
the Io subsystem, the filesystem and memory reclaim all at the same
time: a concurent fsmark inode creation test.
(first google hit https://lkml.org/lkml/2013/9/10/46)

This generates thousands of REQ_WRITE metadata IOs every second, so
iif I understand how the throttle works correctly, these would be
classified as background writeback by the block layer throttle.
And....

FSUse%        Count         Size    Files/sec     App Overhead
     0      1600000            0     255845.0         10796891
     0      3200000            0     261348.8         10842349
     0      4800000            0     249172.3         14121232
     0      6400000            0     245172.8         12453759
     0      8000000            0     201249.5         14293100
     0      9600000            0     200417.5         29496551
>>>> 0     11200000            0      90399.6         40665397
     0     12800000            0     212265.6         21839031
     0     14400000            0     206398.8         32598378
     0     16000000            0     197589.7         26266552
     0     17600000            0     206405.2         16447795
>>>> 0     19200000            0      99189.6         87650540
     0     20800000            0     249720.8         12294862
     0     22400000            0     138523.8         47330007
>>>> 0     24000000            0      85486.2         14271096
     0     25600000            0     157538.1         64430611
     0     27200000            0     109677.8         47835961
     0     28800000            0     207230.5         31301031
     0     30400000            0     188739.6         33750424
     0     32000000            0     174197.9         41402526
     0     33600000            0     139152.0        100838085
     0     35200000            0     203729.7         34833764
     0     36800000            0     228277.4         12459062
>>>> 0     38400000            0      94962.0         30189182
     0     40000000            0     166221.9         40564922
>>>> 0     41600000            0      62902.5         80098461
     0     43200000            0     217932.6         22539354
     0     44800000            0     189594.6         24692209
     0     46400000            0     137834.1         39822038
     0     48000000            0     240043.8         12779453
     0     49600000            0     176830.8         16604133
     0     51200000            0     180771.8         32860221

real    5m35.967s
user    3m57.054s
sys     48m53.332s

In those highlighted report points, the performance has dropped
significantly. The typical range I expect to see ionce memory has
filled (a bit over 8m inodes) is 180k-220k.  Runtime on a vanilla
kernel was 4m40s and there were no performance drops, so this
workload runs almost a minute slower with the block layer throttling
code.

What I see in these performance dips is the XFS transaction
subsystem stalling *completely* - instead of running at a steady
state of around 350,000 transactions/s, there are *zero*
transactions running for periods of up to ten seconds.  This
co-incides with the CPU usage falling to almost zero as well.
AFAICT, the only thing that is running when the filesystem stalls
like this is memory reclaim.

Without the block throttling patches, the workload quickly finds a
steady state of around 7.5-8.5 million cached inodes, and it doesn't
vary much outside those bounds. With the block throttling patches,
on every transaction subsystem stall that occurs, the inode cache
gets 3-4 million inodes trimmed out of it (i.e. half the
cache), and in a couple of cases I saw it trim 6+ million inodes from
the cache before the transactions started up and the cache started
growing again.

> The above was run without scsi-mq, and with using the deadline scheduler,
> results with CFQ are similary depressing for this test. So IO scheduling
> is in place for this test, it's not pure blk-mq without scheduling.

virtio in guest, XFS direct IO -> no-op -> scsi in host.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-03-31  8:24 ` [PATCHSET v3][RFC] Make background writeback not suck Dave Chinner
@ 2016-03-31 14:29   ` Jens Axboe
  2016-03-31 16:21     ` Jens Axboe
  2016-04-01  0:46     ` Dave Chinner
  0 siblings, 2 replies; 34+ messages in thread
From: Jens Axboe @ 2016-03-31 14:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-block

On 03/31/2016 02:24 AM, Dave Chinner wrote:
> On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
>> Hi,
>>
>> This patchset isn't as much a final solution, as it's demonstration
>> of what I believe is a huge issue. Since the dawn of time, our
>> background buffered writeback has sucked. When we do background
>> buffered writeback, it should have little impact on foreground
>> activity. That's the definition of background activity... But for as
>> long as I can remember, heavy buffered writers has not behaved like
>> that. For instance, if I do something like this:
>>
>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>
>> on my laptop, and then try and start chrome, it basically won't start
>> before the buffered writeback is done. Or, for server oriented
>> workloads, where installation of a big RPM (or similar) adversely
>> impacts data base reads or sync writes. When that happens, I get people
>> yelling at me.
>>
>> Last time I posted this, I used flash storage as the example. But
>> this works equally well on rotating storage. Let's run a test case
>> that writes a lot. This test writes 50 files, each 100M, on XFS on
>> a regular hard drive. While this happens, we attempt to read
>> another file with fio.
>>
>> Writers:
>>
>> $ time (./write-files ; sync)
>> real	1m6.304s
>> user	0m0.020s
>> sys	0m12.210s
>
> Great. So a basic IO tests looks good - let's through something more
> complex at it. Say, a benchmark I've been using for years to stress
> the Io subsystem, the filesystem and memory reclaim all at the same
> time: a concurent fsmark inode creation test.
> (first google hit https://lkml.org/lkml/2013/9/10/46)

Is that how you are invoking it as well same arguments?

> This generates thousands of REQ_WRITE metadata IOs every second, so
> iif I understand how the throttle works correctly, these would be
> classified as background writeback by the block layer throttle.
> And....
>
> FSUse%        Count         Size    Files/sec     App Overhead
>       0      1600000            0     255845.0         10796891
>       0      3200000            0     261348.8         10842349
>       0      4800000            0     249172.3         14121232
>       0      6400000            0     245172.8         12453759
>       0      8000000            0     201249.5         14293100
>       0      9600000            0     200417.5         29496551
>>>>> 0     11200000            0      90399.6         40665397
>       0     12800000            0     212265.6         21839031
>       0     14400000            0     206398.8         32598378
>       0     16000000            0     197589.7         26266552
>       0     17600000            0     206405.2         16447795
>>>>> 0     19200000            0      99189.6         87650540
>       0     20800000            0     249720.8         12294862
>       0     22400000            0     138523.8         47330007
>>>>> 0     24000000            0      85486.2         14271096
>       0     25600000            0     157538.1         64430611
>       0     27200000            0     109677.8         47835961
>       0     28800000            0     207230.5         31301031
>       0     30400000            0     188739.6         33750424
>       0     32000000            0     174197.9         41402526
>       0     33600000            0     139152.0        100838085
>       0     35200000            0     203729.7         34833764
>       0     36800000            0     228277.4         12459062
>>>>> 0     38400000            0      94962.0         30189182
>       0     40000000            0     166221.9         40564922
>>>>> 0     41600000            0      62902.5         80098461
>       0     43200000            0     217932.6         22539354
>       0     44800000            0     189594.6         24692209
>       0     46400000            0     137834.1         39822038
>       0     48000000            0     240043.8         12779453
>       0     49600000            0     176830.8         16604133
>       0     51200000            0     180771.8         32860221
>
> real    5m35.967s
> user    3m57.054s
> sys     48m53.332s
>
> In those highlighted report points, the performance has dropped
> significantly. The typical range I expect to see ionce memory has
> filled (a bit over 8m inodes) is 180k-220k.  Runtime on a vanilla
> kernel was 4m40s and there were no performance drops, so this
> workload runs almost a minute slower with the block layer throttling
> code.
>
> What I see in these performance dips is the XFS transaction
> subsystem stalling *completely* - instead of running at a steady
> state of around 350,000 transactions/s, there are *zero*
> transactions running for periods of up to ten seconds.  This
> co-incides with the CPU usage falling to almost zero as well.
> AFAICT, the only thing that is running when the filesystem stalls
> like this is memory reclaim.

I'll take a look at this, stalls should definitely not be occurring. How 
much memory does the box have?

> Without the block throttling patches, the workload quickly finds a
> steady state of around 7.5-8.5 million cached inodes, and it doesn't
> vary much outside those bounds. With the block throttling patches,
> on every transaction subsystem stall that occurs, the inode cache
> gets 3-4 million inodes trimmed out of it (i.e. half the
> cache), and in a couple of cases I saw it trim 6+ million inodes from
> the cache before the transactions started up and the cache started
> growing again.
>
>> The above was run without scsi-mq, and with using the deadline scheduler,
>> results with CFQ are similary depressing for this test. So IO scheduling
>> is in place for this test, it's not pure blk-mq without scheduling.
>
> virtio in guest, XFS direct IO -> no-op -> scsi in host.

That has write back caching enabled on the guest, correct?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-03-31 14:29   ` Jens Axboe
@ 2016-03-31 16:21     ` Jens Axboe
  2016-04-01  0:56       ` Dave Chinner
  2016-04-01  0:46     ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2016-03-31 16:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-block

On 03/31/2016 08:29 AM, Jens Axboe wrote:
>> What I see in these performance dips is the XFS transaction
>> subsystem stalling *completely* - instead of running at a steady
>> state of around 350,000 transactions/s, there are *zero*
>> transactions running for periods of up to ten seconds.  This
>> co-incides with the CPU usage falling to almost zero as well.
>> AFAICT, the only thing that is running when the filesystem stalls
>> like this is memory reclaim.
>
> I'll take a look at this, stalls should definitely not be occurring. How
> much memory does the box have?

I can't seem to reproduce this at all. On an nvme device, I get a fairly 
steady 60K/sec file creation rate, and we're nowhere near being IO 
bound. So the throttling has no effect at all.

On a raid0 on 4 flash devices, I get something that looks more IO bound, 
for some reason. Still no impact of the throttling, however. But given 
that your setup is this:

	virtio in guest, XFS direct IO -> no-op -> scsi in host.

we do potentially have two throttling points, which we don't want. Is 
both the guest and the host running the new code, or just the guest?

In any case, can I talk you into trying with two patches on top of the 
current code? It's the two newest patches here:

http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle

The first treats REQ_META|REQ_PRIO like they should be treated, like 
high priority IO. The second disables throttling for virtual devices, so 
we only throttle on the backend. The latter should probably be the other 
way around, but we need some way of conveying that information to the 
backend.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
                   ` (9 preceding siblings ...)
  2016-03-31  8:24 ` [PATCHSET v3][RFC] Make background writeback not suck Dave Chinner
@ 2016-03-31 22:09 ` Holger Hoffstätte
  2016-04-01  1:01     ` Dave Chinner
  10 siblings, 1 reply; 34+ messages in thread
From: Holger Hoffstätte @ 2016-03-31 22:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel


Hi,

Jens mentioned on Twitter I should post my experience here as well,
so here we go.

I've backported this series (incl. updates) to stable-4.4.x - not too
difficult, minus the NVM part which I don't need anyway - and have been
running it for the past few days without any problem whatsoever, with
GREAT success.

My use case is primarily larger amounts of stuff (transcoded movies,
finished downloads, built Gentoo packages) that gets copied from tmpfs
to SSD (or disk) and every time that happens, the system noticeably
strangles readers (desktop, interactive shell). It does not really matter
how I tune writeback via the write_expire/dirty_bytes knobs or the
scheduler (and yes, I understand how they work); lowering the writeback
limits helped a bit but the system is still overwhelmed. Jacking up
deadline's writes_starved to unreasonable levels helps a bit, but in turn
makes all writes suffer. Anything else - even tried BFQ for a while,
which has its own unrelated problems - didn't really help either.

With this patchset the buffered writeback in these situations is much
improved, and copying several GBs at once to a SATA-3 SSD (or even an
external USB-2 disk with measly 40 MB/s) doodles along in the background
like it always should have, and desktop work is not noticeably affected.

I guess the effect will be even more noticeable on slower block devices
(laptops, old SSDs or disks).

So: +1 would apply again!

cheers
Holger

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-03-31 14:29   ` Jens Axboe
  2016-03-31 16:21     ` Jens Axboe
@ 2016-04-01  0:46     ` Dave Chinner
  2016-04-01  3:25       ` Jens Axboe
  1 sibling, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2016-04-01  0:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, linux-block

On Thu, Mar 31, 2016 at 08:29:35AM -0600, Jens Axboe wrote:
> On 03/31/2016 02:24 AM, Dave Chinner wrote:
> >On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
> >>Hi,
> >>
> >>This patchset isn't as much a final solution, as it's demonstration
> >>of what I believe is a huge issue. Since the dawn of time, our
> >>background buffered writeback has sucked. When we do background
> >>buffered writeback, it should have little impact on foreground
> >>activity. That's the definition of background activity... But for as
> >>long as I can remember, heavy buffered writers has not behaved like
> >>that. For instance, if I do something like this:
> >>
> >>$ dd if=/dev/zero of=foo bs=1M count=10k
> >>
> >>on my laptop, and then try and start chrome, it basically won't start
> >>before the buffered writeback is done. Or, for server oriented
> >>workloads, where installation of a big RPM (or similar) adversely
> >>impacts data base reads or sync writes. When that happens, I get people
> >>yelling at me.
> >>
> >>Last time I posted this, I used flash storage as the example. But
> >>this works equally well on rotating storage. Let's run a test case
> >>that writes a lot. This test writes 50 files, each 100M, on XFS on
> >>a regular hard drive. While this happens, we attempt to read
> >>another file with fio.
> >>
> >>Writers:
> >>
> >>$ time (./write-files ; sync)
> >>real	1m6.304s
> >>user	0m0.020s
> >>sys	0m12.210s
> >
> >Great. So a basic IO tests looks good - let's through something more
> >complex at it. Say, a benchmark I've been using for years to stress
> >the Io subsystem, the filesystem and memory reclaim all at the same
> >time: a concurent fsmark inode creation test.
> >(first google hit https://lkml.org/lkml/2013/9/10/46)
> 
> Is that how you are invoking it as well same arguments?

Yes. And the VM is exactly the same, too - 16p/16GB RAM. Cut down
version of the script I use:

#!/bin/bash

QUOTA=
MKFSOPTS=
NFILES=100000
DEV=/dev/vdc
LOGBSIZE=256k
FSMARK=/home/dave/src/fs_mark-3.3/fs_mark
MNT=/mnt/scratch

while [ $# -gt 0 ]; do
        case "$1" in
        -q)     QUOTA="uquota,gquota,pquota" ;;
        -N)     NFILES=$2 ; shift ;;
        -d)     DEV=$2 ; shift ;;
        -l)     LOGBSIZE=$2; shift ;;
        --)     shift ; break ;;
        esac
        shift
done
MKFSOPTS="$MKFSOPTS $*"

echo QUOTA=$QUOTA
echo MKFSOPTS=$MKFSOPTS
echo DEV=$DEV

sudo umount $MNT > /dev/null 2>&1
sudo mkfs.xfs -f $MKFSOPTS $DEV
sudo mount -o nobarrier,logbsize=$LOGBSIZE,$QUOTA $DEV $MNT
sudo chmod 777 $MNT
sudo sh -c "echo 1 > /proc/sys/fs/xfs/stats_clear"
time $FSMARK  -D  10000  -S0  -n  $NFILES  -s  0  -L  32 \
        -d  $MNT/0  -d  $MNT/1 \
        -d  $MNT/2  -d  $MNT/3 \
        -d  $MNT/4  -d  $MNT/5 \
        -d  $MNT/6  -d  $MNT/7 \
        -d  $MNT/8  -d  $MNT/9 \
        -d  $MNT/10  -d  $MNT/11 \
        -d  $MNT/12  -d  $MNT/13 \
        -d  $MNT/14  -d  $MNT/15 \
        | tee >(stats --trim-outliers | tail -1 1>&2)
sync
sudo umount /mnt/scratch
$

> >>The above was run without scsi-mq, and with using the deadline scheduler,
> >>results with CFQ are similary depressing for this test. So IO scheduling
> >>is in place for this test, it's not pure blk-mq without scheduling.
> >
> >virtio in guest, XFS direct IO -> no-op -> scsi in host.
> 
> That has write back caching enabled on the guest, correct?

No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
Sorry for not being clear about that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-03-31 16:21     ` Jens Axboe
@ 2016-04-01  0:56       ` Dave Chinner
  2016-04-01  3:29         ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2016-04-01  0:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, linux-block

On Thu, Mar 31, 2016 at 10:21:04AM -0600, Jens Axboe wrote:
> On 03/31/2016 08:29 AM, Jens Axboe wrote:
> >>What I see in these performance dips is the XFS transaction
> >>subsystem stalling *completely* - instead of running at a steady
> >>state of around 350,000 transactions/s, there are *zero*
> >>transactions running for periods of up to ten seconds.  This
> >>co-incides with the CPU usage falling to almost zero as well.
> >>AFAICT, the only thing that is running when the filesystem stalls
> >>like this is memory reclaim.
> >
> >I'll take a look at this, stalls should definitely not be occurring. How
> >much memory does the box have?
> 
> I can't seem to reproduce this at all. On an nvme device, I get a
> fairly steady 60K/sec file creation rate, and we're nowhere near
> being IO bound. So the throttling has no effect at all.

That's too slow to show the stalls - your likely concurrency bound
in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
agcount=32 so that every thread works in it's own AG.

> On a raid0 on 4 flash devices, I get something that looks more IO
> bound, for some reason. Still no impact of the throttling, however.
> But given that your setup is this:
> 
> 	virtio in guest, XFS direct IO -> no-op -> scsi in host.
> 
> we do potentially have two throttling points, which we don't want.
> Is both the guest and the host running the new code, or just the
> guest?

Just the guest. Host is running a 4.2.x kernel, IIRC.

> In any case, can I talk you into trying with two patches on top of
> the current code? It's the two newest patches here:
> 
> http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle
> 
> The first treats REQ_META|REQ_PRIO like they should be treated, like
> high priority IO. The second disables throttling for virtual
> devices, so we only throttle on the backend. The latter should
> probably be the other way around, but we need some way of conveying
> that information to the backend.

I'm not changing the host kernels - it's a production machine and so
it runs long uptime testing of stable kernels.  (e.g. catch slow
memory leaks, etc). So if you've disabled throttling in the guest, I
can't test the throttling changes.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-03-31 22:09 ` Holger Hoffstätte
@ 2016-04-01  1:01     ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-04-01  1:01 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-kernel, linux-fsdevel

On Thu, Mar 31, 2016 at 10:09:56PM +0000, Holger Hoffstätte wrote:
> 
> Hi,
> 
> Jens mentioned on Twitter I should post my experience here as well,
> so here we go.
> 
> I've backported this series (incl. updates) to stable-4.4.x - not too
> difficult, minus the NVM part which I don't need anyway - and have been
> running it for the past few days without any problem whatsoever, with
> GREAT success.
> 
> My use case is primarily larger amounts of stuff (transcoded movies,
> finished downloads, built Gentoo packages) that gets copied from tmpfs
> to SSD (or disk) and every time that happens, the system noticeably
> strangles readers (desktop, interactive shell). It does not really matter
> how I tune writeback via the write_expire/dirty_bytes knobs or the
> scheduler (and yes, I understand how they work); lowering the writeback
> limits helped a bit but the system is still overwhelmed. Jacking up
> deadline's writes_starved to unreasonable levels helps a bit, but in turn
> makes all writes suffer. Anything else - even tried BFQ for a while,
> which has its own unrelated problems - didn't really help either.

Can you go back to your original kernel, and lower nr_requests to 8?

Essentially all I see the block throttle doing is keeping the
request queue depth to somewhere between 8-12 requests, rather than
letting it blow out to near nr_requests (around 105-115), so it
would be interesting to note whether the block throttling has any
noticable difference in behaviour when compared to just having a
very shallow request queue....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
@ 2016-04-01  1:01     ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-04-01  1:01 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-kernel, linux-fsdevel

On Thu, Mar 31, 2016 at 10:09:56PM +0000, Holger Hoffst�tte wrote:
> 
> Hi,
> 
> Jens mentioned on Twitter I should post my experience here as well,
> so here we go.
> 
> I've backported this series (incl. updates) to stable-4.4.x - not too
> difficult, minus the NVM part which I don't need anyway - and have been
> running it for the past few days without any problem whatsoever, with
> GREAT success.
> 
> My use case is primarily larger amounts of stuff (transcoded movies,
> finished downloads, built Gentoo packages) that gets copied from tmpfs
> to SSD (or disk) and every time that happens, the system noticeably
> strangles readers (desktop, interactive shell). It does not really matter
> how I tune writeback via the write_expire/dirty_bytes knobs or the
> scheduler (and yes, I understand how they work); lowering the writeback
> limits helped a bit but the system is still overwhelmed. Jacking up
> deadline's writes_starved to unreasonable levels helps a bit, but in turn
> makes all writes suffer. Anything else - even tried BFQ for a while,
> which has its own unrelated problems - didn't really help either.

Can you go back to your original kernel, and lower nr_requests to 8?

Essentially all I see the block throttle doing is keeping the
request queue depth to somewhere between 8-12 requests, rather than
letting it blow out to near nr_requests (around 105-115), so it
would be interesting to note whether the block throttling has any
noticable difference in behaviour when compared to just having a
very shallow request queue....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-04-01  0:46     ` Dave Chinner
@ 2016-04-01  3:25       ` Jens Axboe
  2016-04-01  6:27         ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2016-04-01  3:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-block

On 03/31/2016 06:46 PM, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 08:29:35AM -0600, Jens Axboe wrote:
>> On 03/31/2016 02:24 AM, Dave Chinner wrote:
>>> On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> This patchset isn't as much a final solution, as it's demonstration
>>>> of what I believe is a huge issue. Since the dawn of time, our
>>>> background buffered writeback has sucked. When we do background
>>>> buffered writeback, it should have little impact on foreground
>>>> activity. That's the definition of background activity... But for as
>>>> long as I can remember, heavy buffered writers has not behaved like
>>>> that. For instance, if I do something like this:
>>>>
>>>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>>>
>>>> on my laptop, and then try and start chrome, it basically won't start
>>>> before the buffered writeback is done. Or, for server oriented
>>>> workloads, where installation of a big RPM (or similar) adversely
>>>> impacts data base reads or sync writes. When that happens, I get people
>>>> yelling at me.
>>>>
>>>> Last time I posted this, I used flash storage as the example. But
>>>> this works equally well on rotating storage. Let's run a test case
>>>> that writes a lot. This test writes 50 files, each 100M, on XFS on
>>>> a regular hard drive. While this happens, we attempt to read
>>>> another file with fio.
>>>>
>>>> Writers:
>>>>
>>>> $ time (./write-files ; sync)
>>>> real	1m6.304s
>>>> user	0m0.020s
>>>> sys	0m12.210s
>>>
>>> Great. So a basic IO tests looks good - let's through something more
>>> complex at it. Say, a benchmark I've been using for years to stress
>>> the Io subsystem, the filesystem and memory reclaim all at the same
>>> time: a concurent fsmark inode creation test.
>>> (first google hit https://lkml.org/lkml/2013/9/10/46)
>>
>> Is that how you are invoking it as well same arguments?
>
> Yes. And the VM is exactly the same, too - 16p/16GB RAM. Cut down
> version of the script I use:
>
> #!/bin/bash
>
> QUOTA=
> MKFSOPTS=
> NFILES=100000
> DEV=/dev/vdc
> LOGBSIZE=256k
> FSMARK=/home/dave/src/fs_mark-3.3/fs_mark
> MNT=/mnt/scratch
>
> while [ $# -gt 0 ]; do
>          case "$1" in
>          -q)     QUOTA="uquota,gquota,pquota" ;;
>          -N)     NFILES=$2 ; shift ;;
>          -d)     DEV=$2 ; shift ;;
>          -l)     LOGBSIZE=$2; shift ;;
>          --)     shift ; break ;;
>          esac
>          shift
> done
> MKFSOPTS="$MKFSOPTS $*"
>
> echo QUOTA=$QUOTA
> echo MKFSOPTS=$MKFSOPTS
> echo DEV=$DEV
>
> sudo umount $MNT > /dev/null 2>&1
> sudo mkfs.xfs -f $MKFSOPTS $DEV
> sudo mount -o nobarrier,logbsize=$LOGBSIZE,$QUOTA $DEV $MNT
> sudo chmod 777 $MNT
> sudo sh -c "echo 1 > /proc/sys/fs/xfs/stats_clear"
> time $FSMARK  -D  10000  -S0  -n  $NFILES  -s  0  -L  32 \
>          -d  $MNT/0  -d  $MNT/1 \
>          -d  $MNT/2  -d  $MNT/3 \
>          -d  $MNT/4  -d  $MNT/5 \
>          -d  $MNT/6  -d  $MNT/7 \
>          -d  $MNT/8  -d  $MNT/9 \
>          -d  $MNT/10  -d  $MNT/11 \
>          -d  $MNT/12  -d  $MNT/13 \
>          -d  $MNT/14  -d  $MNT/15 \
>          | tee >(stats --trim-outliers | tail -1 1>&2)
> sync
> sudo umount /mnt/scratch

Perfect, thanks!

>>>> The above was run without scsi-mq, and with using the deadline scheduler,
>>>> results with CFQ are similary depressing for this test. So IO scheduling
>>>> is in place for this test, it's not pure blk-mq without scheduling.
>>>
>>> virtio in guest, XFS direct IO -> no-op -> scsi in host.
>>
>> That has write back caching enabled on the guest, correct?
>
> No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
> Sorry for not being clear about that.

That's fine, it's one less worry if that's not the case. So if you cat 
the 'write_cache' file in the virtioblk sysfs block queue/ directory, it 
says 'write through'? Just want to confirm that we got that propagated 
correctly.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-04-01  0:56       ` Dave Chinner
@ 2016-04-01  3:29         ` Jens Axboe
  2016-04-01  3:33           ` Jens Axboe
                             ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Jens Axboe @ 2016-04-01  3:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-block

On 03/31/2016 06:56 PM, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 10:21:04AM -0600, Jens Axboe wrote:
>> On 03/31/2016 08:29 AM, Jens Axboe wrote:
>>>> What I see in these performance dips is the XFS transaction
>>>> subsystem stalling *completely* - instead of running at a steady
>>>> state of around 350,000 transactions/s, there are *zero*
>>>> transactions running for periods of up to ten seconds.  This
>>>> co-incides with the CPU usage falling to almost zero as well.
>>>> AFAICT, the only thing that is running when the filesystem stalls
>>>> like this is memory reclaim.
>>>
>>> I'll take a look at this, stalls should definitely not be occurring. How
>>> much memory does the box have?
>>
>> I can't seem to reproduce this at all. On an nvme device, I get a
>> fairly steady 60K/sec file creation rate, and we're nowhere near
>> being IO bound. So the throttling has no effect at all.
>
> That's too slow to show the stalls - your likely concurrency bound
> in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
> agcount=32 so that every thread works in it's own AG.

That's the key, with that I get 300-400K ops/sec instead. I'll run some 
testing with this tomorrow and see what I can find, it did one full run 
now and I didn't see any issues, but I need to run it at various 
settings and see if I can find the issue.

>> On a raid0 on 4 flash devices, I get something that looks more IO
>> bound, for some reason. Still no impact of the throttling, however.
>> But given that your setup is this:
>>
>> 	virtio in guest, XFS direct IO -> no-op -> scsi in host.
>>
>> we do potentially have two throttling points, which we don't want.
>> Is both the guest and the host running the new code, or just the
>> guest?
>
> Just the guest. Host is running a 4.2.x kernel, IIRC.

OK

>> In any case, can I talk you into trying with two patches on top of
>> the current code? It's the two newest patches here:
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_log_-3Fh-3Dwb-2Dbuf-2Dthrottle&d=CwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=cK1a7KivzZRh1fKQMjSm2A&m=68CEi93IKLje5aOoxk1y9HMe_HF9pAhzxJGTmTZ7_DY&s=NeYNPvJa3VdF_EEsL8VqAQzJ4UycbXZ5PzHihwZAc_A&e=
>>
>> The first treats REQ_META|REQ_PRIO like they should be treated, like
>> high priority IO. The second disables throttling for virtual
>> devices, so we only throttle on the backend. The latter should
>> probably be the other way around, but we need some way of conveying
>> that information to the backend.
>
> I'm not changing the host kernels - it's a production machine and so
> it runs long uptime testing of stable kernels.  (e.g. catch slow
> memory leaks, etc). So if you've disabled throttling in the guest, I
> can't test the throttling changes.

Right, that'd definitely hide the problem for you. I'll see if I can get 
it in a reproducible state and take it from there.

On your host, you said it's SCSI backed, but what does the device look like?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-04-01  3:29         ` Jens Axboe
@ 2016-04-01  3:33           ` Jens Axboe
  2016-04-01  3:39           ` Jens Axboe
  2016-04-01  5:04           ` Dave Chinner
  2 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-04-01  3:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-block

On 03/31/2016 09:29 PM, Jens Axboe wrote:
>> I'm not changing the host kernels - it's a production machine and so
>> it runs long uptime testing of stable kernels.  (e.g. catch slow
>> memory leaks, etc). So if you've disabled throttling in the guest, I
>> can't test the throttling changes.
>
> Right, that'd definitely hide the problem for you. I'll see if I can get
> it in a reproducible state and take it from there.

Though on the guest, if you could try with just this one applied:

http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle&id=f21fb0e42c7347bd639a17341dcd3f72c1a30d29

I'd appreciate it. It won't disable the throttling in the guest, just 
treat META and PRIO a bit differently.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-04-01  3:29         ` Jens Axboe
  2016-04-01  3:33           ` Jens Axboe
@ 2016-04-01  3:39           ` Jens Axboe
  2016-04-01  6:16             ` Dave Chinner
  2016-04-01  5:04           ` Dave Chinner
  2 siblings, 1 reply; 34+ messages in thread
From: Jens Axboe @ 2016-04-01  3:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-block

On 03/31/2016 09:29 PM, Jens Axboe wrote:
>>> I can't seem to reproduce this at all. On an nvme device, I get a
>>> fairly steady 60K/sec file creation rate, and we're nowhere near
>>> being IO bound. So the throttling has no effect at all.
>>
>> That's too slow to show the stalls - your likely concurrency bound
>> in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
>> agcount=32 so that every thread works in it's own AG.
>
> That's the key, with that I get 300-400K ops/sec instead. I'll run some
> testing with this tomorrow and see what I can find, it did one full run
> now and I didn't see any issues, but I need to run it at various
> settings and see if I can find the issue.

No stalls seen, I get the same performance with it disabled and with it 
enabled, at both default settings, and lower ones (wb_percent=20). 
Looking at iostat, we don't drive a lot of depth, so it makes sense, 
even with the throttling we're doing essentially the same amount of IO.

What does 'nr_requests' say for your virtio_blk device? Looks like 
virtio_blk has a queue_depth setting, but it's not set by default, and 
then it uses the free entries in the ring. But I don't know what that is...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-04-01  3:29         ` Jens Axboe
  2016-04-01  3:33           ` Jens Axboe
  2016-04-01  3:39           ` Jens Axboe
@ 2016-04-01  5:04           ` Dave Chinner
  2 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-04-01  5:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, linux-block

On Thu, Mar 31, 2016 at 09:29:30PM -0600, Jens Axboe wrote:
> On 03/31/2016 06:56 PM, Dave Chinner wrote:
> >I'm not changing the host kernels - it's a production machine and so
> >it runs long uptime testing of stable kernels.  (e.g. catch slow
> >memory leaks, etc). So if you've disabled throttling in the guest, I
> >can't test the throttling changes.
> 
> Right, that'd definitely hide the problem for you. I'll see if I can
> get it in a reproducible state and take it from there.
> 
> On your host, you said it's SCSI backed, but what does the device look like?

HW RAID 0 w/ 1GB FBWC (dell h710, IIRC) of 2x200GB SATA SSDs
(actually 256GB, but 25% of each is left as spare, unused space).
Sustains about 35,000 random 4k write IOPS, up to 70k read IOPS.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-04-01  3:39           ` Jens Axboe
@ 2016-04-01  6:16             ` Dave Chinner
  2016-04-01 14:33               ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2016-04-01  6:16 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, linux-block

On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote:
> On 03/31/2016 09:29 PM, Jens Axboe wrote:
> >>>I can't seem to reproduce this at all. On an nvme device, I get a
> >>>fairly steady 60K/sec file creation rate, and we're nowhere near
> >>>being IO bound. So the throttling has no effect at all.
> >>
> >>That's too slow to show the stalls - your likely concurrency bound
> >>in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
> >>agcount=32 so that every thread works in it's own AG.
> >
> >That's the key, with that I get 300-400K ops/sec instead. I'll run some
> >testing with this tomorrow and see what I can find, it did one full run
> >now and I didn't see any issues, but I need to run it at various
> >settings and see if I can find the issue.
> 
> No stalls seen, I get the same performance with it disabled and with
> it enabled, at both default settings, and lower ones
> (wb_percent=20). Looking at iostat, we don't drive a lot of depth,
> so it makes sense, even with the throttling we're doing essentially
> the same amount of IO.

Try appending numa=fake=4 to your guest's kernel command line.

(that's what I'm using)

> 
> What does 'nr_requests' say for your virtio_blk device? Looks like
> virtio_blk has a queue_depth setting, but it's not set by default,
> and then it uses the free entries in the ring. But I don't know what
> that is...

$ cat /sys/block/vdc/queue/nr_requests 
128
$

Without the block throttling, guest IO (measured within the guest)
looks like this over a fair proportion of the test (5s sample time)

# iostat -d -x -m 5 /dev/vdc

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00 20443.00    6.20  436.60     0.05   269.89  1248.48    73.83  146.11  486.58  141.27   1.64  72.40
vdc               0.00 11567.60   19.20  161.40     0.05   146.08  1657.12   119.17  704.57  707.25  704.25   5.34  96.48
vdc               0.00 12723.20    3.20  437.40     0.05   193.65   900.38    29.46   57.12    1.75   57.52   0.78  34.56
vdc               0.00  1739.80   22.40  426.80     0.05   123.62   563.86    23.44   62.51   79.89   61.59   1.01  45.28
vdc               0.00 12553.80    0.00  521.20     0.00   210.86   828.54    34.38   65.96    0.00   65.96   0.97  50.80
vdc               0.00 12523.60   25.60  529.60     0.10   201.94   745.29    52.24   77.73    0.41   81.47   1.14  63.20
vdc               0.00  5419.80   22.40  502.60     0.05   158.34   617.90    24.42   63.81   30.96   65.27   1.31  68.80
vdc               0.00 12059.00    0.00  439.60     0.00   174.85   814.59    30.91   70.27    0.00   70.27   0.72  31.76
vdc               0.00  7578.00   25.60  397.00     0.10   139.18   675.00    15.72   37.26   61.19   35.72   0.73  30.72
vdc               0.00  9156.00    0.00  537.40     0.00   173.57   661.45    17.08   29.62    0.00   29.62   0.53  28.72
vdc               0.00  5274.80   22.40  377.60     0.05   136.42   698.77    26.17   68.33  186.96   61.30   1.53  61.36
vdc               0.00  9407.00    3.20  541.00     0.05   174.28   656.05    36.10   66.33    3.00   66.71   0.87  47.60
vdc               0.00  8687.20   22.40  410.40     0.05   150.98   714.70    39.91   92.21   93.82   92.12   1.39  60.32
vdc               0.00  8872.80    0.00  422.60     0.00   139.28   674.96    25.01   33.03    0.00   33.03   0.91  38.40
vdc               0.00  1081.60   22.40  241.00     0.05    68.88   535.97    10.78   82.89  137.86   77.79   2.25  59.20
vdc               0.00  9826.80    0.00  445.00     0.00   167.42   770.49    45.16  101.49    0.00  101.49   1.80  79.92
vdc               0.00  7394.00   22.40  447.60     0.05   157.34   685.83    18.06   38.42   77.64   36.46   1.46  68.48
vdc               0.00  9984.80    3.20  252.00     0.05   108.46   870.82    85.68  293.73   16.75  297.24   3.00  76.64
vdc               0.00     0.00   22.40  454.20     0.05   117.67   505.86     8.11   39.51   35.71   39.70   1.17  55.76
vdc               0.00 10273.20    0.00  418.80     0.00   156.76   766.57    90.52  179.40    0.00  179.40   1.85  77.52
vdc               0.00  5650.00   22.40  185.00     0.05    84.12   831.20   103.90  575.15   60.82  637.42   4.21  87.36
vdc               0.00  7193.00    0.00  308.80     0.00   120.71   800.56    63.77  194.35    0.00  194.35   2.24  69.12
vdc               0.00  4460.80    9.80  211.00     0.03    69.52   645.07    72.35  154.81  269.39  149.49   4.42  97.60
vdc               0.00   683.00   14.00  374.60     0.05    99.13   522.69    25.38  167.61  603.14  151.33   1.45  56.24
vdc               0.00  7140.20    1.80  275.20     0.03   104.53   773.06    85.25  202.67   32.44  203.79   2.80  77.68
vdc               0.00  6916.00    0.00  164.00     0.00    82.59  1031.33   126.20  813.60    0.00  813.60   6.10 100.00
vdc               0.00  2255.60   22.40  359.00     0.05   107.41   577.06    42.97  170.03   92.79  174.85   2.17  82.64
vdc               0.00  7580.40    3.20  370.40     0.05   128.32   703.70    60.19  134.11   15.00  135.14   1.64  61.36
vdc               0.00  6438.40   18.80  159.20     0.04    78.04   898.36   126.80  706.27  639.15  714.19   5.62 100.00
vdc               0.00  5420.00    3.60  315.40     0.01   108.87   699.07    20.80   78.54  580.00   72.81   1.03  32.72
vdc               0.00  9444.00    2.60  242.40     0.00   118.72   992.38   126.21  488.66  146.15  492.33   4.08 100.00
vdc               0.00     0.00   19.80  434.60     0.05   110.14   496.65    12.74   57.56  313.78   45.89   1.10  49.84
vdc               0.00 14108.20    3.20  549.60     0.05   207.84   770.17    42.32   69.66   72.75   69.64   1.40  77.20
vdc               0.00  1306.40   35.20  268.20     0.08    78.74   532.08    30.84  114.22  175.07  106.24   2.02  61.20
vdc               0.00 14999.40    0.00  458.60     0.00   192.03   857.57    61.48  134.02    0.00  134.02   1.67  76.80
vdc               0.00     1.40   22.40  331.80     0.05    82.11   475.11     1.74    4.87   22.68    3.66   0.76  26.96
vdc               0.00 13971.80    0.00  670.20     0.00   248.26   758.63    34.45   51.37    0.00   51.37   1.04  69.52
vdc               0.00  7033.00   22.60  205.80     0.06    87.81   787.86    40.95  128.53  244.64  115.78   2.90  66.24
vdc               0.00  1282.00    3.20  456.00     0.05   123.21   549.74    14.56   46.99   21.00   47.17   1.42  65.20
vdc               0.00  9475.80   22.40  248.60     0.05   107.66   814.02   123.94  412.61  376.64  415.86   3.69 100.00
vdc               0.00  3603.60    0.00  418.80     0.00   133.32   651.94    71.28  210.08    0.00  210.08   1.77  74.00

You can see hat there are periods where it drives the request queue
depth to congestion, but most of the time the device is only 60-70%
utilised and the queue depths are only 30-40 deep. THere's quite a
lot of idle time in the request queue.

Note that there are a couple of points where merging stops
completely - that's when memory reclaim is directly flushing dirty
inodes because if all we have is cached inodes then we have toi
throttle memory allocation back to the rate we at which we can clean
dirty inodes.

Throughput does drop when this happens, but because the device has
idle overhead and spare request queue space, these less than optimal
IO dispatch spikes don't really affect throughput because the device
has the capacity available to soak them up without dropping
performance.


An equivalent trace from the middle of a run with block throttling
enabled:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdc               0.00  5143.40   22.40  188.00     0.05    81.04   789.38    19.17   89.09  237.86   71.37   4.75  99.92
vdc               0.00  7182.60    0.00  272.60     0.00   116.74   877.06    15.50   57.91    0.00   57.91   3.66  99.84
vdc               0.00  3732.80   11.60  102.20     0.01    53.17   957.05    19.35  151.47  514.00  110.32   8.79 100.00
vdc               0.00     0.00   10.80 1007.20     0.04   104.33   209.98    10.22   12.49  457.85    7.71   0.92  93.44
vdc               0.00     0.00    0.40  822.80     0.01    47.58   118.40     9.39   11.21   10.00   11.21   1.22 100.24
vdc               0.00     0.00    1.00  227.00     0.02     3.55    32.00     0.22    1.54  224.00    0.56   0.48  11.04
vdc               0.00 11100.40    3.20  437.40     0.05   125.91   585.49     7.95   17.77   47.25   17.55   1.56  68.56
vdc               0.00 14134.20   22.40  453.20     0.05   191.26   823.85    15.99   32.91   73.07   30.92   2.09  99.28
vdc               0.00  6667.00    0.00  265.20     0.00   105.11   811.70    15.71   58.98    0.00   58.98   3.77 100.00
vdc               0.00  6243.40   22.40  259.00     0.05   101.23   737.11    17.53   62.97  115.21   58.45   3.55  99.92
vdc               0.00  5590.80    0.00  278.20     0.00   105.30   775.18    18.09   65.55    0.00   65.55   3.59 100.00
vdc               0.00     0.00   14.20  714.80     0.02    97.81   274.86    11.61   12.86  260.85    7.93   1.23  89.44
vdc               0.00     0.00    9.80 1555.00     0.05   126.19   165.22     5.41    4.91  267.02    3.26   0.53  82.96
vdc               0.00     0.00    3.00  816.80     0.05    22.32    55.89     6.07    7.39  256.00    6.48   1.05  85.84
vdc               0.00 11172.80    0.20  463.00     0.00   125.77   556.10     6.13   13.23  260.00   13.13   0.93  43.28
vdc               0.00  9563.00   22.40  324.60     0.05   119.66   706.55    15.50   38.45   10.39   40.39   2.88  99.84
vdc               0.00  5333.60    0.00  218.00     0.00    83.71   786.46    15.57   80.46    0.00   80.46   4.59 100.00
vdc               0.00  5128.00   24.80  216.60     0.06    85.31   724.28    19.18   79.84  193.13   66.87   4.12  99.52
vdc               0.00  2746.40    0.00  257.40     0.00    81.13   645.49    11.16   43.70    0.00   43.70   3.87  99.68
vdc               0.00     0.00    0.00  418.80     0.00   104.68   511.92     5.33   12.74    0.00   12.74   1.93  80.96
vdc               0.00  8102.00    0.20  291.60     0.00   108.79   763.59     3.09   10.60   20.00   10.59   0.87  25.44


The first thing to note is the device utilisation is almost always
above 80%, and often at 100%, meaning with throttling the device
always has IO in flight. It's got no real idle time to soak up peaks
of IO activity - throttling means the device is running at close to
100% utilisation all the time under worklaods like this, and there's
not elasticity in the pipeline to handle changes in IO dispatch
behaviour.

So when memory reclaim does direct inode writeback, we see merging
stop, but the request queue is not able to soak up all the IO being
dispatched, even though there is very little read IO demand. hence
changes in the dispatch patterns that would drive deeper queues and
maintain performance will now get throttled, resulting in things
like memory reclaim backing up a lot and everything on the machine
suffering.

I'll try the "don't throttle REQ_META" patch, but this seems like a
fragile way to solve this problem - it shuts up the messenger, but
doesn't solve the problem for any other subsystem that might have a
similer issue. e.g. next we're going to have to make sure direct IO
(which is also REQ_WRITE dispatch) does not get throttled, and so
on....

It seems to me that the right thing to do here is add a separate
classification flag for IO that can be throttled. e.g. as
REQ_WRITEBACK and only background writeback work sets this flag.
That would ensure that when the IO is being dispatched from other
sources (e.g. fsync, sync_file_range(), direct IO, filesystem
metadata, etc) it is clear that it is not a target for throttling.
This would also allow us to easily switch off throttling if
writeback is occurring for memory reclaim reasons, and so on.
Throttling policy decisions belong above the block layer, even
though the throttle mechanism itself is in the block layer.

FWIW, this is analogous to REQ_READA, which tells the block layer
that a read is not important and can be discarded if there is too
much load. Policy is set at the layer that knows whether the IO can
be discarded safely, the mechanism is implemented at a lower layer
that knows about load, scheduling and other things the higher layers
know nothing about.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-04-01  3:25       ` Jens Axboe
@ 2016-04-01  6:27         ` Dave Chinner
  2016-04-01 14:34           ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2016-04-01  6:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, linux-block

On Thu, Mar 31, 2016 at 09:25:33PM -0600, Jens Axboe wrote:
> On 03/31/2016 06:46 PM, Dave Chinner wrote:
> >>>virtio in guest, XFS direct IO -> no-op -> scsi in host.
> >>
> >>That has write back caching enabled on the guest, correct?
> >
> >No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
> >Sorry for not being clear about that.
> 
> That's fine, it's one less worry if that's not the case. So if you
> cat the 'write_cache' file in the virtioblk sysfs block queue/
> directory, it says 'write through'? Just want to confirm that we got
> that propagated correctly.

No such file. But I did find:

$ cat /sys/block/vdc/cache_type 
write back

Which is what I'd expect it to safe given the man page description
of cache=none:

	Note that this is considered a writeback mode and the guest
	OS must handle the disk write cache correctly in order to
	avoid data corruption on host crashes.

To make it say "write through" I need to use cache=directsync, but
I have no need for such integrity guarantees on a volatile test
device...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-04-01  6:16             ` Dave Chinner
@ 2016-04-01 14:33               ` Jens Axboe
  0 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-04-01 14:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-block

On 04/01/2016 12:16 AM, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote:
>> On 03/31/2016 09:29 PM, Jens Axboe wrote:
>>>>> I can't seem to reproduce this at all. On an nvme device, I get a
>>>>> fairly steady 60K/sec file creation rate, and we're nowhere near
>>>>> being IO bound. So the throttling has no effect at all.
>>>>
>>>> That's too slow to show the stalls - your likely concurrency bound
>>>> in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
>>>> agcount=32 so that every thread works in it's own AG.
>>>
>>> That's the key, with that I get 300-400K ops/sec instead. I'll run some
>>> testing with this tomorrow and see what I can find, it did one full run
>>> now and I didn't see any issues, but I need to run it at various
>>> settings and see if I can find the issue.
>>
>> No stalls seen, I get the same performance with it disabled and with
>> it enabled, at both default settings, and lower ones
>> (wb_percent=20). Looking at iostat, we don't drive a lot of depth,
>> so it makes sense, even with the throttling we're doing essentially
>> the same amount of IO.
>
> Try appending numa=fake=4 to your guest's kernel command line.
>
> (that's what I'm using)

Sure, I can give that a go.

>> What does 'nr_requests' say for your virtio_blk device? Looks like
>> virtio_blk has a queue_depth setting, but it's not set by default,
>> and then it uses the free entries in the ring. But I don't know what
>> that is...
>
> $ cat /sys/block/vdc/queue/nr_requests
> 128

OK, so that would put you in the 16/32/64 category for idle/normal/high 
priority writeback. Which fits with the iostat below, which is in the 
~16 range.

So the META thing should help, it'll bump it up a bit. But we're also 
seeing smaller requests, and I think that could be because after we do 
throttle, we could potentially have a merge candidate. The code doesn't 
check post-sleeping, it'll allow any merges before though. Though that 
part is a little harder to read from the iostat numbers, but there does 
seem to be a correlation between your higher depths and bigger request 
sizes.

> I'll try the "don't throttle REQ_META" patch, but this seems like a
> fragile way to solve this problem - it shuts up the messenger, but
> doesn't solve the problem for any other subsystem that might have a
> similer issue. e.g. next we're going to have to make sure direct IO
> (which is also REQ_WRITE dispatch) does not get throttled, and so
> on....

I don't think there's anything wrong with the REQ_META patch. Sure, we 
could have better classifications (like discussed below), but that's 
mainly tweaking. As long as we get the same answers, it's fine. There's 
no throttling of O_DIRECT writes in the current code, it specifically 
doesn't include those. It's only for the unbounded writes, which 
writeback tends to be.

> It seems to me that the right thing to do here is add a separate
> classification flag for IO that can be throttled. e.g. as
> REQ_WRITEBACK and only background writeback work sets this flag.
> That would ensure that when the IO is being dispatched from other
> sources (e.g. fsync, sync_file_range(), direct IO, filesystem
> metadata, etc) it is clear that it is not a target for throttling.
> This would also allow us to easily switch off throttling if
> writeback is occurring for memory reclaim reasons, and so on.
> Throttling policy decisions belong above the block layer, even
> though the throttle mechanism itself is in the block layer.

We're already doing all of that, it's just doesn't include a specific 
REQ_WRITEBACK flag. And yeah, that would clean up the checking for 
request type, but functionally it should be the same as it is now. It'll 
be a bit more robust and easier to read if we just have a REQ_WRITEBACK, 
right now it's WRITE_SYNC vs WRITE for important vs not-important, with 
a check for write vs O_DIRECT write as well.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-04-01  6:27         ` Dave Chinner
@ 2016-04-01 14:34           ` Jens Axboe
  0 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-04-01 14:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-block

On 04/01/2016 12:27 AM, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 09:25:33PM -0600, Jens Axboe wrote:
>> On 03/31/2016 06:46 PM, Dave Chinner wrote:
>>>>> virtio in guest, XFS direct IO -> no-op -> scsi in host.
>>>>
>>>> That has write back caching enabled on the guest, correct?
>>>
>>> No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
>>> Sorry for not being clear about that.
>>
>> That's fine, it's one less worry if that's not the case. So if you
>> cat the 'write_cache' file in the virtioblk sysfs block queue/
>> directory, it says 'write through'? Just want to confirm that we got
>> that propagated correctly.
>
> No such file. But I did find:
>
> $ cat /sys/block/vdc/cache_type
> write back
>
> Which is what I'd expect it to safe given the man page description
> of cache=none:
>
> 	Note that this is considered a writeback mode and the guest
> 	OS must handle the disk write cache correctly in order to
> 	avoid data corruption on host crashes.
>
> To make it say "write through" I need to use cache=directsync, but
> I have no need for such integrity guarantees on a volatile test
> device...

I wasn't as concerned about the integrity side, more if it's flagged as 
write back then we induce further throttling. But I'll see if I can get 
your test case reproduced, then I don't see why it can't get fixed. I'm 
off all of next week though, so probably won't be until the week after...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCHSET v3][RFC] Make background writeback not suck
  2016-04-01  1:01     ` Dave Chinner
  (?)
@ 2016-04-01 16:58     ` Holger Hoffstätte
  -1 siblings, 0 replies; 34+ messages in thread
From: Holger Hoffstätte @ 2016-04-01 16:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel

On 04/01/16 03:01, Dave Chinner wrote:
> Can you go back to your original kernel, and lower nr_requests to 8?

Sure, did that and as expected it didn't help much. Under prolonged stress
it was actually even a bit worse than writeback throttling. IMHO that's not
really surprising either, since small queues now punish everyone and
in interactive mode I really want to e.g. start loading hundreds of small
thumbnails at once, or du a directory.

Instead of randomized aka manual/interactive testing I created a simple
stress tester:

#!/bin/sh
while [[ true ]]
do
    cp bigfile bigfile.out
done

and running that in the background turns the system into a tar pit,
which is laughable when you consider that I have 24G and 8 cores.

With the writeback patchset and wb_percent=1 (yes, really!) it is almost
unnoticeable, but according to nmon still writes ~250-280 MB/s.
This is with deadline on ext4 on an older SATA-3 SSD that can still
do peak ~465 MB/s (with dd).

cheers,
Holger

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages()
  2016-03-30 15:07 ` [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
@ 2016-04-13 13:08   ` Jan Kara
  2016-04-13 14:20     ` Jens Axboe
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2016-04-13 13:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, linux-fsdevel, linux-block

On Wed 30-03-16 09:07:52, Jens Axboe wrote:
> Note in the bdi_writeback structure if a task is currently being
> limited in balance_dirty_pages(), waiting for writeback to
> proceed.
...
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 11ff8f758631..15e696bc5d14 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1746,7 +1746,9 @@ pause:
>  					  pause,
>  					  start_time);
>  		__set_current_state(TASK_KILLABLE);
> +		wb->dirty_sleeping = 1;
>  		io_schedule_timeout(pause);
> +		wb->dirty_sleeping = 0;

Huh, but wb->dirty_sleeping is shared by all the processes in the system.
So this is seriously racy, isn't it? You rather need a counter for this to
work.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages()
  2016-04-13 13:08   ` Jan Kara
@ 2016-04-13 14:20     ` Jens Axboe
  0 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2016-04-13 14:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-kernel, linux-fsdevel, linux-block

On 04/13/2016 07:08 AM, Jan Kara wrote:
> On Wed 30-03-16 09:07:52, Jens Axboe wrote:
>> Note in the bdi_writeback structure if a task is currently being
>> limited in balance_dirty_pages(), waiting for writeback to
>> proceed.
> ...
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 11ff8f758631..15e696bc5d14 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1746,7 +1746,9 @@ pause:
>>   					  pause,
>>   					  start_time);
>>   		__set_current_state(TASK_KILLABLE);
>> +		wb->dirty_sleeping = 1;
>>   		io_schedule_timeout(pause);
>> +		wb->dirty_sleeping = 0;
>
> Huh, but wb->dirty_sleeping is shared by all the processes in the system.
> So this is seriously racy, isn't it? You rather need a counter for this to
> work.

Sure, but it's not _that_ important. It's like wb->dirty_exceeded, we 
have an equally relaxed relationship.

I don't mind making it more solid, but I can't make it a counter without 
making it atomic. Which is why I left it as just a basic assignment. But 
I guess since we only fiddle with it when going to sleep, we can make it 
an atomic and not have to worry about the potential impact.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-04-13 14:20 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-30 15:07 [PATCHSET v3][RFC] Make background writeback not suck Jens Axboe
2016-03-30 15:07 ` [PATCH 1/9] writeback: propagate the various reasons for writeback Jens Axboe
2016-03-30 15:07 ` [PATCH 2/9] writeback: add wbc_to_write() Jens Axboe
2016-03-30 15:07 ` [PATCH 3/9] writeback: use WRITE_SYNC for reclaim or sync writeback Jens Axboe
2016-03-30 15:07 ` [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe
2016-04-13 13:08   ` Jan Kara
2016-04-13 14:20     ` Jens Axboe
2016-03-30 15:07 ` [PATCH 5/9] block: add ability to flag write back caching on a device Jens Axboe
2016-03-30 15:42   ` Christoph Hellwig
2016-03-30 15:46     ` Jens Axboe
2016-03-30 16:23       ` Jens Axboe
2016-03-30 17:29         ` Christoph Hellwig
2016-03-30 15:07 ` [PATCH 6/9] sd: inform block layer of write cache state Jens Axboe
2016-03-30 15:07 ` [PATCH 7/9] NVMe: " Jens Axboe
2016-03-30 15:07 ` [PATCH 8/9] block: add code to track actual device queue depth Jens Axboe
2016-03-30 15:07 ` [PATCH 9/9] writeback: throttle buffered writeback Jens Axboe
2016-03-31  8:24 ` [PATCHSET v3][RFC] Make background writeback not suck Dave Chinner
2016-03-31 14:29   ` Jens Axboe
2016-03-31 16:21     ` Jens Axboe
2016-04-01  0:56       ` Dave Chinner
2016-04-01  3:29         ` Jens Axboe
2016-04-01  3:33           ` Jens Axboe
2016-04-01  3:39           ` Jens Axboe
2016-04-01  6:16             ` Dave Chinner
2016-04-01 14:33               ` Jens Axboe
2016-04-01  5:04           ` Dave Chinner
2016-04-01  0:46     ` Dave Chinner
2016-04-01  3:25       ` Jens Axboe
2016-04-01  6:27         ` Dave Chinner
2016-04-01 14:34           ` Jens Axboe
2016-03-31 22:09 ` Holger Hoffstätte
2016-04-01  1:01   ` Dave Chinner
2016-04-01  1:01     ` Dave Chinner
2016-04-01 16:58     ` Holger Hoffstätte

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.