* (no subject) @ 2016-03-23 15:25 Jens Axboe 2016-03-23 15:25 ` [PATCH 1/8] writeback: propagate the various reasons for writeback Jens Axboe ` (8 more replies) 0 siblings, 9 replies; 19+ messages in thread From: Jens Axboe @ 2016-03-23 15:25 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block This patchset isn't as much a final solution, as it's demonstration of what I believe is a huge issue. Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers has not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts data base reads or sync writes. When that happens, I get people yelling at me. A quick demonstration - a fio job that reads a a file, while someone else issues the above 'dd'. Run on a flash device, using XFS. The vmstat output looks something like this: --io---- -system-- ------cpu----- bi bo in cs us sy id wa st 156 4648 58 151 0 1 98 1 0 0 0 64 83 0 0 100 0 0 0 32 76 119 0 0 100 0 0 26616 0 7574 13907 7 0 91 2 0 41992 0 10811 21395 0 2 95 3 0 46040 0 11836 23395 0 3 94 3 0 19376 1310736 5894 10080 0 4 93 3 0 116 1974296 1858 455 0 4 93 3 0 124 2020372 1964 545 0 4 92 4 0 112 1678356 1955 620 0 3 93 3 0 8560 405508 3759 4756 0 1 96 3 0 42496 0 10798 21566 0 0 97 3 0 42476 0 10788 21524 0 0 97 3 0 The read starts out fine, but goes to shit when we start background flushing. The reader experiences latency spikes in the seconds range. On flash. With this set of patches applies, the situation looks like this instead: --io---- -system-- ------cpu----- bi bo in cs us sy id wa st 33544 0 8650 17204 0 1 97 2 0 42488 0 10856 21756 0 0 97 3 0 42032 0 10719 21384 0 0 97 3 0 42544 12 10838 21631 0 0 97 3 0 42620 0 10982 21727 0 3 95 3 0 46392 0 11923 23597 0 3 94 3 0 36268 512000 9907 20044 0 3 91 5 0 31572 696324 8840 18248 0 1 91 7 0 30748 626692 8617 17636 0 2 91 6 0 31016 618504 8679 17736 0 3 91 6 0 30612 648196 8625 17624 0 3 91 6 0 30992 650296 8738 17859 0 3 91 6 0 30680 604075 8614 17605 0 3 92 6 0 30592 595040 8572 17564 0 2 92 6 0 31836 539656 8819 17962 0 2 92 5 0 and the reader never sees latency spikes above a few miliseconds. The above was the why. The how is basically throttling background writeback. We still want to issue big writes from the vm side of things, so we get nice and big extents on the file system end. But we don't need to flood the device with THOUSANDS of requests for background writeback. For most devices, we don't need a whole lot to get decent throughput. This adds some simple blk-wb code that keeps limits how much buffered writeback we keep in flight on the device end. The default is pretty low. If we end up switching to WB_SYNC_ALL, we up the limits. If the dirtying task ends up being throttled in balance_dirty_pages(), we up the limit. If we need to reclaim memory, we up the limit. The cases that need to clean memory at or near device speeds, they get to do that. We still don't need thousands of requests to accomplish that. And for the cases where we don't need to be near device limits, we can clean at a more reasonable pace. Currently there are two tunables associated with this, see the last patch for descriptions of those. I welcome testing. The end goal here would be having much of this auto-tuned, so that we don't lose substantial bandwidth for background writes, while still maintaining decent non-wb performance and latencies. The patchset should be fully stable, I have not observed problems. It passes full xfstest runs, and a variety of benchmarks as well. It should work equally well on blk-mq/scsi-mq, and "classic" setups. You can also find this in a branch in the block git repo: git://git.kernel.dk/linux-block.git wb-buf-throttle-v2 Patches are against current Linus' git, 4.5.0+. Changes since v1 - Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change - wb_start_writeback() fills in background/reclaim/sync info in the writeback work, based on writeback reason. - Use WRITE_SYNC for reclaim/sync IO - Split balance_dirty_pages() sleep change into separate patch - Drop get_request() u64 flag change, set the bit on the request directly after-the-fact. - Fix wrong sysfs return value - Various small cleanups block/Makefile | 2 block/blk-core.c | 15 ++ block/blk-mq.c | 32 +++++ block/blk-settings.c | 11 + block/blk-sysfs.c | 123 +++++++++++++++++++++ block/blk-wb.c | 219 +++++++++++++++++++++++++++++++++++++++ block/blk-wb.h | 27 ++++ drivers/nvme/host/core.c | 1 drivers/scsi/sd.c | 5 fs/block_dev.c | 2 fs/buffer.c | 2 fs/f2fs/data.c | 2 fs/f2fs/node.c | 2 fs/fs-writeback.c | 17 +++ fs/gfs2/meta_io.c | 3 fs/mpage.c | 9 - fs/xfs/xfs_aops.c | 2 include/linux/backing-dev-defs.h | 2 include/linux/blk_types.h | 2 include/linux/blkdev.h | 7 + include/linux/writeback.h | 8 + mm/page-writeback.c | 2 22 files changed, 479 insertions(+), 16 deletions(-) -- Jens Axboe ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/8] writeback: propagate the various reasons for writeback 2016-03-23 15:25 Jens Axboe @ 2016-03-23 15:25 ` Jens Axboe 2016-03-23 15:25 ` [PATCH 2/8] writeback: add wbc_to_write() Jens Axboe ` (7 subsequent siblings) 8 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-03-23 15:25 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe Avoid losing context by propagating the various reason why we initiate writeback. Signed-off-by: Jens Axboe <axboe@fb.com> --- fs/fs-writeback.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 5c46ed9f3e14..387610cf4f7f 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -52,6 +52,7 @@ struct wb_writeback_work { unsigned int range_cyclic:1; unsigned int for_background:1; unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ + unsigned int for_reclaim:1; /* for mem reclaim */ unsigned int auto_free:1; /* free on completion */ enum wb_reason reason; /* why was writeback initiated? */ @@ -942,6 +943,21 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, work->reason = reason; work->auto_free = 1; + switch (reason) { + case WB_REASON_BACKGROUND: + case WB_REASON_PERIODIC: + work->for_background = 1; + break; + case WB_REASON_TRY_TO_FREE_PAGES: + case WB_REASON_FREE_MORE_MEM: + work->for_reclaim = 1; + case WB_REASON_SYNC: + work->for_sync = 1; + break; + default: + break; + } + wb_queue_work(wb, work); } @@ -1443,6 +1459,7 @@ static long writeback_sb_inodes(struct super_block *sb, .for_kupdate = work->for_kupdate, .for_background = work->for_background, .for_sync = work->for_sync, + .for_reclaim = work->for_reclaim, .range_cyclic = work->range_cyclic, .range_start = 0, .range_end = LLONG_MAX, -- 2.4.1.168.g1ea28e1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/8] writeback: add wbc_to_write() 2016-03-23 15:25 Jens Axboe 2016-03-23 15:25 ` [PATCH 1/8] writeback: propagate the various reasons for writeback Jens Axboe @ 2016-03-23 15:25 ` Jens Axboe 2016-03-23 15:25 ` [PATCH 3/8] writeback: use WRITE_SYNC for reclaim or sync writeback Jens Axboe ` (6 subsequent siblings) 8 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-03-23 15:25 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe Add wbc_to_write(), which returns the write type to use, based on a struct writeback_control. No functional changes in this patch, but it prepares us for factoring other wbc fields for write type. Signed-off-by: Jens Axboe <axboe@fb.com> --- fs/block_dev.c | 2 +- fs/buffer.c | 2 +- fs/f2fs/data.c | 2 +- fs/f2fs/node.c | 2 +- fs/gfs2/meta_io.c | 3 +-- fs/mpage.c | 9 ++++----- fs/xfs/xfs_aops.c | 2 +- include/linux/writeback.h | 8 ++++++++ 8 files changed, 18 insertions(+), 12 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 3172c4e2f502..b11d4e08b9a7 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -432,7 +432,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector, struct page *page, struct writeback_control *wbc) { int result; - int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE; + int rw = wbc_to_write(wbc); const struct block_device_operations *ops = bdev->bd_disk->fops; if (!ops->rw_page || bdev_get_integrity(bdev)) diff --git a/fs/buffer.c b/fs/buffer.c index 33be29675358..28273caaf2b1 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1697,7 +1697,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page, struct buffer_head *bh, *head; unsigned int blocksize, bbits; int nr_underway = 0; - int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); + int write_op = wbc_to_write(wbc); head = create_page_buffers(page, inode, (1 << BH_Dirty)|(1 << BH_Uptodate)); diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index e5c762b37239..dca5d43c67a3 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -1143,7 +1143,7 @@ static int f2fs_write_data_page(struct page *page, struct f2fs_io_info fio = { .sbi = sbi, .type = DATA, - .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE, + .rw = wbc_to_write(wbc), .page = page, .encrypted_page = NULL, }; diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 118321bd1a7f..db9201f45bf1 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1397,7 +1397,7 @@ static int f2fs_write_node_page(struct page *page, struct f2fs_io_info fio = { .sbi = sbi, .type = NODE, - .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE, + .rw = wbc_to_write(wbc), .page = page, .encrypted_page = NULL, }; diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c index e137d96f1b17..ede87306caa5 100644 --- a/fs/gfs2/meta_io.c +++ b/fs/gfs2/meta_io.c @@ -37,8 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb { struct buffer_head *bh, *head; int nr_underway = 0; - int write_op = REQ_META | REQ_PRIO | - (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); + int write_op = REQ_META | REQ_PRIO | wbc_to_write(wbc); BUG_ON(!PageLocked(page)); BUG_ON(!page_has_buffers(page)); diff --git a/fs/mpage.c b/fs/mpage.c index 6bd9fd90964e..9986c752f7bb 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -486,7 +486,6 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, struct buffer_head map_bh; loff_t i_size = i_size_read(inode); int ret = 0; - int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); if (page_has_buffers(page)) { struct buffer_head *head = page_buffers(page); @@ -595,7 +594,7 @@ page_is_mapped: * This page will go to BIO. Do we need to send this BIO off first? */ if (bio && mpd->last_block_in_bio != blocks[0] - 1) - bio = mpage_bio_submit(wr, bio); + bio = mpage_bio_submit(wbc_to_write(wbc), bio); alloc_new: if (bio == NULL) { @@ -622,7 +621,7 @@ alloc_new: wbc_account_io(wbc, page, PAGE_SIZE); length = first_unmapped << blkbits; if (bio_add_page(bio, page, length, 0) < length) { - bio = mpage_bio_submit(wr, bio); + bio = mpage_bio_submit(wbc_to_write(wbc), bio); goto alloc_new; } @@ -632,7 +631,7 @@ alloc_new: set_page_writeback(page); unlock_page(page); if (boundary || (first_unmapped != blocks_per_page)) { - bio = mpage_bio_submit(wr, bio); + bio = mpage_bio_submit(wbc_to_write(wbc), bio); if (boundary_block) { write_boundary_block(boundary_bdev, boundary_block, 1 << blkbits); @@ -644,7 +643,7 @@ alloc_new: confused: if (bio) - bio = mpage_bio_submit(wr, bio); + bio = mpage_bio_submit(wbc_to_write(wbc), bio); if (mpd->use_writepage) { ret = mapping->a_ops->writepage(page, wbc); diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index d445a64b979e..239a612ea1d6 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -393,7 +393,7 @@ xfs_submit_ioend_bio( atomic_inc(&ioend->io_remaining); bio->bi_private = ioend; bio->bi_end_io = xfs_end_bio; - submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio); + submit_bio(wbc_to_write(wbc), bio); } STATIC struct bio * diff --git a/include/linux/writeback.h b/include/linux/writeback.h index d0b5ca5d4e08..719c255e105a 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -100,6 +100,14 @@ struct writeback_control { #endif }; +static inline int wbc_to_write(struct writeback_control *wbc) +{ + if (wbc->sync_mode == WB_SYNC_ALL) + return WRITE_SYNC; + + return WRITE; +} + /* * A wb_domain represents a domain that wb's (bdi_writeback's) belong to * and are measured against each other in. There always is one global -- 2.4.1.168.g1ea28e1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 3/8] writeback: use WRITE_SYNC for reclaim or sync writeback 2016-03-23 15:25 Jens Axboe 2016-03-23 15:25 ` [PATCH 1/8] writeback: propagate the various reasons for writeback Jens Axboe 2016-03-23 15:25 ` [PATCH 2/8] writeback: add wbc_to_write() Jens Axboe @ 2016-03-23 15:25 ` Jens Axboe 2016-03-23 15:25 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe ` (5 subsequent siblings) 8 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-03-23 15:25 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe If we're doing reclaim or sync IO, use WRITE_SYNC to inform the lower levels of the importance of this IO. Signed-off-by: Jens Axboe <axboe@fb.com> --- include/linux/writeback.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 719c255e105a..b2c75b8901da 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -102,7 +102,7 @@ struct writeback_control { static inline int wbc_to_write(struct writeback_control *wbc) { - if (wbc->sync_mode == WB_SYNC_ALL) + if (wbc->sync_mode == WB_SYNC_ALL || wbc->for_reclaim || wbc->for_sync) return WRITE_SYNC; return WRITE; -- 2.4.1.168.g1ea28e1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() 2016-03-23 15:25 Jens Axboe ` (2 preceding siblings ...) 2016-03-23 15:25 ` [PATCH 3/8] writeback: use WRITE_SYNC for reclaim or sync writeback Jens Axboe @ 2016-03-23 15:25 ` Jens Axboe 2016-03-23 15:25 ` [PATCH 5/8] block: add ability to flag write back caching on a device Jens Axboe ` (4 subsequent siblings) 8 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-03-23 15:25 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe Note in the bdi_writeback structure if a task is currently being limited in balance_dirty_pages(), waiting for writeback to proceed. Signed-off-by: Jens Axboe <axboe@fb.com> --- include/linux/backing-dev-defs.h | 2 ++ mm/page-writeback.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 1b4d69f68c33..f702309216b4 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -116,6 +116,8 @@ struct bdi_writeback { struct list_head work_list; struct delayed_work dwork; /* work item used for writeback */ + int dirty_sleeping; /* waiting on dirty limit exceeded */ + struct list_head bdi_node; /* anchored at bdi->wb_list */ #ifdef CONFIG_CGROUP_WRITEBACK diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 11ff8f758631..15e696bc5d14 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1746,7 +1746,9 @@ pause: pause, start_time); __set_current_state(TASK_KILLABLE); + wb->dirty_sleeping = 1; io_schedule_timeout(pause); + wb->dirty_sleeping = 0; current->dirty_paused_when = now + pause; current->nr_dirtied = 0; -- 2.4.1.168.g1ea28e1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 5/8] block: add ability to flag write back caching on a device 2016-03-23 15:25 Jens Axboe ` (3 preceding siblings ...) 2016-03-23 15:25 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe @ 2016-03-23 15:25 ` Jens Axboe 2016-03-23 15:25 ` [PATCH 6/8] sd: inform block layer of write cache state Jens Axboe ` (3 subsequent siblings) 8 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-03-23 15:25 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe Add an internal helper and flag for setting whether a queue has write back caching, or write through (or none). Add a sysfs file to show this as well, and make it changeable from user space. Signed-off-by: Jens Axboe <axboe@fb.com> --- block/blk-settings.c | 11 +++++++++++ block/blk-sysfs.c | 39 +++++++++++++++++++++++++++++++++++++++ include/linux/blkdev.h | 4 ++++ 3 files changed, 54 insertions(+) diff --git a/block/blk-settings.c b/block/blk-settings.c index c7bb666aafd1..4dbd511a9889 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -846,6 +846,17 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable) } EXPORT_SYMBOL_GPL(blk_queue_flush_queueable); +void blk_queue_write_cache(struct request_queue *q, bool enabled) +{ + spin_lock_irq(q->queue_lock); + if (enabled) + queue_flag_set(QUEUE_FLAG_WC, q); + else + queue_flag_clear(QUEUE_FLAG_WC, q); + spin_unlock_irq(q->queue_lock); +} +EXPORT_SYMBOL_GPL(blk_queue_write_cache); + static int __init blk_settings_init(void) { blk_max_low_pfn = max_low_pfn - 1; diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index dd93763057ce..954e510452d7 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -347,6 +347,38 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page, return ret; } +static ssize_t queue_wc_show(struct request_queue *q, char *page) +{ + if (test_bit(QUEUE_FLAG_WC, &q->queue_flags)) + return sprintf(page, "write back\n"); + + return sprintf(page, "write through\n"); +} + +static ssize_t queue_wc_store(struct request_queue *q, const char *page, + size_t count) +{ + int set = -1; + + if (!strncmp(page, "write back", 10)) + set = 1; + else if (!strncmp(page, "write through", 13) || + !strncmp(page, "none", 4)) + set = 0; + + if (set == -1) + return -EINVAL; + + spin_lock_irq(q->queue_lock); + if (set) + queue_flag_set(QUEUE_FLAG_WC, q); + else + queue_flag_clear(QUEUE_FLAG_WC, q); + spin_unlock_irq(q->queue_lock); + + return count; +} + static struct queue_sysfs_entry queue_requests_entry = { .attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR }, .show = queue_requests_show, @@ -478,6 +510,12 @@ static struct queue_sysfs_entry queue_poll_entry = { .store = queue_poll_store, }; +static struct queue_sysfs_entry queue_wc_entry = { + .attr = {.name = "write_cache", .mode = S_IRUGO | S_IWUSR }, + .show = queue_wc_show, + .store = queue_wc_store, +}; + static struct attribute *default_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, @@ -503,6 +541,7 @@ static struct attribute *default_attrs[] = { &queue_iostats_entry.attr, &queue_random_entry.attr, &queue_poll_entry.attr, + &queue_wc_entry.attr, NULL, }; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 7e5d7e018bea..76e875159e52 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -491,15 +491,18 @@ struct request_queue { #define QUEUE_FLAG_INIT_DONE 20 /* queue is initialized */ #define QUEUE_FLAG_NO_SG_MERGE 21 /* don't attempt to merge SG segments*/ #define QUEUE_FLAG_POLL 22 /* IO polling enabled if set */ +#define QUEUE_FLAG_WC 23 /* Write back caching */ #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_STACKABLE) | \ (1 << QUEUE_FLAG_SAME_COMP) | \ + (1 << QUEUE_FLAG_WC) | \ (1 << QUEUE_FLAG_ADD_RANDOM)) #define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_STACKABLE) | \ (1 << QUEUE_FLAG_SAME_COMP) | \ + (1 << QUEUE_FLAG_WC) | \ (1 << QUEUE_FLAG_POLL)) static inline void queue_lockdep_assert_held(struct request_queue *q) @@ -1009,6 +1012,7 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *); extern void blk_queue_rq_timeout(struct request_queue *, unsigned int); extern void blk_queue_flush(struct request_queue *q, unsigned int flush); extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable); +extern void blk_queue_write_cache(struct request_queue *q, bool enabled); extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev); extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *); -- 2.4.1.168.g1ea28e1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 6/8] sd: inform block layer of write cache state 2016-03-23 15:25 Jens Axboe ` (4 preceding siblings ...) 2016-03-23 15:25 ` [PATCH 5/8] block: add ability to flag write back caching on a device Jens Axboe @ 2016-03-23 15:25 ` Jens Axboe 2016-03-23 15:25 ` [PATCH 7/8] NVMe: " Jens Axboe ` (2 subsequent siblings) 8 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-03-23 15:25 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe Signed-off-by: Jens Axboe <axboe@fb.com> --- drivers/scsi/sd.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 5a5457ac9cdb..049f424fb4ad 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -192,6 +192,7 @@ cache_type_store(struct device *dev, struct device_attribute *attr, sdkp->WCE = wce; sdkp->RCD = rcd; sd_set_flush_flag(sdkp); + blk_queue_write_cache(sdp->request_queue, wce != 0); return count; } @@ -2571,7 +2572,7 @@ sd_read_cache_type(struct scsi_disk *sdkp, unsigned char *buffer) sdkp->DPOFUA ? "supports DPO and FUA" : "doesn't support DPO or FUA"); - return; + goto done; } bad_sense: @@ -2596,6 +2597,8 @@ defaults: } sdkp->RCD = 0; sdkp->DPOFUA = 0; +done: + blk_queue_write_cache(sdp->request_queue, sdkp->WCE != 0); } /* -- 2.4.1.168.g1ea28e1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 7/8] NVMe: inform block layer of write cache state 2016-03-23 15:25 Jens Axboe ` (5 preceding siblings ...) 2016-03-23 15:25 ` [PATCH 6/8] sd: inform block layer of write cache state Jens Axboe @ 2016-03-23 15:25 ` Jens Axboe 2016-03-23 15:25 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe 2016-03-23 15:39 ` [PATCHSET v2][RFC] Make background writeback not suck Jens Axboe 8 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-03-23 15:25 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe This isn't quite correct, since the VWC merely states if a potential write back cache is volatile or not. But for the purpose of write absortion, it's good enough. Signed-off-by: Jens Axboe <axboe@fb.com> --- drivers/nvme/host/core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 643f457131c2..05c8edfb7611 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -906,6 +906,7 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl, if (ctrl->vwc & NVME_CTRL_VWC_PRESENT) blk_queue_flush(q, REQ_FLUSH | REQ_FUA); blk_queue_virt_boundary(q, ctrl->page_size - 1); + blk_queue_write_cache(q, ctrl->vwc & NVME_CTRL_VWC_PRESENT); } /* -- 2.4.1.168.g1ea28e1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 8/8] writeback: throttle buffered writeback 2016-03-23 15:25 Jens Axboe ` (6 preceding siblings ...) 2016-03-23 15:25 ` [PATCH 7/8] NVMe: " Jens Axboe @ 2016-03-23 15:25 ` Jens Axboe 2016-03-23 15:39 ` [PATCHSET v2][RFC] Make background writeback not suck Jens Axboe 8 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-03-23 15:25 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe Test patch that throttles buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. Would likely need a dynamic adaption to the current device, this one has only been tested on NVMe. But it brings down background activity impact from 1-2s to tens of milliseconds instead. This is just a test patch, and as such, it registers a queue sysfs entry to both monitor the current state: $ cat /sys/block/nvme0n1/queue/wb_stats limit=4, batch=2, inflight=0, wait=0, timer=0 'limit' denotes how many requests we will allow inflight for buffered writeback, this settings can be tweaked through writing to the 'wb_depth' file. Writing '0' turns this off completely. 'inflight' shows how many requests are currently inflight for buffered writeback, 'wait' shows if anyone is currently waiting for access, and 'timer' shows if we have processes being deferred in write back cache timeout. Background buffered writeback will be throttled at depth 'wb_depth', and even lower (QD=1) if the device recently completed "competing" IO. If we are doing reclaim or otherwise sync buffered writeback, the limit is increased 4x to achieve full device bandwidth. Finally, if the device has write back caching, 'wb_cache_delay' delays by this amount of usecs when a write completes before allowing more. Signed-off-by: Jens Axboe <axboe@fb.com> --- block/Makefile | 2 +- block/blk-core.c | 15 ++++ block/blk-mq.c | 32 ++++++- block/blk-sysfs.c | 84 ++++++++++++++++++ block/blk-wb.c | 219 ++++++++++++++++++++++++++++++++++++++++++++++ block/blk-wb.h | 27 ++++++ include/linux/blk_types.h | 2 + include/linux/blkdev.h | 3 + 8 files changed, 381 insertions(+), 3 deletions(-) create mode 100644 block/blk-wb.c create mode 100644 block/blk-wb.h diff --git a/block/Makefile b/block/Makefile index 9eda2322b2d4..9df911a3b569 100644 --- a/block/Makefile +++ b/block/Makefile @@ -5,7 +5,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-flush.o blk-settings.o blk-ioc.o blk-map.o \ blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \ - blk-lib.o blk-mq.o blk-mq-tag.o \ + blk-lib.o blk-mq.o blk-mq-tag.o blk-wb.o \ blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \ genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ badblocks.o partitions/ diff --git a/block/blk-core.c b/block/blk-core.c index 827f8badd143..887a9e64c6ef 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -39,6 +39,7 @@ #include "blk.h" #include "blk-mq.h" +#include "blk-wb.h" EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap); @@ -848,6 +849,9 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, if (blk_init_rl(&q->root_rl, q, GFP_KERNEL)) goto fail; + if (blk_buffered_writeback_init(q)) + goto fail; + INIT_WORK(&q->timeout_work, blk_timeout_work); q->request_fn = rfn; q->prep_rq_fn = NULL; @@ -880,6 +884,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, fail: blk_free_flush_queue(q->fq); + blk_buffered_writeback_exit(q); return NULL; } EXPORT_SYMBOL(blk_init_allocated_queue); @@ -1485,6 +1490,8 @@ void __blk_put_request(struct request_queue *q, struct request *req) /* this is a bio leak */ WARN_ON(req->bio != NULL); + blk_buffered_writeback_done(q->rq_wb, req); + /* * Request may not have originated from ll_rw_blk. if not, * it didn't come out of our reserved rq pools @@ -1714,6 +1721,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT; struct request *req; unsigned int request_count = 0; + bool wb_acct; /* * low level driver can indicate that it wants pages above a @@ -1766,6 +1774,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) } get_rq: + wb_acct = blk_buffered_writeback_wait(q->rq_wb, bio, q->queue_lock); + /* * This sync check and mask will be re-done in init_request_from_bio(), * but we need to set it earlier to expose the sync flag to the @@ -1781,11 +1791,16 @@ get_rq: */ req = get_request(q, rw_flags, bio, GFP_NOIO); if (IS_ERR(req)) { + if (wb_acct) + __blk_buffered_writeback_done(q->rq_wb); bio->bi_error = PTR_ERR(req); bio_endio(bio); goto out_unlock; } + if (wb_acct) + req->cmd_flags |= REQ_BUF_INFLIGHT; + /* * After dropping the lock and possibly sleeping here, our request * may now be mergeable after it had proven unmergeable (above). diff --git a/block/blk-mq.c b/block/blk-mq.c index 050f7a13021b..55aace97fd35 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -29,6 +29,7 @@ #include "blk.h" #include "blk-mq.h" #include "blk-mq-tag.h" +#include "blk-wb.h" static DEFINE_MUTEX(all_q_mutex); static LIST_HEAD(all_q_list); @@ -274,6 +275,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, if (rq->cmd_flags & REQ_MQ_INFLIGHT) atomic_dec(&hctx->nr_active); + + blk_buffered_writeback_done(q->rq_wb, rq); + rq->cmd_flags = 0; clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags); @@ -1253,6 +1257,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) struct blk_plug *plug; struct request *same_queue_rq = NULL; blk_qc_t cookie; + bool wb_acct; blk_queue_bounce(q, &bio); @@ -1270,9 +1275,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) } else request_count = blk_plug_queued_count(q); + wb_acct = blk_buffered_writeback_wait(q->rq_wb, bio, NULL); + rq = blk_mq_map_request(q, bio, &data); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (wb_acct) + __blk_buffered_writeback_done(q->rq_wb); return BLK_QC_T_NONE; + } + + if (wb_acct) + rq->cmd_flags |= REQ_BUF_INFLIGHT; cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -1349,6 +1362,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) struct blk_map_ctx data; struct request *rq; blk_qc_t cookie; + bool wb_acct; blk_queue_bounce(q, &bio); @@ -1363,9 +1377,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) blk_attempt_plug_merge(q, bio, &request_count, NULL)) return BLK_QC_T_NONE; + wb_acct = blk_buffered_writeback_wait(q->rq_wb, bio, NULL); + rq = blk_mq_map_request(q, bio, &data); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (wb_acct) + __blk_buffered_writeback_done(q->rq_wb); return BLK_QC_T_NONE; + } + + if (wb_acct) + rq->cmd_flags |= REQ_BUF_INFLIGHT; cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -2018,6 +2040,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, /* mark the queue as mq asap */ q->mq_ops = set->ops; + if (blk_buffered_writeback_init(q)) + return ERR_PTR(-ENOMEM); + q->queue_ctx = alloc_percpu(struct blk_mq_ctx); if (!q->queue_ctx) return ERR_PTR(-ENOMEM); @@ -2084,6 +2109,7 @@ err_map: kfree(q->queue_hw_ctx); err_percpu: free_percpu(q->queue_ctx); + blk_buffered_writeback_exit(q); return ERR_PTR(-ENOMEM); } EXPORT_SYMBOL(blk_mq_init_allocated_queue); @@ -2096,6 +2122,8 @@ void blk_mq_free_queue(struct request_queue *q) list_del_init(&q->all_q_node); mutex_unlock(&all_q_mutex); + blk_buffered_writeback_exit(q); + blk_mq_del_queue_tag_set(q); blk_mq_exit_hw_queues(q, set, set->nr_hw_queues); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 954e510452d7..9ac9be23e700 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -13,6 +13,7 @@ #include "blk.h" #include "blk-mq.h" +#include "blk-wb.h" struct queue_sysfs_entry { struct attribute attr; @@ -347,6 +348,71 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page, return ret; } +static ssize_t queue_wb_stats_show(struct request_queue *q, char *page) +{ + struct rq_wb *wb = q->rq_wb; + + if (!q->rq_wb) + return -EINVAL; + + return sprintf(page, "limit=%d, batch=%d, inflight=%d, wait=%d, timer=%d\n", + wb->limit, wb->batch, atomic_read(&wb->inflight), + waitqueue_active(&wb->wait), timer_pending(&wb->timer)); +} + +static ssize_t queue_wb_depth_show(struct request_queue *q, char *page) +{ + if (!q->rq_wb) + return -EINVAL; + + return queue_var_show(q->rq_wb->limit, page); +} + +static ssize_t queue_wb_depth_store(struct request_queue *q, const char *page, + size_t count) +{ + unsigned long var; + ssize_t ret; + + if (!q->rq_wb) + return -EINVAL; + + ret = queue_var_store(&var, page, count); + if (ret < 0) + return ret; + if (var != (unsigned int) var) + return -EINVAL; + + blk_update_wb_limit(q->rq_wb, var); + return ret; +} + +static ssize_t queue_wb_cache_delay_show(struct request_queue *q, char *page) +{ + if (!q->rq_wb) + return -EINVAL; + + return queue_var_show(q->rq_wb->cache_delay_usecs, page); +} + +static ssize_t queue_wb_cache_delay_store(struct request_queue *q, + const char *page, size_t count) +{ + unsigned long var; + ssize_t ret; + + if (!q->rq_wb) + return -EINVAL; + + ret = queue_var_store(&var, page, count); + if (ret < 0) + return ret; + + q->rq_wb->cache_delay_usecs = var; + q->rq_wb->cache_delay = usecs_to_jiffies(var); + return ret; +} + static ssize_t queue_wc_show(struct request_queue *q, char *page) { if (test_bit(QUEUE_FLAG_WC, &q->queue_flags)) @@ -516,6 +582,21 @@ static struct queue_sysfs_entry queue_wc_entry = { .store = queue_wc_store, }; +static struct queue_sysfs_entry queue_wb_stats_entry = { + .attr = {.name = "wb_stats", .mode = S_IRUGO }, + .show = queue_wb_stats_show, +}; +static struct queue_sysfs_entry queue_wb_cache_delay_entry = { + .attr = {.name = "wb_cache_usecs", .mode = S_IRUGO | S_IWUSR }, + .show = queue_wb_cache_delay_show, + .store = queue_wb_cache_delay_store, +}; +static struct queue_sysfs_entry queue_wb_depth_entry = { + .attr = {.name = "wb_depth", .mode = S_IRUGO | S_IWUSR }, + .show = queue_wb_depth_show, + .store = queue_wb_depth_store, +}; + static struct attribute *default_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, @@ -542,6 +623,9 @@ static struct attribute *default_attrs[] = { &queue_random_entry.attr, &queue_poll_entry.attr, &queue_wc_entry.attr, + &queue_wb_stats_entry.attr, + &queue_wb_cache_delay_entry.attr, + &queue_wb_depth_entry.attr, NULL, }; diff --git a/block/blk-wb.c b/block/blk-wb.c new file mode 100644 index 000000000000..2aa3753a8e1e --- /dev/null +++ b/block/blk-wb.c @@ -0,0 +1,219 @@ +/* + * buffered writeback throttling + * + * Copyright (C) 2016 Jens Axboe + * + * Things that need changing: + * + * - Auto-detection of most of this, no tunables. Cache type we can get, + * and most other settings we can tweak/gather based on time. + * - Better solution for rwb->bdp_wait? + * - Higher depth for WB_SYNC_ALL? + * + */ +#include <linux/kernel.h> +#include <linux/bio.h> +#include <linux/blkdev.h> + +#include "blk.h" +#include "blk-wb.h" + +void __blk_buffered_writeback_done(struct rq_wb *rwb) +{ + int inflight; + + inflight = atomic_dec_return(&rwb->inflight); + if (inflight >= rwb->limit) + return; + + /* + * If the device does caching, we can still flood it with IO + * even at a low depth. If caching is on, delay a bit before + * submitting the next, if we're still purely background + * activity. + */ + if (test_bit(QUEUE_FLAG_WC, &rwb->q->queue_flags) && !*rwb->bdp_wait && + time_before(jiffies, rwb->last_comp + rwb->cache_delay)) { + if (!timer_pending(&rwb->timer)) + mod_timer(&rwb->timer, jiffies + rwb->cache_delay); + return; + } + + if (waitqueue_active(&rwb->wait)) { + int diff = rwb->limit - inflight; + + if (diff >= rwb->batch) + wake_up_nr(&rwb->wait, 1); + } +} + +/* + * Called on completion of a request. Note that it's also called when + * a request is merged, when the request gets freed. + */ +void blk_buffered_writeback_done(struct rq_wb *rwb, struct request *rq) +{ + if (!(rq->cmd_flags & REQ_BUF_INFLIGHT)) { + const unsigned long cur = jiffies; + + if (rwb->limit && cur != rwb->last_comp) + rwb->last_comp = cur; + } else + __blk_buffered_writeback_done(rwb); +} + +/* + * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded, + * false if 'v' + 1 would be bigger than 'below'. + */ +static bool atomic_inc_below(atomic_t *v, int below) +{ + int cur = atomic_read(v); + + for (;;) { + int old; + + if (cur >= below) + return false; + old = atomic_cmpxchg(v, cur, cur + 1); + if (old == cur) + break; + cur = old; + } + + return true; +} + +/* + * Block if we will exceed our limit, or if we are currently waiting for + * the timer to kick off queuing again. + */ +static void __blk_buffered_writeback_wait(struct rq_wb *rwb, unsigned int limit, + spinlock_t *lock) +{ + DEFINE_WAIT(wait); + + if (!timer_pending(&rwb->timer) && + atomic_inc_below(&rwb->inflight, limit)) + return; + + do { + prepare_to_wait_exclusive(&rwb->wait, &wait, + TASK_UNINTERRUPTIBLE); + + if (!timer_pending(&rwb->timer) && + atomic_inc_below(&rwb->inflight, limit)) + break; + + if (lock) + spin_unlock_irq(lock); + + io_schedule(); + + if (lock) + spin_lock_irq(lock); + } while (1); + + finish_wait(&rwb->wait, &wait); +} + +/* + * Returns true if the IO request should be accounted, false if not. + * May sleep, if we have exceeded the writeback limits. Caller can pass + * in an irq held spinlock, if it holds one when calling this function. + * If we do sleep, we'll release and re-grab it. + */ +bool blk_buffered_writeback_wait(struct rq_wb *rwb, struct bio *bio, + spinlock_t *lock) +{ + unsigned int limit; + + /* + * If disabled, or not a WRITE (or a discard), do nothing + */ + if (!rwb->limit || !(bio->bi_rw & REQ_WRITE) || + (bio->bi_rw & REQ_DISCARD)) + return false; + + /* + * Don't throttle WRITE_ODIRECT + */ + if ((bio->bi_rw & (REQ_SYNC | REQ_NOIDLE)) == REQ_SYNC) + return false; + + /* + * At this point we know it's a buffered write. If REQ_SYNC is + * set, then it's WB_SYNC_ALL writeback. Bump the limit 4x for + * those, since someone is (or will be) waiting on that. + */ + limit = rwb->limit; + if (bio->bi_rw & REQ_SYNC) + limit <<= 2; + else if (limit != 1) { + /* + * If less than 100ms since we completed unrelated IO, + * limit us to a depth of 1 for background writeback. + */ + if (time_before(jiffies, rwb->last_comp + HZ / 10)) + limit = 1; + else if (!*rwb->bdp_wait) + limit >>= 1; + } + + __blk_buffered_writeback_wait(rwb, limit, lock); + return true; +} + +void blk_update_wb_limit(struct rq_wb *rwb, unsigned int limit) +{ + rwb->limit = limit; + rwb->batch = rwb->limit / 2; + if (!rwb->batch && rwb->limit) + rwb->batch = 1; + else if (rwb->batch > 4) + rwb->batch = 4; + + wake_up_all(&rwb->wait); +} + +static void blk_buffered_writeback_timer(unsigned long data) +{ + struct rq_wb *rwb = (struct rq_wb *) data; + + if (waitqueue_active(&rwb->wait)) + wake_up_nr(&rwb->wait, 1); +} + +#define DEF_WB_LIMIT 4 +#define DEF_WB_CACHE_DELAY 10000 + +int blk_buffered_writeback_init(struct request_queue *q) +{ + struct rq_wb *rwb; + + rwb = kzalloc(sizeof(*rwb), GFP_KERNEL); + if (!rwb) + return -ENOMEM; + + atomic_set(&rwb->inflight, 0); + init_waitqueue_head(&rwb->wait); + rwb->last_comp = jiffies; + rwb->bdp_wait = &q->backing_dev_info.wb.dirty_sleeping; + setup_timer(&rwb->timer, blk_buffered_writeback_timer, + (unsigned long) rwb); + rwb->cache_delay_usecs = DEF_WB_CACHE_DELAY; + rwb->cache_delay = usecs_to_jiffies(rwb->cache_delay); + rwb->q = q; + blk_update_wb_limit(rwb, DEF_WB_LIMIT); + q->rq_wb = rwb; + return 0; +} + +void blk_buffered_writeback_exit(struct request_queue *q) +{ + if (q->rq_wb) + del_timer_sync(&q->rq_wb->timer); + + kfree(q->rq_wb); + q->rq_wb = NULL; +} diff --git a/block/blk-wb.h b/block/blk-wb.h new file mode 100644 index 000000000000..f3b4cd139815 --- /dev/null +++ b/block/blk-wb.h @@ -0,0 +1,27 @@ +#ifndef BLK_WB_H +#define BLK_WB_H + +#include <linux/atomic.h> +#include <linux/wait.h> + +struct rq_wb { + unsigned int limit; + unsigned int batch; + unsigned int cache_delay; + unsigned int cache_delay_usecs; + unsigned long last_comp; + unsigned int *bdp_wait; + struct request_queue *q; + atomic_t inflight; + wait_queue_head_t wait; + struct timer_list timer; +}; + +void __blk_buffered_writeback_done(struct rq_wb *); +void blk_buffered_writeback_done(struct rq_wb *, struct request *); +bool blk_buffered_writeback_wait(struct rq_wb *, struct bio *, spinlock_t *); +int blk_buffered_writeback_init(struct request_queue *); +void blk_buffered_writeback_exit(struct request_queue *); +void blk_update_wb_limit(struct rq_wb *, unsigned int); + +#endif diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 86a38ea1823f..6f2a174b771c 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -188,6 +188,7 @@ enum rq_flag_bits { __REQ_PM, /* runtime pm request */ __REQ_HASHED, /* on IO scheduler merge hash */ __REQ_MQ_INFLIGHT, /* track inflight for MQ */ + __REQ_BUF_INFLIGHT, /* track inflight for buffered */ __REQ_NR_BITS, /* stops here */ }; @@ -241,6 +242,7 @@ enum rq_flag_bits { #define REQ_PM (1ULL << __REQ_PM) #define REQ_HASHED (1ULL << __REQ_HASHED) #define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT) +#define REQ_BUF_INFLIGHT (1ULL << __REQ_BUF_INFLIGHT) typedef unsigned int blk_qc_t; #define BLK_QC_T_NONE -1U diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 76e875159e52..8586685bf7b2 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -37,6 +37,7 @@ struct bsg_job; struct blkcg_gq; struct blk_flush_queue; struct pr_ops; +struct rq_wb; #define BLKDEV_MIN_RQ 4 #define BLKDEV_MAX_RQ 128 /* Default maximum */ @@ -290,6 +291,8 @@ struct request_queue { int nr_rqs[2]; /* # allocated [a]sync rqs */ int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */ + struct rq_wb *rq_wb; + /* * If blkcg is not used, @q->root_rl serves all requests. If blkcg * is used, root blkg allocates from @q->root_rl and all other -- 2.4.1.168.g1ea28e1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCHSET v2][RFC] Make background writeback not suck 2016-03-23 15:25 Jens Axboe ` (7 preceding siblings ...) 2016-03-23 15:25 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe @ 2016-03-23 15:39 ` Jens Axboe 2016-03-24 17:42 ` Jens Axboe 8 siblings, 1 reply; 19+ messages in thread From: Jens Axboe @ 2016-03-23 15:39 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block Hi, Apparently I dropped the subject on this one, it's of course v2 of the writeback not sucking patchset... -- Jens Axboe ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCHSET v2][RFC] Make background writeback not suck 2016-03-23 15:39 ` [PATCHSET v2][RFC] Make background writeback not suck Jens Axboe @ 2016-03-24 17:42 ` Jens Axboe 0 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-03-24 17:42 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block On 03/23/2016 09:39 AM, Jens Axboe wrote: > Hi, > > Apparently I dropped the subject on this one, it's of course v2 of the > writeback not sucking patchset... Some test result. I've run a lot of them, on various types of storage, and performance seems good with the default settings. This case reads in a file and writes it to stdout. It targets a certain latency for the reads - by default it's 10ms. If a read isn't done my 10ms, it'll queue the next read. This avoids the coordinated omission problem, where one long latency is in fact many of them, you just don't knows since you don't issue more while one is stuck. The test case reads a compressed file, and writes it over a pipe to gzip to decompress it. The input file is around 9G, uncompresses to 20G. At the end of the run, latency results are shown. Every time the target latency is exceeded during the run, it's output. To keep the system busy, 75% (24G) of the memory is taking up by CPU hogs. This is intended to make the case worse for the throttled depth, as Dave pointed out. Out-of-the-box results: # time (./read-to-pipe-async -f randfile.gz | gzip -dc > outfile; sync) read latency=11790 usec read latency=82697 usec [...] Latency percentiles (usec) (READERS) 50.0000th: 4 75.0000th: 5 90.0000th: 6 95.0000th: 7 99.0000th: 54 99.5000th: 64 99.9000th: 334 99.9900th: 17952 99.9990th: 101504 99.9999th: 203520 Over=333, min=0, max=215367 Latency percentiles (usec) (WRITERS) 50.0000th: 3 75.0000th: 5 90.0000th: 454 95.0000th: 473 99.0000th: 615 99.5000th: 625 99.9000th: 815 99.9900th: 1142 99.9990th: 2244 99.9999th: 10032 Over=3, min=0, max=10811 Read rate (KB/sec) : 88988 Write rate (KB/sec): 60019 real 2m38.701s user 2m33.030s sys 1m31.540s 215ms worst case latency, 333 cases of being above the 10ms target. And with the patchset applied: # time (./read-to-pipe-async -f randfile.gz | gzip -dc > outfile; sync) write latency=15394 usec [...] Latency percentiles (usec) (READERS) 50.0000th: 4 75.0000th: 5 90.0000th: 6 95.0000th: 8 99.0000th: 55 99.5000th: 64 99.9000th: 338 99.9900th: 2652 99.9990th: 3964 99.9999th: 7464 Over=1, min=0, max=10221 Latency percentiles (usec) (WRITERS) 50.0000th: 4 75.0000th: 5 90.0000th: 450 95.0000th: 471 99.0000th: 611 99.5000th: 623 99.9000th: 703 99.9900th: 1106 99.9990th: 2010 99.9999th: 10448 Over=6, min=1, max=15394 Read rate (KB/sec) : 95506 Write rate (KB/sec): 59970 real 2m39.014s user 2m33.800s sys 1m35.210s I won't bore you with vmstat output, it's pretty messy for the default case. -- Jens Axboe ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCHSET v4 0/8] Make background writeback not suck @ 2016-04-18 4:24 Jens Axboe 2016-04-18 4:24 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe 0 siblings, 1 reply; 19+ messages in thread From: Jens Axboe @ 2016-04-18 4:24 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: jack, dchinner Hi, Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers have not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts database reads or sync writes. When that happens, I get people yelling at me. I have posted plenty of results previously, I'll keep it shorter this time. Here's a run on my laptop, using read-to-pipe-async for reading a 5g file, and rewriting it. 4.6-rc3: $ t/read-to-pipe-async -f ~/5g > 5g-new Latency percentiles (usec) (READERS) 50.0000th: 2 75.0000th: 3 90.0000th: 5 95.0000th: 7 99.0000th: 43 99.5000th: 77 99.9000th: 9008 99.9900th: 91008 99.9990th: 286208 99.9999th: 347648 Over=1251, min=0, max=358081 Latency percentiles (usec) (WRITERS) 50.0000th: 4 75.0000th: 8 90.0000th: 13 95.0000th: 15 99.0000th: 32 99.5000th: 43 99.9000th: 81 99.9900th: 2372 99.9990th: 104320 99.9999th: 349696 Over=63, min=1, max=358321 Read rate (KB/sec) : 91859 Write rate (KB/sec): 91859 4.6-rc3 + wb-buf-throttle Latency percentiles (usec) (READERS) 50.0000th: 2 75.0000th: 3 90.0000th: 5 95.0000th: 8 99.0000th: 48 99.5000th: 79 99.9000th: 5304 99.9900th: 22496 99.9990th: 29408 99.9999th: 33728 Over=860, min=0, max=37599 Latency percentiles (usec) (WRITERS) 50.0000th: 4 75.0000th: 9 90.0000th: 14 95.0000th: 16 99.0000th: 34 99.5000th: 45 99.9000th: 87 99.9900th: 1342 99.9990th: 13648 99.9999th: 21280 Over=29, min=1, max=30457 Read rate (KB/sec) : 95832 Write rate (KB/sec): 95832 Better throughput and tighter latencies, for both reads and writes. That's hard not to like. The above was the why. The how is basically throttling background writeback. We still want to issue big writes from the vm side of things, so we get nice and big extents on the file system end. But we don't need to flood the device with THOUSANDS of requests for background writeback. For most devices, we don't need a whole lot to get decent throughput. This adds some simple blk-wb code that keeps limits how much buffered writeback we keep in flight on the device end. It's all about managing the queues on the hardware side. The big change in this version is that it should be pretty much auto-tuning - you no longer have to set a given percentage of writeback bandwidth. I've implemented something similar to CoDel to manage the writeback queue. See the last patch for a full description, but the tldr is that we monitor min latencies over a window of time, and scale up/down the queue based on that. This needs a minimum of tunables, and it stays out of the way, if your device is fast enough. There's a single tunable now, wb_last_usec, that simply sets this latency target. Most people won't have to touch this, it'll work pretty well just being in the ballpark. I welcome testing. If you are sick of Linux bogging down when buffered writes are happening, then this is for you, laptop or server. The patchset is fully stable, I have not observed problems. It passes full xfstest runs, and a variety of benchmarks as well. It works equally well on blk-mq/scsi-mq, and "classic" setups. You can also find this in a branch in the block git repo: git://git.kernel.dk/linux-block.git wb-buf-throttle Note that I rebase this branch when I collapse patches. The wb-buf-throttle-v4 will remain the same as this version. I've folded the device write cache changes into my 4.7 branches, so they are not a part of this posting. Get the full wb-buf-throttle branch, or apply the patches here on top of my for-next. A full patch against Linus' current tree can also be downloaded here: http://brick.kernel.dk/snaps/wb-buf-throttle-v4.patch Changes since v3 - Re-do the mm/ writheback parts. Add REQ_BG for background writes, and don't overload the wbc 'reason' for writeback decisions. - Add tracking for when apps are sleeping waiting for a page to complete. - Change wbc_to_write() to wbc_to_write_cmd(). - Use atomic_t for the balance_dirty_pages() sleep count. - Add a basic scalable block stats tracking framework. - Rewrite blk-wb core as described above, to dynamically adapt. This is a big change, see the last patch for a full description of it. - Add tracing to blk-wb, instead of using debug printk's. - Rebased to 4.6-rc3 (ish) Changes since v2 - Switch from wb_depth to wb_percent, as that's an easier tunable. - Add the patch to track device depth on the block layer side. - Cleanup the limiting code. - Don't use a fixed limit in the wb wait, since it can change between wakeups. - Minor tweaks, fixups, cleanups. Changes since v1 - Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change - wb_start_writeback() fills in background/reclaim/sync info in the writeback work, based on writeback reason. - Use WRITE_SYNC for reclaim/sync IO - Split balance_dirty_pages() sleep change into separate patch - Drop get_request() u64 flag change, set the bit on the request directly after-the-fact. - Fix wrong sysfs return value - Various small cleanups Documentation/block/queue-sysfs.txt | 9 Documentation/block/writeback_cache_control.txt | 4 arch/um/drivers/ubd_kern.c | 2 block/Makefile | 2 block/blk-core.c | 22 + block/blk-flush.c | 11 block/blk-mq-sysfs.c | 47 ++ block/blk-mq.c | 45 ++ block/blk-mq.h | 3 block/blk-settings.c | 58 +- block/blk-stat.c | 184 ++++++++ block/blk-stat.h | 17 block/blk-sysfs.c | 122 +++++ block/blk-wb.c | 495 ++++++++++++++++++++++++ block/blk-wb.h | 42 ++ drivers/block/drbd/drbd_main.c | 2 drivers/block/loop.c | 2 drivers/block/mtip32xx/mtip32xx.c | 6 drivers/block/nbd.c | 4 drivers/block/osdblk.c | 2 drivers/block/ps3disk.c | 2 drivers/block/skd_main.c | 2 drivers/block/virtio_blk.c | 6 drivers/block/xen-blkback/xenbus.c | 2 drivers/block/xen-blkfront.c | 3 drivers/ide/ide-disk.c | 6 drivers/md/bcache/super.c | 2 drivers/md/dm-table.c | 20 drivers/md/md.c | 2 drivers/md/raid5-cache.c | 3 drivers/mmc/card/block.c | 2 drivers/mtd/mtd_blkdevs.c | 2 drivers/nvme/host/core.c | 7 drivers/scsi/scsi.c | 3 drivers/scsi/sd.c | 8 drivers/target/target_core_iblock.c | 6 fs/block_dev.c | 2 fs/buffer.c | 2 fs/f2fs/data.c | 2 fs/f2fs/node.c | 2 fs/gfs2/meta_io.c | 3 fs/mpage.c | 9 fs/xfs/xfs_aops.c | 2 include/linux/backing-dev-defs.h | 2 include/linux/blk_types.h | 14 include/linux/blkdev.h | 27 + include/linux/fs.h | 4 include/linux/writeback.h | 10 include/trace/events/block.h | 98 ++++ mm/backing-dev.c | 1 mm/filemap.c | 42 +- mm/page-writeback.c | 2 52 files changed, 1281 insertions(+), 96 deletions(-) -- Jens Axboe ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 8/8] writeback: throttle buffered writeback 2016-04-18 4:24 [PATCHSET v4 0/8] " Jens Axboe @ 2016-04-18 4:24 ` Jens Axboe 2016-04-23 8:21 ` xiakaixu 0 siblings, 1 reply; 19+ messages in thread From: Jens Axboe @ 2016-04-18 4:24 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: jack, dchinner, Jens Axboe Test patch that throttles buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. The patch registers two sysfs entries. The first one, 'wb_lat_usec', sets the latency target for the window. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. The second entry, 'wb_stats', is a debug entry, that simply shows the current internal state of the throttling machine: $ cat /sys/block/nvme0n1/queue/wb_stats background=16, normal=32, max=64, inflight=0, wait=0, bdp_wait=0 'background' denotes how many requests we will allow in-flight for idle background buffered writeback, 'normal' for higher priority writeback, and 'max' for when it's urgent we clean pages. 'inflight' shows how many requests are currently in-flight for buffered writeback, 'wait' shows if anyone is currently waiting for access, and 'bdp_wait' shows if someone is currently throttled on this device in balance_dirty_pages(). blk-wb also registers a few trace events, that can be used to monitor the state changes: block_wb_lat: Latency 2446318 block_wb_stat: read lat: mean=2446318, min=2446318, max=2446318, samples=1, write lat: mean=518866, min=15522, max=5330353, samples=57 block_wb_step: step down: step=1, background=8, normal=16, max=32 'block_wb_lat' logs a violation in sync issue latency, 'block_wb_stat' logs a window violation of latencies and dumps the stats that lead to that, and finally, 'block_wb_stat' logs a step up/down and the new limits associated with that state. Signed-off-by: Jens Axboe <axboe@fb.com> --- block/Makefile | 2 +- block/blk-core.c | 15 ++ block/blk-mq.c | 31 ++- block/blk-settings.c | 4 + block/blk-sysfs.c | 57 +++++ block/blk-wb.c | 495 +++++++++++++++++++++++++++++++++++++++++++ block/blk-wb.h | 42 ++++ include/linux/blk_types.h | 2 + include/linux/blkdev.h | 3 + include/trace/events/block.h | 98 +++++++++ 10 files changed, 746 insertions(+), 3 deletions(-) create mode 100644 block/blk-wb.c create mode 100644 block/blk-wb.h diff --git a/block/Makefile b/block/Makefile index 3446e0472df0..7e4be7a56a59 100644 --- a/block/Makefile +++ b/block/Makefile @@ -5,7 +5,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \ blk-flush.o blk-settings.o blk-ioc.o blk-map.o \ blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \ - blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \ + blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o blk-wb.o \ blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \ genhd.o scsi_ioctl.o partition-generic.o ioprio.o \ badblocks.o partitions/ diff --git a/block/blk-core.c b/block/blk-core.c index 40b57bf4852c..d941f69dfb4b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -39,6 +39,7 @@ #include "blk.h" #include "blk-mq.h" +#include "blk-wb.h" EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap); @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, fail: blk_free_flush_queue(q->fq); + blk_wb_exit(q); return NULL; } EXPORT_SYMBOL(blk_init_allocated_queue); @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq) blk_delete_timer(rq); blk_clear_rq_complete(rq); trace_block_rq_requeue(q, rq); + blk_wb_requeue(q->rq_wb, rq); if (rq->cmd_flags & REQ_QUEUED) blk_queue_end_tag(q, rq); @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct request *req) /* this is a bio leak */ WARN_ON(req->bio != NULL); + blk_wb_done(q->rq_wb, req); + /* * Request may not have originated from ll_rw_blk. if not, * it didn't come out of our reserved rq pools @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT; struct request *req; unsigned int request_count = 0; + bool wb_acct; /* * low level driver can indicate that it wants pages above a @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) } get_rq: + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock); + /* * This sync check and mask will be re-done in init_request_from_bio(), * but we need to set it earlier to expose the sync flag to the @@ -1781,11 +1789,16 @@ get_rq: */ req = get_request(q, rw_flags, bio, GFP_NOIO); if (IS_ERR(req)) { + if (wb_acct) + __blk_wb_done(q->rq_wb); bio->bi_error = PTR_ERR(req); bio_endio(bio); goto out_unlock; } + if (wb_acct) + req->cmd_flags |= REQ_BUF_INFLIGHT; + /* * After dropping the lock and possibly sleeping here, our request * may now be mergeable after it had proven unmergeable (above). @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req) blk_dequeue_request(req); req->issue_time = ktime_to_ns(ktime_get()); + blk_wb_issue(req->q->rq_wb, req); /* * We are now handing the request to the hardware, initialize @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error) blk_unprep_request(req); blk_account_io_done(req); + blk_wb_done(req->q->rq_wb, req); if (req->end_io) req->end_io(req, error); diff --git a/block/blk-mq.c b/block/blk-mq.c index 71b4a13fbf94..c0c5207fe7fd 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -30,6 +30,7 @@ #include "blk-mq.h" #include "blk-mq-tag.h" #include "blk-stat.h" +#include "blk-wb.h" static DEFINE_MUTEX(all_q_mutex); static LIST_HEAD(all_q_list); @@ -275,6 +276,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, if (rq->cmd_flags & REQ_MQ_INFLIGHT) atomic_dec(&hctx->nr_active); + + blk_wb_done(q->rq_wb, rq); + rq->cmd_flags = 0; clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags); @@ -305,6 +309,7 @@ EXPORT_SYMBOL_GPL(blk_mq_free_request); inline void __blk_mq_end_request(struct request *rq, int error) { blk_account_io_done(rq); + blk_wb_done(rq->q->rq_wb, rq); if (rq->end_io) { rq->end_io(rq, error); @@ -414,6 +419,7 @@ void blk_mq_start_request(struct request *rq) rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq); rq->issue_time = ktime_to_ns(ktime_get()); + blk_wb_issue(q->rq_wb, rq); blk_add_timer(rq); @@ -450,6 +456,7 @@ static void __blk_mq_requeue_request(struct request *rq) struct request_queue *q = rq->q; trace_block_rq_requeue(q, rq); + blk_wb_requeue(q->rq_wb, rq); if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) { if (q->dma_drain_size && blk_rq_bytes(rq)) @@ -1265,6 +1272,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) struct blk_plug *plug; struct request *same_queue_rq = NULL; blk_qc_t cookie; + bool wb_acct; blk_queue_bounce(q, &bio); @@ -1282,9 +1290,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) } else request_count = blk_plug_queued_count(q); + wb_acct = blk_wb_wait(q->rq_wb, bio, NULL); + rq = blk_mq_map_request(q, bio, &data); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (wb_acct) + __blk_wb_done(q->rq_wb); return BLK_QC_T_NONE; + } + + if (wb_acct) + rq->cmd_flags |= REQ_BUF_INFLIGHT; cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -1361,6 +1377,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) struct blk_map_ctx data; struct request *rq; blk_qc_t cookie; + bool wb_acct; blk_queue_bounce(q, &bio); @@ -1375,9 +1392,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) blk_attempt_plug_merge(q, bio, &request_count, NULL)) return BLK_QC_T_NONE; + wb_acct = blk_wb_wait(q->rq_wb, bio, NULL); + rq = blk_mq_map_request(q, bio, &data); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (wb_acct) + __blk_wb_done(q->rq_wb); return BLK_QC_T_NONE; + } + + if (wb_acct) + rq->cmd_flags |= REQ_BUF_INFLIGHT; cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -2111,6 +2136,8 @@ void blk_mq_free_queue(struct request_queue *q) list_del_init(&q->all_q_node); mutex_unlock(&all_q_mutex); + blk_wb_exit(q); + blk_mq_del_queue_tag_set(q); blk_mq_exit_hw_queues(q, set, set->nr_hw_queues); diff --git a/block/blk-settings.c b/block/blk-settings.c index f7e122e717e8..84bcfc22e020 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -13,6 +13,7 @@ #include <linux/gfp.h> #include "blk.h" +#include "blk-wb.h" unsigned long blk_max_low_pfn; EXPORT_SYMBOL(blk_max_low_pfn); @@ -840,6 +841,9 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable); void blk_set_queue_depth(struct request_queue *q, unsigned int depth) { q->queue_depth = depth; + + if (q->rq_wb) + blk_wb_update_limits(q->rq_wb); } EXPORT_SYMBOL(blk_set_queue_depth); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 6e516cc0d3d0..13f325deffa1 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -13,6 +13,7 @@ #include "blk.h" #include "blk-mq.h" +#include "blk-wb.h" struct queue_sysfs_entry { struct attribute attr; @@ -347,6 +348,47 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page, return ret; } +static ssize_t queue_wb_stats_show(struct request_queue *q, char *page) +{ + struct rq_wb *rwb = q->rq_wb; + + if (!rwb) + return -EINVAL; + + return sprintf(page, "background=%d, normal=%d, max=%d, inflight=%d," + " wait=%d, bdp_wait=%d\n", rwb->wb_background, + rwb->wb_normal, rwb->wb_max, + atomic_read(&rwb->inflight), + waitqueue_active(&rwb->wait), + atomic_read(rwb->bdp_wait)); +} + +static ssize_t queue_wb_lat_show(struct request_queue *q, char *page) +{ + if (!q->rq_wb) + return -EINVAL; + + return sprintf(page, "%llu\n", q->rq_wb->min_lat_nsec / 1000ULL); +} + +static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page, + size_t count) +{ + u64 val; + int err; + + if (!q->rq_wb) + return -EINVAL; + + err = kstrtou64(page, 10, &val); + if (err < 0) + return err; + + q->rq_wb->min_lat_nsec = val * 1000ULL; + blk_wb_update_limits(q->rq_wb); + return count; +} + static ssize_t queue_wc_show(struct request_queue *q, char *page) { if (test_bit(QUEUE_FLAG_WC, &q->queue_flags)) @@ -541,6 +583,17 @@ static struct queue_sysfs_entry queue_stats_entry = { .show = queue_stats_show, }; +static struct queue_sysfs_entry queue_wb_stats_entry = { + .attr = {.name = "wb_stats", .mode = S_IRUGO }, + .show = queue_wb_stats_show, +}; + +static struct queue_sysfs_entry queue_wb_lat_entry = { + .attr = {.name = "wb_lat_usec", .mode = S_IRUGO | S_IWUSR }, + .show = queue_wb_lat_show, + .store = queue_wb_lat_store, +}; + static struct attribute *default_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, @@ -568,6 +621,8 @@ static struct attribute *default_attrs[] = { &queue_poll_entry.attr, &queue_wc_entry.attr, &queue_stats_entry.attr, + &queue_wb_stats_entry.attr, + &queue_wb_lat_entry.attr, NULL, }; @@ -721,6 +776,8 @@ int blk_register_queue(struct gendisk *disk) if (q->mq_ops) blk_mq_register_disk(disk); + blk_wb_init(q); + if (!q->request_fn) return 0; diff --git a/block/blk-wb.c b/block/blk-wb.c new file mode 100644 index 000000000000..1b1d80876930 --- /dev/null +++ b/block/blk-wb.c @@ -0,0 +1,495 @@ +/* + * buffered writeback throttling. losely based on CoDel. We can't drop + * packets for IO scheduling, so the logic is something like this: + * + * - Monitor latencies in a defined window of time. + * - If the minimum latency in the above window exceeds some target, increment + * scaling step and scale down queue depth by a factor of 2x. The monitoring + * window is then shrunk to 100 / sqrt(scaling step + 1). + * - For any window where we don't have solid data on what the latencies + * look like, retain status quo. + * - If latencies look good, decrement scaling step. + * + * Copyright (C) 2016 Jens Axboe + * + * Things that (may) need changing: + * + * - Different scaling of background/normal/high priority writeback. + * We may have to violate guarantees for max. + * - We can have mismatches between the stat window and our window. + * + */ +#include <linux/kernel.h> +#include <linux/bio.h> +#include <linux/blkdev.h> +#include <trace/events/block.h> + +#include "blk.h" +#include "blk-wb.h" +#include "blk-stat.h" + +enum { + /* + * Might need to be higher + */ + RWB_MAX_DEPTH = 64, + + /* + * 100msec window + */ + RWB_WINDOW_NSEC = 100 * 1000 * 1000ULL, + + /* + * Disregard stats, if we don't meet these minimums + */ + RWB_MIN_WRITE_SAMPLES = 3, + RWB_MIN_READ_SAMPLES = 1, + + /* + * Target min latencies, in nsecs + */ + RWB_ROT_LAT = 75000000ULL, /* 75 msec */ + RWB_NONROT_LAT = 2000000ULL, /* 2 msec */ +}; + +static inline bool rwb_enabled(struct rq_wb *rwb) +{ + return rwb && rwb->wb_normal != 0; +} + +/* + * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded, + * false if 'v' + 1 would be bigger than 'below'. + */ +static bool atomic_inc_below(atomic_t *v, int below) +{ + int cur = atomic_read(v); + + for (;;) { + int old; + + if (cur >= below) + return false; + old = atomic_cmpxchg(v, cur, cur + 1); + if (old == cur) + break; + cur = old; + } + + return true; +} + +static void wb_timestamp(struct rq_wb *rwb, unsigned long *var) +{ + if (rwb_enabled(rwb)) { + const unsigned long cur = jiffies; + + if (cur != *var) + *var = cur; + } +} + +void __blk_wb_done(struct rq_wb *rwb) +{ + int inflight, limit = rwb->wb_normal; + + /* + * If the device does write back caching, drop further down + * before we wake people up. + */ + if (test_bit(QUEUE_FLAG_WC, &rwb->q->queue_flags) && + !atomic_read(rwb->bdp_wait)) + limit = 0; + else + limit = rwb->wb_normal; + + /* + * Don't wake anyone up if we are above the normal limit. If + * throttling got disabled (limit == 0) with waiters, ensure + * that we wake them up. + */ + inflight = atomic_dec_return(&rwb->inflight); + if (limit && inflight >= limit) { + if (!rwb->wb_max) + wake_up_all(&rwb->wait); + return; + } + + if (waitqueue_active(&rwb->wait)) { + int diff = limit - inflight; + + if (!inflight || diff >= rwb->wb_background / 2) + wake_up_nr(&rwb->wait, 1); + } +} + +/* + * Called on completion of a request. Note that it's also called when + * a request is merged, when the request gets freed. + */ +void blk_wb_done(struct rq_wb *rwb, struct request *rq) +{ + if (!rwb) + return; + + if (!(rq->cmd_flags & REQ_BUF_INFLIGHT)) { + if (rwb->sync_cookie == rq) { + rwb->sync_issue = 0; + rwb->sync_cookie = NULL; + } + + wb_timestamp(rwb, &rwb->last_comp); + } else { + WARN_ON_ONCE(rq == rwb->sync_cookie); + __blk_wb_done(rwb); + rq->cmd_flags &= ~REQ_BUF_INFLIGHT; + } +} + +static void calc_wb_limits(struct rq_wb *rwb) +{ + unsigned int depth; + + if (!rwb->min_lat_nsec) { + rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0; + return; + } + + depth = min_t(unsigned int, RWB_MAX_DEPTH, blk_queue_depth(rwb->q)); + + /* + * Reduce max depth by 50%, and re-calculate normal/bg based on that + */ + rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step)); + rwb->wb_normal = (rwb->wb_max + 1) / 2; + rwb->wb_background = (rwb->wb_max + 3) / 4; +} + +static bool inline stat_sample_valid(struct blk_rq_stat *stat) +{ + /* + * We need at least one read sample, and a minimum of + * RWB_MIN_WRITE_SAMPLES. We require some write samples to know + * that it's writes impacting us, and not just some sole read on + * a device that is in a lower power state. + */ + return stat[0].nr_samples >= 1 && + stat[1].nr_samples >= RWB_MIN_WRITE_SAMPLES; +} + +static u64 rwb_sync_issue_lat(struct rq_wb *rwb) +{ + u64 now, issue = ACCESS_ONCE(rwb->sync_issue); + + if (!issue || !rwb->sync_cookie) + return 0; + + now = ktime_to_ns(ktime_get()); + return now - issue; +} + +enum { + LAT_OK, + LAT_UNKNOWN, + LAT_EXCEEDED, +}; + +static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat) +{ + u64 thislat; + + if (!stat_sample_valid(stat)) + return LAT_UNKNOWN; + + /* + * If the 'min' latency exceeds our target, step down. + */ + if (stat[0].min > rwb->min_lat_nsec) { + trace_block_wb_lat(stat[0].min); + trace_block_wb_stat(stat); + return LAT_EXCEEDED; + } + + /* + * If our stored sync issue exceeds the window size, or it + * exceeds our min target AND we haven't logged any entries, + * flag the latency as exceeded. + */ + thislat = rwb_sync_issue_lat(rwb); + if (thislat > rwb->win_nsec || + (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) { + trace_block_wb_lat(thislat); + return LAT_EXCEEDED; + } + + if (rwb->scale_step) + trace_block_wb_stat(stat); + + return LAT_OK; +} + +static int latency_exceeded(struct rq_wb *rwb) +{ + struct blk_rq_stat stat[2]; + + blk_queue_stat_get(rwb->q, stat); + + return __latency_exceeded(rwb, stat); +} + +static void rwb_trace_step(struct rq_wb *rwb, const char *msg) +{ + trace_block_wb_step(msg, rwb->scale_step, rwb->wb_background, + rwb->wb_normal, rwb->wb_max); +} + +static void scale_up(struct rq_wb *rwb) +{ + /* + * If we're at 0, we can't go lower. + */ + if (!rwb->scale_step) + return; + + rwb->scale_step--; + calc_wb_limits(rwb); + + if (waitqueue_active(&rwb->wait)) + wake_up_all(&rwb->wait); + + rwb_trace_step(rwb, "step up"); +} + +static void scale_down(struct rq_wb *rwb) +{ + /* + * Stop scaling down when we've hit the limit. This also prevents + * ->scale_step from going to crazy values, if the device can't + * keep up. + */ + if (rwb->wb_max == 1) + return; + + rwb->scale_step++; + blk_stat_clear(rwb->q); + calc_wb_limits(rwb); + rwb_trace_step(rwb, "step down"); +} + +static void rwb_arm_timer(struct rq_wb *rwb) +{ + unsigned long expires; + + rwb->win_nsec = 1000000000ULL / int_sqrt((rwb->scale_step + 1) * 100); + expires = jiffies + nsecs_to_jiffies(rwb->win_nsec); + mod_timer(&rwb->window_timer, expires); +} + +static void blk_wb_timer_fn(unsigned long data) +{ + struct rq_wb *rwb = (struct rq_wb *) data; + int status; + + /* + * If we exceeded the latency target, step down. If we did not, + * step one level up. If we don't know enough to say either exceeded + * or ok, then don't do anything. + */ + status = latency_exceeded(rwb); + switch (status) { + case LAT_EXCEEDED: + scale_down(rwb); + break; + case LAT_OK: + scale_up(rwb); + break; + default: + break; + } + + /* + * Re-arm timer, if we have IO in flight + */ + if (rwb->scale_step || atomic_read(&rwb->inflight)) + rwb_arm_timer(rwb); +} + +void blk_wb_update_limits(struct rq_wb *rwb) +{ + rwb->scale_step = 0; + calc_wb_limits(rwb); + + if (waitqueue_active(&rwb->wait)) + wake_up_all(&rwb->wait); +} + +static bool close_io(struct rq_wb *rwb) +{ + const unsigned long now = jiffies; + + return time_before(now, rwb->last_issue + HZ / 10) || + time_before(now, rwb->last_comp + HZ / 10); +} + +#define REQ_HIPRIO (REQ_SYNC | REQ_META | REQ_PRIO) + +static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw) +{ + unsigned int limit; + + /* + * At this point we know it's a buffered write. If REQ_SYNC is + * set, then it's WB_SYNC_ALL writeback, and we'll use the max + * limit for that. If the write is marked as a background write, + * then use the idle limit, or go to normal if we haven't had + * competing IO for a bit. + */ + if ((rw & REQ_HIPRIO) || atomic_read(rwb->bdp_wait)) + limit = rwb->wb_max; + else if ((rw & REQ_BG) || close_io(rwb)) { + /* + * If less than 100ms since we completed unrelated IO, + * limit us to half the depth for background writeback. + */ + limit = rwb->wb_background; + } else + limit = rwb->wb_normal; + + return limit; +} + +static inline bool may_queue(struct rq_wb *rwb, unsigned long rw) +{ + /* + * inc it here even if disabled, since we'll dec it at completion. + * this only happens if the task was sleeping in __blk_wb_wait(), + * and someone turned it off at the same time. + */ + if (!rwb_enabled(rwb)) { + atomic_inc(&rwb->inflight); + return true; + } + + return atomic_inc_below(&rwb->inflight, get_limit(rwb, rw)); +} + +/* + * Block if we will exceed our limit, or if we are currently waiting for + * the timer to kick off queuing again. + */ +static void __blk_wb_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock) +{ + DEFINE_WAIT(wait); + + if (may_queue(rwb, rw)) + return; + + do { + prepare_to_wait_exclusive(&rwb->wait, &wait, + TASK_UNINTERRUPTIBLE); + + if (may_queue(rwb, rw)) + break; + + if (lock) + spin_unlock_irq(lock); + + io_schedule(); + + if (lock) + spin_lock_irq(lock); + } while (1); + + finish_wait(&rwb->wait, &wait); +} + +/* + * Returns true if the IO request should be accounted, false if not. + * May sleep, if we have exceeded the writeback limits. Caller can pass + * in an irq held spinlock, if it holds one when calling this function. + * If we do sleep, we'll release and re-grab it. + */ +bool blk_wb_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock) +{ + /* + * If disabled, or not a WRITE (or a discard), do nothing + */ + if (!rwb_enabled(rwb) || !(bio->bi_rw & REQ_WRITE) || + (bio->bi_rw & REQ_DISCARD)) + goto no_q; + + /* + * Don't throttle WRITE_ODIRECT + */ + if ((bio->bi_rw & (REQ_SYNC | REQ_NOIDLE)) == REQ_SYNC) + goto no_q; + + __blk_wb_wait(rwb, bio->bi_rw, lock); + + if (!timer_pending(&rwb->window_timer)) + rwb_arm_timer(rwb); + + return true; + +no_q: + wb_timestamp(rwb, &rwb->last_issue); + return false; +} + +void blk_wb_issue(struct rq_wb *rwb, struct request *rq) +{ + if (!rwb_enabled(rwb)) + return; + if (!(rq->cmd_flags & REQ_BUF_INFLIGHT) && !rwb->sync_issue) { + rwb->sync_cookie = rq; + rwb->sync_issue = rq->issue_time; + } +} + +void blk_wb_requeue(struct rq_wb *rwb, struct request *rq) +{ + if (!rwb_enabled(rwb)) + return; + if (rq == rwb->sync_cookie) { + rwb->sync_issue = 0; + rwb->sync_cookie = NULL; + } +} + +void blk_wb_init(struct request_queue *q) +{ + struct rq_wb *rwb; + + /* + * If this fails, we don't get throttling + */ + rwb = kzalloc(sizeof(*rwb), GFP_KERNEL); + if (!rwb) + return; + + atomic_set(&rwb->inflight, 0); + init_waitqueue_head(&rwb->wait); + setup_timer(&rwb->window_timer, blk_wb_timer_fn, (unsigned long) rwb); + rwb->last_comp = rwb->last_issue = jiffies; + rwb->bdp_wait = &q->backing_dev_info.wb.dirty_sleeping; + rwb->q = q; + + if (blk_queue_nonrot(q)) + rwb->min_lat_nsec = RWB_NONROT_LAT; + else + rwb->min_lat_nsec = RWB_ROT_LAT; + + blk_wb_update_limits(rwb); + q->rq_wb = rwb; +} + +void blk_wb_exit(struct request_queue *q) +{ + struct rq_wb *rwb = q->rq_wb; + + if (rwb) { + del_timer_sync(&rwb->window_timer); + kfree(q->rq_wb); + q->rq_wb = NULL; + } +} diff --git a/block/blk-wb.h b/block/blk-wb.h new file mode 100644 index 000000000000..6ad47195bc87 --- /dev/null +++ b/block/blk-wb.h @@ -0,0 +1,42 @@ +#ifndef BLK_WB_H +#define BLK_WB_H + +#include <linux/atomic.h> +#include <linux/wait.h> +#include <linux/timer.h> + +struct rq_wb { + /* + * Settings that govern how we throttle + */ + unsigned int wb_background; /* background writeback */ + unsigned int wb_normal; /* normal writeback */ + unsigned int wb_max; /* max throughput writeback */ + unsigned int scale_step; + + u64 win_nsec; + + struct timer_list window_timer; + + s64 sync_issue; + void *sync_cookie; + + unsigned long last_issue; /* last non-throttled issue */ + unsigned long last_comp; /* last non-throttled comp */ + unsigned long min_lat_nsec; + atomic_t *bdp_wait; + struct request_queue *q; + atomic_t inflight; + wait_queue_head_t wait; +}; + +void __blk_wb_done(struct rq_wb *); +void blk_wb_done(struct rq_wb *, struct request *); +bool blk_wb_wait(struct rq_wb *, struct bio *, spinlock_t *); +void blk_wb_init(struct request_queue *); +void blk_wb_exit(struct request_queue *); +void blk_wb_update_limits(struct rq_wb *); +void blk_wb_requeue(struct rq_wb *, struct request *); +void blk_wb_issue(struct rq_wb *, struct request *); + +#endif diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 2b4414fb4d8e..c41f8a303804 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -189,6 +189,7 @@ enum rq_flag_bits { __REQ_PM, /* runtime pm request */ __REQ_HASHED, /* on IO scheduler merge hash */ __REQ_MQ_INFLIGHT, /* track inflight for MQ */ + __REQ_BUF_INFLIGHT, /* track inflight for buffered */ __REQ_NR_BITS, /* stops here */ }; @@ -243,6 +244,7 @@ enum rq_flag_bits { #define REQ_PM (1ULL << __REQ_PM) #define REQ_HASHED (1ULL << __REQ_HASHED) #define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT) +#define REQ_BUF_INFLIGHT (1ULL << __REQ_BUF_INFLIGHT) typedef unsigned int blk_qc_t; #define BLK_QC_T_NONE -1U diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 87f6703ced71..230c55dc95ae 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -37,6 +37,7 @@ struct bsg_job; struct blkcg_gq; struct blk_flush_queue; struct pr_ops; +struct rq_wb; #define BLKDEV_MIN_RQ 4 #define BLKDEV_MAX_RQ 128 /* Default maximum */ @@ -291,6 +292,8 @@ struct request_queue { int nr_rqs[2]; /* # allocated [a]sync rqs */ int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */ + struct rq_wb *rq_wb; + /* * If blkcg is not used, @q->root_rl serves all requests. If blkcg * is used, root blkg allocates from @q->root_rl and all other diff --git a/include/trace/events/block.h b/include/trace/events/block.h index e8a5eca1dbe5..8ae9f47d5287 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -667,6 +667,104 @@ TRACE_EVENT(block_rq_remap, (unsigned long long)__entry->old_sector, __entry->nr_bios) ); +/** + * block_wb_stat - trace stats for blk_wb + * @stat: array of read/write stats + */ +TRACE_EVENT(block_wb_stat, + + TP_PROTO(struct blk_rq_stat *stat), + + TP_ARGS(stat), + + TP_STRUCT__entry( + __field( s64, rmean ) + __field( u64, rmin ) + __field( u64, rmax ) + __field( s64, rnr_samples ) + __field( s64, rtime ) + __field( s64, wmean ) + __field( u64, wmin ) + __field( u64, wmax ) + __field( s64, wnr_samples ) + __field( s64, wtime ) + ), + + TP_fast_assign( + __entry->rmean = stat[0].mean; + __entry->rmin = stat[0].min; + __entry->rmax = stat[0].max; + __entry->rnr_samples = stat[0].nr_samples; + __entry->wmean = stat[1].mean; + __entry->wmin = stat[1].min; + __entry->wmax = stat[1].max; + __entry->wnr_samples = stat[1].nr_samples; + ), + + TP_printk("read lat: mean=%llu, min=%llu, max=%llu, samples=%llu," + "write lat: mean=%llu, min=%llu, max=%llu, samples=%llu\n", + __entry->rmean, __entry->rmin, __entry->rmax, + __entry->rnr_samples, __entry->wmean, __entry->wmin, + __entry->wmax, __entry->wnr_samples) +); + +/** + * block_wb_lat - trace latency event + * @lat: latency trigger + */ +TRACE_EVENT(block_wb_lat, + + TP_PROTO(unsigned long lat), + + TP_ARGS(lat), + + TP_STRUCT__entry( + __field( unsigned long, lat ) + ), + + TP_fast_assign( + __entry->lat = lat; + ), + + TP_printk("Latency %llu\n", (unsigned long long) __entry->lat) +); + +/** + * block_wb_step - trace wb event step + * @msg: context message + * @step: the current scale step count + * @bg: the current background queue limit + * @normal: the current normal writeback limit + * @max: the current max throughput writeback limit + */ +TRACE_EVENT(block_wb_step, + + TP_PROTO(const char *msg, unsigned int step, unsigned int bg, + unsigned int normal, unsigned int max), + + TP_ARGS(msg, step, bg, normal, max), + + TP_STRUCT__entry( + __field( const char *, msg ) + __field( unsigned int, step ) + __field( unsigned int, bg ) + __field( unsigned int, normal ) + __field( unsigned int, max ) + ), + + TP_fast_assign( + __entry->msg = msg; + __entry->step = step; + __entry->bg = bg; + __entry->normal = normal; + __entry->max = max; + ), + + TP_printk("%s: step=%u, background=%u, normal=%u, max=%u\n", + __entry->msg, __entry->step, __entry->bg, __entry->normal, + __entry->max) +); + #endif /* _TRACE_BLOCK_H */ /* This part must be outside protection */ -- 2.8.0.rc4.6.g7e4ba36 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 8/8] writeback: throttle buffered writeback 2016-04-18 4:24 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe @ 2016-04-23 8:21 ` xiakaixu 2016-04-23 21:37 ` Jens Axboe 0 siblings, 1 reply; 19+ messages in thread From: xiakaixu @ 2016-04-23 8:21 UTC (permalink / raw) To: Jens Axboe Cc: linux-kernel, linux-fsdevel, linux-block, jack, dchinner, miaoxie (A), Bintian, Huxinwei, Xia Kaixu > diff --git a/block/blk-core.c b/block/blk-core.c > index 40b57bf4852c..d941f69dfb4b 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -39,6 +39,7 @@ > > #include "blk.h" > #include "blk-mq.h" > +#include "blk-wb.h" > > EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); > EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap); > @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, > > fail: > blk_free_flush_queue(q->fq); > + blk_wb_exit(q); > return NULL; > } > EXPORT_SYMBOL(blk_init_allocated_queue); > @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq) > blk_delete_timer(rq); > blk_clear_rq_complete(rq); > trace_block_rq_requeue(q, rq); > + blk_wb_requeue(q->rq_wb, rq); > > if (rq->cmd_flags & REQ_QUEUED) > blk_queue_end_tag(q, rq); > @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct request *req) > /* this is a bio leak */ > WARN_ON(req->bio != NULL); > > + blk_wb_done(q->rq_wb, req); > + > /* > * Request may not have originated from ll_rw_blk. if not, > * it didn't come out of our reserved rq pools > @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) > int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT; > struct request *req; > unsigned int request_count = 0; > + bool wb_acct; > > /* > * low level driver can indicate that it wants pages above a > @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) > } > > get_rq: > + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock); > + > /* > * This sync check and mask will be re-done in init_request_from_bio(), > * but we need to set it earlier to expose the sync flag to the > @@ -1781,11 +1789,16 @@ get_rq: > */ > req = get_request(q, rw_flags, bio, GFP_NOIO); > if (IS_ERR(req)) { > + if (wb_acct) > + __blk_wb_done(q->rq_wb); > bio->bi_error = PTR_ERR(req); > bio_endio(bio); > goto out_unlock; > } > > + if (wb_acct) > + req->cmd_flags |= REQ_BUF_INFLIGHT; > + > /* > * After dropping the lock and possibly sleeping here, our request > * may now be mergeable after it had proven unmergeable (above). > @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req) > blk_dequeue_request(req); > > req->issue_time = ktime_to_ns(ktime_get()); > + blk_wb_issue(req->q->rq_wb, req); > > /* > * We are now handing the request to the hardware, initialize > @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error) > blk_unprep_request(req); > > blk_account_io_done(req); > + blk_wb_done(req->q->rq_wb, req); Hi Jens, Seems the function blk_wb_done() will be executed twice even if the end_io callback is set. Maybe the same thing would happen in blk-mq.c. > > if (req->end_io) > req->end_io(req, error); > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 71b4a13fbf94..c0c5207fe7fd 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -30,6 +30,7 @@ > #include "blk-mq.h" > #include "blk-mq-tag.h" > #include "blk-stat.h" > +#include "blk-wb.h" > > static DEFINE_MUTEX(all_q_mutex); > static LIST_HEAD(all_q_list); > @@ -275,6 +276,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, > > if (rq->cmd_flags & REQ_MQ_INFLIGHT) > atomic_dec(&hctx->nr_active); > + > + blk_wb_done(q->rq_wb, rq); > + > rq->cmd_flags = 0; > > clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags); > @@ -305,6 +309,7 @@ EXPORT_SYMBOL_GPL(blk_mq_free_request); > inline void __blk_mq_end_request(struct request *rq, int error) > { > blk_account_io_done(rq); > + blk_wb_done(rq->q->rq_wb, rq); > > if (rq->end_io) { > rq->end_io(rq, error); > @@ -414,6 +419,7 @@ void blk_mq_start_request(struct request *rq) > rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq); > > rq->issue_time = ktime_to_ns(ktime_get()); > + blk_wb_issue(q->rq_wb, rq); > > blk_add_timer(rq); > > @@ -450,6 +456,7 @@ static void __blk_mq_requeue_request(struct request *rq) > struct request_queue *q = rq->q; > > trace_block_rq_requeue(q, rq); > + blk_wb_requeue(q->rq_wb, rq); > > if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) { > if (q->dma_drain_size && blk_rq_bytes(rq)) > @@ -1265,6 +1272,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) > struct blk_plug *plug; > struct request *same_queue_rq = NULL; > blk_qc_t cookie; > + bool wb_acct; > > blk_queue_bounce(q, &bio); > > @@ -1282,9 +1290,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) > } else > request_count = blk_plug_queued_count(q); > > + wb_acct = blk_wb_wait(q->rq_wb, bio, NULL); > + > rq = blk_mq_map_request(q, bio, &data); > - if (unlikely(!rq)) > + if (unlikely(!rq)) { > + if (wb_acct) > + __blk_wb_done(q->rq_wb); > return BLK_QC_T_NONE; > + } > + > + if (wb_acct) > + rq->cmd_flags |= REQ_BUF_INFLIGHT; > > cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); > > @@ -1361,6 +1377,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) > struct blk_map_ctx data; > struct request *rq; > blk_qc_t cookie; > + bool wb_acct; > > blk_queue_bounce(q, &bio); > > @@ -1375,9 +1392,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) > blk_attempt_plug_merge(q, bio, &request_count, NULL)) > return BLK_QC_T_NONE; > > + wb_acct = blk_wb_wait(q->rq_wb, bio, NULL); > + > rq = blk_mq_map_request(q, bio, &data); > - if (unlikely(!rq)) > + if (unlikely(!rq)) { > + if (wb_acct) > + __blk_wb_done(q->rq_wb); > return BLK_QC_T_NONE; > + } > + > + if (wb_acct) > + rq->cmd_flags |= REQ_BUF_INFLIGHT; > > cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); > > @@ -2111,6 +2136,8 @@ void blk_mq_free_queue(struct request_queue *q) > list_del_init(&q->all_q_node); -- Regards Kaixu Xia ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 8/8] writeback: throttle buffered writeback 2016-04-23 8:21 ` xiakaixu @ 2016-04-23 21:37 ` Jens Axboe 2016-04-25 11:41 ` xiakaixu 0 siblings, 1 reply; 19+ messages in thread From: Jens Axboe @ 2016-04-23 21:37 UTC (permalink / raw) To: xiakaixu Cc: linux-kernel, linux-fsdevel, linux-block, jack, dchinner, miaoxie (A), Bintian, Huxinwei On 04/23/2016 02:21 AM, xiakaixu wrote: >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 40b57bf4852c..d941f69dfb4b 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -39,6 +39,7 @@ >> >> #include "blk.h" >> #include "blk-mq.h" >> +#include "blk-wb.h" >> >> EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); >> EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap); >> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, >> >> fail: >> blk_free_flush_queue(q->fq); >> + blk_wb_exit(q); >> return NULL; >> } >> EXPORT_SYMBOL(blk_init_allocated_queue); >> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq) >> blk_delete_timer(rq); >> blk_clear_rq_complete(rq); >> trace_block_rq_requeue(q, rq); >> + blk_wb_requeue(q->rq_wb, rq); >> >> if (rq->cmd_flags & REQ_QUEUED) >> blk_queue_end_tag(q, rq); >> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct request *req) >> /* this is a bio leak */ >> WARN_ON(req->bio != NULL); >> >> + blk_wb_done(q->rq_wb, req); >> + >> /* >> * Request may not have originated from ll_rw_blk. if not, >> * it didn't come out of our reserved rq pools >> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) >> int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT; >> struct request *req; >> unsigned int request_count = 0; >> + bool wb_acct; >> >> /* >> * low level driver can indicate that it wants pages above a >> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) >> } >> >> get_rq: >> + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock); >> + >> /* >> * This sync check and mask will be re-done in init_request_from_bio(), >> * but we need to set it earlier to expose the sync flag to the >> @@ -1781,11 +1789,16 @@ get_rq: >> */ >> req = get_request(q, rw_flags, bio, GFP_NOIO); >> if (IS_ERR(req)) { >> + if (wb_acct) >> + __blk_wb_done(q->rq_wb); >> bio->bi_error = PTR_ERR(req); >> bio_endio(bio); >> goto out_unlock; >> } >> >> + if (wb_acct) >> + req->cmd_flags |= REQ_BUF_INFLIGHT; >> + >> /* >> * After dropping the lock and possibly sleeping here, our request >> * may now be mergeable after it had proven unmergeable (above). >> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req) >> blk_dequeue_request(req); >> >> req->issue_time = ktime_to_ns(ktime_get()); >> + blk_wb_issue(req->q->rq_wb, req); >> >> /* >> * We are now handing the request to the hardware, initialize >> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error) >> blk_unprep_request(req); >> >> blk_account_io_done(req); >> + blk_wb_done(req->q->rq_wb, req); > > Hi Jens, > > Seems the function blk_wb_done() will be executed twice even if the end_io > callback is set. > Maybe the same thing would happen in blk-mq.c. Yeah, that was a mistake, the current version has it fixed. It was inadvertently added when I discovered that the flush request didn't work properly. Now it just duplicates the call inside the check for if it has an ->end_io() defined, since we don't use the normal path for that. -- Jens Axboe ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 8/8] writeback: throttle buffered writeback 2016-04-23 21:37 ` Jens Axboe @ 2016-04-25 11:41 ` xiakaixu 2016-04-25 14:37 ` Jens Axboe 0 siblings, 1 reply; 19+ messages in thread From: xiakaixu @ 2016-04-25 11:41 UTC (permalink / raw) To: Jens Axboe Cc: linux-kernel, linux-fsdevel, linux-block, jack, dchinner, miaoxie (A), Bintian, Huxinwei 于 2016/4/24 5:37, Jens Axboe 写道: > On 04/23/2016 02:21 AM, xiakaixu wrote: >>> diff --git a/block/blk-core.c b/block/blk-core.c >>> index 40b57bf4852c..d941f69dfb4b 100644 >>> --- a/block/blk-core.c >>> +++ b/block/blk-core.c >>> @@ -39,6 +39,7 @@ >>> >>> #include "blk.h" >>> #include "blk-mq.h" >>> +#include "blk-wb.h" >>> >>> EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); >>> EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap); >>> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, >>> >>> fail: >>> blk_free_flush_queue(q->fq); >>> + blk_wb_exit(q); >>> return NULL; >>> } >>> EXPORT_SYMBOL(blk_init_allocated_queue); >>> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq) >>> blk_delete_timer(rq); >>> blk_clear_rq_complete(rq); >>> trace_block_rq_requeue(q, rq); >>> + blk_wb_requeue(q->rq_wb, rq); >>> >>> if (rq->cmd_flags & REQ_QUEUED) >>> blk_queue_end_tag(q, rq); >>> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct request *req) >>> /* this is a bio leak */ >>> WARN_ON(req->bio != NULL); >>> >>> + blk_wb_done(q->rq_wb, req); >>> + >>> /* >>> * Request may not have originated from ll_rw_blk. if not, >>> * it didn't come out of our reserved rq pools >>> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) >>> int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT; >>> struct request *req; >>> unsigned int request_count = 0; >>> + bool wb_acct; >>> >>> /* >>> * low level driver can indicate that it wants pages above a >>> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) >>> } >>> >>> get_rq: >>> + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock); >>> + >>> /* >>> * This sync check and mask will be re-done in init_request_from_bio(), >>> * but we need to set it earlier to expose the sync flag to the >>> @@ -1781,11 +1789,16 @@ get_rq: >>> */ >>> req = get_request(q, rw_flags, bio, GFP_NOIO); >>> if (IS_ERR(req)) { >>> + if (wb_acct) >>> + __blk_wb_done(q->rq_wb); >>> bio->bi_error = PTR_ERR(req); >>> bio_endio(bio); >>> goto out_unlock; >>> } >>> >>> + if (wb_acct) >>> + req->cmd_flags |= REQ_BUF_INFLIGHT; >>> + >>> /* >>> * After dropping the lock and possibly sleeping here, our request >>> * may now be mergeable after it had proven unmergeable (above). >>> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req) >>> blk_dequeue_request(req); >>> >>> req->issue_time = ktime_to_ns(ktime_get()); >>> + blk_wb_issue(req->q->rq_wb, req); >>> >>> /* >>> * We are now handing the request to the hardware, initialize >>> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error) >>> blk_unprep_request(req); >>> >>> blk_account_io_done(req); >>> + blk_wb_done(req->q->rq_wb, req); >> >> Hi Jens, >> >> Seems the function blk_wb_done() will be executed twice even if the end_io >> callback is set. >> Maybe the same thing would happen in blk-mq.c. > > Yeah, that was a mistake, the current version has it fixed. It was inadvertently added when I discovered that the flush request didn't work properly. Now it just duplicates the call inside the check for if it has an ->end_io() defined, since we don't use the normal path for that. > Hi Jens, I have checked the wb-buf-throttle branch in your block git repo. I am not sure it is the completed version. Seems only the problem is fixed in blk-mq.c. The function blk_wb_done() still would be executed twice in blk-core.c. (the functions blk_finish_request() and __blk_put_request()) Maybe we can add a flag to mark whether blk_wb_done() has been done or not. -- Regards Kaixu Xia ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 8/8] writeback: throttle buffered writeback 2016-04-25 11:41 ` xiakaixu @ 2016-04-25 14:37 ` Jens Axboe 0 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-04-25 14:37 UTC (permalink / raw) To: xiakaixu Cc: linux-kernel, linux-fsdevel, linux-block, jack, dchinner, miaoxie (A), Bintian, Huxinwei On 04/25/2016 05:41 AM, xiakaixu wrote: > 于 2016/4/24 5:37, Jens Axboe 写道: >> On 04/23/2016 02:21 AM, xiakaixu wrote: >>>> diff --git a/block/blk-core.c b/block/blk-core.c >>>> index 40b57bf4852c..d941f69dfb4b 100644 >>>> --- a/block/blk-core.c >>>> +++ b/block/blk-core.c >>>> @@ -39,6 +39,7 @@ >>>> >>>> #include "blk.h" >>>> #include "blk-mq.h" >>>> +#include "blk-wb.h" >>>> >>>> EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); >>>> EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap); >>>> @@ -880,6 +881,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, >>>> >>>> fail: >>>> blk_free_flush_queue(q->fq); >>>> + blk_wb_exit(q); >>>> return NULL; >>>> } >>>> EXPORT_SYMBOL(blk_init_allocated_queue); >>>> @@ -1395,6 +1397,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq) >>>> blk_delete_timer(rq); >>>> blk_clear_rq_complete(rq); >>>> trace_block_rq_requeue(q, rq); >>>> + blk_wb_requeue(q->rq_wb, rq); >>>> >>>> if (rq->cmd_flags & REQ_QUEUED) >>>> blk_queue_end_tag(q, rq); >>>> @@ -1485,6 +1488,8 @@ void __blk_put_request(struct request_queue *q, struct request *req) >>>> /* this is a bio leak */ >>>> WARN_ON(req->bio != NULL); >>>> >>>> + blk_wb_done(q->rq_wb, req); >>>> + >>>> /* >>>> * Request may not have originated from ll_rw_blk. if not, >>>> * it didn't come out of our reserved rq pools >>>> @@ -1714,6 +1719,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) >>>> int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT; >>>> struct request *req; >>>> unsigned int request_count = 0; >>>> + bool wb_acct; >>>> >>>> /* >>>> * low level driver can indicate that it wants pages above a >>>> @@ -1766,6 +1772,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) >>>> } >>>> >>>> get_rq: >>>> + wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock); >>>> + >>>> /* >>>> * This sync check and mask will be re-done in init_request_from_bio(), >>>> * but we need to set it earlier to expose the sync flag to the >>>> @@ -1781,11 +1789,16 @@ get_rq: >>>> */ >>>> req = get_request(q, rw_flags, bio, GFP_NOIO); >>>> if (IS_ERR(req)) { >>>> + if (wb_acct) >>>> + __blk_wb_done(q->rq_wb); >>>> bio->bi_error = PTR_ERR(req); >>>> bio_endio(bio); >>>> goto out_unlock; >>>> } >>>> >>>> + if (wb_acct) >>>> + req->cmd_flags |= REQ_BUF_INFLIGHT; >>>> + >>>> /* >>>> * After dropping the lock and possibly sleeping here, our request >>>> * may now be mergeable after it had proven unmergeable (above). >>>> @@ -2515,6 +2528,7 @@ void blk_start_request(struct request *req) >>>> blk_dequeue_request(req); >>>> >>>> req->issue_time = ktime_to_ns(ktime_get()); >>>> + blk_wb_issue(req->q->rq_wb, req); >>>> >>>> /* >>>> * We are now handing the request to the hardware, initialize >>>> @@ -2751,6 +2765,7 @@ void blk_finish_request(struct request *req, int error) >>>> blk_unprep_request(req); >>>> >>>> blk_account_io_done(req); >>>> + blk_wb_done(req->q->rq_wb, req); >>> >>> Hi Jens, >>> >>> Seems the function blk_wb_done() will be executed twice even if the end_io >>> callback is set. >>> Maybe the same thing would happen in blk-mq.c. >> >> Yeah, that was a mistake, the current version has it fixed. It was inadvertently added when I discovered that the flush request didn't work properly. Now it just duplicates the call inside the check for if it has an ->end_io() defined, since we don't use the normal path for that. >> > Hi Jens, > > I have checked the wb-buf-throttle branch in your block git repo. I am not sure it is the completed version. > Seems only the problem is fixed in blk-mq.c. The function blk_wb_done() still would be executed twice in blk-core.c. > (the functions blk_finish_request() and __blk_put_request()) > Maybe we can add a flag to mark whether blk_wb_done() has been done or not. Good catch, looks like I did only patch up the mq bits. It's still not perfect, since we could potentially double account a request that has a private end_io(), if it was allocated through the normal block rq allocator. It'll skew the unrelated-io-timestamp a bit, but it's not a big deal. The count for inflight will be consistent, which is the important part. We currently have just 1 bit to tell if the request is tracked or not, so we don't know if it was tracked but already seen. I'll fix up the blk-core part to be identical to the blk-mq fix. -- Jens Axboe ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCHSET v5] Make background writeback great again for the first time @ 2016-04-26 15:55 Jens Axboe 2016-04-26 15:55 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe 0 siblings, 1 reply; 19+ messages in thread From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block; +Cc: jack, dchinner, sedat.dilek Hi, Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers have not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts database reads or sync writes. When that happens, I get people yelling at me. I have posted plenty of results previously, I'll keep it shorter this time. Here's a run on my laptop, using read-to-pipe-async for reading a 5g file, and rewriting it. You can find this test program in the fio git repo. 4.6-rc3: $ t/read-to-pipe-async -f ~/5g > 5g-new Latency percentiles (usec) (READERS) 50.0000th: 2 75.0000th: 3 90.0000th: 5 95.0000th: 7 99.0000th: 43 99.5000th: 77 99.9000th: 9008 99.9900th: 91008 99.9990th: 286208 99.9999th: 347648 Over=1251, min=0, max=358081 Latency percentiles (usec) (WRITERS) 50.0000th: 4 75.0000th: 8 90.0000th: 13 95.0000th: 15 99.0000th: 32 99.5000th: 43 99.9000th: 81 99.9900th: 2372 99.9990th: 104320 99.9999th: 349696 Over=63, min=1, max=358321 Read rate (KB/sec) : 91859 Write rate (KB/sec): 91859 4.6-rc3 + wb-buf-throttle Latency percentiles (usec) (READERS) 50.0000th: 2 75.0000th: 3 90.0000th: 5 95.0000th: 8 99.0000th: 48 99.5000th: 79 99.9000th: 5304 99.9900th: 22496 99.9990th: 29408 99.9999th: 33728 Over=860, min=0, max=37599 Latency percentiles (usec) (WRITERS) 50.0000th: 4 75.0000th: 9 90.0000th: 14 95.0000th: 16 99.0000th: 34 99.5000th: 45 99.9000th: 87 99.9900th: 1342 99.9990th: 13648 99.9999th: 21280 Over=29, min=1, max=30457 Read rate (KB/sec) : 95832 Write rate (KB/sec): 95832 Better throughput and tighter latencies, for both reads and writes. That's hard not to like. The above was the why. The how is basically throttling background writeback. We still want to issue big writes from the vm side of things, so we get nice and big extents on the file system end. But we don't need to flood the device with THOUSANDS of requests for background writeback. For most devices, we don't need a whole lot to get decent throughput. This adds some simple blk-wb code that keeps limits how much buffered writeback we keep in flight on the device end. It's all about managing the queues on the hardware side. The big change in this version is that it should be pretty much auto-tuning - you no longer have to set a given percentage of writeback bandwidth. I've implemented something similar to CoDel to manage the writeback queue. See the last patch for a full description, but the tldr is that we monitor min latencies over a window of time, and scale up/down the queue based on that. This needs a minimum of tunables, and it stays out of the way, if your device is fast enough. There's a single tunable now, wb_last_usec, that simply sets this latency target. Most people won't have to touch this, it'll work pretty well just being in the ballpark. I welcome testing. If you are sick of Linux bogging down when buffered writes are happening, then this is for you, laptop or server. The patchset is fully stable, I have not observed problems. It passes full xfstest runs, and a variety of benchmarks as well. It works equally well on blk-mq/scsi-mq, and "classic" setups. You can also find this in a branch in the block git repo: git://git.kernel.dk/linux-block.git wb-buf-throttle Note that I rebase this branch when I collapse patches. The wb-buf-throttle-v5 will remain the same as this version. I've folded the device write cache changes into my 4.7 branches, so they are not a part of this posting. Get the full wb-buf-throttle branch, or apply the patches here on top of my for-next. A full patch against Linus' current tree can also be downloaded here: http://brick.kernel.dk/snaps/wb-buf-throttle-v5.patch Changes since v4 - Add some documentation for the two queue sysfs files - Kill off wb_stats sysfs file. Use the trace points to get this info now. - Various work around making this block layer agnostic. The main code now resides in lib/wbt.c and can be plugged into NFS as well, for instance. - Fix an issue with double completions on the block layer side. - Fix an issue where a long sync issue was disregarded, if the stat sample weren't valid. - Speed up the division in rwb_arm_timer(). - Add logic to scale back up for 'unknown' latency events. - Don't track sync issue timestamp of wbt is disabled. - Drop the dirty/writeback page inc/dec patch. We don't need it, and it was racy. - Move block/blk-wb.c to lib/wbt.c Changes since v3 - Re-do the mm/ writheback parts. Add REQ_BG for background writes, and don't overload the wbc 'reason' for writeback decisions. - Add tracking for when apps are sleeping waiting for a page to complete. - Change wbc_to_write() to wbc_to_write_cmd(). - Use atomic_t for the balance_dirty_pages() sleep count. - Add a basic scalable block stats tracking framework. - Rewrite blk-wb core as described above, to dynamically adapt. This is a big change, see the last patch for a full description of it. - Add tracing to blk-wb, instead of using debug printk's. - Rebased to 4.6-rc3 (ish) Changes since v2 - Switch from wb_depth to wb_percent, as that's an easier tunable. - Add the patch to track device depth on the block layer side. - Cleanup the limiting code. - Don't use a fixed limit in the wb wait, since it can change between wakeups. - Minor tweaks, fixups, cleanups. Changes since v1 - Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change - wb_start_writeback() fills in background/reclaim/sync info in the writeback work, based on writeback reason. - Use WRITE_SYNC for reclaim/sync IO - Split balance_dirty_pages() sleep change into separate patch - Drop get_request() u64 flag change, set the bit on the request directly after-the-fact. - Fix wrong sysfs return value - Various small cleanups Documentation/block/queue-sysfs.txt | 22 + Documentation/block/writeback_cache_control.txt | 4 arch/um/drivers/ubd_kern.c | 2 block/Kconfig | 1 block/Makefile | 2 block/blk-core.c | 26 + block/blk-flush.c | 11 block/blk-mq-sysfs.c | 47 ++ block/blk-mq.c | 44 +- block/blk-mq.h | 3 block/blk-settings.c | 59 +- block/blk-stat.c | 185 ++++++++ block/blk-stat.h | 17 block/blk-sysfs.c | 184 ++++++++ drivers/block/drbd/drbd_main.c | 2 drivers/block/loop.c | 2 drivers/block/mtip32xx/mtip32xx.c | 6 drivers/block/nbd.c | 4 drivers/block/osdblk.c | 2 drivers/block/ps3disk.c | 2 drivers/block/skd_main.c | 2 drivers/block/virtio_blk.c | 6 drivers/block/xen-blkback/xenbus.c | 2 drivers/block/xen-blkfront.c | 3 drivers/ide/ide-disk.c | 6 drivers/md/bcache/super.c | 2 drivers/md/dm-table.c | 20 drivers/md/md.c | 2 drivers/md/raid5-cache.c | 3 drivers/mmc/card/block.c | 2 drivers/mtd/mtd_blkdevs.c | 2 drivers/nvme/host/core.c | 7 drivers/scsi/scsi.c | 3 drivers/scsi/sd.c | 8 drivers/target/target_core_iblock.c | 6 fs/block_dev.c | 2 fs/buffer.c | 2 fs/f2fs/data.c | 2 fs/f2fs/node.c | 2 fs/gfs2/meta_io.c | 3 fs/mpage.c | 9 fs/xfs/xfs_aops.c | 2 include/linux/backing-dev-defs.h | 2 include/linux/blk_types.h | 12 include/linux/blkdev.h | 28 + include/linux/fs.h | 4 include/linux/wbt.h | 95 ++++ include/linux/writeback.h | 10 include/trace/events/wbt.h | 122 +++++ lib/Kconfig | 3 lib/Makefile | 1 lib/wbt.c | 524 ++++++++++++++++++++++++ mm/backing-dev.c | 1 mm/page-writeback.c | 2 54 files changed, 1429 insertions(+), 96 deletions(-) -- Jens Axboe ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 8/8] writeback: throttle buffered writeback 2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe @ 2016-04-26 15:55 ` Jens Axboe 0 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-04-26 15:55 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-block Cc: jack, dchinner, sedat.dilek, Jens Axboe Test patch that throttles buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. The patch registers two sysfs entries. The first one, 'wb_window_usec', defines the window of monitoring. The second one, 'wb_lat_usec', sets the latency target for the window. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch these settings. Signed-off-by: Jens Axboe <axboe@fb.com> --- Documentation/block/queue-sysfs.txt | 13 ++++ block/Kconfig | 1 + block/blk-core.c | 21 ++++++- block/blk-mq.c | 32 +++++++++- block/blk-settings.c | 3 + block/blk-stat.c | 5 +- block/blk-sysfs.c | 119 ++++++++++++++++++++++++++++++++++++ include/linux/blkdev.h | 6 +- 8 files changed, 191 insertions(+), 9 deletions(-) diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index dce25d848d92..9bc990abef4d 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -151,5 +151,18 @@ device state. This means that it might not be safe to toggle the setting from "write back" to "write through", since that will also eliminate cache flushes issued by the kernel. +wb_lat_usec (RW) +---------------- +If the device is registered for writeback throttling, then this file shows +the target minimum read latency. If this latency is exceeded in a given +window of time (see wb_window_usec), then the writeback throttling will start +scaling back writes. + +wb_window_usec (RW) +------------------- +If the device is registered for writeback throttling, then this file shows +the value of the monitoring window in which we'll look at the target +latency. See wb_lat_usec. + Jens Axboe <jens.axboe@oracle.com>, February 2009 diff --git a/block/Kconfig b/block/Kconfig index 0363cd731320..d4c2ff4b9b2c 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -4,6 +4,7 @@ menuconfig BLOCK bool "Enable the block layer" if EXPERT default y + select WBT help Provide block layer support for the kernel. diff --git a/block/blk-core.c b/block/blk-core.c index 40b57bf4852c..c166d46a09d1 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -33,6 +33,7 @@ #include <linux/ratelimit.h> #include <linux/pm_runtime.h> #include <linux/blk-cgroup.h> +#include <linux/wbt.h> #define CREATE_TRACE_POINTS #include <trace/events/block.h> @@ -880,6 +881,8 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, fail: blk_free_flush_queue(q->fq); + wbt_exit(q->rq_wb); + q->rq_wb = NULL; return NULL; } EXPORT_SYMBOL(blk_init_allocated_queue); @@ -1395,6 +1398,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq) blk_delete_timer(rq); blk_clear_rq_complete(rq); trace_block_rq_requeue(q, rq); + wbt_requeue(q->rq_wb, &rq->wb_stat); if (rq->cmd_flags & REQ_QUEUED) blk_queue_end_tag(q, rq); @@ -1485,6 +1489,8 @@ void __blk_put_request(struct request_queue *q, struct request *req) /* this is a bio leak */ WARN_ON(req->bio != NULL); + wbt_done(q->rq_wb, &req->wb_stat); + /* * Request may not have originated from ll_rw_blk. if not, * it didn't come out of our reserved rq pools @@ -1714,6 +1720,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT; struct request *req; unsigned int request_count = 0; + bool wb_acct; /* * low level driver can indicate that it wants pages above a @@ -1766,6 +1773,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) } get_rq: + wb_acct = wbt_wait(q->rq_wb, bio->bi_rw, q->queue_lock); + /* * This sync check and mask will be re-done in init_request_from_bio(), * but we need to set it earlier to expose the sync flag to the @@ -1781,11 +1790,16 @@ get_rq: */ req = get_request(q, rw_flags, bio, GFP_NOIO); if (IS_ERR(req)) { + if (wb_acct) + __wbt_done(q->rq_wb); bio->bi_error = PTR_ERR(req); bio_endio(bio); goto out_unlock; } + if (wb_acct) + wbt_mark_tracked(&req->wb_stat); + /* * After dropping the lock and possibly sleeping here, our request * may now be mergeable after it had proven unmergeable (above). @@ -2514,7 +2528,7 @@ void blk_start_request(struct request *req) { blk_dequeue_request(req); - req->issue_time = ktime_to_ns(ktime_get()); + wbt_issue(req->q->rq_wb, &req->wb_stat); /* * We are now handing the request to the hardware, initialize @@ -2752,9 +2766,10 @@ void blk_finish_request(struct request *req, int error) blk_account_io_done(req); - if (req->end_io) + if (req->end_io) { + wbt_done(req->q->rq_wb, &req->wb_stat); req->end_io(req, error); - else { + } else { if (blk_bidi_rq(req)) __blk_put_request(req->next_rq->q, req->next_rq); diff --git a/block/blk-mq.c b/block/blk-mq.c index 71b4a13fbf94..556229e4da92 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -22,6 +22,7 @@ #include <linux/sched/sysctl.h> #include <linux/delay.h> #include <linux/crash_dump.h> +#include <linux/wbt.h> #include <trace/events/block.h> @@ -275,6 +276,8 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, if (rq->cmd_flags & REQ_MQ_INFLIGHT) atomic_dec(&hctx->nr_active); + + wbt_done(q->rq_wb, &rq->wb_stat); rq->cmd_flags = 0; clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags); @@ -307,6 +310,7 @@ inline void __blk_mq_end_request(struct request *rq, int error) blk_account_io_done(rq); if (rq->end_io) { + wbt_done(rq->q->rq_wb, &rq->wb_stat); rq->end_io(rq, error); } else { if (unlikely(blk_bidi_rq(rq))) @@ -413,7 +417,7 @@ void blk_mq_start_request(struct request *rq) if (unlikely(blk_bidi_rq(rq))) rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq); - rq->issue_time = ktime_to_ns(ktime_get()); + wbt_issue(q->rq_wb, &rq->wb_stat); blk_add_timer(rq); @@ -450,6 +454,7 @@ static void __blk_mq_requeue_request(struct request *rq) struct request_queue *q = rq->q; trace_block_rq_requeue(q, rq); + wbt_requeue(q->rq_wb, &rq->wb_stat); if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) { if (q->dma_drain_size && blk_rq_bytes(rq)) @@ -1265,6 +1270,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) struct blk_plug *plug; struct request *same_queue_rq = NULL; blk_qc_t cookie; + bool wb_acct; blk_queue_bounce(q, &bio); @@ -1282,9 +1288,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) } else request_count = blk_plug_queued_count(q); + wb_acct = wbt_wait(q->rq_wb, bio->bi_rw, NULL); + rq = blk_mq_map_request(q, bio, &data); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (wb_acct) + __wbt_done(q->rq_wb); return BLK_QC_T_NONE; + } + + if (wb_acct) + wbt_mark_tracked(&rq->wb_stat); cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -1361,6 +1375,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) struct blk_map_ctx data; struct request *rq; blk_qc_t cookie; + bool wb_acct; blk_queue_bounce(q, &bio); @@ -1375,9 +1390,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) blk_attempt_plug_merge(q, bio, &request_count, NULL)) return BLK_QC_T_NONE; + wb_acct = wbt_wait(q->rq_wb, bio->bi_rw, NULL); + rq = blk_mq_map_request(q, bio, &data); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (wb_acct) + __wbt_done(q->rq_wb); return BLK_QC_T_NONE; + } + + if (wb_acct) + wbt_mark_tracked(&rq->wb_stat); cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -2111,6 +2134,9 @@ void blk_mq_free_queue(struct request_queue *q) list_del_init(&q->all_q_node); mutex_unlock(&all_q_mutex); + wbt_exit(q->rq_wb); + q->rq_wb = NULL; + blk_mq_del_queue_tag_set(q); blk_mq_exit_hw_queues(q, set, set->nr_hw_queues); diff --git a/block/blk-settings.c b/block/blk-settings.c index f7e122e717e8..746dc9fee1ac 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -840,6 +840,7 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable); void blk_set_queue_depth(struct request_queue *q, unsigned int depth) { q->queue_depth = depth; + wbt_set_queue_depth(q->rq_wb, depth); } EXPORT_SYMBOL(blk_set_queue_depth); @@ -863,6 +864,8 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua) else queue_flag_clear(QUEUE_FLAG_FUA, q); spin_unlock_irq(q->queue_lock); + + wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags)); } EXPORT_SYMBOL_GPL(blk_queue_write_cache); diff --git a/block/blk-stat.c b/block/blk-stat.c index b38776a83173..8e3974d87c1f 100644 --- a/block/blk-stat.c +++ b/block/blk-stat.c @@ -143,15 +143,16 @@ void blk_stat_init(struct blk_rq_stat *stat) void blk_stat_add(struct blk_rq_stat *stat, struct request *rq) { s64 delta, now, value; + u64 rq_time = wbt_issue_stat_get_time(&rq->wb_stat); now = ktime_to_ns(ktime_get()); - if (now < rq->issue_time) + if (now < rq_time) return; if ((now & BLK_STAT_MASK) != (stat->time & BLK_STAT_MASK)) __blk_stat_init(stat, now); - value = now - rq->issue_time; + value = now - rq_time; if (value > stat->max) stat->max = value; if (value < stat->min) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 6e516cc0d3d0..df194bf93598 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -10,6 +10,7 @@ #include <linux/blktrace_api.h> #include <linux/blk-mq.h> #include <linux/blk-cgroup.h> +#include <linux/wbt.h> #include "blk.h" #include "blk-mq.h" @@ -41,6 +42,19 @@ queue_var_store(unsigned long *var, const char *page, size_t count) return count; } +static ssize_t queue_var_store64(u64 *var, const char *page) +{ + int err; + u64 v; + + err = kstrtou64(page, 10, &v); + if (err < 0) + return err; + + *var = v; + return 0; +} + static ssize_t queue_requests_show(struct request_queue *q, char *page) { return queue_var_show(q->nr_requests, (page)); @@ -347,6 +361,58 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page, return ret; } +static ssize_t queue_wb_win_show(struct request_queue *q, char *page) +{ + if (!q->rq_wb) + return -EINVAL; + + return sprintf(page, "%llu\n", div_u64(q->rq_wb->win_nsec, 1000)); +} + +static ssize_t queue_wb_win_store(struct request_queue *q, const char *page, + size_t count) +{ + ssize_t ret; + u64 val; + + if (!q->rq_wb) + return -EINVAL; + + ret = queue_var_store64(&val, page); + if (ret < 0) + return ret; + + q->rq_wb->win_nsec = val * 1000ULL; + wbt_update_limits(q->rq_wb); + return count; +} + +static ssize_t queue_wb_lat_show(struct request_queue *q, char *page) +{ + if (!q->rq_wb) + return -EINVAL; + + return sprintf(page, "%llu\n", div_u64(q->rq_wb->min_lat_nsec, 1000)); +} + +static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page, + size_t count) +{ + ssize_t ret; + u64 val; + + if (!q->rq_wb) + return -EINVAL; + + ret = queue_var_store64(&val, page); + if (ret < 0) + return ret; + + q->rq_wb->min_lat_nsec = val * 1000ULL; + wbt_update_limits(q->rq_wb); + return count; +} + static ssize_t queue_wc_show(struct request_queue *q, char *page) { if (test_bit(QUEUE_FLAG_WC, &q->queue_flags)) @@ -541,6 +607,18 @@ static struct queue_sysfs_entry queue_stats_entry = { .show = queue_stats_show, }; +static struct queue_sysfs_entry queue_wb_lat_entry = { + .attr = {.name = "wbt_lat_usec", .mode = S_IRUGO | S_IWUSR }, + .show = queue_wb_lat_show, + .store = queue_wb_lat_store, +}; + +static struct queue_sysfs_entry queue_wb_win_entry = { + .attr = {.name = "wbt_window_usec", .mode = S_IRUGO | S_IWUSR }, + .show = queue_wb_win_show, + .store = queue_wb_win_store, +}; + static struct attribute *default_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, @@ -568,6 +646,8 @@ static struct attribute *default_attrs[] = { &queue_poll_entry.attr, &queue_wc_entry.attr, &queue_stats_entry.attr, + &queue_wb_lat_entry.attr, + &queue_wb_win_entry.attr, NULL, }; @@ -682,6 +762,43 @@ struct kobj_type blk_queue_ktype = { .release = blk_release_queue, }; +static void blk_wb_stat_get(void *data, struct blk_rq_stat *stat) +{ + blk_queue_stat_get(data, stat); +} + +static void blk_wb_stat_clear(void *data) +{ + blk_stat_clear(data); +} + +static struct wb_stat_ops wb_stat_ops = { + .get = blk_wb_stat_get, + .clear = blk_wb_stat_clear, +}; + +static void blk_wb_init(struct request_queue *q) +{ + struct rq_wb *rwb; + + rwb = wbt_init(&q->backing_dev_info, &wb_stat_ops, q); + + /* + * If this fails, we don't get throttling + */ + if (IS_ERR(rwb)) + return; + + if (blk_queue_nonrot(q)) + rwb->min_lat_nsec = 2000000ULL; + else + rwb->min_lat_nsec = 75000000ULL; + + wbt_set_queue_depth(rwb, blk_queue_depth(q)); + wbt_set_write_cache(rwb, test_bit(QUEUE_FLAG_WC, &q->queue_flags)); + q->rq_wb = rwb; +} + int blk_register_queue(struct gendisk *disk) { int ret; @@ -721,6 +838,8 @@ int blk_register_queue(struct gendisk *disk) if (q->mq_ops) blk_mq_register_disk(disk); + blk_wb_init(q); + if (!q->request_fn) return 0; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 87f6703ced71..a89f46c58d5f 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -24,6 +24,7 @@ #include <linux/rcupdate.h> #include <linux/percpu-refcount.h> #include <linux/scatterlist.h> +#include <linux/wbt.h> struct module; struct scsi_ioctl_command; @@ -37,6 +38,7 @@ struct bsg_job; struct blkcg_gq; struct blk_flush_queue; struct pr_ops; +struct rq_wb; #define BLKDEV_MIN_RQ 4 #define BLKDEV_MAX_RQ 128 /* Default maximum */ @@ -153,7 +155,7 @@ struct request { struct gendisk *rq_disk; struct hd_struct *part; unsigned long start_time; - s64 issue_time; + struct wb_issue_stat wb_stat; #ifdef CONFIG_BLK_CGROUP struct request_list *rl; /* rl this rq is alloced from */ unsigned long long start_time_ns; @@ -291,6 +293,8 @@ struct request_queue { int nr_rqs[2]; /* # allocated [a]sync rqs */ int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */ + struct rq_wb *rq_wb; + /* * If blkcg is not used, @q->root_rl serves all requests. If blkcg * is used, root blkg allocates from @q->root_rl and all other -- 2.8.0.rc4.6.g7e4ba36 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCHSET v6] Throttled background buffered writeback @ 2016-08-31 17:05 Jens Axboe 2016-08-31 17:05 ` [PATCH 8/8] writeback: throttle " Jens Axboe 0 siblings, 1 reply; 19+ messages in thread From: Jens Axboe @ 2016-08-31 17:05 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel, linux-block Hi, Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers have not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts database reads or sync writes. When that happens, I get people yelling at me. I have posted plenty of results previously, I'll keep it shorter this time. Here's a run on my laptop, using read-to-pipe-async for reading a 5g file, and rewriting it. You can find this test program in the fio git repo. 4.6-rc3: $ t/read-to-pipe-async -f ~/5g > 5g-new Latency percentiles (usec) (READERS) 50.0000th: 2 75.0000th: 3 90.0000th: 5 95.0000th: 7 99.0000th: 43 99.5000th: 77 99.9000th: 9008 99.9900th: 91008 99.9990th: 286208 99.9999th: 347648 Over=1251, min=0, max=358081 Latency percentiles (usec) (WRITERS) 50.0000th: 4 75.0000th: 8 90.0000th: 13 95.0000th: 15 99.0000th: 32 99.5000th: 43 99.9000th: 81 99.9900th: 2372 99.9990th: 104320 99.9999th: 349696 Over=63, min=1, max=358321 Read rate (KB/sec) : 91859 Write rate (KB/sec): 91859 4.6-rc3 + wb-buf-throttle Latency percentiles (usec) (READERS) 50.0000th: 2 75.0000th: 3 90.0000th: 5 95.0000th: 8 99.0000th: 48 99.5000th: 79 99.9000th: 5304 99.9900th: 22496 99.9990th: 29408 99.9999th: 33728 Over=860, min=0, max=37599 Latency percentiles (usec) (WRITERS) 50.0000th: 4 75.0000th: 9 90.0000th: 14 95.0000th: 16 99.0000th: 34 99.5000th: 45 99.9000th: 87 99.9900th: 1342 99.9990th: 13648 99.9999th: 21280 Over=29, min=1, max=30457 Read rate (KB/sec) : 95832 Write rate (KB/sec): 95832 Better throughput and tighter latencies, for both reads and writes. That's hard not to like. The above was the why. The how is basically throttling background writeback. We still want to issue big writes from the vm side of things, so we get nice and big extents on the file system end. But we don't need to flood the device with THOUSANDS of requests for background writeback. For most devices, we don't need a whole lot to get decent throughput. This adds some simple blk-wb code that keeps limits how much buffered writeback we keep in flight on the device end. It's all about managing the queues on the hardware side. The big change in this version is that it should be pretty much auto-tuning - you no longer have to set a given percentage of writeback bandwidth. I've implemented something similar to CoDel to manage the writeback queue. See the last patch for a full description, but the tldr is that we monitor min latencies over a window of time, and scale up/down the queue based on that. This needs a minimum of tunables, and it stays out of the way, if your device is fast enough. There's a single tunable now, wb_last_usec, that simply sets this latency target. Most people won't have to touch this, it'll work pretty well just being in the ballpark. I welcome testing. If you are sick of Linux bogging down when buffered writes are happening, then this is for you, laptop or server. The patchset is fully stable, I have not observed problems. It passes full xfstest runs, and a variety of benchmarks as well. It works equally well on blk-mq/scsi-mq, and "classic" setups. You can also find this in a branch in the block git repo: git://git.kernel.dk/linux-block.git wb-buf-throttle Note that I rebase this branch when I collapse patches. The wb-buf-throttle-v6 will remain the same as this version. I know there are a bunch of folks running this patchset with success. If there's any interest in a version that applies cleanly to Linux v4.7, let me know, and I can provide one. the patches here on top of my for-next. A full patch against Linus' current tree can also be downloaded here: http://brick.kernel.dk/snaps/wb-buf-throttle-v6.patch Changes since v5 - Rebased on top of 4.8-rc4, drop parts of the series that is now in mainline. - Fixes for QD=1 devices, should make them perform better. - Fix for hang with disabling WBT with IO in flight - Change in the sync issue/completion logic. Previously we used whether this IO was tracked or not (eg was a buffered write), this has now been changed to just look at reads. This is a better metric, and should improve behavior. - Add some more comments to the code, explaining how it works. Changes since v4 - Add some documentation for the two queue sysfs files - Kill off wb_stats sysfs file. Use the trace points to get this info now. - Various work around making this block layer agnostic. The main code now resides in lib/wbt.c and can be plugged into NFS as well, for instance. - Fix an issue with double completions on the block layer side. - Fix an issue where a long sync issue was disregarded, if the stat sample weren't valid. - Speed up the division in rwb_arm_timer(). - Add logic to scale back up for 'unknown' latency events. - Don't track sync issue timestamp of wbt is disabled. - Drop the dirty/writeback page inc/dec patch. We don't need it, and it was racy. - Move block/blk-wb.c to lib/wbt.c Changes since v3 - Re-do the mm/ writheback parts. Add REQ_BG for background writes, and don't overload the wbc 'reason' for writeback decisions. - Add tracking for when apps are sleeping waiting for a page to complete. - Change wbc_to_write() to wbc_to_write_cmd(). - Use atomic_t for the balance_dirty_pages() sleep count. - Add a basic scalable block stats tracking framework. - Rewrite blk-wb core as described above, to dynamically adapt. This is a big change, see the last patch for a full description of it. - Add tracing to blk-wb, instead of using debug printk's. - Rebased to 4.6-rc3 (ish) Changes since v2 - Switch from wb_depth to wb_percent, as that's an easier tunable. - Add the patch to track device depth on the block layer side. - Cleanup the limiting code. - Don't use a fixed limit in the wb wait, since it can change between wakeups. - Minor tweaks, fixups, cleanups. Changes since v1 - Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change - wb_start_writeback() fills in background/reclaim/sync info in the writeback work, based on writeback reason. - Use WRITE_SYNC for reclaim/sync IO - Split balance_dirty_pages() sleep change into separate patch - Drop get_request() u64 flag change, set the bit on the request directly after-the-fact. - Fix wrong sysfs return value - Various small cleanups Documentation/block/queue-sysfs.txt | 13 block/Kconfig | 1 block/Makefile | 2 block/blk-core.c | 22 + block/blk-mq-sysfs.c | 47 ++ block/blk-mq.c | 42 ++ block/blk-mq.h | 3 block/blk-settings.c | 15 block/blk-stat.c | 205 ++++++++++++ block/blk-stat.h | 17 + block/blk-sysfs.c | 145 ++++++++ block/cfq-iosched.c | 12 drivers/scsi/scsi.c | 3 fs/buffer.c | 2 fs/f2fs/data.c | 2 fs/f2fs/node.c | 2 fs/gfs2/meta_io.c | 3 fs/mpage.c | 2 fs/xfs/xfs_aops.c | 7 include/linux/backing-dev-defs.h | 2 include/linux/blk_types.h | 16 include/linux/blkdev.h | 19 + include/linux/fs.h | 3 include/linux/wbt.h | 118 +++++++ include/linux/writeback.h | 10 include/trace/events/wbt.h | 122 +++++++ lib/Kconfig | 4 lib/Makefile | 1 lib/wbt.c | 587 ++++++++++++++++++++++++++++++++++++ mm/backing-dev.c | 1 mm/page-writeback.c | 2 31 files changed, 1414 insertions(+), 16 deletions(-) -- Jens Axboe ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 8/8] writeback: throttle buffered writeback 2016-08-31 17:05 [PATCHSET v6] Throttled background " Jens Axboe @ 2016-08-31 17:05 ` Jens Axboe 0 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-08-31 17:05 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe Test patch that throttles buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. The patch registers two sysfs entries. The first one, 'wb_window_usec', defines the window of monitoring. The second one, 'wb_lat_usec', sets the latency target for the window. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch these settings. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: Jens Axboe <axboe@fb.com> --- Documentation/block/queue-sysfs.txt | 13 ++++ block/Kconfig | 1 + block/blk-core.c | 20 +++++- block/blk-mq.c | 30 ++++++++- block/blk-settings.c | 3 + block/blk-stat.c | 5 +- block/blk-sysfs.c | 119 ++++++++++++++++++++++++++++++++++++ block/cfq-iosched.c | 12 ++++ include/linux/blkdev.h | 6 +- 9 files changed, 200 insertions(+), 9 deletions(-) diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index 2a3904030dea..2847219ebd8c 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -169,5 +169,18 @@ This is the number of bytes the device can write in a single write-same command. A value of '0' means write-same is not supported by this device. +wb_lat_usec (RW) +---------------- +If the device is registered for writeback throttling, then this file shows +the target minimum read latency. If this latency is exceeded in a given +window of time (see wb_window_usec), then the writeback throttling will start +scaling back writes. + +wb_window_usec (RW) +------------------- +If the device is registered for writeback throttling, then this file shows +the value of the monitoring window in which we'll look at the target +latency. See wb_lat_usec. + Jens Axboe <jens.axboe@oracle.com>, February 2009 diff --git a/block/Kconfig b/block/Kconfig index 161491d0a879..6da79e670709 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -4,6 +4,7 @@ menuconfig BLOCK bool "Enable the block layer" if EXPERT default y + select WBT help Provide block layer support for the kernel. diff --git a/block/blk-core.c b/block/blk-core.c index 4075cbeb720e..4f4ce050290c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -33,6 +33,7 @@ #include <linux/ratelimit.h> #include <linux/pm_runtime.h> #include <linux/blk-cgroup.h> +#include <linux/wbt.h> #define CREATE_TRACE_POINTS #include <trace/events/block.h> @@ -882,6 +883,8 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, fail: blk_free_flush_queue(q->fq); + wbt_exit(q->rq_wb); + q->rq_wb = NULL; return NULL; } EXPORT_SYMBOL(blk_init_allocated_queue); @@ -1346,6 +1349,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq) blk_delete_timer(rq); blk_clear_rq_complete(rq); trace_block_rq_requeue(q, rq); + wbt_requeue(q->rq_wb, &rq->wb_stat); if (rq->cmd_flags & REQ_QUEUED) blk_queue_end_tag(q, rq); @@ -1436,6 +1440,8 @@ void __blk_put_request(struct request_queue *q, struct request *req) /* this is a bio leak */ WARN_ON(req->bio != NULL); + wbt_done(q->rq_wb, &req->wb_stat); + /* * Request may not have originated from ll_rw_blk. if not, * it didn't come out of our reserved rq pools @@ -1667,6 +1673,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) int el_ret, rw_flags = 0, where = ELEVATOR_INSERT_SORT; struct request *req; unsigned int request_count = 0; + unsigned int wb_acct; /* * low level driver can indicate that it wants pages above a @@ -1719,6 +1726,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) } get_rq: + wb_acct = wbt_wait(q->rq_wb, bio->bi_opf, q->queue_lock); + /* * This sync check and mask will be re-done in init_request_from_bio(), * but we need to set it earlier to expose the sync flag to the @@ -1738,11 +1747,15 @@ get_rq: */ req = get_request(q, bio_data_dir(bio), rw_flags, bio, GFP_NOIO); if (IS_ERR(req)) { + if (wb_acct & WBT_TRACKED) + __wbt_done(q->rq_wb); bio->bi_error = PTR_ERR(req); bio_endio(bio); goto out_unlock; } + wbt_track(&req->wb_stat, wb_acct); + /* * After dropping the lock and possibly sleeping here, our request * may now be mergeable after it had proven unmergeable (above). @@ -2475,7 +2488,7 @@ void blk_start_request(struct request *req) { blk_dequeue_request(req); - req->issue_time = ktime_to_ns(ktime_get()); + wbt_issue(req->q->rq_wb, &req->wb_stat); /* * We are now handing the request to the hardware, initialize @@ -2713,9 +2726,10 @@ void blk_finish_request(struct request *req, int error) blk_account_io_done(req); - if (req->end_io) + if (req->end_io) { + wbt_done(req->q->rq_wb, &req->wb_stat); req->end_io(req, error); - else { + } else { if (blk_bidi_rq(req)) __blk_put_request(req->next_rq->q, req->next_rq); diff --git a/block/blk-mq.c b/block/blk-mq.c index 712f141a6f1a..511289a4626a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -22,6 +22,7 @@ #include <linux/sched/sysctl.h> #include <linux/delay.h> #include <linux/crash_dump.h> +#include <linux/wbt.h> #include <trace/events/block.h> @@ -319,6 +320,8 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, if (rq->cmd_flags & REQ_MQ_INFLIGHT) atomic_dec(&hctx->nr_active); + + wbt_done(q->rq_wb, &rq->wb_stat); rq->cmd_flags = 0; clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags); @@ -351,6 +354,7 @@ inline void __blk_mq_end_request(struct request *rq, int error) blk_account_io_done(rq); if (rq->end_io) { + wbt_done(rq->q->rq_wb, &rq->wb_stat); rq->end_io(rq, error); } else { if (unlikely(blk_bidi_rq(rq))) @@ -457,7 +461,7 @@ void blk_mq_start_request(struct request *rq) if (unlikely(blk_bidi_rq(rq))) rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq); - rq->issue_time = ktime_to_ns(ktime_get()); + wbt_issue(q->rq_wb, &rq->wb_stat); blk_add_timer(rq); @@ -494,6 +498,7 @@ static void __blk_mq_requeue_request(struct request *rq) struct request_queue *q = rq->q; trace_block_rq_requeue(q, rq); + wbt_requeue(q->rq_wb, &rq->wb_stat); if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) { if (q->dma_drain_size && blk_rq_bytes(rq)) @@ -1312,6 +1317,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) struct blk_plug *plug; struct request *same_queue_rq = NULL; blk_qc_t cookie; + unsigned int wb_acct; blk_queue_bounce(q, &bio); @@ -1326,9 +1332,16 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq)) return BLK_QC_T_NONE; + wb_acct = wbt_wait(q->rq_wb, bio->bi_opf, NULL); + rq = blk_mq_map_request(q, bio, &data); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (wb_acct & WBT_TRACKED) + __wbt_done(q->rq_wb); return BLK_QC_T_NONE; + } + + wbt_track(&rq->wb_stat, wb_acct); cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -1405,6 +1418,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) struct blk_map_ctx data; struct request *rq; blk_qc_t cookie; + unsigned int wb_acct; blk_queue_bounce(q, &bio); @@ -1421,9 +1435,16 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) } else request_count = blk_plug_queued_count(q); + wb_acct = wbt_wait(q->rq_wb, bio->bi_opf, NULL); + rq = blk_mq_map_request(q, bio, &data); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (wb_acct & WBT_TRACKED) + __wbt_done(q->rq_wb); return BLK_QC_T_NONE; + } + + wbt_track(&rq->wb_stat, wb_acct); cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -2147,6 +2168,9 @@ void blk_mq_free_queue(struct request_queue *q) list_del_init(&q->all_q_node); mutex_unlock(&all_q_mutex); + wbt_exit(q->rq_wb); + q->rq_wb = NULL; + blk_mq_del_queue_tag_set(q); blk_mq_exit_hw_queues(q, set, set->nr_hw_queues); diff --git a/block/blk-settings.c b/block/blk-settings.c index f7e122e717e8..746dc9fee1ac 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -840,6 +840,7 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable); void blk_set_queue_depth(struct request_queue *q, unsigned int depth) { q->queue_depth = depth; + wbt_set_queue_depth(q->rq_wb, depth); } EXPORT_SYMBOL(blk_set_queue_depth); @@ -863,6 +864,8 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua) else queue_flag_clear(QUEUE_FLAG_FUA, q); spin_unlock_irq(q->queue_lock); + + wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags)); } EXPORT_SYMBOL_GPL(blk_queue_write_cache); diff --git a/block/blk-stat.c b/block/blk-stat.c index 76cf2e2092c1..d8cb9b56fced 100644 --- a/block/blk-stat.c +++ b/block/blk-stat.c @@ -162,15 +162,16 @@ void blk_stat_init(struct blk_rq_stat *stat) void blk_stat_add(struct blk_rq_stat *stat, struct request *rq) { s64 now, value; + u64 rq_time = wbt_issue_stat_get_time(&rq->wb_stat); now = ktime_to_ns(ktime_get()); - if (now < rq->issue_time) + if (now < rq_time) return; if ((now & BLK_STAT_MASK) != (stat->time & BLK_STAT_MASK)) __blk_stat_init(stat, now); - value = now - rq->issue_time; + value = now - rq_time; if (value > stat->max) stat->max = value; if (value < stat->min) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 0b9e435fec97..7fcf02c9bfa7 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -10,6 +10,7 @@ #include <linux/blktrace_api.h> #include <linux/blk-mq.h> #include <linux/blk-cgroup.h> +#include <linux/wbt.h> #include "blk.h" #include "blk-mq.h" @@ -41,6 +42,19 @@ queue_var_store(unsigned long *var, const char *page, size_t count) return count; } +static ssize_t queue_var_store64(u64 *var, const char *page) +{ + int err; + u64 v; + + err = kstrtou64(page, 10, &v); + if (err < 0) + return err; + + *var = v; + return 0; +} + static ssize_t queue_requests_show(struct request_queue *q, char *page) { return queue_var_show(q->nr_requests, (page)); @@ -347,6 +361,58 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page, return ret; } +static ssize_t queue_wb_win_show(struct request_queue *q, char *page) +{ + if (!q->rq_wb) + return -EINVAL; + + return sprintf(page, "%llu\n", div_u64(q->rq_wb->win_nsec, 1000)); +} + +static ssize_t queue_wb_win_store(struct request_queue *q, const char *page, + size_t count) +{ + ssize_t ret; + u64 val; + + if (!q->rq_wb) + return -EINVAL; + + ret = queue_var_store64(&val, page); + if (ret < 0) + return ret; + + q->rq_wb->win_nsec = val * 1000ULL; + wbt_update_limits(q->rq_wb); + return count; +} + +static ssize_t queue_wb_lat_show(struct request_queue *q, char *page) +{ + if (!q->rq_wb) + return -EINVAL; + + return sprintf(page, "%llu\n", div_u64(q->rq_wb->min_lat_nsec, 1000)); +} + +static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page, + size_t count) +{ + ssize_t ret; + u64 val; + + if (!q->rq_wb) + return -EINVAL; + + ret = queue_var_store64(&val, page); + if (ret < 0) + return ret; + + q->rq_wb->min_lat_nsec = val * 1000ULL; + wbt_update_limits(q->rq_wb); + return count; +} + static ssize_t queue_wc_show(struct request_queue *q, char *page) { if (test_bit(QUEUE_FLAG_WC, &q->queue_flags)) @@ -551,6 +617,18 @@ static struct queue_sysfs_entry queue_stats_entry = { .show = queue_stats_show, }; +static struct queue_sysfs_entry queue_wb_lat_entry = { + .attr = {.name = "wbt_lat_usec", .mode = S_IRUGO | S_IWUSR }, + .show = queue_wb_lat_show, + .store = queue_wb_lat_store, +}; + +static struct queue_sysfs_entry queue_wb_win_entry = { + .attr = {.name = "wbt_window_usec", .mode = S_IRUGO | S_IWUSR }, + .show = queue_wb_win_show, + .store = queue_wb_win_store, +}; + static struct attribute *default_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, @@ -579,6 +657,8 @@ static struct attribute *default_attrs[] = { &queue_wc_entry.attr, &queue_dax_entry.attr, &queue_stats_entry.attr, + &queue_wb_lat_entry.attr, + &queue_wb_win_entry.attr, NULL, }; @@ -693,6 +773,43 @@ struct kobj_type blk_queue_ktype = { .release = blk_release_queue, }; +static void blk_wb_stat_get(void *data, struct blk_rq_stat *stat) +{ + blk_queue_stat_get(data, stat); +} + +static void blk_wb_stat_clear(void *data) +{ + blk_stat_clear(data); +} + +static struct wb_stat_ops wb_stat_ops = { + .get = blk_wb_stat_get, + .clear = blk_wb_stat_clear, +}; + +static void blk_wb_init(struct request_queue *q) +{ + struct rq_wb *rwb; + + rwb = wbt_init(&q->backing_dev_info, &wb_stat_ops, q); + + /* + * If this fails, we don't get throttling + */ + if (IS_ERR(rwb)) + return; + + if (blk_queue_nonrot(q)) + rwb->min_lat_nsec = 2000000ULL; + else + rwb->min_lat_nsec = 75000000ULL; + + wbt_set_queue_depth(rwb, blk_queue_depth(q)); + wbt_set_write_cache(rwb, test_bit(QUEUE_FLAG_WC, &q->queue_flags)); + q->rq_wb = rwb; +} + int blk_register_queue(struct gendisk *disk) { int ret; @@ -732,6 +849,8 @@ int blk_register_queue(struct gendisk *disk) if (q->mq_ops) blk_mq_register_disk(disk); + blk_wb_init(q); + if (!q->request_fn) return 0; diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index cc2f6dbd4303..ef61bda76317 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -3777,6 +3777,18 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) return; /* + * If we have a non-root cgroup, we can depend on that to + * do proper throttling of writes. Turn off wbt for that + * case. + */ + if (bio_blkcg(bio) != &blkcg_root) { + struct request_queue *q = cfqd->queue; + + if (q->rq_wb) + wbt_disable(q->rq_wb); + } + + /* * Drop reference to queues. New queues will be assigned in new * group upon arrival of fresh requests. */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 259eba88f991..45256d75c4b7 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -24,6 +24,7 @@ #include <linux/rcupdate.h> #include <linux/percpu-refcount.h> #include <linux/scatterlist.h> +#include <linux/wbt.h> struct module; struct scsi_ioctl_command; @@ -37,6 +38,7 @@ struct bsg_job; struct blkcg_gq; struct blk_flush_queue; struct pr_ops; +struct rq_wb; #define BLKDEV_MIN_RQ 4 #define BLKDEV_MAX_RQ 128 /* Default maximum */ @@ -151,7 +153,7 @@ struct request { struct gendisk *rq_disk; struct hd_struct *part; unsigned long start_time; - s64 issue_time; + struct wb_issue_stat wb_stat; #ifdef CONFIG_BLK_CGROUP struct request_list *rl; /* rl this rq is alloced from */ unsigned long long start_time_ns; @@ -303,6 +305,8 @@ struct request_queue { int nr_rqs[2]; /* # allocated [a]sync rqs */ int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */ + struct rq_wb *rq_wb; + /* * If blkcg is not used, @q->root_rl serves all requests. If blkcg * is used, root blkg allocates from @q->root_rl and all other -- 2.7.4 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 0/8] Throttled background buffered writeback v7 @ 2016-09-07 14:46 Jens Axboe 2016-09-07 14:46 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe 0 siblings, 1 reply; 19+ messages in thread From: Jens Axboe @ 2016-09-07 14:46 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel, linux-block Hi, Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers have not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts database reads or sync writes. When that happens, I get people yelling at me. Results from some recent testing can be found here: https://www.facebook.com/axboe/posts/10154074651342933 See previous postings for a bigger description of the patchset. Find the code here: git://git.kernel.dk/linux-block.git wb-buf-throttle Note that I rebase this branch when I collapse patches. The wb-buf-throttle-v7 will remain the same as this version. I know there are a bunch of folks running this patchset with success. If there's any interest in a version that applies cleanly to Linux v4.7, let me know, and I can provide one. A full patch against 4.8-rc5 can be found here: http://brick.kernel.dk/snaps/wb-buf-throttle-v7.patch Changes since v6 - Improve performance of the stats tracking, by reducing int divisions through batching. - Make blk_mq_stat_get() correctly set the right stat time window. Use this through the ->is_current() stat op. - Change the balance_dirty_pages() triggered 'dirty_sleeping' atomic into a time stamp. Use this in the throttling code to know if someone has slept in bdp() recently, instead of only knowing if a task is block there right now. - Allow negative scaling. This allows us to have a tighter baseline setting for better latencies, while allowing us to go a bit deeper in queue depth for improved write performance for cases where we don't have a mixed workload. - Add a wbt timer trace point. - Changing tracing from nanoseconds to microseconds, with the base noted. - Added/improved code commenting. - Fix the bug in wbc_to_write_flags(). Spotted by Omar. - Kill the unused SCALE_BITMAP Kconfig setting. Spotted by Omar. - Rebased to v4.8-rc5 Changes since v5 - Rebased on top of 4.8-rc4, drop parts of the series that is now in mainline. - Fixes for QD=1 devices, should make them perform better. - Fix for hang with disabling WBT with IO in flight - Change in the sync issue/completion logic. Previously we used whether this IO was tracked or not (eg was a buffered write), this has now been changed to just look at reads. This is a better metric, and should improve behavior. - Add some more comments to the code, explaining how it works. Changes since v4 - Add some documentation for the two queue sysfs files - Kill off wb_stats sysfs file. Use the trace points to get this info now. - Various work around making this block layer agnostic. The main code now resides in lib/wbt.c and can be plugged into NFS as well, for instance. - Fix an issue with double completions on the block layer side. - Fix an issue where a long sync issue was disregarded, if the stat sample weren't valid. - Speed up the division in rwb_arm_timer(). - Add logic to scale back up for 'unknown' latency events. - Don't track sync issue timestamp of wbt is disabled. - Drop the dirty/writeback page inc/dec patch. We don't need it, and it was racy. - Move block/blk-wb.c to lib/wbt.c Changes since v3 - Re-do the mm/ writheback parts. Add REQ_BG for background writes, and don't overload the wbc 'reason' for writeback decisions. - Add tracking for when apps are sleeping waiting for a page to complete. - Change wbc_to_write() to wbc_to_write_cmd(). - Use atomic_t for the balance_dirty_pages() sleep count. - Add a basic scalable block stats tracking framework. - Rewrite blk-wb core as described above, to dynamically adapt. This is a big change, see the last patch for a full description of it. - Add tracing to blk-wb, instead of using debug printk's. - Rebased to 4.6-rc3 (ish) Changes since v2 - Switch from wb_depth to wb_percent, as that's an easier tunable. - Add the patch to track device depth on the block layer side. - Cleanup the limiting code. - Don't use a fixed limit in the wb wait, since it can change between wakeups. - Minor tweaks, fixups, cleanups. Changes since v1 - Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change - wb_start_writeback() fills in background/reclaim/sync info in the writeback work, based on writeback reason. - Use WRITE_SYNC for reclaim/sync IO - Split balance_dirty_pages() sleep change into separate patch - Drop get_request() u64 flag change, set the bit on the request directly after-the-fact. - Fix wrong sysfs return value - Various small cleanups Documentation/block/queue-sysfs.txt | 13 block/Kconfig | 1 block/Makefile | 2 block/blk-core.c | 22 + block/blk-mq-sysfs.c | 47 ++ block/blk-mq.c | 42 ++ block/blk-mq.h | 3 block/blk-settings.c | 15 block/blk-stat.c | 221 +++++++++++ block/blk-stat.h | 18 block/blk-sysfs.c | 151 ++++++++ block/cfq-iosched.c | 12 drivers/scsi/scsi.c | 3 fs/buffer.c | 2 fs/f2fs/data.c | 2 fs/f2fs/node.c | 2 fs/gfs2/meta_io.c | 3 fs/mpage.c | 2 fs/xfs/xfs_aops.c | 7 include/linux/backing-dev-defs.h | 2 include/linux/blk_types.h | 16 include/linux/blkdev.h | 19 + include/linux/fs.h | 3 include/linux/wbt.h | 120 ++++++ include/linux/writeback.h | 10 include/trace/events/wbt.h | 153 ++++++++ lib/Kconfig | 3 lib/Makefile | 1 lib/wbt.c | 679 ++++++++++++++++++++++++++++++++++++ mm/backing-dev.c | 1 mm/page-writeback.c | 1 31 files changed, 1560 insertions(+), 16 deletions(-) -- Jens Axboe ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 8/8] writeback: throttle buffered writeback 2016-09-07 14:46 [PATCH 0/8] Throttled background buffered writeback v7 Jens Axboe @ 2016-09-07 14:46 ` Jens Axboe 0 siblings, 0 replies; 19+ messages in thread From: Jens Axboe @ 2016-09-07 14:46 UTC (permalink / raw) To: axboe, linux-kernel, linux-fsdevel, linux-block; +Cc: Jens Axboe Test patch that throttles buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. Unlike CoDel, blk-wb allows the scale count to to negative. This happens if we primarily have writes going on. Unlike positive scale counts, this doesn't change the size of the monitoring window. When the heavy writers finish, blk-bw quickly snaps back to it's stable state of a zero scale count. The patch registers two sysfs entries. The first one, 'wb_window_usec', defines the window of monitoring. The second one, 'wb_lat_usec', sets the latency target for the window. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch these settings. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: Jens Axboe <axboe@fb.com> --- Documentation/block/queue-sysfs.txt | 13 ++++ block/Kconfig | 1 + block/blk-core.c | 20 +++++- block/blk-mq.c | 30 ++++++++- block/blk-settings.c | 3 + block/blk-stat.c | 5 +- block/blk-sysfs.c | 125 ++++++++++++++++++++++++++++++++++++ block/cfq-iosched.c | 12 ++++ include/linux/blkdev.h | 6 +- 9 files changed, 206 insertions(+), 9 deletions(-) diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index 2a3904030dea..2847219ebd8c 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -169,5 +169,18 @@ This is the number of bytes the device can write in a single write-same command. A value of '0' means write-same is not supported by this device. +wb_lat_usec (RW) +---------------- +If the device is registered for writeback throttling, then this file shows +the target minimum read latency. If this latency is exceeded in a given +window of time (see wb_window_usec), then the writeback throttling will start +scaling back writes. + +wb_window_usec (RW) +------------------- +If the device is registered for writeback throttling, then this file shows +the value of the monitoring window in which we'll look at the target +latency. See wb_lat_usec. + Jens Axboe <jens.axboe@oracle.com>, February 2009 diff --git a/block/Kconfig b/block/Kconfig index 161491d0a879..6da79e670709 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -4,6 +4,7 @@ menuconfig BLOCK bool "Enable the block layer" if EXPERT default y + select WBT help Provide block layer support for the kernel. diff --git a/block/blk-core.c b/block/blk-core.c index 4075cbeb720e..4f4ce050290c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -33,6 +33,7 @@ #include <linux/ratelimit.h> #include <linux/pm_runtime.h> #include <linux/blk-cgroup.h> +#include <linux/wbt.h> #define CREATE_TRACE_POINTS #include <trace/events/block.h> @@ -882,6 +883,8 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn, fail: blk_free_flush_queue(q->fq); + wbt_exit(q->rq_wb); + q->rq_wb = NULL; return NULL; } EXPORT_SYMBOL(blk_init_allocated_queue); @@ -1346,6 +1349,7 @@ void blk_requeue_request(struct request_queue *q, struct request *rq) blk_delete_timer(rq); blk_clear_rq_complete(rq); trace_block_rq_requeue(q, rq); + wbt_requeue(q->rq_wb, &rq->wb_stat); if (rq->cmd_flags & REQ_QUEUED) blk_queue_end_tag(q, rq); @@ -1436,6 +1440,8 @@ void __blk_put_request(struct request_queue *q, struct request *req) /* this is a bio leak */ WARN_ON(req->bio != NULL); + wbt_done(q->rq_wb, &req->wb_stat); + /* * Request may not have originated from ll_rw_blk. if not, * it didn't come out of our reserved rq pools @@ -1667,6 +1673,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) int el_ret, rw_flags = 0, where = ELEVATOR_INSERT_SORT; struct request *req; unsigned int request_count = 0; + unsigned int wb_acct; /* * low level driver can indicate that it wants pages above a @@ -1719,6 +1726,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) } get_rq: + wb_acct = wbt_wait(q->rq_wb, bio->bi_opf, q->queue_lock); + /* * This sync check and mask will be re-done in init_request_from_bio(), * but we need to set it earlier to expose the sync flag to the @@ -1738,11 +1747,15 @@ get_rq: */ req = get_request(q, bio_data_dir(bio), rw_flags, bio, GFP_NOIO); if (IS_ERR(req)) { + if (wb_acct & WBT_TRACKED) + __wbt_done(q->rq_wb); bio->bi_error = PTR_ERR(req); bio_endio(bio); goto out_unlock; } + wbt_track(&req->wb_stat, wb_acct); + /* * After dropping the lock and possibly sleeping here, our request * may now be mergeable after it had proven unmergeable (above). @@ -2475,7 +2488,7 @@ void blk_start_request(struct request *req) { blk_dequeue_request(req); - req->issue_time = ktime_to_ns(ktime_get()); + wbt_issue(req->q->rq_wb, &req->wb_stat); /* * We are now handing the request to the hardware, initialize @@ -2713,9 +2726,10 @@ void blk_finish_request(struct request *req, int error) blk_account_io_done(req); - if (req->end_io) + if (req->end_io) { + wbt_done(req->q->rq_wb, &req->wb_stat); req->end_io(req, error); - else { + } else { if (blk_bidi_rq(req)) __blk_put_request(req->next_rq->q, req->next_rq); diff --git a/block/blk-mq.c b/block/blk-mq.c index 712f141a6f1a..511289a4626a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -22,6 +22,7 @@ #include <linux/sched/sysctl.h> #include <linux/delay.h> #include <linux/crash_dump.h> +#include <linux/wbt.h> #include <trace/events/block.h> @@ -319,6 +320,8 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, if (rq->cmd_flags & REQ_MQ_INFLIGHT) atomic_dec(&hctx->nr_active); + + wbt_done(q->rq_wb, &rq->wb_stat); rq->cmd_flags = 0; clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags); @@ -351,6 +354,7 @@ inline void __blk_mq_end_request(struct request *rq, int error) blk_account_io_done(rq); if (rq->end_io) { + wbt_done(rq->q->rq_wb, &rq->wb_stat); rq->end_io(rq, error); } else { if (unlikely(blk_bidi_rq(rq))) @@ -457,7 +461,7 @@ void blk_mq_start_request(struct request *rq) if (unlikely(blk_bidi_rq(rq))) rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq); - rq->issue_time = ktime_to_ns(ktime_get()); + wbt_issue(q->rq_wb, &rq->wb_stat); blk_add_timer(rq); @@ -494,6 +498,7 @@ static void __blk_mq_requeue_request(struct request *rq) struct request_queue *q = rq->q; trace_block_rq_requeue(q, rq); + wbt_requeue(q->rq_wb, &rq->wb_stat); if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) { if (q->dma_drain_size && blk_rq_bytes(rq)) @@ -1312,6 +1317,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) struct blk_plug *plug; struct request *same_queue_rq = NULL; blk_qc_t cookie; + unsigned int wb_acct; blk_queue_bounce(q, &bio); @@ -1326,9 +1332,16 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq)) return BLK_QC_T_NONE; + wb_acct = wbt_wait(q->rq_wb, bio->bi_opf, NULL); + rq = blk_mq_map_request(q, bio, &data); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (wb_acct & WBT_TRACKED) + __wbt_done(q->rq_wb); return BLK_QC_T_NONE; + } + + wbt_track(&rq->wb_stat, wb_acct); cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -1405,6 +1418,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) struct blk_map_ctx data; struct request *rq; blk_qc_t cookie; + unsigned int wb_acct; blk_queue_bounce(q, &bio); @@ -1421,9 +1435,16 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio) } else request_count = blk_plug_queued_count(q); + wb_acct = wbt_wait(q->rq_wb, bio->bi_opf, NULL); + rq = blk_mq_map_request(q, bio, &data); - if (unlikely(!rq)) + if (unlikely(!rq)) { + if (wb_acct & WBT_TRACKED) + __wbt_done(q->rq_wb); return BLK_QC_T_NONE; + } + + wbt_track(&rq->wb_stat, wb_acct); cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num); @@ -2147,6 +2168,9 @@ void blk_mq_free_queue(struct request_queue *q) list_del_init(&q->all_q_node); mutex_unlock(&all_q_mutex); + wbt_exit(q->rq_wb); + q->rq_wb = NULL; + blk_mq_del_queue_tag_set(q); blk_mq_exit_hw_queues(q, set, set->nr_hw_queues); diff --git a/block/blk-settings.c b/block/blk-settings.c index f7e122e717e8..746dc9fee1ac 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -840,6 +840,7 @@ EXPORT_SYMBOL_GPL(blk_queue_flush_queueable); void blk_set_queue_depth(struct request_queue *q, unsigned int depth) { q->queue_depth = depth; + wbt_set_queue_depth(q->rq_wb, depth); } EXPORT_SYMBOL(blk_set_queue_depth); @@ -863,6 +864,8 @@ void blk_queue_write_cache(struct request_queue *q, bool wc, bool fua) else queue_flag_clear(QUEUE_FLAG_FUA, q); spin_unlock_irq(q->queue_lock); + + wbt_set_write_cache(q->rq_wb, test_bit(QUEUE_FLAG_WC, &q->queue_flags)); } EXPORT_SYMBOL_GPL(blk_queue_write_cache); diff --git a/block/blk-stat.c b/block/blk-stat.c index 3965e8a258c8..bdb16d84b914 100644 --- a/block/blk-stat.c +++ b/block/blk-stat.c @@ -178,15 +178,16 @@ bool blk_stat_is_current(struct blk_rq_stat *stat) void blk_stat_add(struct blk_rq_stat *stat, struct request *rq) { s64 now, value; + u64 rq_time = wbt_issue_stat_get_time(&rq->wb_stat); now = ktime_to_ns(ktime_get()); - if (now < rq->issue_time) + if (now < rq_time) return; if (!__blk_stat_is_current(stat, now)) __blk_stat_init(stat, now); - value = now - rq->issue_time; + value = now - rq_time; if (value > stat->max) stat->max = value; if (value < stat->min) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 0b9e435fec97..85c3dc22307b 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -10,6 +10,7 @@ #include <linux/blktrace_api.h> #include <linux/blk-mq.h> #include <linux/blk-cgroup.h> +#include <linux/wbt.h> #include "blk.h" #include "blk-mq.h" @@ -41,6 +42,19 @@ queue_var_store(unsigned long *var, const char *page, size_t count) return count; } +static ssize_t queue_var_store64(u64 *var, const char *page) +{ + int err; + u64 v; + + err = kstrtou64(page, 10, &v); + if (err < 0) + return err; + + *var = v; + return 0; +} + static ssize_t queue_requests_show(struct request_queue *q, char *page) { return queue_var_show(q->nr_requests, (page)); @@ -347,6 +361,58 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page, return ret; } +static ssize_t queue_wb_win_show(struct request_queue *q, char *page) +{ + if (!q->rq_wb) + return -EINVAL; + + return sprintf(page, "%llu\n", div_u64(q->rq_wb->win_nsec, 1000)); +} + +static ssize_t queue_wb_win_store(struct request_queue *q, const char *page, + size_t count) +{ + ssize_t ret; + u64 val; + + if (!q->rq_wb) + return -EINVAL; + + ret = queue_var_store64(&val, page); + if (ret < 0) + return ret; + + q->rq_wb->win_nsec = val * 1000ULL; + wbt_update_limits(q->rq_wb); + return count; +} + +static ssize_t queue_wb_lat_show(struct request_queue *q, char *page) +{ + if (!q->rq_wb) + return -EINVAL; + + return sprintf(page, "%llu\n", div_u64(q->rq_wb->min_lat_nsec, 1000)); +} + +static ssize_t queue_wb_lat_store(struct request_queue *q, const char *page, + size_t count) +{ + ssize_t ret; + u64 val; + + if (!q->rq_wb) + return -EINVAL; + + ret = queue_var_store64(&val, page); + if (ret < 0) + return ret; + + q->rq_wb->min_lat_nsec = val * 1000ULL; + wbt_update_limits(q->rq_wb); + return count; +} + static ssize_t queue_wc_show(struct request_queue *q, char *page) { if (test_bit(QUEUE_FLAG_WC, &q->queue_flags)) @@ -551,6 +617,18 @@ static struct queue_sysfs_entry queue_stats_entry = { .show = queue_stats_show, }; +static struct queue_sysfs_entry queue_wb_lat_entry = { + .attr = {.name = "wbt_lat_usec", .mode = S_IRUGO | S_IWUSR }, + .show = queue_wb_lat_show, + .store = queue_wb_lat_store, +}; + +static struct queue_sysfs_entry queue_wb_win_entry = { + .attr = {.name = "wbt_window_usec", .mode = S_IRUGO | S_IWUSR }, + .show = queue_wb_win_show, + .store = queue_wb_win_store, +}; + static struct attribute *default_attrs[] = { &queue_requests_entry.attr, &queue_ra_entry.attr, @@ -579,6 +657,8 @@ static struct attribute *default_attrs[] = { &queue_wc_entry.attr, &queue_dax_entry.attr, &queue_stats_entry.attr, + &queue_wb_lat_entry.attr, + &queue_wb_win_entry.attr, NULL, }; @@ -693,6 +773,49 @@ struct kobj_type blk_queue_ktype = { .release = blk_release_queue, }; +static void blk_wb_stat_get(void *data, struct blk_rq_stat *stat) +{ + blk_queue_stat_get(data, stat); +} + +static void blk_wb_stat_clear(void *data) +{ + blk_stat_clear(data); +} + +static bool blk_wb_stat_is_current(struct blk_rq_stat *stat) +{ + return blk_stat_is_current(stat); +} + +static struct wb_stat_ops wb_stat_ops = { + .get = blk_wb_stat_get, + .is_current = blk_wb_stat_is_current, + .clear = blk_wb_stat_clear, +}; + +static void blk_wb_init(struct request_queue *q) +{ + struct rq_wb *rwb; + + rwb = wbt_init(&q->backing_dev_info, &wb_stat_ops, q); + + /* + * If this fails, we don't get throttling + */ + if (IS_ERR(rwb)) + return; + + if (blk_queue_nonrot(q)) + rwb->min_lat_nsec = 2000000ULL; + else + rwb->min_lat_nsec = 75000000ULL; + + wbt_set_queue_depth(rwb, blk_queue_depth(q)); + wbt_set_write_cache(rwb, test_bit(QUEUE_FLAG_WC, &q->queue_flags)); + q->rq_wb = rwb; +} + int blk_register_queue(struct gendisk *disk) { int ret; @@ -732,6 +855,8 @@ int blk_register_queue(struct gendisk *disk) if (q->mq_ops) blk_mq_register_disk(disk); + blk_wb_init(q); + if (!q->request_fn) return 0; diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index cc2f6dbd4303..ef61bda76317 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -3777,6 +3777,18 @@ static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio) return; /* + * If we have a non-root cgroup, we can depend on that to + * do proper throttling of writes. Turn off wbt for that + * case. + */ + if (bio_blkcg(bio) != &blkcg_root) { + struct request_queue *q = cfqd->queue; + + if (q->rq_wb) + wbt_disable(q->rq_wb); + } + + /* * Drop reference to queues. New queues will be assigned in new * group upon arrival of fresh requests. */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 259eba88f991..45256d75c4b7 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -24,6 +24,7 @@ #include <linux/rcupdate.h> #include <linux/percpu-refcount.h> #include <linux/scatterlist.h> +#include <linux/wbt.h> struct module; struct scsi_ioctl_command; @@ -37,6 +38,7 @@ struct bsg_job; struct blkcg_gq; struct blk_flush_queue; struct pr_ops; +struct rq_wb; #define BLKDEV_MIN_RQ 4 #define BLKDEV_MAX_RQ 128 /* Default maximum */ @@ -151,7 +153,7 @@ struct request { struct gendisk *rq_disk; struct hd_struct *part; unsigned long start_time; - s64 issue_time; + struct wb_issue_stat wb_stat; #ifdef CONFIG_BLK_CGROUP struct request_list *rl; /* rl this rq is alloced from */ unsigned long long start_time_ns; @@ -303,6 +305,8 @@ struct request_queue { int nr_rqs[2]; /* # allocated [a]sync rqs */ int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */ + struct rq_wb *rq_wb; + /* * If blkcg is not used, @q->root_rl serves all requests. If blkcg * is used, root blkg allocates from @q->root_rl and all other -- 2.7.4 ^ permalink raw reply related [flat|nested] 19+ messages in thread
end of thread, other threads:[~2016-09-07 14:47 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-03-23 15:25 Jens Axboe 2016-03-23 15:25 ` [PATCH 1/8] writeback: propagate the various reasons for writeback Jens Axboe 2016-03-23 15:25 ` [PATCH 2/8] writeback: add wbc_to_write() Jens Axboe 2016-03-23 15:25 ` [PATCH 3/8] writeback: use WRITE_SYNC for reclaim or sync writeback Jens Axboe 2016-03-23 15:25 ` [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages() Jens Axboe 2016-03-23 15:25 ` [PATCH 5/8] block: add ability to flag write back caching on a device Jens Axboe 2016-03-23 15:25 ` [PATCH 6/8] sd: inform block layer of write cache state Jens Axboe 2016-03-23 15:25 ` [PATCH 7/8] NVMe: " Jens Axboe 2016-03-23 15:25 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe 2016-03-23 15:39 ` [PATCHSET v2][RFC] Make background writeback not suck Jens Axboe 2016-03-24 17:42 ` Jens Axboe 2016-04-18 4:24 [PATCHSET v4 0/8] " Jens Axboe 2016-04-18 4:24 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe 2016-04-23 8:21 ` xiakaixu 2016-04-23 21:37 ` Jens Axboe 2016-04-25 11:41 ` xiakaixu 2016-04-25 14:37 ` Jens Axboe 2016-04-26 15:55 [PATCHSET v5] Make background writeback great again for the first time Jens Axboe 2016-04-26 15:55 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe 2016-08-31 17:05 [PATCHSET v6] Throttled background " Jens Axboe 2016-08-31 17:05 ` [PATCH 8/8] writeback: throttle " Jens Axboe 2016-09-07 14:46 [PATCH 0/8] Throttled background buffered writeback v7 Jens Axboe 2016-09-07 14:46 ` [PATCH 8/8] writeback: throttle buffered writeback Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).