All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] [v2] blk-mq: fix plugging in blk_sq_make_request
@ 2015-04-06 19:14 Jeff Moyer
  2015-04-06 19:14 ` [PATCH 2/2] blk-plug: don't flush nested plug lists Jeff Moyer
  0 siblings, 1 reply; 36+ messages in thread
From: Jeff Moyer @ 2015-04-06 19:14 UTC (permalink / raw)
  To: axboe, linux-kernel

The following appears in blk_sq_make_request:

	/*
	 * If we have multiple hardware queues, just go directly to
	 * one of those for sync IO.
	 */

We clearly don't have multiple hardware queues, here!  This comment was
introduced with this commit 07068d5b8e (blk-mq: split make request
handler for multi and single queue):

    We want slightly different behavior from them:

    - On single queue devices, we currently use the per-process plug
      for deferred IO and for merging.

    - On multi queue devices, we don't use the per-process plug, but
      we want to go straight to hardware for SYNC IO.

The old code had this:

        use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);

and that was converted to:

	use_plug = !is_flush_fua && !is_sync;

which is not equivalent.  For the single queue case, that second half of
the && expression is always true.  So, what I think was actually inteded
follows (and this more closely matches what is done in blk_queue_bio).

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

Changelog
v1->v2: removed the accidental addition of merging for flush/fua requests
---
 block/blk-mq.c | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b7b8933..cca5097 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1306,16 +1306,11 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = rw_is_sync(bio->bi_rw);
 	const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);
-	unsigned int use_plug, request_count = 0;
+	struct blk_plug *plug;
+	unsigned int request_count = 0;
 	struct blk_map_ctx data;
 	struct request *rq;
 
-	/*
-	 * If we have multiple hardware queues, just go directly to
-	 * one of those for sync IO.
-	 */
-	use_plug = !is_flush_fua && !is_sync;
-
 	blk_queue_bounce(q, &bio);
 
 	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
@@ -1323,7 +1318,7 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
-	if (use_plug && !blk_queue_nomerges(q) &&
+	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
 	    blk_attempt_plug_merge(q, bio, &request_count))
 		return;
 
@@ -1342,21 +1337,18 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	 * utilize that to temporarily store requests until the task is
 	 * either done or scheduled away.
 	 */
-	if (use_plug) {
-		struct blk_plug *plug = current->plug;
-
-		if (plug) {
-			blk_mq_bio_to_request(rq, bio);
-			if (list_empty(&plug->mq_list))
-				trace_block_plug(q);
-			else if (request_count >= BLK_MAX_REQUEST_COUNT) {
-				blk_flush_plug_list(plug, false);
-				trace_block_plug(q);
-			}
-			list_add_tail(&rq->queuelist, &plug->mq_list);
-			blk_mq_put_ctx(data.ctx);
-			return;
+	plug = current->plug;
+	if (plug) {
+		blk_mq_bio_to_request(rq, bio);
+		if (list_empty(&plug->mq_list))
+			trace_block_plug(q);
+		else if (request_count >= BLK_MAX_REQUEST_COUNT) {
+			blk_flush_plug_list(plug, false);
+			trace_block_plug(q);
 		}
+		list_add_tail(&rq->queuelist, &plug->mq_list);
+		blk_mq_put_ctx(data.ctx);
+		return;
 	}
 
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/2] blk-plug: don't flush nested plug lists
  2015-04-06 19:14 [PATCH 1/2] [v2] blk-mq: fix plugging in blk_sq_make_request Jeff Moyer
@ 2015-04-06 19:14 ` Jeff Moyer
  2015-04-07  9:19   ` Ming Lei
                     ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-06 19:14 UTC (permalink / raw)
  To: axboe, linux-kernel

The way the on-stack plugging currently works, each nesting level
flushes its own list of I/Os.  This can be less than optimal (read
awful) for certain workloads.  For example, consider an application
that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
I/Os together in a single io_submit call, only to have each of them
dispatched individually down in the bowells of the dirct I/O code.
The reason is that there are blk_plug's instantiated both at the upper
call site in do_io_submit and down in do_direct_IO.  The latter will
submit as little as 1 I/O at a time (if you have a small enough I/O
size) instead of performing the batching that the plugging
infrastructure is supposed to provide.

Now, for the case where there is an elevator involved, this doesn't
really matter too much.  The elevator will keep the I/O around long
enough for it to be merged.  However, in cases where there is no
elevator (like blk-mq), I/Os are simply dispatched immediately.

Try this, for example (note I'm using a virtio-blk device, so it's
using the blk-mq single queue path, though I've also reproduced this
with the micron p320h):

fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based

If you run that on a current kernel, you will get zero merges.  Zero!
After this patch, you will get many merges (the actual number depends
on how fast your storage is, obviously), and much better throughput.
Here are results from my test rig:

Unpatched kernel:
Read B/W:    283,638 KB/s
Read Merges: 0

Patched kernel:
Read B/W:    873,224 KB/s
Read Merges: 2,046K

I considered several approaches to solving the problem:
1)  get rid of the inner-most plugs
2)  handle nesting by using only one on-stack plug
2a) #2, except use a per-cpu blk_plug struct, which may clean up the
    code a bit at the expense of memory footprint

Option 1 will be tricky or impossible to do, since inner most plug
lists are sometimes the only plug lists, depending on the call path.
Option 2 is what this patch implements.  Option 2a is perhaps a better
idea, but since I already implemented option 2, I figured I'd post it
for comments and opinions before rewriting it.

Much of the patch involves modifying call sites to blk_start_plug,
since its signature is changed.  The meat of the patch is actually
pretty simple and constrained to block/blk-core.c and
include/linux/blkdev.h.  The only tricky bits were places where plugs
were finished and then restarted to flush out I/O.  There, I went
ahead and exported blk_flush_plug_list and called that directly.

Comments would be greatly appreciated.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
---
 block/blk-core.c                    | 33 +++++++++++++++++++--------------
 block/blk-lib.c                     |  6 +++---
 block/blk-throttle.c                |  6 +++---
 drivers/block/xen-blkback/blkback.c |  6 +++---
 drivers/md/dm-bufio.c               | 15 +++++++--------
 drivers/md/dm-kcopyd.c              |  6 +++---
 drivers/md/dm-thin.c                |  6 +++---
 drivers/md/md.c                     |  6 +++---
 drivers/md/raid1.c                  |  6 +++---
 drivers/md/raid10.c                 |  6 +++---
 drivers/md/raid5.c                  | 12 ++++++------
 drivers/target/target_core_iblock.c |  6 +++---
 fs/aio.c                            |  6 +++---
 fs/block_dev.c                      |  6 +++---
 fs/btrfs/scrub.c                    |  6 +++---
 fs/btrfs/transaction.c              |  6 +++---
 fs/btrfs/tree-log.c                 | 16 ++++++++--------
 fs/btrfs/volumes.c                  | 12 +++++-------
 fs/buffer.c                         |  6 +++---
 fs/direct-io.c                      |  8 +++++---
 fs/ext4/file.c                      |  6 +++---
 fs/ext4/inode.c                     | 12 +++++-------
 fs/f2fs/checkpoint.c                |  6 +++---
 fs/f2fs/gc.c                        |  6 +++---
 fs/f2fs/node.c                      |  6 +++---
 fs/jbd/checkpoint.c                 |  6 +++---
 fs/jbd/commit.c                     | 10 +++++-----
 fs/jbd2/checkpoint.c                |  6 +++---
 fs/jbd2/commit.c                    |  6 +++---
 fs/mpage.c                          |  6 +++---
 fs/xfs/xfs_buf.c                    | 12 ++++++------
 fs/xfs/xfs_dir2_readdir.c           |  6 +++---
 fs/xfs/xfs_itable.c                 |  6 +++---
 include/linux/blkdev.h              |  3 ++-
 mm/madvise.c                        |  6 +++---
 mm/page-writeback.c                 |  6 +++---
 mm/readahead.c                      |  6 +++---
 mm/swap_state.c                     |  6 +++---
 mm/vmscan.c                         |  6 +++---
 39 files changed, 155 insertions(+), 152 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 794c3e7..64f3f2a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3002,7 +3002,7 @@ EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);
 
 /**
  * blk_start_plug - initialize blk_plug and track it inside the task_struct
- * @plug:	The &struct blk_plug that needs to be initialized
+ * @plug:	The on-stack &struct blk_plug that needs to be initialized
  *
  * Description:
  *   Tracking blk_plug inside the task_struct will help with auto-flushing the
@@ -3013,26 +3013,29 @@ EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);
  *   page belonging to that request that is currently residing in our private
  *   plug. By flushing the pending I/O when the process goes to sleep, we avoid
  *   this kind of deadlock.
+ *
+ * Returns: a pointer to the active &struct blk_plug
  */
-void blk_start_plug(struct blk_plug *plug)
+struct blk_plug *blk_start_plug(struct blk_plug *plug)
 {
 	struct task_struct *tsk = current;
 
+	if (tsk->plug) {
+		tsk->plug->depth++;
+		return tsk->plug;
+	}
+
+	plug->depth = 1;
 	INIT_LIST_HEAD(&plug->list);
 	INIT_LIST_HEAD(&plug->mq_list);
 	INIT_LIST_HEAD(&plug->cb_list);
 
 	/*
-	 * If this is a nested plug, don't actually assign it. It will be
-	 * flushed on its own.
+	 * Store ordering should not be needed here, since a potential
+	 * preempt will imply a full memory barrier
 	 */
-	if (!tsk->plug) {
-		/*
-		 * Store ordering should not be needed here, since a potential
-		 * preempt will imply a full memory barrier
-		 */
-		tsk->plug = plug;
-	}
+	tsk->plug = plug;
+	return tsk->plug;
 }
 EXPORT_SYMBOL(blk_start_plug);
 
@@ -3176,13 +3179,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 
 	local_irq_restore(flags);
 }
+EXPORT_SYMBOL_GPL(blk_flush_plug_list);
 
 void blk_finish_plug(struct blk_plug *plug)
 {
-	blk_flush_plug_list(plug, false);
+	if (--plug->depth > 0)
+		return;
 
-	if (plug == current->plug)
-		current->plug = NULL;
+	blk_flush_plug_list(plug, false);
+	current->plug = NULL;
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..e2d2448 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -48,7 +48,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	struct bio_batch bb;
 	struct bio *bio;
 	int ret = 0;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	if (!q)
 		return -ENXIO;
@@ -81,7 +81,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 	bb.flags = 1 << BIO_UPTODATE;
 	bb.wait = &wait;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	while (nr_sects) {
 		unsigned int req_sects;
 		sector_t end_sect, tmp;
@@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		 */
 		cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	/* Wait for bios in-flight */
 	if (!atomic_dec_and_test(&bb.done))
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5b9c6d5..f57bbd3 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1266,7 +1266,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 	struct request_queue *q = td->queue;
 	struct bio_list bio_list_on_stack;
 	struct bio *bio;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	int rw;
 
 	bio_list_init(&bio_list_on_stack);
@@ -1278,10 +1278,10 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 	spin_unlock_irq(q->queue_lock);
 
 	if (!bio_list_empty(&bio_list_on_stack)) {
-		blk_start_plug(&plug);
+		plug = blk_start_plug(&onstack_plug);
 		while((bio = bio_list_pop(&bio_list_on_stack)))
 			generic_make_request(bio);
-		blk_finish_plug(&plug);
+		blk_finish_plug(plug);
 	}
 }
 
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 2a04d34..f075182 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1207,7 +1207,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	struct bio **biolist = pending_req->biolist;
 	int i, nbio = 0;
 	int operation;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	bool drain = false;
 	struct grant_page **pages = pending_req->segments;
 	unsigned short req_operation;
@@ -1368,13 +1368,13 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 	}
 
 	atomic_set(&pending_req->pendcnt, nbio);
-	blk_start_plug(&plug);
+	blk_start_plug(onstack_plug);
 
 	for (i = 0; i < nbio; i++)
 		submit_bio(operation, biolist[i]);
 
 	/* Let the I/Os go.. */
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	if (operation == READ)
 		blkif->st_rd_sect += preq.nr_sects;
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 86dbbc7..a6cbc6f 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -706,8 +706,8 @@ static void __write_dirty_buffer(struct dm_buffer *b,
 
 static void __flush_write_list(struct list_head *write_list)
 {
-	struct blk_plug plug;
-	blk_start_plug(&plug);
+	struct blk_plug *plug, onstack_plug;
+	plug = blk_start_plug(&onstack_plug);
 	while (!list_empty(write_list)) {
 		struct dm_buffer *b =
 			list_entry(write_list->next, struct dm_buffer, write_list);
@@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
 		submit_io(b, WRITE, b->block, write_endio);
 		dm_bufio_cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 }
 
 /*
@@ -1110,13 +1110,13 @@ EXPORT_SYMBOL_GPL(dm_bufio_new);
 void dm_bufio_prefetch(struct dm_bufio_client *c,
 		       sector_t block, unsigned n_blocks)
 {
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	LIST_HEAD(write_list);
 
 	BUG_ON(dm_bufio_in_request());
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	dm_bufio_lock(c);
 
 	for (; n_blocks--; block++) {
@@ -1126,9 +1126,8 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 				&write_list);
 		if (unlikely(!list_empty(&write_list))) {
 			dm_bufio_unlock(c);
-			blk_finish_plug(&plug);
 			__flush_write_list(&write_list);
-			blk_start_plug(&plug);
+			blk_flush_plug_list(plug, false);
 			dm_bufio_lock(c);
 		}
 		if (unlikely(b != NULL)) {
@@ -1149,7 +1148,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 	dm_bufio_unlock(c);
 
 flush_plug:
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 }
 EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
 
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 3a7cade..fed371f 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -580,7 +580,7 @@ static void do_work(struct work_struct *work)
 {
 	struct dm_kcopyd_client *kc = container_of(work,
 					struct dm_kcopyd_client, kcopyd_work);
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	/*
 	 * The order that these are called is *very* important.
@@ -589,11 +589,11 @@ static void do_work(struct work_struct *work)
 	 * list.  io jobs call wake when they complete and it all
 	 * starts again.
 	 */
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 }
 
 /*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 921aafd..6a7459a 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1775,7 +1775,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 	unsigned long flags;
 	struct bio *bio;
 	struct bio_list bios;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	unsigned count = 0;
 
 	if (tc->requeue_mode) {
@@ -1799,7 +1799,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 
 	spin_unlock_irqrestore(&tc->lock, flags);
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	while ((bio = bio_list_pop(&bios))) {
 		/*
 		 * If we've got no free new_mapping structs, and processing
@@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 			dm_pool_issue_prefetches(pool->pmd);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 }
 
 static int cmp_cells(const void *lhs, const void *rhs)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 717daad..9f24719 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7402,7 +7402,7 @@ void md_do_sync(struct md_thread *thread)
 	int skipped = 0;
 	struct md_rdev *rdev;
 	char *desc, *action = NULL;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	/* just incase thread restarts... */
 	if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
@@ -7572,7 +7572,7 @@ void md_do_sync(struct md_thread *thread)
 	md_new_event(mddev);
 	update_time = jiffies;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	while (j < max_sectors) {
 		sector_t sectors;
 
@@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
 	/*
 	 * this also signals 'finished resyncing' to md_stop
 	 */
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
 
 	/* tell personality that we are finished */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d34e238..2608bc3 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2399,11 +2399,11 @@ static void raid1d(struct md_thread *thread)
 	unsigned long flags;
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	md_check_recovery(mddev);
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	for (;;) {
 
 		flush_pending_writes(conf);
@@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 }
 
 static int init_resync(struct r1conf *conf)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a7196c4..7bea5c7 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2791,11 +2791,11 @@ static void raid10d(struct md_thread *thread)
 	unsigned long flags;
 	struct r10conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	md_check_recovery(mddev);
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	for (;;) {
 
 		flush_pending_writes(conf);
@@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 }
 
 static int init_resync(struct r10conf *conf)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd2f96b..59e7090 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5259,11 +5259,11 @@ static void raid5_do_work(struct work_struct *work)
 	struct r5conf *conf = group->conf;
 	int group_id = group - conf->worker_groups;
 	int handled;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	pr_debug("+++ raid5worker active\n");
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	handled = 0;
 	spin_lock_irq(&conf->device_lock);
 	while (1) {
@@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
 	pr_debug("%d stripes handled\n", handled);
 
 	spin_unlock_irq(&conf->device_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	pr_debug("--- raid5worker inactive\n");
 }
@@ -5298,13 +5298,13 @@ static void raid5d(struct md_thread *thread)
 	struct mddev *mddev = thread->mddev;
 	struct r5conf *conf = mddev->private;
 	int handled;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	pr_debug("+++ raid5d active\n");
 
 	md_check_recovery(mddev);
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	handled = 0;
 	spin_lock_irq(&conf->device_lock);
 	while (1) {
@@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
 	spin_unlock_irq(&conf->device_lock);
 
 	async_tx_issue_pending_all();
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	pr_debug("--- raid5d inactive\n");
 }
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d4a4b0f..248a6bb 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -361,13 +361,13 @@ iblock_get_bio(struct se_cmd *cmd, sector_t lba, u32 sg_num)
 
 static void iblock_submit_bios(struct bio_list *list, int rw)
 {
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	struct bio *bio;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	while ((bio = bio_list_pop(list)))
 		submit_bio(rw, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 }
 
 static void iblock_end_io_flush(struct bio *bio, int err)
diff --git a/fs/aio.c b/fs/aio.c
index f8e52a1..b1c3583 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1575,7 +1575,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 	struct kioctx *ctx;
 	long ret = 0;
 	int i = 0;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	if (unlikely(nr < 0))
 		return -EINVAL;
@@ -1592,7 +1592,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 		return -EINVAL;
 	}
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 	/*
 	 * AKPM: should this return a partial result if some of the IOs were
@@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	percpu_ref_put(&ctx->users);
 	return i ? i : ret;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266b..928d3e0 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1598,10 +1598,10 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	ssize_t ret;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	ret = __generic_file_write_iter(iocb, from);
 	if (ret > 0) {
 		ssize_t err;
@@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (err < 0)
 			ret = err;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_write_iter);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ec57687..768bac3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2967,7 +2967,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	struct btrfs_root *root = fs_info->extent_root;
 	struct btrfs_root *csum_root = fs_info->csum_root;
 	struct btrfs_extent_item *extent;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	u64 flags;
 	int ret;
 	int slot;
@@ -3088,7 +3088,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	 * collect all data csums for the stripe to avoid seeking during
 	 * the scrub. This might currently (crc32) end up to be about 1MB
 	 */
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 	/*
 	 * now find all extents for each stripe and scrub them
@@ -3316,7 +3316,7 @@ out:
 	scrub_wr_submit(sctx);
 	mutex_unlock(&sctx->wr_ctx.wr_lock);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
 	return ret < 0 ? ret : 0;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8be4278..b5a8078 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -979,11 +979,11 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 {
 	int ret;
 	int ret2;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
 
 	if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c5b8ba3..1623924 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2519,7 +2519,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	struct btrfs_root *log_root_tree = root->fs_info->log_root_tree;
 	int log_transid = 0;
 	struct btrfs_log_ctx root_log_ctx;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	mutex_lock(&root->log_mutex);
 	log_transid = ctx->log_transid;
@@ -2571,10 +2571,10 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	/* we start IO on  all the marked extents here, but we don't actually
 	 * wait for them until later.
 	 */
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
 	if (ret) {
-		blk_finish_plug(&plug);
+		blk_finish_plug(plug);
 		btrfs_abort_transaction(trans, root, ret);
 		btrfs_free_logged_extents(log, log_transid);
 		btrfs_set_log_full_commit(root->fs_info, trans);
@@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		if (!list_empty(&root_log_ctx.list))
 			list_del_init(&root_log_ctx.list);
 
-		blk_finish_plug(&plug);
+		blk_finish_plug(plug);
 		btrfs_set_log_full_commit(root->fs_info, trans);
 
 		if (ret != -ENOSPC) {
@@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	}
 
 	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
-		blk_finish_plug(&plug);
+		blk_finish_plug(plug);
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = root_log_ctx.log_ret;
 		goto out;
@@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
-		blk_finish_plug(&plug);
+		blk_finish_plug(plug);
 		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
 						mark);
 		btrfs_wait_logged_extents(trans, log, log_transid);
@@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 * check the full commit flag again
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
-		blk_finish_plug(&plug);
+		blk_finish_plug(plug);
 		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
@@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	ret = btrfs_write_marked_extents(log_root_tree,
 					 &log_root_tree->dirty_log_pages,
 					 EXTENT_DIRTY | EXTENT_NEW);
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	if (ret) {
 		btrfs_set_log_full_commit(root->fs_info, trans);
 		btrfs_abort_transaction(trans, root, ret);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8222f6f..0e215ff 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -261,7 +261,7 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
 	unsigned long last_waited = 0;
 	int force_reg = 0;
 	int sync_pending = 0;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	/*
 	 * this function runs all the bios we've collected for
@@ -269,7 +269,7 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
 	 * another device without first sending all of these down.
 	 * So, setup a plug here and finish it off before we return
 	 */
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 	bdi = blk_get_backing_dev_info(device->bdev);
 	fs_info = device->dev_root->fs_info;
@@ -358,8 +358,7 @@ loop_lock:
 		if (pending_bios == &device->pending_sync_bios) {
 			sync_pending = 1;
 		} else if (sync_pending) {
-			blk_finish_plug(&plug);
-			blk_start_plug(&plug);
+			blk_flush_plug_list(plug, false);
 			sync_pending = 0;
 		}
 
@@ -415,8 +414,7 @@ loop_lock:
 		}
 		/* unplug every 64 requests just for good measure */
 		if (batch_run % 64 == 0) {
-			blk_finish_plug(&plug);
-			blk_start_plug(&plug);
+			blk_flush_plug_list(plug, false);
 			sync_pending = 0;
 		}
 	}
@@ -431,7 +429,7 @@ loop_lock:
 	spin_unlock(&device->io_lock);
 
 done:
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 }
 
 static void pending_bios_fn(struct btrfs_work *work)
diff --git a/fs/buffer.c b/fs/buffer.c
index 20805db..727d642 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -717,10 +717,10 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 	struct list_head tmp;
 	struct address_space *mapping;
 	int err = 0, err2;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	INIT_LIST_HEAD(&tmp);
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 	spin_lock(lock);
 	while (!list_empty(list)) {
@@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 	}
 
 	spin_unlock(lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	spin_lock(lock);
 
 	while (!list_empty(&tmp)) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..e79c0c6 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -488,6 +488,8 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
 static void dio_await_completion(struct dio *dio)
 {
 	struct bio *bio;
+
+	blk_flush_plug(current);
 	do {
 		bio = dio_await_one(dio);
 		if (bio)
@@ -1108,7 +1110,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct dio *dio;
 	struct dio_submit sdio = { 0, };
 	struct buffer_head map_bh = { 0, };
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	unsigned long align = offset | iov_iter_alignment(iter);
 
 	if (rw & WRITE)
@@ -1231,7 +1233,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 
 	sdio.pages_in_io += iov_iter_npages(iter, INT_MAX);
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 	retval = do_direct_IO(dio, &sdio, &map_bh);
 	if (retval)
@@ -1262,7 +1264,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	if (sdio.bio)
 		dio_bio_submit(dio, &sdio);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	/*
 	 * It is possible that, we return short IO due to end of file.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..fd0cf21 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -94,7 +94,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct mutex *aio_mutex = NULL;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	int o_direct = io_is_direct(file);
 	int overwrite = 0;
 	size_t length = iov_iter_count(from);
@@ -139,7 +139,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	iocb->private = &overwrite;
 	if (o_direct) {
-		blk_start_plug(&plug);
+		plug = blk_start_plug(&onstack_plug);
 
 
 		/* check whether we do a DIO overwrite or not */
@@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			ret = err;
 	}
 	if (o_direct)
-		blk_finish_plug(&plug);
+		blk_finish_plug(plug);
 
 errout:
 	if (aio_mutex)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..d4b645b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2284,7 +2284,7 @@ static int ext4_writepages(struct address_space *mapping,
 	int needed_blocks, rsv_blocks = 0, ret = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
 	bool done;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	bool give_up_on_write = false;
 
 	trace_ext4_writepages(inode, wbc);
@@ -2298,11 +2298,9 @@ static int ext4_writepages(struct address_space *mapping,
 		goto out_writepages;
 
 	if (ext4_should_journal_data(inode)) {
-		struct blk_plug plug;
-
-		blk_start_plug(&plug);
+		plug = blk_start_plug(&onstack_plug);
 		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-		blk_finish_plug(&plug);
+		blk_finish_plug(plug);
 		goto out_writepages;
 	}
 
@@ -2368,7 +2366,7 @@ retry:
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag_pages_for_writeback(mapping, mpd.first_page, mpd.last_page);
 	done = false;
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	while (!done && mpd.first_page <= mpd.last_page) {
 		/* For each extent of pages we use new io_end */
 		mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
@@ -2438,7 +2436,7 @@ retry:
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	if (!ret && !cycled && wbc->nr_to_write > 0) {
 		cycled = 1;
 		mpd.last_page = writeback_index - 1;
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 7f794b7..de2b522 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -810,10 +810,10 @@ static int block_operations(struct f2fs_sb_info *sbi)
 		.nr_to_write = LONG_MAX,
 		.for_reclaim = 0,
 	};
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	int err = 0;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 retry_flush_dents:
 	f2fs_lock_all(sbi);
@@ -846,7 +846,7 @@ retry_flush_nodes:
 		goto retry_flush_nodes;
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	return err;
 }
 
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 76adbc3..082e961 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -661,12 +661,12 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
 {
 	struct page *sum_page;
 	struct f2fs_summary_block *sum;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	/* read segment summary of victim */
 	sum_page = get_sum_page(sbi, segno);
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 	sum = page_address(sum_page);
 
@@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
 		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
 		break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
 	stat_inc_call_count(sbi->stat_info);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 97bd9d3..e1fc81e 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1066,7 +1066,7 @@ repeat:
 struct page *get_node_page_ra(struct page *parent, int start)
 {
 	struct f2fs_sb_info *sbi = F2FS_P_SB(parent);
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	struct page *page;
 	int err, i, end;
 	nid_t nid;
@@ -1086,7 +1086,7 @@ repeat:
 	else if (err == LOCKED_PAGE)
 		goto page_hit;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 	/* Then, try readahead for siblings of the desired node */
 	end = start + MAX_RA_NODE;
@@ -1098,7 +1098,7 @@ repeat:
 		ra_node_page(sbi, nid);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	lock_page(page);
 	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index 08c0304..29b550d 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -258,12 +258,12 @@ static void
 __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
 {
 	int i;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = bhs[i];
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index bb217dc..ccb32f0 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -311,7 +311,7 @@ void journal_commit_transaction(journal_t *journal)
 	int first_tag = 0;
 	int tag_flag;
 	int i;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	int write_op = WRITE;
 
 	/*
@@ -444,10 +444,10 @@ void journal_commit_transaction(journal_t *journal)
 	 * Now start flushing things to disk, in the order they appear
 	 * on the transaction lists.  Data blocks go first.
 	 */
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	err = journal_submit_data_buffers(journal, commit_transaction,
 					  write_op);
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	/*
 	 * Wait for all previously submitted IO to complete.
@@ -503,7 +503,7 @@ void journal_commit_transaction(journal_t *journal)
 		err = 0;
 	}
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 	journal_write_revoke_records(journal, commit_transaction, write_op);
 
@@ -697,7 +697,7 @@ start_journal_io:
 		}
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 988b32e..31b5e73 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -182,12 +182,12 @@ static void
 __flush_batch(journal_t *journal, int *batch_count)
 {
 	int i;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = journal->j_chkpt_bhs[i];
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b73e021..713d26a 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -390,7 +390,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	int tag_bytes = journal_tag_bytes(journal);
 	struct buffer_head *cbh = NULL; /* For transactional checksums */
 	__u32 crc32_sum = ~0;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	/* Tail of the journal */
 	unsigned long first_block;
 	tid_t first_tid;
@@ -555,7 +555,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	if (err)
 		jbd2_journal_abort(journal, err);
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	jbd2_journal_write_revoke_records(journal, commit_transaction,
 					  &log_bufs, WRITE_SYNC);
 
@@ -805,7 +805,7 @@ start_journal_io:
 			__jbd2_journal_abort_hard(journal);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220..926dc42 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -676,10 +676,10 @@ int
 mpage_writepages(struct address_space *mapping,
 		struct writeback_control *wbc, get_block_t get_block)
 {
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	int ret;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 	if (!get_block)
 		ret = generic_writepages(mapping, wbc);
@@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
 		if (mpd.bio)
 			mpage_bio_submit(WRITE, mpd.bio);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepages);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1790b00..776ac5a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1208,7 +1208,7 @@ STATIC void
 _xfs_buf_ioapply(
 	struct xfs_buf	*bp)
 {
-	struct blk_plug	plug;
+	struct blk_plug	*plug, onstack_plug;
 	int		rw;
 	int		offset;
 	int		size;
@@ -1281,7 +1281,7 @@ _xfs_buf_ioapply(
 	 */
 	offset = bp->b_offset;
 	size = BBTOB(bp->b_io_length);
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	for (i = 0; i < bp->b_map_count; i++) {
 		xfs_buf_ioapply_map(bp, i, &offset, &size, rw);
 		if (bp->b_error)
@@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
 		if (size <= 0)
 			break;	/* all done */
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 }
 
 /*
@@ -1772,7 +1772,7 @@ __xfs_buf_delwri_submit(
 	struct list_head	*io_list,
 	bool			wait)
 {
-	struct blk_plug		plug;
+	struct blk_plug		*plug, onstack_plug;
 	struct xfs_buf		*bp, *n;
 	int			pinned = 0;
 
@@ -1806,7 +1806,7 @@ __xfs_buf_delwri_submit(
 
 	list_sort(NULL, io_list, xfs_buf_cmp);
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	list_for_each_entry_safe(bp, n, io_list, b_list) {
 		bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC | XBF_WRITE_FAIL);
 		bp->b_flags |= XBF_WRITE | XBF_ASYNC;
@@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
 
 		xfs_buf_submit(bp);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	return pinned;
 }
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 098cd78..6074671 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -275,7 +275,7 @@ xfs_dir2_leaf_readbuf(
 	struct xfs_inode	*dp = args->dp;
 	struct xfs_buf		*bp = *bpp;
 	struct xfs_bmbt_irec	*map = mip->map;
-	struct blk_plug		plug;
+	struct blk_plug		*plug, onstack_plug;
 	int			error = 0;
 	int			length;
 	int			i;
@@ -404,7 +404,7 @@ xfs_dir2_leaf_readbuf(
 	/*
 	 * Do we need more readahead?
 	 */
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	for (mip->ra_index = mip->ra_offset = i = 0;
 	     mip->ra_want > mip->ra_current && i < mip->map_blocks;
 	     i += geo->fsbcount) {
@@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
 			}
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 out:
 	*bpp = bp;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 82e3142..b890aa4 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -179,7 +179,7 @@ xfs_bulkstat_ichunk_ra(
 	struct xfs_inobt_rec_incore	*irec)
 {
 	xfs_agblock_t			agbno;
-	struct blk_plug			plug;
+	struct blk_plug			*plug, onstack_plug;
 	int				blks_per_cluster;
 	int				inodes_per_cluster;
 	int				i;	/* inode chunk index */
@@ -188,7 +188,7 @@ xfs_bulkstat_ichunk_ra(
 	blks_per_cluster = xfs_icluster_size_fsb(mp);
 	inodes_per_cluster = blks_per_cluster << mp->m_sb.sb_inopblog;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	for (i = 0; i < XFS_INODES_PER_CHUNK;
 	     i += inodes_per_cluster, agbno += blks_per_cluster) {
 		if (xfs_inobt_maskn(i, inodes_per_cluster) & ~irec->ir_free) {
@@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
 					     &xfs_inode_buf_ops);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..4617fce 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
  * schedule() where blk_schedule_flush_plug() is called.
  */
 struct blk_plug {
+	int depth; /* number of nested plugs */
 	struct list_head list; /* requests */
 	struct list_head mq_list; /* blk-mq requests */
 	struct list_head cb_list; /* md requires an unplug callback */
@@ -1106,7 +1107,7 @@ struct blk_plug_cb {
 };
 extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
 					     void *data, int size);
-extern void blk_start_plug(struct blk_plug *);
+extern struct blk_plug *blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
 extern void blk_flush_plug_list(struct blk_plug *, bool);
 
diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..1f7d5ad 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -463,7 +463,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	int error = -EINVAL;
 	int write;
 	size_t len;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 #ifdef CONFIG_MEMORY_FAILURE
 	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
@@ -503,7 +503,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 	if (vma && start > vma->vm_start)
 		prev = vma;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	for (;;) {
 		/* Still start < end. */
 		error = -ENOMEM;
@@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 			vma = find_vma(current->mm, start);
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	if (write)
 		up_write(&current->mm->mmap_sem);
 	else
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 644bcb6..9369d5e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2011,16 +2011,16 @@ static int __writepage(struct page *page, struct writeback_control *wbc,
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc)
 {
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	int ret;
 
 	/* deal with chardevs and other special file */
 	if (!mapping->a_ops->writepage)
 		return 0;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 9356758..c3350b5 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -111,11 +111,11 @@ EXPORT_SYMBOL(read_cache_pages);
 static int read_pages(struct address_space *mapping, struct file *filp,
 		struct list_head *pages, unsigned nr_pages)
 {
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	unsigned page_idx;
 	int ret;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 
 	if (mapping->a_ops->readpages) {
 		ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
@@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 	ret = 0;
 
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	return ret;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 405923f..3f6b8ec 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -455,7 +455,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long offset = entry_offset;
 	unsigned long start_offset, end_offset;
 	unsigned long mask;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
@@ -467,7 +467,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	if (!start_offset)	/* First page is swap header. */
 		start_offset++;
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
@@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			SetPageReadahead(page);
 		page_cache_release(page);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..fd29974 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2130,7 +2130,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 	enum lru_list lru;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
-	struct blk_plug plug;
+	struct blk_plug *plug, onstack_plug;
 	bool scan_adjusted;
 
 	get_scan_count(lruvec, swappiness, sc, nr, lru_pages);
@@ -2152,7 +2152,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
 			 sc->priority == DEF_PRIORITY);
 
-	blk_start_plug(&plug);
+	plug = blk_start_plug(&onstack_plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		unsigned long nr_anon, nr_file, percentage;
@@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 
 		scan_adjusted = true;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug(plug);
 	sc->nr_reclaimed += nr_reclaimed;
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] blk-plug: don't flush nested plug lists
  2015-04-06 19:14 ` [PATCH 2/2] blk-plug: don't flush nested plug lists Jeff Moyer
@ 2015-04-07  9:19   ` Ming Lei
  2015-04-07 14:47     ` Jeff Moyer
  2015-04-07 18:55     ` Jeff Moyer
  2015-04-08 17:38   ` [PATCH 2/2] " Jens Axboe
  2 siblings, 1 reply; 36+ messages in thread
From: Ming Lei @ 2015-04-07  9:19 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jens Axboe, Linux Kernel Mailing List

Hi Jeff,

On Tue, Apr 7, 2015 at 3:14 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowells of the dirct I/O code.
> The reason is that there are blk_plug's instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.
>
> Now, for the case where there is an elevator involved, this doesn't
> really matter too much.  The elevator will keep the I/O around long
> enough for it to be merged.  However, in cases where there is no
> elevator (like blk-mq), I/Os are simply dispatched immediately.
>
> Try this, for example (note I'm using a virtio-blk device, so it's
> using the blk-mq single queue path, though I've also reproduced this
> with the micron p320h):
>
> fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based
>
> If you run that on a current kernel, you will get zero merges.  Zero!
> After this patch, you will get many merges (the actual number depends
> on how fast your storage is, obviously), and much better throughput.
> Here are results from my test rig:
>
> Unpatched kernel:
> Read B/W:    283,638 KB/s
> Read Merges: 0
>
> Patched kernel:
> Read B/W:    873,224 KB/s
> Read Merges: 2,046K

The data is amazing, but maybe better to provide some latency
data.

>
> I considered several approaches to solving the problem:
> 1)  get rid of the inner-most plugs
> 2)  handle nesting by using only one on-stack plug
> 2a) #2, except use a per-cpu blk_plug struct, which may clean up the
>     code a bit at the expense of memory footprint
>
> Option 1 will be tricky or impossible to do, since inner most plug
> lists are sometimes the only plug lists, depending on the call path.
> Option 2 is what this patch implements.  Option 2a is perhaps a better
> idea, but since I already implemented option 2, I figured I'd post it
> for comments and opinions before rewriting it.
>
> Much of the patch involves modifying call sites to blk_start_plug,
> since its signature is changed.  The meat of the patch is actually

I am wondering if the type of blk_start_plug has to be changed
since the active plug is always the top plug with your patch, and
blk_finish_plug() can find the active plug from current->plug.

> pretty simple and constrained to block/blk-core.c and
> include/linux/blkdev.h.  The only tricky bits were places where plugs
> were finished and then restarted to flush out I/O.  There, I went
> ahead and exported blk_flush_plug_list and called that directly.
>
> Comments would be greatly appreciated.
>
> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
> ---
>  block/blk-core.c                    | 33 +++++++++++++++++++--------------
>  block/blk-lib.c                     |  6 +++---
>  block/blk-throttle.c                |  6 +++---
>  drivers/block/xen-blkback/blkback.c |  6 +++---
>  drivers/md/dm-bufio.c               | 15 +++++++--------
>  drivers/md/dm-kcopyd.c              |  6 +++---
>  drivers/md/dm-thin.c                |  6 +++---
>  drivers/md/md.c                     |  6 +++---
>  drivers/md/raid1.c                  |  6 +++---
>  drivers/md/raid10.c                 |  6 +++---
>  drivers/md/raid5.c                  | 12 ++++++------
>  drivers/target/target_core_iblock.c |  6 +++---
>  fs/aio.c                            |  6 +++---
>  fs/block_dev.c                      |  6 +++---
>  fs/btrfs/scrub.c                    |  6 +++---
>  fs/btrfs/transaction.c              |  6 +++---
>  fs/btrfs/tree-log.c                 | 16 ++++++++--------
>  fs/btrfs/volumes.c                  | 12 +++++-------
>  fs/buffer.c                         |  6 +++---
>  fs/direct-io.c                      |  8 +++++---
>  fs/ext4/file.c                      |  6 +++---
>  fs/ext4/inode.c                     | 12 +++++-------
>  fs/f2fs/checkpoint.c                |  6 +++---
>  fs/f2fs/gc.c                        |  6 +++---
>  fs/f2fs/node.c                      |  6 +++---
>  fs/jbd/checkpoint.c                 |  6 +++---
>  fs/jbd/commit.c                     | 10 +++++-----
>  fs/jbd2/checkpoint.c                |  6 +++---
>  fs/jbd2/commit.c                    |  6 +++---
>  fs/mpage.c                          |  6 +++---
>  fs/xfs/xfs_buf.c                    | 12 ++++++------
>  fs/xfs/xfs_dir2_readdir.c           |  6 +++---
>  fs/xfs/xfs_itable.c                 |  6 +++---
>  include/linux/blkdev.h              |  3 ++-
>  mm/madvise.c                        |  6 +++---
>  mm/page-writeback.c                 |  6 +++---
>  mm/readahead.c                      |  6 +++---
>  mm/swap_state.c                     |  6 +++---
>  mm/vmscan.c                         |  6 +++---
>  39 files changed, 155 insertions(+), 152 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 794c3e7..64f3f2a 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -3002,7 +3002,7 @@ EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);
>
>  /**
>   * blk_start_plug - initialize blk_plug and track it inside the task_struct
> - * @plug:      The &struct blk_plug that needs to be initialized
> + * @plug:      The on-stack &struct blk_plug that needs to be initialized
>   *
>   * Description:
>   *   Tracking blk_plug inside the task_struct will help with auto-flushing the
> @@ -3013,26 +3013,29 @@ EXPORT_SYMBOL(kblockd_schedule_delayed_work_on);
>   *   page belonging to that request that is currently residing in our private
>   *   plug. By flushing the pending I/O when the process goes to sleep, we avoid
>   *   this kind of deadlock.
> + *
> + * Returns: a pointer to the active &struct blk_plug
>   */
> -void blk_start_plug(struct blk_plug *plug)
> +struct blk_plug *blk_start_plug(struct blk_plug *plug)

The change might not be necessary.

>  {
>         struct task_struct *tsk = current;
>
> +       if (tsk->plug) {
> +               tsk->plug->depth++;
> +               return tsk->plug;
> +       }
> +
> +       plug->depth = 1;
>         INIT_LIST_HEAD(&plug->list);
>         INIT_LIST_HEAD(&plug->mq_list);
>         INIT_LIST_HEAD(&plug->cb_list);
>
>         /*
> -        * If this is a nested plug, don't actually assign it. It will be
> -        * flushed on its own.
> +        * Store ordering should not be needed here, since a potential
> +        * preempt will imply a full memory barrier
>          */
> -       if (!tsk->plug) {
> -               /*
> -                * Store ordering should not be needed here, since a potential
> -                * preempt will imply a full memory barrier
> -                */
> -               tsk->plug = plug;
> -       }
> +       tsk->plug = plug;
> +       return tsk->plug;

tsk->plug is always returned from this function, so it means tsk->plug is
always the active plug.

>  }
>  EXPORT_SYMBOL(blk_start_plug);
>
> @@ -3176,13 +3179,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>
>         local_irq_restore(flags);
>  }
> +EXPORT_SYMBOL_GPL(blk_flush_plug_list);
>
>  void blk_finish_plug(struct blk_plug *plug)
>  {
> -       blk_flush_plug_list(plug, false);
> +       if (--plug->depth > 0)
> +               return;

The active plug should be current->plug, suppose blk_finish_plug()
is always paired with blk_start_plug().

>
> -       if (plug == current->plug)
> -               current->plug = NULL;
> +       blk_flush_plug_list(plug, false);
> +       current->plug = NULL;
>  }
>  EXPORT_SYMBOL(blk_finish_plug);
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 7688ee3..e2d2448 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -48,7 +48,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>         struct bio_batch bb;
>         struct bio *bio;
>         int ret = 0;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         if (!q)
>                 return -ENXIO;
> @@ -81,7 +81,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>         bb.flags = 1 << BIO_UPTODATE;
>         bb.wait = &wait;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         while (nr_sects) {
>                 unsigned int req_sects;
>                 sector_t end_sect, tmp;
> @@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>                  */
>                 cond_resched();
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);

Like the comment above, blk_finish_plug() can figure out the
active plug, so looks unnecessary to introduce return value
to blk_start_plug().

>
>         /* Wait for bios in-flight */
>         if (!atomic_dec_and_test(&bb.done))
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 5b9c6d5..f57bbd3 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1266,7 +1266,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
>         struct request_queue *q = td->queue;
>         struct bio_list bio_list_on_stack;
>         struct bio *bio;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         int rw;
>
>         bio_list_init(&bio_list_on_stack);
> @@ -1278,10 +1278,10 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
>         spin_unlock_irq(q->queue_lock);
>
>         if (!bio_list_empty(&bio_list_on_stack)) {
> -               blk_start_plug(&plug);
> +               plug = blk_start_plug(&onstack_plug);
>                 while((bio = bio_list_pop(&bio_list_on_stack)))
>                         generic_make_request(bio);
> -               blk_finish_plug(&plug);
> +               blk_finish_plug(plug);
>         }
>  }
>
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 2a04d34..f075182 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1207,7 +1207,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>         struct bio **biolist = pending_req->biolist;
>         int i, nbio = 0;
>         int operation;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         bool drain = false;
>         struct grant_page **pages = pending_req->segments;
>         unsigned short req_operation;
> @@ -1368,13 +1368,13 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>         }
>
>         atomic_set(&pending_req->pendcnt, nbio);
> -       blk_start_plug(&plug);
> +       blk_start_plug(onstack_plug);
>
>         for (i = 0; i < nbio; i++)
>                 submit_bio(operation, biolist[i]);
>
>         /* Let the I/Os go.. */
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         if (operation == READ)
>                 blkif->st_rd_sect += preq.nr_sects;
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 86dbbc7..a6cbc6f 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -706,8 +706,8 @@ static void __write_dirty_buffer(struct dm_buffer *b,
>
>  static void __flush_write_list(struct list_head *write_list)
>  {
> -       struct blk_plug plug;
> -       blk_start_plug(&plug);
> +       struct blk_plug *plug, onstack_plug;
> +       plug = blk_start_plug(&onstack_plug);
>         while (!list_empty(write_list)) {
>                 struct dm_buffer *b =
>                         list_entry(write_list->next, struct dm_buffer, write_list);
> @@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
>                 submit_io(b, WRITE, b->block, write_endio);
>                 dm_bufio_cond_resched();
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>  }
>
>  /*
> @@ -1110,13 +1110,13 @@ EXPORT_SYMBOL_GPL(dm_bufio_new);
>  void dm_bufio_prefetch(struct dm_bufio_client *c,
>                        sector_t block, unsigned n_blocks)
>  {
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         LIST_HEAD(write_list);
>
>         BUG_ON(dm_bufio_in_request());
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         dm_bufio_lock(c);
>
>         for (; n_blocks--; block++) {
> @@ -1126,9 +1126,8 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>                                 &write_list);
>                 if (unlikely(!list_empty(&write_list))) {
>                         dm_bufio_unlock(c);
> -                       blk_finish_plug(&plug);
>                         __flush_write_list(&write_list);
> -                       blk_start_plug(&plug);
> +                       blk_flush_plug_list(plug, false);
>                         dm_bufio_lock(c);
>                 }
>                 if (unlikely(b != NULL)) {
> @@ -1149,7 +1148,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>         dm_bufio_unlock(c);
>
>  flush_plug:
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>  }
>  EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
>
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 3a7cade..fed371f 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -580,7 +580,7 @@ static void do_work(struct work_struct *work)
>  {
>         struct dm_kcopyd_client *kc = container_of(work,
>                                         struct dm_kcopyd_client, kcopyd_work);
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         /*
>          * The order that these are called is *very* important.
> @@ -589,11 +589,11 @@ static void do_work(struct work_struct *work)
>          * list.  io jobs call wake when they complete and it all
>          * starts again.
>          */
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         process_jobs(&kc->complete_jobs, kc, run_complete_job);
>         process_jobs(&kc->pages_jobs, kc, run_pages_job);
>         process_jobs(&kc->io_jobs, kc, run_io_job);
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>  }
>
>  /*
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 921aafd..6a7459a 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -1775,7 +1775,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>         unsigned long flags;
>         struct bio *bio;
>         struct bio_list bios;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         unsigned count = 0;
>
>         if (tc->requeue_mode) {
> @@ -1799,7 +1799,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>
>         spin_unlock_irqrestore(&tc->lock, flags);
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         while ((bio = bio_list_pop(&bios))) {
>                 /*
>                  * If we've got no free new_mapping structs, and processing
> @@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>                         dm_pool_issue_prefetches(pool->pmd);
>                 }
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>  }
>
>  static int cmp_cells(const void *lhs, const void *rhs)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 717daad..9f24719 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7402,7 +7402,7 @@ void md_do_sync(struct md_thread *thread)
>         int skipped = 0;
>         struct md_rdev *rdev;
>         char *desc, *action = NULL;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         /* just incase thread restarts... */
>         if (test_bit(MD_RECOVERY_DONE, &mddev->recovery))
> @@ -7572,7 +7572,7 @@ void md_do_sync(struct md_thread *thread)
>         md_new_event(mddev);
>         update_time = jiffies;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         while (j < max_sectors) {
>                 sector_t sectors;
>
> @@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
>         /*
>          * this also signals 'finished resyncing' to md_stop
>          */
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
>
>         /* tell personality that we are finished */
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d34e238..2608bc3 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2399,11 +2399,11 @@ static void raid1d(struct md_thread *thread)
>         unsigned long flags;
>         struct r1conf *conf = mddev->private;
>         struct list_head *head = &conf->retry_list;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         md_check_recovery(mddev);
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         for (;;) {
>
>                 flush_pending_writes(conf);
> @@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
>                 if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>                         md_check_recovery(mddev);
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>  }
>
>  static int init_resync(struct r1conf *conf)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a7196c4..7bea5c7 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2791,11 +2791,11 @@ static void raid10d(struct md_thread *thread)
>         unsigned long flags;
>         struct r10conf *conf = mddev->private;
>         struct list_head *head = &conf->retry_list;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         md_check_recovery(mddev);
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         for (;;) {
>
>                 flush_pending_writes(conf);
> @@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
>                 if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>                         md_check_recovery(mddev);
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>  }
>
>  static int init_resync(struct r10conf *conf)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cd2f96b..59e7090 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5259,11 +5259,11 @@ static void raid5_do_work(struct work_struct *work)
>         struct r5conf *conf = group->conf;
>         int group_id = group - conf->worker_groups;
>         int handled;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         pr_debug("+++ raid5worker active\n");
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         handled = 0;
>         spin_lock_irq(&conf->device_lock);
>         while (1) {
> @@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
>         pr_debug("%d stripes handled\n", handled);
>
>         spin_unlock_irq(&conf->device_lock);
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         pr_debug("--- raid5worker inactive\n");
>  }
> @@ -5298,13 +5298,13 @@ static void raid5d(struct md_thread *thread)
>         struct mddev *mddev = thread->mddev;
>         struct r5conf *conf = mddev->private;
>         int handled;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         pr_debug("+++ raid5d active\n");
>
>         md_check_recovery(mddev);
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         handled = 0;
>         spin_lock_irq(&conf->device_lock);
>         while (1) {
> @@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
>         spin_unlock_irq(&conf->device_lock);
>
>         async_tx_issue_pending_all();
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         pr_debug("--- raid5d inactive\n");
>  }
> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
> index d4a4b0f..248a6bb 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -361,13 +361,13 @@ iblock_get_bio(struct se_cmd *cmd, sector_t lba, u32 sg_num)
>
>  static void iblock_submit_bios(struct bio_list *list, int rw)
>  {
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         struct bio *bio;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         while ((bio = bio_list_pop(list)))
>                 submit_bio(rw, bio);
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>  }
>
>  static void iblock_end_io_flush(struct bio *bio, int err)
> diff --git a/fs/aio.c b/fs/aio.c
> index f8e52a1..b1c3583 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1575,7 +1575,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>         struct kioctx *ctx;
>         long ret = 0;
>         int i = 0;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         if (unlikely(nr < 0))
>                 return -EINVAL;
> @@ -1592,7 +1592,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>                 return -EINVAL;
>         }
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>         /*
>          * AKPM: should this return a partial result if some of the IOs were
> @@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>                 if (ret)
>                         break;
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         percpu_ref_put(&ctx->users);
>         return i ? i : ret;
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 975266b..928d3e0 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1598,10 +1598,10 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
>  ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  {
>         struct file *file = iocb->ki_filp;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         ssize_t ret;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         ret = __generic_file_write_iter(iocb, from);
>         if (ret > 0) {
>                 ssize_t err;
> @@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>                 if (err < 0)
>                         ret = err;
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         return ret;
>  }
>  EXPORT_SYMBOL_GPL(blkdev_write_iter);
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ec57687..768bac3 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -2967,7 +2967,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>         struct btrfs_root *root = fs_info->extent_root;
>         struct btrfs_root *csum_root = fs_info->csum_root;
>         struct btrfs_extent_item *extent;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         u64 flags;
>         int ret;
>         int slot;
> @@ -3088,7 +3088,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>          * collect all data csums for the stripe to avoid seeking during
>          * the scrub. This might currently (crc32) end up to be about 1MB
>          */
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>         /*
>          * now find all extents for each stripe and scrub them
> @@ -3316,7 +3316,7 @@ out:
>         scrub_wr_submit(sctx);
>         mutex_unlock(&sctx->wr_ctx.wr_lock);
>
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         btrfs_free_path(path);
>         btrfs_free_path(ppath);
>         return ret < 0 ? ret : 0;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 8be4278..b5a8078 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -979,11 +979,11 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>  {
>         int ret;
>         int ret2;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         ret = btrfs_write_marked_extents(root, dirty_pages, mark);
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
>
>         if (ret)
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index c5b8ba3..1623924 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2519,7 +2519,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         struct btrfs_root *log_root_tree = root->fs_info->log_root_tree;
>         int log_transid = 0;
>         struct btrfs_log_ctx root_log_ctx;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         mutex_lock(&root->log_mutex);
>         log_transid = ctx->log_transid;
> @@ -2571,10 +2571,10 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         /* we start IO on  all the marked extents here, but we don't actually
>          * wait for them until later.
>          */
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
>         if (ret) {
> -               blk_finish_plug(&plug);
> +               blk_finish_plug(plug);
>                 btrfs_abort_transaction(trans, root, ret);
>                 btrfs_free_logged_extents(log, log_transid);
>                 btrfs_set_log_full_commit(root->fs_info, trans);
> @@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>                 if (!list_empty(&root_log_ctx.list))
>                         list_del_init(&root_log_ctx.list);
>
> -               blk_finish_plug(&plug);
> +               blk_finish_plug(plug);
>                 btrfs_set_log_full_commit(root->fs_info, trans);
>
>                 if (ret != -ENOSPC) {
> @@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         }
>
>         if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
> -               blk_finish_plug(&plug);
> +               blk_finish_plug(plug);
>                 mutex_unlock(&log_root_tree->log_mutex);
>                 ret = root_log_ctx.log_ret;
>                 goto out;
> @@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>
>         index2 = root_log_ctx.log_transid % 2;
>         if (atomic_read(&log_root_tree->log_commit[index2])) {
> -               blk_finish_plug(&plug);
> +               blk_finish_plug(plug);
>                 ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
>                                                 mark);
>                 btrfs_wait_logged_extents(trans, log, log_transid);
> @@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>          * check the full commit flag again
>          */
>         if (btrfs_need_log_full_commit(root->fs_info, trans)) {
> -               blk_finish_plug(&plug);
> +               blk_finish_plug(plug);
>                 btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>                 btrfs_free_logged_extents(log, log_transid);
>                 mutex_unlock(&log_root_tree->log_mutex);
> @@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>         ret = btrfs_write_marked_extents(log_root_tree,
>                                          &log_root_tree->dirty_log_pages,
>                                          EXTENT_DIRTY | EXTENT_NEW);
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         if (ret) {
>                 btrfs_set_log_full_commit(root->fs_info, trans);
>                 btrfs_abort_transaction(trans, root, ret);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8222f6f..0e215ff 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -261,7 +261,7 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
>         unsigned long last_waited = 0;
>         int force_reg = 0;
>         int sync_pending = 0;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         /*
>          * this function runs all the bios we've collected for
> @@ -269,7 +269,7 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
>          * another device without first sending all of these down.
>          * So, setup a plug here and finish it off before we return
>          */
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>         bdi = blk_get_backing_dev_info(device->bdev);
>         fs_info = device->dev_root->fs_info;
> @@ -358,8 +358,7 @@ loop_lock:
>                 if (pending_bios == &device->pending_sync_bios) {
>                         sync_pending = 1;
>                 } else if (sync_pending) {
> -                       blk_finish_plug(&plug);
> -                       blk_start_plug(&plug);
> +                       blk_flush_plug_list(plug, false);
>                         sync_pending = 0;
>                 }
>
> @@ -415,8 +414,7 @@ loop_lock:
>                 }
>                 /* unplug every 64 requests just for good measure */
>                 if (batch_run % 64 == 0) {
> -                       blk_finish_plug(&plug);
> -                       blk_start_plug(&plug);
> +                       blk_flush_plug_list(plug, false);
>                         sync_pending = 0;
>                 }
>         }
> @@ -431,7 +429,7 @@ loop_lock:
>         spin_unlock(&device->io_lock);
>
>  done:
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>  }
>
>  static void pending_bios_fn(struct btrfs_work *work)
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 20805db..727d642 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -717,10 +717,10 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
>         struct list_head tmp;
>         struct address_space *mapping;
>         int err = 0, err2;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         INIT_LIST_HEAD(&tmp);
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>         spin_lock(lock);
>         while (!list_empty(list)) {
> @@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
>         }
>
>         spin_unlock(lock);
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         spin_lock(lock);
>
>         while (!list_empty(&tmp)) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..e79c0c6 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -488,6 +488,8 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio)
>  static void dio_await_completion(struct dio *dio)
>  {
>         struct bio *bio;
> +
> +       blk_flush_plug(current);
>         do {
>                 bio = dio_await_one(dio);
>                 if (bio)
> @@ -1108,7 +1110,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>         struct dio *dio;
>         struct dio_submit sdio = { 0, };
>         struct buffer_head map_bh = { 0, };
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         unsigned long align = offset | iov_iter_alignment(iter);
>
>         if (rw & WRITE)
> @@ -1231,7 +1233,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>
>         sdio.pages_in_io += iov_iter_npages(iter, INT_MAX);
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>         retval = do_direct_IO(dio, &sdio, &map_bh);
>         if (retval)
> @@ -1262,7 +1264,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>         if (sdio.bio)
>                 dio_bio_submit(dio, &sdio);
>
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         /*
>          * It is possible that, we return short IO due to end of file.
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..fd0cf21 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -94,7 +94,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>         struct file *file = iocb->ki_filp;
>         struct inode *inode = file_inode(iocb->ki_filp);
>         struct mutex *aio_mutex = NULL;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         int o_direct = io_is_direct(file);
>         int overwrite = 0;
>         size_t length = iov_iter_count(from);
> @@ -139,7 +139,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>
>         iocb->private = &overwrite;
>         if (o_direct) {
> -               blk_start_plug(&plug);
> +               plug = blk_start_plug(&onstack_plug);
>
>
>                 /* check whether we do a DIO overwrite or not */
> @@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>                         ret = err;
>         }
>         if (o_direct)
> -               blk_finish_plug(&plug);
> +               blk_finish_plug(plug);
>
>  errout:
>         if (aio_mutex)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..d4b645b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2284,7 +2284,7 @@ static int ext4_writepages(struct address_space *mapping,
>         int needed_blocks, rsv_blocks = 0, ret = 0;
>         struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
>         bool done;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         bool give_up_on_write = false;
>
>         trace_ext4_writepages(inode, wbc);
> @@ -2298,11 +2298,9 @@ static int ext4_writepages(struct address_space *mapping,
>                 goto out_writepages;
>
>         if (ext4_should_journal_data(inode)) {
> -               struct blk_plug plug;
> -
> -               blk_start_plug(&plug);
> +               plug = blk_start_plug(&onstack_plug);
>                 ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -               blk_finish_plug(&plug);
> +               blk_finish_plug(plug);
>                 goto out_writepages;
>         }
>
> @@ -2368,7 +2366,7 @@ retry:
>         if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
>                 tag_pages_for_writeback(mapping, mpd.first_page, mpd.last_page);
>         done = false;
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         while (!done && mpd.first_page <= mpd.last_page) {
>                 /* For each extent of pages we use new io_end */
>                 mpd.io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
> @@ -2438,7 +2436,7 @@ retry:
>                 if (ret)
>                         break;
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         if (!ret && !cycled && wbc->nr_to_write > 0) {
>                 cycled = 1;
>                 mpd.last_page = writeback_index - 1;
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 7f794b7..de2b522 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -810,10 +810,10 @@ static int block_operations(struct f2fs_sb_info *sbi)
>                 .nr_to_write = LONG_MAX,
>                 .for_reclaim = 0,
>         };
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         int err = 0;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>  retry_flush_dents:
>         f2fs_lock_all(sbi);
> @@ -846,7 +846,7 @@ retry_flush_nodes:
>                 goto retry_flush_nodes;
>         }
>  out:
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         return err;
>  }
>
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 76adbc3..082e961 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -661,12 +661,12 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
>  {
>         struct page *sum_page;
>         struct f2fs_summary_block *sum;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         /* read segment summary of victim */
>         sum_page = get_sum_page(sbi, segno);
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>         sum = page_address(sum_page);
>
> @@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
>                 gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
>                 break;
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
>         stat_inc_call_count(sbi->stat_info);
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 97bd9d3..e1fc81e 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -1066,7 +1066,7 @@ repeat:
>  struct page *get_node_page_ra(struct page *parent, int start)
>  {
>         struct f2fs_sb_info *sbi = F2FS_P_SB(parent);
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         struct page *page;
>         int err, i, end;
>         nid_t nid;
> @@ -1086,7 +1086,7 @@ repeat:
>         else if (err == LOCKED_PAGE)
>                 goto page_hit;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>         /* Then, try readahead for siblings of the desired node */
>         end = start + MAX_RA_NODE;
> @@ -1098,7 +1098,7 @@ repeat:
>                 ra_node_page(sbi, nid);
>         }
>
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         lock_page(page);
>         if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
> diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
> index 08c0304..29b550d 100644
> --- a/fs/jbd/checkpoint.c
> +++ b/fs/jbd/checkpoint.c
> @@ -258,12 +258,12 @@ static void
>  __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
>  {
>         int i;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         for (i = 0; i < *batch_count; i++)
>                 write_dirty_buffer(bhs[i], WRITE_SYNC);
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         for (i = 0; i < *batch_count; i++) {
>                 struct buffer_head *bh = bhs[i];
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index bb217dc..ccb32f0 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -311,7 +311,7 @@ void journal_commit_transaction(journal_t *journal)
>         int first_tag = 0;
>         int tag_flag;
>         int i;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         int write_op = WRITE;
>
>         /*
> @@ -444,10 +444,10 @@ void journal_commit_transaction(journal_t *journal)
>          * Now start flushing things to disk, in the order they appear
>          * on the transaction lists.  Data blocks go first.
>          */
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         err = journal_submit_data_buffers(journal, commit_transaction,
>                                           write_op);
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         /*
>          * Wait for all previously submitted IO to complete.
> @@ -503,7 +503,7 @@ void journal_commit_transaction(journal_t *journal)
>                 err = 0;
>         }
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>         journal_write_revoke_records(journal, commit_transaction, write_op);
>
> @@ -697,7 +697,7 @@ start_journal_io:
>                 }
>         }
>
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         /* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 988b32e..31b5e73 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -182,12 +182,12 @@ static void
>  __flush_batch(journal_t *journal, int *batch_count)
>  {
>         int i;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         for (i = 0; i < *batch_count; i++)
>                 write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         for (i = 0; i < *batch_count; i++) {
>                 struct buffer_head *bh = journal->j_chkpt_bhs[i];
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b73e021..713d26a 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -390,7 +390,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>         int tag_bytes = journal_tag_bytes(journal);
>         struct buffer_head *cbh = NULL; /* For transactional checksums */
>         __u32 crc32_sum = ~0;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         /* Tail of the journal */
>         unsigned long first_block;
>         tid_t first_tid;
> @@ -555,7 +555,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>         if (err)
>                 jbd2_journal_abort(journal, err);
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         jbd2_journal_write_revoke_records(journal, commit_transaction,
>                                           &log_bufs, WRITE_SYNC);
>
> @@ -805,7 +805,7 @@ start_journal_io:
>                         __jbd2_journal_abort_hard(journal);
>         }
>
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         /* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 3e79220..926dc42 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -676,10 +676,10 @@ int
>  mpage_writepages(struct address_space *mapping,
>                 struct writeback_control *wbc, get_block_t get_block)
>  {
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         int ret;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>         if (!get_block)
>                 ret = generic_writepages(mapping, wbc);
> @@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
>                 if (mpd.bio)
>                         mpage_bio_submit(WRITE, mpd.bio);
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         return ret;
>  }
>  EXPORT_SYMBOL(mpage_writepages);
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1790b00..776ac5a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1208,7 +1208,7 @@ STATIC void
>  _xfs_buf_ioapply(
>         struct xfs_buf  *bp)
>  {
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         int             rw;
>         int             offset;
>         int             size;
> @@ -1281,7 +1281,7 @@ _xfs_buf_ioapply(
>          */
>         offset = bp->b_offset;
>         size = BBTOB(bp->b_io_length);
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         for (i = 0; i < bp->b_map_count; i++) {
>                 xfs_buf_ioapply_map(bp, i, &offset, &size, rw);
>                 if (bp->b_error)
> @@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
>                 if (size <= 0)
>                         break;  /* all done */
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>  }
>
>  /*
> @@ -1772,7 +1772,7 @@ __xfs_buf_delwri_submit(
>         struct list_head        *io_list,
>         bool                    wait)
>  {
> -       struct blk_plug         plug;
> +       struct blk_plug         *plug, onstack_plug;
>         struct xfs_buf          *bp, *n;
>         int                     pinned = 0;
>
> @@ -1806,7 +1806,7 @@ __xfs_buf_delwri_submit(
>
>         list_sort(NULL, io_list, xfs_buf_cmp);
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         list_for_each_entry_safe(bp, n, io_list, b_list) {
>                 bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC | XBF_WRITE_FAIL);
>                 bp->b_flags |= XBF_WRITE | XBF_ASYNC;
> @@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
>
>                 xfs_buf_submit(bp);
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         return pinned;
>  }
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 098cd78..6074671 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -275,7 +275,7 @@ xfs_dir2_leaf_readbuf(
>         struct xfs_inode        *dp = args->dp;
>         struct xfs_buf          *bp = *bpp;
>         struct xfs_bmbt_irec    *map = mip->map;
> -       struct blk_plug         plug;
> +       struct blk_plug         *plug, onstack_plug;
>         int                     error = 0;
>         int                     length;
>         int                     i;
> @@ -404,7 +404,7 @@ xfs_dir2_leaf_readbuf(
>         /*
>          * Do we need more readahead?
>          */
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         for (mip->ra_index = mip->ra_offset = i = 0;
>              mip->ra_want > mip->ra_current && i < mip->map_blocks;
>              i += geo->fsbcount) {
> @@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
>                         }
>                 }
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>  out:
>         *bpp = bp;
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 82e3142..b890aa4 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -179,7 +179,7 @@ xfs_bulkstat_ichunk_ra(
>         struct xfs_inobt_rec_incore     *irec)
>  {
>         xfs_agblock_t                   agbno;
> -       struct blk_plug                 plug;
> +       struct blk_plug                 *plug, onstack_plug;
>         int                             blks_per_cluster;
>         int                             inodes_per_cluster;
>         int                             i;      /* inode chunk index */
> @@ -188,7 +188,7 @@ xfs_bulkstat_ichunk_ra(
>         blks_per_cluster = xfs_icluster_size_fsb(mp);
>         inodes_per_cluster = blks_per_cluster << mp->m_sb.sb_inopblog;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         for (i = 0; i < XFS_INODES_PER_CHUNK;
>              i += inodes_per_cluster, agbno += blks_per_cluster) {
>                 if (xfs_inobt_maskn(i, inodes_per_cluster) & ~irec->ir_free) {
> @@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
>                                              &xfs_inode_buf_ops);
>                 }
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>  }
>
>  /*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7f9a516..4617fce 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
>   * schedule() where blk_schedule_flush_plug() is called.
>   */
>  struct blk_plug {
> +       int depth; /* number of nested plugs */
>         struct list_head list; /* requests */
>         struct list_head mq_list; /* blk-mq requests */
>         struct list_head cb_list; /* md requires an unplug callback */
> @@ -1106,7 +1107,7 @@ struct blk_plug_cb {
>  };
>  extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
>                                              void *data, int size);
> -extern void blk_start_plug(struct blk_plug *);
> +extern struct blk_plug *blk_start_plug(struct blk_plug *);
>  extern void blk_finish_plug(struct blk_plug *);
>  extern void blk_flush_plug_list(struct blk_plug *, bool);
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d551475..1f7d5ad 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -463,7 +463,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>         int error = -EINVAL;
>         int write;
>         size_t len;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>  #ifdef CONFIG_MEMORY_FAILURE
>         if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> @@ -503,7 +503,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>         if (vma && start > vma->vm_start)
>                 prev = vma;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         for (;;) {
>                 /* Still start < end. */
>                 error = -ENOMEM;
> @@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>                         vma = find_vma(current->mm, start);
>         }
>  out:
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         if (write)
>                 up_write(&current->mm->mmap_sem);
>         else
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 644bcb6..9369d5e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2011,16 +2011,16 @@ static int __writepage(struct page *page, struct writeback_control *wbc,
>  int generic_writepages(struct address_space *mapping,
>                        struct writeback_control *wbc)
>  {
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         int ret;
>
>         /* deal with chardevs and other special file */
>         if (!mapping->a_ops->writepage)
>                 return 0;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         return ret;
>  }
>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9356758..c3350b5 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -111,11 +111,11 @@ EXPORT_SYMBOL(read_cache_pages);
>  static int read_pages(struct address_space *mapping, struct file *filp,
>                 struct list_head *pages, unsigned nr_pages)
>  {
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         unsigned page_idx;
>         int ret;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>
>         if (mapping->a_ops->readpages) {
>                 ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
> @@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>         ret = 0;
>
>  out:
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         return ret;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 405923f..3f6b8ec 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -455,7 +455,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         unsigned long offset = entry_offset;
>         unsigned long start_offset, end_offset;
>         unsigned long mask;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>
>         mask = swapin_nr_pages(offset) - 1;
>         if (!mask)
> @@ -467,7 +467,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         if (!start_offset)      /* First page is swap header. */
>                 start_offset++;
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         for (offset = start_offset; offset <= end_offset ; offset++) {
>                 /* Ok, do the async read-ahead now */
>                 page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> @@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>                         SetPageReadahead(page);
>                 page_cache_release(page);
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>
>         lru_add_drain();        /* Push any new pages onto the LRU now */
>  skip:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e8eadd..fd29974 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2130,7 +2130,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>         enum lru_list lru;
>         unsigned long nr_reclaimed = 0;
>         unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> -       struct blk_plug plug;
> +       struct blk_plug *plug, onstack_plug;
>         bool scan_adjusted;
>
>         get_scan_count(lruvec, swappiness, sc, nr, lru_pages);
> @@ -2152,7 +2152,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>         scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
>                          sc->priority == DEF_PRIORITY);
>
> -       blk_start_plug(&plug);
> +       plug = blk_start_plug(&onstack_plug);
>         while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
>                                         nr[LRU_INACTIVE_FILE]) {
>                 unsigned long nr_anon, nr_file, percentage;
> @@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>
>                 scan_adjusted = true;
>         }
> -       blk_finish_plug(&plug);
> +       blk_finish_plug(plug);
>         sc->nr_reclaimed += nr_reclaimed;
>
>         /*
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] blk-plug: don't flush nested plug lists
  2015-04-07  9:19   ` Ming Lei
@ 2015-04-07 14:47     ` Jeff Moyer
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-07 14:47 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, Linux Kernel Mailing List

Ming Lei <tom.leiming@gmail.com> writes:

> Hi Jeff,

Hi, Ming.  Thanks for reviewing!

>> Unpatched kernel:
>> Read B/W:    283,638 KB/s
>> Read Merges: 0
>>
>> Patched kernel:
>> Read B/W:    873,224 KB/s
>> Read Merges: 2,046K
>
> The data is amazing, but maybe better to provide some latency
> data.

Yes, good point.  I'll include the full fio output in the next posting.

>> Much of the patch involves modifying call sites to blk_start_plug,
>> since its signature is changed.  The meat of the patch is actually
>
> I am wondering if the type of blk_start_plug has to be changed
> since the active plug is always the top plug with your patch, and
> blk_finish_plug() can find the active plug from current->plug.

That is a much simpler idea, thank you.  I think blk_finish_plug
shouldn't take the plug argument, though, since it won't actually use
it.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] blk-plug: don't flush nested plug lists
  2015-04-08 17:38   ` [PATCH 2/2] " Jens Axboe
@ 2015-04-07 16:09     ` Shaohua Li
  2015-04-08 17:56     ` Jeff Moyer
  2015-04-16 15:47     ` Jeff Moyer
  2 siblings, 0 replies; 36+ messages in thread
From: Shaohua Li @ 2015-04-07 16:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, linux-kernel

On Wed, Apr 08, 2015 at 11:38:14AM -0600, Jens Axboe wrote:
> On 04/06/2015 01:14 PM, Jeff Moyer wrote:
> >The way the on-stack plugging currently works, each nesting level
> >flushes its own list of I/Os.  This can be less than optimal (read
> >awful) for certain workloads.  For example, consider an application
> >that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> >I/Os together in a single io_submit call, only to have each of them
> >dispatched individually down in the bowells of the dirct I/O code.
> >The reason is that there are blk_plug's instantiated both at the upper
> >call site in do_io_submit and down in do_direct_IO.  The latter will
> >submit as little as 1 I/O at a time (if you have a small enough I/O
> >size) instead of performing the batching that the plugging
> >infrastructure is supposed to provide.
> >
> >Now, for the case where there is an elevator involved, this doesn't
> >really matter too much.  The elevator will keep the I/O around long
> >enough for it to be merged.  However, in cases where there is no
> >elevator (like blk-mq), I/Os are simply dispatched immediately.
> >
> >Try this, for example (note I'm using a virtio-blk device, so it's
> >using the blk-mq single queue path, though I've also reproduced this
> >with the micron p320h):
> >
> >fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based
> >
> >If you run that on a current kernel, you will get zero merges.  Zero!
> >After this patch, you will get many merges (the actual number depends
> >on how fast your storage is, obviously), and much better throughput.
> >Here are results from my test rig:
> >
> >Unpatched kernel:
> >Read B/W:    283,638 KB/s
> >Read Merges: 0
> >
> >Patched kernel:
> >Read B/W:    873,224 KB/s
> >Read Merges: 2,046K
> >
> >I considered several approaches to solving the problem:
> >1)  get rid of the inner-most plugs
> >2)  handle nesting by using only one on-stack plug
> >2a) #2, except use a per-cpu blk_plug struct, which may clean up the
> >     code a bit at the expense of memory footprint
> >
> >Option 1 will be tricky or impossible to do, since inner most plug
> >lists are sometimes the only plug lists, depending on the call path.
> >Option 2 is what this patch implements.  Option 2a is perhaps a better
> >idea, but since I already implemented option 2, I figured I'd post it
> >for comments and opinions before rewriting it.
> >
> >Much of the patch involves modifying call sites to blk_start_plug,
> >since its signature is changed.  The meat of the patch is actually
> >pretty simple and constrained to block/blk-core.c and
> >include/linux/blkdev.h.  The only tricky bits were places where plugs
> >were finished and then restarted to flush out I/O.  There, I went
> >ahead and exported blk_flush_plug_list and called that directly.
> >
> >Comments would be greatly appreciated.
> 
> It's hard to argue with the increased merging for your case. The task plugs
> did originally work like you changed them to, not flushing until the
> outermost plug was flushed. Unfortunately I don't quite remember why I
> changed them, will have to do a bit of digging to refresh my memory.

The behavior never changed. Current code only flush outermost plug.
blk_start_plug() doesn't assign plug to current task for inner plug, so
requests are all added to outmost plug.

maybe the code can be cleaned up as:
start_plug
{
	if (current->plug)
		returnl
	current->plug = plug
}

end_plug()
{
	if (plug != current->plug)
		return;
	flush_plug
	current->plug = NULL;
}

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
  2015-04-06 19:14 ` [PATCH 2/2] blk-plug: don't flush nested plug lists Jeff Moyer
                       ` (2 preceding siblings ...)
  2015-04-08 17:38   ` [PATCH 2/2] " Jens Axboe
@ 2015-04-07 18:55     ` Jeff Moyer
  2 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-07 18:55 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner, Mel Gorman, Trond Myklebust

The way the on-stack plugging currently works, each nesting level
flushes its own list of I/Os.  This can be less than optimal (read
awful) for certain workloads.  For example, consider an application
that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
I/Os together in a single io_submit call, only to have each of them
dispatched individually down in the bowels of the dirct I/O code.
The reason is that there are blk_plug-s instantiated both at the upper
call site in do_io_submit and down in do_direct_IO.  The latter will
submit as little as 1 I/O at a time (if you have a small enough I/O
size) instead of performing the batching that the plugging
infrastructure is supposed to provide.

Now, for the case where there is an elevator involved, this doesn't
really matter too much.  The elevator will keep the I/O around long
enough for it to be merged.  However, in cases where there is no
elevator (like blk-mq), I/Os are simply dispatched immediately.

Try this, for example:

fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based

If you run that on a current kernel, you will get zero merges.  Zero!
After this patch, you will get many merges (the actual number depends
on how fast your storage is, obviously), and much better throughput.
Here are results from my test systems:

First, I tested in a VM using a virtio-blk device:

Unpatched kernel:
 Throughput: 280,262 KB/s
avg latency: 14,587.72 usec

Patched kernel:
 throughput: 832,158 KB/s
avg latency:   4,901.95 usec

Next, I tesetd using a micron p320h on bare metal:

Unpatched kernel:
 Throughput: 688,967 KB/s
avg latency: 5,933.92 usec

Patched kernel:
 Throughput: 1,160.6 MB/s
avg latency: 3,437.01 usec

As you can see, both throughput and latency improved dramatically.
I've included the full fio output below, so you can also see the
marked improvement in standard deviation as well.

I considered several approaches to solving the problem:
1)  get rid of the inner-most plugs
2)  handle nesting by using only one on-stack plug
2a) #2, except use a per-cpu blk_plug struct, which may clean up the
    code a bit at the expense of memory footprint

Option 1 will be tricky or impossible to do, since inner most plug
lists are sometimes the only plug lists, depending on the call path.
Option 2 is what this patch implements.  Option 2a may add unneeded
complexity.

Much of the patch involves modifying call sites to blk_finish_plug,
since its signature is changed.  The meat of the patch is actually
pretty simple and constrained to block/blk-core.c and
include/linux/blkdev.h.  The only tricky bits were places where plugs
were finished and then restarted to flush out I/O.  There, I left things
as-is.  So long as they are the outer-most plugs, they should continue
to function as before.

NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
would always flush the plug list.  After this patch, this is only the
case for the outer-most plug.  If you require the plug list to be
flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
maintainers should take a close look at this patch and ensure they get
the right behavior in the end.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

---
Changelog:
v1->v2: Keep the blk_start_plug interface the same, suggested by Ming Lei.

Test results
------------
Virtio-blk:

unpatched:

job1: (groupid=0, jobs=1): err= 0: pid=8032: Tue Apr  7 13:33:53 2015
  read : io=2736.1MB, bw=280262KB/s, iops=70065, runt= 10000msec
    slat (usec): min=40, max=10472, avg=207.82, stdev=364.02
    clat (usec): min=211, max=35883, avg=14379.83, stdev=2213.95
     lat (usec): min=862, max=36000, avg=14587.72, stdev=2223.80
    clat percentiles (usec):
     |  1.00th=[11328],  5.00th=[12096], 10.00th=[12480], 20.00th=[12992],
     | 30.00th=[13376], 40.00th=[13760], 50.00th=[14144], 60.00th=[14400],
     | 70.00th=[14784], 80.00th=[15168], 90.00th=[15936], 95.00th=[16768],
     | 99.00th=[24448], 99.50th=[25216], 99.90th=[28544], 99.95th=[35072],
     | 99.99th=[36096]
    bw (KB  /s): min=265984, max=302720, per=100.00%, avg=280549.84, stdev=10264.36
    lat (usec) : 250=0.01%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.02%, 10=0.05%, 20=96.57%, 50=3.34%
  cpu          : usr=7.56%, sys=55.57%, ctx=6174, majf=0, minf=523
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=700656/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=2736.1MB, aggrb=280262KB/s, minb=280262KB/s, maxb=280262KB/s, mint=10000msec, maxt=10000msec

Disk stats (read/write):
  vdd: ios=695490/0, merge=0/0, ticks=785741/0, in_queue=785442, util=90.69%


patched:
job1: (groupid=0, jobs=1): err= 0: pid=7743: Tue Apr  7 13:19:07 2015
  read : io=8126.6MB, bw=832158KB/s, iops=208039, runt= 10000msec
    slat (usec): min=20, max=14351, avg=55.08, stdev=143.47
    clat (usec): min=283, max=20003, avg=4846.77, stdev=1355.35
     lat (usec): min=609, max=20074, avg=4901.95, stdev=1362.40
    clat percentiles (usec):
     |  1.00th=[ 4016],  5.00th=[ 4048], 10.00th=[ 4080], 20.00th=[ 4128],
     | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4512],
     | 70.00th=[ 4896], 80.00th=[ 5664], 90.00th=[ 5920], 95.00th=[ 6752],
     | 99.00th=[11968], 99.50th=[13632], 99.90th=[15552], 99.95th=[17024],
     | 99.99th=[19840]
    bw (KB  /s): min=740992, max=896640, per=100.00%, avg=836978.95, stdev=51034.87
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 4=0.50%, 10=97.79%, 20=1.70%, 50=0.01%
  cpu          : usr=20.28%, sys=69.11%, ctx=879, majf=0, minf=522
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2080396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=8126.6MB, aggrb=832158KB/s, minb=832158KB/s, maxb=832158KB/s, mint=10000msec, maxt=10000msec

Disk stats (read/write):
  vdd: ios=127877/0, merge=1918166/0, ticks=23118/0, in_queue=23047, util=94.08%

micron p320h:

unpatched:

job1: (groupid=0, jobs=1): err= 0: pid=3244: Tue Apr  7 13:29:14 2015
  read : io=6728.9MB, bw=688968KB/s, iops=172241, runt= 10001msec
    slat (usec): min=43, max=6273, avg=81.79, stdev=125.96
    clat (usec): min=78, max=12485, avg=5852.06, stdev=1154.76
     lat (usec): min=146, max=12572, avg=5933.92, stdev=1163.75
    clat percentiles (usec):
     |  1.00th=[ 4192],  5.00th=[ 4384], 10.00th=[ 4576], 20.00th=[ 5600],
     | 30.00th=[ 5664], 40.00th=[ 5728], 50.00th=[ 5792], 60.00th=[ 5856],
     | 70.00th=[ 6112], 80.00th=[ 6176], 90.00th=[ 6240], 95.00th=[ 6368],
     | 99.00th=[11840], 99.50th=[11968], 99.90th=[12096], 99.95th=[12096],
     | 99.99th=[12224]
    bw (KB  /s): min=648328, max=859264, per=98.80%, avg=680711.16, stdev=62016.70
    lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.04%, 10=97.07%, 20=2.87%
  cpu          : usr=10.28%, sys=73.61%, ctx=104436, majf=0, minf=6217
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=1722592/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=6728.9MB, aggrb=688967KB/s, minb=688967KB/s, maxb=688967KB/s, mint=10001msec, maxt=10001msec

Disk stats (read/write):
  rssda: ios=1688772/0, merge=0/0, ticks=188820/0, in_queue=188678, util=96.61%

patched:

job1: (groupid=0, jobs=1): err= 0: pid=9531: Tue Apr  7 13:22:28 2015
  read : io=11607MB, bw=1160.6MB/s, iops=297104, runt= 10001msec
    slat (usec): min=21, max=6376, avg=43.05, stdev=81.82
    clat (usec): min=116, max=9844, avg=3393.90, stdev=752.57
     lat (usec): min=167, max=9889, avg=3437.01, stdev=757.02
    clat percentiles (usec):
     |  1.00th=[ 2832],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3120],
     | 30.00th=[ 3152], 40.00th=[ 3248], 50.00th=[ 3280], 60.00th=[ 3344],
     | 70.00th=[ 3376], 80.00th=[ 3504], 90.00th=[ 3728], 95.00th=[ 3824],
     | 99.00th=[ 9152], 99.50th=[ 9408], 99.90th=[ 9664], 99.95th=[ 9664],
     | 99.99th=[ 9792]
    bw (MB  /s): min= 1139, max= 1183, per=100.00%, avg=1161.07, stdev=10.58
    lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=98.31%, 10=1.67%
  cpu          : usr=18.59%, sys=66.65%, ctx=55655, majf=0, minf=6218
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2971338/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=11607MB, aggrb=1160.6MB/s, minb=1160.6MB/s, maxb=1160.6MB/s, mint=10001msec, maxt=10001msec

Disk stats (read/write):
  rssda: ios=183005/0, merge=2745105/0, ticks=31972/0, in_queue=31948, util=97.63%
---
 block/blk-core.c                    | 29 ++++++++++++++++-------------
 block/blk-lib.c                     |  2 +-
 block/blk-throttle.c                |  2 +-
 drivers/block/xen-blkback/blkback.c |  2 +-
 drivers/md/dm-bufio.c               |  6 +++---
 drivers/md/dm-crypt.c               |  2 +-
 drivers/md/dm-kcopyd.c              |  2 +-
 drivers/md/dm-thin.c                |  2 +-
 drivers/md/md.c                     |  2 +-
 drivers/md/raid1.c                  |  2 +-
 drivers/md/raid10.c                 |  2 +-
 drivers/md/raid5.c                  |  4 ++--
 drivers/target/target_core_iblock.c |  2 +-
 fs/aio.c                            |  2 +-
 fs/block_dev.c                      |  2 +-
 fs/btrfs/scrub.c                    |  2 +-
 fs/btrfs/transaction.c              |  2 +-
 fs/btrfs/tree-log.c                 | 12 ++++++------
 fs/btrfs/volumes.c                  |  6 +++---
 fs/buffer.c                         |  2 +-
 fs/direct-io.c                      |  2 +-
 fs/ext4/file.c                      |  2 +-
 fs/ext4/inode.c                     |  4 ++--
 fs/f2fs/checkpoint.c                |  2 +-
 fs/f2fs/gc.c                        |  2 +-
 fs/f2fs/node.c                      |  2 +-
 fs/gfs2/log.c                       |  2 +-
 fs/hpfs/buffer.c                    |  2 +-
 fs/jbd/checkpoint.c                 |  2 +-
 fs/jbd/commit.c                     |  4 ++--
 fs/jbd2/checkpoint.c                |  2 +-
 fs/jbd2/commit.c                    |  2 +-
 fs/mpage.c                          |  2 +-
 fs/nfs/blocklayout/blocklayout.c    |  4 ++--
 fs/xfs/xfs_buf.c                    |  4 ++--
 fs/xfs/xfs_dir2_readdir.c           |  2 +-
 fs/xfs/xfs_itable.c                 |  2 +-
 include/linux/blkdev.h              |  5 +++--
 mm/madvise.c                        |  2 +-
 mm/page-writeback.c                 |  2 +-
 mm/readahead.c                      |  2 +-
 mm/swap_state.c                     |  2 +-
 mm/vmscan.c                         |  2 +-
 43 files changed, 74 insertions(+), 70 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 794c3e7..fcd9c2f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3018,21 +3018,21 @@ void blk_start_plug(struct blk_plug *plug)
 {
 	struct task_struct *tsk = current;
 
+	if (tsk->plug) {
+		tsk->plug->depth++;
+		return;
+	}
+
+	plug->depth = 1;
 	INIT_LIST_HEAD(&plug->list);
 	INIT_LIST_HEAD(&plug->mq_list);
 	INIT_LIST_HEAD(&plug->cb_list);
 
 	/*
-	 * If this is a nested plug, don't actually assign it. It will be
-	 * flushed on its own.
+	 * Store ordering should not be needed here, since a potential
+	 * preempt will imply a full memory barrier
 	 */
-	if (!tsk->plug) {
-		/*
-		 * Store ordering should not be needed here, since a potential
-		 * preempt will imply a full memory barrier
-		 */
-		tsk->plug = plug;
-	}
+	tsk->plug = plug;
 }
 EXPORT_SYMBOL(blk_start_plug);
 
@@ -3177,12 +3177,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	local_irq_restore(flags);
 }
 
-void blk_finish_plug(struct blk_plug *plug)
+void blk_finish_plug(void)
 {
-	blk_flush_plug_list(plug, false);
+	struct blk_plug *plug = current->plug;
 
-	if (plug == current->plug)
-		current->plug = NULL;
+	if (--plug->depth > 0)
+		return;
+
+	blk_flush_plug_list(plug, false);
+	current->plug = NULL;
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..ac347d3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		 */
 		cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Wait for bios in-flight */
 	if (!atomic_dec_and_test(&bb.done))
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5b9c6d5..222a77a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1281,7 +1281,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 		blk_start_plug(&plug);
 		while((bio = bio_list_pop(&bio_list_on_stack)))
 			generic_make_request(bio);
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 	}
 }
 
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 2a04d34..74bea21 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1374,7 +1374,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 		submit_bio(operation, biolist[i]);
 
 	/* Let the I/Os go.. */
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	if (operation == READ)
 		blkif->st_rd_sect += preq.nr_sects;
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 86dbbc7..502c63b 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
 		submit_io(b, WRITE, b->block, write_endio);
 		dm_bufio_cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
@@ -1126,7 +1126,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 				&write_list);
 		if (unlikely(!list_empty(&write_list))) {
 			dm_bufio_unlock(c);
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			__flush_write_list(&write_list);
 			blk_start_plug(&plug);
 			dm_bufio_lock(c);
@@ -1149,7 +1149,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 	dm_bufio_unlock(c);
 
 flush_plug:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 713a962..65d7b72 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1224,7 +1224,7 @@ pop_from_list:
 			rb_erase(&io->rb_node, &write_tree);
 			kcryptd_io_write(io);
 		} while (!RB_EMPTY_ROOT(&write_tree));
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 	}
 	return 0;
 }
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 3a7cade..4a76e42 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -593,7 +593,7 @@ static void do_work(struct work_struct *work)
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 921aafd..be42bf5 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 			dm_pool_issue_prefetches(pool->pmd);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int cmp_cells(const void *lhs, const void *rhs)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 717daad..c4ec179 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
 	/*
 	 * this also signals 'finished resyncing' to md_stop
 	 */
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
 
 	/* tell personality that we are finished */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d34e238..4f8fad4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int init_resync(struct r1conf *conf)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a7196c4..92bb5dd 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int init_resync(struct r10conf *conf)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd2f96b..695bf0f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
 	pr_debug("%d stripes handled\n", handled);
 
 	spin_unlock_irq(&conf->device_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	pr_debug("--- raid5worker inactive\n");
 }
@@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
 	spin_unlock_irq(&conf->device_lock);
 
 	async_tx_issue_pending_all();
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	pr_debug("--- raid5d inactive\n");
 }
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d4a4b0f..17d8730 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -367,7 +367,7 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
 	blk_start_plug(&plug);
 	while ((bio = bio_list_pop(list)))
 		submit_bio(rw, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static void iblock_end_io_flush(struct bio *bio, int err)
diff --git a/fs/aio.c b/fs/aio.c
index f8e52a1..b873698 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	percpu_ref_put(&ctx->users);
 	return i ? i : ret;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266b..f5848de 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (err < 0)
 			ret = err;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_write_iter);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ec57687..f314cfb8 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3316,7 +3316,7 @@ out:
 	scrub_wr_submit(sctx);
 	mutex_unlock(&sctx->wr_ctx.wr_lock);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
 	return ret < 0 ? ret : 0;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8be4278..fee10af 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -983,7 +983,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
 
 	if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c5b8ba3..879c7fd 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2574,7 +2574,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
 	if (ret) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_abort_transaction(trans, root, ret);
 		btrfs_free_logged_extents(log, log_transid);
 		btrfs_set_log_full_commit(root->fs_info, trans);
@@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		if (!list_empty(&root_log_ctx.list))
 			list_del_init(&root_log_ctx.list);
 
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_set_log_full_commit(root->fs_info, trans);
 
 		if (ret != -ENOSPC) {
@@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	}
 
 	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = root_log_ctx.log_ret;
 		goto out;
@@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
 						mark);
 		btrfs_wait_logged_extents(trans, log, log_transid);
@@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 * check the full commit flag again
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
@@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	ret = btrfs_write_marked_extents(log_root_tree,
 					 &log_root_tree->dirty_log_pages,
 					 EXTENT_DIRTY | EXTENT_NEW);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (ret) {
 		btrfs_set_log_full_commit(root->fs_info, trans);
 		btrfs_abort_transaction(trans, root, ret);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8222f6f..16db068 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -358,7 +358,7 @@ loop_lock:
 		if (pending_bios == &device->pending_sync_bios) {
 			sync_pending = 1;
 		} else if (sync_pending) {
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			blk_start_plug(&plug);
 			sync_pending = 0;
 		}
@@ -415,7 +415,7 @@ loop_lock:
 		}
 		/* unplug every 64 requests just for good measure */
 		if (batch_run % 64 == 0) {
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			blk_start_plug(&plug);
 			sync_pending = 0;
 		}
@@ -431,7 +431,7 @@ loop_lock:
 	spin_unlock(&device->io_lock);
 
 done:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static void pending_bios_fn(struct btrfs_work *work)
diff --git a/fs/buffer.c b/fs/buffer.c
index 20805db..8181c44 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 	}
 
 	spin_unlock(lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	spin_lock(lock);
 
 	while (!list_empty(&tmp)) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..16f16ed 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1262,7 +1262,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	if (sdio.bio)
 		dio_bio_submit(dio, &sdio);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/*
 	 * It is possible that, we return short IO due to end of file.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..3a293eb 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			ret = err;
 	}
 	if (o_direct)
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 
 errout:
 	if (aio_mutex)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..90ce0cb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2302,7 +2302,7 @@ static int ext4_writepages(struct address_space *mapping,
 
 		blk_start_plug(&plug);
 		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		goto out_writepages;
 	}
 
@@ -2438,7 +2438,7 @@ retry:
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (!ret && !cycled && wbc->nr_to_write > 0) {
 		cycled = 1;
 		mpd.last_page = writeback_index - 1;
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 7f794b7..86ba453 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -846,7 +846,7 @@ retry_flush_nodes:
 		goto retry_flush_nodes;
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return err;
 }
 
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 76adbc3..abeef77 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
 		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
 		break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
 	stat_inc_call_count(sbi->stat_info);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 97bd9d3..c4aa9e2 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1098,7 +1098,7 @@ repeat:
 		ra_node_page(sbi, nid);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	lock_page(page);
 	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 536e7a6..06f25d17 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -159,7 +159,7 @@ restart:
 			goto restart;
 	}
 	spin_unlock(&sdp->sd_ail_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	trace_gfs2_ail_flush(sdp, wbc, 0);
 }
 
diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
index 8057fe4..138462d 100644
--- a/fs/hpfs/buffer.c
+++ b/fs/hpfs/buffer.c
@@ -35,7 +35,7 @@ void hpfs_prefetch_sectors(struct super_block *s, unsigned secno, int n)
 		secno++;
 		n--;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /* Map a sector into a buffer and return pointers to it and to the buffer. */
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index 08c0304..cd6b09f 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -263,7 +263,7 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
 	blk_start_plug(&plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = bhs[i];
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index bb217dc..e1046c3 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -447,7 +447,7 @@ void journal_commit_transaction(journal_t *journal)
 	blk_start_plug(&plug);
 	err = journal_submit_data_buffers(journal, commit_transaction,
 					  write_op);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/*
 	 * Wait for all previously submitted IO to complete.
@@ -697,7 +697,7 @@ start_journal_io:
 		}
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 988b32e..6aa0039 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -187,7 +187,7 @@ __flush_batch(journal_t *journal, int *batch_count)
 	blk_start_plug(&plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = journal->j_chkpt_bhs[i];
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b73e021..8f532c8 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -805,7 +805,7 @@ start_journal_io:
 			__jbd2_journal_abort_hard(journal);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220..bf7d6c3 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
 		if (mpd.bio)
 			mpage_bio_submit(WRITE, mpd.bio);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepages);
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 1cac3c1..e93b6a8 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -311,7 +311,7 @@ bl_read_pagelist(struct nfs_pgio_header *header)
 	}
 out:
 	bl_submit_bio(READ, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	put_parallel(par);
 	return PNFS_ATTEMPTED;
 }
@@ -433,7 +433,7 @@ bl_write_pagelist(struct nfs_pgio_header *header, int sync)
 	header->res.count = header->args.count;
 out:
 	bl_submit_bio(WRITE, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	put_parallel(par);
 	return PNFS_ATTEMPTED;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1790b00..2f89ca2 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
 		if (size <= 0)
 			break;	/* all done */
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
@@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
 
 		xfs_buf_submit(bp);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	return pinned;
 }
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 098cd78..7e8fa3f 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
 			}
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 out:
 	*bpp = bp;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 82e3142..c3ac5ec 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
 					     &xfs_inode_buf_ops);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..188133f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
  * schedule() where blk_schedule_flush_plug() is called.
  */
 struct blk_plug {
+	int depth; /* number of nested plugs */
 	struct list_head list; /* requests */
 	struct list_head mq_list; /* blk-mq requests */
 	struct list_head cb_list; /* md requires an unplug callback */
@@ -1107,7 +1108,7 @@ struct blk_plug_cb {
 extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
 					     void *data, int size);
 extern void blk_start_plug(struct blk_plug *);
-extern void blk_finish_plug(struct blk_plug *);
+extern void blk_finish_plug(void);
 extern void blk_flush_plug_list(struct blk_plug *, bool);
 
 static inline void blk_flush_plug(struct task_struct *tsk)
@@ -1646,7 +1647,7 @@ static inline void blk_start_plug(struct blk_plug *plug)
 {
 }
 
-static inline void blk_finish_plug(struct blk_plug *plug)
+static inline void blk_finish_plug(void)
 {
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..18a34ee 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 			vma = find_vma(current->mm, start);
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (write)
 		up_write(&current->mm->mmap_sem);
 	else
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 644bcb6..4570f6e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2020,7 +2020,7 @@ int generic_writepages(struct address_space *mapping,
 
 	blk_start_plug(&plug);
 	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 9356758..64182a2 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 	ret = 0;
 
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	return ret;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 405923f..5721f64 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			SetPageReadahead(page);
 		page_cache_release(page);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..56bb274 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 
 		scan_adjusted = true;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	sc->nr_reclaimed += nr_reclaimed;
 
 	/*
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-07 18:55     ` Jeff Moyer
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-07 18:55 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner, Mel Gorman, Trond Myklebust

The way the on-stack plugging currently works, each nesting level
flushes its own list of I/Os.  This can be less than optimal (read
awful) for certain workloads.  For example, consider an application
that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
I/Os together in a single io_submit call, only to have each of them
dispatched individually down in the bowels of the dirct I/O code.
The reason is that there are blk_plug-s instantiated both at the upper
call site in do_io_submit and down in do_direct_IO.  The latter will
submit as little as 1 I/O at a time (if you have a small enough I/O
size) instead of performing the batching that the plugging
infrastructure is supposed to provide.

Now, for the case where there is an elevator involved, this doesn't
really matter too much.  The elevator will keep the I/O around long
enough for it to be merged.  However, in cases where there is no
elevator (like blk-mq), I/Os are simply dispatched immediately.

Try this, for example:

fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based

If you run that on a current kernel, you will get zero merges.  Zero!
After this patch, you will get many merges (the actual number depends
on how fast your storage is, obviously), and much better throughput.
Here are results from my test systems:

First, I tested in a VM using a virtio-blk device:

Unpatched kernel:
 Throughput: 280,262 KB/s
avg latency: 14,587.72 usec

Patched kernel:
 throughput: 832,158 KB/s
avg latency:   4,901.95 usec

Next, I tesetd using a micron p320h on bare metal:

Unpatched kernel:
 Throughput: 688,967 KB/s
avg latency: 5,933.92 usec

Patched kernel:
 Throughput: 1,160.6 MB/s
avg latency: 3,437.01 usec

As you can see, both throughput and latency improved dramatically.
I've included the full fio output below, so you can also see the
marked improvement in standard deviation as well.

I considered several approaches to solving the problem:
1)  get rid of the inner-most plugs
2)  handle nesting by using only one on-stack plug
2a) #2, except use a per-cpu blk_plug struct, which may clean up the
    code a bit at the expense of memory footprint

Option 1 will be tricky or impossible to do, since inner most plug
lists are sometimes the only plug lists, depending on the call path.
Option 2 is what this patch implements.  Option 2a may add unneeded
complexity.

Much of the patch involves modifying call sites to blk_finish_plug,
since its signature is changed.  The meat of the patch is actually
pretty simple and constrained to block/blk-core.c and
include/linux/blkdev.h.  The only tricky bits were places where plugs
were finished and then restarted to flush out I/O.  There, I left things
as-is.  So long as they are the outer-most plugs, they should continue
to function as before.

NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
would always flush the plug list.  After this patch, this is only the
case for the outer-most plug.  If you require the plug list to be
flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
maintainers should take a close look at this patch and ensure they get
the right behavior in the end.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

---
Changelog:
v1->v2: Keep the blk_start_plug interface the same, suggested by Ming Lei.

Test results
------------
Virtio-blk:

unpatched:

job1: (groupid=0, jobs=1): err= 0: pid=8032: Tue Apr  7 13:33:53 2015
  read : io=2736.1MB, bw=280262KB/s, iops=70065, runt= 10000msec
    slat (usec): min=40, max=10472, avg=207.82, stdev=364.02
    clat (usec): min=211, max=35883, avg=14379.83, stdev=2213.95
     lat (usec): min=862, max=36000, avg=14587.72, stdev=2223.80
    clat percentiles (usec):
     |  1.00th=[11328],  5.00th=[12096], 10.00th=[12480], 20.00th=[12992],
     | 30.00th=[13376], 40.00th=[13760], 50.00th=[14144], 60.00th=[14400],
     | 70.00th=[14784], 80.00th=[15168], 90.00th=[15936], 95.00th=[16768],
     | 99.00th=[24448], 99.50th=[25216], 99.90th=[28544], 99.95th=[35072],
     | 99.99th=[36096]
    bw (KB  /s): min=265984, max=302720, per=100.00%, avg=280549.84, stdev=10264.36
    lat (usec) : 250=0.01%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.02%, 10=0.05%, 20=96.57%, 50=3.34%
  cpu          : usr=7.56%, sys=55.57%, ctx=6174, majf=0, minf=523
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=700656/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=2736.1MB, aggrb=280262KB/s, minb=280262KB/s, maxb=280262KB/s, mint=10000msec, maxt=10000msec

Disk stats (read/write):
  vdd: ios=695490/0, merge=0/0, ticks=785741/0, in_queue=785442, util=90.69%


patched:
job1: (groupid=0, jobs=1): err= 0: pid=7743: Tue Apr  7 13:19:07 2015
  read : io=8126.6MB, bw=832158KB/s, iops=208039, runt= 10000msec
    slat (usec): min=20, max=14351, avg=55.08, stdev=143.47
    clat (usec): min=283, max=20003, avg=4846.77, stdev=1355.35
     lat (usec): min=609, max=20074, avg=4901.95, stdev=1362.40
    clat percentiles (usec):
     |  1.00th=[ 4016],  5.00th=[ 4048], 10.00th=[ 4080], 20.00th=[ 4128],
     | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4512],
     | 70.00th=[ 4896], 80.00th=[ 5664], 90.00th=[ 5920], 95.00th=[ 6752],
     | 99.00th=[11968], 99.50th=[13632], 99.90th=[15552], 99.95th=[17024],
     | 99.99th=[19840]
    bw (KB  /s): min=740992, max=896640, per=100.00%, avg=836978.95, stdev=51034.87
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 4=0.50%, 10=97.79%, 20=1.70%, 50=0.01%
  cpu          : usr=20.28%, sys=69.11%, ctx=879, majf=0, minf=522
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2080396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=8126.6MB, aggrb=832158KB/s, minb=832158KB/s, maxb=832158KB/s, mint=10000msec, maxt=10000msec

Disk stats (read/write):
  vdd: ios=127877/0, merge=1918166/0, ticks=23118/0, in_queue=23047, util=94.08%

micron p320h:

unpatched:

job1: (groupid=0, jobs=1): err= 0: pid=3244: Tue Apr  7 13:29:14 2015
  read : io=6728.9MB, bw=688968KB/s, iops=172241, runt= 10001msec
    slat (usec): min=43, max=6273, avg=81.79, stdev=125.96
    clat (usec): min=78, max=12485, avg=5852.06, stdev=1154.76
     lat (usec): min=146, max=12572, avg=5933.92, stdev=1163.75
    clat percentiles (usec):
     |  1.00th=[ 4192],  5.00th=[ 4384], 10.00th=[ 4576], 20.00th=[ 5600],
     | 30.00th=[ 5664], 40.00th=[ 5728], 50.00th=[ 5792], 60.00th=[ 5856],
     | 70.00th=[ 6112], 80.00th=[ 6176], 90.00th=[ 6240], 95.00th=[ 6368],
     | 99.00th=[11840], 99.50th=[11968], 99.90th=[12096], 99.95th=[12096],
     | 99.99th=[12224]
    bw (KB  /s): min=648328, max=859264, per=98.80%, avg=680711.16, stdev=62016.70
    lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.04%, 10=97.07%, 20=2.87%
  cpu          : usr=10.28%, sys=73.61%, ctx=104436, majf=0, minf=6217
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=1722592/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=6728.9MB, aggrb=688967KB/s, minb=688967KB/s, maxb=688967KB/s, mint=10001msec, maxt=10001msec

Disk stats (read/write):
  rssda: ios=1688772/0, merge=0/0, ticks=188820/0, in_queue=188678, util=96.61%

patched:

job1: (groupid=0, jobs=1): err= 0: pid=9531: Tue Apr  7 13:22:28 2015
  read : io=11607MB, bw=1160.6MB/s, iops=297104, runt= 10001msec
    slat (usec): min=21, max=6376, avg=43.05, stdev=81.82
    clat (usec): min=116, max=9844, avg=3393.90, stdev=752.57
     lat (usec): min=167, max=9889, avg=3437.01, stdev=757.02
    clat percentiles (usec):
     |  1.00th=[ 2832],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3120],
     | 30.00th=[ 3152], 40.00th=[ 3248], 50.00th=[ 3280], 60.00th=[ 3344],
     | 70.00th=[ 3376], 80.00th=[ 3504], 90.00th=[ 3728], 95.00th=[ 3824],
     | 99.00th=[ 9152], 99.50th=[ 9408], 99.90th=[ 9664], 99.95th=[ 9664],
     | 99.99th=[ 9792]
    bw (MB  /s): min= 1139, max= 1183, per=100.00%, avg=1161.07, stdev=10.58
    lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=98.31%, 10=1.67%
  cpu          : usr=18.59%, sys=66.65%, ctx=55655, majf=0, minf=6218
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2971338/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=11607MB, aggrb=1160.6MB/s, minb=1160.6MB/s, maxb=1160.6MB/s, mint=10001msec, maxt=10001msec

Disk stats (read/write):
  rssda: ios=183005/0, merge=2745105/0, ticks=31972/0, in_queue=31948, util=97.63%
---
 block/blk-core.c                    | 29 ++++++++++++++++-------------
 block/blk-lib.c                     |  2 +-
 block/blk-throttle.c                |  2 +-
 drivers/block/xen-blkback/blkback.c |  2 +-
 drivers/md/dm-bufio.c               |  6 +++---
 drivers/md/dm-crypt.c               |  2 +-
 drivers/md/dm-kcopyd.c              |  2 +-
 drivers/md/dm-thin.c                |  2 +-
 drivers/md/md.c                     |  2 +-
 drivers/md/raid1.c                  |  2 +-
 drivers/md/raid10.c                 |  2 +-
 drivers/md/raid5.c                  |  4 ++--
 drivers/target/target_core_iblock.c |  2 +-
 fs/aio.c                            |  2 +-
 fs/block_dev.c                      |  2 +-
 fs/btrfs/scrub.c                    |  2 +-
 fs/btrfs/transaction.c              |  2 +-
 fs/btrfs/tree-log.c                 | 12 ++++++------
 fs/btrfs/volumes.c                  |  6 +++---
 fs/buffer.c                         |  2 +-
 fs/direct-io.c                      |  2 +-
 fs/ext4/file.c                      |  2 +-
 fs/ext4/inode.c                     |  4 ++--
 fs/f2fs/checkpoint.c                |  2 +-
 fs/f2fs/gc.c                        |  2 +-
 fs/f2fs/node.c                      |  2 +-
 fs/gfs2/log.c                       |  2 +-
 fs/hpfs/buffer.c                    |  2 +-
 fs/jbd/checkpoint.c                 |  2 +-
 fs/jbd/commit.c                     |  4 ++--
 fs/jbd2/checkpoint.c                |  2 +-
 fs/jbd2/commit.c                    |  2 +-
 fs/mpage.c                          |  2 +-
 fs/nfs/blocklayout/blocklayout.c    |  4 ++--
 fs/xfs/xfs_buf.c                    |  4 ++--
 fs/xfs/xfs_dir2_readdir.c           |  2 +-
 fs/xfs/xfs_itable.c                 |  2 +-
 include/linux/blkdev.h              |  5 +++--
 mm/madvise.c                        |  2 +-
 mm/page-writeback.c                 |  2 +-
 mm/readahead.c                      |  2 +-
 mm/swap_state.c                     |  2 +-
 mm/vmscan.c                         |  2 +-
 43 files changed, 74 insertions(+), 70 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 794c3e7..fcd9c2f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3018,21 +3018,21 @@ void blk_start_plug(struct blk_plug *plug)
 {
 	struct task_struct *tsk = current;
 
+	if (tsk->plug) {
+		tsk->plug->depth++;
+		return;
+	}
+
+	plug->depth = 1;
 	INIT_LIST_HEAD(&plug->list);
 	INIT_LIST_HEAD(&plug->mq_list);
 	INIT_LIST_HEAD(&plug->cb_list);
 
 	/*
-	 * If this is a nested plug, don't actually assign it. It will be
-	 * flushed on its own.
+	 * Store ordering should not be needed here, since a potential
+	 * preempt will imply a full memory barrier
 	 */
-	if (!tsk->plug) {
-		/*
-		 * Store ordering should not be needed here, since a potential
-		 * preempt will imply a full memory barrier
-		 */
-		tsk->plug = plug;
-	}
+	tsk->plug = plug;
 }
 EXPORT_SYMBOL(blk_start_plug);
 
@@ -3177,12 +3177,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	local_irq_restore(flags);
 }
 
-void blk_finish_plug(struct blk_plug *plug)
+void blk_finish_plug(void)
 {
-	blk_flush_plug_list(plug, false);
+	struct blk_plug *plug = current->plug;
 
-	if (plug == current->plug)
-		current->plug = NULL;
+	if (--plug->depth > 0)
+		return;
+
+	blk_flush_plug_list(plug, false);
+	current->plug = NULL;
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..ac347d3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		 */
 		cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Wait for bios in-flight */
 	if (!atomic_dec_and_test(&bb.done))
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5b9c6d5..222a77a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1281,7 +1281,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 		blk_start_plug(&plug);
 		while((bio = bio_list_pop(&bio_list_on_stack)))
 			generic_make_request(bio);
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 	}
 }
 
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 2a04d34..74bea21 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1374,7 +1374,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 		submit_bio(operation, biolist[i]);
 
 	/* Let the I/Os go.. */
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	if (operation == READ)
 		blkif->st_rd_sect += preq.nr_sects;
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 86dbbc7..502c63b 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
 		submit_io(b, WRITE, b->block, write_endio);
 		dm_bufio_cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
@@ -1126,7 +1126,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 				&write_list);
 		if (unlikely(!list_empty(&write_list))) {
 			dm_bufio_unlock(c);
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			__flush_write_list(&write_list);
 			blk_start_plug(&plug);
 			dm_bufio_lock(c);
@@ -1149,7 +1149,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 	dm_bufio_unlock(c);
 
 flush_plug:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 713a962..65d7b72 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1224,7 +1224,7 @@ pop_from_list:
 			rb_erase(&io->rb_node, &write_tree);
 			kcryptd_io_write(io);
 		} while (!RB_EMPTY_ROOT(&write_tree));
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 	}
 	return 0;
 }
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 3a7cade..4a76e42 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -593,7 +593,7 @@ static void do_work(struct work_struct *work)
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 921aafd..be42bf5 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 			dm_pool_issue_prefetches(pool->pmd);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int cmp_cells(const void *lhs, const void *rhs)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 717daad..c4ec179 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
 	/*
 	 * this also signals 'finished resyncing' to md_stop
 	 */
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
 
 	/* tell personality that we are finished */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d34e238..4f8fad4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int init_resync(struct r1conf *conf)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a7196c4..92bb5dd 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int init_resync(struct r10conf *conf)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd2f96b..695bf0f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
 	pr_debug("%d stripes handled\n", handled);
 
 	spin_unlock_irq(&conf->device_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	pr_debug("--- raid5worker inactive\n");
 }
@@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
 	spin_unlock_irq(&conf->device_lock);
 
 	async_tx_issue_pending_all();
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	pr_debug("--- raid5d inactive\n");
 }
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d4a4b0f..17d8730 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -367,7 +367,7 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
 	blk_start_plug(&plug);
 	while ((bio = bio_list_pop(list)))
 		submit_bio(rw, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static void iblock_end_io_flush(struct bio *bio, int err)
diff --git a/fs/aio.c b/fs/aio.c
index f8e52a1..b873698 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	percpu_ref_put(&ctx->users);
 	return i ? i : ret;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266b..f5848de 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (err < 0)
 			ret = err;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_write_iter);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ec57687..f314cfb8 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3316,7 +3316,7 @@ out:
 	scrub_wr_submit(sctx);
 	mutex_unlock(&sctx->wr_ctx.wr_lock);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
 	return ret < 0 ? ret : 0;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8be4278..fee10af 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -983,7 +983,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
 
 	if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c5b8ba3..879c7fd 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2574,7 +2574,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
 	if (ret) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_abort_transaction(trans, root, ret);
 		btrfs_free_logged_extents(log, log_transid);
 		btrfs_set_log_full_commit(root->fs_info, trans);
@@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		if (!list_empty(&root_log_ctx.list))
 			list_del_init(&root_log_ctx.list);
 
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_set_log_full_commit(root->fs_info, trans);
 
 		if (ret != -ENOSPC) {
@@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	}
 
 	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = root_log_ctx.log_ret;
 		goto out;
@@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
 						mark);
 		btrfs_wait_logged_extents(trans, log, log_transid);
@@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 * check the full commit flag again
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
@@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	ret = btrfs_write_marked_extents(log_root_tree,
 					 &log_root_tree->dirty_log_pages,
 					 EXTENT_DIRTY | EXTENT_NEW);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (ret) {
 		btrfs_set_log_full_commit(root->fs_info, trans);
 		btrfs_abort_transaction(trans, root, ret);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8222f6f..16db068 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -358,7 +358,7 @@ loop_lock:
 		if (pending_bios == &device->pending_sync_bios) {
 			sync_pending = 1;
 		} else if (sync_pending) {
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			blk_start_plug(&plug);
 			sync_pending = 0;
 		}
@@ -415,7 +415,7 @@ loop_lock:
 		}
 		/* unplug every 64 requests just for good measure */
 		if (batch_run % 64 == 0) {
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			blk_start_plug(&plug);
 			sync_pending = 0;
 		}
@@ -431,7 +431,7 @@ loop_lock:
 	spin_unlock(&device->io_lock);
 
 done:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static void pending_bios_fn(struct btrfs_work *work)
diff --git a/fs/buffer.c b/fs/buffer.c
index 20805db..8181c44 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 	}
 
 	spin_unlock(lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	spin_lock(lock);
 
 	while (!list_empty(&tmp)) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..16f16ed 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1262,7 +1262,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	if (sdio.bio)
 		dio_bio_submit(dio, &sdio);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/*
 	 * It is possible that, we return short IO due to end of file.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..3a293eb 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			ret = err;
 	}
 	if (o_direct)
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 
 errout:
 	if (aio_mutex)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..90ce0cb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2302,7 +2302,7 @@ static int ext4_writepages(struct address_space *mapping,
 
 		blk_start_plug(&plug);
 		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		goto out_writepages;
 	}
 
@@ -2438,7 +2438,7 @@ retry:
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (!ret && !cycled && wbc->nr_to_write > 0) {
 		cycled = 1;
 		mpd.last_page = writeback_index - 1;
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 7f794b7..86ba453 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -846,7 +846,7 @@ retry_flush_nodes:
 		goto retry_flush_nodes;
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return err;
 }
 
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 76adbc3..abeef77 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
 		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
 		break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
 	stat_inc_call_count(sbi->stat_info);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 97bd9d3..c4aa9e2 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1098,7 +1098,7 @@ repeat:
 		ra_node_page(sbi, nid);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	lock_page(page);
 	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 536e7a6..06f25d17 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -159,7 +159,7 @@ restart:
 			goto restart;
 	}
 	spin_unlock(&sdp->sd_ail_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	trace_gfs2_ail_flush(sdp, wbc, 0);
 }
 
diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
index 8057fe4..138462d 100644
--- a/fs/hpfs/buffer.c
+++ b/fs/hpfs/buffer.c
@@ -35,7 +35,7 @@ void hpfs_prefetch_sectors(struct super_block *s, unsigned secno, int n)
 		secno++;
 		n--;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /* Map a sector into a buffer and return pointers to it and to the buffer. */
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index 08c0304..cd6b09f 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -263,7 +263,7 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
 	blk_start_plug(&plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = bhs[i];
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index bb217dc..e1046c3 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -447,7 +447,7 @@ void journal_commit_transaction(journal_t *journal)
 	blk_start_plug(&plug);
 	err = journal_submit_data_buffers(journal, commit_transaction,
 					  write_op);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/*
 	 * Wait for all previously submitted IO to complete.
@@ -697,7 +697,7 @@ start_journal_io:
 		}
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 988b32e..6aa0039 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -187,7 +187,7 @@ __flush_batch(journal_t *journal, int *batch_count)
 	blk_start_plug(&plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = journal->j_chkpt_bhs[i];
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b73e021..8f532c8 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -805,7 +805,7 @@ start_journal_io:
 			__jbd2_journal_abort_hard(journal);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220..bf7d6c3 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
 		if (mpd.bio)
 			mpage_bio_submit(WRITE, mpd.bio);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepages);
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 1cac3c1..e93b6a8 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -311,7 +311,7 @@ bl_read_pagelist(struct nfs_pgio_header *header)
 	}
 out:
 	bl_submit_bio(READ, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	put_parallel(par);
 	return PNFS_ATTEMPTED;
 }
@@ -433,7 +433,7 @@ bl_write_pagelist(struct nfs_pgio_header *header, int sync)
 	header->res.count = header->args.count;
 out:
 	bl_submit_bio(WRITE, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	put_parallel(par);
 	return PNFS_ATTEMPTED;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1790b00..2f89ca2 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
 		if (size <= 0)
 			break;	/* all done */
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
@@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
 
 		xfs_buf_submit(bp);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	return pinned;
 }
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 098cd78..7e8fa3f 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
 			}
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 out:
 	*bpp = bp;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 82e3142..c3ac5ec 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
 					     &xfs_inode_buf_ops);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..188133f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
  * schedule() where blk_schedule_flush_plug() is called.
  */
 struct blk_plug {
+	int depth; /* number of nested plugs */
 	struct list_head list; /* requests */
 	struct list_head mq_list; /* blk-mq requests */
 	struct list_head cb_list; /* md requires an unplug callback */
@@ -1107,7 +1108,7 @@ struct blk_plug_cb {
 extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
 					     void *data, int size);
 extern void blk_start_plug(struct blk_plug *);
-extern void blk_finish_plug(struct blk_plug *);
+extern void blk_finish_plug(void);
 extern void blk_flush_plug_list(struct blk_plug *, bool);
 
 static inline void blk_flush_plug(struct task_struct *tsk)
@@ -1646,7 +1647,7 @@ static inline void blk_start_plug(struct blk_plug *plug)
 {
 }
 
-static inline void blk_finish_plug(struct blk_plug *plug)
+static inline void blk_finish_plug(void)
 {
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..18a34ee 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 			vma = find_vma(current->mm, start);
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (write)
 		up_write(&current->mm->mmap_sem);
 	else
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 644bcb6..4570f6e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2020,7 +2020,7 @@ int generic_writepages(struct address_space *mapping,
 
 	blk_start_plug(&plug);
 	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 9356758..64182a2 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 	ret = 0;
 
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	return ret;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 405923f..5721f64 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			SetPageReadahead(page);
 		page_cache_release(page);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..56bb274 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 
 		scan_adjusted = true;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	sc->nr_reclaimed += nr_reclaimed;
 
 	/*
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-07 18:55     ` Jeff Moyer
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-07 18:55 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: Vladimir Davydov, linux-aio, Miklos Szeredi, Mike Snitzer,
	Neil Brown, Ming Lei, Trond Myklebust, Jianyu Zhan,
	Nicholas A. Bellinger, linux-kernel, Sagi Grimberg, Chris Mason,
	dm-devel, target-devel, Andreas Dilger, Mikulas Patocka,
	Mark Rustad, Christoph Hellwig, Alasdair Kergon, Matthew Wilcox,
	linux-scsi, Namjae Jeon, linux-raid, cluster-devel, Mel Gorman,
	Suleiman Souhlal, linux-ext4, linux-mm, Changman Lee,
	Rik van Riel, Konrad Rzeszutek Wilk, xfs, Fabian Frederick,
	Joe Perches, Alexander Viro, xen-devel, Jaegeuk Kim,
	Steven Whitehouse, Vlastimil Babka, Michal Hocko, linux-nfs,
	Fengguang Wu, Theodore Ts'o, Martin K. Petersen,
	Wang Sheng-Hui, Josef Bacik, David Sterba, linux-f2fs-devel,
	linux-btrfs, Johannes Weiner, Tejun Heo, linux-fsdevel,
	Andrew Morton, Weston Andros Adamson, Anna Schumaker,
	Kirill A. Shutemov, Roger Pau Monn??

The way the on-stack plugging currently works, each nesting level
flushes its own list of I/Os.  This can be less than optimal (read
awful) for certain workloads.  For example, consider an application
that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
I/Os together in a single io_submit call, only to have each of them
dispatched individually down in the bowels of the dirct I/O code.
The reason is that there are blk_plug-s instantiated both at the upper
call site in do_io_submit and down in do_direct_IO.  The latter will
submit as little as 1 I/O at a time (if you have a small enough I/O
size) instead of performing the batching that the plugging
infrastructure is supposed to provide.

Now, for the case where there is an elevator involved, this doesn't
really matter too much.  The elevator will keep the I/O around long
enough for it to be merged.  However, in cases where there is no
elevator (like blk-mq), I/Os are simply dispatched immediately.

Try this, for example:

fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based

If you run that on a current kernel, you will get zero merges.  Zero!
After this patch, you will get many merges (the actual number depends
on how fast your storage is, obviously), and much better throughput.
Here are results from my test systems:

First, I tested in a VM using a virtio-blk device:

Unpatched kernel:
 Throughput: 280,262 KB/s
avg latency: 14,587.72 usec

Patched kernel:
 throughput: 832,158 KB/s
avg latency:   4,901.95 usec

Next, I tesetd using a micron p320h on bare metal:

Unpatched kernel:
 Throughput: 688,967 KB/s
avg latency: 5,933.92 usec

Patched kernel:
 Throughput: 1,160.6 MB/s
avg latency: 3,437.01 usec

As you can see, both throughput and latency improved dramatically.
I've included the full fio output below, so you can also see the
marked improvement in standard deviation as well.

I considered several approaches to solving the problem:
1)  get rid of the inner-most plugs
2)  handle nesting by using only one on-stack plug
2a) #2, except use a per-cpu blk_plug struct, which may clean up the
    code a bit at the expense of memory footprint

Option 1 will be tricky or impossible to do, since inner most plug
lists are sometimes the only plug lists, depending on the call path.
Option 2 is what this patch implements.  Option 2a may add unneeded
complexity.

Much of the patch involves modifying call sites to blk_finish_plug,
since its signature is changed.  The meat of the patch is actually
pretty simple and constrained to block/blk-core.c and
include/linux/blkdev.h.  The only tricky bits were places where plugs
were finished and then restarted to flush out I/O.  There, I left things
as-is.  So long as they are the outer-most plugs, they should continue
to function as before.

NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
would always flush the plug list.  After this patch, this is only the
case for the outer-most plug.  If you require the plug list to be
flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
maintainers should take a close look at this patch and ensure they get
the right behavior in the end.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

---
Changelog:
v1->v2: Keep the blk_start_plug interface the same, suggested by Ming Lei.

Test results
------------
Virtio-blk:

unpatched:

job1: (groupid=0, jobs=1): err= 0: pid=8032: Tue Apr  7 13:33:53 2015
  read : io=2736.1MB, bw=280262KB/s, iops=70065, runt= 10000msec
    slat (usec): min=40, max=10472, avg=207.82, stdev=364.02
    clat (usec): min=211, max=35883, avg=14379.83, stdev=2213.95
     lat (usec): min=862, max=36000, avg=14587.72, stdev=2223.80
    clat percentiles (usec):
     |  1.00th=[11328],  5.00th=[12096], 10.00th=[12480], 20.00th=[12992],
     | 30.00th=[13376], 40.00th=[13760], 50.00th=[14144], 60.00th=[14400],
     | 70.00th=[14784], 80.00th=[15168], 90.00th=[15936], 95.00th=[16768],
     | 99.00th=[24448], 99.50th=[25216], 99.90th=[28544], 99.95th=[35072],
     | 99.99th=[36096]
    bw (KB  /s): min=265984, max=302720, per=100.00%, avg=280549.84, stdev=10264.36
    lat (usec) : 250=0.01%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.02%, 10=0.05%, 20=96.57%, 50=3.34%
  cpu          : usr=7.56%, sys=55.57%, ctx=6174, majf=0, minf=523
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=700656/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=2736.1MB, aggrb=280262KB/s, minb=280262KB/s, maxb=280262KB/s, mint=10000msec, maxt=10000msec

Disk stats (read/write):
  vdd: ios=695490/0, merge=0/0, ticks=785741/0, in_queue=785442, util=90.69%


patched:
job1: (groupid=0, jobs=1): err= 0: pid=7743: Tue Apr  7 13:19:07 2015
  read : io=8126.6MB, bw=832158KB/s, iops=208039, runt= 10000msec
    slat (usec): min=20, max=14351, avg=55.08, stdev=143.47
    clat (usec): min=283, max=20003, avg=4846.77, stdev=1355.35
     lat (usec): min=609, max=20074, avg=4901.95, stdev=1362.40
    clat percentiles (usec):
     |  1.00th=[ 4016],  5.00th=[ 4048], 10.00th=[ 4080], 20.00th=[ 4128],
     | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4512],
     | 70.00th=[ 4896], 80.00th=[ 5664], 90.00th=[ 5920], 95.00th=[ 6752],
     | 99.00th=[11968], 99.50th=[13632], 99.90th=[15552], 99.95th=[17024],
     | 99.99th=[19840]
    bw (KB  /s): min=740992, max=896640, per=100.00%, avg=836978.95, stdev=51034.87
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 4=0.50%, 10=97.79%, 20=1.70%, 50=0.01%
  cpu          : usr=20.28%, sys=69.11%, ctx=879, majf=0, minf=522
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2080396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=8126.6MB, aggrb=832158KB/s, minb=832158KB/s, maxb=832158KB/s, mint=10000msec, maxt=10000msec

Disk stats (read/write):
  vdd: ios=127877/0, merge=1918166/0, ticks=23118/0, in_queue=23047, util=94.08%

micron p320h:

unpatched:

job1: (groupid=0, jobs=1): err= 0: pid=3244: Tue Apr  7 13:29:14 2015
  read : io=6728.9MB, bw=688968KB/s, iops=172241, runt= 10001msec
    slat (usec): min=43, max=6273, avg=81.79, stdev=125.96
    clat (usec): min=78, max=12485, avg=5852.06, stdev=1154.76
     lat (usec): min=146, max=12572, avg=5933.92, stdev=1163.75
    clat percentiles (usec):
     |  1.00th=[ 4192],  5.00th=[ 4384], 10.00th=[ 4576], 20.00th=[ 5600],
     | 30.00th=[ 5664], 40.00th=[ 5728], 50.00th=[ 5792], 60.00th=[ 5856],
     | 70.00th=[ 6112], 80.00th=[ 6176], 90.00th=[ 6240], 95.00th=[ 6368],
     | 99.00th=[11840], 99.50th=[11968], 99.90th=[12096], 99.95th=[12096],
     | 99.99th=[12224]
    bw (KB  /s): min=648328, max=859264, per=98.80%, avg=680711.16, stdev=62016.70
    lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.04%, 10=97.07%, 20=2.87%
  cpu          : usr=10.28%, sys=73.61%, ctx=104436, majf=0, minf=6217
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=1722592/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=6728.9MB, aggrb=688967KB/s, minb=688967KB/s, maxb=688967KB/s, mint=10001msec, maxt=10001msec

Disk stats (read/write):
  rssda: ios=1688772/0, merge=0/0, ticks=188820/0, in_queue=188678, util=96.61%

patched:

job1: (groupid=0, jobs=1): err= 0: pid=9531: Tue Apr  7 13:22:28 2015
  read : io=11607MB, bw=1160.6MB/s, iops=297104, runt= 10001msec
    slat (usec): min=21, max=6376, avg=43.05, stdev=81.82
    clat (usec): min=116, max=9844, avg=3393.90, stdev=752.57
     lat (usec): min=167, max=9889, avg=3437.01, stdev=757.02
    clat percentiles (usec):
     |  1.00th=[ 2832],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3120],
     | 30.00th=[ 3152], 40.00th=[ 3248], 50.00th=[ 3280], 60.00th=[ 3344],
     | 70.00th=[ 3376], 80.00th=[ 3504], 90.00th=[ 3728], 95.00th=[ 3824],
     | 99.00th=[ 9152], 99.50th=[ 9408], 99.90th=[ 9664], 99.95th=[ 9664],
     | 99.99th=[ 9792]
    bw (MB  /s): min= 1139, max= 1183, per=100.00%, avg=1161.07, stdev=10.58
    lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=98.31%, 10=1.67%
  cpu          : usr=18.59%, sys=66.65%, ctx=55655, majf=0, minf=6218
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2971338/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=11607MB, aggrb=1160.6MB/s, minb=1160.6MB/s, maxb=1160.6MB/s, mint=10001msec, maxt=10001msec

Disk stats (read/write):
  rssda: ios=183005/0, merge=2745105/0, ticks=31972/0, in_queue=31948, util=97.63%
---
 block/blk-core.c                    | 29 ++++++++++++++++-------------
 block/blk-lib.c                     |  2 +-
 block/blk-throttle.c                |  2 +-
 drivers/block/xen-blkback/blkback.c |  2 +-
 drivers/md/dm-bufio.c               |  6 +++---
 drivers/md/dm-crypt.c               |  2 +-
 drivers/md/dm-kcopyd.c              |  2 +-
 drivers/md/dm-thin.c                |  2 +-
 drivers/md/md.c                     |  2 +-
 drivers/md/raid1.c                  |  2 +-
 drivers/md/raid10.c                 |  2 +-
 drivers/md/raid5.c                  |  4 ++--
 drivers/target/target_core_iblock.c |  2 +-
 fs/aio.c                            |  2 +-
 fs/block_dev.c                      |  2 +-
 fs/btrfs/scrub.c                    |  2 +-
 fs/btrfs/transaction.c              |  2 +-
 fs/btrfs/tree-log.c                 | 12 ++++++------
 fs/btrfs/volumes.c                  |  6 +++---
 fs/buffer.c                         |  2 +-
 fs/direct-io.c                      |  2 +-
 fs/ext4/file.c                      |  2 +-
 fs/ext4/inode.c                     |  4 ++--
 fs/f2fs/checkpoint.c                |  2 +-
 fs/f2fs/gc.c                        |  2 +-
 fs/f2fs/node.c                      |  2 +-
 fs/gfs2/log.c                       |  2 +-
 fs/hpfs/buffer.c                    |  2 +-
 fs/jbd/checkpoint.c                 |  2 +-
 fs/jbd/commit.c                     |  4 ++--
 fs/jbd2/checkpoint.c                |  2 +-
 fs/jbd2/commit.c                    |  2 +-
 fs/mpage.c                          |  2 +-
 fs/nfs/blocklayout/blocklayout.c    |  4 ++--
 fs/xfs/xfs_buf.c                    |  4 ++--
 fs/xfs/xfs_dir2_readdir.c           |  2 +-
 fs/xfs/xfs_itable.c                 |  2 +-
 include/linux/blkdev.h              |  5 +++--
 mm/madvise.c                        |  2 +-
 mm/page-writeback.c                 |  2 +-
 mm/readahead.c                      |  2 +-
 mm/swap_state.c                     |  2 +-
 mm/vmscan.c                         |  2 +-
 43 files changed, 74 insertions(+), 70 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 794c3e7..fcd9c2f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3018,21 +3018,21 @@ void blk_start_plug(struct blk_plug *plug)
 {
 	struct task_struct *tsk = current;
 
+	if (tsk->plug) {
+		tsk->plug->depth++;
+		return;
+	}
+
+	plug->depth = 1;
 	INIT_LIST_HEAD(&plug->list);
 	INIT_LIST_HEAD(&plug->mq_list);
 	INIT_LIST_HEAD(&plug->cb_list);
 
 	/*
-	 * If this is a nested plug, don't actually assign it. It will be
-	 * flushed on its own.
+	 * Store ordering should not be needed here, since a potential
+	 * preempt will imply a full memory barrier
 	 */
-	if (!tsk->plug) {
-		/*
-		 * Store ordering should not be needed here, since a potential
-		 * preempt will imply a full memory barrier
-		 */
-		tsk->plug = plug;
-	}
+	tsk->plug = plug;
 }
 EXPORT_SYMBOL(blk_start_plug);
 
@@ -3177,12 +3177,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	local_irq_restore(flags);
 }
 
-void blk_finish_plug(struct blk_plug *plug)
+void blk_finish_plug(void)
 {
-	blk_flush_plug_list(plug, false);
+	struct blk_plug *plug = current->plug;
 
-	if (plug == current->plug)
-		current->plug = NULL;
+	if (--plug->depth > 0)
+		return;
+
+	blk_flush_plug_list(plug, false);
+	current->plug = NULL;
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..ac347d3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		 */
 		cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Wait for bios in-flight */
 	if (!atomic_dec_and_test(&bb.done))
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5b9c6d5..222a77a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1281,7 +1281,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 		blk_start_plug(&plug);
 		while((bio = bio_list_pop(&bio_list_on_stack)))
 			generic_make_request(bio);
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 	}
 }
 
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 2a04d34..74bea21 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1374,7 +1374,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 		submit_bio(operation, biolist[i]);
 
 	/* Let the I/Os go.. */
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	if (operation == READ)
 		blkif->st_rd_sect += preq.nr_sects;
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 86dbbc7..502c63b 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
 		submit_io(b, WRITE, b->block, write_endio);
 		dm_bufio_cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
@@ -1126,7 +1126,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 				&write_list);
 		if (unlikely(!list_empty(&write_list))) {
 			dm_bufio_unlock(c);
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			__flush_write_list(&write_list);
 			blk_start_plug(&plug);
 			dm_bufio_lock(c);
@@ -1149,7 +1149,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 	dm_bufio_unlock(c);
 
 flush_plug:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 713a962..65d7b72 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1224,7 +1224,7 @@ pop_from_list:
 			rb_erase(&io->rb_node, &write_tree);
 			kcryptd_io_write(io);
 		} while (!RB_EMPTY_ROOT(&write_tree));
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 	}
 	return 0;
 }
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 3a7cade..4a76e42 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -593,7 +593,7 @@ static void do_work(struct work_struct *work)
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 921aafd..be42bf5 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 			dm_pool_issue_prefetches(pool->pmd);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int cmp_cells(const void *lhs, const void *rhs)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 717daad..c4ec179 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
 	/*
 	 * this also signals 'finished resyncing' to md_stop
 	 */
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
 
 	/* tell personality that we are finished */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d34e238..4f8fad4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int init_resync(struct r1conf *conf)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a7196c4..92bb5dd 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int init_resync(struct r10conf *conf)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd2f96b..695bf0f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
 	pr_debug("%d stripes handled\n", handled);
 
 	spin_unlock_irq(&conf->device_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	pr_debug("--- raid5worker inactive\n");
 }
@@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
 	spin_unlock_irq(&conf->device_lock);
 
 	async_tx_issue_pending_all();
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	pr_debug("--- raid5d inactive\n");
 }
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d4a4b0f..17d8730 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -367,7 +367,7 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
 	blk_start_plug(&plug);
 	while ((bio = bio_list_pop(list)))
 		submit_bio(rw, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static void iblock_end_io_flush(struct bio *bio, int err)
diff --git a/fs/aio.c b/fs/aio.c
index f8e52a1..b873698 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	percpu_ref_put(&ctx->users);
 	return i ? i : ret;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266b..f5848de 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (err < 0)
 			ret = err;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_write_iter);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ec57687..f314cfb8 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3316,7 +3316,7 @@ out:
 	scrub_wr_submit(sctx);
 	mutex_unlock(&sctx->wr_ctx.wr_lock);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
 	return ret < 0 ? ret : 0;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8be4278..fee10af 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -983,7 +983,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
 
 	if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c5b8ba3..879c7fd 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2574,7 +2574,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
 	if (ret) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_abort_transaction(trans, root, ret);
 		btrfs_free_logged_extents(log, log_transid);
 		btrfs_set_log_full_commit(root->fs_info, trans);
@@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		if (!list_empty(&root_log_ctx.list))
 			list_del_init(&root_log_ctx.list);
 
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_set_log_full_commit(root->fs_info, trans);
 
 		if (ret != -ENOSPC) {
@@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	}
 
 	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = root_log_ctx.log_ret;
 		goto out;
@@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
 						mark);
 		btrfs_wait_logged_extents(trans, log, log_transid);
@@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 * check the full commit flag again
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
@@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	ret = btrfs_write_marked_extents(log_root_tree,
 					 &log_root_tree->dirty_log_pages,
 					 EXTENT_DIRTY | EXTENT_NEW);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (ret) {
 		btrfs_set_log_full_commit(root->fs_info, trans);
 		btrfs_abort_transaction(trans, root, ret);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8222f6f..16db068 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -358,7 +358,7 @@ loop_lock:
 		if (pending_bios == &device->pending_sync_bios) {
 			sync_pending = 1;
 		} else if (sync_pending) {
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			blk_start_plug(&plug);
 			sync_pending = 0;
 		}
@@ -415,7 +415,7 @@ loop_lock:
 		}
 		/* unplug every 64 requests just for good measure */
 		if (batch_run % 64 == 0) {
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			blk_start_plug(&plug);
 			sync_pending = 0;
 		}
@@ -431,7 +431,7 @@ loop_lock:
 	spin_unlock(&device->io_lock);
 
 done:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static void pending_bios_fn(struct btrfs_work *work)
diff --git a/fs/buffer.c b/fs/buffer.c
index 20805db..8181c44 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 	}
 
 	spin_unlock(lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	spin_lock(lock);
 
 	while (!list_empty(&tmp)) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..16f16ed 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1262,7 +1262,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	if (sdio.bio)
 		dio_bio_submit(dio, &sdio);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/*
 	 * It is possible that, we return short IO due to end of file.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..3a293eb 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			ret = err;
 	}
 	if (o_direct)
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 
 errout:
 	if (aio_mutex)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..90ce0cb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2302,7 +2302,7 @@ static int ext4_writepages(struct address_space *mapping,
 
 		blk_start_plug(&plug);
 		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		goto out_writepages;
 	}
 
@@ -2438,7 +2438,7 @@ retry:
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (!ret && !cycled && wbc->nr_to_write > 0) {
 		cycled = 1;
 		mpd.last_page = writeback_index - 1;
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 7f794b7..86ba453 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -846,7 +846,7 @@ retry_flush_nodes:
 		goto retry_flush_nodes;
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return err;
 }
 
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 76adbc3..abeef77 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
 		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
 		break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
 	stat_inc_call_count(sbi->stat_info);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 97bd9d3..c4aa9e2 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1098,7 +1098,7 @@ repeat:
 		ra_node_page(sbi, nid);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	lock_page(page);
 	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 536e7a6..06f25d17 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -159,7 +159,7 @@ restart:
 			goto restart;
 	}
 	spin_unlock(&sdp->sd_ail_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	trace_gfs2_ail_flush(sdp, wbc, 0);
 }
 
diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
index 8057fe4..138462d 100644
--- a/fs/hpfs/buffer.c
+++ b/fs/hpfs/buffer.c
@@ -35,7 +35,7 @@ void hpfs_prefetch_sectors(struct super_block *s, unsigned secno, int n)
 		secno++;
 		n--;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /* Map a sector into a buffer and return pointers to it and to the buffer. */
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index 08c0304..cd6b09f 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -263,7 +263,7 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
 	blk_start_plug(&plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = bhs[i];
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index bb217dc..e1046c3 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -447,7 +447,7 @@ void journal_commit_transaction(journal_t *journal)
 	blk_start_plug(&plug);
 	err = journal_submit_data_buffers(journal, commit_transaction,
 					  write_op);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/*
 	 * Wait for all previously submitted IO to complete.
@@ -697,7 +697,7 @@ start_journal_io:
 		}
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 988b32e..6aa0039 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -187,7 +187,7 @@ __flush_batch(journal_t *journal, int *batch_count)
 	blk_start_plug(&plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = journal->j_chkpt_bhs[i];
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b73e021..8f532c8 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -805,7 +805,7 @@ start_journal_io:
 			__jbd2_journal_abort_hard(journal);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220..bf7d6c3 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
 		if (mpd.bio)
 			mpage_bio_submit(WRITE, mpd.bio);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepages);
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 1cac3c1..e93b6a8 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -311,7 +311,7 @@ bl_read_pagelist(struct nfs_pgio_header *header)
 	}
 out:
 	bl_submit_bio(READ, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	put_parallel(par);
 	return PNFS_ATTEMPTED;
 }
@@ -433,7 +433,7 @@ bl_write_pagelist(struct nfs_pgio_header *header, int sync)
 	header->res.count = header->args.count;
 out:
 	bl_submit_bio(WRITE, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	put_parallel(par);
 	return PNFS_ATTEMPTED;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1790b00..2f89ca2 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
 		if (size <= 0)
 			break;	/* all done */
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
@@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
 
 		xfs_buf_submit(bp);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	return pinned;
 }
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 098cd78..7e8fa3f 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
 			}
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 out:
 	*bpp = bp;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 82e3142..c3ac5ec 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
 					     &xfs_inode_buf_ops);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..188133f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
  * schedule() where blk_schedule_flush_plug() is called.
  */
 struct blk_plug {
+	int depth; /* number of nested plugs */
 	struct list_head list; /* requests */
 	struct list_head mq_list; /* blk-mq requests */
 	struct list_head cb_list; /* md requires an unplug callback */
@@ -1107,7 +1108,7 @@ struct blk_plug_cb {
 extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
 					     void *data, int size);
 extern void blk_start_plug(struct blk_plug *);
-extern void blk_finish_plug(struct blk_plug *);
+extern void blk_finish_plug(void);
 extern void blk_flush_plug_list(struct blk_plug *, bool);
 
 static inline void blk_flush_plug(struct task_struct *tsk)
@@ -1646,7 +1647,7 @@ static inline void blk_start_plug(struct blk_plug *plug)
 {
 }
 
-static inline void blk_finish_plug(struct blk_plug *plug)
+static inline void blk_finish_plug(void)
 {
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..18a34ee 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 			vma = find_vma(current->mm, start);
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (write)
 		up_write(&current->mm->mmap_sem);
 	else
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 644bcb6..4570f6e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2020,7 +2020,7 @@ int generic_writepages(struct address_space *mapping,
 
 	blk_start_plug(&plug);
 	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 9356758..64182a2 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 	ret = 0;
 
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	return ret;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 405923f..5721f64 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			SetPageReadahead(page);
 		page_cache_release(page);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..56bb274 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 
 		scan_adjusted = true;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	sc->nr_reclaimed += nr_reclaimed;
 
 	/*
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-07 18:55     ` Jeff Moyer
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-07 18:55 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner, Mel Gorman, Trond Myklebust, Anna Schumaker,
	Dave Chinner, xfs, Christoph Hellwig, Weston Andros Adamson,
	Martin K. Petersen, Sagi Grimberg, Tejun Heo, Fabian Frederick,
	Matthew Wilcox, Ming Lei, Kirill A. Shutemov, Wang Sheng-Hui,
	Michal Hocko, Joe Perches, Miklos Szeredi, Namjae Jeon,
	Mark Rustad, Jianyu Zhan, Fengguang Wu, Vladimir Davydov,
	Vlastimil Babka, Suleiman Souhlal, linux-kernel, dm-devel,
	xen-devel, linux-raid, linux-scsi, target-devel, linux-fsdevel,
	linux-aio, linux-btrfs, linux-ext4, linux-f2fs-devel,
	cluster-devel, linux-nfs, linux-mm

The way the on-stack plugging currently works, each nesting level
flushes its own list of I/Os.  This can be less than optimal (read
awful) for certain workloads.  For example, consider an application
that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
I/Os together in a single io_submit call, only to have each of them
dispatched individually down in the bowels of the dirct I/O code.
The reason is that there are blk_plug-s instantiated both at the upper
call site in do_io_submit and down in do_direct_IO.  The latter will
submit as little as 1 I/O at a time (if you have a small enough I/O
size) instead of performing the batching that the plugging
infrastructure is supposed to provide.

Now, for the case where there is an elevator involved, this doesn't
really matter too much.  The elevator will keep the I/O around long
enough for it to be merged.  However, in cases where there is no
elevator (like blk-mq), I/Os are simply dispatched immediately.

Try this, for example:

fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based

If you run that on a current kernel, you will get zero merges.  Zero!
After this patch, you will get many merges (the actual number depends
on how fast your storage is, obviously), and much better throughput.
Here are results from my test systems:

First, I tested in a VM using a virtio-blk device:

Unpatched kernel:
 Throughput: 280,262 KB/s
avg latency: 14,587.72 usec

Patched kernel:
 throughput: 832,158 KB/s
avg latency:   4,901.95 usec

Next, I tesetd using a micron p320h on bare metal:

Unpatched kernel:
 Throughput: 688,967 KB/s
avg latency: 5,933.92 usec

Patched kernel:
 Throughput: 1,160.6 MB/s
avg latency: 3,437.01 usec

As you can see, both throughput and latency improved dramatically.
I've included the full fio output below, so you can also see the
marked improvement in standard deviation as well.

I considered several approaches to solving the problem:
1)  get rid of the inner-most plugs
2)  handle nesting by using only one on-stack plug
2a) #2, except use a per-cpu blk_plug struct, which may clean up the
    code a bit at the expense of memory footprint

Option 1 will be tricky or impossible to do, since inner most plug
lists are sometimes the only plug lists, depending on the call path.
Option 2 is what this patch implements.  Option 2a may add unneeded
complexity.

Much of the patch involves modifying call sites to blk_finish_plug,
since its signature is changed.  The meat of the patch is actually
pretty simple and constrained to block/blk-core.c and
include/linux/blkdev.h.  The only tricky bits were places where plugs
were finished and then restarted to flush out I/O.  There, I left things
as-is.  So long as they are the outer-most plugs, they should continue
to function as before.

NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
would always flush the plug list.  After this patch, this is only the
case for the outer-most plug.  If you require the plug list to be
flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
maintainers should take a close look at this patch and ensure they get
the right behavior in the end.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

---
Changelog:
v1->v2: Keep the blk_start_plug interface the same, suggested by Ming Lei.

Test results
------------
Virtio-blk:

unpatched:

job1: (groupid=0, jobs=1): err= 0: pid=8032: Tue Apr  7 13:33:53 2015
  read : io=2736.1MB, bw=280262KB/s, iops=70065, runt= 10000msec
    slat (usec): min=40, max=10472, avg=207.82, stdev=364.02
    clat (usec): min=211, max=35883, avg=14379.83, stdev=2213.95
     lat (usec): min=862, max=36000, avg=14587.72, stdev=2223.80
    clat percentiles (usec):
     |  1.00th=[11328],  5.00th=[12096], 10.00th=[12480], 20.00th=[12992],
     | 30.00th=[13376], 40.00th=[13760], 50.00th=[14144], 60.00th=[14400],
     | 70.00th=[14784], 80.00th=[15168], 90.00th=[15936], 95.00th=[16768],
     | 99.00th=[24448], 99.50th=[25216], 99.90th=[28544], 99.95th=[35072],
     | 99.99th=[36096]
    bw (KB  /s): min=265984, max=302720, per=100.00%, avg=280549.84, stdev=10264.36
    lat (usec) : 250=0.01%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.02%, 10=0.05%, 20=96.57%, 50=3.34%
  cpu          : usr=7.56%, sys=55.57%, ctx=6174, majf=0, minf=523
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=700656/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=2736.1MB, aggrb=280262KB/s, minb=280262KB/s, maxb=280262KB/s, mint=10000msec, maxt=10000msec

Disk stats (read/write):
  vdd: ios=695490/0, merge=0/0, ticks=785741/0, in_queue=785442, util=90.69%


patched:
job1: (groupid=0, jobs=1): err= 0: pid=7743: Tue Apr  7 13:19:07 2015
  read : io=8126.6MB, bw=832158KB/s, iops=208039, runt= 10000msec
    slat (usec): min=20, max=14351, avg=55.08, stdev=143.47
    clat (usec): min=283, max=20003, avg=4846.77, stdev=1355.35
     lat (usec): min=609, max=20074, avg=4901.95, stdev=1362.40
    clat percentiles (usec):
     |  1.00th=[ 4016],  5.00th=[ 4048], 10.00th=[ 4080], 20.00th=[ 4128],
     | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4512],
     | 70.00th=[ 4896], 80.00th=[ 5664], 90.00th=[ 5920], 95.00th=[ 6752],
     | 99.00th=[11968], 99.50th=[13632], 99.90th=[15552], 99.95th=[17024],
     | 99.99th=[19840]
    bw (KB  /s): min=740992, max=896640, per=100.00%, avg=836978.95, stdev=51034.87
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 4=0.50%, 10=97.79%, 20=1.70%, 50=0.01%
  cpu          : usr=20.28%, sys=69.11%, ctx=879, majf=0, minf=522
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2080396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=8126.6MB, aggrb=832158KB/s, minb=832158KB/s, maxb=832158KB/s, mint=10000msec, maxt=10000msec

Disk stats (read/write):
  vdd: ios=127877/0, merge=1918166/0, ticks=23118/0, in_queue=23047, util=94.08%

micron p320h:

unpatched:

job1: (groupid=0, jobs=1): err= 0: pid=3244: Tue Apr  7 13:29:14 2015
  read : io=6728.9MB, bw=688968KB/s, iops=172241, runt= 10001msec
    slat (usec): min=43, max=6273, avg=81.79, stdev=125.96
    clat (usec): min=78, max=12485, avg=5852.06, stdev=1154.76
     lat (usec): min=146, max=12572, avg=5933.92, stdev=1163.75
    clat percentiles (usec):
     |  1.00th=[ 4192],  5.00th=[ 4384], 10.00th=[ 4576], 20.00th=[ 5600],
     | 30.00th=[ 5664], 40.00th=[ 5728], 50.00th=[ 5792], 60.00th=[ 5856],
     | 70.00th=[ 6112], 80.00th=[ 6176], 90.00th=[ 6240], 95.00th=[ 6368],
     | 99.00th=[11840], 99.50th=[11968], 99.90th=[12096], 99.95th=[12096],
     | 99.99th=[12224]
    bw (KB  /s): min=648328, max=859264, per=98.80%, avg=680711.16, stdev=62016.70
    lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.04%, 10=97.07%, 20=2.87%
  cpu          : usr=10.28%, sys=73.61%, ctx=104436, majf=0, minf=6217
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=1722592/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=6728.9MB, aggrb=688967KB/s, minb=688967KB/s, maxb=688967KB/s, mint=10001msec, maxt=10001msec

Disk stats (read/write):
  rssda: ios=1688772/0, merge=0/0, ticks=188820/0, in_queue=188678, util=96.61%

patched:

job1: (groupid=0, jobs=1): err= 0: pid=9531: Tue Apr  7 13:22:28 2015
  read : io=11607MB, bw=1160.6MB/s, iops=297104, runt= 10001msec
    slat (usec): min=21, max=6376, avg=43.05, stdev=81.82
    clat (usec): min=116, max=9844, avg=3393.90, stdev=752.57
     lat (usec): min=167, max=9889, avg=3437.01, stdev=757.02
    clat percentiles (usec):
     |  1.00th=[ 2832],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3120],
     | 30.00th=[ 3152], 40.00th=[ 3248], 50.00th=[ 3280], 60.00th=[ 3344],
     | 70.00th=[ 3376], 80.00th=[ 3504], 90.00th=[ 3728], 95.00th=[ 3824],
     | 99.00th=[ 9152], 99.50th=[ 9408], 99.90th=[ 9664], 99.95th=[ 9664],
     | 99.99th=[ 9792]
    bw (MB  /s): min= 1139, max= 1183, per=100.00%, avg=1161.07, stdev=10.58
    lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=98.31%, 10=1.67%
  cpu          : usr=18.59%, sys=66.65%, ctx=55655, majf=0, minf=6218
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2971338/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=11607MB, aggrb=1160.6MB/s, minb=1160.6MB/s, maxb=1160.6MB/s, mint=10001msec, maxt=10001msec

Disk stats (read/write):
  rssda: ios=183005/0, merge=2745105/0, ticks=31972/0, in_queue=31948, util=97.63%
---
 block/blk-core.c                    | 29 ++++++++++++++++-------------
 block/blk-lib.c                     |  2 +-
 block/blk-throttle.c                |  2 +-
 drivers/block/xen-blkback/blkback.c |  2 +-
 drivers/md/dm-bufio.c               |  6 +++---
 drivers/md/dm-crypt.c               |  2 +-
 drivers/md/dm-kcopyd.c              |  2 +-
 drivers/md/dm-thin.c                |  2 +-
 drivers/md/md.c                     |  2 +-
 drivers/md/raid1.c                  |  2 +-
 drivers/md/raid10.c                 |  2 +-
 drivers/md/raid5.c                  |  4 ++--
 drivers/target/target_core_iblock.c |  2 +-
 fs/aio.c                            |  2 +-
 fs/block_dev.c                      |  2 +-
 fs/btrfs/scrub.c                    |  2 +-
 fs/btrfs/transaction.c              |  2 +-
 fs/btrfs/tree-log.c                 | 12 ++++++------
 fs/btrfs/volumes.c                  |  6 +++---
 fs/buffer.c                         |  2 +-
 fs/direct-io.c                      |  2 +-
 fs/ext4/file.c                      |  2 +-
 fs/ext4/inode.c                     |  4 ++--
 fs/f2fs/checkpoint.c                |  2 +-
 fs/f2fs/gc.c                        |  2 +-
 fs/f2fs/node.c                      |  2 +-
 fs/gfs2/log.c                       |  2 +-
 fs/hpfs/buffer.c                    |  2 +-
 fs/jbd/checkpoint.c                 |  2 +-
 fs/jbd/commit.c                     |  4 ++--
 fs/jbd2/checkpoint.c                |  2 +-
 fs/jbd2/commit.c                    |  2 +-
 fs/mpage.c                          |  2 +-
 fs/nfs/blocklayout/blocklayout.c    |  4 ++--
 fs/xfs/xfs_buf.c                    |  4 ++--
 fs/xfs/xfs_dir2_readdir.c           |  2 +-
 fs/xfs/xfs_itable.c                 |  2 +-
 include/linux/blkdev.h              |  5 +++--
 mm/madvise.c                        |  2 +-
 mm/page-writeback.c                 |  2 +-
 mm/readahead.c                      |  2 +-
 mm/swap_state.c                     |  2 +-
 mm/vmscan.c                         |  2 +-
 43 files changed, 74 insertions(+), 70 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 794c3e7..fcd9c2f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3018,21 +3018,21 @@ void blk_start_plug(struct blk_plug *plug)
 {
 	struct task_struct *tsk = current;
 
+	if (tsk->plug) {
+		tsk->plug->depth++;
+		return;
+	}
+
+	plug->depth = 1;
 	INIT_LIST_HEAD(&plug->list);
 	INIT_LIST_HEAD(&plug->mq_list);
 	INIT_LIST_HEAD(&plug->cb_list);
 
 	/*
-	 * If this is a nested plug, don't actually assign it. It will be
-	 * flushed on its own.
+	 * Store ordering should not be needed here, since a potential
+	 * preempt will imply a full memory barrier
 	 */
-	if (!tsk->plug) {
-		/*
-		 * Store ordering should not be needed here, since a potential
-		 * preempt will imply a full memory barrier
-		 */
-		tsk->plug = plug;
-	}
+	tsk->plug = plug;
 }
 EXPORT_SYMBOL(blk_start_plug);
 
@@ -3177,12 +3177,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	local_irq_restore(flags);
 }
 
-void blk_finish_plug(struct blk_plug *plug)
+void blk_finish_plug(void)
 {
-	blk_flush_plug_list(plug, false);
+	struct blk_plug *plug = current->plug;
 
-	if (plug == current->plug)
-		current->plug = NULL;
+	if (--plug->depth > 0)
+		return;
+
+	blk_flush_plug_list(plug, false);
+	current->plug = NULL;
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..ac347d3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		 */
 		cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Wait for bios in-flight */
 	if (!atomic_dec_and_test(&bb.done))
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5b9c6d5..222a77a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1281,7 +1281,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 		blk_start_plug(&plug);
 		while((bio = bio_list_pop(&bio_list_on_stack)))
 			generic_make_request(bio);
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 	}
 }
 
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 2a04d34..74bea21 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1374,7 +1374,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 		submit_bio(operation, biolist[i]);
 
 	/* Let the I/Os go.. */
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	if (operation == READ)
 		blkif->st_rd_sect += preq.nr_sects;
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 86dbbc7..502c63b 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
 		submit_io(b, WRITE, b->block, write_endio);
 		dm_bufio_cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
@@ -1126,7 +1126,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 				&write_list);
 		if (unlikely(!list_empty(&write_list))) {
 			dm_bufio_unlock(c);
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			__flush_write_list(&write_list);
 			blk_start_plug(&plug);
 			dm_bufio_lock(c);
@@ -1149,7 +1149,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 	dm_bufio_unlock(c);
 
 flush_plug:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 713a962..65d7b72 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1224,7 +1224,7 @@ pop_from_list:
 			rb_erase(&io->rb_node, &write_tree);
 			kcryptd_io_write(io);
 		} while (!RB_EMPTY_ROOT(&write_tree));
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 	}
 	return 0;
 }
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 3a7cade..4a76e42 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -593,7 +593,7 @@ static void do_work(struct work_struct *work)
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 921aafd..be42bf5 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 			dm_pool_issue_prefetches(pool->pmd);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int cmp_cells(const void *lhs, const void *rhs)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 717daad..c4ec179 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
 	/*
 	 * this also signals 'finished resyncing' to md_stop
 	 */
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
 
 	/* tell personality that we are finished */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d34e238..4f8fad4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int init_resync(struct r1conf *conf)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a7196c4..92bb5dd 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int init_resync(struct r10conf *conf)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd2f96b..695bf0f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
 	pr_debug("%d stripes handled\n", handled);
 
 	spin_unlock_irq(&conf->device_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	pr_debug("--- raid5worker inactive\n");
 }
@@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
 	spin_unlock_irq(&conf->device_lock);
 
 	async_tx_issue_pending_all();
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	pr_debug("--- raid5d inactive\n");
 }
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d4a4b0f..17d8730 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -367,7 +367,7 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
 	blk_start_plug(&plug);
 	while ((bio = bio_list_pop(list)))
 		submit_bio(rw, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static void iblock_end_io_flush(struct bio *bio, int err)
diff --git a/fs/aio.c b/fs/aio.c
index f8e52a1..b873698 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	percpu_ref_put(&ctx->users);
 	return i ? i : ret;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266b..f5848de 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (err < 0)
 			ret = err;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_write_iter);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ec57687..f314cfb8 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3316,7 +3316,7 @@ out:
 	scrub_wr_submit(sctx);
 	mutex_unlock(&sctx->wr_ctx.wr_lock);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
 	return ret < 0 ? ret : 0;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8be4278..fee10af 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -983,7 +983,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
 
 	if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c5b8ba3..879c7fd 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2574,7 +2574,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
 	if (ret) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_abort_transaction(trans, root, ret);
 		btrfs_free_logged_extents(log, log_transid);
 		btrfs_set_log_full_commit(root->fs_info, trans);
@@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		if (!list_empty(&root_log_ctx.list))
 			list_del_init(&root_log_ctx.list);
 
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_set_log_full_commit(root->fs_info, trans);
 
 		if (ret != -ENOSPC) {
@@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	}
 
 	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = root_log_ctx.log_ret;
 		goto out;
@@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
 						mark);
 		btrfs_wait_logged_extents(trans, log, log_transid);
@@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 * check the full commit flag again
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
@@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	ret = btrfs_write_marked_extents(log_root_tree,
 					 &log_root_tree->dirty_log_pages,
 					 EXTENT_DIRTY | EXTENT_NEW);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (ret) {
 		btrfs_set_log_full_commit(root->fs_info, trans);
 		btrfs_abort_transaction(trans, root, ret);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8222f6f..16db068 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -358,7 +358,7 @@ loop_lock:
 		if (pending_bios == &device->pending_sync_bios) {
 			sync_pending = 1;
 		} else if (sync_pending) {
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			blk_start_plug(&plug);
 			sync_pending = 0;
 		}
@@ -415,7 +415,7 @@ loop_lock:
 		}
 		/* unplug every 64 requests just for good measure */
 		if (batch_run % 64 == 0) {
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			blk_start_plug(&plug);
 			sync_pending = 0;
 		}
@@ -431,7 +431,7 @@ loop_lock:
 	spin_unlock(&device->io_lock);
 
 done:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static void pending_bios_fn(struct btrfs_work *work)
diff --git a/fs/buffer.c b/fs/buffer.c
index 20805db..8181c44 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 	}
 
 	spin_unlock(lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	spin_lock(lock);
 
 	while (!list_empty(&tmp)) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..16f16ed 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1262,7 +1262,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	if (sdio.bio)
 		dio_bio_submit(dio, &sdio);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/*
 	 * It is possible that, we return short IO due to end of file.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..3a293eb 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			ret = err;
 	}
 	if (o_direct)
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 
 errout:
 	if (aio_mutex)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..90ce0cb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2302,7 +2302,7 @@ static int ext4_writepages(struct address_space *mapping,
 
 		blk_start_plug(&plug);
 		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		goto out_writepages;
 	}
 
@@ -2438,7 +2438,7 @@ retry:
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (!ret && !cycled && wbc->nr_to_write > 0) {
 		cycled = 1;
 		mpd.last_page = writeback_index - 1;
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 7f794b7..86ba453 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -846,7 +846,7 @@ retry_flush_nodes:
 		goto retry_flush_nodes;
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return err;
 }
 
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 76adbc3..abeef77 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
 		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
 		break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
 	stat_inc_call_count(sbi->stat_info);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 97bd9d3..c4aa9e2 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1098,7 +1098,7 @@ repeat:
 		ra_node_page(sbi, nid);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	lock_page(page);
 	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 536e7a6..06f25d17 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -159,7 +159,7 @@ restart:
 			goto restart;
 	}
 	spin_unlock(&sdp->sd_ail_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	trace_gfs2_ail_flush(sdp, wbc, 0);
 }
 
diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
index 8057fe4..138462d 100644
--- a/fs/hpfs/buffer.c
+++ b/fs/hpfs/buffer.c
@@ -35,7 +35,7 @@ void hpfs_prefetch_sectors(struct super_block *s, unsigned secno, int n)
 		secno++;
 		n--;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /* Map a sector into a buffer and return pointers to it and to the buffer. */
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index 08c0304..cd6b09f 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -263,7 +263,7 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
 	blk_start_plug(&plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = bhs[i];
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index bb217dc..e1046c3 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -447,7 +447,7 @@ void journal_commit_transaction(journal_t *journal)
 	blk_start_plug(&plug);
 	err = journal_submit_data_buffers(journal, commit_transaction,
 					  write_op);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/*
 	 * Wait for all previously submitted IO to complete.
@@ -697,7 +697,7 @@ start_journal_io:
 		}
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 988b32e..6aa0039 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -187,7 +187,7 @@ __flush_batch(journal_t *journal, int *batch_count)
 	blk_start_plug(&plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = journal->j_chkpt_bhs[i];
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b73e021..8f532c8 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -805,7 +805,7 @@ start_journal_io:
 			__jbd2_journal_abort_hard(journal);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220..bf7d6c3 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
 		if (mpd.bio)
 			mpage_bio_submit(WRITE, mpd.bio);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepages);
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 1cac3c1..e93b6a8 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -311,7 +311,7 @@ bl_read_pagelist(struct nfs_pgio_header *header)
 	}
 out:
 	bl_submit_bio(READ, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	put_parallel(par);
 	return PNFS_ATTEMPTED;
 }
@@ -433,7 +433,7 @@ bl_write_pagelist(struct nfs_pgio_header *header, int sync)
 	header->res.count = header->args.count;
 out:
 	bl_submit_bio(WRITE, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	put_parallel(par);
 	return PNFS_ATTEMPTED;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1790b00..2f89ca2 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
 		if (size <= 0)
 			break;	/* all done */
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
@@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
 
 		xfs_buf_submit(bp);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	return pinned;
 }
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 098cd78..7e8fa3f 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
 			}
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 out:
 	*bpp = bp;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 82e3142..c3ac5ec 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
 					     &xfs_inode_buf_ops);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..188133f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
  * schedule() where blk_schedule_flush_plug() is called.
  */
 struct blk_plug {
+	int depth; /* number of nested plugs */
 	struct list_head list; /* requests */
 	struct list_head mq_list; /* blk-mq requests */
 	struct list_head cb_list; /* md requires an unplug callback */
@@ -1107,7 +1108,7 @@ struct blk_plug_cb {
 extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
 					     void *data, int size);
 extern void blk_start_plug(struct blk_plug *);
-extern void blk_finish_plug(struct blk_plug *);
+extern void blk_finish_plug(void);
 extern void blk_flush_plug_list(struct blk_plug *, bool);
 
 static inline void blk_flush_plug(struct task_struct *tsk)
@@ -1646,7 +1647,7 @@ static inline void blk_start_plug(struct blk_plug *plug)
 {
 }
 
-static inline void blk_finish_plug(struct blk_plug *plug)
+static inline void blk_finish_plug(void)
 {
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..18a34ee 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 			vma = find_vma(current->mm, start);
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (write)
 		up_write(&current->mm->mmap_sem);
 	else
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 644bcb6..4570f6e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2020,7 +2020,7 @@ int generic_writepages(struct address_space *mapping,
 
 	blk_start_plug(&plug);
 	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 9356758..64182a2 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 	ret = 0;
 
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	return ret;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 405923f..5721f64 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			SetPageReadahead(page);
 		page_cache_release(page);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..56bb274 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 
 		scan_adjusted = true;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	sc->nr_reclaimed += nr_reclaimed;
 
 	/*
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Cluster-devel] [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-07 18:55     ` Jeff Moyer
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-07 18:55 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The way the on-stack plugging currently works, each nesting level
flushes its own list of I/Os.  This can be less than optimal (read
awful) for certain workloads.  For example, consider an application
that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
I/Os together in a single io_submit call, only to have each of them
dispatched individually down in the bowels of the dirct I/O code.
The reason is that there are blk_plug-s instantiated both at the upper
call site in do_io_submit and down in do_direct_IO.  The latter will
submit as little as 1 I/O at a time (if you have a small enough I/O
size) instead of performing the batching that the plugging
infrastructure is supposed to provide.

Now, for the case where there is an elevator involved, this doesn't
really matter too much.  The elevator will keep the I/O around long
enough for it to be merged.  However, in cases where there is no
elevator (like blk-mq), I/Os are simply dispatched immediately.

Try this, for example:

fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based

If you run that on a current kernel, you will get zero merges.  Zero!
After this patch, you will get many merges (the actual number depends
on how fast your storage is, obviously), and much better throughput.
Here are results from my test systems:

First, I tested in a VM using a virtio-blk device:

Unpatched kernel:
 Throughput: 280,262 KB/s
avg latency: 14,587.72 usec

Patched kernel:
 throughput: 832,158 KB/s
avg latency:   4,901.95 usec

Next, I tesetd using a micron p320h on bare metal:

Unpatched kernel:
 Throughput: 688,967 KB/s
avg latency: 5,933.92 usec

Patched kernel:
 Throughput: 1,160.6 MB/s
avg latency: 3,437.01 usec

As you can see, both throughput and latency improved dramatically.
I've included the full fio output below, so you can also see the
marked improvement in standard deviation as well.

I considered several approaches to solving the problem:
1)  get rid of the inner-most plugs
2)  handle nesting by using only one on-stack plug
2a) #2, except use a per-cpu blk_plug struct, which may clean up the
    code a bit at the expense of memory footprint

Option 1 will be tricky or impossible to do, since inner most plug
lists are sometimes the only plug lists, depending on the call path.
Option 2 is what this patch implements.  Option 2a may add unneeded
complexity.

Much of the patch involves modifying call sites to blk_finish_plug,
since its signature is changed.  The meat of the patch is actually
pretty simple and constrained to block/blk-core.c and
include/linux/blkdev.h.  The only tricky bits were places where plugs
were finished and then restarted to flush out I/O.  There, I left things
as-is.  So long as they are the outer-most plugs, they should continue
to function as before.

NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
would always flush the plug list.  After this patch, this is only the
case for the outer-most plug.  If you require the plug list to be
flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
maintainers should take a close look at this patch and ensure they get
the right behavior in the end.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

---
Changelog:
v1->v2: Keep the blk_start_plug interface the same, suggested by Ming Lei.

Test results
------------
Virtio-blk:

unpatched:

job1: (groupid=0, jobs=1): err= 0: pid=8032: Tue Apr  7 13:33:53 2015
  read : io=2736.1MB, bw=280262KB/s, iops=70065, runt= 10000msec
    slat (usec): min=40, max=10472, avg=207.82, stdev=364.02
    clat (usec): min=211, max=35883, avg=14379.83, stdev=2213.95
     lat (usec): min=862, max=36000, avg=14587.72, stdev=2223.80
    clat percentiles (usec):
     |  1.00th=[11328],  5.00th=[12096], 10.00th=[12480], 20.00th=[12992],
     | 30.00th=[13376], 40.00th=[13760], 50.00th=[14144], 60.00th=[14400],
     | 70.00th=[14784], 80.00th=[15168], 90.00th=[15936], 95.00th=[16768],
     | 99.00th=[24448], 99.50th=[25216], 99.90th=[28544], 99.95th=[35072],
     | 99.99th=[36096]
    bw (KB  /s): min=265984, max=302720, per=100.00%, avg=280549.84, stdev=10264.36
    lat (usec) : 250=0.01%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.02%, 10=0.05%, 20=96.57%, 50=3.34%
  cpu          : usr=7.56%, sys=55.57%, ctx=6174, majf=0, minf=523
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=700656/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=2736.1MB, aggrb=280262KB/s, minb=280262KB/s, maxb=280262KB/s, mint=10000msec, maxt=10000msec

Disk stats (read/write):
  vdd: ios=695490/0, merge=0/0, ticks=785741/0, in_queue=785442, util=90.69%


patched:
job1: (groupid=0, jobs=1): err= 0: pid=7743: Tue Apr  7 13:19:07 2015
  read : io=8126.6MB, bw=832158KB/s, iops=208039, runt= 10000msec
    slat (usec): min=20, max=14351, avg=55.08, stdev=143.47
    clat (usec): min=283, max=20003, avg=4846.77, stdev=1355.35
     lat (usec): min=609, max=20074, avg=4901.95, stdev=1362.40
    clat percentiles (usec):
     |  1.00th=[ 4016],  5.00th=[ 4048], 10.00th=[ 4080], 20.00th=[ 4128],
     | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4512],
     | 70.00th=[ 4896], 80.00th=[ 5664], 90.00th=[ 5920], 95.00th=[ 6752],
     | 99.00th=[11968], 99.50th=[13632], 99.90th=[15552], 99.95th=[17024],
     | 99.99th=[19840]
    bw (KB  /s): min=740992, max=896640, per=100.00%, avg=836978.95, stdev=51034.87
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 4=0.50%, 10=97.79%, 20=1.70%, 50=0.01%
  cpu          : usr=20.28%, sys=69.11%, ctx=879, majf=0, minf=522
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2080396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=8126.6MB, aggrb=832158KB/s, minb=832158KB/s, maxb=832158KB/s, mint=10000msec, maxt=10000msec

Disk stats (read/write):
  vdd: ios=127877/0, merge=1918166/0, ticks=23118/0, in_queue=23047, util=94.08%

micron p320h:

unpatched:

job1: (groupid=0, jobs=1): err= 0: pid=3244: Tue Apr  7 13:29:14 2015
  read : io=6728.9MB, bw=688968KB/s, iops=172241, runt= 10001msec
    slat (usec): min=43, max=6273, avg=81.79, stdev=125.96
    clat (usec): min=78, max=12485, avg=5852.06, stdev=1154.76
     lat (usec): min=146, max=12572, avg=5933.92, stdev=1163.75
    clat percentiles (usec):
     |  1.00th=[ 4192],  5.00th=[ 4384], 10.00th=[ 4576], 20.00th=[ 5600],
     | 30.00th=[ 5664], 40.00th=[ 5728], 50.00th=[ 5792], 60.00th=[ 5856],
     | 70.00th=[ 6112], 80.00th=[ 6176], 90.00th=[ 6240], 95.00th=[ 6368],
     | 99.00th=[11840], 99.50th=[11968], 99.90th=[12096], 99.95th=[12096],
     | 99.99th=[12224]
    bw (KB  /s): min=648328, max=859264, per=98.80%, avg=680711.16, stdev=62016.70
    lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.04%, 10=97.07%, 20=2.87%
  cpu          : usr=10.28%, sys=73.61%, ctx=104436, majf=0, minf=6217
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=1722592/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=6728.9MB, aggrb=688967KB/s, minb=688967KB/s, maxb=688967KB/s, mint=10001msec, maxt=10001msec

Disk stats (read/write):
  rssda: ios=1688772/0, merge=0/0, ticks=188820/0, in_queue=188678, util=96.61%

patched:

job1: (groupid=0, jobs=1): err= 0: pid=9531: Tue Apr  7 13:22:28 2015
  read : io=11607MB, bw=1160.6MB/s, iops=297104, runt= 10001msec
    slat (usec): min=21, max=6376, avg=43.05, stdev=81.82
    clat (usec): min=116, max=9844, avg=3393.90, stdev=752.57
     lat (usec): min=167, max=9889, avg=3437.01, stdev=757.02
    clat percentiles (usec):
     |  1.00th=[ 2832],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3120],
     | 30.00th=[ 3152], 40.00th=[ 3248], 50.00th=[ 3280], 60.00th=[ 3344],
     | 70.00th=[ 3376], 80.00th=[ 3504], 90.00th=[ 3728], 95.00th=[ 3824],
     | 99.00th=[ 9152], 99.50th=[ 9408], 99.90th=[ 9664], 99.95th=[ 9664],
     | 99.99th=[ 9792]
    bw (MB  /s): min= 1139, max= 1183, per=100.00%, avg=1161.07, stdev=10.58
    lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=98.31%, 10=1.67%
  cpu          : usr=18.59%, sys=66.65%, ctx=55655, majf=0, minf=6218
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2971338/w=0/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1024

Run status group 0 (all jobs):
   READ: io=11607MB, aggrb=1160.6MB/s, minb=1160.6MB/s, maxb=1160.6MB/s, mint=10001msec, maxt=10001msec

Disk stats (read/write):
  rssda: ios=183005/0, merge=2745105/0, ticks=31972/0, in_queue=31948, util=97.63%
---
 block/blk-core.c                    | 29 ++++++++++++++++-------------
 block/blk-lib.c                     |  2 +-
 block/blk-throttle.c                |  2 +-
 drivers/block/xen-blkback/blkback.c |  2 +-
 drivers/md/dm-bufio.c               |  6 +++---
 drivers/md/dm-crypt.c               |  2 +-
 drivers/md/dm-kcopyd.c              |  2 +-
 drivers/md/dm-thin.c                |  2 +-
 drivers/md/md.c                     |  2 +-
 drivers/md/raid1.c                  |  2 +-
 drivers/md/raid10.c                 |  2 +-
 drivers/md/raid5.c                  |  4 ++--
 drivers/target/target_core_iblock.c |  2 +-
 fs/aio.c                            |  2 +-
 fs/block_dev.c                      |  2 +-
 fs/btrfs/scrub.c                    |  2 +-
 fs/btrfs/transaction.c              |  2 +-
 fs/btrfs/tree-log.c                 | 12 ++++++------
 fs/btrfs/volumes.c                  |  6 +++---
 fs/buffer.c                         |  2 +-
 fs/direct-io.c                      |  2 +-
 fs/ext4/file.c                      |  2 +-
 fs/ext4/inode.c                     |  4 ++--
 fs/f2fs/checkpoint.c                |  2 +-
 fs/f2fs/gc.c                        |  2 +-
 fs/f2fs/node.c                      |  2 +-
 fs/gfs2/log.c                       |  2 +-
 fs/hpfs/buffer.c                    |  2 +-
 fs/jbd/checkpoint.c                 |  2 +-
 fs/jbd/commit.c                     |  4 ++--
 fs/jbd2/checkpoint.c                |  2 +-
 fs/jbd2/commit.c                    |  2 +-
 fs/mpage.c                          |  2 +-
 fs/nfs/blocklayout/blocklayout.c    |  4 ++--
 fs/xfs/xfs_buf.c                    |  4 ++--
 fs/xfs/xfs_dir2_readdir.c           |  2 +-
 fs/xfs/xfs_itable.c                 |  2 +-
 include/linux/blkdev.h              |  5 +++--
 mm/madvise.c                        |  2 +-
 mm/page-writeback.c                 |  2 +-
 mm/readahead.c                      |  2 +-
 mm/swap_state.c                     |  2 +-
 mm/vmscan.c                         |  2 +-
 43 files changed, 74 insertions(+), 70 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 794c3e7..fcd9c2f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3018,21 +3018,21 @@ void blk_start_plug(struct blk_plug *plug)
 {
 	struct task_struct *tsk = current;
 
+	if (tsk->plug) {
+		tsk->plug->depth++;
+		return;
+	}
+
+	plug->depth = 1;
 	INIT_LIST_HEAD(&plug->list);
 	INIT_LIST_HEAD(&plug->mq_list);
 	INIT_LIST_HEAD(&plug->cb_list);
 
 	/*
-	 * If this is a nested plug, don't actually assign it. It will be
-	 * flushed on its own.
+	 * Store ordering should not be needed here, since a potential
+	 * preempt will imply a full memory barrier
 	 */
-	if (!tsk->plug) {
-		/*
-		 * Store ordering should not be needed here, since a potential
-		 * preempt will imply a full memory barrier
-		 */
-		tsk->plug = plug;
-	}
+	tsk->plug = plug;
 }
 EXPORT_SYMBOL(blk_start_plug);
 
@@ -3177,12 +3177,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	local_irq_restore(flags);
 }
 
-void blk_finish_plug(struct blk_plug *plug)
+void blk_finish_plug(void)
 {
-	blk_flush_plug_list(plug, false);
+	struct blk_plug *plug = current->plug;
 
-	if (plug == current->plug)
-		current->plug = NULL;
+	if (--plug->depth > 0)
+		return;
+
+	blk_flush_plug_list(plug, false);
+	current->plug = NULL;
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 7688ee3..ac347d3 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		 */
 		cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Wait for bios in-flight */
 	if (!atomic_dec_and_test(&bb.done))
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5b9c6d5..222a77a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1281,7 +1281,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 		blk_start_plug(&plug);
 		while((bio = bio_list_pop(&bio_list_on_stack)))
 			generic_make_request(bio);
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 	}
 }
 
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 2a04d34..74bea21 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -1374,7 +1374,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
 		submit_bio(operation, biolist[i]);
 
 	/* Let the I/Os go.. */
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	if (operation == READ)
 		blkif->st_rd_sect += preq.nr_sects;
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 86dbbc7..502c63b 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
 		submit_io(b, WRITE, b->block, write_endio);
 		dm_bufio_cond_resched();
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
@@ -1126,7 +1126,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 				&write_list);
 		if (unlikely(!list_empty(&write_list))) {
 			dm_bufio_unlock(c);
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			__flush_write_list(&write_list);
 			blk_start_plug(&plug);
 			dm_bufio_lock(c);
@@ -1149,7 +1149,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
 	dm_bufio_unlock(c);
 
 flush_plug:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 713a962..65d7b72 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1224,7 +1224,7 @@ pop_from_list:
 			rb_erase(&io->rb_node, &write_tree);
 			kcryptd_io_write(io);
 		} while (!RB_EMPTY_ROOT(&write_tree));
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 	}
 	return 0;
 }
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 3a7cade..4a76e42 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -593,7 +593,7 @@ static void do_work(struct work_struct *work)
 	process_jobs(&kc->complete_jobs, kc, run_complete_job);
 	process_jobs(&kc->pages_jobs, kc, run_pages_job);
 	process_jobs(&kc->io_jobs, kc, run_io_job);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 921aafd..be42bf5 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
 			dm_pool_issue_prefetches(pool->pmd);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int cmp_cells(const void *lhs, const void *rhs)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 717daad..c4ec179 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
 	/*
 	 * this also signals 'finished resyncing' to md_stop
 	 */
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
 
 	/* tell personality that we are finished */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index d34e238..4f8fad4 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int init_resync(struct r1conf *conf)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a7196c4..92bb5dd 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
 			md_check_recovery(mddev);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static int init_resync(struct r10conf *conf)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cd2f96b..695bf0f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
 	pr_debug("%d stripes handled\n", handled);
 
 	spin_unlock_irq(&conf->device_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	pr_debug("--- raid5worker inactive\n");
 }
@@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
 	spin_unlock_irq(&conf->device_lock);
 
 	async_tx_issue_pending_all();
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	pr_debug("--- raid5d inactive\n");
 }
diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index d4a4b0f..17d8730 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -367,7 +367,7 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
 	blk_start_plug(&plug);
 	while ((bio = bio_list_pop(list)))
 		submit_bio(rw, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static void iblock_end_io_flush(struct bio *bio, int err)
diff --git a/fs/aio.c b/fs/aio.c
index f8e52a1..b873698 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	percpu_ref_put(&ctx->users);
 	return i ? i : ret;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266b..f5848de 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (err < 0)
 			ret = err;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_write_iter);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ec57687..f314cfb8 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3316,7 +3316,7 @@ out:
 	scrub_wr_submit(sctx);
 	mutex_unlock(&sctx->wr_ctx.wr_lock);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
 	return ret < 0 ? ret : 0;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8be4278..fee10af 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -983,7 +983,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
 
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
 
 	if (ret)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c5b8ba3..879c7fd 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2574,7 +2574,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	blk_start_plug(&plug);
 	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
 	if (ret) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_abort_transaction(trans, root, ret);
 		btrfs_free_logged_extents(log, log_transid);
 		btrfs_set_log_full_commit(root->fs_info, trans);
@@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		if (!list_empty(&root_log_ctx.list))
 			list_del_init(&root_log_ctx.list);
 
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_set_log_full_commit(root->fs_info, trans);
 
 		if (ret != -ENOSPC) {
@@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	}
 
 	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		mutex_unlock(&log_root_tree->log_mutex);
 		ret = root_log_ctx.log_ret;
 		goto out;
@@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 
 	index2 = root_log_ctx.log_transid % 2;
 	if (atomic_read(&log_root_tree->log_commit[index2])) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
 						mark);
 		btrfs_wait_logged_extents(trans, log, log_transid);
@@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	 * check the full commit flag again
 	 */
 	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
 		btrfs_free_logged_extents(log, log_transid);
 		mutex_unlock(&log_root_tree->log_mutex);
@@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	ret = btrfs_write_marked_extents(log_root_tree,
 					 &log_root_tree->dirty_log_pages,
 					 EXTENT_DIRTY | EXTENT_NEW);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (ret) {
 		btrfs_set_log_full_commit(root->fs_info, trans);
 		btrfs_abort_transaction(trans, root, ret);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8222f6f..16db068 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -358,7 +358,7 @@ loop_lock:
 		if (pending_bios == &device->pending_sync_bios) {
 			sync_pending = 1;
 		} else if (sync_pending) {
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			blk_start_plug(&plug);
 			sync_pending = 0;
 		}
@@ -415,7 +415,7 @@ loop_lock:
 		}
 		/* unplug every 64 requests just for good measure */
 		if (batch_run % 64 == 0) {
-			blk_finish_plug(&plug);
+			blk_finish_plug();
 			blk_start_plug(&plug);
 			sync_pending = 0;
 		}
@@ -431,7 +431,7 @@ loop_lock:
 	spin_unlock(&device->io_lock);
 
 done:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 static void pending_bios_fn(struct btrfs_work *work)
diff --git a/fs/buffer.c b/fs/buffer.c
index 20805db..8181c44 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
 	}
 
 	spin_unlock(lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	spin_lock(lock);
 
 	while (!list_empty(&tmp)) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..16f16ed 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1262,7 +1262,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	if (sdio.bio)
 		dio_bio_submit(dio, &sdio);
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/*
 	 * It is possible that, we return short IO due to end of file.
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 33a09da..3a293eb 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			ret = err;
 	}
 	if (o_direct)
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 
 errout:
 	if (aio_mutex)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5cb9a21..90ce0cb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2302,7 +2302,7 @@ static int ext4_writepages(struct address_space *mapping,
 
 		blk_start_plug(&plug);
 		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-		blk_finish_plug(&plug);
+		blk_finish_plug();
 		goto out_writepages;
 	}
 
@@ -2438,7 +2438,7 @@ retry:
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (!ret && !cycled && wbc->nr_to_write > 0) {
 		cycled = 1;
 		mpd.last_page = writeback_index - 1;
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 7f794b7..86ba453 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -846,7 +846,7 @@ retry_flush_nodes:
 		goto retry_flush_nodes;
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return err;
 }
 
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 76adbc3..abeef77 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
 		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
 		break;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
 	stat_inc_call_count(sbi->stat_info);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 97bd9d3..c4aa9e2 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1098,7 +1098,7 @@ repeat:
 		ra_node_page(sbi, nid);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	lock_page(page);
 	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 536e7a6..06f25d17 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -159,7 +159,7 @@ restart:
 			goto restart;
 	}
 	spin_unlock(&sdp->sd_ail_lock);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	trace_gfs2_ail_flush(sdp, wbc, 0);
 }
 
diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
index 8057fe4..138462d 100644
--- a/fs/hpfs/buffer.c
+++ b/fs/hpfs/buffer.c
@@ -35,7 +35,7 @@ void hpfs_prefetch_sectors(struct super_block *s, unsigned secno, int n)
 		secno++;
 		n--;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /* Map a sector into a buffer and return pointers to it and to the buffer. */
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index 08c0304..cd6b09f 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -263,7 +263,7 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
 	blk_start_plug(&plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = bhs[i];
diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index bb217dc..e1046c3 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -447,7 +447,7 @@ void journal_commit_transaction(journal_t *journal)
 	blk_start_plug(&plug);
 	err = journal_submit_data_buffers(journal, commit_transaction,
 					  write_op);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/*
 	 * Wait for all previously submitted IO to complete.
@@ -697,7 +697,7 @@ start_journal_io:
 		}
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 988b32e..6aa0039 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -187,7 +187,7 @@ __flush_batch(journal_t *journal, int *batch_count)
 	blk_start_plug(&plug);
 	for (i = 0; i < *batch_count; i++)
 		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	for (i = 0; i < *batch_count; i++) {
 		struct buffer_head *bh = journal->j_chkpt_bhs[i];
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b73e021..8f532c8 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -805,7 +805,7 @@ start_journal_io:
 			__jbd2_journal_abort_hard(journal);
 	}
 
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	/* Lo and behold: we have just managed to send a transaction to
            the log.  Before we can commit it, wait for the IO so far to
diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220..bf7d6c3 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
 		if (mpd.bio)
 			mpage_bio_submit(WRITE, mpd.bio);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 EXPORT_SYMBOL(mpage_writepages);
diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 1cac3c1..e93b6a8 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -311,7 +311,7 @@ bl_read_pagelist(struct nfs_pgio_header *header)
 	}
 out:
 	bl_submit_bio(READ, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	put_parallel(par);
 	return PNFS_ATTEMPTED;
 }
@@ -433,7 +433,7 @@ bl_write_pagelist(struct nfs_pgio_header *header, int sync)
 	header->res.count = header->args.count;
 out:
 	bl_submit_bio(WRITE, bio);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	put_parallel(par);
 	return PNFS_ATTEMPTED;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1790b00..2f89ca2 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
 		if (size <= 0)
 			break;	/* all done */
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
@@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
 
 		xfs_buf_submit(bp);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	return pinned;
 }
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 098cd78..7e8fa3f 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
 			}
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 out:
 	*bpp = bp;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 82e3142..c3ac5ec 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
 					     &xfs_inode_buf_ops);
 		}
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a516..188133f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
  * schedule() where blk_schedule_flush_plug() is called.
  */
 struct blk_plug {
+	int depth; /* number of nested plugs */
 	struct list_head list; /* requests */
 	struct list_head mq_list; /* blk-mq requests */
 	struct list_head cb_list; /* md requires an unplug callback */
@@ -1107,7 +1108,7 @@ struct blk_plug_cb {
 extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
 					     void *data, int size);
 extern void blk_start_plug(struct blk_plug *);
-extern void blk_finish_plug(struct blk_plug *);
+extern void blk_finish_plug(void);
 extern void blk_flush_plug_list(struct blk_plug *, bool);
 
 static inline void blk_flush_plug(struct task_struct *tsk)
@@ -1646,7 +1647,7 @@ static inline void blk_start_plug(struct blk_plug *plug)
 {
 }
 
-static inline void blk_finish_plug(struct blk_plug *plug)
+static inline void blk_finish_plug(void)
 {
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..18a34ee 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 			vma = find_vma(current->mm, start);
 	}
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	if (write)
 		up_write(&current->mm->mmap_sem);
 	else
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 644bcb6..4570f6e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2020,7 +2020,7 @@ int generic_writepages(struct address_space *mapping,
 
 	blk_start_plug(&plug);
 	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	return ret;
 }
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 9356758..64182a2 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
 	ret = 0;
 
 out:
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	return ret;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 405923f..5721f64 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			SetPageReadahead(page);
 		page_cache_release(page);
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..56bb274 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 
 		scan_adjusted = true;
 	}
-	blk_finish_plug(&plug);
+	blk_finish_plug();
 	sc->nr_reclaimed += nr_reclaimed;
 
 	/*
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
  2015-04-07 18:55     ` Jeff Moyer
                         ` (2 preceding siblings ...)
  (?)
@ 2015-04-08 16:13       ` Christoph Hellwig
  -1 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2015-04-08 16:13 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, Ming Lei, Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner, Mel Gorman, Trond Myklebust

This looks good, but without the blk_finish_plug argument we're bound
to grow programming mistakes where people forget it.  Any chance we
could have annotations similar to say rcu_read_lock/rcu_read_unlock
or the spinlocks so that sparse warns about it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-08 16:13       ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2015-04-08 16:13 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, Ming Lei, Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner, Mel Gorman, Trond Myklebust, Anna Schumaker

This looks good, but without the blk_finish_plug argument we're bound
to grow programming mistakes where people forget it.  Any chance we
could have annotations similar to say rcu_read_lock/rcu_read_unlock
or the spinlocks so that sparse warns about it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-08 16:13       ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2015-04-08 16:13 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Vladimir Davydov, linux-aio, Miklos Szeredi, Mike Snitzer,
	Neil Brown, Ming Lei, Ming Lei, Trond Myklebust, Jianyu Zhan,
	Nicholas A. Bellinger, linux-kernel, Sagi Grimberg, Chris Mason,
	dm-devel, target-devel, Andreas Dilger, Mikulas Patocka,
	Mark Rustad, Christoph Hellwig, Alasdair Kergon, Matthew Wilcox,
	linux-scsi, Namjae Jeon, linux-raid, cluster-devel, Mel Gorman,
	Suleiman Souhlal, linux-ext4, linux-mm, Changman Lee,
	Rik van Riel, Konrad Rzeszutek Wilk, xfs, Fabian Frederick,
	Joe Perches, Alexander Viro, xen-devel, Jaegeuk Kim,
	Steven Whitehouse, Vlastimil Babka, Jens Axboe, Michal Hocko,
	linux-nfs, Fengguang Wu, Theodore Ts'o, Martin K. Petersen,
	Wang Sheng-Hui, Josef Bacik, David Sterba, linux-f2fs-devel,
	linux-btrfs, Johannes Weiner, Tejun Heo, linux-fsdevel,
	Andrew Morton, Weston Andros Adamson, Anna Schumaker,
	Kirill A. Shutemov, Roger Pau Monn??

This looks good, but without the blk_finish_plug argument we're bound
to grow programming mistakes where people forget it.  Any chance we
could have annotations similar to say rcu_read_lock/rcu_read_unlock
or the spinlocks so that sparse warns about it?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-08 16:13       ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2015-04-08 16:13 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, Ming Lei, Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner, Mel Gorman, Trond Myklebust, Anna Schumaker,
	Dave Chinner, xfs, Christoph Hellwig, Weston Andros Adamson,
	Martin K. Petersen, Sagi Grimberg, Tejun Heo, Fabian Frederick,
	Matthew Wilcox, Ming Lei, Kirill A. Shutemov, Wang Sheng-Hui,
	Michal Hocko, Joe Perches, Miklos Szeredi, Namjae Jeon,
	Mark Rustad, Jianyu Zhan, Fengguang Wu, Vladimir Davydov,
	Vlastimil Babka, Suleiman Souhlal, linux-kernel, dm-devel,
	xen-devel, linux-raid, linux-scsi, target-devel, linux-fsdevel,
	linux-aio, linux-btrfs, linux-ext4, linux-f2fs-devel,
	cluster-devel, linux-nfs, linux-mm

This looks good, but without the blk_finish_plug argument we're bound
to grow programming mistakes where people forget it.  Any chance we
could have annotations similar to say rcu_read_lock/rcu_read_unlock
or the spinlocks so that sparse warns about it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Cluster-devel] [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-08 16:13       ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2015-04-08 16:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This looks good, but without the blk_finish_plug argument we're bound
to grow programming mistakes where people forget it.  Any chance we
could have annotations similar to say rcu_read_lock/rcu_read_unlock
or the spinlocks so that sparse warns about it?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] blk-plug: don't flush nested plug lists
  2015-04-06 19:14 ` [PATCH 2/2] blk-plug: don't flush nested plug lists Jeff Moyer
  2015-04-07  9:19   ` Ming Lei
  2015-04-07 18:55     ` Jeff Moyer
@ 2015-04-08 17:38   ` Jens Axboe
  2015-04-07 16:09     ` Shaohua Li
                       ` (2 more replies)
  2 siblings, 3 replies; 36+ messages in thread
From: Jens Axboe @ 2015-04-08 17:38 UTC (permalink / raw)
  To: Jeff Moyer, linux-kernel

On 04/06/2015 01:14 PM, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowells of the dirct I/O code.
> The reason is that there are blk_plug's instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.
>
> Now, for the case where there is an elevator involved, this doesn't
> really matter too much.  The elevator will keep the I/O around long
> enough for it to be merged.  However, in cases where there is no
> elevator (like blk-mq), I/Os are simply dispatched immediately.
>
> Try this, for example (note I'm using a virtio-blk device, so it's
> using the blk-mq single queue path, though I've also reproduced this
> with the micron p320h):
>
> fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based
>
> If you run that on a current kernel, you will get zero merges.  Zero!
> After this patch, you will get many merges (the actual number depends
> on how fast your storage is, obviously), and much better throughput.
> Here are results from my test rig:
>
> Unpatched kernel:
> Read B/W:    283,638 KB/s
> Read Merges: 0
>
> Patched kernel:
> Read B/W:    873,224 KB/s
> Read Merges: 2,046K
>
> I considered several approaches to solving the problem:
> 1)  get rid of the inner-most plugs
> 2)  handle nesting by using only one on-stack plug
> 2a) #2, except use a per-cpu blk_plug struct, which may clean up the
>      code a bit at the expense of memory footprint
>
> Option 1 will be tricky or impossible to do, since inner most plug
> lists are sometimes the only plug lists, depending on the call path.
> Option 2 is what this patch implements.  Option 2a is perhaps a better
> idea, but since I already implemented option 2, I figured I'd post it
> for comments and opinions before rewriting it.
>
> Much of the patch involves modifying call sites to blk_start_plug,
> since its signature is changed.  The meat of the patch is actually
> pretty simple and constrained to block/blk-core.c and
> include/linux/blkdev.h.  The only tricky bits were places where plugs
> were finished and then restarted to flush out I/O.  There, I went
> ahead and exported blk_flush_plug_list and called that directly.
>
> Comments would be greatly appreciated.

It's hard to argue with the increased merging for your case. The task 
plugs did originally work like you changed them to, not flushing until 
the outermost plug was flushed. Unfortunately I don't quite remember why 
I changed them, will have to do a bit of digging to refresh my memory.

For cases where we don't do any merging (like nvme), we always want to 
flush. Well almost, if we start do utilize the batched submission, then 
the plug would still potentially help (just for other reasons than merging).

And agree with Ming, this can be cleaned up substantially. I'd also like 
to see some test results from the other end of the spectrum. Your posted 
cased is clearly based case (we missed tons of merging, now we don't), 
I'd like to see a normal case and a worst case result as well so we have 
an idea of what this would do to latencies.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] blk-plug: don't flush nested plug lists
  2015-04-08 17:38   ` [PATCH 2/2] " Jens Axboe
  2015-04-07 16:09     ` Shaohua Li
@ 2015-04-08 17:56     ` Jeff Moyer
  2015-04-16 15:47     ` Jeff Moyer
  2 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-08 17:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

Jens Axboe <axboe@kernel.dk> writes:

>> Comments would be greatly appreciated.
>
> It's hard to argue with the increased merging for your case. The task
> plugs did originally work like you changed them to, not flushing until
> the outermost plug was flushed. Unfortunately I don't quite remember
> why I changed them, will have to do a bit of digging to refresh my
> memory.

Let me know what you dig up.  

> For cases where we don't do any merging (like nvme), we always want to
> flush. Well almost, if we start do utilize the batched submission,
> then the plug would still potentially help (just for other reasons
> than merging).

It's never straight-forward.  :)

> And agree with Ming, this can be cleaned up substantially. I'd also

Let me know if you have any issues with the v2 posting.

> like to see some test results from the other end of the spectrum. Your
> posted cased is clearly based case (we missed tons of merging, now we
> don't), I'd like to see a normal case and a worst case result as well
> so we have an idea of what this would do to latencies.

As for other benchmark numbers, this work was inspired by two reports of
lower performance after converting virtio_blk to use blk-mq in RHEL.
The workloads were both synthetic, though.  I'd be happy to run a
battery of tests on the patch set, but if there is anything specific you
want to see (besides a workload that will have no merges), let me know.

I guess I should also mention that another solution to the problem might
be the mythical mq I/O scheduler.  :)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
  2015-04-07 18:55     ` Jeff Moyer
                         ` (2 preceding siblings ...)
  (?)
@ 2015-04-08 23:02       ` Dave Chinner
  -1 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-04-08 23:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, Ming Lei, Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner, Mel Gorman, Trond Myklebust

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowels of the dirct I/O code.
> The reason is that there are blk_plug-s instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

> NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
> would always flush the plug list.  After this patch, this is only the
> case for the outer-most plug.  If you require the plug list to be
> flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
> maintainers should take a close look at this patch and ensure they get
> the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to maintain the same behaviour as we
currently have?

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs to be fundamentally changed?

Cheers,

Dave.

> 
> ---
> Changelog:
> v1->v2: Keep the blk_start_plug interface the same, suggested by Ming Lei.
> 
> Test results
> ------------
> Virtio-blk:
> 
> unpatched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=8032: Tue Apr  7 13:33:53 2015
>   read : io=2736.1MB, bw=280262KB/s, iops=70065, runt= 10000msec
>     slat (usec): min=40, max=10472, avg=207.82, stdev=364.02
>     clat (usec): min=211, max=35883, avg=14379.83, stdev=2213.95
>      lat (usec): min=862, max=36000, avg=14587.72, stdev=2223.80
>     clat percentiles (usec):
>      |  1.00th=[11328],  5.00th=[12096], 10.00th=[12480], 20.00th=[12992],
>      | 30.00th=[13376], 40.00th=[13760], 50.00th=[14144], 60.00th=[14400],
>      | 70.00th=[14784], 80.00th=[15168], 90.00th=[15936], 95.00th=[16768],
>      | 99.00th=[24448], 99.50th=[25216], 99.90th=[28544], 99.95th=[35072],
>      | 99.99th=[36096]
>     bw (KB  /s): min=265984, max=302720, per=100.00%, avg=280549.84, stdev=10264.36
>     lat (usec) : 250=0.01%, 1000=0.01%
>     lat (msec) : 2=0.02%, 4=0.02%, 10=0.05%, 20=96.57%, 50=3.34%
>   cpu          : usr=7.56%, sys=55.57%, ctx=6174, majf=0, minf=523
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=700656/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=2736.1MB, aggrb=280262KB/s, minb=280262KB/s, maxb=280262KB/s, mint=10000msec, maxt=10000msec
> 
> Disk stats (read/write):
>   vdd: ios=695490/0, merge=0/0, ticks=785741/0, in_queue=785442, util=90.69%
> 
> 
> patched:
> job1: (groupid=0, jobs=1): err= 0: pid=7743: Tue Apr  7 13:19:07 2015
>   read : io=8126.6MB, bw=832158KB/s, iops=208039, runt= 10000msec
>     slat (usec): min=20, max=14351, avg=55.08, stdev=143.47
>     clat (usec): min=283, max=20003, avg=4846.77, stdev=1355.35
>      lat (usec): min=609, max=20074, avg=4901.95, stdev=1362.40
>     clat percentiles (usec):
>      |  1.00th=[ 4016],  5.00th=[ 4048], 10.00th=[ 4080], 20.00th=[ 4128],
>      | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4512],
>      | 70.00th=[ 4896], 80.00th=[ 5664], 90.00th=[ 5920], 95.00th=[ 6752],
>      | 99.00th=[11968], 99.50th=[13632], 99.90th=[15552], 99.95th=[17024],
>      | 99.99th=[19840]
>     bw (KB  /s): min=740992, max=896640, per=100.00%, avg=836978.95, stdev=51034.87
>     lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 4=0.50%, 10=97.79%, 20=1.70%, 50=0.01%
>   cpu          : usr=20.28%, sys=69.11%, ctx=879, majf=0, minf=522
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=2080396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=8126.6MB, aggrb=832158KB/s, minb=832158KB/s, maxb=832158KB/s, mint=10000msec, maxt=10000msec
> 
> Disk stats (read/write):
>   vdd: ios=127877/0, merge=1918166/0, ticks=23118/0, in_queue=23047, util=94.08%
> 
> micron p320h:
> 
> unpatched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=3244: Tue Apr  7 13:29:14 2015
>   read : io=6728.9MB, bw=688968KB/s, iops=172241, runt= 10001msec
>     slat (usec): min=43, max=6273, avg=81.79, stdev=125.96
>     clat (usec): min=78, max=12485, avg=5852.06, stdev=1154.76
>      lat (usec): min=146, max=12572, avg=5933.92, stdev=1163.75
>     clat percentiles (usec):
>      |  1.00th=[ 4192],  5.00th=[ 4384], 10.00th=[ 4576], 20.00th=[ 5600],
>      | 30.00th=[ 5664], 40.00th=[ 5728], 50.00th=[ 5792], 60.00th=[ 5856],
>      | 70.00th=[ 6112], 80.00th=[ 6176], 90.00th=[ 6240], 95.00th=[ 6368],
>      | 99.00th=[11840], 99.50th=[11968], 99.90th=[12096], 99.95th=[12096],
>      | 99.99th=[12224]
>     bw (KB  /s): min=648328, max=859264, per=98.80%, avg=680711.16, stdev=62016.70
>     lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.04%, 10=97.07%, 20=2.87%
>   cpu          : usr=10.28%, sys=73.61%, ctx=104436, majf=0, minf=6217
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=1722592/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=6728.9MB, aggrb=688967KB/s, minb=688967KB/s, maxb=688967KB/s, mint=10001msec, maxt=10001msec
> 
> Disk stats (read/write):
>   rssda: ios=1688772/0, merge=0/0, ticks=188820/0, in_queue=188678, util=96.61%
> 
> patched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=9531: Tue Apr  7 13:22:28 2015
>   read : io=11607MB, bw=1160.6MB/s, iops=297104, runt= 10001msec
>     slat (usec): min=21, max=6376, avg=43.05, stdev=81.82
>     clat (usec): min=116, max=9844, avg=3393.90, stdev=752.57
>      lat (usec): min=167, max=9889, avg=3437.01, stdev=757.02
>     clat percentiles (usec):
>      |  1.00th=[ 2832],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3120],
>      | 30.00th=[ 3152], 40.00th=[ 3248], 50.00th=[ 3280], 60.00th=[ 3344],
>      | 70.00th=[ 3376], 80.00th=[ 3504], 90.00th=[ 3728], 95.00th=[ 3824],
>      | 99.00th=[ 9152], 99.50th=[ 9408], 99.90th=[ 9664], 99.95th=[ 9664],
>      | 99.99th=[ 9792]
>     bw (MB  /s): min= 1139, max= 1183, per=100.00%, avg=1161.07, stdev=10.58
>     lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=98.31%, 10=1.67%
>   cpu          : usr=18.59%, sys=66.65%, ctx=55655, majf=0, minf=6218
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=2971338/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=11607MB, aggrb=1160.6MB/s, minb=1160.6MB/s, maxb=1160.6MB/s, mint=10001msec, maxt=10001msec
> 
> Disk stats (read/write):
>   rssda: ios=183005/0, merge=2745105/0, ticks=31972/0, in_queue=31948, util=97.63%
> ---
>  block/blk-core.c                    | 29 ++++++++++++++++-------------
>  block/blk-lib.c                     |  2 +-
>  block/blk-throttle.c                |  2 +-
>  drivers/block/xen-blkback/blkback.c |  2 +-
>  drivers/md/dm-bufio.c               |  6 +++---
>  drivers/md/dm-crypt.c               |  2 +-
>  drivers/md/dm-kcopyd.c              |  2 +-
>  drivers/md/dm-thin.c                |  2 +-
>  drivers/md/md.c                     |  2 +-
>  drivers/md/raid1.c                  |  2 +-
>  drivers/md/raid10.c                 |  2 +-
>  drivers/md/raid5.c                  |  4 ++--
>  drivers/target/target_core_iblock.c |  2 +-
>  fs/aio.c                            |  2 +-
>  fs/block_dev.c                      |  2 +-
>  fs/btrfs/scrub.c                    |  2 +-
>  fs/btrfs/transaction.c              |  2 +-
>  fs/btrfs/tree-log.c                 | 12 ++++++------
>  fs/btrfs/volumes.c                  |  6 +++---
>  fs/buffer.c                         |  2 +-
>  fs/direct-io.c                      |  2 +-
>  fs/ext4/file.c                      |  2 +-
>  fs/ext4/inode.c                     |  4 ++--
>  fs/f2fs/checkpoint.c                |  2 +-
>  fs/f2fs/gc.c                        |  2 +-
>  fs/f2fs/node.c                      |  2 +-
>  fs/gfs2/log.c                       |  2 +-
>  fs/hpfs/buffer.c                    |  2 +-
>  fs/jbd/checkpoint.c                 |  2 +-
>  fs/jbd/commit.c                     |  4 ++--
>  fs/jbd2/checkpoint.c                |  2 +-
>  fs/jbd2/commit.c                    |  2 +-
>  fs/mpage.c                          |  2 +-
>  fs/nfs/blocklayout/blocklayout.c    |  4 ++--
>  fs/xfs/xfs_buf.c                    |  4 ++--
>  fs/xfs/xfs_dir2_readdir.c           |  2 +-
>  fs/xfs/xfs_itable.c                 |  2 +-
>  include/linux/blkdev.h              |  5 +++--
>  mm/madvise.c                        |  2 +-
>  mm/page-writeback.c                 |  2 +-
>  mm/readahead.c                      |  2 +-
>  mm/swap_state.c                     |  2 +-
>  mm/vmscan.c                         |  2 +-
>  43 files changed, 74 insertions(+), 70 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 794c3e7..fcd9c2f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -3018,21 +3018,21 @@ void blk_start_plug(struct blk_plug *plug)
>  {
>  	struct task_struct *tsk = current;
>  
> +	if (tsk->plug) {
> +		tsk->plug->depth++;
> +		return;
> +	}
> +
> +	plug->depth = 1;
>  	INIT_LIST_HEAD(&plug->list);
>  	INIT_LIST_HEAD(&plug->mq_list);
>  	INIT_LIST_HEAD(&plug->cb_list);
>  
>  	/*
> -	 * If this is a nested plug, don't actually assign it. It will be
> -	 * flushed on its own.
> +	 * Store ordering should not be needed here, since a potential
> +	 * preempt will imply a full memory barrier
>  	 */
> -	if (!tsk->plug) {
> -		/*
> -		 * Store ordering should not be needed here, since a potential
> -		 * preempt will imply a full memory barrier
> -		 */
> -		tsk->plug = plug;
> -	}
> +	tsk->plug = plug;
>  }
>  EXPORT_SYMBOL(blk_start_plug);
>  
> @@ -3177,12 +3177,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  	local_irq_restore(flags);
>  }
>  
> -void blk_finish_plug(struct blk_plug *plug)
> +void blk_finish_plug(void)
>  {
> -	blk_flush_plug_list(plug, false);
> +	struct blk_plug *plug = current->plug;
>  
> -	if (plug == current->plug)
> -		current->plug = NULL;
> +	if (--plug->depth > 0)
> +		return;
> +
> +	blk_flush_plug_list(plug, false);
> +	current->plug = NULL;
>  }
>  EXPORT_SYMBOL(blk_finish_plug);
>  
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 7688ee3..ac347d3 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		 */
>  		cond_resched();
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Wait for bios in-flight */
>  	if (!atomic_dec_and_test(&bb.done))
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 5b9c6d5..222a77a 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1281,7 +1281,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
>  		blk_start_plug(&plug);
>  		while((bio = bio_list_pop(&bio_list_on_stack)))
>  			generic_make_request(bio);
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  	}
>  }
>  
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 2a04d34..74bea21 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1374,7 +1374,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  		submit_bio(operation, biolist[i]);
>  
>  	/* Let the I/Os go.. */
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	if (operation == READ)
>  		blkif->st_rd_sect += preq.nr_sects;
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 86dbbc7..502c63b 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
>  		submit_io(b, WRITE, b->block, write_endio);
>  		dm_bufio_cond_resched();
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> @@ -1126,7 +1126,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>  				&write_list);
>  		if (unlikely(!list_empty(&write_list))) {
>  			dm_bufio_unlock(c);
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			__flush_write_list(&write_list);
>  			blk_start_plug(&plug);
>  			dm_bufio_lock(c);
> @@ -1149,7 +1149,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>  	dm_bufio_unlock(c);
>  
>  flush_plug:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
>  
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 713a962..65d7b72 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -1224,7 +1224,7 @@ pop_from_list:
>  			rb_erase(&io->rb_node, &write_tree);
>  			kcryptd_io_write(io);
>  		} while (!RB_EMPTY_ROOT(&write_tree));
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  	}
>  	return 0;
>  }
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 3a7cade..4a76e42 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -593,7 +593,7 @@ static void do_work(struct work_struct *work)
>  	process_jobs(&kc->complete_jobs, kc, run_complete_job);
>  	process_jobs(&kc->pages_jobs, kc, run_pages_job);
>  	process_jobs(&kc->io_jobs, kc, run_io_job);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 921aafd..be42bf5 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>  			dm_pool_issue_prefetches(pool->pmd);
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int cmp_cells(const void *lhs, const void *rhs)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 717daad..c4ec179 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
>  	/*
>  	 * this also signals 'finished resyncing' to md_stop
>  	 */
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
>  
>  	/* tell personality that we are finished */
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d34e238..4f8fad4 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>  			md_check_recovery(mddev);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int init_resync(struct r1conf *conf)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a7196c4..92bb5dd 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>  			md_check_recovery(mddev);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int init_resync(struct r10conf *conf)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cd2f96b..695bf0f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
>  	pr_debug("%d stripes handled\n", handled);
>  
>  	spin_unlock_irq(&conf->device_lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	pr_debug("--- raid5worker inactive\n");
>  }
> @@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
>  	spin_unlock_irq(&conf->device_lock);
>  
>  	async_tx_issue_pending_all();
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	pr_debug("--- raid5d inactive\n");
>  }
> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
> index d4a4b0f..17d8730 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -367,7 +367,7 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
>  	blk_start_plug(&plug);
>  	while ((bio = bio_list_pop(list)))
>  		submit_bio(rw, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static void iblock_end_io_flush(struct bio *bio, int err)
> diff --git a/fs/aio.c b/fs/aio.c
> index f8e52a1..b873698 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>  		if (ret)
>  			break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	percpu_ref_put(&ctx->users);
>  	return i ? i : ret;
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 975266b..f5848de 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		if (err < 0)
>  			ret = err;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(blkdev_write_iter);
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ec57687..f314cfb8 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3316,7 +3316,7 @@ out:
>  	scrub_wr_submit(sctx);
>  	mutex_unlock(&sctx->wr_ctx.wr_lock);
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	btrfs_free_path(path);
>  	btrfs_free_path(ppath);
>  	return ret < 0 ? ret : 0;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 8be4278..fee10af 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -983,7 +983,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>  
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
>  
>  	if (ret)
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index c5b8ba3..879c7fd 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2574,7 +2574,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
>  	if (ret) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_abort_transaction(trans, root, ret);
>  		btrfs_free_logged_extents(log, log_transid);
>  		btrfs_set_log_full_commit(root->fs_info, trans);
> @@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		if (!list_empty(&root_log_ctx.list))
>  			list_del_init(&root_log_ctx.list);
>  
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_set_log_full_commit(root->fs_info, trans);
>  
>  		if (ret != -ENOSPC) {
> @@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	}
>  
>  	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		ret = root_log_ctx.log_ret;
>  		goto out;
> @@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  
>  	index2 = root_log_ctx.log_transid % 2;
>  	if (atomic_read(&log_root_tree->log_commit[index2])) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
>  						mark);
>  		btrfs_wait_logged_extents(trans, log, log_transid);
> @@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	 * check the full commit flag again
>  	 */
>  	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>  		btrfs_free_logged_extents(log, log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
> @@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	ret = btrfs_write_marked_extents(log_root_tree,
>  					 &log_root_tree->dirty_log_pages,
>  					 EXTENT_DIRTY | EXTENT_NEW);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (ret) {
>  		btrfs_set_log_full_commit(root->fs_info, trans);
>  		btrfs_abort_transaction(trans, root, ret);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8222f6f..16db068 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -358,7 +358,7 @@ loop_lock:
>  		if (pending_bios == &device->pending_sync_bios) {
>  			sync_pending = 1;
>  		} else if (sync_pending) {
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			blk_start_plug(&plug);
>  			sync_pending = 0;
>  		}
> @@ -415,7 +415,7 @@ loop_lock:
>  		}
>  		/* unplug every 64 requests just for good measure */
>  		if (batch_run % 64 == 0) {
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			blk_start_plug(&plug);
>  			sync_pending = 0;
>  		}
> @@ -431,7 +431,7 @@ loop_lock:
>  	spin_unlock(&device->io_lock);
>  
>  done:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static void pending_bios_fn(struct btrfs_work *work)
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 20805db..8181c44 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
>  	}
>  
>  	spin_unlock(lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	spin_lock(lock);
>  
>  	while (!list_empty(&tmp)) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..16f16ed 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1262,7 +1262,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	if (sdio.bio)
>  		dio_bio_submit(dio, &sdio);
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/*
>  	 * It is possible that, we return short IO due to end of file.
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..3a293eb 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			ret = err;
>  	}
>  	if (o_direct)
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  
>  errout:
>  	if (aio_mutex)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..90ce0cb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2302,7 +2302,7 @@ static int ext4_writepages(struct address_space *mapping,
>  
>  		blk_start_plug(&plug);
>  		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		goto out_writepages;
>  	}
>  
> @@ -2438,7 +2438,7 @@ retry:
>  		if (ret)
>  			break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (!ret && !cycled && wbc->nr_to_write > 0) {
>  		cycled = 1;
>  		mpd.last_page = writeback_index - 1;
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 7f794b7..86ba453 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -846,7 +846,7 @@ retry_flush_nodes:
>  		goto retry_flush_nodes;
>  	}
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return err;
>  }
>  
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 76adbc3..abeef77 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
>  		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
>  		break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
>  	stat_inc_call_count(sbi->stat_info);
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 97bd9d3..c4aa9e2 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -1098,7 +1098,7 @@ repeat:
>  		ra_node_page(sbi, nid);
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	lock_page(page);
>  	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
> diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> index 536e7a6..06f25d17 100644
> --- a/fs/gfs2/log.c
> +++ b/fs/gfs2/log.c
> @@ -159,7 +159,7 @@ restart:
>  			goto restart;
>  	}
>  	spin_unlock(&sdp->sd_ail_lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	trace_gfs2_ail_flush(sdp, wbc, 0);
>  }
>  
> diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
> index 8057fe4..138462d 100644
> --- a/fs/hpfs/buffer.c
> +++ b/fs/hpfs/buffer.c
> @@ -35,7 +35,7 @@ void hpfs_prefetch_sectors(struct super_block *s, unsigned secno, int n)
>  		secno++;
>  		n--;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /* Map a sector into a buffer and return pointers to it and to the buffer. */
> diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
> index 08c0304..cd6b09f 100644
> --- a/fs/jbd/checkpoint.c
> +++ b/fs/jbd/checkpoint.c
> @@ -263,7 +263,7 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
>  	blk_start_plug(&plug);
>  	for (i = 0; i < *batch_count; i++)
>  		write_dirty_buffer(bhs[i], WRITE_SYNC);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	for (i = 0; i < *batch_count; i++) {
>  		struct buffer_head *bh = bhs[i];
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index bb217dc..e1046c3 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -447,7 +447,7 @@ void journal_commit_transaction(journal_t *journal)
>  	blk_start_plug(&plug);
>  	err = journal_submit_data_buffers(journal, commit_transaction,
>  					  write_op);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/*
>  	 * Wait for all previously submitted IO to complete.
> @@ -697,7 +697,7 @@ start_journal_io:
>  		}
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 988b32e..6aa0039 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -187,7 +187,7 @@ __flush_batch(journal_t *journal, int *batch_count)
>  	blk_start_plug(&plug);
>  	for (i = 0; i < *batch_count; i++)
>  		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	for (i = 0; i < *batch_count; i++) {
>  		struct buffer_head *bh = journal->j_chkpt_bhs[i];
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b73e021..8f532c8 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -805,7 +805,7 @@ start_journal_io:
>  			__jbd2_journal_abort_hard(journal);
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 3e79220..bf7d6c3 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
>  		if (mpd.bio)
>  			mpage_bio_submit(WRITE, mpd.bio);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  EXPORT_SYMBOL(mpage_writepages);
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 1cac3c1..e93b6a8 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -311,7 +311,7 @@ bl_read_pagelist(struct nfs_pgio_header *header)
>  	}
>  out:
>  	bl_submit_bio(READ, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	put_parallel(par);
>  	return PNFS_ATTEMPTED;
>  }
> @@ -433,7 +433,7 @@ bl_write_pagelist(struct nfs_pgio_header *header, int sync)
>  	header->res.count = header->args.count;
>  out:
>  	bl_submit_bio(WRITE, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	put_parallel(par);
>  	return PNFS_ATTEMPTED;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1790b00..2f89ca2 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
>  		if (size <= 0)
>  			break;	/* all done */
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> @@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
>  
>  		xfs_buf_submit(bp);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	return pinned;
>  }
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 098cd78..7e8fa3f 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
>  			}
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  out:
>  	*bpp = bp;
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 82e3142..c3ac5ec 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
>  					     &xfs_inode_buf_ops);
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7f9a516..188133f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
>   * schedule() where blk_schedule_flush_plug() is called.
>   */
>  struct blk_plug {
> +	int depth; /* number of nested plugs */
>  	struct list_head list; /* requests */
>  	struct list_head mq_list; /* blk-mq requests */
>  	struct list_head cb_list; /* md requires an unplug callback */
> @@ -1107,7 +1108,7 @@ struct blk_plug_cb {
>  extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
>  					     void *data, int size);
>  extern void blk_start_plug(struct blk_plug *);
> -extern void blk_finish_plug(struct blk_plug *);
> +extern void blk_finish_plug(void);
>  extern void blk_flush_plug_list(struct blk_plug *, bool);
>  
>  static inline void blk_flush_plug(struct task_struct *tsk)
> @@ -1646,7 +1647,7 @@ static inline void blk_start_plug(struct blk_plug *plug)
>  {
>  }
>  
> -static inline void blk_finish_plug(struct blk_plug *plug)
> +static inline void blk_finish_plug(void)
>  {
>  }
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d551475..18a34ee 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  			vma = find_vma(current->mm, start);
>  	}
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (write)
>  		up_write(&current->mm->mmap_sem);
>  	else
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 644bcb6..4570f6e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2020,7 +2020,7 @@ int generic_writepages(struct address_space *mapping,
>  
>  	blk_start_plug(&plug);
>  	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9356758..64182a2 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>  	ret = 0;
>  
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	return ret;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 405923f..5721f64 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			SetPageReadahead(page);
>  		page_cache_release(page);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  skip:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e8eadd..56bb274 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>  
>  		scan_adjusted = true;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	sc->nr_reclaimed += nr_reclaimed;
>  
>  	/*
> -- 
> 1.8.3.1
> 
> 

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-08 23:02       ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-04-08 23:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, Ming Lei, Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner, Mel Gorman, Trond Myklebust, Anna Schumaker

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowels of the dirct I/O code.
> The reason is that there are blk_plug-s instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

> NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
> would always flush the plug list.  After this patch, this is only the
> case for the outer-most plug.  If you require the plug list to be
> flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
> maintainers should take a close look at this patch and ensure they get
> the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to maintain the same behaviour as we
currently have?

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs to be fundamentally changed?

Cheers,

Dave.

> 
> ---
> Changelog:
> v1->v2: Keep the blk_start_plug interface the same, suggested by Ming Lei.
> 
> Test results
> ------------
> Virtio-blk:
> 
> unpatched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=8032: Tue Apr  7 13:33:53 2015
>   read : io=2736.1MB, bw=280262KB/s, iops=70065, runt= 10000msec
>     slat (usec): min=40, max=10472, avg=207.82, stdev=364.02
>     clat (usec): min=211, max=35883, avg=14379.83, stdev=2213.95
>      lat (usec): min=862, max=36000, avg=14587.72, stdev=2223.80
>     clat percentiles (usec):
>      |  1.00th=[11328],  5.00th=[12096], 10.00th=[12480], 20.00th=[12992],
>      | 30.00th=[13376], 40.00th=[13760], 50.00th=[14144], 60.00th=[14400],
>      | 70.00th=[14784], 80.00th=[15168], 90.00th=[15936], 95.00th=[16768],
>      | 99.00th=[24448], 99.50th=[25216], 99.90th=[28544], 99.95th=[35072],
>      | 99.99th=[36096]
>     bw (KB  /s): min=265984, max=302720, per=100.00%, avg=280549.84, stdev=10264.36
>     lat (usec) : 250=0.01%, 1000=0.01%
>     lat (msec) : 2=0.02%, 4=0.02%, 10=0.05%, 20=96.57%, 50=3.34%
>   cpu          : usr=7.56%, sys=55.57%, ctx=6174, majf=0, minf=523
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=700656/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=2736.1MB, aggrb=280262KB/s, minb=280262KB/s, maxb=280262KB/s, mint=10000msec, maxt=10000msec
> 
> Disk stats (read/write):
>   vdd: ios=695490/0, merge=0/0, ticks=785741/0, in_queue=785442, util=90.69%
> 
> 
> patched:
> job1: (groupid=0, jobs=1): err= 0: pid=7743: Tue Apr  7 13:19:07 2015
>   read : io=8126.6MB, bw=832158KB/s, iops=208039, runt= 10000msec
>     slat (usec): min=20, max=14351, avg=55.08, stdev=143.47
>     clat (usec): min=283, max=20003, avg=4846.77, stdev=1355.35
>      lat (usec): min=609, max=20074, avg=4901.95, stdev=1362.40
>     clat percentiles (usec):
>      |  1.00th=[ 4016],  5.00th=[ 4048], 10.00th=[ 4080], 20.00th=[ 4128],
>      | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4512],
>      | 70.00th=[ 4896], 80.00th=[ 5664], 90.00th=[ 5920], 95.00th=[ 6752],
>      | 99.00th=[11968], 99.50th=[13632], 99.90th=[15552], 99.95th=[17024],
>      | 99.99th=[19840]
>     bw (KB  /s): min=740992, max=896640, per=100.00%, avg=836978.95, stdev=51034.87
>     lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 4=0.50%, 10=97.79%, 20=1.70%, 50=0.01%
>   cpu          : usr=20.28%, sys=69.11%, ctx=879, majf=0, minf=522
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=2080396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=8126.6MB, aggrb=832158KB/s, minb=832158KB/s, maxb=832158KB/s, mint=10000msec, maxt=10000msec
> 
> Disk stats (read/write):
>   vdd: ios=127877/0, merge=1918166/0, ticks=23118/0, in_queue=23047, util=94.08%
> 
> micron p320h:
> 
> unpatched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=3244: Tue Apr  7 13:29:14 2015
>   read : io=6728.9MB, bw=688968KB/s, iops=172241, runt= 10001msec
>     slat (usec): min=43, max=6273, avg=81.79, stdev=125.96
>     clat (usec): min=78, max=12485, avg=5852.06, stdev=1154.76
>      lat (usec): min=146, max=12572, avg=5933.92, stdev=1163.75
>     clat percentiles (usec):
>      |  1.00th=[ 4192],  5.00th=[ 4384], 10.00th=[ 4576], 20.00th=[ 5600],
>      | 30.00th=[ 5664], 40.00th=[ 5728], 50.00th=[ 5792], 60.00th=[ 5856],
>      | 70.00th=[ 6112], 80.00th=[ 6176], 90.00th=[ 6240], 95.00th=[ 6368],
>      | 99.00th=[11840], 99.50th=[11968], 99.90th=[12096], 99.95th=[12096],
>      | 99.99th=[12224]
>     bw (KB  /s): min=648328, max=859264, per=98.80%, avg=680711.16, stdev=62016.70
>     lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.04%, 10=97.07%, 20=2.87%
>   cpu          : usr=10.28%, sys=73.61%, ctx=104436, majf=0, minf=6217
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=1722592/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=6728.9MB, aggrb=688967KB/s, minb=688967KB/s, maxb=688967KB/s, mint=10001msec, maxt=10001msec
> 
> Disk stats (read/write):
>   rssda: ios=1688772/0, merge=0/0, ticks=188820/0, in_queue=188678, util=96.61%
> 
> patched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=9531: Tue Apr  7 13:22:28 2015
>   read : io=11607MB, bw=1160.6MB/s, iops=297104, runt= 10001msec
>     slat (usec): min=21, max=6376, avg=43.05, stdev=81.82
>     clat (usec): min=116, max=9844, avg=3393.90, stdev=752.57
>      lat (usec): min=167, max=9889, avg=3437.01, stdev=757.02
>     clat percentiles (usec):
>      |  1.00th=[ 2832],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3120],
>      | 30.00th=[ 3152], 40.00th=[ 3248], 50.00th=[ 3280], 60.00th=[ 3344],
>      | 70.00th=[ 3376], 80.00th=[ 3504], 90.00th=[ 3728], 95.00th=[ 3824],
>      | 99.00th=[ 9152], 99.50th=[ 9408], 99.90th=[ 9664], 99.95th=[ 9664],
>      | 99.99th=[ 9792]
>     bw (MB  /s): min= 1139, max= 1183, per=100.00%, avg=1161.07, stdev=10.58
>     lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=98.31%, 10=1.67%
>   cpu          : usr=18.59%, sys=66.65%, ctx=55655, majf=0, minf=6218
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=2971338/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=11607MB, aggrb=1160.6MB/s, minb=1160.6MB/s, maxb=1160.6MB/s, mint=10001msec, maxt=10001msec
> 
> Disk stats (read/write):
>   rssda: ios=183005/0, merge=2745105/0, ticks=31972/0, in_queue=31948, util=97.63%
> ---
>  block/blk-core.c                    | 29 ++++++++++++++++-------------
>  block/blk-lib.c                     |  2 +-
>  block/blk-throttle.c                |  2 +-
>  drivers/block/xen-blkback/blkback.c |  2 +-
>  drivers/md/dm-bufio.c               |  6 +++---
>  drivers/md/dm-crypt.c               |  2 +-
>  drivers/md/dm-kcopyd.c              |  2 +-
>  drivers/md/dm-thin.c                |  2 +-
>  drivers/md/md.c                     |  2 +-
>  drivers/md/raid1.c                  |  2 +-
>  drivers/md/raid10.c                 |  2 +-
>  drivers/md/raid5.c                  |  4 ++--
>  drivers/target/target_core_iblock.c |  2 +-
>  fs/aio.c                            |  2 +-
>  fs/block_dev.c                      |  2 +-
>  fs/btrfs/scrub.c                    |  2 +-
>  fs/btrfs/transaction.c              |  2 +-
>  fs/btrfs/tree-log.c                 | 12 ++++++------
>  fs/btrfs/volumes.c                  |  6 +++---
>  fs/buffer.c                         |  2 +-
>  fs/direct-io.c                      |  2 +-
>  fs/ext4/file.c                      |  2 +-
>  fs/ext4/inode.c                     |  4 ++--
>  fs/f2fs/checkpoint.c                |  2 +-
>  fs/f2fs/gc.c                        |  2 +-
>  fs/f2fs/node.c                      |  2 +-
>  fs/gfs2/log.c                       |  2 +-
>  fs/hpfs/buffer.c                    |  2 +-
>  fs/jbd/checkpoint.c                 |  2 +-
>  fs/jbd/commit.c                     |  4 ++--
>  fs/jbd2/checkpoint.c                |  2 +-
>  fs/jbd2/commit.c                    |  2 +-
>  fs/mpage.c                          |  2 +-
>  fs/nfs/blocklayout/blocklayout.c    |  4 ++--
>  fs/xfs/xfs_buf.c                    |  4 ++--
>  fs/xfs/xfs_dir2_readdir.c           |  2 +-
>  fs/xfs/xfs_itable.c                 |  2 +-
>  include/linux/blkdev.h              |  5 +++--
>  mm/madvise.c                        |  2 +-
>  mm/page-writeback.c                 |  2 +-
>  mm/readahead.c                      |  2 +-
>  mm/swap_state.c                     |  2 +-
>  mm/vmscan.c                         |  2 +-
>  43 files changed, 74 insertions(+), 70 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 794c3e7..fcd9c2f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -3018,21 +3018,21 @@ void blk_start_plug(struct blk_plug *plug)
>  {
>  	struct task_struct *tsk = current;
>  
> +	if (tsk->plug) {
> +		tsk->plug->depth++;
> +		return;
> +	}
> +
> +	plug->depth = 1;
>  	INIT_LIST_HEAD(&plug->list);
>  	INIT_LIST_HEAD(&plug->mq_list);
>  	INIT_LIST_HEAD(&plug->cb_list);
>  
>  	/*
> -	 * If this is a nested plug, don't actually assign it. It will be
> -	 * flushed on its own.
> +	 * Store ordering should not be needed here, since a potential
> +	 * preempt will imply a full memory barrier
>  	 */
> -	if (!tsk->plug) {
> -		/*
> -		 * Store ordering should not be needed here, since a potential
> -		 * preempt will imply a full memory barrier
> -		 */
> -		tsk->plug = plug;
> -	}
> +	tsk->plug = plug;
>  }
>  EXPORT_SYMBOL(blk_start_plug);
>  
> @@ -3177,12 +3177,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  	local_irq_restore(flags);
>  }
>  
> -void blk_finish_plug(struct blk_plug *plug)
> +void blk_finish_plug(void)
>  {
> -	blk_flush_plug_list(plug, false);
> +	struct blk_plug *plug = current->plug;
>  
> -	if (plug == current->plug)
> -		current->plug = NULL;
> +	if (--plug->depth > 0)
> +		return;
> +
> +	blk_flush_plug_list(plug, false);
> +	current->plug = NULL;
>  }
>  EXPORT_SYMBOL(blk_finish_plug);
>  
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 7688ee3..ac347d3 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		 */
>  		cond_resched();
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Wait for bios in-flight */
>  	if (!atomic_dec_and_test(&bb.done))
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 5b9c6d5..222a77a 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1281,7 +1281,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
>  		blk_start_plug(&plug);
>  		while((bio = bio_list_pop(&bio_list_on_stack)))
>  			generic_make_request(bio);
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  	}
>  }
>  
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 2a04d34..74bea21 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1374,7 +1374,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  		submit_bio(operation, biolist[i]);
>  
>  	/* Let the I/Os go.. */
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	if (operation == READ)
>  		blkif->st_rd_sect += preq.nr_sects;
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 86dbbc7..502c63b 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
>  		submit_io(b, WRITE, b->block, write_endio);
>  		dm_bufio_cond_resched();
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> @@ -1126,7 +1126,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>  				&write_list);
>  		if (unlikely(!list_empty(&write_list))) {
>  			dm_bufio_unlock(c);
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			__flush_write_list(&write_list);
>  			blk_start_plug(&plug);
>  			dm_bufio_lock(c);
> @@ -1149,7 +1149,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>  	dm_bufio_unlock(c);
>  
>  flush_plug:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
>  
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 713a962..65d7b72 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -1224,7 +1224,7 @@ pop_from_list:
>  			rb_erase(&io->rb_node, &write_tree);
>  			kcryptd_io_write(io);
>  		} while (!RB_EMPTY_ROOT(&write_tree));
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  	}
>  	return 0;
>  }
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 3a7cade..4a76e42 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -593,7 +593,7 @@ static void do_work(struct work_struct *work)
>  	process_jobs(&kc->complete_jobs, kc, run_complete_job);
>  	process_jobs(&kc->pages_jobs, kc, run_pages_job);
>  	process_jobs(&kc->io_jobs, kc, run_io_job);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 921aafd..be42bf5 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>  			dm_pool_issue_prefetches(pool->pmd);
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int cmp_cells(const void *lhs, const void *rhs)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 717daad..c4ec179 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
>  	/*
>  	 * this also signals 'finished resyncing' to md_stop
>  	 */
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
>  
>  	/* tell personality that we are finished */
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d34e238..4f8fad4 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>  			md_check_recovery(mddev);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int init_resync(struct r1conf *conf)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a7196c4..92bb5dd 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>  			md_check_recovery(mddev);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int init_resync(struct r10conf *conf)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cd2f96b..695bf0f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
>  	pr_debug("%d stripes handled\n", handled);
>  
>  	spin_unlock_irq(&conf->device_lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	pr_debug("--- raid5worker inactive\n");
>  }
> @@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
>  	spin_unlock_irq(&conf->device_lock);
>  
>  	async_tx_issue_pending_all();
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	pr_debug("--- raid5d inactive\n");
>  }
> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
> index d4a4b0f..17d8730 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -367,7 +367,7 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
>  	blk_start_plug(&plug);
>  	while ((bio = bio_list_pop(list)))
>  		submit_bio(rw, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static void iblock_end_io_flush(struct bio *bio, int err)
> diff --git a/fs/aio.c b/fs/aio.c
> index f8e52a1..b873698 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>  		if (ret)
>  			break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	percpu_ref_put(&ctx->users);
>  	return i ? i : ret;
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 975266b..f5848de 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		if (err < 0)
>  			ret = err;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(blkdev_write_iter);
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ec57687..f314cfb8 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3316,7 +3316,7 @@ out:
>  	scrub_wr_submit(sctx);
>  	mutex_unlock(&sctx->wr_ctx.wr_lock);
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	btrfs_free_path(path);
>  	btrfs_free_path(ppath);
>  	return ret < 0 ? ret : 0;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 8be4278..fee10af 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -983,7 +983,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>  
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
>  
>  	if (ret)
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index c5b8ba3..879c7fd 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2574,7 +2574,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
>  	if (ret) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_abort_transaction(trans, root, ret);
>  		btrfs_free_logged_extents(log, log_transid);
>  		btrfs_set_log_full_commit(root->fs_info, trans);
> @@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		if (!list_empty(&root_log_ctx.list))
>  			list_del_init(&root_log_ctx.list);
>  
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_set_log_full_commit(root->fs_info, trans);
>  
>  		if (ret != -ENOSPC) {
> @@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	}
>  
>  	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		ret = root_log_ctx.log_ret;
>  		goto out;
> @@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  
>  	index2 = root_log_ctx.log_transid % 2;
>  	if (atomic_read(&log_root_tree->log_commit[index2])) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
>  						mark);
>  		btrfs_wait_logged_extents(trans, log, log_transid);
> @@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	 * check the full commit flag again
>  	 */
>  	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>  		btrfs_free_logged_extents(log, log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
> @@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	ret = btrfs_write_marked_extents(log_root_tree,
>  					 &log_root_tree->dirty_log_pages,
>  					 EXTENT_DIRTY | EXTENT_NEW);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (ret) {
>  		btrfs_set_log_full_commit(root->fs_info, trans);
>  		btrfs_abort_transaction(trans, root, ret);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8222f6f..16db068 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -358,7 +358,7 @@ loop_lock:
>  		if (pending_bios == &device->pending_sync_bios) {
>  			sync_pending = 1;
>  		} else if (sync_pending) {
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			blk_start_plug(&plug);
>  			sync_pending = 0;
>  		}
> @@ -415,7 +415,7 @@ loop_lock:
>  		}
>  		/* unplug every 64 requests just for good measure */
>  		if (batch_run % 64 == 0) {
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			blk_start_plug(&plug);
>  			sync_pending = 0;
>  		}
> @@ -431,7 +431,7 @@ loop_lock:
>  	spin_unlock(&device->io_lock);
>  
>  done:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static void pending_bios_fn(struct btrfs_work *work)
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 20805db..8181c44 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
>  	}
>  
>  	spin_unlock(lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	spin_lock(lock);
>  
>  	while (!list_empty(&tmp)) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..16f16ed 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1262,7 +1262,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	if (sdio.bio)
>  		dio_bio_submit(dio, &sdio);
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/*
>  	 * It is possible that, we return short IO due to end of file.
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..3a293eb 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			ret = err;
>  	}
>  	if (o_direct)
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  
>  errout:
>  	if (aio_mutex)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..90ce0cb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2302,7 +2302,7 @@ static int ext4_writepages(struct address_space *mapping,
>  
>  		blk_start_plug(&plug);
>  		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		goto out_writepages;
>  	}
>  
> @@ -2438,7 +2438,7 @@ retry:
>  		if (ret)
>  			break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (!ret && !cycled && wbc->nr_to_write > 0) {
>  		cycled = 1;
>  		mpd.last_page = writeback_index - 1;
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 7f794b7..86ba453 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -846,7 +846,7 @@ retry_flush_nodes:
>  		goto retry_flush_nodes;
>  	}
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return err;
>  }
>  
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 76adbc3..abeef77 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
>  		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
>  		break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
>  	stat_inc_call_count(sbi->stat_info);
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 97bd9d3..c4aa9e2 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -1098,7 +1098,7 @@ repeat:
>  		ra_node_page(sbi, nid);
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	lock_page(page);
>  	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
> diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> index 536e7a6..06f25d17 100644
> --- a/fs/gfs2/log.c
> +++ b/fs/gfs2/log.c
> @@ -159,7 +159,7 @@ restart:
>  			goto restart;
>  	}
>  	spin_unlock(&sdp->sd_ail_lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	trace_gfs2_ail_flush(sdp, wbc, 0);
>  }
>  
> diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
> index 8057fe4..138462d 100644
> --- a/fs/hpfs/buffer.c
> +++ b/fs/hpfs/buffer.c
> @@ -35,7 +35,7 @@ void hpfs_prefetch_sectors(struct super_block *s, unsigned secno, int n)
>  		secno++;
>  		n--;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /* Map a sector into a buffer and return pointers to it and to the buffer. */
> diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
> index 08c0304..cd6b09f 100644
> --- a/fs/jbd/checkpoint.c
> +++ b/fs/jbd/checkpoint.c
> @@ -263,7 +263,7 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
>  	blk_start_plug(&plug);
>  	for (i = 0; i < *batch_count; i++)
>  		write_dirty_buffer(bhs[i], WRITE_SYNC);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	for (i = 0; i < *batch_count; i++) {
>  		struct buffer_head *bh = bhs[i];
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index bb217dc..e1046c3 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -447,7 +447,7 @@ void journal_commit_transaction(journal_t *journal)
>  	blk_start_plug(&plug);
>  	err = journal_submit_data_buffers(journal, commit_transaction,
>  					  write_op);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/*
>  	 * Wait for all previously submitted IO to complete.
> @@ -697,7 +697,7 @@ start_journal_io:
>  		}
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 988b32e..6aa0039 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -187,7 +187,7 @@ __flush_batch(journal_t *journal, int *batch_count)
>  	blk_start_plug(&plug);
>  	for (i = 0; i < *batch_count; i++)
>  		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	for (i = 0; i < *batch_count; i++) {
>  		struct buffer_head *bh = journal->j_chkpt_bhs[i];
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b73e021..8f532c8 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -805,7 +805,7 @@ start_journal_io:
>  			__jbd2_journal_abort_hard(journal);
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 3e79220..bf7d6c3 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
>  		if (mpd.bio)
>  			mpage_bio_submit(WRITE, mpd.bio);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  EXPORT_SYMBOL(mpage_writepages);
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 1cac3c1..e93b6a8 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -311,7 +311,7 @@ bl_read_pagelist(struct nfs_pgio_header *header)
>  	}
>  out:
>  	bl_submit_bio(READ, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	put_parallel(par);
>  	return PNFS_ATTEMPTED;
>  }
> @@ -433,7 +433,7 @@ bl_write_pagelist(struct nfs_pgio_header *header, int sync)
>  	header->res.count = header->args.count;
>  out:
>  	bl_submit_bio(WRITE, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	put_parallel(par);
>  	return PNFS_ATTEMPTED;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1790b00..2f89ca2 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
>  		if (size <= 0)
>  			break;	/* all done */
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> @@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
>  
>  		xfs_buf_submit(bp);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	return pinned;
>  }
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 098cd78..7e8fa3f 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
>  			}
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  out:
>  	*bpp = bp;
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 82e3142..c3ac5ec 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
>  					     &xfs_inode_buf_ops);
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7f9a516..188133f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
>   * schedule() where blk_schedule_flush_plug() is called.
>   */
>  struct blk_plug {
> +	int depth; /* number of nested plugs */
>  	struct list_head list; /* requests */
>  	struct list_head mq_list; /* blk-mq requests */
>  	struct list_head cb_list; /* md requires an unplug callback */
> @@ -1107,7 +1108,7 @@ struct blk_plug_cb {
>  extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
>  					     void *data, int size);
>  extern void blk_start_plug(struct blk_plug *);
> -extern void blk_finish_plug(struct blk_plug *);
> +extern void blk_finish_plug(void);
>  extern void blk_flush_plug_list(struct blk_plug *, bool);
>  
>  static inline void blk_flush_plug(struct task_struct *tsk)
> @@ -1646,7 +1647,7 @@ static inline void blk_start_plug(struct blk_plug *plug)
>  {
>  }
>  
> -static inline void blk_finish_plug(struct blk_plug *plug)
> +static inline void blk_finish_plug(void)
>  {
>  }
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d551475..18a34ee 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  			vma = find_vma(current->mm, start);
>  	}
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (write)
>  		up_write(&current->mm->mmap_sem);
>  	else
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 644bcb6..4570f6e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2020,7 +2020,7 @@ int generic_writepages(struct address_space *mapping,
>  
>  	blk_start_plug(&plug);
>  	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9356758..64182a2 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>  	ret = 0;
>  
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	return ret;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 405923f..5721f64 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			SetPageReadahead(page);
>  		page_cache_release(page);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  skip:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e8eadd..56bb274 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>  
>  		scan_adjusted = true;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	sc->nr_reclaimed += nr_reclaimed;
>  
>  	/*
> -- 
> 1.8.3.1
> 
> 

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-08 23:02       ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-04-08 23:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Vladimir Davydov, linux-aio, Miklos Szeredi, Mike Snitzer,
	Neil Brown, Ming Lei, Ming Lei, Trond Myklebust, Jianyu Zhan,
	Nicholas A. Bellinger, linux-kernel, Sagi Grimberg, Chris Mason,
	dm-devel, target-devel, Andreas Dilger, Mikulas Patocka,
	Mark Rustad, Christoph Hellwig, Alasdair Kergon, Matthew Wilcox,
	linux-scsi, Namjae Jeon, linux-raid, cluster-devel, Mel Gorman,
	Suleiman Souhlal, linux-ext4, linux-mm, Changman Lee,
	Rik van Riel, Konrad Rzeszutek Wilk, xfs, Fabian Frederick,
	Joe Perches, Alexander Viro, xen-devel, Jaegeuk Kim,
	Steven Whitehouse, Vlastimil Babka, Jens Axboe, Michal Hocko,
	linux-nfs, Fengguang Wu, Theodore Ts'o, Martin K. Petersen,
	Wang Sheng-Hui, Josef Bacik, David Sterba, linux-f2fs-devel,
	linux-btrfs, Johannes Weiner, Tejun Heo, linux-fsdevel,
	Andrew Morton, Weston Andros Adamson, Anna Schumaker,
	Kirill A. Shutemov, Roger Pau Monn??

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowels of the dirct I/O code.
> The reason is that there are blk_plug-s instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

> NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
> would always flush the plug list.  After this patch, this is only the
> case for the outer-most plug.  If you require the plug list to be
> flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
> maintainers should take a close look at this patch and ensure they get
> the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to maintain the same behaviour as we
currently have?

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs to be fundamentally changed?

Cheers,

Dave.

> 
> ---
> Changelog:
> v1->v2: Keep the blk_start_plug interface the same, suggested by Ming Lei.
> 
> Test results
> ------------
> Virtio-blk:
> 
> unpatched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=8032: Tue Apr  7 13:33:53 2015
>   read : io=2736.1MB, bw=280262KB/s, iops=70065, runt= 10000msec
>     slat (usec): min=40, max=10472, avg=207.82, stdev=364.02
>     clat (usec): min=211, max=35883, avg=14379.83, stdev=2213.95
>      lat (usec): min=862, max=36000, avg=14587.72, stdev=2223.80
>     clat percentiles (usec):
>      |  1.00th=[11328],  5.00th=[12096], 10.00th=[12480], 20.00th=[12992],
>      | 30.00th=[13376], 40.00th=[13760], 50.00th=[14144], 60.00th=[14400],
>      | 70.00th=[14784], 80.00th=[15168], 90.00th=[15936], 95.00th=[16768],
>      | 99.00th=[24448], 99.50th=[25216], 99.90th=[28544], 99.95th=[35072],
>      | 99.99th=[36096]
>     bw (KB  /s): min=265984, max=302720, per=100.00%, avg=280549.84, stdev=10264.36
>     lat (usec) : 250=0.01%, 1000=0.01%
>     lat (msec) : 2=0.02%, 4=0.02%, 10=0.05%, 20=96.57%, 50=3.34%
>   cpu          : usr=7.56%, sys=55.57%, ctx=6174, majf=0, minf=523
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=700656/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=2736.1MB, aggrb=280262KB/s, minb=280262KB/s, maxb=280262KB/s, mint=10000msec, maxt=10000msec
> 
> Disk stats (read/write):
>   vdd: ios=695490/0, merge=0/0, ticks=785741/0, in_queue=785442, util=90.69%
> 
> 
> patched:
> job1: (groupid=0, jobs=1): err= 0: pid=7743: Tue Apr  7 13:19:07 2015
>   read : io=8126.6MB, bw=832158KB/s, iops=208039, runt= 10000msec
>     slat (usec): min=20, max=14351, avg=55.08, stdev=143.47
>     clat (usec): min=283, max=20003, avg=4846.77, stdev=1355.35
>      lat (usec): min=609, max=20074, avg=4901.95, stdev=1362.40
>     clat percentiles (usec):
>      |  1.00th=[ 4016],  5.00th=[ 4048], 10.00th=[ 4080], 20.00th=[ 4128],
>      | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4512],
>      | 70.00th=[ 4896], 80.00th=[ 5664], 90.00th=[ 5920], 95.00th=[ 6752],
>      | 99.00th=[11968], 99.50th=[13632], 99.90th=[15552], 99.95th=[17024],
>      | 99.99th=[19840]
>     bw (KB  /s): min=740992, max=896640, per=100.00%, avg=836978.95, stdev=51034.87
>     lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 4=0.50%, 10=97.79%, 20=1.70%, 50=0.01%
>   cpu          : usr=20.28%, sys=69.11%, ctx=879, majf=0, minf=522
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=2080396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=8126.6MB, aggrb=832158KB/s, minb=832158KB/s, maxb=832158KB/s, mint=10000msec, maxt=10000msec
> 
> Disk stats (read/write):
>   vdd: ios=127877/0, merge=1918166/0, ticks=23118/0, in_queue=23047, util=94.08%
> 
> micron p320h:
> 
> unpatched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=3244: Tue Apr  7 13:29:14 2015
>   read : io=6728.9MB, bw=688968KB/s, iops=172241, runt= 10001msec
>     slat (usec): min=43, max=6273, avg=81.79, stdev=125.96
>     clat (usec): min=78, max=12485, avg=5852.06, stdev=1154.76
>      lat (usec): min=146, max=12572, avg=5933.92, stdev=1163.75
>     clat percentiles (usec):
>      |  1.00th=[ 4192],  5.00th=[ 4384], 10.00th=[ 4576], 20.00th=[ 5600],
>      | 30.00th=[ 5664], 40.00th=[ 5728], 50.00th=[ 5792], 60.00th=[ 5856],
>      | 70.00th=[ 6112], 80.00th=[ 6176], 90.00th=[ 6240], 95.00th=[ 6368],
>      | 99.00th=[11840], 99.50th=[11968], 99.90th=[12096], 99.95th=[12096],
>      | 99.99th=[12224]
>     bw (KB  /s): min=648328, max=859264, per=98.80%, avg=680711.16, stdev=62016.70
>     lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.04%, 10=97.07%, 20=2.87%
>   cpu          : usr=10.28%, sys=73.61%, ctx=104436, majf=0, minf=6217
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=1722592/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=6728.9MB, aggrb=688967KB/s, minb=688967KB/s, maxb=688967KB/s, mint=10001msec, maxt=10001msec
> 
> Disk stats (read/write):
>   rssda: ios=1688772/0, merge=0/0, ticks=188820/0, in_queue=188678, util=96.61%
> 
> patched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=9531: Tue Apr  7 13:22:28 2015
>   read : io=11607MB, bw=1160.6MB/s, iops=297104, runt= 10001msec
>     slat (usec): min=21, max=6376, avg=43.05, stdev=81.82
>     clat (usec): min=116, max=9844, avg=3393.90, stdev=752.57
>      lat (usec): min=167, max=9889, avg=3437.01, stdev=757.02
>     clat percentiles (usec):
>      |  1.00th=[ 2832],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3120],
>      | 30.00th=[ 3152], 40.00th=[ 3248], 50.00th=[ 3280], 60.00th=[ 3344],
>      | 70.00th=[ 3376], 80.00th=[ 3504], 90.00th=[ 3728], 95.00th=[ 3824],
>      | 99.00th=[ 9152], 99.50th=[ 9408], 99.90th=[ 9664], 99.95th=[ 9664],
>      | 99.99th=[ 9792]
>     bw (MB  /s): min= 1139, max= 1183, per=100.00%, avg=1161.07, stdev=10.58
>     lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=98.31%, 10=1.67%
>   cpu          : usr=18.59%, sys=66.65%, ctx=55655, majf=0, minf=6218
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=2971338/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=11607MB, aggrb=1160.6MB/s, minb=1160.6MB/s, maxb=1160.6MB/s, mint=10001msec, maxt=10001msec
> 
> Disk stats (read/write):
>   rssda: ios=183005/0, merge=2745105/0, ticks=31972/0, in_queue=31948, util=97.63%
> ---
>  block/blk-core.c                    | 29 ++++++++++++++++-------------
>  block/blk-lib.c                     |  2 +-
>  block/blk-throttle.c                |  2 +-
>  drivers/block/xen-blkback/blkback.c |  2 +-
>  drivers/md/dm-bufio.c               |  6 +++---
>  drivers/md/dm-crypt.c               |  2 +-
>  drivers/md/dm-kcopyd.c              |  2 +-
>  drivers/md/dm-thin.c                |  2 +-
>  drivers/md/md.c                     |  2 +-
>  drivers/md/raid1.c                  |  2 +-
>  drivers/md/raid10.c                 |  2 +-
>  drivers/md/raid5.c                  |  4 ++--
>  drivers/target/target_core_iblock.c |  2 +-
>  fs/aio.c                            |  2 +-
>  fs/block_dev.c                      |  2 +-
>  fs/btrfs/scrub.c                    |  2 +-
>  fs/btrfs/transaction.c              |  2 +-
>  fs/btrfs/tree-log.c                 | 12 ++++++------
>  fs/btrfs/volumes.c                  |  6 +++---
>  fs/buffer.c                         |  2 +-
>  fs/direct-io.c                      |  2 +-
>  fs/ext4/file.c                      |  2 +-
>  fs/ext4/inode.c                     |  4 ++--
>  fs/f2fs/checkpoint.c                |  2 +-
>  fs/f2fs/gc.c                        |  2 +-
>  fs/f2fs/node.c                      |  2 +-
>  fs/gfs2/log.c                       |  2 +-
>  fs/hpfs/buffer.c                    |  2 +-
>  fs/jbd/checkpoint.c                 |  2 +-
>  fs/jbd/commit.c                     |  4 ++--
>  fs/jbd2/checkpoint.c                |  2 +-
>  fs/jbd2/commit.c                    |  2 +-
>  fs/mpage.c                          |  2 +-
>  fs/nfs/blocklayout/blocklayout.c    |  4 ++--
>  fs/xfs/xfs_buf.c                    |  4 ++--
>  fs/xfs/xfs_dir2_readdir.c           |  2 +-
>  fs/xfs/xfs_itable.c                 |  2 +-
>  include/linux/blkdev.h              |  5 +++--
>  mm/madvise.c                        |  2 +-
>  mm/page-writeback.c                 |  2 +-
>  mm/readahead.c                      |  2 +-
>  mm/swap_state.c                     |  2 +-
>  mm/vmscan.c                         |  2 +-
>  43 files changed, 74 insertions(+), 70 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 794c3e7..fcd9c2f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -3018,21 +3018,21 @@ void blk_start_plug(struct blk_plug *plug)
>  {
>  	struct task_struct *tsk = current;
>  
> +	if (tsk->plug) {
> +		tsk->plug->depth++;
> +		return;
> +	}
> +
> +	plug->depth = 1;
>  	INIT_LIST_HEAD(&plug->list);
>  	INIT_LIST_HEAD(&plug->mq_list);
>  	INIT_LIST_HEAD(&plug->cb_list);
>  
>  	/*
> -	 * If this is a nested plug, don't actually assign it. It will be
> -	 * flushed on its own.
> +	 * Store ordering should not be needed here, since a potential
> +	 * preempt will imply a full memory barrier
>  	 */
> -	if (!tsk->plug) {
> -		/*
> -		 * Store ordering should not be needed here, since a potential
> -		 * preempt will imply a full memory barrier
> -		 */
> -		tsk->plug = plug;
> -	}
> +	tsk->plug = plug;
>  }
>  EXPORT_SYMBOL(blk_start_plug);
>  
> @@ -3177,12 +3177,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  	local_irq_restore(flags);
>  }
>  
> -void blk_finish_plug(struct blk_plug *plug)
> +void blk_finish_plug(void)
>  {
> -	blk_flush_plug_list(plug, false);
> +	struct blk_plug *plug = current->plug;
>  
> -	if (plug == current->plug)
> -		current->plug = NULL;
> +	if (--plug->depth > 0)
> +		return;
> +
> +	blk_flush_plug_list(plug, false);
> +	current->plug = NULL;
>  }
>  EXPORT_SYMBOL(blk_finish_plug);
>  
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 7688ee3..ac347d3 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		 */
>  		cond_resched();
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Wait for bios in-flight */
>  	if (!atomic_dec_and_test(&bb.done))
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 5b9c6d5..222a77a 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1281,7 +1281,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
>  		blk_start_plug(&plug);
>  		while((bio = bio_list_pop(&bio_list_on_stack)))
>  			generic_make_request(bio);
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  	}
>  }
>  
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 2a04d34..74bea21 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1374,7 +1374,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  		submit_bio(operation, biolist[i]);
>  
>  	/* Let the I/Os go.. */
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	if (operation == READ)
>  		blkif->st_rd_sect += preq.nr_sects;
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 86dbbc7..502c63b 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
>  		submit_io(b, WRITE, b->block, write_endio);
>  		dm_bufio_cond_resched();
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> @@ -1126,7 +1126,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>  				&write_list);
>  		if (unlikely(!list_empty(&write_list))) {
>  			dm_bufio_unlock(c);
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			__flush_write_list(&write_list);
>  			blk_start_plug(&plug);
>  			dm_bufio_lock(c);
> @@ -1149,7 +1149,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>  	dm_bufio_unlock(c);
>  
>  flush_plug:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
>  
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 713a962..65d7b72 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -1224,7 +1224,7 @@ pop_from_list:
>  			rb_erase(&io->rb_node, &write_tree);
>  			kcryptd_io_write(io);
>  		} while (!RB_EMPTY_ROOT(&write_tree));
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  	}
>  	return 0;
>  }
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 3a7cade..4a76e42 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -593,7 +593,7 @@ static void do_work(struct work_struct *work)
>  	process_jobs(&kc->complete_jobs, kc, run_complete_job);
>  	process_jobs(&kc->pages_jobs, kc, run_pages_job);
>  	process_jobs(&kc->io_jobs, kc, run_io_job);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 921aafd..be42bf5 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>  			dm_pool_issue_prefetches(pool->pmd);
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int cmp_cells(const void *lhs, const void *rhs)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 717daad..c4ec179 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
>  	/*
>  	 * this also signals 'finished resyncing' to md_stop
>  	 */
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
>  
>  	/* tell personality that we are finished */
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d34e238..4f8fad4 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>  			md_check_recovery(mddev);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int init_resync(struct r1conf *conf)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a7196c4..92bb5dd 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>  			md_check_recovery(mddev);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int init_resync(struct r10conf *conf)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cd2f96b..695bf0f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
>  	pr_debug("%d stripes handled\n", handled);
>  
>  	spin_unlock_irq(&conf->device_lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	pr_debug("--- raid5worker inactive\n");
>  }
> @@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
>  	spin_unlock_irq(&conf->device_lock);
>  
>  	async_tx_issue_pending_all();
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	pr_debug("--- raid5d inactive\n");
>  }
> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
> index d4a4b0f..17d8730 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -367,7 +367,7 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
>  	blk_start_plug(&plug);
>  	while ((bio = bio_list_pop(list)))
>  		submit_bio(rw, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static void iblock_end_io_flush(struct bio *bio, int err)
> diff --git a/fs/aio.c b/fs/aio.c
> index f8e52a1..b873698 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>  		if (ret)
>  			break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	percpu_ref_put(&ctx->users);
>  	return i ? i : ret;
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 975266b..f5848de 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		if (err < 0)
>  			ret = err;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(blkdev_write_iter);
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ec57687..f314cfb8 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3316,7 +3316,7 @@ out:
>  	scrub_wr_submit(sctx);
>  	mutex_unlock(&sctx->wr_ctx.wr_lock);
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	btrfs_free_path(path);
>  	btrfs_free_path(ppath);
>  	return ret < 0 ? ret : 0;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 8be4278..fee10af 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -983,7 +983,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>  
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
>  
>  	if (ret)
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index c5b8ba3..879c7fd 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2574,7 +2574,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
>  	if (ret) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_abort_transaction(trans, root, ret);
>  		btrfs_free_logged_extents(log, log_transid);
>  		btrfs_set_log_full_commit(root->fs_info, trans);
> @@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		if (!list_empty(&root_log_ctx.list))
>  			list_del_init(&root_log_ctx.list);
>  
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_set_log_full_commit(root->fs_info, trans);
>  
>  		if (ret != -ENOSPC) {
> @@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	}
>  
>  	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		ret = root_log_ctx.log_ret;
>  		goto out;
> @@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  
>  	index2 = root_log_ctx.log_transid % 2;
>  	if (atomic_read(&log_root_tree->log_commit[index2])) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
>  						mark);
>  		btrfs_wait_logged_extents(trans, log, log_transid);
> @@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	 * check the full commit flag again
>  	 */
>  	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>  		btrfs_free_logged_extents(log, log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
> @@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	ret = btrfs_write_marked_extents(log_root_tree,
>  					 &log_root_tree->dirty_log_pages,
>  					 EXTENT_DIRTY | EXTENT_NEW);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (ret) {
>  		btrfs_set_log_full_commit(root->fs_info, trans);
>  		btrfs_abort_transaction(trans, root, ret);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8222f6f..16db068 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -358,7 +358,7 @@ loop_lock:
>  		if (pending_bios == &device->pending_sync_bios) {
>  			sync_pending = 1;
>  		} else if (sync_pending) {
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			blk_start_plug(&plug);
>  			sync_pending = 0;
>  		}
> @@ -415,7 +415,7 @@ loop_lock:
>  		}
>  		/* unplug every 64 requests just for good measure */
>  		if (batch_run % 64 == 0) {
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			blk_start_plug(&plug);
>  			sync_pending = 0;
>  		}
> @@ -431,7 +431,7 @@ loop_lock:
>  	spin_unlock(&device->io_lock);
>  
>  done:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static void pending_bios_fn(struct btrfs_work *work)
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 20805db..8181c44 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
>  	}
>  
>  	spin_unlock(lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	spin_lock(lock);
>  
>  	while (!list_empty(&tmp)) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..16f16ed 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1262,7 +1262,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	if (sdio.bio)
>  		dio_bio_submit(dio, &sdio);
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/*
>  	 * It is possible that, we return short IO due to end of file.
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..3a293eb 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			ret = err;
>  	}
>  	if (o_direct)
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  
>  errout:
>  	if (aio_mutex)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..90ce0cb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2302,7 +2302,7 @@ static int ext4_writepages(struct address_space *mapping,
>  
>  		blk_start_plug(&plug);
>  		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		goto out_writepages;
>  	}
>  
> @@ -2438,7 +2438,7 @@ retry:
>  		if (ret)
>  			break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (!ret && !cycled && wbc->nr_to_write > 0) {
>  		cycled = 1;
>  		mpd.last_page = writeback_index - 1;
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 7f794b7..86ba453 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -846,7 +846,7 @@ retry_flush_nodes:
>  		goto retry_flush_nodes;
>  	}
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return err;
>  }
>  
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 76adbc3..abeef77 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
>  		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
>  		break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
>  	stat_inc_call_count(sbi->stat_info);
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 97bd9d3..c4aa9e2 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -1098,7 +1098,7 @@ repeat:
>  		ra_node_page(sbi, nid);
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	lock_page(page);
>  	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
> diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> index 536e7a6..06f25d17 100644
> --- a/fs/gfs2/log.c
> +++ b/fs/gfs2/log.c
> @@ -159,7 +159,7 @@ restart:
>  			goto restart;
>  	}
>  	spin_unlock(&sdp->sd_ail_lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	trace_gfs2_ail_flush(sdp, wbc, 0);
>  }
>  
> diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
> index 8057fe4..138462d 100644
> --- a/fs/hpfs/buffer.c
> +++ b/fs/hpfs/buffer.c
> @@ -35,7 +35,7 @@ void hpfs_prefetch_sectors(struct super_block *s, unsigned secno, int n)
>  		secno++;
>  		n--;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /* Map a sector into a buffer and return pointers to it and to the buffer. */
> diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
> index 08c0304..cd6b09f 100644
> --- a/fs/jbd/checkpoint.c
> +++ b/fs/jbd/checkpoint.c
> @@ -263,7 +263,7 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
>  	blk_start_plug(&plug);
>  	for (i = 0; i < *batch_count; i++)
>  		write_dirty_buffer(bhs[i], WRITE_SYNC);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	for (i = 0; i < *batch_count; i++) {
>  		struct buffer_head *bh = bhs[i];
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index bb217dc..e1046c3 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -447,7 +447,7 @@ void journal_commit_transaction(journal_t *journal)
>  	blk_start_plug(&plug);
>  	err = journal_submit_data_buffers(journal, commit_transaction,
>  					  write_op);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/*
>  	 * Wait for all previously submitted IO to complete.
> @@ -697,7 +697,7 @@ start_journal_io:
>  		}
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 988b32e..6aa0039 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -187,7 +187,7 @@ __flush_batch(journal_t *journal, int *batch_count)
>  	blk_start_plug(&plug);
>  	for (i = 0; i < *batch_count; i++)
>  		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	for (i = 0; i < *batch_count; i++) {
>  		struct buffer_head *bh = journal->j_chkpt_bhs[i];
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b73e021..8f532c8 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -805,7 +805,7 @@ start_journal_io:
>  			__jbd2_journal_abort_hard(journal);
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 3e79220..bf7d6c3 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
>  		if (mpd.bio)
>  			mpage_bio_submit(WRITE, mpd.bio);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  EXPORT_SYMBOL(mpage_writepages);
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 1cac3c1..e93b6a8 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -311,7 +311,7 @@ bl_read_pagelist(struct nfs_pgio_header *header)
>  	}
>  out:
>  	bl_submit_bio(READ, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	put_parallel(par);
>  	return PNFS_ATTEMPTED;
>  }
> @@ -433,7 +433,7 @@ bl_write_pagelist(struct nfs_pgio_header *header, int sync)
>  	header->res.count = header->args.count;
>  out:
>  	bl_submit_bio(WRITE, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	put_parallel(par);
>  	return PNFS_ATTEMPTED;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1790b00..2f89ca2 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
>  		if (size <= 0)
>  			break;	/* all done */
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> @@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
>  
>  		xfs_buf_submit(bp);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	return pinned;
>  }
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 098cd78..7e8fa3f 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
>  			}
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  out:
>  	*bpp = bp;
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 82e3142..c3ac5ec 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
>  					     &xfs_inode_buf_ops);
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7f9a516..188133f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
>   * schedule() where blk_schedule_flush_plug() is called.
>   */
>  struct blk_plug {
> +	int depth; /* number of nested plugs */
>  	struct list_head list; /* requests */
>  	struct list_head mq_list; /* blk-mq requests */
>  	struct list_head cb_list; /* md requires an unplug callback */
> @@ -1107,7 +1108,7 @@ struct blk_plug_cb {
>  extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
>  					     void *data, int size);
>  extern void blk_start_plug(struct blk_plug *);
> -extern void blk_finish_plug(struct blk_plug *);
> +extern void blk_finish_plug(void);
>  extern void blk_flush_plug_list(struct blk_plug *, bool);
>  
>  static inline void blk_flush_plug(struct task_struct *tsk)
> @@ -1646,7 +1647,7 @@ static inline void blk_start_plug(struct blk_plug *plug)
>  {
>  }
>  
> -static inline void blk_finish_plug(struct blk_plug *plug)
> +static inline void blk_finish_plug(void)
>  {
>  }
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d551475..18a34ee 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  			vma = find_vma(current->mm, start);
>  	}
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (write)
>  		up_write(&current->mm->mmap_sem);
>  	else
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 644bcb6..4570f6e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2020,7 +2020,7 @@ int generic_writepages(struct address_space *mapping,
>  
>  	blk_start_plug(&plug);
>  	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9356758..64182a2 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>  	ret = 0;
>  
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	return ret;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 405923f..5721f64 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			SetPageReadahead(page);
>  		page_cache_release(page);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  skip:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e8eadd..56bb274 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>  
>  		scan_adjusted = true;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	sc->nr_reclaimed += nr_reclaimed;
>  
>  	/*
> -- 
> 1.8.3.1
> 
> 

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-08 23:02       ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-04-08 23:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jens Axboe, Ming Lei, Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner, Mel Gorman, Trond Myklebust, Anna Schumaker,
	xfs, Christoph Hellwig, Weston Andros Adamson,
	Martin K. Petersen, Sagi Grimberg, Tejun Heo, Fabian Frederick,
	Matthew Wilcox, Ming Lei, Kirill A. Shutemov, Wang Sheng-Hui,
	Michal Hocko, Joe Perches, Miklos Szeredi, Namjae Jeon,
	Mark Rustad, Jianyu Zhan, Fengguang Wu, Vladimir Davydov,
	Vlastimil Babka, Suleiman Souhlal, linux-kernel, dm-devel,
	xen-devel, linux-raid, linux-scsi, target-devel, linux-fsdevel,
	linux-aio, linux-btrfs, linux-ext4, linux-f2fs-devel,
	cluster-devel, linux-nfs, linux-mm

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowels of the dirct I/O code.
> The reason is that there are blk_plug-s instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

> NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
> would always flush the plug list.  After this patch, this is only the
> case for the outer-most plug.  If you require the plug list to be
> flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
> maintainers should take a close look at this patch and ensure they get
> the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to maintain the same behaviour as we
currently have?

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs to be fundamentally changed?

Cheers,

Dave.

> 
> ---
> Changelog:
> v1->v2: Keep the blk_start_plug interface the same, suggested by Ming Lei.
> 
> Test results
> ------------
> Virtio-blk:
> 
> unpatched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=8032: Tue Apr  7 13:33:53 2015
>   read : io=2736.1MB, bw=280262KB/s, iops=70065, runt= 10000msec
>     slat (usec): min=40, max=10472, avg=207.82, stdev=364.02
>     clat (usec): min=211, max=35883, avg=14379.83, stdev=2213.95
>      lat (usec): min=862, max=36000, avg=14587.72, stdev=2223.80
>     clat percentiles (usec):
>      |  1.00th=[11328],  5.00th=[12096], 10.00th=[12480], 20.00th=[12992],
>      | 30.00th=[13376], 40.00th=[13760], 50.00th=[14144], 60.00th=[14400],
>      | 70.00th=[14784], 80.00th=[15168], 90.00th=[15936], 95.00th=[16768],
>      | 99.00th=[24448], 99.50th=[25216], 99.90th=[28544], 99.95th=[35072],
>      | 99.99th=[36096]
>     bw (KB  /s): min=265984, max=302720, per=100.00%, avg=280549.84, stdev=10264.36
>     lat (usec) : 250=0.01%, 1000=0.01%
>     lat (msec) : 2=0.02%, 4=0.02%, 10=0.05%, 20=96.57%, 50=3.34%
>   cpu          : usr=7.56%, sys=55.57%, ctx=6174, majf=0, minf=523
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=700656/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=2736.1MB, aggrb=280262KB/s, minb=280262KB/s, maxb=280262KB/s, mint=10000msec, maxt=10000msec
> 
> Disk stats (read/write):
>   vdd: ios=695490/0, merge=0/0, ticks=785741/0, in_queue=785442, util=90.69%
> 
> 
> patched:
> job1: (groupid=0, jobs=1): err= 0: pid=7743: Tue Apr  7 13:19:07 2015
>   read : io=8126.6MB, bw=832158KB/s, iops=208039, runt= 10000msec
>     slat (usec): min=20, max=14351, avg=55.08, stdev=143.47
>     clat (usec): min=283, max=20003, avg=4846.77, stdev=1355.35
>      lat (usec): min=609, max=20074, avg=4901.95, stdev=1362.40
>     clat percentiles (usec):
>      |  1.00th=[ 4016],  5.00th=[ 4048], 10.00th=[ 4080], 20.00th=[ 4128],
>      | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4512],
>      | 70.00th=[ 4896], 80.00th=[ 5664], 90.00th=[ 5920], 95.00th=[ 6752],
>      | 99.00th=[11968], 99.50th=[13632], 99.90th=[15552], 99.95th=[17024],
>      | 99.99th=[19840]
>     bw (KB  /s): min=740992, max=896640, per=100.00%, avg=836978.95, stdev=51034.87
>     lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 4=0.50%, 10=97.79%, 20=1.70%, 50=0.01%
>   cpu          : usr=20.28%, sys=69.11%, ctx=879, majf=0, minf=522
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=2080396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=8126.6MB, aggrb=832158KB/s, minb=832158KB/s, maxb=832158KB/s, mint=10000msec, maxt=10000msec
> 
> Disk stats (read/write):
>   vdd: ios=127877/0, merge=1918166/0, ticks=23118/0, in_queue=23047, util=94.08%
> 
> micron p320h:
> 
> unpatched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=3244: Tue Apr  7 13:29:14 2015
>   read : io=6728.9MB, bw=688968KB/s, iops=172241, runt= 10001msec
>     slat (usec): min=43, max=6273, avg=81.79, stdev=125.96
>     clat (usec): min=78, max=12485, avg=5852.06, stdev=1154.76
>      lat (usec): min=146, max=12572, avg=5933.92, stdev=1163.75
>     clat percentiles (usec):
>      |  1.00th=[ 4192],  5.00th=[ 4384], 10.00th=[ 4576], 20.00th=[ 5600],
>      | 30.00th=[ 5664], 40.00th=[ 5728], 50.00th=[ 5792], 60.00th=[ 5856],
>      | 70.00th=[ 6112], 80.00th=[ 6176], 90.00th=[ 6240], 95.00th=[ 6368],
>      | 99.00th=[11840], 99.50th=[11968], 99.90th=[12096], 99.95th=[12096],
>      | 99.99th=[12224]
>     bw (KB  /s): min=648328, max=859264, per=98.80%, avg=680711.16, stdev=62016.70
>     lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.04%, 10=97.07%, 20=2.87%
>   cpu          : usr=10.28%, sys=73.61%, ctx=104436, majf=0, minf=6217
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=1722592/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=6728.9MB, aggrb=688967KB/s, minb=688967KB/s, maxb=688967KB/s, mint=10001msec, maxt=10001msec
> 
> Disk stats (read/write):
>   rssda: ios=1688772/0, merge=0/0, ticks=188820/0, in_queue=188678, util=96.61%
> 
> patched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=9531: Tue Apr  7 13:22:28 2015
>   read : io=11607MB, bw=1160.6MB/s, iops=297104, runt= 10001msec
>     slat (usec): min=21, max=6376, avg=43.05, stdev=81.82
>     clat (usec): min=116, max=9844, avg=3393.90, stdev=752.57
>      lat (usec): min=167, max=9889, avg=3437.01, stdev=757.02
>     clat percentiles (usec):
>      |  1.00th=[ 2832],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3120],
>      | 30.00th=[ 3152], 40.00th=[ 3248], 50.00th=[ 3280], 60.00th=[ 3344],
>      | 70.00th=[ 3376], 80.00th=[ 3504], 90.00th=[ 3728], 95.00th=[ 3824],
>      | 99.00th=[ 9152], 99.50th=[ 9408], 99.90th=[ 9664], 99.95th=[ 9664],
>      | 99.99th=[ 9792]
>     bw (MB  /s): min= 1139, max= 1183, per=100.00%, avg=1161.07, stdev=10.58
>     lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=98.31%, 10=1.67%
>   cpu          : usr=18.59%, sys=66.65%, ctx=55655, majf=0, minf=6218
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=2971338/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=11607MB, aggrb=1160.6MB/s, minb=1160.6MB/s, maxb=1160.6MB/s, mint=10001msec, maxt=10001msec
> 
> Disk stats (read/write):
>   rssda: ios=183005/0, merge=2745105/0, ticks=31972/0, in_queue=31948, util=97.63%
> ---
>  block/blk-core.c                    | 29 ++++++++++++++++-------------
>  block/blk-lib.c                     |  2 +-
>  block/blk-throttle.c                |  2 +-
>  drivers/block/xen-blkback/blkback.c |  2 +-
>  drivers/md/dm-bufio.c               |  6 +++---
>  drivers/md/dm-crypt.c               |  2 +-
>  drivers/md/dm-kcopyd.c              |  2 +-
>  drivers/md/dm-thin.c                |  2 +-
>  drivers/md/md.c                     |  2 +-
>  drivers/md/raid1.c                  |  2 +-
>  drivers/md/raid10.c                 |  2 +-
>  drivers/md/raid5.c                  |  4 ++--
>  drivers/target/target_core_iblock.c |  2 +-
>  fs/aio.c                            |  2 +-
>  fs/block_dev.c                      |  2 +-
>  fs/btrfs/scrub.c                    |  2 +-
>  fs/btrfs/transaction.c              |  2 +-
>  fs/btrfs/tree-log.c                 | 12 ++++++------
>  fs/btrfs/volumes.c                  |  6 +++---
>  fs/buffer.c                         |  2 +-
>  fs/direct-io.c                      |  2 +-
>  fs/ext4/file.c                      |  2 +-
>  fs/ext4/inode.c                     |  4 ++--
>  fs/f2fs/checkpoint.c                |  2 +-
>  fs/f2fs/gc.c                        |  2 +-
>  fs/f2fs/node.c                      |  2 +-
>  fs/gfs2/log.c                       |  2 +-
>  fs/hpfs/buffer.c                    |  2 +-
>  fs/jbd/checkpoint.c                 |  2 +-
>  fs/jbd/commit.c                     |  4 ++--
>  fs/jbd2/checkpoint.c                |  2 +-
>  fs/jbd2/commit.c                    |  2 +-
>  fs/mpage.c                          |  2 +-
>  fs/nfs/blocklayout/blocklayout.c    |  4 ++--
>  fs/xfs/xfs_buf.c                    |  4 ++--
>  fs/xfs/xfs_dir2_readdir.c           |  2 +-
>  fs/xfs/xfs_itable.c                 |  2 +-
>  include/linux/blkdev.h              |  5 +++--
>  mm/madvise.c                        |  2 +-
>  mm/page-writeback.c                 |  2 +-
>  mm/readahead.c                      |  2 +-
>  mm/swap_state.c                     |  2 +-
>  mm/vmscan.c                         |  2 +-
>  43 files changed, 74 insertions(+), 70 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 794c3e7..fcd9c2f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -3018,21 +3018,21 @@ void blk_start_plug(struct blk_plug *plug)
>  {
>  	struct task_struct *tsk = current;
>  
> +	if (tsk->plug) {
> +		tsk->plug->depth++;
> +		return;
> +	}
> +
> +	plug->depth = 1;
>  	INIT_LIST_HEAD(&plug->list);
>  	INIT_LIST_HEAD(&plug->mq_list);
>  	INIT_LIST_HEAD(&plug->cb_list);
>  
>  	/*
> -	 * If this is a nested plug, don't actually assign it. It will be
> -	 * flushed on its own.
> +	 * Store ordering should not be needed here, since a potential
> +	 * preempt will imply a full memory barrier
>  	 */
> -	if (!tsk->plug) {
> -		/*
> -		 * Store ordering should not be needed here, since a potential
> -		 * preempt will imply a full memory barrier
> -		 */
> -		tsk->plug = plug;
> -	}
> +	tsk->plug = plug;
>  }
>  EXPORT_SYMBOL(blk_start_plug);
>  
> @@ -3177,12 +3177,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  	local_irq_restore(flags);
>  }
>  
> -void blk_finish_plug(struct blk_plug *plug)
> +void blk_finish_plug(void)
>  {
> -	blk_flush_plug_list(plug, false);
> +	struct blk_plug *plug = current->plug;
>  
> -	if (plug == current->plug)
> -		current->plug = NULL;
> +	if (--plug->depth > 0)
> +		return;
> +
> +	blk_flush_plug_list(plug, false);
> +	current->plug = NULL;
>  }
>  EXPORT_SYMBOL(blk_finish_plug);
>  
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 7688ee3..ac347d3 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		 */
>  		cond_resched();
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Wait for bios in-flight */
>  	if (!atomic_dec_and_test(&bb.done))
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 5b9c6d5..222a77a 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1281,7 +1281,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
>  		blk_start_plug(&plug);
>  		while((bio = bio_list_pop(&bio_list_on_stack)))
>  			generic_make_request(bio);
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  	}
>  }
>  
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 2a04d34..74bea21 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1374,7 +1374,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  		submit_bio(operation, biolist[i]);
>  
>  	/* Let the I/Os go.. */
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	if (operation == READ)
>  		blkif->st_rd_sect += preq.nr_sects;
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 86dbbc7..502c63b 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
>  		submit_io(b, WRITE, b->block, write_endio);
>  		dm_bufio_cond_resched();
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> @@ -1126,7 +1126,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>  				&write_list);
>  		if (unlikely(!list_empty(&write_list))) {
>  			dm_bufio_unlock(c);
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			__flush_write_list(&write_list);
>  			blk_start_plug(&plug);
>  			dm_bufio_lock(c);
> @@ -1149,7 +1149,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>  	dm_bufio_unlock(c);
>  
>  flush_plug:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
>  
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 713a962..65d7b72 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -1224,7 +1224,7 @@ pop_from_list:
>  			rb_erase(&io->rb_node, &write_tree);
>  			kcryptd_io_write(io);
>  		} while (!RB_EMPTY_ROOT(&write_tree));
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  	}
>  	return 0;
>  }
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 3a7cade..4a76e42 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -593,7 +593,7 @@ static void do_work(struct work_struct *work)
>  	process_jobs(&kc->complete_jobs, kc, run_complete_job);
>  	process_jobs(&kc->pages_jobs, kc, run_pages_job);
>  	process_jobs(&kc->io_jobs, kc, run_io_job);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 921aafd..be42bf5 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>  			dm_pool_issue_prefetches(pool->pmd);
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int cmp_cells(const void *lhs, const void *rhs)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 717daad..c4ec179 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
>  	/*
>  	 * this also signals 'finished resyncing' to md_stop
>  	 */
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
>  
>  	/* tell personality that we are finished */
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d34e238..4f8fad4 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>  			md_check_recovery(mddev);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int init_resync(struct r1conf *conf)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a7196c4..92bb5dd 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>  			md_check_recovery(mddev);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int init_resync(struct r10conf *conf)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cd2f96b..695bf0f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
>  	pr_debug("%d stripes handled\n", handled);
>  
>  	spin_unlock_irq(&conf->device_lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	pr_debug("--- raid5worker inactive\n");
>  }
> @@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
>  	spin_unlock_irq(&conf->device_lock);
>  
>  	async_tx_issue_pending_all();
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	pr_debug("--- raid5d inactive\n");
>  }
> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
> index d4a4b0f..17d8730 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -367,7 +367,7 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
>  	blk_start_plug(&plug);
>  	while ((bio = bio_list_pop(list)))
>  		submit_bio(rw, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static void iblock_end_io_flush(struct bio *bio, int err)
> diff --git a/fs/aio.c b/fs/aio.c
> index f8e52a1..b873698 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>  		if (ret)
>  			break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	percpu_ref_put(&ctx->users);
>  	return i ? i : ret;
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 975266b..f5848de 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		if (err < 0)
>  			ret = err;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(blkdev_write_iter);
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ec57687..f314cfb8 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3316,7 +3316,7 @@ out:
>  	scrub_wr_submit(sctx);
>  	mutex_unlock(&sctx->wr_ctx.wr_lock);
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	btrfs_free_path(path);
>  	btrfs_free_path(ppath);
>  	return ret < 0 ? ret : 0;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 8be4278..fee10af 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -983,7 +983,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>  
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
>  
>  	if (ret)
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index c5b8ba3..879c7fd 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2574,7 +2574,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
>  	if (ret) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_abort_transaction(trans, root, ret);
>  		btrfs_free_logged_extents(log, log_transid);
>  		btrfs_set_log_full_commit(root->fs_info, trans);
> @@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		if (!list_empty(&root_log_ctx.list))
>  			list_del_init(&root_log_ctx.list);
>  
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_set_log_full_commit(root->fs_info, trans);
>  
>  		if (ret != -ENOSPC) {
> @@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	}
>  
>  	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		ret = root_log_ctx.log_ret;
>  		goto out;
> @@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  
>  	index2 = root_log_ctx.log_transid % 2;
>  	if (atomic_read(&log_root_tree->log_commit[index2])) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
>  						mark);
>  		btrfs_wait_logged_extents(trans, log, log_transid);
> @@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	 * check the full commit flag again
>  	 */
>  	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>  		btrfs_free_logged_extents(log, log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
> @@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	ret = btrfs_write_marked_extents(log_root_tree,
>  					 &log_root_tree->dirty_log_pages,
>  					 EXTENT_DIRTY | EXTENT_NEW);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (ret) {
>  		btrfs_set_log_full_commit(root->fs_info, trans);
>  		btrfs_abort_transaction(trans, root, ret);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8222f6f..16db068 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -358,7 +358,7 @@ loop_lock:
>  		if (pending_bios == &device->pending_sync_bios) {
>  			sync_pending = 1;
>  		} else if (sync_pending) {
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			blk_start_plug(&plug);
>  			sync_pending = 0;
>  		}
> @@ -415,7 +415,7 @@ loop_lock:
>  		}
>  		/* unplug every 64 requests just for good measure */
>  		if (batch_run % 64 == 0) {
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			blk_start_plug(&plug);
>  			sync_pending = 0;
>  		}
> @@ -431,7 +431,7 @@ loop_lock:
>  	spin_unlock(&device->io_lock);
>  
>  done:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static void pending_bios_fn(struct btrfs_work *work)
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 20805db..8181c44 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
>  	}
>  
>  	spin_unlock(lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	spin_lock(lock);
>  
>  	while (!list_empty(&tmp)) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..16f16ed 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1262,7 +1262,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	if (sdio.bio)
>  		dio_bio_submit(dio, &sdio);
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/*
>  	 * It is possible that, we return short IO due to end of file.
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..3a293eb 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			ret = err;
>  	}
>  	if (o_direct)
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  
>  errout:
>  	if (aio_mutex)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..90ce0cb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2302,7 +2302,7 @@ static int ext4_writepages(struct address_space *mapping,
>  
>  		blk_start_plug(&plug);
>  		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		goto out_writepages;
>  	}
>  
> @@ -2438,7 +2438,7 @@ retry:
>  		if (ret)
>  			break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (!ret && !cycled && wbc->nr_to_write > 0) {
>  		cycled = 1;
>  		mpd.last_page = writeback_index - 1;
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 7f794b7..86ba453 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -846,7 +846,7 @@ retry_flush_nodes:
>  		goto retry_flush_nodes;
>  	}
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return err;
>  }
>  
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 76adbc3..abeef77 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
>  		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
>  		break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
>  	stat_inc_call_count(sbi->stat_info);
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 97bd9d3..c4aa9e2 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -1098,7 +1098,7 @@ repeat:
>  		ra_node_page(sbi, nid);
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	lock_page(page);
>  	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
> diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> index 536e7a6..06f25d17 100644
> --- a/fs/gfs2/log.c
> +++ b/fs/gfs2/log.c
> @@ -159,7 +159,7 @@ restart:
>  			goto restart;
>  	}
>  	spin_unlock(&sdp->sd_ail_lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	trace_gfs2_ail_flush(sdp, wbc, 0);
>  }
>  
> diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
> index 8057fe4..138462d 100644
> --- a/fs/hpfs/buffer.c
> +++ b/fs/hpfs/buffer.c
> @@ -35,7 +35,7 @@ void hpfs_prefetch_sectors(struct super_block *s, unsigned secno, int n)
>  		secno++;
>  		n--;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /* Map a sector into a buffer and return pointers to it and to the buffer. */
> diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
> index 08c0304..cd6b09f 100644
> --- a/fs/jbd/checkpoint.c
> +++ b/fs/jbd/checkpoint.c
> @@ -263,7 +263,7 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
>  	blk_start_plug(&plug);
>  	for (i = 0; i < *batch_count; i++)
>  		write_dirty_buffer(bhs[i], WRITE_SYNC);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	for (i = 0; i < *batch_count; i++) {
>  		struct buffer_head *bh = bhs[i];
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index bb217dc..e1046c3 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -447,7 +447,7 @@ void journal_commit_transaction(journal_t *journal)
>  	blk_start_plug(&plug);
>  	err = journal_submit_data_buffers(journal, commit_transaction,
>  					  write_op);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/*
>  	 * Wait for all previously submitted IO to complete.
> @@ -697,7 +697,7 @@ start_journal_io:
>  		}
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 988b32e..6aa0039 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -187,7 +187,7 @@ __flush_batch(journal_t *journal, int *batch_count)
>  	blk_start_plug(&plug);
>  	for (i = 0; i < *batch_count; i++)
>  		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	for (i = 0; i < *batch_count; i++) {
>  		struct buffer_head *bh = journal->j_chkpt_bhs[i];
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b73e021..8f532c8 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -805,7 +805,7 @@ start_journal_io:
>  			__jbd2_journal_abort_hard(journal);
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 3e79220..bf7d6c3 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
>  		if (mpd.bio)
>  			mpage_bio_submit(WRITE, mpd.bio);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  EXPORT_SYMBOL(mpage_writepages);
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 1cac3c1..e93b6a8 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -311,7 +311,7 @@ bl_read_pagelist(struct nfs_pgio_header *header)
>  	}
>  out:
>  	bl_submit_bio(READ, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	put_parallel(par);
>  	return PNFS_ATTEMPTED;
>  }
> @@ -433,7 +433,7 @@ bl_write_pagelist(struct nfs_pgio_header *header, int sync)
>  	header->res.count = header->args.count;
>  out:
>  	bl_submit_bio(WRITE, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	put_parallel(par);
>  	return PNFS_ATTEMPTED;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1790b00..2f89ca2 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
>  		if (size <= 0)
>  			break;	/* all done */
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> @@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
>  
>  		xfs_buf_submit(bp);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	return pinned;
>  }
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 098cd78..7e8fa3f 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
>  			}
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  out:
>  	*bpp = bp;
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 82e3142..c3ac5ec 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
>  					     &xfs_inode_buf_ops);
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7f9a516..188133f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
>   * schedule() where blk_schedule_flush_plug() is called.
>   */
>  struct blk_plug {
> +	int depth; /* number of nested plugs */
>  	struct list_head list; /* requests */
>  	struct list_head mq_list; /* blk-mq requests */
>  	struct list_head cb_list; /* md requires an unplug callback */
> @@ -1107,7 +1108,7 @@ struct blk_plug_cb {
>  extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
>  					     void *data, int size);
>  extern void blk_start_plug(struct blk_plug *);
> -extern void blk_finish_plug(struct blk_plug *);
> +extern void blk_finish_plug(void);
>  extern void blk_flush_plug_list(struct blk_plug *, bool);
>  
>  static inline void blk_flush_plug(struct task_struct *tsk)
> @@ -1646,7 +1647,7 @@ static inline void blk_start_plug(struct blk_plug *plug)
>  {
>  }
>  
> -static inline void blk_finish_plug(struct blk_plug *plug)
> +static inline void blk_finish_plug(void)
>  {
>  }
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d551475..18a34ee 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  			vma = find_vma(current->mm, start);
>  	}
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (write)
>  		up_write(&current->mm->mmap_sem);
>  	else
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 644bcb6..4570f6e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2020,7 +2020,7 @@ int generic_writepages(struct address_space *mapping,
>  
>  	blk_start_plug(&plug);
>  	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9356758..64182a2 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>  	ret = 0;
>  
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	return ret;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 405923f..5721f64 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			SetPageReadahead(page);
>  		page_cache_release(page);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  skip:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e8eadd..56bb274 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>  
>  		scan_adjusted = true;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	sc->nr_reclaimed += nr_reclaimed;
>  
>  	/*
> -- 
> 1.8.3.1
> 
> 

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Cluster-devel] [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-08 23:02       ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-04-08 23:02 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowels of the dirct I/O code.
> The reason is that there are blk_plug-s instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

> NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
> would always flush the plug list.  After this patch, this is only the
> case for the outer-most plug.  If you require the plug list to be
> flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
> maintainers should take a close look at this patch and ensure they get
> the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to maintain the same behaviour as we
currently have?

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs to be fundamentally changed?

Cheers,

Dave.

> 
> ---
> Changelog:
> v1->v2: Keep the blk_start_plug interface the same, suggested by Ming Lei.
> 
> Test results
> ------------
> Virtio-blk:
> 
> unpatched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=8032: Tue Apr  7 13:33:53 2015
>   read : io=2736.1MB, bw=280262KB/s, iops=70065, runt= 10000msec
>     slat (usec): min=40, max=10472, avg=207.82, stdev=364.02
>     clat (usec): min=211, max=35883, avg=14379.83, stdev=2213.95
>      lat (usec): min=862, max=36000, avg=14587.72, stdev=2223.80
>     clat percentiles (usec):
>      |  1.00th=[11328],  5.00th=[12096], 10.00th=[12480], 20.00th=[12992],
>      | 30.00th=[13376], 40.00th=[13760], 50.00th=[14144], 60.00th=[14400],
>      | 70.00th=[14784], 80.00th=[15168], 90.00th=[15936], 95.00th=[16768],
>      | 99.00th=[24448], 99.50th=[25216], 99.90th=[28544], 99.95th=[35072],
>      | 99.99th=[36096]
>     bw (KB  /s): min=265984, max=302720, per=100.00%, avg=280549.84, stdev=10264.36
>     lat (usec) : 250=0.01%, 1000=0.01%
>     lat (msec) : 2=0.02%, 4=0.02%, 10=0.05%, 20=96.57%, 50=3.34%
>   cpu          : usr=7.56%, sys=55.57%, ctx=6174, majf=0, minf=523
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=700656/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=2736.1MB, aggrb=280262KB/s, minb=280262KB/s, maxb=280262KB/s, mint=10000msec, maxt=10000msec
> 
> Disk stats (read/write):
>   vdd: ios=695490/0, merge=0/0, ticks=785741/0, in_queue=785442, util=90.69%
> 
> 
> patched:
> job1: (groupid=0, jobs=1): err= 0: pid=7743: Tue Apr  7 13:19:07 2015
>   read : io=8126.6MB, bw=832158KB/s, iops=208039, runt= 10000msec
>     slat (usec): min=20, max=14351, avg=55.08, stdev=143.47
>     clat (usec): min=283, max=20003, avg=4846.77, stdev=1355.35
>      lat (usec): min=609, max=20074, avg=4901.95, stdev=1362.40
>     clat percentiles (usec):
>      |  1.00th=[ 4016],  5.00th=[ 4048], 10.00th=[ 4080], 20.00th=[ 4128],
>      | 30.00th=[ 4192], 40.00th=[ 4192], 50.00th=[ 4256], 60.00th=[ 4512],
>      | 70.00th=[ 4896], 80.00th=[ 5664], 90.00th=[ 5920], 95.00th=[ 6752],
>      | 99.00th=[11968], 99.50th=[13632], 99.90th=[15552], 99.95th=[17024],
>      | 99.99th=[19840]
>     bw (KB  /s): min=740992, max=896640, per=100.00%, avg=836978.95, stdev=51034.87
>     lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 4=0.50%, 10=97.79%, 20=1.70%, 50=0.01%
>   cpu          : usr=20.28%, sys=69.11%, ctx=879, majf=0, minf=522
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=2080396/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=8126.6MB, aggrb=832158KB/s, minb=832158KB/s, maxb=832158KB/s, mint=10000msec, maxt=10000msec
> 
> Disk stats (read/write):
>   vdd: ios=127877/0, merge=1918166/0, ticks=23118/0, in_queue=23047, util=94.08%
> 
> micron p320h:
> 
> unpatched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=3244: Tue Apr  7 13:29:14 2015
>   read : io=6728.9MB, bw=688968KB/s, iops=172241, runt= 10001msec
>     slat (usec): min=43, max=6273, avg=81.79, stdev=125.96
>     clat (usec): min=78, max=12485, avg=5852.06, stdev=1154.76
>      lat (usec): min=146, max=12572, avg=5933.92, stdev=1163.75
>     clat percentiles (usec):
>      |  1.00th=[ 4192],  5.00th=[ 4384], 10.00th=[ 4576], 20.00th=[ 5600],
>      | 30.00th=[ 5664], 40.00th=[ 5728], 50.00th=[ 5792], 60.00th=[ 5856],
>      | 70.00th=[ 6112], 80.00th=[ 6176], 90.00th=[ 6240], 95.00th=[ 6368],
>      | 99.00th=[11840], 99.50th=[11968], 99.90th=[12096], 99.95th=[12096],
>      | 99.99th=[12224]
>     bw (KB  /s): min=648328, max=859264, per=98.80%, avg=680711.16, stdev=62016.70
>     lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.04%, 10=97.07%, 20=2.87%
>   cpu          : usr=10.28%, sys=73.61%, ctx=104436, majf=0, minf=6217
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=1722592/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=6728.9MB, aggrb=688967KB/s, minb=688967KB/s, maxb=688967KB/s, mint=10001msec, maxt=10001msec
> 
> Disk stats (read/write):
>   rssda: ios=1688772/0, merge=0/0, ticks=188820/0, in_queue=188678, util=96.61%
> 
> patched:
> 
> job1: (groupid=0, jobs=1): err= 0: pid=9531: Tue Apr  7 13:22:28 2015
>   read : io=11607MB, bw=1160.6MB/s, iops=297104, runt= 10001msec
>     slat (usec): min=21, max=6376, avg=43.05, stdev=81.82
>     clat (usec): min=116, max=9844, avg=3393.90, stdev=752.57
>      lat (usec): min=167, max=9889, avg=3437.01, stdev=757.02
>     clat percentiles (usec):
>      |  1.00th=[ 2832],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3120],
>      | 30.00th=[ 3152], 40.00th=[ 3248], 50.00th=[ 3280], 60.00th=[ 3344],
>      | 70.00th=[ 3376], 80.00th=[ 3504], 90.00th=[ 3728], 95.00th=[ 3824],
>      | 99.00th=[ 9152], 99.50th=[ 9408], 99.90th=[ 9664], 99.95th=[ 9664],
>      | 99.99th=[ 9792]
>     bw (MB  /s): min= 1139, max= 1183, per=100.00%, avg=1161.07, stdev=10.58
>     lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.01%, 4=98.31%, 10=1.67%
>   cpu          : usr=18.59%, sys=66.65%, ctx=55655, majf=0, minf=6218
>   IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>      issued    : total=r=2971338/w=0/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=1024
> 
> Run status group 0 (all jobs):
>    READ: io=11607MB, aggrb=1160.6MB/s, minb=1160.6MB/s, maxb=1160.6MB/s, mint=10001msec, maxt=10001msec
> 
> Disk stats (read/write):
>   rssda: ios=183005/0, merge=2745105/0, ticks=31972/0, in_queue=31948, util=97.63%
> ---
>  block/blk-core.c                    | 29 ++++++++++++++++-------------
>  block/blk-lib.c                     |  2 +-
>  block/blk-throttle.c                |  2 +-
>  drivers/block/xen-blkback/blkback.c |  2 +-
>  drivers/md/dm-bufio.c               |  6 +++---
>  drivers/md/dm-crypt.c               |  2 +-
>  drivers/md/dm-kcopyd.c              |  2 +-
>  drivers/md/dm-thin.c                |  2 +-
>  drivers/md/md.c                     |  2 +-
>  drivers/md/raid1.c                  |  2 +-
>  drivers/md/raid10.c                 |  2 +-
>  drivers/md/raid5.c                  |  4 ++--
>  drivers/target/target_core_iblock.c |  2 +-
>  fs/aio.c                            |  2 +-
>  fs/block_dev.c                      |  2 +-
>  fs/btrfs/scrub.c                    |  2 +-
>  fs/btrfs/transaction.c              |  2 +-
>  fs/btrfs/tree-log.c                 | 12 ++++++------
>  fs/btrfs/volumes.c                  |  6 +++---
>  fs/buffer.c                         |  2 +-
>  fs/direct-io.c                      |  2 +-
>  fs/ext4/file.c                      |  2 +-
>  fs/ext4/inode.c                     |  4 ++--
>  fs/f2fs/checkpoint.c                |  2 +-
>  fs/f2fs/gc.c                        |  2 +-
>  fs/f2fs/node.c                      |  2 +-
>  fs/gfs2/log.c                       |  2 +-
>  fs/hpfs/buffer.c                    |  2 +-
>  fs/jbd/checkpoint.c                 |  2 +-
>  fs/jbd/commit.c                     |  4 ++--
>  fs/jbd2/checkpoint.c                |  2 +-
>  fs/jbd2/commit.c                    |  2 +-
>  fs/mpage.c                          |  2 +-
>  fs/nfs/blocklayout/blocklayout.c    |  4 ++--
>  fs/xfs/xfs_buf.c                    |  4 ++--
>  fs/xfs/xfs_dir2_readdir.c           |  2 +-
>  fs/xfs/xfs_itable.c                 |  2 +-
>  include/linux/blkdev.h              |  5 +++--
>  mm/madvise.c                        |  2 +-
>  mm/page-writeback.c                 |  2 +-
>  mm/readahead.c                      |  2 +-
>  mm/swap_state.c                     |  2 +-
>  mm/vmscan.c                         |  2 +-
>  43 files changed, 74 insertions(+), 70 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 794c3e7..fcd9c2f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -3018,21 +3018,21 @@ void blk_start_plug(struct blk_plug *plug)
>  {
>  	struct task_struct *tsk = current;
>  
> +	if (tsk->plug) {
> +		tsk->plug->depth++;
> +		return;
> +	}
> +
> +	plug->depth = 1;
>  	INIT_LIST_HEAD(&plug->list);
>  	INIT_LIST_HEAD(&plug->mq_list);
>  	INIT_LIST_HEAD(&plug->cb_list);
>  
>  	/*
> -	 * If this is a nested plug, don't actually assign it. It will be
> -	 * flushed on its own.
> +	 * Store ordering should not be needed here, since a potential
> +	 * preempt will imply a full memory barrier
>  	 */
> -	if (!tsk->plug) {
> -		/*
> -		 * Store ordering should not be needed here, since a potential
> -		 * preempt will imply a full memory barrier
> -		 */
> -		tsk->plug = plug;
> -	}
> +	tsk->plug = plug;
>  }
>  EXPORT_SYMBOL(blk_start_plug);
>  
> @@ -3177,12 +3177,15 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  	local_irq_restore(flags);
>  }
>  
> -void blk_finish_plug(struct blk_plug *plug)
> +void blk_finish_plug(void)
>  {
> -	blk_flush_plug_list(plug, false);
> +	struct blk_plug *plug = current->plug;
>  
> -	if (plug == current->plug)
> -		current->plug = NULL;
> +	if (--plug->depth > 0)
> +		return;
> +
> +	blk_flush_plug_list(plug, false);
> +	current->plug = NULL;
>  }
>  EXPORT_SYMBOL(blk_finish_plug);
>  
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 7688ee3..ac347d3 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -128,7 +128,7 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		 */
>  		cond_resched();
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Wait for bios in-flight */
>  	if (!atomic_dec_and_test(&bb.done))
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 5b9c6d5..222a77a 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1281,7 +1281,7 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
>  		blk_start_plug(&plug);
>  		while((bio = bio_list_pop(&bio_list_on_stack)))
>  			generic_make_request(bio);
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  	}
>  }
>  
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index 2a04d34..74bea21 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -1374,7 +1374,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>  		submit_bio(operation, biolist[i]);
>  
>  	/* Let the I/Os go.. */
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	if (operation == READ)
>  		blkif->st_rd_sect += preq.nr_sects;
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 86dbbc7..502c63b 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -715,7 +715,7 @@ static void __flush_write_list(struct list_head *write_list)
>  		submit_io(b, WRITE, b->block, write_endio);
>  		dm_bufio_cond_resched();
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> @@ -1126,7 +1126,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>  				&write_list);
>  		if (unlikely(!list_empty(&write_list))) {
>  			dm_bufio_unlock(c);
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			__flush_write_list(&write_list);
>  			blk_start_plug(&plug);
>  			dm_bufio_lock(c);
> @@ -1149,7 +1149,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
>  	dm_bufio_unlock(c);
>  
>  flush_plug:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  EXPORT_SYMBOL_GPL(dm_bufio_prefetch);
>  
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 713a962..65d7b72 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -1224,7 +1224,7 @@ pop_from_list:
>  			rb_erase(&io->rb_node, &write_tree);
>  			kcryptd_io_write(io);
>  		} while (!RB_EMPTY_ROOT(&write_tree));
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  	}
>  	return 0;
>  }
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 3a7cade..4a76e42 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -593,7 +593,7 @@ static void do_work(struct work_struct *work)
>  	process_jobs(&kc->complete_jobs, kc, run_complete_job);
>  	process_jobs(&kc->pages_jobs, kc, run_pages_job);
>  	process_jobs(&kc->io_jobs, kc, run_io_job);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 921aafd..be42bf5 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -1824,7 +1824,7 @@ static void process_thin_deferred_bios(struct thin_c *tc)
>  			dm_pool_issue_prefetches(pool->pmd);
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int cmp_cells(const void *lhs, const void *rhs)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 717daad..c4ec179 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7686,7 +7686,7 @@ void md_do_sync(struct md_thread *thread)
>  	/*
>  	 * this also signals 'finished resyncing' to md_stop
>  	 */
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active));
>  
>  	/* tell personality that we are finished */
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index d34e238..4f8fad4 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -2441,7 +2441,7 @@ static void raid1d(struct md_thread *thread)
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>  			md_check_recovery(mddev);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int init_resync(struct r1conf *conf)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a7196c4..92bb5dd 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -2835,7 +2835,7 @@ static void raid10d(struct md_thread *thread)
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
>  			md_check_recovery(mddev);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static int init_resync(struct r10conf *conf)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cd2f96b..695bf0f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5281,7 +5281,7 @@ static void raid5_do_work(struct work_struct *work)
>  	pr_debug("%d stripes handled\n", handled);
>  
>  	spin_unlock_irq(&conf->device_lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	pr_debug("--- raid5worker inactive\n");
>  }
> @@ -5352,7 +5352,7 @@ static void raid5d(struct md_thread *thread)
>  	spin_unlock_irq(&conf->device_lock);
>  
>  	async_tx_issue_pending_all();
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	pr_debug("--- raid5d inactive\n");
>  }
> diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
> index d4a4b0f..17d8730 100644
> --- a/drivers/target/target_core_iblock.c
> +++ b/drivers/target/target_core_iblock.c
> @@ -367,7 +367,7 @@ static void iblock_submit_bios(struct bio_list *list, int rw)
>  	blk_start_plug(&plug);
>  	while ((bio = bio_list_pop(list)))
>  		submit_bio(rw, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static void iblock_end_io_flush(struct bio *bio, int err)
> diff --git a/fs/aio.c b/fs/aio.c
> index f8e52a1..b873698 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1616,7 +1616,7 @@ long do_io_submit(aio_context_t ctx_id, long nr,
>  		if (ret)
>  			break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	percpu_ref_put(&ctx->users);
>  	return i ? i : ret;
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 975266b..f5848de 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1609,7 +1609,7 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		if (err < 0)
>  			ret = err;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(blkdev_write_iter);
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ec57687..f314cfb8 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3316,7 +3316,7 @@ out:
>  	scrub_wr_submit(sctx);
>  	mutex_unlock(&sctx->wr_ctx.wr_lock);
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	btrfs_free_path(path);
>  	btrfs_free_path(ppath);
>  	return ret < 0 ? ret : 0;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 8be4278..fee10af 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -983,7 +983,7 @@ static int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
>  
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(root, dirty_pages, mark);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	ret2 = btrfs_wait_marked_extents(root, dirty_pages, mark);
>  
>  	if (ret)
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index c5b8ba3..879c7fd 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2574,7 +2574,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	blk_start_plug(&plug);
>  	ret = btrfs_write_marked_extents(log, &log->dirty_log_pages, mark);
>  	if (ret) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_abort_transaction(trans, root, ret);
>  		btrfs_free_logged_extents(log, log_transid);
>  		btrfs_set_log_full_commit(root->fs_info, trans);
> @@ -2619,7 +2619,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		if (!list_empty(&root_log_ctx.list))
>  			list_del_init(&root_log_ctx.list);
>  
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_set_log_full_commit(root->fs_info, trans);
>  
>  		if (ret != -ENOSPC) {
> @@ -2635,7 +2635,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	}
>  
>  	if (log_root_tree->log_transid_committed >= root_log_ctx.log_transid) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		mutex_unlock(&log_root_tree->log_mutex);
>  		ret = root_log_ctx.log_ret;
>  		goto out;
> @@ -2643,7 +2643,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  
>  	index2 = root_log_ctx.log_transid % 2;
>  	if (atomic_read(&log_root_tree->log_commit[index2])) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		ret = btrfs_wait_marked_extents(log, &log->dirty_log_pages,
>  						mark);
>  		btrfs_wait_logged_extents(trans, log, log_transid);
> @@ -2669,7 +2669,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	 * check the full commit flag again
>  	 */
>  	if (btrfs_need_log_full_commit(root->fs_info, trans)) {
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		btrfs_wait_marked_extents(log, &log->dirty_log_pages, mark);
>  		btrfs_free_logged_extents(log, log_transid);
>  		mutex_unlock(&log_root_tree->log_mutex);
> @@ -2680,7 +2680,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	ret = btrfs_write_marked_extents(log_root_tree,
>  					 &log_root_tree->dirty_log_pages,
>  					 EXTENT_DIRTY | EXTENT_NEW);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (ret) {
>  		btrfs_set_log_full_commit(root->fs_info, trans);
>  		btrfs_abort_transaction(trans, root, ret);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 8222f6f..16db068 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -358,7 +358,7 @@ loop_lock:
>  		if (pending_bios == &device->pending_sync_bios) {
>  			sync_pending = 1;
>  		} else if (sync_pending) {
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			blk_start_plug(&plug);
>  			sync_pending = 0;
>  		}
> @@ -415,7 +415,7 @@ loop_lock:
>  		}
>  		/* unplug every 64 requests just for good measure */
>  		if (batch_run % 64 == 0) {
> -			blk_finish_plug(&plug);
> +			blk_finish_plug();
>  			blk_start_plug(&plug);
>  			sync_pending = 0;
>  		}
> @@ -431,7 +431,7 @@ loop_lock:
>  	spin_unlock(&device->io_lock);
>  
>  done:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  static void pending_bios_fn(struct btrfs_work *work)
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 20805db..8181c44 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -758,7 +758,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list)
>  	}
>  
>  	spin_unlock(lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	spin_lock(lock);
>  
>  	while (!list_empty(&tmp)) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..16f16ed 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1262,7 +1262,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	if (sdio.bio)
>  		dio_bio_submit(dio, &sdio);
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/*
>  	 * It is possible that, we return short IO due to end of file.
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 33a09da..3a293eb 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -183,7 +183,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			ret = err;
>  	}
>  	if (o_direct)
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  
>  errout:
>  	if (aio_mutex)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5cb9a21..90ce0cb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2302,7 +2302,7 @@ static int ext4_writepages(struct address_space *mapping,
>  
>  		blk_start_plug(&plug);
>  		ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -		blk_finish_plug(&plug);
> +		blk_finish_plug();
>  		goto out_writepages;
>  	}
>  
> @@ -2438,7 +2438,7 @@ retry:
>  		if (ret)
>  			break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (!ret && !cycled && wbc->nr_to_write > 0) {
>  		cycled = 1;
>  		mpd.last_page = writeback_index - 1;
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index 7f794b7..86ba453 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -846,7 +846,7 @@ retry_flush_nodes:
>  		goto retry_flush_nodes;
>  	}
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return err;
>  }
>  
> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> index 76adbc3..abeef77 100644
> --- a/fs/f2fs/gc.c
> +++ b/fs/f2fs/gc.c
> @@ -678,7 +678,7 @@ static void do_garbage_collect(struct f2fs_sb_info *sbi, unsigned int segno,
>  		gc_data_segment(sbi, sum->entries, gc_list, segno, gc_type);
>  		break;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	stat_inc_seg_count(sbi, GET_SUM_TYPE((&sum->footer)));
>  	stat_inc_call_count(sbi->stat_info);
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 97bd9d3..c4aa9e2 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -1098,7 +1098,7 @@ repeat:
>  		ra_node_page(sbi, nid);
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	lock_page(page);
>  	if (unlikely(page->mapping != NODE_MAPPING(sbi))) {
> diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> index 536e7a6..06f25d17 100644
> --- a/fs/gfs2/log.c
> +++ b/fs/gfs2/log.c
> @@ -159,7 +159,7 @@ restart:
>  			goto restart;
>  	}
>  	spin_unlock(&sdp->sd_ail_lock);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	trace_gfs2_ail_flush(sdp, wbc, 0);
>  }
>  
> diff --git a/fs/hpfs/buffer.c b/fs/hpfs/buffer.c
> index 8057fe4..138462d 100644
> --- a/fs/hpfs/buffer.c
> +++ b/fs/hpfs/buffer.c
> @@ -35,7 +35,7 @@ void hpfs_prefetch_sectors(struct super_block *s, unsigned secno, int n)
>  		secno++;
>  		n--;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /* Map a sector into a buffer and return pointers to it and to the buffer. */
> diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
> index 08c0304..cd6b09f 100644
> --- a/fs/jbd/checkpoint.c
> +++ b/fs/jbd/checkpoint.c
> @@ -263,7 +263,7 @@ __flush_batch(journal_t *journal, struct buffer_head **bhs, int *batch_count)
>  	blk_start_plug(&plug);
>  	for (i = 0; i < *batch_count; i++)
>  		write_dirty_buffer(bhs[i], WRITE_SYNC);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	for (i = 0; i < *batch_count; i++) {
>  		struct buffer_head *bh = bhs[i];
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index bb217dc..e1046c3 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -447,7 +447,7 @@ void journal_commit_transaction(journal_t *journal)
>  	blk_start_plug(&plug);
>  	err = journal_submit_data_buffers(journal, commit_transaction,
>  					  write_op);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/*
>  	 * Wait for all previously submitted IO to complete.
> @@ -697,7 +697,7 @@ start_journal_io:
>  		}
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 988b32e..6aa0039 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -187,7 +187,7 @@ __flush_batch(journal_t *journal, int *batch_count)
>  	blk_start_plug(&plug);
>  	for (i = 0; i < *batch_count; i++)
>  		write_dirty_buffer(journal->j_chkpt_bhs[i], WRITE_SYNC);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	for (i = 0; i < *batch_count; i++) {
>  		struct buffer_head *bh = journal->j_chkpt_bhs[i];
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b73e021..8f532c8 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -805,7 +805,7 @@ start_journal_io:
>  			__jbd2_journal_abort_hard(journal);
>  	}
>  
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	/* Lo and behold: we have just managed to send a transaction to
>             the log.  Before we can commit it, wait for the IO so far to
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 3e79220..bf7d6c3 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -695,7 +695,7 @@ mpage_writepages(struct address_space *mapping,
>  		if (mpd.bio)
>  			mpage_bio_submit(WRITE, mpd.bio);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  EXPORT_SYMBOL(mpage_writepages);
> diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
> index 1cac3c1..e93b6a8 100644
> --- a/fs/nfs/blocklayout/blocklayout.c
> +++ b/fs/nfs/blocklayout/blocklayout.c
> @@ -311,7 +311,7 @@ bl_read_pagelist(struct nfs_pgio_header *header)
>  	}
>  out:
>  	bl_submit_bio(READ, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	put_parallel(par);
>  	return PNFS_ATTEMPTED;
>  }
> @@ -433,7 +433,7 @@ bl_write_pagelist(struct nfs_pgio_header *header, int sync)
>  	header->res.count = header->args.count;
>  out:
>  	bl_submit_bio(WRITE, bio);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	put_parallel(par);
>  	return PNFS_ATTEMPTED;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1790b00..2f89ca2 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1289,7 +1289,7 @@ _xfs_buf_ioapply(
>  		if (size <= 0)
>  			break;	/* all done */
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> @@ -1823,7 +1823,7 @@ __xfs_buf_delwri_submit(
>  
>  		xfs_buf_submit(bp);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	return pinned;
>  }
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 098cd78..7e8fa3f 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -455,7 +455,7 @@ xfs_dir2_leaf_readbuf(
>  			}
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  out:
>  	*bpp = bp;
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 82e3142..c3ac5ec 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -196,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
>  					     &xfs_inode_buf_ops);
>  		}
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  }
>  
>  /*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 7f9a516..188133f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1091,6 +1091,7 @@ static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
>   * schedule() where blk_schedule_flush_plug() is called.
>   */
>  struct blk_plug {
> +	int depth; /* number of nested plugs */
>  	struct list_head list; /* requests */
>  	struct list_head mq_list; /* blk-mq requests */
>  	struct list_head cb_list; /* md requires an unplug callback */
> @@ -1107,7 +1108,7 @@ struct blk_plug_cb {
>  extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug,
>  					     void *data, int size);
>  extern void blk_start_plug(struct blk_plug *);
> -extern void blk_finish_plug(struct blk_plug *);
> +extern void blk_finish_plug(void);
>  extern void blk_flush_plug_list(struct blk_plug *, bool);
>  
>  static inline void blk_flush_plug(struct task_struct *tsk)
> @@ -1646,7 +1647,7 @@ static inline void blk_start_plug(struct blk_plug *plug)
>  {
>  }
>  
> -static inline void blk_finish_plug(struct blk_plug *plug)
> +static inline void blk_finish_plug(void)
>  {
>  }
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index d551475..18a34ee 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -539,7 +539,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  			vma = find_vma(current->mm, start);
>  	}
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	if (write)
>  		up_write(&current->mm->mmap_sem);
>  	else
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 644bcb6..4570f6e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2020,7 +2020,7 @@ int generic_writepages(struct address_space *mapping,
>  
>  	blk_start_plug(&plug);
>  	ret = write_cache_pages(mapping, wbc, __writepage, mapping);
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	return ret;
>  }
>  
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9356758..64182a2 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -136,7 +136,7 @@ static int read_pages(struct address_space *mapping, struct file *filp,
>  	ret = 0;
>  
>  out:
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	return ret;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 405923f..5721f64 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -478,7 +478,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			SetPageReadahead(page);
>  		page_cache_release(page);
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  skip:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5e8eadd..56bb274 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2222,7 +2222,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
>  
>  		scan_adjusted = true;
>  	}
> -	blk_finish_plug(&plug);
> +	blk_finish_plug();
>  	sc->nr_reclaimed += nr_reclaimed;
>  
>  	/*
> -- 
> 1.8.3.1
> 
> 

-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
  2015-04-08 23:02       ` Dave Chinner
  (?)
@ 2015-04-09  0:54         ` Dave Chinner
  -1 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-04-09  0:54 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, dm-devel, xen-devel, linux-raid, linux-scsi,
	target-devel, linux-fsdevel, linux-aio, linux-btrfs, linux-ext4,
	linux-f2fs-devel, cluster-devel, linux-nfs, linux-mm

[ Sending again with a trimmed CC list to just the lists. Jeff - cc
lists that large get blocked by mailing lists... ]

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowels of the dirct I/O code.
> The reason is that there are blk_plug-s instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

> NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
> would always flush the plug list.  After this patch, this is only the
> case for the outer-most plug.  If you require the plug list to be
> flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
> maintainers should take a close look at this patch and ensure they get
> the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to *try* to maintain the same behaviour as
we currently have? I say *try*, because no instead of just flushing
the readahead IO on the plug, we'll also flush all the queued data
writeback IO onthe high level plug. We don't actually want to do
that; we only want to submit the readahead and not the bulk IO that
will delay the latency sensitive dependent IOs....

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs fundamental change?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-09  0:54         ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-04-09  0:54 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, dm-devel, xen-devel, linux-raid, linux-scsi,
	target-devel, linux-fsdevel, linux-aio, linux-btrfs, linux-ext4,
	linux-f2fs-devel, cluster-devel, linux-nfs, linux-mm

[ Sending again with a trimmed CC list to just the lists. Jeff - cc
lists that large get blocked by mailing lists... ]

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowels of the dirct I/O code.
> The reason is that there are blk_plug-s instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

> NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
> would always flush the plug list.  After this patch, this is only the
> case for the outer-most plug.  If you require the plug list to be
> flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
> maintainers should take a close look at this patch and ensure they get
> the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to *try* to maintain the same behaviour as
we currently have? I say *try*, because no instead of just flushing
the readahead IO on the plug, we'll also flush all the queued data
writeback IO onthe high level plug. We don't actually want to do
that; we only want to submit the readahead and not the bulk IO that
will delay the latency sensitive dependent IOs....

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs fundamental change?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
  2015-04-08 23:02       ` Dave Chinner
                         ` (3 preceding siblings ...)
  (?)
@ 2015-04-09  0:54       ` Dave Chinner
  -1 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-04-09  0:54 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-aio, linux-nfs, cluster-devel, linux-scsi, linux-kernel,
	linux-f2fs-devel, linux-raid, linux-mm, dm-devel, target-devel,
	linux-fsdevel, xen-devel, linux-ext4, linux-btrfs

[ Sending again with a trimmed CC list to just the lists. Jeff - cc
lists that large get blocked by mailing lists... ]

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowels of the dirct I/O code.
> The reason is that there are blk_plug-s instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

> NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
> would always flush the plug list.  After this patch, this is only the
> case for the outer-most plug.  If you require the plug list to be
> flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
> maintainers should take a close look at this patch and ensure they get
> the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to *try* to maintain the same behaviour as
we currently have? I say *try*, because no instead of just flushing
the readahead IO on the plug, we'll also flush all the queued data
writeback IO onthe high level plug. We don't actually want to do
that; we only want to submit the readahead and not the bulk IO that
will delay the latency sensitive dependent IOs....

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs fundamental change?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Cluster-devel] [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-09  0:54         ` Dave Chinner
  0 siblings, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2015-04-09  0:54 UTC (permalink / raw)
  To: cluster-devel.redhat.com

[ Sending again with a trimmed CC list to just the lists. Jeff - cc
lists that large get blocked by mailing lists... ]

On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
> The way the on-stack plugging currently works, each nesting level
> flushes its own list of I/Os.  This can be less than optimal (read
> awful) for certain workloads.  For example, consider an application
> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
> I/Os together in a single io_submit call, only to have each of them
> dispatched individually down in the bowels of the dirct I/O code.
> The reason is that there are blk_plug-s instantiated both at the upper
> call site in do_io_submit and down in do_direct_IO.  The latter will
> submit as little as 1 I/O at a time (if you have a small enough I/O
> size) instead of performing the batching that the plugging
> infrastructure is supposed to provide.

I'm wondering what impact this will have on filesystem metadata IO
that needs to be issued immediately. e.g. we are doing writeback, so
there is a high level plug in place and we need to page in btree
blocks to do extent allocation. We do readahead at this point,
but it looks like this change will prevent the readahead from being
issued by the unplug in xfs_buf_iosubmit().

So while I can see how this can make your single microbenchmark
better (because it's only doing concurrent direct IO to the block
device and hence there are no dependencies between individual IOs),
I have significant reservations that it's actually a win for
filesystem-based workloads where we need direct control of flushing
to minimise IO latency due to IO dependencies...

Patches like this one:

https://lkml.org/lkml/2015/3/20/442

show similar real-world workload improvements to your patchset by
being smarter about using high level plugging to enable cross-file
merging of IO, but it still relies on the lower layers of plugging
to resolve latency bubbles caused by IO dependencies in the
filesystems.

> NOTE TO SUBSYSTEM MAINTAINERS: Before this patch, blk_finish_plug
> would always flush the plug list.  After this patch, this is only the
> case for the outer-most plug.  If you require the plug list to be
> flushed, you should be calling blk_flush_plug(current).  Btrfs and dm
> maintainers should take a close look at this patch and ensure they get
> the right behavior in the end.

IOWs, you are saying we need to change all our current unplugs to
blk_flush_plug(current) to *try* to maintain the same behaviour as
we currently have? I say *try*, because no instead of just flushing
the readahead IO on the plug, we'll also flush all the queued data
writeback IO onthe high level plug. We don't actually want to do
that; we only want to submit the readahead and not the bulk IO that
will delay the latency sensitive dependent IOs....

If that is the case, shouldn't you actually be trying to fix the
specific plugging problem you've identified (i.e. do_direct_IO() is
flushing far too frequently) rather than making a sweeping
generalisation that the IO stack plugging infrastructure
needs fundamental change?

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
  2015-04-08 23:02       ` Dave Chinner
                           ` (2 preceding siblings ...)
  (?)
@ 2015-04-10 21:50         ` Jeff Moyer
  -1 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-10 21:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Ming Lei, Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner

Dave Chinner <david@fromorbit.com> writes:

> On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
>> The way the on-stack plugging currently works, each nesting level
>> flushes its own list of I/Os.  This can be less than optimal (read
>> awful) for certain workloads.  For example, consider an application
>> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
>> I/Os together in a single io_submit call, only to have each of them
>> dispatched individually down in the bowels of the dirct I/O code.
>> The reason is that there are blk_plug-s instantiated both at the upper
>> call site in do_io_submit and down in do_direct_IO.  The latter will
>> submit as little as 1 I/O at a time (if you have a small enough I/O
>> size) instead of performing the batching that the plugging
>> infrastructure is supposed to provide.
>
> I'm wondering what impact this will have on filesystem metadata IO
> that needs to be issued immediately. e.g. we are doing writeback, so
> there is a high level plug in place and we need to page in btree
> blocks to do extent allocation. We do readahead at this point,
> but it looks like this change will prevent the readahead from being
> issued by the unplug in xfs_buf_iosubmit().

I'm not ignoring you, Dave, I'm just doing some more investigation and
testing.  It's taking longer than I had hoped.

-Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-10 21:50         ` Jeff Moyer
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-10 21:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Ming Lei, Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner

Dave Chinner <david@fromorbit.com> writes:

> On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
>> The way the on-stack plugging currently works, each nesting level
>> flushes its own list of I/Os.  This can be less than optimal (read
>> awful) for certain workloads.  For example, consider an application
>> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
>> I/Os together in a single io_submit call, only to have each of them
>> dispatched individually down in the bowels of the dirct I/O code.
>> The reason is that there are blk_plug-s instantiated both at the upper
>> call site in do_io_submit and down in do_direct_IO.  The latter will
>> submit as little as 1 I/O at a time (if you have a small enough I/O
>> size) instead of performing the batching that the plugging
>> infrastructure is supposed to provide.
>
> I'm wondering what impact this will have on filesystem metadata IO
> that needs to be issued immediately. e.g. we are doing writeback, so
> there is a high level plug in place and we need to page in btree
> blocks to do extent allocation. We do readahead at this point,
> but it looks like this change will prevent the readahead from being
> issued by the unplug in xfs_buf_iosubmit().

I'm not ignoring you, Dave, I'm just doing some more investigation and
testing.  It's taking longer than I had hoped.

-Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-10 21:50         ` Jeff Moyer
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-10 21:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vladimir Davydov, linux-aio, Miklos Szeredi, Mike Snitzer,
	Neil Brown, Ming Lei, Ming Lei, Trond Myklebust, Jianyu Zhan,
	Nicholas A. Bellinger, linux-kernel, Sagi Grimberg, Chris Mason,
	dm-devel, target-devel, Andreas Dilger, Mikulas Patocka,
	Mark Rustad, Christoph Hellwig, Alasdair Kergon, Matthew Wilcox,
	linux-scsi, Namjae Jeon, linux-raid, cluster-devel, Mel Gorman,
	Suleiman Souhlal, linux-ext4, linux-mm, Changman Lee,
	Rik van Riel, Konrad Rzeszutek Wilk, xfs, Fabian Frederick,
	Joe Perches, Alexander Viro, xen-devel, Jaegeuk Kim,
	Steven Whitehouse, Vlastimil Babka, Jens Axboe, Michal Hocko,
	linux-nfs, Fengguang Wu, Theodore Ts'o, Martin K. Petersen,
	Wang Sheng-Hui, Josef Bacik, David Sterba, linux-f2fs-devel,
	linux-btrfs, Johannes Weiner, Tejun Heo, linux-fsdevel,
	Andrew Morton, Weston Andros Adamson, Anna Schumaker,
	Kirill A. Shutemov, Roger Pau Monn??

Dave Chinner <david@fromorbit.com> writes:

> On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
>> The way the on-stack plugging currently works, each nesting level
>> flushes its own list of I/Os.  This can be less than optimal (read
>> awful) for certain workloads.  For example, consider an application
>> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
>> I/Os together in a single io_submit call, only to have each of them
>> dispatched individually down in the bowels of the dirct I/O code.
>> The reason is that there are blk_plug-s instantiated both at the upper
>> call site in do_io_submit and down in do_direct_IO.  The latter will
>> submit as little as 1 I/O at a time (if you have a small enough I/O
>> size) instead of performing the batching that the plugging
>> infrastructure is supposed to provide.
>
> I'm wondering what impact this will have on filesystem metadata IO
> that needs to be issued immediately. e.g. we are doing writeback, so
> there is a high level plug in place and we need to page in btree
> blocks to do extent allocation. We do readahead at this point,
> but it looks like this change will prevent the readahead from being
> issued by the unplug in xfs_buf_iosubmit().

I'm not ignoring you, Dave, I'm just doing some more investigation and
testing.  It's taking longer than I had hoped.

-Jeff

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-10 21:50         ` Jeff Moyer
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-10 21:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Ming Lei, Konrad Rzeszutek Wilk, Roger Pau Monn??,
	Alasdair Kergon, Mike Snitzer, Neil Brown, Nicholas A. Bellinger,
	Alexander Viro, Chris Mason, Josef Bacik, David Sterba,
	Theodore Ts'o, Andreas Dilger, Jaegeuk Kim, Changman Lee,
	Steven Whitehouse, Mikulas Patocka, Andrew Morton, Rik van Riel,
	Johannes Weiner, Mel Gorman, Trond Myklebust, Anna Schumaker,
	xfs, Christoph Hellwig, Weston Andros Adamson,
	Martin K. Petersen, Sagi Grimberg, Tejun Heo, Fabian Frederick,
	Matthew Wilcox, Ming Lei, Kirill A. Shutemov, Wang Sheng-Hui,
	Michal Hocko, Joe Perches, Miklos Szeredi, Namjae Jeon,
	Mark Rustad, Jianyu Zhan, Fengguang Wu, Vladimir Davydov,
	Vlastimil Babka, Suleiman Souhlal, linux-kernel, dm-devel,
	xen-devel, linux-raid, linux-scsi, target-devel, linux-fsdevel,
	linux-aio, linux-btrfs, linux-ext4, linux-f2fs-devel,
	cluster-devel, linux-nfs, linux-mm

Dave Chinner <david@fromorbit.com> writes:

> On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
>> The way the on-stack plugging currently works, each nesting level
>> flushes its own list of I/Os.  This can be less than optimal (read
>> awful) for certain workloads.  For example, consider an application
>> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
>> I/Os together in a single io_submit call, only to have each of them
>> dispatched individually down in the bowels of the dirct I/O code.
>> The reason is that there are blk_plug-s instantiated both at the upper
>> call site in do_io_submit and down in do_direct_IO.  The latter will
>> submit as little as 1 I/O at a time (if you have a small enough I/O
>> size) instead of performing the batching that the plugging
>> infrastructure is supposed to provide.
>
> I'm wondering what impact this will have on filesystem metadata IO
> that needs to be issued immediately. e.g. we are doing writeback, so
> there is a high level plug in place and we need to page in btree
> blocks to do extent allocation. We do readahead at this point,
> but it looks like this change will prevent the readahead from being
> issued by the unplug in xfs_buf_iosubmit().

I'm not ignoring you, Dave, I'm just doing some more investigation and
testing.  It's taking longer than I had hoped.

-Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Cluster-devel] [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-10 21:50         ` Jeff Moyer
  0 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-10 21:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Dave Chinner <david@fromorbit.com> writes:

> On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
>> The way the on-stack plugging currently works, each nesting level
>> flushes its own list of I/Os.  This can be less than optimal (read
>> awful) for certain workloads.  For example, consider an application
>> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
>> I/Os together in a single io_submit call, only to have each of them
>> dispatched individually down in the bowels of the dirct I/O code.
>> The reason is that there are blk_plug-s instantiated both at the upper
>> call site in do_io_submit and down in do_direct_IO.  The latter will
>> submit as little as 1 I/O at a time (if you have a small enough I/O
>> size) instead of performing the batching that the plugging
>> infrastructure is supposed to provide.
>
> I'm wondering what impact this will have on filesystem metadata IO
> that needs to be issued immediately. e.g. we are doing writeback, so
> there is a high level plug in place and we need to page in btree
> blocks to do extent allocation. We do readahead at this point,
> but it looks like this change will prevent the readahead from being
> issued by the unplug in xfs_buf_iosubmit().

I'm not ignoring you, Dave, I'm just doing some more investigation and
testing.  It's taking longer than I had hoped.

-Jeff



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [f2fs-dev] [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
  2015-04-10 21:50         ` Jeff Moyer
  (?)
  (?)
@ 2015-04-11  4:11           ` nick
  -1 siblings, 0 replies; 36+ messages in thread
From: nick @ 2015-04-11  4:11 UTC (permalink / raw)
  To: Jeff Moyer, Dave Chinner
  Cc: Vladimir Davydov, linux-aio, Miklos Szeredi, Mike Snitzer,
	Ming Lei, Ming Lei, Trond Myklebust, Jianyu Zhan,
	Nicholas A. Bellinger, linux-kernel, Sagi Grimberg, Chris Mason,
	dm-devel, target-devel, Andreas Dilger, Mikulas Patocka,
	Mark Rustad, Christoph Hellwig, Alasdair Kergon, Matthew Wilcox,
	linux-scsi, Namjae Jeon, linux-raid, cluster-devel, Mel Gorman



On 2015-04-10 05:50 PM, Jeff Moyer wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
>> On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
>>> The way the on-stack plugging currently works, each nesting level
>>> flushes its own list of I/Os.  This can be less than optimal (read
>>> awful) for certain workloads.  For example, consider an application
>>> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
>>> I/Os together in a single io_submit call, only to have each of them
>>> dispatched individually down in the bowels of the dirct I/O code.
>>> The reason is that there are blk_plug-s instantiated both at the upper
>>> call site in do_io_submit and down in do_direct_IO.  The latter will
>>> submit as little as 1 I/O at a time (if you have a small enough I/O
>>> size) instead of performing the batching that the plugging
>>> infrastructure is supposed to provide.
>>
>> I'm wondering what impact this will have on filesystem metadata IO
>> that needs to be issued immediately. e.g. we are doing writeback, so
>> there is a high level plug in place and we need to page in btree
>> blocks to do extent allocation. We do readahead at this point,
>> but it looks like this change will prevent the readahead from being
>> issued by the unplug in xfs_buf_iosubmit().
> 
> I'm not ignoring you, Dave, I'm just doing some more investigation and
> testing.  It's taking longer than I had hoped.
> 
> -Jeff
> 
Jeff,
Would you mind sending your test reports to the list so we can see what workloads
and tests your running your patch under. This is due to me and the others perhaps
being able to give input into the other major benchmarks or workloads we need to
test too in order to see if there are any regressions with your patch.
Thanks,
Nick

> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [f2fs-dev] [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-11  4:11           ` nick
  0 siblings, 0 replies; 36+ messages in thread
From: nick @ 2015-04-11  4:11 UTC (permalink / raw)
  To: Jeff Moyer, Dave Chinner
  Cc: Vladimir Davydov, linux-aio, Miklos Szeredi, Mike Snitzer,
	Ming Lei, Ming Lei, Trond Myklebust, Jianyu Zhan,
	Nicholas A. Bellinger, linux-kernel, Sagi Grimberg, Chris Mason,
	dm-devel, target-devel, Andreas Dilger, Mikulas Patocka,
	Mark Rustad, Christoph Hellwig, Alasdair Kergon, Matthew Wilcox,
	linux-scsi, Namjae Jeon, linux-raid, cluster-devel, Mel Gorman



On 2015-04-10 05:50 PM, Jeff Moyer wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
>> On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
>>> The way the on-stack plugging currently works, each nesting level
>>> flushes its own list of I/Os.  This can be less than optimal (read
>>> awful) for certain workloads.  For example, consider an application
>>> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
>>> I/Os together in a single io_submit call, only to have each of them
>>> dispatched individually down in the bowels of the dirct I/O code.
>>> The reason is that there are blk_plug-s instantiated both at the upper
>>> call site in do_io_submit and down in do_direct_IO.  The latter will
>>> submit as little as 1 I/O at a time (if you have a small enough I/O
>>> size) instead of performing the batching that the plugging
>>> infrastructure is supposed to provide.
>>
>> I'm wondering what impact this will have on filesystem metadata IO
>> that needs to be issued immediately. e.g. we are doing writeback, so
>> there is a high level plug in place and we need to page in btree
>> blocks to do extent allocation. We do readahead at this point,
>> but it looks like this change will prevent the readahead from being
>> issued by the unplug in xfs_buf_iosubmit().
> 
> I'm not ignoring you, Dave, I'm just doing some more investigation and
> testing.  It's taking longer than I had hoped.
> 
> -Jeff
> 
Jeff,
Would you mind sending your test reports to the list so we can see what workloads
and tests your running your patch under. This is due to me and the others perhaps
being able to give input into the other major benchmarks or workloads we need to
test too in order to see if there are any regressions with your patch.
Thanks,
Nick

> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [f2fs-dev] [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-11  4:11           ` nick
  0 siblings, 0 replies; 36+ messages in thread
From: nick @ 2015-04-11  4:11 UTC (permalink / raw)
  To: Jeff Moyer, Dave Chinner
  Cc: Vladimir Davydov, linux-aio, Miklos Szeredi, Mike Snitzer,
	Ming Lei, Ming Lei, Trond Myklebust, Jianyu Zhan,
	Nicholas A. Bellinger, David Sterba, Sagi Grimberg, Chris Mason,
	dm-devel, target-devel, Andreas Dilger, Mikulas Patocka,
	Mark Rustad, Christoph Hellwig, Alasdair Kergon, Roger Pau Monn??,
	linux-scsi, Namjae Jeon, Michal Hocko, cluster-devel, Mel Gorman,
	Suleiman Souhlal, linux-ext4, Rik van Riel,
	Konrad Rzeszutek Wilk, xfs, linux-raid, Tejun Heo,
	Matthew Wilcox, xen-devel, Jaegeuk Kim, Steven Whitehouse,
	Vlastimil Babka, Jens Axboe, Fabian Frederick, linux-nfs,
	Weston Andros Adamson, Theodore Ts'o, Martin K. Petersen,
	linux-mm, Josef Bacik, Wang Sheng-Hui, linux-kernel,
	linux-f2fs-devel, Kirill A. Shutemov, Johannes Weiner,
	Joe Perches, linux-fsdevel, Andrew Morton, Fengguang Wu,
	Anna Schumaker, linux-btrfs, Alexander Viro



On 2015-04-10 05:50 PM, Jeff Moyer wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
>> On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
>>> The way the on-stack plugging currently works, each nesting level
>>> flushes its own list of I/Os.  This can be less than optimal (read
>>> awful) for certain workloads.  For example, consider an application
>>> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
>>> I/Os together in a single io_submit call, only to have each of them
>>> dispatched individually down in the bowels of the dirct I/O code.
>>> The reason is that there are blk_plug-s instantiated both at the upper
>>> call site in do_io_submit and down in do_direct_IO.  The latter will
>>> submit as little as 1 I/O at a time (if you have a small enough I/O
>>> size) instead of performing the batching that the plugging
>>> infrastructure is supposed to provide.
>>
>> I'm wondering what impact this will have on filesystem metadata IO
>> that needs to be issued immediately. e.g. we are doing writeback, so
>> there is a high level plug in place and we need to page in btree
>> blocks to do extent allocation. We do readahead at this point,
>> but it looks like this change will prevent the readahead from being
>> issued by the unplug in xfs_buf_iosubmit().
> 
> I'm not ignoring you, Dave, I'm just doing some more investigation and
> testing.  It's taking longer than I had hoped.
> 
> -Jeff
> 
Jeff,
Would you mind sending your test reports to the list so we can see what workloads
and tests your running your patch under. This is due to me and the others perhaps
being able to give input into the other major benchmarks or workloads we need to
test too in order to see if there are any regressions with your patch.
Thanks,
Nick

> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [f2fs-dev] [PATCH 2/2][v2] blk-plug: don't flush nested plug lists
@ 2015-04-11  4:11           ` nick
  0 siblings, 0 replies; 36+ messages in thread
From: nick @ 2015-04-11  4:11 UTC (permalink / raw)
  To: Jeff Moyer, Dave Chinner
  Cc: Vladimir Davydov, linux-aio, Miklos Szeredi, Mike Snitzer,
	Ming Lei, Ming Lei, Trond Myklebust, Jianyu Zhan,
	Nicholas A. Bellinger, linux-kernel, Sagi Grimberg, Chris Mason,
	dm-devel, target-devel, Andreas Dilger, Mikulas Patocka,
	Mark Rustad, Christoph Hellwig, Alasdair Kergon, Matthew Wilcox,
	linux-scsi, Namjae Jeon, linux-raid, cluster-devel, Mel Gorman,
	Suleiman Souhlal, linux-ext4, linux-mm, Rik van Riel,
	Konrad Rzeszutek Wilk, xfs, Fabian Frederick, Joe Perches,
	Alexander Viro, xen-devel, Jaegeuk Kim, Steven Whitehouse,
	Vlastimil Babka, Jens Axboe, Michal Hocko, linux-nfs,
	Fengguang Wu, Theodore Ts'o, Martin K. Petersen,
	Wang Sheng-Hui, Josef Bacik, David Sterba, linux-f2fs-devel,
	linux-btrfs, Johannes Weiner, Tejun Heo, linux-fsdevel,
	Andrew Morton, Weston Andros Adamson, Anna Schumaker,
	Kirill A. Shutemov, Roger Pau Monn??



On 2015-04-10 05:50 PM, Jeff Moyer wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
>> On Tue, Apr 07, 2015 at 02:55:13PM -0400, Jeff Moyer wrote:
>>> The way the on-stack plugging currently works, each nesting level
>>> flushes its own list of I/Os.  This can be less than optimal (read
>>> awful) for certain workloads.  For example, consider an application
>>> that issues asynchronous O_DIRECT I/Os.  It can send down a bunch of
>>> I/Os together in a single io_submit call, only to have each of them
>>> dispatched individually down in the bowels of the dirct I/O code.
>>> The reason is that there are blk_plug-s instantiated both at the upper
>>> call site in do_io_submit and down in do_direct_IO.  The latter will
>>> submit as little as 1 I/O at a time (if you have a small enough I/O
>>> size) instead of performing the batching that the plugging
>>> infrastructure is supposed to provide.
>>
>> I'm wondering what impact this will have on filesystem metadata IO
>> that needs to be issued immediately. e.g. we are doing writeback, so
>> there is a high level plug in place and we need to page in btree
>> blocks to do extent allocation. We do readahead at this point,
>> but it looks like this change will prevent the readahead from being
>> issued by the unplug in xfs_buf_iosubmit().
> 
> I'm not ignoring you, Dave, I'm just doing some more investigation and
> testing.  It's taking longer than I had hoped.
> 
> -Jeff
> 
Jeff,
Would you mind sending your test reports to the list so we can see what workloads
and tests your running your patch under. This is due to me and the others perhaps
being able to give input into the other major benchmarks or workloads we need to
test too in order to see if there are any regressions with your patch.
Thanks,
Nick

> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/2] blk-plug: don't flush nested plug lists
  2015-04-08 17:38   ` [PATCH 2/2] " Jens Axboe
  2015-04-07 16:09     ` Shaohua Li
  2015-04-08 17:56     ` Jeff Moyer
@ 2015-04-16 15:47     ` Jeff Moyer
  2 siblings, 0 replies; 36+ messages in thread
From: Jeff Moyer @ 2015-04-16 15:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

Jens Axboe <axboe@kernel.dk> writes:

> And agree with Ming, this can be cleaned up substantially. I'd also
> like to see some test results from the other end of the spectrum. Your
> posted cased is clearly based case (we missed tons of merging, now we
> don't), I'd like to see a normal case and a worst case result as well
> so we have an idea of what this would do to latencies.

I re-ran my tests (just sequential reads) to verify where the
performance benefit comes from.  Here are the results:

device|  vanilla   |    patch1      |   dio-noplug  |  noflush-nested
------+------------+----------------+---------------+-----------------
rssda | 701,684    |    1,168,527   |   1,342,177   |   1,297,612
      |     100%   |         +66%   |        +91%   |        +85%
vdb0  | 358,264    |      902,913   |     906,850   |     922,327
      |     100%   |        +152%   |       +153%   |       +157%

Patch1 refers to the first patch in this series, which fixes the merge
logic for single-queue blk-mq devices.  Each column after that includes
that first patch.  In dio-noplug, I removed the blk_plug from the
direct-io code path (so there is no nesting at all).  This is a control,
since it is what I expect the outcome of the noflush-nested column to
actually be.  Then, the noflush-nested column leaves the blk_plug in
place in the dio code, but includes the patch that prevents nested
blk_plug's from being flushed.  All numbers are the average of 5 runs.
With the exception of the vanilla run on rssda (the first run was
faster, causing the average to go up), the standard deviation is very
small.

As you can see, for the micron device (rssda) we get quite a bit of an
improvement from just fixing the plugging, but we could do quite a bit
better.  For the virtio-blk device, the follow-on patches don't
contribute very much to the overall performance.

So, I'm happy to push in just patch 1 at this time.  It is a bugfix,
after all, and could even be marked for stable.  Does that sound
reasonable?

Later, if time permits, I can look into refining patch 2, though I don't
have any firm plans to do that right now.

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2015-04-16 15:47 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-06 19:14 [PATCH 1/2] [v2] blk-mq: fix plugging in blk_sq_make_request Jeff Moyer
2015-04-06 19:14 ` [PATCH 2/2] blk-plug: don't flush nested plug lists Jeff Moyer
2015-04-07  9:19   ` Ming Lei
2015-04-07 14:47     ` Jeff Moyer
2015-04-07 18:55   ` [PATCH 2/2][v2] " Jeff Moyer
2015-04-07 18:55     ` [Cluster-devel] " Jeff Moyer
2015-04-07 18:55     ` Jeff Moyer
2015-04-07 18:55     ` Jeff Moyer
2015-04-07 18:55     ` Jeff Moyer
2015-04-08 16:13     ` Christoph Hellwig
2015-04-08 16:13       ` [Cluster-devel] " Christoph Hellwig
2015-04-08 16:13       ` Christoph Hellwig
2015-04-08 16:13       ` Christoph Hellwig
2015-04-08 16:13       ` Christoph Hellwig
2015-04-08 23:02     ` Dave Chinner
2015-04-08 23:02       ` [Cluster-devel] " Dave Chinner
2015-04-08 23:02       ` Dave Chinner
2015-04-08 23:02       ` Dave Chinner
2015-04-08 23:02       ` Dave Chinner
2015-04-09  0:54       ` Dave Chinner
2015-04-09  0:54       ` Dave Chinner
2015-04-09  0:54         ` [Cluster-devel] " Dave Chinner
2015-04-09  0:54         ` Dave Chinner
2015-04-10 21:50       ` Jeff Moyer
2015-04-10 21:50         ` [Cluster-devel] " Jeff Moyer
2015-04-10 21:50         ` Jeff Moyer
2015-04-10 21:50         ` Jeff Moyer
2015-04-10 21:50         ` Jeff Moyer
2015-04-11  4:11         ` [f2fs-dev] " nick
2015-04-11  4:11           ` nick
2015-04-11  4:11           ` nick
2015-04-11  4:11           ` nick
2015-04-08 17:38   ` [PATCH 2/2] " Jens Axboe
2015-04-07 16:09     ` Shaohua Li
2015-04-08 17:56     ` Jeff Moyer
2015-04-16 15:47     ` Jeff Moyer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.