All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET] block: reimplement FLUSH/FUA to support merge
@ 2011-01-21 15:59 ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-21 15:59 UTC (permalink / raw)
  To: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef

Hello,

This patchset reimplements (yeah, once more) so that FLUSH/FUA
requests can be merged.  This is inspired by Darrick's patches to
merge multiple zero-data flushes which helps workloads with highly
concurrent fsync requests.

For more information, please read the patch description on the third
patch.

 0001-block-add-REQ_FLUSH_SEQ.patch
 0002-block-improve-flush-bio-completion.patch
 0003-block-reimplement-FLUSH-FUA-to-support-merge.patch

0001-0002 prepare for the reimplementation.  0003 reimplements.

Tested on FLUSH+FUA, FLUSH, and no FLUSH configurations.  Works pretty
solid for me.  Darrick says that preliminary performance test looks
okay.  Darrick, please use this version for further testing as I fixed
a few problems.

This patchset is based on 2.6.38-rc1 (2b1caf6e) and available in the
following git branch,

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git block-flush-merge

and contains the following changes.

 block/blk-core.c          |   56 ++---
 block/blk-flush.c         |  441 +++++++++++++++++++++++++++++++---------------
 block/blk.h               |   12 -
 block/elevator.c          |    7 
 include/linux/blk_types.h |    2 
 include/linux/blkdev.h    |   18 +
 include/linux/elevator.h  |    1 
 7 files changed, 355 insertions(+), 182 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCHSET] block: reimplement FLUSH/FUA to support merge
@ 2011-01-21 15:59 ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-21 15:59 UTC (permalink / raw)
  To: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth

Hello,

This patchset reimplements (yeah, once more) so that FLUSH/FUA
requests can be merged.  This is inspired by Darrick's patches to
merge multiple zero-data flushes which helps workloads with highly
concurrent fsync requests.

For more information, please read the patch description on the third
patch.

 0001-block-add-REQ_FLUSH_SEQ.patch
 0002-block-improve-flush-bio-completion.patch
 0003-block-reimplement-FLUSH-FUA-to-support-merge.patch

0001-0002 prepare for the reimplementation.  0003 reimplements.

Tested on FLUSH+FUA, FLUSH, and no FLUSH configurations.  Works pretty
solid for me.  Darrick says that preliminary performance test looks
okay.  Darrick, please use this version for further testing as I fixed
a few problems.

This patchset is based on 2.6.38-rc1 (2b1caf6e) and available in the
following git branch,

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git block-flush-merge

and contains the following changes.

 block/blk-core.c          |   56 ++---
 block/blk-flush.c         |  441 +++++++++++++++++++++++++++++++---------------
 block/blk.h               |   12 -
 block/elevator.c          |    7 
 include/linux/blk_types.h |    2 
 include/linux/blkdev.h    |   18 +
 include/linux/elevator.h  |    1 
 7 files changed, 355 insertions(+), 182 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/3] block: add REQ_FLUSH_SEQ
  2011-01-21 15:59 ` Tejun Heo
@ 2011-01-21 15:59   ` Tejun Heo
  -1 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-21 15:59 UTC (permalink / raw)
  To: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef
  Cc: Tejun Heo

rq == &q->flush_rq was used to determine whether a rq is part of a
flush sequence, which worked because all requests in a flush sequence
were sequenced using the single dedicated request.  This is about to
change, so introduce REQ_FLUSH_SEQ flag to distinguish flush sequence
requests.

This patch doesn't cause any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c          |    4 ++--
 block/blk-flush.c         |    1 +
 block/blk.h               |    2 +-
 include/linux/blk_types.h |    2 ++
 4 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2f4002f..22e0b4f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -151,7 +151,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 {
 	struct request_queue *q = rq->q;
 
-	if (&q->flush_rq != rq) {
+	if (!(rq->cmd_flags & REQ_FLUSH_SEQ)) {
 		if (error)
 			clear_bit(BIO_UPTODATE, &bio->bi_flags);
 		else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -1804,7 +1804,7 @@ static void blk_account_io_done(struct request *req)
 	 * normal IO on queueing nor completion.  Accounting the
 	 * containing request is enough.
 	 */
-	if (blk_do_io_stat(req) && req != &req->q->flush_rq) {
+	if (blk_do_io_stat(req) && !(req->cmd_flags & REQ_FLUSH_SEQ)) {
 		unsigned long duration = jiffies - req->start_time;
 		const int rw = rq_data_dir(req);
 		struct hd_struct *part;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 54b123d..8592869 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -130,6 +130,7 @@ static struct request *queue_next_fseq(struct request_queue *q)
 		BUG();
 	}
 
+	rq->cmd_flags |= REQ_FLUSH_SEQ;
 	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 	return rq;
 }
diff --git a/block/blk.h b/block/blk.h
index 2db8f32..9d2ee8f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -61,7 +61,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
 		while (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
 			if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
-			    rq == &q->flush_rq)
+			    (rq->cmd_flags & REQ_FLUSH_SEQ))
 				return rq;
 			rq = blk_do_flush(q, rq);
 			if (rq)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 46ad519..dddedfc 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -148,6 +148,7 @@ enum rq_flag_bits {
 	__REQ_ALLOCED,		/* request came from our alloc pool */
 	__REQ_COPY_USER,	/* contains copies of user pages */
 	__REQ_FLUSH,		/* request for cache flush */
+	__REQ_FLUSH_SEQ,	/* request for flush sequence */
 	__REQ_IO_STAT,		/* account I/O stat */
 	__REQ_MIXED_MERGE,	/* merge of different types, fail separately */
 	__REQ_SECURE,		/* secure discard (used with __REQ_DISCARD) */
@@ -188,6 +189,7 @@ enum rq_flag_bits {
 #define REQ_ALLOCED		(1 << __REQ_ALLOCED)
 #define REQ_COPY_USER		(1 << __REQ_COPY_USER)
 #define REQ_FLUSH		(1 << __REQ_FLUSH)
+#define REQ_FLUSH_SEQ		(1 << __REQ_FLUSH_SEQ)
 #define REQ_IO_STAT		(1 << __REQ_IO_STAT)
 #define REQ_MIXED_MERGE		(1 << __REQ_MIXED_MERGE)
 #define REQ_SECURE		(1 << __REQ_SECURE)
-- 
1.7.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/3] block: add REQ_FLUSH_SEQ
@ 2011-01-21 15:59   ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-21 15:59 UTC (permalink / raw)
  To: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth
  Cc: Tejun Heo

rq == &q->flush_rq was used to determine whether a rq is part of a
flush sequence, which worked because all requests in a flush sequence
were sequenced using the single dedicated request.  This is about to
change, so introduce REQ_FLUSH_SEQ flag to distinguish flush sequence
requests.

This patch doesn't cause any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c          |    4 ++--
 block/blk-flush.c         |    1 +
 block/blk.h               |    2 +-
 include/linux/blk_types.h |    2 ++
 4 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2f4002f..22e0b4f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -151,7 +151,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 {
 	struct request_queue *q = rq->q;
 
-	if (&q->flush_rq != rq) {
+	if (!(rq->cmd_flags & REQ_FLUSH_SEQ)) {
 		if (error)
 			clear_bit(BIO_UPTODATE, &bio->bi_flags);
 		else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -1804,7 +1804,7 @@ static void blk_account_io_done(struct request *req)
 	 * normal IO on queueing nor completion.  Accounting the
 	 * containing request is enough.
 	 */
-	if (blk_do_io_stat(req) && req != &req->q->flush_rq) {
+	if (blk_do_io_stat(req) && !(req->cmd_flags & REQ_FLUSH_SEQ)) {
 		unsigned long duration = jiffies - req->start_time;
 		const int rw = rq_data_dir(req);
 		struct hd_struct *part;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 54b123d..8592869 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -130,6 +130,7 @@ static struct request *queue_next_fseq(struct request_queue *q)
 		BUG();
 	}
 
+	rq->cmd_flags |= REQ_FLUSH_SEQ;
 	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 	return rq;
 }
diff --git a/block/blk.h b/block/blk.h
index 2db8f32..9d2ee8f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -61,7 +61,7 @@ static inline struct request *__elv_next_request(struct request_queue *q)
 		while (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
 			if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
-			    rq == &q->flush_rq)
+			    (rq->cmd_flags & REQ_FLUSH_SEQ))
 				return rq;
 			rq = blk_do_flush(q, rq);
 			if (rq)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 46ad519..dddedfc 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -148,6 +148,7 @@ enum rq_flag_bits {
 	__REQ_ALLOCED,		/* request came from our alloc pool */
 	__REQ_COPY_USER,	/* contains copies of user pages */
 	__REQ_FLUSH,		/* request for cache flush */
+	__REQ_FLUSH_SEQ,	/* request for flush sequence */
 	__REQ_IO_STAT,		/* account I/O stat */
 	__REQ_MIXED_MERGE,	/* merge of different types, fail separately */
 	__REQ_SECURE,		/* secure discard (used with __REQ_DISCARD) */
@@ -188,6 +189,7 @@ enum rq_flag_bits {
 #define REQ_ALLOCED		(1 << __REQ_ALLOCED)
 #define REQ_COPY_USER		(1 << __REQ_COPY_USER)
 #define REQ_FLUSH		(1 << __REQ_FLUSH)
+#define REQ_FLUSH_SEQ		(1 << __REQ_FLUSH_SEQ)
 #define REQ_IO_STAT		(1 << __REQ_IO_STAT)
 #define REQ_MIXED_MERGE		(1 << __REQ_MIXED_MERGE)
 #define REQ_SECURE		(1 << __REQ_SECURE)
-- 
1.7.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/3] block: improve flush bio completion
  2011-01-21 15:59 ` Tejun Heo
@ 2011-01-21 15:59   ` Tejun Heo
  -1 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-21 15:59 UTC (permalink / raw)
  To: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef
  Cc: Tejun Heo

bio's for flush are completed twice - once during the data phase and
one more time after the whole sequence is complete.  The first
completion shouldn't notify completion to the issuer.

This was achieved by skipping all bio completion steps in
req_bio_endio() for the first completion; however, this has two
drawbacks.

* Error is not recorded in bio and must be tracked somewhere else.

* Partial completion is not supported.

Both don't cause problems for the current users; however, they make
further improvements difficult.  Change req_bio_endio() such that it
only skips the actual notification part for the first completion.  bio
completion is implemented with partial completions on mind anyway so
this is as simple as moving the REQ_FLUSH_SEQ conditional such that
only calling of bio_endio() is skipped.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c |   48 +++++++++++++++++++++---------------------------
 1 files changed, 21 insertions(+), 27 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 22e0b4f..038519b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -151,37 +151,31 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 {
 	struct request_queue *q = rq->q;
 
-	if (!(rq->cmd_flags & REQ_FLUSH_SEQ)) {
-		if (error)
-			clear_bit(BIO_UPTODATE, &bio->bi_flags);
-		else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
-			error = -EIO;
-
-		if (unlikely(nbytes > bio->bi_size)) {
-			printk(KERN_ERR "%s: want %u bytes done, %u left\n",
-			       __func__, nbytes, bio->bi_size);
-			nbytes = bio->bi_size;
-		}
+	if (error)
+		clear_bit(BIO_UPTODATE, &bio->bi_flags);
+	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
+		error = -EIO;
+
+	if (unlikely(nbytes > bio->bi_size)) {
+		printk(KERN_ERR "%s: want %u bytes done, %u left\n",
+		       __func__, nbytes, bio->bi_size);
+		nbytes = bio->bi_size;
+	}
 
-		if (unlikely(rq->cmd_flags & REQ_QUIET))
-			set_bit(BIO_QUIET, &bio->bi_flags);
+	if (unlikely(rq->cmd_flags & REQ_QUIET))
+		set_bit(BIO_QUIET, &bio->bi_flags);
 
-		bio->bi_size -= nbytes;
-		bio->bi_sector += (nbytes >> 9);
+	bio->bi_size -= nbytes;
+	bio->bi_sector += (nbytes >> 9);
 
-		if (bio_integrity(bio))
-			bio_integrity_advance(bio, nbytes);
+	if (bio_integrity(bio))
+		bio_integrity_advance(bio, nbytes);
 
-		if (bio->bi_size == 0)
-			bio_endio(bio, error);
-	} else {
-		/*
-		 * Okay, this is the sequenced flush request in
-		 * progress, just record the error;
-		 */
-		if (error && !q->flush_err)
-			q->flush_err = error;
-	}
+	/* don't actually finish bio if it's part of flush sequence */
+	if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
+		bio_endio(bio, error);
+	else if (error && !q->flush_err)
+		q->flush_err = error;
 }
 
 void blk_dump_rq_flags(struct request *rq, char *msg)
-- 
1.7.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 2/3] block: improve flush bio completion
@ 2011-01-21 15:59   ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-21 15:59 UTC (permalink / raw)
  To: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth
  Cc: Tejun Heo

bio's for flush are completed twice - once during the data phase and
one more time after the whole sequence is complete.  The first
completion shouldn't notify completion to the issuer.

This was achieved by skipping all bio completion steps in
req_bio_endio() for the first completion; however, this has two
drawbacks.

* Error is not recorded in bio and must be tracked somewhere else.

* Partial completion is not supported.

Both don't cause problems for the current users; however, they make
further improvements difficult.  Change req_bio_endio() such that it
only skips the actual notification part for the first completion.  bio
completion is implemented with partial completions on mind anyway so
this is as simple as moving the REQ_FLUSH_SEQ conditional such that
only calling of bio_endio() is skipped.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c |   48 +++++++++++++++++++++---------------------------
 1 files changed, 21 insertions(+), 27 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 22e0b4f..038519b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -151,37 +151,31 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 {
 	struct request_queue *q = rq->q;
 
-	if (!(rq->cmd_flags & REQ_FLUSH_SEQ)) {
-		if (error)
-			clear_bit(BIO_UPTODATE, &bio->bi_flags);
-		else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
-			error = -EIO;
-
-		if (unlikely(nbytes > bio->bi_size)) {
-			printk(KERN_ERR "%s: want %u bytes done, %u left\n",
-			       __func__, nbytes, bio->bi_size);
-			nbytes = bio->bi_size;
-		}
+	if (error)
+		clear_bit(BIO_UPTODATE, &bio->bi_flags);
+	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
+		error = -EIO;
+
+	if (unlikely(nbytes > bio->bi_size)) {
+		printk(KERN_ERR "%s: want %u bytes done, %u left\n",
+		       __func__, nbytes, bio->bi_size);
+		nbytes = bio->bi_size;
+	}
 
-		if (unlikely(rq->cmd_flags & REQ_QUIET))
-			set_bit(BIO_QUIET, &bio->bi_flags);
+	if (unlikely(rq->cmd_flags & REQ_QUIET))
+		set_bit(BIO_QUIET, &bio->bi_flags);
 
-		bio->bi_size -= nbytes;
-		bio->bi_sector += (nbytes >> 9);
+	bio->bi_size -= nbytes;
+	bio->bi_sector += (nbytes >> 9);
 
-		if (bio_integrity(bio))
-			bio_integrity_advance(bio, nbytes);
+	if (bio_integrity(bio))
+		bio_integrity_advance(bio, nbytes);
 
-		if (bio->bi_size == 0)
-			bio_endio(bio, error);
-	} else {
-		/*
-		 * Okay, this is the sequenced flush request in
-		 * progress, just record the error;
-		 */
-		if (error && !q->flush_err)
-			q->flush_err = error;
-	}
+	/* don't actually finish bio if it's part of flush sequence */
+	if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
+		bio_endio(bio, error);
+	else if (error && !q->flush_err)
+		q->flush_err = error;
 }
 
 void blk_dump_rq_flags(struct request *rq, char *msg)
-- 
1.7.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-21 15:59 ` Tejun Heo
                   ` (2 preceding siblings ...)
  (?)
@ 2011-01-21 15:59 ` Tejun Heo
  2011-01-21 18:56   ` Vivek Goyal
                     ` (3 more replies)
  -1 siblings, 4 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-21 15:59 UTC (permalink / raw)
  To: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef
  Cc: Tejun Heo

The current FLUSH/FUA support has evolved from the implementation
which had to perform queue draining.  As such, sequencing is done
queue-wide one flush request after another.  However, with the
draining requirement gone, there's no reason to keep the queue-wide
sequential approach.

This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
request is sequenced individually.  The actual FLUSH execution is
double buffered and whenever a request wants to execute one for either
PRE or POSTFLUSH, it queues on the pending queue.  Once certain
conditions are met, a flush request is issued and on its completion
all pending requests proceed to the next sequence.

This allows arbitrary merging of different type of flushes.  How they
are merged can be primarily controlled and tuned by adjusting the
above said 'conditions' used to determine when to issue the next
flush.

This is inspired by Darrick's patches to merge multiple zero-data
flushes which helps workloads with highly concurrent fsync requests.

* As flush requests are never put on the IO scheduler, request fields
  used for flush share space with rq->rb_node.  rq->completion_data is
  moved out of the union.  This increases the request size by one
  pointer.

  As rq->elevator_private* are used only by the iosched too, it is
  possible to reduce the request size further.  However, to do that,
  we need to modify request allocation path such that iosched data is
  not allocated for flush requests.

* FLUSH/FUA processing happens on insertion now instead of dispatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Darrick J. Wong" <djwong@us.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c         |   10 +-
 block/blk-flush.c        |  440 ++++++++++++++++++++++++++++++++--------------
 block/blk.h              |   12 +-
 block/elevator.c         |    7 +
 include/linux/blkdev.h   |   18 ++-
 include/linux/elevator.h |    1 +
 6 files changed, 332 insertions(+), 156 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 038519b..72dd23b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -149,8 +149,6 @@ EXPORT_SYMBOL(blk_rq_init);
 static void req_bio_endio(struct request *rq, struct bio *bio,
 			  unsigned int nbytes, int error)
 {
-	struct request_queue *q = rq->q;
-
 	if (error)
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -174,8 +172,6 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
 		bio_endio(bio, error);
-	else if (error && !q->flush_err)
-		q->flush_err = error;
 }
 
 void blk_dump_rq_flags(struct request *rq, char *msg)
@@ -534,7 +530,9 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
-	INIT_LIST_HEAD(&q->pending_flushes);
+	INIT_LIST_HEAD(&q->flush_queue[0]);
+	INIT_LIST_HEAD(&q->flush_queue[1]);
+	INIT_LIST_HEAD(&q->flush_data_in_flight);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1213,7 +1211,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 	spin_lock_irq(q->queue_lock);
 
 	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
-		where = ELEVATOR_INSERT_FRONT;
+		where = ELEVATOR_INSERT_FLUSH;
 		goto get_rq;
 	}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 8592869..cd73e93 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -1,6 +1,69 @@
 /*
  * Functions to sequence FLUSH and FUA writes.
+ *
+ * Copyright (C) 2011		Max Planck Institute for Gravitational Physics
+ * Copyright (C) 2011		Tejun Heo <tj@kernel.org>
+ *
+ * This file is released under the GPLv2.
+ *
+ * REQ_{FLUSH|FUA} requests are decomposed to sequences consisted of three
+ * optional steps - PREFLUSH, DATA and POSTFLUSH - according to the request
+ * properties and hardware capability.
+ *
+ * If a request doesn't have data, only REQ_FLUSH makes sense, which
+ * indicates a simple flush request.  If there is data, REQ_FLUSH indicates
+ * that the device cache should be flushed before the data is executed, and
+ * REQ_FUA means that the data must be on non-volatile media on request
+ * completion.
+ *
+ * If the device doesn't have writeback cache, FLUSH and FUA don't make any
+ * difference.  The requests are either completed immediately if there's no
+ * data or executed as normal requests otherwise.
+ *
+ * If the device has writeback cache and supports FUA, REQ_FLUSH is
+ * translated to PREFLUSH but REQ_FUA is passed down directly with DATA.
+ *
+ * If the device has writeback cache and doesn't support FUA, REQ_FLUSH is
+ * translated to PREFLUSH and REQ_FUA to POSTFLUSH.
+ *
+ * The actual execution of flush is double buffered.  Whenever a request
+ * needs to execute PRE or POSTFLUSH, it queues at
+ * q->flush_queue[q->flush_pending_idx].  Once certain criteria are met, a
+ * flush is issued and the pending_idx is toggled.  When the flush
+ * completes, all the requests which were pending are proceeded to the next
+ * step.  This allows arbitrary merging of different types of FLUSH/FUA
+ * requests.
+ *
+ * Currently, the following conditions are used to determine when to issue
+ * flush.
+ *
+ * C1. At any given time, only one flush shall be in progress.  This makes
+ *     double buffering sufficient.
+ *
+ * C2. Flush is not deferred if any request is executing DATA of its
+ *     sequence.  This avoids issuing separate POSTFLUSHes for requests
+ *     which shared PREFLUSH.
+ *
+ * C3. The second condition is ignored if there is a request which has
+ *     waited longer than FLUSH_PENDING_TIMEOUT.  This is to avoid
+ *     starvation in the unlikely case where there are continuous stream of
+ *     FUA (without FLUSH) requests.
+ *
+ * For devices which support FUA, it isn't clear whether C2 (and thus C3)
+ * is beneficial.
+ *
+ * Note that a sequenced FLUSH/FUA request with DATA is completed twice.
+ * Once while executing DATA and again after the whole sequence is
+ * complete.  The first completion updates the contained bio but doesn't
+ * finish it so that the bio submitter is notified only after the whole
+ * sequence is complete.  This is implemented by testing REQ_FLUSH_SEQ in
+ * req_bio_endio().
+ *
+ * The above peculiarity requires that each FLUSH/FUA request has only one
+ * bio attached to it, which is guaranteed as they aren't allowed to be
+ * merged in the usual way.
  */
+
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/bio.h>
@@ -11,185 +74,290 @@
 
 /* FLUSH/FUA sequences */
 enum {
-	QUEUE_FSEQ_STARTED	= (1 << 0), /* flushing in progress */
-	QUEUE_FSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
-	QUEUE_FSEQ_DATA		= (1 << 2), /* data write in progress */
-	QUEUE_FSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
-	QUEUE_FSEQ_DONE		= (1 << 4),
+	REQ_FSEQ_PREFLUSH	= (1 << 0), /* pre-flushing in progress */
+	REQ_FSEQ_DATA		= (1 << 1), /* data write in progress */
+	REQ_FSEQ_POSTFLUSH	= (1 << 2), /* post-flushing in progress */
+	REQ_FSEQ_DONE		= (1 << 3),
+
+	REQ_FSEQ_ACTIONS	= REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA |
+				  REQ_FSEQ_POSTFLUSH,
+
+	/*
+	 * If flush has been pending longer than the following timeout,
+	 * it's issued even if flush_data requests are still in flight.
+	 */
+	FLUSH_PENDING_TIMEOUT	= 5 * HZ,
 };
 
-static struct request *queue_next_fseq(struct request_queue *q);
+static bool blk_kick_flush(struct request_queue *q);
 
-unsigned blk_flush_cur_seq(struct request_queue *q)
+static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
 {
-	if (!q->flush_seq)
-		return 0;
-	return 1 << ffz(q->flush_seq);
+	unsigned int policy = 0;
+
+	if (fflags & REQ_FLUSH) {
+		if (rq->cmd_flags & REQ_FLUSH)
+			policy |= REQ_FSEQ_PREFLUSH;
+		if (blk_rq_sectors(rq))
+			policy |= REQ_FSEQ_DATA;
+		if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
+			policy |= REQ_FSEQ_POSTFLUSH;
+	}
+	return policy;
 }
 
-static struct request *blk_flush_complete_seq(struct request_queue *q,
-					      unsigned seq, int error)
+static unsigned int blk_flush_cur_seq(struct request *rq)
 {
-	struct request *next_rq = NULL;
-
-	if (error && !q->flush_err)
-		q->flush_err = error;
-
-	BUG_ON(q->flush_seq & seq);
-	q->flush_seq |= seq;
-
-	if (blk_flush_cur_seq(q) != QUEUE_FSEQ_DONE) {
-		/* not complete yet, queue the next flush sequence */
-		next_rq = queue_next_fseq(q);
-	} else {
-		/* complete this flush request */
-		__blk_end_request_all(q->orig_flush_rq, q->flush_err);
-		q->orig_flush_rq = NULL;
-		q->flush_seq = 0;
-
-		/* dispatch the next flush if there's one */
-		if (!list_empty(&q->pending_flushes)) {
-			next_rq = list_entry_rq(q->pending_flushes.next);
-			list_move(&next_rq->queuelist, &q->queue_head);
-		}
-	}
-	return next_rq;
+	return 1 << ffz(rq->flush.seq);
 }
 
-static void blk_flush_complete_seq_end_io(struct request_queue *q,
-					  unsigned seq, int error)
+static void blk_flush_restore_request(struct request *rq)
 {
-	bool was_empty = elv_queue_empty(q);
-	struct request *next_rq;
-
-	next_rq = blk_flush_complete_seq(q, seq, error);
-
 	/*
-	 * Moving a request silently to empty queue_head may stall the
-	 * queue.  Kick the queue in those cases.
+	 * After flush data completion, @rq->bio is %NULL but we need to
+	 * complete the bio again.  @rq->biotail is guaranteed to equal the
+	 * original @rq->bio.  Restore it.
 	 */
-	if (was_empty && next_rq)
-		__blk_run_queue(q);
+	rq->bio = rq->biotail;
+
+	/* make @rq a normal request */
+	rq->cmd_flags &= ~REQ_FLUSH_SEQ;
+	rq->end_io = NULL;
 }
 
-static void pre_flush_end_io(struct request *rq, int error)
+/**
+ * blk_flush_complete_seq - complete flush sequence
+ * @rq: FLUSH/FUA request being sequenced
+ * @seq: sequences to complete (mask of %REQ_FSEQ_*, can be zero)
+ * @error: whether an error occurred
+ *
+ * @rq just completed @seq part of its flush sequence, record the
+ * completion and trigger the next step.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ *
+ * RETURNS:
+ * %true if requests were added to the dispatch queue, %false otherwise.
+ */
+static bool blk_flush_complete_seq(struct request *rq, unsigned int seq,
+				   int error)
 {
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+	struct request_queue *q = rq->q;
+	struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+	bool queued = false;
+
+	BUG_ON(rq->flush.seq & seq);
+	rq->flush.seq |= seq;
+
+	if (likely(!error))
+		seq = blk_flush_cur_seq(rq);
+	else
+		seq = REQ_FSEQ_DONE;
+
+	switch (seq) {
+	case REQ_FSEQ_PREFLUSH:
+	case REQ_FSEQ_POSTFLUSH:
+		/* queue for flush */
+		if (list_empty(pending))
+			q->flush_pending_since = jiffies;
+		list_move_tail(&rq->flush.list, pending);
+		break;
+
+	case REQ_FSEQ_DATA:
+		list_move_tail(&rq->flush.list, &q->flush_data_in_flight);
+		list_add(&rq->queuelist, &q->queue_head);
+		queued = true;
+		break;
+
+	case REQ_FSEQ_DONE:
+		/*
+		 * @rq was previously adjusted by blk_flush_issue() for
+		 * flush sequencing and may already have gone through the
+		 * flush data request completion path.  Restore @rq for
+		 * normal completion and end it.
+		 */
+		BUG_ON(!list_empty(&rq->queuelist));
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
+		__blk_end_request_all(rq, error);
+		break;
+
+	default:
+		BUG();
+	}
+
+	return blk_kick_flush(q) | queued;
 }
 
-static void flush_data_end_io(struct request *rq, int error)
+static void flush_end_io(struct request *flush_rq, int error)
 {
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
+	struct request_queue *q = flush_rq->q;
+	struct list_head *running = &q->flush_queue[q->flush_running_idx];
+	bool was_empty = elv_queue_empty(q);
+	bool queued = false;
+	struct request *rq, *n;
+
+	BUG_ON(q->flush_pending_idx == q->flush_running_idx);
+
+	/* account completion of the flush request */
+	q->flush_running_idx ^= 1;
+	elv_completed_request(q, flush_rq);
+
+	/* and push the waiting requests to the next stage */
+	list_for_each_entry_safe(rq, n, running, flush.list) {
+		unsigned int seq = blk_flush_cur_seq(rq);
+
+		BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
+		queued |= blk_flush_complete_seq(rq, seq, error);
+	}
+
+	/* after populating an empty queue, kick it to avoid stall */
+	if (queued && was_empty)
+		__blk_run_queue(q);
 }
 
-static void post_flush_end_io(struct request *rq, int error)
+/**
+ * blk_kick_flush - consider issuing flush request
+ * @q: request_queue being kicked
+ *
+ * Flush related states of @q have changed, consider issuing flush request.
+ * Please read the comment at the top of this file for more info.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ *
+ * RETURNS:
+ * %true if flush was issued, %false otherwise.
+ */
+static bool blk_kick_flush(struct request_queue *q)
 {
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+	struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+	struct request *first_rq =
+		list_first_entry(pending, struct request, flush.list);
+
+	/* C1 described at the top of this file */
+	if (q->flush_pending_idx != q->flush_running_idx || list_empty(pending))
+		return false;
+
+	/* C2 and C3 */
+	if (!list_empty(&q->flush_data_in_flight) &&
+	    time_before(jiffies,
+			q->flush_pending_since + FLUSH_PENDING_TIMEOUT))
+		return false;
+
+	/*
+	 * Issue flush and toggle pending_idx.  This makes pending_idx
+	 * different from running_idx, which means flush is in flight.
+	 */
+	blk_rq_init(q, &q->flush_rq);
+	q->flush_rq.cmd_type = REQ_TYPE_FS;
+	q->flush_rq.cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ;
+	q->flush_rq.rq_disk = first_rq->rq_disk;
+	q->flush_rq.end_io = flush_end_io;
+
+	q->flush_pending_idx ^= 1;
+	elv_insert(q, &q->flush_rq, ELEVATOR_INSERT_FRONT);
+	return true;
 }
 
-static void init_flush_request(struct request *rq, struct gendisk *disk)
+static void flush_data_end_io(struct request *rq, int error)
 {
-	rq->cmd_type = REQ_TYPE_FS;
-	rq->cmd_flags = WRITE_FLUSH;
-	rq->rq_disk = disk;
+	struct request_queue *q = rq->q;
+	bool was_empty = elv_queue_empty(q);
+
+	/* after populating an empty queue, kick it to avoid stall */
+	if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error) && was_empty)
+		__blk_run_queue(q);
 }
 
-static struct request *queue_next_fseq(struct request_queue *q)
+/**
+ * blk_insert_flush - insert a new FLUSH/FUA request
+ * @rq: request to insert
+ *
+ * To be called from elv_insert() for %ELEVATOR_INSERT_FLUSH insertions.
+ * @rq is being submitted.  Analyze what needs to be done and put it on the
+ * right queue.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ */
+void blk_insert_flush(struct request *rq)
 {
-	struct request *orig_rq = q->orig_flush_rq;
-	struct request *rq = &q->flush_rq;
+	struct request_queue *q = rq->q;
+	unsigned int fflags = q->flush_flags;	/* may change, cache */
+	unsigned int policy = blk_flush_policy(fflags, rq);
 
-	blk_rq_init(q, rq);
+	BUG_ON(rq->end_io);
+	BUG_ON(!rq->bio || rq->bio != rq->biotail);
 
-	switch (blk_flush_cur_seq(q)) {
-	case QUEUE_FSEQ_PREFLUSH:
-		init_flush_request(rq, orig_rq->rq_disk);
-		rq->end_io = pre_flush_end_io;
-		break;
-	case QUEUE_FSEQ_DATA:
-		init_request_from_bio(rq, orig_rq->bio);
-		/*
-		 * orig_rq->rq_disk may be different from
-		 * bio->bi_bdev->bd_disk if orig_rq got here through
-		 * remapping drivers.  Make sure rq->rq_disk points
-		 * to the same one as orig_rq.
-		 */
-		rq->rq_disk = orig_rq->rq_disk;
-		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
-		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
-		rq->end_io = flush_data_end_io;
-		break;
-	case QUEUE_FSEQ_POSTFLUSH:
-		init_flush_request(rq, orig_rq->rq_disk);
-		rq->end_io = post_flush_end_io;
-		break;
-	default:
-		BUG();
+	/*
+	 * @policy now records what operations need to be done.  Adjust
+	 * REQ_FLUSH and FUA for the driver.
+	 */
+	rq->cmd_flags &= ~REQ_FLUSH;
+	if (!(fflags & REQ_FUA))
+		rq->cmd_flags &= ~REQ_FUA;
+
+	/*
+	 * If there's data but flush is not necessary, the request can be
+	 * processed directly without going through flush machinery.  Queue
+	 * for normal execution.
+	 */
+	if ((policy & REQ_FSEQ_DATA) &&
+	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
+		list_add(&rq->queuelist, &q->queue_head);
+		return;
 	}
 
+	/*
+	 * @rq should go through flush machinery.  Mark it part of flush
+	 * sequence and submit for further processing.
+	 */
+	memset(&rq->flush, 0, sizeof(rq->flush));
+	INIT_LIST_HEAD(&rq->flush.list);
 	rq->cmd_flags |= REQ_FLUSH_SEQ;
-	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-	return rq;
+	rq->end_io = flush_data_end_io;
+
+	blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);
 }
 
-struct request *blk_do_flush(struct request_queue *q, struct request *rq)
+/**
+ * blk_abort_flush - @q is being aborted, abort flush requests
+ * @q: request_queue being aborted
+ *
+ * To be called from elv_abort_queue().  @q is being aborted.  Prepare all
+ * FLUSH/FUA requests for abortion.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ */
+void blk_abort_flushes(struct request_queue *q)
 {
-	unsigned int fflags = q->flush_flags; /* may change, cache it */
-	bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
-	bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
-	bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA);
-	unsigned skip = 0;
+	struct request *rq, *n;
+	int i;
 
 	/*
-	 * Special case.  If there's data but flush is not necessary,
-	 * the request can be issued directly.
-	 *
-	 * Flush w/o data should be able to be issued directly too but
-	 * currently some drivers assume that rq->bio contains
-	 * non-zero data if it isn't NULL and empty FLUSH requests
-	 * getting here usually have bio's without data.
+	 * Requests in flight for data are already owned by the dispatch
+	 * queue or the device driver.  Just restore for normal completion.
 	 */
-	if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
-		rq->cmd_flags &= ~REQ_FLUSH;
-		if (!has_fua)
-			rq->cmd_flags &= ~REQ_FUA;
-		return rq;
+	list_for_each_entry_safe(rq, n, &q->flush_data_in_flight, flush.list) {
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
 	}
 
 	/*
-	 * Sequenced flushes can't be processed in parallel.  If
-	 * another one is already in progress, queue for later
-	 * processing.
+	 * We need to give away requests on flush queues.  Restore for
+	 * normal completion and put them on the dispatch queue.
 	 */
-	if (q->flush_seq) {
-		list_move_tail(&rq->queuelist, &q->pending_flushes);
-		return NULL;
+	for (i = 0; i < ARRAY_SIZE(q->flush_queue); i++) {
+		list_for_each_entry_safe(rq, n, &q->flush_queue[i],
+					 flush.list) {
+			list_del_init(&rq->flush.list);
+			blk_flush_restore_request(rq);
+			list_add_tail(&rq->queuelist, &q->queue_head);
+		}
 	}
-
-	/*
-	 * Start a new flush sequence
-	 */
-	q->flush_err = 0;
-	q->flush_seq |= QUEUE_FSEQ_STARTED;
-
-	/* adjust FLUSH/FUA of the original request and stash it away */
-	rq->cmd_flags &= ~REQ_FLUSH;
-	if (!has_fua)
-		rq->cmd_flags &= ~REQ_FUA;
-	blk_dequeue_request(rq);
-	q->orig_flush_rq = rq;
-
-	/* skip unneded sequences and return the first one */
-	if (!do_preflush)
-		skip |= QUEUE_FSEQ_PREFLUSH;
-	if (!blk_rq_sectors(rq))
-		skip |= QUEUE_FSEQ_DATA;
-	if (!do_postflush)
-		skip |= QUEUE_FSEQ_POSTFLUSH;
-	return blk_flush_complete_seq(q, skip, 0);
 }
 
 static void bio_end_flush(struct bio *bio, int err)
diff --git a/block/blk.h b/block/blk.h
index 9d2ee8f..284b500 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,21 +51,17 @@ static inline void blk_clear_rq_complete(struct request *rq)
  */
 #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))
 
-struct request *blk_do_flush(struct request_queue *q, struct request *rq);
+void blk_insert_flush(struct request *rq);
+void blk_abort_flushes(struct request_queue *q);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
 {
 	struct request *rq;
 
 	while (1) {
-		while (!list_empty(&q->queue_head)) {
+		if (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
-			if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
-			    (rq->cmd_flags & REQ_FLUSH_SEQ))
-				return rq;
-			rq = blk_do_flush(q, rq);
-			if (rq)
-				return rq;
+			return rq;
 		}
 
 		if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
diff --git a/block/elevator.c b/block/elevator.c
index 2569512..270e097 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -673,6 +673,11 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 		q->elevator->ops->elevator_add_req_fn(q, rq);
 		break;
 
+	case ELEVATOR_INSERT_FLUSH:
+		rq->cmd_flags |= REQ_SOFTBARRIER;
+		blk_insert_flush(rq);
+		break;
+
 	default:
 		printk(KERN_ERR "%s: bad insertion point %d\n",
 		       __func__, where);
@@ -785,6 +790,8 @@ void elv_abort_queue(struct request_queue *q)
 {
 	struct request *rq;
 
+	blk_abort_flushes(q);
+
 	while (!list_empty(&q->queue_head)) {
 		rq = list_entry_rq(q->queue_head.next);
 		rq->cmd_flags |= REQ_QUIET;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4d18ff3..8a082a5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -99,13 +99,18 @@ struct request {
 	/*
 	 * The rb_node is only used inside the io scheduler, requests
 	 * are pruned when moved to the dispatch queue. So let the
-	 * completion_data share space with the rb_node.
+	 * flush fields share space with the rb_node.
 	 */
 	union {
 		struct rb_node rb_node;	/* sort/lookup */
-		void *completion_data;
+		struct {
+			unsigned int			seq;
+			struct list_head		list;
+		} flush;
 	};
 
+	void *completion_data;
+
 	/*
 	 * Three pointers are available for the IO schedulers, if they need
 	 * more they have to dynamically allocate it.
@@ -363,11 +368,12 @@ struct request_queue
 	 * for flush operations
 	 */
 	unsigned int		flush_flags;
-	unsigned int		flush_seq;
-	int			flush_err;
+	unsigned int		flush_pending_idx:1;
+	unsigned int		flush_running_idx:1;
+	unsigned long		flush_pending_since;
+	struct list_head	flush_queue[2];
+	struct list_head	flush_data_in_flight;
 	struct request		flush_rq;
-	struct request		*orig_flush_rq;
-	struct list_head	pending_flushes;
 
 	struct mutex		sysfs_lock;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 4d85797..39b68ed 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -167,6 +167,7 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 #define ELEVATOR_INSERT_BACK	2
 #define ELEVATOR_INSERT_SORT	3
 #define ELEVATOR_INSERT_REQUEUE	4
+#define ELEVATOR_INSERT_FLUSH	5
 
 /*
  * return values from elevator_may_queue_fn
-- 
1.7.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-21 15:59 ` Tejun Heo
                   ` (3 preceding siblings ...)
  (?)
@ 2011-01-21 15:59 ` Tejun Heo
  -1 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-21 15:59 UTC (permalink / raw)
  To: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth
  Cc: Tejun Heo

The current FLUSH/FUA support has evolved from the implementation
which had to perform queue draining.  As such, sequencing is done
queue-wide one flush request after another.  However, with the
draining requirement gone, there's no reason to keep the queue-wide
sequential approach.

This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
request is sequenced individually.  The actual FLUSH execution is
double buffered and whenever a request wants to execute one for either
PRE or POSTFLUSH, it queues on the pending queue.  Once certain
conditions are met, a flush request is issued and on its completion
all pending requests proceed to the next sequence.

This allows arbitrary merging of different type of flushes.  How they
are merged can be primarily controlled and tuned by adjusting the
above said 'conditions' used to determine when to issue the next
flush.

This is inspired by Darrick's patches to merge multiple zero-data
flushes which helps workloads with highly concurrent fsync requests.

* As flush requests are never put on the IO scheduler, request fields
  used for flush share space with rq->rb_node.  rq->completion_data is
  moved out of the union.  This increases the request size by one
  pointer.

  As rq->elevator_private* are used only by the iosched too, it is
  possible to reduce the request size further.  However, to do that,
  we need to modify request allocation path such that iosched data is
  not allocated for flush requests.

* FLUSH/FUA processing happens on insertion now instead of dispatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Darrick J. Wong" <djwong@us.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c         |   10 +-
 block/blk-flush.c        |  440 ++++++++++++++++++++++++++++++++--------------
 block/blk.h              |   12 +-
 block/elevator.c         |    7 +
 include/linux/blkdev.h   |   18 ++-
 include/linux/elevator.h |    1 +
 6 files changed, 332 insertions(+), 156 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 038519b..72dd23b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -149,8 +149,6 @@ EXPORT_SYMBOL(blk_rq_init);
 static void req_bio_endio(struct request *rq, struct bio *bio,
 			  unsigned int nbytes, int error)
 {
-	struct request_queue *q = rq->q;
-
 	if (error)
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -174,8 +172,6 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
 		bio_endio(bio, error);
-	else if (error && !q->flush_err)
-		q->flush_err = error;
 }
 
 void blk_dump_rq_flags(struct request *rq, char *msg)
@@ -534,7 +530,9 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
-	INIT_LIST_HEAD(&q->pending_flushes);
+	INIT_LIST_HEAD(&q->flush_queue[0]);
+	INIT_LIST_HEAD(&q->flush_queue[1]);
+	INIT_LIST_HEAD(&q->flush_data_in_flight);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1213,7 +1211,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 	spin_lock_irq(q->queue_lock);
 
 	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
-		where = ELEVATOR_INSERT_FRONT;
+		where = ELEVATOR_INSERT_FLUSH;
 		goto get_rq;
 	}
 
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 8592869..cd73e93 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -1,6 +1,69 @@
 /*
  * Functions to sequence FLUSH and FUA writes.
+ *
+ * Copyright (C) 2011		Max Planck Institute for Gravitational Physics
+ * Copyright (C) 2011		Tejun Heo <tj@kernel.org>
+ *
+ * This file is released under the GPLv2.
+ *
+ * REQ_{FLUSH|FUA} requests are decomposed to sequences consisted of three
+ * optional steps - PREFLUSH, DATA and POSTFLUSH - according to the request
+ * properties and hardware capability.
+ *
+ * If a request doesn't have data, only REQ_FLUSH makes sense, which
+ * indicates a simple flush request.  If there is data, REQ_FLUSH indicates
+ * that the device cache should be flushed before the data is executed, and
+ * REQ_FUA means that the data must be on non-volatile media on request
+ * completion.
+ *
+ * If the device doesn't have writeback cache, FLUSH and FUA don't make any
+ * difference.  The requests are either completed immediately if there's no
+ * data or executed as normal requests otherwise.
+ *
+ * If the device has writeback cache and supports FUA, REQ_FLUSH is
+ * translated to PREFLUSH but REQ_FUA is passed down directly with DATA.
+ *
+ * If the device has writeback cache and doesn't support FUA, REQ_FLUSH is
+ * translated to PREFLUSH and REQ_FUA to POSTFLUSH.
+ *
+ * The actual execution of flush is double buffered.  Whenever a request
+ * needs to execute PRE or POSTFLUSH, it queues at
+ * q->flush_queue[q->flush_pending_idx].  Once certain criteria are met, a
+ * flush is issued and the pending_idx is toggled.  When the flush
+ * completes, all the requests which were pending are proceeded to the next
+ * step.  This allows arbitrary merging of different types of FLUSH/FUA
+ * requests.
+ *
+ * Currently, the following conditions are used to determine when to issue
+ * flush.
+ *
+ * C1. At any given time, only one flush shall be in progress.  This makes
+ *     double buffering sufficient.
+ *
+ * C2. Flush is not deferred if any request is executing DATA of its
+ *     sequence.  This avoids issuing separate POSTFLUSHes for requests
+ *     which shared PREFLUSH.
+ *
+ * C3. The second condition is ignored if there is a request which has
+ *     waited longer than FLUSH_PENDING_TIMEOUT.  This is to avoid
+ *     starvation in the unlikely case where there are continuous stream of
+ *     FUA (without FLUSH) requests.
+ *
+ * For devices which support FUA, it isn't clear whether C2 (and thus C3)
+ * is beneficial.
+ *
+ * Note that a sequenced FLUSH/FUA request with DATA is completed twice.
+ * Once while executing DATA and again after the whole sequence is
+ * complete.  The first completion updates the contained bio but doesn't
+ * finish it so that the bio submitter is notified only after the whole
+ * sequence is complete.  This is implemented by testing REQ_FLUSH_SEQ in
+ * req_bio_endio().
+ *
+ * The above peculiarity requires that each FLUSH/FUA request has only one
+ * bio attached to it, which is guaranteed as they aren't allowed to be
+ * merged in the usual way.
  */
+
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/bio.h>
@@ -11,185 +74,290 @@
 
 /* FLUSH/FUA sequences */
 enum {
-	QUEUE_FSEQ_STARTED	= (1 << 0), /* flushing in progress */
-	QUEUE_FSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
-	QUEUE_FSEQ_DATA		= (1 << 2), /* data write in progress */
-	QUEUE_FSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
-	QUEUE_FSEQ_DONE		= (1 << 4),
+	REQ_FSEQ_PREFLUSH	= (1 << 0), /* pre-flushing in progress */
+	REQ_FSEQ_DATA		= (1 << 1), /* data write in progress */
+	REQ_FSEQ_POSTFLUSH	= (1 << 2), /* post-flushing in progress */
+	REQ_FSEQ_DONE		= (1 << 3),
+
+	REQ_FSEQ_ACTIONS	= REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA |
+				  REQ_FSEQ_POSTFLUSH,
+
+	/*
+	 * If flush has been pending longer than the following timeout,
+	 * it's issued even if flush_data requests are still in flight.
+	 */
+	FLUSH_PENDING_TIMEOUT	= 5 * HZ,
 };
 
-static struct request *queue_next_fseq(struct request_queue *q);
+static bool blk_kick_flush(struct request_queue *q);
 
-unsigned blk_flush_cur_seq(struct request_queue *q)
+static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
 {
-	if (!q->flush_seq)
-		return 0;
-	return 1 << ffz(q->flush_seq);
+	unsigned int policy = 0;
+
+	if (fflags & REQ_FLUSH) {
+		if (rq->cmd_flags & REQ_FLUSH)
+			policy |= REQ_FSEQ_PREFLUSH;
+		if (blk_rq_sectors(rq))
+			policy |= REQ_FSEQ_DATA;
+		if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
+			policy |= REQ_FSEQ_POSTFLUSH;
+	}
+	return policy;
 }
 
-static struct request *blk_flush_complete_seq(struct request_queue *q,
-					      unsigned seq, int error)
+static unsigned int blk_flush_cur_seq(struct request *rq)
 {
-	struct request *next_rq = NULL;
-
-	if (error && !q->flush_err)
-		q->flush_err = error;
-
-	BUG_ON(q->flush_seq & seq);
-	q->flush_seq |= seq;
-
-	if (blk_flush_cur_seq(q) != QUEUE_FSEQ_DONE) {
-		/* not complete yet, queue the next flush sequence */
-		next_rq = queue_next_fseq(q);
-	} else {
-		/* complete this flush request */
-		__blk_end_request_all(q->orig_flush_rq, q->flush_err);
-		q->orig_flush_rq = NULL;
-		q->flush_seq = 0;
-
-		/* dispatch the next flush if there's one */
-		if (!list_empty(&q->pending_flushes)) {
-			next_rq = list_entry_rq(q->pending_flushes.next);
-			list_move(&next_rq->queuelist, &q->queue_head);
-		}
-	}
-	return next_rq;
+	return 1 << ffz(rq->flush.seq);
 }
 
-static void blk_flush_complete_seq_end_io(struct request_queue *q,
-					  unsigned seq, int error)
+static void blk_flush_restore_request(struct request *rq)
 {
-	bool was_empty = elv_queue_empty(q);
-	struct request *next_rq;
-
-	next_rq = blk_flush_complete_seq(q, seq, error);
-
 	/*
-	 * Moving a request silently to empty queue_head may stall the
-	 * queue.  Kick the queue in those cases.
+	 * After flush data completion, @rq->bio is %NULL but we need to
+	 * complete the bio again.  @rq->biotail is guaranteed to equal the
+	 * original @rq->bio.  Restore it.
 	 */
-	if (was_empty && next_rq)
-		__blk_run_queue(q);
+	rq->bio = rq->biotail;
+
+	/* make @rq a normal request */
+	rq->cmd_flags &= ~REQ_FLUSH_SEQ;
+	rq->end_io = NULL;
 }
 
-static void pre_flush_end_io(struct request *rq, int error)
+/**
+ * blk_flush_complete_seq - complete flush sequence
+ * @rq: FLUSH/FUA request being sequenced
+ * @seq: sequences to complete (mask of %REQ_FSEQ_*, can be zero)
+ * @error: whether an error occurred
+ *
+ * @rq just completed @seq part of its flush sequence, record the
+ * completion and trigger the next step.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ *
+ * RETURNS:
+ * %true if requests were added to the dispatch queue, %false otherwise.
+ */
+static bool blk_flush_complete_seq(struct request *rq, unsigned int seq,
+				   int error)
 {
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+	struct request_queue *q = rq->q;
+	struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+	bool queued = false;
+
+	BUG_ON(rq->flush.seq & seq);
+	rq->flush.seq |= seq;
+
+	if (likely(!error))
+		seq = blk_flush_cur_seq(rq);
+	else
+		seq = REQ_FSEQ_DONE;
+
+	switch (seq) {
+	case REQ_FSEQ_PREFLUSH:
+	case REQ_FSEQ_POSTFLUSH:
+		/* queue for flush */
+		if (list_empty(pending))
+			q->flush_pending_since = jiffies;
+		list_move_tail(&rq->flush.list, pending);
+		break;
+
+	case REQ_FSEQ_DATA:
+		list_move_tail(&rq->flush.list, &q->flush_data_in_flight);
+		list_add(&rq->queuelist, &q->queue_head);
+		queued = true;
+		break;
+
+	case REQ_FSEQ_DONE:
+		/*
+		 * @rq was previously adjusted by blk_flush_issue() for
+		 * flush sequencing and may already have gone through the
+		 * flush data request completion path.  Restore @rq for
+		 * normal completion and end it.
+		 */
+		BUG_ON(!list_empty(&rq->queuelist));
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
+		__blk_end_request_all(rq, error);
+		break;
+
+	default:
+		BUG();
+	}
+
+	return blk_kick_flush(q) | queued;
 }
 
-static void flush_data_end_io(struct request *rq, int error)
+static void flush_end_io(struct request *flush_rq, int error)
 {
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
+	struct request_queue *q = flush_rq->q;
+	struct list_head *running = &q->flush_queue[q->flush_running_idx];
+	bool was_empty = elv_queue_empty(q);
+	bool queued = false;
+	struct request *rq, *n;
+
+	BUG_ON(q->flush_pending_idx == q->flush_running_idx);
+
+	/* account completion of the flush request */
+	q->flush_running_idx ^= 1;
+	elv_completed_request(q, flush_rq);
+
+	/* and push the waiting requests to the next stage */
+	list_for_each_entry_safe(rq, n, running, flush.list) {
+		unsigned int seq = blk_flush_cur_seq(rq);
+
+		BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
+		queued |= blk_flush_complete_seq(rq, seq, error);
+	}
+
+	/* after populating an empty queue, kick it to avoid stall */
+	if (queued && was_empty)
+		__blk_run_queue(q);
 }
 
-static void post_flush_end_io(struct request *rq, int error)
+/**
+ * blk_kick_flush - consider issuing flush request
+ * @q: request_queue being kicked
+ *
+ * Flush related states of @q have changed, consider issuing flush request.
+ * Please read the comment at the top of this file for more info.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ *
+ * RETURNS:
+ * %true if flush was issued, %false otherwise.
+ */
+static bool blk_kick_flush(struct request_queue *q)
 {
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+	struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+	struct request *first_rq =
+		list_first_entry(pending, struct request, flush.list);
+
+	/* C1 described at the top of this file */
+	if (q->flush_pending_idx != q->flush_running_idx || list_empty(pending))
+		return false;
+
+	/* C2 and C3 */
+	if (!list_empty(&q->flush_data_in_flight) &&
+	    time_before(jiffies,
+			q->flush_pending_since + FLUSH_PENDING_TIMEOUT))
+		return false;
+
+	/*
+	 * Issue flush and toggle pending_idx.  This makes pending_idx
+	 * different from running_idx, which means flush is in flight.
+	 */
+	blk_rq_init(q, &q->flush_rq);
+	q->flush_rq.cmd_type = REQ_TYPE_FS;
+	q->flush_rq.cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ;
+	q->flush_rq.rq_disk = first_rq->rq_disk;
+	q->flush_rq.end_io = flush_end_io;
+
+	q->flush_pending_idx ^= 1;
+	elv_insert(q, &q->flush_rq, ELEVATOR_INSERT_FRONT);
+	return true;
 }
 
-static void init_flush_request(struct request *rq, struct gendisk *disk)
+static void flush_data_end_io(struct request *rq, int error)
 {
-	rq->cmd_type = REQ_TYPE_FS;
-	rq->cmd_flags = WRITE_FLUSH;
-	rq->rq_disk = disk;
+	struct request_queue *q = rq->q;
+	bool was_empty = elv_queue_empty(q);
+
+	/* after populating an empty queue, kick it to avoid stall */
+	if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error) && was_empty)
+		__blk_run_queue(q);
 }
 
-static struct request *queue_next_fseq(struct request_queue *q)
+/**
+ * blk_insert_flush - insert a new FLUSH/FUA request
+ * @rq: request to insert
+ *
+ * To be called from elv_insert() for %ELEVATOR_INSERT_FLUSH insertions.
+ * @rq is being submitted.  Analyze what needs to be done and put it on the
+ * right queue.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ */
+void blk_insert_flush(struct request *rq)
 {
-	struct request *orig_rq = q->orig_flush_rq;
-	struct request *rq = &q->flush_rq;
+	struct request_queue *q = rq->q;
+	unsigned int fflags = q->flush_flags;	/* may change, cache */
+	unsigned int policy = blk_flush_policy(fflags, rq);
 
-	blk_rq_init(q, rq);
+	BUG_ON(rq->end_io);
+	BUG_ON(!rq->bio || rq->bio != rq->biotail);
 
-	switch (blk_flush_cur_seq(q)) {
-	case QUEUE_FSEQ_PREFLUSH:
-		init_flush_request(rq, orig_rq->rq_disk);
-		rq->end_io = pre_flush_end_io;
-		break;
-	case QUEUE_FSEQ_DATA:
-		init_request_from_bio(rq, orig_rq->bio);
-		/*
-		 * orig_rq->rq_disk may be different from
-		 * bio->bi_bdev->bd_disk if orig_rq got here through
-		 * remapping drivers.  Make sure rq->rq_disk points
-		 * to the same one as orig_rq.
-		 */
-		rq->rq_disk = orig_rq->rq_disk;
-		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
-		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
-		rq->end_io = flush_data_end_io;
-		break;
-	case QUEUE_FSEQ_POSTFLUSH:
-		init_flush_request(rq, orig_rq->rq_disk);
-		rq->end_io = post_flush_end_io;
-		break;
-	default:
-		BUG();
+	/*
+	 * @policy now records what operations need to be done.  Adjust
+	 * REQ_FLUSH and FUA for the driver.
+	 */
+	rq->cmd_flags &= ~REQ_FLUSH;
+	if (!(fflags & REQ_FUA))
+		rq->cmd_flags &= ~REQ_FUA;
+
+	/*
+	 * If there's data but flush is not necessary, the request can be
+	 * processed directly without going through flush machinery.  Queue
+	 * for normal execution.
+	 */
+	if ((policy & REQ_FSEQ_DATA) &&
+	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
+		list_add(&rq->queuelist, &q->queue_head);
+		return;
 	}
 
+	/*
+	 * @rq should go through flush machinery.  Mark it part of flush
+	 * sequence and submit for further processing.
+	 */
+	memset(&rq->flush, 0, sizeof(rq->flush));
+	INIT_LIST_HEAD(&rq->flush.list);
 	rq->cmd_flags |= REQ_FLUSH_SEQ;
-	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-	return rq;
+	rq->end_io = flush_data_end_io;
+
+	blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);
 }
 
-struct request *blk_do_flush(struct request_queue *q, struct request *rq)
+/**
+ * blk_abort_flush - @q is being aborted, abort flush requests
+ * @q: request_queue being aborted
+ *
+ * To be called from elv_abort_queue().  @q is being aborted.  Prepare all
+ * FLUSH/FUA requests for abortion.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ */
+void blk_abort_flushes(struct request_queue *q)
 {
-	unsigned int fflags = q->flush_flags; /* may change, cache it */
-	bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
-	bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
-	bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA);
-	unsigned skip = 0;
+	struct request *rq, *n;
+	int i;
 
 	/*
-	 * Special case.  If there's data but flush is not necessary,
-	 * the request can be issued directly.
-	 *
-	 * Flush w/o data should be able to be issued directly too but
-	 * currently some drivers assume that rq->bio contains
-	 * non-zero data if it isn't NULL and empty FLUSH requests
-	 * getting here usually have bio's without data.
+	 * Requests in flight for data are already owned by the dispatch
+	 * queue or the device driver.  Just restore for normal completion.
 	 */
-	if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
-		rq->cmd_flags &= ~REQ_FLUSH;
-		if (!has_fua)
-			rq->cmd_flags &= ~REQ_FUA;
-		return rq;
+	list_for_each_entry_safe(rq, n, &q->flush_data_in_flight, flush.list) {
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
 	}
 
 	/*
-	 * Sequenced flushes can't be processed in parallel.  If
-	 * another one is already in progress, queue for later
-	 * processing.
+	 * We need to give away requests on flush queues.  Restore for
+	 * normal completion and put them on the dispatch queue.
 	 */
-	if (q->flush_seq) {
-		list_move_tail(&rq->queuelist, &q->pending_flushes);
-		return NULL;
+	for (i = 0; i < ARRAY_SIZE(q->flush_queue); i++) {
+		list_for_each_entry_safe(rq, n, &q->flush_queue[i],
+					 flush.list) {
+			list_del_init(&rq->flush.list);
+			blk_flush_restore_request(rq);
+			list_add_tail(&rq->queuelist, &q->queue_head);
+		}
 	}
-
-	/*
-	 * Start a new flush sequence
-	 */
-	q->flush_err = 0;
-	q->flush_seq |= QUEUE_FSEQ_STARTED;
-
-	/* adjust FLUSH/FUA of the original request and stash it away */
-	rq->cmd_flags &= ~REQ_FLUSH;
-	if (!has_fua)
-		rq->cmd_flags &= ~REQ_FUA;
-	blk_dequeue_request(rq);
-	q->orig_flush_rq = rq;
-
-	/* skip unneded sequences and return the first one */
-	if (!do_preflush)
-		skip |= QUEUE_FSEQ_PREFLUSH;
-	if (!blk_rq_sectors(rq))
-		skip |= QUEUE_FSEQ_DATA;
-	if (!do_postflush)
-		skip |= QUEUE_FSEQ_POSTFLUSH;
-	return blk_flush_complete_seq(q, skip, 0);
 }
 
 static void bio_end_flush(struct bio *bio, int err)
diff --git a/block/blk.h b/block/blk.h
index 9d2ee8f..284b500 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -51,21 +51,17 @@ static inline void blk_clear_rq_complete(struct request *rq)
  */
 #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))
 
-struct request *blk_do_flush(struct request_queue *q, struct request *rq);
+void blk_insert_flush(struct request *rq);
+void blk_abort_flushes(struct request_queue *q);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
 {
 	struct request *rq;
 
 	while (1) {
-		while (!list_empty(&q->queue_head)) {
+		if (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
-			if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
-			    (rq->cmd_flags & REQ_FLUSH_SEQ))
-				return rq;
-			rq = blk_do_flush(q, rq);
-			if (rq)
-				return rq;
+			return rq;
 		}
 
 		if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
diff --git a/block/elevator.c b/block/elevator.c
index 2569512..270e097 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -673,6 +673,11 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 		q->elevator->ops->elevator_add_req_fn(q, rq);
 		break;
 
+	case ELEVATOR_INSERT_FLUSH:
+		rq->cmd_flags |= REQ_SOFTBARRIER;
+		blk_insert_flush(rq);
+		break;
+
 	default:
 		printk(KERN_ERR "%s: bad insertion point %d\n",
 		       __func__, where);
@@ -785,6 +790,8 @@ void elv_abort_queue(struct request_queue *q)
 {
 	struct request *rq;
 
+	blk_abort_flushes(q);
+
 	while (!list_empty(&q->queue_head)) {
 		rq = list_entry_rq(q->queue_head.next);
 		rq->cmd_flags |= REQ_QUIET;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4d18ff3..8a082a5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -99,13 +99,18 @@ struct request {
 	/*
 	 * The rb_node is only used inside the io scheduler, requests
 	 * are pruned when moved to the dispatch queue. So let the
-	 * completion_data share space with the rb_node.
+	 * flush fields share space with the rb_node.
 	 */
 	union {
 		struct rb_node rb_node;	/* sort/lookup */
-		void *completion_data;
+		struct {
+			unsigned int			seq;
+			struct list_head		list;
+		} flush;
 	};
 
+	void *completion_data;
+
 	/*
 	 * Three pointers are available for the IO schedulers, if they need
 	 * more they have to dynamically allocate it.
@@ -363,11 +368,12 @@ struct request_queue
 	 * for flush operations
 	 */
 	unsigned int		flush_flags;
-	unsigned int		flush_seq;
-	int			flush_err;
+	unsigned int		flush_pending_idx:1;
+	unsigned int		flush_running_idx:1;
+	unsigned long		flush_pending_since;
+	struct list_head	flush_queue[2];
+	struct list_head	flush_data_in_flight;
 	struct request		flush_rq;
-	struct request		*orig_flush_rq;
-	struct list_head	pending_flushes;
 
 	struct mutex		sysfs_lock;
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 4d85797..39b68ed 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -167,6 +167,7 @@ extern struct request *elv_rb_find(struct rb_root *, sector_t);
 #define ELEVATOR_INSERT_BACK	2
 #define ELEVATOR_INSERT_SORT	3
 #define ELEVATOR_INSERT_REQUEUE	4
+#define ELEVATOR_INSERT_FLUSH	5
 
 /*
  * return values from elevator_may_queue_fn
-- 
1.7.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-21 15:59 ` [PATCH 3/3] block: reimplement FLUSH/FUA to support merge Tejun Heo
@ 2011-01-21 18:56   ` Vivek Goyal
  2011-01-21 19:19     ` Vivek Goyal
  2011-01-23 10:25     ` Tejun Heo
  2011-01-22  0:49   ` Mike Snitzer
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 48+ messages in thread
From: Vivek Goyal @ 2011-01-21 18:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	Tejun Heo

On Fri, Jan 21, 2011 at 04:59:58PM +0100, Tejun Heo wrote:

[..]
> + * The actual execution of flush is double buffered.  Whenever a request
> + * needs to execute PRE or POSTFLUSH, it queues at
> + * q->flush_queue[q->flush_pending_idx].  Once certain criteria are met, a
> + * flush is issued and the pending_idx is toggled.  When the flush
> + * completes, all the requests which were pending are proceeded to the next
> + * step.  This allows arbitrary merging of different types of FLUSH/FUA
> + * requests.
> + *
> + * Currently, the following conditions are used to determine when to issue
> + * flush.
> + *
> + * C1. At any given time, only one flush shall be in progress.  This makes
> + *     double buffering sufficient.
> + *
> + * C2. Flush is not deferred if any request is executing DATA of its
> + *     sequence.  This avoids issuing separate POSTFLUSHes for requests
> + *     which shared PREFLUSH.

Tejun, did you mean "Flush is deferred" instead of "Flush is not deferred"
above?

IIUC, C2 might help only if requests which contain data are also going to 
issue postflush. Couple of cases come to mind.

- If queue supports FUA, I think we will not issue POSTFLUSH. In that
  case issuing next PREFLUSH which data is in flight might make sense.

- Even if queue does not support FUA and we are only getting requests
  with REQ_FLUSH then also waiting for data requests to finish before
  issuing next FLUSH might not help.

- Even if queue does not support FUA and say we have a mix of REQ_FUA
  and REQ_FLUSH, then this will help only if in a batch we have more
  than 1 request which is going to issue POSTFLUSH and those postflush
  will be merged.

- Ric Wheeler was once mentioning that there are boxes which advertise
  writeback cache but are battery backed so they ignore flush internally and
  signal completion immediately. I am not sure how prevalent those
  cases are but I think waiting for data to finish will delay processing
  of new REQ_FLUSH requests in pending queue for such array. There
  we will not anyway benefit from merging of FLUSH.

Given that C2 is going to benefit primarily only if queue does not support
FUA and we have many requets with REQ_FUA set, will it make sense to 
put additional checks for C2. Atleast a simple queue support FUA
check might help.

In practice does C2 really help or we can get rid of it entirely?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-21 18:56   ` Vivek Goyal
@ 2011-01-21 19:19     ` Vivek Goyal
  2011-01-23 10:25     ` Tejun Heo
  1 sibling, 0 replies; 48+ messages in thread
From: Vivek Goyal @ 2011-01-21 19:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef

On Fri, Jan 21, 2011 at 01:56:17PM -0500, Vivek Goyal wrote:
> On Fri, Jan 21, 2011 at 04:59:58PM +0100, Tejun Heo wrote:
> 
> [..]
> > + * The actual execution of flush is double buffered.  Whenever a request
> > + * needs to execute PRE or POSTFLUSH, it queues at
> > + * q->flush_queue[q->flush_pending_idx].  Once certain criteria are met, a
> > + * flush is issued and the pending_idx is toggled.  When the flush
> > + * completes, all the requests which were pending are proceeded to the next
> > + * step.  This allows arbitrary merging of different types of FLUSH/FUA
> > + * requests.
> > + *
> > + * Currently, the following conditions are used to determine when to issue
> > + * flush.
> > + *
> > + * C1. At any given time, only one flush shall be in progress.  This makes
> > + *     double buffering sufficient.
> > + *
> > + * C2. Flush is not deferred if any request is executing DATA of its
> > + *     sequence.  This avoids issuing separate POSTFLUSHes for requests
> > + *     which shared PREFLUSH.
> 
> Tejun, did you mean "Flush is deferred" instead of "Flush is not deferred"
> above?
> 
> IIUC, C2 might help only if requests which contain data are also going to 
> issue postflush. Couple of cases come to mind.
> 
> - If queue supports FUA, I think we will not issue POSTFLUSH. In that
>   case issuing next PREFLUSH which data is in flight might make sense.
> 
> - Even if queue does not support FUA and we are only getting requests
>   with REQ_FLUSH then also waiting for data requests to finish before
>   issuing next FLUSH might not help.
> 
> - Even if queue does not support FUA and say we have a mix of REQ_FUA
>   and REQ_FLUSH, then this will help only if in a batch we have more
>   than 1 request which is going to issue POSTFLUSH and those postflush
>   will be merged.
> 
> - Ric Wheeler was once mentioning that there are boxes which advertise
>   writeback cache but are battery backed so they ignore flush internally and
>   signal completion immediately. I am not sure how prevalent those
>   cases are but I think waiting for data to finish will delay processing
>   of new REQ_FLUSH requests in pending queue for such array. There
>   we will not anyway benefit from merging of FLUSH.
> 
> Given that C2 is going to benefit primarily only if queue does not support
> FUA and we have many requets with REQ_FUA set, will it make sense to 
> put additional checks for C2. Atleast a simple queue support FUA
> check might help.

Reading through the blk_insert_flush() bit more, looks like pure REQ_FUA
requests will not even show up in data list if queue supports FUA. But
IIUC requests with both REQ_FLUSH and REQ_FUA set will still show up even if
queue supports FUA.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-21 15:59 ` [PATCH 3/3] block: reimplement FLUSH/FUA to support merge Tejun Heo
  2011-01-21 18:56   ` Vivek Goyal
@ 2011-01-22  0:49   ` Mike Snitzer
  2011-01-23 10:31     ` Tejun Heo
  2011-01-23 10:48     ` Tejun Heo
  2011-01-25 20:41     ` Mike Snitzer
  3 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2011-01-22  0:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef

On Fri, Jan 21 2011 at 10:59am -0500,
Tejun Heo <tj@kernel.org> wrote:

> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 8592869..cd73e93 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -1,6 +1,69 @@
>  /*
>   * Functions to sequence FLUSH and FUA writes.
> + *
> + * Copyright (C) 2011		Max Planck Institute for Gravitational Physics
> + * Copyright (C) 2011		Tejun Heo <tj@kernel.org>
> + *
> + * This file is released under the GPLv2.
> + *
> + * REQ_{FLUSH|FUA} requests are decomposed to sequences consisted of three
> + * optional steps - PREFLUSH, DATA and POSTFLUSH - according to the request
> + * properties and hardware capability.
> + *
> + * If a request doesn't have data, only REQ_FLUSH makes sense, which
> + * indicates a simple flush request.  If there is data, REQ_FLUSH indicates
> + * that the device cache should be flushed before the data is executed, and
> + * REQ_FUA means that the data must be on non-volatile media on request
> + * completion.
> + *
> + * If the device doesn't have writeback cache, FLUSH and FUA don't make any
> + * difference.  The requests are either completed immediately if there's no
> + * data or executed as normal requests otherwise.

For devices without a writeback cache, I'm not seeing where pure flushes
are completed immediately.  But I do see where data is processed
directly in blk_insert_flush().


> -struct request *blk_do_flush(struct request_queue *q, struct request *rq)
> +/**
> + * blk_abort_flush - @q is being aborted, abort flush requests
      ^^^^^^^^^^^^^^^ 
Small comment nit, s/blk_abort_flush/blk_abort_flushes/

> + * @q: request_queue being aborted
> + *
> + * To be called from elv_abort_queue().  @q is being aborted.  Prepare all
> + * FLUSH/FUA requests for abortion.
> + *
> + * CONTEXT:
> + * spin_lock_irq(q->queue_lock)
> + */
> +void blk_abort_flushes(struct request_queue *q)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-21 18:56   ` Vivek Goyal
  2011-01-21 19:19     ` Vivek Goyal
@ 2011-01-23 10:25     ` Tejun Heo
  2011-01-23 10:29       ` Tejun Heo
  2011-01-24 20:31       ` Darrick J. Wong
  1 sibling, 2 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-23 10:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef

Hello,

On Fri, Jan 21, 2011 at 01:56:17PM -0500, Vivek Goyal wrote:
> > + * Currently, the following conditions are used to determine when to issue
> > + * flush.
> > + *
> > + * C1. At any given time, only one flush shall be in progress.  This makes
> > + *     double buffering sufficient.
> > + *
> > + * C2. Flush is not deferred if any request is executing DATA of its
> > + *     sequence.  This avoids issuing separate POSTFLUSHes for requests
> > + *     which shared PREFLUSH.
> 
> Tejun, did you mean "Flush is deferred" instead of "Flush is not deferred"
> above?

Oh yeah, I did.  :-)

> IIUC, C2 might help only if requests which contain data are also going to 
> issue postflush. Couple of cases come to mind.

That's true.  I didn't want to go too advanced on it.  I wanted
something which is fairly mechanical (without intricate parameters)
and effective enough for common cases.

> - If queue supports FUA, I think we will not issue POSTFLUSH. In that
>   case issuing next PREFLUSH which data is in flight might make sense.
>
> - Even if queue does not support FUA and we are only getting requests
>   with REQ_FLUSH then also waiting for data requests to finish before
>   issuing next FLUSH might not help.
> 
> - Even if queue does not support FUA and say we have a mix of REQ_FUA
>   and REQ_FLUSH, then this will help only if in a batch we have more
>   than 1 request which is going to issue POSTFLUSH and those postflush
>   will be merged.

Sure, not applying C2 and 3 if the underlying device supports REQ_FUA
would probably be the most compelling change of the bunch; however,
please keep in mind that issuing flush as soon as possible doesn't
necessarily result in better performance.  It's inherently a balancing
act between latency and throughput.  Even inducing artificial issue
latencies is likely to help if done right (as the ioscheds do).

So, I think it's better to start with something simple and improve it
with actual testing.  If the current simple implementation can match
Darrick's previous numbers, let's first settle the mechanisms.  We can
tune the latency/throughput balance all we want later.  Other than the
double buffering contraint (which can be relaxed too but I don't think
that would be necessary or a good idea) things can be easily adjusted
in blk_kick_flush().  It's intentionally designed that way.

> - Ric Wheeler was once mentioning that there are boxes which advertise
>   writeback cache but are battery backed so they ignore flush internally and
>   signal completion immediately. I am not sure how prevalent those
>   cases are but I think waiting for data to finish will delay processing
>   of new REQ_FLUSH requests in pending queue for such array. There
>   we will not anyway benefit from merging of FLUSH.

I don't really think we should design the whole thing around broken
devices which incorrectly report writeback cache when it need not.
The correct place to work around that is during device identification
not in the flush logic.

> Given that C2 is going to benefit primarily only if queue does not support
> FUA and we have many requets with REQ_FUA set, will it make sense to 
> put additional checks for C2. Atleast a simple queue support FUA
> check might help.
> 
> In practice does C2 really help or we can get rid of it entirely?

Again, issuing flushes as fast as possible isn't necessarily better.
It might feel counter-intuitive but it generally makes sense to delay
flush if there are a lot of concurrent flush activities going on.
Another related interesting point is that with flush merging,
depending on workload, there's a likelihood that FUA, even if the
device supports it, might result in worse performance than merged DATA
+ single POSTFLUSH sequence.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-23 10:25     ` Tejun Heo
@ 2011-01-23 10:29       ` Tejun Heo
  2011-01-24 20:31       ` Darrick J. Wong
  1 sibling, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-23 10:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef

On Sun, Jan 23, 2011 at 11:25:26AM +0100, Tejun Heo wrote:
> Again, issuing flushes as fast as possible isn't necessarily better.
> It might feel counter-intuitive but it generally makes sense to delay
> flush if there are a lot of concurrent flush activities going on.
> Another related interesting point is that with flush merging,
> depending on workload, there's a likelihood that FUA, even if the
> device supports it, might result in worse performance than merged DATA
> + single POSTFLUSH sequence.

Let me add a bit.

In general, I'm a bit skeptical about the usefulness of hardware FUA
on a rotating disk.  All it saves is a single command issue overhead.
On storage array or SSDs, the balance might be different tho.  Event
hen, with flush merging, I think it would heavily depend on the
workload which way it would turn out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-22  0:49   ` Mike Snitzer
@ 2011-01-23 10:31     ` Tejun Heo
  2011-01-25 20:46       ` Vivek Goyal
  0 siblings, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2011-01-23 10:31 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef

On Fri, Jan 21, 2011 at 07:49:55PM -0500, Mike Snitzer wrote:
> > + * If the device doesn't have writeback cache, FLUSH and FUA don't make any
> > + * difference.  The requests are either completed immediately if there's no
> > + * data or executed as normal requests otherwise.
> 
> For devices without a writeback cache, I'm not seeing where pure flushes
> are completed immediately.  But I do see where data is processed
> directly in blk_insert_flush().

Yeah, it does.  Pure flushes on a device w/o writeback cache, @policy
is zero and blk_flush_complete_seq() will directly proceed to
REQ_FSEQ_DONE.

> > -struct request *blk_do_flush(struct request_queue *q, struct request *rq)
> > +/**
> > + * blk_abort_flush - @q is being aborted, abort flush requests
>       ^^^^^^^^^^^^^^^ 
> Small comment nit, s/blk_abort_flush/blk_abort_flushes/

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH UPDATED 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-21 15:59 ` [PATCH 3/3] block: reimplement FLUSH/FUA to support merge Tejun Heo
@ 2011-01-23 10:48     ` Tejun Heo
  2011-01-22  0:49   ` Mike Snitzer
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-23 10:48 UTC (permalink / raw)
  To: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef

The current FLUSH/FUA support has evolved from the implementation
which had to perform queue draining.  As such, sequencing is done
queue-wide one flush request after another.  However, with the
draining requirement gone, there's no reason to keep the queue-wide
sequential approach.

This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
request is sequenced individually.  The actual FLUSH execution is
double buffered and whenever a request wants to execute one for either
PRE or POSTFLUSH, it queues on the pending queue.  Once certain
conditions are met, a flush request is issued and on its completion
all pending requests proceed to the next sequence.

This allows arbitrary merging of different type of flushes.  How they
are merged can be primarily controlled and tuned by adjusting the
above said 'conditions' used to determine when to issue the next
flush.

This is inspired by Darrick's patches to merge multiple zero-data
flushes which helps workloads with highly concurrent fsync requests.

* As flush requests are never put on the IO scheduler, request fields
  used for flush share space with rq->rb_node.  rq->completion_data is
  moved out of the union.  This increases the request size by one
  pointer.

  As rq->elevator_private* are used only by the iosched too, it is
  possible to reduce the request size further.  However, to do that,
  we need to modify request allocation path such that iosched data is
  not allocated for flush requests.

* FLUSH/FUA processing happens on insertion now instead of dispatch.

- Comments updated as per Vivek and Mike.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Darrick J. Wong" <djwong@us.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
---
Simple comment update.  Nothing else changed.  Git tree is updated
too.

Thanks.

 block/blk-core.c         |   10 -
 block/blk-flush.c        |  432 ++++++++++++++++++++++++++++++++---------------
 block/blk.h              |   12 -
 block/elevator.c         |    7 
 include/linux/blkdev.h   |   18 +
 include/linux/elevator.h |    1 
 6 files changed, 328 insertions(+), 152 deletions(-)

Index: work/block/blk-core.c
===================================================================
--- work.orig/block/blk-core.c
+++ work/block/blk-core.c
@@ -149,8 +149,6 @@ EXPORT_SYMBOL(blk_rq_init);
 static void req_bio_endio(struct request *rq, struct bio *bio,
 			  unsigned int nbytes, int error)
 {
-	struct request_queue *q = rq->q;
-
 	if (error)
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -174,8 +172,6 @@ static void req_bio_endio(struct request
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
 		bio_endio(bio, error);
-	else if (error && !q->flush_err)
-		q->flush_err = error;
 }
 
 void blk_dump_rq_flags(struct request *rq, char *msg)
@@ -534,7 +530,9 @@ struct request_queue *blk_alloc_queue_no
 	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
-	INIT_LIST_HEAD(&q->pending_flushes);
+	INIT_LIST_HEAD(&q->flush_queue[0]);
+	INIT_LIST_HEAD(&q->flush_queue[1]);
+	INIT_LIST_HEAD(&q->flush_data_in_flight);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1213,7 +1211,7 @@ static int __make_request(struct request
 	spin_lock_irq(q->queue_lock);
 
 	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
-		where = ELEVATOR_INSERT_FRONT;
+		where = ELEVATOR_INSERT_FLUSH;
 		goto get_rq;
 	}
 
Index: work/block/blk-flush.c
===================================================================
--- work.orig/block/blk-flush.c
+++ work/block/blk-flush.c
@@ -1,6 +1,69 @@
 /*
  * Functions to sequence FLUSH and FUA writes.
+ *
+ * Copyright (C) 2011		Max Planck Institute for Gravitational Physics
+ * Copyright (C) 2011		Tejun Heo <tj@kernel.org>
+ *
+ * This file is released under the GPLv2.
+ *
+ * REQ_{FLUSH|FUA} requests are decomposed to sequences consisted of three
+ * optional steps - PREFLUSH, DATA and POSTFLUSH - according to the request
+ * properties and hardware capability.
+ *
+ * If a request doesn't have data, only REQ_FLUSH makes sense, which
+ * indicates a simple flush request.  If there is data, REQ_FLUSH indicates
+ * that the device cache should be flushed before the data is executed, and
+ * REQ_FUA means that the data must be on non-volatile media on request
+ * completion.
+ *
+ * If the device doesn't have writeback cache, FLUSH and FUA don't make any
+ * difference.  The requests are either completed immediately if there's no
+ * data or executed as normal requests otherwise.
+ *
+ * If the device has writeback cache and supports FUA, REQ_FLUSH is
+ * translated to PREFLUSH but REQ_FUA is passed down directly with DATA.
+ *
+ * If the device has writeback cache and doesn't support FUA, REQ_FLUSH is
+ * translated to PREFLUSH and REQ_FUA to POSTFLUSH.
+ *
+ * The actual execution of flush is double buffered.  Whenever a request
+ * needs to execute PRE or POSTFLUSH, it queues at
+ * q->flush_queue[q->flush_pending_idx].  Once certain criteria are met, a
+ * flush is issued and the pending_idx is toggled.  When the flush
+ * completes, all the requests which were pending are proceeded to the next
+ * step.  This allows arbitrary merging of different types of FLUSH/FUA
+ * requests.
+ *
+ * Currently, the following conditions are used to determine when to issue
+ * flush.
+ *
+ * C1. At any given time, only one flush shall be in progress.  This makes
+ *     double buffering sufficient.
+ *
+ * C2. Flush is deferred if any request is executing DATA of its sequence.
+ *     This avoids issuing separate POSTFLUSHes for requests which shared
+ *     PREFLUSH.
+ *
+ * C3. The second condition is ignored if there is a request which has
+ *     waited longer than FLUSH_PENDING_TIMEOUT.  This is to avoid
+ *     starvation in the unlikely case where there are continuous stream of
+ *     FUA (without FLUSH) requests.
+ *
+ * For devices which support FUA, it isn't clear whether C2 (and thus C3)
+ * is beneficial.
+ *
+ * Note that a sequenced FLUSH/FUA request with DATA is completed twice.
+ * Once while executing DATA and again after the whole sequence is
+ * complete.  The first completion updates the contained bio but doesn't
+ * finish it so that the bio submitter is notified only after the whole
+ * sequence is complete.  This is implemented by testing REQ_FLUSH_SEQ in
+ * req_bio_endio().
+ *
+ * The above peculiarity requires that each FLUSH/FUA request has only one
+ * bio attached to it, which is guaranteed as they aren't allowed to be
+ * merged in the usual way.
  */
+
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/bio.h>
@@ -11,185 +74,290 @@
 
 /* FLUSH/FUA sequences */
 enum {
-	QUEUE_FSEQ_STARTED	= (1 << 0), /* flushing in progress */
-	QUEUE_FSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
-	QUEUE_FSEQ_DATA		= (1 << 2), /* data write in progress */
-	QUEUE_FSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
-	QUEUE_FSEQ_DONE		= (1 << 4),
+	REQ_FSEQ_PREFLUSH	= (1 << 0), /* pre-flushing in progress */
+	REQ_FSEQ_DATA		= (1 << 1), /* data write in progress */
+	REQ_FSEQ_POSTFLUSH	= (1 << 2), /* post-flushing in progress */
+	REQ_FSEQ_DONE		= (1 << 3),
+
+	REQ_FSEQ_ACTIONS	= REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA |
+				  REQ_FSEQ_POSTFLUSH,
+
+	/*
+	 * If flush has been pending longer than the following timeout,
+	 * it's issued even if flush_data requests are still in flight.
+	 */
+	FLUSH_PENDING_TIMEOUT	= 5 * HZ,
 };
 
-static struct request *queue_next_fseq(struct request_queue *q);
+static bool blk_kick_flush(struct request_queue *q);
 
-unsigned blk_flush_cur_seq(struct request_queue *q)
+static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
 {
-	if (!q->flush_seq)
-		return 0;
-	return 1 << ffz(q->flush_seq);
+	unsigned int policy = 0;
+
+	if (fflags & REQ_FLUSH) {
+		if (rq->cmd_flags & REQ_FLUSH)
+			policy |= REQ_FSEQ_PREFLUSH;
+		if (blk_rq_sectors(rq))
+			policy |= REQ_FSEQ_DATA;
+		if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
+			policy |= REQ_FSEQ_POSTFLUSH;
+	}
+	return policy;
 }
 
-static struct request *blk_flush_complete_seq(struct request_queue *q,
-					      unsigned seq, int error)
+static unsigned int blk_flush_cur_seq(struct request *rq)
 {
-	struct request *next_rq = NULL;
+	return 1 << ffz(rq->flush.seq);
+}
 
-	if (error && !q->flush_err)
-		q->flush_err = error;
+static void blk_flush_restore_request(struct request *rq)
+{
+	/*
+	 * After flush data completion, @rq->bio is %NULL but we need to
+	 * complete the bio again.  @rq->biotail is guaranteed to equal the
+	 * original @rq->bio.  Restore it.
+	 */
+	rq->bio = rq->biotail;
 
-	BUG_ON(q->flush_seq & seq);
-	q->flush_seq |= seq;
+	/* make @rq a normal request */
+	rq->cmd_flags &= ~REQ_FLUSH_SEQ;
+	rq->end_io = NULL;
+}
 
-	if (blk_flush_cur_seq(q) != QUEUE_FSEQ_DONE) {
-		/* not complete yet, queue the next flush sequence */
-		next_rq = queue_next_fseq(q);
-	} else {
-		/* complete this flush request */
-		__blk_end_request_all(q->orig_flush_rq, q->flush_err);
-		q->orig_flush_rq = NULL;
-		q->flush_seq = 0;
+/**
+ * blk_flush_complete_seq - complete flush sequence
+ * @rq: FLUSH/FUA request being sequenced
+ * @seq: sequences to complete (mask of %REQ_FSEQ_*, can be zero)
+ * @error: whether an error occurred
+ *
+ * @rq just completed @seq part of its flush sequence, record the
+ * completion and trigger the next step.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ *
+ * RETURNS:
+ * %true if requests were added to the dispatch queue, %false otherwise.
+ */
+static bool blk_flush_complete_seq(struct request *rq, unsigned int seq,
+				   int error)
+{
+	struct request_queue *q = rq->q;
+	struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+	bool queued = false;
+
+	BUG_ON(rq->flush.seq & seq);
+	rq->flush.seq |= seq;
+
+	if (likely(!error))
+		seq = blk_flush_cur_seq(rq);
+	else
+		seq = REQ_FSEQ_DONE;
+
+	switch (seq) {
+	case REQ_FSEQ_PREFLUSH:
+	case REQ_FSEQ_POSTFLUSH:
+		/* queue for flush */
+		if (list_empty(pending))
+			q->flush_pending_since = jiffies;
+		list_move_tail(&rq->flush.list, pending);
+		break;
 
-		/* dispatch the next flush if there's one */
-		if (!list_empty(&q->pending_flushes)) {
-			next_rq = list_entry_rq(q->pending_flushes.next);
-			list_move(&next_rq->queuelist, &q->queue_head);
-		}
+	case REQ_FSEQ_DATA:
+		list_move_tail(&rq->flush.list, &q->flush_data_in_flight);
+		list_add(&rq->queuelist, &q->queue_head);
+		queued = true;
+		break;
+
+	case REQ_FSEQ_DONE:
+		/*
+		 * @rq was previously adjusted by blk_flush_issue() for
+		 * flush sequencing and may already have gone through the
+		 * flush data request completion path.  Restore @rq for
+		 * normal completion and end it.
+		 */
+		BUG_ON(!list_empty(&rq->queuelist));
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
+		__blk_end_request_all(rq, error);
+		break;
+
+	default:
+		BUG();
 	}
-	return next_rq;
+
+	return blk_kick_flush(q) | queued;
 }
 
-static void blk_flush_complete_seq_end_io(struct request_queue *q,
-					  unsigned seq, int error)
+static void flush_end_io(struct request *flush_rq, int error)
 {
+	struct request_queue *q = flush_rq->q;
+	struct list_head *running = &q->flush_queue[q->flush_running_idx];
 	bool was_empty = elv_queue_empty(q);
-	struct request *next_rq;
+	bool queued = false;
+	struct request *rq, *n;
 
-	next_rq = blk_flush_complete_seq(q, seq, error);
+	BUG_ON(q->flush_pending_idx == q->flush_running_idx);
 
-	/*
-	 * Moving a request silently to empty queue_head may stall the
-	 * queue.  Kick the queue in those cases.
-	 */
-	if (was_empty && next_rq)
+	/* account completion of the flush request */
+	q->flush_running_idx ^= 1;
+	elv_completed_request(q, flush_rq);
+
+	/* and push the waiting requests to the next stage */
+	list_for_each_entry_safe(rq, n, running, flush.list) {
+		unsigned int seq = blk_flush_cur_seq(rq);
+
+		BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
+		queued |= blk_flush_complete_seq(rq, seq, error);
+	}
+
+	/* after populating an empty queue, kick it to avoid stall */
+	if (queued && was_empty)
 		__blk_run_queue(q);
 }
 
-static void pre_flush_end_io(struct request *rq, int error)
+/**
+ * blk_kick_flush - consider issuing flush request
+ * @q: request_queue being kicked
+ *
+ * Flush related states of @q have changed, consider issuing flush request.
+ * Please read the comment at the top of this file for more info.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ *
+ * RETURNS:
+ * %true if flush was issued, %false otherwise.
+ */
+static bool blk_kick_flush(struct request_queue *q)
 {
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+	struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+	struct request *first_rq =
+		list_first_entry(pending, struct request, flush.list);
+
+	/* C1 described at the top of this file */
+	if (q->flush_pending_idx != q->flush_running_idx || list_empty(pending))
+		return false;
+
+	/* C2 and C3 */
+	if (!list_empty(&q->flush_data_in_flight) &&
+	    time_before(jiffies,
+			q->flush_pending_since + FLUSH_PENDING_TIMEOUT))
+		return false;
+
+	/*
+	 * Issue flush and toggle pending_idx.  This makes pending_idx
+	 * different from running_idx, which means flush is in flight.
+	 */
+	blk_rq_init(q, &q->flush_rq);
+	q->flush_rq.cmd_type = REQ_TYPE_FS;
+	q->flush_rq.cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ;
+	q->flush_rq.rq_disk = first_rq->rq_disk;
+	q->flush_rq.end_io = flush_end_io;
+
+	q->flush_pending_idx ^= 1;
+	elv_insert(q, &q->flush_rq, ELEVATOR_INSERT_FRONT);
+	return true;
 }
 
 static void flush_data_end_io(struct request *rq, int error)
 {
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
-}
+	struct request_queue *q = rq->q;
+	bool was_empty = elv_queue_empty(q);
 
-static void post_flush_end_io(struct request *rq, int error)
-{
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+	/* after populating an empty queue, kick it to avoid stall */
+	if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error) && was_empty)
+		__blk_run_queue(q);
 }
 
-static void init_flush_request(struct request *rq, struct gendisk *disk)
+/**
+ * blk_insert_flush - insert a new FLUSH/FUA request
+ * @rq: request to insert
+ *
+ * To be called from elv_insert() for %ELEVATOR_INSERT_FLUSH insertions.
+ * @rq is being submitted.  Analyze what needs to be done and put it on the
+ * right queue.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ */
+void blk_insert_flush(struct request *rq)
 {
-	rq->cmd_type = REQ_TYPE_FS;
-	rq->cmd_flags = WRITE_FLUSH;
-	rq->rq_disk = disk;
-}
+	struct request_queue *q = rq->q;
+	unsigned int fflags = q->flush_flags;	/* may change, cache */
+	unsigned int policy = blk_flush_policy(fflags, rq);
 
-static struct request *queue_next_fseq(struct request_queue *q)
-{
-	struct request *orig_rq = q->orig_flush_rq;
-	struct request *rq = &q->flush_rq;
+	BUG_ON(rq->end_io);
+	BUG_ON(!rq->bio || rq->bio != rq->biotail);
 
-	blk_rq_init(q, rq);
+	/*
+	 * @policy now records what operations need to be done.  Adjust
+	 * REQ_FLUSH and FUA for the driver.
+	 */
+	rq->cmd_flags &= ~REQ_FLUSH;
+	if (!(fflags & REQ_FUA))
+		rq->cmd_flags &= ~REQ_FUA;
 
-	switch (blk_flush_cur_seq(q)) {
-	case QUEUE_FSEQ_PREFLUSH:
-		init_flush_request(rq, orig_rq->rq_disk);
-		rq->end_io = pre_flush_end_io;
-		break;
-	case QUEUE_FSEQ_DATA:
-		init_request_from_bio(rq, orig_rq->bio);
-		/*
-		 * orig_rq->rq_disk may be different from
-		 * bio->bi_bdev->bd_disk if orig_rq got here through
-		 * remapping drivers.  Make sure rq->rq_disk points
-		 * to the same one as orig_rq.
-		 */
-		rq->rq_disk = orig_rq->rq_disk;
-		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
-		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
-		rq->end_io = flush_data_end_io;
-		break;
-	case QUEUE_FSEQ_POSTFLUSH:
-		init_flush_request(rq, orig_rq->rq_disk);
-		rq->end_io = post_flush_end_io;
-		break;
-	default:
-		BUG();
+	/*
+	 * If there's data but flush is not necessary, the request can be
+	 * processed directly without going through flush machinery.  Queue
+	 * for normal execution.
+	 */
+	if ((policy & REQ_FSEQ_DATA) &&
+	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
+		list_add(&rq->queuelist, &q->queue_head);
+		return;
 	}
 
+	/*
+	 * @rq should go through flush machinery.  Mark it part of flush
+	 * sequence and submit for further processing.
+	 */
+	memset(&rq->flush, 0, sizeof(rq->flush));
+	INIT_LIST_HEAD(&rq->flush.list);
 	rq->cmd_flags |= REQ_FLUSH_SEQ;
-	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-	return rq;
+	rq->end_io = flush_data_end_io;
+
+	blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);
 }
 
-struct request *blk_do_flush(struct request_queue *q, struct request *rq)
+/**
+ * blk_abort_flushes - @q is being aborted, abort flush requests
+ * @q: request_queue being aborted
+ *
+ * To be called from elv_abort_queue().  @q is being aborted.  Prepare all
+ * FLUSH/FUA requests for abortion.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ */
+void blk_abort_flushes(struct request_queue *q)
 {
-	unsigned int fflags = q->flush_flags; /* may change, cache it */
-	bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
-	bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
-	bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA);
-	unsigned skip = 0;
+	struct request *rq, *n;
+	int i;
 
 	/*
-	 * Special case.  If there's data but flush is not necessary,
-	 * the request can be issued directly.
-	 *
-	 * Flush w/o data should be able to be issued directly too but
-	 * currently some drivers assume that rq->bio contains
-	 * non-zero data if it isn't NULL and empty FLUSH requests
-	 * getting here usually have bio's without data.
+	 * Requests in flight for data are already owned by the dispatch
+	 * queue or the device driver.  Just restore for normal completion.
 	 */
-	if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
-		rq->cmd_flags &= ~REQ_FLUSH;
-		if (!has_fua)
-			rq->cmd_flags &= ~REQ_FUA;
-		return rq;
+	list_for_each_entry_safe(rq, n, &q->flush_data_in_flight, flush.list) {
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
 	}
 
 	/*
-	 * Sequenced flushes can't be processed in parallel.  If
-	 * another one is already in progress, queue for later
-	 * processing.
+	 * We need to give away requests on flush queues.  Restore for
+	 * normal completion and put them on the dispatch queue.
 	 */
-	if (q->flush_seq) {
-		list_move_tail(&rq->queuelist, &q->pending_flushes);
-		return NULL;
+	for (i = 0; i < ARRAY_SIZE(q->flush_queue); i++) {
+		list_for_each_entry_safe(rq, n, &q->flush_queue[i],
+					 flush.list) {
+			list_del_init(&rq->flush.list);
+			blk_flush_restore_request(rq);
+			list_add_tail(&rq->queuelist, &q->queue_head);
+		}
 	}
-
-	/*
-	 * Start a new flush sequence
-	 */
-	q->flush_err = 0;
-	q->flush_seq |= QUEUE_FSEQ_STARTED;
-
-	/* adjust FLUSH/FUA of the original request and stash it away */
-	rq->cmd_flags &= ~REQ_FLUSH;
-	if (!has_fua)
-		rq->cmd_flags &= ~REQ_FUA;
-	blk_dequeue_request(rq);
-	q->orig_flush_rq = rq;
-
-	/* skip unneded sequences and return the first one */
-	if (!do_preflush)
-		skip |= QUEUE_FSEQ_PREFLUSH;
-	if (!blk_rq_sectors(rq))
-		skip |= QUEUE_FSEQ_DATA;
-	if (!do_postflush)
-		skip |= QUEUE_FSEQ_POSTFLUSH;
-	return blk_flush_complete_seq(q, skip, 0);
 }
 
 static void bio_end_flush(struct bio *bio, int err)
Index: work/block/elevator.c
===================================================================
--- work.orig/block/elevator.c
+++ work/block/elevator.c
@@ -673,6 +673,11 @@ void elv_insert(struct request_queue *q,
 		q->elevator->ops->elevator_add_req_fn(q, rq);
 		break;
 
+	case ELEVATOR_INSERT_FLUSH:
+		rq->cmd_flags |= REQ_SOFTBARRIER;
+		blk_insert_flush(rq);
+		break;
+
 	default:
 		printk(KERN_ERR "%s: bad insertion point %d\n",
 		       __func__, where);
@@ -785,6 +790,8 @@ void elv_abort_queue(struct request_queu
 {
 	struct request *rq;
 
+	blk_abort_flushes(q);
+
 	while (!list_empty(&q->queue_head)) {
 		rq = list_entry_rq(q->queue_head.next);
 		rq->cmd_flags |= REQ_QUIET;
Index: work/include/linux/elevator.h
===================================================================
--- work.orig/include/linux/elevator.h
+++ work/include/linux/elevator.h
@@ -167,6 +167,7 @@ extern struct request *elv_rb_find(struc
 #define ELEVATOR_INSERT_BACK	2
 #define ELEVATOR_INSERT_SORT	3
 #define ELEVATOR_INSERT_REQUEUE	4
+#define ELEVATOR_INSERT_FLUSH	5
 
 /*
  * return values from elevator_may_queue_fn
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -99,13 +99,18 @@ struct request {
 	/*
 	 * The rb_node is only used inside the io scheduler, requests
 	 * are pruned when moved to the dispatch queue. So let the
-	 * completion_data share space with the rb_node.
+	 * flush fields share space with the rb_node.
 	 */
 	union {
 		struct rb_node rb_node;	/* sort/lookup */
-		void *completion_data;
+		struct {
+			unsigned int			seq;
+			struct list_head		list;
+		} flush;
 	};
 
+	void *completion_data;
+
 	/*
 	 * Three pointers are available for the IO schedulers, if they need
 	 * more they have to dynamically allocate it.
@@ -363,11 +368,12 @@ struct request_queue
 	 * for flush operations
 	 */
 	unsigned int		flush_flags;
-	unsigned int		flush_seq;
-	int			flush_err;
+	unsigned int		flush_pending_idx:1;
+	unsigned int		flush_running_idx:1;
+	unsigned long		flush_pending_since;
+	struct list_head	flush_queue[2];
+	struct list_head	flush_data_in_flight;
 	struct request		flush_rq;
-	struct request		*orig_flush_rq;
-	struct list_head	pending_flushes;
 
 	struct mutex		sysfs_lock;
 
Index: work/block/blk.h
===================================================================
--- work.orig/block/blk.h
+++ work/block/blk.h
@@ -51,21 +51,17 @@ static inline void blk_clear_rq_complete
  */
 #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))
 
-struct request *blk_do_flush(struct request_queue *q, struct request *rq);
+void blk_insert_flush(struct request *rq);
+void blk_abort_flushes(struct request_queue *q);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
 {
 	struct request *rq;
 
 	while (1) {
-		while (!list_empty(&q->queue_head)) {
+		if (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
-			if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
-			    (rq->cmd_flags & REQ_FLUSH_SEQ))
-				return rq;
-			rq = blk_do_flush(q, rq);
-			if (rq)
-				return rq;
+			return rq;
 		}
 
 		if (!q->elevator->ops->elevator_dispatch_fn(q, 0))

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH UPDATED 3/3] block: reimplement FLUSH/FUA to support merge
@ 2011-01-23 10:48     ` Tejun Heo
  0 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-23 10:48 UTC (permalink / raw)
  To: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack, snitzer,
	linux-kernel, kmannth

The current FLUSH/FUA support has evolved from the implementation
which had to perform queue draining.  As such, sequencing is done
queue-wide one flush request after another.  However, with the
draining requirement gone, there's no reason to keep the queue-wide
sequential approach.

This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
request is sequenced individually.  The actual FLUSH execution is
double buffered and whenever a request wants to execute one for either
PRE or POSTFLUSH, it queues on the pending queue.  Once certain
conditions are met, a flush request is issued and on its completion
all pending requests proceed to the next sequence.

This allows arbitrary merging of different type of flushes.  How they
are merged can be primarily controlled and tuned by adjusting the
above said 'conditions' used to determine when to issue the next
flush.

This is inspired by Darrick's patches to merge multiple zero-data
flushes which helps workloads with highly concurrent fsync requests.

* As flush requests are never put on the IO scheduler, request fields
  used for flush share space with rq->rb_node.  rq->completion_data is
  moved out of the union.  This increases the request size by one
  pointer.

  As rq->elevator_private* are used only by the iosched too, it is
  possible to reduce the request size further.  However, to do that,
  we need to modify request allocation path such that iosched data is
  not allocated for flush requests.

* FLUSH/FUA processing happens on insertion now instead of dispatch.

- Comments updated as per Vivek and Mike.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Darrick J. Wong" <djwong@us.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
---
Simple comment update.  Nothing else changed.  Git tree is updated
too.

Thanks.

 block/blk-core.c         |   10 -
 block/blk-flush.c        |  432 ++++++++++++++++++++++++++++++++---------------
 block/blk.h              |   12 -
 block/elevator.c         |    7 
 include/linux/blkdev.h   |   18 +
 include/linux/elevator.h |    1 
 6 files changed, 328 insertions(+), 152 deletions(-)

Index: work/block/blk-core.c
===================================================================
--- work.orig/block/blk-core.c
+++ work/block/blk-core.c
@@ -149,8 +149,6 @@ EXPORT_SYMBOL(blk_rq_init);
 static void req_bio_endio(struct request *rq, struct bio *bio,
 			  unsigned int nbytes, int error)
 {
-	struct request_queue *q = rq->q;
-
 	if (error)
 		clear_bit(BIO_UPTODATE, &bio->bi_flags);
 	else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -174,8 +172,6 @@ static void req_bio_endio(struct request
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_size == 0 && !(rq->cmd_flags & REQ_FLUSH_SEQ))
 		bio_endio(bio, error);
-	else if (error && !q->flush_err)
-		q->flush_err = error;
 }
 
 void blk_dump_rq_flags(struct request *rq, char *msg)
@@ -534,7 +530,9 @@ struct request_queue *blk_alloc_queue_no
 	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
-	INIT_LIST_HEAD(&q->pending_flushes);
+	INIT_LIST_HEAD(&q->flush_queue[0]);
+	INIT_LIST_HEAD(&q->flush_queue[1]);
+	INIT_LIST_HEAD(&q->flush_data_in_flight);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1213,7 +1211,7 @@ static int __make_request(struct request
 	spin_lock_irq(q->queue_lock);
 
 	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
-		where = ELEVATOR_INSERT_FRONT;
+		where = ELEVATOR_INSERT_FLUSH;
 		goto get_rq;
 	}
 
Index: work/block/blk-flush.c
===================================================================
--- work.orig/block/blk-flush.c
+++ work/block/blk-flush.c
@@ -1,6 +1,69 @@
 /*
  * Functions to sequence FLUSH and FUA writes.
+ *
+ * Copyright (C) 2011		Max Planck Institute for Gravitational Physics
+ * Copyright (C) 2011		Tejun Heo <tj@kernel.org>
+ *
+ * This file is released under the GPLv2.
+ *
+ * REQ_{FLUSH|FUA} requests are decomposed to sequences consisted of three
+ * optional steps - PREFLUSH, DATA and POSTFLUSH - according to the request
+ * properties and hardware capability.
+ *
+ * If a request doesn't have data, only REQ_FLUSH makes sense, which
+ * indicates a simple flush request.  If there is data, REQ_FLUSH indicates
+ * that the device cache should be flushed before the data is executed, and
+ * REQ_FUA means that the data must be on non-volatile media on request
+ * completion.
+ *
+ * If the device doesn't have writeback cache, FLUSH and FUA don't make any
+ * difference.  The requests are either completed immediately if there's no
+ * data or executed as normal requests otherwise.
+ *
+ * If the device has writeback cache and supports FUA, REQ_FLUSH is
+ * translated to PREFLUSH but REQ_FUA is passed down directly with DATA.
+ *
+ * If the device has writeback cache and doesn't support FUA, REQ_FLUSH is
+ * translated to PREFLUSH and REQ_FUA to POSTFLUSH.
+ *
+ * The actual execution of flush is double buffered.  Whenever a request
+ * needs to execute PRE or POSTFLUSH, it queues at
+ * q->flush_queue[q->flush_pending_idx].  Once certain criteria are met, a
+ * flush is issued and the pending_idx is toggled.  When the flush
+ * completes, all the requests which were pending are proceeded to the next
+ * step.  This allows arbitrary merging of different types of FLUSH/FUA
+ * requests.
+ *
+ * Currently, the following conditions are used to determine when to issue
+ * flush.
+ *
+ * C1. At any given time, only one flush shall be in progress.  This makes
+ *     double buffering sufficient.
+ *
+ * C2. Flush is deferred if any request is executing DATA of its sequence.
+ *     This avoids issuing separate POSTFLUSHes for requests which shared
+ *     PREFLUSH.
+ *
+ * C3. The second condition is ignored if there is a request which has
+ *     waited longer than FLUSH_PENDING_TIMEOUT.  This is to avoid
+ *     starvation in the unlikely case where there are continuous stream of
+ *     FUA (without FLUSH) requests.
+ *
+ * For devices which support FUA, it isn't clear whether C2 (and thus C3)
+ * is beneficial.
+ *
+ * Note that a sequenced FLUSH/FUA request with DATA is completed twice.
+ * Once while executing DATA and again after the whole sequence is
+ * complete.  The first completion updates the contained bio but doesn't
+ * finish it so that the bio submitter is notified only after the whole
+ * sequence is complete.  This is implemented by testing REQ_FLUSH_SEQ in
+ * req_bio_endio().
+ *
+ * The above peculiarity requires that each FLUSH/FUA request has only one
+ * bio attached to it, which is guaranteed as they aren't allowed to be
+ * merged in the usual way.
  */
+
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/bio.h>
@@ -11,185 +74,290 @@
 
 /* FLUSH/FUA sequences */
 enum {
-	QUEUE_FSEQ_STARTED	= (1 << 0), /* flushing in progress */
-	QUEUE_FSEQ_PREFLUSH	= (1 << 1), /* pre-flushing in progress */
-	QUEUE_FSEQ_DATA		= (1 << 2), /* data write in progress */
-	QUEUE_FSEQ_POSTFLUSH	= (1 << 3), /* post-flushing in progress */
-	QUEUE_FSEQ_DONE		= (1 << 4),
+	REQ_FSEQ_PREFLUSH	= (1 << 0), /* pre-flushing in progress */
+	REQ_FSEQ_DATA		= (1 << 1), /* data write in progress */
+	REQ_FSEQ_POSTFLUSH	= (1 << 2), /* post-flushing in progress */
+	REQ_FSEQ_DONE		= (1 << 3),
+
+	REQ_FSEQ_ACTIONS	= REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA |
+				  REQ_FSEQ_POSTFLUSH,
+
+	/*
+	 * If flush has been pending longer than the following timeout,
+	 * it's issued even if flush_data requests are still in flight.
+	 */
+	FLUSH_PENDING_TIMEOUT	= 5 * HZ,
 };
 
-static struct request *queue_next_fseq(struct request_queue *q);
+static bool blk_kick_flush(struct request_queue *q);
 
-unsigned blk_flush_cur_seq(struct request_queue *q)
+static unsigned int blk_flush_policy(unsigned int fflags, struct request *rq)
 {
-	if (!q->flush_seq)
-		return 0;
-	return 1 << ffz(q->flush_seq);
+	unsigned int policy = 0;
+
+	if (fflags & REQ_FLUSH) {
+		if (rq->cmd_flags & REQ_FLUSH)
+			policy |= REQ_FSEQ_PREFLUSH;
+		if (blk_rq_sectors(rq))
+			policy |= REQ_FSEQ_DATA;
+		if (!(fflags & REQ_FUA) && (rq->cmd_flags & REQ_FUA))
+			policy |= REQ_FSEQ_POSTFLUSH;
+	}
+	return policy;
 }
 
-static struct request *blk_flush_complete_seq(struct request_queue *q,
-					      unsigned seq, int error)
+static unsigned int blk_flush_cur_seq(struct request *rq)
 {
-	struct request *next_rq = NULL;
+	return 1 << ffz(rq->flush.seq);
+}
 
-	if (error && !q->flush_err)
-		q->flush_err = error;
+static void blk_flush_restore_request(struct request *rq)
+{
+	/*
+	 * After flush data completion, @rq->bio is %NULL but we need to
+	 * complete the bio again.  @rq->biotail is guaranteed to equal the
+	 * original @rq->bio.  Restore it.
+	 */
+	rq->bio = rq->biotail;
 
-	BUG_ON(q->flush_seq & seq);
-	q->flush_seq |= seq;
+	/* make @rq a normal request */
+	rq->cmd_flags &= ~REQ_FLUSH_SEQ;
+	rq->end_io = NULL;
+}
 
-	if (blk_flush_cur_seq(q) != QUEUE_FSEQ_DONE) {
-		/* not complete yet, queue the next flush sequence */
-		next_rq = queue_next_fseq(q);
-	} else {
-		/* complete this flush request */
-		__blk_end_request_all(q->orig_flush_rq, q->flush_err);
-		q->orig_flush_rq = NULL;
-		q->flush_seq = 0;
+/**
+ * blk_flush_complete_seq - complete flush sequence
+ * @rq: FLUSH/FUA request being sequenced
+ * @seq: sequences to complete (mask of %REQ_FSEQ_*, can be zero)
+ * @error: whether an error occurred
+ *
+ * @rq just completed @seq part of its flush sequence, record the
+ * completion and trigger the next step.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ *
+ * RETURNS:
+ * %true if requests were added to the dispatch queue, %false otherwise.
+ */
+static bool blk_flush_complete_seq(struct request *rq, unsigned int seq,
+				   int error)
+{
+	struct request_queue *q = rq->q;
+	struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+	bool queued = false;
+
+	BUG_ON(rq->flush.seq & seq);
+	rq->flush.seq |= seq;
+
+	if (likely(!error))
+		seq = blk_flush_cur_seq(rq);
+	else
+		seq = REQ_FSEQ_DONE;
+
+	switch (seq) {
+	case REQ_FSEQ_PREFLUSH:
+	case REQ_FSEQ_POSTFLUSH:
+		/* queue for flush */
+		if (list_empty(pending))
+			q->flush_pending_since = jiffies;
+		list_move_tail(&rq->flush.list, pending);
+		break;
 
-		/* dispatch the next flush if there's one */
-		if (!list_empty(&q->pending_flushes)) {
-			next_rq = list_entry_rq(q->pending_flushes.next);
-			list_move(&next_rq->queuelist, &q->queue_head);
-		}
+	case REQ_FSEQ_DATA:
+		list_move_tail(&rq->flush.list, &q->flush_data_in_flight);
+		list_add(&rq->queuelist, &q->queue_head);
+		queued = true;
+		break;
+
+	case REQ_FSEQ_DONE:
+		/*
+		 * @rq was previously adjusted by blk_flush_issue() for
+		 * flush sequencing and may already have gone through the
+		 * flush data request completion path.  Restore @rq for
+		 * normal completion and end it.
+		 */
+		BUG_ON(!list_empty(&rq->queuelist));
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
+		__blk_end_request_all(rq, error);
+		break;
+
+	default:
+		BUG();
 	}
-	return next_rq;
+
+	return blk_kick_flush(q) | queued;
 }
 
-static void blk_flush_complete_seq_end_io(struct request_queue *q,
-					  unsigned seq, int error)
+static void flush_end_io(struct request *flush_rq, int error)
 {
+	struct request_queue *q = flush_rq->q;
+	struct list_head *running = &q->flush_queue[q->flush_running_idx];
 	bool was_empty = elv_queue_empty(q);
-	struct request *next_rq;
+	bool queued = false;
+	struct request *rq, *n;
 
-	next_rq = blk_flush_complete_seq(q, seq, error);
+	BUG_ON(q->flush_pending_idx == q->flush_running_idx);
 
-	/*
-	 * Moving a request silently to empty queue_head may stall the
-	 * queue.  Kick the queue in those cases.
-	 */
-	if (was_empty && next_rq)
+	/* account completion of the flush request */
+	q->flush_running_idx ^= 1;
+	elv_completed_request(q, flush_rq);
+
+	/* and push the waiting requests to the next stage */
+	list_for_each_entry_safe(rq, n, running, flush.list) {
+		unsigned int seq = blk_flush_cur_seq(rq);
+
+		BUG_ON(seq != REQ_FSEQ_PREFLUSH && seq != REQ_FSEQ_POSTFLUSH);
+		queued |= blk_flush_complete_seq(rq, seq, error);
+	}
+
+	/* after populating an empty queue, kick it to avoid stall */
+	if (queued && was_empty)
 		__blk_run_queue(q);
 }
 
-static void pre_flush_end_io(struct request *rq, int error)
+/**
+ * blk_kick_flush - consider issuing flush request
+ * @q: request_queue being kicked
+ *
+ * Flush related states of @q have changed, consider issuing flush request.
+ * Please read the comment at the top of this file for more info.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ *
+ * RETURNS:
+ * %true if flush was issued, %false otherwise.
+ */
+static bool blk_kick_flush(struct request_queue *q)
 {
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+	struct list_head *pending = &q->flush_queue[q->flush_pending_idx];
+	struct request *first_rq =
+		list_first_entry(pending, struct request, flush.list);
+
+	/* C1 described at the top of this file */
+	if (q->flush_pending_idx != q->flush_running_idx || list_empty(pending))
+		return false;
+
+	/* C2 and C3 */
+	if (!list_empty(&q->flush_data_in_flight) &&
+	    time_before(jiffies,
+			q->flush_pending_since + FLUSH_PENDING_TIMEOUT))
+		return false;
+
+	/*
+	 * Issue flush and toggle pending_idx.  This makes pending_idx
+	 * different from running_idx, which means flush is in flight.
+	 */
+	blk_rq_init(q, &q->flush_rq);
+	q->flush_rq.cmd_type = REQ_TYPE_FS;
+	q->flush_rq.cmd_flags = WRITE_FLUSH | REQ_FLUSH_SEQ;
+	q->flush_rq.rq_disk = first_rq->rq_disk;
+	q->flush_rq.end_io = flush_end_io;
+
+	q->flush_pending_idx ^= 1;
+	elv_insert(q, &q->flush_rq, ELEVATOR_INSERT_FRONT);
+	return true;
 }
 
 static void flush_data_end_io(struct request *rq, int error)
 {
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
-}
+	struct request_queue *q = rq->q;
+	bool was_empty = elv_queue_empty(q);
 
-static void post_flush_end_io(struct request *rq, int error)
-{
-	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+	/* after populating an empty queue, kick it to avoid stall */
+	if (blk_flush_complete_seq(rq, REQ_FSEQ_DATA, error) && was_empty)
+		__blk_run_queue(q);
 }
 
-static void init_flush_request(struct request *rq, struct gendisk *disk)
+/**
+ * blk_insert_flush - insert a new FLUSH/FUA request
+ * @rq: request to insert
+ *
+ * To be called from elv_insert() for %ELEVATOR_INSERT_FLUSH insertions.
+ * @rq is being submitted.  Analyze what needs to be done and put it on the
+ * right queue.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ */
+void blk_insert_flush(struct request *rq)
 {
-	rq->cmd_type = REQ_TYPE_FS;
-	rq->cmd_flags = WRITE_FLUSH;
-	rq->rq_disk = disk;
-}
+	struct request_queue *q = rq->q;
+	unsigned int fflags = q->flush_flags;	/* may change, cache */
+	unsigned int policy = blk_flush_policy(fflags, rq);
 
-static struct request *queue_next_fseq(struct request_queue *q)
-{
-	struct request *orig_rq = q->orig_flush_rq;
-	struct request *rq = &q->flush_rq;
+	BUG_ON(rq->end_io);
+	BUG_ON(!rq->bio || rq->bio != rq->biotail);
 
-	blk_rq_init(q, rq);
+	/*
+	 * @policy now records what operations need to be done.  Adjust
+	 * REQ_FLUSH and FUA for the driver.
+	 */
+	rq->cmd_flags &= ~REQ_FLUSH;
+	if (!(fflags & REQ_FUA))
+		rq->cmd_flags &= ~REQ_FUA;
 
-	switch (blk_flush_cur_seq(q)) {
-	case QUEUE_FSEQ_PREFLUSH:
-		init_flush_request(rq, orig_rq->rq_disk);
-		rq->end_io = pre_flush_end_io;
-		break;
-	case QUEUE_FSEQ_DATA:
-		init_request_from_bio(rq, orig_rq->bio);
-		/*
-		 * orig_rq->rq_disk may be different from
-		 * bio->bi_bdev->bd_disk if orig_rq got here through
-		 * remapping drivers.  Make sure rq->rq_disk points
-		 * to the same one as orig_rq.
-		 */
-		rq->rq_disk = orig_rq->rq_disk;
-		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
-		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
-		rq->end_io = flush_data_end_io;
-		break;
-	case QUEUE_FSEQ_POSTFLUSH:
-		init_flush_request(rq, orig_rq->rq_disk);
-		rq->end_io = post_flush_end_io;
-		break;
-	default:
-		BUG();
+	/*
+	 * If there's data but flush is not necessary, the request can be
+	 * processed directly without going through flush machinery.  Queue
+	 * for normal execution.
+	 */
+	if ((policy & REQ_FSEQ_DATA) &&
+	    !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
+		list_add(&rq->queuelist, &q->queue_head);
+		return;
 	}
 
+	/*
+	 * @rq should go through flush machinery.  Mark it part of flush
+	 * sequence and submit for further processing.
+	 */
+	memset(&rq->flush, 0, sizeof(rq->flush));
+	INIT_LIST_HEAD(&rq->flush.list);
 	rq->cmd_flags |= REQ_FLUSH_SEQ;
-	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
-	return rq;
+	rq->end_io = flush_data_end_io;
+
+	blk_flush_complete_seq(rq, REQ_FSEQ_ACTIONS & ~policy, 0);
 }
 
-struct request *blk_do_flush(struct request_queue *q, struct request *rq)
+/**
+ * blk_abort_flushes - @q is being aborted, abort flush requests
+ * @q: request_queue being aborted
+ *
+ * To be called from elv_abort_queue().  @q is being aborted.  Prepare all
+ * FLUSH/FUA requests for abortion.
+ *
+ * CONTEXT:
+ * spin_lock_irq(q->queue_lock)
+ */
+void blk_abort_flushes(struct request_queue *q)
 {
-	unsigned int fflags = q->flush_flags; /* may change, cache it */
-	bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
-	bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
-	bool do_postflush = has_flush && !has_fua && (rq->cmd_flags & REQ_FUA);
-	unsigned skip = 0;
+	struct request *rq, *n;
+	int i;
 
 	/*
-	 * Special case.  If there's data but flush is not necessary,
-	 * the request can be issued directly.
-	 *
-	 * Flush w/o data should be able to be issued directly too but
-	 * currently some drivers assume that rq->bio contains
-	 * non-zero data if it isn't NULL and empty FLUSH requests
-	 * getting here usually have bio's without data.
+	 * Requests in flight for data are already owned by the dispatch
+	 * queue or the device driver.  Just restore for normal completion.
 	 */
-	if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
-		rq->cmd_flags &= ~REQ_FLUSH;
-		if (!has_fua)
-			rq->cmd_flags &= ~REQ_FUA;
-		return rq;
+	list_for_each_entry_safe(rq, n, &q->flush_data_in_flight, flush.list) {
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
 	}
 
 	/*
-	 * Sequenced flushes can't be processed in parallel.  If
-	 * another one is already in progress, queue for later
-	 * processing.
+	 * We need to give away requests on flush queues.  Restore for
+	 * normal completion and put them on the dispatch queue.
 	 */
-	if (q->flush_seq) {
-		list_move_tail(&rq->queuelist, &q->pending_flushes);
-		return NULL;
+	for (i = 0; i < ARRAY_SIZE(q->flush_queue); i++) {
+		list_for_each_entry_safe(rq, n, &q->flush_queue[i],
+					 flush.list) {
+			list_del_init(&rq->flush.list);
+			blk_flush_restore_request(rq);
+			list_add_tail(&rq->queuelist, &q->queue_head);
+		}
 	}
-
-	/*
-	 * Start a new flush sequence
-	 */
-	q->flush_err = 0;
-	q->flush_seq |= QUEUE_FSEQ_STARTED;
-
-	/* adjust FLUSH/FUA of the original request and stash it away */
-	rq->cmd_flags &= ~REQ_FLUSH;
-	if (!has_fua)
-		rq->cmd_flags &= ~REQ_FUA;
-	blk_dequeue_request(rq);
-	q->orig_flush_rq = rq;
-
-	/* skip unneded sequences and return the first one */
-	if (!do_preflush)
-		skip |= QUEUE_FSEQ_PREFLUSH;
-	if (!blk_rq_sectors(rq))
-		skip |= QUEUE_FSEQ_DATA;
-	if (!do_postflush)
-		skip |= QUEUE_FSEQ_POSTFLUSH;
-	return blk_flush_complete_seq(q, skip, 0);
 }
 
 static void bio_end_flush(struct bio *bio, int err)
Index: work/block/elevator.c
===================================================================
--- work.orig/block/elevator.c
+++ work/block/elevator.c
@@ -673,6 +673,11 @@ void elv_insert(struct request_queue *q,
 		q->elevator->ops->elevator_add_req_fn(q, rq);
 		break;
 
+	case ELEVATOR_INSERT_FLUSH:
+		rq->cmd_flags |= REQ_SOFTBARRIER;
+		blk_insert_flush(rq);
+		break;
+
 	default:
 		printk(KERN_ERR "%s: bad insertion point %d\n",
 		       __func__, where);
@@ -785,6 +790,8 @@ void elv_abort_queue(struct request_queu
 {
 	struct request *rq;
 
+	blk_abort_flushes(q);
+
 	while (!list_empty(&q->queue_head)) {
 		rq = list_entry_rq(q->queue_head.next);
 		rq->cmd_flags |= REQ_QUIET;
Index: work/include/linux/elevator.h
===================================================================
--- work.orig/include/linux/elevator.h
+++ work/include/linux/elevator.h
@@ -167,6 +167,7 @@ extern struct request *elv_rb_find(struc
 #define ELEVATOR_INSERT_BACK	2
 #define ELEVATOR_INSERT_SORT	3
 #define ELEVATOR_INSERT_REQUEUE	4
+#define ELEVATOR_INSERT_FLUSH	5
 
 /*
  * return values from elevator_may_queue_fn
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -99,13 +99,18 @@ struct request {
 	/*
 	 * The rb_node is only used inside the io scheduler, requests
 	 * are pruned when moved to the dispatch queue. So let the
-	 * completion_data share space with the rb_node.
+	 * flush fields share space with the rb_node.
 	 */
 	union {
 		struct rb_node rb_node;	/* sort/lookup */
-		void *completion_data;
+		struct {
+			unsigned int			seq;
+			struct list_head		list;
+		} flush;
 	};
 
+	void *completion_data;
+
 	/*
 	 * Three pointers are available for the IO schedulers, if they need
 	 * more they have to dynamically allocate it.
@@ -363,11 +368,12 @@ struct request_queue
 	 * for flush operations
 	 */
 	unsigned int		flush_flags;
-	unsigned int		flush_seq;
-	int			flush_err;
+	unsigned int		flush_pending_idx:1;
+	unsigned int		flush_running_idx:1;
+	unsigned long		flush_pending_since;
+	struct list_head	flush_queue[2];
+	struct list_head	flush_data_in_flight;
 	struct request		flush_rq;
-	struct request		*orig_flush_rq;
-	struct list_head	pending_flushes;
 
 	struct mutex		sysfs_lock;
 
Index: work/block/blk.h
===================================================================
--- work.orig/block/blk.h
+++ work/block/blk.h
@@ -51,21 +51,17 @@ static inline void blk_clear_rq_complete
  */
 #define ELV_ON_HASH(rq)		(!hlist_unhashed(&(rq)->hash))
 
-struct request *blk_do_flush(struct request_queue *q, struct request *rq);
+void blk_insert_flush(struct request *rq);
+void blk_abort_flushes(struct request_queue *q);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
 {
 	struct request *rq;
 
 	while (1) {
-		while (!list_empty(&q->queue_head)) {
+		if (!list_empty(&q->queue_head)) {
 			rq = list_entry_rq(q->queue_head.next);
-			if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
-			    (rq->cmd_flags & REQ_FLUSH_SEQ))
-				return rq;
-			rq = blk_do_flush(q, rq);
-			if (rq)
-				return rq;
+			return rq;
 		}
 
 		if (!q->elevator->ops->elevator_dispatch_fn(q, 0))

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-23 10:25     ` Tejun Heo
  2011-01-23 10:29       ` Tejun Heo
@ 2011-01-24 20:31       ` Darrick J. Wong
  2011-01-25 10:21         ` Tejun Heo
  1 sibling, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2011-01-24 20:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, axboe, tytso, shli, neilb, adilger.kernel, jack,
	snitzer, linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch,
	josef

On Sun, Jan 23, 2011 at 11:25:26AM +0100, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jan 21, 2011 at 01:56:17PM -0500, Vivek Goyal wrote:
> > > + * Currently, the following conditions are used to determine when to issue
> > > + * flush.
> > > + *
> > > + * C1. At any given time, only one flush shall be in progress.  This makes
> > > + *     double buffering sufficient.
> > > + *
> > > + * C2. Flush is not deferred if any request is executing DATA of its
> > > + *     sequence.  This avoids issuing separate POSTFLUSHes for requests
> > > + *     which shared PREFLUSH.
> > 
> > Tejun, did you mean "Flush is deferred" instead of "Flush is not deferred"
> > above?
> 
> Oh yeah, I did.  :-)
> 
> > IIUC, C2 might help only if requests which contain data are also going to 
> > issue postflush. Couple of cases come to mind.
> 
> That's true.  I didn't want to go too advanced on it.  I wanted
> something which is fairly mechanical (without intricate parameters)
> and effective enough for common cases.
> 
> > - If queue supports FUA, I think we will not issue POSTFLUSH. In that
> >   case issuing next PREFLUSH which data is in flight might make sense.
> >
> > - Even if queue does not support FUA and we are only getting requests
> >   with REQ_FLUSH then also waiting for data requests to finish before
> >   issuing next FLUSH might not help.
> > 
> > - Even if queue does not support FUA and say we have a mix of REQ_FUA
> >   and REQ_FLUSH, then this will help only if in a batch we have more
> >   than 1 request which is going to issue POSTFLUSH and those postflush
> >   will be merged.
> 
> Sure, not applying C2 and 3 if the underlying device supports REQ_FUA
> would probably be the most compelling change of the bunch; however,
> please keep in mind that issuing flush as soon as possible doesn't
> necessarily result in better performance.  It's inherently a balancing
> act between latency and throughput.  Even inducing artificial issue
> latencies is likely to help if done right (as the ioscheds do).
> 
> So, I think it's better to start with something simple and improve it
> with actual testing.  If the current simple implementation can match
> Darrick's previous numbers, let's first settle the mechanisms.  We can

Yep, the fsync-happy numbers more or less match... at least for 2.6.37:
http://tinyurl.com/4q2xeao

I'll give 2.6.38-rc2 a try later, though -rc1 didn't boot for me, so these
numbers are based on a backport to .37. :(

In general, the effect of this patchset is to change a 100% drop in fsync-happy
performance into a 20% drop.  As always, the higher the average flush time, the
more the storage system benefits from having flush coordination.  The only
exception to that is elm3b231_ipr, which is a md array of disks that are
attached to a controller that is now throwing errors, so I'm not sure I
entirely trust that machine's numbers.

As for elm3c44_sas, I'm not sure why enabling flushes always increases
performance, other than to say that I suspect it has something to do with
md-raid'ing disk trays together, because elm3a4_sas and elm3c71_extsas consist
of the same configuration of disk trays, only without the md.  I've also been
told by our storage folks that md atop raid trays is not really a recommended
setup anyway.

The long and short of it is that this latest patchset looks and delivers the
behavior that I was aiming for. :)

> tune the latency/throughput balance all we want later.  Other than the
> double buffering contraint (which can be relaxed too but I don't think
> that would be necessary or a good idea) things can be easily adjusted
> in blk_kick_flush().  It's intentionally designed that way.
> 
> > - Ric Wheeler was once mentioning that there are boxes which advertise
> >   writeback cache but are battery backed so they ignore flush internally and
> >   signal completion immediately. I am not sure how prevalent those
> >   cases are but I think waiting for data to finish will delay processing
> >   of new REQ_FLUSH requests in pending queue for such array. There
> >   we will not anyway benefit from merging of FLUSH.
> 
> I don't really think we should design the whole thing around broken
> devices which incorrectly report writeback cache when it need not.
> The correct place to work around that is during device identification
> not in the flush logic.

elm3a4_sas and elm3c71_extsas advertise writeback cache yet the flush completion
times are suspiciously low.  I suppose it could be useful to disable flushes to
squeeze out that last bit of performance, though I don't know how one goes
about querying the disk array to learn if there's a battery behind the cache.
I guess the current mechanism (admin knob that picks a safe default) is good
enough.

> > Given that C2 is going to benefit primarily only if queue does not support
> > FUA and we have many requets with REQ_FUA set, will it make sense to 
> > put additional checks for C2. Atleast a simple queue support FUA
> > check might help.
> > 
> > In practice does C2 really help or we can get rid of it entirely?
> 
> Again, issuing flushes as fast as possible isn't necessarily better.
> It might feel counter-intuitive but it generally makes sense to delay
> flush if there are a lot of concurrent flush activities going on.
> Another related interesting point is that with flush merging,
> depending on workload, there's a likelihood that FUA, even if the
> device supports it, might result in worse performance than merged DATA
> + single POSTFLUSH sequence.
> 
> Thanks.
> 
> -- 
> tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-24 20:31       ` Darrick J. Wong
@ 2011-01-25 10:21         ` Tejun Heo
  2011-01-25 11:39           ` Jens Axboe
  2011-01-25 22:56           ` Darrick J. Wong
  0 siblings, 2 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-25 10:21 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Vivek Goyal, axboe, tytso, shli, neilb, adilger.kernel, jack,
	snitzer, linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch,
	josef

Hello, Darrick.

On Mon, Jan 24, 2011 at 12:31:55PM -0800, Darrick J. Wong wrote:
> > So, I think it's better to start with something simple and improve it
> > with actual testing.  If the current simple implementation can match
> > Darrick's previous numbers, let's first settle the mechanisms.  We can
> 
> Yep, the fsync-happy numbers more or less match... at least for 2.6.37:
> http://tinyurl.com/4q2xeao

Good to hear.  Thanks for the detailed testing.

> I'll give 2.6.38-rc2 a try later, though -rc1 didn't boot for me, so these
> numbers are based on a backport to .37. :(

Well, there hasn' been any change in the area during the merge window
anyway, so I think testing on 2.6.37 should be fine.

> > I don't really think we should design the whole thing around broken
> > devices which incorrectly report writeback cache when it need not.
> > The correct place to work around that is during device identification
> > not in the flush logic.
> 
> elm3a4_sas and elm3c71_extsas advertise writeback cache yet the
> flush completion times are suspiciously low.  I suppose it could be
> useful to disable flushes to squeeze out that last bit of
> performance, though I don't know how one goes about querying the
> disk array to learn if there's a battery behind the cache.  I guess
> the current mechanism (admin knob that picks a safe default) is good
> enough.

Yeap, that or a blacklist of devices which lie.

Jens, what do you think?  If you don't object, let's put this through
linux-next.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-25 10:21         ` Tejun Heo
@ 2011-01-25 11:39           ` Jens Axboe
  2011-03-23 23:37             ` Darrick J. Wong
  2011-01-25 22:56           ` Darrick J. Wong
  1 sibling, 1 reply; 48+ messages in thread
From: Jens Axboe @ 2011-01-25 11:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Darrick J. Wong, Vivek Goyal, tytso, shli, neilb, adilger.kernel,
	jack, snitzer, linux-kernel, kmannth, cmm, linux-ext4, rwheeler,
	hch, josef

On 2011-01-25 11:21, Tejun Heo wrote:
> Hello, Darrick.
> 
> On Mon, Jan 24, 2011 at 12:31:55PM -0800, Darrick J. Wong wrote:
>>> So, I think it's better to start with something simple and improve it
>>> with actual testing.  If the current simple implementation can match
>>> Darrick's previous numbers, let's first settle the mechanisms.  We can
>>
>> Yep, the fsync-happy numbers more or less match... at least for 2.6.37:
>> http://tinyurl.com/4q2xeao
> 
> Good to hear.  Thanks for the detailed testing.
> 
>> I'll give 2.6.38-rc2 a try later, though -rc1 didn't boot for me, so these
>> numbers are based on a backport to .37. :(
> 
> Well, there hasn' been any change in the area during the merge window
> anyway, so I think testing on 2.6.37 should be fine.
> 
>>> I don't really think we should design the whole thing around broken
>>> devices which incorrectly report writeback cache when it need not.
>>> The correct place to work around that is during device identification
>>> not in the flush logic.
>>
>> elm3a4_sas and elm3c71_extsas advertise writeback cache yet the
>> flush completion times are suspiciously low.  I suppose it could be
>> useful to disable flushes to squeeze out that last bit of
>> performance, though I don't know how one goes about querying the
>> disk array to learn if there's a battery behind the cache.  I guess
>> the current mechanism (admin knob that picks a safe default) is good
>> enough.
> 
> Yeap, that or a blacklist of devices which lie.
> 
> Jens, what do you think?  If you don't object, let's put this through
> linux-next.

I like the approach, I'll queue it up for 2.6.39.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [KNOWN BUGGY RFC PATCH 4/3] block: skip elevator initialization for flush requests
  2011-01-21 15:59 ` [PATCH 3/3] block: reimplement FLUSH/FUA to support merge Tejun Heo
@ 2011-01-25 20:41     ` Mike Snitzer
  2011-01-22  0:49   ` Mike Snitzer
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-01-25 20:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer

Hi Tejun,

On Fri, Jan 21 2011 at 10:59am -0500,
Tejun Heo <tj@kernel.org> wrote:
> 
> * As flush requests are never put on the IO scheduler, request fields
>   used for flush share space with rq->rb_node.  rq->completion_data is
>   moved out of the union.  This increases the request size by one
>   pointer.
> 
>   As rq->elevator_private* are used only by the iosched too, it is
>   possible to reduce the request size further.  However, to do that,
>   we need to modify request allocation path such that iosched data is
>   not allocated for flush requests.

I decided to take a crack at using rq->elevator_private* and came up
with the following patch.

Unfortunately, in testing I found that flush requests that have data do
in fact eventually get added to the queue as normal requests, via:
1) "data but flush is not necessary" case in blk_insert_flush
2) REQ_FSEQ_DATA case in blk_flush_complete_seq

I know this because in my following get_request() change to _not_ call
elv_set_request() for flush requests hit cfq_put_request()'s
BUG_ON(!cfqq->allocated[rw]).  cfqq->allocated[rw] gets set via
elv_set_request()'s call to cfq_set_request().

So this seems to call in to question the running theory that flush
requests can share 'struct request' space with elevator-specific members
(via union) -- be it rq->rb_node or rq->elevator_private*.

Please advise, thanks!
Mike



From: Mike Snitzer <snitzer@redhat.com>

Skip elevator initialization for flush requests by passing priv=0 to
blk_alloc_request() in get_request().  As such elv_set_request() is
never called for flush requests.

Move elevator_private* into 'struct elevator' and have the flush fields
share a union with it.

Reclaim the space lost in 'struct request' by moving 'completion_data'
back in to the union with 'rb_node'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c       |   13 +++++++++----
 block/cfq-iosched.c    |   18 +++++++++---------
 block/elevator.c       |    2 +-
 include/linux/blkdev.h |   26 +++++++++++++++-----------
 4 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 72dd23b..f507888 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -764,7 +764,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
-	int may_queue, priv;
+	int may_queue, priv = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -808,9 +808,14 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
-	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	/*
+	 * Skip elevator initialization for flush requests
+	 */
+	if (!(bio && (bio->bi_rw & (REQ_FLUSH | REQ_FUA)))) {
+		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+		if (priv)
+			rl->elvpriv++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 501ffdf..580ae0a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -54,9 +54,9 @@ static const int cfq_hist_divisor = 4;
 #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
 
 #define RQ_CIC(rq)		\
-	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
-#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator_private3)
+	((struct cfq_io_context *) (rq)->elevator.private)
+#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator.private2)
+#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator.private3)
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -3609,12 +3609,12 @@ static void cfq_put_request(struct request *rq)
 
 		put_io_context(RQ_CIC(rq)->ioc);
 
-		rq->elevator_private = NULL;
-		rq->elevator_private2 = NULL;
+		rq->elevator.private = NULL;
+		rq->elevator.private2 = NULL;
 
 		/* Put down rq reference on cfqg */
 		cfq_put_cfqg(RQ_CFQG(rq));
-		rq->elevator_private3 = NULL;
+		rq->elevator.private3 = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -3702,9 +3702,9 @@ new_queue:
 
 	cfqq->allocated[rw]++;
 	cfqq->ref++;
-	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
-	rq->elevator_private3 = cfq_ref_get_cfqg(cfqq->cfqg);
+	rq->elevator.private = cic;
+	rq->elevator.private2 = cfqq;
+	rq->elevator.private3 = cfq_ref_get_cfqg(cfqq->cfqg);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
diff --git a/block/elevator.c b/block/elevator.c
index 270e097..02b66be 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -764,7 +764,7 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
-	rq->elevator_private = NULL;
+	rq->elevator.private = NULL;
 	return 0;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8a082a5..0c569ec 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -99,25 +99,29 @@ struct request {
 	/*
 	 * The rb_node is only used inside the io scheduler, requests
 	 * are pruned when moved to the dispatch queue. So let the
-	 * flush fields share space with the rb_node.
+	 * completion_data share space with the rb_node.
 	 */
 	union {
 		struct rb_node rb_node;	/* sort/lookup */
-		struct {
-			unsigned int			seq;
-			struct list_head		list;
-		} flush;
+		void *completion_data;
 	};
 
-	void *completion_data;
-
 	/*
 	 * Three pointers are available for the IO schedulers, if they need
-	 * more they have to dynamically allocate it.
+	 * more they have to dynamically allocate it.  Let the flush fields
+	 * share space with these three pointers.
 	 */
-	void *elevator_private;
-	void *elevator_private2;
-	void *elevator_private3;
+	union {
+		struct {
+			void *private;
+			void *private2;
+			void *private3;
+		} elevator;
+		struct {
+			unsigned int			seq;
+			struct list_head		list;
+		} flush;
+	};
 
 	struct gendisk *rq_disk;
 	struct hd_struct *part;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [KNOWN BUGGY RFC PATCH 4/3] block: skip elevator initialization for flush requests
@ 2011-01-25 20:41     ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-01-25 20:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer

Hi Tejun,

On Fri, Jan 21 2011 at 10:59am -0500,
Tejun Heo <tj@kernel.org> wrote:
> 
> * As flush requests are never put on the IO scheduler, request fields
>   used for flush share space with rq->rb_node.  rq->completion_data is
>   moved out of the union.  This increases the request size by one
>   pointer.
> 
>   As rq->elevator_private* are used only by the iosched too, it is
>   possible to reduce the request size further.  However, to do that,
>   we need to modify request allocation path such that iosched data is
>   not allocated for flush requests.

I decided to take a crack at using rq->elevator_private* and came up
with the following patch.

Unfortunately, in testing I found that flush requests that have data do
in fact eventually get added to the queue as normal requests, via:
1) "data but flush is not necessary" case in blk_insert_flush
2) REQ_FSEQ_DATA case in blk_flush_complete_seq

I know this because in my following get_request() change to _not_ call
elv_set_request() for flush requests hit cfq_put_request()'s
BUG_ON(!cfqq->allocated[rw]).  cfqq->allocated[rw] gets set via
elv_set_request()'s call to cfq_set_request().

So this seems to call in to question the running theory that flush
requests can share 'struct request' space with elevator-specific members
(via union) -- be it rq->rb_node or rq->elevator_private*.

Please advise, thanks!
Mike



From: Mike Snitzer <snitzer@redhat.com>

Skip elevator initialization for flush requests by passing priv=0 to
blk_alloc_request() in get_request().  As such elv_set_request() is
never called for flush requests.

Move elevator_private* into 'struct elevator' and have the flush fields
share a union with it.

Reclaim the space lost in 'struct request' by moving 'completion_data'
back in to the union with 'rb_node'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c       |   13 +++++++++----
 block/cfq-iosched.c    |   18 +++++++++---------
 block/elevator.c       |    2 +-
 include/linux/blkdev.h |   26 +++++++++++++++-----------
 4 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 72dd23b..f507888 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -764,7 +764,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
-	int may_queue, priv;
+	int may_queue, priv = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -808,9 +808,14 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
-	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	/*
+	 * Skip elevator initialization for flush requests
+	 */
+	if (!(bio && (bio->bi_rw & (REQ_FLUSH | REQ_FUA)))) {
+		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+		if (priv)
+			rl->elvpriv++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 501ffdf..580ae0a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -54,9 +54,9 @@ static const int cfq_hist_divisor = 4;
 #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
 
 #define RQ_CIC(rq)		\
-	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
-#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator_private3)
+	((struct cfq_io_context *) (rq)->elevator.private)
+#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator.private2)
+#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator.private3)
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -3609,12 +3609,12 @@ static void cfq_put_request(struct request *rq)
 
 		put_io_context(RQ_CIC(rq)->ioc);
 
-		rq->elevator_private = NULL;
-		rq->elevator_private2 = NULL;
+		rq->elevator.private = NULL;
+		rq->elevator.private2 = NULL;
 
 		/* Put down rq reference on cfqg */
 		cfq_put_cfqg(RQ_CFQG(rq));
-		rq->elevator_private3 = NULL;
+		rq->elevator.private3 = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -3702,9 +3702,9 @@ new_queue:
 
 	cfqq->allocated[rw]++;
 	cfqq->ref++;
-	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
-	rq->elevator_private3 = cfq_ref_get_cfqg(cfqq->cfqg);
+	rq->elevator.private = cic;
+	rq->elevator.private2 = cfqq;
+	rq->elevator.private3 = cfq_ref_get_cfqg(cfqq->cfqg);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
diff --git a/block/elevator.c b/block/elevator.c
index 270e097..02b66be 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -764,7 +764,7 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
-	rq->elevator_private = NULL;
+	rq->elevator.private = NULL;
 	return 0;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8a082a5..0c569ec 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -99,25 +99,29 @@ struct request {
 	/*
 	 * The rb_node is only used inside the io scheduler, requests
 	 * are pruned when moved to the dispatch queue. So let the
-	 * flush fields share space with the rb_node.
+	 * completion_data share space with the rb_node.
 	 */
 	union {
 		struct rb_node rb_node;	/* sort/lookup */
-		struct {
-			unsigned int			seq;
-			struct list_head		list;
-		} flush;
+		void *completion_data;
 	};
 
-	void *completion_data;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-23 10:31     ` Tejun Heo
@ 2011-01-25 20:46       ` Vivek Goyal
  2011-01-25 21:04         ` Mike Snitzer
  0 siblings, 1 reply; 48+ messages in thread
From: Vivek Goyal @ 2011-01-25 20:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, axboe, tytso, djwong, shli, neilb, adilger.kernel,
	jack, linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch,
	josef

On Sun, Jan 23, 2011 at 11:31:33AM +0100, Tejun Heo wrote:
> On Fri, Jan 21, 2011 at 07:49:55PM -0500, Mike Snitzer wrote:
> > > + * If the device doesn't have writeback cache, FLUSH and FUA don't make any
> > > + * difference.  The requests are either completed immediately if there's no
> > > + * data or executed as normal requests otherwise.
> > 
> > For devices without a writeback cache, I'm not seeing where pure flushes
> > are completed immediately.  But I do see where data is processed
> > directly in blk_insert_flush().
> 
> Yeah, it does.  Pure flushes on a device w/o writeback cache, @policy
> is zero and blk_flush_complete_seq() will directly proceed to
> REQ_FSEQ_DONE.

I see following code in __generic_make_request(). I am wondering if empty
flushes will be completed here itself if device does not have writeback
cache.

                /*
                 * Filter flush bio's early so that make_request based
                 * drivers without flush support don't have to worry
                 * about them.
                 */
                if ((bio->bi_rw & (REQ_FLUSH | REQ_FUA)) && !q->flush_flags) {
                        bio->bi_rw &= ~(REQ_FLUSH | REQ_FUA);
                        if (!nr_sectors) {
                                err = 0;
                                goto end_io;
                        }
                }

Thanks
Vivek

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-25 20:46       ` Vivek Goyal
@ 2011-01-25 21:04         ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-01-25 21:04 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, axboe, tytso, djwong, shli, neilb, adilger.kernel,
	jack, linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch,
	josef

On Tue, Jan 25 2011 at  3:46pm -0500,
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Sun, Jan 23, 2011 at 11:31:33AM +0100, Tejun Heo wrote:
> > On Fri, Jan 21, 2011 at 07:49:55PM -0500, Mike Snitzer wrote:
> > > > + * If the device doesn't have writeback cache, FLUSH and FUA don't make any
> > > > + * difference.  The requests are either completed immediately if there's no
> > > > + * data or executed as normal requests otherwise.
> > > 
> > > For devices without a writeback cache, I'm not seeing where pure flushes
> > > are completed immediately.  But I do see where data is processed
> > > directly in blk_insert_flush().
> > 
> > Yeah, it does.  Pure flushes on a device w/o writeback cache, @policy
> > is zero and blk_flush_complete_seq() will directly proceed to
> > REQ_FSEQ_DONE.
> 
> I see following code in __generic_make_request(). I am wondering if empty
> flushes will be completed here itself if device does not have writeback
> cache.
> 
>                 /*
>                  * Filter flush bio's early so that make_request based
>                  * drivers without flush support don't have to worry
>                  * about them.
>                  */
>                 if ((bio->bi_rw & (REQ_FLUSH | REQ_FUA)) && !q->flush_flags) {
>                         bio->bi_rw &= ~(REQ_FLUSH | REQ_FUA);
>                         if (!nr_sectors) {
>                                 err = 0;
>                                 goto end_io;
>                         }
>                 }

Yes, due to the !q->flush_flags check, empty flushes will complete here
if the device's request_queue doesn't advertise support for FLUSH or FUA
via blk_queue_flush().

Mike



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [KNOWN BUGGY RFC PATCH 4/3] block: skip elevator initialization for flush requests
  2011-01-25 20:41     ` Mike Snitzer
  (?)
@ 2011-01-25 21:55     ` Mike Snitzer
  2011-01-26  5:27       ` [RFC PATCH 4/3] block: skip elevator initialization for flush requests -- was never BUGGY relative to upstream Mike Snitzer
  -1 siblings, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2011-01-25 21:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer

On Tue, Jan 25 2011 at  3:41pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> Hi Tejun,
> 
> On Fri, Jan 21 2011 at 10:59am -0500,
> Tejun Heo <tj@kernel.org> wrote:
> > 
> > * As flush requests are never put on the IO scheduler, request fields
> >   used for flush share space with rq->rb_node.  rq->completion_data is
> >   moved out of the union.  This increases the request size by one
> >   pointer.
> > 
> >   As rq->elevator_private* are used only by the iosched too, it is
> >   possible to reduce the request size further.  However, to do that,
> >   we need to modify request allocation path such that iosched data is
> >   not allocated for flush requests.
> 
> I decided to take a crack at using rq->elevator_private* and came up
> with the following patch.
> 
> Unfortunately, in testing I found that flush requests that have data do
> in fact eventually get added to the queue as normal requests, via:
> 1) "data but flush is not necessary" case in blk_insert_flush
> 2) REQ_FSEQ_DATA case in blk_flush_complete_seq

Vivek helped me understand that adding the request to the queue doesn't
mean it goes to the elevator.  It is inserting the request directly to
the underlying queue.

That embarassing oversight aside, the flush request is still getting to
the elevator somehow -- even though elv_set_request() was clearly not
called.

It is an interesting duality that:
1) REQ_ELVPRIV is never set because priv=0 is passed to blk_alloc_request
2) yet when blk_free_request() checks rq->cmd_flags REQ_ELVPRIV is set;
   resulting in the call to elv_put_request()

> I know this because in my following get_request() change to _not_ call
> elv_set_request() for flush requests hit cfq_put_request()'s
> BUG_ON(!cfqq->allocated[rw]).

FYI, here is the backtrace:

PID: 0      TASK: ffff88007ccd6b30  CPU: 1   COMMAND: "swapper"
 #0 [ffff880002103930] show_trace_log_lvl at ffffffff8100f3ec
 #1 [ffff880002103980] delay_tsc at ffffffff8125e62a
 #2 [ffff8800021039b0] __const_udelay at ffffffff8125e5d6
 #3 [ffff8800021039c0] panic at ffffffff814c3604
 #4 [ffff880002103a40] oops_end at ffffffff814c7622
 #5 [ffff880002103a70] die at ffffffff8100f33b
 #6 [ffff880002103aa0] do_trap at ffffffff814c6ec4
 #7 [ffff880002103b00] do_invalid_op at ffffffff8100cee5
 #8 [ffff880002103ba0] invalid_op at ffffffff8100bf5b
    [exception RIP: cfq_put_request+128]
    RIP: ffffffff8124f000  RSP: ffff880002103c58  RFLAGS: 00010046
    RAX: 0000000000000019  RBX: ffff88007b5668a0  RCX: 0000000000002c7b
    RDX: 0000000000000000  RSI: ffff88007b5668a0  RDI: ffff88007b5668a0
    RBP: ffff880002103c68   R8: 0000000000000003   R9: 0000000000000001
    R10: 0000000000000003  R11: 0000000000000003  R12: ffff88007b566940
    R13: 00000000018c2441  R14: 0000000000000001  R15: 00000000000000a5
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff880002103c70] elv_put_request at ffffffff8123208e
#10 [ffff880002103c80] __blk_put_request at ffffffff8123ae23
#11 [ffff880002103cb0] blk_finish_request at ffffffff8123b049
#12 [ffff880002103d00] __blk_end_request_all at ffffffff8123b0fb
#13 [ffff880002103d20] blk_flush_complete_seq at ffffffff8123de4c
#14 [ffff880002103d50] flush_end_io at ffffffff8123e095
#15 [ffff880002103da0] blk_finish_request at ffffffff8123aedb
#16 [ffff880002103df0] __blk_end_request_all at ffffffff8123b0fb
#17 [ffff880002103e10] blk_done at ffffffffa002e085
#18 [ffff880002103e50] vring_interrupt at ffffffffa001f19c
#19 [ffff880002103e70] vp_vring_interrupt at ffffffffa00264bb
#20 [ffff880002103ec0] vp_interrupt at ffffffffa0026544
#21 [ffff880002103ee0] handle_IRQ_event at ffffffff810d10b0
#22 [ffff880002103f30] handle_fasteoi_irq at ffffffff810d38a9
#23 [ffff880002103f60] handle_irq at ffffffff8100dfb9
#24 [ffff880002103f80] do_IRQ at ffffffff814cb32c
--- <IRQ stack> ---
#25 [ffff88007ccf9e28] ret_from_intr at ffffffff8100bad3
    [exception RIP: native_safe_halt+11]
    RIP: ffffffff81033f0b  RSP: ffff88007ccf9ed8  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff88007ccf9ed8  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000001  RDI: ffffffff81d0f1e8
    RBP: ffffffff8100bace   R8: 0000000000000000   R9: 0000000000000001
    R10: 0000000000000000  R11: 00000000fffba7c1  R12: 0000000000000001
    R13: ffff88007ccf9e68  R14: 0000000281075d93  R15: ffff88007ccf9e98
    ORIG_RAX: ffffffffffffffc4  CS: 0010  SS: 0018
#26 [ffff88007ccf9ee0] default_idle at ffffffff81013e0d
#27 [ffff88007ccf9f00] cpu_idle at ffffffff81009e96

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-25 10:21         ` Tejun Heo
  2011-01-25 11:39           ` Jens Axboe
@ 2011-01-25 22:56           ` Darrick J. Wong
  1 sibling, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2011-01-25 22:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, axboe, tytso, shli, neilb, adilger.kernel, jack,
	snitzer, linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch,
	josef

On Tue, Jan 25, 2011 at 11:21:28AM +0100, Tejun Heo wrote:
> Hello, Darrick.
> 
> On Mon, Jan 24, 2011 at 12:31:55PM -0800, Darrick J. Wong wrote:
> > > So, I think it's better to start with something simple and improve it
> > > with actual testing.  If the current simple implementation can match
> > > Darrick's previous numbers, let's first settle the mechanisms.  We can
> > 
> > Yep, the fsync-happy numbers more or less match... at least for 2.6.37:
> > http://tinyurl.com/4q2xeao
> 
> Good to hear.  Thanks for the detailed testing.
> 
> > I'll give 2.6.38-rc2 a try later, though -rc1 didn't boot for me, so these
> > numbers are based on a backport to .37. :(
> 
> Well, there hasn' been any change in the area during the merge window
> anyway, so I think testing on 2.6.37 should be fine.

Well, I gave it a spin on -rc2 with no problems and no significant change in
performance, so:

Acked-by: Darrick J. Wong <djwong@us.ibm.com>

> > > I don't really think we should design the whole thing around broken
> > > devices which incorrectly report writeback cache when it need not.
> > > The correct place to work around that is during device identification
> > > not in the flush logic.
> > 
> > elm3a4_sas and elm3c71_extsas advertise writeback cache yet the
> > flush completion times are suspiciously low.  I suppose it could be
> > useful to disable flushes to squeeze out that last bit of
> > performance, though I don't know how one goes about querying the
> > disk array to learn if there's a battery behind the cache.  I guess
> > the current mechanism (admin knob that picks a safe default) is good
> > enough.
> 
> Yeap, that or a blacklist of devices which lie.

Hmm... I don't think a blacklist would work for our arrays, since one can force
them to run with write cache and no battery.  I _do_ have a patch that adds a
sysfs knob to the block layer to drop flush/fua if the admin really really
really wants it, so I'll send that out shortly along with another one to remove
the barrier= mount option from ext4.

(Unless the screams of objection rain from the skies. :))

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 4/3] block: skip elevator initialization for flush requests -- was never BUGGY relative to upstream
  2011-01-25 21:55     ` Mike Snitzer
@ 2011-01-26  5:27       ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-01-26  5:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer

On Tue, Jan 25 2011 at  4:55pm -0500,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Tue, Jan 25 2011 at  3:41pm -0500,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > Hi Tejun,
> > 
> > On Fri, Jan 21 2011 at 10:59am -0500,
> > Tejun Heo <tj@kernel.org> wrote:
> > > 
> > > * As flush requests are never put on the IO scheduler, request fields
> > >   used for flush share space with rq->rb_node.  rq->completion_data is
> > >   moved out of the union.  This increases the request size by one
> > >   pointer.
> > > 
> > >   As rq->elevator_private* are used only by the iosched too, it is
> > >   possible to reduce the request size further.  However, to do that,
> > >   we need to modify request allocation path such that iosched data is
> > >   not allocated for flush requests.
> > 
> > I decided to take a crack at using rq->elevator_private* and came up
> > with the following patch.

...

> It is an interesting duality that:
> 1) REQ_ELVPRIV is never set because priv=0 is passed to blk_alloc_request
> 2) yet when blk_free_request() checks rq->cmd_flags REQ_ELVPRIV is set;
>    resulting in the call to elv_put_request()

Turns out priv=0 was never actually being established in get_request()
on the kernel I was testing (see below).

> > I know this because in my following get_request() change to _not_ call
> > elv_set_request() for flush requests hit cfq_put_request()'s
> > BUG_ON(!cfqq->allocated[rw]).
> 
> FYI, here is the backtrace:

That backtrace was from a RHEL6 port of your recent flush merge work.
One RHELism associated with the RHEL6 flush+fua port in general (and now
this flush merge port) is that bio and request flags are _not_ shared.

As such I introduced BIO_FLUSH and BIO_FUA flags in RHEL6.  Long story
short, my 4/3 patch works just fine ontop of your 3 patches that Jens
just staged for 2.6.39.

So sorry for the noise!  I'm kicking myself for having tainted my patch
(and your work indirectly) by making an issue out of nothing.

Taking a step back, what do you think of my proposed patch?

Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [KNOWN BUGGY RFC PATCH 4/3] block: skip elevator initialization for flush requests
  2011-01-25 20:41     ` Mike Snitzer
  (?)
  (?)
@ 2011-01-26 10:03     ` Tejun Heo
  2011-01-26 10:05       ` Tejun Heo
  2011-02-01 17:38       ` [RFC " Mike Snitzer
  -1 siblings, 2 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-26 10:03 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer

Hello,

On Tue, Jan 25, 2011 at 03:41:58PM -0500, Mike Snitzer wrote:
> Unfortunately, in testing I found that flush requests that have data do
> in fact eventually get added to the queue as normal requests, via:
> 1) "data but flush is not necessary" case in blk_insert_flush
> 2) REQ_FSEQ_DATA case in blk_flush_complete_seq
> 
> I know this because in my following get_request() change to _not_ call
> elv_set_request() for flush requests hit cfq_put_request()'s
> BUG_ON(!cfqq->allocated[rw]).  cfqq->allocated[rw] gets set via
> elv_set_request()'s call to cfq_set_request().
> 
> So this seems to call in to question the running theory that flush
> requests can share 'struct request' space with elevator-specific members
> (via union) -- be it rq->rb_node or rq->elevator_private*.

As this part seems to have already been solved, I'm skipping this
part.

> diff --git a/block/blk-core.c b/block/blk-core.c
> index 72dd23b..f507888 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -764,7 +764,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	struct request_list *rl = &q->rq;
>  	struct io_context *ioc = NULL;
>  	const bool is_sync = rw_is_sync(rw_flags) != 0;
> -	int may_queue, priv;
> +	int may_queue, priv = 0;
>  
>  	may_queue = elv_may_queue(q, rw_flags);
>  	if (may_queue == ELV_MQUEUE_NO)
> @@ -808,9 +808,14 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	rl->count[is_sync]++;
>  	rl->starved[is_sync] = 0;
>  
> -	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> -	if (priv)
> -		rl->elvpriv++;
> +	/*
> +	 * Skip elevator initialization for flush requests
> +	 */
> +	if (!(bio && (bio->bi_rw & (REQ_FLUSH | REQ_FUA)))) {
> +		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> +		if (priv)
> +			rl->elvpriv++;
> +	}

I thought about doing it this way but I think we're burying the
REQ_FLUSH|REQ_FUA test logic too deep.  get_request() shouldn't
"magically" know not to allocate elevator data.  The decision should
be made higher in the stack and passed down to get_request().  e.g. if
REQ_SORTED is set in @rw, elevator data is allocated; otherwise, not.

> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8a082a5..0c569ec 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -99,25 +99,29 @@ struct request {
>  	/*
>  	 * The rb_node is only used inside the io scheduler, requests
>  	 * are pruned when moved to the dispatch queue. So let the
> -	 * flush fields share space with the rb_node.
> +	 * completion_data share space with the rb_node.
>  	 */
>  	union {
>  		struct rb_node rb_node;	/* sort/lookup */
> -		struct {
> -			unsigned int			seq;
> -			struct list_head		list;
> -		} flush;
> +		void *completion_data;
>  	};
>  
> -	void *completion_data;
> -
>  	/*
>  	 * Three pointers are available for the IO schedulers, if they need
> -	 * more they have to dynamically allocate it.
> +	 * more they have to dynamically allocate it.  Let the flush fields
> +	 * share space with these three pointers.
>  	 */
> -	void *elevator_private;
> -	void *elevator_private2;
> -	void *elevator_private3;
> +	union {
> +		struct {
> +			void *private;
> +			void *private2;
> +			void *private3;
> +		} elevator;
> +		struct {
> +			unsigned int			seq;
> +			struct list_head		list;
> +		} flush;
> +	};

Another thing is, can we please make private* an array?  The number
postfixes are irksome.  It's even one based instead of zero!

Also, it would be great to better describe the lifetime difference
between the first and the second unions and why it has be organized
this way (rb_node and completion_data can live together but rb_node
and flush can't).

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [KNOWN BUGGY RFC PATCH 4/3] block: skip elevator initialization for flush requests
  2011-01-26 10:03     ` [KNOWN BUGGY RFC PATCH 4/3] block: skip elevator initialization for flush requests Tejun Heo
@ 2011-01-26 10:05       ` Tejun Heo
  2011-02-01 17:38       ` [RFC " Mike Snitzer
  1 sibling, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-01-26 10:05 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer

On Wed, Jan 26, 2011 at 11:03:22AM +0100, Tejun Heo wrote:
> Also, it would be great to better describe the lifetime difference
> between the first and the second unions and why it has be organized
> this way (rb_node and completion_data can live together but rb_node
> and flush can't).

Oops, what can't live together are elevator_private* and
completion_data.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 4/3] block: skip elevator initialization for flush requests
  2011-01-26 10:03     ` [KNOWN BUGGY RFC PATCH 4/3] block: skip elevator initialization for flush requests Tejun Heo
  2011-01-26 10:05       ` Tejun Heo
@ 2011-02-01 17:38       ` Mike Snitzer
  2011-02-01 18:52         ` Tejun Heo
  1 sibling, 1 reply; 48+ messages in thread
From: Mike Snitzer @ 2011-02-01 17:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer

On Wed, Jan 26 2011 at  5:03am -0500,
Tejun Heo <tj@kernel.org> wrote:

> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 72dd23b..f507888 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -764,7 +764,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> >  	struct request_list *rl = &q->rq;
> >  	struct io_context *ioc = NULL;
> >  	const bool is_sync = rw_is_sync(rw_flags) != 0;
> > -	int may_queue, priv;
> > +	int may_queue, priv = 0;
> >  
> >  	may_queue = elv_may_queue(q, rw_flags);
> >  	if (may_queue == ELV_MQUEUE_NO)
> > @@ -808,9 +808,14 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> >  	rl->count[is_sync]++;
> >  	rl->starved[is_sync] = 0;
> >  
> > -	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> > -	if (priv)
> > -		rl->elvpriv++;
> > +	/*
> > +	 * Skip elevator initialization for flush requests
> > +	 */
> > +	if (!(bio && (bio->bi_rw & (REQ_FLUSH | REQ_FUA)))) {
> > +		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> > +		if (priv)
> > +			rl->elvpriv++;
> > +	}
> 
> I thought about doing it this way but I think we're burying the
> REQ_FLUSH|REQ_FUA test logic too deep.  get_request() shouldn't
> "magically" know not to allocate elevator data.

There is already a considerable amount of REQ_FLUSH|REQ_FUA special
casing magic sprinkled though-out the block layer.  Why is this
get_request() change the case that goes too far?

> The decision should
> be made higher in the stack and passed down to get_request().  e.g. if
> REQ_SORTED is set in @rw, elevator data is allocated; otherwise, not.

Considering REQ_SORTED is set in elv_insert(), well after get_request() 
is called, I'm not seeing what you're suggesting.

Anyway, I agree that ideally we'd have a mechanism to explicitly
short-circuit elevator initialization.  But doing so in a meaningful way
would likely require a fair amount of refactoring of get_request* and
its callers.  I'll come back to this and have another look but my gut is
this interface churn wouldn't _really_ help -- all things considered.

> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 8a082a5..0c569ec 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -99,25 +99,29 @@ struct request {
> >  	/*
> >  	 * The rb_node is only used inside the io scheduler, requests
> >  	 * are pruned when moved to the dispatch queue. So let the
> > -	 * flush fields share space with the rb_node.
> > +	 * completion_data share space with the rb_node.
> >  	 */
> >  	union {
> >  		struct rb_node rb_node;	/* sort/lookup */
> > -		struct {
> > -			unsigned int			seq;
> > -			struct list_head		list;
> > -		} flush;
> > +		void *completion_data;
> >  	};
> >  
> > -	void *completion_data;
> > -
> >  	/*
> >  	 * Three pointers are available for the IO schedulers, if they need
> > -	 * more they have to dynamically allocate it.
> > +	 * more they have to dynamically allocate it.  Let the flush fields
> > +	 * share space with these three pointers.
> >  	 */
> > -	void *elevator_private;
> > -	void *elevator_private2;
> > -	void *elevator_private3;
> > +	union {
> > +		struct {
> > +			void *private;
> > +			void *private2;
> > +			void *private3;
> > +		} elevator;
> > +		struct {
> > +			unsigned int			seq;
> > +			struct list_head		list;
> > +		} flush;
> > +	};
> 
> Another thing is, can we please make private* an array?  The number
> postfixes are irksome.  It's even one based instead of zero!

Sure, I can sort that out.

> > Also, it would be great to better describe the lifetime difference
> > between the first and the second unions and why it has be organized
> > this way (rb_node and completion_data can live together but rb_node
> > and flush can't).
>
> Oops, what can't live together are elevator_private* and
> completion_data.

I'll better describe the 2nd union's sharing in the next revision.

Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 4/3] block: skip elevator initialization for flush requests
  2011-02-01 17:38       ` [RFC " Mike Snitzer
@ 2011-02-01 18:52         ` Tejun Heo
  2011-02-01 22:46           ` [PATCH v2 1/2] " Mike Snitzer
  2011-02-01 22:46             ` Mike Snitzer
  0 siblings, 2 replies; 48+ messages in thread
From: Tejun Heo @ 2011-02-01 18:52 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer

Hello,

On Tue, Feb 01, 2011 at 12:38:46PM -0500, Mike Snitzer wrote:
> > I thought about doing it this way but I think we're burying the
> > REQ_FLUSH|REQ_FUA test logic too deep.  get_request() shouldn't
> > "magically" know not to allocate elevator data.
> 
> There is already a considerable amount of REQ_FLUSH|REQ_FUA special
> casing magic sprinkled though-out the block layer.  Why is this
> get_request() change the case that goes too far?

After the reimplementation, FLUSH implementation seems to be pretty
well isolated.  Also, having REQ_FLUSH logic in the issue and
completion paths is logical and preventing them from leaking to other
places sounds like a good idea.

> > The decision should
> > be made higher in the stack and passed down to get_request().  e.g. if
> > REQ_SORTED is set in @rw, elevator data is allocated; otherwise, not.
> 
> Considering REQ_SORTED is set in elv_insert(), well after get_request() 
> is called, I'm not seeing what you're suggesting.

I was suggesting using REQ_SORTED in @rw parameter to indicate "this
request may be sorted and thus needs elevator data allocation".

> Anyway, I agree that ideally we'd have a mechanism to explicitly
> short-circuit elevator initialization.  But doing so in a meaningful way
> would likely require a fair amount of refactoring of get_request* and
> its callers.  I'll come back to this and have another look but my gut is
> this interface churn wouldn't _really_ help -- all things considered.

I don't know.  I agree that it's not a critical issue but, to me,
subjectively of course, it feels a bit too subtle.  The sharing of
fields using unions is already subtle enough.  I with that at least
the allocation switching would be obvious and explicit.  The combined
subtleties scare me.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 1/2] block: skip elevator initialization for flush requests
  2011-02-01 18:52         ` Tejun Heo
@ 2011-02-01 22:46           ` Mike Snitzer
  2011-02-02 21:51             ` Vivek Goyal
  2011-02-02 22:55             ` [PATCH v3 1/2] block: skip elevator data " Mike Snitzer
  2011-02-01 22:46             ` Mike Snitzer
  1 sibling, 2 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-02-01 22:46 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: tytso, djwong, shli, neilb, adilger.kernel, jack, linux-kernel,
	kmannth, cmm, linux-ext4, rwheeler, hch, josef, jmoyer, vgoyal,
	snitzer

Skip elevator initialization during request allocation if REQ_SORTED
is not set in the @rw_flags passed to the request allocator.

Set REQ_SORTED for all requests that may be put on IO scheduler.  Flush
requests are not put on IO scheduler so REQ_SORTED is not set for
them.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c |   24 +++++++++++++++++++-----
 1 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 72dd23b..f6fcc64 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -764,7 +764,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
-	int may_queue, priv;
+	int may_queue, priv = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -808,9 +808,14 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
-	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	/*
+	 * Only initialize elevator data if REQ_SORTED is set.
+	 */
+	if (rw_flags & REQ_SORTED) {
+		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+		if (priv)
+			rl->elvpriv++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
@@ -1197,6 +1202,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 	const unsigned short prio = bio_prio(bio);
 	const bool sync = !!(bio->bi_rw & REQ_SYNC);
 	const bool unplug = !!(bio->bi_rw & REQ_UNPLUG);
+	const bool flush = !!(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
 	const unsigned long ff = bio->bi_rw & REQ_FAILFAST_MASK;
 	int where = ELEVATOR_INSERT_SORT;
 	int rw_flags;
@@ -1210,7 +1216,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 
 	spin_lock_irq(q->queue_lock);
 
-	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
+	if (flush) {
 		where = ELEVATOR_INSERT_FLUSH;
 		goto get_rq;
 	}
@@ -1293,6 +1299,14 @@ get_rq:
 		rw_flags |= REQ_SYNC;
 
 	/*
+	 * Set REQ_SORTED for all requests that may be put on IO scheduler.
+	 * The request allocator's IO scheduler initialization will be skipped
+	 * if REQ_SORTED is not set.
+	 */
+	if (!flush)
+		rw_flags |= REQ_SORTED;
+
+	/*
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */
-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 2/2] block: share request flush fields with elevator_private
  2011-02-01 18:52         ` Tejun Heo
@ 2011-02-01 22:46             ` Mike Snitzer
  2011-02-01 22:46             ` Mike Snitzer
  1 sibling, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-02-01 22:46 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: tytso, djwong, shli, neilb, adilger.kernel, jack, linux-kernel,
	kmannth, cmm, linux-ext4, rwheeler, hch, josef, jmoyer, vgoyal,
	snitzer

Flush requests are never put on the IO scheduler.  Convert request
structure's elevator_private* into an array and have the flush fields
share a union with it.

Reclaim the space lost in 'struct request' by moving 'completion_data'
back in the union with 'rb_node'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/cfq-iosched.c    |   18 +++++++++---------
 block/elevator.c       |    2 +-
 include/linux/blkdev.h |   23 ++++++++++++-----------
 3 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 501ffdf..8692958 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -54,9 +54,9 @@ static const int cfq_hist_divisor = 4;
 #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
 
 #define RQ_CIC(rq)		\
-	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
-#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator_private3)
+	((struct cfq_io_context *) (rq)->elevator_private[0])
+#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private[1])
+#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator_private[2])
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -3609,12 +3609,12 @@ static void cfq_put_request(struct request *rq)
 
 		put_io_context(RQ_CIC(rq)->ioc);
 
-		rq->elevator_private = NULL;
-		rq->elevator_private2 = NULL;
+		rq->elevator_private[0] = NULL;
+		rq->elevator_private[1] = NULL;
 
 		/* Put down rq reference on cfqg */
 		cfq_put_cfqg(RQ_CFQG(rq));
-		rq->elevator_private3 = NULL;
+		rq->elevator_private[2] = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -3702,9 +3702,9 @@ new_queue:
 
 	cfqq->allocated[rw]++;
 	cfqq->ref++;
-	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
-	rq->elevator_private3 = cfq_ref_get_cfqg(cfqq->cfqg);
+	rq->elevator_private[0] = cic;
+	rq->elevator_private[1] = cfqq;
+	rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
diff --git a/block/elevator.c b/block/elevator.c
index 270e097..f98e92e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -764,7 +764,7 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
-	rq->elevator_private = NULL;
+	rq->elevator_private[0] = NULL;
 	return 0;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8a082a5..e3ee74f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -99,25 +99,26 @@ struct request {
 	/*
 	 * The rb_node is only used inside the io scheduler, requests
 	 * are pruned when moved to the dispatch queue. So let the
-	 * flush fields share space with the rb_node.
+	 * completion_data share space with the rb_node.
 	 */
 	union {
 		struct rb_node rb_node;	/* sort/lookup */
-		struct {
-			unsigned int			seq;
-			struct list_head		list;
-		} flush;
+		void *completion_data;
 	};
 
-	void *completion_data;
-
 	/*
 	 * Three pointers are available for the IO schedulers, if they need
-	 * more they have to dynamically allocate it.
+	 * more they have to dynamically allocate it.  Flush requests are
+	 * never put on the IO scheduler. So let the flush fields share
+	 * space with the three elevator_private pointers.
 	 */
-	void *elevator_private;
-	void *elevator_private2;
-	void *elevator_private3;
+	union {
+		void *elevator_private[3];
+		struct {
+			unsigned int		seq;
+			struct list_head	list;
+		} flush;
+	};
 
 	struct gendisk *rq_disk;
 	struct hd_struct *part;
-- 
1.7.3.4


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 2/2] block: share request flush fields with elevator_private
@ 2011-02-01 22:46             ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-02-01 22:46 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: tytso, djwong, shli, neilb, adilger.kernel, jack, linux-kernel,
	kmannth, cmm, linux-ext4, rwheeler, hch, josef, jmoyer, vgoyal,
	snitzer

Flush requests are never put on the IO scheduler.  Convert request
structure's elevator_private* into an array and have the flush fields
share a union with it.

Reclaim the space lost in 'struct request' by moving 'completion_data'
back in the union with 'rb_node'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/cfq-iosched.c    |   18 +++++++++---------
 block/elevator.c       |    2 +-
 include/linux/blkdev.h |   23 ++++++++++++-----------
 3 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 501ffdf..8692958 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -54,9 +54,9 @@ static const int cfq_hist_divisor = 4;
 #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
 
 #define RQ_CIC(rq)		\
-	((struct cfq_io_context *) (rq)->elevator_private)
-#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
-#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator_private3)
+	((struct cfq_io_context *) (rq)->elevator_private[0])
+#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private[1])
+#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator_private[2])
 
 static struct kmem_cache *cfq_pool;
 static struct kmem_cache *cfq_ioc_pool;
@@ -3609,12 +3609,12 @@ static void cfq_put_request(struct request *rq)
 
 		put_io_context(RQ_CIC(rq)->ioc);
 
-		rq->elevator_private = NULL;
-		rq->elevator_private2 = NULL;
+		rq->elevator_private[0] = NULL;
+		rq->elevator_private[1] = NULL;
 
 		/* Put down rq reference on cfqg */
 		cfq_put_cfqg(RQ_CFQG(rq));
-		rq->elevator_private3 = NULL;
+		rq->elevator_private[2] = NULL;
 
 		cfq_put_queue(cfqq);
 	}
@@ -3702,9 +3702,9 @@ new_queue:
 
 	cfqq->allocated[rw]++;
 	cfqq->ref++;
-	rq->elevator_private = cic;
-	rq->elevator_private2 = cfqq;
-	rq->elevator_private3 = cfq_ref_get_cfqg(cfqq->cfqg);
+	rq->elevator_private[0] = cic;
+	rq->elevator_private[1] = cfqq;
+	rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
diff --git a/block/elevator.c b/block/elevator.c
index 270e097..f98e92e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -764,7 +764,7 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	if (e->ops->elevator_set_req_fn)
 		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
 
-	rq->elevator_private = NULL;
+	rq->elevator_private[0] = NULL;
 	return 0;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8a082a5..e3ee74f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -99,25 +99,26 @@ struct request {
 	/*
 	 * The rb_node is only used inside the io scheduler, requests
 	 * are pruned when moved to the dispatch queue. So let the
-	 * flush fields share space with the rb_node.
+	 * completion_data share space with the rb_node.
 	 */
 	union {
 		struct rb_node rb_node;	/* sort/lookup */
-		struct {
-			unsigned int			seq;
-			struct list_head		list;
-		} flush;
+		void *completion_data;
 	};
 
-	void *completion_data;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/2] block: skip elevator initialization for flush requests
  2011-02-01 22:46           ` [PATCH v2 1/2] " Mike Snitzer
@ 2011-02-02 21:51             ` Vivek Goyal
  2011-02-02 22:06               ` Mike Snitzer
  2011-02-02 22:55             ` [PATCH v3 1/2] block: skip elevator data " Mike Snitzer
  1 sibling, 1 reply; 48+ messages in thread
From: Vivek Goyal @ 2011-02-02 21:51 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Tejun Heo, Jens Axboe, tytso, djwong, shli, neilb,
	adilger.kernel, jack, linux-kernel, kmannth, cmm, linux-ext4,
	rwheeler, hch, josef, jmoyer

On Tue, Feb 01, 2011 at 05:46:12PM -0500, Mike Snitzer wrote:
> Skip elevator initialization during request allocation if REQ_SORTED
> is not set in the @rw_flags passed to the request allocator.
> 
> Set REQ_SORTED for all requests that may be put on IO scheduler.  Flush
> requests are not put on IO scheduler so REQ_SORTED is not set for
> them.

So we are doing all this so that elevator_private and flush data can
share the space through union and we can avoid increasing the size
of struct rq by 1 pointer (4 or 8 bytes depneding on arch)? 

Looks good to me. One minor comment inline.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  block/blk-core.c |   24 +++++++++++++++++++-----
>  1 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 72dd23b..f6fcc64 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -764,7 +764,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	struct request_list *rl = &q->rq;
>  	struct io_context *ioc = NULL;
>  	const bool is_sync = rw_is_sync(rw_flags) != 0;
> -	int may_queue, priv;
> +	int may_queue, priv = 0;
>  
>  	may_queue = elv_may_queue(q, rw_flags);
>  	if (may_queue == ELV_MQUEUE_NO)
> @@ -808,9 +808,14 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	rl->count[is_sync]++;
>  	rl->starved[is_sync] = 0;
>  
> -	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> -	if (priv)
> -		rl->elvpriv++;
> +	/*
> +	 * Only initialize elevator data if REQ_SORTED is set.
> +	 */
> +	if (rw_flags & REQ_SORTED) {
> +		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> +		if (priv)
> +			rl->elvpriv++;
> +	}
>  
>  	if (blk_queue_io_stat(q))
>  		rw_flags |= REQ_IO_STAT;
> @@ -1197,6 +1202,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
>  	const unsigned short prio = bio_prio(bio);
>  	const bool sync = !!(bio->bi_rw & REQ_SYNC);
>  	const bool unplug = !!(bio->bi_rw & REQ_UNPLUG);
> +	const bool flush = !!(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
>  	const unsigned long ff = bio->bi_rw & REQ_FAILFAST_MASK;
>  	int where = ELEVATOR_INSERT_SORT;
>  	int rw_flags;
> @@ -1210,7 +1216,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
>  
>  	spin_lock_irq(q->queue_lock);
>  
> -	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
> +	if (flush) {
>  		where = ELEVATOR_INSERT_FLUSH;
>  		goto get_rq;
>  	}
> @@ -1293,6 +1299,14 @@ get_rq:
>  		rw_flags |= REQ_SYNC;
>  
>  	/*
> +	 * Set REQ_SORTED for all requests that may be put on IO scheduler.
> +	 * The request allocator's IO scheduler initialization will be skipped
> +	 * if REQ_SORTED is not set.
> +	 */

Do you want to mention here that why do we want to avoid IO scheduler
initialization. Specifically mention that set_request() is avoided so
that elevator_private[*] are not initialized and that space can be
used by flush request data.

> +	if (!flush)
> +		rw_flags |= REQ_SORTED;
> +
> +	/*
>  	 * Grab a free request. This is might sleep but can not fail.
>  	 * Returns with the queue unlocked.
>  	 */
> -- 
> 1.7.3.4

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/2] block: share request flush fields with elevator_private
  2011-02-01 22:46             ` Mike Snitzer
  (?)
@ 2011-02-02 21:52             ` Vivek Goyal
  -1 siblings, 0 replies; 48+ messages in thread
From: Vivek Goyal @ 2011-02-02 21:52 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Tejun Heo, Jens Axboe, tytso, djwong, shli, neilb,
	adilger.kernel, jack, linux-kernel, kmannth, cmm, linux-ext4,
	rwheeler, hch, josef, jmoyer

On Tue, Feb 01, 2011 at 05:46:13PM -0500, Mike Snitzer wrote:
> Flush requests are never put on the IO scheduler.  Convert request
> structure's elevator_private* into an array and have the flush fields
> share a union with it.
> 
> Reclaim the space lost in 'struct request' by moving 'completion_data'
> back in the union with 'rb_node'.
> 

Looks good to me.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> ---
>  block/cfq-iosched.c    |   18 +++++++++---------
>  block/elevator.c       |    2 +-
>  include/linux/blkdev.h |   23 ++++++++++++-----------
>  3 files changed, 22 insertions(+), 21 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 501ffdf..8692958 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -54,9 +54,9 @@ static const int cfq_hist_divisor = 4;
>  #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
>  
>  #define RQ_CIC(rq)		\
> -	((struct cfq_io_context *) (rq)->elevator_private)
> -#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private2)
> -#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator_private3)
> +	((struct cfq_io_context *) (rq)->elevator_private[0])
> +#define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private[1])
> +#define RQ_CFQG(rq)		(struct cfq_group *) ((rq)->elevator_private[2])
>  
>  static struct kmem_cache *cfq_pool;
>  static struct kmem_cache *cfq_ioc_pool;
> @@ -3609,12 +3609,12 @@ static void cfq_put_request(struct request *rq)
>  
>  		put_io_context(RQ_CIC(rq)->ioc);
>  
> -		rq->elevator_private = NULL;
> -		rq->elevator_private2 = NULL;
> +		rq->elevator_private[0] = NULL;
> +		rq->elevator_private[1] = NULL;
>  
>  		/* Put down rq reference on cfqg */
>  		cfq_put_cfqg(RQ_CFQG(rq));
> -		rq->elevator_private3 = NULL;
> +		rq->elevator_private[2] = NULL;
>  
>  		cfq_put_queue(cfqq);
>  	}
> @@ -3702,9 +3702,9 @@ new_queue:
>  
>  	cfqq->allocated[rw]++;
>  	cfqq->ref++;
> -	rq->elevator_private = cic;
> -	rq->elevator_private2 = cfqq;
> -	rq->elevator_private3 = cfq_ref_get_cfqg(cfqq->cfqg);
> +	rq->elevator_private[0] = cic;
> +	rq->elevator_private[1] = cfqq;
> +	rq->elevator_private[2] = cfq_ref_get_cfqg(cfqq->cfqg);
>  
>  	spin_unlock_irqrestore(q->queue_lock, flags);
>  
> diff --git a/block/elevator.c b/block/elevator.c
> index 270e097..f98e92e 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -764,7 +764,7 @@ int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
>  	if (e->ops->elevator_set_req_fn)
>  		return e->ops->elevator_set_req_fn(q, rq, gfp_mask);
>  
> -	rq->elevator_private = NULL;
> +	rq->elevator_private[0] = NULL;
>  	return 0;
>  }
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8a082a5..e3ee74f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -99,25 +99,26 @@ struct request {
>  	/*
>  	 * The rb_node is only used inside the io scheduler, requests
>  	 * are pruned when moved to the dispatch queue. So let the
> -	 * flush fields share space with the rb_node.
> +	 * completion_data share space with the rb_node.
>  	 */
>  	union {
>  		struct rb_node rb_node;	/* sort/lookup */
> -		struct {
> -			unsigned int			seq;
> -			struct list_head		list;
> -		} flush;
> +		void *completion_data;
>  	};
>  
> -	void *completion_data;
> -
>  	/*
>  	 * Three pointers are available for the IO schedulers, if they need
> -	 * more they have to dynamically allocate it.
> +	 * more they have to dynamically allocate it.  Flush requests are
> +	 * never put on the IO scheduler. So let the flush fields share
> +	 * space with the three elevator_private pointers.
>  	 */
> -	void *elevator_private;
> -	void *elevator_private2;
> -	void *elevator_private3;
> +	union {
> +		void *elevator_private[3];
> +		struct {
> +			unsigned int		seq;
> +			struct list_head	list;
> +		} flush;
> +	};
>  
>  	struct gendisk *rq_disk;
>  	struct hd_struct *part;
> -- 
> 1.7.3.4

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/2] block: skip elevator initialization for flush requests
  2011-02-02 21:51             ` Vivek Goyal
@ 2011-02-02 22:06               ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-02-02 22:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Jens Axboe, tytso, djwong, shli, neilb,
	adilger.kernel, jack, linux-kernel, kmannth, cmm, linux-ext4,
	rwheeler, hch, josef, jmoyer

On Wed, Feb 02 2011 at  4:51pm -0500,
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Tue, Feb 01, 2011 at 05:46:12PM -0500, Mike Snitzer wrote:
> > Skip elevator initialization during request allocation if REQ_SORTED
> > is not set in the @rw_flags passed to the request allocator.
> > 
> > Set REQ_SORTED for all requests that may be put on IO scheduler.  Flush
> > requests are not put on IO scheduler so REQ_SORTED is not set for
> > them.
> 
> So we are doing all this so that elevator_private and flush data can
> share the space through union and we can avoid increasing the size
> of struct rq by 1 pointer (4 or 8 bytes depneding on arch)? 

Correct.

> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 72dd23b..f6fcc64 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1197,6 +1202,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
> >  	const unsigned short prio = bio_prio(bio);
> >  	const bool sync = !!(bio->bi_rw & REQ_SYNC);
> >  	const bool unplug = !!(bio->bi_rw & REQ_UNPLUG);
> > +	const bool flush = !!(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
> >  	const unsigned long ff = bio->bi_rw & REQ_FAILFAST_MASK;
> >  	int where = ELEVATOR_INSERT_SORT;
> >  	int rw_flags;
> > @@ -1210,7 +1216,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
> >  
> >  	spin_lock_irq(q->queue_lock);
> >  
> > -	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
> > +	if (flush) {
> >  		where = ELEVATOR_INSERT_FLUSH;
> >  		goto get_rq;
> >  	}
> > @@ -1293,6 +1299,14 @@ get_rq:
> >  		rw_flags |= REQ_SYNC;
> >  
> >  	/*
> > +	 * Set REQ_SORTED for all requests that may be put on IO scheduler.
> > +	 * The request allocator's IO scheduler initialization will be skipped
> > +	 * if REQ_SORTED is not set.
> > +	 */
> 
> Do you want to mention here that why do we want to avoid IO scheduler
> initialization. Specifically mention that set_request() is avoided so
> that elevator_private[*] are not initialized and that space can be
> used by flush request data.

Sure, I'll post a v3 for this patch with that edit and your Acked-by.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 1/2] block: skip elevator data initialization for flush requests
  2011-02-01 22:46           ` [PATCH v2 1/2] " Mike Snitzer
  2011-02-02 21:51             ` Vivek Goyal
@ 2011-02-02 22:55             ` Mike Snitzer
  2011-02-03  9:28               ` Tejun Heo
  2011-02-03 13:24               ` [PATCH v3 " Jens Axboe
  1 sibling, 2 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-02-02 22:55 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: tytso, djwong, shli, neilb, adilger.kernel, jack, linux-kernel,
	kmannth, cmm, linux-ext4, rwheeler, hch, josef, jmoyer, vgoyal

Set REQ_SORTED, in the @rw_flags passed to the request allocator, for
any request that may be put on IO scheduler.  Skip elevator data
initialization during request allocation if REQ_SORTED is not set.

REQ_SORTED is not set for flush requests because they are never put on
the IO scheduler.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c |   26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

v3: edits to patch header and __make_request() comment

Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c
+++ linux-2.6/block/blk-core.c
@@ -764,7 +764,7 @@ static struct request *get_request(struc
 	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
-	int may_queue, priv;
+	int may_queue, priv = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -808,9 +808,14 @@ static struct request *get_request(struc
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
-	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	/*
+	 * Only initialize elevator data if REQ_SORTED is set.
+	 */
+	if (rw_flags & REQ_SORTED) {
+		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+		if (priv)
+			rl->elvpriv++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
@@ -1197,6 +1202,7 @@ static int __make_request(struct request
 	const unsigned short prio = bio_prio(bio);
 	const bool sync = !!(bio->bi_rw & REQ_SYNC);
 	const bool unplug = !!(bio->bi_rw & REQ_UNPLUG);
+	const bool flush = !!(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
 	const unsigned long ff = bio->bi_rw & REQ_FAILFAST_MASK;
 	int where = ELEVATOR_INSERT_SORT;
 	int rw_flags;
@@ -1210,7 +1216,7 @@ static int __make_request(struct request
 
 	spin_lock_irq(q->queue_lock);
 
-	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
+	if (flush) {
 		where = ELEVATOR_INSERT_FLUSH;
 		goto get_rq;
 	}
@@ -1293,6 +1299,16 @@ get_rq:
 		rw_flags |= REQ_SYNC;
 
 	/*
+	 * Set REQ_SORTED for all requests that may be put on IO scheduler.
+	 * The request allocator's IO scheduler initialization will be skipped
+	 * if REQ_SORTED is not set -- elv_set_request() is avoided so that
+	 * that the allocated request's elevator_private pointers are not
+	 * initialized and that space can be used by flush request data.
+	 */
+	if (!flush)
+		rw_flags |= REQ_SORTED;
+
+	/*
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/2] block: share request flush fields with elevator_private
  2011-02-01 22:46             ` Mike Snitzer
  (?)
  (?)
@ 2011-02-03  9:24             ` Tejun Heo
  -1 siblings, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-02-03  9:24 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer, vgoyal

On Tue, Feb 01, 2011 at 05:46:13PM -0500, Mike Snitzer wrote:
> Flush requests are never put on the IO scheduler.  Convert request
> structure's elevator_private* into an array and have the flush fields
> share a union with it.
> 
> Reclaim the space lost in 'struct request' by moving 'completion_data'
> back in the union with 'rb_node'.
> 
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/2] block: skip elevator data initialization for flush requests
  2011-02-02 22:55             ` [PATCH v3 1/2] block: skip elevator data " Mike Snitzer
@ 2011-02-03  9:28               ` Tejun Heo
  2011-02-03 14:48                 ` [PATCH v4 " Mike Snitzer
  2011-02-03 13:24               ` [PATCH v3 " Jens Axboe
  1 sibling, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2011-02-03  9:28 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer, vgoyal

Hello,

On Wed, Feb 02, 2011 at 05:55:49PM -0500, Mike Snitzer wrote:
> @@ -808,9 +808,14 @@ static struct request *get_request(struc
>  	rl->count[is_sync]++;
>  	rl->starved[is_sync] = 0;
>  
> -	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> -	if (priv)
> -		rl->elvpriv++;
> +	/*
> +	 * Only initialize elevator data if REQ_SORTED is set.
> +	 */
> +	if (rw_flags & REQ_SORTED) {
> +		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
> +		if (priv)
> +			rl->elvpriv++;
> +	}

This isn't enough.  Now the allocated requests would have REQ_SORTED
set before going into elevator.  You probably want to filter out
REQ_SORTED to elv_may_queue() too.

Also, it would be great if you update the comment on top of
get_request() to explain what @rw_flags does.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/2] block: skip elevator data initialization for flush requests
  2011-02-02 22:55             ` [PATCH v3 1/2] block: skip elevator data " Mike Snitzer
  2011-02-03  9:28               ` Tejun Heo
@ 2011-02-03 13:24               ` Jens Axboe
  2011-02-03 13:38                 ` Tejun Heo
  2011-02-03 14:54                 ` [PATCH v3 " Mike Snitzer
  1 sibling, 2 replies; 48+ messages in thread
From: Jens Axboe @ 2011-02-03 13:24 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Tejun Heo, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer, vgoyal

On 2011-02-02 23:55, Mike Snitzer wrote:
> Set REQ_SORTED, in the @rw_flags passed to the request allocator, for
> any request that may be put on IO scheduler.  Skip elevator data
> initialization during request allocation if REQ_SORTED is not set.
> 
> REQ_SORTED is not set for flush requests because they are never put on
> the IO scheduler.

That looks very wrong. REQ_SORTED gets set _when_ the request is sorted
into the IO scheduler. This is gross misuse, a bad hack.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/2] block: skip elevator data initialization for flush requests
  2011-02-03 13:24               ` [PATCH v3 " Jens Axboe
@ 2011-02-03 13:38                 ` Tejun Heo
  2011-02-04 15:04                   ` Vivek Goyal
  2011-02-03 14:54                 ` [PATCH v3 " Mike Snitzer
  1 sibling, 1 reply; 48+ messages in thread
From: Tejun Heo @ 2011-02-03 13:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer, vgoyal

Hey, Jens.

On Thu, Feb 03, 2011 at 02:24:42PM +0100, Jens Axboe wrote:
> On 2011-02-02 23:55, Mike Snitzer wrote:
> > REQ_SORTED is not set for flush requests because they are never put on
> > the IO scheduler.
> 
> That looks very wrong. REQ_SORTED gets set _when_ the request is sorted
> into the IO scheduler. This is gross misuse, a bad hack.

The rationale behind suggesting was that it indicates to the allocator
that the request may be sorted as how the request will be used is
communicated using @rw_flags to the allocator.  The patch is buggy
that the flag actually ends up on the request.  Any better idea how to
communicate it?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v4 1/2] block: skip elevator data initialization for flush requests
  2011-02-03  9:28               ` Tejun Heo
@ 2011-02-03 14:48                 ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-02-03 14:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer, vgoyal

Set REQ_SORTED, in the @rw_flags passed to the request allocator, for
all requests that may be put on IO scheduler.  REQ_SORTED is not set for
flush requests because they are never put on the IO scheduler.

Skip elevator data initialization during request allocation if
REQ_SORTED is not set in @rw_flags passed to get_request().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c |   33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

v4: fixed bug where REQ_SORTED wasn't cleared on entry to get_request
    - and Jens, yes I agree this is still a hack

Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c
+++ linux-2.6/block/blk-core.c
@@ -754,6 +754,9 @@ static void freed_request(struct request
 
 /*
  * Get a free request, queue_lock must be held.
+ * @rw_flags: may be overloaded to convey additional request features;
+ * any overloaded feature flags must be cleared immediately.
+ *
  * Returns NULL on failure, with queue_lock held.
  * Returns !NULL on success, with queue_lock *not held*.
  */
@@ -764,7 +767,11 @@ static struct request *get_request(struc
 	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
-	int may_queue, priv;
+	const bool init_elv_data = !!(rw_flags & REQ_SORTED);
+	int may_queue, priv = 0;
+
+	if (init_elv_data)
+		rw_flags &= ~REQ_SORTED;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -808,9 +815,14 @@ static struct request *get_request(struc
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
-	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	/*
+	 * Only initialize elevator data if IO scheduler may be used.
+	 */
+	if (init_elv_data) {
+		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+		if (priv)
+			rl->elvpriv++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
@@ -1197,6 +1209,7 @@ static int __make_request(struct request
 	const unsigned short prio = bio_prio(bio);
 	const bool sync = !!(bio->bi_rw & REQ_SYNC);
 	const bool unplug = !!(bio->bi_rw & REQ_UNPLUG);
+	const bool flush = !!(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
 	const unsigned long ff = bio->bi_rw & REQ_FAILFAST_MASK;
 	int where = ELEVATOR_INSERT_SORT;
 	int rw_flags;
@@ -1210,7 +1223,7 @@ static int __make_request(struct request
 
 	spin_lock_irq(q->queue_lock);
 
-	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
+	if (flush) {
 		where = ELEVATOR_INSERT_FLUSH;
 		goto get_rq;
 	}
@@ -1293,6 +1306,16 @@ get_rq:
 		rw_flags |= REQ_SYNC;
 
 	/*
+	 * Set REQ_SORTED for all requests that may be put on IO scheduler.
+	 * The request allocator's IO scheduler initialization will be skipped
+	 * if REQ_SORTED is not set -- elv_set_request() is avoided so that
+	 * that the allocated request's elevator_private pointers are not
+	 * initialized and that space can be used by flush request data.
+	 */
+	if (!flush)
+		rw_flags |= REQ_SORTED;
+
+	/*
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/2] block: skip elevator data initialization for flush requests
  2011-02-03 13:24               ` [PATCH v3 " Jens Axboe
  2011-02-03 13:38                 ` Tejun Heo
@ 2011-02-03 14:54                 ` Mike Snitzer
  1 sibling, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-02-03 14:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer, vgoyal

On Thu, Feb 03 2011 at  8:24am -0500,
Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-02-02 23:55, Mike Snitzer wrote:
> > Set REQ_SORTED, in the @rw_flags passed to the request allocator, for
> > any request that may be put on IO scheduler.  Skip elevator data
> > initialization during request allocation if REQ_SORTED is not set.
> > 
> > REQ_SORTED is not set for flush requests because they are never put on
> > the IO scheduler.
> 
> That looks very wrong. REQ_SORTED gets set _when_ the request is sorted
> into the IO scheduler.

Yes, considerable oversight on my part.

> This is gross misuse, a bad hack.

As my recently posted v4 stated, even without the bug it is still a hack.

My initial version didn't get so cute with overloading flags, etc, e.g:

diff --git a/block/blk-core.c b/block/blk-core.c
index 72dd23b..e098598 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -764,7 +764,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
-	int may_queue, priv;
+	int may_queue, priv = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -808,9 +808,14 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
-	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	/*
+	 * Skip elevator data initialization for flush requests.
+	 */
+	if (!(bio && (bio->bi_rw & (REQ_FLUSH | REQ_FUA)))) {
+		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+		if (priv)
+			rl->elvpriv++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;


I ran with Tejun's suggestion of overloading @rw_flags when it became
clear he wasn't happy with the above (because it spread FLUSH/FUA
special casing too thin).

Regardless, other ideas for how to convey this info to get_request()
would be appreciated.

Mike

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/2] block: skip elevator data initialization for flush requests
  2011-02-03 13:38                 ` Tejun Heo
@ 2011-02-04 15:04                   ` Vivek Goyal
  2011-02-04 15:08                     ` Tejun Heo
  2011-02-04 16:58                     ` [PATCH v5 " Mike Snitzer
  0 siblings, 2 replies; 48+ messages in thread
From: Vivek Goyal @ 2011-02-04 15:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Mike Snitzer, tytso, djwong, shli, neilb,
	adilger.kernel, jack, linux-kernel, kmannth, cmm, linux-ext4,
	rwheeler, hch, josef, jmoyer

On Thu, Feb 03, 2011 at 02:38:20PM +0100, Tejun Heo wrote:
> Hey, Jens.
> 
> On Thu, Feb 03, 2011 at 02:24:42PM +0100, Jens Axboe wrote:
> > On 2011-02-02 23:55, Mike Snitzer wrote:
> > > REQ_SORTED is not set for flush requests because they are never put on
> > > the IO scheduler.
> > 
> > That looks very wrong. REQ_SORTED gets set _when_ the request is sorted
> > into the IO scheduler. This is gross misuse, a bad hack.
> 
> The rationale behind suggesting was that it indicates to the allocator
> that the request may be sorted as how the request will be used is
> communicated using @rw_flags to the allocator.  The patch is buggy
> that the flag actually ends up on the request.  Any better idea how to
> communicate it?

Though you did not like the V1 of patch, personally I also liked just parsing
FLUSH or FUA flag in get_request().

Or how about intoducing a helper function blk_rq_should_init_elevator() 
or something like that and this function will parse FLUSH, FUA flags.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3 1/2] block: skip elevator data initialization for flush requests
  2011-02-04 15:04                   ` Vivek Goyal
@ 2011-02-04 15:08                     ` Tejun Heo
  2011-02-04 16:58                     ` [PATCH v5 " Mike Snitzer
  1 sibling, 0 replies; 48+ messages in thread
From: Tejun Heo @ 2011-02-04 15:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Mike Snitzer, tytso, djwong, shli, neilb,
	adilger.kernel, jack, linux-kernel, kmannth, cmm, linux-ext4,
	rwheeler, hch, josef, jmoyer

Hello,

On Fri, Feb 04, 2011 at 10:04:57AM -0500, Vivek Goyal wrote:
> > The rationale behind suggesting was that it indicates to the allocator
> > that the request may be sorted as how the request will be used is
> > communicated using @rw_flags to the allocator.  The patch is buggy
> > that the flag actually ends up on the request.  Any better idea how to
> > communicate it?
> 
> Though you did not like the V1 of patch, personally I also liked just parsing
> FLUSH or FUA flag in get_request().
> 
> Or how about intoducing a helper function blk_rq_should_init_elevator() 
> or something like that and this function will parse FLUSH, FUA flags.

I suppose it's Jens' call, but if the FLUSH/FUA testing goes inside
the alloc function, please decorate with big fat comment and mention
it in the comment for the union definition too.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v5 1/2] block: skip elevator data initialization for flush requests
  2011-02-04 15:04                   ` Vivek Goyal
  2011-02-04 15:08                     ` Tejun Heo
@ 2011-02-04 16:58                     ` Mike Snitzer
  1 sibling, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2011-02-04 16:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Vivek Goyal, tytso, djwong, shli, neilb,
	adilger.kernel, jack, linux-kernel, kmannth, cmm, linux-ext4,
	rwheeler, hch, josef, jmoyer

Skip elevator initialization for flush requests by passing priv=0 to
blk_alloc_request() in get_request().  As such elv_set_request() is
never called for flush requests.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c |   29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

v5: introduces blk_rq_should_init_elevator() to abstract the elevator
    initialization decision to promote future reuse and clarity.

Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c
+++ linux-2.6/block/blk-core.c
@@ -753,6 +753,25 @@ static void freed_request(struct request
 }
 
 /*
+ * Determine if elevator data should be initialized when allocating the
+ * request associated with @bio.
+ */
+static bool blk_rq_should_init_elevator(struct bio *bio)
+{
+	if (!bio)
+		return true;
+
+	/*
+	 * Flush requests do not use the elevator so skip initialization.
+	 * This allows a request to share the flush and elevator data.
+	 */
+	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA))
+		return false;
+
+	return true;
+}
+
+/*
  * Get a free request, queue_lock must be held.
  * Returns NULL on failure, with queue_lock held.
  * Returns !NULL on success, with queue_lock *not held*.
@@ -764,7 +783,7 @@ static struct request *get_request(struc
 	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
-	int may_queue, priv;
+	int may_queue, priv = 0;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -808,9 +827,11 @@ static struct request *get_request(struc
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
-	priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-	if (priv)
-		rl->elvpriv++;
+	if (blk_rq_should_init_elevator(bio)) {
+		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
+		if (priv)
+			rl->elvpriv++;
+	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/2] block: share request flush fields with  elevator_private
  2011-02-01 22:46             ` Mike Snitzer
                               ` (2 preceding siblings ...)
  (?)
@ 2011-02-11 10:08             ` Jens Axboe
  -1 siblings, 0 replies; 48+ messages in thread
From: Jens Axboe @ 2011-02-11 10:08 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Tejun Heo, tytso, djwong, shli, neilb, adilger.kernel, jack,
	linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch, josef,
	jmoyer, vgoyal

On 2011-02-01 23:46, Mike Snitzer wrote:
> Flush requests are never put on the IO scheduler.  Convert request
> structure's elevator_private* into an array and have the flush fields
> share a union with it.
> 
> Reclaim the space lost in 'struct request' by moving 'completion_data'
> back in the union with 'rb_node'.

Thanks Mike, I've merged this one and v5 1/2 as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge
  2011-01-25 11:39           ` Jens Axboe
@ 2011-03-23 23:37             ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2011-03-23 23:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, Vivek Goyal, tytso, shli, neilb, adilger.kernel, jack,
	snitzer, linux-kernel, kmannth, cmm, linux-ext4, rwheeler, hch,
	josef

On Tue, Jan 25, 2011 at 12:39:24PM +0100, Jens Axboe wrote:
> On 2011-01-25 11:21, Tejun Heo wrote:
> > Hello, Darrick.
> > 
> > On Mon, Jan 24, 2011 at 12:31:55PM -0800, Darrick J. Wong wrote:
> >>> So, I think it's better to start with something simple and improve it
> >>> with actual testing.  If the current simple implementation can match
> >>> Darrick's previous numbers, let's first settle the mechanisms.  We can
> >>
> >> Yep, the fsync-happy numbers more or less match... at least for 2.6.37:
> >> http://tinyurl.com/4q2xeao
> > 
> > Good to hear.  Thanks for the detailed testing.
> > 
> >> I'll give 2.6.38-rc2 a try later, though -rc1 didn't boot for me, so these
> >> numbers are based on a backport to .37. :(
> > 
> > Well, there hasn' been any change in the area during the merge window
> > anyway, so I think testing on 2.6.37 should be fine.
> > 
> >>> I don't really think we should design the whole thing around broken
> >>> devices which incorrectly report writeback cache when it need not.
> >>> The correct place to work around that is during device identification
> >>> not in the flush logic.
> >>
> >> elm3a4_sas and elm3c71_extsas advertise writeback cache yet the
> >> flush completion times are suspiciously low.  I suppose it could be
> >> useful to disable flushes to squeeze out that last bit of
> >> performance, though I don't know how one goes about querying the
> >> disk array to learn if there's a battery behind the cache.  I guess
> >> the current mechanism (admin knob that picks a safe default) is good
> >> enough.
> > 
> > Yeap, that or a blacklist of devices which lie.
> > 
> > Jens, what do you think?  If you don't object, let's put this through
> > linux-next.
> 
> I like the approach, I'll queue it up for 2.6.39.

Is this patch set still on the merge list for 2.6.39?

--D
> 
> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2011-03-23 23:37 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-21 15:59 [PATCHSET] block: reimplement FLUSH/FUA to support merge Tejun Heo
2011-01-21 15:59 ` Tejun Heo
2011-01-21 15:59 ` [PATCH 1/3] block: add REQ_FLUSH_SEQ Tejun Heo
2011-01-21 15:59   ` Tejun Heo
2011-01-21 15:59 ` [PATCH 2/3] block: improve flush bio completion Tejun Heo
2011-01-21 15:59   ` Tejun Heo
2011-01-21 15:59 ` [PATCH 3/3] block: reimplement FLUSH/FUA to support merge Tejun Heo
2011-01-21 18:56   ` Vivek Goyal
2011-01-21 19:19     ` Vivek Goyal
2011-01-23 10:25     ` Tejun Heo
2011-01-23 10:29       ` Tejun Heo
2011-01-24 20:31       ` Darrick J. Wong
2011-01-25 10:21         ` Tejun Heo
2011-01-25 11:39           ` Jens Axboe
2011-03-23 23:37             ` Darrick J. Wong
2011-01-25 22:56           ` Darrick J. Wong
2011-01-22  0:49   ` Mike Snitzer
2011-01-23 10:31     ` Tejun Heo
2011-01-25 20:46       ` Vivek Goyal
2011-01-25 21:04         ` Mike Snitzer
2011-01-23 10:48   ` [PATCH UPDATED " Tejun Heo
2011-01-23 10:48     ` Tejun Heo
2011-01-25 20:41   ` [KNOWN BUGGY RFC PATCH 4/3] block: skip elevator initialization for flush requests Mike Snitzer
2011-01-25 20:41     ` Mike Snitzer
2011-01-25 21:55     ` Mike Snitzer
2011-01-26  5:27       ` [RFC PATCH 4/3] block: skip elevator initialization for flush requests -- was never BUGGY relative to upstream Mike Snitzer
2011-01-26 10:03     ` [KNOWN BUGGY RFC PATCH 4/3] block: skip elevator initialization for flush requests Tejun Heo
2011-01-26 10:05       ` Tejun Heo
2011-02-01 17:38       ` [RFC " Mike Snitzer
2011-02-01 18:52         ` Tejun Heo
2011-02-01 22:46           ` [PATCH v2 1/2] " Mike Snitzer
2011-02-02 21:51             ` Vivek Goyal
2011-02-02 22:06               ` Mike Snitzer
2011-02-02 22:55             ` [PATCH v3 1/2] block: skip elevator data " Mike Snitzer
2011-02-03  9:28               ` Tejun Heo
2011-02-03 14:48                 ` [PATCH v4 " Mike Snitzer
2011-02-03 13:24               ` [PATCH v3 " Jens Axboe
2011-02-03 13:38                 ` Tejun Heo
2011-02-04 15:04                   ` Vivek Goyal
2011-02-04 15:08                     ` Tejun Heo
2011-02-04 16:58                     ` [PATCH v5 " Mike Snitzer
2011-02-03 14:54                 ` [PATCH v3 " Mike Snitzer
2011-02-01 22:46           ` [PATCH v2 2/2] block: share request flush fields with elevator_private Mike Snitzer
2011-02-01 22:46             ` Mike Snitzer
2011-02-02 21:52             ` Vivek Goyal
2011-02-03  9:24             ` Tejun Heo
2011-02-11 10:08             ` Jens Axboe
2011-01-21 15:59 ` [PATCH 3/3] block: reimplement FLUSH/FUA to support merge Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.