All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2
@ 2010-08-30  9:58 ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hello,

This is the second take of block-dm-finish-REQ_FLUSH-FUA-conversion.
I've put patches on top of the previous patches for easier review.
The series can be trivially reordered so that the order is more
logical.  Jens, please let me know if you want it to be reordered.

The dm conversion is _lightly_ tested.  Please proceed with caution.
Let's hold off merging these bits until dm people can verify the
conversion is correct.

  0001-block-make-__blk_rq_prep_clone-copy-most-command-fla.patch
  0002-dm-implement-REQ_FLUSH-FUA-support-for-bio-based-dm.patch
  0003-dm-relax-ordering-of-bio-based-flush-implementation.patch
  0004-dm-implement-REQ_FLUSH-FUA-support-for-request-based.patch
  0005-block-remove-the-WRITE_BARRIER-flag.patch

Differences from the previous attempt[1] are,

* bio-based dm and requested-based dm patches are split.  0002-0003
  convert bio-based dm and 0004 converts requested-based dm.

* The previous requested based dm conversion was broken in the way
  requests are sequenced.  This patch rips out special multi-target
  handling for flushes and handles flushes the same way other requests
  are handled.

This patchset is on top of "block, fs: replace HARDBARRIER with
FLUSH/FUA" patchset[2] and available in the following git tree

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua

and contains the following changes.

 block/blk-core.c                |    4 
 drivers/md/dm-crypt.c           |    2 
 drivers/md/dm-io.c              |   20 --
 drivers/md/dm-log.c             |    2 
 drivers/md/dm-raid1.c           |    8 
 drivers/md/dm-region-hash.c     |   16 -
 drivers/md/dm-snap-persistent.c |    2 
 drivers/md/dm-snap.c            |    6 
 drivers/md/dm-stripe.c          |    2 
 drivers/md/dm.c                 |  394 ++++++++--------------------------------
 include/linux/blk_types.h       |    1 
 include/linux/fs.h              |    3 
 12 files changed, 104 insertions(+), 356 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.raid/29344
[2] http://thread.gmane.org/gmane.linux.kernel/1022363

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2
@ 2010-08-30  9:58 ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid

Hello,

This is the second take of block-dm-finish-REQ_FLUSH-FUA-conversion.
I've put patches on top of the previous patches for easier review.
The series can be trivially reordered so that the order is more
logical.  Jens, please let me know if you want it to be reordered.

The dm conversion is _lightly_ tested.  Please proceed with caution.
Let's hold off merging these bits until dm people can verify the
conversion is correct.

  0001-block-make-__blk_rq_prep_clone-copy-most-command-fla.patch
  0002-dm-implement-REQ_FLUSH-FUA-support-for-bio-based-dm.patch
  0003-dm-relax-ordering-of-bio-based-flush-implementation.patch
  0004-dm-implement-REQ_FLUSH-FUA-support-for-request-based.patch
  0005-block-remove-the-WRITE_BARRIER-flag.patch

Differences from the previous attempt[1] are,

* bio-based dm and requested-based dm patches are split.  0002-0003
  convert bio-based dm and 0004 converts requested-based dm.

* The previous requested based dm conversion was broken in the way
  requests are sequenced.  This patch rips out special multi-target
  handling for flushes and handles flushes the same way other requests
  are handled.

This patchset is on top of "block, fs: replace HARDBARRIER with
FLUSH/FUA" patchset[2] and available in the following git tree

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua

and contains the following changes.

 block/blk-core.c                |    4 
 drivers/md/dm-crypt.c           |    2 
 drivers/md/dm-io.c              |   20 --
 drivers/md/dm-log.c             |    2 
 drivers/md/dm-raid1.c           |    8 
 drivers/md/dm-region-hash.c     |   16 -
 drivers/md/dm-snap-persistent.c |    2 
 drivers/md/dm-snap.c            |    6 
 drivers/md/dm-stripe.c          |    2 
 drivers/md/dm.c                 |  394 ++++++++--------------------------------
 include/linux/blk_types.h       |    1 
 include/linux/fs.h              |    3 
 12 files changed, 104 insertions(+), 356 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.raid/29344
[2] http://thread.gmane.org/gmane.linux.kernel/1022363

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags
  2010-08-30  9:58 ` Tejun Heo
  (?)
  (?)
@ 2010-08-30  9:58 ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid
  Cc: Tejun Heo

Currently __blk_rq_prep_clone() copies only REQ_WRITE and REQ_DISCARD.
There's no reason to omit other command flags and REQ_FUA needs to be
copied to implement FUA support in request-based dm.

REQ_COMMON_MASK which specifies flags to be copied from bio to request
already identifies all the command flags.  Define REQ_CLONE_MASK to be
the same as REQ_COMMON_MASK for clarity and make __blk_rq_prep_clone()
copy all flags in the mask.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c          |    4 +---
 include/linux/blk_types.h |    1 +
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 495bdc4..2a5b192 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2505,9 +2505,7 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 static void __blk_rq_prep_clone(struct request *dst, struct request *src)
 {
 	dst->cpu = src->cpu;
-	dst->cmd_flags = (rq_data_dir(src) | REQ_NOMERGE);
-	if (src->cmd_flags & REQ_DISCARD)
-		dst->cmd_flags |= REQ_DISCARD;
+	dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
 	dst->cmd_type = src->cmd_type;
 	dst->__sector = blk_rq_pos(src);
 	dst->__data_len = blk_rq_bytes(src);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1797994..36edadf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,6 +168,7 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
 	 REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
+#define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define REQ_UNPLUG		(1 << __REQ_UNPLUG)
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags
  2010-08-30  9:58 ` Tejun Heo
@ 2010-08-30  9:58   ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch
  Cc: Tejun Heo

Currently __blk_rq_prep_clone() copies only REQ_WRITE and REQ_DISCARD.
There's no reason to omit other command flags and REQ_FUA needs to be
copied to implement FUA support in request-based dm.

REQ_COMMON_MASK which specifies flags to be copied from bio to request
already identifies all the command flags.  Define REQ_CLONE_MASK to be
the same as REQ_COMMON_MASK for clarity and make __blk_rq_prep_clone()
copy all flags in the mask.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c          |    4 +---
 include/linux/blk_types.h |    1 +
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 495bdc4..2a5b192 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2505,9 +2505,7 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 static void __blk_rq_prep_clone(struct request *dst, struct request *src)
 {
 	dst->cpu = src->cpu;
-	dst->cmd_flags = (rq_data_dir(src) | REQ_NOMERGE);
-	if (src->cmd_flags & REQ_DISCARD)
-		dst->cmd_flags |= REQ_DISCARD;
+	dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
 	dst->cmd_type = src->cmd_type;
 	dst->__sector = blk_rq_pos(src);
 	dst->__data_len = blk_rq_bytes(src);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1797994..36edadf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,6 +168,7 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
 	 REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
+#define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define REQ_UNPLUG		(1 << __REQ_UNPLUG)
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags
@ 2010-08-30  9:58   ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid
  Cc: Tejun Heo

Currently __blk_rq_prep_clone() copies only REQ_WRITE and REQ_DISCARD.
There's no reason to omit other command flags and REQ_FUA needs to be
copied to implement FUA support in request-based dm.

REQ_COMMON_MASK which specifies flags to be copied from bio to request
already identifies all the command flags.  Define REQ_CLONE_MASK to be
the same as REQ_COMMON_MASK for clarity and make __blk_rq_prep_clone()
copy all flags in the mask.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c          |    4 +---
 include/linux/blk_types.h |    1 +
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 495bdc4..2a5b192 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2505,9 +2505,7 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 static void __blk_rq_prep_clone(struct request *dst, struct request *src)
 {
 	dst->cpu = src->cpu;
-	dst->cmd_flags = (rq_data_dir(src) | REQ_NOMERGE);
-	if (src->cmd_flags & REQ_DISCARD)
-		dst->cmd_flags |= REQ_DISCARD;
+	dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
 	dst->cmd_type = src->cmd_type;
 	dst->__sector = blk_rq_pos(src);
 	dst->__data_len = blk_rq_bytes(src);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1797994..36edadf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,6 +168,7 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
 	 REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
+#define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define REQ_UNPLUG		(1 << __REQ_UNPLUG)
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags
  2010-08-30  9:58 ` Tejun Heo
  (?)
@ 2010-08-30  9:58 ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel, linux-fsdevel
  Cc: Tejun Heo

Currently __blk_rq_prep_clone() copies only REQ_WRITE and REQ_DISCARD.
There's no reason to omit other command flags and REQ_FUA needs to be
copied to implement FUA support in request-based dm.

REQ_COMMON_MASK which specifies flags to be copied from bio to request
already identifies all the command flags.  Define REQ_CLONE_MASK to be
the same as REQ_COMMON_MASK for clarity and make __blk_rq_prep_clone()
copy all flags in the mask.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c          |    4 +---
 include/linux/blk_types.h |    1 +
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 495bdc4..2a5b192 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2505,9 +2505,7 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 static void __blk_rq_prep_clone(struct request *dst, struct request *src)
 {
 	dst->cpu = src->cpu;
-	dst->cmd_flags = (rq_data_dir(src) | REQ_NOMERGE);
-	if (src->cmd_flags & REQ_DISCARD)
-		dst->cmd_flags |= REQ_DISCARD;
+	dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
 	dst->cmd_type = src->cmd_type;
 	dst->__sector = blk_rq_pos(src);
 	dst->__data_len = blk_rq_bytes(src);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1797994..36edadf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,6 +168,7 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
 	 REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
+#define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define REQ_UNPLUG		(1 << __REQ_UNPLUG)
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-08-30  9:58 ` Tejun Heo
                   ` (3 preceding siblings ...)
  (?)
@ 2010-08-30  9:58 ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid
  Cc: Tejun Heo, dm-devel

This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
now deprecated REQ_HARDBARRIER.

* -EOPNOTSUPP handling logic dropped.

* Preflush is handled as before but postflush is dropped and replaced
  with passing down REQ_FUA to member request_queues.  This replaces
  one array wide cache flush w/ member specific FUA writes.

* __split_and_process_bio() now calls __clone_and_map_flush() directly
  for flushes and guarantees all FLUSH bio's going to targets are zero
`  length.

* It's now guaranteed that all FLUSH bio's which are passed onto dm
  targets are zero length.  bio_empty_barrier() tests are replaced
  with REQ_FLUSH tests.

* Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.

* Dropped unlikely() around REQ_FLUSH tests.  Flushes are not unlikely
  enough to be marked with unlikely().

* Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
  doesn't support cache flushing.  Advertise REQ_FLUSH | REQ_FUA
  capability.

* Request based dm isn't converted yet.  dm_init_request_based_queue()
  resets flush support to 0 for now.  To avoid disturbing request
  based dm code, dm->flush_error is added for bio based dm while
  requested based dm continues to use dm->barrier_error.

Lightly tested linear, stripe, raid1, snap and crypt targets.  Please
proceed with caution as I'm not familiar with the code base.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: dm-devel@redhat.com
Cc: Christoph Hellwig <hch@lst.de>
---
 drivers/md/dm-crypt.c           |    2 +-
 drivers/md/dm-io.c              |   20 +-----
 drivers/md/dm-log.c             |    2 +-
 drivers/md/dm-raid1.c           |    8 +-
 drivers/md/dm-region-hash.c     |   16 +++---
 drivers/md/dm-snap-persistent.c |    2 +-
 drivers/md/dm-snap.c            |    6 +-
 drivers/md/dm-stripe.c          |    2 +-
 drivers/md/dm.c                 |  119 +++++++++++++++++++--------------------
 9 files changed, 80 insertions(+), 97 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 368e8e9..d5b0e4c 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1278,7 +1278,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio,
 	struct dm_crypt_io *io;
 	struct crypt_config *cc;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		cc = ti->private;
 		bio->bi_bdev = cc->dev->bdev;
 		return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 0590c75..136d4f7 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -31,7 +31,6 @@ struct dm_io_client {
  */
 struct io {
 	unsigned long error_bits;
-	unsigned long eopnotsupp_bits;
 	atomic_t count;
 	struct task_struct *sleeper;
 	struct dm_io_client *client;
@@ -130,11 +129,8 @@ static void retrieve_io_and_region_from_bio(struct bio *bio, struct io **io,
  *---------------------------------------------------------------*/
 static void dec_count(struct io *io, unsigned int region, int error)
 {
-	if (error) {
+	if (error)
 		set_bit(region, &io->error_bits);
-		if (error == -EOPNOTSUPP)
-			set_bit(region, &io->eopnotsupp_bits);
-	}
 
 	if (atomic_dec_and_test(&io->count)) {
 		if (io->sleeper)
@@ -310,8 +306,8 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
 	sector_t remaining = where->count;
 
 	/*
-	 * where->count may be zero if rw holds a write barrier and we
-	 * need to send a zero-sized barrier.
+	 * where->count may be zero if rw holds a flush and we need to
+	 * send a zero-sized flush.
 	 */
 	do {
 		/*
@@ -364,7 +360,7 @@ static void dispatch_io(int rw, unsigned int num_regions,
 	 */
 	for (i = 0; i < num_regions; i++) {
 		*dp = old_pages;
-		if (where[i].count || (rw & REQ_HARDBARRIER))
+		if (where[i].count || (rw & REQ_FLUSH))
 			do_region(rw, i, where + i, dp, io);
 	}
 
@@ -393,9 +389,7 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions,
 		return -EIO;
 	}
 
-retry:
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = current;
 	io->client = client;
@@ -412,11 +406,6 @@ retry:
 	}
 	set_current_state(TASK_RUNNING);
 
-	if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
-		rw &= ~REQ_HARDBARRIER;
-		goto retry;
-	}
-
 	if (error_bits)
 		*error_bits = io->error_bits;
 
@@ -437,7 +426,6 @@ static int async_io(struct dm_io_client *client, unsigned int num_regions,
 
 	io = mempool_alloc(client->pool, GFP_NOIO);
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = NULL;
 	io->client = client;
diff --git a/drivers/md/dm-log.c b/drivers/md/dm-log.c
index 5a08be0..33420e6 100644
--- a/drivers/md/dm-log.c
+++ b/drivers/md/dm-log.c
@@ -300,7 +300,7 @@ static int flush_header(struct log_c *lc)
 		.count = 0,
 	};
 
-	lc->io_req.bi_rw = WRITE_BARRIER;
+	lc->io_req.bi_rw = WRITE_FLUSH;
 
 	return dm_io(&lc->io_req, 1, &null_location, NULL);
 }
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index 7c081bc..19a59b0 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -259,7 +259,7 @@ static int mirror_flush(struct dm_target *ti)
 	struct dm_io_region io[ms->nr_mirrors];
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE_BARRIER,
+		.bi_rw = WRITE_FLUSH,
 		.mem.type = DM_IO_KMEM,
 		.mem.ptr.bvec = NULL,
 		.client = ms->io_client,
@@ -629,7 +629,7 @@ static void do_write(struct mirror_set *ms, struct bio *bio)
 	struct dm_io_region io[ms->nr_mirrors], *dest = io;
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+		.bi_rw = WRITE | (bio->bi_rw & WRITE_FLUSH_FUA),
 		.mem.type = DM_IO_BVEC,
 		.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
 		.notify.fn = write_callback,
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes)
 	bio_list_init(&requeue);
 
 	while ((bio = bio_list_pop(writes))) {
-		if (unlikely(bio_empty_barrier(bio))) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			bio_list_add(&sync, bio);
 			continue;
 		}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
 	 * We need to dec pending if this was a write.
 	 */
 	if (rw == WRITE) {
-		if (likely(!bio_empty_barrier(bio)))
+		if (!(bio->bi_rw & REQ_FLUSH))
 			dm_rh_dec(ms->rh, map_context->ll);
 		return error;
 	}
diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
index bd5c58b..dad011a 100644
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -81,9 +81,9 @@ struct dm_region_hash {
 	struct list_head failed_recovered_regions;
 
 	/*
-	 * If there was a barrier failure no regions can be marked clean.
+	 * If there was a flush failure no regions can be marked clean.
 	 */
-	int barrier_failure;
+	int flush_failure;
 
 	void *context;
 	sector_t target_begin;
@@ -217,7 +217,7 @@ struct dm_region_hash *dm_region_hash_create(
 	INIT_LIST_HEAD(&rh->quiesced_regions);
 	INIT_LIST_HEAD(&rh->recovered_regions);
 	INIT_LIST_HEAD(&rh->failed_recovered_regions);
-	rh->barrier_failure = 0;
+	rh->flush_failure = 0;
 
 	rh->region_pool = mempool_create_kmalloc_pool(MIN_REGIONS,
 						      sizeof(struct dm_region));
@@ -399,8 +399,8 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio)
 	region_t region = dm_rh_bio_to_region(rh, bio);
 	int recovering = 0;
 
-	if (bio_empty_barrier(bio)) {
-		rh->barrier_failure = 1;
+	if (bio->bi_rw & REQ_FLUSH) {
+		rh->flush_failure = 1;
 		return;
 	}
 
@@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios)
 	struct bio *bio;
 
 	for (bio = bios->head; bio; bio = bio->bi_next) {
-		if (bio_empty_barrier(bio))
+		if (bio->bi_rw & REQ_FLUSH)
 			continue;
 		rh_inc(rh, dm_rh_bio_to_region(rh, bio));
 	}
@@ -555,9 +555,9 @@ void dm_rh_dec(struct dm_region_hash *rh, region_t region)
 		 */
 
 		/* do nothing for DM_RH_NOSYNC */
-		if (unlikely(rh->barrier_failure)) {
+		if (unlikely(rh->flush_failure)) {
 			/*
-			 * If a write barrier failed some time ago, we
+			 * If a write flush failed some time ago, we
 			 * don't know whether or not this write made it
 			 * to the disk, so we must resync the device.
 			 */
diff --git a/drivers/md/dm-snap-persistent.c b/drivers/md/dm-snap-persistent.c
index cc2bdb8..0b61792 100644
--- a/drivers/md/dm-snap-persistent.c
+++ b/drivers/md/dm-snap-persistent.c
@@ -687,7 +687,7 @@ static void persistent_commit_exception(struct dm_exception_store *store,
 	/*
 	 * Commit exceptions to disk.
 	 */
-	if (ps->valid && area_io(ps, WRITE_BARRIER))
+	if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
 		ps->valid = 0;
 
 	/*
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 5974d30..eed2101 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1587,7 +1587,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio,
 	chunk_t chunk;
 	struct dm_snap_pending_exception *pe = NULL;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		bio->bi_bdev = s->cow->bdev;
 		return DM_MAPIO_REMAPPED;
 	}
@@ -1691,7 +1691,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio,
 	int r = DM_MAPIO_REMAPPED;
 	chunk_t chunk;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		if (!map_context->target_request_nr)
 			bio->bi_bdev = s->origin->bdev;
 		else
@@ -2135,7 +2135,7 @@ static int origin_map(struct dm_target *ti, struct bio *bio,
 	struct dm_dev *dev = ti->private;
 	bio->bi_bdev = dev->bdev;
 
-	if (unlikely(bio_empty_barrier(bio)))
+	if (bio->bi_rw & REQ_FLUSH)
 		return DM_MAPIO_REMAPPED;
 
 	/* Only tell snapshots if this is a write */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index c297f6d..f0371b4 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -271,7 +271,7 @@ static int stripe_map(struct dm_target *ti, struct bio *bio,
 	uint32_t stripe;
 	unsigned target_request_nr;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		target_request_nr = map_context->target_request_nr;
 		BUG_ON(target_request_nr >= sc->stripes);
 		bio->bi_bdev = sc->stripe[target_request_nr].dev->bdev;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b1d92be..32e6622 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -144,15 +144,16 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * An error from the barrier request currently being processed.
+	 * An error from the flush request currently being processed.
 	 */
-	int barrier_error;
+	int flush_error;
 
 	/*
 	 * Protect barrier_error from concurrent endio processing
 	 * in request-based dm.
 	 */
 	spinlock_t barrier_error_lock;
+	int barrier_error;
 
 	/*
 	 * Processing queue (flush/barriers)
@@ -200,8 +201,8 @@ struct mapped_device {
 	/* sysfs handle */
 	struct kobject kobj;
 
-	/* zero-length barrier that will be cloned and submitted to targets */
-	struct bio barrier_bio;
+	/* zero-length flush that will be cloned and submitted to targets */
+	struct bio flush_bio;
 };
 
 /*
@@ -512,7 +513,7 @@ static void end_io_acct(struct dm_io *io)
 
 	/*
 	 * After this is decremented the bio must not be touched if it is
-	 * a barrier.
+	 * a flush.
 	 */
 	dm_disk(md)->part0.in_flight[rw] = pending =
 		atomic_dec_return(&md->pending[rw]);
@@ -626,7 +627,7 @@ static void dec_pending(struct dm_io *io, int error)
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
 			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+				if (!(io->bio->bi_rw & REQ_FLUSH))
 					bio_list_add_head(&md->deferred,
 							  io->bio);
 			} else
@@ -638,20 +639,14 @@ static void dec_pending(struct dm_io *io, int error)
 		io_error = io->error;
 		bio = io->bio;
 
-		if (bio->bi_rw & REQ_HARDBARRIER) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			/*
-			 * There can be just one barrier request so we use
+			 * There can be just one flush request so we use
 			 * a per-device variable for error reporting.
 			 * Note that you can't touch the bio after end_io_acct
-			 *
-			 * We ignore -EOPNOTSUPP for empty flush reported by
-			 * underlying devices. We assume that if the device
-			 * doesn't support empty barriers, it doesn't need
-			 * cache flushing commands.
 			 */
-			if (!md->barrier_error &&
-			    !(bio_empty_barrier(bio) && io_error == -EOPNOTSUPP))
-				md->barrier_error = io_error;
+			if (!md->flush_error)
+				md->flush_error = io_error;
 			end_io_acct(io);
 			free_io(md, io);
 		} else {
@@ -1119,7 +1114,7 @@ static void dm_bio_destructor(struct bio *bio)
 }
 
 /*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that just does part of a bvec.
  */
 static struct bio *split_bvec(struct bio *bio, sector_t sector,
 			      unsigned short idx, unsigned int offset,
@@ -1134,7 +1129,7 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,
 
 	clone->bi_sector = sector;
 	clone->bi_bdev = bio->bi_bdev;
-	clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+	clone->bi_rw = bio->bi_rw;
 	clone->bi_vcnt = 1;
 	clone->bi_size = to_bytes(len);
 	clone->bi_io_vec->bv_offset = offset;
@@ -1161,7 +1156,6 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,
 
 	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
 	__bio_clone(clone, bio);
-	clone->bi_rw &= ~REQ_HARDBARRIER;
 	clone->bi_destructor = dm_bio_destructor;
 	clone->bi_sector = sector;
 	clone->bi_idx = idx;
@@ -1225,7 +1219,7 @@ static void __issue_target_requests(struct clone_info *ci, struct dm_target *ti,
 		__issue_target_request(ci, ti, request_nr, len);
 }
 
-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
 {
 	unsigned target_nr = 0;
 	struct dm_target *ti;
@@ -1289,9 +1283,6 @@ static int __clone_and_map(struct clone_info *ci)
 	sector_t len = 0, max;
 	struct dm_target_io *tio;
 
-	if (unlikely(bio_empty_barrier(bio)))
-		return __clone_and_map_empty_barrier(ci);
-
 	if (unlikely(bio->bi_rw & REQ_DISCARD))
 		return __clone_and_map_discard(ci);
 
@@ -1383,11 +1374,11 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_HARDBARRIER))
+		if (!(bio->bi_rw & REQ_FLUSH))
 			bio_io_error(bio);
 		else
-			if (!md->barrier_error)
-				md->barrier_error = -EIO;
+			if (!md->flush_error)
+				md->flush_error = -EIO;
 		return;
 	}
 
@@ -1400,14 +1391,22 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 	ci.io->md = md;
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
-	ci.sector_count = bio_sectors(bio);
-	if (unlikely(bio_empty_barrier(bio)))
+	if (!(bio->bi_rw & REQ_FLUSH))
+		ci.sector_count = bio_sectors(bio);
+	else {
+		/* all FLUSH bio's reaching here should be empty */
+		WARN_ON_ONCE(bio_has_data(bio));
 		ci.sector_count = 1;
+	}
 	ci.idx = bio->bi_idx;
 
 	start_io_acct(ci.io);
-	while (ci.sector_count && !error)
-		error = __clone_and_map(&ci);
+	while (ci.sector_count && !error) {
+		if (!(bio->bi_rw & REQ_FLUSH))
+			error = __clone_and_map(&ci);
+		else
+			error = __clone_and_map_flush(&ci);
+	}
 
 	/* drop the extra reference count */
 	dec_pending(ci.io, error);
@@ -1492,11 +1491,11 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
 	part_stat_unlock();
 
 	/*
-	 * If we're suspended or the thread is processing barriers
+	 * If we're suspended or the thread is processing flushes
 	 * we have to queue this io for later.
 	 */
 	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+	    (bio->bi_rw & REQ_FLUSH)) {
 		up_read(&md->io_lock);
 
 		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1940,6 +1939,7 @@ static void dm_init_md_queue(struct mapped_device *md)
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
+	blk_queue_flush(md->queue, REQ_FLUSH | REQ_FUA);
 }
 
 /*
@@ -2245,7 +2245,8 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	blk_queue_flush(md->queue, REQ_FLUSH);
+	/* no flush support for request based dm yet */
+	blk_queue_flush(md->queue, 0);
 
 	elv_register_queue(md->queue);
 
@@ -2406,41 +2407,35 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
 	return r;
 }
 
-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
 {
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	bio_init(&md->barrier_bio);
-	md->barrier_bio.bi_bdev = md->bdev;
-	md->barrier_bio.bi_rw = WRITE_BARRIER;
-	__split_and_process_bio(md, &md->barrier_bio);
+	md->flush_error = 0;
 
+	/* handle REQ_FLUSH */
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
 
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
-	md->barrier_error = 0;
+	bio_init(&md->flush_bio);
+	md->flush_bio.bi_bdev = md->bdev;
+	md->flush_bio.bi_rw = WRITE_FLUSH;
+	__split_and_process_bio(md, &md->flush_bio);
 
-	dm_flush(md);
+	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
 
-	if (!bio_empty_barrier(bio)) {
-		__split_and_process_bio(md, bio);
-		/*
-		 * If the request isn't supported, don't waste time with
-		 * the second flush.
-		 */
-		if (md->barrier_error != -EOPNOTSUPP)
-			dm_flush(md);
+	/* if it's an empty flush or the preflush failed, we're done */
+	if (!bio_has_data(bio) || md->flush_error) {
+		if (md->flush_error != DM_ENDIO_REQUEUE)
+			bio_endio(bio, md->flush_error);
+		else {
+			spin_lock_irq(&md->deferred_lock);
+			bio_list_add_head(&md->deferred, bio);
+			spin_unlock_irq(&md->deferred_lock);
+		}
+		return;
 	}
 
-	if (md->barrier_error != DM_ENDIO_REQUEUE)
-		bio_endio(bio, md->barrier_error);
-	else {
-		spin_lock_irq(&md->deferred_lock);
-		bio_list_add_head(&md->deferred, bio);
-		spin_unlock_irq(&md->deferred_lock);
-	}
+	/* issue data + REQ_FUA */
+	bio->bi_rw &= ~REQ_FLUSH;
+	__split_and_process_bio(md, bio);
 }
 
 /*
@@ -2469,8 +2464,8 @@ static void dm_wq_work(struct work_struct *work)
 		if (dm_request_based(md))
 			generic_make_request(c);
 		else {
-			if (c->bi_rw & REQ_HARDBARRIER)
-				process_barrier(md, c);
+			if (c->bi_rw & REQ_FLUSH)
+				process_flush(md, c);
 			else
 				__split_and_process_bio(md, c);
 		}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-08-30  9:58 ` Tejun Heo
@ 2010-08-30  9:58   ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch
  Cc: Tejun Heo, dm-devel

This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
now deprecated REQ_HARDBARRIER.

* -EOPNOTSUPP handling logic dropped.

* Preflush is handled as before but postflush is dropped and replaced
  with passing down REQ_FUA to member request_queues.  This replaces
  one array wide cache flush w/ member specific FUA writes.

* __split_and_process_bio() now calls __clone_and_map_flush() directly
  for flushes and guarantees all FLUSH bio's going to targets are zero
`  length.

* It's now guaranteed that all FLUSH bio's which are passed onto dm
  targets are zero length.  bio_empty_barrier() tests are replaced
  with REQ_FLUSH tests.

* Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.

* Dropped unlikely() around REQ_FLUSH tests.  Flushes are not unlikely
  enough to be marked with unlikely().

* Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
  doesn't support cache flushing.  Advertise REQ_FLUSH | REQ_FUA
  capability.

* Request based dm isn't converted yet.  dm_init_request_based_queue()
  resets flush support to 0 for now.  To avoid disturbing request
  based dm code, dm->flush_error is added for bio based dm while
  requested based dm continues to use dm->barrier_error.

Lightly tested linear, stripe, raid1, snap and crypt targets.  Please
proceed with caution as I'm not familiar with the code base.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: dm-devel@redhat.com
Cc: Christoph Hellwig <hch@lst.de>
---
 drivers/md/dm-crypt.c           |    2 +-
 drivers/md/dm-io.c              |   20 +-----
 drivers/md/dm-log.c             |    2 +-
 drivers/md/dm-raid1.c           |    8 +-
 drivers/md/dm-region-hash.c     |   16 +++---
 drivers/md/dm-snap-persistent.c |    2 +-
 drivers/md/dm-snap.c            |    6 +-
 drivers/md/dm-stripe.c          |    2 +-
 drivers/md/dm.c                 |  119 +++++++++++++++++++--------------------
 9 files changed, 80 insertions(+), 97 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 368e8e9..d5b0e4c 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1278,7 +1278,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio,
 	struct dm_crypt_io *io;
 	struct crypt_config *cc;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		cc = ti->private;
 		bio->bi_bdev = cc->dev->bdev;
 		return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 0590c75..136d4f7 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -31,7 +31,6 @@ struct dm_io_client {
  */
 struct io {
 	unsigned long error_bits;
-	unsigned long eopnotsupp_bits;
 	atomic_t count;
 	struct task_struct *sleeper;
 	struct dm_io_client *client;
@@ -130,11 +129,8 @@ static void retrieve_io_and_region_from_bio(struct bio *bio, struct io **io,
  *---------------------------------------------------------------*/
 static void dec_count(struct io *io, unsigned int region, int error)
 {
-	if (error) {
+	if (error)
 		set_bit(region, &io->error_bits);
-		if (error == -EOPNOTSUPP)
-			set_bit(region, &io->eopnotsupp_bits);
-	}
 
 	if (atomic_dec_and_test(&io->count)) {
 		if (io->sleeper)
@@ -310,8 +306,8 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
 	sector_t remaining = where->count;
 
 	/*
-	 * where->count may be zero if rw holds a write barrier and we
-	 * need to send a zero-sized barrier.
+	 * where->count may be zero if rw holds a flush and we need to
+	 * send a zero-sized flush.
 	 */
 	do {
 		/*
@@ -364,7 +360,7 @@ static void dispatch_io(int rw, unsigned int num_regions,
 	 */
 	for (i = 0; i < num_regions; i++) {
 		*dp = old_pages;
-		if (where[i].count || (rw & REQ_HARDBARRIER))
+		if (where[i].count || (rw & REQ_FLUSH))
 			do_region(rw, i, where + i, dp, io);
 	}
 
@@ -393,9 +389,7 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions,
 		return -EIO;
 	}
 
-retry:
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = current;
 	io->client = client;
@@ -412,11 +406,6 @@ retry:
 	}
 	set_current_state(TASK_RUNNING);
 
-	if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
-		rw &= ~REQ_HARDBARRIER;
-		goto retry;
-	}
-
 	if (error_bits)
 		*error_bits = io->error_bits;
 
@@ -437,7 +426,6 @@ static int async_io(struct dm_io_client *client, unsigned int num_regions,
 
 	io = mempool_alloc(client->pool, GFP_NOIO);
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = NULL;
 	io->client = client;
diff --git a/drivers/md/dm-log.c b/drivers/md/dm-log.c
index 5a08be0..33420e6 100644
--- a/drivers/md/dm-log.c
+++ b/drivers/md/dm-log.c
@@ -300,7 +300,7 @@ static int flush_header(struct log_c *lc)
 		.count = 0,
 	};
 
-	lc->io_req.bi_rw = WRITE_BARRIER;
+	lc->io_req.bi_rw = WRITE_FLUSH;
 
 	return dm_io(&lc->io_req, 1, &null_location, NULL);
 }
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index 7c081bc..19a59b0 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -259,7 +259,7 @@ static int mirror_flush(struct dm_target *ti)
 	struct dm_io_region io[ms->nr_mirrors];
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE_BARRIER,
+		.bi_rw = WRITE_FLUSH,
 		.mem.type = DM_IO_KMEM,
 		.mem.ptr.bvec = NULL,
 		.client = ms->io_client,
@@ -629,7 +629,7 @@ static void do_write(struct mirror_set *ms, struct bio *bio)
 	struct dm_io_region io[ms->nr_mirrors], *dest = io;
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+		.bi_rw = WRITE | (bio->bi_rw & WRITE_FLUSH_FUA),
 		.mem.type = DM_IO_BVEC,
 		.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
 		.notify.fn = write_callback,
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes)
 	bio_list_init(&requeue);
 
 	while ((bio = bio_list_pop(writes))) {
-		if (unlikely(bio_empty_barrier(bio))) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			bio_list_add(&sync, bio);
 			continue;
 		}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
 	 * We need to dec pending if this was a write.
 	 */
 	if (rw == WRITE) {
-		if (likely(!bio_empty_barrier(bio)))
+		if (!(bio->bi_rw & REQ_FLUSH))
 			dm_rh_dec(ms->rh, map_context->ll);
 		return error;
 	}
diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
index bd5c58b..dad011a 100644
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -81,9 +81,9 @@ struct dm_region_hash {
 	struct list_head failed_recovered_regions;
 
 	/*
-	 * If there was a barrier failure no regions can be marked clean.
+	 * If there was a flush failure no regions can be marked clean.
 	 */
-	int barrier_failure;
+	int flush_failure;
 
 	void *context;
 	sector_t target_begin;
@@ -217,7 +217,7 @@ struct dm_region_hash *dm_region_hash_create(
 	INIT_LIST_HEAD(&rh->quiesced_regions);
 	INIT_LIST_HEAD(&rh->recovered_regions);
 	INIT_LIST_HEAD(&rh->failed_recovered_regions);
-	rh->barrier_failure = 0;
+	rh->flush_failure = 0;
 
 	rh->region_pool = mempool_create_kmalloc_pool(MIN_REGIONS,
 						      sizeof(struct dm_region));
@@ -399,8 +399,8 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio)
 	region_t region = dm_rh_bio_to_region(rh, bio);
 	int recovering = 0;
 
-	if (bio_empty_barrier(bio)) {
-		rh->barrier_failure = 1;
+	if (bio->bi_rw & REQ_FLUSH) {
+		rh->flush_failure = 1;
 		return;
 	}
 
@@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios)
 	struct bio *bio;
 
 	for (bio = bios->head; bio; bio = bio->bi_next) {
-		if (bio_empty_barrier(bio))
+		if (bio->bi_rw & REQ_FLUSH)
 			continue;
 		rh_inc(rh, dm_rh_bio_to_region(rh, bio));
 	}
@@ -555,9 +555,9 @@ void dm_rh_dec(struct dm_region_hash *rh, region_t region)
 		 */
 
 		/* do nothing for DM_RH_NOSYNC */
-		if (unlikely(rh->barrier_failure)) {
+		if (unlikely(rh->flush_failure)) {
 			/*
-			 * If a write barrier failed some time ago, we
+			 * If a write flush failed some time ago, we
 			 * don't know whether or not this write made it
 			 * to the disk, so we must resync the device.
 			 */
diff --git a/drivers/md/dm-snap-persistent.c b/drivers/md/dm-snap-persistent.c
index cc2bdb8..0b61792 100644
--- a/drivers/md/dm-snap-persistent.c
+++ b/drivers/md/dm-snap-persistent.c
@@ -687,7 +687,7 @@ static void persistent_commit_exception(struct dm_exception_store *store,
 	/*
 	 * Commit exceptions to disk.
 	 */
-	if (ps->valid && area_io(ps, WRITE_BARRIER))
+	if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
 		ps->valid = 0;
 
 	/*
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 5974d30..eed2101 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1587,7 +1587,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio,
 	chunk_t chunk;
 	struct dm_snap_pending_exception *pe = NULL;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		bio->bi_bdev = s->cow->bdev;
 		return DM_MAPIO_REMAPPED;
 	}
@@ -1691,7 +1691,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio,
 	int r = DM_MAPIO_REMAPPED;
 	chunk_t chunk;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		if (!map_context->target_request_nr)
 			bio->bi_bdev = s->origin->bdev;
 		else
@@ -2135,7 +2135,7 @@ static int origin_map(struct dm_target *ti, struct bio *bio,
 	struct dm_dev *dev = ti->private;
 	bio->bi_bdev = dev->bdev;
 
-	if (unlikely(bio_empty_barrier(bio)))
+	if (bio->bi_rw & REQ_FLUSH)
 		return DM_MAPIO_REMAPPED;
 
 	/* Only tell snapshots if this is a write */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index c297f6d..f0371b4 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -271,7 +271,7 @@ static int stripe_map(struct dm_target *ti, struct bio *bio,
 	uint32_t stripe;
 	unsigned target_request_nr;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		target_request_nr = map_context->target_request_nr;
 		BUG_ON(target_request_nr >= sc->stripes);
 		bio->bi_bdev = sc->stripe[target_request_nr].dev->bdev;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b1d92be..32e6622 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -144,15 +144,16 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * An error from the barrier request currently being processed.
+	 * An error from the flush request currently being processed.
 	 */
-	int barrier_error;
+	int flush_error;
 
 	/*
 	 * Protect barrier_error from concurrent endio processing
 	 * in request-based dm.
 	 */
 	spinlock_t barrier_error_lock;
+	int barrier_error;
 
 	/*
 	 * Processing queue (flush/barriers)
@@ -200,8 +201,8 @@ struct mapped_device {
 	/* sysfs handle */
 	struct kobject kobj;
 
-	/* zero-length barrier that will be cloned and submitted to targets */
-	struct bio barrier_bio;
+	/* zero-length flush that will be cloned and submitted to targets */
+	struct bio flush_bio;
 };
 
 /*
@@ -512,7 +513,7 @@ static void end_io_acct(struct dm_io *io)
 
 	/*
 	 * After this is decremented the bio must not be touched if it is
-	 * a barrier.
+	 * a flush.
 	 */
 	dm_disk(md)->part0.in_flight[rw] = pending =
 		atomic_dec_return(&md->pending[rw]);
@@ -626,7 +627,7 @@ static void dec_pending(struct dm_io *io, int error)
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
 			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+				if (!(io->bio->bi_rw & REQ_FLUSH))
 					bio_list_add_head(&md->deferred,
 							  io->bio);
 			} else
@@ -638,20 +639,14 @@ static void dec_pending(struct dm_io *io, int error)
 		io_error = io->error;
 		bio = io->bio;
 
-		if (bio->bi_rw & REQ_HARDBARRIER) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			/*
-			 * There can be just one barrier request so we use
+			 * There can be just one flush request so we use
 			 * a per-device variable for error reporting.
 			 * Note that you can't touch the bio after end_io_acct
-			 *
-			 * We ignore -EOPNOTSUPP for empty flush reported by
-			 * underlying devices. We assume that if the device
-			 * doesn't support empty barriers, it doesn't need
-			 * cache flushing commands.
 			 */
-			if (!md->barrier_error &&
-			    !(bio_empty_barrier(bio) && io_error == -EOPNOTSUPP))
-				md->barrier_error = io_error;
+			if (!md->flush_error)
+				md->flush_error = io_error;
 			end_io_acct(io);
 			free_io(md, io);
 		} else {
@@ -1119,7 +1114,7 @@ static void dm_bio_destructor(struct bio *bio)
 }
 
 /*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that just does part of a bvec.
  */
 static struct bio *split_bvec(struct bio *bio, sector_t sector,
 			      unsigned short idx, unsigned int offset,
@@ -1134,7 +1129,7 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,
 
 	clone->bi_sector = sector;
 	clone->bi_bdev = bio->bi_bdev;
-	clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+	clone->bi_rw = bio->bi_rw;
 	clone->bi_vcnt = 1;
 	clone->bi_size = to_bytes(len);
 	clone->bi_io_vec->bv_offset = offset;
@@ -1161,7 +1156,6 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,
 
 	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
 	__bio_clone(clone, bio);
-	clone->bi_rw &= ~REQ_HARDBARRIER;
 	clone->bi_destructor = dm_bio_destructor;
 	clone->bi_sector = sector;
 	clone->bi_idx = idx;
@@ -1225,7 +1219,7 @@ static void __issue_target_requests(struct clone_info *ci, struct dm_target *ti,
 		__issue_target_request(ci, ti, request_nr, len);
 }
 
-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
 {
 	unsigned target_nr = 0;
 	struct dm_target *ti;
@@ -1289,9 +1283,6 @@ static int __clone_and_map(struct clone_info *ci)
 	sector_t len = 0, max;
 	struct dm_target_io *tio;
 
-	if (unlikely(bio_empty_barrier(bio)))
-		return __clone_and_map_empty_barrier(ci);
-
 	if (unlikely(bio->bi_rw & REQ_DISCARD))
 		return __clone_and_map_discard(ci);
 
@@ -1383,11 +1374,11 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_HARDBARRIER))
+		if (!(bio->bi_rw & REQ_FLUSH))
 			bio_io_error(bio);
 		else
-			if (!md->barrier_error)
-				md->barrier_error = -EIO;
+			if (!md->flush_error)
+				md->flush_error = -EIO;
 		return;
 	}
 
@@ -1400,14 +1391,22 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 	ci.io->md = md;
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
-	ci.sector_count = bio_sectors(bio);
-	if (unlikely(bio_empty_barrier(bio)))
+	if (!(bio->bi_rw & REQ_FLUSH))
+		ci.sector_count = bio_sectors(bio);
+	else {
+		/* all FLUSH bio's reaching here should be empty */
+		WARN_ON_ONCE(bio_has_data(bio));
 		ci.sector_count = 1;
+	}
 	ci.idx = bio->bi_idx;
 
 	start_io_acct(ci.io);
-	while (ci.sector_count && !error)
-		error = __clone_and_map(&ci);
+	while (ci.sector_count && !error) {
+		if (!(bio->bi_rw & REQ_FLUSH))
+			error = __clone_and_map(&ci);
+		else
+			error = __clone_and_map_flush(&ci);
+	}
 
 	/* drop the extra reference count */
 	dec_pending(ci.io, error);
@@ -1492,11 +1491,11 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
 	part_stat_unlock();
 
 	/*
-	 * If we're suspended or the thread is processing barriers
+	 * If we're suspended or the thread is processing flushes
 	 * we have to queue this io for later.
 	 */
 	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+	    (bio->bi_rw & REQ_FLUSH)) {
 		up_read(&md->io_lock);
 
 		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1940,6 +1939,7 @@ static void dm_init_md_queue(struct mapped_device *md)
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
+	blk_queue_flush(md->queue, REQ_FLUSH | REQ_FUA);
 }
 
 /*
@@ -2245,7 +2245,8 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	blk_queue_flush(md->queue, REQ_FLUSH);
+	/* no flush support for request based dm yet */
+	blk_queue_flush(md->queue, 0);
 
 	elv_register_queue(md->queue);
 
@@ -2406,41 +2407,35 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
 	return r;
 }
 
-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
 {
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	bio_init(&md->barrier_bio);
-	md->barrier_bio.bi_bdev = md->bdev;
-	md->barrier_bio.bi_rw = WRITE_BARRIER;
-	__split_and_process_bio(md, &md->barrier_bio);
+	md->flush_error = 0;
 
+	/* handle REQ_FLUSH */
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
 
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
-	md->barrier_error = 0;
+	bio_init(&md->flush_bio);
+	md->flush_bio.bi_bdev = md->bdev;
+	md->flush_bio.bi_rw = WRITE_FLUSH;
+	__split_and_process_bio(md, &md->flush_bio);
 
-	dm_flush(md);
+	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
 
-	if (!bio_empty_barrier(bio)) {
-		__split_and_process_bio(md, bio);
-		/*
-		 * If the request isn't supported, don't waste time with
-		 * the second flush.
-		 */
-		if (md->barrier_error != -EOPNOTSUPP)
-			dm_flush(md);
+	/* if it's an empty flush or the preflush failed, we're done */
+	if (!bio_has_data(bio) || md->flush_error) {
+		if (md->flush_error != DM_ENDIO_REQUEUE)
+			bio_endio(bio, md->flush_error);
+		else {
+			spin_lock_irq(&md->deferred_lock);
+			bio_list_add_head(&md->deferred, bio);
+			spin_unlock_irq(&md->deferred_lock);
+		}
+		return;
 	}
 
-	if (md->barrier_error != DM_ENDIO_REQUEUE)
-		bio_endio(bio, md->barrier_error);
-	else {
-		spin_lock_irq(&md->deferred_lock);
-		bio_list_add_head(&md->deferred, bio);
-		spin_unlock_irq(&md->deferred_lock);
-	}
+	/* issue data + REQ_FUA */
+	bio->bi_rw &= ~REQ_FLUSH;
+	__split_and_process_bio(md, bio);
 }
 
 /*
@@ -2469,8 +2464,8 @@ static void dm_wq_work(struct work_struct *work)
 		if (dm_request_based(md))
 			generic_make_request(c);
 		else {
-			if (c->bi_rw & REQ_HARDBARRIER)
-				process_barrier(md, c);
+			if (c->bi_rw & REQ_FLUSH)
+				process_flush(md, c);
 			else
 				__split_and_process_bio(md, c);
 		}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
@ 2010-08-30  9:58   ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid
  Cc: Tejun Heo, dm-devel

This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
now deprecated REQ_HARDBARRIER.

* -EOPNOTSUPP handling logic dropped.

* Preflush is handled as before but postflush is dropped and replaced
  with passing down REQ_FUA to member request_queues.  This replaces
  one array wide cache flush w/ member specific FUA writes.

* __split_and_process_bio() now calls __clone_and_map_flush() directly
  for flushes and guarantees all FLUSH bio's going to targets are zero
`  length.

* It's now guaranteed that all FLUSH bio's which are passed onto dm
  targets are zero length.  bio_empty_barrier() tests are replaced
  with REQ_FLUSH tests.

* Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.

* Dropped unlikely() around REQ_FLUSH tests.  Flushes are not unlikely
  enough to be marked with unlikely().

* Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
  doesn't support cache flushing.  Advertise REQ_FLUSH | REQ_FUA
  capability.

* Request based dm isn't converted yet.  dm_init_request_based_queue()
  resets flush support to 0 for now.  To avoid disturbing request
  based dm code, dm->flush_error is added for bio based dm while
  requested based dm continues to use dm->barrier_error.

Lightly tested linear, stripe, raid1, snap and crypt targets.  Please
proceed with caution as I'm not familiar with the code base.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: dm-devel@redhat.com
Cc: Christoph Hellwig <hch@lst.de>
---
 drivers/md/dm-crypt.c           |    2 +-
 drivers/md/dm-io.c              |   20 +-----
 drivers/md/dm-log.c             |    2 +-
 drivers/md/dm-raid1.c           |    8 +-
 drivers/md/dm-region-hash.c     |   16 +++---
 drivers/md/dm-snap-persistent.c |    2 +-
 drivers/md/dm-snap.c            |    6 +-
 drivers/md/dm-stripe.c          |    2 +-
 drivers/md/dm.c                 |  119 +++++++++++++++++++--------------------
 9 files changed, 80 insertions(+), 97 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 368e8e9..d5b0e4c 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1278,7 +1278,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio,
 	struct dm_crypt_io *io;
 	struct crypt_config *cc;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		cc = ti->private;
 		bio->bi_bdev = cc->dev->bdev;
 		return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 0590c75..136d4f7 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -31,7 +31,6 @@ struct dm_io_client {
  */
 struct io {
 	unsigned long error_bits;
-	unsigned long eopnotsupp_bits;
 	atomic_t count;
 	struct task_struct *sleeper;
 	struct dm_io_client *client;
@@ -130,11 +129,8 @@ static void retrieve_io_and_region_from_bio(struct bio *bio, struct io **io,
  *---------------------------------------------------------------*/
 static void dec_count(struct io *io, unsigned int region, int error)
 {
-	if (error) {
+	if (error)
 		set_bit(region, &io->error_bits);
-		if (error == -EOPNOTSUPP)
-			set_bit(region, &io->eopnotsupp_bits);
-	}
 
 	if (atomic_dec_and_test(&io->count)) {
 		if (io->sleeper)
@@ -310,8 +306,8 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
 	sector_t remaining = where->count;
 
 	/*
-	 * where->count may be zero if rw holds a write barrier and we
-	 * need to send a zero-sized barrier.
+	 * where->count may be zero if rw holds a flush and we need to
+	 * send a zero-sized flush.
 	 */
 	do {
 		/*
@@ -364,7 +360,7 @@ static void dispatch_io(int rw, unsigned int num_regions,
 	 */
 	for (i = 0; i < num_regions; i++) {
 		*dp = old_pages;
-		if (where[i].count || (rw & REQ_HARDBARRIER))
+		if (where[i].count || (rw & REQ_FLUSH))
 			do_region(rw, i, where + i, dp, io);
 	}
 
@@ -393,9 +389,7 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions,
 		return -EIO;
 	}
 
-retry:
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = current;
 	io->client = client;
@@ -412,11 +406,6 @@ retry:
 	}
 	set_current_state(TASK_RUNNING);
 
-	if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
-		rw &= ~REQ_HARDBARRIER;
-		goto retry;
-	}
-
 	if (error_bits)
 		*error_bits = io->error_bits;
 
@@ -437,7 +426,6 @@ static int async_io(struct dm_io_client *client, unsigned int num_regions,
 
 	io = mempool_alloc(client->pool, GFP_NOIO);
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = NULL;
 	io->client = client;
diff --git a/drivers/md/dm-log.c b/drivers/md/dm-log.c
index 5a08be0..33420e6 100644
--- a/drivers/md/dm-log.c
+++ b/drivers/md/dm-log.c
@@ -300,7 +300,7 @@ static int flush_header(struct log_c *lc)
 		.count = 0,
 	};
 
-	lc->io_req.bi_rw = WRITE_BARRIER;
+	lc->io_req.bi_rw = WRITE_FLUSH;
 
 	return dm_io(&lc->io_req, 1, &null_location, NULL);
 }
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index 7c081bc..19a59b0 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -259,7 +259,7 @@ static int mirror_flush(struct dm_target *ti)
 	struct dm_io_region io[ms->nr_mirrors];
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE_BARRIER,
+		.bi_rw = WRITE_FLUSH,
 		.mem.type = DM_IO_KMEM,
 		.mem.ptr.bvec = NULL,
 		.client = ms->io_client,
@@ -629,7 +629,7 @@ static void do_write(struct mirror_set *ms, struct bio *bio)
 	struct dm_io_region io[ms->nr_mirrors], *dest = io;
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+		.bi_rw = WRITE | (bio->bi_rw & WRITE_FLUSH_FUA),
 		.mem.type = DM_IO_BVEC,
 		.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
 		.notify.fn = write_callback,
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes)
 	bio_list_init(&requeue);
 
 	while ((bio = bio_list_pop(writes))) {
-		if (unlikely(bio_empty_barrier(bio))) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			bio_list_add(&sync, bio);
 			continue;
 		}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
 	 * We need to dec pending if this was a write.
 	 */
 	if (rw == WRITE) {
-		if (likely(!bio_empty_barrier(bio)))
+		if (!(bio->bi_rw & REQ_FLUSH))
 			dm_rh_dec(ms->rh, map_context->ll);
 		return error;
 	}
diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
index bd5c58b..dad011a 100644
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -81,9 +81,9 @@ struct dm_region_hash {
 	struct list_head failed_recovered_regions;
 
 	/*
-	 * If there was a barrier failure no regions can be marked clean.
+	 * If there was a flush failure no regions can be marked clean.
 	 */
-	int barrier_failure;
+	int flush_failure;
 
 	void *context;
 	sector_t target_begin;
@@ -217,7 +217,7 @@ struct dm_region_hash *dm_region_hash_create(
 	INIT_LIST_HEAD(&rh->quiesced_regions);
 	INIT_LIST_HEAD(&rh->recovered_regions);
 	INIT_LIST_HEAD(&rh->failed_recovered_regions);
-	rh->barrier_failure = 0;
+	rh->flush_failure = 0;
 
 	rh->region_pool = mempool_create_kmalloc_pool(MIN_REGIONS,
 						      sizeof(struct dm_region));
@@ -399,8 +399,8 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio)
 	region_t region = dm_rh_bio_to_region(rh, bio);
 	int recovering = 0;
 
-	if (bio_empty_barrier(bio)) {
-		rh->barrier_failure = 1;
+	if (bio->bi_rw & REQ_FLUSH) {
+		rh->flush_failure = 1;
 		return;
 	}
 
@@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios)
 	struct bio *bio;
 
 	for (bio = bios->head; bio; bio = bio->bi_next) {
-		if (bio_empty_barrier(bio))
+		if (bio->bi_rw & REQ_FLUSH)
 			continue;
 		rh_inc(rh, dm_rh_bio_to_region(rh, bio));
 	}
@@ -555,9 +555,9 @@ void dm_rh_dec(struct dm_region_hash *rh, region_t region)
 		 */
 
 		/* do nothing for DM_RH_NOSYNC */
-		if (unlikely(rh->barrier_failure)) {
+		if (unlikely(rh->flush_failure)) {
 			/*
-			 * If a write barrier failed some time ago, we
+			 * If a write flush failed some time ago, we
 			 * don't know whether or not this write made it
 			 * to the disk, so we must resync the device.
 			 */
diff --git a/drivers/md/dm-snap-persistent.c b/drivers/md/dm-snap-persistent.c
index cc2bdb8..0b61792 100644
--- a/drivers/md/dm-snap-persistent.c
+++ b/drivers/md/dm-snap-persistent.c
@@ -687,7 +687,7 @@ static void persistent_commit_exception(struct dm_exception_store *store,
 	/*
 	 * Commit exceptions to disk.
 	 */
-	if (ps->valid && area_io(ps, WRITE_BARRIER))
+	if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
 		ps->valid = 0;
 
 	/*
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 5974d30..eed2101 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1587,7 +1587,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio,
 	chunk_t chunk;
 	struct dm_snap_pending_exception *pe = NULL;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		bio->bi_bdev = s->cow->bdev;
 		return DM_MAPIO_REMAPPED;
 	}
@@ -1691,7 +1691,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio,
 	int r = DM_MAPIO_REMAPPED;
 	chunk_t chunk;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		if (!map_context->target_request_nr)
 			bio->bi_bdev = s->origin->bdev;
 		else
@@ -2135,7 +2135,7 @@ static int origin_map(struct dm_target *ti, struct bio *bio,
 	struct dm_dev *dev = ti->private;
 	bio->bi_bdev = dev->bdev;
 
-	if (unlikely(bio_empty_barrier(bio)))
+	if (bio->bi_rw & REQ_FLUSH)
 		return DM_MAPIO_REMAPPED;
 
 	/* Only tell snapshots if this is a write */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index c297f6d..f0371b4 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -271,7 +271,7 @@ static int stripe_map(struct dm_target *ti, struct bio *bio,
 	uint32_t stripe;
 	unsigned target_request_nr;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		target_request_nr = map_context->target_request_nr;
 		BUG_ON(target_request_nr >= sc->stripes);
 		bio->bi_bdev = sc->stripe[target_request_nr].dev->bdev;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b1d92be..32e6622 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -144,15 +144,16 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * An error from the barrier request currently being processed.
+	 * An error from the flush request currently being processed.
 	 */
-	int barrier_error;
+	int flush_error;
 
 	/*
 	 * Protect barrier_error from concurrent endio processing
 	 * in request-based dm.
 	 */
 	spinlock_t barrier_error_lock;
+	int barrier_error;
 
 	/*
 	 * Processing queue (flush/barriers)
@@ -200,8 +201,8 @@ struct mapped_device {
 	/* sysfs handle */
 	struct kobject kobj;
 
-	/* zero-length barrier that will be cloned and submitted to targets */
-	struct bio barrier_bio;
+	/* zero-length flush that will be cloned and submitted to targets */
+	struct bio flush_bio;
 };
 
 /*
@@ -512,7 +513,7 @@ static void end_io_acct(struct dm_io *io)
 
 	/*
 	 * After this is decremented the bio must not be touched if it is
-	 * a barrier.
+	 * a flush.
 	 */
 	dm_disk(md)->part0.in_flight[rw] = pending =
 		atomic_dec_return(&md->pending[rw]);
@@ -626,7 +627,7 @@ static void dec_pending(struct dm_io *io, int error)
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
 			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+				if (!(io->bio->bi_rw & REQ_FLUSH))
 					bio_list_add_head(&md->deferred,
 							  io->bio);
 			} else
@@ -638,20 +639,14 @@ static void dec_pending(struct dm_io *io, int error)
 		io_error = io->error;
 		bio = io->bio;
 
-		if (bio->bi_rw & REQ_HARDBARRIER) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			/*
-			 * There can be just one barrier request so we use
+			 * There can be just one flush request so we use
 			 * a per-device variable for error reporting.
 			 * Note that you can't touch the bio after end_io_acct
-			 *
-			 * We ignore -EOPNOTSUPP for empty flush reported by
-			 * underlying devices. We assume that if the device
-			 * doesn't support empty barriers, it doesn't need
-			 * cache flushing commands.
 			 */
-			if (!md->barrier_error &&
-			    !(bio_empty_barrier(bio) && io_error == -EOPNOTSUPP))
-				md->barrier_error = io_error;
+			if (!md->flush_error)
+				md->flush_error = io_error;
 			end_io_acct(io);
 			free_io(md, io);
 		} else {
@@ -1119,7 +1114,7 @@ static void dm_bio_destructor(struct bio *bio)
 }
 
 /*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that just does part of a bvec.
  */
 static struct bio *split_bvec(struct bio *bio, sector_t sector,
 			      unsigned short idx, unsigned int offset,
@@ -1134,7 +1129,7 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,
 
 	clone->bi_sector = sector;
 	clone->bi_bdev = bio->bi_bdev;
-	clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+	clone->bi_rw = bio->bi_rw;
 	clone->bi_vcnt = 1;
 	clone->bi_size = to_bytes(len);
 	clone->bi_io_vec->bv_offset = offset;
@@ -1161,7 +1156,6 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,
 
 	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
 	__bio_clone(clone, bio);
-	clone->bi_rw &= ~REQ_HARDBARRIER;
 	clone->bi_destructor = dm_bio_destructor;
 	clone->bi_sector = sector;
 	clone->bi_idx = idx;
@@ -1225,7 +1219,7 @@ static void __issue_target_requests(struct clone_info *ci, struct dm_target *ti,
 		__issue_target_request(ci, ti, request_nr, len);
 }
 
-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
 {
 	unsigned target_nr = 0;
 	struct dm_target *ti;
@@ -1289,9 +1283,6 @@ static int __clone_and_map(struct clone_info *ci)
 	sector_t len = 0, max;
 	struct dm_target_io *tio;
 
-	if (unlikely(bio_empty_barrier(bio)))
-		return __clone_and_map_empty_barrier(ci);
-
 	if (unlikely(bio->bi_rw & REQ_DISCARD))
 		return __clone_and_map_discard(ci);
 
@@ -1383,11 +1374,11 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_HARDBARRIER))
+		if (!(bio->bi_rw & REQ_FLUSH))
 			bio_io_error(bio);
 		else
-			if (!md->barrier_error)
-				md->barrier_error = -EIO;
+			if (!md->flush_error)
+				md->flush_error = -EIO;
 		return;
 	}
 
@@ -1400,14 +1391,22 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 	ci.io->md = md;
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
-	ci.sector_count = bio_sectors(bio);
-	if (unlikely(bio_empty_barrier(bio)))
+	if (!(bio->bi_rw & REQ_FLUSH))
+		ci.sector_count = bio_sectors(bio);
+	else {
+		/* all FLUSH bio's reaching here should be empty */
+		WARN_ON_ONCE(bio_has_data(bio));
 		ci.sector_count = 1;
+	}
 	ci.idx = bio->bi_idx;
 
 	start_io_acct(ci.io);
-	while (ci.sector_count && !error)
-		error = __clone_and_map(&ci);
+	while (ci.sector_count && !error) {
+		if (!(bio->bi_rw & REQ_FLUSH))
+			error = __clone_and_map(&ci);
+		else
+			error = __clone_and_map_flush(&ci);
+	}
 
 	/* drop the extra reference count */
 	dec_pending(ci.io, error);
@@ -1492,11 +1491,11 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
 	part_stat_unlock();
 
 	/*
-	 * If we're suspended or the thread is processing barriers
+	 * If we're suspended or the thread is processing flushes
 	 * we have to queue this io for later.
 	 */
 	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+	    (bio->bi_rw & REQ_FLUSH)) {
 		up_read(&md->io_lock);
 
 		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1940,6 +1939,7 @@ static void dm_init_md_queue(struct mapped_device *md)
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
+	blk_queue_flush(md->queue, REQ_FLUSH | REQ_FUA);
 }
 
 /*
@@ -2245,7 +2245,8 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	blk_queue_flush(md->queue, REQ_FLUSH);
+	/* no flush support for request based dm yet */
+	blk_queue_flush(md->queue, 0);
 
 	elv_register_queue(md->queue);
 
@@ -2406,41 +2407,35 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
 	return r;
 }
 
-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
 {
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	bio_init(&md->barrier_bio);
-	md->barrier_bio.bi_bdev = md->bdev;
-	md->barrier_bio.bi_rw = WRITE_BARRIER;
-	__split_and_process_bio(md, &md->barrier_bio);
+	md->flush_error = 0;
 
+	/* handle REQ_FLUSH */
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
 
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
-	md->barrier_error = 0;
+	bio_init(&md->flush_bio);
+	md->flush_bio.bi_bdev = md->bdev;
+	md->flush_bio.bi_rw = WRITE_FLUSH;
+	__split_and_process_bio(md, &md->flush_bio);
 
-	dm_flush(md);
+	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
 
-	if (!bio_empty_barrier(bio)) {
-		__split_and_process_bio(md, bio);
-		/*
-		 * If the request isn't supported, don't waste time with
-		 * the second flush.
-		 */
-		if (md->barrier_error != -EOPNOTSUPP)
-			dm_flush(md);
+	/* if it's an empty flush or the preflush failed, we're done */
+	if (!bio_has_data(bio) || md->flush_error) {
+		if (md->flush_error != DM_ENDIO_REQUEUE)
+			bio_endio(bio, md->flush_error);
+		else {
+			spin_lock_irq(&md->deferred_lock);
+			bio_list_add_head(&md->deferred, bio);
+			spin_unlock_irq(&md->deferred_lock);
+		}
+		return;
 	}
 
-	if (md->barrier_error != DM_ENDIO_REQUEUE)
-		bio_endio(bio, md->barrier_error);
-	else {
-		spin_lock_irq(&md->deferred_lock);
-		bio_list_add_head(&md->deferred, bio);
-		spin_unlock_irq(&md->deferred_lock);
-	}
+	/* issue data + REQ_FUA */
+	bio->bi_rw &= ~REQ_FLUSH;
+	__split_and_process_bio(md, bio);
 }
 
 /*
@@ -2469,8 +2464,8 @@ static void dm_wq_work(struct work_struct *work)
 		if (dm_request_based(md))
 			generic_make_request(c);
 		else {
-			if (c->bi_rw & REQ_HARDBARRIER)
-				process_barrier(md, c);
+			if (c->bi_rw & REQ_FLUSH)
+				process_flush(md, c);
 			else
 				__split_and_process_bio(md, c);
 		}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-08-30  9:58 ` Tejun Heo
@ 2010-08-30  9:58   ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid
  Cc: Tejun Heo

Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
against other bio's.  This patch relaxes ordering around flushes.

* A flush bio is no longer deferred to workqueue directly.  It's
  processed like other bio's but __split_and_process_bio() uses
  md->flush_bio as the clone source.  md->flush_bio is initialized to
  empty flush during md initialization and shared for all flushes.

* When dec_pending() detects that a flush has completed, it checks
  whether the original bio has data.  If so, the bio is queued to the
  deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.

* As flush sequencing is handled in the usual issue/completion path,
  dm_wq_work() no longer needs to handle flushes differently.  Now its
  only responsibility is re-issuing deferred bio's the same way as
  _dm_request() would.  REQ_FLUSH handling logic including
  process_flush() is dropped.

* There's no reason for queue_io() and dm_wq_work() write lock
  dm->io_lock.  queue_io() now only uses md->deferred_lock and
  dm_wq_work() read locks dm->io_lock.

* bio's no longer need to be queued on the deferred list while a flush
  is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.

This avoids stalling the device during flushes and simplifies the
implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/md/dm.c |  157 ++++++++++++++++---------------------------------------
 1 files changed, 45 insertions(+), 112 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 32e6622..e67c519 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -110,7 +110,6 @@ EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
 #define DMF_FREEING 3
 #define DMF_DELETING 4
 #define DMF_NOFLUSH_SUSPENDING 5
-#define DMF_QUEUE_IO_TO_THREAD 6
 
 /*
  * Work processed by per-device workqueue.
@@ -144,11 +143,6 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * An error from the flush request currently being processed.
-	 */
-	int flush_error;
-
-	/*
 	 * Protect barrier_error from concurrent endio processing
 	 * in request-based dm.
 	 */
@@ -529,16 +523,10 @@ static void end_io_acct(struct dm_io *io)
  */
 static void queue_io(struct mapped_device *md, struct bio *bio)
 {
-	down_write(&md->io_lock);
-
 	spin_lock_irq(&md->deferred_lock);
 	bio_list_add(&md->deferred, bio);
 	spin_unlock_irq(&md->deferred_lock);
-
-	if (!test_and_set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))
-		queue_work(md->wq, &md->work);
-
-	up_write(&md->io_lock);
+	queue_work(md->wq, &md->work);
 }
 
 /*
@@ -626,11 +614,9 @@ static void dec_pending(struct dm_io *io, int error)
 			 * Target requested pushing back the I/O.
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
-			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_FLUSH))
-					bio_list_add_head(&md->deferred,
-							  io->bio);
-			} else
+			if (__noflush_suspending(md))
+				bio_list_add_head(&md->deferred, io->bio);
+			else
 				/* noflush suspend was interrupted. */
 				io->error = -EIO;
 			spin_unlock_irqrestore(&md->deferred_lock, flags);
@@ -638,26 +624,22 @@ static void dec_pending(struct dm_io *io, int error)
 
 		io_error = io->error;
 		bio = io->bio;
+		end_io_acct(io);
+		free_io(md, io);
+
+		if (io_error == DM_ENDIO_REQUEUE)
+			return;
 
-		if (bio->bi_rw & REQ_FLUSH) {
+		if (!(bio->bi_rw & REQ_FLUSH) || !bio->bi_size) {
+			trace_block_bio_complete(md->queue, bio);
+			bio_endio(bio, io_error);
+		} else {
 			/*
-			 * There can be just one flush request so we use
-			 * a per-device variable for error reporting.
-			 * Note that you can't touch the bio after end_io_acct
+			 * Preflush done for flush with data, reissue
+			 * without REQ_FLUSH.
 			 */
-			if (!md->flush_error)
-				md->flush_error = io_error;
-			end_io_acct(io);
-			free_io(md, io);
-		} else {
-			end_io_acct(io);
-			free_io(md, io);
-
-			if (io_error != DM_ENDIO_REQUEUE) {
-				trace_block_bio_complete(md->queue, bio);
-
-				bio_endio(bio, io_error);
-			}
+			bio->bi_rw &= ~REQ_FLUSH;
+			queue_io(md, bio);
 		}
 	}
 }
@@ -1369,21 +1351,17 @@ static int __clone_and_map(struct clone_info *ci)
  */
 static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 {
+	bool is_flush = bio->bi_rw & REQ_FLUSH;
 	struct clone_info ci;
 	int error = 0;
 
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_FLUSH))
-			bio_io_error(bio);
-		else
-			if (!md->flush_error)
-				md->flush_error = -EIO;
+		bio_io_error(bio);
 		return;
 	}
 
 	ci.md = md;
-	ci.bio = bio;
 	ci.io = alloc_io(md);
 	ci.io->error = 0;
 	atomic_set(&ci.io->io_count, 1);
@@ -1391,18 +1369,19 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 	ci.io->md = md;
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
-	if (!(bio->bi_rw & REQ_FLUSH))
+	ci.idx = bio->bi_idx;
+
+	if (!is_flush) {
+		ci.bio = bio;
 		ci.sector_count = bio_sectors(bio);
-	else {
-		/* all FLUSH bio's reaching here should be empty */
-		WARN_ON_ONCE(bio_has_data(bio));
+	} else {
+		ci.bio = &ci.md->flush_bio;
 		ci.sector_count = 1;
 	}
-	ci.idx = bio->bi_idx;
 
 	start_io_acct(ci.io);
 	while (ci.sector_count && !error) {
-		if (!(bio->bi_rw & REQ_FLUSH))
+		if (!is_flush)
 			error = __clone_and_map(&ci);
 		else
 			error = __clone_and_map_flush(&ci);
@@ -1490,22 +1469,14 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
 	part_stat_add(cpu, &dm_disk(md)->part0, sectors[rw], bio_sectors(bio));
 	part_stat_unlock();
 
-	/*
-	 * If we're suspended or the thread is processing flushes
-	 * we have to queue this io for later.
-	 */
-	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    (bio->bi_rw & REQ_FLUSH)) {
+	/* if we're suspended, we have to queue this io for later */
+	if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
 		up_read(&md->io_lock);
 
-		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
-		    bio_rw(bio) == READA) {
+		if (bio_rw(bio) != READA)
+			queue_io(md, bio);
+		else
 			bio_io_error(bio);
-			return 0;
-		}
-
-		queue_io(md, bio);
-
 		return 0;
 	}
 
@@ -2015,6 +1986,10 @@ static struct mapped_device *alloc_dev(int minor)
 	if (!md->bdev)
 		goto bad_bdev;
 
+	bio_init(&md->flush_bio);
+	md->flush_bio.bi_bdev = md->bdev;
+	md->flush_bio.bi_rw = WRITE_FLUSH;
+
 	/* Populate the mapping, nobody knows we exist yet */
 	spin_lock(&_minor_lock);
 	old_md = idr_replace(&_minor_idr, md, minor);
@@ -2407,37 +2382,6 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
 	return r;
 }
 
-static void process_flush(struct mapped_device *md, struct bio *bio)
-{
-	md->flush_error = 0;
-
-	/* handle REQ_FLUSH */
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	bio_init(&md->flush_bio);
-	md->flush_bio.bi_bdev = md->bdev;
-	md->flush_bio.bi_rw = WRITE_FLUSH;
-	__split_and_process_bio(md, &md->flush_bio);
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	/* if it's an empty flush or the preflush failed, we're done */
-	if (!bio_has_data(bio) || md->flush_error) {
-		if (md->flush_error != DM_ENDIO_REQUEUE)
-			bio_endio(bio, md->flush_error);
-		else {
-			spin_lock_irq(&md->deferred_lock);
-			bio_list_add_head(&md->deferred, bio);
-			spin_unlock_irq(&md->deferred_lock);
-		}
-		return;
-	}
-
-	/* issue data + REQ_FUA */
-	bio->bi_rw &= ~REQ_FLUSH;
-	__split_and_process_bio(md, bio);
-}
-
 /*
  * Process the deferred bios
  */
@@ -2447,33 +2391,27 @@ static void dm_wq_work(struct work_struct *work)
 						work);
 	struct bio *c;
 
-	down_write(&md->io_lock);
+	down_read(&md->io_lock);
 
 	while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
 		spin_lock_irq(&md->deferred_lock);
 		c = bio_list_pop(&md->deferred);
 		spin_unlock_irq(&md->deferred_lock);
 
-		if (!c) {
-			clear_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
+		if (!c)
 			break;
-		}
 
-		up_write(&md->io_lock);
+		up_read(&md->io_lock);
 
 		if (dm_request_based(md))
 			generic_make_request(c);
-		else {
-			if (c->bi_rw & REQ_FLUSH)
-				process_flush(md, c);
-			else
-				__split_and_process_bio(md, c);
-		}
+		else
+			__split_and_process_bio(md, c);
 
-		down_write(&md->io_lock);
+		down_read(&md->io_lock);
 	}
 
-	up_write(&md->io_lock);
+	up_read(&md->io_lock);
 }
 
 static void dm_queue_flush(struct mapped_device *md)
@@ -2672,17 +2610,12 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 	 *
 	 * To get all processes out of __split_and_process_bio in dm_request,
 	 * we take the write lock. To prevent any process from reentering
-	 * __split_and_process_bio from dm_request, we set
-	 * DMF_QUEUE_IO_TO_THREAD.
-	 *
-	 * To quiesce the thread (dm_wq_work), we set DMF_BLOCK_IO_FOR_SUSPEND
-	 * and call flush_workqueue(md->wq). flush_workqueue will wait until
-	 * dm_wq_work exits and DMF_BLOCK_IO_FOR_SUSPEND will prevent any
-	 * further calls to __split_and_process_bio from dm_wq_work.
+	 * __split_and_process_bio from dm_request and quiesce the thread
+	 * (dm_wq_work), we set BMF_BLOCK_IO_FOR_SUSPEND and call
+	 * flush_workqueue(md->wq).
 	 */
 	down_write(&md->io_lock);
 	set_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags);
-	set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
 	up_write(&md->io_lock);
 
 	/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 3/5] dm: relax ordering of bio-based flush implementation
@ 2010-08-30  9:58   ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch
  Cc: Tejun Heo

Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
against other bio's.  This patch relaxes ordering around flushes.

* A flush bio is no longer deferred to workqueue directly.  It's
  processed like other bio's but __split_and_process_bio() uses
  md->flush_bio as the clone source.  md->flush_bio is initialized to
  empty flush during md initialization and shared for all flushes.

* When dec_pending() detects that a flush has completed, it checks
  whether the original bio has data.  If so, the bio is queued to the
  deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.

* As flush sequencing is handled in the usual issue/completion path,
  dm_wq_work() no longer needs to handle flushes differently.  Now its
  only responsibility is re-issuing deferred bio's the same way as
  _dm_request() would.  REQ_FLUSH handling logic including
  process_flush() is dropped.

* There's no reason for queue_io() and dm_wq_work() write lock
  dm->io_lock.  queue_io() now only uses md->deferred_lock and
  dm_wq_work() read locks dm->io_lock.

* bio's no longer need to be queued on the deferred list while a flush
  is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.

This avoids stalling the device during flushes and simplifies the
implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/md/dm.c |  157 ++++++++++++++++---------------------------------------
 1 files changed, 45 insertions(+), 112 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 32e6622..e67c519 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -110,7 +110,6 @@ EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
 #define DMF_FREEING 3
 #define DMF_DELETING 4
 #define DMF_NOFLUSH_SUSPENDING 5
-#define DMF_QUEUE_IO_TO_THREAD 6
 
 /*
  * Work processed by per-device workqueue.
@@ -144,11 +143,6 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * An error from the flush request currently being processed.
-	 */
-	int flush_error;
-
-	/*
 	 * Protect barrier_error from concurrent endio processing
 	 * in request-based dm.
 	 */
@@ -529,16 +523,10 @@ static void end_io_acct(struct dm_io *io)
  */
 static void queue_io(struct mapped_device *md, struct bio *bio)
 {
-	down_write(&md->io_lock);
-
 	spin_lock_irq(&md->deferred_lock);
 	bio_list_add(&md->deferred, bio);
 	spin_unlock_irq(&md->deferred_lock);
-
-	if (!test_and_set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))
-		queue_work(md->wq, &md->work);
-
-	up_write(&md->io_lock);
+	queue_work(md->wq, &md->work);
 }
 
 /*
@@ -626,11 +614,9 @@ static void dec_pending(struct dm_io *io, int error)
 			 * Target requested pushing back the I/O.
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
-			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_FLUSH))
-					bio_list_add_head(&md->deferred,
-							  io->bio);
-			} else
+			if (__noflush_suspending(md))
+				bio_list_add_head(&md->deferred, io->bio);
+			else
 				/* noflush suspend was interrupted. */
 				io->error = -EIO;
 			spin_unlock_irqrestore(&md->deferred_lock, flags);
@@ -638,26 +624,22 @@ static void dec_pending(struct dm_io *io, int error)
 
 		io_error = io->error;
 		bio = io->bio;
+		end_io_acct(io);
+		free_io(md, io);
+
+		if (io_error == DM_ENDIO_REQUEUE)
+			return;
 
-		if (bio->bi_rw & REQ_FLUSH) {
+		if (!(bio->bi_rw & REQ_FLUSH) || !bio->bi_size) {
+			trace_block_bio_complete(md->queue, bio);
+			bio_endio(bio, io_error);
+		} else {
 			/*
-			 * There can be just one flush request so we use
-			 * a per-device variable for error reporting.
-			 * Note that you can't touch the bio after end_io_acct
+			 * Preflush done for flush with data, reissue
+			 * without REQ_FLUSH.
 			 */
-			if (!md->flush_error)
-				md->flush_error = io_error;
-			end_io_acct(io);
-			free_io(md, io);
-		} else {
-			end_io_acct(io);
-			free_io(md, io);
-
-			if (io_error != DM_ENDIO_REQUEUE) {
-				trace_block_bio_complete(md->queue, bio);
-
-				bio_endio(bio, io_error);
-			}
+			bio->bi_rw &= ~REQ_FLUSH;
+			queue_io(md, bio);
 		}
 	}
 }
@@ -1369,21 +1351,17 @@ static int __clone_and_map(struct clone_info *ci)
  */
 static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 {
+	bool is_flush = bio->bi_rw & REQ_FLUSH;
 	struct clone_info ci;
 	int error = 0;
 
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_FLUSH))
-			bio_io_error(bio);
-		else
-			if (!md->flush_error)
-				md->flush_error = -EIO;
+		bio_io_error(bio);
 		return;
 	}
 
 	ci.md = md;
-	ci.bio = bio;
 	ci.io = alloc_io(md);
 	ci.io->error = 0;
 	atomic_set(&ci.io->io_count, 1);
@@ -1391,18 +1369,19 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 	ci.io->md = md;
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
-	if (!(bio->bi_rw & REQ_FLUSH))
+	ci.idx = bio->bi_idx;
+
+	if (!is_flush) {
+		ci.bio = bio;
 		ci.sector_count = bio_sectors(bio);
-	else {
-		/* all FLUSH bio's reaching here should be empty */
-		WARN_ON_ONCE(bio_has_data(bio));
+	} else {
+		ci.bio = &ci.md->flush_bio;
 		ci.sector_count = 1;
 	}
-	ci.idx = bio->bi_idx;
 
 	start_io_acct(ci.io);
 	while (ci.sector_count && !error) {
-		if (!(bio->bi_rw & REQ_FLUSH))
+		if (!is_flush)
 			error = __clone_and_map(&ci);
 		else
 			error = __clone_and_map_flush(&ci);
@@ -1490,22 +1469,14 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
 	part_stat_add(cpu, &dm_disk(md)->part0, sectors[rw], bio_sectors(bio));
 	part_stat_unlock();
 
-	/*
-	 * If we're suspended or the thread is processing flushes
-	 * we have to queue this io for later.
-	 */
-	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    (bio->bi_rw & REQ_FLUSH)) {
+	/* if we're suspended, we have to queue this io for later */
+	if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
 		up_read(&md->io_lock);
 
-		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
-		    bio_rw(bio) == READA) {
+		if (bio_rw(bio) != READA)
+			queue_io(md, bio);
+		else
 			bio_io_error(bio);
-			return 0;
-		}
-
-		queue_io(md, bio);
-
 		return 0;
 	}
 
@@ -2015,6 +1986,10 @@ static struct mapped_device *alloc_dev(int minor)
 	if (!md->bdev)
 		goto bad_bdev;
 
+	bio_init(&md->flush_bio);
+	md->flush_bio.bi_bdev = md->bdev;
+	md->flush_bio.bi_rw = WRITE_FLUSH;
+
 	/* Populate the mapping, nobody knows we exist yet */
 	spin_lock(&_minor_lock);
 	old_md = idr_replace(&_minor_idr, md, minor);
@@ -2407,37 +2382,6 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
 	return r;
 }
 
-static void process_flush(struct mapped_device *md, struct bio *bio)
-{
-	md->flush_error = 0;
-
-	/* handle REQ_FLUSH */
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	bio_init(&md->flush_bio);
-	md->flush_bio.bi_bdev = md->bdev;
-	md->flush_bio.bi_rw = WRITE_FLUSH;
-	__split_and_process_bio(md, &md->flush_bio);
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	/* if it's an empty flush or the preflush failed, we're done */
-	if (!bio_has_data(bio) || md->flush_error) {
-		if (md->flush_error != DM_ENDIO_REQUEUE)
-			bio_endio(bio, md->flush_error);
-		else {
-			spin_lock_irq(&md->deferred_lock);
-			bio_list_add_head(&md->deferred, bio);
-			spin_unlock_irq(&md->deferred_lock);
-		}
-		return;
-	}
-
-	/* issue data + REQ_FUA */
-	bio->bi_rw &= ~REQ_FLUSH;
-	__split_and_process_bio(md, bio);
-}
-
 /*
  * Process the deferred bios
  */
@@ -2447,33 +2391,27 @@ static void dm_wq_work(struct work_struct *work)
 						work);
 	struct bio *c;
 
-	down_write(&md->io_lock);
+	down_read(&md->io_lock);
 
 	while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
 		spin_lock_irq(&md->deferred_lock);
 		c = bio_list_pop(&md->deferred);
 		spin_unlock_irq(&md->deferred_lock);
 
-		if (!c) {
-			clear_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
+		if (!c)
 			break;
-		}
 
-		up_write(&md->io_lock);
+		up_read(&md->io_lock);
 
 		if (dm_request_based(md))
 			generic_make_request(c);
-		else {
-			if (c->bi_rw & REQ_FLUSH)
-				process_flush(md, c);
-			else
-				__split_and_process_bio(md, c);
-		}
+		else
+			__split_and_process_bio(md, c);
 
-		down_write(&md->io_lock);
+		down_read(&md->io_lock);
 	}
 
-	up_write(&md->io_lock);
+	up_read(&md->io_lock);
 }
 
 static void dm_queue_flush(struct mapped_device *md)
@@ -2672,17 +2610,12 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 	 *
 	 * To get all processes out of __split_and_process_bio in dm_request,
 	 * we take the write lock. To prevent any process from reentering
-	 * __split_and_process_bio from dm_request, we set
-	 * DMF_QUEUE_IO_TO_THREAD.
-	 *
-	 * To quiesce the thread (dm_wq_work), we set DMF_BLOCK_IO_FOR_SUSPEND
-	 * and call flush_workqueue(md->wq). flush_workqueue will wait until
-	 * dm_wq_work exits and DMF_BLOCK_IO_FOR_SUSPEND will prevent any
-	 * further calls to __split_and_process_bio from dm_wq_work.
+	 * __split_and_process_bio from dm_request and quiesce the thread
+	 * (dm_wq_work), we set BMF_BLOCK_IO_FOR_SUSPEND and call
+	 * flush_workqueue(md->wq).
 	 */
 	down_write(&md->io_lock);
 	set_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags);
-	set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
 	up_write(&md->io_lock);
 
 	/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-08-30  9:58 ` Tejun Heo
                   ` (6 preceding siblings ...)
  (?)
@ 2010-08-30  9:58 ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel, linux-fsdevel
  Cc: Tejun Heo

Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
against other bio's.  This patch relaxes ordering around flushes.

* A flush bio is no longer deferred to workqueue directly.  It's
  processed like other bio's but __split_and_process_bio() uses
  md->flush_bio as the clone source.  md->flush_bio is initialized to
  empty flush during md initialization and shared for all flushes.

* When dec_pending() detects that a flush has completed, it checks
  whether the original bio has data.  If so, the bio is queued to the
  deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.

* As flush sequencing is handled in the usual issue/completion path,
  dm_wq_work() no longer needs to handle flushes differently.  Now its
  only responsibility is re-issuing deferred bio's the same way as
  _dm_request() would.  REQ_FLUSH handling logic including
  process_flush() is dropped.

* There's no reason for queue_io() and dm_wq_work() write lock
  dm->io_lock.  queue_io() now only uses md->deferred_lock and
  dm_wq_work() read locks dm->io_lock.

* bio's no longer need to be queued on the deferred list while a flush
  is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.

This avoids stalling the device during flushes and simplifies the
implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/md/dm.c |  157 ++++++++++++++++---------------------------------------
 1 files changed, 45 insertions(+), 112 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 32e6622..e67c519 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -110,7 +110,6 @@ EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
 #define DMF_FREEING 3
 #define DMF_DELETING 4
 #define DMF_NOFLUSH_SUSPENDING 5
-#define DMF_QUEUE_IO_TO_THREAD 6
 
 /*
  * Work processed by per-device workqueue.
@@ -144,11 +143,6 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * An error from the flush request currently being processed.
-	 */
-	int flush_error;
-
-	/*
 	 * Protect barrier_error from concurrent endio processing
 	 * in request-based dm.
 	 */
@@ -529,16 +523,10 @@ static void end_io_acct(struct dm_io *io)
  */
 static void queue_io(struct mapped_device *md, struct bio *bio)
 {
-	down_write(&md->io_lock);
-
 	spin_lock_irq(&md->deferred_lock);
 	bio_list_add(&md->deferred, bio);
 	spin_unlock_irq(&md->deferred_lock);
-
-	if (!test_and_set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))
-		queue_work(md->wq, &md->work);
-
-	up_write(&md->io_lock);
+	queue_work(md->wq, &md->work);
 }
 
 /*
@@ -626,11 +614,9 @@ static void dec_pending(struct dm_io *io, int error)
 			 * Target requested pushing back the I/O.
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
-			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_FLUSH))
-					bio_list_add_head(&md->deferred,
-							  io->bio);
-			} else
+			if (__noflush_suspending(md))
+				bio_list_add_head(&md->deferred, io->bio);
+			else
 				/* noflush suspend was interrupted. */
 				io->error = -EIO;
 			spin_unlock_irqrestore(&md->deferred_lock, flags);
@@ -638,26 +624,22 @@ static void dec_pending(struct dm_io *io, int error)
 
 		io_error = io->error;
 		bio = io->bio;
+		end_io_acct(io);
+		free_io(md, io);
+
+		if (io_error == DM_ENDIO_REQUEUE)
+			return;
 
-		if (bio->bi_rw & REQ_FLUSH) {
+		if (!(bio->bi_rw & REQ_FLUSH) || !bio->bi_size) {
+			trace_block_bio_complete(md->queue, bio);
+			bio_endio(bio, io_error);
+		} else {
 			/*
-			 * There can be just one flush request so we use
-			 * a per-device variable for error reporting.
-			 * Note that you can't touch the bio after end_io_acct
+			 * Preflush done for flush with data, reissue
+			 * without REQ_FLUSH.
 			 */
-			if (!md->flush_error)
-				md->flush_error = io_error;
-			end_io_acct(io);
-			free_io(md, io);
-		} else {
-			end_io_acct(io);
-			free_io(md, io);
-
-			if (io_error != DM_ENDIO_REQUEUE) {
-				trace_block_bio_complete(md->queue, bio);
-
-				bio_endio(bio, io_error);
-			}
+			bio->bi_rw &= ~REQ_FLUSH;
+			queue_io(md, bio);
 		}
 	}
 }
@@ -1369,21 +1351,17 @@ static int __clone_and_map(struct clone_info *ci)
  */
 static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 {
+	bool is_flush = bio->bi_rw & REQ_FLUSH;
 	struct clone_info ci;
 	int error = 0;
 
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_FLUSH))
-			bio_io_error(bio);
-		else
-			if (!md->flush_error)
-				md->flush_error = -EIO;
+		bio_io_error(bio);
 		return;
 	}
 
 	ci.md = md;
-	ci.bio = bio;
 	ci.io = alloc_io(md);
 	ci.io->error = 0;
 	atomic_set(&ci.io->io_count, 1);
@@ -1391,18 +1369,19 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 	ci.io->md = md;
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
-	if (!(bio->bi_rw & REQ_FLUSH))
+	ci.idx = bio->bi_idx;
+
+	if (!is_flush) {
+		ci.bio = bio;
 		ci.sector_count = bio_sectors(bio);
-	else {
-		/* all FLUSH bio's reaching here should be empty */
-		WARN_ON_ONCE(bio_has_data(bio));
+	} else {
+		ci.bio = &ci.md->flush_bio;
 		ci.sector_count = 1;
 	}
-	ci.idx = bio->bi_idx;
 
 	start_io_acct(ci.io);
 	while (ci.sector_count && !error) {
-		if (!(bio->bi_rw & REQ_FLUSH))
+		if (!is_flush)
 			error = __clone_and_map(&ci);
 		else
 			error = __clone_and_map_flush(&ci);
@@ -1490,22 +1469,14 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
 	part_stat_add(cpu, &dm_disk(md)->part0, sectors[rw], bio_sectors(bio));
 	part_stat_unlock();
 
-	/*
-	 * If we're suspended or the thread is processing flushes
-	 * we have to queue this io for later.
-	 */
-	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    (bio->bi_rw & REQ_FLUSH)) {
+	/* if we're suspended, we have to queue this io for later */
+	if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
 		up_read(&md->io_lock);
 
-		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
-		    bio_rw(bio) == READA) {
+		if (bio_rw(bio) != READA)
+			queue_io(md, bio);
+		else
 			bio_io_error(bio);
-			return 0;
-		}
-
-		queue_io(md, bio);
-
 		return 0;
 	}
 
@@ -2015,6 +1986,10 @@ static struct mapped_device *alloc_dev(int minor)
 	if (!md->bdev)
 		goto bad_bdev;
 
+	bio_init(&md->flush_bio);
+	md->flush_bio.bi_bdev = md->bdev;
+	md->flush_bio.bi_rw = WRITE_FLUSH;
+
 	/* Populate the mapping, nobody knows we exist yet */
 	spin_lock(&_minor_lock);
 	old_md = idr_replace(&_minor_idr, md, minor);
@@ -2407,37 +2382,6 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
 	return r;
 }
 
-static void process_flush(struct mapped_device *md, struct bio *bio)
-{
-	md->flush_error = 0;
-
-	/* handle REQ_FLUSH */
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	bio_init(&md->flush_bio);
-	md->flush_bio.bi_bdev = md->bdev;
-	md->flush_bio.bi_rw = WRITE_FLUSH;
-	__split_and_process_bio(md, &md->flush_bio);
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	/* if it's an empty flush or the preflush failed, we're done */
-	if (!bio_has_data(bio) || md->flush_error) {
-		if (md->flush_error != DM_ENDIO_REQUEUE)
-			bio_endio(bio, md->flush_error);
-		else {
-			spin_lock_irq(&md->deferred_lock);
-			bio_list_add_head(&md->deferred, bio);
-			spin_unlock_irq(&md->deferred_lock);
-		}
-		return;
-	}
-
-	/* issue data + REQ_FUA */
-	bio->bi_rw &= ~REQ_FLUSH;
-	__split_and_process_bio(md, bio);
-}
-
 /*
  * Process the deferred bios
  */
@@ -2447,33 +2391,27 @@ static void dm_wq_work(struct work_struct *work)
 						work);
 	struct bio *c;
 
-	down_write(&md->io_lock);
+	down_read(&md->io_lock);
 
 	while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
 		spin_lock_irq(&md->deferred_lock);
 		c = bio_list_pop(&md->deferred);
 		spin_unlock_irq(&md->deferred_lock);
 
-		if (!c) {
-			clear_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
+		if (!c)
 			break;
-		}
 
-		up_write(&md->io_lock);
+		up_read(&md->io_lock);
 
 		if (dm_request_based(md))
 			generic_make_request(c);
-		else {
-			if (c->bi_rw & REQ_FLUSH)
-				process_flush(md, c);
-			else
-				__split_and_process_bio(md, c);
-		}
+		else
+			__split_and_process_bio(md, c);
 
-		down_write(&md->io_lock);
+		down_read(&md->io_lock);
 	}
 
-	up_write(&md->io_lock);
+	up_read(&md->io_lock);
 }
 
 static void dm_queue_flush(struct mapped_device *md)
@@ -2672,17 +2610,12 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 	 *
 	 * To get all processes out of __split_and_process_bio in dm_request,
 	 * we take the write lock. To prevent any process from reentering
-	 * __split_and_process_bio from dm_request, we set
-	 * DMF_QUEUE_IO_TO_THREAD.
-	 *
-	 * To quiesce the thread (dm_wq_work), we set DMF_BLOCK_IO_FOR_SUSPEND
-	 * and call flush_workqueue(md->wq). flush_workqueue will wait until
-	 * dm_wq_work exits and DMF_BLOCK_IO_FOR_SUSPEND will prevent any
-	 * further calls to __split_and_process_bio from dm_wq_work.
+	 * __split_and_process_bio from dm_request and quiesce the thread
+	 * (dm_wq_work), we set BMF_BLOCK_IO_FOR_SUSPEND and call
+	 * flush_workqueue(md->wq).
 	 */
 	down_write(&md->io_lock);
 	set_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags);
-	set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
 	up_write(&md->io_lock);
 
 	/*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30  9:58 ` Tejun Heo
                   ` (8 preceding siblings ...)
  (?)
@ 2010-08-30  9:58 ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid
  Cc: Tejun Heo

This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them.  However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests.  The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/md/dm.c |  204 ++++++-------------------------------------------------
 1 files changed, 20 insertions(+), 184 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e67c519..81a012f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * Protect barrier_error from concurrent endio processing
-	 * in request-based dm.
-	 */
-	spinlock_t barrier_error_lock;
-	int barrier_error;
-
-	/*
-	 * Processing queue (flush/barriers)
+	 * Processing queue (flush)
 	 */
 	struct workqueue_struct *wq;
-	struct work_struct barrier_work;
-
-	/* A pointer to the currently processing pre/post flush request */
-	struct request *flush_request;
 
 	/*
 	 * The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *clone, int error)
 	blk_update_request(tio->orig, 0, nr_bytes);
 }
 
-static void store_barrier_error(struct mapped_device *md, int error)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&md->barrier_error_lock, flags);
-	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
 /*
  * Don't touch any member of the md after calling this function because
  * the md may be freed in dm_put() at the end of this function.
@@ -786,13 +758,11 @@ static void free_rq_clone(struct request *clone)
 static void dm_end_request(struct request *clone, int error)
 {
 	int rw = rq_data_dir(clone);
-	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 
-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;
 
@@ -806,15 +776,8 @@ static void dm_end_request(struct request *clone, int error)
 	}
 
 	free_rq_clone(clone);
-
-	if (unlikely(is_barrier)) {
-		if (unlikely(error))
-			store_barrier_error(md, error);
-		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
-
-	rq_completed(md, rw, run_queue);
+	blk_end_request_all(rq, error);
+	rq_completed(md, rw, true);
 }
 
 static void dm_unprep_request(struct request *rq)
@@ -839,16 +802,6 @@ void dm_requeue_unmapped_request(struct request *clone)
 	struct request_queue *q = rq->q;
 	unsigned long flags;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		dm_end_request(clone, DM_ENDIO_REQUEUE);
-		return;
-	}
-
 	dm_unprep_request(rq);
 
 	spin_lock_irqsave(q->queue_lock, flags);
@@ -938,19 +891,6 @@ static void dm_complete_request(struct request *clone, int error)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.  So can't use
-		 * softirq_done with the original.
-		 * Pass the clone to dm_done() directly in this special case.
-		 * It is safe (even if clone->q->queue_lock is held here)
-		 * because there is no I/O dispatching during the completion
-		 * of barrier clone.
-		 */
-		dm_done(clone, error, true);
-		return;
-	}
-
 	tio->error = error;
 	rq->completion_data = clone;
 	blk_complete_request(rq);
@@ -967,17 +907,6 @@ void dm_kill_unmapped_request(struct request *clone, int error)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		BUG_ON(error > 0);
-		dm_end_request(clone, error);
-		return;
-	}
-
 	rq->cmd_flags |= REQ_FAILED;
 	dm_complete_request(clone, error);
 }
@@ -1507,14 +1436,6 @@ static int dm_request(struct request_queue *q, struct bio *bio)
 	return _dm_request(q, bio);
 }
 
-static bool dm_rq_is_flush_request(struct request *rq)
-{
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
-}
-
 void dm_dispatch_request(struct request *rq)
 {
 	int r;
@@ -1562,22 +1483,15 @@ static int setup_clone(struct request *clone, struct request *rq,
 {
 	int r;
 
-	if (dm_rq_is_flush_request(rq)) {
-		blk_rq_init(NULL, clone);
-		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
-	} else {
-		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
-				      dm_rq_bio_constructor, tio);
-		if (r)
-			return r;
-
-		clone->cmd = rq->cmd;
-		clone->cmd_len = rq->cmd_len;
-		clone->sense = rq->sense;
-		clone->buffer = rq->buffer;
-	}
+	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+			      dm_rq_bio_constructor, tio);
+	if (r)
+		return r;
 
+	clone->cmd = rq->cmd;
+	clone->cmd_len = rq->cmd_len;
+	clone->sense = rq->sense;
+	clone->buffer = rq->buffer;
 	clone->end_io = end_clone_request;
 	clone->end_io_data = tio;
 
@@ -1618,9 +1532,6 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
 	struct mapped_device *md = q->queuedata;
 	struct request *clone;
 
-	if (unlikely(dm_rq_is_flush_request(rq)))
-		return BLKPREP_OK;
-
 	if (unlikely(rq->special)) {
 		DMWARN("Already has something in rq->special.");
 		return BLKPREP_KILL;
@@ -1697,6 +1608,7 @@ static void dm_request_fn(struct request_queue *q)
 	struct dm_table *map = dm_get_live_table(md);
 	struct dm_target *ti;
 	struct request *rq, *clone;
+	sector_t pos;
 
 	/*
 	 * For suspend, check blk_queue_stopped() and increment
@@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
 		if (!rq)
 			goto plug_and_out;
 
-		if (unlikely(dm_rq_is_flush_request(rq))) {
-			BUG_ON(md->flush_request);
-			md->flush_request = rq;
-			blk_start_request(rq);
-			queue_work(md->wq, &md->barrier_work);
-			goto out;
-		}
+		/* always use block 0 to find the target for flushes for now */
+		pos = 0;
+		if (!(rq->cmd_flags & REQ_FLUSH))
+			pos = blk_rq_pos(rq);
 
-		ti = dm_table_find_target(map, blk_rq_pos(rq));
+		ti = dm_table_find_target(map, pos);
 		if (ti->type->busy && ti->type->busy(ti))
 			goto plug_and_out;
 
@@ -1888,7 +1797,6 @@ out:
 static const struct block_device_operations dm_blk_dops;
 
 static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
 
 static void dm_init_md_queue(struct mapped_device *md)
 {
@@ -1943,7 +1851,6 @@ static struct mapped_device *alloc_dev(int minor)
 	mutex_init(&md->suspend_lock);
 	mutex_init(&md->type_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -1966,7 +1873,6 @@ static struct mapped_device *alloc_dev(int minor)
 	atomic_set(&md->pending[1], 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
-	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
 	init_waitqueue_head(&md->eventq);
 
 	md->disk->major = _major;
@@ -2220,8 +2126,6 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	/* no flush support for request based dm yet */
-	blk_queue_flush(md->queue, 0);
 
 	elv_register_queue(md->queue);
 
@@ -2421,73 +2325,6 @@ static void dm_queue_flush(struct mapped_device *md)
 	queue_work(md->wq, &md->work);
 }
 
-static void dm_rq_set_target_request_nr(struct request *clone, unsigned request_nr)
-{
-	struct dm_rq_target_io *tio = clone->end_io_data;
-
-	tio->info.target_request_nr = request_nr;
-}
-
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
-{
-	int i, j;
-	struct dm_table *map = dm_get_live_table(md);
-	unsigned num_targets = dm_table_get_num_targets(map);
-	struct dm_target *ti;
-	struct request *clone;
-
-	md->barrier_error = 0;
-
-	for (i = 0; i < num_targets; i++) {
-		ti = dm_table_get_target(map, i);
-		for (j = 0; j < ti->num_flush_requests; j++) {
-			clone = clone_rq(md->flush_request, md, GFP_NOIO);
-			dm_rq_set_target_request_nr(clone, j);
-			atomic_inc(&md->pending[rq_data_dir(clone)]);
-			map_request(ti, clone, md);
-		}
-	}
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-	dm_table_put(map);
-
-	return md->barrier_error;
-}
-
-static void dm_rq_barrier_work(struct work_struct *work)
-{
-	int error;
-	struct mapped_device *md = container_of(work, struct mapped_device,
-						barrier_work);
-	struct request_queue *q = md->queue;
-	struct request *rq;
-	unsigned long flags;
-
-	/*
-	 * Hold the md reference here and leave it at the last part so that
-	 * the md can't be deleted by device opener when the barrier request
-	 * completes.
-	 */
-	dm_get(md);
-
-	error = dm_rq_barrier(md);
-
-	rq = md->flush_request;
-	md->flush_request = NULL;
-
-	if (error == DM_ENDIO_REQUEUE) {
-		spin_lock_irqsave(q->queue_lock, flags);
-		blk_requeue_request(q, rq);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	} else
-		blk_end_request_all(rq, error);
-
-	blk_run_queue(q);
-
-	dm_put(md);
-}
-
 /*
  * Swap in a new table, returning the old one for the caller to destroy.
  */
@@ -2619,9 +2456,8 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 	up_write(&md->io_lock);
 
 	/*
-	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
-	 * can be kicked until md->queue is stopped.  So stop md->queue before
-	 * flushing md->wq.
+	 * Stop md->queue before flushing md->wq in case request-based
+	 * dm defers requests to md->wq from md->queue.
 	 */
 	if (dm_request_based(md))
 		stop_queue(md->queue);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30  9:58 ` Tejun Heo
@ 2010-08-30  9:58   ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch
  Cc: Tejun Heo

This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them.  However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests.  The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/md/dm.c |  204 ++++++-------------------------------------------------
 1 files changed, 20 insertions(+), 184 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e67c519..81a012f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * Protect barrier_error from concurrent endio processing
-	 * in request-based dm.
-	 */
-	spinlock_t barrier_error_lock;
-	int barrier_error;
-
-	/*
-	 * Processing queue (flush/barriers)
+	 * Processing queue (flush)
 	 */
 	struct workqueue_struct *wq;
-	struct work_struct barrier_work;
-
-	/* A pointer to the currently processing pre/post flush request */
-	struct request *flush_request;
 
 	/*
 	 * The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *clone, int error)
 	blk_update_request(tio->orig, 0, nr_bytes);
 }
 
-static void store_barrier_error(struct mapped_device *md, int error)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&md->barrier_error_lock, flags);
-	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
 /*
  * Don't touch any member of the md after calling this function because
  * the md may be freed in dm_put() at the end of this function.
@@ -786,13 +758,11 @@ static void free_rq_clone(struct request *clone)
 static void dm_end_request(struct request *clone, int error)
 {
 	int rw = rq_data_dir(clone);
-	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 
-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;
 
@@ -806,15 +776,8 @@ static void dm_end_request(struct request *clone, int error)
 	}
 
 	free_rq_clone(clone);
-
-	if (unlikely(is_barrier)) {
-		if (unlikely(error))
-			store_barrier_error(md, error);
-		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
-
-	rq_completed(md, rw, run_queue);
+	blk_end_request_all(rq, error);
+	rq_completed(md, rw, true);
 }
 
 static void dm_unprep_request(struct request *rq)
@@ -839,16 +802,6 @@ void dm_requeue_unmapped_request(struct request *clone)
 	struct request_queue *q = rq->q;
 	unsigned long flags;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		dm_end_request(clone, DM_ENDIO_REQUEUE);
-		return;
-	}
-
 	dm_unprep_request(rq);
 
 	spin_lock_irqsave(q->queue_lock, flags);
@@ -938,19 +891,6 @@ static void dm_complete_request(struct request *clone, int error)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.  So can't use
-		 * softirq_done with the original.
-		 * Pass the clone to dm_done() directly in this special case.
-		 * It is safe (even if clone->q->queue_lock is held here)
-		 * because there is no I/O dispatching during the completion
-		 * of barrier clone.
-		 */
-		dm_done(clone, error, true);
-		return;
-	}
-
 	tio->error = error;
 	rq->completion_data = clone;
 	blk_complete_request(rq);
@@ -967,17 +907,6 @@ void dm_kill_unmapped_request(struct request *clone, int error)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		BUG_ON(error > 0);
-		dm_end_request(clone, error);
-		return;
-	}
-
 	rq->cmd_flags |= REQ_FAILED;
 	dm_complete_request(clone, error);
 }
@@ -1507,14 +1436,6 @@ static int dm_request(struct request_queue *q, struct bio *bio)
 	return _dm_request(q, bio);
 }
 
-static bool dm_rq_is_flush_request(struct request *rq)
-{
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
-}
-
 void dm_dispatch_request(struct request *rq)
 {
 	int r;
@@ -1562,22 +1483,15 @@ static int setup_clone(struct request *clone, struct request *rq,
 {
 	int r;
 
-	if (dm_rq_is_flush_request(rq)) {
-		blk_rq_init(NULL, clone);
-		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
-	} else {
-		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
-				      dm_rq_bio_constructor, tio);
-		if (r)
-			return r;
-
-		clone->cmd = rq->cmd;
-		clone->cmd_len = rq->cmd_len;
-		clone->sense = rq->sense;
-		clone->buffer = rq->buffer;
-	}
+	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+			      dm_rq_bio_constructor, tio);
+	if (r)
+		return r;
 
+	clone->cmd = rq->cmd;
+	clone->cmd_len = rq->cmd_len;
+	clone->sense = rq->sense;
+	clone->buffer = rq->buffer;
 	clone->end_io = end_clone_request;
 	clone->end_io_data = tio;
 
@@ -1618,9 +1532,6 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
 	struct mapped_device *md = q->queuedata;
 	struct request *clone;
 
-	if (unlikely(dm_rq_is_flush_request(rq)))
-		return BLKPREP_OK;
-
 	if (unlikely(rq->special)) {
 		DMWARN("Already has something in rq->special.");
 		return BLKPREP_KILL;
@@ -1697,6 +1608,7 @@ static void dm_request_fn(struct request_queue *q)
 	struct dm_table *map = dm_get_live_table(md);
 	struct dm_target *ti;
 	struct request *rq, *clone;
+	sector_t pos;
 
 	/*
 	 * For suspend, check blk_queue_stopped() and increment
@@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
 		if (!rq)
 			goto plug_and_out;
 
-		if (unlikely(dm_rq_is_flush_request(rq))) {
-			BUG_ON(md->flush_request);
-			md->flush_request = rq;
-			blk_start_request(rq);
-			queue_work(md->wq, &md->barrier_work);
-			goto out;
-		}
+		/* always use block 0 to find the target for flushes for now */
+		pos = 0;
+		if (!(rq->cmd_flags & REQ_FLUSH))
+			pos = blk_rq_pos(rq);
 
-		ti = dm_table_find_target(map, blk_rq_pos(rq));
+		ti = dm_table_find_target(map, pos);
 		if (ti->type->busy && ti->type->busy(ti))
 			goto plug_and_out;
 
@@ -1888,7 +1797,6 @@ out:
 static const struct block_device_operations dm_blk_dops;
 
 static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
 
 static void dm_init_md_queue(struct mapped_device *md)
 {
@@ -1943,7 +1851,6 @@ static struct mapped_device *alloc_dev(int minor)
 	mutex_init(&md->suspend_lock);
 	mutex_init(&md->type_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -1966,7 +1873,6 @@ static struct mapped_device *alloc_dev(int minor)
 	atomic_set(&md->pending[1], 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
-	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
 	init_waitqueue_head(&md->eventq);
 
 	md->disk->major = _major;
@@ -2220,8 +2126,6 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	/* no flush support for request based dm yet */
-	blk_queue_flush(md->queue, 0);
 
 	elv_register_queue(md->queue);
 
@@ -2421,73 +2325,6 @@ static void dm_queue_flush(struct mapped_device *md)
 	queue_work(md->wq, &md->work);
 }
 
-static void dm_rq_set_target_request_nr(struct request *clone, unsigned request_nr)
-{
-	struct dm_rq_target_io *tio = clone->end_io_data;
-
-	tio->info.target_request_nr = request_nr;
-}
-
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
-{
-	int i, j;
-	struct dm_table *map = dm_get_live_table(md);
-	unsigned num_targets = dm_table_get_num_targets(map);
-	struct dm_target *ti;
-	struct request *clone;
-
-	md->barrier_error = 0;
-
-	for (i = 0; i < num_targets; i++) {
-		ti = dm_table_get_target(map, i);
-		for (j = 0; j < ti->num_flush_requests; j++) {
-			clone = clone_rq(md->flush_request, md, GFP_NOIO);
-			dm_rq_set_target_request_nr(clone, j);
-			atomic_inc(&md->pending[rq_data_dir(clone)]);
-			map_request(ti, clone, md);
-		}
-	}
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-	dm_table_put(map);
-
-	return md->barrier_error;
-}
-
-static void dm_rq_barrier_work(struct work_struct *work)
-{
-	int error;
-	struct mapped_device *md = container_of(work, struct mapped_device,
-						barrier_work);
-	struct request_queue *q = md->queue;
-	struct request *rq;
-	unsigned long flags;
-
-	/*
-	 * Hold the md reference here and leave it at the last part so that
-	 * the md can't be deleted by device opener when the barrier request
-	 * completes.
-	 */
-	dm_get(md);
-
-	error = dm_rq_barrier(md);
-
-	rq = md->flush_request;
-	md->flush_request = NULL;
-
-	if (error == DM_ENDIO_REQUEUE) {
-		spin_lock_irqsave(q->queue_lock, flags);
-		blk_requeue_request(q, rq);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	} else
-		blk_end_request_all(rq, error);
-
-	blk_run_queue(q);
-
-	dm_put(md);
-}
-
 /*
  * Swap in a new table, returning the old one for the caller to destroy.
  */
@@ -2619,9 +2456,8 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 	up_write(&md->io_lock);
 
 	/*
-	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
-	 * can be kicked until md->queue is stopped.  So stop md->queue before
-	 * flushing md->wq.
+	 * Stop md->queue before flushing md->wq in case request-based
+	 * dm defers requests to md->wq from md->queue.
 	 */
 	if (dm_request_based(md))
 		stop_queue(md->queue);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
@ 2010-08-30  9:58   ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid
  Cc: Tejun Heo

This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them.  However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests.  The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/md/dm.c |  204 ++++++-------------------------------------------------
 1 files changed, 20 insertions(+), 184 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e67c519..81a012f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * Protect barrier_error from concurrent endio processing
-	 * in request-based dm.
-	 */
-	spinlock_t barrier_error_lock;
-	int barrier_error;
-
-	/*
-	 * Processing queue (flush/barriers)
+	 * Processing queue (flush)
 	 */
 	struct workqueue_struct *wq;
-	struct work_struct barrier_work;
-
-	/* A pointer to the currently processing pre/post flush request */
-	struct request *flush_request;
 
 	/*
 	 * The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *clone, int error)
 	blk_update_request(tio->orig, 0, nr_bytes);
 }
 
-static void store_barrier_error(struct mapped_device *md, int error)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&md->barrier_error_lock, flags);
-	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
 /*
  * Don't touch any member of the md after calling this function because
  * the md may be freed in dm_put() at the end of this function.
@@ -786,13 +758,11 @@ static void free_rq_clone(struct request *clone)
 static void dm_end_request(struct request *clone, int error)
 {
 	int rw = rq_data_dir(clone);
-	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 
-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;
 
@@ -806,15 +776,8 @@ static void dm_end_request(struct request *clone, int error)
 	}
 
 	free_rq_clone(clone);
-
-	if (unlikely(is_barrier)) {
-		if (unlikely(error))
-			store_barrier_error(md, error);
-		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
-
-	rq_completed(md, rw, run_queue);
+	blk_end_request_all(rq, error);
+	rq_completed(md, rw, true);
 }
 
 static void dm_unprep_request(struct request *rq)
@@ -839,16 +802,6 @@ void dm_requeue_unmapped_request(struct request *clone)
 	struct request_queue *q = rq->q;
 	unsigned long flags;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		dm_end_request(clone, DM_ENDIO_REQUEUE);
-		return;
-	}
-
 	dm_unprep_request(rq);
 
 	spin_lock_irqsave(q->queue_lock, flags);
@@ -938,19 +891,6 @@ static void dm_complete_request(struct request *clone, int error)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.  So can't use
-		 * softirq_done with the original.
-		 * Pass the clone to dm_done() directly in this special case.
-		 * It is safe (even if clone->q->queue_lock is held here)
-		 * because there is no I/O dispatching during the completion
-		 * of barrier clone.
-		 */
-		dm_done(clone, error, true);
-		return;
-	}
-
 	tio->error = error;
 	rq->completion_data = clone;
 	blk_complete_request(rq);
@@ -967,17 +907,6 @@ void dm_kill_unmapped_request(struct request *clone, int error)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		BUG_ON(error > 0);
-		dm_end_request(clone, error);
-		return;
-	}
-
 	rq->cmd_flags |= REQ_FAILED;
 	dm_complete_request(clone, error);
 }
@@ -1507,14 +1436,6 @@ static int dm_request(struct request_queue *q, struct bio *bio)
 	return _dm_request(q, bio);
 }
 
-static bool dm_rq_is_flush_request(struct request *rq)
-{
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
-}
-
 void dm_dispatch_request(struct request *rq)
 {
 	int r;
@@ -1562,22 +1483,15 @@ static int setup_clone(struct request *clone, struct request *rq,
 {
 	int r;
 
-	if (dm_rq_is_flush_request(rq)) {
-		blk_rq_init(NULL, clone);
-		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
-	} else {
-		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
-				      dm_rq_bio_constructor, tio);
-		if (r)
-			return r;
-
-		clone->cmd = rq->cmd;
-		clone->cmd_len = rq->cmd_len;
-		clone->sense = rq->sense;
-		clone->buffer = rq->buffer;
-	}
+	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+			      dm_rq_bio_constructor, tio);
+	if (r)
+		return r;
 
+	clone->cmd = rq->cmd;
+	clone->cmd_len = rq->cmd_len;
+	clone->sense = rq->sense;
+	clone->buffer = rq->buffer;
 	clone->end_io = end_clone_request;
 	clone->end_io_data = tio;
 
@@ -1618,9 +1532,6 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
 	struct mapped_device *md = q->queuedata;
 	struct request *clone;
 
-	if (unlikely(dm_rq_is_flush_request(rq)))
-		return BLKPREP_OK;
-
 	if (unlikely(rq->special)) {
 		DMWARN("Already has something in rq->special.");
 		return BLKPREP_KILL;
@@ -1697,6 +1608,7 @@ static void dm_request_fn(struct request_queue *q)
 	struct dm_table *map = dm_get_live_table(md);
 	struct dm_target *ti;
 	struct request *rq, *clone;
+	sector_t pos;
 
 	/*
 	 * For suspend, check blk_queue_stopped() and increment
@@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
 		if (!rq)
 			goto plug_and_out;
 
-		if (unlikely(dm_rq_is_flush_request(rq))) {
-			BUG_ON(md->flush_request);
-			md->flush_request = rq;
-			blk_start_request(rq);
-			queue_work(md->wq, &md->barrier_work);
-			goto out;
-		}
+		/* always use block 0 to find the target for flushes for now */
+		pos = 0;
+		if (!(rq->cmd_flags & REQ_FLUSH))
+			pos = blk_rq_pos(rq);
 
-		ti = dm_table_find_target(map, blk_rq_pos(rq));
+		ti = dm_table_find_target(map, pos);
 		if (ti->type->busy && ti->type->busy(ti))
 			goto plug_and_out;
 
@@ -1888,7 +1797,6 @@ out:
 static const struct block_device_operations dm_blk_dops;
 
 static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
 
 static void dm_init_md_queue(struct mapped_device *md)
 {
@@ -1943,7 +1851,6 @@ static struct mapped_device *alloc_dev(int minor)
 	mutex_init(&md->suspend_lock);
 	mutex_init(&md->type_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -1966,7 +1873,6 @@ static struct mapped_device *alloc_dev(int minor)
 	atomic_set(&md->pending[1], 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
-	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
 	init_waitqueue_head(&md->eventq);
 
 	md->disk->major = _major;
@@ -2220,8 +2126,6 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	/* no flush support for request based dm yet */
-	blk_queue_flush(md->queue, 0);
 
 	elv_register_queue(md->queue);
 
@@ -2421,73 +2325,6 @@ static void dm_queue_flush(struct mapped_device *md)
 	queue_work(md->wq, &md->work);
 }
 
-static void dm_rq_set_target_request_nr(struct request *clone, unsigned request_nr)
-{
-	struct dm_rq_target_io *tio = clone->end_io_data;
-
-	tio->info.target_request_nr = request_nr;
-}
-
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
-{
-	int i, j;
-	struct dm_table *map = dm_get_live_table(md);
-	unsigned num_targets = dm_table_get_num_targets(map);
-	struct dm_target *ti;
-	struct request *clone;
-
-	md->barrier_error = 0;
-
-	for (i = 0; i < num_targets; i++) {
-		ti = dm_table_get_target(map, i);
-		for (j = 0; j < ti->num_flush_requests; j++) {
-			clone = clone_rq(md->flush_request, md, GFP_NOIO);
-			dm_rq_set_target_request_nr(clone, j);
-			atomic_inc(&md->pending[rq_data_dir(clone)]);
-			map_request(ti, clone, md);
-		}
-	}
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-	dm_table_put(map);
-
-	return md->barrier_error;
-}
-
-static void dm_rq_barrier_work(struct work_struct *work)
-{
-	int error;
-	struct mapped_device *md = container_of(work, struct mapped_device,
-						barrier_work);
-	struct request_queue *q = md->queue;
-	struct request *rq;
-	unsigned long flags;
-
-	/*
-	 * Hold the md reference here and leave it at the last part so that
-	 * the md can't be deleted by device opener when the barrier request
-	 * completes.
-	 */
-	dm_get(md);
-
-	error = dm_rq_barrier(md);
-
-	rq = md->flush_request;
-	md->flush_request = NULL;
-
-	if (error == DM_ENDIO_REQUEUE) {
-		spin_lock_irqsave(q->queue_lock, flags);
-		blk_requeue_request(q, rq);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	} else
-		blk_end_request_all(rq, error);
-
-	blk_run_queue(q);
-
-	dm_put(md);
-}
-
 /*
  * Swap in a new table, returning the old one for the caller to destroy.
  */
@@ -2619,9 +2456,8 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 	up_write(&md->io_lock);
 
 	/*
-	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
-	 * can be kicked until md->queue is stopped.  So stop md->queue before
-	 * flushing md->wq.
+	 * Stop md->queue before flushing md->wq in case request-based
+	 * dm defers requests to md->wq from md->queue.
 	 */
 	if (dm_request_based(md))
 		stop_queue(md->queue);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30  9:58 ` Tejun Heo
                   ` (9 preceding siblings ...)
  (?)
@ 2010-08-30  9:58 ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel, linux-fsdevel
  Cc: Tejun Heo

This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them.  However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests.  The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/md/dm.c |  204 ++++++-------------------------------------------------
 1 files changed, 20 insertions(+), 184 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e67c519..81a012f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * Protect barrier_error from concurrent endio processing
-	 * in request-based dm.
-	 */
-	spinlock_t barrier_error_lock;
-	int barrier_error;
-
-	/*
-	 * Processing queue (flush/barriers)
+	 * Processing queue (flush)
 	 */
 	struct workqueue_struct *wq;
-	struct work_struct barrier_work;
-
-	/* A pointer to the currently processing pre/post flush request */
-	struct request *flush_request;
 
 	/*
 	 * The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *clone, int error)
 	blk_update_request(tio->orig, 0, nr_bytes);
 }
 
-static void store_barrier_error(struct mapped_device *md, int error)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&md->barrier_error_lock, flags);
-	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
 /*
  * Don't touch any member of the md after calling this function because
  * the md may be freed in dm_put() at the end of this function.
@@ -786,13 +758,11 @@ static void free_rq_clone(struct request *clone)
 static void dm_end_request(struct request *clone, int error)
 {
 	int rw = rq_data_dir(clone);
-	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 
-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;
 
@@ -806,15 +776,8 @@ static void dm_end_request(struct request *clone, int error)
 	}
 
 	free_rq_clone(clone);
-
-	if (unlikely(is_barrier)) {
-		if (unlikely(error))
-			store_barrier_error(md, error);
-		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
-
-	rq_completed(md, rw, run_queue);
+	blk_end_request_all(rq, error);
+	rq_completed(md, rw, true);
 }
 
 static void dm_unprep_request(struct request *rq)
@@ -839,16 +802,6 @@ void dm_requeue_unmapped_request(struct request *clone)
 	struct request_queue *q = rq->q;
 	unsigned long flags;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		dm_end_request(clone, DM_ENDIO_REQUEUE);
-		return;
-	}
-
 	dm_unprep_request(rq);
 
 	spin_lock_irqsave(q->queue_lock, flags);
@@ -938,19 +891,6 @@ static void dm_complete_request(struct request *clone, int error)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.  So can't use
-		 * softirq_done with the original.
-		 * Pass the clone to dm_done() directly in this special case.
-		 * It is safe (even if clone->q->queue_lock is held here)
-		 * because there is no I/O dispatching during the completion
-		 * of barrier clone.
-		 */
-		dm_done(clone, error, true);
-		return;
-	}
-
 	tio->error = error;
 	rq->completion_data = clone;
 	blk_complete_request(rq);
@@ -967,17 +907,6 @@ void dm_kill_unmapped_request(struct request *clone, int error)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		BUG_ON(error > 0);
-		dm_end_request(clone, error);
-		return;
-	}
-
 	rq->cmd_flags |= REQ_FAILED;
 	dm_complete_request(clone, error);
 }
@@ -1507,14 +1436,6 @@ static int dm_request(struct request_queue *q, struct bio *bio)
 	return _dm_request(q, bio);
 }
 
-static bool dm_rq_is_flush_request(struct request *rq)
-{
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
-}
-
 void dm_dispatch_request(struct request *rq)
 {
 	int r;
@@ -1562,22 +1483,15 @@ static int setup_clone(struct request *clone, struct request *rq,
 {
 	int r;
 
-	if (dm_rq_is_flush_request(rq)) {
-		blk_rq_init(NULL, clone);
-		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
-	} else {
-		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
-				      dm_rq_bio_constructor, tio);
-		if (r)
-			return r;
-
-		clone->cmd = rq->cmd;
-		clone->cmd_len = rq->cmd_len;
-		clone->sense = rq->sense;
-		clone->buffer = rq->buffer;
-	}
+	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+			      dm_rq_bio_constructor, tio);
+	if (r)
+		return r;
 
+	clone->cmd = rq->cmd;
+	clone->cmd_len = rq->cmd_len;
+	clone->sense = rq->sense;
+	clone->buffer = rq->buffer;
 	clone->end_io = end_clone_request;
 	clone->end_io_data = tio;
 
@@ -1618,9 +1532,6 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
 	struct mapped_device *md = q->queuedata;
 	struct request *clone;
 
-	if (unlikely(dm_rq_is_flush_request(rq)))
-		return BLKPREP_OK;
-
 	if (unlikely(rq->special)) {
 		DMWARN("Already has something in rq->special.");
 		return BLKPREP_KILL;
@@ -1697,6 +1608,7 @@ static void dm_request_fn(struct request_queue *q)
 	struct dm_table *map = dm_get_live_table(md);
 	struct dm_target *ti;
 	struct request *rq, *clone;
+	sector_t pos;
 
 	/*
 	 * For suspend, check blk_queue_stopped() and increment
@@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
 		if (!rq)
 			goto plug_and_out;
 
-		if (unlikely(dm_rq_is_flush_request(rq))) {
-			BUG_ON(md->flush_request);
-			md->flush_request = rq;
-			blk_start_request(rq);
-			queue_work(md->wq, &md->barrier_work);
-			goto out;
-		}
+		/* always use block 0 to find the target for flushes for now */
+		pos = 0;
+		if (!(rq->cmd_flags & REQ_FLUSH))
+			pos = blk_rq_pos(rq);
 
-		ti = dm_table_find_target(map, blk_rq_pos(rq));
+		ti = dm_table_find_target(map, pos);
 		if (ti->type->busy && ti->type->busy(ti))
 			goto plug_and_out;
 
@@ -1888,7 +1797,6 @@ out:
 static const struct block_device_operations dm_blk_dops;
 
 static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
 
 static void dm_init_md_queue(struct mapped_device *md)
 {
@@ -1943,7 +1851,6 @@ static struct mapped_device *alloc_dev(int minor)
 	mutex_init(&md->suspend_lock);
 	mutex_init(&md->type_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -1966,7 +1873,6 @@ static struct mapped_device *alloc_dev(int minor)
 	atomic_set(&md->pending[1], 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
-	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
 	init_waitqueue_head(&md->eventq);
 
 	md->disk->major = _major;
@@ -2220,8 +2126,6 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	/* no flush support for request based dm yet */
-	blk_queue_flush(md->queue, 0);
 
 	elv_register_queue(md->queue);
 
@@ -2421,73 +2325,6 @@ static void dm_queue_flush(struct mapped_device *md)
 	queue_work(md->wq, &md->work);
 }
 
-static void dm_rq_set_target_request_nr(struct request *clone, unsigned request_nr)
-{
-	struct dm_rq_target_io *tio = clone->end_io_data;
-
-	tio->info.target_request_nr = request_nr;
-}
-
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
-{
-	int i, j;
-	struct dm_table *map = dm_get_live_table(md);
-	unsigned num_targets = dm_table_get_num_targets(map);
-	struct dm_target *ti;
-	struct request *clone;
-
-	md->barrier_error = 0;
-
-	for (i = 0; i < num_targets; i++) {
-		ti = dm_table_get_target(map, i);
-		for (j = 0; j < ti->num_flush_requests; j++) {
-			clone = clone_rq(md->flush_request, md, GFP_NOIO);
-			dm_rq_set_target_request_nr(clone, j);
-			atomic_inc(&md->pending[rq_data_dir(clone)]);
-			map_request(ti, clone, md);
-		}
-	}
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-	dm_table_put(map);
-
-	return md->barrier_error;
-}
-
-static void dm_rq_barrier_work(struct work_struct *work)
-{
-	int error;
-	struct mapped_device *md = container_of(work, struct mapped_device,
-						barrier_work);
-	struct request_queue *q = md->queue;
-	struct request *rq;
-	unsigned long flags;
-
-	/*
-	 * Hold the md reference here and leave it at the last part so that
-	 * the md can't be deleted by device opener when the barrier request
-	 * completes.
-	 */
-	dm_get(md);
-
-	error = dm_rq_barrier(md);
-
-	rq = md->flush_request;
-	md->flush_request = NULL;
-
-	if (error == DM_ENDIO_REQUEUE) {
-		spin_lock_irqsave(q->queue_lock, flags);
-		blk_requeue_request(q, rq);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	} else
-		blk_end_request_all(rq, error);
-
-	blk_run_queue(q);
-
-	dm_put(md);
-}
-
 /*
  * Swap in a new table, returning the old one for the caller to destroy.
  */
@@ -2619,9 +2456,8 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 	up_write(&md->io_lock);
 
 	/*
-	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
-	 * can be kicked until md->queue is stopped.  So stop md->queue before
-	 * flushing md->wq.
+	 * Stop md->queue before flushing md->wq in case request-based
+	 * dm defers requests to md->wq from md->queue.
 	 */
 	if (dm_request_based(md))
 		stop_queue(md->queue);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 5/5] block: remove the WRITE_BARRIER flag
  2010-08-30  9:58 ` Tejun Heo
                   ` (12 preceding siblings ...)
  (?)
@ 2010-08-30  9:58 ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid
  Cc: Christoph Hellwig, Tejun Heo

From: Christoph Hellwig <hch@infradead.org>

It's unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/fs.h |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32703a9..6b0f6e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -135,7 +135,6 @@ struct inodes_stat_t {
  *			immediately after submission. The write equivalent
  *			of READ_SYNC.
  * WRITE_ODIRECT_PLUG	Special case write for O_DIRECT only.
- * WRITE_BARRIER	DEPRECATED. Always fails. Use FLUSH/FUA instead.
  * WRITE_FLUSH		Like WRITE_SYNC but with preceding cache flush.
  * WRITE_FUA		Like WRITE_SYNC but data is guaranteed to be on
  *			non-volatile media on completion.
@@ -157,8 +156,6 @@ struct inodes_stat_t {
 #define WRITE_SYNC		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
 #define WRITE_ODIRECT_PLUG	(WRITE | REQ_SYNC)
 #define WRITE_META		(WRITE | REQ_META)
-#define WRITE_BARRIER		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-				 REQ_HARDBARRIER)
 #define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
 				 REQ_FLUSH)
 #define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 5/5] block: remove the WRITE_BARRIER flag
  2010-08-30  9:58 ` Tejun Heo
@ 2010-08-30  9:58   ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch
  Cc: Christoph Hellwig, Tejun Heo

From: Christoph Hellwig <hch@infradead.org>

It's unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/fs.h |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32703a9..6b0f6e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -135,7 +135,6 @@ struct inodes_stat_t {
  *			immediately after submission. The write equivalent
  *			of READ_SYNC.
  * WRITE_ODIRECT_PLUG	Special case write for O_DIRECT only.
- * WRITE_BARRIER	DEPRECATED. Always fails. Use FLUSH/FUA instead.
  * WRITE_FLUSH		Like WRITE_SYNC but with preceding cache flush.
  * WRITE_FUA		Like WRITE_SYNC but data is guaranteed to be on
  *			non-volatile media on completion.
@@ -157,8 +156,6 @@ struct inodes_stat_t {
 #define WRITE_SYNC		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
 #define WRITE_ODIRECT_PLUG	(WRITE | REQ_SYNC)
 #define WRITE_META		(WRITE | REQ_META)
-#define WRITE_BARRIER		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-				 REQ_HARDBARRIER)
 #define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
 				 REQ_FLUSH)
 #define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 5/5] block: remove the WRITE_BARRIER flag
@ 2010-08-30  9:58   ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid
  Cc: Christoph Hellwig, Tejun Heo

From: Christoph Hellwig <hch@infradead.org>

It's unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/fs.h |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32703a9..6b0f6e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -135,7 +135,6 @@ struct inodes_stat_t {
  *			immediately after submission. The write equivalent
  *			of READ_SYNC.
  * WRITE_ODIRECT_PLUG	Special case write for O_DIRECT only.
- * WRITE_BARRIER	DEPRECATED. Always fails. Use FLUSH/FUA instead.
  * WRITE_FLUSH		Like WRITE_SYNC but with preceding cache flush.
  * WRITE_FUA		Like WRITE_SYNC but data is guaranteed to be on
  *			non-volatile media on completion.
@@ -157,8 +156,6 @@ struct inodes_stat_t {
 #define WRITE_SYNC		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
 #define WRITE_ODIRECT_PLUG	(WRITE | REQ_SYNC)
 #define WRITE_META		(WRITE | REQ_META)
-#define WRITE_BARRIER		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-				 REQ_HARDBARRIER)
 #define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
 				 REQ_FLUSH)
 #define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [PATCH 5/5] block: remove the WRITE_BARRIER flag
  2010-08-30  9:58 ` Tejun Heo
                   ` (10 preceding siblings ...)
  (?)
@ 2010-08-30  9:58 ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel, linux-fsdevel
  Cc: Christoph Hellwig, Tejun Heo

From: Christoph Hellwig <hch@infradead.org>

It's unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/fs.h |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32703a9..6b0f6e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -135,7 +135,6 @@ struct inodes_stat_t {
  *			immediately after submission. The write equivalent
  *			of READ_SYNC.
  * WRITE_ODIRECT_PLUG	Special case write for O_DIRECT only.
- * WRITE_BARRIER	DEPRECATED. Always fails. Use FLUSH/FUA instead.
  * WRITE_FLUSH		Like WRITE_SYNC but with preceding cache flush.
  * WRITE_FUA		Like WRITE_SYNC but data is guaranteed to be on
  *			non-volatile media on completion.
@@ -157,8 +156,6 @@ struct inodes_stat_t {
 #define WRITE_SYNC		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
 #define WRITE_ODIRECT_PLUG	(WRITE | REQ_SYNC)
 #define WRITE_META		(WRITE | REQ_META)
-#define WRITE_BARRIER		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-				 REQ_HARDBARRIER)
 #define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
 				 REQ_FLUSH)
 #define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30  9:58   ` Tejun Heo
  (?)
@ 2010-08-30 13:28   ` Mike Snitzer
  2010-08-30 13:59     ` Tejun Heo
  2010-08-30 13:59     ` Tejun Heo
  -1 siblings, 2 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-08-30 13:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

On Mon, Aug 30 2010 at  5:58am -0400,
Tejun Heo <tj@kernel.org> wrote:

> This patch converts request-based dm to support the new REQ_FLUSH/FUA.
...
> This patch rips out special flush code path and deals handles
> REQ_FLUSH/FUA requests the same way as other requests.  The only
> special treatment is that REQ_FLUSH requests use the block address 0
> when finding target, which is enough for now.

Looks very comparable to the patch I prepared but I have 2 observations
below (based on my findings from testing my patch).

> @@ -1562,22 +1483,15 @@ static int setup_clone(struct request *clone, struct request *rq,
>  {
>  	int r;
>  
> -	if (dm_rq_is_flush_request(rq)) {
> -		blk_rq_init(NULL, clone);
> -		clone->cmd_type = REQ_TYPE_FS;
> -		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
> -	} else {
> -		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> -				      dm_rq_bio_constructor, tio);
> -		if (r)
> -			return r;
> -
> -		clone->cmd = rq->cmd;
> -		clone->cmd_len = rq->cmd_len;
> -		clone->sense = rq->sense;
> -		clone->buffer = rq->buffer;
> -	}
> +	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> +			      dm_rq_bio_constructor, tio);
> +	if (r)
> +		return r;
>  
> +	clone->cmd = rq->cmd;
> +	clone->cmd_len = rq->cmd_len;
> +	clone->sense = rq->sense;
> +	clone->buffer = rq->buffer;
>  	clone->end_io = end_clone_request;
>  	clone->end_io_data = tio;

blk_rq_prep_clone() of a REQ_FLUSH request will result in a
rq_data_dir(clone) of read.

I still had the following:

        if (rq->cmd_flags & REQ_FLUSH) {
                blk_rq_init(NULL, clone);
                clone->cmd_type = REQ_TYPE_FS;
                /* without this the clone has a rq_data_dir of 0 */
                clone->cmd_flags |= WRITE_FLUSH;
        } else {
                r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
                                      dm_rq_bio_constructor, tio);
                ...

Request-based DM's REQ_FLUSH still works without this special casing but
I figured I'd raise this to ask: what is the proper rq_data_dir() is for
a REQ_FLUSH?

> @@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
>  		if (!rq)
>  			goto plug_and_out;
>  
> -		if (unlikely(dm_rq_is_flush_request(rq))) {
> -			BUG_ON(md->flush_request);
> -			md->flush_request = rq;
> -			blk_start_request(rq);
> -			queue_work(md->wq, &md->barrier_work);
> -			goto out;
> -		}
> +		/* always use block 0 to find the target for flushes for now */
> +		pos = 0;
> +		if (!(rq->cmd_flags & REQ_FLUSH))
> +			pos = blk_rq_pos(rq);
>  
> -		ti = dm_table_find_target(map, blk_rq_pos(rq));
> +		ti = dm_table_find_target(map, pos);

I added the following here: BUG_ON(!dm_target_is_valid(ti));

>  		if (ti->type->busy && ti->type->busy(ti))
>  			goto plug_and_out;

I also needed to avoid the ->busy call for REQ_FLUSH:

                if (!(rq->cmd_flags & REQ_FLUSH)) {
                        ti = dm_table_find_target(map, blk_rq_pos(rq));
                        BUG_ON(!dm_target_is_valid(ti));
                        if (ti->type->busy && ti->type->busy(ti))
                                goto plug_and_out;
                } else {
                        /* rq-based only ever has one target! leverage this for FLUSH */
                        ti = dm_table_get_target(map, 0);
                }

If I allowed ->busy to be called for REQ_FLUSH it would result in a
deadlock.  I haven't identified where/why yet.

Other than these remaining issues this patch looks good.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30  9:58   ` Tejun Heo
  (?)
  (?)
@ 2010-08-30 13:28   ` Mike Snitzer
  -1 siblings, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-08-30 13:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: k-ueda, jaxboe, jamie, linux-kernel, linux-raid, linux-fsdevel,
	j-nomura, hch

On Mon, Aug 30 2010 at  5:58am -0400,
Tejun Heo <tj@kernel.org> wrote:

> This patch converts request-based dm to support the new REQ_FLUSH/FUA.
...
> This patch rips out special flush code path and deals handles
> REQ_FLUSH/FUA requests the same way as other requests.  The only
> special treatment is that REQ_FLUSH requests use the block address 0
> when finding target, which is enough for now.

Looks very comparable to the patch I prepared but I have 2 observations
below (based on my findings from testing my patch).

> @@ -1562,22 +1483,15 @@ static int setup_clone(struct request *clone, struct request *rq,
>  {
>  	int r;
>  
> -	if (dm_rq_is_flush_request(rq)) {
> -		blk_rq_init(NULL, clone);
> -		clone->cmd_type = REQ_TYPE_FS;
> -		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
> -	} else {
> -		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> -				      dm_rq_bio_constructor, tio);
> -		if (r)
> -			return r;
> -
> -		clone->cmd = rq->cmd;
> -		clone->cmd_len = rq->cmd_len;
> -		clone->sense = rq->sense;
> -		clone->buffer = rq->buffer;
> -	}
> +	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> +			      dm_rq_bio_constructor, tio);
> +	if (r)
> +		return r;
>  
> +	clone->cmd = rq->cmd;
> +	clone->cmd_len = rq->cmd_len;
> +	clone->sense = rq->sense;
> +	clone->buffer = rq->buffer;
>  	clone->end_io = end_clone_request;
>  	clone->end_io_data = tio;

blk_rq_prep_clone() of a REQ_FLUSH request will result in a
rq_data_dir(clone) of read.

I still had the following:

        if (rq->cmd_flags & REQ_FLUSH) {
                blk_rq_init(NULL, clone);
                clone->cmd_type = REQ_TYPE_FS;
                /* without this the clone has a rq_data_dir of 0 */
                clone->cmd_flags |= WRITE_FLUSH;
        } else {
                r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
                                      dm_rq_bio_constructor, tio);
                ...

Request-based DM's REQ_FLUSH still works without this special casing but
I figured I'd raise this to ask: what is the proper rq_data_dir() is for
a REQ_FLUSH?

> @@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
>  		if (!rq)
>  			goto plug_and_out;
>  
> -		if (unlikely(dm_rq_is_flush_request(rq))) {
> -			BUG_ON(md->flush_request);
> -			md->flush_request = rq;
> -			blk_start_request(rq);
> -			queue_work(md->wq, &md->barrier_work);
> -			goto out;
> -		}
> +		/* always use block 0 to find the target for flushes for now */
> +		pos = 0;
> +		if (!(rq->cmd_flags & REQ_FLUSH))
> +			pos = blk_rq_pos(rq);
>  
> -		ti = dm_table_find_target(map, blk_rq_pos(rq));
> +		ti = dm_table_find_target(map, pos);

I added the following here: BUG_ON(!dm_target_is_valid(ti));

>  		if (ti->type->busy && ti->type->busy(ti))
>  			goto plug_and_out;

I also needed to avoid the ->busy call for REQ_FLUSH:

                if (!(rq->cmd_flags & REQ_FLUSH)) {
                        ti = dm_table_find_target(map, blk_rq_pos(rq));
                        BUG_ON(!dm_target_is_valid(ti));
                        if (ti->type->busy && ti->type->busy(ti))
                                goto plug_and_out;
                } else {
                        /* rq-based only ever has one target! leverage this for FLUSH */
                        ti = dm_table_get_target(map, 0);
                }

If I allowed ->busy to be called for REQ_FLUSH it would result in a
deadlock.  I haven't identified where/why yet.

Other than these remaining issues this patch looks good.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 13:28   ` Mike Snitzer
  2010-08-30 13:59     ` Tejun Heo
@ 2010-08-30 13:59     ` Tejun Heo
  2010-08-30 15:07       ` Tejun Heo
                         ` (5 more replies)
  1 sibling, 6 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30 13:59 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

Hello,

On 08/30/2010 03:28 PM, Mike Snitzer wrote:
>> +	clone->cmd = rq->cmd;
>> +	clone->cmd_len = rq->cmd_len;
>> +	clone->sense = rq->sense;
>> +	clone->buffer = rq->buffer;
>>  	clone->end_io = end_clone_request;
>>  	clone->end_io_data = tio;
> 
> blk_rq_prep_clone() of a REQ_FLUSH request will result in a
> rq_data_dir(clone) of read.

Hmmm... why?  blk_rq_prep_clone() copies all REQ_* flags in
REQ_CLONE_MASK and REQ_WRITE is definitely there.  I'll check.

> I still had the following:
> 
>         if (rq->cmd_flags & REQ_FLUSH) {
>                 blk_rq_init(NULL, clone);
>                 clone->cmd_type = REQ_TYPE_FS;
>                 /* without this the clone has a rq_data_dir of 0 */
>                 clone->cmd_flags |= WRITE_FLUSH;
>         } else {
>                 r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
>                                       dm_rq_bio_constructor, tio);
>                 ...
> 
> Request-based DM's REQ_FLUSH still works without this special casing but
> I figured I'd raise this to ask: what is the proper rq_data_dir() is for
> a REQ_FLUSH?

Technically block layer doesn't care one way or the other but WRITE
definitely.  Maybe it would be a good idea to enforce that from block
layer.

>> @@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
>>  		if (!rq)
>>  			goto plug_and_out;
>>  
>> -		if (unlikely(dm_rq_is_flush_request(rq))) {
>> -			BUG_ON(md->flush_request);
>> -			md->flush_request = rq;
>> -			blk_start_request(rq);
>> -			queue_work(md->wq, &md->barrier_work);
>> -			goto out;
>> -		}
>> +		/* always use block 0 to find the target for flushes for now */
>> +		pos = 0;
>> +		if (!(rq->cmd_flags & REQ_FLUSH))
>> +			pos = blk_rq_pos(rq);
>>  
>> -		ti = dm_table_find_target(map, blk_rq_pos(rq));
>> +		ti = dm_table_find_target(map, pos);
> 
> I added the following here: BUG_ON(!dm_target_is_valid(ti));

I'll add it.

>>  		if (ti->type->busy && ti->type->busy(ti))
>>  			goto plug_and_out;
> 
> I also needed to avoid the ->busy call for REQ_FLUSH:
> 
>                 if (!(rq->cmd_flags & REQ_FLUSH)) {
>                         ti = dm_table_find_target(map, blk_rq_pos(rq));
>                         BUG_ON(!dm_target_is_valid(ti));
>                         if (ti->type->busy && ti->type->busy(ti))
>                                 goto plug_and_out;
>                 } else {
>                         /* rq-based only ever has one target! leverage this for FLUSH */
>                         ti = dm_table_get_target(map, 0);
>                 }
> 
> If I allowed ->busy to be called for REQ_FLUSH it would result in a
> deadlock.  I haven't identified where/why yet.

Ah... that's probably from "if (!elv_queue_empty(q))" check below,
flushes are on a separate queue but I forgot to update
elv_queue_empty() to check the flush queue.  elv_queue_empty() can
return %true spuriously in which case the queue won't be plugged and
restarted later leading to queue hang.  I'll fix elv_queue_empty().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 13:28   ` Mike Snitzer
@ 2010-08-30 13:59     ` Tejun Heo
  2010-08-30 13:59     ` Tejun Heo
  1 sibling, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30 13:59 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: k-ueda, jaxboe, jamie, linux-kernel, linux-raid, linux-fsdevel,
	j-nomura, hch

Hello,

On 08/30/2010 03:28 PM, Mike Snitzer wrote:
>> +	clone->cmd = rq->cmd;
>> +	clone->cmd_len = rq->cmd_len;
>> +	clone->sense = rq->sense;
>> +	clone->buffer = rq->buffer;
>>  	clone->end_io = end_clone_request;
>>  	clone->end_io_data = tio;
> 
> blk_rq_prep_clone() of a REQ_FLUSH request will result in a
> rq_data_dir(clone) of read.

Hmmm... why?  blk_rq_prep_clone() copies all REQ_* flags in
REQ_CLONE_MASK and REQ_WRITE is definitely there.  I'll check.

> I still had the following:
> 
>         if (rq->cmd_flags & REQ_FLUSH) {
>                 blk_rq_init(NULL, clone);
>                 clone->cmd_type = REQ_TYPE_FS;
>                 /* without this the clone has a rq_data_dir of 0 */
>                 clone->cmd_flags |= WRITE_FLUSH;
>         } else {
>                 r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
>                                       dm_rq_bio_constructor, tio);
>                 ...
> 
> Request-based DM's REQ_FLUSH still works without this special casing but
> I figured I'd raise this to ask: what is the proper rq_data_dir() is for
> a REQ_FLUSH?

Technically block layer doesn't care one way or the other but WRITE
definitely.  Maybe it would be a good idea to enforce that from block
layer.

>> @@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
>>  		if (!rq)
>>  			goto plug_and_out;
>>  
>> -		if (unlikely(dm_rq_is_flush_request(rq))) {
>> -			BUG_ON(md->flush_request);
>> -			md->flush_request = rq;
>> -			blk_start_request(rq);
>> -			queue_work(md->wq, &md->barrier_work);
>> -			goto out;
>> -		}
>> +		/* always use block 0 to find the target for flushes for now */
>> +		pos = 0;
>> +		if (!(rq->cmd_flags & REQ_FLUSH))
>> +			pos = blk_rq_pos(rq);
>>  
>> -		ti = dm_table_find_target(map, blk_rq_pos(rq));
>> +		ti = dm_table_find_target(map, pos);
> 
> I added the following here: BUG_ON(!dm_target_is_valid(ti));

I'll add it.

>>  		if (ti->type->busy && ti->type->busy(ti))
>>  			goto plug_and_out;
> 
> I also needed to avoid the ->busy call for REQ_FLUSH:
> 
>                 if (!(rq->cmd_flags & REQ_FLUSH)) {
>                         ti = dm_table_find_target(map, blk_rq_pos(rq));
>                         BUG_ON(!dm_target_is_valid(ti));
>                         if (ti->type->busy && ti->type->busy(ti))
>                                 goto plug_and_out;
>                 } else {
>                         /* rq-based only ever has one target! leverage this for FLUSH */
>                         ti = dm_table_get_target(map, 0);
>                 }
> 
> If I allowed ->busy to be called for REQ_FLUSH it would result in a
> deadlock.  I haven't identified where/why yet.

Ah... that's probably from "if (!elv_queue_empty(q))" check below,
flushes are on a separate queue but I forgot to update
elv_queue_empty() to check the flush queue.  elv_queue_empty() can
return %true spuriously in which case the queue won't be plugged and
restarted later leading to queue hang.  I'll fix elv_queue_empty().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 13:59     ` Tejun Heo
  2010-08-30 15:07       ` Tejun Heo
@ 2010-08-30 15:07       ` Tejun Heo
  2010-08-30 19:08         ` Mike Snitzer
  2010-08-30 19:08         ` Mike Snitzer
  2010-08-30 15:42       ` [PATCH] block: initialize flush request with WRITE_FLUSH instead of REQ_FLUSH Tejun Heo
                         ` (3 subsequent siblings)
  5 siblings, 2 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30 15:07 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

On 08/30/2010 03:59 PM, Tejun Heo wrote:
> Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> flushes are on a separate queue but I forgot to update
> elv_queue_empty() to check the flush queue.  elv_queue_empty() can
> return %true spuriously in which case the queue won't be plugged and
> restarted later leading to queue hang.  I'll fix elv_queue_empty().

I think I was too quick to blame elv_queue_empty().  Can you please
test whether the following patch fixes the hang?

Thanks.

---
 block/blk-flush.c |   18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -28,7 +28,8 @@ unsigned blk_flush_cur_seq(struct reques
 }

 static struct request *blk_flush_complete_seq(struct request_queue *q,
-					      unsigned seq, int error)
+					      unsigned seq, int error,
+					      bool from_end_io)
 {
 	struct request *next_rq = NULL;

@@ -51,6 +52,13 @@ static struct request *blk_flush_complet
 		if (!list_empty(&q->pending_flushes)) {
 			next_rq = list_entry_rq(q->pending_flushes.next);
 			list_move(&next_rq->queuelist, &q->queue_head);
+			/*
+			 * Moving a request silently to queue_head may
+			 * stall the queue, kick the queue if we
+			 * aren't in the issue path already.
+			 */
+			if (from_end_io)
+				__blk_run_queue(q);
 		}
 	}
 	return next_rq;
@@ -59,19 +67,19 @@ static struct request *blk_flush_complet
 static void pre_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error, true);
 }

 static void flush_data_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error, true);
 }

 static void post_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error, true);
 }

 static void init_flush_request(struct request *rq, struct gendisk *disk)
@@ -165,7 +173,7 @@ struct request *blk_do_flush(struct requ
 		skip |= QUEUE_FSEQ_DATA;
 	if (!do_postflush)
 		skip |= QUEUE_FSEQ_POSTFLUSH;
-	return blk_flush_complete_seq(q, skip, 0);
+	return blk_flush_complete_seq(q, skip, 0, false);
 }

 static void bio_end_flush(struct bio *bio, int err)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 13:59     ` Tejun Heo
@ 2010-08-30 15:07       ` Tejun Heo
  2010-08-30 15:07       ` Tejun Heo
                         ` (4 subsequent siblings)
  5 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30 15:07 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: k-ueda, jaxboe, jamie, linux-kernel, linux-raid, linux-fsdevel,
	j-nomura, hch

On 08/30/2010 03:59 PM, Tejun Heo wrote:
> Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> flushes are on a separate queue but I forgot to update
> elv_queue_empty() to check the flush queue.  elv_queue_empty() can
> return %true spuriously in which case the queue won't be plugged and
> restarted later leading to queue hang.  I'll fix elv_queue_empty().

I think I was too quick to blame elv_queue_empty().  Can you please
test whether the following patch fixes the hang?

Thanks.

---
 block/blk-flush.c |   18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -28,7 +28,8 @@ unsigned blk_flush_cur_seq(struct reques
 }

 static struct request *blk_flush_complete_seq(struct request_queue *q,
-					      unsigned seq, int error)
+					      unsigned seq, int error,
+					      bool from_end_io)
 {
 	struct request *next_rq = NULL;

@@ -51,6 +52,13 @@ static struct request *blk_flush_complet
 		if (!list_empty(&q->pending_flushes)) {
 			next_rq = list_entry_rq(q->pending_flushes.next);
 			list_move(&next_rq->queuelist, &q->queue_head);
+			/*
+			 * Moving a request silently to queue_head may
+			 * stall the queue, kick the queue if we
+			 * aren't in the issue path already.
+			 */
+			if (from_end_io)
+				__blk_run_queue(q);
 		}
 	}
 	return next_rq;
@@ -59,19 +67,19 @@ static struct request *blk_flush_complet
 static void pre_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error, true);
 }

 static void flush_data_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error, true);
 }

 static void post_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error, true);
 }

 static void init_flush_request(struct request *rq, struct gendisk *disk)
@@ -165,7 +173,7 @@ struct request *blk_do_flush(struct requ
 		skip |= QUEUE_FSEQ_DATA;
 	if (!do_postflush)
 		skip |= QUEUE_FSEQ_POSTFLUSH;
-	return blk_flush_complete_seq(q, skip, 0);
+	return blk_flush_complete_seq(q, skip, 0, false);
 }

 static void bio_end_flush(struct bio *bio, int err)

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH] block: initialize flush request with WRITE_FLUSH instead of REQ_FLUSH
  2010-08-30 13:59     ` Tejun Heo
  2010-08-30 15:07       ` Tejun Heo
  2010-08-30 15:07       ` Tejun Heo
@ 2010-08-30 15:42       ` Tejun Heo
  2010-08-30 15:42       ` Tejun Heo
                         ` (2 subsequent siblings)
  5 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30 15:42 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

init_flush_request() only set REQ_FLUSH when initializing flush
requests making them READ requests.  Use WRITE_FLUSH instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Snitzer <snitzer@redhat.com>
---
So, this was the culprit for the incorrect data direction for flush
requests.

Thanks.

 block/blk-flush.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -77,7 +77,7 @@ static void post_flush_end_io(struct req
 static void init_flush_request(struct request *rq, struct gendisk *disk)
 {
 	rq->cmd_type = REQ_TYPE_FS;
-	rq->cmd_flags = REQ_FLUSH;
+	rq->cmd_flags = WRITE_FLUSH;
 	rq->rq_disk = disk;
 }


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH] block: initialize flush request with WRITE_FLUSH instead of REQ_FLUSH
  2010-08-30 13:59     ` Tejun Heo
                         ` (2 preceding siblings ...)
  2010-08-30 15:42       ` [PATCH] block: initialize flush request with WRITE_FLUSH instead of REQ_FLUSH Tejun Heo
@ 2010-08-30 15:42       ` Tejun Heo
  2010-08-30 15:45       ` [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
  2010-08-30 15:45       ` Tejun Heo
  5 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30 15:42 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: k-ueda, jaxboe, jamie, linux-kernel, linux-raid, linux-fsdevel,
	j-nomura, hch

init_flush_request() only set REQ_FLUSH when initializing flush
requests making them READ requests.  Use WRITE_FLUSH instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Snitzer <snitzer@redhat.com>
---
So, this was the culprit for the incorrect data direction for flush
requests.

Thanks.

 block/blk-flush.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -77,7 +77,7 @@ static void post_flush_end_io(struct req
 static void init_flush_request(struct request *rq, struct gendisk *disk)
 {
 	rq->cmd_type = REQ_TYPE_FS;
-	rq->cmd_flags = REQ_FLUSH;
+	rq->cmd_flags = WRITE_FLUSH;
 	rq->rq_disk = disk;
 }

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 13:59     ` Tejun Heo
                         ` (4 preceding siblings ...)
  2010-08-30 15:45       ` [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
@ 2010-08-30 15:45       ` Tejun Heo
  2010-08-30 19:18         ` Mike Snitzer
                           ` (3 more replies)
  5 siblings, 4 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30 15:45 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them.  However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests.  The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

* added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
  suggested by Mike Snitzer

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
---
Here's a version w/ BUG_ON() added.  Once the queue hang issue is
tracked down, I'll refresh the whole series and repost.

Thanks.

 drivers/md/dm.c |  206 +++++---------------------------------------------------
 1 file changed, 22 insertions(+), 184 deletions(-)

Index: block/drivers/md/dm.c
===================================================================
--- block.orig/drivers/md/dm.c
+++ block/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
 	spinlock_t deferred_lock;

 	/*
-	 * Protect barrier_error from concurrent endio processing
-	 * in request-based dm.
-	 */
-	spinlock_t barrier_error_lock;
-	int barrier_error;
-
-	/*
-	 * Processing queue (flush/barriers)
+	 * Processing queue (flush)
 	 */
 	struct workqueue_struct *wq;
-	struct work_struct barrier_work;
-
-	/* A pointer to the currently processing pre/post flush request */
-	struct request *flush_request;

 	/*
 	 * The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *cl
 	blk_update_request(tio->orig, 0, nr_bytes);
 }

-static void store_barrier_error(struct mapped_device *md, int error)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&md->barrier_error_lock, flags);
-	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
 /*
  * Don't touch any member of the md after calling this function because
  * the md may be freed in dm_put() at the end of this function.
@@ -786,13 +758,11 @@ static void free_rq_clone(struct request
 static void dm_end_request(struct request *clone, int error)
 {
 	int rw = rq_data_dir(clone);
-	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;

-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;

@@ -806,15 +776,8 @@ static void dm_end_request(struct reques
 	}

 	free_rq_clone(clone);
-
-	if (unlikely(is_barrier)) {
-		if (unlikely(error))
-			store_barrier_error(md, error);
-		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
-
-	rq_completed(md, rw, run_queue);
+	blk_end_request_all(rq, error);
+	rq_completed(md, rw, true);
 }

 static void dm_unprep_request(struct request *rq)
@@ -839,16 +802,6 @@ void dm_requeue_unmapped_request(struct
 	struct request_queue *q = rq->q;
 	unsigned long flags;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		dm_end_request(clone, DM_ENDIO_REQUEUE);
-		return;
-	}
-
 	dm_unprep_request(rq);

 	spin_lock_irqsave(q->queue_lock, flags);
@@ -938,19 +891,6 @@ static void dm_complete_request(struct r
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.  So can't use
-		 * softirq_done with the original.
-		 * Pass the clone to dm_done() directly in this special case.
-		 * It is safe (even if clone->q->queue_lock is held here)
-		 * because there is no I/O dispatching during the completion
-		 * of barrier clone.
-		 */
-		dm_done(clone, error, true);
-		return;
-	}
-
 	tio->error = error;
 	rq->completion_data = clone;
 	blk_complete_request(rq);
@@ -967,17 +907,6 @@ void dm_kill_unmapped_request(struct req
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		BUG_ON(error > 0);
-		dm_end_request(clone, error);
-		return;
-	}
-
 	rq->cmd_flags |= REQ_FAILED;
 	dm_complete_request(clone, error);
 }
@@ -1507,14 +1436,6 @@ static int dm_request(struct request_que
 	return _dm_request(q, bio);
 }

-static bool dm_rq_is_flush_request(struct request *rq)
-{
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
-}
-
 void dm_dispatch_request(struct request *rq)
 {
 	int r;
@@ -1562,22 +1483,15 @@ static int setup_clone(struct request *c
 {
 	int r;

-	if (dm_rq_is_flush_request(rq)) {
-		blk_rq_init(NULL, clone);
-		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
-	} else {
-		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
-				      dm_rq_bio_constructor, tio);
-		if (r)
-			return r;
-
-		clone->cmd = rq->cmd;
-		clone->cmd_len = rq->cmd_len;
-		clone->sense = rq->sense;
-		clone->buffer = rq->buffer;
-	}
+	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+			      dm_rq_bio_constructor, tio);
+	if (r)
+		return r;

+	clone->cmd = rq->cmd;
+	clone->cmd_len = rq->cmd_len;
+	clone->sense = rq->sense;
+	clone->buffer = rq->buffer;
 	clone->end_io = end_clone_request;
 	clone->end_io_data = tio;

@@ -1618,9 +1532,6 @@ static int dm_prep_fn(struct request_que
 	struct mapped_device *md = q->queuedata;
 	struct request *clone;

-	if (unlikely(dm_rq_is_flush_request(rq)))
-		return BLKPREP_OK;
-
 	if (unlikely(rq->special)) {
 		DMWARN("Already has something in rq->special.");
 		return BLKPREP_KILL;
@@ -1697,6 +1608,7 @@ static void dm_request_fn(struct request
 	struct dm_table *map = dm_get_live_table(md);
 	struct dm_target *ti;
 	struct request *rq, *clone;
+	sector_t pos;

 	/*
 	 * For suspend, check blk_queue_stopped() and increment
@@ -1709,15 +1621,14 @@ static void dm_request_fn(struct request
 		if (!rq)
 			goto plug_and_out;

-		if (unlikely(dm_rq_is_flush_request(rq))) {
-			BUG_ON(md->flush_request);
-			md->flush_request = rq;
-			blk_start_request(rq);
-			queue_work(md->wq, &md->barrier_work);
-			goto out;
-		}
+		/* always use block 0 to find the target for flushes for now */
+		pos = 0;
+		if (!(rq->cmd_flags & REQ_FLUSH))
+			pos = blk_rq_pos(rq);
+
+		ti = dm_table_find_target(map, pos);
+		BUG_ON(!dm_target_is_valid(ti));

-		ti = dm_table_find_target(map, blk_rq_pos(rq));
 		if (ti->type->busy && ti->type->busy(ti))
 			goto plug_and_out;

@@ -1888,7 +1799,6 @@ out:
 static const struct block_device_operations dm_blk_dops;

 static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);

 static void dm_init_md_queue(struct mapped_device *md)
 {
@@ -1943,7 +1853,6 @@ static struct mapped_device *alloc_dev(i
 	mutex_init(&md->suspend_lock);
 	mutex_init(&md->type_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -1966,7 +1875,6 @@ static struct mapped_device *alloc_dev(i
 	atomic_set(&md->pending[1], 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
-	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
 	init_waitqueue_head(&md->eventq);

 	md->disk->major = _major;
@@ -2220,8 +2128,6 @@ static int dm_init_request_based_queue(s
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	/* no flush support for request based dm yet */
-	blk_queue_flush(md->queue, 0);

 	elv_register_queue(md->queue);

@@ -2421,73 +2327,6 @@ static void dm_queue_flush(struct mapped
 	queue_work(md->wq, &md->work);
 }

-static void dm_rq_set_target_request_nr(struct request *clone, unsigned request_nr)
-{
-	struct dm_rq_target_io *tio = clone->end_io_data;
-
-	tio->info.target_request_nr = request_nr;
-}
-
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
-{
-	int i, j;
-	struct dm_table *map = dm_get_live_table(md);
-	unsigned num_targets = dm_table_get_num_targets(map);
-	struct dm_target *ti;
-	struct request *clone;
-
-	md->barrier_error = 0;
-
-	for (i = 0; i < num_targets; i++) {
-		ti = dm_table_get_target(map, i);
-		for (j = 0; j < ti->num_flush_requests; j++) {
-			clone = clone_rq(md->flush_request, md, GFP_NOIO);
-			dm_rq_set_target_request_nr(clone, j);
-			atomic_inc(&md->pending[rq_data_dir(clone)]);
-			map_request(ti, clone, md);
-		}
-	}
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-	dm_table_put(map);
-
-	return md->barrier_error;
-}
-
-static void dm_rq_barrier_work(struct work_struct *work)
-{
-	int error;
-	struct mapped_device *md = container_of(work, struct mapped_device,
-						barrier_work);
-	struct request_queue *q = md->queue;
-	struct request *rq;
-	unsigned long flags;
-
-	/*
-	 * Hold the md reference here and leave it at the last part so that
-	 * the md can't be deleted by device opener when the barrier request
-	 * completes.
-	 */
-	dm_get(md);
-
-	error = dm_rq_barrier(md);
-
-	rq = md->flush_request;
-	md->flush_request = NULL;
-
-	if (error == DM_ENDIO_REQUEUE) {
-		spin_lock_irqsave(q->queue_lock, flags);
-		blk_requeue_request(q, rq);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	} else
-		blk_end_request_all(rq, error);
-
-	blk_run_queue(q);
-
-	dm_put(md);
-}
-
 /*
  * Swap in a new table, returning the old one for the caller to destroy.
  */
@@ -2619,9 +2458,8 @@ int dm_suspend(struct mapped_device *md,
 	up_write(&md->io_lock);

 	/*
-	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
-	 * can be kicked until md->queue is stopped.  So stop md->queue before
-	 * flushing md->wq.
+	 * Stop md->queue before flushing md->wq in case request-based
+	 * dm defers requests to md->wq from md->queue.
 	 */
 	if (dm_request_based(md))
 		stop_queue(md->queue);

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 13:59     ` Tejun Heo
                         ` (3 preceding siblings ...)
  2010-08-30 15:42       ` Tejun Heo
@ 2010-08-30 15:45       ` Tejun Heo
  2010-08-30 15:45       ` Tejun Heo
  5 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-30 15:45 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: k-ueda, jaxboe, jamie, linux-kernel, linux-raid, linux-fsdevel,
	j-nomura, hch

This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them.  However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests.  The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

* added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
  suggested by Mike Snitzer

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
---
Here's a version w/ BUG_ON() added.  Once the queue hang issue is
tracked down, I'll refresh the whole series and repost.

Thanks.

 drivers/md/dm.c |  206 +++++---------------------------------------------------
 1 file changed, 22 insertions(+), 184 deletions(-)

Index: block/drivers/md/dm.c
===================================================================
--- block.orig/drivers/md/dm.c
+++ block/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
 	spinlock_t deferred_lock;

 	/*
-	 * Protect barrier_error from concurrent endio processing
-	 * in request-based dm.
-	 */
-	spinlock_t barrier_error_lock;
-	int barrier_error;
-
-	/*
-	 * Processing queue (flush/barriers)
+	 * Processing queue (flush)
 	 */
 	struct workqueue_struct *wq;
-	struct work_struct barrier_work;
-
-	/* A pointer to the currently processing pre/post flush request */
-	struct request *flush_request;

 	/*
 	 * The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *cl
 	blk_update_request(tio->orig, 0, nr_bytes);
 }

-static void store_barrier_error(struct mapped_device *md, int error)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&md->barrier_error_lock, flags);
-	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
 /*
  * Don't touch any member of the md after calling this function because
  * the md may be freed in dm_put() at the end of this function.
@@ -786,13 +758,11 @@ static void free_rq_clone(struct request
 static void dm_end_request(struct request *clone, int error)
 {
 	int rw = rq_data_dir(clone);
-	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;

-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;

@@ -806,15 +776,8 @@ static void dm_end_request(struct reques
 	}

 	free_rq_clone(clone);
-
-	if (unlikely(is_barrier)) {
-		if (unlikely(error))
-			store_barrier_error(md, error);
-		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
-
-	rq_completed(md, rw, run_queue);
+	blk_end_request_all(rq, error);
+	rq_completed(md, rw, true);
 }

 static void dm_unprep_request(struct request *rq)
@@ -839,16 +802,6 @@ void dm_requeue_unmapped_request(struct
 	struct request_queue *q = rq->q;
 	unsigned long flags;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		dm_end_request(clone, DM_ENDIO_REQUEUE);
-		return;
-	}
-
 	dm_unprep_request(rq);

 	spin_lock_irqsave(q->queue_lock, flags);
@@ -938,19 +891,6 @@ static void dm_complete_request(struct r
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.  So can't use
-		 * softirq_done with the original.
-		 * Pass the clone to dm_done() directly in this special case.
-		 * It is safe (even if clone->q->queue_lock is held here)
-		 * because there is no I/O dispatching during the completion
-		 * of barrier clone.
-		 */
-		dm_done(clone, error, true);
-		return;
-	}
-
 	tio->error = error;
 	rq->completion_data = clone;
 	blk_complete_request(rq);
@@ -967,17 +907,6 @@ void dm_kill_unmapped_request(struct req
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		BUG_ON(error > 0);
-		dm_end_request(clone, error);
-		return;
-	}
-
 	rq->cmd_flags |= REQ_FAILED;
 	dm_complete_request(clone, error);
 }
@@ -1507,14 +1436,6 @@ static int dm_request(struct request_que
 	return _dm_request(q, bio);
 }

-static bool dm_rq_is_flush_request(struct request *rq)
-{
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
-}
-
 void dm_dispatch_request(struct request *rq)
 {
 	int r;
@@ -1562,22 +1483,15 @@ static int setup_clone(struct request *c
 {
 	int r;

-	if (dm_rq_is_flush_request(rq)) {
-		blk_rq_init(NULL, clone);
-		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
-	} else {
-		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
-				      dm_rq_bio_constructor, tio);
-		if (r)
-			return r;
-
-		clone->cmd = rq->cmd;
-		clone->cmd_len = rq->cmd_len;
-		clone->sense = rq->sense;
-		clone->buffer = rq->buffer;
-	}
+	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+			      dm_rq_bio_constructor, tio);
+	if (r)
+		return r;

+	clone->cmd = rq->cmd;
+	clone->cmd_len = rq->cmd_len;
+	clone->sense = rq->sense;
+	clone->buffer = rq->buffer;
 	clone->end_io = end_clone_request;
 	clone->end_io_data = tio;

@@ -1618,9 +1532,6 @@ static int dm_prep_fn(struct request_que
 	struct mapped_device *md = q->queuedata;
 	struct request *clone;

-	if (unlikely(dm_rq_is_flush_request(rq)))
-		return BLKPREP_OK;
-
 	if (unlikely(rq->special)) {
 		DMWARN("Already has something in rq->special.");
 		return BLKPREP_KILL;
@@ -1697,6 +1608,7 @@ static void dm_request_fn(struct request
 	struct dm_table *map = dm_get_live_table(md);
 	struct dm_target *ti;
 	struct request *rq, *clone;
+	sector_t pos;

 	/*
 	 * For suspend, check blk_queue_stopped() and increment
@@ -1709,15 +1621,14 @@ static void dm_request_fn(struct request
 		if (!rq)
 			goto plug_and_out;

-		if (unlikely(dm_rq_is_flush_request(rq))) {
-			BUG_ON(md->flush_request);
-			md->flush_request = rq;
-			blk_start_request(rq);
-			queue_work(md->wq, &md->barrier_work);
-			goto out;
-		}
+		/* always use block 0 to find the target for flushes for now */
+		pos = 0;
+		if (!(rq->cmd_flags & REQ_FLUSH))
+			pos = blk_rq_pos(rq);
+
+		ti = dm_table_find_target(map, pos);
+		BUG_ON(!dm_target_is_valid(ti));

-		ti = dm_table_find_target(map, blk_rq_pos(rq));
 		if (ti->type->busy && ti->type->busy(ti))
 			goto plug_and_out;

@@ -1888,7 +1799,6 @@ out:
 static const struct block_device_operations dm_blk_dops;

 static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);

 static void dm_init_md_queue(struct mapped_device *md)
 {
@@ -1943,7 +1853,6 @@ static struct mapped_device *alloc_dev(i
 	mutex_init(&md->suspend_lock);
 	mutex_init(&md->type_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -1966,7 +1875,6 @@ static struct mapped_device *alloc_dev(i
 	atomic_set(&md->pending[1], 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
-	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
 	init_waitqueue_head(&md->eventq);

 	md->disk->major = _major;
@@ -2220,8 +2128,6 @@ static int dm_init_request_based_queue(s
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	/* no flush support for request based dm yet */
-	blk_queue_flush(md->queue, 0);

 	elv_register_queue(md->queue);

@@ -2421,73 +2327,6 @@ static void dm_queue_flush(struct mapped
 	queue_work(md->wq, &md->work);
 }

-static void dm_rq_set_target_request_nr(struct request *clone, unsigned request_nr)
-{
-	struct dm_rq_target_io *tio = clone->end_io_data;
-
-	tio->info.target_request_nr = request_nr;
-}
-
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
-{
-	int i, j;
-	struct dm_table *map = dm_get_live_table(md);
-	unsigned num_targets = dm_table_get_num_targets(map);
-	struct dm_target *ti;
-	struct request *clone;
-
-	md->barrier_error = 0;
-
-	for (i = 0; i < num_targets; i++) {
-		ti = dm_table_get_target(map, i);
-		for (j = 0; j < ti->num_flush_requests; j++) {
-			clone = clone_rq(md->flush_request, md, GFP_NOIO);
-			dm_rq_set_target_request_nr(clone, j);
-			atomic_inc(&md->pending[rq_data_dir(clone)]);
-			map_request(ti, clone, md);
-		}
-	}
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-	dm_table_put(map);
-
-	return md->barrier_error;
-}
-
-static void dm_rq_barrier_work(struct work_struct *work)
-{
-	int error;
-	struct mapped_device *md = container_of(work, struct mapped_device,
-						barrier_work);
-	struct request_queue *q = md->queue;
-	struct request *rq;
-	unsigned long flags;
-
-	/*
-	 * Hold the md reference here and leave it at the last part so that
-	 * the md can't be deleted by device opener when the barrier request
-	 * completes.
-	 */
-	dm_get(md);
-
-	error = dm_rq_barrier(md);
-
-	rq = md->flush_request;
-	md->flush_request = NULL;
-
-	if (error == DM_ENDIO_REQUEUE) {
-		spin_lock_irqsave(q->queue_lock, flags);
-		blk_requeue_request(q, rq);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	} else
-		blk_end_request_all(rq, error);
-
-	blk_run_queue(q);
-
-	dm_put(md);
-}
-
 /*
  * Swap in a new table, returning the old one for the caller to destroy.
  */
@@ -2619,9 +2458,8 @@ int dm_suspend(struct mapped_device *md,
 	up_write(&md->io_lock);

 	/*
-	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
-	 * can be kicked until md->queue is stopped.  So stop md->queue before
-	 * flushing md->wq.
+	 * Stop md->queue before flushing md->wq in case request-based
+	 * dm defers requests to md->wq from md->queue.
 	 */
 	if (dm_request_based(md))
 		stop_queue(md->queue);

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 15:07       ` Tejun Heo
  2010-08-30 19:08         ` Mike Snitzer
@ 2010-08-30 19:08         ` Mike Snitzer
  2010-08-30 21:28           ` Mike Snitzer
  1 sibling, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-08-30 19:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

On Mon, Aug 30 2010 at 11:07am -0400,
Tejun Heo <tj@kernel.org> wrote:

> On 08/30/2010 03:59 PM, Tejun Heo wrote:
> > Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> > flushes are on a separate queue but I forgot to update
> > elv_queue_empty() to check the flush queue.  elv_queue_empty() can
> > return %true spuriously in which case the queue won't be plugged and
> > restarted later leading to queue hang.  I'll fix elv_queue_empty().
> 
> I think I was too quick to blame elv_queue_empty().  Can you please
> test whether the following patch fixes the hang?

It does, thanks!

Tested-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 15:07       ` Tejun Heo
@ 2010-08-30 19:08         ` Mike Snitzer
  2010-08-30 19:08         ` Mike Snitzer
  1 sibling, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-08-30 19:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: k-ueda, jaxboe, jamie, linux-kernel, linux-raid, linux-fsdevel,
	j-nomura, hch

On Mon, Aug 30 2010 at 11:07am -0400,
Tejun Heo <tj@kernel.org> wrote:

> On 08/30/2010 03:59 PM, Tejun Heo wrote:
> > Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> > flushes are on a separate queue but I forgot to update
> > elv_queue_empty() to check the flush queue.  elv_queue_empty() can
> > return %true spuriously in which case the queue won't be plugged and
> > restarted later leading to queue hang.  I'll fix elv_queue_empty().
> 
> I think I was too quick to blame elv_queue_empty().  Can you please
> test whether the following patch fixes the hang?

It does, thanks!

Tested-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 15:45       ` Tejun Heo
@ 2010-08-30 19:18         ` Mike Snitzer
  2010-08-30 19:18         ` Mike Snitzer
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-08-30 19:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: k-ueda, jaxboe, jamie, linux-kernel, linux-raid, linux-fsdevel,
	j-nomura, hch

On Mon, Aug 30 2010 at 11:45am -0400,
Tejun Heo <tj@kernel.org> wrote:

> This patch converts request-based dm to support the new REQ_FLUSH/FUA.
> 
> The original request-based flush implementation depended on
> request_queue blocking other requests while a barrier sequence is in
> progress, which is no longer true for the new REQ_FLUSH/FUA.
> 
> In general, request-based dm doesn't have infrastructure for cloning
> one source request to multiple targets, but the original flush
> implementation had a special mostly independent path which can issue
> flushes to multiple targets and sequence them.  However, the
> capability isn't currently in use and adds a lot of complexity.
> Moreoever, it's unlikely to be useful in its current form as it
> doesn't make sense to be able to send out flushes to multiple targets
> when write requests can't be.
> 
> This patch rips out special flush code path and deals handles
> REQ_FLUSH/FUA requests the same way as other requests.  The only
> special treatment is that REQ_FLUSH requests use the block address 0
> when finding target, which is enough for now.
> 
> * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
>   suggested by Mike Snitzer
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good.

Acked-by: Mike Snitzer <snitzer@redhat.com>

Junichi and/or Kiyoshi,
Could you please review this patch and add your Acked-by if it is OK?
(Alasdair will want to see NEC's Ack to accept this patch).

Thanks,
Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 15:45       ` Tejun Heo
  2010-08-30 19:18         ` Mike Snitzer
@ 2010-08-30 19:18         ` Mike Snitzer
  2010-09-01  7:15         ` Kiyoshi Ueda
       [not found]         ` <20100830194731.GA10702@redhat.com>
  3 siblings, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-08-30 19:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

On Mon, Aug 30 2010 at 11:45am -0400,
Tejun Heo <tj@kernel.org> wrote:

> This patch converts request-based dm to support the new REQ_FLUSH/FUA.
> 
> The original request-based flush implementation depended on
> request_queue blocking other requests while a barrier sequence is in
> progress, which is no longer true for the new REQ_FLUSH/FUA.
> 
> In general, request-based dm doesn't have infrastructure for cloning
> one source request to multiple targets, but the original flush
> implementation had a special mostly independent path which can issue
> flushes to multiple targets and sequence them.  However, the
> capability isn't currently in use and adds a lot of complexity.
> Moreoever, it's unlikely to be useful in its current form as it
> doesn't make sense to be able to send out flushes to multiple targets
> when write requests can't be.
> 
> This patch rips out special flush code path and deals handles
> REQ_FLUSH/FUA requests the same way as other requests.  The only
> special treatment is that REQ_FLUSH requests use the block address 0
> when finding target, which is enough for now.
> 
> * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
>   suggested by Mike Snitzer
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good.

Acked-by: Mike Snitzer <snitzer@redhat.com>

Junichi and/or Kiyoshi,
Could you please review this patch and add your Acked-by if it is OK?
(Alasdair will want to see NEC's Ack to accept this patch).

Thanks,
Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 19:08         ` Mike Snitzer
@ 2010-08-30 21:28           ` Mike Snitzer
  2010-08-31 10:29             ` Tejun Heo
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-08-30 21:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On Mon, Aug 30 2010 at  3:08pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Aug 30 2010 at 11:07am -0400,
> Tejun Heo <tj@kernel.org> wrote:
> 
> > On 08/30/2010 03:59 PM, Tejun Heo wrote:
> > > Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> > > flushes are on a separate queue but I forgot to update
> > > elv_queue_empty() to check the flush queue.  elv_queue_empty() can
> > > return %true spuriously in which case the queue won't be plugged and
> > > restarted later leading to queue hang.  I'll fix elv_queue_empty().
> > 
> > I think I was too quick to blame elv_queue_empty().  Can you please
> > test whether the following patch fixes the hang?
> 
> It does, thanks!

Hmm, but unfortunately I was too quick to say the patch fixed the hang.

It is much more rare, but I can still get a hang.  I just got the
following running vgcreate against an DM mpath (rq-based) device:

INFO: task vgcreate:3517 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vgcreate      D ffff88003d677a00  5168  3517   3361 0x00000080
 ffff88003d677998 0000000000000046 ffff880000000000 ffff88003d677fd8
 ffff880039c84860 ffff88003d677fd8 00000000001d3880 ffff880039c84c30
 ffff880039c84c28 00000000001d3880 00000000001d3880 ffff88003d677fd8
Call Trace:
 [<ffffffff81389308>] io_schedule+0x73/0xb5
 [<ffffffff811c7304>] get_request_wait+0xef/0x17d
 [<ffffffff810642be>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811c7890>] __make_request+0x333/0x467
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff811c5e91>] generic_make_request+0x342/0x3bf
 [<ffffffff81074714>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff81069df2>] ? local_clock+0x41/0x5a
 [<ffffffff811c5fe9>] submit_bio+0xdb/0xf8
 [<ffffffff810754a4>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff811381a6>] dio_bio_submit+0x7b/0x9c
 [<ffffffff81138dbe>] __blockdev_direct_IO+0x7f3/0x97d
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff81136d7a>] blkdev_direct_IO+0x57/0x59
 [<ffffffff81135f58>] ? blkdev_get_blocks+0x0/0x90
 [<ffffffff810ce301>] generic_file_aio_read+0xed/0x5b4
 [<ffffffff81077932>] ? lock_release_non_nested+0xd5/0x23b
 [<ffffffff810e40f8>] ? might_fault+0x5c/0xac
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff8110e131>] do_sync_read+0xcb/0x108
 [<ffffffff81074688>] ? trace_hardirqs_off_caller+0x1f/0x9e
 [<ffffffff81389a99>] ? __mutex_unlock_slowpath+0x120/0x132
 [<ffffffff8119d805>] ? fsnotify_perm+0x4a/0x50
 [<ffffffff8119d86c>] ? security_file_permission+0x2e/0x33
 [<ffffffff8110e7a3>] vfs_read+0xab/0x107
 [<ffffffff81075473>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff8110e8c2>] sys_read+0x4d/0x74
 [<ffffffff81002c32>] system_call_fastpath+0x16/0x1b
no locks held by vgcreate/3517.

I was then able to reproduce it after reboot and another ~5 attempts
(all against 2.6.36-rc2 + your latest FLUSH+FUA patchset and DM
patches).

crash> bt -l 3893
PID: 3893   TASK: ffff88003e65a430  CPU: 0   COMMAND: "vgcreate"
 #0 [ffff88003a5298d8] schedule at ffffffff813891d3
    /root/git/linux-2.6/kernel/sched.c: 2873
 #1 [ffff88003a5299a0] io_schedule at ffffffff81389308
    /root/git/linux-2.6/kernel/sched.c: 5128
 #2 [ffff88003a5299c0] get_request_wait at ffffffff811c7304
    /root/git/linux-2.6/block/blk-core.c: 879
 #3 [ffff88003a529a50] __make_request at ffffffff811c7890
    /root/git/linux-2.6/block/blk-core.c: 1301
 #4 [ffff88003a529ac0] generic_make_request at ffffffff811c5e91
    /root/git/linux-2.6/block/blk-core.c: 1536
 #5 [ffff88003a529b70] submit_bio at ffffffff811c5fe9
    /root/git/linux-2.6/block/blk-core.c: 1632
 #6 [ffff88003a529bc0] dio_bio_submit at ffffffff811381a6
    /root/git/linux-2.6/fs/direct-io.c: 375
 #7 [ffff88003a529bf0] __blockdev_direct_IO at ffffffff81138dbe
    /root/git/linux-2.6/fs/direct-io.c: 1087
 #8 [ffff88003a529cd0] blkdev_direct_IO at ffffffff81136d7a
    /root/git/linux-2.6/fs/block_dev.c: 177
 #9 [ffff88003a529d10] generic_file_aio_read at ffffffff810ce301
    /root/git/linux-2.6/mm/filemap.c: 1303
#10 [ffff88003a529df0] do_sync_read at ffffffff8110e131
    /root/git/linux-2.6/fs/read_write.c: 282
#11 [ffff88003a529f00] vfs_read at ffffffff8110e7a3
    /root/git/linux-2.6/fs/read_write.c: 310
#12 [ffff88003a529f40] sys_read at ffffffff8110e8c2
    /root/git/linux-2.6/fs/read_write.c: 388
#13 [ffff88003a529f80] system_call_fastpath at ffffffff81002c32
    /root/git/linux-2.6/arch/x86/kernel/entry_64.S: 488
    RIP: 0000003b602d41a0  RSP: 00007fff55d5b928  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffffffff81002c32  RCX: 00007fff55d5b960
    RDX: 0000000000001000  RSI: 00007fff55d5a000  RDI: 0000000000000005
    RBP: 0000000000000000   R8: 0000000000494ecd   R9: 0000000000001000
    R10: 000000315c41c160  R11: 0000000000000246  R12: 00007fff55d5a000
    R13: 00007fff55d5b0a0  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 21:28           ` Mike Snitzer
@ 2010-08-31 10:29             ` Tejun Heo
  2010-08-31 13:02               ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Tejun Heo @ 2010-08-31 10:29 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On 08/30/2010 11:28 PM, Mike Snitzer wrote:
> On Mon, Aug 30 2010 at  3:08pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> On Mon, Aug 30 2010 at 11:07am -0400,
>> Tejun Heo <tj@kernel.org> wrote:
>>
>>> On 08/30/2010 03:59 PM, Tejun Heo wrote:
>>>> Ah... that's probably from "if (!elv_queue_empty(q))" check below,
>>>> flushes are on a separate queue but I forgot to update
>>>> elv_queue_empty() to check the flush queue.  elv_queue_empty() can
>>>> return %true spuriously in which case the queue won't be plugged and
>>>> restarted later leading to queue hang.  I'll fix elv_queue_empty().
>>>
>>> I think I was too quick to blame elv_queue_empty().  Can you please
>>> test whether the following patch fixes the hang?
>>
>> It does, thanks!
> 
> Hmm, but unfortunately I was too quick to say the patch fixed the hang.
> 
> It is much more rare, but I can still get a hang.  I just got the
> following running vgcreate against an DM mpath (rq-based) device:

Can you please try this one instead?

Thanks.

---
 block/blk-flush.c |   22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -56,22 +56,38 @@ static struct request *blk_flush_complet
 	return next_rq;
 }

+static void blk_flush_complete_seq_end_io(struct request_queue *q,
+					  unsigned seq, int error)
+{
+	bool was_empty = elv_queue_empty(q);
+	struct request *next_rq;
+
+	next_rq = blk_flush_complete_seq(q, seq, error);
+
+	/*
+	 * Moving a request silently to empty queue_head may stall the
+	 * queue.  Kick the queue in those cases.
+	 */
+	if (next_rq && was_empty)
+		__blk_run_queue(q);
+}
+
 static void pre_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_PREFLUSH, error);
 }

 static void flush_data_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
+	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
 }

 static void post_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
 }

 static void init_flush_request(struct request *rq, struct gendisk *disk)


-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-31 10:29             ` Tejun Heo
@ 2010-08-31 13:02               ` Mike Snitzer
  2010-08-31 13:14                 ` Tejun Heo
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-08-31 13:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On Tue, Aug 31 2010 at  6:29am -0400,
Tejun Heo <tj@kernel.org> wrote:

> On 08/30/2010 11:28 PM, Mike Snitzer wrote:
> > Hmm, but unfortunately I was too quick to say the patch fixed the hang.
> > 
> > It is much more rare, but I can still get a hang.  I just got the
> > following running vgcreate against an DM mpath (rq-based) device:
> 
> Can you please try this one instead?

Still hit the hang on the 5th iteration of my test:
while true ; do ./test_dm_discard_mpath.sh && sleep 1 ; done

Would you like me to (re)send my test script offlist?

INFO: task vgcreate:2617 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vgcreate      D ffff88007bf7ba00  4688  2617   2479 0x00000080
 ffff88007bf7b998 0000000000000046 ffff880000000000 ffff88007bf7bfd8
 ffff88005542a430 ffff88007bf7bfd8 00000000001d3880 ffff88005542a800
 ffff88005542a7f8 00000000001d3880 00000000001d3880 ffff88007bf7bfd8
Call Trace:
 [<ffffffff81389338>] io_schedule+0x73/0xb5
 [<ffffffff811c7304>] get_request_wait+0xef/0x17d
 [<ffffffff810642be>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811c7890>] __make_request+0x333/0x467
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff811c5e91>] generic_make_request+0x342/0x3bf
 [<ffffffff81074714>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff81069df2>] ? local_clock+0x41/0x5a
 [<ffffffff811c5fe9>] submit_bio+0xdb/0xf8
 [<ffffffff810754a4>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff811381a6>] dio_bio_submit+0x7b/0x9c
 [<ffffffff81138dbe>] __blockdev_direct_IO+0x7f3/0x97d
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff81136d7a>] blkdev_direct_IO+0x57/0x59
 [<ffffffff81135f58>] ? blkdev_get_blocks+0x0/0x90
 [<ffffffff810ce301>] generic_file_aio_read+0xed/0x5b4
 [<ffffffff81077932>] ? lock_release_non_nested+0xd5/0x23b
 [<ffffffff810e40f8>] ? might_fault+0x5c/0xac
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff8110e131>] do_sync_read+0xcb/0x108
 [<ffffffff81074688>] ? trace_hardirqs_off_caller+0x1f/0x9e
 [<ffffffff81389ac9>] ? __mutex_unlock_slowpath+0x120/0x132
 [<ffffffff8119d805>] ? fsnotify_perm+0x4a/0x50
 [<ffffffff8119d86c>] ? security_file_permission+0x2e/0x33
 [<ffffffff8110e7a3>] vfs_read+0xab/0x107
 [<ffffffff81075473>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff8110e8c2>] sys_read+0x4d/0x74
 [<ffffffff81002c32>] system_call_fastpath+0x16/0x1b
no locks held by vgcreate/2617.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-31 13:02               ` Mike Snitzer
@ 2010-08-31 13:14                 ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-08-31 13:14 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

Hello,

On 08/31/2010 03:02 PM, Mike Snitzer wrote:
> On Tue, Aug 31 2010 at  6:29am -0400,
> Tejun Heo <tj@kernel.org> wrote:
> 
>> On 08/30/2010 11:28 PM, Mike Snitzer wrote:
>>> Hmm, but unfortunately I was too quick to say the patch fixed the hang.
>>>
>>> It is much more rare, but I can still get a hang.  I just got the
>>> following running vgcreate against an DM mpath (rq-based) device:
>>
>> Can you please try this one instead?
> 
> Still hit the hang on the 5th iteration of my test:
> while true ; do ./test_dm_discard_mpath.sh && sleep 1 ; done
> 
> Would you like me to (re)send my test script offlist?

Yes, please.  Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 15:45       ` Tejun Heo
  2010-08-30 19:18         ` Mike Snitzer
  2010-08-30 19:18         ` Mike Snitzer
@ 2010-09-01  7:15         ` Kiyoshi Ueda
  2010-09-01 12:25           ` Mike Snitzer
                             ` (2 more replies)
       [not found]         ` <20100830194731.GA10702@redhat.com>
  3 siblings, 3 replies; 88+ messages in thread
From: Kiyoshi Ueda @ 2010-09-01  7:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hi Tejun,

On 08/31/2010 12:45 AM +0900, Tejun Heo wrote:
> This patch converts request-based dm to support the new REQ_FLUSH/FUA.
> 
> The original request-based flush implementation depended on
> request_queue blocking other requests while a barrier sequence is in
> progress, which is no longer true for the new REQ_FLUSH/FUA.
> 
> In general, request-based dm doesn't have infrastructure for cloning
> one source request to multiple targets, but the original flush
> implementation had a special mostly independent path which can issue
> flushes to multiple targets and sequence them.  However, the
> capability isn't currently in use and adds a lot of complexity.
> Moreoever, it's unlikely to be useful in its current form as it
> doesn't make sense to be able to send out flushes to multiple targets
> when write requests can't be.
> 
> This patch rips out special flush code path and deals handles
> REQ_FLUSH/FUA requests the same way as other requests.  The only
> special treatment is that REQ_FLUSH requests use the block address 0
> when finding target, which is enough for now.
> 
> * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
>   suggested by Mike Snitzer

Thank you for your work.

I don't see any obvious problem on this patch.
However, I hit a NULL pointer dereference below when I use a mpath
device with barrier option of ext3.  I'm investigating the cause now.
(Also I'm not sure the cause of the hang which Mike is hitting yet.)

I tried on the commit 28dd53b26d362c16234249bad61db8cbd9222d0b of
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua.

    # mke2fs -j /dev/mapper/mpatha
    # mount -o barrier=1 /dev/mapper/mpatha /mnt/0
    # dd if=/dev/zero of=/mnt/0/a bs=512 count=1
    # sync

BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
IP: [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]
PGD 29fd9a067 PUD 2a21ff067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 1 
Modules linked in: ext4 jbd2 crc16 ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables autofs4 lockd sunrpc cpufreq_ondemand acpi_cpufreq bridge stp llc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_region_hash dm_log dm_service_time dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac kvm_intel kvm e1000e sg sr_mod cdrom lpfc scsi_transport_fc piix rtc_cmos rtc_core ioatdma ata_piix button serio_raw rtc_lib libata dca megaraid_sas sd_mod scsi_mod crc_t10dif ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]

Pid: 0, comm: kworker/0:0 Not tainted 2.6.36-rc2+ #1 MS-9196/Express5800/120Lj [N8100-1417]
RIP: 0010:[<ffffffffa0070ec3>]  [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]
RSP: 0018:ffff880002c83e50  EFLAGS: 00010297
RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
RDX: 0000000000007d7c RSI: ffffffff81389c55 RDI: 0000000000000286
RBP: ffff880002c83e70 R08: 0000000000000002 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802a2acf750
R13: ffff8802a25686c8 R14: ffff8802791f7eb8 R15: 0000000000000100
FS:  0000000000000000(0000) GS:ffff880002c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000078 CR3: 00000002a2ab6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/0:0 (pid: 0, threadinfo ffff8802a4576000, task ffff8802a4574050)
Stack:
 ffff8802791f7eb8 0000000000002002 000000000000ea60 0000000000000005
<0> ffff880002c83ea0 ffffffffa0079ec8 ffff880002c83eb0 0000000000000020
<0> ffffffff815220a0 0000000000000004 ffff880002c83ed0 ffffffff811c6636
Call Trace:
 <IRQ> 
 [<ffffffffa0079ec8>] scsi_softirq_done+0x138/0x170 [scsi_mod]
 [<ffffffff811c6636>] blk_done_softirq+0x86/0xa0
 [<ffffffff81053036>] __do_softirq+0xd6/0x210
 [<ffffffff81003d9c>] call_softirq+0x1c/0x50
 [<ffffffff81005705>] do_softirq+0x95/0xd0
 [<ffffffff81052f4d>] irq_exit+0x4d/0x60
 [<ffffffff81391668>] do_IRQ+0x78/0xf0
 [<ffffffff8138a053>] ret_from_intr+0x0/0x16
 <EOI> 
 [<ffffffff8100b630>] ? mwait_idle+0x70/0xe0
 [<ffffffff8107cc8d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8100b639>] ? mwait_idle+0x79/0xe0
 [<ffffffff8100b630>] ? mwait_idle+0x70/0xe0
 [<ffffffff81001c36>] cpu_idle+0x66/0xe0
 [<ffffffff81380e81>] ? start_secondary+0x181/0x1f0
 [<ffffffff81380e8f>] start_secondary+0x18f/0x1f0
Code: 0c 83 e0 07 83 f8 04 77 6c 49 8b 86 80 00 00 00 41 8b 5e 68 83 78 44 02 74 27 48 8b 80 b0 00 00 00 48 8b 80 70 02 00 00 48 8b 00 <48> 8b 50 78 89 d8 48 85 d2 74 05 4c 89 f7 ff d2 39 c3 74 21 89 
RIP  [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]
 RSP <ffff880002c83e50>
CR2: 0000000000000078



Also, I have one comment below on this patch.

> @@ -2619,9 +2458,8 @@ int dm_suspend(struct mapped_device *md,
>  	up_write(&md->io_lock);
> 
>  	/*
> -	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
> -	 * can be kicked until md->queue is stopped.  So stop md->queue before
> -	 * flushing md->wq.
> +	 * Stop md->queue before flushing md->wq in case request-based
> +	 * dm defers requests to md->wq from md->queue.
>  	 */
>  	if (dm_request_based(md))
>  		stop_queue(md->queue);

Request-based dm doesn't use md->wq now, so you can just remove
the comment above.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
       [not found]         ` <20100830194731.GA10702@redhat.com>
@ 2010-09-01 10:31           ` Mikulas Patocka
  2010-09-01 11:20             ` Tejun Heo
  0 siblings, 1 reply; 88+ messages in thread
From: Mikulas Patocka @ 2010-09-01 10:31 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Tejun Heo, dm-devel

On Mon, 30 Aug 2010, Mike Snitzer wrote:

> Hi,
> 
> On Mon, Aug 30 2010 at 11:45am -0400,
> Tejun Heo <tj@kernel.org> wrote:
> 
> > Here's a version w/ BUG_ON() added.  Once the queue hang issue is
> > tracked down, I'll refresh the whole series and repost.
> 
> When you next send out the refreshed series, could you cc dm-devel any
> patches that touch DM directly or block patches that are required to
> support DM?
> 
> It'd be great to cc Mikulas Patocka on those patches too (I cc'd him).
> Mikulas will be reviewing these DM patches now that we have something
> that works with your larger FLUSH+FUA patchset.
> 
> Thanks,
> Mike

My recommended approach to this (on non-request-based dm) is to simply let 
the current barrier infrastructure be as it is --- you don't need to 
change it now, you can simply map FUA write to barrier write and FLUSH to 
zero-data barrier --- and it won't cause any data corruption. It will just 
force unneeded I/O queue draining.

Once FLUSH+FUA interface is finalized and committed upstream, we can 
remove that I/O queue draining from dm to improve performance.

Mikulas

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01 10:31           ` [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Mikulas Patocka
@ 2010-09-01 11:20             ` Tejun Heo
  2010-09-01 12:12               ` Mikulas Patocka
  0 siblings, 1 reply; 88+ messages in thread
From: Tejun Heo @ 2010-09-01 11:20 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Mike Snitzer

Hello,

On 09/01/2010 12:31 PM, Mikulas Patocka wrote:
> My recommended approach to this (on non-request-based dm) is to simply let 
> the current barrier infrastructure be as it is --- you don't need to 
> change it now, you can simply map FUA write to barrier write and FLUSH to 
> zero-data barrier --- and it won't cause any data corruption. It will just 
> force unneeded I/O queue draining.
> 
> Once FLUSH+FUA interface is finalized and committed upstream, we can 
> remove that I/O queue draining from dm to improve performance.

Unfortunately, it doesn't work that way.  The current dm
implementation depends on block layer holding the queue while a
barrier sequence is in progress which the new implementation doesn't
do anymore (the whole point of this conversion BTW).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01 11:20             ` Tejun Heo
@ 2010-09-01 12:12               ` Mikulas Patocka
  2010-09-01 12:42                 ` Tejun Heo
  2010-09-01 15:20                 ` Mike Snitzer
  0 siblings, 2 replies; 88+ messages in thread
From: Mikulas Patocka @ 2010-09-01 12:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: dm-devel, Mike Snitzer

On Wed, 1 Sep 2010, Tejun Heo wrote:

> Hello,
> 
> On 09/01/2010 12:31 PM, Mikulas Patocka wrote:
> > My recommended approach to this (on non-request-based dm) is to simply let 
> > the current barrier infrastructure be as it is --- you don't need to 
> > change it now, you can simply map FUA write to barrier write and FLUSH to 
> > zero-data barrier --- and it won't cause any data corruption. It will just 
> > force unneeded I/O queue draining.
> > 
> > Once FLUSH+FUA interface is finalized and committed upstream, we can 
> > remove that I/O queue draining from dm to improve performance.
> 
> Unfortunately, it doesn't work that way.  The current dm
> implementation depends on block layer holding the queue while a
> barrier sequence is in progress which the new implementation doesn't
> do anymore (the whole point of this conversion BTW).

That may be true for request-based dm (I don't know).

But bio-based dm doesn't depend on it, I wrote it and I didn't rely on 
that.

Mikulas

> Thanks.
> 
> -- 
> tejun
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01  7:15         ` Kiyoshi Ueda
@ 2010-09-01 12:25           ` Mike Snitzer
  2010-09-02 13:22           ` Tejun Heo
  2010-09-02 17:43           ` [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original Tejun Heo
  2 siblings, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-09-01 12:25 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Tejun Heo, jaxboe, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

On Wed, Sep 01 2010 at  3:15am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Tejun,
> 
> On 08/31/2010 12:45 AM +0900, Tejun Heo wrote:
> > This patch converts request-based dm to support the new REQ_FLUSH/FUA.
> > 
> > The original request-based flush implementation depended on
> > request_queue blocking other requests while a barrier sequence is in
> > progress, which is no longer true for the new REQ_FLUSH/FUA.
> > 
> > In general, request-based dm doesn't have infrastructure for cloning
> > one source request to multiple targets, but the original flush
> > implementation had a special mostly independent path which can issue
> > flushes to multiple targets and sequence them.  However, the
> > capability isn't currently in use and adds a lot of complexity.
> > Moreoever, it's unlikely to be useful in its current form as it
> > doesn't make sense to be able to send out flushes to multiple targets
> > when write requests can't be.
> > 
> > This patch rips out special flush code path and deals handles
> > REQ_FLUSH/FUA requests the same way as other requests.  The only
> > special treatment is that REQ_FLUSH requests use the block address 0
> > when finding target, which is enough for now.
> > 
> > * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
> >   suggested by Mike Snitzer
> 
> Thank you for your work.
> 
> I don't see any obvious problem on this patch.
> However, I hit a NULL pointer dereference below when I use a mpath
> device with barrier option of ext3.  I'm investigating the cause now.
> (Also I'm not sure the cause of the hang which Mike is hitting yet.)
> 
> I tried on the commit 28dd53b26d362c16234249bad61db8cbd9222d0b of
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua.
> 
>     # mke2fs -j /dev/mapper/mpatha
>     # mount -o barrier=1 /dev/mapper/mpatha /mnt/0
>     # dd if=/dev/zero of=/mnt/0/a bs=512 count=1
>     # sync
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000078

FYI, I can't reproduce this using all of Tejun's latest patches (not yet
in the flush-fua git tree).  But I haven't tried the specific flush-fua
commit that you referenced.

Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01 12:12               ` Mikulas Patocka
@ 2010-09-01 12:42                 ` Tejun Heo
  2010-09-01 12:54                   ` Mike Snitzer
  2010-09-01 15:20                 ` Mike Snitzer
  1 sibling, 1 reply; 88+ messages in thread
From: Tejun Heo @ 2010-09-01 12:42 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Mike Snitzer

Hello,

On 09/01/2010 02:12 PM, Mikulas Patocka wrote:
> That may be true for request-based dm (I don't know).

Oh, okay, this part of thread was for request based dm, so I assumed
you were talking about it.

> But bio-based dm doesn't depend on it, I wrote it and I didn't rely on 
> that.

If you look at the two patches for bio-based ones.  The first one is
basically what you're talking about w/ s/barrier/flush/ renames and
dropping of -EOPNOTSUPP.  It doesn't really change the mechanism much.
If you don't feel comfortable about the second one, we sure can
postpone it but it's still quite away from the next merge window and
what would be the point of delaying it?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01 12:42                 ` Tejun Heo
@ 2010-09-01 12:54                   ` Mike Snitzer
  0 siblings, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-09-01 12:54 UTC (permalink / raw)
  To: Tejun Heo; +Cc: dm-devel, Mikulas Patocka

On Wed, Sep 01 2010 at  8:42am -0400,
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 09/01/2010 02:12 PM, Mikulas Patocka wrote:
> > That may be true for request-based dm (I don't know).
> 
> Oh, okay, this part of thread was for request based dm, so I assumed
> you were talking about it.
> 
> > But bio-based dm doesn't depend on it, I wrote it and I didn't rely on 
> > that.
> 
> If you look at the two patches for bio-based ones.  The first one is
> basically what you're talking about w/ s/barrier/flush/ renames and
> dropping of -EOPNOTSUPP.  It doesn't really change the mechanism much.
> If you don't feel comfortable about the second one, we sure can
> postpone it but it's still quite away from the next merge window and
> what would be the point of delaying it?

Right, we have a window of opportunity to sort this out now.  No sense
in wasting it.

Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-08-30  9:58   ` Tejun Heo
  (?)
@ 2010-09-01 13:43   ` Mike Snitzer
  2010-09-01 13:50     ` Tejun Heo
  -1 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-01 13:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On Mon, Aug 30 2010 at  5:58am -0400,
Tejun Heo <tj@kernel.org> wrote:

> This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
> now deprecated REQ_HARDBARRIER.
> 
> * -EOPNOTSUPP handling logic dropped.

Can you expand on _why_ -EOPNOTSUPP handling is no longer needed?  And
please at it to the final patch header.

This removal isn't unique to DM's conversion to FLUSH+FUA but I couldn't
easily find the justification for its removal in the larger 30+ patch
patchset either -- other patches are terse on the removal too.

Other than that.

Acked-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-09-01 13:43   ` Mike Snitzer
@ 2010-09-01 13:50     ` Tejun Heo
  2010-09-01 13:54       ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Tejun Heo @ 2010-09-01 13:50 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

Hello,

On 09/01/2010 03:43 PM, Mike Snitzer wrote:
> On Mon, Aug 30 2010 at  5:58am -0400,
> Tejun Heo <tj@kernel.org> wrote:
> 
>> This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
>> now deprecated REQ_HARDBARRIER.
>>
>> * -EOPNOTSUPP handling logic dropped.
> 
> Can you expand on _why_ -EOPNOTSUPP handling is no longer needed?  And
> please at it to the final patch header.

It just doesn't happen anymore.  If the underlying device doesn't
support FLUSH/FUA, the block layer simply make those parts noop.  IOW,
it no longer distinguishes between writeback cache which doesn't
support cache flush at all and writethrough cache.  Devices which have
WB cache w/o flush very difficult to come by these days and there's
nothing much we can do anyway, so it doesn't make sense to require
everyone to implement -EOPNOTSUPP.

One scheduled feature is to implement falling back to REQ_FLUSH when
the device advertises REQ_FUA but fails to process it, but one way or
the other, the goal is encapsulating REQ_FLUSH/FUA support in block
layer proper.  If FLUSH/FUA can be retried using a different strategy,
it should be done inside request_queue proper instead of pushing retry
logic to all its users.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-08-30  9:58   ` Tejun Heo
@ 2010-09-01 13:51     ` Mike Snitzer
  -1 siblings, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-09-01 13:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: k-ueda, jaxboe, jamie, linux-kernel, linux-raid, linux-fsdevel,
	dm-devel, j-nomura, hch

On Mon, Aug 30 2010 at  5:58am -0400,
Tejun Heo <tj@kernel.org> wrote:

> Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
> against other bio's.  This patch relaxes ordering around flushes.
> 
> * A flush bio is no longer deferred to workqueue directly.  It's
>   processed like other bio's but __split_and_process_bio() uses
>   md->flush_bio as the clone source.  md->flush_bio is initialized to
>   empty flush during md initialization and shared for all flushes.
> 
> * When dec_pending() detects that a flush has completed, it checks
>   whether the original bio has data.  If so, the bio is queued to the
>   deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.
> 
> * As flush sequencing is handled in the usual issue/completion path,
>   dm_wq_work() no longer needs to handle flushes differently.  Now its
>   only responsibility is re-issuing deferred bio's the same way as
>   _dm_request() would.  REQ_FLUSH handling logic including
>   process_flush() is dropped.
> 
> * There's no reason for queue_io() and dm_wq_work() write lock
>   dm->io_lock.  queue_io() now only uses md->deferred_lock and
>   dm_wq_work() read locks dm->io_lock.
> 
> * bio's no longer need to be queued on the deferred list while a flush
>   is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.
> 
> This avoids stalling the device during flushes and simplifies the
> implementation.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good overall.

> @@ -144,11 +143,6 @@ struct mapped_device {
>  	spinlock_t deferred_lock;
>  
>  	/*
> -	 * An error from the flush request currently being processed.
> -	 */
> -	int flush_error;
> -
> -	/*
>  	 * Protect barrier_error from concurrent endio processing
>  	 * in request-based dm.
>  	 */

Could you please document why it is OK to remove 'flush_error' in the
patch header?  The -EOPNOTSUPP handling removal (done in patch 2)
obviously helps enable this but it is not clear how the
'num_flush_requests' flushes that __clone_and_map_flush() generates do
not need explicit DM error handling.

Other than that.

Acked-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation
@ 2010-09-01 13:51     ` Mike Snitzer
  0 siblings, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-09-01 13:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On Mon, Aug 30 2010 at  5:58am -0400,
Tejun Heo <tj@kernel.org> wrote:

> Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
> against other bio's.  This patch relaxes ordering around flushes.
> 
> * A flush bio is no longer deferred to workqueue directly.  It's
>   processed like other bio's but __split_and_process_bio() uses
>   md->flush_bio as the clone source.  md->flush_bio is initialized to
>   empty flush during md initialization and shared for all flushes.
> 
> * When dec_pending() detects that a flush has completed, it checks
>   whether the original bio has data.  If so, the bio is queued to the
>   deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.
> 
> * As flush sequencing is handled in the usual issue/completion path,
>   dm_wq_work() no longer needs to handle flushes differently.  Now its
>   only responsibility is re-issuing deferred bio's the same way as
>   _dm_request() would.  REQ_FLUSH handling logic including
>   process_flush() is dropped.
> 
> * There's no reason for queue_io() and dm_wq_work() write lock
>   dm->io_lock.  queue_io() now only uses md->deferred_lock and
>   dm_wq_work() read locks dm->io_lock.
> 
> * bio's no longer need to be queued on the deferred list while a flush
>   is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.
> 
> This avoids stalling the device during flushes and simplifies the
> implementation.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good overall.

> @@ -144,11 +143,6 @@ struct mapped_device {
>  	spinlock_t deferred_lock;
>  
>  	/*
> -	 * An error from the flush request currently being processed.
> -	 */
> -	int flush_error;
> -
> -	/*
>  	 * Protect barrier_error from concurrent endio processing
>  	 * in request-based dm.
>  	 */

Could you please document why it is OK to remove 'flush_error' in the
patch header?  The -EOPNOTSUPP handling removal (done in patch 2)
obviously helps enable this but it is not clear how the
'num_flush_requests' flushes that __clone_and_map_flush() generates do
not need explicit DM error handling.

Other than that.

Acked-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-09-01 13:50     ` Tejun Heo
@ 2010-09-01 13:54       ` Mike Snitzer
  2010-09-01 13:56         ` Tejun Heo
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-01 13:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On Wed, Sep 01 2010 at  9:50am -0400,
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 09/01/2010 03:43 PM, Mike Snitzer wrote:
> > On Mon, Aug 30 2010 at  5:58am -0400,
> > Tejun Heo <tj@kernel.org> wrote:
> > 
> >> This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
> >> now deprecated REQ_HARDBARRIER.
> >>
> >> * -EOPNOTSUPP handling logic dropped.
> > 
> > Can you expand on _why_ -EOPNOTSUPP handling is no longer needed?  And
> > please at it to the final patch header.
> 
> It just doesn't happen anymore.  If the underlying device doesn't
> support FLUSH/FUA, the block layer simply make those parts noop.  IOW,
> it no longer distinguishes between writeback cache which doesn't
> support cache flush at all and writethrough cache.  Devices which have
> WB cache w/o flush very difficult to come by these days and there's
> nothing much we can do anyway, so it doesn't make sense to require
> everyone to implement -EOPNOTSUPP.
> 
> One scheduled feature is to implement falling back to REQ_FLUSH when
> the device advertises REQ_FUA but fails to process it, but one way or
> the other, the goal is encapsulating REQ_FLUSH/FUA support in block
> layer proper.  If FLUSH/FUA can be retried using a different strategy,
> it should be done inside request_queue proper instead of pushing retry
> logic to all its users.

OK, so maybe add this info to the patch header one of the primary
FLUSH+FUA conversion patches?

Thanks for the detailed explanation!

Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-09-01 13:54       ` Mike Snitzer
@ 2010-09-01 13:56         ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-09-01 13:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On 09/01/2010 03:54 PM, Mike Snitzer wrote:
>> It just doesn't happen anymore.  If the underlying device doesn't
>> support FLUSH/FUA, the block layer simply make those parts noop.  IOW,
>> it no longer distinguishes between writeback cache which doesn't
>> support cache flush at all and writethrough cache.  Devices which have
>> WB cache w/o flush very difficult to come by these days and there's
>> nothing much we can do anyway, so it doesn't make sense to require
>> everyone to implement -EOPNOTSUPP.
>>
>> One scheduled feature is to implement falling back to REQ_FLUSH when
>> the device advertises REQ_FUA but fails to process it, but one way or
>> the other, the goal is encapsulating REQ_FLUSH/FUA support in block
>> layer proper.  If FLUSH/FUA can be retried using a different strategy,
>> it should be done inside request_queue proper instead of pushing retry
>> logic to all its users.
> 
> OK, so maybe add this info to the patch header one of the primary
> FLUSH+FUA conversion patches?

Sure.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-09-01 13:51     ` Mike Snitzer
@ 2010-09-01 13:56       ` Tejun Heo
  -1 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-09-01 13:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: k-ueda, jaxboe, jamie, linux-kernel, linux-raid, linux-fsdevel,
	dm-devel, j-nomura, hch

On 09/01/2010 03:51 PM, Mike Snitzer wrote:
> Could you please document why it is OK to remove 'flush_error' in the
> patch header?  The -EOPNOTSUPP handling removal (done in patch 2)
> obviously helps enable this but it is not clear how the
> 'num_flush_requests' flushes that __clone_and_map_flush() generates do
> not need explicit DM error handling.

Sure, I'll.  It's because it now uses the same error handling path in
dec_pending() all other bio's use.  The flush_error thing was there
because flushes got executed/completed in a separate code path to
begin with.  With the special path gone, there's no need for
flush_error path either.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation
@ 2010-09-01 13:56       ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-09-01 13:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On 09/01/2010 03:51 PM, Mike Snitzer wrote:
> Could you please document why it is OK to remove 'flush_error' in the
> patch header?  The -EOPNOTSUPP handling removal (done in patch 2)
> obviously helps enable this but it is not clear how the
> 'num_flush_requests' flushes that __clone_and_map_flush() generates do
> not need explicit DM error handling.

Sure, I'll.  It's because it now uses the same error handling path in
dec_pending() all other bio's use.  The flush_error thing was there
because flushes got executed/completed in a separate code path to
begin with.  With the special path gone, there's no need for
flush_error path either.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01 12:12               ` Mikulas Patocka
  2010-09-01 12:42                 ` Tejun Heo
@ 2010-09-01 15:20                 ` Mike Snitzer
  2010-09-01 15:35                   ` Mikulas Patocka
  1 sibling, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-01 15:20 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Tejun Heo, dm-devel

On Wed, Sep 01 2010 at  8:12am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> On Wed, 1 Sep 2010, Tejun Heo wrote:
> 
> > Hello,
> > 
> > On 09/01/2010 12:31 PM, Mikulas Patocka wrote:
> > > My recommended approach to this (on non-request-based dm) is to simply let 
> > > the current barrier infrastructure be as it is --- you don't need to 
> > > change it now, you can simply map FUA write to barrier write and FLUSH to 
> > > zero-data barrier --- and it won't cause any data corruption. It will just 
> > > force unneeded I/O queue draining.
> > > 
> > > Once FLUSH+FUA interface is finalized and committed upstream, we can 
> > > remove that I/O queue draining from dm to improve performance.
> > 
> > Unfortunately, it doesn't work that way.  The current dm
> > implementation depends on block layer holding the queue while a
> > barrier sequence is in progress which the new implementation doesn't
> > do anymore (the whole point of this conversion BTW).
> 
> That may be true for request-based dm (I don't know).
> 
> But bio-based dm doesn't depend on it, I wrote it and I didn't rely on 
> that.

Mikulas,

Current bio-based barrier support also defers IO if a flush is in
progress.  See _dm_request:

	/*
	 * If we're suspended or the thread is processing barriers
	 * we have to queue this io for later.
	 */

Tejun also shared the following:

"bio based implementation also uses dm_wait_for_completion() and
DMF_QUEUE_IO_TO_THREAD to plug all the follow up bio's while flush is in
progress, which sucks for throughput but successfully avoids starvation."

here:
https://www.redhat.com/archives/dm-devel/2010-August/msg00174.html

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags
  2010-08-30  9:58   ` Tejun Heo
  (?)
@ 2010-09-01 15:30   ` Christoph Hellwig
  -1 siblings, 0 replies; 88+ messages in thread
From: Christoph Hellwig @ 2010-09-01 15:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

On Mon, Aug 30, 2010 at 11:58:12AM +0200, Tejun Heo wrote:
> Currently __blk_rq_prep_clone() copies only REQ_WRITE and REQ_DISCARD.
> There's no reason to omit other command flags and REQ_FUA needs to be
> copied to implement FUA support in request-based dm.
> 
> REQ_COMMON_MASK which specifies flags to be copied from bio to request
> already identifies all the command flags.  Define REQ_CLONE_MASK to be
> the same as REQ_COMMON_MASK for clarity and make __blk_rq_prep_clone()
> copy all flags in the mask.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01 15:20                 ` Mike Snitzer
@ 2010-09-01 15:35                   ` Mikulas Patocka
  2010-09-01 17:07                     ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Mikulas Patocka @ 2010-09-01 15:35 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Tejun Heo, dm-devel



On Wed, 1 Sep 2010, Mike Snitzer wrote:

> On Wed, Sep 01 2010 at  8:12am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > On Wed, 1 Sep 2010, Tejun Heo wrote:
> > 
> > > Hello,
> > > 
> > > On 09/01/2010 12:31 PM, Mikulas Patocka wrote:
> > > > My recommended approach to this (on non-request-based dm) is to simply let 
> > > > the current barrier infrastructure be as it is --- you don't need to 
> > > > change it now, you can simply map FUA write to barrier write and FLUSH to 
> > > > zero-data barrier --- and it won't cause any data corruption. It will just 
> > > > force unneeded I/O queue draining.
> > > > 
> > > > Once FLUSH+FUA interface is finalized and committed upstream, we can 
> > > > remove that I/O queue draining from dm to improve performance.
> > > 
> > > Unfortunately, it doesn't work that way.  The current dm
> > > implementation depends on block layer holding the queue while a
> > > barrier sequence is in progress which the new implementation doesn't
> > > do anymore (the whole point of this conversion BTW).
> > 
> > That may be true for request-based dm (I don't know).
> > 
> > But bio-based dm doesn't depend on it, I wrote it and I didn't rely on 
> > that.
> 
> Mikulas,
> 
> Current bio-based barrier support also defers IO if a flush is in
> progress.  See _dm_request:

I know. But it doesn't hurt with flush/fua requests. It just lowers 
performance (it defers i/os when it doesn't have to) but doesn't damage 
data.

So I think that we can let it be this way until flush/fua patch is 
finalized.

Mikulas

> 	/*
> 	 * If we're suspended or the thread is processing barriers
> 	 * we have to queue this io for later.
> 	 */
> 
> Tejun also shared the following:
> 
> "bio based implementation also uses dm_wait_for_completion() and
> DMF_QUEUE_IO_TO_THREAD to plug all the follow up bio's while flush is in
> progress, which sucks for throughput but successfully avoids starvation."
> 
> here:
> https://www.redhat.com/archives/dm-devel/2010-August/msg00174.html
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01 15:35                   ` Mikulas Patocka
@ 2010-09-01 17:07                     ` Mike Snitzer
  2010-09-01 18:59                       ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-01 17:07 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Tejun Heo, dm-devel

On Wed, Sep 01 2010 at 11:35am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 1 Sep 2010, Mike Snitzer wrote:
> 
> > On Wed, Sep 01 2010 at  8:12am -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > > On Wed, 1 Sep 2010, Tejun Heo wrote:
> > > 
> > > > Hello,
> > > > 
> > > > On 09/01/2010 12:31 PM, Mikulas Patocka wrote:
> > > > > My recommended approach to this (on non-request-based dm) is to simply let 
> > > > > the current barrier infrastructure be as it is --- you don't need to 
> > > > > change it now, you can simply map FUA write to barrier write and FLUSH to 
> > > > > zero-data barrier --- and it won't cause any data corruption. It will just 
> > > > > force unneeded I/O queue draining.
> > > > > 
> > > > > Once FLUSH+FUA interface is finalized and committed upstream, we can 
> > > > > remove that I/O queue draining from dm to improve performance.
> > > > 
> > > > Unfortunately, it doesn't work that way.  The current dm
> > > > implementation depends on block layer holding the queue while a
> > > > barrier sequence is in progress which the new implementation doesn't
> > > > do anymore (the whole point of this conversion BTW).
> > > 
> > > That may be true for request-based dm (I don't know).
> > > 
> > > But bio-based dm doesn't depend on it, I wrote it and I didn't rely on 
> > > that.
> > 
> > Mikulas,
> > 
> > Current bio-based barrier support also defers IO if a flush is in
> > progress.  See _dm_request:
> 
> I know. But it doesn't hurt with flush/fua requests. It just lowers 
> performance (it defers i/os when it doesn't have to) but doesn't damage 
> data.
> 
> So I think that we can let it be this way until flush/fua patch is 
> finalized.

Neither Tejun nor I see the point in waiting when we have a window of
time to address the issues now.  We want DM to realize the benefit
associated with the kernel-wide FLUSH+FUA conversion too.

You're not explaining your reluctance to tackle review of this DM
FLUSH+FUA conversion.  Converting DM can (and already has) exposed some
shortcomings in the the kernel-wide FLUSH+FUA conversion.

We can't afford for these kernel-wide changes to go in and have DM left
trying to fix some fundamental kernel issue outside of DM after the
fact.

So why wait?  This debate has distracted you from just reviewing the
code.  The bio-based DM changes are fairly straight-forward.

Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01 17:07                     ` Mike Snitzer
@ 2010-09-01 18:59                       ` Mike Snitzer
  2010-09-02  3:22                         ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-01 18:59 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Tejun Heo, dm-devel

On Wed, Sep 01 2010 at  1:07pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, Sep 01 2010 at 11:35am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > On Wed, 1 Sep 2010, Mike Snitzer wrote:
> > > 
> > > Mikulas,
> > > 
> > > Current bio-based barrier support also defers IO if a flush is in
> > > progress.  See _dm_request:
> > 
> > I know. But it doesn't hurt with flush/fua requests. It just lowers 
> > performance (it defers i/os when it doesn't have to) but doesn't damage 
> > data.
> > 
> > So I think that we can let it be this way until flush/fua patch is 
> > finalized.
> 
> Neither Tejun nor I see the point in waiting when we have a window of
> time to address the issues now.  We want DM to realize the benefit
> associated with the kernel-wide FLUSH+FUA conversion too.

But we can meet in the middle.  I've reordered the DM FLUSH+FUA patches
so that the more intrusive bio-based relaxed ordering patch is at the
very end.

My hope was that the request-based deadlock I'm seeing would disappear
if that relaxed ordering patch wasn't applied.  Unfortunately, I still
see the hang.

Anyway, I've made the patches available here:
http://people.redhat.com/msnitzer/patches/flush-fua/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01 18:59                       ` Mike Snitzer
@ 2010-09-02  3:22                         ` Mike Snitzer
  2010-09-02 10:24                           ` Tejun Heo
  2010-09-09 15:26                           ` [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm] Mike Snitzer
  0 siblings, 2 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-09-02  3:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: dm-devel, Mikulas Patocka, Vivek Goyal

On Wed, Sep 01 2010 at  2:59pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> But we can meet in the middle.  I've reordered the DM FLUSH+FUA patches
> so that the more intrusive bio-based relaxed ordering patch is at the
> very end.
> 
> My hope was that the request-based deadlock I'm seeing would disappear
> if that relaxed ordering patch wasn't applied.  Unfortunately, I still
> see the hang.

Turns out I can reproduce the hang on a stock 2.6.36-rc3 (without _any_
FLUSH+FUA patches)!

I'll try to pin-point the root cause but I think my test is somehow
exposing a bug in my virt setup.

So this hang is definitely starting to look like a red herring.

Tejun,

This news should clear the way for you to re-post your patches.  I think
it would be best if you reordered the DM patches like I did here in this
series: http://people.redhat.com/msnitzer/patches/flush-fua/series

In particular, the dm-relax-ordering-of-bio-based-flush-implementation
patch should go at the end.  I think it makes for a more logical
evolution of the DM code.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-02  3:22                         ` Mike Snitzer
@ 2010-09-02 10:24                           ` Tejun Heo
  2010-09-02 15:11                             ` Mike Snitzer
  2010-09-09 15:26                           ` [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm] Mike Snitzer
  1 sibling, 1 reply; 88+ messages in thread
From: Tejun Heo @ 2010-09-02 10:24 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Mikulas Patocka, Vivek Goyal

Hello,

On 09/02/2010 05:22 AM, Mike Snitzer wrote:
>> But we can meet in the middle.  I've reordered the DM FLUSH+FUA patches
>> so that the more intrusive bio-based relaxed ordering patch is at the
>> very end.
>>
>> My hope was that the request-based deadlock I'm seeing would disappear
>> if that relaxed ordering patch wasn't applied.  Unfortunately, I still
>> see the hang.

I don't think it would make any difference.  AFAICS, the patch doesn't
touch anything requested based dm uses.

> Turns out I can reproduce the hang on a stock 2.6.36-rc3 (without _any_
> FLUSH+FUA patches)!

Hmmm... that's interesting.

> I'll try to pin-point the root cause but I think my test is somehow
> exposing a bug in my virt setup.
> 
> So this hang is definitely starting to look like a red herring.
> 
> Tejun,
> 
> This news should clear the way for you to re-post your patches.  I think
> it would be best if you reordered the DM patches like I did here in this
> series: http://people.redhat.com/msnitzer/patches/flush-fua/series
> 
> In particular, the dm-relax-ordering-of-bio-based-flush-implementation
> patch should go at the end.  I think it makes for a more logical
> evolution of the DM code.

Sure, I'll.  I still think having the queue kicking mechanism is a
good idea tho.  I'll integrate that into series, reorder and repost
it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01  7:15         ` Kiyoshi Ueda
  2010-09-01 12:25           ` Mike Snitzer
@ 2010-09-02 13:22           ` Tejun Heo
  2010-09-02 13:32             ` Tejun Heo
  2010-09-03  5:46             ` Kiyoshi Ueda
  2010-09-02 17:43           ` [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original Tejun Heo
  2 siblings, 2 replies; 88+ messages in thread
From: Tejun Heo @ 2010-09-02 13:22 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hello,

On 09/01/2010 09:15 AM, Kiyoshi Ueda wrote:
> I don't see any obvious problem on this patch.
> However, I hit a NULL pointer dereference below when I use a mpath
> device with barrier option of ext3.  I'm investigating the cause now.
> (Also I'm not sure the cause of the hang which Mike is hitting yet.)
> 
> I tried on the commit 28dd53b26d362c16234249bad61db8cbd9222d0b of
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua.
> 
>     # mke2fs -j /dev/mapper/mpatha
>     # mount -o barrier=1 /dev/mapper/mpatha /mnt/0
>     # dd if=/dev/zero of=/mnt/0/a bs=512 count=1
>     # sync

Hmm... I'm trying to reproduce this problem but hasn't been successful
yet.

> BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
> IP: [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]

Can you please ask gdb which source line this is?

> Also, I have one comment below on this patch.
> 
>> @@ -2619,9 +2458,8 @@ int dm_suspend(struct mapped_device *md,
>>  	up_write(&md->io_lock);
>>
>>  	/*
>> -	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
>> -	 * can be kicked until md->queue is stopped.  So stop md->queue before
>> -	 * flushing md->wq.
>> +	 * Stop md->queue before flushing md->wq in case request-based
>> +	 * dm defers requests to md->wq from md->queue.
>>  	 */
>>  	if (dm_request_based(md))
>>  		stop_queue(md->queue);
> 
> Request-based dm doesn't use md->wq now, so you can just remove
> the comment above.

I sure can remove it but md->wq already has most stuff necessary to
process deferred requests and when someone starts using it, having the
comment there about the rather delicate ordering would definitely be
helpful, so I suggest keeping the comment.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-02 13:22           ` Tejun Heo
@ 2010-09-02 13:32             ` Tejun Heo
  2010-09-03  5:46             ` Kiyoshi Ueda
  1 sibling, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-09-02 13:32 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

On 09/02/2010 03:22 PM, Tejun Heo wrote:
> Hello,
> 
> On 09/01/2010 09:15 AM, Kiyoshi Ueda wrote:
>> I don't see any obvious problem on this patch.
>> However, I hit a NULL pointer dereference below when I use a mpath
>> device with barrier option of ext3.  I'm investigating the cause now.
>> (Also I'm not sure the cause of the hang which Mike is hitting yet.)
>>
>> I tried on the commit 28dd53b26d362c16234249bad61db8cbd9222d0b of
>> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua.
>>
>>     # mke2fs -j /dev/mapper/mpatha
>>     # mount -o barrier=1 /dev/mapper/mpatha /mnt/0
>>     # dd if=/dev/zero of=/mnt/0/a bs=512 count=1
>>     # sync
> 
> Hmm... I'm trying to reproduce this problem but hasn't been successful
> yet.
> 
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
>> IP: [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]
> 
> Can you please ask gdb which source line this is?

Ooh, never mind.  Reproduced it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-02 10:24                           ` Tejun Heo
@ 2010-09-02 15:11                             ` Mike Snitzer
  0 siblings, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-09-02 15:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: hch, dm-devel, Mikulas Patocka, Vivek Goyal

On Thu, Sep 02 2010 at  6:24am -0400,
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 09/02/2010 05:22 AM, Mike Snitzer wrote:
> >> But we can meet in the middle.  I've reordered the DM FLUSH+FUA patches
> >> so that the more intrusive bio-based relaxed ordering patch is at the
> >> very end.
> >>
> >> My hope was that the request-based deadlock I'm seeing would disappear
> >> if that relaxed ordering patch wasn't applied.  Unfortunately, I still
> >> see the hang.
> 
> I don't think it would make any difference.  AFAICS, the patch doesn't
> touch anything requested based dm uses.

Right, I was stacking bio-based on request-based so it initially seemed
like they were related (based on traces I had seen).

> > Turns out I can reproduce the hang on a stock 2.6.36-rc3 (without _any_
> > FLUSH+FUA patches)!
> 
> Hmmm... that's interesting.

Definitely, and I just tested a recent RHEL6 kernel.. it works perfectly
fine.

So now I'll be doing a git bisect to try to pinpoint where upstream went
wrong.

> > I'll try to pin-point the root cause but I think my test is somehow
> > exposing a bug in my virt setup.
> > 
> > So this hang is definitely starting to look like a red herring.
> > 
> > Tejun,
> > 
> > This news should clear the way for you to re-post your patches.  I think
> > it would be best if you reordered the DM patches like I did here in this
> > series: http://people.redhat.com/msnitzer/patches/flush-fua/series
> > 
> > In particular, the dm-relax-ordering-of-bio-based-flush-implementation
> > patch should go at the end.  I think it makes for a more logical
> > evolution of the DM code.
> 
> Sure, I'll.  I still think having the queue kicking mechanism is a
> good idea tho.  I'll integrate that into series, reorder and repost
> it.

Sounds good.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-01  7:15         ` Kiyoshi Ueda
  2010-09-01 12:25           ` Mike Snitzer
  2010-09-02 13:22           ` Tejun Heo
@ 2010-09-02 17:43           ` Tejun Heo
  2010-09-03  5:47             ` Kiyoshi Ueda
  2 siblings, 1 reply; 88+ messages in thread
From: Tejun Heo @ 2010-09-02 17:43 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

rq->rq_disk and bio->bi_bdev->bd_disk may differ if a request has
passed through remapping drivers.  FSEQ_DATA request incorrectly
followed bio->bi_bdev->bd_disk ending up being issued w/ mismatching
rq_disk.  Make it follow orig_rq->rq_disk.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
---
Kiyoshi, can you please apply this patch on top and verify that the
problem goes away?  The reason I and Mike didn't see the problem was
because we were using REQ_FUA capable underlying device, which makes
block layer bypass sequencing for FSEQ_DATA request.

Thanks.

 block/blk-flush.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -111,6 +111,13 @@ static struct request *queue_next_fseq(s
 		break;
 	case QUEUE_FSEQ_DATA:
 		init_request_from_bio(rq, orig_rq->bio);
+		/*
+		 * orig_rq->rq_disk may be different from
+		 * bio->bi_bdev->bd_disk if orig_rq got here through
+		 * remapping drivers.  Make sure rq->rq_disk points
+		 * to the same one as orig_rq.
+		 */
+		rq->rq_disk = orig_rq->rq_disk;
 		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
 		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
 		rq->end_io = flush_data_end_io;

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-02 13:22           ` Tejun Heo
  2010-09-02 13:32             ` Tejun Heo
@ 2010-09-03  5:46             ` Kiyoshi Ueda
  1 sibling, 0 replies; 88+ messages in thread
From: Kiyoshi Ueda @ 2010-09-03  5:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hi Tejun,

On 09/02/2010 10:22 PM +0900, Tejun Heo wrote:
> On 09/01/2010 09:15 AM, Kiyoshi Ueda wrote:
>>> @@ -2619,9 +2458,8 @@ int dm_suspend(struct mapped_device *md,
>>>  	up_write(&md->io_lock);
>>>
>>>  	/*
>>> -	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
>>> -	 * can be kicked until md->queue is stopped.  So stop md->queue before
>>> -	 * flushing md->wq.
>>> +	 * Stop md->queue before flushing md->wq in case request-based
>>> +	 * dm defers requests to md->wq from md->queue.
>>>  	 */
>>>  	if (dm_request_based(md))
>>>  		stop_queue(md->queue);
>> 
>> Request-based dm doesn't use md->wq now, so you can just remove
>> the comment above.
> 
> I sure can remove it but md->wq already has most stuff necessary to
> process deferred requests and when someone starts using it, having the
> comment there about the rather delicate ordering would definitely be
> helpful, so I suggest keeping the comment.

OK, makes sense.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-02 17:43           ` [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original Tejun Heo
@ 2010-09-03  5:47             ` Kiyoshi Ueda
  2010-09-03  9:33               ` Tejun Heo
  0 siblings, 1 reply; 88+ messages in thread
From: Kiyoshi Ueda @ 2010-09-03  5:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hi Tejun,

On 09/03/2010 02:43 AM +0900, Tejun Heo wrote:
> rq->rq_disk and bio->bi_bdev->bd_disk may differ if a request has
> passed through remapping drivers.  FSEQ_DATA request incorrectly
> followed bio->bi_bdev->bd_disk ending up being issued w/ mismatching
> rq_disk.  Make it follow orig_rq->rq_disk.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
> ---
> Kiyoshi, can you please apply this patch on top and verify that the
> problem goes away?  The reason I and Mike didn't see the problem was
> because we were using REQ_FUA capable underlying device, which makes
> block layer bypass sequencing for FSEQ_DATA request.
> 
> Thanks.
> 
>  block/blk-flush.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> Index: block/block/blk-flush.c
> ===================================================================
> --- block.orig/block/blk-flush.c
> +++ block/block/blk-flush.c
> @@ -111,6 +111,13 @@ static struct request *queue_next_fseq(s
>  		break;
>  	case QUEUE_FSEQ_DATA:
>  		init_request_from_bio(rq, orig_rq->bio);
> +		/*
> +		 * orig_rq->rq_disk may be different from
> +		 * bio->bi_bdev->bd_disk if orig_rq got here through
> +		 * remapping drivers.  Make sure rq->rq_disk points
> +		 * to the same one as orig_rq.
> +		 */
> +		rq->rq_disk = orig_rq->rq_disk;
>  		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
>  		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
>  		rq->end_io = flush_data_end_io;

Ah, I see, thank you for the quick fix!
I confirmed no panic occurs with this patch.

Tested-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>


By the way, I had been considering a block-layer interface which remaps
struct request and its bios to a block device such as:
void blk_remap_request(struct request *rq, struct block_device *bdev)
{
	rq->rq_disk = bdev->bd_disk;

	__rq_for_each_bio(bio, rq) {
		bio->bi_bdev = bdev->bd_disk;
	}
}

If there is such an interface and remapping drivers use it, then these
kind of issues may be avoided in the future.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-08-30  9:58   ` Tejun Heo
  (?)
  (?)
@ 2010-09-03  6:04   ` Kiyoshi Ueda
  2010-09-03  9:42     ` Tejun Heo
  -1 siblings, 1 reply; 88+ messages in thread
From: Kiyoshi Ueda @ 2010-09-03  6:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, snitzer, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

Hi Tejun,

On 08/30/2010 06:58 PM +0900, Tejun Heo wrote:
> Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
> against other bio's.  This patch relaxes ordering around flushes.
...
> * When dec_pending() detects that a flush has completed, it checks
>   whether the original bio has data.  If so, the bio is queued to the
>   deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.
...
> @@ -529,16 +523,10 @@ static void end_io_acct(struct dm_io *io)
>   */
>  static void queue_io(struct mapped_device *md, struct bio *bio)
>  {
> -	down_write(&md->io_lock);
> -
>  	spin_lock_irq(&md->deferred_lock);
>  	bio_list_add(&md->deferred, bio);
>  	spin_unlock_irq(&md->deferred_lock);
> -
> -	if (!test_and_set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))
> -		queue_work(md->wq, &md->work);
> -
> -	up_write(&md->io_lock);
> +	queue_work(md->wq, &md->work);
...
> @@ -638,26 +624,22 @@ static void dec_pending(struct dm_io *io, int error)
...
> -		} else {
> -			end_io_acct(io);
> -			free_io(md, io);
> -
> -			if (io_error != DM_ENDIO_REQUEUE) {
> -				trace_block_bio_complete(md->queue, bio);
> -
> -				bio_endio(bio, io_error);
> -			}
> +			bio->bi_rw &= ~REQ_FLUSH;
> +			queue_io(md, bio);

dec_pending() is a function which is called during I/O completion
where the caller may be disabling interrupts.
So if you use queue_io() inside dec_pending(), the spin_lock must be
taken/released with irqsave/irqrestore like the patch below.

BTW, lockdep detects the issue and a warning like below is displayed.
It may break the underlying drivers.

=================================
[ INFO: inconsistent lock state ]
2.6.36-rc2+ #2
---------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
kworker/0:1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
 (&(&q->__queue_lock)->rlock){?.-...}, at: [<ffffffff811be844>] blk_end_bidi_request+0x44/0x80
{IN-HARDIRQ-W} state was registered at:
  [<ffffffff81080266>] __lock_acquire+0x8c6/0xb30
  [<ffffffff81080570>] lock_acquire+0xa0/0x120
  [<ffffffff8138953e>] _raw_spin_lock_irqsave+0x4e/0x70
  [<ffffffffa00e095a>] ata_qc_schedule_eh+0x5a/0xa0 [libata]
  [<ffffffffa00d37e7>] ata_qc_complete+0x147/0x1f0 [libata]
  [<ffffffffa00e3af2>] ata_hsm_qc_complete+0xc2/0x140 [libata]
  [<ffffffffa00e3d45>] ata_sff_hsm_move+0x1d5/0x700 [libata]
  [<ffffffffa00e4323>] __ata_sff_port_intr+0xb3/0x100 [libata]
  [<ffffffffa00e4bff>] ata_bmdma_port_intr+0x3f/0x120 [libata]
  [<ffffffffa00e2735>] ata_bmdma_interrupt+0x195/0x1e0 [libata]
  [<ffffffff810a6b14>] handle_IRQ_event+0x54/0x170
  [<ffffffff810a8fb8>] handle_edge_irq+0xc8/0x170
  [<ffffffff8100561b>] handle_irq+0x4b/0xa0
  [<ffffffff8139169f>] do_IRQ+0x6f/0xf0
  [<ffffffff8138a093>] ret_from_intr+0x0/0x16
  [<ffffffff81389ca3>] _raw_spin_unlock+0x23/0x40
  [<ffffffff81133ea2>] sys_dup3+0x122/0x1a0
  [<ffffffff81133f43>] sys_dup2+0x23/0xb0
  [<ffffffff81002eb2>] system_call_fastpath+0x16/0x1b
irq event stamp: 14660913
hardirqs last  enabled at (14660912): [<ffffffff81389c65>] _raw_spin_unlock_irqrestore+0x65/0x80
hardirqs last disabled at (14660913): [<ffffffff8138951e>] _raw_spin_lock_irqsave+0x2e/0x70
softirqs last  enabled at (14660874): [<ffffffff810530ae>] __do_softirq+0x14e/0x210
softirqs last disabled at (14660879): [<ffffffff81003d9c>] call_softirq+0x1c/0x50

other info that might help us debug this:
1 lock held by kworker/0:1/0:
 #0:  (&(&q->__queue_lock)->rlock){?.-...}, at: [<ffffffff811be844>] blk_end_bidi_request+0x44/0x80

stack backtrace:
Pid: 0, comm: kworker/0:1 Not tainted 2.6.36-rc2+ #2
Call Trace:
 <IRQ>  [<ffffffff8107c386>] print_usage_bug+0x1a6/0x1f0
 [<ffffffff8107ca31>] mark_lock+0x661/0x690
 [<ffffffff8107de90>] ? check_usage_backwards+0x0/0xf0
 [<ffffffff8107cac0>] mark_held_locks+0x60/0x80
 [<ffffffff81389bf0>] ? _raw_spin_unlock_irq+0x30/0x40
 [<ffffffff8107cb63>] trace_hardirqs_on_caller+0x83/0x1a0
 [<ffffffff8107cc8d>] trace_hardirqs_on+0xd/0x10
 [<ffffffff81389bf0>] _raw_spin_unlock_irq+0x30/0x40
 [<ffffffffa0292e0e>] ? queue_io+0x2e/0x90 [dm_mod]
 [<ffffffffa0292e37>] queue_io+0x57/0x90 [dm_mod]
 [<ffffffffa02932fa>] dec_pending+0x22a/0x320 [dm_mod]
 [<ffffffffa0293125>] ? dec_pending+0x55/0x320 [dm_mod]
 [<ffffffffa029366d>] clone_endio+0xad/0xc0 [dm_mod]
 [<ffffffff81150d1d>] bio_endio+0x1d/0x40
 [<ffffffff811bd181>] req_bio_endio+0x81/0xf0
 [<ffffffff811bd42d>] blk_update_request+0x23d/0x460
 [<ffffffff811bd306>] ? blk_update_request+0x116/0x460
 [<ffffffff811bd677>] blk_update_bidi_request+0x27/0x80
 [<ffffffff811be490>] __blk_end_bidi_request+0x20/0x50
 [<ffffffff811be4df>] __blk_end_request_all+0x1f/0x40
 [<ffffffff811c3b40>] blk_flush_complete_seq+0x140/0x1a0
 [<ffffffff811c3c79>] pre_flush_end_io+0x39/0x50
 [<ffffffff811be265>] blk_finish_request+0x85/0x290
 [<ffffffff811be852>] blk_end_bidi_request+0x52/0x80
 [<ffffffff811bfa3f>] blk_end_request_all+0x1f/0x40
 [<ffffffffa02941bd>] dm_softirq_done+0xad/0x120 [dm_mod]
 [<ffffffff811c6646>] blk_done_softirq+0x86/0xa0
 [<ffffffff81053036>] __do_softirq+0xd6/0x210
 [<ffffffff81003d9c>] call_softirq+0x1c/0x50
 [<ffffffff81005705>] do_softirq+0x95/0xd0
 [<ffffffff81052f4d>] irq_exit+0x4d/0x60
 [<ffffffff813916a8>] do_IRQ+0x78/0xf0
 [<ffffffff8138a093>] ret_from_intr+0x0/0x16
 <EOI>  [<ffffffff8100b639>] ? mwait_idle+0x79/0xe0
 [<ffffffff8100b630>] ? mwait_idle+0x70/0xe0
 [<ffffffff81001c36>] cpu_idle+0x66/0xe0
 [<ffffffff81380e91>] ? start_secondary+0x181/0x1f0
 [<ffffffff81380e9f>] start_secondary+0x18f/0x1f0

Thanks,
Kiyoshi Ueda


Now queue_io() is called from dec_pending(), which may be called with
interrupts disabled.
So queue_io() must not enable interrupts unconditionally and must
save/restore the current interrupts status.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 drivers/md/dm.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: misc/drivers/md/dm.c
===================================================================
--- misc.orig/drivers/md/dm.c
+++ misc/drivers/md/dm.c
@@ -512,9 +512,11 @@ static void end_io_acct(struct dm_io *io
  */
 static void queue_io(struct mapped_device *md, struct bio *bio)
 {
-	spin_lock_irq(&md->deferred_lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&md->deferred_lock, flags);
 	bio_list_add(&md->deferred, bio);
-	spin_unlock_irq(&md->deferred_lock);
+	spin_unlock_irqrestore(&md->deferred_lock, flags);
 	queue_work(md->wq, &md->work);
 }
 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-03  5:47             ` Kiyoshi Ueda
@ 2010-09-03  9:33               ` Tejun Heo
  2010-09-03 10:28                 ` Kiyoshi Ueda
  0 siblings, 1 reply; 88+ messages in thread
From: Tejun Heo @ 2010-09-03  9:33 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hello,

On 09/03/2010 07:47 AM, Kiyoshi Ueda wrote:
> Ah, I see, thank you for the quick fix!
> I confirmed no panic occurs with this patch.
> 
> Tested-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>

Great, thanks for testing.

> By the way, I had been considering a block-layer interface which remaps
> struct request and its bios to a block device such as:
> void blk_remap_request(struct request *rq, struct block_device *bdev)
> {
> 	rq->rq_disk = bdev->bd_disk;
> 
> 	__rq_for_each_bio(bio, rq) {
> 		bio->bi_bdev = bdev->bd_disk;
> 	}
> }
> 
> If there is such an interface and remapping drivers use it, then these
> kind of issues may be avoided in the future.

I think the problem is more with request initialization.  After all,
once bios are packed into a request, they are (or at least should be)
just data containers.  We now have multiple request init paths in
block layer and different ones initialize different subsets and it's
not very clear which fields are supposed to be initialized to what by
whom.

But yeah I agree removing discrepancy between request and bio would be
nice to have too.  It's not really remapping tho.  Maybe just
blk_set_rq_q() or something like that (it should also set rq->q)?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-09-03  6:04   ` Kiyoshi Ueda
@ 2010-09-03  9:42     ` Tejun Heo
  0 siblings, 0 replies; 88+ messages in thread
From: Tejun Heo @ 2010-09-03  9:42 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: jaxboe, snitzer, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

Hello,

On 09/03/2010 08:04 AM, Kiyoshi Ueda wrote:
> Now queue_io() is called from dec_pending(), which may be called with
> interrupts disabled.
> So queue_io() must not enable interrupts unconditionally and must
> save/restore the current interrupts status.
> 
> Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>

Patch included into the series.  Thanks a lot!

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-03  9:33               ` Tejun Heo
@ 2010-09-03 10:28                 ` Kiyoshi Ueda
  2010-09-03 11:42                   ` Tejun Heo
  0 siblings, 1 reply; 88+ messages in thread
From: Kiyoshi Ueda @ 2010-09-03 10:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hi Tejun,

On 09/03/2010 06:33 PM +0900, Tejun Heo wrote:
> On 09/03/2010 07:47 AM, Kiyoshi Ueda wrote:
>> By the way, I had been considering a block-layer interface which remaps
>> struct request and its bios to a block device such as:
>> void blk_remap_request(struct request *rq, struct block_device *bdev)
>> {
>> 	rq->rq_disk = bdev->bd_disk;
>>
>> 	__rq_for_each_bio(bio, rq) {
>> 		bio->bi_bdev = bdev->bd_disk;
>> 	}
>> }
>>
>> If there is such an interface and remapping drivers use it, then these
>> kind of issues may be avoided in the future.
> 
> I think the problem is more with request initialization.  After all,
> once bios are packed into a request, they are (or at least should be)
> just data containers.  We now have multiple request init paths in
> block layer and different ones initialize different subsets and it's
> not very clear which fields are supposed to be initialized to what by
> whom.
> 
> But yeah I agree removing discrepancy between request and bio would be
> nice to have too.  It's not really remapping tho.  Maybe just
> blk_set_rq_q() or something like that (it should also set rq->q)?

Thank you for pointing it.
Yes, the interface should also set rq->q.

About the naming of the interface, blk_set_<something> sounds
reasonable to me.
But does blk_set_rq_q() take request and queue as arguments?
Then, I'm afraid we can't find bdev for bio from given queue.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-03 10:28                 ` Kiyoshi Ueda
@ 2010-09-03 11:42                   ` Tejun Heo
  2010-09-03 11:51                     ` Kiyoshi Ueda
  0 siblings, 1 reply; 88+ messages in thread
From: Tejun Heo @ 2010-09-03 11:42 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hello,

On 09/03/2010 12:28 PM, Kiyoshi Ueda wrote:
> Thank you for pointing it.
> Yes, the interface should also set rq->q.
> 
> About the naming of the interface, blk_set_<something> sounds
> reasonable to me.
> But does blk_set_rq_q() take request and queue as arguments?
> Then, I'm afraid we can't find bdev for bio from given queue.

Oh, right.  Maybe blk_set_rq_bdev() then?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-03 11:42                   ` Tejun Heo
@ 2010-09-03 11:51                     ` Kiyoshi Ueda
  0 siblings, 0 replies; 88+ messages in thread
From: Kiyoshi Ueda @ 2010-09-03 11:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hi Tejun,

On 09/03/2010 08:42 PM +0900, Tejun Heo wrote:
> Hello,
> 
> On 09/03/2010 12:28 PM, Kiyoshi Ueda wrote:
>> Thank you for pointing it.
>> Yes, the interface should also set rq->q.
>>
>> About the naming of the interface, blk_set_<something> sounds
>> reasonable to me.
>> But does blk_set_rq_q() take request and queue as arguments?
>> Then, I'm afraid we can't find bdev for bio from given queue.
> 
> Oh, right.  Maybe blk_set_rq_bdev() then?

Yeah, although struct request doesn't have the member 'bdev',
blk_set_rq_bdev() may be better from the view of the arguments.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm]
  2010-09-02  3:22                         ` Mike Snitzer
  2010-09-02 10:24                           ` Tejun Heo
@ 2010-09-09 15:26                           ` Mike Snitzer
  2010-09-09 15:44                             ` Ryan Harper
  1 sibling, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-09 15:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mikulas Patocka, dm-devel, Vivek Goyal, ryanh, john.cooper,
	rusty, hch, kvm

[-- Attachment #1: Type: text/plain, Size: 4471 bytes --]

On Wed, Sep 01 2010 at 11:22pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, Sep 01 2010 at  2:59pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > My hope was that the request-based deadlock I'm seeing would disappear
> > if that relaxed ordering patch wasn't applied.  Unfortunately, I still
> > see the hang.
> 
> Turns out I can reproduce the hang on a stock 2.6.36-rc3 (without _any_
> FLUSH+FUA patches)!
> 
> I'll try to pin-point the root cause but I think my test is somehow
> exposing a bug in my virt setup.

[my virt setup == single kvm guest (RHEL6) with F13 host]

My gut turned out to be correct.  I finally tracked down the regression
point to the following commit (cc'ing appropriate people):

commit a5eb9e4ff18a33e43557d44b205f953b0c1efade
Author: Ryan Harper <ryanh@us.ibm.com>
Date:   Wed Jun 23 22:19:57 2010 -0500

    virtio_blk: Add 'serial' attribute to virtio-blk devices (v2)
    
    Create a new attribute for virtio-blk devices that will fetch the serial number
    of the block device.  This attribute can be used by udev to create disk/by-id
    symlinks for devices that don't have a UUID (filesystem) associated with them.
    
    ATA_IDENTIFY strings are special in that they can be up to 20 chars long
    and aren't required to be nul-terminated.  The buffer is also zero-padded
    meaning that if the serial is 19 chars or less that we get a nul-terminated
    string.  When copying this value into a string buffer, we must be careful to
    copy up to the nul (if it present) and only 20 if it is longer and not to
    attempt to nul terminate; this isn't needed.
    
    Changes since v1:
    - Added BUILD_BUG_ON() for PAGE_SIZE check
    - Removed min() since BUILD_BUG_ON() handles the check
    - Replaced serial_sysfs() by copying id directly to buffer
    
    Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
    Signed-off-by: john cooper <john.cooper@redhat.com>
    Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

So the first released kernel to have this regression is 2.6.36-rc1.

Some background:
I have been working with Tejun to test the barrier to FLUSH+FUA
conversion patchset.  I crafted the attached script to test the DM
changes that are part of the FLUSH+FUA patchset.

Using this script with:
while true ; do ./test_dm_discard_mpath_scsi_debug.sh ; done

I can reliably trigger the following hang, always on the 5th iteration
in my testing, IFF commit a5eb9e4ff18a33e43557d44b205f953b0c1efade is
applied:

INFO: task lvcreate:2484 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
lvcreate      D 0000000100064871  4960  2484   2350 0x00000080
 ffff88007b87b978 0000000000000046 ffff88007b87b8e8 ffff880000000000
 ffff88007b87bfd8 ffff8800724fa400 00000000001d4040 ffff88007b87bfd8
 00000000001d4040 00000000001d4040 00000000001d4040 00000000001d4040
Call Trace:
 [<ffffffff8136de23>] io_schedule+0x73/0xb5
 [<ffffffff811b6882>] get_request_wait+0xf2/0x180
 [<ffffffff8105d8da>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811b6deb>] __make_request+0x310/0x434
 [<ffffffff811b5442>] generic_make_request+0x2f1/0x36e
 [<ffffffff81062f78>] ? cpu_clock+0x43/0x5e
 [<ffffffff811b559d>] submit_bio+0xde/0xfb
 [<ffffffff8106e459>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81129332>] dio_bio_submit+0x7b/0x9c
 [<ffffffff8112939d>] dio_send_cur_page+0x4a/0xb0
 [<ffffffff81129f1c>] __blockdev_direct_IO_newtrunc+0x7c5/0x97d
 [<ffffffff81127f4f>] blkdev_direct_IO+0x57/0x59
 [<ffffffff81127080>] ? blkdev_get_blocks+0x0/0x90
 [<ffffffff810c2eee>] generic_file_aio_read+0xed/0x5b4
 [<ffffffff810d70d4>] ? might_fault+0x5c/0xac
 [<ffffffff810242bd>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff81100813>] do_sync_read+0xcb/0x108
 [<ffffffff8136e5ad>] ? __mutex_unlock_slowpath+0x119/0x12b
 [<ffffffff8106e428>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff8106e459>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff8118cae7>] ? security_file_permission+0x16/0x18
 [<ffffffff81100e7a>] vfs_read+0xab/0x108
 [<ffffffff8106e428>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff81100f97>] sys_read+0x4a/0x6e
 [<ffffffff81002bf2>] system_call_fastpath+0x16/0x1b
no locks held by lvcreate/2484.


lvcreate is just the first victim (sometimes it is the vgcreate).  But
if the guest is left running other new processes get hung with
comparable traces (w/ get_request_wait).  Until eventually the guest is
completely unresponsive.

Mike

[-- Attachment #2: test_dm_discard_mpath_scsi_debug.sh --]
[-- Type: application/x-sh, Size: 907 bytes --]

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm]
  2010-09-09 15:26                           ` [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm] Mike Snitzer
@ 2010-09-09 15:44                             ` Ryan Harper
  2010-09-09 15:57                               ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Ryan Harper @ 2010-09-09 15:44 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal, ryanh,
	john.cooper, rusty, hch, kvm

* Mike Snitzer <snitzer@redhat.com> [2010-09-09 10:29]:
> On Wed, Sep 01 2010 at 11:22pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > On Wed, Sep 01 2010 at  2:59pm -0400,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> > > My hope was that the request-based deadlock I'm seeing would disappear
> > > if that relaxed ordering patch wasn't applied.  Unfortunately, I still
> > > see the hang.
> > 
> > Turns out I can reproduce the hang on a stock 2.6.36-rc3 (without _any_
> > FLUSH+FUA patches)!
> > 
> > I'll try to pin-point the root cause but I think my test is somehow
> > exposing a bug in my virt setup.
> 
> [my virt setup == single kvm guest (RHEL6) with F13 host]

What's your kvm guest command line?  And the guest is using stock RHEL6
kernel?  What KVM userspace are you using on the host?  What comes with
F13 or some updated version?


> 
> My gut turned out to be correct.  I finally tracked down the regression
> point to the following commit (cc'ing appropriate people):
> 
> commit a5eb9e4ff18a33e43557d44b205f953b0c1efade
> Author: Ryan Harper <ryanh@us.ibm.com>
> Date:   Wed Jun 23 22:19:57 2010 -0500
> 
>     virtio_blk: Add 'serial' attribute to virtio-blk devices (v2)
>     
>     Create a new attribute for virtio-blk devices that will fetch the serial number
>     of the block device.  This attribute can be used by udev to create disk/by-id
>     symlinks for devices that don't have a UUID (filesystem) associated with them.
>     
>     ATA_IDENTIFY strings are special in that they can be up to 20 chars long
>     and aren't required to be nul-terminated.  The buffer is also zero-padded
>     meaning that if the serial is 19 chars or less that we get a nul-terminated
>     string.  When copying this value into a string buffer, we must be careful to
>     copy up to the nul (if it present) and only 20 if it is longer and not to
>     attempt to nul terminate; this isn't needed.
>     
>     Changes since v1:
>     - Added BUILD_BUG_ON() for PAGE_SIZE check
>     - Removed min() since BUILD_BUG_ON() handles the check
>     - Replaced serial_sysfs() by copying id directly to buffer
>     
>     Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
>     Signed-off-by: john cooper <john.cooper@redhat.com>
>     Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> 
> So the first released kernel to have this regression is 2.6.36-rc1.
> 
> Some background:
> I have been working with Tejun to test the barrier to FLUSH+FUA
> conversion patchset.  I crafted the attached script to test the DM
> changes that are part of the FLUSH+FUA patchset.
> 
> Using this script with:
> while true ; do ./test_dm_discard_mpath_scsi_debug.sh ; done
> 
> I can reliably trigger the following hang, always on the 5th iteration
> in my testing, IFF commit a5eb9e4ff18a33e43557d44b205f953b0c1efade is
> applied:
> 
> INFO: task lvcreate:2484 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> lvcreate      D 0000000100064871  4960  2484   2350 0x00000080
>  ffff88007b87b978 0000000000000046 ffff88007b87b8e8 ffff880000000000
>  ffff88007b87bfd8 ffff8800724fa400 00000000001d4040 ffff88007b87bfd8
>  00000000001d4040 00000000001d4040 00000000001d4040 00000000001d4040
> Call Trace:
>  [<ffffffff8136de23>] io_schedule+0x73/0xb5
>  [<ffffffff811b6882>] get_request_wait+0xf2/0x180
>  [<ffffffff8105d8da>] ? autoremove_wake_function+0x0/0x39
>  [<ffffffff811b6deb>] __make_request+0x310/0x434
>  [<ffffffff811b5442>] generic_make_request+0x2f1/0x36e
>  [<ffffffff81062f78>] ? cpu_clock+0x43/0x5e
>  [<ffffffff811b559d>] submit_bio+0xde/0xfb
>  [<ffffffff8106e459>] ? trace_hardirqs_on+0xd/0xf
>  [<ffffffff81129332>] dio_bio_submit+0x7b/0x9c
>  [<ffffffff8112939d>] dio_send_cur_page+0x4a/0xb0
>  [<ffffffff81129f1c>] __blockdev_direct_IO_newtrunc+0x7c5/0x97d
>  [<ffffffff81127f4f>] blkdev_direct_IO+0x57/0x59
>  [<ffffffff81127080>] ? blkdev_get_blocks+0x0/0x90
>  [<ffffffff810c2eee>] generic_file_aio_read+0xed/0x5b4
>  [<ffffffff810d70d4>] ? might_fault+0x5c/0xac
>  [<ffffffff810242bd>] ? pvclock_clocksource_read+0x50/0xb9
>  [<ffffffff81100813>] do_sync_read+0xcb/0x108
>  [<ffffffff8136e5ad>] ? __mutex_unlock_slowpath+0x119/0x12b
>  [<ffffffff8106e428>] ? trace_hardirqs_on_caller+0x11d/0x141
>  [<ffffffff8106e459>] ? trace_hardirqs_on+0xd/0xf
>  [<ffffffff8118cae7>] ? security_file_permission+0x16/0x18
>  [<ffffffff81100e7a>] vfs_read+0xab/0x108
>  [<ffffffff8106e428>] ? trace_hardirqs_on_caller+0x11d/0x141
>  [<ffffffff81100f97>] sys_read+0x4a/0x6e
>  [<ffffffff81002bf2>] system_call_fastpath+0x16/0x1b
> no locks held by lvcreate/2484.
> 
> 
> lvcreate is just the first victim (sometimes it is the vgcreate).  But
> if the guest is left running other new processes get hung with
> comparable traces (w/ get_request_wait).  Until eventually the guest is
> completely unresponsive.
> 
> Mike



-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm]
  2010-09-09 15:44                             ` Ryan Harper
@ 2010-09-09 15:57                               ` Mike Snitzer
  2010-09-09 16:03                                 ` Ryan Harper
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-09 15:57 UTC (permalink / raw)
  To: Ryan Harper
  Cc: Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal, john.cooper,
	rusty, hch, kvm

On Thu, Sep 09 2010 at 11:44am -0400,
Ryan Harper <ryanh@us.ibm.com> wrote:

> * Mike Snitzer <snitzer@redhat.com> [2010-09-09 10:29]:
> > On Wed, Sep 01 2010 at 11:22pm -0400,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> > > On Wed, Sep 01 2010 at  2:59pm -0400,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
> > > 
> > > > My hope was that the request-based deadlock I'm seeing would disappear
> > > > if that relaxed ordering patch wasn't applied.  Unfortunately, I still
> > > > see the hang.
> > > 
> > > Turns out I can reproduce the hang on a stock 2.6.36-rc3 (without _any_
> > > FLUSH+FUA patches)!
> > > 
> > > I'll try to pin-point the root cause but I think my test is somehow
> > > exposing a bug in my virt setup.
> > 
> > [my virt setup == single kvm guest (RHEL6) with F13 host]
> 
> What's your kvm guest command line?

I assume you mean qemu-kvm commandline:

/usr/bin/qemu-kvm -S -M pc-0.11 -enable-kvm -m 2048 -smp 1,sockets=1,cores=1,threads=1 -name rhel6.x86_64 -uuid 9129e4e4-15d3-00e2-e9de-2c28a29feb52 -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/rhel6.x86_64.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot cd -drive file=/var/lib/libvirt/images/rhel6.x86_64.img,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/var/lib/libvirt/images/boot.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -device virtio-net-pci,vlan=0,id=net0,mac=54:52:00:70:e8:23,bus=pci.0,addr=0x6 -net tap,fd=49,vlan=0,name=hostnet0 -charde
 v pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3

I'm using virtio-blk w/ cache=none for the root device.  virtio-blk
isn't used for any other devices in the guest.

Here is the guest's kernel commandline (not that it is interesting):
ro root=UUID=e0236db2-5a38-4d48-8bf5-55675671dee6 console=ttyS0 rhgb quiet SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us rd_plytheme=charge crashkernel=auto

> And the guest is using stock RHEL6 kernel?

No, the guest is using the upstream kernel.org kernel.  Hence my report
of an upstream regression.  The guest is using RHEL6 userspace (udev in
all its glory, etc).

> What KVM userspace are you using on the host?  What comes with
> F13 or some updated version?

Just what comes with F13.

Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm]
  2010-09-09 15:57                               ` Mike Snitzer
@ 2010-09-09 16:03                                 ` Ryan Harper
  2010-09-09 17:55                                   ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Ryan Harper @ 2010-09-09 16:03 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ryan Harper, Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal,
	john.cooper, rusty, hch, kvm

* Mike Snitzer <snitzer@redhat.com> [2010-09-09 10:58]:
> On Thu, Sep 09 2010 at 11:44am -0400,
> Ryan Harper <ryanh@us.ibm.com> wrote:
> 
> > * Mike Snitzer <snitzer@redhat.com> [2010-09-09 10:29]:
> > > On Wed, Sep 01 2010 at 11:22pm -0400,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
> > > 
> > > > On Wed, Sep 01 2010 at  2:59pm -0400,
> > > > Mike Snitzer <snitzer@redhat.com> wrote:
> > > > 
> > > > > My hope was that the request-based deadlock I'm seeing would disappear
> > > > > if that relaxed ordering patch wasn't applied.  Unfortunately, I still
> > > > > see the hang.
> > > > 
> > > > Turns out I can reproduce the hang on a stock 2.6.36-rc3 (without _any_
> > > > FLUSH+FUA patches)!
> > > > 
> > > > I'll try to pin-point the root cause but I think my test is somehow
> > > > exposing a bug in my virt setup.
> > > 
> > > [my virt setup == single kvm guest (RHEL6) with F13 host]
> > 
> > What's your kvm guest command line?
> 
> I assume you mean qemu-kvm commandline:
> 
> /usr/bin/qemu-kvm -S -M pc-0.11 -enable-kvm -m 2048 -smp 1,sockets=1,cores=1,threads=1 -name rhel6.x86_64 -uuid 9129e4e4-15d3-00e2-e9de-2c28a29feb52 -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/rhel6.x86_64.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot cd -drive file=/var/lib/libvirt/images/rhel6.x86_64.img,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/var/lib/libvirt/images/boot.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -device virtio-net-pci,vlan=0,id=net0,mac=54:52:00:70:e8:23,bus=pci.0,addr=0x6 -net tap,fd=49,vlan=0,name=hostnet0 -char
 dev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3
> 
> I'm using virtio-blk w/ cache=none for the root device.  virtio-blk
> isn't used for any other devices in the guest.

And you don't have any other disks in the guest (I see just the root and
the cdrom), the lv stuff is happening against some sort of dummy target?


> 
> Here is the guest's kernel commandline (not that it is interesting):
> ro root=UUID=e0236db2-5a38-4d48-8bf5-55675671dee6 console=ttyS0 rhgb quiet SYSFONT=latarcyrheb-sun16 LANG=en_US.UTF-8 KEYTABLE=us rd_plytheme=charge crashkernel=auto
> 
> > And the guest is using stock RHEL6 kernel?
> 
> No, the guest is using the upstream kernel.org kernel.  Hence my report
> of an upstream regression.  The guest is using RHEL6 userspace (udev in
> all its glory, etc).


> 
> > What KVM userspace are you using on the host?  What comes with
> > F13 or some updated version?
> 
> Just what comes with F13.
> 
> Mike

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm]
  2010-09-09 16:03                                 ` Ryan Harper
@ 2010-09-09 17:55                                   ` Mike Snitzer
  2010-09-09 18:35                                     ` Ryan Harper
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-09 17:55 UTC (permalink / raw)
  To: Ryan Harper
  Cc: Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal, john.cooper,
	rusty, hch, kvm

On Thu, Sep 09 2010 at 12:03pm -0400,
Ryan Harper <ryanh@us.ibm.com> wrote:

> * Mike Snitzer <snitzer@redhat.com> [2010-09-09 10:58]:

> > I'm using virtio-blk w/ cache=none for the root device.  virtio-blk
> > isn't used for any other devices in the guest.
> 
> And you don't have any other disks in the guest (I see just the root and
> the cdrom), the lv stuff is happening against some sort of dummy target?

Correct.  I have used variants of the script I provided against both
scsi-debug devices and iscsi devices in the guest.  The script I shared
uses multipath on scsi-debug (ram-based) devices.

That script causes udev to run its various callouts via multipath and
LVM (both packages, upstream and RHEL6, now use udev).

I have verified that I no longer get the hang if I switch the root
device from virtio to ide.

Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm]
  2010-09-09 17:55                                   ` Mike Snitzer
@ 2010-09-09 18:35                                     ` Ryan Harper
  2010-09-09 19:15                                       ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Ryan Harper @ 2010-09-09 18:35 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ryan Harper, Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal,
	john.cooper, rusty, hch, kvm

* Mike Snitzer <snitzer@redhat.com> [2010-09-09 12:56]:
> On Thu, Sep 09 2010 at 12:03pm -0400,
> Ryan Harper <ryanh@us.ibm.com> wrote:
> 
> > * Mike Snitzer <snitzer@redhat.com> [2010-09-09 10:58]:
> 
> > > I'm using virtio-blk w/ cache=none for the root device.  virtio-blk
> > > isn't used for any other devices in the guest.
> > 
> > And you don't have any other disks in the guest (I see just the root and
> > the cdrom), the lv stuff is happening against some sort of dummy target?
> 
> Correct.  I have used variants of the script I provided against both
> scsi-debug devices and iscsi devices in the guest.  The script I shared
> uses multipath on scsi-debug (ram-based) devices.
> 
> That script causes udev to run its various callouts via multipath and
> LVM (both packages, upstream and RHEL6, now use udev).
> 
> I have verified that I no longer get the hang if I switch the root
> device from virtio to ide.

And in the failing case, do you see:

/sys/block/vda/serial 

attribute in sysfs?


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm]
  2010-09-09 18:35                                     ` Ryan Harper
@ 2010-09-09 19:15                                       ` Mike Snitzer
  2010-09-09 19:43                                         ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-09 19:15 UTC (permalink / raw)
  To: Ryan Harper
  Cc: Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal, john.cooper,
	rusty, hch, kvm

On Thu, Sep 09 2010 at  2:35pm -0400,
Ryan Harper <ryanh@us.ibm.com> wrote:

> * Mike Snitzer <snitzer@redhat.com> [2010-09-09 12:56]:

> > I have verified that I no longer get the hang if I switch the root
> > device from virtio to ide.
> 
> And in the failing case, do you see:
> 
> /sys/block/vda/serial 
> 
> attribute in sysfs?

[root@rhel6 ~]# ls -al /sys/block/vda/serial 
-r--r--r-- 1 root root 4096 Sep  9 15:07 /sys/block/vda/serial
[root@rhel6 ~]# cat /sys/block/vda/serial
[root@rhel6 ~]# 

If I try to access 'serial' once the reproducer script has hung the cat
also hangs:

cat           D 00000000fffe5d48  5664  2386   2049 0x00000080
 ffff880075129ce8 0000000000000046 ffff880075129c88 ffff880000000000
 ffff880075129fd8 ffff8800758aa400 00000000001d4040 ffff880075129fd8
 00000000001d4040 00000000001d4040 00000000001d4040 00000000001d4040
Call Trace:
 [<ffffffff8136de23>] io_schedule+0x73/0xb5
 [<ffffffff811b6882>] get_request_wait+0xf2/0x180
 [<ffffffff8105d8da>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811b6951>] blk_get_request+0x41/0x71
 [<ffffffff811b69ad>] blk_make_request+0x2c/0x8b
 [<ffffffffa0016073>] virtblk_get_id+0x57/0x93 [virtio_blk]
 [<ffffffffa00160d0>] virtblk_serial_show+0x21/0x4d [virtio_blk]
 [<ffffffff81265f7c>] dev_attr_show+0x27/0x4e
 [<ffffffff81157f0b>] ? sysfs_read_file+0x94/0x17f
 [<ffffffff810c6d6b>] ? __get_free_pages+0x18/0x55
 [<ffffffff81157f34>] sysfs_read_file+0xbd/0x17f
 [<ffffffff81100e7a>] vfs_read+0xab/0x108
 [<ffffffff8106e428>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff81100f97>] sys_read+0x4a/0x6e
 [<ffffffff81002bf2>] system_call_fastpath+0x16/0x1b


Taking a step back; why did you make the inability to add the 'serial'
attribute such a hard failure?

Strikes me as worthy of a non-fatal warning at most.  The serial
attribute is added fine in this instance but I'm just curious.

Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm]
  2010-09-09 19:15                                       ` Mike Snitzer
@ 2010-09-09 19:43                                         ` Mike Snitzer
  2010-09-09 20:14                                           ` Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-09 19:43 UTC (permalink / raw)
  To: Ryan Harper
  Cc: Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal, john.cooper,
	rusty, hch, kvm

On Thu, Sep 09 2010 at  3:15pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Sep 09 2010 at  2:35pm -0400,
> Ryan Harper <ryanh@us.ibm.com> wrote:
> 
> > * Mike Snitzer <snitzer@redhat.com> [2010-09-09 12:56]:
> 
> > > I have verified that I no longer get the hang if I switch the root
> > > device from virtio to ide.
> > 
> > And in the failing case, do you see:
> > 
> > /sys/block/vda/serial 
> > 
> > attribute in sysfs?

Interestingly, just this loop:

while true ; do cat /sys/block/vda/serial && date && sleep 1 ; done
Thu Sep  9 15:29:30 EDT 2010
...
Thu Sep  9 15:31:19 EDT 2010

caused the following hang:
INFO: task cat:1825 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cat           D 000000010000e784  5664  1825   1467 0x00000080
 ffff88007bbcdce8 0000000000000046 ffff88007bbcdc88 ffff880000000000
 ffff88007bbcdfd8 ffff88007b42a400 00000000001d4040 ffff88007bbcdfd8
 00000000001d4040 00000000001d4040 00000000001d4040 00000000001d4040
Call Trace:
 [<ffffffff8136de23>] io_schedule+0x73/0xb5
 [<ffffffff811b6882>] get_request_wait+0xf2/0x180
 [<ffffffff8105d8da>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811b6951>] blk_get_request+0x41/0x71
 [<ffffffff811b69ad>] blk_make_request+0x2c/0x8b
 [<ffffffffa0016073>] virtblk_get_id+0x57/0x93 [virtio_blk]
 [<ffffffffa00160d0>] virtblk_serial_show+0x21/0x4d [virtio_blk]
 [<ffffffff81265f7c>] dev_attr_show+0x27/0x4e
 [<ffffffff81157f0b>] ? sysfs_read_file+0x94/0x17f
 [<ffffffff810c6d6b>] ? __get_free_pages+0x18/0x55
 [<ffffffff81157f34>] sysfs_read_file+0xbd/0x17f
 [<ffffffff81100e7a>] vfs_read+0xab/0x108
 [<ffffffff8106e428>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff81100f97>] sys_read+0x4a/0x6e
 [<ffffffff81002bf2>] system_call_fastpath+0x16/0x1b
2 locks held by cat/1825:
 #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff81157eaf>] sysfs_read_file+0x38/0x17f
 #1:  (s_active#14){.+.+.+}, at: [<ffffffff81157f0b>] sysfs_read_file+0x94/0x17f

So it seems like the virtio requests aren't being properly cleaned up?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm]
  2010-09-09 19:43                                         ` Mike Snitzer
@ 2010-09-09 20:14                                           ` Mike Snitzer
  2010-09-09 20:30                                             ` Ryan Harper
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Snitzer @ 2010-09-09 20:14 UTC (permalink / raw)
  To: Ryan Harper
  Cc: Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal, john.cooper,
	rusty, hch, kvm

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

On Thu, Sep 09 2010 at  3:43pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> Interestingly, just this loop:
> 
> while true ; do cat /sys/block/vda/serial && date && sleep 1 ; done
> Thu Sep  9 15:29:30 EDT 2010
> ...
> Thu Sep  9 15:31:19 EDT 2010
> 
> caused the following hang:
...
> So it seems like the virtio requests aren't being properly cleaned up?

Yeap, here is the result with the attached debug patch that Vivek wrote
last week to help chase this issue (which adds 'nr_requests_used').  We
thought the mpath device might be leaking requests; concern for other
devices wasn't on our radar:

# cat /sys/block/vda/queue/nr_requests
128

# while true ; do cat /sys/block/vda/queue/nr_requests_used && cat /sys/block/vda/serial && date && sleep 1 ; done
10
Thu Sep  9 16:04:40 EDT 2010
11
Thu Sep  9 16:04:41 EDT 2010
...
Thu Sep  9 16:06:38 EDT 2010
127
Thu Sep  9 16:06:39 EDT 2010
128

I'll have a quick look at the virtio-blk code to see if I can spot where
the request isn't getting cleaned up.  But I welcome others to have a
look too (I've already spent entirely way to much time on this issue).

Mike

[-- Attachment #2: export-nr-requests-throuth-sysfs.patch --]
[-- Type: text/plain, Size: 1650 bytes --]

---
 block/blk-sysfs.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Index: linux-2.6/block/blk-sysfs.c
===================================================================
--- linux-2.6.orig/block/blk-sysfs.c	2010-09-01 09:23:55.000000000 -0400
+++ linux-2.6/block/blk-sysfs.c	2010-09-01 17:55:50.000000000 -0400
@@ -36,6 +36,19 @@ static ssize_t queue_requests_show(struc
 	return queue_var_show(q->nr_requests, (page));
 }
 
+static ssize_t queue_requests_used_show(struct request_queue *q, char *page)
+{
+	struct request_list *rl = &q->rq;
+
+	printk("Vivek: count[sync]=%d count[async]=%d"
+		" congestion_on_thres=%d queue_congestion_off_threshold=%d\n",
+		rl->count[BLK_RW_SYNC], rl->count[BLK_RW_ASYNC],
+		queue_congestion_on_threshold(q),
+		queue_congestion_off_threshold(q));
+
+	return queue_var_show(rl->count[BLK_RW_SYNC], (page));
+}
+
 static ssize_t
 queue_requests_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -266,6 +279,11 @@ static struct queue_sysfs_entry queue_re
 	.store = queue_requests_store,
 };
 
+static struct queue_sysfs_entry queue_requests_used_entry = {
+	.attr = {.name = "nr_requests_used", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_requests_used_show,
+};
+
 static struct queue_sysfs_entry queue_ra_entry = {
 	.attr = {.name = "read_ahead_kb", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_ra_show,
@@ -371,6 +389,7 @@ static struct queue_sysfs_entry queue_ra
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
+	&queue_requests_used_entry.attr,
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm]
  2010-09-09 20:14                                           ` Mike Snitzer
@ 2010-09-09 20:30                                             ` Ryan Harper
  2010-09-09 21:00                                               ` [PATCH] virtio-blk: put request that was created to retrieve the device id Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Ryan Harper @ 2010-09-09 20:30 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ryan Harper, Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal,
	john.cooper, rusty, hch, kvm

* Mike Snitzer <snitzer@redhat.com> [2010-09-09 15:15]:
> On Thu, Sep 09 2010 at  3:43pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > Interestingly, just this loop:
> > 
> > while true ; do cat /sys/block/vda/serial && date && sleep 1 ; done
> > Thu Sep  9 15:29:30 EDT 2010
> > ...
> > Thu Sep  9 15:31:19 EDT 2010
> > 
> > caused the following hang:
> ...
> > So it seems like the virtio requests aren't being properly cleaned up?
> 
> Yeap, here is the result with the attached debug patch that Vivek wrote
> last week to help chase this issue (which adds 'nr_requests_used').  We
> thought the mpath device might be leaking requests; concern for other
> devices wasn't on our radar:
> 
> # cat /sys/block/vda/queue/nr_requests
> 128
> 
> # while true ; do cat /sys/block/vda/queue/nr_requests_used && cat /sys/block/vda/serial && date && sleep 1 ; done
> 10
> Thu Sep  9 16:04:40 EDT 2010
> 11
> Thu Sep  9 16:04:41 EDT 2010
> ...
> Thu Sep  9 16:06:38 EDT 2010
> 127
> Thu Sep  9 16:06:39 EDT 2010
> 128
> 
> I'll have a quick look at the virtio-blk code to see if I can spot where
> the request isn't getting cleaned up.  But I welcome others to have a
> look too (I've already spent entirely way to much time on this issue).

The qemu on the host isn't new enough to handle the request.  This
serial attribute should have had a feature bit with it (it did at one
point in one of the previous forms of the virtio-blk serial patch
series, but it isn't present now) so we don't expose the attribute
unless backend can handle the request type.

For immediate relief, it's probably easiest to revert the kernel-side
commit (or comment out the device_create_file() call after add_disk() in
virtblk_probe().


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [PATCH] virtio-blk: put request that was created to retrieve the device id
  2010-09-09 20:30                                             ` Ryan Harper
@ 2010-09-09 21:00                                               ` Mike Snitzer
  2010-09-09 21:15                                                 ` Christoph Hellwig
  2010-10-09  1:41                                                 ` [PATCH] " Rusty Russell
  0 siblings, 2 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-09-09 21:00 UTC (permalink / raw)
  To: Ryan Harper
  Cc: Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal, john.cooper,
	rusty, hch, kvm

On Thu, Sep 09 2010 at  4:30pm -0400,
Ryan Harper <ryanh@us.ibm.com> wrote:

> * Mike Snitzer <snitzer@redhat.com> [2010-09-09 15:15]:
> > On Thu, Sep 09 2010 at  3:43pm -0400,
> > Mike Snitzer <snitzer@redhat.com> wrote:
> > 
> > > Interestingly, just this loop:
> > > 
> > > while true ; do cat /sys/block/vda/serial && date && sleep 1 ; done
> > > Thu Sep  9 15:29:30 EDT 2010
> > > ...
> > > Thu Sep  9 15:31:19 EDT 2010
> > > 
> > > caused the following hang:
> > ...
> > > So it seems like the virtio requests aren't being properly cleaned up?
> > 
> > Yeap, here is the result with the attached debug patch that Vivek wrote
> > last week to help chase this issue (which adds 'nr_requests_used').  We
> > thought the mpath device might be leaking requests; concern for other
> > devices wasn't on our radar:
> > 
> > # cat /sys/block/vda/queue/nr_requests
> > 128
> > 
> > # while true ; do cat /sys/block/vda/queue/nr_requests_used && cat /sys/block/vda/serial && date && sleep 1 ; done
> > 10
> > Thu Sep  9 16:04:40 EDT 2010
> > 11
> > Thu Sep  9 16:04:41 EDT 2010
> > ...
> > Thu Sep  9 16:06:38 EDT 2010
> > 127
> > Thu Sep  9 16:06:39 EDT 2010
> > 128
> > 
> > I'll have a quick look at the virtio-blk code to see if I can spot where
> > the request isn't getting cleaned up.  But I welcome others to have a
> > look too (I've already spent entirely way to much time on this issue).
> 
> The qemu on the host isn't new enough to handle the request.  This
> serial attribute should have had a feature bit with it (it did at one
> point in one of the previous forms of the virtio-blk serial patch
> series, but it isn't present now) so we don't expose the attribute
> unless backend can handle the request type.

Be that as it may, it doesn't change the fact that the request created
in virtblk_get_id (via blk_make_request) isn't being properly cleaned
up.
 
> For immediate relief, it's probably easiest to revert the kernel-side
> commit (or comment out the device_create_file() call after add_disk() in
> virtblk_probe().

This patch fixes the issue for me; Rusty and/or Christoph please
review/advise.


From: Mike Snitzer <snitzer@redhat.com>
Subject: virtio-blk: put request that was created to retrieve the device id

Must drop reference taken by blk_make_request().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/block/virtio_blk.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 1260628..831e75c 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -199,6 +199,7 @@ static int virtblk_get_id(struct gendisk *disk, char *id_str)
 	struct virtio_blk *vblk = disk->private_data;
 	struct request *req;
 	struct bio *bio;
+	int err;
 
 	bio = bio_map_kern(vblk->disk->queue, id_str, VIRTIO_BLK_ID_BYTES,
 			   GFP_KERNEL);
@@ -212,7 +213,10 @@ static int virtblk_get_id(struct gendisk *disk, char *id_str)
 	}
 
 	req->cmd_type = REQ_TYPE_SPECIAL;
-	return blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
+	err = blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
+	blk_put_request(req);
+
+	return err;
 }
 
 static int virtblk_locked_ioctl(struct block_device *bdev, fmode_t mode,

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [PATCH] virtio-blk: put request that was created to retrieve the device id
  2010-09-09 21:00                                               ` [PATCH] virtio-blk: put request that was created to retrieve the device id Mike Snitzer
@ 2010-09-09 21:15                                                 ` Christoph Hellwig
  2010-09-17 14:58                                                   ` Ryan Harper
  2010-10-09  1:41                                                 ` [PATCH] " Rusty Russell
  1 sibling, 1 reply; 88+ messages in thread
From: Christoph Hellwig @ 2010-09-09 21:15 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Ryan Harper, Tejun Heo, Mikulas Patocka, dm-devel, Vivek Goyal,
	john.cooper, rusty, hch, kvm

On Thu, Sep 09, 2010 at 05:00:42PM -0400, Mike Snitzer wrote:
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 1260628..831e75c 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -199,6 +199,7 @@ static int virtblk_get_id(struct gendisk *disk, char *id_str)
>  	struct virtio_blk *vblk = disk->private_data;
>  	struct request *req;
>  	struct bio *bio;
> +	int err;
>  
>  	bio = bio_map_kern(vblk->disk->queue, id_str, VIRTIO_BLK_ID_BYTES,
>  			   GFP_KERNEL);
> @@ -212,7 +213,10 @@ static int virtblk_get_id(struct gendisk *disk, char *id_str)
>  	}
>  
>  	req->cmd_type = REQ_TYPE_SPECIAL;
> -	return blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
> +	err = blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
> +	blk_put_request(req);

This looks correct as far as the request is concerned, but we're still
leaking the bio.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] virtio-blk: put request that was created to retrieve the device id
  2010-09-09 21:15                                                 ` Christoph Hellwig
@ 2010-09-17 14:58                                                   ` Ryan Harper
  2010-09-21 21:00                                                     ` Christoph Hellwig
  0 siblings, 1 reply; 88+ messages in thread
From: Ryan Harper @ 2010-09-17 14:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mike Snitzer, Ryan Harper, Tejun Heo, Mikulas Patocka, dm-devel,
	Vivek Goyal, john.cooper, rusty, kvm

* Christoph Hellwig <hch@infradead.org> [2010-09-09 16:18]:
> On Thu, Sep 09, 2010 at 05:00:42PM -0400, Mike Snitzer wrote:
> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index 1260628..831e75c 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -199,6 +199,7 @@ static int virtblk_get_id(struct gendisk *disk, char *id_str)
> >  	struct virtio_blk *vblk = disk->private_data;
> >  	struct request *req;
> >  	struct bio *bio;
> > +	int err;
> >  
> >  	bio = bio_map_kern(vblk->disk->queue, id_str, VIRTIO_BLK_ID_BYTES,
> >  			   GFP_KERNEL);
> > @@ -212,7 +213,10 @@ static int virtblk_get_id(struct gendisk *disk, char *id_str)
> >  	}
> >  
> >  	req->cmd_type = REQ_TYPE_SPECIAL;
> > -	return blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
> > +	err = blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
> > +	blk_put_request(req);
> 
> This looks correct as far as the request is concerned, but we're still
> leaking the bio.

Since __bio_map_kern() sets up bio->bi_end_io = bio_map_kern_endio
(which does a bio_put(bio)) doesn't that ensure we don't leak?

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] virtio-blk: put request that was created to retrieve the device id
  2010-09-17 14:58                                                   ` Ryan Harper
@ 2010-09-21 21:00                                                     ` Christoph Hellwig
  2010-10-08 16:06                                                       ` [2.6.36 REGRESSION] " Mike Snitzer
  0 siblings, 1 reply; 88+ messages in thread
From: Christoph Hellwig @ 2010-09-21 21:00 UTC (permalink / raw)
  To: Ryan Harper
  Cc: Christoph Hellwig, Mike Snitzer, Tejun Heo, Mikulas Patocka,
	dm-devel, Vivek Goyal, john.cooper, rusty, kvm

On Fri, Sep 17, 2010 at 09:58:48AM -0500, Ryan Harper wrote:
> Since __bio_map_kern() sets up bio->bi_end_io = bio_map_kern_endio
> (which does a bio_put(bio)) doesn't that ensure we don't leak?

Indeed, that should take care of it.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* [2.6.36 REGRESSION] Re: virtio-blk: put request that was created to retrieve the device id
  2010-09-21 21:00                                                     ` Christoph Hellwig
@ 2010-10-08 16:06                                                       ` Mike Snitzer
  0 siblings, 0 replies; 88+ messages in thread
From: Mike Snitzer @ 2010-10-08 16:06 UTC (permalink / raw)
  To: rusty
  Cc: Christoph Hellwig, Ryan Harper, Tejun Heo, Mikulas Patocka,
	dm-devel, Vivek Goyal, john.cooper, kvm, linux-kernel

Hi Rusty,

On Tue, Sep 21 2010 at  5:00pm -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Sep 17, 2010 at 09:58:48AM -0500, Ryan Harper wrote:
> > Since __bio_map_kern() sets up bio->bi_end_io = bio_map_kern_endio
> > (which does a bio_put(bio)) doesn't that ensure we don't leak?
> 
> Indeed, that should take care of it.
> 

We need to fix this regression for 2.6.36.  Not sure what else I need to
do to get this on your radar.  It's a pretty significant show-stopper
for 2.6.36 guests that use virtio-blk with a udev enabled distro.

Here is a reference to my original patch (which Ryan and hch have both
reviewed): https://patchwork.kernel.org/patch/165571/

Thanks,
Mike

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [PATCH] virtio-blk: put request that was created to retrieve the device id
  2010-09-09 21:00                                               ` [PATCH] virtio-blk: put request that was created to retrieve the device id Mike Snitzer
  2010-09-09 21:15                                                 ` Christoph Hellwig
@ 2010-10-09  1:41                                                 ` Rusty Russell
  1 sibling, 0 replies; 88+ messages in thread
From: Rusty Russell @ 2010-10-09  1:41 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Ryan Harper, dm-devel, kvm, john.cooper

On Fri, 10 Sep 2010 06:30:42 am Mike Snitzer wrote:
> On Thu, Sep 09 2010 at  4:30pm -0400,
> Ryan Harper <ryanh@us.ibm.com> wrote:
> 
> > * Mike Snitzer <snitzer@redhat.com> [2010-09-09 15:15]:
> > > On Thu, Sep 09 2010 at  3:43pm -0400,
> > > Mike Snitzer <snitzer@redhat.com> wrote:
> > > # while true ; do cat /sys/block/vda/queue/nr_requests_used && cat /sys/block/vda/serial && date && sleep 1 ; done
> > > 10
> > > Thu Sep  9 16:04:40 EDT 2010
> > > 11
...
> > The qemu on the host isn't new enough to handle the request.  This
> > serial attribute should have had a feature bit with it (it did at one
> > point in one of the previous forms of the virtio-blk serial patch
> > series, but it isn't present now) so we don't expose the attribute
> > unless backend can handle the request type.
> 
> Be that as it may, it doesn't change the fact that the request created
> in virtblk_get_id (via blk_make_request) isn't being properly cleaned
> up.

Thanks for re-sending Mike.

This patch confused me at first, but it's correct.  Took me a few
minutes of checking though.

For those not familiar with the block layer, here are the key points:

1) blk_execute_rq waits for the request to finish.
2) blk_execute_rq grabs its own reference to the req.
3) Once qemu finishes with it and sends an interrupt blk_done()
   releases that reference via __blk_end_request_all()
4) As caller of blk_make_request, it is our responsibility to
   free it after it's finished, ie. after blk_execute_rq.

> This patch fixes the issue for me; Rusty and/or Christoph please
> review/advise.

Thanks, applied, and CC'd stable@kernel.org (it's in 2.6.35 as well).

From: Mike Snitzer <snitzer@redhat.com>
Subject: virtio-blk: fix request leak.

Must drop reference taken by blk_make_request().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org # .35.x
---
 drivers/block/virtio_blk.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 1260628..831e75c 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -199,6 +199,7 @@ static int virtblk_get_id(struct gendisk *disk, char *id_str)
 	struct virtio_blk *vblk = disk->private_data;
 	struct request *req;
 	struct bio *bio;
+	int err;
 
 	bio = bio_map_kern(vblk->disk->queue, id_str, VIRTIO_BLK_ID_BYTES,
 			   GFP_KERNEL);
@@ -212,7 +213,10 @@ static int virtblk_get_id(struct gendisk *disk, char *id_str)
 	}
 
 	req->cmd_type = REQ_TYPE_SPECIAL;
-	return blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
+	err = blk_execute_rq(vblk->disk->queue, vblk->disk, req, false);
+	blk_put_request(req);
+
+	return err;
 }
 
 static int virtblk_locked_ioctl(struct block_device *bdev, fmode_t mode,

^ permalink raw reply related	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2010-10-09  1:41 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-30  9:58 [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2 Tejun Heo
2010-08-30  9:58 ` Tejun Heo
2010-08-30  9:58 ` [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags Tejun Heo
2010-08-30  9:58 ` Tejun Heo
2010-08-30  9:58 ` Tejun Heo
2010-08-30  9:58   ` Tejun Heo
2010-09-01 15:30   ` Christoph Hellwig
2010-08-30  9:58 ` [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm Tejun Heo
2010-08-30  9:58 ` Tejun Heo
2010-08-30  9:58   ` Tejun Heo
2010-09-01 13:43   ` Mike Snitzer
2010-09-01 13:50     ` Tejun Heo
2010-09-01 13:54       ` Mike Snitzer
2010-09-01 13:56         ` Tejun Heo
2010-08-30  9:58 ` [PATCH 3/5] dm: relax ordering of bio-based flush implementation Tejun Heo
2010-08-30  9:58   ` Tejun Heo
2010-09-01 13:51   ` Mike Snitzer
2010-09-01 13:51     ` Mike Snitzer
2010-09-01 13:56     ` Tejun Heo
2010-09-01 13:56       ` Tejun Heo
2010-09-03  6:04   ` Kiyoshi Ueda
2010-09-03  9:42     ` Tejun Heo
2010-08-30  9:58 ` Tejun Heo
2010-08-30  9:58 ` [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
2010-08-30  9:58   ` Tejun Heo
2010-08-30 13:28   ` Mike Snitzer
2010-08-30 13:59     ` Tejun Heo
2010-08-30 13:59     ` Tejun Heo
2010-08-30 15:07       ` Tejun Heo
2010-08-30 15:07       ` Tejun Heo
2010-08-30 19:08         ` Mike Snitzer
2010-08-30 19:08         ` Mike Snitzer
2010-08-30 21:28           ` Mike Snitzer
2010-08-31 10:29             ` Tejun Heo
2010-08-31 13:02               ` Mike Snitzer
2010-08-31 13:14                 ` Tejun Heo
2010-08-30 15:42       ` [PATCH] block: initialize flush request with WRITE_FLUSH instead of REQ_FLUSH Tejun Heo
2010-08-30 15:42       ` Tejun Heo
2010-08-30 15:45       ` [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
2010-08-30 15:45       ` Tejun Heo
2010-08-30 19:18         ` Mike Snitzer
2010-08-30 19:18         ` Mike Snitzer
2010-09-01  7:15         ` Kiyoshi Ueda
2010-09-01 12:25           ` Mike Snitzer
2010-09-02 13:22           ` Tejun Heo
2010-09-02 13:32             ` Tejun Heo
2010-09-03  5:46             ` Kiyoshi Ueda
2010-09-02 17:43           ` [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original Tejun Heo
2010-09-03  5:47             ` Kiyoshi Ueda
2010-09-03  9:33               ` Tejun Heo
2010-09-03 10:28                 ` Kiyoshi Ueda
2010-09-03 11:42                   ` Tejun Heo
2010-09-03 11:51                     ` Kiyoshi Ueda
     [not found]         ` <20100830194731.GA10702@redhat.com>
2010-09-01 10:31           ` [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Mikulas Patocka
2010-09-01 11:20             ` Tejun Heo
2010-09-01 12:12               ` Mikulas Patocka
2010-09-01 12:42                 ` Tejun Heo
2010-09-01 12:54                   ` Mike Snitzer
2010-09-01 15:20                 ` Mike Snitzer
2010-09-01 15:35                   ` Mikulas Patocka
2010-09-01 17:07                     ` Mike Snitzer
2010-09-01 18:59                       ` Mike Snitzer
2010-09-02  3:22                         ` Mike Snitzer
2010-09-02 10:24                           ` Tejun Heo
2010-09-02 15:11                             ` Mike Snitzer
2010-09-09 15:26                           ` [REGRESSION][BISECTED] virtio-blk serial attribute causes guest to hang [Was: Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm] Mike Snitzer
2010-09-09 15:44                             ` Ryan Harper
2010-09-09 15:57                               ` Mike Snitzer
2010-09-09 16:03                                 ` Ryan Harper
2010-09-09 17:55                                   ` Mike Snitzer
2010-09-09 18:35                                     ` Ryan Harper
2010-09-09 19:15                                       ` Mike Snitzer
2010-09-09 19:43                                         ` Mike Snitzer
2010-09-09 20:14                                           ` Mike Snitzer
2010-09-09 20:30                                             ` Ryan Harper
2010-09-09 21:00                                               ` [PATCH] virtio-blk: put request that was created to retrieve the device id Mike Snitzer
2010-09-09 21:15                                                 ` Christoph Hellwig
2010-09-17 14:58                                                   ` Ryan Harper
2010-09-21 21:00                                                     ` Christoph Hellwig
2010-10-08 16:06                                                       ` [2.6.36 REGRESSION] " Mike Snitzer
2010-10-09  1:41                                                 ` [PATCH] " Rusty Russell
2010-08-30 13:28   ` [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Mike Snitzer
2010-08-30  9:58 ` Tejun Heo
2010-08-30  9:58 ` Tejun Heo
2010-08-30  9:58 ` [PATCH 5/5] block: remove the WRITE_BARRIER flag Tejun Heo
2010-08-30  9:58 ` Tejun Heo
2010-08-30  9:58   ` Tejun Heo
2010-08-30  9:58 ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.