[PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2
@ 2010-08-30  9:58 Tejun Heo
  2010-08-30  9:58 ` [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags Tejun Heo
                   ` (4 more replies)
  0 siblings, 5 replies; 37+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hello,

This is the second take of block-dm-finish-REQ_FLUSH-FUA-conversion.
I've put patches on top of the previous patches for easier review.
The series can be trivially reordered so that the order is more
logical.  Jens, please let me know if you want it to be reordered.

The dm conversion is _lightly_ tested.  Please proceed with caution.
Let's hold off merging these bits until dm people can verify the
conversion is correct.

  0001-block-make-__blk_rq_prep_clone-copy-most-command-fla.patch
  0002-dm-implement-REQ_FLUSH-FUA-support-for-bio-based-dm.patch
  0003-dm-relax-ordering-of-bio-based-flush-implementation.patch
  0004-dm-implement-REQ_FLUSH-FUA-support-for-request-based.patch
  0005-block-remove-the-WRITE_BARRIER-flag.patch

Differences from the previous attempt[1] are,

* bio-based dm and requested-based dm patches are split.  0002-0003
  convert bio-based dm and 0004 converts requested-based dm.

* The previous requested based dm conversion was broken in the way
  requests are sequenced.  This patch rips out special multi-target
  handling for flushes and handles flushes the same way other requests
  are handled.

This patchset is on top of "block, fs: replace HARDBARRIER with
FLUSH/FUA" patchset[2] and available in the following git tree

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua

and contains the following changes.

 block/blk-core.c                |    4 
 drivers/md/dm-crypt.c           |    2 
 drivers/md/dm-io.c              |   20 --
 drivers/md/dm-log.c             |    2 
 drivers/md/dm-raid1.c           |    8 
 drivers/md/dm-region-hash.c     |   16 -
 drivers/md/dm-snap-persistent.c |    2 
 drivers/md/dm-snap.c            |    6 
 drivers/md/dm-stripe.c          |    2 
 drivers/md/dm.c                 |  394 ++++++++--------------------------------
 include/linux/blk_types.h       |    1 
 include/linux/fs.h              |    3 
 12 files changed, 104 insertions(+), 356 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.raid/29344
[2] http://thread.gmane.org/gmane.linux.kernel/1022363

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags
  2010-08-30  9:58 [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2 Tejun Heo
@ 2010-08-30  9:58 ` Tejun Heo
  2010-09-01 15:30   ` Christoph Hellwig
  2010-08-30  9:58 ` [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm Tejun Heo
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch
  Cc: Tejun Heo

Currently __blk_rq_prep_clone() copies only REQ_WRITE and REQ_DISCARD.
There's no reason to omit other command flags and REQ_FUA needs to be
copied to implement FUA support in request-based dm.

REQ_COMMON_MASK which specifies flags to be copied from bio to request
already identifies all the command flags.  Define REQ_CLONE_MASK to be
the same as REQ_COMMON_MASK for clarity and make __blk_rq_prep_clone()
copy all flags in the mask.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c          |    4 +---
 include/linux/blk_types.h |    1 +
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 495bdc4..2a5b192 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2505,9 +2505,7 @@ EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 static void __blk_rq_prep_clone(struct request *dst, struct request *src)
 {
 	dst->cpu = src->cpu;
-	dst->cmd_flags = (rq_data_dir(src) | REQ_NOMERGE);
-	if (src->cmd_flags & REQ_DISCARD)
-		dst->cmd_flags |= REQ_DISCARD;
+	dst->cmd_flags = (src->cmd_flags & REQ_CLONE_MASK) | REQ_NOMERGE;
 	dst->cmd_type = src->cmd_type;
 	dst->__sector = blk_rq_pos(src);
 	dst->__data_len = blk_rq_bytes(src);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 1797994..36edadf 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,6 +168,7 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_HARDBARRIER | REQ_SYNC | \
 	 REQ_META | REQ_DISCARD | REQ_NOIDLE | REQ_FLUSH | REQ_FUA)
+#define REQ_CLONE_MASK		REQ_COMMON_MASK
 
 #define REQ_UNPLUG		(1 << __REQ_UNPLUG)
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-08-30  9:58 [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2 Tejun Heo
  2010-08-30  9:58 ` [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags Tejun Heo
@ 2010-08-30  9:58 ` Tejun Heo
  2010-09-01 13:43   ` Mike Snitzer
  2010-08-30  9:58 ` [PATCH 3/5] dm: relax ordering of bio-based flush implementation Tejun Heo
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch
  Cc: Tejun Heo, dm-devel

This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
now deprecated REQ_HARDBARRIER.

* -EOPNOTSUPP handling logic dropped.

* Preflush is handled as before but postflush is dropped and replaced
  with passing down REQ_FUA to member request_queues.  This replaces
  one array wide cache flush w/ member specific FUA writes.

* __split_and_process_bio() now calls __clone_and_map_flush() directly
  for flushes and guarantees all FLUSH bio's going to targets are zero
`  length.

* It's now guaranteed that all FLUSH bio's which are passed onto dm
  targets are zero length.  bio_empty_barrier() tests are replaced
  with REQ_FLUSH tests.

* Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.

* Dropped unlikely() around REQ_FLUSH tests.  Flushes are not unlikely
  enough to be marked with unlikely().

* Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
  doesn't support cache flushing.  Advertise REQ_FLUSH | REQ_FUA
  capability.

* Request based dm isn't converted yet.  dm_init_request_based_queue()
  resets flush support to 0 for now.  To avoid disturbing request
  based dm code, dm->flush_error is added for bio based dm while
  requested based dm continues to use dm->barrier_error.

Lightly tested linear, stripe, raid1, snap and crypt targets.  Please
proceed with caution as I'm not familiar with the code base.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: dm-devel@redhat.com
Cc: Christoph Hellwig <hch@lst.de>
---
 drivers/md/dm-crypt.c           |    2 +-
 drivers/md/dm-io.c              |   20 +-----
 drivers/md/dm-log.c             |    2 +-
 drivers/md/dm-raid1.c           |    8 +-
 drivers/md/dm-region-hash.c     |   16 +++---
 drivers/md/dm-snap-persistent.c |    2 +-
 drivers/md/dm-snap.c            |    6 +-
 drivers/md/dm-stripe.c          |    2 +-
 drivers/md/dm.c                 |  119 +++++++++++++++++++--------------------
 9 files changed, 80 insertions(+), 97 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 368e8e9..d5b0e4c 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1278,7 +1278,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio,
 	struct dm_crypt_io *io;
 	struct crypt_config *cc;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		cc = ti->private;
 		bio->bi_bdev = cc->dev->bdev;
 		return DM_MAPIO_REMAPPED;
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 0590c75..136d4f7 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -31,7 +31,6 @@ struct dm_io_client {
  */
 struct io {
 	unsigned long error_bits;
-	unsigned long eopnotsupp_bits;
 	atomic_t count;
 	struct task_struct *sleeper;
 	struct dm_io_client *client;
@@ -130,11 +129,8 @@ static void retrieve_io_and_region_from_bio(struct bio *bio, struct io **io,
  *---------------------------------------------------------------*/
 static void dec_count(struct io *io, unsigned int region, int error)
 {
-	if (error) {
+	if (error)
 		set_bit(region, &io->error_bits);
-		if (error == -EOPNOTSUPP)
-			set_bit(region, &io->eopnotsupp_bits);
-	}
 
 	if (atomic_dec_and_test(&io->count)) {
 		if (io->sleeper)
@@ -310,8 +306,8 @@ static void do_region(int rw, unsigned region, struct dm_io_region *where,
 	sector_t remaining = where->count;
 
 	/*
-	 * where->count may be zero if rw holds a write barrier and we
-	 * need to send a zero-sized barrier.
+	 * where->count may be zero if rw holds a flush and we need to
+	 * send a zero-sized flush.
 	 */
 	do {
 		/*
@@ -364,7 +360,7 @@ static void dispatch_io(int rw, unsigned int num_regions,
 	 */
 	for (i = 0; i < num_regions; i++) {
 		*dp = old_pages;
-		if (where[i].count || (rw & REQ_HARDBARRIER))
+		if (where[i].count || (rw & REQ_FLUSH))
 			do_region(rw, i, where + i, dp, io);
 	}
 
@@ -393,9 +389,7 @@ static int sync_io(struct dm_io_client *client, unsigned int num_regions,
 		return -EIO;
 	}
 
-retry:
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = current;
 	io->client = client;
@@ -412,11 +406,6 @@ retry:
 	}
 	set_current_state(TASK_RUNNING);
 
-	if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
-		rw &= ~REQ_HARDBARRIER;
-		goto retry;
-	}
-
 	if (error_bits)
 		*error_bits = io->error_bits;
 
@@ -437,7 +426,6 @@ static int async_io(struct dm_io_client *client, unsigned int num_regions,
 
 	io = mempool_alloc(client->pool, GFP_NOIO);
 	io->error_bits = 0;
-	io->eopnotsupp_bits = 0;
 	atomic_set(&io->count, 1); /* see dispatch_io() */
 	io->sleeper = NULL;
 	io->client = client;
diff --git a/drivers/md/dm-log.c b/drivers/md/dm-log.c
index 5a08be0..33420e6 100644
--- a/drivers/md/dm-log.c
+++ b/drivers/md/dm-log.c
@@ -300,7 +300,7 @@ static int flush_header(struct log_c *lc)
 		.count = 0,
 	};
 
-	lc->io_req.bi_rw = WRITE_BARRIER;
+	lc->io_req.bi_rw = WRITE_FLUSH;
 
 	return dm_io(&lc->io_req, 1, &null_location, NULL);
 }
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index 7c081bc..19a59b0 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -259,7 +259,7 @@ static int mirror_flush(struct dm_target *ti)
 	struct dm_io_region io[ms->nr_mirrors];
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE_BARRIER,
+		.bi_rw = WRITE_FLUSH,
 		.mem.type = DM_IO_KMEM,
 		.mem.ptr.bvec = NULL,
 		.client = ms->io_client,
@@ -629,7 +629,7 @@ static void do_write(struct mirror_set *ms, struct bio *bio)
 	struct dm_io_region io[ms->nr_mirrors], *dest = io;
 	struct mirror *m;
 	struct dm_io_request io_req = {
-		.bi_rw = WRITE | (bio->bi_rw & WRITE_BARRIER),
+		.bi_rw = WRITE | (bio->bi_rw & WRITE_FLUSH_FUA),
 		.mem.type = DM_IO_BVEC,
 		.mem.ptr.bvec = bio->bi_io_vec + bio->bi_idx,
 		.notify.fn = write_callback,
@@ -670,7 +670,7 @@ static void do_writes(struct mirror_set *ms, struct bio_list *writes)
 	bio_list_init(&requeue);
 
 	while ((bio = bio_list_pop(writes))) {
-		if (unlikely(bio_empty_barrier(bio))) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			bio_list_add(&sync, bio);
 			continue;
 		}
@@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
 	 * We need to dec pending if this was a write.
 	 */
 	if (rw == WRITE) {
-		if (likely(!bio_empty_barrier(bio)))
+		if (!(bio->bi_rw & REQ_FLUSH))
 			dm_rh_dec(ms->rh, map_context->ll);
 		return error;
 	}
diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
index bd5c58b..dad011a 100644
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -81,9 +81,9 @@ struct dm_region_hash {
 	struct list_head failed_recovered_regions;
 
 	/*
-	 * If there was a barrier failure no regions can be marked clean.
+	 * If there was a flush failure no regions can be marked clean.
 	 */
-	int barrier_failure;
+	int flush_failure;
 
 	void *context;
 	sector_t target_begin;
@@ -217,7 +217,7 @@ struct dm_region_hash *dm_region_hash_create(
 	INIT_LIST_HEAD(&rh->quiesced_regions);
 	INIT_LIST_HEAD(&rh->recovered_regions);
 	INIT_LIST_HEAD(&rh->failed_recovered_regions);
-	rh->barrier_failure = 0;
+	rh->flush_failure = 0;
 
 	rh->region_pool = mempool_create_kmalloc_pool(MIN_REGIONS,
 						      sizeof(struct dm_region));
@@ -399,8 +399,8 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio)
 	region_t region = dm_rh_bio_to_region(rh, bio);
 	int recovering = 0;
 
-	if (bio_empty_barrier(bio)) {
-		rh->barrier_failure = 1;
+	if (bio->bi_rw & REQ_FLUSH) {
+		rh->flush_failure = 1;
 		return;
 	}
 
@@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios)
 	struct bio *bio;
 
 	for (bio = bios->head; bio; bio = bio->bi_next) {
-		if (bio_empty_barrier(bio))
+		if (bio->bi_rw & REQ_FLUSH)
 			continue;
 		rh_inc(rh, dm_rh_bio_to_region(rh, bio));
 	}
@@ -555,9 +555,9 @@ void dm_rh_dec(struct dm_region_hash *rh, region_t region)
 		 */
 
 		/* do nothing for DM_RH_NOSYNC */
-		if (unlikely(rh->barrier_failure)) {
+		if (unlikely(rh->flush_failure)) {
 			/*
-			 * If a write barrier failed some time ago, we
+			 * If a write flush failed some time ago, we
 			 * don't know whether or not this write made it
 			 * to the disk, so we must resync the device.
 			 */
diff --git a/drivers/md/dm-snap-persistent.c b/drivers/md/dm-snap-persistent.c
index cc2bdb8..0b61792 100644
--- a/drivers/md/dm-snap-persistent.c
+++ b/drivers/md/dm-snap-persistent.c
@@ -687,7 +687,7 @@ static void persistent_commit_exception(struct dm_exception_store *store,
 	/*
 	 * Commit exceptions to disk.
 	 */
-	if (ps->valid && area_io(ps, WRITE_BARRIER))
+	if (ps->valid && area_io(ps, WRITE_FLUSH_FUA))
 		ps->valid = 0;
 
 	/*
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 5974d30..eed2101 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1587,7 +1587,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio,
 	chunk_t chunk;
 	struct dm_snap_pending_exception *pe = NULL;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		bio->bi_bdev = s->cow->bdev;
 		return DM_MAPIO_REMAPPED;
 	}
@@ -1691,7 +1691,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio,
 	int r = DM_MAPIO_REMAPPED;
 	chunk_t chunk;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		if (!map_context->target_request_nr)
 			bio->bi_bdev = s->origin->bdev;
 		else
@@ -2135,7 +2135,7 @@ static int origin_map(struct dm_target *ti, struct bio *bio,
 	struct dm_dev *dev = ti->private;
 	bio->bi_bdev = dev->bdev;
 
-	if (unlikely(bio_empty_barrier(bio)))
+	if (bio->bi_rw & REQ_FLUSH)
 		return DM_MAPIO_REMAPPED;
 
 	/* Only tell snapshots if this is a write */
diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
index c297f6d..f0371b4 100644
--- a/drivers/md/dm-stripe.c
+++ b/drivers/md/dm-stripe.c
@@ -271,7 +271,7 @@ static int stripe_map(struct dm_target *ti, struct bio *bio,
 	uint32_t stripe;
 	unsigned target_request_nr;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio->bi_rw & REQ_FLUSH) {
 		target_request_nr = map_context->target_request_nr;
 		BUG_ON(target_request_nr >= sc->stripes);
 		bio->bi_bdev = sc->stripe[target_request_nr].dev->bdev;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b1d92be..32e6622 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -144,15 +144,16 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * An error from the barrier request currently being processed.
+	 * An error from the flush request currently being processed.
 	 */
-	int barrier_error;
+	int flush_error;
 
 	/*
 	 * Protect barrier_error from concurrent endio processing
 	 * in request-based dm.
 	 */
 	spinlock_t barrier_error_lock;
+	int barrier_error;
 
 	/*
 	 * Processing queue (flush/barriers)
@@ -200,8 +201,8 @@ struct mapped_device {
 	/* sysfs handle */
 	struct kobject kobj;
 
-	/* zero-length barrier that will be cloned and submitted to targets */
-	struct bio barrier_bio;
+	/* zero-length flush that will be cloned and submitted to targets */
+	struct bio flush_bio;
 };
 
 /*
@@ -512,7 +513,7 @@ static void end_io_acct(struct dm_io *io)
 
 	/*
 	 * After this is decremented the bio must not be touched if it is
-	 * a barrier.
+	 * a flush.
 	 */
 	dm_disk(md)->part0.in_flight[rw] = pending =
 		atomic_dec_return(&md->pending[rw]);
@@ -626,7 +627,7 @@ static void dec_pending(struct dm_io *io, int error)
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
 			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_HARDBARRIER))
+				if (!(io->bio->bi_rw & REQ_FLUSH))
 					bio_list_add_head(&md->deferred,
 							  io->bio);
 			} else
@@ -638,20 +639,14 @@ static void dec_pending(struct dm_io *io, int error)
 		io_error = io->error;
 		bio = io->bio;
 
-		if (bio->bi_rw & REQ_HARDBARRIER) {
+		if (bio->bi_rw & REQ_FLUSH) {
 			/*
-			 * There can be just one barrier request so we use
+			 * There can be just one flush request so we use
 			 * a per-device variable for error reporting.
 			 * Note that you can't touch the bio after end_io_acct
-			 *
-			 * We ignore -EOPNOTSUPP for empty flush reported by
-			 * underlying devices. We assume that if the device
-			 * doesn't support empty barriers, it doesn't need
-			 * cache flushing commands.
 			 */
-			if (!md->barrier_error &&
-			    !(bio_empty_barrier(bio) && io_error == -EOPNOTSUPP))
-				md->barrier_error = io_error;
+			if (!md->flush_error)
+				md->flush_error = io_error;
 			end_io_acct(io);
 			free_io(md, io);
 		} else {
@@ -1119,7 +1114,7 @@ static void dm_bio_destructor(struct bio *bio)
 }
 
 /*
- * Creates a little bio that is just does part of a bvec.
+ * Creates a little bio that just does part of a bvec.
  */
 static struct bio *split_bvec(struct bio *bio, sector_t sector,
 			      unsigned short idx, unsigned int offset,
@@ -1134,7 +1129,7 @@ static struct bio *split_bvec(struct bio *bio, sector_t sector,
 
 	clone->bi_sector = sector;
 	clone->bi_bdev = bio->bi_bdev;
-	clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER;
+	clone->bi_rw = bio->bi_rw;
 	clone->bi_vcnt = 1;
 	clone->bi_size = to_bytes(len);
 	clone->bi_io_vec->bv_offset = offset;
@@ -1161,7 +1156,6 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,
 
 	clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs);
 	__bio_clone(clone, bio);
-	clone->bi_rw &= ~REQ_HARDBARRIER;
 	clone->bi_destructor = dm_bio_destructor;
 	clone->bi_sector = sector;
 	clone->bi_idx = idx;
@@ -1225,7 +1219,7 @@ static void __issue_target_requests(struct clone_info *ci, struct dm_target *ti,
 		__issue_target_request(ci, ti, request_nr, len);
 }
 
-static int __clone_and_map_empty_barrier(struct clone_info *ci)
+static int __clone_and_map_flush(struct clone_info *ci)
 {
 	unsigned target_nr = 0;
 	struct dm_target *ti;
@@ -1289,9 +1283,6 @@ static int __clone_and_map(struct clone_info *ci)
 	sector_t len = 0, max;
 	struct dm_target_io *tio;
 
-	if (unlikely(bio_empty_barrier(bio)))
-		return __clone_and_map_empty_barrier(ci);
-
 	if (unlikely(bio->bi_rw & REQ_DISCARD))
 		return __clone_and_map_discard(ci);
 
@@ -1383,11 +1374,11 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_HARDBARRIER))
+		if (!(bio->bi_rw & REQ_FLUSH))
 			bio_io_error(bio);
 		else
-			if (!md->barrier_error)
-				md->barrier_error = -EIO;
+			if (!md->flush_error)
+				md->flush_error = -EIO;
 		return;
 	}
 
@@ -1400,14 +1391,22 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 	ci.io->md = md;
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
-	ci.sector_count = bio_sectors(bio);
-	if (unlikely(bio_empty_barrier(bio)))
+	if (!(bio->bi_rw & REQ_FLUSH))
+		ci.sector_count = bio_sectors(bio);
+	else {
+		/* all FLUSH bio's reaching here should be empty */
+		WARN_ON_ONCE(bio_has_data(bio));
 		ci.sector_count = 1;
+	}
 	ci.idx = bio->bi_idx;
 
 	start_io_acct(ci.io);
-	while (ci.sector_count && !error)
-		error = __clone_and_map(&ci);
+	while (ci.sector_count && !error) {
+		if (!(bio->bi_rw & REQ_FLUSH))
+			error = __clone_and_map(&ci);
+		else
+			error = __clone_and_map_flush(&ci);
+	}
 
 	/* drop the extra reference count */
 	dec_pending(ci.io, error);
@@ -1492,11 +1491,11 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
 	part_stat_unlock();
 
 	/*
-	 * If we're suspended or the thread is processing barriers
+	 * If we're suspended or the thread is processing flushes
 	 * we have to queue this io for later.
 	 */
 	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    unlikely(bio->bi_rw & REQ_HARDBARRIER)) {
+	    (bio->bi_rw & REQ_FLUSH)) {
 		up_read(&md->io_lock);
 
 		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
@@ -1940,6 +1939,7 @@ static void dm_init_md_queue(struct mapped_device *md)
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
+	blk_queue_flush(md->queue, REQ_FLUSH | REQ_FUA);
 }
 
 /*
@@ -2245,7 +2245,8 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	blk_queue_flush(md->queue, REQ_FLUSH);
+	/* no flush support for request based dm yet */
+	blk_queue_flush(md->queue, 0);
 
 	elv_register_queue(md->queue);
 
@@ -2406,41 +2407,35 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
 	return r;
 }
 
-static void dm_flush(struct mapped_device *md)
+static void process_flush(struct mapped_device *md, struct bio *bio)
 {
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	bio_init(&md->barrier_bio);
-	md->barrier_bio.bi_bdev = md->bdev;
-	md->barrier_bio.bi_rw = WRITE_BARRIER;
-	__split_and_process_bio(md, &md->barrier_bio);
+	md->flush_error = 0;
 
+	/* handle REQ_FLUSH */
 	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-}
 
-static void process_barrier(struct mapped_device *md, struct bio *bio)
-{
-	md->barrier_error = 0;
+	bio_init(&md->flush_bio);
+	md->flush_bio.bi_bdev = md->bdev;
+	md->flush_bio.bi_rw = WRITE_FLUSH;
+	__split_and_process_bio(md, &md->flush_bio);
 
-	dm_flush(md);
+	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
 
-	if (!bio_empty_barrier(bio)) {
-		__split_and_process_bio(md, bio);
-		/*
-		 * If the request isn't supported, don't waste time with
-		 * the second flush.
-		 */
-		if (md->barrier_error != -EOPNOTSUPP)
-			dm_flush(md);
+	/* if it's an empty flush or the preflush failed, we're done */
+	if (!bio_has_data(bio) || md->flush_error) {
+		if (md->flush_error != DM_ENDIO_REQUEUE)
+			bio_endio(bio, md->flush_error);
+		else {
+			spin_lock_irq(&md->deferred_lock);
+			bio_list_add_head(&md->deferred, bio);
+			spin_unlock_irq(&md->deferred_lock);
+		}
+		return;
 	}
 
-	if (md->barrier_error != DM_ENDIO_REQUEUE)
-		bio_endio(bio, md->barrier_error);
-	else {
-		spin_lock_irq(&md->deferred_lock);
-		bio_list_add_head(&md->deferred, bio);
-		spin_unlock_irq(&md->deferred_lock);
-	}
+	/* issue data + REQ_FUA */
+	bio->bi_rw &= ~REQ_FLUSH;
+	__split_and_process_bio(md, bio);
 }
 
 /*
@@ -2469,8 +2464,8 @@ static void dm_wq_work(struct work_struct *work)
 		if (dm_request_based(md))
 			generic_make_request(c);
 		else {
-			if (c->bi_rw & REQ_HARDBARRIER)
-				process_barrier(md, c);
+			if (c->bi_rw & REQ_FLUSH)
+				process_flush(md, c);
 			else
 				__split_and_process_bio(md, c);
 		}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-08-30  9:58 [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2 Tejun Heo
  2010-08-30  9:58 ` [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags Tejun Heo
  2010-08-30  9:58 ` [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm Tejun Heo
@ 2010-08-30  9:58 ` Tejun Heo
  2010-09-01 13:51   ` Mike Snitzer
  2010-09-03  6:04   ` Kiyoshi Ueda
  2010-08-30  9:58 ` [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
  2010-08-30  9:58 ` [PATCH 5/5] block: remove the WRITE_BARRIER flag Tejun Heo
  4 siblings, 2 replies; 37+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch
  Cc: Tejun Heo

Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
against other bio's.  This patch relaxes ordering around flushes.

* A flush bio is no longer deferred to workqueue directly.  It's
  processed like other bio's but __split_and_process_bio() uses
  md->flush_bio as the clone source.  md->flush_bio is initialized to
  empty flush during md initialization and shared for all flushes.

* When dec_pending() detects that a flush has completed, it checks
  whether the original bio has data.  If so, the bio is queued to the
  deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.

* As flush sequencing is handled in the usual issue/completion path,
  dm_wq_work() no longer needs to handle flushes differently.  Now its
  only responsibility is re-issuing deferred bio's the same way as
  _dm_request() would.  REQ_FLUSH handling logic including
  process_flush() is dropped.

* There's no reason for queue_io() and dm_wq_work() write lock
  dm->io_lock.  queue_io() now only uses md->deferred_lock and
  dm_wq_work() read locks dm->io_lock.

* bio's no longer need to be queued on the deferred list while a flush
  is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.

This avoids stalling the device during flushes and simplifies the
implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/md/dm.c |  157 ++++++++++++++++---------------------------------------
 1 files changed, 45 insertions(+), 112 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 32e6622..e67c519 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -110,7 +110,6 @@ EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
 #define DMF_FREEING 3
 #define DMF_DELETING 4
 #define DMF_NOFLUSH_SUSPENDING 5
-#define DMF_QUEUE_IO_TO_THREAD 6
 
 /*
  * Work processed by per-device workqueue.
@@ -144,11 +143,6 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * An error from the flush request currently being processed.
-	 */
-	int flush_error;
-
-	/*
 	 * Protect barrier_error from concurrent endio processing
 	 * in request-based dm.
 	 */
@@ -529,16 +523,10 @@ static void end_io_acct(struct dm_io *io)
  */
 static void queue_io(struct mapped_device *md, struct bio *bio)
 {
-	down_write(&md->io_lock);
-
 	spin_lock_irq(&md->deferred_lock);
 	bio_list_add(&md->deferred, bio);
 	spin_unlock_irq(&md->deferred_lock);
-
-	if (!test_and_set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))
-		queue_work(md->wq, &md->work);
-
-	up_write(&md->io_lock);
+	queue_work(md->wq, &md->work);
 }
 
 /*
@@ -626,11 +614,9 @@ static void dec_pending(struct dm_io *io, int error)
 			 * Target requested pushing back the I/O.
 			 */
 			spin_lock_irqsave(&md->deferred_lock, flags);
-			if (__noflush_suspending(md)) {
-				if (!(io->bio->bi_rw & REQ_FLUSH))
-					bio_list_add_head(&md->deferred,
-							  io->bio);
-			} else
+			if (__noflush_suspending(md))
+				bio_list_add_head(&md->deferred, io->bio);
+			else
 				/* noflush suspend was interrupted. */
 				io->error = -EIO;
 			spin_unlock_irqrestore(&md->deferred_lock, flags);
@@ -638,26 +624,22 @@ static void dec_pending(struct dm_io *io, int error)
 
 		io_error = io->error;
 		bio = io->bio;
+		end_io_acct(io);
+		free_io(md, io);
+
+		if (io_error == DM_ENDIO_REQUEUE)
+			return;
 
-		if (bio->bi_rw & REQ_FLUSH) {
+		if (!(bio->bi_rw & REQ_FLUSH) || !bio->bi_size) {
+			trace_block_bio_complete(md->queue, bio);
+			bio_endio(bio, io_error);
+		} else {
 			/*
-			 * There can be just one flush request so we use
-			 * a per-device variable for error reporting.
-			 * Note that you can't touch the bio after end_io_acct
+			 * Preflush done for flush with data, reissue
+			 * without REQ_FLUSH.
 			 */
-			if (!md->flush_error)
-				md->flush_error = io_error;
-			end_io_acct(io);
-			free_io(md, io);
-		} else {
-			end_io_acct(io);
-			free_io(md, io);
-
-			if (io_error != DM_ENDIO_REQUEUE) {
-				trace_block_bio_complete(md->queue, bio);
-
-				bio_endio(bio, io_error);
-			}
+			bio->bi_rw &= ~REQ_FLUSH;
+			queue_io(md, bio);
 		}
 	}
 }
@@ -1369,21 +1351,17 @@ static int __clone_and_map(struct clone_info *ci)
  */
 static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 {
+	bool is_flush = bio->bi_rw & REQ_FLUSH;
 	struct clone_info ci;
 	int error = 0;
 
 	ci.map = dm_get_live_table(md);
 	if (unlikely(!ci.map)) {
-		if (!(bio->bi_rw & REQ_FLUSH))
-			bio_io_error(bio);
-		else
-			if (!md->flush_error)
-				md->flush_error = -EIO;
+		bio_io_error(bio);
 		return;
 	}
 
 	ci.md = md;
-	ci.bio = bio;
 	ci.io = alloc_io(md);
 	ci.io->error = 0;
 	atomic_set(&ci.io->io_count, 1);
@@ -1391,18 +1369,19 @@ static void __split_and_process_bio(struct mapped_device *md, struct bio *bio)
 	ci.io->md = md;
 	spin_lock_init(&ci.io->endio_lock);
 	ci.sector = bio->bi_sector;
-	if (!(bio->bi_rw & REQ_FLUSH))
+	ci.idx = bio->bi_idx;
+
+	if (!is_flush) {
+		ci.bio = bio;
 		ci.sector_count = bio_sectors(bio);
-	else {
-		/* all FLUSH bio's reaching here should be empty */
-		WARN_ON_ONCE(bio_has_data(bio));
+	} else {
+		ci.bio = &ci.md->flush_bio;
 		ci.sector_count = 1;
 	}
-	ci.idx = bio->bi_idx;
 
 	start_io_acct(ci.io);
 	while (ci.sector_count && !error) {
-		if (!(bio->bi_rw & REQ_FLUSH))
+		if (!is_flush)
 			error = __clone_and_map(&ci);
 		else
 			error = __clone_and_map_flush(&ci);
@@ -1490,22 +1469,14 @@ static int _dm_request(struct request_queue *q, struct bio *bio)
 	part_stat_add(cpu, &dm_disk(md)->part0, sectors[rw], bio_sectors(bio));
 	part_stat_unlock();
 
-	/*
-	 * If we're suspended or the thread is processing flushes
-	 * we have to queue this io for later.
-	 */
-	if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) ||
-	    (bio->bi_rw & REQ_FLUSH)) {
+	/* if we're suspended, we have to queue this io for later */
+	if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
 		up_read(&md->io_lock);
 
-		if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) &&
-		    bio_rw(bio) == READA) {
+		if (bio_rw(bio) != READA)
+			queue_io(md, bio);
+		else
 			bio_io_error(bio);
-			return 0;
-		}
-
-		queue_io(md, bio);
-
 		return 0;
 	}
 
@@ -2015,6 +1986,10 @@ static struct mapped_device *alloc_dev(int minor)
 	if (!md->bdev)
 		goto bad_bdev;
 
+	bio_init(&md->flush_bio);
+	md->flush_bio.bi_bdev = md->bdev;
+	md->flush_bio.bi_rw = WRITE_FLUSH;
+
 	/* Populate the mapping, nobody knows we exist yet */
 	spin_lock(&_minor_lock);
 	old_md = idr_replace(&_minor_idr, md, minor);
@@ -2407,37 +2382,6 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
 	return r;
 }
 
-static void process_flush(struct mapped_device *md, struct bio *bio)
-{
-	md->flush_error = 0;
-
-	/* handle REQ_FLUSH */
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	bio_init(&md->flush_bio);
-	md->flush_bio.bi_bdev = md->bdev;
-	md->flush_bio.bi_rw = WRITE_FLUSH;
-	__split_and_process_bio(md, &md->flush_bio);
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-
-	/* if it's an empty flush or the preflush failed, we're done */
-	if (!bio_has_data(bio) || md->flush_error) {
-		if (md->flush_error != DM_ENDIO_REQUEUE)
-			bio_endio(bio, md->flush_error);
-		else {
-			spin_lock_irq(&md->deferred_lock);
-			bio_list_add_head(&md->deferred, bio);
-			spin_unlock_irq(&md->deferred_lock);
-		}
-		return;
-	}
-
-	/* issue data + REQ_FUA */
-	bio->bi_rw &= ~REQ_FLUSH;
-	__split_and_process_bio(md, bio);
-}
-
 /*
  * Process the deferred bios
  */
@@ -2447,33 +2391,27 @@ static void dm_wq_work(struct work_struct *work)
 						work);
 	struct bio *c;
 
-	down_write(&md->io_lock);
+	down_read(&md->io_lock);
 
 	while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
 		spin_lock_irq(&md->deferred_lock);
 		c = bio_list_pop(&md->deferred);
 		spin_unlock_irq(&md->deferred_lock);
 
-		if (!c) {
-			clear_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
+		if (!c)
 			break;
-		}
 
-		up_write(&md->io_lock);
+		up_read(&md->io_lock);
 
 		if (dm_request_based(md))
 			generic_make_request(c);
-		else {
-			if (c->bi_rw & REQ_FLUSH)
-				process_flush(md, c);
-			else
-				__split_and_process_bio(md, c);
-		}
+		else
+			__split_and_process_bio(md, c);
 
-		down_write(&md->io_lock);
+		down_read(&md->io_lock);
 	}
 
-	up_write(&md->io_lock);
+	up_read(&md->io_lock);
 }
 
 static void dm_queue_flush(struct mapped_device *md)
@@ -2672,17 +2610,12 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 	 *
 	 * To get all processes out of __split_and_process_bio in dm_request,
 	 * we take the write lock. To prevent any process from reentering
-	 * __split_and_process_bio from dm_request, we set
-	 * DMF_QUEUE_IO_TO_THREAD.
-	 *
-	 * To quiesce the thread (dm_wq_work), we set DMF_BLOCK_IO_FOR_SUSPEND
-	 * and call flush_workqueue(md->wq). flush_workqueue will wait until
-	 * dm_wq_work exits and DMF_BLOCK_IO_FOR_SUSPEND will prevent any
-	 * further calls to __split_and_process_bio from dm_wq_work.
+	 * __split_and_process_bio from dm_request and quiesce the thread
+	 * (dm_wq_work), we set BMF_BLOCK_IO_FOR_SUSPEND and call
+	 * flush_workqueue(md->wq).
 	 */
 	down_write(&md->io_lock);
 	set_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags);
-	set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags);
 	up_write(&md->io_lock);
 
 	/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30  9:58 [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2 Tejun Heo
                   ` (2 preceding siblings ...)
  2010-08-30  9:58 ` [PATCH 3/5] dm: relax ordering of bio-based flush implementation Tejun Heo
@ 2010-08-30  9:58 ` Tejun Heo
  2010-08-30 13:28   ` Mike Snitzer
  2010-08-30  9:58 ` [PATCH 5/5] block: remove the WRITE_BARRIER flag Tejun Heo
  4 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch
  Cc: Tejun Heo

This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them.  However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests.  The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/md/dm.c |  204 ++++++-------------------------------------------------
 1 files changed, 20 insertions(+), 184 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e67c519..81a012f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
 	spinlock_t deferred_lock;
 
 	/*
-	 * Protect barrier_error from concurrent endio processing
-	 * in request-based dm.
-	 */
-	spinlock_t barrier_error_lock;
-	int barrier_error;
-
-	/*
-	 * Processing queue (flush/barriers)
+	 * Processing queue (flush)
 	 */
 	struct workqueue_struct *wq;
-	struct work_struct barrier_work;
-
-	/* A pointer to the currently processing pre/post flush request */
-	struct request *flush_request;
 
 	/*
 	 * The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *clone, int error)
 	blk_update_request(tio->orig, 0, nr_bytes);
 }
 
-static void store_barrier_error(struct mapped_device *md, int error)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&md->barrier_error_lock, flags);
-	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
 /*
  * Don't touch any member of the md after calling this function because
  * the md may be freed in dm_put() at the end of this function.
@@ -786,13 +758,11 @@ static void free_rq_clone(struct request *clone)
 static void dm_end_request(struct request *clone, int error)
 {
 	int rw = rq_data_dir(clone);
-	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 
-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;
 
@@ -806,15 +776,8 @@ static void dm_end_request(struct request *clone, int error)
 	}
 
 	free_rq_clone(clone);
-
-	if (unlikely(is_barrier)) {
-		if (unlikely(error))
-			store_barrier_error(md, error);
-		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
-
-	rq_completed(md, rw, run_queue);
+	blk_end_request_all(rq, error);
+	rq_completed(md, rw, true);
 }
 
 static void dm_unprep_request(struct request *rq)
@@ -839,16 +802,6 @@ void dm_requeue_unmapped_request(struct request *clone)
 	struct request_queue *q = rq->q;
 	unsigned long flags;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		dm_end_request(clone, DM_ENDIO_REQUEUE);
-		return;
-	}
-
 	dm_unprep_request(rq);
 
 	spin_lock_irqsave(q->queue_lock, flags);
@@ -938,19 +891,6 @@ static void dm_complete_request(struct request *clone, int error)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.  So can't use
-		 * softirq_done with the original.
-		 * Pass the clone to dm_done() directly in this special case.
-		 * It is safe (even if clone->q->queue_lock is held here)
-		 * because there is no I/O dispatching during the completion
-		 * of barrier clone.
-		 */
-		dm_done(clone, error, true);
-		return;
-	}
-
 	tio->error = error;
 	rq->completion_data = clone;
 	blk_complete_request(rq);
@@ -967,17 +907,6 @@ void dm_kill_unmapped_request(struct request *clone, int error)
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;
 
-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		BUG_ON(error > 0);
-		dm_end_request(clone, error);
-		return;
-	}
-
 	rq->cmd_flags |= REQ_FAILED;
 	dm_complete_request(clone, error);
 }
@@ -1507,14 +1436,6 @@ static int dm_request(struct request_queue *q, struct bio *bio)
 	return _dm_request(q, bio);
 }
 
-static bool dm_rq_is_flush_request(struct request *rq)
-{
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
-}
-
 void dm_dispatch_request(struct request *rq)
 {
 	int r;
@@ -1562,22 +1483,15 @@ static int setup_clone(struct request *clone, struct request *rq,
 {
 	int r;
 
-	if (dm_rq_is_flush_request(rq)) {
-		blk_rq_init(NULL, clone);
-		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
-	} else {
-		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
-				      dm_rq_bio_constructor, tio);
-		if (r)
-			return r;
-
-		clone->cmd = rq->cmd;
-		clone->cmd_len = rq->cmd_len;
-		clone->sense = rq->sense;
-		clone->buffer = rq->buffer;
-	}
+	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+			      dm_rq_bio_constructor, tio);
+	if (r)
+		return r;
 
+	clone->cmd = rq->cmd;
+	clone->cmd_len = rq->cmd_len;
+	clone->sense = rq->sense;
+	clone->buffer = rq->buffer;
 	clone->end_io = end_clone_request;
 	clone->end_io_data = tio;
 
@@ -1618,9 +1532,6 @@ static int dm_prep_fn(struct request_queue *q, struct request *rq)
 	struct mapped_device *md = q->queuedata;
 	struct request *clone;
 
-	if (unlikely(dm_rq_is_flush_request(rq)))
-		return BLKPREP_OK;
-
 	if (unlikely(rq->special)) {
 		DMWARN("Already has something in rq->special.");
 		return BLKPREP_KILL;
@@ -1697,6 +1608,7 @@ static void dm_request_fn(struct request_queue *q)
 	struct dm_table *map = dm_get_live_table(md);
 	struct dm_target *ti;
 	struct request *rq, *clone;
+	sector_t pos;
 
 	/*
 	 * For suspend, check blk_queue_stopped() and increment
@@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
 		if (!rq)
 			goto plug_and_out;
 
-		if (unlikely(dm_rq_is_flush_request(rq))) {
-			BUG_ON(md->flush_request);
-			md->flush_request = rq;
-			blk_start_request(rq);
-			queue_work(md->wq, &md->barrier_work);
-			goto out;
-		}
+		/* always use block 0 to find the target for flushes for now */
+		pos = 0;
+		if (!(rq->cmd_flags & REQ_FLUSH))
+			pos = blk_rq_pos(rq);
 
-		ti = dm_table_find_target(map, blk_rq_pos(rq));
+		ti = dm_table_find_target(map, pos);
 		if (ti->type->busy && ti->type->busy(ti))
 			goto plug_and_out;
 
@@ -1888,7 +1797,6 @@ out:
 static const struct block_device_operations dm_blk_dops;
 
 static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);
 
 static void dm_init_md_queue(struct mapped_device *md)
 {
@@ -1943,7 +1851,6 @@ static struct mapped_device *alloc_dev(int minor)
 	mutex_init(&md->suspend_lock);
 	mutex_init(&md->type_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -1966,7 +1873,6 @@ static struct mapped_device *alloc_dev(int minor)
 	atomic_set(&md->pending[1], 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
-	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
 	init_waitqueue_head(&md->eventq);
 
 	md->disk->major = _major;
@@ -2220,8 +2126,6 @@ static int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	/* no flush support for request based dm yet */
-	blk_queue_flush(md->queue, 0);
 
 	elv_register_queue(md->queue);
 
@@ -2421,73 +2325,6 @@ static void dm_queue_flush(struct mapped_device *md)
 	queue_work(md->wq, &md->work);
 }
 
-static void dm_rq_set_target_request_nr(struct request *clone, unsigned request_nr)
-{
-	struct dm_rq_target_io *tio = clone->end_io_data;
-
-	tio->info.target_request_nr = request_nr;
-}
-
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
-{
-	int i, j;
-	struct dm_table *map = dm_get_live_table(md);
-	unsigned num_targets = dm_table_get_num_targets(map);
-	struct dm_target *ti;
-	struct request *clone;
-
-	md->barrier_error = 0;
-
-	for (i = 0; i < num_targets; i++) {
-		ti = dm_table_get_target(map, i);
-		for (j = 0; j < ti->num_flush_requests; j++) {
-			clone = clone_rq(md->flush_request, md, GFP_NOIO);
-			dm_rq_set_target_request_nr(clone, j);
-			atomic_inc(&md->pending[rq_data_dir(clone)]);
-			map_request(ti, clone, md);
-		}
-	}
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-	dm_table_put(map);
-
-	return md->barrier_error;
-}
-
-static void dm_rq_barrier_work(struct work_struct *work)
-{
-	int error;
-	struct mapped_device *md = container_of(work, struct mapped_device,
-						barrier_work);
-	struct request_queue *q = md->queue;
-	struct request *rq;
-	unsigned long flags;
-
-	/*
-	 * Hold the md reference here and leave it at the last part so that
-	 * the md can't be deleted by device opener when the barrier request
-	 * completes.
-	 */
-	dm_get(md);
-
-	error = dm_rq_barrier(md);
-
-	rq = md->flush_request;
-	md->flush_request = NULL;
-
-	if (error == DM_ENDIO_REQUEUE) {
-		spin_lock_irqsave(q->queue_lock, flags);
-		blk_requeue_request(q, rq);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	} else
-		blk_end_request_all(rq, error);
-
-	blk_run_queue(q);
-
-	dm_put(md);
-}
-
 /*
  * Swap in a new table, returning the old one for the caller to destroy.
  */
@@ -2619,9 +2456,8 @@ int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 	up_write(&md->io_lock);
 
 	/*
-	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
-	 * can be kicked until md->queue is stopped.  So stop md->queue before
-	 * flushing md->wq.
+	 * Stop md->queue before flushing md->wq in case request-based
+	 * dm defers requests to md->wq from md->queue.
 	 */
 	if (dm_request_based(md))
 		stop_queue(md->queue);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/5] block: remove the WRITE_BARRIER flag
  2010-08-30  9:58 [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2 Tejun Heo
                   ` (3 preceding siblings ...)
  2010-08-30  9:58 ` [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
@ 2010-08-30  9:58 ` Tejun Heo
  4 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-08-30  9:58 UTC (permalink / raw)
  To: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch
  Cc: Christoph Hellwig, Tejun Heo

From: Christoph Hellwig <hch@infradead.org>

It's unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/fs.h |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 32703a9..6b0f6e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -135,7 +135,6 @@ struct inodes_stat_t {
  *			immediately after submission. The write equivalent
  *			of READ_SYNC.
  * WRITE_ODIRECT_PLUG	Special case write for O_DIRECT only.
- * WRITE_BARRIER	DEPRECATED. Always fails. Use FLUSH/FUA instead.
  * WRITE_FLUSH		Like WRITE_SYNC but with preceding cache flush.
  * WRITE_FUA		Like WRITE_SYNC but data is guaranteed to be on
  *			non-volatile media on completion.
@@ -157,8 +156,6 @@ struct inodes_stat_t {
 #define WRITE_SYNC		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG)
 #define WRITE_ODIRECT_PLUG	(WRITE | REQ_SYNC)
 #define WRITE_META		(WRITE | REQ_META)
-#define WRITE_BARRIER		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-				 REQ_HARDBARRIER)
 #define WRITE_FLUSH		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
 				 REQ_FLUSH)
 #define WRITE_FUA		(WRITE | REQ_SYNC | REQ_NOIDLE | REQ_UNPLUG | \
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30  9:58 ` [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
@ 2010-08-30 13:28   ` Mike Snitzer
  2010-08-30 13:59     ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Snitzer @ 2010-08-30 13:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

On Mon, Aug 30 2010 at  5:58am -0400,
Tejun Heo <tj@kernel.org> wrote:

> This patch converts request-based dm to support the new REQ_FLUSH/FUA.
...
> This patch rips out special flush code path and deals handles
> REQ_FLUSH/FUA requests the same way as other requests.  The only
> special treatment is that REQ_FLUSH requests use the block address 0
> when finding target, which is enough for now.

Looks very comparable to the patch I prepared but I have 2 observations
below (based on my findings from testing my patch).

> @@ -1562,22 +1483,15 @@ static int setup_clone(struct request *clone, struct request *rq,
>  {
>  	int r;
>  
> -	if (dm_rq_is_flush_request(rq)) {
> -		blk_rq_init(NULL, clone);
> -		clone->cmd_type = REQ_TYPE_FS;
> -		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
> -	} else {
> -		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> -				      dm_rq_bio_constructor, tio);
> -		if (r)
> -			return r;
> -
> -		clone->cmd = rq->cmd;
> -		clone->cmd_len = rq->cmd_len;
> -		clone->sense = rq->sense;
> -		clone->buffer = rq->buffer;
> -	}
> +	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
> +			      dm_rq_bio_constructor, tio);
> +	if (r)
> +		return r;
>  
> +	clone->cmd = rq->cmd;
> +	clone->cmd_len = rq->cmd_len;
> +	clone->sense = rq->sense;
> +	clone->buffer = rq->buffer;
>  	clone->end_io = end_clone_request;
>  	clone->end_io_data = tio;

blk_rq_prep_clone() of a REQ_FLUSH request will result in a
rq_data_dir(clone) of read.

I still had the following:

        if (rq->cmd_flags & REQ_FLUSH) {
                blk_rq_init(NULL, clone);
                clone->cmd_type = REQ_TYPE_FS;
                /* without this the clone has a rq_data_dir of 0 */
                clone->cmd_flags |= WRITE_FLUSH;
        } else {
                r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
                                      dm_rq_bio_constructor, tio);
                ...

Request-based DM's REQ_FLUSH still works without this special casing but
I figured I'd raise this to ask: what is the proper rq_data_dir() is for
a REQ_FLUSH?

> @@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
>  		if (!rq)
>  			goto plug_and_out;
>  
> -		if (unlikely(dm_rq_is_flush_request(rq))) {
> -			BUG_ON(md->flush_request);
> -			md->flush_request = rq;
> -			blk_start_request(rq);
> -			queue_work(md->wq, &md->barrier_work);
> -			goto out;
> -		}
> +		/* always use block 0 to find the target for flushes for now */
> +		pos = 0;
> +		if (!(rq->cmd_flags & REQ_FLUSH))
> +			pos = blk_rq_pos(rq);
>  
> -		ti = dm_table_find_target(map, blk_rq_pos(rq));
> +		ti = dm_table_find_target(map, pos);

I added the following here: BUG_ON(!dm_target_is_valid(ti));

>  		if (ti->type->busy && ti->type->busy(ti))
>  			goto plug_and_out;

I also needed to avoid the ->busy call for REQ_FLUSH:

                if (!(rq->cmd_flags & REQ_FLUSH)) {
                        ti = dm_table_find_target(map, blk_rq_pos(rq));
                        BUG_ON(!dm_target_is_valid(ti));
                        if (ti->type->busy && ti->type->busy(ti))
                                goto plug_and_out;
                } else {
                        /* rq-based only ever has one target! leverage this for FLUSH */
                        ti = dm_table_get_target(map, 0);
                }

If I allowed ->busy to be called for REQ_FLUSH it would result in a
deadlock.  I haven't identified where/why yet.

Other than these remaining issues this patch looks good.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 13:28   ` Mike Snitzer
@ 2010-08-30 13:59     ` Tejun Heo
  2010-08-30 15:07       ` Tejun Heo
                         ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Tejun Heo @ 2010-08-30 13:59 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

Hello,

On 08/30/2010 03:28 PM, Mike Snitzer wrote:
>> +	clone->cmd = rq->cmd;
>> +	clone->cmd_len = rq->cmd_len;
>> +	clone->sense = rq->sense;
>> +	clone->buffer = rq->buffer;
>>  	clone->end_io = end_clone_request;
>>  	clone->end_io_data = tio;
> 
> blk_rq_prep_clone() of a REQ_FLUSH request will result in a
> rq_data_dir(clone) of read.

Hmmm... why?  blk_rq_prep_clone() copies all REQ_* flags in
REQ_CLONE_MASK and REQ_WRITE is definitely there.  I'll check.

> I still had the following:
> 
>         if (rq->cmd_flags & REQ_FLUSH) {
>                 blk_rq_init(NULL, clone);
>                 clone->cmd_type = REQ_TYPE_FS;
>                 /* without this the clone has a rq_data_dir of 0 */
>                 clone->cmd_flags |= WRITE_FLUSH;
>         } else {
>                 r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
>                                       dm_rq_bio_constructor, tio);
>                 ...
> 
> Request-based DM's REQ_FLUSH still works without this special casing but
> I figured I'd raise this to ask: what is the proper rq_data_dir() is for
> a REQ_FLUSH?

Technically block layer doesn't care one way or the other but WRITE
definitely.  Maybe it would be a good idea to enforce that from block
layer.

>> @@ -1709,15 +1621,12 @@ static void dm_request_fn(struct request_queue *q)
>>  		if (!rq)
>>  			goto plug_and_out;
>>  
>> -		if (unlikely(dm_rq_is_flush_request(rq))) {
>> -			BUG_ON(md->flush_request);
>> -			md->flush_request = rq;
>> -			blk_start_request(rq);
>> -			queue_work(md->wq, &md->barrier_work);
>> -			goto out;
>> -		}
>> +		/* always use block 0 to find the target for flushes for now */
>> +		pos = 0;
>> +		if (!(rq->cmd_flags & REQ_FLUSH))
>> +			pos = blk_rq_pos(rq);
>>  
>> -		ti = dm_table_find_target(map, blk_rq_pos(rq));
>> +		ti = dm_table_find_target(map, pos);
> 
> I added the following here: BUG_ON(!dm_target_is_valid(ti));

I'll add it.

>>  		if (ti->type->busy && ti->type->busy(ti))
>>  			goto plug_and_out;
> 
> I also needed to avoid the ->busy call for REQ_FLUSH:
> 
>                 if (!(rq->cmd_flags & REQ_FLUSH)) {
>                         ti = dm_table_find_target(map, blk_rq_pos(rq));
>                         BUG_ON(!dm_target_is_valid(ti));
>                         if (ti->type->busy && ti->type->busy(ti))
>                                 goto plug_and_out;
>                 } else {
>                         /* rq-based only ever has one target! leverage this for FLUSH */
>                         ti = dm_table_get_target(map, 0);
>                 }
> 
> If I allowed ->busy to be called for REQ_FLUSH it would result in a
> deadlock.  I haven't identified where/why yet.

Ah... that's probably from "if (!elv_queue_empty(q))" check below,
flushes are on a separate queue but I forgot to update
elv_queue_empty() to check the flush queue.  elv_queue_empty() can
return %true spuriously in which case the queue won't be plugged and
restarted later leading to queue hang.  I'll fix elv_queue_empty().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 13:59     ` Tejun Heo
@ 2010-08-30 15:07       ` Tejun Heo
  2010-08-30 19:08         ` Mike Snitzer
  2010-08-30 15:42       ` [PATCH] block: initialize flush request with WRITE_FLUSH instead of REQ_FLUSH Tejun Heo
  2010-08-30 15:45       ` [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
  2 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2010-08-30 15:07 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

On 08/30/2010 03:59 PM, Tejun Heo wrote:
> Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> flushes are on a separate queue but I forgot to update
> elv_queue_empty() to check the flush queue.  elv_queue_empty() can
> return %true spuriously in which case the queue won't be plugged and
> restarted later leading to queue hang.  I'll fix elv_queue_empty().

I think I was too quick to blame elv_queue_empty().  Can you please
test whether the following patch fixes the hang?

Thanks.

---
 block/blk-flush.c |   18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -28,7 +28,8 @@ unsigned blk_flush_cur_seq(struct reques
 }

 static struct request *blk_flush_complete_seq(struct request_queue *q,
-					      unsigned seq, int error)
+					      unsigned seq, int error,
+					      bool from_end_io)
 {
 	struct request *next_rq = NULL;

@@ -51,6 +52,13 @@ static struct request *blk_flush_complet
 		if (!list_empty(&q->pending_flushes)) {
 			next_rq = list_entry_rq(q->pending_flushes.next);
 			list_move(&next_rq->queuelist, &q->queue_head);
+			/*
+			 * Moving a request silently to queue_head may
+			 * stall the queue, kick the queue if we
+			 * aren't in the issue path already.
+			 */
+			if (from_end_io)
+				__blk_run_queue(q);
 		}
 	}
 	return next_rq;
@@ -59,19 +67,19 @@ static struct request *blk_flush_complet
 static void pre_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error, true);
 }

 static void flush_data_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error, true);
 }

 static void post_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error, true);
 }

 static void init_flush_request(struct request *rq, struct gendisk *disk)
@@ -165,7 +173,7 @@ struct request *blk_do_flush(struct requ
 		skip |= QUEUE_FSEQ_DATA;
 	if (!do_postflush)
 		skip |= QUEUE_FSEQ_POSTFLUSH;
-	return blk_flush_complete_seq(q, skip, 0);
+	return blk_flush_complete_seq(q, skip, 0, false);
 }

 static void bio_end_flush(struct bio *bio, int err)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] block: initialize flush request with WRITE_FLUSH instead of REQ_FLUSH
  2010-08-30 13:59     ` Tejun Heo
  2010-08-30 15:07       ` Tejun Heo
@ 2010-08-30 15:42       ` Tejun Heo
  2010-08-30 15:45       ` [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
  2 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-08-30 15:42 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

init_flush_request() only set REQ_FLUSH when initializing flush
requests making them READ requests.  Use WRITE_FLUSH instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Snitzer <snitzer@redhat.com>
---
So, this was the culprit for the incorrect data direction for flush
requests.

Thanks.

 block/blk-flush.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -77,7 +77,7 @@ static void post_flush_end_io(struct req
 static void init_flush_request(struct request *rq, struct gendisk *disk)
 {
 	rq->cmd_type = REQ_TYPE_FS;
-	rq->cmd_flags = REQ_FLUSH;
+	rq->cmd_flags = WRITE_FLUSH;
 	rq->rq_disk = disk;
 }


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 13:59     ` Tejun Heo
  2010-08-30 15:07       ` Tejun Heo
  2010-08-30 15:42       ` [PATCH] block: initialize flush request with WRITE_FLUSH instead of REQ_FLUSH Tejun Heo
@ 2010-08-30 15:45       ` Tejun Heo
  2010-08-30 19:18         ` Mike Snitzer
  2010-09-01  7:15         ` Kiyoshi Ueda
  2 siblings, 2 replies; 37+ messages in thread
From: Tejun Heo @ 2010-08-30 15:45 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

This patch converts request-based dm to support the new REQ_FLUSH/FUA.

The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.

In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them.  However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.

This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests.  The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.

* added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
  suggested by Mike Snitzer

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
---
Here's a version w/ BUG_ON() added.  Once the queue hang issue is
tracked down, I'll refresh the whole series and repost.

Thanks.

 drivers/md/dm.c |  206 +++++---------------------------------------------------
 1 file changed, 22 insertions(+), 184 deletions(-)

Index: block/drivers/md/dm.c
===================================================================
--- block.orig/drivers/md/dm.c
+++ block/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
 	spinlock_t deferred_lock;

 	/*
-	 * Protect barrier_error from concurrent endio processing
-	 * in request-based dm.
-	 */
-	spinlock_t barrier_error_lock;
-	int barrier_error;
-
-	/*
-	 * Processing queue (flush/barriers)
+	 * Processing queue (flush)
 	 */
 	struct workqueue_struct *wq;
-	struct work_struct barrier_work;
-
-	/* A pointer to the currently processing pre/post flush request */
-	struct request *flush_request;

 	/*
 	 * The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *cl
 	blk_update_request(tio->orig, 0, nr_bytes);
 }

-static void store_barrier_error(struct mapped_device *md, int error)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&md->barrier_error_lock, flags);
-	/*
-	 * Basically, the first error is taken, but:
-	 *   -EOPNOTSUPP supersedes any I/O error.
-	 *   Requeue request supersedes any I/O error but -EOPNOTSUPP.
-	 */
-	if (!md->barrier_error || error == -EOPNOTSUPP ||
-	    (md->barrier_error != -EOPNOTSUPP &&
-	     error == DM_ENDIO_REQUEUE))
-		md->barrier_error = error;
-	spin_unlock_irqrestore(&md->barrier_error_lock, flags);
-}
-
 /*
  * Don't touch any member of the md after calling this function because
  * the md may be freed in dm_put() at the end of this function.
@@ -786,13 +758,11 @@ static void free_rq_clone(struct request
 static void dm_end_request(struct request *clone, int error)
 {
 	int rw = rq_data_dir(clone);
-	int run_queue = 1;
-	bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER;
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;

-	if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) {
+	if (rq->cmd_type == REQ_TYPE_BLOCK_PC) {
 		rq->errors = clone->errors;
 		rq->resid_len = clone->resid_len;

@@ -806,15 +776,8 @@ static void dm_end_request(struct reques
 	}

 	free_rq_clone(clone);
-
-	if (unlikely(is_barrier)) {
-		if (unlikely(error))
-			store_barrier_error(md, error);
-		run_queue = 0;
-	} else
-		blk_end_request_all(rq, error);
-
-	rq_completed(md, rw, run_queue);
+	blk_end_request_all(rq, error);
+	rq_completed(md, rw, true);
 }

 static void dm_unprep_request(struct request *rq)
@@ -839,16 +802,6 @@ void dm_requeue_unmapped_request(struct
 	struct request_queue *q = rq->q;
 	unsigned long flags;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		dm_end_request(clone, DM_ENDIO_REQUEUE);
-		return;
-	}
-
 	dm_unprep_request(rq);

 	spin_lock_irqsave(q->queue_lock, flags);
@@ -938,19 +891,6 @@ static void dm_complete_request(struct r
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.  So can't use
-		 * softirq_done with the original.
-		 * Pass the clone to dm_done() directly in this special case.
-		 * It is safe (even if clone->q->queue_lock is held here)
-		 * because there is no I/O dispatching during the completion
-		 * of barrier clone.
-		 */
-		dm_done(clone, error, true);
-		return;
-	}
-
 	tio->error = error;
 	rq->completion_data = clone;
 	blk_complete_request(rq);
@@ -967,17 +907,6 @@ void dm_kill_unmapped_request(struct req
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	struct request *rq = tio->orig;

-	if (unlikely(clone->cmd_flags & REQ_HARDBARRIER)) {
-		/*
-		 * Barrier clones share an original request.
-		 * Leave it to dm_end_request(), which handles this special
-		 * case.
-		 */
-		BUG_ON(error > 0);
-		dm_end_request(clone, error);
-		return;
-	}
-
 	rq->cmd_flags |= REQ_FAILED;
 	dm_complete_request(clone, error);
 }
@@ -1507,14 +1436,6 @@ static int dm_request(struct request_que
 	return _dm_request(q, bio);
 }

-static bool dm_rq_is_flush_request(struct request *rq)
-{
-	if (rq->cmd_flags & REQ_FLUSH)
-		return true;
-	else
-		return false;
-}
-
 void dm_dispatch_request(struct request *rq)
 {
 	int r;
@@ -1562,22 +1483,15 @@ static int setup_clone(struct request *c
 {
 	int r;

-	if (dm_rq_is_flush_request(rq)) {
-		blk_rq_init(NULL, clone);
-		clone->cmd_type = REQ_TYPE_FS;
-		clone->cmd_flags |= (REQ_HARDBARRIER | WRITE);
-	} else {
-		r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
-				      dm_rq_bio_constructor, tio);
-		if (r)
-			return r;
-
-		clone->cmd = rq->cmd;
-		clone->cmd_len = rq->cmd_len;
-		clone->sense = rq->sense;
-		clone->buffer = rq->buffer;
-	}
+	r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
+			      dm_rq_bio_constructor, tio);
+	if (r)
+		return r;

+	clone->cmd = rq->cmd;
+	clone->cmd_len = rq->cmd_len;
+	clone->sense = rq->sense;
+	clone->buffer = rq->buffer;
 	clone->end_io = end_clone_request;
 	clone->end_io_data = tio;

@@ -1618,9 +1532,6 @@ static int dm_prep_fn(struct request_que
 	struct mapped_device *md = q->queuedata;
 	struct request *clone;

-	if (unlikely(dm_rq_is_flush_request(rq)))
-		return BLKPREP_OK;
-
 	if (unlikely(rq->special)) {
 		DMWARN("Already has something in rq->special.");
 		return BLKPREP_KILL;
@@ -1697,6 +1608,7 @@ static void dm_request_fn(struct request
 	struct dm_table *map = dm_get_live_table(md);
 	struct dm_target *ti;
 	struct request *rq, *clone;
+	sector_t pos;

 	/*
 	 * For suspend, check blk_queue_stopped() and increment
@@ -1709,15 +1621,14 @@ static void dm_request_fn(struct request
 		if (!rq)
 			goto plug_and_out;

-		if (unlikely(dm_rq_is_flush_request(rq))) {
-			BUG_ON(md->flush_request);
-			md->flush_request = rq;
-			blk_start_request(rq);
-			queue_work(md->wq, &md->barrier_work);
-			goto out;
-		}
+		/* always use block 0 to find the target for flushes for now */
+		pos = 0;
+		if (!(rq->cmd_flags & REQ_FLUSH))
+			pos = blk_rq_pos(rq);
+
+		ti = dm_table_find_target(map, pos);
+		BUG_ON(!dm_target_is_valid(ti));

-		ti = dm_table_find_target(map, blk_rq_pos(rq));
 		if (ti->type->busy && ti->type->busy(ti))
 			goto plug_and_out;

@@ -1888,7 +1799,6 @@ out:
 static const struct block_device_operations dm_blk_dops;

 static void dm_wq_work(struct work_struct *work);
-static void dm_rq_barrier_work(struct work_struct *work);

 static void dm_init_md_queue(struct mapped_device *md)
 {
@@ -1943,7 +1853,6 @@ static struct mapped_device *alloc_dev(i
 	mutex_init(&md->suspend_lock);
 	mutex_init(&md->type_lock);
 	spin_lock_init(&md->deferred_lock);
-	spin_lock_init(&md->barrier_error_lock);
 	rwlock_init(&md->map_lock);
 	atomic_set(&md->holders, 1);
 	atomic_set(&md->open_count, 0);
@@ -1966,7 +1875,6 @@ static struct mapped_device *alloc_dev(i
 	atomic_set(&md->pending[1], 0);
 	init_waitqueue_head(&md->wait);
 	INIT_WORK(&md->work, dm_wq_work);
-	INIT_WORK(&md->barrier_work, dm_rq_barrier_work);
 	init_waitqueue_head(&md->eventq);

 	md->disk->major = _major;
@@ -2220,8 +2128,6 @@ static int dm_init_request_based_queue(s
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	/* no flush support for request based dm yet */
-	blk_queue_flush(md->queue, 0);

 	elv_register_queue(md->queue);

@@ -2421,73 +2327,6 @@ static void dm_queue_flush(struct mapped
 	queue_work(md->wq, &md->work);
 }

-static void dm_rq_set_target_request_nr(struct request *clone, unsigned request_nr)
-{
-	struct dm_rq_target_io *tio = clone->end_io_data;
-
-	tio->info.target_request_nr = request_nr;
-}
-
-/* Issue barrier requests to targets and wait for their completion. */
-static int dm_rq_barrier(struct mapped_device *md)
-{
-	int i, j;
-	struct dm_table *map = dm_get_live_table(md);
-	unsigned num_targets = dm_table_get_num_targets(map);
-	struct dm_target *ti;
-	struct request *clone;
-
-	md->barrier_error = 0;
-
-	for (i = 0; i < num_targets; i++) {
-		ti = dm_table_get_target(map, i);
-		for (j = 0; j < ti->num_flush_requests; j++) {
-			clone = clone_rq(md->flush_request, md, GFP_NOIO);
-			dm_rq_set_target_request_nr(clone, j);
-			atomic_inc(&md->pending[rq_data_dir(clone)]);
-			map_request(ti, clone, md);
-		}
-	}
-
-	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
-	dm_table_put(map);
-
-	return md->barrier_error;
-}
-
-static void dm_rq_barrier_work(struct work_struct *work)
-{
-	int error;
-	struct mapped_device *md = container_of(work, struct mapped_device,
-						barrier_work);
-	struct request_queue *q = md->queue;
-	struct request *rq;
-	unsigned long flags;
-
-	/*
-	 * Hold the md reference here and leave it at the last part so that
-	 * the md can't be deleted by device opener when the barrier request
-	 * completes.
-	 */
-	dm_get(md);
-
-	error = dm_rq_barrier(md);
-
-	rq = md->flush_request;
-	md->flush_request = NULL;
-
-	if (error == DM_ENDIO_REQUEUE) {
-		spin_lock_irqsave(q->queue_lock, flags);
-		blk_requeue_request(q, rq);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-	} else
-		blk_end_request_all(rq, error);
-
-	blk_run_queue(q);
-
-	dm_put(md);
-}
-
 /*
  * Swap in a new table, returning the old one for the caller to destroy.
  */
@@ -2619,9 +2458,8 @@ int dm_suspend(struct mapped_device *md,
 	up_write(&md->io_lock);

 	/*
-	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
-	 * can be kicked until md->queue is stopped.  So stop md->queue before
-	 * flushing md->wq.
+	 * Stop md->queue before flushing md->wq in case request-based
+	 * dm defers requests to md->wq from md->queue.
 	 */
 	if (dm_request_based(md))
 		stop_queue(md->queue);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 15:07       ` Tejun Heo
@ 2010-08-30 19:08         ` Mike Snitzer
  2010-08-30 21:28           ` Mike Snitzer
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Snitzer @ 2010-08-30 19:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

On Mon, Aug 30 2010 at 11:07am -0400,
Tejun Heo <tj@kernel.org> wrote:

> On 08/30/2010 03:59 PM, Tejun Heo wrote:
> > Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> > flushes are on a separate queue but I forgot to update
> > elv_queue_empty() to check the flush queue.  elv_queue_empty() can
> > return %true spuriously in which case the queue won't be plugged and
> > restarted later leading to queue hang.  I'll fix elv_queue_empty().
> 
> I think I was too quick to blame elv_queue_empty().  Can you please
> test whether the following patch fixes the hang?

It does, thanks!

Tested-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 15:45       ` [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
@ 2010-08-30 19:18         ` Mike Snitzer
  2010-09-01  7:15         ` Kiyoshi Ueda
  1 sibling, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2010-08-30 19:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

On Mon, Aug 30 2010 at 11:45am -0400,
Tejun Heo <tj@kernel.org> wrote:

> This patch converts request-based dm to support the new REQ_FLUSH/FUA.
> 
> The original request-based flush implementation depended on
> request_queue blocking other requests while a barrier sequence is in
> progress, which is no longer true for the new REQ_FLUSH/FUA.
> 
> In general, request-based dm doesn't have infrastructure for cloning
> one source request to multiple targets, but the original flush
> implementation had a special mostly independent path which can issue
> flushes to multiple targets and sequence them.  However, the
> capability isn't currently in use and adds a lot of complexity.
> Moreoever, it's unlikely to be useful in its current form as it
> doesn't make sense to be able to send out flushes to multiple targets
> when write requests can't be.
> 
> This patch rips out special flush code path and deals handles
> REQ_FLUSH/FUA requests the same way as other requests.  The only
> special treatment is that REQ_FLUSH requests use the block address 0
> when finding target, which is enough for now.
> 
> * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
>   suggested by Mike Snitzer
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good.

Acked-by: Mike Snitzer <snitzer@redhat.com>

Junichi and/or Kiyoshi,
Could you please review this patch and add your Acked-by if it is OK?
(Alasdair will want to see NEC's Ack to accept this patch).

Thanks,
Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 19:08         ` Mike Snitzer
@ 2010-08-30 21:28           ` Mike Snitzer
  2010-08-31 10:29             ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Snitzer @ 2010-08-30 21:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On Mon, Aug 30 2010 at  3:08pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Mon, Aug 30 2010 at 11:07am -0400,
> Tejun Heo <tj@kernel.org> wrote:
> 
> > On 08/30/2010 03:59 PM, Tejun Heo wrote:
> > > Ah... that's probably from "if (!elv_queue_empty(q))" check below,
> > > flushes are on a separate queue but I forgot to update
> > > elv_queue_empty() to check the flush queue.  elv_queue_empty() can
> > > return %true spuriously in which case the queue won't be plugged and
> > > restarted later leading to queue hang.  I'll fix elv_queue_empty().
> > 
> > I think I was too quick to blame elv_queue_empty().  Can you please
> > test whether the following patch fixes the hang?
> 
> It does, thanks!

Hmm, but unfortunately I was too quick to say the patch fixed the hang.

It is much more rare, but I can still get a hang.  I just got the
following running vgcreate against an DM mpath (rq-based) device:

INFO: task vgcreate:3517 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vgcreate      D ffff88003d677a00  5168  3517   3361 0x00000080
 ffff88003d677998 0000000000000046 ffff880000000000 ffff88003d677fd8
 ffff880039c84860 ffff88003d677fd8 00000000001d3880 ffff880039c84c30
 ffff880039c84c28 00000000001d3880 00000000001d3880 ffff88003d677fd8
Call Trace:
 [<ffffffff81389308>] io_schedule+0x73/0xb5
 [<ffffffff811c7304>] get_request_wait+0xef/0x17d
 [<ffffffff810642be>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811c7890>] __make_request+0x333/0x467
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff811c5e91>] generic_make_request+0x342/0x3bf
 [<ffffffff81074714>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff81069df2>] ? local_clock+0x41/0x5a
 [<ffffffff811c5fe9>] submit_bio+0xdb/0xf8
 [<ffffffff810754a4>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff811381a6>] dio_bio_submit+0x7b/0x9c
 [<ffffffff81138dbe>] __blockdev_direct_IO+0x7f3/0x97d
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff81136d7a>] blkdev_direct_IO+0x57/0x59
 [<ffffffff81135f58>] ? blkdev_get_blocks+0x0/0x90
 [<ffffffff810ce301>] generic_file_aio_read+0xed/0x5b4
 [<ffffffff81077932>] ? lock_release_non_nested+0xd5/0x23b
 [<ffffffff810e40f8>] ? might_fault+0x5c/0xac
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff8110e131>] do_sync_read+0xcb/0x108
 [<ffffffff81074688>] ? trace_hardirqs_off_caller+0x1f/0x9e
 [<ffffffff81389a99>] ? __mutex_unlock_slowpath+0x120/0x132
 [<ffffffff8119d805>] ? fsnotify_perm+0x4a/0x50
 [<ffffffff8119d86c>] ? security_file_permission+0x2e/0x33
 [<ffffffff8110e7a3>] vfs_read+0xab/0x107
 [<ffffffff81075473>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff8110e8c2>] sys_read+0x4d/0x74
 [<ffffffff81002c32>] system_call_fastpath+0x16/0x1b
no locks held by vgcreate/3517.

I was then able to reproduce it after reboot and another ~5 attempts
(all against 2.6.36-rc2 + your latest FLUSH+FUA patchset and DM
patches).

crash> bt -l 3893
PID: 3893   TASK: ffff88003e65a430  CPU: 0   COMMAND: "vgcreate"
 #0 [ffff88003a5298d8] schedule at ffffffff813891d3
    /root/git/linux-2.6/kernel/sched.c: 2873
 #1 [ffff88003a5299a0] io_schedule at ffffffff81389308
    /root/git/linux-2.6/kernel/sched.c: 5128
 #2 [ffff88003a5299c0] get_request_wait at ffffffff811c7304
    /root/git/linux-2.6/block/blk-core.c: 879
 #3 [ffff88003a529a50] __make_request at ffffffff811c7890
    /root/git/linux-2.6/block/blk-core.c: 1301
 #4 [ffff88003a529ac0] generic_make_request at ffffffff811c5e91
    /root/git/linux-2.6/block/blk-core.c: 1536
 #5 [ffff88003a529b70] submit_bio at ffffffff811c5fe9
    /root/git/linux-2.6/block/blk-core.c: 1632
 #6 [ffff88003a529bc0] dio_bio_submit at ffffffff811381a6
    /root/git/linux-2.6/fs/direct-io.c: 375
 #7 [ffff88003a529bf0] __blockdev_direct_IO at ffffffff81138dbe
    /root/git/linux-2.6/fs/direct-io.c: 1087
 #8 [ffff88003a529cd0] blkdev_direct_IO at ffffffff81136d7a
    /root/git/linux-2.6/fs/block_dev.c: 177
 #9 [ffff88003a529d10] generic_file_aio_read at ffffffff810ce301
    /root/git/linux-2.6/mm/filemap.c: 1303
#10 [ffff88003a529df0] do_sync_read at ffffffff8110e131
    /root/git/linux-2.6/fs/read_write.c: 282
#11 [ffff88003a529f00] vfs_read at ffffffff8110e7a3
    /root/git/linux-2.6/fs/read_write.c: 310
#12 [ffff88003a529f40] sys_read at ffffffff8110e8c2
    /root/git/linux-2.6/fs/read_write.c: 388
#13 [ffff88003a529f80] system_call_fastpath at ffffffff81002c32
    /root/git/linux-2.6/arch/x86/kernel/entry_64.S: 488
    RIP: 0000003b602d41a0  RSP: 00007fff55d5b928  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffffffff81002c32  RCX: 00007fff55d5b960
    RDX: 0000000000001000  RSI: 00007fff55d5a000  RDI: 0000000000000005
    RBP: 0000000000000000   R8: 0000000000494ecd   R9: 0000000000001000
    R10: 000000315c41c160  R11: 0000000000000246  R12: 00007fff55d5a000
    R13: 00007fff55d5b0a0  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 21:28           ` Mike Snitzer
@ 2010-08-31 10:29             ` Tejun Heo
  2010-08-31 13:02               ` Mike Snitzer
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2010-08-31 10:29 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On 08/30/2010 11:28 PM, Mike Snitzer wrote:
> On Mon, Aug 30 2010 at  3:08pm -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
> 
>> On Mon, Aug 30 2010 at 11:07am -0400,
>> Tejun Heo <tj@kernel.org> wrote:
>>
>>> On 08/30/2010 03:59 PM, Tejun Heo wrote:
>>>> Ah... that's probably from "if (!elv_queue_empty(q))" check below,
>>>> flushes are on a separate queue but I forgot to update
>>>> elv_queue_empty() to check the flush queue.  elv_queue_empty() can
>>>> return %true spuriously in which case the queue won't be plugged and
>>>> restarted later leading to queue hang.  I'll fix elv_queue_empty().
>>>
>>> I think I was too quick to blame elv_queue_empty().  Can you please
>>> test whether the following patch fixes the hang?
>>
>> It does, thanks!
> 
> Hmm, but unfortunately I was too quick to say the patch fixed the hang.
> 
> It is much more rare, but I can still get a hang.  I just got the
> following running vgcreate against an DM mpath (rq-based) device:

Can you please try this one instead?

Thanks.

---
 block/blk-flush.c |   22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -56,22 +56,38 @@ static struct request *blk_flush_complet
 	return next_rq;
 }

+static void blk_flush_complete_seq_end_io(struct request_queue *q,
+					  unsigned seq, int error)
+{
+	bool was_empty = elv_queue_empty(q);
+	struct request *next_rq;
+
+	next_rq = blk_flush_complete_seq(q, seq, error);
+
+	/*
+	 * Moving a request silently to empty queue_head may stall the
+	 * queue.  Kick the queue in those cases.
+	 */
+	if (next_rq && was_empty)
+		__blk_run_queue(q);
+}
+
 static void pre_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_PREFLUSH, error);
 }

 static void flush_data_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
+	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
 }

 static void post_flush_end_io(struct request *rq, int error)
 {
 	elv_completed_request(rq->q, rq);
-	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+	blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
 }

 static void init_flush_request(struct request *rq, struct gendisk *disk)


-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-31 10:29             ` Tejun Heo
@ 2010-08-31 13:02               ` Mike Snitzer
  2010-08-31 13:14                 ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Snitzer @ 2010-08-31 13:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On Tue, Aug 31 2010 at  6:29am -0400,
Tejun Heo <tj@kernel.org> wrote:

> On 08/30/2010 11:28 PM, Mike Snitzer wrote:
> > Hmm, but unfortunately I was too quick to say the patch fixed the hang.
> > 
> > It is much more rare, but I can still get a hang.  I just got the
> > following running vgcreate against an DM mpath (rq-based) device:
> 
> Can you please try this one instead?

Still hit the hang on the 5th iteration of my test:
while true ; do ./test_dm_discard_mpath.sh && sleep 1 ; done

Would you like me to (re)send my test script offlist?

INFO: task vgcreate:2617 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
vgcreate      D ffff88007bf7ba00  4688  2617   2479 0x00000080
 ffff88007bf7b998 0000000000000046 ffff880000000000 ffff88007bf7bfd8
 ffff88005542a430 ffff88007bf7bfd8 00000000001d3880 ffff88005542a800
 ffff88005542a7f8 00000000001d3880 00000000001d3880 ffff88007bf7bfd8
Call Trace:
 [<ffffffff81389338>] io_schedule+0x73/0xb5
 [<ffffffff811c7304>] get_request_wait+0xef/0x17d
 [<ffffffff810642be>] ? autoremove_wake_function+0x0/0x39
 [<ffffffff811c7890>] __make_request+0x333/0x467
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff811c5e91>] generic_make_request+0x342/0x3bf
 [<ffffffff81074714>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff81069df2>] ? local_clock+0x41/0x5a
 [<ffffffff811c5fe9>] submit_bio+0xdb/0xf8
 [<ffffffff810754a4>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff811381a6>] dio_bio_submit+0x7b/0x9c
 [<ffffffff81138dbe>] __blockdev_direct_IO+0x7f3/0x97d
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff81136d7a>] blkdev_direct_IO+0x57/0x59
 [<ffffffff81135f58>] ? blkdev_get_blocks+0x0/0x90
 [<ffffffff810ce301>] generic_file_aio_read+0xed/0x5b4
 [<ffffffff81077932>] ? lock_release_non_nested+0xd5/0x23b
 [<ffffffff810e40f8>] ? might_fault+0x5c/0xac
 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9
 [<ffffffff8110e131>] do_sync_read+0xcb/0x108
 [<ffffffff81074688>] ? trace_hardirqs_off_caller+0x1f/0x9e
 [<ffffffff81389ac9>] ? __mutex_unlock_slowpath+0x120/0x132
 [<ffffffff8119d805>] ? fsnotify_perm+0x4a/0x50
 [<ffffffff8119d86c>] ? security_file_permission+0x2e/0x33
 [<ffffffff8110e7a3>] vfs_read+0xab/0x107
 [<ffffffff81075473>] ? trace_hardirqs_on_caller+0x11d/0x141
 [<ffffffff8110e8c2>] sys_read+0x4d/0x74
 [<ffffffff81002c32>] system_call_fastpath+0x16/0x1b
no locks held by vgcreate/2617.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-31 13:02               ` Mike Snitzer
@ 2010-08-31 13:14                 ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-08-31 13:14 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

Hello,

On 08/31/2010 03:02 PM, Mike Snitzer wrote:
> On Tue, Aug 31 2010 at  6:29am -0400,
> Tejun Heo <tj@kernel.org> wrote:
> 
>> On 08/30/2010 11:28 PM, Mike Snitzer wrote:
>>> Hmm, but unfortunately I was too quick to say the patch fixed the hang.
>>>
>>> It is much more rare, but I can still get a hang.  I just got the
>>> following running vgcreate against an DM mpath (rq-based) device:
>>
>> Can you please try this one instead?
> 
> Still hit the hang on the 5th iteration of my test:
> while true ; do ./test_dm_discard_mpath.sh && sleep 1 ; done
> 
> Would you like me to (re)send my test script offlist?

Yes, please.  Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-08-30 15:45       ` [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
  2010-08-30 19:18         ` Mike Snitzer
@ 2010-09-01  7:15         ` Kiyoshi Ueda
  2010-09-01 12:25           ` Mike Snitzer
                             ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Kiyoshi Ueda @ 2010-09-01  7:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hi Tejun,

On 08/31/2010 12:45 AM +0900, Tejun Heo wrote:
> This patch converts request-based dm to support the new REQ_FLUSH/FUA.
> 
> The original request-based flush implementation depended on
> request_queue blocking other requests while a barrier sequence is in
> progress, which is no longer true for the new REQ_FLUSH/FUA.
> 
> In general, request-based dm doesn't have infrastructure for cloning
> one source request to multiple targets, but the original flush
> implementation had a special mostly independent path which can issue
> flushes to multiple targets and sequence them.  However, the
> capability isn't currently in use and adds a lot of complexity.
> Moreoever, it's unlikely to be useful in its current form as it
> doesn't make sense to be able to send out flushes to multiple targets
> when write requests can't be.
> 
> This patch rips out special flush code path and deals handles
> REQ_FLUSH/FUA requests the same way as other requests.  The only
> special treatment is that REQ_FLUSH requests use the block address 0
> when finding target, which is enough for now.
> 
> * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
>   suggested by Mike Snitzer

Thank you for your work.

I don't see any obvious problem on this patch.
However, I hit a NULL pointer dereference below when I use a mpath
device with barrier option of ext3.  I'm investigating the cause now.
(Also I'm not sure the cause of the hang which Mike is hitting yet.)

I tried on the commit 28dd53b26d362c16234249bad61db8cbd9222d0b of
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua.

    # mke2fs -j /dev/mapper/mpatha
    # mount -o barrier=1 /dev/mapper/mpatha /mnt/0
    # dd if=/dev/zero of=/mnt/0/a bs=512 count=1
    # sync

BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
IP: [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]
PGD 29fd9a067 PUD 2a21ff067 PMD 0 
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 1 
Modules linked in: ext4 jbd2 crc16 ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables autofs4 lockd sunrpc cpufreq_ondemand acpi_cpufreq bridge stp llc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_region_hash dm_log dm_service_time dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac kvm_intel kvm e1000e sg sr_mod cdrom lpfc scsi_transport_fc piix rtc_cmos rtc_core ioatdma ata_piix button serio_raw rtc_lib libata dca megaraid_sas sd_mod scsi_mod crc_t10dif ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]

Pid: 0, comm: kworker/0:0 Not tainted 2.6.36-rc2+ #1 MS-9196/Express5800/120Lj [N8100-1417]
RIP: 0010:[<ffffffffa0070ec3>]  [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]
RSP: 0018:ffff880002c83e50  EFLAGS: 00010297
RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
RDX: 0000000000007d7c RSI: ffffffff81389c55 RDI: 0000000000000286
RBP: ffff880002c83e70 R08: 0000000000000002 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802a2acf750
R13: ffff8802a25686c8 R14: ffff8802791f7eb8 R15: 0000000000000100
FS:  0000000000000000(0000) GS:ffff880002c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000078 CR3: 00000002a2ab6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/0:0 (pid: 0, threadinfo ffff8802a4576000, task ffff8802a4574050)
Stack:
 ffff8802791f7eb8 0000000000002002 000000000000ea60 0000000000000005
<0> ffff880002c83ea0 ffffffffa0079ec8 ffff880002c83eb0 0000000000000020
<0> ffffffff815220a0 0000000000000004 ffff880002c83ed0 ffffffff811c6636
Call Trace:
 <IRQ> 
 [<ffffffffa0079ec8>] scsi_softirq_done+0x138/0x170 [scsi_mod]
 [<ffffffff811c6636>] blk_done_softirq+0x86/0xa0
 [<ffffffff81053036>] __do_softirq+0xd6/0x210
 [<ffffffff81003d9c>] call_softirq+0x1c/0x50
 [<ffffffff81005705>] do_softirq+0x95/0xd0
 [<ffffffff81052f4d>] irq_exit+0x4d/0x60
 [<ffffffff81391668>] do_IRQ+0x78/0xf0
 [<ffffffff8138a053>] ret_from_intr+0x0/0x16
 <EOI> 
 [<ffffffff8100b630>] ? mwait_idle+0x70/0xe0
 [<ffffffff8107cc8d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8100b639>] ? mwait_idle+0x79/0xe0
 [<ffffffff8100b630>] ? mwait_idle+0x70/0xe0
 [<ffffffff81001c36>] cpu_idle+0x66/0xe0
 [<ffffffff81380e81>] ? start_secondary+0x181/0x1f0
 [<ffffffff81380e8f>] start_secondary+0x18f/0x1f0
Code: 0c 83 e0 07 83 f8 04 77 6c 49 8b 86 80 00 00 00 41 8b 5e 68 83 78 44 02 74 27 48 8b 80 b0 00 00 00 48 8b 80 70 02 00 00 48 8b 00 <48> 8b 50 78 89 d8 48 85 d2 74 05 4c 89 f7 ff d2 39 c3 74 21 89 
RIP  [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]
 RSP <ffff880002c83e50>
CR2: 0000000000000078



Also, I have one comment below on this patch.

> @@ -2619,9 +2458,8 @@ int dm_suspend(struct mapped_device *md,
>  	up_write(&md->io_lock);
> 
>  	/*
> -	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
> -	 * can be kicked until md->queue is stopped.  So stop md->queue before
> -	 * flushing md->wq.
> +	 * Stop md->queue before flushing md->wq in case request-based
> +	 * dm defers requests to md->wq from md->queue.
>  	 */
>  	if (dm_request_based(md))
>  		stop_queue(md->queue);

Request-based dm doesn't use md->wq now, so you can just remove
the comment above.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01  7:15         ` Kiyoshi Ueda
@ 2010-09-01 12:25           ` Mike Snitzer
  2010-09-02 13:22           ` Tejun Heo
  2010-09-02 17:43           ` [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original Tejun Heo
  2 siblings, 0 replies; 37+ messages in thread
From: Mike Snitzer @ 2010-09-01 12:25 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Tejun Heo, jaxboe, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

On Wed, Sep 01 2010 at  3:15am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Tejun,
> 
> On 08/31/2010 12:45 AM +0900, Tejun Heo wrote:
> > This patch converts request-based dm to support the new REQ_FLUSH/FUA.
> > 
> > The original request-based flush implementation depended on
> > request_queue blocking other requests while a barrier sequence is in
> > progress, which is no longer true for the new REQ_FLUSH/FUA.
> > 
> > In general, request-based dm doesn't have infrastructure for cloning
> > one source request to multiple targets, but the original flush
> > implementation had a special mostly independent path which can issue
> > flushes to multiple targets and sequence them.  However, the
> > capability isn't currently in use and adds a lot of complexity.
> > Moreoever, it's unlikely to be useful in its current form as it
> > doesn't make sense to be able to send out flushes to multiple targets
> > when write requests can't be.
> > 
> > This patch rips out special flush code path and deals handles
> > REQ_FLUSH/FUA requests the same way as other requests.  The only
> > special treatment is that REQ_FLUSH requests use the block address 0
> > when finding target, which is enough for now.
> > 
> > * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
> >   suggested by Mike Snitzer
> 
> Thank you for your work.
> 
> I don't see any obvious problem on this patch.
> However, I hit a NULL pointer dereference below when I use a mpath
> device with barrier option of ext3.  I'm investigating the cause now.
> (Also I'm not sure the cause of the hang which Mike is hitting yet.)
> 
> I tried on the commit 28dd53b26d362c16234249bad61db8cbd9222d0b of
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua.
> 
>     # mke2fs -j /dev/mapper/mpatha
>     # mount -o barrier=1 /dev/mapper/mpatha /mnt/0
>     # dd if=/dev/zero of=/mnt/0/a bs=512 count=1
>     # sync
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000078

FYI, I can't reproduce this using all of Tejun's latest patches (not yet
in the flush-fua git tree).  But I haven't tried the specific flush-fua
commit that you referenced.

Mike

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-08-30  9:58 ` [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm Tejun Heo
@ 2010-09-01 13:43   ` Mike Snitzer
  2010-09-01 13:50     ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Snitzer @ 2010-09-01 13:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On Mon, Aug 30 2010 at  5:58am -0400,
Tejun Heo <tj@kernel.org> wrote:

> This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
> now deprecated REQ_HARDBARRIER.
> 
> * -EOPNOTSUPP handling logic dropped.

Can you expand on _why_ -EOPNOTSUPP handling is no longer needed?  And
please at it to the final patch header.

This removal isn't unique to DM's conversion to FLUSH+FUA but I couldn't
easily find the justification for its removal in the larger 30+ patch
patchset either -- other patches are terse on the removal too.

Other than that.

Acked-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-09-01 13:43   ` Mike Snitzer
@ 2010-09-01 13:50     ` Tejun Heo
  2010-09-01 13:54       ` Mike Snitzer
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2010-09-01 13:50 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

Hello,

On 09/01/2010 03:43 PM, Mike Snitzer wrote:
> On Mon, Aug 30 2010 at  5:58am -0400,
> Tejun Heo <tj@kernel.org> wrote:
> 
>> This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
>> now deprecated REQ_HARDBARRIER.
>>
>> * -EOPNOTSUPP handling logic dropped.
> 
> Can you expand on _why_ -EOPNOTSUPP handling is no longer needed?  And
> please at it to the final patch header.

It just doesn't happen anymore.  If the underlying device doesn't
support FLUSH/FUA, the block layer simply make those parts noop.  IOW,
it no longer distinguishes between writeback cache which doesn't
support cache flush at all and writethrough cache.  Devices which have
WB cache w/o flush very difficult to come by these days and there's
nothing much we can do anyway, so it doesn't make sense to require
everyone to implement -EOPNOTSUPP.

One scheduled feature is to implement falling back to REQ_FLUSH when
the device advertises REQ_FUA but fails to process it, but one way or
the other, the goal is encapsulating REQ_FLUSH/FUA support in block
layer proper.  If FLUSH/FUA can be retried using a different strategy,
it should be done inside request_queue proper instead of pushing retry
logic to all its users.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-08-30  9:58 ` [PATCH 3/5] dm: relax ordering of bio-based flush implementation Tejun Heo
@ 2010-09-01 13:51   ` Mike Snitzer
  2010-09-01 13:56     ` Tejun Heo
  2010-09-03  6:04   ` Kiyoshi Ueda
  1 sibling, 1 reply; 37+ messages in thread
From: Mike Snitzer @ 2010-09-01 13:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On Mon, Aug 30 2010 at  5:58am -0400,
Tejun Heo <tj@kernel.org> wrote:

> Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
> against other bio's.  This patch relaxes ordering around flushes.
> 
> * A flush bio is no longer deferred to workqueue directly.  It's
>   processed like other bio's but __split_and_process_bio() uses
>   md->flush_bio as the clone source.  md->flush_bio is initialized to
>   empty flush during md initialization and shared for all flushes.
> 
> * When dec_pending() detects that a flush has completed, it checks
>   whether the original bio has data.  If so, the bio is queued to the
>   deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.
> 
> * As flush sequencing is handled in the usual issue/completion path,
>   dm_wq_work() no longer needs to handle flushes differently.  Now its
>   only responsibility is re-issuing deferred bio's the same way as
>   _dm_request() would.  REQ_FLUSH handling logic including
>   process_flush() is dropped.
> 
> * There's no reason for queue_io() and dm_wq_work() write lock
>   dm->io_lock.  queue_io() now only uses md->deferred_lock and
>   dm_wq_work() read locks dm->io_lock.
> 
> * bio's no longer need to be queued on the deferred list while a flush
>   is in progress making DMF_QUEUE_IO_TO_THREAD unncessary.  Drop it.
> 
> This avoids stalling the device during flushes and simplifies the
> implementation.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good overall.

> @@ -144,11 +143,6 @@ struct mapped_device {
>  	spinlock_t deferred_lock;
>  
>  	/*
> -	 * An error from the flush request currently being processed.
> -	 */
> -	int flush_error;
> -
> -	/*
>  	 * Protect barrier_error from concurrent endio processing
>  	 * in request-based dm.
>  	 */

Could you please document why it is OK to remove 'flush_error' in the
patch header?  The -EOPNOTSUPP handling removal (done in patch 2)
obviously helps enable this but it is not clear how the
'num_flush_requests' flushes that __clone_and_map_flush() generates do
not need explicit DM error handling.

Other than that.

Acked-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-09-01 13:50     ` Tejun Heo
@ 2010-09-01 13:54       ` Mike Snitzer
  2010-09-01 13:56         ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Mike Snitzer @ 2010-09-01 13:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On Wed, Sep 01 2010 at  9:50am -0400,
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 09/01/2010 03:43 PM, Mike Snitzer wrote:
> > On Mon, Aug 30 2010 at  5:58am -0400,
> > Tejun Heo <tj@kernel.org> wrote:
> > 
> >> This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
> >> now deprecated REQ_HARDBARRIER.
> >>
> >> * -EOPNOTSUPP handling logic dropped.
> > 
> > Can you expand on _why_ -EOPNOTSUPP handling is no longer needed?  And
> > please at it to the final patch header.
> 
> It just doesn't happen anymore.  If the underlying device doesn't
> support FLUSH/FUA, the block layer simply make those parts noop.  IOW,
> it no longer distinguishes between writeback cache which doesn't
> support cache flush at all and writethrough cache.  Devices which have
> WB cache w/o flush very difficult to come by these days and there's
> nothing much we can do anyway, so it doesn't make sense to require
> everyone to implement -EOPNOTSUPP.
> 
> One scheduled feature is to implement falling back to REQ_FLUSH when
> the device advertises REQ_FUA but fails to process it, but one way or
> the other, the goal is encapsulating REQ_FLUSH/FUA support in block
> layer proper.  If FLUSH/FUA can be retried using a different strategy,
> it should be done inside request_queue proper instead of pushing retry
> logic to all its users.

OK, so maybe add this info to the patch header one of the primary
FLUSH+FUA conversion patches?

Thanks for the detailed explanation!

Mike

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm
  2010-09-01 13:54       ` Mike Snitzer
@ 2010-09-01 13:56         ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-01 13:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On 09/01/2010 03:54 PM, Mike Snitzer wrote:
>> It just doesn't happen anymore.  If the underlying device doesn't
>> support FLUSH/FUA, the block layer simply make those parts noop.  IOW,
>> it no longer distinguishes between writeback cache which doesn't
>> support cache flush at all and writethrough cache.  Devices which have
>> WB cache w/o flush very difficult to come by these days and there's
>> nothing much we can do anyway, so it doesn't make sense to require
>> everyone to implement -EOPNOTSUPP.
>>
>> One scheduled feature is to implement falling back to REQ_FLUSH when
>> the device advertises REQ_FUA but fails to process it, but one way or
>> the other, the goal is encapsulating REQ_FLUSH/FUA support in block
>> layer proper.  If FLUSH/FUA can be retried using a different strategy,
>> it should be done inside request_queue proper instead of pushing retry
>> logic to all its users.
> 
> OK, so maybe add this info to the patch header one of the primary
> FLUSH+FUA conversion patches?

Sure.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-09-01 13:51   ` Mike Snitzer
@ 2010-09-01 13:56     ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-01 13:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: jaxboe, k-ueda, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch, dm-devel

On 09/01/2010 03:51 PM, Mike Snitzer wrote:
> Could you please document why it is OK to remove 'flush_error' in the
> patch header?  The -EOPNOTSUPP handling removal (done in patch 2)
> obviously helps enable this but it is not clear how the
> 'num_flush_requests' flushes that __clone_and_map_flush() generates do
> not need explicit DM error handling.

Sure, I'll.  It's because it now uses the same error handling path in
dec_pending() all other bio's use.  The flush_error thing was there
because flushes got executed/completed in a separate code path to
begin with.  With the special path gone, there's no need for
flush_error path either.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags
  2010-08-30  9:58 ` [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags Tejun Heo
@ 2010-09-01 15:30   ` Christoph Hellwig
  0 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2010-09-01 15:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, k-ueda, snitzer, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

On Mon, Aug 30, 2010 at 11:58:12AM +0200, Tejun Heo wrote:
> Currently __blk_rq_prep_clone() copies only REQ_WRITE and REQ_DISCARD.
> There's no reason to omit other command flags and REQ_FUA needs to be
> copied to implement FUA support in request-based dm.
> 
> REQ_COMMON_MASK which specifies flags to be copied from bio to request
> already identifies all the command flags.  Define REQ_CLONE_MASK to be
> the same as REQ_COMMON_MASK for clarity and make __blk_rq_prep_clone()
> copy all flags in the mask.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-01  7:15         ` Kiyoshi Ueda
  2010-09-01 12:25           ` Mike Snitzer
@ 2010-09-02 13:22           ` Tejun Heo
  2010-09-02 13:32             ` Tejun Heo
  2010-09-03  5:46             ` Kiyoshi Ueda
  2010-09-02 17:43           ` [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original Tejun Heo
  2 siblings, 2 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-02 13:22 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hello,

On 09/01/2010 09:15 AM, Kiyoshi Ueda wrote:
> I don't see any obvious problem on this patch.
> However, I hit a NULL pointer dereference below when I use a mpath
> device with barrier option of ext3.  I'm investigating the cause now.
> (Also I'm not sure the cause of the hang which Mike is hitting yet.)
> 
> I tried on the commit 28dd53b26d362c16234249bad61db8cbd9222d0b of
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua.
> 
>     # mke2fs -j /dev/mapper/mpatha
>     # mount -o barrier=1 /dev/mapper/mpatha /mnt/0
>     # dd if=/dev/zero of=/mnt/0/a bs=512 count=1
>     # sync

Hmm... I'm trying to reproduce this problem but hasn't been successful
yet.

> BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
> IP: [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]

Can you please ask gdb which source line this is?

> Also, I have one comment below on this patch.
> 
>> @@ -2619,9 +2458,8 @@ int dm_suspend(struct mapped_device *md,
>>  	up_write(&md->io_lock);
>>
>>  	/*
>> -	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
>> -	 * can be kicked until md->queue is stopped.  So stop md->queue before
>> -	 * flushing md->wq.
>> +	 * Stop md->queue before flushing md->wq in case request-based
>> +	 * dm defers requests to md->wq from md->queue.
>>  	 */
>>  	if (dm_request_based(md))
>>  		stop_queue(md->queue);
> 
> Request-based dm doesn't use md->wq now, so you can just remove
> the comment above.

I sure can remove it but md->wq already has most stuff necessary to
process deferred requests and when someone starts using it, having the
comment there about the rather delicate ordering would definitely be
helpful, so I suggest keeping the comment.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-02 13:22           ` Tejun Heo
@ 2010-09-02 13:32             ` Tejun Heo
  2010-09-03  5:46             ` Kiyoshi Ueda
  1 sibling, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-02 13:32 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

On 09/02/2010 03:22 PM, Tejun Heo wrote:
> Hello,
> 
> On 09/01/2010 09:15 AM, Kiyoshi Ueda wrote:
>> I don't see any obvious problem on this patch.
>> However, I hit a NULL pointer dereference below when I use a mpath
>> device with barrier option of ext3.  I'm investigating the cause now.
>> (Also I'm not sure the cause of the hang which Mike is hitting yet.)
>>
>> I tried on the commit 28dd53b26d362c16234249bad61db8cbd9222d0b of
>> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua.
>>
>>     # mke2fs -j /dev/mapper/mpatha
>>     # mount -o barrier=1 /dev/mapper/mpatha /mnt/0
>>     # dd if=/dev/zero of=/mnt/0/a bs=512 count=1
>>     # sync
> 
> Hmm... I'm trying to reproduce this problem but hasn't been successful
> yet.
> 
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
>> IP: [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]
> 
> Can you please ask gdb which source line this is?

Ooh, never mind.  Reproduced it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-01  7:15         ` Kiyoshi Ueda
  2010-09-01 12:25           ` Mike Snitzer
  2010-09-02 13:22           ` Tejun Heo
@ 2010-09-02 17:43           ` Tejun Heo
  2010-09-03  5:47             ` Kiyoshi Ueda
  2 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2010-09-02 17:43 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

rq->rq_disk and bio->bi_bdev->bd_disk may differ if a request has
passed through remapping drivers.  FSEQ_DATA request incorrectly
followed bio->bi_bdev->bd_disk ending up being issued w/ mismatching
rq_disk.  Make it follow orig_rq->rq_disk.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
---
Kiyoshi, can you please apply this patch on top and verify that the
problem goes away?  The reason I and Mike didn't see the problem was
because we were using REQ_FUA capable underlying device, which makes
block layer bypass sequencing for FSEQ_DATA request.

Thanks.

 block/blk-flush.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -111,6 +111,13 @@ static struct request *queue_next_fseq(s
 		break;
 	case QUEUE_FSEQ_DATA:
 		init_request_from_bio(rq, orig_rq->bio);
+		/*
+		 * orig_rq->rq_disk may be different from
+		 * bio->bi_bdev->bd_disk if orig_rq got here through
+		 * remapping drivers.  Make sure rq->rq_disk points
+		 * to the same one as orig_rq.
+		 */
+		rq->rq_disk = orig_rq->rq_disk;
 		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
 		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
 		rq->end_io = flush_data_end_io;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm
  2010-09-02 13:22           ` Tejun Heo
  2010-09-02 13:32             ` Tejun Heo
@ 2010-09-03  5:46             ` Kiyoshi Ueda
  1 sibling, 0 replies; 37+ messages in thread
From: Kiyoshi Ueda @ 2010-09-03  5:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hi Tejun,

On 09/02/2010 10:22 PM +0900, Tejun Heo wrote:
> On 09/01/2010 09:15 AM, Kiyoshi Ueda wrote:
>>> @@ -2619,9 +2458,8 @@ int dm_suspend(struct mapped_device *md,
>>>  	up_write(&md->io_lock);
>>>
>>>  	/*
>>> -	 * Request-based dm uses md->wq for barrier (dm_rq_barrier_work) which
>>> -	 * can be kicked until md->queue is stopped.  So stop md->queue before
>>> -	 * flushing md->wq.
>>> +	 * Stop md->queue before flushing md->wq in case request-based
>>> +	 * dm defers requests to md->wq from md->queue.
>>>  	 */
>>>  	if (dm_request_based(md))
>>>  		stop_queue(md->queue);
>> 
>> Request-based dm doesn't use md->wq now, so you can just remove
>> the comment above.
> 
> I sure can remove it but md->wq already has most stuff necessary to
> process deferred requests and when someone starts using it, having the
> comment there about the rather delicate ordering would definitely be
> helpful, so I suggest keeping the comment.

OK, makes sense.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-02 17:43           ` [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original Tejun Heo
@ 2010-09-03  5:47             ` Kiyoshi Ueda
  2010-09-03  9:33               ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Kiyoshi Ueda @ 2010-09-03  5:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hi Tejun,

On 09/03/2010 02:43 AM +0900, Tejun Heo wrote:
> rq->rq_disk and bio->bi_bdev->bd_disk may differ if a request has
> passed through remapping drivers.  FSEQ_DATA request incorrectly
> followed bio->bi_bdev->bd_disk ending up being issued w/ mismatching
> rq_disk.  Make it follow orig_rq->rq_disk.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
> ---
> Kiyoshi, can you please apply this patch on top and verify that the
> problem goes away?  The reason I and Mike didn't see the problem was
> because we were using REQ_FUA capable underlying device, which makes
> block layer bypass sequencing for FSEQ_DATA request.
> 
> Thanks.
> 
>  block/blk-flush.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> Index: block/block/blk-flush.c
> ===================================================================
> --- block.orig/block/blk-flush.c
> +++ block/block/blk-flush.c
> @@ -111,6 +111,13 @@ static struct request *queue_next_fseq(s
>  		break;
>  	case QUEUE_FSEQ_DATA:
>  		init_request_from_bio(rq, orig_rq->bio);
> +		/*
> +		 * orig_rq->rq_disk may be different from
> +		 * bio->bi_bdev->bd_disk if orig_rq got here through
> +		 * remapping drivers.  Make sure rq->rq_disk points
> +		 * to the same one as orig_rq.
> +		 */
> +		rq->rq_disk = orig_rq->rq_disk;
>  		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
>  		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
>  		rq->end_io = flush_data_end_io;

Ah, I see, thank you for the quick fix!
I confirmed no panic occurs with this patch.

Tested-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>


By the way, I had been considering a block-layer interface which remaps
struct request and its bios to a block device such as:
void blk_remap_request(struct request *rq, struct block_device *bdev)
{
	rq->rq_disk = bdev->bd_disk;

	__rq_for_each_bio(bio, rq) {
		bio->bi_bdev = bdev->bd_disk;
	}
}

If there is such an interface and remapping drivers use it, then these
kind of issues may be avoided in the future.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-08-30  9:58 ` [PATCH 3/5] dm: relax ordering of bio-based flush implementation Tejun Heo
  2010-09-01 13:51   ` Mike Snitzer
@ 2010-09-03  6:04   ` Kiyoshi Ueda
  2010-09-03  9:42     ` Tejun Heo
  1 sibling, 1 reply; 37+ messages in thread
From: Kiyoshi Ueda @ 2010-09-03  6:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: jaxboe, snitzer, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

Hi Tejun,

On 08/30/2010 06:58 PM +0900, Tejun Heo wrote:
> Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering
> against other bio's.  This patch relaxes ordering around flushes.
...
> * When dec_pending() detects that a flush has completed, it checks
>   whether the original bio has data.  If so, the bio is queued to the
>   deferred list w/ REQ_FLUSH cleared; otherwise, it's completed.
...
> @@ -529,16 +523,10 @@ static void end_io_acct(struct dm_io *io)
>   */
>  static void queue_io(struct mapped_device *md, struct bio *bio)
>  {
> -	down_write(&md->io_lock);
> -
>  	spin_lock_irq(&md->deferred_lock);
>  	bio_list_add(&md->deferred, bio);
>  	spin_unlock_irq(&md->deferred_lock);
> -
> -	if (!test_and_set_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))
> -		queue_work(md->wq, &md->work);
> -
> -	up_write(&md->io_lock);
> +	queue_work(md->wq, &md->work);
...
> @@ -638,26 +624,22 @@ static void dec_pending(struct dm_io *io, int error)
...
> -		} else {
> -			end_io_acct(io);
> -			free_io(md, io);
> -
> -			if (io_error != DM_ENDIO_REQUEUE) {
> -				trace_block_bio_complete(md->queue, bio);
> -
> -				bio_endio(bio, io_error);
> -			}
> +			bio->bi_rw &= ~REQ_FLUSH;
> +			queue_io(md, bio);

dec_pending() is a function which is called during I/O completion
where the caller may be disabling interrupts.
So if you use queue_io() inside dec_pending(), the spin_lock must be
taken/released with irqsave/irqrestore like the patch below.

BTW, lockdep detects the issue and a warning like below is displayed.
It may break the underlying drivers.

=================================
[ INFO: inconsistent lock state ]
2.6.36-rc2+ #2
---------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
kworker/0:1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
 (&(&q->__queue_lock)->rlock){?.-...}, at: [<ffffffff811be844>] blk_end_bidi_request+0x44/0x80
{IN-HARDIRQ-W} state was registered at:
  [<ffffffff81080266>] __lock_acquire+0x8c6/0xb30
  [<ffffffff81080570>] lock_acquire+0xa0/0x120
  [<ffffffff8138953e>] _raw_spin_lock_irqsave+0x4e/0x70
  [<ffffffffa00e095a>] ata_qc_schedule_eh+0x5a/0xa0 [libata]
  [<ffffffffa00d37e7>] ata_qc_complete+0x147/0x1f0 [libata]
  [<ffffffffa00e3af2>] ata_hsm_qc_complete+0xc2/0x140 [libata]
  [<ffffffffa00e3d45>] ata_sff_hsm_move+0x1d5/0x700 [libata]
  [<ffffffffa00e4323>] __ata_sff_port_intr+0xb3/0x100 [libata]
  [<ffffffffa00e4bff>] ata_bmdma_port_intr+0x3f/0x120 [libata]
  [<ffffffffa00e2735>] ata_bmdma_interrupt+0x195/0x1e0 [libata]
  [<ffffffff810a6b14>] handle_IRQ_event+0x54/0x170
  [<ffffffff810a8fb8>] handle_edge_irq+0xc8/0x170
  [<ffffffff8100561b>] handle_irq+0x4b/0xa0
  [<ffffffff8139169f>] do_IRQ+0x6f/0xf0
  [<ffffffff8138a093>] ret_from_intr+0x0/0x16
  [<ffffffff81389ca3>] _raw_spin_unlock+0x23/0x40
  [<ffffffff81133ea2>] sys_dup3+0x122/0x1a0
  [<ffffffff81133f43>] sys_dup2+0x23/0xb0
  [<ffffffff81002eb2>] system_call_fastpath+0x16/0x1b
irq event stamp: 14660913
hardirqs last  enabled at (14660912): [<ffffffff81389c65>] _raw_spin_unlock_irqrestore+0x65/0x80
hardirqs last disabled at (14660913): [<ffffffff8138951e>] _raw_spin_lock_irqsave+0x2e/0x70
softirqs last  enabled at (14660874): [<ffffffff810530ae>] __do_softirq+0x14e/0x210
softirqs last disabled at (14660879): [<ffffffff81003d9c>] call_softirq+0x1c/0x50

other info that might help us debug this:
1 lock held by kworker/0:1/0:
 #0:  (&(&q->__queue_lock)->rlock){?.-...}, at: [<ffffffff811be844>] blk_end_bidi_request+0x44/0x80

stack backtrace:
Pid: 0, comm: kworker/0:1 Not tainted 2.6.36-rc2+ #2
Call Trace:
 <IRQ>  [<ffffffff8107c386>] print_usage_bug+0x1a6/0x1f0
 [<ffffffff8107ca31>] mark_lock+0x661/0x690
 [<ffffffff8107de90>] ? check_usage_backwards+0x0/0xf0
 [<ffffffff8107cac0>] mark_held_locks+0x60/0x80
 [<ffffffff81389bf0>] ? _raw_spin_unlock_irq+0x30/0x40
 [<ffffffff8107cb63>] trace_hardirqs_on_caller+0x83/0x1a0
 [<ffffffff8107cc8d>] trace_hardirqs_on+0xd/0x10
 [<ffffffff81389bf0>] _raw_spin_unlock_irq+0x30/0x40
 [<ffffffffa0292e0e>] ? queue_io+0x2e/0x90 [dm_mod]
 [<ffffffffa0292e37>] queue_io+0x57/0x90 [dm_mod]
 [<ffffffffa02932fa>] dec_pending+0x22a/0x320 [dm_mod]
 [<ffffffffa0293125>] ? dec_pending+0x55/0x320 [dm_mod]
 [<ffffffffa029366d>] clone_endio+0xad/0xc0 [dm_mod]
 [<ffffffff81150d1d>] bio_endio+0x1d/0x40
 [<ffffffff811bd181>] req_bio_endio+0x81/0xf0
 [<ffffffff811bd42d>] blk_update_request+0x23d/0x460
 [<ffffffff811bd306>] ? blk_update_request+0x116/0x460
 [<ffffffff811bd677>] blk_update_bidi_request+0x27/0x80
 [<ffffffff811be490>] __blk_end_bidi_request+0x20/0x50
 [<ffffffff811be4df>] __blk_end_request_all+0x1f/0x40
 [<ffffffff811c3b40>] blk_flush_complete_seq+0x140/0x1a0
 [<ffffffff811c3c79>] pre_flush_end_io+0x39/0x50
 [<ffffffff811be265>] blk_finish_request+0x85/0x290
 [<ffffffff811be852>] blk_end_bidi_request+0x52/0x80
 [<ffffffff811bfa3f>] blk_end_request_all+0x1f/0x40
 [<ffffffffa02941bd>] dm_softirq_done+0xad/0x120 [dm_mod]
 [<ffffffff811c6646>] blk_done_softirq+0x86/0xa0
 [<ffffffff81053036>] __do_softirq+0xd6/0x210
 [<ffffffff81003d9c>] call_softirq+0x1c/0x50
 [<ffffffff81005705>] do_softirq+0x95/0xd0
 [<ffffffff81052f4d>] irq_exit+0x4d/0x60
 [<ffffffff813916a8>] do_IRQ+0x78/0xf0
 [<ffffffff8138a093>] ret_from_intr+0x0/0x16
 <EOI>  [<ffffffff8100b639>] ? mwait_idle+0x79/0xe0
 [<ffffffff8100b630>] ? mwait_idle+0x70/0xe0
 [<ffffffff81001c36>] cpu_idle+0x66/0xe0
 [<ffffffff81380e91>] ? start_secondary+0x181/0x1f0
 [<ffffffff81380e9f>] start_secondary+0x18f/0x1f0

Thanks,
Kiyoshi Ueda


Now queue_io() is called from dec_pending(), which may be called with
interrupts disabled.
So queue_io() must not enable interrupts unconditionally and must
save/restore the current interrupts status.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 drivers/md/dm.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: misc/drivers/md/dm.c
===================================================================
--- misc.orig/drivers/md/dm.c
+++ misc/drivers/md/dm.c
@@ -512,9 +512,11 @@ static void end_io_acct(struct dm_io *io
  */
 static void queue_io(struct mapped_device *md, struct bio *bio)
 {
-	spin_lock_irq(&md->deferred_lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&md->deferred_lock, flags);
 	bio_list_add(&md->deferred, bio);
-	spin_unlock_irq(&md->deferred_lock);
+	spin_unlock_irqrestore(&md->deferred_lock, flags);
 	queue_work(md->wq, &md->work);
 }
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-03  5:47             ` Kiyoshi Ueda
@ 2010-09-03  9:33               ` Tejun Heo
  2010-09-03 10:28                 ` Kiyoshi Ueda
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2010-09-03  9:33 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hello,

On 09/03/2010 07:47 AM, Kiyoshi Ueda wrote:
> Ah, I see, thank you for the quick fix!
> I confirmed no panic occurs with this patch.
> 
> Tested-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>

Great, thanks for testing.

> By the way, I had been considering a block-layer interface which remaps
> struct request and its bios to a block device such as:
> void blk_remap_request(struct request *rq, struct block_device *bdev)
> {
> 	rq->rq_disk = bdev->bd_disk;
> 
> 	__rq_for_each_bio(bio, rq) {
> 		bio->bi_bdev = bdev->bd_disk;
> 	}
> }
> 
> If there is such an interface and remapping drivers use it, then these
> kind of issues may be avoided in the future.

I think the problem is more with request initialization.  After all,
once bios are packed into a request, they are (or at least should be)
just data containers.  We now have multiple request init paths in
block layer and different ones initialize different subsets and it's
not very clear which fields are supposed to be initialized to what by
whom.

But yeah I agree removing discrepancy between request and bio would be
nice to have too.  It's not really remapping tho.  Maybe just
blk_set_rq_q() or something like that (it should also set rq->q)?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/5] dm: relax ordering of bio-based flush implementation
  2010-09-03  6:04   ` Kiyoshi Ueda
@ 2010-09-03  9:42     ` Tejun Heo
  0 siblings, 0 replies; 37+ messages in thread
From: Tejun Heo @ 2010-09-03  9:42 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: jaxboe, snitzer, j-nomura, jamie, linux-kernel, linux-fsdevel,
	linux-raid, hch

Hello,

On 09/03/2010 08:04 AM, Kiyoshi Ueda wrote:
> Now queue_io() is called from dec_pending(), which may be called with
> interrupts disabled.
> So queue_io() must not enable interrupts unconditionally and must
> save/restore the current interrupts status.
> 
> Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>

Patch included into the series.  Thanks a lot!

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-03  9:33               ` Tejun Heo
@ 2010-09-03 10:28                 ` Kiyoshi Ueda
  2010-09-03 11:42                   ` Tejun Heo
  0 siblings, 1 reply; 37+ messages in thread
From: Kiyoshi Ueda @ 2010-09-03 10:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hi Tejun,

On 09/03/2010 06:33 PM +0900, Tejun Heo wrote:
> On 09/03/2010 07:47 AM, Kiyoshi Ueda wrote:
>> By the way, I had been considering a block-layer interface which remaps
>> struct request and its bios to a block device such as:
>> void blk_remap_request(struct request *rq, struct block_device *bdev)
>> {
>> 	rq->rq_disk = bdev->bd_disk;
>>
>> 	__rq_for_each_bio(bio, rq) {
>> 		bio->bi_bdev = bdev->bd_disk;
>> 	}
>> }
>>
>> If there is such an interface and remapping drivers use it, then these
>> kind of issues may be avoided in the future.
> 
> I think the problem is more with request initialization.  After all,
> once bios are packed into a request, they are (or at least should be)
> just data containers.  We now have multiple request init paths in
> block layer and different ones initialize different subsets and it's
> not very clear which fields are supposed to be initialized to what by
> whom.
> 
> But yeah I agree removing discrepancy between request and bio would be
> nice to have too.  It's not really remapping tho.  Maybe just
> blk_set_rq_q() or something like that (it should also set rq->q)?

Thank you for pointing it.
Yes, the interface should also set rq->q.

About the naming of the interface, blk_set_<something> sounds
reasonable to me.
But does blk_set_rq_q() take request and queue as arguments?
Then, I'm afraid we can't find bdev for bio from given queue.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-03 10:28                 ` Kiyoshi Ueda
@ 2010-09-03 11:42                   ` Tejun Heo
  2010-09-03 11:51                     ` Kiyoshi Ueda
  0 siblings, 1 reply; 37+ messages in thread
From: Tejun Heo @ 2010-09-03 11:42 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hello,

On 09/03/2010 12:28 PM, Kiyoshi Ueda wrote:
> Thank you for pointing it.
> Yes, the interface should also set rq->q.
> 
> About the naming of the interface, blk_set_<something> sounds
> reasonable to me.
> But does blk_set_rq_q() take request and queue as arguments?
> Then, I'm afraid we can't find bdev for bio from given queue.

Oh, right.  Maybe blk_set_rq_bdev() then?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original
  2010-09-03 11:42                   ` Tejun Heo
@ 2010-09-03 11:51                     ` Kiyoshi Ueda
  0 siblings, 0 replies; 37+ messages in thread
From: Kiyoshi Ueda @ 2010-09-03 11:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Mike Snitzer, jaxboe, j-nomura, jamie, linux-kernel,
	linux-fsdevel, linux-raid, hch

Hi Tejun,

On 09/03/2010 08:42 PM +0900, Tejun Heo wrote:
> Hello,
> 
> On 09/03/2010 12:28 PM, Kiyoshi Ueda wrote:
>> Thank you for pointing it.
>> Yes, the interface should also set rq->q.
>>
>> About the naming of the interface, blk_set_<something> sounds
>> reasonable to me.
>> But does blk_set_rq_q() take request and queue as arguments?
>> Then, I'm afraid we can't find bdev for bio from given queue.
> 
> Oh, right.  Maybe blk_set_rq_bdev() then?

Yeah, although struct request doesn't have the member 'bdev',
blk_set_rq_bdev() may be better from the view of the arguments.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2010-09-03 11:53 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-30  9:58 [PATCHSET 2.6.36-rc2] block, dm: finish REQ_FLUSH/FUA conversion, take#2 Tejun Heo
2010-08-30  9:58 ` [PATCH 1/5] block: make __blk_rq_prep_clone() copy most command flags Tejun Heo
2010-09-01 15:30   ` Christoph Hellwig
2010-08-30  9:58 ` [PATCH 2/5] dm: implement REQ_FLUSH/FUA support for bio-based dm Tejun Heo
2010-09-01 13:43   ` Mike Snitzer
2010-09-01 13:50     ` Tejun Heo
2010-09-01 13:54       ` Mike Snitzer
2010-09-01 13:56         ` Tejun Heo
2010-08-30  9:58 ` [PATCH 3/5] dm: relax ordering of bio-based flush implementation Tejun Heo
2010-09-01 13:51   ` Mike Snitzer
2010-09-01 13:56     ` Tejun Heo
2010-09-03  6:04   ` Kiyoshi Ueda
2010-09-03  9:42     ` Tejun Heo
2010-08-30  9:58 ` [PATCH 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
2010-08-30 13:28   ` Mike Snitzer
2010-08-30 13:59     ` Tejun Heo
2010-08-30 15:07       ` Tejun Heo
2010-08-30 19:08         ` Mike Snitzer
2010-08-30 21:28           ` Mike Snitzer
2010-08-31 10:29             ` Tejun Heo
2010-08-31 13:02               ` Mike Snitzer
2010-08-31 13:14                 ` Tejun Heo
2010-08-30 15:42       ` [PATCH] block: initialize flush request with WRITE_FLUSH instead of REQ_FLUSH Tejun Heo
2010-08-30 15:45       ` [PATCH UPDATED 4/5] dm: implement REQ_FLUSH/FUA support for request-based dm Tejun Heo
2010-08-30 19:18         ` Mike Snitzer
2010-09-01  7:15         ` Kiyoshi Ueda
2010-09-01 12:25           ` Mike Snitzer
2010-09-02 13:22           ` Tejun Heo
2010-09-02 13:32             ` Tejun Heo
2010-09-03  5:46             ` Kiyoshi Ueda
2010-09-02 17:43           ` [PATCH] block: make sure FSEQ_DATA request has the same rq_disk as the original Tejun Heo
2010-09-03  5:47             ` Kiyoshi Ueda
2010-09-03  9:33               ` Tejun Heo
2010-09-03 10:28                 ` Kiyoshi Ueda
2010-09-03 11:42                   ` Tejun Heo
2010-09-03 11:51                     ` Kiyoshi Ueda
2010-08-30  9:58 ` [PATCH 5/5] block: remove the WRITE_BARRIER flag Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).