All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] dm: request-based dm-multipath
@ 2008-10-03 15:08 Kiyoshi Ueda
  2008-10-03 15:09   ` Kiyoshi Ueda
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:08 UTC (permalink / raw)
  To: agk; +Cc: linux-kernel, linux-scsi, dm-devel, mbroz, j-nomura, k-ueda

Hi Alasdair,

This patch-set is the updated version of request-based dm-multipath.
The changes from the previous version (*) are to follow up the change
of the interface to export lld's busy state (PATCH 5).

(*) http://lkml.org/lkml/2008/9/12/100

All necessary interfaces to support request stacking drivers are now
in the for-2.6.28 branch of the Jens' linux-2.6-block git repository.
This patch-set depends on those interfaces.

This patch-set is created on top of 2.6.27-rc8 + your patches shown
between "NEXT_PATCHES_START" and "NEXT_PATCHES_END" of the series file:
  ftp://sources.redhat.com/pub/dm/patches/2.6-unstable/editing-quilt/patches/series.html

Please review and apply.


Summary of the patches:
  1/8: dm core: remove unused DM_WQ_FLUSH_ALL
  2/8: dm core: tidy local_init
  3/8: dm core: add kmem_cache for request-based dm
  4/8: dm core: add target interfaces for request-based dm
  5/8: dm core: add core functions for request-based dm
  6/8: dm core: enable request-based dm
  7/8: dm core: reject I/O violating new queue limits
  8/8: dm-mpath: convert to request-based
  (PATCH 1 is also included in the Milan's barrier support patch-set)

 drivers/md/dm-ioctl.c         |   13
 drivers/md/dm-mpath.c         |  192 +++++---
 drivers/md/dm-table.c         |   82 +++
 drivers/md/dm.c               |  956 +++++++++++++++++++++++++++++++++++++++--- drivers/md/dm.h               |   17
 include/linux/device-mapper.h |   24 +
 6 files changed, 1161 insertions(+), 123 deletions(-)

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/8] dm core: remove unused DM_WQ_FLUSH_ALL
  2008-10-03 15:08 [PATCH 0/8] dm: request-based dm-multipath Kiyoshi Ueda
@ 2008-10-03 15:09   ` Kiyoshi Ueda
  2008-10-03 15:10 ` [PATCH 2/8] dm core: tidy local_init Kiyoshi Ueda
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:09 UTC (permalink / raw)
  To: agk; +Cc: linux-kernel, linux-scsi, dm-devel, mbroz, j-nomura, k-ueda

This patch removes dead codes for the noflush suspend.
No functional change.

This patch is just a clean up of the codes and not functionally
related to request-based dm.  But included here due to literal
dependency.

The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
is never invoked because:
  - The 'goto flush_and_out' is same as 'goto out', because
    the 'goto flush_and_out' is called only when '!noflush'
  - If the 'r && noflush' is true, the interrupt check code above
    is invoked and 'goto out'

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm.c |   14 +-------------
 1 files changed, 1 insertion(+), 13 deletions(-)

Index: 2.6.27-rc8/drivers/md/dm.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.c
+++ 2.6.27-rc8/drivers/md/dm.c
@@ -76,7 +76,6 @@ union map_info *dm_get_mapinfo(struct bi
  */
 struct dm_wq_req {
 	enum {
-		DM_WQ_FLUSH_ALL,
 		DM_WQ_FLUSH_DEFERRED,
 	} type;
 	struct work_struct work;
@@ -1388,9 +1387,6 @@ static void dm_wq_work(struct work_struc
 
 	down_write(&md->io_lock);
 	switch (req->type) {
-	case DM_WQ_FLUSH_ALL:
-		__merge_pushback_list(md);
-		/* pass through */
 	case DM_WQ_FLUSH_DEFERRED:
 		__flush_deferred_io(md);
 		break;
@@ -1520,7 +1516,7 @@ int dm_suspend(struct mapped_device *md,
 		if (!md->suspended_bdev) {
 			DMWARN("bdget failed in dm_suspend");
 			r = -ENOMEM;
-			goto flush_and_out;
+			goto out;
 		}
 
 		/*
@@ -1571,14 +1567,6 @@ int dm_suspend(struct mapped_device *md,
 
 	set_bit(DMF_SUSPENDED, &md->flags);
 
-flush_and_out:
-	if (r && noflush)
-		/*
-		 * Because there may be already I/Os in the pushback list,
-		 * flush them before return.
-		 */
-		dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL);
-
 out:
 	if (r && md->suspended_bdev) {
 		bdput(md->suspended_bdev);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/8] dm core: remove unused DM_WQ_FLUSH_ALL
@ 2008-10-03 15:09   ` Kiyoshi Ueda
  0 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:09 UTC (permalink / raw)
  To: agk; +Cc: k-ueda, linux-scsi, linux-kernel, dm-devel, j-nomura, mbroz

This patch removes dead codes for the noflush suspend.
No functional change.

This patch is just a clean up of the codes and not functionally
related to request-based dm.  But included here due to literal
dependency.

The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
is never invoked because:
  - The 'goto flush_and_out' is same as 'goto out', because
    the 'goto flush_and_out' is called only when '!noflush'
  - If the 'r && noflush' is true, the interrupt check code above
    is invoked and 'goto out'

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm.c |   14 +-------------
 1 files changed, 1 insertion(+), 13 deletions(-)

Index: 2.6.27-rc8/drivers/md/dm.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.c
+++ 2.6.27-rc8/drivers/md/dm.c
@@ -76,7 +76,6 @@ union map_info *dm_get_mapinfo(struct bi
  */
 struct dm_wq_req {
 	enum {
-		DM_WQ_FLUSH_ALL,
 		DM_WQ_FLUSH_DEFERRED,
 	} type;
 	struct work_struct work;
@@ -1388,9 +1387,6 @@ static void dm_wq_work(struct work_struc
 
 	down_write(&md->io_lock);
 	switch (req->type) {
-	case DM_WQ_FLUSH_ALL:
-		__merge_pushback_list(md);
-		/* pass through */
 	case DM_WQ_FLUSH_DEFERRED:
 		__flush_deferred_io(md);
 		break;
@@ -1520,7 +1516,7 @@ int dm_suspend(struct mapped_device *md,
 		if (!md->suspended_bdev) {
 			DMWARN("bdget failed in dm_suspend");
 			r = -ENOMEM;
-			goto flush_and_out;
+			goto out;
 		}
 
 		/*
@@ -1571,14 +1567,6 @@ int dm_suspend(struct mapped_device *md,
 
 	set_bit(DMF_SUSPENDED, &md->flags);
 
-flush_and_out:
-	if (r && noflush)
-		/*
-		 * Because there may be already I/Os in the pushback list,
-		 * flush them before return.
-		 */
-		dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL);
-
 out:
 	if (r && md->suspended_bdev) {
 		bdput(md->suspended_bdev);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/8] dm core: tidy local_init
  2008-10-03 15:08 [PATCH 0/8] dm: request-based dm-multipath Kiyoshi Ueda
  2008-10-03 15:09   ` Kiyoshi Ueda
@ 2008-10-03 15:10 ` Kiyoshi Ueda
  2008-10-03 15:17 ` [PATCH 3/8] dm core: add kmem_cache for request-based dm Kiyoshi Ueda
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:10 UTC (permalink / raw)
  To: agk; +Cc: linux-kernel, linux-scsi, dm-devel, mbroz, j-nomura, k-ueda

This patch tidies local_init() as preparation for another patch
(PATCH 3), which creates some kmem_cache for request-based dm.
No functional change.

This patch is just a clean up of the codes and not functionally
related to request-based dm.  But included here due to literal
dependency.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm.c |   34 +++++++++++++++++-----------------
 1 files changed, 17 insertions(+), 17 deletions(-)

Index: 2.6.27-rc8/drivers/md/dm.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.c
+++ 2.6.27-rc8/drivers/md/dm.c
@@ -150,40 +150,40 @@ static struct kmem_cache *_tio_cache;
 
 static int __init local_init(void)
 {
-	int r;
+	int r = -ENOMEM;
 
 	/* allocate a slab for the dm_ios */
 	_io_cache = KMEM_CACHE(dm_io, 0);
 	if (!_io_cache)
-		return -ENOMEM;
+		return r;
 
 	/* allocate a slab for the target ios */
 	_tio_cache = KMEM_CACHE(dm_target_io, 0);
-	if (!_tio_cache) {
-		kmem_cache_destroy(_io_cache);
-		return -ENOMEM;
-	}
+	if (!_tio_cache)
+		goto out_free_io_cache;
 
 	r = dm_uevent_init();
-	if (r) {
-		kmem_cache_destroy(_tio_cache);
-		kmem_cache_destroy(_io_cache);
-		return r;
-	}
+	if (r)
+		goto out_free_tio_cache;
 
 	_major = major;
 	r = register_blkdev(_major, _name);
-	if (r < 0) {
-		kmem_cache_destroy(_tio_cache);
-		kmem_cache_destroy(_io_cache);
-		dm_uevent_exit();
-		return r;
-	}
+	if (r < 0)
+		goto out_uevent_exit;
 
 	if (!_major)
 		_major = r;
 
 	return 0;
+
+out_uevent_exit:
+	dm_uevent_exit();
+out_free_tio_cache:
+	kmem_cache_destroy(_tio_cache);
+out_free_io_cache:
+	kmem_cache_destroy(_io_cache);
+
+	return r;
 }
 
 static void local_exit(void)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/8] dm core: add kmem_cache for request-based dm
  2008-10-03 15:08 [PATCH 0/8] dm: request-based dm-multipath Kiyoshi Ueda
  2008-10-03 15:09   ` Kiyoshi Ueda
  2008-10-03 15:10 ` [PATCH 2/8] dm core: tidy local_init Kiyoshi Ueda
@ 2008-10-03 15:17 ` Kiyoshi Ueda
  2008-10-03 15:18   ` Kiyoshi Ueda
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:17 UTC (permalink / raw)
  To: agk; +Cc: linux-kernel, linux-scsi, dm-devel, mbroz, j-nomura, k-ueda

This patch prepares some kmem_cache for request-based dm.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm.c |   45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 44 insertions(+), 1 deletion(-)

Index: 2.6.27-rc8/drivers/md/dm.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.c
+++ 2.6.27-rc8/drivers/md/dm.c
@@ -32,6 +32,7 @@ static unsigned int _major = 0;
 
 static DEFINE_SPINLOCK(_minor_lock);
 /*
+ * For bio based dm.
  * One of these is allocated per bio.
  */
 struct dm_io {
@@ -43,6 +44,7 @@ struct dm_io {
 };
 
 /*
+ * For bio based dm.
  * One of these is allocated per target within a bio.  Hopefully
  * this will be simplified out one day.
  */
@@ -52,6 +54,31 @@ struct dm_target_io {
 	union map_info info;
 };
 
+/*
+ * For request based dm.
+ * One of these is allocated per request.
+ *
+ * Since assuming "original request : cloned request = 1 : 1" and
+ * a counter for number of clones like struct dm_io.io_count isn't needed,
+ * struct dm_io and struct target_io can be merged.
+ */
+struct dm_rq_target_io {
+	struct mapped_device *md;
+	struct dm_target *ti;
+	struct request *orig, clone;
+	int error;
+	union map_info info;
+};
+
+/*
+ * For request based dm.
+ * One of these is allocated per bio.
+ */
+struct dm_clone_bio_info {
+	struct bio *orig;
+	struct request *rq;
+};
+
 union map_info *dm_get_mapinfo(struct bio *bio)
 {
 	if (bio && bio->bi_private)
@@ -147,6 +174,8 @@ struct mapped_device {
 #define MIN_IOS 256
 static struct kmem_cache *_io_cache;
 static struct kmem_cache *_tio_cache;
+static struct kmem_cache *_rq_tio_cache;
+static struct kmem_cache *_bio_info_cache;
 
 static int __init local_init(void)
 {
@@ -162,9 +191,17 @@ static int __init local_init(void)
 	if (!_tio_cache)
 		goto out_free_io_cache;
 
+	_rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0);
+	if (!_rq_tio_cache)
+		goto out_free_tio_cache;
+
+	_bio_info_cache = KMEM_CACHE(dm_clone_bio_info, 0);
+	if (!_bio_info_cache)
+		goto out_free_rq_tio_cache;
+
 	r = dm_uevent_init();
 	if (r)
-		goto out_free_tio_cache;
+		goto out_free_bio_info_cache;
 
 	_major = major;
 	r = register_blkdev(_major, _name);
@@ -178,6 +215,10 @@ static int __init local_init(void)
 
 out_uevent_exit:
 	dm_uevent_exit();
+out_free_bio_info_cache:
+	kmem_cache_destroy(_bio_info_cache);
+out_free_rq_tio_cache:
+	kmem_cache_destroy(_rq_tio_cache);
 out_free_tio_cache:
 	kmem_cache_destroy(_tio_cache);
 out_free_io_cache:
@@ -188,6 +229,8 @@ out_free_io_cache:
 
 static void local_exit(void)
 {
+	kmem_cache_destroy(_bio_info_cache);
+	kmem_cache_destroy(_rq_tio_cache);
 	kmem_cache_destroy(_tio_cache);
 	kmem_cache_destroy(_io_cache);
 	unregister_blkdev(_major, _name);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/8] dm core: add target interfaces for request-based dm
  2008-10-03 15:08 [PATCH 0/8] dm: request-based dm-multipath Kiyoshi Ueda
@ 2008-10-03 15:18   ` Kiyoshi Ueda
  2008-10-03 15:10 ` [PATCH 2/8] dm core: tidy local_init Kiyoshi Ueda
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:18 UTC (permalink / raw)
  To: agk; +Cc: linux-kernel, linux-scsi, dm-devel, mbroz, j-nomura, k-ueda

This patch adds the following target interfaces for request-based dm.

  map_rq    : for mapping a request

  rq_end_io : for finishing a request

  busy      : for avoiding performance regression from bio-based dm.
              Target can tell dm core not to map requests now, and
              that may help requests in the block layer queue to be
              bigger by I/O merging.
              In bio-based dm, this behavior is done by device
              drivers which managing the block layer queue.
              But in request-based dm, dm core has to do that
              since dm core manages the block layer queue, and
              target drivers help is needed for it.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 include/linux/device-mapper.h |   15 +++++++++++++++
 1 files changed, 15 insertions(+)

Index: 2.6.27-rc8/include/linux/device-mapper.h
===================================================================
--- 2.6.27-rc8.orig/include/linux/device-mapper.h
+++ 2.6.27-rc8/include/linux/device-mapper.h
@@ -45,6 +45,8 @@ typedef void (*dm_dtr_fn) (struct dm_tar
  */
 typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio,
 			  union map_info *map_context);
+typedef int (*dm_map_request_fn) (struct dm_target *ti, struct request *clone,
+				  union map_info *map_context);
 
 /*
  * Returns:
@@ -57,6 +59,9 @@ typedef int (*dm_map_fn) (struct dm_targ
 typedef int (*dm_endio_fn) (struct dm_target *ti,
 			    struct bio *bio, int error,
 			    union map_info *map_context);
+typedef int (*dm_request_endio_fn) (struct dm_target *ti,
+				    struct request *clone, int error,
+				    union map_info *map_context);
 
 typedef void (*dm_flush_fn) (struct dm_target *ti);
 typedef void (*dm_presuspend_fn) (struct dm_target *ti);
@@ -76,6 +81,13 @@ typedef int (*dm_ioctl_fn) (struct dm_ta
 typedef int (*dm_merge_fn) (struct dm_target *ti, struct bvec_merge_data *bvm,
 			    struct bio_vec *biovec, int max_size);
 
+/*
+ * Returns:
+ *    0: The target can handle the next I/O immediately.
+ *    1: The target can't handle the next I/O immediately.
+ */
+typedef int (*dm_busy_fn) (struct dm_target *ti);
+
 void dm_error(const char *message);
 
 /*
@@ -108,7 +120,9 @@ struct target_type {
 	dm_ctr_fn ctr;
 	dm_dtr_fn dtr;
 	dm_map_fn map;
+	dm_map_request_fn map_rq;
 	dm_endio_fn end_io;
+	dm_request_endio_fn rq_end_io;
 	dm_flush_fn flush;
 	dm_presuspend_fn presuspend;
 	dm_postsuspend_fn postsuspend;
@@ -118,6 +132,7 @@ struct target_type {
 	dm_message_fn message;
 	dm_ioctl_fn ioctl;
 	dm_merge_fn merge;
+	dm_busy_fn busy;
 };
 
 struct io_restrictions {

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/8] dm core: add target interfaces for request-based dm
@ 2008-10-03 15:18   ` Kiyoshi Ueda
  0 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:18 UTC (permalink / raw)
  To: agk; +Cc: k-ueda, linux-scsi, linux-kernel, dm-devel, j-nomura, mbroz

This patch adds the following target interfaces for request-based dm.

  map_rq    : for mapping a request

  rq_end_io : for finishing a request

  busy      : for avoiding performance regression from bio-based dm.
              Target can tell dm core not to map requests now, and
              that may help requests in the block layer queue to be
              bigger by I/O merging.
              In bio-based dm, this behavior is done by device
              drivers which managing the block layer queue.
              But in request-based dm, dm core has to do that
              since dm core manages the block layer queue, and
              target drivers help is needed for it.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 include/linux/device-mapper.h |   15 +++++++++++++++
 1 files changed, 15 insertions(+)

Index: 2.6.27-rc8/include/linux/device-mapper.h
===================================================================
--- 2.6.27-rc8.orig/include/linux/device-mapper.h
+++ 2.6.27-rc8/include/linux/device-mapper.h
@@ -45,6 +45,8 @@ typedef void (*dm_dtr_fn) (struct dm_tar
  */
 typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio,
 			  union map_info *map_context);
+typedef int (*dm_map_request_fn) (struct dm_target *ti, struct request *clone,
+				  union map_info *map_context);
 
 /*
  * Returns:
@@ -57,6 +59,9 @@ typedef int (*dm_map_fn) (struct dm_targ
 typedef int (*dm_endio_fn) (struct dm_target *ti,
 			    struct bio *bio, int error,
 			    union map_info *map_context);
+typedef int (*dm_request_endio_fn) (struct dm_target *ti,
+				    struct request *clone, int error,
+				    union map_info *map_context);
 
 typedef void (*dm_flush_fn) (struct dm_target *ti);
 typedef void (*dm_presuspend_fn) (struct dm_target *ti);
@@ -76,6 +81,13 @@ typedef int (*dm_ioctl_fn) (struct dm_ta
 typedef int (*dm_merge_fn) (struct dm_target *ti, struct bvec_merge_data *bvm,
 			    struct bio_vec *biovec, int max_size);
 
+/*
+ * Returns:
+ *    0: The target can handle the next I/O immediately.
+ *    1: The target can't handle the next I/O immediately.
+ */
+typedef int (*dm_busy_fn) (struct dm_target *ti);
+
 void dm_error(const char *message);
 
 /*
@@ -108,7 +120,9 @@ struct target_type {
 	dm_ctr_fn ctr;
 	dm_dtr_fn dtr;
 	dm_map_fn map;
+	dm_map_request_fn map_rq;
 	dm_endio_fn end_io;
+	dm_request_endio_fn rq_end_io;
 	dm_flush_fn flush;
 	dm_presuspend_fn presuspend;
 	dm_postsuspend_fn postsuspend;
@@ -118,6 +132,7 @@ struct target_type {
 	dm_message_fn message;
 	dm_ioctl_fn ioctl;
 	dm_merge_fn merge;
+	dm_busy_fn busy;
 };
 
 struct io_restrictions {

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 5/8] dm core: add core functions for request-based dm
  2008-10-03 15:08 [PATCH 0/8] dm: request-based dm-multipath Kiyoshi Ueda
@ 2008-10-03 15:18   ` Kiyoshi Ueda
  2008-10-03 15:10 ` [PATCH 2/8] dm core: tidy local_init Kiyoshi Ueda
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:18 UTC (permalink / raw)
  To: agk; +Cc: linux-kernel, linux-scsi, dm-devel, mbroz, j-nomura, k-ueda

This patch adds core functions for request-based dm.

When struct mapped device (md) is initialized as request-based,
md->queue has an I/O scheduler and the following functions are set:
    make_request_fn: __make_request() (existing block layer function)
    request_fn:      dm_request_fn()  (newly added function)
Actual initializations are done in another patch (PATCH 6).

Below is a brief summary of how request-based dm behaves, including:
  - making request from bio
  - cloning, mapping and dispatching request
  - completing request and bio
  - suspending md
  - resuming md


bio to request
==============
md->queue->make_request_fn() (__make_request()) is called for a bio
submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.


Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.

dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.

Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)

After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using __elv_add_request().


Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
    bio->bi_end_io(): end_clone_bio()
    rq->end_io():     end_clone_request()

Summary of the request completion flow is below:
blk_end_request() for a clone request
  => __end_that_request_first()
     => bio->bi_end_io() == end_clone_bio() for each clone bio
        => Free the clone bio
        => Success: Complete the original bio (blk_update_request())
           Error:   Don't complete the original bio
  => end_that_request_last()
     => rq->end_io() == end_clone_request()
        => blk_complete_request()
           => dm_softirq_done()
              => Free the clone request
              => Success: Complete the original request (blk_end_request())
                 Error:   Requeue the original request

end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().

end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
    - The submitted request may be mapped to the same device
    - Request submission requires queue lock, but the queue lock
      has been held by itself and it doesn't know that

The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.


suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.

For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue.  Then, stops dispatching request and waits
for the all dispatched requests to be completed.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.


resume
======
Starts md->queue.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-table.c         |   14 
 drivers/md/dm.c               |  714 +++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm.h               |    2 
 include/linux/device-mapper.h |    8 
 4 files changed, 734 insertions(+), 4 deletions(-)

Index: 2.6.27-rc8/drivers/md/dm.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.c
+++ 2.6.27-rc8/drivers/md/dm.c
@@ -86,6 +86,14 @@ union map_info *dm_get_mapinfo(struct bi
 	return NULL;
 }
 
+union map_info *dm_get_rq_mapinfo(struct request *rq)
+{
+	if (rq && rq->end_io_data)
+		return &((struct dm_rq_target_io *)rq->end_io_data)->info;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
+
 #define MINOR_ALLOCED ((void *)-1)
 
 /*
@@ -169,6 +177,12 @@ struct mapped_device {
 
 	/* forced geometry settings */
 	struct hd_geometry geometry;
+
+	/* marker of flush suspend for request-based dm */
+	struct request suspend_rq;
+
+	/* For saving the address of __make_request for request based dm */
+	make_request_fn *saved_make_request_fn;
 };
 
 #define MIN_IOS 256
@@ -416,6 +430,28 @@ static void free_tio(struct mapped_devic
 	mempool_free(tio, md->tio_pool);
 }
 
+static inline struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md)
+{
+	return mempool_alloc(md->tio_pool, GFP_ATOMIC);
+}
+
+static inline void free_rq_tio(struct mapped_device *md,
+			       struct dm_rq_target_io *tio)
+{
+	mempool_free(tio, md->tio_pool);
+}
+
+static inline struct dm_clone_bio_info *alloc_bio_info(struct mapped_device *md)
+{
+	return mempool_alloc(md->io_pool, GFP_ATOMIC);
+}
+
+static inline void free_bio_info(struct mapped_device *md,
+				 struct dm_clone_bio_info *info)
+{
+	mempool_free(info, md->io_pool);
+}
+
 static void start_io_acct(struct dm_io *io)
 {
 	struct mapped_device *md = io->md;
@@ -604,6 +640,266 @@ static void clone_endio(struct bio *bio,
 	free_tio(md, tio);
 }
 
+/*
+ * Partial completion handling for request-based dm
+ */
+static void end_clone_bio(struct bio *clone, int error)
+{
+	struct dm_clone_bio_info *info = clone->bi_private;
+	struct dm_rq_target_io *tio = info->rq->end_io_data;
+	struct bio *bio = info->orig;
+	unsigned int nr_bytes = info->orig->bi_size;
+
+	free_bio_info(tio->md, info);
+	clone->bi_private = tio->md->bs;
+	bio_put(clone);
+
+	if (tio->error) {
+		/*
+		 * An error has already been detected on the request.
+		 * Once error occurred, just let clone->end_io() handle
+		 * the remainder.
+		 */
+		return;
+	} else if (error) {
+		/*
+		 * Don't notice the error to the upper layer yet.
+		 * The error handling decision is made by the target driver,
+		 * when the request is completed.
+		 */
+		tio->error = error;
+		return;
+	}
+
+	/*
+	 * I/O for the bio successfully completed.
+	 * Notice the data completion to the upper layer.
+	 */
+
+	/*
+	 * bios are processed from the head of the list.
+	 * So the completing bio should always be rq->bio.
+	 * If it's not, something wrong is happening.
+	 */
+	if (tio->orig->bio != bio)
+		DMERR("bio completion is going in the middle of the request");
+
+	/*
+	 * Update the original request.
+	 * Do not use blk_end_request() here, because it may complete
+	 * the original request before the clone, and break the ordering.
+	 */
+	blk_update_request(tio->orig, 0, nr_bytes);
+}
+
+static void free_bio_clone(struct request *clone)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct mapped_device *md = tio->md;
+	struct bio *bio;
+	struct dm_clone_bio_info *info;
+
+	while ((bio = clone->bio) != NULL) {
+		clone->bio = bio->bi_next;
+
+		info = bio->bi_private;
+		free_bio_info(md, info);
+
+		bio->bi_private = md->bs;
+		bio_put(bio);
+	}
+}
+
+static void dec_rq_pending(struct dm_rq_target_io *tio)
+{
+	if (!atomic_dec_return(&tio->md->pending))
+		/* nudge anyone waiting on suspend queue */
+		wake_up(&tio->md->wait);
+}
+
+static void dm_unprep_request(struct request *rq)
+{
+	struct request *clone = rq->special;
+	struct dm_rq_target_io *tio = clone->end_io_data;
+
+	rq->special = NULL;
+	rq->cmd_flags &= ~REQ_DONTPREP;
+
+	free_bio_clone(clone);
+	dec_rq_pending(tio);
+	free_rq_tio(tio->md, tio);
+}
+
+/*
+ * Requeue the original request of a clone.
+ */
+void dm_requeue_request(struct request *clone)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *rq = tio->orig;
+	struct request_queue *q = rq->q;
+	unsigned long flags;
+
+	dm_unprep_request(rq);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (elv_queue_empty(q))
+		blk_plug_device(q);
+	blk_requeue_request(q, rq);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+EXPORT_SYMBOL_GPL(dm_requeue_request);
+
+static inline void __stop_queue(struct request_queue *q)
+{
+	blk_stop_queue(q);
+}
+
+static void stop_queue(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	__stop_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static inline void __start_queue(struct request_queue *q)
+{
+	if (blk_queue_stopped(q))
+		blk_start_queue(q);
+}
+
+static void start_queue(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	__start_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+/*
+ * Complete the clone and the original request
+ */
+static void dm_end_request(struct request *clone, int error)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *rq = tio->orig;
+	struct request_queue *q = rq->q;
+	unsigned int nr_bytes = blk_rq_bytes(rq);
+
+	if (blk_pc_request(rq)) {
+		rq->errors = clone->errors;
+		rq->data_len = clone->data_len;
+
+		if (rq->sense)
+			/*
+			 * We are using the sense buffer of the original
+			 * request.
+			 * So setting the length of the sense data is enough.
+			 */
+			rq->sense_len = clone->sense_len;
+	}
+
+	free_bio_clone(clone);
+	dec_rq_pending(tio);
+	free_rq_tio(tio->md, tio);
+
+	if (unlikely(blk_end_request(rq, error, nr_bytes)))
+		BUG();
+
+	blk_run_queue(q);
+}
+
+/*
+ * Request completion handler for request-based dm
+ */
+static void dm_softirq_done(struct request *rq)
+{
+	struct request *clone = rq->completion_data;
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
+	int error = tio->error;
+	int r;
+
+	if (rq->cmd_flags & REQ_FAILED)
+		goto end_request;
+
+	if (rq_end_io) {
+		r = rq_end_io(tio->ti, clone, error, &tio->info);
+		if (r <= 0)
+			/* The target wants to complete the I/O */
+			error = r;
+		else if (r == DM_ENDIO_INCOMPLETE)
+			/* The target will handle the I/O */
+			return;
+		else if (r == DM_ENDIO_REQUEUE) {
+			/*
+			 * The target wants to requeue the I/O.
+			 * Don't invoke blk_run_queue() so that the requeued
+			 * request won't be dispatched again soon.
+			 */
+			dm_requeue_request(clone);
+			return;
+		} else {
+			DMWARN("unimplemented target endio return value: %d",
+			       r);
+			BUG();
+		}
+	}
+
+end_request:
+	dm_end_request(clone, error);
+}
+
+/*
+ * Called with the queue lock held
+ */
+static void end_clone_request(struct request *clone, int error)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *rq = tio->orig;
+
+	/*
+	 * For just cleaning up the information of the queue in which
+	 * the clone was dispatched.
+	 * The clone is *NOT* freed actually here because it is alloced from
+	 * dm own mempool and REQ_ALLOCED isn't set in clone->cmd_flags.
+	 */
+	__blk_put_request(clone->q, clone);
+
+	/*
+	 * Actual request completion is done in a softirq context which doesn't
+	 * hold the queue lock.  Otherwise, deadlock could occur because:
+	 *     - another request may be submitted by the upper level driver
+	 *       of the stacking during the completion
+	 *     - the submission which requires queue lock may be done
+	 *       against this queue
+	 */
+	tio->error = error;
+	rq->completion_data = clone;
+	blk_complete_request(rq);
+}
+
+/*
+ * Complete the original request of a clone with an error status.
+ * Target's rq_end_io() function isn't called.
+ * This may be used by target's map_rq() function when the mapping fails.
+ */
+void dm_kill_request(struct request *clone, int error)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *rq = tio->orig;
+
+	tio->error = error;
+	/* Avoid printing "I/O error" message, since we didn't I/O actually */
+	rq->cmd_flags |= (REQ_FAILED | REQ_QUIET);
+	rq->completion_data = clone;
+	blk_complete_request(rq);
+}
+EXPORT_SYMBOL_GPL(dm_kill_request);
+
 static sector_t max_io_len(struct mapped_device *md,
 			   sector_t sector, struct dm_target *ti)
 {
@@ -922,7 +1218,7 @@ out:
  * The request function that just remaps the bio built up by
  * dm_merge_bvec.
  */
-static int dm_request(struct request_queue *q, struct bio *bio)
+static int _dm_request(struct request_queue *q, struct bio *bio)
 {
 	int r = -EIO;
 	int rw = bio_data_dir(bio);
@@ -972,12 +1268,311 @@ out_req:
 	return 0;
 }
 
+static int dm_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct mapped_device *md = (struct mapped_device *)q->queuedata;
+
+	if (unlikely(bio_barrier(bio))) {
+		bio_endio(bio, -EOPNOTSUPP);
+		return 0;
+	}
+
+	if (unlikely(!md->map)) {
+		bio_endio(bio, -EIO);
+		return 0;
+	}
+
+	return md->saved_make_request_fn(q, bio); /* call __make_request() */
+}
+
+static inline int dm_request_based(struct mapped_device *md)
+{
+	return blk_queue_stackable(md->queue);
+}
+
+static int dm_request(struct request_queue *q, struct bio *bio)
+{
+	struct mapped_device *md = q->queuedata;
+
+	if (dm_request_based(md))
+		return dm_make_request(q, bio);
+
+	return _dm_request(q, bio);
+}
+
+void dm_dispatch_request(struct request *rq)
+{
+	int r;
+
+	rq->start_time = jiffies;
+	r = blk_insert_cloned_request(rq->q, rq);
+	if (r)
+		dm_kill_request(rq, r);
+}
+EXPORT_SYMBOL_GPL(dm_dispatch_request);
+
+static void copy_request_info(struct request *clone, struct request *rq)
+{
+	clone->cmd_flags = (rq_data_dir(rq) | REQ_NOMERGE);
+	clone->cmd_type = rq->cmd_type;
+	clone->sector = rq->sector;
+	clone->hard_sector = rq->hard_sector;
+	clone->nr_sectors = rq->nr_sectors;
+	clone->hard_nr_sectors = rq->hard_nr_sectors;
+	clone->current_nr_sectors = rq->current_nr_sectors;
+	clone->hard_cur_sectors = rq->hard_cur_sectors;
+	clone->nr_phys_segments = rq->nr_phys_segments;
+	clone->ioprio = rq->ioprio;
+	clone->buffer = rq->buffer;
+	clone->cmd_len = rq->cmd_len;
+	if (rq->cmd_len)
+		clone->cmd = rq->cmd;
+	clone->data_len = rq->data_len;
+	clone->extra_len = rq->extra_len;
+	clone->sense_len = rq->sense_len;
+	clone->data = rq->data;
+	clone->sense = rq->sense;
+}
+
+static int clone_request_bios(struct request *clone, struct request *rq,
+			      struct mapped_device *md)
+{
+	struct bio *bio, *clone_bio;
+	struct dm_clone_bio_info *info;
+
+	for (bio = rq->bio; bio; bio = bio->bi_next) {
+		info = alloc_bio_info(md);
+		if (!info)
+			goto free_and_out;
+
+		clone_bio = bio_alloc_bioset(GFP_ATOMIC, bio->bi_max_vecs,
+					     md->bs);
+		if (!clone_bio) {
+			free_bio_info(md, info);
+			goto free_and_out;
+		}
+
+		__bio_clone(clone_bio, bio);
+		clone_bio->bi_destructor = dm_bio_destructor;
+		clone_bio->bi_end_io = end_clone_bio;
+		info->rq = clone;
+		info->orig = bio;
+		clone_bio->bi_private = info;
+
+		if (clone->bio) {
+			clone->biotail->bi_next = clone_bio;
+			clone->biotail = clone_bio;
+		} else
+			clone->bio = clone->biotail = clone_bio;
+	}
+
+	return 0;
+
+free_and_out:
+	free_bio_clone(clone);
+
+	return -ENOMEM;
+}
+
+static int setup_clone(struct request *clone, struct request *rq,
+		       struct dm_rq_target_io *tio)
+{
+	int r;
+
+	blk_rq_init(NULL, clone);
+
+	r = clone_request_bios(clone, rq, tio->md);
+	if (r)
+		return r;
+
+	copy_request_info(clone, rq);
+	clone->start_time = jiffies;
+	clone->end_io = end_clone_request;
+	clone->end_io_data = tio;
+
+	return 0;
+}
+
+static inline int dm_flush_suspending(struct mapped_device *md)
+{
+	return !md->suspend_rq.data;
+}
+
+/*
+ * Called with the queue lock held.
+ */
+static int dm_prep_fn(struct request_queue *q, struct request *rq)
+{
+	struct mapped_device *md = (struct mapped_device *)q->queuedata;
+	struct dm_rq_target_io *tio;
+	struct request *clone;
+
+	if (unlikely(rq == &md->suspend_rq)) { /* Flush suspend marker */
+		if (dm_flush_suspending(md)) {
+			if (q->in_flight)
+				return BLKPREP_DEFER;
+			else {
+				/* This device should be quiet now */
+				__stop_queue(q);
+				smp_mb();
+				BUG_ON(atomic_read(&md->pending));
+				wake_up(&md->wait);
+				return BLKPREP_KILL;
+			}
+		} else
+			/*
+			 * The suspend process was interrupted.
+			 * So no need to suspend now.
+			 */
+			return BLKPREP_KILL;
+	}
+
+	if (unlikely(rq->special)) {
+		DMWARN("Already has something in rq->special.");
+		return BLKPREP_KILL;
+	}
+
+	if (unlikely(!dm_request_based(md))) {
+		DMWARN("Request was queued into bio-based device");
+		return BLKPREP_KILL;
+	}
+
+	tio = alloc_rq_tio(md); /* Only one for each original request */
+	if (!tio)
+		/* -ENOMEM */
+		return BLKPREP_DEFER;
+
+	tio->md = md;
+	tio->ti = NULL;
+	tio->orig = rq;
+	tio->error = 0;
+	memset(&tio->info, 0, sizeof(tio->info));
+
+	clone = &tio->clone;
+	if (setup_clone(clone, rq, tio)) {
+		/* -ENOMEM */
+		free_rq_tio(md, tio);
+		return BLKPREP_DEFER;
+	}
+
+	rq->special = clone;
+	rq->cmd_flags |= REQ_DONTPREP;
+
+	return BLKPREP_OK;
+}
+
+static void map_request(struct dm_target *ti, struct request *rq,
+			struct mapped_device *md)
+{
+	int r;
+	struct request *clone = rq->special;
+	struct dm_rq_target_io *tio = clone->end_io_data;
+
+	tio->ti = ti;
+	atomic_inc(&md->pending);
+	r = ti->type->map_rq(ti, clone, &tio->info);
+	switch (r) {
+	case DM_MAPIO_SUBMITTED:
+		/* The target has taken the I/O to submit by itself later */
+		break;
+	case DM_MAPIO_REMAPPED:
+		/* The target has remapped the I/O so dispatch it */
+		dm_dispatch_request(clone);
+		break;
+	case DM_MAPIO_REQUEUE:
+		/* The target wants to requeue the I/O */
+		dm_requeue_request(clone);
+		break;
+	default:
+		if (r > 0) {
+			DMWARN("unimplemented target map return value: %d", r);
+			BUG();
+		}
+
+		/* The target wants to complete the I/O */
+		dm_kill_request(clone, r);
+		break;
+	}
+}
+
+/*
+ * q->request_fn for request-based dm.
+ * Called with the queue lock held.
+ */
+static void dm_request_fn(struct request_queue *q)
+{
+	struct mapped_device *md = (struct mapped_device *)q->queuedata;
+	struct dm_table *map = dm_get_table(md);
+	struct dm_target *ti;
+	struct request *rq;
+
+	/*
+	 * The check for blk_queue_stopped() needs here, because:
+	 *     - device suspend uses blk_stop_queue() and expects that
+	 *       no I/O will be dispatched any more after the queue stop
+	 *     - generic_unplug_device() doesn't call q->request_fn()
+	 *       when the queue is stopped, so no problem
+	 *     - but underlying device drivers may call q->request_fn()
+	 *       without the check through blk_run_queue()
+	 */
+	while (!blk_queue_plugged(q) && !blk_queue_stopped(q)) {
+		rq = elv_next_request(q);
+		if (!rq)
+			goto plug_and_out;
+
+		ti = dm_table_find_target(map, rq->sector);
+		if (ti->type->busy && ti->type->busy(ti))
+			goto plug_and_out;
+
+		blkdev_dequeue_request(rq);
+		spin_unlock(q->queue_lock);
+		map_request(ti, rq, md);
+		spin_lock_irq(q->queue_lock);
+	}
+
+	goto out;
+
+plug_and_out:
+	if (!elv_queue_empty(q))
+		/* Some requests still remain, retry later */
+		blk_plug_device(q);
+
+out:
+	dm_table_put(map);
+
+	return;
+}
+
+int dm_underlying_device_busy(struct request_queue *q)
+{
+	return blk_lld_busy(q);
+}
+EXPORT_SYMBOL_GPL(dm_underlying_device_busy);
+
+static int dm_lld_busy(struct request_queue *q)
+{
+	int r;
+	struct mapped_device *md = q->queuedata;
+	struct dm_table *map = dm_get_table(md);
+
+	if (!map || test_bit(DMF_BLOCK_IO, &md->flags))
+		r = 1;
+	else
+		r = dm_table_any_busy_target(map);
+
+	dm_table_put(map);
+	return r;
+}
+
 static void dm_unplug_all(struct request_queue *q)
 {
 	struct mapped_device *md = q->queuedata;
 	struct dm_table *map = dm_get_table(md);
 
 	if (map) {
+		if (dm_request_based(md))
+			generic_unplug_device(q);
+
 		dm_table_unplug_all(map);
 		dm_table_put(map);
 	}
@@ -991,6 +1586,12 @@ static int dm_any_congested(void *conges
 
 	if (!map || test_bit(DMF_BLOCK_IO, &md->flags))
 		r = bdi_bits;
+	else if (dm_request_based(md))
+		/*
+		 * Request-based dm cares about only own queue for
+		 * the query about congestion status of request_queue
+		 */
+		r = md->queue->backing_dev_info.state & bdi_bits;
 	else
 		r = dm_table_any_congested(map, bdi_bits);
 
@@ -1382,7 +1983,11 @@ static int dm_wait_for_completion(struct
 		set_current_state(TASK_INTERRUPTIBLE);
 
 		smp_mb();
-		if (!atomic_read(&md->pending))
+		if (dm_request_based(md)) {
+			if (!atomic_read(&md->pending) &&
+			    blk_queue_stopped(md->queue))
+				break;
+		} else if (!atomic_read(&md->pending))
 			break;
 
 		if (signal_pending(current)) {
@@ -1484,6 +2089,88 @@ out:
 	return r;
 }
 
+static inline void dm_invalidate_flush_suspend(struct mapped_device *md)
+{
+	md->suspend_rq.data = (void *)0x1;
+}
+
+static void dm_abort_suspend(struct mapped_device *md, int noflush)
+{
+	struct request_queue *q = md->queue;
+	unsigned long flags;
+
+	/*
+	 * For flush suspend, invalidation and queue restart must be protected
+	 * by a single queue lock to prevent a race with dm_prep_fn().
+	 */
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (!noflush)
+		dm_invalidate_flush_suspend(md);
+	__start_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+/*
+ * Additional suspend work for request-based dm.
+ *
+ * In request-based dm, stopping request_queue prevents mapping.
+ * Even after stopping the request_queue, submitted requests from upper-layer
+ * can be inserted to the request_queue.  So original (unmapped) requests are
+ * kept in the request_queue during suspension.
+ */
+static void dm_start_suspend(struct mapped_device *md, int noflush)
+{
+	struct request *rq = &md->suspend_rq;
+	struct request_queue *q = md->queue;
+	unsigned long flags;
+
+	if (noflush) {
+		stop_queue(q);
+		return;
+	}
+
+	/*
+	 * For flush suspend, we need a marker to indicate the border line
+	 * between flush needed I/Os and deferred I/Os, since all I/Os are
+	 * queued in the request_queue during suspension.
+	 *
+	 * This marker must be inserted after setting DMF_BLOCK_IO,
+	 * because dm_prep_fn() considers no DMF_BLOCK_IO to be
+	 * a suspend interruption.
+	 */
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (unlikely(rq->ref_count)) {
+		/*
+		 * This can happen when the previous suspend was interrupted,
+		 * the inserted suspend_rq for the previous suspend has still
+		 * been in the queue and this suspend has been invoked.
+		 *
+		 * We could re-insert the suspend_rq by deleting it from
+		 * the queue forcibly using list_del_init(&rq->queuelist).
+		 * But it would break the block-layer easily.
+		 * So we don't re-insert the suspend_rq again in such a case.
+		 * The suspend_rq should be already invalidated during
+		 * the previous suspend interruption, so just wait for it
+		 * to be completed.
+		 *
+		 * This suspend will never complete, so warn the user to
+		 * interrupt this suspend and retry later.
+		 */
+		BUG_ON(!rq->data);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+
+		DMWARN("Invalidating the previous suspend is still in"
+		       " progress.  This suspend will be never done."
+		       " Please interrupt this suspend and retry later.");
+		return;
+	}
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	/* Now no user of the suspend_rq */
+	blk_rq_init(q, rq);
+	blk_insert_request(q, rq, 0, NULL);
+}
+
 /*
  * Functions to lock and unlock any filesystem running on the
  * device.
@@ -1582,6 +2269,9 @@ int dm_suspend(struct mapped_device *md,
 	add_wait_queue(&md->wait, &wait);
 	up_write(&md->io_lock);
 
+	if (dm_request_based(md))
+		dm_start_suspend(md, noflush);
+
 	/* unplug */
 	if (map)
 		dm_table_unplug_all(map);
@@ -1594,14 +2284,22 @@ int dm_suspend(struct mapped_device *md,
 	down_write(&md->io_lock);
 	remove_wait_queue(&md->wait, &wait);
 
-	if (noflush)
-		__merge_pushback_list(md);
+	if (noflush) {
+		if (dm_request_based(md))
+			/* All requeued requests are already in md->queue */
+			clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
+		else
+			__merge_pushback_list(md);
+	}
 	up_write(&md->io_lock);
 
 	/* were we interrupted ? */
 	if (r < 0) {
 		dm_queue_flush(md, DM_WQ_FLUSH_DEFERRED, NULL);
 
+		if (dm_request_based(md))
+			dm_abort_suspend(md, noflush);
+
 		unlock_fs(md);
 		goto out; /* pushback list is already flushed, so skip flush */
 	}
@@ -1642,6 +2340,14 @@ int dm_resume(struct mapped_device *md)
 
 	dm_queue_flush(md, DM_WQ_FLUSH_DEFERRED, NULL);
 
+	/*
+	 * Flushing deferred I/Os must be done after targets are resumed
+	 * so that mapping of targets can work correctly.
+	 * Request-based dm is queueing the deferred I/Os in its request_queue.
+	 */
+	if (dm_request_based(md))
+		start_queue(md->queue);
+
 	unlock_fs(md);
 
 	if (md->suspended_bdev) {
Index: 2.6.27-rc8/drivers/md/dm.h
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.h
+++ 2.6.27-rc8/drivers/md/dm.h
@@ -46,6 +46,7 @@ void dm_table_presuspend_targets(struct 
 void dm_table_postsuspend_targets(struct dm_table *t);
 int dm_table_resume_targets(struct dm_table *t);
 int dm_table_any_congested(struct dm_table *t, int bdi_bits);
+int dm_table_any_busy_target(struct dm_table *t);
 
 /*
  * To check the return value from dm_table_find_target().
@@ -91,6 +92,7 @@ void dm_stripe_exit(void);
 
 int dm_open_count(struct mapped_device *md);
 int dm_lock_for_deletion(struct mapped_device *md);
+union map_info *dm_get_rq_mapinfo(struct request *rq);
 
 void dm_kobject_uevent(struct mapped_device *md);
 
Index: 2.6.27-rc8/drivers/md/dm-table.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm-table.c
+++ 2.6.27-rc8/drivers/md/dm-table.c
@@ -956,6 +956,20 @@ int dm_table_any_congested(struct dm_tab
 	return r;
 }
 
+int dm_table_any_busy_target(struct dm_table *t)
+{
+	int i;
+	struct dm_target *ti;
+
+	for (i = 0; i < t->num_targets; i++) {
+		ti = t->targets + i;
+		if (ti->type->busy && ti->type->busy(ti))
+			return 1;
+	}
+
+	return 0;
+}
+
 void dm_table_unplug_all(struct dm_table *t)
 {
 	struct dm_dev_internal *dd;
Index: 2.6.27-rc8/include/linux/device-mapper.h
===================================================================
--- 2.6.27-rc8.orig/include/linux/device-mapper.h
+++ 2.6.27-rc8/include/linux/device-mapper.h
@@ -379,4 +379,12 @@ static inline unsigned long to_bytes(sec
 	return (n << SECTOR_SHIFT);
 }
 
+/*-----------------------------------------------------------------
+ * Helper for block layer and dm core operations
+ *---------------------------------------------------------------*/
+void dm_dispatch_request(struct request *rq);
+void dm_requeue_request(struct request *rq);
+void dm_kill_request(struct request *rq, int error);
+int dm_underlying_device_busy(struct request_queue *q);
+
 #endif	/* _LINUX_DEVICE_MAPPER_H */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 5/8] dm core: add core functions for request-based dm
@ 2008-10-03 15:18   ` Kiyoshi Ueda
  0 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:18 UTC (permalink / raw)
  To: agk; +Cc: k-ueda, linux-scsi, linux-kernel, dm-devel, j-nomura, mbroz

This patch adds core functions for request-based dm.

When struct mapped device (md) is initialized as request-based,
md->queue has an I/O scheduler and the following functions are set:
    make_request_fn: __make_request() (existing block layer function)
    request_fn:      dm_request_fn()  (newly added function)
Actual initializations are done in another patch (PATCH 6).

Below is a brief summary of how request-based dm behaves, including:
  - making request from bio
  - cloning, mapping and dispatching request
  - completing request and bio
  - suspending md
  - resuming md


bio to request
==============
md->queue->make_request_fn() (__make_request()) is called for a bio
submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.


Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.

dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.

Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)

After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using __elv_add_request().


Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
    bio->bi_end_io(): end_clone_bio()
    rq->end_io():     end_clone_request()

Summary of the request completion flow is below:
blk_end_request() for a clone request
  => __end_that_request_first()
     => bio->bi_end_io() == end_clone_bio() for each clone bio
        => Free the clone bio
        => Success: Complete the original bio (blk_update_request())
           Error:   Don't complete the original bio
  => end_that_request_last()
     => rq->end_io() == end_clone_request()
        => blk_complete_request()
           => dm_softirq_done()
              => Free the clone request
              => Success: Complete the original request (blk_end_request())
                 Error:   Requeue the original request

end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().

end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
    - The submitted request may be mapped to the same device
    - Request submission requires queue lock, but the queue lock
      has been held by itself and it doesn't know that

The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.


suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.

For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue.  Then, stops dispatching request and waits
for the all dispatched requests to be completed.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.


resume
======
Starts md->queue.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-table.c         |   14 
 drivers/md/dm.c               |  714 +++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm.h               |    2 
 include/linux/device-mapper.h |    8 
 4 files changed, 734 insertions(+), 4 deletions(-)

Index: 2.6.27-rc8/drivers/md/dm.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.c
+++ 2.6.27-rc8/drivers/md/dm.c
@@ -86,6 +86,14 @@ union map_info *dm_get_mapinfo(struct bi
 	return NULL;
 }
 
+union map_info *dm_get_rq_mapinfo(struct request *rq)
+{
+	if (rq && rq->end_io_data)
+		return &((struct dm_rq_target_io *)rq->end_io_data)->info;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
+
 #define MINOR_ALLOCED ((void *)-1)
 
 /*
@@ -169,6 +177,12 @@ struct mapped_device {
 
 	/* forced geometry settings */
 	struct hd_geometry geometry;
+
+	/* marker of flush suspend for request-based dm */
+	struct request suspend_rq;
+
+	/* For saving the address of __make_request for request based dm */
+	make_request_fn *saved_make_request_fn;
 };
 
 #define MIN_IOS 256
@@ -416,6 +430,28 @@ static void free_tio(struct mapped_devic
 	mempool_free(tio, md->tio_pool);
 }
 
+static inline struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md)
+{
+	return mempool_alloc(md->tio_pool, GFP_ATOMIC);
+}
+
+static inline void free_rq_tio(struct mapped_device *md,
+			       struct dm_rq_target_io *tio)
+{
+	mempool_free(tio, md->tio_pool);
+}
+
+static inline struct dm_clone_bio_info *alloc_bio_info(struct mapped_device *md)
+{
+	return mempool_alloc(md->io_pool, GFP_ATOMIC);
+}
+
+static inline void free_bio_info(struct mapped_device *md,
+				 struct dm_clone_bio_info *info)
+{
+	mempool_free(info, md->io_pool);
+}
+
 static void start_io_acct(struct dm_io *io)
 {
 	struct mapped_device *md = io->md;
@@ -604,6 +640,266 @@ static void clone_endio(struct bio *bio,
 	free_tio(md, tio);
 }
 
+/*
+ * Partial completion handling for request-based dm
+ */
+static void end_clone_bio(struct bio *clone, int error)
+{
+	struct dm_clone_bio_info *info = clone->bi_private;
+	struct dm_rq_target_io *tio = info->rq->end_io_data;
+	struct bio *bio = info->orig;
+	unsigned int nr_bytes = info->orig->bi_size;
+
+	free_bio_info(tio->md, info);
+	clone->bi_private = tio->md->bs;
+	bio_put(clone);
+
+	if (tio->error) {
+		/*
+		 * An error has already been detected on the request.
+		 * Once error occurred, just let clone->end_io() handle
+		 * the remainder.
+		 */
+		return;
+	} else if (error) {
+		/*
+		 * Don't notice the error to the upper layer yet.
+		 * The error handling decision is made by the target driver,
+		 * when the request is completed.
+		 */
+		tio->error = error;
+		return;
+	}
+
+	/*
+	 * I/O for the bio successfully completed.
+	 * Notice the data completion to the upper layer.
+	 */
+
+	/*
+	 * bios are processed from the head of the list.
+	 * So the completing bio should always be rq->bio.
+	 * If it's not, something wrong is happening.
+	 */
+	if (tio->orig->bio != bio)
+		DMERR("bio completion is going in the middle of the request");
+
+	/*
+	 * Update the original request.
+	 * Do not use blk_end_request() here, because it may complete
+	 * the original request before the clone, and break the ordering.
+	 */
+	blk_update_request(tio->orig, 0, nr_bytes);
+}
+
+static void free_bio_clone(struct request *clone)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct mapped_device *md = tio->md;
+	struct bio *bio;
+	struct dm_clone_bio_info *info;
+
+	while ((bio = clone->bio) != NULL) {
+		clone->bio = bio->bi_next;
+
+		info = bio->bi_private;
+		free_bio_info(md, info);
+
+		bio->bi_private = md->bs;
+		bio_put(bio);
+	}
+}
+
+static void dec_rq_pending(struct dm_rq_target_io *tio)
+{
+	if (!atomic_dec_return(&tio->md->pending))
+		/* nudge anyone waiting on suspend queue */
+		wake_up(&tio->md->wait);
+}
+
+static void dm_unprep_request(struct request *rq)
+{
+	struct request *clone = rq->special;
+	struct dm_rq_target_io *tio = clone->end_io_data;
+
+	rq->special = NULL;
+	rq->cmd_flags &= ~REQ_DONTPREP;
+
+	free_bio_clone(clone);
+	dec_rq_pending(tio);
+	free_rq_tio(tio->md, tio);
+}
+
+/*
+ * Requeue the original request of a clone.
+ */
+void dm_requeue_request(struct request *clone)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *rq = tio->orig;
+	struct request_queue *q = rq->q;
+	unsigned long flags;
+
+	dm_unprep_request(rq);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (elv_queue_empty(q))
+		blk_plug_device(q);
+	blk_requeue_request(q, rq);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+EXPORT_SYMBOL_GPL(dm_requeue_request);
+
+static inline void __stop_queue(struct request_queue *q)
+{
+	blk_stop_queue(q);
+}
+
+static void stop_queue(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	__stop_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static inline void __start_queue(struct request_queue *q)
+{
+	if (blk_queue_stopped(q))
+		blk_start_queue(q);
+}
+
+static void start_queue(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	__start_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+/*
+ * Complete the clone and the original request
+ */
+static void dm_end_request(struct request *clone, int error)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *rq = tio->orig;
+	struct request_queue *q = rq->q;
+	unsigned int nr_bytes = blk_rq_bytes(rq);
+
+	if (blk_pc_request(rq)) {
+		rq->errors = clone->errors;
+		rq->data_len = clone->data_len;
+
+		if (rq->sense)
+			/*
+			 * We are using the sense buffer of the original
+			 * request.
+			 * So setting the length of the sense data is enough.
+			 */
+			rq->sense_len = clone->sense_len;
+	}
+
+	free_bio_clone(clone);
+	dec_rq_pending(tio);
+	free_rq_tio(tio->md, tio);
+
+	if (unlikely(blk_end_request(rq, error, nr_bytes)))
+		BUG();
+
+	blk_run_queue(q);
+}
+
+/*
+ * Request completion handler for request-based dm
+ */
+static void dm_softirq_done(struct request *rq)
+{
+	struct request *clone = rq->completion_data;
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
+	int error = tio->error;
+	int r;
+
+	if (rq->cmd_flags & REQ_FAILED)
+		goto end_request;
+
+	if (rq_end_io) {
+		r = rq_end_io(tio->ti, clone, error, &tio->info);
+		if (r <= 0)
+			/* The target wants to complete the I/O */
+			error = r;
+		else if (r == DM_ENDIO_INCOMPLETE)
+			/* The target will handle the I/O */
+			return;
+		else if (r == DM_ENDIO_REQUEUE) {
+			/*
+			 * The target wants to requeue the I/O.
+			 * Don't invoke blk_run_queue() so that the requeued
+			 * request won't be dispatched again soon.
+			 */
+			dm_requeue_request(clone);
+			return;
+		} else {
+			DMWARN("unimplemented target endio return value: %d",
+			       r);
+			BUG();
+		}
+	}
+
+end_request:
+	dm_end_request(clone, error);
+}
+
+/*
+ * Called with the queue lock held
+ */
+static void end_clone_request(struct request *clone, int error)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *rq = tio->orig;
+
+	/*
+	 * For just cleaning up the information of the queue in which
+	 * the clone was dispatched.
+	 * The clone is *NOT* freed actually here because it is alloced from
+	 * dm own mempool and REQ_ALLOCED isn't set in clone->cmd_flags.
+	 */
+	__blk_put_request(clone->q, clone);
+
+	/*
+	 * Actual request completion is done in a softirq context which doesn't
+	 * hold the queue lock.  Otherwise, deadlock could occur because:
+	 *     - another request may be submitted by the upper level driver
+	 *       of the stacking during the completion
+	 *     - the submission which requires queue lock may be done
+	 *       against this queue
+	 */
+	tio->error = error;
+	rq->completion_data = clone;
+	blk_complete_request(rq);
+}
+
+/*
+ * Complete the original request of a clone with an error status.
+ * Target's rq_end_io() function isn't called.
+ * This may be used by target's map_rq() function when the mapping fails.
+ */
+void dm_kill_request(struct request *clone, int error)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *rq = tio->orig;
+
+	tio->error = error;
+	/* Avoid printing "I/O error" message, since we didn't I/O actually */
+	rq->cmd_flags |= (REQ_FAILED | REQ_QUIET);
+	rq->completion_data = clone;
+	blk_complete_request(rq);
+}
+EXPORT_SYMBOL_GPL(dm_kill_request);
+
 static sector_t max_io_len(struct mapped_device *md,
 			   sector_t sector, struct dm_target *ti)
 {
@@ -922,7 +1218,7 @@ out:
  * The request function that just remaps the bio built up by
  * dm_merge_bvec.
  */
-static int dm_request(struct request_queue *q, struct bio *bio)
+static int _dm_request(struct request_queue *q, struct bio *bio)
 {
 	int r = -EIO;
 	int rw = bio_data_dir(bio);
@@ -972,12 +1268,311 @@ out_req:
 	return 0;
 }
 
+static int dm_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct mapped_device *md = (struct mapped_device *)q->queuedata;
+
+	if (unlikely(bio_barrier(bio))) {
+		bio_endio(bio, -EOPNOTSUPP);
+		return 0;
+	}
+
+	if (unlikely(!md->map)) {
+		bio_endio(bio, -EIO);
+		return 0;
+	}
+
+	return md->saved_make_request_fn(q, bio); /* call __make_request() */
+}
+
+static inline int dm_request_based(struct mapped_device *md)
+{
+	return blk_queue_stackable(md->queue);
+}
+
+static int dm_request(struct request_queue *q, struct bio *bio)
+{
+	struct mapped_device *md = q->queuedata;
+
+	if (dm_request_based(md))
+		return dm_make_request(q, bio);
+
+	return _dm_request(q, bio);
+}
+
+void dm_dispatch_request(struct request *rq)
+{
+	int r;
+
+	rq->start_time = jiffies;
+	r = blk_insert_cloned_request(rq->q, rq);
+	if (r)
+		dm_kill_request(rq, r);
+}
+EXPORT_SYMBOL_GPL(dm_dispatch_request);
+
+static void copy_request_info(struct request *clone, struct request *rq)
+{
+	clone->cmd_flags = (rq_data_dir(rq) | REQ_NOMERGE);
+	clone->cmd_type = rq->cmd_type;
+	clone->sector = rq->sector;
+	clone->hard_sector = rq->hard_sector;
+	clone->nr_sectors = rq->nr_sectors;
+	clone->hard_nr_sectors = rq->hard_nr_sectors;
+	clone->current_nr_sectors = rq->current_nr_sectors;
+	clone->hard_cur_sectors = rq->hard_cur_sectors;
+	clone->nr_phys_segments = rq->nr_phys_segments;
+	clone->ioprio = rq->ioprio;
+	clone->buffer = rq->buffer;
+	clone->cmd_len = rq->cmd_len;
+	if (rq->cmd_len)
+		clone->cmd = rq->cmd;
+	clone->data_len = rq->data_len;
+	clone->extra_len = rq->extra_len;
+	clone->sense_len = rq->sense_len;
+	clone->data = rq->data;
+	clone->sense = rq->sense;
+}
+
+static int clone_request_bios(struct request *clone, struct request *rq,
+			      struct mapped_device *md)
+{
+	struct bio *bio, *clone_bio;
+	struct dm_clone_bio_info *info;
+
+	for (bio = rq->bio; bio; bio = bio->bi_next) {
+		info = alloc_bio_info(md);
+		if (!info)
+			goto free_and_out;
+
+		clone_bio = bio_alloc_bioset(GFP_ATOMIC, bio->bi_max_vecs,
+					     md->bs);
+		if (!clone_bio) {
+			free_bio_info(md, info);
+			goto free_and_out;
+		}
+
+		__bio_clone(clone_bio, bio);
+		clone_bio->bi_destructor = dm_bio_destructor;
+		clone_bio->bi_end_io = end_clone_bio;
+		info->rq = clone;
+		info->orig = bio;
+		clone_bio->bi_private = info;
+
+		if (clone->bio) {
+			clone->biotail->bi_next = clone_bio;
+			clone->biotail = clone_bio;
+		} else
+			clone->bio = clone->biotail = clone_bio;
+	}
+
+	return 0;
+
+free_and_out:
+	free_bio_clone(clone);
+
+	return -ENOMEM;
+}
+
+static int setup_clone(struct request *clone, struct request *rq,
+		       struct dm_rq_target_io *tio)
+{
+	int r;
+
+	blk_rq_init(NULL, clone);
+
+	r = clone_request_bios(clone, rq, tio->md);
+	if (r)
+		return r;
+
+	copy_request_info(clone, rq);
+	clone->start_time = jiffies;
+	clone->end_io = end_clone_request;
+	clone->end_io_data = tio;
+
+	return 0;
+}
+
+static inline int dm_flush_suspending(struct mapped_device *md)
+{
+	return !md->suspend_rq.data;
+}
+
+/*
+ * Called with the queue lock held.
+ */
+static int dm_prep_fn(struct request_queue *q, struct request *rq)
+{
+	struct mapped_device *md = (struct mapped_device *)q->queuedata;
+	struct dm_rq_target_io *tio;
+	struct request *clone;
+
+	if (unlikely(rq == &md->suspend_rq)) { /* Flush suspend marker */
+		if (dm_flush_suspending(md)) {
+			if (q->in_flight)
+				return BLKPREP_DEFER;
+			else {
+				/* This device should be quiet now */
+				__stop_queue(q);
+				smp_mb();
+				BUG_ON(atomic_read(&md->pending));
+				wake_up(&md->wait);
+				return BLKPREP_KILL;
+			}
+		} else
+			/*
+			 * The suspend process was interrupted.
+			 * So no need to suspend now.
+			 */
+			return BLKPREP_KILL;
+	}
+
+	if (unlikely(rq->special)) {
+		DMWARN("Already has something in rq->special.");
+		return BLKPREP_KILL;
+	}
+
+	if (unlikely(!dm_request_based(md))) {
+		DMWARN("Request was queued into bio-based device");
+		return BLKPREP_KILL;
+	}
+
+	tio = alloc_rq_tio(md); /* Only one for each original request */
+	if (!tio)
+		/* -ENOMEM */
+		return BLKPREP_DEFER;
+
+	tio->md = md;
+	tio->ti = NULL;
+	tio->orig = rq;
+	tio->error = 0;
+	memset(&tio->info, 0, sizeof(tio->info));
+
+	clone = &tio->clone;
+	if (setup_clone(clone, rq, tio)) {
+		/* -ENOMEM */
+		free_rq_tio(md, tio);
+		return BLKPREP_DEFER;
+	}
+
+	rq->special = clone;
+	rq->cmd_flags |= REQ_DONTPREP;
+
+	return BLKPREP_OK;
+}
+
+static void map_request(struct dm_target *ti, struct request *rq,
+			struct mapped_device *md)
+{
+	int r;
+	struct request *clone = rq->special;
+	struct dm_rq_target_io *tio = clone->end_io_data;
+
+	tio->ti = ti;
+	atomic_inc(&md->pending);
+	r = ti->type->map_rq(ti, clone, &tio->info);
+	switch (r) {
+	case DM_MAPIO_SUBMITTED:
+		/* The target has taken the I/O to submit by itself later */
+		break;
+	case DM_MAPIO_REMAPPED:
+		/* The target has remapped the I/O so dispatch it */
+		dm_dispatch_request(clone);
+		break;
+	case DM_MAPIO_REQUEUE:
+		/* The target wants to requeue the I/O */
+		dm_requeue_request(clone);
+		break;
+	default:
+		if (r > 0) {
+			DMWARN("unimplemented target map return value: %d", r);
+			BUG();
+		}
+
+		/* The target wants to complete the I/O */
+		dm_kill_request(clone, r);
+		break;
+	}
+}
+
+/*
+ * q->request_fn for request-based dm.
+ * Called with the queue lock held.
+ */
+static void dm_request_fn(struct request_queue *q)
+{
+	struct mapped_device *md = (struct mapped_device *)q->queuedata;
+	struct dm_table *map = dm_get_table(md);
+	struct dm_target *ti;
+	struct request *rq;
+
+	/*
+	 * The check for blk_queue_stopped() needs here, because:
+	 *     - device suspend uses blk_stop_queue() and expects that
+	 *       no I/O will be dispatched any more after the queue stop
+	 *     - generic_unplug_device() doesn't call q->request_fn()
+	 *       when the queue is stopped, so no problem
+	 *     - but underlying device drivers may call q->request_fn()
+	 *       without the check through blk_run_queue()
+	 */
+	while (!blk_queue_plugged(q) && !blk_queue_stopped(q)) {
+		rq = elv_next_request(q);
+		if (!rq)
+			goto plug_and_out;
+
+		ti = dm_table_find_target(map, rq->sector);
+		if (ti->type->busy && ti->type->busy(ti))
+			goto plug_and_out;
+
+		blkdev_dequeue_request(rq);
+		spin_unlock(q->queue_lock);
+		map_request(ti, rq, md);
+		spin_lock_irq(q->queue_lock);
+	}
+
+	goto out;
+
+plug_and_out:
+	if (!elv_queue_empty(q))
+		/* Some requests still remain, retry later */
+		blk_plug_device(q);
+
+out:
+	dm_table_put(map);
+
+	return;
+}
+
+int dm_underlying_device_busy(struct request_queue *q)
+{
+	return blk_lld_busy(q);
+}
+EXPORT_SYMBOL_GPL(dm_underlying_device_busy);
+
+static int dm_lld_busy(struct request_queue *q)
+{
+	int r;
+	struct mapped_device *md = q->queuedata;
+	struct dm_table *map = dm_get_table(md);
+
+	if (!map || test_bit(DMF_BLOCK_IO, &md->flags))
+		r = 1;
+	else
+		r = dm_table_any_busy_target(map);
+
+	dm_table_put(map);
+	return r;
+}
+
 static void dm_unplug_all(struct request_queue *q)
 {
 	struct mapped_device *md = q->queuedata;
 	struct dm_table *map = dm_get_table(md);
 
 	if (map) {
+		if (dm_request_based(md))
+			generic_unplug_device(q);
+
 		dm_table_unplug_all(map);
 		dm_table_put(map);
 	}
@@ -991,6 +1586,12 @@ static int dm_any_congested(void *conges
 
 	if (!map || test_bit(DMF_BLOCK_IO, &md->flags))
 		r = bdi_bits;
+	else if (dm_request_based(md))
+		/*
+		 * Request-based dm cares about only own queue for
+		 * the query about congestion status of request_queue
+		 */
+		r = md->queue->backing_dev_info.state & bdi_bits;
 	else
 		r = dm_table_any_congested(map, bdi_bits);
 
@@ -1382,7 +1983,11 @@ static int dm_wait_for_completion(struct
 		set_current_state(TASK_INTERRUPTIBLE);
 
 		smp_mb();
-		if (!atomic_read(&md->pending))
+		if (dm_request_based(md)) {
+			if (!atomic_read(&md->pending) &&
+			    blk_queue_stopped(md->queue))
+				break;
+		} else if (!atomic_read(&md->pending))
 			break;
 
 		if (signal_pending(current)) {
@@ -1484,6 +2089,88 @@ out:
 	return r;
 }
 
+static inline void dm_invalidate_flush_suspend(struct mapped_device *md)
+{
+	md->suspend_rq.data = (void *)0x1;
+}
+
+static void dm_abort_suspend(struct mapped_device *md, int noflush)
+{
+	struct request_queue *q = md->queue;
+	unsigned long flags;
+
+	/*
+	 * For flush suspend, invalidation and queue restart must be protected
+	 * by a single queue lock to prevent a race with dm_prep_fn().
+	 */
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (!noflush)
+		dm_invalidate_flush_suspend(md);
+	__start_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+/*
+ * Additional suspend work for request-based dm.
+ *
+ * In request-based dm, stopping request_queue prevents mapping.
+ * Even after stopping the request_queue, submitted requests from upper-layer
+ * can be inserted to the request_queue.  So original (unmapped) requests are
+ * kept in the request_queue during suspension.
+ */
+static void dm_start_suspend(struct mapped_device *md, int noflush)
+{
+	struct request *rq = &md->suspend_rq;
+	struct request_queue *q = md->queue;
+	unsigned long flags;
+
+	if (noflush) {
+		stop_queue(q);
+		return;
+	}
+
+	/*
+	 * For flush suspend, we need a marker to indicate the border line
+	 * between flush needed I/Os and deferred I/Os, since all I/Os are
+	 * queued in the request_queue during suspension.
+	 *
+	 * This marker must be inserted after setting DMF_BLOCK_IO,
+	 * because dm_prep_fn() considers no DMF_BLOCK_IO to be
+	 * a suspend interruption.
+	 */
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (unlikely(rq->ref_count)) {
+		/*
+		 * This can happen when the previous suspend was interrupted,
+		 * the inserted suspend_rq for the previous suspend has still
+		 * been in the queue and this suspend has been invoked.
+		 *
+		 * We could re-insert the suspend_rq by deleting it from
+		 * the queue forcibly using list_del_init(&rq->queuelist).
+		 * But it would break the block-layer easily.
+		 * So we don't re-insert the suspend_rq again in such a case.
+		 * The suspend_rq should be already invalidated during
+		 * the previous suspend interruption, so just wait for it
+		 * to be completed.
+		 *
+		 * This suspend will never complete, so warn the user to
+		 * interrupt this suspend and retry later.
+		 */
+		BUG_ON(!rq->data);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+
+		DMWARN("Invalidating the previous suspend is still in"
+		       " progress.  This suspend will be never done."
+		       " Please interrupt this suspend and retry later.");
+		return;
+	}
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	/* Now no user of the suspend_rq */
+	blk_rq_init(q, rq);
+	blk_insert_request(q, rq, 0, NULL);
+}
+
 /*
  * Functions to lock and unlock any filesystem running on the
  * device.
@@ -1582,6 +2269,9 @@ int dm_suspend(struct mapped_device *md,
 	add_wait_queue(&md->wait, &wait);
 	up_write(&md->io_lock);
 
+	if (dm_request_based(md))
+		dm_start_suspend(md, noflush);
+
 	/* unplug */
 	if (map)
 		dm_table_unplug_all(map);
@@ -1594,14 +2284,22 @@ int dm_suspend(struct mapped_device *md,
 	down_write(&md->io_lock);
 	remove_wait_queue(&md->wait, &wait);
 
-	if (noflush)
-		__merge_pushback_list(md);
+	if (noflush) {
+		if (dm_request_based(md))
+			/* All requeued requests are already in md->queue */
+			clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
+		else
+			__merge_pushback_list(md);
+	}
 	up_write(&md->io_lock);
 
 	/* were we interrupted ? */
 	if (r < 0) {
 		dm_queue_flush(md, DM_WQ_FLUSH_DEFERRED, NULL);
 
+		if (dm_request_based(md))
+			dm_abort_suspend(md, noflush);
+
 		unlock_fs(md);
 		goto out; /* pushback list is already flushed, so skip flush */
 	}
@@ -1642,6 +2340,14 @@ int dm_resume(struct mapped_device *md)
 
 	dm_queue_flush(md, DM_WQ_FLUSH_DEFERRED, NULL);
 
+	/*
+	 * Flushing deferred I/Os must be done after targets are resumed
+	 * so that mapping of targets can work correctly.
+	 * Request-based dm is queueing the deferred I/Os in its request_queue.
+	 */
+	if (dm_request_based(md))
+		start_queue(md->queue);
+
 	unlock_fs(md);
 
 	if (md->suspended_bdev) {
Index: 2.6.27-rc8/drivers/md/dm.h
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.h
+++ 2.6.27-rc8/drivers/md/dm.h
@@ -46,6 +46,7 @@ void dm_table_presuspend_targets(struct 
 void dm_table_postsuspend_targets(struct dm_table *t);
 int dm_table_resume_targets(struct dm_table *t);
 int dm_table_any_congested(struct dm_table *t, int bdi_bits);
+int dm_table_any_busy_target(struct dm_table *t);
 
 /*
  * To check the return value from dm_table_find_target().
@@ -91,6 +92,7 @@ void dm_stripe_exit(void);
 
 int dm_open_count(struct mapped_device *md);
 int dm_lock_for_deletion(struct mapped_device *md);
+union map_info *dm_get_rq_mapinfo(struct request *rq);
 
 void dm_kobject_uevent(struct mapped_device *md);
 
Index: 2.6.27-rc8/drivers/md/dm-table.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm-table.c
+++ 2.6.27-rc8/drivers/md/dm-table.c
@@ -956,6 +956,20 @@ int dm_table_any_congested(struct dm_tab
 	return r;
 }
 
+int dm_table_any_busy_target(struct dm_table *t)
+{
+	int i;
+	struct dm_target *ti;
+
+	for (i = 0; i < t->num_targets; i++) {
+		ti = t->targets + i;
+		if (ti->type->busy && ti->type->busy(ti))
+			return 1;
+	}
+
+	return 0;
+}
+
 void dm_table_unplug_all(struct dm_table *t)
 {
 	struct dm_dev_internal *dd;
Index: 2.6.27-rc8/include/linux/device-mapper.h
===================================================================
--- 2.6.27-rc8.orig/include/linux/device-mapper.h
+++ 2.6.27-rc8/include/linux/device-mapper.h
@@ -379,4 +379,12 @@ static inline unsigned long to_bytes(sec
 	return (n << SECTOR_SHIFT);
 }
 
+/*-----------------------------------------------------------------
+ * Helper for block layer and dm core operations
+ *---------------------------------------------------------------*/
+void dm_dispatch_request(struct request *rq);
+void dm_requeue_request(struct request *rq);
+void dm_kill_request(struct request *rq, int error);
+int dm_underlying_device_busy(struct request_queue *q);
+
 #endif	/* _LINUX_DEVICE_MAPPER_H */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 6/8] dm core: enable request-based dm
  2008-10-03 15:08 [PATCH 0/8] dm: request-based dm-multipath Kiyoshi Ueda
@ 2008-10-03 15:19   ` Kiyoshi Ueda
  2008-10-03 15:10 ` [PATCH 2/8] dm core: tidy local_init Kiyoshi Ueda
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:19 UTC (permalink / raw)
  To: agk; +Cc: linux-kernel, linux-scsi, dm-devel, mbroz, j-nomura, k-ueda

This patch enables request-based dm.

o Request-based dm and bio-based dm coexist, since there are
  some target drivers which are more fitting to bio-based dm.
  Also, there are other bio-based devices in the kernel
  (e.g. md, loop).
  Since bio-based device can't receive struct request,
  there are some limitations on device stacking between
  bio-based and request-based.

                     type of underlying device
                   bio-based      requeset-based
   ----------------------------------------------
    bio-based         OK                OK
    request-based     NG                OK

  The device type is recognized by the queue flag in the kernel,
  so dm follows that.

o The type of a dm device is decided at the first table loading time.
  Until then, mempool creations are deferred, since mempools for
  request-based dm are different from those for bio-based dm.
  Once the type of a dm device is decided, the type can't be changed.

o Currently, request-based dm supports only tables that have a single
  target.  To support multiple targets, we need to support request
  splitting or prevent bio/request from spanning multiple targets.
  The former needs lots of changes in the block layer, and the latter
  needs that all target drivers support merge() function.
  Both will take a time.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-ioctl.c         |   13 ++++
 drivers/md/dm-table.c         |   68 +++++++++++++++++++++++
 drivers/md/dm.c               |  123 ++++++++++++++++++++++++++++++++++--------
 drivers/md/dm.h               |   15 +++++
 include/linux/device-mapper.h |    1 
 5 files changed, 197 insertions(+), 23 deletions(-)

Index: 2.6.27-rc8/drivers/md/dm-table.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm-table.c
+++ 2.6.27-rc8/drivers/md/dm-table.c
@@ -108,6 +108,8 @@ static void combine_restrictions_low(str
 	lhs->bounce_pfn = min_not_zero(lhs->bounce_pfn, rhs->bounce_pfn);
 
 	lhs->no_cluster |= rhs->no_cluster;
+
+	lhs->no_request_stacking |= rhs->no_request_stacking;
 }
 
 /*
@@ -526,6 +528,8 @@ void dm_set_device_limits(struct dm_targ
 	rs->bounce_pfn = min_not_zero(rs->bounce_pfn, q->bounce_pfn);
 
 	rs->no_cluster |= !test_bit(QUEUE_FLAG_CLUSTER, &q->queue_flags);
+
+	rs->no_request_stacking |= !blk_queue_stackable(q);
 }
 EXPORT_SYMBOL_GPL(dm_set_device_limits);
 
@@ -738,6 +742,66 @@ int dm_table_add_target(struct dm_table 
 	return r;
 }
 
+int dm_table_set_type(struct dm_table *t)
+{
+	int i;
+	int bio_based = 0, request_based = 0;
+	struct dm_target *tgt;
+
+	for (i = 0; i < t->num_targets; i++) {
+		tgt = t->targets + i;
+		if (tgt->type->map_rq)
+			request_based = 1;
+		else
+			bio_based = 1;
+
+		if (bio_based && request_based) {
+			DMWARN("Inconsistent table: different target types"
+			       " can't be mixed up");
+			return -EINVAL;
+		}
+	}
+
+	if (bio_based) {
+		/* We must use this table as bio-based */
+		t->limits.no_request_stacking = 1;
+		return 0;
+	}
+
+	BUG_ON(!request_based); /* No targets in this table */
+
+	/* Non-request-stackable devices can't be used for request-based dm */
+	if (t->limits.no_request_stacking) {
+		DMWARN("table load rejected: including non-request-stackable"
+		       " devices");
+		return -EINVAL;
+	}
+
+	/*
+	 * Request-based dm supports only tables that have a single target now.
+	 * To support multiple targets, request splitting support is needed,
+	 * and that needs lots of changes in the block-layer.
+	 * (e.g. request completion process for partial completion.)
+	 */
+	if (t->num_targets > 1) {
+		DMWARN("Request-based dm doesn't support multiple targets yet");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int dm_table_get_type(struct dm_table *t)
+{
+	return t->limits.no_request_stacking ?
+		DM_TYPE_BIO_BASED : DM_TYPE_REQUEST_BASED;
+}
+
+int dm_table_request_based(struct dm_table *t)
+{
+	return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED;
+}
+
 static int setup_indexes(struct dm_table *t)
 {
 	int i;
@@ -868,6 +932,10 @@ void dm_table_set_restrictions(struct dm
 	else
 		queue_flag_set_unlocked(QUEUE_FLAG_CLUSTER, q);
 
+	if (t->limits.no_request_stacking)
+		queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, q);
+	else
+		queue_flag_set_unlocked(QUEUE_FLAG_STACKABLE, q);
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
Index: 2.6.27-rc8/drivers/md/dm.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.c
+++ 2.6.27-rc8/drivers/md/dm.c
@@ -160,6 +160,8 @@ struct mapped_device {
 
 	struct bio_set *bs;
 
+	unsigned int mempool_type; /* Type of mempools above. */
+
 	/*
 	 * Event handling.
 	 */
@@ -1712,10 +1714,22 @@ static struct mapped_device *alloc_dev(i
 	INIT_LIST_HEAD(&md->uevent_list);
 	spin_lock_init(&md->uevent_lock);
 
-	md->queue = blk_alloc_queue(GFP_KERNEL);
+	md->queue = blk_init_queue(dm_request_fn, NULL);
 	if (!md->queue)
 		goto bad_queue;
 
+	/*
+	 * Request-based dm devices cannot be stacked on top of bio-based dm
+	 * devices.  The type of this dm device has not been decided yet,
+	 * although we initialized the queue using blk_init_queue().
+	 * The type is decided at the first table loading time.
+	 * To prevent problematic device stacking, clear the queue flag
+	 * for request stacking support until then.
+	 *
+	 * This queue is new, so no concurrency on the queue_flags.
+	 */
+	queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
+	md->saved_make_request_fn = md->queue->make_request_fn;
 	md->queue->queuedata = md;
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
@@ -1723,18 +1737,9 @@ static struct mapped_device *alloc_dev(i
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
-
-	md->io_pool = mempool_create_slab_pool(MIN_IOS, _io_cache);
-	if (!md->io_pool)
-		goto bad_io_pool;
-
-	md->tio_pool = mempool_create_slab_pool(MIN_IOS, _tio_cache);
-	if (!md->tio_pool)
-		goto bad_tio_pool;
-
-	md->bs = bioset_create(16, 16);
-	if (!md->bs)
-		goto bad_no_bioset;
+	blk_queue_softirq_done(md->queue, dm_softirq_done);
+	blk_queue_prep_rq(md->queue, dm_prep_fn);
+	blk_queue_lld_busy(md->queue, dm_lld_busy);
 
 	md->disk = alloc_disk(1);
 	if (!md->disk)
@@ -1769,12 +1774,6 @@ static struct mapped_device *alloc_dev(i
 bad_thread:
 	put_disk(md->disk);
 bad_disk:
-	bioset_free(md->bs);
-bad_no_bioset:
-	mempool_destroy(md->tio_pool);
-bad_tio_pool:
-	mempool_destroy(md->io_pool);
-bad_io_pool:
 	blk_cleanup_queue(md->queue);
 bad_queue:
 	free_minor(minor);
@@ -1796,9 +1795,12 @@ static void free_dev(struct mapped_devic
 		bdput(md->suspended_bdev);
 	}
 	destroy_workqueue(md->wq);
-	mempool_destroy(md->tio_pool);
-	mempool_destroy(md->io_pool);
-	bioset_free(md->bs);
+	if (md->tio_pool)
+		mempool_destroy(md->tio_pool);
+	if (md->io_pool)
+		mempool_destroy(md->io_pool);
+	if (md->bs)
+		bioset_free(md->bs);
 	del_gendisk(md->disk);
 	free_minor(minor);
 
@@ -1861,6 +1863,16 @@ static int __bind(struct mapped_device *
 	dm_table_get(t);
 	dm_table_event_callback(t, event_callback, md);
 
+	/*
+	 * The queue hasn't been stopped yet, if the old table type wasn't
+	 * for request-based during suspension.  So stop it to prevent
+	 * I/O mapping before resume.
+	 * This must be done before setting the queue restrictions,
+	 * because request-based dm may be run just after the setting.
+	 */
+	if (dm_table_request_based(t) && !blk_queue_stopped(q))
+		stop_queue(q);
+
 	write_lock(&md->map_lock);
 	md->map = t;
 	dm_table_set_restrictions(t, q);
@@ -2010,7 +2022,13 @@ static void __flush_deferred_io(struct m
 	struct bio *c;
 
 	while ((c = bio_list_pop(&md->deferred))) {
-		if (__split_bio(md, c))
+		/*
+		 * Some bios might have been queued here during suspension
+		 * before setting of request-based dm in resume
+		 */
+		if (dm_request_based(md))
+			generic_make_request(c);
+		else if (__split_bio(md, c))
 			bio_io_error(c);
 	}
 
@@ -2428,6 +2446,65 @@ int dm_noflush_suspending(struct dm_targ
 }
 EXPORT_SYMBOL_GPL(dm_noflush_suspending);
 
+int dm_init_md_mempool(struct mapped_device *md, int type)
+{
+	if (unlikely(type == DM_TYPE_NONE)) {
+		DMWARN("no type is specified, can't initialize mempool");
+		return -EINVAL;
+	}
+
+	if (md->mempool_type == type)
+		return 0;
+
+	if (md->map) {
+		/* The md has been using, can't change the mempool type */
+		DMWARN("can't change mempool type after a table is bound");
+		return -EINVAL;
+	}
+
+	/* Not using the md yet, we can still change the mempool type */
+	if (md->mempool_type != DM_TYPE_NONE) {
+		mempool_destroy(md->io_pool);
+		md->io_pool = NULL;
+		mempool_destroy(md->tio_pool);
+		md->tio_pool = NULL;
+		bioset_free(md->bs);
+		md->bs = NULL;
+		md->mempool_type = DM_TYPE_NONE;
+	}
+
+	md->io_pool = (type == DM_TYPE_BIO_BASED) ?
+		      mempool_create_slab_pool(MIN_IOS, _io_cache) :
+		      mempool_create_slab_pool(MIN_IOS, _bio_info_cache);
+	if (!md->io_pool)
+		return -ENOMEM;
+
+	md->tio_pool = (type == DM_TYPE_BIO_BASED) ?
+		       mempool_create_slab_pool(MIN_IOS, _tio_cache) :
+		       mempool_create_slab_pool(MIN_IOS, _rq_tio_cache);
+	if (!md->tio_pool)
+		goto free_io_pool_and_out;
+
+	md->bs = (type == DM_TYPE_BIO_BASED) ?
+		 bioset_create(16, 16) : bioset_create(MIN_IOS, MIN_IOS);
+	if (!md->bs)
+		goto free_tio_pool_and_out;
+
+	md->mempool_type = type;
+
+	return 0;
+
+free_tio_pool_and_out:
+	mempool_destroy(md->tio_pool);
+	md->tio_pool = NULL;
+
+free_io_pool_and_out:
+	mempool_destroy(md->io_pool);
+	md->io_pool = NULL;
+
+	return -ENOMEM;
+}
+
 static struct block_device_operations dm_blk_dops = {
 	.open = dm_blk_open,
 	.release = dm_blk_close,
Index: 2.6.27-rc8/drivers/md/dm.h
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.h
+++ 2.6.27-rc8/drivers/md/dm.h
@@ -23,6 +23,13 @@
 #define DM_SUSPEND_NOFLUSH_FLAG		(1 << 1)
 
 /*
+ * Type of table and mapped_device's mempool
+ */
+#define DM_TYPE_NONE		0
+#define DM_TYPE_BIO_BASED	1
+#define DM_TYPE_REQUEST_BASED	2
+
+/*
  * List of devices that a metadevice uses and should open/close.
  */
 struct dm_dev_internal {
@@ -47,6 +54,9 @@ void dm_table_postsuspend_targets(struct
 int dm_table_resume_targets(struct dm_table *t);
 int dm_table_any_congested(struct dm_table *t, int bdi_bits);
 int dm_table_any_busy_target(struct dm_table *t);
+int dm_table_set_type(struct dm_table *t);
+int dm_table_get_type(struct dm_table *t);
+int dm_table_request_based(struct dm_table *t);
 
 /*
  * To check the return value from dm_table_find_target().
@@ -99,4 +109,9 @@ void dm_kobject_uevent(struct mapped_dev
 int dm_kcopyd_init(void);
 void dm_kcopyd_exit(void);
 
+/*
+ * Mempool initializer for a mapped_device
+ */
+int dm_init_md_mempool(struct mapped_device *md, int type);
+
 #endif
Index: 2.6.27-rc8/drivers/md/dm-ioctl.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm-ioctl.c
+++ 2.6.27-rc8/drivers/md/dm-ioctl.c
@@ -1045,6 +1045,12 @@ static int populate_table(struct dm_tabl
 		next = spec->next;
 	}
 
+	r = dm_table_set_type(table);
+	if (r) {
+		DMWARN("unable to set table type");
+		return r;
+	}
+
 	return dm_table_complete(table);
 }
 
@@ -1069,6 +1075,13 @@ static int table_load(struct dm_ioctl *p
 		goto out;
 	}
 
+	r = dm_init_md_mempool(md, dm_table_get_type(t));
+	if (r) {
+		DMWARN("unable to initialize the md mempools for this table");
+		dm_table_put(t);
+		goto out;
+	}
+
 	down_write(&_hash_lock);
 	hc = dm_get_mdptr(md);
 	if (!hc || hc->md != md) {
Index: 2.6.27-rc8/include/linux/device-mapper.h
===================================================================
--- 2.6.27-rc8.orig/include/linux/device-mapper.h
+++ 2.6.27-rc8/include/linux/device-mapper.h
@@ -145,6 +145,7 @@ struct io_restrictions {
 	unsigned short max_hw_segments;
 	unsigned short max_phys_segments;
 	unsigned char no_cluster; /* inverted so that 0 is default */
+	unsigned char no_request_stacking;
 };
 
 struct dm_target {

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 6/8] dm core: enable request-based dm
@ 2008-10-03 15:19   ` Kiyoshi Ueda
  0 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:19 UTC (permalink / raw)
  To: agk; +Cc: k-ueda, linux-scsi, linux-kernel, dm-devel, j-nomura, mbroz

This patch enables request-based dm.

o Request-based dm and bio-based dm coexist, since there are
  some target drivers which are more fitting to bio-based dm.
  Also, there are other bio-based devices in the kernel
  (e.g. md, loop).
  Since bio-based device can't receive struct request,
  there are some limitations on device stacking between
  bio-based and request-based.

                     type of underlying device
                   bio-based      requeset-based
   ----------------------------------------------
    bio-based         OK                OK
    request-based     NG                OK

  The device type is recognized by the queue flag in the kernel,
  so dm follows that.

o The type of a dm device is decided at the first table loading time.
  Until then, mempool creations are deferred, since mempools for
  request-based dm are different from those for bio-based dm.
  Once the type of a dm device is decided, the type can't be changed.

o Currently, request-based dm supports only tables that have a single
  target.  To support multiple targets, we need to support request
  splitting or prevent bio/request from spanning multiple targets.
  The former needs lots of changes in the block layer, and the latter
  needs that all target drivers support merge() function.
  Both will take a time.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-ioctl.c         |   13 ++++
 drivers/md/dm-table.c         |   68 +++++++++++++++++++++++
 drivers/md/dm.c               |  123 ++++++++++++++++++++++++++++++++++--------
 drivers/md/dm.h               |   15 +++++
 include/linux/device-mapper.h |    1 
 5 files changed, 197 insertions(+), 23 deletions(-)

Index: 2.6.27-rc8/drivers/md/dm-table.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm-table.c
+++ 2.6.27-rc8/drivers/md/dm-table.c
@@ -108,6 +108,8 @@ static void combine_restrictions_low(str
 	lhs->bounce_pfn = min_not_zero(lhs->bounce_pfn, rhs->bounce_pfn);
 
 	lhs->no_cluster |= rhs->no_cluster;
+
+	lhs->no_request_stacking |= rhs->no_request_stacking;
 }
 
 /*
@@ -526,6 +528,8 @@ void dm_set_device_limits(struct dm_targ
 	rs->bounce_pfn = min_not_zero(rs->bounce_pfn, q->bounce_pfn);
 
 	rs->no_cluster |= !test_bit(QUEUE_FLAG_CLUSTER, &q->queue_flags);
+
+	rs->no_request_stacking |= !blk_queue_stackable(q);
 }
 EXPORT_SYMBOL_GPL(dm_set_device_limits);
 
@@ -738,6 +742,66 @@ int dm_table_add_target(struct dm_table 
 	return r;
 }
 
+int dm_table_set_type(struct dm_table *t)
+{
+	int i;
+	int bio_based = 0, request_based = 0;
+	struct dm_target *tgt;
+
+	for (i = 0; i < t->num_targets; i++) {
+		tgt = t->targets + i;
+		if (tgt->type->map_rq)
+			request_based = 1;
+		else
+			bio_based = 1;
+
+		if (bio_based && request_based) {
+			DMWARN("Inconsistent table: different target types"
+			       " can't be mixed up");
+			return -EINVAL;
+		}
+	}
+
+	if (bio_based) {
+		/* We must use this table as bio-based */
+		t->limits.no_request_stacking = 1;
+		return 0;
+	}
+
+	BUG_ON(!request_based); /* No targets in this table */
+
+	/* Non-request-stackable devices can't be used for request-based dm */
+	if (t->limits.no_request_stacking) {
+		DMWARN("table load rejected: including non-request-stackable"
+		       " devices");
+		return -EINVAL;
+	}
+
+	/*
+	 * Request-based dm supports only tables that have a single target now.
+	 * To support multiple targets, request splitting support is needed,
+	 * and that needs lots of changes in the block-layer.
+	 * (e.g. request completion process for partial completion.)
+	 */
+	if (t->num_targets > 1) {
+		DMWARN("Request-based dm doesn't support multiple targets yet");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int dm_table_get_type(struct dm_table *t)
+{
+	return t->limits.no_request_stacking ?
+		DM_TYPE_BIO_BASED : DM_TYPE_REQUEST_BASED;
+}
+
+int dm_table_request_based(struct dm_table *t)
+{
+	return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED;
+}
+
 static int setup_indexes(struct dm_table *t)
 {
 	int i;
@@ -868,6 +932,10 @@ void dm_table_set_restrictions(struct dm
 	else
 		queue_flag_set_unlocked(QUEUE_FLAG_CLUSTER, q);
 
+	if (t->limits.no_request_stacking)
+		queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, q);
+	else
+		queue_flag_set_unlocked(QUEUE_FLAG_STACKABLE, q);
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
Index: 2.6.27-rc8/drivers/md/dm.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.c
+++ 2.6.27-rc8/drivers/md/dm.c
@@ -160,6 +160,8 @@ struct mapped_device {
 
 	struct bio_set *bs;
 
+	unsigned int mempool_type; /* Type of mempools above. */
+
 	/*
 	 * Event handling.
 	 */
@@ -1712,10 +1714,22 @@ static struct mapped_device *alloc_dev(i
 	INIT_LIST_HEAD(&md->uevent_list);
 	spin_lock_init(&md->uevent_lock);
 
-	md->queue = blk_alloc_queue(GFP_KERNEL);
+	md->queue = blk_init_queue(dm_request_fn, NULL);
 	if (!md->queue)
 		goto bad_queue;
 
+	/*
+	 * Request-based dm devices cannot be stacked on top of bio-based dm
+	 * devices.  The type of this dm device has not been decided yet,
+	 * although we initialized the queue using blk_init_queue().
+	 * The type is decided at the first table loading time.
+	 * To prevent problematic device stacking, clear the queue flag
+	 * for request stacking support until then.
+	 *
+	 * This queue is new, so no concurrency on the queue_flags.
+	 */
+	queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
+	md->saved_make_request_fn = md->queue->make_request_fn;
 	md->queue->queuedata = md;
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
@@ -1723,18 +1737,9 @@ static struct mapped_device *alloc_dev(i
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
-
-	md->io_pool = mempool_create_slab_pool(MIN_IOS, _io_cache);
-	if (!md->io_pool)
-		goto bad_io_pool;
-
-	md->tio_pool = mempool_create_slab_pool(MIN_IOS, _tio_cache);
-	if (!md->tio_pool)
-		goto bad_tio_pool;
-
-	md->bs = bioset_create(16, 16);
-	if (!md->bs)
-		goto bad_no_bioset;
+	blk_queue_softirq_done(md->queue, dm_softirq_done);
+	blk_queue_prep_rq(md->queue, dm_prep_fn);
+	blk_queue_lld_busy(md->queue, dm_lld_busy);
 
 	md->disk = alloc_disk(1);
 	if (!md->disk)
@@ -1769,12 +1774,6 @@ static struct mapped_device *alloc_dev(i
 bad_thread:
 	put_disk(md->disk);
 bad_disk:
-	bioset_free(md->bs);
-bad_no_bioset:
-	mempool_destroy(md->tio_pool);
-bad_tio_pool:
-	mempool_destroy(md->io_pool);
-bad_io_pool:
 	blk_cleanup_queue(md->queue);
 bad_queue:
 	free_minor(minor);
@@ -1796,9 +1795,12 @@ static void free_dev(struct mapped_devic
 		bdput(md->suspended_bdev);
 	}
 	destroy_workqueue(md->wq);
-	mempool_destroy(md->tio_pool);
-	mempool_destroy(md->io_pool);
-	bioset_free(md->bs);
+	if (md->tio_pool)
+		mempool_destroy(md->tio_pool);
+	if (md->io_pool)
+		mempool_destroy(md->io_pool);
+	if (md->bs)
+		bioset_free(md->bs);
 	del_gendisk(md->disk);
 	free_minor(minor);
 
@@ -1861,6 +1863,16 @@ static int __bind(struct mapped_device *
 	dm_table_get(t);
 	dm_table_event_callback(t, event_callback, md);
 
+	/*
+	 * The queue hasn't been stopped yet, if the old table type wasn't
+	 * for request-based during suspension.  So stop it to prevent
+	 * I/O mapping before resume.
+	 * This must be done before setting the queue restrictions,
+	 * because request-based dm may be run just after the setting.
+	 */
+	if (dm_table_request_based(t) && !blk_queue_stopped(q))
+		stop_queue(q);
+
 	write_lock(&md->map_lock);
 	md->map = t;
 	dm_table_set_restrictions(t, q);
@@ -2010,7 +2022,13 @@ static void __flush_deferred_io(struct m
 	struct bio *c;
 
 	while ((c = bio_list_pop(&md->deferred))) {
-		if (__split_bio(md, c))
+		/*
+		 * Some bios might have been queued here during suspension
+		 * before setting of request-based dm in resume
+		 */
+		if (dm_request_based(md))
+			generic_make_request(c);
+		else if (__split_bio(md, c))
 			bio_io_error(c);
 	}
 
@@ -2428,6 +2446,65 @@ int dm_noflush_suspending(struct dm_targ
 }
 EXPORT_SYMBOL_GPL(dm_noflush_suspending);
 
+int dm_init_md_mempool(struct mapped_device *md, int type)
+{
+	if (unlikely(type == DM_TYPE_NONE)) {
+		DMWARN("no type is specified, can't initialize mempool");
+		return -EINVAL;
+	}
+
+	if (md->mempool_type == type)
+		return 0;
+
+	if (md->map) {
+		/* The md has been using, can't change the mempool type */
+		DMWARN("can't change mempool type after a table is bound");
+		return -EINVAL;
+	}
+
+	/* Not using the md yet, we can still change the mempool type */
+	if (md->mempool_type != DM_TYPE_NONE) {
+		mempool_destroy(md->io_pool);
+		md->io_pool = NULL;
+		mempool_destroy(md->tio_pool);
+		md->tio_pool = NULL;
+		bioset_free(md->bs);
+		md->bs = NULL;
+		md->mempool_type = DM_TYPE_NONE;
+	}
+
+	md->io_pool = (type == DM_TYPE_BIO_BASED) ?
+		      mempool_create_slab_pool(MIN_IOS, _io_cache) :
+		      mempool_create_slab_pool(MIN_IOS, _bio_info_cache);
+	if (!md->io_pool)
+		return -ENOMEM;
+
+	md->tio_pool = (type == DM_TYPE_BIO_BASED) ?
+		       mempool_create_slab_pool(MIN_IOS, _tio_cache) :
+		       mempool_create_slab_pool(MIN_IOS, _rq_tio_cache);
+	if (!md->tio_pool)
+		goto free_io_pool_and_out;
+
+	md->bs = (type == DM_TYPE_BIO_BASED) ?
+		 bioset_create(16, 16) : bioset_create(MIN_IOS, MIN_IOS);
+	if (!md->bs)
+		goto free_tio_pool_and_out;
+
+	md->mempool_type = type;
+
+	return 0;
+
+free_tio_pool_and_out:
+	mempool_destroy(md->tio_pool);
+	md->tio_pool = NULL;
+
+free_io_pool_and_out:
+	mempool_destroy(md->io_pool);
+	md->io_pool = NULL;
+
+	return -ENOMEM;
+}
+
 static struct block_device_operations dm_blk_dops = {
 	.open = dm_blk_open,
 	.release = dm_blk_close,
Index: 2.6.27-rc8/drivers/md/dm.h
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.h
+++ 2.6.27-rc8/drivers/md/dm.h
@@ -23,6 +23,13 @@
 #define DM_SUSPEND_NOFLUSH_FLAG		(1 << 1)
 
 /*
+ * Type of table and mapped_device's mempool
+ */
+#define DM_TYPE_NONE		0
+#define DM_TYPE_BIO_BASED	1
+#define DM_TYPE_REQUEST_BASED	2
+
+/*
  * List of devices that a metadevice uses and should open/close.
  */
 struct dm_dev_internal {
@@ -47,6 +54,9 @@ void dm_table_postsuspend_targets(struct
 int dm_table_resume_targets(struct dm_table *t);
 int dm_table_any_congested(struct dm_table *t, int bdi_bits);
 int dm_table_any_busy_target(struct dm_table *t);
+int dm_table_set_type(struct dm_table *t);
+int dm_table_get_type(struct dm_table *t);
+int dm_table_request_based(struct dm_table *t);
 
 /*
  * To check the return value from dm_table_find_target().
@@ -99,4 +109,9 @@ void dm_kobject_uevent(struct mapped_dev
 int dm_kcopyd_init(void);
 void dm_kcopyd_exit(void);
 
+/*
+ * Mempool initializer for a mapped_device
+ */
+int dm_init_md_mempool(struct mapped_device *md, int type);
+
 #endif
Index: 2.6.27-rc8/drivers/md/dm-ioctl.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm-ioctl.c
+++ 2.6.27-rc8/drivers/md/dm-ioctl.c
@@ -1045,6 +1045,12 @@ static int populate_table(struct dm_tabl
 		next = spec->next;
 	}
 
+	r = dm_table_set_type(table);
+	if (r) {
+		DMWARN("unable to set table type");
+		return r;
+	}
+
 	return dm_table_complete(table);
 }
 
@@ -1069,6 +1075,13 @@ static int table_load(struct dm_ioctl *p
 		goto out;
 	}
 
+	r = dm_init_md_mempool(md, dm_table_get_type(t));
+	if (r) {
+		DMWARN("unable to initialize the md mempools for this table");
+		dm_table_put(t);
+		goto out;
+	}
+
 	down_write(&_hash_lock);
 	hc = dm_get_mdptr(md);
 	if (!hc || hc->md != md) {
Index: 2.6.27-rc8/include/linux/device-mapper.h
===================================================================
--- 2.6.27-rc8.orig/include/linux/device-mapper.h
+++ 2.6.27-rc8/include/linux/device-mapper.h
@@ -145,6 +145,7 @@ struct io_restrictions {
 	unsigned short max_hw_segments;
 	unsigned short max_phys_segments;
 	unsigned char no_cluster; /* inverted so that 0 is default */
+	unsigned char no_request_stacking;
 };
 
 struct dm_target {

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 7/8] dm core: reject I/O violating new queue limits
  2008-10-03 15:08 [PATCH 0/8] dm: request-based dm-multipath Kiyoshi Ueda
@ 2008-10-03 15:19   ` Kiyoshi Ueda
  2008-10-03 15:10 ` [PATCH 2/8] dm core: tidy local_init Kiyoshi Ueda
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:19 UTC (permalink / raw)
  To: agk; +Cc: linux-kernel, linux-scsi, dm-devel, mbroz, j-nomura, k-ueda

This patch detects requests violating the queue limitations
and rejects them.

The same limitation checks are done when requests are submitted
to the queue by blk_insert_cloned_request().
However, such violation can happen if a table is swapped and
the queue limitations are shrunk while some requests are
in the queue.

Since struct request is a reliable one in the block layer and
device drivers, dispatching such requests is pretty dangerous.
(e.g. it may cause kernel panic easily.)
So avoid to dispatch such problematic requests in request-based dm.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm.c |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+)

Index: 2.6.27-rc8/drivers/md/dm.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.c
+++ 2.6.27-rc8/drivers/md/dm.c
@@ -1472,6 +1472,30 @@ static void map_request(struct dm_target
 
 	tio->ti = ti;
 	atomic_inc(&md->pending);
+
+	/*
+	 * Although submitted requests to the md->queue are checked against
+	 * the table/queue limitations at the submission time, the limitations
+	 * may be changed by a table swapping while those already checked
+	 * requests are in the md->queue.
+	 * If the limitations have been shrunk in such situations, we may be
+	 * dispatching requests violating the current limitations here.
+	 * Since struct request is a reliable one in the block-layer
+	 * and device drivers, dispatching such requests is dangerous.
+	 * (e.g. it may cause kernel panic easily.)
+	 * Avoid to dispatch such problematic requests in request-based dm.
+	 *
+	 * Since dm_kill_request() decrements the md->pending, this have to
+	 * be done after incrementing the md->pending.
+	 */
+	r = blk_rq_check_limits(rq->q, rq);
+	if (unlikely(r)) {
+		DMWARN("violating the queue limitation. the limitation may be"
+		       " shrunk while there are some requests in the queue.");
+		dm_kill_request(clone, r);
+		return;
+	}
+
 	r = ti->type->map_rq(ti, clone, &tio->info);
 	switch (r) {
 	case DM_MAPIO_SUBMITTED:

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 7/8] dm core: reject I/O violating new queue limits
@ 2008-10-03 15:19   ` Kiyoshi Ueda
  0 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:19 UTC (permalink / raw)
  To: agk; +Cc: k-ueda, linux-scsi, linux-kernel, dm-devel, j-nomura, mbroz

This patch detects requests violating the queue limitations
and rejects them.

The same limitation checks are done when requests are submitted
to the queue by blk_insert_cloned_request().
However, such violation can happen if a table is swapped and
the queue limitations are shrunk while some requests are
in the queue.

Since struct request is a reliable one in the block layer and
device drivers, dispatching such requests is pretty dangerous.
(e.g. it may cause kernel panic easily.)
So avoid to dispatch such problematic requests in request-based dm.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm.c |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+)

Index: 2.6.27-rc8/drivers/md/dm.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm.c
+++ 2.6.27-rc8/drivers/md/dm.c
@@ -1472,6 +1472,30 @@ static void map_request(struct dm_target
 
 	tio->ti = ti;
 	atomic_inc(&md->pending);
+
+	/*
+	 * Although submitted requests to the md->queue are checked against
+	 * the table/queue limitations at the submission time, the limitations
+	 * may be changed by a table swapping while those already checked
+	 * requests are in the md->queue.
+	 * If the limitations have been shrunk in such situations, we may be
+	 * dispatching requests violating the current limitations here.
+	 * Since struct request is a reliable one in the block-layer
+	 * and device drivers, dispatching such requests is dangerous.
+	 * (e.g. it may cause kernel panic easily.)
+	 * Avoid to dispatch such problematic requests in request-based dm.
+	 *
+	 * Since dm_kill_request() decrements the md->pending, this have to
+	 * be done after incrementing the md->pending.
+	 */
+	r = blk_rq_check_limits(rq->q, rq);
+	if (unlikely(r)) {
+		DMWARN("violating the queue limitation. the limitation may be"
+		       " shrunk while there are some requests in the queue.");
+		dm_kill_request(clone, r);
+		return;
+	}
+
 	r = ti->type->map_rq(ti, clone, &tio->info);
 	switch (r) {
 	case DM_MAPIO_SUBMITTED:

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 8/8] dm-mpath: convert to request-based
  2008-10-03 15:08 [PATCH 0/8] dm: request-based dm-multipath Kiyoshi Ueda
@ 2008-10-03 15:19   ` Kiyoshi Ueda
  2008-10-03 15:10 ` [PATCH 2/8] dm core: tidy local_init Kiyoshi Ueda
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:19 UTC (permalink / raw)
  To: agk; +Cc: linux-kernel, linux-scsi, dm-devel, mbroz, j-nomura, k-ueda

This patch converts dm-multipath target to request-based from bio-based.

Basically, the patch just converts the I/O unit from struct bio
to struct request.
In the course of the conversion, it also changes the I/O queueing
mechanism.  The change in the I/O queueing is described in details
as follows.

I/O queueing mechanism change
-----------------------------
In I/O submission, map_io(), there is no mechanism change from
bio-based, since the clone request is ready for retry as it is.
However, in I/O complition, do_end_io(), there is a mechanism change
from bio-based, since the clone request is not ready for retry.

In do_end_io() of bio-based, the clone bio has all needed memory
for resubmission.  So the target driver can queue it and resubmit
it later without memory allocations.
The mechanism has almost no overhead.

On the other hand, in do_end_io() of request-based, the clone request
doesn't have clone bios, so the target driver can't resubmit it
as it is.  To resubmit the clone request, memory allocation for
clone bios is needed, and it takes some overheads.
To avoid the overheads just for queueing, the target driver doesn't
queue the clone request inside itself.
Instead, the target driver asks dm core for queueing and remapping
the original request of the clone request, since the overhead for
queueing is just a freeing memory for the clone request.

As a result, the target driver doesn't need to record/restore
the information of the original request for resubmitting
the clone request.  So dm_bio_details in dm_mpath_io is removed.


multipath_busy()
---------------------
The target driver returns "busy", only when the following case:
  o The target driver will map I/Os, if map() function is called
  and
  o The mapped I/Os will wait on underlying device's queue due to
    their congestions, if map() function is called now.

In other cases, the target driver doesn't return "busy".
Otherwise, dm core will keep the I/Os and the target driver can't
do what it wants.
(e.g. the target driver can't map I/Os now, so wants to kill I/Os.)


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-mpath.c |  192 +++++++++++++++++++++++++++++++++-----------------
 1 files changed, 127 insertions(+), 65 deletions(-)

Index: 2.6.27-rc8/drivers/md/dm-mpath.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm-mpath.c
+++ 2.6.27-rc8/drivers/md/dm-mpath.c
@@ -7,8 +7,6 @@
 
 #include "dm.h"
 #include "dm-path-selector.h"
-#include "dm-bio-list.h"
-#include "dm-bio-record.h"
 #include "dm-uevent.h"
 
 #include <linux/ctype.h>
@@ -83,7 +81,7 @@ struct multipath {
 	unsigned pg_init_count;		/* Number of times pg_init called */
 
 	struct work_struct process_queued_ios;
-	struct bio_list queued_ios;
+	struct list_head queued_ios;
 	unsigned queue_size;
 
 	struct work_struct trigger_event;
@@ -100,7 +98,6 @@ struct multipath {
  */
 struct dm_mpath_io {
 	struct pgpath *pgpath;
-	struct dm_bio_details details;
 };
 
 typedef int (*action_fn) (struct pgpath *pgpath);
@@ -186,6 +183,7 @@ static struct multipath *alloc_multipath
 	m = kzalloc(sizeof(*m), GFP_KERNEL);
 	if (m) {
 		INIT_LIST_HEAD(&m->priority_groups);
+		INIT_LIST_HEAD(&m->queued_ios);
 		spin_lock_init(&m->lock);
 		m->queue_io = 1;
 		INIT_WORK(&m->process_queued_ios, process_queued_ios);
@@ -310,12 +308,13 @@ static int __must_push_back(struct multi
 		dm_noflush_suspending(m->ti));
 }
 
-static int map_io(struct multipath *m, struct bio *bio,
+static int map_io(struct multipath *m, struct request *clone,
 		  struct dm_mpath_io *mpio, unsigned was_queued)
 {
 	int r = DM_MAPIO_REMAPPED;
 	unsigned long flags;
 	struct pgpath *pgpath;
+	struct block_device *bdev;
 
 	spin_lock_irqsave(&m->lock, flags);
 
@@ -332,16 +331,18 @@ static int map_io(struct multipath *m, s
 	if ((pgpath && m->queue_io) ||
 	    (!pgpath && m->queue_if_no_path)) {
 		/* Queue for the daemon to resubmit */
-		bio_list_add(&m->queued_ios, bio);
+		list_add_tail(&clone->queuelist, &m->queued_ios);
 		m->queue_size++;
 		if ((m->pg_init_required && !m->pg_init_in_progress) ||
 		    !m->queue_io)
 			queue_work(kmultipathd, &m->process_queued_ios);
 		pgpath = NULL;
 		r = DM_MAPIO_SUBMITTED;
-	} else if (pgpath)
-		bio->bi_bdev = pgpath->path.dev->bdev;
-	else if (__must_push_back(m))
+	} else if (pgpath) {
+		bdev = pgpath->path.dev->bdev;
+		clone->q = bdev_get_queue(bdev);
+		clone->rq_disk = bdev->bd_disk;
+	} else if (__must_push_back(m))
 		r = DM_MAPIO_REQUEUE;
 	else
 		r = -EIO;	/* Failed */
@@ -384,30 +385,31 @@ static void dispatch_queued_ios(struct m
 {
 	int r;
 	unsigned long flags;
-	struct bio *bio = NULL, *next;
 	struct dm_mpath_io *mpio;
 	union map_info *info;
+	struct request *clone, *n;
+	LIST_HEAD(cl);
 
 	spin_lock_irqsave(&m->lock, flags);
-	bio = bio_list_get(&m->queued_ios);
+	list_splice_init(&m->queued_ios, &cl);
 	spin_unlock_irqrestore(&m->lock, flags);
 
-	while (bio) {
-		next = bio->bi_next;
-		bio->bi_next = NULL;
+	list_for_each_entry_safe(clone, n, &cl, queuelist) {
+		list_del_init(&clone->queuelist);
 
-		info = dm_get_mapinfo(bio);
+		info = dm_get_rq_mapinfo(clone);
 		mpio = info->ptr;
 
-		r = map_io(m, bio, mpio, 1);
-		if (r < 0)
-			bio_endio(bio, r);
-		else if (r == DM_MAPIO_REMAPPED)
-			generic_make_request(bio);
-		else if (r == DM_MAPIO_REQUEUE)
-			bio_endio(bio, -EIO);
-
-		bio = next;
+		r = map_io(m, clone, mpio, 1);
+		if (r < 0) {
+			mempool_free(mpio, m->mpio_pool);
+			dm_kill_request(clone, r);
+		} else if (r == DM_MAPIO_REMAPPED)
+			dm_dispatch_request(clone);
+		else if (r == DM_MAPIO_REQUEUE) {
+			mempool_free(mpio, m->mpio_pool);
+			dm_requeue_request(clone);
+		}
 	}
 }
 
@@ -824,21 +826,24 @@ static void multipath_dtr(struct dm_targ
 }
 
 /*
- * Map bios, recording original fields for later in case we have to resubmit
+ * Map cloned requests
  */
-static int multipath_map(struct dm_target *ti, struct bio *bio,
+static int multipath_map(struct dm_target *ti, struct request *clone,
 			 union map_info *map_context)
 {
 	int r;
 	struct dm_mpath_io *mpio;
 	struct multipath *m = (struct multipath *) ti->private;
 
-	mpio = mempool_alloc(m->mpio_pool, GFP_NOIO);
-	dm_bio_record(&mpio->details, bio);
+	mpio = mempool_alloc(m->mpio_pool, GFP_ATOMIC);
+	if (!mpio)
+		/* ENOMEM, requeue */
+		return DM_MAPIO_REQUEUE;
+	memset(mpio, 0, sizeof(*mpio));
 
 	map_context->ptr = mpio;
-	bio->bi_rw |= (1 << BIO_RW_FAILFAST);
-	r = map_io(m, bio, mpio, 0);
+	clone->cmd_flags |= REQ_FAILFAST;
+	r = map_io(m, clone, mpio, 0);
 	if (r < 0 || r == DM_MAPIO_REQUEUE)
 		mempool_free(mpio, m->mpio_pool);
 
@@ -1119,53 +1124,41 @@ static void activate_path(struct work_st
 /*
  * end_io handling
  */
-static int do_end_io(struct multipath *m, struct bio *bio,
+static int do_end_io(struct multipath *m, struct request *clone,
 		     int error, struct dm_mpath_io *mpio)
 {
+	/*
+	 * We don't queue any clone request inside the multipath target
+	 * during end I/O handling, since those clone requests don't have
+	 * bio clones.  If we queue them inside the multipath target,
+	 * we need to make bio clones, that requires memory allocation.
+	 * (See drivers/md/dm.c:end_clone_bio() about why the clone requests
+	 *  don't have bio clones.)
+	 * Instead of queueing the clone request here, we queue the original
+	 * request into dm core, which will remake a clone request and
+	 * clone bios for it and resubmit it later.
+	 */
+	int r = DM_ENDIO_REQUEUE;
 	unsigned long flags;
 
-	if (!error)
+	if (!error && !clone->errors)
 		return 0;	/* I/O complete */
 
-	if ((error == -EWOULDBLOCK) && bio_rw_ahead(bio))
-		return error;
-
 	if (error == -EOPNOTSUPP)
 		return error;
 
-	spin_lock_irqsave(&m->lock, flags);
-	if (!m->nr_valid_paths) {
-		if (__must_push_back(m)) {
-			spin_unlock_irqrestore(&m->lock, flags);
-			return DM_ENDIO_REQUEUE;
-		} else if (!m->queue_if_no_path) {
-			spin_unlock_irqrestore(&m->lock, flags);
-			return -EIO;
-		} else {
-			spin_unlock_irqrestore(&m->lock, flags);
-			goto requeue;
-		}
-	}
-	spin_unlock_irqrestore(&m->lock, flags);
-
 	if (mpio->pgpath)
 		fail_path(mpio->pgpath);
 
-      requeue:
-	dm_bio_restore(&mpio->details, bio);
-
-	/* queue for the daemon to resubmit or fail */
 	spin_lock_irqsave(&m->lock, flags);
-	bio_list_add(&m->queued_ios, bio);
-	m->queue_size++;
-	if (!m->queue_io)
-		queue_work(kmultipathd, &m->process_queued_ios);
+	if (!m->nr_valid_paths && !m->queue_if_no_path && !__must_push_back(m))
+		r = -EIO;
 	spin_unlock_irqrestore(&m->lock, flags);
 
-	return DM_ENDIO_INCOMPLETE;	/* io not complete */
+	return r;
 }
 
-static int multipath_end_io(struct dm_target *ti, struct bio *bio,
+static int multipath_end_io(struct dm_target *ti, struct request *clone,
 			    int error, union map_info *map_context)
 {
 	struct multipath *m = ti->private;
@@ -1174,14 +1167,13 @@ static int multipath_end_io(struct dm_ta
 	struct path_selector *ps;
 	int r;
 
-	r  = do_end_io(m, bio, error, mpio);
+	r  = do_end_io(m, clone, error, mpio);
 	if (pgpath) {
 		ps = &pgpath->pg->ps;
 		if (ps->type->end_io)
 			ps->type->end_io(ps, &pgpath->path);
 	}
-	if (r != DM_ENDIO_INCOMPLETE)
-		mempool_free(mpio, m->mpio_pool);
+	mempool_free(mpio, m->mpio_pool);
 
 	return r;
 }
@@ -1417,6 +1409,75 @@ static int multipath_ioctl(struct dm_tar
 					 bdev->bd_disk, cmd, arg);
 }
 
+static int __pgpath_busy(struct pgpath *pgpath)
+{
+	struct request_queue *q = bdev_get_queue(pgpath->path.dev->bdev);
+
+	return dm_underlying_device_busy(q);
+}
+
+/*
+ * We return "busy", only when we can map I/Os but underlying devices
+ * are busy (so even if we map I/Os now, the I/Os will wait on
+ * the underlying queue).
+ * In other words, if we want to kill I/Os or queue them inside us
+ * due to map unavailability, we don't return "busy".  Otherwise,
+ * dm core won't give us the I/Os and we can't do what we want.
+ */
+static int multipath_busy(struct dm_target *ti)
+{
+	int busy = 0, has_active = 0;
+	struct multipath *m = (struct multipath *) ti->private;
+	struct priority_group *pg;
+	struct pgpath *pgpath;
+	unsigned long flags;
+
+	spin_lock_irqsave(&m->lock, flags);
+
+	/* Guess which priority_group will be used at next mapping time */
+	if (unlikely(!m->current_pgpath && m->next_pg))
+		pg = m->next_pg;
+	else if (likely(m->current_pg))
+		pg = m->current_pg;
+	else
+		/*
+		 * We don't know which pg will be used at next mapping time.
+		 * We don't call __choose_pgpath() here to avoid to trigger
+		 * pg_init just by busy checking.
+		 * So we don't know whether underlying devices we will be using
+		 * at next mapping time are busy or not. Just try mapping.
+		 */
+		goto out;
+
+	/*
+	 * If there is one non-busy active path at least, the path selector
+	 * will be able to select it. So we consider such a pg as not busy.
+	 */
+	busy = 1;
+	list_for_each_entry(pgpath, &pg->pgpaths, list)
+		if (pgpath->is_active) {
+			has_active = 1;
+
+			if (!__pgpath_busy(pgpath)) {
+				busy = 0;
+				break;
+			}
+		}
+
+	if (!has_active)
+		/*
+		 * No active path in this pg, so this pg won't be used and
+		 * the current_pg will be changed at next mapping time.
+		 * We need to try mapping to determine it.
+		 */
+		busy = 0;
+
+out:
+	spin_unlock_irqrestore(&m->lock, flags);
+
+	return busy;
+}
+
 /*-----------------------------------------------------------------
  * Module setup
  *---------------------------------------------------------------*/
@@ -1426,13 +1487,14 @@ static struct target_type multipath_targ
 	.module = THIS_MODULE,
 	.ctr = multipath_ctr,
 	.dtr = multipath_dtr,
-	.map = multipath_map,
-	.end_io = multipath_end_io,
+	.map_rq = multipath_map,
+	.rq_end_io = multipath_end_io,
 	.presuspend = multipath_presuspend,
 	.resume = multipath_resume,
 	.status = multipath_status,
 	.message = multipath_message,
 	.ioctl  = multipath_ioctl,
+	.busy = multipath_busy,
 };
 
 static int __init dm_multipath_init(void)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 8/8] dm-mpath: convert to request-based
@ 2008-10-03 15:19   ` Kiyoshi Ueda
  0 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2008-10-03 15:19 UTC (permalink / raw)
  To: agk; +Cc: k-ueda, linux-scsi, linux-kernel, dm-devel, j-nomura, mbroz

This patch converts dm-multipath target to request-based from bio-based.

Basically, the patch just converts the I/O unit from struct bio
to struct request.
In the course of the conversion, it also changes the I/O queueing
mechanism.  The change in the I/O queueing is described in details
as follows.

I/O queueing mechanism change
-----------------------------
In I/O submission, map_io(), there is no mechanism change from
bio-based, since the clone request is ready for retry as it is.
However, in I/O complition, do_end_io(), there is a mechanism change
from bio-based, since the clone request is not ready for retry.

In do_end_io() of bio-based, the clone bio has all needed memory
for resubmission.  So the target driver can queue it and resubmit
it later without memory allocations.
The mechanism has almost no overhead.

On the other hand, in do_end_io() of request-based, the clone request
doesn't have clone bios, so the target driver can't resubmit it
as it is.  To resubmit the clone request, memory allocation for
clone bios is needed, and it takes some overheads.
To avoid the overheads just for queueing, the target driver doesn't
queue the clone request inside itself.
Instead, the target driver asks dm core for queueing and remapping
the original request of the clone request, since the overhead for
queueing is just a freeing memory for the clone request.

As a result, the target driver doesn't need to record/restore
the information of the original request for resubmitting
the clone request.  So dm_bio_details in dm_mpath_io is removed.


multipath_busy()
---------------------
The target driver returns "busy", only when the following case:
  o The target driver will map I/Os, if map() function is called
  and
  o The mapped I/Os will wait on underlying device's queue due to
    their congestions, if map() function is called now.

In other cases, the target driver doesn't return "busy".
Otherwise, dm core will keep the I/Os and the target driver can't
do what it wants.
(e.g. the target driver can't map I/Os now, so wants to kill I/Os.)


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-mpath.c |  192 +++++++++++++++++++++++++++++++++-----------------
 1 files changed, 127 insertions(+), 65 deletions(-)

Index: 2.6.27-rc8/drivers/md/dm-mpath.c
===================================================================
--- 2.6.27-rc8.orig/drivers/md/dm-mpath.c
+++ 2.6.27-rc8/drivers/md/dm-mpath.c
@@ -7,8 +7,6 @@
 
 #include "dm.h"
 #include "dm-path-selector.h"
-#include "dm-bio-list.h"
-#include "dm-bio-record.h"
 #include "dm-uevent.h"
 
 #include <linux/ctype.h>
@@ -83,7 +81,7 @@ struct multipath {
 	unsigned pg_init_count;		/* Number of times pg_init called */
 
 	struct work_struct process_queued_ios;
-	struct bio_list queued_ios;
+	struct list_head queued_ios;
 	unsigned queue_size;
 
 	struct work_struct trigger_event;
@@ -100,7 +98,6 @@ struct multipath {
  */
 struct dm_mpath_io {
 	struct pgpath *pgpath;
-	struct dm_bio_details details;
 };
 
 typedef int (*action_fn) (struct pgpath *pgpath);
@@ -186,6 +183,7 @@ static struct multipath *alloc_multipath
 	m = kzalloc(sizeof(*m), GFP_KERNEL);
 	if (m) {
 		INIT_LIST_HEAD(&m->priority_groups);
+		INIT_LIST_HEAD(&m->queued_ios);
 		spin_lock_init(&m->lock);
 		m->queue_io = 1;
 		INIT_WORK(&m->process_queued_ios, process_queued_ios);
@@ -310,12 +308,13 @@ static int __must_push_back(struct multi
 		dm_noflush_suspending(m->ti));
 }
 
-static int map_io(struct multipath *m, struct bio *bio,
+static int map_io(struct multipath *m, struct request *clone,
 		  struct dm_mpath_io *mpio, unsigned was_queued)
 {
 	int r = DM_MAPIO_REMAPPED;
 	unsigned long flags;
 	struct pgpath *pgpath;
+	struct block_device *bdev;
 
 	spin_lock_irqsave(&m->lock, flags);
 
@@ -332,16 +331,18 @@ static int map_io(struct multipath *m, s
 	if ((pgpath && m->queue_io) ||
 	    (!pgpath && m->queue_if_no_path)) {
 		/* Queue for the daemon to resubmit */
-		bio_list_add(&m->queued_ios, bio);
+		list_add_tail(&clone->queuelist, &m->queued_ios);
 		m->queue_size++;
 		if ((m->pg_init_required && !m->pg_init_in_progress) ||
 		    !m->queue_io)
 			queue_work(kmultipathd, &m->process_queued_ios);
 		pgpath = NULL;
 		r = DM_MAPIO_SUBMITTED;
-	} else if (pgpath)
-		bio->bi_bdev = pgpath->path.dev->bdev;
-	else if (__must_push_back(m))
+	} else if (pgpath) {
+		bdev = pgpath->path.dev->bdev;
+		clone->q = bdev_get_queue(bdev);
+		clone->rq_disk = bdev->bd_disk;
+	} else if (__must_push_back(m))
 		r = DM_MAPIO_REQUEUE;
 	else
 		r = -EIO;	/* Failed */
@@ -384,30 +385,31 @@ static void dispatch_queued_ios(struct m
 {
 	int r;
 	unsigned long flags;
-	struct bio *bio = NULL, *next;
 	struct dm_mpath_io *mpio;
 	union map_info *info;
+	struct request *clone, *n;
+	LIST_HEAD(cl);
 
 	spin_lock_irqsave(&m->lock, flags);
-	bio = bio_list_get(&m->queued_ios);
+	list_splice_init(&m->queued_ios, &cl);
 	spin_unlock_irqrestore(&m->lock, flags);
 
-	while (bio) {
-		next = bio->bi_next;
-		bio->bi_next = NULL;
+	list_for_each_entry_safe(clone, n, &cl, queuelist) {
+		list_del_init(&clone->queuelist);
 
-		info = dm_get_mapinfo(bio);
+		info = dm_get_rq_mapinfo(clone);
 		mpio = info->ptr;
 
-		r = map_io(m, bio, mpio, 1);
-		if (r < 0)
-			bio_endio(bio, r);
-		else if (r == DM_MAPIO_REMAPPED)
-			generic_make_request(bio);
-		else if (r == DM_MAPIO_REQUEUE)
-			bio_endio(bio, -EIO);
-
-		bio = next;
+		r = map_io(m, clone, mpio, 1);
+		if (r < 0) {
+			mempool_free(mpio, m->mpio_pool);
+			dm_kill_request(clone, r);
+		} else if (r == DM_MAPIO_REMAPPED)
+			dm_dispatch_request(clone);
+		else if (r == DM_MAPIO_REQUEUE) {
+			mempool_free(mpio, m->mpio_pool);
+			dm_requeue_request(clone);
+		}
 	}
 }
 
@@ -824,21 +826,24 @@ static void multipath_dtr(struct dm_targ
 }
 
 /*
- * Map bios, recording original fields for later in case we have to resubmit
+ * Map cloned requests
  */
-static int multipath_map(struct dm_target *ti, struct bio *bio,
+static int multipath_map(struct dm_target *ti, struct request *clone,
 			 union map_info *map_context)
 {
 	int r;
 	struct dm_mpath_io *mpio;
 	struct multipath *m = (struct multipath *) ti->private;
 
-	mpio = mempool_alloc(m->mpio_pool, GFP_NOIO);
-	dm_bio_record(&mpio->details, bio);
+	mpio = mempool_alloc(m->mpio_pool, GFP_ATOMIC);
+	if (!mpio)
+		/* ENOMEM, requeue */
+		return DM_MAPIO_REQUEUE;
+	memset(mpio, 0, sizeof(*mpio));
 
 	map_context->ptr = mpio;
-	bio->bi_rw |= (1 << BIO_RW_FAILFAST);
-	r = map_io(m, bio, mpio, 0);
+	clone->cmd_flags |= REQ_FAILFAST;
+	r = map_io(m, clone, mpio, 0);
 	if (r < 0 || r == DM_MAPIO_REQUEUE)
 		mempool_free(mpio, m->mpio_pool);
 
@@ -1119,53 +1124,41 @@ static void activate_path(struct work_st
 /*
  * end_io handling
  */
-static int do_end_io(struct multipath *m, struct bio *bio,
+static int do_end_io(struct multipath *m, struct request *clone,
 		     int error, struct dm_mpath_io *mpio)
 {
+	/*
+	 * We don't queue any clone request inside the multipath target
+	 * during end I/O handling, since those clone requests don't have
+	 * bio clones.  If we queue them inside the multipath target,
+	 * we need to make bio clones, that requires memory allocation.
+	 * (See drivers/md/dm.c:end_clone_bio() about why the clone requests
+	 *  don't have bio clones.)
+	 * Instead of queueing the clone request here, we queue the original
+	 * request into dm core, which will remake a clone request and
+	 * clone bios for it and resubmit it later.
+	 */
+	int r = DM_ENDIO_REQUEUE;
 	unsigned long flags;
 
-	if (!error)
+	if (!error && !clone->errors)
 		return 0;	/* I/O complete */
 
-	if ((error == -EWOULDBLOCK) && bio_rw_ahead(bio))
-		return error;
-
 	if (error == -EOPNOTSUPP)
 		return error;
 
-	spin_lock_irqsave(&m->lock, flags);
-	if (!m->nr_valid_paths) {
-		if (__must_push_back(m)) {
-			spin_unlock_irqrestore(&m->lock, flags);
-			return DM_ENDIO_REQUEUE;
-		} else if (!m->queue_if_no_path) {
-			spin_unlock_irqrestore(&m->lock, flags);
-			return -EIO;
-		} else {
-			spin_unlock_irqrestore(&m->lock, flags);
-			goto requeue;
-		}
-	}
-	spin_unlock_irqrestore(&m->lock, flags);
-
 	if (mpio->pgpath)
 		fail_path(mpio->pgpath);
 
-      requeue:
-	dm_bio_restore(&mpio->details, bio);
-
-	/* queue for the daemon to resubmit or fail */
 	spin_lock_irqsave(&m->lock, flags);
-	bio_list_add(&m->queued_ios, bio);
-	m->queue_size++;
-	if (!m->queue_io)
-		queue_work(kmultipathd, &m->process_queued_ios);
+	if (!m->nr_valid_paths && !m->queue_if_no_path && !__must_push_back(m))
+		r = -EIO;
 	spin_unlock_irqrestore(&m->lock, flags);
 
-	return DM_ENDIO_INCOMPLETE;	/* io not complete */
+	return r;
 }
 
-static int multipath_end_io(struct dm_target *ti, struct bio *bio,
+static int multipath_end_io(struct dm_target *ti, struct request *clone,
 			    int error, union map_info *map_context)
 {
 	struct multipath *m = ti->private;
@@ -1174,14 +1167,13 @@ static int multipath_end_io(struct dm_ta
 	struct path_selector *ps;
 	int r;
 
-	r  = do_end_io(m, bio, error, mpio);
+	r  = do_end_io(m, clone, error, mpio);
 	if (pgpath) {
 		ps = &pgpath->pg->ps;
 		if (ps->type->end_io)
 			ps->type->end_io(ps, &pgpath->path);
 	}
-	if (r != DM_ENDIO_INCOMPLETE)
-		mempool_free(mpio, m->mpio_pool);
+	mempool_free(mpio, m->mpio_pool);
 
 	return r;
 }
@@ -1417,6 +1409,75 @@ static int multipath_ioctl(struct dm_tar
 					 bdev->bd_disk, cmd, arg);
 }
 
+static int __pgpath_busy(struct pgpath *pgpath)
+{
+	struct request_queue *q = bdev_get_queue(pgpath->path.dev->bdev);
+
+	return dm_underlying_device_busy(q);
+}
+
+/*
+ * We return "busy", only when we can map I/Os but underlying devices
+ * are busy (so even if we map I/Os now, the I/Os will wait on
+ * the underlying queue).
+ * In other words, if we want to kill I/Os or queue them inside us
+ * due to map unavailability, we don't return "busy".  Otherwise,
+ * dm core won't give us the I/Os and we can't do what we want.
+ */
+static int multipath_busy(struct dm_target *ti)
+{
+	int busy = 0, has_active = 0;
+	struct multipath *m = (struct multipath *) ti->private;
+	struct priority_group *pg;
+	struct pgpath *pgpath;
+	unsigned long flags;
+
+	spin_lock_irqsave(&m->lock, flags);
+
+	/* Guess which priority_group will be used at next mapping time */
+	if (unlikely(!m->current_pgpath && m->next_pg))
+		pg = m->next_pg;
+	else if (likely(m->current_pg))
+		pg = m->current_pg;
+	else
+		/*
+		 * We don't know which pg will be used at next mapping time.
+		 * We don't call __choose_pgpath() here to avoid to trigger
+		 * pg_init just by busy checking.
+		 * So we don't know whether underlying devices we will be using
+		 * at next mapping time are busy or not. Just try mapping.
+		 */
+		goto out;
+
+	/*
+	 * If there is one non-busy active path at least, the path selector
+	 * will be able to select it. So we consider such a pg as not busy.
+	 */
+	busy = 1;
+	list_for_each_entry(pgpath, &pg->pgpaths, list)
+		if (pgpath->is_active) {
+			has_active = 1;
+
+			if (!__pgpath_busy(pgpath)) {
+				busy = 0;
+				break;
+			}
+		}
+
+	if (!has_active)
+		/*
+		 * No active path in this pg, so this pg won't be used and
+		 * the current_pg will be changed at next mapping time.
+		 * We need to try mapping to determine it.
+		 */
+		busy = 0;
+
+out:
+	spin_unlock_irqrestore(&m->lock, flags);
+
+	return busy;
+}
+
 /*-----------------------------------------------------------------
  * Module setup
  *---------------------------------------------------------------*/
@@ -1426,13 +1487,14 @@ static struct target_type multipath_targ
 	.module = THIS_MODULE,
 	.ctr = multipath_ctr,
 	.dtr = multipath_dtr,
-	.map = multipath_map,
-	.end_io = multipath_end_io,
+	.map_rq = multipath_map,
+	.rq_end_io = multipath_end_io,
 	.presuspend = multipath_presuspend,
 	.resume = multipath_resume,
 	.status = multipath_status,
 	.message = multipath_message,
 	.ioctl  = multipath_ioctl,
+	.busy = multipath_busy,
 };
 
 static int __init dm_multipath_init(void)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2008-10-03 15:08 [PATCH 0/8] dm: request-based dm-multipath Kiyoshi Ueda
                   ` (7 preceding siblings ...)
  2008-10-03 15:19   ` Kiyoshi Ueda
@ 2009-01-28 15:40 ` Alasdair G Kergon
  2009-01-29  7:18   ` Kiyoshi Ueda
  8 siblings, 1 reply; 28+ messages in thread
From: Alasdair G Kergon @ 2009-01-28 15:40 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: j-nomura, dm-devel, mbroz

On Fri, Oct 03, 2008 at 11:08:25AM -0400, Kiyoshi Ueda wrote:
> This patch-set is the updated version of request-based dm-multipath.
> The changes from the previous version (*) are to follow up the change
> of the interface to export lld's busy state (PATCH 5).

I've fixed them up to compile again, but haven't thoroughly checked for
side-effects.

Alasdair
-- 
agk@redhat.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-01-28 15:40 ` [PATCH 0/8] dm: request-based dm-multipath Alasdair G Kergon
@ 2009-01-29  7:18   ` Kiyoshi Ueda
  2009-01-29 10:41     ` Hannes Reinecke
  0 siblings, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2009-01-29  7:18 UTC (permalink / raw)
  To: Alasdair Kergon; +Cc: dm-devel, mbroz

Hi Alasdair,

On 01/29/2009 12:40 AM +0900, Alasdair G Kergon wrote:
> On Fri, Oct 03, 2008 at 11:08:25AM -0400, Kiyoshi Ueda wrote:
>> This patch-set is the updated version of request-based dm-multipath.
>> The changes from the previous version (*) are to follow up the change
>> of the interface to export lld's busy state (PATCH 5).
> 
> I've fixed them up to compile again, but haven't thoroughly checked for
> side-effects.

I found some problems below in my patches and now considering
how to fix them:
  o Unnecessary EIO is returned to application if it submits
    a bio during table swapping.
  o kernel panic occurs by frequent table swapping during heavy I/Os.
 
I'll post the fixed version after my vacation.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-01-29  7:18   ` Kiyoshi Ueda
@ 2009-01-29 10:41     ` Hannes Reinecke
  2009-01-29 14:32       ` Alasdair G Kergon
  2009-01-30  8:05       ` Kiyoshi Ueda
  0 siblings, 2 replies; 28+ messages in thread
From: Hannes Reinecke @ 2009-01-29 10:41 UTC (permalink / raw)
  To: device-mapper development; +Cc: Alasdair Kergon, mbroz

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=utf-8, Size: 2776 bytes --]

On Thu, Jan 29, 2009 at 04:18:59PM +0900, Kiyoshi Ueda wrote:
> Hi Alasdair,
> 
> On 01/29/2009 12:40 AM +0900, Alasdair G Kergon wrote:
> > On Fri, Oct 03, 2008 at 11:08:25AM -0400, Kiyoshi Ueda wrote:
> >> This patch-set is the updated version of request-based dm-multipath.
> >> The changes from the previous version (*) are to follow up the change
> >> of the interface to export lld's busy state (PATCH 5).
> > 
> > I've fixed them up to compile again, but haven't thoroughly checked for
> > side-effects.
> 
> I found some problems below in my patches and now considering
> how to fix them:
>   o Unnecessary EIO is returned to application if it submits
>     a bio during table swapping.
Yes, I've noticed that. Problem is this:

--- linux-2.6.27.orig/drivers/md/dm.c
+++ linux-2.6.27/drivers/md/dm.c
@@ -1304,7 +1304,11 @@ static int dm_make_request(struct reques
                return 0;
        }
 
-       if (unlikely(!md->map)) {
+       /*
+        * Submitting to a stopped queue with no map is okay;
+        * might happen during reconfiguration.
+        */
+       if (unlikely(!md->map) && !blk_queue_stopped(q)) {
                bio_endio(bio, -EIO);
                return 0;
        }

The make_request callback should never return EIO if there's any
chance at all to get this request done.

>   o kernel panic occurs by frequent table swapping during heavy I/Os.
>  
That's probably fixed by this patch:

--- linux-2.6.27/drivers/md/dm.c.orig   2009-01-23 15:59:22.741461315 +0100
+++ linux-2.6.27/drivers/md/dm.c        2009-01-26 09:03:02.787605723 +0100
@@ -714,13 +714,14 @@ static void free_bio_clone(struct reques
        struct dm_rq_target_io *tio = clone->end_io_data;
        struct mapped_device *md = tio->md;
        struct bio *bio;
-       struct dm_clone_bio_info *info;
 
        while ((bio = clone->bio) != NULL) {
                clone->bio = bio->bi_next;
 
-               info = bio->bi_private;
-               free_bio_info(md, info);
+               if (bio->bi_private) {
+                       struct dm_clone_bio_info *info = bio->bi_private;
+                       free_bio_info(md, info);
+               }
 
                bio->bi_private = md->bs;
                bio_put(bio);

The info field is not necessarily filled here, so we have to check for it
explicitly.

With these two patches request-based multipathing have survived all stress-tests
so far. Except on mainframe (zfcp), but that's more a driver-related thing.

Good work!

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-01-29 10:41     ` Hannes Reinecke
@ 2009-01-29 14:32       ` Alasdair G Kergon
  2009-01-30  8:05       ` Kiyoshi Ueda
  1 sibling, 0 replies; 28+ messages in thread
From: Alasdair G Kergon @ 2009-01-29 14:32 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: device-mapper development

On Thu, Jan 29, 2009 at 11:41:47AM +0100, Hannes Reinecke wrote:
> Yes, I've noticed that. Problem is this:

Thanks - I've applied those two fixes.

Alasdair
-- 
agk@redhat.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-01-29 10:41     ` Hannes Reinecke
  2009-01-29 14:32       ` Alasdair G Kergon
@ 2009-01-30  8:05       ` Kiyoshi Ueda
  2009-03-10  6:10         ` Kiyoshi Ueda
  1 sibling, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2009-01-30  8:05 UTC (permalink / raw)
  To: device-mapper development; +Cc: Alasdair Kergon, mbroz

Hi Hannes,

Thank you for the comments and patches.
See my comments below.

On 01/29/2009 07:41 PM +0900, Hannes Reinecke wrote:
> On Thu, Jan 29, 2009 at 04:18:59PM +0900, Kiyoshi Ueda wrote:
>> Hi Alasdair,
>>
>> On 01/29/2009 12:40 AM +0900, Alasdair G Kergon wrote:
>>> On Fri, Oct 03, 2008 at 11:08:25AM -0400, Kiyoshi Ueda wrote:
>>>> This patch-set is the updated version of request-based dm-multipath.
>>>> The changes from the previous version (*) are to follow up the change
>>>> of the interface to export lld's busy state (PATCH 5).
>>> I've fixed them up to compile again, but haven't thoroughly checked for
>>> side-effects.
>> I found some problems below in my patches and now considering
>> how to fix them:
>>   o Unnecessary EIO is returned to application if it submits
>>     a bio during table swapping.
> Yes, I've noticed that. Problem is this:
> 
> --- linux-2.6.27.orig/drivers/md/dm.c
> +++ linux-2.6.27/drivers/md/dm.c
> @@ -1304,7 +1304,11 @@ static int dm_make_request(struct reques
>                 return 0;
>         }
>  
> -       if (unlikely(!md->map)) {
> +       /*
> +        * Submitting to a stopped queue with no map is okay;
> +        * might happen during reconfiguration.
> +        */
> +       if (unlikely(!md->map) && !blk_queue_stopped(q)) {
>                 bio_endio(bio, -EIO);
>                 return 0;
>         }
> 
> The make_request callback should never return EIO if there's any
> chance at all to get this request done.
 
Exactly this part has a race condition with table swapping.
Although you patch fixes the race condition, I think I need
more careful consideration here.
(e.g. Is it really OK to go bios through __make_request() with
      old queue restrictions during table swapping?)

I'll work on this problem after my vacation.


>>   o kernel panic occurs by frequent table swapping during heavy I/Os.
>>  
> That's probably fixed by this patch:
> 
> --- linux-2.6.27/drivers/md/dm.c.orig   2009-01-23 15:59:22.741461315 +0100
> +++ linux-2.6.27/drivers/md/dm.c        2009-01-26 09:03:02.787605723 +0100
> @@ -714,13 +714,14 @@ static void free_bio_clone(struct reques
>         struct dm_rq_target_io *tio = clone->end_io_data;
>         struct mapped_device *md = tio->md;
>         struct bio *bio;
> -       struct dm_clone_bio_info *info;
>  
>         while ((bio = clone->bio) != NULL) {
>                 clone->bio = bio->bi_next;
>  
> -               info = bio->bi_private;
> -               free_bio_info(md, info);
> +               if (bio->bi_private) {
> +                       struct dm_clone_bio_info *info = bio->bi_private;
> +                       free_bio_info(md, info);
> +               }
>  
>                 bio->bi_private = md->bs;
>                 bio_put(bio);
> 
> The info field is not necessarily filled here, so we have to check for it
> explicitly.
> 
> With these two patches request-based multipathing have survived all stress-tests
> so far. Except on mainframe (zfcp), but that's more a driver-related thing.

Hmm, it seems I'm hitting a different problem from the one you hit,
because I still see my problem even with your patch.
I need more investigation after my vacation.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-01-30  8:05       ` Kiyoshi Ueda
@ 2009-03-10  6:10         ` Kiyoshi Ueda
  2009-03-10  7:17           ` Hannes Reinecke
  0 siblings, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2009-03-10  6:10 UTC (permalink / raw)
  To: device-mapper development

Hi Hannes,

On 2009/01/30 17:05 +0900, Kiyoshi Ueda wrote:
>>>   o kernel panic occurs by frequent table swapping during heavy I/Os.
>>>  
>> That's probably fixed by this patch:
>>
>> --- linux-2.6.27/drivers/md/dm.c.orig   2009-01-23 15:59:22.741461315 +0100
>> +++ linux-2.6.27/drivers/md/dm.c        2009-01-26 09:03:02.787605723 +0100
>> @@ -714,13 +714,14 @@ static void free_bio_clone(struct reques
>>         struct dm_rq_target_io *tio = clone->end_io_data;
>>         struct mapped_device *md = tio->md;
>>         struct bio *bio;
>> -       struct dm_clone_bio_info *info;
>>  
>>         while ((bio = clone->bio) != NULL) {
>>                 clone->bio = bio->bi_next;
>>  
>> -               info = bio->bi_private;
>> -               free_bio_info(md, info);
>> +               if (bio->bi_private) {
>> +                       struct dm_clone_bio_info *info = bio->bi_private;
>> +                       free_bio_info(md, info);
>> +               }
>>  
>>                 bio->bi_private = md->bs;
>>                 bio_put(bio);
>>
>> The info field is not necessarily filled here, so we have to check for it
>> explicitly.
>>
>> With these two patches request-based multipathing have survived all stress-tests
>> so far. Except on mainframe (zfcp), but that's more a driver-related thing.

My problem was different from that one, and I have fixed my problem.

Do you hit some problem without the patch above?
If so, that should be a programming bug and we need to fix it.  Otherwise,
we should be leaking a memory (since all cloned bio should always have
the dm_clone_bio_info structure in ->bi_private).

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-03-10  6:10         ` Kiyoshi Ueda
@ 2009-03-10  7:17           ` Hannes Reinecke
  2009-03-10  8:17             ` Kiyoshi Ueda
  2009-03-12  8:58             ` Kiyoshi Ueda
  0 siblings, 2 replies; 28+ messages in thread
From: Hannes Reinecke @ 2009-03-10  7:17 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: device-mapper development

[-- Attachment #1: Type: text/plain, Size: 2319 bytes --]

Hi Kiyoshi,

Kiyoshi Ueda wrote:
> Hi Hannes,
> 
> On 2009/01/30 17:05 +0900, Kiyoshi Ueda wrote:
>>>>   o kernel panic occurs by frequent table swapping during heavy I/Os.
>>>>  
>>> That's probably fixed by this patch:
>>>
>>> --- linux-2.6.27/drivers/md/dm.c.orig   2009-01-23 15:59:22.741461315 +0100
>>> +++ linux-2.6.27/drivers/md/dm.c        2009-01-26 09:03:02.787605723 +0100
>>> @@ -714,13 +714,14 @@ static void free_bio_clone(struct reques
>>>         struct dm_rq_target_io *tio = clone->end_io_data;
>>>         struct mapped_device *md = tio->md;
>>>         struct bio *bio;
>>> -       struct dm_clone_bio_info *info;
>>>  
>>>         while ((bio = clone->bio) != NULL) {
>>>                 clone->bio = bio->bi_next;
>>>  
>>> -               info = bio->bi_private;
>>> -               free_bio_info(md, info);
>>> +               if (bio->bi_private) {
>>> +                       struct dm_clone_bio_info *info = bio->bi_private;
>>> +                       free_bio_info(md, info);
>>> +               }
>>>  
>>>                 bio->bi_private = md->bs;
>>>                 bio_put(bio);
>>>
>>> The info field is not necessarily filled here, so we have to check for it
>>> explicitly.
>>>
>>> With these two patches request-based multipathing have survived all stress-tests
>>> so far. Except on mainframe (zfcp), but that's more a driver-related thing.
> 
> My problem was different from that one, and I have fixed my problem.
> 
What was this? Was is something specific to your setup or some within the
request-based multipathing code?
If the latter, I'd be _very_ much interested in seeing the patch. Naturally.

> Do you hit some problem without the patch above?
> If so, that should be a programming bug and we need to fix it.  Otherwise,
> we should be leaking a memory (since all cloned bio should always have
> the dm_clone_bio_info structure in ->bi_private).
> 
Yes, I've found that one later on.
The real problem was in clone_setup_bios(), which might end up calling an
invalid end_io_data pointer. Patch is attached.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

[-- Attachment #2: dm-use-md-for-free_bio_clone --]
[-- Type: text/plain, Size: 2149 bytes --]

From: Hannes Reinecke <hare@suse.de>
Subject: Kernel oops in free_bio_clone()
References: bnc#472360

Bug is here:

static int setup_clone(struct request *clone, struct request *rq,
               struct dm_rq_target_io *tio)
{
    int r;

    blk_rq_init(NULL, clone);

    r = clone_request_bios(clone, rq, tio->md);
    if (r)
        return r;

    copy_request_info(clone, rq);
    clone->start_time = jiffies;
    clone->end_io = end_clone_request;
    clone->end_io_data = tio;

    return 0;
}

clone_request_bios() might end up calling free_bio_clone(), which references:

static void free_bio_clone(struct request *clone)
{
    struct dm_rq_target_io *tio = clone->end_io_data;
    struct mapped_device *md = tio->md;
...

but end_io_data will be set only _after_ the call to clone_request_bios().
So we should be passing the 'md' argument directly here to avoid this
bug and several pointless derefencings.

Signed-off-by: Hannes Reinecke <hare@suse.de>

--- linux-2.6.27-SLE11_BRANCH/drivers/md/dm.c.orig	2009-02-04 10:33:22.656627650 +0100
+++ linux-2.6.27-SLE11_BRANCH/drivers/md/dm.c	2009-02-05 11:03:35.843251773 +0100
@@ -709,10 +709,8 @@ static void end_clone_bio(struct bio *cl
 	blk_update_request(tio->orig, 0, nr_bytes);
 }
 
-static void free_bio_clone(struct request *clone)
+static void free_bio_clone(struct request *clone, struct mapped_device *md)
 {
-	struct dm_rq_target_io *tio = clone->end_io_data;
-	struct mapped_device *md = tio->md;
 	struct bio *bio;
 
 	while ((bio = clone->bio) != NULL) {
@@ -743,7 +741,7 @@ static void dm_unprep_request(struct req
 	rq->special = NULL;
 	rq->cmd_flags &= ~REQ_DONTPREP;
 
-	free_bio_clone(clone);
+	free_bio_clone(clone, tio->md);
 	dec_rq_pending(tio);
 	free_rq_tio(tio->md, tio);
 }
@@ -820,7 +818,7 @@ static void dm_end_request(struct reques
 			rq->sense_len = clone->sense_len;
 	}
 
-	free_bio_clone(clone);
+	free_bio_clone(clone, tio->md);
 	dec_rq_pending(tio);
 	free_rq_tio(tio->md, tio);
 
@@ -1406,7 +1404,7 @@ static int clone_request_bios(struct req
 	return 0;
 
 free_and_out:
-	free_bio_clone(clone);
+	free_bio_clone(clone, md);
 
 	return -ENOMEM;
 }

[-- Attachment #3: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-03-10  7:17           ` Hannes Reinecke
@ 2009-03-10  8:17             ` Kiyoshi Ueda
  2009-03-11 12:28               ` Hannes Reinecke
  2009-03-12  8:58             ` Kiyoshi Ueda
  1 sibling, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2009-03-10  8:17 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: device-mapper development

Hi Hannes,

On 03/10/2009 04:17 PM +0900, Hannes Reinecke wrote:
> Hi Kiyoshi,
> 
> Kiyoshi Ueda wrote:
>> Hi Hannes,
>>
>> On 2009/01/30 17:05 +0900, Kiyoshi Ueda wrote:
>>>>>   o kernel panic occurs by frequent table swapping during heavy I/Os.
>>>>>  
>>>> That's probably fixed by this patch:
>>>>
>>>> --- linux-2.6.27/drivers/md/dm.c.orig   2009-01-23
>>>> 15:59:22.741461315 +0100
>>>> +++ linux-2.6.27/drivers/md/dm.c        2009-01-26
>>>> 09:03:02.787605723 +0100
>>>> @@ -714,13 +714,14 @@ static void free_bio_clone(struct reques
>>>>         struct dm_rq_target_io *tio = clone->end_io_data;
>>>>         struct mapped_device *md = tio->md;
>>>>         struct bio *bio;
>>>> -       struct dm_clone_bio_info *info;
>>>>  
>>>>         while ((bio = clone->bio) != NULL) {
>>>>                 clone->bio = bio->bi_next;
>>>>  
>>>> -               info = bio->bi_private;
>>>> -               free_bio_info(md, info);
>>>> +               if (bio->bi_private) {
>>>> +                       struct dm_clone_bio_info *info =
>>>> bio->bi_private;
>>>> +                       free_bio_info(md, info);
>>>> +               }
>>>>  
>>>>                 bio->bi_private = md->bs;
>>>>                 bio_put(bio);
>>>>
>>>> The info field is not necessarily filled here, so we have to check
>>>> for it
>>>> explicitly.
>>>>
>>>> With these two patches request-based multipathing have survived all
>>>> stress-tests
>>>> so far. Except on mainframe (zfcp), but that's more a driver-related
>>>> thing.
>>
>> My problem was different from that one, and I have fixed my problem.
>>
> What was this? Was is something specific to your setup or some within the
> request-based multipathing code?
> If the latter, I'd be _very_ much interested in seeing the patch.
> Naturally.

Suspend was broken.
dm_suspend() recognized that suspend completed while some requests
were still in flight.  So we could swap/free the in-use table while
there was in_flight request.
The patch is like the attached one, although it is not finalized and
I'm testing now.
I'll post an updated patch-set including the attached patch
this week or next week.


>> Do you hit some problem without the patch above?
>> If so, that should be a programming bug and we need to fix it. 
>> Otherwise,
>> we should be leaking a memory (since all cloned bio should always have
>> the dm_clone_bio_info structure in ->bi_private).
>>
> Yes, I've found that one later on.
> The real problem was in clone_setup_bios(), which might end up calling an
> invalid end_io_data pointer. Patch is attached.

Thank you for the patch.
I'll see it soon.

Thanks,
Kiyoshi Ueda


---
 drivers/md/dm.c |  236 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 144 insertions(+), 92 deletions(-)

Index: 2.6.29-rc2/drivers/md/dm.c
===================================================================
--- 2.6.29-rc2.orig/drivers/md/dm.c
+++ 2.6.29-rc2/drivers/md/dm.c
@@ -701,11 +701,17 @@ static void free_bio_clone(struct reques
 	}
 }
 
-static void dec_rq_pending(struct dm_rq_target_io *tio)
+/*
+ * XXX: Not taking queue lock for efficiency.
+ *      For correctness, waiters will check that again with queue lock held.
+ *      No false negative because this function will be called everytime
+ *      in_flight is decremented.
+ */
+static void rq_completed(struct mapped_device *md)
 {
-	if (!atomic_dec_return(&tio->md->pending))
+	if (!md->queue->in_flight)
 		/* nudge anyone waiting on suspend queue */
-		wake_up(&tio->md->wait);
+		wake_up(&md->wait);
 }
 
 static void dm_unprep_request(struct request *rq)
@@ -717,7 +723,6 @@ static void dm_unprep_request(struct req
 	rq->cmd_flags &= ~REQ_DONTPREP;
 
 	free_bio_clone(clone);
-	dec_rq_pending(tio);
 	free_rq_tio(tio->md, tio);
 }
 
@@ -727,6 +732,7 @@ static void dm_unprep_request(struct req
 void dm_requeue_request(struct request *clone)
 {
 	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 	struct request_queue *q = rq->q;
 	unsigned long flags;
@@ -738,6 +744,8 @@ void dm_requeue_request(struct request *
 		blk_plug_device(q);
 	blk_requeue_request(q, rq);
 	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	rq_completed(md);
 }
 EXPORT_SYMBOL_GPL(dm_requeue_request);
 
@@ -776,6 +784,7 @@ static void start_queue(struct request_q
 static void dm_end_request(struct request *clone, int error)
 {
 	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct mapped_device *md = tio->md;
 	struct request *rq = tio->orig;
 	struct request_queue *q = rq->q;
 	unsigned int nr_bytes = blk_rq_bytes(rq);
@@ -794,12 +803,12 @@ static void dm_end_request(struct reques
 	}
 
 	free_bio_clone(clone);
-	dec_rq_pending(tio);
 	free_rq_tio(tio->md, tio);
 
 	if (unlikely(blk_end_request(rq, error, nr_bytes)))
 		BUG();
 
+	rq_completed(md);
 	blk_run_queue(q);
 }
 
@@ -1397,7 +1406,7 @@ static int setup_clone(struct request *c
 	return 0;
 }
 
-static inline int dm_flush_suspending(struct mapped_device *md)
+static inline int dm_rq_flush_suspending(struct mapped_device *md)
 {
 	return !md->suspend_rq.data;
 }
@@ -1411,23 +1420,11 @@ static int dm_prep_fn(struct request_que
 	struct dm_rq_target_io *tio;
 	struct request *clone;
 
-	if (unlikely(rq == &md->suspend_rq)) { /* Flush suspend marker */
-		if (dm_flush_suspending(md)) {
-			if (q->in_flight)
-				return BLKPREP_DEFER;
-			else {
-				/* This device should be quiet now */
-				__stop_queue(q);
-				smp_mb();
-				BUG_ON(atomic_read(&md->pending));
-				wake_up(&md->wait);
-				return BLKPREP_KILL;
-			}
-		} else
-			/*
-			 * The suspend process was interrupted.
-			 * So no need to suspend now.
-			 */
+	if (unlikely(rq == &md->suspend_rq)) {
+		if (dm_rq_flush_suspending(md))
+			return BLKPREP_OK;
+		else
+			/* The flush suspend was interrupted */
 			return BLKPREP_KILL;
 	}
 
@@ -1436,11 +1433,6 @@ static int dm_prep_fn(struct request_que
 		return BLKPREP_KILL;
 	}
 
-	if (unlikely(!dm_request_based(md))) {
-		DMWARN("Request was queued into bio-based device");
-		return BLKPREP_KILL;
-	}
-
 	tio = alloc_rq_tio(md); /* Only one for each original request */
 	if (!tio)
 		/* -ENOMEM */
@@ -1473,7 +1465,6 @@ static void map_request(struct dm_target
 	struct dm_rq_target_io *tio = clone->end_io_data;
 
 	tio->ti = ti;
-	atomic_inc(&md->pending);
 	r = ti->type->map_rq(ti, clone, &tio->info);
 	switch (r) {
 	case DM_MAPIO_SUBMITTED:
@@ -1511,19 +1502,35 @@ static void dm_request_fn(struct request
 	struct request *rq;
 
 	/*
-	 * The check for blk_queue_stopped() needs here, because:
-	 *     - device suspend uses blk_stop_queue() and expects that
-	 *       no I/O will be dispatched any more after the queue stop
-	 *     - generic_unplug_device() doesn't call q->request_fn()
-	 *       when the queue is stopped, so no problem
-	 *     - but underlying device drivers may call q->request_fn()
-	 *       without the check through blk_run_queue()
+	 * The check of blk_queue_stopped() needs here, because we want to
+	 * complete noflush suspend quickly:
+	 *     - noflush suspend stops the queue in dm_suspend() and expects
+	 *       that no I/O will be dispatched any more after the queue stop
+	 *     - but if the queue stop is done while the loop below and
+	 *       there is no check for the queue stop, I/O dispatching
+	 *       may not stop until all remaining I/Os in the queue are
+	 *       dispatched.  For noflush suspend, we shouldn't want
+	 *       this behavior.
 	 */
 	while (!blk_queue_plugged(q) && !blk_queue_stopped(q)) {
 		rq = elv_next_request(q);
 		if (!rq)
 			goto plug_and_out;
 
+		if (unlikely(rq == &md->suspend_rq)) { /* Flush suspend maker */
+			if (q->in_flight)
+				/* Not quiet yet.  Wait more */
+				goto plug_and_out;
+
+			/* This device should be quiet now */
+			__stop_queue(q);
+			blkdev_dequeue_request(rq);
+			if (unlikely(__blk_end_request(rq, 0, 0)))
+				BUG();
+			wake_up(&md->wait);
+			goto out;
+		}
+
 		ti = dm_table_find_target(map, rq->sector);
 		if (ti->type->busy && ti->type->busy(ti))
 			goto plug_and_out;
@@ -1996,15 +2003,20 @@ EXPORT_SYMBOL_GPL(dm_put);
 static int dm_wait_for_completion(struct mapped_device *md)
 {
 	int r = 0;
+	struct request_queue *q = md->queue;
+	unsigned long flags;
 
 	while (1) {
 		set_current_state(TASK_INTERRUPTIBLE);
 
 		smp_mb();
 		if (dm_request_based(md)) {
-			if (!atomic_read(&md->pending) &&
-			    blk_queue_stopped(md->queue))
+			spin_lock_irqsave(q->queue_lock, flags);
+			if (!q->in_flight && blk_queue_stopped(q)) {
+				spin_unlock_irqrestore(q->queue_lock, flags);
 				break;
+			}
+			spin_unlock_irqrestore(q->queue_lock, flags);
 		} else if (!atomic_read(&md->pending))
 			break;
 
@@ -2107,86 +2119,75 @@ out:
 	return r;
 }
 
-static inline void dm_invalidate_flush_suspend(struct mapped_device *md)
+static inline void dm_rq_invalidate_suspend_marker(struct mapped_device *md)
 {
 	md->suspend_rq.data = (void *)0x1;
 }
 
-static void dm_abort_suspend(struct mapped_device *md, int noflush)
+/*
+ * For noflush suspend, starting the queue is enough because noflush suspend
+ * only stops the queue.
+ *
+ * For flush suspend, we also need to take care of the marker.
+ * We could remove the marker from the queue forcibly using list_del_init(),
+ * but it would break the block-layer.  To follow the block-layer manner,
+ * we just put an invalidated mark on the marker here and wait for it to be
+ * completed by the normal way.
+ */
+static void dm_rq_abort_suspend(struct mapped_device *md, int noflush)
 {
 	struct request_queue *q = md->queue;
 	unsigned long flags;
 
-	/*
-	 * For flush suspend, invalidation and queue restart must be protected
-	 * by a single queue lock to prevent a race with dm_prep_fn().
-	 */
 	spin_lock_irqsave(q->queue_lock, flags);
 	if (!noflush)
-		dm_invalidate_flush_suspend(md);
+		dm_rq_invalidate_suspend_marker(md);
 	__start_queue(q);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
-/*
- * Additional suspend work for request-based dm.
- *
- * In request-based dm, stopping request_queue prevents mapping.
- * Even after stopping the request_queue, submitted requests from upper-layer
- * can be inserted to the request_queue.  So original (unmapped) requests are
- * kept in the request_queue during suspension.
- */
-static void dm_start_suspend(struct mapped_device *md, int noflush)
+static void dm_rq_start_suspend(struct mapped_device *md, int noflush)
 {
 	struct request *rq = &md->suspend_rq;
 	struct request_queue *q = md->queue;
-	unsigned long flags;
 
-	if (noflush) {
+	if (noflush)
 		stop_queue(q);
-		return;
+	else {
+		blk_rq_init(q, rq);
+		blk_insert_request(q, rq, 0, NULL);
 	}
+}
 
-	/*
-	 * For flush suspend, we need a marker to indicate the border line
-	 * between flush needed I/Os and deferred I/Os, since all I/Os are
-	 * queued in the request_queue during suspension.
-	 *
-	 * This marker must be inserted after setting DMF_BLOCK_IO,
-	 * because dm_prep_fn() considers no DMF_BLOCK_IO to be
-	 * a suspend interruption.
-	 */
+static int dm_rq_suspend_unavailable(struct mapped_device *md, int noflush)
+{
+	int r = 0;
+	struct request *rq = &md->suspend_rq;
+	struct request_queue *q = md->queue;
+	unsigned long flags;
+
+	if (noflush)
+		return 0;
+
+	/* The marker must be protected by queue lock if it is in use */
 	spin_lock_irqsave(q->queue_lock, flags);
 	if (unlikely(rq->ref_count)) {
 		/*
-		 * This can happen when the previous suspend was interrupted,
-		 * the inserted suspend_rq for the previous suspend has still
-		 * been in the queue and this suspend has been invoked.
-		 *
-		 * We could re-insert the suspend_rq by deleting it from
-		 * the queue forcibly using list_del_init(&rq->queuelist).
-		 * But it would break the block-layer easily.
-		 * So we don't re-insert the suspend_rq again in such a case.
-		 * The suspend_rq should be already invalidated during
-		 * the previous suspend interruption, so just wait for it
-		 * to be completed.
-		 *
-		 * This suspend will never complete, so warn the user to
-		 * interrupt this suspend and retry later.
+		 * This can happen, when the previous flush suspend was
+		 * interrupted, the marker is still in the queue and
+		 * this flush suspend has been invoked, because we don't
+		 * remove the marker at the time of suspend interruption.
+		 * We have only one marker per mapped_device, so we can't
+		 * start another flush suspend while it is in use.
 		 */
-		BUG_ON(!rq->data);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-
-		DMWARN("Invalidating the previous suspend is still in"
-		       " progress.  This suspend will be never done."
-		       " Please interrupt this suspend and retry later.");
-		return;
+		BUG_ON(!rq->data); /* The marker should be invalidated */
+		DMWARN("Invalidating the previous flush suspend is still in"
+		       " progress.  Please retry later.");
+		r = 1;
 	}
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
-	/* Now no user of the suspend_rq */
-	blk_rq_init(q, rq);
-	blk_insert_request(q, rq, 0, NULL);
+	return r;
 }
 
 /*
@@ -2231,6 +2232,52 @@ static void unlock_fs(struct mapped_devi
  * dm_bind_table, dm_suspend must be called to flush any in
  * flight bios and ensure that any further io gets deferred.
  */
+/*
+ * Suspend mechanism in request-based dm.
+ *
+ * After the suspend starts, (remaining requests in the request_queue are
+ * flushed if it is flush suspend, and) further incoming requests are kept
+ * in the request_queue and deferred.
+ * The suspend completes when the following conditions have been satisfied,
+ * so wait for it:
+ *    1. q->in_flight is 0 (which means no in_flight request)
+ *    2. queue has been stopped (which means no request dispatching)
+ *
+ *
+ * Noflush suspend
+ * ---------------
+ * Noflush suspend doesn't need to dispatch remaining requests.
+ * So stop the queue immediately.  Then, wait for all in_flight requests
+ * to be completed or requeued.
+ *
+ * To abort noflush suspend, start the queue.
+ *
+ *
+ * Flush suspend
+ * -------------
+ * Flush suspend needs to dispatch remaining requests.  So stop the queue
+ * after the remaining requests are completed. (Requeued request must be also
+ * re-dispatched and completed.  Until then, we can't stop the queue.)
+ *
+ * During flushing the remaining requests, further incoming requests are also
+ * inserted to the same queue.  To distinguish which requests are needed to be
+ * flushed, we insert a marker request to the queue at the time of starting
+ * flush suspend, like a barrier.
+ * The dispatching is blocked when the marker is found on the top of the queue.
+ * And the queue is stopped when all in_flight requests are completed, since
+ * that means the remaining requests are completely flushed.
+ * Then, the marker is removed from the queue.
+ *
+ * To abort flush suspend, we also need to take care of the marker, not only
+ * starting the queue.
+ * We could remove the marker forcibly from the queue, but it would break
+ * the block-layer.  Instead, we put a invalidated mark on the marker.
+ * When the invalidated marker is found on the top of the queue, it is
+ * immediately removed from the queue, so it doesn't block dispatching.
+ * Because we have only one marker per mapped_device, we can't start another
+ * flush suspend until the invalidated marker is removed from the queue.
+ * So fail and return with -EBUSY in such a case.
+ */
 int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 {
 	struct dm_table *map = NULL;
@@ -2246,6 +2293,11 @@ int dm_suspend(struct mapped_device *md,
 		goto out_unlock;
 	}
 
+	if (dm_request_based(md) && dm_rq_suspend_unavailable(md, noflush)) {
+		r = -EBUSY;
+		goto out_unlock;
+	}
+
 	map = dm_get_table(md);
 
 	/*
@@ -2288,7 +2340,7 @@ int dm_suspend(struct mapped_device *md,
 	up_write(&md->io_lock);
 
 	if (dm_request_based(md))
-		dm_start_suspend(md, noflush);
+		dm_rq_start_suspend(md, noflush);
 
 	/* unplug */
 	if (map)
@@ -2316,7 +2368,7 @@ int dm_suspend(struct mapped_device *md,
 		dm_queue_flush(md, DM_WQ_FLUSH_DEFERRED, NULL);
 
 		if (dm_request_based(md))
-			dm_abort_suspend(md, noflush);
+			dm_rq_abort_suspend(md, noflush);
 
 		unlock_fs(md);
 		goto out; /* pushback list is already flushed, so skip flush */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-03-10  8:17             ` Kiyoshi Ueda
@ 2009-03-11 12:28               ` Hannes Reinecke
  2009-03-12  1:40                 ` Kiyoshi Ueda
  0 siblings, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2009-03-11 12:28 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: device-mapper development

Hi Kiyoshi,

Kiyoshi Ueda wrote:
> Hi Hannes,
> 
[ .. ]
> 
> Suspend was broken.
> dm_suspend() recognized that suspend completed while some requests
> were still in flight.  So we could swap/free the in-use table while
> there was in_flight request.
> The patch is like the attached one, although it is not finalized and
> I'm testing now.
> I'll post an updated patch-set including the attached patch
> this week or next week.
> 
> 
> ---
>  drivers/md/dm.c |  236 ++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 144 insertions(+), 92 deletions(-)
> 
> Index: 2.6.29-rc2/drivers/md/dm.c
> ===================================================================
> --- 2.6.29-rc2.orig/drivers/md/dm.c
> +++ 2.6.29-rc2/drivers/md/dm.c
> @@ -701,11 +701,17 @@ static void free_bio_clone(struct reques
>  	}
>  }
>  
> -static void dec_rq_pending(struct dm_rq_target_io *tio)
> +/*
> + * XXX: Not taking queue lock for efficiency.
> + *      For correctness, waiters will check that again with queue lock held.
> + *      No false negative because this function will be called everytime
> + *      in_flight is decremented.
> + */
> +static void rq_completed(struct mapped_device *md)
>  {
> -	if (!atomic_dec_return(&tio->md->pending))
> +	if (!md->queue->in_flight)
>  		/* nudge anyone waiting on suspend queue */
> -		wake_up(&tio->md->wait);
> +		wake_up(&md->wait);
>  }
>  
Hmm. Don't think that's a good idea. Either take the spinlock here or
in_flight should be atomic.

Apart from this I'll give it a shot.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-03-11 12:28               ` Hannes Reinecke
@ 2009-03-12  1:40                 ` Kiyoshi Ueda
  0 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2009-03-12  1:40 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: device-mapper development

Hi Hannes,

On 2009/03/11 21:28 +0900, Hannes Reinecke wrote:
> Hi Kiyoshi,
> 
> Kiyoshi Ueda wrote:
>> Hi Hannes,
>>
> [ .. ]
>>
>> Suspend was broken.
>> dm_suspend() recognized that suspend completed while some requests
>> were still in flight.  So we could swap/free the in-use table while
>> there was in_flight request.
>> The patch is like the attached one, although it is not finalized and
>> I'm testing now.
>> I'll post an updated patch-set including the attached patch
>> this week or next week.
>>
>>
>> ---
>>  drivers/md/dm.c |  236
>> ++++++++++++++++++++++++++++++++++----------------------
>>  1 file changed, 144 insertions(+), 92 deletions(-)
>>
>> Index: 2.6.29-rc2/drivers/md/dm.c
>> ===================================================================
>> --- 2.6.29-rc2.orig/drivers/md/dm.c
>> +++ 2.6.29-rc2/drivers/md/dm.c
>> @@ -701,11 +701,17 @@ static void free_bio_clone(struct reques
>>      }
>>  }
>>  
>> -static void dec_rq_pending(struct dm_rq_target_io *tio)
>> +/*
>> + * XXX: Not taking queue lock for efficiency.
>> + *      For correctness, waiters will check that again with queue
>> lock held.
>> + *      No false negative because this function will be called everytime
>> + *      in_flight is decremented.
>> + */
>> +static void rq_completed(struct mapped_device *md)
>>  {
>> -    if (!atomic_dec_return(&tio->md->pending))
>> +    if (!md->queue->in_flight)
>>          /* nudge anyone waiting on suspend queue */
>> -        wake_up(&tio->md->wait);
>> +        wake_up(&md->wait);
>>  }
>>  
> Hmm. Don't think that's a good idea. Either take the spinlock here or
> in_flight should be atomic.

Thank you for the comment.
OK, I'll change to take queue_lock here for maintenancability now,
although the queue_lock is not needed logically.
Then, I'll have another patch to drop the queue_lock for efficiency
in the future.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-03-10  7:17           ` Hannes Reinecke
  2009-03-10  8:17             ` Kiyoshi Ueda
@ 2009-03-12  8:58             ` Kiyoshi Ueda
  2009-03-12  9:08               ` Hannes Reinecke
  1 sibling, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2009-03-12  8:58 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: device-mapper development

Hi Hannes,

On 2009/03/10 16:17 +0900, Hannes Reinecke wrote:
>>>>>   o kernel panic occurs by frequent table swapping during heavy I/Os.
>>>>>  
>>>> That's probably fixed by this patch:
>>>>
>>>> --- linux-2.6.27/drivers/md/dm.c.orig   2009-01-23
>>>> 15:59:22.741461315 +0100
>>>> +++ linux-2.6.27/drivers/md/dm.c        2009-01-26
>>>> 09:03:02.787605723 +0100
>>>> @@ -714,13 +714,14 @@ static void free_bio_clone(struct reques
>>>>         struct dm_rq_target_io *tio = clone->end_io_data;
>>>>         struct mapped_device *md = tio->md;
>>>>         struct bio *bio;
>>>> -       struct dm_clone_bio_info *info;
>>>>  
>>>>         while ((bio = clone->bio) != NULL) {
>>>>                 clone->bio = bio->bi_next;
>>>>  
>>>> -               info = bio->bi_private;
>>>> -               free_bio_info(md, info);
>>>> +               if (bio->bi_private) {
>>>> +                       struct dm_clone_bio_info *info =
>>>> bio->bi_private;
>>>> +                       free_bio_info(md, info);
>>>> +               }
>>>>  
>>>>                 bio->bi_private = md->bs;
>>>>                 bio_put(bio);
>>>>
>>>> The info field is not necessarily filled here, so we have to check
>>>> for it
>>>> explicitly.
>>>>
>>>> With these two patches request-based multipathing have survived all
>>>> stress-tests
>>>> so far. Except on mainframe (zfcp), but that's more a driver-related
>>>> thing.
>>
>> Do you hit some problem without the patch above?
>> If so, that should be a programming bug and we need to fix it. 
>> Otherwise,
>> we should be leaking a memory (since all cloned bio should always have
>> the dm_clone_bio_info structure in ->bi_private).
>>
> Yes, I've found that one later on.
> The real problem was in clone_setup_bios(), which might end up calling an
> invalid end_io_data pointer. Patch is attached.

Nice catch!  Thank you for the patch.

> -static void free_bio_clone(struct request *clone)
> +static void free_bio_clone(struct request *clone, struct mapped_device *md)

I have changed the argument order to match with other free_* functions:
    free_bio_clone(struct mapped_device *md, struct request *clone)

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-03-12  8:58             ` Kiyoshi Ueda
@ 2009-03-12  9:08               ` Hannes Reinecke
  2009-03-13  1:03                 ` Kiyoshi Ueda
  0 siblings, 1 reply; 28+ messages in thread
From: Hannes Reinecke @ 2009-03-12  9:08 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: device-mapper development

Hi Kiyoshi,

Kiyoshi Ueda wrote:
> Hi Hannes,
> 
> On 2009/03/10 16:17 +0900, Hannes Reinecke wrote:
[ .. ]
>> Yes, I've found that one later on.
>> The real problem was in clone_setup_bios(), which might end up calling an
>> invalid end_io_data pointer. Patch is attached.
> 
> Nice catch!  Thank you for the patch.
> 
Oh, nae bother. Took me only a month to track it down :-(

>> -static void free_bio_clone(struct request *clone)
>> +static void free_bio_clone(struct request *clone, struct mapped_device *md)
> 
> I have changed the argument order to match with other free_* functions:
>     free_bio_clone(struct mapped_device *md, struct request *clone)
> 
Sure. I wasn't sure myself which way round the arguments should be.

Do you have an updated patch of your suspend fixes? We've run into an issue
here which looks suspiciously close to that one (I/O is completed on a deleted
pgpath), so we would be happy to test it these out.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/8] dm: request-based dm-multipath
  2009-03-12  9:08               ` Hannes Reinecke
@ 2009-03-13  1:03                 ` Kiyoshi Ueda
  0 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2009-03-13  1:03 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: device-mapper development

Hi Hannes,

On 2009/03/12 18:08 +0900, Hannes Reinecke wrote:
> Do you have an updated patch of your suspend fixes? We've run into an issue
> here which looks suspiciously close to that one (I/O is completed on a
> deleted pgpath), so we would be happy to test it these out.

You mean that the issue occurs WITHOUT the suspend fix patch which
I sent.  Is it right?
If so, you can use it, since I haven't added any big change about
suspend fix since then.
Logic changes I added about suspend fix since then are only in
rq_complete() to follow your comment.  The updated rq_complete()
is below:

static void rq_completed(struct mapped_device *md)
{
	struct request_queue *q = md->queue;
	unsigned long flags;

	spin_lock_irqsave(q->queue_lock, flags);
	if (q->in_flight) {
		spin_unlock_irqrestore(q->queue_lock, flags);
		return;
	}
	spin_unlock_irqrestore(q->queue_lock, flags);

	/* nudge anyone waiting on suspend queue */
	wake_up(&md->wait);
}


I merged the previous suspend fix patch into the request-based dm
core patch, and I've been changing the core patch after that.
So I don't have a patch which addresses only suspend fix update.
Sorry about that.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2009-03-13  1:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-03 15:08 [PATCH 0/8] dm: request-based dm-multipath Kiyoshi Ueda
2008-10-03 15:09 ` [PATCH 1/8] dm core: remove unused DM_WQ_FLUSH_ALL Kiyoshi Ueda
2008-10-03 15:09   ` Kiyoshi Ueda
2008-10-03 15:10 ` [PATCH 2/8] dm core: tidy local_init Kiyoshi Ueda
2008-10-03 15:17 ` [PATCH 3/8] dm core: add kmem_cache for request-based dm Kiyoshi Ueda
2008-10-03 15:18 ` [PATCH 4/8] dm core: add target interfaces " Kiyoshi Ueda
2008-10-03 15:18   ` Kiyoshi Ueda
2008-10-03 15:18 ` [PATCH 5/8] dm core: add core functions " Kiyoshi Ueda
2008-10-03 15:18   ` Kiyoshi Ueda
2008-10-03 15:19 ` [PATCH 6/8] dm core: enable " Kiyoshi Ueda
2008-10-03 15:19   ` Kiyoshi Ueda
2008-10-03 15:19 ` [PATCH 7/8] dm core: reject I/O violating new queue limits Kiyoshi Ueda
2008-10-03 15:19   ` Kiyoshi Ueda
2008-10-03 15:19 ` [PATCH 8/8] dm-mpath: convert to request-based Kiyoshi Ueda
2008-10-03 15:19   ` Kiyoshi Ueda
2009-01-28 15:40 ` [PATCH 0/8] dm: request-based dm-multipath Alasdair G Kergon
2009-01-29  7:18   ` Kiyoshi Ueda
2009-01-29 10:41     ` Hannes Reinecke
2009-01-29 14:32       ` Alasdair G Kergon
2009-01-30  8:05       ` Kiyoshi Ueda
2009-03-10  6:10         ` Kiyoshi Ueda
2009-03-10  7:17           ` Hannes Reinecke
2009-03-10  8:17             ` Kiyoshi Ueda
2009-03-11 12:28               ` Hannes Reinecke
2009-03-12  1:40                 ` Kiyoshi Ueda
2009-03-12  8:58             ` Kiyoshi Ueda
2009-03-12  9:08               ` Hannes Reinecke
2009-03-13  1:03                 ` Kiyoshi Ueda

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.