All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] request-based dm-multipath (v3)
@ 2009-06-02  6:57 Kiyoshi Ueda
  2009-06-02  6:59 ` [PATCH 1/5] dm core: add core functions for request-based dm Kiyoshi Ueda
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Kiyoshi Ueda @ 2009-06-02  6:57 UTC (permalink / raw)
  To: Alasdair Kergon; +Cc: device-mapper development

Hi,

This patch-set is the updated version of request-based dm-multipath.
Alasdair, please consider to include this for 2.6.31.

The patches are based on the combination of the following trees and patches:
  - Jens' block/for-2.6.31 (c143dc903d7c0b15f5052e00b2c7de33a8b4299c)
  - Alasdair's linux-next patches on June 2nd (UTC).
  - Mike Snitzer's dm-topology patch:
    http://marc.info/?l=dm-devel&m=124345842831206&w=2
    (Actually, no functional dependencies on this patch.)

This patch-set doesn't have barrier support yet.

Some basic function/performance testings are done with NEC iStorage
(active-active multipath), and no problem was found.
Also heavy stress testing with a lot of path up/down and table
switching is done to check panic, stall and memory leak don't occur.
Please review and apply if no problem.

NOTE:
Recently, __sector and __data_len of struct request are commented
as 'internal, NEVER access directly' because they reflect the latest
status of request processing and change as BIOs are added/completed.
However, since request-based dm needs to copy the latest status
of the original request to clone, those fields are accessed directly
in this patch-set.
Further block layer change will be needed to clean up this situation.
I'm going to discuss about it separately in linux-kernel.


Summary of the patch-set
========================
  1/5: dm core: add core functions for request-based dm
  2/5: dm core: enable request-based dm
  3/5: dm core: don't set QUEUE_ORDERED_DRAIN for request-based dm
  4/5: dm core: disable interrupt when taking map_lock
  5/5: dm-mpath: convert to request-based

 drivers/md/dm-ioctl.c         |   13 
 drivers/md/dm-mpath.c         |  193 +++++---
 drivers/md/dm-table.c         |  130 +++++
 drivers/md/dm.c               |  947 ++++++++++++++++++++++++++++++++++++++++--
 drivers/md/dm.h               |   27 +
 include/linux/device-mapper.h |    9 
 6 files changed, 1218 insertions(+), 101 deletions(-)


CHANGES
=======
v3:
  - Rebased to
      * Jens' block/for-2.6.31 (c143dc903d7c0b15f5052e00b2c7de33a8b4299c)
      * Alasdair's linux-next patches on June 2nd (UTC)
      * Mike Snitzer's dm-topology patch:
        http://marc.info/?l=dm-devel&m=124345842831206&w=2
        (Actually, no functional dependencies on this patch.)

  - Removed the patch which rejects I/Os violating new queue limits,
    since Hannes pointed out that the patch seems to cause unexpected
    -EIO in no-path situation.
      * This unexpected -EIO problem is still under investigation.
      * The patch is for dangerous table swapping like shrinking
        the queue limits.  Generally, such table swapping shouldn't
        be done.  So removing the patch as of now shouldn't cause
        a problem.

  - Merged the integrity patch into the core function patch (PATCH 1)

  - Fixed types of variables (PATCH 1, PATCH 2)
      * Changed to use 'unsigned' instead of 'int' for 'i' in
        dm_table_any_busy_target() and dm_table_set_type(),
        since 'i' is always positive value and compared with
        't->num_targets', which is 'unsigned'.
      * Changed to use 'unsigned' instead of 'int' for 'bio_based'
        and 'request_based' in dm_table_set_type(), since they are
        always positive values.

  - Moved table type from struct io_restrictions to struct dm_table
    (PATCH 2)
      * In v2, table type was stored in struct io_restrictions as
        'no_request_stacking'.  But Mike Snitzer's dm-topology patch
        replaced the dm-specific io_restrictions with generic I/O
        topology framework (queue_limits).
      * Since table type is dm-specific, it cannot be a part of
        queue_limits.  So 'type' member is added to struct dm_table.

  - Dropped cast in multipath_busy(), since it's from void (PATCH 5)


v2: (http://marc.info/?l=dm-devel&m=124056068208687&w=2)
  - Rebased to 2.6.30-rc3

  - Fixed memory leak when bio_integrity_clone() fails (PATCH 2)
      * Added missing bio_free() when bio_integrity_clone() fails.

  - Added a patch not to set QUEUE_ORDERED_DRAIN for request-based dm
    (PATCH 4)
      * This version of request-based dm doesn't have barrier support
        yet.  So set QUEUE_ORDERED_DRAIN only for bio-based dm.
        Since the device type is decided at the first table binding
        time, the flag set is deferred until then.


v1: (http://marc.info/?l=dm-devel&m=123736590931733&w=2)
  - Rebased to 2.6.29-rc8

  - Fixed oops on table swapping because of broken suspend (PATCH 1)
    http://marc.info/?l=dm-devel&m=123667308218868&w=2
      * dm_suspend() recognized that suspend completed while some
        clones were still in flight.  So swapping/freeing table
        was possible against the in-use table, and it caused oops
      * The problem was a race condition on in-flight checking:
        new I/O could become in-flight while suspend code found
        no I/O was in-flight and went on
      * Fixed it by protecting the in-flight counting and queue
        suspension with queue lock.  Also using q->in_flight
        instead of md->pending since it is straightforward to
        understand the meaning of the in-flight clones

  - Fixed NULL pointer reference when allocation for clone bio fails
    (From Hannes Reinecke) (PATCH 1)
    http://marc.info/?l=dm-devel&m=123666946315335&w=2
      * free_bio_clone() used clone->end_io_data to get md
      * But clone->end_io_data can be NULL when free_bio_clone()
        is called from error handling path in clone_request_bios()
      * Fixed it by passing md directly to free_bio_clone()
      * Removed the work around code to check bio->bi_private:
        http://marc.info/?l=dm-devel&m=123666946315335&w=2

  - Fixed unnecessary -EIO on table swapping by removing
    the map-existence check in dm_make_request() (PATCH 1)
    http://marc.info/?l=dm-devel&m=123322573410050&w=2
      * The check was there to reject BIOs until a table is loaded
        and the device and the queue are initialized correctly
      * But the check is not needed because BIOs are rejected
        in __generic_make_request() by device size checking
        until the device and the queue are initialized

  - Fixed freeing of in-use md (PATCH 1)
      * Upper layer can close the device just after all BIOs are
        completed by blk_update_request() or blk_end_request(),
        even if the request is still in the process of completion.
        It creates a small window where dm_put() can free the md
        which is needed for the process of completion
      * Fixed it by incrementing md->holders while I/O is in-flight

  - Fixed memory leak (missing unmap) on failure of dispatching
    a mapped clone (PATCH 1)
      * dm_kill_request() was used in dm_dispatch_request() when
        the dispatching fails
      * But dm_kill_requst() is for not-mapped clone, so target's
        rq_end_io() function is not called. This caused memory leak.
      * Fixed it by creating dm_complete_request() for mapped clone
        and calling it in dm_dispatch_request()

  - Changed exported function names to prevent misuse like above
    (PATCH 1, PATCH 3, PATCH 5)
      * dm_kill_request()    => dm_kill_unmapped_request()
      * dm_requeue_request() => dm_requeue_unmapped_request()

  - Removed the REQ_QUIET flag in dm_kill_unmapped_request(), since
    "I/O error" message is not printed for failed I/O (PATCH 1)

  - Removed 'inline' for simple static functions, since compiler
    will do inline anyway (PATCH 1)

  - Removed if-request-based check in dm_prep_fn(), since the device
    is always request-based when dm_prep_fn() is called (PATCH 1)
      * The check was for the case where the table is switched from
        request-based to bio-based.  But currently no plan to support
        such switching

  - Fixed the problem that switching device type is unexpectedly
    available on table swapping (PATCH 2)
      * table_load() checks md->map for the current table type
        to prevent the device type switching
      * But table_load() can run in parallel with dm_swap_table(),
        which has a small no-map window.
        So dm_init_md_mempool() in table_load() could switch
        the mempool type for an in-use device in this window
      * Fixed it by placing the pre-allocated mempools in table
        and binding them on resume.  The resume is rejected if
        the device is in-use with different type

  - Fixed the problem that BIO is submitted to a request-based device
    in which some settings may not be done yet (PATCH 2)
      * The QUEUE_FLAG_STACKABLE flag was set in the middle of
        dm_table_set_restrictions()
      * But the flag needs to be set after all queue settings are
        done and they become visible to other CPUs, because:
          o BIOs can be submitted during dm_table_set_restrictions()
          o Once the QUEUE_FLAG_STACKABLE flag is set in the queue,
            BIOs are passed to request-based dm and eventually to
            __make_request() where the queue settings are used
      * Fixed it by re-ordering the flag set and putting a barrier
        before the flag set.

  - Fixed lockdep warnings (From Christof Schmitt) (PATCH 4)
    http://marc.info/?l=dm-devel&m=123501258326935&w=2
      * lockdep warns some self-deadlock possibilities since:
          o map_lock is taken without disabling irq
          o request-based dm takes map_lock after taking
            queue_lock with disabling irq in dm_request_fn()
          o so logically a deadlock may happen:
              write_lock(map_lock)
              <interrupt>
              spin_lock_irqsave(queue_lock)
              dm_request_fn()
                  => dm_get_table()
                      => read_lock(map_lock)
      * dm_request_fn() is never called in interrupt context, so
        such self-deadlocks don't happen actually
      * But fixed the unneeded warnings by disabling irq when
        taking map_lock


Summary of the design and request-based dm-multipath are below.
(No changes since the previous post.)

BACKGROUND
==========
Currently, device-mapper (dm) is implemented as a stacking block device
at bio level.  This bio-based implementation has an issue below
on dm-multipath.

  Because hook for I/O mapping is above block layer __make_request(),
  contiguous bios can be mapped to different underlying devices
  and these bios aren't merged into a request.
  Dynamic load balancing could happen this situation, though
  it has not been implemented yet.
  Therefore, I/O mapping after bio merging is needed for better
  dynamic load balancing.

The basic idea to resolve the issue is to move multipathing layer
down below the I/O scheduler, and it was proposed from Mike Christie
as the block layer (request-based) multipath:
  http://marc.info/?l=linux-scsi&m=115520444515914&w=2

Mike's patch added new block layer device for multipath and didn't
have dm interface.  So I modified his patch to be used from dm.
It is request-based dm-multipath.


DESIGN OVERVIEW
===============
While currently dm and md stacks block devices at bio level,
request-based dm stacks at request level and submits/completes
struct request instead of struct bio.


Overview of the request-based dm patch:
  - Mapping is done in a unit of struct request, instead of struct bio
  - Hook for I/O mapping is at q->request_fn() after merging and
    sorting by I/O scheduler, instead of q->make_request_fn().
  - Hook for I/O completion is at bio->bi_end_io() and rq->end_io(),
    instead of only bio->bi_end_io()
                  bio-based (current)     request-based (this patch)
      ------------------------------------------------------------------
      submission  q->make_request_fn()    q->request_fn()
      completion  bio->bi_end_io()        bio->bi_end_io(), rq->end_io()
  - Whether the dm device is bio-based or request-based is determined
    at table loading time
  - Keep user interface same (table/message/status/ioctl)
  - Any bio-based devices (like dm/md) can be stacked on request-based
    dm device.
    Request-based dm device *cannot* be stacked on any bio-based device.


Expected benefit:
  - better load balancing


Additional explanations:

Why does request-based dm use bio->bi_end_io(), too?
Because:
  - dm needs to keep not only the request but also bios of the request,
    if dm target drivers want to retry or do something on the request.
    For example, dm-multipath has to check errors and retry with other
    paths if necessary before returning the I/O result to the upper layer.

  - But rq->end_io() is called at the very late stage of completion
    handling where all bios in the request have been completed and
    the I/O results are already visible to the upper layer.
So request-based dm hooks bio->bi_end_io() and doesn't complete the bio
in error cases, and gives over the error handling to rq->end_io() hook.


Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/5] dm core: add core functions for request-based dm
  2009-06-02  6:57 [PATCH 0/5] request-based dm-multipath (v3) Kiyoshi Ueda
@ 2009-06-02  6:59 ` Kiyoshi Ueda
  2009-06-02  7:01 ` [PATCH 2/5] dm core: enable " Kiyoshi Ueda
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Kiyoshi Ueda @ 2009-06-02  6:59 UTC (permalink / raw)
  To: Alasdair Kergon; +Cc: device-mapper development

This patch adds core functions for request-based dm.

When struct mapped device (md) is initialized, md->queue has
an I/O scheduler and the following functions are used for
request-based dm as the queue functions:
    make_request_fn: dm_make_request()
    pref_fn:         dm_prep_fn()
    request_fn:      dm_request_fn()
    softirq_done_fn: dm_softirq_done()
    lld_busy_fn:     dm_lld_busy()
Actual initializations are done in another patch (PATCH 2).

Below is a brief summary of how request-based dm behaves, including:
  - making request from bio
  - cloning, mapping and dispatching request
  - completing request and bio
  - suspending md
  - resuming md


bio to request
==============
md->queue->make_request_fn() (dm_make_request()) calls __make_request()
for a bio submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.


Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.

dm_request_fn() checks busy state of underlying devices using
target's busy() function and stops dispatching requests to keep them
on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.

Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)

After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using
blk_insert_cloned_request().


Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error cases, request-based dm clones
both bio and request and hooks both bio->bi_end_io() and rq->end_io():
    bio->bi_end_io(): end_clone_bio()
    rq->end_io():     end_clone_request()

Summary of the request completion flow is below:
blk_end_request() for a clone request
  => blk_update_request()
     => bio->bi_end_io() == end_clone_bio() for each clone bio
        => Free the clone bio
        => Success: Complete the original bio (blk_update_request())
           Error:   Don't complete the original bio
  => blk_finish_request()
     => rq->end_io() == end_clone_request()
        => blk_complete_request()
           => dm_softirq_done()
              => Free the clone request
              => Success: Complete the original request (blk_end_request())
                 Error:   Requeue the original request

end_clone_bio() completes the original request on the size of
the original bio in successful cases.
Even if all bios in the original request are completed by that
completion, the original request must not be completed yet to keep
the ordering of request completion for the stacking.
So end_clone_bio() uses blk_update_request() instead of
blk_end_request().
In error cases, end_clone_bio() doesn't complete the original bio.
It just frees the cloned bio and gives over the error handling to
end_clone_request().

end_clone_request(), which is called with queue lock held, completes
the clone request and the original request in a softirq context
(dm_softirq_done()), which has no queue lock, to avoid a deadlock
issue on submission of another request during the completion:
    - The submitted request may be mapped to the same device
    - Request submission requires queue lock, but the queue lock
      has been held by itself and it doesn't know that

The clone request has no clone bio when dm_softirq_done() is called.
So target drivers can't resubmit it again even error cases.
Instead, they can ask dm core for requeueing and remapping
the original request in that cases.


suspend
=======
Request-based dm uses stopping md->queue as suspend of the md.
For noflush suspend, just stops md->queue.

For flush suspend, inserts a marker request to the tail of md->queue.
And dispatches all requests in md->queue until the marker comes to
the front of md->queue.  Then, stops dispatching request and waits
for the all dispatched requests to complete.
After that, completes the marker request, stops md->queue and
wake up the waiter on the suspend queue, md->wait.


resume
======
Starts md->queue.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-table.c         |   14 
 drivers/md/dm.c               |  759 +++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm.h               |    1 
 include/linux/device-mapper.h |    9 
 4 files changed, 780 insertions(+), 3 deletions(-)

Index: linux-2.6-block/drivers/md/dm.c
===================================================================
--- linux-2.6-block.orig/drivers/md/dm.c
+++ linux-2.6-block/drivers/md/dm.c
@@ -83,6 +83,14 @@ union map_info *dm_get_mapinfo(struct bi
 	return NULL;
 }
 
+union map_info *dm_get_rq_mapinfo(struct request *rq)
+{
+	if (rq && rq->end_io_data)
+		return &((struct dm_rq_target_io *)rq->end_io_data)->info;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(dm_get_rq_mapinfo);
+
 #define MINOR_ALLOCED ((void *)-1)
 
 /*
@@ -164,6 +172,12 @@ struct mapped_device {
 	/* forced geometry settings */
 	struct hd_geometry geometry;
 
+	/* marker of flush suspend for request-based dm */
+	struct request suspend_rq;
+
+	/* For saving the address of __make_request for request based dm */
+	make_request_fn *saved_make_request_fn;
+
 	/* sysfs handle */
 	struct kobject kobj;
 
@@ -401,6 +415,27 @@ static void free_tio(struct mapped_devic
 	mempool_free(tio, md->tio_pool);
 }
 
+static struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md)
+{
+	return mempool_alloc(md->tio_pool, GFP_ATOMIC);
+}
+
+static void free_rq_tio(struct mapped_device *md, struct dm_rq_target_io *tio)
+{
+	mempool_free(tio, md->tio_pool);
+}
+
+static struct dm_rq_clone_bio_info *alloc_bio_info(struct mapped_device *md)
+{
+	return mempool_alloc(md->io_pool, GFP_ATOMIC);
+}
+
+static void free_bio_info(struct mapped_device *md,
+			  struct dm_rq_clone_bio_info *info)
+{
+	mempool_free(info, md->io_pool);
+}
+
 static void start_io_acct(struct dm_io *io)
 {
 	struct mapped_device *md = io->md;
@@ -610,6 +645,281 @@ static void clone_endio(struct bio *bio,
 	dec_pending(io, error);
 }
 
+/*
+ * Partial completion handling for request-based dm
+ */
+static void end_clone_bio(struct bio *clone, int error)
+{
+	struct dm_rq_clone_bio_info *info = clone->bi_private;
+	struct dm_rq_target_io *tio = info->rq->end_io_data;
+	struct bio *bio = info->orig;
+	unsigned int nr_bytes = info->orig->bi_size;
+
+	free_bio_info(tio->md, info);
+	clone->bi_private = tio->md->bs;
+	bio_put(clone);
+
+	if (tio->error) {
+		/*
+		 * An error has already been detected on the request.
+		 * Once error occurred, just let clone->end_io() handle
+		 * the remainder.
+		 */
+		return;
+	} else if (error) {
+		/*
+		 * Don't notice the error to the upper layer yet.
+		 * The error handling decision is made by the target driver,
+		 * when the request is completed.
+		 */
+		tio->error = error;
+		return;
+	}
+
+	/*
+	 * I/O for the bio successfully completed.
+	 * Notice the data completion to the upper layer.
+	 */
+
+	/*
+	 * bios are processed from the head of the list.
+	 * So the completing bio should always be rq->bio.
+	 * If it's not, something wrong is happening.
+	 */
+	if (tio->orig->bio != bio)
+		DMERR("bio completion is going in the middle of the request");
+
+	/*
+	 * Update the original request.
+	 * Do not use blk_end_request() here, because it may complete
+	 * the original request before the clone, and break the ordering.
+	 */
+	blk_update_request(tio->orig, 0, nr_bytes);
+}
+
+static void free_bio_clone(struct mapped_device *md, struct request *clone)
+{
+	struct bio *bio;
+	struct dm_rq_clone_bio_info *info;
+
+	while ((bio = clone->bio) != NULL) {
+		clone->bio = bio->bi_next;
+
+		info = bio->bi_private;
+		free_bio_info(md, info);
+
+		bio->bi_private = md->bs;
+		bio_put(bio);
+	}
+}
+
+/*
+ * Don't touch any member of the md after calling this function because
+ * the md may be freed in dm_put() at the end of this function.
+ * Or do dm_get() before calling this function and dm_put() later.
+ */
+static void rq_completed(struct mapped_device *md, int run_queue)
+{
+	int wakeup_waiters = 0;
+	struct request_queue *q = md->queue;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (!queue_in_flight(q))
+		wakeup_waiters = 1;
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	/* nudge anyone waiting on suspend queue */
+	if (wakeup_waiters)
+		wake_up(&md->wait);
+
+	if (run_queue)
+		blk_run_queue(q);
+
+	/*
+	 * dm_put() must be at the end of this function. See the comment above
+	 */
+	dm_put(md);
+}
+
+static void dm_unprep_request(struct request *rq)
+{
+	struct request *clone = rq->special;
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct mapped_device *md = tio->md;
+
+	rq->special = NULL;
+	rq->cmd_flags &= ~REQ_DONTPREP;
+
+	free_bio_clone(md, clone);
+	free_rq_tio(md, tio);
+}
+
+/*
+ * Requeue the original request of a clone.
+ */
+void dm_requeue_unmapped_request(struct request *clone)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct mapped_device *md = tio->md;
+	struct request *rq = tio->orig;
+	struct request_queue *q = rq->q;
+	unsigned long flags;
+
+	dm_unprep_request(rq);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (elv_queue_empty(q))
+		blk_plug_device(q);
+	blk_requeue_request(q, rq);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	rq_completed(md, 0);
+}
+EXPORT_SYMBOL_GPL(dm_requeue_unmapped_request);
+
+static void __stop_queue(struct request_queue *q)
+{
+	blk_stop_queue(q);
+}
+
+static void stop_queue(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	__stop_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void __start_queue(struct request_queue *q)
+{
+	if (blk_queue_stopped(q))
+		blk_start_queue(q);
+}
+
+static void start_queue(struct request_queue *q)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	__start_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+/*
+ * Complete the clone and the original request.
+ * Must be called without queue lock.
+ */
+static void dm_end_request(struct request *clone, int error)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct mapped_device *md = tio->md;
+	struct request *rq = tio->orig;
+
+	if (blk_pc_request(rq)) {
+		rq->errors = clone->errors;
+		rq->resid_len = clone->resid_len;
+
+		if (rq->sense)
+			/*
+			 * We are using the sense buffer of the original
+			 * request.
+			 * So setting the length of the sense data is enough.
+			 */
+			rq->sense_len = clone->sense_len;
+	}
+
+	free_bio_clone(md, clone);
+	free_rq_tio(md, tio);
+
+	blk_end_request_all(rq, error);
+
+	rq_completed(md, 1);
+}
+
+/*
+ * Request completion handler for request-based dm
+ */
+static void dm_softirq_done(struct request *rq)
+{
+	struct request *clone = rq->completion_data;
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
+	int error = tio->error;
+
+	if (!(rq->cmd_flags & REQ_FAILED) && rq_end_io)
+		error = rq_end_io(tio->ti, clone, error, &tio->info);
+
+	if (error <= 0)
+		/* The target wants to complete the I/O */
+		dm_end_request(clone, error);
+	else if (error == DM_ENDIO_INCOMPLETE)
+		/* The target will handle the I/O */
+		return;
+	else if (error == DM_ENDIO_REQUEUE)
+		/* The target wants to requeue the I/O */
+		dm_requeue_unmapped_request(clone);
+	else {
+		DMWARN("unimplemented target endio return value: %d", error);
+		BUG();
+	}
+}
+
+/*
+ * Complete the clone and the original request with the error status
+ * through softirq context.
+ */
+static void dm_complete_request(struct request *clone, int error)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *rq = tio->orig;
+
+	tio->error = error;
+	rq->completion_data = clone;
+	blk_complete_request(rq);
+}
+
+/*
+ * Complete the not-mapped clone and the original request with the error status
+ * through softirq context.
+ * Target's rq_end_io() function isn't called.
+ * This may be used when the target's map_rq() function fails.
+ */
+void dm_kill_unmapped_request(struct request *clone, int error)
+{
+	struct dm_rq_target_io *tio = clone->end_io_data;
+	struct request *rq = tio->orig;
+
+	rq->cmd_flags |= REQ_FAILED;
+	dm_complete_request(clone, error);
+}
+EXPORT_SYMBOL_GPL(dm_kill_unmapped_request);
+
+/*
+ * Called with the queue lock held
+ */
+static void end_clone_request(struct request *clone, int error)
+{
+	/*
+	 * For just cleaning up the information of the queue in which
+	 * the clone was dispatched.
+	 * The clone is *NOT* freed actually here because it is alloced from
+	 * dm own mempool and REQ_ALLOCED isn't set in clone->cmd_flags.
+	 */
+	__blk_put_request(clone->q, clone);
+
+	/*
+	 * Actual request completion is done in a softirq context which doesn't
+	 * hold the queue lock.  Otherwise, deadlock could occur because:
+	 *     - another request may be submitted by the upper level driver
+	 *       of the stacking during the completion
+	 *     - the submission which requires queue lock may be done
+	 *       against this queue
+	 */
+	dm_complete_request(clone, error);
+}
+
 static sector_t max_io_len(struct mapped_device *md,
 			   sector_t sector, struct dm_target *ti)
 {
@@ -993,7 +1303,7 @@ out:
  * The request function that just remaps the bio built up by
  * dm_merge_bvec.
  */
-static int dm_request(struct request_queue *q, struct bio *bio)
+static int _dm_request(struct request_queue *q, struct bio *bio)
 {
 	int rw = bio_data_dir(bio);
 	struct mapped_device *md = q->queuedata;
@@ -1030,12 +1340,310 @@ static int dm_request(struct request_que
 	return 0;
 }
 
+static int dm_make_request(struct request_queue *q, struct bio *bio)
+{
+	struct mapped_device *md = q->queuedata;
+
+	if (unlikely(bio_barrier(bio))) {
+		bio_endio(bio, -EOPNOTSUPP);
+		return 0;
+	}
+
+	return md->saved_make_request_fn(q, bio); /* call __make_request() */
+}
+
+static int dm_request_based(struct mapped_device *md)
+{
+	return blk_queue_stackable(md->queue);
+}
+
+static int dm_request(struct request_queue *q, struct bio *bio)
+{
+	struct mapped_device *md = q->queuedata;
+
+	if (dm_request_based(md))
+		return dm_make_request(q, bio);
+
+	return _dm_request(q, bio);
+}
+
+void dm_dispatch_request(struct request *rq)
+{
+	int r;
+
+	if (blk_queue_io_stat(rq->q))
+		rq->cmd_flags |= REQ_IO_STAT;
+
+	rq->start_time = jiffies;
+	r = blk_insert_cloned_request(rq->q, rq);
+	if (r)
+		dm_complete_request(rq, r);
+}
+EXPORT_SYMBOL_GPL(dm_dispatch_request);
+
+static void copy_request_info(struct request *clone, struct request *rq)
+{
+	clone->cpu = rq->cpu;
+	clone->cmd_flags = (rq_data_dir(rq) | REQ_NOMERGE);
+	clone->cmd_type = rq->cmd_type;
+	clone->__sector = blk_rq_pos(rq);
+	clone->__data_len = blk_rq_bytes(rq);
+	clone->nr_phys_segments = rq->nr_phys_segments;
+	clone->ioprio = rq->ioprio;
+	clone->buffer = rq->buffer;
+	clone->cmd_len = rq->cmd_len;
+	if (rq->cmd_len)
+		clone->cmd = rq->cmd;
+	clone->extra_len = rq->extra_len;
+	clone->sense = rq->sense;
+}
+
+static int clone_request_bios(struct request *clone, struct request *rq,
+			      struct mapped_device *md)
+{
+	struct bio *bio, *clone_bio;
+	struct dm_rq_clone_bio_info *info;
+
+	for (bio = rq->bio; bio; bio = bio->bi_next) {
+		info = alloc_bio_info(md);
+		if (!info)
+			goto free_and_out;
+
+		clone_bio = bio_alloc_bioset(GFP_ATOMIC, bio->bi_max_vecs,
+					     md->bs);
+		if (!clone_bio) {
+			free_bio_info(md, info);
+			goto free_and_out;
+		}
+
+		__bio_clone(clone_bio, bio);
+
+		if (bio_integrity(bio) &&
+		    !bio_integrity_clone(clone_bio, bio, GFP_ATOMIC)) {
+			bio_free(clone_bio, md->bs);
+			free_bio_info(md, info);
+			goto free_and_out;
+		}
+
+		clone_bio->bi_destructor = dm_bio_destructor;
+		clone_bio->bi_end_io = end_clone_bio;
+		info->rq = clone;
+		info->orig = bio;
+		clone_bio->bi_private = info;
+
+		if (clone->bio) {
+			clone->biotail->bi_next = clone_bio;
+			clone->biotail = clone_bio;
+		} else
+			clone->bio = clone->biotail = clone_bio;
+	}
+
+	return 0;
+
+free_and_out:
+	free_bio_clone(md, clone);
+
+	return -ENOMEM;
+}
+
+static int setup_clone(struct request *clone, struct request *rq,
+		       struct dm_rq_target_io *tio)
+{
+	int r;
+
+	blk_rq_init(NULL, clone);
+
+	r = clone_request_bios(clone, rq, tio->md);
+	if (r)
+		return r;
+
+	copy_request_info(clone, rq);
+	clone->end_io = end_clone_request;
+	clone->end_io_data = tio;
+
+	return 0;
+}
+
+static int dm_rq_flush_suspending(struct mapped_device *md)
+{
+	return !md->suspend_rq.special;
+}
+
+/*
+ * Called with the queue lock held.
+ */
+static int dm_prep_fn(struct request_queue *q, struct request *rq)
+{
+	struct mapped_device *md = q->queuedata;
+	struct dm_rq_target_io *tio;
+	struct request *clone;
+
+	if (unlikely(rq == &md->suspend_rq)) {
+		if (dm_rq_flush_suspending(md))
+			return BLKPREP_OK;
+		else
+			/* The flush suspend was interrupted */
+			return BLKPREP_KILL;
+	}
+
+	if (unlikely(rq->special)) {
+		DMWARN("Already has something in rq->special.");
+		return BLKPREP_KILL;
+	}
+
+	tio = alloc_rq_tio(md); /* Only one for each original request */
+	if (!tio)
+		/* -ENOMEM */
+		return BLKPREP_DEFER;
+
+	tio->md = md;
+	tio->ti = NULL;
+	tio->orig = rq;
+	tio->error = 0;
+	memset(&tio->info, 0, sizeof(tio->info));
+
+	clone = &tio->clone;
+	if (setup_clone(clone, rq, tio)) {
+		/* -ENOMEM */
+		free_rq_tio(md, tio);
+		return BLKPREP_DEFER;
+	}
+
+	rq->special = clone;
+	rq->cmd_flags |= REQ_DONTPREP;
+
+	return BLKPREP_OK;
+}
+
+static void map_request(struct dm_target *ti, struct request *rq,
+			struct mapped_device *md)
+{
+	int r;
+	struct request *clone = rq->special;
+	struct dm_rq_target_io *tio = clone->end_io_data;
+
+	/*
+	 * Hold the md reference here for the in-flight I/O.
+	 * We can't rely on the reference count by device opener,
+	 * because the device may be closed during the request completion
+	 * when all bios are completed.
+	 * See the comment in rq_completed() too.
+	 */
+	dm_get(md);
+
+	tio->ti = ti;
+	r = ti->type->map_rq(ti, clone, &tio->info);
+	switch (r) {
+	case DM_MAPIO_SUBMITTED:
+		/* The target has taken the I/O to submit by itself later */
+		break;
+	case DM_MAPIO_REMAPPED:
+		/* The target has remapped the I/O so dispatch it */
+		dm_dispatch_request(clone);
+		break;
+	case DM_MAPIO_REQUEUE:
+		/* The target wants to requeue the I/O */
+		dm_requeue_unmapped_request(clone);
+		break;
+	default:
+		if (r > 0) {
+			DMWARN("unimplemented target map return value: %d", r);
+			BUG();
+		}
+
+		/* The target wants to complete the I/O */
+		dm_kill_unmapped_request(clone, r);
+		break;
+	}
+}
+
+/*
+ * q->request_fn for request-based dm.
+ * Called with the queue lock held.
+ */
+static void dm_request_fn(struct request_queue *q)
+{
+	struct mapped_device *md = q->queuedata;
+	struct dm_table *map = dm_get_table(md);
+	struct dm_target *ti;
+	struct request *rq;
+
+	/*
+	 * For noflush suspend, check blk_queue_stopped() to immediately
+	 * quit I/O dispatching.
+	 */
+	while (!blk_queue_plugged(q) && !blk_queue_stopped(q)) {
+		rq = blk_peek_request(q);
+		if (!rq)
+			goto plug_and_out;
+
+		if (unlikely(rq == &md->suspend_rq)) { /* Flush suspend maker */
+			if (queue_in_flight(q))
+				/* Not quiet yet.  Wait more */
+				goto plug_and_out;
+
+			/* This device should be quiet now */
+			__stop_queue(q);
+			blk_start_request(rq);
+			__blk_end_request_all(rq, 0);
+			wake_up(&md->wait);
+			goto out;
+		}
+
+		ti = dm_table_find_target(map, blk_rq_pos(rq));
+		if (ti->type->busy && ti->type->busy(ti))
+			goto plug_and_out;
+
+		blk_start_request(rq);
+		spin_unlock(q->queue_lock);
+		map_request(ti, rq, md);
+		spin_lock_irq(q->queue_lock);
+	}
+
+	goto out;
+
+plug_and_out:
+	if (!elv_queue_empty(q))
+		/* Some requests still remain, retry later */
+		blk_plug_device(q);
+
+out:
+	dm_table_put(map);
+
+	return;
+}
+
+int dm_underlying_device_busy(struct request_queue *q)
+{
+	return blk_lld_busy(q);
+}
+EXPORT_SYMBOL_GPL(dm_underlying_device_busy);
+
+static int dm_lld_busy(struct request_queue *q)
+{
+	int r;
+	struct mapped_device *md = q->queuedata;
+	struct dm_table *map = dm_get_table(md);
+
+	if (!map || test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))
+		r = 1;
+	else
+		r = dm_table_any_busy_target(map);
+
+	dm_table_put(map);
+
+	return r;
+}
+
 static void dm_unplug_all(struct request_queue *q)
 {
 	struct mapped_device *md = q->queuedata;
 	struct dm_table *map = dm_get_table(md);
 
 	if (map) {
+		if (dm_request_based(md))
+			generic_unplug_device(q);
+
 		dm_table_unplug_all(map);
 		dm_table_put(map);
 	}
@@ -1050,7 +1658,16 @@ static int dm_any_congested(void *conges
 	if (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
 		map = dm_get_table(md);
 		if (map) {
-			r = dm_table_any_congested(map, bdi_bits);
+			/*
+			 * Request-based dm cares about only own queue for
+			 * the query about congestion status of request_queue
+			 */
+			if (dm_request_based(md))
+				r = md->queue->backing_dev_info.state &
+				    bdi_bits;
+			else
+				r = dm_table_any_congested(map, bdi_bits);
+
 			dm_table_put(map);
 		}
 	}
@@ -1452,6 +2069,8 @@ static int dm_wait_for_completion(struct
 {
 	int r = 0;
 	DECLARE_WAITQUEUE(wait, current);
+	struct request_queue *q = md->queue;
+	unsigned long flags;
 
 	dm_unplug_all(md->queue);
 
@@ -1461,7 +2080,14 @@ static int dm_wait_for_completion(struct
 		set_current_state(interruptible);
 
 		smp_mb();
-		if (!atomic_read(&md->pending))
+		if (dm_request_based(md)) {
+			spin_lock_irqsave(q->queue_lock, flags);
+			if (!queue_in_flight(q) && blk_queue_stopped(q)) {
+				spin_unlock_irqrestore(q->queue_lock, flags);
+				break;
+			}
+			spin_unlock_irqrestore(q->queue_lock, flags);
+		} else if (!atomic_read(&md->pending))
 			break;
 
 		if (interruptible == TASK_INTERRUPTIBLE &&
@@ -1573,6 +2199,67 @@ out:
 	return r;
 }
 
+static void dm_rq_invalidate_suspend_marker(struct mapped_device *md)
+{
+	md->suspend_rq.special = (void *)0x1;
+}
+
+static void dm_rq_abort_suspend(struct mapped_device *md, int noflush)
+{
+	struct request_queue *q = md->queue;
+	unsigned long flags;
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (!noflush)
+		dm_rq_invalidate_suspend_marker(md);
+	__start_queue(q);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+}
+
+static void dm_rq_start_suspend(struct mapped_device *md, int noflush)
+{
+	struct request *rq = &md->suspend_rq;
+	struct request_queue *q = md->queue;
+
+	if (noflush)
+		stop_queue(q);
+	else {
+		blk_rq_init(q, rq);
+		blk_insert_request(q, rq, 0, NULL);
+	}
+}
+
+static int dm_rq_suspend_available(struct mapped_device *md, int noflush)
+{
+	int r = 1;
+	struct request *rq = &md->suspend_rq;
+	struct request_queue *q = md->queue;
+	unsigned long flags;
+
+	if (noflush)
+		return r;
+
+	/* The marker must be protected by queue lock if it is in use */
+	spin_lock_irqsave(q->queue_lock, flags);
+	if (unlikely(rq->ref_count)) {
+		/*
+		 * This can happen, when the previous flush suspend was
+		 * interrupted, the marker is still in the queue and
+		 * this flush suspend has been invoked, because we don't
+		 * remove the marker at the time of suspend interruption.
+		 * We have only one marker per mapped_device, so we can't
+		 * start another flush suspend while it is in use.
+		 */
+		BUG_ON(!rq->special); /* The marker should be invalidated */
+		DMWARN("Invalidating the previous flush suspend is still in"
+		       " progress.  Please retry later.");
+		r = 0;
+	}
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	return r;
+}
+
 /*
  * Functions to lock and unlock any filesystem running on the
  * device.
@@ -1612,6 +2299,53 @@ static void unlock_fs(struct mapped_devi
  * dm_bind_table, dm_suspend must be called to flush any in
  * flight bios and ensure that any further io gets deferred.
  */
+/*
+ * Suspend mechanism in request-based dm.
+ *
+ * After the suspend starts, further incoming requests are kept in
+ * the request_queue and deferred.
+ * Remaining requests in the request_queue at the start of suspend are flushed
+ * if it is flush suspend.
+ * The suspend completes when the following conditions have been satisfied,
+ * so wait for it:
+ *    1. q->in_flight is 0 (which means no in_flight request)
+ *    2. queue has been stopped (which means no request dispatching)
+ *
+ *
+ * Noflush suspend
+ * ---------------
+ * Noflush suspend doesn't need to dispatch remaining requests.
+ * So stop the queue immediately.  Then, wait for all in_flight requests
+ * to be completed or requeued.
+ *
+ * To abort noflush suspend, start the queue.
+ *
+ *
+ * Flush suspend
+ * -------------
+ * Flush suspend needs to dispatch remaining requests.  So stop the queue
+ * after the remaining requests are completed. (Requeued request must be also
+ * re-dispatched and completed.  Until then, we can't stop the queue.)
+ *
+ * During flushing the remaining requests, further incoming requests are also
+ * inserted to the same queue.  To distinguish which requests are to be
+ * flushed, we insert a marker request to the queue at the time of starting
+ * flush suspend, like a barrier.
+ * The dispatching is blocked when the marker is found on the top of the queue.
+ * And the queue is stopped when all in_flight requests are completed, since
+ * that means the remaining requests are completely flushed.
+ * Then, the marker is removed from the queue.
+ *
+ * To abort flush suspend, we also need to take care of the marker, not only
+ * starting the queue.
+ * We don't remove the marker forcibly from the queue since it's against
+ * the block-layer manner.  Instead, we put a invalidated mark on the marker.
+ * When the invalidated marker is found on the top of the queue, it is
+ * immediately removed from the queue, so it doesn't block dispatching.
+ * Because we have only one marker per mapped_device, we can't start another
+ * flush suspend until the invalidated marker is removed from the queue.
+ * So fail and return with -EBUSY in such a case.
+ */
 int dm_suspend(struct mapped_device *md, unsigned suspend_flags)
 {
 	struct dm_table *map = NULL;
@@ -1626,6 +2360,11 @@ int dm_suspend(struct mapped_device *md,
 		goto out_unlock;
 	}
 
+	if (dm_request_based(md) && !dm_rq_suspend_available(md, noflush)) {
+		r = -EBUSY;
+		goto out_unlock;
+	}
+
 	map = dm_get_table(md);
 
 	/*
@@ -1671,6 +2410,9 @@ int dm_suspend(struct mapped_device *md,
 
 	flush_workqueue(md->wq);
 
+	if (dm_request_based(md))
+		dm_rq_start_suspend(md, noflush);
+
 	/*
 	 * At this point no more requests are entering target request routines.
 	 * We call dm_wait_for_completion to wait for all existing requests
@@ -1687,6 +2429,9 @@ int dm_suspend(struct mapped_device *md,
 	if (r < 0) {
 		dm_queue_flush(md);
 
+		if (dm_request_based(md))
+			dm_rq_abort_suspend(md, noflush);
+
 		unlock_fs(md);
 		goto out; /* pushback list is already flushed, so skip flush */
 	}
@@ -1728,6 +2473,14 @@ int dm_resume(struct mapped_device *md)
 
 	dm_queue_flush(md);
 
+	/*
+	 * Flushing deferred I/Os must be done after targets are resumed
+	 * so that mapping of targets can work correctly.
+	 * Request-based dm is queueing the deferred I/Os in its request_queue.
+	 */
+	if (dm_request_based(md))
+		start_queue(md->queue);
+
 	unlock_fs(md);
 
 	clear_bit(DMF_SUSPENDED, &md->flags);
Index: linux-2.6-block/drivers/md/dm.h
===================================================================
--- linux-2.6-block.orig/drivers/md/dm.h
+++ linux-2.6-block/drivers/md/dm.h
@@ -47,6 +47,7 @@ void dm_table_presuspend_targets(struct 
 void dm_table_postsuspend_targets(struct dm_table *t);
 int dm_table_resume_targets(struct dm_table *t);
 int dm_table_any_congested(struct dm_table *t, int bdi_bits);
+int dm_table_any_busy_target(struct dm_table *t);
 
 /*
  * To check the return value from dm_table_find_target().
Index: linux-2.6-block/drivers/md/dm-table.c
===================================================================
--- linux-2.6-block.orig/drivers/md/dm-table.c
+++ linux-2.6-block/drivers/md/dm-table.c
@@ -1067,6 +1067,20 @@ int dm_table_any_congested(struct dm_tab
 	return r;
 }
 
+int dm_table_any_busy_target(struct dm_table *t)
+{
+	unsigned i;
+	struct dm_target *ti;
+
+	for (i = 0; i < t->num_targets; i++) {
+		ti = t->targets + i;
+		if (ti->type->busy && ti->type->busy(ti))
+			return 1;
+	}
+
+	return 0;
+}
+
 void dm_table_unplug_all(struct dm_table *t)
 {
 	struct dm_dev_internal *dd;
Index: linux-2.6-block/include/linux/device-mapper.h
===================================================================
--- linux-2.6-block.orig/include/linux/device-mapper.h
+++ linux-2.6-block/include/linux/device-mapper.h
@@ -229,6 +229,7 @@ struct gendisk *dm_disk(struct mapped_de
 int dm_suspended(struct mapped_device *md);
 int dm_noflush_suspending(struct dm_target *ti);
 union map_info *dm_get_mapinfo(struct bio *bio);
+union map_info *dm_get_rq_mapinfo(struct request *rq);
 
 /*
  * Geometry functions.
@@ -391,4 +392,12 @@ static inline unsigned long to_bytes(sec
 	return (n << SECTOR_SHIFT);
 }
 
+/*-----------------------------------------------------------------
+ * Helper for block layer and dm core operations
+ *---------------------------------------------------------------*/
+void dm_dispatch_request(struct request *rq);
+void dm_requeue_unmapped_request(struct request *rq);
+void dm_kill_unmapped_request(struct request *rq, int error);
+int dm_underlying_device_busy(struct request_queue *q);
+
 #endif	/* _LINUX_DEVICE_MAPPER_H */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/5] dm core: enable request-based dm
  2009-06-02  6:57 [PATCH 0/5] request-based dm-multipath (v3) Kiyoshi Ueda
  2009-06-02  6:59 ` [PATCH 1/5] dm core: add core functions for request-based dm Kiyoshi Ueda
@ 2009-06-02  7:01 ` Kiyoshi Ueda
  2009-06-02  7:01 ` [PATCH 3/5] dm core: don't set QUEUE_ORDERED_DRAIN for " Kiyoshi Ueda
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Kiyoshi Ueda @ 2009-06-02  7:01 UTC (permalink / raw)
  To: Alasdair Kergon; +Cc: device-mapper development

This patch enables request-based dm.

o Request-based dm and bio-based dm coexist, since there are
  some target drivers which are more fitting to bio-based dm.
  Also, there are other bio-based devices in the kernel
  (e.g. md, loop).
  Since bio-based device can't receive struct request,
  there are some limitations on device stacking between
  bio-based and request-based.

                     type of underlying device
                   bio-based      requeset-based
   ----------------------------------------------
    bio-based         OK                OK
    request-based     NG                OK

  The device type is recognized by the queue flag in the kernel,
  so dm follows that.

o The type of a dm device is decided at the first table binding time.
  Once the type of a dm device is decided, the type can't be changed.

o Mempool allocations are deferred to at the table loading time, since
  mempools for request-based dm are different from those for bio-based
  dm and needed mempool type is fixed by the type of table.

o Currently, request-based dm supports only tables that have a single
  target.  To support multiple targets, we need to support request
  splitting or prevent bio/request from spanning multiple targets.
  The former needs lots of changes in the block layer, and the latter
  needs that all target drivers support merge() function.
  Both will take a time.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-ioctl.c |   13 ++++
 drivers/md/dm-table.c |  111 ++++++++++++++++++++++++++++++++++
 drivers/md/dm.c       |  162 +++++++++++++++++++++++++++++++++++++++++---------
 drivers/md/dm.h       |   25 +++++++
 4 files changed, 285 insertions(+), 26 deletions(-)

Index: linux-2.6-block/drivers/md/dm-table.c
===================================================================
--- linux-2.6-block.orig/drivers/md/dm-table.c
+++ linux-2.6-block/drivers/md/dm-table.c
@@ -41,6 +41,7 @@
 struct dm_table {
 	struct mapped_device *md;
 	atomic_t holders;
+	unsigned type;
 
 	/* btree table */
 	unsigned int depth;
@@ -71,6 +72,8 @@ struct dm_table {
 	/* events get handed up using this callback */
 	void (*event_fn)(void *);
 	void *event_context;
+
+	struct dm_md_mempools *mempools;
 };
 
 /*
@@ -264,6 +267,8 @@ void dm_table_destroy(struct dm_table *t
 	if (t->devices.next != &t->devices)
 		free_devices(&t->devices);
 
+	dm_free_md_mempools(t->mempools);
+
 	kfree(t);
 }
 
@@ -794,6 +799,99 @@ int dm_table_add_target(struct dm_table 
 	return r;
 }
 
+int dm_table_set_type(struct dm_table *t)
+{
+	unsigned i;
+	unsigned bio_based = 0, request_based = 0;
+	struct dm_target *tgt;
+	struct dm_dev_internal *dd;
+	struct list_head *devices;
+
+	for (i = 0; i < t->num_targets; i++) {
+		tgt = t->targets + i;
+		if (dm_target_request_based(tgt))
+			request_based = 1;
+		else
+			bio_based = 1;
+
+		if (bio_based && request_based) {
+			DMWARN("Inconsistent table: different target types"
+			       " can't be mixed up");
+			return -EINVAL;
+		}
+	}
+
+	if (bio_based) {
+		/* We must use this table as bio-based */
+		t->type = DM_TYPE_BIO_BASED;
+		return 0;
+	}
+
+	BUG_ON(!request_based); /* No targets in this table */
+
+	/* Non-request-stackable devices can't be used for request-based dm */
+	devices = dm_table_get_devices(t);
+	list_for_each_entry(dd, devices, list) {
+		if (!blk_queue_stackable(bdev_get_queue(dd->dm_dev.bdev))) {
+			DMWARN("table load rejected: including"
+			       " non-request-stackable devices");
+			return -EINVAL;
+		}
+	}
+
+	/*
+	 * Request-based dm supports only tables that have a single target now.
+	 * To support multiple targets, request splitting support is needed,
+	 * and that needs lots of changes in the block-layer.
+	 * (e.g. request completion process for partial completion.)
+	 */
+	if (t->num_targets > 1) {
+		DMWARN("Request-based dm doesn't support multiple targets yet");
+		return -EINVAL;
+	}
+
+	t->type = DM_TYPE_REQUEST_BASED;
+
+	return 0;
+}
+
+unsigned dm_table_get_type(struct dm_table *t)
+{
+	return t->type;
+}
+
+bool dm_table_request_based(struct dm_table *t)
+{
+	return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED;
+}
+
+int dm_table_alloc_md_mempools(struct dm_table *t)
+{
+	unsigned type = dm_table_get_type(t);
+
+	if (unlikely(type == DM_TYPE_NONE)) {
+		DMWARN("no table type is set, can't allocate mempools");
+		return -EINVAL;
+	}
+
+	t->mempools = dm_alloc_md_mempools(type);
+	if (!t->mempools)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void dm_table_free_md_mempools(struct dm_table *t)
+{
+	dm_free_md_mempools(t->mempools);
+	t->mempools = NULL;
+}
+
+struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t)
+{
+	return t->mempools;
+}
+
 static int setup_indexes(struct dm_table *t)
 {
 	int i;
@@ -972,6 +1070,19 @@ void dm_table_set_restrictions(struct dm
 		queue_flag_set_unlocked(QUEUE_FLAG_CLUSTER, q);
 
 	dm_table_set_integrity(t);
+
+	/*
+	 * QUEUE_FLAG_STACKABLE must be set after all queue settings are
+	 * visible to other CPUs because, once the flag is set, incoming bios
+	 * are processed by request-based dm, which refers to the queue
+	 * settings.
+	 * Until the flag set, bios are passed to bio-based dm and queued to
+	 * md->deferred where queue settings are not needed yet.
+	 * Those bios are passed to request-based dm at the resume time.
+	 */
+	smp_mb();
+	if (dm_table_request_based(t))
+		queue_flag_set_unlocked(QUEUE_FLAG_STACKABLE, q);
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
Index: linux-2.6-block/drivers/md/dm.c
===================================================================
--- linux-2.6-block.orig/drivers/md/dm.c
+++ linux-2.6-block/drivers/md/dm.c
@@ -185,6 +185,15 @@ struct mapped_device {
 	struct bio barrier_bio;
 };
 
+/*
+ * For mempools pre-allocation at the table loading time.
+ */
+struct dm_md_mempools {
+	mempool_t *io_pool;
+	mempool_t *tio_pool;
+	struct bio_set *bs;
+};
+
 #define MIN_IOS 256
 static struct kmem_cache *_io_cache;
 static struct kmem_cache *_tio_cache;
@@ -1790,10 +1799,22 @@ static struct mapped_device *alloc_dev(i
 	INIT_LIST_HEAD(&md->uevent_list);
 	spin_lock_init(&md->uevent_lock);
 
-	md->queue = blk_alloc_queue(GFP_KERNEL);
+	md->queue = blk_init_queue(dm_request_fn, NULL);
 	if (!md->queue)
 		goto bad_queue;
 
+	/*
+	 * Request-based dm devices cannot be stacked on top of bio-based dm
+	 * devices.  The type of this dm device has not been decided yet,
+	 * although we initialized the queue using blk_init_queue().
+	 * The type is decided at the first table loading time.
+	 * To prevent problematic device stacking, clear the queue flag
+	 * for request stacking support until then.
+	 *
+	 * This queue is new, so no concurrency on the queue_flags.
+	 */
+	queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
+	md->saved_make_request_fn = md->queue->make_request_fn;
 	md->queue->queuedata = md;
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
@@ -1802,18 +1823,9 @@ static struct mapped_device *alloc_dev(i
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
-
-	md->io_pool = mempool_create_slab_pool(MIN_IOS, _io_cache);
-	if (!md->io_pool)
-		goto bad_io_pool;
-
-	md->tio_pool = mempool_create_slab_pool(MIN_IOS, _tio_cache);
-	if (!md->tio_pool)
-		goto bad_tio_pool;
-
-	md->bs = bioset_create(16, 0);
-	if (!md->bs)
-		goto bad_no_bioset;
+	blk_queue_softirq_done(md->queue, dm_softirq_done);
+	blk_queue_prep_rq(md->queue, dm_prep_fn);
+	blk_queue_lld_busy(md->queue, dm_lld_busy);
 
 	md->disk = alloc_disk(1);
 	if (!md->disk)
@@ -1855,12 +1867,6 @@ bad_bdev:
 bad_thread:
 	put_disk(md->disk);
 bad_disk:
-	bioset_free(md->bs);
-bad_no_bioset:
-	mempool_destroy(md->tio_pool);
-bad_tio_pool:
-	mempool_destroy(md->io_pool);
-bad_io_pool:
 	blk_cleanup_queue(md->queue);
 bad_queue:
 	free_minor(minor);
@@ -1880,9 +1886,12 @@ static void free_dev(struct mapped_devic
 	unlock_fs(md);
 	bdput(md->bdev);
 	destroy_workqueue(md->wq);
-	mempool_destroy(md->tio_pool);
-	mempool_destroy(md->io_pool);
-	bioset_free(md->bs);
+	if (md->tio_pool)
+		mempool_destroy(md->tio_pool);
+	if (md->io_pool)
+		mempool_destroy(md->io_pool);
+	if (md->bs)
+		bioset_free(md->bs);
 	blk_integrity_unregister(md->disk);
 	del_gendisk(md->disk);
 	free_minor(minor);
@@ -1897,6 +1906,29 @@ static void free_dev(struct mapped_devic
 	kfree(md);
 }
 
+static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
+{
+	struct dm_md_mempools *p;
+
+	if (md->io_pool && md->tio_pool && md->bs)
+		/* the md already has necessary mempools */
+		goto out;
+
+	p = dm_table_get_md_mempools(t);
+	BUG_ON(!p || md->io_pool || md->tio_pool || md->bs);
+
+	md->io_pool = p->io_pool;
+	p->io_pool = NULL;
+	md->tio_pool = p->tio_pool;
+	p->tio_pool = NULL;
+	md->bs = p->bs;
+	p->bs = NULL;
+
+out:
+	/* mempool bind completed, now no need any mempools in the table */
+	dm_table_free_md_mempools(t);
+}
+
 /*
  * Bind a table to the device.
  */
@@ -1947,6 +1979,18 @@ static int __bind(struct mapped_device *
 
 	dm_table_event_callback(t, event_callback, md);
 
+	/*
+	 * The queue hasn't been stopped yet, if the old table type wasn't
+	 * for request-based during suspension.  So stop it to prevent
+	 * I/O mapping before resume.
+	 * This must be done before setting the queue restrictions,
+	 * because request-based dm may be run just after the setting.
+	 */
+	if (dm_table_request_based(t) && !blk_queue_stopped(q))
+		stop_queue(q);
+
+	__bind_mempools(md, t);
+
 	write_lock(&md->map_lock);
 	md->map = t;
 	dm_table_set_restrictions(t, q);
@@ -2160,10 +2204,14 @@ static void dm_wq_work(struct work_struc
 
 		up_write(&md->io_lock);
 
-		if (bio_barrier(c))
-			process_barrier(md, c);
-		else
-			__split_and_process_bio(md, c);
+		if (dm_request_based(md))
+			generic_make_request(c);
+		else {
+			if (bio_barrier(c))
+				process_barrier(md, c);
+			else
+				__split_and_process_bio(md, c);
+		}
 
 		down_write(&md->io_lock);
 	}
@@ -2191,6 +2239,13 @@ int dm_swap_table(struct mapped_device *
 	if (!dm_suspended(md))
 		goto out;
 
+	/* cannot change the device type, once a table is bound */
+	if (md->map &&
+	    (dm_table_get_type(md->map) != dm_table_get_type(table))) {
+		DMWARN("can't change the device type after a table is bound");
+		goto out;
+	}
+
 	__unbind(md);
 	r = __bind(md, table);
 
@@ -2577,6 +2632,61 @@ int dm_noflush_suspending(struct dm_targ
 }
 EXPORT_SYMBOL_GPL(dm_noflush_suspending);
 
+struct dm_md_mempools *dm_alloc_md_mempools(unsigned type)
+{
+	struct dm_md_mempools *pools = kmalloc(sizeof(*pools), GFP_KERNEL);
+
+	if (!pools)
+		return NULL;
+
+	pools->io_pool = (type == DM_TYPE_BIO_BASED) ?
+			 mempool_create_slab_pool(MIN_IOS, _io_cache) :
+			 mempool_create_slab_pool(MIN_IOS, _rq_bio_info_cache);
+	if (!pools->io_pool)
+		goto free_pools_and_out;
+
+	pools->tio_pool = (type == DM_TYPE_BIO_BASED) ?
+			  mempool_create_slab_pool(MIN_IOS, _tio_cache) :
+			  mempool_create_slab_pool(MIN_IOS, _rq_tio_cache);
+	if (!pools->tio_pool)
+		goto free_io_pool_and_out;
+
+	pools->bs = (type == DM_TYPE_BIO_BASED) ?
+		    bioset_create(16, 0) : bioset_create(MIN_IOS, 0);
+	if (!pools->bs)
+		goto free_tio_pool_and_out;
+
+	return pools;
+
+free_tio_pool_and_out:
+	mempool_destroy(pools->tio_pool);
+
+free_io_pool_and_out:
+	mempool_destroy(pools->io_pool);
+
+free_pools_and_out:
+	kfree(pools);
+
+	return NULL;
+}
+
+void dm_free_md_mempools(struct dm_md_mempools *pools)
+{
+	if (!pools)
+		return;
+
+	if (pools->io_pool)
+		mempool_destroy(pools->io_pool);
+
+	if (pools->tio_pool)
+		mempool_destroy(pools->tio_pool);
+
+	if (pools->bs)
+		bioset_free(pools->bs);
+
+	kfree(pools);
+}
+
 static struct block_device_operations dm_blk_dops = {
 	.open = dm_blk_open,
 	.release = dm_blk_close,
Index: linux-2.6-block/drivers/md/dm.h
===================================================================
--- linux-2.6-block.orig/drivers/md/dm.h
+++ linux-2.6-block/drivers/md/dm.h
@@ -23,6 +23,13 @@
 #define DM_SUSPEND_NOFLUSH_FLAG		(1 << 1)
 
 /*
+ * Type of table and mapped_device's mempool
+ */
+#define DM_TYPE_NONE		0
+#define DM_TYPE_BIO_BASED	1
+#define DM_TYPE_REQUEST_BASED	2
+
+/*
  * List of devices that a metadevice uses and should open/close.
  */
 struct dm_dev_internal {
@@ -32,6 +39,7 @@ struct dm_dev_internal {
 };
 
 struct dm_table;
+struct dm_md_mempools;
 
 /*-----------------------------------------------------------------
  * Internal table functions.
@@ -48,12 +56,23 @@ void dm_table_postsuspend_targets(struct
 int dm_table_resume_targets(struct dm_table *t);
 int dm_table_any_congested(struct dm_table *t, int bdi_bits);
 int dm_table_any_busy_target(struct dm_table *t);
+int dm_table_set_type(struct dm_table *t);
+unsigned dm_table_get_type(struct dm_table *t);
+bool dm_table_request_based(struct dm_table *t);
+int dm_table_alloc_md_mempools(struct dm_table *t);
+void dm_table_free_md_mempools(struct dm_table *t);
+struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);
 
 /*
  * To check the return value from dm_table_find_target().
  */
 #define dm_target_is_valid(t) ((t)->table)
 
+/*
+ * To check whether the target type is request-based or not (bio-based).
+ */
+#define dm_target_request_based(t) ((t)->type->map_rq != NULL)
+
 /*-----------------------------------------------------------------
  * A registry of target types.
  *---------------------------------------------------------------*/
@@ -98,4 +117,10 @@ void dm_kobject_uevent(struct mapped_dev
 int dm_kcopyd_init(void);
 void dm_kcopyd_exit(void);
 
+/*
+ * Mempool operations
+ */
+struct dm_md_mempools *dm_alloc_md_mempools(unsigned type);
+void dm_free_md_mempools(struct dm_md_mempools *pools);
+
 #endif
Index: linux-2.6-block/drivers/md/dm-ioctl.c
===================================================================
--- linux-2.6-block.orig/drivers/md/dm-ioctl.c
+++ linux-2.6-block/drivers/md/dm-ioctl.c
@@ -1044,6 +1044,12 @@ static int populate_table(struct dm_tabl
 		next = spec->next;
 	}
 
+	r = dm_table_set_type(table);
+	if (r) {
+		DMWARN("unable to set table type");
+		return r;
+	}
+
 	return dm_table_complete(table);
 }
 
@@ -1089,6 +1095,13 @@ static int table_load(struct dm_ioctl *p
 		goto out;
 	}
 
+	r = dm_table_alloc_md_mempools(t);
+	if (r) {
+		DMWARN("unable to allocate mempools for this table");
+		dm_table_destroy(t);
+		goto out;
+	}
+
 	down_write(&_hash_lock);
 	hc = dm_get_mdptr(md);
 	if (!hc || hc->md != md) {

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/5] dm core: don't set QUEUE_ORDERED_DRAIN for request-based dm
  2009-06-02  6:57 [PATCH 0/5] request-based dm-multipath (v3) Kiyoshi Ueda
  2009-06-02  6:59 ` [PATCH 1/5] dm core: add core functions for request-based dm Kiyoshi Ueda
  2009-06-02  7:01 ` [PATCH 2/5] dm core: enable " Kiyoshi Ueda
@ 2009-06-02  7:01 ` Kiyoshi Ueda
  2009-06-02  7:02 ` [PATCH 4/5] dm core: disable interrupt when taking map_lock Kiyoshi Ueda
  2009-06-02  7:03 ` [PATCH 5/5] dm-mpath: convert to request-based Kiyoshi Ueda
  4 siblings, 0 replies; 11+ messages in thread
From: Kiyoshi Ueda @ 2009-06-02  7:01 UTC (permalink / raw)
  To: Alasdair Kergon; +Cc: device-mapper development

Request-based dm doesn't have barrier support yet.
So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm.
Since the device type is decided at the first table loading time,
the flag set is deferred until then.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-table.c |    5 +++++
 drivers/md/dm.c       |   11 ++++++++++-
 drivers/md/dm.h       |    1 +
 3 files changed, 16 insertions(+), 1 deletion(-)

Index: linux-2.6-block/drivers/md/dm-table.c
===================================================================
--- linux-2.6-block.orig/drivers/md/dm-table.c
+++ linux-2.6-block/drivers/md/dm-table.c
@@ -860,6 +860,11 @@ unsigned dm_table_get_type(struct dm_tab
 	return t->type;
 }
 
+bool dm_table_bio_based(struct dm_table *t)
+{
+	return dm_table_get_type(t) == DM_TYPE_BIO_BASED;
+}
+
 bool dm_table_request_based(struct dm_table *t)
 {
 	return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED;
Index: linux-2.6-block/drivers/md/dm.c
===================================================================
--- linux-2.6-block.orig/drivers/md/dm.c
+++ linux-2.6-block/drivers/md/dm.c
@@ -1819,7 +1819,6 @@ static struct mapped_device *alloc_dev(i
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
 	blk_queue_make_request(md->queue, dm_request);
-	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN, NULL);
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
@@ -2246,6 +2245,16 @@ int dm_swap_table(struct mapped_device *
 		goto out;
 	}
 
+	/*
+	 * It is enought that blk_queue_ordered() is called only once when
+	 * the first bio-based table is bound.
+	 *
+	 * This setting should be moved to alloc_dev() when request-based dm
+	 * supports barrier.
+	 */
+	if (!md->map && dm_table_bio_based(table))
+		blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN, NULL);
+
 	__unbind(md);
 	r = __bind(md, table);
 
Index: linux-2.6-block/drivers/md/dm.h
===================================================================
--- linux-2.6-block.orig/drivers/md/dm.h
+++ linux-2.6-block/drivers/md/dm.h
@@ -58,6 +58,7 @@ int dm_table_any_congested(struct dm_tab
 int dm_table_any_busy_target(struct dm_table *t);
 int dm_table_set_type(struct dm_table *t);
 unsigned dm_table_get_type(struct dm_table *t);
+bool dm_table_bio_based(struct dm_table *t);
 bool dm_table_request_based(struct dm_table *t);
 int dm_table_alloc_md_mempools(struct dm_table *t);
 void dm_table_free_md_mempools(struct dm_table *t);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/5] dm core: disable interrupt when taking map_lock
  2009-06-02  6:57 [PATCH 0/5] request-based dm-multipath (v3) Kiyoshi Ueda
                   ` (2 preceding siblings ...)
  2009-06-02  7:01 ` [PATCH 3/5] dm core: don't set QUEUE_ORDERED_DRAIN for " Kiyoshi Ueda
@ 2009-06-02  7:02 ` Kiyoshi Ueda
  2009-06-02  7:03 ` [PATCH 5/5] dm-mpath: convert to request-based Kiyoshi Ueda
  4 siblings, 0 replies; 11+ messages in thread
From: Kiyoshi Ueda @ 2009-06-02  7:02 UTC (permalink / raw)
  To: Alasdair Kergon; +Cc: device-mapper development

This patch disables interrupt when taking map_lock to prevent
needless lockdep warnings in request-based dm.

request-based dm takes map_lock after taking queue_lock with
disabling interrupt:
  spin_lock_irqsave(queue_lock)
  q->request_fn() == dm_request_fn()
    => dm_get_table()
         => read_lock(map_lock)
while queue_lock could be taken in interrupt context.

So lockdep warns that a deadlock can happen:
  write_lock(map_lock)
  <interrupt>
  spin_lock_irqsave(queue_lock)
  q->request_fn() == dm_request_fn()
    => dm_get_table()
         => read_lock(map_lock)

Currently there is no such code path in request-based dm where
q->request_fn() is called from interrupt context, so no such deadlock
happens.
But such warning messages confuse users, so prevent them by disabling
interrupt when taking map_lock.


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Christof Schmitt <christof.schmitt@de.ibm.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

Index: linux-2.6-block/drivers/md/dm.c
===================================================================
--- linux-2.6-block.orig/drivers/md/dm.c
+++ linux-2.6-block/drivers/md/dm.c
@@ -508,12 +508,13 @@ static void queue_io(struct mapped_devic
 struct dm_table *dm_get_table(struct mapped_device *md)
 {
 	struct dm_table *t;
+	unsigned long flags;
 
-	read_lock(&md->map_lock);
+	read_lock_irqsave(&md->map_lock, flags);
 	t = md->map;
 	if (t)
 		dm_table_get(t);
-	read_unlock(&md->map_lock);
+	read_unlock_irqrestore(&md->map_lock, flags);
 
 	return t;
 }
@@ -1960,6 +1961,7 @@ static int __bind(struct mapped_device *
 {
 	struct request_queue *q = md->queue;
 	sector_t size;
+	unsigned long flags;
 
 	size = dm_table_get_size(t);
 
@@ -1990,10 +1992,10 @@ static int __bind(struct mapped_device *
 
 	__bind_mempools(md, t);
 
-	write_lock(&md->map_lock);
+	write_lock_irqsave(&md->map_lock, flags);
 	md->map = t;
 	dm_table_set_restrictions(t, q);
-	write_unlock(&md->map_lock);
+	write_unlock_irqrestore(&md->map_lock, flags);
 
 	return 0;
 }
@@ -2001,14 +2003,15 @@ static int __bind(struct mapped_device *
 static void __unbind(struct mapped_device *md)
 {
 	struct dm_table *map = md->map;
+	unsigned long flags;
 
 	if (!map)
 		return;
 
 	dm_table_event_callback(map, NULL, NULL);
-	write_lock(&md->map_lock);
+	write_lock_irqsave(&md->map_lock, flags);
 	md->map = NULL;
-	write_unlock(&md->map_lock);
+	write_unlock_irqrestore(&md->map_lock, flags);
 	dm_table_destroy(map);
 }
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 5/5] dm-mpath: convert to request-based
  2009-06-02  6:57 [PATCH 0/5] request-based dm-multipath (v3) Kiyoshi Ueda
                   ` (3 preceding siblings ...)
  2009-06-02  7:02 ` [PATCH 4/5] dm core: disable interrupt when taking map_lock Kiyoshi Ueda
@ 2009-06-02  7:03 ` Kiyoshi Ueda
  2009-08-27 17:54   ` Alasdair G Kergon
  4 siblings, 1 reply; 11+ messages in thread
From: Kiyoshi Ueda @ 2009-06-02  7:03 UTC (permalink / raw)
  To: Alasdair Kergon; +Cc: device-mapper development

This patch converts dm-multipath target to request-based from bio-based.

Basically, the patch just converts the I/O unit from struct bio
to struct request.
In the course of the conversion, it also changes the I/O queueing
mechanism.  The change in the I/O queueing is described in details
as follows.

I/O queueing mechanism change
-----------------------------
In I/O submission, map_io(), there is no mechanism change from
bio-based, since the clone request is ready for retry as it is.
However, in I/O complition, do_end_io(), there is a mechanism change
from bio-based, since the clone request is not ready for retry.

In do_end_io() of bio-based, the clone bio has all needed memory
for resubmission.  So the target driver can queue it and resubmit
it later without memory allocations.
The mechanism has almost no overhead.

On the other hand, in do_end_io() of request-based, the clone request
doesn't have clone bios, so the target driver can't resubmit it
as it is.  To resubmit the clone request, memory allocation for
clone bios is needed, and it takes some overheads.
To avoid the overheads just for queueing, the target driver doesn't
queue the clone request inside itself.
Instead, the target driver asks dm core for queueing and remapping
the original request of the clone request, since the overhead for
queueing is just a freeing memory for the clone request.

As a result, the target driver doesn't need to record/restore
the information of the original request for resubmitting
the clone request.  So dm_bio_details in dm_mpath_io is removed.


multipath_busy()
---------------------
The target driver returns "busy", only when the following case:
  o The target driver will map I/Os, if map() function is called
  and
  o The mapped I/Os will wait on underlying device's queue due to
    their congestions, if map() function is called now.

In other cases, the target driver doesn't return "busy".
Otherwise, dm core will keep the I/Os and the target driver can't
do what it wants.
(e.g. the target driver can't map I/Os now, so wants to kill I/Os.)


Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm-mpath.c |  193 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 128 insertions(+), 65 deletions(-)

Index: linux-2.6-block/drivers/md/dm-mpath.c
===================================================================
--- linux-2.6-block.orig/drivers/md/dm-mpath.c
+++ linux-2.6-block/drivers/md/dm-mpath.c
@@ -8,7 +8,6 @@
 #include <linux/device-mapper.h>
 
 #include "dm-path-selector.h"
-#include "dm-bio-record.h"
 #include "dm-uevent.h"
 
 #include <linux/ctype.h>
@@ -83,7 +82,7 @@ struct multipath {
 	unsigned pg_init_count;		/* Number of times pg_init called */
 
 	struct work_struct process_queued_ios;
-	struct bio_list queued_ios;
+	struct list_head queued_ios;
 	unsigned queue_size;
 
 	struct work_struct trigger_event;
@@ -100,7 +99,6 @@ struct multipath {
  */
 struct dm_mpath_io {
 	struct pgpath *pgpath;
-	struct dm_bio_details details;
 	size_t nr_bytes;
 };
 
@@ -194,6 +192,7 @@ static struct multipath *alloc_multipath
 	m = kzalloc(sizeof(*m), GFP_KERNEL);
 	if (m) {
 		INIT_LIST_HEAD(&m->priority_groups);
+		INIT_LIST_HEAD(&m->queued_ios);
 		spin_lock_init(&m->lock);
 		m->queue_io = 1;
 		INIT_WORK(&m->process_queued_ios, process_queued_ios);
@@ -318,13 +317,14 @@ static int __must_push_back(struct multi
 		dm_noflush_suspending(m->ti));
 }
 
-static int map_io(struct multipath *m, struct bio *bio,
+static int map_io(struct multipath *m, struct request *clone,
 		  struct dm_mpath_io *mpio, unsigned was_queued)
 {
 	int r = DM_MAPIO_REMAPPED;
-	size_t nr_bytes = bio->bi_size;
+	size_t nr_bytes = blk_rq_bytes(clone);
 	unsigned long flags;
 	struct pgpath *pgpath;
+	struct block_device *bdev;
 
 	spin_lock_irqsave(&m->lock, flags);
 
@@ -341,16 +341,18 @@ static int map_io(struct multipath *m, s
 	if ((pgpath && m->queue_io) ||
 	    (!pgpath && m->queue_if_no_path)) {
 		/* Queue for the daemon to resubmit */
-		bio_list_add(&m->queued_ios, bio);
+		list_add_tail(&clone->queuelist, &m->queued_ios);
 		m->queue_size++;
 		if ((m->pg_init_required && !m->pg_init_in_progress) ||
 		    !m->queue_io)
 			queue_work(kmultipathd, &m->process_queued_ios);
 		pgpath = NULL;
 		r = DM_MAPIO_SUBMITTED;
-	} else if (pgpath)
-		bio->bi_bdev = pgpath->path.dev->bdev;
-	else if (__must_push_back(m))
+	} else if (pgpath) {
+		bdev = pgpath->path.dev->bdev;
+		clone->q = bdev_get_queue(bdev);
+		clone->rq_disk = bdev->bd_disk;
+	} else if (__must_push_back(m))
 		r = DM_MAPIO_REQUEUE;
 	else
 		r = -EIO;	/* Failed */
@@ -398,30 +400,31 @@ static void dispatch_queued_ios(struct m
 {
 	int r;
 	unsigned long flags;
-	struct bio *bio = NULL, *next;
 	struct dm_mpath_io *mpio;
 	union map_info *info;
+	struct request *clone, *n;
+	LIST_HEAD(cl);
 
 	spin_lock_irqsave(&m->lock, flags);
-	bio = bio_list_get(&m->queued_ios);
+	list_splice_init(&m->queued_ios, &cl);
 	spin_unlock_irqrestore(&m->lock, flags);
 
-	while (bio) {
-		next = bio->bi_next;
-		bio->bi_next = NULL;
+	list_for_each_entry_safe(clone, n, &cl, queuelist) {
+		list_del_init(&clone->queuelist);
 
-		info = dm_get_mapinfo(bio);
+		info = dm_get_rq_mapinfo(clone);
 		mpio = info->ptr;
 
-		r = map_io(m, bio, mpio, 1);
-		if (r < 0)
-			bio_endio(bio, r);
-		else if (r == DM_MAPIO_REMAPPED)
-			generic_make_request(bio);
-		else if (r == DM_MAPIO_REQUEUE)
-			bio_endio(bio, -EIO);
-
-		bio = next;
+		r = map_io(m, clone, mpio, 1);
+		if (r < 0) {
+			mempool_free(mpio, m->mpio_pool);
+			dm_kill_unmapped_request(clone, r);
+		} else if (r == DM_MAPIO_REMAPPED)
+			dm_dispatch_request(clone);
+		else if (r == DM_MAPIO_REQUEUE) {
+			mempool_free(mpio, m->mpio_pool);
+			dm_requeue_unmapped_request(clone);
+		}
 	}
 }
 
@@ -863,21 +866,24 @@ static void multipath_dtr(struct dm_targ
 }
 
 /*
- * Map bios, recording original fields for later in case we have to resubmit
+ * Map cloned requests
  */
-static int multipath_map(struct dm_target *ti, struct bio *bio,
+static int multipath_map(struct dm_target *ti, struct request *clone,
 			 union map_info *map_context)
 {
 	int r;
 	struct dm_mpath_io *mpio;
 	struct multipath *m = (struct multipath *) ti->private;
 
-	mpio = mempool_alloc(m->mpio_pool, GFP_NOIO);
-	dm_bio_record(&mpio->details, bio);
+	mpio = mempool_alloc(m->mpio_pool, GFP_ATOMIC);
+	if (!mpio)
+		/* ENOMEM, requeue */
+		return DM_MAPIO_REQUEUE;
+	memset(mpio, 0, sizeof(*mpio));
 
 	map_context->ptr = mpio;
-	bio->bi_rw |= (1 << BIO_RW_FAILFAST_TRANSPORT);
-	r = map_io(m, bio, mpio, 0);
+	clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
+	r = map_io(m, clone, mpio, 0);
 	if (r < 0 || r == DM_MAPIO_REQUEUE)
 		mempool_free(mpio, m->mpio_pool);
 
@@ -1158,53 +1164,41 @@ static void activate_path(struct work_st
 /*
  * end_io handling
  */
-static int do_end_io(struct multipath *m, struct bio *bio,
+static int do_end_io(struct multipath *m, struct request *clone,
 		     int error, struct dm_mpath_io *mpio)
 {
+	/*
+	 * We don't queue any clone request inside the multipath target
+	 * during end I/O handling, since those clone requests don't have
+	 * bio clones.  If we queue them inside the multipath target,
+	 * we need to make bio clones, that requires memory allocation.
+	 * (See drivers/md/dm.c:end_clone_bio() about why the clone requests
+	 *  don't have bio clones.)
+	 * Instead of queueing the clone request here, we queue the original
+	 * request into dm core, which will remake a clone request and
+	 * clone bios for it and resubmit it later.
+	 */
+	int r = DM_ENDIO_REQUEUE;
 	unsigned long flags;
 
-	if (!error)
+	if (!error && !clone->errors)
 		return 0;	/* I/O complete */
 
-	if ((error == -EWOULDBLOCK) && bio_rw_ahead(bio))
-		return error;
-
 	if (error == -EOPNOTSUPP)
 		return error;
 
-	spin_lock_irqsave(&m->lock, flags);
-	if (!m->nr_valid_paths) {
-		if (__must_push_back(m)) {
-			spin_unlock_irqrestore(&m->lock, flags);
-			return DM_ENDIO_REQUEUE;
-		} else if (!m->queue_if_no_path) {
-			spin_unlock_irqrestore(&m->lock, flags);
-			return -EIO;
-		} else {
-			spin_unlock_irqrestore(&m->lock, flags);
-			goto requeue;
-		}
-	}
-	spin_unlock_irqrestore(&m->lock, flags);
-
 	if (mpio->pgpath)
 		fail_path(mpio->pgpath);
 
-      requeue:
-	dm_bio_restore(&mpio->details, bio);
-
-	/* queue for the daemon to resubmit or fail */
 	spin_lock_irqsave(&m->lock, flags);
-	bio_list_add(&m->queued_ios, bio);
-	m->queue_size++;
-	if (!m->queue_io)
-		queue_work(kmultipathd, &m->process_queued_ios);
+	if (!m->nr_valid_paths && !m->queue_if_no_path && !__must_push_back(m))
+		r = -EIO;
 	spin_unlock_irqrestore(&m->lock, flags);
 
-	return DM_ENDIO_INCOMPLETE;	/* io not complete */
+	return r;
 }
 
-static int multipath_end_io(struct dm_target *ti, struct bio *bio,
+static int multipath_end_io(struct dm_target *ti, struct request *clone,
 			    int error, union map_info *map_context)
 {
 	struct multipath *m = ti->private;
@@ -1213,14 +1207,13 @@ static int multipath_end_io(struct dm_ta
 	struct path_selector *ps;
 	int r;
 
-	r  = do_end_io(m, bio, error, mpio);
+	r  = do_end_io(m, clone, error, mpio);
 	if (pgpath) {
 		ps = &pgpath->pg->ps;
 		if (ps->type->end_io)
 			ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
 	}
-	if (r != DM_ENDIO_INCOMPLETE)
-		mempool_free(mpio, m->mpio_pool);
+	mempool_free(mpio, m->mpio_pool);
 
 	return r;
 }
@@ -1450,6 +1443,75 @@ static int multipath_ioctl(struct dm_tar
 	return r ? : __blkdev_driver_ioctl(bdev, mode, cmd, arg);
 }
 
+static int __pgpath_busy(struct pgpath *pgpath)
+{
+	struct request_queue *q = bdev_get_queue(pgpath->path.dev->bdev);
+
+	return dm_underlying_device_busy(q);
+}
+
+/*
+ * We return "busy", only when we can map I/Os but underlying devices
+ * are busy (so even if we map I/Os now, the I/Os will wait on
+ * the underlying queue).
+ * In other words, if we want to kill I/Os or queue them inside us
+ * due to map unavailability, we don't return "busy".  Otherwise,
+ * dm core won't give us the I/Os and we can't do what we want.
+ */
+static int multipath_busy(struct dm_target *ti)
+{
+	int busy = 0, has_active = 0;
+	struct multipath *m = ti->private;
+	struct priority_group *pg;
+	struct pgpath *pgpath;
+	unsigned long flags;
+
+	spin_lock_irqsave(&m->lock, flags);
+
+	/* Guess which priority_group will be used at next mapping time */
+	if (unlikely(!m->current_pgpath && m->next_pg))
+		pg = m->next_pg;
+	else if (likely(m->current_pg))
+		pg = m->current_pg;
+	else
+		/*
+		 * We don't know which pg will be used at next mapping time.
+		 * We don't call __choose_pgpath() here to avoid to trigger
+		 * pg_init just by busy checking.
+		 * So we don't know whether underlying devices we will be using
+		 * at next mapping time are busy or not. Just try mapping.
+		 */
+		goto out;
+
+	/*
+	 * If there is one non-busy active path at least, the path selector
+	 * will be able to select it. So we consider such a pg as not busy.
+	 */
+	busy = 1;
+	list_for_each_entry(pgpath, &pg->pgpaths, list)
+		if (pgpath->is_active) {
+			has_active = 1;
+
+			if (!__pgpath_busy(pgpath)) {
+				busy = 0;
+				break;
+			}
+		}
+
+	if (!has_active)
+		/*
+		 * No active path in this pg, so this pg won't be used and
+		 * the current_pg will be changed at next mapping time.
+		 * We need to try mapping to determine it.
+		 */
+		busy = 0;
+
+out:
+	spin_unlock_irqrestore(&m->lock, flags);
+
+	return busy;
+}
+
 /*-----------------------------------------------------------------
  * Module setup
  *---------------------------------------------------------------*/
@@ -1459,13 +1521,14 @@ static struct target_type multipath_targ
 	.module = THIS_MODULE,
 	.ctr = multipath_ctr,
 	.dtr = multipath_dtr,
-	.map = multipath_map,
-	.end_io = multipath_end_io,
+	.map_rq = multipath_map,
+	.rq_end_io = multipath_end_io,
 	.presuspend = multipath_presuspend,
 	.resume = multipath_resume,
 	.status = multipath_status,
 	.message = multipath_message,
 	.ioctl  = multipath_ioctl,
+	.busy = multipath_busy,
 };
 
 static int __init dm_multipath_init(void)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/5] dm-mpath: convert to request-based
  2009-06-02  7:03 ` [PATCH 5/5] dm-mpath: convert to request-based Kiyoshi Ueda
@ 2009-08-27 17:54   ` Alasdair G Kergon
  2009-08-28  5:00     ` Kiyoshi Ueda
  0 siblings, 1 reply; 11+ messages in thread
From: Alasdair G Kergon @ 2009-08-27 17:54 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: device-mapper development

On Tue, Jun 02, 2009 at 04:03:25PM +0900, Kiyoshi Ueda wrote:
> This patch converts dm-multipath target to request-based from bio-based.
 
How much effort would it be to retain the old mpath implementation
in parallel?

I'm rather concerned that we're losing some useful functionality in
2.6.31 with this patch - stacking over bio-based devices (test beds use
this and it's helpful for debugging), barrier support - and supporting
both would make it easier for people to compare the two implementations
and stick to the old one if in their particular circumstances it worked
better.

Perhaps, dm-mpath could just register two targets (like snapshot does),
one bio-based, and one rq-based, sharing most of the functions with
wrappers to indicate which is which where necessary?

Alasdair

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/5] dm-mpath: convert to request-based
  2009-08-27 17:54   ` Alasdair G Kergon
@ 2009-08-28  5:00     ` Kiyoshi Ueda
  2009-08-28 13:36       ` Mike Snitzer
  2009-11-12 10:08       ` reinstate bio-based dm-multipath? (Was: dm-mpath: convert to request-based) Kiyoshi Ueda
  0 siblings, 2 replies; 11+ messages in thread
From: Kiyoshi Ueda @ 2009-08-28  5:00 UTC (permalink / raw)
  To: Alasdair Kergon; +Cc: device-mapper development

Hi Alasdair,

On 08/28/2009 02:54 AM +0900, Alasdair G Kergon wrote:
> On Tue, Jun 02, 2009 at 04:03:25PM +0900, Kiyoshi Ueda wrote:
>> This patch converts dm-multipath target to request-based from bio-based.
>  
> How much effort would it be to retain the old mpath implementation
> in parallel?
> 
> I'm rather concerned that we're losing some useful functionality in
> 2.6.31 with this patch - stacking over bio-based devices (test beds use
> this and it's helpful for debugging), barrier support - and supporting
> both would make it easier for people to compare the two implementations
> and stick to the old one if in their particular circumstances it worked
> better.
> 
> Perhaps, dm-mpath could just register two targets (like snapshot does),
> one bio-based, and one rq-based, sharing most of the functions with
> wrappers to indicate which is which where necessary?

Such wrappers need to be made very well to share codes as much as
possible.  Otherwise, we won't be able to maintain the non-default
(bio-based?) code, then people won't be able to use it even for
testing/debugging.
Also we need to consider the user interface so that user-space tools
won't be confused.

I'll look into this when I finish the barrier implementation of
request-based dm. (hopefully 2.6.32 timeframe, maybe 2.6.33)

By the way, only for testing/debugging purposes, making request-based
error/linear targets (e.g. named like error_rq/linear_rq) may be an
alternative.

What do you think?

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/5] dm-mpath: convert to request-based
  2009-08-28  5:00     ` Kiyoshi Ueda
@ 2009-08-28 13:36       ` Mike Snitzer
  2009-08-29 18:23         ` Mike Snitzer
  2009-11-12 10:08       ` reinstate bio-based dm-multipath? (Was: dm-mpath: convert to request-based) Kiyoshi Ueda
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Snitzer @ 2009-08-28 13:36 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: device-mapper development, Alasdair Kergon

On Fri, Aug 28 2009 at  1:00am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Alasdair,
> 
> On 08/28/2009 02:54 AM +0900, Alasdair G Kergon wrote:
> > On Tue, Jun 02, 2009 at 04:03:25PM +0900, Kiyoshi Ueda wrote:
> >> This patch converts dm-multipath target to request-based from bio-based.
> >  
> > How much effort would it be to retain the old mpath implementation
> > in parallel?
> > 
> > I'm rather concerned that we're losing some useful functionality in
> > 2.6.31 with this patch - stacking over bio-based devices (test beds use
> > this and it's helpful for debugging), barrier support - and supporting
> > both would make it easier for people to compare the two implementations
> > and stick to the old one if in their particular circumstances it worked
> > better.
> > 
> > Perhaps, dm-mpath could just register two targets (like snapshot does),
> > one bio-based, and one rq-based, sharing most of the functions with
> > wrappers to indicate which is which where necessary?
> 
> Such wrappers need to be made very well to share codes as much as
> possible.  Otherwise, we won't be able to maintain the non-default
> (bio-based?) code, then people won't be able to use it even for
> testing/debugging.
> Also we need to consider the user interface so that user-space tools
> won't be confused.
> 
> I'll look into this when I finish the barrier implementation of
> request-based dm. (hopefully 2.6.32 timeframe, maybe 2.6.33)
> 
> By the way, only for testing/debugging purposes, making request-based
> error/linear targets (e.g. named like error_rq/linear_rq) may be an
> alternative.

As a stop-gap I'd imagine Linux's 'scsi_debug' could be used for testing
and debugging purposes.

Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [PATCH 5/5] dm-mpath: convert to request-based
  2009-08-28 13:36       ` Mike Snitzer
@ 2009-08-29 18:23         ` Mike Snitzer
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Snitzer @ 2009-08-29 18:23 UTC (permalink / raw)
  To: device-mapper development; +Cc: Kiyoshi Ueda, Alasdair Kergon

On 8/28/09, Mike Snitzer <snitzer@redhat.com> wrote:
> On Fri, Aug 28 2009 at  1:00am -0400,
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>
>> Hi Alasdair,
>>
>> On 08/28/2009 02:54 AM +0900, Alasdair G Kergon wrote:
>> > On Tue, Jun 02, 2009 at 04:03:25PM +0900, Kiyoshi Ueda wrote:
>> >> This patch converts dm-multipath target to request-based from
>> >> bio-based.
>> >
>> > How much effort would it be to retain the old mpath implementation
>> > in parallel?
>> >
>> > I'm rather concerned that we're losing some useful functionality in
>> > 2.6.31 with this patch - stacking over bio-based devices (test beds use
>> > this and it's helpful for debugging), barrier support - and supporting
>> > both would make it easier for people to compare the two implementations
>> > and stick to the old one if in their particular circumstances it worked
>> > better.
>> >
>> > Perhaps, dm-mpath could just register two targets (like snapshot does),
>> > one bio-based, and one rq-based, sharing most of the functions with
>> > wrappers to indicate which is which where necessary?
>>
>> Such wrappers need to be made very well to share codes as much as
>> possible.  Otherwise, we won't be able to maintain the non-default
>> (bio-based?) code, then people won't be able to use it even for
>> testing/debugging.
>> Also we need to consider the user interface so that user-space tools
>> won't be confused.
>>
>> I'll look into this when I finish the barrier implementation of
>> request-based dm. (hopefully 2.6.32 timeframe, maybe 2.6.33)
>>
>> By the way, only for testing/debugging purposes, making request-based
>> error/linear targets (e.g. named like error_rq/linear_rq) may be an
>> alternative.
>
> As a stop-gap I'd imagine Linux's 'scsi_debug' could be used for testing
> and debugging purposes.

I looked closer at scsi_debug and it doesn't have the ability to
establish a shared WWID for N SCSI targets.  That aside, it still
wouldn't offer anywhere near as much utility as supporting the various
bio-based DM targets for mpath testing purposes.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* reinstate bio-based dm-multipath? (Was: dm-mpath: convert to request-based)
  2009-08-28  5:00     ` Kiyoshi Ueda
  2009-08-28 13:36       ` Mike Snitzer
@ 2009-11-12 10:08       ` Kiyoshi Ueda
  1 sibling, 0 replies; 11+ messages in thread
From: Kiyoshi Ueda @ 2009-11-12 10:08 UTC (permalink / raw)
  To: Alasdair Kergon; +Cc: device-mapper development

Hi Alasdair,

Sorry for the late reply.

On 08/28/2009 02:00 PM +0900, Kiyoshi Ueda wrote:
> Hi Alasdair,
> 
> On 08/28/2009 02:54 AM +0900, Alasdair G Kergon wrote:
>> On Tue, Jun 02, 2009 at 04:03:25PM +0900, Kiyoshi Ueda wrote:
>>> This patch converts dm-multipath target to request-based from bio-based.
>>  
>> How much effort would it be to retain the old mpath implementation
>> in parallel?
>>
>> I'm rather concerned that we're losing some useful functionality in
>> 2.6.31 with this patch - stacking over bio-based devices (test beds use
>> this and it's helpful for debugging), barrier support - and supporting
>> both would make it easier for people to compare the two implementations
>> and stick to the old one if in their particular circumstances it worked
>> better.
>>
>> Perhaps, dm-mpath could just register two targets (like snapshot does),
>> one bio-based, and one rq-based, sharing most of the functions with
>> wrappers to indicate which is which where necessary?
> 
> Such wrappers need to be made very well to share codes as much as
> possible.  Otherwise, we won't be able to maintain the non-default
> (bio-based?) code, then people won't be able to use it even for
> testing/debugging.
> Also we need to consider the user interface so that user-space tools
> won't be confused.
> 
> I'll look into this when I finish the barrier implementation of
> request-based dm. (hopefully 2.6.32 timeframe, maybe 2.6.33)

Error handling of ->end_io() in bio-based dm-mpath is different from
the one in rq-based dm-mpath:
    bio-based uses the target internal queue for retrying clone.
    On the other hand, rq-based uses dm-core's queue.

This difference makes code sharing difficult.
So, to make such good wrappers, we need to change the end_io/retry
handling of bio-based dm-mpath and dm-core:
    bio-based dm-core applies DM_ENDIO_REQUEUE anytime targets want,
    not only during noflush_suspending.  (Then, bio-based dm-mpath
    pushes the clone needed retry back to bio-based dm-core.)
    It means that "deferred" list can be kicked anytime, resulting in
    changing suspend and barrier processing logic of bio-based dm-core.
    (By the way, by doing this, bio-based dm-mpath doesn't need
     dm_bio_details in dm_mpath_io.  It means rq-based and bio-based
     can use the same type of structure for private data per I/O.
     This also makes code sharing easy.)

That will take a time, since we must be very careful not to break
the existing bio-based dm-core's suspend/barrier code.
I don't come up with any needs for it other than testing purpose.
And for testing purpose, using scsi_debug or adding rq-based
linear/error targets are better approaches, I think.
So I don't think this feature should be proceeded until any strong
requirement is found.

By the way, if we reinstate bio-based dm-multipath without any wrapper,
the patch is like below. (Don't apply this since we won't be able to
maintain well.)

Note: Its target name is "bio_multipath", so the multipath-tools
      doesn't work for it and people must use dmsetup manually.

Thanks,
Kiyoshi Ueda
---
 drivers/md/dm-mpath.c |  223 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 220 insertions(+), 3 deletions(-)

Index: 2.6.32-rc6/drivers/md/dm-mpath.c
===================================================================
--- 2.6.32-rc6.orig/drivers/md/dm-mpath.c
+++ 2.6.32-rc6/drivers/md/dm-mpath.c
@@ -8,6 +8,7 @@
 #include <linux/device-mapper.h>
 
 #include "dm-path-selector.h"
+#include "dm-bio-record.h"
 #include "dm-uevent.h"
 
 #include <linux/ctype.h>
@@ -84,6 +85,7 @@ struct multipath {
 
 	struct work_struct process_queued_ios;
 	struct list_head queued_ios;
+	struct bio_list queued_bios;
 	unsigned queue_size;
 
 	struct work_struct trigger_event;
@@ -103,11 +105,17 @@ struct dm_mpath_io {
 	size_t nr_bytes;
 };
 
+struct dm_bio_mpath_io {
+	struct pgpath *pgpath;
+	size_t nr_bytes;
+	struct dm_bio_details details;
+};
+
 typedef int (*action_fn) (struct pgpath *pgpath);
 
 #define MIN_IOS 256	/* Mempool size */
 
-static struct kmem_cache *_mpio_cache;
+static struct kmem_cache *_mpio_cache, *_bio_mpio_cache;
 
 static struct workqueue_struct *kmultipathd, *kmpath_handlerd;
 static void process_queued_ios(struct work_struct *work);
@@ -189,6 +197,7 @@ static void free_priority_group(struct p
 static struct multipath *alloc_multipath(struct dm_target *ti)
 {
 	struct multipath *m;
+	struct kmem_cache *mpio_cache;
 
 	m = kzalloc(sizeof(*m), GFP_KERNEL);
 	if (m) {
@@ -198,7 +207,8 @@ static struct multipath *alloc_multipath
 		m->queue_io = 1;
 		INIT_WORK(&m->process_queued_ios, process_queued_ios);
 		INIT_WORK(&m->trigger_event, trigger_event);
-		m->mpio_pool = mempool_create_slab_pool(MIN_IOS, _mpio_cache);
+		mpio_cache = ti->type->map_rq ? _mpio_cache : _bio_mpio_cache;
+		m->mpio_pool = mempool_create_slab_pool(MIN_IOS, mpio_cache);
 		if (!m->mpio_pool) {
 			kfree(m);
 			return NULL;
@@ -371,6 +381,54 @@ static int map_io(struct multipath *m, s
 	return r;
 }
 
+static int map_bio(struct multipath *m, struct bio *clone,
+		   struct dm_bio_mpath_io *mpio, unsigned was_queued)
+{
+	int r = DM_MAPIO_REMAPPED;
+	size_t nr_bytes = clone->bi_size;
+	unsigned long flags;
+	struct pgpath *pgpath;
+
+	spin_lock_irqsave(&m->lock, flags);
+
+	/* Do we need to select a new pgpath? */
+	if (!m->current_pgpath ||
+	    (!m->queue_io && (m->repeat_count && --m->repeat_count == 0)))
+		__choose_pgpath(m, nr_bytes);
+
+	pgpath = m->current_pgpath;
+
+	if (was_queued)
+		m->queue_size--;
+
+	if ((pgpath && m->queue_io) || (!pgpath && m->queue_if_no_path)) {
+		/* Queue for the daemon to resubmit */
+		bio_list_add(&m->queued_bios, clone);
+		m->queue_size++;
+		if ((m->pg_init_required && !m->pg_init_in_progress) ||
+		    !m->queue_io)
+			queue_work(kmultipathd, &m->process_queued_ios);
+		pgpath = NULL;
+		r = DM_MAPIO_SUBMITTED;
+	} else if (pgpath)
+		clone->bi_bdev = pgpath->path.dev->bdev;
+	else if (__must_push_back(m))
+		r = DM_MAPIO_REQUEUE;
+	else
+		r = -EIO;	/* Failed */
+
+	mpio->pgpath = pgpath;
+	mpio->nr_bytes = nr_bytes;
+
+	if (r == DM_MAPIO_REMAPPED && pgpath->pg->ps.type->start_io)
+		pgpath->pg->ps.type->start_io(&pgpath->pg->ps, &pgpath->path,
+					      nr_bytes);
+
+	spin_unlock_irqrestore(&m->lock, flags);
+
+	return r;
+}
+
 /*
  * If we run out of usable paths, should we queue I/O or error it?
  */
@@ -430,6 +488,37 @@ static void dispatch_queued_ios(struct m
 	}
 }
 
+static void dispatch_queued_bios(struct multipath *m)
+{
+	int r;
+	unsigned long flags;
+	struct bio *clone = NULL, *next;
+	struct dm_bio_mpath_io *mpio;
+	union map_info *info;
+
+	spin_lock_irqsave(&m->lock, flags);
+	clone = bio_list_get(&m->queued_bios);
+	spin_unlock_irqrestore(&m->lock, flags);
+
+	while (clone) {
+		next = clone->bi_next;
+		clone->bi_next = NULL;
+
+		info = dm_get_mapinfo(clone);
+		mpio = info->ptr;
+
+		r = map_bio(m, clone, mpio, 1);
+		if (r < 0)
+			bio_endio(clone, r);
+		else if (r == DM_MAPIO_REMAPPED)
+			generic_make_request(clone);
+		else if (r == DM_MAPIO_REQUEUE)
+			bio_endio(clone, -EIO);
+
+		clone = next;
+	}
+}
+
 static void process_queued_ios(struct work_struct *work)
 {
 	struct multipath *m =
@@ -462,8 +551,10 @@ static void process_queued_ios(struct wo
 	}
 out:
 	spin_unlock_irqrestore(&m->lock, flags);
-	if (!must_queue)
+	if (!must_queue) {
 		dispatch_queued_ios(m);
+		dispatch_queued_bios(m);
+	}
 }
 
 /*
@@ -921,6 +1012,28 @@ static int multipath_map(struct dm_targe
 }
 
 /*
+ * Map bios, recording original fields for later in case we have to resubmit
+ */
+static int multipath_map_bio(struct dm_target *ti, struct bio *clone,
+			     union map_info *map_context)
+{
+	int r;
+	struct dm_bio_mpath_io *mpio;
+	struct multipath *m = ti->private;
+
+	mpio = mempool_alloc(m->mpio_pool, GFP_NOIO);
+	dm_bio_record(&mpio->details, clone);
+
+	map_context->ptr = mpio;
+	clone->bi_rw |= (1 << BIO_RW_FAILFAST_TRANSPORT);
+	r = map_bio(m, clone, mpio, 0);
+	if (r < 0 || r == DM_MAPIO_REQUEUE)
+		mempool_free(mpio, m->mpio_pool);
+
+	return r;
+}
+
+/*
  * Take a path out of use.
  */
 static int fail_path(struct pgpath *pgpath)
@@ -1228,6 +1341,52 @@ static int do_end_io(struct multipath *m
 	return r;
 }
 
+static int do_end_bio(struct multipath *m, struct bio *clone,
+		      int error, struct dm_bio_mpath_io *mpio)
+{
+	unsigned long flags;
+
+	if (!error)
+		return 0;	/* I/O complete */
+
+	if ((error == -EWOULDBLOCK) && bio_rw_flagged(clone, BIO_RW_AHEAD))
+		return error;
+
+	if (error == -EOPNOTSUPP)
+		return error;
+
+	spin_lock_irqsave(&m->lock, flags);
+	if (!m->nr_valid_paths) {
+		if (__must_push_back(m)) {
+			spin_unlock_irqrestore(&m->lock, flags);
+			return DM_ENDIO_REQUEUE;
+		} else if (!m->queue_if_no_path) {
+			spin_unlock_irqrestore(&m->lock, flags);
+			return -EIO;
+		} else {
+			spin_unlock_irqrestore(&m->lock, flags);
+			goto requeue;
+		}
+	}
+	spin_unlock_irqrestore(&m->lock, flags);
+
+	if (mpio->pgpath)
+		fail_path(mpio->pgpath);
+
+requeue:
+	dm_bio_restore(&mpio->details, clone);
+
+	/* queue for the daemon to resubmit or fail */
+	spin_lock_irqsave(&m->lock, flags);
+	bio_list_add(&m->queued_bios, clone);
+	m->queue_size++;
+	if (!m->queue_io)
+		queue_work(kmultipathd, &m->process_queued_ios);
+	spin_unlock_irqrestore(&m->lock, flags);
+
+	return DM_ENDIO_INCOMPLETE;	/* io not complete */
+}
+
 static int multipath_end_io(struct dm_target *ti, struct request *clone,
 			    int error, union map_info *map_context)
 {
@@ -1248,6 +1407,27 @@ static int multipath_end_io(struct dm_ta
 	return r;
 }
 
+static int multipath_end_bio(struct dm_target *ti, struct bio *clone,
+			     int error, union map_info *map_context)
+{
+	struct multipath *m = ti->private;
+	struct dm_bio_mpath_io *mpio = map_context->ptr;
+	struct pgpath *pgpath = mpio->pgpath;
+	struct path_selector *ps;
+	int r;
+
+	r  = do_end_bio(m, clone, error, mpio);
+	if (pgpath) {
+		ps = &pgpath->pg->ps;
+		if (ps->type->end_io)
+			ps->type->end_io(ps, &pgpath->path, mpio->nr_bytes);
+	}
+	if (r != DM_ENDIO_INCOMPLETE)
+		mempool_free(mpio, m->mpio_pool);
+
+	return r;
+}
+
 /*
  * Suspend can't complete until all the I/O is processed so if
  * the last path fails we must error any remaining I/O.
@@ -1582,6 +1762,22 @@ static struct target_type multipath_targ
 	.busy = multipath_busy,
 };
 
+static struct target_type bio_multipath_target = {
+	.name = "bio_multipath",
+	.version = {1, 1, 0},
+	.module = THIS_MODULE,
+	.ctr = multipath_ctr,
+	.dtr = multipath_dtr,
+	.map = multipath_map_bio,
+	.end_io = multipath_end_bio,
+	.presuspend = multipath_presuspend,
+	.resume = multipath_resume,
+	.status = multipath_status,
+	.message = multipath_message,
+	.ioctl  = multipath_ioctl,
+	.iterate_devices = multipath_iterate_devices,
+};
+
 static int __init dm_multipath_init(void)
 {
 	int r;
@@ -1591,9 +1787,25 @@ static int __init dm_multipath_init(void
 	if (!_mpio_cache)
 		return -ENOMEM;
 
+	_bio_mpio_cache = KMEM_CACHE(dm_bio_mpath_io, 0);
+	if (!_bio_mpio_cache) {
+		kmem_cache_destroy(_mpio_cache);
+		return -ENOMEM;
+	}
+
 	r = dm_register_target(&multipath_target);
 	if (r < 0) {
 		DMERR("register failed %d", r);
+		kmem_cache_destroy(_bio_mpio_cache);
+		kmem_cache_destroy(_mpio_cache);
+		return -EINVAL;
+	}
+
+	r = dm_register_target(&bio_multipath_target);
+	if (r < 0) {
+		DMERR("register failed %d", r);
+		dm_unregister_target(&multipath_target);
+		kmem_cache_destroy(_bio_mpio_cache);
 		kmem_cache_destroy(_mpio_cache);
 		return -EINVAL;
 	}
@@ -1601,7 +1813,9 @@ static int __init dm_multipath_init(void
 	kmultipathd = create_workqueue("kmpathd");
 	if (!kmultipathd) {
 		DMERR("failed to create workqueue kmpathd");
+		dm_unregister_target(&bio_multipath_target);
 		dm_unregister_target(&multipath_target);
+		kmem_cache_destroy(_bio_mpio_cache);
 		kmem_cache_destroy(_mpio_cache);
 		return -ENOMEM;
 	}
@@ -1616,7 +1830,9 @@ static int __init dm_multipath_init(void
 	if (!kmpath_handlerd) {
 		DMERR("failed to create workqueue kmpath_handlerd");
 		destroy_workqueue(kmultipathd);
+		dm_unregister_target(&bio_multipath_target);
 		dm_unregister_target(&multipath_target);
+		kmem_cache_destroy(_bio_mpio_cache);
 		kmem_cache_destroy(_mpio_cache);
 		return -ENOMEM;
 	}
@@ -1633,6 +1849,7 @@ static void __exit dm_multipath_exit(voi
 	destroy_workqueue(kmpath_handlerd);
 	destroy_workqueue(kmultipathd);
 
+	dm_unregister_target(&bio_multipath_target);
 	dm_unregister_target(&multipath_target);
 	kmem_cache_destroy(_mpio_cache);
 }

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-11-12 10:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-02  6:57 [PATCH 0/5] request-based dm-multipath (v3) Kiyoshi Ueda
2009-06-02  6:59 ` [PATCH 1/5] dm core: add core functions for request-based dm Kiyoshi Ueda
2009-06-02  7:01 ` [PATCH 2/5] dm core: enable " Kiyoshi Ueda
2009-06-02  7:01 ` [PATCH 3/5] dm core: don't set QUEUE_ORDERED_DRAIN for " Kiyoshi Ueda
2009-06-02  7:02 ` [PATCH 4/5] dm core: disable interrupt when taking map_lock Kiyoshi Ueda
2009-06-02  7:03 ` [PATCH 5/5] dm-mpath: convert to request-based Kiyoshi Ueda
2009-08-27 17:54   ` Alasdair G Kergon
2009-08-28  5:00     ` Kiyoshi Ueda
2009-08-28 13:36       ` Mike Snitzer
2009-08-29 18:23         ` Mike Snitzer
2009-11-12 10:08       ` reinstate bio-based dm-multipath? (Was: dm-mpath: convert to request-based) Kiyoshi Ueda

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.