* [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations
@ 2017-08-31 6:18 NeilBrown
2017-09-01 2:38 ` Mikulas Patocka
0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2017-08-31 6:18 UTC (permalink / raw)
To: Alasdair Kergon, Mike Snitzer, Mikulas Patocka; +Cc: dm-devel
[-- Attachment #1.1: Type: text/plain, Size: 8264 bytes --]
The md->bs bioset currently needs the BIOSET_NEED_RESCUER flag
which I hope to deprecate. It is currently needed because
__split_and_process_bio() can potentially allocate from the
bioset multiple times within the one make_request_fn context.
Subsequent allocations can deadlock waiting for the bio returned
by the first allocation to complete - but it is stuck on the
current->bio_list list.
The bioset is also used in setup_clone() in dm-rq.c, but there a
non-blocking (GFP_ATOMIC) allocation is used, so there is no risk of
deadlock.
dm also provide rescuing for ->map() calls. If a map call ever blocks,
any bios already submitted with generic_make_request() will be passed to
the bioset rescuer thread.
This patch unifies these two separate needs for rescuing bios into a
single mechanism, and use the md->wq work queue to do the rescuing.
A blk_plug_cb is added to 'struct clone_info' so that it is available
throughout the __split_and_process_bio() process.
Prior to allocating from the bioset, or calling ->map(),
dm_offload_check() is called to ensure that the blk_plug_cb is active.
If either of these operation schedules, the callback is called,
and any queued bios get passed to the wq.
Note that only current->bio_list[0] is offloaded. current->bio_list[1]
contains bios that were scheduled *before* the current one started, so
they must have been submitted from higher up the stack, and we cannot be
waiting for them here. Also, we now rescue *all* bios on the list as
there is nothing to be gained by being more selective.
This allows us to remove BIOSET_NEED_RESCUER when allocating this
bioset.
It also keeps all the mechanics of rescuing inside dm, so dm can
be in full control.
Signed-off-by: NeilBrown <neilb@suse.com>
---
Hi,
I have only tested this lightly as I don't have any test infrastructure
for dm and don't know how to reproduce the expected deadlocks.
I'm keen to know your thoughts on this approach if you find a few spare
moment to have a look.
Thanks,
NeilBrown
drivers/md/dm-core.h | 1 +
drivers/md/dm.c | 99 ++++++++++++++++++++++++++--------------------------
2 files changed, 51 insertions(+), 49 deletions(-)
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 24eddbdf2ab4..cb060dd6d3ca 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -71,6 +71,7 @@ struct mapped_device {
struct work_struct work;
spinlock_t deferred_lock;
struct bio_list deferred;
+ struct bio_list rescued;
/*
* Event handling.
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2edbcc2d7d3f..774dd71cb49a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1134,65 +1134,63 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start)
}
EXPORT_SYMBOL_GPL(dm_remap_zone_report);
+struct clone_info {
+ struct mapped_device *md;
+ struct dm_table *map;
+ struct bio *bio;
+ struct dm_io *io;
+ sector_t sector;
+ unsigned sector_count;
+ struct blk_plug plug;
+ struct blk_plug_cb cb;
+};
+
/*
* Flush current->bio_list when the target map method blocks.
* This fixes deadlocks in snapshot and possibly in other targets.
*/
-struct dm_offload {
- struct blk_plug plug;
- struct blk_plug_cb cb;
-};
static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
{
- struct dm_offload *o = container_of(cb, struct dm_offload, cb);
+ struct clone_info *ci = container_of(cb, struct clone_info, cb);
struct bio_list list;
- struct bio *bio;
- int i;
- INIT_LIST_HEAD(&o->cb.list);
+ INIT_LIST_HEAD(&cb->list);
- if (unlikely(!current->bio_list))
+ if (unlikely(!current->bio_list || bio_list_empty(¤t->bio_list[0])))
return;
- for (i = 0; i < 2; i++) {
- list = current->bio_list[i];
- bio_list_init(¤t->bio_list[i]);
-
- while ((bio = bio_list_pop(&list))) {
- struct bio_set *bs = bio->bi_pool;
- if (unlikely(!bs) || bs == fs_bio_set ||
- !bs->rescue_workqueue) {
- bio_list_add(¤t->bio_list[i], bio);
- continue;
- }
+ list = current->bio_list[0];
+ bio_list_init(¤t->bio_list[0]);
+ spin_lock(&ci->md->deferred_lock);
+ bio_list_merge(&ci->md->rescued, &list);
+ spin_unlock(&ci->md->deferred_lock);
+ queue_work(ci->md->wq, &ci->md->work);
+}
- spin_lock(&bs->rescue_lock);
- bio_list_add(&bs->rescue_list, bio);
- queue_work(bs->rescue_workqueue, &bs->rescue_work);
- spin_unlock(&bs->rescue_lock);
- }
- }
+static void dm_offload_start(struct clone_info *ci)
+{
+ blk_start_plug(&ci->plug);
+ INIT_LIST_HEAD(&ci->cb.list);
+ ci->cb.callback = flush_current_bio_list;
}
-static void dm_offload_start(struct dm_offload *o)
+static void dm_offload_end(struct clone_info *ci)
{
- blk_start_plug(&o->plug);
- o->cb.callback = flush_current_bio_list;
- list_add(&o->cb.list, ¤t->plug->cb_list);
+ list_del(&ci->cb.list);
+ blk_finish_plug(&ci->plug);
}
-static void dm_offload_end(struct dm_offload *o)
+static void dm_offload_check(struct clone_info *ci)
{
- list_del(&o->cb.list);
- blk_finish_plug(&o->plug);
+ if (list_empty(&ci->cb.list))
+ list_add(&ci->cb.list, ¤t->plug->cb_list);
}
-static void __map_bio(struct dm_target_io *tio)
+static void __map_bio(struct clone_info *ci, struct dm_target_io *tio)
{
int r;
sector_t sector;
- struct dm_offload o;
struct bio *clone = &tio->clone;
struct dm_target *ti = tio->ti;
@@ -1206,9 +1204,8 @@ static void __map_bio(struct dm_target_io *tio)
atomic_inc(&tio->io->io_count);
sector = clone->bi_iter.bi_sector;
- dm_offload_start(&o);
+ dm_offload_check(ci);
r = ti->type->map(ti, clone);
- dm_offload_end(&o);
switch (r) {
case DM_MAPIO_SUBMITTED:
@@ -1233,15 +1230,6 @@ static void __map_bio(struct dm_target_io *tio)
}
}
-struct clone_info {
- struct mapped_device *md;
- struct dm_table *map;
- struct bio *bio;
- struct dm_io *io;
- sector_t sector;
- unsigned sector_count;
-};
-
static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
{
bio->bi_iter.bi_sector = sector;
@@ -1291,6 +1279,7 @@ static struct dm_target_io *alloc_tio(struct clone_info *ci,
struct dm_target_io *tio;
struct bio *clone;
+ dm_offload_check(ci);
clone = bio_alloc_bioset(GFP_NOIO, 0, ci->md->bs);
tio = container_of(clone, struct dm_target_io, clone);
@@ -1314,7 +1303,7 @@ static void __clone_and_map_simple_bio(struct clone_info *ci,
if (len)
bio_setup_sector(clone, ci->sector, *len);
- __map_bio(tio);
+ __map_bio(ci, tio);
}
static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
@@ -1361,7 +1350,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
free_tio(tio);
break;
}
- __map_bio(tio);
+ __map_bio(ci, tio);
}
return r;
@@ -1493,6 +1482,7 @@ static void __split_and_process_bio(struct mapped_device *md,
bio_io_error(bio);
return;
}
+ dm_offload_start(&ci);
ci.map = map;
ci.md = md;
@@ -1521,6 +1511,7 @@ static void __split_and_process_bio(struct mapped_device *md,
while (ci.sector_count && !error)
error = __split_and_process_non_flush(&ci);
}
+ dm_offload_end(&ci);
/* drop the extra reference count */
dec_pending(ci.io, error);
@@ -2248,6 +2239,16 @@ static void dm_wq_work(struct work_struct *work)
int srcu_idx;
struct dm_table *map;
+ if (!bio_list_empty(&md->rescued)) {
+ struct bio_list list;
+ spin_lock_irq(&md->deferred_lock);
+ list = md->deferred;
+ bio_list_init(&md->deferred);
+ spin_unlock_irq(&md->deferred_lock);
+ while ((c = bio_list_pop(&md->deferred)) != NULL)
+ generic_make_request(c);
+ }
+
map = dm_get_live_table(md, &srcu_idx);
while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
@@ -2796,7 +2797,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu
BUG();
}
- pools->bs = bioset_create(pool_size, front_pad, BIOSET_NEED_RESCUER);
+ pools->bs = bioset_create(pool_size, front_pad, 0);
if (!pools->bs)
goto out;
--
2.14.0.rc0.dirty
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations
2017-08-31 6:18 [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations NeilBrown
@ 2017-09-01 2:38 ` Mikulas Patocka
2017-09-01 3:48 ` NeilBrown
0 siblings, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2017-09-01 2:38 UTC (permalink / raw)
To: NeilBrown; +Cc: dm-devel, Mike Snitzer, Alasdair Kergon
On Thu, 31 Aug 2017, NeilBrown wrote:
>
> The md->bs bioset currently needs the BIOSET_NEED_RESCUER flag
> which I hope to deprecate. It is currently needed because
> __split_and_process_bio() can potentially allocate from the
> bioset multiple times within the one make_request_fn context.
> Subsequent allocations can deadlock waiting for the bio returned
> by the first allocation to complete - but it is stuck on the
> current->bio_list list.
>
> The bioset is also used in setup_clone() in dm-rq.c, but there a
> non-blocking (GFP_ATOMIC) allocation is used, so there is no risk of
> deadlock.
>
> dm also provide rescuing for ->map() calls. If a map call ever blocks,
> any bios already submitted with generic_make_request() will be passed to
> the bioset rescuer thread.
>
> This patch unifies these two separate needs for rescuing bios into a
> single mechanism, and use the md->wq work queue to do the rescuing.
>
> A blk_plug_cb is added to 'struct clone_info' so that it is available
> throughout the __split_and_process_bio() process.
> Prior to allocating from the bioset, or calling ->map(),
> dm_offload_check() is called to ensure that the blk_plug_cb is active.
> If either of these operation schedules, the callback is called,
> and any queued bios get passed to the wq.
>
> Note that only current->bio_list[0] is offloaded. current->bio_list[1]
> contains bios that were scheduled *before* the current one started, so
These bios need to be offloaded too, otherwise you re-introduce this
deadlock: https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html
Mikulas
> they must have been submitted from higher up the stack, and we cannot be
> waiting for them here. Also, we now rescue *all* bios on the list as
> there is nothing to be gained by being more selective.
>
> This allows us to remove BIOSET_NEED_RESCUER when allocating this
> bioset.
> It also keeps all the mechanics of rescuing inside dm, so dm can
> be in full control.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>
> Hi,
> I have only tested this lightly as I don't have any test infrastructure
> for dm and don't know how to reproduce the expected deadlocks.
> I'm keen to know your thoughts on this approach if you find a few spare
> moment to have a look.
>
> Thanks,
> NeilBrown
>
>
> drivers/md/dm-core.h | 1 +
> drivers/md/dm.c | 99 ++++++++++++++++++++++++++--------------------------
> 2 files changed, 51 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 24eddbdf2ab4..cb060dd6d3ca 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -71,6 +71,7 @@ struct mapped_device {
> struct work_struct work;
> spinlock_t deferred_lock;
> struct bio_list deferred;
> + struct bio_list rescued;
>
> /*
> * Event handling.
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 2edbcc2d7d3f..774dd71cb49a 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1134,65 +1134,63 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start)
> }
> EXPORT_SYMBOL_GPL(dm_remap_zone_report);
>
> +struct clone_info {
> + struct mapped_device *md;
> + struct dm_table *map;
> + struct bio *bio;
> + struct dm_io *io;
> + sector_t sector;
> + unsigned sector_count;
> + struct blk_plug plug;
> + struct blk_plug_cb cb;
> +};
> +
> /*
> * Flush current->bio_list when the target map method blocks.
> * This fixes deadlocks in snapshot and possibly in other targets.
> */
> -struct dm_offload {
> - struct blk_plug plug;
> - struct blk_plug_cb cb;
> -};
>
> static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
> {
> - struct dm_offload *o = container_of(cb, struct dm_offload, cb);
> + struct clone_info *ci = container_of(cb, struct clone_info, cb);
> struct bio_list list;
> - struct bio *bio;
> - int i;
>
> - INIT_LIST_HEAD(&o->cb.list);
> + INIT_LIST_HEAD(&cb->list);
>
> - if (unlikely(!current->bio_list))
> + if (unlikely(!current->bio_list || bio_list_empty(¤t->bio_list[0])))
> return;
>
> - for (i = 0; i < 2; i++) {
> - list = current->bio_list[i];
> - bio_list_init(¤t->bio_list[i]);
> -
> - while ((bio = bio_list_pop(&list))) {
> - struct bio_set *bs = bio->bi_pool;
> - if (unlikely(!bs) || bs == fs_bio_set ||
> - !bs->rescue_workqueue) {
> - bio_list_add(¤t->bio_list[i], bio);
> - continue;
> - }
> + list = current->bio_list[0];
> + bio_list_init(¤t->bio_list[0]);
> + spin_lock(&ci->md->deferred_lock);
> + bio_list_merge(&ci->md->rescued, &list);
> + spin_unlock(&ci->md->deferred_lock);
> + queue_work(ci->md->wq, &ci->md->work);
> +}
>
> - spin_lock(&bs->rescue_lock);
> - bio_list_add(&bs->rescue_list, bio);
> - queue_work(bs->rescue_workqueue, &bs->rescue_work);
> - spin_unlock(&bs->rescue_lock);
> - }
> - }
> +static void dm_offload_start(struct clone_info *ci)
> +{
> + blk_start_plug(&ci->plug);
> + INIT_LIST_HEAD(&ci->cb.list);
> + ci->cb.callback = flush_current_bio_list;
> }
>
> -static void dm_offload_start(struct dm_offload *o)
> +static void dm_offload_end(struct clone_info *ci)
> {
> - blk_start_plug(&o->plug);
> - o->cb.callback = flush_current_bio_list;
> - list_add(&o->cb.list, ¤t->plug->cb_list);
> + list_del(&ci->cb.list);
> + blk_finish_plug(&ci->plug);
> }
>
> -static void dm_offload_end(struct dm_offload *o)
> +static void dm_offload_check(struct clone_info *ci)
> {
> - list_del(&o->cb.list);
> - blk_finish_plug(&o->plug);
> + if (list_empty(&ci->cb.list))
> + list_add(&ci->cb.list, ¤t->plug->cb_list);
> }
>
> -static void __map_bio(struct dm_target_io *tio)
> +static void __map_bio(struct clone_info *ci, struct dm_target_io *tio)
> {
> int r;
> sector_t sector;
> - struct dm_offload o;
> struct bio *clone = &tio->clone;
> struct dm_target *ti = tio->ti;
>
> @@ -1206,9 +1204,8 @@ static void __map_bio(struct dm_target_io *tio)
> atomic_inc(&tio->io->io_count);
> sector = clone->bi_iter.bi_sector;
>
> - dm_offload_start(&o);
> + dm_offload_check(ci);
> r = ti->type->map(ti, clone);
> - dm_offload_end(&o);
>
> switch (r) {
> case DM_MAPIO_SUBMITTED:
> @@ -1233,15 +1230,6 @@ static void __map_bio(struct dm_target_io *tio)
> }
> }
>
> -struct clone_info {
> - struct mapped_device *md;
> - struct dm_table *map;
> - struct bio *bio;
> - struct dm_io *io;
> - sector_t sector;
> - unsigned sector_count;
> -};
> -
> static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
> {
> bio->bi_iter.bi_sector = sector;
> @@ -1291,6 +1279,7 @@ static struct dm_target_io *alloc_tio(struct clone_info *ci,
> struct dm_target_io *tio;
> struct bio *clone;
>
> + dm_offload_check(ci);
> clone = bio_alloc_bioset(GFP_NOIO, 0, ci->md->bs);
> tio = container_of(clone, struct dm_target_io, clone);
>
> @@ -1314,7 +1303,7 @@ static void __clone_and_map_simple_bio(struct clone_info *ci,
> if (len)
> bio_setup_sector(clone, ci->sector, *len);
>
> - __map_bio(tio);
> + __map_bio(ci, tio);
> }
>
> static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
> @@ -1361,7 +1350,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
> free_tio(tio);
> break;
> }
> - __map_bio(tio);
> + __map_bio(ci, tio);
> }
>
> return r;
> @@ -1493,6 +1482,7 @@ static void __split_and_process_bio(struct mapped_device *md,
> bio_io_error(bio);
> return;
> }
> + dm_offload_start(&ci);
>
> ci.map = map;
> ci.md = md;
> @@ -1521,6 +1511,7 @@ static void __split_and_process_bio(struct mapped_device *md,
> while (ci.sector_count && !error)
> error = __split_and_process_non_flush(&ci);
> }
> + dm_offload_end(&ci);
>
> /* drop the extra reference count */
> dec_pending(ci.io, error);
> @@ -2248,6 +2239,16 @@ static void dm_wq_work(struct work_struct *work)
> int srcu_idx;
> struct dm_table *map;
>
> + if (!bio_list_empty(&md->rescued)) {
> + struct bio_list list;
> + spin_lock_irq(&md->deferred_lock);
> + list = md->deferred;
> + bio_list_init(&md->deferred);
> + spin_unlock_irq(&md->deferred_lock);
> + while ((c = bio_list_pop(&md->deferred)) != NULL)
> + generic_make_request(c);
> + }
> +
> map = dm_get_live_table(md, &srcu_idx);
>
> while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
> @@ -2796,7 +2797,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu
> BUG();
> }
>
> - pools->bs = bioset_create(pool_size, front_pad, BIOSET_NEED_RESCUER);
> + pools->bs = bioset_create(pool_size, front_pad, 0);
> if (!pools->bs)
> goto out;
>
> --
> 2.14.0.rc0.dirty
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations
2017-09-01 2:38 ` Mikulas Patocka
@ 2017-09-01 3:48 ` NeilBrown
2017-09-02 1:34 ` Mikulas Patocka
0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2017-09-01 3:48 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: dm-devel, Mike Snitzer, Alasdair Kergon
[-- Attachment #1.1: Type: text/plain, Size: 965 bytes --]
On Thu, Aug 31 2017, Mikulas Patocka wrote:
>>
>> Note that only current->bio_list[0] is offloaded. current->bio_list[1]
>> contains bios that were scheduled *before* the current one started, so
>
> These bios need to be offloaded too, otherwise you re-introduce this
> deadlock: https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html
Thanks for the link.
In the example the bio that is stuck was created in step 4. The call
to generic_make_request() will have placed it on current->bio_list[0].
The top-level generic_make_request call by Process A is still running,
so nothing will have moved the bio to ->bio_list[1]. That only happens
after the ->make_request_fn completes, which must be after step 7.
So the problem bio is on ->bio_list[0] and the code in my patch will
pass it to a workqueue for handling.
So I don't think the deadlock would be reintroduced. Can you see
something that I am missing?
Thanks,
NeilBrown
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations
2017-09-01 3:48 ` NeilBrown
@ 2017-09-02 1:34 ` Mikulas Patocka
2017-09-04 0:12 ` NeilBrown
0 siblings, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2017-09-02 1:34 UTC (permalink / raw)
To: NeilBrown; +Cc: dm-devel, Mike Snitzer, Alasdair Kergon
On Fri, 1 Sep 2017, NeilBrown wrote:
> On Thu, Aug 31 2017, Mikulas Patocka wrote:
>
> >>
> >> Note that only current->bio_list[0] is offloaded. current->bio_list[1]
> >> contains bios that were scheduled *before* the current one started, so
> >
> > These bios need to be offloaded too, otherwise you re-introduce this
> > deadlock: https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html
>
> Thanks for the link.
> In the example the bio that is stuck was created in step 4. The call
> to generic_make_request() will have placed it on current->bio_list[0].
> The top-level generic_make_request call by Process A is still running,
> so nothing will have moved the bio to ->bio_list[1]. That only happens
> after the ->make_request_fn completes, which must be after step 7.
> So the problem bio is on ->bio_list[0] and the code in my patch will
> pass it to a workqueue for handling.
>
> So I don't think the deadlock would be reintroduced. Can you see
> something that I am missing?
>
> Thanks,
> NeilBrown
Offloading only current->bio_list[0] will work in a simple case described
above, but not in the general case.
For example, suppose that we have a dm device where the first part is
linear and the second part is snapshot.
* We receive bio A that crosses the linear/snapshot boundary
* DM core creates bio B, submits it to the linear target and adds it to
current->bio_list[0]
* DM core creates bio C, submits it to the snapshot target, the snapshot
target calls track_chunk for this bio and appends the bio to
current->bio_list[0]
* Now, we return back to generic_make_request
* We pop bio B from current->bio_list[0]
* We execute bio_list_on_stack[1] = bio_list_on_stack[0], that moves bio C
to bio_list_on_stack[1] - and now we lose any possibility to offload bio C
to the rescue thread
* The kcopyd thread for the snapshot takes the snapshot lock and waits for
bio C to finish
* We process bio B - and if processing bio B reaches something that takes
the snapshot lock (for example an origin target for the snapshot), a
deadlock will happen. The deadlock could be avoided by offloading bio C to
the rescue thread, but bio C is already on bio_list_on_stack[1] and so it
won't be offloaded
Mikulas
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations
2017-09-02 1:34 ` Mikulas Patocka
@ 2017-09-04 0:12 ` NeilBrown
2017-09-04 16:54 ` Mikulas Patocka
0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2017-09-04 0:12 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: dm-devel, Alasdair Kergon, Mike Snitzer
[-- Attachment #1.1: Type: text/plain, Size: 3088 bytes --]
On Fri, Sep 01 2017, Mikulas Patocka wrote:
> On Fri, 1 Sep 2017, NeilBrown wrote:
>
>> On Thu, Aug 31 2017, Mikulas Patocka wrote:
>>
>> >>
>> >> Note that only current->bio_list[0] is offloaded. current->bio_list[1]
>> >> contains bios that were scheduled *before* the current one started, so
>> >
>> > These bios need to be offloaded too, otherwise you re-introduce this
>> > deadlock: https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html
>>
>> Thanks for the link.
>> In the example the bio that is stuck was created in step 4. The call
>> to generic_make_request() will have placed it on current->bio_list[0].
>> The top-level generic_make_request call by Process A is still running,
>> so nothing will have moved the bio to ->bio_list[1]. That only happens
>> after the ->make_request_fn completes, which must be after step 7.
>> So the problem bio is on ->bio_list[0] and the code in my patch will
>> pass it to a workqueue for handling.
>>
>> So I don't think the deadlock would be reintroduced. Can you see
>> something that I am missing?
>>
>> Thanks,
>> NeilBrown
>
> Offloading only current->bio_list[0] will work in a simple case described
> above, but not in the general case.
>
> For example, suppose that we have a dm device where the first part is
> linear and the second part is snapshot.
>
> * We receive bio A that crosses the linear/snapshot boundary
> * DM core creates bio B, submits it to the linear target and adds it to
> current->bio_list[0]
> * DM core creates bio C, submits it to the snapshot target, the snapshot
> target calls track_chunk for this bio and appends the bio to
> current->bio_list[0]
> * Now, we return back to generic_make_request
> * We pop bio B from current->bio_list[0]
> * We execute bio_list_on_stack[1] = bio_list_on_stack[0], that moves bio C
> to bio_list_on_stack[1] - and now we lose any possibility to offload bio C
> to the rescue thread
> * The kcopyd thread for the snapshot takes the snapshot lock and waits for
> bio C to finish
Thanks for the explanation.
I cannot find this last step in the code. "waits for BIO C to finish"
is presumably the called to __check_for_conflicting_io(). This is
called from kcopyd in merge_callback() -> snapshot_merge_next_chunks().
What lock is held at that time?
> * We process bio B - and if processing bio B reaches something that takes
> the snapshot lock (for example an origin target for the snapshot), a
> deadlock will happen. The deadlock could be avoided by offloading bio C to
> the rescue thread, but bio C is already on bio_list_on_stack[1] and so it
> won't be offloaded
I think the core issue behind this deadlock is that the same volume
can appear twice in the top-level device, in different regions. An
outstanding request to one region can then interfere with a new request
to a different region. That seems like a reasonable thing to do.
I cannot immediately see a generic way to handle this case that I am
happy with. I'll keep hunting.
Thanks,
NeilBrown
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations
2017-09-04 0:12 ` NeilBrown
@ 2017-09-04 16:54 ` Mikulas Patocka
2017-09-05 23:43 ` NeilBrown
0 siblings, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2017-09-04 16:54 UTC (permalink / raw)
To: NeilBrown; +Cc: dm-devel, Alasdair Kergon, Mike Snitzer
On Mon, 4 Sep 2017, NeilBrown wrote:
> On Fri, Sep 01 2017, Mikulas Patocka wrote:
>
> > On Fri, 1 Sep 2017, NeilBrown wrote:
> >
> >> On Thu, Aug 31 2017, Mikulas Patocka wrote:
> >>
> >> >>
> >> >> Note that only current->bio_list[0] is offloaded. current->bio_list[1]
> >> >> contains bios that were scheduled *before* the current one started, so
> >> >
> >> > These bios need to be offloaded too, otherwise you re-introduce this
> >> > deadlock: https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html
> >>
> >> Thanks for the link.
> >> In the example the bio that is stuck was created in step 4. The call
> >> to generic_make_request() will have placed it on current->bio_list[0].
> >> The top-level generic_make_request call by Process A is still running,
> >> so nothing will have moved the bio to ->bio_list[1]. That only happens
> >> after the ->make_request_fn completes, which must be after step 7.
> >> So the problem bio is on ->bio_list[0] and the code in my patch will
> >> pass it to a workqueue for handling.
> >>
> >> So I don't think the deadlock would be reintroduced. Can you see
> >> something that I am missing?
> >>
> >> Thanks,
> >> NeilBrown
> >
> > Offloading only current->bio_list[0] will work in a simple case described
> > above, but not in the general case.
> >
> > For example, suppose that we have a dm device where the first part is
> > linear and the second part is snapshot.
> >
> > * We receive bio A that crosses the linear/snapshot boundary
> > * DM core creates bio B, submits it to the linear target and adds it to
> > current->bio_list[0]
> > * DM core creates bio C, submits it to the snapshot target, the snapshot
> > target calls track_chunk for this bio and appends the bio to
> > current->bio_list[0]
> > * Now, we return back to generic_make_request
> > * We pop bio B from current->bio_list[0]
> > * We execute bio_list_on_stack[1] = bio_list_on_stack[0], that moves bio C
> > to bio_list_on_stack[1] - and now we lose any possibility to offload bio C
> > to the rescue thread
> > * The kcopyd thread for the snapshot takes the snapshot lock and waits for
> > bio C to finish
>
> Thanks for the explanation.
> I cannot find this last step in the code. "waits for BIO C to finish"
> is presumably the called to __check_for_conflicting_io(). This is
> called from kcopyd in merge_callback() -> snapshot_merge_next_chunks().
> What lock is held at that time?
The function pending_complete is called from the kcopyd callback. It takes
"down_write(&s->lock)" and calls __check_for_conflicting_io which waits
for I/Os to finish.
> > * We process bio B - and if processing bio B reaches something that takes
> > the snapshot lock (for example an origin target for the snapshot), a
> > deadlock will happen. The deadlock could be avoided by offloading bio C to
> > the rescue thread, but bio C is already on bio_list_on_stack[1] and so it
> > won't be offloaded
>
> I think the core issue behind this deadlock is that the same volume
> can appear twice in the top-level device, in different regions. An
> outstanding request to one region can then interfere with a new request
> to a different region. That seems like a reasonable thing to do.
> I cannot immediately see a generic way to handle this case that I am
> happy with. I'll keep hunting.
You can have a snapshot and an origin for the same device in the same
tree - it is not common, but it is possible.
Offloafing bios queued on current->bio_list avoids the deadlock, but your
patch breaks it by offloading only the first list and not the second.
> Thanks,
> NeilBrown
Mikulas
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations
2017-09-04 16:54 ` Mikulas Patocka
@ 2017-09-05 23:43 ` NeilBrown
2017-09-06 22:57 ` Mikulas Patocka
0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2017-09-05 23:43 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: dm-devel, Alasdair Kergon, Mike Snitzer
[-- Attachment #1.1: Type: text/plain, Size: 9171 bytes --]
On Mon, Sep 04 2017, Mikulas Patocka wrote:
> On Mon, 4 Sep 2017, NeilBrown wrote:
>
>> On Fri, Sep 01 2017, Mikulas Patocka wrote:
>>
>> > On Fri, 1 Sep 2017, NeilBrown wrote:
>> >
>> >> On Thu, Aug 31 2017, Mikulas Patocka wrote:
>> >>
>> >> >>
>> >> >> Note that only current->bio_list[0] is offloaded. current->bio_list[1]
>> >> >> contains bios that were scheduled *before* the current one started, so
>> >> >
>> >> > These bios need to be offloaded too, otherwise you re-introduce this
>> >> > deadlock: https://www.redhat.com/archives/dm-devel/2014-May/msg00089.html
>> >>
>> >> Thanks for the link.
>> >> In the example the bio that is stuck was created in step 4. The call
>> >> to generic_make_request() will have placed it on current->bio_list[0].
>> >> The top-level generic_make_request call by Process A is still running,
>> >> so nothing will have moved the bio to ->bio_list[1]. That only happens
>> >> after the ->make_request_fn completes, which must be after step 7.
>> >> So the problem bio is on ->bio_list[0] and the code in my patch will
>> >> pass it to a workqueue for handling.
>> >>
>> >> So I don't think the deadlock would be reintroduced. Can you see
>> >> something that I am missing?
>> >>
>> >> Thanks,
>> >> NeilBrown
>> >
>> > Offloading only current->bio_list[0] will work in a simple case described
>> > above, but not in the general case.
>> >
>> > For example, suppose that we have a dm device where the first part is
>> > linear and the second part is snapshot.
>> >
>> > * We receive bio A that crosses the linear/snapshot boundary
>> > * DM core creates bio B, submits it to the linear target and adds it to
>> > current->bio_list[0]
>> > * DM core creates bio C, submits it to the snapshot target, the snapshot
>> > target calls track_chunk for this bio and appends the bio to
>> > current->bio_list[0]
>> > * Now, we return back to generic_make_request
>> > * We pop bio B from current->bio_list[0]
>> > * We execute bio_list_on_stack[1] = bio_list_on_stack[0], that moves bio C
>> > to bio_list_on_stack[1] - and now we lose any possibility to offload bio C
>> > to the rescue thread
>> > * The kcopyd thread for the snapshot takes the snapshot lock and waits for
>> > bio C to finish
>>
>> Thanks for the explanation.
>> I cannot find this last step in the code. "waits for BIO C to finish"
>> is presumably the called to __check_for_conflicting_io(). This is
>> called from kcopyd in merge_callback() -> snapshot_merge_next_chunks().
>> What lock is held at that time?
>
> The function pending_complete is called from the kcopyd callback. It takes
> "down_write(&s->lock)" and calls __check_for_conflicting_io which waits
> for I/Os to finish.
>
>> > * We process bio B - and if processing bio B reaches something that takes
>> > the snapshot lock (for example an origin target for the snapshot), a
>> > deadlock will happen. The deadlock could be avoided by offloading bio C to
>> > the rescue thread, but bio C is already on bio_list_on_stack[1] and so it
>> > won't be offloaded
>>
>> I think the core issue behind this deadlock is that the same volume
>> can appear twice in the top-level device, in different regions. An
>> outstanding request to one region can then interfere with a new request
>> to a different region. That seems like a reasonable thing to do.
>> I cannot immediately see a generic way to handle this case that I am
>> happy with. I'll keep hunting.
>
> You can have a snapshot and an origin for the same device in the same
> tree - it is not common, but it is possible.
>
> Offloafing bios queued on current->bio_list avoids the deadlock, but your
> patch breaks it by offloading only the first list and not the second.
>
Thanks for the extra explanation. I've thought some more about this I can
see a way forward that I am comfortable with. Please let me know what
you think.
I haven't tested this patch yet, but I believe that applying it first will
clear the way for my other patch to work without reintroducing
deadlocks.
Thanks,
NeilBrown
From: NeilBrown <neilb@suse.com>
Date: Wed, 6 Sep 2017 09:14:52 +1000
Subject: [PATCH] dm: ensure bio submission follows a depth-first tree walk.
A dm device can, in general, represent a tree of targets,
each of which handles a sub-range of the range of blocks handled
by the parent.
The bio sequencing managed by generic_make_request() requires that bios
are generated and handled in a depth-first manner. Each call to a
make_request_fn() may submit bios to a single member device, and may
submit bios for a reduced region of the same device as the
make_request_fn.
In particular, any bios submitted to member devices must be expected to
be processed in order, so a later one must never wait for an earlier
one.
This ordering is usually achieved by using bio_split() to reduce a bio
to a size that can be completely handled by one target, and resubmitting
the remainder to the originating device. bio_queue_split() shows the canonical
approach.
dm doesn't follow this approach, largely because it has needed to split
bios since long before bio_split() was available. It currently can
submit bios to separate targets within the one dm_make_request() call.
Dependencies between these targets, as can happen with dm-snap, can
cause deadlocks if either bios gets stuck behind the other in the queues
managed by generic_make_request(). This requires the 'rescue'
functionality provided by dm_offload*.
Some of this requirement can be removed by changing the order of bio
submission to follow the canonical approach. That is, if dm finds that
it needs to split a bio, the remainder should be sent to
generic_make_request() rather than being handled immediately. This delays
the handling until the first part is completely processed, so the
deadlock problems do not occur.
__split_and_process_bio() can be called both from dm_make_request() and
from dm_wq_work(). When called from dm_wq_work() the current approach
is perfectly satisfactory as each bio will be processed immediately.
When called from dm_make_request, current->bio_list will be non-NULL,
and in this case it is best to create a separate "clone" bio for the
remainder.
To provide the clone bio when splitting, we use q->bio_split(). This
was previously being freed to avoid having excess rescuer threads. As
bio_split bio sets no longer create rescuer threads, there is little
cost and much gain from leaving the ->bio_split bio set in place.
Signed-off-by: NeilBrown <neilb@suse.com>
---
drivers/md/dm.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d669fddd9290..81395f7550c0 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1007,7 +1007,7 @@ static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
- * allowed for all bio types except REQ_PREFLUSH.
+ * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
*
* dm_accept_partial_bio informs the dm that the target only wants to process
* additional n_sectors sectors of the bio and the rest of the data should be
@@ -1508,8 +1508,21 @@ static void __split_and_process_bio(struct mapped_device *md,
} else {
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
- while (ci.sector_count && !error)
+ while (ci.sector_count && !error) {
error = __split_and_process_non_flush(&ci);
+ if (current->bio_list && ci.sector_count && !error) {
+ /* Remainder must be passed to generic_make_request()
+ * so that it gets handled *after* bios already submitted
+ * have been completely processed.
+ */
+ struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
+ md->queue->bio_split);
+ bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
+ bio_chain(b, bio);
+ generic_make_request(b);
+ break;
+ }
+ }
}
/* drop the extra reference count */
@@ -1520,8 +1533,8 @@ static void __split_and_process_bio(struct mapped_device *md,
*---------------------------------------------------------------*/
/*
- * The request function that just remaps the bio built up by
- * dm_merge_bvec.
+ * The request function that remaps the bio to one target and
+ * splits off any remainder.
*/
static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
{
@@ -2056,12 +2069,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
case DM_TYPE_DAX_BIO_BASED:
dm_init_normal_md_queue(md);
blk_queue_make_request(md->queue, dm_make_request);
- /*
- * DM handles splitting bios as needed. Free the bio_split bioset
- * since it won't be used (saves 1 process per bio-based DM device).
- */
- bioset_free(md->queue->bio_split);
- md->queue->bio_split = NULL;
if (type == DM_TYPE_DAX_BIO_BASED)
queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue);
--
2.14.0
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations
2017-09-05 23:43 ` NeilBrown
@ 2017-09-06 22:57 ` Mikulas Patocka
2017-09-07 3:19 ` NeilBrown
0 siblings, 1 reply; 9+ messages in thread
From: Mikulas Patocka @ 2017-09-06 22:57 UTC (permalink / raw)
To: NeilBrown; +Cc: dm-devel, Alasdair Kergon, Mike Snitzer
On Wed, 6 Sep 2017, NeilBrown wrote:
> Thanks for the extra explanation. I've thought some more about this I can
> see a way forward that I am comfortable with. Please let me know what
> you think.
> I haven't tested this patch yet, but I believe that applying it first will
> clear the way for my other patch to work without reintroducing
> deadlocks.
>
> Thanks,
> NeilBrown
What's the purpose of this patch? If the current code that offloads bios
on current->bio_list to a rescue thread works, why do you want to change
it?
Mikulas
> From: NeilBrown <neilb@suse.com>
> Date: Wed, 6 Sep 2017 09:14:52 +1000
> Subject: [PATCH] dm: ensure bio submission follows a depth-first tree walk.
>
> A dm device can, in general, represent a tree of targets,
> each of which handles a sub-range of the range of blocks handled
> by the parent.
>
> The bio sequencing managed by generic_make_request() requires that bios
> are generated and handled in a depth-first manner. Each call to a
> make_request_fn() may submit bios to a single member device, and may
> submit bios for a reduced region of the same device as the
> make_request_fn.
>
> In particular, any bios submitted to member devices must be expected to
> be processed in order, so a later one must never wait for an earlier
> one.
>
> This ordering is usually achieved by using bio_split() to reduce a bio
> to a size that can be completely handled by one target, and resubmitting
> the remainder to the originating device. bio_queue_split() shows the canonical
> approach.
>
> dm doesn't follow this approach, largely because it has needed to split
> bios since long before bio_split() was available. It currently can
> submit bios to separate targets within the one dm_make_request() call.
> Dependencies between these targets, as can happen with dm-snap, can
> cause deadlocks if either bios gets stuck behind the other in the queues
> managed by generic_make_request(). This requires the 'rescue'
> functionality provided by dm_offload*.
>
> Some of this requirement can be removed by changing the order of bio
> submission to follow the canonical approach. That is, if dm finds that
> it needs to split a bio, the remainder should be sent to
> generic_make_request() rather than being handled immediately. This delays
> the handling until the first part is completely processed, so the
> deadlock problems do not occur.
>
> __split_and_process_bio() can be called both from dm_make_request() and
> from dm_wq_work(). When called from dm_wq_work() the current approach
> is perfectly satisfactory as each bio will be processed immediately.
> When called from dm_make_request, current->bio_list will be non-NULL,
> and in this case it is best to create a separate "clone" bio for the
> remainder.
>
> To provide the clone bio when splitting, we use q->bio_split(). This
> was previously being freed to avoid having excess rescuer threads. As
> bio_split bio sets no longer create rescuer threads, there is little
> cost and much gain from leaving the ->bio_split bio set in place.
>
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> drivers/md/dm.c | 27 +++++++++++++++++----------
> 1 file changed, 17 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index d669fddd9290..81395f7550c0 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1007,7 +1007,7 @@ static void dm_dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
>
> /*
> * A target may call dm_accept_partial_bio only from the map routine. It is
> - * allowed for all bio types except REQ_PREFLUSH.
> + * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
> *
> * dm_accept_partial_bio informs the dm that the target only wants to process
> * additional n_sectors sectors of the bio and the rest of the data should be
> @@ -1508,8 +1508,21 @@ static void __split_and_process_bio(struct mapped_device *md,
> } else {
> ci.bio = bio;
> ci.sector_count = bio_sectors(bio);
> - while (ci.sector_count && !error)
> + while (ci.sector_count && !error) {
> error = __split_and_process_non_flush(&ci);
> + if (current->bio_list && ci.sector_count && !error) {
> + /* Remainder must be passed to generic_make_request()
> + * so that it gets handled *after* bios already submitted
> + * have been completely processed.
> + */
> + struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
> + md->queue->bio_split);
> + bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
> + bio_chain(b, bio);
> + generic_make_request(b);
> + break;
> + }
> + }
> }
>
> /* drop the extra reference count */
> @@ -1520,8 +1533,8 @@ static void __split_and_process_bio(struct mapped_device *md,
> *---------------------------------------------------------------*/
>
> /*
> - * The request function that just remaps the bio built up by
> - * dm_merge_bvec.
> + * The request function that remaps the bio to one target and
> + * splits off any remainder.
> */
> static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
> {
> @@ -2056,12 +2069,6 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
> case DM_TYPE_DAX_BIO_BASED:
> dm_init_normal_md_queue(md);
> blk_queue_make_request(md->queue, dm_make_request);
> - /*
> - * DM handles splitting bios as needed. Free the bio_split bioset
> - * since it won't be used (saves 1 process per bio-based DM device).
> - */
> - bioset_free(md->queue->bio_split);
> - md->queue->bio_split = NULL;
>
> if (type == DM_TYPE_DAX_BIO_BASED)
> queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue);
> --
> 2.14.0
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations
2017-09-06 22:57 ` Mikulas Patocka
@ 2017-09-07 3:19 ` NeilBrown
0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2017-09-07 3:19 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: dm-devel, Alasdair Kergon, Mike Snitzer
[-- Attachment #1.1: Type: text/plain, Size: 2224 bytes --]
On Wed, Sep 06 2017, Mikulas Patocka wrote:
> On Wed, 6 Sep 2017, NeilBrown wrote:
>
>> Thanks for the extra explanation. I've thought some more about this I can
>> see a way forward that I am comfortable with. Please let me know what
>> you think.
>> I haven't tested this patch yet, but I believe that applying it first will
>> clear the way for my other patch to work without reintroducing
>> deadlocks.
>>
>> Thanks,
>> NeilBrown
>
> What's the purpose of this patch? If the current code that offloads bios
> on current->bio_list to a rescue thread works, why do you want to change
> it?
That is a fair question.
My primary reason is that I don't like the 'rescue' functionality built
in to biosets, and I want to get rid of it together with all uses for
BIOSET_NEED_RESCUER. I think this functionality is unnecessary and adds
complexity. It superficially seems like it should fix any deadlock
issue created by the ordering that generic_make_request() requires, but
as you discovered with dm-snap, it isn't sufficient by itself. So it is
an incomplete solution that mostly isn't necessary, and I don't think it
is good design to keep such things around.
I also like uniformity where practical. The block layer provides a
simple coherent infrastructure for splitting requests when needed, and
if simple guidelines are following concerning how the resulting bios are
submitted to generic_make_request(), then deadlocks can be avoided.
Most device use this common infrastructure, the exceptions being dm and
bcache (though I really haven't wrapped my mind around bcache yet so I
cannot say much in concrete terms about what it does).
So I would like dm to handle over-large bios by allocating from the
->bio_split pool, cloning, bio_chain and generic_make_request().
On reflection, I think that just this change might address all your
deadlock issues. We might be able to get rid of the dm_offload stuff
completely, though I'm not 100% sure yet.
So in brief:
1/ use same approach to bio splitting everywhere - an approach that is
resistant to deadlocks.
2/ discard the incomplete and unnecessary BIOSET_NEED_RESCUER
functionality.
Thanks,
NeilBrown
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-09-07 3:19 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-31 6:18 [PATCH/rfc] dm: revise 'rescue' strategy for md->bs allocations NeilBrown
2017-09-01 2:38 ` Mikulas Patocka
2017-09-01 3:48 ` NeilBrown
2017-09-02 1:34 ` Mikulas Patocka
2017-09-04 0:12 ` NeilBrown
2017-09-04 16:54 ` Mikulas Patocka
2017-09-05 23:43 ` NeilBrown
2017-09-06 22:57 ` Mikulas Patocka
2017-09-07 3:19 ` NeilBrown
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.