All of lore.kernel.org
 help / color / mirror / Atom feed
From: NeilBrown <neilb@suse.com>
To: Mikulas Patocka <mpatocka@redhat.com>, Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	"linux-kernel\@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-block@vger.kernel.org,
	device-mapper development <dm-devel@redhat.com>,
	Zdenek Kabelac <zkabelac@redhat.com>
Subject: Re: [dm-devel] new patchset to eliminate DM's use of	BIOSET_NEED_RESCUER
Date: Wed, 22 Nov 2017 15:00:51 +1100	[thread overview]
Message-ID: <87bmjv0xos.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <alpine.LRH.2.02.1711212009040.28456@file01.intranet.prod.int.rdu2.redhat.com>

[-- Attachment #1: Type: text/plain, Size: 11093 bytes --]

On Tue, Nov 21 2017, Mikulas Patocka wrote:

> On Tue, 21 Nov 2017, Mike Snitzer wrote:
>
>> On Tue, Nov 21 2017 at  4:23pm -0500,
>> Mikulas Patocka <mpatocka@redhat.com> wrote:
>> 
>> > This is not correct:
>> > 
>> >    2206 static void dm_wq_work(struct work_struct *work)
>> >    2207 {
>> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>> >    2209         struct bio *bio;
>> >    2210         int srcu_idx;
>> >    2211         struct dm_table *map;
>> >    2212
>> >    2213         if (!bio_list_empty(&md->rescued)) {
>> >    2214                 struct bio_list list;
>> >    2215                 spin_lock_irq(&md->deferred_lock);
>> >    2216                 list = md->rescued;
>> >    2217                 bio_list_init(&md->rescued);
>> >    2218                 spin_unlock_irq(&md->deferred_lock);
>> >    2219                 while ((bio = bio_list_pop(&list)))
>> >    2220                         generic_make_request(bio);
>> >    2221         }
>> >    2222
>> >    2223         map = dm_get_live_table(md, &srcu_idx);
>> >    2224
>> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>> >    2226                 spin_lock_irq(&md->deferred_lock);
>> >    2227                 bio = bio_list_pop(&md->deferred);
>> >    2228                 spin_unlock_irq(&md->deferred_lock);
>> >    2229
>> >    2230                 if (!bio)
>> >    2231                         break;
>> >    2232
>> >    2233                 if (dm_request_based(md))
>> >    2234                         generic_make_request(bio);
>> >    2235                 else
>> >    2236                         __split_and_process_bio(md, map, bio);
>> >    2237         }
>> >    2238
>> >    2239         dm_put_live_table(md, srcu_idx);
>> >    2240 }
>> > 
>> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> > will not process md->rescued list.
>> 
>> Can you elaborate further?  We cannot be "in dm_wq_work in
>> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> of scheduling away from __split_and_process_bio?
>> 
>> The more detail you can share the better.
>
> Suppose this scenario:
>
> * dm_wq_work calls __split_and_process_bio
> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
>
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

Yes, I think you are right. 

I think the solution is to get rid of the dm_offload() infrastructure
and make it not necessary.
i.e. discard my patches
    dm: prepare to discontinue use of BIOSET_NEED_RESCUER
and
    dm: revise 'rescue' strategy for bio-based bioset allocations

And build on "dm: ensure bio submission follows a depth-first tree walk"
which was written after those and already makes dm_offload() less
important.

Since that "depth-first" patch, every request to the dm device, after
the initial splitting, allocates just one dm_target_io structure, and
makes just one __map_bio() call, and so will behave exactly the way
generic_make_request() expects and copes with - thus avoiding awkward
dependencies and deadlocks.  Except....

a/ If any target defines ->num_write_bios() to return >1,
   __clone_and_map_data_bio() will make multiple calls to alloc_tio()
   and __map_bio(), which might need rescuing.
   But no target defines num_write_bios, and none have since it was
   removed from dm-cache 4.5 years ago.
   Can we discard num_write_bios??

b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
   to a value > 1, then __send_duplicate_bios() will also make multiple
   calls to alloc_tio() and __map_bio().
   Some do.
     dm-cache-target:  flush=2
     dm-snap: flush=2
     dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.

These will only be a problem if the second (or subsequent) alloc_tio()
blocks waiting for an earlier allocation to complete.  This will only
be a problem if multiple threads are each trying to allocate multiple
dm_target_io from the same bioset at the same time.
This is rare and should be easier to address than the current
dm_offload() approach. 
One possibility would be to copy the approach taken by
crypt_alloc_buffer() which needs to allocate multiple entries from a
mempool.
It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
tries with GFP_NOIO.  This mean only one thread will try to allocate
multiple bios at once, so there can be no deadlock.

Below are two RFC patches.  The first removes num_write_bios.
The second is incomplete and makes a stab are allocating multiple bios
at once safely.
A third would be needed to remove dm_offload() etc... but I cannot quite
fit that in today - must be off.

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Wed, 22 Nov 2017 14:25:18 +1100
Subject: [PATCH] DM: remove num_write_bios target interface.

No target provides num_write_bios and none has done
since 2013.
Having the possibility of num_write_bios > 1 complicates
bio allocation.
So remove the interface and assume there is only one bio
needed.
If a target ever needs more, it must provide a suitable
bioset and allocate itself based on its particular needs.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/dm.c               | 22 ++++------------------
 include/linux/device-mapper.h | 15 ---------------
 2 files changed, 4 insertions(+), 33 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b20febd6cbc7..8c1a05609eea 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1323,27 +1323,13 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 {
 	struct bio *bio = ci->bio;
 	struct dm_target_io *tio;
-	unsigned target_bio_nr;
-	unsigned num_target_bios = 1;
 	int r = 0;
 
-	/*
-	 * Does the target want to receive duplicate copies of the bio?
-	 */
-	if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
-		num_target_bios = ti->num_write_bios(ti, bio);
-
-	for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
-		tio = alloc_tio(ci, ti, target_bio_nr);
-		tio->len_ptr = len;
-		r = clone_bio(tio, bio, sector, *len);
-		if (r < 0) {
-			free_tio(tio);
-			break;
-		}
+	tio = alloc_tio(ci, ti, 0);
+	tio->len_ptr = len;
+	r = clone_bio(tio, bio, sector, *len);
+	if (r >= 0)
 		__map_bio(tio);
-	}
-
 	return r;
 }
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a5538433c927..5a68b366e664 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -220,14 +220,6 @@ struct target_type {
 #define DM_TARGET_WILDCARD		0x00000008
 #define dm_target_is_wildcard(type)	((type)->features & DM_TARGET_WILDCARD)
 
-/*
- * Some targets need to be sent the same WRITE bio severals times so
- * that they can send copies of it to different devices.  This function
- * examines any supplied bio and returns the number of copies of it the
- * target requires.
- */
-typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
-
 /*
  * A target implements own bio data integrity.
  */
@@ -291,13 +283,6 @@ struct dm_target {
 	 */
 	unsigned per_io_data_size;
 
-	/*
-	 * If defined, this function is called to find out how many
-	 * duplicate bios should be sent to the target when writing
-	 * data.
-	 */
-	dm_num_write_bios_fn num_write_bios;
-
 	/* target specific data */
 	void *private;
 
-- 
2.14.0.rc0.dirty


-----------------------------------
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c1a05609eea..8762661df2ef 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1265,8 +1265,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
 }
 
 static struct dm_target_io *alloc_tio(struct clone_info *ci,
-				      struct dm_target *ti,
-				      unsigned target_bio_nr)
+				      struct dm_target *ti)
 {
 	struct dm_target_io *tio;
 	struct bio *clone;
@@ -1276,34 +1275,66 @@ static struct dm_target_io *alloc_tio(struct clone_info *ci,
 
 	tio->io = ci->io;
 	tio->ti = ti;
-	tio->target_bio_nr = target_bio_nr;
+	tio->target_bio_nr = 0;
 
 	return tio;
 }
 
-static void __clone_and_map_simple_bio(struct clone_info *ci,
-				       struct dm_target *ti,
-				       unsigned target_bio_nr, unsigned *len)
+static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
+				struct dm_target *ti, unsigned num_bios)
 {
-	struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr);
-	struct bio *clone = &tio->clone;
+	int try;
 
-	tio->len_ptr = len;
+	for (try = 0; try < 2; try++) {
+		int bio_nr;
+		struct bio *bio;
+
+		if (try)
+			mutex_lock(&ci->md->table_devices_lock);
+		for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
+			bio = bio_alloc_bioset(try ? GFP_NOIO : GFP_NOWAIT,
+					       0, ci->md->bs);
+			if (bio) {
+				struct dm_target_io *tio;
+				bio_list_add(blist, bio);
+				tio = container_of(bio, struct dm_target_io, clone);
 
-	__bio_clone_fast(clone, ci->bio);
-	if (len)
-		bio_setup_sector(clone, ci->sector, *len);
+				tio->io = ci->io;
+				tio->ti = ti;
+				tio->target_bio_nr = bio_nr;
+			} else
+				break;
+		}
+		if (try)
+			mutex_unlock(&ci->md->table_devices_lock);
+		if (bio_nr == num_bios)
+			return;
 
-	__map_bio(tio);
+		while ((bio = bio_list_pop(blist)) != NULL)
+			bio_put(bio);
+	}
 }
 
 static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 				  unsigned num_bios, unsigned *len)
 {
-	unsigned target_bio_nr;
+	struct bio_list blist = BIO_EMPTY_LIST;
+	struct bio *bio;
 
-	for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++)
-		__clone_and_map_simple_bio(ci, ti, target_bio_nr, len);
+	if (num_bios == 1)
+		bio_list_add(&blist, &alloc_tio(ci, ti)->clone);
+	else
+		alloc_multiple_bios(&blist, ci, ti, num_bios);
+
+	while ((bio = bio_list_pop(&blist)) != NULL) {
+		struct dm_target_io *tio = container_of(
+			bio, struct dm_target_io, clone);
+		tio->len_ptr = len;
+		__bio_clone_fast(bio, ci->bio);
+		if (len)
+			bio_setup_sector(bio, ci->sector, *len);
+		__map_bio(tio);
+	}
 }
 
 static int __send_empty_flush(struct clone_info *ci)
@@ -1325,7 +1356,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 	struct dm_target_io *tio;
 	int r = 0;
 
-	tio = alloc_tio(ci, ti, 0);
+	tio = alloc_tio(ci, ti);
 	tio->len_ptr = len;
 	r = clone_bio(tio, bio, sector, *len);
 	if (r >= 0)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

WARNING: multiple messages have this Message-ID (diff)
From: NeilBrown <neilb@suse.com>
To: Mikulas Patocka <mpatocka@redhat.com>, Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	"linux-kernel\@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-block@vger.kernel.org,
	device-mapper development <dm-devel@redhat.com>,
	Zdenek Kabelac <zkabelac@redhat.com>
Subject: Re: [dm-devel] new patchset to eliminate DM's use of   BIOSET_NEED_RESCUER
Date: Wed, 22 Nov 2017 15:00:51 +1100	[thread overview]
Message-ID: <87bmjv0xos.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <alpine.LRH.2.02.1711212009040.28456@file01.intranet.prod.int.rdu2.redhat.com>

[-- Attachment #1: Type: text/plain, Size: 11093 bytes --]

On Tue, Nov 21 2017, Mikulas Patocka wrote:

> On Tue, 21 Nov 2017, Mike Snitzer wrote:
>
>> On Tue, Nov 21 2017 at  4:23pm -0500,
>> Mikulas Patocka <mpatocka@redhat.com> wrote:
>> 
>> > This is not correct:
>> > 
>> >    2206 static void dm_wq_work(struct work_struct *work)
>> >    2207 {
>> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>> >    2209         struct bio *bio;
>> >    2210         int srcu_idx;
>> >    2211         struct dm_table *map;
>> >    2212
>> >    2213         if (!bio_list_empty(&md->rescued)) {
>> >    2214                 struct bio_list list;
>> >    2215                 spin_lock_irq(&md->deferred_lock);
>> >    2216                 list = md->rescued;
>> >    2217                 bio_list_init(&md->rescued);
>> >    2218                 spin_unlock_irq(&md->deferred_lock);
>> >    2219                 while ((bio = bio_list_pop(&list)))
>> >    2220                         generic_make_request(bio);
>> >    2221         }
>> >    2222
>> >    2223         map = dm_get_live_table(md, &srcu_idx);
>> >    2224
>> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>> >    2226                 spin_lock_irq(&md->deferred_lock);
>> >    2227                 bio = bio_list_pop(&md->deferred);
>> >    2228                 spin_unlock_irq(&md->deferred_lock);
>> >    2229
>> >    2230                 if (!bio)
>> >    2231                         break;
>> >    2232
>> >    2233                 if (dm_request_based(md))
>> >    2234                         generic_make_request(bio);
>> >    2235                 else
>> >    2236                         __split_and_process_bio(md, map, bio);
>> >    2237         }
>> >    2238
>> >    2239         dm_put_live_table(md, srcu_idx);
>> >    2240 }
>> > 
>> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> > will not process md->rescued list.
>> 
>> Can you elaborate further?  We cannot be "in dm_wq_work in
>> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> of scheduling away from __split_and_process_bio?
>> 
>> The more detail you can share the better.
>
> Suppose this scenario:
>
> * dm_wq_work calls __split_and_process_bio
> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
>
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

Yes, I think you are right. 

I think the solution is to get rid of the dm_offload() infrastructure
and make it not necessary.
i.e. discard my patches
    dm: prepare to discontinue use of BIOSET_NEED_RESCUER
and
    dm: revise 'rescue' strategy for bio-based bioset allocations

And build on "dm: ensure bio submission follows a depth-first tree walk"
which was written after those and already makes dm_offload() less
important.

Since that "depth-first" patch, every request to the dm device, after
the initial splitting, allocates just one dm_target_io structure, and
makes just one __map_bio() call, and so will behave exactly the way
generic_make_request() expects and copes with - thus avoiding awkward
dependencies and deadlocks.  Except....

a/ If any target defines ->num_write_bios() to return >1,
   __clone_and_map_data_bio() will make multiple calls to alloc_tio()
   and __map_bio(), which might need rescuing.
   But no target defines num_write_bios, and none have since it was
   removed from dm-cache 4.5 years ago.
   Can we discard num_write_bios??

b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
   to a value > 1, then __send_duplicate_bios() will also make multiple
   calls to alloc_tio() and __map_bio().
   Some do.
     dm-cache-target:  flush=2
     dm-snap: flush=2
     dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.

These will only be a problem if the second (or subsequent) alloc_tio()
blocks waiting for an earlier allocation to complete.  This will only
be a problem if multiple threads are each trying to allocate multiple
dm_target_io from the same bioset at the same time.
This is rare and should be easier to address than the current
dm_offload() approach. 
One possibility would be to copy the approach taken by
crypt_alloc_buffer() which needs to allocate multiple entries from a
mempool.
It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
tries with GFP_NOIO.  This mean only one thread will try to allocate
multiple bios at once, so there can be no deadlock.

Below are two RFC patches.  The first removes num_write_bios.
The second is incomplete and makes a stab are allocating multiple bios
at once safely.
A third would be needed to remove dm_offload() etc... but I cannot quite
fit that in today - must be off.

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Wed, 22 Nov 2017 14:25:18 +1100
Subject: [PATCH] DM: remove num_write_bios target interface.

No target provides num_write_bios and none has done
since 2013.
Having the possibility of num_write_bios > 1 complicates
bio allocation.
So remove the interface and assume there is only one bio
needed.
If a target ever needs more, it must provide a suitable
bioset and allocate itself based on its particular needs.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/dm.c               | 22 ++++------------------
 include/linux/device-mapper.h | 15 ---------------
 2 files changed, 4 insertions(+), 33 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b20febd6cbc7..8c1a05609eea 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1323,27 +1323,13 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 {
 	struct bio *bio = ci->bio;
 	struct dm_target_io *tio;
-	unsigned target_bio_nr;
-	unsigned num_target_bios = 1;
 	int r = 0;
 
-	/*
-	 * Does the target want to receive duplicate copies of the bio?
-	 */
-	if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
-		num_target_bios = ti->num_write_bios(ti, bio);
-
-	for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
-		tio = alloc_tio(ci, ti, target_bio_nr);
-		tio->len_ptr = len;
-		r = clone_bio(tio, bio, sector, *len);
-		if (r < 0) {
-			free_tio(tio);
-			break;
-		}
+	tio = alloc_tio(ci, ti, 0);
+	tio->len_ptr = len;
+	r = clone_bio(tio, bio, sector, *len);
+	if (r >= 0)
 		__map_bio(tio);
-	}
-
 	return r;
 }
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a5538433c927..5a68b366e664 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -220,14 +220,6 @@ struct target_type {
 #define DM_TARGET_WILDCARD		0x00000008
 #define dm_target_is_wildcard(type)	((type)->features & DM_TARGET_WILDCARD)
 
-/*
- * Some targets need to be sent the same WRITE bio severals times so
- * that they can send copies of it to different devices.  This function
- * examines any supplied bio and returns the number of copies of it the
- * target requires.
- */
-typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
-
 /*
  * A target implements own bio data integrity.
  */
@@ -291,13 +283,6 @@ struct dm_target {
 	 */
 	unsigned per_io_data_size;
 
-	/*
-	 * If defined, this function is called to find out how many
-	 * duplicate bios should be sent to the target when writing
-	 * data.
-	 */
-	dm_num_write_bios_fn num_write_bios;
-
 	/* target specific data */
 	void *private;
 
-- 
2.14.0.rc0.dirty


-----------------------------------
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c1a05609eea..8762661df2ef 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1265,8 +1265,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
 }
 
 static struct dm_target_io *alloc_tio(struct clone_info *ci,
-				      struct dm_target *ti,
-				      unsigned target_bio_nr)
+				      struct dm_target *ti)
 {
 	struct dm_target_io *tio;
 	struct bio *clone;
@@ -1276,34 +1275,66 @@ static struct dm_target_io *alloc_tio(struct clone_info *ci,
 
 	tio->io = ci->io;
 	tio->ti = ti;
-	tio->target_bio_nr = target_bio_nr;
+	tio->target_bio_nr = 0;
 
 	return tio;
 }
 
-static void __clone_and_map_simple_bio(struct clone_info *ci,
-				       struct dm_target *ti,
-				       unsigned target_bio_nr, unsigned *len)
+static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
+				struct dm_target *ti, unsigned num_bios)
 {
-	struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr);
-	struct bio *clone = &tio->clone;
+	int try;
 
-	tio->len_ptr = len;
+	for (try = 0; try < 2; try++) {
+		int bio_nr;
+		struct bio *bio;
+
+		if (try)
+			mutex_lock(&ci->md->table_devices_lock);
+		for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
+			bio = bio_alloc_bioset(try ? GFP_NOIO : GFP_NOWAIT,
+					       0, ci->md->bs);
+			if (bio) {
+				struct dm_target_io *tio;
+				bio_list_add(blist, bio);
+				tio = container_of(bio, struct dm_target_io, clone);
 
-	__bio_clone_fast(clone, ci->bio);
-	if (len)
-		bio_setup_sector(clone, ci->sector, *len);
+				tio->io = ci->io;
+				tio->ti = ti;
+				tio->target_bio_nr = bio_nr;
+			} else
+				break;
+		}
+		if (try)
+			mutex_unlock(&ci->md->table_devices_lock);
+		if (bio_nr == num_bios)
+			return;
 
-	__map_bio(tio);
+		while ((bio = bio_list_pop(blist)) != NULL)
+			bio_put(bio);
+	}
 }
 
 static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 				  unsigned num_bios, unsigned *len)
 {
-	unsigned target_bio_nr;
+	struct bio_list blist = BIO_EMPTY_LIST;
+	struct bio *bio;
 
-	for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++)
-		__clone_and_map_simple_bio(ci, ti, target_bio_nr, len);
+	if (num_bios == 1)
+		bio_list_add(&blist, &alloc_tio(ci, ti)->clone);
+	else
+		alloc_multiple_bios(&blist, ci, ti, num_bios);
+
+	while ((bio = bio_list_pop(&blist)) != NULL) {
+		struct dm_target_io *tio = container_of(
+			bio, struct dm_target_io, clone);
+		tio->len_ptr = len;
+		__bio_clone_fast(bio, ci->bio);
+		if (len)
+			bio_setup_sector(bio, ci->sector, *len);
+		__map_bio(tio);
+	}
 }
 
 static int __send_empty_flush(struct clone_info *ci)
@@ -1325,7 +1356,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 	struct dm_target_io *tio;
 	int r = 0;
 
-	tio = alloc_tio(ci, ti, 0);
+	tio = alloc_tio(ci, ti);
 	tio->len_ptr = len;
 	r = clone_bio(tio, bio, sector, *len);
 	if (r >= 0)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

WARNING: multiple messages have this Message-ID (diff)
From: NeilBrown <neilb@suse.com>
To: Mikulas Patocka <mpatocka@redhat.com>, Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	linux-block@vger.kernel.org,
	device-mapper development <dm-devel@redhat.com>,
	Zdenek Kabelac <zkabelac@redhat.com>
Subject: Re: [dm-devel] new patchset to eliminate DM's use of   BIOSET_NEED_RESCUER
Date: Wed, 22 Nov 2017 15:00:51 +1100	[thread overview]
Message-ID: <87bmjv0xos.fsf@notabene.neil.brown.name> (raw)
In-Reply-To: <alpine.LRH.2.02.1711212009040.28456@file01.intranet.prod.int.rdu2.redhat.com>

[-- Attachment #1: Type: text/plain, Size: 11093 bytes --]

On Tue, Nov 21 2017, Mikulas Patocka wrote:

> On Tue, 21 Nov 2017, Mike Snitzer wrote:
>
>> On Tue, Nov 21 2017 at  4:23pm -0500,
>> Mikulas Patocka <mpatocka@redhat.com> wrote:
>> 
>> > This is not correct:
>> > 
>> >    2206 static void dm_wq_work(struct work_struct *work)
>> >    2207 {
>> >    2208         struct mapped_device *md = container_of(work, struct mapped_device, work);
>> >    2209         struct bio *bio;
>> >    2210         int srcu_idx;
>> >    2211         struct dm_table *map;
>> >    2212
>> >    2213         if (!bio_list_empty(&md->rescued)) {
>> >    2214                 struct bio_list list;
>> >    2215                 spin_lock_irq(&md->deferred_lock);
>> >    2216                 list = md->rescued;
>> >    2217                 bio_list_init(&md->rescued);
>> >    2218                 spin_unlock_irq(&md->deferred_lock);
>> >    2219                 while ((bio = bio_list_pop(&list)))
>> >    2220                         generic_make_request(bio);
>> >    2221         }
>> >    2222
>> >    2223         map = dm_get_live_table(md, &srcu_idx);
>> >    2224
>> >    2225         while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
>> >    2226                 spin_lock_irq(&md->deferred_lock);
>> >    2227                 bio = bio_list_pop(&md->deferred);
>> >    2228                 spin_unlock_irq(&md->deferred_lock);
>> >    2229
>> >    2230                 if (!bio)
>> >    2231                         break;
>> >    2232
>> >    2233                 if (dm_request_based(md))
>> >    2234                         generic_make_request(bio);
>> >    2235                 else
>> >    2236                         __split_and_process_bio(md, map, bio);
>> >    2237         }
>> >    2238
>> >    2239         dm_put_live_table(md, srcu_idx);
>> >    2240 }
>> > 
>> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> > will not process md->rescued list.
>> 
>> Can you elaborate further?  We cannot be "in dm_wq_work in
>> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> of scheduling away from __split_and_process_bio?
>> 
>> The more detail you can share the better.
>
> Suppose this scenario:
>
> * dm_wq_work calls __split_and_process_bio
> * __split_and_process_bio eventually reaches the function snapshot_map
> * snapshot_map attempts to take the snapshot lock
>
> * the snapshot lock could be released only if some bios submitted by the 
> snapshot driver to the underlying device complete
> * the bios submitted to the underlying device were already offloaded by 
> some other task and they are waiting on the list md->rescued
> * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> blocked in snapshot_map (called from __split_and_process_bio)

Yes, I think you are right. 

I think the solution is to get rid of the dm_offload() infrastructure
and make it not necessary.
i.e. discard my patches
    dm: prepare to discontinue use of BIOSET_NEED_RESCUER
and
    dm: revise 'rescue' strategy for bio-based bioset allocations

And build on "dm: ensure bio submission follows a depth-first tree walk"
which was written after those and already makes dm_offload() less
important.

Since that "depth-first" patch, every request to the dm device, after
the initial splitting, allocates just one dm_target_io structure, and
makes just one __map_bio() call, and so will behave exactly the way
generic_make_request() expects and copes with - thus avoiding awkward
dependencies and deadlocks.  Except....

a/ If any target defines ->num_write_bios() to return >1,
   __clone_and_map_data_bio() will make multiple calls to alloc_tio()
   and __map_bio(), which might need rescuing.
   But no target defines num_write_bios, and none have since it was
   removed from dm-cache 4.5 years ago.
   Can we discard num_write_bios??

b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
   to a value > 1, then __send_duplicate_bios() will also make multiple
   calls to alloc_tio() and __map_bio().
   Some do.
     dm-cache-target:  flush=2
     dm-snap: flush=2
     dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.

These will only be a problem if the second (or subsequent) alloc_tio()
blocks waiting for an earlier allocation to complete.  This will only
be a problem if multiple threads are each trying to allocate multiple
dm_target_io from the same bioset at the same time.
This is rare and should be easier to address than the current
dm_offload() approach. 
One possibility would be to copy the approach taken by
crypt_alloc_buffer() which needs to allocate multiple entries from a
mempool.
It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
tries with GFP_NOIO.  This mean only one thread will try to allocate
multiple bios at once, so there can be no deadlock.

Below are two RFC patches.  The first removes num_write_bios.
The second is incomplete and makes a stab are allocating multiple bios
at once safely.
A third would be needed to remove dm_offload() etc... but I cannot quite
fit that in today - must be off.

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Wed, 22 Nov 2017 14:25:18 +1100
Subject: [PATCH] DM: remove num_write_bios target interface.

No target provides num_write_bios and none has done
since 2013.
Having the possibility of num_write_bios > 1 complicates
bio allocation.
So remove the interface and assume there is only one bio
needed.
If a target ever needs more, it must provide a suitable
bioset and allocate itself based on its particular needs.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 drivers/md/dm.c               | 22 ++++------------------
 include/linux/device-mapper.h | 15 ---------------
 2 files changed, 4 insertions(+), 33 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b20febd6cbc7..8c1a05609eea 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1323,27 +1323,13 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 {
 	struct bio *bio = ci->bio;
 	struct dm_target_io *tio;
-	unsigned target_bio_nr;
-	unsigned num_target_bios = 1;
 	int r = 0;
 
-	/*
-	 * Does the target want to receive duplicate copies of the bio?
-	 */
-	if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
-		num_target_bios = ti->num_write_bios(ti, bio);
-
-	for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
-		tio = alloc_tio(ci, ti, target_bio_nr);
-		tio->len_ptr = len;
-		r = clone_bio(tio, bio, sector, *len);
-		if (r < 0) {
-			free_tio(tio);
-			break;
-		}
+	tio = alloc_tio(ci, ti, 0);
+	tio->len_ptr = len;
+	r = clone_bio(tio, bio, sector, *len);
+	if (r >= 0)
 		__map_bio(tio);
-	}
-
 	return r;
 }
 
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a5538433c927..5a68b366e664 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -220,14 +220,6 @@ struct target_type {
 #define DM_TARGET_WILDCARD		0x00000008
 #define dm_target_is_wildcard(type)	((type)->features & DM_TARGET_WILDCARD)
 
-/*
- * Some targets need to be sent the same WRITE bio severals times so
- * that they can send copies of it to different devices.  This function
- * examines any supplied bio and returns the number of copies of it the
- * target requires.
- */
-typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
-
 /*
  * A target implements own bio data integrity.
  */
@@ -291,13 +283,6 @@ struct dm_target {
 	 */
 	unsigned per_io_data_size;
 
-	/*
-	 * If defined, this function is called to find out how many
-	 * duplicate bios should be sent to the target when writing
-	 * data.
-	 */
-	dm_num_write_bios_fn num_write_bios;
-
 	/* target specific data */
 	void *private;
 
-- 
2.14.0.rc0.dirty


-----------------------------------
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c1a05609eea..8762661df2ef 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1265,8 +1265,7 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
 }
 
 static struct dm_target_io *alloc_tio(struct clone_info *ci,
-				      struct dm_target *ti,
-				      unsigned target_bio_nr)
+				      struct dm_target *ti)
 {
 	struct dm_target_io *tio;
 	struct bio *clone;
@@ -1276,34 +1275,66 @@ static struct dm_target_io *alloc_tio(struct clone_info *ci,
 
 	tio->io = ci->io;
 	tio->ti = ti;
-	tio->target_bio_nr = target_bio_nr;
+	tio->target_bio_nr = 0;
 
 	return tio;
 }
 
-static void __clone_and_map_simple_bio(struct clone_info *ci,
-				       struct dm_target *ti,
-				       unsigned target_bio_nr, unsigned *len)
+static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
+				struct dm_target *ti, unsigned num_bios)
 {
-	struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr);
-	struct bio *clone = &tio->clone;
+	int try;
 
-	tio->len_ptr = len;
+	for (try = 0; try < 2; try++) {
+		int bio_nr;
+		struct bio *bio;
+
+		if (try)
+			mutex_lock(&ci->md->table_devices_lock);
+		for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
+			bio = bio_alloc_bioset(try ? GFP_NOIO : GFP_NOWAIT,
+					       0, ci->md->bs);
+			if (bio) {
+				struct dm_target_io *tio;
+				bio_list_add(blist, bio);
+				tio = container_of(bio, struct dm_target_io, clone);
 
-	__bio_clone_fast(clone, ci->bio);
-	if (len)
-		bio_setup_sector(clone, ci->sector, *len);
+				tio->io = ci->io;
+				tio->ti = ti;
+				tio->target_bio_nr = bio_nr;
+			} else
+				break;
+		}
+		if (try)
+			mutex_unlock(&ci->md->table_devices_lock);
+		if (bio_nr == num_bios)
+			return;
 
-	__map_bio(tio);
+		while ((bio = bio_list_pop(blist)) != NULL)
+			bio_put(bio);
+	}
 }
 
 static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 				  unsigned num_bios, unsigned *len)
 {
-	unsigned target_bio_nr;
+	struct bio_list blist = BIO_EMPTY_LIST;
+	struct bio *bio;
 
-	for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++)
-		__clone_and_map_simple_bio(ci, ti, target_bio_nr, len);
+	if (num_bios == 1)
+		bio_list_add(&blist, &alloc_tio(ci, ti)->clone);
+	else
+		alloc_multiple_bios(&blist, ci, ti, num_bios);
+
+	while ((bio = bio_list_pop(&blist)) != NULL) {
+		struct dm_target_io *tio = container_of(
+			bio, struct dm_target_io, clone);
+		tio->len_ptr = len;
+		__bio_clone_fast(bio, ci->bio);
+		if (len)
+			bio_setup_sector(bio, ci->sector, *len);
+		__map_bio(tio);
+	}
 }
 
 static int __send_empty_flush(struct clone_info *ci)
@@ -1325,7 +1356,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 	struct dm_target_io *tio;
 	int r = 0;
 
-	tio = alloc_tio(ci, ti, 0);
+	tio = alloc_tio(ci, ti);
 	tio->len_ptr = len;
 	r = clone_bio(tio, bio, sector, *len);
 	if (r >= 0)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

  parent reply	other threads:[~2017-11-22  4:00 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-18  4:38 [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
2017-06-18  4:38 ` [PATCH 03/13] blk: make the bioset rescue_workqueue optional NeilBrown
2017-06-18  4:38 ` [PATCH 04/13] blk: use non-rescuing bioset for q->bio_split NeilBrown
2017-06-18  4:38 ` [PATCH 01/13] blk: remove bio_set arg from blk_queue_split() NeilBrown
2017-06-18  4:38 ` [PATCH 02/13] blk: replace bioset_create_nobvec() with a flags arg to bioset_create() NeilBrown
2017-06-18  4:38 ` [PATCH 06/13] rbd: use bio_clone_fast() instead of bio_clone() NeilBrown
2017-06-18  4:38 ` [PATCH 08/13] pktcdvd: " NeilBrown
2017-06-18  4:38 ` [PATCH 05/13] block: Improvements to bounce-buffer handling NeilBrown
2017-06-18  4:38 ` [PATCH 09/13] lightnvm/pblk-read: use bio_clone_fast() NeilBrown
2017-06-18  4:38 ` [PATCH 07/13] drbd: use bio_clone_fast() instead of bio_clone() NeilBrown
2017-06-18  4:38 ` [PATCH 12/13] block: remove bio_clone() and all references NeilBrown
2017-06-18  4:38 ` [PATCH 11/13] bcache: use kmalloc to allocate bio in bch_data_verify() NeilBrown
2017-06-18  4:38 ` [PATCH 13/13] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split() NeilBrown
2017-06-18  4:38 ` [PATCH 10/13] xen-blkfront: remove bio splitting NeilBrown
2017-06-18 18:41 ` [PATCH 00/13] block: assorted cleanup for bio splitting and cloning Jens Axboe
2017-06-18 21:36   ` NeilBrown
2017-11-20 16:43     ` Mike Snitzer
2017-11-21  0:34       ` [dm-devel] " NeilBrown
2017-11-21  0:34         ` NeilBrown
2017-11-21  0:34         ` NeilBrown
2017-11-21  1:35         ` Mike Snitzer
2017-11-21  1:35           ` Mike Snitzer
2017-11-21 12:10           ` Mike Snitzer
2017-11-21 12:10             ` Mike Snitzer
2017-11-21 12:43             ` Mike Snitzer
2017-11-21 12:43               ` Mike Snitzer
2017-11-21 19:47               ` new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.] Mike Snitzer
2017-11-21 21:23                 ` [dm-devel] " Mikulas Patocka
2017-11-21 22:51                   ` new patchset to eliminate DM's use of BIOSET_NEED_RESCUER Mike Snitzer
2017-11-22  1:21                     ` Mikulas Patocka
2017-11-22  2:32                       ` Mike Snitzer
2017-11-22  4:00                       ` NeilBrown [this message]
2017-11-22  4:00                         ` [dm-devel] " NeilBrown
2017-11-22  4:00                         ` NeilBrown
2017-11-22  4:28                         ` Mike Snitzer
2017-11-22  4:28                           ` Mike Snitzer
2017-11-22 21:18                           ` Mike Snitzer
2017-11-22 18:24                         ` [dm-devel] " Mikulas Patocka
2017-11-22 18:49                           ` Mike Snitzer
2017-11-23  5:12                           ` [dm-devel] " NeilBrown
2017-11-23  5:12                             ` NeilBrown
2017-11-23 22:52                           ` [PATCH] dm: use cloned bio as head, not remainder, in __split_and_process_bio() NeilBrown
2017-11-23 22:52                             ` NeilBrown
2017-11-27 14:23                             ` Mike Snitzer
2017-11-28 22:18                               ` [dm-devel] " NeilBrown
2017-11-28 22:18                                 ` NeilBrown
2017-11-21 23:03                   ` [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER [was: Re: [PATCH 00/13] block: assorted cleanup for bio splitting and cloning.] NeilBrown
2017-11-21 23:03                     ` NeilBrown
2017-11-21 19:44             ` [dm-devel] [PATCH 00/13] block: assorted cleanup for bio splitting and cloning NeilBrown
2017-11-21 19:44               ` NeilBrown
2017-11-21 19:44               ` NeilBrown
2017-11-21 19:50               ` Mike Snitzer
2017-11-21 19:50                 ` Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87bmjv0xos.fsf@notabene.neil.brown.name \
    --to=neilb@suse.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpatocka@redhat.com \
    --cc=snitzer@redhat.com \
    --cc=zkabelac@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.