All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
@ 2016-06-10 11:07 Ming Lei
  2016-06-10 11:37   ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Ming Lei @ 2016-06-10 11:07 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Kent Overstreet, Ming Lei, 4.3+,
	Shaohua Li, Jens Axboe

After arbitrary bio size is supported, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().

This patch fixes the following kernel crash:

> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> [  172.660399] Oops: 0000 [#1] SMP
> [...]
> [  172.664780] Call Trace:
> [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]

The issue can be reproduced by the following steps:
	- create one raid1 over two virtio-blk
	- build bcache device over the above raid1 and another cache device
	and bucket size is set as 2Mbytes
	- set cache mode as writeback
	- run random write over ext4 on the bcache device

Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: stable@vger.kernel.org (4.3+)
Cc: Shaohua Li <shli@fb.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
V2:
	- don't mark as REQ_NOMERGE in case the bio is splitted
	for reaching the limit of bvecs count
V1:
        - Kent pointed out that using max io size can't cover
        the case of non-full bvecs/pages
 block/blk-merge.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index c265348..839529b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -85,7 +85,8 @@ static inline unsigned get_max_io_size(struct request_queue *q,
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 					 struct bio *bio,
 					 struct bio_set *bs,
-					 unsigned *segs)
+					 unsigned *segs,
+					 bool *no_merge)
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
@@ -94,9 +95,34 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	bool do_split = true;
 	struct bio *new = NULL;
 	const unsigned max_sectors = get_max_io_size(q, bio);
+	unsigned bvecs = 0;
+
+	*no_merge = true;
 
 	bio_for_each_segment(bv, bio, iter) {
 		/*
+		 * With arbitrary bio size, the incoming bio may be very
+		 * big. We have to split the bio into small bios so that
+		 * each holds at most BIO_MAX_PAGES bvecs because
+		 * bio_clone() can fail to allocate big bvecs.
+		 *
+		 * It should have been better to apply the limit per
+		 * request queue in which bio_clone() is involved,
+		 * instead of globally. The biggest blocker is
+		 * bio_clone() in bio bounce.
+		 *
+		 * If bio is splitted by this reason, we should allow
+		 * to continue bios merging.
+		 *
+		 * TODO: deal with bio bounce's bio_clone() gracefully
+		 * and convert the global limit into per-queue limit.
+		 */
+		if (bvecs++ >= BIO_MAX_PAGES) {
+			*no_merge = false;
+			goto split;
+		}
+
+		/*
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
@@ -171,13 +197,15 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 {
 	struct bio *split, *res;
 	unsigned nsegs;
+	bool no_merge_for_split = true;
 
 	if (bio_op(*bio) == REQ_OP_DISCARD)
 		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
 	else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
 		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
 	else
-		split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
+		split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs,
+				&no_merge_for_split);
 
 	/* physical segments can be figured out during splitting */
 	res = split ? split : *bio;
@@ -186,7 +214,8 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 
 	if (split) {
 		/* there isn't chance to merge the splitted bio */
-		split->bi_rw |= REQ_NOMERGE;
+		if (no_merge_for_split)
+			split->bi_rw |= REQ_NOMERGE;
 
 		bio_chain(split, *bio);
 		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
  2016-06-10 11:07 [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs Ming Lei
@ 2016-06-10 11:37   ` Hannes Reinecke
  0 siblings, 0 replies; 11+ messages in thread
From: Hannes Reinecke @ 2016-06-10 11:37 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Kent Overstreet, 4.3+,
	Shaohua Li, Jens Axboe

On 06/10/2016 01:07 PM, Ming Lei wrote:
> After arbitrary bio size is supported, the incoming bio may
> be very big. We have to split the bio into small bios so that
> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> as bio_clone().
> 
> This patch fixes the following kernel crash:
> 
>> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
>> [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
>> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
>> [  172.660399] Oops: 0000 [#1] SMP
>> [...]
>> [  172.664780] Call Trace:
>> [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
>> [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
>> [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
>> [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
>> [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>> [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
>> [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
> 
> The issue can be reproduced by the following steps:
> 	- create one raid1 over two virtio-blk
> 	- build bcache device over the above raid1 and another cache device
> 	and bucket size is set as 2Mbytes
> 	- set cache mode as writeback
> 	- run random write over ext4 on the bcache device
> 
> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> Cc: stable@vger.kernel.org (4.3+)
> Cc: Shaohua Li <shli@fb.com>
> Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
> V2:
> 	- don't mark as REQ_NOMERGE in case the bio is splitted
> 	for reaching the limit of bvecs count
> V1:
>         - Kent pointed out that using max io size can't cover
>         the case of non-full bvecs/pages
>  block/blk-merge.c | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
> 
Hmm. So everybody is suffering because someone _might_ be using bio_clone?
Why can't we fixup bio_clone() (or the callers of which) to correctly
set the queue limits?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
@ 2016-06-10 11:37   ` Hannes Reinecke
  0 siblings, 0 replies; 11+ messages in thread
From: Hannes Reinecke @ 2016-06-10 11:37 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-kernel
  Cc: linux-block, Christoph Hellwig, Kent Overstreet, 4.3+,
	Shaohua Li, Jens Axboe

On 06/10/2016 01:07 PM, Ming Lei wrote:
> After arbitrary bio size is supported, the incoming bio may
> be very big. We have to split the bio into small bios so that
> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> as bio_clone().
> 
> This patch fixes the following kernel crash:
> 
>> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
>> [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
>> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
>> [  172.660399] Oops: 0000 [#1] SMP
>> [...]
>> [  172.664780] Call Trace:
>> [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
>> [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
>> [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
>> [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
>> [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>> [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
>> [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
> 
> The issue can be reproduced by the following steps:
> 	- create one raid1 over two virtio-blk
> 	- build bcache device over the above raid1 and another cache device
> 	and bucket size is set as 2Mbytes
> 	- set cache mode as writeback
> 	- run random write over ext4 on the bcache device
> 
> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
> Cc: stable@vger.kernel.org (4.3+)
> Cc: Shaohua Li <shli@fb.com>
> Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
> V2:
> 	- don't mark as REQ_NOMERGE in case the bio is splitted
> 	for reaching the limit of bvecs count
> V1:
>         - Kent pointed out that using max io size can't cover
>         the case of non-full bvecs/pages
>  block/blk-merge.c | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
> 
Hmm. So everybody is suffering because someone _might_ be using bio_clone?
Why can't we fixup bio_clone() (or the callers of which) to correctly
set the queue limits?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
  2016-06-10 11:37   ` Hannes Reinecke
@ 2016-06-10 14:16     ` Ming Lei
  -1 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2016-06-10 14:16 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block,
	Christoph Hellwig, Kent Overstreet, 4.3+,
	Shaohua Li, Jens Axboe

On Fri, Jun 10, 2016 at 7:37 PM, Hannes Reinecke <hare@suse.de> wrote:
> On 06/10/2016 01:07 PM, Ming Lei wrote:
>> After arbitrary bio size is supported, the incoming bio may
>> be very big. We have to split the bio into small bios so that
>> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> as bio_clone().
>>
>> This patch fixes the following kernel crash:
>>
>>> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at=
 0000000000000028
>>> [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
>>> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
>>> [  172.660399] Oops: 0000 [#1] SMP
>>> [...]
>>> [  172.664780] Call Trace:
>>> [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [=
raid1]
>>> [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
>>> [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_m=
od]
>>> [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
>>> [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>>> [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [=
bcache]
>>> [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [b=
cache]
>>
>> The issue can be reproduced by the following steps:
>>       - create one raid1 over two virtio-blk
>>       - build bcache device over the above raid1 and another cache devic=
e
>>       and bucket size is set as 2Mbytes
>>       - set cache mode as writeback
>>       - run random write over ext4 on the bcache device
>>
>> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized=
 bios)
>> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
>> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
>> Cc: stable@vger.kernel.org (4.3+)
>> Cc: Shaohua Li <shli@fb.com>
>> Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>> V2:
>>       - don't mark as REQ_NOMERGE in case the bio is splitted
>>       for reaching the limit of bvecs count
>> V1:
>>         - Kent pointed out that using max io size can't cover
>>         the case of non-full bvecs/pages
>>  block/blk-merge.c | 35 ++++++++++++++++++++++++++++++++---
>>  1 file changed, 32 insertions(+), 3 deletions(-)
>>
> Hmm. So everybody is suffering because someone _might_ be using bio_clone=
?

I believe most of usages are involved with <=3D 256 bvecs in one bio, so
only few(such as bcache) will 'suffer', not everybody, :-)

> Why can't we fixup bio_clone() (or the callers of which) to correctly
> set the queue limits?

IMO there isn't a good solution to fix the issue in bio_clone.

Firstly one page can held at most 256 bvecs, and not safe to allocate
multi-pages in I/O path.

Secondaly as said in the comment of the patch it can't be a queue limit
now because bio_clone() is used inside bio bounce.

But it should be possible to use bio splitting to deal with bio bounce, and
it can be a following up job, and of course that change can be a bit too bi=
g
for backporting.

That is why I suggest to fix the issue with this patch. Or other ideas?


Thanks,
Ming

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Teamlead Storage & Networking
> hare@suse.de                                   +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg
> GF: F. Imend=C3=B6rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG N=C3=BCrnberg)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
@ 2016-06-10 14:16     ` Ming Lei
  0 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2016-06-10 14:16 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, Linux Kernel Mailing List, linux-block,
	Christoph Hellwig, Kent Overstreet, 4.3+,
	Shaohua Li, Jens Axboe

On Fri, Jun 10, 2016 at 7:37 PM, Hannes Reinecke <hare@suse.de> wrote:
> On 06/10/2016 01:07 PM, Ming Lei wrote:
>> After arbitrary bio size is supported, the incoming bio may
>> be very big. We have to split the bio into small bios so that
>> each holds at most BIO_MAX_PAGES bvecs for safety reason, such
>> as bio_clone().
>>
>> This patch fixes the following kernel crash:
>>
>>> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
>>> [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
>>> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
>>> [  172.660399] Oops: 0000 [#1] SMP
>>> [...]
>>> [  172.664780] Call Trace:
>>> [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
>>> [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
>>> [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
>>> [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
>>> [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>>> [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
>>> [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]
>>
>> The issue can be reproduced by the following steps:
>>       - create one raid1 over two virtio-blk
>>       - build bcache device over the above raid1 and another cache device
>>       and bucket size is set as 2Mbytes
>>       - set cache mode as writeback
>>       - run random write over ext4 on the bcache device
>>
>> Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
>> Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
>> Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
>> Cc: stable@vger.kernel.org (4.3+)
>> Cc: Shaohua Li <shli@fb.com>
>> Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
>> Signed-off-by: Ming Lei <ming.lei@canonical.com>
>> ---
>> V2:
>>       - don't mark as REQ_NOMERGE in case the bio is splitted
>>       for reaching the limit of bvecs count
>> V1:
>>         - Kent pointed out that using max io size can't cover
>>         the case of non-full bvecs/pages
>>  block/blk-merge.c | 35 ++++++++++++++++++++++++++++++++---
>>  1 file changed, 32 insertions(+), 3 deletions(-)
>>
> Hmm. So everybody is suffering because someone _might_ be using bio_clone?

I believe most of usages are involved with <= 256 bvecs in one bio, so
only few(such as bcache) will 'suffer', not everybody, :-)

> Why can't we fixup bio_clone() (or the callers of which) to correctly
> set the queue limits?

IMO there isn't a good solution to fix the issue in bio_clone.

Firstly one page can held at most 256 bvecs, and not safe to allocate
multi-pages in I/O path.

Secondaly as said in the comment of the patch it can't be a queue limit
now because bio_clone() is used inside bio bounce.

But it should be possible to use bio splitting to deal with bio bounce, and
it can be a following up job, and of course that change can be a bit too big
for backporting.

That is why I suggest to fix the issue with this patch. Or other ideas?


Thanks,
Ming

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Teamlead Storage & Networking
> hare@suse.de                                   +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
  2016-06-10 11:37   ` Hannes Reinecke
  (?)
  (?)
@ 2016-06-13 15:46   ` Christoph Hellwig
  -1 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2016-06-13 15:46 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ming Lei, Jens Axboe, linux-kernel, linux-block,
	Christoph Hellwig, Kent Overstreet, 4.3+,
	Shaohua Li, Jens Axboe

On Fri, Jun 10, 2016 at 01:37:17PM +0200, Hannes Reinecke wrote:
> Hmm. So everybody is suffering because someone _might_ be using bio_clone?
> Why can't we fixup bio_clone() (or the callers of which) to correctly
> set the queue limits?

The only one suffering is bcache.  Everyone else uses bios below the
block layer arbitrary max size.  The fixup to allow bio_clone support
a larger size is the same one as to allow everyone else submitting
larger bios:  increase BIO_MAX_PAGES and create the required mempools
to back that new larger size.  Or just go for multipage biovecs..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
  2016-08-11 14:02   ` Christoph Hellwig
  2016-08-12 11:12       ` Ming Lei
@ 2016-08-12 16:36     ` Kent Overstreet
  1 sibling, 0 replies; 11+ messages in thread
From: Kent Overstreet @ 2016-08-12 16:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Wheeler, Ming Lei, Jens Axboe, linux-kernel, linux-block,
	linux-bcache, linux-raid, Sebastian Roesner, 4.3+,
	Shaohua Li

On Thu, Aug 11, 2016 at 07:02:53AM -0700, Christoph Hellwig wrote:
> Please just fix bcache to not submit bios larger than BIO_MAX_PAGES for
> now, until we can support such callers in general and enable common
> used code to do so.

Christoph, what's wrong with Ming's patch? Leaving bcache aside, just
considering the block layer, do you think that patch is the wrong approach?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
  2016-08-11 14:02   ` Christoph Hellwig
@ 2016-08-12 11:12       ` Ming Lei
  2016-08-12 16:36     ` Kent Overstreet
  1 sibling, 0 replies; 11+ messages in thread
From: Ming Lei @ 2016-08-12 11:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Wheeler, Jens Axboe, Linux Kernel Mailing List, linux-block,
	open list:BCACHE (BLOCK LAYER CACHE),
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	Kent Overstreet, Sebastian Roesner, 4.3+,
	Shaohua Li

On Thu, Aug 11, 2016 at 10:02 PM, Christoph Hellwig <hch@infradead.org> wrote:
> Please just fix bcache to not submit bios larger than BIO_MAX_PAGES for
> now, until we can support such callers in general and enable common
> used code to do so.

IMO it can't be efficient to do that in bcache because it need to figure out
how many bvecs one bio includes.

This patch(block: make sure big bio is splitted into at most 256 bvecs)
can support such callers.

Also this kind of usage does simplify drivers.

Thanks,
Ming

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
@ 2016-08-12 11:12       ` Ming Lei
  0 siblings, 0 replies; 11+ messages in thread
From: Ming Lei @ 2016-08-12 11:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Wheeler, Jens Axboe, Linux Kernel Mailing List, linux-block,
	open list:BCACHE (BLOCK LAYER CACHE),
	open list:SOFTWARE RAID (Multiple Disks) SUPPORT,
	Kent Overstreet, Sebastian Roesner, 4.3+,
	Shaohua Li

On Thu, Aug 11, 2016 at 10:02 PM, Christoph Hellwig <hch@infradead.org> wrote:
> Please just fix bcache to not submit bios larger than BIO_MAX_PAGES for
> now, until we can support such callers in general and enable common
> used code to do so.

IMO it can't be efficient to do that in bcache because it need to figure out
how many bvecs one bio includes.

This patch(block: make sure big bio is splitted into at most 256 bvecs)
can support such callers.

Also this kind of usage does simplify drivers.

Thanks,
Ming

> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
  2016-08-11  6:33 ` [PATCH v2] " Eric Wheeler
@ 2016-08-11 14:02   ` Christoph Hellwig
  2016-08-12 11:12       ` Ming Lei
  2016-08-12 16:36     ` Kent Overstreet
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2016-08-11 14:02 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Ming Lei, Jens Axboe, linux-kernel, linux-block, linux-bcache,
	linux-raid, kent.overstreet, Christoph Hellwig,
	Sebastian Roesner, 4.3+,
	Shaohua Li

Please just fix bcache to not submit bios larger than BIO_MAX_PAGES for
now, until we can support such callers in general and enable common
used code to do so.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
  2016-04-06  3:43 [PATCH v1] " Ming Lei
@ 2016-08-11  6:33 ` Eric Wheeler
  2016-08-11 14:02   ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Wheeler @ 2016-08-11  6:33 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-kernel, linux-block, linux-bcache, linux-raid,
	kent.overstreet, Christoph Hellwig, Sebastian Roesner, 4.3+,
	Shaohua Li

On Fri, 10 Jun 2016, Christoph Hellwig wrote: 
> On Wed, 6 Apr 2016, Ming Lei wrote:
> >
> > After arbitrary bio size is supported, the incoming bio may
> > be very big. We have to split the bio into small bios so that
> > each holds at most BIO_MAX_PAGES bvecs for safety reason, such
> > as bio_clone().
> > 
> > This patch fixes the following kernel crash:
> > [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> > [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> > [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> > [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> > [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
>
> The fixup to allow bio_clone support a larger size is the same one as to 
> allow everyone else submitting larger bios:  increase BIO_MAX_PAGES and 
> create the required mempools to back that new larger size.  Or just go 
> for multipage biovecs..

Hi Christoph, Ming, everyone:

I'm hoping you can help me get this off of a list of stability fixes 
related to changes around Linux 4.3.  Ming's patch [1] is known to fix an 
issue when a bio with bi_vcnt > BIO_MAX_PAGES is passed to 
generic_make_request and later hits bio_clone.  (Note that bi_vcnt can't 
be trusted since immutable biovecs and needs to be re-counted unless you 
own the bio, which Ming's patch does.)

The diffstat, 22 lines of which are commentary, 
seems relatively minor and would land in stable for v4.3+:
 block/blk-merge.c | 35 ++++++++++++++++++++++++++++++++---

I'm not sure I understood Christoph's suggestion; BIO_MAX_PAGES is a 
static #define and we don't know what the the bi_vcnt from an arbitrary 
driver might be.  Wouldn't increasing BIO_MAX_PAGES just push the problem 
further out into the future when bi_vcnt might again exceed BIO_MAX_PAGES?  

Perhaps you could elaborate if I have misunderstood. Are you suggesting 
that no driver should call generic_make_request when bi_vcnt > BIO_MAX_PAGES?


--
Eric Wheeler

[1] https://patchwork.kernel.org/patch/9169483/
    Pasted below:



After arbitrary bio size is supported, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().

This patch fixes the following kernel crash:

> [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> [  172.660229] IP: [<ffffffff811e53b4>] bio_trim+0xf/0x2a
> [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
> [  172.660399] Oops: 0000 [#1] SMP
> [...]
> [  172.664780] Call Trace:
> [  172.664813]  [<ffffffffa007f3be>] ? raid1_make_request+0x2e8/0xad7 [raid1]
> [  172.664846]  [<ffffffff811f07da>] ? blk_queue_split+0x377/0x3d4
> [  172.664880]  [<ffffffffa005fb5f>] ? md_make_request+0xf6/0x1e9 [md_mod]
> [  172.664912]  [<ffffffff811eb860>] ? generic_make_request+0xb5/0x155
> [  172.664947]  [<ffffffffa0445c89>] ? prio_io+0x85/0x95 [bcache]
> [  172.664981]  [<ffffffffa0448252>] ? register_cache_set+0x355/0x8d0 [bcache]
> [  172.665016]  [<ffffffffa04497d3>] ? register_bcache+0x1006/0x1174 [bcache]

The issue can be reproduced by the following steps:
	- create one raid1 over two virtio-blk
	- build bcache device over the above raid1 and another cache device
	and bucket size is set as 2Mbytes
	- set cache mode as writeback
	- run random write over ext4 on the bcache device

Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
Reported-by: Sebastian Roesner <sroesner-kernelorg@roesner-online.de>
Reported-by: Eric Wheeler <bcache@lists.ewheeler.net>
Cc: stable@vger.kernel.org (4.3+)
Cc: Shaohua Li <shli@fb.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
V2:
	- don't mark as REQ_NOMERGE in case the bio is splitted
	for reaching the limit of bvecs count
V1:
        - Kent pointed out that using max io size can't cover
        the case of non-full bvecs/pages
 block/blk-merge.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index c265348..839529b 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -85,7 +85,8 @@  static inline unsigned get_max_io_size(struct request_queue *q,
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 					 struct bio *bio,
 					 struct bio_set *bs,
-					 unsigned *segs)
+					 unsigned *segs,
+					 bool *no_merge)
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
@@ -94,9 +95,34 @@  static struct bio *blk_bio_segment_split(struct request_queue *q,
 	bool do_split = true;
 	struct bio *new = NULL;
 	const unsigned max_sectors = get_max_io_size(q, bio);
+	unsigned bvecs = 0;
+
+	*no_merge = true;
 
 	bio_for_each_segment(bv, bio, iter) {
 		/*
+		 * With arbitrary bio size, the incoming bio may be very
+		 * big. We have to split the bio into small bios so that
+		 * each holds at most BIO_MAX_PAGES bvecs because
+		 * bio_clone() can fail to allocate big bvecs.
+		 *
+		 * It should have been better to apply the limit per
+		 * request queue in which bio_clone() is involved,
+		 * instead of globally. The biggest blocker is
+		 * bio_clone() in bio bounce.
+		 *
+		 * If bio is splitted by this reason, we should allow
+		 * to continue bios merging.
+		 *
+		 * TODO: deal with bio bounce's bio_clone() gracefully
+		 * and convert the global limit into per-queue limit.
+		 */
+		if (bvecs++ >= BIO_MAX_PAGES) {
+			*no_merge = false;
+			goto split;
+		}
+
+		/*
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
@@ -171,13 +197,15 @@  void blk_queue_split(struct request_queue *q, struct bio **bio,
 {
 	struct bio *split, *res;
 	unsigned nsegs;
+	bool no_merge_for_split = true;
 
 	if (bio_op(*bio) == REQ_OP_DISCARD)
 		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
 	else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
 		split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
 	else
-		split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
+		split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs,
+				&no_merge_for_split);
 
 	/* physical segments can be figured out during splitting */
 	res = split ? split : *bio;
@@ -186,7 +214,8 @@  void blk_queue_split(struct request_queue *q, struct bio **bio,
 
 	if (split) {
 		/* there isn't chance to merge the splitted bio */
-		split->bi_rw |= REQ_NOMERGE;
+		if (no_merge_for_split)
+			split->bi_rw |= REQ_NOMERGE;
 
 		bio_chain(split, *bio);
 		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-08-12 16:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-10 11:07 [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs Ming Lei
2016-06-10 11:37 ` Hannes Reinecke
2016-06-10 11:37   ` Hannes Reinecke
2016-06-10 14:16   ` Ming Lei
2016-06-10 14:16     ` Ming Lei
2016-06-13 15:46   ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2016-04-06  3:43 [PATCH v1] " Ming Lei
2016-08-11  6:33 ` [PATCH v2] " Eric Wheeler
2016-08-11 14:02   ` Christoph Hellwig
2016-08-12 11:12     ` Ming Lei
2016-08-12 11:12       ` Ming Lei
2016-08-12 16:36     ` Kent Overstreet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.