All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
@ 2010-02-27 17:35 Dmitry Monakhov
  2010-02-27 17:35 ` [PATCH 2/2] blktrace: perform cleanup after setup error Dmitry Monakhov
  2010-02-28 18:46 ` [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks Jens Axboe
  0 siblings, 2 replies; 28+ messages in thread
From: Dmitry Monakhov @ 2010-02-27 17:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: jens.axboe, Dmitry Monakhov

merge_bvec_fn() returns bvec->bv_len on success. So we have to check
against this value. But in case of fs_optimization merge we compare
with wrong value. This patch must be included in
 b428cd6da7e6559aca69aa2e3a526037d3f20403
But accidentally i've forgot to add this in the initial patch.
To make things straight let's replace all such checks.
In fact this makes code easy to understand.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/bio.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index 88094af..c28e5fb 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -557,7 +557,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 					.bi_rw = bio->bi_rw,
 				};
 
-				if (q->merge_bvec_fn(q, &bvm, prev) < len) {
+				if (q->merge_bvec_fn(q, &bvm, prev) != prev->bv_len) {
 					prev->bv_len -= len;
 					return 0;
 				}
@@ -611,7 +611,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 		 * merge_bvec_fn() returns number of bytes it can accept
 		 * at this offset
 		 */
-		if (q->merge_bvec_fn(q, &bvm, bvec) < len) {
+		if (q->merge_bvec_fn(q, &bvm, bvec) != bvec->bv_len) {
 			bvec->bv_page = NULL;
 			bvec->bv_len = 0;
 			bvec->bv_offset = 0;
-- 
1.6.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/2] blktrace: perform cleanup after setup error
  2010-02-27 17:35 [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks Dmitry Monakhov
@ 2010-02-27 17:35 ` Dmitry Monakhov
  2010-02-28 18:46   ` Jens Axboe
  2010-02-28 18:46 ` [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks Jens Axboe
  1 sibling, 1 reply; 28+ messages in thread
From: Dmitry Monakhov @ 2010-02-27 17:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: jens.axboe, Dmitry Monakhov

Currently even if BLKTRACESETUP ioctl has failed user must call
BLKTRACETEARDOWN to be shure what all staff was cleaned, which
is contr-intuitive.
Let's setup ioctl make necessery cleanup by it self.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 kernel/trace/blktrace.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index d9d6206..07f945a 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -540,9 +540,10 @@ int blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 	if (ret)
 		return ret;
 
-	if (copy_to_user(arg, &buts, sizeof(buts)))
+	if (copy_to_user(arg, &buts, sizeof(buts))) {
+		blk_trace_remove(q);
 		return -EFAULT;
-
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(blk_trace_setup);
-- 
1.6.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-02-27 17:35 [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks Dmitry Monakhov
  2010-02-27 17:35 ` [PATCH 2/2] blktrace: perform cleanup after setup error Dmitry Monakhov
@ 2010-02-28 18:46 ` Jens Axboe
  2010-03-03  3:49   ` Dmitry Monakhov
  1 sibling, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2010-02-28 18:46 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel

On Sat, Feb 27 2010, Dmitry Monakhov wrote:
> merge_bvec_fn() returns bvec->bv_len on success. So we have to check
> against this value. But in case of fs_optimization merge we compare
> with wrong value. This patch must be included in
>  b428cd6da7e6559aca69aa2e3a526037d3f20403
> But accidentally i've forgot to add this in the initial patch.
> To make things straight let's replace all such checks.
> In fact this makes code easy to understand.

Agree, applied.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] blktrace: perform cleanup after setup error
  2010-02-27 17:35 ` [PATCH 2/2] blktrace: perform cleanup after setup error Dmitry Monakhov
@ 2010-02-28 18:46   ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2010-02-28 18:46 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel

On Sat, Feb 27 2010, Dmitry Monakhov wrote:
> Currently even if BLKTRACESETUP ioctl has failed user must call
> BLKTRACETEARDOWN to be shure what all staff was cleaned, which
> is contr-intuitive.
> Let's setup ioctl make necessery cleanup by it self.

This makes more sense, we should return in with a known state (enabled
or not). Applied.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-02-28 18:46 ` [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks Jens Axboe
@ 2010-03-03  3:49   ` Dmitry Monakhov
  2010-03-03  7:30     ` Jens Axboe
  2010-03-03 18:20     ` Mike Snitzer
  0 siblings, 2 replies; 28+ messages in thread
From: Dmitry Monakhov @ 2010-03-03  3:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 720 bytes --]

Jens Axboe <jens.axboe@oracle.com> writes:

> On Sat, Feb 27 2010, Dmitry Monakhov wrote:
>> merge_bvec_fn() returns bvec->bv_len on success. So we have to check
>> against this value. But in case of fs_optimization merge we compare
>> with wrong value. This patch must be included in
>>  b428cd6da7e6559aca69aa2e3a526037d3f20403
>> But accidentally i've forgot to add this in the initial patch.
>> To make things straight let's replace all such checks.
>> In fact this makes code easy to understand.
>
> Agree, applied.
Ohh.. as you already know this patch break dm-layer. Sorry.
This is because dm->merge may return more than requested. So correct
check must test against less what requested. Correct patch attached.


[-- Attachment #2: 0001-blkdev-fix-merge_bvec_fn-return-value-checks-v2.patch --]
[-- Type: text/plain, Size: 1509 bytes --]

>From 145fb49bf2251f445ca29c5218333367448932d6 Mon Sep 17 00:00:00 2001
From: Dmitry Monakhov <dmonakhov@openvz.org>
Date: Wed, 3 Mar 2010 06:28:06 +0300
Subject: [PATCH] blkdev: fix merge_bvec_fn return value checks v2

merge_bvec_fn() returns bvec->bv_len on success. So we have to check
against this value. But in case of fs_optimization merge we compare
with wrong value. This patch must be included in
 b428cd6da7e6559aca69aa2e3a526037d3f20403
But accidentally i've forgot to add this in the initial patch.
To make things straight let's replace all such checks.
In fact this makes code easy to understand.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/bio.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index 88094af..975657a 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -557,7 +557,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 					.bi_rw = bio->bi_rw,
 				};
 
-				if (q->merge_bvec_fn(q, &bvm, prev) < len) {
+				if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
 					prev->bv_len -= len;
 					return 0;
 				}
@@ -611,7 +611,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
 		 * merge_bvec_fn() returns number of bytes it can accept
 		 * at this offset
 		 */
-		if (q->merge_bvec_fn(q, &bvm, bvec) < len) {
+		if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) {
 			bvec->bv_page = NULL;
 			bvec->bv_len = 0;
 			bvec->bv_offset = 0;
-- 
1.6.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-03  3:49   ` Dmitry Monakhov
@ 2010-03-03  7:30     ` Jens Axboe
  2010-03-03  8:39       ` Dmitry Monakhov
  2010-03-03 18:20     ` Mike Snitzer
  1 sibling, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2010-03-03  7:30 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel

On Wed, Mar 03 2010, Dmitry Monakhov wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > On Sat, Feb 27 2010, Dmitry Monakhov wrote:
> >> merge_bvec_fn() returns bvec->bv_len on success. So we have to check
> >> against this value. But in case of fs_optimization merge we compare
> >> with wrong value. This patch must be included in
> >>  b428cd6da7e6559aca69aa2e3a526037d3f20403
> >> But accidentally i've forgot to add this in the initial patch.
> >> To make things straight let's replace all such checks.
> >> In fact this makes code easy to understand.
> >
> > Agree, applied.
> Ohh.. as you already know this patch break dm-layer. Sorry.
> This is because dm->merge may return more than requested. So correct
> check must test against less what requested. Correct patch attached.

Have you tested this with dm and md (ie actual users of the merge_bvec
functionality) this time?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-03  7:30     ` Jens Axboe
@ 2010-03-03  8:39       ` Dmitry Monakhov
  2010-03-03 12:21         ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Dmitry Monakhov @ 2010-03-03  8:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

Jens Axboe <jens.axboe@oracle.com> writes:

> On Wed, Mar 03 2010, Dmitry Monakhov wrote:
>> Jens Axboe <jens.axboe@oracle.com> writes:
>> 
>> > On Sat, Feb 27 2010, Dmitry Monakhov wrote:
>> >> merge_bvec_fn() returns bvec->bv_len on success. So we have to check
>> >> against this value. But in case of fs_optimization merge we compare
>> >> with wrong value. This patch must be included in
>> >>  b428cd6da7e6559aca69aa2e3a526037d3f20403
>> >> But accidentally i've forgot to add this in the initial patch.
>> >> To make things straight let's replace all such checks.
>> >> In fact this makes code easy to understand.
>> >
>> > Agree, applied.
>> Ohh.. as you already know this patch break dm-layer. Sorry.
>> This is because dm->merge may return more than requested. So correct
>> check must test against less what requested. Correct patch attached.
>
> Have you tested this with dm and md (ie actual users of the merge_bvec
> functionality) this time?
Yes. This time both md and dm are ok.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-03  8:39       ` Dmitry Monakhov
@ 2010-03-03 12:21         ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2010-03-03 12:21 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-kernel

On Wed, Mar 03 2010, Dmitry Monakhov wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
> 
> > On Wed, Mar 03 2010, Dmitry Monakhov wrote:
> >> Jens Axboe <jens.axboe@oracle.com> writes:
> >> 
> >> > On Sat, Feb 27 2010, Dmitry Monakhov wrote:
> >> >> merge_bvec_fn() returns bvec->bv_len on success. So we have to check
> >> >> against this value. But in case of fs_optimization merge we compare
> >> >> with wrong value. This patch must be included in
> >> >>  b428cd6da7e6559aca69aa2e3a526037d3f20403
> >> >> But accidentally i've forgot to add this in the initial patch.
> >> >> To make things straight let's replace all such checks.
> >> >> In fact this makes code easy to understand.
> >> >
> >> > Agree, applied.
> >> Ohh.. as you already know this patch break dm-layer. Sorry.
> >> This is because dm->merge may return more than requested. So correct
> >> check must test against less what requested. Correct patch attached.
> >
> > Have you tested this with dm and md (ie actual users of the merge_bvec
> > functionality) this time?
> Yes. This time both md and dm are ok.

Good, I'll queue it up for testing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-03  3:49   ` Dmitry Monakhov
  2010-03-03  7:30     ` Jens Axboe
@ 2010-03-03 18:20     ` Mike Snitzer
  2010-03-03 18:45       ` Dmitry Monakhov
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-03-03 18:20 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jens Axboe, linux-kernel, dm-devel

On Tue, Mar 2, 2010 at 10:49 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> Jens Axboe <jens.axboe@oracle.com> writes:
>
>> On Sat, Feb 27 2010, Dmitry Monakhov wrote:
>>> merge_bvec_fn() returns bvec->bv_len on success. So we have to check
>>> against this value. But in case of fs_optimization merge we compare
>>> with wrong value. This patch must be included in
>>>  b428cd6da7e6559aca69aa2e3a526037d3f20403
>>> But accidentally i've forgot to add this in the initial patch.
>>> To make things straight let's replace all such checks.
>>> In fact this makes code easy to understand.
>>
>> Agree, applied.
> Ohh.. as you already know this patch break dm-layer. Sorry.
> This is because dm->merge may return more than requested. So correct
> check must test against less what requested. Correct patch attached.

Yes, it is quite common for dm_merge_bvec() to return greater than the
requested length.

But dm_merge_bvec() returning a maximum length, rather than requested,
isn't special.  All the other blk_queue_merge_bvec() callers' merge
functions appear to return "maximum amount of bytes we can accept at
this offset" too.

__bio_add_page() only needs to care about whether the
'q->merge_bvec_fn' return is _less than_ the requested length.

> From 145fb49bf2251f445ca29c5218333367448932d6 Mon Sep 17 00:00:00 2001
> From: Dmitry Monakhov <dmonakhov@openvz.org>
> Date: Wed, 3 Mar 2010 06:28:06 +0300
> Subject: [PATCH] blkdev: fix merge_bvec_fn return value checks v2
>
> merge_bvec_fn() returns bvec->bv_len on success. So we have to check
> against this value. But in case of fs_optimization merge we compare
> with wrong value. This patch must be included in
>  b428cd6da7e6559aca69aa2e3a526037d3f20403
> But accidentally i've forgot to add this in the initial patch.
> To make things straight let's replace all such checks.
> In fact this makes code easy to understand.
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/bio.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/bio.c b/fs/bio.c
> index 88094af..975657a 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -557,7 +557,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
>                                        .bi_rw = bio->bi_rw,
>                                };
>
> -                               if (q->merge_bvec_fn(q, &bvm, prev) < len) {
> +                               if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
>                                        prev->bv_len -= len;
>                                        return 0;
>                                }
> @@ -611,7 +611,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
>                 * merge_bvec_fn() returns number of bytes it can accept
>                 * at this offset
>                 */
> -               if (q->merge_bvec_fn(q, &bvm, bvec) < len) {
> +               if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) {
>                        bvec->bv_page = NULL;
>                        bvec->bv_len = 0;
>                        bvec->bv_offset = 0;

NOTE this 2nd hunk doesn't change anything at all because: bvec->bv_len = len;

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-03 18:20     ` Mike Snitzer
@ 2010-03-03 18:45       ` Dmitry Monakhov
  2010-03-03 19:16         ` Mike Snitzer
  0 siblings, 1 reply; 28+ messages in thread
From: Dmitry Monakhov @ 2010-03-03 18:45 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-kernel, dm-devel

Mike Snitzer <snitzer@redhat.com> writes:

> On Tue, Mar 2, 2010 at 10:49 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
>> Jens Axboe <jens.axboe@oracle.com> writes:
>>
>>> On Sat, Feb 27 2010, Dmitry Monakhov wrote:
>>>> merge_bvec_fn() returns bvec->bv_len on success. So we have to check
>>>> against this value. But in case of fs_optimization merge we compare
>>>> with wrong value. This patch must be included in
>>>>  b428cd6da7e6559aca69aa2e3a526037d3f20403
>>>> But accidentally i've forgot to add this in the initial patch.
>>>> To make things straight let's replace all such checks.
>>>> In fact this makes code easy to understand.
>>>
>>> Agree, applied.
>> Ohh.. as you already know this patch break dm-layer. Sorry.
>> This is because dm->merge may return more than requested. So correct
>> check must test against less what requested. Correct patch attached.
>
> Yes, it is quite common for dm_merge_bvec() to return greater than the
> requested length.
>
> But dm_merge_bvec() returning a maximum length, rather than requested,
> isn't special.  All the other blk_queue_merge_bvec() callers' merge
> functions appear to return "maximum amount of bytes we can accept at
> this offset" too.
What for? Does it allow us to make some optimization?
For example like follows:
bio_add_pageS(bio, **pages) {
   /* call merge_fn only one untill all space exhausted */                
   ret = merge_fn()  (this returns huge value (1024*1024))
   while (ret) { 
          bio->bi_io_vec[bio->bi_vcnt - 1].bv_page = page;
          ...
          ret -= PAGE_SIZE;
          bio->bi_vcnt++;
   }
}
IMHO the answer is *NO*, this code will unlikely to work.
>
> __bio_add_page() only needs to care about whether the
> 'q->merge_bvec_fn' return is _less than_ the requested length.
>
>> From 145fb49bf2251f445ca29c5218333367448932d6 Mon Sep 17 00:00:00 2001
>> From: Dmitry Monakhov <dmonakhov@openvz.org>
>> Date: Wed, 3 Mar 2010 06:28:06 +0300
>> Subject: [PATCH] blkdev: fix merge_bvec_fn return value checks v2
>>
>> merge_bvec_fn() returns bvec->bv_len on success. So we have to check
>> against this value. But in case of fs_optimization merge we compare
>> with wrong value. This patch must be included in
>>  b428cd6da7e6559aca69aa2e3a526037d3f20403
>> But accidentally i've forgot to add this in the initial patch.
>> To make things straight let's replace all such checks.
>> In fact this makes code easy to understand.
>>
>> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
>> ---
>>  fs/bio.c |    4 ++--
>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/bio.c b/fs/bio.c
>> index 88094af..975657a 100644
>> --- a/fs/bio.c
>> +++ b/fs/bio.c
>> @@ -557,7 +557,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
>>                                        .bi_rw = bio->bi_rw,
>>                                };
>>
>> -                               if (q->merge_bvec_fn(q, &bvm, prev) < len) {
>> +                               if (q->merge_bvec_fn(q, &bvm, prev) < prev->bv_len) {
>>                                        prev->bv_len -= len;
>>                                        return 0;
>>                                }
>> @@ -611,7 +611,7 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
>>                 * merge_bvec_fn() returns number of bytes it can accept
>>                 * at this offset
>>                 */
>> -               if (q->merge_bvec_fn(q, &bvm, bvec) < len) {
>> +               if (q->merge_bvec_fn(q, &bvm, bvec) < bvec->bv_len) {
>>                        bvec->bv_page = NULL;
>>                        bvec->bv_len = 0;
>>                        bvec->bv_offset = 0;
>
> NOTE this 2nd hunk doesn't change anything at all because: bvec->bv_len = len;
Yess. IMHO this makes code more readable.
>
> Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-03 18:45       ` Dmitry Monakhov
@ 2010-03-03 19:16         ` Mike Snitzer
  2010-03-03 19:42           ` Dmitry Monakhov
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-03-03 19:16 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jens Axboe, linux-kernel, dm-devel

On Wed, Mar 03 2010 at  1:45pm -0500,
Dmitry Monakhov <dmonakhov@openvz.org> wrote:

> Mike Snitzer <snitzer@redhat.com> writes:
> 
> > On Tue, Mar 2, 2010 at 10:49 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> >> Jens Axboe <jens.axboe@oracle.com> writes:
> >>
> >>> On Sat, Feb 27 2010, Dmitry Monakhov wrote:
> >>>> merge_bvec_fn() returns bvec->bv_len on success. So we have to check
> >>>> against this value. But in case of fs_optimization merge we compare
> >>>> with wrong value. This patch must be included in
> >>>>  b428cd6da7e6559aca69aa2e3a526037d3f20403
> >>>> But accidentally i've forgot to add this in the initial patch.
> >>>> To make things straight let's replace all such checks.
> >>>> In fact this makes code easy to understand.
> >>>
> >>> Agree, applied.
> >> Ohh.. as you already know this patch break dm-layer. Sorry.
> >> This is because dm->merge may return more than requested. So correct
> >> check must test against less what requested. Correct patch attached.
> >
> > Yes, it is quite common for dm_merge_bvec() to return greater than the
> > requested length.
> >
> > But dm_merge_bvec() returning a maximum length, rather than requested,
> > isn't special.  All the other blk_queue_merge_bvec() callers' merge
> > functions appear to return "maximum amount of bytes we can accept at
> > this offset" too.
> What for? Does it allow us to make some optimization?

I wasn't suggesting the behavior of the current merge functions
returning maximum is important or useful.  I was just pointing out what
the interface is and that dm_merge_bvec() is no different than the
others.

> For example like follows:
> bio_add_pageS(bio, **pages) {
>    /* call merge_fn only one untill all space exhausted */                
>    ret = merge_fn()  (this returns huge value (1024*1024))
>    while (ret) { 
>           bio->bi_io_vec[bio->bi_vcnt - 1].bv_page = page;
>           ...
>           ret -= PAGE_SIZE;
>           bio->bi_vcnt++;
>    }
> }
> IMHO the answer is *NO*, this code will unlikely to work.

Conversely, I see no value in imposing that these 'q->merge_bvec_fn'
functions return at most the requested length.  Can't even really see it
making the __bio_add_page() code more readable.

> > __bio_add_page() only needs to care about whether the
> > 'q->merge_bvec_fn' return is _less than_ the requested length.

Linux has all sorts of internal interfaces that are "odd"... the current
'q->merge_bvec_fn' interface included.  But odd is not a problem (nor is
it "broken") unless you make changes that don't consider how the current
interface is defined.

But I digress...

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-03 19:16         ` Mike Snitzer
@ 2010-03-03 19:42           ` Dmitry Monakhov
  2010-03-03 20:07             ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Dmitry Monakhov @ 2010-03-03 19:42 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-kernel, dm-devel

Mike Snitzer <snitzer@redhat.com> writes:

> Linux has all sorts of internal interfaces that are "odd"... the current
> 'q->merge_bvec_fn' interface included.  But odd is not a problem (nor is
> it "broken") unless you make changes that don't consider how the current
> interface is defined.
Ok. then cant you please explain more historical questions
1) Why bio_add_page() can not add less data than requested?
   Seems that it doesn't make caller's code much complicate
   Off course barrier bio is special case. I don't consider it here.
   
2) What statement "bio_add_page() must accept at least one page"
   exactly means?
   IMHO this means that bio_add_page() must accept at least
   one page with len (PAGE_SIZE - offset). Or more restricted
   statemnt that first bio_add_page() must be always successfull.

   But currently in some places this rule treated as what all bio
   which has size less whan PAGE_SIZE are accepted. And in x86 such
   bio may has up to 8 pages/bvecs.
>
> But I digress...
>
> Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-03 19:42           ` Dmitry Monakhov
@ 2010-03-03 20:07             ` Jens Axboe
  2010-03-04 11:47               ` [dm-devel] " Mikulas Patocka
  2010-03-04 17:59               ` Lars Ellenberg
  0 siblings, 2 replies; 28+ messages in thread
From: Jens Axboe @ 2010-03-03 20:07 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Mike Snitzer, linux-kernel, dm-devel

On Wed, Mar 03 2010, Dmitry Monakhov wrote:
> Mike Snitzer <snitzer@redhat.com> writes:
> 
> > Linux has all sorts of internal interfaces that are "odd"... the current
> > 'q->merge_bvec_fn' interface included.  But odd is not a problem (nor is
> > it "broken") unless you make changes that don't consider how the current
> > interface is defined.
> Ok. then cant you please explain more historical questions
> 1) Why bio_add_page() can not add less data than requested?
>    Seems that it doesn't make caller's code much complicate
>    Off course barrier bio is special case. I don't consider it here.

Because the caller may not expect that, a partial add may not make any
sense to the caller. The bio code obviously doesn't care. And it
certainly could complicate the caller a lot, if they need to now issue
and wait for several bio's instead of just a single one. Now a single
completion queue and wait_for_completion() is not enough.

> 2) What statement "bio_add_page() must accept at least one page"
>    exactly means?
>    IMHO this means that bio_add_page() must accept at least
>    one page with len (PAGE_SIZE - offset). Or more restricted
>    statemnt that first bio_add_page() must be always successfull.

It's really 'first add must succeed', the restriction being that you
cannot rely on that first add being more than a single page. So the rule
is that you must accept at least a page at any offset if the bio is
currently empty, since we know that a page is typically our IO
granularity.

>    But currently in some places this rule treated as what all bio
>    which has size less whan PAGE_SIZE are accepted. And in x86 such
>    bio may has up to 8 pages/bvecs.

Not sure I follow what you are trying to say here.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-03 20:07             ` Jens Axboe
@ 2010-03-04 11:47               ` Mikulas Patocka
  2010-03-04 12:19                 ` Jens Axboe
  2010-03-04 17:59               ` Lars Ellenberg
  1 sibling, 1 reply; 28+ messages in thread
From: Mikulas Patocka @ 2010-03-04 11:47 UTC (permalink / raw)
  To: device-mapper development, Jens Axboe
  Cc: Dmitry Monakhov, linux-kernel, Mike Snitzer



On Wed, 3 Mar 2010, Jens Axboe wrote:

> On Wed, Mar 03 2010, Dmitry Monakhov wrote:
> > Mike Snitzer <snitzer@redhat.com> writes:
> > 
> > > Linux has all sorts of internal interfaces that are "odd"... the current
> > > 'q->merge_bvec_fn' interface included.  But odd is not a problem (nor is
> > > it "broken") unless you make changes that don't consider how the current
> > > interface is defined.
> > Ok. then cant you please explain more historical questions
> > 1) Why bio_add_page() can not add less data than requested?
> >    Seems that it doesn't make caller's code much complicate
> >    Off course barrier bio is special case. I don't consider it here.
> 
> Because the caller may not expect that, a partial add may not make any
> sense to the caller. The bio code obviously doesn't care. And it
> certainly could complicate the caller a lot, if they need to now issue
> and wait for several bio's instead of just a single one. Now a single
> completion queue and wait_for_completion() is not enough.

The whole thing is lame by design.

As a consequence, the code for splitting these page-sized bios is being 
duplicated in md, dm and request-based device (and maybe in other drivers).

And there is a long standing unfixable bug --- adding a device to raid1 
may fail if the device has smaller request size than the other raid1 leg.
Think about this:

* thread 1:
bio_add_page() -> returns ok
...
bio_add_page() -> returns ok
* scheduling to thread 2 *
* thread 2:
add a raid1 leg to the block device
* scheduling to thread 1 *
* thread 1:
submit_bio() --- the bio already oversized for the new raid1 leg and there 
is no way to shrink it.


I think it should be redesigned in this way:
* the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
* the code for splitting bios should be just at one place --- in the 
generic block layer and it should work with all drivers. I.e.
q->make_request_fn will return that the request is too big for the given 
driver and the generic code will allocate few another bios from 
per-request_queue mempool and split the bio.
* then, we would fix the raid1 bug and we would also simplify md and dm 
--- remove the splitting code.

Mikulas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-04 11:47               ` [dm-devel] " Mikulas Patocka
@ 2010-03-04 12:19                 ` Jens Axboe
  2010-03-04 21:55                   ` Mikulas Patocka
  0 siblings, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2010-03-04 12:19 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: device-mapper development, Dmitry Monakhov, linux-kernel, Mike Snitzer

On Thu, Mar 04 2010, Mikulas Patocka wrote:
> 
> 
> On Wed, 3 Mar 2010, Jens Axboe wrote:
> 
> > On Wed, Mar 03 2010, Dmitry Monakhov wrote:
> > > Mike Snitzer <snitzer@redhat.com> writes:
> > > 
> > > > Linux has all sorts of internal interfaces that are "odd"... the current
> > > > 'q->merge_bvec_fn' interface included.  But odd is not a problem (nor is
> > > > it "broken") unless you make changes that don't consider how the current
> > > > interface is defined.
> > > Ok. then cant you please explain more historical questions
> > > 1) Why bio_add_page() can not add less data than requested?
> > >    Seems that it doesn't make caller's code much complicate
> > >    Off course barrier bio is special case. I don't consider it here.
> > 
> > Because the caller may not expect that, a partial add may not make any
> > sense to the caller. The bio code obviously doesn't care. And it
> > certainly could complicate the caller a lot, if they need to now issue
> > and wait for several bio's instead of just a single one. Now a single
> > completion queue and wait_for_completion() is not enough.
> 
> The whole thing is lame by design.
> 
> As a consequence, the code for splitting these page-sized bios is being 
> duplicated in md, dm and request-based device (and maybe in other drivers).
> 
> And there is a long standing unfixable bug --- adding a device to raid1 
> may fail if the device has smaller request size than the other raid1 leg.
> Think about this:
> 
> * thread 1:
> bio_add_page() -> returns ok
> ...
> bio_add_page() -> returns ok
> * scheduling to thread 2 *
> * thread 2:
> add a raid1 leg to the block device
> * scheduling to thread 1 *
> * thread 1:
> submit_bio() --- the bio already oversized for the new raid1 leg and there 
> is no way to shrink it.
> 
> 
> I think it should be redesigned in this way:
> * the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
> * the code for splitting bios should be just at one place --- in the 
> generic block layer and it should work with all drivers. I.e.
> q->make_request_fn will return that the request is too big for the given 
> driver and the generic code will allocate few another bios from 
> per-request_queue mempool and split the bio.
> * then, we would fix the raid1 bug and we would also simplify md and dm 
> --- remove the splitting code.

The design certainly isn't perfect, but neither is the one that you
suggest. For instance, what would the point of building bigger bios just
to split them _everytime_ be? That's not good design, and it's certainly
not helping performance much.

Alasdair had an idea for doing a map/unmap like interface instead, which
I think is a lot better. I don't think he ever coded it up, though.
Perhaps talk to him, if you want to improve our situation in this area.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-03 20:07             ` Jens Axboe
  2010-03-04 11:47               ` [dm-devel] " Mikulas Patocka
@ 2010-03-04 17:59               ` Lars Ellenberg
  2010-03-05 17:37                 ` Alasdair G Kergon
  1 sibling, 1 reply; 28+ messages in thread
From: Lars Ellenberg @ 2010-03-04 17:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Dmitry Monakhov, dm-devel, linux-kernel, Mike Snitzer

On Wed, Mar 03, 2010 at 09:07:34PM +0100, Jens Axboe wrote:
> > 2) What statement "bio_add_page() must accept at least one page"
> >    exactly means?
> >    IMHO this means that bio_add_page() must accept at least
> >    one page with len (PAGE_SIZE - offset). Or more restricted
> >    statemnt that first bio_add_page() must be always successfull.
> 
> It's really 'first add must succeed', the restriction being that you
> cannot rely on that first add being more than a single page. So the rule
> is that you must accept at least a page at any offset if the bio is
> currently empty, since we know that a page is typically our IO
> granularity.

Speaking of...

dm_set_device_limits is still doing things wrong here, I think.

I posted this about two years ago, but somehow it got lost
and I lost it from my focus as well.
Reading this post reminded me ... there was something:

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4b22feb..bc34901 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -519,10 +519,22 @@ int dm_set_device_limits(struct dm_target *ti, struct dm_dev *dev,
 	 * smaller I/O, just to be safe.
 	 */
 
-	if (q->merge_bvec_fn && !ti->type->merge)
+	if (q->merge_bvec_fn && !ti->type->merge) {
 		limits->max_sectors =
 			min_not_zero(limits->max_sectors,
 				     (unsigned int) (PAGE_SIZE >> 9));
+
+		/* Restricting max_sectors is not enough.
+		 * If someone uses bio_add_page to add 8 disjunct 512 byte
+		 * partial pages to a bio, it would succeed,
+		 * but could still cross a border of whatever restrictions
+		 * are below us (raid0 stripe boundary).  An attempted
+		 * bio_split would not succeed, because bi_vcnt is 8.
+		 * E.g. the xen io layer is known to trigger this.
+		 */
+		limits->max_segments = 1;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(dm_set_device_limits);


Thanks,
	Lars

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-04 12:19                 ` Jens Axboe
@ 2010-03-04 21:55                   ` Mikulas Patocka
  2010-03-05  7:30                       ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Mikulas Patocka @ 2010-03-04 21:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: device-mapper development, Dmitry Monakhov, linux-kernel, Mike Snitzer

> > I think it should be redesigned in this way:
> > * the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
> > * the code for splitting bios should be just at one place --- in the 
> > generic block layer and it should work with all drivers. I.e.
> > q->make_request_fn will return that the request is too big for the given 
> > driver and the generic code will allocate few another bios from 
> > per-request_queue mempool and split the bio.
> > * then, we would fix the raid1 bug and we would also simplify md and dm 
> > --- remove the splitting code.
> 
> The design certainly isn't perfect, but neither is the one that you
> suggest. For instance, what would the point of building bigger bios just
> to split them _everytime_ be? That's not good design, and it's certainly
> not helping performance much.

The point for building a big bio and splitting it is code size reduction.

You must realize that you need to split the request anyway. If the caller 
needs to read/write M sectors and the device can process at most N 
sectors, where N<M, then the i/o must be split. You can't get rid of that 
split.

Now, the question is, where this splitting happens. Currently, the 
splitting happens in the issuers of the bio (for example dm-io, which is a 
big chunk of code written just to work around the uncertain bio size 
limits) as well as in the consumers of the bio (md, dm and request-based 
drivers, if their maximum transfer size < PAGE_SIZE).

What I'm proposing is to put that bio splitting just in one place and 
remove it from both the issuers and the consumers of the bio.

There is just one case, where it is not desirable to create larger 
requests than the physical transfer size because of performance --- that 
is page cache readahead. That could use the current approach of querying 
the queue for limits and not stuffing more data than the device can 
handle.

In all the other (non-readahead) cases, the caller is waiting for all the 
data anyway, there is no performance problem with creating a larger 
requests and splitting them.

> Alasdair had an idea for doing a map/unmap like interface instead, which
> I think is a lot better. I don't think he ever coded it up, though.
> Perhaps talk to him, if you want to improve our situation in this area.

It would increase code size, not reduce it.

> -- 
> Jens Axboe

BTW. see fs/bio.c:bio_split --- it uses a shared mempool. It can deadlock 
if the devices depend on each other. You need to use per-queue mempool.

Mikulas


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-04 21:55                   ` Mikulas Patocka
@ 2010-03-05  7:30                       ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2010-03-05  7:30 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: device-mapper development, Dmitry Monakhov, linux-kernel, Mike Snitzer

On Thu, Mar 04 2010, Mikulas Patocka wrote:
> > > I think it should be redesigned in this way:
> > > * the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
> > > * the code for splitting bios should be just at one place --- in the 
> > > generic block layer and it should work with all drivers. I.e.
> > > q->make_request_fn will return that the request is too big for the given 
> > > driver and the generic code will allocate few another bios from 
> > > per-request_queue mempool and split the bio.
> > > * then, we would fix the raid1 bug and we would also simplify md and dm 
> > > --- remove the splitting code.
> > 
> > The design certainly isn't perfect, but neither is the one that you
> > suggest. For instance, what would the point of building bigger bios just
> > to split them _everytime_ be? That's not good design, and it's certainly
> > not helping performance much.
> 
> The point for building a big bio and splitting it is code size reduction.
> 
> You must realize that you need to split the request anyway. If the caller 
> needs to read/write M sectors and the device can process at most N 
> sectors, where N<M, then the i/o must be split. You can't get rid of that 
> split.

If we consider the basic (and mosted used) file system IO path, then
those 'M' sectors will usually be submitted in units that are smaller
than 'N' anyway. It would be silly to allocate a bio for 'N' sectors
(those get big), only to split it up again shortly.

The point is not to build it too big to begin with, and keep the
splitting as the slow path. That's how it was originally designed, and
it'll surely stay that way. Doing the split handling in generic code is
a good idea, it definitely should be. But I'm never going to make
splitting a normal part of IO operations just because we allow
arbitrarily large bios, sorry but that's insane.

> Now, the question is, where this splitting happens. Currently, the 
> splitting happens in the issuers of the bio (for example dm-io, which is a 
> big chunk of code written just to work around the uncertain bio size 
> limits) as well as in the consumers of the bio (md, dm and request-based 
> drivers, if their maximum transfer size < PAGE_SIZE).
> 
> What I'm proposing is to put that bio splitting just in one place and 
> remove it from both the issuers and the consumers of the bio.

Agree on that, we can make that generic (and should).

> There is just one case, where it is not desirable to create larger 
> requests than the physical transfer size because of performance --- that 
> is page cache readahead. That could use the current approach of querying 
> the queue for limits and not stuffing more data than the device can 
> handle.
> 
> In all the other (non-readahead) cases, the caller is waiting for all the 
> data anyway, there is no performance problem with creating a larger 
> requests and splitting them.

But this I still think is crazy, sorry.

> > Alasdair had an idea for doing a map/unmap like interface instead, which
> > I think is a lot better. I don't think he ever coded it up, though.
> > Perhaps talk to him, if you want to improve our situation in this area.
> 
> It would increase code size, not reduce it.

What kind of answer is that?

> BTW. see fs/bio.c:bio_split --- it uses a shared mempool. It can deadlock 
> if the devices depend on each other. You need to use per-queue mempool.

Only if you need to split the same original bio twice on the same IO
path.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
@ 2010-03-05  7:30                       ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2010-03-05  7:30 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: device-mapper development, Dmitry Monakhov, linux-kernel, Mike Snitzer

On Thu, Mar 04 2010, Mikulas Patocka wrote:
> > > I think it should be redesigned in this way:
> > > * the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
> > > * the code for splitting bios should be just at one place --- in the 
> > > generic block layer and it should work with all drivers. I.e.
> > > q->make_request_fn will return that the request is too big for the given 
> > > driver and the generic code will allocate few another bios from 
> > > per-request_queue mempool and split the bio.
> > > * then, we would fix the raid1 bug and we would also simplify md and dm 
> > > --- remove the splitting code.
> > 
> > The design certainly isn't perfect, but neither is the one that you
> > suggest. For instance, what would the point of building bigger bios just
> > to split them _everytime_ be? That's not good design, and it's certainly
> > not helping performance much.
> 
> The point for building a big bio and splitting it is code size reduction.
> 
> You must realize that you need to split the request anyway. If the caller 
> needs to read/write M sectors and the device can process at most N 
> sectors, where N<M, then the i/o must be split. You can't get rid of that 
> split.

If we consider the basic (and mosted used) file system IO path, then
those 'M' sectors will usually be submitted in units that are smaller
than 'N' anyway. It would be silly to allocate a bio for 'N' sectors
(those get big), only to split it up again shortly.

The point is not to build it too big to begin with, and keep the
splitting as the slow path. That's how it was originally designed, and
it'll surely stay that way. Doing the split handling in generic code is
a good idea, it definitely should be. But I'm never going to make
splitting a normal part of IO operations just because we allow
arbitrarily large bios, sorry but that's insane.

> Now, the question is, where this splitting happens. Currently, the 
> splitting happens in the issuers of the bio (for example dm-io, which is a 
> big chunk of code written just to work around the uncertain bio size 
> limits) as well as in the consumers of the bio (md, dm and request-based 
> drivers, if their maximum transfer size < PAGE_SIZE).
> 
> What I'm proposing is to put that bio splitting just in one place and 
> remove it from both the issuers and the consumers of the bio.

Agree on that, we can make that generic (and should).

> There is just one case, where it is not desirable to create larger 
> requests than the physical transfer size because of performance --- that 
> is page cache readahead. That could use the current approach of querying 
> the queue for limits and not stuffing more data than the device can 
> handle.
> 
> In all the other (non-readahead) cases, the caller is waiting for all the 
> data anyway, there is no performance problem with creating a larger 
> requests and splitting them.

But this I still think is crazy, sorry.

> > Alasdair had an idea for doing a map/unmap like interface instead, which
> > I think is a lot better. I don't think he ever coded it up, though.
> > Perhaps talk to him, if you want to improve our situation in this area.
> 
> It would increase code size, not reduce it.

What kind of answer is that?

> BTW. see fs/bio.c:bio_split --- it uses a shared mempool. It can deadlock 
> if the devices depend on each other. You need to use per-queue mempool.

Only if you need to split the same original bio twice on the same IO
path.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-04 17:59               ` Lars Ellenberg
@ 2010-03-05 17:37                 ` Alasdair G Kergon
  2010-03-05 19:20                   ` Lars Ellenberg
  2010-03-08  5:43                   ` Neil Brown
  0 siblings, 2 replies; 28+ messages in thread
From: Alasdair G Kergon @ 2010-03-05 17:37 UTC (permalink / raw)
  To: device-mapper development

On Thu, Mar 04, 2010 at 06:59:21PM +0100, Lars Ellenberg wrote:
> +		/* Restricting max_sectors is not enough.
> +		 * If someone uses bio_add_page to add 8 disjunct 512 byte
> +		 * partial pages to a bio, it would succeed,
> +		 * but could still cross a border of whatever restrictions
> +		 * are below us (raid0 stripe boundary).  An attempted
> +		 * bio_split would not succeed, because bi_vcnt is 8.
> +		 * E.g. the xen io layer is known to trigger this.
> +		 */

Sounds plausible.

Do you or anyone readingt his have example messages demonstrating the failure
when this patch is not applied?

Alasdair.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-05 17:37                 ` Alasdair G Kergon
@ 2010-03-05 19:20                   ` Lars Ellenberg
  2010-03-08  5:43                   ` Neil Brown
  1 sibling, 0 replies; 28+ messages in thread
From: Lars Ellenberg @ 2010-03-05 19:20 UTC (permalink / raw)
  To: device-mapper development

On Fri, Mar 05, 2010 at 05:37:16PM +0000, Alasdair G Kergon wrote:
> On Thu, Mar 04, 2010 at 06:59:21PM +0100, Lars Ellenberg wrote:
> > +		/* Restricting max_sectors is not enough.
> > +		 * If someone uses bio_add_page to add 8 disjunct 512 byte
> > +		 * partial pages to a bio, it would succeed,
> > +		 * but could still cross a border of whatever restrictions
> > +		 * are below us (raid0 stripe boundary).  An attempted
> > +		 * bio_split would not succeed, because bi_vcnt is 8.
> > +		 * E.g. the xen io layer is known to trigger this.
> > +		 */
> 
> Sounds plausible.
> 
> Do you or anyone readingt his have example messages demonstrating the failure
> when this patch is not applied?

This has been an issue said two years ago when using Xen VMs on DRBD clusters.
My original summary post once the issue was understood is
http://archives.free.net.ph/message/20080523.161104.054c63ef.html

(the gmane link therein seems to be broken now, is not really of
interesst here anyways, but, just in case, equivalent with
http://archives.free.net.ph/message/20080410.212155.d9e0243d.html)

Actual error messages looked like
kernel: drbd3: bio would need to, but cannot, be split: (vcnt=8,idx=0,size=4096,sector=13113083)
kernel: drbd3: bio would need to, but cannot, be split: (vcnt=8,idx=0,size=4096,sector=13113149)
kernel: drbd3: bio would need to, but cannot, be split: (vcnt=8,idx=0,size=4096,sector=13113215)

 bi_size is only 4k, but bi_vcnt is 8.

These had been submitted by Xen virtio on a typical VM image with the
legacy 63 sector offset into the first "partition".

You'd have the exact same issue (with slightly different error printk's)
on md raid0, or any other bio consumer with boundary restrictions not
having "generic" bio split logic, but directly using bio_split(),
as that does
	BUG_ON(bi->bi_vcnt != 1);
	BUG_ON(bi->bi_idx != 0);

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-05  7:30                       ` Jens Axboe
  (?)
@ 2010-03-05 21:56                       ` Neil Brown
  2010-03-05 22:27                           ` Alasdair G Kergon
  -1 siblings, 1 reply; 28+ messages in thread
From: Neil Brown @ 2010-03-05 21:56 UTC (permalink / raw)
  To: device-mapper development
  Cc: jens.axboe, Mikulas Patocka, Dmitry Monakhov, linux-kernel, Mike Snitzer

On Fri, 5 Mar 2010 08:30:59 +0100
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Thu, Mar 04 2010, Mikulas Patocka wrote:
> > > > I think it should be redesigned in this way:
> > > > * the bio should be allowed to have arbitrary size, up to BIO_MAX_PAGES
> > > > * the code for splitting bios should be just at one place --- in the 
> > > > generic block layer and it should work with all drivers. I.e.
> > > > q->make_request_fn will return that the request is too big for the given 
> > > > driver and the generic code will allocate few another bios from 
> > > > per-request_queue mempool and split the bio.
> > > > * then, we would fix the raid1 bug and we would also simplify md and dm 
> > > > --- remove the splitting code.
> > > 
> > > The design certainly isn't perfect, but neither is the one that you
> > > suggest. For instance, what would the point of building bigger bios just
> > > to split them _everytime_ be? That's not good design, and it's certainly
> > > not helping performance much.
> > 
> > The point for building a big bio and splitting it is code size reduction.
> > 
> > You must realize that you need to split the request anyway. If the caller 
> > needs to read/write M sectors and the device can process at most N 
> > sectors, where N<M, then the i/o must be split. You can't get rid of that 
> > split.
> 
> If we consider the basic (and mosted used) file system IO path, then
> those 'M' sectors will usually be submitted in units that are smaller
> than 'N' anyway. It would be silly to allocate a bio for 'N' sectors
> (those get big), only to split it up again shortly.
> 
> The point is not to build it too big to begin with, and keep the
> splitting as the slow path. That's how it was originally designed, and
> it'll surely stay that way. Doing the split handling in generic code is
> a good idea, it definitely should be. But I'm never going to make
> splitting a normal part of IO operations just because we allow
> arbitrarily large bios, sorry but that's insane.

I think that where-ever we put the splitting there is a lot to be gained by
make it easier to split things.

It is a while since I looked closely at this (about 30 months) but last time
I looked, low level devices were allowed to modify the bi_iovec.  This means
that a split has to copy the iovec as well as the bio which significantly
increased the complexity.  If we stored an index into the current bio in
'struct request' rather than using bi_idx plus modifying bv_offset, then I
think we could make bi_iovec read-only and simplify splitting significantly.

I have some patches that did this, but they are very old.  I could probably
refresh them if there was interest.


My preferred end-game would be to allow a bio of any size to be submitted to
any device.  The device would be responsible for cutting it up if necessary.
For the elevator code, I wouldn't bother making new bios, but would teach
'struct request' to be able to reference part of a bio.  So in the general
case, a 'request' would point to a list of bios, but would exclude a prefix
of the first and a suffix of the last. A single bio could then be referenced
by multiple requests if necessary.

It would probably make sense to include a 'max_sectors' concept to discourage
large bios as doing that would probably allow things to run more smoothly
in general.  But if this were a hint rather than a requirement then I think
it would make life a lot easier for a lot of code.

NeilBrown

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-05 21:56                       ` [dm-devel] " Neil Brown
@ 2010-03-05 22:27                           ` Alasdair G Kergon
  0 siblings, 0 replies; 28+ messages in thread
From: Alasdair G Kergon @ 2010-03-05 22:27 UTC (permalink / raw)
  To: device-mapper development, Dmitry Monakhov, Mikulas Patocka,
	linux-kernel, Mike Snitzer, jens.axboe

On Sat, Mar 06, 2010 at 08:56:51AM +1100, Neil Brown wrote:
> My preferred end-game would be to allow a bio of any size to be submitted to
> any device.  The device would be responsible for cutting it up if necessary.

>From the dm point of view, splitting is undesirable - memory allocations from
separate mempools, submitting the split-off parts could reorder/delay but must
still respect barrier constraints etc.  Splitting is the 'slow and complicated'
path for us.  We support it, but it is simpler and more efficient if the bio is
created a suitable size in the first place - and the merge_bvec_fn does this
for us most of the time.

Alasdair


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
@ 2010-03-05 22:27                           ` Alasdair G Kergon
  0 siblings, 0 replies; 28+ messages in thread
From: Alasdair G Kergon @ 2010-03-05 22:27 UTC (permalink / raw)
  To: device-mapper development, Dmitry Monakhov, Mikulas Patocka,
	linux-kernel, Mike Snitzer

On Sat, Mar 06, 2010 at 08:56:51AM +1100, Neil Brown wrote:
> My preferred end-game would be to allow a bio of any size to be submitted to
> any device.  The device would be responsible for cutting it up if necessary.

From the dm point of view, splitting is undesirable - memory allocations from
separate mempools, submitting the split-off parts could reorder/delay but must
still respect barrier constraints etc.  Splitting is the 'slow and complicated'
path for us.  We support it, but it is simpler and more efficient if the bio is
created a suitable size in the first place - and the merge_bvec_fn does this
for us most of the time.

Alasdair

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-05 22:27                           ` Alasdair G Kergon
  (?)
@ 2010-03-05 23:52                           ` Neil Brown
  2010-03-06  2:20                             ` Alasdair G Kergon
  -1 siblings, 1 reply; 28+ messages in thread
From: Neil Brown @ 2010-03-05 23:52 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: device-mapper development, Dmitry Monakhov, Mikulas Patocka,
	linux-kernel, Mike Snitzer, jens.axboe

On Fri, 5 Mar 2010 22:27:48 +0000
Alasdair G Kergon <agk@redhat.com> wrote:

> On Sat, Mar 06, 2010 at 08:56:51AM +1100, Neil Brown wrote:
> > My preferred end-game would be to allow a bio of any size to be submitted to
> > any device.  The device would be responsible for cutting it up if necessary.
> 
> >From the dm point of view, splitting is undesirable - memory allocations from
> separate mempools, submitting the split-off parts could reorder/delay but must
> still respect barrier constraints etc.  Splitting is the 'slow and complicated'
> path for us.  We support it, but it is simpler and more efficient if the bio is
> created a suitable size in the first place - and the merge_bvec_fn does this
> for us most of the time.

I hadn't thought about barriers in this context previously, but on reflection
I don't think splitting makes barriers noticeably more complex than they are
already.
If you have multiple devices, then you pretty much have to handle a barrier
as:
  - stall new requests
  - send zero length barrier to all devices
  - write the barrier request, either with the barrier flag set,
    or followed by a zero-length barrier

I don't think splitting adds complexity.
I guess if you have a single device, but still wont to split a request, there
might be some optimisations you have to forego...

But my first suggestion, that splitting could be made easier, still stands.
Maybe it doesn't have to be so complicated - then it wouldn't be so slow?

And do you honour merge_bvec_fn's of underlying devices?  A quick grep
suggests you do only for dm-linear and dm-crypt.  This suggests to me that it
is actually a hard interface to support completely in a stacked device, so we
might be better off without it.
In md, I check if merge_bvec_fn is defined and if it is, I just drop
max_sectors to 4K so it will never be needed.  I figure that a performance
drop is better than a correctness error.

Yes, splitting is undesirable, and if we can arrange that most requests don't
get split, then that is good.  But there will always be some requests that
have to be split - whether by bio_add_page or something lower - and I think
that it makes sense to only require that the lowest level does the
splitting, as that is where the specifics on what might be required exists.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-05 23:52                           ` Neil Brown
@ 2010-03-06  2:20                             ` Alasdair G Kergon
  0 siblings, 0 replies; 28+ messages in thread
From: Alasdair G Kergon @ 2010-03-06  2:20 UTC (permalink / raw)
  To: Neil Brown
  Cc: device-mapper development, Dmitry Monakhov, Mikulas Patocka,
	linux-kernel, Mike Snitzer, jens.axboe

On Sat, Mar 06, 2010 at 10:52:37AM +1100, Neil Brown wrote:
> But my first suggestion, that splitting could be made easier, still stands.

In full agreement we should try to make it easier.

> And do you honour merge_bvec_fn's of underlying devices?  A quick grep
> suggests you do only for dm-linear and dm-crypt.  

We implemented it to address some specific performance problems reported by
people using those targets.  So far there hasn't been any pressing need to
implement it for other targets.

> This suggests to me that it
> is actually a hard interface to support completely in a stacked device, so we
> might be better off without it.

All this code is hard - but very useful.  The direction I want to explore when
I finally get some time to work on this is one which will probably make more
use of, and extend, merge_bvec.

> that it makes sense to only require that the lowest level does the
> splitting, as that is where the specifics on what might be required exists.

We have a hybrid model at the moment - some stuff dealt with top-down, other
stuff bottom-up.  This thread describes just one of several problems related to
this.  In the next incarnation of device-mapper I hope we'll find a way to
eliminate all of them together.  Maybe I'll finally get time later this year to
start working on these ideas properly.

Alasdair


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-05 17:37                 ` Alasdair G Kergon
  2010-03-05 19:20                   ` Lars Ellenberg
@ 2010-03-08  5:43                   ` Neil Brown
  1 sibling, 0 replies; 28+ messages in thread
From: Neil Brown @ 2010-03-08  5:43 UTC (permalink / raw)
  To: device-mapper development; +Cc: agk

On Fri, 5 Mar 2010 17:37:16 +0000
Alasdair G Kergon <agk@redhat.com> wrote:

> On Thu, Mar 04, 2010 at 06:59:21PM +0100, Lars Ellenberg wrote:
> > +		/* Restricting max_sectors is not enough.
> > +		 * If someone uses bio_add_page to add 8 disjunct 512 byte
> > +		 * partial pages to a bio, it would succeed,
> > +		 * but could still cross a border of whatever restrictions
> > +		 * are below us (raid0 stripe boundary).  An attempted
> > +		 * bio_split would not succeed, because bi_vcnt is 8.
> > +		 * E.g. the xen io layer is known to trigger this.
> > +		 */
> 
> Sounds plausible.
> 
> Do you or anyone readingt his have example messages demonstrating the failure
> when this patch is not applied?

Yes.  This

   http://marc.info/?l=linux-raid&m=126672681521073&w=2

almost certainly refers to that problem.

NeilBrown


> 
> Alasdair.
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-devel] [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks
  2010-03-05  7:30                       ` Jens Axboe
  (?)
  (?)
@ 2010-03-08  9:01                       ` Mikulas Patocka
  -1 siblings, 0 replies; 28+ messages in thread
From: Mikulas Patocka @ 2010-03-08  9:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: device-mapper development, Dmitry Monakhov, linux-kernel, Mike Snitzer

> > The point for building a big bio and splitting it is code size reduction.
> > 
> > You must realize that you need to split the request anyway. If the caller 
> > needs to read/write M sectors and the device can process at most N 
> > sectors, where N<M, then the i/o must be split. You can't get rid of that 
> > split.
> 
> If we consider the basic (and mosted used) file system IO path, then
> those 'M' sectors will usually be submitted in units that are smaller
> than 'N' anyway. It would be silly to allocate a bio for 'N' sectors
> (those get big), only to split it up again shortly.

I agree that majority of IOs are page cache reads/writes and they 
shouldn't be split.

> The point is not to build it too big to begin with, and keep the
> splitting as the slow path. That's how it was originally designed, and
> it'll surely stay that way. Doing the split handling in generic code is
> a good idea, it definitely should be. But I'm never going to make
> splitting a normal part of IO operations just because we allow
> arbitrarily large bios, sorry but that's insane.
>
> > There is just one case, where it is not desirable to create larger 
> > requests than the physical transfer size because of performance --- that 
> > is page cache readahead. That could use the current approach of querying 
> > the queue for limits and not stuffing more data than the device can 
> > handle.
> > 
> > In all the other (non-readahead) cases, the caller is waiting for all the 
> > data anyway, there is no performance problem with creating a larger 
> > requests and splitting them.
> 
> But this I still think is crazy, sorry.

It doesn't look crazy to me:

Case 1: the caller constructs a request to dm-io. Dm-io allocates its 
internal structure, splits the request into several bios, submits them, 
waits for them, calls the completion routing when they all finish.

Case 2: the caller constructs a big bio (without care about device 
limits), submits it to the block layer, the block layer splits the bio 
into several smaller bios (it doesn't even have to allocate the vector if 
it splits at page boundaries), submits the smaller bios to the device, 
when they finish, signals the completion of the big bio.

Now tell me, why do you think that "Case 1" is better than "Case 2"? If 
the bio gets split, you get exactly the same number of allocations in both 
cases. If the bio isn't split, "Case 2" is even better (it saves dm-io 
overhead).

> > > Alasdair had an idea for doing a map/unmap like interface instead, which
> > > I think is a lot better. I don't think he ever coded it up, though.
> > > Perhaps talk to him, if you want to improve our situation in this area.
> > 
> > It would increase code size, not reduce it.
> 
> What kind of answer is that?

If you increase code size in the interface, you make the interface more 
complex and any further changes will be harder. It kills the software 
slowly.

> > BTW. see fs/bio.c:bio_split --- it uses a shared mempool. It can deadlock 
> > if the devices depend on each other. You need to use per-queue mempool.
> 
> Only if you need to split the same original bio twice on the same IO
> path.

It may happen if you use stacked MD devices with different small stripe 
sizes. But it's not common because people usually set stripe size as a 
multiple of page size.

Mikulas

> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-03-08  9:01 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-27 17:35 [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks Dmitry Monakhov
2010-02-27 17:35 ` [PATCH 2/2] blktrace: perform cleanup after setup error Dmitry Monakhov
2010-02-28 18:46   ` Jens Axboe
2010-02-28 18:46 ` [PATCH 1/2] blkdev: fix merge_bvec_fn return value checks Jens Axboe
2010-03-03  3:49   ` Dmitry Monakhov
2010-03-03  7:30     ` Jens Axboe
2010-03-03  8:39       ` Dmitry Monakhov
2010-03-03 12:21         ` Jens Axboe
2010-03-03 18:20     ` Mike Snitzer
2010-03-03 18:45       ` Dmitry Monakhov
2010-03-03 19:16         ` Mike Snitzer
2010-03-03 19:42           ` Dmitry Monakhov
2010-03-03 20:07             ` Jens Axboe
2010-03-04 11:47               ` [dm-devel] " Mikulas Patocka
2010-03-04 12:19                 ` Jens Axboe
2010-03-04 21:55                   ` Mikulas Patocka
2010-03-05  7:30                     ` Jens Axboe
2010-03-05  7:30                       ` Jens Axboe
2010-03-05 21:56                       ` [dm-devel] " Neil Brown
2010-03-05 22:27                         ` Alasdair G Kergon
2010-03-05 22:27                           ` Alasdair G Kergon
2010-03-05 23:52                           ` Neil Brown
2010-03-06  2:20                             ` Alasdair G Kergon
2010-03-08  9:01                       ` Mikulas Patocka
2010-03-04 17:59               ` Lars Ellenberg
2010-03-05 17:37                 ` Alasdair G Kergon
2010-03-05 19:20                   ` Lars Ellenberg
2010-03-08  5:43                   ` Neil Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.