linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* data corruption using raid0+lvm2+jfs with 2.6.0-test3
@ 2003-08-12 20:45 Tupshin Harper
  2003-08-12 21:08 ` Christophe Saout
  2003-08-12 21:28 ` Andrew Morton
  0 siblings, 2 replies; 14+ messages in thread
From: Tupshin Harper @ 2003-08-12 20:45 UTC (permalink / raw)
  To: Kernel List

I have an LVM2 setup with four lvm groups. One of those groups sits on 
top of a two disk raid 0 array. When writing to JFS partitions on that 
lvm group, I get frequent, reproducible data corruption. This same setup 
works fine with 2.4.22-pre kernels. The JFS may or may not be relevant, 
since I haven't had a chance to use other filesystems as a control. 
There are a number of instances of the following message associated with 
the data corruption:

raid0_make_request bug: can't convert block across chunks or bigger than 
8k 12436792 8

The 12436792 varies widely, the rest is always the same. The error is 
coming from drivers/md/raid0.c.

-Tupshin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-12 20:45 data corruption using raid0+lvm2+jfs with 2.6.0-test3 Tupshin Harper
@ 2003-08-12 21:08 ` Christophe Saout
  2003-08-12 21:10   ` Tupshin Harper
  2003-08-12 21:28 ` Andrew Morton
  1 sibling, 1 reply; 14+ messages in thread
From: Christophe Saout @ 2003-08-12 21:08 UTC (permalink / raw)
  To: Tupshin Harper; +Cc: Kernel List

Am Di, 2003-08-12 um 22.45 schrieb Tupshin Harper:

> I have an LVM2 setup with four lvm groups. One of those groups sits on 
> top of a two disk raid 0 array. When writing to JFS partitions on that 
> lvm group, I get frequent, reproducible data corruption. This same setup 
> works fine with 2.4.22-pre kernels. The JFS may or may not be relevant, 
> since I haven't had a chance to use other filesystems as a control. 
> There are a number of instances of the following message associated with 
> the data corruption:
> 
> raid0_make_request bug: can't convert block across chunks or bigger than 
> 8k 12436792 8
> 
> The 12436792 varies widely, the rest is always the same. The error is 
> coming from drivers/md/raid0.c.

Why don't you try using an LVM2 stripe? That's the same as raid0 does.
And I'm sure it doesn't suffer from such problems because it's handling
bios in a very generic and flexible manner.

Looking at the code:

chunk_size = mddev->chunk_size >> 10;
block = bio->bi_sector >> 1;
if (unlikely(chunk_size < (block & (chunk_size - 1)) + (bio->bi_size >>
10))) { ...
/* Sanity check -- queue functions should prevent this happening */
if (bio->bi_vcnt != 1 ||
    bio->bi_idx != 0)
        goto bad_map; /* -> error message */

So, it looks like queue functions don't prevent this from happening.

md.c:
blk_queue_max_sectors(mddev->queue, chunk_size >> 9);
blk_queue_segment_boundary(mddev->queue, (chunk_size>>1) - 1);

I'm wondering, why can't there be more than one bvec?

--
Christophe Saout <christophe@saout.de>
Please avoid sending me Word or PowerPoint attachments.
See http://www.fsf.org/philosophy/no-word-attachments.html


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-12 21:08 ` Christophe Saout
@ 2003-08-12 21:10   ` Tupshin Harper
  0 siblings, 0 replies; 14+ messages in thread
From: Tupshin Harper @ 2003-08-12 21:10 UTC (permalink / raw)
  To: Christophe Saout; +Cc: Kernel List

Christophe Saout wrote:

>Am Di, 2003-08-12 um 22.45 schrieb Tupshin Harper:
>
>  
>
>>I have an LVM2 setup with four lvm groups. One of those groups sits on 
>>top of a two disk raid 0 array. When writing to JFS partitions on that 
>>lvm group, I get frequent, reproducible data corruption. This same setup 
>>works fine with 2.4.22-pre kernels. The JFS may or may not be relevant, 
>>since I haven't had a chance to use other filesystems as a control. 
>>There are a number of instances of the following message associated with 
>>the data corruption:
>>
>>raid0_make_request bug: can't convert block across chunks or bigger than 
>>8k 12436792 8
>>
>>The 12436792 varies widely, the rest is always the same. The error is 
>>coming from drivers/md/raid0.c.
>>    
>>
>
>Why don't you try using an LVM2 stripe? That's the same as raid0 does.
>And I'm sure it doesn't suffer from such problems because it's handling
>bios in a very generic and flexible manner.
>
Yes, I'm already converting to such a setup as I type this. I thought 
that a data corruption issue was worth mentioning, however. ;-)

-Tupshin


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-12 20:45 data corruption using raid0+lvm2+jfs with 2.6.0-test3 Tupshin Harper
  2003-08-12 21:08 ` Christophe Saout
@ 2003-08-12 21:28 ` Andrew Morton
  2003-08-12 23:05   ` Neil Brown
  1 sibling, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2003-08-12 21:28 UTC (permalink / raw)
  To: Tupshin Harper; +Cc: linux-kernel

Tupshin Harper <tupshin@tupshin.com> wrote:
>
> raid0_make_request bug: can't convert block across chunks or bigger than 
> 8k 12436792 8

There is a fix for this at

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.0-test3/2.6.0-test3-mm1/broken-out/bio-too-big-fix.patch

Results of testing are always appreciated...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-12 21:28 ` Andrew Morton
@ 2003-08-12 23:05   ` Neil Brown
  2003-08-13  7:25     ` Joe Thornber
  2003-08-15 21:27     ` Mike Fedyk
  0 siblings, 2 replies; 14+ messages in thread
From: Neil Brown @ 2003-08-12 23:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Tupshin Harper, linux-kernel, Jens Axboe

On Tuesday August 12, akpm@osdl.org wrote:
> Tupshin Harper <tupshin@tupshin.com> wrote:
> >
> > raid0_make_request bug: can't convert block across chunks or bigger than 
> > 8k 12436792 8
> 
> There is a fix for this at
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.0-test3/2.6.0-test3-mm1/broken-out/bio-too-big-fix.patch
> 
> Results of testing are always appreciated...

I don't think this will help.  It is a different problem.

As far as I can tell, the problem is that dm doesn't honour the
merge_bvec_fn of the underlying device (neither does md for that
matter).
I think it does honour the max_sectors restriction, so it will only
allow a request as big as one chunk, but it will allow such a requests
to span a chunk boundary.

Probably the simplest solution to this is to put in calls to
bio_split, which will need to be strengthed to handle multi-page bios.

The policy would be:
  "a client of a block device *should* honour the various bio size 
   restrictions, and may suffer performance loss if it doesn't;
   a block device driver *must* handle any bio it is passed, and may
   call bio_split to help out".

A better solution, which is too much for 2.6.0, is to have a cleaner
interface wherein the client of the block device uses a two-stage
process to submit requests.
Firstly it says:
  I want to do IO at this location, what is the max number of sectors
  allowed?
Then it adds pages to the bio upto that limit. 
Finally it say
  OK, here is the request.

The first and final stages have to be properly paired so that a
device knows if there are any pending requests and can hold-off any
device reconfiguration until all pending requests have completed.

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-12 23:05   ` Neil Brown
@ 2003-08-13  7:25     ` Joe Thornber
  2003-08-15 21:27     ` Mike Fedyk
  1 sibling, 0 replies; 14+ messages in thread
From: Joe Thornber @ 2003-08-13  7:25 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, Tupshin Harper, linux-kernel, Jens Axboe

On Wed, Aug 13, 2003 at 09:05:58AM +1000, Neil Brown wrote:
> A better solution, which is too much for 2.6.0, is to have a cleaner
> interface wherein the client of the block device uses a two-stage
> process to submit requests.
> Firstly it says:
>   I want to do IO at this location, what is the max number of sectors
>   allowed?
> Then it adds pages to the bio upto that limit. 
> Finally it say
>   OK, here is the request.
> 
> The first and final stages have to be properly paired so that a
> device knows if there are any pending requests and can hold-off any
> device reconfiguration until all pending requests have completed.

This is exactly what I'd like to do.  The merge_bvec_fn is unusable by
dm (and probably md) because this function is mapping specific - so
dm_suspend/dm_resume need to be lifted above it.

- Joe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-12 23:05   ` Neil Brown
  2003-08-13  7:25     ` Joe Thornber
@ 2003-08-15 21:27     ` Mike Fedyk
  2003-08-16  8:00       ` Neil Brown
  1 sibling, 1 reply; 14+ messages in thread
From: Mike Fedyk @ 2003-08-15 21:27 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, Tupshin Harper, linux-kernel, Jens Axboe

On Wed, Aug 13, 2003 at 09:05:58AM +1000, Neil Brown wrote:
> On Tuesday August 12, akpm@osdl.org wrote:
> > Tupshin Harper <tupshin@tupshin.com> wrote:
> > >
> > > raid0_make_request bug: can't convert block across chunks or bigger than 
> > > 8k 12436792 8
> > 
> > There is a fix for this at
> > 
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.0-test3/2.6.0-test3-mm1/broken-out/bio-too-big-fix.patch
> > 
> > Results of testing are always appreciated...
> 
> I don't think this will help.  It is a different problem.
> 
> As far as I can tell, the problem is that dm doesn't honour the
> merge_bvec_fn of the underlying device (neither does md for that
> matter).
> I think it does honour the max_sectors restriction, so it will only
> allow a request as big as one chunk, but it will allow such a requests
> to span a chunk boundary.
> 
> Probably the simplest solution to this is to put in calls to
> bio_split, which will need to be strengthed to handle multi-page bios.
> 
> The policy would be:
>   "a client of a block device *should* honour the various bio size 
>    restrictions, and may suffer performance loss if it doesn't;
>    a block device driver *must* handle any bio it is passed, and may
>    call bio_split to help out".
> 

Any progress on this?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-15 21:27     ` Mike Fedyk
@ 2003-08-16  8:00       ` Neil Brown
  2003-08-16 14:18         ` Lars Marowsky-Bree
  2003-08-16 23:52         ` Mike Fedyk
  0 siblings, 2 replies; 14+ messages in thread
From: Neil Brown @ 2003-08-16  8:00 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Andrew Morton, Tupshin Harper, linux-kernel, Jens Axboe

On Friday August 15, mfedyk@matchmail.com wrote:
> On Wed, Aug 13, 2003 at 09:05:58AM +1000, Neil Brown wrote:
> > On Tuesday August 12, akpm@osdl.org wrote:
> > > Tupshin Harper <tupshin@tupshin.com> wrote:
> > > >
> > > > raid0_make_request bug: can't convert block across chunks or bigger than 
> > > > 8k 12436792 8

> > 
> > Probably the simplest solution to this is to put in calls to
> > bio_split, which will need to be strengthed to handle multi-page bios.
> > 
> > The policy would be:
> >   "a client of a block device *should* honour the various bio size 
> >    restrictions, and may suffer performance loss if it doesn't;
> >    a block device driver *must* handle any bio it is passed, and may
> >    call bio_split to help out".
> > 
> 
> Any progress on this?

No, and I doubt there will be in a big hurry, unless I come up with an
easy way to make lvm-over-raid0 break instantly instead of eventually.

I think that for now you should assume tat lvm over raid0 (or raid0
over lvm) simply isn't supported.  As lvm (aka dm) supports striping,
it shouldn't be needed.

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-16  8:00       ` Neil Brown
@ 2003-08-16 14:18         ` Lars Marowsky-Bree
  2003-08-16 23:52         ` Mike Fedyk
  1 sibling, 0 replies; 14+ messages in thread
From: Lars Marowsky-Bree @ 2003-08-16 14:18 UTC (permalink / raw)
  To: Neil Brown, Mike Fedyk
  Cc: Andrew Morton, Tupshin Harper, linux-kernel, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 643 bytes --]

On 2003-08-16T18:00:21,
   Neil Brown <neilb@cse.unsw.edu.au> said:

> I think that for now you should assume tat lvm over raid0 (or raid0
> over lvm) simply isn't supported.  As lvm (aka dm) supports striping,
> it shouldn't be needed.

Can raid0 detect that it is being accessed via DM and 'fail-fast' and
refuse to ever come up?

This probably also suggests that the lvm2 and evms2 folks should refuse
to set this up in their tools...


Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
SuSE Labs - Research & Development, SuSE Linux AG
 
High Availabilty, n.: Patching up complex systems with even more complexity.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-16  8:00       ` Neil Brown
  2003-08-16 14:18         ` Lars Marowsky-Bree
@ 2003-08-16 23:52         ` Mike Fedyk
  2003-08-17  0:12           ` Neil Brown
  1 sibling, 1 reply; 14+ messages in thread
From: Mike Fedyk @ 2003-08-16 23:52 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, Tupshin Harper, linux-kernel, Jens Axboe

On Sat, Aug 16, 2003 at 06:00:21PM +1000, Neil Brown wrote:
> On Friday August 15, mfedyk@matchmail.com wrote:
> > On Wed, Aug 13, 2003 at 09:05:58AM +1000, Neil Brown wrote:
> > > On Tuesday August 12, akpm@osdl.org wrote:
> > > > Tupshin Harper <tupshin@tupshin.com> wrote:
> > > > >
> > > > > raid0_make_request bug: can't convert block across chunks or bigger than 
> > > > > 8k 12436792 8
> 
> > > 
> > > Probably the simplest solution to this is to put in calls to
> > > bio_split, which will need to be strengthed to handle multi-page bios.
> > > 
> > > The policy would be:
> > >   "a client of a block device *should* honour the various bio size 
> > >    restrictions, and may suffer performance loss if it doesn't;
> > >    a block device driver *must* handle any bio it is passed, and may
> > >    call bio_split to help out".
> > > 
> > 
> > Any progress on this?
> 
> No, and I doubt there will be in a big hurry, unless I come up with an
> easy way to make lvm-over-raid0 break instantly instead of eventually.
> 
> I think that for now you should assume tat lvm over raid0 (or raid0
> over lvm) simply isn't supported.  As lvm (aka dm) supports striping,
> it shouldn't be needed.

I have a raid5 with "4" 18gb drives, and one of the "drives" is two 9gb
drives in a linear md "array".

I'm guessing this will hit this bug too?

I have a couple systems that use software raid5 that I'll avoid putting
2.6-test on until I know the raid is more reliable (or is this only with
md+lvm?)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-16 23:52         ` Mike Fedyk
@ 2003-08-17  0:12           ` Neil Brown
  2003-08-17 17:50             ` Mike Fedyk
  0 siblings, 1 reply; 14+ messages in thread
From: Neil Brown @ 2003-08-17  0:12 UTC (permalink / raw)
  To: Mike Fedyk
  Cc: Joe Thornber, Andrew Morton, Tupshin Harper, linux-kernel, Jens Axboe

On Saturday August 16, mfedyk@matchmail.com wrote:
> On Sat, Aug 16, 2003 at 06:00:21PM +1000, Neil Brown wrote:
> > On Friday August 15, mfedyk@matchmail.com wrote:
> > > On Wed, Aug 13, 2003 at 09:05:58AM +1000, Neil Brown wrote:
> > > > On Tuesday August 12, akpm@osdl.org wrote:
> > > > > Tupshin Harper <tupshin@tupshin.com> wrote:
> > > > > >
> > > > > > raid0_make_request bug: can't convert block across chunks or bigger than 
> > > > > > 8k 12436792 8
> > 
> > > > 
> > > > Probably the simplest solution to this is to put in calls to
> > > > bio_split, which will need to be strengthed to handle multi-page bios.
> > > > 
> > > > The policy would be:
> > > >   "a client of a block device *should* honour the various bio size 
> > > >    restrictions, and may suffer performance loss if it doesn't;
> > > >    a block device driver *must* handle any bio it is passed, and may
> > > >    call bio_split to help out".
> > > > 
> > > 
> > > Any progress on this?
> > 
> > No, and I doubt there will be in a big hurry, unless I come up with an
> > easy way to make lvm-over-raid0 break instantly instead of eventually.
> > 
> > I think that for now you should assume tat lvm over raid0 (or raid0
> > over lvm) simply isn't supported.  As lvm (aka dm) supports striping,
> > it shouldn't be needed.
> 
> I have a raid5 with "4" 18gb drives, and one of the "drives" is two 9gb
> drives in a linear md "array".
> 
> I'm guessing this will hit this bug too?

This should be safe.  raid5 only ever submits 1-page (4K) requests
that are page aligned, and linear arrays will have the boundary
between drives 4k aligned (actually "chunksize" aligned, and chunksize
is atleast 4k). 

So raid5 should be safe over everything (unless dm allows striping
with a chunk size less than pagesize).

Thinks: as an interim solution of other raid levels - if the
underlying device has a merge_bvec_function which is being ignored, we
could set max_sectors to PAGE_SIZE/512.  This should be safe, though
possibly not optimal (but "safe" is trumps "optimal" any day).

NeilBrown
> 
> I have a couple systems that use software raid5 that I'll avoid putting
> 2.6-test on until I know the raid is more reliable (or is this only with
> md+lvm?)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-17  0:12           ` Neil Brown
@ 2003-08-17 17:50             ` Mike Fedyk
  2003-08-17 23:14               ` Neil Brown
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Fedyk @ 2003-08-17 17:50 UTC (permalink / raw)
  To: Neil Brown
  Cc: Joe Thornber, Andrew Morton, Tupshin Harper, linux-kernel, Jens Axboe

On Sun, Aug 17, 2003 at 10:12:27AM +1000, Neil Brown wrote:
> On Saturday August 16, mfedyk@matchmail.com wrote:
> > I have a raid5 with "4" 18gb drives, and one of the "drives" is two 9gb
> > drives in a linear md "array".
> > 
> > I'm guessing this will hit this bug too?
> 
> This should be safe.  raid5 only ever submits 1-page (4K) requests
> that are page aligned, and linear arrays will have the boundary
> between drives 4k aligned (actually "chunksize" aligned, and chunksize
> is atleast 4k). 
> 

So why is this hitting with raid0?  Is lvm2 on top of md the problem and md
on lvm2 is ok?

> So raid5 should be safe over everything (unless dm allows striping
> with a chunk size less than pagesize).
> 
> Thinks: as an interim solution of other raid levels - if the
> underlying device has a merge_bvec_function which is being ignored, we
> could set max_sectors to PAGE_SIZE/512.  This should be safe, though
> possibly not optimal (but "safe" is trumps "optimal" any day).

Assuming that sectors are always 512 bytes (true for any hard drive I've
seen) that will be 512 * 8 = one 4k page.

Any chance sector != 512?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-17 17:50             ` Mike Fedyk
@ 2003-08-17 23:14               ` Neil Brown
  2003-08-18  0:28                 ` Mike Fedyk
  0 siblings, 1 reply; 14+ messages in thread
From: Neil Brown @ 2003-08-17 23:14 UTC (permalink / raw)
  To: Mike Fedyk
  Cc: Joe Thornber, Andrew Morton, Tupshin Harper, linux-kernel, Jens Axboe

On Sunday August 17, mfedyk@matchmail.com wrote:
> On Sun, Aug 17, 2003 at 10:12:27AM +1000, Neil Brown wrote:
> > On Saturday August 16, mfedyk@matchmail.com wrote:
> > > I have a raid5 with "4" 18gb drives, and one of the "drives" is two 9gb
> > > drives in a linear md "array".
> > > 
> > > I'm guessing this will hit this bug too?
> > 
> > This should be safe.  raid5 only ever submits 1-page (4K) requests
> > that are page aligned, and linear arrays will have the boundary
> > between drives 4k aligned (actually "chunksize" aligned, and chunksize
> > is atleast 4k). 
> > 
> 
> So why is this hitting with raid0?  Is lvm2 on top of md the problem and md
> on lvm2 is ok?
> 

The various raid levels under md are in many ways quite independent.
You cannot generalise about "md works" or "md doesn't work", you have
to talk about the specific raid levels.

The problem happens when 
 1/ The underlying device defines a merge_bvec_fn, and 
 2/ the driver (meta-device) that uses that device
     2a/  does not honour the merge_bvec_fn, and 
     2b/  passes on requests larger than one page.

md/linear, md/raid0, and lvm all define a merge_bvec_fn, and so can be
a problem as an underlying device by point (1).

md/* and lvm do not honour merge_bvec_fn and so can be a problem as a
meta-device by 2a.
However md/raid5 never passes on requests larger than one page, so it
escapes being a problem by point 2b.

So the problem can happen with
  md/linear, md/raid0, or lvm being a component of 
  md/linear, md/raid0, md/raid1, md/multipath, lvm.

(I have possibly oversimplified the situation with lvm due to lack of
precise knowledge).

I hope that clarifies the situation.

> > So raid5 should be safe over everything (unless dm allows striping
> > with a chunk size less than pagesize).
> > 
> > Thinks: as an interim solution of other raid levels - if the
> > underlying device has a merge_bvec_function which is being ignored, we
> > could set max_sectors to PAGE_SIZE/512.  This should be safe, though
> > possibly not optimal (but "safe" is trumps "optimal" any day).
> 
> Assuming that sectors are always 512 bytes (true for any hard drive I've
> seen) that will be 512 * 8 = one 4k page.
> 
> Any chance sector != 512?

No.  'sector' in the kernel means '512 bytes'.
Some devices might request requests to be at least 2 sectors long and
have even sector addresses because they have physical sectors that are
1K, but the parameters like max_sectors are still in multiples of 512
bytes.

NeilBrown

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: data corruption using raid0+lvm2+jfs with 2.6.0-test3
  2003-08-17 23:14               ` Neil Brown
@ 2003-08-18  0:28                 ` Mike Fedyk
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Fedyk @ 2003-08-18  0:28 UTC (permalink / raw)
  To: Neil Brown
  Cc: Joe Thornber, Andrew Morton, Tupshin Harper, linux-kernel, Jens Axboe

On Mon, Aug 18, 2003 at 09:14:07AM +1000, Neil Brown wrote:
> The various raid levels under md are in many ways quite independent.
> You cannot generalise about "md works" or "md doesn't work", you have
> to talk about the specific raid levels.
> 
> The problem happens when 
>  1/ The underlying device defines a merge_bvec_fn, and 
>  2/ the driver (meta-device) that uses that device
>      2a/  does not honour the merge_bvec_fn, and 
>      2b/  passes on requests larger than one page.
> 
> md/linear, md/raid0, and lvm all define a merge_bvec_fn, and so can be
> a problem as an underlying device by point (1).
> 
> md/* and lvm do not honour merge_bvec_fn and so can be a problem as a
> meta-device by 2a.
> However md/raid5 never passes on requests larger than one page, so it
> escapes being a problem by point 2b.
> 
> So the problem can happen with
>   md/linear, md/raid0, or lvm being a component of 
>   md/linear, md/raid0, md/raid1, md/multipath, lvm.
> 
> (I have possibly oversimplified the situation with lvm due to lack of
> precise knowledge).
> 
> I hope that clarifies the situation.

Thanks Neil.  That was very informative. :)

> > On Sun, Aug 17, 2003 at 10:12:27AM +1000, Neil Brown wrote:
> > > So raid5 should be safe over everything (unless dm allows striping
> > > with a chunk size less than pagesize).
> > > 
> > > Thinks: as an interim solution of other raid levels - if the
> > > underlying device has a merge_bvec_function which is being ignored, we
> > > could set max_sectors to PAGE_SIZE/512.  This should be safe, though
> > > possibly not optimal (but "safe" is trumps "optimal" any day).
> > 
> > Assuming that sectors are always 512 bytes (true for any hard drive I've
> > seen) that will be 512 * 8 = one 4k page.
> > 
> > Any chance sector != 512?
> 
> No.  'sector' in the kernel means '512 bytes'.
> Some devices might request requests to be at least 2 sectors long and
> have even sector addresses because they have physical sectors that are
> 1K, but the parameters like max_sectors are still in multiples of 512
> bytes.
> 
> NeilBrown

Any idea of the ETA for that nice interim patch?

And if there is already a merge_bvec_function defined and coded, why is it
not being used?!  Isn't that supposed to be detected by the BIO sybsystem,
and used automatically when it's defined?  Or were there some bugs found in
it and it was disabled temporarily?  Maybe not everyone agrees on bio
spliting/merging?  (I seem to recall a thread about that a while back, but I
thougth it was resolved...)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2003-08-18  0:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-12 20:45 data corruption using raid0+lvm2+jfs with 2.6.0-test3 Tupshin Harper
2003-08-12 21:08 ` Christophe Saout
2003-08-12 21:10   ` Tupshin Harper
2003-08-12 21:28 ` Andrew Morton
2003-08-12 23:05   ` Neil Brown
2003-08-13  7:25     ` Joe Thornber
2003-08-15 21:27     ` Mike Fedyk
2003-08-16  8:00       ` Neil Brown
2003-08-16 14:18         ` Lars Marowsky-Bree
2003-08-16 23:52         ` Mike Fedyk
2003-08-17  0:12           ` Neil Brown
2003-08-17 17:50             ` Mike Fedyk
2003-08-17 23:14               ` Neil Brown
2003-08-18  0:28                 ` Mike Fedyk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).