All of lore.kernel.org
 help / color / mirror / Atom feed
* bio too big - in nested raid setup
@ 2010-01-24 18:49 "Ing. Daniel Rozsnyó"
  2010-01-25 15:25 ` Marti Raudsepp
  0 siblings, 1 reply; 9+ messages in thread
From: "Ing. Daniel Rozsnyó" @ 2010-01-24 18:49 UTC (permalink / raw)
  To: linux-kernel

Hello,
   I am having troubles with nested RAID - when one array is added to 
the other, the "bio too big device md0" messages are appearing:

bio too big device md0 (144 > 8)
bio too big device md0 (248 > 8)
bio too big device md0 (32 > 8)

   From internet searches I've found no solution or error like mine, 
just a note about data corruption when this is happening.

Description:

   My setup is the following - one 2TB and four 500GB drives. The goal 
is to have a mirror of the 2TB drive to a linear array of the other four 
drives.

   So.. the state without the error above is this:

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active linear sdb1[0] sde1[3] sdd1[2] sdc1[1]
       1953535988 blocks super 1.1 0k rounding

md0 : active raid1 sda2[0]
       1953447680 blocks [2/1] [U_]
       bitmap: 233/233 pages [932KB], 4096KB chunk

unused devices: <none>

   With these block request sizes:

# cat /sys/block/md{0,1}/queue/max_{,hw_}sectors_kb
127
127
127
127

   Now, I add the four drive array to the mirror - and the system starts 
showing the bio error at any significant disk activity..  (probably 
writes only). The reboot/shutdown process is full of these errors.

   The step which messes up the system (ignore re-added, it happened the 
very first time I've constructed the 4 drive array a hour ago):

# mdadm /dev/md0 --add /dev/md1
mdadm: re-added /dev/md1

# cat /sys/block/md{0,1}/queue/max_{,hw_}sectors_kb
4
4
127
127

The dmesg is just showing this:

md: bind<md1>
RAID1 conf printout:
  --- wd:1 rd:2
  disk 0, wo:0, o:1, dev:sda2
  disk 1, wo:1, o:1, dev:md1
md: recovery of RAID array md0
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 
KB/sec) for recovery.
md: using 128k window, over a total of 1953447680 blocks.


   And as soon as a write occures to the array:

bio too big device md0 (40 > 8)

   The removal of md1 from md0 does not help the situation, I need to 
reboot the machine.

   The md0 array bears LVM and inside it a root / swap / portage / 
distfiles and home logical volumes.

   My system is:

# uname -a
Linux desktop 2.6.32-gentoo-r1 #2 SMP PREEMPT Sun Jan 24 12:06:13 CET 
2010 i686 Intel(R) Xeon(R) CPU X3220 @ 2.40GHz GenuineIntel GNU/Linux


Thanks for any help,

Daniel


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bio too big - in nested raid setup
  2010-01-24 18:49 bio too big - in nested raid setup "Ing. Daniel Rozsnyó"
@ 2010-01-25 15:25 ` Marti Raudsepp
  2010-01-25 18:27   ` Milan Broz
  0 siblings, 1 reply; 9+ messages in thread
From: Marti Raudsepp @ 2010-01-25 15:25 UTC (permalink / raw)
  To: Ing. Daniel Rozsnyó; +Cc: linux-kernel

2010/1/24 "Ing. Daniel Rozsnyó" <daniel@rozsnyo.com>:
> Hello,
>  I am having troubles with nested RAID - when one array is added to the
> other, the "bio too big device md0" messages are appearing:
>
> bio too big device md0 (144 > 8)
> bio too big device md0 (248 > 8)
> bio too big device md0 (32 > 8)

I *think* this is the same bug that I hit years ago when mixing
different disks and 'pvmove'

It's a design flaw in the DM/MD frameworks; see comment #3 from Milan Broz:
http://bugzilla.kernel.org/show_bug.cgi?id=9401#c3

Regards,
Marti

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bio too big - in nested raid setup
  2010-01-25 15:25 ` Marti Raudsepp
@ 2010-01-25 18:27   ` Milan Broz
  2010-01-28  2:28     ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Milan Broz @ 2010-01-25 18:27 UTC (permalink / raw)
  To: Marti Raudsepp
  Cc: "Ing. Daniel Rozsnyó", linux-kernel, Neil Brown

On 01/25/2010 04:25 PM, Marti Raudsepp wrote:
> 2010/1/24 "Ing. Daniel Rozsnyó" <daniel@rozsnyo.com>:
>> Hello,
>>  I am having troubles with nested RAID - when one array is added to the
>> other, the "bio too big device md0" messages are appearing:
>>
>> bio too big device md0 (144 > 8)
>> bio too big device md0 (248 > 8)
>> bio too big device md0 (32 > 8)
> 
> I *think* this is the same bug that I hit years ago when mixing
> different disks and 'pvmove'
> 
> It's a design flaw in the DM/MD frameworks; see comment #3 from Milan Broz:
> http://bugzilla.kernel.org/show_bug.cgi?id=9401#c3

Hm. I don't think it is the same problem, you are only adding device to md array...
(adding cc: Neil, this seems to me like MD bug).

(original report for reference is here http://lkml.org/lkml/2010/1/24/60 )

Milan
--
mbroz@redhat.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bio too big - in nested raid setup
  2010-01-25 18:27   ` Milan Broz
@ 2010-01-28  2:28     ` Neil Brown
  2010-01-28  9:24       ` "Ing. Daniel Rozsnyó"
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2010-01-28  2:28 UTC (permalink / raw)
  To: Milan Broz; +Cc: Marti Raudsepp, Ing. Daniel Rozsnyó, linux-kernel

On Mon, 25 Jan 2010 19:27:53 +0100
Milan Broz <mbroz@redhat.com> wrote:

> On 01/25/2010 04:25 PM, Marti Raudsepp wrote:
> > 2010/1/24 "Ing. Daniel Rozsnyó" <daniel@rozsnyo.com>:
> >> Hello,
> >>  I am having troubles with nested RAID - when one array is added to the
> >> other, the "bio too big device md0" messages are appearing:
> >>
> >> bio too big device md0 (144 > 8)
> >> bio too big device md0 (248 > 8)
> >> bio too big device md0 (32 > 8)
> > 
> > I *think* this is the same bug that I hit years ago when mixing
> > different disks and 'pvmove'
> > 
> > It's a design flaw in the DM/MD frameworks; see comment #3 from Milan Broz:
> > http://bugzilla.kernel.org/show_bug.cgi?id=9401#c3
> 
> Hm. I don't think it is the same problem, you are only adding device to md array...
> (adding cc: Neil, this seems to me like MD bug).
> 
> (original report for reference is here http://lkml.org/lkml/2010/1/24/60 )

No, I think it is the same problem.

When you have a stack of devices, the top level client needs to know the
maximum restrictions imposed by lower level devices to ensure it doesn't
violate them.
However there is no mechanism for a device to report that its restrictions
have changed.
So when md0 gains a linear leg and so needs to reduce the max size for
requests, there is no way to tell DM, so DM doesn't know.  And as the
filesystem only asks DM for restrictions, it never finds out about the
new restrictions.

This should be fixed by having the filesystem not care about restrictions,
and the lower levels just split requests as needed, but that just hasn't
happened....

If you completely assemble md0 before activating the LVM stuff on top of it,
this should work.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bio too big - in nested raid setup
  2010-01-28  2:28     ` Neil Brown
@ 2010-01-28  9:24       ` "Ing. Daniel Rozsnyó"
  2010-01-28 10:50         ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: "Ing. Daniel Rozsnyó" @ 2010-01-28  9:24 UTC (permalink / raw)
  To: Neil Brown; +Cc: Milan Broz, Marti Raudsepp, linux-kernel

Neil Brown wrote:
> On Mon, 25 Jan 2010 19:27:53 +0100
> Milan Broz <mbroz@redhat.com> wrote:
> 
>> On 01/25/2010 04:25 PM, Marti Raudsepp wrote:
>>> 2010/1/24 "Ing. Daniel Rozsnyó" <daniel@rozsnyo.com>:
>>>> Hello,
>>>>  I am having troubles with nested RAID - when one array is added to the
>>>> other, the "bio too big device md0" messages are appearing:
>>>>
>>>> bio too big device md0 (144 > 8)
>>>> bio too big device md0 (248 > 8)
>>>> bio too big device md0 (32 > 8)
>>> I *think* this is the same bug that I hit years ago when mixing
>>> different disks and 'pvmove'
>>>
>>> It's a design flaw in the DM/MD frameworks; see comment #3 from Milan Broz:
>>> http://bugzilla.kernel.org/show_bug.cgi?id=9401#c3
>> Hm. I don't think it is the same problem, you are only adding device to md array...
>> (adding cc: Neil, this seems to me like MD bug).
>>
>> (original report for reference is here http://lkml.org/lkml/2010/1/24/60 )
> 
> No, I think it is the same problem.
> 
> When you have a stack of devices, the top level client needs to know the
> maximum restrictions imposed by lower level devices to ensure it doesn't
> violate them.
> However there is no mechanism for a device to report that its restrictions
> have changed.
> So when md0 gains a linear leg and so needs to reduce the max size for
> requests, there is no way to tell DM, so DM doesn't know.  And as the
> filesystem only asks DM for restrictions, it never finds out about the
> new restrictions.

Neil, why does it even reduce its block size? I've tried with both 
"linear" and "raid0" (as they are the only way to get 2T from 4x500G) 
and both behave the same (sda has 512, md0 127, linear 127 and raid0 has 
512 kb block size).

I do not see the mechanism how 512:127 or 512:512 leads to 4 kb limit

Is it because:
  - of rebuilding the array?
  - of non-multiplicative max block size
  - of non-multiplicative total device size
  - of nesting?
  - of some other fallback to 1 page?

I ask because I can not believe that a pre-assembled nested stack would 
result in 4kb max limit. But I haven't tried yet (e.g. from a live cd).

The block device should not do this kind of "magic", unless the higher 
layers support it. Which one has proper support then?
  - standard partition table?
  - LVM?
  - filesystem drivers?

> This should be fixed by having the filesystem not care about restrictions,
> and the lower levels just split requests as needed, but that just hasn't
> happened....
> 
> If you completely assemble md0 before activating the LVM stuff on top of it,
> this should work.
> 
> NeilBrown


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bio too big - in nested raid setup
  2010-01-28  9:24       ` "Ing. Daniel Rozsnyó"
@ 2010-01-28 10:50         ` Neil Brown
  2010-01-28 12:07           ` Boaz Harrosh
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2010-01-28 10:50 UTC (permalink / raw)
  To: Ing. Daniel Rozsnyó; +Cc: Milan Broz, Marti Raudsepp, linux-kernel

On Thu, 28 Jan 2010 10:24:43 +0100
"Ing. Daniel Rozsnyó" <daniel@rozsnyo.com> wrote:

> Neil Brown wrote:
> > On Mon, 25 Jan 2010 19:27:53 +0100
> > Milan Broz <mbroz@redhat.com> wrote:
> > 
> >> On 01/25/2010 04:25 PM, Marti Raudsepp wrote:
> >>> 2010/1/24 "Ing. Daniel Rozsnyó" <daniel@rozsnyo.com>:
> >>>> Hello,
> >>>>  I am having troubles with nested RAID - when one array is added to the
> >>>> other, the "bio too big device md0" messages are appearing:
> >>>>
> >>>> bio too big device md0 (144 > 8)
> >>>> bio too big device md0 (248 > 8)
> >>>> bio too big device md0 (32 > 8)
> >>> I *think* this is the same bug that I hit years ago when mixing
> >>> different disks and 'pvmove'
> >>>
> >>> It's a design flaw in the DM/MD frameworks; see comment #3 from Milan Broz:
> >>> http://bugzilla.kernel.org/show_bug.cgi?id=9401#c3
> >> Hm. I don't think it is the same problem, you are only adding device to md array...
> >> (adding cc: Neil, this seems to me like MD bug).
> >>
> >> (original report for reference is here http://lkml.org/lkml/2010/1/24/60 )
> > 
> > No, I think it is the same problem.
> > 
> > When you have a stack of devices, the top level client needs to know the
> > maximum restrictions imposed by lower level devices to ensure it doesn't
> > violate them.
> > However there is no mechanism for a device to report that its restrictions
> > have changed.
> > So when md0 gains a linear leg and so needs to reduce the max size for
> > requests, there is no way to tell DM, so DM doesn't know.  And as the
> > filesystem only asks DM for restrictions, it never finds out about the
> > new restrictions.
> 
> Neil, why does it even reduce its block size? I've tried with both 
> "linear" and "raid0" (as they are the only way to get 2T from 4x500G) 
> and both behave the same (sda has 512, md0 127, linear 127 and raid0 has 
> 512 kb block size).
> 
> I do not see the mechanism how 512:127 or 512:512 leads to 4 kb limit

Both raid0 and linear register a 'bvec_mergeable' function (or whatever it is
called today).
This allows for the fact that these devices have restrictions that cannot be
expressed simply with request sizes.  In particular they only handle requests
that don't cross a chunk boundary.

As raid1 never calls the bvec_mergeable function of it's components (it would
be very hard to get that to work reliably, maybe impossible), it treats any
device with a bvec_mergeable function as though the max_sectors were one page.
This is because the interface guarantees that a one page request will always
be handled.

> 
> Is it because:
>   - of rebuilding the array?
>   - of non-multiplicative max block size
>   - of non-multiplicative total device size
>   - of nesting?
>   - of some other fallback to 1 page?

The last I guess.

> 
> I ask because I can not believe that a pre-assembled nested stack would 
> result in 4kb max limit. But I haven't tried yet (e.g. from a live cd).

When people say "I can not believe" I always chuckle to myself.  You just
aren't trying hard enough.  There is adequate evidence that people can
believe whatever they want to believe :-)

> 
> The block device should not do this kind of "magic", unless the higher 
> layers support it. Which one has proper support then?
>   - standard partition table?
>   - LVM?
>   - filesystem drivers?
> 

I don't understand this question, sorry.

Yes, there is definitely something broken here.  Unfortunately fixing it is
non-trivial.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bio too big - in nested raid setup
  2010-01-28 10:50         ` Neil Brown
@ 2010-01-28 12:07           ` Boaz Harrosh
  2010-01-28 22:14             ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Boaz Harrosh @ 2010-01-28 12:07 UTC (permalink / raw)
  To: Neil Brown
  Cc: "Ing. Daniel Rozsnyó",
	Milan Broz, Marti Raudsepp, linux-kernel, Trond Myklebust,
	Andrew Morton

On 01/28/2010 12:50 PM, Neil Brown wrote:
> 
> Both raid0 and linear register a 'bvec_mergeable' function (or whatever it is
> called today).
> This allows for the fact that these devices have restrictions that cannot be
> expressed simply with request sizes.  In particular they only handle requests
> that don't cross a chunk boundary.
> 
> As raid1 never calls the bvec_mergeable function of it's components (it would
> be very hard to get that to work reliably, maybe impossible), it treats any
> device with a bvec_mergeable function as though the max_sectors were one page.
> This is because the interface guarantees that a one page request will always
> be handled.
> 

I'm also guilty of doing some mirror work, in exofs, over osd objects.

I was thinking about that reliability problem with mirrors, also related
to that infamous problem of coping the mirrored buffers so they do not
change while writing at the page cache level.

So what if we don't fight it? what if we just keep a journal of the mirror
unbalanced state and do not page_uptodate until the mirror is finally balanced.
Only then pages can be dropped from the cache, and journal cleared.

(Balanced-mirror-page is when a page has participated in an IO to all devices
 without being marked dirty from the get-go to the completion of IO)

I think Trond's last work with adding that un_updated-but-committed state to
pages can facilitate in doing that, though I do understand that it is a major
conceptual change to the the VFS-BLOCKS relationship in letting the block devices
participate in the pages state machine (And md keeping a journal). Sigh

??
Boaz

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bio too big - in nested raid setup
  2010-01-28 12:07           ` Boaz Harrosh
@ 2010-01-28 22:14             ` Neil Brown
  2010-01-31 15:42               ` Boaz Harrosh
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2010-01-28 22:14 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Ing. Daniel Rozsnyó,
	Milan Broz, Marti Raudsepp, linux-kernel, Trond Myklebust,
	Andrew Morton

On Thu, 28 Jan 2010 14:07:31 +0200
Boaz Harrosh <bharrosh@panasas.com> wrote:

> On 01/28/2010 12:50 PM, Neil Brown wrote:
> > 
> > Both raid0 and linear register a 'bvec_mergeable' function (or whatever it is
> > called today).
> > This allows for the fact that these devices have restrictions that cannot be
> > expressed simply with request sizes.  In particular they only handle requests
> > that don't cross a chunk boundary.
> > 
> > As raid1 never calls the bvec_mergeable function of it's components (it would
> > be very hard to get that to work reliably, maybe impossible), it treats any
> > device with a bvec_mergeable function as though the max_sectors were one page.
> > This is because the interface guarantees that a one page request will always
> > be handled.
> > 
> 
> I'm also guilty of doing some mirror work, in exofs, over osd objects.
> 
> I was thinking about that reliability problem with mirrors, also related
> to that infamous problem of coping the mirrored buffers so they do not
> change while writing at the page cache level.

So this is a totally new topic, right?

> 
> So what if we don't fight it? what if we just keep a journal of the mirror
> unbalanced state and do not page_uptodate until the mirror is finally balanced.
> Only then pages can be dropped from the cache, and journal cleared.

I cannot see what you are suggesting, but it seems like a layering violation.
The block device level cannot see anything about whether the page is up to
date or not.  The page it has may not even be in the page cache.

The only thing that the block device can do is make a copy of the page and
write that out twice.

If we could have a flag which the filesystem can send to say "I promise not
to change this page until the IO completes", then that copy could be
optimised away in lots of common cases.


> 
> (Balanced-mirror-page is when a page has participated in an IO to all devices
>  without being marked dirty from the get-go to the completion of IO)
> 

Block device cannot see the 'dirty' flag.


> I think Trond's last work with adding that un_updated-but-committed state to
> pages can facilitate in doing that, though I do understand that it is a major
> conceptual change to the the VFS-BLOCKS relationship in letting the block devices
> participate in the pages state machine (And md keeping a journal). Sigh
> 
> ??
> Boaz

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bio too big - in nested raid setup
  2010-01-28 22:14             ` Neil Brown
@ 2010-01-31 15:42               ` Boaz Harrosh
  0 siblings, 0 replies; 9+ messages in thread
From: Boaz Harrosh @ 2010-01-31 15:42 UTC (permalink / raw)
  To: Neil Brown
  Cc: "Ing. Daniel Rozsnyó",
	Milan Broz, Marti Raudsepp, linux-kernel, Trond Myklebust,
	Andrew Morton

On 01/29/2010 12:14 AM, Neil Brown wrote:
> On Thu, 28 Jan 2010 14:07:31 +0200
> Boaz Harrosh <bharrosh@panasas.com> wrote:
> 
>> On 01/28/2010 12:50 PM, Neil Brown wrote:
>>>

I'm totally theoretical on this. So feel free to ignore me, if it gets
boring.

>>> Both raid0 and linear register a 'bvec_mergeable' function (or whatever it is
>>> called today).
>>> This allows for the fact that these devices have restrictions that cannot be
>>> expressed simply with request sizes.  In particular they only handle requests
>>> that don't cross a chunk boundary.
>>>
>>> As raid1 never calls the bvec_mergeable function of it's components (it would
>>> be very hard to get that to work reliably, maybe impossible), it treats any
>>> device with a bvec_mergeable function as though the max_sectors were one page.
>>> This is because the interface guarantees that a one page request will always
>>> be handled.
>>>
>>
>> I'm also guilty of doing some mirror work, in exofs, over osd objects.
>>
>> I was thinking about that reliability problem with mirrors, also related
>> to that infamous problem of coping the mirrored buffers so they do not
>> change while writing at the page cache level.
> 
> So this is a totally new topic, right?
> 

Not new, I'm talking about that (no) guaranty of page not changing while in
flight, as you mention below. Which is why we need to copy the to-be-mirrored
page.

>>
>> So what if we don't fight it? what if we just keep a journal of the mirror
>> unbalanced state and do not page_uptodate until the mirror is finally balanced.
>> Only then pages can be dropped from the cache, and journal cleared.
> 
> I cannot see what you are suggesting, but it seems like a layering violation.
> The block device level cannot see anything about whether the page is up to
> date or not.  The page it has may not even be in the page cache.
> 

It is certainly a layering violation today, but theoretically speaking ,it does
not have to be. An abstract API can be made so block devices notify when page's
IO is done, at this point VFS can decide if it must resubmit do to page changing
while IO or the IO is actually valid at this point.

> The only thing that the block device can do is make a copy of the page and
> write that out twice.
> 

That is the copy I was referring to.

> If we could have a flag which the filesystem can send to say "I promise not
> to change this page until the IO completes", then that copy could be
> optimised away in lots of common cases.
> 

What I meant is: What if we only have that knowledge at end of IO, So we can decide
at that point if the page is up-to-date and is allowed to be evicted from cache.
It's the same as if we have a crash/power-failure while IO, surely the mirrors are
not balanced, and each device's file content cannot be determained some of the
last written buffer is old, some new, and some undefined. That is the roll of the
file-system to keep a journal and decide what data can be guaranteed and what
data must be reverted to a last known good state. Now what I'm wondering is what
if we prolong this window to until we know the mirrors match. The window for disaster
is wider, but should never matter in normal use. Most setups could tolerate the
bad statistics, and could use the extra bandwidth.

> 
>>
>> (Balanced-mirror-page is when a page has participated in an IO to all devices
>>  without being marked dirty from the get-go to the completion of IO)
>>
> 
> Block device cannot see the 'dirty' flag.
> 

Right, but is there some additional information a block device should communicate
to the FS so it can make a decision?

> 
>> I think Trond's last work with adding that un_updated-but-committed state to
>> pages can facilitate in doing that, though I do understand that it is a major
>> conceptual change to the the VFS-BLOCKS relationship in letting the block devices
>> participate in the pages state machine (And md keeping a journal). Sigh
>>
>> ??
>> Boaz
> 
> NeilBrown

Thanks
Boaz

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-01-31 15:42 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-24 18:49 bio too big - in nested raid setup "Ing. Daniel Rozsnyó"
2010-01-25 15:25 ` Marti Raudsepp
2010-01-25 18:27   ` Milan Broz
2010-01-28  2:28     ` Neil Brown
2010-01-28  9:24       ` "Ing. Daniel Rozsnyó"
2010-01-28 10:50         ` Neil Brown
2010-01-28 12:07           ` Boaz Harrosh
2010-01-28 22:14             ` Neil Brown
2010-01-31 15:42               ` Boaz Harrosh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.