All of lore.kernel.org
 help / color / mirror / Atom feed
* filesystem stripe parameters
@ 2009-06-18 19:08 Wil Reichert
  2009-06-19  9:15 ` Michael Tokarev
  0 siblings, 1 reply; 7+ messages in thread
From: Wil Reichert @ 2009-06-18 19:08 UTC (permalink / raw)
  To: linux raid

When using LVM on top of RAID 5, is it still worthwhile to pass RAID
stripe information to the filesystem on creation?  Or do the PE's in
LVM blur the specific stripe sizes & I'd want to use some multiple of
those instead?

Wil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filesystem stripe parameters
  2009-06-18 19:08 filesystem stripe parameters Wil Reichert
@ 2009-06-19  9:15 ` Michael Tokarev
  2009-06-19  9:36   ` Robin Hill
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Michael Tokarev @ 2009-06-19  9:15 UTC (permalink / raw)
  To: Wil Reichert; +Cc: linux raid

Wil Reichert wrote:
> When using LVM on top of RAID 5, is it still worthwhile to pass RAID
> stripe information to the filesystem on creation?  Or do the PE's in
> LVM blur the specific stripe sizes & I'd want to use some multiple of
> those instead?

It's a very good question, especially in context of RAID5.

Yes it is still a good idea to pass that info because it is still a
RAID5 which requires proper treatment wrt unaligned writes and keeping
redundancy.

But the thing is that RAID5 and LVM are not good to each other UNLESS
RAID5 consists of 3, 5 or 9 (or 17 etc) drives -- i.e. 2^N+1, so that
there's 2^N data drives.

This is because LVM can only have blocksize as a power of two and in
order to be useful that blocksize should be a multiple of RAID5 data
row size (stripe size etc).

This is only possible when RAID5 has 2^N data drives or 2^N+1 total
drives.  The same is for RAID4, and for RAID6 it's 2^N+2 since RAID6
has 2 parity drives.

But if you can't match LVM blocksize and RAID strip size, there's
*almost* no point at telling raid parameters to the filesystem: no
matter how hard you'll try, LVM will make the whole thing non-optimal.

Ok, depending on the number of drives, *some* logical volumes will
be properly aligned, but definitely not all of them.  For example
on a 4-drive RAID5 array, only every 3rd volume will be ok, --
provided the volumes are allocated in full one after another,
without holes and fragmentation.

When everything is properly aligned, it's still worth the effort
IMHO to tell the filesystem about true raid properties.

/mjt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filesystem stripe parameters
  2009-06-19  9:15 ` Michael Tokarev
@ 2009-06-19  9:36   ` Robin Hill
  2009-06-19 20:59   ` Justin Perreault
  2009-06-20  0:26   ` Wil Reichert
  2 siblings, 0 replies; 7+ messages in thread
From: Robin Hill @ 2009-06-19  9:36 UTC (permalink / raw)
  To: linux raid

[-- Attachment #1: Type: text/plain, Size: 2340 bytes --]

On Fri Jun 19, 2009 at 01:15:41PM +0400, Michael Tokarev wrote:

> Wil Reichert wrote:
>> When using LVM on top of RAID 5, is it still worthwhile to pass RAID
>> stripe information to the filesystem on creation?  Or do the PE's in
>> LVM blur the specific stripe sizes & I'd want to use some multiple of
>> those instead?
>
> It's a very good question, especially in context of RAID5.
>
> Yes it is still a good idea to pass that info because it is still a
> RAID5 which requires proper treatment wrt unaligned writes and keeping
> redundancy.
>
> But the thing is that RAID5 and LVM are not good to each other UNLESS
> RAID5 consists of 3, 5 or 9 (or 17 etc) drives -- i.e. 2^N+1, so that
> there's 2^N data drives.
>
> This is because LVM can only have blocksize as a power of two and in
> order to be useful that blocksize should be a multiple of RAID5 data
> row size (stripe size etc).
>
> This is only possible when RAID5 has 2^N data drives or 2^N+1 total
> drives.  The same is for RAID4, and for RAID6 it's 2^N+2 since RAID6
> has 2 parity drives.
>
> But if you can't match LVM blocksize and RAID strip size, there's
> *almost* no point at telling raid parameters to the filesystem: no
> matter how hard you'll try, LVM will make the whole thing non-optimal.
>
> Ok, depending on the number of drives, *some* logical volumes will
> be properly aligned, but definitely not all of them.  For example
> on a 4-drive RAID5 array, only every 3rd volume will be ok, --
> provided the volumes are allocated in full one after another,
> without holes and fragmentation.
>
> When everything is properly aligned, it's still worth the effort
> IMHO to tell the filesystem about true raid properties.
>
You'll also need to get the LVM header padding right, otherwise the
filesystem won't start on a stripe boundary and the alignment will all
go wrong.  I've no idea of the syntax but I recall seeing it discussed
on here several times.  A quick search throws up:
    http://www.issociate.de/board/goto/1859627/stride_/_stripe_alignment_on_LVM_?.html

Cheers,
    Robin

-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filesystem stripe parameters
  2009-06-19  9:15 ` Michael Tokarev
  2009-06-19  9:36   ` Robin Hill
@ 2009-06-19 20:59   ` Justin Perreault
  2009-06-20  6:35     ` Michael Tokarev
  2009-06-20  0:26   ` Wil Reichert
  2 siblings, 1 reply; 7+ messages in thread
From: Justin Perreault @ 2009-06-19 20:59 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Wil Reichert, linux raid

Still learning, please be gentle.

On Fri, 2009-06-19 at 13:15 +0400, Michael Tokarev wrote:
> Wil Reichert wrote:
> > When using LVM on top of RAID 5, is it still worthwhile to pass RAID
> > stripe information to the filesystem on creation?  Or do the PE's in
> > LVM blur the specific stripe sizes & I'd want to use some multiple of
> > those instead?
> Yes it is still a good idea to pass that info because it is still a
> RAID5 which requires proper treatment wrt unaligned writes and keeping
> redundancy.
> 
> But the thing is that RAID5 and LVM are not good to each other UNLESS
> RAID5 consists of 3, 5 or 9 (or 17 etc) drives -- i.e. 2^N+1, so that
> there's 2^N data drives.
> 
> This is because LVM can only have blocksize as a power of two and in
> order to be useful that blocksize should be a multiple of RAID5 data
> row size (stripe size etc).
> 
> This is only possible when RAID5 has 2^N data drives or 2^N+1 total
> drives.  The same is for RAID4, and for RAID6 it's 2^N+2 since RAID6
> has 2 parity drives.
> 
> But if you can't match LVM blocksize and RAID strip size, there's
> *almost* no point at telling raid parameters to the filesystem: no
> matter how hard you'll try, LVM will make the whole thing non-optimal.

2.5 questions:

1) Will this same issue affect a 5+0 raid array?

2) It is inferred that one can choose to not tell the filesystem the
raid parameters, what negative effect does not doing it have?
Conversely, what is the positive effect does doing it have?

Thanks,
Justin




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filesystem stripe parameters
  2009-06-19  9:15 ` Michael Tokarev
  2009-06-19  9:36   ` Robin Hill
  2009-06-19 20:59   ` Justin Perreault
@ 2009-06-20  0:26   ` Wil Reichert
  2009-06-20  6:19     ` Michael Tokarev
  2 siblings, 1 reply; 7+ messages in thread
From: Wil Reichert @ 2009-06-20  0:26 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux raid

On Fri, Jun 19, 2009 at 2:15 AM, Michael Tokarev<mjt@tls.msk.ru> wrote:
> Wil Reichert wrote:
>>
>> When using LVM on top of RAID 5, is it still worthwhile to pass RAID
>> stripe information to the filesystem on creation?  Or do the PE's in
>> LVM blur the specific stripe sizes & I'd want to use some multiple of
>> those instead?
>
> It's a very good question, especially in context of RAID5.
>
> Yes it is still a good idea to pass that info because it is still a
> RAID5 which requires proper treatment wrt unaligned writes and keeping
> redundancy.
>
> But the thing is that RAID5 and LVM are not good to each other UNLESS
> RAID5 consists of 3, 5 or 9 (or 17 etc) drives -- i.e. 2^N+1, so that
> there's 2^N data drives.
>
> This is because LVM can only have blocksize as a power of two and in
> order to be useful that blocksize should be a multiple of RAID5 data
> row size (stripe size etc).
>
> This is only possible when RAID5 has 2^N data drives or 2^N+1 total
> drives.  The same is for RAID4, and for RAID6 it's 2^N+2 since RAID6
> has 2 parity drives.
>
> But if you can't match LVM blocksize and RAID strip size, there's
> *almost* no point at telling raid parameters to the filesystem: no
> matter how hard you'll try, LVM will make the whole thing non-optimal.
>
> Ok, depending on the number of drives, *some* logical volumes will
> be properly aligned, but definitely not all of them.  For example
> on a 4-drive RAID5 array, only every 3rd volume will be ok, --
> provided the volumes are allocated in full one after another,
> without holes and fragmentation.
>
> When everything is properly aligned, it's still worth the effort
> IMHO to tell the filesystem about true raid properties.

Several questions answered, more questions arise =)

I'm using 3 1T discs, so it seems I'm in luck.  My chunk size is 128k,
my PE size is the default 4M.  Using mkfs.ext4 as an example, it takes
stride (chunk) and stripe-width ( chunk * (N-1) ) parameters.  So
which would be optimal - using the RAID values (i.e. 128k, 256k) or
the LVM values (i.e. 4M, 8M) when creating the filesystem or is there
no right answer and it just depends on the usage pattern?

Wil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filesystem stripe parameters
  2009-06-20  0:26   ` Wil Reichert
@ 2009-06-20  6:19     ` Michael Tokarev
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Tokarev @ 2009-06-20  6:19 UTC (permalink / raw)
  To: Wil Reichert; +Cc: linux raid

Wil Reichert wrote:
> On Fri, Jun 19, 2009 at 2:15 AM, Michael Tokarev<mjt@tls.msk.ru> wrote:
[]
>> When everything is properly aligned, it's still worth the effort
>> IMHO to tell the filesystem about true raid properties.
> 
> Several questions answered, more questions arise =)
> 
> I'm using 3 1T discs, so it seems I'm in luck.  My chunk size is 128k,
> my PE size is the default 4M.  Using mkfs.ext4 as an example, it takes
> stride (chunk) and stripe-width ( chunk * (N-1) ) parameters.  So
> which would be optimal - using the RAID values (i.e. 128k, 256k) or
> the LVM values (i.e. 4M, 8M) when creating the filesystem or is there
> no right answer and it just depends on the usage pattern?

No, see my last statement from my initial email, quoted above.
Tell the fs about your raid.  If raid strips are combined using
some other way it's still raid and it's still strip size that
matters much.  After all, you need two parameters for the fs
(chunk + stripe-width) not one (lvm block size) -- this fact
already telling :)

/mjt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: filesystem stripe parameters
  2009-06-19 20:59   ` Justin Perreault
@ 2009-06-20  6:35     ` Michael Tokarev
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Tokarev @ 2009-06-20  6:35 UTC (permalink / raw)
  To: Justin Perreault; +Cc: Wil Reichert, linux raid

Justin Perreault wrote:
> Still learning, please be gentle.
> 
> On Fri, 2009-06-19 at 13:15 +0400, Michael Tokarev wrote:
>> Wil Reichert wrote:
>>> When using LVM on top of RAID 5, is it still worthwhile to pass RAID
>>> stripe information to the filesystem on creation?  Or do the PE's in
>>> LVM blur the specific stripe sizes & I'd want to use some multiple of
>>> those instead?
>> Yes it is still a good idea to pass that info because it is still a
>> RAID5 which requires proper treatment wrt unaligned writes and keeping
>> redundancy.
>>
>> But the thing is that RAID5 and LVM are not good to each other UNLESS
>> RAID5 consists of 3, 5 or 9 (or 17 etc) drives -- i.e. 2^N+1, so that
>> there's 2^N data drives.
>>
>> This is because LVM can only have blocksize as a power of two and in
>> order to be useful that blocksize should be a multiple of RAID5 data
>> row size (stripe size etc).
>>
>> This is only possible when RAID5 has 2^N data drives or 2^N+1 total
>> drives.  The same is for RAID4, and for RAID6 it's 2^N+2 since RAID6
>> has 2 parity drives.
>>
>> But if you can't match LVM blocksize and RAID strip size, there's
>> *almost* no point at telling raid parameters to the filesystem: no
>> matter how hard you'll try, LVM will make the whole thing non-optimal.
> 
> 2.5 questions:
> 
> 1) Will this same issue affect a 5+0 raid array?

Yes, definitely.  But with 5+0 it's a bit more complicated.  In that
case each raid5 should have 3, 5, 9 etc (2^N+1) drives and by combining
the two into raid0 you'll have "combined stripe size" of 2*2^N which
is still power of two and hence can be used with lvm.  You still need
to tell the fs about raid5 properties, not raid0, but this is really
questionable.

> 2) It is inferred that one can choose to not tell the filesystem the
> raid parameters, what negative effect does not doing it have?
> Conversely, what is the positive effect does doing it have?

It's covered by the mkfs.ext3 and mkfs.xfs manpages.  Telling the fs
about your raid properties serves for two purposes - the filesystem
tries to avoid read-modify-write cycle for raid5 (the most expensive
thing, unavoidable if partitions/volumes are not aligned to the
raid stripe-width) and tries to place various data to different
disks.

The most expensive thing is read-modify-write for writes on raid[456].
Basically, if you write only "small" amount of data, raid5 needs to
re-calculate and re-write the parity block which is a function of
your new data and content of all the other data in this stripe.
So it has to read either all other data blocks from this raid row
or at least the previous content of the blocks you're writing AND
the previous parity block, -- in order to calculate new parity.

On the other hand if you write whole stripe (or more), there's
no need to read anything, all the data needed to calculate new
parity is already here.

So basically read-modify-write (for small/unaligned writes) is 3x
more operations (plus seeks!) than direct write (for large and
aligned writes).

But note that by telling the filesystem about the raid properties
we don't affect the file data itself, or, rather, how our applications
will access it.  Filesystem can change metadata location and file
placement, but not the way how userspace writes.  Ok, the fs can
also perform smarter buffering, so that buffered writes will be
sent to raid5 in multiplies of raid stripe width.

Note also that for reads, especially for "large enough" reads all
this alignment etc has little effect.

/mjt

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-06-20  6:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-18 19:08 filesystem stripe parameters Wil Reichert
2009-06-19  9:15 ` Michael Tokarev
2009-06-19  9:36   ` Robin Hill
2009-06-19 20:59   ` Justin Perreault
2009-06-20  6:35     ` Michael Tokarev
2009-06-20  0:26   ` Wil Reichert
2009-06-20  6:19     ` Michael Tokarev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.