All of lore.kernel.org
 help / color / mirror / Atom feed
* Wiki, raid 10, and my new system :-)
@ 2017-10-16 13:54 Wols Lists
  2017-10-16 14:26 ` Reindl Harald
  2017-10-17  0:42 ` NeilBrown
  0 siblings, 2 replies; 11+ messages in thread
From: Wols Lists @ 2017-10-16 13:54 UTC (permalink / raw)
  To: linux-raid

Raid 10 is a complicated subject what with near and far, and whether it
will grow, etc etc.

I'm planning to raid-10 my swap partition, and while it doesn't matter
in the slightest because destroying and recreating will be no hassle for
swap, I'd like to understand what's going on.

If I remember correctly, there was a thread a little while back on
growing a raid-10? And you can't (for certain values of "can't" :-) do it?

Where's the best place to find info about near, far and offset layouts?
I seem to remember "man md", but is there anywhere better?

To give you an idea of what I'm planning, I've currently got 2 x 4TB
drives that will have a swap partition. I know if I raid-10 that it's
effectively just a raid-1 mirror, but I intend to add a third, and then
probably a fourth, drive. Can you do that? What will the result be? And
for swap especially, does anybody know if I should optimise for read or
write - common sense says they're equally important but as a scientist I
know common sense is not to be trusted :-)

Of course, all this will end up on the wiki :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Wiki, raid 10, and my new system :-)
  2017-10-16 13:54 Wiki, raid 10, and my new system :-) Wols Lists
@ 2017-10-16 14:26 ` Reindl Harald
  2017-10-17  0:33   ` NeilBrown
  2017-10-17  0:42 ` NeilBrown
  1 sibling, 1 reply; 11+ messages in thread
From: Reindl Harald @ 2017-10-16 14:26 UTC (permalink / raw)
  To: Wols Lists, linux-raid


Am 16.10.2017 um 15:54 schrieb Wols Lists:
> Raid 10 is a complicated subject what with near and far, and whether it
> will grow, etc etc.
> 
> I'm planning to raid-10 my swap partition, and while it doesn't matter
> in the slightest because destroying and recreating will be no hassle for
> swap, I'd like to understand what's going on.
> 
> If I remember correctly, there was a thread a little while back on
> growing a raid-10? And you can't (for certain values of "can't" :-) do it?
> 
> Where's the best place to find info about near, far and offset layouts?

http://xmodulo.com/setup-raid10-linux.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Wiki, raid 10, and my new system :-)
  2017-10-16 14:26 ` Reindl Harald
@ 2017-10-17  0:33   ` NeilBrown
  2017-10-17 18:32     ` Anthony Youngman
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2017-10-17  0:33 UTC (permalink / raw)
  To: Reindl Harald, Wols Lists, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1090 bytes --]

On Mon, Oct 16 2017, Reindl Harald wrote:

> Am 16.10.2017 um 15:54 schrieb Wols Lists:
>> Raid 10 is a complicated subject what with near and far, and whether it
>> will grow, etc etc.
>> 
>> I'm planning to raid-10 my swap partition, and while it doesn't matter
>> in the slightest because destroying and recreating will be no hassle for
>> swap, I'd like to understand what's going on.
>> 
>> If I remember correctly, there was a thread a little while back on
>> growing a raid-10? And you can't (for certain values of "can't" :-) do it?
>> 
>> Where's the best place to find info about near, far and offset layouts?
>
> http://xmodulo.com/setup-raid10-linux.html

This patch has some good stuff, but

  Chunk Size, as per the Linux RAID wiki, is the smallest unit of data
  that can be written to the devices

I wonder where the Linux RAID wiki say that.  It is wrong.
The smallest unit of data that can be written to the devices is the
block size, which is hardware dependant and typically 512bytes or 4K.
This is not the same as the chunk size.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Wiki, raid 10, and my new system :-)
  2017-10-16 13:54 Wiki, raid 10, and my new system :-) Wols Lists
  2017-10-16 14:26 ` Reindl Harald
@ 2017-10-17  0:42 ` NeilBrown
  2017-10-20  0:55   ` Wols Lists
  1 sibling, 1 reply; 11+ messages in thread
From: NeilBrown @ 2017-10-17  0:42 UTC (permalink / raw)
  To: Wols Lists, linux-raid

[-- Attachment #1: Type: text/plain, Size: 1906 bytes --]

On Mon, Oct 16 2017, Wols Lists wrote:

> Raid 10 is a complicated subject what with near and far, and whether it
> will grow, etc etc.
>
> I'm planning to raid-10 my swap partition, and while it doesn't matter
> in the slightest because destroying and recreating will be no hassle for
> swap, I'd like to understand what's going on.
>
> If I remember correctly, there was a thread a little while back on
> growing a raid-10? And you can't (for certain values of "can't" :-) do it?

You can for sufficiently recent kernels, and for layouts that support
reshape :-)

>
> Where's the best place to find info about near, far and offset layouts?
> I seem to remember "man md", but is there anywhere better?

Use the source, Luke.

	 * Currently we reject any reshape of a 'far' mode array,
	 * allow chunk size to change if new is generally acceptable,
	 * allow raid_disks to increase, and allow
	 * a switch between 'near' mode and 'offset' mode.

though the code seems to allow raid_disks to decrease as well.

I recommend testing different changes on different configurations and
seeing which ones work.

>
> To give you an idea of what I'm planning, I've currently got 2 x 4TB
> drives that will have a swap partition. I know if I raid-10 that it's
> effectively just a raid-1 mirror, but I intend to add a third, and then
> probably a fourth, drive. Can you do that? What will the result be? And
> for swap especially, does anybody know if I should optimise for read or
> write - common sense says they're equally important but as a scientist I
> know common sense is not to be trusted :-)

As long as you don't choose the 'far' layout, you will be able to
increase the number of devices in the raid10.
If the performance of swap ever becomes an issue, you have lost
already (I'm more of the philosopher than the scientist today).

>
> Of course, all this will end up on the wiki :-)

Thanks!

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Wiki, raid 10, and my new system :-)
  2017-10-17  0:33   ` NeilBrown
@ 2017-10-17 18:32     ` Anthony Youngman
  2017-10-17 19:04       ` Phil Turmel
  2017-10-17 20:14       ` NeilBrown
  0 siblings, 2 replies; 11+ messages in thread
From: Anthony Youngman @ 2017-10-17 18:32 UTC (permalink / raw)
  To: NeilBrown, linux-raid

On 17/10/17 01:33, NeilBrown wrote:
> On Mon, Oct 16 2017, Reindl Harald wrote:
> 
>> Am 16.10.2017 um 15:54 schrieb Wols Lists:
>>> Raid 10 is a complicated subject what with near and far, and whether it
>>> will grow, etc etc.
>>>
>>> I'm planning to raid-10 my swap partition, and while it doesn't matter
>>> in the slightest because destroying and recreating will be no hassle for
>>> swap, I'd like to understand what's going on.
>>>
>>> If I remember correctly, there was a thread a little while back on
>>> growing a raid-10? And you can't (for certain values of "can't" :-) do it?
>>>
>>> Where's the best place to find info about near, far and offset layouts?
>>
>> http://xmodulo.com/setup-raid10-linux.html
> 
> This patch has some good stuff, but
> 
>    Chunk Size, as per the Linux RAID wiki, is the smallest unit of data
>    that can be written to the devices
> 
> I wonder where the Linux RAID wiki say that.  It is wrong.
> The smallest unit of data that can be written to the devices is the
> block size, which is hardware dependant and typically 512bytes or 4K.
> This is not the same as the chunk size.
> 
https://raid.wiki.kernel.org/index.php/RAID_setup#Chunk_sizes

I understand what it's saying - this is the smallest size passed down to 
the disk, not the smallest size that the disk can write ... :-)

This page has made it into the archaeology section so I don't plan to 
update it. And in context, it's reasonably clear what it means.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Wiki, raid 10, and my new system :-)
  2017-10-17 18:32     ` Anthony Youngman
@ 2017-10-17 19:04       ` Phil Turmel
  2017-10-17 20:43         ` Anthony Youngman
  2017-10-17 20:14       ` NeilBrown
  1 sibling, 1 reply; 11+ messages in thread
From: Phil Turmel @ 2017-10-17 19:04 UTC (permalink / raw)
  To: Anthony Youngman, NeilBrown, linux-raid

On 10/17/2017 02:32 PM, Anthony Youngman wrote:
> On 17/10/17 01:33, NeilBrown wrote:

>> This patch has some good stuff, but
>>
>>    Chunk Size, as per the Linux RAID wiki, is the smallest unit of data
>>    that can be written to the devices
>>
>> I wonder where the Linux RAID wiki say that.  It is wrong.
>> The smallest unit of data that can be written to the devices is the
>> block size, which is hardware dependant and typically 512bytes or 4K.
>> This is not the same as the chunk size.
>>
> https://raid.wiki.kernel.org/index.php/RAID_setup#Chunk_sizes
> 
> I understand what it's saying - this is the smallest size passed down to
> the disk, not the smallest size that the disk can write ... :-)
> 
> This page has made it into the archaeology section so I don't plan to
> update it. And in context, it's reasonably clear what it means.

No, it is *wrong*.  Writes in multiples of 4k and entirely within a
chunk are passes as-is to the devices.  For mirrors, all affected
devices get a copy of the request.  For parity raid, the 4k stripes
corresponding to those 4k blocks will be pulled into the stripe cache
for recalculation.  Not whole chunk-size stripes.  The stripe cache is
multiples of 4k, not multiples of the chunk size!

Writes smaller than 4k, or not aligned to 4k, will generate a
read-modify-write cycle of the 4k block involved.  Not the whole chunk.

It is more accurate to say that a chunk may be the *largest* a request
can be before it is split between devices.

Phil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Wiki, raid 10, and my new system :-)
  2017-10-17 18:32     ` Anthony Youngman
  2017-10-17 19:04       ` Phil Turmel
@ 2017-10-17 20:14       ` NeilBrown
  1 sibling, 0 replies; 11+ messages in thread
From: NeilBrown @ 2017-10-17 20:14 UTC (permalink / raw)
  To: Anthony Youngman, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2015 bytes --]

On Tue, Oct 17 2017, Anthony Youngman wrote:

> On 17/10/17 01:33, NeilBrown wrote:
>> On Mon, Oct 16 2017, Reindl Harald wrote:
>> 
>>> Am 16.10.2017 um 15:54 schrieb Wols Lists:
>>>> Raid 10 is a complicated subject what with near and far, and whether it
>>>> will grow, etc etc.
>>>>
>>>> I'm planning to raid-10 my swap partition, and while it doesn't matter
>>>> in the slightest because destroying and recreating will be no hassle for
>>>> swap, I'd like to understand what's going on.
>>>>
>>>> If I remember correctly, there was a thread a little while back on
>>>> growing a raid-10? And you can't (for certain values of "can't" :-) do it?
>>>>
>>>> Where's the best place to find info about near, far and offset layouts?
>>>
>>> http://xmodulo.com/setup-raid10-linux.html
>> 
>> This patch has some good stuff, but
>> 
>>    Chunk Size, as per the Linux RAID wiki, is the smallest unit of data
>>    that can be written to the devices
>> 
>> I wonder where the Linux RAID wiki say that.  It is wrong.
>> The smallest unit of data that can be written to the devices is the
>> block size, which is hardware dependant and typically 512bytes or 4K.
>> This is not the same as the chunk size.
>> 
> https://raid.wiki.kernel.org/index.php/RAID_setup#Chunk_sizes
>
> I understand what it's saying - this is the smallest size passed down to 
> the disk, not the smallest size that the disk can write ... :-)

Well that is wrong too.  It is not the smallest size passed down to the
disk. That is also 512b or 4K.  Maybe it is the largest size that might
be passed down to a single disk.

It is the granularity at which the address-space of the array is divided
up before being shared around the address space of the member devices.
It has (almost) nothing to do with the size of IO requests.

NeilBrown

>
> This page has made it into the archaeology section so I don't plan to 
> update it. And in context, it's reasonably clear what it means.
>
> Cheers,
> Wol

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Wiki, raid 10, and my new system :-)
  2017-10-17 19:04       ` Phil Turmel
@ 2017-10-17 20:43         ` Anthony Youngman
  2017-10-17 20:57           ` Phil Turmel
  2017-10-17 21:01           ` NeilBrown
  0 siblings, 2 replies; 11+ messages in thread
From: Anthony Youngman @ 2017-10-17 20:43 UTC (permalink / raw)
  To: Phil Turmel, NeilBrown, linux-raid

On 17/10/17 20:04, Phil Turmel wrote:
> No, it is*wrong*.  Writes in multiples of 4k and entirely within a
> chunk are passes as-is to the devices.  For mirrors, all affected
> devices get a copy of the request.  For parity raid, the 4k stripes
> corresponding to those 4k blocks will be pulled into the stripe cache
> for recalculation.  Not whole chunk-size stripes.  The stripe cache is
> multiples of 4k, not multiples of the chunk size!
> 
> Writes smaller than 4k, or not aligned to 4k, will generate a
> read-modify-write cycle of the 4k block involved.  Not the whole chunk.
> 
> It is more accurate to say that a chunk may be the*largest*  a request
> can be before it is split between devices.

Okay, I think I need to update my understanding on this ... :-)

Let's say a chunk is 12K. That's three 4K blocks to drive 1, followed by 
three to drive 2 etc. Does that mean that each chunk is split across 
three stripes, or is the stripe all the 12K chunks one per drive?

In other words, does a stripe consist of one block per drive, or one 
chunk per drive?

(I'll put a "sic" on that page then, just to point out it's a 
misunderstanding by the original author. As I said, I'd rather not mess 
around with the page now.)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Wiki, raid 10, and my new system :-)
  2017-10-17 20:43         ` Anthony Youngman
@ 2017-10-17 20:57           ` Phil Turmel
  2017-10-17 21:01           ` NeilBrown
  1 sibling, 0 replies; 11+ messages in thread
From: Phil Turmel @ 2017-10-17 20:57 UTC (permalink / raw)
  To: Anthony Youngman, NeilBrown, linux-raid

On 10/17/2017 04:43 PM, Anthony Youngman wrote:
> On 17/10/17 20:04, Phil Turmel wrote:
>> No, it is*wrong*.  Writes in multiples of 4k and entirely within a
>> chunk are passes as-is to the devices.  For mirrors, all affected
>> devices get a copy of the request.  For parity raid, the 4k stripes
>> corresponding to those 4k blocks will be pulled into the stripe cache
>> for recalculation.  Not whole chunk-size stripes.  The stripe cache is
>> multiples of 4k, not multiples of the chunk size!
>>
>> Writes smaller than 4k, or not aligned to 4k, will generate a
>> read-modify-write cycle of the 4k block involved.  Not the whole chunk.
>>
>> It is more accurate to say that a chunk may be the*largest*  a request
>> can be before it is split between devices.
> 
> Okay, I think I need to update my understanding on this ... :-)
> 
> Let's say a chunk is 12K. That's three 4K blocks to drive 1, followed by
> three to drive 2 etc. Does that mean that each chunk is split across
> three stripes, or is the stripe all the 12K chunks one per drive?

A stripe is still a chunk on each driver.  You are confusing the chunk
as part of the layout with the read/write operations on the underlying
devices.  Read/write operations to the array obviously have to be split
at chunk boundaries because that's where the layout divides the
underlying space between devices.  But *within* a single chunk, the
operation simply passes through (plus mirrors or write recalc as
needed).  Minimum-size operations on underlying devices are 4k.

> In other words, does a stripe consist of one block per drive, or one
> chunk per drive?

The *stripe* is one chunk per drive.

But the stripe *cache* is one block per drive per cache entry, because
operations on devices are multiples of blocks, not chunks.

> (I'll put a "sic" on that page then, just to point out it's a
> misunderstanding by the original author. As I said, I'd rather not mess
> around with the page now.)

The number one pitch point for wikis is that they can be edited, and the
wiki keeps a history.  I don't get why you don't want to fix it.

{ But then, I'm not really a fan of wikis .... }

Phil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Wiki, raid 10, and my new system :-)
  2017-10-17 20:43         ` Anthony Youngman
  2017-10-17 20:57           ` Phil Turmel
@ 2017-10-17 21:01           ` NeilBrown
  1 sibling, 0 replies; 11+ messages in thread
From: NeilBrown @ 2017-10-17 21:01 UTC (permalink / raw)
  To: Anthony Youngman, Phil Turmel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2556 bytes --]

On Tue, Oct 17 2017, Anthony Youngman wrote:

> On 17/10/17 20:04, Phil Turmel wrote:
>> No, it is*wrong*.  Writes in multiples of 4k and entirely within a
>> chunk are passes as-is to the devices.  For mirrors, all affected
>> devices get a copy of the request.  For parity raid, the 4k stripes
>> corresponding to those 4k blocks will be pulled into the stripe cache
>> for recalculation.  Not whole chunk-size stripes.  The stripe cache is
>> multiples of 4k, not multiples of the chunk size!
>> 
>> Writes smaller than 4k, or not aligned to 4k, will generate a
>> read-modify-write cycle of the 4k block involved.  Not the whole chunk.
>> 
>> It is more accurate to say that a chunk may be the*largest*  a request
>> can be before it is split between devices.
>
> Okay, I think I need to update my understanding on this ... :-)
>
> Let's say a chunk is 12K. That's three 4K blocks to drive 1, followed by 
> three to drive 2 etc. Does that mean that each chunk is split across 
> three stripes, or is the stripe all the 12K chunks one per drive?

RAID5 would not allow a 12K chunk size (must be power of 2) but RAID0
would.  Not sure about RAID10.

I interpret "stripe" to mean "a set of chunks, one from each device".
So if you had a RAID10 with a 12K chunk size and 3 devices, then a
stripe would be 36K of space, 12K per device.

This is primarily an address-space mapping.  Think of it as a function
from "array-address" to "device-index, device-address".

0 -> 0,0
512 -> 0,512
1024 -> 0,1024
....
3072 -> 1,0
3584 -> 1,512
....

No imagine that the application always sends 512 I/O requests.  Each I/O
request is mapped through the above function and sent to the appropriate
device with the new address.

In practice, larger requests are allowed and the a split into
sub-requests if the function isn't contiguous for the whole range of a
particular request.


>
> In other words, does a stripe consist of one block per drive, or one 
> chunk per drive?

One chunk per drive.

Note that inside the md/raid5 code the word "stripe" usually means one
PAGE per drive.  This is an unfortunately historical accident.  I
sometimes use the word "strip" (no 'e') to mean one page (or one block)
per device.  A strip is not contiguous in the array address space.  A
stripe is.

Thanks,
NeilBrown

>
> (I'll put a "sic" on that page then, just to point out it's a 
> misunderstanding by the original author. As I said, I'd rather not mess 
> around with the page now.)
>
> Cheers,
> Wol

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Wiki, raid 10, and my new system :-)
  2017-10-17  0:42 ` NeilBrown
@ 2017-10-20  0:55   ` Wols Lists
  0 siblings, 0 replies; 11+ messages in thread
From: Wols Lists @ 2017-10-20  0:55 UTC (permalink / raw)
  To: NeilBrown, linux-raid

On 17/10/17 01:42, NeilBrown wrote:
>> > Of course, all this will end up on the wiki :-)

> Thanks!
> 

I've now written a nice pile of bumf about all the raid levels that
tries to explain everything in simple language.

https://raid.wiki.kernel.org/index.php/A_guide_to_mdadm#Array_internals_and_how_it_affects_mdadm

I'm sure experts will be able to poke some holes in - let me know if
there are any mistakes and I'll correct them - but I've tried to make it
easily understandable for noobs.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-10-20  0:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-16 13:54 Wiki, raid 10, and my new system :-) Wols Lists
2017-10-16 14:26 ` Reindl Harald
2017-10-17  0:33   ` NeilBrown
2017-10-17 18:32     ` Anthony Youngman
2017-10-17 19:04       ` Phil Turmel
2017-10-17 20:43         ` Anthony Youngman
2017-10-17 20:57           ` Phil Turmel
2017-10-17 21:01           ` NeilBrown
2017-10-17 20:14       ` NeilBrown
2017-10-17  0:42 ` NeilBrown
2017-10-20  0:55   ` Wols Lists

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.