All of lore.kernel.org
 help / color / mirror / Atom feed
* The chunk size paradox
@ 2013-12-30 18:48 Phillip Susi
  2013-12-30 23:38 ` Peter Grandi
  2014-01-02 14:49 ` joystick
  0 siblings, 2 replies; 27+ messages in thread
From: Phillip Susi @ 2013-12-30 18:48 UTC (permalink / raw)
  To: Linux RAID

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I believe that using a single "chunk size" causes a lose-lose tradeoff
when creating raid 5/6/10 arrays.  Too small of a chunk size and you
waste too much time seeking to skip over the redundant data ( I think
this is why the default was changed from 64k to 512k ), but too large of
a chunk size, and you lose parallelism since your requests won't be
large enough to span a whole stripe, and in the case of raid5 you run
into problems with the stripe cache.

I believe that what is needed is to drop back down to 64k chunk size,
and deal with the seek problem by grouping stripes.  Instead of rotating
between every stripe, you only rotate between groups of stripes.  An
example of a three disk raid5 would look like this with a group factor
of 3:

1     2     1+2'
3     4     3+4'
5     6     5+6'
7+8'  7     8
9+10' 9    10

And a raid10-offset:

1     2     3
4     5     6
7     8     9
3'    1'    2'
6'    4'    5'
9'    7'    8'

And raid10-near:

1     1'    2
3     3'    4
5     5'    6
2'    7     7'
4'    8     8'
6'    9     9'

This gets you the benefit of reduced seeks, without hindering
parallelism.  In the case of the raid10-offset, you can use a relatively
large ( ~1GB ) group size to get sequential read performance nearly
identical to that of raid0, only needing to seek every 1 * n GB, while
not requiring requests at least 1 * n GB to keep all disks busy.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJSwcAZAAoJEI5FoCIzSKrw7J4IAKvCT/pvph2/PRiU4hB+nQAa
2PkZujxN0qQOzt/WH8nBPtibFIZDMrz7BF77R4H9ysJDJ2zScwJXPUtVhfDGp4rG
l++JsE7Drie/+hFR60N2gJNGZIBTnTWAWmfMig72fbJcURTwKDcqrhkPBe2gnA9D
gVlz+prNKcbVAa9j3LByL1PN29Gq2Vr9ICLeDs+x+epIA2ZslbIWwj8A3rsS98/3
D2LX3m5Jx5DzgjxIxWsgFcJy1aT6bby0QgNwSh/2ITLeQVKE8HlQb2r6PupPX/GC
MVMP7CLGREBb2D83Q2YDBMLz4+xCd0h4mbMPkD7L47m1XPobPLeEu4SfsiHjDSk=
=O9Ce
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2013-12-30 18:48 The chunk size paradox Phillip Susi
@ 2013-12-30 23:38 ` Peter Grandi
  2013-12-31  0:01   ` Wolfgang Denk
  2014-01-02 20:08   ` Phillip Susi
  2014-01-02 14:49 ` joystick
  1 sibling, 2 replies; 27+ messages in thread
From: Peter Grandi @ 2013-12-30 23:38 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

> I believe that using a single "chunk size" causes a lose-lose
> tradeoff when creating raid 5/6/10 arrays. Too small of a
> chunk size and you waste too much time seeking to skip over
> the redundant data ( I think this is why the default was
> changed from 64k to 512k ), but too large of a chunk size, and
> you lose parallelism since your requests won't be large enough
> to span a whole stripe,

That seems to me a very peculiar way of looking at it. I used to
think that the biggest tradeoff as to chunk size is due to the
devices in a RAID set being as a rule not synchronized, so it
may happen that if they are disk drives their angular positions
might be up to nearly a full rotation apart across the the RAID
set members.

This can result in some significant wait to collect data that
spans multiple chunks from all the devices involved, and the
more drives there are the greater the chances that at least one
disk will have an angular position nearly a full rotation apart
from another disk drive...

  http://www.sabi.co.uk/blog/12-thr.html#120310
  http://www.sabi.co.uk/blog/12-two.html#120221

Therefore a larger chunk size increases the amount of data that
can be fetched on each device without waiting for the other
device to get to the desires angular position. It has of course
the advnatage that you mention, but also the advantage that
random IO might be improved.

> and in the case of raid5 you run into problems with the stripe
> cache.

IIRC the stripe cache can be up to 32MB for RAID device, and
that's a lot of stripes for any sensible-sized RAID set. But it
never stopped people who "know better" to do very wide RAID5 or
RAID6 sets :-).

[ ... ]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2013-12-30 23:38 ` Peter Grandi
@ 2013-12-31  0:01   ` Wolfgang Denk
  2013-12-31 13:51     ` David Brown
  2014-01-02 20:08   ` Phillip Susi
  1 sibling, 1 reply; 27+ messages in thread
From: Wolfgang Denk @ 2013-12-31  0:01 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

Dear Peter,

In message <21186.996.238486.690328@tree.ty.sabi.co.uk> you wrote:
> 
> Therefore a larger chunk size increases the amount of data that
> can be fetched on each device without waiting for the other
> device to get to the desires angular position. It has of course
> the advnatage that you mention, but also the advantage that
> random IO might be improved.

Hm... does it make sense to discuss any of this without considering
the actual work load of the storage system?

For example, we have some RAID 6 arrays that store mostly source code
and the resulting object files when compiling that code.  In this
environment, we have the following distribution of file sizes:

	65% 	are smaller than 4 kB
	80% 	are smaller than 8 kB
	90% 	are smaller than 16 kB
	96% 	are smaller than 32 kB
	98.4% 	are smaller than 64 kB 

It appears to me, that your argumentation is valid only for large (or
rather huge), strictly sequential file accesses only.  Random acces to
a large number of small files like in the environment shown above will
need pretty much different settings for optimal performance.

I think we should not conceal such dependencies.  There is no "one
size fits all" solution.

Just my $ 0.02.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Man is the best computer we can put aboard a spacecraft ...  and  the
only one that can be mass produced with unskilled labor.
                                                  - Wernher von Braun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2013-12-31  0:01   ` Wolfgang Denk
@ 2013-12-31 13:51     ` David Brown
  0 siblings, 0 replies; 27+ messages in thread
From: David Brown @ 2013-12-31 13:51 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Peter Grandi, Linux RAID

On 31/12/13 01:01, Wolfgang Denk wrote:
> Dear Peter,
>
> In message <21186.996.238486.690328@tree.ty.sabi.co.uk> you wrote:
>>
>> Therefore a larger chunk size increases the amount of data that
>> can be fetched on each device without waiting for the other
>> device to get to the desires angular position. It has of course
>> the advnatage that you mention, but also the advantage that
>> random IO might be improved.
>
> Hm... does it make sense to discuss any of this without considering
> the actual work load of the storage system?
>
> For example, we have some RAID 6 arrays that store mostly source code
> and the resulting object files when compiling that code.  In this
> environment, we have the following distribution of file sizes:
>
> 	65% 	are smaller than 4 kB
> 	80% 	are smaller than 8 kB
> 	90% 	are smaller than 16 kB
> 	96% 	are smaller than 32 kB
> 	98.4% 	are smaller than 64 kB
>
> It appears to me, that your argumentation is valid only for large (or
> rather huge), strictly sequential file accesses only.  Random acces to
> a large number of small files like in the environment shown above will
> need pretty much different settings for optimal performance.
>
> I think we should not conceal such dependencies.  There is no "one
> size fits all" solution.
>
> Just my $ 0.02.
>
> Best regards,
>
> Wolfgang Denk
>

While that's true, it would be my guess that for most large raid 6 
arrays, there /are/ many large files.  It takes a great many small files 
to justify having raid 6 rather than raid 1, but you don't need too many 
large media files.

But it's important that new options are optional - we don't want to 
reduce performance for existing users, even if it is for less common usage.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2013-12-30 18:48 The chunk size paradox Phillip Susi
  2013-12-30 23:38 ` Peter Grandi
@ 2014-01-02 14:49 ` joystick
  2014-01-02 15:24   ` Phillip Susi
  2014-01-02 15:41   ` Stan Hoeppner
  1 sibling, 2 replies; 27+ messages in thread
From: joystick @ 2014-01-02 14:49 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

On 30/12/2013 19:48, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I believe that using a single "chunk size" causes a lose-lose tradeoff
> when creating raid 5/6/10 arrays.

I don't think your analysis is correct.

Firstly you are forgetting that multiple requests are issued 
simultaneously to one disk by the kernel, and they can be served 
out-of-order via NCQ / TCQ by the disks. The kernel does not wait for 
sector N to be read before issuing the read for sector N+1, it issues a 
lot of them together since it knows how much data it has to read (via 
readahead, most of the times). The disk reorders read/write requests 
according to its angular position, so you almost never pay for the 
angular offset during a sequential read/write, not even when skipping 
redundant data from one component disk of the RAID.

Secondly, for writes, I suspect you are assuming that a whole stipe has 
to be read and rewritten in order for one small write to be performed, 
but it is not so. For a 4k write in raid5, two 4k sectors are read, then 
two 4k sectors are written, and this is completely independent from 
chunk size. It already behaves mostly like your "groups", which are the 
stripes actually.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 14:49 ` joystick
@ 2014-01-02 15:24   ` Phillip Susi
  2014-01-02 15:41   ` Stan Hoeppner
  1 sibling, 0 replies; 27+ messages in thread
From: Phillip Susi @ 2014-01-02 15:24 UTC (permalink / raw)
  To: joystick; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/2/2014 9:49 AM, joystick wrote:
> lot of them together since it knows how much data it has to read
> (via readahead, most of the times). The disk reorders read/write
> requests

That's the problem; if you want a 1 GB stripe size then you need 2 GB
of readahead to keep the disks spinning.  That's not really
reasonable.  The idea is to get back to ~128k being enough readahead
to keep the disks busy, without causing loads of seeking.  Reordering
doesn't come into play here since I'm talking sequential reads.

> Secondly, for writes, I suspect you are assuming that a whole stipe
> has to be read and rewritten in order for one small write to be
> performed, but it is not so. For a 4k write in raid5, two 4k
> sectors are read, then two 4k sectors are written, and this is
> completely independent from chunk size. It already behaves mostly
> like your "groups", which are the stripes actually.

I don't believe this is correct.  raid5 caches an entire stripe at a
time in the stripe cache.  Even if it were so, it still wouldn't be
related to what I'm talking about, which is the fact that when doing
large streaming reads, the disk head has to seek to skip the redundant
data.  How often it has to do that is related to the chunk size, and
is something you want to minimize since it is bad for throughput.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJSxYSWAAoJEI5FoCIzSKrwvdoH/3/181+FZ1O40NJ1BuDan1Hc
oHwvqt+B+ow2H042l0gduVM8ZkgsGaLXF2R4/g2dYF4p77OvCPiHmY5R47WdUJzI
M6phlZlkCgKK3uyUCaCPAxJ2OKdzZxvlM9SDtywfH3H6aFYmAOWJTUD9EdkztRfZ
XM6CGpRly3/sCx+P3RTb4X/5F2VOJlbzEar5mp1TYIlYyR4hdeDeaci32Xgtmkcv
HBVAXnNMfQew4fTEQzD+hE1/ILIVB0xZt5E/O5368pidcTU7ndPVjXywWA9FF/CL
ckQSnNE2DpLXEb/2uT3VsAqEcJjMXYspGs1qocGw7VOMirM7ZstREX/kphJdPAU=
=eVT+
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 14:49 ` joystick
  2014-01-02 15:24   ` Phillip Susi
@ 2014-01-02 15:41   ` Stan Hoeppner
  2014-01-02 16:31     ` Phillip Susi
  1 sibling, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2014-01-02 15:41 UTC (permalink / raw)
  To: joystick, Phillip Susi; +Cc: linux-raid

On 1/2/2014 8:49 AM, joystick wrote:
> For a 4k write in raid5, two 4k sectors are read, then
> two 4k sectors are written, and this is completely independent from
> chunk size.

First, there is no such thing as a 4K sector in Linux.  Sectors are 512
bytes.  Filesystem blocks and memory pages are 4K.

I'm no expert WRT raid5.c/raid6.c, but I'm pretty sure it doesn't work
as you state.  I'm pretty sure it works like this:

Redundancy is maintained at the chunk level, not the filesystem block
level or page level.  If modifying a single filesystem block, md will
read the data chunk of the stripe in which the 4 sectors of the 4KB
block resides, write back the chunk incorporating the changes to the 4
sectors, read the parity chunk, recalculate the parity chunk based on
the new data chunk, and then write back the parity chunk.

This is precisely why many folks, including myself, consider the current
512KB chunk default to be way too high.  Modifying a single 4KB
filesystem block requires reading 1MB from disk and writing 1MB, a total
of 2MB of IO just to modify a single 4KB page.  And AFAIK this is the
best case scenario.  According to past posts by Neil, IIRC, the current
RAID5/6 code may read more than just two chunks during RMW depending on
certain factors.  With RAID6 you have at least one extra chunk write, if
not an extra chunk read, so your IO is at least 2.5MB for a single 4K
write with RAID6.

-- 
Stan



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 15:41   ` Stan Hoeppner
@ 2014-01-02 16:31     ` Phillip Susi
  2014-01-02 18:02       ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: Phillip Susi @ 2014-01-02 16:31 UTC (permalink / raw)
  To: stan, joystick; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/2/2014 10:41 AM, Stan Hoeppner wrote:
> First, there is no such thing as a 4K sector in Linux.  Sectors are
> 512 bytes.  Filesystem blocks and memory pages are 4K.

Of course there is.  Disks with 4k sectors are becoming more and more
popular.  CD-ROM type drives have always used 2k sectors.  Also
filesystem blocks and memory pages aren't necessarily 4K, though that
is the most common size.

> read the data chunk of the stripe in which the 4 sectors of the
> 4KB

You mean 8 sectors, assuming you're still talking about 512 byte sectors.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJSxZRoAAoJEI5FoCIzSKrwCGUH/i7LbqjiCAxXo4to5mrNOKyj
3MlfRXyihoo0v78LNPZg5nX7toqEWD/E+kc+HxHPdQKAEkxFqVcgT7S+VA2Zoi59
/F2jFHFE+ZSI6szAkt1FiGsYR/SQ3bkAC2L078keZ0i1gaO6ZUov/4rdw5cTf+TG
pUZk6Kwqr1uNRaPIVRhvMUy9+qCnTOKJ/65/XvesIGDoTIZTcCMitbxLduPd7nkr
iZaiWv7HNrWba7WVPbD2SGKpkAUiE61F1+8nobeslYrWuD8w2QrNFfw1tA+2ScuB
b3lqdXnLGeIX6phDuEew/kveJ/mXvIuBJtDkNN4q6Xqc8FWVDSerI2u10xrDsc8=
=17Wg
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 16:31     ` Phillip Susi
@ 2014-01-02 18:02       ` Stan Hoeppner
  2014-01-02 19:10         ` Phillip Susi
                           ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Stan Hoeppner @ 2014-01-02 18:02 UTC (permalink / raw)
  To: Phillip Susi, joystick; +Cc: linux-raid

On 1/2/2014 10:31 AM, Phillip Susi wrote:
> On 1/2/2014 10:41 AM, Stan Hoeppner wrote:
>> First, there is no such thing as a 4K sector in Linux.  Sectors are
>> 512 bytes.  Filesystem blocks and memory pages are 4K.
> 
> Of course there is.  Disks with 4k sectors are becoming more and more
> popular.  

Please read:  https://en.wikipedia.org/wiki/Advanced_Format

There are no native 4K sector drives on the market.  Linux does not
support a native 4K sector size, only 512 bytes, unless this has changed
in recent kernels and I'm simply not aware of it yet.

> CD-ROM type drives have always used 2k sectors.  Also

This is not relevant to this discussion.

> filesystem blocks and memory pages aren't necessarily 4K, though that
> is the most common size.

Yes, they are necessarily 4K in Linux.  Linux only supports page sized
BIO for consistency across the memory manager and IO subsystems.  Most
architectures which Linux currently supports have hardware page sizes
greater than 4K, for instance IA64 supports 4k/8k/16k, even a 4GB page
size.  But it was decided long ago to stick with 4K for a number of
reasons, one of these is stated above.  For background on this Google is
your friend.

>> read the data chunk of the stripe in which the 4 sectors of the
>> 4KB
> 
> You mean 8 sectors, assuming you're still talking about 512 byte sectors.

Yes, 8 sectors, thanks for catching my brain to finger err.  The ONLY
thing to talk about is 512 byte sectors because this is the only sector
size Linux supports.

-- 
Stan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 18:02       ` Stan Hoeppner
@ 2014-01-02 19:10         ` Phillip Susi
  2014-01-02 22:49           ` Peter Grandi
  2014-01-02 23:16           ` Stan Hoeppner
  2014-01-02 19:21         ` Joe Landman
  2014-01-02 22:32         ` Wolfgang Denk
  2 siblings, 2 replies; 27+ messages in thread
From: Phillip Susi @ 2014-01-02 19:10 UTC (permalink / raw)
  To: stan, joystick; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/2/2014 1:02 PM, Stan Hoeppner wrote:
> There are no native 4K sector drives on the market.  Linux does
> not support a native 4K sector size, only 512 bytes, unless this
> has changed in recent kernels and I'm simply not aware of it yet.

Linux has supported 4k sectors for several years.  You can test it
with the scsi_debug module and it's sector_size argument.  The parted
test suite has been doing this for a few years to test that parted
correctly handles 1k, 2k, and 4k sector sizes.  You can also set up
qemu to emulate such a drive.

While most consumer level sata drives that use 4k hardware sectors
have 512 byte logical sector emulation, there are at least a few
drives out there that do not, and are pure 4k sector drives.

>> CD-ROM type drives have always used 2k sectors.  Also
> 
> This is not relevant to this discussion.

Sure it is; it's a non 512 byte sector that linux has handled for many
years and so disproves your assertion that a sector is always 512 bytes.

> Yes, they are necessarily 4K in Linux.  Linux only supports page
> sized BIO for consistency across the memory manager and IO
> subsystems.  Most architectures which Linux currently supports have
> hardware page sizes greater than 4K, for instance IA64 supports
> 4k/8k/16k, even a 4GB page size.  But it was decided long ago to
> stick with 4K for a number of reasons, one of these is stated
> above.  For background on this Google is your friend.

Wrong, wrong wrong.  Linux always has supported ext[234] filesystems
using 1k, 2k, or 4k filesystem block sizes.  Now basically nobody has
used the smaller sizes for quite a few years ( they were originally
useful on 1-100 MB disks ), but it is still supported.  It can use
larger sizes than that, if your platform has > 4k page size.  The page
cache limits the block size to the page size.  I believe it was the
drobo box that uses a larger block size and people often run into the
page cache problem when they try to pull the drive from the drobo and
mount it in their pc, which can't handle > 4k blocks, but the drobo's
cpu uses 32k pages so it could use 32k blocks just fine.  Several cpu
archs give you the option to choose between different page sizes when
building the kernel, so yes, you can choose to use the larger sizes
rather than the default 4k.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJSxbmuAAoJEI5FoCIzSKrweyQIAIIJy4WfLpBkKzsallhvbgcn
i3nAkIpZSg1PYqLovYA+4bPjcjt1kDkikW3whl4PVBSE2rzGe9pS6fOtKbldQkZq
4vPVGRPIAP71iyJnA0TXM6NJpzoAVt+GBrY1N0aKFXFcPB5+wphDNWBNDWh8uYNG
mcH3HlQdbZB66yalsaik+w8pU/AItrLniTFvC3dybCVSqMdmghTKzqMjFU0sPmWK
zH9MtsLaTaRrQiYipYe+NbXsf5w8OgAaY1wxruMqW3BnI1cVFscSBXkS64hwYGBz
b5119Z3RgOWc+/BbdpLhNTP1y+qQmme/BVP84se8VRn25Zq/Qg0Jwau9xq4OlWg=
=yt5q
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 18:02       ` Stan Hoeppner
  2014-01-02 19:10         ` Phillip Susi
@ 2014-01-02 19:21         ` Joe Landman
  2014-01-02 22:42           ` Stan Hoeppner
  2014-01-02 23:22           ` Peter Grandi
  2014-01-02 22:32         ` Wolfgang Denk
  2 siblings, 2 replies; 27+ messages in thread
From: Joe Landman @ 2014-01-02 19:21 UTC (permalink / raw)
  To: stan, Phillip Susi, joystick; +Cc: linux-raid

On 01/02/2014 01:02 PM, Stan Hoeppner wrote:
> On 1/2/2014 10:31 AM, Phillip Susi wrote:
>> On 1/2/2014 10:41 AM, Stan Hoeppner wrote:
>>> First, there is no such thing as a 4K sector in Linux.  Sectors are
>>> 512 bytes.  Filesystem blocks and memory pages are 4K.
>>
>> Of course there is.  Disks with 4k sectors are becoming more and more
>> popular.
>
> Please read:  https://en.wikipedia.org/wiki/Advanced_Format
>
> There are no native 4K sector drives on the market.  Linux does not

Untrue.

http://storage.toshiba.eu/cms/en/hdd/hard_disk_drives/product_detail.jsp?productid=452

http://www.seagate.com/www-content/product-content/constellation-fam/constellation-cs/en-us/docs/terascale-hdd-data-sheet-ds1793-1-1306us.pdf

and many others.

512 byte sector native drives are now far less common, with many of the 
drives being native 4096 bytes with a translation layer for legacy 
systems that require 512 bytes.  This is infamous for wreaking havoc 
with alignment.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2013-12-30 23:38 ` Peter Grandi
  2013-12-31  0:01   ` Wolfgang Denk
@ 2014-01-02 20:08   ` Phillip Susi
  1 sibling, 0 replies; 27+ messages in thread
From: Phillip Susi @ 2014-01-02 20:08 UTC (permalink / raw)
  To: Peter Grandi, Linux RAID

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Peter, please make sure to use your mail client's reply-to-all feature
and avoid any BS reply-to-list command, which breaks cross posted
threads, and delays my seeing your reply since I didn't get a copy.

On 12/30/2013 6:38 PM, Peter Grandi wrote:

> Therefore a larger chunk size increases the amount of data that can
> be fetched on each device without waiting for the other device to
> get to the desires angular position. It has of course the advnatage
> that you mention, but also the advantage that random IO might be
> improved.

Yes, and that is a good reason not to use 4k chunk size.  I believe
that 64k is plenty large enough for this purpose though.

>> and in the case of raid5 you run into problems with the stripe 
>> cache.
> 
> IIRC the stripe cache can be up to 32MB for RAID device, and that's
> a lot of stripes for any sensible-sized RAID set. But it never
> stopped people who "know better" to do very wide RAID5 or RAID6
> sets :-).

That's kind of my point: you don't want a *very* large stripe cache,
but limiting the chunk size means you get seek overhead to skip the
redundant data, so you are stuck between a rock and a hard place.  It
isn't as big of a deal on a 5 disk raid5/6 but on a 2 or 3 disk
raid10, a 512k chunk size has a hefty seek overhead.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJSxccxAAoJEI5FoCIzSKrwQU4H/RXeLoFe74YFIKaLIc2gs9CM
S/RJ2P6ht5ZDR0uWge1t80PcIt46AG61KDoe4uqKeJ8i9j2pnEo74jP4uy8FpIg+
YgwZ114iAthiuA/vL/pQPbbnjJ5t/cG0TBmAviiGEyjckirc6/JO1kwcy3+dL+Fx
gfD4f7Vt7CNAkuJyJiyk87cRxJ8o0v643YFuGMnmCdlWr6GDVm/dDOhgtoepK7Mq
9yKm/VLm1Yr5XdCdogK7K7ZnGHLVMKQG8U2+fC1AgsulNtQoEeZmfU47RcK0LSdE
Oxb2w5aQb5AwxI6mukpAFTihbKR1qevpffs5qBNSCuTSgiCExj4KAe6BwcAlu9s=
=/GBo
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 18:02       ` Stan Hoeppner
  2014-01-02 19:10         ` Phillip Susi
  2014-01-02 19:21         ` Joe Landman
@ 2014-01-02 22:32         ` Wolfgang Denk
  2014-01-03 14:51           ` Benjamin ESTRABAUD
  2 siblings, 1 reply; 27+ messages in thread
From: Wolfgang Denk @ 2014-01-02 22:32 UTC (permalink / raw)
  To: stan; +Cc: Phillip Susi, joystick, linux-raid

Dear Stan,

In message <52C5A9AA.9090300@hardwarefreak.com> you wrote:
>
> > filesystem blocks and memory pages aren't necessarily 4K, though that
> > is the most common size.
> 
> Yes, they are necessarily 4K in Linux.  Linux only supports page sized
> BIO for consistency across the memory manager and IO subsystems.  Most
> architectures which Linux currently supports have hardware page sizes
> greater than 4K, for instance IA64 supports 4k/8k/16k, even a 4GB page
> size.  But it was decided long ago to stick with 4K for a number of
> reasons, one of these is stated above.  For background on this Google is
> your friend.

Well, you can tune the page size - and if you need no file system
support (like when implementing a RAID controller card) making the
page size exactly the same as the chunk size will allow for some nice
performance optimizations (as you can avoid a lot of large memcpy()
operations).

We did this (some 6 years ago) for the (then AMCC) PPC440SPe
processors; I can't find the old document any longer on APM's web
site, but there is still a copy here ([1]) which shows the effect.

So yes, tuning the system page size can have considerable impact, but
only for special-purpose applications, and when optimising for large
sequential I/O (which appears to be how the RAID controller
manufacturers are testing / optimizing their systems).


Fact is, with a file system on top of the RAID array, and with our
typical work load of many very small files, A RAID6 with chunk size
16k will give much better results that with chunk size 64k - and
anything even bigger will be worse.


[1] ftp://ftp.denx.de/pub/demos/RAID-demo/doc/RAIDinLinux_PB_0529a.pdf

Best regards,

Wolfgang Denk

--
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
"The first rule of magic is simple. Don't waste your time waving your
hands and hoping when a rock or a club will do."
                                               - McCloctnik the Lucid

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 19:21         ` Joe Landman
@ 2014-01-02 22:42           ` Stan Hoeppner
  2014-01-02 22:56             ` Carsten Aulbert
  2014-01-02 23:22           ` Peter Grandi
  1 sibling, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2014-01-02 22:42 UTC (permalink / raw)
  To: Joe Landman, Phillip Susi, joystick; +Cc: linux-raid

On 1/2/2014 1:21 PM, Joe Landman wrote:
> On 01/02/2014 01:02 PM, Stan Hoeppner wrote:
>> On 1/2/2014 10:31 AM, Phillip Susi wrote:
>>> On 1/2/2014 10:41 AM, Stan Hoeppner wrote:
>>>> First, there is no such thing as a 4K sector in Linux.  Sectors are
>>>> 512 bytes.  Filesystem blocks and memory pages are 4K.
>>>
>>> Of course there is.  Disks with 4k sectors are becoming more and more
>>> popular.
>>
>> Please read:  https://en.wikipedia.org/wiki/Advanced_Format
>>
>> There are no native 4K sector drives on the market.  Linux does not
> 
> Untrue.
> 
> http://storage.toshiba.eu/cms/en/hdd/hard_disk_drives/product_detail.jsp?productid=452

From that page:

Physical parameters	
Bytes/sector (Host)		512
Bytes/sector (Disk)		4096 kByte

> 
> http://www.seagate.com/www-content/product-content/constellation-fam/constellation-cs/en-us/docs/terascale-hdd-data-sheet-ds1793-1-1306us.pdf

From that page:

Configuration
Heads/Disks				8/4
Bytes per Sector (512-byte emulation)	4096

Damn, you had me salivating Joe.  These are both AF 512e drives, not
native 4K.

> and many others.

I've not yet seen an announcement from anyone.  I'm not all seeing, but
I'd think such an announcement would cross my RADAR.

> 512 byte sector native drives are now far less common, with many of the
> drives being native 4096 bytes with a translation layer for legacy
> systems that require 512 bytes.  This is infamous for wreaking havoc
> with alignment.

Exactly.  Which is why so many of us wish native 4K drives would be
released.  And, again, AFAIK, nobody is shipping a native drive.
They're all still 512e.  I'd love for someone to prove me wrong here by
pointing one out that is available, preferably more than one.

-- 
Stan



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 19:10         ` Phillip Susi
@ 2014-01-02 22:49           ` Peter Grandi
  2014-01-02 23:16           ` Stan Hoeppner
  1 sibling, 0 replies; 27+ messages in thread
From: Peter Grandi @ 2014-01-02 22:49 UTC (permalink / raw)
  To: Linux RAID

[ .... ]

>> There are no native 4K sector drives on the market. Linux
>> does not support a native 4K sector size, only 512 bytes,
>> unless this has changed in recent kernels and I'm simply not
>> aware of it yet.

> Linux has supported 4k sectors for several years. You can test
> it with the scsi_debug module and it's sector_size argument.
> The parted test suite has been doing this for a few years to
> test that parted correctly handles 1k, 2k, and 4k sector
> sizes.

Indeed, but 2KiB is a bit theoretical that I don't think any
hard disks have 2KiB sectors. The 4KiB sector size transition
has been a problem for a 2-3 years after 2009, so that 4KiB
hardware sector support is now approximately 5 years old:

  http://lwn.net/Articles/322777/

    «Linux and 4K disk sectors March 11, 2009

    [ ... ] Matthew Wilcox recently posted a patch to support 4K
    sectors according to the ATA-8 standard (PDF). [ ... ]»

as the kernel and tools got updated slowly, and it is one reason
of the new partition alignment default to 1MB.

In the MS-Windows world it is better know as "Advanced Format":

  http://en.wikipedia.org/wiki/Advanced_Format

One of the annoying consequences is that parity-RAID (when not
using MD) of with less than 8 data drives would have stripe-let
sizes under the common 4KiB filesystem block size, which was
exploited cleverly by DDN with their standard 8+2 RAID3 (sort of
RAID3...) products. They have switched to another arrangement
now that 4KiB drives are fairly common.

For MD RAID this did not matter because MD RAID does IO in page
cache units, that is base VM pages which on IA32/AMD64 CPUs is
4KiB anyhow, regardless of a smaller physical sector size.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 22:42           ` Stan Hoeppner
@ 2014-01-02 22:56             ` Carsten Aulbert
  2014-01-03  0:19               ` Phillip Susi
                                 ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Carsten Aulbert @ 2014-01-02 22:56 UTC (permalink / raw)
  To: stan; +Cc: Joe Landman, Phillip Susi, joystick, linux-raid

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

Hi

sorry late in joining the thread

On 01/02/2014 11:42 PM, Stan Hoeppner wrote:
> 
> Damn, you had me salivating Joe.  These are both AF 512e drives, not
> native 4K.
> 
>> and many others.
> 
> I've not yet seen an announcement from anyone.  I'm not all seeing, but
> I'd think such an announcement would cross my RADAR.
> 

Just look around:


http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771475.pdf

and then the FAQ about "Advanced Format":

https://wdc.custhelp.com/app/answers/detail/a_id/5655

Why would one need to align to 4k if the on disk layout would be 512bytes?

Or did I miss something?

Cheers

Carsten


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2044 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 19:10         ` Phillip Susi
  2014-01-02 22:49           ` Peter Grandi
@ 2014-01-02 23:16           ` Stan Hoeppner
  2014-01-03  1:02             ` Phillip Susi
  1 sibling, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2014-01-02 23:16 UTC (permalink / raw)
  To: Phillip Susi, joystick; +Cc: linux-raid

On 1/2/2014 1:10 PM, Phillip Susi wrote:
> On 1/2/2014 1:02 PM, Stan Hoeppner wrote:
>> There are no native 4K sector drives on the market.  Linux does
>> not support a native 4K sector size, only 512 bytes, unless this
>> has changed in recent kernels and I'm simply not aware of it yet.
> 
> Linux has supported 4k sectors for several years.  You can test it
> with the scsi_debug module and it's sector_size argument.  The parted
> test suite has been doing this for a few years to test that parted
> correctly handles 1k, 2k, and 4k sector sizes.  You can also set up
> qemu to emulate such a drive.

Thank you for this information.  Now, if I actually had a 4K drive in my
hands, and plugged it in, directly formatted it with XFS, no partitions,
would the LBA addressing be 4K or 512B?  Or would I need to tweak kernel
parameters?  Or possibly rebuild my kernel to support 4K sectors?

> While most consumer level sata drives that use 4k hardware sectors
> have 512 byte logical sector emulation, there are at least a few
> drives out there that do not, and are pure 4k sector drives.

I'm still waiting to hear an announcement from a vendor, or see a link
from someone claiming this to be true.

>>> CD-ROM type drives have always used 2k sectors.  Also
>>
>> This is not relevant to this discussion.
> 
> Sure it is; it's a non 512 byte sector that linux has handled for many
> years and so disproves your assertion that a sector is always 512 bytes.

It's not relevant because you don't create an md RAID set from CD-ROM.

>> Yes, they are necessarily 4K in Linux.  Linux only supports page
>> sized BIO for consistency across the memory manager and IO
>> subsystems.  Most architectures which Linux currently supports have
>> hardware page sizes greater than 4K, for instance IA64 supports
>> 4k/8k/16k, even a 4GB page size.  But it was decided long ago to
>> stick with 4K for a number of reasons, one of these is stated
>> above.  For background on this Google is your friend.
> 
> Wrong, wrong wrong.  

If you say it 3 times does that make it 3x more likely to be true? :)

> Linux always has supported ext[234] filesystems
> using 1k, 2k, or 4k filesystem block sizes.  Now basically nobody has
> used the smaller sizes for quite a few years ( they were originally
> useful on 1-100 MB disks ), but it is still supported.  

I'll take your that it is still supported, FSVO 'supported'.

> It can use
> larger sizes than that, if your platform has > 4k page size.  The page

IIRC, there was a lengthy discussion about this on mm back when some
folks wanted to use 16K-4GB pages on Itanium, and later 2M pages on
x86-64, to cut down on the amount of memory required for page tables and
to increase performance for big memory workloads.  As I recall the
arguments for continuing to use 4K pages across the world of Linux,
regardless of architecture capability, and to NOT make it configurable
as in HP-UX, were, paraphrasing:

1.  The kernel manipulates "everything" in pages so we need consistency
2.  While larger pages saves page table space and increases throughput
    for large memory intensive workloads, it causes more waste in other
    structures and increases bandwidth demands for data that are
    smaller than the page size

So, IIRC, it was decided that the page size would remain 4K basically
forever.  So while it is *technically* possible to have a larger page
size in Linux, it is absolutely not supported by the kernel team, nor
any distro kernel, AFAIK.

> Several cpu
> archs give you the option to choose between different page sizes when
> building the kernel, so yes, you can choose to use the larger sizes
> rather than the default 4k.

And I'd guess a whole host of things will likely break as a result if
you don't correctly modify much of the kernel source before running
make.  See above.

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 19:21         ` Joe Landman
  2014-01-02 22:42           ` Stan Hoeppner
@ 2014-01-02 23:22           ` Peter Grandi
  2014-01-03  3:09             ` Joe Landman
  2014-01-03  4:58             ` Joe Landman
  1 sibling, 2 replies; 27+ messages in thread
From: Peter Grandi @ 2014-01-02 23:22 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

> 512 byte sector native drives are now far less common,

Most smaller/faster SAS drivers still have 512 physical sectors,
but >= 2TB drives, especially those with < 7200RPM, tend to have
4KiB physical sectors (because 4KiB sector drives are cheaper).

> with many of the drives being native 4096 bytes witnh a
> translation layer for legacy systems that require 512 bytes.

On a PC here I have half-and-half:

  # lsscsi | grep sd.
  [0:0:0:0]    disk    ATA      WDC WD2002FAEX-0 05.0  /dev/sda
  [1:0:0:0]    disk    ATA      ST2000DM001-1CH1 CC44  /dev/sdb
  [2:0:0:0]    disk    ATA      ST2000DM001-9YN1 CC4C  /dev/sdc
  [3:0:0:0]    disk    ATA      Hitachi HDS72202 JKAO  /dev/sdd
  [6:0:0:0]    disk    ATA      WDC WD20EARX-32P AB51  /dev/sde
  [6:0:1:0]    disk    ATA      ST2000DL003-9VT1 CC32  /dev/sdf

  # grep . /sys/block/sd?/queue/physical_block_size
  /sys/block/sda/queue/physical_block_size:512
  /sys/block/sdb/queue/physical_block_size:4096
  /sys/block/sdc/queue/physical_block_size:4096
  /sys/block/sdd/queue/physical_block_size:512
  /sys/block/sde/queue/physical_block_size:4096
  /sys/block/sdf/queue/physical_block_size:512

They are all 2TB "consumer" drives, mostly recent ones. I am
slightly surprised that half still have 512 physical sectors.

[ ... ]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 22:56             ` Carsten Aulbert
@ 2014-01-03  0:19               ` Phillip Susi
  2014-01-03  1:24               ` Stan Hoeppner
  2014-01-03  3:14               ` Joe Landman
  2 siblings, 0 replies; 27+ messages in thread
From: Phillip Susi @ 2014-01-03  0:19 UTC (permalink / raw)
  To: Carsten Aulbert, stan; +Cc: Joe Landman, joystick, linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 01/02/2014 05:56 PM, Carsten Aulbert wrote:
> Why would one need to align to 4k if the on disk layout would be
> 512bytes?
> 
> Or did I miss something?

Because they are providing 512 byte sector emulation on top of 4k
physical sectors.  If they were pure 4k sectors, there would be no
need for special alignment.  I believe that all WD drives do this, but
I've seen a few people report that some seagate drives skip the 512
emulation.  For that matter I have a first generation WD "Green" AF
drive that I am retiring because it has been throwing uncorrectable
errors but not reallocating on write, and often having no problem
reading those sectors later.  That drive isn't even nice enough to
report that its really using 4k physical sectors ( despite my requests
for them to fix this firmware bug, they never released an updated
firmware ), though my new blue drives do at least indicate they are 4k
physical.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJSxgInAAoJEI5FoCIzSKrwW1kH/2+2rdIm3ZE4TB2eyjDP3HQO
OGHlr+BhIPIrATeEjZA+7EnTog8Xk36Pis4zPrCWUGZ5rVjVQnMR0lWmoeOGGjyR
xAcX/g60ia1sqKUQaTmGevzRYfIycglW3wALSTu0mm2i6EaGm4BSl+cZruPAKPxl
iBoNKyagIQccWtKxLH+6tCPpCqFRDnXg3PFQ0+K49CxUoh9Nb1Sk1FkPEjuMz5YF
zENlt1WHNaG4ZcKtgLVAZZLDnjt3kdFxDwOWNIU/CEr2qUQ3hXyaSAp9b/MjeBfb
9S+PdjB67ZEBOLPY0CUA4Im/ixu/B4AT91uCVRL69yqasW3i4jX6St2UjdcPkDw=
=U9uU
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 23:16           ` Stan Hoeppner
@ 2014-01-03  1:02             ` Phillip Susi
  0 siblings, 0 replies; 27+ messages in thread
From: Phillip Susi @ 2014-01-03  1:02 UTC (permalink / raw)
  To: stan, joystick; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 01/02/2014 06:16 PM, Stan Hoeppner wrote:
> Thank you for this information.  Now, if I actually had a 4K drive
> in my hands, and plugged it in, directly formatted it with XFS, no
> partitions, would the LBA addressing be 4K or 512B?  Or would I
> need to tweak kernel parameters?  Or possibly rebuild my kernel to
> support 4K sectors?

Addressing would be in 4k sectors, no need for tweaking.  You can see
it in action with the scsi_debug module if you don't have the actual
hardware.

> It's not relevant because you don't create an md RAID set from
> CD-ROM.

Well the context wasn't specifically for md; you claimed linux did not
support non 512 byte sectors full stop.  And you *can* build an md
raid on top of cd/dvd-rw, though obviously that's a little nutty.

> IIRC, there was a lengthy discussion about this on mm back when
> some folks wanted to use 16K-4GB pages on Itanium, and later 2M
> pages on x86-64, to cut down on the amount of memory required for
> page tables and to increase performance for big memory workloads.
> As I recall the arguments for continuing to use 4K pages across the
> world of Linux, regardless of architecture capability, and to NOT
> make it configurable as in HP-UX, were, paraphrasing:

Some archs don't support 4k pages at all.

> 1.  The kernel manipulates "everything" in pages so we need
> consistency 2.  While larger pages saves page table space and
> increases throughput for large memory intensive workloads, it
> causes more waste in other structures and increases bandwidth
> demands for data that are smaller than the page size
> 
> So, IIRC, it was decided that the page size would remain 4K
> basically forever.  So while it is *technically* possible to have a
> larger page size in Linux, it is absolutely not supported by the
> kernel team, nor any distro kernel, AFAIK.

Yep, that's why the default on those archs that have a choice is still
4k, but some do give the option for larger.  If you scan the ext4 and
md mailing lists you should find a few discussions of people doing
this, or just using larger block sizes on archs that only support
larger page sizes.

> And I'd guess a whole host of things will likely break as a result
> if you don't correctly modify much of the kernel source before
> running make.  See above.

Nope; otherwise it wouldn't be a Kconfig option.  Remember, the kernel
had to grow support for non 4k pages to work at all on archs that
don't support them.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJSxgwRAAoJEI5FoCIzSKrwVBIH/2XxI7Pz7pb5ojV94Xg7om2Y
drEebKn5N+QFXQGXkMZFknFSDGuv4l/KgNmPbU43ntXmAi/U+HyepMtFl4K4tPVd
lgSxNE7oikEMgHV5mKa4Ic0iB46AUp7cuhvyGe3FVxC+co9+J8hWgr11iYCCP2ra
9d7TT2hQEHELSvzWVoYWbh7ndV9ZfB5LC6dsRlqu0OKPlkX5xg/H7jlEgpqb0/Uc
ebJ4CidYtKVMsW5My3K7uG1T4uZoB6QTNEDaYzfJjw+5KAKAR60TZXttADt670yk
BPpU55UbFZTBgUyKIuYsxIZ2mgZKNTd3BT4jWOb+J5Z3+KPbN/nuHhAWqwIE+24=
=6YGJ
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 22:56             ` Carsten Aulbert
  2014-01-03  0:19               ` Phillip Susi
@ 2014-01-03  1:24               ` Stan Hoeppner
  2014-01-03  3:14               ` Joe Landman
  2 siblings, 0 replies; 27+ messages in thread
From: Stan Hoeppner @ 2014-01-03  1:24 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: Joe Landman, Phillip Susi, joystick, linux-raid

On 1/2/2014 4:56 PM, Carsten Aulbert wrote:

> http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771475.pdf

This is also a 512e drive.

> Or did I miss something?

You missed something.

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 23:22           ` Peter Grandi
@ 2014-01-03  3:09             ` Joe Landman
  2014-01-03  4:58             ` Joe Landman
  1 sibling, 0 replies; 27+ messages in thread
From: Joe Landman @ 2014-01-03  3:09 UTC (permalink / raw)
  To: Peter Grandi, Linux RAID

On 01/02/2014 06:22 PM, Peter Grandi wrote:
> [ ... ]
>
>> 512 byte sector native drives are now far less common,
>
> Most smaller/faster SAS drivers still have 512 physical sectors,
> but >= 2TB drives, especially those with < 7200RPM, tend to have
> 4KiB physical sectors (because 4KiB sector drives are cheaper).

There was a reason for the 512 byte sectors, and that has something to 
do with the extended integrity sector length you can use if you format 
your drives for this.  There are a fair number of these enterprise class 
drives which might not play well with translation firmware.  Its easier 
to keep producing these and gradually wean those customers off as they 
migrate to 12g scenarios.

>
>> with many of the drives being native 4096 bytes witnh a
>> translation layer for legacy systems that require 512 bytes.
>
> On a PC here I have half-and-half:
>
>    # lsscsi | grep sd.
>    [0:0:0:0]    disk    ATA      WDC WD2002FAEX-0 05.0  /dev/sda
>    [1:0:0:0]    disk    ATA      ST2000DM001-1CH1 CC44  /dev/sdb
>    [2:0:0:0]    disk    ATA      ST2000DM001-9YN1 CC4C  /dev/sdc
>    [3:0:0:0]    disk    ATA      Hitachi HDS72202 JKAO  /dev/sdd
>    [6:0:0:0]    disk    ATA      WDC WD20EARX-32P AB51  /dev/sde
>    [6:0:1:0]    disk    ATA      ST2000DL003-9VT1 CC32  /dev/sdf
>
>    # grep . /sys/block/sd?/queue/physical_block_size
>    /sys/block/sda/queue/physical_block_size:512
>    /sys/block/sdb/queue/physical_block_size:4096
>    /sys/block/sdc/queue/physical_block_size:4096
>    /sys/block/sdd/queue/physical_block_size:512
>    /sys/block/sde/queue/physical_block_size:4096
>    /sys/block/sdf/queue/physical_block_size:512

Yeah, depends upon which disks you use.  The HGST's we use annoyingly 
all register as 512, but the Seagate we use are now mostly 4k.

>
> They are all 2TB "consumer" drives, mostly recent ones. I am
> slightly surprised that half still have 512 physical sectors.

Some could be the translation layer ... I don't have evidence of this, 
but I am guessing that HGST is keeping the layer in place to minimize 
problems with old OSes.  Thats the entire purpose of that translation 
layer ... pay a slight performance price for compatibility.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 22:56             ` Carsten Aulbert
  2014-01-03  0:19               ` Phillip Susi
  2014-01-03  1:24               ` Stan Hoeppner
@ 2014-01-03  3:14               ` Joe Landman
  2014-01-03  3:19                 ` Stan Hoeppner
  2 siblings, 1 reply; 27+ messages in thread
From: Joe Landman @ 2014-01-03  3:14 UTC (permalink / raw)
  To: Carsten Aulbert, stan; +Cc: Phillip Susi, joystick, linux-raid

On 01/02/2014 05:56 PM, Carsten Aulbert wrote:
> Hi
>
> sorry late in joining the thread

Hey Carsten, good to see you here!

>
> On 01/02/2014 11:42 PM, Stan Hoeppner wrote:
>>
>> Damn, you had me salivating Joe.  These are both AF 512e drives, not
>> native 4K.
>>
>>> and many others.
>>
>> I've not yet seen an announcement from anyone.  I'm not all seeing, but
>> I'd think such an announcement would cross my RADAR.
>>
>
> Just look around:
>
>
> http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771475.pdf
>
> and then the FAQ about "Advanced Format":
>
> https://wdc.custhelp.com/app/answers/detail/a_id/5655
>
> Why would one need to align to 4k if the on disk layout would be 512bytes?
>
> Or did I miss something?

If you align to a 1k size (2x 512 byte emulated sectors), then you will 
be offset from the start of the real 4k sector.  As you can see from the 
WD document, advanced format drives are 4k with a 512byte translation 
layer.  They are there specifically to provide backwards compatibility 
for old OSes.  For best performance, they recommend using the units in 
their native 4k format.

But they are 4k native and you can use them that way.   The 512 byte 
layer is emulated.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-03  3:14               ` Joe Landman
@ 2014-01-03  3:19                 ` Stan Hoeppner
  2014-01-03  4:24                   ` Stan Hoeppner
  0 siblings, 1 reply; 27+ messages in thread
From: Stan Hoeppner @ 2014-01-03  3:19 UTC (permalink / raw)
  To: Joe Landman, Carsten Aulbert; +Cc: Phillip Susi, joystick, linux-raid

On 1/2/2014 9:14 PM, Joe Landman wrote:

> But they are 4k native and you can use them that way.   The 512 byte
> layer is emulated.

Which series is this?  Is the mode switched via a jumper?

-- 
Stan



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-03  3:19                 ` Stan Hoeppner
@ 2014-01-03  4:24                   ` Stan Hoeppner
  0 siblings, 0 replies; 27+ messages in thread
From: Stan Hoeppner @ 2014-01-03  4:24 UTC (permalink / raw)
  To: Joe Landman, Carsten Aulbert; +Cc: Phillip Susi, joystick, linux-raid

On 1/2/2014 9:19 PM, Stan Hoeppner wrote:
> On 1/2/2014 9:14 PM, Joe Landman wrote:
> 
>> But they are 4k native and you can use them that way.   The 512 byte
>> layer is emulated.
> 
> Which series is this?  Is the mode switched via a jumper?

According to WD's web site, all of their "datacenter" drives are 512n or
512e, none are listed as native 4K.  Same for their consumer drives.

I have just located about a dozen Seagate enterprise SSHD and Savio 15K
models that are offered in native 4K.  These must be relatively new.  I
find no 4Kn high cap Seagate drives.

All of the HGST enterprise drives are 512n/512e, select few support 520/528.

Toshiba's drive spec sheets don't even list sector size...

-- 
Stan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 23:22           ` Peter Grandi
  2014-01-03  3:09             ` Joe Landman
@ 2014-01-03  4:58             ` Joe Landman
  1 sibling, 0 replies; 27+ messages in thread
From: Joe Landman @ 2014-01-03  4:58 UTC (permalink / raw)
  To: Peter Grandi, Linux RAID

On 01/02/2014 06:22 PM, Peter Grandi wrote:

>    # grep . /sys/block/sd?/queue/physical_block_size
>    /sys/block/sda/queue/physical_block_size:512
>    /sys/block/sdb/queue/physical_block_size:4096
>    /sys/block/sdc/queue/physical_block_size:4096
>    /sys/block/sdd/queue/physical_block_size:512
>    /sys/block/sde/queue/physical_block_size:4096
>    /sys/block/sdf/queue/physical_block_size:512
>
> They are all 2TB "consumer" drives, mostly recent ones. I am
> slightly surprised that half still have 512 physical sectors.

Gaak ... our HGST are showing this as well.  For some reason I think I 
have an old sdparm when I looked before.  Trust that the kernel will 
rarely lie to you even if the tools that work with it do.


[root@unison ~]# grep . /sys/block/sd*/queue/physical_block_size
/sys/block/sdaa/queue/physical_block_size:4096
/sys/block/sdab/queue/physical_block_size:4096
/sys/block/sdac/queue/physical_block_size:512
/sys/block/sdad/queue/physical_block_size:4096
/sys/block/sdae/queue/physical_block_size:4096
/sys/block/sdaf/queue/physical_block_size:4096
/sys/block/sdag/queue/physical_block_size:4096
/sys/block/sdah/queue/physical_block_size:4096
/sys/block/sdai/queue/physical_block_size:4096
/sys/block/sdaj/queue/physical_block_size:4096
/sys/block/sdak/queue/physical_block_size:512
/sys/block/sdal/queue/physical_block_size:4096
/sys/block/sdam/queue/physical_block_size:4096
/sys/block/sdan/queue/physical_block_size:4096
/sys/block/sdao/queue/physical_block_size:4096
/sys/block/sdap/queue/physical_block_size:4096
/sys/block/sdaq/queue/physical_block_size:4096
/sys/block/sda/queue/physical_block_size:4096
/sys/block/sdar/queue/physical_block_size:4096
/sys/block/sdas/queue/physical_block_size:512
/sys/block/sdat/queue/physical_block_size:512
/sys/block/sdau/queue/physical_block_size:4096
/sys/block/sdav/queue/physical_block_size:4096
/sys/block/sdaw/queue/physical_block_size:4096
/sys/block/sdax/queue/physical_block_size:4096
/sys/block/sday/queue/physical_block_size:512
/sys/block/sdaz/queue/physical_block_size:512
/sys/block/sdba/queue/physical_block_size:4096
/sys/block/sdbb/queue/physical_block_size:4096
/sys/block/sdbc/queue/physical_block_size:4096
/sys/block/sdbd/queue/physical_block_size:4096
/sys/block/sdbe/queue/physical_block_size:4096
/sys/block/sdbf/queue/physical_block_size:4096
/sys/block/sdbg/queue/physical_block_size:512
/sys/block/sdbh/queue/physical_block_size:512
/sys/block/sdbi/queue/physical_block_size:4096
/sys/block/sdbj/queue/physical_block_size:4096
/sys/block/sdbk/queue/physical_block_size:512
/sys/block/sdbl/queue/physical_block_size:512
/sys/block/sdb/queue/physical_block_size:4096
/sys/block/sdc/queue/physical_block_size:4096
/sys/block/sdd/queue/physical_block_size:4096
/sys/block/sde/queue/physical_block_size:512
/sys/block/sdf/queue/physical_block_size:4096
/sys/block/sdg/queue/physical_block_size:4096
/sys/block/sdh/queue/physical_block_size:4096
/sys/block/sdi/queue/physical_block_size:4096
/sys/block/sdj/queue/physical_block_size:4096
/sys/block/sdk/queue/physical_block_size:4096
/sys/block/sdl/queue/physical_block_size:4096
/sys/block/sdm/queue/physical_block_size:512
/sys/block/sdn/queue/physical_block_size:4096
/sys/block/sdo/queue/physical_block_size:512
/sys/block/sdp/queue/physical_block_size:4096
/sys/block/sdq/queue/physical_block_size:4096
/sys/block/sdr/queue/physical_block_size:4096
/sys/block/sds/queue/physical_block_size:4096
/sys/block/sdt/queue/physical_block_size:4096
/sys/block/sdu/queue/physical_block_size:512
/sys/block/sdv/queue/physical_block_size:4096
/sys/block/sdw/queue/physical_block_size:4096
/sys/block/sdx/queue/physical_block_size:4096
/sys/block/sdy/queue/physical_block_size:4096
/sys/block/sdz/queue/physical_block_size:4096



This is one of the day job's large Ceph boxen.  The 4k units are HGST 
4TB enterprise drives, and the 512B units are SSDs.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The chunk size paradox
  2014-01-02 22:32         ` Wolfgang Denk
@ 2014-01-03 14:51           ` Benjamin ESTRABAUD
  0 siblings, 0 replies; 27+ messages in thread
From: Benjamin ESTRABAUD @ 2014-01-03 14:51 UTC (permalink / raw)
  To: Wolfgang Denk, stan; +Cc: Phillip Susi, joystick, linux-raid

On 02/01/14 22:32, Wolfgang Denk wrote:
> Dear Stan,
>
> In message <52C5A9AA.9090300@hardwarefreak.com> you wrote:
>>
>>> filesystem blocks and memory pages aren't necessarily 4K, though that
>>> is the most common size.
>>
>> Yes, they are necessarily 4K in Linux.  Linux only supports page sized
>> BIO for consistency across the memory manager and IO subsystems.  Most
>> architectures which Linux currently supports have hardware page sizes
>> greater than 4K, for instance IA64 supports 4k/8k/16k, even a 4GB page
>> size.  But it was decided long ago to stick with 4K for a number of
>> reasons, one of these is stated above.  For background on this Google is
>> your friend.
>
> Well, you can tune the page size - and if you need no file system
> support (like when implementing a RAID controller card) making the
> page size exactly the same as the chunk size will allow for some nice
> performance optimizations (as you can avoid a lot of large memcpy()
> operations).
>
> We did this (some 6 years ago) for the (then AMCC) PPC440SPe
> processors; I can't find the old document any longer on APM's web
> site, but there is still a copy here ([1]) which shows the effect.
>
I had the chance to work with those AMCC boards and I confirm that 
aligning the chunk size on the (larger) PPC440 page size yielded some 
impressive performance results. We were topping the hardware 
capabilities. On the other hand, the entire kernel was built for this 
particular hardware and I'm not sure how well tuning the system page 
size would work on an Intel platform.

> So yes, tuning the system page size can have considerable impact, but
> only for special-purpose applications, and when optimising for large
> sequential I/O (which appears to be how the RAID controller
> manufacturers are testing / optimizing their systems).
>
>
> Fact is, with a file system on top of the RAID array, and with our
> typical work load of many very small files, A RAID6 with chunk size
> 16k will give much better results that with chunk size 64k - and
> anything even bigger will be worse.
>
>
> [1] ftp://ftp.denx.de/pub/demos/RAID-demo/doc/RAIDinLinux_PB_0529a.pdf
>
> Best regards,
>
> Wolfgang Denk
>
Regards,
Ben.
> --
> DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
> "The first rule of magic is simple. Don't waste your time waving your
> hands and hoping when a rock or a club will do."
>                                                 - McCloctnik the Lucid
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-01-03 14:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-30 18:48 The chunk size paradox Phillip Susi
2013-12-30 23:38 ` Peter Grandi
2013-12-31  0:01   ` Wolfgang Denk
2013-12-31 13:51     ` David Brown
2014-01-02 20:08   ` Phillip Susi
2014-01-02 14:49 ` joystick
2014-01-02 15:24   ` Phillip Susi
2014-01-02 15:41   ` Stan Hoeppner
2014-01-02 16:31     ` Phillip Susi
2014-01-02 18:02       ` Stan Hoeppner
2014-01-02 19:10         ` Phillip Susi
2014-01-02 22:49           ` Peter Grandi
2014-01-02 23:16           ` Stan Hoeppner
2014-01-03  1:02             ` Phillip Susi
2014-01-02 19:21         ` Joe Landman
2014-01-02 22:42           ` Stan Hoeppner
2014-01-02 22:56             ` Carsten Aulbert
2014-01-03  0:19               ` Phillip Susi
2014-01-03  1:24               ` Stan Hoeppner
2014-01-03  3:14               ` Joe Landman
2014-01-03  3:19                 ` Stan Hoeppner
2014-01-03  4:24                   ` Stan Hoeppner
2014-01-02 23:22           ` Peter Grandi
2014-01-03  3:09             ` Joe Landman
2014-01-03  4:58             ` Joe Landman
2014-01-02 22:32         ` Wolfgang Denk
2014-01-03 14:51           ` Benjamin ESTRABAUD

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.