All of lore.kernel.org
 help / color / mirror / Atom feed
* mkfs.xfs states log stripe unit is too large
@ 2012-06-23 12:50 Ingo Jürgensmann
  2012-06-23 23:44 ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Jürgensmann @ 2012-06-23 12:50 UTC (permalink / raw)
  To: xfs

Hi!

I already brought this one up yesterday on #xfs@freenode where it was suggested to write this on this ML. Here I go... 

I'm running Debian unstable on my desktop and lately added a new RAID set consisting of 3x 4 TB disks (namely Hitachi HDS724040ALE640). My partition layout is: 

Model: ATA Hitachi HDS72404 (scsi)
Disk /dev/sdd: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name  Flags
 1      17.4kB  1018kB  1000kB                     bios_grub
 2      2097kB  212MB   210MB   ext3               raid
 3      212MB   1286MB  1074MB  xfs                raid
 4      1286MB  4001GB  4000GB                     raid

Partition #2 is intended as /boot disk (RAID1), partition #3 as small rescue disk or swap (RAID1), partition #4 will be used as physical device for LVM (RAID5). 

muaddib:~# mdadm --detail /dev/md7
/dev/md7:
        Version : 1.2
  Creation Time : Fri Jun 22 22:47:15 2012
     Raid Level : raid5
     Array Size : 7811261440 (7449.40 GiB 7998.73 GB)
  Used Dev Size : 3905630720 (3724.70 GiB 3999.37 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Sat Jun 23 13:47:19 2012
          State : clean 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : muaddib:7  (local to host muaddib)
           UUID : 0be7f76d:90fe734e:ac190ee4:9b5f7f34
         Events : 20

    Number   Major   Minor   RaidDevice State
       0       8       68        0      active sync   /dev/sde4
       1       8       52        1      active sync   /dev/sdd4
       3       8       84        2      active sync   /dev/sdf4


So, a cat /proc/mdstat shows all of my RAID devices: 

muaddib:~# cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md7 : active raid5 sdf4[3] sdd4[1] sde4[0]
      7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      
md6 : active raid1 sdd3[0] sdf3[2] sde3[1]
      1048564 blocks super 1.2 [3/3] [UUU]
      
md5 : active (auto-read-only) raid1 sdd2[0] sdf2[2] sde2[1]
      204788 blocks super 1.2 [3/3] [UUU]
      
md4 : active raid5 sdc6[0] sda6[2] sdb6[1]
      1938322304 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
      
md3 : active (auto-read-only) raid1 sdc5[0] sda5[2] sdb5[1]
      1052160 blocks [3/3] [UUU]
      
md2 : active raid1 sdc3[0] sda3[2] sdb3[1]
      4192896 blocks [3/3] [UUU]
      
md1 : active (auto-read-only) raid1 sdc2[0] sda2[2] sdb2[1]
      2096384 blocks [3/3] [UUU]
      
md0 : active raid1 sdc1[0] sda1[2] sdb1[1]
      256896 blocks [3/3] [UUU]
      
unused devices: <none>

The RAID devices /dev/md0 to /dev/md4 are on my old 3x 1 TB Seagate disks. Anyway, to finally come to the problem, when I try to create a filesystem on the new RAID5 I get the following:  

muaddib:~# mkfs.xfs /dev/lv/usr
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/lv/usr            isize=256    agcount=16, agsize=327552 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=5240832, imaxpct=25
         =                       sunit=128    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


As you can see I follow the "mkfs.xfs knows best, don't fiddle around with options unless you know what you're doing!"-advice. But apparently mkfs.xfs wanted to create a log stripe unit of 512 kiB, most likely because it's the same chunk size as the underlying RAID device. 

The problem seems to be related to RAID5, because when I try to make a filesystem on /dev/md6 (RAID1), there's no error message: 

muaddib:~# mkfs.xfs /dev/md6
meta-data=/dev/md6               isize=256    agcount=8, agsize=32768 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=262141, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=1200, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Additional info: 
I first bought two 4 TB disks and ran them for about 6 weeks as a RAID1 and already did some tests (because the 4 TB Hitachis were sold out in the meantime). I can't remember seeing the log stripe error message during those tests while working with a RAID1. 

So, the question is: 
- is this a bug somewhere in XFS, LVM or Linux's software RAID implementation?
- will performance suffer from log stripe size adjusted to just 32 kiB? Some of my logical volumes will just store data, but one or the other will have some workload acting as storage for BackupPC. 
- would it be worth the effort to raise log stripe to at least 256 kiB?
- or would it be better to run with external log on the old 1 TB RAID?

End note: the 4 TB disks are not yet "in production", so I can run tests with both RAID setup as well as mkfs.xfs. Reshaping the RAID will take up to 10 hours, though... 

-- 
Ciao...            //      Fon: 0381-2744150
      Ingo       \X/       http://blog.windfluechter.net


gpg pubkey:  http://www.juergensmann.de/ij_public_key.asc

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-23 12:50 mkfs.xfs states log stripe unit is too large Ingo Jürgensmann
@ 2012-06-23 23:44 ` Dave Chinner
  2012-06-24  2:20   ` Eric Sandeen
  2012-06-25 10:33   ` Ingo Jürgensmann
  0 siblings, 2 replies; 24+ messages in thread
From: Dave Chinner @ 2012-06-23 23:44 UTC (permalink / raw)
  To: Ingo Jürgensmann; +Cc: xfs

On Sat, Jun 23, 2012 at 02:50:49PM +0200, Ingo Jürgensmann wrote:
> muaddib:~# cat /proc/mdstat 
> Personalities : [raid1] [raid6] [raid5] [raid4] 
> md7 : active raid5 sdf4[3] sdd4[1] sde4[0]
>       7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
.....

> The RAID devices /dev/md0 to /dev/md4 are on my old 3x 1 TB
> Seagate disks. Anyway, to finally come to the problem, when I try
> to create a filesystem on the new RAID5 I get the following:  
> 
> muaddib:~# mkfs.xfs /dev/lv/usr
> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
> log stripe unit adjusted to 32KiB
> meta-data=/dev/lv/usr            isize=256    agcount=16, agsize=327552 blks
>          =                       sectsz=512   attr=2, projid32bit=0
> data     =                       bsize=4096   blocks=5240832, imaxpct=25
>          =                       sunit=128    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> 
> As you can see I follow the "mkfs.xfs knows best, don't fiddle
> around with options unless you know what you're doing!"-advice.
> But apparently mkfs.xfs wanted to create a log stripe unit of 512
> kiB, most likely because it's the same chunk size as the
> underlying RAID device. 

Exactly. Best thing in general is to align all log writes to the
underlying stripe unit of the array. That way as multiple frequent
log writes occur, it is guaranteed to form full stripe writes and
basically have no RMW overhead. 32k is chosen by default because
that's the default log buffer size and hence the typical size of
log writes.

If you increase the log stripe unit, you also increase the minimum
log buffer size that the filesystem supports. The filesystem can
support up to 256k log buffers, and hence the limit on maximum log
stripe alignment.

> The problem seems to be related to RAID5, because when I try to
> make a filesystem on /dev/md6 (RAID1), there's no error message:

They don't have a stripe unit/stripe width, so no alignment is
needed or configured.

> So, the question is: 
> - is this a bug somewhere in XFS, LVM or Linux's software RAID
> implementation?

Not a bug at all.

> - will performance suffer from log stripe size adjusted to just 32
> kiB? Some of my logical volumes will just store data, but one or
> the other will have some workload acting as storage for BackupPC.

For data volumes, no. For backupPC, it depends on whether the MD
RAID stripe cache can turn all the sequential log writes into a full
stripe write. In general, this is not a problem, and is almost never
a problem for HW RAID with BBWC....

> - would it be worth the effort to raise log stripe to at least 256
> kiB?

Depends on your workload. If it is fsync heavy, I'd advise against
it, as every log write will be padded out to 256k, even if you only
write 500 bytes worth of transaction data....

> - or would it be better to run with external log on the old 1 TB
> RAID?

External logs provide muchless benefit with delayed logging than hey
use to. As it is, your external log needs to have the same
reliability characteristics as the main volume - lose the log,
corrupt the filesystem. Hence for RAID5 volumes, you need a RAID1
log, and for RAID6 you either need RAID6 or a 3-way mirror to
provide the same reliability....

> End note: the 4 TB disks are not yet "in production", so I can run
> tests with both RAID setup as well as mkfs.xfs. Reshaping the RAID
> will take up to 10 hours, though... 

IMO, RAID reshaping is just a bad idea - it changes the alignment
characteristic of the volume, hence everything that the
filesystemlaid down in an aligned fashion is now unaligned, and you
have to tell the filesytemteh new alignment before new files will be
correctly aligned. Also, it's usually faster to back up, recreate
and restore than reshape and that puts a lot less load on your
disks, too...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-23 23:44 ` Dave Chinner
@ 2012-06-24  2:20   ` Eric Sandeen
  2012-06-24 13:05     ` Stan Hoeppner
  2012-06-25 10:33   ` Ingo Jürgensmann
  1 sibling, 1 reply; 24+ messages in thread
From: Eric Sandeen @ 2012-06-24  2:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ingo Jürgensmann, xfs

On 6/23/12 6:44 PM, Dave Chinner wrote:
> On Sat, Jun 23, 2012 at 02:50:49PM +0200, Ingo Jürgensmann wrote:
>> muaddib:~# cat /proc/mdstat 
>> Personalities : [raid1] [raid6] [raid5] [raid4] 
>> md7 : active raid5 sdf4[3] sdd4[1] sde4[0]
>>       7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
> .....
> 
>> The RAID devices /dev/md0 to /dev/md4 are on my old 3x 1 TB
>> Seagate disks. Anyway, to finally come to the problem, when I try
>> to create a filesystem on the new RAID5 I get the following:  
>>
>> muaddib:~# mkfs.xfs /dev/lv/usr
>> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
>> log stripe unit adjusted to 32KiB

...

> 
>> So, the question is: 
>> - is this a bug somewhere in XFS, LVM or Linux's software RAID
>> implementation?
> 
> Not a bug at all.

Dave, I'd suggest that we should remove the warning though, if XFS picks
the wrong defaults and then overrides itself.

Rule of Silence: When a program has nothing surprising to say, it should say nothing.

;)

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-24  2:20   ` Eric Sandeen
@ 2012-06-24 13:05     ` Stan Hoeppner
  2012-06-24 13:17       ` Ingo Jürgensmann
  2012-06-24 15:03       ` Ingo Jürgensmann
  0 siblings, 2 replies; 24+ messages in thread
From: Stan Hoeppner @ 2012-06-24 13:05 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: =?ISO-8859-1?Q?Ingo_J=FCrgensma?=, nn, xfs

On 6/23/2012 9:20 PM, Eric Sandeen wrote:
> On 6/23/12 6:44 PM, Dave Chinner wrote:
>> On Sat, Jun 23, 2012 at 02:50:49PM +0200, Ingo Jürgensmann wrote:
>>> muaddib:~# cat /proc/mdstat 
>>> Personalities : [raid1] [raid6] [raid5] [raid4] 
>>> md7 : active raid5 sdf4[3] sdd4[1] sde4[0]
>>>       7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
                                               ^^^^^^^^^^

The the log stripe unit mismatch error is a direct result of Ingo
manually choosing a rather large chunk size for his two stripe spindle
md array, yielding a 1MB stripe, and using an internal log with it.
Maybe there is a good reason for this, but I'm going to challenge it.

The default md chunk size IIRC is 64KB, 8x smaller than Ingo's chunk.
With the default it would require 16 stripe spindles to reach a 1MB
stripe.  Ingo has TWO stripe spindles.

In the default case with a 1MB stripe and 16 spindles, each aligned
aggregated stripe write out will be 256 XFS blocks, or 16 blocks to each
spindle, 128 sectors (512 byte).  In Ingo's case, it will be 128 XFS
blocks, 1024 sectors.

Does backup PC perform better writing 2048 sectors per stripe write,
1024 per spindle, with two spindles, than 256 sectors per stripe write,
128 per spindle, using two spindles?

>> .....
>>
>>> The RAID devices /dev/md0 to /dev/md4 are on my old 3x 1 TB
>>> Seagate disks. Anyway, to finally come to the problem, when I try
>>> to create a filesystem on the new RAID5 I get the following:  
>>>
>>> muaddib:~# mkfs.xfs /dev/lv/usr
>>> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
>>> log stripe unit adjusted to 32KiB
> 
> ...
> 
>>
>>> So, the question is: 
>>> - is this a bug somewhere in XFS, LVM or Linux's software RAID
>>> implementation?
>>
>> Not a bug at all.
> 
> Dave, I'd suggest that we should remove the warning though, if XFS picks
> the wrong defaults and then overrides itself.
> 
> Rule of Silence: When a program has nothing surprising to say, it should say nothing.

I think this goes to the heart of the matter.  Ingo chose an arbitrarily
large chunk size apparently without understanding the ramifications.
mkfs.xfs was written to read md parameters I believe with the assumption
the parameters were md defaults.  It obviously wasn't written to
gracefully deal with a manually configured arbitrarily large md chunks size.

Maybe a better solution than silence here would be education.  Flag the
mismatch as we do now, and provide a URL to a new FAQ entry that
explains why this occurs, and possible solutions to the problem, the
first recommendation being to choose a sane chunk size.

Question:  does this occur with hardware RAID when entering all the same
parameters manually on the command line?  Or is this error limited to
the md interrogation path?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-24 13:05     ` Stan Hoeppner
@ 2012-06-24 13:17       ` Ingo Jürgensmann
  2012-06-24 19:28         ` Stan Hoeppner
  2012-06-24 15:03       ` Ingo Jürgensmann
  1 sibling, 1 reply; 24+ messages in thread
From: Ingo Jürgensmann @ 2012-06-24 13:17 UTC (permalink / raw)
  To: xfs; +Cc: stan

Am 24.06.2012 um 15:05 schrieb Stan Hoeppner:

> On 6/23/2012 9:20 PM, Eric Sandeen wrote:
>> On 6/23/12 6:44 PM, Dave Chinner wrote:
>>> On Sat, Jun 23, 2012 at 02:50:49PM +0200, Ingo Jürgensmann wrote:
>>>> muaddib:~# cat /proc/mdstat 
>>>> Personalities : [raid1] [raid6] [raid5] [raid4] 
>>>> md7 : active raid5 sdf4[3] sdd4[1] sde4[0]
>>>>      7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
>                                               ^^^^^^^^^^
> 
> The the log stripe unit mismatch error is a direct result of Ingo
> manually choosing a rather large chunk size for his two stripe spindle
> md array, yielding a 1MB stripe, and using an internal log with it.
> Maybe there is a good reason for this, but I'm going to challenge it.

Correction: I did not manually choose that chunk size, but it was automatically chosen by mdadm when creating the RAID5. 

> The default md chunk size IIRC is 64KB, 8x smaller than Ingo's chunk.

64k is the default for creating RAIDs with 0.90 format superblock. My RAID5 has a 1.2 format superblock. 

> Does backup PC perform better writing 2048 sectors per stripe write,
> 1024 per spindle, with two spindles, than 256 sectors per stripe write,
> 128 per spindle, using two spindles?

Don't know how BackupPC actually writes the data, but it does make extensive use of hardlinks to save some diskspace. Some sort of deduplicating, if you like to say it that way. 

>> Rule of Silence: When a program has nothing surprising to say, it should say nothing.
> I think this goes to the heart of the matter.  Ingo chose an arbitrarily
> large chunk size apparently without understanding the ramifications.

That's wrong! I've just worked with the defaults.  

-- 
Ciao...            //      Fon: 0381-2744150
      Ingo       \X/       http://blog.windfluechter.net


gpg pubkey:  http://www.juergensmann.de/ij_public_key.asc

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-24 13:05     ` Stan Hoeppner
  2012-06-24 13:17       ` Ingo Jürgensmann
@ 2012-06-24 15:03       ` Ingo Jürgensmann
  2012-06-26  2:30         ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Ingo Jürgensmann @ 2012-06-24 15:03 UTC (permalink / raw)
  To: xfs

On 2012-06-24 15:05, Stan Hoeppner wrote:

> The the log stripe unit mismatch error is a direct result of Ingo
> manually choosing a rather large chunk size for his two stripe 
> spindle
> md array, yielding a 1MB stripe, and using an internal log with it.
> Maybe there is a good reason for this, but I'm going to challenge it.

To cite man mdadm:

        -c, --chunk=
               Specify chunk size of kibibytes.  The  default  when
               creating an array is 512KB.  To ensure compatibility
               with earlier versions, the default when Building and
               array  with no persistent metadata is 64KB.  This is
               only meaningful for RAID0, RAID4, RAID5, RAID6,  and
               RAID10.

So, actually there's a mismatch with the default of mdadm an mkfs.xfs. 
Maybe it's worthwhile to think of raising the log stripe maximum size to 
at least 512 kiB? I don't know what implications this could have, 
though...

-- 
Ciao...          //    Fon: 0381-2744150
.     Ingo     \X/     http://blog.windfluechter.net

gpg pubkey: http://www.juergensmann.de/ij_public_key.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-24 13:17       ` Ingo Jürgensmann
@ 2012-06-24 19:28         ` Stan Hoeppner
  2012-06-24 19:51           ` Ingo Jürgensmann
  0 siblings, 1 reply; 24+ messages in thread
From: Stan Hoeppner @ 2012-06-24 19:28 UTC (permalink / raw)
  To: Ingo Jürgensmann; +Cc: xfs

On 6/24/2012 8:17 AM, Ingo Jürgensmann wrote:
> Am 24.06.2012 um 15:05 schrieb Stan Hoeppner:
> 
>> On 6/23/2012 9:20 PM, Eric Sandeen wrote:
>>> On 6/23/12 6:44 PM, Dave Chinner wrote:
>>>> On Sat, Jun 23, 2012 at 02:50:49PM +0200, Ingo Jürgensmann wrote:
>>>>> muaddib:~# cat /proc/mdstat 
>>>>> Personalities : [raid1] [raid6] [raid5] [raid4] 
>>>>> md7 : active raid5 sdf4[3] sdd4[1] sde4[0]
>>>>>      7811261440 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
>>                                               ^^^^^^^^^^
>>
>> The the log stripe unit mismatch error is a direct result of Ingo
>> manually choosing a rather large chunk size for his two stripe spindle
>> md array, yielding a 1MB stripe, and using an internal log with it.
>> Maybe there is a good reason for this, but I'm going to challenge it.
> 
> Correction: I did not manually choose that chunk size, but it was automatically chosen by mdadm when creating the RAID5. 
> 
>> The default md chunk size IIRC is 64KB, 8x smaller than Ingo's chunk.
> 
> 64k is the default for creating RAIDs with 0.90 format superblock. My RAID5 has a 1.2 format superblock. 
> 
>> Does backup PC perform better writing 2048 sectors per stripe write,
>> 1024 per spindle, with two spindles, than 256 sectors per stripe write,
>> 128 per spindle, using two spindles?
> 
> Don't know how BackupPC actually writes the data, but it does make extensive use of hardlinks to save some diskspace. Some sort of deduplicating, if you like to say it that way. 
> 
>>> Rule of Silence: When a program has nothing surprising to say, it should say nothing.
>> I think this goes to the heart of the matter.  Ingo chose an arbitrarily
>> large chunk size apparently without understanding the ramifications.
> 
> That's wrong! I've just worked with the defaults.  

At this point I get the feeling you're sandbagging us Ingo.  AFAIK you
have the distinction of being the very first person on earth to report
this problem.  This would suggest you're the first XFS user with an
internal log to use the mdadm defaults.  Do you think that's likely?

Thus, I'd guess that the metadata format changed from 0.90 to 1.2 with a
very recent release of mdadm.  Are you using distro supplied mdadm, a
backported more recent mdadm, or did you build mdadm from the most
recent source?

If either of the latter two, don't you think it would have been wise to
inform us that "hay, I'm using the bleeding edge mdadm just released"?
Or if you're using a brand new distro release?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-24 19:28         ` Stan Hoeppner
@ 2012-06-24 19:51           ` Ingo Jürgensmann
  2012-06-24 22:15             ` Stan Hoeppner
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Jürgensmann @ 2012-06-24 19:51 UTC (permalink / raw)
  To: stan; +Cc: xfs

Am 24.06.2012 um 21:28 schrieb Stan Hoeppner:

> Thus, I'd guess that the metadata format changed from 0.90 to 1.2 with a
> very recent release of mdadm.  Are you using distro supplied mdadm, a
> backported more recent mdadm, or did you build mdadm from the most
> recent source?

As I already wrote, I'm using Debian unstable, therefore distro supplied mdadm. Otherwise I'd have said this. 

> If either of the latter two, don't you think it would have been wise to
> inform us that "hay, I'm using the bleeding edge mdadm just released"?
> Or if you're using a brand new distro release?

I don't think that Debian unstable is bleeding edge. 

I find it strange that you've misinterpreted citing the mdadm man page as "sandbagging us". =:-O

-- 
Ciao...            //      Fon: 0381-2744150
      Ingo       \X/       http://blog.windfluechter.net


gpg pubkey:  http://www.juergensmann.de/ij_public_key.asc

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-24 19:51           ` Ingo Jürgensmann
@ 2012-06-24 22:15             ` Stan Hoeppner
  2012-06-25  5:25               ` Ingo Jürgensmann
  0 siblings, 1 reply; 24+ messages in thread
From: Stan Hoeppner @ 2012-06-24 22:15 UTC (permalink / raw)
  To: Ingo Jürgensmann; +Cc: xfs

On 6/24/2012 2:51 PM, Ingo Jürgensmann wrote:
> Am 24.06.2012 um 21:28 schrieb Stan Hoeppner:
> 
>> Thus, I'd guess that the metadata format changed from 0.90 to 1.2 with a
>> very recent release of mdadm.  Are you using distro supplied mdadm, a
>> backported more recent mdadm, or did you build mdadm from the most
>> recent source?
> 
> As I already wrote, I'm using Debian unstable, therefore distro supplied mdadm. Otherwise I'd have said this. 

Yes, you did mention SID, and I missed it.

SID is the problem here, or I should say, the cause of the error
message.  SID is leading (better?) edge, and is obviously using a recent
mdadm release, which defaults to metadata 1.2, and chunk of 512KB.

As more distros adopt newer mdadm, reports of this will be more
prevalent.  So Eric's idea is likely preferable than mine.  XFS making a
recommendation against an md default would fly like a lead balloon...

> I don't think that Debian unstable is bleeding edge. 

It's apparently close enough in the case of mdadm, given you're the
first to report this, AFAIK.

> I find it strange that you've misinterpreted citing the mdadm man page as "sandbagging us". =:-O

Sandbagging simply means holding something back, withholding
information.  Had you actually not mentioned your OS/version, this would
have been an accurate take on the situation.  But again, youd did, and I
simply missed it.  So again, my apologies for missing your mention of
SID in your opening email.  That would have prevented my skeptical demeanor.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-24 22:15             ` Stan Hoeppner
@ 2012-06-25  5:25               ` Ingo Jürgensmann
       [not found]                 ` <4FE8CEED.7070505@hardwarefreak.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Jürgensmann @ 2012-06-25  5:25 UTC (permalink / raw)
  To: stan; +Cc: xfs

On 2012-06-25 00:15, Stan Hoeppner wrote:

>> As I already wrote, I'm using Debian unstable, therefore distro 
>> supplied mdadm. Otherwise I'd have said this.
> SID is the problem here, or I should say, the cause of the error
> message.  SID is leading (better?) edge, and is obviously using a 
> recent
> mdadm release, which defaults to metadata 1.2, and chunk of 512KB.
> As more distros adopt newer mdadm, reports of this will be more
> prevalent.  So Eric's idea is likely preferable than mine.  XFS 
> making a
> recommendation against an md default would fly like a lead balloon...

Actually, even man page of Debian stable (Squeeze) mentions:

        -c, --chunk=
               Specify chunk size of kibibytes.  The default when 
creating an array is 512KB.  To ensure
               compatibility with earlier versions, the default when 
Building and array with no  persis‐
               tent  metadata  is  64KB.   This  is  only meaningful for 
RAID0, RAID4, RAID5, RAID6, and
               RAID10.

So, the question is: why did mdadm choose 1.2 format superblock this 
time? My guess is, that's because of GPT labelled disks instead of MBR, 
but it's only a guess. Maybe it's because the new md device is bigger in 
size. All of my other md devices on MBR labelled disks do have 0.90 
format superblock, all md devices on the GPT disks are of 1.2 format.
Although it doesn't seem a new default in mdadm for me, your assumption 
would still stand if the cause would turn out to be the GPT label. More 
and more people will start using GPT labelled disks.

>> I find it strange that you've misinterpreted citing the mdadm man 
>> page as "sandbagging us". =:-O
> Sandbagging simply means holding something back, withholding
> information.

Are ok, I misread "sandboxing us" as "boxing onto us like at a 
sandbox". So, my apologies here. :-)

-- 
Ciao...          //    Fon: 0381-2744150
.     Ingo     \X/     http://blog.windfluechter.net

gpg pubkey: http://www.juergensmann.de/ij_public_key.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-23 23:44 ` Dave Chinner
  2012-06-24  2:20   ` Eric Sandeen
@ 2012-06-25 10:33   ` Ingo Jürgensmann
  1 sibling, 0 replies; 24+ messages in thread
From: Ingo Jürgensmann @ 2012-06-25 10:33 UTC (permalink / raw)
  To: xfs

On 2012-06-24 01:44, Dave Chinner wrote:

> If you increase the log stripe unit, you also increase the minimum
> log buffer size that the filesystem supports. The filesystem can
> support up to 256k log buffers, and hence the limit on maximum log
> stripe alignment.

So, no way to increase log buffers to match 1.2 format superblocks 
default size of 512 kiB, I guess, because it would change on 
disk-format?

>> - will performance suffer from log stripe size adjusted to just 32
>> kiB? Some of my logical volumes will just store data, but one or
>> the other will have some workload acting as storage for BackupPC.
> For data volumes, no. For backupPC, it depends on whether the MD
> RAID stripe cache can turn all the sequential log writes into a full
> stripe write. In general, this is not a problem, and is almost never
> a problem for HW RAID with BBWC....

Well, the external log would have been on my other RAID disks. Having a 
RAID1 for this would be doable, but I decided to not go that way. It 
would limit me too much to replace those 1 TB disks by bigger ones 
somewhen in the future.
Regarding BackupPC: it might more likely benefit from a smaller log 
stripe size, because BackupPC makes extensive use of hardlinks, so I 
guess the overhead will be smaller when using 32 kiB log stripe size, as 
you suggests as well below:

>> - would it be worth the effort to raise log stripe to at least 256
>> kiB?
> Depends on your workload. If it is fsync heavy, I'd advise against
> it, as every log write will be padded out to 256k, even if you only
> write 500 bytes worth of transaction data....

BackupPC will check against its pool of files, whether a file is 
already in it (by comparing md5sum or shaXXXsum) or not. If it's in the 
pool already it will hardlink to it, if it's not it will copy the file 
and hardlink then. Therefore I assume that the workload will mainly be 
fsyncs.

>> - or would it be better to run with external log on the old 1 TB
>> RAID?
> External logs provide muchless benefit with delayed logging than hey
> use to. As it is, your external log needs to have the same
> reliability characteristics as the main volume - lose the log,
> corrupt the filesystem. Hence for RAID5 volumes, you need a RAID1
> log, and for RAID6 you either need RAID6 or a 3-way mirror to
> provide the same reliability....

That would be possible. But as stated above, I won't go that way for 
practical reasons.

>> End note: the 4 TB disks are not yet "in production", so I can run
>> tests with both RAID setup as well as mkfs.xfs. Reshaping the RAID
>> will take up to 10 hours, though...
> IMO, RAID reshaping is just a bad idea - it changes the alignment
> characteristic of the volume, hence everything that the
> filesystemlaid down in an aligned fashion is now unaligned, and you
> have to tell the filesytemteh new alignment before new files will be
> correctly aligned. Also, it's usually faster to back up, recreate
> and restore than reshape and that puts a lot less load on your
> disks, too...

True. Therefor I've re-created the RAID again instead of still running 
it from re-shaped RAID1-to-RAID5. Anyway, reshaping is only an issue as 
long as there's already a FS on it. But a bad feeling still persists... 
;)

Thanks for your explanation, Dave!

-- 
Ciao...          //    Fon: 0381-2744150
.     Ingo     \X/     http://blog.windfluechter.net

gpg pubkey: http://www.juergensmann.de/ij_public_key.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
       [not found]                 ` <4FE8CEED.7070505@hardwarefreak.com>
@ 2012-06-25 21:18                   ` Ingo Jürgensmann
  0 siblings, 0 replies; 24+ messages in thread
From: Ingo Jürgensmann @ 2012-06-25 21:18 UTC (permalink / raw)
  To: stan; +Cc: xfs

On 2012-06-25 22:49, Stan Hoeppner wrote:

> I've never understand exactly what this means, but it's apparently
> involved with some of the arrays you've built with Stable and SID:
>
> "To ensure compatibility with earlier versions, the default when
> Building and array with no persistent metadata is 64KB."
>
> How does one "build an array with no persistent metadata"?  Does this
> simply mean forcing metadata .90 on the command line?

IIRC, the metadata in 1.2 is populated over the RAID whereas in 0.90 it 
was only at the beginning. But take that with care. I've no source for 
that assumption. It's only somewhere in my mind that I think I might 
have read about this somewhere, somewhen...
Someone else will know better and correct me for sure. :-)

>> So, the question is: why did mdadm choose 1.2 format superblock this
>> time? My guess is, that's because of GPT labelled disks instead of 
>> MBR,
>> but it's only a guess. Maybe it's because the new md device is 
>> bigger in
>> size. All of my other md devices on MBR labelled disks do have 0.90
>> format superblock, all md devices on the GPT disks are of 1.2 
>> format.
>> Although it doesn't seem a new default in mdadm for me, your 
>> assumption
>> would still stand if the cause would turn out to be the GPT label. 
>> More
>> and more people will start using GPT labelled disks.
> Ok this is really interesting as this is undocumented behavior, if
> indeed this is occurring.  Would you mind firing up a thread about 
> this
> on the linux-raid list?

I've talked to some guys on #debian.de in the meantime. I don't think 
now that this has anything to do with GPT labels. According to 
#debian.de the default behaviour in mdadm was changed after release of 
Squeeze. Already before Squeeze, metadata format 0.90 was obsolete and 
was only kept for Squeeze for backward compatibility reasons.

So, it's indeed a changed default within Debian, but nothing new for 
upstream mdadm and it's likely that other distros have adopted the 
upstream default way before Debian did.

-- 
Ciao...          //    Fon: 0381-2744150
.     Ingo     \X/     http://blog.windfluechter.net

gpg pubkey: http://www.juergensmann.de/ij_public_key.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-24 15:03       ` Ingo Jürgensmann
@ 2012-06-26  2:30         ` Dave Chinner
  2012-06-26  8:02             ` Christoph Hellwig
                             ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Dave Chinner @ 2012-06-26  2:30 UTC (permalink / raw)
  To: Ingo Jürgensmann; +Cc: xfs

On Sun, Jun 24, 2012 at 05:03:47PM +0200, Ingo Jürgensmann wrote:
> On 2012-06-24 15:05, Stan Hoeppner wrote:
> 
> >The the log stripe unit mismatch error is a direct result of Ingo
> >manually choosing a rather large chunk size for his two stripe
> >spindle
> >md array, yielding a 1MB stripe, and using an internal log with it.
> >Maybe there is a good reason for this, but I'm going to challenge it.
> 
> To cite man mdadm:
> 
>        -c, --chunk=
>               Specify chunk size of kibibytes.  The  default  when
>               creating an array is 512KB.  To ensure compatibility
>               with earlier versions, the default when Building and
>               array  with no persistent metadata is 64KB.  This is
>               only meaningful for RAID0, RAID4, RAID5, RAID6,  and
>               RAID10.
> 
> So, actually there's a mismatch with the default of mdadm an
> mkfs.xfs. Maybe it's worthwhile to think of raising the log stripe
> maximum size to at least 512 kiB? I don't know what implications
> this could have, though...

You can't, simple as that. The maximum supported is 256k. As it is,
a default chunk size of 512k is probably harmful to most workloads -
large chunk sizes mean that just about every write will trigger a
RMW cycle in the RAID because it is pretty much impossible to issue
full stripe writes. Writeback doesn't do any alignment of IO (the
generic page cache writeback path is the problem here), so we will
lamost always be doing unaligned IO to the RAID, and there will be
little opportunity for sequential IOs to merge and form full stripe
writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).

IOWs, every time you do a small isolated write, the MD RAID volume
will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
Given that most workloads are not doing lots and lots of large
sequential writes this is, IMO, a pretty bad default given typical
RAID5/6 volume configurations we see....

Without the warning, nobody would have noticed this. I think the
warning has value - even if it is just to indicate MD now uses a
bad default value for common workloads..

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-26  2:30         ` Dave Chinner
@ 2012-06-26  8:02             ` Christoph Hellwig
  2012-06-26 19:34           ` Ingo Jürgensmann
  2012-06-27  2:06           ` Eric Sandeen
  2 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2012-06-26  8:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ingo J?rgensmann, xfs, linux-raid

On Tue, Jun 26, 2012 at 12:30:59PM +1000, Dave Chinner wrote:
> You can't, simple as that. The maximum supported is 256k. As it is,
> a default chunk size of 512k is probably harmful to most workloads -
> large chunk sizes mean that just about every write will trigger a
> RMW cycle in the RAID because it is pretty much impossible to issue
> full stripe writes. Writeback doesn't do any alignment of IO (the
> generic page cache writeback path is the problem here), so we will
> lamost always be doing unaligned IO to the RAID, and there will be
> little opportunity for sequential IOs to merge and form full stripe
> writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).
> 
> IOWs, every time you do a small isolated write, the MD RAID volume
> will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
> Given that most workloads are not doing lots and lots of large
> sequential writes this is, IMO, a pretty bad default given typical
> RAID5/6 volume configurations we see....

Not too long ago I benchmarked out mdraid stripe sizes, and at least
for XFS 32kb was a clear winner, anything larger decreased performance.

ext4 didn't get hit that badly with larger stripe sizes, probably
because they still internally bump the writeback size like crazy, but
they did not actually get faster with larger stripes either.

This was streaming data heavy workloads, anything more metadata heavy
probably will suffer from larger stripes even more.

Ccing the linux-raid list if there actually is any reason for these
defaults, something I wanted to ask for a long time but never really got
back to.

Also I'm pretty sure back then the md default was 256kb writes, not 512
so it seems the defaults further increased.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
@ 2012-06-26  8:02             ` Christoph Hellwig
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2012-06-26  8:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-raid, Ingo J?rgensmann, xfs

On Tue, Jun 26, 2012 at 12:30:59PM +1000, Dave Chinner wrote:
> You can't, simple as that. The maximum supported is 256k. As it is,
> a default chunk size of 512k is probably harmful to most workloads -
> large chunk sizes mean that just about every write will trigger a
> RMW cycle in the RAID because it is pretty much impossible to issue
> full stripe writes. Writeback doesn't do any alignment of IO (the
> generic page cache writeback path is the problem here), so we will
> lamost always be doing unaligned IO to the RAID, and there will be
> little opportunity for sequential IOs to merge and form full stripe
> writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).
> 
> IOWs, every time you do a small isolated write, the MD RAID volume
> will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
> Given that most workloads are not doing lots and lots of large
> sequential writes this is, IMO, a pretty bad default given typical
> RAID5/6 volume configurations we see....

Not too long ago I benchmarked out mdraid stripe sizes, and at least
for XFS 32kb was a clear winner, anything larger decreased performance.

ext4 didn't get hit that badly with larger stripe sizes, probably
because they still internally bump the writeback size like crazy, but
they did not actually get faster with larger stripes either.

This was streaming data heavy workloads, anything more metadata heavy
probably will suffer from larger stripes even more.

Ccing the linux-raid list if there actually is any reason for these
defaults, something I wanted to ask for a long time but never really got
back to.

Also I'm pretty sure back then the md default was 256kb writes, not 512
so it seems the defaults further increased.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-26  2:30         ` Dave Chinner
  2012-06-26  8:02             ` Christoph Hellwig
@ 2012-06-26 19:34           ` Ingo Jürgensmann
  2012-06-27  2:06           ` Eric Sandeen
  2 siblings, 0 replies; 24+ messages in thread
From: Ingo Jürgensmann @ 2012-06-26 19:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Am 26.06.2012 um 04:30 schrieb Dave Chinner:

> On Sun, Jun 24, 2012 at 05:03:47PM +0200, Ingo Jürgensmann wrote:
>> On 2012-06-24 15:05, Stan Hoeppner wrote:
>> 
>>> The the log stripe unit mismatch error is a direct result of Ingo
>>> manually choosing a rather large chunk size for his two stripe
>>> spindle
>>> md array, yielding a 1MB stripe, and using an internal log with it.
>>> Maybe there is a good reason for this, but I'm going to challenge it.
>> 
>> To cite man mdadm:
>> 
>>       -c, --chunk=
>>              Specify chunk size of kibibytes.  The  default  when
>>              creating an array is 512KB.  To ensure compatibility
>>              with earlier versions, the default when Building and
>>              array  with no persistent metadata is 64KB.  This is
>>              only meaningful for RAID0, RAID4, RAID5, RAID6,  and
>>              RAID10.
>> 
>> So, actually there's a mismatch with the default of mdadm an
>> mkfs.xfs. Maybe it's worthwhile to think of raising the log stripe
>> maximum size to at least 512 kiB? I don't know what implications
>> this could have, though...
> 
> You can't, simple as that. The maximum supported is 256k. As it is,
> a default chunk size of 512k is probably harmful to most workloads -
> large chunk sizes mean that just about every write will trigger a
> RMW cycle in the RAID because it is pretty much impossible to issue
> full stripe writes. Writeback doesn't do any alignment of IO (the
> generic page cache writeback path is the problem here), so we will
> lamost always be doing unaligned IO to the RAID, and there will be
> little opportunity for sequential IOs to merge and form full stripe
> writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).
> 
> IOWs, every time you do a small isolated write, the MD RAID volume
> will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
> Given that most workloads are not doing lots and lots of large
> sequential writes this is, IMO, a pretty bad default given typical
> RAID5/6 volume configurations we see....
> 
> Without the warning, nobody would have noticed this. I think the
> warning has value - even if it is just to indicate MD now uses a
> bad default value for common workloads..

Seconded. But I think the warning, as it is, can confuse the use - like me. ;)

Maybe you can an URL to this warning message and point it to a detailed explanation: 

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Q: mkfs.xfs states log stripe unit is too large

A: On RAID devices created with mdadm and a 1.2 format superblock,
the default chunk size is 512 kiB. When creating a filesystem with 
mkfs.xfs on top of such a device, mkfs.xfs will use the chunk size
of the underlying RAID device to set some parameters of the file-
system, e.g. log stripe size. XFS is limited to 256 kiB of log stripe
size, so mkfs.xfs falls back to its default value of 32 kiB size when
it can't use larger values from underlying chunk sizes. This is, in
general, a good decision for your filesystem.  

Best thing in general is to align all log writes to the
underlying stripe unit of the array. That way as multiple frequent
log writes occur, it is guaranteed to form full stripe writes and
basically have no RMW overhead. 32k is chosen by default because
that's the default log buffer size and hence the typical size of
log writes.

If you increase the log stripe unit, you also increase the minimum
log buffer size that the filesystem supports. The filesystem can
support up to 256k log buffers, and hence the limit on maximum log
stripe alignment.

The maximum supported log stripe size in XFS is 256k. As it is,
a default chunk size of 512k is probably harmful to most workloads -
large chunk sizes mean that just about every write will trigger a
RMW cycle in the RAID because it is pretty much impossible to issue
full stripe writes. Writeback doesn't do any alignment of IO (the
generic page cache writeback path is the problem here), so we will
lamost always be doing unaligned IO to the RAID, and there will be
little opportunity for sequential IOs to merge and form full stripe
writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).

IOWs, every time you do a small isolated write, the MD RAID volume
will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
Given that most workloads are not doing lots and lots of large
sequential writes this is, IMO, a pretty bad default given typical
RAID5/6 volume configurations we see....

When benchmarking out mdraid stripe sizes a size of 32kb for XFS is
a clear winner, anything larger decreases performance.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

As you can see, I've conducted some answers from Dave and Chris that helped me to understand the issue and the implications of log stripe size. 

I would welcome a FAQ entry and an URL to it included in the already existing warn message. Regardless whether you will do so, I've blogged today about this issue and the "solution": http://blog.windfluechter.net/content/blog/2012/06/26/1475-confusion-about-mkfsxfs-and-log-stripe-size-being-too-big
Maybe this helps other people to not come up with the same question... :-) 

Many thanks to all who helped me to understand this "issue"! :-) 

-- 
Ciao...            //      Fon: 0381-2744150
      Ingo       \X/       http://blog.windfluechter.net


gpg pubkey:  http://www.juergensmann.de/ij_public_key.asc

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-26  2:30         ` Dave Chinner
  2012-06-26  8:02             ` Christoph Hellwig
  2012-06-26 19:34           ` Ingo Jürgensmann
@ 2012-06-27  2:06           ` Eric Sandeen
  2 siblings, 0 replies; 24+ messages in thread
From: Eric Sandeen @ 2012-06-27  2:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ingo Jürgensmann, xfs

On 6/25/12 10:30 PM, Dave Chinner wrote:

...

> Without the warning, nobody would have noticed this. I think the
> warning has value - even if it is just to indicate MD now uses a
> bad default value for common workloads..

Fair enough.

log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB

It just tweaked me a little to complain about something the user didn't specify, but thinking about it from the perspective of letting the user know that the _device_ has a stripe unit larger than xfs can handle makes sense.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-06-26  8:02             ` Christoph Hellwig
  (?)
@ 2012-07-02  6:18             ` Christoph Hellwig
  2012-07-02  6:41                 ` NeilBrown
  -1 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2012-07-02  6:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Ingo J?rgensmann, xfs, linux-raid

Ping to Neil / the raid list.

On Tue, Jun 26, 2012 at 04:02:17AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 26, 2012 at 12:30:59PM +1000, Dave Chinner wrote:
> > You can't, simple as that. The maximum supported is 256k. As it is,
> > a default chunk size of 512k is probably harmful to most workloads -
> > large chunk sizes mean that just about every write will trigger a
> > RMW cycle in the RAID because it is pretty much impossible to issue
> > full stripe writes. Writeback doesn't do any alignment of IO (the
> > generic page cache writeback path is the problem here), so we will
> > lamost always be doing unaligned IO to the RAID, and there will be
> > little opportunity for sequential IOs to merge and form full stripe
> > writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).
> > 
> > IOWs, every time you do a small isolated write, the MD RAID volume
> > will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
> > Given that most workloads are not doing lots and lots of large
> > sequential writes this is, IMO, a pretty bad default given typical
> > RAID5/6 volume configurations we see....
> 
> Not too long ago I benchmarked out mdraid stripe sizes, and at least
> for XFS 32kb was a clear winner, anything larger decreased performance.
> 
> ext4 didn't get hit that badly with larger stripe sizes, probably
> because they still internally bump the writeback size like crazy, but
> they did not actually get faster with larger stripes either.
> 
> This was streaming data heavy workloads, anything more metadata heavy
> probably will suffer from larger stripes even more.
> 
> Ccing the linux-raid list if there actually is any reason for these
> defaults, something I wanted to ask for a long time but never really got
> back to.
> 
> Also I'm pretty sure back then the md default was 256kb writes, not 512
> so it seems the defaults further increased.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
---end quoted text---

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-07-02  6:18             ` Christoph Hellwig
@ 2012-07-02  6:41                 ` NeilBrown
  0 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2012-07-02  6:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, Ingo J?rgensmann, xfs, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3438 bytes --]

On Mon, 2 Jul 2012 02:18:27 -0400 Christoph Hellwig <hch@infradead.org> wrote:

> Ping to Neil / the raid list.

Thanks for the reminder.

> 
> On Tue, Jun 26, 2012 at 04:02:17AM -0400, Christoph Hellwig wrote:
> > On Tue, Jun 26, 2012 at 12:30:59PM +1000, Dave Chinner wrote:
> > > You can't, simple as that. The maximum supported is 256k. As it is,
> > > a default chunk size of 512k is probably harmful to most workloads -
> > > large chunk sizes mean that just about every write will trigger a
> > > RMW cycle in the RAID because it is pretty much impossible to issue
> > > full stripe writes. Writeback doesn't do any alignment of IO (the
> > > generic page cache writeback path is the problem here), so we will
> > > lamost always be doing unaligned IO to the RAID, and there will be
> > > little opportunity for sequential IOs to merge and form full stripe
> > > writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).
> > > 
> > > IOWs, every time you do a small isolated write, the MD RAID volume
> > > will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
> > > Given that most workloads are not doing lots and lots of large
> > > sequential writes this is, IMO, a pretty bad default given typical
> > > RAID5/6 volume configurations we see....
> > 
> > Not too long ago I benchmarked out mdraid stripe sizes, and at least
> > for XFS 32kb was a clear winner, anything larger decreased performance.
> > 
> > ext4 didn't get hit that badly with larger stripe sizes, probably
> > because they still internally bump the writeback size like crazy, but
> > they did not actually get faster with larger stripes either.
> > 
> > This was streaming data heavy workloads, anything more metadata heavy
> > probably will suffer from larger stripes even more.
> > 
> > Ccing the linux-raid list if there actually is any reason for these
> > defaults, something I wanted to ask for a long time but never really got
> > back to.
> > 
> > Also I'm pretty sure back then the md default was 256kb writes, not 512
> > so it seems the defaults further increased.

"originally" the default chunksize was 64K.
It was changed in late 2009 to 512K - this first appeared in mdadm 3.1.1

I don't recall the details of why it was changed but I'm fairly sure that
it was based on measurements that I had made and measurements that others had
made.  I suspect the tests were largely run on ext3.

I don't think there is anything close to a truly optimal chunk size.  What
works best really depends on your hardware, your filesystem, and your work
load. 

If 512K is always suboptimal for XFS then that is unfortunate but I don't
think it is really possible to choose a default that everyone will be happy
with.  Maybe we just need more documentation and warning emitted by various
tools.  Maybe mkfs.xfs could augment the "stripe unit too large" message with
some text about choosing a smaller chunk size?

NeilBrown



> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ---end quoted text---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
@ 2012-07-02  6:41                 ` NeilBrown
  0 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2012-07-02  6:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-raid, Ingo J?rgensmann, xfs


[-- Attachment #1.1: Type: text/plain, Size: 3438 bytes --]

On Mon, 2 Jul 2012 02:18:27 -0400 Christoph Hellwig <hch@infradead.org> wrote:

> Ping to Neil / the raid list.

Thanks for the reminder.

> 
> On Tue, Jun 26, 2012 at 04:02:17AM -0400, Christoph Hellwig wrote:
> > On Tue, Jun 26, 2012 at 12:30:59PM +1000, Dave Chinner wrote:
> > > You can't, simple as that. The maximum supported is 256k. As it is,
> > > a default chunk size of 512k is probably harmful to most workloads -
> > > large chunk sizes mean that just about every write will trigger a
> > > RMW cycle in the RAID because it is pretty much impossible to issue
> > > full stripe writes. Writeback doesn't do any alignment of IO (the
> > > generic page cache writeback path is the problem here), so we will
> > > lamost always be doing unaligned IO to the RAID, and there will be
> > > little opportunity for sequential IOs to merge and form full stripe
> > > writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).
> > > 
> > > IOWs, every time you do a small isolated write, the MD RAID volume
> > > will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
> > > Given that most workloads are not doing lots and lots of large
> > > sequential writes this is, IMO, a pretty bad default given typical
> > > RAID5/6 volume configurations we see....
> > 
> > Not too long ago I benchmarked out mdraid stripe sizes, and at least
> > for XFS 32kb was a clear winner, anything larger decreased performance.
> > 
> > ext4 didn't get hit that badly with larger stripe sizes, probably
> > because they still internally bump the writeback size like crazy, but
> > they did not actually get faster with larger stripes either.
> > 
> > This was streaming data heavy workloads, anything more metadata heavy
> > probably will suffer from larger stripes even more.
> > 
> > Ccing the linux-raid list if there actually is any reason for these
> > defaults, something I wanted to ask for a long time but never really got
> > back to.
> > 
> > Also I'm pretty sure back then the md default was 256kb writes, not 512
> > so it seems the defaults further increased.

"originally" the default chunksize was 64K.
It was changed in late 2009 to 512K - this first appeared in mdadm 3.1.1

I don't recall the details of why it was changed but I'm fairly sure that
it was based on measurements that I had made and measurements that others had
made.  I suspect the tests were largely run on ext3.

I don't think there is anything close to a truly optimal chunk size.  What
works best really depends on your hardware, your filesystem, and your work
load. 

If 512K is always suboptimal for XFS then that is unfortunate but I don't
think it is really possible to choose a default that everyone will be happy
with.  Maybe we just need more documentation and warning emitted by various
tools.  Maybe mkfs.xfs could augment the "stripe unit too large" message with
some text about choosing a smaller chunk size?

NeilBrown



> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ---end quoted text---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
  2012-07-02  6:41                 ` NeilBrown
@ 2012-07-02  8:08                   ` Dave Chinner
  -1 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2012-07-02  8:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christoph Hellwig, Ingo J?rgensmann, xfs, linux-raid

On Mon, Jul 02, 2012 at 04:41:13PM +1000, NeilBrown wrote:
> On Mon, 2 Jul 2012 02:18:27 -0400 Christoph Hellwig <hch@infradead.org> wrote:
> 
> > Ping to Neil / the raid list.
> 
> Thanks for the reminder.
> 
> > 
> > On Tue, Jun 26, 2012 at 04:02:17AM -0400, Christoph Hellwig wrote:
> > > On Tue, Jun 26, 2012 at 12:30:59PM +1000, Dave Chinner wrote:
> > > > You can't, simple as that. The maximum supported is 256k. As it is,
> > > > a default chunk size of 512k is probably harmful to most workloads -
> > > > large chunk sizes mean that just about every write will trigger a
> > > > RMW cycle in the RAID because it is pretty much impossible to issue
> > > > full stripe writes. Writeback doesn't do any alignment of IO (the
> > > > generic page cache writeback path is the problem here), so we will
> > > > lamost always be doing unaligned IO to the RAID, and there will be
> > > > little opportunity for sequential IOs to merge and form full stripe
> > > > writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).
> > > > 
> > > > IOWs, every time you do a small isolated write, the MD RAID volume
> > > > will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
> > > > Given that most workloads are not doing lots and lots of large
> > > > sequential writes this is, IMO, a pretty bad default given typical
> > > > RAID5/6 volume configurations we see....
> > > 
> > > Not too long ago I benchmarked out mdraid stripe sizes, and at least
> > > for XFS 32kb was a clear winner, anything larger decreased performance.
> > > 
> > > ext4 didn't get hit that badly with larger stripe sizes, probably
> > > because they still internally bump the writeback size like crazy, but
> > > they did not actually get faster with larger stripes either.
> > > 
> > > This was streaming data heavy workloads, anything more metadata heavy
> > > probably will suffer from larger stripes even more.
> > > 
> > > Ccing the linux-raid list if there actually is any reason for these
> > > defaults, something I wanted to ask for a long time but never really got
> > > back to.
> > > 
> > > Also I'm pretty sure back then the md default was 256kb writes, not 512
> > > so it seems the defaults further increased.
> 
> "originally" the default chunksize was 64K.
> It was changed in late 2009 to 512K - this first appeared in mdadm 3.1.1
> 
> I don't recall the details of why it was changed but I'm fairly sure that
> it was based on measurements that I had made and measurements that others had
> made.  I suspect the tests were largely run on ext3.
> 
> I don't think there is anything close to a truly optimal chunk size.  What
> works best really depends on your hardware, your filesystem, and your work
> load. 

That's true, but the characterisitics of spinning disks have not
changed in the past 20 years, nor has the typical file size
distributions in filesystems, nor have the RAID5/6 algorithms. So
it's not really clear to me why you;d woul deven consider changing
the default the downsides of large chunk sizes on RAID5/6 volumes is
well known. This may well explain the apparent increase in "XFS has
hung but it's really just waiting for lots of really slow IO on MD"
cases I've seen over the past couple of years.

The only time I'd ever consider stripe -widths- of more than 512k or
1MB with RAID5/6 is if I knew my workload is almost exclusively
using large files and sequential access with little metadata load,
and there's relatively few workloads where that is the case.
Typically those workloads measure throughput in GB/s and everyone
uses hardware RAID for them because MD simply doesn't scale to this
sort of usage.

> If 512K is always suboptimal for XFS then that is unfortunate but I don't

I think 512k chunk sizes are suboptimal for most users, regardless
of the filesystem or workload....

> think it is really possible to choose a default that everyone will be happy
> with.  Maybe we just need more documentation and warning emitted by various
> tools.  Maybe mkfs.xfs could augment the "stripe unit too large" message with
> some text about choosing a smaller chunk size?

We work to the mantra that XFS should always choose the defaults
that give the best overall performance and aging characteristics so
users don't need to be a storage expert to get the best the
filesystem can offer. The XFS warning is there to indicate that the
user might be doing something wrong. If that's being emitted with a
default MD configuration, then that indicates that the MD defaults
need to be revised....

If you know what a stripe unit or chunk size is, then you know how
to deal with the problem. But for the majority of people, that's way
more knowledge than they are prepared to learn about or should be
forced to learn about.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: mkfs.xfs states log stripe unit is too large
@ 2012-07-02  8:08                   ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2012-07-02  8:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christoph Hellwig, linux-raid, Ingo J?rgensmann, xfs

On Mon, Jul 02, 2012 at 04:41:13PM +1000, NeilBrown wrote:
> On Mon, 2 Jul 2012 02:18:27 -0400 Christoph Hellwig <hch@infradead.org> wrote:
> 
> > Ping to Neil / the raid list.
> 
> Thanks for the reminder.
> 
> > 
> > On Tue, Jun 26, 2012 at 04:02:17AM -0400, Christoph Hellwig wrote:
> > > On Tue, Jun 26, 2012 at 12:30:59PM +1000, Dave Chinner wrote:
> > > > You can't, simple as that. The maximum supported is 256k. As it is,
> > > > a default chunk size of 512k is probably harmful to most workloads -
> > > > large chunk sizes mean that just about every write will trigger a
> > > > RMW cycle in the RAID because it is pretty much impossible to issue
> > > > full stripe writes. Writeback doesn't do any alignment of IO (the
> > > > generic page cache writeback path is the problem here), so we will
> > > > lamost always be doing unaligned IO to the RAID, and there will be
> > > > little opportunity for sequential IOs to merge and form full stripe
> > > > writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).
> > > > 
> > > > IOWs, every time you do a small isolated write, the MD RAID volume
> > > > will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
> > > > Given that most workloads are not doing lots and lots of large
> > > > sequential writes this is, IMO, a pretty bad default given typical
> > > > RAID5/6 volume configurations we see....
> > > 
> > > Not too long ago I benchmarked out mdraid stripe sizes, and at least
> > > for XFS 32kb was a clear winner, anything larger decreased performance.
> > > 
> > > ext4 didn't get hit that badly with larger stripe sizes, probably
> > > because they still internally bump the writeback size like crazy, but
> > > they did not actually get faster with larger stripes either.
> > > 
> > > This was streaming data heavy workloads, anything more metadata heavy
> > > probably will suffer from larger stripes even more.
> > > 
> > > Ccing the linux-raid list if there actually is any reason for these
> > > defaults, something I wanted to ask for a long time but never really got
> > > back to.
> > > 
> > > Also I'm pretty sure back then the md default was 256kb writes, not 512
> > > so it seems the defaults further increased.
> 
> "originally" the default chunksize was 64K.
> It was changed in late 2009 to 512K - this first appeared in mdadm 3.1.1
> 
> I don't recall the details of why it was changed but I'm fairly sure that
> it was based on measurements that I had made and measurements that others had
> made.  I suspect the tests were largely run on ext3.
> 
> I don't think there is anything close to a truly optimal chunk size.  What
> works best really depends on your hardware, your filesystem, and your work
> load. 

That's true, but the characterisitics of spinning disks have not
changed in the past 20 years, nor has the typical file size
distributions in filesystems, nor have the RAID5/6 algorithms. So
it's not really clear to me why you;d woul deven consider changing
the default the downsides of large chunk sizes on RAID5/6 volumes is
well known. This may well explain the apparent increase in "XFS has
hung but it's really just waiting for lots of really slow IO on MD"
cases I've seen over the past couple of years.

The only time I'd ever consider stripe -widths- of more than 512k or
1MB with RAID5/6 is if I knew my workload is almost exclusively
using large files and sequential access with little metadata load,
and there's relatively few workloads where that is the case.
Typically those workloads measure throughput in GB/s and everyone
uses hardware RAID for them because MD simply doesn't scale to this
sort of usage.

> If 512K is always suboptimal for XFS then that is unfortunate but I don't

I think 512k chunk sizes are suboptimal for most users, regardless
of the filesystem or workload....

> think it is really possible to choose a default that everyone will be happy
> with.  Maybe we just need more documentation and warning emitted by various
> tools.  Maybe mkfs.xfs could augment the "stripe unit too large" message with
> some text about choosing a smaller chunk size?

We work to the mantra that XFS should always choose the defaults
that give the best overall performance and aging characteristics so
users don't need to be a storage expert to get the best the
filesystem can offer. The XFS warning is there to indicate that the
user might be doing something wrong. If that's being emitted with a
default MD configuration, then that indicates that the MD defaults
need to be revised....

If you know what a stripe unit or chunk size is, then you know how
to deal with the problem. But for the majority of people, that's way
more knowledge than they are prepared to learn about or should be
forced to learn about.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: mkfs.xfs states log stripe unit is too large
  2012-07-02  8:08                   ` Dave Chinner
@ 2012-07-09 12:02                     ` kedacomkernel
  -1 siblings, 0 replies; 24+ messages in thread
From: kedacomkernel @ 2012-07-09 12:02 UTC (permalink / raw)
  To: Dave Chinner, Neil Brown
  Cc: Christoph Hellwig, Ingo J?rgensmann, xfs, linux-raid

On 2012-07-02 16:08 Dave Chinner <david@fromorbit.com> Wrote:
>On Mon, Jul 02, 2012 at 04:41:13PM +1000, NeilBrown wrote:
>> On Mon, 2 Jul 2012 02:18:27 -0400 Christoph Hellwig <hch@infradead.org> wrote:
>> 
>> > Ping to Neil / the raid list.
>> 
>> Thanks for the reminder.
>> 
>> > 
[snip]
>
>That's true, but the characterisitics of spinning disks have not
>changed in the past 20 years, nor has the typical file size
>distributions in filesystems, nor have the RAID5/6 algorithms. So
>it's not really clear to me why you;d woul deven consider changing
>the default the downsides of large chunk sizes on RAID5/6 volumes is
>well known. This may well explain the apparent increase in "XFS has
>hung but it's really just waiting for lots of really slow IO on MD"
>cases I've seen over the past couple of years.
>
At present, cat /sys/block/sdb/queue/max_sectors_kb:
is 512k. Maybe because this.

>The only time I'd ever consider stripe -widths- of more than 512k or
>1MB with RAID5/6 is if I knew my workload is almost exclusively
>using large files and sequential access with little metadata load,
>and there's relatively few workloads where that is the case.
>Typically those workloads measure throughput in GB/s and everyone
>uses hardware RAID for them because MD simply doesn't scale to this
>sort of usage.
>
>> If 512K is always suboptimal for XFS then that is unfortunate but I don't
>
>I think 512k chunk sizes are suboptimal for most users, regardless
>of the filesystem or workload....
>
>> think it is really possible to choose a default that everyone will be happy
>> with.  Maybe we just need more documentation and warning emitted by various
>> tools.  Maybe mkfs.xfs could augment the "stripe unit too large" message with
>> some text about choosing a smaller chunk size?
>
>We work to the mantra that XFS should always choose the defaults
>that give the best overall performance and aging characteristics so
>users don't need to be a storage expert to get the best the
>filesystem can offer. The XFS warning is there to indicate that the
>user might be doing something wrong. If that's being emitted with a
>default MD configuration, then that indicates that the MD defaults
>need to be revised....
>
>If you know what a stripe unit or chunk size is, then you know how
>to deal with the problem. But for the majority of people, that's way
>more knowledge than they are prepared to learn about or should be
>forced to learn about.
>
>Cheers,
>
>Dave.
>-- 
>Dave Chinner
>david@fromorbit.com
>--
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Re: mkfs.xfs states log stripe unit is too large
@ 2012-07-09 12:02                     ` kedacomkernel
  0 siblings, 0 replies; 24+ messages in thread
From: kedacomkernel @ 2012-07-09 12:02 UTC (permalink / raw)
  To: Dave Chinner, Neil Brown
  Cc: Christoph Hellwig, linux-raid, Ingo J?rgensmann, xfs

On 2012-07-02 16:08 Dave Chinner <david@fromorbit.com> Wrote:
>On Mon, Jul 02, 2012 at 04:41:13PM +1000, NeilBrown wrote:
>> On Mon, 2 Jul 2012 02:18:27 -0400 Christoph Hellwig <hch@infradead.org> wrote:
>> 
>> > Ping to Neil / the raid list.
>> 
>> Thanks for the reminder.
>> 
>> > 
[snip]
>
>That's true, but the characterisitics of spinning disks have not
>changed in the past 20 years, nor has the typical file size
>distributions in filesystems, nor have the RAID5/6 algorithms. So
>it's not really clear to me why you;d woul deven consider changing
>the default the downsides of large chunk sizes on RAID5/6 volumes is
>well known. This may well explain the apparent increase in "XFS has
>hung but it's really just waiting for lots of really slow IO on MD"
>cases I've seen over the past couple of years.
>
At present, cat /sys/block/sdb/queue/max_sectors_kb:
is 512k. Maybe because this.

>The only time I'd ever consider stripe -widths- of more than 512k or
>1MB with RAID5/6 is if I knew my workload is almost exclusively
>using large files and sequential access with little metadata load,
>and there's relatively few workloads where that is the case.
>Typically those workloads measure throughput in GB/s and everyone
>uses hardware RAID for them because MD simply doesn't scale to this
>sort of usage.
>
>> If 512K is always suboptimal for XFS then that is unfortunate but I don't
>
>I think 512k chunk sizes are suboptimal for most users, regardless
>of the filesystem or workload....
>
>> think it is really possible to choose a default that everyone will be happy
>> with.  Maybe we just need more documentation and warning emitted by various
>> tools.  Maybe mkfs.xfs could augment the "stripe unit too large" message with
>> some text about choosing a smaller chunk size?
>
>We work to the mantra that XFS should always choose the defaults
>that give the best overall performance and aging characteristics so
>users don't need to be a storage expert to get the best the
>filesystem can offer. The XFS warning is there to indicate that the
>user might be doing something wrong. If that's being emitted with a
>default MD configuration, then that indicates that the MD defaults
>need to be revised....
>
>If you know what a stripe unit or chunk size is, then you know how
>to deal with the problem. But for the majority of people, that's way
>more knowledge than they are prepared to learn about or should be
>forced to learn about.
>
>Cheers,
>
>Dave.
>-- 
>Dave Chinner
>david@fromorbit.com
>--
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2012-07-09 12:02 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-23 12:50 mkfs.xfs states log stripe unit is too large Ingo Jürgensmann
2012-06-23 23:44 ` Dave Chinner
2012-06-24  2:20   ` Eric Sandeen
2012-06-24 13:05     ` Stan Hoeppner
2012-06-24 13:17       ` Ingo Jürgensmann
2012-06-24 19:28         ` Stan Hoeppner
2012-06-24 19:51           ` Ingo Jürgensmann
2012-06-24 22:15             ` Stan Hoeppner
2012-06-25  5:25               ` Ingo Jürgensmann
     [not found]                 ` <4FE8CEED.7070505@hardwarefreak.com>
2012-06-25 21:18                   ` Ingo Jürgensmann
2012-06-24 15:03       ` Ingo Jürgensmann
2012-06-26  2:30         ` Dave Chinner
2012-06-26  8:02           ` Christoph Hellwig
2012-06-26  8:02             ` Christoph Hellwig
2012-07-02  6:18             ` Christoph Hellwig
2012-07-02  6:41               ` NeilBrown
2012-07-02  6:41                 ` NeilBrown
2012-07-02  8:08                 ` Dave Chinner
2012-07-02  8:08                   ` Dave Chinner
2012-07-09 12:02                   ` kedacomkernel
2012-07-09 12:02                     ` kedacomkernel
2012-06-26 19:34           ` Ingo Jürgensmann
2012-06-27  2:06           ` Eric Sandeen
2012-06-25 10:33   ` Ingo Jürgensmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.