All of lore.kernel.org
 help / color / mirror / Atom feed
* raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-15  0:30 Jessie Evangelista
  2012-03-15  5:38 ` Stan Hoeppner
  2012-03-17 22:10 ` Zdenek Kaspar
  0 siblings, 2 replies; 65+ messages in thread
From: Jessie Evangelista @ 2012-03-15  0:30 UTC (permalink / raw)
  To: linux-raid

I want to create a raid10,n2 using 3 1TB SATA drives.
I want to create an xfs filesystem on top of it.
The filesystem will be used as NFS/Samba storage.

mdadm --zero /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm -v --create /dev/md0 --metadata=1.2 --assume-clean
--level=raid10 --chunk 256 --raid-devices=3 /dev/sdb1 /dev/sdc1
/dev/sdd1
mkfs -t xfs -l lazy-count=1,size=128m -f /dev/md0
mount -t xfs -o barrier=1,logbsize=256k,logbufs=8,noatime /dev/md0
/mnt/raid10xfs

Will my files be safe even on sudden power loss? Is barrier=1 enough?
Do i need to disable the write cache?
with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd

I tried it but performance is horrendous.

Am I better of with ext4? Data safety/integrity is the priority and
optimization affecting it is not acceptable.

Thanks and any advice/guidance would be appreciated

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15  0:30 raid10n2/xfs setup guidance on write-cache/barrier Jessie Evangelista
@ 2012-03-15  5:38 ` Stan Hoeppner
  2012-03-15 12:06   ` Jessie Evangelista
  2012-03-17 22:10 ` Zdenek Kaspar
  1 sibling, 1 reply; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-15  5:38 UTC (permalink / raw)
  To: Jessie Evangelista; +Cc: linux-raid

On 3/14/2012 7:30 PM, Jessie Evangelista wrote:
> I want to create a raid10,n2 using 3 1TB SATA drives.
> I want to create an xfs filesystem on top of it.
> The filesystem will be used as NFS/Samba storage.
> 
> mdadm --zero /dev/sdb1 /dev/sdc1 /dev/sdd1
> mdadm -v --create /dev/md0 --metadata=1.2 --assume-clean
> --level=raid10 --chunk 256 --raid-devices=3 /dev/sdb1 /dev/sdc1
> /dev/sdd1

Why 256KB for chunk size?


Looks like you've been reading a very outdated/inaccurate "XFS guide" on
the web...

What kernel version?  This can make a significant difference in XFS
metadata performance.  You should use 2.6.39+ if possible.  What
xfsprogs version?

> mkfs -t xfs -l lazy-count=1,size=128m -f /dev/md0

lazy-count=1 is currently the default with recent xfsprogs so no need to
specify it.  Why are you manually specifying the size of the internal
journal log file?  This is unnecessary.  In fact, unless you have
profiled your workload and testing shows that alternate XFS settings
perform better, it is always best to stick with the defaults.  They
exist for a reason, and are well considered.

> mount -t xfs -o barrier=1,logbsize=256k,logbufs=8,noatime /dev/md0
> /mnt/raid10xfs

Barrier has no value, it's either on or off.  XFS mounts with barriers
enabled by default so remove 'barrier=1'.  You do not have a RAID card
with persistent write cache (BBWC), so you should leave barriers
enabled.  Barriers mitigate journal log corruption due to power failure
and crashes, which seem seem to be of concern to you.

logbsize=256k and logbufs=8 are the defaults in recent kernels so no
need to specify them.  Your NFS/Samba workload on 3 slow disks isn't
sufficient to need that much in memory journal buffer space anyway.  XFS
uses relatime which is equivalent to noatime WRT IO reduction
performance, so don't specify 'noatime'.

In fact, it appears you don't need to specify anything in mkfs.xfs or
fstab, but just use the defaults.  Fancy that.  And the one thing that
might actually increase your performance a little bit you didn't
specify--sunit/swidth.  However, since you're using mdraid, mkfs.xfs
will calculate these for you (which is nice as mdraid10 with odd disk
count can be a tricky calculation).  Again, defaults work for a reason.

> Will my files be safe even on sudden power loss?

Are you unwilling to purchase a UPS and implement shutdown scripts?  If
so you have no business running a server, frankly.  Any system will lose
data due to power loss, it's just a matter of how much based on the
quantity of inflight writes at the time the juice dies.  This problem is
mostly filesytem independent.  Application write behavior does play a
role.  UPS with shutdown scripts, and persistent write cache prevent
this problem.  A cheap UPS suitable for this purpose is less money than
a 1TB 7.2k drive, currently.

You say this is an NFS/Samba server.  That would imply that multiple
people or other systems directly rely on it.  Implement a good UPS
solution and eliminate this potential problem.

> Is barrier=1 enough?
> Do i need to disable the write cache?
> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd

Disabling drive write caches does decrease the likelihood of data loss.

> I tried it but performance is horrendous.

And this is why you should leave them enabled and use barriers.  Better
yet, use a RAID card with BBWC and disable the drive caches.

> Am I better of with ext4? Data safety/integrity is the priority and
> optimization affecting it is not acceptable.

You're better off using a UPS.  Filesystem makes little difference WRT
data safety/integrity.  All will suffer some damage if you throw a
grenade at them.  So don't throw grenades.  Speaking of which, what is
your backup/restore procedure/hardware for this array?

> Thanks and any advice/guidance would be appreciated

I'll appreciate your response stating "Yes, I have a UPS and
tested/working shutdown scripts" or "I'll be implementing a UPS very
soon." :)

-- 
Stan


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15  5:38 ` Stan Hoeppner
@ 2012-03-15 12:06   ` Jessie Evangelista
  2012-03-15 14:07       ` Peter Grandi
  2012-03-16 12:25     ` raid10n2/xfs setup guidance on write-cache/barrier Stan Hoeppner
  0 siblings, 2 replies; 65+ messages in thread
From: Jessie Evangelista @ 2012-03-15 12:06 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Thu, Mar 15, 2012 at 1:38 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 3/14/2012 7:30 PM, Jessie Evangelista wrote:
>> I want to create a raid10,n2 using 3 1TB SATA drives.
>> I want to create an xfs filesystem on top of it.
>> The filesystem will be used as NFS/Samba storage.
>>
>> mdadm --zero /dev/sdb1 /dev/sdc1 /dev/sdd1
>> mdadm -v --create /dev/md0 --metadata=1.2 --assume-clean
>> --level=raid10 --chunk 256 --raid-devices=3 /dev/sdb1 /dev/sdc1
>> /dev/sdd1
>
> Why 256KB for chunk size?
>
For reference, the machine has 16GB memory

I've run some benchmarks with dd trying the different chunks and 256k
seems like the sweetspot.
dd if=/dev/zero of=/dev/md0 bs=64k count=655360 oflag=direct

>
> Looks like you've been reading a very outdated/inaccurate "XFS guide" on
> the web...
>
> What kernel version?  This can make a significant difference in XFS
> metadata performance.  You should use 2.6.39+ if possible.  What
> xfsprogs version?
>

testing was done with ubuntu 10.04LTS with kernel at 2.6.32-33-server
xfsprogs at 3.1.0ubuntu1

>> mkfs -t xfs -l lazy-count=1,size=128m -f /dev/md0
>
> lazy-count=1 is currently the default with recent xfsprogs so no need to
> specify it.  Why are you manually specifying the size of the internal
> journal log file?  This is unnecessary.  In fact, unless you have
> profiled your workload and testing shows that alternate XFS settings
> perform better, it is always best to stick with the defaults.  They
> exist for a reason, and are well considered.

I'll probably forgo setting the journal log file size. It seemed like
a safe optimization from what I've read.

>> mount -t xfs -o barrier=1,logbsize=256k,logbufs=8,noatime /dev/md0
>> /mnt/raid10xfs
>
> Barrier has no value, it's either on or off.  XFS mounts with barriers
> enabled by default so remove 'barrier=1'.  You do not have a RAID card
> with persistent write cache (BBWC), so you should leave barriers
> enabled.  Barriers mitigate journal log corruption due to power failure
> and crashes, which seem seem to be of concern to you.
>
> logbsize=256k and logbufs=8 are the defaults in recent kernels so no
> need to specify them.  Your NFS/Samba workload on 3 slow disks isn't
> sufficient to need that much in memory journal buffer space anyway.  XFS
> uses relatime which is equivalent to noatime WRT IO reduction
> performance, so don't specify 'noatime'.

I just wanted to be explicit about it so that I know what is set just
in case the defaults change

>
> In fact, it appears you don't need to specify anything in mkfs.xfs or
> fstab, but just use the defaults.  Fancy that.  And the one thing that
> might actually increase your performance a little bit you didn't
> specify--sunit/swidth.  However, since you're using mdraid, mkfs.xfs
> will calculate these for you (which is nice as mdraid10 with odd disk
> count can be a tricky calculation).  Again, defaults work for a reason.
>
The reason I did not set sunit/swidth is because I read somewhere that
mkfs.xfs will calculate based on mdraid.

>> Will my files be safe even on sudden power loss?
>
> Are you unwilling to purchase a UPS and implement shutdown scripts?  If
> so you have no business running a server, frankly.  Any system will lose
> data due to power loss, it's just a matter of how much based on the
> quantity of inflight writes at the time the juice dies.  This problem is
> mostly filesytem independent.  Application write behavior does play a
> role.  UPS with shutdown scripts, and persistent write cache prevent
> this problem.  A cheap UPS suitable for this purpose is less money than
> a 1TB 7.2k drive, currently.
>

The server is for a non-profit org that I am helping out.
I think a APC Smart-UPS SC 420VA 230V may fit their shoe string budget.

> You say this is an NFS/Samba server.  That would imply that multiple
> people or other systems directly rely on it.  Implement a good UPS
> solution and eliminate this potential problem.
>
>> Is barrier=1 enough?
>> Do i need to disable the write cache?
>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd
>
> Disabling drive write caches does decrease the likelihood of data loss.
>
>> I tried it but performance is horrendous.
>
> And this is why you should leave them enabled and use barriers.  Better
> yet, use a RAID card with BBWC and disable the drive caches.

Budget does not allow for RAID card with BBWC
>
>> Am I better of with ext4? Data safety/integrity is the priority and
>> optimization affecting it is not acceptable.
>
> You're better off using a UPS.  Filesystem makes little difference WRT
> data safety/integrity.  All will suffer some damage if you throw a
> grenade at them.  So don't throw grenades.  Speaking of which, what is
> your backup/restore procedure/hardware for this array?

nightly backups will be stored on an external USB disk
is xfs going to be prone to more data loss in case the non-redundant
power supply goes out?

>
>> Thanks and any advice/guidance would be appreciated
>
> I'll appreciate your response stating "Yes, I have a UPS and
> tested/working shutdown scripts" or "I'll be implementing a UPS very
> soon." :)

I don't have shutdown scripts yet but will look into it.
Meatware would have to do for now as the server will probably be ON
only when there's people at the office. And yes I will be asking them
to not go into production without a UPS

>
> --
> Stan
>

Thanks for you input Stan.

I just updated the kernel to 3.0.0-16.
Did they take out barrier support in mdraid? or was the implementation
replaced with FUA?
Is there a definitive test to determine if the off the shelf consumer
sata drives honor barrier or cache flush requests?

I think I'd like to go with device cache turned ON and barrier enabled.

Am still torn between ext4 and xfs i.e. which will be safer in this
particular setup.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15 12:06   ` Jessie Evangelista
@ 2012-03-15 14:07       ` Peter Grandi
  2012-03-16 12:25     ` raid10n2/xfs setup guidance on write-cache/barrier Stan Hoeppner
  1 sibling, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-15 14:07 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

>>> I want to create a raid10,n2 using 3 1TB SATA drives.
>>> I want to create an xfs filesystem on top of it. The
>>> filesystem will be used as NFS/Samba storage.

Consider also an 'o2' layout (it is probably the same thing for a
3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
one of the few cases where RAID5 may be plausible.

> [ ... ] I've run some benchmarks with dd trying the different
> chunks and 256k seems like the sweetspot.  dd if=/dev/zero
> of=/dev/md0 bs=64k count=655360 oflag=direct

That's for bulk sequential transfers. Random-ish, as in a
fileserver perhaps with many smaller files, may not be the same,
but probably larger chunks are good.

>> [ ... ] What kernel version?  This can make a significant
>> difference in XFS metadata performance.

As an aside, that's a myth that has been propagandaized by DaveC
in his entertaining presentation not long ago.

There have been decent but no major improvements in XFS metadata
*performance*, but weaker implicit *semantics* have been made an
option, and these have a different safety/performance tradeoff
(less implicit safety, somewhat more performance), not "just"
better performance.

http://lwn.net/Articles/476267/
 «In other words, instead of there only being a maximum of 2MB of
  transaction changes not written to the log at any point in time,
  there may be a much greater amount being accumulated in memory.

  Hence the potential for loss of metadata on a crash is much
  greater than for the existing logging mechanism.

  It should be noted that this does not change the guarantee that
  log recovery will result in a consistent filesystem.

  What it does mean is that as far as the recovered filesystem is
  concerned, there may be many thousands of transactions that
  simply did not occur as a result of the crash.

  This makes it even more important that applications that care
  about their data use fsync() where they need to ensure
  application level data integrity is maintained.»

>>  Your NFS/Samba workload on 3 slow disks isn't sufficient to
>> need that much in memory journal buffer space anyway.

That's probably true, but does no harm.

>>  XFS uses relatime which is equivalent to noatime WRT IO
>> reduction performance, so don't specify 'noatime'.

Uhm, not so sure, and 'noatime' does not hurt either.

> I just wanted to be explicit about it so that I know what is
> set just in case the defaults change

That's what I do as well, because relying on remembering exactly
what the defaults are can cause sometimes confusion. But it is a
matter of taste to a large degree, like 'noatime'.

>> In fact, it appears you don't need to specify anything in
>> mkfs.xfs or fstab, but just use the defaults.  Fancy that.

For NFS/Samba, especially with ACLs (SMB protocol), and
especially if one expects largish directories, and in general I
would recommend a larger inode size, at least 1024B, if not even
2048B.

Also, as a rule I want to make sure that the sector size is set
to 4096B, for future proofing (and recent drives not only have
4096B sectors but usually lie).

>>  And the one thing that might actually increase your
>> performance a little bit you didn't specify--sunit/swidth.

Especially 'sunit', as XFS ideally would align metadata on chunk
boundaries.

>>  However, since you're using mdraid, mkfs.xfs will calculate
>> these for you (which is nice as mdraid10 with odd disk count
>> can be a tricky calculation).

Ambiguous more than tricky, and not very useful, except the chunk
size.

>>> Will my files be safe even on sudden power loss?

The answer is NO, if you mean "absolutely safe". But see the
discussion at the end.

>> [ ... ]  Application write behavior does play a role.

Indeed, see the discussion at the end and ways to mitigate.

>>  UPS with shutdown scripts, and persistent write cache prevent
>> this problem. [ ... ]

There is always the problem of system crashes that don't depend
on power....

>>> Is barrier=1 enough?  Do i need to disable the write cache?
>>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd

>> Disabling drive write caches does decrease the likelihood of
>> data loss.

>>> I tried it but performance is horrendous.

>> And this is why you should leave them enabled and use
>> barriers.  Better yet, use a RAID card with BBWC and disable
>> the drive caches.

> Budget does not allow for RAID card with BBWC

You'd be surprised by how cheap you can get one. But many HW host
adapters with builtin cache have bad performance or horrid bugs,
so you'd have to be careful.

In any case that's not the major problem you have.

>>> Am I better of with ext4? Data safety/integrity is the
>>> priority and optimization affecting it is not acceptable.

XFS is the filesystem of the future ;-). I would choose it over
'ext4' in every plausible case.

> nightly backups will be stored on an external USB disk

USB is an unreliable, buggy transport, and slow, eSATA is
enormously better and faster.

> is xfs going to be prone to more data loss in case the
> non-redundant power supply goes out?

That's the wrong question entirely. Data loss can happen for many
other reasons, and XFS is probably one of the safest designs, if
properly used and configured. The problems are elsewhere.

> I just updated the kernel to 3.0.0-16.  Did they take out
> barrier support in mdraid? or was the implementation replaced
> with FUA?  Is there a definitive test to determine if the off
> the shelf consumer sata drives honor barrier or cache flush
> requests?

Usually they do, but that's the least of your worries. Anyhow a
test that occurs to me is to write a know pattern to a file,
let's say 1GiB, then 'fsync', and as soon as 'fsync' completes,
power off. Then check whether the whole 1GiB is the known pattern.

> I think I'd like to go with device cache turned ON and barrier
> enabled.

That's how it is supposed to work.

As to general safety issues, there seem to be some misunderstanding, 
and I'll try to be more explicit than "lob the grenade" notion.

It matters a great deal what "safety" means in your mind and that
of your users. As a previous comment pointed out, that usually
involves backups, that is data that has already been stored.

But your insistence on power off and disk caches etc. seems to
indicate that "safety" in your mind means "when I click the
'Save' button it is really saved and not partially".

As to that there quite a lot of qualifiers:

  * Most users don't understand that even in the best scenario a
    file is really saved not when they *click* the 'Save' button,
    but when they get the "Saved!" message. In between anything
    can happen. Also, work in progress (not yet saved explicitly)
    is fair game.

  * "Really saved" is an *application* concern first and foremost.
    The application *must* say (via 'fsync') that it wants the
    data really saved. Unfortunately most applications don't do
    that because "really saved" is a very expensive operation, and
    usually sytems don't crash, so the application writer looks
    like a genius if he has an "optimistic" attitude. If you do a
    web search look for various O_PONIES discussions. Some intros:

      http://lwn.net/Articles/351422/
      http://lwn.net/Articles/322823/

  * XFS (and to a point 'ext4') is designed for applications that
    work correctly and issue 'fsync' appropriately, and if they do
    it is very safe, because it tries hard to ensure that either
    'fsync' means "really saved" or you know that it does not. XFS
    takes advantage of the assumption that applications do the
    right thing to do various latency-based optimizations between
    calls to 'fsync'.

  * Unfortunately most GUI applications don't do the right thing,
    but fortunately you can compensate for that. The key here is
    to make sure that the flusher's parameter are set for rather
    more frequent flushing than the default, which is equivalent
    to issuing 'fsync' systemwide fairly frequently. Ideally set
    'vm/dirty_bytes' to something like 1-3 seconds of IO transfer
    rate (and in reversal on some of my previous advice leave
    'vm/dirty_background_bytes' to something quite large unless
    you *really* want safety), and to shorten significantly
    'vm/dirty_expire_centisecs', 'vm/dirty_writeback_centisecs'.
    This defeats some XFS optimizations, but that's inevitable.

  * In any case you are using NFS/Samba, and that opens a much
    bigger set of issues, because caching happens on the clients
    too: http://www.sabi.co.uk/0707jul.html#070701b

Then Von Neuman help you if your users or you decide to store lots
of messages in MH/Maildir style mailstores, or VM images on
"growable" virtual disks.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-15 14:07       ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-15 14:07 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

>>> I want to create a raid10,n2 using 3 1TB SATA drives.
>>> I want to create an xfs filesystem on top of it. The
>>> filesystem will be used as NFS/Samba storage.

Consider also an 'o2' layout (it is probably the same thing for a
3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
one of the few cases where RAID5 may be plausible.

> [ ... ] I've run some benchmarks with dd trying the different
> chunks and 256k seems like the sweetspot.  dd if=/dev/zero
> of=/dev/md0 bs=64k count=655360 oflag=direct

That's for bulk sequential transfers. Random-ish, as in a
fileserver perhaps with many smaller files, may not be the same,
but probably larger chunks are good.

>> [ ... ] What kernel version?  This can make a significant
>> difference in XFS metadata performance.

As an aside, that's a myth that has been propagandaized by DaveC
in his entertaining presentation not long ago.

There have been decent but no major improvements in XFS metadata
*performance*, but weaker implicit *semantics* have been made an
option, and these have a different safety/performance tradeoff
(less implicit safety, somewhat more performance), not "just"
better performance.

http://lwn.net/Articles/476267/
 «In other words, instead of there only being a maximum of 2MB of
  transaction changes not written to the log at any point in time,
  there may be a much greater amount being accumulated in memory.

  Hence the potential for loss of metadata on a crash is much
  greater than for the existing logging mechanism.

  It should be noted that this does not change the guarantee that
  log recovery will result in a consistent filesystem.

  What it does mean is that as far as the recovered filesystem is
  concerned, there may be many thousands of transactions that
  simply did not occur as a result of the crash.

  This makes it even more important that applications that care
  about their data use fsync() where they need to ensure
  application level data integrity is maintained.»

>>  Your NFS/Samba workload on 3 slow disks isn't sufficient to
>> need that much in memory journal buffer space anyway.

That's probably true, but does no harm.

>>  XFS uses relatime which is equivalent to noatime WRT IO
>> reduction performance, so don't specify 'noatime'.

Uhm, not so sure, and 'noatime' does not hurt either.

> I just wanted to be explicit about it so that I know what is
> set just in case the defaults change

That's what I do as well, because relying on remembering exactly
what the defaults are can cause sometimes confusion. But it is a
matter of taste to a large degree, like 'noatime'.

>> In fact, it appears you don't need to specify anything in
>> mkfs.xfs or fstab, but just use the defaults.  Fancy that.

For NFS/Samba, especially with ACLs (SMB protocol), and
especially if one expects largish directories, and in general I
would recommend a larger inode size, at least 1024B, if not even
2048B.

Also, as a rule I want to make sure that the sector size is set
to 4096B, for future proofing (and recent drives not only have
4096B sectors but usually lie).

>>  And the one thing that might actually increase your
>> performance a little bit you didn't specify--sunit/swidth.

Especially 'sunit', as XFS ideally would align metadata on chunk
boundaries.

>>  However, since you're using mdraid, mkfs.xfs will calculate
>> these for you (which is nice as mdraid10 with odd disk count
>> can be a tricky calculation).

Ambiguous more than tricky, and not very useful, except the chunk
size.

>>> Will my files be safe even on sudden power loss?

The answer is NO, if you mean "absolutely safe". But see the
discussion at the end.

>> [ ... ]  Application write behavior does play a role.

Indeed, see the discussion at the end and ways to mitigate.

>>  UPS with shutdown scripts, and persistent write cache prevent
>> this problem. [ ... ]

There is always the problem of system crashes that don't depend
on power....

>>> Is barrier=1 enough?  Do i need to disable the write cache?
>>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd

>> Disabling drive write caches does decrease the likelihood of
>> data loss.

>>> I tried it but performance is horrendous.

>> And this is why you should leave them enabled and use
>> barriers.  Better yet, use a RAID card with BBWC and disable
>> the drive caches.

> Budget does not allow for RAID card with BBWC

You'd be surprised by how cheap you can get one. But many HW host
adapters with builtin cache have bad performance or horrid bugs,
so you'd have to be careful.

In any case that's not the major problem you have.

>>> Am I better of with ext4? Data safety/integrity is the
>>> priority and optimization affecting it is not acceptable.

XFS is the filesystem of the future ;-). I would choose it over
'ext4' in every plausible case.

> nightly backups will be stored on an external USB disk

USB is an unreliable, buggy transport, and slow, eSATA is
enormously better and faster.

> is xfs going to be prone to more data loss in case the
> non-redundant power supply goes out?

That's the wrong question entirely. Data loss can happen for many
other reasons, and XFS is probably one of the safest designs, if
properly used and configured. The problems are elsewhere.

> I just updated the kernel to 3.0.0-16.  Did they take out
> barrier support in mdraid? or was the implementation replaced
> with FUA?  Is there a definitive test to determine if the off
> the shelf consumer sata drives honor barrier or cache flush
> requests?

Usually they do, but that's the least of your worries. Anyhow a
test that occurs to me is to write a know pattern to a file,
let's say 1GiB, then 'fsync', and as soon as 'fsync' completes,
power off. Then check whether the whole 1GiB is the known pattern.

> I think I'd like to go with device cache turned ON and barrier
> enabled.

That's how it is supposed to work.

As to general safety issues, there seem to be some misunderstanding, 
and I'll try to be more explicit than "lob the grenade" notion.

It matters a great deal what "safety" means in your mind and that
of your users. As a previous comment pointed out, that usually
involves backups, that is data that has already been stored.

But your insistence on power off and disk caches etc. seems to
indicate that "safety" in your mind means "when I click the
'Save' button it is really saved and not partially".

As to that there quite a lot of qualifiers:

  * Most users don't understand that even in the best scenario a
    file is really saved not when they *click* the 'Save' button,
    but when they get the "Saved!" message. In between anything
    can happen. Also, work in progress (not yet saved explicitly)
    is fair game.

  * "Really saved" is an *application* concern first and foremost.
    The application *must* say (via 'fsync') that it wants the
    data really saved. Unfortunately most applications don't do
    that because "really saved" is a very expensive operation, and
    usually sytems don't crash, so the application writer looks
    like a genius if he has an "optimistic" attitude. If you do a
    web search look for various O_PONIES discussions. Some intros:

      http://lwn.net/Articles/351422/
      http://lwn.net/Articles/322823/

  * XFS (and to a point 'ext4') is designed for applications that
    work correctly and issue 'fsync' appropriately, and if they do
    it is very safe, because it tries hard to ensure that either
    'fsync' means "really saved" or you know that it does not. XFS
    takes advantage of the assumption that applications do the
    right thing to do various latency-based optimizations between
    calls to 'fsync'.

  * Unfortunately most GUI applications don't do the right thing,
    but fortunately you can compensate for that. The key here is
    to make sure that the flusher's parameter are set for rather
    more frequent flushing than the default, which is equivalent
    to issuing 'fsync' systemwide fairly frequently. Ideally set
    'vm/dirty_bytes' to something like 1-3 seconds of IO transfer
    rate (and in reversal on some of my previous advice leave
    'vm/dirty_background_bytes' to something quite large unless
    you *really* want safety), and to shorten significantly
    'vm/dirty_expire_centisecs', 'vm/dirty_writeback_centisecs'.
    This defeats some XFS optimizations, but that's inevitable.

  * In any case you are using NFS/Samba, and that opens a much
    bigger set of issues, because caching happens on the clients
    too: http://www.sabi.co.uk/0707jul.html#070701b

Then Von Neuman help you if your users or you decide to store lots
of messages in MH/Maildir style mailstores, or VM images on
"growable" virtual disks.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15 14:07       ` Peter Grandi
@ 2012-03-15 15:25         ` keld
  -1 siblings, 0 replies; 65+ messages in thread
From: keld @ 2012-03-15 15:25 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote:
> >>> I want to create a raid10,n2 using 3 1TB SATA drives.
> >>> I want to create an xfs filesystem on top of it. The
> >>> filesystem will be used as NFS/Samba storage.
> 
> Consider also an 'o2' layout (it is probably the same thing for a
> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
> one of the few cases where RAID5 may be plausible.

Well, for a file server like NFS/Samba, you could also consider raid10,f2.
I would think you could get about double the read performance compared to n2 and o2
layouts, and also for individual read transfers on a running system
you would get somthing like double the read performance. 
Write performance could be somewhat slower (0 to 10 %) bot as users
are not waiting for writes to complete, they will probably not notice.
 
best regards
keld

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-15 15:25         ` keld
  0 siblings, 0 replies; 65+ messages in thread
From: keld @ 2012-03-15 15:25 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote:
> >>> I want to create a raid10,n2 using 3 1TB SATA drives.
> >>> I want to create an xfs filesystem on top of it. The
> >>> filesystem will be used as NFS/Samba storage.
> 
> Consider also an 'o2' layout (it is probably the same thing for a
> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
> one of the few cases where RAID5 may be plausible.

Well, for a file server like NFS/Samba, you could also consider raid10,f2.
I would think you could get about double the read performance compared to n2 and o2
layouts, and also for individual read transfers on a running system
you would get somthing like double the read performance. 
Write performance could be somewhat slower (0 to 10 %) bot as users
are not waiting for writes to complete, they will probably not notice.
 
best regards
keld

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15 14:07       ` Peter Grandi
@ 2012-03-15 16:18         ` Jessie Evangelista
  -1 siblings, 0 replies; 65+ messages in thread
From: Jessie Evangelista @ 2012-03-15 16:18 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

Hey Peter,

On Thu, Mar 15, 2012 at 10:07 PM, Peter Grandi <pg@lxra2.to.sabi.co.uk> wrote:
>>>> I want to create a raid10,n2 using 3 1TB SATA drives.
>>>> I want to create an xfs filesystem on top of it. The
>>>> filesystem will be used as NFS/Samba storage.
>
> Consider also an 'o2' layout (it is probably the same thing for a
> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
> one of the few cases where RAID5 may be plausible.

Thanks for reminding me about raid5. I'll probably give it a try and
do some benchmarks. I'd also like to try raid10f2.

>> [ ... ] I've run some benchmarks with dd trying the different
>> chunks and 256k seems like the sweetspot.  dd if=/dev/zero
>> of=/dev/md0 bs=64k count=655360 oflag=direct
>
> That's for bulk sequential transfers. Random-ish, as in a
> fileserver perhaps with many smaller files, may not be the same,
> but probably larger chunks are good.
>>> [ ... ] What kernel version?  This can make a significant
>>> difference in XFS metadata performance.
>
> As an aside, that's a myth that has been propagandaized by DaveC
> in his entertaining presentation not long ago.
>
> There have been decent but no major improvements in XFS metadata
> *performance*, but weaker implicit *semantics* have been made an
> option, and these have a different safety/performance tradeoff
> (less implicit safety, somewhat more performance), not "just"
> better performance.
>
> http://lwn.net/Articles/476267/
>  «In other words, instead of there only being a maximum of 2MB of
>  transaction changes not written to the log at any point in time,
>  there may be a much greater amount being accumulated in memory.
>
>  Hence the potential for loss of metadata on a crash is much
>  greater than for the existing logging mechanism.
>
>  It should be noted that this does not change the guarantee that
>  log recovery will result in a consistent filesystem.
>
>  What it does mean is that as far as the recovered filesystem is
>  concerned, there may be many thousands of transactions that
>  simply did not occur as a result of the crash.
>
>  This makes it even more important that applications that care
>  about their data use fsync() where they need to ensure
>  application level data integrity is maintained.»
>
>>>  Your NFS/Samba workload on 3 slow disks isn't sufficient to
>>> need that much in memory journal buffer space anyway.
>
> That's probably true, but does no harm.
>
>>>  XFS uses relatime which is equivalent to noatime WRT IO
>>> reduction performance, so don't specify 'noatime'.
>
> Uhm, not so sure, and 'noatime' does not hurt either.
>
>> I just wanted to be explicit about it so that I know what is
>> set just in case the defaults change
>
> That's what I do as well, because relying on remembering exactly
> what the defaults are can cause sometimes confusion. But it is a
> matter of taste to a large degree, like 'noatime'.
>
>>> In fact, it appears you don't need to specify anything in
>>> mkfs.xfs or fstab, but just use the defaults.  Fancy that.
>
> For NFS/Samba, especially with ACLs (SMB protocol), and
> especially if one expects largish directories, and in general I
> would recommend a larger inode size, at least 1024B, if not even
> 2048B.

thanks for this tip. will look into adjusting inode size.

>
> Also, as a rule I want to make sure that the sector size is set
> to 4096B, for future proofing (and recent drives not only have
> 4096B sectors but usually lie).
>

it seems the 1TB drivers that I have still have 512byte sectors

>>>  And the one thing that might actually increase your
>>> performance a little bit you didn't specify--sunit/swidth.
>
> Especially 'sunit', as XFS ideally would align metadata on chunk
> boundaries.
>
>>>  However, since you're using mdraid, mkfs.xfs will calculate
>>> these for you (which is nice as mdraid10 with odd disk count
>>> can be a tricky calculation).
>
> Ambiguous more than tricky, and not very useful, except the chunk
> size.
>
>>>> Will my files be safe even on sudden power loss?
>
> The answer is NO, if you mean "absolutely safe". But see the
> discussion at the end.
>
>>> [ ... ]  Application write behavior does play a role.
>
> Indeed, see the discussion at the end and ways to mitigate.
>
>>>  UPS with shutdown scripts, and persistent write cache prevent
>>> this problem. [ ... ]
>
> There is always the problem of system crashes that don't depend
> on power....
>
>>>> Is barrier=1 enough?  Do i need to disable the write cache?
>>>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd
>
>>> Disabling drive write caches does decrease the likelihood of
>>> data loss.
>
>>>> I tried it but performance is horrendous.
>
>>> And this is why you should leave them enabled and use
>>> barriers.  Better yet, use a RAID card with BBWC and disable
>>> the drive caches.
>
>> Budget does not allow for RAID card with BBWC
>
> You'd be surprised by how cheap you can get one. But many HW host
> adapters with builtin cache have bad performance or horrid bugs,
> so you'd have to be careful.

could you please suggest a hardware raid card with BBU that's cheap?

>
> In any case that's not the major problem you have.
>
>>>> Am I better of with ext4? Data safety/integrity is the
>>>> priority and optimization affecting it is not acceptable.
>
> XFS is the filesystem of the future ;-). I would choose it over
> 'ext4' in every plausible case.
>
>> nightly backups will be stored on an external USB disk
>
> USB is an unreliable, buggy transport, and slow, eSATA is
> enormously better and faster.
>
>> is xfs going to be prone to more data loss in case the
>> non-redundant power supply goes out?
>
> That's the wrong question entirely. Data loss can happen for many
> other reasons, and XFS is probably one of the safest designs, if
> properly used and configured. The problems are elsewhere.

Can you please elaborate how xfs can be properly used and configured?
>
>> I just updated the kernel to 3.0.0-16.  Did they take out
>> barrier support in mdraid? or was the implementation replaced
>> with FUA?  Is there a definitive test to determine if the off
>> the shelf consumer sata drives honor barrier or cache flush
>> requests?
>
> Usually they do, but that's the least of your worries. Anyhow a
> test that occurs to me is to write a know pattern to a file,
> let's say 1GiB, then 'fsync', and as soon as 'fsync' completes,
> power off. Then check whether the whole 1GiB is the known pattern.
>
>> I think I'd like to go with device cache turned ON and barrier
>> enabled.
>
> That's how it is supposed to work.
>
> As to general safety issues, there seem to be some misunderstanding,
> and I'll try to be more explicit than "lob the grenade" notion.
>
> It matters a great deal what "safety" means in your mind and that
> of your users. As a previous comment pointed out, that usually
> involves backups, that is data that has already been stored.
>
> But your insistence on power off and disk caches etc. seems to
> indicate that "safety" in your mind means "when I click the
> 'Save' button it is really saved and not partially".
>
let me define safety as needed by the usecase:
fileA is a 2MB open office document file already existing on the file system.
userA opens fileA locally, modifies a lot of lines and attempts to save it.
as the saving operation is proceeding, the PSU goes haywire and power
is cut abruptly.
When the system is turned on, i expect some sort of recovery process
to bring the filesystem to a consistent state.
I expect fileA should be as it was before the save operation and
should not be corrupted in anyway.
Am I asking/expecting too much?

> As to that there quite a lot of qualifiers:
>
>  * Most users don't understand that even in the best scenario a
>    file is really saved not when they *click* the 'Save' button,
>    but when they get the "Saved!" message. In between anything
>    can happen. Also, work in progress (not yet saved explicitly)
>    is fair game.
>
>  * "Really saved" is an *application* concern first and foremost.
>    The application *must* say (via 'fsync') that it wants the
>    data really saved. Unfortunately most applications don't do
>    that because "really saved" is a very expensive operation, and
>    usually sytems don't crash, so the application writer looks
>    like a genius if he has an "optimistic" attitude. If you do a
>    web search look for various O_PONIES discussions. Some intros:
>
>      http://lwn.net/Articles/351422/
>      http://lwn.net/Articles/322823/
>
>  * XFS (and to a point 'ext4') is designed for applications that
>    work correctly and issue 'fsync' appropriately, and if they do
>    it is very safe, because it tries hard to ensure that either
>    'fsync' means "really saved" or you know that it does not. XFS
>    takes advantage of the assumption that applications do the
>    right thing to do various latency-based optimizations between
>    calls to 'fsync'.
>
>  * Unfortunately most GUI applications don't do the right thing,
>    but fortunately you can compensate for that. The key here is
>    to make sure that the flusher's parameter are set for rather
>    more frequent flushing than the default, which is equivalent
>    to issuing 'fsync' systemwide fairly frequently. Ideally set
>    'vm/dirty_bytes' to something like 1-3 seconds of IO transfer
>    rate (and in reversal on some of my previous advice leave
>    'vm/dirty_background_bytes' to something quite large unless
>    you *really* want safety), and to shorten significantly
>    'vm/dirty_expire_centisecs', 'vm/dirty_writeback_centisecs'.
>    This defeats some XFS optimizations, but that's inevitable.
>
>  * In any case you are using NFS/Samba, and that opens a much
>    bigger set of issues, because caching happens on the clients
>    too: http://www.sabi.co.uk/0707jul.html#070701b
>
> Then Von Neuman help you if your users or you decide to store lots
> of messages in MH/Maildir style mailstores, or VM images on
> "growable" virtual disks.

what's wrong with VM images on "growable" virtual disks. are you
saying not to rely on lvm2 volumes?

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-15 16:18         ` Jessie Evangelista
  0 siblings, 0 replies; 65+ messages in thread
From: Jessie Evangelista @ 2012-03-15 16:18 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

Hey Peter,

On Thu, Mar 15, 2012 at 10:07 PM, Peter Grandi <pg@lxra2.to.sabi.co.uk> wrote:
>>>> I want to create a raid10,n2 using 3 1TB SATA drives.
>>>> I want to create an xfs filesystem on top of it. The
>>>> filesystem will be used as NFS/Samba storage.
>
> Consider also an 'o2' layout (it is probably the same thing for a
> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
> one of the few cases where RAID5 may be plausible.

Thanks for reminding me about raid5. I'll probably give it a try and
do some benchmarks. I'd also like to try raid10f2.

>> [ ... ] I've run some benchmarks with dd trying the different
>> chunks and 256k seems like the sweetspot.  dd if=/dev/zero
>> of=/dev/md0 bs=64k count=655360 oflag=direct
>
> That's for bulk sequential transfers. Random-ish, as in a
> fileserver perhaps with many smaller files, may not be the same,
> but probably larger chunks are good.
>>> [ ... ] What kernel version?  This can make a significant
>>> difference in XFS metadata performance.
>
> As an aside, that's a myth that has been propagandaized by DaveC
> in his entertaining presentation not long ago.
>
> There have been decent but no major improvements in XFS metadata
> *performance*, but weaker implicit *semantics* have been made an
> option, and these have a different safety/performance tradeoff
> (less implicit safety, somewhat more performance), not "just"
> better performance.
>
> http://lwn.net/Articles/476267/
>  «In other words, instead of there only being a maximum of 2MB of
>  transaction changes not written to the log at any point in time,
>  there may be a much greater amount being accumulated in memory.
>
>  Hence the potential for loss of metadata on a crash is much
>  greater than for the existing logging mechanism.
>
>  It should be noted that this does not change the guarantee that
>  log recovery will result in a consistent filesystem.
>
>  What it does mean is that as far as the recovered filesystem is
>  concerned, there may be many thousands of transactions that
>  simply did not occur as a result of the crash.
>
>  This makes it even more important that applications that care
>  about their data use fsync() where they need to ensure
>  application level data integrity is maintained.»
>
>>>  Your NFS/Samba workload on 3 slow disks isn't sufficient to
>>> need that much in memory journal buffer space anyway.
>
> That's probably true, but does no harm.
>
>>>  XFS uses relatime which is equivalent to noatime WRT IO
>>> reduction performance, so don't specify 'noatime'.
>
> Uhm, not so sure, and 'noatime' does not hurt either.
>
>> I just wanted to be explicit about it so that I know what is
>> set just in case the defaults change
>
> That's what I do as well, because relying on remembering exactly
> what the defaults are can cause sometimes confusion. But it is a
> matter of taste to a large degree, like 'noatime'.
>
>>> In fact, it appears you don't need to specify anything in
>>> mkfs.xfs or fstab, but just use the defaults.  Fancy that.
>
> For NFS/Samba, especially with ACLs (SMB protocol), and
> especially if one expects largish directories, and in general I
> would recommend a larger inode size, at least 1024B, if not even
> 2048B.

thanks for this tip. will look into adjusting inode size.

>
> Also, as a rule I want to make sure that the sector size is set
> to 4096B, for future proofing (and recent drives not only have
> 4096B sectors but usually lie).
>

it seems the 1TB drivers that I have still have 512byte sectors

>>>  And the one thing that might actually increase your
>>> performance a little bit you didn't specify--sunit/swidth.
>
> Especially 'sunit', as XFS ideally would align metadata on chunk
> boundaries.
>
>>>  However, since you're using mdraid, mkfs.xfs will calculate
>>> these for you (which is nice as mdraid10 with odd disk count
>>> can be a tricky calculation).
>
> Ambiguous more than tricky, and not very useful, except the chunk
> size.
>
>>>> Will my files be safe even on sudden power loss?
>
> The answer is NO, if you mean "absolutely safe". But see the
> discussion at the end.
>
>>> [ ... ]  Application write behavior does play a role.
>
> Indeed, see the discussion at the end and ways to mitigate.
>
>>>  UPS with shutdown scripts, and persistent write cache prevent
>>> this problem. [ ... ]
>
> There is always the problem of system crashes that don't depend
> on power....
>
>>>> Is barrier=1 enough?  Do i need to disable the write cache?
>>>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd
>
>>> Disabling drive write caches does decrease the likelihood of
>>> data loss.
>
>>>> I tried it but performance is horrendous.
>
>>> And this is why you should leave them enabled and use
>>> barriers.  Better yet, use a RAID card with BBWC and disable
>>> the drive caches.
>
>> Budget does not allow for RAID card with BBWC
>
> You'd be surprised by how cheap you can get one. But many HW host
> adapters with builtin cache have bad performance or horrid bugs,
> so you'd have to be careful.

could you please suggest a hardware raid card with BBU that's cheap?

>
> In any case that's not the major problem you have.
>
>>>> Am I better of with ext4? Data safety/integrity is the
>>>> priority and optimization affecting it is not acceptable.
>
> XFS is the filesystem of the future ;-). I would choose it over
> 'ext4' in every plausible case.
>
>> nightly backups will be stored on an external USB disk
>
> USB is an unreliable, buggy transport, and slow, eSATA is
> enormously better and faster.
>
>> is xfs going to be prone to more data loss in case the
>> non-redundant power supply goes out?
>
> That's the wrong question entirely. Data loss can happen for many
> other reasons, and XFS is probably one of the safest designs, if
> properly used and configured. The problems are elsewhere.

Can you please elaborate how xfs can be properly used and configured?
>
>> I just updated the kernel to 3.0.0-16.  Did they take out
>> barrier support in mdraid? or was the implementation replaced
>> with FUA?  Is there a definitive test to determine if the off
>> the shelf consumer sata drives honor barrier or cache flush
>> requests?
>
> Usually they do, but that's the least of your worries. Anyhow a
> test that occurs to me is to write a know pattern to a file,
> let's say 1GiB, then 'fsync', and as soon as 'fsync' completes,
> power off. Then check whether the whole 1GiB is the known pattern.
>
>> I think I'd like to go with device cache turned ON and barrier
>> enabled.
>
> That's how it is supposed to work.
>
> As to general safety issues, there seem to be some misunderstanding,
> and I'll try to be more explicit than "lob the grenade" notion.
>
> It matters a great deal what "safety" means in your mind and that
> of your users. As a previous comment pointed out, that usually
> involves backups, that is data that has already been stored.
>
> But your insistence on power off and disk caches etc. seems to
> indicate that "safety" in your mind means "when I click the
> 'Save' button it is really saved and not partially".
>
let me define safety as needed by the usecase:
fileA is a 2MB open office document file already existing on the file system.
userA opens fileA locally, modifies a lot of lines and attempts to save it.
as the saving operation is proceeding, the PSU goes haywire and power
is cut abruptly.
When the system is turned on, i expect some sort of recovery process
to bring the filesystem to a consistent state.
I expect fileA should be as it was before the save operation and
should not be corrupted in anyway.
Am I asking/expecting too much?

> As to that there quite a lot of qualifiers:
>
>  * Most users don't understand that even in the best scenario a
>    file is really saved not when they *click* the 'Save' button,
>    but when they get the "Saved!" message. In between anything
>    can happen. Also, work in progress (not yet saved explicitly)
>    is fair game.
>
>  * "Really saved" is an *application* concern first and foremost.
>    The application *must* say (via 'fsync') that it wants the
>    data really saved. Unfortunately most applications don't do
>    that because "really saved" is a very expensive operation, and
>    usually sytems don't crash, so the application writer looks
>    like a genius if he has an "optimistic" attitude. If you do a
>    web search look for various O_PONIES discussions. Some intros:
>
>      http://lwn.net/Articles/351422/
>      http://lwn.net/Articles/322823/
>
>  * XFS (and to a point 'ext4') is designed for applications that
>    work correctly and issue 'fsync' appropriately, and if they do
>    it is very safe, because it tries hard to ensure that either
>    'fsync' means "really saved" or you know that it does not. XFS
>    takes advantage of the assumption that applications do the
>    right thing to do various latency-based optimizations between
>    calls to 'fsync'.
>
>  * Unfortunately most GUI applications don't do the right thing,
>    but fortunately you can compensate for that. The key here is
>    to make sure that the flusher's parameter are set for rather
>    more frequent flushing than the default, which is equivalent
>    to issuing 'fsync' systemwide fairly frequently. Ideally set
>    'vm/dirty_bytes' to something like 1-3 seconds of IO transfer
>    rate (and in reversal on some of my previous advice leave
>    'vm/dirty_background_bytes' to something quite large unless
>    you *really* want safety), and to shorten significantly
>    'vm/dirty_expire_centisecs', 'vm/dirty_writeback_centisecs'.
>    This defeats some XFS optimizations, but that's inevitable.
>
>  * In any case you are using NFS/Samba, and that opens a much
>    bigger set of issues, because caching happens on the clients
>    too: http://www.sabi.co.uk/0707jul.html#070701b
>
> Then Von Neuman help you if your users or you decide to store lots
> of messages in MH/Maildir style mailstores, or VM images on
> "growable" virtual disks.

what's wrong with VM images on "growable" virtual disks. are you
saying not to rely on lvm2 volumes?

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15 15:25         ` keld
@ 2012-03-15 16:52           ` Jessie Evangelista
  -1 siblings, 0 replies; 65+ messages in thread
From: Jessie Evangelista @ 2012-03-15 16:52 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

Hi keld,

On Thu, Mar 15, 2012 at 11:25 PM,  <keld@keldix.com> wrote:
> On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote:
>> >>> I want to create a raid10,n2 using 3 1TB SATA drives.
>> >>> I want to create an xfs filesystem on top of it. The
>> >>> filesystem will be used as NFS/Samba storage.
>>
>> Consider also an 'o2' layout (it is probably the same thing for a
>> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
>> one of the few cases where RAID5 may be plausible.
>
> Well, for a file server like NFS/Samba, you could also consider raid10,f2.
> I would think you could get about double the read performance compared to n2 and o2
> layouts, and also for individual read transfers on a running system
> you would get somthing like double the read performance.
> Write performance could be somewhat slower (0 to 10 %) bot as users
> are not waiting for writes to complete, they will probably not notice.

I also plan to try raid10f2. Did you do your own benchmarks or are you
quoting someone elses?
>
> best regards
> keld

thanks for chiming in. have a nice day

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-15 16:52           ` Jessie Evangelista
  0 siblings, 0 replies; 65+ messages in thread
From: Jessie Evangelista @ 2012-03-15 16:52 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

Hi keld,

On Thu, Mar 15, 2012 at 11:25 PM,  <keld@keldix.com> wrote:
> On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote:
>> >>> I want to create a raid10,n2 using 3 1TB SATA drives.
>> >>> I want to create an xfs filesystem on top of it. The
>> >>> filesystem will be used as NFS/Samba storage.
>>
>> Consider also an 'o2' layout (it is probably the same thing for a
>> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
>> one of the few cases where RAID5 may be plausible.
>
> Well, for a file server like NFS/Samba, you could also consider raid10,f2.
> I would think you could get about double the read performance compared to n2 and o2
> layouts, and also for individual read transfers on a running system
> you would get somthing like double the read performance.
> Write performance could be somewhat slower (0 to 10 %) bot as users
> are not waiting for writes to complete, they will probably not notice.

I also plan to try raid10f2. Did you do your own benchmarks or are you
quoting someone elses?
>
> best regards
> keld

thanks for chiming in. have a nice day

> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15 16:52           ` Jessie Evangelista
@ 2012-03-15 17:15             ` keld
  -1 siblings, 0 replies; 65+ messages in thread
From: keld @ 2012-03-15 17:15 UTC (permalink / raw)
  To: Jessie Evangelista; +Cc: Linux RAID, Linux fs XFS

On Fri, Mar 16, 2012 at 12:52:19AM +0800, Jessie Evangelista wrote:
> Hi keld,
> 
> On Thu, Mar 15, 2012 at 11:25 PM,  <keld@keldix.com> wrote:
> > On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote:
> >> >>> I want to create a raid10,n2 using 3 1TB SATA drives.
> >> >>> I want to create an xfs filesystem on top of it. The
> >> >>> filesystem will be used as NFS/Samba storage.
> >>
> >> Consider also an 'o2' layout (it is probably the same thing for a
> >> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
> >> one of the few cases where RAID5 may be plausible.
> >
> > Well, for a file server like NFS/Samba, you could also consider raid10,f2.
> > I would think you could get about double the read performance compared to n2 and o2
> > layouts, and also for individual read transfers on a running system
> > you would get somthing like double the read performance.
> > Write performance could be somewhat slower (0 to 10 %) bot as users
> > are not waiting for writes to complete, they will probably not notice.
> 
> I also plan to try raid10f2. Did you do your own benchmarks or are you
> quoting someone elses?

Both, look at our wiki: https://raid.wiki.kernel.org/articles/p/e/r/Performance.html

Best regards
keld

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-15 17:15             ` keld
  0 siblings, 0 replies; 65+ messages in thread
From: keld @ 2012-03-15 17:15 UTC (permalink / raw)
  To: Jessie Evangelista; +Cc: Linux RAID, Linux fs XFS

On Fri, Mar 16, 2012 at 12:52:19AM +0800, Jessie Evangelista wrote:
> Hi keld,
> 
> On Thu, Mar 15, 2012 at 11:25 PM,  <keld@keldix.com> wrote:
> > On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote:
> >> >>> I want to create a raid10,n2 using 3 1TB SATA drives.
> >> >>> I want to create an xfs filesystem on top of it. The
> >> >>> filesystem will be used as NFS/Samba storage.
> >>
> >> Consider also an 'o2' layout (it is probably the same thing for a
> >> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
> >> one of the few cases where RAID5 may be plausible.
> >
> > Well, for a file server like NFS/Samba, you could also consider raid10,f2.
> > I would think you could get about double the read performance compared to n2 and o2
> > layouts, and also for individual read transfers on a running system
> > you would get somthing like double the read performance.
> > Write performance could be somewhat slower (0 to 10 %) bot as users
> > are not waiting for writes to complete, they will probably not notice.
> 
> I also plan to try raid10f2. Did you do your own benchmarks or are you
> quoting someone elses?

Both, look at our wiki: https://raid.wiki.kernel.org/articles/p/e/r/Performance.html

Best regards
keld

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15 17:15             ` keld
@ 2012-03-15 17:40               ` keld
  -1 siblings, 0 replies; 65+ messages in thread
From: keld @ 2012-03-15 17:40 UTC (permalink / raw)
  To: Jessie Evangelista; +Cc: Linux RAID, Linux fs XFS

On Thu, Mar 15, 2012 at 06:15:49PM +0100, keld@keldix.com wrote:
> On Fri, Mar 16, 2012 at 12:52:19AM +0800, Jessie Evangelista wrote:
> > Hi keld,
> > 
> > On Thu, Mar 15, 2012 at 11:25 PM,  <keld@keldix.com> wrote:
> > > On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote:
> > >> >>> I want to create a raid10,n2 using 3 1TB SATA drives.
> > >> >>> I want to create an xfs filesystem on top of it. The
> > >> >>> filesystem will be used as NFS/Samba storage.
> > >>
> > >> Consider also an 'o2' layout (it is probably the same thing for a
> > >> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
> > >> one of the few cases where RAID5 may be plausible.
> > >
> > > Well, for a file server like NFS/Samba, you could also consider raid10,f2.
> > > I would think you could get about double the read performance compared to n2 and o2
> > > layouts, and also for individual read transfers on a running system
> > > you would get somthing like double the read performance.
> > > Write performance could be somewhat slower (0 to 10 %) bot as users
> > > are not waiting for writes to complete, they will probably not notice.
> > 
> > I also plan to try raid10f2. Did you do your own benchmarks or are you
> > quoting someone elses?
> 
> Both, look at our wiki: https://raid.wiki.kernel.org/articles/p/e/r/Performance.html

I think it would be interesting to include your figures on the wiki page, if
you publish them here on the list. 

Maybe we should rearrange the wiki page a little. I am not so happy about the 
data reported in the section "New benchmarks from 2011" as it only illustrates
what is happening with a 100 % used CPU. I would like to move it to a separate page.
Also the really old data in section "Old performance benchmark" should be moved to
a separate page, IMHO. The text on the wiki page should be gaving info of
general interest for systems running today (still IMHO). Comments?

Best regards
keld

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-15 17:40               ` keld
  0 siblings, 0 replies; 65+ messages in thread
From: keld @ 2012-03-15 17:40 UTC (permalink / raw)
  To: Jessie Evangelista; +Cc: Linux RAID, Linux fs XFS

On Thu, Mar 15, 2012 at 06:15:49PM +0100, keld@keldix.com wrote:
> On Fri, Mar 16, 2012 at 12:52:19AM +0800, Jessie Evangelista wrote:
> > Hi keld,
> > 
> > On Thu, Mar 15, 2012 at 11:25 PM,  <keld@keldix.com> wrote:
> > > On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote:
> > >> >>> I want to create a raid10,n2 using 3 1TB SATA drives.
> > >> >>> I want to create an xfs filesystem on top of it. The
> > >> >>> filesystem will be used as NFS/Samba storage.
> > >>
> > >> Consider also an 'o2' layout (it is probably the same thing for a
> > >> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
> > >> one of the few cases where RAID5 may be plausible.
> > >
> > > Well, for a file server like NFS/Samba, you could also consider raid10,f2.
> > > I would think you could get about double the read performance compared to n2 and o2
> > > layouts, and also for individual read transfers on a running system
> > > you would get somthing like double the read performance.
> > > Write performance could be somewhat slower (0 to 10 %) bot as users
> > > are not waiting for writes to complete, they will probably not notice.
> > 
> > I also plan to try raid10f2. Did you do your own benchmarks or are you
> > quoting someone elses?
> 
> Both, look at our wiki: https://raid.wiki.kernel.org/articles/p/e/r/Performance.html

I think it would be interesting to include your figures on the wiki page, if
you publish them here on the list. 

Maybe we should rearrange the wiki page a little. I am not so happy about the 
data reported in the section "New benchmarks from 2011" as it only illustrates
what is happening with a 100 % used CPU. I would like to move it to a separate page.
Also the really old data in section "Old performance benchmark" should be moved to
a separate page, IMHO. The text on the wiki page should be gaving info of
general interest for systems running today (still IMHO). Comments?

Best regards
keld

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15 16:18         ` Jessie Evangelista
@ 2012-03-15 23:00           ` Peter Grandi
  -1 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-15 23:00 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

[ ... ]

>> Also, as a rule I want to make sure that the sector size is
>> set to 4096B, for future proofing (and recent drives not only
>> have 4096B sectors but usually lie).

> it seems the 1TB drivers that I have still have 512byte sectors

But usually you can still set the XFS idea of sector size to 4096,
which is probably a good idea in general.

[ ... ]

>>> is xfs going to be prone to more data loss in case the
>>> non-redundant power supply goes out?

>> That's the wrong question entirely. Data loss can happen for
>> many other reasons, and XFS is probably one of the safest
>> designs, if properly used and configured. The problems are
>> elsewhere.

> Can you please elaborate how xfs can be properly used and
> configured?

I did that in the following bits of the reply. You must be in a
real hurry if you cannot trim down the quoting or write your
comments after reading through once...

[ ... ]

>> But your insistence on power off and disk caches etc. seems to
>> indicate that "safety" in your mind means "when I click the
>> 'Save' button it is really saved and not partially".

> let me define safety as needed by the usecase: fileA is a 2MB
> open office document file already existing on the file system.
> userA opens fileA locally, modifies a lot of lines and attempts
> to save it. as the saving operation is proceeding, the PSU goes
> haywire and power is cut abruptly.

To worry you, if the PSU goes haywire, the disk data may become
subtly corrupted:

https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta
 «Another user, also running a Tyan 2885 dual-Opteron workstation
  like mine, had experienced data corruption with SATA disks. The
  root cause? A faulty power supply.»

Even if that is not an argument for filesystem provided checksums,
as the ZFS (and other) people say, but for end-to-end (application
level) ones.

> When the system is turned on, i expect some sort of recovery
> process to bring the filesystem to a consistent state.

The XFS design really cares about that and unless the hardware is
very broken metadata consistency will be good.

> I expect fileA should be as it was before the save operation and
> should not be corrupted in anyway.  Am I asking/expecting too much?

That is too much to expect of the filesystem and at the same time
too little.

It is too much because it is strictly the responsibility of the
application, and it is very expensive, because it can only happen
by simulating copy-on-write (app makes a copy of the document,
updates the copy, and then atomically renames it, and then makes
another copy). Some applications like OOo/LibreO/VIM instead use a
log file to record updates, and then merge those on save (copy,
merge, rename), which is better. Some filesystems like NILFS2 or
BTRFS or Next3/Next4 use COW to provide builtin versioning, but
that's expensive too. The original UNIX insight to provide a very
simple file abstraction layer should not be lightly discarded (but
I like NILFS2 in particular).

It is too little because of what happens if you have dozens to
thousands of modified but not yet fully persisted files, sych as
newly created mail folders, 'tar' unpacks , source tree checkins,
...

As I tried to show in my previous reply, and in the NFS blog entry
mentioned in it too, on a creduly practical level relying on
applications doing the right thing is optimistic, and it may be
regrettably expedient to complement barriers with frequent system
driven flushing, which partially simulates (at a price) O_PONIES.

[ ... ]

>> Then Von Neuman help you if your users or you decide to store
>> lots of messages in MH/Maildir style mailstores, or VM images
>> on "growable" virtual disks.

> what's wrong with VM images on "growable" virtual disks. are you
> saying not to rely on lvm2 volumes?

By "growable" I mean that the virtual disk is allocated sparsely.

As to to LVM2 it is very rarely needed. The only really valuable
feature it has is snapshot LVs, and those are very expensive. XFS,
which can allocate routinely 2GiB (or bigger) files as a single
extents, can be used as a volume manager too.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-15 23:00           ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-15 23:00 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

[ ... ]

>> Also, as a rule I want to make sure that the sector size is
>> set to 4096B, for future proofing (and recent drives not only
>> have 4096B sectors but usually lie).

> it seems the 1TB drivers that I have still have 512byte sectors

But usually you can still set the XFS idea of sector size to 4096,
which is probably a good idea in general.

[ ... ]

>>> is xfs going to be prone to more data loss in case the
>>> non-redundant power supply goes out?

>> That's the wrong question entirely. Data loss can happen for
>> many other reasons, and XFS is probably one of the safest
>> designs, if properly used and configured. The problems are
>> elsewhere.

> Can you please elaborate how xfs can be properly used and
> configured?

I did that in the following bits of the reply. You must be in a
real hurry if you cannot trim down the quoting or write your
comments after reading through once...

[ ... ]

>> But your insistence on power off and disk caches etc. seems to
>> indicate that "safety" in your mind means "when I click the
>> 'Save' button it is really saved and not partially".

> let me define safety as needed by the usecase: fileA is a 2MB
> open office document file already existing on the file system.
> userA opens fileA locally, modifies a lot of lines and attempts
> to save it. as the saving operation is proceeding, the PSU goes
> haywire and power is cut abruptly.

To worry you, if the PSU goes haywire, the disk data may become
subtly corrupted:

https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta
 «Another user, also running a Tyan 2885 dual-Opteron workstation
  like mine, had experienced data corruption with SATA disks. The
  root cause? A faulty power supply.»

Even if that is not an argument for filesystem provided checksums,
as the ZFS (and other) people say, but for end-to-end (application
level) ones.

> When the system is turned on, i expect some sort of recovery
> process to bring the filesystem to a consistent state.

The XFS design really cares about that and unless the hardware is
very broken metadata consistency will be good.

> I expect fileA should be as it was before the save operation and
> should not be corrupted in anyway.  Am I asking/expecting too much?

That is too much to expect of the filesystem and at the same time
too little.

It is too much because it is strictly the responsibility of the
application, and it is very expensive, because it can only happen
by simulating copy-on-write (app makes a copy of the document,
updates the copy, and then atomically renames it, and then makes
another copy). Some applications like OOo/LibreO/VIM instead use a
log file to record updates, and then merge those on save (copy,
merge, rename), which is better. Some filesystems like NILFS2 or
BTRFS or Next3/Next4 use COW to provide builtin versioning, but
that's expensive too. The original UNIX insight to provide a very
simple file abstraction layer should not be lightly discarded (but
I like NILFS2 in particular).

It is too little because of what happens if you have dozens to
thousands of modified but not yet fully persisted files, sych as
newly created mail folders, 'tar' unpacks , source tree checkins,
...

As I tried to show in my previous reply, and in the NFS blog entry
mentioned in it too, on a creduly practical level relying on
applications doing the right thing is optimistic, and it may be
regrettably expedient to complement barriers with frequent system
driven flushing, which partially simulates (at a price) O_PONIES.

[ ... ]

>> Then Von Neuman help you if your users or you decide to store
>> lots of messages in MH/Maildir style mailstores, or VM images
>> on "growable" virtual disks.

> what's wrong with VM images on "growable" virtual disks. are you
> saying not to rely on lvm2 volumes?

By "growable" I mean that the virtual disk is allocated sparsely.

As to to LVM2 it is very rarely needed. The only really valuable
feature it has is snapshot LVs, and those are very expensive. XFS,
which can allocate routinely 2GiB (or bigger) files as a single
extents, can be used as a volume manager too.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15 23:00           ` Peter Grandi
@ 2012-03-16  3:36             ` Jessie Evangelista
  -1 siblings, 0 replies; 65+ messages in thread
From: Jessie Evangelista @ 2012-03-16  3:36 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

> But usually you can still set the XFS idea of sector size to 4096,
> which is probably a good idea in general.

I'm now running kernel 3.0.0-16-server Ubuntu 10.04LTS
cat /sys/block/sd[b-d]/queue/physical_block_size shows 512
cat /sys/block/sd[b-d]/device/model shows ST31000524AS
looking up the model at seagate, the specs page does not mention 512
byte sectors
but it did mention guaranteed sectors of 1,953,525,168
multiplying by 512bytes we do get 1000204886016(1TBish)

Anyway, I'll have a look at setting the sector size for xfs

> I did that in the following bits of the reply. You must be in a
> real hurry if you cannot trim down the quoting or write your
> comments after reading through once...

I did read thru your comments several times and I really appreciate them.
Will look into setting vm/dirty_bytes, vm/dirty_background_bytes,
vm/dirty_expire_centisecs, vm/dirty_writeback_centisecs.

I'm still scouring the internet for a best practice recipe for
implementing xfs/mdraid.
I am open to writing one and including the inputs everyone is contributing here.
In my search, I also saw some references of alignment issues for partitions.
this is what I used to setup the partitions for the md device

sfdisk /dev/sdb <<EOF
unit: sectors

63,104872257,fd
0,0,0
0,0,0
0,0,0
EOF

I've read a recommendation to start the partition on the 1MB mark.
Does this make sense?

>> let me define safety as needed by the usecase: fileA is a 2MB
>> open office document file already existing on the file system.
>> userA opens fileA locally, modifies a lot of lines and attempts
>> to save it. as the saving operation is proceeding, the PSU goes
>> haywire and power is cut abruptly.
>
> To worry you, if the PSU goes haywire, the disk data may become
> subtly corrupted:
>
> https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta
>  «Another user, also running a Tyan 2885 dual-Opteron workstation
>  like mine, had experienced data corruption with SATA disks. The
>  root cause? A faulty power supply.»
>
> Even if that is not an argument for filesystem provided checksums,
> as the ZFS (and other) people say, but for end-to-end (application
> level) ones.

Mmmm, Ive also been reading up on ZFS but haven't put it thru its paces.

>> I expect fileA should be as it was before the save operation and
>> should not be corrupted in anyway.  Am I asking/expecting too much?
>
> That is too much to expect of the filesystem and at the same time
> too little.
>
> It is too much because it is strictly the responsibility of the
> application, and it is very expensive, because it can only happen
> by simulating copy-on-write (app makes a copy of the document,
> updates the copy, and then atomically renames it, and then makes
> another copy). Some applications like OOo/LibreO/VIM instead use a
> log file to record updates, and then merge those on save (copy,
> merge, rename), which is better. Some filesystems like NILFS2 or
> BTRFS or Next3/Next4 use COW to provide builtin versioning, but
> that's expensive too. The original UNIX insight to provide a very
> simple file abstraction layer should not be lightly discarded (but
> I like NILFS2 in particular).
>
> It is too little because of what happens if you have dozens to
> thousands of modified but not yet fully persisted files, sych as
> newly created mail folders, 'tar' unpacks , source tree checkins,
> ...
>
> As I tried to show in my previous reply, and in the NFS blog entry
> mentioned in it too, on a creduly practical level relying on
> applications doing the right thing is optimistic, and it may be
> regrettably expedient to complement barriers with frequent system
> driven flushing, which partially simulates (at a price) O_PONIES.

I'd like to read about the NFS blog entry but the link you included
results in a 404.
I forgot to mention in my last reply.

Based on what I understood from your thoughts above, if an
applications issues a flush/fsync
and it does not complete due to some catastrophic crash,
xfs on its own can not roll back to the prev version of the file in
case of unfinished write operation.
disabling the device caches wouldn't help either right?
only filesystems that do COW can do this at the expense of
performance? (btrfs and zfs, please hurry and grow up!)

> As to to LVM2 it is very rarely needed. The only really valuable
> feature it has is snapshot LVs, and those are very expensive. XFS,
> which can allocate routinely 2GiB (or bigger) files as a single
> extents, can be used as a volume manager too.

If you were in my place with the resource constraints, you'd go with:
xfs with barriers on top of mdraid10 with device cache ON and setting
vm/dirty_bytes, vm/dirty_background_bytes, vm/dirty_expire_centisecs,
vm/dirty_writeback_centisecs to safe values
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-16  3:36             ` Jessie Evangelista
  0 siblings, 0 replies; 65+ messages in thread
From: Jessie Evangelista @ 2012-03-16  3:36 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

> But usually you can still set the XFS idea of sector size to 4096,
> which is probably a good idea in general.

I'm now running kernel 3.0.0-16-server Ubuntu 10.04LTS
cat /sys/block/sd[b-d]/queue/physical_block_size shows 512
cat /sys/block/sd[b-d]/device/model shows ST31000524AS
looking up the model at seagate, the specs page does not mention 512
byte sectors
but it did mention guaranteed sectors of 1,953,525,168
multiplying by 512bytes we do get 1000204886016(1TBish)

Anyway, I'll have a look at setting the sector size for xfs

> I did that in the following bits of the reply. You must be in a
> real hurry if you cannot trim down the quoting or write your
> comments after reading through once...

I did read thru your comments several times and I really appreciate them.
Will look into setting vm/dirty_bytes, vm/dirty_background_bytes,
vm/dirty_expire_centisecs, vm/dirty_writeback_centisecs.

I'm still scouring the internet for a best practice recipe for
implementing xfs/mdraid.
I am open to writing one and including the inputs everyone is contributing here.
In my search, I also saw some references of alignment issues for partitions.
this is what I used to setup the partitions for the md device

sfdisk /dev/sdb <<EOF
unit: sectors

63,104872257,fd
0,0,0
0,0,0
0,0,0
EOF

I've read a recommendation to start the partition on the 1MB mark.
Does this make sense?

>> let me define safety as needed by the usecase: fileA is a 2MB
>> open office document file already existing on the file system.
>> userA opens fileA locally, modifies a lot of lines and attempts
>> to save it. as the saving operation is proceeding, the PSU goes
>> haywire and power is cut abruptly.
>
> To worry you, if the PSU goes haywire, the disk data may become
> subtly corrupted:
>
> https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta
>  «Another user, also running a Tyan 2885 dual-Opteron workstation
>  like mine, had experienced data corruption with SATA disks. The
>  root cause? A faulty power supply.»
>
> Even if that is not an argument for filesystem provided checksums,
> as the ZFS (and other) people say, but for end-to-end (application
> level) ones.

Mmmm, Ive also been reading up on ZFS but haven't put it thru its paces.

>> I expect fileA should be as it was before the save operation and
>> should not be corrupted in anyway.  Am I asking/expecting too much?
>
> That is too much to expect of the filesystem and at the same time
> too little.
>
> It is too much because it is strictly the responsibility of the
> application, and it is very expensive, because it can only happen
> by simulating copy-on-write (app makes a copy of the document,
> updates the copy, and then atomically renames it, and then makes
> another copy). Some applications like OOo/LibreO/VIM instead use a
> log file to record updates, and then merge those on save (copy,
> merge, rename), which is better. Some filesystems like NILFS2 or
> BTRFS or Next3/Next4 use COW to provide builtin versioning, but
> that's expensive too. The original UNIX insight to provide a very
> simple file abstraction layer should not be lightly discarded (but
> I like NILFS2 in particular).
>
> It is too little because of what happens if you have dozens to
> thousands of modified but not yet fully persisted files, sych as
> newly created mail folders, 'tar' unpacks , source tree checkins,
> ...
>
> As I tried to show in my previous reply, and in the NFS blog entry
> mentioned in it too, on a creduly practical level relying on
> applications doing the right thing is optimistic, and it may be
> regrettably expedient to complement barriers with frequent system
> driven flushing, which partially simulates (at a price) O_PONIES.

I'd like to read about the NFS blog entry but the link you included
results in a 404.
I forgot to mention in my last reply.

Based on what I understood from your thoughts above, if an
applications issues a flush/fsync
and it does not complete due to some catastrophic crash,
xfs on its own can not roll back to the prev version of the file in
case of unfinished write operation.
disabling the device caches wouldn't help either right?
only filesystems that do COW can do this at the expense of
performance? (btrfs and zfs, please hurry and grow up!)

> As to to LVM2 it is very rarely needed. The only really valuable
> feature it has is snapshot LVs, and those are very expensive. XFS,
> which can allocate routinely 2GiB (or bigger) files as a single
> extents, can be used as a volume manager too.

If you were in my place with the resource constraints, you'd go with:
xfs with barriers on top of mdraid10 with device cache ON and setting
vm/dirty_bytes, vm/dirty_background_bytes, vm/dirty_expire_centisecs,
vm/dirty_writeback_centisecs to safe values

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-16  3:36             ` Jessie Evangelista
@ 2012-03-16 11:06               ` Michael Monnerie
  -1 siblings, 0 replies; 65+ messages in thread
From: Michael Monnerie @ 2012-03-16 11:06 UTC (permalink / raw)
  To: xfs; +Cc: Jessie Evangelista, Peter Grandi, Linux RAID

[-- Attachment #1: Type: text/plain, Size: 1370 bytes --]

Am Freitag, 16. März 2012, 11:36:07 schrieb Jessie Evangelista:
> If you were in my place with the resource constraints, you'd go with:
> xfs with barriers on top of mdraid10 with device cache ON and setting
> vm/dirty_bytes, vm/dirty_background_bytes, vm/dirty_expire_centisecs,
> vm/dirty_writeback_centisecs to safe values

If you ever experienced a crash where lots of sensible and important 
data were lost, you would not even think about "device cache ON".

> could you please suggest a hardware raid card with BBU that's cheap?

"Cheap" is a varying definition. How much is your data worth? How much 
does one day of blackout cost? 

I've been very happy with Areca Controllers, like the 12x0 and 1680 
series, and now there's the newer 1882 series like 
http://geizhals.at/eu/721745
plus a BBU for about 100€.

You can even mix RAID levels on the same disks, example of 8x1TB define 
a RAID0 of 500G and the rest a RAID6. Online expansion possible, 
scheduled background verify, e-mail notification on everything, logging, 
ntp times, oob-mgmnt via it's own network interface, ...
Very reliable, I never had a problem. And they have a good support team.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-16 11:06               ` Michael Monnerie
  0 siblings, 0 replies; 65+ messages in thread
From: Michael Monnerie @ 2012-03-16 11:06 UTC (permalink / raw)
  To: xfs; +Cc: Linux RAID, Peter Grandi, Jessie Evangelista


[-- Attachment #1.1: Type: text/plain, Size: 1370 bytes --]

Am Freitag, 16. März 2012, 11:36:07 schrieb Jessie Evangelista:
> If you were in my place with the resource constraints, you'd go with:
> xfs with barriers on top of mdraid10 with device cache ON and setting
> vm/dirty_bytes, vm/dirty_background_bytes, vm/dirty_expire_centisecs,
> vm/dirty_writeback_centisecs to safe values

If you ever experienced a crash where lots of sensible and important 
data were lost, you would not even think about "device cache ON".

> could you please suggest a hardware raid card with BBU that's cheap?

"Cheap" is a varying definition. How much is your data worth? How much 
does one day of blackout cost? 

I've been very happy with Areca Controllers, like the 12x0 and 1680 
series, and now there's the newer 1882 series like 
http://geizhals.at/eu/721745
plus a BBU for about 100€.

You can even mix RAID levels on the same disks, example of 8x1TB define 
a RAID0 of 500G and the rest a RAID6. Online expansion possible, 
scheduled background verify, e-mail notification on everything, logging, 
ntp times, oob-mgmnt via it's own network interface, ...
Very reliable, I never had a problem. And they have a good support team.

-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services: Protéger
http://proteger.at [gesprochen: Prot-e-schee]
Tel: +43 660 / 415 6531

[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-16 11:06               ` Michael Monnerie
@ 2012-03-16 12:21                 ` Peter Grandi
  -1 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-16 12:21 UTC (permalink / raw)
  To: Linux fs XFS, Linux RAID

[ ... ]

>> If you were in my place with the resource constraints, you'd
>> go with: xfs with barriers on top of mdraid10 with device
>> cache ON and setting vm/dirty_bytes, vm/dirty_background_bytes,
>> vm/dirty_expire_centisecs, vm/dirty_writeback_centisecs to
>> safe values

> If you ever experienced a crash where lots of sensible and
> important data were lost, you would not even think about
> "device cache ON".

It is not as simple as that... *If* hw barriers are implemented
*and* applications do the right things, that is not a concern.
Disabling the device cache is just a way to turn barriers on for
everything. Indeed the whole rationale for having the 'barrier'
option is to let the device caches on, and the OP did ask how to
test that barriers actually work.

Since even most consumers level drives currently implement
barriers correctly, the biggest problem today, as per the
O_PONIES discussions, is applications that don't do the right
thing, and therefore the biggest risk is large amounts of dirty
pages in system memory (either NFS client or server), not in the
drive caches.

Since the Linux flusher parameters are/have been demented, I
have seen one or more GiB of dirty pages in system memory (on
hosts I didn't configure...), which also causes performance
problems.

Again, as a crassly expedient thing, working around the lack of
"do the right thing" in applications by letting only a few
seconds of dirty pages accumulate in system memory seems to fool
enough users (and many system administrators and application
developers) into thinking that stuff is "safe". It worked well
enough for 'ext3' for many years, quite regrettably.

  Note: 'ext3' has also had the "helpful" issue of excessive
  impact of flushing, which made 'fsync' performance terrible,
  but improved the apparent safety for "optimistic" applications.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-16 12:21                 ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-16 12:21 UTC (permalink / raw)
  To: Linux fs XFS, Linux RAID

[ ... ]

>> If you were in my place with the resource constraints, you'd
>> go with: xfs with barriers on top of mdraid10 with device
>> cache ON and setting vm/dirty_bytes, vm/dirty_background_bytes,
>> vm/dirty_expire_centisecs, vm/dirty_writeback_centisecs to
>> safe values

> If you ever experienced a crash where lots of sensible and
> important data were lost, you would not even think about
> "device cache ON".

It is not as simple as that... *If* hw barriers are implemented
*and* applications do the right things, that is not a concern.
Disabling the device cache is just a way to turn barriers on for
everything. Indeed the whole rationale for having the 'barrier'
option is to let the device caches on, and the OP did ask how to
test that barriers actually work.

Since even most consumers level drives currently implement
barriers correctly, the biggest problem today, as per the
O_PONIES discussions, is applications that don't do the right
thing, and therefore the biggest risk is large amounts of dirty
pages in system memory (either NFS client or server), not in the
drive caches.

Since the Linux flusher parameters are/have been demented, I
have seen one or more GiB of dirty pages in system memory (on
hosts I didn't configure...), which also causes performance
problems.

Again, as a crassly expedient thing, working around the lack of
"do the right thing" in applications by letting only a few
seconds of dirty pages accumulate in system memory seems to fool
enough users (and many system administrators and application
developers) into thinking that stuff is "safe". It worked well
enough for 'ext3' for many years, quite regrettably.

  Note: 'ext3' has also had the "helpful" issue of excessive
  impact of flushing, which made 'fsync' performance terrible,
  but improved the apparent safety for "optimistic" applications.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15 12:06   ` Jessie Evangelista
  2012-03-15 14:07       ` Peter Grandi
@ 2012-03-16 12:25     ` Stan Hoeppner
  2012-03-16 18:01       ` Jon Nelson
  1 sibling, 1 reply; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-16 12:25 UTC (permalink / raw)
  To: Jessie Evangelista; +Cc: linux-raid

On 3/15/2012 7:06 AM, Jessie Evangelista wrote:
> On Thu, Mar 15, 2012 at 1:38 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

>> Why 256KB for chunk size?
>>
> For reference, the machine has 16GB memory
> 
> I've run some benchmarks with dd trying the different chunks and 256k
> seems like the sweetspot.
> dd if=/dev/zero of=/dev/md0 bs=64k count=655360 oflag=direct

Using dd in this manner is precisely analogous to taking your daily
driver Toyota to the local drag strip, making a few runs, and observing
your car can accelerate from 0-92 mph in 1320 ft.  This has no
correlation to daily driving on public roads.

> I'll probably forgo setting the journal log file size. It seemed like
> a safe optimization from what I've read.

See:
http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

> I just wanted to be explicit about it so that I know what is set just
> in case the defaults change

See:
http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

Even if the XFS mount defaults change you won't notice a difference, not
on this server, except for possibly delaylog if you do a lot of 'rm -rf'
operations on directories containing tens of thousands of files.
Delaylog is the only mount default change in many years.  It occurred in
2.6.39, which is why I recommended this rev as the minimum you should
choose.

>> In fact, it appears you don't need to specify anything in mkfs.xfs or
>> fstab, but just use the defaults.  Fancy that.  And the one thing that
>> might actually increase your performance a little bit you didn't
>> specify--sunit/swidth.  However, since you're using mdraid, mkfs.xfs
>> will calculate these for you (which is nice as mdraid10 with odd disk
>> count can be a tricky calculation).  Again, defaults work for a reason.
>>
> The reason I did not set sunit/swidth is because I read somewhere that
> mkfs.xfs will calculate based on mdraid.

I guess my stating of the same got lost in the rest of that paragraph. ;)

> The server is for a non-profit org that I am helping out.
> I think a APC Smart-UPS SC 420VA 230V may fit their shoe string budget.

Given the rough server specs you presented, a 420 (260 watts) should be
fine, assuming you're not running seti@home, folding@home, etc, which
can double average system power draw.  A 420 won't yield much battery
run time but will give more than enough time for a clean shutdown.  Are
you sure you want a 230v model?  If so I'd guess you're outside the
States.  Also:  http://www.apcupsd.com/  APC control daemon with auto
shutdown.

> nightly backups will be stored on an external USB disk
> is xfs going to be prone to more data loss in case the non-redundant
> power supply goes out?

There are bigger issues here WRT XFS and USB connected disks IIRC from
some list posts.  USB is prone to random device/bus disconnections due
to power management in various USB controllers.  XFS design assumes
storage devices are persistently connected, and it does frequent
background reads/writes to the device.  If the USB drive is offline long
enough, lots of errors are logged, and XFS can't access it, it may/will
shutdown the filesystem as a safety precaution.  If you want to use XFS
on that external USB drive, you need to do some research first--I don't
have solid answers for you here.  Or simply use EXT3/4.  XFS isn't going
to yield any advantage with single threaded backup anyway, so maybe just
going with EXT is the smart move.

>> I'll appreciate your response stating "Yes, I have a UPS and
>> tested/working shutdown scripts" or "I'll be implementing a UPS very
>> soon." :)
> 
> I don't have shutdown scripts yet but will look into it.

Again:  http://www.apcupsd.com/  There may be others available.

> Meatware would have to do for now as the server will probably be ON
> only when there's people at the office. 

I just hope do proper shutdowns when they power it off. ;)

> And yes I will be asking them
> to not go into production without a UPS

If it's a hard sell, simply explain that every laptop has a built in
UPS, and that the server and its data are obviously as important, if not
more, than any laptop.

> Thanks for you input Stan.

You're welcome.

> I just updated the kernel to 3.0.0-16.
> Did they take out barrier support in mdraid? or was the implementation
> replaced with FUA?

Write barriers, in one form or another, are there.  These will never be
removed or broken--too critical.  The implementation may have changed.
Neil can answer this much better than me.

> Is there a definitive test to determine if the off the shelf consumer
> sata drives honor barrier or cache flush requests?

Just connect the drive and boot up.  You'll see this in dmesg:

sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA

And:
$ hdparm -I /dev/sda
...
Commands/features:
        Enabled Supported:
                ...
           *    Write cache
                ...
           *    Mandatory FLUSH_CACHE
           *    FLUSH_CACHE_EXT
                ...
These are good indicators that the drive supports barrier operations.

> I think I'd like to go with device cache turned ON and barrier enabled.

You just stated the the Linux defaults.  Do note that XFS write barriers
will ensure journal and thus filesystem integrity in a crash/power fail
event.  They do NOT guarantee file data integrity as file data isn't
journaled.  No filesystem (Linux anyway) journals data, only metadata.
To prevent file data loss due to a crash/power fail, you must disable
the drive write caches completely and/or use a BBWC RAID card.

As you know performance is horrible with caches disabled.  With so few
users and so little data writes, you're safe running with barriers and
write cache enabled.  This is how most people on this list with plain
HBAs run their systems.

> Am still torn between ext4 and xfs i.e. which will be safer in this
> particular setup.

Neither is "safer" than the other.  That's up to your hardware and power
configuration.  Pick the one you are most comfortable working with and
have the most experience supporting.  For this non-prof SOHO workload,
XFS' advanced features will yield little, if any, performance
advantage--you have too few users, disks, and too little IO.

If this box had, say, 24 cores, 128GB RAM, and 192 15k SAS drives across
4 dual port SAS RAID HBAs, 8x24 drive hardware RAID10s in an mdraid
linear array, with a user IO load demanding such a system--multiple GB/s
of concurrent IO, then the only choice is XFS.  EXT4 simply can't scale
close to anything like this.

All things considered, for your system, EXT4 is probably the best choice.

-- 
Stan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-16  3:36             ` Jessie Evangelista
@ 2012-03-16 17:15               ` Brian Candler
  -1 siblings, 0 replies; 65+ messages in thread
From: Brian Candler @ 2012-03-16 17:15 UTC (permalink / raw)
  To: Jessie Evangelista; +Cc: Peter Grandi, Linux RAID, Linux fs XFS

On Fri, Mar 16, 2012 at 11:36:07AM +0800, Jessie Evangelista wrote:
> I'm still scouring the internet for a best practice recipe for
> implementing xfs/mdraid.
> I am open to writing one and including the inputs everyone is contributing here.
> In my search, I also saw some references of alignment issues for partitions.
> this is what I used to setup the partitions for the md device
> 
> sfdisk /dev/sdb <<EOF
> unit: sectors
> 
> 63,104872257,fd
> 0,0,0
> 0,0,0
> 0,0,0
> EOF
> 
> I've read a recommendation to start the partition on the 1MB mark.
> Does this make sense?

I would just make the raw disks members of the RAID array, e.g.
/dev/sdb, /dev/sdc etc and not partition them.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-16 17:15               ` Brian Candler
  0 siblings, 0 replies; 65+ messages in thread
From: Brian Candler @ 2012-03-16 17:15 UTC (permalink / raw)
  To: Jessie Evangelista; +Cc: Linux RAID, Peter Grandi, Linux fs XFS

On Fri, Mar 16, 2012 at 11:36:07AM +0800, Jessie Evangelista wrote:
> I'm still scouring the internet for a best practice recipe for
> implementing xfs/mdraid.
> I am open to writing one and including the inputs everyone is contributing here.
> In my search, I also saw some references of alignment issues for partitions.
> this is what I used to setup the partitions for the md device
> 
> sfdisk /dev/sdb <<EOF
> unit: sectors
> 
> 63,104872257,fd
> 0,0,0
> 0,0,0
> 0,0,0
> EOF
> 
> I've read a recommendation to start the partition on the 1MB mark.
> Does this make sense?

I would just make the raw disks members of the RAID array, e.g.
/dev/sdb, /dev/sdc etc and not partition them.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-16 12:25     ` raid10n2/xfs setup guidance on write-cache/barrier Stan Hoeppner
@ 2012-03-16 18:01       ` Jon Nelson
  2012-03-16 18:03         ` Jon Nelson
  0 siblings, 1 reply; 65+ messages in thread
From: Jon Nelson @ 2012-03-16 18:01 UTC (permalink / raw)
  To: stan; +Cc: linux-raid

On Fri, Mar 16, 2012 at 7:25 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
..
> You just stated the the Linux defaults.  Do note that XFS write barriers
> will ensure journal and thus filesystem integrity in a crash/power fail
> event.  They do NOT guarantee file data integrity as file data isn't
> journaled.  No filesystem (Linux anyway) journals data, only metadata.

..

That's not true, is it? ext3 and ext4 support journal=data.

-- 
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-16 18:01       ` Jon Nelson
@ 2012-03-16 18:03         ` Jon Nelson
  2012-03-16 19:28             ` Peter Grandi
  0 siblings, 1 reply; 65+ messages in thread
From: Jon Nelson @ 2012-03-16 18:03 UTC (permalink / raw)
  To: stan, LinuxRaid

On Fri, Mar 16, 2012 at 1:01 PM, Jon Nelson
<jnelson-linux-raid@jamponi.net> wrote:
> On Fri, Mar 16, 2012 at 7:25 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> ..
>> You just stated the the Linux defaults.  Do note that XFS write barriers
>> will ensure journal and thus filesystem integrity in a crash/power fail
>> event.  They do NOT guarantee file data integrity as file data isn't
>> journaled.  No filesystem (Linux anyway) journals data, only metadata.
>
> ..
>
> That's not true, is it? ext3 and ext4 support journal=data.

And btrfs supports COW (as does nilfs2) with "transactions", which
should/could be similar?

-- 
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-16 18:03         ` Jon Nelson
@ 2012-03-16 19:28             ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-16 19:28 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

[ ... ]

>>> write barriers will ensure journal and thus filesystem
>>> integrity in a crash/power fail event.  They do NOT guarantee
>>> file data integrity as file data isn't journaled.

Not well expressed, as XFS barriers do ensure file data integrity,
*if the applications uses them* (and uses them in exactly the
right way).

The difference between metadata and data with XFS is that XFS
itself will use barriers on metadata at the right times, because
that's data to XFS, but it won't use barriers on data, leaving
that entirely to the application.

>>>  No filesystem (Linux anyway) journals data, only metadata.

>> That's not true, is it? ext3 and ext4 support journal=data.

They do, because they journal blocks, which is not generally a
great choice, but gives the option to journal data blocks too more
easily than other choices. But it is a very special case that few
people use.

Also, there are significant issues with 'ext3' and 'fsync' and
journaling:

http://lwn.net/Articles/328363/
 «There is one other important change needed to get a truly
  quick fsync() with ext3, though: the filesystem must be
  mounted in data=writeback mode. This mode eliminates the
  requirement that data blocks be flushed to disk ahead of
  metadata; in data=ordered mode, instead, the amount of data to
  be written guarantees that fsync() will always be slower.
  Switching to data=writeback eliminates those writes, but, in
  the process, it also turns off the feature which made ext3
  seem more robust than ext4.»

On a more general note, journaling and barriers are sort of
distinct issues.

The real purpose of barriers is to ensure that updates are
actually on the recording medium, whether in the journal or
directly on final destination.
That is barriers are used to ensure that data or metadata on the
persistent layer is current.

The purpose of a journal is not to ensure that the state on the
persistent layer are *current*, but rather *consistent* (at a
lower cost than synchronous updates), without having to be
careful about the order in which the updates are made current.
The updates are made consistent by writing them to the log as
they are needed (not necessarily immediately), and then on
recovery the order gets sorted out spatially.

Currency does not imply consistency (if the updates are made
current in some arbitrary order) and consistency does not imply
currency (if the recording medium is kept consistent but updates
are applied to it infrequently).

The BSD FFS does not need a journal because it is designed to be
very careful as to the order in which updates are made current,
and log file systems don't aim for spatial currency.

> And btrfs supports COW (as does nilfs2) with "transactions",
> which should/could be similar?

Not quite. They are more like "checkpoints", that is alternate
root inodes that "snapshot" the state of the whole filetree at
some point.

These are not entirely inexpensive, and as a result as I learned
from a talk about some recent updates about the BSD FFS:

  http://www.sabi.co.uk/blog/12-two.html#120222

COW filesystems like ZFS/BTRFS/... need to have a journal too to
support 'fsync' in between checkpoints.

BTW there are now COW versions of 'ext3' and 'ext4', with
snapshotting too:

  http://www.sabi.co.uk/blog/12-two.html#120218b

The 'freeze' features of XFS does not rely on snapshotting, it
relies on suspending all processes that are writing to the
filetree, so updates are avoided for the duration.

As the XFS team have been adding or planning to add various "new"
features like checksums, maybe one day they will add COW to XFS
too (not such an easy task when considering how large XFS extents
can be, but the hole punching code can help there).
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-16 19:28             ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-16 19:28 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

[ ... ]

>>> write barriers will ensure journal and thus filesystem
>>> integrity in a crash/power fail event.  They do NOT guarantee
>>> file data integrity as file data isn't journaled.

Not well expressed, as XFS barriers do ensure file data integrity,
*if the applications uses them* (and uses them in exactly the
right way).

The difference between metadata and data with XFS is that XFS
itself will use barriers on metadata at the right times, because
that's data to XFS, but it won't use barriers on data, leaving
that entirely to the application.

>>>  No filesystem (Linux anyway) journals data, only metadata.

>> That's not true, is it? ext3 and ext4 support journal=data.

They do, because they journal blocks, which is not generally a
great choice, but gives the option to journal data blocks too more
easily than other choices. But it is a very special case that few
people use.

Also, there are significant issues with 'ext3' and 'fsync' and
journaling:

http://lwn.net/Articles/328363/
 «There is one other important change needed to get a truly
  quick fsync() with ext3, though: the filesystem must be
  mounted in data=writeback mode. This mode eliminates the
  requirement that data blocks be flushed to disk ahead of
  metadata; in data=ordered mode, instead, the amount of data to
  be written guarantees that fsync() will always be slower.
  Switching to data=writeback eliminates those writes, but, in
  the process, it also turns off the feature which made ext3
  seem more robust than ext4.»

On a more general note, journaling and barriers are sort of
distinct issues.

The real purpose of barriers is to ensure that updates are
actually on the recording medium, whether in the journal or
directly on final destination.
That is barriers are used to ensure that data or metadata on the
persistent layer is current.

The purpose of a journal is not to ensure that the state on the
persistent layer are *current*, but rather *consistent* (at a
lower cost than synchronous updates), without having to be
careful about the order in which the updates are made current.
The updates are made consistent by writing them to the log as
they are needed (not necessarily immediately), and then on
recovery the order gets sorted out spatially.

Currency does not imply consistency (if the updates are made
current in some arbitrary order) and consistency does not imply
currency (if the recording medium is kept consistent but updates
are applied to it infrequently).

The BSD FFS does not need a journal because it is designed to be
very careful as to the order in which updates are made current,
and log file systems don't aim for spatial currency.

> And btrfs supports COW (as does nilfs2) with "transactions",
> which should/could be similar?

Not quite. They are more like "checkpoints", that is alternate
root inodes that "snapshot" the state of the whole filetree at
some point.

These are not entirely inexpensive, and as a result as I learned
from a talk about some recent updates about the BSD FFS:

  http://www.sabi.co.uk/blog/12-two.html#120222

COW filesystems like ZFS/BTRFS/... need to have a journal too to
support 'fsync' in between checkpoints.

BTW there are now COW versions of 'ext3' and 'ext4', with
snapshotting too:

  http://www.sabi.co.uk/blog/12-two.html#120218b

The 'freeze' features of XFS does not rely on snapshotting, it
relies on suspending all processes that are writing to the
filetree, so updates are avoided for the duration.

As the XFS team have been adding or planning to add various "new"
features like checksums, maybe one day they will add COW to XFS
too (not such an easy task when considering how large XFS extents
can be, but the hole punching code can help there).

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-16 19:28             ` Peter Grandi
@ 2012-03-17  0:02               ` Stan Hoeppner
  -1 siblings, 0 replies; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-17  0:02 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

On 3/16/2012 2:28 PM, Peter Grandi wrote:
> [ ... ]
> 
>>>> write barriers will ensure journal and thus filesystem
>>>> integrity in a crash/power fail event.  They do NOT guarantee
>>>> file data integrity as file data isn't journaled.
> 
> Not well expressed, 

Given the audience, the OP, I was simply avoiding getting too deep in
the weeds Peter.  This thread is on the linux-raid list, not xfs@oss.
You know I have a tendency to get too deep in the weeds.  I think I did
nice job of balance here. ;)

> as XFS barriers do ensure file data integrity,
> *if the applications uses them* (and uses them in exactly the
> right way).

How will the OP know which, if any, of his users' desktop applications
do fsyncs, properly?  He won't.  Which is why I made the general
statement, which is correct, if not elaborate, nor down in the weeds.

> The difference between metadata and data with XFS is that XFS
> itself will use barriers on metadata at the right times, because
> that's data to XFS, but it won't use barriers on data[1], leaving
> that entirely to the application.

[1]File data, just to be clear

>>>>  No filesystem (Linux anyway) journals data, only metadata.
> 
>>> That's not true, is it? ext3 and ext4 support journal=data.
> 
> They do, because they journal blocks, which is not generally a
> great choice, but gives the option to journal data blocks too more
> easily than other choices. But it is a very special case that few
> people use.

Few use it because the performance is absolutely horrible.  data=journal
disables delayed allocation (which serious contributes to any modern
filesystem's performance--EXT devs stole/borrowed delayed allocation
from XFS BTW) and it disables O_DIRECT.  It also doubles the number of
data writes to media, once to the journal, once to the filesystem, for
every block of every file written.

> On a more general note, journaling and barriers are sort of
> distinct issues.
> 
> The real purpose of barriers is to ensure that updates are
> actually on the recording medium, whether in the journal or
> directly on final destination.
> That is barriers are used to ensure that data or metadata on the
> persistent layer is current.

Correct.  Again, trying to stay out of the weeds.  I'd established that
XFS uses barriers on journal writes for metadata consistency, which
prevents filesystem corruption after a crash, but not necessarily file
corruption.  Making the statement that XFS doesn't journal data gets the
point across more quickly, while staying out of the weeds.

[...]

> The 'freeze' features of XFS does not rely on snapshotting, it
> relies on suspending all processes that are writing to the
> filetree, so updates are avoided for the duration.

xfs_freeze was moved into the VFS in 2.6.29 and is called automatically
when doing an LVM snapshot of any Linux FS supporting such.  Thus,
snapshotting relies on xfs_freeze, not the other way round.  And
xfs_freeze doesn't suspend all processes that are writing to the
filesystem.  All write system calls to the filesystem are simply halted,
and the process blocks on IO until the filesystem is unfrozen.

> As the XFS team have been adding or planning to add various "new"
> features like checksums, maybe one day they will add COW to XFS
> too (not such an easy task when considering how large XFS extents
> can be, but the hole punching code can help there).

Not at all an easy rewrite of XFS.  And that's what COW would be, a
massive rewrite.  Copy on write definitely has some advantages for some
usage scenarios, but it's not yet been proven the holy grail of
filesystem design.

-- 
Stan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-17  0:02               ` Stan Hoeppner
  0 siblings, 0 replies; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-17  0:02 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

On 3/16/2012 2:28 PM, Peter Grandi wrote:
> [ ... ]
> 
>>>> write barriers will ensure journal and thus filesystem
>>>> integrity in a crash/power fail event.  They do NOT guarantee
>>>> file data integrity as file data isn't journaled.
> 
> Not well expressed, 

Given the audience, the OP, I was simply avoiding getting too deep in
the weeds Peter.  This thread is on the linux-raid list, not xfs@oss.
You know I have a tendency to get too deep in the weeds.  I think I did
nice job of balance here. ;)

> as XFS barriers do ensure file data integrity,
> *if the applications uses them* (and uses them in exactly the
> right way).

How will the OP know which, if any, of his users' desktop applications
do fsyncs, properly?  He won't.  Which is why I made the general
statement, which is correct, if not elaborate, nor down in the weeds.

> The difference between metadata and data with XFS is that XFS
> itself will use barriers on metadata at the right times, because
> that's data to XFS, but it won't use barriers on data[1], leaving
> that entirely to the application.

[1]File data, just to be clear

>>>>  No filesystem (Linux anyway) journals data, only metadata.
> 
>>> That's not true, is it? ext3 and ext4 support journal=data.
> 
> They do, because they journal blocks, which is not generally a
> great choice, but gives the option to journal data blocks too more
> easily than other choices. But it is a very special case that few
> people use.

Few use it because the performance is absolutely horrible.  data=journal
disables delayed allocation (which serious contributes to any modern
filesystem's performance--EXT devs stole/borrowed delayed allocation
from XFS BTW) and it disables O_DIRECT.  It also doubles the number of
data writes to media, once to the journal, once to the filesystem, for
every block of every file written.

> On a more general note, journaling and barriers are sort of
> distinct issues.
> 
> The real purpose of barriers is to ensure that updates are
> actually on the recording medium, whether in the journal or
> directly on final destination.
> That is barriers are used to ensure that data or metadata on the
> persistent layer is current.

Correct.  Again, trying to stay out of the weeds.  I'd established that
XFS uses barriers on journal writes for metadata consistency, which
prevents filesystem corruption after a crash, but not necessarily file
corruption.  Making the statement that XFS doesn't journal data gets the
point across more quickly, while staying out of the weeds.

[...]

> The 'freeze' features of XFS does not rely on snapshotting, it
> relies on suspending all processes that are writing to the
> filetree, so updates are avoided for the duration.

xfs_freeze was moved into the VFS in 2.6.29 and is called automatically
when doing an LVM snapshot of any Linux FS supporting such.  Thus,
snapshotting relies on xfs_freeze, not the other way round.  And
xfs_freeze doesn't suspend all processes that are writing to the
filesystem.  All write system calls to the filesystem are simply halted,
and the process blocks on IO until the filesystem is unfrozen.

> As the XFS team have been adding or planning to add various "new"
> features like checksums, maybe one day they will add COW to XFS
> too (not such an easy task when considering how large XFS extents
> can be, but the hole punching code can help there).

Not at all an easy rewrite of XFS.  And that's what COW would be, a
massive rewrite.  Copy on write definitely has some advantages for some
usage scenarios, but it's not yet been proven the holy grail of
filesystem design.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* NOW:Peter goading Dave over delaylog -  WAS: Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15 14:07       ` Peter Grandi
                         ` (2 preceding siblings ...)
  (?)
@ 2012-03-17  4:21       ` Stan Hoeppner
  2012-03-17 22:34         ` Dave Chinner
  -1 siblings, 1 reply; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-17  4:21 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs XFS

On 3/15/2012 9:07 AM, Peter Grandi wrote:
>>>> I want to create a raid10,n2 using 3 1TB SATA drives.
>>>> I want to create an xfs filesystem on top of it. The
>>>> filesystem will be used as NFS/Samba storage.
> 
> Consider also an 'o2' layout (it is probably the same thing for a
> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems
> one of the few cases where RAID5 may be plausible.

It's customary to note in your message body when you decide to CC
another mailing list, and why.  I just got to my XFS folder and realized
you'd silently CC'd XFS.  This was unnecessary and simply added noise to
XFS.  Given some of your comments in this post I suspect you did so in
an effort to goad Dave into some kind of argument WRT delayed logging
performance, and his linux.conf.au presentation claims in general.
Doing this via subterfuge simply reduces people's level of respect for
you Peter.

If you want to have the delayed logging performance discussion/argument
with Dave, it should be its own thread on xfs@oss, not slipped into a
thread started on another list and CC'ed here.  I'm removing linux-raid
from the CC list of this message, as anything further in this discussion
topic is only relevant to XFS.

I'm guessing either Dave chose not to take your bait, or simply didn't
read your message.  If the former this thread will likely die now.  If
the latter, and Dave decides to respond, I'm grabbing some popcorn, a
beer, and a lawn chair. ;)

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-16  3:36             ` Jessie Evangelista
@ 2012-03-17 15:35               ` Peter Grandi
  -1 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-17 15:35 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

[ ... ]

> I've read a recommendation to start the partition on the 1MB
> mark. Does this make sense?

As a general principle it is good, that has almost no cost.
Indeed recent versions of some partitionig tools do that by
default.

I often recommend aligning partitions to 1GiB, also because I
like to have 1GiB or so of empty space at the very beginning and
end of a drive.

> I'd like to read about the NFS blog entry but the link you
> included results in a 404.  I forgot to mention in my last
> reply.

Oops I forgot a bit of the URL:
  http://www.sabi.co.uk/blog/0707jul.html#070701b

Note that currently I suggest different values from:

 «vm/dirty_ratio                  =4
  vm/dirty_background_ratio       =2»

Because:

  * 4% of memory "dirty" today is often a gigantic amount.
    I had provided an elegant patch to specify the same in
    absolute terms in
      http://www.sabi.co.uk/blog/0707jul.html#070701
    but now the official way is the "_bytes" alternative.

  * 2% as the level at which writing becomes uncached is too
    low, and the system become unresposive when that level is
    crossed. Sure it is risky, but, regretfully, I think that
    maintaining responsiveness is usually better than limiting
    outstanding background writes.

> Based on what I understood from your thoughts above, if an
> applications issues a flush/fsync and it does not complete due
> to some catastrophic crash, xfs on its own can not roll back
> to the prev version of the file in case of unfinished write
> operation. disabling the device caches wouldn't help either
> right?

If your goal is to make sure incomplete updates don't get
persisted, disabling device caches might help with that, in a
very perverse way (if the whole partial update is still in the
device cache, it just vanishes). Forget that of course :-).

The main message is that filesystems in UNIX-like system should
not provide atomic transactions, just the means to do them at
the applications level, because they are both difficult and very
expensive.

The secondary message is that some applications and the firmware
of some host adpters and drives don't do the right thing, and
if your really want to make sure about atomic transactions it is
an expensive and difficult system integration challenge.

> [ ... ] only filesystems that do COW can do this at the
> expense of performance? (btrfs and zfs, please hurry and grow
> up!)

Filesystems that do COW sort-of do *global* "rolling" updates,
that is filtree level snapshots, but that's a side effect of a
choice made for other reasons (consistency more than currency).

> [ ... ] If you were in my place with the resource constraints,
> you'd go with: xfs with barriers on top of mdraid10 with
> device cache ON and setting vm/dirty_bytes, [ ... ]

Yes, that seems a reasonable overall tradeoff, because XFS is
implemented to provide well defined (and documented) semantics,
to check whether the underlying storage layer actually does
barriers, and to perform decently even if "delayed" writing is
not that delayed.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-17 15:35               ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-17 15:35 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

[ ... ]

> I've read a recommendation to start the partition on the 1MB
> mark. Does this make sense?

As a general principle it is good, that has almost no cost.
Indeed recent versions of some partitionig tools do that by
default.

I often recommend aligning partitions to 1GiB, also because I
like to have 1GiB or so of empty space at the very beginning and
end of a drive.

> I'd like to read about the NFS blog entry but the link you
> included results in a 404.  I forgot to mention in my last
> reply.

Oops I forgot a bit of the URL:
  http://www.sabi.co.uk/blog/0707jul.html#070701b

Note that currently I suggest different values from:

 «vm/dirty_ratio                  =4
  vm/dirty_background_ratio       =2»

Because:

  * 4% of memory "dirty" today is often a gigantic amount.
    I had provided an elegant patch to specify the same in
    absolute terms in
      http://www.sabi.co.uk/blog/0707jul.html#070701
    but now the official way is the "_bytes" alternative.

  * 2% as the level at which writing becomes uncached is too
    low, and the system become unresposive when that level is
    crossed. Sure it is risky, but, regretfully, I think that
    maintaining responsiveness is usually better than limiting
    outstanding background writes.

> Based on what I understood from your thoughts above, if an
> applications issues a flush/fsync and it does not complete due
> to some catastrophic crash, xfs on its own can not roll back
> to the prev version of the file in case of unfinished write
> operation. disabling the device caches wouldn't help either
> right?

If your goal is to make sure incomplete updates don't get
persisted, disabling device caches might help with that, in a
very perverse way (if the whole partial update is still in the
device cache, it just vanishes). Forget that of course :-).

The main message is that filesystems in UNIX-like system should
not provide atomic transactions, just the means to do them at
the applications level, because they are both difficult and very
expensive.

The secondary message is that some applications and the firmware
of some host adpters and drives don't do the right thing, and
if your really want to make sure about atomic transactions it is
an expensive and difficult system integration challenge.

> [ ... ] only filesystems that do COW can do this at the
> expense of performance? (btrfs and zfs, please hurry and grow
> up!)

Filesystems that do COW sort-of do *global* "rolling" updates,
that is filtree level snapshots, but that's a side effect of a
choice made for other reasons (consistency more than currency).

> [ ... ] If you were in my place with the resource constraints,
> you'd go with: xfs with barriers on top of mdraid10 with
> device cache ON and setting vm/dirty_bytes, [ ... ]

Yes, that seems a reasonable overall tradeoff, because XFS is
implemented to provide well defined (and documented) semantics,
to check whether the underlying storage layer actually does
barriers, and to perform decently even if "delayed" writing is
not that delayed.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier (GiB alignment)
  2012-03-17 15:35               ` Peter Grandi
@ 2012-03-17 21:39                 ` Zdenek Kaspar
  -1 siblings, 0 replies; 65+ messages in thread
From: Zdenek Kaspar @ 2012-03-17 21:39 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

Dne 17.3.2012 16:35, Peter Grandi napsal(a):
> I often recommend aligning partitions to 1GiB, also because I
> like to have 1GiB or so of empty space at the very beginning and
> end of a drive.

I'm really curious why do you use such alignment? I can think about few
reasons, but most practical I think you like to slice in gigabyte sizes.

Z.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier (GiB alignment)
@ 2012-03-17 21:39                 ` Zdenek Kaspar
  0 siblings, 0 replies; 65+ messages in thread
From: Zdenek Kaspar @ 2012-03-17 21:39 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

Dne 17.3.2012 16:35, Peter Grandi napsal(a):
> I often recommend aligning partitions to 1GiB, also because I
> like to have 1GiB or so of empty space at the very beginning and
> end of a drive.

I'm really curious why do you use such alignment? I can think about few
reasons, but most practical I think you like to slice in gigabyte sizes.

Z.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-15  0:30 raid10n2/xfs setup guidance on write-cache/barrier Jessie Evangelista
  2012-03-15  5:38 ` Stan Hoeppner
@ 2012-03-17 22:10 ` Zdenek Kaspar
  1 sibling, 0 replies; 65+ messages in thread
From: Zdenek Kaspar @ 2012-03-17 22:10 UTC (permalink / raw)
  To: Jessie Evangelista; +Cc: linux-raid

Dne 15.3.2012 1:30, Jessie Evangelista napsal(a):
> I want to create a raid10,n2 using 3 1TB SATA drives.
> I want to create an xfs filesystem on top of it.
> The filesystem will be used as NFS/Samba storage.
> 
> mdadm --zero /dev/sdb1 /dev/sdc1 /dev/sdd1
> mdadm -v --create /dev/md0 --metadata=1.2 --assume-clean
> --level=raid10 --chunk 256 --raid-devices=3 /dev/sdb1 /dev/sdc1
> /dev/sdd1
> mkfs -t xfs -l lazy-count=1,size=128m -f /dev/md0
> mount -t xfs -o barrier=1,logbsize=256k,logbufs=8,noatime /dev/md0
> /mnt/raid10xfs
> 
> Will my files be safe even on sudden power loss? Is barrier=1 enough?
> Do i need to disable the write cache?
> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd
> 
> I tried it but performance is horrendous.
> 
> Am I better of with ext4? Data safety/integrity is the priority and
> optimization affecting it is not acceptable.
> 
> Thanks and any advice/guidance would be appreciated
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I think today you're most safe with old ext3. Maybe data=journal is not
best idea, because it has small user-base. The more aggressive caching
and features leading to awesome performance will bite you harder by
power loss or software bugs.

Limit power loss with UPS unit. You need to cleanly shutdown the system
when UPS reaches it's low level capacity. That's mandatory IMO and still
inexpensive. Next you can think about multiple PSU/UPS units...

But just think in a way that you will never make it 100% safe, so use
damn good backups!

HTH, Z.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: NOW:Peter goading Dave over delaylog -  WAS: Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-17  4:21       ` NOW:Peter goading Dave over delaylog - WAS: " Stan Hoeppner
@ 2012-03-17 22:34         ` Dave Chinner
  2012-03-18  2:09             ` Peter Grandi
  0 siblings, 1 reply; 65+ messages in thread
From: Dave Chinner @ 2012-03-17 22:34 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Peter Grandi, Linux fs XFS

On Fri, Mar 16, 2012 at 11:21:49PM -0500, Stan Hoeppner wrote:
> I'm guessing either Dave chose not to take your bait, or simply didn't
> read your message.  If the former this thread will likely die now.  If
> the latter, and Dave decides to respond, I'm grabbing some popcorn, a
> beer, and a lawn chair. ;)

Just ignore the troll, Stan.  I've got code to write and bugs to fix
- I don't have time to waste on irrelevant semantic arguments about
the definition of "performance optimisation" for improvements that
are done, dusted and widely deployed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier (GiB alignment)
  2012-03-17 21:39                 ` Zdenek Kaspar
@ 2012-03-18  0:08                   ` Peter Grandi
  -1 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-18  0:08 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

>> I often recommend aligning partitions to 1GiB, also because I
>> like to have 1GiB or so of empty space at the very beginning
>> and end of a drive.

> I'm really curious why do you use such alignment? I can think
> about few reasons, but most practical I think you like to
> slice in gigabyte sizes.

Indeed, and to summarize:

* As mentioned before, I usually leave a chunk of unused space
  at the very start and end of a drive. This also because:
  - Many head landing accidents happen at the start or end of a
    drive.
  - Free space at the start: sometimes it is useful to have a
    few seconds of grace at the start when duplicating a drive
    to realize one has mistyped the name, and many partitioning
    or booting schemes can use a bit of free space at the start.
    Consider XFS and its use of sector 0.
  - Free space at the end: many drives have slightly different
    sizes, and this can cause problems when for example
    rebuilding arrays, or doing backups, and leaving some bit
    unused can avoid a lot of trouble.

* Having even sizes for partitions means that it may be easier
  to image copy them from one drive to another. I often to do
  that. Indeed I usually create partitions of a few "standard"
  sizes, usually tailored to fit drives that tend also to come
  in fairly standard increments, because drive manufacturers in
  each new platter generation usually aim at a fairly standard
  factor of improvement. Standard drive sizes tend to be, in
  gigabytes: 80 160 250 500 1000 1500 2000 3000. Since 80 and
  160 are somewhat old and no longer used, I currently tend to
  do partitions in sizes like 230GiB, 460GiB, 920GiB etc. SSDs
  have somewhat complicated this.

* On a contemporary drive 1GiB is a rather small fraction of the
  capacity of a drive, so why not just align everything to 1GiB,
  even if it seems pretty large in absolute terms? And if you
  align the first partition to start at 1GiB, and leave free
  space at the end, it is farily natural to align everything in
  between on 1GiB boundaries.

In this as in many other cases I like to buy myself some extra
degrees of freedom if they are cheap.

Another example I have written about previously is specifying
advisedly chosen 'sunit' and 'swidth' even on non-RAID volumes,
or non-parity RAID setups, not because they really improve
things, but because the cost is minimal and it might come useful
later.

  Note: I *really* like to be able to do partition image copies,
    because they are so awesomely faster than treewise ones.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier (GiB alignment)
@ 2012-03-18  0:08                   ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-18  0:08 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

>> I often recommend aligning partitions to 1GiB, also because I
>> like to have 1GiB or so of empty space at the very beginning
>> and end of a drive.

> I'm really curious why do you use such alignment? I can think
> about few reasons, but most practical I think you like to
> slice in gigabyte sizes.

Indeed, and to summarize:

* As mentioned before, I usually leave a chunk of unused space
  at the very start and end of a drive. This also because:
  - Many head landing accidents happen at the start or end of a
    drive.
  - Free space at the start: sometimes it is useful to have a
    few seconds of grace at the start when duplicating a drive
    to realize one has mistyped the name, and many partitioning
    or booting schemes can use a bit of free space at the start.
    Consider XFS and its use of sector 0.
  - Free space at the end: many drives have slightly different
    sizes, and this can cause problems when for example
    rebuilding arrays, or doing backups, and leaving some bit
    unused can avoid a lot of trouble.

* Having even sizes for partitions means that it may be easier
  to image copy them from one drive to another. I often to do
  that. Indeed I usually create partitions of a few "standard"
  sizes, usually tailored to fit drives that tend also to come
  in fairly standard increments, because drive manufacturers in
  each new platter generation usually aim at a fairly standard
  factor of improvement. Standard drive sizes tend to be, in
  gigabytes: 80 160 250 500 1000 1500 2000 3000. Since 80 and
  160 are somewhat old and no longer used, I currently tend to
  do partitions in sizes like 230GiB, 460GiB, 920GiB etc. SSDs
  have somewhat complicated this.

* On a contemporary drive 1GiB is a rather small fraction of the
  capacity of a drive, so why not just align everything to 1GiB,
  even if it seems pretty large in absolute terms? And if you
  align the first partition to start at 1GiB, and leave free
  space at the end, it is farily natural to align everything in
  between on 1GiB boundaries.

In this as in many other cases I like to buy myself some extra
degrees of freedom if they are cheap.

Another example I have written about previously is specifying
advisedly chosen 'sunit' and 'swidth' even on non-RAID volumes,
or non-parity RAID setups, not because they really improve
things, but because the cost is minimal and it might come useful
later.

  Note: I *really* like to be able to do partition image copies,
    because they are so awesomely faster than treewise ones.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-17 22:34         ` Dave Chinner
@ 2012-03-18  2:09             ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-18  2:09 UTC (permalink / raw)
  To: Linux fs XFS, Linux RAID

[ ... ]

> Just ignore the troll, Stan.

It is noticeable that Stan and you have chosen to write offtopic
"contributions" that contain purely personal attacks in reply to a
technical point about «guidance on write-cache/barrier», but I'll
try to keep ontopic:

> [ ... ] irrelevant semantic arguments about the definition of
> "performance optimisation" [ ... ]

Oops, here is instead a (handwaving) technical argument, I
partially retract the above.

  Note: I have 'grep'ed for «"performance optimisation"» and it
    seems to me a made-up quote for this thread, and no argument
    has been made by me about the «definition of "performance
    optimisation"», and the above point seem to me a strong
    misrepresentation.

The (handwaving) technical argument above seems to me a laughable
attempt to attribute respectability to the disregard for how
important is the difference between improving speed at the same
(implicit) safety level vs. doing so at a lower one, even more so
as (implicit) safety is an important theme in this thread, and my
argument (quite different from the above misrepresentation) was in
essence:

 «There have been decent but no major improvements in XFS metadata
  *performance*, but weaker implicit *semantics* have been made an
  option, and these have a different safety/performance tradeoff
  (less implicit safety, somewhat more performance), not "just"
  better performance.»

The relevance of pointing out that there is a big tradeoff is is
demonstrated by the honest mention in 'delaylog.txt' that «the
potential for loss of metadata on a crash is much greater than for
the existing logging mechanism», which seems far from merely
«semantic arguments» as the potential for «many thousands of
transactions that simply did not occur as a result of the crash» is
not purely a matter of «semantic arguments», and indeed mattered a
lot to the topic of the thread, where the 'Subject:' is:

  «raid10n2/xfs setup guidance on write-cache/barrier»
                      ===============================

It seems to me that http://packages.debian.org/sid/eatmydata could
also be described boldly and barely as making a «significant
difference in [XFS] metadata performance» because its description
says «This has two side-effects: making software that writes data
safely to disk a lot quicker» even if continues «and making this
software no longer crash safe.»

If considering both the speed and safety aspect is irrelevant
semantics, then it seems to me that:

  http://sandeen.net/wordpress/computers/fsync-sigh/

would be about «irrelevant semantic arguments» too, instead od
being a sensible discussion of tradeoffs between speed and safety.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-18  2:09             ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-18  2:09 UTC (permalink / raw)
  To: Linux fs XFS, Linux RAID

[ ... ]

> Just ignore the troll, Stan.

It is noticeable that Stan and you have chosen to write offtopic
"contributions" that contain purely personal attacks in reply to a
technical point about «guidance on write-cache/barrier», but I'll
try to keep ontopic:

> [ ... ] irrelevant semantic arguments about the definition of
> "performance optimisation" [ ... ]

Oops, here is instead a (handwaving) technical argument, I
partially retract the above.

  Note: I have 'grep'ed for «"performance optimisation"» and it
    seems to me a made-up quote for this thread, and no argument
    has been made by me about the «definition of "performance
    optimisation"», and the above point seem to me a strong
    misrepresentation.

The (handwaving) technical argument above seems to me a laughable
attempt to attribute respectability to the disregard for how
important is the difference between improving speed at the same
(implicit) safety level vs. doing so at a lower one, even more so
as (implicit) safety is an important theme in this thread, and my
argument (quite different from the above misrepresentation) was in
essence:

 «There have been decent but no major improvements in XFS metadata
  *performance*, but weaker implicit *semantics* have been made an
  option, and these have a different safety/performance tradeoff
  (less implicit safety, somewhat more performance), not "just"
  better performance.»

The relevance of pointing out that there is a big tradeoff is is
demonstrated by the honest mention in 'delaylog.txt' that «the
potential for loss of metadata on a crash is much greater than for
the existing logging mechanism», which seems far from merely
«semantic arguments» as the potential for «many thousands of
transactions that simply did not occur as a result of the crash» is
not purely a matter of «semantic arguments», and indeed mattered a
lot to the topic of the thread, where the 'Subject:' is:

  «raid10n2/xfs setup guidance on write-cache/barrier»
                      ===============================

It seems to me that http://packages.debian.org/sid/eatmydata could
also be described boldly and barely as making a «significant
difference in [XFS] metadata performance» because its description
says «This has two side-effects: making software that writes data
safely to disk a lot quicker» even if continues «and making this
software no longer crash safe.»

If considering both the speed and safety aspect is irrelevant
semantics, then it seems to me that:

  http://sandeen.net/wordpress/computers/fsync-sigh/

would be about «irrelevant semantic arguments» too, instead od
being a sensible discussion of tradeoffs between speed and safety.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-18  2:09             ` Peter Grandi
@ 2012-03-18 11:25               ` Peter Grandi
  -1 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-18 11:25 UTC (permalink / raw)
  To: Linux fs XFS, Linux RAID

>  «There have been decent but no major improvements in XFS metadata
>   *performance*, but weaker implicit *semantics* have been made an
>   option, and these have a different safety/performance tradeoff
>   (less implicit safety, somewhat more performance), not "just"
>   better performance.»

I have left implicit a point that perhaps should be explicit: I
think that XFS metadata performance before 'delaylog' was pretty
good, and that it has remained pretty good with 'delaylog'.

People who complained about slow metadata performance with XFS
before 'delaylog' were in effect complaining that XFS was
implementing overly (in some sense) safe metadata semantics, and
in effect were demanding less (implicit) safety, without
probably realizing they were asking for that.

Accordingly, 'delaylog' offers less (implicit) safety, and it is
a good and legitimate option to have, in the same way that
'nobarrier' is also a good and legitimate option to have.

So in my view 'delaylog' cannot be described boldly and barely
described, especially in this thread, as an improvement in XFS
performance, as it is an improvement in XFS's unsafety to obtain
greater speed, similar to but not as extensive as 'nobarrier'.
In the same way that 'eatmydata':

> The relevance of pointing out that there is a big tradeoff [ ... ]
> It seems to me that http://packages.debian.org/sid/eatmydata
> could also be described boldly and barely as making a
> «significant difference in [XFS] metadata performance» [ ... ]

is a massive improvement in unsafety as the name says.

Since the thread is about maximizing safety and implicit safety
too, technical arguments about changes in operational semantics
as to safety are entirely appropriate here, even if there are
those who don't "get" them.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-18 11:25               ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-18 11:25 UTC (permalink / raw)
  To: Linux fs XFS, Linux RAID

>  «There have been decent but no major improvements in XFS metadata
>   *performance*, but weaker implicit *semantics* have been made an
>   option, and these have a different safety/performance tradeoff
>   (less implicit safety, somewhat more performance), not "just"
>   better performance.»

I have left implicit a point that perhaps should be explicit: I
think that XFS metadata performance before 'delaylog' was pretty
good, and that it has remained pretty good with 'delaylog'.

People who complained about slow metadata performance with XFS
before 'delaylog' were in effect complaining that XFS was
implementing overly (in some sense) safe metadata semantics, and
in effect were demanding less (implicit) safety, without
probably realizing they were asking for that.

Accordingly, 'delaylog' offers less (implicit) safety, and it is
a good and legitimate option to have, in the same way that
'nobarrier' is also a good and legitimate option to have.

So in my view 'delaylog' cannot be described boldly and barely
described, especially in this thread, as an improvement in XFS
performance, as it is an improvement in XFS's unsafety to obtain
greater speed, similar to but not as extensive as 'nobarrier'.
In the same way that 'eatmydata':

> The relevance of pointing out that there is a big tradeoff [ ... ]
> It seems to me that http://packages.debian.org/sid/eatmydata
> could also be described boldly and barely as making a
> «significant difference in [XFS] metadata performance» [ ... ]

is a massive improvement in unsafety as the name says.

Since the thread is about maximizing safety and implicit safety
too, technical arguments about changes in operational semantics
as to safety are entirely appropriate here, even if there are
those who don't "get" them.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-18 11:25               ` Peter Grandi
@ 2012-03-18 14:00                 ` Christoph Hellwig
  -1 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2012-03-18 14:00 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs XFS, Linux RAID

On Sun, Mar 18, 2012 at 11:25:14AM +0000, Peter Grandi wrote:
> >  ?There have been decent but no major improvements in XFS metadata
> >   *performance*, but weaker implicit *semantics* have been made an
> >   option, and these have a different safety/performance tradeoff
> >   (less implicit safety, somewhat more performance), not "just"
> >   better performance.?
> 
> I have left implicit a point that perhaps should be explicit: I
> think that XFS metadata performance before 'delaylog' was pretty
> good, and that it has remained pretty good with 'delaylog'.

For many workloads it absolutely wasn't.

> People who complained about slow metadata performance with XFS
> before 'delaylog' were in effect complaining that XFS was
> implementing overly (in some sense) safe metadata semantics, and
> in effect were demanding less (implicit) safety, without
> probably realizing they were asking for that.

No, they weren't, and as with most posts to the XFS and RAID lists
you are completely off the track.

Plese read through Documentation/filesystems/xfs-delayed-logging-design.txt
and if you have any actual technical questions that you don't understand
feel free to come back and ask.

But please stop giving advise taken out of the thin air to people on the
lists that might actually believe whatever madness you just dreamed up.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-18 14:00                 ` Christoph Hellwig
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2012-03-18 14:00 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

On Sun, Mar 18, 2012 at 11:25:14AM +0000, Peter Grandi wrote:
> >  ?There have been decent but no major improvements in XFS metadata
> >   *performance*, but weaker implicit *semantics* have been made an
> >   option, and these have a different safety/performance tradeoff
> >   (less implicit safety, somewhat more performance), not "just"
> >   better performance.?
> 
> I have left implicit a point that perhaps should be explicit: I
> think that XFS metadata performance before 'delaylog' was pretty
> good, and that it has remained pretty good with 'delaylog'.

For many workloads it absolutely wasn't.

> People who complained about slow metadata performance with XFS
> before 'delaylog' were in effect complaining that XFS was
> implementing overly (in some sense) safe metadata semantics, and
> in effect were demanding less (implicit) safety, without
> probably realizing they were asking for that.

No, they weren't, and as with most posts to the XFS and RAID lists
you are completely off the track.

Plese read through Documentation/filesystems/xfs-delayed-logging-design.txt
and if you have any actual technical questions that you don't understand
feel free to come back and ask.

But please stop giving advise taken out of the thin air to people on the
lists that might actually believe whatever madness you just dreamed up.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-18 11:25               ` Peter Grandi
@ 2012-03-18 18:08                 ` Stan Hoeppner
  -1 siblings, 0 replies; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-18 18:08 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs XFS, Linux RAID

On 3/18/2012 6:25 AM, Peter Grandi wrote:

> So in my view 'delaylog' cannot be described boldly and barely
> described, especially in this thread, as an improvement in XFS
> performance, as it is an improvement in XFS's unsafety to obtain
> greater speed, similar to but not as extensive as 'nobarrier'.

You have recommended in various past posts on multiple lists that users
should max out logbsize and logbufs to increase metadata performance.
You made no mention in those posts about safety as you have here.
Logbufs are in-memory journal write buffers and are volatile.  Delaylog
uses in-memory structures that are volatile.  So, why do you consider
logbufs to be inherently safer than delaylog?  Following the logic
you've used in this thread, both should be considered equally unsafe.
Yet I don't recall you ever preaching against logbufs in the past.  Is
it because logbufs can 'only' potentially lose 2MB worth of metadata
transactions, and delaylog can potentially lose more than 2MB?

> In the same way that 'eatmydata':

Hardly.  From: http://packages.debian.org/sid/eatmydata

"This package ... transparently disable fsync ... two side-effects: ...
writes data ... quicker ... no longer crash safe ... useful if
particular software calls fsync(), sync() etc. frequently but *the data
it stores is not that valuable to you* and you may *afford losing it in
case of system crash*."

So you're comparing delaylog's volatile buffer architecture to software
that *intentionally and transparently disables fsync*?  So do you
believe a similar warning should be attached to the docs for delaylog?
And thus to the use of logbufs as well?  How about all write
buffers/caches in the Linux kernel?

Where exactly do you draw the line Peter, between unsafe/safe use of
in-memory write buffers?  Is there some magical demarcation point
between synchronous serial IO, and having gigabytes of inflight write
data in memory buffers?

-- 
Stan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-18 18:08                 ` Stan Hoeppner
  0 siblings, 0 replies; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-18 18:08 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

On 3/18/2012 6:25 AM, Peter Grandi wrote:

> So in my view 'delaylog' cannot be described boldly and barely
> described, especially in this thread, as an improvement in XFS
> performance, as it is an improvement in XFS's unsafety to obtain
> greater speed, similar to but not as extensive as 'nobarrier'.

You have recommended in various past posts on multiple lists that users
should max out logbsize and logbufs to increase metadata performance.
You made no mention in those posts about safety as you have here.
Logbufs are in-memory journal write buffers and are volatile.  Delaylog
uses in-memory structures that are volatile.  So, why do you consider
logbufs to be inherently safer than delaylog?  Following the logic
you've used in this thread, both should be considered equally unsafe.
Yet I don't recall you ever preaching against logbufs in the past.  Is
it because logbufs can 'only' potentially lose 2MB worth of metadata
transactions, and delaylog can potentially lose more than 2MB?

> In the same way that 'eatmydata':

Hardly.  From: http://packages.debian.org/sid/eatmydata

"This package ... transparently disable fsync ... two side-effects: ...
writes data ... quicker ... no longer crash safe ... useful if
particular software calls fsync(), sync() etc. frequently but *the data
it stores is not that valuable to you* and you may *afford losing it in
case of system crash*."

So you're comparing delaylog's volatile buffer architecture to software
that *intentionally and transparently disables fsync*?  So do you
believe a similar warning should be attached to the docs for delaylog?
And thus to the use of logbufs as well?  How about all write
buffers/caches in the Linux kernel?

Where exactly do you draw the line Peter, between unsafe/safe use of
in-memory write buffers?  Is there some magical demarcation point
between synchronous serial IO, and having gigabytes of inflight write
data in memory buffers?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-18 14:00                 ` Christoph Hellwig
@ 2012-03-18 19:17                   ` Peter Grandi
  -1 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-18 19:17 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

[ ... ]

>>>  «There have been decent but no major improvements in XFS
>>>  metadata *performance*, but weaker implicit *semantics*
>>>  have been made an option, and these have a different
>>>  safety/performance tradeoff (less implicit safety, somewhat
>>>  more performance), not "just" better performance.»

>> I have left implicit a point that perhaps should be explicit: I
>> think that XFS metadata performance before 'delaylog' was pretty
>> good, and that it has remained pretty good with 'delaylog'.

> For many workloads it absolutely wasn't.

My self importance is not quite as huge as feeling able to just
say «absolutely wasn't» to settle points once and for all.

So I would rather argue (and I did in a different form) that for
some workloads 'nobarrier'+'hdparm -W1' or 'eatmydata' have the
most desirable tradeoffs, and for many others the safety/speed
tradeoff of 'delaylog' is more appropriate (so for example I
think that making it the default is reasonable if a bit edgy).

But also that as the already quoted document makes it very clear
how overall 'delaylog' improves unsafety and only thanks to this
latency and time to completion are better:

http://lwn.net/Articles/476267/
http://www.mjmwired.net/kernel/Documentation/filesystems/xfs-delayed-logging-design.txt
124     [ ... ] In other
125	words, instead of there only being a maximum of 2MB of transaction changes not
126	written to the log at any point in time, there may be a much greater amount
127	being accumulated in memory. Hence the potential for loss of metadata on a
128	crash is much greater than for the existing logging mechanism.

That's why my argument was that performance without 'delaylog'
was good: given the safer semantics, it was quite good.

Just perhaps not the semantics tradeoff that some people wanted
in some cases, and I think that it is cheeky marketing to
describe something involving a much greater «potential for loss
of metadata» as better performance boldly and barely, as then
one could argue that 'eatmydata' gives the best "performance".

Note: the work on multithreading the journaling path is an
  authentic (and I guess amazingly tricky) performance
  improvement instead, not merely a new safety/latency/speed
  tradeoff similar to 'nobarrier' or 'eatmydata'.

>> People who complained about slow metadata performance with XFS
>> before 'delaylog' were in effect complaining that XFS was
>> implementing overly (in some sense) safe metadata semantics, and
>> in effect were demanding less (implicit) safety, without
>> probably realizing they were asking for that.

> No, they weren't,

Again my self importance is not quite as huge as feeling able to
just say «No, they weren't» to settle points once and for all.

Here it is not clear to me what you mean by «they weren't», but
as the quote above shows, even if complainers weren't in effect
«demanding less (implicit) safety», that's what they got anyhow,
because that's the main (unavoidable) way to improve latency
massively given how expensive barriers are (at least on disk
devices). That's how the O_PONIES story goes...

> [ ... personal attacks ... ]

It is noticeable that 90% of your post is pure malicious
offtopic personal attack, and the rest is "from on high",
and the whole is entirely devoid of technical content.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-18 19:17                   ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-18 19:17 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

[ ... ]

>>>  «There have been decent but no major improvements in XFS
>>>  metadata *performance*, but weaker implicit *semantics*
>>>  have been made an option, and these have a different
>>>  safety/performance tradeoff (less implicit safety, somewhat
>>>  more performance), not "just" better performance.»

>> I have left implicit a point that perhaps should be explicit: I
>> think that XFS metadata performance before 'delaylog' was pretty
>> good, and that it has remained pretty good with 'delaylog'.

> For many workloads it absolutely wasn't.

My self importance is not quite as huge as feeling able to just
say «absolutely wasn't» to settle points once and for all.

So I would rather argue (and I did in a different form) that for
some workloads 'nobarrier'+'hdparm -W1' or 'eatmydata' have the
most desirable tradeoffs, and for many others the safety/speed
tradeoff of 'delaylog' is more appropriate (so for example I
think that making it the default is reasonable if a bit edgy).

But also that as the already quoted document makes it very clear
how overall 'delaylog' improves unsafety and only thanks to this
latency and time to completion are better:

http://lwn.net/Articles/476267/
http://www.mjmwired.net/kernel/Documentation/filesystems/xfs-delayed-logging-design.txt
124     [ ... ] In other
125	words, instead of there only being a maximum of 2MB of transaction changes not
126	written to the log at any point in time, there may be a much greater amount
127	being accumulated in memory. Hence the potential for loss of metadata on a
128	crash is much greater than for the existing logging mechanism.

That's why my argument was that performance without 'delaylog'
was good: given the safer semantics, it was quite good.

Just perhaps not the semantics tradeoff that some people wanted
in some cases, and I think that it is cheeky marketing to
describe something involving a much greater «potential for loss
of metadata» as better performance boldly and barely, as then
one could argue that 'eatmydata' gives the best "performance".

Note: the work on multithreading the journaling path is an
  authentic (and I guess amazingly tricky) performance
  improvement instead, not merely a new safety/latency/speed
  tradeoff similar to 'nobarrier' or 'eatmydata'.

>> People who complained about slow metadata performance with XFS
>> before 'delaylog' were in effect complaining that XFS was
>> implementing overly (in some sense) safe metadata semantics, and
>> in effect were demanding less (implicit) safety, without
>> probably realizing they were asking for that.

> No, they weren't,

Again my self importance is not quite as huge as feeling able to
just say «No, they weren't» to settle points once and for all.

Here it is not clear to me what you mean by «they weren't», but
as the quote above shows, even if complainers weren't in effect
«demanding less (implicit) safety», that's what they got anyhow,
because that's the main (unavoidable) way to improve latency
massively given how expensive barriers are (at least on disk
devices). That's how the O_PONIES story goes...

> [ ... personal attacks ... ]

It is noticeable that 90% of your post is pure malicious
offtopic personal attack, and the rest is "from on high",
and the whole is entirely devoid of technical content.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-18 19:17                   ` Peter Grandi
@ 2012-03-19  9:07                     ` Stan Hoeppner
  -1 siblings, 0 replies; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-19  9:07 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

On 3/18/2012 2:17 PM, Peter Grandi wrote:

> It is noticeable that 90% of your post is pure malicious
> offtopic personal attack, and the rest is "from on high",
> and the whole is entirely devoid of technical content.

It is noticeable that 100% of my post was technical content, directly
asked questions of you, yet you chose to respond to Christoph's
"personal attacks" while avoiding answering my purely technical questions.

I guess we can assume your silence, your unwillingness to answer my
questions, is a sign of capitulation.

-- 
Stan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-19  9:07                     ` Stan Hoeppner
  0 siblings, 0 replies; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-19  9:07 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

On 3/18/2012 2:17 PM, Peter Grandi wrote:

> It is noticeable that 90% of your post is pure malicious
> offtopic personal attack, and the rest is "from on high",
> and the whole is entirely devoid of technical content.

It is noticeable that 100% of my post was technical content, directly
asked questions of you, yet you chose to respond to Christoph's
"personal attacks" while avoiding answering my purely technical questions.

I guess we can assume your silence, your unwillingness to answer my
questions, is a sign of capitulation.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-19  9:07                     ` Stan Hoeppner
@ 2012-03-20 12:34                       ` Jessie Evangelista
  -1 siblings, 0 replies; 65+ messages in thread
From: Jessie Evangelista @ 2012-03-20 12:34 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

Thank you everyone for you insights and comments.

I made a post of how I proceeded here:
http://blog.henyo.com/2012/03/cheap-and-safe-file-storage-on-linux.html

In summary, I did some benchmarks with bonnie++ using different raid
levels(5,10n2,10f2,10o2) using different chunks(64,128,256,512,1024)
using different file systems(xfs,ext4) and different settings for each
file system.

I decided to go with ext4 because I wanted to make use of the
data=journal option which IMHO is safer albeit slower.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-20 12:34                       ` Jessie Evangelista
  0 siblings, 0 replies; 65+ messages in thread
From: Jessie Evangelista @ 2012-03-20 12:34 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

Thank you everyone for you insights and comments.

I made a post of how I proceeded here:
http://blog.henyo.com/2012/03/cheap-and-safe-file-storage-on-linux.html

In summary, I did some benchmarks with bonnie++ using different raid
levels(5,10n2,10f2,10o2) using different chunks(64,128,256,512,1024)
using different file systems(xfs,ext4) and different settings for each
file system.

I decided to go with ext4 because I wanted to make use of the
data=journal option which IMHO is safer albeit slower.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-18 18:08                 ` Stan Hoeppner
@ 2012-03-22 21:26                   ` Peter Grandi
  -1 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-22 21:26 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

[ ... ]

>> So in my view 'delaylog' cannot be described boldly and
>> barely described, especially in this thread, as an
>> improvement in XFS performance, as it is an improvement in
>> XFS's unsafety to obtain greater speed, similar to but not as
>> extensive as 'nobarrier'.

> You have recommended in various past posts on multiple lists
> that users should max out logbsize and logbufs to increase
> metadata performance.

Perhaps you confuse me with DaveC (or, see later, the XFS FAQ),
for example:

http://oss.sgi.com/archives/xfs/2010-09/msg00113.html
 «> Why isn't logbsize=256k default, when it's suggested most
  > of the time anyway?
  It's suggested when people are asking about performance
  tuning. When the performance is acceptible with the default
  value, then you don't hear about it, do you?»
http://oss.sgi.com/archives/xfs/2007-11/msg00918.html
 «# mkfs.xfs -f -l lazy-count=1,version=2,size=128m -i attr=2 -d agcount=4 <dev>
  # mount -o logbsize=256k <dev> <mtpt>
  And if you don't care about filsystem corruption on power loss:
  # mount -o logbsize=256k,nobarrier <dev> <mtpt>»

> You made no mention in those posts about safety as you have
> here.

As to safety, this thread, by the explicit request of the
original poster, is about safety before speed. But I already
made this point above as in «especially in this thread».

Also, "logbufs" have been known for a long time to have an
unsafety aspect, for example there is a clear mention from 2001,
but also see the quote from the XFS FAQ below:

http://oss.sgi.com/archives/xfs/2001-05/msg03391.html
 «logbufs=4 or logbufs=8, this increases (from 2) the number
  of in memory log buffers. This means you can have more active
  transactions at once, and can still perform metadata changes
  while the log is being synced to disk. The flip side of this is
  that the amount of metadata changes which may be lost on crash
  is greater.»

That's "news" from over 10 years ago...

> Logbufs are in-memory journal write buffers and are volatile.
> Delaylog uses in-memory structures that are volatile. So, why do
> you consider logbufs to be inherently safer than delaylog?

That's a quote from the 'delaylog' documentation: «the potential
for loss of metadata on a crash is much greater than for the
existing logging mechanism».

> Following the logic you've used in this thread, both should be
> considered equally unsafe.

They are both unsafe (at least with applications that do not use
'fsync' appropriately), but not equally, as they have quite
different semantics and behaviour, as the quote above from the
'delaylog' docs states (and see the quote from the XFS FAQ below).

> Yet I don't recall you ever preaching against logbufs in the
> past.

Why should I preach against any of the safety/speed tradeoffs?
Each of them has a domain of usability, including 'nobarrier' or
'eatmydata', or even 'sync'.

> Is it because logbufs can 'only' potentially lose 2MB worth of
> metadata transactions, and delaylog can potentially lose more
> than 2MB?

That's a quote from the 'delaylog' documentation: «In other words,
instead of there only being a maximum of 2MB of transaction
changes not written to the log at any point in time, there may be
a much greater amount being accumulated in memory.» «What it does
mean is that as far as the recovered filesystem is concerned,
there may be many thousands of transactions that simply did not
occur as a result of the crash.»

> So you're comparing delaylog's volatile buffer architecture to
> software that *intentionally and transparently disables fsync*?

They are both speed-enhancing options. If 'delaylog' can be
compared with 'nobarrier' or 'sync' as to their effects on
performance, so can 'eatmydata'.

The point of comparing 'sync' or 'delaylog' to 'nobarrier' or to
'eastmydata' is to justify why I think that 'delaylog' «cannot be
described boldly and barely described, especially in this thread,
as an improvement in XFS performance». because if the only thing
that matters is the improvement in speed, then 'nobarrier' or
'eatmydata' can give better performance than 'delaylog', and to me
that is an absurd argument.

> So do you believe a similar warning should be attached to the
> docs for delaylog?

You seem unaware that a similar warning is already part of the doc
for 'delaylog', and I have quoted it prominently before (and above).

> And thus to the use of logbufs as well?

You seem unaware that the XFS FAQ already states:

http://www.xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E
 «For mount options, the only thing that will change metadata
  performance considerably are the logbsize and delaylog mount
  options.

  Increasing logbsize reduces the number of journal IOs for a
  given workload, and delaylog will reduce them even further.

  The trade off for this increase in metadata performance is
  that more operations may be "missing" after recovery if the
  system crashes while actively making modifications.»

> How about all write buffers/caches in the Linux kernel?

Indeed it would be a very good idea given the poor level of
awareness of the downsides of more buffering/caching (not just
less safety, also higher latency and even lower overall
throughput in many cases).

But that discussion has already happened a few times, in the
various 'O_PONIES' discussions, as to that I have mentioned this
page for a weary summary of the story at some point in time:

http://sandeen.net/wordpress/computers/fsync-sigh/

 «So now we are faced with some decisions.  Should the filesystem
  put in hacks that offer more data safety than posix guarantees?
  Possibly. Probably. But there are tradeoffs. XFS, after giving
  up on the fsync-education fight long ago (note; fsync is pretty
  well-behaved on XFS) put in some changes to essentially fsync
  under the covers on close, if a file has been truncated (think
  file overwrite).»

Note the sad «XFS, after giving up on the fsync-education fight
long ago» statment. Also related to this, about defaulting to
safer implicit semantics:

 «But now we’ve taken that control away from the apps (did they
  want it?) and introduced behavior which may slow down some
  other workloads. And, perhaps worse, encouraged sloppy app
  writing because the filesystem has taken care of pushing stuff
  to disk when the application forgets (or never knew). I dunno
  how to resolve this right now.»

> Where exactly do you draw the line Peter, between unsafe/safe
> use of in-memory write buffers?

At the point where the application requirements draw it (or
perhaps a bit safer than that, "just in case").

For some applications it must be tight, for others it can be
loose. Quoting again from the 'delaylog' docs: «This makes it even
more important that applications that care about their data use
fsync() where they need to ensure application level data integrity
is maintained.» which seems a straight statement that the level of
safety is application-dependent.

For me 'delaylog' is just a point on a line of tradeoffs going
from 'sync' to 'nobarrier', it is useful as different point, but
it cannot be boldly and barely described as giving better
performance, anymore than 'nobarrier' can be boldly and barely
described as giving better performance than 'sync'.

Unless one boldly ignores the very different semantics, something
that the 'delaylog' documentation and the XFS FAQ don't do.

Overselling 'delaylog' with cheeky propaganda glossing over the
heavy tradeoffs involved is understandable, but quite wrong.

Again, XFS metadata performance without 'delaylog' was pretty
decent, even if speed was slow due to unusually safe semantics.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-22 21:26                   ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-22 21:26 UTC (permalink / raw)
  To: Linux RAID, Linux fs XFS

[ ... ]

>> So in my view 'delaylog' cannot be described boldly and
>> barely described, especially in this thread, as an
>> improvement in XFS performance, as it is an improvement in
>> XFS's unsafety to obtain greater speed, similar to but not as
>> extensive as 'nobarrier'.

> You have recommended in various past posts on multiple lists
> that users should max out logbsize and logbufs to increase
> metadata performance.

Perhaps you confuse me with DaveC (or, see later, the XFS FAQ),
for example:

http://oss.sgi.com/archives/xfs/2010-09/msg00113.html
 «> Why isn't logbsize=256k default, when it's suggested most
  > of the time anyway?
  It's suggested when people are asking about performance
  tuning. When the performance is acceptible with the default
  value, then you don't hear about it, do you?»
http://oss.sgi.com/archives/xfs/2007-11/msg00918.html
 «# mkfs.xfs -f -l lazy-count=1,version=2,size=128m -i attr=2 -d agcount=4 <dev>
  # mount -o logbsize=256k <dev> <mtpt>
  And if you don't care about filsystem corruption on power loss:
  # mount -o logbsize=256k,nobarrier <dev> <mtpt>»

> You made no mention in those posts about safety as you have
> here.

As to safety, this thread, by the explicit request of the
original poster, is about safety before speed. But I already
made this point above as in «especially in this thread».

Also, "logbufs" have been known for a long time to have an
unsafety aspect, for example there is a clear mention from 2001,
but also see the quote from the XFS FAQ below:

http://oss.sgi.com/archives/xfs/2001-05/msg03391.html
 «logbufs=4 or logbufs=8, this increases (from 2) the number
  of in memory log buffers. This means you can have more active
  transactions at once, and can still perform metadata changes
  while the log is being synced to disk. The flip side of this is
  that the amount of metadata changes which may be lost on crash
  is greater.»

That's "news" from over 10 years ago...

> Logbufs are in-memory journal write buffers and are volatile.
> Delaylog uses in-memory structures that are volatile. So, why do
> you consider logbufs to be inherently safer than delaylog?

That's a quote from the 'delaylog' documentation: «the potential
for loss of metadata on a crash is much greater than for the
existing logging mechanism».

> Following the logic you've used in this thread, both should be
> considered equally unsafe.

They are both unsafe (at least with applications that do not use
'fsync' appropriately), but not equally, as they have quite
different semantics and behaviour, as the quote above from the
'delaylog' docs states (and see the quote from the XFS FAQ below).

> Yet I don't recall you ever preaching against logbufs in the
> past.

Why should I preach against any of the safety/speed tradeoffs?
Each of them has a domain of usability, including 'nobarrier' or
'eatmydata', or even 'sync'.

> Is it because logbufs can 'only' potentially lose 2MB worth of
> metadata transactions, and delaylog can potentially lose more
> than 2MB?

That's a quote from the 'delaylog' documentation: «In other words,
instead of there only being a maximum of 2MB of transaction
changes not written to the log at any point in time, there may be
a much greater amount being accumulated in memory.» «What it does
mean is that as far as the recovered filesystem is concerned,
there may be many thousands of transactions that simply did not
occur as a result of the crash.»

> So you're comparing delaylog's volatile buffer architecture to
> software that *intentionally and transparently disables fsync*?

They are both speed-enhancing options. If 'delaylog' can be
compared with 'nobarrier' or 'sync' as to their effects on
performance, so can 'eatmydata'.

The point of comparing 'sync' or 'delaylog' to 'nobarrier' or to
'eastmydata' is to justify why I think that 'delaylog' «cannot be
described boldly and barely described, especially in this thread,
as an improvement in XFS performance». because if the only thing
that matters is the improvement in speed, then 'nobarrier' or
'eatmydata' can give better performance than 'delaylog', and to me
that is an absurd argument.

> So do you believe a similar warning should be attached to the
> docs for delaylog?

You seem unaware that a similar warning is already part of the doc
for 'delaylog', and I have quoted it prominently before (and above).

> And thus to the use of logbufs as well?

You seem unaware that the XFS FAQ already states:

http://www.xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E
 «For mount options, the only thing that will change metadata
  performance considerably are the logbsize and delaylog mount
  options.

  Increasing logbsize reduces the number of journal IOs for a
  given workload, and delaylog will reduce them even further.

  The trade off for this increase in metadata performance is
  that more operations may be "missing" after recovery if the
  system crashes while actively making modifications.»

> How about all write buffers/caches in the Linux kernel?

Indeed it would be a very good idea given the poor level of
awareness of the downsides of more buffering/caching (not just
less safety, also higher latency and even lower overall
throughput in many cases).

But that discussion has already happened a few times, in the
various 'O_PONIES' discussions, as to that I have mentioned this
page for a weary summary of the story at some point in time:

http://sandeen.net/wordpress/computers/fsync-sigh/

 «So now we are faced with some decisions.  Should the filesystem
  put in hacks that offer more data safety than posix guarantees?
  Possibly. Probably. But there are tradeoffs. XFS, after giving
  up on the fsync-education fight long ago (note; fsync is pretty
  well-behaved on XFS) put in some changes to essentially fsync
  under the covers on close, if a file has been truncated (think
  file overwrite).»

Note the sad «XFS, after giving up on the fsync-education fight
long ago» statment. Also related to this, about defaulting to
safer implicit semantics:

 «But now we’ve taken that control away from the apps (did they
  want it?) and introduced behavior which may slow down some
  other workloads. And, perhaps worse, encouraged sloppy app
  writing because the filesystem has taken care of pushing stuff
  to disk when the application forgets (or never knew). I dunno
  how to resolve this right now.»

> Where exactly do you draw the line Peter, between unsafe/safe
> use of in-memory write buffers?

At the point where the application requirements draw it (or
perhaps a bit safer than that, "just in case").

For some applications it must be tight, for others it can be
loose. Quoting again from the 'delaylog' docs: «This makes it even
more important that applications that care about their data use
fsync() where they need to ensure application level data integrity
is maintained.» which seems a straight statement that the level of
safety is application-dependent.

For me 'delaylog' is just a point on a line of tradeoffs going
from 'sync' to 'nobarrier', it is useful as different point, but
it cannot be boldly and barely described as giving better
performance, anymore than 'nobarrier' can be boldly and barely
described as giving better performance than 'sync'.

Unless one boldly ignores the very different semantics, something
that the 'delaylog' documentation and the XFS FAQ don't do.

Overselling 'delaylog' with cheeky propaganda glossing over the
heavy tradeoffs involved is understandable, but quite wrong.

Again, XFS metadata performance without 'delaylog' was pretty
decent, even if speed was slow due to unusually safe semantics.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-22 21:26                   ` Peter Grandi
@ 2012-03-23  5:10                     ` Stan Hoeppner
  -1 siblings, 0 replies; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-23  5:10 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

On 3/22/2012 4:26 PM, Peter Grandi wrote:

[snipped 2-3 pages of redundant nonsense, linked docs, and filesystem
concepts everyone is already familiar with]

> Overselling 'delaylog' with cheeky propaganda glossing over the
> heavy tradeoffs involved is understandable, but quite wrong.

And now we come full circle to what started this mess of a discussion:
Peter's dislike of Dave's presentation of delaylog, and XFS in general,
at linux.conf.au.

Peter, if *you* had been giving Dave's presentation at linux.conf.au,
how would *you* have presented delayed logging differently?  How much
time would you have spent warning of the dangers of potential data loss
upon a crash and how would you have presented it?  Note I'm not asking
you to re-critique Dave's presentation.  I'm asking you to write your
own short presentation of the delayed logging feature, so we can all see
it done the right way, without "cheeky propaganda" and without "glossing
over the heavy tradeoffs".

We're all on the edge of our seats, eagerly awaiting your expert XFS
presentation Peter.

-- 
Stan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
@ 2012-03-23  5:10                     ` Stan Hoeppner
  0 siblings, 0 replies; 65+ messages in thread
From: Stan Hoeppner @ 2012-03-23  5:10 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS

On 3/22/2012 4:26 PM, Peter Grandi wrote:

[snipped 2-3 pages of redundant nonsense, linked docs, and filesystem
concepts everyone is already familiar with]

> Overselling 'delaylog' with cheeky propaganda glossing over the
> heavy tradeoffs involved is understandable, but quite wrong.

And now we come full circle to what started this mess of a discussion:
Peter's dislike of Dave's presentation of delaylog, and XFS in general,
at linux.conf.au.

Peter, if *you* had been giving Dave's presentation at linux.conf.au,
how would *you* have presented delayed logging differently?  How much
time would you have spent warning of the dangers of potential data loss
upon a crash and how would you have presented it?  Note I'm not asking
you to re-critique Dave's presentation.  I'm asking you to write your
own short presentation of the delayed logging feature, so we can all see
it done the right way, without "cheeky propaganda" and without "glossing
over the heavy tradeoffs".

We're all on the edge of our seats, eagerly awaiting your expert XFS
presentation Peter.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-22 21:26                   ` Peter Grandi
  (?)
  (?)
@ 2012-03-23 22:48                   ` Martin Steigerwald
  2012-03-24  1:27                     ` Peter Grandi
  -1 siblings, 1 reply; 65+ messages in thread
From: Martin Steigerwald @ 2012-03-23 22:48 UTC (permalink / raw)
  To: xfs

Am Donnerstag, 22. März 2012 schrieb Peter Grandi:
> Overselling 'delaylog' with cheeky propaganda glossing over the
> heavy tradeoffs involved is understandable, but quite wrong.

Thing is, as far as I understand Dave´s slides and recent entries in 
Kernelnewbies Linux Changes as well as Heise Open Kernel log is that - 
beside delaylog - there has been quite some other metadata related 
performance improvements.

Thus IMHO reducing the recent improvements in metadata performance is 
underselling XFS and overselling delaylog. Unless of course all those 
recent performance improvements could not have been done without the 
delaylog mode.

That said, this is just my interpretation. If all recent improvements are 
only due to delaylog, then I am obviously off track.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-23 22:48                   ` Martin Steigerwald
@ 2012-03-24  1:27                     ` Peter Grandi
  2012-03-24 16:27                       ` GNU 'tar', Schilling's 'tar', write-cache/barrier Peter Grandi
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Grandi @ 2012-03-24  1:27 UTC (permalink / raw)
  To: Linux fs XFS

>> Overselling 'delaylog' with cheeky propaganda glossing over
>> the heavy tradeoffs involved is understandable, but quite
>> wrong.

> [ ... ] there has been quite some other metadata related
> performance improvements. Thus IMHO reducing the recent
> improvements in metadata performance is underselling XFS and
> overselling delaylog. [ ... ]

That's a good way of putting it, and I am pleased that I finally
get a reasonable comment on this story, and one that agrees with
one of my previous points in this thread:

http://www.spinics.net/lists/raid/msg37931.html
 «Note: the work on multithreading the journaling path is an
    authentic (and I guess amazingly tricky) performance
    improvement instead, not merely a new safety/latency/speed
    tradeoff similar to 'nobarrier' or 'eatmydata'.»

There are two reasons why I rate the multithreading work as more
important than the 'delaylog' work:

  * It is a *net* improvement, as it increases the potential and
    actual retirement rate of metadata operation without adverse
    impact.

  * It improves XFS in the area where it is strongest, which is
    massive and multithread workloads, on reliable storage
    systems with large IOPS.

Conversely, 'delaylog' does not improve the XFS performance
envelope, it seems a crowd-pleasing yet useful intermediate
tradeoff between 'sync' and 'nobarrier', and the standard
documents about XFS tuning make it clear that XFS is really meant
to run on reliable and massive storage layers with 'nobarrier',
and it is/was not aimed at «untarring kernel tarballs» with
'barrier' on.

My suspicion is that 'delaylog' therefore is in large part a
marketing device to match 'ext4' in unsafety and therefore in
apparent speed for "popular" systems, as an argument to stop
investing in 'ext4' and continue to invest in XFS.

Consider DaveC's famous presentation (the one in which he makes
absolutely no mention of the safety/speed tradeoff of 'delaylog'):

http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf
 «There's a White Elephant in the Room....
  * With the speed, performance and capability of XFS and the
    maturing of BTRFS, why do we need EXT4 anymore?»

That's a pretty big tell :-). I agree with it BTW.

In the same presentation earlier there are also these other
interesting points:

http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf
 «* Ext4 can be up 20-50x times than XFS when data is also being
    written as well (e.g. untarring kernel tarballs).
  * This is XFS @ 2009-2010.
  * Unless you have seriously fast storage, XFS just won't
    perform well on metadata modification heavy workloads.»

It is never mentioned that 'ext4' is 20-50x faster on metadata
modification workloads because it implements much weaker
semantics than «XFS @ 2009-2010», and that 'delaylog' matches
'ext4' because it implements similarly weaker semantics, by
reducing the frequency of commits, as the XFS FAQ briefly
summarizes:

http://www.xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E
 «Increasing logbsize reduces the number of journal IOs for a
  given workload, and delaylog will reduce them even further.
  The trade off for this increase in metadata performance is
  that more operations may be "missing" after recovery if the
  system crashes while actively making modifications.»

As should be obvious by now that I think that is an outrageously
cheeky omission from the «filesystem of the futurex presentation,
an omission that makes «XFS @ 2009-2010» seem much worse than it
really was/is, making 'delaylog' seem then a more significant
improvement than it is, or as you wrote «underselling XFS and
overselling delaylog».

Note: I wrote «improvement» above because 'delaylog' is indeed
  an improvement, but not to the performance of XFS, but to its
  functionality/flexibility: it is significant as an additional
  and useful speed/safety tradeoff, not as a speed improvement.

The last point above «Unless you have seriously fast storage»
gives away the main story: metadata intensive workloads are
mostly random access workloads, and random access workloads get
out of typical disk drives around 1-2MB/s, which means that if
you play it safe and commit modifications frequently, you need a
storage layer with massive IOPS indeed.

For what I think are essentially marketing reasons, 'ext3' and
'ext4' try to be "popular" filesystem (consider the quote from
Eric Sandeen's blog about the O_PONIES issue), and this has
caused a lot of problems, and 'delaylog' seems to be an attempt
to compete with 'ext4' in "popular" appeal.

It may be good salesmanship for whoever claims the credit for
'delaylog', but advertising a massive speed improvement with
colourful graphs without ever mentioning the massive improvement
in unsafety seems quite cheeky to me, and I guess to you too.

BTW some other interesting quotes from DaveC, the first about the
aim of 'delaylog' to compete with 'ext4' on low end systems:

http://lwn.net/Articles/477278/
 «That's *exactly* the point of my talk - to smash this silly
  stereotype that XFS is only for massive, expensive servers and
  storage arrays. It is simply not true - there are more consumer
  NAS devices running XFS in the world than there are servers
  running XFS. Not to mention DVRs, or the fact that even TVs
  these days run XFS.»

Another one instead on the impact of the locking improvements,
where metadata operations now can use many CPUs instead of the
previous limit of one:

http://oss.sgi.com/archives/xfs/2010-08/msg00345.html
 «I'm getting a 8core/16thread server being CPU bound with
  multithreaded unlink workloads using delaylog, so it's entirely
  possible that all CPU cores are fully utilised on your machine.»
http://lwn.net/Articles/476617/
 «I even pointed out in the talk some performance artifacts in
  the distribution plots that were a result of separate threads
  lock-stepping at times on AG resources, and that increasing the
  number of AGs solves the problem (and makes XFS even faster!)
  e.g. at 8 threads, XFS unlink is about 20% faster when I
  increase the number of AGs from 17 to 32 on teh same test rig.

  If you have a workload that has a heavy concurrent metadata
  modification workload, then increasing the number of AGs might
  be a good thing. I tend to use 2x the number of CPU cores as a
  general rule of thumb for such workloads but the best tunings
  are highly depended on the workload so you should start just by
  using the defaults. :)»

An interesting quote from an old (1996) design document for XFS
where the metadata locking issue was acknowleged:

http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html
 «In order to support the parallelism of such a machine, XFS has
  only one centralized resource: the transaction log. All other
  resources in the file system are made independent either across
  allocation groups or across individual inodes. This allows
  inodes and blocks to be allocated and freed in parallel
  throughout the file system. The transaction log is the most
  contentious resource in XFS.»

 «As long as the log can be written fast enough to keep up with
  the transaction load, the fact that it is centralized is not a
  problem. However, under workloads which modify large amount of
  metadata without pausing to do anything else, like a program
  constantly linking and unlinking a file in a directory, the
  metadata update rate will be limited to the speed at which we
  can write the log to disk.»

It is remarkable that it is has taken ~15 years before the
implementation needed improving.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* GNU 'tar', Schilling's 'tar', write-cache/barrier
  2012-03-24  1:27                     ` Peter Grandi
@ 2012-03-24 16:27                       ` Peter Grandi
  2012-03-24 17:11                         ` Brian Candler
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Grandi @ 2012-03-24 16:27 UTC (permalink / raw)
  To: Linux fs XFS

>> [ ... ] there has been quite some other metadata related
>> performance improvements. Thus IMHO reducing the recent
>> improvements in metadata performance is underselling XFS and
>> overselling delaylog. [ ... ]

> That's a good way of putting it, and I am pleased that I finally
> get a reasonable comment on this story, and one that agrees with
> one of my previous points in this thread: [ ... ]
[ ... ]
> http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf
>  «* Ext4 can be up 20-50x times than XFS when data is also being
>     written as well (e.g. untarring kernel tarballs).
>   * This is XFS @ 2009-2010.
>   * Unless you have seriously fast storage, XFS just won't
>     perform well on metadata modification heavy workloads.»

> It is never mentioned that 'ext4' is 20-50x faster on metadata
> modification workloads because it implements much weaker
> semantics than «XFS @ 2009-2010», and that 'delaylog' matches
> 'ext4' because it implements similarly weaker semantics, by
> reducing the frequency of commits, as the XFS FAQ briefly
> summarizes: [ ... ]

As to this, I have realized that there is a very big detail that
I have given for implicit but that perhaps at this point should
be made explicit as to the deliberately misleading propaganda
that «Ext4 can be up 20-50x times than XFS when data is also
being written as well (e.g. untarring kernel tarballs).»:

  Almost all «untarring kernel tarballs» "benchmarks" are done
  with GNU 'tar', and it does not 'fsync'.

This matters because XFS has done the "right thing" with 'fsync'
for a long time, and if the application does 'fsync' then 'ext4',
XFS without and with 'delaylog' are mostly equivalent.

Conversely Schilling's 'tar' does 'fsync' and as a result it is
often considered (by the gullible crowd to which the presentation
propaganda referred to above is addressed) to have less
"performance" than GNU 'tar'.

To illustrate I have done a tiny test '.tar' file with a
directory and two files within, and this is what happens with
Schilling's 'tar':

  $ strace -f -e trace=file,fsync,fdatasync,read,write star xf d.tar
  open("d.tar", O_RDONLY)                 = 7
  read(7, "d/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
  Process 8201 attached
  [ ... ]
  [pid  8200] lstat("d/", 0x7fff174d9490) = -1 ENOENT (No such file or directory)
  [pid  8200] lstat("d/", 0x7fff174d9330) = -1 ENOENT (No such file or directory)
  [pid  8200] access("d", F_OK)           = -1 ENOENT (No such file or directory)
  [pid  8200] mkdir("d", 0700)            = 0
  [pid  8200] lstat("d/", {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0
  [pid  8200] lstat("d/f1", 0x7fff174d9490) = -1 ENOENT (No such file or directory)
  [pid  8200] open("d/f1", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4
  [pid  8200] write(4, "3\275@&{U(\356\332\25z\250\236\256v\6U[5\334\265\313\206:\351\335\366Q\21\231\210H"..., 128) = 128
  [pid  8200] fsync(4 <unfinished ...>
  [pid  8201] <... write resumed> )       = 1
  [pid  8201] read(7, "", 10240)          = 0
  Process 8201 detached
  <... fsync resumed> )                   = 0
  --- SIGCHLD (Child exited) @ 0 (0) ---
  utimes("d/f1", {{1332588240, 0}, {1332588240, 0}}) = 0
  utimes("d/f1", {{1332588240, 0}, {1332588240, 0}}) = 0
  lstat("d/f2", 0x7fff174d9490)           = -1 ENOENT (No such file or directory)
  open("d/f2", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4
  write(4, "\377\325\253\257,\210\2719e\24\347*P\325x\357\345\220\375Ei\375\355\22063\17\355\312.\6\347"..., 4096) = 4096
  fsync(4)                                = 0
  utimes("d/f2", {{1332588257, 0}, {1332588257, 0}}) = 0
  utimes("d/f2", {{1332588257, 0}, {1332588257, 0}}) = 0
  utimes("d", {{1332588242, 0}, {1332588242, 0}}) = 0
  write(2, "star: 1 blocks + 0 bytes (total "..., 58star: 1 blocks + 0 bytes (total of 10240 bytes = 10.00k).
  ) = 58

Compare with GNU 'tar':

  $ strace -f -e trace=file,fsync,fdatasync,read,write tar xf d.tar
  [ ... ]
  open("d.tar", O_RDONLY)                 = 3
  read(3, "d/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240
  [ ... ]
  mkdir("d", 0700)                        = -1 EEXIST (File exists)
  stat("d", {st_mode=S_IFDIR|0700, st_size=24, ...}) = 0
  open("d/f1", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)
  unlink("d/f1")                          = 0
  open("d/f1", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
  write(4, "3\275@&{U(\356\332\25z\250\236\256v\6U[5\334\265\313\206:\351\335\366Q\21\231\210H"..., 128) = 128
  close(4)                                = 0
  utimensat(AT_FDCWD, "d/f1", {{1332589368, 193330071}, {1332588240, 0}}, 0) = 0
  open("d/f2", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)
  unlink("d/f2")                          = 0
  open("d/f2", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4
  write(4, "\377\325\253\257,\210\2719e\24\347*P\325x\357\345\220\375Ei\375\355\22063\17\355\312.\6\347"..., 4096) = 4096
  close(4)                                = 0
  utimensat(AT_FDCWD, "d/f2", {{1332589368, 193330071}, {1332588257, 0}}, 0) = 0
  close(3)                                = 0
  utimensat(AT_FDCWD, "d", {{1332589368, 193330071}, {1332588242, 0}}, 0) = 0
  close(1)                                = 0
  close(2)                                = 0

In effect running GNU 'tar x' (GNU 'tar') is the same as running
'eatmydata tar x ...'; and indeed as its documentation says,
'eatmydata' is designed to achieve higher "performance" by
turning programs that behave like Schilling's 'tar' into programs
that behave like GNU 'tar'.

When GNU 'tar' is used as a "benchmark" for 'delaylog' and there
are no 'fsync's, the longer the interval between commits (and
thus the implicit unsafety) the higher the "performance", or at
least that's the argument I think propagandists and buffoons may
be using.

That's one important reason why I mentioned 'eatmydata' as one
performance enhancing technique in a group with 'nobarrier' and
'delaylog'; and why I was amused by this buffoonery:

 «So you're comparing delaylog's volatile buffer architecture to
  software that *intentionally and transparently disables fsync*?»

Because when the 'delaylog' propagandists write that:

  «Ext4 can be up 20-50x times than XFS when data is also being
   written as well (e.g. untarring kernel tarballs).»

it is them who are comparing "performance" using GNU 'tar' which
intentionally and transparently does not use at all 'fsync'.

To illustrate here are some "benchmarks", which hopefully should
be revealing as to the merit of the posturings of some of the
buffoons or propagandists that have been discontributing to this
discussion (note that there are somewhat subtle details both as
to the setup and the results):

--------------------------------------------------------------
#  uname -a
Linux base.ty.sabi.co.uk 2.6.18-274.18.1.el5 #1 SMP Thu Feb 9 12:20:03 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
#  egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl vm | egrep '_(bytes|centisecs)' | sort
none /tmp tmpfs rw 0 0
/dev/sdd8 /tmp/xfs xfs rw,nouuid,attr2,inode64,logbsize=256k,sunit=8,swidth=8,noquota 0 0
/dev/sdd3 /tmp/ext4 ext4 rw,barrier=1,data=ordered 0 0
vm.dirty_background_bytes = 900000000
vm.dirty_bytes = 500000000
vm.dirty_expire_centisecs = 2000
vm.dirty_writeback_centisecs = 1000
--------------------------------------------------------------
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m1.027s
user    0m0.105s
sys     0m0.922s
Dirty:          419700 kB
Writeback:           0 kB

real    0m5.163s
user    0m0.000s
sys     0m0.473s
--------------------------------------------------------------
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -no-fsync -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    0m1.204s
user    0m0.139s
sys     0m1.270s
Dirty:          419456 kB
Writeback:           0 kB

real    0m5.012s
user    0m0.000s
sys     0m0.458s
--------------------------------------------------------------
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    23m29.346s
user    0m0.327s
sys     0m2.280s
Dirty:             108 kB
Writeback:           0 kB

real    0m0.236s
user    0m0.000s
sys     0m0.199s
--------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m46.554s
user    0m0.107s
sys     0m1.271s
Dirty:          415168 kB
Writeback:           0 kB

real    1m54.913s
user    0m0.000s
sys     0m0.325s
----------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    60m15.723s
user    0m0.442s
sys     0m7.009s
Dirty:               4 kB
Writeback:           0 kB

real    0m0.222s
user    0m0.000s
sys     0m0.194s
----------------------------------------------------------------

>From the above my conclusion is that «XFS @ 2009-2010» half the
performance of 'ext4' on this workload, and that «Ext4 can be up
20-50x times than XFS when data is also being written as well
(e.g. untarring kernel tarballs).» only when both data and
metadata are written to RAM by 'ext4'.

One can spend a lot of time changing parameters, as in using
'delaylog' or 'nobarrier' etc.

I have tried with my favourite rather "tighter" flusher
parameters, some comparisons that I find interesting:

----------------------------------------------------------------
#  egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl vm | egrep '_(bytes|centisecs)' | sort
none /tmp tmpfs rw 0 0
/dev/sdd3 /tmp/ext4 ext4 rw,barrier=1,data=ordered 0 0
/dev/sdd8 /tmp/xfs xfs rw,nouuid,attr2,inode64,logbsize=256k,sunit=8,swidth=8,noquota 0 0
vm.dirty_background_bytes = 900000000
vm.dirty_bytes = 100000
vm.dirty_expire_centisecs = 200
vm.dirty_writeback_centisecs = 100
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m6.776s
user    0m0.107s
sys     0m1.260s
Dirty:            1776 kB
Writeback:           0 kB

real    0m0.231s
user    0m0.000s
sys     0m0.197s
----------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    2m25.805s
user    0m0.135s
sys     0m1.812s
Dirty:            2372 kB
Writeback:          84 kB

real    0m1.683s
user    0m0.000s
sys     0m0.196s
----------------------------------------------------------------

That's a bit of a surprise, because time to completion on both
when the flusher parameters allowed writing entirely to memory
for both with 'eatmydata tar' were the same. It looks like that
when flushing 'xfs' still does a fair bit of implicit metadata
commits, as switching off barriers shows:

----------------------------------------------------------------
#  mount -o remount,barrier=0 /dev/sdd8 /tmp/ext4
#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m7.388s
user    0m0.127s
sys     0m1.235s
Dirty:             508 kB
Writeback:           0 kB

real    0m0.243s
user    0m0.000s
sys     0m0.199s
----------------------------------------------------------------
#  mount -o remount,nobarrier /dev/sdd3 /tmp/xfs
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m31.047s
user    0m0.124s
sys     0m1.880s
Dirty:            2324 kB
Writeback:          24 kB

real    0m0.269s
user    0m0.000s
sys     0m0.195s
----------------------------------------------------------------

While it seems likely 'ext4' runs headlong without commits on
either metadata or data ('ext4' and 'ext3' in effect have a
rather loose 'delaylog'). XFS however seems to be a bit at a
disadvantage though as with 'nobarrier' and 'eatmydata tar' the
time to completion should be the same. The partition for XFS is
on inner tracks, but that does not make that much of a
difference.

Also compare with 'ext4' using 'eatmydata tar' with no barriers
and using 'star' with no barrier and also 'data=writeback':

----------------------------------------------------------------
  base#  umount /tmp/ext4; mount -t ext4 -o defaults,barrier=0,data=writeback /dev/sdd3 /tmp/ext4
  base#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m6.158s
user    0m0.123s
sys     0m1.233s
Dirty:            1704 kB
Writeback:           0 kB

real    0m0.247s
user    0m0.001s
sys     0m0.194s
----------------------------------------------------------------
  base#  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    0m32.101s
user    0m0.196s
sys     0m1.718s
Dirty:              24 kB
Writeback:          48 kB

real    0m0.217s
user    0m0.000s
sys     0m0.193s
----------------------------------------------------------------

Finally here is on XFS, with 'delaylog', on a system with a
3.x kernel and a rather fast (especially on small random writes)
SSD drive (and my usual tighter flusher parameters):

----------------------------------------------------------------
#  uname -a
Linux.ty.sabi.co.UK 3.0.0-15-generic #26~lucid1-Ubuntu SMP Wed Jan 25 15:37:10 UTC 2012 x86_64 GNU/Linux
#  egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl -a 2>/dev/null | egrep '_(bytes|centisecs)' | sort
none /tmp tmpfs rw,relatime,size=1024000k 0 0
/dev/sda6 /tmp/xfs xfs rw,noatime,nodiratime,attr2,delaylog,discard,inode64,logbsize=256k,sunit=16,swidth=8192,noquota 0 0
/dev/sda3 /tmp/ext4 ext4 rw,nodiratime,relatime,errors=remount-ro,user_xattr,acl,barrier=1,data=ordered,discard 0 0
fs.xfs.age_buffer_centisecs = 1500
fs.xfs.filestream_centisecs = 3000
fs.xfs.xfsbufd_centisecs = 100
fs.xfs.xfssyncd_centisecs = 3000
vm.dirty_background_bytes = 900000000
vm.dirty_bytes = 100000000
vm.dirty_expire_centisecs = 200
vm.dirty_writeback_centisecs = 100
----------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)

real    0m5.148s
user    0m0.300s
sys     0m2.876s
Dirty:             50052 kB
Writeback:             0 kB
WritebackTmp:          0 kB

real    0m0.784s
user    0m0.000s
sys     0m0.100s
----------------------------------------------------------------
#  (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).

real    6m21.946s
user    0m0.808s
sys     0m11.321s
Dirty:                 0 kB
Writeback:             0 kB
WritebackTmp:          0 kB

real    0m0.097s
user    0m0.000s
sys     0m0.044s
----------------------------------------------------------------

The effect of 'delaylog' is pretty obvious there.

The numbers above with their wide variation depending on changes
in the level of safety requested amply demonstrate that it takes
the skills of a propagandist or a buffoon to boast about the
"performance" of 'delaylog' and comparisons with 'ext4' without
prominently mentioning the big safety tradeoffs involved.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: GNU 'tar', Schilling's 'tar', write-cache/barrier
  2012-03-24 16:27                       ` GNU 'tar', Schilling's 'tar', write-cache/barrier Peter Grandi
@ 2012-03-24 17:11                         ` Brian Candler
  2012-03-24 18:35                           ` Peter Grandi
  0 siblings, 1 reply; 65+ messages in thread
From: Brian Candler @ 2012-03-24 17:11 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs XFS

On Sat, Mar 24, 2012 at 04:27:19PM +0000, Peter Grandi wrote:
> --------------------------------------------------------------
> #  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -no-fsync -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
> star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).
> 
> real    0m1.204s
> user    0m0.139s
> sys     0m1.270s
> Dirty:          419456 kB
> Writeback:           0 kB
> 
> real    0m5.012s
> user    0m0.000s
> sys     0m0.458s
> --------------------------------------------------------------
> #  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
> star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k).
> 
> real    23m29.346s
> user    0m0.327s
> sys     0m2.280s
> Dirty:             108 kB
> Writeback:           0 kB
> 
> real    0m0.236s
> user    0m0.000s
> sys     0m0.199s

But as a user, what guarantees do I *want* from tar?

I think the only meaningful guarantee I might want is: "if the tar returns
successfully, I want to know that all the files are persisted to disk".  And
of course that's what your final "sync" does, although with the unfortunate
side-effect of syncing all other dirty blocks in the system too.

Calling fsync() after every single file is unpacked does also achieve the
desired guarantee, but at a very high cost.  This is partly because you have
to wait for each fsync() to return [although I guess you could spawn threads
to do them] but also because the disk can't aggregate lots of small writes
into one larger write, even when the filesystem has carefully allocated them
in adjacent blocks.

I think what's needed is a group fsync which says "please ensure this set of
files is all persisted to disk", which is done at the end, or after every N
files.  If such an API exists I don't know of it.

On the flip side, does fsync()ing each individual file buy you anything over
and above the desired guarantee?  Possibly - in theory you could safely
restart an aborted untar even through a system crash.  You would have to be
aware that the last file which was unpacked may only have been partially
written to disk, so you'd have to restart by overwriting the last item in
the archive which already exists on disk.  Maybe star has this feature, I
don't know.  And unlike zip, I don't think tarfiles are indexed, so you'd
still have to read it from the beginning.

If the above benchmark is typical, it suggests that fsyncing after every
file is 4 times slower than untar followed by sync.  So I reckon you would
be better off using the fast/unsafe version, and simply restarting it from
the beginning if the system crashed while you were running it.  That's
unless you expect the system to crash 4 or more times while you untar this
single file.

Just my 2¢, as a user and definitely not a filesystem expert.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: GNU 'tar', Schilling's 'tar', write-cache/barrier
  2012-03-24 17:11                         ` Brian Candler
@ 2012-03-24 18:35                           ` Peter Grandi
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Grandi @ 2012-03-24 18:35 UTC (permalink / raw)
  To: Linux fs XFS

[ ... ]

>> #  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -no-fsync -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
>> real    0m1.204s
>> Dirty:          419456 kB
>> real    0m5.012s

>> #  (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync)
>> real    23m29.346s
>> Dirty:             108 kB
>> real    0m0.236s

> But as a user, what guarantees do I *want* from tar?

Ahhhh, but that depends *a lot* on the application, that may or
may not be 'tar', and what you are using 'tar' for. Consider for
example restoring a backup using RSYNC instead of 'tar'.

> I think the only meaningful guarantee I might want is: "if the
> tar returns successfully, I want to know that all the files
> are persisted to disk".

Perhaps in some cases, but perhaps in others not. For example if
you are restoring 20TB, having to redo the whole 20TB or a
significant fraction may be undesirable, and you would like to
change the guarantee as tyou write later:

  > On the flip side, does fsync()ing each individual file [
  > ... ] you could safely restart an aborted untar [ ... ] the
  > last file which was unpacked may only have been partially
  > written to disk [ ... ]

to add "if the tar does not return successfully, I want to know
that most or or all the files are persisted, except the last one
that was only partially written, which I want to disappear, so I
can rerun 'tar -x -k' and only restore the rest of the files".

> And of course that's what your final "sync" does, although
> with the unfortunate side-effect of syncing all other dirty
> blocks in the system too.

Just to be sure: that was on a quiescent system, so in the
particular case of my tests it was just on the 'tar'.

[ ... ]

> I think what's needed is a group fsync which says "please
> ensure this set of files is all persisted to disk", which is
> done at the end, or after every N files.  If such an API
> exists I don't know of it.

That's in part what mentioned here:

[ ... ]

> If the above benchmark is typical, it suggests that fsyncing
> after every file is 4 times slower than untar followed by
> sync.

Depends on how often the flusher runs and how aggressively and
how much memory you get. In the comparison quoted above, GNU
'tar' on 'ext4' dumps 410MB into RAM in just over 1 second plus
5 seconds for 'sync', and Schilling's 'tar' persists the lot to
disk, incrementally, in 1409 seconds. The ratio is 227 times.

Because that's a typical disk drive that can either do around
100MB/s with bulk sequential IO (thus the 5 seconds 'sync') or
around 0.5-4MB/s with small random IO.

> So I reckon you would be better off using the fast/unsafe
> version, and simply restarting it from the beginning if the
> system crashed while you were running it. [ ... ]

That's in one very specific example with one application in one
context. As to this, for a better discussion, let's go back to
your original and very appropriate question:

  > But as a user, what guarantees do I *want* from tar?

The question is very sensible as far as it goes, but it does not
go far enough, because «from tar» and small 'tar' archives is
just happenstance: what you should ask yourself is:

  But as a user, what guarantees do I *want* from filesystems
  and the applications that use them?

That's in essence the O_PONIES question.

That question can have many answers each of them addressing a
different aspect of normative and positive situation, and I'll
try to list some.

The first answer is that you want to be able to choose different
guarantees and costs, and know which they are. In this respect
'delaylog' log, properly described as an improvement in both
unsafety and speed, is a good thing to have, because it is often
a useful option. So are 'sync', 'nobarrier', and 'eatmydata'.

The second answer is that as a rule users don't have the
knowledge or the desire to understand the tradeoffs offered by
filesystems and how they relate to the behavior of the programs
(including 'tar') that they use, so there needs to be a default
guarantee that most users would have chosen if they could, and
this should be about more safety rather than the more speed, and
this was what «XFS @ 2009-2010» was doing.

More to follow...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: raid10n2/xfs setup guidance on write-cache/barrier
  2012-03-17 15:35               ` Peter Grandi
  (?)
  (?)
@ 2012-03-26 19:50               ` Martin Steigerwald
  -1 siblings, 0 replies; 65+ messages in thread
From: Martin Steigerwald @ 2012-03-26 19:50 UTC (permalink / raw)
  To: xfs

Am Samstag, 17. März 2012 schrieb Peter Grandi:
> > I'd like to read about the NFS blog entry but the link you
> > included results in a 404.  I forgot to mention in my last
> > reply.
> 
> Oops I forgot a bit of the URL:
>   http://www.sabi.co.uk/blog/0707jul.html#070701b
> 
> Note that currently I suggest different values from:
> 
>  «vm/dirty_ratio                  =4
>   vm/dirty_background_ratio       =2»

Consider dirty_background_bytes as thats indepentend of the amount of 
installed memory.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2012-03-26 19:50 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-15  0:30 raid10n2/xfs setup guidance on write-cache/barrier Jessie Evangelista
2012-03-15  5:38 ` Stan Hoeppner
2012-03-15 12:06   ` Jessie Evangelista
2012-03-15 14:07     ` Peter Grandi
2012-03-15 14:07       ` Peter Grandi
2012-03-15 15:25       ` keld
2012-03-15 15:25         ` keld
2012-03-15 16:52         ` Jessie Evangelista
2012-03-15 16:52           ` Jessie Evangelista
2012-03-15 17:15           ` keld
2012-03-15 17:15             ` keld
2012-03-15 17:40             ` keld
2012-03-15 17:40               ` keld
2012-03-15 16:18       ` Jessie Evangelista
2012-03-15 16:18         ` Jessie Evangelista
2012-03-15 23:00         ` Peter Grandi
2012-03-15 23:00           ` Peter Grandi
2012-03-16  3:36           ` Jessie Evangelista
2012-03-16  3:36             ` Jessie Evangelista
2012-03-16 11:06             ` Michael Monnerie
2012-03-16 11:06               ` Michael Monnerie
2012-03-16 12:21               ` Peter Grandi
2012-03-16 12:21                 ` Peter Grandi
2012-03-16 17:15             ` Brian Candler
2012-03-16 17:15               ` Brian Candler
2012-03-17 15:35             ` Peter Grandi
2012-03-17 15:35               ` Peter Grandi
2012-03-17 21:39               ` raid10n2/xfs setup guidance on write-cache/barrier (GiB alignment) Zdenek Kaspar
2012-03-17 21:39                 ` Zdenek Kaspar
2012-03-18  0:08                 ` Peter Grandi
2012-03-18  0:08                   ` Peter Grandi
2012-03-26 19:50               ` raid10n2/xfs setup guidance on write-cache/barrier Martin Steigerwald
2012-03-17  4:21       ` NOW:Peter goading Dave over delaylog - WAS: " Stan Hoeppner
2012-03-17 22:34         ` Dave Chinner
2012-03-18  2:09           ` Peter Grandi
2012-03-18  2:09             ` Peter Grandi
2012-03-18 11:25             ` Peter Grandi
2012-03-18 11:25               ` Peter Grandi
2012-03-18 14:00               ` Christoph Hellwig
2012-03-18 14:00                 ` Christoph Hellwig
2012-03-18 19:17                 ` Peter Grandi
2012-03-18 19:17                   ` Peter Grandi
2012-03-19  9:07                   ` Stan Hoeppner
2012-03-19  9:07                     ` Stan Hoeppner
2012-03-20 12:34                     ` Jessie Evangelista
2012-03-20 12:34                       ` Jessie Evangelista
2012-03-18 18:08               ` Stan Hoeppner
2012-03-18 18:08                 ` Stan Hoeppner
2012-03-22 21:26                 ` Peter Grandi
2012-03-22 21:26                   ` Peter Grandi
2012-03-23  5:10                   ` Stan Hoeppner
2012-03-23  5:10                     ` Stan Hoeppner
2012-03-23 22:48                   ` Martin Steigerwald
2012-03-24  1:27                     ` Peter Grandi
2012-03-24 16:27                       ` GNU 'tar', Schilling's 'tar', write-cache/barrier Peter Grandi
2012-03-24 17:11                         ` Brian Candler
2012-03-24 18:35                           ` Peter Grandi
2012-03-16 12:25     ` raid10n2/xfs setup guidance on write-cache/barrier Stan Hoeppner
2012-03-16 18:01       ` Jon Nelson
2012-03-16 18:03         ` Jon Nelson
2012-03-16 19:28           ` Peter Grandi
2012-03-16 19:28             ` Peter Grandi
2012-03-17  0:02             ` Stan Hoeppner
2012-03-17  0:02               ` Stan Hoeppner
2012-03-17 22:10 ` Zdenek Kaspar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.