* raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-15 0:30 Jessie Evangelista 2012-03-15 5:38 ` Stan Hoeppner 2012-03-17 22:10 ` Zdenek Kaspar 0 siblings, 2 replies; 65+ messages in thread From: Jessie Evangelista @ 2012-03-15 0:30 UTC (permalink / raw) To: linux-raid I want to create a raid10,n2 using 3 1TB SATA drives. I want to create an xfs filesystem on top of it. The filesystem will be used as NFS/Samba storage. mdadm --zero /dev/sdb1 /dev/sdc1 /dev/sdd1 mdadm -v --create /dev/md0 --metadata=1.2 --assume-clean --level=raid10 --chunk 256 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1 mkfs -t xfs -l lazy-count=1,size=128m -f /dev/md0 mount -t xfs -o barrier=1,logbsize=256k,logbufs=8,noatime /dev/md0 /mnt/raid10xfs Will my files be safe even on sudden power loss? Is barrier=1 enough? Do i need to disable the write cache? with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd I tried it but performance is horrendous. Am I better of with ext4? Data safety/integrity is the priority and optimization affecting it is not acceptable. Thanks and any advice/guidance would be appreciated ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 0:30 raid10n2/xfs setup guidance on write-cache/barrier Jessie Evangelista @ 2012-03-15 5:38 ` Stan Hoeppner 2012-03-15 12:06 ` Jessie Evangelista 2012-03-17 22:10 ` Zdenek Kaspar 1 sibling, 1 reply; 65+ messages in thread From: Stan Hoeppner @ 2012-03-15 5:38 UTC (permalink / raw) To: Jessie Evangelista; +Cc: linux-raid On 3/14/2012 7:30 PM, Jessie Evangelista wrote: > I want to create a raid10,n2 using 3 1TB SATA drives. > I want to create an xfs filesystem on top of it. > The filesystem will be used as NFS/Samba storage. > > mdadm --zero /dev/sdb1 /dev/sdc1 /dev/sdd1 > mdadm -v --create /dev/md0 --metadata=1.2 --assume-clean > --level=raid10 --chunk 256 --raid-devices=3 /dev/sdb1 /dev/sdc1 > /dev/sdd1 Why 256KB for chunk size? Looks like you've been reading a very outdated/inaccurate "XFS guide" on the web... What kernel version? This can make a significant difference in XFS metadata performance. You should use 2.6.39+ if possible. What xfsprogs version? > mkfs -t xfs -l lazy-count=1,size=128m -f /dev/md0 lazy-count=1 is currently the default with recent xfsprogs so no need to specify it. Why are you manually specifying the size of the internal journal log file? This is unnecessary. In fact, unless you have profiled your workload and testing shows that alternate XFS settings perform better, it is always best to stick with the defaults. They exist for a reason, and are well considered. > mount -t xfs -o barrier=1,logbsize=256k,logbufs=8,noatime /dev/md0 > /mnt/raid10xfs Barrier has no value, it's either on or off. XFS mounts with barriers enabled by default so remove 'barrier=1'. You do not have a RAID card with persistent write cache (BBWC), so you should leave barriers enabled. Barriers mitigate journal log corruption due to power failure and crashes, which seem seem to be of concern to you. logbsize=256k and logbufs=8 are the defaults in recent kernels so no need to specify them. Your NFS/Samba workload on 3 slow disks isn't sufficient to need that much in memory journal buffer space anyway. XFS uses relatime which is equivalent to noatime WRT IO reduction performance, so don't specify 'noatime'. In fact, it appears you don't need to specify anything in mkfs.xfs or fstab, but just use the defaults. Fancy that. And the one thing that might actually increase your performance a little bit you didn't specify--sunit/swidth. However, since you're using mdraid, mkfs.xfs will calculate these for you (which is nice as mdraid10 with odd disk count can be a tricky calculation). Again, defaults work for a reason. > Will my files be safe even on sudden power loss? Are you unwilling to purchase a UPS and implement shutdown scripts? If so you have no business running a server, frankly. Any system will lose data due to power loss, it's just a matter of how much based on the quantity of inflight writes at the time the juice dies. This problem is mostly filesytem independent. Application write behavior does play a role. UPS with shutdown scripts, and persistent write cache prevent this problem. A cheap UPS suitable for this purpose is less money than a 1TB 7.2k drive, currently. You say this is an NFS/Samba server. That would imply that multiple people or other systems directly rely on it. Implement a good UPS solution and eliminate this potential problem. > Is barrier=1 enough? > Do i need to disable the write cache? > with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd Disabling drive write caches does decrease the likelihood of data loss. > I tried it but performance is horrendous. And this is why you should leave them enabled and use barriers. Better yet, use a RAID card with BBWC and disable the drive caches. > Am I better of with ext4? Data safety/integrity is the priority and > optimization affecting it is not acceptable. You're better off using a UPS. Filesystem makes little difference WRT data safety/integrity. All will suffer some damage if you throw a grenade at them. So don't throw grenades. Speaking of which, what is your backup/restore procedure/hardware for this array? > Thanks and any advice/guidance would be appreciated I'll appreciate your response stating "Yes, I have a UPS and tested/working shutdown scripts" or "I'll be implementing a UPS very soon." :) -- Stan ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 5:38 ` Stan Hoeppner @ 2012-03-15 12:06 ` Jessie Evangelista 2012-03-15 14:07 ` Peter Grandi 2012-03-16 12:25 ` raid10n2/xfs setup guidance on write-cache/barrier Stan Hoeppner 0 siblings, 2 replies; 65+ messages in thread From: Jessie Evangelista @ 2012-03-15 12:06 UTC (permalink / raw) To: stan; +Cc: linux-raid On Thu, Mar 15, 2012 at 1:38 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 3/14/2012 7:30 PM, Jessie Evangelista wrote: >> I want to create a raid10,n2 using 3 1TB SATA drives. >> I want to create an xfs filesystem on top of it. >> The filesystem will be used as NFS/Samba storage. >> >> mdadm --zero /dev/sdb1 /dev/sdc1 /dev/sdd1 >> mdadm -v --create /dev/md0 --metadata=1.2 --assume-clean >> --level=raid10 --chunk 256 --raid-devices=3 /dev/sdb1 /dev/sdc1 >> /dev/sdd1 > > Why 256KB for chunk size? > For reference, the machine has 16GB memory I've run some benchmarks with dd trying the different chunks and 256k seems like the sweetspot. dd if=/dev/zero of=/dev/md0 bs=64k count=655360 oflag=direct > > Looks like you've been reading a very outdated/inaccurate "XFS guide" on > the web... > > What kernel version? This can make a significant difference in XFS > metadata performance. You should use 2.6.39+ if possible. What > xfsprogs version? > testing was done with ubuntu 10.04LTS with kernel at 2.6.32-33-server xfsprogs at 3.1.0ubuntu1 >> mkfs -t xfs -l lazy-count=1,size=128m -f /dev/md0 > > lazy-count=1 is currently the default with recent xfsprogs so no need to > specify it. Why are you manually specifying the size of the internal > journal log file? This is unnecessary. In fact, unless you have > profiled your workload and testing shows that alternate XFS settings > perform better, it is always best to stick with the defaults. They > exist for a reason, and are well considered. I'll probably forgo setting the journal log file size. It seemed like a safe optimization from what I've read. >> mount -t xfs -o barrier=1,logbsize=256k,logbufs=8,noatime /dev/md0 >> /mnt/raid10xfs > > Barrier has no value, it's either on or off. XFS mounts with barriers > enabled by default so remove 'barrier=1'. You do not have a RAID card > with persistent write cache (BBWC), so you should leave barriers > enabled. Barriers mitigate journal log corruption due to power failure > and crashes, which seem seem to be of concern to you. > > logbsize=256k and logbufs=8 are the defaults in recent kernels so no > need to specify them. Your NFS/Samba workload on 3 slow disks isn't > sufficient to need that much in memory journal buffer space anyway. XFS > uses relatime which is equivalent to noatime WRT IO reduction > performance, so don't specify 'noatime'. I just wanted to be explicit about it so that I know what is set just in case the defaults change > > In fact, it appears you don't need to specify anything in mkfs.xfs or > fstab, but just use the defaults. Fancy that. And the one thing that > might actually increase your performance a little bit you didn't > specify--sunit/swidth. However, since you're using mdraid, mkfs.xfs > will calculate these for you (which is nice as mdraid10 with odd disk > count can be a tricky calculation). Again, defaults work for a reason. > The reason I did not set sunit/swidth is because I read somewhere that mkfs.xfs will calculate based on mdraid. >> Will my files be safe even on sudden power loss? > > Are you unwilling to purchase a UPS and implement shutdown scripts? If > so you have no business running a server, frankly. Any system will lose > data due to power loss, it's just a matter of how much based on the > quantity of inflight writes at the time the juice dies. This problem is > mostly filesytem independent. Application write behavior does play a > role. UPS with shutdown scripts, and persistent write cache prevent > this problem. A cheap UPS suitable for this purpose is less money than > a 1TB 7.2k drive, currently. > The server is for a non-profit org that I am helping out. I think a APC Smart-UPS SC 420VA 230V may fit their shoe string budget. > You say this is an NFS/Samba server. That would imply that multiple > people or other systems directly rely on it. Implement a good UPS > solution and eliminate this potential problem. > >> Is barrier=1 enough? >> Do i need to disable the write cache? >> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd > > Disabling drive write caches does decrease the likelihood of data loss. > >> I tried it but performance is horrendous. > > And this is why you should leave them enabled and use barriers. Better > yet, use a RAID card with BBWC and disable the drive caches. Budget does not allow for RAID card with BBWC > >> Am I better of with ext4? Data safety/integrity is the priority and >> optimization affecting it is not acceptable. > > You're better off using a UPS. Filesystem makes little difference WRT > data safety/integrity. All will suffer some damage if you throw a > grenade at them. So don't throw grenades. Speaking of which, what is > your backup/restore procedure/hardware for this array? nightly backups will be stored on an external USB disk is xfs going to be prone to more data loss in case the non-redundant power supply goes out? > >> Thanks and any advice/guidance would be appreciated > > I'll appreciate your response stating "Yes, I have a UPS and > tested/working shutdown scripts" or "I'll be implementing a UPS very > soon." :) I don't have shutdown scripts yet but will look into it. Meatware would have to do for now as the server will probably be ON only when there's people at the office. And yes I will be asking them to not go into production without a UPS > > -- > Stan > Thanks for you input Stan. I just updated the kernel to 3.0.0-16. Did they take out barrier support in mdraid? or was the implementation replaced with FUA? Is there a definitive test to determine if the off the shelf consumer sata drives honor barrier or cache flush requests? I think I'd like to go with device cache turned ON and barrier enabled. Am still torn between ext4 and xfs i.e. which will be safer in this particular setup. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 12:06 ` Jessie Evangelista @ 2012-03-15 14:07 ` Peter Grandi 2012-03-16 12:25 ` raid10n2/xfs setup guidance on write-cache/barrier Stan Hoeppner 1 sibling, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-15 14:07 UTC (permalink / raw) To: Linux RAID, Linux fs XFS >>> I want to create a raid10,n2 using 3 1TB SATA drives. >>> I want to create an xfs filesystem on top of it. The >>> filesystem will be used as NFS/Samba storage. Consider also an 'o2' layout (it is probably the same thing for a 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems one of the few cases where RAID5 may be plausible. > [ ... ] I've run some benchmarks with dd trying the different > chunks and 256k seems like the sweetspot. dd if=/dev/zero > of=/dev/md0 bs=64k count=655360 oflag=direct That's for bulk sequential transfers. Random-ish, as in a fileserver perhaps with many smaller files, may not be the same, but probably larger chunks are good. >> [ ... ] What kernel version? This can make a significant >> difference in XFS metadata performance. As an aside, that's a myth that has been propagandaized by DaveC in his entertaining presentation not long ago. There have been decent but no major improvements in XFS metadata *performance*, but weaker implicit *semantics* have been made an option, and these have a different safety/performance tradeoff (less implicit safety, somewhat more performance), not "just" better performance. http://lwn.net/Articles/476267/ «In other words, instead of there only being a maximum of 2MB of transaction changes not written to the log at any point in time, there may be a much greater amount being accumulated in memory. Hence the potential for loss of metadata on a crash is much greater than for the existing logging mechanism. It should be noted that this does not change the guarantee that log recovery will result in a consistent filesystem. What it does mean is that as far as the recovered filesystem is concerned, there may be many thousands of transactions that simply did not occur as a result of the crash. This makes it even more important that applications that care about their data use fsync() where they need to ensure application level data integrity is maintained.» >> Your NFS/Samba workload on 3 slow disks isn't sufficient to >> need that much in memory journal buffer space anyway. That's probably true, but does no harm. >> XFS uses relatime which is equivalent to noatime WRT IO >> reduction performance, so don't specify 'noatime'. Uhm, not so sure, and 'noatime' does not hurt either. > I just wanted to be explicit about it so that I know what is > set just in case the defaults change That's what I do as well, because relying on remembering exactly what the defaults are can cause sometimes confusion. But it is a matter of taste to a large degree, like 'noatime'. >> In fact, it appears you don't need to specify anything in >> mkfs.xfs or fstab, but just use the defaults. Fancy that. For NFS/Samba, especially with ACLs (SMB protocol), and especially if one expects largish directories, and in general I would recommend a larger inode size, at least 1024B, if not even 2048B. Also, as a rule I want to make sure that the sector size is set to 4096B, for future proofing (and recent drives not only have 4096B sectors but usually lie). >> And the one thing that might actually increase your >> performance a little bit you didn't specify--sunit/swidth. Especially 'sunit', as XFS ideally would align metadata on chunk boundaries. >> However, since you're using mdraid, mkfs.xfs will calculate >> these for you (which is nice as mdraid10 with odd disk count >> can be a tricky calculation). Ambiguous more than tricky, and not very useful, except the chunk size. >>> Will my files be safe even on sudden power loss? The answer is NO, if you mean "absolutely safe". But see the discussion at the end. >> [ ... ] Application write behavior does play a role. Indeed, see the discussion at the end and ways to mitigate. >> UPS with shutdown scripts, and persistent write cache prevent >> this problem. [ ... ] There is always the problem of system crashes that don't depend on power.... >>> Is barrier=1 enough? Do i need to disable the write cache? >>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd >> Disabling drive write caches does decrease the likelihood of >> data loss. >>> I tried it but performance is horrendous. >> And this is why you should leave them enabled and use >> barriers. Better yet, use a RAID card with BBWC and disable >> the drive caches. > Budget does not allow for RAID card with BBWC You'd be surprised by how cheap you can get one. But many HW host adapters with builtin cache have bad performance or horrid bugs, so you'd have to be careful. In any case that's not the major problem you have. >>> Am I better of with ext4? Data safety/integrity is the >>> priority and optimization affecting it is not acceptable. XFS is the filesystem of the future ;-). I would choose it over 'ext4' in every plausible case. > nightly backups will be stored on an external USB disk USB is an unreliable, buggy transport, and slow, eSATA is enormously better and faster. > is xfs going to be prone to more data loss in case the > non-redundant power supply goes out? That's the wrong question entirely. Data loss can happen for many other reasons, and XFS is probably one of the safest designs, if properly used and configured. The problems are elsewhere. > I just updated the kernel to 3.0.0-16. Did they take out > barrier support in mdraid? or was the implementation replaced > with FUA? Is there a definitive test to determine if the off > the shelf consumer sata drives honor barrier or cache flush > requests? Usually they do, but that's the least of your worries. Anyhow a test that occurs to me is to write a know pattern to a file, let's say 1GiB, then 'fsync', and as soon as 'fsync' completes, power off. Then check whether the whole 1GiB is the known pattern. > I think I'd like to go with device cache turned ON and barrier > enabled. That's how it is supposed to work. As to general safety issues, there seem to be some misunderstanding, and I'll try to be more explicit than "lob the grenade" notion. It matters a great deal what "safety" means in your mind and that of your users. As a previous comment pointed out, that usually involves backups, that is data that has already been stored. But your insistence on power off and disk caches etc. seems to indicate that "safety" in your mind means "when I click the 'Save' button it is really saved and not partially". As to that there quite a lot of qualifiers: * Most users don't understand that even in the best scenario a file is really saved not when they *click* the 'Save' button, but when they get the "Saved!" message. In between anything can happen. Also, work in progress (not yet saved explicitly) is fair game. * "Really saved" is an *application* concern first and foremost. The application *must* say (via 'fsync') that it wants the data really saved. Unfortunately most applications don't do that because "really saved" is a very expensive operation, and usually sytems don't crash, so the application writer looks like a genius if he has an "optimistic" attitude. If you do a web search look for various O_PONIES discussions. Some intros: http://lwn.net/Articles/351422/ http://lwn.net/Articles/322823/ * XFS (and to a point 'ext4') is designed for applications that work correctly and issue 'fsync' appropriately, and if they do it is very safe, because it tries hard to ensure that either 'fsync' means "really saved" or you know that it does not. XFS takes advantage of the assumption that applications do the right thing to do various latency-based optimizations between calls to 'fsync'. * Unfortunately most GUI applications don't do the right thing, but fortunately you can compensate for that. The key here is to make sure that the flusher's parameter are set for rather more frequent flushing than the default, which is equivalent to issuing 'fsync' systemwide fairly frequently. Ideally set 'vm/dirty_bytes' to something like 1-3 seconds of IO transfer rate (and in reversal on some of my previous advice leave 'vm/dirty_background_bytes' to something quite large unless you *really* want safety), and to shorten significantly 'vm/dirty_expire_centisecs', 'vm/dirty_writeback_centisecs'. This defeats some XFS optimizations, but that's inevitable. * In any case you are using NFS/Samba, and that opens a much bigger set of issues, because caching happens on the clients too: http://www.sabi.co.uk/0707jul.html#070701b Then Von Neuman help you if your users or you decide to store lots of messages in MH/Maildir style mailstores, or VM images on "growable" virtual disks. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-15 14:07 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-15 14:07 UTC (permalink / raw) To: Linux RAID, Linux fs XFS >>> I want to create a raid10,n2 using 3 1TB SATA drives. >>> I want to create an xfs filesystem on top of it. The >>> filesystem will be used as NFS/Samba storage. Consider also an 'o2' layout (it is probably the same thing for a 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems one of the few cases where RAID5 may be plausible. > [ ... ] I've run some benchmarks with dd trying the different > chunks and 256k seems like the sweetspot. dd if=/dev/zero > of=/dev/md0 bs=64k count=655360 oflag=direct That's for bulk sequential transfers. Random-ish, as in a fileserver perhaps with many smaller files, may not be the same, but probably larger chunks are good. >> [ ... ] What kernel version? This can make a significant >> difference in XFS metadata performance. As an aside, that's a myth that has been propagandaized by DaveC in his entertaining presentation not long ago. There have been decent but no major improvements in XFS metadata *performance*, but weaker implicit *semantics* have been made an option, and these have a different safety/performance tradeoff (less implicit safety, somewhat more performance), not "just" better performance. http://lwn.net/Articles/476267/ «In other words, instead of there only being a maximum of 2MB of transaction changes not written to the log at any point in time, there may be a much greater amount being accumulated in memory. Hence the potential for loss of metadata on a crash is much greater than for the existing logging mechanism. It should be noted that this does not change the guarantee that log recovery will result in a consistent filesystem. What it does mean is that as far as the recovered filesystem is concerned, there may be many thousands of transactions that simply did not occur as a result of the crash. This makes it even more important that applications that care about their data use fsync() where they need to ensure application level data integrity is maintained.» >> Your NFS/Samba workload on 3 slow disks isn't sufficient to >> need that much in memory journal buffer space anyway. That's probably true, but does no harm. >> XFS uses relatime which is equivalent to noatime WRT IO >> reduction performance, so don't specify 'noatime'. Uhm, not so sure, and 'noatime' does not hurt either. > I just wanted to be explicit about it so that I know what is > set just in case the defaults change That's what I do as well, because relying on remembering exactly what the defaults are can cause sometimes confusion. But it is a matter of taste to a large degree, like 'noatime'. >> In fact, it appears you don't need to specify anything in >> mkfs.xfs or fstab, but just use the defaults. Fancy that. For NFS/Samba, especially with ACLs (SMB protocol), and especially if one expects largish directories, and in general I would recommend a larger inode size, at least 1024B, if not even 2048B. Also, as a rule I want to make sure that the sector size is set to 4096B, for future proofing (and recent drives not only have 4096B sectors but usually lie). >> And the one thing that might actually increase your >> performance a little bit you didn't specify--sunit/swidth. Especially 'sunit', as XFS ideally would align metadata on chunk boundaries. >> However, since you're using mdraid, mkfs.xfs will calculate >> these for you (which is nice as mdraid10 with odd disk count >> can be a tricky calculation). Ambiguous more than tricky, and not very useful, except the chunk size. >>> Will my files be safe even on sudden power loss? The answer is NO, if you mean "absolutely safe". But see the discussion at the end. >> [ ... ] Application write behavior does play a role. Indeed, see the discussion at the end and ways to mitigate. >> UPS with shutdown scripts, and persistent write cache prevent >> this problem. [ ... ] There is always the problem of system crashes that don't depend on power.... >>> Is barrier=1 enough? Do i need to disable the write cache? >>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd >> Disabling drive write caches does decrease the likelihood of >> data loss. >>> I tried it but performance is horrendous. >> And this is why you should leave them enabled and use >> barriers. Better yet, use a RAID card with BBWC and disable >> the drive caches. > Budget does not allow for RAID card with BBWC You'd be surprised by how cheap you can get one. But many HW host adapters with builtin cache have bad performance or horrid bugs, so you'd have to be careful. In any case that's not the major problem you have. >>> Am I better of with ext4? Data safety/integrity is the >>> priority and optimization affecting it is not acceptable. XFS is the filesystem of the future ;-). I would choose it over 'ext4' in every plausible case. > nightly backups will be stored on an external USB disk USB is an unreliable, buggy transport, and slow, eSATA is enormously better and faster. > is xfs going to be prone to more data loss in case the > non-redundant power supply goes out? That's the wrong question entirely. Data loss can happen for many other reasons, and XFS is probably one of the safest designs, if properly used and configured. The problems are elsewhere. > I just updated the kernel to 3.0.0-16. Did they take out > barrier support in mdraid? or was the implementation replaced > with FUA? Is there a definitive test to determine if the off > the shelf consumer sata drives honor barrier or cache flush > requests? Usually they do, but that's the least of your worries. Anyhow a test that occurs to me is to write a know pattern to a file, let's say 1GiB, then 'fsync', and as soon as 'fsync' completes, power off. Then check whether the whole 1GiB is the known pattern. > I think I'd like to go with device cache turned ON and barrier > enabled. That's how it is supposed to work. As to general safety issues, there seem to be some misunderstanding, and I'll try to be more explicit than "lob the grenade" notion. It matters a great deal what "safety" means in your mind and that of your users. As a previous comment pointed out, that usually involves backups, that is data that has already been stored. But your insistence on power off and disk caches etc. seems to indicate that "safety" in your mind means "when I click the 'Save' button it is really saved and not partially". As to that there quite a lot of qualifiers: * Most users don't understand that even in the best scenario a file is really saved not when they *click* the 'Save' button, but when they get the "Saved!" message. In between anything can happen. Also, work in progress (not yet saved explicitly) is fair game. * "Really saved" is an *application* concern first and foremost. The application *must* say (via 'fsync') that it wants the data really saved. Unfortunately most applications don't do that because "really saved" is a very expensive operation, and usually sytems don't crash, so the application writer looks like a genius if he has an "optimistic" attitude. If you do a web search look for various O_PONIES discussions. Some intros: http://lwn.net/Articles/351422/ http://lwn.net/Articles/322823/ * XFS (and to a point 'ext4') is designed for applications that work correctly and issue 'fsync' appropriately, and if they do it is very safe, because it tries hard to ensure that either 'fsync' means "really saved" or you know that it does not. XFS takes advantage of the assumption that applications do the right thing to do various latency-based optimizations between calls to 'fsync'. * Unfortunately most GUI applications don't do the right thing, but fortunately you can compensate for that. The key here is to make sure that the flusher's parameter are set for rather more frequent flushing than the default, which is equivalent to issuing 'fsync' systemwide fairly frequently. Ideally set 'vm/dirty_bytes' to something like 1-3 seconds of IO transfer rate (and in reversal on some of my previous advice leave 'vm/dirty_background_bytes' to something quite large unless you *really* want safety), and to shorten significantly 'vm/dirty_expire_centisecs', 'vm/dirty_writeback_centisecs'. This defeats some XFS optimizations, but that's inevitable. * In any case you are using NFS/Samba, and that opens a much bigger set of issues, because caching happens on the clients too: http://www.sabi.co.uk/0707jul.html#070701b Then Von Neuman help you if your users or you decide to store lots of messages in MH/Maildir style mailstores, or VM images on "growable" virtual disks. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 14:07 ` Peter Grandi @ 2012-03-15 15:25 ` keld -1 siblings, 0 replies; 65+ messages in thread From: keld @ 2012-03-15 15:25 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote: > >>> I want to create a raid10,n2 using 3 1TB SATA drives. > >>> I want to create an xfs filesystem on top of it. The > >>> filesystem will be used as NFS/Samba storage. > > Consider also an 'o2' layout (it is probably the same thing for a > 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems > one of the few cases where RAID5 may be plausible. Well, for a file server like NFS/Samba, you could also consider raid10,f2. I would think you could get about double the read performance compared to n2 and o2 layouts, and also for individual read transfers on a running system you would get somthing like double the read performance. Write performance could be somewhat slower (0 to 10 %) bot as users are not waiting for writes to complete, they will probably not notice. best regards keld ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-15 15:25 ` keld 0 siblings, 0 replies; 65+ messages in thread From: keld @ 2012-03-15 15:25 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote: > >>> I want to create a raid10,n2 using 3 1TB SATA drives. > >>> I want to create an xfs filesystem on top of it. The > >>> filesystem will be used as NFS/Samba storage. > > Consider also an 'o2' layout (it is probably the same thing for a > 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems > one of the few cases where RAID5 may be plausible. Well, for a file server like NFS/Samba, you could also consider raid10,f2. I would think you could get about double the read performance compared to n2 and o2 layouts, and also for individual read transfers on a running system you would get somthing like double the read performance. Write performance could be somewhat slower (0 to 10 %) bot as users are not waiting for writes to complete, they will probably not notice. best regards keld _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 15:25 ` keld @ 2012-03-15 16:52 ` Jessie Evangelista -1 siblings, 0 replies; 65+ messages in thread From: Jessie Evangelista @ 2012-03-15 16:52 UTC (permalink / raw) To: Linux RAID, Linux fs XFS Hi keld, On Thu, Mar 15, 2012 at 11:25 PM, <keld@keldix.com> wrote: > On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote: >> >>> I want to create a raid10,n2 using 3 1TB SATA drives. >> >>> I want to create an xfs filesystem on top of it. The >> >>> filesystem will be used as NFS/Samba storage. >> >> Consider also an 'o2' layout (it is probably the same thing for a >> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems >> one of the few cases where RAID5 may be plausible. > > Well, for a file server like NFS/Samba, you could also consider raid10,f2. > I would think you could get about double the read performance compared to n2 and o2 > layouts, and also for individual read transfers on a running system > you would get somthing like double the read performance. > Write performance could be somewhat slower (0 to 10 %) bot as users > are not waiting for writes to complete, they will probably not notice. I also plan to try raid10f2. Did you do your own benchmarks or are you quoting someone elses? > > best regards > keld thanks for chiming in. have a nice day > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-15 16:52 ` Jessie Evangelista 0 siblings, 0 replies; 65+ messages in thread From: Jessie Evangelista @ 2012-03-15 16:52 UTC (permalink / raw) To: Linux RAID, Linux fs XFS Hi keld, On Thu, Mar 15, 2012 at 11:25 PM, <keld@keldix.com> wrote: > On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote: >> >>> I want to create a raid10,n2 using 3 1TB SATA drives. >> >>> I want to create an xfs filesystem on top of it. The >> >>> filesystem will be used as NFS/Samba storage. >> >> Consider also an 'o2' layout (it is probably the same thing for a >> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems >> one of the few cases where RAID5 may be plausible. > > Well, for a file server like NFS/Samba, you could also consider raid10,f2. > I would think you could get about double the read performance compared to n2 and o2 > layouts, and also for individual read transfers on a running system > you would get somthing like double the read performance. > Write performance could be somewhat slower (0 to 10 %) bot as users > are not waiting for writes to complete, they will probably not notice. I also plan to try raid10f2. Did you do your own benchmarks or are you quoting someone elses? > > best regards > keld thanks for chiming in. have a nice day > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 16:52 ` Jessie Evangelista @ 2012-03-15 17:15 ` keld -1 siblings, 0 replies; 65+ messages in thread From: keld @ 2012-03-15 17:15 UTC (permalink / raw) To: Jessie Evangelista; +Cc: Linux RAID, Linux fs XFS On Fri, Mar 16, 2012 at 12:52:19AM +0800, Jessie Evangelista wrote: > Hi keld, > > On Thu, Mar 15, 2012 at 11:25 PM, <keld@keldix.com> wrote: > > On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote: > >> >>> I want to create a raid10,n2 using 3 1TB SATA drives. > >> >>> I want to create an xfs filesystem on top of it. The > >> >>> filesystem will be used as NFS/Samba storage. > >> > >> Consider also an 'o2' layout (it is probably the same thing for a > >> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems > >> one of the few cases where RAID5 may be plausible. > > > > Well, for a file server like NFS/Samba, you could also consider raid10,f2. > > I would think you could get about double the read performance compared to n2 and o2 > > layouts, and also for individual read transfers on a running system > > you would get somthing like double the read performance. > > Write performance could be somewhat slower (0 to 10 %) bot as users > > are not waiting for writes to complete, they will probably not notice. > > I also plan to try raid10f2. Did you do your own benchmarks or are you > quoting someone elses? Both, look at our wiki: https://raid.wiki.kernel.org/articles/p/e/r/Performance.html Best regards keld ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-15 17:15 ` keld 0 siblings, 0 replies; 65+ messages in thread From: keld @ 2012-03-15 17:15 UTC (permalink / raw) To: Jessie Evangelista; +Cc: Linux RAID, Linux fs XFS On Fri, Mar 16, 2012 at 12:52:19AM +0800, Jessie Evangelista wrote: > Hi keld, > > On Thu, Mar 15, 2012 at 11:25 PM, <keld@keldix.com> wrote: > > On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote: > >> >>> I want to create a raid10,n2 using 3 1TB SATA drives. > >> >>> I want to create an xfs filesystem on top of it. The > >> >>> filesystem will be used as NFS/Samba storage. > >> > >> Consider also an 'o2' layout (it is probably the same thing for a > >> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems > >> one of the few cases where RAID5 may be plausible. > > > > Well, for a file server like NFS/Samba, you could also consider raid10,f2. > > I would think you could get about double the read performance compared to n2 and o2 > > layouts, and also for individual read transfers on a running system > > you would get somthing like double the read performance. > > Write performance could be somewhat slower (0 to 10 %) bot as users > > are not waiting for writes to complete, they will probably not notice. > > I also plan to try raid10f2. Did you do your own benchmarks or are you > quoting someone elses? Both, look at our wiki: https://raid.wiki.kernel.org/articles/p/e/r/Performance.html Best regards keld _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 17:15 ` keld @ 2012-03-15 17:40 ` keld -1 siblings, 0 replies; 65+ messages in thread From: keld @ 2012-03-15 17:40 UTC (permalink / raw) To: Jessie Evangelista; +Cc: Linux RAID, Linux fs XFS On Thu, Mar 15, 2012 at 06:15:49PM +0100, keld@keldix.com wrote: > On Fri, Mar 16, 2012 at 12:52:19AM +0800, Jessie Evangelista wrote: > > Hi keld, > > > > On Thu, Mar 15, 2012 at 11:25 PM, <keld@keldix.com> wrote: > > > On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote: > > >> >>> I want to create a raid10,n2 using 3 1TB SATA drives. > > >> >>> I want to create an xfs filesystem on top of it. The > > >> >>> filesystem will be used as NFS/Samba storage. > > >> > > >> Consider also an 'o2' layout (it is probably the same thing for a > > >> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems > > >> one of the few cases where RAID5 may be plausible. > > > > > > Well, for a file server like NFS/Samba, you could also consider raid10,f2. > > > I would think you could get about double the read performance compared to n2 and o2 > > > layouts, and also for individual read transfers on a running system > > > you would get somthing like double the read performance. > > > Write performance could be somewhat slower (0 to 10 %) bot as users > > > are not waiting for writes to complete, they will probably not notice. > > > > I also plan to try raid10f2. Did you do your own benchmarks or are you > > quoting someone elses? > > Both, look at our wiki: https://raid.wiki.kernel.org/articles/p/e/r/Performance.html I think it would be interesting to include your figures on the wiki page, if you publish them here on the list. Maybe we should rearrange the wiki page a little. I am not so happy about the data reported in the section "New benchmarks from 2011" as it only illustrates what is happening with a 100 % used CPU. I would like to move it to a separate page. Also the really old data in section "Old performance benchmark" should be moved to a separate page, IMHO. The text on the wiki page should be gaving info of general interest for systems running today (still IMHO). Comments? Best regards keld ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-15 17:40 ` keld 0 siblings, 0 replies; 65+ messages in thread From: keld @ 2012-03-15 17:40 UTC (permalink / raw) To: Jessie Evangelista; +Cc: Linux RAID, Linux fs XFS On Thu, Mar 15, 2012 at 06:15:49PM +0100, keld@keldix.com wrote: > On Fri, Mar 16, 2012 at 12:52:19AM +0800, Jessie Evangelista wrote: > > Hi keld, > > > > On Thu, Mar 15, 2012 at 11:25 PM, <keld@keldix.com> wrote: > > > On Thu, Mar 15, 2012 at 02:07:25PM +0000, Peter Grandi wrote: > > >> >>> I want to create a raid10,n2 using 3 1TB SATA drives. > > >> >>> I want to create an xfs filesystem on top of it. The > > >> >>> filesystem will be used as NFS/Samba storage. > > >> > > >> Consider also an 'o2' layout (it is probably the same thing for a > > >> 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems > > >> one of the few cases where RAID5 may be plausible. > > > > > > Well, for a file server like NFS/Samba, you could also consider raid10,f2. > > > I would think you could get about double the read performance compared to n2 and o2 > > > layouts, and also for individual read transfers on a running system > > > you would get somthing like double the read performance. > > > Write performance could be somewhat slower (0 to 10 %) bot as users > > > are not waiting for writes to complete, they will probably not notice. > > > > I also plan to try raid10f2. Did you do your own benchmarks or are you > > quoting someone elses? > > Both, look at our wiki: https://raid.wiki.kernel.org/articles/p/e/r/Performance.html I think it would be interesting to include your figures on the wiki page, if you publish them here on the list. Maybe we should rearrange the wiki page a little. I am not so happy about the data reported in the section "New benchmarks from 2011" as it only illustrates what is happening with a 100 % used CPU. I would like to move it to a separate page. Also the really old data in section "Old performance benchmark" should be moved to a separate page, IMHO. The text on the wiki page should be gaving info of general interest for systems running today (still IMHO). Comments? Best regards keld _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 14:07 ` Peter Grandi @ 2012-03-15 16:18 ` Jessie Evangelista -1 siblings, 0 replies; 65+ messages in thread From: Jessie Evangelista @ 2012-03-15 16:18 UTC (permalink / raw) To: Linux RAID, Linux fs XFS Hey Peter, On Thu, Mar 15, 2012 at 10:07 PM, Peter Grandi <pg@lxra2.to.sabi.co.uk> wrote: >>>> I want to create a raid10,n2 using 3 1TB SATA drives. >>>> I want to create an xfs filesystem on top of it. The >>>> filesystem will be used as NFS/Samba storage. > > Consider also an 'o2' layout (it is probably the same thing for a > 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems > one of the few cases where RAID5 may be plausible. Thanks for reminding me about raid5. I'll probably give it a try and do some benchmarks. I'd also like to try raid10f2. >> [ ... ] I've run some benchmarks with dd trying the different >> chunks and 256k seems like the sweetspot. dd if=/dev/zero >> of=/dev/md0 bs=64k count=655360 oflag=direct > > That's for bulk sequential transfers. Random-ish, as in a > fileserver perhaps with many smaller files, may not be the same, > but probably larger chunks are good. >>> [ ... ] What kernel version? This can make a significant >>> difference in XFS metadata performance. > > As an aside, that's a myth that has been propagandaized by DaveC > in his entertaining presentation not long ago. > > There have been decent but no major improvements in XFS metadata > *performance*, but weaker implicit *semantics* have been made an > option, and these have a different safety/performance tradeoff > (less implicit safety, somewhat more performance), not "just" > better performance. > > http://lwn.net/Articles/476267/ > «In other words, instead of there only being a maximum of 2MB of > transaction changes not written to the log at any point in time, > there may be a much greater amount being accumulated in memory. > > Hence the potential for loss of metadata on a crash is much > greater than for the existing logging mechanism. > > It should be noted that this does not change the guarantee that > log recovery will result in a consistent filesystem. > > What it does mean is that as far as the recovered filesystem is > concerned, there may be many thousands of transactions that > simply did not occur as a result of the crash. > > This makes it even more important that applications that care > about their data use fsync() where they need to ensure > application level data integrity is maintained.» > >>> Your NFS/Samba workload on 3 slow disks isn't sufficient to >>> need that much in memory journal buffer space anyway. > > That's probably true, but does no harm. > >>> XFS uses relatime which is equivalent to noatime WRT IO >>> reduction performance, so don't specify 'noatime'. > > Uhm, not so sure, and 'noatime' does not hurt either. > >> I just wanted to be explicit about it so that I know what is >> set just in case the defaults change > > That's what I do as well, because relying on remembering exactly > what the defaults are can cause sometimes confusion. But it is a > matter of taste to a large degree, like 'noatime'. > >>> In fact, it appears you don't need to specify anything in >>> mkfs.xfs or fstab, but just use the defaults. Fancy that. > > For NFS/Samba, especially with ACLs (SMB protocol), and > especially if one expects largish directories, and in general I > would recommend a larger inode size, at least 1024B, if not even > 2048B. thanks for this tip. will look into adjusting inode size. > > Also, as a rule I want to make sure that the sector size is set > to 4096B, for future proofing (and recent drives not only have > 4096B sectors but usually lie). > it seems the 1TB drivers that I have still have 512byte sectors >>> And the one thing that might actually increase your >>> performance a little bit you didn't specify--sunit/swidth. > > Especially 'sunit', as XFS ideally would align metadata on chunk > boundaries. > >>> However, since you're using mdraid, mkfs.xfs will calculate >>> these for you (which is nice as mdraid10 with odd disk count >>> can be a tricky calculation). > > Ambiguous more than tricky, and not very useful, except the chunk > size. > >>>> Will my files be safe even on sudden power loss? > > The answer is NO, if you mean "absolutely safe". But see the > discussion at the end. > >>> [ ... ] Application write behavior does play a role. > > Indeed, see the discussion at the end and ways to mitigate. > >>> UPS with shutdown scripts, and persistent write cache prevent >>> this problem. [ ... ] > > There is always the problem of system crashes that don't depend > on power.... > >>>> Is barrier=1 enough? Do i need to disable the write cache? >>>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd > >>> Disabling drive write caches does decrease the likelihood of >>> data loss. > >>>> I tried it but performance is horrendous. > >>> And this is why you should leave them enabled and use >>> barriers. Better yet, use a RAID card with BBWC and disable >>> the drive caches. > >> Budget does not allow for RAID card with BBWC > > You'd be surprised by how cheap you can get one. But many HW host > adapters with builtin cache have bad performance or horrid bugs, > so you'd have to be careful. could you please suggest a hardware raid card with BBU that's cheap? > > In any case that's not the major problem you have. > >>>> Am I better of with ext4? Data safety/integrity is the >>>> priority and optimization affecting it is not acceptable. > > XFS is the filesystem of the future ;-). I would choose it over > 'ext4' in every plausible case. > >> nightly backups will be stored on an external USB disk > > USB is an unreliable, buggy transport, and slow, eSATA is > enormously better and faster. > >> is xfs going to be prone to more data loss in case the >> non-redundant power supply goes out? > > That's the wrong question entirely. Data loss can happen for many > other reasons, and XFS is probably one of the safest designs, if > properly used and configured. The problems are elsewhere. Can you please elaborate how xfs can be properly used and configured? > >> I just updated the kernel to 3.0.0-16. Did they take out >> barrier support in mdraid? or was the implementation replaced >> with FUA? Is there a definitive test to determine if the off >> the shelf consumer sata drives honor barrier or cache flush >> requests? > > Usually they do, but that's the least of your worries. Anyhow a > test that occurs to me is to write a know pattern to a file, > let's say 1GiB, then 'fsync', and as soon as 'fsync' completes, > power off. Then check whether the whole 1GiB is the known pattern. > >> I think I'd like to go with device cache turned ON and barrier >> enabled. > > That's how it is supposed to work. > > As to general safety issues, there seem to be some misunderstanding, > and I'll try to be more explicit than "lob the grenade" notion. > > It matters a great deal what "safety" means in your mind and that > of your users. As a previous comment pointed out, that usually > involves backups, that is data that has already been stored. > > But your insistence on power off and disk caches etc. seems to > indicate that "safety" in your mind means "when I click the > 'Save' button it is really saved and not partially". > let me define safety as needed by the usecase: fileA is a 2MB open office document file already existing on the file system. userA opens fileA locally, modifies a lot of lines and attempts to save it. as the saving operation is proceeding, the PSU goes haywire and power is cut abruptly. When the system is turned on, i expect some sort of recovery process to bring the filesystem to a consistent state. I expect fileA should be as it was before the save operation and should not be corrupted in anyway. Am I asking/expecting too much? > As to that there quite a lot of qualifiers: > > * Most users don't understand that even in the best scenario a > file is really saved not when they *click* the 'Save' button, > but when they get the "Saved!" message. In between anything > can happen. Also, work in progress (not yet saved explicitly) > is fair game. > > * "Really saved" is an *application* concern first and foremost. > The application *must* say (via 'fsync') that it wants the > data really saved. Unfortunately most applications don't do > that because "really saved" is a very expensive operation, and > usually sytems don't crash, so the application writer looks > like a genius if he has an "optimistic" attitude. If you do a > web search look for various O_PONIES discussions. Some intros: > > http://lwn.net/Articles/351422/ > http://lwn.net/Articles/322823/ > > * XFS (and to a point 'ext4') is designed for applications that > work correctly and issue 'fsync' appropriately, and if they do > it is very safe, because it tries hard to ensure that either > 'fsync' means "really saved" or you know that it does not. XFS > takes advantage of the assumption that applications do the > right thing to do various latency-based optimizations between > calls to 'fsync'. > > * Unfortunately most GUI applications don't do the right thing, > but fortunately you can compensate for that. The key here is > to make sure that the flusher's parameter are set for rather > more frequent flushing than the default, which is equivalent > to issuing 'fsync' systemwide fairly frequently. Ideally set > 'vm/dirty_bytes' to something like 1-3 seconds of IO transfer > rate (and in reversal on some of my previous advice leave > 'vm/dirty_background_bytes' to something quite large unless > you *really* want safety), and to shorten significantly > 'vm/dirty_expire_centisecs', 'vm/dirty_writeback_centisecs'. > This defeats some XFS optimizations, but that's inevitable. > > * In any case you are using NFS/Samba, and that opens a much > bigger set of issues, because caching happens on the clients > too: http://www.sabi.co.uk/0707jul.html#070701b > > Then Von Neuman help you if your users or you decide to store lots > of messages in MH/Maildir style mailstores, or VM images on > "growable" virtual disks. what's wrong with VM images on "growable" virtual disks. are you saying not to rely on lvm2 volumes? > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-15 16:18 ` Jessie Evangelista 0 siblings, 0 replies; 65+ messages in thread From: Jessie Evangelista @ 2012-03-15 16:18 UTC (permalink / raw) To: Linux RAID, Linux fs XFS Hey Peter, On Thu, Mar 15, 2012 at 10:07 PM, Peter Grandi <pg@lxra2.to.sabi.co.uk> wrote: >>>> I want to create a raid10,n2 using 3 1TB SATA drives. >>>> I want to create an xfs filesystem on top of it. The >>>> filesystem will be used as NFS/Samba storage. > > Consider also an 'o2' layout (it is probably the same thing for a > 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems > one of the few cases where RAID5 may be plausible. Thanks for reminding me about raid5. I'll probably give it a try and do some benchmarks. I'd also like to try raid10f2. >> [ ... ] I've run some benchmarks with dd trying the different >> chunks and 256k seems like the sweetspot. dd if=/dev/zero >> of=/dev/md0 bs=64k count=655360 oflag=direct > > That's for bulk sequential transfers. Random-ish, as in a > fileserver perhaps with many smaller files, may not be the same, > but probably larger chunks are good. >>> [ ... ] What kernel version? This can make a significant >>> difference in XFS metadata performance. > > As an aside, that's a myth that has been propagandaized by DaveC > in his entertaining presentation not long ago. > > There have been decent but no major improvements in XFS metadata > *performance*, but weaker implicit *semantics* have been made an > option, and these have a different safety/performance tradeoff > (less implicit safety, somewhat more performance), not "just" > better performance. > > http://lwn.net/Articles/476267/ > «In other words, instead of there only being a maximum of 2MB of > transaction changes not written to the log at any point in time, > there may be a much greater amount being accumulated in memory. > > Hence the potential for loss of metadata on a crash is much > greater than for the existing logging mechanism. > > It should be noted that this does not change the guarantee that > log recovery will result in a consistent filesystem. > > What it does mean is that as far as the recovered filesystem is > concerned, there may be many thousands of transactions that > simply did not occur as a result of the crash. > > This makes it even more important that applications that care > about their data use fsync() where they need to ensure > application level data integrity is maintained.» > >>> Your NFS/Samba workload on 3 slow disks isn't sufficient to >>> need that much in memory journal buffer space anyway. > > That's probably true, but does no harm. > >>> XFS uses relatime which is equivalent to noatime WRT IO >>> reduction performance, so don't specify 'noatime'. > > Uhm, not so sure, and 'noatime' does not hurt either. > >> I just wanted to be explicit about it so that I know what is >> set just in case the defaults change > > That's what I do as well, because relying on remembering exactly > what the defaults are can cause sometimes confusion. But it is a > matter of taste to a large degree, like 'noatime'. > >>> In fact, it appears you don't need to specify anything in >>> mkfs.xfs or fstab, but just use the defaults. Fancy that. > > For NFS/Samba, especially with ACLs (SMB protocol), and > especially if one expects largish directories, and in general I > would recommend a larger inode size, at least 1024B, if not even > 2048B. thanks for this tip. will look into adjusting inode size. > > Also, as a rule I want to make sure that the sector size is set > to 4096B, for future proofing (and recent drives not only have > 4096B sectors but usually lie). > it seems the 1TB drivers that I have still have 512byte sectors >>> And the one thing that might actually increase your >>> performance a little bit you didn't specify--sunit/swidth. > > Especially 'sunit', as XFS ideally would align metadata on chunk > boundaries. > >>> However, since you're using mdraid, mkfs.xfs will calculate >>> these for you (which is nice as mdraid10 with odd disk count >>> can be a tricky calculation). > > Ambiguous more than tricky, and not very useful, except the chunk > size. > >>>> Will my files be safe even on sudden power loss? > > The answer is NO, if you mean "absolutely safe". But see the > discussion at the end. > >>> [ ... ] Application write behavior does play a role. > > Indeed, see the discussion at the end and ways to mitigate. > >>> UPS with shutdown scripts, and persistent write cache prevent >>> this problem. [ ... ] > > There is always the problem of system crashes that don't depend > on power.... > >>>> Is barrier=1 enough? Do i need to disable the write cache? >>>> with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd > >>> Disabling drive write caches does decrease the likelihood of >>> data loss. > >>>> I tried it but performance is horrendous. > >>> And this is why you should leave them enabled and use >>> barriers. Better yet, use a RAID card with BBWC and disable >>> the drive caches. > >> Budget does not allow for RAID card with BBWC > > You'd be surprised by how cheap you can get one. But many HW host > adapters with builtin cache have bad performance or horrid bugs, > so you'd have to be careful. could you please suggest a hardware raid card with BBU that's cheap? > > In any case that's not the major problem you have. > >>>> Am I better of with ext4? Data safety/integrity is the >>>> priority and optimization affecting it is not acceptable. > > XFS is the filesystem of the future ;-). I would choose it over > 'ext4' in every plausible case. > >> nightly backups will be stored on an external USB disk > > USB is an unreliable, buggy transport, and slow, eSATA is > enormously better and faster. > >> is xfs going to be prone to more data loss in case the >> non-redundant power supply goes out? > > That's the wrong question entirely. Data loss can happen for many > other reasons, and XFS is probably one of the safest designs, if > properly used and configured. The problems are elsewhere. Can you please elaborate how xfs can be properly used and configured? > >> I just updated the kernel to 3.0.0-16. Did they take out >> barrier support in mdraid? or was the implementation replaced >> with FUA? Is there a definitive test to determine if the off >> the shelf consumer sata drives honor barrier or cache flush >> requests? > > Usually they do, but that's the least of your worries. Anyhow a > test that occurs to me is to write a know pattern to a file, > let's say 1GiB, then 'fsync', and as soon as 'fsync' completes, > power off. Then check whether the whole 1GiB is the known pattern. > >> I think I'd like to go with device cache turned ON and barrier >> enabled. > > That's how it is supposed to work. > > As to general safety issues, there seem to be some misunderstanding, > and I'll try to be more explicit than "lob the grenade" notion. > > It matters a great deal what "safety" means in your mind and that > of your users. As a previous comment pointed out, that usually > involves backups, that is data that has already been stored. > > But your insistence on power off and disk caches etc. seems to > indicate that "safety" in your mind means "when I click the > 'Save' button it is really saved and not partially". > let me define safety as needed by the usecase: fileA is a 2MB open office document file already existing on the file system. userA opens fileA locally, modifies a lot of lines and attempts to save it. as the saving operation is proceeding, the PSU goes haywire and power is cut abruptly. When the system is turned on, i expect some sort of recovery process to bring the filesystem to a consistent state. I expect fileA should be as it was before the save operation and should not be corrupted in anyway. Am I asking/expecting too much? > As to that there quite a lot of qualifiers: > > * Most users don't understand that even in the best scenario a > file is really saved not when they *click* the 'Save' button, > but when they get the "Saved!" message. In between anything > can happen. Also, work in progress (not yet saved explicitly) > is fair game. > > * "Really saved" is an *application* concern first and foremost. > The application *must* say (via 'fsync') that it wants the > data really saved. Unfortunately most applications don't do > that because "really saved" is a very expensive operation, and > usually sytems don't crash, so the application writer looks > like a genius if he has an "optimistic" attitude. If you do a > web search look for various O_PONIES discussions. Some intros: > > http://lwn.net/Articles/351422/ > http://lwn.net/Articles/322823/ > > * XFS (and to a point 'ext4') is designed for applications that > work correctly and issue 'fsync' appropriately, and if they do > it is very safe, because it tries hard to ensure that either > 'fsync' means "really saved" or you know that it does not. XFS > takes advantage of the assumption that applications do the > right thing to do various latency-based optimizations between > calls to 'fsync'. > > * Unfortunately most GUI applications don't do the right thing, > but fortunately you can compensate for that. The key here is > to make sure that the flusher's parameter are set for rather > more frequent flushing than the default, which is equivalent > to issuing 'fsync' systemwide fairly frequently. Ideally set > 'vm/dirty_bytes' to something like 1-3 seconds of IO transfer > rate (and in reversal on some of my previous advice leave > 'vm/dirty_background_bytes' to something quite large unless > you *really* want safety), and to shorten significantly > 'vm/dirty_expire_centisecs', 'vm/dirty_writeback_centisecs'. > This defeats some XFS optimizations, but that's inevitable. > > * In any case you are using NFS/Samba, and that opens a much > bigger set of issues, because caching happens on the clients > too: http://www.sabi.co.uk/0707jul.html#070701b > > Then Von Neuman help you if your users or you decide to store lots > of messages in MH/Maildir style mailstores, or VM images on > "growable" virtual disks. what's wrong with VM images on "growable" virtual disks. are you saying not to rely on lvm2 volumes? > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 16:18 ` Jessie Evangelista @ 2012-03-15 23:00 ` Peter Grandi -1 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-15 23:00 UTC (permalink / raw) To: Linux RAID, Linux fs XFS [ ... ] >> Also, as a rule I want to make sure that the sector size is >> set to 4096B, for future proofing (and recent drives not only >> have 4096B sectors but usually lie). > it seems the 1TB drivers that I have still have 512byte sectors But usually you can still set the XFS idea of sector size to 4096, which is probably a good idea in general. [ ... ] >>> is xfs going to be prone to more data loss in case the >>> non-redundant power supply goes out? >> That's the wrong question entirely. Data loss can happen for >> many other reasons, and XFS is probably one of the safest >> designs, if properly used and configured. The problems are >> elsewhere. > Can you please elaborate how xfs can be properly used and > configured? I did that in the following bits of the reply. You must be in a real hurry if you cannot trim down the quoting or write your comments after reading through once... [ ... ] >> But your insistence on power off and disk caches etc. seems to >> indicate that "safety" in your mind means "when I click the >> 'Save' button it is really saved and not partially". > let me define safety as needed by the usecase: fileA is a 2MB > open office document file already existing on the file system. > userA opens fileA locally, modifies a lot of lines and attempts > to save it. as the saving operation is proceeding, the PSU goes > haywire and power is cut abruptly. To worry you, if the PSU goes haywire, the disk data may become subtly corrupted: https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta «Another user, also running a Tyan 2885 dual-Opteron workstation like mine, had experienced data corruption with SATA disks. The root cause? A faulty power supply.» Even if that is not an argument for filesystem provided checksums, as the ZFS (and other) people say, but for end-to-end (application level) ones. > When the system is turned on, i expect some sort of recovery > process to bring the filesystem to a consistent state. The XFS design really cares about that and unless the hardware is very broken metadata consistency will be good. > I expect fileA should be as it was before the save operation and > should not be corrupted in anyway. Am I asking/expecting too much? That is too much to expect of the filesystem and at the same time too little. It is too much because it is strictly the responsibility of the application, and it is very expensive, because it can only happen by simulating copy-on-write (app makes a copy of the document, updates the copy, and then atomically renames it, and then makes another copy). Some applications like OOo/LibreO/VIM instead use a log file to record updates, and then merge those on save (copy, merge, rename), which is better. Some filesystems like NILFS2 or BTRFS or Next3/Next4 use COW to provide builtin versioning, but that's expensive too. The original UNIX insight to provide a very simple file abstraction layer should not be lightly discarded (but I like NILFS2 in particular). It is too little because of what happens if you have dozens to thousands of modified but not yet fully persisted files, sych as newly created mail folders, 'tar' unpacks , source tree checkins, ... As I tried to show in my previous reply, and in the NFS blog entry mentioned in it too, on a creduly practical level relying on applications doing the right thing is optimistic, and it may be regrettably expedient to complement barriers with frequent system driven flushing, which partially simulates (at a price) O_PONIES. [ ... ] >> Then Von Neuman help you if your users or you decide to store >> lots of messages in MH/Maildir style mailstores, or VM images >> on "growable" virtual disks. > what's wrong with VM images on "growable" virtual disks. are you > saying not to rely on lvm2 volumes? By "growable" I mean that the virtual disk is allocated sparsely. As to to LVM2 it is very rarely needed. The only really valuable feature it has is snapshot LVs, and those are very expensive. XFS, which can allocate routinely 2GiB (or bigger) files as a single extents, can be used as a volume manager too. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-15 23:00 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-15 23:00 UTC (permalink / raw) To: Linux RAID, Linux fs XFS [ ... ] >> Also, as a rule I want to make sure that the sector size is >> set to 4096B, for future proofing (and recent drives not only >> have 4096B sectors but usually lie). > it seems the 1TB drivers that I have still have 512byte sectors But usually you can still set the XFS idea of sector size to 4096, which is probably a good idea in general. [ ... ] >>> is xfs going to be prone to more data loss in case the >>> non-redundant power supply goes out? >> That's the wrong question entirely. Data loss can happen for >> many other reasons, and XFS is probably one of the safest >> designs, if properly used and configured. The problems are >> elsewhere. > Can you please elaborate how xfs can be properly used and > configured? I did that in the following bits of the reply. You must be in a real hurry if you cannot trim down the quoting or write your comments after reading through once... [ ... ] >> But your insistence on power off and disk caches etc. seems to >> indicate that "safety" in your mind means "when I click the >> 'Save' button it is really saved and not partially". > let me define safety as needed by the usecase: fileA is a 2MB > open office document file already existing on the file system. > userA opens fileA locally, modifies a lot of lines and attempts > to save it. as the saving operation is proceeding, the PSU goes > haywire and power is cut abruptly. To worry you, if the PSU goes haywire, the disk data may become subtly corrupted: https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta «Another user, also running a Tyan 2885 dual-Opteron workstation like mine, had experienced data corruption with SATA disks. The root cause? A faulty power supply.» Even if that is not an argument for filesystem provided checksums, as the ZFS (and other) people say, but for end-to-end (application level) ones. > When the system is turned on, i expect some sort of recovery > process to bring the filesystem to a consistent state. The XFS design really cares about that and unless the hardware is very broken metadata consistency will be good. > I expect fileA should be as it was before the save operation and > should not be corrupted in anyway. Am I asking/expecting too much? That is too much to expect of the filesystem and at the same time too little. It is too much because it is strictly the responsibility of the application, and it is very expensive, because it can only happen by simulating copy-on-write (app makes a copy of the document, updates the copy, and then atomically renames it, and then makes another copy). Some applications like OOo/LibreO/VIM instead use a log file to record updates, and then merge those on save (copy, merge, rename), which is better. Some filesystems like NILFS2 or BTRFS or Next3/Next4 use COW to provide builtin versioning, but that's expensive too. The original UNIX insight to provide a very simple file abstraction layer should not be lightly discarded (but I like NILFS2 in particular). It is too little because of what happens if you have dozens to thousands of modified but not yet fully persisted files, sych as newly created mail folders, 'tar' unpacks , source tree checkins, ... As I tried to show in my previous reply, and in the NFS blog entry mentioned in it too, on a creduly practical level relying on applications doing the right thing is optimistic, and it may be regrettably expedient to complement barriers with frequent system driven flushing, which partially simulates (at a price) O_PONIES. [ ... ] >> Then Von Neuman help you if your users or you decide to store >> lots of messages in MH/Maildir style mailstores, or VM images >> on "growable" virtual disks. > what's wrong with VM images on "growable" virtual disks. are you > saying not to rely on lvm2 volumes? By "growable" I mean that the virtual disk is allocated sparsely. As to to LVM2 it is very rarely needed. The only really valuable feature it has is snapshot LVs, and those are very expensive. XFS, which can allocate routinely 2GiB (or bigger) files as a single extents, can be used as a volume manager too. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 23:00 ` Peter Grandi @ 2012-03-16 3:36 ` Jessie Evangelista -1 siblings, 0 replies; 65+ messages in thread From: Jessie Evangelista @ 2012-03-16 3:36 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS > But usually you can still set the XFS idea of sector size to 4096, > which is probably a good idea in general. I'm now running kernel 3.0.0-16-server Ubuntu 10.04LTS cat /sys/block/sd[b-d]/queue/physical_block_size shows 512 cat /sys/block/sd[b-d]/device/model shows ST31000524AS looking up the model at seagate, the specs page does not mention 512 byte sectors but it did mention guaranteed sectors of 1,953,525,168 multiplying by 512bytes we do get 1000204886016(1TBish) Anyway, I'll have a look at setting the sector size for xfs > I did that in the following bits of the reply. You must be in a > real hurry if you cannot trim down the quoting or write your > comments after reading through once... I did read thru your comments several times and I really appreciate them. Will look into setting vm/dirty_bytes, vm/dirty_background_bytes, vm/dirty_expire_centisecs, vm/dirty_writeback_centisecs. I'm still scouring the internet for a best practice recipe for implementing xfs/mdraid. I am open to writing one and including the inputs everyone is contributing here. In my search, I also saw some references of alignment issues for partitions. this is what I used to setup the partitions for the md device sfdisk /dev/sdb <<EOF unit: sectors 63,104872257,fd 0,0,0 0,0,0 0,0,0 EOF I've read a recommendation to start the partition on the 1MB mark. Does this make sense? >> let me define safety as needed by the usecase: fileA is a 2MB >> open office document file already existing on the file system. >> userA opens fileA locally, modifies a lot of lines and attempts >> to save it. as the saving operation is proceeding, the PSU goes >> haywire and power is cut abruptly. > > To worry you, if the PSU goes haywire, the disk data may become > subtly corrupted: > > https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta > «Another user, also running a Tyan 2885 dual-Opteron workstation > like mine, had experienced data corruption with SATA disks. The > root cause? A faulty power supply.» > > Even if that is not an argument for filesystem provided checksums, > as the ZFS (and other) people say, but for end-to-end (application > level) ones. Mmmm, Ive also been reading up on ZFS but haven't put it thru its paces. >> I expect fileA should be as it was before the save operation and >> should not be corrupted in anyway. Am I asking/expecting too much? > > That is too much to expect of the filesystem and at the same time > too little. > > It is too much because it is strictly the responsibility of the > application, and it is very expensive, because it can only happen > by simulating copy-on-write (app makes a copy of the document, > updates the copy, and then atomically renames it, and then makes > another copy). Some applications like OOo/LibreO/VIM instead use a > log file to record updates, and then merge those on save (copy, > merge, rename), which is better. Some filesystems like NILFS2 or > BTRFS or Next3/Next4 use COW to provide builtin versioning, but > that's expensive too. The original UNIX insight to provide a very > simple file abstraction layer should not be lightly discarded (but > I like NILFS2 in particular). > > It is too little because of what happens if you have dozens to > thousands of modified but not yet fully persisted files, sych as > newly created mail folders, 'tar' unpacks , source tree checkins, > ... > > As I tried to show in my previous reply, and in the NFS blog entry > mentioned in it too, on a creduly practical level relying on > applications doing the right thing is optimistic, and it may be > regrettably expedient to complement barriers with frequent system > driven flushing, which partially simulates (at a price) O_PONIES. I'd like to read about the NFS blog entry but the link you included results in a 404. I forgot to mention in my last reply. Based on what I understood from your thoughts above, if an applications issues a flush/fsync and it does not complete due to some catastrophic crash, xfs on its own can not roll back to the prev version of the file in case of unfinished write operation. disabling the device caches wouldn't help either right? only filesystems that do COW can do this at the expense of performance? (btrfs and zfs, please hurry and grow up!) > As to to LVM2 it is very rarely needed. The only really valuable > feature it has is snapshot LVs, and those are very expensive. XFS, > which can allocate routinely 2GiB (or bigger) files as a single > extents, can be used as a volume manager too. If you were in my place with the resource constraints, you'd go with: xfs with barriers on top of mdraid10 with device cache ON and setting vm/dirty_bytes, vm/dirty_background_bytes, vm/dirty_expire_centisecs, vm/dirty_writeback_centisecs to safe values -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-16 3:36 ` Jessie Evangelista 0 siblings, 0 replies; 65+ messages in thread From: Jessie Evangelista @ 2012-03-16 3:36 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS > But usually you can still set the XFS idea of sector size to 4096, > which is probably a good idea in general. I'm now running kernel 3.0.0-16-server Ubuntu 10.04LTS cat /sys/block/sd[b-d]/queue/physical_block_size shows 512 cat /sys/block/sd[b-d]/device/model shows ST31000524AS looking up the model at seagate, the specs page does not mention 512 byte sectors but it did mention guaranteed sectors of 1,953,525,168 multiplying by 512bytes we do get 1000204886016(1TBish) Anyway, I'll have a look at setting the sector size for xfs > I did that in the following bits of the reply. You must be in a > real hurry if you cannot trim down the quoting or write your > comments after reading through once... I did read thru your comments several times and I really appreciate them. Will look into setting vm/dirty_bytes, vm/dirty_background_bytes, vm/dirty_expire_centisecs, vm/dirty_writeback_centisecs. I'm still scouring the internet for a best practice recipe for implementing xfs/mdraid. I am open to writing one and including the inputs everyone is contributing here. In my search, I also saw some references of alignment issues for partitions. this is what I used to setup the partitions for the md device sfdisk /dev/sdb <<EOF unit: sectors 63,104872257,fd 0,0,0 0,0,0 0,0,0 EOF I've read a recommendation to start the partition on the 1MB mark. Does this make sense? >> let me define safety as needed by the usecase: fileA is a 2MB >> open office document file already existing on the file system. >> userA opens fileA locally, modifies a lot of lines and attempts >> to save it. as the saving operation is proceeding, the PSU goes >> haywire and power is cut abruptly. > > To worry you, if the PSU goes haywire, the disk data may become > subtly corrupted: > > https://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta > «Another user, also running a Tyan 2885 dual-Opteron workstation > like mine, had experienced data corruption with SATA disks. The > root cause? A faulty power supply.» > > Even if that is not an argument for filesystem provided checksums, > as the ZFS (and other) people say, but for end-to-end (application > level) ones. Mmmm, Ive also been reading up on ZFS but haven't put it thru its paces. >> I expect fileA should be as it was before the save operation and >> should not be corrupted in anyway. Am I asking/expecting too much? > > That is too much to expect of the filesystem and at the same time > too little. > > It is too much because it is strictly the responsibility of the > application, and it is very expensive, because it can only happen > by simulating copy-on-write (app makes a copy of the document, > updates the copy, and then atomically renames it, and then makes > another copy). Some applications like OOo/LibreO/VIM instead use a > log file to record updates, and then merge those on save (copy, > merge, rename), which is better. Some filesystems like NILFS2 or > BTRFS or Next3/Next4 use COW to provide builtin versioning, but > that's expensive too. The original UNIX insight to provide a very > simple file abstraction layer should not be lightly discarded (but > I like NILFS2 in particular). > > It is too little because of what happens if you have dozens to > thousands of modified but not yet fully persisted files, sych as > newly created mail folders, 'tar' unpacks , source tree checkins, > ... > > As I tried to show in my previous reply, and in the NFS blog entry > mentioned in it too, on a creduly practical level relying on > applications doing the right thing is optimistic, and it may be > regrettably expedient to complement barriers with frequent system > driven flushing, which partially simulates (at a price) O_PONIES. I'd like to read about the NFS blog entry but the link you included results in a 404. I forgot to mention in my last reply. Based on what I understood from your thoughts above, if an applications issues a flush/fsync and it does not complete due to some catastrophic crash, xfs on its own can not roll back to the prev version of the file in case of unfinished write operation. disabling the device caches wouldn't help either right? only filesystems that do COW can do this at the expense of performance? (btrfs and zfs, please hurry and grow up!) > As to to LVM2 it is very rarely needed. The only really valuable > feature it has is snapshot LVs, and those are very expensive. XFS, > which can allocate routinely 2GiB (or bigger) files as a single > extents, can be used as a volume manager too. If you were in my place with the resource constraints, you'd go with: xfs with barriers on top of mdraid10 with device cache ON and setting vm/dirty_bytes, vm/dirty_background_bytes, vm/dirty_expire_centisecs, vm/dirty_writeback_centisecs to safe values _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-16 3:36 ` Jessie Evangelista @ 2012-03-16 11:06 ` Michael Monnerie -1 siblings, 0 replies; 65+ messages in thread From: Michael Monnerie @ 2012-03-16 11:06 UTC (permalink / raw) To: xfs; +Cc: Jessie Evangelista, Peter Grandi, Linux RAID [-- Attachment #1: Type: text/plain, Size: 1370 bytes --] Am Freitag, 16. März 2012, 11:36:07 schrieb Jessie Evangelista: > If you were in my place with the resource constraints, you'd go with: > xfs with barriers on top of mdraid10 with device cache ON and setting > vm/dirty_bytes, vm/dirty_background_bytes, vm/dirty_expire_centisecs, > vm/dirty_writeback_centisecs to safe values If you ever experienced a crash where lots of sensible and important data were lost, you would not even think about "device cache ON". > could you please suggest a hardware raid card with BBU that's cheap? "Cheap" is a varying definition. How much is your data worth? How much does one day of blackout cost? I've been very happy with Areca Controllers, like the 12x0 and 1680 series, and now there's the newer 1882 series like http://geizhals.at/eu/721745 plus a BBU for about 100€. You can even mix RAID levels on the same disks, example of 8x1TB define a RAID0 of 500G and the rest a RAID6. Online expansion possible, scheduled background verify, e-mail notification on everything, logging, ntp times, oob-mgmnt via it's own network interface, ... Very reliable, I never had a problem. And they have a good support team. -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-16 11:06 ` Michael Monnerie 0 siblings, 0 replies; 65+ messages in thread From: Michael Monnerie @ 2012-03-16 11:06 UTC (permalink / raw) To: xfs; +Cc: Linux RAID, Peter Grandi, Jessie Evangelista [-- Attachment #1.1: Type: text/plain, Size: 1370 bytes --] Am Freitag, 16. März 2012, 11:36:07 schrieb Jessie Evangelista: > If you were in my place with the resource constraints, you'd go with: > xfs with barriers on top of mdraid10 with device cache ON and setting > vm/dirty_bytes, vm/dirty_background_bytes, vm/dirty_expire_centisecs, > vm/dirty_writeback_centisecs to safe values If you ever experienced a crash where lots of sensible and important data were lost, you would not even think about "device cache ON". > could you please suggest a hardware raid card with BBU that's cheap? "Cheap" is a varying definition. How much is your data worth? How much does one day of blackout cost? I've been very happy with Areca Controllers, like the 12x0 and 1680 series, and now there's the newer 1882 series like http://geizhals.at/eu/721745 plus a BBU for about 100€. You can even mix RAID levels on the same disks, example of 8x1TB define a RAID0 of 500G and the rest a RAID6. Online expansion possible, scheduled background verify, e-mail notification on everything, logging, ntp times, oob-mgmnt via it's own network interface, ... Very reliable, I never had a problem. And they have a good support team. -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-16 11:06 ` Michael Monnerie @ 2012-03-16 12:21 ` Peter Grandi -1 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-16 12:21 UTC (permalink / raw) To: Linux fs XFS, Linux RAID [ ... ] >> If you were in my place with the resource constraints, you'd >> go with: xfs with barriers on top of mdraid10 with device >> cache ON and setting vm/dirty_bytes, vm/dirty_background_bytes, >> vm/dirty_expire_centisecs, vm/dirty_writeback_centisecs to >> safe values > If you ever experienced a crash where lots of sensible and > important data were lost, you would not even think about > "device cache ON". It is not as simple as that... *If* hw barriers are implemented *and* applications do the right things, that is not a concern. Disabling the device cache is just a way to turn barriers on for everything. Indeed the whole rationale for having the 'barrier' option is to let the device caches on, and the OP did ask how to test that barriers actually work. Since even most consumers level drives currently implement barriers correctly, the biggest problem today, as per the O_PONIES discussions, is applications that don't do the right thing, and therefore the biggest risk is large amounts of dirty pages in system memory (either NFS client or server), not in the drive caches. Since the Linux flusher parameters are/have been demented, I have seen one or more GiB of dirty pages in system memory (on hosts I didn't configure...), which also causes performance problems. Again, as a crassly expedient thing, working around the lack of "do the right thing" in applications by letting only a few seconds of dirty pages accumulate in system memory seems to fool enough users (and many system administrators and application developers) into thinking that stuff is "safe". It worked well enough for 'ext3' for many years, quite regrettably. Note: 'ext3' has also had the "helpful" issue of excessive impact of flushing, which made 'fsync' performance terrible, but improved the apparent safety for "optimistic" applications. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-16 12:21 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-16 12:21 UTC (permalink / raw) To: Linux fs XFS, Linux RAID [ ... ] >> If you were in my place with the resource constraints, you'd >> go with: xfs with barriers on top of mdraid10 with device >> cache ON and setting vm/dirty_bytes, vm/dirty_background_bytes, >> vm/dirty_expire_centisecs, vm/dirty_writeback_centisecs to >> safe values > If you ever experienced a crash where lots of sensible and > important data were lost, you would not even think about > "device cache ON". It is not as simple as that... *If* hw barriers are implemented *and* applications do the right things, that is not a concern. Disabling the device cache is just a way to turn barriers on for everything. Indeed the whole rationale for having the 'barrier' option is to let the device caches on, and the OP did ask how to test that barriers actually work. Since even most consumers level drives currently implement barriers correctly, the biggest problem today, as per the O_PONIES discussions, is applications that don't do the right thing, and therefore the biggest risk is large amounts of dirty pages in system memory (either NFS client or server), not in the drive caches. Since the Linux flusher parameters are/have been demented, I have seen one or more GiB of dirty pages in system memory (on hosts I didn't configure...), which also causes performance problems. Again, as a crassly expedient thing, working around the lack of "do the right thing" in applications by letting only a few seconds of dirty pages accumulate in system memory seems to fool enough users (and many system administrators and application developers) into thinking that stuff is "safe". It worked well enough for 'ext3' for many years, quite regrettably. Note: 'ext3' has also had the "helpful" issue of excessive impact of flushing, which made 'fsync' performance terrible, but improved the apparent safety for "optimistic" applications. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-16 3:36 ` Jessie Evangelista @ 2012-03-16 17:15 ` Brian Candler -1 siblings, 0 replies; 65+ messages in thread From: Brian Candler @ 2012-03-16 17:15 UTC (permalink / raw) To: Jessie Evangelista; +Cc: Peter Grandi, Linux RAID, Linux fs XFS On Fri, Mar 16, 2012 at 11:36:07AM +0800, Jessie Evangelista wrote: > I'm still scouring the internet for a best practice recipe for > implementing xfs/mdraid. > I am open to writing one and including the inputs everyone is contributing here. > In my search, I also saw some references of alignment issues for partitions. > this is what I used to setup the partitions for the md device > > sfdisk /dev/sdb <<EOF > unit: sectors > > 63,104872257,fd > 0,0,0 > 0,0,0 > 0,0,0 > EOF > > I've read a recommendation to start the partition on the 1MB mark. > Does this make sense? I would just make the raw disks members of the RAID array, e.g. /dev/sdb, /dev/sdc etc and not partition them. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-16 17:15 ` Brian Candler 0 siblings, 0 replies; 65+ messages in thread From: Brian Candler @ 2012-03-16 17:15 UTC (permalink / raw) To: Jessie Evangelista; +Cc: Linux RAID, Peter Grandi, Linux fs XFS On Fri, Mar 16, 2012 at 11:36:07AM +0800, Jessie Evangelista wrote: > I'm still scouring the internet for a best practice recipe for > implementing xfs/mdraid. > I am open to writing one and including the inputs everyone is contributing here. > In my search, I also saw some references of alignment issues for partitions. > this is what I used to setup the partitions for the md device > > sfdisk /dev/sdb <<EOF > unit: sectors > > 63,104872257,fd > 0,0,0 > 0,0,0 > 0,0,0 > EOF > > I've read a recommendation to start the partition on the 1MB mark. > Does this make sense? I would just make the raw disks members of the RAID array, e.g. /dev/sdb, /dev/sdc etc and not partition them. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-16 3:36 ` Jessie Evangelista @ 2012-03-17 15:35 ` Peter Grandi -1 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-17 15:35 UTC (permalink / raw) To: Linux RAID, Linux fs XFS [ ... ] > I've read a recommendation to start the partition on the 1MB > mark. Does this make sense? As a general principle it is good, that has almost no cost. Indeed recent versions of some partitionig tools do that by default. I often recommend aligning partitions to 1GiB, also because I like to have 1GiB or so of empty space at the very beginning and end of a drive. > I'd like to read about the NFS blog entry but the link you > included results in a 404. I forgot to mention in my last > reply. Oops I forgot a bit of the URL: http://www.sabi.co.uk/blog/0707jul.html#070701b Note that currently I suggest different values from: «vm/dirty_ratio =4 vm/dirty_background_ratio =2» Because: * 4% of memory "dirty" today is often a gigantic amount. I had provided an elegant patch to specify the same in absolute terms in http://www.sabi.co.uk/blog/0707jul.html#070701 but now the official way is the "_bytes" alternative. * 2% as the level at which writing becomes uncached is too low, and the system become unresposive when that level is crossed. Sure it is risky, but, regretfully, I think that maintaining responsiveness is usually better than limiting outstanding background writes. > Based on what I understood from your thoughts above, if an > applications issues a flush/fsync and it does not complete due > to some catastrophic crash, xfs on its own can not roll back > to the prev version of the file in case of unfinished write > operation. disabling the device caches wouldn't help either > right? If your goal is to make sure incomplete updates don't get persisted, disabling device caches might help with that, in a very perverse way (if the whole partial update is still in the device cache, it just vanishes). Forget that of course :-). The main message is that filesystems in UNIX-like system should not provide atomic transactions, just the means to do them at the applications level, because they are both difficult and very expensive. The secondary message is that some applications and the firmware of some host adpters and drives don't do the right thing, and if your really want to make sure about atomic transactions it is an expensive and difficult system integration challenge. > [ ... ] only filesystems that do COW can do this at the > expense of performance? (btrfs and zfs, please hurry and grow > up!) Filesystems that do COW sort-of do *global* "rolling" updates, that is filtree level snapshots, but that's a side effect of a choice made for other reasons (consistency more than currency). > [ ... ] If you were in my place with the resource constraints, > you'd go with: xfs with barriers on top of mdraid10 with > device cache ON and setting vm/dirty_bytes, [ ... ] Yes, that seems a reasonable overall tradeoff, because XFS is implemented to provide well defined (and documented) semantics, to check whether the underlying storage layer actually does barriers, and to perform decently even if "delayed" writing is not that delayed. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-17 15:35 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-17 15:35 UTC (permalink / raw) To: Linux RAID, Linux fs XFS [ ... ] > I've read a recommendation to start the partition on the 1MB > mark. Does this make sense? As a general principle it is good, that has almost no cost. Indeed recent versions of some partitionig tools do that by default. I often recommend aligning partitions to 1GiB, also because I like to have 1GiB or so of empty space at the very beginning and end of a drive. > I'd like to read about the NFS blog entry but the link you > included results in a 404. I forgot to mention in my last > reply. Oops I forgot a bit of the URL: http://www.sabi.co.uk/blog/0707jul.html#070701b Note that currently I suggest different values from: «vm/dirty_ratio =4 vm/dirty_background_ratio =2» Because: * 4% of memory "dirty" today is often a gigantic amount. I had provided an elegant patch to specify the same in absolute terms in http://www.sabi.co.uk/blog/0707jul.html#070701 but now the official way is the "_bytes" alternative. * 2% as the level at which writing becomes uncached is too low, and the system become unresposive when that level is crossed. Sure it is risky, but, regretfully, I think that maintaining responsiveness is usually better than limiting outstanding background writes. > Based on what I understood from your thoughts above, if an > applications issues a flush/fsync and it does not complete due > to some catastrophic crash, xfs on its own can not roll back > to the prev version of the file in case of unfinished write > operation. disabling the device caches wouldn't help either > right? If your goal is to make sure incomplete updates don't get persisted, disabling device caches might help with that, in a very perverse way (if the whole partial update is still in the device cache, it just vanishes). Forget that of course :-). The main message is that filesystems in UNIX-like system should not provide atomic transactions, just the means to do them at the applications level, because they are both difficult and very expensive. The secondary message is that some applications and the firmware of some host adpters and drives don't do the right thing, and if your really want to make sure about atomic transactions it is an expensive and difficult system integration challenge. > [ ... ] only filesystems that do COW can do this at the > expense of performance? (btrfs and zfs, please hurry and grow > up!) Filesystems that do COW sort-of do *global* "rolling" updates, that is filtree level snapshots, but that's a side effect of a choice made for other reasons (consistency more than currency). > [ ... ] If you were in my place with the resource constraints, > you'd go with: xfs with barriers on top of mdraid10 with > device cache ON and setting vm/dirty_bytes, [ ... ] Yes, that seems a reasonable overall tradeoff, because XFS is implemented to provide well defined (and documented) semantics, to check whether the underlying storage layer actually does barriers, and to perform decently even if "delayed" writing is not that delayed. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier (GiB alignment) 2012-03-17 15:35 ` Peter Grandi @ 2012-03-17 21:39 ` Zdenek Kaspar -1 siblings, 0 replies; 65+ messages in thread From: Zdenek Kaspar @ 2012-03-17 21:39 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS Dne 17.3.2012 16:35, Peter Grandi napsal(a): > I often recommend aligning partitions to 1GiB, also because I > like to have 1GiB or so of empty space at the very beginning and > end of a drive. I'm really curious why do you use such alignment? I can think about few reasons, but most practical I think you like to slice in gigabyte sizes. Z. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier (GiB alignment) @ 2012-03-17 21:39 ` Zdenek Kaspar 0 siblings, 0 replies; 65+ messages in thread From: Zdenek Kaspar @ 2012-03-17 21:39 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS Dne 17.3.2012 16:35, Peter Grandi napsal(a): > I often recommend aligning partitions to 1GiB, also because I > like to have 1GiB or so of empty space at the very beginning and > end of a drive. I'm really curious why do you use such alignment? I can think about few reasons, but most practical I think you like to slice in gigabyte sizes. Z. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier (GiB alignment) 2012-03-17 21:39 ` Zdenek Kaspar @ 2012-03-18 0:08 ` Peter Grandi -1 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-18 0:08 UTC (permalink / raw) To: Linux RAID, Linux fs XFS >> I often recommend aligning partitions to 1GiB, also because I >> like to have 1GiB or so of empty space at the very beginning >> and end of a drive. > I'm really curious why do you use such alignment? I can think > about few reasons, but most practical I think you like to > slice in gigabyte sizes. Indeed, and to summarize: * As mentioned before, I usually leave a chunk of unused space at the very start and end of a drive. This also because: - Many head landing accidents happen at the start or end of a drive. - Free space at the start: sometimes it is useful to have a few seconds of grace at the start when duplicating a drive to realize one has mistyped the name, and many partitioning or booting schemes can use a bit of free space at the start. Consider XFS and its use of sector 0. - Free space at the end: many drives have slightly different sizes, and this can cause problems when for example rebuilding arrays, or doing backups, and leaving some bit unused can avoid a lot of trouble. * Having even sizes for partitions means that it may be easier to image copy them from one drive to another. I often to do that. Indeed I usually create partitions of a few "standard" sizes, usually tailored to fit drives that tend also to come in fairly standard increments, because drive manufacturers in each new platter generation usually aim at a fairly standard factor of improvement. Standard drive sizes tend to be, in gigabytes: 80 160 250 500 1000 1500 2000 3000. Since 80 and 160 are somewhat old and no longer used, I currently tend to do partitions in sizes like 230GiB, 460GiB, 920GiB etc. SSDs have somewhat complicated this. * On a contemporary drive 1GiB is a rather small fraction of the capacity of a drive, so why not just align everything to 1GiB, even if it seems pretty large in absolute terms? And if you align the first partition to start at 1GiB, and leave free space at the end, it is farily natural to align everything in between on 1GiB boundaries. In this as in many other cases I like to buy myself some extra degrees of freedom if they are cheap. Another example I have written about previously is specifying advisedly chosen 'sunit' and 'swidth' even on non-RAID volumes, or non-parity RAID setups, not because they really improve things, but because the cost is minimal and it might come useful later. Note: I *really* like to be able to do partition image copies, because they are so awesomely faster than treewise ones. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier (GiB alignment) @ 2012-03-18 0:08 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-18 0:08 UTC (permalink / raw) To: Linux RAID, Linux fs XFS >> I often recommend aligning partitions to 1GiB, also because I >> like to have 1GiB or so of empty space at the very beginning >> and end of a drive. > I'm really curious why do you use such alignment? I can think > about few reasons, but most practical I think you like to > slice in gigabyte sizes. Indeed, and to summarize: * As mentioned before, I usually leave a chunk of unused space at the very start and end of a drive. This also because: - Many head landing accidents happen at the start or end of a drive. - Free space at the start: sometimes it is useful to have a few seconds of grace at the start when duplicating a drive to realize one has mistyped the name, and many partitioning or booting schemes can use a bit of free space at the start. Consider XFS and its use of sector 0. - Free space at the end: many drives have slightly different sizes, and this can cause problems when for example rebuilding arrays, or doing backups, and leaving some bit unused can avoid a lot of trouble. * Having even sizes for partitions means that it may be easier to image copy them from one drive to another. I often to do that. Indeed I usually create partitions of a few "standard" sizes, usually tailored to fit drives that tend also to come in fairly standard increments, because drive manufacturers in each new platter generation usually aim at a fairly standard factor of improvement. Standard drive sizes tend to be, in gigabytes: 80 160 250 500 1000 1500 2000 3000. Since 80 and 160 are somewhat old and no longer used, I currently tend to do partitions in sizes like 230GiB, 460GiB, 920GiB etc. SSDs have somewhat complicated this. * On a contemporary drive 1GiB is a rather small fraction of the capacity of a drive, so why not just align everything to 1GiB, even if it seems pretty large in absolute terms? And if you align the first partition to start at 1GiB, and leave free space at the end, it is farily natural to align everything in between on 1GiB boundaries. In this as in many other cases I like to buy myself some extra degrees of freedom if they are cheap. Another example I have written about previously is specifying advisedly chosen 'sunit' and 'swidth' even on non-RAID volumes, or non-parity RAID setups, not because they really improve things, but because the cost is minimal and it might come useful later. Note: I *really* like to be able to do partition image copies, because they are so awesomely faster than treewise ones. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-17 15:35 ` Peter Grandi (?) (?) @ 2012-03-26 19:50 ` Martin Steigerwald -1 siblings, 0 replies; 65+ messages in thread From: Martin Steigerwald @ 2012-03-26 19:50 UTC (permalink / raw) To: xfs Am Samstag, 17. März 2012 schrieb Peter Grandi: > > I'd like to read about the NFS blog entry but the link you > > included results in a 404. I forgot to mention in my last > > reply. > > Oops I forgot a bit of the URL: > http://www.sabi.co.uk/blog/0707jul.html#070701b > > Note that currently I suggest different values from: > > «vm/dirty_ratio =4 > vm/dirty_background_ratio =2» Consider dirty_background_bytes as thats indepentend of the amount of installed memory. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* NOW:Peter goading Dave over delaylog - WAS: Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 14:07 ` Peter Grandi ` (2 preceding siblings ...) (?) @ 2012-03-17 4:21 ` Stan Hoeppner 2012-03-17 22:34 ` Dave Chinner -1 siblings, 1 reply; 65+ messages in thread From: Stan Hoeppner @ 2012-03-17 4:21 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs XFS On 3/15/2012 9:07 AM, Peter Grandi wrote: >>>> I want to create a raid10,n2 using 3 1TB SATA drives. >>>> I want to create an xfs filesystem on top of it. The >>>> filesystem will be used as NFS/Samba storage. > > Consider also an 'o2' layout (it is probably the same thing for a > 3 drive RAID10) or even a RAID5, as 3 drives and this usage seems > one of the few cases where RAID5 may be plausible. It's customary to note in your message body when you decide to CC another mailing list, and why. I just got to my XFS folder and realized you'd silently CC'd XFS. This was unnecessary and simply added noise to XFS. Given some of your comments in this post I suspect you did so in an effort to goad Dave into some kind of argument WRT delayed logging performance, and his linux.conf.au presentation claims in general. Doing this via subterfuge simply reduces people's level of respect for you Peter. If you want to have the delayed logging performance discussion/argument with Dave, it should be its own thread on xfs@oss, not slipped into a thread started on another list and CC'ed here. I'm removing linux-raid from the CC list of this message, as anything further in this discussion topic is only relevant to XFS. I'm guessing either Dave chose not to take your bait, or simply didn't read your message. If the former this thread will likely die now. If the latter, and Dave decides to respond, I'm grabbing some popcorn, a beer, and a lawn chair. ;) -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: NOW:Peter goading Dave over delaylog - WAS: Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-17 4:21 ` NOW:Peter goading Dave over delaylog - WAS: " Stan Hoeppner @ 2012-03-17 22:34 ` Dave Chinner 2012-03-18 2:09 ` Peter Grandi 0 siblings, 1 reply; 65+ messages in thread From: Dave Chinner @ 2012-03-17 22:34 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Peter Grandi, Linux fs XFS On Fri, Mar 16, 2012 at 11:21:49PM -0500, Stan Hoeppner wrote: > I'm guessing either Dave chose not to take your bait, or simply didn't > read your message. If the former this thread will likely die now. If > the latter, and Dave decides to respond, I'm grabbing some popcorn, a > beer, and a lawn chair. ;) Just ignore the troll, Stan. I've got code to write and bugs to fix - I don't have time to waste on irrelevant semantic arguments about the definition of "performance optimisation" for improvements that are done, dusted and widely deployed. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-17 22:34 ` Dave Chinner @ 2012-03-18 2:09 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-18 2:09 UTC (permalink / raw) To: Linux fs XFS, Linux RAID [ ... ] > Just ignore the troll, Stan. It is noticeable that Stan and you have chosen to write offtopic "contributions" that contain purely personal attacks in reply to a technical point about «guidance on write-cache/barrier», but I'll try to keep ontopic: > [ ... ] irrelevant semantic arguments about the definition of > "performance optimisation" [ ... ] Oops, here is instead a (handwaving) technical argument, I partially retract the above. Note: I have 'grep'ed for «"performance optimisation"» and it seems to me a made-up quote for this thread, and no argument has been made by me about the «definition of "performance optimisation"», and the above point seem to me a strong misrepresentation. The (handwaving) technical argument above seems to me a laughable attempt to attribute respectability to the disregard for how important is the difference between improving speed at the same (implicit) safety level vs. doing so at a lower one, even more so as (implicit) safety is an important theme in this thread, and my argument (quite different from the above misrepresentation) was in essence: «There have been decent but no major improvements in XFS metadata *performance*, but weaker implicit *semantics* have been made an option, and these have a different safety/performance tradeoff (less implicit safety, somewhat more performance), not "just" better performance.» The relevance of pointing out that there is a big tradeoff is is demonstrated by the honest mention in 'delaylog.txt' that «the potential for loss of metadata on a crash is much greater than for the existing logging mechanism», which seems far from merely «semantic arguments» as the potential for «many thousands of transactions that simply did not occur as a result of the crash» is not purely a matter of «semantic arguments», and indeed mattered a lot to the topic of the thread, where the 'Subject:' is: «raid10n2/xfs setup guidance on write-cache/barrier» =============================== It seems to me that http://packages.debian.org/sid/eatmydata could also be described boldly and barely as making a «significant difference in [XFS] metadata performance» because its description says «This has two side-effects: making software that writes data safely to disk a lot quicker» even if continues «and making this software no longer crash safe.» If considering both the speed and safety aspect is irrelevant semantics, then it seems to me that: http://sandeen.net/wordpress/computers/fsync-sigh/ would be about «irrelevant semantic arguments» too, instead od being a sensible discussion of tradeoffs between speed and safety. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-18 2:09 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-18 2:09 UTC (permalink / raw) To: Linux fs XFS, Linux RAID [ ... ] > Just ignore the troll, Stan. It is noticeable that Stan and you have chosen to write offtopic "contributions" that contain purely personal attacks in reply to a technical point about «guidance on write-cache/barrier», but I'll try to keep ontopic: > [ ... ] irrelevant semantic arguments about the definition of > "performance optimisation" [ ... ] Oops, here is instead a (handwaving) technical argument, I partially retract the above. Note: I have 'grep'ed for «"performance optimisation"» and it seems to me a made-up quote for this thread, and no argument has been made by me about the «definition of "performance optimisation"», and the above point seem to me a strong misrepresentation. The (handwaving) technical argument above seems to me a laughable attempt to attribute respectability to the disregard for how important is the difference between improving speed at the same (implicit) safety level vs. doing so at a lower one, even more so as (implicit) safety is an important theme in this thread, and my argument (quite different from the above misrepresentation) was in essence: «There have been decent but no major improvements in XFS metadata *performance*, but weaker implicit *semantics* have been made an option, and these have a different safety/performance tradeoff (less implicit safety, somewhat more performance), not "just" better performance.» The relevance of pointing out that there is a big tradeoff is is demonstrated by the honest mention in 'delaylog.txt' that «the potential for loss of metadata on a crash is much greater than for the existing logging mechanism», which seems far from merely «semantic arguments» as the potential for «many thousands of transactions that simply did not occur as a result of the crash» is not purely a matter of «semantic arguments», and indeed mattered a lot to the topic of the thread, where the 'Subject:' is: «raid10n2/xfs setup guidance on write-cache/barrier» =============================== It seems to me that http://packages.debian.org/sid/eatmydata could also be described boldly and barely as making a «significant difference in [XFS] metadata performance» because its description says «This has two side-effects: making software that writes data safely to disk a lot quicker» even if continues «and making this software no longer crash safe.» If considering both the speed and safety aspect is irrelevant semantics, then it seems to me that: http://sandeen.net/wordpress/computers/fsync-sigh/ would be about «irrelevant semantic arguments» too, instead od being a sensible discussion of tradeoffs between speed and safety. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-18 2:09 ` Peter Grandi @ 2012-03-18 11:25 ` Peter Grandi -1 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-18 11:25 UTC (permalink / raw) To: Linux fs XFS, Linux RAID > «There have been decent but no major improvements in XFS metadata > *performance*, but weaker implicit *semantics* have been made an > option, and these have a different safety/performance tradeoff > (less implicit safety, somewhat more performance), not "just" > better performance.» I have left implicit a point that perhaps should be explicit: I think that XFS metadata performance before 'delaylog' was pretty good, and that it has remained pretty good with 'delaylog'. People who complained about slow metadata performance with XFS before 'delaylog' were in effect complaining that XFS was implementing overly (in some sense) safe metadata semantics, and in effect were demanding less (implicit) safety, without probably realizing they were asking for that. Accordingly, 'delaylog' offers less (implicit) safety, and it is a good and legitimate option to have, in the same way that 'nobarrier' is also a good and legitimate option to have. So in my view 'delaylog' cannot be described boldly and barely described, especially in this thread, as an improvement in XFS performance, as it is an improvement in XFS's unsafety to obtain greater speed, similar to but not as extensive as 'nobarrier'. In the same way that 'eatmydata': > The relevance of pointing out that there is a big tradeoff [ ... ] > It seems to me that http://packages.debian.org/sid/eatmydata > could also be described boldly and barely as making a > «significant difference in [XFS] metadata performance» [ ... ] is a massive improvement in unsafety as the name says. Since the thread is about maximizing safety and implicit safety too, technical arguments about changes in operational semantics as to safety are entirely appropriate here, even if there are those who don't "get" them. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-18 11:25 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-18 11:25 UTC (permalink / raw) To: Linux fs XFS, Linux RAID > «There have been decent but no major improvements in XFS metadata > *performance*, but weaker implicit *semantics* have been made an > option, and these have a different safety/performance tradeoff > (less implicit safety, somewhat more performance), not "just" > better performance.» I have left implicit a point that perhaps should be explicit: I think that XFS metadata performance before 'delaylog' was pretty good, and that it has remained pretty good with 'delaylog'. People who complained about slow metadata performance with XFS before 'delaylog' were in effect complaining that XFS was implementing overly (in some sense) safe metadata semantics, and in effect were demanding less (implicit) safety, without probably realizing they were asking for that. Accordingly, 'delaylog' offers less (implicit) safety, and it is a good and legitimate option to have, in the same way that 'nobarrier' is also a good and legitimate option to have. So in my view 'delaylog' cannot be described boldly and barely described, especially in this thread, as an improvement in XFS performance, as it is an improvement in XFS's unsafety to obtain greater speed, similar to but not as extensive as 'nobarrier'. In the same way that 'eatmydata': > The relevance of pointing out that there is a big tradeoff [ ... ] > It seems to me that http://packages.debian.org/sid/eatmydata > could also be described boldly and barely as making a > «significant difference in [XFS] metadata performance» [ ... ] is a massive improvement in unsafety as the name says. Since the thread is about maximizing safety and implicit safety too, technical arguments about changes in operational semantics as to safety are entirely appropriate here, even if there are those who don't "get" them. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-18 11:25 ` Peter Grandi @ 2012-03-18 14:00 ` Christoph Hellwig -1 siblings, 0 replies; 65+ messages in thread From: Christoph Hellwig @ 2012-03-18 14:00 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs XFS, Linux RAID On Sun, Mar 18, 2012 at 11:25:14AM +0000, Peter Grandi wrote: > > ?There have been decent but no major improvements in XFS metadata > > *performance*, but weaker implicit *semantics* have been made an > > option, and these have a different safety/performance tradeoff > > (less implicit safety, somewhat more performance), not "just" > > better performance.? > > I have left implicit a point that perhaps should be explicit: I > think that XFS metadata performance before 'delaylog' was pretty > good, and that it has remained pretty good with 'delaylog'. For many workloads it absolutely wasn't. > People who complained about slow metadata performance with XFS > before 'delaylog' were in effect complaining that XFS was > implementing overly (in some sense) safe metadata semantics, and > in effect were demanding less (implicit) safety, without > probably realizing they were asking for that. No, they weren't, and as with most posts to the XFS and RAID lists you are completely off the track. Plese read through Documentation/filesystems/xfs-delayed-logging-design.txt and if you have any actual technical questions that you don't understand feel free to come back and ask. But please stop giving advise taken out of the thin air to people on the lists that might actually believe whatever madness you just dreamed up. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-18 14:00 ` Christoph Hellwig 0 siblings, 0 replies; 65+ messages in thread From: Christoph Hellwig @ 2012-03-18 14:00 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS On Sun, Mar 18, 2012 at 11:25:14AM +0000, Peter Grandi wrote: > > ?There have been decent but no major improvements in XFS metadata > > *performance*, but weaker implicit *semantics* have been made an > > option, and these have a different safety/performance tradeoff > > (less implicit safety, somewhat more performance), not "just" > > better performance.? > > I have left implicit a point that perhaps should be explicit: I > think that XFS metadata performance before 'delaylog' was pretty > good, and that it has remained pretty good with 'delaylog'. For many workloads it absolutely wasn't. > People who complained about slow metadata performance with XFS > before 'delaylog' were in effect complaining that XFS was > implementing overly (in some sense) safe metadata semantics, and > in effect were demanding less (implicit) safety, without > probably realizing they were asking for that. No, they weren't, and as with most posts to the XFS and RAID lists you are completely off the track. Plese read through Documentation/filesystems/xfs-delayed-logging-design.txt and if you have any actual technical questions that you don't understand feel free to come back and ask. But please stop giving advise taken out of the thin air to people on the lists that might actually believe whatever madness you just dreamed up. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-18 14:00 ` Christoph Hellwig @ 2012-03-18 19:17 ` Peter Grandi -1 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-18 19:17 UTC (permalink / raw) To: Linux RAID, Linux fs XFS [ ... ] >>> «There have been decent but no major improvements in XFS >>> metadata *performance*, but weaker implicit *semantics* >>> have been made an option, and these have a different >>> safety/performance tradeoff (less implicit safety, somewhat >>> more performance), not "just" better performance.» >> I have left implicit a point that perhaps should be explicit: I >> think that XFS metadata performance before 'delaylog' was pretty >> good, and that it has remained pretty good with 'delaylog'. > For many workloads it absolutely wasn't. My self importance is not quite as huge as feeling able to just say «absolutely wasn't» to settle points once and for all. So I would rather argue (and I did in a different form) that for some workloads 'nobarrier'+'hdparm -W1' or 'eatmydata' have the most desirable tradeoffs, and for many others the safety/speed tradeoff of 'delaylog' is more appropriate (so for example I think that making it the default is reasonable if a bit edgy). But also that as the already quoted document makes it very clear how overall 'delaylog' improves unsafety and only thanks to this latency and time to completion are better: http://lwn.net/Articles/476267/ http://www.mjmwired.net/kernel/Documentation/filesystems/xfs-delayed-logging-design.txt 124 [ ... ] In other 125 words, instead of there only being a maximum of 2MB of transaction changes not 126 written to the log at any point in time, there may be a much greater amount 127 being accumulated in memory. Hence the potential for loss of metadata on a 128 crash is much greater than for the existing logging mechanism. That's why my argument was that performance without 'delaylog' was good: given the safer semantics, it was quite good. Just perhaps not the semantics tradeoff that some people wanted in some cases, and I think that it is cheeky marketing to describe something involving a much greater «potential for loss of metadata» as better performance boldly and barely, as then one could argue that 'eatmydata' gives the best "performance". Note: the work on multithreading the journaling path is an authentic (and I guess amazingly tricky) performance improvement instead, not merely a new safety/latency/speed tradeoff similar to 'nobarrier' or 'eatmydata'. >> People who complained about slow metadata performance with XFS >> before 'delaylog' were in effect complaining that XFS was >> implementing overly (in some sense) safe metadata semantics, and >> in effect were demanding less (implicit) safety, without >> probably realizing they were asking for that. > No, they weren't, Again my self importance is not quite as huge as feeling able to just say «No, they weren't» to settle points once and for all. Here it is not clear to me what you mean by «they weren't», but as the quote above shows, even if complainers weren't in effect «demanding less (implicit) safety», that's what they got anyhow, because that's the main (unavoidable) way to improve latency massively given how expensive barriers are (at least on disk devices). That's how the O_PONIES story goes... > [ ... personal attacks ... ] It is noticeable that 90% of your post is pure malicious offtopic personal attack, and the rest is "from on high", and the whole is entirely devoid of technical content. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-18 19:17 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-18 19:17 UTC (permalink / raw) To: Linux RAID, Linux fs XFS [ ... ] >>> «There have been decent but no major improvements in XFS >>> metadata *performance*, but weaker implicit *semantics* >>> have been made an option, and these have a different >>> safety/performance tradeoff (less implicit safety, somewhat >>> more performance), not "just" better performance.» >> I have left implicit a point that perhaps should be explicit: I >> think that XFS metadata performance before 'delaylog' was pretty >> good, and that it has remained pretty good with 'delaylog'. > For many workloads it absolutely wasn't. My self importance is not quite as huge as feeling able to just say «absolutely wasn't» to settle points once and for all. So I would rather argue (and I did in a different form) that for some workloads 'nobarrier'+'hdparm -W1' or 'eatmydata' have the most desirable tradeoffs, and for many others the safety/speed tradeoff of 'delaylog' is more appropriate (so for example I think that making it the default is reasonable if a bit edgy). But also that as the already quoted document makes it very clear how overall 'delaylog' improves unsafety and only thanks to this latency and time to completion are better: http://lwn.net/Articles/476267/ http://www.mjmwired.net/kernel/Documentation/filesystems/xfs-delayed-logging-design.txt 124 [ ... ] In other 125 words, instead of there only being a maximum of 2MB of transaction changes not 126 written to the log at any point in time, there may be a much greater amount 127 being accumulated in memory. Hence the potential for loss of metadata on a 128 crash is much greater than for the existing logging mechanism. That's why my argument was that performance without 'delaylog' was good: given the safer semantics, it was quite good. Just perhaps not the semantics tradeoff that some people wanted in some cases, and I think that it is cheeky marketing to describe something involving a much greater «potential for loss of metadata» as better performance boldly and barely, as then one could argue that 'eatmydata' gives the best "performance". Note: the work on multithreading the journaling path is an authentic (and I guess amazingly tricky) performance improvement instead, not merely a new safety/latency/speed tradeoff similar to 'nobarrier' or 'eatmydata'. >> People who complained about slow metadata performance with XFS >> before 'delaylog' were in effect complaining that XFS was >> implementing overly (in some sense) safe metadata semantics, and >> in effect were demanding less (implicit) safety, without >> probably realizing they were asking for that. > No, they weren't, Again my self importance is not quite as huge as feeling able to just say «No, they weren't» to settle points once and for all. Here it is not clear to me what you mean by «they weren't», but as the quote above shows, even if complainers weren't in effect «demanding less (implicit) safety», that's what they got anyhow, because that's the main (unavoidable) way to improve latency massively given how expensive barriers are (at least on disk devices). That's how the O_PONIES story goes... > [ ... personal attacks ... ] It is noticeable that 90% of your post is pure malicious offtopic personal attack, and the rest is "from on high", and the whole is entirely devoid of technical content. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-18 19:17 ` Peter Grandi @ 2012-03-19 9:07 ` Stan Hoeppner -1 siblings, 0 replies; 65+ messages in thread From: Stan Hoeppner @ 2012-03-19 9:07 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS On 3/18/2012 2:17 PM, Peter Grandi wrote: > It is noticeable that 90% of your post is pure malicious > offtopic personal attack, and the rest is "from on high", > and the whole is entirely devoid of technical content. It is noticeable that 100% of my post was technical content, directly asked questions of you, yet you chose to respond to Christoph's "personal attacks" while avoiding answering my purely technical questions. I guess we can assume your silence, your unwillingness to answer my questions, is a sign of capitulation. -- Stan ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-19 9:07 ` Stan Hoeppner 0 siblings, 0 replies; 65+ messages in thread From: Stan Hoeppner @ 2012-03-19 9:07 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS On 3/18/2012 2:17 PM, Peter Grandi wrote: > It is noticeable that 90% of your post is pure malicious > offtopic personal attack, and the rest is "from on high", > and the whole is entirely devoid of technical content. It is noticeable that 100% of my post was technical content, directly asked questions of you, yet you chose to respond to Christoph's "personal attacks" while avoiding answering my purely technical questions. I guess we can assume your silence, your unwillingness to answer my questions, is a sign of capitulation. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-19 9:07 ` Stan Hoeppner @ 2012-03-20 12:34 ` Jessie Evangelista -1 siblings, 0 replies; 65+ messages in thread From: Jessie Evangelista @ 2012-03-20 12:34 UTC (permalink / raw) To: Linux RAID, Linux fs XFS Thank you everyone for you insights and comments. I made a post of how I proceeded here: http://blog.henyo.com/2012/03/cheap-and-safe-file-storage-on-linux.html In summary, I did some benchmarks with bonnie++ using different raid levels(5,10n2,10f2,10o2) using different chunks(64,128,256,512,1024) using different file systems(xfs,ext4) and different settings for each file system. I decided to go with ext4 because I wanted to make use of the data=journal option which IMHO is safer albeit slower. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-20 12:34 ` Jessie Evangelista 0 siblings, 0 replies; 65+ messages in thread From: Jessie Evangelista @ 2012-03-20 12:34 UTC (permalink / raw) To: Linux RAID, Linux fs XFS Thank you everyone for you insights and comments. I made a post of how I proceeded here: http://blog.henyo.com/2012/03/cheap-and-safe-file-storage-on-linux.html In summary, I did some benchmarks with bonnie++ using different raid levels(5,10n2,10f2,10o2) using different chunks(64,128,256,512,1024) using different file systems(xfs,ext4) and different settings for each file system. I decided to go with ext4 because I wanted to make use of the data=journal option which IMHO is safer albeit slower. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-18 11:25 ` Peter Grandi @ 2012-03-18 18:08 ` Stan Hoeppner -1 siblings, 0 replies; 65+ messages in thread From: Stan Hoeppner @ 2012-03-18 18:08 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs XFS, Linux RAID On 3/18/2012 6:25 AM, Peter Grandi wrote: > So in my view 'delaylog' cannot be described boldly and barely > described, especially in this thread, as an improvement in XFS > performance, as it is an improvement in XFS's unsafety to obtain > greater speed, similar to but not as extensive as 'nobarrier'. You have recommended in various past posts on multiple lists that users should max out logbsize and logbufs to increase metadata performance. You made no mention in those posts about safety as you have here. Logbufs are in-memory journal write buffers and are volatile. Delaylog uses in-memory structures that are volatile. So, why do you consider logbufs to be inherently safer than delaylog? Following the logic you've used in this thread, both should be considered equally unsafe. Yet I don't recall you ever preaching against logbufs in the past. Is it because logbufs can 'only' potentially lose 2MB worth of metadata transactions, and delaylog can potentially lose more than 2MB? > In the same way that 'eatmydata': Hardly. From: http://packages.debian.org/sid/eatmydata "This package ... transparently disable fsync ... two side-effects: ... writes data ... quicker ... no longer crash safe ... useful if particular software calls fsync(), sync() etc. frequently but *the data it stores is not that valuable to you* and you may *afford losing it in case of system crash*." So you're comparing delaylog's volatile buffer architecture to software that *intentionally and transparently disables fsync*? So do you believe a similar warning should be attached to the docs for delaylog? And thus to the use of logbufs as well? How about all write buffers/caches in the Linux kernel? Where exactly do you draw the line Peter, between unsafe/safe use of in-memory write buffers? Is there some magical demarcation point between synchronous serial IO, and having gigabytes of inflight write data in memory buffers? -- Stan ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-18 18:08 ` Stan Hoeppner 0 siblings, 0 replies; 65+ messages in thread From: Stan Hoeppner @ 2012-03-18 18:08 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS On 3/18/2012 6:25 AM, Peter Grandi wrote: > So in my view 'delaylog' cannot be described boldly and barely > described, especially in this thread, as an improvement in XFS > performance, as it is an improvement in XFS's unsafety to obtain > greater speed, similar to but not as extensive as 'nobarrier'. You have recommended in various past posts on multiple lists that users should max out logbsize and logbufs to increase metadata performance. You made no mention in those posts about safety as you have here. Logbufs are in-memory journal write buffers and are volatile. Delaylog uses in-memory structures that are volatile. So, why do you consider logbufs to be inherently safer than delaylog? Following the logic you've used in this thread, both should be considered equally unsafe. Yet I don't recall you ever preaching against logbufs in the past. Is it because logbufs can 'only' potentially lose 2MB worth of metadata transactions, and delaylog can potentially lose more than 2MB? > In the same way that 'eatmydata': Hardly. From: http://packages.debian.org/sid/eatmydata "This package ... transparently disable fsync ... two side-effects: ... writes data ... quicker ... no longer crash safe ... useful if particular software calls fsync(), sync() etc. frequently but *the data it stores is not that valuable to you* and you may *afford losing it in case of system crash*." So you're comparing delaylog's volatile buffer architecture to software that *intentionally and transparently disables fsync*? So do you believe a similar warning should be attached to the docs for delaylog? And thus to the use of logbufs as well? How about all write buffers/caches in the Linux kernel? Where exactly do you draw the line Peter, between unsafe/safe use of in-memory write buffers? Is there some magical demarcation point between synchronous serial IO, and having gigabytes of inflight write data in memory buffers? -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-18 18:08 ` Stan Hoeppner @ 2012-03-22 21:26 ` Peter Grandi -1 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-22 21:26 UTC (permalink / raw) To: Linux RAID, Linux fs XFS [ ... ] >> So in my view 'delaylog' cannot be described boldly and >> barely described, especially in this thread, as an >> improvement in XFS performance, as it is an improvement in >> XFS's unsafety to obtain greater speed, similar to but not as >> extensive as 'nobarrier'. > You have recommended in various past posts on multiple lists > that users should max out logbsize and logbufs to increase > metadata performance. Perhaps you confuse me with DaveC (or, see later, the XFS FAQ), for example: http://oss.sgi.com/archives/xfs/2010-09/msg00113.html «> Why isn't logbsize=256k default, when it's suggested most > of the time anyway? It's suggested when people are asking about performance tuning. When the performance is acceptible with the default value, then you don't hear about it, do you?» http://oss.sgi.com/archives/xfs/2007-11/msg00918.html «# mkfs.xfs -f -l lazy-count=1,version=2,size=128m -i attr=2 -d agcount=4 <dev> # mount -o logbsize=256k <dev> <mtpt> And if you don't care about filsystem corruption on power loss: # mount -o logbsize=256k,nobarrier <dev> <mtpt>» > You made no mention in those posts about safety as you have > here. As to safety, this thread, by the explicit request of the original poster, is about safety before speed. But I already made this point above as in «especially in this thread». Also, "logbufs" have been known for a long time to have an unsafety aspect, for example there is a clear mention from 2001, but also see the quote from the XFS FAQ below: http://oss.sgi.com/archives/xfs/2001-05/msg03391.html «logbufs=4 or logbufs=8, this increases (from 2) the number of in memory log buffers. This means you can have more active transactions at once, and can still perform metadata changes while the log is being synced to disk. The flip side of this is that the amount of metadata changes which may be lost on crash is greater.» That's "news" from over 10 years ago... > Logbufs are in-memory journal write buffers and are volatile. > Delaylog uses in-memory structures that are volatile. So, why do > you consider logbufs to be inherently safer than delaylog? That's a quote from the 'delaylog' documentation: «the potential for loss of metadata on a crash is much greater than for the existing logging mechanism». > Following the logic you've used in this thread, both should be > considered equally unsafe. They are both unsafe (at least with applications that do not use 'fsync' appropriately), but not equally, as they have quite different semantics and behaviour, as the quote above from the 'delaylog' docs states (and see the quote from the XFS FAQ below). > Yet I don't recall you ever preaching against logbufs in the > past. Why should I preach against any of the safety/speed tradeoffs? Each of them has a domain of usability, including 'nobarrier' or 'eatmydata', or even 'sync'. > Is it because logbufs can 'only' potentially lose 2MB worth of > metadata transactions, and delaylog can potentially lose more > than 2MB? That's a quote from the 'delaylog' documentation: «In other words, instead of there only being a maximum of 2MB of transaction changes not written to the log at any point in time, there may be a much greater amount being accumulated in memory.» «What it does mean is that as far as the recovered filesystem is concerned, there may be many thousands of transactions that simply did not occur as a result of the crash.» > So you're comparing delaylog's volatile buffer architecture to > software that *intentionally and transparently disables fsync*? They are both speed-enhancing options. If 'delaylog' can be compared with 'nobarrier' or 'sync' as to their effects on performance, so can 'eatmydata'. The point of comparing 'sync' or 'delaylog' to 'nobarrier' or to 'eastmydata' is to justify why I think that 'delaylog' «cannot be described boldly and barely described, especially in this thread, as an improvement in XFS performance». because if the only thing that matters is the improvement in speed, then 'nobarrier' or 'eatmydata' can give better performance than 'delaylog', and to me that is an absurd argument. > So do you believe a similar warning should be attached to the > docs for delaylog? You seem unaware that a similar warning is already part of the doc for 'delaylog', and I have quoted it prominently before (and above). > And thus to the use of logbufs as well? You seem unaware that the XFS FAQ already states: http://www.xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E «For mount options, the only thing that will change metadata performance considerably are the logbsize and delaylog mount options. Increasing logbsize reduces the number of journal IOs for a given workload, and delaylog will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.» > How about all write buffers/caches in the Linux kernel? Indeed it would be a very good idea given the poor level of awareness of the downsides of more buffering/caching (not just less safety, also higher latency and even lower overall throughput in many cases). But that discussion has already happened a few times, in the various 'O_PONIES' discussions, as to that I have mentioned this page for a weary summary of the story at some point in time: http://sandeen.net/wordpress/computers/fsync-sigh/ «So now we are faced with some decisions. Should the filesystem put in hacks that offer more data safety than posix guarantees? Possibly. Probably. But there are tradeoffs. XFS, after giving up on the fsync-education fight long ago (note; fsync is pretty well-behaved on XFS) put in some changes to essentially fsync under the covers on close, if a file has been truncated (think file overwrite).» Note the sad «XFS, after giving up on the fsync-education fight long ago» statment. Also related to this, about defaulting to safer implicit semantics: «But now we’ve taken that control away from the apps (did they want it?) and introduced behavior which may slow down some other workloads. And, perhaps worse, encouraged sloppy app writing because the filesystem has taken care of pushing stuff to disk when the application forgets (or never knew). I dunno how to resolve this right now.» > Where exactly do you draw the line Peter, between unsafe/safe > use of in-memory write buffers? At the point where the application requirements draw it (or perhaps a bit safer than that, "just in case"). For some applications it must be tight, for others it can be loose. Quoting again from the 'delaylog' docs: «This makes it even more important that applications that care about their data use fsync() where they need to ensure application level data integrity is maintained.» which seems a straight statement that the level of safety is application-dependent. For me 'delaylog' is just a point on a line of tradeoffs going from 'sync' to 'nobarrier', it is useful as different point, but it cannot be boldly and barely described as giving better performance, anymore than 'nobarrier' can be boldly and barely described as giving better performance than 'sync'. Unless one boldly ignores the very different semantics, something that the 'delaylog' documentation and the XFS FAQ don't do. Overselling 'delaylog' with cheeky propaganda glossing over the heavy tradeoffs involved is understandable, but quite wrong. Again, XFS metadata performance without 'delaylog' was pretty decent, even if speed was slow due to unusually safe semantics. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-22 21:26 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-22 21:26 UTC (permalink / raw) To: Linux RAID, Linux fs XFS [ ... ] >> So in my view 'delaylog' cannot be described boldly and >> barely described, especially in this thread, as an >> improvement in XFS performance, as it is an improvement in >> XFS's unsafety to obtain greater speed, similar to but not as >> extensive as 'nobarrier'. > You have recommended in various past posts on multiple lists > that users should max out logbsize and logbufs to increase > metadata performance. Perhaps you confuse me with DaveC (or, see later, the XFS FAQ), for example: http://oss.sgi.com/archives/xfs/2010-09/msg00113.html «> Why isn't logbsize=256k default, when it's suggested most > of the time anyway? It's suggested when people are asking about performance tuning. When the performance is acceptible with the default value, then you don't hear about it, do you?» http://oss.sgi.com/archives/xfs/2007-11/msg00918.html «# mkfs.xfs -f -l lazy-count=1,version=2,size=128m -i attr=2 -d agcount=4 <dev> # mount -o logbsize=256k <dev> <mtpt> And if you don't care about filsystem corruption on power loss: # mount -o logbsize=256k,nobarrier <dev> <mtpt>» > You made no mention in those posts about safety as you have > here. As to safety, this thread, by the explicit request of the original poster, is about safety before speed. But I already made this point above as in «especially in this thread». Also, "logbufs" have been known for a long time to have an unsafety aspect, for example there is a clear mention from 2001, but also see the quote from the XFS FAQ below: http://oss.sgi.com/archives/xfs/2001-05/msg03391.html «logbufs=4 or logbufs=8, this increases (from 2) the number of in memory log buffers. This means you can have more active transactions at once, and can still perform metadata changes while the log is being synced to disk. The flip side of this is that the amount of metadata changes which may be lost on crash is greater.» That's "news" from over 10 years ago... > Logbufs are in-memory journal write buffers and are volatile. > Delaylog uses in-memory structures that are volatile. So, why do > you consider logbufs to be inherently safer than delaylog? That's a quote from the 'delaylog' documentation: «the potential for loss of metadata on a crash is much greater than for the existing logging mechanism». > Following the logic you've used in this thread, both should be > considered equally unsafe. They are both unsafe (at least with applications that do not use 'fsync' appropriately), but not equally, as they have quite different semantics and behaviour, as the quote above from the 'delaylog' docs states (and see the quote from the XFS FAQ below). > Yet I don't recall you ever preaching against logbufs in the > past. Why should I preach against any of the safety/speed tradeoffs? Each of them has a domain of usability, including 'nobarrier' or 'eatmydata', or even 'sync'. > Is it because logbufs can 'only' potentially lose 2MB worth of > metadata transactions, and delaylog can potentially lose more > than 2MB? That's a quote from the 'delaylog' documentation: «In other words, instead of there only being a maximum of 2MB of transaction changes not written to the log at any point in time, there may be a much greater amount being accumulated in memory.» «What it does mean is that as far as the recovered filesystem is concerned, there may be many thousands of transactions that simply did not occur as a result of the crash.» > So you're comparing delaylog's volatile buffer architecture to > software that *intentionally and transparently disables fsync*? They are both speed-enhancing options. If 'delaylog' can be compared with 'nobarrier' or 'sync' as to their effects on performance, so can 'eatmydata'. The point of comparing 'sync' or 'delaylog' to 'nobarrier' or to 'eastmydata' is to justify why I think that 'delaylog' «cannot be described boldly and barely described, especially in this thread, as an improvement in XFS performance». because if the only thing that matters is the improvement in speed, then 'nobarrier' or 'eatmydata' can give better performance than 'delaylog', and to me that is an absurd argument. > So do you believe a similar warning should be attached to the > docs for delaylog? You seem unaware that a similar warning is already part of the doc for 'delaylog', and I have quoted it prominently before (and above). > And thus to the use of logbufs as well? You seem unaware that the XFS FAQ already states: http://www.xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E «For mount options, the only thing that will change metadata performance considerably are the logbsize and delaylog mount options. Increasing logbsize reduces the number of journal IOs for a given workload, and delaylog will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.» > How about all write buffers/caches in the Linux kernel? Indeed it would be a very good idea given the poor level of awareness of the downsides of more buffering/caching (not just less safety, also higher latency and even lower overall throughput in many cases). But that discussion has already happened a few times, in the various 'O_PONIES' discussions, as to that I have mentioned this page for a weary summary of the story at some point in time: http://sandeen.net/wordpress/computers/fsync-sigh/ «So now we are faced with some decisions. Should the filesystem put in hacks that offer more data safety than posix guarantees? Possibly. Probably. But there are tradeoffs. XFS, after giving up on the fsync-education fight long ago (note; fsync is pretty well-behaved on XFS) put in some changes to essentially fsync under the covers on close, if a file has been truncated (think file overwrite).» Note the sad «XFS, after giving up on the fsync-education fight long ago» statment. Also related to this, about defaulting to safer implicit semantics: «But now we’ve taken that control away from the apps (did they want it?) and introduced behavior which may slow down some other workloads. And, perhaps worse, encouraged sloppy app writing because the filesystem has taken care of pushing stuff to disk when the application forgets (or never knew). I dunno how to resolve this right now.» > Where exactly do you draw the line Peter, between unsafe/safe > use of in-memory write buffers? At the point where the application requirements draw it (or perhaps a bit safer than that, "just in case"). For some applications it must be tight, for others it can be loose. Quoting again from the 'delaylog' docs: «This makes it even more important that applications that care about their data use fsync() where they need to ensure application level data integrity is maintained.» which seems a straight statement that the level of safety is application-dependent. For me 'delaylog' is just a point on a line of tradeoffs going from 'sync' to 'nobarrier', it is useful as different point, but it cannot be boldly and barely described as giving better performance, anymore than 'nobarrier' can be boldly and barely described as giving better performance than 'sync'. Unless one boldly ignores the very different semantics, something that the 'delaylog' documentation and the XFS FAQ don't do. Overselling 'delaylog' with cheeky propaganda glossing over the heavy tradeoffs involved is understandable, but quite wrong. Again, XFS metadata performance without 'delaylog' was pretty decent, even if speed was slow due to unusually safe semantics. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-22 21:26 ` Peter Grandi @ 2012-03-23 5:10 ` Stan Hoeppner -1 siblings, 0 replies; 65+ messages in thread From: Stan Hoeppner @ 2012-03-23 5:10 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS On 3/22/2012 4:26 PM, Peter Grandi wrote: [snipped 2-3 pages of redundant nonsense, linked docs, and filesystem concepts everyone is already familiar with] > Overselling 'delaylog' with cheeky propaganda glossing over the > heavy tradeoffs involved is understandable, but quite wrong. And now we come full circle to what started this mess of a discussion: Peter's dislike of Dave's presentation of delaylog, and XFS in general, at linux.conf.au. Peter, if *you* had been giving Dave's presentation at linux.conf.au, how would *you* have presented delayed logging differently? How much time would you have spent warning of the dangers of potential data loss upon a crash and how would you have presented it? Note I'm not asking you to re-critique Dave's presentation. I'm asking you to write your own short presentation of the delayed logging feature, so we can all see it done the right way, without "cheeky propaganda" and without "glossing over the heavy tradeoffs". We're all on the edge of our seats, eagerly awaiting your expert XFS presentation Peter. -- Stan ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-23 5:10 ` Stan Hoeppner 0 siblings, 0 replies; 65+ messages in thread From: Stan Hoeppner @ 2012-03-23 5:10 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS On 3/22/2012 4:26 PM, Peter Grandi wrote: [snipped 2-3 pages of redundant nonsense, linked docs, and filesystem concepts everyone is already familiar with] > Overselling 'delaylog' with cheeky propaganda glossing over the > heavy tradeoffs involved is understandable, but quite wrong. And now we come full circle to what started this mess of a discussion: Peter's dislike of Dave's presentation of delaylog, and XFS in general, at linux.conf.au. Peter, if *you* had been giving Dave's presentation at linux.conf.au, how would *you* have presented delayed logging differently? How much time would you have spent warning of the dangers of potential data loss upon a crash and how would you have presented it? Note I'm not asking you to re-critique Dave's presentation. I'm asking you to write your own short presentation of the delayed logging feature, so we can all see it done the right way, without "cheeky propaganda" and without "glossing over the heavy tradeoffs". We're all on the edge of our seats, eagerly awaiting your expert XFS presentation Peter. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-22 21:26 ` Peter Grandi (?) (?) @ 2012-03-23 22:48 ` Martin Steigerwald 2012-03-24 1:27 ` Peter Grandi -1 siblings, 1 reply; 65+ messages in thread From: Martin Steigerwald @ 2012-03-23 22:48 UTC (permalink / raw) To: xfs Am Donnerstag, 22. März 2012 schrieb Peter Grandi: > Overselling 'delaylog' with cheeky propaganda glossing over the > heavy tradeoffs involved is understandable, but quite wrong. Thing is, as far as I understand Dave´s slides and recent entries in Kernelnewbies Linux Changes as well as Heise Open Kernel log is that - beside delaylog - there has been quite some other metadata related performance improvements. Thus IMHO reducing the recent improvements in metadata performance is underselling XFS and overselling delaylog. Unless of course all those recent performance improvements could not have been done without the delaylog mode. That said, this is just my interpretation. If all recent improvements are only due to delaylog, then I am obviously off track. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-23 22:48 ` Martin Steigerwald @ 2012-03-24 1:27 ` Peter Grandi 2012-03-24 16:27 ` GNU 'tar', Schilling's 'tar', write-cache/barrier Peter Grandi 0 siblings, 1 reply; 65+ messages in thread From: Peter Grandi @ 2012-03-24 1:27 UTC (permalink / raw) To: Linux fs XFS >> Overselling 'delaylog' with cheeky propaganda glossing over >> the heavy tradeoffs involved is understandable, but quite >> wrong. > [ ... ] there has been quite some other metadata related > performance improvements. Thus IMHO reducing the recent > improvements in metadata performance is underselling XFS and > overselling delaylog. [ ... ] That's a good way of putting it, and I am pleased that I finally get a reasonable comment on this story, and one that agrees with one of my previous points in this thread: http://www.spinics.net/lists/raid/msg37931.html «Note: the work on multithreading the journaling path is an authentic (and I guess amazingly tricky) performance improvement instead, not merely a new safety/latency/speed tradeoff similar to 'nobarrier' or 'eatmydata'.» There are two reasons why I rate the multithreading work as more important than the 'delaylog' work: * It is a *net* improvement, as it increases the potential and actual retirement rate of metadata operation without adverse impact. * It improves XFS in the area where it is strongest, which is massive and multithread workloads, on reliable storage systems with large IOPS. Conversely, 'delaylog' does not improve the XFS performance envelope, it seems a crowd-pleasing yet useful intermediate tradeoff between 'sync' and 'nobarrier', and the standard documents about XFS tuning make it clear that XFS is really meant to run on reliable and massive storage layers with 'nobarrier', and it is/was not aimed at «untarring kernel tarballs» with 'barrier' on. My suspicion is that 'delaylog' therefore is in large part a marketing device to match 'ext4' in unsafety and therefore in apparent speed for "popular" systems, as an argument to stop investing in 'ext4' and continue to invest in XFS. Consider DaveC's famous presentation (the one in which he makes absolutely no mention of the safety/speed tradeoff of 'delaylog'): http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf «There's a White Elephant in the Room.... * With the speed, performance and capability of XFS and the maturing of BTRFS, why do we need EXT4 anymore?» That's a pretty big tell :-). I agree with it BTW. In the same presentation earlier there are also these other interesting points: http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf «* Ext4 can be up 20-50x times than XFS when data is also being written as well (e.g. untarring kernel tarballs). * This is XFS @ 2009-2010. * Unless you have seriously fast storage, XFS just won't perform well on metadata modification heavy workloads.» It is never mentioned that 'ext4' is 20-50x faster on metadata modification workloads because it implements much weaker semantics than «XFS @ 2009-2010», and that 'delaylog' matches 'ext4' because it implements similarly weaker semantics, by reducing the frequency of commits, as the XFS FAQ briefly summarizes: http://www.xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E «Increasing logbsize reduces the number of journal IOs for a given workload, and delaylog will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.» As should be obvious by now that I think that is an outrageously cheeky omission from the «filesystem of the futurex presentation, an omission that makes «XFS @ 2009-2010» seem much worse than it really was/is, making 'delaylog' seem then a more significant improvement than it is, or as you wrote «underselling XFS and overselling delaylog». Note: I wrote «improvement» above because 'delaylog' is indeed an improvement, but not to the performance of XFS, but to its functionality/flexibility: it is significant as an additional and useful speed/safety tradeoff, not as a speed improvement. The last point above «Unless you have seriously fast storage» gives away the main story: metadata intensive workloads are mostly random access workloads, and random access workloads get out of typical disk drives around 1-2MB/s, which means that if you play it safe and commit modifications frequently, you need a storage layer with massive IOPS indeed. For what I think are essentially marketing reasons, 'ext3' and 'ext4' try to be "popular" filesystem (consider the quote from Eric Sandeen's blog about the O_PONIES issue), and this has caused a lot of problems, and 'delaylog' seems to be an attempt to compete with 'ext4' in "popular" appeal. It may be good salesmanship for whoever claims the credit for 'delaylog', but advertising a massive speed improvement with colourful graphs without ever mentioning the massive improvement in unsafety seems quite cheeky to me, and I guess to you too. BTW some other interesting quotes from DaveC, the first about the aim of 'delaylog' to compete with 'ext4' on low end systems: http://lwn.net/Articles/477278/ «That's *exactly* the point of my talk - to smash this silly stereotype that XFS is only for massive, expensive servers and storage arrays. It is simply not true - there are more consumer NAS devices running XFS in the world than there are servers running XFS. Not to mention DVRs, or the fact that even TVs these days run XFS.» Another one instead on the impact of the locking improvements, where metadata operations now can use many CPUs instead of the previous limit of one: http://oss.sgi.com/archives/xfs/2010-08/msg00345.html «I'm getting a 8core/16thread server being CPU bound with multithreaded unlink workloads using delaylog, so it's entirely possible that all CPU cores are fully utilised on your machine.» http://lwn.net/Articles/476617/ «I even pointed out in the talk some performance artifacts in the distribution plots that were a result of separate threads lock-stepping at times on AG resources, and that increasing the number of AGs solves the problem (and makes XFS even faster!) e.g. at 8 threads, XFS unlink is about 20% faster when I increase the number of AGs from 17 to 32 on teh same test rig. If you have a workload that has a heavy concurrent metadata modification workload, then increasing the number of AGs might be a good thing. I tend to use 2x the number of CPU cores as a general rule of thumb for such workloads but the best tunings are highly depended on the workload so you should start just by using the defaults. :)» An interesting quote from an old (1996) design document for XFS where the metadata locking issue was acknowleged: http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html «In order to support the parallelism of such a machine, XFS has only one centralized resource: the transaction log. All other resources in the file system are made independent either across allocation groups or across individual inodes. This allows inodes and blocks to be allocated and freed in parallel throughout the file system. The transaction log is the most contentious resource in XFS.» «As long as the log can be written fast enough to keep up with the transaction load, the fact that it is centralized is not a problem. However, under workloads which modify large amount of metadata without pausing to do anything else, like a program constantly linking and unlinking a file in a directory, the metadata update rate will be limited to the speed at which we can write the log to disk.» It is remarkable that it is has taken ~15 years before the implementation needed improving. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* GNU 'tar', Schilling's 'tar', write-cache/barrier 2012-03-24 1:27 ` Peter Grandi @ 2012-03-24 16:27 ` Peter Grandi 2012-03-24 17:11 ` Brian Candler 0 siblings, 1 reply; 65+ messages in thread From: Peter Grandi @ 2012-03-24 16:27 UTC (permalink / raw) To: Linux fs XFS >> [ ... ] there has been quite some other metadata related >> performance improvements. Thus IMHO reducing the recent >> improvements in metadata performance is underselling XFS and >> overselling delaylog. [ ... ] > That's a good way of putting it, and I am pleased that I finally > get a reasonable comment on this story, and one that agrees with > one of my previous points in this thread: [ ... ] [ ... ] > http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf > «* Ext4 can be up 20-50x times than XFS when data is also being > written as well (e.g. untarring kernel tarballs). > * This is XFS @ 2009-2010. > * Unless you have seriously fast storage, XFS just won't > perform well on metadata modification heavy workloads.» > It is never mentioned that 'ext4' is 20-50x faster on metadata > modification workloads because it implements much weaker > semantics than «XFS @ 2009-2010», and that 'delaylog' matches > 'ext4' because it implements similarly weaker semantics, by > reducing the frequency of commits, as the XFS FAQ briefly > summarizes: [ ... ] As to this, I have realized that there is a very big detail that I have given for implicit but that perhaps at this point should be made explicit as to the deliberately misleading propaganda that «Ext4 can be up 20-50x times than XFS when data is also being written as well (e.g. untarring kernel tarballs).»: Almost all «untarring kernel tarballs» "benchmarks" are done with GNU 'tar', and it does not 'fsync'. This matters because XFS has done the "right thing" with 'fsync' for a long time, and if the application does 'fsync' then 'ext4', XFS without and with 'delaylog' are mostly equivalent. Conversely Schilling's 'tar' does 'fsync' and as a result it is often considered (by the gullible crowd to which the presentation propaganda referred to above is addressed) to have less "performance" than GNU 'tar'. To illustrate I have done a tiny test '.tar' file with a directory and two files within, and this is what happens with Schilling's 'tar': $ strace -f -e trace=file,fsync,fdatasync,read,write star xf d.tar open("d.tar", O_RDONLY) = 7 read(7, "d/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512 Process 8201 attached [ ... ] [pid 8200] lstat("d/", 0x7fff174d9490) = -1 ENOENT (No such file or directory) [pid 8200] lstat("d/", 0x7fff174d9330) = -1 ENOENT (No such file or directory) [pid 8200] access("d", F_OK) = -1 ENOENT (No such file or directory) [pid 8200] mkdir("d", 0700) = 0 [pid 8200] lstat("d/", {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0 [pid 8200] lstat("d/f1", 0x7fff174d9490) = -1 ENOENT (No such file or directory) [pid 8200] open("d/f1", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4 [pid 8200] write(4, "3\275@&{U(\356\332\25z\250\236\256v\6U[5\334\265\313\206:\351\335\366Q\21\231\210H"..., 128) = 128 [pid 8200] fsync(4 <unfinished ...> [pid 8201] <... write resumed> ) = 1 [pid 8201] read(7, "", 10240) = 0 Process 8201 detached <... fsync resumed> ) = 0 --- SIGCHLD (Child exited) @ 0 (0) --- utimes("d/f1", {{1332588240, 0}, {1332588240, 0}}) = 0 utimes("d/f1", {{1332588240, 0}, {1332588240, 0}}) = 0 lstat("d/f2", 0x7fff174d9490) = -1 ENOENT (No such file or directory) open("d/f2", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4 write(4, "\377\325\253\257,\210\2719e\24\347*P\325x\357\345\220\375Ei\375\355\22063\17\355\312.\6\347"..., 4096) = 4096 fsync(4) = 0 utimes("d/f2", {{1332588257, 0}, {1332588257, 0}}) = 0 utimes("d/f2", {{1332588257, 0}, {1332588257, 0}}) = 0 utimes("d", {{1332588242, 0}, {1332588242, 0}}) = 0 write(2, "star: 1 blocks + 0 bytes (total "..., 58star: 1 blocks + 0 bytes (total of 10240 bytes = 10.00k). ) = 58 Compare with GNU 'tar': $ strace -f -e trace=file,fsync,fdatasync,read,write tar xf d.tar [ ... ] open("d.tar", O_RDONLY) = 3 read(3, "d/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240 [ ... ] mkdir("d", 0700) = -1 EEXIST (File exists) stat("d", {st_mode=S_IFDIR|0700, st_size=24, ...}) = 0 open("d/f1", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists) unlink("d/f1") = 0 open("d/f1", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4 write(4, "3\275@&{U(\356\332\25z\250\236\256v\6U[5\334\265\313\206:\351\335\366Q\21\231\210H"..., 128) = 128 close(4) = 0 utimensat(AT_FDCWD, "d/f1", {{1332589368, 193330071}, {1332588240, 0}}, 0) = 0 open("d/f2", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists) unlink("d/f2") = 0 open("d/f2", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4 write(4, "\377\325\253\257,\210\2719e\24\347*P\325x\357\345\220\375Ei\375\355\22063\17\355\312.\6\347"..., 4096) = 4096 close(4) = 0 utimensat(AT_FDCWD, "d/f2", {{1332589368, 193330071}, {1332588257, 0}}, 0) = 0 close(3) = 0 utimensat(AT_FDCWD, "d", {{1332589368, 193330071}, {1332588242, 0}}, 0) = 0 close(1) = 0 close(2) = 0 In effect running GNU 'tar x' (GNU 'tar') is the same as running 'eatmydata tar x ...'; and indeed as its documentation says, 'eatmydata' is designed to achieve higher "performance" by turning programs that behave like Schilling's 'tar' into programs that behave like GNU 'tar'. When GNU 'tar' is used as a "benchmark" for 'delaylog' and there are no 'fsync's, the longer the interval between commits (and thus the implicit unsafety) the higher the "performance", or at least that's the argument I think propagandists and buffoons may be using. That's one important reason why I mentioned 'eatmydata' as one performance enhancing technique in a group with 'nobarrier' and 'delaylog'; and why I was amused by this buffoonery: «So you're comparing delaylog's volatile buffer architecture to software that *intentionally and transparently disables fsync*?» Because when the 'delaylog' propagandists write that: «Ext4 can be up 20-50x times than XFS when data is also being written as well (e.g. untarring kernel tarballs).» it is them who are comparing "performance" using GNU 'tar' which intentionally and transparently does not use at all 'fsync'. To illustrate here are some "benchmarks", which hopefully should be revealing as to the merit of the posturings of some of the buffoons or propagandists that have been discontributing to this discussion (note that there are somewhat subtle details both as to the setup and the results): -------------------------------------------------------------- # uname -a Linux base.ty.sabi.co.uk 2.6.18-274.18.1.el5 #1 SMP Thu Feb 9 12:20:03 EST 2012 x86_64 x86_64 x86_64 GNU/Linux # egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl vm | egrep '_(bytes|centisecs)' | sort none /tmp tmpfs rw 0 0 /dev/sdd8 /tmp/xfs xfs rw,nouuid,attr2,inode64,logbsize=256k,sunit=8,swidth=8,noquota 0 0 /dev/sdd3 /tmp/ext4 ext4 rw,barrier=1,data=ordered 0 0 vm.dirty_background_bytes = 900000000 vm.dirty_bytes = 500000000 vm.dirty_expire_centisecs = 2000 vm.dirty_writeback_centisecs = 1000 -------------------------------------------------------------- # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m1.027s user 0m0.105s sys 0m0.922s Dirty: 419700 kB Writeback: 0 kB real 0m5.163s user 0m0.000s sys 0m0.473s -------------------------------------------------------------- # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -no-fsync -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). real 0m1.204s user 0m0.139s sys 0m1.270s Dirty: 419456 kB Writeback: 0 kB real 0m5.012s user 0m0.000s sys 0m0.458s -------------------------------------------------------------- # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). real 23m29.346s user 0m0.327s sys 0m2.280s Dirty: 108 kB Writeback: 0 kB real 0m0.236s user 0m0.000s sys 0m0.199s -------------------------------------------------------------- # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m46.554s user 0m0.107s sys 0m1.271s Dirty: 415168 kB Writeback: 0 kB real 1m54.913s user 0m0.000s sys 0m0.325s ---------------------------------------------------------------- # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). real 60m15.723s user 0m0.442s sys 0m7.009s Dirty: 4 kB Writeback: 0 kB real 0m0.222s user 0m0.000s sys 0m0.194s ---------------------------------------------------------------- >From the above my conclusion is that «XFS @ 2009-2010» half the performance of 'ext4' on this workload, and that «Ext4 can be up 20-50x times than XFS when data is also being written as well (e.g. untarring kernel tarballs).» only when both data and metadata are written to RAM by 'ext4'. One can spend a lot of time changing parameters, as in using 'delaylog' or 'nobarrier' etc. I have tried with my favourite rather "tighter" flusher parameters, some comparisons that I find interesting: ---------------------------------------------------------------- # egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl vm | egrep '_(bytes|centisecs)' | sort none /tmp tmpfs rw 0 0 /dev/sdd3 /tmp/ext4 ext4 rw,barrier=1,data=ordered 0 0 /dev/sdd8 /tmp/xfs xfs rw,nouuid,attr2,inode64,logbsize=256k,sunit=8,swidth=8,noquota 0 0 vm.dirty_background_bytes = 900000000 vm.dirty_bytes = 100000 vm.dirty_expire_centisecs = 200 vm.dirty_writeback_centisecs = 100 # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m6.776s user 0m0.107s sys 0m1.260s Dirty: 1776 kB Writeback: 0 kB real 0m0.231s user 0m0.000s sys 0m0.197s ---------------------------------------------------------------- # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 2m25.805s user 0m0.135s sys 0m1.812s Dirty: 2372 kB Writeback: 84 kB real 0m1.683s user 0m0.000s sys 0m0.196s ---------------------------------------------------------------- That's a bit of a surprise, because time to completion on both when the flusher parameters allowed writing entirely to memory for both with 'eatmydata tar' were the same. It looks like that when flushing 'xfs' still does a fair bit of implicit metadata commits, as switching off barriers shows: ---------------------------------------------------------------- # mount -o remount,barrier=0 /dev/sdd8 /tmp/ext4 # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m7.388s user 0m0.127s sys 0m1.235s Dirty: 508 kB Writeback: 0 kB real 0m0.243s user 0m0.000s sys 0m0.199s ---------------------------------------------------------------- # mount -o remount,nobarrier /dev/sdd3 /tmp/xfs # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m31.047s user 0m0.124s sys 0m1.880s Dirty: 2324 kB Writeback: 24 kB real 0m0.269s user 0m0.000s sys 0m0.195s ---------------------------------------------------------------- While it seems likely 'ext4' runs headlong without commits on either metadata or data ('ext4' and 'ext3' in effect have a rather loose 'delaylog'). XFS however seems to be a bit at a disadvantage though as with 'nobarrier' and 'eatmydata tar' the time to completion should be the same. The partition for XFS is on inner tracks, but that does not make that much of a difference. Also compare with 'ext4' using 'eatmydata tar' with no barriers and using 'star' with no barrier and also 'data=writeback': ---------------------------------------------------------------- base# umount /tmp/ext4; mount -t ext4 -o defaults,barrier=0,data=writeback /dev/sdd3 /tmp/ext4 base# (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m6.158s user 0m0.123s sys 0m1.233s Dirty: 1704 kB Writeback: 0 kB real 0m0.247s user 0m0.001s sys 0m0.194s ---------------------------------------------------------------- base# (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). real 0m32.101s user 0m0.196s sys 0m1.718s Dirty: 24 kB Writeback: 48 kB real 0m0.217s user 0m0.000s sys 0m0.193s ---------------------------------------------------------------- Finally here is on XFS, with 'delaylog', on a system with a 3.x kernel and a rather fast (especially on small random writes) SSD drive (and my usual tighter flusher parameters): ---------------------------------------------------------------- # uname -a Linux.ty.sabi.co.UK 3.0.0-15-generic #26~lucid1-Ubuntu SMP Wed Jan 25 15:37:10 UTC 2012 x86_64 GNU/Linux # egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl -a 2>/dev/null | egrep '_(bytes|centisecs)' | sort none /tmp tmpfs rw,relatime,size=1024000k 0 0 /dev/sda6 /tmp/xfs xfs rw,noatime,nodiratime,attr2,delaylog,discard,inode64,logbsize=256k,sunit=16,swidth=8192,noquota 0 0 /dev/sda3 /tmp/ext4 ext4 rw,nodiratime,relatime,errors=remount-ro,user_xattr,acl,barrier=1,data=ordered,discard 0 0 fs.xfs.age_buffer_centisecs = 1500 fs.xfs.filestream_centisecs = 3000 fs.xfs.xfsbufd_centisecs = 100 fs.xfs.xfssyncd_centisecs = 3000 vm.dirty_background_bytes = 900000000 vm.dirty_bytes = 100000000 vm.dirty_expire_centisecs = 200 vm.dirty_writeback_centisecs = 100 ---------------------------------------------------------------- # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m5.148s user 0m0.300s sys 0m2.876s Dirty: 50052 kB Writeback: 0 kB WritebackTmp: 0 kB real 0m0.784s user 0m0.000s sys 0m0.100s ---------------------------------------------------------------- # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). real 6m21.946s user 0m0.808s sys 0m11.321s Dirty: 0 kB Writeback: 0 kB WritebackTmp: 0 kB real 0m0.097s user 0m0.000s sys 0m0.044s ---------------------------------------------------------------- The effect of 'delaylog' is pretty obvious there. The numbers above with their wide variation depending on changes in the level of safety requested amply demonstrate that it takes the skills of a propagandist or a buffoon to boast about the "performance" of 'delaylog' and comparisons with 'ext4' without prominently mentioning the big safety tradeoffs involved. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: GNU 'tar', Schilling's 'tar', write-cache/barrier 2012-03-24 16:27 ` GNU 'tar', Schilling's 'tar', write-cache/barrier Peter Grandi @ 2012-03-24 17:11 ` Brian Candler 2012-03-24 18:35 ` Peter Grandi 0 siblings, 1 reply; 65+ messages in thread From: Brian Candler @ 2012-03-24 17:11 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs XFS On Sat, Mar 24, 2012 at 04:27:19PM +0000, Peter Grandi wrote: > -------------------------------------------------------------- > # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -no-fsync -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) > star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). > > real 0m1.204s > user 0m0.139s > sys 0m1.270s > Dirty: 419456 kB > Writeback: 0 kB > > real 0m5.012s > user 0m0.000s > sys 0m0.458s > -------------------------------------------------------------- > # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) > star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). > > real 23m29.346s > user 0m0.327s > sys 0m2.280s > Dirty: 108 kB > Writeback: 0 kB > > real 0m0.236s > user 0m0.000s > sys 0m0.199s But as a user, what guarantees do I *want* from tar? I think the only meaningful guarantee I might want is: "if the tar returns successfully, I want to know that all the files are persisted to disk". And of course that's what your final "sync" does, although with the unfortunate side-effect of syncing all other dirty blocks in the system too. Calling fsync() after every single file is unpacked does also achieve the desired guarantee, but at a very high cost. This is partly because you have to wait for each fsync() to return [although I guess you could spawn threads to do them] but also because the disk can't aggregate lots of small writes into one larger write, even when the filesystem has carefully allocated them in adjacent blocks. I think what's needed is a group fsync which says "please ensure this set of files is all persisted to disk", which is done at the end, or after every N files. If such an API exists I don't know of it. On the flip side, does fsync()ing each individual file buy you anything over and above the desired guarantee? Possibly - in theory you could safely restart an aborted untar even through a system crash. You would have to be aware that the last file which was unpacked may only have been partially written to disk, so you'd have to restart by overwriting the last item in the archive which already exists on disk. Maybe star has this feature, I don't know. And unlike zip, I don't think tarfiles are indexed, so you'd still have to read it from the beginning. If the above benchmark is typical, it suggests that fsyncing after every file is 4 times slower than untar followed by sync. So I reckon you would be better off using the fast/unsafe version, and simply restarting it from the beginning if the system crashed while you were running it. That's unless you expect the system to crash 4 or more times while you untar this single file. Just my 2¢, as a user and definitely not a filesystem expert. Regards, Brian. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: GNU 'tar', Schilling's 'tar', write-cache/barrier 2012-03-24 17:11 ` Brian Candler @ 2012-03-24 18:35 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-24 18:35 UTC (permalink / raw) To: Linux fs XFS [ ... ] >> # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -no-fsync -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) >> real 0m1.204s >> Dirty: 419456 kB >> real 0m5.012s >> # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) >> real 23m29.346s >> Dirty: 108 kB >> real 0m0.236s > But as a user, what guarantees do I *want* from tar? Ahhhh, but that depends *a lot* on the application, that may or may not be 'tar', and what you are using 'tar' for. Consider for example restoring a backup using RSYNC instead of 'tar'. > I think the only meaningful guarantee I might want is: "if the > tar returns successfully, I want to know that all the files > are persisted to disk". Perhaps in some cases, but perhaps in others not. For example if you are restoring 20TB, having to redo the whole 20TB or a significant fraction may be undesirable, and you would like to change the guarantee as tyou write later: > On the flip side, does fsync()ing each individual file [ > ... ] you could safely restart an aborted untar [ ... ] the > last file which was unpacked may only have been partially > written to disk [ ... ] to add "if the tar does not return successfully, I want to know that most or or all the files are persisted, except the last one that was only partially written, which I want to disappear, so I can rerun 'tar -x -k' and only restore the rest of the files". > And of course that's what your final "sync" does, although > with the unfortunate side-effect of syncing all other dirty > blocks in the system too. Just to be sure: that was on a quiescent system, so in the particular case of my tests it was just on the 'tar'. [ ... ] > I think what's needed is a group fsync which says "please > ensure this set of files is all persisted to disk", which is > done at the end, or after every N files. If such an API > exists I don't know of it. That's in part what mentioned here: [ ... ] > If the above benchmark is typical, it suggests that fsyncing > after every file is 4 times slower than untar followed by > sync. Depends on how often the flusher runs and how aggressively and how much memory you get. In the comparison quoted above, GNU 'tar' on 'ext4' dumps 410MB into RAM in just over 1 second plus 5 seconds for 'sync', and Schilling's 'tar' persists the lot to disk, incrementally, in 1409 seconds. The ratio is 227 times. Because that's a typical disk drive that can either do around 100MB/s with bulk sequential IO (thus the 5 seconds 'sync') or around 0.5-4MB/s with small random IO. > So I reckon you would be better off using the fast/unsafe > version, and simply restarting it from the beginning if the > system crashed while you were running it. [ ... ] That's in one very specific example with one application in one context. As to this, for a better discussion, let's go back to your original and very appropriate question: > But as a user, what guarantees do I *want* from tar? The question is very sensible as far as it goes, but it does not go far enough, because «from tar» and small 'tar' archives is just happenstance: what you should ask yourself is: But as a user, what guarantees do I *want* from filesystems and the applications that use them? That's in essence the O_PONIES question. That question can have many answers each of them addressing a different aspect of normative and positive situation, and I'll try to list some. The first answer is that you want to be able to choose different guarantees and costs, and know which they are. In this respect 'delaylog' log, properly described as an improvement in both unsafety and speed, is a good thing to have, because it is often a useful option. So are 'sync', 'nobarrier', and 'eatmydata'. The second answer is that as a rule users don't have the knowledge or the desire to understand the tradeoffs offered by filesystems and how they relate to the behavior of the programs (including 'tar') that they use, so there needs to be a default guarantee that most users would have chosen if they could, and this should be about more safety rather than the more speed, and this was what «XFS @ 2009-2010» was doing. More to follow... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 12:06 ` Jessie Evangelista 2012-03-15 14:07 ` Peter Grandi @ 2012-03-16 12:25 ` Stan Hoeppner 2012-03-16 18:01 ` Jon Nelson 1 sibling, 1 reply; 65+ messages in thread From: Stan Hoeppner @ 2012-03-16 12:25 UTC (permalink / raw) To: Jessie Evangelista; +Cc: linux-raid On 3/15/2012 7:06 AM, Jessie Evangelista wrote: > On Thu, Mar 15, 2012 at 1:38 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> Why 256KB for chunk size? >> > For reference, the machine has 16GB memory > > I've run some benchmarks with dd trying the different chunks and 256k > seems like the sweetspot. > dd if=/dev/zero of=/dev/md0 bs=64k count=655360 oflag=direct Using dd in this manner is precisely analogous to taking your daily driver Toyota to the local drag strip, making a few runs, and observing your car can accelerate from 0-92 mph in 1320 ft. This has no correlation to daily driving on public roads. > I'll probably forgo setting the journal log file size. It seemed like > a safe optimization from what I've read. See: http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E > I just wanted to be explicit about it so that I know what is set just > in case the defaults change See: http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E Even if the XFS mount defaults change you won't notice a difference, not on this server, except for possibly delaylog if you do a lot of 'rm -rf' operations on directories containing tens of thousands of files. Delaylog is the only mount default change in many years. It occurred in 2.6.39, which is why I recommended this rev as the minimum you should choose. >> In fact, it appears you don't need to specify anything in mkfs.xfs or >> fstab, but just use the defaults. Fancy that. And the one thing that >> might actually increase your performance a little bit you didn't >> specify--sunit/swidth. However, since you're using mdraid, mkfs.xfs >> will calculate these for you (which is nice as mdraid10 with odd disk >> count can be a tricky calculation). Again, defaults work for a reason. >> > The reason I did not set sunit/swidth is because I read somewhere that > mkfs.xfs will calculate based on mdraid. I guess my stating of the same got lost in the rest of that paragraph. ;) > The server is for a non-profit org that I am helping out. > I think a APC Smart-UPS SC 420VA 230V may fit their shoe string budget. Given the rough server specs you presented, a 420 (260 watts) should be fine, assuming you're not running seti@home, folding@home, etc, which can double average system power draw. A 420 won't yield much battery run time but will give more than enough time for a clean shutdown. Are you sure you want a 230v model? If so I'd guess you're outside the States. Also: http://www.apcupsd.com/ APC control daemon with auto shutdown. > nightly backups will be stored on an external USB disk > is xfs going to be prone to more data loss in case the non-redundant > power supply goes out? There are bigger issues here WRT XFS and USB connected disks IIRC from some list posts. USB is prone to random device/bus disconnections due to power management in various USB controllers. XFS design assumes storage devices are persistently connected, and it does frequent background reads/writes to the device. If the USB drive is offline long enough, lots of errors are logged, and XFS can't access it, it may/will shutdown the filesystem as a safety precaution. If you want to use XFS on that external USB drive, you need to do some research first--I don't have solid answers for you here. Or simply use EXT3/4. XFS isn't going to yield any advantage with single threaded backup anyway, so maybe just going with EXT is the smart move. >> I'll appreciate your response stating "Yes, I have a UPS and >> tested/working shutdown scripts" or "I'll be implementing a UPS very >> soon." :) > > I don't have shutdown scripts yet but will look into it. Again: http://www.apcupsd.com/ There may be others available. > Meatware would have to do for now as the server will probably be ON > only when there's people at the office. I just hope do proper shutdowns when they power it off. ;) > And yes I will be asking them > to not go into production without a UPS If it's a hard sell, simply explain that every laptop has a built in UPS, and that the server and its data are obviously as important, if not more, than any laptop. > Thanks for you input Stan. You're welcome. > I just updated the kernel to 3.0.0-16. > Did they take out barrier support in mdraid? or was the implementation > replaced with FUA? Write barriers, in one form or another, are there. These will never be removed or broken--too critical. The implementation may have changed. Neil can answer this much better than me. > Is there a definitive test to determine if the off the shelf consumer > sata drives honor barrier or cache flush requests? Just connect the drive and boot up. You'll see this in dmesg: sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA And: $ hdparm -I /dev/sda ... Commands/features: Enabled Supported: ... * Write cache ... * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT ... These are good indicators that the drive supports barrier operations. > I think I'd like to go with device cache turned ON and barrier enabled. You just stated the the Linux defaults. Do note that XFS write barriers will ensure journal and thus filesystem integrity in a crash/power fail event. They do NOT guarantee file data integrity as file data isn't journaled. No filesystem (Linux anyway) journals data, only metadata. To prevent file data loss due to a crash/power fail, you must disable the drive write caches completely and/or use a BBWC RAID card. As you know performance is horrible with caches disabled. With so few users and so little data writes, you're safe running with barriers and write cache enabled. This is how most people on this list with plain HBAs run their systems. > Am still torn between ext4 and xfs i.e. which will be safer in this > particular setup. Neither is "safer" than the other. That's up to your hardware and power configuration. Pick the one you are most comfortable working with and have the most experience supporting. For this non-prof SOHO workload, XFS' advanced features will yield little, if any, performance advantage--you have too few users, disks, and too little IO. If this box had, say, 24 cores, 128GB RAM, and 192 15k SAS drives across 4 dual port SAS RAID HBAs, 8x24 drive hardware RAID10s in an mdraid linear array, with a user IO load demanding such a system--multiple GB/s of concurrent IO, then the only choice is XFS. EXT4 simply can't scale close to anything like this. All things considered, for your system, EXT4 is probably the best choice. -- Stan ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-16 12:25 ` raid10n2/xfs setup guidance on write-cache/barrier Stan Hoeppner @ 2012-03-16 18:01 ` Jon Nelson 2012-03-16 18:03 ` Jon Nelson 0 siblings, 1 reply; 65+ messages in thread From: Jon Nelson @ 2012-03-16 18:01 UTC (permalink / raw) To: stan; +Cc: linux-raid On Fri, Mar 16, 2012 at 7:25 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote: .. > You just stated the the Linux defaults. Do note that XFS write barriers > will ensure journal and thus filesystem integrity in a crash/power fail > event. They do NOT guarantee file data integrity as file data isn't > journaled. No filesystem (Linux anyway) journals data, only metadata. .. That's not true, is it? ext3 and ext4 support journal=data. -- Jon -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-16 18:01 ` Jon Nelson @ 2012-03-16 18:03 ` Jon Nelson 2012-03-16 19:28 ` Peter Grandi 0 siblings, 1 reply; 65+ messages in thread From: Jon Nelson @ 2012-03-16 18:03 UTC (permalink / raw) To: stan, LinuxRaid On Fri, Mar 16, 2012 at 1:01 PM, Jon Nelson <jnelson-linux-raid@jamponi.net> wrote: > On Fri, Mar 16, 2012 at 7:25 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > .. >> You just stated the the Linux defaults. Do note that XFS write barriers >> will ensure journal and thus filesystem integrity in a crash/power fail >> event. They do NOT guarantee file data integrity as file data isn't >> journaled. No filesystem (Linux anyway) journals data, only metadata. > > .. > > That's not true, is it? ext3 and ext4 support journal=data. And btrfs supports COW (as does nilfs2) with "transactions", which should/could be similar? -- Jon -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-16 18:03 ` Jon Nelson @ 2012-03-16 19:28 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-16 19:28 UTC (permalink / raw) To: Linux RAID, Linux fs XFS [ ... ] >>> write barriers will ensure journal and thus filesystem >>> integrity in a crash/power fail event. They do NOT guarantee >>> file data integrity as file data isn't journaled. Not well expressed, as XFS barriers do ensure file data integrity, *if the applications uses them* (and uses them in exactly the right way). The difference between metadata and data with XFS is that XFS itself will use barriers on metadata at the right times, because that's data to XFS, but it won't use barriers on data, leaving that entirely to the application. >>> No filesystem (Linux anyway) journals data, only metadata. >> That's not true, is it? ext3 and ext4 support journal=data. They do, because they journal blocks, which is not generally a great choice, but gives the option to journal data blocks too more easily than other choices. But it is a very special case that few people use. Also, there are significant issues with 'ext3' and 'fsync' and journaling: http://lwn.net/Articles/328363/ «There is one other important change needed to get a truly quick fsync() with ext3, though: the filesystem must be mounted in data=writeback mode. This mode eliminates the requirement that data blocks be flushed to disk ahead of metadata; in data=ordered mode, instead, the amount of data to be written guarantees that fsync() will always be slower. Switching to data=writeback eliminates those writes, but, in the process, it also turns off the feature which made ext3 seem more robust than ext4.» On a more general note, journaling and barriers are sort of distinct issues. The real purpose of barriers is to ensure that updates are actually on the recording medium, whether in the journal or directly on final destination. That is barriers are used to ensure that data or metadata on the persistent layer is current. The purpose of a journal is not to ensure that the state on the persistent layer are *current*, but rather *consistent* (at a lower cost than synchronous updates), without having to be careful about the order in which the updates are made current. The updates are made consistent by writing them to the log as they are needed (not necessarily immediately), and then on recovery the order gets sorted out spatially. Currency does not imply consistency (if the updates are made current in some arbitrary order) and consistency does not imply currency (if the recording medium is kept consistent but updates are applied to it infrequently). The BSD FFS does not need a journal because it is designed to be very careful as to the order in which updates are made current, and log file systems don't aim for spatial currency. > And btrfs supports COW (as does nilfs2) with "transactions", > which should/could be similar? Not quite. They are more like "checkpoints", that is alternate root inodes that "snapshot" the state of the whole filetree at some point. These are not entirely inexpensive, and as a result as I learned from a talk about some recent updates about the BSD FFS: http://www.sabi.co.uk/blog/12-two.html#120222 COW filesystems like ZFS/BTRFS/... need to have a journal too to support 'fsync' in between checkpoints. BTW there are now COW versions of 'ext3' and 'ext4', with snapshotting too: http://www.sabi.co.uk/blog/12-two.html#120218b The 'freeze' features of XFS does not rely on snapshotting, it relies on suspending all processes that are writing to the filetree, so updates are avoided for the duration. As the XFS team have been adding or planning to add various "new" features like checksums, maybe one day they will add COW to XFS too (not such an easy task when considering how large XFS extents can be, but the hole punching code can help there). -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-16 19:28 ` Peter Grandi 0 siblings, 0 replies; 65+ messages in thread From: Peter Grandi @ 2012-03-16 19:28 UTC (permalink / raw) To: Linux RAID, Linux fs XFS [ ... ] >>> write barriers will ensure journal and thus filesystem >>> integrity in a crash/power fail event. They do NOT guarantee >>> file data integrity as file data isn't journaled. Not well expressed, as XFS barriers do ensure file data integrity, *if the applications uses them* (and uses them in exactly the right way). The difference between metadata and data with XFS is that XFS itself will use barriers on metadata at the right times, because that's data to XFS, but it won't use barriers on data, leaving that entirely to the application. >>> No filesystem (Linux anyway) journals data, only metadata. >> That's not true, is it? ext3 and ext4 support journal=data. They do, because they journal blocks, which is not generally a great choice, but gives the option to journal data blocks too more easily than other choices. But it is a very special case that few people use. Also, there are significant issues with 'ext3' and 'fsync' and journaling: http://lwn.net/Articles/328363/ «There is one other important change needed to get a truly quick fsync() with ext3, though: the filesystem must be mounted in data=writeback mode. This mode eliminates the requirement that data blocks be flushed to disk ahead of metadata; in data=ordered mode, instead, the amount of data to be written guarantees that fsync() will always be slower. Switching to data=writeback eliminates those writes, but, in the process, it also turns off the feature which made ext3 seem more robust than ext4.» On a more general note, journaling and barriers are sort of distinct issues. The real purpose of barriers is to ensure that updates are actually on the recording medium, whether in the journal or directly on final destination. That is barriers are used to ensure that data or metadata on the persistent layer is current. The purpose of a journal is not to ensure that the state on the persistent layer are *current*, but rather *consistent* (at a lower cost than synchronous updates), without having to be careful about the order in which the updates are made current. The updates are made consistent by writing them to the log as they are needed (not necessarily immediately), and then on recovery the order gets sorted out spatially. Currency does not imply consistency (if the updates are made current in some arbitrary order) and consistency does not imply currency (if the recording medium is kept consistent but updates are applied to it infrequently). The BSD FFS does not need a journal because it is designed to be very careful as to the order in which updates are made current, and log file systems don't aim for spatial currency. > And btrfs supports COW (as does nilfs2) with "transactions", > which should/could be similar? Not quite. They are more like "checkpoints", that is alternate root inodes that "snapshot" the state of the whole filetree at some point. These are not entirely inexpensive, and as a result as I learned from a talk about some recent updates about the BSD FFS: http://www.sabi.co.uk/blog/12-two.html#120222 COW filesystems like ZFS/BTRFS/... need to have a journal too to support 'fsync' in between checkpoints. BTW there are now COW versions of 'ext3' and 'ext4', with snapshotting too: http://www.sabi.co.uk/blog/12-two.html#120218b The 'freeze' features of XFS does not rely on snapshotting, it relies on suspending all processes that are writing to the filetree, so updates are avoided for the duration. As the XFS team have been adding or planning to add various "new" features like checksums, maybe one day they will add COW to XFS too (not such an easy task when considering how large XFS extents can be, but the hole punching code can help there). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-16 19:28 ` Peter Grandi @ 2012-03-17 0:02 ` Stan Hoeppner -1 siblings, 0 replies; 65+ messages in thread From: Stan Hoeppner @ 2012-03-17 0:02 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS On 3/16/2012 2:28 PM, Peter Grandi wrote: > [ ... ] > >>>> write barriers will ensure journal and thus filesystem >>>> integrity in a crash/power fail event. They do NOT guarantee >>>> file data integrity as file data isn't journaled. > > Not well expressed, Given the audience, the OP, I was simply avoiding getting too deep in the weeds Peter. This thread is on the linux-raid list, not xfs@oss. You know I have a tendency to get too deep in the weeds. I think I did nice job of balance here. ;) > as XFS barriers do ensure file data integrity, > *if the applications uses them* (and uses them in exactly the > right way). How will the OP know which, if any, of his users' desktop applications do fsyncs, properly? He won't. Which is why I made the general statement, which is correct, if not elaborate, nor down in the weeds. > The difference between metadata and data with XFS is that XFS > itself will use barriers on metadata at the right times, because > that's data to XFS, but it won't use barriers on data[1], leaving > that entirely to the application. [1]File data, just to be clear >>>> No filesystem (Linux anyway) journals data, only metadata. > >>> That's not true, is it? ext3 and ext4 support journal=data. > > They do, because they journal blocks, which is not generally a > great choice, but gives the option to journal data blocks too more > easily than other choices. But it is a very special case that few > people use. Few use it because the performance is absolutely horrible. data=journal disables delayed allocation (which serious contributes to any modern filesystem's performance--EXT devs stole/borrowed delayed allocation from XFS BTW) and it disables O_DIRECT. It also doubles the number of data writes to media, once to the journal, once to the filesystem, for every block of every file written. > On a more general note, journaling and barriers are sort of > distinct issues. > > The real purpose of barriers is to ensure that updates are > actually on the recording medium, whether in the journal or > directly on final destination. > That is barriers are used to ensure that data or metadata on the > persistent layer is current. Correct. Again, trying to stay out of the weeds. I'd established that XFS uses barriers on journal writes for metadata consistency, which prevents filesystem corruption after a crash, but not necessarily file corruption. Making the statement that XFS doesn't journal data gets the point across more quickly, while staying out of the weeds. [...] > The 'freeze' features of XFS does not rely on snapshotting, it > relies on suspending all processes that are writing to the > filetree, so updates are avoided for the duration. xfs_freeze was moved into the VFS in 2.6.29 and is called automatically when doing an LVM snapshot of any Linux FS supporting such. Thus, snapshotting relies on xfs_freeze, not the other way round. And xfs_freeze doesn't suspend all processes that are writing to the filesystem. All write system calls to the filesystem are simply halted, and the process blocks on IO until the filesystem is unfrozen. > As the XFS team have been adding or planning to add various "new" > features like checksums, maybe one day they will add COW to XFS > too (not such an easy task when considering how large XFS extents > can be, but the hole punching code can help there). Not at all an easy rewrite of XFS. And that's what COW would be, a massive rewrite. Copy on write definitely has some advantages for some usage scenarios, but it's not yet been proven the holy grail of filesystem design. -- Stan ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier @ 2012-03-17 0:02 ` Stan Hoeppner 0 siblings, 0 replies; 65+ messages in thread From: Stan Hoeppner @ 2012-03-17 0:02 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux RAID, Linux fs XFS On 3/16/2012 2:28 PM, Peter Grandi wrote: > [ ... ] > >>>> write barriers will ensure journal and thus filesystem >>>> integrity in a crash/power fail event. They do NOT guarantee >>>> file data integrity as file data isn't journaled. > > Not well expressed, Given the audience, the OP, I was simply avoiding getting too deep in the weeds Peter. This thread is on the linux-raid list, not xfs@oss. You know I have a tendency to get too deep in the weeds. I think I did nice job of balance here. ;) > as XFS barriers do ensure file data integrity, > *if the applications uses them* (and uses them in exactly the > right way). How will the OP know which, if any, of his users' desktop applications do fsyncs, properly? He won't. Which is why I made the general statement, which is correct, if not elaborate, nor down in the weeds. > The difference between metadata and data with XFS is that XFS > itself will use barriers on metadata at the right times, because > that's data to XFS, but it won't use barriers on data[1], leaving > that entirely to the application. [1]File data, just to be clear >>>> No filesystem (Linux anyway) journals data, only metadata. > >>> That's not true, is it? ext3 and ext4 support journal=data. > > They do, because they journal blocks, which is not generally a > great choice, but gives the option to journal data blocks too more > easily than other choices. But it is a very special case that few > people use. Few use it because the performance is absolutely horrible. data=journal disables delayed allocation (which serious contributes to any modern filesystem's performance--EXT devs stole/borrowed delayed allocation from XFS BTW) and it disables O_DIRECT. It also doubles the number of data writes to media, once to the journal, once to the filesystem, for every block of every file written. > On a more general note, journaling and barriers are sort of > distinct issues. > > The real purpose of barriers is to ensure that updates are > actually on the recording medium, whether in the journal or > directly on final destination. > That is barriers are used to ensure that data or metadata on the > persistent layer is current. Correct. Again, trying to stay out of the weeds. I'd established that XFS uses barriers on journal writes for metadata consistency, which prevents filesystem corruption after a crash, but not necessarily file corruption. Making the statement that XFS doesn't journal data gets the point across more quickly, while staying out of the weeds. [...] > The 'freeze' features of XFS does not rely on snapshotting, it > relies on suspending all processes that are writing to the > filetree, so updates are avoided for the duration. xfs_freeze was moved into the VFS in 2.6.29 and is called automatically when doing an LVM snapshot of any Linux FS supporting such. Thus, snapshotting relies on xfs_freeze, not the other way round. And xfs_freeze doesn't suspend all processes that are writing to the filesystem. All write system calls to the filesystem are simply halted, and the process blocks on IO until the filesystem is unfrozen. > As the XFS team have been adding or planning to add various "new" > features like checksums, maybe one day they will add COW to XFS > too (not such an easy task when considering how large XFS extents > can be, but the hole punching code can help there). Not at all an easy rewrite of XFS. And that's what COW would be, a massive rewrite. Copy on write definitely has some advantages for some usage scenarios, but it's not yet been proven the holy grail of filesystem design. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: raid10n2/xfs setup guidance on write-cache/barrier 2012-03-15 0:30 raid10n2/xfs setup guidance on write-cache/barrier Jessie Evangelista 2012-03-15 5:38 ` Stan Hoeppner @ 2012-03-17 22:10 ` Zdenek Kaspar 1 sibling, 0 replies; 65+ messages in thread From: Zdenek Kaspar @ 2012-03-17 22:10 UTC (permalink / raw) To: Jessie Evangelista; +Cc: linux-raid Dne 15.3.2012 1:30, Jessie Evangelista napsal(a): > I want to create a raid10,n2 using 3 1TB SATA drives. > I want to create an xfs filesystem on top of it. > The filesystem will be used as NFS/Samba storage. > > mdadm --zero /dev/sdb1 /dev/sdc1 /dev/sdd1 > mdadm -v --create /dev/md0 --metadata=1.2 --assume-clean > --level=raid10 --chunk 256 --raid-devices=3 /dev/sdb1 /dev/sdc1 > /dev/sdd1 > mkfs -t xfs -l lazy-count=1,size=128m -f /dev/md0 > mount -t xfs -o barrier=1,logbsize=256k,logbufs=8,noatime /dev/md0 > /mnt/raid10xfs > > Will my files be safe even on sudden power loss? Is barrier=1 enough? > Do i need to disable the write cache? > with: hdparm -W0 /dev/sdb /dev/sdc /dev/sdd > > I tried it but performance is horrendous. > > Am I better of with ext4? Data safety/integrity is the priority and > optimization affecting it is not acceptable. > > Thanks and any advice/guidance would be appreciated > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html I think today you're most safe with old ext3. Maybe data=journal is not best idea, because it has small user-base. The more aggressive caching and features leading to awesome performance will bite you harder by power loss or software bugs. Limit power loss with UPS unit. You need to cleanly shutdown the system when UPS reaches it's low level capacity. That's mandatory IMO and still inexpensive. Next you can think about multiple PSU/UPS units... But just think in a way that you will never make it 100% safe, so use damn good backups! HTH, Z. ^ permalink raw reply [flat|nested] 65+ messages in thread
end of thread, other threads:[~2012-03-26 19:50 UTC | newest] Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-03-15 0:30 raid10n2/xfs setup guidance on write-cache/barrier Jessie Evangelista 2012-03-15 5:38 ` Stan Hoeppner 2012-03-15 12:06 ` Jessie Evangelista 2012-03-15 14:07 ` Peter Grandi 2012-03-15 14:07 ` Peter Grandi 2012-03-15 15:25 ` keld 2012-03-15 15:25 ` keld 2012-03-15 16:52 ` Jessie Evangelista 2012-03-15 16:52 ` Jessie Evangelista 2012-03-15 17:15 ` keld 2012-03-15 17:15 ` keld 2012-03-15 17:40 ` keld 2012-03-15 17:40 ` keld 2012-03-15 16:18 ` Jessie Evangelista 2012-03-15 16:18 ` Jessie Evangelista 2012-03-15 23:00 ` Peter Grandi 2012-03-15 23:00 ` Peter Grandi 2012-03-16 3:36 ` Jessie Evangelista 2012-03-16 3:36 ` Jessie Evangelista 2012-03-16 11:06 ` Michael Monnerie 2012-03-16 11:06 ` Michael Monnerie 2012-03-16 12:21 ` Peter Grandi 2012-03-16 12:21 ` Peter Grandi 2012-03-16 17:15 ` Brian Candler 2012-03-16 17:15 ` Brian Candler 2012-03-17 15:35 ` Peter Grandi 2012-03-17 15:35 ` Peter Grandi 2012-03-17 21:39 ` raid10n2/xfs setup guidance on write-cache/barrier (GiB alignment) Zdenek Kaspar 2012-03-17 21:39 ` Zdenek Kaspar 2012-03-18 0:08 ` Peter Grandi 2012-03-18 0:08 ` Peter Grandi 2012-03-26 19:50 ` raid10n2/xfs setup guidance on write-cache/barrier Martin Steigerwald 2012-03-17 4:21 ` NOW:Peter goading Dave over delaylog - WAS: " Stan Hoeppner 2012-03-17 22:34 ` Dave Chinner 2012-03-18 2:09 ` Peter Grandi 2012-03-18 2:09 ` Peter Grandi 2012-03-18 11:25 ` Peter Grandi 2012-03-18 11:25 ` Peter Grandi 2012-03-18 14:00 ` Christoph Hellwig 2012-03-18 14:00 ` Christoph Hellwig 2012-03-18 19:17 ` Peter Grandi 2012-03-18 19:17 ` Peter Grandi 2012-03-19 9:07 ` Stan Hoeppner 2012-03-19 9:07 ` Stan Hoeppner 2012-03-20 12:34 ` Jessie Evangelista 2012-03-20 12:34 ` Jessie Evangelista 2012-03-18 18:08 ` Stan Hoeppner 2012-03-18 18:08 ` Stan Hoeppner 2012-03-22 21:26 ` Peter Grandi 2012-03-22 21:26 ` Peter Grandi 2012-03-23 5:10 ` Stan Hoeppner 2012-03-23 5:10 ` Stan Hoeppner 2012-03-23 22:48 ` Martin Steigerwald 2012-03-24 1:27 ` Peter Grandi 2012-03-24 16:27 ` GNU 'tar', Schilling's 'tar', write-cache/barrier Peter Grandi 2012-03-24 17:11 ` Brian Candler 2012-03-24 18:35 ` Peter Grandi 2012-03-16 12:25 ` raid10n2/xfs setup guidance on write-cache/barrier Stan Hoeppner 2012-03-16 18:01 ` Jon Nelson 2012-03-16 18:03 ` Jon Nelson 2012-03-16 19:28 ` Peter Grandi 2012-03-16 19:28 ` Peter Grandi 2012-03-17 0:02 ` Stan Hoeppner 2012-03-17 0:02 ` Stan Hoeppner 2012-03-17 22:10 ` Zdenek Kaspar
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.