* XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) @ 2012-04-05 18:10 Stefan Ring 2012-04-05 19:56 ` Peter Grandi ` (3 more replies) 0 siblings, 4 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-05 18:10 UTC (permalink / raw) To: xfs Encouraged by reading about the recent improvements to XFS, I decided to give it another try on a new server machine. I am happy to report that compared to my previous tests a few years ago, performance has progressed from unusably slow to barely acceptable, but still lagging behind ext4, which is a noticeable (and notable) improvement indeed ;). The filesystem operations I care about the most are the likes which involve thousands of small files across lots of directories, like large trees of source code. For my test, I created a tarball of a finished IcedTea6 build, about 2.5 GB in size. It contains roughly 200,000 files in 20,000 directories. The test I want to report about here was extracting this tarball onto an XFS filesystem. I tested other actions as well, but they didn't reveal anything too noticeable. So the test consists of nothing but un-tarring the archive, followed by a "sync" to make sure that the time-to-disk is measured. Prior to running it, I had populated the filesystem in the following way: I created two directory hierarchies, each containing the unpacked tarball 20 times, which I rsynced simultaneously to the target filesystem. When this was done, I deleted one half of them, creating some free space fragmentation, and what I hoped would mimic real-world conditions to some degree. So now to the test itself -- the tar "x" command returned quite fast (on the order of only a few seconds), but the following sync took ages. I created a diagram using seekwatcher, and it reveals that the disk head jumps about wildly between four zones which are written to in almost perfectly linear fashion. When I reran the test with only a single allocation group, behavior was much better (about twice as fast). OTOH, when I continuously extracted the same tarball in a loop without syncing in-between, it would continuously slow down in the ag=1 case to the point of being unacceptably slow. The same behavior did not occur with ag=4. I am aware that no filesystem can be optimal, but given that the entire write set -- all 2.5 GB of it -- is "known" to the file system, that is, in memory, wouldn't it be possible to write it out to disk in a somewhat more reasonable fashion? This is the seekwatcher graph: http://dl.dropbox.com/u/5338701/dev/xfs/xfs-ag4.png And for comparison, the same on ext4, on the same partition primed in the same way (parallel rsyncs mentioned above): http://dl.dropbox.com/u/5338701/dev/xfs/ext4.png As can be seen from the time scale in the bottom part, the ext4 version performed about 5 times as fast because of a much more disk-friendly write pattern. I ran the tests with a current RHEL 6.2 kernel and also with a 3.3rc2 kernel. Both of them exhibited the same behavior. The disk hardware used was a SmartArray p400 controller with 6x 10k rpm 300GB SAS disks in RAID 6. The server has plenty of RAM (64 GB). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 18:10 XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) Stefan Ring @ 2012-04-05 19:56 ` Peter Grandi 2012-04-05 22:41 ` Peter Grandi 2012-04-06 14:36 ` Peter Grandi 2012-04-05 21:37 ` Christoph Hellwig ` (2 subsequent siblings) 3 siblings, 2 replies; 64+ messages in thread From: Peter Grandi @ 2012-04-05 19:56 UTC (permalink / raw) To: Linux fs XFS [ ... ] > The filesystem operations I care about the most are the likes which > involve thousands of small files across lots of directories, like > large trees of source code. For my test, I created a tarball of a > finished IcedTea6 build, about 2.5 GB in size. It contains roughly > 200,000 files in 20,000 directories. Ah another totally inappropriate "test" of something (euphemism) insipid. The XFS mailing list gets regularly queries on this topic. Apparently not many people have figured out in the Linux culture that general purpose filesystems cannot handle well large groups of small files, and since the beginning of computing various forms of "aggregate" files have been used for that, like 'ar' ('.a') files from UNIX, which should have been used far more commonly than has happened, and never mind things like BDB/GDBM databases. But many lazy application programmer like to use the filesystem as a small-record database, it is so easy... > [ ... ] I ran the tests with a current RHEL 6.2 kernel and > also with a 3.3rc2 kernel. Both of them exhibited the same > behavior. The disk hardware used was a SmartArray p400 > controller with 6x 10k rpm 300GB SAS disks in RAID 6. The > server has plenty of RAM (64 GB). [ ... ] Huge hardware, but (euphemism) imaginative setup, as among its many defects RAID6 is particularly inappropriate for most small file/metadata heavy operation. > [ ... ] I created two directory hierarchies, each containing > the unpacked tarball 20 times, which I rsynced simultaneously > to the target filesystem. When this was done, I deleted one > half of them, creating some free space fragmentation, and what > I hoped would mimic real-world conditions to some degree. Your test is less (euphemism) insignificant because you tried to cope with filetree lifetime issues. > [ ... ] disk head jumps about wildly between four zones which > are written to in almost perfectly linear fashion. > [ ... ] I am aware that no filesystem can be optimal, Every filesystem can be close to optimal, just not for every workload. > but given that the entire write set -- all 2.5 GB of it -- is > "known" to the file system, that is, in memory, wouldn't it be > possible to write it out to disk in a somewhat more reasonable > fashion? That sounds to me like a (euphemism) strategic aim: why ever should a filesystem optimize that special case? Especially given that XFS does spread file allocations across AGs because it aims for multihreaded operations, especially on RAID sets with several independent (that is, not RAID6 with small writes) arms. Unfortunately filesystems are not psychic and cannot use predictive allocation policies, and have to cope with poorly written applications that don't do advising (or 'fsync' properly which is even worse). So some policies get hard-written in the filesystem "flavor". Your remedy, as you have noticed, is to tweak the filesystem logic by changing the number of AGs, and you might also want to experiment with the elevator (you seem to have forgotten about that) and other block subsystem policies, and/or with the safety vs. latency tradeoffs available at the filesystem and storage system levels. There are many annoying details, and recentish version of XFS try to help with the hideous hack of building an elevator inside the filesystem code itself: http://oss.sgi.com/archives/xfs/2010-01/msg00011.html http://oss.sgi.com/archives/xfs/2010-01/msg00008.html which however is sort of effective, because the Linux block IO subsystem has several (euphemism) appalling issues. > As can be seen from the time scale in the bottom part, the ext4 > version performed about 5 times as fast because of a much more > disk-friendly write pattern. Is it really disk friendly for every workload? Think about what happens on 'ext4' there, and when it jumps between block groups, and it is in effect doing commits in a different order. What 'ext4' does costs dearly on other workload types. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 19:56 ` Peter Grandi @ 2012-04-05 22:41 ` Peter Grandi 2012-04-06 14:36 ` Peter Grandi 1 sibling, 0 replies; 64+ messages in thread From: Peter Grandi @ 2012-04-05 22:41 UTC (permalink / raw) To: Linux fs XFS [ ... ] > Apparently not many people have figured out in the Linux > culture that general purpose filesystems cannot handle well > large groups of small files, and since the beginning of > computing various forms of "aggregate" files have been used > for that, like 'ar' ('.a') files from UNIX, which should have > been used far more commonly than has happened, and never mind > things like BDB/GDBM databases. As to this, another filesystem strongly oriented at massive streaming, Lustre, is sometimes used for small-file workloads, and one of the suggestions given for that is to put the small files inside an 'ext2' filesystem in a file, and mount it via 'loop'. That is to use 'ext2' (or some other filesystem type) as an archive format. That is less crazy than it seems. [ ... ] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 19:56 ` Peter Grandi 2012-04-05 22:41 ` Peter Grandi @ 2012-04-06 14:36 ` Peter Grandi 2012-04-06 15:37 ` Stefan Ring 1 sibling, 1 reply; 64+ messages in thread From: Peter Grandi @ 2012-04-06 14:36 UTC (permalink / raw) To: Linux fs XFS [ ... ] > [ ... ] general purpose filesystems cannot handle well large > groups of small files, [ ... ] >> As can be seen from the time scale in the bottom part, the >> ext4 version performed about 5 times as fast because of a much >> more disk-friendly write pattern. As to 'ext4' and doing (euphemism) insipid tests involving peculiar setups, there is an interesting story in this post: http://oss.sgi.com/archives/xfs/2012-03/msg00465.html on the perils of using 'tar x' as a "test" of something meaningful (illustrated using a much smaller "test" than yours). The telling details was that there was a ratio of 227 times (6 seconds versus 23 minutes) between running 'tar x' without any safety and with most safeties. A ratio of 227 times indicates that there is something big going on, which is that contemporary disk drives have 2 orders of magnitude between bulk sequential and small random "speed" (which is the major reason why «general purpose filesystems cannot handle well large groups of small files»), and that in between one can choose a vast number of different safety/speed tradeoffs (or introduce performance problems :->). Does that means that 'ext4' has "Abysmal write performance" in the 23 minutes case? No, just a different tradeoff. Similarly XFS has had for a long time a mostly undeserved reputation for being "slow" on small-IO/metadata intensive workloads, in large part because traditionally it has been designed to deliver a higher level of (implicit, metadata) safety than other filesystems; for good reasons. Therefore as I argued in other comments the «excessive seeking» you report seems due to me more to storage layer issues and perhaps stricter interpretation of safety by XFS, than to something really wrong with XFS, which is a tool that has to be deployed with consideration. As to that a comparison that does point a finger at the underlying storage system: * In your graphs 'ext4' writes out 2.5GB of small files at around 100MB/s (and with relatively few long seeks on that workload) on an "enterprise" storage system that has 4+2 disks each capable of 130MB/s. * In the 6s "test" I did reported above in a similar situation 'ext4' wrote out 370MB also at not much less than 100MB/s, but on a single "consumer" disk on a much slower destktop. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 14:36 ` Peter Grandi @ 2012-04-06 15:37 ` Stefan Ring 2012-04-07 13:33 ` Peter Grandi 0 siblings, 1 reply; 64+ messages in thread From: Stefan Ring @ 2012-04-06 15:37 UTC (permalink / raw) To: Linux fs XFS > As to 'ext4' and doing (euphemism) insipid tests involving > peculiar setups, there is an interesting story in this post: > > http://oss.sgi.com/archives/xfs/2012-03/msg00465.html I really don't see the connection to this thread. You're advocating mostly that tar use fsync on every file, which to me seems absurd. If the system goes down halfway through tar extraction, I would delete the tree and untar again. What do I care if some files are corrupt, when the entire tree is incomplete anyway? Despite the somewhat inflammatory thread subject, I don't want to bash anyone. It's just that untarring large source trees is a very typical workload for me. And I just don't want to accept that XFS cannot do better than being several orders of magnitude slower than ext4 (speaking of binary orders of magnitude). As I see it, both file systems give the same guarantees: 1) That upon completion of sync, all data is readily available on permanent storage. 2) That the file system metadata doesn't suffer corruption, should the system lose power during the operation. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 15:37 ` Stefan Ring @ 2012-04-07 13:33 ` Peter Grandi 0 siblings, 0 replies; 64+ messages in thread From: Peter Grandi @ 2012-04-07 13:33 UTC (permalink / raw) To: Linux fs XFS >> As to 'ext4' and doing (euphemism) insipid tests involving >> peculiar setups, there is an interesting story in this post: >> http://oss.sgi.com/archives/xfs/2012-03/msg00465.html > I really don't see the connection to this thread. You're > advocating mostly that tar use fsync on every file, which to > me seems absurd. Rather different: I am pointing out that there is a fundamental problem, that the spectrum of safety/speed tradeoffs covers 2 orders of magnitude as to speed, and that for equivalent points XFS and 'ext4' don't perform that differently (factor of 2 in this particular "test", which is sort of "noise"). Note: it is Schilling who advocates for 'tar' to 'fsync' every file, and he gives some pretty good reasons why that should be the default, and why that should not be that expensive, (which I is a bit optimistic0. My advocacy in that thread was that having different safety/speed tradeoffs is a good thing, if they are honestly represented as tradeoffs. So it is likely if there is a significant difference you are getting a different tradeoff even if you may not *want* a different tradeoff. Note: JFS and XFS are more or less as good as it gets as to "general purpose" filesystems, and when people complain about "speed" of them odds are that they are using either improperly, or in corner cases, or there is a problem in the application or storage layer. To get something better than JFS or XFS one must look at filesystems based on radically different tradeoffs, like NILFS2 (log), OCFS2 (shareable) or BTRFS (COW). In your case perhaps NILFS2 would give best results. And that's what seems to be happening: 'ext4' seems to commit metadata and data in spacewise order, XFS in timewise order, because the seek order on writeout probably reflects the order in which files were extracted from the 'tar' file. > If the system goes down halfway through tar extraction, I > would delete the tree and untar again. What do I care if some > files are corrupt, when the entire tree is incomplete anyway? Maybe you don't care; but filesystems are not psychic (they use hardwired and adaptive policy, not predictive) and given that most people seem to care the default for XFS is to try harder to keep metadata durable. Also various versions of 'tar' have options that allow continuing rather than restarting an extraction because some people prefer that. > [ ... ] It's just that untarring large source trees is a very > typical workload for me. Well, it makes a lot of difference whether you are creating an extreme corner case just to see what happens, or whether you have a real problem, even a corner case problem, about which you have to make some compromise. The problem you have described seems rather strange: * You write a lot of little files to memory, as you have way more memory than data. * The whole is written out to a relatively RAID6 in one go, on a storage layer that can do 500-700MB/s but does 1/5th of that. * You don't do anything else with the files. > And I just don't want to accept that XFS cannot do better than > being several orders of magnitude slower than ext4 (speaking > of binary orders of magnitude). > As I see it, both file systems give the same guarantees: > 1) That upon completion of sync, all data is readily available > on permanent storage. > 2) That the file system metadata doesn't suffer corruption, > should the system lose power during the operation. Yes, but they also give you some *implicit* guarantees that are different. For example that: * XFS spreads out files for you so you can better take advantage of parallelism in your storage layer, and further allocations are more resistant to fragmentation. * 'ext4' probably commits in a different and less safe order from XFS. If the storage layer rearranged IO order this might matter a lot less. You may not care about either, but then you are doing something very special. For example, if you were to use your freshly written sources to do a build, then conceivably spreading the files over 4 AGs means that the builds can be much quicker on a system with available hardware parallelism. Also, *you* don't care about the order in which losses would happen, and how much, if the system crashes, but most users tend to want to avoid repeating work, because either they are not copying data, or the copy is huge and they don't want to restart it from the beginning. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 18:10 XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) Stefan Ring 2012-04-05 19:56 ` Peter Grandi @ 2012-04-05 21:37 ` Christoph Hellwig 2012-04-06 1:09 ` Peter Grandi 2012-04-06 8:25 ` Stefan Ring 2012-04-05 22:32 ` Roger Willcocks 2012-04-05 23:07 ` Peter Grandi 3 siblings, 2 replies; 64+ messages in thread From: Christoph Hellwig @ 2012-04-05 21:37 UTC (permalink / raw) To: Stefan Ring; +Cc: xfs Hi Stefan, thanks for the detailed report. The seekwatcher makes it very clear that XFS is spreading I/O over the 4 allocation groups, while ext4 isn't. There's a couple of reasons why XFS is doing that, including to max out multiple devices in a multi-device setup, and not totally killing read speed. Can you try a few mount options for me both all together and if you have some time also individually. -o inode64 This allows inodes to be close to data even for >1TB filesystems. It's something we hope to make the default soon. -o filestreams This keeps data written in a single directory group together. Not sure your directories are large enough to really benefit from it, but it's worth a try. -o allocsize=4k This disables the agressive file preallocation we do in XFS, which sounds like it's not useful for your workload. > I ran the tests with a current RHEL 6.2 kernel and also with a 3.3rc2 > kernel. Both of them exhibited the same behavior. The disk hardware > used was a SmartArray p400 controller with 6x 10k rpm 300GB SAS disks > in RAID 6. The server has plenty of RAM (64 GB). For metadata intensive workloads like yours you would be much better using a non-striping raid, e.g. concatentation and mirroring instead of raid 5 or raid 6. I know this has a cost in terms of "wasted" space, but for IOPs bound workload the difference is dramatic. P.s. please ignore Peter - he's made himself a name as not only beeing technically incompetent but also extremly abrasive. He is in no way associated with the XFS development team. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 21:37 ` Christoph Hellwig @ 2012-04-06 1:09 ` Peter Grandi 2012-04-06 8:25 ` Stefan Ring 1 sibling, 0 replies; 64+ messages in thread From: Peter Grandi @ 2012-04-06 1:09 UTC (permalink / raw) To: Linux fs XFS [ ... ] > For metadata intensive workloads like yours you would be much > better using a non-striping raid, e.g. concatentation and > mirroring instead of raid 5 or raid 6. I know this has a cost > in terms of "wasted" space, but for IOPs bound workload the > difference is dramatic. The problem with parity RAIDs and small-write-IO intensive workloads is not striping as such, it is with *large* stripes. That is a detail that matters quite a lot. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 21:37 ` Christoph Hellwig 2012-04-06 1:09 ` Peter Grandi @ 2012-04-06 8:25 ` Stefan Ring 2012-04-07 18:57 ` Martin Steigerwald 1 sibling, 1 reply; 64+ messages in thread From: Stefan Ring @ 2012-04-06 8:25 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs > thanks for the detailed report. Thanks for the detailed and kind answer. > Can you try a few mount options for me both all together and if you have > some time also individually. > > -o inode64 > > This allows inodes to be close to data even for >1TB > filesystems. It's something we hope to make the default soon. The filesystem is not that large. It’s only 400GB. I turned it on anyway. No difference. > -o filestreams > > This keeps data written in a single directory group together. > Not sure your directories are large enough to really benefit > from it, but it's worth a try. > -o allocsize=4k > > This disables the agressive file preallocation we do in XFS, > which sounds like it's not useful for your workload. inode64+filestreams: no difference inode64+allocsize: no difference inode64+filestreams+allocsize: no difference :( > For metadata intensive workloads like yours you would be much better > using a non-striping raid, e.g. concatentation and mirroring instead of > raid 5 or raid 6. I know this has a cost in terms of "wasted" space, > but for IOPs bound workload the difference is dramatic. Hmm, I’m sure you’re right, but I’m out of luck here. If I had 24 drives, I could think about a different organization. But with only 6 bays, I cannot give up all that space. Although *in theory*, it *should* be possible to run fast for write-only workloads. The stripe size is 64 KB (4x16), and it’s not like data is written all over the place. So it should very well be possible to write the data out in some reasonably sized and aligned chunks. The filesystem partition itself is nicely aligned. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 8:25 ` Stefan Ring @ 2012-04-07 18:57 ` Martin Steigerwald 2012-04-10 14:02 ` Stefan Ring 0 siblings, 1 reply; 64+ messages in thread From: Martin Steigerwald @ 2012-04-07 18:57 UTC (permalink / raw) To: xfs; +Cc: Stefan Ring, Christoph Hellwig Am Freitag, 6. April 2012 schrieb Stefan Ring: > > thanks for the detailed report. > > Thanks for the detailed and kind answer. > > > Can you try a few mount options for me both all together and if you > > have some time also individually. > > > > -o inode64 > > > > This allows inodes to be close to data even for >1TB > > filesystems. It's something we hope to make the default soon. > > The filesystem is not that large. It’s only 400GB. I turned it on > anyway. No difference. > > > -o filestreams > > > > This keeps data written in a single directory group together. > > Not sure your directories are large enough to really benefit > > from it, but it's worth a try. > > -o allocsize=4k > > > > This disables the agressive file preallocation we do in XFS, > > which sounds like it's not useful for your workload. > > inode64+filestreams: no difference > inode64+allocsize: no difference > inode64+filestreams+allocsize: no difference :( > > > For metadata intensive workloads like yours you would be much better > > using a non-striping raid, e.g. concatentation and mirroring instead > > of raid 5 or raid 6. I know this has a cost in terms of "wasted" > > space, but for IOPs bound workload the difference is dramatic. > > Hmm, I’m sure you’re right, but I’m out of luck here. If I had 24 > drives, I could think about a different organization. But with only 6 > bays, I cannot give up all that space. > > Although *in theory*, it *should* be possible to run fast for > write-only workloads. The stripe size is 64 KB (4x16), and it’s not > like data is written all over the place. So it should very well be > possible to write the data out in some reasonably sized and aligned > chunks. The filesystem partition itself is nicely aligned. And is XFS aligned to the RAID 6? What does xfs_info display on it? -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-07 18:57 ` Martin Steigerwald @ 2012-04-10 14:02 ` Stefan Ring 2012-04-10 14:32 ` Joe Landman ` (2 more replies) 0 siblings, 3 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-10 14:02 UTC (permalink / raw) To: Martin Steigerwald; +Cc: Christoph Hellwig, xfs > And is XFS aligned to the RAID 6? > > What does xfs_info display on it? Yes, it’s aligned. meta-data=/dev/mapper/vg_data-lvhome isize=256 agcount=4, agsize=73233656 blks = sectsz=512 attr=2 data = bsize=4096 blocks=292934624, imaxpct=5 = sunit=8 swidth=32 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=143040, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 I changed the stripe size to 32kb in the meantime. This way, it performs slightly better. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-10 14:02 ` Stefan Ring @ 2012-04-10 14:32 ` Joe Landman 2012-04-10 15:56 ` Stefan Ring 2012-04-10 18:13 ` Martin Steigerwald 2012-04-10 20:44 ` Stan Hoeppner 2 siblings, 1 reply; 64+ messages in thread From: Joe Landman @ 2012-04-10 14:32 UTC (permalink / raw) To: xfs On 04/10/2012 10:02 AM, Stefan Ring wrote: >> And is XFS aligned to the RAID 6? >> >> What does xfs_info display on it? > > Yes, it’s aligned. > > meta-data=/dev/mapper/vg_data-lvhome isize=256 agcount=4, > agsize=73233656 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=292934624, imaxpct=5 > = sunit=8 swidth=32 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=143040, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > I changed the stripe size to 32kb in the meantime. This way, it > performs slightly better. try 128k to 512k for stripe size. And try to increase your agcount by (nearly) an order of magnitude. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-10 14:32 ` Joe Landman @ 2012-04-10 15:56 ` Stefan Ring 0 siblings, 0 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-10 15:56 UTC (permalink / raw) To: Linux fs XFS > try 128k to 512k for stripe size. And try to increase your agcount by > (nearly) an order of magnitude. Would that be of any real value to anyone here, except for satisfying curiosity (which I feel as well ;))? Because frankly, it’s a lot of work, and I’m quite through with this tedious kind of activity… My conclusion is that everything should work well if the levels below the file system behaved the way they should and brought the writes into a sane order. Apparently, both the RAID controller as well as the Linux block scheduler fail to do so. Despite the annoying nature of this state of affairs, I do believe that file systems should be able to count on the lower levels of the stack for such low-level work and not work around them, but apparently, they are often failed. Probably that’s one of the reasons why almost every file system acquires some sort of block scheduling over time. Maybe some day, the Linux IO scheduler will do a better job. Unfortunately, by then, this entire issue will be irrelevant because nobody will be using rotational storage anymore, at least not for everyday work. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-10 14:02 ` Stefan Ring 2012-04-10 14:32 ` Joe Landman @ 2012-04-10 18:13 ` Martin Steigerwald 2012-04-10 20:44 ` Stan Hoeppner 2 siblings, 0 replies; 64+ messages in thread From: Martin Steigerwald @ 2012-04-10 18:13 UTC (permalink / raw) To: Stefan Ring; +Cc: Christoph Hellwig, xfs Am Dienstag, 10. April 2012 schrieb Stefan Ring: > > And is XFS aligned to the RAID 6? > > > > What does xfs_info display on it? > > Yes, it’s aligned. > > meta-data=/dev/mapper/vg_data-lvhome isize=256 agcount=4, > agsize=73233656 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=292934624, > imaxpct=5 = sunit=8 swidth=32 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=143040, version=2 > = sectsz=512 sunit=8 blks, > lazy-count=1 realtime =none extsz=4096 blocks=0, > rtextents=0 Hmmm, so its not the alignment. xfs_info output looks sane otherwise. I have no further ideas for now. But others had it seems. (Reading rest of new messages in thread.) Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-10 14:02 ` Stefan Ring 2012-04-10 14:32 ` Joe Landman 2012-04-10 18:13 ` Martin Steigerwald @ 2012-04-10 20:44 ` Stan Hoeppner 2012-04-10 21:00 ` Stefan Ring 2 siblings, 1 reply; 64+ messages in thread From: Stan Hoeppner @ 2012-04-10 20:44 UTC (permalink / raw) To: Stefan Ring; +Cc: Christoph Hellwig, xfs On 4/10/2012 9:02 AM, Stefan Ring wrote: >> And is XFS aligned to the RAID 6? >> >> What does xfs_info display on it? > > Yes, it’s aligned. > > meta-data=/dev/mapper/vg_data-lvhome Is the LVM volume aligned to the RAID stripe? Is their a partition atop the RAID LUN and under LVM? Is the partition aligned? Why LVM anyway? > isize=256 agcount=4, agsize=73233656 blks > = sectsz=512 attr=2 > data = bsize=4096 blocks=292934624, imaxpct=5 > = sunit=8 swidth=32 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal bsize=4096 blocks=143040, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > I changed the stripe size to 32kb in the meantime. This way, it > performs slightly better. The devil is always in the details. Were you using partitions and LVM with the RAID1 concat tesing? With the free space testing? I assumed you were directly formatting the LUN with XFS. With LVM and possibly partitions involved here, that could explain some of the mediocre performance across the board, with both EXT4 and XFS. If one wants maximum performance from their filesystem, one should typically stay away from partitions and LVM, and any other layers that can slow IO down. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-10 20:44 ` Stan Hoeppner @ 2012-04-10 21:00 ` Stefan Ring 0 siblings, 0 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-10 21:00 UTC (permalink / raw) To: stan; +Cc: Christoph Hellwig, xfs > Is the LVM volume aligned to the RAID stripe? Is their a partition atop > the RAID LUN and under LVM? Is the partition aligned? Why LVM anyway? Yes, it is aligned. I followed the advice from <http://www.mysqlperformanceblog.com/2011/06/09/aligning-io-on-a-hard-disk-raid-the-theory/>. Why LVM? Because we use it on lots of servers, and there is some value to having a somewhat similar setup in development as in production. I’ve done similar tests time and again with LVM and without, and I’ve never ever measured a significant difference. I haven’t re-tested it this time, true, but I would be surprised if it would magically behave completely differently this time. > The devil is always in the details. Were you using partitions and LVM > with the RAID1 concat tesing? With the free space testing? I used LVM linear for the concatenation – one volume group made from 3 physical volumes. The pvols were on primary partitions. The one-volume RAID 6 is set up similarly; from only one pvol of course. > I assumed you were directly formatting the LUN with XFS. With LVM and > possibly partitions involved here, that could explain some of the > mediocre performance across the board, with both EXT4 and XFS. If one > wants maximum performance from their filesystem, one should typically > stay away from partitions and LVM, and any other layers that can slow IO > down. I don’t want maximum performance, I want acceptable performance ;). This means, I am satisfied with 80% or more of what’s possible, but I’m not satisfied with 15%. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 18:10 XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) Stefan Ring 2012-04-05 19:56 ` Peter Grandi 2012-04-05 21:37 ` Christoph Hellwig @ 2012-04-05 22:32 ` Roger Willcocks 2012-04-06 7:11 ` Stefan Ring 2012-04-05 23:07 ` Peter Grandi 3 siblings, 1 reply; 64+ messages in thread From: Roger Willcocks @ 2012-04-05 22:32 UTC (permalink / raw) To: Stefan Ring; +Cc: xfs http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s10.html On 5 Apr 2012, at 19:10, Stefan Ring wrote: > Encouraged by reading about the recent improvements to XFS, I decided > to give it another try on a new server machine. I am happy to report > that compared to my previous tests a few years ago, performance has > progressed from unusably slow to barely acceptable, but still lagging > behind ext4, which is a noticeable (and notable) improvement indeed > ;). > > The filesystem operations I care about the most are the likes which > involve thousands of small files across lots of directories, like > large trees of source code. For my test, I created a tarball of a > finished IcedTea6 build, about 2.5 GB in size. It contains roughly > 200,000 files in 20,000 directories. The test I want to report about > here was extracting this tarball onto an XFS filesystem. I tested > other actions as well, but they didn't reveal anything too noticeable. > > So the test consists of nothing but un-tarring the archive, followed > by a "sync" to make sure that the time-to-disk is measured. Prior to > running it, I had populated the filesystem in the following way: > > I created two directory hierarchies, each containing the unpacked > tarball 20 times, which I rsynced simultaneously to the target > filesystem. When this was done, I deleted one half of them, creating > some free space fragmentation, and what I hoped would mimic real-world > conditions to some degree. > > So now to the test itself -- the tar "x" command returned quite fast > (on the order of only a few seconds), but the following sync took > ages. I created a diagram using seekwatcher, and it reveals that the > disk head jumps about wildly between four zones which are written to > in almost perfectly linear fashion. > > When I reran the test with only a single allocation group, behavior > was much better (about twice as fast). > > OTOH, when I continuously extracted the same tarball in a loop without > syncing in-between, it would continuously slow down in the ag=1 case > to the point of being unacceptably slow. The same behavior did not > occur with ag=4. > > I am aware that no filesystem can be optimal, but given that the > entire write set -- all 2.5 GB of it -- is "known" to the file system, > that is, in memory, wouldn't it be possible to write it out to disk in > a somewhat more reasonable fashion? > > This is the seekwatcher graph: > http://dl.dropbox.com/u/5338701/dev/xfs/xfs-ag4.png > > And for comparison, the same on ext4, on the same partition primed in > the same way (parallel rsyncs mentioned above): > http://dl.dropbox.com/u/5338701/dev/xfs/ext4.png > > As can be seen from the time scale in the bottom part, the ext4 > version performed about 5 times as fast because of a much more > disk-friendly write pattern. > > I ran the tests with a current RHEL 6.2 kernel and also with a 3.3rc2 > kernel. Both of them exhibited the same behavior. The disk hardware > used was a SmartArray p400 controller with 6x 10k rpm 300GB SAS disks > in RAID 6. The server has plenty of RAM (64 GB). > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 22:32 ` Roger Willcocks @ 2012-04-06 7:11 ` Stefan Ring 2012-04-06 8:24 ` Stefan Ring 0 siblings, 1 reply; 64+ messages in thread From: Stefan Ring @ 2012-04-06 7:11 UTC (permalink / raw) To: Roger Willcocks; +Cc: xfs On Fri, Apr 6, 2012 at 12:32 AM, Roger Willcocks <roger@filmlight.ltd.uk> wrote: > http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s10.html This sounds like it could help very much. I'll try that. Thanks! _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 7:11 ` Stefan Ring @ 2012-04-06 8:24 ` Stefan Ring 0 siblings, 0 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-06 8:24 UTC (permalink / raw) To: Roger Willcocks; +Cc: xfs On Fri, Apr 6, 2012 at 9:11 AM, Stefan Ring <stefanrin@gmail.com> wrote: > On Fri, Apr 6, 2012 at 12:32 AM, Roger Willcocks <roger@filmlight.ltd.uk> wrote: >> http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch06s10.html > > This sounds like it could help very much. I'll try that. Thanks! Unfortunately, the documentation says it’s only effective for filesystems > 1TB, which mine isn’t. I tried it, and it doesn’t make a difference, which is to be expected. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 18:10 XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) Stefan Ring ` (2 preceding siblings ...) 2012-04-05 22:32 ` Roger Willcocks @ 2012-04-05 23:07 ` Peter Grandi 2012-04-06 0:13 ` Peter Grandi ` (2 more replies) 3 siblings, 3 replies; 64+ messages in thread From: Peter Grandi @ 2012-04-05 23:07 UTC (permalink / raw) To: Linux fs XFS [ ... ] > [ ... ] tarball of a finished IcedTea6 build, about 2.5 GB in > size. It contains roughly 200,000 files in 20,000 directories. > [ ... ] given that the entire write set -- all 2.5 GB of it -- > is "known" to the file system, that is, in memory, wouldn't it > be possible to write it out to disk in a somewhat more > reasonable fashion? [ ... ] The disk hardware used was a > SmartArray p400 controller with 6x 10k rpm 300GB SAS disks in > RAID 6. The server has plenty of RAM (64 GB). On reflection this trigger for me an aside: traditional filesystem types are designed for the case where the ratio is the opposite, something like a 64GB data collection to process and 2.5GB of RAM, and where therefore the issue is minimizing ongoing disk accesses, not the upload from memory to disk of a bulk sparse set of stuff. The Sprite Log-structured File System was a design targeted at large-memory systems, assuming that then writes are the issue (especially as Sprite was network-based), and reads would mostly happen from RAM, as in your (euphemism) insipid test. I suspect that if the fundamental tradeoffs are inverted, then a completely different design like a LFS might be appropriate. But the above has a relationship to your (euphemism) unwise concerns: the case where 200,000 files for 2.5GB are completely written to RAM and then flushed as a whole to disk is not only "untraditional" it is also (euphemism) peculiar: try by setting the flusher to run rather often so that not more than 100-300MB of dirty pages are left at any one time. Which brings another subject: usually hw RAID host adapter have cache, and have firmware that cleverly rearranges writes. Looking at the specs of the P400: http://h18004.www1.hp.com/products/servers/proliantstorage/arraycontrollers/smartarrayp400/ it seems to me that it has standard 256MB of cache, and only supports RAID6 with a battery backed write cache (wise!). Which means that your Linux-level seek graphs may be not so useful, because the host adapter may be drastically rearranging the seek patterns, and you may need to tweak the P400 elevator, rather than or in addition to the Linux elevator. Unless possibly barriers are enabled, and even with a BBWC the P400 writes through on receiving a barrier request. IIRC XFS is rather stricter in issuing barrier requests than 'ext4', and you may be seeing more the effect of that than the effect of aiming to splitting the access patterns between 4 AGs to improve the potential for multithreading (which you deny because you are using what is most likely a large RAID6 stripe size with a small IO intensive write workload, as previously noted). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 23:07 ` Peter Grandi @ 2012-04-06 0:13 ` Peter Grandi 2012-04-06 7:27 ` Stefan Ring 2012-04-06 0:53 ` Peter Grandi 2012-04-06 5:53 ` Stefan Ring 2 siblings, 1 reply; 64+ messages in thread From: Peter Grandi @ 2012-04-06 0:13 UTC (permalink / raw) To: Linux fs XFS [ ... ] > Which means that your Linux-level seek graphs may be not so > useful, because the host adapter may be drastically rearranging > the seek patterns, and you may need to tweak the P400 elevator, > rather than or in addition to the Linux elevator. > Unless possibly barriers are enabled, and even with a BBWC the > P400 writes through on receiving a barrier request. IIRC XFS is > rather stricter in issuing barrier requests than 'ext4', and you > may be seeing more the effect of that than the effect of aiming > to splitting the access patterns between 4 AGs [ ... ] As to this, in theory even having split the files among 4 AGs, the upload from system RAM to host adapter RAM and then to disk could happen by writing first all the dirty blocks for one AG, then a long seek to the next AG, and so on, and the additional cost of 3 long seeks would be negligible. That you report a significant slowdown indicates that this is not happening, and that likely XFS flushing is happening not in spacewise order but in timewise order. The seeks graphs you have gathered indeed indicate that with 'ext4' there is a spacewise flush, while with XFS the flush alternates constantly among the 4 AGs, instead of doing each AG in turn. Which seems to indicate an elevator issue or a barrier issue after the delayed allocator has assigned block addresses to the various pages being flushed. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 0:13 ` Peter Grandi @ 2012-04-06 7:27 ` Stefan Ring 2012-04-06 23:28 ` Stan Hoeppner 2012-04-07 16:50 ` Peter Grandi 0 siblings, 2 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-06 7:27 UTC (permalink / raw) To: Linux fs XFS > As to this, in theory even having split the files among 4 AGs, > the upload from system RAM to host adapter RAM and then to disk > could happen by writing first all the dirty blocks for one AG, > then a long seek to the next AG, and so on, and the additional > cost of 3 long seeks would be negligible. Yes, that’s exactly what I had in mind, and what prompted me to write this post. It would be about 10 times as fast. That’s what bothers me so much. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 7:27 ` Stefan Ring @ 2012-04-06 23:28 ` Stan Hoeppner 2012-04-07 7:27 ` Stefan Ring ` (2 more replies) 2012-04-07 16:50 ` Peter Grandi 1 sibling, 3 replies; 64+ messages in thread From: Stan Hoeppner @ 2012-04-06 23:28 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS On 4/6/2012 2:27 AM, Stefan Ring wrote: >> As to this, in theory even having split the files among 4 AGs, >> the upload from system RAM to host adapter RAM and then to disk >> could happen by writing first all the dirty blocks for one AG, >> then a long seek to the next AG, and so on, and the additional >> cost of 3 long seeks would be negligible. > > Yes, that’s exactly what I had in mind, and what prompted me to write > this post. It would be about 10 times as fast. That’s what bothers me > so much. XFS is still primarily a "lots and large" filesystem. Its allocation group based design is what facilitates this. Very wide stripe arrays have horrible performance for most workloads, especially random IOPS heavy workloads, and you won't see hardware that will allow arrays of hundreds, let alone dozens of spindles in a RAID stripe set. Say one needs a high IOPS single 50TB filesystem. We could use 4 Nexsan E60 arrays each containing 60 15k SAS drives of 450GB each, 240 drives total. Creating four 60 drive RAID10 arrays, let alone 60 drive RAID6 arrays, would be silly. Instead, a far more optimal solution would be to set aside 4 spares per chassis and create 14 four drive RADI10 arrays. This would yield ~600 seeks/sec and ~400MB/s sequential throughput performance per 2 spindle array. We'd stitch the resulting 56 hardware RAID10 arrays together in an mdraid linear (concatenated) array. Then we'd format this 112 effective spindle linear array with simply: $ mkfs.xfs -d agcount=56 /dev/md0 Since each RAID10 is 900GB capacity, we have 56 AGs of just under the 1TB limit, 1 AG per 2 physical spindles. Due to the 2 stripe spindle nature of the constituent hardware RAID10 arrays, we don't need to worry about aligning XFS writes to the RAID stripe width. The hardware cache will take care of filling the small stripes. Now we're in the opposite situation of having too many AGs per spindle. We've put 2 spindles in a single AG and turned the seek starvation issues on its head. Given a workload with at least 56 threads, we can write 56 files in parallel at ~400MB/s each, one to each AG, 22.4GB/s aggregate throughput. With this particular hardware, the 16x8Gb FC ports limit total one way bandwidth to 12.8GB/s aggregate, or "only" 228MB/s per AG. Not too shabby. But streaming bandwidth isn't the workload here. This setup will allow for ~30,000 random write IOPS with 56 writers. Not that impressive compared to SSD, but you've got 50TB of space instead of a few hundred gigs. The moral of this story is this: If XFS behaved the way you opine above, each of these 56 AGs would be written in a serial fashion, basically limiting the throughput of 112 effective 15k SAS spindles to something along the lines of only ~400MB/s and ~600 random IOPS. Note that this hypothetical XFS storage system is tiny compared to some of those in the wild. NASA's Advanced Supercomputing Division alone has deployed 500TB+ XFS filesystems on nested concatenated/striped arrays. So while the XFS AG architecture may not be perfectly suited to your single 6 drive RAID6 array, it still gives rather remarkable performance given that the same architecture can scale pretty linearly to the heights above, and far beyond. Something EXTx and others could never dream of. Some of the SGI guys might be able to confirm deployed single XFS filesystems spanning 1000+ drives in the past. Today we'd probably only see that scale with CXFS. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 23:28 ` Stan Hoeppner @ 2012-04-07 7:27 ` Stefan Ring 2012-04-07 8:53 ` Emmanuel Florac ` (2 more replies) 2012-04-07 8:49 ` Emmanuel Florac 2012-04-09 14:21 ` Geoffrey Wehrman 2 siblings, 3 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-07 7:27 UTC (permalink / raw) To: stan; +Cc: Linux fs XFS > Instead, a far more optimal solution would be to set aside 4 spares per > chassis and create 14 four drive RADI10 arrays. This would yield ~600 > seeks/sec and ~400MB/s sequential throughput performance per 2 spindle > array. We'd stitch the resulting 56 hardware RAID10 arrays together in > an mdraid linear (concatenated) array. Then we'd format this 112 > effective spindle linear array with simply: > > $ mkfs.xfs -d agcount=56 /dev/md0 > > Since each RAID10 is 900GB capacity, we have 56 AGs of just under the > 1TB limit, 1 AG per 2 physical spindles. Due to the 2 stripe spindle > nature of the constituent hardware RAID10 arrays, we don't need to worry > about aligning XFS writes to the RAID stripe width. The hardware cache > will take care of filling the small stripes. Now we're in the opposite > situation of having too many AGs per spindle. We've put 2 spindles in a > single AG and turned the seek starvation issues on its head. So it sounds like that for poor guys like us, who can’t afford the hardware to have dozens of spindles, the best option would be to create the XFS file system with agcount=1? That seems to be the only reasonable conclusion to me, since a single RAID device, like a single disk, cannot write in parallel anyway. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-07 7:27 ` Stefan Ring @ 2012-04-07 8:53 ` Emmanuel Florac 2012-04-07 14:57 ` Stan Hoeppner 2012-04-09 0:19 ` Dave Chinner 2 siblings, 0 replies; 64+ messages in thread From: Emmanuel Florac @ 2012-04-07 8:53 UTC (permalink / raw) To: Stefan Ring; +Cc: stan, Linux fs XFS Le Sat, 7 Apr 2012 09:27:50 +0200 vous écriviez: > So it sounds like that for poor guys like us, who can’t afford the > hardware to have dozens of spindles, the best option would be to > create the XFS file system with agcount=1? That seems to be the only > reasonable conclusion to me, since a single RAID device, like a single > disk, cannot write in parallel anyway. You best option is to buy a SSD. Seriously, even a basic decent consumer model will bury your array in the dust. Also, recent RAID controllers from LSI and Adaptec are able to "enhance" a spinning rust array by using an SSD as a cache. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-07 7:27 ` Stefan Ring 2012-04-07 8:53 ` Emmanuel Florac @ 2012-04-07 14:57 ` Stan Hoeppner 2012-04-09 11:02 ` Stefan Ring 2012-04-09 0:19 ` Dave Chinner 2 siblings, 1 reply; 64+ messages in thread From: Stan Hoeppner @ 2012-04-07 14:57 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS On 4/7/2012 2:27 AM, Stefan Ring wrote: >> Instead, a far more optimal solution would be to set aside 4 spares per >> chassis and create 14 four drive RADI10 arrays. This would yield ~600 >> seeks/sec and ~400MB/s sequential throughput performance per 2 spindle >> array. We'd stitch the resulting 56 hardware RAID10 arrays together in >> an mdraid linear (concatenated) array. Then we'd format this 112 >> effective spindle linear array with simply: >> >> $ mkfs.xfs -d agcount=56 /dev/md0 >> >> Since each RAID10 is 900GB capacity, we have 56 AGs of just under the >> 1TB limit, 1 AG per 2 physical spindles. Due to the 2 stripe spindle >> nature of the constituent hardware RAID10 arrays, we don't need to worry >> about aligning XFS writes to the RAID stripe width. The hardware cache >> will take care of filling the small stripes. Now we're in the opposite >> situation of having too many AGs per spindle. We've put 2 spindles in a >> single AG and turned the seek starvation issues on its head. > > So it sounds like that for poor guys like us, who can’t afford the > hardware to have dozens of spindles, the best option would be to > create the XFS file system with agcount=1? Not at all. You can achieve this performance with the 6 300GB spindles you currently have, as Christoph and I both mentioned. You simply lose one spindle of capacity, 300GB, vs your current RAID6 setup. Make 3 RAID1 pairs in the p400 and concatenate them. If the p400 can't do this concat the mirror pair devices with md --linear. Format the resulting Linux block device with the following and mount with inode64. $ mkfs.xfs -d agcount=3 /dev/[device] That will give you 1 AG per spindle, 3 horizontal AGs total instead of 4 vertical AGs as you get with default striping setup. This is optimal for your high IOPS workload as it eliminates all 'extraneous' seeks yielding a per disk access pattern nearly identical to EXT4. And it will almost certainly outrun EXT4 on your RAID6 due mostly to the eliminated seeks, but also to elimination of parity calculations. You've wiped the array a few times in your testing already right, so one or two more test setups should be no sweat. Give it a go. The results will be pleasantly surprising. > That seems to be the only > reasonable conclusion to me, since a single RAID device, like a single > disk, cannot write in parallel anyway. It's not a reasonable conclusion. And both striping and concat arrays write in parallel, just a different kind of parallel. The very coarse description (for which I'll likely take heat) is that striping 'breaks up' one file into stripe_width number of blocks, then writes all the blocks, one to each disk, in parallel, until all the blocks of the file are written. Conversely, with a concatenated array, since XFS writes each file to a different AG, and each spindle is 1 AG in this case, each file's blocks are written serially to one disk. But we can have 3 of these going in parallel with 3 disks. The former method relies on being able to neatly pack a file's blocks into stripes that are written in parallel, to get max write performance. This is irrelevant with a concat. We write all the blocks until the file is written, and we waste no rotation or seeks in the process as can be the case with partial stripe width writes on striped arrays. The only thing we "waste" is some disk space. Everyone knows parity equals lower write IOPS, and knows of the disk space tradeoff with non-parity RAID to get maximum IOPS. And since we're talking EXT4 vs XFS, make the playing field level by testing EXT4 on a p400 based RAID10 of these 6 drives and compare the results to the concat. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-07 14:57 ` Stan Hoeppner @ 2012-04-09 11:02 ` Stefan Ring 2012-04-09 12:48 ` Emmanuel Florac 2012-04-09 23:38 ` Stan Hoeppner 0 siblings, 2 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-09 11:02 UTC (permalink / raw) To: stan; +Cc: Linux fs XFS > Not at all. You can achieve this performance with the 6 300GB spindles > you currently have, as Christoph and I both mentioned. You simply lose > one spindle of capacity, 300GB, vs your current RAID6 setup. Make 3 > RAID1 pairs in the p400 and concatenate them. If the p400 can't do this > concat the mirror pair devices with md --linear. Format the resulting > Linux block device with the following and mount with inode64. > > $ mkfs.xfs -d agcount=3 /dev/[device] > > That will give you 1 AG per spindle, 3 horizontal AGs total instead of 4 > vertical AGs as you get with default striping setup. This is optimal > for your high IOPS workload as it eliminates all 'extraneous' seeks > yielding a per disk access pattern nearly identical to EXT4. And it > will almost certainly outrun EXT4 on your RAID6 due mostly to the > eliminated seeks, but also to elimination of parity calculations. > You've wiped the array a few times in your testing already right, so one > or two more test setups should be no sweat. Give it a go. The results > will be pleasantly surprising. Well I had to move around quite a bit of data, but for the sake of completeness, I had to give it a try. With a nice and tidy fresh XFS file system, performance is indeed impressive – about 16 sec for the same task that would take 2 min 25 before. So that’s about 150 MB/sec, which is not great, but for many tiny files it would perhaps be a bit unreasonable to expect more. A simple copy of the tar onto the XFS file system yields the same linear performance, the same as with ext4, btw. So 150 MB/sec seems to be the best these disks can do, meaning that theoretically, with 3 AGs, it should be able to reach 450 MB/sec under optimal conditions. I will still do a test with the free space fragmentation priming on the concatenated AG=3 volume, because it seems to be rather slow as well. But then I guess I’m back to ext4 land. XFS just doesn’t offer enough benefits in this case to justify the hassle. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 11:02 ` Stefan Ring @ 2012-04-09 12:48 ` Emmanuel Florac 2012-04-09 12:53 ` Stefan Ring 2012-04-09 23:38 ` Stan Hoeppner 1 sibling, 1 reply; 64+ messages in thread From: Emmanuel Florac @ 2012-04-09 12:48 UTC (permalink / raw) To: Stefan Ring; +Cc: stan, Linux fs XFS Le Mon, 9 Apr 2012 13:02:27 +0200 vous écriviez: > So 150 MB/sec seems to be the > best these disks can do, Definitely NOT right. I mean I've got routinely 600 MB/s from 8 7.2K drives RAID-6 arrays. Unless cache is off, which RAID-6 isn't obviously thought for. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 12:48 ` Emmanuel Florac @ 2012-04-09 12:53 ` Stefan Ring 2012-04-09 13:03 ` Emmanuel Florac 0 siblings, 1 reply; 64+ messages in thread From: Stefan Ring @ 2012-04-09 12:53 UTC (permalink / raw) To: Emmanuel Florac; +Cc: stan, Linux fs XFS >> So 150 MB/sec seems to be the >> best these disks can do, > > Definitely NOT right. I mean I've got routinely 600 MB/s from 8 7.2K > drives RAID-6 arrays. Unless cache is off, which RAID-6 isn't obviously > thought for. In this case it was a 2-disk RAID 1, so it’s 150 MB/s per disk. Seems quite right to me. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 12:53 ` Stefan Ring @ 2012-04-09 13:03 ` Emmanuel Florac 0 siblings, 0 replies; 64+ messages in thread From: Emmanuel Florac @ 2012-04-09 13:03 UTC (permalink / raw) To: Stefan Ring; +Cc: stan, Linux fs XFS Le Mon, 9 Apr 2012 14:53:14 +0200 vous écriviez: > In this case it was a 2-disk RAID 1, so it’s 150 MB/s per disk. Seems > quite right to me. Yes sorry, I though it to be the 6 drives array speed. Seen it just after hitting "send", but there is no "supersede" in mailing lists :-) -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 11:02 ` Stefan Ring 2012-04-09 12:48 ` Emmanuel Florac @ 2012-04-09 23:38 ` Stan Hoeppner 2012-04-10 6:11 ` Stefan Ring 1 sibling, 1 reply; 64+ messages in thread From: Stan Hoeppner @ 2012-04-09 23:38 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS On 4/9/2012 6:02 AM, Stefan Ring wrote: >> Not at all. You can achieve this performance with the 6 300GB spindles >> you currently have, as Christoph and I both mentioned. You simply lose >> one spindle of capacity, 300GB, vs your current RAID6 setup. Make 3 >> RAID1 pairs in the p400 and concatenate them. If the p400 can't do this >> concat the mirror pair devices with md --linear. Format the resulting >> Linux block device with the following and mount with inode64. >> >> $ mkfs.xfs -d agcount=3 /dev/[device] >> >> That will give you 1 AG per spindle, 3 horizontal AGs total instead of 4 >> vertical AGs as you get with default striping setup. This is optimal >> for your high IOPS workload as it eliminates all 'extraneous' seeks >> yielding a per disk access pattern nearly identical to EXT4. And it >> will almost certainly outrun EXT4 on your RAID6 due mostly to the >> eliminated seeks, but also to elimination of parity calculations. >> You've wiped the array a few times in your testing already right, so one >> or two more test setups should be no sweat. Give it a go. The results >> will be pleasantly surprising. > > Well I had to move around quite a bit of data, but for the sake of > completeness, I had to give it a try. > > With a nice and tidy fresh XFS file system, performance is indeed > impressive – about 16 sec for the same task that would take 2 min 25 > before. So that’s about 150 MB/sec, which is not great, but for many > tiny files it would perhaps be a bit unreasonable to expect more. A 150MB/s isn't correct. Should be closer to 450MB/s. This makes it appear that you're writing all these files to a single directory. If you're writing them fairly evenly to 3 directories or a multiple of 3, you should see close to 450MB/s, if using mdraid linear over 3 P400 RAID1 pairs. If this is what you're doing then something seems wrong somewhere. Try unpacking a kernel tarball. Lots of subdirectories to exercise all 3 AGs thus all 3 spindles. > simple copy of the tar onto the XFS file system yields the same linear > performance, the same as with ext4, btw. So 150 MB/sec seems to be the > best these disks can do, meaning that theoretically, with 3 AGs, it > should be able to reach 450 MB/sec under optimal conditions. The optimal condition, again, requires writing 3 of this file to 3 directories to hit ~450MB/s, which you should get close to if using mdraid linear over RAID1 pairs. XFS is a filesystem after all, so it's parallelism must come from manipulating usage of filesystem structures. I thought I explained all of this previously when I introduced the "XFS concat" into this thread. > I will still do a test with the free space fragmentation priming on > the concatenated AG=3 volume, because it seems to be rather slow as > well. > But then I guess I’m back to ext4 land. XFS just doesn’t offer enough > benefits in this case to justify the hassle. If you were writing to only one directory I can understand this sentiment. Again, if you were writing 3 directories fairly evenly, with the md concat, then your sentiment here should be quite different. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 23:38 ` Stan Hoeppner @ 2012-04-10 6:11 ` Stefan Ring 2012-04-10 20:29 ` Stan Hoeppner 0 siblings, 1 reply; 64+ messages in thread From: Stefan Ring @ 2012-04-10 6:11 UTC (permalink / raw) To: stan; +Cc: Linux fs XFS > 150MB/s isn't correct. Should be closer to 450MB/s. This makes it > appear that you're writing all these files to a single directory. If > you're writing them fairly evenly to 3 directories or a multiple of 3, > you should see close to 450MB/s, if using mdraid linear over 3 P400 > RAID1 pairs. If this is what you're doing then something seems wrong > somewhere. Try unpacking a kernel tarball. Lots of subdirectories to > exercise all 3 AGs thus all 3 spindles. The spindles were exercised; I watched it with iostat. Maybe I could have reached more with more parallelism, but that wasn’t my goal at all. Although, over the course of these experiments, I got to doubt that the controller could even handle this data rate. >> simple copy of the tar onto the XFS file system yields the same linear >> performance, the same as with ext4, btw. So 150 MB/sec seems to be the >> best these disks can do, meaning that theoretically, with 3 AGs, it >> should be able to reach 450 MB/sec under optimal conditions. > > The optimal condition, again, requires writing 3 of this file to 3 > directories to hit ~450MB/s, which you should get close to if using > mdraid linear over RAID1 pairs. XFS is a filesystem after all, so it's > parallelism must come from manipulating usage of filesystem structures. > I thought I explained all of this previously when I introduced the "XFS > concat" into this thread. The optimal condition would be 3 parallel writes of huge files, which can be easily written linearly. Not thousands of tiny files. >> But then I guess I’m back to ext4 land. XFS just doesn’t offer enough >> benefits in this case to justify the hassle. > > If you were writing to only one directory I can understand this > sentiment. Again, if you were writing 3 directories fairly evenly, with > the md concat, then your sentiment here should be quite different. Haha, I made a U-turn on this one. XFS is back on the table (and on the disks now) ;). When I thought I was done, I wanted to restore a few large KVM images which were on the disks prior to the RAID reconfiguration. With ext4, I watched iostat writing at 130MB/s for a while. After 2 or 3 minutes, it broke down completely and languished at 30-40MB/s for many minutes, even after I had SIGSTOPed the writing process, during which it was nearly impossible to use vim to edit a file on the ext4 partition. It would pause for tens of seconds all the time. It’s not even clear why it broke down so badly. From another seekwatcher sample I took, it looked like fairly linear writing. So I threw XFS back in, restarted the restore, and it went very smoothly while still providing acceptable interactivity. XFS is not a panacea (obviously), and it may be a bit slower in many cases, and doesn’t seem to cope well with fragmented free space (which is what this entire thread is really about), but overall it feels more well-rounded. After all, I don’t really care how much it writes per time unit, as long as it’s not ridiculously little and it doesn’t bring everything else to a halt. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-10 6:11 ` Stefan Ring @ 2012-04-10 20:29 ` Stan Hoeppner 2012-04-10 20:43 ` Stefan Ring 0 siblings, 1 reply; 64+ messages in thread From: Stan Hoeppner @ 2012-04-10 20:29 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS On 4/10/2012 1:11 AM, Stefan Ring wrote: >> 150MB/s isn't correct. Should be closer to 450MB/s. This makes it >> appear that you're writing all these files to a single directory. If >> you're writing them fairly evenly to 3 directories or a multiple of 3, >> you should see close to 450MB/s, if using mdraid linear over 3 P400 >> RAID1 pairs. If this is what you're doing then something seems wrong >> somewhere. Try unpacking a kernel tarball. Lots of subdirectories to >> exercise all 3 AGs thus all 3 spindles. > > The spindles were exercised; I watched it with iostat. Maybe I could > have reached more with more parallelism, but that wasn’t my goal at > all. Although, over the course of these experiments, I got to doubt > that the controller could even handle this data rate. Hmm. We might need to see me detail of what your workload is actually doing. It's possible that 3 AGs is too few. Going with more will cause more head seeking, but it might also alleviate some bottlenecks within XFS itself that we may be creating by using only 3 AGs. I don't know XFS internals well enough to say. Dave can surely tell us if 3 may be too few. And yes, that controller doesn't seem to be the speediest with a huge random IO workload. >>> simple copy of the tar onto the XFS file system yields the same linear >>> performance, the same as with ext4, btw. So 150 MB/sec seems to be the >>> best these disks can do, meaning that theoretically, with 3 AGs, it >>> should be able to reach 450 MB/sec under optimal conditions. >> >> The optimal condition, again, requires writing 3 of this file to 3 >> directories to hit ~450MB/s, which you should get close to if using >> mdraid linear over RAID1 pairs. XFS is a filesystem after all, so it's >> parallelism must come from manipulating usage of filesystem structures. >> I thought I explained all of this previously when I introduced the "XFS >> concat" into this thread. > > The optimal condition would be 3 parallel writes of huge files, which > can be easily written linearly. Not thousands of tiny files. That was my point. You mentioned copying a single tar file. A single file write to a concatenated XFS will hit only one AG, thus only one spindle. If you launch 3 parallel copies of that file to 3 different directories, each one on a different AG, then you should hit close to 450. The trick is knowing which directories are on which AGs. If you manually create 3 directories right after making the filesystem, each one will be on a different AG. Write a file to each of these dirs in parallel and you should hit ~450MB/s. >>> But then I guess I’m back to ext4 land. XFS just doesn’t offer enough >>> benefits in this case to justify the hassle. >> >> If you were writing to only one directory I can understand this >> sentiment. Again, if you were writing 3 directories fairly evenly, with >> the md concat, then your sentiment here should be quite different. > > Haha, I made a U-turn on this one. XFS is back on the table (and on > the disks now) ;). When I thought I was done, I wanted to restore a > few large KVM images which were on the disks prior to the RAID > reconfiguration. With ext4, I watched iostat writing at 130MB/s for a > while. After 2 or 3 minutes, it broke down completely and languished > at 30-40MB/s for many minutes, even after I had SIGSTOPed the writing > process, during which it was nearly impossible to use vim to edit a > file on the ext4 partition. It would pause for tens of seconds all the > time. It’s not even clear why it broke down so badly. From another > seekwatcher sample I took, it looked like fairly linear writing. What was the location of the KVM images you were copying? Is it possible the source device simply slowed down? Or network congestion if this was an NFS copy? > So I threw XFS back in, restarted the restore, and it went very > smoothly while still providing acceptable interactivity. It's nice to know XFS "saved the day" but I'm not so sure XFS deserves the credit here. The EXT4 driver itself/alone shouldn't cause the lack of responsiveness behavior you saw. I'm guessing something went wrong on the source side of these file copies, given your report of dropping to 30-40MB/s on the writeout. > XFS is not a panacea (obviously), and it may be a bit slower in many > cases, and doesn’t seem to cope well with fragmented free space (which > is what this entire thread is really about), Did you retest fragmented freespace writes with the linear concat or RAID10? If not you're drawing incorrect conclusions due to not having all the facts. RAID6 can cause tremendous overhead with writes into fragmented free space because of RMW, same with RAID5. And given the P400's RAID6 performance it's not at all surprising XFS would appear to perform poorly here. And my suggestion of using only 3 AGs to minimize seeks may actually be detrimental here as well. 6 AGs may perform better, and overall, than 3 AGs. > but overall it feels more > well-rounded. After all, I don’t really care how much it writes per > time unit, as long as it’s not ridiculously little and it doesn’t > bring everything else to a halt. And you should be discovering by now that while XFS may not be a "panacea" of a filesystem, it has unbelievable flexibility in allowing you to tune it for specific storage layouts and workloads to wring out its maximum performance. Even with optimum tuning, it may not match the performance of other filesystems for specific workloads, but you can tune it to get damn close with ALL workloads, and also trounce all other with very large workloads. No other filesystem can do this. Note Geoffrey's example of an XFS on 600 disks with 15GB/s throughput. Name another FS that can perform acceptably with your workload, and also that workload. ;) -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-10 20:29 ` Stan Hoeppner @ 2012-04-10 20:43 ` Stefan Ring 2012-04-10 21:29 ` Stan Hoeppner 0 siblings, 1 reply; 64+ messages in thread From: Stefan Ring @ 2012-04-10 20:43 UTC (permalink / raw) To: stan; +Cc: Linux fs XFS > What was the location of the KVM images you were copying? Is it > possible the source device simply slowed down? Or network congestion if > this was an NFS copy? Piped via ssh from another host. No, everything was completely idle otherwise. >> So I threw XFS back in, restarted the restore, and it went very >> smoothly while still providing acceptable interactivity. > > It's nice to know XFS "saved the day" but I'm not so sure XFS deserves > the credit here. The EXT4 driver itself/alone shouldn't cause the lack > of responsiveness behavior you saw. I'm guessing something went wrong > on the source side of these file copies, given your report of dropping > to 30-40MB/s on the writeout. Maybe it shouldn’t, but something sure did. And the circumstances seem to point at ext4. Since the situation persisted for minutes after I had stopped the transfer, it cannot possibly have been related to the source. I have a feeling that with appropriate vm.dirty_ratio tuning (and probably related settings), I could have remedied this. But that’s just one more thing I’d have to tinker with just to get to get acceptable behavior out of this machine. I don’t mind if I don’t get top-notch performance out of the box, but this is simply too much. I don’t want to be expected to hand-tune every damn thing. >> XFS is not a panacea (obviously), and it may be a bit slower in many >> cases, and doesn’t seem to cope well with fragmented free space (which >> is what this entire thread is really about), > > Did you retest fragmented freespace writes with the linear concat or > RAID10? If not you're drawing incorrect conclusions due to not having > all the facts. Yes, I did this. It performed very well. Only slightly slower than on a completely empty file system. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-10 20:43 ` Stefan Ring @ 2012-04-10 21:29 ` Stan Hoeppner 0 siblings, 0 replies; 64+ messages in thread From: Stan Hoeppner @ 2012-04-10 21:29 UTC (permalink / raw) To: Stefan Ring; +Cc: Linux fs XFS On 4/10/2012 3:43 PM, Stefan Ring wrote: > I don’t want to be expected to hand-tune every damn thing. You don't. >> $ mkfs.xfs -d agcount=3 /dev/[device] > With a nice and tidy fresh XFS file system, performance is indeed > impressive – about 16 sec for the same task that would take 2 min 25 > before. 9x improvement in your workload. First problem down. What was the runtime for EXT4 here? Less than 16 seconds? >>> and doesn’t seem to cope well with fragmented free space (which >>> is what this entire thread is really about), >> Did you retest fragmented freespace writes > Yes, I did this. It performed very well. Only slightly slower than on > a completely empty file system. 2nd problem down. So the concat is your solution, no? If not, what's still missing? BTW, concats don't have parity thus no RMW, so with the concat setup you should set 100% of the P400 cache to writes. The 25% you had for reads definitely helps RAID6 RMW, but yields no benefit for concat. Bump write cache to 100% and you'll gain a little more XFS concat performance. And if by chance there is some weird logic in the P400 firmware, dedicating 100% to write cache may magically blow the doors off. I'm guessing I'm not the only one here to have seen odd magical settings values like this at least once, though not necessarily with RAID cache. Even if not magical, in addition to increasing write cache size by 25%, you will also increase write cache bandwidth with your high allocation workload, as metadata free space lookups won't get cached by the controller. And given that sector write ordering is an apparent problem currently, having this extra size and bandwidth may put you over the top. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-07 7:27 ` Stefan Ring 2012-04-07 8:53 ` Emmanuel Florac 2012-04-07 14:57 ` Stan Hoeppner @ 2012-04-09 0:19 ` Dave Chinner 2012-04-09 11:39 ` Emmanuel Florac 2 siblings, 1 reply; 64+ messages in thread From: Dave Chinner @ 2012-04-09 0:19 UTC (permalink / raw) To: Stefan Ring; +Cc: stan, Linux fs XFS On Sat, Apr 07, 2012 at 09:27:50AM +0200, Stefan Ring wrote: > > Instead, a far more optimal solution would be to set aside 4 spares per > > chassis and create 14 four drive RADI10 arrays. This would yield ~600 > > seeks/sec and ~400MB/s sequential throughput performance per 2 spindle > > array. We'd stitch the resulting 56 hardware RAID10 arrays together in > > an mdraid linear (concatenated) array. Then we'd format this 112 > > effective spindle linear array with simply: > > > > $ mkfs.xfs -d agcount=56 /dev/md0 > > > > Since each RAID10 is 900GB capacity, we have 56 AGs of just under the > > 1TB limit, 1 AG per 2 physical spindles. Due to the 2 stripe spindle > > nature of the constituent hardware RAID10 arrays, we don't need to worry > > about aligning XFS writes to the RAID stripe width. The hardware cache > > will take care of filling the small stripes. Now we're in the opposite > > situation of having too many AGs per spindle. We've put 2 spindles in a > > single AG and turned the seek starvation issues on its head. > > So it sounds like that for poor guys like us, who can’t afford the > hardware to have dozens of spindles, the best option would be to > create the XFS file system with agcount=1? No, because then you have no redundancy in metadata structures, so if you lose/corrupt the superblock you can easier lose the entire filesytem. Not to mention you have no allocation parallelism in the filesystem, so you'll get terrible performance in many common workloads. IO fairness will also be a big problem. > That seems to be the only reasonable conclusion to me, since a > single RAID device, like a single disk, cannot write in parallel > anyway. A decent RAID controller with a BBWC and a single LUN benefits from parallelism just as much as a large disk arrays do because the BBWC minimises the write IO latency and the controller to do a better job of scheduling it's IO. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 0:19 ` Dave Chinner @ 2012-04-09 11:39 ` Emmanuel Florac 2012-04-09 21:47 ` Dave Chinner 0 siblings, 1 reply; 64+ messages in thread From: Emmanuel Florac @ 2012-04-09 11:39 UTC (permalink / raw) To: Dave Chinner; +Cc: Stefan Ring, stan, Linux fs XFS Le Mon, 9 Apr 2012 10:19:43 +1000 vous écriviez: > A decent RAID controller with a BBWC and a single LUN benefits from > parallelism just as much as a large disk arrays do because the BBWC > minimises the write IO latency and the controller to do a better job > of scheduling its IO. BTW recently I've found that for storage servers, noop io scheduler often is the best choice, I suppose precisely because it doesn't try to outsmart the RAID controller logic... -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 11:39 ` Emmanuel Florac @ 2012-04-09 21:47 ` Dave Chinner 0 siblings, 0 replies; 64+ messages in thread From: Dave Chinner @ 2012-04-09 21:47 UTC (permalink / raw) To: Emmanuel Florac; +Cc: Stefan Ring, stan, Linux fs XFS On Mon, Apr 09, 2012 at 01:39:13PM +0200, Emmanuel Florac wrote: > Le Mon, 9 Apr 2012 10:19:43 +1000 vous écriviez: > > > A decent RAID controller with a BBWC and a single LUN benefits from > > parallelism just as much as a large disk arrays do because the BBWC > > minimises the write IO latency and the controller to do a better job > > of scheduling its IO. > > BTW recently I've found that for storage servers, noop io scheduler > often is the best choice, I suppose precisely because it doesn't try to > outsmart the RAID controller logic... We've been recommending the use of the no-op (or worst case, deadline) scheduler for XFS on hardware RAID for quite a few years. I only test against the no-op scheduler, because I got sick of having to track down regressions caused by "smart" CFQ heuristics.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 23:28 ` Stan Hoeppner 2012-04-07 7:27 ` Stefan Ring @ 2012-04-07 8:49 ` Emmanuel Florac 2012-04-08 20:33 ` Stan Hoeppner 2012-04-09 14:21 ` Geoffrey Wehrman 2 siblings, 1 reply; 64+ messages in thread From: Emmanuel Florac @ 2012-04-07 8:49 UTC (permalink / raw) To: stan; +Cc: Stefan Ring, Linux fs XFS Le Fri, 06 Apr 2012 18:28:37 -0500 vous écriviez: > Creating four 60 drive RAID10 arrays, let alone 60 drive RAID6 > arrays, would be silly. From my experience, with modern arrays don't make much of a difference. I've reached decent IOPS (i. e. about 4000 IOPS) on large arrays of up to 46 drives provided there are enough threads -- more threads than spindles, preferably. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-07 8:49 ` Emmanuel Florac @ 2012-04-08 20:33 ` Stan Hoeppner 2012-04-08 21:45 ` Emmanuel Florac 0 siblings, 1 reply; 64+ messages in thread From: Stan Hoeppner @ 2012-04-08 20:33 UTC (permalink / raw) To: Emmanuel Florac; +Cc: Stefan Ring, Linux fs XFS On 4/7/2012 3:49 AM, Emmanuel Florac wrote: > Le Fri, 06 Apr 2012 18:28:37 -0500 vous écriviez: > >> Creating four 60 drive RAID10 arrays, let alone 60 drive RAID6 >> arrays, would be silly. > > From my experience, with modern arrays don't make much of a difference. > I've reached decent IOPS (i. e. about 4000 IOPS) on large arrays of up > to 46 drives provided there are enough threads -- more threads than > spindles, preferably. Are you speaking of a mixed metadata/data heavy IOPS workload similar to that which is the focus of this thread, or another type of workload? Is this 46 drive array RAID10 or RAID6? -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-08 20:33 ` Stan Hoeppner @ 2012-04-08 21:45 ` Emmanuel Florac 2012-04-09 5:27 ` Stan Hoeppner 0 siblings, 1 reply; 64+ messages in thread From: Emmanuel Florac @ 2012-04-08 21:45 UTC (permalink / raw) To: stan; +Cc: Stefan Ring, Linux fs XFS Le Sun, 08 Apr 2012 15:33:01 -0500 vous écriviez: > > > > From my experience, with modern arrays don't make much of a > > difference. I've reached decent IOPS (i. e. about 4000 IOPS) on > > large arrays of up to 46 drives provided there are enough threads > > -- more threads than spindles, preferably. > > Are you speaking of a mixed metadata/data heavy IOPS workload similar > to that which is the focus of this thread, or another type of > workload? Is this 46 drive array RAID10 or RAID6? Pure random access, 8K IO benchmark (database simulation). RAID-6 performs about the same in pure reading tests, but stinks terribly at writing of course. -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-08 21:45 ` Emmanuel Florac @ 2012-04-09 5:27 ` Stan Hoeppner 2012-04-09 12:45 ` Emmanuel Florac 0 siblings, 1 reply; 64+ messages in thread From: Stan Hoeppner @ 2012-04-09 5:27 UTC (permalink / raw) To: xfs On 4/8/2012 4:45 PM, Emmanuel Florac wrote: > Le Sun, 08 Apr 2012 15:33:01 -0500 vous écriviez: > >>> >>> From my experience, with modern arrays don't make much of a >>> difference. I've reached decent IOPS (i. e. about 4000 IOPS) on >>> large arrays of up to 46 drives provided there are enough threads >>> -- more threads than spindles, preferably. >> >> Are you speaking of a mixed metadata/data heavy IOPS workload similar >> to that which is the focus of this thread, or another type of >> workload? Is this 46 drive array RAID10 or RAID6? > > Pure random access, 8K IO benchmark (database simulation). RAID-6 > performs about the same in pure reading tests, but stinks terribly at > writing of course. In your RAID10 random write testing, was this with a filesystem or doing direct block IO? If the latter, I wonder if its write pattern is anything like the access pattern we'd see hitting dozens of AGs while creating 10s of thousands of files. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 5:27 ` Stan Hoeppner @ 2012-04-09 12:45 ` Emmanuel Florac 2012-04-13 19:36 ` Stefan Ring 0 siblings, 1 reply; 64+ messages in thread From: Emmanuel Florac @ 2012-04-09 12:45 UTC (permalink / raw) To: stan; +Cc: xfs Le Mon, 09 Apr 2012 00:27:29 -0500 vous écriviez: > In your RAID10 random write testing, was this with a filesystem or > doing direct block IO? Doing random IO in a file lying on an XFS filesystem. > If the latter, I wonder if its write pattern > is anything like the access pattern we'd see hitting dozens of AGs > while creating 10s of thousands of files. I suppose the file creation process to hit more some defined hot spots than pure random access. I just have a machine for testing purposes with 15 4TB drives in RAID-6, not exactly a IOPS demon :) So I've build a tar file to make it somewhat similar to OP's problem : root@3[raid]# ls -lh test.tar -rw-r--r-- 1 root root 2,6G 9 avril 13:52 test.tar root@3[raid]# tar tf test.tar | wc -l 234318 # echo 3 > /proc/sys/vm/drop_caches # time tar xf test.tar real 1m2.584s user 0m1.376s sys 0m13.643s Let's rerun it with files cached (the machine has 16 GB RAM, so every single file must be cached): # time tar xf test.tar real 0m50.842s user 0m0.809s sys 0m13.767s Typical IOs during unarchiving: no read, write IO bound. Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 1573,50 0,00 480,50 0,00 36,96 157,52 60,65 124,45 2,08 100,10 dm-0 0,00 0,00 0,00 2067,00 0,00 39,56 39,20 322,55 151,62 0,48 100,10 The OP setup being 6 15k drives, should provide roughly the same number of true IOPS (1200) as my slow as hell bunch of 7200RPM 4TB drives (1500). I suppose write cache makes for most of the difference; or else 15K drives are overrated :) Alas, I can't run the test on this machine with ext4: I can't get mkfs.ext4 to swallow my big device. mkfs -t ext4 -v -b 4096 -n /dev/dm-0 2147483647 should work (though drastically limiting the filesystem size), but dies miserably when removing the -n flag. Mmmph, I suppose it's production ready if you don't have much data to store. JFS doesn't work either. And I was wondering why I'm using XFS? :) -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac@intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 12:45 ` Emmanuel Florac @ 2012-04-13 19:36 ` Stefan Ring 2012-04-14 7:32 ` Stan Hoeppner 0 siblings, 1 reply; 64+ messages in thread From: Stefan Ring @ 2012-04-13 19:36 UTC (permalink / raw) To: Emmanuel Florac; +Cc: stan, xfs > Let's rerun it with files cached (the machine has 16 GB RAM, so > every single file must be cached): > > # time tar xf test.tar > > real 0m50.842s > user 0m0.809s > sys 0m13.767s That’s about the same time I’m getting on a fresh (non-fragmented) file system with the RAID 6 volume. Interestingly, the P400’s successor, the P410 does recognize a setting that the P400 lacks, which is called elevatorsort. It sounds like this could make all the difference. Unfortunately, the P400 doesn’t have it. I don’t have a P410 with more than 2 drives to test this, but some effect should definitely be measurable. Since this finding has piqued my interest again, I’m willing to invest a little more time, but I’m completely occupied for the next few days, so it will have to wait a while. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-13 19:36 ` Stefan Ring @ 2012-04-14 7:32 ` Stan Hoeppner 2012-04-14 11:30 ` Stefan Ring 0 siblings, 1 reply; 64+ messages in thread From: Stan Hoeppner @ 2012-04-14 7:32 UTC (permalink / raw) To: xfs On 4/13/2012 2:36 PM, Stefan Ring wrote: >> Let's rerun it with files cached (the machine has 16 GB RAM, so >> every single file must be cached): >> >> # time tar xf test.tar >> >> real 0m50.842s >> user 0m0.809s >> sys 0m13.767s > > That’s about the same time I’m getting on a fresh (non-fragmented) > file system with the RAID 6 volume. > > Interestingly, the P400’s successor, the P410 does recognize a setting > that the P400 lacks, which is called elevatorsort. It sounds like this > could make all the difference. Unfortunately, the P400 doesn’t have > it. I don’t have a P410 with more than 2 drives to test this, but some > effect should definitely be measurable. > > Since this finding has piqued my interest again, I’m willing to invest > a little more time, but I’m completely occupied for the next few days, > so it will have to wait a while. What configuration are you running right now Stefan? You said you went back to XFS due to the EXT4 lockups, but I can't recall what RAID config you put underneath it this time. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-14 7:32 ` Stan Hoeppner @ 2012-04-14 11:30 ` Stefan Ring 0 siblings, 0 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-14 11:30 UTC (permalink / raw) To: stan; +Cc: xfs On Sat, Apr 14, 2012 at 9:32 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > On 4/13/2012 2:36 PM, Stefan Ring wrote: >>> Let's rerun it with files cached (the machine has 16 GB RAM, so >>> every single file must be cached): >>> >>> # time tar xf test.tar >>> >>> real 0m50.842s >>> user 0m0.809s >>> sys 0m13.767s >> >> That’s about the same time I’m getting on a fresh (non-fragmented) >> file system with the RAID 6 volume. >> > What configuration are you running right now Stefan? You said you went > back to XFS due to the EXT4 lockups, but I can't recall what RAID config > you put underneath it this time. RAID 6 4+2, LVM (single volume), 32kb stripe size (=> full stripe: 128kb), agcount=4 Except for the stripe size, the same config I had originally. The only instance of really poor behavior is with the (artificially) fragmented free space. I have moved everything elsewhere for a while, so I can once again do some testing that involves destroying and rebuilding everything. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 23:28 ` Stan Hoeppner 2012-04-07 7:27 ` Stefan Ring 2012-04-07 8:49 ` Emmanuel Florac @ 2012-04-09 14:21 ` Geoffrey Wehrman 2012-04-10 19:30 ` Stan Hoeppner 2 siblings, 1 reply; 64+ messages in thread From: Geoffrey Wehrman @ 2012-04-09 14:21 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Stefan Ring, Linux fs XFS On Fri, Apr 06, 2012 at 06:28:37PM -0500, Stan Hoeppner wrote: | So while the XFS AG architecture may not be perfectly suited to your | single 6 drive RAID6 array, it still gives rather remarkable performance | given that the same architecture can scale pretty linearly to the | heights above, and far beyond. Something EXTx and others could never | dream of. Some of the SGI guys might be able to confirm deployed single | XFS filesystems spanning 1000+ drives in the past. Today we'd probably | only see that scale with CXFS. With an SGI IS16000 array which supports up to 1,200 drives, filesystems with large numbers of drives isn't difficult. Most configurations using the IS16000 have 8+2 RAID6 luns. I've seen sustained 15 GB/s to a single filesystem on one of the arrays with a 600 drive configuration. The scalability of XFS is impressive. -- Geoffrey Wehrman SGI Building 10 Office: (651)683-5496 2750 Blue Water Road Fax: (651)683-5098 Eagan, MN 55121 E-mail: gwehrman@sgi.com http://www.sgi.com/products/storage/software/ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 14:21 ` Geoffrey Wehrman @ 2012-04-10 19:30 ` Stan Hoeppner 2012-04-11 22:19 ` Geoffrey Wehrman 0 siblings, 1 reply; 64+ messages in thread From: Stan Hoeppner @ 2012-04-10 19:30 UTC (permalink / raw) To: Geoffrey Wehrman; +Cc: Stefan Ring, Linux fs XFS On 4/9/2012 9:21 AM, Geoffrey Wehrman wrote: > On Fri, Apr 06, 2012 at 06:28:37PM -0500, Stan Hoeppner wrote: > | So while the XFS AG architecture may not be perfectly suited to your > | single 6 drive RAID6 array, it still gives rather remarkable performance > | given that the same architecture can scale pretty linearly to the > | heights above, and far beyond. Something EXTx and others could never > | dream of. Some of the SGI guys might be able to confirm deployed single > | XFS filesystems spanning 1000+ drives in the past. Today we'd probably > | only see that scale with CXFS. Good to hear from you Geoffrey. > With an SGI IS16000 array which supports up to 1,200 drives, filesystems > with large numbers of drives isn't difficult. Most configurations > using the IS16000 have 8+2 RAID6 luns. Is the concatenation of all these RAID6 LUNs performed within the IS16000, or with md/lvm, or? > I've seen sustained 15 GB/s to > a single filesystem on one of the arrays with a 600 drive configuration. To be clear, this is a single Linux XFS filesystem on a single host, not multiple CXFS clients, correct? If so, out of curiosity, is the host in this case an old Itanium Altix or the newer Xeon based Altix UV? And finally, is this example system using FC or Infiniband connectivity? How many ports? > The scalability of XFS is impressive. Quite impressive. And there's nothing in XFS itself preventing scalability of a single filesystem over 4 IS16000s w/4800 total drives, although one might run into some limitations when attempting to concatenate that many LUNs. I've never attempted that scale with md or lvm, and I've never had my hands on an IS16000. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-10 19:30 ` Stan Hoeppner @ 2012-04-11 22:19 ` Geoffrey Wehrman 0 siblings, 0 replies; 64+ messages in thread From: Geoffrey Wehrman @ 2012-04-11 22:19 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Stefan Ring, Linux fs XFS On Tue, Apr 10, 2012 at 02:30:39PM -0500, Stan Hoeppner wrote: | On 4/9/2012 9:21 AM, Geoffrey Wehrman wrote: | > On Fri, Apr 06, 2012 at 06:28:37PM -0500, Stan Hoeppner wrote: | > | So while the XFS AG architecture may not be perfectly suited to your | > | single 6 drive RAID6 array, it still gives rather remarkable performance | > | given that the same architecture can scale pretty linearly to the | > | heights above, and far beyond. Something EXTx and others could never | > | dream of. Some of the SGI guys might be able to confirm deployed single | > | XFS filesystems spanning 1000+ drives in the past. Today we'd probably | > | only see that scale with CXFS. | | Good to hear from you Geoffrey. | | > With an SGI IS16000 array which supports up to 1,200 drives, filesystems | > with large numbers of drives isn't difficult. Most configurations | > using the IS16000 have 8+2 RAID6 luns. | | Is the concatenation of all these RAID6 LUNs performed within the | IS16000, or with md/lvm, or? The LUNs were concatenated with XVM which is SGI's md/lvm equivalent. The filesystem was then constructed so that the LUN boundaries matched AG boundaries in the filesystem. The filesystem was mounted with the inode64 mount option. inode64 rotors directories across AGs, and then attempts to allocate space for files created in the AG containing the directory. Utilizing this behavior allowed the generated load to be spread across the entire set of LUNs. | > I've seen sustained 15 GB/s to | > a single filesystem on one of the arrays with a 600 drive configuration. | | To be clear, this is a single Linux XFS filesystem on a single host, not | multiple CXFS clients, correct? If so, out of curiosity, is the host in | this case an old Itanium Altix or the newer Xeon based Altix UV? And | finally, is this example system using FC or Infiniband connectivity? | How many ports? This was a single Linux XFS filesystem, but with two CXFS client hosts. They were both rather ordinary dual socket x86_64 Xeon systems using FC connectivity. I fully expect that the same results could be obtained from a single host with enough I/O bandwidth. -- Geoffrey Wehrman _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 7:27 ` Stefan Ring 2012-04-06 23:28 ` Stan Hoeppner @ 2012-04-07 16:50 ` Peter Grandi 2012-04-07 17:10 ` Joe Landman 1 sibling, 1 reply; 64+ messages in thread From: Peter Grandi @ 2012-04-07 16:50 UTC (permalink / raw) To: Linux fs XFS [ ... ] >> As to this, in theory even having split the files among 4 >> AGs, the upload from system RAM to host adapter RAM and then >> to disk could happen by writing first all the dirty blocks >> for one AG, then a long seek to the next AG, and so on, and >> the additional cost of 3 long seeks would be negligible. > Yes, that’s exactly what I had in mind, and what prompted me > to write this post. It would be about 10 times as fast. Ahhh yes, but let's go back to this and summarize some of my previous observations: * If the scheduling order was by AG, and the hardware was parallel, the available parallelism would not be exploited, (and fragmentation might be worse) as if there were only a single AG. And XFS does let you configure the number of AGs in part for for that reason. * Your storage layer does not seem to deliver parallel operations: as the ~100MB/s overall 'ext4' speed and the seek graphs show, in effect your 4+2 RAID6 performs in this case as if it were a single drive with a single arm. * Even with the actual scheduling at the Linux level being by interleaving AGs in XFS, your host adapter with a BBWC should be able to reorder them, in 256MiB lots, ignoring Linux level barriers and ordering, but it seems that this is not happening. So the major things to look into seem to me: * Ensure that your RAID set can deliver the parallelism at which XFS is targeted, with the bulk transfer rates that it can do. * Otherwise figure out ways to ensure that the IO transactions generated by XFS are not in interleave-AG order. * Otherwise figure out ways to get the XFS IO ordering rearranged at the storage layer in spacewise order. Summarizing some of the things to try, and some of them are rather tentative, because you have a rather peculiar corner case: * Change the flusher to writeout incrementally instead of just at 'sync' time, e.g. every 1-2 seconds. In some similar cases this makes things a lot better, as large 'uploads' to the storage layer from the page cache can cause damaging latencies. But the success of this may depend on having a properly parallel storage layer, at least for XFS. * Use a different RAID setup. If the RAID set is used only for reproducible data, a RAID0, else a RAID10, or even a RAID5 with a small chunk size. * Check the elevator and cache policy on the P400, if they are settable. Too bad many RAID host adapters have (euphemism) hideous fw (many older 3ware models come to mind) with some undocumented (euphemism) peculiarties as to scheduling. * Tweak 'queue/nr_requests' and 'device/queue_depth'. Probably they should be big (hundreds/thousands), but various settings should be tried as fw sometimes is so weird. * Given that it is now established that your host adapter has BBWC, consider switching the Linux elevator to 'noop', so as to leave IO scheduling to the host adapter fw, and reduce issue latency. 'queue/nr_requests' may be set to a very low number here perhaps, but my guess is that it shouldn't matter. * Alternatively if the host adapter fw insists on not reordering IO from the Linux level, use Linux elevator settings that behaves similarly to 'anticipatory'. It may help to use Bonnie (Garloff's 1.4 version with '-o_direct') to give a rough feel of filetree speed profile, for example I tend to use these options: Bonnie -y -u -o_direct -s 2000 -v 2 -d "$DIR" Ultimately even 'ext4' does not seem the right filesystem for this workload either, because all these "legacy" filesystems are targeted at situations where data is much bigger than memory, and you are trying to fit them into a very specific corner case where the opposite is true. Making my fantasy run wild, my guess is that your workload is not 'tar x', but release building, where sources and objects fit entirely in memory, and you are only concerned with persisting the sources because you want to do several builds from that set of sources without re-tar-x-ing them, and ideally you would like to reduce build times by building several objects in parallel. BTW your corner case then has another property here: that disk writes greatly exceed disk reads, because you would only write once the sources and then read them from cache every time thereafter while the system is up. I doubt also that you would want to persist the generated objects themselves, but only the generated final "package" containing them, which might suggest building the objects to a 'tmpfs', unlss you want them persisted (a bit) to make builds restartable. If that's the case, and you cannot fix the storage layer to be more suitable for 'ext4' or XFS, consider using NILFS2, or even 'ext2' (with a long flusher interval perhaps). Note: or "cheat" and do your builds to a flash SSD, as they both run a fw layer that implements a COW/logging allocation strategy, and have nicer seek times :-). > That’s what bothers me so much. And in case you did not get this before, I have a long standing pet peeve about abusing filesystems for small file IO, or other ways of going against the grain of what is plausible, which I call the "syntactic approach" (every syntactically valid system configuration is assumed to work equally well...). Some technical postscripts: * It seems that most if not all RAID6 implementations don't do shortened RMWs, where only the updated blocks and the PQ blocks are involved, but they always do full stripe RMW. Even with a BBWC in the host adapter this is one major reason to avoid RAID6 in favor of at least RAID5, for your setup in particular. But hey, RAID6 setups are all syntactically valid! :-) * The 'ext3' on disk layout and allocation policies seem to deliver very good compact locality on bulk writeouts and on relatively fresh filetrees, but then locality can degrade apocaliptically over time, like seven times: http://www.sabi.co.uk/blog/anno05-3rd.html#050913 I suspect that the same applies to 'ext4', even if perhaps a bit less. You have tried to "age" the filetree a bit, but I suspect you did not succeed enough, as the graphed Linux level seek patterns with 'ext4' shows a mostly-linear write. * Hopefully your storage layer does not use DM/LVMs... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-07 16:50 ` Peter Grandi @ 2012-04-07 17:10 ` Joe Landman 2012-04-08 21:42 ` Stan Hoeppner 0 siblings, 1 reply; 64+ messages in thread From: Joe Landman @ 2012-04-07 17:10 UTC (permalink / raw) To: xfs On 04/07/2012 12:50 PM, Peter Grandi wrote: > * Your storage layer does not seem to deliver parallel > operations: as the ~100MB/s overall 'ext4' speed and the > seek graphs show, in effect your 4+2 RAID6 performs in this > case as if it were a single drive with a single arm. This is what lept out at me. I retried a very similar test (pulled Icedtea 2.1, compiled it, tarred it, measured untar on our boxen). I was getting a fairly consistent 4 +/- delta seconds. Ignoring the rest of your post for brevity (basically to focus upon this one issue), I suspect that the observed performance issue has more to do with the RAID card, the disks, and the server than the file system. 100MB/s on some supposedly fast drives with a RAID card indicates that either the RAID is badly implemented, the RAID layout is suspect, or similar. He should be getting closer to N(data disks) * BW(single disk) for something "close" to a streaming operation. This isn't suggesting that he didn't hit some bug which happens to over specify use of ag=0, but he definitely had a weak RAID system (at best). If he retries with a more capable system, or one with a saner RAID layout (16k chunk size? For spinning rust? Seriously? Short stroking DB layout?), an agcount of 32 or higher, and still sees similar issues, then I'd be more suspicious of a bug. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-07 17:10 ` Joe Landman @ 2012-04-08 21:42 ` Stan Hoeppner 2012-04-09 5:13 ` Stan Hoeppner 2012-04-09 9:23 ` Stefan Ring 0 siblings, 2 replies; 64+ messages in thread From: Stan Hoeppner @ 2012-04-08 21:42 UTC (permalink / raw) To: xfs On 4/7/2012 12:10 PM, Joe Landman wrote: > On 04/07/2012 12:50 PM, Peter Grandi wrote: > >> * Your storage layer does not seem to deliver parallel >> operations: as the ~100MB/s overall 'ext4' speed and the >> seek graphs show, in effect your 4+2 RAID6 performs in this >> case as if it were a single drive with a single arm. > > This is what lept out at me. I retried a very similar test (pulled > Icedtea 2.1, compiled it, tarred it, measured untar on our boxen). I > was getting a fairly consistent 4 +/- delta seconds. That's an interesting point. I guess I'd chalked the low throughput up to high seeks. > 100MB/s on some supposedly fast drives with a RAID card indicates that > either the RAID is badly implemented, the RAID layout is suspect, or > similar. He should be getting closer to N(data disks) * BW(single disk) > for something "close" to a streaming operation. Reading this thread seems to indicate you're onto something Joe: http://h30499.www3.hp.com/t5/System-Administration/Extremely-slow-io-on-cciss-raid6/td-p/4214888 Add this to the mix: "The HP Smart Array P400 is HP's first PCI-Express (PCIe) serial attached SCSI (SAS) RAID controller" That's from: http://h18000.www1.hp.com/products/servers/proliantstorage/arraycontrollers/smartarrayp400/index.html First gen products aren't always duds, but the likelihood is often much higher. Everyone posting to that forum is getting low throughput, and most of them are testing streaming reads/writes, not massively random IO as is Stefan's case. > This isn't suggesting that he didn't hit some bug which happens to over > specify use of ag=0, but he definitely had a weak RAID system (at best). > > If he retries with a more capable system, or one with a saner RAID > layout (16k chunk size? For spinning rust? Seriously? Short stroking > DB layout?), an agcount of 32 or higher, and still sees similar issues, > then I'd be more suspicious of a bug. Or merely a weak/old product. The P400 was an entry level RAID HBA, HP's first PCIe/SAS RAID card. It was discontinued quite some time ago. The use of DDR2/533 memory indicates it's design stage started probably somewhere around 2004, 8 years ago. Now that I've researched the P400, and assuming Stefan currently has the card firmware optimally configured, I'd bet this workload is simply overwhelming the RAID ASIC. To confirm this, simply configure each drive as a RAID0 array, so all 6 drives are exported as block devices. Configure them as an md RAID6 and test the workload. Be sure to change the Linux elevator to noop first since you're using hardware write cache: $ echo deadline > /sys/block/sd[a-e]/queue/scheduler Execute this 6 times, once for each of the 6 drives, changing the device name each time, obviously. This is not a persistent change. The gap between EXT4 and XFS will likely still exist, but overall numbers should jump substantially Northward, if the problem is indeed a slow RAID ASIC. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-08 21:42 ` Stan Hoeppner @ 2012-04-09 5:13 ` Stan Hoeppner 2012-04-09 11:52 ` Stefan Ring 2012-04-09 9:23 ` Stefan Ring 1 sibling, 1 reply; 64+ messages in thread From: Stan Hoeppner @ 2012-04-09 5:13 UTC (permalink / raw) To: xfs On 4/8/2012 4:42 PM, Stan Hoeppner wrote: > On 4/7/2012 12:10 PM, Joe Landman wrote: >> On 04/07/2012 12:50 PM, Peter Grandi wrote: >> >>> * Your storage layer does not seem to deliver parallel >>> operations: as the ~100MB/s overall 'ext4' speed and the >>> seek graphs show, in effect your 4+2 RAID6 performs in this >>> case as if it were a single drive with a single arm. >> >> This is what lept out at me. I retried a very similar test (pulled >> Icedtea 2.1, compiled it, tarred it, measured untar on our boxen). I >> was getting a fairly consistent 4 +/- delta seconds. > > That's an interesting point. I guess I'd chalked the low throughput up > to high seeks. > >> 100MB/s on some supposedly fast drives with a RAID card indicates that >> either the RAID is badly implemented, the RAID layout is suspect, or >> similar. He should be getting closer to N(data disks) * BW(single disk) >> for something "close" to a streaming operation. > > Reading this thread seems to indicate you're onto something Joe: > http://h30499.www3.hp.com/t5/System-Administration/Extremely-slow-io-on-cciss-raid6/td-p/4214888 The P400 uses the LSISAS1078 chip, PowerPC 500MHz core, "2 hardware RAID5/6 processors". Some sequential benchmarks under Windows with 8x750GB SATA drives on an LSI 1078 based card show sequential RAID6 write rates of ~100MB/s. RAID0 write rate of this card for 8 drives is 350MB/s. These drives are capable of 50MB/s sustained writes, so the RAID0 performance isn't far off the hardware max. It seems the 1078 is simply not that quick with anything but pure striping. Hardware RAID10 write performance appears only about 50% faster than RAID6. The RAID6 speed is roughly 1/3rd of the RAID0 speed. So exporting the individual drives as I previously mentioned and using mdraid6 should yield at least a 3x improvement, assuming your CPUs aren't already loaded down. Or, as others have mentioned, simply install an MLC SSD and get 10-100x more random throughput with XFS if you match the agcount to the number of flash chips in the SSD. XFS parallelism flexing its muscles once again. EXT4 won't improve as much as it will tend to write the flash chips sequentially. Newegg currently has two Mushkin 120GB models for $120 each, both with 4/5 eggs. http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=100008120+600038484+50001504&QksAutoSuggestion=&ShowDeactivatedMark=False&Configurator=&IsNodeId=1&Subcategory=636&description=&hisInDesc=&Ntk=&CFG=&SpeTabStoreType=&AdvancedSearch=1&srchInDesc= -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 5:13 ` Stan Hoeppner @ 2012-04-09 11:52 ` Stefan Ring 2012-04-10 7:34 ` Stan Hoeppner 0 siblings, 1 reply; 64+ messages in thread From: Stefan Ring @ 2012-04-09 11:52 UTC (permalink / raw) To: stan; +Cc: xfs > It seems the 1078 is simply not that quick with anything but pure > striping. Hardware RAID10 write performance appears only about 50% > faster than RAID6. The RAID6 speed is roughly 1/3rd of the RAID0 speed. > So exporting the individual drives as I previously mentioned and using > mdraid6 should yield at least a 3x improvement, assuming your CPUs > aren't already loaded down. Whatever the problem with the controller may be, it behaves quite nicely usually. It seems clear though, that, regardless of the storage technology, it cannot be a good idea to schedule tiny blocks in the order that XFS schedules them in my case. This: AG0 * * * AG1 * * * AG2 * * * AG3 * * * cannot be better than this: AG0 *** AG1 *** AG2 *** AG3 *** Yes, in theory, a good cache controller should be able to sort this out. But at least this particular controller is not able to do so and could use a little help. Also, a single consumer-grade drive is certainly not helped by this write ordering. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 11:52 ` Stefan Ring @ 2012-04-10 7:34 ` Stan Hoeppner 2012-04-10 13:59 ` Stefan Ring 0 siblings, 1 reply; 64+ messages in thread From: Stan Hoeppner @ 2012-04-10 7:34 UTC (permalink / raw) To: xfs On 4/9/2012 6:52 AM, Stefan Ring wrote: > Whatever the problem with the controller may be, it behaves quite > nicely usually. It seems clear though, that, regardless of the storage > technology, it cannot be a good idea to schedule tiny blocks in the > order that XFS schedules them in my case. > > This: > AG0 * * * > AG1 * * * > AG2 * * * > AG3 * * * > > cannot be better than this: > > AG0 *** > AG1 *** > AG2 *** > AG3 *** With 4 AGs this must represent the RAID6 or RAID10 case. Those don't seem to show any overlapping concurrency. Maybe I'm missing something, but it should look more like this, at least in the concat case: AG0 *** AG1 *** AG2 *** > Yes, in theory, a good cache controller should be able to sort this > out. But at least this particular controller is not able to do so and > could use a little help. Is the cache in write-through or write-back mode? The latter should allow for aggressive reordering. The former none, or very little. And is all of it dedicated to writes, or is it split? If split, dedicate it all to writes. Linux is going to cache block reads anyway, so it makes little sense to cache them in the controller as well. > Also, a single consumer-grade drive is > certainly not helped by this write ordering. Are you referring to the Mushkin SSD I mentioned? The SandForce 2281 onboard the Enhanced Chronos Deluxe is capable of a *sustained* 20,000 4KB random write IOPs, 60,000 peak. Mushkin states 90,000, which may be due to their use of Toggle Mode NAND instead ONFi, and/or they're simply fudging. Regardless, 20K real write IOPS is enough to make scheduling/ordering mostly irrelevant I'd think. Just format with 8 AGs to be on the safe side for DLP (directory level parallelism), and you're off to the races. The features of the SF2000 series make MLC SSDs based on it much more like 'enterprise' SLC SSDs in most respects. The lines between "consumer" and "enterprise" SSDs have already been blurred as many vendors have already been selling "enterprise" MLC SSDs for a while now, including Intel, Kingston, OCZ, PNY, and Seagate. All are based on the same SandForce 2281 as in this Mushkin, or the 2282, which is required for devices over 512GB. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-10 7:34 ` Stan Hoeppner @ 2012-04-10 13:59 ` Stefan Ring 0 siblings, 0 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-10 13:59 UTC (permalink / raw) To: Linux fs XFS > With 4 AGs this must represent the RAID6 or RAID10 case. Yes, the original RAID 6 case. >> Yes, in theory, a good cache controller should be able to sort this >> out. But at least this particular controller is not able to do so and >> could use a little help. > > Is the cache in write-through or write-back mode? The latter should > allow for aggressive reordering. The former none, or very little. And > is all of it dedicated to writes, or is it split? If split, dedicate it > all to writes. Linux is going to cache block reads anyway, so it makes > little sense to cache them in the controller as well. The cache is a write-back cache. Yes, it’s split 75% write / 25% read. Changing to 100% write does not make a difference. I can imagine that the small read cache might be beneficial for partial stripe writes, when the stripe contents from the untouched drives are in cache. >> Also, a single consumer-grade drive is >> certainly not helped by this write ordering. > > Are you referring to the Mushkin SSD I mentioned? No, I meant rotational storage. But even SSDs should gain at least a little from a linear write pattern. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-08 21:42 ` Stan Hoeppner 2012-04-09 5:13 ` Stan Hoeppner @ 2012-04-09 9:23 ` Stefan Ring 2012-04-09 23:06 ` Stan Hoeppner 1 sibling, 1 reply; 64+ messages in thread From: Stefan Ring @ 2012-04-09 9:23 UTC (permalink / raw) To: stan; +Cc: xfs > Or merely a weak/old product. The P400 was an entry level RAID HBA, > HP's first PCIe/SAS RAID card. It was discontinued quite some time ago. > The use of DDR2/533 memory indicates it's design stage started probably > somewhere around 2004, 8 years ago. It was what you got when you bought a direct-attach storage blade from HP until a few months ago. Apparently, they changed it to P410i very recently: <http://h10010.www1.hp.com/wwpc/us/en/sm/WF06a/3709945-3709945-3710114-3722820-3722776-4304942.html?dnr=1> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-09 9:23 ` Stefan Ring @ 2012-04-09 23:06 ` Stan Hoeppner 0 siblings, 0 replies; 64+ messages in thread From: Stan Hoeppner @ 2012-04-09 23:06 UTC (permalink / raw) To: Stefan Ring; +Cc: xfs On 4/9/2012 4:23 AM, Stefan Ring wrote: >> Or merely a weak/old product. The P400 was an entry level RAID HBA, >> HP's first PCIe/SAS RAID card. It was discontinued quite some time ago. >> The use of DDR2/533 memory indicates it's design stage started probably >> somewhere around 2004, 8 years ago. > > It was what you got when you bought a direct-attach storage blade from > HP until a few months ago. Apparently, they changed it to P410i very > recently: <http://h10010.www1.hp.com/wwpc/us/en/sm/WF06a/3709945-3709945-3710114-3722820-3722776-4304942.html?dnr=1> Nonetheless, it's performance is quite bad with RAID6 (RAID10 as well). If you're happy with EXT4 on the P400 based RAID6, you'll be even much happier with 3-4x more performance using md for the RAID6. If it was worth your time to test the XFS concat I would think this test would be even more so, as it appears you'll be sticking with EXT4. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 23:07 ` Peter Grandi 2012-04-06 0:13 ` Peter Grandi @ 2012-04-06 0:53 ` Peter Grandi 2012-04-06 7:32 ` Stefan Ring 2012-04-06 5:53 ` Stefan Ring 2 siblings, 1 reply; 64+ messages in thread From: Peter Grandi @ 2012-04-06 0:53 UTC (permalink / raw) To: Linux fs XFS > [ ... ] Which brings another subject: usually hw RAID host > adapter have cache, and have firmware that cleverly rearranges > writes. Looking at the specs of the P400: [ ... ] it seems to > me that it has standard 256MB of cache, and only supports > RAID6 with a battery backed write cache (wise!). [ ... ] Uhm, looking further into the P400 an interesting detail: http://hardforum.com/showpost.php?s=c19964285e760bee47b8558ae82899d5&p=1033958051&postcount=4 «One is a stick of memory with a battery attached to it and one without. The one without is what the basic models ship with and usually has either 256 or 512Mb of memory, it supports caching for read operations only. [ ... ] You need the battery backed write cache module if you want to be able to use/turn on write caching on the array controller which makes a huge difference for write performance in general and is pretty much critical for raid 5 performance on writes.» It may be worthwhile to check if there is an enabled BBWC because if there is the BBWC the host adapter should be buffering writes up to 256MiB/512MiB and sorting them thus long inter-AG seeks should be happening only 10 or 5 times or not much more (4 times) that. Instead it may be happening that the P400 is doing write-through, which would reflect the unsorted seek pattern at the Linux->host adapter level into the host adapter->disk drive level. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 0:53 ` Peter Grandi @ 2012-04-06 7:32 ` Stefan Ring 0 siblings, 0 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-06 7:32 UTC (permalink / raw) To: Linux fs XFS > http://hardforum.com/showpost.php?s=c19964285e760bee47b8558ae82899d5&p=1033958051&postcount=4 > «One is a stick of memory with a battery attached to it and one > without. The one without is what the basic models ship with > and usually has either 256 or 512Mb of memory, it supports > caching for read operations only. [ ... ] You need the battery > backed write cache module if you want to be able to use/turn > on write caching on the array controller which makes a huge > difference for write performance in general and is pretty much > critical for raid 5 performance on writes.» > > It may be worthwhile to check if there is an enabled BBWC > because if there is the BBWC the host adapter should be > buffering writes up to 256MiB/512MiB and sorting them thus long > inter-AG seeks should be happening only 10 or 5 times or not > much more (4 times) that. Instead it may be happening that the > P400 is doing write-through, which would reflect the unsorted > seek pattern at the Linux->host adapter level into the host > adapter->disk drive level. The write cache can also be enabled without a battery present (at considerable risk), but I insisted to get a battery. It is enabled, and it makes a noticeable difference. Without it, it’s even slower (more than factor 2). _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-05 23:07 ` Peter Grandi 2012-04-06 0:13 ` Peter Grandi 2012-04-06 0:53 ` Peter Grandi @ 2012-04-06 5:53 ` Stefan Ring 2012-04-06 15:35 ` Peter Grandi 2012-04-07 19:11 ` Peter Grandi 2 siblings, 2 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-06 5:53 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs XFS > Which brings another subject: usually hw RAID host adapter have > cache, and have firmware that cleverly rearranges writes. > > Looking at the specs of the P400: > > http://h18004.www1.hp.com/products/servers/proliantstorage/arraycontrollers/smartarrayp400/ > > it seems to me that it has standard 256MB of cache, and only > supports RAID6 with a battery backed write cache (wise!). > > Which means that your Linux-level seek graphs may be not so > useful, because the host adapter may be drastically rearranging > the seek patterns, and you may need to tweak the P400 elevator, > rather than or in addition to the Linux elevator. > > Unless possibly barriers are enabled, and even with a BBWC the > P400 writes through on receiving a barrier request. IIRC XFS is > rather stricter in issuing barrier requests than 'ext4', and you > may be seeing more the effect of that than the effect of aiming > to splitting the access patterns between 4 AGs to improve the > potential for multithreading (which you deny because you are > using what is most likely a large RAID6 stripe size with a small > IO intensive write workload, as previously noted). Yes, it does have 256 MB BBWC, and it is enabled. When I disabled it, the time needed would rise from 120 sec in the BBWC case to a whopping 330 sec. IIRC, I did the benchmark with barrier=0, but changing this did not make a big difference. Nothing did; that’s what frustrated me a bit ;). I also tried different Linux IO elevators, as you suggested in your other response, without any measurable effect. The stripe size is this, btw.: su=16k,sw=4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 5:53 ` Stefan Ring @ 2012-04-06 15:35 ` Peter Grandi 2012-04-10 14:05 ` Stefan Ring 2012-04-07 19:11 ` Peter Grandi 1 sibling, 1 reply; 64+ messages in thread From: Peter Grandi @ 2012-04-06 15:35 UTC (permalink / raw) To: Linux fs XFS [ ... ] > Yes, it does have 256 MB BBWC, and it is enabled. When I > disabled it, the time needed would rise from 120 sec in the > BBWC case to a whopping 330 sec. > IIRC, I did the benchmark with barrier=0, but changing this did not > make a big difference. Note that the syntax is slightly different between 'ext4' and XFS, and if you use 'barrier=0' with XFS, it will mount the filetree *with* barriers (just double checked). > Nothing did; that’s what frustrated me a bit ;). I also tried > different Linux IO elevators, as you suggested in your other > response, without any measurable effect. Here a lot depends on the firmware of the P400, because with a BBWC in theory it can completely ignore the request ordering and the barriers it receives from the Linux side (and barriers *might* be disabled), so by and large the Linux elevator should not matter. Note: what would look good in this very narrow example is something like 'anticipatory', and while that has disappeared IIRC there is a way to tweak 'cfq' to behave like that. But the times you report are consistent with the notion that your Linux side seek graph is what happens at the P400 level too, which is something that should not happen with a BBWC. If that's the case, tweaking the Linux side scheduling might help, for example increasing a lot 'queue/nr_requests' and 'device/queue_depth' ('nr_requests' should be apparently at least twice 'queue_depth' in most cases). Or else ensuring that the P400 does reorder requests and does not write-through, as it has a BBWC. Overall your test does not seem very notable to me, except that it is strange that in the XFS 4 AG case the generated IO stream (at the Linux level) is seeking incessantly between the 4 AGs instead of in phases, and this apparently gets reflected to the disks by the P400 even if it has a BBWC. It is not clear to me why the seeks among the 4AGs happen in such a tightly interleaved way (barriers? The way journaling works?) instead of a more bulky way. The suggestion by another commenter to use 'rotorstep' (probably set to a high value) may help then, as it bunches files in AGs. > The stripe size is this, btw.: su=16k,sw=4 BTW congratulations for limiting your RAID6 set to 4+2, and using a relatively small chunk size compared to that chosen by many others. But it is still pretty large for this case: 64KiB when your average file size is around 12KiB. Potentially lots of RMW, and little opportunity to take advantage of the higher parallelism of having 4 AGs with 4 independent streams of data. As mentioned in another comment, I got nearly the same 'ext4' writeout rate on a single disk... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 15:35 ` Peter Grandi @ 2012-04-10 14:05 ` Stefan Ring 0 siblings, 0 replies; 64+ messages in thread From: Stefan Ring @ 2012-04-10 14:05 UTC (permalink / raw) To: Linux fs XFS > BTW congratulations for limiting your RAID6 set to 4+2, and > using a relatively small chunk size compared to that chosen by > many others. Interestingly, it performs better with a larger stripe size, though. Probably because it’s better able to combine writes when the blocks are larger. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) 2012-04-06 5:53 ` Stefan Ring 2012-04-06 15:35 ` Peter Grandi @ 2012-04-07 19:11 ` Peter Grandi 1 sibling, 0 replies; 64+ messages in thread From: Peter Grandi @ 2012-04-07 19:11 UTC (permalink / raw) To: Linux fs XFS > [ ... ] I also tried different Linux IO elevators, as you > suggested in your other response, without any measurable > effect. [ ... ] That's probably because of the RAID6 host adapter being uncooperative, but I wondered whether this might apply in some form: http://xfs.org/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E «As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.» BTW earlier 'cfq' versions have been reported to have huge problems with workloads involving writes and reads, and only 'deadline' (which is quite unsuitable for some workloads) seems to be fairly reliable. http://www.webhostingtalk.com/showthread.php?t=727173 «Anyway, we have found HUGE problems with CFQ in many different scenarios and many different hardware setups. If it was only an issue with our configuration I would have foregone posting this message and simply informed those kernel developers responsible for the fix. Two scenarios where CFQ has a severe problem - When you are running a single block device (1 drive, or a raid 1 scenario) under certain circumstances where heavy sustained writes are occurring the CFQ scheduler will behave very strangely. It will begin to give all access to reads and limit all writes to the point of allowing only 0-2 I/O write operations being allowed per second vs 100-180 read operations per second. This condition will persist indefinitely until the sustained write process completes. This is VERY bad for a shared environment where you need reads and writes to complete regardless of increased reads or writes. This behavior goes beyond what CFQ says it is supposed to do in this situation - meaning this is a bug, and a serious one at that. We can reproduce this EVERY TIME. The second scenario occurs when you have two or more block devices, either single drives, or any type of raid array including raid 0,1,0+1,1+0,5 and 6. (We never tested 3,4 who uses raid 3 or 4 anymore anyway?!!). This case is almost exactly opposite of what happens with only one block device. In this case if one of more of the drives is blocked with heavy writes for a sustained period of time CFQ will block reads from the other devices or severely limit the reads until the writes have completed. We can also reproduce this behavior with test software we have written on a 100% consistent basis.» _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2012-04-14 11:30 UTC | newest] Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-04-05 18:10 XFS: Abysmal write performance because of excessive seeking (allocation groups to blame?) Stefan Ring 2012-04-05 19:56 ` Peter Grandi 2012-04-05 22:41 ` Peter Grandi 2012-04-06 14:36 ` Peter Grandi 2012-04-06 15:37 ` Stefan Ring 2012-04-07 13:33 ` Peter Grandi 2012-04-05 21:37 ` Christoph Hellwig 2012-04-06 1:09 ` Peter Grandi 2012-04-06 8:25 ` Stefan Ring 2012-04-07 18:57 ` Martin Steigerwald 2012-04-10 14:02 ` Stefan Ring 2012-04-10 14:32 ` Joe Landman 2012-04-10 15:56 ` Stefan Ring 2012-04-10 18:13 ` Martin Steigerwald 2012-04-10 20:44 ` Stan Hoeppner 2012-04-10 21:00 ` Stefan Ring 2012-04-05 22:32 ` Roger Willcocks 2012-04-06 7:11 ` Stefan Ring 2012-04-06 8:24 ` Stefan Ring 2012-04-05 23:07 ` Peter Grandi 2012-04-06 0:13 ` Peter Grandi 2012-04-06 7:27 ` Stefan Ring 2012-04-06 23:28 ` Stan Hoeppner 2012-04-07 7:27 ` Stefan Ring 2012-04-07 8:53 ` Emmanuel Florac 2012-04-07 14:57 ` Stan Hoeppner 2012-04-09 11:02 ` Stefan Ring 2012-04-09 12:48 ` Emmanuel Florac 2012-04-09 12:53 ` Stefan Ring 2012-04-09 13:03 ` Emmanuel Florac 2012-04-09 23:38 ` Stan Hoeppner 2012-04-10 6:11 ` Stefan Ring 2012-04-10 20:29 ` Stan Hoeppner 2012-04-10 20:43 ` Stefan Ring 2012-04-10 21:29 ` Stan Hoeppner 2012-04-09 0:19 ` Dave Chinner 2012-04-09 11:39 ` Emmanuel Florac 2012-04-09 21:47 ` Dave Chinner 2012-04-07 8:49 ` Emmanuel Florac 2012-04-08 20:33 ` Stan Hoeppner 2012-04-08 21:45 ` Emmanuel Florac 2012-04-09 5:27 ` Stan Hoeppner 2012-04-09 12:45 ` Emmanuel Florac 2012-04-13 19:36 ` Stefan Ring 2012-04-14 7:32 ` Stan Hoeppner 2012-04-14 11:30 ` Stefan Ring 2012-04-09 14:21 ` Geoffrey Wehrman 2012-04-10 19:30 ` Stan Hoeppner 2012-04-11 22:19 ` Geoffrey Wehrman 2012-04-07 16:50 ` Peter Grandi 2012-04-07 17:10 ` Joe Landman 2012-04-08 21:42 ` Stan Hoeppner 2012-04-09 5:13 ` Stan Hoeppner 2012-04-09 11:52 ` Stefan Ring 2012-04-10 7:34 ` Stan Hoeppner 2012-04-10 13:59 ` Stefan Ring 2012-04-09 9:23 ` Stefan Ring 2012-04-09 23:06 ` Stan Hoeppner 2012-04-06 0:53 ` Peter Grandi 2012-04-06 7:32 ` Stefan Ring 2012-04-06 5:53 ` Stefan Ring 2012-04-06 15:35 ` Peter Grandi 2012-04-10 14:05 ` Stefan Ring 2012-04-07 19:11 ` Peter Grandi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.