* I/O hang, possibly XFS, possibly general @ 2011-06-02 14:42 Paul Anderson 2011-06-02 16:17 ` Stan Hoeppner ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Paul Anderson @ 2011-06-02 14:42 UTC (permalink / raw) To: xfs-oss This morning, I had a symptom of a I/O throughput problem in which dirty pages appeared to be taking a long time to write to disk. The system is a large x64 192GiB dell 810 server running 2.6.38.5 from kernel.org - the basic workload was data intensive - concurrent large NFS (with high metadata/low filesize), rsync/lftp (with low metadata/high file size) all working in a 200TiB XFS volume on a software MD raid0 on top of 7 software MD raid6, each w/18 drives. I had mounted the filesystem with inode64,largeio,logbufs=8,noatime. The specific symptom was that 'sync' hung, a dpkg command hung (presumably trying to issue fsync), and experimenting with "killall -STOP" or "kill -STOP" of the workload jobs didn't let the system drain I/O enough to finish the sync. I probably did not wait long enough, however. So here's what I did to diagnose: when all workloads were stopped, there was still low rate I/O from kflush->md array jobs. No CPU starvation, but the I/O rate was low - 5-30MiB/second (the array can readily do >1000MiB/second for big I/O). Mind you, one "md5sum --check" job was able to run at >200MiB/second without trouble - turn it off or on and the aggregate I/O load shoots right up or down along with it, so I'm fairly confident in the underlying physical arrays as well as XFS large data I/O. I did "echo 3 > /proc/sys/vm/drop_caches" repeatedly and noticed that according to top, the total amount of cached data would drop down rapidly (first time had the big drop), but still be stuck at around 8-10Gigabytes. While continuing to do this, I noticed finally that the cached data value was in fact dropping slowly (at the rate of 5-30MiB/second), and in fact finally dropped down to approximately 60Megabytes at which point the stuck dpkg command finished, and I was again able to issue sync commands that finished instantly. My guess is that I've done something to fill the buffer pool with slow to flush metadata - and prior to rebooting the machine a few minutes ago, I removed the largeio option in /etc/fstab. I can't say this is an XFS bug specifically, but more likely how I am using it - are there other tools I can use to better diagnose what is going on? I do know it will happen again, since we will have 5 of these machines running at very high rates soon. Also, any suggestions for better metadata or log management are very welcome. This particular machine is probably our worst, since it has the widest variation in offered file I/O load (tens of millions of small files, thousands of >1GB files). If this workload is pushing XFS too hard, I can deploy new hardware to split the workload across different filesystems. Thanks very much for any thoughts or suggestions, Paul Anderson _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-02 14:42 I/O hang, possibly XFS, possibly general Paul Anderson @ 2011-06-02 16:17 ` Stan Hoeppner 2011-06-02 18:56 ` Peter Grandi 2011-06-03 0:42 ` Christoph Hellwig 2 siblings, 0 replies; 25+ messages in thread From: Stan Hoeppner @ 2011-06-02 16:17 UTC (permalink / raw) To: Paul Anderson; +Cc: xfs-oss On 6/2/2011 9:42 AM, Paul Anderson wrote: > had mounted the filesystem with inode64,largeio,logbufs=8,noatime. I don't see 'delaylog' in your mount options nor an external log device specified. Delayed logging will dramatically decrease IOPs to the log device via cleverly discarding duplicate metadata write operations and other tricks. Enabling it may solve your problem given your high metadata workload. Delayed logging design document: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs-delayed-logging-design.txt Delaylog was an optional mount option from 2.6.35 to 2.6.38. In 2.6.39 and up it is the default. Give it a go. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-02 14:42 I/O hang, possibly XFS, possibly general Paul Anderson 2011-06-02 16:17 ` Stan Hoeppner @ 2011-06-02 18:56 ` Peter Grandi 2011-06-02 21:24 ` Paul Anderson 2011-06-03 0:06 ` Phil Karn 2011-06-03 0:42 ` Christoph Hellwig 2 siblings, 2 replies; 25+ messages in thread From: Peter Grandi @ 2011-06-02 18:56 UTC (permalink / raw) To: Linux fs XFS > This morning, I had a symptom of a I/O throughput problem in which > dirty pages appeared to be taking a long time to write to disk. That can happen because of a lot of reasons, like elevator issues (CFQ has serious problems) and even CPU scheduler issues, RAID HA firmware problems (if you are using one, and you seem to be using MD, but then you may be using several in JBOD mode to handle all the disks), or problems with the Linux page cache (read ahead, the abominable plugger) or the flusher (the defaults are not so hot). Sometimes there are odd resonances between the page cache and multiple layers od MD or LVM too. Lots of people have been burned even with much simpler setups than the one you describe below: > The system is a large x64 192GiB dell 810 server running > 2.6.38.5 from kernel.org - the basic workload was data > intensive - concurrent large NFS (with high metadata/low > filesize), Very imaginative. :-) > rsync/lftp (with low metadata/high file size) More suitable, but insignificant compared to this: > all working in a 200TiB XFS volume on a software MD raid0 on > top of 7 software MD raid6, each w/18 drives. That's rather more than imaginative :-). But this is a family oriented mailing list so I can't use appropriate euphemisms, because they no longer look like euphemisms. > [ ... ] (the array can readily do >1000MiB/second for big > I/O). [ ... ] In a very specific narrow case, and you can get that with a lot less disks. You have 126 drives that can each do 130MB/s (outer tracks), so you should be getting 10GB/s :-). Also, your 1000MiB/s set probably is not full yet, so that's outer tracks only, and when it fills up, data gets into the inner tracks, and get a bit churned, then the real performances will "shine" through. > I did "echo 3 > /proc/sys/vm/drop_caches" repeatedly and > noticed that according to top, the total amount of cached data > would drop down rapidly (first time had the big drop), but > still be stuck at around 8-10Gigabytes. You have to watch '/proc/meminfo' to check the dirty pages in the cache. But you seem to have 8-10GiB of dirty pages in your 192GiB system. Extraordinarily imaginative. > While continuing to do this, I noticed finally that the cached > data value was in fact dropping slowly (at the rate of > 5-30MiB/second), and in fact finally dropped down to > approximately 60Megabytes at which point the stuck dpkg > command finished, and I was again able to issue sync commands > that finished instantly. Fantastic stuff, is that cached data or cached and dirty data? Guessing that it is cached and dirty (also because of the "Subject" line), do you really want to have several GiB of cached dirty pages? Do you want these to be zillions of little metadata transactions scattered at random all over the place? How "good" (I hesitate to use the very word in the context) is this more than imaginative RAID60 set at writing widely scattered small transactions? > [ ... ] since we will have 5 of these machines running at > very high rates soon. Look forward to that :-). > Also, any suggestions for better metadata Use some kind of low overhead database if you need a database, else pray :-) > or log management are very welcome. Separate drives/flash SSD/RAM SSD. As previously revealed by a question I asked, Linux MD does full-width stripe updates with RAID6. The wider, the better of course :-). > This particular machine is probably our worst, since it has > the widest variation in offered file I/O load (tens of > millions of small files, thousands of >1GB files). Wide variation is not the problem, and neither is the machine, it is the approach. > If this workload is pushing XFS too hard, XFS is a very good design within a fairly well defined envelope, and often the problems are more with Linux or application issues, but you may be a bit outside that envelope (euphemism alert), and you need to work on the grain of the storage system (understatement of the week). > I can deploy new hardware to split the workload across > different filesystems. My usual recommendation is to default (unless you have extraordinarily good arguments otherwise, and almost nobody does) to use RAID10 sets of at most 10 pairs (of "enterprise" drives of no more than 1TB each), with XFS or JFS depending on workload, as many servers as needed (if at all possible located topologically near to their users to avoid some potentially nasty network syndromes like incast), and forget about having a single large storage pool. Other details as to the flusher (every 1-2 seconds), elevator (deadline or noop), ... can matter a great deal. If you do need a single large storage pool almost the only reasonable way currently (even if I have great hopes for GlusterFS) is Lustre or one of its forks (or much simpler imitators like DPM), and that has its own downsides (it takes a lot of work), but a single large storage pool is almost never needed, at most a single large namespace, and that can be instantiated with an automounter (and Lustre/DPM/.... is in effect a more sophisticated automounter). If you know better go ahead and build 200TB XFS filesystems on top of a 7x(16+2) drive RAID60 and put lots of small files in them (or whatever) and don't even think about 'fsck' because you "know" it will never happen. And what about backing up one of those storage sets to another one? That can happen in the "background" of course, with no extra load :-). Just realized another imaginative detail: a 126 drive RAID60 set delivering 200TB, looks like that you are using 2TB drives. Why am I not surprised? It would be just picture-perfect if they were low cost "eco" drives, and only a bit less so if they were ordinary drives without ERC. Indeed cost conscious budget heroes can only suggest using 2TB drives in a 126-drive RAID60 set even for a small-file metadata intensive workload, because IOPS and concurrent RW are obsolete concepts in many parts of the world. Disclaimer: some smart people I know built knowingly a similar and fortunately much smaller collection of RAID6 sets because that was the least worst option for them, and since they know that it will not fill up before they can replace it, they are effectively short-stroking all those 2TB drives (I still would have bought ERC ones if possible) so it's cooler than it looks. > Thanks very much for any thoughts or suggestions, * Don't expect to slap together a lot of stuff at random and it working just like that. But then if you didn't expect that you wouldn't have done any of the above. * "My usual recommendation" above is freely given yet often worth more than months/years of very expensive consultants. * This mailing list is continuing proof that the "let's bang it together, it will just work" club is large. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-02 18:56 ` Peter Grandi @ 2011-06-02 21:24 ` Paul Anderson 2011-06-02 23:59 ` Phil Karn 2011-06-03 22:19 ` Peter Grandi 2011-06-03 0:06 ` Phil Karn 1 sibling, 2 replies; 25+ messages in thread From: Paul Anderson @ 2011-06-02 21:24 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs XFS Hi Peter - I appreciate the feedback! The background for this is that we live in an extreme corner case of the world - our use case is dealing with 1GiB to 100GiB files at present, and in the future probably to 500GiB files (aggregated data from multiple deep sequencing runs). The data itself has very odd lifecycle behavior, as well - since it is research, the different stages are still being sorted out, but some stages are essentially write once, read once, maybe keep, maybe discard, depending on the research scenario. Parenthetically, I will note there are numerous other issues and problems that impose constraints beyond what is noted here - conventional work flow, research problems, budgets, rack space, rack power, time and more. On Thu, Jun 2, 2011 at 2:56 PM, Peter Grandi <pg_xf2@xf2.for.sabi.co.uk> wrote: >> This morning, I had a symptom of a I/O throughput problem in which >> dirty pages appeared to be taking a long time to write to disk. > > That can happen because of a lot of reasons, like elevator > issues (CFQ has serious problems) and even CPU scheduler issues, > RAID HA firmware problems (if you are using one, and you seem to > be using MD, but then you may be using several in JBOD mode to > handle all the disks), or problems with the Linux page cache > (read ahead, the abominable plugger) or the flusher (the > defaults are not so hot). Sometimes there are odd resonances > between the page cache and multiple layers od MD or LVM too. All JBOD chassis (SuperMicro SC 847's)... been experimenting with the flusher, will look at the others. > > Lots of people have been burned even with much simpler setups > than the one you describe below: No doubt. > >> The system is a large x64 192GiB dell 810 server running >> 2.6.38.5 from kernel.org - the basic workload was data >> intensive - concurrent large NFS (with high metadata/low >> filesize), > > Very imaginative. :-) > >> rsync/lftp (with low metadata/high file size) > > More suitable, but insignificant compared to this: The rsync job currently appear to be causing the issue - it was rsyncing around 250,000 files. If the copy had already been done, the rsync is fast (i.e. stat is fast, despite the numbers), but when it starts moving data, the IOPS pegs and seems to be the limiting factor. > >> all working in a 200TiB XFS volume on a software MD raid0 on >> top of 7 software MD raid6, each w/18 drives. > > That's rather more than imaginative :-). But this is a family > oriented mailing list so I can't use appropriate euphemisms, > because they no longer look like euphemisms. We most likely live in different worlds - this is a pure research group with "different" constraints than those you're probably used to. Not my choice, but 4-10X the cost per unit of storage is currently not an option. >> [ ... ] (the array can readily do >1000MiB/second for big >> I/O). [ ... ] > > In a very specific narrow case, and you can get that with a lot > less disks. You have 126 drives that can each do 130MB/s (outer > tracks), so you should be getting 10GB/s :-). The raw hardware will do about 5GiB/sec - near as I can tell, this is saturating the pci-e bus (maybe main memory). With XFS freshly installed, it was doing around 1400MiB/sec write, and around 1900MiB/sec read - 10 parallel high throughput processes read or writing as fast as possible (which actually is our use case). > Also, your 1000MiB/s set probably is not full yet, so that's > outer tracks only, and when it fills up, data gets into the > inner tracks, and get a bit churned, then the real performances > will "shine" through. Yeah - overall, I expect it to drop - perhaps 50%? I dunno. The particular filesystem being discussed is 80% full at the moment. >> I did "echo 3 > /proc/sys/vm/drop_caches" repeatedly and >> noticed that according to top, the total amount of cached data >> would drop down rapidly (first time had the big drop), but >> still be stuck at around 8-10Gigabytes. > > You have to watch '/proc/meminfo' to check the dirty pages in > the cache. But you seem to have 8-10GiB of dirty pages in your > 192GiB system. Extraordinarily imaginative. Will watch that - yes, too many dirty pages in RAM - defaults are far from optimal here. > >> While continuing to do this, I noticed finally that the cached >> data value was in fact dropping slowly (at the rate of >> 5-30MiB/second), and in fact finally dropped down to >> approximately 60Megabytes at which point the stuck dpkg >> command finished, and I was again able to issue sync commands >> that finished instantly. > > Fantastic stuff, is that cached data or cached and dirty data? > Guessing that it is cached and dirty (also because of the > "Subject" line), do you really want to have several GiB of > cached dirty pages? After watching it reach steady state at around 60M, it appears not to be dirty, as a sync command returned immediately and had no effect on that value. No, I do not want lots of dirty pages, however, I'm also aware that if those are just data pages, it represents a few seconds of system operation. > Do you want these to be zillions of little metadata transactions > scattered at random all over the place? How "good" (I hesitate > to use the very word in the context) is this more than imaginative > RAID60 set at writing widely scattered small transactions? >> [ ... ] since we will have 5 of these machines running at >> very high rates soon. > > Look forward to that :-). We are, actually, it is a tremendous improvement over what we've been using. > >> Also, any suggestions for better metadata > > Use some kind of low overhead database if you need a database, > else pray :-) No database will work that I'm aware of, at least for the end data storage. > >> or log management are very welcome. > > Separate drives/flash SSD/RAM SSD. As previously revealed by a > question I asked, Linux MD does full-width stripe updates with > RAID6. The wider, the better of course :-). > >> This particular machine is probably our worst, since it has >> the widest variation in offered file I/O load (tens of >> millions of small files, thousands of >1GB files). > > Wide variation is not the problem, and neither is the machine, > it is the approach. All other approaches I am aware of cost more. I favor Lustre, but the infrastructure costs alone for a 2-5PB system will tend to be exceptional. Not that we may have much choice - the system we have is well beyond the limits of what we should really be doing - however, the constraints are also exceptional. >> If this workload is pushing XFS too hard, > > XFS is a very good design within a fairly well defined envelope, > and often the problems are more with Linux or application > issues, but you may be a bit outside that envelope (euphemism > alert), and you need to work on the grain of the storage system > (understatement of the week). > >> I can deploy new hardware to split the workload across >> different filesystems. > > My usual recommendation is to default (unless you have > extraordinarily good arguments otherwise, and almost nobody > does) to use RAID10 sets of at most 10 pairs (of "enterprise" > drives of no more than 1TB each), with XFS or JFS depending on > workload, as many servers as needed (if at all possible located > topologically near to their users to avoid some potentially > nasty network syndromes like incast), and forget about having a > single large storage pool. Other details as to the flusher > (every 1-2 seconds), elevator (deadline or noop), ... can matter > a great deal. re RAID10 specifically, I'd love to do something better - however the process is currently severely cost and space constrained. > If you do need a single large storage pool almost the only > reasonable way currently (even if I have great hopes for > GlusterFS) is Lustre or one of its forks (or much simpler > imitators like DPM), and that has its own downsides (it takes a > lot of work), but a single large storage pool is almost never > needed, at most a single large namespace, and that can be > instantiated with an automounter (and Lustre/DPM/.... is in > effect a more sophisticated automounter). "It takes a lot of work" is another reason we aren't readily able to go to other architectures, despite their many advantages. > > If you know better go ahead and build 200TB XFS filesystems on > top of a 7x(16+2) drive RAID60 and put lots of small files in > them (or whatever) and don't even think about 'fsck' because you > "know" it will never happen. And what about backing up one of > those storage sets to another one? That can happen in the > "background" of course, with no extra load :-). fsck happens in less than a day, likewise rebuilding all RAIDs... backups are interesting - it is impossible in the old scenario (our prior generation storage) - possible now due to higher disk and network bandwidth. Keep in mind our ultimate backup is tissue samples. > Just realized another imaginative detail: a 126 drive RAID60 set > delivering 200TB, looks like that you are using 2TB drives. Why > am I not surprised? It would be just picture-perfect if they > were low cost "eco" drives, and only a bit less so if they were > ordinary drives without ERC. Indeed cost conscious budget heroes > can only suggest using 2TB drives in a 126-drive RAID60 set even > for a small-file metadata intensive workload, because IOPS and > concurrent RW are obsolete concepts in many parts of the world. We fortunately are were able to afford reasonably good enterprise drives. 2TB drives are mandatory - there simply isn't enough available space in the data center otherwise. The bulk of the work is not small-file - almost all is large files. > Disclaimer: some smart people I know built knowingly a similar > and fortunately much smaller collection of RAID6 sets because > that was the least worst option for them, and since they know > that it will not fill up before they can replace it, they are > effectively short-stroking all those 2TB drives (I still would > have bought ERC ones if possible) so it's cooler than it looks. That is precisely the situation here - it is the "least worst" option. > >> Thanks very much for any thoughts or suggestions, > > * Don't expect to slap together a lot of stuff at random and it > working just like that. But then if you didn't expect that you > wouldn't have done any of the above. > > * "My usual recommendation" above is freely given yet often > worth more than months/years of very expensive consultants. > > * This mailing list is continuing proof that the "let's bang it > together, it will just work" club is large. Research is research - not my choice of how it is done, either. Paul > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-02 21:24 ` Paul Anderson @ 2011-06-02 23:59 ` Phil Karn 2011-06-03 0:39 ` Dave Chinner 2011-06-03 22:19 ` Peter Grandi 1 sibling, 1 reply; 25+ messages in thread From: Phil Karn @ 2011-06-02 23:59 UTC (permalink / raw) To: Paul Anderson; +Cc: Linux fs XFS On 6/2/11 2:24 PM, Paul Anderson wrote: > The data itself has very odd lifecycle behavior, as well - since it is > research, the different stages are still being sorted out, but some > stages are essentially write once, read once, maybe keep, maybe > discard, depending on the research scenario. ... > The bulk of the work is not small-file - almost all is large files. Out of curiosity, do your writers use the fallocate() call? If not, how fragmented do your filesystems get? Even if most of your data isn't read very often, it seems like a good idea to minimize its fragmentation because that also reduces fragmentation of the free list, which makes it easier to keep contiguous other files that *are* heavily read. Also, fewer extents per file means less metadata per file, ergo less metadata and log I/O, etc. When a writer knows in advance how big a file will be, I can't see any downside to having it call fallocate() to let the file system know. Soon after I switched to XFS six months ago I've been running locally patched versions of rsync/tar/cp and so on, and they really do minimize fragmentation with very little effort. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-02 23:59 ` Phil Karn @ 2011-06-03 0:39 ` Dave Chinner 2011-06-03 2:11 ` Phil Karn 0 siblings, 1 reply; 25+ messages in thread From: Dave Chinner @ 2011-06-03 0:39 UTC (permalink / raw) To: Phil Karn; +Cc: Paul Anderson, Linux fs XFS On Thu, Jun 02, 2011 at 04:59:25PM -0700, Phil Karn wrote: > On 6/2/11 2:24 PM, Paul Anderson wrote: > > > The data itself has very odd lifecycle behavior, as well - since it is > > research, the different stages are still being sorted out, but some > > stages are essentially write once, read once, maybe keep, maybe > > discard, depending on the research scenario. > ... > > The bulk of the work is not small-file - almost all is large files. > > Out of curiosity, do your writers use the fallocate() call? If not, how > fragmented do your filesystems get? > > Even if most of your data isn't read very often, it seems like a good > idea to minimize its fragmentation because that also reduces > fragmentation of the free list, which makes it easier to keep contiguous > other files that *are* heavily read. Also, fewer extents per file means > less metadata per file, ergo less metadata and log I/O, etc. > > When a writer knows in advance how big a file will be, I can't see any > downside to having it call fallocate() to let the file system know. You're ignoring the fact that delayed allocation effectively does this for you without needing to physically allocate the blocks. So when you have files that are short lived, you don't actually do any allocation at all, Further delayed allocation results in allocation order according to writeback order rather than write() order, so I/O patterns are much nicer when using delayed allocation. Basicaly you are removing one of the major IO optimisation capabilities of XFS by preallocating everything like this. > Soon > after I switched to XFS six months ago I've been running locally patched > versions of rsync/tar/cp and so on, and they really do minimize > fragmentation with very little effort. So you don't have any idea of how well XFS minimises fragmentation without needing to use preallocation? Sounds like you have a classic case of premature optimisation. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-03 0:39 ` Dave Chinner @ 2011-06-03 2:11 ` Phil Karn 2011-06-03 2:54 ` Dave Chinner 0 siblings, 1 reply; 25+ messages in thread From: Phil Karn @ 2011-06-03 2:11 UTC (permalink / raw) To: Dave Chinner; +Cc: Paul Anderson, Linux fs XFS [-- Attachment #1.1: Type: text/plain, Size: 3768 bytes --] On Thu, Jun 2, 2011 at 5:39 PM, Dave Chinner <david@fromorbit.com> wrote: > > You're ignoring the fact that delayed allocation effectively does > this for you without needing to physically allocate the blocks. > So when you have files that are short lived, you don't actually do > any allocation at all, Further delayed allocation results in > allocation order according to writeback order rather than write() > order, so I/O patterns are much nicer when using delayed allocation. > Oh, I'm well aware of delayed allocation. I've just noticed that, in my experience, it doesn't seem to work nearly as well as fallocate(). And why should it? If you know in advance how big a file you're writing, how can it hurt to inform your file system? I suppose the FS implementer could always ignore that information if he felt he could somehow do a better job, but it's hard to see how. Isn't it always better to know than to guess? I'm talking here about the genuine fallocate() system call, not the POSIX hack that falls back to first conventionally writing zeroes over the file. The true fallocate() call seems very fast, and if your file system doesn't support it then it will simply fail without harm. I still can't see any reason not to use it. I did know that xfs can avoid the disk allocation and writes entirely when the files are short-lived, but Paul was talking about writing large, long-lived files so that's what I had in mind. And when I use fallocate(), my files are not likely to be short-lived either. Like most people I write the vast majority of my short-lived files to /tmp, which is tmpfs, not xfs. But you do raise an interesting point -- is there any serious performance degradation from using fallocate() on a short-lived file? The written data still lives in the buffer cache for a while, so if you delete the file before it gets flushed the disk writes will still be avoided. The file system may have a little extra work to undo the unnecessary allocation but that doesn't seem to be a big deal. Basicaly you are removing one of the major IO optimisation > capabilities of XFS by preallocating everything like this. > "Remove" it? How is giving it the correct answer worse than letting it guess -- even if it usually guesses correctly? I still rely on preallocation to keep log files and mailboxes from getting too badly fragmented. >So you don't have any idea of how well XFS minimises fragmentation > without needing to use preallocation? Sounds like you have a classic > case of premature optimisation. ;) > > As I said, I've tried it both ways. I found that the simple act of adding fallocate() to rsync (which I use for practically all copying) vastly reduces xfs fragmentation. Just as I expected it would. Maybe I'm a little more sensitive to fragmentation than most because I've been experimenting with storing SHA1 hashes of all my files in external attributes. This grew out of a data deduplication tool; at first I simply cached the hashes so I wouldn't have to recompute them on another run, but then I just added them to every file. This lets me get a warm and fuzzy feeling by periodically verifying that my files haven't been corrupted, especially when I began to use SSDs with trim tools. XFS stores both attributes and extent lists directly in the inode when there's room, and it turns out that a default-sized xfs inode can store my hashes provided that the extent list is small. So I now when I walk through my file system statting everything I can read the hashes too at absolutely no extra cost. This makes deduplication really fast. I haven't experimented to see how many extents a file can have before the attributes get pushed out of the inode, but by keeping most everything contiguous I simply avoid the problem. [-- Attachment #1.2: Type: text/html, Size: 4573 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-03 2:11 ` Phil Karn @ 2011-06-03 2:54 ` Dave Chinner 2011-06-03 22:28 ` Phil Karn 0 siblings, 1 reply; 25+ messages in thread From: Dave Chinner @ 2011-06-03 2:54 UTC (permalink / raw) To: karn; +Cc: Paul Anderson, Linux fs XFS On Thu, Jun 02, 2011 at 07:11:15PM -0700, Phil Karn wrote: > On Thu, Jun 2, 2011 at 5:39 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > > You're ignoring the fact that delayed allocation effectively does > > this for you without needing to physically allocate the blocks. > > So when you have files that are short lived, you don't actually do > > any allocation at all, Further delayed allocation results in > > allocation order according to writeback order rather than write() > > order, so I/O patterns are much nicer when using delayed allocation. > > > > Oh, I'm well aware of delayed allocation. I've just noticed that, in my > experience, it doesn't seem to work nearly as well as fallocate(). And why > should it? If you know in advance how big a file you're writing, how can it > hurt to inform your file system? I suppose the FS implementer could always > ignore that information if he felt he could somehow do a better job, but > it's hard to see how. Isn't it always better to know than to guess? There are definitely cases where it helps for preventing fragmenting, but as a sweeping generalisation it is very, very wrong. > I'm talking here about the genuine fallocate() system call, not the POSIX > hack that falls back to first conventionally writing zeroes over the file. > The true fallocate() call seems very fast, and if your file system doesn't > support it then it will simply fail without harm. I still can't see any > reason not to use it. > > I did know that xfs can avoid the disk allocation and writes entirely when > the files are short-lived, but Paul was talking about writing large, > long-lived files so that's what I had in mind. And when I use fallocate(), > my files are not likely to be short-lived either. Like most people I write > the vast majority of my short-lived files to /tmp, which is tmpfs, not xfs. Do you do that for temporary object files when you build <program X> from source? > But you do raise an interesting point -- is there any serious performance > degradation from using fallocate() on a short-lived file? Allocation and freeing has CPU overhead, transaction overhead, log space overhead, can cause free space fragmentation when you have a mix of short- and long-lived files being preallocated at the same time, IO for long lived data does not get packed together closely so requires more seeks to issue which leads to significantly worse IO performance on RAID5/6 storage sub-systems, etc. I could go one for quite some time, but the overal effect of such behaviour is that it speeds up filesystem aging degradation significantly. You might not notice that for 6 months or a year, but when you do.... > The written data > still lives in the buffer cache for a while, so if you delete the file > before it gets flushed the disk writes will still be avoided. The file > system may have a little extra work to undo the unnecessary allocation but > that doesn't seem to be a big deal. > > Basicaly you are removing one of the major IO optimisation > > capabilities of XFS by preallocating everything like this. > > > > "Remove" it? How is giving it the correct answer worse than letting it guess > -- even if it usually guesses correctly? See above. > I still rely on preallocation to keep log files and mailboxes from getting > too badly fragmented. > > >So you don't have any idea of how well XFS minimises fragmentation > > > without needing to use preallocation? Sounds like you have a classic > > case of premature optimisation. ;) > > > > > As I said, I've tried it both ways. I found that the simple act of adding > fallocate() to rsync (which I use for practically all copying) vastly > reduces xfs fragmentation. Just as I expected it would. > > Maybe I'm a little more sensitive to fragmentation than most because I've > been experimenting with storing SHA1 hashes of all my files in external > attributes. This grew out of a data deduplication tool; at first I simply > cached the hashes so I wouldn't have to recompute them on another run, but > then I just added them to every file. This lets me get a warm and fuzzy > feeling by periodically verifying that my files haven't been corrupted, > especially when I began to use SSDs with trim tools. > > XFS stores both attributes and extent lists directly in the inode when > there's room, and it turns out that a default-sized xfs inode can store my > hashes provided that the extent list is small. So I now when I walk through > my file system statting everything I can read the hashes too at absolutely > no extra cost. This makes deduplication really fast. /me slaps his forehead. You do realise that your "attr out of line" problem would have gone away by simply increasing the XFS inode size at mkfs time? And that there is almost no performance penalty for doing this? Instead, it seems you found a hammer named fallocate() and proceeded to treat every tool you have like a nail. :) Changing a single mkfs parameter is far less work than maintaining your own forks of multiple tools.... > I haven't experimented to see how many extents a file can have > before the attributes get pushed out of the inode, but by keeping > most everything contiguous I simply avoid the problem. Until aging has degraded your filesystem til free space is sufficiently fragmented that you can't allocate large extents any more. Then you are completely screwed. :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-03 2:54 ` Dave Chinner @ 2011-06-03 22:28 ` Phil Karn 2011-06-04 3:12 ` Dave Chinner 0 siblings, 1 reply; 25+ messages in thread From: Phil Karn @ 2011-06-03 22:28 UTC (permalink / raw) To: Dave Chinner; +Cc: Paul Anderson, Linux fs XFS On 6/2/11 7:54 PM, Dave Chinner wrote: > There are definitely cases where it helps for preventing > fragmenting, but as a sweeping generalisation it is very, very > wrong. Well, if I ever see that in practice I'll change my procedures. > Do you do that for temporary object files when you build <program X> > from source? No, that would involve patching gcc to use fallocate(). I could be wrong -- I don't know much about gcc internals -- but I think most temp files go on /tmp, which is not xfs. As I clearly said, I patched only a few file copy programs like rsync that I use to create long-lived files. I can't see why the upstream maintainers of those programs shouldn't accept patches to incorporate fallocate() as long as care is taken to avoid calling the POSIX version and no other harm is done on file systems or OSes that don't support it. > Allocation and freeing has CPU overhead, transaction overhead, log > space overhead, can cause free space fragmentation when you have a > mix of short- and long-lived files being preallocated at the same > time, IO for long lived data does not get packed together closely so > requires more seeks to issue which leads to significantly worse IO > performance on RAID5/6 storage sub-systems, etc. I'll believe that when I see it. Like a lot of people I am moving away from RAID 5/6. It is hard to see how keeping files contiguous can lead to free space fragmentation. Seems to me that when a file is severely fragmented, so is the free space around it. Keeping a file contiguous also keeps free space in fewer, larger pieces. > You do realise that your "attr out of line" problem would have gone > away by simply increasing the XFS inode size at mkfs time? And that > there is almost no performance penalty for doing this? Instead, it > seems you found a hammer named fallocate() and proceeded to treat > every tool you have like a nail. :) You do realize that I started experimenting with attributes well *after* I had built XFS on a 6 GB (net) RAID5 that took over a week of solid copying to load to 50%? I had noticed the inode size parameter to mkfs.xfs but I wasn't about to buy four more disks, mkfs a whole new file system with bigger inodes and copy all my data (again) just to waste more space on largely empty inodes and, more importantly, require many more disk seeks and reads to walk through them all. The default xfs inode is 256 bytes. That means a single 4KiB block read fetches 16 inodes at once. Making each inode 512 bytes means reading only 8 inodes in each 4KiB block. That's arithmetic. And I'd still have no guarantee of keeping my attributes in the inodes without some limit on the size of the extent list. > Changing a single mkfs parameter is far less work than maintaining > your own forks of multiple tools.... See above. I've since built a new RAID1 array with bigger and faster drives and am abandoning RAID5, but I still see no reason to waste disk space and seeks on larger data structures that are mostly empty space. A long extent table contains overhead information that is useless -- noise -- to me, the user. Defragmenting a file discards that information and allows more of the disk's storage and I/O capacity to be used for user data. The only drawback I can see to keeping a file system defragmented is that I give up an opportunity for steganography, i.e., hiding information in the locations and sizes of those seemingly random sequences of extent allocations. I know this has been done. > Until aging has degraded your filesystem til free space is > sufficiently fragmented that you can't allocate large extents any > more. Then you are completely screwed. :/ Once again, it is very difficult to see how keeping my long-lived files contiguous causes free space to become more fragmented, not less. Help me out here; it's highly counter intuitive, and more importantly I haven't seen that problem, at least not yet. I have a few extremely large files (many GB) that cannot be allocated a contiguous area. That's probably because of xfs's strategy of scattering files around disk to allow room for growth, which fragments the free space. But that's not a big problem since I don't have very many such files. Each extent is still pretty big, so sequential I/O is still quite fast, and if their attributes are squeezed out of their inodes it's not a big performance hit either. You seem to take personal offense to my use of fallocate(), which is hardly my intention. Did you perhaps write the xfs preallocation code that I'm bypassing? As I said, I still rely on it for log files, mailboxes and temporary files, and it is much appreciated. --Phil _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-03 22:28 ` Phil Karn @ 2011-06-04 3:12 ` Dave Chinner 0 siblings, 0 replies; 25+ messages in thread From: Dave Chinner @ 2011-06-04 3:12 UTC (permalink / raw) To: Phil Karn; +Cc: Paul Anderson, Linux fs XFS On Fri, Jun 03, 2011 at 03:28:54PM -0700, Phil Karn wrote: > On 6/2/11 7:54 PM, Dave Chinner wrote: > > > There are definitely cases where it helps for preventing > > fragmenting, but as a sweeping generalisation it is very, very > > wrong. > > Well, if I ever see that in practice I'll change my procedures. > > > Do you do that for temporary object files when you build <program X> > > from source? > > No, that would involve patching gcc to use fallocate(). I could be wrong > -- I don't know much about gcc internals -- but I think most temp files > go on /tmp, which is not xfs. As I clearly said, I patched only a few > file copy programs like rsync that I use to create long-lived files. I > can't see why the upstream maintainers of those programs shouldn't > accept patches to incorporate fallocate() as long as care is taken to > avoid calling the POSIX version and no other harm is done on file > systems or OSes that don't support it. They are trying, but, well, the file corruption problems seen on 2.6.38/.39 kernels that are the result of them using fiemap/fallocate don't inspire me with confidence.... > > away by simply increasing the XFS inode size at mkfs time? And that > > there is almost no performance penalty for doing this? Instead, it > > seems you found a hammer named fallocate() and proceeded to treat > > every tool you have like a nail. :) > > You do realize that I started experimenting with attributes well *after* > I had built XFS on a 6 GB (net) RAID5 that took over a week of solid > copying to load to 50%? I had noticed the inode size parameter to > mkfs.xfs but I wasn't about to buy four more disks, mkfs a whole new > file system with bigger inodes and copy all my data (again) just to > waste more space on largely empty inodes and, more importantly, require > many more disk seeks and reads to walk through them all. > > The default xfs inode is 256 bytes. That means a single 4KiB block read > fetches 16 inodes at once. Making each inode 512 bytes means reading > only 8 inodes in each 4KiB block. That's arithmetic. XFS does not do inode IO like that, so your logic is flawed. Firstly, inodes are read and written in clusters of 8k, and contiguous inode clusters are merged during IO by the elevator. Metadata blocks are heavily sorted before being issued by for writeback, so we get excellent large IO patterns even for metadata IO. Under heavy file create workloads, I'm seeing XFS consistently write metadata to disk in 320k IOs - the maximum IO size my storage subsystem will allow. e.g. a couple of instructive graphs from Chris Mason for a parallel file create workload: http://oss.oracle.com/~mason/seekwatcher/fs_mark/xfs.png http://oss.oracle.com/~mason/seekwatcher/fs_mark/xfs.ogg The fact that ~5000 IOPS is being sustained with only 30-100 seeks/s indicates that the elevator merging is merging roughly 50-100 individual IOs together into each physical IO. This will happen regardless of inode size, so inode/metadata writeback under these workloads tends to be limited by bandwidth, not IOPS.... Reads might be a bit more random, but due to inodes being allocated in larger chunks (64 inodes at a time) and temporal locality effects due to sequential allocation by apps like rsync, then typically reads occur to localised areas as well and hit track caches or RAID controller readahead windows. > And I'd still have no guarantee of keeping my attributes in the inodes > without some limit on the size of the extent list. going from 256 -> 512 byte inodes gives you 256 bytes more space for attributes and extents, which in your case woul dbe entirely for data extents. In hat space you can fit another 16 extent records, which is more than enough for 99.9% of normal files. > > Changing a single mkfs parameter is far less work than maintaining > > your own forks of multiple tools.... > > See above. I've since built a new RAID1 array with bigger and faster > drives and am abandoning RAID5, but I still see no reason to waste disk > space and seeks on larger data structures that are mostly empty space. Well, if you think that inodes are taking too much space, then I guess you'd be really concerned about the amount of space that directories consume and how badly they get fragmented ;) > > Until aging has degraded your filesystem til free space is > > sufficiently fragmented that you can't allocate large extents any > > more. Then you are completely screwed. :/ > > Once again, it is very difficult to see how keeping my long-lived files > contiguous causes free space to become more fragmented, not less. Help > me out here; it's highly counter intuitive, and more importantly I > haven't seen that problem, at least not yet. Initial allocations are done via the "allocate near" algorithm. It starts by finding the largest freespace extent that will hold the allocation via a -size- match i.e. it will look for a match on the size you are asking for. If there isn't a free space extent large enough, it will fall back to searching for a large enough extent near to where you are asking with an increasing search radius. Once a free space extent is found, it then trims it for alignment to stripe unit/stripe width. This generally leaves small, isolated chunks of free space behind, as allocations are typically not stripe unit/width length. Hence you end up with lots of little holes around. Subsequent sequential allocations use an exact block allocation target to try to extend the contiguous allocation each file does. For large files, this tends to keep the files contiguous, or at least with multiple large extents rather than lots of small extents. Then things like unrelated metadata allocations will tend to fill those little holes, be it inodes, btree blocks, directory blocks or attributes. If there aren't little holes (or you aren't using alignment), they will simply sit between data extents. When you then free the allocated data space, you've still got that unrelated metadata lying around, and the free space is now somewhat fragmented. This pattern gets worse as the filesystem ages. Delayed allocation reduces the impact of this problem because it reduces the amount of on-disk metadata modifications that occur during normal operations. It also allows things like directory and inode extent allocation during creates (e.g. untaring) to avoid interleaving with data allocations, so directory and inode extents tend to cluster and be more contiguous and not fill holes between data extents. This means that you are less likely to get sparse metadata blocks fragmenting free space, metadata read and write IO is more likely to be clustered effectively (better IO performance), and so on. IOWs, there are many reasons why delayed allocation reduces the effects of filesystem aging compared to up-front preallocation.... > I have a few extremely large files (many GB) that cannot be allocated a > contiguous area. That's probably because of xfs's strategy of scattering > files around disk to allow room for growth, which fragments the free > space. I doubt it. An extent canbe at most 8GB on a 4kB filesystem, so that's why you see multiple extents for large files. i.e. they require multiple allocations.... > You seem to take personal offense to my use of fallocate(), which is > hardly my intention. Nothing personal at all. > Did you perhaps write the xfs preallocation code > that I'm bypassing? No. People much smarter than me designed and wrote all this stuff. What I'm commenting on is your implication (sweeping generalisation) that preallocation should be used everywhere because it seems to work for you. I don't like to let such statements stand unchallenged, especially when there are very good reaѕons why it is likely to be wrong. I don't do this for my benefit - and I don't really care if you benefit from it or not - but there's a lot of XFS users on this list that might have be wondering "why isn't that done by default?". Those people learn a lot from someone trying to explain why what one person says is beneficial for their use cases might be considered harmful to everyone else... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-02 21:24 ` Paul Anderson 2011-06-02 23:59 ` Phil Karn @ 2011-06-03 22:19 ` Peter Grandi 2011-06-06 7:29 ` Michael Monnerie 1 sibling, 1 reply; 25+ messages in thread From: Peter Grandi @ 2011-06-03 22:19 UTC (permalink / raw) To: Linux fs XFS > All JBOD chassis (SuperMicro SC 847's)... been experimenting > with the flusher, will look at the others. I think that from the symptoms you describe the hang happens in the first instance because the number of dirty pages has hit 'dirty_background_ratio', after which all writes become synchronous and this really works badly, especially with XFS. To prevent that, and in general to prevent the accumulation of lots of dirty pages, and sudden latency killing large bursts of IO, it is quite important to tell the flusher to sync pretty often and constantly. For the Linux kernel by default permits the buildup of a mass of dirty pages proportional to memory, which is a very bad idea, as it should be proportional to write speed, with the idea that one should not buffer more than 1 second or perhaps less of dirty pages. In your case that's probably a few hundred MBs, and even that is pretty bad in case of crashes. The sw solution is to set the 'vm/dirty_*' tunables accordingly. vm/dirty_ratio=2 vm/dirty_bytes=400000000 vm/dirty_background_ratio=60 vm/dirty_background_bytes=0 vm/dirty_expire_centisecs=200 vm/dirty_writeback_centisecs=400 The hw solution is to do that *and* use SAS/SATA host adapters with (large) battery-backed buffers/cache (but still keeping very few dirty pages in the Linux page cache). I would not use them in hw RAID mode, also because so many hw RAID cards have abominably buggy firmware, and I trust Linxu MD rather more. Unfortunately it is difficult to recommend any specific host adapter. > The rsync job currently appear to be causing the issue - it > was rsyncing around 250,000 files. If the copy had already > been done, the rsync is fast (i.e. stat is fast, despite the > numbers), but when it starts moving data, the IOPS pegs and > seems to be the limiting factor. That's probably also some effect related to writing to the intent log, and a RAID60 makes that very painful. [ ... ] > We most likely live in different worlds - this is a pure > research group with "different" constraints than those you're > probably used to. Not my choice, but 4-10X the cost per unit > of storage is currently not an option. Then lots lots more smaller RAID5 sets, or even RAID6 if you are sufficiently desparate. Joined together at the namespace level, not with a RAID0. Do you really need a single free space pool? I doubt it: you probably are reading/generating data and storing it, so you instead of having a single 200TB storage pool, you could have 20x10TB ones and fill one after the other. Also ideally much smaller RAID sets: 18 wide with double parity beckons a world of read-modify-write pain, especially if the metadata intent log is on the same logical block device. The MD maintainer thinks that for his much smaller needs putting the metadata intent logs on a speedy small RAID1 is good enough, but I think that scales a fair bit. After all the maximum log size for XFS is not that large (fortunately) and smaller is better. Having multiple smaller filesystems also help with having multiple smaller metadata intent logs. > With XFS freshly installed, it was doing around 1400MiB/sec > write, and around 1900MiB/sec read - 10 parallel high > throughput processes read or writing as fast as possible > (which actually is our use case). >> Also, your 1000MiB/s set probably is not full yet, so that's >> outer tracks only, and when it fills up, data gets into the >> inner tracks, and get a bit churned, then the real >> performances will "shine" through. > Yeah - overall, I expect it to drop - perhaps 50%? I dunno. > The particular filesystem being discussed is 80% full at the > moment. That's then fairly realistic, as it is getting well into the inner tracks. Getting avobe 90% will cause trouble. [ ... ] >> But you seem to have 8-10GiB of dirty pages in your 192GiB >> system. Extraordinarily imaginative. > No, I do not want lots of dirty pages, however, I'm also aware > that if those are just data pages, it represents a few seconds > of system operation. Only if written entirely sequentially. IOPS in random and sequential are quite different. > All other approaches I am aware of cost more. I favor Lustre, > but the infrastructure costs alone for a 2-5PB system will > tend to be exceptional. Why? Lustre can run on your existing hw, and you need the network anyhow (unless you compute several TB on one host and store them on that host's disks, in which case you are lucky). >> [ ... ] is Lustre or one of its forks (or much simpler >> imitators like DPM), and that has its own downsides (it takes >> a lot of work), but a single large storage pool is almost >> never needed, at most a single large namespace, and that can >> be instantiated with an automounter (and Lustre/DPM/.... is >> in effect a more sophisticated automounter). > "It takes a lot of work" is another reason we aren't readily > able to go to other architectures, despite their many > advantages. But creating a 200TB volume and formatting it as XFS seems a quick thing to do now, but soon you will need to cope with the consequences. Setting up Lustre takes more at the beginning, but will handle your workload a lot better, and it handles much better having a lot of smaller independently fsck-able pools and highly parallel network operation. It handles small files not so well, so some kind of NFS server with XFS or better JFS for that would be nice. There is a high throughput genomic data system at thre Sanger Institute in Cambridge UK based on Lustre and it might inspire you. This is a relatively old post, it has been in production for a long time: http://threebit.net/mail-archive/bioclusters/msg00188.html http://www.slideshare.net/gcoates Alternatively a number of smaller XFS filesystems as suggested above, but you lose the extra integration/parallelism Lustre gives. [ ... ] > fsck happens in less than a day, It takes less than a day *if there is essentially no damage*, otherwise it might take weeks. > likewise rebuilding all RAIDs... But the impact on performance will be terrifying, and if you reduce resync speed, it will take much longer, and while it rebuilds further failures will be far more likely, and that will be a very long day. Also consider that you have a 7-wide RAID0 of RAID6 sets; if one of the RAID6 sets becomes much slower because of rebuild, odds are this will impact *all* IO because of the RAID0. If you are unlucky, you could end up with one of the RAID6 members of the RAID0 set being in rebuild quite a good percentage of the time. > backups are interesting - it is impossible in the old scenario > (our prior generation storage) - possible now due to higher > disk and network bandwidth. But many people forget that a backup is often the most stressful operation that can happen. > Keep in mind our ultimate backup is tissue samples. If you can regenerate the data even if expensively then avoid RAID6. Two 8+1 RAID5 sets are better than a 16+2 RAID6 set, and losing a bit more spare, three 5+1 RAID5 sets (10TB each) are better still. The reason are much smaller RMW stripe width, the ability to do non-full-width RMW updates, much nicer rebuilds (1/2 or 1/3 of the drives would be slowed down). > 2TB drives are mandatory - there simply isn't enough available > space in the data center otherwise. Ah that's a pretty hard constraint then. > The bulk of the work is not small-file - almost all is large > files. Then perhaps put the large file on XFS or Lustre and the small file on JFS. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-03 22:19 ` Peter Grandi @ 2011-06-06 7:29 ` Michael Monnerie 2011-06-07 14:09 ` Peter Grandi 0 siblings, 1 reply; 25+ messages in thread From: Michael Monnerie @ 2011-06-06 7:29 UTC (permalink / raw) To: xfs [-- Attachment #1.1: Type: Text/Plain, Size: 1261 bytes --] On Samstag, 4. Juni 2011 Peter Grandi wrote: > vm/dirty_ratio=2 > vm/dirty_bytes=400000000 > > vm/dirty_background_ratio=60 > vm/dirty_background_bytes=0 > > vm/dirty_expire_centisecs=200 > vm/dirty_writeback_centisecs=400 Why dirty_background_ratio=60? This would mean you start to write dirty pages only after it reaches 60% of total system memory... Setting it to =1 would be the thing you want I guess. Also, setting both dirty_background_(ratio|bytes) is not supported. The latter wins, according to sysctl/vm.txt Similarly, dirty_ratio and dirty_bytes belong together and exclude each other. Maybe you specified both to fit older and newer kernels in one example? dirty_expire_centisecs to 200 means a sync every 2s, which might be good in this specific setup mentioned here, but not for a generic server. That would defeat XFS's in-memory grouping of blocks before writeout, and in case of many parallel (slow|ftp) uploads could lead to much more data fragmentation, or no? -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 // Haus zu verkaufen: http://zmi.at/langegg/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-06 7:29 ` Michael Monnerie @ 2011-06-07 14:09 ` Peter Grandi 2011-06-08 5:18 ` Dave Chinner 2011-06-08 8:32 ` Michael Monnerie 0 siblings, 2 replies; 25+ messages in thread From: Peter Grandi @ 2011-06-07 14:09 UTC (permalink / raw) To: Michael Monnerie; +Cc: xfs [ ... ] >> vm/dirty_ratio=2 >> vm/dirty_bytes=400000000 >> >> vm/dirty_background_ratio=60 >> vm/dirty_background_bytes=0 > Why dirty_background_ratio=60? This would mean you start to > write dirty pages only after it reaches 60% of total system > memory... Oops, invert 'dirty_background_*' and 'dirty_*', I was writing from memory and got it the wrong way round. These are BTW my notes in my 'sysctl.conf', with pointer to a nice discussion: # http://www.westnet.com/~gsmith/content/linux-pdflush.htm # dirty_ratio # If more than this percentage of active memory is unflushed then # *all* processes that are writing start writing synchronously. # dirty_background_ratio # If more than this percentage of active memory is unflushed the # system starts flushing. # dirty_expire_centisecs # How long a page can be dirty before it gets flushed. # dirty_writeback_centisecs # How often the flusher runs. # In 'mm/pagewriteback.c' there is code that makes sure that in effect # the 'dirty_background_ratio' must be smaller (half if larger or equal) # than the 'dirty_ratio', and other code to put lower limits on # 'dirty_writeback_centisecs' and whatever. > [ ... '*_bytes' and '*_ratio' Maybe you specified both to fit > older and newer kernels in one example? Yes. I had written what I thought was a much simpler/neater change here: http://www.sabi.co.uk/blog/0707jul.html#070701 but I currently put in both versions and let the better one win :-). >> vm/dirty_expire_centisecs=200 >> vm/dirty_writeback_centisecs=400 > dirty_expire_centisecs to 200 means a sync every 2s, which > might be good in this specific setup mentioned here, Not quite, see above. There are times where I think the values should be the other way round (run the flusher every 2s and flush pages dirty for more than 4s). > but not for a generic server. Uhmmm, I am not so sure. Because I think that flushes should be related to IO speed, and even on a smaller system 2 seconds of IO are a lot of data. Quite a few traditional Linux (and Unix) tunables are set to defaults from a time where hardware was much slower. I started using UNIX when there was no 'update' daemon, and I got into the habit which I still have of typing 'sync' explicitly every now and then, and then when 'update' was introduced to do 'sync' every 30s there was not a lot of data one could lose in those 30s. > That would defeat XFS's in-memory grouping of blocks before > writeout, and in case of many parallel (slow|ftp) uploads > could lead to much more data fragmentation, or no? Well, it depends on what "fragmentation" means here. It is a long standing item of discussion. It is nice to see a 10GB file all in one extent, but is it *necessary*? As long as a file is composed of fairly large contiguous extents and they are not themselves widely scattered, things are going to be fine. What matter is the ratio of long seeks to data reads, and minimizing that is not the same as reducing seeks to zero. Now consider two common cases: * A file that is written out at speed, say 100-500MB/s. 2-4s means that there is an opportunity to allocate 200MB-2GB contiguous extents, and with any luck much larger ones. Conversely any larger intervals means potentially losing 200MB-2GB of data. Sure, if they did not want to lose the data the user process should be doing 'fdatasync()', but XFS in particular is sort of pretty good at doing a mild version of 'O_PONIES' where there is a balance between going as fast as possible (buffer a lot in memory) and offering *some* level of safety (as shown in the tests I did for a fair comparison with 'ext3'). * A file that is written slowly in small chunks. Well, *nothing* will help that except preallocate or space reservations. Personally I'd rather have a file system design with space reservations (on detecting an append-like access pattern) and truncate-on-close than delayed allocation like XFS; while delayed allocation seems to work well enough in many cases, it is not quit "the more the merrier". _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-07 14:09 ` Peter Grandi @ 2011-06-08 5:18 ` Dave Chinner 2011-06-08 8:32 ` Michael Monnerie 1 sibling, 0 replies; 25+ messages in thread From: Dave Chinner @ 2011-06-08 5:18 UTC (permalink / raw) To: Peter Grandi; +Cc: Michael Monnerie, xfs On Tue, Jun 07, 2011 at 03:09:09PM +0100, Peter Grandi wrote: > Personally I'd rather have a file system design with space > reservations (on detecting an append-like access pattern) and > truncate-on-close than delayed allocation like XFS; Welcome to the 1990s, Peter. XFS has been doing this for 15 years. It is an optimisation used by the delayed allocation mechanism, not a replacement for it. You might have heard the term "speculative preallocation" before - this is what it does. FYI, ext3 has a space reservation infrastructure to try to ensure contiguous allocation occurs without using delayed allocation. It doesn't work nearly as well as delayed allocation in ext4, btrfs or XFS... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-07 14:09 ` Peter Grandi 2011-06-08 5:18 ` Dave Chinner @ 2011-06-08 8:32 ` Michael Monnerie 1 sibling, 0 replies; 25+ messages in thread From: Michael Monnerie @ 2011-06-08 8:32 UTC (permalink / raw) To: xfs; +Cc: Peter Grandi [-- Attachment #1.1: Type: Text/Plain, Size: 2430 bytes --] On Dienstag, 7. Juni 2011 Peter Grandi wrote: > * A file that is written out at speed, say 100-500MB/s. 2-4s > means that there is an opportunity to allocate 200MB-2GB > contiguous extents, and with any luck much larger ones. > Conversely any larger intervals means potentially losing > 200MB-2GB of data. Sure, if they did not want to lose the > data the user process should be doing 'fdatasync()', but XFS > in particular is sort of pretty good at doing a mild version > of 'O_PONIES' where there is a balance between going as fast > as possible (buffer a lot in memory) and offering some > level of safety (as shown in the tests I did for a fair > comparison with 'ext3'). On a PC, that "loosing 2GB of data" is loosing a single file under normal use. It's quite seldom that people are copying data around. And even if, when the crash happens they usually know what they just did, and restart the copy after a crash. If we speak about a server normally there should be a HW RAID card in it with good cache, and then it's true you should limit Linux write cache and flush early and often, as the card has BBWC and therefore data is protected once in the RAID card. People tend to forget to set writeback lower when using RAID controllers + BBWC, and it's almost nowhere documented. Maybe good for a FAQ entry on XFS, even if it's not XFS specific? I wonder if there is a good document for "best practise" on VMs? I've never seen someone testing a VMware/XEN host with 20 Linux VMs, and what the settings should be for vm.dirty* and net.ipv4.* values. I've seen crashes on VM servers, where afterwards databases in VMs were broken despite using a RAID card +BBWC... > * A file that is written slowly in small chunks. Well, > nothing will help that except preallocate or space > reservations. Now for a common webserver we use, as a guideline there are about 8 uploads parallel all the time. Most of them are slow, as people are on ADSL. If you sync quite often, you're lucky when using XFS to get preallocation and all that. Otherwise, you'd have chunks of all files scattered on disk. -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services: Protéger http://proteger.at [gesprochen: Prot-e-schee] Tel: +43 660 / 415 6531 // Haus zu verkaufen: http://zmi.at/langegg/ [-- Attachment #1.2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-02 18:56 ` Peter Grandi 2011-06-02 21:24 ` Paul Anderson @ 2011-06-03 0:06 ` Phil Karn 1 sibling, 0 replies; 25+ messages in thread From: Phil Karn @ 2011-06-03 0:06 UTC (permalink / raw) To: Peter Grandi; +Cc: Linux fs XFS On 6/2/11 11:56 AM, Peter Grandi wrote: > Disclaimer: some smart people I know built knowingly a similar > and fortunately much smaller collection of RAID6 sets because > that was the least worst option for them, and since they know > that it will not fill up before they can replace it, they are > effectively short-stroking all those 2TB drives (I still would > have bought ERC ones if possible) so it's cooler than it looks. What do you mean by "short stroking"? That the data (and head motions) stay in one part of the disk? I haven't been using XFS that long and I'm no expert on it, but I've noticed that it seems to distribute files pretty evenly across an entire disk. Even without the inode64 option, only the inodes are kept at the beginning; the data can be anywhere. The only way I can think of to confine the activity on a lightly-loaded XFS file system to one part of a disk (e.g., to reduce average seek times and to stay in the faster outer area of the drive) is to create partitions that initially span only part of the disk, then grow them later as needed. Is that what you mean? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-02 14:42 I/O hang, possibly XFS, possibly general Paul Anderson 2011-06-02 16:17 ` Stan Hoeppner 2011-06-02 18:56 ` Peter Grandi @ 2011-06-03 0:42 ` Christoph Hellwig 2011-06-03 1:39 ` Dave Chinner 2 siblings, 1 reply; 25+ messages in thread From: Christoph Hellwig @ 2011-06-03 0:42 UTC (permalink / raw) To: Paul Anderson; +Cc: xfs-oss On Thu, Jun 02, 2011 at 10:42:46AM -0400, Paul Anderson wrote: > This morning, I had a symptom of a I/O throughput problem in which > dirty pages appeared to be taking a long time to write to disk. > > The system is a large x64 192GiB dell 810 server running 2.6.38.5 from > kernel.org - the basic workload was data intensive - concurrent large > NFS (with high metadata/low filesize), rsync/lftp (with low > metadata/high file size) all working in a 200TiB XFS volume on a > software MD raid0 on top of 7 software MD raid6, each w/18 drives. I > had mounted the filesystem with inode64,largeio,logbufs=8,noatime. A few comments on the setup before trying to analze what's going on in detail. I'd absolutely recommend an external log device for this setup, that is buy another two fast but small disks, or take two existing ones and use a RAID 1 for the external log device. This will speed up anything log intensive, which both NFS, and resync workloads are lot. Second thing if you can split the workloads into multiple volumes if you have two such different workloads, so thay they don't interfear with each other. Second a RAID0 on top of RAID6 volumes sounds like a pretty worst case for almost any type of I/O. You end up doing even relatively small I/O to all of the disks in the worst case. I think you'd be much better off with a simple linear concatenation of the RAID6 devices, even if you can split them into multiple filesystems > The specific symptom was that 'sync' hung, a dpkg command hung > (presumably trying to issue fsync), and experimenting with "killall > -STOP" or "kill -STOP" of the workload jobs didn't let the system > drain I/O enough to finish the sync. I probably did not wait long > enough, however. It really sounds like you're simply killloing the MD setup with a log of log I/O that does to all the devices. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-03 0:42 ` Christoph Hellwig @ 2011-06-03 1:39 ` Dave Chinner 2011-06-03 15:59 ` Paul Anderson 0 siblings, 1 reply; 25+ messages in thread From: Dave Chinner @ 2011-06-03 1:39 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Paul Anderson, xfs-oss On Thu, Jun 02, 2011 at 08:42:47PM -0400, Christoph Hellwig wrote: > On Thu, Jun 02, 2011 at 10:42:46AM -0400, Paul Anderson wrote: > > This morning, I had a symptom of a I/O throughput problem in which > > dirty pages appeared to be taking a long time to write to disk. > > > > The system is a large x64 192GiB dell 810 server running 2.6.38.5 from > > kernel.org - the basic workload was data intensive - concurrent large > > NFS (with high metadata/low filesize), rsync/lftp (with low > > metadata/high file size) all working in a 200TiB XFS volume on a > > software MD raid0 on top of 7 software MD raid6, each w/18 drives. I > > had mounted the filesystem with inode64,largeio,logbufs=8,noatime. > > A few comments on the setup before trying to analze what's going on in > detail. I'd absolutely recommend an external log device for this setup, > that is buy another two fast but small disks, or take two existing ones > and use a RAID 1 for the external log device. This will speed up > anything log intensive, which both NFS, and resync workloads are lot. > > Second thing if you can split the workloads into multiple volumes if you > have two such different workloads, so thay they don't interfear with > each other. > > Second a RAID0 on top of RAID6 volumes sounds like a pretty worst case > for almost any type of I/O. You end up doing even relatively small I/O > to all of the disks in the worst case. I think you'd be much better > off with a simple linear concatenation of the RAID6 devices, even if you > can split them into multiple filesystems > > > The specific symptom was that 'sync' hung, a dpkg command hung > > (presumably trying to issue fsync), and experimenting with "killall > > -STOP" or "kill -STOP" of the workload jobs didn't let the system > > drain I/O enough to finish the sync. I probably did not wait long > > enough, however. > > It really sounds like you're simply killloing the MD setup with a > log of log I/O that does to all the devices. And this is one of the reasons why I originally suggested that storage at this scale really should be using hardware RAID with large amounts of BBWC to isolate the backend from such problematic IO patterns. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-03 1:39 ` Dave Chinner @ 2011-06-03 15:59 ` Paul Anderson 2011-06-04 3:15 ` Dave Chinner 2011-06-04 8:14 ` Stan Hoeppner 0 siblings, 2 replies; 25+ messages in thread From: Paul Anderson @ 2011-06-03 15:59 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, xfs-oss On Thu, Jun 2, 2011 at 9:39 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Jun 02, 2011 at 08:42:47PM -0400, Christoph Hellwig wrote: >> On Thu, Jun 02, 2011 at 10:42:46AM -0400, Paul Anderson wrote: >> > This morning, I had a symptom of a I/O throughput problem in which >> > dirty pages appeared to be taking a long time to write to disk. >> > >> > The system is a large x64 192GiB dell 810 server running 2.6.38.5 from >> > kernel.org - the basic workload was data intensive - concurrent large >> > NFS (with high metadata/low filesize), rsync/lftp (with low >> > metadata/high file size) all working in a 200TiB XFS volume on a >> > software MD raid0 on top of 7 software MD raid6, each w/18 drives. I >> > had mounted the filesystem with inode64,largeio,logbufs=8,noatime. >> >> A few comments on the setup before trying to analze what's going on in >> detail. I'd absolutely recommend an external log device for this setup, >> that is buy another two fast but small disks, or take two existing ones >> and use a RAID 1 for the external log device. This will speed up >> anything log intensive, which both NFS, and resync workloads are lot. >> >> Second thing if you can split the workloads into multiple volumes if you >> have two such different workloads, so thay they don't interfear with >> each other. >> >> Second a RAID0 on top of RAID6 volumes sounds like a pretty worst case >> for almost any type of I/O. You end up doing even relatively small I/O >> to all of the disks in the worst case. I think you'd be much better >> off with a simple linear concatenation of the RAID6 devices, even if you >> can split them into multiple filesystems >> >> > The specific symptom was that 'sync' hung, a dpkg command hung >> > (presumably trying to issue fsync), and experimenting with "killall >> > -STOP" or "kill -STOP" of the workload jobs didn't let the system >> > drain I/O enough to finish the sync. I probably did not wait long >> > enough, however. >> >> It really sounds like you're simply killloing the MD setup with a >> log of log I/O that does to all the devices. > > And this is one of the reasons why I originally suggested that > storage at this scale really should be using hardware RAID with > large amounts of BBWC to isolate the backend from such problematic > IO patterns. > Dave Chinner > david@fromorbit.com > Good HW RAID cards are on order - seems to be backordered at least a few weeks now at CDW. Got the batteries immediately. That will give more options for test and deployment. Not sure what I can do about the log - man page says xfs_growfs doesn't implement log moving. I can rebuild the filesystems, but for the one mentioned in this theread, this will take a long time. I'm guessing we'll need to split out the workload - aside from the differences in file size and use patterns, they also have fundamentally different values (the high metadata dataset happens to be high value relative to the low metadata/large file dataset). Paul _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-03 15:59 ` Paul Anderson @ 2011-06-04 3:15 ` Dave Chinner 2011-06-04 8:14 ` Stan Hoeppner 1 sibling, 0 replies; 25+ messages in thread From: Dave Chinner @ 2011-06-04 3:15 UTC (permalink / raw) To: Paul Anderson; +Cc: Christoph Hellwig, xfs-oss On Fri, Jun 03, 2011 at 11:59:02AM -0400, Paul Anderson wrote: > On Thu, Jun 2, 2011 at 9:39 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Jun 02, 2011 at 08:42:47PM -0400, Christoph Hellwig wrote: > >> On Thu, Jun 02, 2011 at 10:42:46AM -0400, Paul Anderson wrote: > >> > This morning, I had a symptom of a I/O throughput problem in which > >> > dirty pages appeared to be taking a long time to write to disk. > >> > > >> > The system is a large x64 192GiB dell 810 server running 2.6.38.5 from > >> > kernel.org - the basic workload was data intensive - concurrent large > >> > NFS (with high metadata/low filesize), rsync/lftp (with low > >> > metadata/high file size) all working in a 200TiB XFS volume on a > >> > software MD raid0 on top of 7 software MD raid6, each w/18 drives. I > >> > had mounted the filesystem with inode64,largeio,logbufs=8,noatime. > >> > >> A few comments on the setup before trying to analze what's going on in > >> detail. I'd absolutely recommend an external log device for this setup, > >> that is buy another two fast but small disks, or take two existing ones > >> and use a RAID 1 for the external log device. This will speed up > >> anything log intensive, which both NFS, and resync workloads are lot. > >> > >> Second thing if you can split the workloads into multiple volumes if you > >> have two such different workloads, so thay they don't interfear with > >> each other. > >> > >> Second a RAID0 on top of RAID6 volumes sounds like a pretty worst case > >> for almost any type of I/O. You end up doing even relatively small I/O > >> to all of the disks in the worst case. I think you'd be much better > >> off with a simple linear concatenation of the RAID6 devices, even if you > >> can split them into multiple filesystems > >> > >> > The specific symptom was that 'sync' hung, a dpkg command hung > >> > (presumably trying to issue fsync), and experimenting with "killall > >> > -STOP" or "kill -STOP" of the workload jobs didn't let the system > >> > drain I/O enough to finish the sync. I probably did not wait long > >> > enough, however. > >> > >> It really sounds like you're simply killloing the MD setup with a > >> log of log I/O that does to all the devices. > > > > And this is one of the reasons why I originally suggested that > > storage at this scale really should be using hardware RAID with > > large amounts of BBWC to isolate the backend from such problematic > > IO patterns. > > > Dave Chinner > > david@fromorbit.com > > > > Good HW RAID cards are on order - seems to be backordered at least a > few weeks now at CDW. Got the batteries immediately. > > That will give more options for test and deployment. > > Not sure what I can do about the log - man page says xfs_growfs > doesn't implement log moving. I can rebuild the filesystems, but for > the one mentioned in this theread, this will take a long time. Once you have BBWC, the log IO gets aggregated into stripe width writes to the back end (because it is always sequential IO), so it's generally not a significant problem for HW RAID subsystems. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-03 15:59 ` Paul Anderson 2011-06-04 3:15 ` Dave Chinner @ 2011-06-04 8:14 ` Stan Hoeppner 2011-06-04 10:32 ` Dave Chinner 1 sibling, 1 reply; 25+ messages in thread From: Stan Hoeppner @ 2011-06-04 8:14 UTC (permalink / raw) To: Paul Anderson; +Cc: Christoph Hellwig, xfs-oss On 6/3/2011 10:59 AM, Paul Anderson wrote: Hi Paul, When I first replied to this thread I didn't recognize your name, thus forgot our off-list conversation. Sorry bout that. > Good HW RAID cards are on order - seems to be backordered at least a > few weeks now at CDW. Got the batteries immediately. As I mentioned, the 9285-8E is very new product, but I didn't realize it was *that* new. Sorry you're having to wait for them. > That will give more options for test and deployment. Others have made valid points WRT the down sides of wide stripe parity arrays. I've mentioned many times I loathe parity RAID due to those reasons, and others, but it's mandatory in your case due to the reasons you previously stated. If such arguments are sufficiently convincing, and you can afford to lose the capacity of 2 more disks per chassis to parity, and increase complexity a bit, you may want to consider 3 x 7 drive RAID5 arrays per backplane, 6 drive stripe width, 18 total arrays concatenated, 216 AGs, 6 AGs per array, 216TB raw storage per server, if my math is correct. That instead of the concatenated 6 x 21 drive RAID6 arrays I previously mentioned. You'd have 3 arrays per backplane/cable and thus retain some isolation advantages for troubleshooting, with the same spares arrangement. Your overall resiliency, mathematical/theoretical anyway, to drive failure should actually increase slightly as you would have 3 drives per backplane worth of parity instead of 2, and array rebuild time would be ~1/3rd that of the 21 drive array, somewhat negating the dual parity advantage of RAID6 as the odds of drive failure during a rebuild tend to increase with the duration of the rebuild. > Not sure what I can do about the log - man page says xfs_growfs > doesn't implement log moving. I can rebuild the filesystems, but for > the one mentioned in this theread, this will take a long time. See the logdev mount option. Using two mirrored drives was recommended, I'd go a step further and use two quality "consumer grade", i.e. MLC based, SSDs, such as: http://www.cdw.com/shop/products/Corsair-Force-Series-F40-solid-state-drive-40-GB-SATA-300/2181114.aspx Rated at 50K 4K write IOPS, about 150 times greater than a 15K SAS drive. > I'm guessing we'll need to split out the workload - aside from the > differences in file size and use patterns, they also have > fundamentally different values (the high metadata dataset happens to > be high value relative to the low metadata/large file dataset). LSI is touting significantly better parity performance for the 9265/9285 vs LSI's previous generation cards for which they claim peaks of ~2700 MB/s sequential read and ~1800 MB/s write. The new cards have double the cache of the previous, so I would think write performance would increase more than read. I'm really interested in seeing your test results with your workloads Paul. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-04 8:14 ` Stan Hoeppner @ 2011-06-04 10:32 ` Dave Chinner 2011-06-04 12:11 ` Stan Hoeppner 0 siblings, 1 reply; 25+ messages in thread From: Dave Chinner @ 2011-06-04 10:32 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Paul Anderson, Christoph Hellwig, xfs-oss On Sat, Jun 04, 2011 at 03:14:53AM -0500, Stan Hoeppner wrote: > On 6/3/2011 10:59 AM, Paul Anderson wrote: > > Not sure what I can do about the log - man page says xfs_growfs > > doesn't implement log moving. I can rebuild the filesystems, but for > > the one mentioned in this theread, this will take a long time. > > See the logdev mount option. Using two mirrored drives was recommended, > I'd go a step further and use two quality "consumer grade", i.e. MLC > based, SSDs, such as: > > http://www.cdw.com/shop/products/Corsair-Force-Series-F40-solid-state-drive-40-GB-SATA-300/2181114.aspx > > Rated at 50K 4K write IOPS, about 150 times greater than a 15K SAS drive. If you are using delayed logging, then a pair of mirrored 7200rpm SAS or SATA drives would be sufficient for most workloads as the log bandwidth rarely gets above 50MB/s in normal operation. If you have fsync heavy workloads, or are not using delayed logging, then you really need to use the RAID5/6 device behind a BBWC because the log is -seriously- bandwidth intensive. I can drive >500MB/s of log throughput on metadata intensive workloads on 2.6.39 when not using delayed logging or I'm regularly forcing the log via fsync. You sure as hell don't want to be running a sustained long term write load like that on consumer grade SSDs..... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-04 10:32 ` Dave Chinner @ 2011-06-04 12:11 ` Stan Hoeppner 2011-06-04 23:10 ` Dave Chinner 0 siblings, 1 reply; 25+ messages in thread From: Stan Hoeppner @ 2011-06-04 12:11 UTC (permalink / raw) To: Dave Chinner; +Cc: Paul Anderson, Christoph Hellwig, xfs-oss On 6/4/2011 5:32 AM, Dave Chinner wrote: > On Sat, Jun 04, 2011 at 03:14:53AM -0500, Stan Hoeppner wrote: >> On 6/3/2011 10:59 AM, Paul Anderson wrote: >>> Not sure what I can do about the log - man page says xfs_growfs >>> doesn't implement log moving. I can rebuild the filesystems, but for >>> the one mentioned in this theread, this will take a long time. >> >> See the logdev mount option. Using two mirrored drives was recommended, >> I'd go a step further and use two quality "consumer grade", i.e. MLC >> based, SSDs, such as: >> >> http://www.cdw.com/shop/products/Corsair-Force-Series-F40-solid-state-drive-40-GB-SATA-300/2181114.aspx >> >> Rated at 50K 4K write IOPS, about 150 times greater than a 15K SAS drive. > > If you are using delayed logging, then a pair of mirrored 7200rpm > SAS or SATA drives would be sufficient for most workloads as the log > bandwidth rarely gets above 50MB/s in normal operation. Hi Dave. I made the first reply to Paul's post, recommending he enable delayed logging as a possible solution to his I/O hang problem. I recommended this due to his mention of super heavy metadata operations at the time on his all md raid60 on plain HBA setup. Paul did not list delaylog when he submitted his 2.6.38.5 mount options: inode64,largeio,logbufs=8,noatime Being the author of the delayed logging code, I had expected you to comment on this, either expounding on my recommendation, or shooting it down, and giving the reasons why. So, would delayed logging have possibly prevented his hang problem or no? I always read your replies at least twice, and I don't recall you touching on delayed logging in this thread. If you did and I missed it, my apologies. Paul will have 3 of LSI's newest RAID cards with a combined 3GB BBWC to test with, hopefully soon. With that much cache is an external log device still needed? With and/or without delayed logging enabled? > If you have fsync heavy workloads, or are not using delayed logging, > then you really need to use the RAID5/6 device behind a BBWC because > the log is -seriously- bandwidth intensive. I can drive >500MB/s of > log throughput on metadata intensive workloads on 2.6.39 when not > using delayed logging or I'm regularly forcing the log via fsync. > You sure as hell don't want to be running a sustained long term > write load like that on consumer grade SSDs..... Given that the max log size is 2GB, IIRC, and that most recommendations I've seen here are against using a log that big, I figure such MLC drives would be fine. AIUI, modern wear leveling will spread writes throughout the entire flash array before going back and over writing the first sector. Published MTBF on most MLC drives rates are roughly equivalent to enterprise SRDs, 1+ million hours. Do you believe MLC based SSDs are simply never appropriate for anything but consumer use, and that only SLC devices should be used for real storage applications? AIUI SLC flash cells do have about a 10:1 greater lifetime than MLC cells. However, there have been a number of articles/posts demonstrating math which shows a current generation SandForce based MLC SSD, under a constant 100MB/s write stream, will run for 20+ years, IIRC, before sufficient live+reserved spare cells burn out to cause hard write errors, thus necessitating drive replacement. Under your 500MB/s load, assuming that's constant, the drives would theoretically last 4+ years. If that 500MB/s load was only for 12 hours each day, the drives would last 8+ years. I wish I had one of those articles bookmarked... -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-04 12:11 ` Stan Hoeppner @ 2011-06-04 23:10 ` Dave Chinner 2011-06-05 1:31 ` Stan Hoeppner 0 siblings, 1 reply; 25+ messages in thread From: Dave Chinner @ 2011-06-04 23:10 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Paul Anderson, Christoph Hellwig, xfs-oss On Sat, Jun 04, 2011 at 07:11:50AM -0500, Stan Hoeppner wrote: > On 6/4/2011 5:32 AM, Dave Chinner wrote: > > On Sat, Jun 04, 2011 at 03:14:53AM -0500, Stan Hoeppner wrote: > >> On 6/3/2011 10:59 AM, Paul Anderson wrote: > >>> Not sure what I can do about the log - man page says xfs_growfs > >>> doesn't implement log moving. I can rebuild the filesystems, but for > >>> the one mentioned in this theread, this will take a long time. > >> > >> See the logdev mount option. Using two mirrored drives was recommended, > >> I'd go a step further and use two quality "consumer grade", i.e. MLC > >> based, SSDs, such as: > >> > >> http://www.cdw.com/shop/products/Corsair-Force-Series-F40-solid-state-drive-40-GB-SATA-300/2181114.aspx > >> > >> Rated at 50K 4K write IOPS, about 150 times greater than a 15K SAS drive. > > > > If you are using delayed logging, then a pair of mirrored 7200rpm > > SAS or SATA drives would be sufficient for most workloads as the log > > bandwidth rarely gets above 50MB/s in normal operation. > > Hi Dave. I made the first reply to Paul's post, recommending he enable > delayed logging as a possible solution to his I/O hang problem. I > recommended this due to his mention of super heavy metadata operations > at the time on his all md raid60 on plain HBA setup. Paul did not list > delaylog when he submitted his 2.6.38.5 mount options: > > inode64,largeio,logbufs=8,noatime > > Being the author of the delayed logging code, I had expected you to > comment on this, either expounding on my recommendation, or shooting it > down, and giving the reasons why. > > So, would delayed logging have possibly prevented his hang problem or > no? I always read your replies at least twice, and I don't recall you > touching on delayed logging in this thread. If you did and I missed it, > my apologies. It might, but I delayed logging iћ not he solution to every problem, and NFS servers are notoriously heavy on log forces due to COMMIT operations during writes. So it's a good bet that delyed logging won't fix the problem entirely. > > If you have fsync heavy workloads, or are not using delayed logging, > > then you really need to use the RAID5/6 device behind a BBWC because > > the log is -seriously- bandwidth intensive. I can drive >500MB/s of > > log throughput on metadata intensive workloads on 2.6.39 when not > > using delayed logging or I'm regularly forcing the log via fsync. > > You sure as hell don't want to be running a sustained long term > > write load like that on consumer grade SSDs..... > > Given that the max log size is 2GB, IIRC, and that most recommendations > I've seen here are against using a log that big, I figure such MLC > drives would be fine. AIUI, modern wear leveling will spread writes > throughout the entire flash array before going back and over writing the > first sector. Published MTBF on most MLC drives rates are roughly > equivalent to enterprise SRDs, 1+ million hours. > > Do you believe MLC based SSDs are simply never appropriate for anything > but consumer use, and that only SLC devices should be used for real > storage applications? AIUI SLC flash cells do have about a 10:1 greater > lifetime than MLC cells. However, there have been a number of > articles/posts demonstrating math which shows a current generation > SandForce based MLC SSD, under a constant 100MB/s write stream, will run > for 20+ years, IIRC, before sufficient live+reserved spare cells burn > out to cause hard write errors, thus necessitating drive replacement. > Under your 500MB/s load, assuming that's constant, the drives would > theoretically last 4+ years. If that 500MB/s load was only for 12 hours > each day, the drives would last 8+ years. I wish I had one of those > articles bookmarked... That's the theory, anyway. Let's call it an expected 4 year life cycle under this workload (which is highly optimistic, IMO). Now you have two drives in RAID1, that means one will fail in 2 years, or if you need more drives to sustain that performance the log needs (*) you might be looking at 4 or more drives, and that brings the expet failure rate down under one drive per year. Multiply that across 5-10 servers, and that's a drive failure every month just on the log devices. That failure rate would make me extremely nervous - losing the log is a -major- filesystem corruption event - and make me want to spend more money or change the config to reduce the risk of a double failure causing the log device to be lost. Especially if there are hundreds of terabytes of data at risk. Cheers, Dave. (*) You have to consider that sustained workloads mean that the drives don't get idle time to trigger background garbage collection, which is one of the key features that current consumer level drives rely on for maintaining performance and even wear levelling. The "spare" area in the drives is kept small because it is assumed that there won't be long term sustained IO so that the garbage collection can clean up before spare area is exhausted. Enterprise drives have a much larger relative percentage of flash in the drive reserved as spare to avoid severe degradation in such sustained (common enterprise) workloads. Hence performance on consumer MLC drives tails off much more quickly than SLC drives. Hence performance on consumer MLC drives may not be sustainable, and wear leveling may not be optimal, resulting in flash failure earlier than you expect. To maintain performance, you'll need more MLC drives to maintain baseline performance. And with more drives, the chance of failure goes up... -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: I/O hang, possibly XFS, possibly general 2011-06-04 23:10 ` Dave Chinner @ 2011-06-05 1:31 ` Stan Hoeppner 0 siblings, 0 replies; 25+ messages in thread From: Stan Hoeppner @ 2011-06-05 1:31 UTC (permalink / raw) To: Dave Chinner; +Cc: Paul Anderson, Christoph Hellwig, xfs-oss On 6/4/2011 6:10 PM, Dave Chinner wrote: > On Sat, Jun 04, 2011 at 07:11:50AM -0500, Stan Hoeppner wrote: >> So, would delayed logging have possibly prevented his hang problem or >> no? I always read your replies at least twice, and I don't recall you >> touching on delayed logging in this thread. If you did and I missed it, >> my apologies. > > It might, but I delayed logging iћ not he solution to every problem, > and NFS servers are notoriously heavy on log forces due to COMMIT > operations during writes. So it's a good bet that delyed logging > won't fix the problem entirely. So the solution in this case will likely require a multi pronged approach, including XFS optimization, and RAID card and/or RAID level reconfiguration that has been mentioned. >> Do you believe MLC based SSDs are simply never appropriate for anything >> but consumer use, and that only SLC devices should be used for real >> storage applications? AIUI SLC flash cells do have about a 10:1 greater >> lifetime than MLC cells. However, there have been a number of >> articles/posts demonstrating math which shows a current generation >> SandForce based MLC SSD, under a constant 100MB/s write stream, will run >> for 20+ years, IIRC, before sufficient live+reserved spare cells burn >> out to cause hard write errors, thus necessitating drive replacement. >> Under your 500MB/s load, assuming that's constant, the drives would >> theoretically last 4+ years. If that 500MB/s load was only for 12 hours >> each day, the drives would last 8+ years. I wish I had one of those >> articles bookmarked... > > That's the theory, anyway. Let's call it an expected 4 year life > cycle under this workload (which is highly optimistic, IMO). Now you > have two drives in RAID1, that means one will fail in 2 years, or if > you need more drives to sustain that performance the log needs (*) > you might be looking at 4 or more drives, and that brings the expet > failure rate down under one drive per year. Multiply that across > 5-10 servers, and that's a drive failure every month just on the log > devices. Very good point. I was looking at single system probabilities instead of farm scale (shame on me for that newbish oversight). > That failure rate would make me extremely nervous - losing the log > is a -major- filesystem corruption event - and make me want to spend > more money or change the config to reduce the risk of a double > failure causing the log device to be lost. Especially if there are > hundreds of terabytes of data at risk. > Cheers, > > Dave. > > (*) You have to consider that sustained workloads mean that the > drives don't get idle time to trigger background garbage collection, > which is one of the key features that current consumer level drives > rely on for maintaining performance and even wear levelling. The > "spare" area in the drives is kept small because it is assumed that > there won't be long term sustained IO so that the garbage collection > can clean up before spare area is exhausted. > > Enterprise drives have a much larger relative percentage of flash in > the drive reserved as spare to avoid severe degradation in such > sustained (common enterprise) workloads. Hence performance on > consumer MLC drives tails off much more quickly than SLC drives. Ahh, I didn't realize the SLC drives have much larger reserved areas. Shame on me again. A hardwarefreak should know such things. :( > Hence performance on consumer MLC drives may not be sustainable, and > wear leveling may not be optimal, resulting in flash failure earlier > than you expect. To maintain performance, you'll need more MLC > drives to maintain baseline performance. And with more drives, the > chance of failure goes up... Are the enterprise SLC drives able to perform garbage collection etc while under such constant load? If not, is it always better to use SRDs for the log, either internal on a BBWC array, or an external mirrored pair? I previously mentioned I always read your posts twice. You are a deep well of authoritative information and experience. Keep up the great work and contribution to the knowledge base of this list. -- Stan _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2011-06-08 8:33 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-06-02 14:42 I/O hang, possibly XFS, possibly general Paul Anderson 2011-06-02 16:17 ` Stan Hoeppner 2011-06-02 18:56 ` Peter Grandi 2011-06-02 21:24 ` Paul Anderson 2011-06-02 23:59 ` Phil Karn 2011-06-03 0:39 ` Dave Chinner 2011-06-03 2:11 ` Phil Karn 2011-06-03 2:54 ` Dave Chinner 2011-06-03 22:28 ` Phil Karn 2011-06-04 3:12 ` Dave Chinner 2011-06-03 22:19 ` Peter Grandi 2011-06-06 7:29 ` Michael Monnerie 2011-06-07 14:09 ` Peter Grandi 2011-06-08 5:18 ` Dave Chinner 2011-06-08 8:32 ` Michael Monnerie 2011-06-03 0:06 ` Phil Karn 2011-06-03 0:42 ` Christoph Hellwig 2011-06-03 1:39 ` Dave Chinner 2011-06-03 15:59 ` Paul Anderson 2011-06-04 3:15 ` Dave Chinner 2011-06-04 8:14 ` Stan Hoeppner 2011-06-04 10:32 ` Dave Chinner 2011-06-04 12:11 ` Stan Hoeppner 2011-06-04 23:10 ` Dave Chinner 2011-06-05 1:31 ` Stan Hoeppner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.