* drastic changes to allocsize semantics in or around 2.6.38? @ 2011-05-20 0:55 Marc Lehmann 2011-05-20 2:56 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Marc Lehmann @ 2011-05-20 0:55 UTC (permalink / raw) To: xfs Hi! I have "allocsize=64m" (or simialr sizes, such as 1m, 16m etc.) on many of my xfs filesystems, in an attempt to fight fragmentation on logfiles. I am not sure about it's effectiveness, but in 2.6.38 (but not in 2.6.32), this leads to very unexpected and weird behaviour, namely that files being written have semi-permanently allocated chunks of allocsize to them. I realised this when I did a make clean and a make in a buildroot directory, which cross-compiles uclibc, gcc, and lots of other packages, leading to a lot of mostly small files. After a few minutes, the job stopped because it ate 180GB of disk space and the disk was full. When I came back in the mornng (about 8 hours later), the disk was still full, and investigation showed that even 3kb files were allocated the full 64m (as seen with du). Atfer I deleted some files to get some space and rebooted, I suddenly had 180GB of space again, so it seems an unmount "fixes" this issue. I often do these kind of build,s and I have allocsize on thee high values for a very long time, without ever having run into this kind of problem. It seems that files get temporarily allocated much larger chunks (which is expoected behaviour), but xfs doesn't free them until there is a unmount (which is unexpected). Is this the desired behaviour? I would assume that any allocsize > 0 could lead to a lot of fragmentation if files that are closed and no longer being in-use always have extra space allocated for expansion for extremely long periods of time. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-20 0:55 drastic changes to allocsize semantics in or around 2.6.38? Marc Lehmann @ 2011-05-20 2:56 ` Dave Chinner 2011-05-20 15:49 ` Marc Lehmann 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2011-05-20 2:56 UTC (permalink / raw) To: Marc Lehmann; +Cc: xfs On Fri, May 20, 2011 at 02:55:11AM +0200, Marc Lehmann wrote: > Hi! > > I have "allocsize=64m" (or simialr sizes, such as 1m, 16m etc.) on many of my > xfs filesystems, in an attempt to fight fragmentation on logfiles. > > I am not sure about it's effectiveness, but in 2.6.38 (but not in 2.6.32), > this leads to very unexpected and weird behaviour, namely that files being > written have semi-permanently allocated chunks of allocsize to them. The change that will be causing this was to how the preallocation is dropped. In normal use cases, the preallocation should be dropped when the file descriptor is closed. The change in 2.6.38 was to make this conditional on whether the inode had been closed multiple times while dirty. If the inode is closed (.release is called) multiple times while dirty, then the preallocation is not truncated away until the inode is dropped from the caches, rather than immediately on close. This prevents writes on NFS servers from doing excessive work and triggering excessive fragmentation, as the NFS server does an "open-write-close" for every write that comes across the wire. This was also coupled witha change to the default speculative allocation behaviour to do more and larger specualtive preallocation and so in most cases remove the need for ever using the allocsize mount option. It dynamically increases the preallocation size as the file size increases, so small file writes behave like pre-2.6.38 without the allocsize mount option, large file writes behave like they have a large allocsize mount option set and thereby preventing most known delayed allocation fragmentation cases from occurring. > I realised this when I did a make clean and a make in a buildroot directory, > which cross-compiles uclibc, gcc, and lots of other packages, leading to a > lot of mostly small files. So the question there: how is your workload accessing the files? Is it opening and closing them multiple times in quick succession after writing them? I think it is triggering the "NFS server access pattern" logic and so keeping speculative preallocation around for longer. > Atfer I deleted some files to get some space and rebooted, I suddenly had > 180GB of space again, so it seems an unmount "fixes" this issue. > > I often do these kind of build,s and I have allocsize on thee high values for > a very long time, without ever having run into this kind of problem. > > It seems that files get temporarily allocated much larger chunks (which is > expoected behaviour), but xfs doesn't free them until there is a unmount > (which is unexpected). "echo 3 > /proc/sys/vm/drop_caches" should free up the space as the preallocation will be truncated as the inodes are removed from the VFS inode cache. > Is this the desired behaviour? I would assume that any allocsize > 0 could > lead to a lot of fragmentation if files that are closed and no longer being > in-use always have extra space allocated for expansion for extremely long > periods of time. I'd suggest removing the allocsize mount option - you shouldn't need it anymore because the new default behaviour resists fragmentation a whole lot better than pre-2.6.38 kernels. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-20 2:56 ` Dave Chinner @ 2011-05-20 15:49 ` Marc Lehmann 2011-05-21 0:45 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Marc Lehmann @ 2011-05-20 15:49 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Fri, May 20, 2011 at 12:56:59PM +1000, Dave Chinner <david@fromorbit.com> wrote: [thanks for the thorough explanation] > > So the question there: how is your workload accessing the files? Is > it opening and closing them multiple times in quick succession after > writing them? I don't think so, but of course, when compiling a file, it will be linked afterwards, so I guess it would be accessed at least once. > I think it is triggering the "NFS server access pattern" logic and so > keeping speculative preallocation around for longer. Longer meaning practically infinitely :) > I'd suggest removing the allocsize mount option - you shouldn't need > it anymore because the new default behaviour resists fragmentation a > whole lot better than pre-2.6.38 kernels. I did remove it already, and will actively try this on our production server which suffer from severe fragmentation (but xfs_fsr fixes that with some work (suspending the logfile writing) anyway). However, I would suggest that whatever heuristic 2.6.38 uses is deeply broken at the momment, as NFS was not involved here at all (so no need for it), the usage pattern was a simple compile-then-link-pattern (which is very common), and there is really no need to cache this preallocation for files that have been closed 8 hours ago and never touched since then. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-20 15:49 ` Marc Lehmann @ 2011-05-21 0:45 ` Dave Chinner 2011-05-21 1:36 ` Marc Lehmann 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2011-05-21 0:45 UTC (permalink / raw) To: Marc Lehmann; +Cc: xfs On Fri, May 20, 2011 at 05:49:20PM +0200, Marc Lehmann wrote: > On Fri, May 20, 2011 at 12:56:59PM +1000, Dave Chinner <david@fromorbit.com> wrote: > [thanks for the thorough explanation] > > > > So the question there: how is your workload accessing the files? Is > > it opening and closing them multiple times in quick succession after > > writing them? > > I don't think so, but of course, when compiling a file, it will be linked > afterwards, so I guess it would be accessed at least once. Ok, I'll see if I can reporduce it localy. > > I think it is triggering the "NFS server access pattern" logic and so > > keeping speculative preallocation around for longer. > > Longer meaning practically infinitely :) No, longer meaning the in-memory lifecycle of the inode. > > I'd suggest removing the allocsize mount option - you shouldn't need > > it anymore because the new default behaviour resists fragmentation a > > whole lot better than pre-2.6.38 kernels. > > I did remove it already, and will actively try this on our production > server which suffer from severe fragmentation (but xfs_fsr fixes that with > some work (suspending the logfile writing) anyway). log file writing - append only workloads - is one where the dynamic speculative preallocation can make a significant difference. > However, I would suggest that whatever heuristic 2.6.38 uses is deeply > broken at the momment, One bug report two months after general availability != deeply broken. > as NFS was not involved here at all (so no need for > it), the usage pattern was a simple compile-then-link-pattern (which is > very common), While using a large allocsize mount option, which is relatively rare. Basically, you've told XFS to optimise allocation for large files and then are running workloads with lots of small files. It's not surprise that there are issues, and you don't need the changes in 2.6.38 to get bitten by this problem.... > and there is really no need to cache this preallocation for > files that have been closed 8 hours ago and never touched since then. If the preallocation was the size of the dynamic behaviour, you wouldn't have even noticed this. So really what you are saying is that it is excessive for your current configuration and workload. If I can reproduce it, I'll have a think about how to tweak it better for allocsize filesystems. However, I'm not going to start to add lots of workload-dependent tweaks to this code - the default behaviour is much better and in most cases removes the problems that led to using allocsize in the first place. So removing allocsize from your config is, IMO, the correct fix, not tweaking heuristics in the code... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-21 0:45 ` Dave Chinner @ 2011-05-21 1:36 ` Marc Lehmann 2011-05-21 3:15 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Marc Lehmann @ 2011-05-21 1:36 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sat, May 21, 2011 at 10:45:44AM +1000, Dave Chinner <david@fromorbit.com> wrote: > > Longer meaning practically infinitely :) > > No, longer meaning the in-memory lifecycle of the inode. That makes no sense - if I have twice the memory I suddenly have half (or some other factor) free diskspace. The lifetime of the preallocated area should be tied to something sensible, really - all that xfs has now is a broken heuristic that ties the wrong statistic to the extra space allocated. Or in other words, tieing the amount of preallocations to the amount of free ram (for the inode) is not a sensible heuristic. > log file writing - append only workloads - is one where the dynamic > speculative preallocation can make a significant difference. Thats absolutely fantastic, as that will apply to a large range of files that are problematic (while xfs performs really well in most cases). > > However, I would suggest that whatever heuristic 2.6.38 uses is deeply > > broken at the momment, > > One bug report two months after general availability != deeply > broken. That makes no sense - I only found out about this broken behaviour because I specified a large allocsize manually, which is rare. However, the behaviour happens even without that. but might not be immediately noticable (how would you find out if you lost a few gigabytes of disk space unless the disk runs full? most people would have no clue where to look for). Just because the breakage is not obviously visible doesn't mean it's not deeply broken. Also, I just looked more thoroughly through the list - the problem has been reported before, but was basically ignored, so you are wrong in that there is only one report. > While using a large allocsize mount option, which is relatively > rare. Basically, you've told XFS to optimise allocation for large > files and then are running workloads with lots of small files. The allocsize isn't "optimise for large files", it's to reduce fragmentation. 64MB is _hardly_ a big size for logfiles. Note also that the breakage occurs with smaller allocsize values as well., it's just less obvious All you do right now is make up fantasy reasons on why to ignore this report - the problem applies to any allocsize, and, unless xfs uses a different heuristic for dynamic preallocation, even without the mount option. > It's not surprise that there are issues, and you don't need the changes > in 2.6.38 to get bitten by this problem.... Really? I do know (by measuring it) that older kernels do not have this problem, and you basically said the same thing, namely that there was a behaviour change. If your goal is to argue for yourself that the breakage has to stay, thats fine, but don't expect users (like me) to follow your illogical train of thought. > > and there is really no need to cache this preallocation for > > files that have been closed 8 hours ago and never touched since then. > > If the preallocation was the size of the dynamic behaviour, you > wouldn't have even noticed this. Maybe, it certainly is a lot less noticable. But the new xfs behaviour basically means you have less space (potentially a lot less) on your disk when you have more memory, and that disk space is lost indefinitely just because I have some free ram. This is simply not a sensible heuristic - more ram must not mean that potentialy large amounts of diskspace are lost forever (if you have enough ram). > So really what you are saying is that it is excessive for your current > configuration and workload. No, what I am saying is that the heuristic is simply buggy - it ties one value (available ram for cache) to a completely unrelated one (amount of free space used for preallocation). It also doesn't only happen in my workload only. > better for allocsize filesystems. However, I'm not going to start to > add lots of workload-dependent tweaks to this code - the default > behaviour is much better and in most cases removes the problems that > led to using allocsize in the first place. So removing allocsize > from your config is, IMO, the correct fix, not tweaking heuristics in > the code... I am fine with not using allocsize if the fragmentation problems in xfs (for append-only cases) has been improved. But you aid the heuristic applies regardless of whether allocsize was specified or not. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-21 1:36 ` Marc Lehmann @ 2011-05-21 3:15 ` Dave Chinner 2011-05-21 4:16 ` Marc Lehmann 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2011-05-21 3:15 UTC (permalink / raw) To: Marc Lehmann; +Cc: xfs On Sat, May 21, 2011 at 03:36:04AM +0200, Marc Lehmann wrote: > On Sat, May 21, 2011 at 10:45:44AM +1000, Dave Chinner <david@fromorbit.com> wrote: > > > Longer meaning practically infinitely :) > > > > No, longer meaning the in-memory lifecycle of the inode. > > That makes no sense - if I have twice the memory I suddenly have half (or > some other factor) free diskspace. > > The lifetime of the preallocated area should be tied to something sensible, > really - all that xfs has now is a broken heuristic that ties the wrong > statistic to the extra space allocated. So, instead of tying it to the lifecycle of the file descriptor, it gets tied to the lifecycle of the inode. There isn't much in between those that can be easily used. When your workload spans hundreds of thousands of inodes and they are cached in memory, switching to the inode life-cycle heuristic works better than anything else that has been tried. One of those cases is large NFS servers, and the changes made in 2.6.38 are intended to improve performance on NFS servers by switching it to use inode life-cycle to control speculative preallocation. As it is, regardless of this change, we already have pre-existing circumstances where specualtive preallocation is controlled by the inode life-cycle - inodes with manual preallocation (e.g fallocate) and append only files - so this problem with allocsize causing premature ENOSPC raises it's head every couple of years regardless of whether there's been any recent changes or not. FWIW, I remember reading bug reports for Irix from 1998 about such problems w.r.t. manual preallocation. In all cases that I can remember, the problems went away with small configuration tweaks.... > > > However, I would suggest that whatever heuristic 2.6.38 uses > > > is deeply broken at the momment, > > > > One bug report two months after general availability != deeply > > broken. > > That makes no sense - I only found out about this broken behaviour > because I specified a large allocsize manually, which is rare. > > However, the behaviour happens even without that. but might not be > immediately noticable (how would you find out if you lost a few > gigabytes of disk space unless the disk runs full? most people > would have no clue where to look for). If most people never notice it and it reduces fragmentation and improves performance, then I don't see a problem. Right now evidence points to the "most people have not noticed it". Just to point out what people do notice: when the dynamic functionality was introduced into 2.6.38-rc1, it had a bug in a calculation that was resulting in 32bit machines always preallocing 8GB extents. That was noticed _immediately_ and reported by several people independently. Once that bug was fixed there have been no further reports until yours. That tells me that the new default behaviour is not actually causing ENOSPC problems for most people. I've already said I'll look into the allocsize interaction with the new heuristic you've reported, and told you how to work around the problem in the mean time. I can't do any more than that. > Just because the breakage is not obviously visible doesn't mean it's not > deeply broken. > > Also, I just looked more thoroughly through the list - the problem has > been reported before, but was basically ignored, so you are wrong in that > there is only one report. I stand corrected. I get at least 1000-1500 emails a day and I occasionally forget/miss/delete one I shouldn't. Or maybe it was one I put down to the above bug. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-21 3:15 ` Dave Chinner @ 2011-05-21 4:16 ` Marc Lehmann 2011-05-22 2:00 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Marc Lehmann @ 2011-05-21 4:16 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sat, May 21, 2011 at 01:15:37PM +1000, Dave Chinner <david@fromorbit.com> wrote: > > The lifetime of the preallocated area should be tied to something sensible, > > really - all that xfs has now is a broken heuristic that ties the wrong > > statistic to the extra space allocated. > > So, instead of tying it to the lifecycle of the file descriptor, it > gets tied to the lifecycle of the inode. That's quite the difference, though - the former is in some relation to the actual in-use files, while the latter is in no relation to it. > those that can be easily used. When your workload spans hundreds of > thousands of inodes and they are cached in memory, switching to the > inode life-cycle heuristic works better than anything else that has > been tried. The problem is that this is not anything like the normal case. It simply doesn't make any sense to preallocate disk space for files that are not in use and are unlikely to be in use again. > One of those cases is large NFS servers, and the changes made in 2.6.38 > are intended to improve performance on NFS servers by switching it to > use inode life-cycle to control speculative preallocation. It's easy to get some gains in special situations at the expense of normal ones - keep in mind that this optimisation makes little sense for non-NFS cases, which is the majority of use cases. The problem here is that XFS doesn't get enough feedback in the case of an NFS server which might open and close files much more often than local processes. However, the solution to this is a better nfs server, not some dirty hacks in some filesystem code in the hope that it works in the special case of an NFS server, to the detriment of all other workloads which give better feedback. This heuristic is just that: a bad hack to improve benchmarks in a special case. The preallocation makes sense in relation to the working set, which can be characterised by the open files, or recently opened files. Tieing it to the (in-memory) inode lifetime is an abysmal approximation to this. I understand that XFS does this to please a very suboptimal case - the NFS server code which doesn't give you enough feedback on which files are open. But keep in mind that in my case, XFS cached a large number of inodes that have been closed many hours ago - and haven't been accessed for many hours as well. I have 8GB of ram, which is plenty, but not really an abnormal amount of memory. If I unpack a large tar file, this means that I get a lot of (internal) fragmentation because all files are spread over a large area than necesssary, and diskspace is used for a potentially indefinite time. > > However, the behaviour happens even without that. but might not be > > immediately noticable (how would you find out if you lost a few > > gigabytes of disk space unless the disk runs full? most people > > would have no clue where to look for). > > If most people never notice it and it reduces fragmentation > and improves performance, then I don't see a problem. Right now Preallocation sure also increases fragmentation when its never going to be used. > evidence points to the "most people have not noticed it". The problem with these statements is that they have no meaning. Most people don't even notice filesystem fragmentation - or corruption, or bugs in xfs_repair. If I apply your style of arguing that means it's not big deal - msot people don't even notice when a few files get corrupted, they will just reinstall their box. And ehy, who uses xfs_repasir and notices some bugs in it. Sorry, but this kind of arguing makes no sense to me. > 8GB extents. That was noticed _immediately_ and reported by several > people independently. Once that bug was fixed there have been no > further reports until yours. That tells me that the new default > behaviour is not actually causing ENOSPC problems for most people. You of curse know well enough that ENOSPC was just one symptom, and that the real problem is allocating free disk space semi-permanently. Why do you bring up this strawmen of ENOSPC? > I've already said I'll look into the allocsize interaction with the > new heuristic you've reported, and told you how to work around the > problem in the mean time. I can't do any more than that. The problem is that you are selectively ignoring facts to downplay this problem. That doesn't instill confidence, you really sound like "don't insult my toy allocation heuristic, I'll just ignore the facts and claim there is no problem lalala". You simply ignore most of what I wrote - the problem is also clearly not allocsize interaction, but the broken logic behind the heuristic - "NFS servers have bad access patterns, so we assume every workload is like an NFS server". It's simply wrong. The heuristic clearly doesn't make sense with any normal workload, where files that were closed long ago will not be used. Heck, in most workloads, files that are closed will almost never be written to soon afterwards, simply because it is a common sense optimisations to not do unnecessary operations. If XFS contains dirty hacks that are meant for specific workloads only (to workaround bad access patterns by NFS servers), then it would make sense to disable these to not hurt the common cases. And this heuristic clearly is just a hack to suit a specific need. I know that, and I am sure you know that too, otherwise you wouldn't be hammering home the NFS server case :) Hacking some NFS server access pattern heuristic into XFS is, however, just a workaround for that case, not a fix, or a sensible thing to do in the general case. I would certainly appreciate that XFS has such hacks and heuristics, and would certainly try them out (having lots of NFS servers :), but it's clear that enforcing workarounds for uncommon cases at the expense of normal workloads is a bad idea, in general. So please give this a bit considerations: is it really worth to keep preallocstion for files that are not used by anything on a computer just to improve benchmark numbers for a client with bad access patterns (the NFS server code)? -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-21 4:16 ` Marc Lehmann @ 2011-05-22 2:00 ` Dave Chinner 2011-05-22 7:59 ` Matthias Schniedermeyer 2011-05-23 13:35 ` Marc Lehmann 0 siblings, 2 replies; 14+ messages in thread From: Dave Chinner @ 2011-05-22 2:00 UTC (permalink / raw) To: Marc Lehmann; +Cc: xfs On Sat, May 21, 2011 at 06:16:52AM +0200, Marc Lehmann wrote: > On Sat, May 21, 2011 at 01:15:37PM +1000, Dave Chinner <david@fromorbit.com> wrote: > > > The lifetime of the preallocated area should be tied to something sensible, > > > really - all that xfs has now is a broken heuristic that ties the wrong > > > statistic to the extra space allocated. > > > > So, instead of tying it to the lifecycle of the file descriptor, it > > gets tied to the lifecycle of the inode. > > That's quite the difference, though - the former is in some relation to > the actual in-use files, while the latter is in no relation to it. > > > those that can be easily used. When your workload spans hundreds of > > thousands of inodes and they are cached in memory, switching to the > > inode life-cycle heuristic works better than anything else that has > > been tried. > > The problem is that this is not anything like the normal case. For you, maybe. > It simply doesn't make any sense to preallocate disk space for files that > are not in use and are unlikely to be in use again. That's why the normal close case truncates it away. But there are other cases where we don't want this to happen. > > One of those cases is large NFS servers, and the changes made in 2.6.38 > > are intended to improve performance on NFS servers by switching it to > > use inode life-cycle to control speculative preallocation. > > It's easy to get some gains in special situations at the expense of normal > ones - keep in mind that this optimisation makes little sense for non-NFS > cases, which is the majority of use cases. XFS is used extensively in NAS products, from small $100 ARM/MIPS embedded NAS systems all the way up to high end commercial NAS products. It is one of the main use cases we optimise XFS for. > The problem here is that XFS doesn't get enough feedback in the case of > an NFS server which might open and close files much more often than local > processes. > > However, the solution to this is a better nfs server, not some dirty hacks > in some filesystem code in the hope that it works in the special case of > an NFS server, to the detriment of all other workloads which give better > feedback. Sure, that would be my preferred approach. However, if you followed the discussion when this first came up, you'd realise that we've been trying to get NFS server changes to fix this operation for the past 5 years, and I've just about given up trying. Hell, the NFS OFC (open file cache) proposal that would have mostly solved this (and other problems like readahead state thrashing) from 2-3 years ago went nowhere... > This heuristic is just that: a bad hack to improve benchmarks in a special > case. It wasn't aimed at improving benchmark performance - these changes have been measured to reduce large file fragmentation in real-world workloads on the default configuration by at least an order of magnitude. > The preallocation makes sense in relation to the working set, which can be > characterised by the open files, or recently opened files. > Tieing it to the (in-memory) inode lifetime is an abysmal approximation to > this. So you keep saying, but you keep ignoring the fact that the inode cache represents the _entire_ working set of inodes. It's not an approximation - it is the _exact_ current working set of files we currently have. Hence falling back to "preallocation lasts for as long as the inode is part of the working set" is an extremely good heuristic to use - we move from preallocation for only the L1 cache lifecycle (open fd's) to using the L2 cache lifecycle (recently opened inodes) instead. > If I unpack a large tar file, this means that I get a lot of (internal) > fragmentation because all files are spread over a large area than > necesssary, and diskspace is used for a potentially indefinite time. So you can reproduce this using an tar? Any details on size, # of files, the untar command, etc? How do you know you get internal fragmentation and tha tit is affecting fragmentation? Please provide concrete examples (e.g. copy+paste the command lines and any relevant output) that so I might be able to reproduce your problem myself? I don't really care what you think the problem is based on what you've read in this email thread, or for that matter how you think we should fix it. What I really want is your test cases that reproduce the problem so I can analyse it for myself. Once I understand what is going on, then we can talk about what the real problem is and how to fix it. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-22 2:00 ` Dave Chinner @ 2011-05-22 7:59 ` Matthias Schniedermeyer 2011-05-23 1:20 ` Dave Chinner 2011-05-23 13:35 ` Marc Lehmann 1 sibling, 1 reply; 14+ messages in thread From: Matthias Schniedermeyer @ 2011-05-22 7:59 UTC (permalink / raw) To: Dave Chinner; +Cc: Marc Lehmann, xfs On 22.05.2011 12:00, Dave Chinner wrote: > > I don't really care what you think the problem is based on what > you've read in this email thread, or for that matter how you think > we should fix it. What I really want is your test cases that > reproduce the problem so I can analyse it for myself. Once I > understand what is going on, then we can talk about what the real > problem is and how to fix it. What would interest me is why the following creates files with large preallocations. cp -a <somedir> target rm -rf target cp -a <somedir> target After the first copy everything looks normal, `du` is about the original value. After the second run a `du` shows a much higher value, until the preallocation is shrunk away. Bis denn -- Real Programmers consider "what you see is what you get" to be just as bad a concept in Text Editors as it is in women. No, the Real Programmer wants a "you asked for it, you got it" text editor -- complicated, cryptic, powerful, unforgiving, dangerous. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-22 7:59 ` Matthias Schniedermeyer @ 2011-05-23 1:20 ` Dave Chinner 2011-05-23 9:01 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2011-05-23 1:20 UTC (permalink / raw) To: Matthias Schniedermeyer; +Cc: Marc Lehmann, xfs On Sun, May 22, 2011 at 09:59:55AM +0200, Matthias Schniedermeyer wrote: > On 22.05.2011 12:00, Dave Chinner wrote: > > > > I don't really care what you think the problem is based on what > > you've read in this email thread, or for that matter how you think > > we should fix it. What I really want is your test cases that > > reproduce the problem so I can analyse it for myself. Once I > > understand what is going on, then we can talk about what the real > > problem is and how to fix it. > > What would interest me is why the following creates files with large > preallocations. > > cp -a <somedir> target > rm -rf target > cp -a <somedir> target > > After the first copy everything looks normal, `du` is about the > original value. > > After the second run a `du` shows a much higher value, until the > preallocation is shrunk away. That's obviously a bug. It's also a simple test case that is easy to reproduce - exactly what I like in a bug report. ;) The inodes are being recycled off the reclaimable list in the second case i.e. we're short-circuiting the inode lifecycle and making it new again because it has been reallocated. The XFS_IDIRTY_RELEASE flag is not being cleared in this case, so we are not removing the speculative preallocation when the fd is closed for the second copy. The patch below fixes this. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: clear inode dirty release flag when recycling it From: Dave Chinner <dchinner@redhat.com> The state used to track dirty inode release calls is not reset when an inode is reallocated and reused from the reclaimable state. This leads to specualtive preallocation not being truncated away in the expected manner for local files until the inode is subsequently truncated, freed or cycles out of the cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_iget.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c index cb9b6d1..e75e757 100644 --- a/fs/xfs/xfs_iget.c +++ b/fs/xfs/xfs_iget.c @@ -241,6 +241,13 @@ xfs_iget_cache_hit( */ ip->i_flags |= XFS_IRECLAIM; + /* + * clear the dirty release state as we are now effectively a + * new inode and so we need to treat speculative preallocation + * accordingly. + */ + ip->i_flags &= ~XFS_IDIRTY_RELEASE; + spin_unlock(&ip->i_flags_lock); rcu_read_unlock(); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-23 1:20 ` Dave Chinner @ 2011-05-23 9:01 ` Christoph Hellwig 2011-05-24 0:20 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2011-05-23 9:01 UTC (permalink / raw) To: Dave Chinner; +Cc: Marc Lehmann, xfs On Mon, May 23, 2011 at 11:20:34AM +1000, Dave Chinner wrote: > > The state used to track dirty inode release calls is not reset when > an inode is reallocated and reused from the reclaimable state. This > leads to specualtive preallocation not being truncated away in the > expected manner for local files until the inode is subsequently > truncated, freed or cycles out of the cache. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_iget.c | 7 +++++++ > 1 files changed, 7 insertions(+), 0 deletions(-) > > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c > index cb9b6d1..e75e757 100644 > --- a/fs/xfs/xfs_iget.c > +++ b/fs/xfs/xfs_iget.c > @@ -241,6 +241,13 @@ xfs_iget_cache_hit( > */ > ip->i_flags |= XFS_IRECLAIM; > > + /* > + * clear the dirty release state as we are now effectively a > + * new inode and so we need to treat speculative preallocation > + * accordingly. > + */ > + ip->i_flags &= ~XFS_IDIRTY_RELEASE; Btw, don't we need to clear even more flags here? To me it seems we need to clear XFS_ISTALE, XFS_IFILESTREAM and XFS_ITRUNCATED as well. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-23 9:01 ` Christoph Hellwig @ 2011-05-24 0:20 ` Dave Chinner 0 siblings, 0 replies; 14+ messages in thread From: Dave Chinner @ 2011-05-24 0:20 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Marc Lehmann, xfs On Mon, May 23, 2011 at 05:01:44AM -0400, Christoph Hellwig wrote: > On Mon, May 23, 2011 at 11:20:34AM +1000, Dave Chinner wrote: > > > > The state used to track dirty inode release calls is not reset when > > an inode is reallocated and reused from the reclaimable state. This > > leads to specualtive preallocation not being truncated away in the > > expected manner for local files until the inode is subsequently > > truncated, freed or cycles out of the cache. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_iget.c | 7 +++++++ > > 1 files changed, 7 insertions(+), 0 deletions(-) > > > > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c > > index cb9b6d1..e75e757 100644 > > --- a/fs/xfs/xfs_iget.c > > +++ b/fs/xfs/xfs_iget.c > > @@ -241,6 +241,13 @@ xfs_iget_cache_hit( > > */ > > ip->i_flags |= XFS_IRECLAIM; > > > > + /* > > + * clear the dirty release state as we are now effectively a > > + * new inode and so we need to treat speculative preallocation > > + * accordingly. > > + */ > > + ip->i_flags &= ~XFS_IDIRTY_RELEASE; > > Btw, don't we need to clear even more flags here? To me it seems we > need to clear XFS_ISTALE, XFS_IFILESTREAM and XFS_ITRUNCATED as well. XFS_ISTALE is cleared unconditionally at the end of the function, which means that any lookup on a stale inode will clear it. I'm not absolutely sure this is right now that I think about it but that's a different issue. XFS_ITRUNCATED is mostly harmless, so it isn't a but issue, but we probably should clear it. I'm not sure what the end result of not clearing XFS_IFILESTREAM is, but you are right in that it should not pass through here, either. I'll respin the patch to clear all the state flags that hold sub-lifecycle state. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-22 2:00 ` Dave Chinner 2011-05-22 7:59 ` Matthias Schniedermeyer @ 2011-05-23 13:35 ` Marc Lehmann 2011-05-24 1:30 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Marc Lehmann @ 2011-05-23 13:35 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Sun, May 22, 2011 at 12:00:24PM +1000, Dave Chinner <david@fromorbit.com> wrote: > > The problem is that this is not anything like the normal case. > > For you, maybe. For the majority of boxes that use xfs - most desktop boxes are not heavy NFS servers. > > It's easy to get some gains in special situations at the expense of normal > > ones - keep in mind that this optimisation makes little sense for non-NFS > > cases, which is the majority of use cases. > > XFS is used extensively in NAS products, from small $100 ARM/MIPS > embedded NAS systems all the way up to high end commercial NAS > products. It is one of the main use cases we optimise XFS for. Thats really sad - maybe people like me who use XFS on their servers should rethink that decision that, if XFS mainly optimises for commercial nas boxes only. You aren't serious, are you? > Sure, that would be my preferred approach. However, if you followed > the discussion when this first came up, you'd realise that we've > been trying to get NFS server changes to fix this operation for the > past 5 years, and I've just about given up trying. Hell, the NFS > OFC (open file cache) proposal that would have mostly solved this > (and other problems like readahead state thrashing) from 2-3 years > ago went nowhere... In other words, if you can't do it right, you make ugly broken hacks, and then tell people that it's exepcted behaviour, because xfs is optimised for commercial NFS server boxes. > > The preallocation makes sense in relation to the working set, which can be > > characterised by the open files, or recently opened files. > > Tieing it to the (in-memory) inode lifetime is an abysmal approximation to > > this. > > So you keep saying, but you keep ignoring the fact that the inode > cache represents the _entire_ working set of inodes. It's not an > approximation - it is the _exact_ current working set of files we > currently have. I am sory, but that is wrong and shows a serious lack of understanding. The cached inode set is just that, a cache. It is definitely not corresponding to any working set, simply because it is a *cache*. ls -l in a directory will cache all inodes, but that doesn't mean that those files are the working set 8 hours later. Open files are in the working set, because applications open files to use them. The inode cache probably contains stuff that was in the working set before, but is no longer. > Hence falling back to "preallocation lasts for as long as the inode > is part of the working set" is an extremely good heuristic to use - It's of course extremely broken, because all it does is improve the (fragmentation) performance for broken clients - for normal clients it will reduce performance of course. > we move from preallocation for only the L1 cache lifecycle (open > fd's) to using the L2 cache lifecycle (recently opened inodes) > instead. That comparison is seriously flawed, as a cache is transparent, but the xfs behaviour is not. > > If I unpack a large tar file, this means that I get a lot of (internal) > > fragmentation because all files are spread over a large area than > > necesssary, and diskspace is used for a potentially indefinite time. > > So you can reproduce this using an tar? Any details on size, # of > files, the untar command, etc? I can reproduce it simply by running make in the uclibc source tree. Since gas has the same access behaviour as tar, why would it be different? What kind of broken heuristic is it that XFS now uses that these two use cases would make a difference? > How do you know you get internal fragmentation and tha tit is affecting > fragmentation? If what you say is true, thats a logical conclusion, it doesn'T need evidence, it follows from your claims. XFS can't preallocate for basically all files that are beign written and at the same time avoid fragmentation. > Please provide concrete examples (e.g. copy+paste the command lines and > any relevant output) that so I might be able to reproduce your problem > myself? "make" - I already told you in my first e-mail. > we should fix it. What I really want is your test cases that > reproduce the problem so I can analyse it for myself. Once I > understand what is going on, then we can talk about what the real > problem is and how to fix it. Being a good citizen wanting to improve XFS I of course dleivered that in my first e-mail. Again, I used allocsize=64m and then made a buildroot build, which stopped after a few minutes because 180GB of disk space were gone. The disk space was all used up by the buildroot, which is normally a few gigabytes (after a successful build). I found that the uclibc object directory uses 50GB of space, about 8 hours after the compile - the object files were typically a few kb in size, but du showed 64mb of usage, even though nobody was using that file more than once, or ever after the make stopped. I am sorry, I think you are more interested in forcing your personal toy heuristic through reality - thats how you come across, because you selectively ignore the bits that you don't like. It's also pretty telling that XFS mainly optimises for commercial NAS boxes now, and no longer for good performance on local boxes. :( -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\ _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: drastic changes to allocsize semantics in or around 2.6.38? 2011-05-23 13:35 ` Marc Lehmann @ 2011-05-24 1:30 ` Dave Chinner 0 siblings, 0 replies; 14+ messages in thread From: Dave Chinner @ 2011-05-24 1:30 UTC (permalink / raw) To: Marc Lehmann; +Cc: xfs On Mon, May 23, 2011 at 03:35:48PM +0200, Marc Lehmann wrote: > On Sun, May 22, 2011 at 12:00:24PM +1000, Dave Chinner <david@fromorbit.com> wrote: > > > The problem is that this is not anything like the normal case. > > > > For you, maybe. > > For the majority of boxes that use xfs - most desktop boxes are not heavy > NFS servers. Desktops are not a use case we optimise XFS for. We make sure XFS works adequately on the desktop, but other than that we focus of server workloads as optimisation targets. > > > It's easy to get some gains in special situations at the expense of normal > > > ones - keep in mind that this optimisation makes little sense for non-NFS > > > cases, which is the majority of use cases. > > > > XFS is used extensively in NAS products, from small $100 ARM/MIPS > > embedded NAS systems all the way up to high end commercial NAS > > products. It is one of the main use cases we optimise XFS for. > > Thats really sad - maybe people like me who use XFS on their servers > should rethink that decision that, if XFS mainly optimises for commercial > nas boxes only. Nice twist - you're trying to imply we do something very different to what I said. So to set the record straight, we optimise for several different overlapping primary use cases. We make optimisation decisions that benefit systems and workloads that fall into the following categories: - large filesystems (e.g > 100TB) - large storage subsystems (hundreds to thousands of spindles) - large amounts of RAM (tens of GBs to TBs of RAM) - high concurrency from large numbers of CPUs (thousands of CPU cores) - high throughput, both IOPS and bandwidth - low fragmentation of large files - robust error detection and handling IOWs, we optimise for high performance, high end servers and workloads. And that means that Just because we make changes that help high performance, high end NFS servers acheive these goals _does not mean_ we only optimise for NFS servers. I'm not going to continue this part of this thread - it's just a waste of my time. If you want the regression fixed, then stop trying to tell us what the bug is and instead try to help diagnose the cause of the problem. > > we should fix it. What I really want is your test cases that > > reproduce the problem so I can analyse it for myself. Once I > > understand what is going on, then we can talk about what the real > > problem is and how to fix it. > > Being a good citizen wanting to improve XFS I of course dleivered that in > my first e-mail. Again, I used allocsize=64m and then made a buildroot > build, which stopped after a few minutes because 180GB of disk space were > gone. > > The disk space was all used up by the buildroot, which is normally a few > gigabytes (after a successful build). > > I found that the uclibc object directory uses 50GB of space, about 8 hours > after the compile - the object files were typically a few kb in size, but > du showed 64mb of usage, even though nobody was using that file more than > once, or ever after the make stopped. A vaguely specified 8 hour long test involving building some large number of packages is not a useful test case. There are too many variables, too much setup time, too much data to analyse and taking 8 hours to get a result is far too long. I did try a couple of kernel builds and didn't see the problem you reported. Hence I came to the conclusion that it was something specific to your build environment and asked for more a more exact test case. Indeed, someone else presented a 100% reproducable test case in a 3 line script using cp and rm that took 10s to run. It then took me 15 minutes to analyse, then write, test and post a patch that fixes the problem their test case demonstrated. Does the patch in the following email fix your buildroot space usage problem? http://oss.sgi.com/pipermail/xfs/2011-May/050651.html Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-05-24 1:30 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-05-20 0:55 drastic changes to allocsize semantics in or around 2.6.38? Marc Lehmann 2011-05-20 2:56 ` Dave Chinner 2011-05-20 15:49 ` Marc Lehmann 2011-05-21 0:45 ` Dave Chinner 2011-05-21 1:36 ` Marc Lehmann 2011-05-21 3:15 ` Dave Chinner 2011-05-21 4:16 ` Marc Lehmann 2011-05-22 2:00 ` Dave Chinner 2011-05-22 7:59 ` Matthias Schniedermeyer 2011-05-23 1:20 ` Dave Chinner 2011-05-23 9:01 ` Christoph Hellwig 2011-05-24 0:20 ` Dave Chinner 2011-05-23 13:35 ` Marc Lehmann 2011-05-24 1:30 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.