* O(n^2) deletion performance @ 2018-01-02 0:21 Niklas Hambüchen 2018-01-02 1:20 ` Niklas Hambüchen ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Niklas Hambüchen @ 2018-01-02 0:21 UTC (permalink / raw) To: linux-fsdevel; +Cc: Paul Eggert, Jim Meyering, Pádraig Brady Hello filesystem hackers, Over in coreutils issue https://bugs.gnu.org/29921 we have found that unlink()ing `n` many files seems to take in general O(n²) wall time. So far me and others (in CC) have benchmarked `rm -r` and other methods of unlink()ing multiple files and we've found approximately quadratic performance across: * ext4, XFS and zfs-on-linux on spinning disks * ext4 on SSDs * the other combinations were not tested yet Please find the exact numbers https://bugs.gnu.org/29921. So far we've observed apparently linear deletion time on: * XFS on NVMe It would be great to hear any kind of insights you may have regarding this issue, implementation details that might be relevant here, or even some quick benchmarks on your systems to confirm or refute our findings. Unless instructed otherwise, I would also go ahead and file bugs for the respective filesystem types, as quadratic deletion time makes for a bad user experience and lots of sysadmin tears. Happy new year, Niklas ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: O(n^2) deletion performance 2018-01-02 0:21 O(n^2) deletion performance Niklas Hambüchen @ 2018-01-02 1:20 ` Niklas Hambüchen 2018-01-02 1:59 ` Theodore Ts'o 2018-01-02 4:54 ` Dave Chinner 2 siblings, 0 replies; 11+ messages in thread From: Niklas Hambüchen @ 2018-01-02 1:20 UTC (permalink / raw) To: linux-fsdevel; +Cc: Paul Eggert, Jim Meyering, Pádraig Brady Fixing an omission: I meant "unlink()ing `n` many files [in the same directory]". ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: O(n^2) deletion performance 2018-01-02 0:21 O(n^2) deletion performance Niklas Hambüchen 2018-01-02 1:20 ` Niklas Hambüchen @ 2018-01-02 1:59 ` Theodore Ts'o 2018-01-02 2:49 ` Andreas Dilger 2018-01-02 4:54 ` Dave Chinner 2 siblings, 1 reply; 11+ messages in thread From: Theodore Ts'o @ 2018-01-02 1:59 UTC (permalink / raw) To: Niklas Hambüchen Cc: linux-fsdevel, Paul Eggert, Jim Meyering, Pádraig Brady On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hamb�chen wrote: > Hello filesystem hackers, > > Over in coreutils issue https://bugs.gnu.org/29921 we have found that > unlink()ing `n` many files seems to take in general O(n�) wall time. > > So far me and others (in CC) have benchmarked `rm -r` and other methods > of unlink()ing multiple files and we've found approximately quadratic > performance across... So a couple of observations. You're benchmarking the creation and deletion of files that don't have any blocks associated with them, so you are measuring the directory and inode deallocation operations. There are no block (de-)allocation operations going on. Secondly, you aren't dropping the cache between between the creation phase and the timed deletion phase of the test. This means that the results are heavily going to be impacted by the amount of memory available on the system. This doesn't matter all that much since you aren't actually writing any data blocks, and so you're only measuring metadata operations which are going to be journalled. Sorting by inode number helps in the cold cache case, especially if you have data blocks that you need to delete. I also suspect that the original tests of the sort-the-inodes commit were with substantially smaller numbers of files, especially these were non-zero-sized files being deleted. Sorting by inode number helps optimze deleting files in creation order for ext3/4 and that helps optimize updates to the inode table and to the allocation bitmaps. However, the number I/O's and hence the elapsed time is also going to be a function of the directory block updates. In general you're not going to be able delete files in optimal directory tree order *and* simulaneously in optimal inode table order. So fundamentally there's not a whole lot that can be done here as directories get super-sized. For ext4, you can improve things somewhat by using a larger journal size, if the workload is extremely metadata-heavy (which it is in this very artificial case of deleting a directory hierarchy with 16 milion+ zero-length inodes). We now use a large default journal size with newer versions of e2fsprogs, but if your file system was created a while ago, it probably is maxed out with a 128 meg journal. Using a one or two gig journal (e.g., "mke2fs -t ext4 -J size=2048 /dev/sdXX) will help in this case. It's unlikely to help that much for more realistic workloads, though. > Unless instructed otherwise, I would also go ahead and file bugs for the > respective filesystem types, as quadratic deletion time makes for a bad > user experience and lots of sysadmin tears. Unfortunately there's not a whole lot I can do from the ext4 end. And it's not clear to me that optimizing for this extremely artificial workload makes a lot of sense even if we could do something. (I support we could create an rm -rf system call which would work so long as you had sufficient amounts of non-swappable kernel memory available --- but in the end it's going to be impossible to make something which is purely linear for arbitrarily large directories.) - Ted P.S. I suspect that if you had sufficiently large amounts of super-expensive NVMe storage, you'd see super-linear times even for XFS on NVMe.... The fast storage simply puts off the inevitable. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: O(n^2) deletion performance 2018-01-02 1:59 ` Theodore Ts'o @ 2018-01-02 2:49 ` Andreas Dilger 2018-01-02 4:27 ` Jim Meyering 2018-01-02 4:33 ` Theodore Ts'o 0 siblings, 2 replies; 11+ messages in thread From: Andreas Dilger @ 2018-01-02 2:49 UTC (permalink / raw) To: Theodore Ts'o Cc: Niklas Hambüchen, Linux FS-devel Mailing List, Paul Eggert, Jim Meyering, Pádraig Brady On Jan 1, 2018, at 6:59 PM, Theodore Ts'o <tytso@mit.edu> wrote: > > On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambüchen wrote: >> Hello filesystem hackers, >> >> Over in coreutils issue https://bugs.gnu.org/29921 we have found that >> unlink()ing `n` many files seems to take in general O(n²) wall time. Looking at the performance results in these tests (in particular the first results that showed system time separate from user time), the system time appears to be roughly linear with the number of files that are deleted. It is the user time that is increasing quadratically, so that isn't the fault of the filesystem/kernel but rather "rm" itself as far as I can see. >> So far me and others (in CC) have benchmarked `rm -r` and other methods >> of unlink()ing multiple files and we've found approximately quadratic >> performance across... > > So a couple of observations. You're benchmarking the creation and > deletion of files that don't have any blocks associated with them, so > you are measuring the directory and inode deallocation operations. > There are no block (de-)allocation operations going on. > > Secondly, you aren't dropping the cache between between the creation > phase and the timed deletion phase of the test. This means that the > results are heavily going to be impacted by the amount of memory > available on the system. This doesn't matter all that much since you > aren't actually writing any data blocks, and so you're only measuring > metadata operations which are going to be journalled. > > Sorting by inode number helps in the cold cache case, especially if > you have data blocks that you need to delete. I also suspect that the > original tests of the sort-the-inodes commit were with substantially > smaller numbers of files, especially these were non-zero-sized files > being deleted. > > Sorting by inode number helps optimze deleting files in creation order > for ext3/4 and that helps optimize updates to the inode table and to > the allocation bitmaps. However, the number I/O's and hence the > elapsed time is also going to be a function of the directory block > updates. > > In general you're not going to be able delete files in optimal > directory tree order *and* simulaneously in optimal inode table order. > So fundamentally there's not a whole lot that can be done here as > directories get super-sized. At one time we discussed to change inode number allocation to be piecewise linear for inodes within the same directory. As a directory grows larger, it could grab a chunk of free inodes in the inode table, then map the new filename hash to the range of free inodes, and use that when selecting the new inode number. As that chunk is filled up, a new, larger chunk of free inodes would be selected and new filenames would be mapped into the new chunk. If this was done, then the hash-order directory traversal for deletion or stat would have approximately O(log(n)) inode table blocks to load for a given hash range rather than having to get each inode from a random inode table block. > For ext4, you can improve things somewhat by using a larger journal > size, if the workload is extremely metadata-heavy (which it is in this > very artificial case of deleting a directory hierarchy with 16 milion+ > zero-length inodes). We now use a large default journal size with > newer versions of e2fsprogs, but if your file system was created a > while ago, it probably is maxed out with a 128 meg journal. Using a > one or two gig journal (e.g., "mke2fs -t ext4 -J size=2048 /dev/sdXX) > will help in this case. It's unlikely to help that much for more > realistic workloads, though. > >> Unless instructed otherwise, I would also go ahead and file bugs for the >> respective filesystem types, as quadratic deletion time makes for a bad >> user experience and lots of sysadmin tears. > > Unfortunately there's not a whole lot I can do from the ext4 end. And > it's not clear to me that optimizing for this extremely artificial > workload makes a lot of sense even if we could do something. (I > support we could create an rm -rf system call which would work so long > as you had sufficient amounts of non-swappable kernel memory available > --- but in the end it's going to be impossible to make something which > is purely linear for arbitrarily large directories.) Cheers, Andreas ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: O(n^2) deletion performance 2018-01-02 2:49 ` Andreas Dilger @ 2018-01-02 4:27 ` Jim Meyering 2018-01-02 6:22 ` Theodore Ts'o 2018-01-02 4:33 ` Theodore Ts'o 1 sibling, 1 reply; 11+ messages in thread From: Jim Meyering @ 2018-01-02 4:27 UTC (permalink / raw) To: Andreas Dilger Cc: Theodore Ts'o, Niklas Hambüchen, Linux FS-devel Mailing List, Paul Eggert, Pádraig Brady On Mon, Jan 1, 2018 at 6:49 PM, Andreas Dilger <adilger@dilger.ca> wrote: > On Jan 1, 2018, at 6:59 PM, Theodore Ts'o <tytso@mit.edu> wrote: >> >> On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hambüchen wrote: >>> Hello filesystem hackers, >>> >>> Over in coreutils issue https://bugs.gnu.org/29921 we have found that >>> unlink()ing `n` many files seems to take in general O(n²) wall time. > > Looking at the performance results in these tests (in particular the > first results that showed system time separate from user time), the > system time appears to be roughly linear with the number of files that > are deleted. It is the user time that is increasing quadratically, so > that isn't the fault of the filesystem/kernel but rather "rm" itself > as far as I can see. Thank you both for the quick replies. Here are those numbers again: nfiles real user sys 100000 0.51s 0.07s 0.43s 200000 2.46s 0.15s 0.89s 400000 10.78s 0.26s 2.21s 800000 44.72s 0.58s 6.03s 1600000 180.37s 1.06s 10.70s Since the user and sys durations for the 1.6M case are so low (nowhere near 2-3x those of the 800K case), I wonder if that test hit the inode limit for that file system. Niklas, what does df -i report for the file system on which you ran that test? Here are some numbers from a system with 32BG of RAM, an ext4 SSD and starting with 8.8M free inodes. They show user and system time very close to linear, with a little uptick on the last iteration. "real" seems to be growing too fast, even here. $ i=100000; while :; do mkdir x; (cd x && seq $i|xargs touch); env time -f "%e %U %S $i" env rm -rf x; case $i in 32*) break;; esac; i=$[i*2]; done real user sys N 0.48 0.05 0.41 100000 1.28 0.10 0.86 200000 3.79 0.21 1.83 400000 10.51 0.43 4.04 800000 28.00 0.88 8.97 1600000 62.54 1.86 19.30 3200000 ... > At one time we discussed to change inode number allocation to be > piecewise linear for inodes within the same directory. As a directory > grows larger, it could grab a chunk of free inodes in the inode table, > then map the new filename hash to the range of free inodes, and use > that when selecting the new inode number. As that chunk is filled up, > a new, larger chunk of free inodes would be selected and new filenames > would be mapped into the new chunk. > > If this was done, then the hash-order directory traversal for deletion > or stat would have approximately O(log(n)) inode table blocks to load > for a given hash range rather than having to get each inode from a > random inode table block. Ouch. It sounds like this is a known-O(n) process and the above would have reduced it to O(log(n)). Can you provide a reference? I am curious to know how it played out. Our goal (with fts and coreutils) has been to make it harder for an accident or maliciousness (with a few million entries in a directory) to hinder file system traversals. Of course, it's not just rm: any FS-traversal tool is affected: cp, chmod, chgrp, du, find, tar, etc. Sure, quotas can help, but even self-inflicted accidents happen on single-user systems with no quotas. Idly wondered if the default inode limits could save ext4 users? Perhaps not. In this 850GB file system, I see it has 48M inodes (caveat, I may have changed the default when I created it -- don't recall): Filesystem Type Inodes IUsed IFree IUse% Mounted on /dev/sda3 ext4 54M 6.0M 48M 12% /x Putting even half of those all in one directory is quick and easy, yet would would cause disproportionate pain. And that was a small partition. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: O(n^2) deletion performance 2018-01-02 4:27 ` Jim Meyering @ 2018-01-02 6:22 ` Theodore Ts'o 2018-01-04 4:16 ` Jim Meyering 0 siblings, 1 reply; 11+ messages in thread From: Theodore Ts'o @ 2018-01-02 6:22 UTC (permalink / raw) To: Jim Meyering Cc: Andreas Dilger, Niklas Hambüchen, Linux FS-devel Mailing List, Paul Eggert, Pádraig Brady On Mon, Jan 01, 2018 at 08:27:48PM -0800, Jim Meyering wrote: > Our goal (with fts and coreutils) has been to make it harder for an > accident or maliciousness (with a few million entries in a directory) > to hinder file system traversals. Of course, it's not just rm: any > FS-traversal tool is affected: cp, chmod, chgrp, du, find, tar, etc. > Sure, quotas can help, but even self-inflicted accidents happen on > single-user systems with no quotas. > > Idly wondered if the default inode limits could save ext4 users? Perhaps not. > In this 850GB file system, I see it has 48M inodes (caveat, I may have > changed the default when I created it -- don't recall): Well, it's a bit of a blunt hammer, but you *can* set a mount option "mount -t ext4 -o max_dir_size_kb=512" which will not allow the directory to grow larger than 512k (or pick your favorite limit). To me that's like putting a speed limit governer on a car because you don't trust the driver not to exceed some arbitrary (sometimes opposed by politicians who think they know better) limit. Personally, I would be super annoyed if my car was outfitted with something which didn't let me go faster than 90 km/h or 56 mph. But, apparently many commercial vehciles, particular in the UK, do have such things. But if you don't trust your users, the feature is there for a sysadmin/BOFH to impose on said users; but personally, it's relatively rare thing, and when trading off between the pain caused by hitting an artificially imposed limit, versus the patience needed to recover from accidentally dumping 16 million files into a directory --- I prefer the latter. I can wait a few minutes.... - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: O(n^2) deletion performance 2018-01-02 6:22 ` Theodore Ts'o @ 2018-01-04 4:16 ` Jim Meyering 2018-01-04 7:16 ` Theodore Ts'o 2018-01-04 11:42 ` Dave Chinner 0 siblings, 2 replies; 11+ messages in thread From: Jim Meyering @ 2018-01-04 4:16 UTC (permalink / raw) To: Theodore Ts'o Cc: Andreas Dilger, Niklas Hambüchen, Linux FS-devel Mailing List, Paul Eggert, Pádraig Brady On Mon, Jan 1, 2018 at 10:22 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Mon, Jan 01, 2018 at 08:27:48PM -0800, Jim Meyering wrote: >> Our goal (with fts and coreutils) has been to make it harder for an >> accident or maliciousness (with a few million entries in a directory) >> to hinder file system traversals. Of course, it's not just rm: any >> FS-traversal tool is affected: cp, chmod, chgrp, du, find, tar, etc. >> Sure, quotas can help, but even self-inflicted accidents happen on >> single-user systems with no quotas. >> >> Idly wondered if the default inode limits could save ext4 users? Perhaps not. >> In this 850GB file system, I see it has 48M inodes (caveat, I may have >> changed the default when I created it -- don't recall): > > Well, it's a bit of a blunt hammer, but you *can* set a mount option > "mount -t ext4 -o max_dir_size_kb=512" which will not allow the > directory to grow larger than 512k (or pick your favorite limit). Thanks, but no thanks :-) Still wondering how this happened... deliberate optimization for something else, probably. And wishing I'd written a relative (not absolute) test for it in 2008, so I would have noticed sooner. In 2008 when I wrote this coreutils extN performance test: https://git.savannah.gnu.org/cgit/coreutils.git/tree/tests/rm/ext3-perf.sh there was no O(N^2) or even "just" O(N^1.5) component when using the then-just-improved rm. Many of us plotted the curves. Any idea when ext4's unlink became more expensive? > versus the patience needed to recover from > accidentally dumping 16 million files into a directory --- I prefer > the latter. I can wait a few minutes.... I've just run a test on the spinning-disk file system mentioned above, and it took 75 minutes to delete 12.8M entries. That's rather nasty. On the bright side, Kevin Vigor was kind enough to run tests showing that on some large, fast NVMe devices, everything looks linear: https://docs.google.com/spreadsheets/d/1bPi8MTvSP4xzzuARPOd5fxFujhBoU2Dxandr-Vh1T9c/edit#gid=0 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: O(n^2) deletion performance 2018-01-04 4:16 ` Jim Meyering @ 2018-01-04 7:16 ` Theodore Ts'o 2018-01-04 11:42 ` Dave Chinner 1 sibling, 0 replies; 11+ messages in thread From: Theodore Ts'o @ 2018-01-04 7:16 UTC (permalink / raw) To: Jim Meyering Cc: Andreas Dilger, Niklas Hambüchen, Linux FS-devel Mailing List, Paul Eggert, Pádraig Brady On Wed, Jan 03, 2018 at 08:16:58PM -0800, Jim Meyering wrote: > > Still wondering how this happened... deliberate optimization for > something else, probably. > And wishing I'd written a relative (not absolute) test for it in 2008, > so I would have noticed sooner. > In 2008 when I wrote this coreutils extN performance test: > > https://git.savannah.gnu.org/cgit/coreutils.git/tree/tests/rm/ext3-perf.sh > > there was no O(N^2) or even "just" O(N^1.5) component when using the > then-just-improved rm. Many of us plotted the curves. How high (how many files) did you plot the curves back in 2008? Do you remember? > Any idea when ext4's unlink became more expensive? Ext4's unlink hasn't really changed. What *has* changed over time is the block and inode allocation algorithms. 2008 is before we optimized allocation algorithms to read/write operations to large files more efficient, and to try to avoid free space fragmention as the file system aged. Given that on large / fast NVMe device everything looks linear, it's pretty clear that what you're seeing is essentially seek time on spinning rust, and probably SSD GC overhead on flash devices. > I've just run a test on the spinning-disk file system mentioned above, > and it took 75 minutes to delete 12.8M entries. That's rather nasty. Yes, but how long did it take to *create* the 12.8M entries? My guess is that it was roughly comparable? - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: O(n^2) deletion performance 2018-01-04 4:16 ` Jim Meyering 2018-01-04 7:16 ` Theodore Ts'o @ 2018-01-04 11:42 ` Dave Chinner 1 sibling, 0 replies; 11+ messages in thread From: Dave Chinner @ 2018-01-04 11:42 UTC (permalink / raw) To: Jim Meyering Cc: Theodore Ts'o, Andreas Dilger, Niklas Hambüchen, Linux FS-devel Mailing List, Paul Eggert, Pádraig Brady On Wed, Jan 03, 2018 at 08:16:58PM -0800, Jim Meyering wrote: > On Mon, Jan 1, 2018 at 10:22 PM, Theodore Ts'o <tytso@mit.edu> wrote: > > versus the patience needed to recover from > > accidentally dumping 16 million files into a directory --- I prefer > > the latter. I can wait a few minutes.... > > I've just run a test on the spinning-disk file system mentioned above, > and it took 75 minutes to delete 12.8M entries. That's rather nasty. An unrealistic performance expectation is not a code bug. It's a knowledge problem and a learning opportunity. :) So let's put some numbers to this operation. If this is a cache-cold rm, then the number of read IOs required on a perfectly laid out /XFS/ filesystem would be roughly; read IOs = num getdents IO + num inode lookup IO = 12.8M / 32 + 12.8M / 500 = 400,000 + 25,600 = 425,600. Now, lets assume an low seek time for each IO on a sata drive - about 3ms per IO. That means the read IO delay is: read IO delay = 425,600 * 0.003s = 1277s = 21m17s Hence with a perfect layout we've got at least 21m of blocking IO time alone in this workload. If the directory and inodes are scattered, and we get an average seek time (7ms), then we're looking at 50 minutes of blocking read IO time in the workload. Then we've got to add the time for each operation to be executed on the CPU. If I take the number from a 10M file unlink I did yesterday in a VM with device image files hosted on XFS on NVMe drives: File count: 10000000 Create: 918.46 47.14 866.24 Unlink: 1545.99 12.14 1410.97 There's 1410s of system CPU time to unlink 10 million inodes. Add another 25% for the larger inode count in your test, and we've got almost 30 minutes of CPU time in this workload. If we add that to our 50 minutes of IO blocking time, we've got ~80 minutes total runtime. IOWs, 75 minutes on a spinning sata drive to delete 12.8M directory entries is entirely reasonable for any filesystem because the workload is single threaded, blocks on each read IO, requires hundreds of thousands of read IOs to complete and the IO time is largely dependent on the physical layout of the directory and inode structure. That's the reason I asked for iowatcher to be run to analysis these bug reports - it will tell us /exactly/ how much time is being spent waiting for IO, what the seek patterns are, and that will tell us how badly we all suck at laying out large directories. > On the bright side, Kevin Vigor was kind enough to run tests showing > that on some large, fast NVMe devices, everything looks linear: > https://docs.google.com/spreadsheets/d/1bPi8MTvSP4xzzuARPOd5fxFujhBoU2Dxandr-Vh1T9c/edit#gid=0 Of course. That's because the read Io latency time is deterministic, not dependent on the physical layout (i.e. constant "seek" time) and it's much, much faster than a spinning disk - probably around 100 microseconds. In that case, the IO wait time drops to: read IO delay = 425,600 * 0.0001s = 42.6s That's not unreasonable given the above numbers from my test VM showing roughly 2 minutes of wait time in 25 minutes for 10million inodes being removed. IOWs, the performance on fast SSDs and NVMe drives is determined almost entirely by software algorithms and the per-operation CPU cost rather than the physical layout of the structure on disk and the time needed to retreive it into memory. That's why you see variable and slow performance on slow spinning drives, but it gets more and more linear as the devices get faster and less sensitive to physical layout of the filesystem metadata.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: O(n^2) deletion performance 2018-01-02 2:49 ` Andreas Dilger 2018-01-02 4:27 ` Jim Meyering @ 2018-01-02 4:33 ` Theodore Ts'o 1 sibling, 0 replies; 11+ messages in thread From: Theodore Ts'o @ 2018-01-02 4:33 UTC (permalink / raw) To: Andreas Dilger Cc: Niklas Hambüchen, Linux FS-devel Mailing List, Paul Eggert, Jim Meyering, Pádraig Brady On Mon, Jan 01, 2018 at 07:49:55PM -0700, Andreas Dilger wrote: > > At one time we discussed to change inode number allocation to be > piecewise linear for inodes within the same directory. As a directory > grows larger, it could grab a chunk of free inodes in the inode table, > then map the new filename hash to the range of free inodes, and use > that when selecting the new inode number. As that chunk is filled up, > a new, larger chunk of free inodes would be selected and new filenames > would be mapped into the new chunk. Well, it's not so simple. Remember that there are only 16 inodes per 4k inode table block. And only 32,768 inodes per block group. In the workloads discussed in the coreutils bug, there are 1 million to 32 million files being created in a single directory. At that point, unless we start doing truly ridiculous readaheads in the inode table, the disk is going to be randomly seeking no matter what you do. Also, if we try to stay strictly within the inodes in the block group, assuming that the average file size is larger than 4k --- generally a safe bet --- this will tend to separate the data blocks from the inodes, which will increase latency when reading files. And most of the time, optimizing for reading files makes sense (e.g., think /usr/include), because that tends to happen more often than rm -rf workloads. The bottom line is this gets tricky, especially if the directory is dynamically growing and shrinking. Now you might start deleting from the first chunk, and start allocating from the second chunk, or maybe the third chunk. The bottom line is that it's relatively easy to optimize for specific workloads --- but even if this is a good idea --- "rm -rf" of zero-length files is not the first workload I would be hyper-optimizing for. Cheers, - Ted ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: O(n^2) deletion performance 2018-01-02 0:21 O(n^2) deletion performance Niklas Hambüchen 2018-01-02 1:20 ` Niklas Hambüchen 2018-01-02 1:59 ` Theodore Ts'o @ 2018-01-02 4:54 ` Dave Chinner 2 siblings, 0 replies; 11+ messages in thread From: Dave Chinner @ 2018-01-02 4:54 UTC (permalink / raw) To: Niklas Hambüchen Cc: linux-fsdevel, Paul Eggert, Jim Meyering, Pádraig Brady On Tue, Jan 02, 2018 at 01:21:32AM +0100, Niklas Hamb�chen wrote: > Hello filesystem hackers, > > Over in coreutils issue https://bugs.gnu.org/29921 we have found that > unlink()ing `n` many files seems to take in general O(n�) wall time. > > So far me and others (in CC) have benchmarked `rm -r` and other methods > of unlink()ing multiple files and we've found approximately quadratic > performance across: > > * ext4, XFS and zfs-on-linux on spinning disks > * ext4 on SSDs > * the other combinations were not tested yet > > Please find the exact numbers https://bugs.gnu.org/29921. Run iowatcher on the tests and have a look at the IO patterns before and after the modifications to rm. Work out where the seek delays are actually common from and find the common characteristics that all the reports have. > It would be great to hear any kind of insights you may have regarding > this issue, implementation details that might be relevant here, or even > some quick benchmarks on your systems to confirm or refute our findings. So I just ran a quick test here on XFS. Creat files with fsmark, unmount, mount, remove them. compare wall times. Nfiles create time remove time 10k 0m0.518s 0m0.658s 20k 0m1.039s 0m1.342s 100k 0m4.540s 0m4.646s 200k 0m8.567s 0m10.180s 1M 1m23.059s 0m51.582s Looks pretty linear in terms of rm time. This is with rm v8.26, and the rm command is "rm -rf /mnt/scratch/0" which contains a single directory with all the files in it. The typical IO pattern during rm is getdents @ 40calls/s and 20,000 inode lookup calls/s. i.e. inode lookups outnumber directory reads by 500:1. XFS does inode reads in clusters, so that 20,000 inodes/s after a sequential create is being serviced by about 650 read IO/s. not very much, really. This is run on XFS on raid0 of 2 NVMe drives, sparse 500TB file passed to a VM as virtio,cache=none,aio=native drive formatted with XFS. (i.e. XFS on XFS). So, run it on spinning drives. Same VM, just a SATA XFS image file on a sata drive on the host used instead: Nfiles create time remove time 10k 0m0.520s 0m0.643s 20k 0m1.198s 0m1.351s 100k 0m5.306s 0m6.768s 200k 0m11.083s 0m13.763s 1M 1m39.876s 1m14.765s So, pretty much identical, still linear. The difference in perofrmance between the two was the journal size - the smaller filesystem wrote at 50MB/s continually to the journal, the large NVMe filesystem hardly touched it. Remade small filesystem with larger log, IO patterns and performance numbers are identical to the large filesystem. So I locked out 15.5GB of ram, so the test runs short of memory and causes writeback to occur during creations. Barely changed the performance on either spinning or NVMe drives. Just upgraded to coreutils 8.28. performance is basically identical to 8.26. Oh, I just remembered that I've also got some old samsung 840 EVO SSDs attached to that VM! I ran the test again, got the same results. And, while I'm at it, I ran it on the emulated pmem I have configured, and it got the same results. IOWs, I've got XFS on 4.14 producing near identical results for spinning disks, old SATA SSDs, brand new NVMe SSDs and pmem. There simple isn't an O(n^2) problem in the userspace or XFS code before or after the changes pointed out the bug linked above. Keep in mind that running a random script like this on a random filesystem in a random initial state is about the most-sub-optimal environment you can make for a regression test. It won't give reliable results and you can't even compare results from the same machine and filesystem because initial conditions are always different from run to run. > Unless instructed otherwise, I would also go ahead and file bugs for the > respective filesystem types, as quadratic deletion time makes for a bad > user experience and lots of sysadmin tears. I think you're being overly melodramatic. There isn't an obvious problem, just a lot of handwaving. Do some IO analysis that show where the delays are coming from, then we'll have something to work from.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-01-04 11:42 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-01-02 0:21 O(n^2) deletion performance Niklas Hambüchen 2018-01-02 1:20 ` Niklas Hambüchen 2018-01-02 1:59 ` Theodore Ts'o 2018-01-02 2:49 ` Andreas Dilger 2018-01-02 4:27 ` Jim Meyering 2018-01-02 6:22 ` Theodore Ts'o 2018-01-04 4:16 ` Jim Meyering 2018-01-04 7:16 ` Theodore Ts'o 2018-01-04 11:42 ` Dave Chinner 2018-01-02 4:33 ` Theodore Ts'o 2018-01-02 4:54 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).