* xfs and swift @ 2016-01-06 15:15 Mark Seger 2016-01-06 22:04 ` Dave Chinner 2016-01-25 18:24 ` Bernd Schubert 0 siblings, 2 replies; 10+ messages in thread From: Mark Seger @ 2016-01-06 15:15 UTC (permalink / raw) To: Linux fs XFS; +Cc: Laurence Oberman [-- Attachment #1.1: Type: text/plain, Size: 2381 bytes --] I've recently found the performance our development swift system is degrading over time as the number of objects/files increases. This is a relatively small system, each server has 3 400GB disks. The system I'm currently looking at has about 70GB tied up in slabs alone, close to 55GB in xfs inodes and ili, and about 2GB free. The kernel is 3.14.57-1-amd64-hlinux. Here's the way the filesystems are mounted: /dev/sdb1 on /srv/node/disk0 type xfs (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota) I can do about 2000 1K file creates/sec when running 2 minute PUT tests at 100 threads. If I repeat that tests for multiple hours, I see the number of IOPS steadily decreasing to about 770 and the very next run it drops to 260 and continues to fall from there. This happens at about 12M files. The directory structure is 2 tiered, with 1000 directories per tier so we can have about 1M of them, though they don't currently all exist. I've written a collectl plugin that lets me watch many of the xfs stats in real-time and also have a test script that exercises the swift PUT code directly and so eliminates all the inter-node communications. This script also allows me to write to the existing swift directories as well as redirect to an empty structure so mimics clean environment with no existing subdirectories. I'm attaching some xfs stats during the run and hope they're readable. These values are in operations/sec and each line is 1 second's worth of data. The first set of numbers is on the clean directory and the second on the existing 12M file one. At the bottom of these stats are also the xfs slab allocations as reported by collectl. I can also watch these during a test and can see the number of inode and ilo objects steadily grow at about 1K/sec, which is curious since I'm only creating about 300. If there is anything else I can provide just let me know. I don't fully understand all the xfs stats but what does jump out at me is the XFS read/write ops have increased by a factor of about 5 when the system is slower. Right now the collectl plugin is not something I've released, but if there is interest and someone would like to help me present the data in a more organized/meaningful manner just let me know. if there are any tuning suggestions I'm more than happy to try them out. -mark [-- Attachment #1.2: Type: text/html, Size: 2715 bytes --] [-- Attachment #2: tests.txt --] [-- Type: text/plain, Size: 6565 bytes --] >>> Fast <<< #<--XFS Ops--><-----------XFS Logging----------><------Extents------><------DirOps-------><----Trans---><----Xstrat---><-------AttrOps-----><--------------INodes--------------> # Write Reads Writes WrtKBs NoRoom Force Sleep ExtA BlkA ExtF ExtF Look Cre8 Remv Gdnt Sync Asyn Empt Quick Split Gets Sets Rmov List Atpt Hit Miss Recy Dup Recl Chgd 53 599 4 1024 0 4 4 3 65 10 70 155 6 14 284 0 75 1 2 0 275 1 0 24 0 149 5 0 0 0 1 92 836 16 4096 0 16 16 24 117 8 98 200 58 18 272 0 181 1 3 0 235 11 0 10 0 152 39 0 0 15 1 370 732 295 75520 0 295 295 592 685 6 96 1599 1442 293 829 0 3527 1 9 0 244 290 0 13 0 153 1144 0 0 0 1 383 837 284 72704 0 284 285 559 683 10 130 1532 1352 284 816 0 3343 0 4 0 236 276 0 10 0 155 1073 0 0 9 0 341 734 289 73984 0 289 289 583 690 8 68 1574 1393 297 860 0 3472 3 6 0 291 288 0 30 0 143 1105 0 0 0 3 342 812 291 74496 0 291 291 583 720 6 66 1574 1376 294 840 0 3439 2 2 0 261 289 0 19 0 144 1087 0 0 0 2 427 415 301 77056 0 301 302 598 843 14 164 1613 1391 305 870 0 3531 1 5 0 279 292 0 26 0 163 1090 0 0 0 1 401 832 302 77312 0 302 303 598 797 10 130 1604 1390 303 862 0 3522 1 4 0 244 295 0 13 0 148 1093 0 0 90 1 349 384 275 70400 0 275 275 549 717 10 100 1480 1258 281 814 0 3224 1 4 0 251 270 0 15 0 146 985 0 0 0 1 79 432 6 1536 0 6 6 9 102 6 96 158 3 3 250 0 47 0 9 0 248 0 0 14 0 156 2 0 0 0 0 54 253 4 1024 0 4 4 2 64 4 64 157 2 2 274 0 23 0 2 0 284 0 0 26 0 156 1 0 0 0 0 >>> Slow <<< #<--XFS Ops--><-----------XFS Logging----------><------Extents------><------DirOps-------><----Trans---><----Xstrat---><-------AttrOps-----><--------------INodes--------------> # Write Reads Writes WrtKBs NoRoom Force Sleep ExtA BlkA ExtF ExtF Look Cre8 Remv Gdnt Sync Asyn Empt Quick Split Gets Sets Rmov List Atpt Hit Miss Recy Dup Recl Chgd 0 61 0 0 0 0 0 0 0 0 0 132 0 0 218 0 0 0 0 0 213 0 0 0 0 126 6 0 0 0 0 59 115 11 2816 0 11 11 16 78 4 65 160 33 9 230 0 104 0 2 0 210 7 0 0 0 128 28 0 0 0 0 1384 1263 272 69632 0 272 272 423 1998 92 1443 875 872 227 576 0 2639 0 45 0 210 182 0 0 0 153 675 0 0 4 0 1604 1503 294 75264 0 294 294 438 2201 106 1696 907 890 241 590 0 2772 0 53 0 210 188 0 0 0 151 681 0 0 9 0 1638 2255 309 79104 0 309 307 460 2314 114 1734 946 934 260 632 0 2942 0 54 0 237 199 0 0 0 193 678 0 0 0 0 1678 2298 337 86272 0 338 330 486 2326 128 1779 1031 987 291 712 0 3168 0 55 0 284 220 0 4 0 189 712 0 0 0 0 1578 2423 333 85248 0 332 325 492 2268 118 1649 1041 991 289 714 0 3153 0 51 0 270 222 0 0 0 200 700 0 0 0 0 1040 1861 239 61184 0 241 231 353 1496 82 1072 774 718 212 588 0 2272 0 33 0 264 164 0 0 0 153 524 0 0 1730 0 1709 2361 336 86016 0 340 331 485 2401 128 1810 1029 969 291 706 0 3137 0 56 0 278 216 0 4 0 174 722 0 0 4751 0 1616 2063 325 83200 0 326 321 485 2278 114 1707 1038 973 274 674 1 3067 0 53 0 240 214 0 3 0 173 731 0 0 0 0 1599 1557 312 79872 0 312 313 482 2274 104 1664 1048 962 260 684 0 2980 0 52 0 224 208 0 7 0 219 681 0 0 0 0 1114 1312 229 58624 0 229 230 356 1577 72 1152 817 709 192 552 0 2188 0 36 0 216 157 0 3 0 165 524 0 0 3570 0 1066 1185 175 44800 0 176 176 249 1440 72 1153 585 466 141 440 0 1577 0 36 0 214 104 0 2 0 155 339 0 0 4497 0 54 487 6 1536 0 6 6 2 64 4 64 137 2 2 238 0 24 0 2 0 216 0 0 3 0 135 2 0 0 2664 0 0 590 0 0 0 0 0 0 0 0 0 136 0 0 224 0 0 0 0 0 210 0 0 0 0 136 0 0 0 0 0 54 514 4 1024 0 4 4 2 64 4 64 142 2 2 248 0 24 0 2 0 214 0 0 2 0 140 2 0 0 0 0 >>> slabs <<< stack@helion-cp1-swobj0001-mgmt:~$ sudo collectl -sY -i:1 -c1 --slabfilt xfs waiting for 1 second sample... # SLAB DETAIL # <-----------Objects----------><---------Slab Allocation------><---Change--> #Name InUse Bytes Alloc Bytes InUse Bytes Total Bytes Diff Pct xfs_btree_cur 1872 389376 1872 389376 48 393216 48 393216 0 0.0 xfs_da_state 1584 772992 1584 772992 48 786432 48 786432 0 0.0 xfs_dqtrx 0 0 0 0 0 0 0 0 0 0.0 xfs_efd_item 4360 1744000 4760 1904000 119 1949696 119 1949696 0 0.0 xfs_icr 0 0 0 0 0 0 0 0 0 0.0 xfs_ili 48127K 6976M 48197K 6986M 909380 7104M 909380 7104M 0 0.0 xfs_inode 48210K 47080M 48244K 47113M 1507K 47123M 1507K 47123M 0 0.0 [-- Attachment #3: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift 2016-01-06 15:15 xfs and swift Mark Seger @ 2016-01-06 22:04 ` Dave Chinner 2016-01-06 22:10 ` Dave Chinner 2016-01-25 18:24 ` Bernd Schubert 1 sibling, 1 reply; 10+ messages in thread From: Dave Chinner @ 2016-01-06 22:04 UTC (permalink / raw) To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS On Wed, Jan 06, 2016 at 10:15:25AM -0500, Mark Seger wrote: > I've recently found the performance our development swift system is > degrading over time as the number of objects/files increases. This is a > relatively small system, each server has 3 400GB disks. The system I'm > currently looking at has about 70GB tied up in slabs alone, close to 55GB > in xfs inodes and ili, and about 2GB free. The kernel > is 3.14.57-1-amd64-hlinux. So you go 50M cached inodes in memory, and a relatively old kernel. > Here's the way the filesystems are mounted: > > /dev/sdb1 on /srv/node/disk0 type xfs > (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota) > > I can do about 2000 1K file creates/sec when running 2 minute PUT tests at > 100 threads. If I repeat that tests for multiple hours, I see the number > of IOPS steadily decreasing to about 770 and the very next run it drops to > 260 and continues to fall from there. This happens at about 12M files. According to the numbers you've provided: lookups creates removes Fast: 1550 1350 300 Slow: 1000 900 250 This is pretty much what I'd expect on the XFS level when going from a small empty filesystem to one containing 12M 1k files. That does not correlate to your numbers above, so it's not at all clear that there is realy a problem here at the XFS level. > The directory structure is 2 tiered, with 1000 directories per tier so we > can have about 1M of them, though they don't currently all exist. That's insane. The xfs directory structure is much, much more space, time, IO and memory efficient that a directory hierachy like this. The only thing you need a directory hash hierarchy for is to provide sufficient concurrency for your operations, which you would probably get with a single level with one or two subdirs per filesystem AG. What you are doing is spreading the IO over thousands of different regions on the disks, and then randomly seeking between them on every operation. i.e. your workload is seekbound, and your directory structure is has the effect of /maximising/ seeks per operation... > I've written a collectl plugin that lets me watch many of the xfs stats in /me sighs and points at PCP: http://pcp.io > real-time and also have a test script that exercises the swift PUT code > directly and so eliminates all the inter-node communications. This script > also allows me to write to the existing swift directories as well as > redirect to an empty structure so mimics clean environment with no existing > subdirectories. Yet that doesn't behave like an empty filesystem, which is clearly shown by the fact the caches are full of inodes that are't being used by the test. It also points out that allocation of new inodes will follow the old logarithmic search speed degradation, because you're kernel is sufficiently old that it doesn't support the free inode btree feature... > I'm attaching some xfs stats during the run and hope they're readable. > These values are in operations/sec and each line is 1 second's worth of > data. The first set of numbers is on the clean directory and the second on > the existing 12M file one. At the bottom of these stats are also the xfs > slab allocations as reported by collectl. I can also watch these during a > test and can see the number of inode and ilo objects steadily grow at about > 1K/sec, which is curious since I'm only creating about 300. It grows at exactly the rate of the lookups beng done, which is what is expected. i.e. for each create being done, there are other lookups being done first. e.g. directories, other objects to determine where to create the new one, lookup has to be done before removes (which there are significant number of), etc. > > If there is anything else I can provide just let me know. > > I don't fully understand all the xfs stats but what does jump out at me is > the XFS read/write ops have increased by a factor of about 5 when the > system is slower. Which means your application is reading/writing 5x as much information from the filesystem when it is slow. That's not a filesystem problem - your applicaiton is having to traverse/modify 5x as much information for each object it is creating/modifying. There's a good chance that's a result of your massively wide object store directory heirarchy.... i.e. you need to start by understanding what your application is doing in terms of IO, configuration and algorithms and determine whether that is optimal before you start looking at whether the filesystem is actually the bottleneck. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift 2016-01-06 22:04 ` Dave Chinner @ 2016-01-06 22:10 ` Dave Chinner 2016-01-06 22:46 ` Mark Seger 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2016-01-06 22:10 UTC (permalink / raw) To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS On Thu, Jan 07, 2016 at 09:04:54AM +1100, Dave Chinner wrote: > On Wed, Jan 06, 2016 at 10:15:25AM -0500, Mark Seger wrote: > > I've recently found the performance our development swift system is > > degrading over time as the number of objects/files increases. This is a > > relatively small system, each server has 3 400GB disks. The system I'm > > currently looking at has about 70GB tied up in slabs alone, close to 55GB > > in xfs inodes and ili, and about 2GB free. The kernel > > is 3.14.57-1-amd64-hlinux. > > So you go 50M cached inodes in memory, and a relatively old kernel. > > > Here's the way the filesystems are mounted: > > > > /dev/sdb1 on /srv/node/disk0 type xfs > > (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota) > > > > I can do about 2000 1K file creates/sec when running 2 minute PUT tests at > > 100 threads. If I repeat that tests for multiple hours, I see the number > > of IOPS steadily decreasing to about 770 and the very next run it drops to > > 260 and continues to fall from there. This happens at about 12M files. > > According to the numbers you've provided: > > lookups creates removes > Fast: 1550 1350 300 > Slow: 1000 900 250 > > This is pretty much what I'd expect on the XFS level when going from > a small empty filesystem to one containing 12M 1k files. > > That does not correlate to your numbers above, so it's not at all > clear that there is realy a problem here at the XFS level. > > > The directory structure is 2 tiered, with 1000 directories per tier so we > > can have about 1M of them, though they don't currently all exist. > > That's insane. > > The xfs directory structure is much, much more space, time, IO and > memory efficient that a directory hierachy like this. The only thing > you need a directory hash hierarchy for is to provide sufficient > concurrency for your operations, which you would probably get with a > single level with one or two subdirs per filesystem AG. BTW, you might want to read the section on directory block size for a quick introduction to XFS directory design and scalability: https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift 2016-01-06 22:10 ` Dave Chinner @ 2016-01-06 22:46 ` Mark Seger 2016-01-06 23:49 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Mark Seger @ 2016-01-06 22:46 UTC (permalink / raw) To: Dave Chinner; +Cc: Laurence Oberman, Linux fs XFS [-- Attachment #1.1: Type: text/plain, Size: 5477 bytes --] dave, thanks for getting back to me and the pointer to the config doc. lots to absorb and play with. the real challenge for me is that I'm doing testing as different levels. While i realize running 100 parallel swift PUT threads on a small system is not the ideal way to do things, it's the only easy way to get massive numbers of objects into the fillesystem and once there, the performance of a single stream is pretty poor and by instrumenting the swift code I can clearly see excess time being spent in creating/writing the objects and so that's lead us to believe the problem lies in the way xfs is configured. creating a new directory structure on that same mount point immediately results in high levels of performance. As an attempt to try to reproduce the problems w/o swift, I wrote a little python script that simply creates files in a 2-tier structure, the first tier consisting of 1024 directories and each directory contains 4096 subdirectories into which 1K files are created. I'm doing this for 10000 objects as a time and then timing them, reporting the times, 10 per line so each line represents 100 thousand file creates. Here too I'm seeing degradation and if I look at what happens when there are already 3M files and I write 1M more, I see these creation times/10 thousand: 1.004236 0.961419 0.996514 1.012150 1.101794 0.999422 0.994796 1.214535 0.997276 1.306736 2.793429 1.201471 1.133576 1.069682 1.030985 1.096341 1.052602 1.391364 0.999480 1.914125 1.193892 0.967206 1.263310 0.890472 1.051962 4.253694 1.145573 1.528848 13.586892 4.925790 3.975442 8.896552 1.197005 3.904226 7.503806 1.294842 1.816422 9.329792 7.270323 5.936545 7.058685 5.516841 4.527271 1.956592 1.382551 1.510339 1.318341 13.255939 6.938845 4.106066 2.612064 2.028795 4.647980 7.371628 5.473423 5.823201 14.229120 0.899348 3.539658 8.501498 4.662593 6.423530 7.980757 6.367012 3.414239 7.364857 4.143751 6.317348 11.393067 1.273371 146.067300 1.317814 1.176529 1.177830 52.206605 1.112854 2.087990 42.328220 1.178436 1.335202 49.118140 1.368696 1.515826 44.690431 0.927428 0.920801 0.985965 1.000591 1.027458 60.650443 1.771318 2.690499 2.262868 1.061343 0.932998 64.064210 37.726213 1.245129 0.743771 0.996683 nothing one set of 10K took almost 3 minutes! my main questions at this point are is this performance expected and/or might a newer kernel help? and might it be possible to significantly improve things via tuning or is it what it is? I do realize I'm starting with an empty directory tree whose performance degrades as it fills, but if I wanted to tune for say 10M or maybe 100M files might I be able to expect more consistent numbers (perhaps starting out at lower performance) as the numbers of objects grow? I'm basically looking for more consistency over a broader range of numbers of files. -mark On Wed, Jan 6, 2016 at 5:10 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Jan 07, 2016 at 09:04:54AM +1100, Dave Chinner wrote: > > On Wed, Jan 06, 2016 at 10:15:25AM -0500, Mark Seger wrote: > > > I've recently found the performance our development swift system is > > > degrading over time as the number of objects/files increases. This is > a > > > relatively small system, each server has 3 400GB disks. The system I'm > > > currently looking at has about 70GB tied up in slabs alone, close to > 55GB > > > in xfs inodes and ili, and about 2GB free. The kernel > > > is 3.14.57-1-amd64-hlinux. > > > > So you go 50M cached inodes in memory, and a relatively old kernel. > > > > > Here's the way the filesystems are mounted: > > > > > > /dev/sdb1 on /srv/node/disk0 type xfs > > > > (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota) > > > > > > I can do about 2000 1K file creates/sec when running 2 minute PUT > tests at > > > 100 threads. If I repeat that tests for multiple hours, I see the > number > > > of IOPS steadily decreasing to about 770 and the very next run it > drops to > > > 260 and continues to fall from there. This happens at about 12M files. > > > > According to the numbers you've provided: > > > > lookups creates removes > > Fast: 1550 1350 300 > > Slow: 1000 900 250 > > > > This is pretty much what I'd expect on the XFS level when going from > > a small empty filesystem to one containing 12M 1k files. > > > > That does not correlate to your numbers above, so it's not at all > > clear that there is realy a problem here at the XFS level. > > > > > The directory structure is 2 tiered, with 1000 directories per tier so > we > > > can have about 1M of them, though they don't currently all exist. > > > > That's insane. > > > > The xfs directory structure is much, much more space, time, IO and > > memory efficient that a directory hierachy like this. The only thing > > you need a directory hash hierarchy for is to provide sufficient > > concurrency for your operations, which you would probably get with a > > single level with one or two subdirs per filesystem AG. > > BTW, you might want to read the section on directory block size for > a quick introduction to XFS directory design and scalability: > > > https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > [-- Attachment #1.2: Type: text/html, Size: 6872 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift 2016-01-06 22:46 ` Mark Seger @ 2016-01-06 23:49 ` Dave Chinner 2016-01-25 16:38 ` Mark Seger 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2016-01-06 23:49 UTC (permalink / raw) To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS On Wed, Jan 06, 2016 at 05:46:33PM -0500, Mark Seger wrote: > dave, thanks for getting back to me and the pointer to the config doc. > lots to absorb and play with. > > the real challenge for me is that I'm doing testing as different levels. > While i realize running 100 parallel swift PUT threads on a small system is > not the ideal way to do things, it's the only easy way to get massive > numbers of objects into the fillesystem and once there, the performance of > a single stream is pretty poor and by instrumenting the swift code I can > clearly see excess time being spent in creating/writing the objects and so > that's lead us to believe the problem lies in the way xfs is configured. > creating a new directory structure on that same mount point immediately > results in high levels of performance. > > As an attempt to try to reproduce the problems w/o swift, I wrote a little > python script that simply creates files in a 2-tier structure, the first > tier consisting of 1024 directories and each directory contains 4096 > subdirectories into which 1K files are created. So you created something with even greater fan-out than what your swift app is using? > I'm doing this for 10000 > objects as a time and then timing them, reporting the times, 10 per line so > each line represents 100 thousand file creates. > > Here too I'm seeing degradation and if I look at what happens when there > are already 3M files and I write 1M more, I see these creation times/10 > thousand: > > 1.004236 0.961419 0.996514 1.012150 1.101794 0.999422 0.994796 > 1.214535 0.997276 1.306736 > 2.793429 1.201471 1.133576 1.069682 1.030985 1.096341 1.052602 > 1.391364 0.999480 1.914125 > 1.193892 0.967206 1.263310 0.890472 1.051962 4.253694 1.145573 > 1.528848 13.586892 4.925790 > 3.975442 8.896552 1.197005 3.904226 7.503806 1.294842 1.816422 > 9.329792 7.270323 5.936545 > 7.058685 5.516841 4.527271 1.956592 1.382551 1.510339 1.318341 > 13.255939 6.938845 4.106066 > 2.612064 2.028795 4.647980 7.371628 5.473423 5.823201 14.229120 > 0.899348 3.539658 8.501498 > 4.662593 6.423530 7.980757 6.367012 3.414239 7.364857 4.143751 > 6.317348 11.393067 1.273371 > 146.067300 1.317814 1.176529 1.177830 52.206605 1.112854 2.087990 > 42.328220 1.178436 1.335202 > 49.118140 1.368696 1.515826 44.690431 0.927428 0.920801 0.985965 > 1.000591 1.027458 60.650443 > 1.771318 2.690499 2.262868 1.061343 0.932998 64.064210 37.726213 > 1.245129 0.743771 0.996683 > > nothing one set of 10K took almost 3 minutes! Which is no surprise because you have slow disks and a *lot* of memory. At some point the journal and/or memory is going to fill up with dirty objects and have to block waiting for writeback. At that point there's going to be several hundred thousand dirty inodes that need to be flushed to disk before progress can be made again. That metadata writeback will be seek bound, and that's where all the delay comes from. We've been through this problem several times now with different swift users over the past couple of years. Please go and search the list archives, because every time the solution has been the same: - reduce the directory heirarchy to a single level with, at most, the number of directories matching the expected *production* concurrency level - reduce the XFS log size down to 32-128MB to limit dirty metadata object buildup in memory - reduce the number of AGs to as small as necessary to maintain /allocation/ concurrency to limit the number of different locations XFS writes to the disks (typically 10-20x less than the application level concurrency) - use a 3.16+ kernel with the free inode btree on-disk format feature to keep inode allocation CPU overhead low and consistent regardless of the number of inodes already allocated in the filesystem. > my main questions at this point are is this performance expected and/or > might a newer kernel help? and might it be possible to significantly > improve things via tuning or is it what it is? I do realize I'm starting > with an empty directory tree whose performance degrades as it fills, but if > I wanted to tune for say 10M or maybe 100M files might I be able to expect The mkfs defaults will work just fine with that many files in the filesystem. Your application configuration and data store layout is likely to be your biggest problem here. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift 2016-01-06 23:49 ` Dave Chinner @ 2016-01-25 16:38 ` Mark Seger 2016-02-01 5:27 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Mark Seger @ 2016-01-25 16:38 UTC (permalink / raw) To: Dave Chinner; +Cc: Laurence Oberman, Linux fs XFS [-- Attachment #1.1: Type: text/plain, Size: 19884 bytes --] since getting your last reply I've been doing a lot more trying to understand the behavior of what I'm seeing by writing some non-swift code that sort of does what swift does with respect to a directory structure. in my case I have 1024 top level dirs, 4096 under each. each 1k file I'm creating gets it's only directory under these so there are clearly a lot of directories. xfs writes out about 25M objects and then the performance goes into the toilet. I'm sure what you said before about having to flush data and causing big delays, but would it be continuous? each entry in the following table shows the time to write 10K files so the 2 blocks are 1M each Sat Jan 23 12:15:09 2016 16.114386 14.656736 14.789760 17.418389 14.613157 15.938176 14.865369 14.962058 17.297193 15.953590 14.895471 15.560252 14.789937 14.308618 16.390057 16.561789 15.713806 14.843791 15.940992 16.466924 15.842781 15.611230 17.102329 15.006291 14.454088 17.923662 13.378340 16.084664 15.996794 13.736398 18.125125 14.462063 18.101833 15.355139 16.603660 14.205896 16.474111 16.212237 15.072443 14.217581 16.273899 14.905624 17.285019 14.955722 13.769731 18.308619 15.601386 15.832661 14.342416 16.516657 14.697575 15.719496 16.723135 16.808668 15.443325 14.608358 17.031334 16.426377 13.900535 13.528603 16.197697 16.839241 14.802707 15.507915 14.864337 15.836943 15.660089 15.998911 13.956739 14.337318 16.416974 17.729661 14.936045 13.450859 15.943900 15.106077 15.541450 16.523752 16.555945 14.440305 14.937772 16.486544 13.780310 16.944841 14.867400 18.214934 14.142108 15.931952 14.424949 15.533156 16.010153 16.323108 14.423508 15.970071 15.277186 15.561362 14.978766 15.855935 16.953906 14.247016 Sat Jan 23 12:41:09 2016 15.908483 15.638943 17.681281 15.188704 15.721495 13.359225 15.999421 15.858876 16.402176 16.416312 15.443946 14.675751 15.470643 15.573755 15.422241 16.336590 17.220916 13.974890 15.877780 62.650921 62.667990 46.334603 53.546195 69.465447 65.006016 68.761229 70.754684 97.571669 104.811261 104.229302 105.605257 105.166030 105.058075 105.519703 106.573306 106.708545 106.114733 105.643131 106.049387 106.379378 104.239131 104.268931 103.852929 103.549319 103.516169 103.007015 103.724020 104.519983 105.839203 105.324985 104.328205 104.932713 103.051548 104.938652 102.769383 102.851609 101.432277 102.269842 100.937972 103.450103 103.477628 103.636130 103.444242 103.023145 102.565047 102.853115 101.402610 98.928230 99.310677 99.669667 101.140554 99.628664 102.093801 100.580659 101.762283 101.369349 102.637014 102.240950 101.778506 101.144526 100.899476 102.294952 102.029285 100.871166 102.763222 102.910690 104.892447 104.748194 105.403636 106.159345 106.413154 104.626632 105.775004 104.579775 104.778526 104.634778 106.233381 104.063642 106.635481 104.314503 if I look at the disk loads at the time, I see a dramatic increase in disk reads that correspond to the slow writes so I'm guessing at least some writes are waiting in the queue as you can see there - thanks to laurence for the patch to show disk read wait times ;) # DISK STATISTICS (/sec) # <---------reads---------------><---------writes--------------><--------averages--------> Pct #Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util 12:45:30 sdb 0 0 0 0 0 270040 105 2276 119 4 118 16 4 0 62 12:45:31 sdb 0 0 0 0 0 273776 120 2262 121 4 121 18 4 0 57 12:45:32 sdb 4 0 1 4 0 100164 57 909 110 4 110 6 4 0 84 12:45:33 sdb 0 0 0 0 0 229992 87 1924 120 1 119 2 1 0 68 12:45:34 sdb 4 0 1 4 4 153528 59 1304 118 0 117 1 0 0 78 12:45:35 sdb 0 0 0 0 0 220896 97 1895 117 1 116 1 1 0 62 12:45:36 sdb 0 0 0 0 0 419084 197 3504 120 0 119 1 0 0 32 12:45:37 sdb 0 0 0 0 0 428076 193 3662 117 0 116 1 0 0 32 12:45:38 sdb 0 0 0 0 0 428492 181 3560 120 0 120 1 0 0 30 12:45:39 sdb 0 0 0 0 0 426024 199 3641 117 0 117 1 0 0 32 12:45:40 sdb 0 0 0 0 0 429764 200 3589 120 0 119 1 0 0 28 12:45:41 sdb 0 0 0 0 0 410204 165 3430 120 0 119 3 0 0 36 12:45:42 sdb 0 0 0 0 0 406192 196 3437 118 0 118 5 0 0 39 12:45:43 sdb 0 0 0 0 0 420952 175 3552 119 0 118 1 0 0 34 12:45:44 sdb 0 0 0 0 0 428424 197 3645 118 0 117 1 0 0 31 12:45:45 sdb 0 0 0 0 0 192464 76 1599 120 8 120 18 8 0 75 12:45:46 sdb 0 0 0 0 0 340522 205 2951 115 2 115 16 2 0 41 12:45:47 sdb 0 0 0 0 0 429128 193 3664 117 0 117 1 0 0 28 12:45:48 sdb 0 0 0 0 0 402600 164 3311 122 0 121 3 0 0 39 12:45:49 sdb 0 0 0 0 0 435316 195 3701 118 0 117 1 0 0 36 12:45:50 sdb 0 0 0 0 0 367976 162 3152 117 1 116 7 1 0 46 12:45:51 sdb 0 0 0 0 0 255716 125 2153 119 4 118 16 4 0 60 # DISK STATISTICS (/sec) # <---------reads---------------><---------writes--------------><--------averages--------> Pct #Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util 12:45:52 sdb 0 0 0 0 0 360144 149 3006 120 1 119 9 1 0 46 12:45:53 sdb 0 0 0 0 0 343500 162 2909 118 1 118 11 1 0 43 12:45:54 sdb 0 0 0 0 0 256636 119 2188 117 2 117 11 2 0 54 12:45:55 sdb 0 0 0 0 0 149000 47 1260 118 14 118 22 14 0 79 12:45:56 sdb 0 0 0 0 0 198544 88 1654 120 7 120 19 7 0 67 12:45:57 sdb 0 0 0 0 0 320688 151 2731 117 1 117 8 1 0 53 12:45:58 sdb 0 0 0 0 0 422176 190 3532 120 0 119 1 0 0 32 12:45:59 sdb 0 0 0 0 0 266540 115 2233 119 5 119 13 5 0 93 12:46:00 sdb 8 0 2 4 690 291116 129 2463 118 3 118 9 3 0 82 12:46:01 sdb 0 0 0 0 0 249964 118 2160 116 4 115 15 4 0 60 12:46:02 sdb 4736 0 37 128 0 424680 167 3522 121 0 120 1 0 0 28 12:46:03 sdb 5016 0 42 119 0 391364 196 3344 117 0 117 6 0 0 34 12:46:04 sdb 0 0 0 0 0 415436 172 3501 119 0 118 2 0 0 33 12:46:05 sdb 0 0 0 0 0 398736 192 3373 118 0 118 3 0 0 39 12:46:06 sdb 0 0 0 0 0 367292 155 3015 122 0 121 6 0 0 39 12:46:07 sdb 0 0 0 0 0 420392 201 3614 116 0 116 1 0 0 30 12:46:08 sdb 0 0 0 0 0 424828 172 3547 120 0 119 1 0 0 32 12:46:09 sdb 0 0 0 0 0 500380 234 4277 117 0 116 2 0 0 34 12:46:10 sdb 0 0 0 0 0 104500 7 698 150 0 149 1 0 1 87 12:46:11 sdb 8 0 1 8 1260 77252 45 647 119 0 119 1 2 1 92 12:46:12 sdb 8 0 1 8 1244 73956 31 615 120 0 120 1 2 1 94 12:46:13 sdb 8 0 1 8 228 149552 64 1256 119 0 118 1 0 0 85 # DISK STATISTICS (/sec) # <---------reads---------------><---------writes--------------><--------averages--------> Pct #Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util 12:46:14 sdb 8 0 1 8 1232 37124 28 319 116 0 116 1 3 3 99 12:46:15 sdb 16 0 2 8 720 2776 23 120 23 1 22 1 13 8 99 12:46:16 sdb 0 0 0 0 0 108180 16 823 131 0 131 1 0 1 90 12:46:17 sdb 8 0 1 8 1260 37136 28 322 115 0 114 1 3 2 94 12:46:18 sdb 8 0 1 8 1252 108680 57 875 124 0 124 1 1 1 88 12:46:19 sdb 0 0 0 0 0 0 0 0 0 0 0 1 0 0 100 12:46:20 sdb 16 0 2 8 618 81516 49 685 119 0 118 1 1 1 94 12:46:21 sdb 16 0 2 8 640 225788 106 1907 118 0 118 1 0 0 75 12:46:22 sdb 32 0 4 8 95 73892 17 627 118 0 117 1 0 1 93 12:46:23 sdb 24 0 3 8 408 257012 119 2171 118 0 118 1 0 0 65 12:46:24 sdb 12 0 3 4 5 3608 0 20 180 0 157 1 0 43 100 12:46:25 sdb 44 0 7 6 210 74072 41 625 119 0 117 1 2 1 97 12:46:26 sdb 48 0 6 8 216 202852 112 1819 112 0 111 1 0 0 92 12:46:27 sdb 52 0 7 7 233 307156 137 2648 116 0 115 1 0 0 95 12:46:28 sdb 16 0 2 8 100 93168 7 638 146 0 145 1 0 1 97 12:46:29 sdb 16 0 2 8 642 37028 16 319 116 0 115 1 4 3 99 12:46:30 sdb 16 0 2 8 624 39068 36 342 114 0 113 1 3 2 99 12:46:31 sdb 80 0 10 8 94 253892 105 2169 117 0 116 1 0 0 84 12:46:32 sdb 0 0 0 0 0 5676 0 33 172 0 172 1 0 30 100 12:46:33 sdb 16 0 2 8 642 69236 28 583 119 0 118 1 2 1 96 12:46:34 sdb 8 0 1 8 1032 37132 30 315 118 0 117 1 3 3 100 12:46:35 sdb 16 0 2 8 822 56292 15 515 109 0 108 1 3 1 100 # DISK STATISTICS (/sec) # <---------reads---------------><---------writes--------------><--------averages--------> Pct #Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size Wait RWSize QLen Wait SvcTim Util 12:46:36 sdb 8 0 1 8 44 58768 15 452 130 0 129 1 0 2 96 12:46:37 sdb 28 0 4 7 390 114944 89 1100 104 0 104 1 1 0 88 12:46:38 sdb 0 0 0 0 0 29668 0 172 172 12 172 1 12 5 98 12:46:39 sdb 80 0 10 8 90 100084 31 882 113 0 112 1 1 1 91 12:46:40 sdb 0 0 0 0 0 24244 0 139 174 0 174 1 0 7 100 12:46:41 sdb 8 0 1 8 1224 0 0 0 0 0 8 1 1224 1000 100 12:46:42 sdb 8 0 1 8 1244 42368 29 354 120 0 119 1 3 2 96 12:46:43 sdb 36 0 5 7 251 51428 32 507 101 0 100 1 2 1 94 12:46:44 sdb 24 0 3 8 70 5732 31 147 39 15 38 2 16 6 99 12:46:45 sdb 32 0 4 8 4 213056 53 1647 129 0 129 1 0 0 74 12:46:46 sdb 8 0 1 8 1220 37416 28 328 114 0 113 1 3 2 96 12:46:47 sdb 8 0 1 8 1248 58572 67 607 96 0 96 1 2 1 93 12:46:48 sdb 40 0 5 8 84 274808 82 2173 126 0 126 1 0 0 70 12:46:49 sdb 0 0 0 0 0 0 0 0 0 0 0 1 0 0 100 12:46:50 sdb 8 0 1 8 1248 0 0 0 0 0 8 1 1248 1000 100 12:46:51 sdb 8 0 1 8 1272 0 0 0 0 0 8 1 1272 1000 100 12:46:52 sdb 24 0 3 8 414 205240 113 1798 114 0 113 1 0 0 75 12:46:53 sdb 8 0 1 8 876 92476 48 839 110 0 110 1 1 1 89 12:46:54 sdb 0 0 0 0 0 38700 0 225 172 0 172 1 0 4 99 12:46:55 sdb 16 0 2 8 582 150680 73 1262 119 0 119 1 1 0 87 12:46:56 sdb 8 0 1 8 1228 0 0 0 0 0 8 1 1228 1000 100 12:46:57 sdb 8 0 1 8 1244 0 0 0 0 0 8 1 1244 1000 100 next I played back the collectl process data and sorted by disk reads and discovered the top process, corresponding to the long disk reads was xfsaild. btw - I also see the slab xfs_inode using about 60GB. It's also worth noting that I'm only doing 1-2MB/sec of writes and the rest of the data looks like it's coming from xfs journaling because when I look at the xfs stats I'm seeing on the order of 200-400MB/sec xfs logging writes - clearly they're not all going to disk. Once the read waits increase everything slows down including xfs logging (since it's doing less). I'm sure the simple answer may be that it is what it is, but I'm also wondering without changes to swift itself, might there be some ways to improve the situation by adding more memory or making any other tuning changes? The system I'm currently running my tests on has 128GB. -mark On Wed, Jan 6, 2016 at 6:49 PM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Jan 06, 2016 at 05:46:33PM -0500, Mark Seger wrote: > > dave, thanks for getting back to me and the pointer to the config doc. > > lots to absorb and play with. > > > > the real challenge for me is that I'm doing testing as different levels. > > While i realize running 100 parallel swift PUT threads on a small system > is > > not the ideal way to do things, it's the only easy way to get massive > > numbers of objects into the fillesystem and once there, the performance > of > > a single stream is pretty poor and by instrumenting the swift code I can > > clearly see excess time being spent in creating/writing the objects and > so > > that's lead us to believe the problem lies in the way xfs is configured. > > creating a new directory structure on that same mount point immediately > > results in high levels of performance. > > > > As an attempt to try to reproduce the problems w/o swift, I wrote a > little > > python script that simply creates files in a 2-tier structure, the first > > tier consisting of 1024 directories and each directory contains 4096 > > subdirectories into which 1K files are created. > > So you created something with even greater fan-out than what your > swift app is using? > > > I'm doing this for 10000 > > objects as a time and then timing them, reporting the times, 10 per line > so > > each line represents 100 thousand file creates. > > > > Here too I'm seeing degradation and if I look at what happens when there > > are already 3M files and I write 1M more, I see these creation times/10 > > thousand: > > > > 1.004236 0.961419 0.996514 1.012150 1.101794 0.999422 0.994796 > > 1.214535 0.997276 1.306736 > > 2.793429 1.201471 1.133576 1.069682 1.030985 1.096341 1.052602 > > 1.391364 0.999480 1.914125 > > 1.193892 0.967206 1.263310 0.890472 1.051962 4.253694 1.145573 > > 1.528848 13.586892 4.925790 > > 3.975442 8.896552 1.197005 3.904226 7.503806 1.294842 1.816422 > > 9.329792 7.270323 5.936545 > > 7.058685 5.516841 4.527271 1.956592 1.382551 1.510339 1.318341 > > 13.255939 6.938845 4.106066 > > 2.612064 2.028795 4.647980 7.371628 5.473423 5.823201 14.229120 > > 0.899348 3.539658 8.501498 > > 4.662593 6.423530 7.980757 6.367012 3.414239 7.364857 4.143751 > > 6.317348 11.393067 1.273371 > > 146.067300 1.317814 1.176529 1.177830 52.206605 1.112854 2.087990 > > 42.328220 1.178436 1.335202 > > 49.118140 1.368696 1.515826 44.690431 0.927428 0.920801 0.985965 > > 1.000591 1.027458 60.650443 > > 1.771318 2.690499 2.262868 1.061343 0.932998 64.064210 37.726213 > > 1.245129 0.743771 0.996683 > > > > nothing one set of 10K took almost 3 minutes! > > Which is no surprise because you have slow disks and a *lot* of > memory. At some point the journal and/or memory is going to fill up > with dirty objects and have to block waiting for writeback. At that > point there's going to be several hundred thousand dirty inodes that > need to be flushed to disk before progress can be made again. That > metadata writeback will be seek bound, and that's where all the > delay comes from. > > We've been through this problem several times now with different > swift users over the past couple of years. Please go and search the > list archives, because every time the solution has been the same: > > - reduce the directory heirarchy to a single level with, at > most, the number of directories matching the expected > *production* concurrency level > - reduce the XFS log size down to 32-128MB to limit dirty > metadata object buildup in memory > - reduce the number of AGs to as small as necessary to > maintain /allocation/ concurrency to limit the number of > different locations XFS writes to the disks (typically > 10-20x less than the application level concurrency) > - use a 3.16+ kernel with the free inode btree on-disk > format feature to keep inode allocation CPU overhead low > and consistent regardless of the number of inodes already > allocated in the filesystem. > > > my main questions at this point are is this performance expected and/or > > might a newer kernel help? and might it be possible to significantly > > improve things via tuning or is it what it is? I do realize I'm starting > > with an empty directory tree whose performance degrades as it fills, but > if > > I wanted to tune for say 10M or maybe 100M files might I be able to > expect > > The mkfs defaults will work just fine with that many files in the > filesystem. Your application configuration and data store layout is > likely to be your biggest problem here. > > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com > [-- Attachment #1.2: Type: text/html, Size: 29999 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift 2016-01-25 16:38 ` Mark Seger @ 2016-02-01 5:27 ` Dave Chinner 0 siblings, 0 replies; 10+ messages in thread From: Dave Chinner @ 2016-02-01 5:27 UTC (permalink / raw) To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS On Mon, Jan 25, 2016 at 11:38:07AM -0500, Mark Seger wrote: > since getting your last reply I've been doing a lot more trying to > understand the behavior of what I'm seeing by writing some non-swift code > that sort of does what swift does with respect to a directory structure. > in my case I have 1024 top level dirs, 4096 under each. each 1k file I'm > creating gets it's only directory under these so there are clearly a lot of > directories. I'm not sure you understood what I said in my last reply: your directory structure is the problem, and that's what needs changing. > xfs writes out about 25M objects and then the performance goes into the > toilet. I'm sure what you said before about having to flush data and > causing big delays, but would it be continuous? Go read the previous thread on this subject. Or, alternatively, try some of the subgestions I made, like reducing the log size, to see how this affects such behaviour. > each entry in the > following table shows the time to write 10K files so the 2 blocks are 1M > each > > Sat Jan 23 12:15:09 2016 > 16.114386 14.656736 14.789760 17.418389 14.613157 15.938176 > 14.865369 14.962058 17.297193 15.953590 ..... > 62.667990 46.334603 53.546195 69.465447 65.006016 68.761229 > 70.754684 97.571669 104.811261 104.229302 > 105.605257 105.166030 105.058075 105.519703 106.573306 106.708545 > 106.114733 105.643131 106.049387 106.379378 Your test goes from operating wholly in memory to being limited by disk speed because it no longer fits in memory. > if I look at the disk loads at the time, I see a dramatic increase in disk > reads that correspond to the slow writes so I'm guessing at least some ..... > next I played back the collectl process data and sorted by disk reads and > discovered the top process, corresponding to the long disk reads was > xfsaild. btw - I also see the slab xfs_inode using about 60GB. And there's your problem. You're accumulating gigabytes of dirty inodes in memory, then wondering why everything goes to crap when memory fills up and we have to start cleaning inodes. TO clean those inodes, we have to do RMW cycles on the inode cluster buffers, because the inode cache memory pressure has caused the inod buffers to be reclaimed from memory before the cached dirty inodes are written. All the changes I recommended you make also happen address this problem, too.... > It's also worth noting that I'm only doing 1-2MB/sec of writes and the rest > of the data looks like it's coming from xfs journaling because when I look > at the xfs stats I'm seeing on the order of 200-400MB/sec xfs logging > writes - clearly they're not all going to disk. Before delayed logging was introduced 5 years ago, it was quite common to see XFS writing >500MB/s to the journal. The thing is, your massive fan-out directory structure is mostly going to defeat the relogging optimisations that make delayed logging work, so it's entirely possible that you are seeing this much throughput through the journal. > Once the read waits > increase everything slows down including xfs logging (since it's doing > less). Of course, because we can't journal more changes until the dirty inodes in the journal are cleaned. That's what the xfsaild does - clean dirty inodes, and the reads coming from that threads are for cleaning inodes... > I'm sure the simple answer may be that it is what it is, but I'm also > wondering without changes to swift itself, might there be some ways to > improve the situation by adding more memory or making any other tuning > changes? The system I'm currently running my tests on has 128GB. I've already described what you need to do to both the swift directory layout and the XFS filesystem configuration to minimise the impact of storing millions of tiny records in a filesystem. I'll leave the quote from my last email for you: > > We've been through this problem several times now with different > > swift users over the past couple of years. Please go and search the > > list archives, because every time the solution has been the same: > > > > - reduce the directory heirarchy to a single level with, at > > most, the number of directories matching the expected > > *production* concurrency level > > - reduce the XFS log size down to 32-128MB to limit dirty > > metadata object buildup in memory > > - reduce the number of AGs to as small as necessary to > > maintain /allocation/ concurrency to limit the number of > > different locations XFS writes to the disks (typically > > 10-20x less than the application level concurrency) > > - use a 3.16+ kernel with the free inode btree on-disk > > format feature to keep inode allocation CPU overhead low > > and consistent regardless of the number of inodes already > > allocated in the filesystem. -Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift 2016-01-06 15:15 xfs and swift Mark Seger 2016-01-06 22:04 ` Dave Chinner @ 2016-01-25 18:24 ` Bernd Schubert 2016-01-25 19:00 ` Mark Seger 1 sibling, 1 reply; 10+ messages in thread From: Bernd Schubert @ 2016-01-25 18:24 UTC (permalink / raw) To: Mark Seger, Linux fs XFS; +Cc: Laurence Oberman Hi Mark! On 01/06/2016 04:15 PM, Mark Seger wrote: > I've recently found the performance our development swift system is > degrading over time as the number of objects/files increases. This is a > relatively small system, each server has 3 400GB disks. The system I'm > currently looking at has about 70GB tied up in slabs alone, close to 55GB > in xfs inodes and ili, and about 2GB free. The kernel > is 3.14.57-1-amd64-hlinux. > > Here's the way the filesystems are mounted: > > /dev/sdb1 on /srv/node/disk0 type xfs > (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota) > > I can do about 2000 1K file creates/sec when running 2 minute PUT tests at > 100 threads. If I repeat that tests for multiple hours, I see the number > of IOPS steadily decreasing to about 770 and the very next run it drops to > 260 and continues to fall from there. This happens at about 12M files. > > The directory structure is 2 tiered, with 1000 directories per tier so we > can have about 1M of them, though they don't currently all exist. This sounds pretty much like hash directories as used by some parallel file systems (Lustre and in the past BeeGFS). For us the file create slow down was due to lookup in directories if a file with the same name already exists. At least for ext4 it was rather easy to demonstrate that simply caching directory blocks would eliminate that issue. We then considered working on a better kernel cache, but in the end simply found a way to get rid of such a simple directory structure in BeeGFS and changed it to a more complex layout, but with less random access and so we could eliminate the main reason for the slow down. Now I have no idea what a "swift system" is and in which order it creates and accesses those files and if it would be possible to change the access pattern. One thing you might try and which should work much better since 3.11 is the vfs_cache_pressure setting. The lower it is the less dentries/inodes are dropped from cache when pages are needed for file data. Cheers, Bernd _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift 2016-01-25 18:24 ` Bernd Schubert @ 2016-01-25 19:00 ` Mark Seger 2016-01-25 19:33 ` Bernd Schubert 0 siblings, 1 reply; 10+ messages in thread From: Mark Seger @ 2016-01-25 19:00 UTC (permalink / raw) To: Bernd Schubert; +Cc: Laurence Oberman, Linux fs XFS [-- Attachment #1.1: Type: text/plain, Size: 2934 bytes --] hey bernd, long time no chat. it turns out you don't have to know what swift is because I've been able to demonstrate this behavior with a very simple python script that simply creates files in a 3-tier hierarchy. the third level directories each contain a single file which for my testing are all 1K. I have played wiht cache_pressure and it doesn't seem to make a difference, though that was awhlle ago and perhaps it is worth revisiting. one thing you may get a hoot out of, being a collectl user, is I have an xfs plugin that lets you look at a ton of xfs stats either in realtime or after the fact just like any other collectl stat. I just havent' added it to the kit yet. -mark On Mon, Jan 25, 2016 at 1:24 PM, Bernd Schubert <bschubert@ddn.com> wrote: > Hi Mark! > > On 01/06/2016 04:15 PM, Mark Seger wrote: > > I've recently found the performance our development swift system is > > degrading over time as the number of objects/files increases. This is a > > relatively small system, each server has 3 400GB disks. The system I'm > > currently looking at has about 70GB tied up in slabs alone, close to 55GB > > in xfs inodes and ili, and about 2GB free. The kernel > > is 3.14.57-1-amd64-hlinux. > > > > Here's the way the filesystems are mounted: > > > > /dev/sdb1 on /srv/node/disk0 type xfs > > > (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota) > > > > I can do about 2000 1K file creates/sec when running 2 minute PUT tests > at > > 100 threads. If I repeat that tests for multiple hours, I see the number > > of IOPS steadily decreasing to about 770 and the very next run it drops > to > > 260 and continues to fall from there. This happens at about 12M files. > > > > The directory structure is 2 tiered, with 1000 directories per tier so we > > can have about 1M of them, though they don't currently all exist. > > This sounds pretty much like hash directories as used by some parallel > file systems (Lustre and in the past BeeGFS). For us the file create > slow down was due to lookup in directories if a file with the same name > already exists. At least for ext4 it was rather easy to demonstrate that > simply caching directory blocks would eliminate that issue. > We then considered working on a better kernel cache, but in the end > simply found a way to get rid of such a simple directory structure in > BeeGFS and changed it to a more complex layout, but with less random > access and so we could eliminate the main reason for the slow down. > > Now I have no idea what a "swift system" is and in which order it > creates and accesses those files and if it would be possible to change > the access pattern. One thing you might try and which should work much > better since 3.11 is the vfs_cache_pressure setting. The lower it is the > less dentries/inodes are dropped from cache when pages are needed for > file data. > > > > Cheers, > Bernd [-- Attachment #1.2: Type: text/html, Size: 3529 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift 2016-01-25 19:00 ` Mark Seger @ 2016-01-25 19:33 ` Bernd Schubert 0 siblings, 0 replies; 10+ messages in thread From: Bernd Schubert @ 2016-01-25 19:33 UTC (permalink / raw) To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS Hi Mark! On 01/25/2016 08:00 PM, Mark Seger wrote: > hey bernd, long time no chat. it turns out you don't have to know what > swift is because I've been able to demonstrate this behavior with a very > simple python script that simply creates files in a 3-tier hierarchy. the > third level directories each contain a single file which for my testing are > all 1K. So what is the script exactly doing? Does it create those files sequentially per dir or randomly between those dirs? Btw, I had been talking about that issue at linux plumbers in 2013 https://www.youtube.com/watch?v=N_bZOGZAb-Y > > I have played wiht cache_pressure and it doesn't seem to make a difference, > though that was awhlle ago and perhaps it is worth revisiting. one thing There are several patches from Mel Gorman in 3.11, which really made a difference for me. So unless you tested with >= 3.11 you should probably re-test. > you may get a hoot out of, being a collectl user, is I have an xfs plugin > that lets you look at a ton of xfs stats either in realtime or after the > fact just like any other collectl stat. I just havent' added it to the kit > yet. Hmm, I currently don't have good a test system for that. I'm working on an entirely different project now and while this is also a parallel file system, it does not have a linux file system in between, but has its own (log rotated) layout. Cheers, Bernd _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-02-01 5:28 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-06 15:15 xfs and swift Mark Seger 2016-01-06 22:04 ` Dave Chinner 2016-01-06 22:10 ` Dave Chinner 2016-01-06 22:46 ` Mark Seger 2016-01-06 23:49 ` Dave Chinner 2016-01-25 16:38 ` Mark Seger 2016-02-01 5:27 ` Dave Chinner 2016-01-25 18:24 ` Bernd Schubert 2016-01-25 19:00 ` Mark Seger 2016-01-25 19:33 ` Bernd Schubert
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.