* xfs and swift
@ 2016-01-06 15:15 Mark Seger
2016-01-06 22:04 ` Dave Chinner
2016-01-25 18:24 ` Bernd Schubert
0 siblings, 2 replies; 10+ messages in thread
From: Mark Seger @ 2016-01-06 15:15 UTC (permalink / raw)
To: Linux fs XFS; +Cc: Laurence Oberman
[-- Attachment #1.1: Type: text/plain, Size: 2381 bytes --]
I've recently found the performance our development swift system is
degrading over time as the number of objects/files increases. This is a
relatively small system, each server has 3 400GB disks. The system I'm
currently looking at has about 70GB tied up in slabs alone, close to 55GB
in xfs inodes and ili, and about 2GB free. The kernel
is 3.14.57-1-amd64-hlinux.
Here's the way the filesystems are mounted:
/dev/sdb1 on /srv/node/disk0 type xfs
(rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
I can do about 2000 1K file creates/sec when running 2 minute PUT tests at
100 threads. If I repeat that tests for multiple hours, I see the number
of IOPS steadily decreasing to about 770 and the very next run it drops to
260 and continues to fall from there. This happens at about 12M files.
The directory structure is 2 tiered, with 1000 directories per tier so we
can have about 1M of them, though they don't currently all exist.
I've written a collectl plugin that lets me watch many of the xfs stats in
real-time and also have a test script that exercises the swift PUT code
directly and so eliminates all the inter-node communications. This script
also allows me to write to the existing swift directories as well as
redirect to an empty structure so mimics clean environment with no existing
subdirectories.
I'm attaching some xfs stats during the run and hope they're readable.
These values are in operations/sec and each line is 1 second's worth of
data. The first set of numbers is on the clean directory and the second on
the existing 12M file one. At the bottom of these stats are also the xfs
slab allocations as reported by collectl. I can also watch these during a
test and can see the number of inode and ilo objects steadily grow at about
1K/sec, which is curious since I'm only creating about 300.
If there is anything else I can provide just let me know.
I don't fully understand all the xfs stats but what does jump out at me is
the XFS read/write ops have increased by a factor of about 5 when the
system is slower. Right now the collectl plugin is not something I've
released, but if there is interest and someone would like to help me
present the data in a more organized/meaningful manner just let me know.
if there are any tuning suggestions I'm more than happy to try them out.
-mark
[-- Attachment #1.2: Type: text/html, Size: 2715 bytes --]
[-- Attachment #2: tests.txt --]
[-- Type: text/plain, Size: 6565 bytes --]
>>> Fast <<<
#<--XFS Ops--><-----------XFS Logging----------><------Extents------><------DirOps-------><----Trans---><----Xstrat---><-------AttrOps-----><--------------INodes-------------->
# Write Reads Writes WrtKBs NoRoom Force Sleep ExtA BlkA ExtF ExtF Look Cre8 Remv Gdnt Sync Asyn Empt Quick Split Gets Sets Rmov List Atpt Hit Miss Recy Dup Recl Chgd
53 599 4 1024 0 4 4 3 65 10 70 155 6 14 284 0 75 1 2 0 275 1 0 24 0 149 5 0 0 0 1
92 836 16 4096 0 16 16 24 117 8 98 200 58 18 272 0 181 1 3 0 235 11 0 10 0 152 39 0 0 15 1
370 732 295 75520 0 295 295 592 685 6 96 1599 1442 293 829 0 3527 1 9 0 244 290 0 13 0 153 1144 0 0 0 1
383 837 284 72704 0 284 285 559 683 10 130 1532 1352 284 816 0 3343 0 4 0 236 276 0 10 0 155 1073 0 0 9 0
341 734 289 73984 0 289 289 583 690 8 68 1574 1393 297 860 0 3472 3 6 0 291 288 0 30 0 143 1105 0 0 0 3
342 812 291 74496 0 291 291 583 720 6 66 1574 1376 294 840 0 3439 2 2 0 261 289 0 19 0 144 1087 0 0 0 2
427 415 301 77056 0 301 302 598 843 14 164 1613 1391 305 870 0 3531 1 5 0 279 292 0 26 0 163 1090 0 0 0 1
401 832 302 77312 0 302 303 598 797 10 130 1604 1390 303 862 0 3522 1 4 0 244 295 0 13 0 148 1093 0 0 90 1
349 384 275 70400 0 275 275 549 717 10 100 1480 1258 281 814 0 3224 1 4 0 251 270 0 15 0 146 985 0 0 0 1
79 432 6 1536 0 6 6 9 102 6 96 158 3 3 250 0 47 0 9 0 248 0 0 14 0 156 2 0 0 0 0
54 253 4 1024 0 4 4 2 64 4 64 157 2 2 274 0 23 0 2 0 284 0 0 26 0 156 1 0 0 0 0
>>> Slow <<<
#<--XFS Ops--><-----------XFS Logging----------><------Extents------><------DirOps-------><----Trans---><----Xstrat---><-------AttrOps-----><--------------INodes-------------->
# Write Reads Writes WrtKBs NoRoom Force Sleep ExtA BlkA ExtF ExtF Look Cre8 Remv Gdnt Sync Asyn Empt Quick Split Gets Sets Rmov List Atpt Hit Miss Recy Dup Recl Chgd
0 61 0 0 0 0 0 0 0 0 0 132 0 0 218 0 0 0 0 0 213 0 0 0 0 126 6 0 0 0 0
59 115 11 2816 0 11 11 16 78 4 65 160 33 9 230 0 104 0 2 0 210 7 0 0 0 128 28 0 0 0 0
1384 1263 272 69632 0 272 272 423 1998 92 1443 875 872 227 576 0 2639 0 45 0 210 182 0 0 0 153 675 0 0 4 0
1604 1503 294 75264 0 294 294 438 2201 106 1696 907 890 241 590 0 2772 0 53 0 210 188 0 0 0 151 681 0 0 9 0
1638 2255 309 79104 0 309 307 460 2314 114 1734 946 934 260 632 0 2942 0 54 0 237 199 0 0 0 193 678 0 0 0 0
1678 2298 337 86272 0 338 330 486 2326 128 1779 1031 987 291 712 0 3168 0 55 0 284 220 0 4 0 189 712 0 0 0 0
1578 2423 333 85248 0 332 325 492 2268 118 1649 1041 991 289 714 0 3153 0 51 0 270 222 0 0 0 200 700 0 0 0 0
1040 1861 239 61184 0 241 231 353 1496 82 1072 774 718 212 588 0 2272 0 33 0 264 164 0 0 0 153 524 0 0 1730 0
1709 2361 336 86016 0 340 331 485 2401 128 1810 1029 969 291 706 0 3137 0 56 0 278 216 0 4 0 174 722 0 0 4751 0
1616 2063 325 83200 0 326 321 485 2278 114 1707 1038 973 274 674 1 3067 0 53 0 240 214 0 3 0 173 731 0 0 0 0
1599 1557 312 79872 0 312 313 482 2274 104 1664 1048 962 260 684 0 2980 0 52 0 224 208 0 7 0 219 681 0 0 0 0
1114 1312 229 58624 0 229 230 356 1577 72 1152 817 709 192 552 0 2188 0 36 0 216 157 0 3 0 165 524 0 0 3570 0
1066 1185 175 44800 0 176 176 249 1440 72 1153 585 466 141 440 0 1577 0 36 0 214 104 0 2 0 155 339 0 0 4497 0
54 487 6 1536 0 6 6 2 64 4 64 137 2 2 238 0 24 0 2 0 216 0 0 3 0 135 2 0 0 2664 0
0 590 0 0 0 0 0 0 0 0 0 136 0 0 224 0 0 0 0 0 210 0 0 0 0 136 0 0 0 0 0
54 514 4 1024 0 4 4 2 64 4 64 142 2 2 248 0 24 0 2 0 214 0 0 2 0 140 2 0 0 0 0
>>> slabs <<<
stack@helion-cp1-swobj0001-mgmt:~$ sudo collectl -sY -i:1 -c1 --slabfilt xfs
waiting for 1 second sample...
# SLAB DETAIL
# <-----------Objects----------><---------Slab Allocation------><---Change-->
#Name InUse Bytes Alloc Bytes InUse Bytes Total Bytes Diff Pct
xfs_btree_cur 1872 389376 1872 389376 48 393216 48 393216 0 0.0
xfs_da_state 1584 772992 1584 772992 48 786432 48 786432 0 0.0
xfs_dqtrx 0 0 0 0 0 0 0 0 0 0.0
xfs_efd_item 4360 1744000 4760 1904000 119 1949696 119 1949696 0 0.0
xfs_icr 0 0 0 0 0 0 0 0 0 0.0
xfs_ili 48127K 6976M 48197K 6986M 909380 7104M 909380 7104M 0 0.0
xfs_inode 48210K 47080M 48244K 47113M 1507K 47123M 1507K 47123M 0 0.0
[-- Attachment #3: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift
2016-01-06 15:15 xfs and swift Mark Seger
@ 2016-01-06 22:04 ` Dave Chinner
2016-01-06 22:10 ` Dave Chinner
2016-01-25 18:24 ` Bernd Schubert
1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2016-01-06 22:04 UTC (permalink / raw)
To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS
On Wed, Jan 06, 2016 at 10:15:25AM -0500, Mark Seger wrote:
> I've recently found the performance our development swift system is
> degrading over time as the number of objects/files increases. This is a
> relatively small system, each server has 3 400GB disks. The system I'm
> currently looking at has about 70GB tied up in slabs alone, close to 55GB
> in xfs inodes and ili, and about 2GB free. The kernel
> is 3.14.57-1-amd64-hlinux.
So you go 50M cached inodes in memory, and a relatively old kernel.
> Here's the way the filesystems are mounted:
>
> /dev/sdb1 on /srv/node/disk0 type xfs
> (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
>
> I can do about 2000 1K file creates/sec when running 2 minute PUT tests at
> 100 threads. If I repeat that tests for multiple hours, I see the number
> of IOPS steadily decreasing to about 770 and the very next run it drops to
> 260 and continues to fall from there. This happens at about 12M files.
According to the numbers you've provided:
lookups creates removes
Fast: 1550 1350 300
Slow: 1000 900 250
This is pretty much what I'd expect on the XFS level when going from
a small empty filesystem to one containing 12M 1k files.
That does not correlate to your numbers above, so it's not at all
clear that there is realy a problem here at the XFS level.
> The directory structure is 2 tiered, with 1000 directories per tier so we
> can have about 1M of them, though they don't currently all exist.
That's insane.
The xfs directory structure is much, much more space, time, IO and
memory efficient that a directory hierachy like this. The only thing
you need a directory hash hierarchy for is to provide sufficient
concurrency for your operations, which you would probably get with a
single level with one or two subdirs per filesystem AG.
What you are doing is spreading the IO over thousands of different
regions on the disks, and then randomly seeking between them on
every operation. i.e. your workload is seekbound, and your directory
structure is has the effect of /maximising/ seeks per operation...
> I've written a collectl plugin that lets me watch many of the xfs stats in
/me sighs and points at PCP: http://pcp.io
> real-time and also have a test script that exercises the swift PUT code
> directly and so eliminates all the inter-node communications. This script
> also allows me to write to the existing swift directories as well as
> redirect to an empty structure so mimics clean environment with no existing
> subdirectories.
Yet that doesn't behave like an empty filesystem, which is clearly
shown by the fact the caches are full of inodes that are't being
used by the test. It also points out that allocation of new inodes
will follow the old logarithmic search speed degradation, because
you're kernel is sufficiently old that it doesn't support the free
inode btree feature...
> I'm attaching some xfs stats during the run and hope they're readable.
> These values are in operations/sec and each line is 1 second's worth of
> data. The first set of numbers is on the clean directory and the second on
> the existing 12M file one. At the bottom of these stats are also the xfs
> slab allocations as reported by collectl. I can also watch these during a
> test and can see the number of inode and ilo objects steadily grow at about
> 1K/sec, which is curious since I'm only creating about 300.
It grows at exactly the rate of the lookups beng done, which is what
is expected. i.e. for each create being done, there are other
lookups being done first. e.g. directories, other objects to
determine where to create the new one, lookup has to be done before
removes (which there are significant number of), etc.
>
> If there is anything else I can provide just let me know.
>
> I don't fully understand all the xfs stats but what does jump out at me is
> the XFS read/write ops have increased by a factor of about 5 when the
> system is slower.
Which means your application is reading/writing 5x as much
information from the filesystem when it is slow. That's not a
filesystem problem - your applicaiton is having to traverse/modify
5x as much information for each object it is creating/modifying.
There's a good chance that's a result of your massively wide
object store directory heirarchy....
i.e. you need to start by understanding what your application is
doing in terms of IO, configuration and algorithms and determine
whether that is optimal before you start looking at whether the
filesystem is actually the bottleneck.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift
2016-01-06 22:04 ` Dave Chinner
@ 2016-01-06 22:10 ` Dave Chinner
2016-01-06 22:46 ` Mark Seger
0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2016-01-06 22:10 UTC (permalink / raw)
To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS
On Thu, Jan 07, 2016 at 09:04:54AM +1100, Dave Chinner wrote:
> On Wed, Jan 06, 2016 at 10:15:25AM -0500, Mark Seger wrote:
> > I've recently found the performance our development swift system is
> > degrading over time as the number of objects/files increases. This is a
> > relatively small system, each server has 3 400GB disks. The system I'm
> > currently looking at has about 70GB tied up in slabs alone, close to 55GB
> > in xfs inodes and ili, and about 2GB free. The kernel
> > is 3.14.57-1-amd64-hlinux.
>
> So you go 50M cached inodes in memory, and a relatively old kernel.
>
> > Here's the way the filesystems are mounted:
> >
> > /dev/sdb1 on /srv/node/disk0 type xfs
> > (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
> >
> > I can do about 2000 1K file creates/sec when running 2 minute PUT tests at
> > 100 threads. If I repeat that tests for multiple hours, I see the number
> > of IOPS steadily decreasing to about 770 and the very next run it drops to
> > 260 and continues to fall from there. This happens at about 12M files.
>
> According to the numbers you've provided:
>
> lookups creates removes
> Fast: 1550 1350 300
> Slow: 1000 900 250
>
> This is pretty much what I'd expect on the XFS level when going from
> a small empty filesystem to one containing 12M 1k files.
>
> That does not correlate to your numbers above, so it's not at all
> clear that there is realy a problem here at the XFS level.
>
> > The directory structure is 2 tiered, with 1000 directories per tier so we
> > can have about 1M of them, though they don't currently all exist.
>
> That's insane.
>
> The xfs directory structure is much, much more space, time, IO and
> memory efficient that a directory hierachy like this. The only thing
> you need a directory hash hierarchy for is to provide sufficient
> concurrency for your operations, which you would probably get with a
> single level with one or two subdirs per filesystem AG.
BTW, you might want to read the section on directory block size for
a quick introduction to XFS directory design and scalability:
https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift
2016-01-06 22:10 ` Dave Chinner
@ 2016-01-06 22:46 ` Mark Seger
2016-01-06 23:49 ` Dave Chinner
0 siblings, 1 reply; 10+ messages in thread
From: Mark Seger @ 2016-01-06 22:46 UTC (permalink / raw)
To: Dave Chinner; +Cc: Laurence Oberman, Linux fs XFS
[-- Attachment #1.1: Type: text/plain, Size: 5477 bytes --]
dave, thanks for getting back to me and the pointer to the config doc.
lots to absorb and play with.
the real challenge for me is that I'm doing testing as different levels.
While i realize running 100 parallel swift PUT threads on a small system is
not the ideal way to do things, it's the only easy way to get massive
numbers of objects into the fillesystem and once there, the performance of
a single stream is pretty poor and by instrumenting the swift code I can
clearly see excess time being spent in creating/writing the objects and so
that's lead us to believe the problem lies in the way xfs is configured.
creating a new directory structure on that same mount point immediately
results in high levels of performance.
As an attempt to try to reproduce the problems w/o swift, I wrote a little
python script that simply creates files in a 2-tier structure, the first
tier consisting of 1024 directories and each directory contains 4096
subdirectories into which 1K files are created. I'm doing this for 10000
objects as a time and then timing them, reporting the times, 10 per line so
each line represents 100 thousand file creates.
Here too I'm seeing degradation and if I look at what happens when there
are already 3M files and I write 1M more, I see these creation times/10
thousand:
1.004236 0.961419 0.996514 1.012150 1.101794 0.999422 0.994796
1.214535 0.997276 1.306736
2.793429 1.201471 1.133576 1.069682 1.030985 1.096341 1.052602
1.391364 0.999480 1.914125
1.193892 0.967206 1.263310 0.890472 1.051962 4.253694 1.145573
1.528848 13.586892 4.925790
3.975442 8.896552 1.197005 3.904226 7.503806 1.294842 1.816422
9.329792 7.270323 5.936545
7.058685 5.516841 4.527271 1.956592 1.382551 1.510339 1.318341
13.255939 6.938845 4.106066
2.612064 2.028795 4.647980 7.371628 5.473423 5.823201 14.229120
0.899348 3.539658 8.501498
4.662593 6.423530 7.980757 6.367012 3.414239 7.364857 4.143751
6.317348 11.393067 1.273371
146.067300 1.317814 1.176529 1.177830 52.206605 1.112854 2.087990
42.328220 1.178436 1.335202
49.118140 1.368696 1.515826 44.690431 0.927428 0.920801 0.985965
1.000591 1.027458 60.650443
1.771318 2.690499 2.262868 1.061343 0.932998 64.064210 37.726213
1.245129 0.743771 0.996683
nothing one set of 10K took almost 3 minutes!
my main questions at this point are is this performance expected and/or
might a newer kernel help? and might it be possible to significantly
improve things via tuning or is it what it is? I do realize I'm starting
with an empty directory tree whose performance degrades as it fills, but if
I wanted to tune for say 10M or maybe 100M files might I be able to expect
more consistent numbers (perhaps starting out at lower performance) as the
numbers of objects grow? I'm basically looking for more consistency over a
broader range of numbers of files.
-mark
On Wed, Jan 6, 2016 at 5:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Jan 07, 2016 at 09:04:54AM +1100, Dave Chinner wrote:
> > On Wed, Jan 06, 2016 at 10:15:25AM -0500, Mark Seger wrote:
> > > I've recently found the performance our development swift system is
> > > degrading over time as the number of objects/files increases. This is
> a
> > > relatively small system, each server has 3 400GB disks. The system I'm
> > > currently looking at has about 70GB tied up in slabs alone, close to
> 55GB
> > > in xfs inodes and ili, and about 2GB free. The kernel
> > > is 3.14.57-1-amd64-hlinux.
> >
> > So you go 50M cached inodes in memory, and a relatively old kernel.
> >
> > > Here's the way the filesystems are mounted:
> > >
> > > /dev/sdb1 on /srv/node/disk0 type xfs
> > >
> (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
> > >
> > > I can do about 2000 1K file creates/sec when running 2 minute PUT
> tests at
> > > 100 threads. If I repeat that tests for multiple hours, I see the
> number
> > > of IOPS steadily decreasing to about 770 and the very next run it
> drops to
> > > 260 and continues to fall from there. This happens at about 12M files.
> >
> > According to the numbers you've provided:
> >
> > lookups creates removes
> > Fast: 1550 1350 300
> > Slow: 1000 900 250
> >
> > This is pretty much what I'd expect on the XFS level when going from
> > a small empty filesystem to one containing 12M 1k files.
> >
> > That does not correlate to your numbers above, so it's not at all
> > clear that there is realy a problem here at the XFS level.
> >
> > > The directory structure is 2 tiered, with 1000 directories per tier so
> we
> > > can have about 1M of them, though they don't currently all exist.
> >
> > That's insane.
> >
> > The xfs directory structure is much, much more space, time, IO and
> > memory efficient that a directory hierachy like this. The only thing
> > you need a directory hash hierarchy for is to provide sufficient
> > concurrency for your operations, which you would probably get with a
> > single level with one or two subdirs per filesystem AG.
>
> BTW, you might want to read the section on directory block size for
> a quick introduction to XFS directory design and scalability:
>
>
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
[-- Attachment #1.2: Type: text/html, Size: 6872 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift
2016-01-06 22:46 ` Mark Seger
@ 2016-01-06 23:49 ` Dave Chinner
2016-01-25 16:38 ` Mark Seger
0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2016-01-06 23:49 UTC (permalink / raw)
To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS
On Wed, Jan 06, 2016 at 05:46:33PM -0500, Mark Seger wrote:
> dave, thanks for getting back to me and the pointer to the config doc.
> lots to absorb and play with.
>
> the real challenge for me is that I'm doing testing as different levels.
> While i realize running 100 parallel swift PUT threads on a small system is
> not the ideal way to do things, it's the only easy way to get massive
> numbers of objects into the fillesystem and once there, the performance of
> a single stream is pretty poor and by instrumenting the swift code I can
> clearly see excess time being spent in creating/writing the objects and so
> that's lead us to believe the problem lies in the way xfs is configured.
> creating a new directory structure on that same mount point immediately
> results in high levels of performance.
>
> As an attempt to try to reproduce the problems w/o swift, I wrote a little
> python script that simply creates files in a 2-tier structure, the first
> tier consisting of 1024 directories and each directory contains 4096
> subdirectories into which 1K files are created.
So you created something with even greater fan-out than what your
swift app is using?
> I'm doing this for 10000
> objects as a time and then timing them, reporting the times, 10 per line so
> each line represents 100 thousand file creates.
>
> Here too I'm seeing degradation and if I look at what happens when there
> are already 3M files and I write 1M more, I see these creation times/10
> thousand:
>
> 1.004236 0.961419 0.996514 1.012150 1.101794 0.999422 0.994796
> 1.214535 0.997276 1.306736
> 2.793429 1.201471 1.133576 1.069682 1.030985 1.096341 1.052602
> 1.391364 0.999480 1.914125
> 1.193892 0.967206 1.263310 0.890472 1.051962 4.253694 1.145573
> 1.528848 13.586892 4.925790
> 3.975442 8.896552 1.197005 3.904226 7.503806 1.294842 1.816422
> 9.329792 7.270323 5.936545
> 7.058685 5.516841 4.527271 1.956592 1.382551 1.510339 1.318341
> 13.255939 6.938845 4.106066
> 2.612064 2.028795 4.647980 7.371628 5.473423 5.823201 14.229120
> 0.899348 3.539658 8.501498
> 4.662593 6.423530 7.980757 6.367012 3.414239 7.364857 4.143751
> 6.317348 11.393067 1.273371
> 146.067300 1.317814 1.176529 1.177830 52.206605 1.112854 2.087990
> 42.328220 1.178436 1.335202
> 49.118140 1.368696 1.515826 44.690431 0.927428 0.920801 0.985965
> 1.000591 1.027458 60.650443
> 1.771318 2.690499 2.262868 1.061343 0.932998 64.064210 37.726213
> 1.245129 0.743771 0.996683
>
> nothing one set of 10K took almost 3 minutes!
Which is no surprise because you have slow disks and a *lot* of
memory. At some point the journal and/or memory is going to fill up
with dirty objects and have to block waiting for writeback. At that
point there's going to be several hundred thousand dirty inodes that
need to be flushed to disk before progress can be made again. That
metadata writeback will be seek bound, and that's where all the
delay comes from.
We've been through this problem several times now with different
swift users over the past couple of years. Please go and search the
list archives, because every time the solution has been the same:
- reduce the directory heirarchy to a single level with, at
most, the number of directories matching the expected
*production* concurrency level
- reduce the XFS log size down to 32-128MB to limit dirty
metadata object buildup in memory
- reduce the number of AGs to as small as necessary to
maintain /allocation/ concurrency to limit the number of
different locations XFS writes to the disks (typically
10-20x less than the application level concurrency)
- use a 3.16+ kernel with the free inode btree on-disk
format feature to keep inode allocation CPU overhead low
and consistent regardless of the number of inodes already
allocated in the filesystem.
> my main questions at this point are is this performance expected and/or
> might a newer kernel help? and might it be possible to significantly
> improve things via tuning or is it what it is? I do realize I'm starting
> with an empty directory tree whose performance degrades as it fills, but if
> I wanted to tune for say 10M or maybe 100M files might I be able to expect
The mkfs defaults will work just fine with that many files in the
filesystem. Your application configuration and data store layout is
likely to be your biggest problem here.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift
2016-01-06 23:49 ` Dave Chinner
@ 2016-01-25 16:38 ` Mark Seger
2016-02-01 5:27 ` Dave Chinner
0 siblings, 1 reply; 10+ messages in thread
From: Mark Seger @ 2016-01-25 16:38 UTC (permalink / raw)
To: Dave Chinner; +Cc: Laurence Oberman, Linux fs XFS
[-- Attachment #1.1: Type: text/plain, Size: 19884 bytes --]
since getting your last reply I've been doing a lot more trying to
understand the behavior of what I'm seeing by writing some non-swift code
that sort of does what swift does with respect to a directory structure.
in my case I have 1024 top level dirs, 4096 under each. each 1k file I'm
creating gets it's only directory under these so there are clearly a lot of
directories.
xfs writes out about 25M objects and then the performance goes into the
toilet. I'm sure what you said before about having to flush data and
causing big delays, but would it be continuous? each entry in the
following table shows the time to write 10K files so the 2 blocks are 1M
each
Sat Jan 23 12:15:09 2016
16.114386 14.656736 14.789760 17.418389 14.613157 15.938176
14.865369 14.962058 17.297193 15.953590
14.895471 15.560252 14.789937 14.308618 16.390057 16.561789
15.713806 14.843791 15.940992 16.466924
15.842781 15.611230 17.102329 15.006291 14.454088 17.923662
13.378340 16.084664 15.996794 13.736398
18.125125 14.462063 18.101833 15.355139 16.603660 14.205896
16.474111 16.212237 15.072443 14.217581
16.273899 14.905624 17.285019 14.955722 13.769731 18.308619
15.601386 15.832661 14.342416 16.516657
14.697575 15.719496 16.723135 16.808668 15.443325 14.608358
17.031334 16.426377 13.900535 13.528603
16.197697 16.839241 14.802707 15.507915 14.864337 15.836943
15.660089 15.998911 13.956739 14.337318
16.416974 17.729661 14.936045 13.450859 15.943900 15.106077
15.541450 16.523752 16.555945 14.440305
14.937772 16.486544 13.780310 16.944841 14.867400 18.214934
14.142108 15.931952 14.424949 15.533156
16.010153 16.323108 14.423508 15.970071 15.277186 15.561362
14.978766 15.855935 16.953906 14.247016
Sat Jan 23 12:41:09 2016
15.908483 15.638943 17.681281 15.188704 15.721495 13.359225
15.999421 15.858876 16.402176 16.416312
15.443946 14.675751 15.470643 15.573755 15.422241 16.336590
17.220916 13.974890 15.877780 62.650921
62.667990 46.334603 53.546195 69.465447 65.006016 68.761229
70.754684 97.571669 104.811261 104.229302
105.605257 105.166030 105.058075 105.519703 106.573306 106.708545
106.114733 105.643131 106.049387 106.379378
104.239131 104.268931 103.852929 103.549319 103.516169 103.007015
103.724020 104.519983 105.839203 105.324985
104.328205 104.932713 103.051548 104.938652 102.769383 102.851609
101.432277 102.269842 100.937972 103.450103
103.477628 103.636130 103.444242 103.023145 102.565047 102.853115
101.402610 98.928230 99.310677 99.669667
101.140554 99.628664 102.093801 100.580659 101.762283 101.369349
102.637014 102.240950 101.778506 101.144526
100.899476 102.294952 102.029285 100.871166 102.763222 102.910690
104.892447 104.748194 105.403636 106.159345
106.413154 104.626632 105.775004 104.579775 104.778526 104.634778
106.233381 104.063642 106.635481 104.314503
if I look at the disk loads at the time, I see a dramatic increase in disk
reads that correspond to the slow writes so I'm guessing at least some
writes are waiting in the queue as you can see there - thanks to laurence
for the patch to show disk read wait times ;)
# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size
Wait RWSize QLen Wait SvcTim Util
12:45:30 sdb 0 0 0 0 0 270040 105 2276 119
4 118 16 4 0 62
12:45:31 sdb 0 0 0 0 0 273776 120 2262 121
4 121 18 4 0 57
12:45:32 sdb 4 0 1 4 0 100164 57 909 110
4 110 6 4 0 84
12:45:33 sdb 0 0 0 0 0 229992 87 1924 120
1 119 2 1 0 68
12:45:34 sdb 4 0 1 4 4 153528 59 1304 118
0 117 1 0 0 78
12:45:35 sdb 0 0 0 0 0 220896 97 1895 117
1 116 1 1 0 62
12:45:36 sdb 0 0 0 0 0 419084 197 3504 120
0 119 1 0 0 32
12:45:37 sdb 0 0 0 0 0 428076 193 3662 117
0 116 1 0 0 32
12:45:38 sdb 0 0 0 0 0 428492 181 3560 120
0 120 1 0 0 30
12:45:39 sdb 0 0 0 0 0 426024 199 3641 117
0 117 1 0 0 32
12:45:40 sdb 0 0 0 0 0 429764 200 3589 120
0 119 1 0 0 28
12:45:41 sdb 0 0 0 0 0 410204 165 3430 120
0 119 3 0 0 36
12:45:42 sdb 0 0 0 0 0 406192 196 3437 118
0 118 5 0 0 39
12:45:43 sdb 0 0 0 0 0 420952 175 3552 119
0 118 1 0 0 34
12:45:44 sdb 0 0 0 0 0 428424 197 3645 118
0 117 1 0 0 31
12:45:45 sdb 0 0 0 0 0 192464 76 1599 120
8 120 18 8 0 75
12:45:46 sdb 0 0 0 0 0 340522 205 2951 115
2 115 16 2 0 41
12:45:47 sdb 0 0 0 0 0 429128 193 3664 117
0 117 1 0 0 28
12:45:48 sdb 0 0 0 0 0 402600 164 3311 122
0 121 3 0 0 39
12:45:49 sdb 0 0 0 0 0 435316 195 3701 118
0 117 1 0 0 36
12:45:50 sdb 0 0 0 0 0 367976 162 3152 117
1 116 7 1 0 46
12:45:51 sdb 0 0 0 0 0 255716 125 2153 119
4 118 16 4 0 60
# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size
Wait RWSize QLen Wait SvcTim Util
12:45:52 sdb 0 0 0 0 0 360144 149 3006 120
1 119 9 1 0 46
12:45:53 sdb 0 0 0 0 0 343500 162 2909 118
1 118 11 1 0 43
12:45:54 sdb 0 0 0 0 0 256636 119 2188 117
2 117 11 2 0 54
12:45:55 sdb 0 0 0 0 0 149000 47 1260 118
14 118 22 14 0 79
12:45:56 sdb 0 0 0 0 0 198544 88 1654 120
7 120 19 7 0 67
12:45:57 sdb 0 0 0 0 0 320688 151 2731 117
1 117 8 1 0 53
12:45:58 sdb 0 0 0 0 0 422176 190 3532 120
0 119 1 0 0 32
12:45:59 sdb 0 0 0 0 0 266540 115 2233 119
5 119 13 5 0 93
12:46:00 sdb 8 0 2 4 690 291116 129 2463 118
3 118 9 3 0 82
12:46:01 sdb 0 0 0 0 0 249964 118 2160 116
4 115 15 4 0 60
12:46:02 sdb 4736 0 37 128 0 424680 167 3522 121
0 120 1 0 0 28
12:46:03 sdb 5016 0 42 119 0 391364 196 3344 117
0 117 6 0 0 34
12:46:04 sdb 0 0 0 0 0 415436 172 3501 119
0 118 2 0 0 33
12:46:05 sdb 0 0 0 0 0 398736 192 3373 118
0 118 3 0 0 39
12:46:06 sdb 0 0 0 0 0 367292 155 3015 122
0 121 6 0 0 39
12:46:07 sdb 0 0 0 0 0 420392 201 3614 116
0 116 1 0 0 30
12:46:08 sdb 0 0 0 0 0 424828 172 3547 120
0 119 1 0 0 32
12:46:09 sdb 0 0 0 0 0 500380 234 4277 117
0 116 2 0 0 34
12:46:10 sdb 0 0 0 0 0 104500 7 698 150
0 149 1 0 1 87
12:46:11 sdb 8 0 1 8 1260 77252 45 647 119
0 119 1 2 1 92
12:46:12 sdb 8 0 1 8 1244 73956 31 615 120
0 120 1 2 1 94
12:46:13 sdb 8 0 1 8 228 149552 64 1256 119
0 118 1 0 0 85
# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size
Wait RWSize QLen Wait SvcTim Util
12:46:14 sdb 8 0 1 8 1232 37124 28 319 116
0 116 1 3 3 99
12:46:15 sdb 16 0 2 8 720 2776 23 120 23
1 22 1 13 8 99
12:46:16 sdb 0 0 0 0 0 108180 16 823 131
0 131 1 0 1 90
12:46:17 sdb 8 0 1 8 1260 37136 28 322 115
0 114 1 3 2 94
12:46:18 sdb 8 0 1 8 1252 108680 57 875 124
0 124 1 1 1 88
12:46:19 sdb 0 0 0 0 0 0 0 0 0
0 0 1 0 0 100
12:46:20 sdb 16 0 2 8 618 81516 49 685 119
0 118 1 1 1 94
12:46:21 sdb 16 0 2 8 640 225788 106 1907 118
0 118 1 0 0 75
12:46:22 sdb 32 0 4 8 95 73892 17 627 118
0 117 1 0 1 93
12:46:23 sdb 24 0 3 8 408 257012 119 2171 118
0 118 1 0 0 65
12:46:24 sdb 12 0 3 4 5 3608 0 20 180
0 157 1 0 43 100
12:46:25 sdb 44 0 7 6 210 74072 41 625 119
0 117 1 2 1 97
12:46:26 sdb 48 0 6 8 216 202852 112 1819 112
0 111 1 0 0 92
12:46:27 sdb 52 0 7 7 233 307156 137 2648 116
0 115 1 0 0 95
12:46:28 sdb 16 0 2 8 100 93168 7 638 146
0 145 1 0 1 97
12:46:29 sdb 16 0 2 8 642 37028 16 319 116
0 115 1 4 3 99
12:46:30 sdb 16 0 2 8 624 39068 36 342 114
0 113 1 3 2 99
12:46:31 sdb 80 0 10 8 94 253892 105 2169 117
0 116 1 0 0 84
12:46:32 sdb 0 0 0 0 0 5676 0 33 172
0 172 1 0 30 100
12:46:33 sdb 16 0 2 8 642 69236 28 583 119
0 118 1 2 1 96
12:46:34 sdb 8 0 1 8 1032 37132 30 315 118
0 117 1 3 3 100
12:46:35 sdb 16 0 2 8 822 56292 15 515 109
0 108 1 3 1 100
# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time Name KBytes Merged IOs Size Wait KBytes Merged IOs Size
Wait RWSize QLen Wait SvcTim Util
12:46:36 sdb 8 0 1 8 44 58768 15 452 130
0 129 1 0 2 96
12:46:37 sdb 28 0 4 7 390 114944 89 1100 104
0 104 1 1 0 88
12:46:38 sdb 0 0 0 0 0 29668 0 172 172
12 172 1 12 5 98
12:46:39 sdb 80 0 10 8 90 100084 31 882 113
0 112 1 1 1 91
12:46:40 sdb 0 0 0 0 0 24244 0 139 174
0 174 1 0 7 100
12:46:41 sdb 8 0 1 8 1224 0 0 0 0
0 8 1 1224 1000 100
12:46:42 sdb 8 0 1 8 1244 42368 29 354 120
0 119 1 3 2 96
12:46:43 sdb 36 0 5 7 251 51428 32 507 101
0 100 1 2 1 94
12:46:44 sdb 24 0 3 8 70 5732 31 147 39
15 38 2 16 6 99
12:46:45 sdb 32 0 4 8 4 213056 53 1647 129
0 129 1 0 0 74
12:46:46 sdb 8 0 1 8 1220 37416 28 328 114
0 113 1 3 2 96
12:46:47 sdb 8 0 1 8 1248 58572 67 607 96
0 96 1 2 1 93
12:46:48 sdb 40 0 5 8 84 274808 82 2173 126
0 126 1 0 0 70
12:46:49 sdb 0 0 0 0 0 0 0 0 0
0 0 1 0 0 100
12:46:50 sdb 8 0 1 8 1248 0 0 0 0
0 8 1 1248 1000 100
12:46:51 sdb 8 0 1 8 1272 0 0 0 0
0 8 1 1272 1000 100
12:46:52 sdb 24 0 3 8 414 205240 113 1798 114
0 113 1 0 0 75
12:46:53 sdb 8 0 1 8 876 92476 48 839 110
0 110 1 1 1 89
12:46:54 sdb 0 0 0 0 0 38700 0 225 172
0 172 1 0 4 99
12:46:55 sdb 16 0 2 8 582 150680 73 1262 119
0 119 1 1 0 87
12:46:56 sdb 8 0 1 8 1228 0 0 0 0
0 8 1 1228 1000 100
12:46:57 sdb 8 0 1 8 1244 0 0 0 0
0 8 1 1244 1000 100
next I played back the collectl process data and sorted by disk reads and
discovered the top process, corresponding to the long disk reads was
xfsaild. btw - I also see the slab xfs_inode using about 60GB.
It's also worth noting that I'm only doing 1-2MB/sec of writes and the rest
of the data looks like it's coming from xfs journaling because when I look
at the xfs stats I'm seeing on the order of 200-400MB/sec xfs logging
writes - clearly they're not all going to disk. Once the read waits
increase everything slows down including xfs logging (since it's doing
less).
I'm sure the simple answer may be that it is what it is, but I'm also
wondering without changes to swift itself, might there be some ways to
improve the situation by adding more memory or making any other tuning
changes? The system I'm currently running my tests on has 128GB.
-mark
On Wed, Jan 6, 2016 at 6:49 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Jan 06, 2016 at 05:46:33PM -0500, Mark Seger wrote:
> > dave, thanks for getting back to me and the pointer to the config doc.
> > lots to absorb and play with.
> >
> > the real challenge for me is that I'm doing testing as different levels.
> > While i realize running 100 parallel swift PUT threads on a small system
> is
> > not the ideal way to do things, it's the only easy way to get massive
> > numbers of objects into the fillesystem and once there, the performance
> of
> > a single stream is pretty poor and by instrumenting the swift code I can
> > clearly see excess time being spent in creating/writing the objects and
> so
> > that's lead us to believe the problem lies in the way xfs is configured.
> > creating a new directory structure on that same mount point immediately
> > results in high levels of performance.
> >
> > As an attempt to try to reproduce the problems w/o swift, I wrote a
> little
> > python script that simply creates files in a 2-tier structure, the first
> > tier consisting of 1024 directories and each directory contains 4096
> > subdirectories into which 1K files are created.
>
> So you created something with even greater fan-out than what your
> swift app is using?
>
> > I'm doing this for 10000
> > objects as a time and then timing them, reporting the times, 10 per line
> so
> > each line represents 100 thousand file creates.
> >
> > Here too I'm seeing degradation and if I look at what happens when there
> > are already 3M files and I write 1M more, I see these creation times/10
> > thousand:
> >
> > 1.004236 0.961419 0.996514 1.012150 1.101794 0.999422 0.994796
> > 1.214535 0.997276 1.306736
> > 2.793429 1.201471 1.133576 1.069682 1.030985 1.096341 1.052602
> > 1.391364 0.999480 1.914125
> > 1.193892 0.967206 1.263310 0.890472 1.051962 4.253694 1.145573
> > 1.528848 13.586892 4.925790
> > 3.975442 8.896552 1.197005 3.904226 7.503806 1.294842 1.816422
> > 9.329792 7.270323 5.936545
> > 7.058685 5.516841 4.527271 1.956592 1.382551 1.510339 1.318341
> > 13.255939 6.938845 4.106066
> > 2.612064 2.028795 4.647980 7.371628 5.473423 5.823201 14.229120
> > 0.899348 3.539658 8.501498
> > 4.662593 6.423530 7.980757 6.367012 3.414239 7.364857 4.143751
> > 6.317348 11.393067 1.273371
> > 146.067300 1.317814 1.176529 1.177830 52.206605 1.112854 2.087990
> > 42.328220 1.178436 1.335202
> > 49.118140 1.368696 1.515826 44.690431 0.927428 0.920801 0.985965
> > 1.000591 1.027458 60.650443
> > 1.771318 2.690499 2.262868 1.061343 0.932998 64.064210 37.726213
> > 1.245129 0.743771 0.996683
> >
> > nothing one set of 10K took almost 3 minutes!
>
> Which is no surprise because you have slow disks and a *lot* of
> memory. At some point the journal and/or memory is going to fill up
> with dirty objects and have to block waiting for writeback. At that
> point there's going to be several hundred thousand dirty inodes that
> need to be flushed to disk before progress can be made again. That
> metadata writeback will be seek bound, and that's where all the
> delay comes from.
>
> We've been through this problem several times now with different
> swift users over the past couple of years. Please go and search the
> list archives, because every time the solution has been the same:
>
> - reduce the directory heirarchy to a single level with, at
> most, the number of directories matching the expected
> *production* concurrency level
> - reduce the XFS log size down to 32-128MB to limit dirty
> metadata object buildup in memory
> - reduce the number of AGs to as small as necessary to
> maintain /allocation/ concurrency to limit the number of
> different locations XFS writes to the disks (typically
> 10-20x less than the application level concurrency)
> - use a 3.16+ kernel with the free inode btree on-disk
> format feature to keep inode allocation CPU overhead low
> and consistent regardless of the number of inodes already
> allocated in the filesystem.
>
> > my main questions at this point are is this performance expected and/or
> > might a newer kernel help? and might it be possible to significantly
> > improve things via tuning or is it what it is? I do realize I'm starting
> > with an empty directory tree whose performance degrades as it fills, but
> if
> > I wanted to tune for say 10M or maybe 100M files might I be able to
> expect
>
> The mkfs defaults will work just fine with that many files in the
> filesystem. Your application configuration and data store layout is
> likely to be your biggest problem here.
>
> Cheers,
>
> Dave.
>
> --
> Dave Chinner
> david@fromorbit.com
>
[-- Attachment #1.2: Type: text/html, Size: 29999 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift
2016-01-06 15:15 xfs and swift Mark Seger
2016-01-06 22:04 ` Dave Chinner
@ 2016-01-25 18:24 ` Bernd Schubert
2016-01-25 19:00 ` Mark Seger
1 sibling, 1 reply; 10+ messages in thread
From: Bernd Schubert @ 2016-01-25 18:24 UTC (permalink / raw)
To: Mark Seger, Linux fs XFS; +Cc: Laurence Oberman
Hi Mark!
On 01/06/2016 04:15 PM, Mark Seger wrote:
> I've recently found the performance our development swift system is
> degrading over time as the number of objects/files increases. This is a
> relatively small system, each server has 3 400GB disks. The system I'm
> currently looking at has about 70GB tied up in slabs alone, close to 55GB
> in xfs inodes and ili, and about 2GB free. The kernel
> is 3.14.57-1-amd64-hlinux.
>
> Here's the way the filesystems are mounted:
>
> /dev/sdb1 on /srv/node/disk0 type xfs
> (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
>
> I can do about 2000 1K file creates/sec when running 2 minute PUT tests at
> 100 threads. If I repeat that tests for multiple hours, I see the number
> of IOPS steadily decreasing to about 770 and the very next run it drops to
> 260 and continues to fall from there. This happens at about 12M files.
>
> The directory structure is 2 tiered, with 1000 directories per tier so we
> can have about 1M of them, though they don't currently all exist.
This sounds pretty much like hash directories as used by some parallel
file systems (Lustre and in the past BeeGFS). For us the file create
slow down was due to lookup in directories if a file with the same name
already exists. At least for ext4 it was rather easy to demonstrate that
simply caching directory blocks would eliminate that issue.
We then considered working on a better kernel cache, but in the end
simply found a way to get rid of such a simple directory structure in
BeeGFS and changed it to a more complex layout, but with less random
access and so we could eliminate the main reason for the slow down.
Now I have no idea what a "swift system" is and in which order it
creates and accesses those files and if it would be possible to change
the access pattern. One thing you might try and which should work much
better since 3.11 is the vfs_cache_pressure setting. The lower it is the
less dentries/inodes are dropped from cache when pages are needed for
file data.
Cheers,
Bernd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift
2016-01-25 18:24 ` Bernd Schubert
@ 2016-01-25 19:00 ` Mark Seger
2016-01-25 19:33 ` Bernd Schubert
0 siblings, 1 reply; 10+ messages in thread
From: Mark Seger @ 2016-01-25 19:00 UTC (permalink / raw)
To: Bernd Schubert; +Cc: Laurence Oberman, Linux fs XFS
[-- Attachment #1.1: Type: text/plain, Size: 2934 bytes --]
hey bernd, long time no chat. it turns out you don't have to know what
swift is because I've been able to demonstrate this behavior with a very
simple python script that simply creates files in a 3-tier hierarchy. the
third level directories each contain a single file which for my testing are
all 1K.
I have played wiht cache_pressure and it doesn't seem to make a difference,
though that was awhlle ago and perhaps it is worth revisiting. one thing
you may get a hoot out of, being a collectl user, is I have an xfs plugin
that lets you look at a ton of xfs stats either in realtime or after the
fact just like any other collectl stat. I just havent' added it to the kit
yet.
-mark
On Mon, Jan 25, 2016 at 1:24 PM, Bernd Schubert <bschubert@ddn.com> wrote:
> Hi Mark!
>
> On 01/06/2016 04:15 PM, Mark Seger wrote:
> > I've recently found the performance our development swift system is
> > degrading over time as the number of objects/files increases. This is a
> > relatively small system, each server has 3 400GB disks. The system I'm
> > currently looking at has about 70GB tied up in slabs alone, close to 55GB
> > in xfs inodes and ili, and about 2GB free. The kernel
> > is 3.14.57-1-amd64-hlinux.
> >
> > Here's the way the filesystems are mounted:
> >
> > /dev/sdb1 on /srv/node/disk0 type xfs
> >
> (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
> >
> > I can do about 2000 1K file creates/sec when running 2 minute PUT tests
> at
> > 100 threads. If I repeat that tests for multiple hours, I see the number
> > of IOPS steadily decreasing to about 770 and the very next run it drops
> to
> > 260 and continues to fall from there. This happens at about 12M files.
> >
> > The directory structure is 2 tiered, with 1000 directories per tier so we
> > can have about 1M of them, though they don't currently all exist.
>
> This sounds pretty much like hash directories as used by some parallel
> file systems (Lustre and in the past BeeGFS). For us the file create
> slow down was due to lookup in directories if a file with the same name
> already exists. At least for ext4 it was rather easy to demonstrate that
> simply caching directory blocks would eliminate that issue.
> We then considered working on a better kernel cache, but in the end
> simply found a way to get rid of such a simple directory structure in
> BeeGFS and changed it to a more complex layout, but with less random
> access and so we could eliminate the main reason for the slow down.
>
> Now I have no idea what a "swift system" is and in which order it
> creates and accesses those files and if it would be possible to change
> the access pattern. One thing you might try and which should work much
> better since 3.11 is the vfs_cache_pressure setting. The lower it is the
> less dentries/inodes are dropped from cache when pages are needed for
> file data.
>
>
>
> Cheers,
> Bernd
[-- Attachment #1.2: Type: text/html, Size: 3529 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift
2016-01-25 19:00 ` Mark Seger
@ 2016-01-25 19:33 ` Bernd Schubert
0 siblings, 0 replies; 10+ messages in thread
From: Bernd Schubert @ 2016-01-25 19:33 UTC (permalink / raw)
To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS
Hi Mark!
On 01/25/2016 08:00 PM, Mark Seger wrote:
> hey bernd, long time no chat. it turns out you don't have to know what
> swift is because I've been able to demonstrate this behavior with a very
> simple python script that simply creates files in a 3-tier hierarchy. the
> third level directories each contain a single file which for my testing are
> all 1K.
So what is the script exactly doing? Does it create those files
sequentially per dir or randomly between those dirs?
Btw, I had been talking about that issue at linux plumbers in 2013
https://www.youtube.com/watch?v=N_bZOGZAb-Y
>
> I have played wiht cache_pressure and it doesn't seem to make a difference,
> though that was awhlle ago and perhaps it is worth revisiting. one thing
There are several patches from Mel Gorman in 3.11, which really made a
difference for me. So unless you tested with >= 3.11 you should probably
re-test.
> you may get a hoot out of, being a collectl user, is I have an xfs plugin
> that lets you look at a ton of xfs stats either in realtime or after the
> fact just like any other collectl stat. I just havent' added it to the kit
> yet.
Hmm, I currently don't have good a test system for that. I'm working on
an entirely different project now and while this is also a parallel file
system, it does not have a linux file system in between, but has its own
(log rotated) layout.
Cheers,
Bernd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: xfs and swift
2016-01-25 16:38 ` Mark Seger
@ 2016-02-01 5:27 ` Dave Chinner
0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2016-02-01 5:27 UTC (permalink / raw)
To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS
On Mon, Jan 25, 2016 at 11:38:07AM -0500, Mark Seger wrote:
> since getting your last reply I've been doing a lot more trying to
> understand the behavior of what I'm seeing by writing some non-swift code
> that sort of does what swift does with respect to a directory structure.
> in my case I have 1024 top level dirs, 4096 under each. each 1k file I'm
> creating gets it's only directory under these so there are clearly a lot of
> directories.
I'm not sure you understood what I said in my last reply: your
directory structure is the problem, and that's what needs changing.
> xfs writes out about 25M objects and then the performance goes into the
> toilet. I'm sure what you said before about having to flush data and
> causing big delays, but would it be continuous?
Go read the previous thread on this subject. Or, alternatively, try
some of the subgestions I made, like reducing the log size, to see
how this affects such behaviour.
> each entry in the
> following table shows the time to write 10K files so the 2 blocks are 1M
> each
>
> Sat Jan 23 12:15:09 2016
> 16.114386 14.656736 14.789760 17.418389 14.613157 15.938176
> 14.865369 14.962058 17.297193 15.953590
.....
> 62.667990 46.334603 53.546195 69.465447 65.006016 68.761229
> 70.754684 97.571669 104.811261 104.229302
> 105.605257 105.166030 105.058075 105.519703 106.573306 106.708545
> 106.114733 105.643131 106.049387 106.379378
Your test goes from operating wholly in memory to being limited by
disk speed because it no longer fits in memory.
> if I look at the disk loads at the time, I see a dramatic increase in disk
> reads that correspond to the slow writes so I'm guessing at least some
.....
> next I played back the collectl process data and sorted by disk reads and
> discovered the top process, corresponding to the long disk reads was
> xfsaild. btw - I also see the slab xfs_inode using about 60GB.
And there's your problem. You're accumulating gigabytes of dirty
inodes in memory, then wondering why everything goes to crap when
memory fills up and we have to start cleaning inodes. TO clean those
inodes, we have to do
RMW cycles on the inode cluster buffers, because the inode cache
memory pressure has caused the inod buffers to be reclaimed from
memory before the cached dirty inodes are written. All the
changes I recommended you make also happen address this problem,
too....
> It's also worth noting that I'm only doing 1-2MB/sec of writes and the rest
> of the data looks like it's coming from xfs journaling because when I look
> at the xfs stats I'm seeing on the order of 200-400MB/sec xfs logging
> writes - clearly they're not all going to disk.
Before delayed logging was introduced 5 years ago, it was quite
common to see XFS writing >500MB/s to the journal. The thing is,
your massive fan-out directory structure is mostly going to defeat
the relogging optimisations that make delayed logging work, so it's
entirely possible that you are seeing this much throughput through
the journal.
> Once the read waits
> increase everything slows down including xfs logging (since it's doing
> less).
Of course, because we can't journal more changes until the dirty
inodes in the journal are cleaned. That's what the xfsaild does -
clean dirty inodes, and the reads coming from that threads are for
cleaning inodes...
> I'm sure the simple answer may be that it is what it is, but I'm also
> wondering without changes to swift itself, might there be some ways to
> improve the situation by adding more memory or making any other tuning
> changes? The system I'm currently running my tests on has 128GB.
I've already described what you need to do to both the swift
directory layout and the XFS filesystem configuration to minimise
the impact of storing millions of tiny records in a filesystem. I'll
leave the quote from my last email for you:
> > We've been through this problem several times now with different
> > swift users over the past couple of years. Please go and search the
> > list archives, because every time the solution has been the same:
> >
> > - reduce the directory heirarchy to a single level with, at
> > most, the number of directories matching the expected
> > *production* concurrency level
> > - reduce the XFS log size down to 32-128MB to limit dirty
> > metadata object buildup in memory
> > - reduce the number of AGs to as small as necessary to
> > maintain /allocation/ concurrency to limit the number of
> > different locations XFS writes to the disks (typically
> > 10-20x less than the application level concurrency)
> > - use a 3.16+ kernel with the free inode btree on-disk
> > format feature to keep inode allocation CPU overhead low
> > and consistent regardless of the number of inodes already
> > allocated in the filesystem.
-Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-02-01 5:28 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-06 15:15 xfs and swift Mark Seger
2016-01-06 22:04 ` Dave Chinner
2016-01-06 22:10 ` Dave Chinner
2016-01-06 22:46 ` Mark Seger
2016-01-06 23:49 ` Dave Chinner
2016-01-25 16:38 ` Mark Seger
2016-02-01 5:27 ` Dave Chinner
2016-01-25 18:24 ` Bernd Schubert
2016-01-25 19:00 ` Mark Seger
2016-01-25 19:33 ` Bernd Schubert
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.