All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs and swift
@ 2016-01-06 15:15 Mark Seger
  2016-01-06 22:04 ` Dave Chinner
  2016-01-25 18:24 ` Bernd Schubert
  0 siblings, 2 replies; 10+ messages in thread
From: Mark Seger @ 2016-01-06 15:15 UTC (permalink / raw)
  To: Linux fs XFS; +Cc: Laurence Oberman


[-- Attachment #1.1: Type: text/plain, Size: 2381 bytes --]

I've recently found the performance our development swift system is
degrading over time as the number of objects/files increases.  This is a
relatively small system, each server has 3 400GB disks.  The system I'm
currently looking at has about 70GB tied up in slabs alone, close to 55GB
in xfs inodes and ili, and about 2GB free.  The kernel
is 3.14.57-1-amd64-hlinux.

Here's the way the filesystems are mounted:

/dev/sdb1 on /srv/node/disk0 type xfs
(rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)

I can do about 2000 1K file creates/sec when running 2 minute PUT tests at
100 threads.  If I repeat that tests for multiple hours, I see the number
of IOPS steadily decreasing to about 770 and the very next run it drops to
260 and continues to fall from there.  This happens at about 12M files.

The directory structure is 2 tiered, with 1000 directories per tier so we
can have about 1M of them, though they don't currently all exist.

I've written a collectl plugin that lets me watch many of the xfs stats in
real-time and also have a test script that exercises the swift PUT code
directly and so eliminates all the inter-node communications.  This script
also allows me to write to the existing swift directories as well as
redirect to an empty structure so mimics clean environment with no existing
subdirectories.

I'm attaching some xfs stats during the run and hope they're readable.
These values are in operations/sec and each line is 1 second's worth of
data.  The first set of numbers is on the clean directory and the second on
the existing 12M file one.  At the bottom of these stats are also the xfs
slab allocations as reported by collectl.  I can also watch these during a
test and can see the number of inode and ilo objects steadily grow at about
1K/sec, which is curious since I'm only creating about 300.

If there is anything else I can provide just let me know.

I don't fully understand all the xfs stats but what does jump out at me is
the XFS read/write ops have increased by a factor of about 5 when the
system is slower.  Right now the collectl plugin is not something I've
released, but if there is interest and someone would like to help me
present the data in a more organized/meaningful manner just let me know.

if there are any tuning suggestions I'm more than happy to try them out.

-mark

[-- Attachment #1.2: Type: text/html, Size: 2715 bytes --]

[-- Attachment #2: tests.txt --]
[-- Type: text/plain, Size: 6565 bytes --]


>>> Fast <<<
#<--XFS Ops--><-----------XFS Logging----------><------Extents------><------DirOps-------><----Trans---><----Xstrat---><-------AttrOps-----><--------------INodes-------------->
# Write Reads  Writes WrtKBs NoRoom Force Sleep  ExtA BlkA ExtF ExtF  Look Cre8 Remv Gdnt  Sync Asyn Empt  Quick Split  Gets Sets Rmov List  Atpt  Hit Miss Recy  Dup Recl Chgd 
     53   599       4   1024      0     4     4     3   65   10   70   155    6   14  284     0   75    1      2     0   275    1    0   24     0  149    5    0    0    0    1
     92   836      16   4096      0    16    16    24  117    8   98   200   58   18  272     0  181    1      3     0   235   11    0   10     0  152   39    0    0   15    1
    370   732     295  75520      0   295   295   592  685    6   96  1599 1442  293  829     0 3527    1      9     0   244  290    0   13     0  153 1144    0    0    0    1
    383   837     284  72704      0   284   285   559  683   10  130  1532 1352  284  816     0 3343    0      4     0   236  276    0   10     0  155 1073    0    0    9    0
    341   734     289  73984      0   289   289   583  690    8   68  1574 1393  297  860     0 3472    3      6     0   291  288    0   30     0  143 1105    0    0    0    3
    342   812     291  74496      0   291   291   583  720    6   66  1574 1376  294  840     0 3439    2      2     0   261  289    0   19     0  144 1087    0    0    0    2
    427   415     301  77056      0   301   302   598  843   14  164  1613 1391  305  870     0 3531    1      5     0   279  292    0   26     0  163 1090    0    0    0    1
    401   832     302  77312      0   302   303   598  797   10  130  1604 1390  303  862     0 3522    1      4     0   244  295    0   13     0  148 1093    0    0   90    1
    349   384     275  70400      0   275   275   549  717   10  100  1480 1258  281  814     0 3224    1      4     0   251  270    0   15     0  146  985    0    0    0    1
     79   432       6   1536      0     6     6     9  102    6   96   158    3    3  250     0   47    0      9     0   248    0    0   14     0  156    2    0    0    0    0
     54   253       4   1024      0     4     4     2   64    4   64   157    2    2  274     0   23    0      2     0   284    0    0   26     0  156    1    0    0    0    0

>>> Slow <<<

#<--XFS Ops--><-----------XFS Logging----------><------Extents------><------DirOps-------><----Trans---><----Xstrat---><-------AttrOps-----><--------------INodes-------------->
# Write Reads  Writes WrtKBs NoRoom Force Sleep  ExtA BlkA ExtF ExtF  Look Cre8 Remv Gdnt  Sync Asyn Empt  Quick Split  Gets Sets Rmov List  Atpt  Hit Miss Recy  Dup Recl Chgd 
      0    61       0      0      0     0     0     0    0    0    0   132    0    0  218     0    0    0      0     0   213    0    0    0     0  126    6    0    0    0    0
     59   115      11   2816      0    11    11    16   78    4   65   160   33    9  230     0  104    0      2     0   210    7    0    0     0  128   28    0    0    0    0
   1384  1263     272  69632      0   272   272   423 1998   92 1443   875  872  227  576     0 2639    0     45     0   210  182    0    0     0  153  675    0    0    4    0
   1604  1503     294  75264      0   294   294   438 2201  106 1696   907  890  241  590     0 2772    0     53     0   210  188    0    0     0  151  681    0    0    9    0
   1638  2255     309  79104      0   309   307   460 2314  114 1734   946  934  260  632     0 2942    0     54     0   237  199    0    0     0  193  678    0    0    0    0
   1678  2298     337  86272      0   338   330   486 2326  128 1779  1031  987  291  712     0 3168    0     55     0   284  220    0    4     0  189  712    0    0    0    0
   1578  2423     333  85248      0   332   325   492 2268  118 1649  1041  991  289  714     0 3153    0     51     0   270  222    0    0     0  200  700    0    0    0    0
   1040  1861     239  61184      0   241   231   353 1496   82 1072   774  718  212  588     0 2272    0     33     0   264  164    0    0     0  153  524    0    0 1730    0
   1709  2361     336  86016      0   340   331   485 2401  128 1810  1029  969  291  706     0 3137    0     56     0   278  216    0    4     0  174  722    0    0 4751    0
   1616  2063     325  83200      0   326   321   485 2278  114 1707  1038  973  274  674     1 3067    0     53     0   240  214    0    3     0  173  731    0    0    0    0
   1599  1557     312  79872      0   312   313   482 2274  104 1664  1048  962  260  684     0 2980    0     52     0   224  208    0    7     0  219  681    0    0    0    0
   1114  1312     229  58624      0   229   230   356 1577   72 1152   817  709  192  552     0 2188    0     36     0   216  157    0    3     0  165  524    0    0 3570    0
   1066  1185     175  44800      0   176   176   249 1440   72 1153   585  466  141  440     0 1577    0     36     0   214  104    0    2     0  155  339    0    0 4497    0
     54   487       6   1536      0     6     6     2   64    4   64   137    2    2  238     0   24    0      2     0   216    0    0    3     0  135    2    0    0 2664    0
      0   590       0      0      0     0     0     0    0    0    0   136    0    0  224     0    0    0      0     0   210    0    0    0     0  136    0    0    0    0    0
     54   514       4   1024      0     4     4     2   64    4   64   142    2    2  248     0   24    0      2     0   214    0    0    2     0  140    2    0    0    0    0

>>> slabs <<<


stack@helion-cp1-swobj0001-mgmt:~$ sudo collectl -sY -i:1 -c1 --slabfilt xfs
waiting for 1 second sample...

# SLAB DETAIL
#                           <-----------Objects----------><---------Slab Allocation------><---Change-->
#Name                       InUse   Bytes   Alloc   Bytes   InUse   Bytes   Total   Bytes   Diff    Pct
xfs_btree_cur                1872  389376    1872  389376      48  393216      48  393216      0    0.0
xfs_da_state                 1584  772992    1584  772992      48  786432      48  786432      0    0.0
xfs_dqtrx                       0       0       0       0       0       0       0       0      0    0.0
xfs_efd_item                 4360 1744000    4760 1904000     119 1949696     119 1949696      0    0.0
xfs_icr                         0       0       0       0       0       0       0       0      0    0.0
xfs_ili                    48127K   6976M  48197K   6986M  909380   7104M  909380   7104M      0    0.0
xfs_inode                  48210K  47080M  48244K  47113M   1507K  47123M   1507K  47123M      0    0.0

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs and swift
  2016-01-06 15:15 xfs and swift Mark Seger
@ 2016-01-06 22:04 ` Dave Chinner
  2016-01-06 22:10   ` Dave Chinner
  2016-01-25 18:24 ` Bernd Schubert
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2016-01-06 22:04 UTC (permalink / raw)
  To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS

On Wed, Jan 06, 2016 at 10:15:25AM -0500, Mark Seger wrote:
> I've recently found the performance our development swift system is
> degrading over time as the number of objects/files increases.  This is a
> relatively small system, each server has 3 400GB disks.  The system I'm
> currently looking at has about 70GB tied up in slabs alone, close to 55GB
> in xfs inodes and ili, and about 2GB free.  The kernel
> is 3.14.57-1-amd64-hlinux.

So you go 50M cached inodes in memory, and a relatively old kernel.

> Here's the way the filesystems are mounted:
> 
> /dev/sdb1 on /srv/node/disk0 type xfs
> (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
> 
> I can do about 2000 1K file creates/sec when running 2 minute PUT tests at
> 100 threads.  If I repeat that tests for multiple hours, I see the number
> of IOPS steadily decreasing to about 770 and the very next run it drops to
> 260 and continues to fall from there.  This happens at about 12M files.

According to the numbers you've provided:

	lookups		creates		removes
Fast:	1550		1350		300
Slow:	1000		 900		250

This is pretty much what I'd expect on the XFS level when going from
a small empty filesystem to one containing 12M 1k files.

That does not correlate to your numbers above, so it's not at all
clear that there is realy a problem here at the XFS level.

> The directory structure is 2 tiered, with 1000 directories per tier so we
> can have about 1M of them, though they don't currently all exist.

That's insane.

The xfs directory structure is much, much more space, time, IO and
memory efficient that a directory hierachy like this. The only thing
you need a directory hash hierarchy for is to provide sufficient
concurrency for your operations, which you would probably get with a
single level with one or two subdirs per filesystem AG.

What you are doing is spreading the IO over thousands of different
regions on the disks, and then randomly seeking between them on
every operation. i.e. your workload is seekbound, and your directory
structure is has the effect of /maximising/ seeks per operation...


> I've written a collectl plugin that lets me watch many of the xfs stats in

/me sighs and points at PCP: http://pcp.io

> real-time and also have a test script that exercises the swift PUT code
> directly and so eliminates all the inter-node communications.  This script
> also allows me to write to the existing swift directories as well as
> redirect to an empty structure so mimics clean environment with no existing
> subdirectories.

Yet that doesn't behave like an empty filesystem, which is clearly
shown by the fact the caches are full of inodes that are't being
used by the test. It also points out that allocation of new inodes
will follow the old logarithmic search speed degradation, because
you're kernel is sufficiently old that it doesn't support the free
inode btree feature...

> I'm attaching some xfs stats during the run and hope they're readable.
> These values are in operations/sec and each line is 1 second's worth of
> data.  The first set of numbers is on the clean directory and the second on
> the existing 12M file one.  At the bottom of these stats are also the xfs
> slab allocations as reported by collectl.  I can also watch these during a
> test and can see the number of inode and ilo objects steadily grow at about
> 1K/sec, which is curious since I'm only creating about 300.

It grows at exactly the rate of the lookups beng done, which is what
is expected. i.e. for each create being done, there are other
lookups being done first. e.g. directories, other objects to
determine where to create the new one, lookup has to be done before
removes (which there are significant number of), etc.
> 
> If there is anything else I can provide just let me know.
> 
> I don't fully understand all the xfs stats but what does jump out at me is
> the XFS read/write ops have increased by a factor of about 5 when the
> system is slower.

Which means your application is reading/writing 5x as much
information from the filesystem when it is slow. That's not a
filesystem problem - your applicaiton is having to traverse/modify
5x as much information for each object it is creating/modifying.
There's a good chance that's a result of your massively wide
object store directory heirarchy....

i.e. you need to start by understanding what your application is
doing in terms of IO, configuration and algorithms and determine
whether that is optimal before you start looking at whether the
filesystem is actually the bottleneck.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs and swift
  2016-01-06 22:04 ` Dave Chinner
@ 2016-01-06 22:10   ` Dave Chinner
  2016-01-06 22:46     ` Mark Seger
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2016-01-06 22:10 UTC (permalink / raw)
  To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS

On Thu, Jan 07, 2016 at 09:04:54AM +1100, Dave Chinner wrote:
> On Wed, Jan 06, 2016 at 10:15:25AM -0500, Mark Seger wrote:
> > I've recently found the performance our development swift system is
> > degrading over time as the number of objects/files increases.  This is a
> > relatively small system, each server has 3 400GB disks.  The system I'm
> > currently looking at has about 70GB tied up in slabs alone, close to 55GB
> > in xfs inodes and ili, and about 2GB free.  The kernel
> > is 3.14.57-1-amd64-hlinux.
> 
> So you go 50M cached inodes in memory, and a relatively old kernel.
> 
> > Here's the way the filesystems are mounted:
> > 
> > /dev/sdb1 on /srv/node/disk0 type xfs
> > (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
> > 
> > I can do about 2000 1K file creates/sec when running 2 minute PUT tests at
> > 100 threads.  If I repeat that tests for multiple hours, I see the number
> > of IOPS steadily decreasing to about 770 and the very next run it drops to
> > 260 and continues to fall from there.  This happens at about 12M files.
> 
> According to the numbers you've provided:
> 
> 	lookups		creates		removes
> Fast:	1550		1350		300
> Slow:	1000		 900		250
> 
> This is pretty much what I'd expect on the XFS level when going from
> a small empty filesystem to one containing 12M 1k files.
> 
> That does not correlate to your numbers above, so it's not at all
> clear that there is realy a problem here at the XFS level.
> 
> > The directory structure is 2 tiered, with 1000 directories per tier so we
> > can have about 1M of them, though they don't currently all exist.
> 
> That's insane.
> 
> The xfs directory structure is much, much more space, time, IO and
> memory efficient that a directory hierachy like this. The only thing
> you need a directory hash hierarchy for is to provide sufficient
> concurrency for your operations, which you would probably get with a
> single level with one or two subdirs per filesystem AG.

BTW, you might want to read the section on directory block size for
a quick introduction to XFS directory design and scalability:

https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs and swift
  2016-01-06 22:10   ` Dave Chinner
@ 2016-01-06 22:46     ` Mark Seger
  2016-01-06 23:49       ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Seger @ 2016-01-06 22:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Laurence Oberman, Linux fs XFS


[-- Attachment #1.1: Type: text/plain, Size: 5477 bytes --]

dave, thanks for getting back to me and the pointer to the config doc.
 lots to absorb and play with.

the real challenge for me is that I'm doing testing as different levels.
While i realize running 100 parallel swift PUT threads on a small system is
not the ideal way to do things, it's the only easy way to get massive
numbers of objects into the fillesystem and once there, the performance of
a single stream is pretty poor and by instrumenting the swift code I can
clearly see excess time being spent in creating/writing the objects and so
that's lead us to believe the problem lies in the way xfs is configured.
 creating a new directory structure on that same mount point immediately
results in high levels of performance.

As an attempt to try to reproduce the problems w/o swift, I wrote a little
python script that simply creates files in a 2-tier structure, the first
tier consisting of 1024 directories and each directory contains 4096
subdirectories into which 1K files are created.  I'm doing this for 10000
objects as a time and then timing them, reporting the times, 10 per line so
each line represents 100 thousand file creates.

Here too I'm seeing degradation and if I look at what happens when there
are already 3M files and I write 1M more, I see these creation times/10
thousand:

 1.004236  0.961419  0.996514  1.012150  1.101794  0.999422  0.994796
 1.214535  0.997276  1.306736
 2.793429  1.201471  1.133576  1.069682  1.030985  1.096341  1.052602
 1.391364  0.999480  1.914125
 1.193892  0.967206  1.263310  0.890472  1.051962  4.253694  1.145573
 1.528848 13.586892  4.925790
 3.975442  8.896552  1.197005  3.904226  7.503806  1.294842  1.816422
 9.329792  7.270323  5.936545
 7.058685  5.516841  4.527271  1.956592  1.382551  1.510339  1.318341
13.255939  6.938845  4.106066
 2.612064  2.028795  4.647980  7.371628  5.473423  5.823201 14.229120
 0.899348  3.539658  8.501498
 4.662593  6.423530  7.980757  6.367012  3.414239  7.364857  4.143751
 6.317348 11.393067  1.273371
146.067300  1.317814  1.176529  1.177830 52.206605  1.112854  2.087990
42.328220  1.178436  1.335202
49.118140  1.368696  1.515826 44.690431  0.927428  0.920801  0.985965
 1.000591  1.027458 60.650443
 1.771318  2.690499  2.262868  1.061343  0.932998 64.064210 37.726213
 1.245129  0.743771  0.996683

nothing one set of 10K took almost 3 minutes!

my main questions at this point are is this performance expected and/or
might a newer kernel help?  and might it be possible to significantly
improve things via tuning or is it what it is?  I do realize I'm starting
with an empty directory tree whose performance degrades as it fills, but if
I wanted to tune for say 10M or maybe 100M files might I be able to expect
more consistent numbers (perhaps starting out at lower performance) as the
numbers of objects grow?  I'm basically looking for more consistency over a
broader range of numbers of files.

-mark

On Wed, Jan 6, 2016 at 5:10 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Thu, Jan 07, 2016 at 09:04:54AM +1100, Dave Chinner wrote:
> > On Wed, Jan 06, 2016 at 10:15:25AM -0500, Mark Seger wrote:
> > > I've recently found the performance our development swift system is
> > > degrading over time as the number of objects/files increases.  This is
> a
> > > relatively small system, each server has 3 400GB disks.  The system I'm
> > > currently looking at has about 70GB tied up in slabs alone, close to
> 55GB
> > > in xfs inodes and ili, and about 2GB free.  The kernel
> > > is 3.14.57-1-amd64-hlinux.
> >
> > So you go 50M cached inodes in memory, and a relatively old kernel.
> >
> > > Here's the way the filesystems are mounted:
> > >
> > > /dev/sdb1 on /srv/node/disk0 type xfs
> > >
> (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
> > >
> > > I can do about 2000 1K file creates/sec when running 2 minute PUT
> tests at
> > > 100 threads.  If I repeat that tests for multiple hours, I see the
> number
> > > of IOPS steadily decreasing to about 770 and the very next run it
> drops to
> > > 260 and continues to fall from there.  This happens at about 12M files.
> >
> > According to the numbers you've provided:
> >
> >       lookups         creates         removes
> > Fast: 1550            1350            300
> > Slow: 1000             900            250
> >
> > This is pretty much what I'd expect on the XFS level when going from
> > a small empty filesystem to one containing 12M 1k files.
> >
> > That does not correlate to your numbers above, so it's not at all
> > clear that there is realy a problem here at the XFS level.
> >
> > > The directory structure is 2 tiered, with 1000 directories per tier so
> we
> > > can have about 1M of them, though they don't currently all exist.
> >
> > That's insane.
> >
> > The xfs directory structure is much, much more space, time, IO and
> > memory efficient that a directory hierachy like this. The only thing
> > you need a directory hash hierarchy for is to provide sufficient
> > concurrency for your operations, which you would probably get with a
> > single level with one or two subdirs per filesystem AG.
>
> BTW, you might want to read the section on directory block size for
> a quick introduction to XFS directory design and scalability:
>
>
> https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 6872 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs and swift
  2016-01-06 22:46     ` Mark Seger
@ 2016-01-06 23:49       ` Dave Chinner
  2016-01-25 16:38         ` Mark Seger
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2016-01-06 23:49 UTC (permalink / raw)
  To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS

On Wed, Jan 06, 2016 at 05:46:33PM -0500, Mark Seger wrote:
> dave, thanks for getting back to me and the pointer to the config doc.
>  lots to absorb and play with.
> 
> the real challenge for me is that I'm doing testing as different levels.
> While i realize running 100 parallel swift PUT threads on a small system is
> not the ideal way to do things, it's the only easy way to get massive
> numbers of objects into the fillesystem and once there, the performance of
> a single stream is pretty poor and by instrumenting the swift code I can
> clearly see excess time being spent in creating/writing the objects and so
> that's lead us to believe the problem lies in the way xfs is configured.
>  creating a new directory structure on that same mount point immediately
> results in high levels of performance.
> 
> As an attempt to try to reproduce the problems w/o swift, I wrote a little
> python script that simply creates files in a 2-tier structure, the first
> tier consisting of 1024 directories and each directory contains 4096
> subdirectories into which 1K files are created.

So you created something with even greater fan-out than what your
swift app is using?

> I'm doing this for 10000
> objects as a time and then timing them, reporting the times, 10 per line so
> each line represents 100 thousand file creates.
> 
> Here too I'm seeing degradation and if I look at what happens when there
> are already 3M files and I write 1M more, I see these creation times/10
> thousand:
> 
>  1.004236  0.961419  0.996514  1.012150  1.101794  0.999422  0.994796
>  1.214535  0.997276  1.306736
>  2.793429  1.201471  1.133576  1.069682  1.030985  1.096341  1.052602
>  1.391364  0.999480  1.914125
>  1.193892  0.967206  1.263310  0.890472  1.051962  4.253694  1.145573
>  1.528848 13.586892  4.925790
>  3.975442  8.896552  1.197005  3.904226  7.503806  1.294842  1.816422
>  9.329792  7.270323  5.936545
>  7.058685  5.516841  4.527271  1.956592  1.382551  1.510339  1.318341
> 13.255939  6.938845  4.106066
>  2.612064  2.028795  4.647980  7.371628  5.473423  5.823201 14.229120
>  0.899348  3.539658  8.501498
>  4.662593  6.423530  7.980757  6.367012  3.414239  7.364857  4.143751
>  6.317348 11.393067  1.273371
> 146.067300  1.317814  1.176529  1.177830 52.206605  1.112854  2.087990
> 42.328220  1.178436  1.335202
> 49.118140  1.368696  1.515826 44.690431  0.927428  0.920801  0.985965
>  1.000591  1.027458 60.650443
>  1.771318  2.690499  2.262868  1.061343  0.932998 64.064210 37.726213
>  1.245129  0.743771  0.996683
> 
> nothing one set of 10K took almost 3 minutes!

Which is no surprise because you have slow disks and a *lot* of
memory. At some point the journal and/or memory is going to fill up
with dirty objects and have to block waiting for writeback. At that
point there's going to be several hundred thousand dirty inodes that
need to be flushed to disk before progress can be made again.  That
metadata writeback will be seek bound, and that's where all the
delay comes from.

We've been through this problem several times now with different
swift users over the past couple of years. Please go and search the
list archives, because every time the solution has been the same:

	- reduce the directory heirarchy to a single level with, at
	  most, the number of directories matching the expected
	  *production* concurrency level 
	- reduce the XFS log size down to 32-128MB to limit dirty
	  metadata object buildup in memory
	- reduce the number of AGs to as small as necessary to
	  maintain /allocation/ concurrency to limit the number of
	  different locations XFS writes to the disks (typically
	  10-20x less than the application level concurrency)
	- use a 3.16+ kernel with the free inode btree on-disk
	  format feature to keep inode allocation CPU overhead low
	  and consistent regardless of the number of inodes already
	  allocated in the filesystem.

> my main questions at this point are is this performance expected and/or
> might a newer kernel help?  and might it be possible to significantly
> improve things via tuning or is it what it is?  I do realize I'm starting
> with an empty directory tree whose performance degrades as it fills, but if
> I wanted to tune for say 10M or maybe 100M files might I be able to expect

The mkfs defaults will work just fine with that many files in the
filesystem. Your application configuration and data store layout is
likely to be your biggest problem here.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs and swift
  2016-01-06 23:49       ` Dave Chinner
@ 2016-01-25 16:38         ` Mark Seger
  2016-02-01  5:27           ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Seger @ 2016-01-25 16:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Laurence Oberman, Linux fs XFS


[-- Attachment #1.1: Type: text/plain, Size: 19884 bytes --]

since getting your last reply I've been doing a lot more trying to
understand the behavior of what I'm seeing by writing some non-swift code
that sort of does what swift does with respect to a directory structure.
 in my case I have 1024 top level dirs, 4096 under each.  each 1k file I'm
creating gets it's only directory under these so there are clearly a lot of
directories.

xfs writes out about 25M objects and then the performance goes into the
toilet.  I'm sure what you said before about having to flush data and
causing big delays, but would it be continuous?  each entry in the
following table shows the time to write 10K files so the 2 blocks are 1M
each

Sat Jan 23 12:15:09 2016
 16.114386  14.656736  14.789760  17.418389  14.613157  15.938176
 14.865369  14.962058  17.297193  15.953590
 14.895471  15.560252  14.789937  14.308618  16.390057  16.561789
 15.713806  14.843791  15.940992  16.466924
 15.842781  15.611230  17.102329  15.006291  14.454088  17.923662
 13.378340  16.084664  15.996794  13.736398
 18.125125  14.462063  18.101833  15.355139  16.603660  14.205896
 16.474111  16.212237  15.072443  14.217581
 16.273899  14.905624  17.285019  14.955722  13.769731  18.308619
 15.601386  15.832661  14.342416  16.516657
 14.697575  15.719496  16.723135  16.808668  15.443325  14.608358
 17.031334  16.426377  13.900535  13.528603
 16.197697  16.839241  14.802707  15.507915  14.864337  15.836943
 15.660089  15.998911  13.956739  14.337318
 16.416974  17.729661  14.936045  13.450859  15.943900  15.106077
 15.541450  16.523752  16.555945  14.440305
 14.937772  16.486544  13.780310  16.944841  14.867400  18.214934
 14.142108  15.931952  14.424949  15.533156
 16.010153  16.323108  14.423508  15.970071  15.277186  15.561362
 14.978766  15.855935  16.953906  14.247016
Sat Jan 23 12:41:09 2016
 15.908483  15.638943  17.681281  15.188704  15.721495  13.359225
 15.999421  15.858876  16.402176  16.416312
 15.443946  14.675751  15.470643  15.573755  15.422241  16.336590
 17.220916  13.974890  15.877780  62.650921
 62.667990  46.334603  53.546195  69.465447  65.006016  68.761229
 70.754684  97.571669 104.811261 104.229302
105.605257 105.166030 105.058075 105.519703 106.573306 106.708545
106.114733 105.643131 106.049387 106.379378
104.239131 104.268931 103.852929 103.549319 103.516169 103.007015
103.724020 104.519983 105.839203 105.324985
104.328205 104.932713 103.051548 104.938652 102.769383 102.851609
101.432277 102.269842 100.937972 103.450103
103.477628 103.636130 103.444242 103.023145 102.565047 102.853115
101.402610  98.928230  99.310677  99.669667
101.140554  99.628664 102.093801 100.580659 101.762283 101.369349
102.637014 102.240950 101.778506 101.144526
100.899476 102.294952 102.029285 100.871166 102.763222 102.910690
104.892447 104.748194 105.403636 106.159345
106.413154 104.626632 105.775004 104.579775 104.778526 104.634778
106.233381 104.063642 106.635481 104.314503

if I look at the disk loads at the time, I see a dramatic increase in disk
reads that correspond to the slow writes so I'm guessing at least some
writes are waiting in the queue as you can see there - thanks to laurence
for the patch to show disk read wait times ;)

# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size
 Wait  RWSize  QLen  Wait SvcTim Util
12:45:30 sdb              0      0    0    0     0  270040    105 2276  119
    4     118    16     4      0   62
12:45:31 sdb              0      0    0    0     0  273776    120 2262  121
    4     121    18     4      0   57
12:45:32 sdb              4      0    1    4     0  100164     57  909  110
    4     110     6     4      0   84
12:45:33 sdb              0      0    0    0     0  229992     87 1924  120
    1     119     2     1      0   68
12:45:34 sdb              4      0    1    4     4  153528     59 1304  118
    0     117     1     0      0   78
12:45:35 sdb              0      0    0    0     0  220896     97 1895  117
    1     116     1     1      0   62
12:45:36 sdb              0      0    0    0     0  419084    197 3504  120
    0     119     1     0      0   32
12:45:37 sdb              0      0    0    0     0  428076    193 3662  117
    0     116     1     0      0   32
12:45:38 sdb              0      0    0    0     0  428492    181 3560  120
    0     120     1     0      0   30
12:45:39 sdb              0      0    0    0     0  426024    199 3641  117
    0     117     1     0      0   32
12:45:40 sdb              0      0    0    0     0  429764    200 3589  120
    0     119     1     0      0   28
12:45:41 sdb              0      0    0    0     0  410204    165 3430  120
    0     119     3     0      0   36
12:45:42 sdb              0      0    0    0     0  406192    196 3437  118
    0     118     5     0      0   39
12:45:43 sdb              0      0    0    0     0  420952    175 3552  119
    0     118     1     0      0   34
12:45:44 sdb              0      0    0    0     0  428424    197 3645  118
    0     117     1     0      0   31
12:45:45 sdb              0      0    0    0     0  192464     76 1599  120
    8     120    18     8      0   75
12:45:46 sdb              0      0    0    0     0  340522    205 2951  115
    2     115    16     2      0   41
12:45:47 sdb              0      0    0    0     0  429128    193 3664  117
    0     117     1     0      0   28
12:45:48 sdb              0      0    0    0     0  402600    164 3311  122
    0     121     3     0      0   39
12:45:49 sdb              0      0    0    0     0  435316    195 3701  118
    0     117     1     0      0   36
12:45:50 sdb              0      0    0    0     0  367976    162 3152  117
    1     116     7     1      0   46
12:45:51 sdb              0      0    0    0     0  255716    125 2153  119
    4     118    16     4      0   60

# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size
 Wait  RWSize  QLen  Wait SvcTim Util
12:45:52 sdb              0      0    0    0     0  360144    149 3006  120
    1     119     9     1      0   46
12:45:53 sdb              0      0    0    0     0  343500    162 2909  118
    1     118    11     1      0   43
12:45:54 sdb              0      0    0    0     0  256636    119 2188  117
    2     117    11     2      0   54
12:45:55 sdb              0      0    0    0     0  149000     47 1260  118
   14     118    22    14      0   79
12:45:56 sdb              0      0    0    0     0  198544     88 1654  120
    7     120    19     7      0   67
12:45:57 sdb              0      0    0    0     0  320688    151 2731  117
    1     117     8     1      0   53
12:45:58 sdb              0      0    0    0     0  422176    190 3532  120
    0     119     1     0      0   32
12:45:59 sdb              0      0    0    0     0  266540    115 2233  119
    5     119    13     5      0   93
12:46:00 sdb              8      0    2    4   690  291116    129 2463  118
    3     118     9     3      0   82
12:46:01 sdb              0      0    0    0     0  249964    118 2160  116
    4     115    15     4      0   60
12:46:02 sdb           4736      0   37  128     0  424680    167 3522  121
    0     120     1     0      0   28
12:46:03 sdb           5016      0   42  119     0  391364    196 3344  117
    0     117     6     0      0   34
12:46:04 sdb              0      0    0    0     0  415436    172 3501  119
    0     118     2     0      0   33
12:46:05 sdb              0      0    0    0     0  398736    192 3373  118
    0     118     3     0      0   39
12:46:06 sdb              0      0    0    0     0  367292    155 3015  122
    0     121     6     0      0   39
12:46:07 sdb              0      0    0    0     0  420392    201 3614  116
    0     116     1     0      0   30
12:46:08 sdb              0      0    0    0     0  424828    172 3547  120
    0     119     1     0      0   32
12:46:09 sdb              0      0    0    0     0  500380    234 4277  117
    0     116     2     0      0   34
12:46:10 sdb              0      0    0    0     0  104500      7  698  150
    0     149     1     0      1   87
12:46:11 sdb              8      0    1    8  1260   77252     45  647  119
    0     119     1     2      1   92
12:46:12 sdb              8      0    1    8  1244   73956     31  615  120
    0     120     1     2      1   94
12:46:13 sdb              8      0    1    8   228  149552     64 1256  119
    0     118     1     0      0   85

# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size
 Wait  RWSize  QLen  Wait SvcTim Util
12:46:14 sdb              8      0    1    8  1232   37124     28  319  116
    0     116     1     3      3   99
12:46:15 sdb             16      0    2    8   720    2776     23  120   23
    1      22     1    13      8   99
12:46:16 sdb              0      0    0    0     0  108180     16  823  131
    0     131     1     0      1   90
12:46:17 sdb              8      0    1    8  1260   37136     28  322  115
    0     114     1     3      2   94
12:46:18 sdb              8      0    1    8  1252  108680     57  875  124
    0     124     1     1      1   88
12:46:19 sdb              0      0    0    0     0       0      0    0    0
    0       0     1     0      0  100
12:46:20 sdb             16      0    2    8   618   81516     49  685  119
    0     118     1     1      1   94
12:46:21 sdb             16      0    2    8   640  225788    106 1907  118
    0     118     1     0      0   75
12:46:22 sdb             32      0    4    8    95   73892     17  627  118
    0     117     1     0      1   93
12:46:23 sdb             24      0    3    8   408  257012    119 2171  118
    0     118     1     0      0   65
12:46:24 sdb             12      0    3    4     5    3608      0   20  180
    0     157     1     0     43  100
12:46:25 sdb             44      0    7    6   210   74072     41  625  119
    0     117     1     2      1   97
12:46:26 sdb             48      0    6    8   216  202852    112 1819  112
    0     111     1     0      0   92
12:46:27 sdb             52      0    7    7   233  307156    137 2648  116
    0     115     1     0      0   95
12:46:28 sdb             16      0    2    8   100   93168      7  638  146
    0     145     1     0      1   97
12:46:29 sdb             16      0    2    8   642   37028     16  319  116
    0     115     1     4      3   99
12:46:30 sdb             16      0    2    8   624   39068     36  342  114
    0     113     1     3      2   99
12:46:31 sdb             80      0   10    8    94  253892    105 2169  117
    0     116     1     0      0   84
12:46:32 sdb              0      0    0    0     0    5676      0   33  172
    0     172     1     0     30  100
12:46:33 sdb             16      0    2    8   642   69236     28  583  119
    0     118     1     2      1   96
12:46:34 sdb              8      0    1    8  1032   37132     30  315  118
    0     117     1     3      3  100
12:46:35 sdb             16      0    2    8   822   56292     15  515  109
    0     108     1     3      1  100

# DISK STATISTICS (/sec)
#
<---------reads---------------><---------writes--------------><--------averages-------->
Pct
#Time     Name       KBytes Merged  IOs Size  Wait  KBytes Merged  IOs Size
 Wait  RWSize  QLen  Wait SvcTim Util
12:46:36 sdb              8      0    1    8    44   58768     15  452  130
    0     129     1     0      2   96
12:46:37 sdb             28      0    4    7   390  114944     89 1100  104
    0     104     1     1      0   88
12:46:38 sdb              0      0    0    0     0   29668      0  172  172
   12     172     1    12      5   98
12:46:39 sdb             80      0   10    8    90  100084     31  882  113
    0     112     1     1      1   91
12:46:40 sdb              0      0    0    0     0   24244      0  139  174
    0     174     1     0      7  100
12:46:41 sdb              8      0    1    8  1224       0      0    0    0
    0       8     1  1224   1000  100
12:46:42 sdb              8      0    1    8  1244   42368     29  354  120
    0     119     1     3      2   96
12:46:43 sdb             36      0    5    7   251   51428     32  507  101
    0     100     1     2      1   94
12:46:44 sdb             24      0    3    8    70    5732     31  147   39
   15      38     2    16      6   99
12:46:45 sdb             32      0    4    8     4  213056     53 1647  129
    0     129     1     0      0   74
12:46:46 sdb              8      0    1    8  1220   37416     28  328  114
    0     113     1     3      2   96
12:46:47 sdb              8      0    1    8  1248   58572     67  607   96
    0      96     1     2      1   93
12:46:48 sdb             40      0    5    8    84  274808     82 2173  126
    0     126     1     0      0   70
12:46:49 sdb              0      0    0    0     0       0      0    0    0
    0       0     1     0      0  100
12:46:50 sdb              8      0    1    8  1248       0      0    0    0
    0       8     1  1248   1000  100
12:46:51 sdb              8      0    1    8  1272       0      0    0    0
    0       8     1  1272   1000  100
12:46:52 sdb             24      0    3    8   414  205240    113 1798  114
    0     113     1     0      0   75
12:46:53 sdb              8      0    1    8   876   92476     48  839  110
    0     110     1     1      1   89
12:46:54 sdb              0      0    0    0     0   38700      0  225  172
    0     172     1     0      4   99
12:46:55 sdb             16      0    2    8   582  150680     73 1262  119
    0     119     1     1      0   87
12:46:56 sdb              8      0    1    8  1228       0      0    0    0
    0       8     1  1228   1000  100
12:46:57 sdb              8      0    1    8  1244       0      0    0    0
    0       8     1  1244   1000  100

next I played back the collectl process data and sorted by disk reads and
discovered the top process, corresponding to the long disk reads was
xfsaild.  btw - I also see the slab xfs_inode using about 60GB.

It's also worth noting that I'm only doing 1-2MB/sec of writes and the rest
of the data looks like it's coming from xfs journaling because when I look
at the xfs stats I'm seeing on the order of 200-400MB/sec xfs logging
writes - clearly they're not all going to disk.  Once the read waits
increase everything slows down including xfs logging (since it's doing
less).

I'm sure the simple answer may be that it is what it is, but I'm also
wondering without changes to swift itself, might there be some ways to
improve the situation by adding more memory or making any other tuning
changes?  The system I'm currently running my tests on has 128GB.

-mark

On Wed, Jan 6, 2016 at 6:49 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Jan 06, 2016 at 05:46:33PM -0500, Mark Seger wrote:
> > dave, thanks for getting back to me and the pointer to the config doc.
> >  lots to absorb and play with.
> >
> > the real challenge for me is that I'm doing testing as different levels.
> > While i realize running 100 parallel swift PUT threads on a small system
> is
> > not the ideal way to do things, it's the only easy way to get massive
> > numbers of objects into the fillesystem and once there, the performance
> of
> > a single stream is pretty poor and by instrumenting the swift code I can
> > clearly see excess time being spent in creating/writing the objects and
> so
> > that's lead us to believe the problem lies in the way xfs is configured.
> >  creating a new directory structure on that same mount point immediately
> > results in high levels of performance.
> >
> > As an attempt to try to reproduce the problems w/o swift, I wrote a
> little
> > python script that simply creates files in a 2-tier structure, the first
> > tier consisting of 1024 directories and each directory contains 4096
> > subdirectories into which 1K files are created.
>
> So you created something with even greater fan-out than what your
> swift app is using?
>
> > I'm doing this for 10000
> > objects as a time and then timing them, reporting the times, 10 per line
> so
> > each line represents 100 thousand file creates.
> >
> > Here too I'm seeing degradation and if I look at what happens when there
> > are already 3M files and I write 1M more, I see these creation times/10
> > thousand:
> >
> >  1.004236  0.961419  0.996514  1.012150  1.101794  0.999422  0.994796
> >  1.214535  0.997276  1.306736
> >  2.793429  1.201471  1.133576  1.069682  1.030985  1.096341  1.052602
> >  1.391364  0.999480  1.914125
> >  1.193892  0.967206  1.263310  0.890472  1.051962  4.253694  1.145573
> >  1.528848 13.586892  4.925790
> >  3.975442  8.896552  1.197005  3.904226  7.503806  1.294842  1.816422
> >  9.329792  7.270323  5.936545
> >  7.058685  5.516841  4.527271  1.956592  1.382551  1.510339  1.318341
> > 13.255939  6.938845  4.106066
> >  2.612064  2.028795  4.647980  7.371628  5.473423  5.823201 14.229120
> >  0.899348  3.539658  8.501498
> >  4.662593  6.423530  7.980757  6.367012  3.414239  7.364857  4.143751
> >  6.317348 11.393067  1.273371
> > 146.067300  1.317814  1.176529  1.177830 52.206605  1.112854  2.087990
> > 42.328220  1.178436  1.335202
> > 49.118140  1.368696  1.515826 44.690431  0.927428  0.920801  0.985965
> >  1.000591  1.027458 60.650443
> >  1.771318  2.690499  2.262868  1.061343  0.932998 64.064210 37.726213
> >  1.245129  0.743771  0.996683
> >
> > nothing one set of 10K took almost 3 minutes!
>
> Which is no surprise because you have slow disks and a *lot* of
> memory. At some point the journal and/or memory is going to fill up
> with dirty objects and have to block waiting for writeback. At that
> point there's going to be several hundred thousand dirty inodes that
> need to be flushed to disk before progress can be made again.  That
> metadata writeback will be seek bound, and that's where all the
> delay comes from.
>
> We've been through this problem several times now with different
> swift users over the past couple of years. Please go and search the
> list archives, because every time the solution has been the same:
>
>         - reduce the directory heirarchy to a single level with, at
>           most, the number of directories matching the expected
>           *production* concurrency level
>         - reduce the XFS log size down to 32-128MB to limit dirty
>           metadata object buildup in memory
>         - reduce the number of AGs to as small as necessary to
>           maintain /allocation/ concurrency to limit the number of
>           different locations XFS writes to the disks (typically
>           10-20x less than the application level concurrency)
>         - use a 3.16+ kernel with the free inode btree on-disk
>           format feature to keep inode allocation CPU overhead low
>           and consistent regardless of the number of inodes already
>           allocated in the filesystem.
>
> > my main questions at this point are is this performance expected and/or
> > might a newer kernel help?  and might it be possible to significantly
> > improve things via tuning or is it what it is?  I do realize I'm starting
> > with an empty directory tree whose performance degrades as it fills, but
> if
> > I wanted to tune for say 10M or maybe 100M files might I be able to
> expect
>
> The mkfs defaults will work just fine with that many files in the
> filesystem. Your application configuration and data store layout is
> likely to be your biggest problem here.
>
> Cheers,
>
> Dave.
>
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 29999 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs and swift
  2016-01-06 15:15 xfs and swift Mark Seger
  2016-01-06 22:04 ` Dave Chinner
@ 2016-01-25 18:24 ` Bernd Schubert
  2016-01-25 19:00   ` Mark Seger
  1 sibling, 1 reply; 10+ messages in thread
From: Bernd Schubert @ 2016-01-25 18:24 UTC (permalink / raw)
  To: Mark Seger, Linux fs XFS; +Cc: Laurence Oberman

Hi Mark!

On 01/06/2016 04:15 PM, Mark Seger wrote:
> I've recently found the performance our development swift system is
> degrading over time as the number of objects/files increases.  This is a
> relatively small system, each server has 3 400GB disks.  The system I'm
> currently looking at has about 70GB tied up in slabs alone, close to 55GB
> in xfs inodes and ili, and about 2GB free.  The kernel
> is 3.14.57-1-amd64-hlinux.
> 
> Here's the way the filesystems are mounted:
> 
> /dev/sdb1 on /srv/node/disk0 type xfs
> (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
> 
> I can do about 2000 1K file creates/sec when running 2 minute PUT tests at
> 100 threads.  If I repeat that tests for multiple hours, I see the number
> of IOPS steadily decreasing to about 770 and the very next run it drops to
> 260 and continues to fall from there.  This happens at about 12M files.
> 
> The directory structure is 2 tiered, with 1000 directories per tier so we
> can have about 1M of them, though they don't currently all exist.

This sounds pretty much like hash directories as used by some parallel
file systems (Lustre and in the past BeeGFS). For us the file create
slow down was due to lookup in directories if a file with the same name
already exists. At least for ext4 it was rather easy to demonstrate that
simply caching directory blocks would eliminate that issue.
We then considered working on a better kernel cache, but in the end
simply found a way to get rid of such a simple directory structure in
BeeGFS and changed it to a more complex layout, but with less random
access and so we could eliminate the main reason for the slow down.

Now I have no idea what a "swift system" is and in which order it
creates and accesses those files and if it would be possible to change
the access pattern. One thing you might try and which should work much
better since 3.11 is the vfs_cache_pressure setting. The lower it is the
less dentries/inodes are dropped from cache when pages are needed for
file data.



Cheers,
Bernd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs and swift
  2016-01-25 18:24 ` Bernd Schubert
@ 2016-01-25 19:00   ` Mark Seger
  2016-01-25 19:33     ` Bernd Schubert
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Seger @ 2016-01-25 19:00 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Laurence Oberman, Linux fs XFS


[-- Attachment #1.1: Type: text/plain, Size: 2934 bytes --]

hey bernd, long time no chat.  it turns out you don't have to know what
swift is because I've been able to demonstrate this behavior with a very
simple python script that simply creates files in a 3-tier hierarchy.  the
third level directories each contain a single file which for my testing are
all 1K.

I have played wiht cache_pressure and it doesn't seem to make a difference,
though that was awhlle ago and perhaps it is worth revisiting. one thing
you may get a hoot out of, being a collectl user, is I have an xfs plugin
that lets you look at a ton of xfs stats either in realtime or after the
fact just like any other collectl stat.  I just havent' added it to the kit
yet.

-mark

On Mon, Jan 25, 2016 at 1:24 PM, Bernd Schubert <bschubert@ddn.com> wrote:

> Hi Mark!
>
> On 01/06/2016 04:15 PM, Mark Seger wrote:
> > I've recently found the performance our development swift system is
> > degrading over time as the number of objects/files increases.  This is a
> > relatively small system, each server has 3 400GB disks.  The system I'm
> > currently looking at has about 70GB tied up in slabs alone, close to 55GB
> > in xfs inodes and ili, and about 2GB free.  The kernel
> > is 3.14.57-1-amd64-hlinux.
> >
> > Here's the way the filesystems are mounted:
> >
> > /dev/sdb1 on /srv/node/disk0 type xfs
> >
> (rw,noatime,nodiratime,attr2,nobarrier,inode64,logbufs=8,logbsize=256k,sunit=512,swidth=1536,noquota)
> >
> > I can do about 2000 1K file creates/sec when running 2 minute PUT tests
> at
> > 100 threads.  If I repeat that tests for multiple hours, I see the number
> > of IOPS steadily decreasing to about 770 and the very next run it drops
> to
> > 260 and continues to fall from there.  This happens at about 12M files.
> >
> > The directory structure is 2 tiered, with 1000 directories per tier so we
> > can have about 1M of them, though they don't currently all exist.
>
> This sounds pretty much like hash directories as used by some parallel
> file systems (Lustre and in the past BeeGFS). For us the file create
> slow down was due to lookup in directories if a file with the same name
> already exists. At least for ext4 it was rather easy to demonstrate that
> simply caching directory blocks would eliminate that issue.
> We then considered working on a better kernel cache, but in the end
> simply found a way to get rid of such a simple directory structure in
> BeeGFS and changed it to a more complex layout, but with less random
> access and so we could eliminate the main reason for the slow down.
>
> Now I have no idea what a "swift system" is and in which order it
> creates and accesses those files and if it would be possible to change
> the access pattern. One thing you might try and which should work much
> better since 3.11 is the vfs_cache_pressure setting. The lower it is the
> less dentries/inodes are dropped from cache when pages are needed for
> file data.
>
>
>
> Cheers,
> Bernd

[-- Attachment #1.2: Type: text/html, Size: 3529 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs and swift
  2016-01-25 19:00   ` Mark Seger
@ 2016-01-25 19:33     ` Bernd Schubert
  0 siblings, 0 replies; 10+ messages in thread
From: Bernd Schubert @ 2016-01-25 19:33 UTC (permalink / raw)
  To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS

Hi Mark!

On 01/25/2016 08:00 PM, Mark Seger wrote:
> hey bernd, long time no chat.  it turns out you don't have to know what
> swift is because I've been able to demonstrate this behavior with a very
> simple python script that simply creates files in a 3-tier hierarchy.  the
> third level directories each contain a single file which for my testing are
> all 1K.

So what is the script exactly doing? Does it create those files
sequentially per dir or randomly between those dirs?

Btw, I had been talking about that issue at linux plumbers in 2013

https://www.youtube.com/watch?v=N_bZOGZAb-Y

> 
> I have played wiht cache_pressure and it doesn't seem to make a difference,
> though that was awhlle ago and perhaps it is worth revisiting. one thing

There are several patches from Mel Gorman in 3.11, which really made a
difference for me. So unless you tested with >= 3.11 you should probably
re-test.

> you may get a hoot out of, being a collectl user, is I have an xfs plugin
> that lets you look at a ton of xfs stats either in realtime or after the
> fact just like any other collectl stat.  I just havent' added it to the kit
> yet.


Hmm, I currently don't have good a test system for that. I'm working on
an entirely different project now and while this is also a parallel file
system, it does not have a linux file system in between, but has its own
(log rotated) layout.


Cheers,
Bernd
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: xfs and swift
  2016-01-25 16:38         ` Mark Seger
@ 2016-02-01  5:27           ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2016-02-01  5:27 UTC (permalink / raw)
  To: Mark Seger; +Cc: Laurence Oberman, Linux fs XFS

On Mon, Jan 25, 2016 at 11:38:07AM -0500, Mark Seger wrote:
> since getting your last reply I've been doing a lot more trying to
> understand the behavior of what I'm seeing by writing some non-swift code
> that sort of does what swift does with respect to a directory structure.
>  in my case I have 1024 top level dirs, 4096 under each.  each 1k file I'm
> creating gets it's only directory under these so there are clearly a lot of
> directories.

I'm not sure you understood what I said in my last reply: your
directory structure is the problem, and that's what needs changing.

> xfs writes out about 25M objects and then the performance goes into the
> toilet.  I'm sure what you said before about having to flush data and
> causing big delays, but would it be continuous?

Go read the previous thread on this subject. Or, alternatively, try
some of the subgestions I made, like reducing the log size, to see
how this affects such behaviour.

> each entry in the
> following table shows the time to write 10K files so the 2 blocks are 1M
> each
> 
> Sat Jan 23 12:15:09 2016
>  16.114386  14.656736  14.789760  17.418389  14.613157  15.938176
>  14.865369  14.962058  17.297193  15.953590
.....
>  62.667990  46.334603  53.546195  69.465447  65.006016  68.761229
>  70.754684  97.571669 104.811261 104.229302
> 105.605257 105.166030 105.058075 105.519703 106.573306 106.708545
> 106.114733 105.643131 106.049387 106.379378

Your test goes from operating wholly in memory to being limited by
disk speed because it no longer fits in memory.

> if I look at the disk loads at the time, I see a dramatic increase in disk
> reads that correspond to the slow writes so I'm guessing at least some
.....
> next I played back the collectl process data and sorted by disk reads and
> discovered the top process, corresponding to the long disk reads was
> xfsaild.  btw - I also see the slab xfs_inode using about 60GB.

And there's your problem. You're accumulating gigabytes of dirty
inodes in memory, then wondering why everything goes to crap when
memory fills up and we have to start cleaning inodes. TO clean those
inodes, we have to do 

RMW cycles on the inode cluster buffers, because the inode cache
memory pressure has caused the inod buffers to be reclaimed from
memory before the cached dirty inodes are written. All the
changes I recommended you make also happen address this problem,
too....

> It's also worth noting that I'm only doing 1-2MB/sec of writes and the rest
> of the data looks like it's coming from xfs journaling because when I look
> at the xfs stats I'm seeing on the order of 200-400MB/sec xfs logging
> writes - clearly they're not all going to disk.

Before delayed logging was introduced 5 years ago, it was quite
common to see XFS writing >500MB/s to the journal. The thing is,
your massive fan-out directory structure is mostly going to defeat
the relogging optimisations that make delayed logging work, so it's
entirely possible that you are seeing this much throughput through
the journal.

> Once the read waits
> increase everything slows down including xfs logging (since it's doing
> less).

Of course, because we can't journal more changes until the dirty
inodes in the journal are cleaned. That's what the xfsaild does -
clean dirty inodes, and the reads coming from that threads are for
cleaning inodes...

> I'm sure the simple answer may be that it is what it is, but I'm also
> wondering without changes to swift itself, might there be some ways to
> improve the situation by adding more memory or making any other tuning
> changes?  The system I'm currently running my tests on has 128GB.

I've already described what you need to do to both the swift
directory layout and the XFS filesystem configuration to minimise
the impact of storing millions of tiny records in a filesystem. I'll
leave the quote from my last email for you:

> > We've been through this problem several times now with different
> > swift users over the past couple of years. Please go and search the
> > list archives, because every time the solution has been the same:
> >
> >         - reduce the directory heirarchy to a single level with, at
> >           most, the number of directories matching the expected
> >           *production* concurrency level
> >         - reduce the XFS log size down to 32-128MB to limit dirty
> >           metadata object buildup in memory
> >         - reduce the number of AGs to as small as necessary to
> >           maintain /allocation/ concurrency to limit the number of
> >           different locations XFS writes to the disks (typically
> >           10-20x less than the application level concurrency)
> >         - use a 3.16+ kernel with the free inode btree on-disk
> >           format feature to keep inode allocation CPU overhead low
> >           and consistent regardless of the number of inodes already
> >           allocated in the filesystem.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-02-01  5:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-06 15:15 xfs and swift Mark Seger
2016-01-06 22:04 ` Dave Chinner
2016-01-06 22:10   ` Dave Chinner
2016-01-06 22:46     ` Mark Seger
2016-01-06 23:49       ` Dave Chinner
2016-01-25 16:38         ` Mark Seger
2016-02-01  5:27           ` Dave Chinner
2016-01-25 18:24 ` Bernd Schubert
2016-01-25 19:00   ` Mark Seger
2016-01-25 19:33     ` Bernd Schubert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.