All of lore.kernel.org
 help / color / mirror / Atom feed
* bad performance on touch/cp file on XFS system
@ 2014-08-25  3:34 Zhang Qiang
  2014-08-25  5:18 ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Zhang Qiang @ 2014-08-25  3:34 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1816 bytes --]

Dear XFS community & developers,

I am using CentOS 6.3 and xfs as base file system and use RAID5 as hardware
storage.

Detail environment as follow:
   OS: CentOS 6.3
   Kernel: kernel-2.6.32-279.el6.x86_64
   XFS option info(df output): /dev/sdb1 on /data type xfs
(rw,noatime,nodiratime,nobarrier)

Detail phenomenon:

    # df
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sda1              29G   17G   11G  61% /
    /dev/sdb1             893G  803G   91G  90% /data
    /dev/sda4             2.2T  1.6T  564G  75% /data1

    # time touch /data1/1111
    real    0m23.043s
    user    0m0.001s
    sys     0m0.349s

    # perf top
    Events: 6K cycles
     16.96%  [xfs]                     [k] xfs_inobt_get_rec
     11.95%  [xfs]                     [k] xfs_btree_increment
     11.16%  [xfs]                     [k] xfs_btree_get_rec
      7.39%  [xfs]                     [k] xfs_btree_get_block
      5.02%  [xfs]                     [k] xfs_dialloc
      4.87%  [xfs]                     [k] xfs_btree_rec_offset
      4.33%  [xfs]                     [k] xfs_btree_readahead
      4.13%  [xfs]                     [k] _xfs_buf_find
      4.05%  [kernel]                  [k] intel_idle
      2.89%  [xfs]                     [k] xfs_btree_rec_addr
      1.04%  [kernel]                  [k] kmem_cache_free


It seems that some xfs kernel function spend much time (xfs_inobt_get_rec,
xfs_btree_increment, etc.)

I found a bug in bugzilla [1], is that is the same issue like this?

It's very greatly appreciated if you can give constructive suggestion about
this issue, as It's really hard to reproduce from another system and it's
not possible to do upgrade on that online machine.


[1] https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=813137

Thanks in advance
Qiang

[-- Attachment #1.2: Type: text/html, Size: 2726 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-25  3:34 bad performance on touch/cp file on XFS system Zhang Qiang
@ 2014-08-25  5:18 ` Dave Chinner
  2014-08-25  8:09   ` Zhang Qiang
  2014-08-25  8:47   ` Zhang Qiang
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2014-08-25  5:18 UTC (permalink / raw)
  To: Zhang Qiang; +Cc: xfs

On Mon, Aug 25, 2014 at 11:34:34AM +0800, Zhang Qiang wrote:
> Dear XFS community & developers,
> 
> I am using CentOS 6.3 and xfs as base file system and use RAID5 as hardware
> storage.
> 
> Detail environment as follow:
>    OS: CentOS 6.3
>    Kernel: kernel-2.6.32-279.el6.x86_64
>    XFS option info(df output): /dev/sdb1 on /data type xfs
> (rw,noatime,nodiratime,nobarrier)
> 
> Detail phenomenon:
> 
>     # df
>     Filesystem            Size  Used Avail Use% Mounted on
>     /dev/sda1              29G   17G   11G  61% /
>     /dev/sdb1             893G  803G   91G  90% /data
>     /dev/sda4             2.2T  1.6T  564G  75% /data1
> 
>     # time touch /data1/1111
>     real    0m23.043s
>     user    0m0.001s
>     sys     0m0.349s
> 
>     # perf top
>     Events: 6K cycles
>      16.96%  [xfs]                     [k] xfs_inobt_get_rec
>      11.95%  [xfs]                     [k] xfs_btree_increment
>      11.16%  [xfs]                     [k] xfs_btree_get_rec
>       7.39%  [xfs]                     [k] xfs_btree_get_block
>       5.02%  [xfs]                     [k] xfs_dialloc
>       4.87%  [xfs]                     [k] xfs_btree_rec_offset
>       4.33%  [xfs]                     [k] xfs_btree_readahead
>       4.13%  [xfs]                     [k] _xfs_buf_find
>       4.05%  [kernel]                  [k] intel_idle
>       2.89%  [xfs]                     [k] xfs_btree_rec_addr
>       1.04%  [kernel]                  [k] kmem_cache_free
> 
> 
> It seems that some xfs kernel function spend much time (xfs_inobt_get_rec,
> xfs_btree_increment, etc.)
> 
> I found a bug in bugzilla [1], is that is the same issue like this?

No.

> It's very greatly appreciated if you can give constructive suggestion about
> this issue, as It's really hard to reproduce from another system and it's
> not possible to do upgrade on that online machine.

You've got very few free inodes, widely distributed in the allocated
inode btree. The CPU time above is the btree search for the next
free inode.

This is the issue solved by this series of recent commits to add a
new on-disk free inode btree index:

53801fd xfs: enable the finobt feature on v5 superblocks
0c153c1 xfs: report finobt status in fs geometry
a3fa516 xfs: add finobt support to growfs
3efa4ff xfs: update the finobt on inode free
2b64ee5 xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
6dd8638 xfs: use and update the finobt on inode allocation
0aa0a75 xfs: insert newly allocated inode chunks into the finobt
9d43b18 xfs: update inode allocation/free transaction reservations for finobt
aafc3c2 xfs: support the XFS_BTNUM_FINOBT free inode btree type
8e2c84d xfs: reserve v5 superblock read-only compat. feature bit for finobt
57bd3db xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers

Which is of no help to you, however, because it's not available in
any CentOS kernel.

There's really not much you can do to avoid the problem once you've
punched random freespace holes in the allocated inode btree. IT
generally doesn't affect many people; those that it does affect are
normally using XFS as an object store indexed by a hard link farm
(e.g. various backup programs do this).

If you dump the superblock via xfs_db, the difference between icount
and ifree will give you idea of how much "needle in a haystack"
searching is going on. You can probably narrow it down to a specific
AG by dumping the AGI headers and checking the same thing. filling
in all the holes (by creating a bunch of zero length files in the
appropriate AGs) might take some time, but it should make the
problem go away until you remove more filesystem and create random
free inode holes again...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-25  5:18 ` Dave Chinner
@ 2014-08-25  8:09   ` Zhang Qiang
  2014-08-25  8:56     ` Dave Chinner
  2014-08-25  8:47   ` Zhang Qiang
  1 sibling, 1 reply; 15+ messages in thread
From: Zhang Qiang @ 2014-08-25  8:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 4901 bytes --]

Thanks for your quick and clear response. Some comments bellow:


2014-08-25 13:18 GMT+08:00 Dave Chinner <david@fromorbit.com>:

> On Mon, Aug 25, 2014 at 11:34:34AM +0800, Zhang Qiang wrote:
> > Dear XFS community & developers,
> >
> > I am using CentOS 6.3 and xfs as base file system and use RAID5 as
> hardware
> > storage.
> >
> > Detail environment as follow:
> >    OS: CentOS 6.3
> >    Kernel: kernel-2.6.32-279.el6.x86_64
> >    XFS option info(df output): /dev/sdb1 on /data type xfs
> > (rw,noatime,nodiratime,nobarrier)
> >
> > Detail phenomenon:
> >
> >     # df
> >     Filesystem            Size  Used Avail Use% Mounted on
> >     /dev/sda1              29G   17G   11G  61% /
> >     /dev/sdb1             893G  803G   91G  90% /data
> >     /dev/sda4             2.2T  1.6T  564G  75% /data1
> >
> >     # time touch /data1/1111
> >     real    0m23.043s
> >     user    0m0.001s
> >     sys     0m0.349s
> >
> >     # perf top
> >     Events: 6K cycles
> >      16.96%  [xfs]                     [k] xfs_inobt_get_rec
> >      11.95%  [xfs]                     [k] xfs_btree_increment
> >      11.16%  [xfs]                     [k] xfs_btree_get_rec
> >       7.39%  [xfs]                     [k] xfs_btree_get_block
> >       5.02%  [xfs]                     [k] xfs_dialloc
> >       4.87%  [xfs]                     [k] xfs_btree_rec_offset
> >       4.33%  [xfs]                     [k] xfs_btree_readahead
> >       4.13%  [xfs]                     [k] _xfs_buf_find
> >       4.05%  [kernel]                  [k] intel_idle
> >       2.89%  [xfs]                     [k] xfs_btree_rec_addr
> >       1.04%  [kernel]                  [k] kmem_cache_free
> >
> >
> > It seems that some xfs kernel function spend much time
> (xfs_inobt_get_rec,
> > xfs_btree_increment, etc.)
> >
> > I found a bug in bugzilla [1], is that is the same issue like this?
>
> No.
>


>
> > It's very greatly appreciated if you can give constructive suggestion
> about
> > this issue, as It's really hard to reproduce from another system and it's
> > not possible to do upgrade on that online machine.
>
> You've got very few free inodes, widely distributed in the allocated
> inode btree. The CPU time above is the btree search for the next
> free inode.
>
> This is the issue solved by this series of recent commits to add a
> new on-disk free inode btree index:
>
[Qiang] This meas that if I want to fix this issue, I have to apply the
following patches and build my own kernel.

As the on-disk structure has been changed, so should I also re-create xfs
filesystem again? is there any user space tools to convert old disk
filesystem to new one, and don't need to backup and restore currently data?



>
> 53801fd xfs: enable the finobt feature on v5 superblocks
> 0c153c1 xfs: report finobt status in fs geometry
> a3fa516 xfs: add finobt support to growfs
> 3efa4ff xfs: update the finobt on inode free
> 2b64ee5 xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt()
> helper
> 6dd8638 xfs: use and update the finobt on inode allocation
> 0aa0a75 xfs: insert newly allocated inode chunks into the finobt
> 9d43b18 xfs: update inode allocation/free transaction reservations for
> finobt
> aafc3c2 xfs: support the XFS_BTNUM_FINOBT free inode btree type
> 8e2c84d xfs: reserve v5 superblock read-only compat. feature bit for finobt
> 57bd3db xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
>
> Which is of no help to you, however, because it's not available in
> any CentOS kernel.
>
[Qiang] Do you think if it's possible to just backport these patches to
kernel  6.2.32 (CentOS 6.3) to fix this issue?

Or it's better to backport to 3.10 kernel, used in CentOS 7.0?



> There's really not much you can do to avoid the problem once you've
> punched random freespace holes in the allocated inode btree. IT
> generally doesn't affect many people; those that it does affect are
> normally using XFS as an object store indexed by a hard link farm
> (e.g. various backup programs do this).
>
OK, I see.

Could you please guide me to reproduce this issue easily? as I have tried
to use a 500G xfs partition, and use about 98 % spaces, but still can't
reproduce this issue. Is there any easy way from your mind?


> If you dump the superblock via xfs_db, the difference between icount
> and ifree will give you idea of how much "needle in a haystack"
> searching is going on. You can probably narrow it down to a specific
> AG by dumping the AGI headers and checking the same thing. filling
> in all the holes (by creating a bunch of zero length files in the
> appropriate AGs) might take some time, but it should make the
> problem go away until you remove more filesystem and create random
> free inode holes again...
>

I will try to investigate the detail issue.

Thanks for your kindly response.
Qiang


>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 6932 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-25  5:18 ` Dave Chinner
  2014-08-25  8:09   ` Zhang Qiang
@ 2014-08-25  8:47   ` Zhang Qiang
  2014-08-25  9:08     ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Zhang Qiang @ 2014-08-25  8:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 7891 bytes --]

I have checked icount and ifree, but I found there are about 11.8 percent
free, so the free inode should not be too few.

Here's the detail log, any new clue?

# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4              isize=256    agcount=4, agsize=142272384
blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=569089536, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=277875, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
# umount /dev/sda4
# xfs_db /dev/sda4
xfs_db> sb 0
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = 129
rsumino = 130
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 0
imax_pct = 5
icount = 220619904
ifree = 26202919
fdblocks = 147805479
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 1
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 2
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = null
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 3
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa



2014-08-25 13:18 GMT+08:00 Dave Chinner <david@fromorbit.com>:

> On Mon, Aug 25, 2014 at 11:34:34AM +0800, Zhang Qiang wrote:
> > Dear XFS community & developers,
> >
> > I am using CentOS 6.3 and xfs as base file system and use RAID5 as
> hardware
> > storage.
> >
> > Detail environment as follow:
> >    OS: CentOS 6.3
> >    Kernel: kernel-2.6.32-279.el6.x86_64
> >    XFS option info(df output): /dev/sdb1 on /data type xfs
> > (rw,noatime,nodiratime,nobarrier)
> >
> > Detail phenomenon:
> >
> >     # df
> >     Filesystem            Size  Used Avail Use% Mounted on
> >     /dev/sda1              29G   17G   11G  61% /
> >     /dev/sdb1             893G  803G   91G  90% /data
> >     /dev/sda4             2.2T  1.6T  564G  75% /data1
> >
> >     # time touch /data1/1111
> >     real    0m23.043s
> >     user    0m0.001s
> >     sys     0m0.349s
> >
> >     # perf top
> >     Events: 6K cycles
> >      16.96%  [xfs]                     [k] xfs_inobt_get_rec
> >      11.95%  [xfs]                     [k] xfs_btree_increment
> >      11.16%  [xfs]                     [k] xfs_btree_get_rec
> >       7.39%  [xfs]                     [k] xfs_btree_get_block
> >       5.02%  [xfs]                     [k] xfs_dialloc
> >       4.87%  [xfs]                     [k] xfs_btree_rec_offset
> >       4.33%  [xfs]                     [k] xfs_btree_readahead
> >       4.13%  [xfs]                     [k] _xfs_buf_find
> >       4.05%  [kernel]                  [k] intel_idle
> >       2.89%  [xfs]                     [k] xfs_btree_rec_addr
> >       1.04%  [kernel]                  [k] kmem_cache_free
> >
> >
> > It seems that some xfs kernel function spend much time
> (xfs_inobt_get_rec,
> > xfs_btree_increment, etc.)
> >
> > I found a bug in bugzilla [1], is that is the same issue like this?
>
> No.
>
> > It's very greatly appreciated if you can give constructive suggestion
> about
> > this issue, as It's really hard to reproduce from another system and it's
> > not possible to do upgrade on that online machine.
>
> You've got very few free inodes, widely distributed in the allocated
> inode btree. The CPU time above is the btree search for the next
> free inode.
>
> This is the issue solved by this series of recent commits to add a
> new on-disk free inode btree index:
>
> 53801fd xfs: enable the finobt feature on v5 superblocks
> 0c153c1 xfs: report finobt status in fs geometry
> a3fa516 xfs: add finobt support to growfs
> 3efa4ff xfs: update the finobt on inode free
> 2b64ee5 xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt()
> helper
> 6dd8638 xfs: use and update the finobt on inode allocation
> 0aa0a75 xfs: insert newly allocated inode chunks into the finobt
> 9d43b18 xfs: update inode allocation/free transaction reservations for
> finobt
> aafc3c2 xfs: support the XFS_BTNUM_FINOBT free inode btree type
> 8e2c84d xfs: reserve v5 superblock read-only compat. feature bit for finobt
> 57bd3db xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
>
> Which is of no help to you, however, because it's not available in
> any CentOS kernel.
>
> There's really not much you can do to avoid the problem once you've
> punched random freespace holes in the allocated inode btree. IT
> generally doesn't affect many people; those that it does affect are
> normally using XFS as an object store indexed by a hard link farm
> (e.g. various backup programs do this).
>
> If you dump the superblock via xfs_db, the difference between icount
> and ifree will give you idea of how much "needle in a haystack"
> searching is going on. You can probably narrow it down to a specific
> AG by dumping the AGI headers and checking the same thing. filling
> in all the holes (by creating a bunch of zero length files in the
> appropriate AGs) might take some time, but it should make the
> problem go away until you remove more filesystem and create random
> free inode holes again...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 11406 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-25  8:09   ` Zhang Qiang
@ 2014-08-25  8:56     ` Dave Chinner
  2014-08-25  9:05       ` Zhang Qiang
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-08-25  8:56 UTC (permalink / raw)
  To: Zhang Qiang; +Cc: xfs

On Mon, Aug 25, 2014 at 04:09:05PM +0800, Zhang Qiang wrote:
> Thanks for your quick and clear response. Some comments bellow:
> 
> 
> 2014-08-25 13:18 GMT+08:00 Dave Chinner <david@fromorbit.com>:
> 
> > On Mon, Aug 25, 2014 at 11:34:34AM +0800, Zhang Qiang wrote:
> > > Dear XFS community & developers,
> > >
> > > I am using CentOS 6.3 and xfs as base file system and use RAID5 as
> > hardware
> > > storage.
> > >
> > > Detail environment as follow:
> > >    OS: CentOS 6.3
> > >    Kernel: kernel-2.6.32-279.el6.x86_64
> > >    XFS option info(df output): /dev/sdb1 on /data type xfs
> > > (rw,noatime,nodiratime,nobarrier)
....

> > > It's very greatly appreciated if you can give constructive suggestion
> > about
> > > this issue, as It's really hard to reproduce from another system and it's
> > > not possible to do upgrade on that online machine.
> >
> > You've got very few free inodes, widely distributed in the allocated
> > inode btree. The CPU time above is the btree search for the next
> > free inode.
> >
> > This is the issue solved by this series of recent commits to add a
> > new on-disk free inode btree index:
> >
> [Qiang] This meas that if I want to fix this issue, I have to apply the
> following patches and build my own kernel.

Yes. Good luck, even I wouldn't attempt to do that.

And then use xfsprogs 3.2.1, and make a new filesystem that enables
metadata CRCs and the free inode btree feature.

> As the on-disk structure has been changed, so should I also re-create xfs
> filesystem again?

Yes, you need to download the latest xfsprogs (3.2.1) to be able to
make it with the necessary feature bits set.

> is there any user space tools to convert old disk
> filesystem to new one, and don't need to backup and restore currently data?

No, we don't write utilities to mangle on disk formats. dump, mkfs
and restore is far more reliable than any "in-place conversion" code
we could write. It will probably be faster, too.

> > Which is of no help to you, however, because it's not available in
> > any CentOS kernel.
> >
> [Qiang] Do you think if it's possible to just backport these patches to
> kernel  6.2.32 (CentOS 6.3) to fix this issue?
> 
> Or it's better to backport to 3.10 kernel, used in CentOS 7.0?

You can try, but if you break it you get to keep all the pieces
yourself. Eventually someone who maintains the RHEL code will do a
backport that will trickle down to CentOS. If you need it any
sooner, then you'll need to do it yourself, or upgrade to RHEL
and ask your support contact for it to be included in RHEL 7.1....

> > There's really not much you can do to avoid the problem once you've
> > punched random freespace holes in the allocated inode btree. IT
> > generally doesn't affect many people; those that it does affect are
> > normally using XFS as an object store indexed by a hard link farm
> > (e.g. various backup programs do this).
> >
> OK, I see.
> 
> Could you please guide me to reproduce this issue easily? as I have tried
> to use a 500G xfs partition, and use about 98 % spaces, but still can't
> reproduce this issue. Is there any easy way from your mind?

Search the archives for the test cases that were used for the patch
set. There's a performance test case documented in the review
discussions.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-25  8:56     ` Dave Chinner
@ 2014-08-25  9:05       ` Zhang Qiang
  0 siblings, 0 replies; 15+ messages in thread
From: Zhang Qiang @ 2014-08-25  9:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 7713 bytes --]

Great, thank you.

>From my xfs_db debug, I found I have icount and ifree as follow:

icount = 220619904
ifree = 26202919

So the number of free inode take about 10%, so that's not so few.

So, are you still sure the patches can fix this issue?

Here's the detail xfs_db info:

# mount /dev/sda4 /data1/
# xfs_info /data1/
meta-data=/dev/sda4              isize=256    agcount=4, agsize=142272384
blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=569089536, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=277875, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
# umount /dev/sda4
# xfs_db /dev/sda4
xfs_db> sb 0
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = 129
rsumino = 130
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 0
imax_pct = 5
icount = 220619904
ifree = 26202919
fdblocks = 147805479
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 1
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 2
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = null
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa
xfs_db> sb 3
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
dblocks = 569089536
rblocks = 0
rextents = 0
uuid = 13ecf47b-52cf-4944-9a71-885bddc5e008
logstart = 536870916
rootino = 128
rbmino = null
rsumino = null
rextsize = 1
agblocks = 142272384
agcount = 4
rbmblocks = 0
logblocks = 277875
versionnum = 0xb4a4
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 28
rextslog = 0
inprogress = 1
imax_pct = 5
icount = 0
ifree = 0
fdblocks = 568811645
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 0
width = 0
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 1
features2 = 0xa
bad_features2 = 0xa


Thanks
Qiang



2014-08-25 16:56 GMT+08:00 Dave Chinner <david@fromorbit.com>:

> On Mon, Aug 25, 2014 at 04:09:05PM +0800, Zhang Qiang wrote:
> > Thanks for your quick and clear response. Some comments bellow:
> >
> >
> > 2014-08-25 13:18 GMT+08:00 Dave Chinner <david@fromorbit.com>:
> >
> > > On Mon, Aug 25, 2014 at 11:34:34AM +0800, Zhang Qiang wrote:
> > > > Dear XFS community & developers,
> > > >
> > > > I am using CentOS 6.3 and xfs as base file system and use RAID5 as
> > > hardware
> > > > storage.
> > > >
> > > > Detail environment as follow:
> > > >    OS: CentOS 6.3
> > > >    Kernel: kernel-2.6.32-279.el6.x86_64
> > > >    XFS option info(df output): /dev/sdb1 on /data type xfs
> > > > (rw,noatime,nodiratime,nobarrier)
> ....
>
> > > > It's very greatly appreciated if you can give constructive suggestion
> > > about
> > > > this issue, as It's really hard to reproduce from another system and
> it's
> > > > not possible to do upgrade on that online machine.
> > >
> > > You've got very few free inodes, widely distributed in the allocated
> > > inode btree. The CPU time above is the btree search for the next
> > > free inode.
> > >
> > > This is the issue solved by this series of recent commits to add a
> > > new on-disk free inode btree index:
> > >
> > [Qiang] This meas that if I want to fix this issue, I have to apply the
> > following patches and build my own kernel.
>
> Yes. Good luck, even I wouldn't attempt to do that.
>
> And then use xfsprogs 3.2.1, and make a new filesystem that enables
> metadata CRCs and the free inode btree feature.
>
> > As the on-disk structure has been changed, so should I also re-create xfs
> > filesystem again?
>
> Yes, you need to download the latest xfsprogs (3.2.1) to be able to
> make it with the necessary feature bits set.
>
> > is there any user space tools to convert old disk
> > filesystem to new one, and don't need to backup and restore currently
> data?
>
> No, we don't write utilities to mangle on disk formats. dump, mkfs
> and restore is far more reliable than any "in-place conversion" code
> we could write. It will probably be faster, too.
>
> > > Which is of no help to you, however, because it's not available in
> > > any CentOS kernel.
> > >
> > [Qiang] Do you think if it's possible to just backport these patches to
> > kernel  6.2.32 (CentOS 6.3) to fix this issue?
> >
> > Or it's better to backport to 3.10 kernel, used in CentOS 7.0?
>
> You can try, but if you break it you get to keep all the pieces
> yourself. Eventually someone who maintains the RHEL code will do a
> backport that will trickle down to CentOS. If you need it any
> sooner, then you'll need to do it yourself, or upgrade to RHEL
> and ask your support contact for it to be included in RHEL 7.1....
>
> > > There's really not much you can do to avoid the problem once you've
> > > punched random freespace holes in the allocated inode btree. IT
> > > generally doesn't affect many people; those that it does affect are
> > > normally using XFS as an object store indexed by a hard link farm
> > > (e.g. various backup programs do this).
> > >
> > OK, I see.
> >
> > Could you please guide me to reproduce this issue easily? as I have tried
> > to use a 500G xfs partition, and use about 98 % spaces, but still can't
> > reproduce this issue. Is there any easy way from your mind?
>
> Search the archives for the test cases that were used for the patch
> set. There's a performance test case documented in the review
> discussions.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>

[-- Attachment #1.2: Type: text/html, Size: 11579 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-25  8:47   ` Zhang Qiang
@ 2014-08-25  9:08     ` Dave Chinner
  2014-08-25 10:31       ` Zhang Qiang
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-08-25  9:08 UTC (permalink / raw)
  To: Zhang Qiang; +Cc: xfs

On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote:
> I have checked icount and ifree, but I found there are about 11.8 percent
> free, so the free inode should not be too few.
> 
> Here's the detail log, any new clue?
> 
> # mount /dev/sda4 /data1/
> # xfs_info /data1/
> meta-data=/dev/sda4              isize=256    agcount=4, agsize=142272384

4 AGs

> icount = 220619904
> ifree = 26202919

And 220 million inodes. There's your problem - that's an average
of 55 million inodes per AGI btree assuming you are using inode64.
If you are using inode32, then the inodes will be in 2 btrees, or
maybe even only one.

Anyway you look at it, searching btrees with tens of millions of
entries is going to consume a *lot* of CPU time. So, really, the
state your fs is in is probably unfixable without mkfs. And really,
that's probably pushing the boundaries of what xfsdump and
xfs-restore can support - it's going to take a long tiem to dump and
restore that data....

With that many inodes, I'd be considering moving to 32 or 64 AGs to
keep the btree size down to a more manageable size. The free inode
btree would also help, but, really, 220M inodes in a 2TB filesystem
is really pushing the boundaries of sanity.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-25  9:08     ` Dave Chinner
@ 2014-08-25 10:31       ` Zhang Qiang
  2014-08-25 22:26         ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Zhang Qiang @ 2014-08-25 10:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1977 bytes --]

2014-08-25 17:08 GMT+08:00 Dave Chinner <david@fromorbit.com>:

> On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote:
> > I have checked icount and ifree, but I found there are about 11.8 percent
> > free, so the free inode should not be too few.
> >
> > Here's the detail log, any new clue?
> >
> > # mount /dev/sda4 /data1/
> > # xfs_info /data1/
> > meta-data=/dev/sda4              isize=256    agcount=4, agsize=142272384
>
> 4 AGs
>
Yes.

>
> > icount = 220619904
> > ifree = 26202919
>
> And 220 million inodes. There's your problem - that's an average
> of 55 million inodes per AGI btree assuming you are using inode64.
> If you are using inode32, then the inodes will be in 2 btrees, or
> maybe even only one.
>

You are right, all inodes stay on one AG.

BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,
sorry as I am not familiar with xfs currently.


>
> Anyway you look at it, searching btrees with tens of millions of
> entries is going to consume a *lot* of CPU time. So, really, the
> state your fs is in is probably unfixable without mkfs. And really,
> that's probably pushing the boundaries of what xfsdump and
> xfs-restore can support - it's going to take a long tiem to dump and
> restore that data....
>

 Thanks reasonable.



> With that many inodes, I'd be considering moving to 32 or 64 AGs to
> keep the btree size down to a more manageable size. The free inode

btree would also help, but, really, 220M inodes in a 2TB filesystem
> is really pushing the boundaries of sanity.....
>

So the better inodes size in one AG is about 5M, is there any documents
about these options I can learn more?

I will spend more time to learn how to use xfs, and the internal of xfs,
and try to contribute code.

Thanks for your help.



> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>

[-- Attachment #1.2: Type: text/html, Size: 3535 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-25 10:31       ` Zhang Qiang
@ 2014-08-25 22:26         ` Dave Chinner
  2014-08-25 22:46           ` Greg Freemyer
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-08-25 22:26 UTC (permalink / raw)
  To: Zhang Qiang; +Cc: xfs

On Mon, Aug 25, 2014 at 06:31:10PM +0800, Zhang Qiang wrote:
> 2014-08-25 17:08 GMT+08:00 Dave Chinner <david@fromorbit.com>:
> 
> > On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote:
> > > I have checked icount and ifree, but I found there are about 11.8 percent
> > > free, so the free inode should not be too few.
> > >
> > > Here's the detail log, any new clue?
> > >
> > > # mount /dev/sda4 /data1/
> > > # xfs_info /data1/
> > > meta-data=/dev/sda4              isize=256    agcount=4, agsize=142272384
> >
> > 4 AGs
> >
> Yes.
> 
> >
> > > icount = 220619904
> > > ifree = 26202919
> >
> > And 220 million inodes. There's your problem - that's an average
> > of 55 million inodes per AGI btree assuming you are using inode64.
> > If you are using inode32, then the inodes will be in 2 btrees, or
> > maybe even only one.
> >
> 
> You are right, all inodes stay on one AG.
> 
> BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,

Because the top addresses in the 2nd AG go over 32 bits, hence only
AG 0 can be used for inodes. Changing to inode64 will give you some
relief, but any time allocation occurs in AG0 is will be slow. i.e.
you'll be trading always slow for "unpredictably slow".

> > With that many inodes, I'd be considering moving to 32 or 64 AGs to
> > keep the btree size down to a more manageable size. The free inode
> 
> btree would also help, but, really, 220M inodes in a 2TB filesystem
> > is really pushing the boundaries of sanity.....
> >
> 
> So the better inodes size in one AG is about 5M,

Not necessarily. But for your storage it's almost certainly going to
minimise the problem you are seeing.

> is there any documents
> about these options I can learn more?

http://xfs.org/index.php/XFS_Papers_and_Documentation

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-25 22:26         ` Dave Chinner
@ 2014-08-25 22:46           ` Greg Freemyer
  2014-08-26  2:37             ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Greg Freemyer @ 2014-08-25 22:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Zhang Qiang, xfs-oss

On Mon, Aug 25, 2014 at 6:26 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Aug 25, 2014 at 06:31:10PM +0800, Zhang Qiang wrote:
>> 2014-08-25 17:08 GMT+08:00 Dave Chinner <david@fromorbit.com>:
>>
>> > On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote:
>> > > I have checked icount and ifree, but I found there are about 11.8 percent
>> > > free, so the free inode should not be too few.
>> > >
>> > > Here's the detail log, any new clue?
>> > >
>> > > # mount /dev/sda4 /data1/
>> > > # xfs_info /data1/
>> > > meta-data=/dev/sda4              isize=256    agcount=4, agsize=142272384
>> >
>> > 4 AGs
>> >
>> Yes.
>>
>> >
>> > > icount = 220619904
>> > > ifree = 26202919
>> >
>> > And 220 million inodes. There's your problem - that's an average
>> > of 55 million inodes per AGI btree assuming you are using inode64.
>> > If you are using inode32, then the inodes will be in 2 btrees, or
>> > maybe even only one.
>> >
>>
>> You are right, all inodes stay on one AG.
>>
>> BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,
>
> Because the top addresses in the 2nd AG go over 32 bits, hence only
> AG 0 can be used for inodes. Changing to inode64 will give you some
> relief, but any time allocation occurs in AG0 is will be slow. i.e.
> you'll be trading always slow for "unpredictably slow".
>
>> > With that many inodes, I'd be considering moving to 32 or 64 AGs to
>> > keep the btree size down to a more manageable size. The free inode
>>
>> btree would also help, but, really, 220M inodes in a 2TB filesystem
>> > is really pushing the boundaries of sanity.....
>> >
>>
>> So the better inodes size in one AG is about 5M,
>
> Not necessarily. But for your storage it's almost certainly going to
> minimise the problem you are seeing.
>
>> is there any documents
>> about these options I can learn more?
>
> http://xfs.org/index.php/XFS_Papers_and_Documentation

Given the apparently huge number of small files would he likely see a
big performance increase if he replaced that 2TB or rust with SSD.

Greg

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-25 22:46           ` Greg Freemyer
@ 2014-08-26  2:37             ` Dave Chinner
  2014-08-26 10:04               ` Zhang Qiang
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-08-26  2:37 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: Zhang Qiang, xfs-oss

On Mon, Aug 25, 2014 at 06:46:31PM -0400, Greg Freemyer wrote:
> On Mon, Aug 25, 2014 at 6:26 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Aug 25, 2014 at 06:31:10PM +0800, Zhang Qiang wrote:
> >> 2014-08-25 17:08 GMT+08:00 Dave Chinner <david@fromorbit.com>:
> >>
> >> > On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote:
> >> > > I have checked icount and ifree, but I found there are about 11.8 percent
> >> > > free, so the free inode should not be too few.
> >> > >
> >> > > Here's the detail log, any new clue?
> >> > >
> >> > > # mount /dev/sda4 /data1/
> >> > > # xfs_info /data1/
> >> > > meta-data=/dev/sda4              isize=256    agcount=4, agsize=142272384
> >> >
> >> > 4 AGs
> >> >
> >> Yes.
> >>
> >> >
> >> > > icount = 220619904
> >> > > ifree = 26202919
> >> >
> >> > And 220 million inodes. There's your problem - that's an average
> >> > of 55 million inodes per AGI btree assuming you are using inode64.
> >> > If you are using inode32, then the inodes will be in 2 btrees, or
> >> > maybe even only one.
> >> >
> >>
> >> You are right, all inodes stay on one AG.
> >>
> >> BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,
> >
> > Because the top addresses in the 2nd AG go over 32 bits, hence only
> > AG 0 can be used for inodes. Changing to inode64 will give you some
> > relief, but any time allocation occurs in AG0 is will be slow. i.e.
> > you'll be trading always slow for "unpredictably slow".
> >
> >> > With that many inodes, I'd be considering moving to 32 or 64 AGs to
> >> > keep the btree size down to a more manageable size. The free inode
> >>
> >> btree would also help, but, really, 220M inodes in a 2TB filesystem
> >> > is really pushing the boundaries of sanity.....
> >> >
> >>
> >> So the better inodes size in one AG is about 5M,
> >
> > Not necessarily. But for your storage it's almost certainly going to
> > minimise the problem you are seeing.
> >
> >> is there any documents
> >> about these options I can learn more?
> >
> > http://xfs.org/index.php/XFS_Papers_and_Documentation
> 
> Given the apparently huge number of small files would he likely see a
> big performance increase if he replaced that 2TB or rust with SSD.

Doubt it - the profiles showed the allocation being CPU bound
searching the metadata that indexes all those inodes. Those same
profiles showed all the signs that it was hitting the buffer
cache most of the time, too, which is why it was CPU bound....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-26  2:37             ` Dave Chinner
@ 2014-08-26 10:04               ` Zhang Qiang
  2014-08-26 13:13                 ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Zhang Qiang @ 2014-08-26 10:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Greg Freemyer, xfs-oss


[-- Attachment #1.1: Type: text/plain, Size: 7713 bytes --]

Thanks Dave/Greg for your analysis and suggestions.

I can summarize what I should do next:

- backup my data using xfsdump
- rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
- mount filesystem with option inode64,nobarrier
- applied patches about adding free list inode on disk structure

As we have about ~100 servers need back up, so that will take much effort,
do you have any other suggestion?

What I am testing (ongoing):
 - created a new 2T partition filesystem
 - try to create small files and fill whole spaces then remove some of them
randomly
 - check the performance of touch/cp files
 - apply patches and verify it.

I have got more data from server:

1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount filesystem
2) mount filesystem and testing with touch command
  * The first touch new file command take about ~23s
  * second touch command take about ~0.1s.

Here's the perf data:
First touch command:

Events: 435  cycles
+   7.51%  touch  [xfs]              [k] xfs_inobt_get_rec
+   5.61%  touch  [xfs]              [k] xfs_btree_get_block
+   5.38%  touch  [xfs]              [k] xfs_btree_increment
+   4.26%  touch  [xfs]              [k] xfs_btree_get_rec
+   3.73%  touch  [kernel.kallsyms]  [k] find_busiest_group
+   3.43%  touch  [xfs]              [k] _xfs_buf_find
+   2.72%  touch  [xfs]              [k] xfs_btree_readahead
+   2.38%  touch  [xfs]              [k] xfs_trans_buf_item_match
+   2.34%  touch  [xfs]              [k] xfs_dialloc
+   2.32%  touch  [kernel.kallsyms]  [k] generic_make_request
+   2.09%  touch  [xfs]              [k] xfs_btree_rec_offset
+   1.75%  touch  [kernel.kallsyms]  [k] kmem_cache_alloc
+   1.63%  touch  [kernel.kallsyms]  [k] cpumask_next_and
+   1.41%  touch  [sd_mod]           [k] sd_prep_fn
+   1.41%  touch  [kernel.kallsyms]  [k] get_page_from_freelist
+   1.38%  touch  [kernel.kallsyms]  [k] __alloc_pages_nodemask
+   1.27%  touch  [kernel.kallsyms]  [k] scsi_request_fn
+   1.22%  touch  [kernel.kallsyms]  [k] blk_queue_bounce
+   1.20%  touch  [kernel.kallsyms]  [k] cfq_should_idle
+   1.10%  touch  [xfs]              [k] xfs_btree_rec_addr
+   1.03%  touch  [kernel.kallsyms]  [k] cfq_dispatch_requests+   1.00%
 touch  [kernel.kallsyms]  [k] _spin_lock_irqsave+   0.94%  touch
 [kernel.kallsyms]  [k] memcpy+   0.86%  touch  [kernel.kallsyms]  [k]
swiotlb_map_sg_attrs+   0.84%  touch  [kernel.kallsyms]  [k]
alloc_pages_current
+   0.82%  touch  [kernel.kallsyms]  [k] submit_bio
+   0.81%  touch  [megaraid_sas]     [k] megasas_build_and_issue_cmd_fusion
+   0.77%  touch  [kernel.kallsyms]  [k] blk_peek_request
+   0.73%  touch  [xfs]              [k] xfs_btree_setbuf
+   0.73%  touch  [megaraid_sas]     [k] MR_GetPhyParams
+   0.73%  touch  [kernel.kallsyms]  [k] run_timer_softirq
+   0.71%  touch  [kernel.kallsyms]  [k] pick_next_task_rt
+   0.71%  touch  [kernel.kallsyms]  [k] init_request_from_bio
+   0.70%  touch  [kernel.kallsyms]  [k] thread_return
+   0.69%  touch  [kernel.kallsyms]  [k] cfq_set_request
+   0.67%  touch  [kernel.kallsyms]  [k] mempool_alloc
+   0.66%  touch  [xfs]              [k] xfs_buf_hold
+   0.66%  touch  [kernel.kallsyms]  [k] find_next_bit
+   0.62%  touch  [kernel.kallsyms]  [k] cfq_insert_request
+   0.61%  touch  [kernel.kallsyms]  [k] scsi_init_io
+   0.60%  touch  [megaraid_sas]     [k] MR_BuildRaidContext
+   0.59%  touch  [kernel.kallsyms]  [k] policy_zonelist
+   0.59%  touch  [kernel.kallsyms]  [k] elv_insert
+   0.58%  touch  [xfs]              [k] xfs_buf_allocate_memory


Second perf command:


Events: 105  cycles
+  20.92%  touch  [xfs]              [k] xfs_inobt_get_rec
+  14.27%  touch  [xfs]              [k] xfs_btree_get_rec
+  12.21%  touch  [xfs]              [k] xfs_btree_get_block
+  12.12%  touch  [xfs]              [k] xfs_btree_increment
+   9.86%  touch  [xfs]              [k] xfs_btree_readahead
+   7.87%  touch  [xfs]              [k] _xfs_buf_find
+   4.93%  touch  [xfs]              [k] xfs_btree_rec_addr
+   4.12%  touch  [xfs]              [k] xfs_dialloc
+   3.03%  touch  [kernel.kallsyms]  [k] clear_page_c
+   2.96%  touch  [xfs]              [k] xfs_btree_rec_offset
+   1.31%  touch  [kernel.kallsyms]  [k] kmem_cache_free
+   1.03%  touch  [xfs]              [k] xfs_trans_buf_item_match
+   0.99%  touch  [kernel.kallsyms]  [k] _atomic_dec_and_lock
+   0.99%  touch  [xfs]              [k] xfs_inobt_get_maxrecs
+   0.99%  touch  [xfs]              [k] xfs_buf_unlock
+   0.99%  touch  [xfs]              [k] kmem_zone_alloc
+   0.98%  touch  [kernel.kallsyms]  [k] kmem_cache_alloc
+   0.28%  touch  [kernel.kallsyms]  [k] pgd_alloc
+   0.17%  touch  [kernel.kallsyms]  [k] page_fault
+   0.01%  touch  [kernel.kallsyms]  [k] native_write_msr_safe

I have compared the memory used, it seems that xfs try to load inode bmap
block for the first time, which take much time, is that the reason to take
so much time for the first touch operation?

Thanks
Qiang
2014-08-26 10:37 GMT+08:00 Dave Chinner <david@fromorbit.com>:

> On Mon, Aug 25, 2014 at 06:46:31PM -0400, Greg Freemyer wrote:
> > On Mon, Aug 25, 2014 at 6:26 PM, Dave Chinner <david@fromorbit.com>
> wrote:
> > > On Mon, Aug 25, 2014 at 06:31:10PM +0800, Zhang Qiang wrote:
> > >> 2014-08-25 17:08 GMT+08:00 Dave Chinner <david@fromorbit.com>:
> > >>
> > >> > On Mon, Aug 25, 2014 at 04:47:39PM +0800, Zhang Qiang wrote:
> > >> > > I have checked icount and ifree, but I found there are about 11.8
> percent
> > >> > > free, so the free inode should not be too few.
> > >> > >
> > >> > > Here's the detail log, any new clue?
> > >> > >
> > >> > > # mount /dev/sda4 /data1/
> > >> > > # xfs_info /data1/
> > >> > > meta-data=/dev/sda4              isize=256    agcount=4,
> agsize=142272384
> > >> >
> > >> > 4 AGs
> > >> >
> > >> Yes.
> > >>
> > >> >
> > >> > > icount = 220619904
> > >> > > ifree = 26202919
> > >> >
> > >> > And 220 million inodes. There's your problem - that's an average
> > >> > of 55 million inodes per AGI btree assuming you are using inode64.
> > >> > If you are using inode32, then the inodes will be in 2 btrees, or
> > >> > maybe even only one.
> > >> >
> > >>
> > >> You are right, all inodes stay on one AG.
> > >>
> > >> BTW, why i allocate 4 AGs, and all inodes stay in one AG for inode32?,
> > >
> > > Because the top addresses in the 2nd AG go over 32 bits, hence only
> > > AG 0 can be used for inodes. Changing to inode64 will give you some
> > > relief, but any time allocation occurs in AG0 is will be slow. i.e.
> > > you'll be trading always slow for "unpredictably slow".
> > >
> > >> > With that many inodes, I'd be considering moving to 32 or 64 AGs to
> > >> > keep the btree size down to a more manageable size. The free inode
> > >>
> > >> btree would also help, but, really, 220M inodes in a 2TB filesystem
> > >> > is really pushing the boundaries of sanity.....
> > >> >
> > >>
> > >> So the better inodes size in one AG is about 5M,
> > >
> > > Not necessarily. But for your storage it's almost certainly going to
> > > minimise the problem you are seeing.
> > >
> > >> is there any documents
> > >> about these options I can learn more?
> > >
> > > http://xfs.org/index.php/XFS_Papers_and_Documentation
> >
> > Given the apparently huge number of small files would he likely see a
> > big performance increase if he replaced that 2TB or rust with SSD.
>
> Doubt it - the profiles showed the allocation being CPU bound
> searching the metadata that indexes all those inodes. Those same
> profiles showed all the signs that it was hitting the buffer
> cache most of the time, too, which is why it was CPU bound....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 10793 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-26 10:04               ` Zhang Qiang
@ 2014-08-26 13:13                 ` Dave Chinner
  2014-08-27  8:53                   ` Zhang Qiang
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-08-26 13:13 UTC (permalink / raw)
  To: Zhang Qiang; +Cc: Greg Freemyer, xfs-oss

On Tue, Aug 26, 2014 at 06:04:52PM +0800, Zhang Qiang wrote:
> Thanks Dave/Greg for your analysis and suggestions.
> 
> I can summarize what I should do next:
> 
> - backup my data using xfsdump
> - rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
> - mount filesystem with option inode64,nobarrier

Ok up to here.

> - applied patches about adding free list inode on disk structure

No, don't do that. You're almost certain to get it wrong and corrupt
your filesysetms and lose data.

> As we have about ~100 servers need back up, so that will take much effort,
> do you have any other suggestion?

Just remount them with inode64. Nothing else. Over time as you add
and remove files the inodes will redistribute across all 4 AGs.

> What I am testing (ongoing):
>  - created a new 2T partition filesystem
>  - try to create small files and fill whole spaces then remove some of them
> randomly
>  - check the performance of touch/cp files
>  - apply patches and verify it.
> 
> I have got more data from server:
> 
> 1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount filesystem
> 2) mount filesystem and testing with touch command
>   * The first touch new file command take about ~23s
>   * second touch command take about ~0.1s.

So it's cache population that is your issue. You didn't say that
first time around, which means the diagnosis was wrong. Again, it's having to
search a btree with 220 million inodes in it to find the first free
inode, and that btree has to be pulled in from disk and searched.
Once it's cached, then each subsequent allocation will be much
faster becaue the majority of the tree being searched will already
be in cache...

> I have compared the memory used, it seems that xfs try to load inode bmap
> block for the first time, which take much time, is that the reason to take
> so much time for the first touch operation?

No. reading the AGI btree to find the first free inode to allocate
is what is taking the time. If you spread the inodes out over 4 AGs
(using inode64) then the overhead of the first read will go down
proportionally. Indeed, that is one of the reasons for using more
AGs than 4 for filesystems lik ethis.

Still, I can't help but wonder why you are using a filesystem to
store hundreds of millions of tiny files, when a database is far
better suited to storing and indexing this type and quantity of
data....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-26 13:13                 ` Dave Chinner
@ 2014-08-27  8:53                   ` Zhang Qiang
  2014-08-28  2:08                     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Zhang Qiang @ 2014-08-27  8:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Greg Freemyer, xfs-oss


[-- Attachment #1.1: Type: text/plain, Size: 4534 bytes --]

2014-08-26 21:13 GMT+08:00 Dave Chinner <david@fromorbit.com>:

> On Tue, Aug 26, 2014 at 06:04:52PM +0800, Zhang Qiang wrote:
> > Thanks Dave/Greg for your analysis and suggestions.
> >
> > I can summarize what I should do next:
> >
> > - backup my data using xfsdump
> > - rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
> > - mount filesystem with option inode64,nobarrier
>
> Ok up to here.
>
> > - applied patches about adding free list inode on disk structure
>
> No, don't do that. You're almost certain to get it wrong and corrupt
> your filesysetms and lose data.
>
> > As we have about ~100 servers need back up, so that will take much
> effort,
> > do you have any other suggestion?
>
> Just remount them with inode64. Nothing else. Over time as you add
> and remove files the inodes will redistribute across all 4 AGs.
>
OK.

How I can see  the layout number of inodes on each AGs? Here's my checking
steps:

1) Check unmounted file system first:
[root@fstest data1]# xfs_db -c "sb 0"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 421793920
ifree = 41
[root@fstest data1]# xfs_db -c "sb 1"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0
[root@fstest data1]# xfs_db -c "sb 2"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0
[root@fstest data1]# xfs_db -c "sb 3"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0
2) mount it with inode64 and create many files:

[root@fstest /]# mount -o inode64,nobarrier /dev/sdb1 /data
[root@fstest /]# cd /data/tmp/
[root@fstest tmp]# fdtree.bash -d 16 -l 2 -f 100 -s 1
[root@fstest /]# umount /data

3) Check with xfs_db again:

[root@fstest data1]# xfs_db -f -c "sb 0"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 421821504
ifree = 52
[root@fstest data1]# xfs_db -f -c "sb 1"  -c "p" /dev/sdb1 |egrep
'icount|ifree'
icount = 0
ifree = 0

So, it seems that inodes only on first AG. Or icount/ifree is not the
correct value to check, and how should I check how many inodes on each AGs?


I am finding  a way to improve the performance based on current filesystem
and kernel just remounting with inode64, I am trying how to make all inodes
redistribute on all AGs averagely.

Is there any good way?, for example backup half of data to another device
and remove it, then copy back it.


> > What I am testing (ongoing):
> >  - created a new 2T partition filesystem
> >  - try to create small files and fill whole spaces then remove some of
> them
> > randomly
> >  - check the performance of touch/cp files
> >  - apply patches and verify it.
> >
> > I have got more data from server:
> >
> > 1) flush all cache(echo 3 > /proc/sys/vm/drop_caches), and umount
> filesystem
> > 2) mount filesystem and testing with touch command
> >   * The first touch new file command take about ~23s
> >   * second touch command take about ~0.1s.
>
> So it's cache population that is your issue. You didn't say that
> first time around, which means the diagnosis was wrong. Again, it's having
> to
> search a btree with 220 million inodes in it to find the first free
> inode, and that btree has to be pulled in from disk and searched.
> Once it's cached, then each subsequent allocation will be much
> faster becaue the majority of the tree being searched will already
> be in cache...
>
> > I have compared the memory used, it seems that xfs try to load inode bmap
> > block for the first time, which take much time, is that the reason to
> take
> > so much time for the first touch operation?
>
> No. reading the AGI btree to find the first free inode to allocate
> is what is taking the time. If you spread the inodes out over 4 AGs
> (using inode64) then the overhead of the first read will go down
> proportionally. Indeed, that is one of the reasons for using more
> AGs than 4 for filesystems lik ethis.
>
OK, I see.


> Still, I can't help but wonder why you are using a filesystem to
> store hundreds of millions of tiny files, when a database is far
> better suited to storing and indexing this type and quantity of
> data....
>

OK, this is a social networking website back end servers, actually the CDN
infrastructure, and different server located different cities.
We have a global sync script to make all these 100 servers have the same
data.

For each server we use RAID10 and XFS (CentOS6.3).

There are about 3M files (50K in size) generated every day, and we track
the path of each files in database.

Do you have any suggestions to improve our solution?



> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

[-- Attachment #1.2: Type: text/html, Size: 6742 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: bad performance on touch/cp file on XFS system
  2014-08-27  8:53                   ` Zhang Qiang
@ 2014-08-28  2:08                     ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-08-28  2:08 UTC (permalink / raw)
  To: Zhang Qiang; +Cc: Greg Freemyer, xfs-oss

On Wed, Aug 27, 2014 at 04:53:17PM +0800, Zhang Qiang wrote:
> 2014-08-26 21:13 GMT+08:00 Dave Chinner <david@fromorbit.com>:
> 
> > On Tue, Aug 26, 2014 at 06:04:52PM +0800, Zhang Qiang wrote:
> > > Thanks Dave/Greg for your analysis and suggestions.
> > >
> > > I can summarize what I should do next:
> > >
> > > - backup my data using xfsdump
> > > - rebuilt filesystem using mkfs with options: agcount=32 for 2T disk
> > > - mount filesystem with option inode64,nobarrier
> >
> > Ok up to here.
> >
> > > - applied patches about adding free list inode on disk structure
> >
> > No, don't do that. You're almost certain to get it wrong and corrupt
> > your filesysetms and lose data.
> >
> > > As we have about ~100 servers need back up, so that will take much
> > effort,
> > > do you have any other suggestion?
> >
> > Just remount them with inode64. Nothing else. Over time as you add
> > and remove files the inodes will redistribute across all 4 AGs.
> >
> OK.
> 
> How I can see  the layout number of inodes on each AGs? Here's my checking
> steps:
> 
> 1) Check unmounted file system first:
> [root@fstest data1]# xfs_db -c "sb 0"  -c "p" /dev/sdb1 |egrep
> 'icount|ifree'
> icount = 421793920
> ifree = 41
> [root@fstest data1]# xfs_db -c "sb 1"  -c "p" /dev/sdb1 |egrep
> 'icount|ifree'
> icount = 0
> ifree = 0

That's wrong. You need to check the AGI headers, not the superblock.
Only the primary superblock gets updated, and it's the aggregated of
all the AGI values, not the per AG values.

And, BTW, that's *421 million* inodes in that filesystem. Almost
twice as many as the filesystem you started showing problems on...

> OK, this is a social networking website back end servers, actually the CDN
> infrastructure, and different server located different cities.
> We have a global sync script to make all these 100 servers have the same
> data.
> 
> For each server we use RAID10 and XFS (CentOS6.3).
> 
> There are about 3M files (50K in size) generated every day, and we track
> the path of each files in database.

I'd suggest you are overestimating the size of the files being
storedi by an order of magnitude: 200M files at 50k in size is 10TB,
not 1.5TB.

But you've confirmed exactly what I thought - you're using the
filesystem as an anonymous object store for hundreds of millions of
small objects and that's exactly the situation I'd expect to see
these problems....

> Do you have any suggestions to improve our solution?

TANSTAAFL.

I've given you some stuff to try, worst case is reformating and
recopying all the data around. I don't really have much time to do
much more than that - talk to Red Hat (because you are using CentOS)
if you want help with a more targeted solution to your problem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-08-28  2:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-25  3:34 bad performance on touch/cp file on XFS system Zhang Qiang
2014-08-25  5:18 ` Dave Chinner
2014-08-25  8:09   ` Zhang Qiang
2014-08-25  8:56     ` Dave Chinner
2014-08-25  9:05       ` Zhang Qiang
2014-08-25  8:47   ` Zhang Qiang
2014-08-25  9:08     ` Dave Chinner
2014-08-25 10:31       ` Zhang Qiang
2014-08-25 22:26         ` Dave Chinner
2014-08-25 22:46           ` Greg Freemyer
2014-08-26  2:37             ` Dave Chinner
2014-08-26 10:04               ` Zhang Qiang
2014-08-26 13:13                 ` Dave Chinner
2014-08-27  8:53                   ` Zhang Qiang
2014-08-28  2:08                     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.