All of lore.kernel.org
 help / color / mirror / Atom feed
* extremely slow file creation/deletion after xfs ran full
@ 2015-01-12  8:36 Carsten Aulbert
  2015-01-12 12:44 ` Brian Foster
  2015-01-12 13:30 ` Carsten Aulbert
  0 siblings, 2 replies; 14+ messages in thread
From: Carsten Aulbert @ 2015-01-12  8:36 UTC (permalink / raw)
  To: xfs

Hi

a couple of days ago, one of our XFS pools ran full which did not
realize due to some bogged monitoring.

After deleting files, the FS is not filled to about 80-85% and file
creation as well as deletion is extremely slow:

$ cd /srv
$ time touch test
touch test  0.00s user 0.17s system 5% cpu 2.993 total
$ time touch test
touch test  0.00s user 0.00s system 0% cpu 0.002 total
$ time rm test
rm test  0.00s user 0.00s system 0% cpu 4.954 total

As we had seen some XFS error messages like XFS: Internal error
XFS_WANT_CORRUPTED_GOTO[1] shortly before the file system was completely
filled, we umounted it and ran successfully xfs_repair on it[2].

We started to dig into fragmentation levels, but to be honest, I'm not
too sure how to interpret the numbers. All seem pretty innocent to me
except the very high directory fragmentation:

# mount
[...]
/dev/sdb1 on /srv type xfs (rw,relatime,attr2,inode64,noquota)

# uname -a
Linux einstein4 3.14.24-atlas #1 SMP Wed Nov 19 08:29:23 UTC 2014 x86_64
GNU/Linux

(after the reboot to add more RAM, before 3.14.11)
# df -h /dev/sdb1
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        15T   12T  2.7T  82% /srv

# df -i /dev/sdb1
Filesystem         Inodes     IUsed      IFree IUse% Mounted on
/dev/sdb1      3105468352 554441361 2551026991   18% /srv

# xfs_info /dev/sdb1
meta-data=/dev/sdb1              isize=256    agcount=15,
agsize=268435455 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3881835478, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

(underneath is a Areca backed 16 disk raid 10 and yes, possibly we
should have gone to more AGs, hindsight is great ;))

# xfs_db -r -c "frag" /dev/sdb1
actual 562488447, ideal 553668062, fragmentation factor 1.57%
# xfs_db -r -c "freesp" /dev/sdb1
   from      to extents  blocks    pct
      1       1 9669664 9669664   1.36
      2       3 5634086 12956862   1.83
      4       7 1720988 8320694   1.17
      8      15  314377 3088702   0.44
     16      31  231279 4941737   0.70
     32      63 10501367 372068430  52.45
     64     127 2604359 204737946  28.86
    128     255  377595 63623350   8.97
    256     511   58744 20198673   2.85
    512    1023    8843 5759074   0.81
   1024    2047    1044 1418464   0.20
   2048    4095     255  685540   0.10
   4096    8191      81  453953   0.06
   8192   16383      14  126863   0.02
  16384   32767      18  373876   0.05
  32768   65535      11  435041   0.06
  65536  131071       6  458767   0.06

# xfs_db -r -c "frag -d" /dev/sdb1
actual 7987218, ideal 67532, fragmentation factor 99.15%
# xfs_db -r -c "frag -f" /dev/sdb1
actual 555079796, ideal 554171759, fragmentation factor 0.16%

Finally, we still xfs error messages in the kernel logs[3].

Questions:
  * Would xfs_fsr be able to cure this problem?
  * Would moving to xfsprogs 3.2 and kernel > 3.15 with on disk format 5
plus crc help to protect against this "wear"?

Cheers

Carsten

PS: Please Cc me, as I'm not subscribed ATM.


[1]Jan  7 20:58:47 einstein4 kernel: [14029681.124548] XFS: Internal
error XFS_WANT_CORRUPTED_GOTO at line 1419 of file fs/xfs/xfs_alloc.c.
Caller 0xffffffffa023594d
Jan  7 20:58:47 einstein4 kernel: [14029681.124657] CPU: 5 PID: 13437
Comm: kworker/5:1 Not tainted 3.14.11-atlas #1
Jan  7 20:58:47 einstein4 kernel: [14029681.124745] Hardware name:
Supermicro X7DVL/X7DVL, BIOS 6.00 11/02/2007
Jan  7 20:58:47 einstein4 kernel: [14029681.124860] Workqueue: xfsalloc
xfs_bmapi_allocate_worker [xfs]
Jan  7 20:58:47 einstein4 kernel: [14029681.124977]  0000000000000000
ffff88013b7ebc80 ffffffff814e9555 ffff88017022bd30
Jan  7 20:58:47 einstein4 kernel: [14029681.125247]  ffffffffa0234aed
ffff88041877c380 0000000000000000 000000000000000c
Jan  7 20:58:47 einstein4 kernel: [14029681.125514]  0000000000000000
0000000100000000 0ff0b98800000000 0000000100000002
Jan  7 20:58:47 einstein4 kernel: [14029681.125783] Call Trace:
Jan  7 20:58:47 einstein4 kernel: [14029681.125856]
[<ffffffff814e9555>] ? dump_stack+0x41/0x51
Jan  7 20:58:47 einstein4 kernel: [14029681.125937]
[<ffffffffa0234aed>] ? xfs_alloc_ag_vextent_size+0x33d/0x650 [xfs]
Jan  7 20:58:47 einstein4 kernel: [14029681.126035]
[<ffffffffa023594d>] ? xfs_alloc_ag_vextent+0x9d/0x100 [xfs]
Jan  7 20:58:47 einstein4 kernel: [14029681.126118]
[<ffffffffa02366c5>] ? xfs_alloc_vextent+0x485/0x610 [xfs]
Jan  7 20:58:47 einstein4 kernel: [14029681.126202]
[<ffffffffa0245ff4>] ? xfs_bmap_btalloc+0x2d4/0x790 [xfs]
Jan  7 20:58:47 einstein4 kernel: [14029681.126286]
[<ffffffffa0246ab1>] ? __xfs_bmapi_allocate+0xd1/0x2e0 [xfs]
Jan  7 20:58:47 einstein4 kernel: [14029681.126366]
[<ffffffffa0216375>] ? xfs_bmapi_allocate_worker+0x35/0x70 [xfs]
Jan  7 20:58:47 einstein4 kernel: [14029681.126456]
[<ffffffff8107d5db>] ? process_one_work+0x16b/0x400
Jan  7 20:58:47 einstein4 kernel: [14029681.126529]
[<ffffffff8107e9c6>] ? worker_thread+0x116/0x380
Jan  7 20:58:47 einstein4 kernel: [14029681.126601]
[<ffffffff8107e8b0>] ? manage_workers.isra.21+0x290/0x290
Jan  7 20:58:47 einstein4 kernel: [14029681.126675]
[<ffffffff81084bcc>] ? kthread+0xbc/0xe0
Jan  7 20:58:47 einstein4 kernel: [14029681.126745]
[<ffffffff81084b10>] ? flush_kthread_worker+0x80/0x80
Jan  7 20:58:47 einstein4 kernel: [14029681.126819]
[<ffffffff814f724c>] ? ret_from_fork+0x7c/0xb0
Jan  7 20:58:47 einstein4 kernel: [14029681.126890]
[<ffffffff81084b10>] ? flush_kthread_worker+0x80/0x80
Jan  7 20:58:47 einstein4 kernel: [14029681.126973] XFS (sdb1): page
discard on page ffffea000b973650, inode 0x8e56233bd, offset 172167168.

[2] multiple times until we had enough RAM installed into the machine
(24GB+swap space finally made the difference)

[3] e.g.
[170369.029932] ------------[ cut here ]------------
[170369.029940] WARNING: CPU: 4 PID: 65 at fs/xfs/xfs_aops.c:1184
xfs_vm_releasepage+0xb5/0xc0 [xfs]()
[170369.029942] Modules linked in: dm_mod nfsv3 tun xt_nat autofs4
ipt_REJECT ipt_MASQUERADE xt_state iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables
xt_tcpudp x_tables nfsd auth_rpcgss oid_registry nfs_acl nfs lockd
fscache sunrpc ext4 crc16 mbcache jbd2 ipmi_poweroff ipmi_watchdog
ipmi_devintf ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support ttm
drm_kms_helper drm coretemp i5000_edac edac_core lpc_ich i5k_amb
mfd_core ioatdma sr_mod cdrom i2c_algo_bit i2c_i801 rng_core i2c_core
joydev dca psmouse ehci_pci snd_pcm snd_timer snd soundcore pcspkr
serio_raw shpchp processor parport_pc parport evdev kvm thermal_sys
button xfs crc32c libcrc32c md_mod usb_storage hid_generic usbhid hid
sd_mod sg crc_t10dif crct10dif_common floppy ahci libahci libata
uhci_hcd e1000 ehci_hcd arcmsr usbcore usb_common scsi_mod e1000e ptp
pps_core
[170369.029986] CPU: 4 PID: 65 Comm: kswapd0 Tainted: G        W
3.14.24-atlas #1
[170369.029986] Hardware name: Supermicro X7DVL/X7DVL, BIOS 6.00 11/02/2007
[170369.029987]  0000000000000009 ffff88060d461a18 ffffffff81529af8
0000000000000000
[170369.029989]  0000000000000000 ffff88060d461a58 ffffffff8106a37c
ffff880610389d40
[170369.029991]  ffff88060d461b28 ffffea000f39f8a0 ffff8805f722dd98
ffffea000f39f8c0
[170369.029994] Call Trace:
[170369.029994]  [<ffffffff81529af8>] dump_stack+0x46/0x58
[170369.029996]  [<ffffffff8106a37c>] warn_slowpath_common+0x8c/0xc0
[170369.029998]  [<ffffffff8106a3ca>] warn_slowpath_null+0x1a/0x20
[170369.030000]  [<ffffffffa0212225>] xfs_vm_releasepage+0xb5/0xc0 [xfs]
[170369.030009]  [<ffffffff81182a10>] ? anon_vma_prepare+0x140/0x140
[170369.030011]  [<ffffffff81149902>] try_to_release_page+0x32/0x50
[170369.030014]  [<ffffffff8115ed7a>] shrink_active_list+0x3aa/0x3b0
[170369.030016]  [<ffffffff8115f154>] shrink_lruvec+0x3d4/0x6a0
[170369.030017]  [<ffffffff811aa001>] ?
__mem_cgroup_largest_soft_limit_node+0x21/0xd0
[170369.030019]  [<ffffffff811aa001>] ?
__mem_cgroup_largest_soft_limit_node+0x21/0xd0
[170369.030021]  [<ffffffff8115f486>] shrink_zone+0x66/0x1a0
[170369.030023]  [<ffffffff8116065c>] balance_pgdat+0x3ac/0x600
[170369.030025]  [<ffffffff81160a34>] kswapd+0x184/0x470
[170369.030027]  [<ffffffff810ae460>] ? finish_wait+0x80/0x80
[170369.030028]  [<ffffffff811608b0>] ? balance_pgdat+0x600/0x600
[170369.030030]  [<ffffffff8108da79>] kthread+0xc9/0xe0
[170369.030032]  [<ffffffff8108d9b0>] ? flush_kthread_worker+0x80/0x80
[170369.030034]  [<ffffffff81538cfc>] ret_from_fork+0x7c/0xb0
[170369.030035]  [<ffffffff8108d9b0>] ? flush_kthread_worker+0x80/0x80
[170369.030038] ---[ end trace 9f4d705dbe4e039c ]---

-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
phone/fax: +49 511 762-17185 / -17193
https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/ATLAS/WebHome

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-12  8:36 extremely slow file creation/deletion after xfs ran full Carsten Aulbert
@ 2015-01-12 12:44 ` Brian Foster
  2015-01-12 13:30 ` Carsten Aulbert
  1 sibling, 0 replies; 14+ messages in thread
From: Brian Foster @ 2015-01-12 12:44 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: xfs

On Mon, Jan 12, 2015 at 09:36:49AM +0100, Carsten Aulbert wrote:
> Hi
> 
> a couple of days ago, one of our XFS pools ran full which did not
> realize due to some bogged monitoring.
> 
> After deleting files, the FS is not filled to about 80-85% and file
> creation as well as deletion is extremely slow:
> 
> $ cd /srv
> $ time touch test
> touch test  0.00s user 0.17s system 5% cpu 2.993 total
> $ time touch test
> touch test  0.00s user 0.00s system 0% cpu 0.002 total
> $ time rm test
> rm test  0.00s user 0.00s system 0% cpu 4.954 total
> 

Based on the size and consumption of the fs, first thing that comes to
mind is perhaps fragmented use of inode records. E.g., inode records are
spread all over the storage with handfulls of inodes free here and
there, which means individual inode allocation can take a hit searching
for the record with free inodes. I don't think that explains rm
performance though.

It might be interesting to grab a couple perf traces of the touch and rm
commands and see what cpu usage looks like. E.g., 'perf record -g touch
<file>,' 'perf report -g,' etc.

> As we had seen some XFS error messages like XFS: Internal error
> XFS_WANT_CORRUPTED_GOTO[1] shortly before the file system was completely
> filled, we umounted it and ran successfully xfs_repair on it[2].
> 

Interesting, that is:

	XFS_WANT_CORRUPTED_GOTO(rlen <= flen, error0);

... which seems strange given that rlen should have been previously
trimmed. Taking a closer look at xfs_alloc_fix_len() reminded me of the
following commit, which actually documents and fixes instances of this
error, so that might be the most likely culprit:

	30265117 xfs: Fix rounding in xfs_alloc_fix_len()

That originally went into 3.16 and I don't see it in the 3.14 stable
branch. Did xfs_repair actually report anything wrong?

> We started to dig into fragmentation levels, but to be honest, I'm not
> too sure how to interpret the numbers. All seem pretty innocent to me
> except the very high directory fragmentation:
> 
> # mount
> [...]
> /dev/sdb1 on /srv type xfs (rw,relatime,attr2,inode64,noquota)
> 
> # uname -a
> Linux einstein4 3.14.24-atlas #1 SMP Wed Nov 19 08:29:23 UTC 2014 x86_64
> GNU/Linux
> 
> (after the reboot to add more RAM, before 3.14.11)
> # df -h /dev/sdb1
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sdb1        15T   12T  2.7T  82% /srv
> 
> # df -i /dev/sdb1
> Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> /dev/sdb1      3105468352 554441361 2551026991   18% /srv
> 
> # xfs_info /dev/sdb1
> meta-data=/dev/sdb1              isize=256    agcount=15,
> agsize=268435455 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=3881835478, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> (underneath is a Areca backed 16 disk raid 10 and yes, possibly we
> should have gone to more AGs, hindsight is great ;))
> 
> # xfs_db -r -c "frag" /dev/sdb1
> actual 562488447, ideal 553668062, fragmentation factor 1.57%
> # xfs_db -r -c "freesp" /dev/sdb1
>    from      to extents  blocks    pct
>       1       1 9669664 9669664   1.36
>       2       3 5634086 12956862   1.83
>       4       7 1720988 8320694   1.17
>       8      15  314377 3088702   0.44
>      16      31  231279 4941737   0.70
>      32      63 10501367 372068430  52.45
>      64     127 2604359 204737946  28.86
>     128     255  377595 63623350   8.97
>     256     511   58744 20198673   2.85
>     512    1023    8843 5759074   0.81
>    1024    2047    1044 1418464   0.20
>    2048    4095     255  685540   0.10
>    4096    8191      81  453953   0.06
>    8192   16383      14  126863   0.02
>   16384   32767      18  373876   0.05
>   32768   65535      11  435041   0.06
>   65536  131071       6  458767   0.06
> 
> # xfs_db -r -c "frag -d" /dev/sdb1
> actual 7987218, ideal 67532, fragmentation factor 99.15%
> # xfs_db -r -c "frag -f" /dev/sdb1
> actual 555079796, ideal 554171759, fragmentation factor 0.16%
> 
> Finally, we still xfs error messages in the kernel logs[3].
> 
> Questions:
>   * Would xfs_fsr be able to cure this problem?
>   * Would moving to xfsprogs 3.2 and kernel > 3.15 with on disk format 5
> plus crc help to protect against this "wear"?
> 

It seems like you have sufficiently large and available free space. That
said, it's fairly common for filesytems to naturally drop in performance
as free space becomes more limited. E.g., I think it's common practice
to avoid regular usage while over 80% used if performance is a major
concern. Also, I doubt xfs_fsr will do much to affect inode allocation
performance, but I could be wrong.

With regard to v5, I don't think there is anything that will affect
"wear" as such. Perhaps there are speculative preallocation fixes that
might affect long term allocation patterns, depending on your workload.
CRC support provides extra metadata validation. v5 also includes free
inode btree support, which can improve inode allocation performance in
the inode record fragmentation scenario described above.

Brian

> Cheers
> 
> Carsten
> 
> PS: Please Cc me, as I'm not subscribed ATM.
> 
> 
> [1]Jan  7 20:58:47 einstein4 kernel: [14029681.124548] XFS: Internal
> error XFS_WANT_CORRUPTED_GOTO at line 1419 of file fs/xfs/xfs_alloc.c.
> Caller 0xffffffffa023594d
> Jan  7 20:58:47 einstein4 kernel: [14029681.124657] CPU: 5 PID: 13437
> Comm: kworker/5:1 Not tainted 3.14.11-atlas #1
> Jan  7 20:58:47 einstein4 kernel: [14029681.124745] Hardware name:
> Supermicro X7DVL/X7DVL, BIOS 6.00 11/02/2007
> Jan  7 20:58:47 einstein4 kernel: [14029681.124860] Workqueue: xfsalloc
> xfs_bmapi_allocate_worker [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.124977]  0000000000000000
> ffff88013b7ebc80 ffffffff814e9555 ffff88017022bd30
> Jan  7 20:58:47 einstein4 kernel: [14029681.125247]  ffffffffa0234aed
> ffff88041877c380 0000000000000000 000000000000000c
> Jan  7 20:58:47 einstein4 kernel: [14029681.125514]  0000000000000000
> 0000000100000000 0ff0b98800000000 0000000100000002
> Jan  7 20:58:47 einstein4 kernel: [14029681.125783] Call Trace:
> Jan  7 20:58:47 einstein4 kernel: [14029681.125856]
> [<ffffffff814e9555>] ? dump_stack+0x41/0x51
> Jan  7 20:58:47 einstein4 kernel: [14029681.125937]
> [<ffffffffa0234aed>] ? xfs_alloc_ag_vextent_size+0x33d/0x650 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126035]
> [<ffffffffa023594d>] ? xfs_alloc_ag_vextent+0x9d/0x100 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126118]
> [<ffffffffa02366c5>] ? xfs_alloc_vextent+0x485/0x610 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126202]
> [<ffffffffa0245ff4>] ? xfs_bmap_btalloc+0x2d4/0x790 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126286]
> [<ffffffffa0246ab1>] ? __xfs_bmapi_allocate+0xd1/0x2e0 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126366]
> [<ffffffffa0216375>] ? xfs_bmapi_allocate_worker+0x35/0x70 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126456]
> [<ffffffff8107d5db>] ? process_one_work+0x16b/0x400
> Jan  7 20:58:47 einstein4 kernel: [14029681.126529]
> [<ffffffff8107e9c6>] ? worker_thread+0x116/0x380
> Jan  7 20:58:47 einstein4 kernel: [14029681.126601]
> [<ffffffff8107e8b0>] ? manage_workers.isra.21+0x290/0x290
> Jan  7 20:58:47 einstein4 kernel: [14029681.126675]
> [<ffffffff81084bcc>] ? kthread+0xbc/0xe0
> Jan  7 20:58:47 einstein4 kernel: [14029681.126745]
> [<ffffffff81084b10>] ? flush_kthread_worker+0x80/0x80
> Jan  7 20:58:47 einstein4 kernel: [14029681.126819]
> [<ffffffff814f724c>] ? ret_from_fork+0x7c/0xb0
> Jan  7 20:58:47 einstein4 kernel: [14029681.126890]
> [<ffffffff81084b10>] ? flush_kthread_worker+0x80/0x80
> Jan  7 20:58:47 einstein4 kernel: [14029681.126973] XFS (sdb1): page
> discard on page ffffea000b973650, inode 0x8e56233bd, offset 172167168.
> 
> [2] multiple times until we had enough RAM installed into the machine
> (24GB+swap space finally made the difference)
> 
> [3] e.g.
> [170369.029932] ------------[ cut here ]------------
> [170369.029940] WARNING: CPU: 4 PID: 65 at fs/xfs/xfs_aops.c:1184
> xfs_vm_releasepage+0xb5/0xc0 [xfs]()
> [170369.029942] Modules linked in: dm_mod nfsv3 tun xt_nat autofs4
> ipt_REJECT ipt_MASQUERADE xt_state iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables
> xt_tcpudp x_tables nfsd auth_rpcgss oid_registry nfs_acl nfs lockd
> fscache sunrpc ext4 crc16 mbcache jbd2 ipmi_poweroff ipmi_watchdog
> ipmi_devintf ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support ttm
> drm_kms_helper drm coretemp i5000_edac edac_core lpc_ich i5k_amb
> mfd_core ioatdma sr_mod cdrom i2c_algo_bit i2c_i801 rng_core i2c_core
> joydev dca psmouse ehci_pci snd_pcm snd_timer snd soundcore pcspkr
> serio_raw shpchp processor parport_pc parport evdev kvm thermal_sys
> button xfs crc32c libcrc32c md_mod usb_storage hid_generic usbhid hid
> sd_mod sg crc_t10dif crct10dif_common floppy ahci libahci libata
> uhci_hcd e1000 ehci_hcd arcmsr usbcore usb_common scsi_mod e1000e ptp
> pps_core
> [170369.029986] CPU: 4 PID: 65 Comm: kswapd0 Tainted: G        W
> 3.14.24-atlas #1
> [170369.029986] Hardware name: Supermicro X7DVL/X7DVL, BIOS 6.00 11/02/2007
> [170369.029987]  0000000000000009 ffff88060d461a18 ffffffff81529af8
> 0000000000000000
> [170369.029989]  0000000000000000 ffff88060d461a58 ffffffff8106a37c
> ffff880610389d40
> [170369.029991]  ffff88060d461b28 ffffea000f39f8a0 ffff8805f722dd98
> ffffea000f39f8c0
> [170369.029994] Call Trace:
> [170369.029994]  [<ffffffff81529af8>] dump_stack+0x46/0x58
> [170369.029996]  [<ffffffff8106a37c>] warn_slowpath_common+0x8c/0xc0
> [170369.029998]  [<ffffffff8106a3ca>] warn_slowpath_null+0x1a/0x20
> [170369.030000]  [<ffffffffa0212225>] xfs_vm_releasepage+0xb5/0xc0 [xfs]
> [170369.030009]  [<ffffffff81182a10>] ? anon_vma_prepare+0x140/0x140
> [170369.030011]  [<ffffffff81149902>] try_to_release_page+0x32/0x50
> [170369.030014]  [<ffffffff8115ed7a>] shrink_active_list+0x3aa/0x3b0
> [170369.030016]  [<ffffffff8115f154>] shrink_lruvec+0x3d4/0x6a0
> [170369.030017]  [<ffffffff811aa001>] ?
> __mem_cgroup_largest_soft_limit_node+0x21/0xd0
> [170369.030019]  [<ffffffff811aa001>] ?
> __mem_cgroup_largest_soft_limit_node+0x21/0xd0
> [170369.030021]  [<ffffffff8115f486>] shrink_zone+0x66/0x1a0
> [170369.030023]  [<ffffffff8116065c>] balance_pgdat+0x3ac/0x600
> [170369.030025]  [<ffffffff81160a34>] kswapd+0x184/0x470
> [170369.030027]  [<ffffffff810ae460>] ? finish_wait+0x80/0x80
> [170369.030028]  [<ffffffff811608b0>] ? balance_pgdat+0x600/0x600
> [170369.030030]  [<ffffffff8108da79>] kthread+0xc9/0xe0
> [170369.030032]  [<ffffffff8108d9b0>] ? flush_kthread_worker+0x80/0x80
> [170369.030034]  [<ffffffff81538cfc>] ret_from_fork+0x7c/0xb0
> [170369.030035]  [<ffffffff8108d9b0>] ? flush_kthread_worker+0x80/0x80
> [170369.030038] ---[ end trace 9f4d705dbe4e039c ]---
> 
> -- 
> Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
> Callinstrasse 38, 30167 Hannover, Germany
> phone/fax: +49 511 762-17185 / -17193
> https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/ATLAS/WebHome
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-12  8:36 extremely slow file creation/deletion after xfs ran full Carsten Aulbert
  2015-01-12 12:44 ` Brian Foster
@ 2015-01-12 13:30 ` Carsten Aulbert
  2015-01-12 15:52   ` Brian Foster
  1 sibling, 1 reply; 14+ messages in thread
From: Carsten Aulbert @ 2015-01-12 13:30 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: Type: text/plain, Size: 2350 bytes --]

Hi again

(sorry that I reply to my own email, rather than Brian's, but I've only
just subscribed to the list), I'll try to address Brian's email in here
with fake quotation. Sorry for breaking the threading :(

Brian Foster wrote:
> Based on the size and consumption of the fs, first thing that comes
> to mind is perhaps fragmented use of inode records. E.g., inode
> records are spread all over the storage with handfulls of inodes
> free here and there, which means individual inode allocation can
> take a hit searching for the record with free inodes. I don't think
> that explains rm performance though.

> It might be interesting to grab a couple perf traces of the touch
> and rm commands and see what cpu usage looks like. E.g., 'perf
> record -g touch <file>,' 'perf report -g,' etc.

I've attached both perf outputs and after reviewing them briefly I think
slowness is caused by different means, i.e. only the touch one is in
xfs' territory.

> 	30265117 xfs: Fix rounding in xfs_alloc_fix_len()
>
> That originally went into 3.16 and I don't see it in the 3.14 stable
> branch. Did xfs_repair actually report anything wrong?

Nope, only displayed all the stages, but nothing was fixed.

> It seems like you have sufficiently large and available free
> space. That said, it's fairly common for filesytems to naturally
> drop in performance as free space becomes more limited. E.g., I
> think it's common practice to avoid regular usage while over 80%
> used if performance is a major concern. Also, I doubt xfs_fsr will
> do much to affect inode allocation performance, but I could be
> wrong.

Yes, we should have monitored that mount point rather than /tmp which we
did when bad things happened(TM). Given that we have a high
fragmentation of directories, would xfs_fsr help here at all?

Regarding v5, currently we are copying data off that disk and will
create it anew with -m crc=1,finobt=1 on a recent 3.18 kernel. Apart
from that I don't know much we can further do to safe-guard us against
this happening again (well kepp it below 80% all the time as well).

Thanks a lot for the remarks!

cheers

Carsten

-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
phone/fax: +49 511 762-17185 / -17193
https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/ATLAS/WebHome

[-- Attachment #2: rm.perf.data.gz --]
[-- Type: application/gzip, Size: 2822 bytes --]

[-- Attachment #3: touch.perf.data.gz --]
[-- Type: application/gzip, Size: 14833 bytes --]

[-- Attachment #4: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-12 13:30 ` Carsten Aulbert
@ 2015-01-12 15:52   ` Brian Foster
  2015-01-12 16:09     ` Carsten Aulbert
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2015-01-12 15:52 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: xfs

On Mon, Jan 12, 2015 at 02:30:18PM +0100, Carsten Aulbert wrote:
> Hi again
> 
> (sorry that I reply to my own email, rather than Brian's, but I've only
> just subscribed to the list), I'll try to address Brian's email in here
> with fake quotation. Sorry for breaking the threading :(
> 
> Brian Foster wrote:
> > Based on the size and consumption of the fs, first thing that comes
> > to mind is perhaps fragmented use of inode records. E.g., inode
> > records are spread all over the storage with handfulls of inodes
> > free here and there, which means individual inode allocation can
> > take a hit searching for the record with free inodes. I don't think
> > that explains rm performance though.
> 
> > It might be interesting to grab a couple perf traces of the touch
> > and rm commands and see what cpu usage looks like. E.g., 'perf
> > record -g touch <file>,' 'perf report -g,' etc.
> 
> I've attached both perf outputs and after reviewing them briefly I think
> slowness is caused by different means, i.e. only the touch one is in
> xfs' territory.
> 

Ok, well if something else causes the rm slowness, the scattered free
inode scenario might be more likely.

I can't see any symbols associated with the perf output. I suspect
because I'm not running on your kernel. It might be better to run 'perf
report -g' and copy/paste the stack trace for some of the larger
consumers.

> > 	30265117 xfs: Fix rounding in xfs_alloc_fix_len()
> >
> > That originally went into 3.16 and I don't see it in the 3.14 stable
> > branch. Did xfs_repair actually report anything wrong?
> 
> Nope, only displayed all the stages, but nothing was fixed.
> 

Ok, also seems like further indication of the problem fixed by the above
commit.

> > It seems like you have sufficiently large and available free
> > space. That said, it's fairly common for filesytems to naturally
> > drop in performance as free space becomes more limited. E.g., I
> > think it's common practice to avoid regular usage while over 80%
> > used if performance is a major concern. Also, I doubt xfs_fsr will
> > do much to affect inode allocation performance, but I could be
> > wrong.
> 
> Yes, we should have monitored that mount point rather than /tmp which we
> did when bad things happened(TM). Given that we have a high
> fragmentation of directories, would xfs_fsr help here at all?
> 

I don't _think_ that fsr will mess with directories, but I don't know
for sure...

> Regarding v5, currently we are copying data off that disk and will
> create it anew with -m crc=1,finobt=1 on a recent 3.18 kernel. Apart
> from that I don't know much we can further do to safe-guard us against
> this happening again (well kepp it below 80% all the time as well).
> 

Sounds good. FWIW, something like the following should tell us how many
free inodes are available in each ag, and thus whether we have to search
for free inodes in existing records rather than allocate new ones:

for i in $(seq 0 15); do
	xfs_db -c "agi $i" -c "p freecount" <dev>
done

Brian

> Thanks a lot for the remarks!
> 
> cheers
> 
> Carsten
> 
> -- 
> Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
> Callinstrasse 38, 30167 Hannover, Germany
> phone/fax: +49 511 762-17185 / -17193
> https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/ATLAS/WebHome



> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-12 15:52   ` Brian Foster
@ 2015-01-12 16:09     ` Carsten Aulbert
  2015-01-12 16:37       ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Carsten Aulbert @ 2015-01-12 16:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 1095 bytes --]

Hi Brian

On 01/12/2015 04:52 PM, Brian Foster wrote:
> I can't see any symbols associated with the perf output. I suspect
> because I'm not running on your kernel. It might be better to run 'perf
> report -g' and copy/paste the stack trace for some of the larger
> consumers.
> 

Sorry, I rarely need to use perf and of course forgot that the
intermediate output it tightly coupled to the running kernel. Attaching
the output of perf report -g here.
> 
> Sounds good. FWIW, something like the following should tell us how many
> free inodes are available in each ag, and thus whether we have to search
> for free inodes in existing records rather than allocate new ones:
> 
> for i in $(seq 0 15); do
> 	xfs_db -c "agi $i" -c "p freecount" <dev>
> done
> 
Another metric :)

freecount = 53795884
freecount = 251
freecount = 45
freecount = 381
freecount = 11009
freecount = 6748
freecount = 663
freecount = 595
freecount = 693
freecount = 9089
freecount = 37122
freecount = 2657
freecount = 60497
freecount = 1790275
freecount = 54544

That looks... not really uniform to me.

Cheers

Carsten



[-- Attachment #2: touch.stack.gz --]
[-- Type: application/gzip, Size: 3498 bytes --]

[-- Attachment #3: rm.stack.gz --]
[-- Type: application/gzip, Size: 702 bytes --]

[-- Attachment #4: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-12 16:09     ` Carsten Aulbert
@ 2015-01-12 16:37       ` Brian Foster
  2015-01-12 17:33         ` Carsten Aulbert
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2015-01-12 16:37 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: xfs

On Mon, Jan 12, 2015 at 05:09:01PM +0100, Carsten Aulbert wrote:
> Hi Brian
> 
> On 01/12/2015 04:52 PM, Brian Foster wrote:
> > I can't see any symbols associated with the perf output. I suspect
> > because I'm not running on your kernel. It might be better to run 'perf
> > report -g' and copy/paste the stack trace for some of the larger
> > consumers.
> > 
> 
> Sorry, I rarely need to use perf and of course forgot that the
> intermediate output it tightly coupled to the running kernel. Attaching
> the output of perf report -g here.

It does look like we're spending most of the time down in
xfs_diallog_ag(), which is the algorithm for finding a free inode in the
btree for a particular AG when we know that existing records have some
free.

> > 
> > Sounds good. FWIW, something like the following should tell us how many
> > free inodes are available in each ag, and thus whether we have to search
> > for free inodes in existing records rather than allocate new ones:
> > 
> > for i in $(seq 0 15); do
> > 	xfs_db -c "agi $i" -c "p freecount" <dev>
> > done
> > 
> Another metric :)
> 
> freecount = 53795884
> freecount = 251
> freecount = 45
> freecount = 381
> freecount = 11009
> freecount = 6748
> freecount = 663
> freecount = 595
> freecount = 693
> freecount = 9089
> freecount = 37122
> freecount = 2657
> freecount = 60497
> freecount = 1790275
> freecount = 54544
> 
> That looks... not really uniform to me.
> 

No, but it does show that there are a bunch of free inodes scattered
throughout the existing records in most of the AGs. The finobt should
definitely help avoid the allocation latency when this occurs.

It is interesting that you have so many more free inodes in ag 0 (~53m
as opposed to several hundreds/thousands in others). What does 'p count'
show for each ag? Was this fs grown to the current size over time?

Brian

> Cheers
> 
> Carsten
> 
> 



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-12 16:37       ` Brian Foster
@ 2015-01-12 17:33         ` Carsten Aulbert
  2015-01-13 20:06           ` Stan Hoeppner
  2015-01-13 20:33           ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Carsten Aulbert @ 2015-01-12 17:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

[-- Attachment #1: Type: text/plain, Size: 1205 bytes --]

Hi

On 01/12/2015 05:37 PM, Brian Foster wrote:
> No, but it does show that there are a bunch of free inodes scattered
> throughout the existing records in most of the AGs. The finobt should
> definitely help avoid the allocation latency when this occurs.
> 

That is good to know/hope :)

> It is interesting that you have so many more free inodes in ag 0 (~53m
> as opposed to several hundreds/thousands in others). What does 'p count'
> show for each ag? Was this fs grown to the current size over time?
> 
"p count" seems to "thin out" over ag:

count = 513057792
count = 16596224
count = 15387584
count = 14958528
count = 4096960
count = 4340416
count = 4987968
count = 3321792
count = 5041856
count = 5485376
count = 5233088
count = 5810432
count = 5271552
count = 5464000
count = 365440

(if the full print output is interesting, it's attached).

The FS was never grown, the machine was installed on November 5th, 2012
and was (ab)using the FS ever since. On average there have been about
1-2 million file creations per day (ranging from a few kByte files to a
few 100 kBytes) and also an equally large number of deletions (after
some time). Thus overall, a somewhat busy server.

Cheers

Carsten


[-- Attachment #2: ag-print.log --]
[-- Type: text/plain, Size: 4175 bytes --]


AG 0
magicnum = 0x58414749
versionnum = 1
seqno = 0
length = 268435455
count = 513057792
root = 1286
level = 3
freecount = 53795884
newino = 128
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 1
magicnum = 0x58414749
versionnum = 1
seqno = 1
length = 268435455
count = 16596224
root = 123653
level = 3
freecount = 251
newino = 17696
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 2
magicnum = 0x58414749
versionnum = 1
seqno = 2
length = 268435455
count = 15387584
root = 644
level = 3
freecount = 45
newino = 5081504
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 3
magicnum = 0x58414749
versionnum = 1
seqno = 3
length = 268435455
count = 14958528
root = 317207
level = 3
freecount = 381
newino = 5034112
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 4
magicnum = 0x58414749
versionnum = 1
seqno = 4
length = 268435455
count = 4096960
root = 320520
level = 2
freecount = 11009
newino = 5079872
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 5
magicnum = 0x58414749
versionnum = 1
seqno = 5
length = 268435455
count = 4340416
root = 743
level = 2
freecount = 6748
newino = 5150208
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 6
magicnum = 0x58414749
versionnum = 1
seqno = 6
length = 268435455
count = 4987968
root = 155879
level = 2
freecount = 663
newino = 4937472
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 7
magicnum = 0x58414749
versionnum = 1
seqno = 7
length = 268435455
count = 3321792
root = 543815
level = 2
freecount = 595
newino = 13695232
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 8
magicnum = 0x58414749
versionnum = 1
seqno = 8
length = 268435455
count = 5041856
root = 507410
level = 2
freecount = 693
newino = 5057856
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 9
magicnum = 0x58414749
versionnum = 1
seqno = 9
length = 268435455
count = 5485376
root = 208
level = 2
freecount = 9089
newino = 5134912
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 10
magicnum = 0x58414749
versionnum = 1
seqno = 10
length = 268435455
count = 5233088
root = 821
level = 2
freecount = 37122
newino = 5060416
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 11
magicnum = 0x58414749
versionnum = 1
seqno = 11
length = 268435455
count = 5810432
root = 349837
level = 2
freecount = 2657
newino = 5102880
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 12
magicnum = 0x58414749
versionnum = 1
seqno = 12
length = 268435455
count = 5271552
root = 314
level = 2
freecount = 60497
newino = 5050528
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 13
magicnum = 0x58414749
versionnum = 1
seqno = 13
length = 268435455
count = 5464000
root = 992
level = 2
freecount = 1790275
newino = 5067808
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

AG 14
magicnum = 0x58414749
versionnum = 1
seqno = 14
length = 123739108
count = 365440
root = 945
level = 2
freecount = 54544
newino = 5084224
dirino = null
unlinked[0-63] = 
uuid = 00000000-0000-0000-0000-000000000000
lsn = 0
crc = 0 (unchecked)
free_root = 0
free_level = 0

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-12 17:33         ` Carsten Aulbert
@ 2015-01-13 20:06           ` Stan Hoeppner
  2015-01-13 20:13             ` Carsten Aulbert
  2015-01-13 20:33           ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Stan Hoeppner @ 2015-01-13 20:06 UTC (permalink / raw)
  To: Carsten Aulbert, Brian Foster; +Cc: xfs

On 01/12/2015 11:33 AM, Carsten Aulbert wrote:
...
> The FS was never grown, the machine was installed on November 5th, 2012
> and was (ab)using the FS ever since. On average there have been about
> 1-2 million file creations per day (ranging from a few kByte files to a
> few 100 kBytes) and also an equally large number of deletions (after
> some time). Thus overall, a somewhat busy server.

This workload seems more suited to a database than a filesystem. Though
surely you've already considered such, and chose not to go that route.

With high fragmentation you get lots of seeking.  What model disks are
these?  What is your RAID10 geometry?  Are your partitions properly
aligned to that geometry, and to the drives (512n/512e)?


Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-13 20:06           ` Stan Hoeppner
@ 2015-01-13 20:13             ` Carsten Aulbert
  2015-01-13 20:43               ` Stan Hoeppner
  0 siblings, 1 reply; 14+ messages in thread
From: Carsten Aulbert @ 2015-01-13 20:13 UTC (permalink / raw)
  To: Stan Hoeppner, Brian Foster; +Cc: xfs

Hi Stan

On 01/13/2015 09:06 PM, Stan Hoeppner wrote:
> This workload seems more suited to a database than a filesystem. Though
> surely you've already considered such, and chose not to go that route.
> 

Yepp, but as we do not fully control the server software and need to
work further on the binary blobs arriving, a database is also not that
well suited for it, but yes, we looked into it (and run mysql, marida,
cassandra, mongo, postgresql, ...)

> With high fragmentation you get lots of seeking.  What model disks are
> these?  What is your RAID10 geometry?  Are your partitions properly
> aligned to that geometry, and to the drives (512n/512e)?

Disks are 2TB Hitachi SATA drives (Ultrastar, HUA722020ALA330). As these
are some yrs old, they are native 512byte ones. They are connected via
an Areca 1261ML controller with a Supermicro backplane.

RAID striping is not ideal (128kByte per member disk) and thus our xfs
layout is not ideal as well. Things we plan to change with the next
attempt ;)

After the arrival of "advanced format" HDD and SSDs we usually try to
align everything to full 1 MByte or larger, just to be sure any
combination of 512b, 4kb, ... will eventually align :)

Cheers

Carsten

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-12 17:33         ` Carsten Aulbert
  2015-01-13 20:06           ` Stan Hoeppner
@ 2015-01-13 20:33           ` Dave Chinner
  2015-01-14  6:12             ` Carsten Aulbert
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2015-01-13 20:33 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: Brian Foster, xfs

On Mon, Jan 12, 2015 at 06:33:06PM +0100, Carsten Aulbert wrote:
> Hi
> 
> On 01/12/2015 05:37 PM, Brian Foster wrote:
> > No, but it does show that there are a bunch of free inodes scattered
> > throughout the existing records in most of the AGs. The finobt should
> > definitely help avoid the allocation latency when this occurs.
> > 
> 
> That is good to know/hope :)
> 
> > It is interesting that you have so many more free inodes in ag 0 (~53m
> > as opposed to several hundreds/thousands in others). What does 'p count'
> > show for each ag? Was this fs grown to the current size over time?
> > 
> "p count" seems to "thin out" over ag:
> 
> count = 513057792

AG 0 has > 500 allocated million inodes, and > 50 million free
inodes.  Traversal of the inode btree to find free inodes is going
to be costly in both CPU time and memory footprint. The other AGs
have much fewer inodes, so searches are much less costly.

That's why allocation will be sometimes slow and sometimes fast - it
depends on what AG the inode is being allocated from.

> count = 16596224
> count = 15387584
> count = 14958528
> count = 4096960
> count = 4340416
> count = 4987968
> count = 3321792
> count = 5041856
> count = 5485376
> count = 5233088
> count = 5810432
> count = 5271552
> count = 5464000
> count = 365440

This pattern seems to match a filesystem that was running under
inode32 for most of it's life, and is now using inode64 and hence
spreading the subdirectories and hence new inodes over all AGs
instead of just limiting them to AG 0....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-13 20:13             ` Carsten Aulbert
@ 2015-01-13 20:43               ` Stan Hoeppner
  2015-01-14  6:07                 ` Carsten Aulbert
  0 siblings, 1 reply; 14+ messages in thread
From: Stan Hoeppner @ 2015-01-13 20:43 UTC (permalink / raw)
  To: Carsten Aulbert, Brian Foster; +Cc: xfs

On 01/13/2015 02:13 PM, Carsten Aulbert wrote:
> Hi Stan
> 
> On 01/13/2015 09:06 PM, Stan Hoeppner wrote:
>> This workload seems more suited to a database than a filesystem. Though
>> surely you've already considered such, and chose not to go that route.
>>
> 
> Yepp, but as we do not fully control the server software and need to
> work further on the binary blobs arriving, a database is also not that
> well suited for it, but yes, we looked into it (and run mysql, marida,
> cassandra, mongo, postgresql, ...)
> 
>> With high fragmentation you get lots of seeking.  What model disks are
>> these?  What is your RAID10 geometry?  Are your partitions properly
>> aligned to that geometry, and to the drives (512n/512e)?
> 
> Disks are 2TB Hitachi SATA drives (Ultrastar, HUA722020ALA330). As these
> are some yrs old, they are native 512byte ones. They are connected via
> an Areca 1261ML controller with a Supermicro backplane.
> 
> RAID striping is not ideal (128kByte per member disk) and thus our xfs
> layout is not ideal as well. Things we plan to change with the next
> attempt ;)

With your file sizes that seems a recipe for hotspots.  What do the
controller metrics tell you about IOs per drive, bytes per drive?  Are
they balanced?

> After the arrival of "advanced format" HDD and SSDs we usually try to
> align everything to full 1 MByte or larger, just to be sure any
> combination of 512b, 4kb, ... will eventually align :)

It's not that simple with striping.  Partitions need to start and end on
stripe boundaries, not simply multiples of 4KB or 1MB as with single
disks. If you use non power of 2 drive counts in a stripe, aligning at
multiples of 1MB will screw ya, e.g. 12 drives * 64KB is a 768KB stripe.

Stan



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-13 20:43               ` Stan Hoeppner
@ 2015-01-14  6:07                 ` Carsten Aulbert
  0 siblings, 0 replies; 14+ messages in thread
From: Carsten Aulbert @ 2015-01-14  6:07 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: xfs

Hi Stan

On 01/13/2015 09:43 PM, Stan Hoeppner wrote:
> With your file sizes that seems a recipe for hotspots.  What do the
> controller metrics tell you about IOs per drive, bytes per drive?  Are
> they balanced?
> 

They looked ok from 10000ft, but unfortunately, the controller is not
that talkative :(

>> After the arrival of "advanced format" HDD and SSDs we usually try to
>> align everything to full 1 MByte or larger, just to be sure any
>> combination of 512b, 4kb, ... will eventually align :)
> 
> It's not that simple with striping.  Partitions need to start and end on
> stripe boundaries, not simply multiples of 4KB or 1MB as with single
> disks. If you use non power of 2 drive counts in a stripe, aligning at
> multiples of 1MB will screw ya, e.g. 12 drives * 64KB is a 768KB stripe.

At the moment there are 16 disks (or 8 "data" disks), so here, power of
2 still prevails. But still, very valid remark!

Cheers

Carsten

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-13 20:33           ` Dave Chinner
@ 2015-01-14  6:12             ` Carsten Aulbert
  2015-01-16 15:35               ` Carlos Maiolino
  0 siblings, 1 reply; 14+ messages in thread
From: Carsten Aulbert @ 2015-01-14  6:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hi Dave

On 01/13/2015 09:33 PM, Dave Chinner wrote:
> 
> This pattern seems to match a filesystem that was running under
> inode32 for most of it's life, and is now using inode64 and hence
> spreading the subdirectories and hence new inodes over all AGs
> instead of just limiting them to AG 0....
> 

I fear you are right. The install logs from 2012 show, that fstab simply
contained:

UUID=dd407af5-1edc-41de-8c72-5fe1de31fb11 /srv xfs rw  0 2

And our change management is not that good to track back, when we
changed to inode64.

OK, I think I've gathered quite a lot of things we did wrong. I'll try
to summarize it in another email later today or tomorrow to invite a few
more comments before we go ahead redoing the FS. Also, it may serve as
summary for the list archive.

For now, I want to say a very, very big thank you for remarks and
suggestions!

Cheers

Carsten

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: extremely slow file creation/deletion after xfs ran full
  2015-01-14  6:12             ` Carsten Aulbert
@ 2015-01-16 15:35               ` Carlos Maiolino
  0 siblings, 0 replies; 14+ messages in thread
From: Carlos Maiolino @ 2015-01-16 15:35 UTC (permalink / raw)
  To: Carsten Aulbert; +Cc: xfs

On Wed, Jan 14, 2015 at 07:12:55AM +0100, Carsten Aulbert wrote:
> Hi Dave
> 
> On 01/13/2015 09:33 PM, Dave Chinner wrote:
> > 
> > This pattern seems to match a filesystem that was running under
> > inode32 for most of it's life, and is now using inode64 and hence
> > spreading the subdirectories and hence new inodes over all AGs
> > instead of just limiting them to AG 0....
> > 
> 
> I fear you are right. The install logs from 2012 show, that fstab simply
> contained:
> 
> UUID=dd407af5-1edc-41de-8c72-5fe1de31fb11 /srv xfs rw  0 2
> 
> And our change management is not that good to track back, when we
> changed to inode64.
> 
So, looks like Dave is right, you have been running XFS with inode32 for
the entire time (which probably overloaded AG0 from this FS) and then
changed to inode64 when it became the default option.

> OK, I think I've gathered quite a lot of things we did wrong. I'll try
> to summarize it in another email later today or tomorrow to invite a few
> more comments before we go ahead redoing the FS. Also, it may serve as
> summary for the list archive.
> 
> For now, I want to say a very, very big thank you for remarks and
> suggestions!
> 
> Cheers
> 
> Carsten
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

-- 
Carlos

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-01-16 15:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-12  8:36 extremely slow file creation/deletion after xfs ran full Carsten Aulbert
2015-01-12 12:44 ` Brian Foster
2015-01-12 13:30 ` Carsten Aulbert
2015-01-12 15:52   ` Brian Foster
2015-01-12 16:09     ` Carsten Aulbert
2015-01-12 16:37       ` Brian Foster
2015-01-12 17:33         ` Carsten Aulbert
2015-01-13 20:06           ` Stan Hoeppner
2015-01-13 20:13             ` Carsten Aulbert
2015-01-13 20:43               ` Stan Hoeppner
2015-01-14  6:07                 ` Carsten Aulbert
2015-01-13 20:33           ` Dave Chinner
2015-01-14  6:12             ` Carsten Aulbert
2015-01-16 15:35               ` Carlos Maiolino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.