All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs_alloc_ag_vextent_near() takes minutes to complete
@ 2017-05-01 12:37 Alex Lyakas
  2017-05-01 15:26 ` Brian Foster
  2017-05-02  7:35 ` Christoph Hellwig
  0 siblings, 2 replies; 12+ messages in thread
From: Alex Lyakas @ 2017-05-01 12:37 UTC (permalink / raw)
  To: linux-xfs

[resending to proper list]

Greetings XFS community,

We have an XFS instance mounted as "sync"
(rw,sync,noatime,wsync,attr2,discard,inode64,prjquota 0 0). This instance is
exposed via nfsd. Size is about 50TB.

root@0000000f:/home/alex # xfs_info /export/share1
meta-data=/dev/dm-64             isize=256    agcount=49, agsize=268435455
blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=13107200000, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

We observe the following issue:

nfsd thread comes into xfs_file_write_iter()=>xfs_file_buffered_aio_write(),
which copies the data into page cache. Then xfs_file_write_iter() calls
generic_write_sync(), which ends up working for minutes in stack like [1].
It seems to be scanning the free space btrees, calling xfs_btree_increment()
and sometimes xfs_btree_decrement().

This whole operation is performed while i_lock is exclusively locked by
xfs_iomap_write_allocate(), before calling xfs_bmapi_write(). So i_lock is
being held for minutes.

Meanwhile, another nfsd thread calls into the same inode, and  reaches
xfs_file_buffered_aio_write(), and calls xfs_rw_ilock(XFS_IOLOCK_EXCL),
which locks both i_mutex and i_iolock. Now this thread calls
xfs_file_aio_write_checks() which wants to grab the i_lock in stack like
[2]. So now all three locks (i_mutex, i_lock and i_iolock) are taken.

Now another nfsd thread calling into xfs_file_buffered_aio_write() will get
stuck waiting for i_mutex. Or any other thread that needs one of these locks
will get stuck. This in some cases triggers hung-task panic.

After allocation completes, I look at the inode in question via xfs_db, and
I see that this file is not heavily fragmented. It has few 10s of extents
only. So file fragmentation is not an issue.

Questions:
1) The thread in [1] does not seem to be waiting for IO. It wants to lock a
metadata buffer, which is done before spawning a IO on it (to my
understanding). What is it waiting for?

2) Any recommendations what to do in such situation? Will running xfs_fsr
help? (although the files in question are not heavily fragmented)

Kernel version is 3.18.19.

BTW, in the code I see the following comment:
/*
* Locking primitives for read and write IO paths to ensure we consistently
use
* and order the inode->i_mutex, ip->i_lock and ip->i_iolock.
*/

But the above analysis clearly shows that this locking order is violated.
xfs_file_buffered_aio_write()  grabs the i_mutex and the i_iolock. And only
then it wants to grab the i_lock (indirectly). Is the comment wrong or is
the code wrong? I looked at kernel 4.10, and it seems to have the same
discrepancy between the comment and the code.

Thanks,
Alex.




[1]
[<ffffffff810b90f1>] down+0x41/0x50
[<ffffffffc0d79090>] xfs_buf_lock+0x40/0x120 [xfs]
[<ffffffffc0d792cd>] _xfs_buf_find+0x15d/0x2e0 [xfs]
[<ffffffffc0d7947a>] xfs_buf_get_map+0x2a/0x210 [xfs]
[<ffffffffc0d7a1dc>] xfs_buf_read_map+0x2c/0x190 [xfs]
[<ffffffffc0dac621>] xfs_trans_read_buf_map+0x261/0x490 [xfs]
[<ffffffffc0d50a89>] xfs_btree_read_buf_block.constprop.28+0x69/0xa0 [xfs]
[<ffffffffc0d5340e>] xfs_btree_increment+0x1ee/0x2e0 [xfs]
[<ffffffffc0d384ea>] xfs_alloc_ag_vextent_near+0x58a/0x1e20 [xfs]
[<ffffffffc0d39ea5>] xfs_alloc_ag_vextent+0xd5/0x120 [xfs]
[<ffffffffc0d3b65f>] xfs_alloc_vextent+0x48f/0x690 [xfs]
[<ffffffffc0d4c4ff>] xfs_bmap_btalloc+0x39f/0x700 [xfs]
[<ffffffffc0d4c884>] xfs_bmap_alloc+0x24/0x40 [xfs]
[<ffffffffc0d4d601>] xfs_bmapi_write+0x811/0xe00 [xfs]
[<ffffffffc0d8b0c5>] xfs_iomap_write_allocate+0x155/0x4f0 [xfs]
[<ffffffffc0d7159e>] xfs_map_blocks+0x1de/0x250 [xfs]
[<ffffffffc0d72e15>] xfs_vm_writepage+0x1a5/0x630 [xfs]
[<ffffffff8117c377>] __writepage+0x17/0x50
[<ffffffff8117ccb8>] write_cache_pages+0x238/0x4c0
[<ffffffff8117cf80>] generic_writepages+0x40/0x60
[<ffffffffc0d71c93>] xfs_vm_writepages+0x43/0x50 [xfs]
[<ffffffff8117e7ae>] do_writepages+0x1e/0x40
[<ffffffff81173429>] __filemap_fdatawrite_range+0x59/0x60
[<ffffffff8117352a>] filemap_write_and_wait_range+0x2a/0x70
[<ffffffffc0d80af7>] xfs_file_fsync+0x57/0x220 [xfs]
[<ffffffff8121823b>] vfs_fsync_range+0x1b/0x30
[<ffffffffc0d8238b>] xfs_file_write_iter+0xeb/0x130 [xfs]
[<ffffffff811e7545>] do_iter_readv_writev+0x65/0xa0
[<ffffffff811e8bf9>] do_readv_writev+0xc9/0x280
[<ffffffff811e8e39>] vfs_writev+0x39/0x50
[<ffffffffc0306fb7>] nfsd_vfs_write.isra.14+0x97/0x350 [nfsd]
[<ffffffffc030a8ce>] nfsd_write+0x8e/0x100 [nfsd]
[<ffffffffc0315411>] nfsd4_write+0x1c1/0x220 [nfsd]
[<ffffffffc0316edc>] nfsd4_proc_compound+0x50c/0x850 [nfsd]
[<ffffffffc0302ed3>] nfsd_dispatch+0xd3/0x240 [nfsd]
[<ffffffffc0263dc0>] svc_process_common+0x440/0x6d0 [sunrpc]
[<ffffffffc0264157>] svc_process+0x107/0x170 [sunrpc]
[<ffffffffc0302837>] nfsd+0x127/0x1a0 [nfsd]
[<ffffffff810911b9>] kthread+0xc9/0xe0
[<ffffffff81717918>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

[2]
[<ffffffffc0c80db1>] xfs_ilock+0x321/0x330 [xfs]
[<ffffffffc0c7e5dd>] xfs_vn_update_time+0x6d/0x1b0 [xfs]
[<ffffffff81202c15>] update_time+0x25/0xc0
[<ffffffff810d7c76>] ? current_fs_time+0x16/0x60
[<ffffffff81202e70>] file_update_time+0x80/0xd0
[<ffffffffc0c74e60>] xfs_file_aio_write_checks+0x140/0x1a0 [xfs]
[<ffffffffc0c74fac>] xfs_file_buffered_aio_write.isra.11+0xec/0x390 [xfs]
[<ffffffff811e7430>] ? new_sync_read+0xb0/0xb0
[<ffffffffc0c752d3>] xfs_file_write_iter+0x83/0x130 [xfs]
[<ffffffff811e7545>] do_iter_readv_writev+0x65/0xa0
[<ffffffff811e8bf9>] do_readv_writev+0xc9/0x280
[<ffffffffc0c75250>] ? xfs_file_buffered_aio_write.isra.11+0x390/0x390 [xfs]
[<ffffffffc0c75250>] ? xfs_file_buffered_aio_write.isra.11+0x390/0x390 [xfs]
[<ffffffff81341e43>] ? apparmor_cred_prepare+0x33/0x50
[<ffffffff811e8e39>] vfs_writev+0x39/0x50
[<ffffffffc0306fb7>] nfsd_vfs_write.isra.14+0x97/0x350 [nfsd]
[<ffffffffc030a8ce>] nfsd_write+0x8e/0x100 [nfsd]
[<ffffffffc0315411>] nfsd4_write+0x1c1/0x220 [nfsd]
[<ffffffffc0316edc>] nfsd4_proc_compound+0x50c/0x850 [nfsd]
[<ffffffffc0302ed3>] nfsd_dispatch+0xd3/0x240 [nfsd]


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-01 12:37 xfs_alloc_ag_vextent_near() takes minutes to complete Alex Lyakas
@ 2017-05-01 15:26 ` Brian Foster
  2017-05-02  7:35 ` Christoph Hellwig
  1 sibling, 0 replies; 12+ messages in thread
From: Brian Foster @ 2017-05-01 15:26 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-xfs

On Mon, May 01, 2017 at 03:37:23PM +0300, Alex Lyakas wrote:
> [resending to proper list]
> 
> Greetings XFS community,
> 
> We have an XFS instance mounted as "sync"
> (rw,sync,noatime,wsync,attr2,discard,inode64,prjquota 0 0). This instance is
> exposed via nfsd. Size is about 50TB.
> 
> root@0000000f:/home/alex # xfs_info /export/share1
> meta-data=/dev/dm-64             isize=256    agcount=49, agsize=268435455
> blks
>         =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=13107200000, imaxpct=5
>         =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>         =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 

Looks mostly like you are using defaults with the exception of the sync
and discard related mount options.

> We observe the following issue:
> 
> nfsd thread comes into xfs_file_write_iter()=>xfs_file_buffered_aio_write(),
> which copies the data into page cache. Then xfs_file_write_iter() calls
> generic_write_sync(), which ends up working for minutes in stack like [1].
> It seems to be scanning the free space btrees, calling xfs_btree_increment()
> and sometimes xfs_btree_decrement().
> 
> This whole operation is performed while i_lock is exclusively locked by
> xfs_iomap_write_allocate(), before calling xfs_bmapi_write(). So i_lock is
> being held for minutes.
> 
> Meanwhile, another nfsd thread calls into the same inode, and  reaches
> xfs_file_buffered_aio_write(), and calls xfs_rw_ilock(XFS_IOLOCK_EXCL),
> which locks both i_mutex and i_iolock. Now this thread calls
> xfs_file_aio_write_checks() which wants to grab the i_lock in stack like
> [2]. So now all three locks (i_mutex, i_lock and i_iolock) are taken.
> 
> Now another nfsd thread calling into xfs_file_buffered_aio_write() will get
> stuck waiting for i_mutex. Or any other thread that needs one of these locks
> will get stuck. This in some cases triggers hung-task panic.
> 
> After allocation completes, I look at the inode in question via xfs_db, and
> I see that this file is not heavily fragmented. It has few 10s of extents
> only. So file fragmentation is not an issue.
> 
> Questions:
> 1) The thread in [1] does not seem to be waiting for IO. It wants to lock a
> metadata buffer, which is done before spawning a IO on it (to my
> understanding). What is it waiting for?
> 

The search overhead could be high due to either fragmented free space or
perhaps waiting on busy extents (since you have enabled online discard).
Do you have any threads freeing space and waiting on discard operations
when this occurs? Also, what does 'xfs_db -c "freesp -s" <dev>' show for
this filesystem?

> 2) Any recommendations what to do in such situation? Will running xfs_fsr
> help? (although the files in question are not heavily fragmented)
> 

I don't think so.

> Kernel version is 3.18.19.
> 
> BTW, in the code I see the following comment:
> /*
> * Locking primitives for read and write IO paths to ensure we consistently
> use
> * and order the inode->i_mutex, ip->i_lock and ip->i_iolock.
> */
> 
> But the above analysis clearly shows that this locking order is violated.
> xfs_file_buffered_aio_write()  grabs the i_mutex and the i_iolock. And only
> then it wants to grab the i_lock (indirectly). Is the comment wrong or is
> the code wrong? I looked at kernel 4.10, and it seems to have the same
> discrepancy between the comment and the code.
> 

I don't think the comment above is intended to define locking order,
just to indicate that order is important. iolock generally comes before
ilock.

Brian

> Thanks,
> Alex.
> 
> 
> 
> 
> [1]
> [<ffffffff810b90f1>] down+0x41/0x50
> [<ffffffffc0d79090>] xfs_buf_lock+0x40/0x120 [xfs]
> [<ffffffffc0d792cd>] _xfs_buf_find+0x15d/0x2e0 [xfs]
> [<ffffffffc0d7947a>] xfs_buf_get_map+0x2a/0x210 [xfs]
> [<ffffffffc0d7a1dc>] xfs_buf_read_map+0x2c/0x190 [xfs]
> [<ffffffffc0dac621>] xfs_trans_read_buf_map+0x261/0x490 [xfs]
> [<ffffffffc0d50a89>] xfs_btree_read_buf_block.constprop.28+0x69/0xa0 [xfs]
> [<ffffffffc0d5340e>] xfs_btree_increment+0x1ee/0x2e0 [xfs]
> [<ffffffffc0d384ea>] xfs_alloc_ag_vextent_near+0x58a/0x1e20 [xfs]
> [<ffffffffc0d39ea5>] xfs_alloc_ag_vextent+0xd5/0x120 [xfs]
> [<ffffffffc0d3b65f>] xfs_alloc_vextent+0x48f/0x690 [xfs]
> [<ffffffffc0d4c4ff>] xfs_bmap_btalloc+0x39f/0x700 [xfs]
> [<ffffffffc0d4c884>] xfs_bmap_alloc+0x24/0x40 [xfs]
> [<ffffffffc0d4d601>] xfs_bmapi_write+0x811/0xe00 [xfs]
> [<ffffffffc0d8b0c5>] xfs_iomap_write_allocate+0x155/0x4f0 [xfs]
> [<ffffffffc0d7159e>] xfs_map_blocks+0x1de/0x250 [xfs]
> [<ffffffffc0d72e15>] xfs_vm_writepage+0x1a5/0x630 [xfs]
> [<ffffffff8117c377>] __writepage+0x17/0x50
> [<ffffffff8117ccb8>] write_cache_pages+0x238/0x4c0
> [<ffffffff8117cf80>] generic_writepages+0x40/0x60
> [<ffffffffc0d71c93>] xfs_vm_writepages+0x43/0x50 [xfs]
> [<ffffffff8117e7ae>] do_writepages+0x1e/0x40
> [<ffffffff81173429>] __filemap_fdatawrite_range+0x59/0x60
> [<ffffffff8117352a>] filemap_write_and_wait_range+0x2a/0x70
> [<ffffffffc0d80af7>] xfs_file_fsync+0x57/0x220 [xfs]
> [<ffffffff8121823b>] vfs_fsync_range+0x1b/0x30
> [<ffffffffc0d8238b>] xfs_file_write_iter+0xeb/0x130 [xfs]
> [<ffffffff811e7545>] do_iter_readv_writev+0x65/0xa0
> [<ffffffff811e8bf9>] do_readv_writev+0xc9/0x280
> [<ffffffff811e8e39>] vfs_writev+0x39/0x50
> [<ffffffffc0306fb7>] nfsd_vfs_write.isra.14+0x97/0x350 [nfsd]
> [<ffffffffc030a8ce>] nfsd_write+0x8e/0x100 [nfsd]
> [<ffffffffc0315411>] nfsd4_write+0x1c1/0x220 [nfsd]
> [<ffffffffc0316edc>] nfsd4_proc_compound+0x50c/0x850 [nfsd]
> [<ffffffffc0302ed3>] nfsd_dispatch+0xd3/0x240 [nfsd]
> [<ffffffffc0263dc0>] svc_process_common+0x440/0x6d0 [sunrpc]
> [<ffffffffc0264157>] svc_process+0x107/0x170 [sunrpc]
> [<ffffffffc0302837>] nfsd+0x127/0x1a0 [nfsd]
> [<ffffffff810911b9>] kthread+0xc9/0xe0
> [<ffffffff81717918>] ret_from_fork+0x58/0x90
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> [2]
> [<ffffffffc0c80db1>] xfs_ilock+0x321/0x330 [xfs]
> [<ffffffffc0c7e5dd>] xfs_vn_update_time+0x6d/0x1b0 [xfs]
> [<ffffffff81202c15>] update_time+0x25/0xc0
> [<ffffffff810d7c76>] ? current_fs_time+0x16/0x60
> [<ffffffff81202e70>] file_update_time+0x80/0xd0
> [<ffffffffc0c74e60>] xfs_file_aio_write_checks+0x140/0x1a0 [xfs]
> [<ffffffffc0c74fac>] xfs_file_buffered_aio_write.isra.11+0xec/0x390 [xfs]
> [<ffffffff811e7430>] ? new_sync_read+0xb0/0xb0
> [<ffffffffc0c752d3>] xfs_file_write_iter+0x83/0x130 [xfs]
> [<ffffffff811e7545>] do_iter_readv_writev+0x65/0xa0
> [<ffffffff811e8bf9>] do_readv_writev+0xc9/0x280
> [<ffffffffc0c75250>] ? xfs_file_buffered_aio_write.isra.11+0x390/0x390 [xfs]
> [<ffffffffc0c75250>] ? xfs_file_buffered_aio_write.isra.11+0x390/0x390 [xfs]
> [<ffffffff81341e43>] ? apparmor_cred_prepare+0x33/0x50
> [<ffffffff811e8e39>] vfs_writev+0x39/0x50
> [<ffffffffc0306fb7>] nfsd_vfs_write.isra.14+0x97/0x350 [nfsd]
> [<ffffffffc030a8ce>] nfsd_write+0x8e/0x100 [nfsd]
> [<ffffffffc0315411>] nfsd4_write+0x1c1/0x220 [nfsd]
> [<ffffffffc0316edc>] nfsd4_proc_compound+0x50c/0x850 [nfsd]
> [<ffffffffc0302ed3>] nfsd_dispatch+0xd3/0x240 [nfsd]
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-01 12:37 xfs_alloc_ag_vextent_near() takes minutes to complete Alex Lyakas
  2017-05-01 15:26 ` Brian Foster
@ 2017-05-02  7:35 ` Christoph Hellwig
  2017-05-04  8:07   ` Alex Lyakas
  2017-05-07  8:00   ` Alex Lyakas
  1 sibling, 2 replies; 12+ messages in thread
From: Christoph Hellwig @ 2017-05-02  7:35 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-xfs

On Mon, May 01, 2017 at 03:37:23PM +0300, Alex Lyakas wrote:
> [resending to proper list]
> 
> Greetings XFS community,
> 
> We have an XFS instance mounted as "sync"
> (rw,sync,noatime,wsync,attr2,discard,inode64,prjquota 0 0). This instance is
> exposed via nfsd. Size is about 50TB.

Don't use the sync option ever.  In fact we probably should remove it.

> Kernel version is 3.18.19.

Which is extremely outdate.  Please try latest mainline, or at least
4.9-stable which is long term supported.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-02  7:35 ` Christoph Hellwig
@ 2017-05-04  8:07   ` Alex Lyakas
  2017-05-04 11:13     ` Alex Lyakas
                       ` (2 more replies)
  2017-05-07  8:00   ` Alex Lyakas
  1 sibling, 3 replies; 12+ messages in thread
From: Alex Lyakas @ 2017-05-04  8:07 UTC (permalink / raw)
  To: Christoph Hellwig, Brian Foster; +Cc: linux-xfs

Hello Brian, Cristoph,

Thank you for your responses.

> The search overhead could be high due to either fragmented free space or
> perhaps waiting on busy extents (since you have enabled online discard).
> Do you have any threads freeing space and waiting on discard operations
> when this occurs? Also, what does 'xfs_db -c "freesp -s" <dev>' show for
> this filesystem?
I disabled the discard, but the problem still happens. Output of the freesp 
command is at [1]. To my understanding this means that 60% of the free space 
is 16-31 continuous blocks, i.e., 64kb-124kb. Does this count as a 
fragmented free space?

I debugged the issue further, profiling the xfs_alloc_ag_vextent_near() call 
and what it does. Some results:

# it appears to not be triggering any READs of xfs_buf, i.e., no calls to 
xfs_buf_ioapply_map() with rw==READ or rw==READA in the same thread
# most of the time (about 95%) is spent in xfs_buf_lock() waiting in 
"down(&bp->b_sema)" call
# the average time to lock an xfs_buf is about 10-12 ms

For example, in one test it took 45778 ms to complete the 
xfs_alloc_ag_vextent_near()  execution. During this time, 6240 xfs_buf were 
locked, totalling to 42810 ms spent in locking the buffers, which is about 
93%. On average 7 ms to lock a buffer.

# it is still not clear who is holding the lock

Cristoph, I understand that kernel 3.18 is EOL at the moment, but it used to 
be a long-term kernel, so there is an expectation of stability, but perhaps 
not community support at this point.

Thanks,
Alex.


[1]
   from      to extents  blocks    pct
      1       1  155759  155759   0.00
      2       3    1319    3328   0.00
      4       7   13153   56265   0.00
      8      15  152663 1752813   0.03
     16      31 143626908 4019133338  60.17
     32      63 1484214 72838775   1.09
     64     127 9799130 876068428  13.12
    128     255 1929928 310722786   4.65
    256     511  150035 49779357   0.75
    512    1023   26496 19658529   0.29
   1024    2047   27800 41418636   0.62
   2048    4095   26369 77587481   1.16
   4096    8191   13872 80270202   1.20
   8192   16383    6653 77527746   1.16
  16384   32767    4384 100576452   1.51
  32768   65535    3967 200958816   3.01
  65536  131071    1346 127613203   1.91
131072  262143     753 141530959   2.12
262144  524287     473 168900109   2.53
524288 1048575     202 147607986   2.21
1048576 2097151      65 95394305   1.43
2097152 4194303      16 42998164   0.64
4194304 8388607       5 26710209   0.40
total free extents 157425510
total free blocks 6679263646
average free extent size 42.4281


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-04  8:07   ` Alex Lyakas
@ 2017-05-04 11:13     ` Alex Lyakas
  2017-05-04 12:29       ` Brian Foster
  2017-05-04 12:25     ` Brian Foster
  2017-05-05  3:29     ` Dave Chinner
  2 siblings, 1 reply; 12+ messages in thread
From: Alex Lyakas @ 2017-05-04 11:13 UTC (permalink / raw)
  To: Christoph Hellwig, Brian Foster; +Cc: linux-xfs

Hello,

> # it is still not clear who is holding the lock
Further analysis shows that the xfs_buf lock is being held, because
the buffer is currently being read from disk. The stack that unlocks
the buffer is [1]. The size of each buffer being read is 4Kb.

So bottom line is that sometimes XFS needs to search through thousands
of metadata blocks of 4Kb, and needs to wait while they are being read
from disk. This causes very slow allocation in some cases, further
causing other threads to wait on i_lock/i_mutex, in some cases
triggering hung-task panics.

Anything can be done to remedy this?

Thanks,
Alex.



[1]
Call Trace:
 [<ffffffff81710c85>] dump_stack+0x4e/0x71
 [<ffffffffc0fade1b>] xfs_buf_ioend+0x22b/0x230 [xfs]
 [<ffffffffc0fade35>] xfs_buf_ioend_work+0x15/0x20 [xfs]
 [<ffffffff8108bd56>] process_one_work+0x146/0x410
 [<ffffffff8108c141>] worker_thread+0x121/0x450
 [<ffffffff8108c020>] ? process_one_work+0x410/0x410
 [<ffffffff810911b9>] kthread+0xc9/0xe0
 [<ffffffff810910f0>] ? kthread_create_on_node+0x180/0x180
 [<ffffffff81717918>] ret_from_fork+0x58/0x90
 [<ffffffff810910f0>] ? kthread_create_on_node+0x180/0x180

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-04  8:07   ` Alex Lyakas
  2017-05-04 11:13     ` Alex Lyakas
@ 2017-05-04 12:25     ` Brian Foster
  2017-05-04 13:53       ` Alex Lyakas
  2017-05-05  3:29     ` Dave Chinner
  2 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2017-05-04 12:25 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Christoph Hellwig, linux-xfs

On Thu, May 04, 2017 at 11:07:45AM +0300, Alex Lyakas wrote:
> Hello Brian, Cristoph,
> 
> Thank you for your responses.
> 
> > The search overhead could be high due to either fragmented free space or
> > perhaps waiting on busy extents (since you have enabled online discard).
> > Do you have any threads freeing space and waiting on discard operations
> > when this occurs? Also, what does 'xfs_db -c "freesp -s" <dev>' show for
> > this filesystem?
> I disabled the discard, but the problem still happens. Output of the freesp
> command is at [1]. To my understanding this means that 60% of the free space
> is 16-31 continuous blocks, i.e., 64kb-124kb. Does this count as a
> fragmented free space?
> 

Ok. Note that the discard implementation from kernels that old is known
to produce similar stalls, so may want to consider using fstrim going
forward independent from this problem.

That aside, free space does appear to be fairly fragmented to me. You do
have some larger extents available, but doing locality-based allocations
certainly seems like it could drop into a range where a time consuming
search is required.

> I debugged the issue further, profiling the xfs_alloc_ag_vextent_near() call
> and what it does. Some results:
> 
> # it appears to not be triggering any READs of xfs_buf, i.e., no calls to
> xfs_buf_ioapply_map() with rw==READ or rw==READA in the same thread
> # most of the time (about 95%) is spent in xfs_buf_lock() waiting in
> "down(&bp->b_sema)" call
> # the average time to lock an xfs_buf is about 10-12 ms
> 
> For example, in one test it took 45778 ms to complete the
> xfs_alloc_ag_vextent_near()  execution. During this time, 6240 xfs_buf were
> locked, totalling to 42810 ms spent in locking the buffers, which is about
> 93%. On average 7 ms to lock a buffer.
> 

This is probably dropping into the fallback allocation algorithm in
xfs_alloc_ag_vextent_near(), explained by the following comment:

        /*
         * Second algorithm.
         * Search in the by-bno tree to the left and to the right
         * simultaneously, until in each case we find a space big enough,
         * or run into the edge of the tree.  When we run into the edge,
         * we deallocate that cursor.
         * If both searches succeed, we compare the two spaces and pick
         * the better one.
         * With alignment, it's possible for both to fail; the upper
         * level algorithm that picks allocation groups for allocations
         * is not supposed to do this.
         */

So what is happening here is that the algorithm is starting at a point
in an AG based on a starting blockno (e.g., such as the inode block) and
searching left and right from there for a suitable range of free blocks.
Depending on the size of the tree, fragmentation of free space, size of
the allocation request, etc., it can certainly take a while to seek/read
all of the btree blocks required to satisfy the allocation.

I suspect this is ultimately caused by the sync mount option, presumably
converting smallish chunks of delalloc blocks to real blocks repeatedly
and in parallel with other allocations, and fragmenting free space over
time. I don't think there is any easy way out of the current state of
the fs outside of reformatting and migrating data to a new fs without
using the sync mount option.

Brian

> # it is still not clear who is holding the lock
> 
> Cristoph, I understand that kernel 3.18 is EOL at the moment, but it used to
> be a long-term kernel, so there is an expectation of stability, but perhaps
> not community support at this point.
> 
> Thanks,
> Alex.
> 
> 
> [1]
>   from      to extents  blocks    pct
>      1       1  155759  155759   0.00
>      2       3    1319    3328   0.00
>      4       7   13153   56265   0.00
>      8      15  152663 1752813   0.03
>     16      31 143626908 4019133338  60.17
>     32      63 1484214 72838775   1.09
>     64     127 9799130 876068428  13.12
>    128     255 1929928 310722786   4.65
>    256     511  150035 49779357   0.75
>    512    1023   26496 19658529   0.29
>   1024    2047   27800 41418636   0.62
>   2048    4095   26369 77587481   1.16
>   4096    8191   13872 80270202   1.20
>   8192   16383    6653 77527746   1.16
>  16384   32767    4384 100576452   1.51
>  32768   65535    3967 200958816   3.01
>  65536  131071    1346 127613203   1.91
> 131072  262143     753 141530959   2.12
> 262144  524287     473 168900109   2.53
> 524288 1048575     202 147607986   2.21
> 1048576 2097151      65 95394305   1.43
> 2097152 4194303      16 42998164   0.64
> 4194304 8388607       5 26710209   0.40
> total free extents 157425510
> total free blocks 6679263646
> average free extent size 42.4281
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-04 11:13     ` Alex Lyakas
@ 2017-05-04 12:29       ` Brian Foster
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Foster @ 2017-05-04 12:29 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Christoph Hellwig, linux-xfs

On Thu, May 04, 2017 at 02:13:40PM +0300, Alex Lyakas wrote:
> Hello,
> 
> > # it is still not clear who is holding the lock
> Further analysis shows that the xfs_buf lock is being held, because
> the buffer is currently being read from disk. The stack that unlocks
> the buffer is [1]. The size of each buffer being read is 4Kb.
> 
> So bottom line is that sometimes XFS needs to search through thousands
> of metadata blocks of 4Kb, and needs to wait while they are being read
> from disk. This causes very slow allocation in some cases, further
> causing other threads to wait on i_lock/i_mutex, in some cases
> triggering hung-task panics.
> 

Not that this helps resolve the fundamental problem, but note that I
don't think hung task messages have to result in kernel panics. On a
quick look at v3.18, the CONFIG_BOOTPARAM_HUNG_TASK_PANIC kernel config
option looks like it determines whether a hung task message panics the
kernel or not.

Brian

> Anything can be done to remedy this?
> 
> Thanks,
> Alex.
> 
> 
> 
> [1]
> Call Trace:
>  [<ffffffff81710c85>] dump_stack+0x4e/0x71
>  [<ffffffffc0fade1b>] xfs_buf_ioend+0x22b/0x230 [xfs]
>  [<ffffffffc0fade35>] xfs_buf_ioend_work+0x15/0x20 [xfs]
>  [<ffffffff8108bd56>] process_one_work+0x146/0x410
>  [<ffffffff8108c141>] worker_thread+0x121/0x450
>  [<ffffffff8108c020>] ? process_one_work+0x410/0x410
>  [<ffffffff810911b9>] kthread+0xc9/0xe0
>  [<ffffffff810910f0>] ? kthread_create_on_node+0x180/0x180
>  [<ffffffff81717918>] ret_from_fork+0x58/0x90
>  [<ffffffff810910f0>] ? kthread_create_on_node+0x180/0x180
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-04 12:25     ` Brian Foster
@ 2017-05-04 13:53       ` Alex Lyakas
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Lyakas @ 2017-05-04 13:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs

Hi Brian,

> This is probably dropping into the fallback allocation algorithm in
> xfs_alloc_ag_vextent_near(), explained by the following comment:
>
>         /*
>          * Second algorithm.
>          * Search in the by-bno tree to the left and to the right
>          * simultaneously, until in each case we find a space big enough,
>          * or run into the edge of the tree.  When we run into the edge,
>          * we deallocate that cursor.
>          * If both searches succeed, we compare the two spaces and pick
>          * the better one.
>          * With alignment, it's possible for both to fail; the upper
>          * level algorithm that picks allocation groups for allocations
>          * is not supposed to do this.
>          */
>
> So what is happening here is that the algorithm is starting at a point
> in an AG based on a starting blockno (e.g., such as the inode block) and
> searching left and right from there for a suitable range of free blocks.
> Depending on the size of the tree, fragmentation of free space, size of
> the allocation request, etc., it can certainly take a while to seek/read
> all of the btree blocks required to satisfy the allocation.
Yes, according to my profiling, indeed all the slow allocation
incidents end up in the "second" algorithm.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-04  8:07   ` Alex Lyakas
  2017-05-04 11:13     ` Alex Lyakas
  2017-05-04 12:25     ` Brian Foster
@ 2017-05-05  3:29     ` Dave Chinner
  2017-05-07  7:52       ` Alex Lyakas
  2 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2017-05-05  3:29 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Christoph Hellwig, Brian Foster, linux-xfs

On Thu, May 04, 2017 at 11:07:45AM +0300, Alex Lyakas wrote:
> Hello Brian, Cristoph,
> 
> Thank you for your responses.
> 
> >The search overhead could be high due to either fragmented free space or
> >perhaps waiting on busy extents (since you have enabled online discard).
> >Do you have any threads freeing space and waiting on discard operations
> >when this occurs? Also, what does 'xfs_db -c "freesp -s" <dev>' show for
> >this filesystem?
> I disabled the discard, but the problem still happens. Output of the
> freesp command is at [1]. To my understanding this means that 60% of
> the free space is 16-31 continuous blocks, i.e., 64kb-124kb. Does
> this count as a fragmented free space?
> 
> I debugged the issue further, profiling the
> xfs_alloc_ag_vextent_near() call and what it does. Some results:
> 
> # it appears to not be triggering any READs of xfs_buf, i.e., no
> calls to xfs_buf_ioapply_map() with rw==READ or rw==READA in the
> same thread
> # most of the time (about 95%) is spent in xfs_buf_lock() waiting in
> "down(&bp->b_sema)" call
> # the average time to lock an xfs_buf is about 10-12 ms
> 
> For example, in one test it took 45778 ms to complete the
> xfs_alloc_ag_vextent_near()  execution. During this time, 6240
> xfs_buf were locked, totalling to 42810 ms spent in locking the
> buffers, which is about 93%. On average 7 ms to lock a buffer.
> 
> # it is still not clear who is holding the lock
> 
> Cristoph, I understand that kernel 3.18 is EOL at the moment, but it
> used to be a long-term kernel, so there is an expectation of
> stability, but perhaps not community support at this point.
> 
> Thanks,
> Alex.
> 
> 
> [1]
>   from      to extents  blocks    pct
>      1       1  155759  155759   0.00
>      2       3    1319    3328   0.00
>      4       7   13153   56265   0.00
>      8      15  152663 1752813   0.03
>     16      31 143626908 4019133338  60.17

There's your problem. 143 million small free space extents totalling
4TB of free space. That's going to require (roughly speaking)
somewhere between 3-500,000 4k btree leaf blocks to index. i.e a
footprint of 10-20GB of metadata.

Even accounting for it being evenly spread across 50AGs, that's
still a 5-10k of btree blocks per free space btree per AG, and so if
that's not in cache when we end up doing a linear search for a near
block of a size that falls into this bucket, it's going to get stuck
reading btree leaf siblings from disk synchronously....

Perhaps this "near block" search needs to terminate after at a
certain search radius, similar to how the old AGI btree searches
during inode allocation were terminated after a certain radius of
allocated inode clusters were searched for free inodes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-05  3:29     ` Dave Chinner
@ 2017-05-07  7:52       ` Alex Lyakas
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Lyakas @ 2017-05-07  7:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Brian Foster, linux-xfs

>> [1]
>>   from      to extents  blocks    pct
>>      1       1  155759  155759   0.00
>>      2       3    1319    3328   0.00
>>      4       7   13153   56265   0.00
>>      8      15  152663 1752813   0.03
>>     16      31 143626908 4019133338  60.17
>
> There's your problem. 143 million small free space extents totalling
> 4TB of free space. That's going to require (roughly speaking)
> somewhere between 3-500,000 4k btree leaf blocks to index. i.e a
> footprint of 10-20GB of metadata.
>
> Even accounting for it being evenly spread across 50AGs, that's
> still a 5-10k of btree blocks per free space btree per AG, and so if
> that's not in cache when we end up doing a linear search for a near
> block of a size that falls into this bucket, it's going to get stuck
> reading btree leaf siblings from disk synchronously....
>
> Perhaps this "near block" search needs to terminate after at a
> certain search radius, similar to how the old AGI btree searches
> during inode allocation were terminated after a certain radius of
> allocated inode clusters were searched for free inodes....
Thank you for your response, Dave, this makes sense.

Alex.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-02  7:35 ` Christoph Hellwig
  2017-05-04  8:07   ` Alex Lyakas
@ 2017-05-07  8:00   ` Alex Lyakas
  2017-05-07  9:12     ` Christoph Hellwig
  1 sibling, 1 reply; 12+ messages in thread
From: Alex Lyakas @ 2017-05-07  8:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

Hi Christoph,

>> We have an XFS instance mounted as "sync"
>> (rw,sync,noatime,wsync,attr2,discard,inode64,prjquota 0 0). This instance is
>> exposed via nfsd. Size is about 50TB.
>
> Don't use the sync option ever.  In fact we probably should remove it.
We are using the sync mount option, because the XFS instances that we
have are exposed via nfsd. Without the "sync" mount option, data from
NFS write command would end up sitting in page cache, and the nfs
client would not know when it has been finally persisted on disk.
Is the any other way that you can recommend to provide data integrity
guarantee for nfs clients? Even if it requires some development, like
placing the data in a short-term persistent space, this is something
we will look at.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes minutes to complete
  2017-05-07  8:00   ` Alex Lyakas
@ 2017-05-07  9:12     ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2017-05-07  9:12 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: Christoph Hellwig, linux-xfs

On Sun, May 07, 2017 at 11:00:40AM +0300, Alex Lyakas wrote:
> We are using the sync mount option, because the XFS instances that we
> have are exposed via nfsd. Without the "sync" mount option, data from
> NFS write command would end up sitting in page cache, and the nfs
> client would not know when it has been finally persisted on disk.
> Is the any other way that you can recommend to provide data integrity
> guarantee for nfs clients? Even if it requires some development, like
> placing the data in a short-term persistent space, this is something
> we will look at.

You don't need any mount option for NFS exports.  Only NFSv2 requires
symchronous writes, while NFS3+ have the concept of unstable writes
that the client needs to explicitly committ using the COMMIT on the
write operation.  Both the NFSv2 synchronous and the NFSv3+ unstable
semantics are managed by NFSD.

(I'll speak here as a XFS and NFSD developer, and someone involved in
NFS protocol development, just in case you're doubting)

> 
> Thanks,
> Alex.
---end quoted text---

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-05-08  1:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-01 12:37 xfs_alloc_ag_vextent_near() takes minutes to complete Alex Lyakas
2017-05-01 15:26 ` Brian Foster
2017-05-02  7:35 ` Christoph Hellwig
2017-05-04  8:07   ` Alex Lyakas
2017-05-04 11:13     ` Alex Lyakas
2017-05-04 12:29       ` Brian Foster
2017-05-04 12:25     ` Brian Foster
2017-05-04 13:53       ` Alex Lyakas
2017-05-05  3:29     ` Dave Chinner
2017-05-07  7:52       ` Alex Lyakas
2017-05-07  8:00   ` Alex Lyakas
2017-05-07  9:12     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.