All of lore.kernel.org
 help / color / mirror / Atom feed
* Processes stuck in D state when accessing XFSv5 filesystem
@ 2017-04-06 18:41 David Shaw
  2017-04-07 14:56 ` Emmanuel Florac
  0 siblings, 1 reply; 7+ messages in thread
From: David Shaw @ 2017-04-06 18:41 UTC (permalink / raw)
  To: linux-xfs

Hello,

I'm having a problem with processes getting "stuck" when accessing an XFS (v5) filesystem.  When it happens, I start getting the "blocked for more than 120 seconds" error, and the process stays in that state until I reboot.  The kernel is 3.10.0-514.2.2 and the xfsprogs is 4.5.0-9 (both Centos 7.3).

I have tried running xfs_repair on the filesystem in question, but no problems were reported as fixed.  The stuck processes did happen again after the xfs_repair run.

Any thoughts or directions I could go with this?  One thing which may or may not be useful is that while the current kernel and xfsprogs are as given above, the filesystem was originally created using kernel 3.10.0-327.13.1 and xfsprogs 3.2.2-2 (Centos 7.2).  It's an XFSv5 filesystem, and I know there were a number of problems with XFSv5 in Centos 7.2 that were resolved in 7.3.

Thanks very much for any advice,

David

2017-04-05 17:15:03.853 localhost kernel: [607696.887764] INFO: task serv:23300 blocked for more than 120 seconds.
2017-04-05 17:15:03.853 localhost kernel: [607696.887764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2017-04-05 17:15:03.853 localhost kernel: [607696.887765] serv         D 0000000000000000     0 23300  11642 0x00000000
2017-04-05 17:15:03.853 localhost kernel: [607696.887766]  ffff880856bff910 0000000000000086 ffff8807d93c1f60 ffff880856bfffd8
2017-04-05 17:15:03.853 localhost kernel: [607696.887768]  ffff880856bfffd8 ffff880856bfffd8 ffff8807d93c1f60 ffff880484aed3b0
2017-04-05 17:15:03.853 localhost kernel: [607696.887769]  7fffffffffffffff ffff8807d93c1f60 0000000000000002 0000000000000000
2017-04-05 17:15:03.853 localhost kernel: [607696.887770] Call Trace:
2017-04-05 17:15:03.853 localhost kernel: [607696.887772]  [<ffffffff8168b899>] schedule+0x29/0x70
2017-04-05 17:15:03.853 localhost kernel: [607696.887774]  [<ffffffff816892e9>] schedule_timeout+0x239/0x2d0
2017-04-05 17:15:03.853 localhost kernel: [607696.887778]  [<ffffffff812efba3>] ? blk_peek_request+0x83/0x280
2017-04-05 17:15:03.853 localhost kernel: [607696.887808]  [<ffffffffa03358a7>] ? xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887810]  [<ffffffff8168ab68>] __down_common+0x108/0x154
2017-04-05 17:15:03.853 localhost kernel: [607696.887821]  [<ffffffffa0343136>] ? _xfs_buf_find+0x176/0x340 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887823]  [<ffffffff8168abd1>] __down+0x1d/0x1f
2017-04-05 17:15:03.853 localhost kernel: [607696.887825]  [<ffffffff810b6581>] down+0x41/0x50
2017-04-05 17:15:03.853 localhost kernel: [607696.887835]  [<ffffffffa0342f2c>] xfs_buf_lock+0x3c/0xd0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887844]  [<ffffffffa0343136>] _xfs_buf_find+0x176/0x340 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887853]  [<ffffffffa034332a>] xfs_buf_get_map+0x2a/0x240 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887862]  [<ffffffffa0343ea0>] xfs_buf_read_map+0x30/0x160 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887875]  [<ffffffffa0373709>] xfs_trans_read_buf_map+0x199/0x400 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887884]  [<ffffffffa03222c4>] xfs_da_read_buf+0xd4/0x100 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887892]  [<ffffffffa0308879>] xfs_attr3_leaf_read+0x29/0x70 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887900]  [<ffffffffa0318028>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887906]  [<ffffffffa0305efc>] xfs_attr_leaf_get+0x4c/0x120 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887913]  [<ffffffffa030722c>] xfs_attr_get+0x12c/0x1d0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887925]  [<ffffffffa03628ad>] xfs_xattr_get+0x3d/0x80 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887927]  [<ffffffff8122349c>] generic_getxattr+0x4c/0x70
2017-04-05 17:15:03.853 localhost kernel: [607696.887928]  [<ffffffff812238e8>] vfs_getxattr+0x88/0xb0
2017-04-05 17:15:03.853 localhost kernel: [607696.887930]  [<ffffffff81223a2b>] getxattr+0xab/0x1d0
2017-04-05 17:15:03.853 localhost kernel: [607696.887931]  [<ffffffff8120ecad>] ? putname+0x3d/0x60
2017-04-05 17:15:03.853 localhost kernel: [607696.887932]  [<ffffffff8120fe52>] ? user_path_at_empty+0x72/0xc0
2017-04-05 17:15:03.853 localhost kernel: [607696.887934]  [<ffffffff81224811>] SyS_lgetxattr+0x61/0xb0
2017-04-05 17:15:03.853 localhost kernel: [607696.887936]  [<ffffffff816967c9>] system_call_fastpath+0x16/0x1b
2017-04-05 17:15:03.853 localhost kernel: [607696.887937] INFO: task serv:23317 blocked for more than 120 seconds.
2017-04-05 17:15:03.853 localhost kernel: [607696.887938] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2017-04-05 17:15:03.853 localhost kernel: [607696.887939] serv         D ffff880867f07980     0 23317  11642 0x00000000
2017-04-05 17:15:03.853 localhost kernel: [607696.887940]  ffff8806082df9d0 0000000000000086 ffff88012b88de20 ffff8806082dffd8
2017-04-05 17:15:03.853 localhost kernel: [607696.887942]  ffff8806082dffd8 ffff8806082dffd8 ffff88012b88de20 ffff880484aed458
2017-04-05 17:15:03.853 localhost kernel: [607696.887943]  ffff880484aed460 7fffffffffffffff ffff88012b88de20 ffff880867f07980
2017-04-05 17:15:03.853 localhost kernel: [607696.887944] Call Trace:
2017-04-05 17:15:03.853 localhost kernel: [607696.887946]  [<ffffffff8168b899>] schedule+0x29/0x70
2017-04-05 17:15:03.853 localhost kernel: [607696.887948]  [<ffffffff816892e9>] schedule_timeout+0x239/0x2d0
2017-04-05 17:15:03.853 localhost kernel: [607696.887950]  [<ffffffff812f04d4>] ? blk_finish_plug+0x14/0x40
2017-04-05 17:15:03.853 localhost kernel: [607696.887961]  [<ffffffffa0342014>] ? _xfs_buf_ioapply+0x334/0x460 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887963]  [<ffffffff8168bc76>] wait_for_completion+0x116/0x170
2017-04-05 17:15:03.853 localhost kernel: [607696.887967]  [<ffffffff810c4fe0>] ? wake_up_state+0x20/0x20
2017-04-05 17:15:03.853 localhost kernel: [607696.887978]  [<ffffffffa0343e53>] ? _xfs_buf_read+0x23/0x40 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.887990]  [<ffffffffa0343d46>] xfs_buf_submit_wait+0xe6/0x1d0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888004]  [<ffffffffa0373709>] ? xfs_trans_read_buf_map+0x199/0x400 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888015]  [<ffffffffa0343e53>] _xfs_buf_read+0x23/0x40 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888027]  [<ffffffffa0343f68>] xfs_buf_read_map+0xf8/0x160 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888041]  [<ffffffffa0373709>] xfs_trans_read_buf_map+0x199/0x400 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888051]  [<ffffffffa03222c4>] xfs_da_read_buf+0xd4/0x100 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888060]  [<ffffffffa0308879>] xfs_attr3_leaf_read+0x29/0x70 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888070]  [<ffffffffa0318028>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888078]  [<ffffffffa0305efc>] xfs_attr_leaf_get+0x4c/0x120 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888087]  [<ffffffffa030722c>] xfs_attr_get+0x12c/0x1d0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888100]  [<ffffffffa03628ad>] xfs_xattr_get+0x3d/0x80 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888101]  [<ffffffff8122349c>] generic_getxattr+0x4c/0x70
2017-04-05 17:15:03.853 localhost kernel: [607696.888103]  [<ffffffff812238e8>] vfs_getxattr+0x88/0xb0
2017-04-05 17:15:03.853 localhost kernel: [607696.888104]  [<ffffffff81223a2b>] getxattr+0xab/0x1d0
2017-04-05 17:15:03.853 localhost kernel: [607696.888105]  [<ffffffff8120ecad>] ? putname+0x3d/0x60
2017-04-05 17:15:03.853 localhost kernel: [607696.888107]  [<ffffffff8120fe52>] ? user_path_at_empty+0x72/0xc0
2017-04-05 17:15:03.853 localhost kernel: [607696.888108]  [<ffffffff81224811>] SyS_lgetxattr+0x61/0xb0
2017-04-05 17:15:03.853 localhost kernel: [607696.888110]  [<ffffffff816967c9>] system_call_fastpath+0x16/0x1b
2017-04-05 17:15:03.853 localhost kernel: [607696.888112] INFO: task serv:190686 blocked for more than 120 seconds.
2017-04-05 17:15:03.853 localhost kernel: [607696.888113] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2017-04-05 17:15:03.853 localhost kernel: [607696.888114] serv         D 0000000000000000     0 190686  11642 0x00000000
2017-04-05 17:15:03.853 localhost kernel: [607696.888115]  ffff8804177e3910 0000000000000086 ffff88020c9d8000 ffff8804177e3fd8
2017-04-05 17:15:03.853 localhost kernel: [607696.888117]  ffff8804177e3fd8 ffff8804177e3fd8 ffff88020c9d8000 ffff880484aed3b0
2017-04-05 17:15:03.853 localhost kernel: [607696.888118]  7fffffffffffffff ffff88020c9d8000 0000000000000002 0000000000000000
2017-04-05 17:15:03.853 localhost kernel: [607696.888119] Call Trace:
2017-04-05 17:15:03.853 localhost kernel: [607696.888121]  [<ffffffff8168b899>] schedule+0x29/0x70
2017-04-05 17:15:03.853 localhost kernel: [607696.888123]  [<ffffffff816892e9>] schedule_timeout+0x239/0x2d0
2017-04-05 17:15:03.853 localhost kernel: [607696.888134]  [<ffffffffa03358a7>] ? xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888136]  [<ffffffff8168ab68>] __down_common+0x108/0x154
2017-04-05 17:15:03.853 localhost kernel: [607696.888148]  [<ffffffffa0343136>] ? _xfs_buf_find+0x176/0x340 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888149]  [<ffffffff8168abd1>] __down+0x1d/0x1f
2017-04-05 17:15:03.853 localhost kernel: [607696.888151]  [<ffffffff810b6581>] down+0x41/0x50
2017-04-05 17:15:03.853 localhost kernel: [607696.888162]  [<ffffffffa0342f2c>] xfs_buf_lock+0x3c/0xd0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888174]  [<ffffffffa0343136>] _xfs_buf_find+0x176/0x340 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888185]  [<ffffffffa034332a>] xfs_buf_get_map+0x2a/0x240 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888196]  [<ffffffffa0343ea0>] xfs_buf_read_map+0x30/0x160 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888210]  [<ffffffffa0373709>] xfs_trans_read_buf_map+0x199/0x400 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888220]  [<ffffffffa03222c4>] xfs_da_read_buf+0xd4/0x100 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888230]  [<ffffffffa0308879>] xfs_attr3_leaf_read+0x29/0x70 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888239]  [<ffffffffa0318028>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888247]  [<ffffffffa0305efc>] xfs_attr_leaf_get+0x4c/0x120 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888256]  [<ffffffffa030722c>] xfs_attr_get+0x12c/0x1d0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888269]  [<ffffffffa03628ad>] xfs_xattr_get+0x3d/0x80 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888270]  [<ffffffff8122349c>] generic_getxattr+0x4c/0x70
2017-04-05 17:15:03.853 localhost kernel: [607696.888272]  [<ffffffff812238e8>] vfs_getxattr+0x88/0xb0
2017-04-05 17:15:03.853 localhost kernel: [607696.888273]  [<ffffffff81223a2b>] getxattr+0xab/0x1d0
2017-04-05 17:15:03.853 localhost kernel: [607696.888274]  [<ffffffff8120ecad>] ? putname+0x3d/0x60
2017-04-05 17:15:03.853 localhost kernel: [607696.888276]  [<ffffffff8120fe52>] ? user_path_at_empty+0x72/0xc0
2017-04-05 17:15:03.853 localhost kernel: [607696.888277]  [<ffffffff81224811>] SyS_lgetxattr+0x61/0xb0
2017-04-05 17:15:03.853 localhost kernel: [607696.888279]  [<ffffffff816967c9>] system_call_fastpath+0x16/0x1b
2017-04-05 17:15:03.853 localhost kernel: [607696.888280] INFO: task serv:190687 blocked for more than 120 seconds.
2017-04-05 17:15:03.853 localhost kernel: [607696.888281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2017-04-05 17:15:03.853 localhost kernel: [607696.888281] serv         D 0000000000000000     0 190687  11642 0x00000000
2017-04-05 17:15:03.853 localhost kernel: [607696.888283]  ffff88048d82b910 0000000000000086 ffff88010a6a0fb0 ffff88048d82bfd8
2017-04-05 17:15:03.853 localhost kernel: [607696.888284]  ffff88048d82bfd8 ffff88048d82bfd8 ffff88010a6a0fb0 ffff880484aed3b0
2017-04-05 17:15:03.853 localhost kernel: [607696.888286]  7fffffffffffffff ffff88010a6a0fb0 0000000000000002 0000000000000000
2017-04-05 17:15:03.853 localhost kernel: [607696.888287] Call Trace:
2017-04-05 17:15:03.853 localhost kernel: [607696.888289]  [<ffffffff8168b899>] schedule+0x29/0x70
2017-04-05 17:15:03.853 localhost kernel: [607696.888291]  [<ffffffff816892e9>] schedule_timeout+0x239/0x2d0
2017-04-05 17:15:03.853 localhost kernel: [607696.888301]  [<ffffffffa03358a7>] ? xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888303]  [<ffffffff8168ab68>] __down_common+0x108/0x154
2017-04-05 17:15:03.853 localhost kernel: [607696.888315]  [<ffffffffa0343136>] ? _xfs_buf_find+0x176/0x340 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888317]  [<ffffffff8168abd1>] __down+0x1d/0x1f
2017-04-05 17:15:03.853 localhost kernel: [607696.888318]  [<ffffffff810b6581>] down+0x41/0x50
2017-04-05 17:15:03.853 localhost kernel: [607696.888330]  [<ffffffffa0342f2c>] xfs_buf_lock+0x3c/0xd0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888341]  [<ffffffffa0343136>] _xfs_buf_find+0x176/0x340 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888352]  [<ffffffffa034332a>] xfs_buf_get_map+0x2a/0x240 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888364]  [<ffffffffa0343ea0>] xfs_buf_read_map+0x30/0x160 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888378]  [<ffffffffa0373709>] xfs_trans_read_buf_map+0x199/0x400 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888388]  [<ffffffffa03222c4>] xfs_da_read_buf+0xd4/0x100 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888397]  [<ffffffffa0308879>] xfs_attr3_leaf_read+0x29/0x70 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888407]  [<ffffffffa0318028>] ? xfs_bmbt_get_all+0x18/0x20 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888415]  [<ffffffffa0305efc>] xfs_attr_leaf_get+0x4c/0x120 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888424]  [<ffffffffa030722c>] xfs_attr_get+0x12c/0x1d0 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888437]  [<ffffffffa03628ad>] xfs_xattr_get+0x3d/0x80 [xfs]
2017-04-05 17:15:03.853 localhost kernel: [607696.888438]  [<ffffffff8122349c>] generic_getxattr+0x4c/0x70
2017-04-05 17:15:03.853 localhost kernel: [607696.888439]  [<ffffffff812238e8>] vfs_getxattr+0x88/0xb0
2017-04-05 17:15:03.853 localhost kernel: [607696.888441]  [<ffffffff81223a2b>] getxattr+0xab/0x1d0
2017-04-05 17:15:03.853 localhost kernel: [607696.888442]  [<ffffffff8120ecad>] ? putname+0x3d/0x60
2017-04-05 17:15:03.853 localhost kernel: [607696.888444]  [<ffffffff8120fe52>] ? user_path_at_empty+0x72/0xc0
2017-04-05 17:15:03.853 localhost kernel: [607696.888445]  [<ffffffff81224811>] SyS_lgetxattr+0x61/0xb0
2017-04-05 17:15:03.853 localhost kernel: [607696.888447]  [<ffffffff816967c9>] system_call_fastpath+0x16/0x1bƒ


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Processes stuck in D state when accessing XFSv5 filesystem
  2017-04-06 18:41 Processes stuck in D state when accessing XFSv5 filesystem David Shaw
@ 2017-04-07 14:56 ` Emmanuel Florac
  2017-04-07 16:56   ` David Shaw
  0 siblings, 1 reply; 7+ messages in thread
From: Emmanuel Florac @ 2017-04-07 14:56 UTC (permalink / raw)
  To: David Shaw; +Cc: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 786 bytes --]

Le Thu, 6 Apr 2017 14:41:45 -0400
David Shaw <dshaw@jabberwocky.com> écrivait:

> I'm having a problem with processes getting "stuck" when accessing an
> XFS (v5) filesystem.  When it happens, I start getting the "blocked
> for more than 120 seconds" error, and the process stays in that state
> until I reboot.  The kernel is 3.10.0-514.2.2 and the xfsprogs is
> 4.5.0-9 (both Centos 7.3).
> 

Could it be that your system is under high IO load?

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Processes stuck in D state when accessing XFSv5 filesystem
  2017-04-07 14:56 ` Emmanuel Florac
@ 2017-04-07 16:56   ` David Shaw
  2017-04-07 17:39     ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: David Shaw @ 2017-04-07 16:56 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: linux-xfs

On Apr 7, 2017, at 10:56 AM, Emmanuel Florac <eflorac@intellique.com> wrote:
> 
> Le Thu, 6 Apr 2017 14:41:45 -0400
> David Shaw <dshaw@jabberwocky.com> écrivait:
> 
>> I'm having a problem with processes getting "stuck" when accessing an
>> XFS (v5) filesystem.  When it happens, I start getting the "blocked
>> for more than 120 seconds" error, and the process stays in that state
>> until I reboot.  The kernel is 3.10.0-514.2.2 and the xfsprogs is
>> 4.5.0-9 (both Centos 7.3).
>> 
> 
> Could it be that your system is under high IO load?

It's possible that the problem is instigated or made worse by high load, but once the processes enter D state, they stay there even when the system is idle.  They stay in D state until I reboot.

David


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Processes stuck in D state when accessing XFSv5 filesystem
  2017-04-07 16:56   ` David Shaw
@ 2017-04-07 17:39     ` Brian Foster
  2017-04-08 11:26       ` David Shaw
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2017-04-07 17:39 UTC (permalink / raw)
  To: David Shaw; +Cc: Emmanuel Florac, linux-xfs

On Fri, Apr 07, 2017 at 12:56:48PM -0400, David Shaw wrote:
> On Apr 7, 2017, at 10:56 AM, Emmanuel Florac <eflorac@intellique.com> wrote:
> > 
> > Le Thu, 6 Apr 2017 14:41:45 -0400
> > David Shaw <dshaw@jabberwocky.com> écrivait:
> > 
> >> I'm having a problem with processes getting "stuck" when accessing an
> >> XFS (v5) filesystem.  When it happens, I start getting the "blocked
> >> for more than 120 seconds" error, and the process stays in that state
> >> until I reboot.  The kernel is 3.10.0-514.2.2 and the xfsprogs is
> >> 4.5.0-9 (both Centos 7.3).
> >> 
> > 
> > Could it be that your system is under high IO load?
> 
> It's possible that the problem is instigated or made worse by high load, but once the processes enter D state, they stay there even when the system is idle.  They stay in D state until I reboot.
> 

There isn't enough information provided to suggest the filesystem is
locked up as opposed to waiting for (very slow) I/O, as suggested by
Emmanuel.

If the filesystem appears to be deadlocked, can you provide the complete
hung task output (echo w > /proc/sysrq-trigger) as well any activity
that might be shown by tracepoints if enabled when in this state
(trace-cmd start -e xfs:*; cat /sys/kernel/debug/tracing/trace_pipe)?

Brian

> David
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Processes stuck in D state when accessing XFSv5 filesystem
  2017-04-07 17:39     ` Brian Foster
@ 2017-04-08 11:26       ` David Shaw
  2017-04-10 15:23         ` Brian Foster
  0 siblings, 1 reply; 7+ messages in thread
From: David Shaw @ 2017-04-08 11:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: Emmanuel Florac, linux-xfs

On Apr 7, 2017, at 1:39 PM, Brian Foster <bfoster@redhat.com> wrote:
> 
> On Fri, Apr 07, 2017 at 12:56:48PM -0400, David Shaw wrote:
>> On Apr 7, 2017, at 10:56 AM, Emmanuel Florac <eflorac@intellique.com> wrote:
>>> 
>>> Le Thu, 6 Apr 2017 14:41:45 -0400
>>> David Shaw <dshaw@jabberwocky.com> écrivait:
>>> 
>>>> I'm having a problem with processes getting "stuck" when accessing an
>>>> XFS (v5) filesystem.  When it happens, I start getting the "blocked
>>>> for more than 120 seconds" error, and the process stays in that state
>>>> until I reboot.  The kernel is 3.10.0-514.2.2 and the xfsprogs is
>>>> 4.5.0-9 (both Centos 7.3).
>>>> 
>>> 
>>> Could it be that your system is under high IO load?
>> 
>> It's possible that the problem is instigated or made worse by high load, but once the processes enter D state, they stay there even when the system is idle.  They stay in D state until I reboot.
>> 
> 
> There isn't enough information provided to suggest the filesystem is
> locked up as opposed to waiting for (very slow) I/O, as suggested by
> Emmanuel.

Ah, I didn't understand, thanks.

> If the filesystem appears to be deadlocked, can you provide the complete
> hung task output (echo w > /proc/sysrq-trigger) as well any activity
> that might be shown by tracepoints if enabled when in this state
> (trace-cmd start -e xfs:*; cat /sys/kernel/debug/tracing/trace_pipe)?

I have the hung task output from a recent time this happened (http://www.jabberwocky.com/xfs/blocked.txt).  In it, "servxfs" is the process that reads from the XFS filesystem.  It's the process feeding a fuse filesystem.  "mmon" and "smbd" are both trying to access the fuse filesystem.  I suspect that they're blocking because the servxfs threads are blocking and thus not fulfilling the fuse requests.

I will get the xfs traces the next time it happens.  It seems to be happening 2-3 times a week, but frustratingly, I can't make it happen on demand.  Is there something I should look for in particular in the trace output, or some amount of time to capture it for?

Thanks again for your help,

David


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Processes stuck in D state when accessing XFSv5 filesystem
  2017-04-08 11:26       ` David Shaw
@ 2017-04-10 15:23         ` Brian Foster
  2017-05-10 20:56           ` David Shaw
  0 siblings, 1 reply; 7+ messages in thread
From: Brian Foster @ 2017-04-10 15:23 UTC (permalink / raw)
  To: David Shaw; +Cc: Emmanuel Florac, linux-xfs

On Sat, Apr 08, 2017 at 07:26:14AM -0400, David Shaw wrote:
> On Apr 7, 2017, at 1:39 PM, Brian Foster <bfoster@redhat.com> wrote:
> > 
> > On Fri, Apr 07, 2017 at 12:56:48PM -0400, David Shaw wrote:
> >> On Apr 7, 2017, at 10:56 AM, Emmanuel Florac <eflorac@intellique.com> wrote:
> >>> 
> >>> Le Thu, 6 Apr 2017 14:41:45 -0400
> >>> David Shaw <dshaw@jabberwocky.com> écrivait:
> >>> 
> >>>> I'm having a problem with processes getting "stuck" when accessing an
> >>>> XFS (v5) filesystem.  When it happens, I start getting the "blocked
> >>>> for more than 120 seconds" error, and the process stays in that state
> >>>> until I reboot.  The kernel is 3.10.0-514.2.2 and the xfsprogs is
> >>>> 4.5.0-9 (both Centos 7.3).
> >>>> 
> >>> 
> >>> Could it be that your system is under high IO load?
> >> 
> >> It's possible that the problem is instigated or made worse by high load, but once the processes enter D state, they stay there even when the system is idle.  They stay in D state until I reboot.
> >> 
> > 
> > There isn't enough information provided to suggest the filesystem is
> > locked up as opposed to waiting for (very slow) I/O, as suggested by
> > Emmanuel.
> 
> Ah, I didn't understand, thanks.
> 
> > If the filesystem appears to be deadlocked, can you provide the complete
> > hung task output (echo w > /proc/sysrq-trigger) as well any activity
> > that might be shown by tracepoints if enabled when in this state
> > (trace-cmd start -e xfs:*; cat /sys/kernel/debug/tracing/trace_pipe)?
> 
> I have the hung task output from a recent time this happened (http://www.jabberwocky.com/xfs/blocked.txt).  In it, "servxfs" is the process that reads from the XFS filesystem.  It's the process feeding a fuse filesystem.  "mmon" and "smbd" are both trying to access the fuse filesystem.  I suspect that they're blocking because the servxfs threads are blocking and thus not fulfilling the fuse requests.
> 

Interesting, nothing obvious stands out to me. It looks you have smbd
waiting on a pread over fuse and mmon waiting on a stat. Presumably the
smbd pread corresponds to the blocked servxfs pread and the stat to one
of the servxfs getxattr calls.

Any idea where the other servxfs getxattr comes in?

> I will get the xfs traces the next time it happens.  It seems to be happening 2-3 times a week, but frustratingly, I can't make it happen on demand.  Is there something I should look for in particular in the trace output, or some amount of time to capture it for?
> 

Can you elaborate on the resulting behavior? Are these same processes
always involved? Can you identify whether they are attempting to access
the same file(s) or not? Also, does the underlying filesystem continue
to function outside of these processes seemingly all blocked on reads?
E.g., can you read another file from the XFS fs or is all I/O blocked?

Because there are multiple layers involved here, presumably with custom
code in between (e.g., your fuse userspace), this might be easier to
reason about if you can dig more into what's blocked in the upper layers
to describe precisely what high level requests are active at the XFS
level.

Brian

> Thanks again for your help,
> 
> David
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Processes stuck in D state when accessing XFSv5 filesystem
  2017-04-10 15:23         ` Brian Foster
@ 2017-05-10 20:56           ` David Shaw
  0 siblings, 0 replies; 7+ messages in thread
From: David Shaw @ 2017-05-10 20:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: Emmanuel Florac, linux-xfs

On Apr 10, 2017, at 11:23 AM, Brian Foster <bfoster@redhat.com> wrote:
> 
> Interesting, nothing obvious stands out to me. It looks you have smbd
> waiting on a pread over fuse and mmon waiting on a stat. Presumably the
> smbd pread corresponds to the blocked servxfs pread and the stat to one
> of the servxfs getxattr calls.
> 
> Any idea where the other servxfs getxattr comes in?

servxfs stores some internal metadata in xattrs, so it should be normal that it makes occasional xattr calls.

>> I will get the xfs traces the next time it happens.  It seems to be happening 2-3 times a week, but frustratingly, I can't make it happen on demand. Is there something I should look for in particular in the trace output, or some amount of time to capture it for?
>> 
> 
> Can you elaborate on the resulting behavior? Are these same processes
> always involved? Can you identify whether they are attempting to access
> the same file(s) or not? Also, does the underlying filesystem continue
> to function outside of these processes seemingly all blocked on reads?
> E.g., can you read another file from the XFS fs or is all I/O blocked?
> 
> Because there are multiple layers involved here, presumably with custom
> code in between (e.g., your fuse userspace), this might be easier to
> reason about if you can dig more into what's blocked in the upper layers
> to describe precisely what high level requests are active at the XFS
> level.

I agree that removing the fuse userspace code from the equation can simplify things.  There are examples that don't go through fuse - for example, there is a postgres daemon that is running directly on the XFS filesystem that has occasionally hit the same issue (meaning going into D state and staying there "forever", or at least 30 minutes until it reboots).

Unfortunately, I'm having a huge problem reproducing the problem now.  Where it used to happen 2-3 times a week, now it's not.  So on the one hand, great, but on the other hand, it makes it hard to debug.

I will update as soon as I have something new.

David


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-05-10 20:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-06 18:41 Processes stuck in D state when accessing XFSv5 filesystem David Shaw
2017-04-07 14:56 ` Emmanuel Florac
2017-04-07 16:56   ` David Shaw
2017-04-07 17:39     ` Brian Foster
2017-04-08 11:26       ` David Shaw
2017-04-10 15:23         ` Brian Foster
2017-05-10 20:56           ` David Shaw

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.