All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
@ 2021-10-20  7:36 bugzilla-daemon
  2021-10-20  7:39 ` [Bug 214767] " bugzilla-daemon
                   ` (17 more replies)
  0 siblings, 18 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-20  7:36 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

            Bug ID: 214767
           Summary: xfs seems to hang due to race condition? maybe related
                    to (gratuitous) thaw.
           Product: File System
           Version: 2.5
    Kernel Version: 5.10.70
          Hardware: x86-64
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: XFS
          Assignee: filesystem_xfs@kernel-bugs.kernel.org
          Reporter: ct@flyingcircus.io
        Regression: No

We have been getting somewhat frequently (around 1 per day for about 500 VMs)
blocked tasks with tracebacks always similar to this:

First we get:

[656898.010322] INFO: task kworker/u2:1:458736 blocked for more than 122
seconds.
[656898.011988]       Not tainted 5.10.70 #1-NixOS
[656898.012981] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[656898.014679] task:kworker/u2:1    state:D stack:    0 pid:458736 ppid:     2
flags:0x00004000
[656898.016530] Workqueue: writeback wb_workfn (flush-253:0)
[656898.017698] Call Trace:
[656898.018310]  __schedule+0x271/0x860
[656898.019208]  schedule+0x46/0xb0
[656898.019944]  rwsem_down_read_slowpath+0x169/0x490
[656898.021008]  ? mempool_alloc+0x62/0x170
[656898.022032]  xfs_map_blocks+0xb9/0x400 [xfs]
[656898.022998]  iomap_do_writepage+0x163/0x850
[656898.023587]  ? __mod_memcg_lruvec_state+0x21/0xe0
[656898.024044]  write_cache_pages+0x186/0x3d0
[656898.024463]  ? iomap_migrate_page+0xc0/0xc0
[656898.024882]  ? submit_bio_noacct+0x3a9/0x420
[656898.025303]  iomap_writepages+0x1c/0x40
[656898.025712]  xfs_vm_writepages+0x64/0x90 [xfs]
[656898.026146]  do_writepages+0x34/0xc0
[656898.026513]  __writeback_single_inode+0x39/0x2a0
[656898.026969]  writeback_sb_inodes+0x200/0x470
[656898.027388]  __writeback_inodes_wb+0x4c/0xe0
[656898.027805]  wb_writeback+0x1d8/0x290
[656898.028161]  wb_workfn+0x29b/0x4d0
[656898.028526]  ? __switch_to_asm+0x42/0x70
[656898.028915]  ? __switch_to+0x7b/0x3e0
[656898.029285]  process_one_work+0x1df/0x370
[656898.029696]  worker_thread+0x50/0x400
[656898.030054]  ? process_one_work+0x370/0x370
[656898.030475]  kthread+0x11b/0x140
[656898.030798]  ? __kthread_bind_mask+0x60/0x60
[656898.031214]  ret_from_fork+0x22/0x30
[656898.031592] INFO: task nix-daemon:459204 blocked for more than 122 seconds.
[656898.032259]       Not tainted 5.10.70 #1-NixOS
[656898.032695] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[656898.033451] task:nix-daemon      state:D stack:    0 pid:459204 ppid:  1067
flags:0x00000000
[656898.034261] Call Trace:
[656898.034538]  __schedule+0x271/0x860
[656898.034881]  schedule+0x46/0xb0
[656898.035226]  xfs_log_commit_cil+0x6a4/0x800 [xfs]
[656898.035709]  ? wake_up_q+0xa0/0xa0
[656898.036073]  __xfs_trans_commit+0x9d/0x310 [xfs]
[656898.036561]  xfs_setattr_nonsize+0x342/0x520 [xfs]
[656898.037027]  notify_change+0x348/0x4c0
[656898.037407]  ? chmod_common+0xa1/0x150
[656898.037779]  chmod_common+0xa1/0x150
[656898.038139]  do_fchmodat+0x5a/0xb0
[656898.038491]  __x64_sys_chmod+0x17/0x20
[656898.038860]  do_syscall_64+0x33/0x40
[656898.039221]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[656898.039722] RIP: 0033:0x7f3d9b3a3b77
[656898.040072] RSP: 002b:00007ffc388e2b18 EFLAGS: 00000206 ORIG_RAX:
000000000000005a
[656898.040797] RAX: ffffffffffffffda RBX: 00007ffc388e2e10 RCX:
00007f3d9b3a3b77
[656898.041542] RDX: 0000000000008049 RSI: 0000000000008124 RDI:
00000000016a2400
[656898.042222] RBP: 00007ffc388e32a0 R08: 00007ffc388e4628 R09:
0000000000000000
[656898.042913] R10: 0000000000000000 R11: 0000000000000206 R12:
00007ffc388e4620
[656898.043604] R13: 0000000000000000 R14: 00007ffc388e2df0 R15:
00007ffc388e2e10


and then a while later we see this:

[657020.886633] INFO: task qemu-ga:750 blocked for more than 122 seconds.
[657020.888346]       Not tainted 5.10.70 #1-NixOS
[657020.889482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[657020.891072] task:qemu-ga         state:D stack:    0 pid:  750 ppid:     1
flags:0x00000080
[657020.891906] Call Trace:
[657020.892188]  __schedule+0x271/0x860
[657020.892547]  ? kvm_sched_clock_read+0xd/0x20
[657020.892983]  schedule+0x46/0xb0
[657020.893319]  rwsem_down_write_slowpath+0x218/0x480
[657020.893802]  thaw_super+0x12/0x20
[657020.894130]  __x64_sys_ioctl+0x62/0xb0
[657020.894491]  do_syscall_64+0x33/0x40
[657020.894860]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[657020.895396] RIP: 0033:0x7f69a55f2b17
[657020.895762] RSP: 002b:00007ffe5687b548 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[657020.896470] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f69a55f2b17
[657020.897144] RDX: 0000000000000000 RSI: 00000000c0045878 RDI:
0000000000000006
[657020.897815] RBP: 0000000000000000 R08: 00007f69a5477c08 R09:
0000000000000000
[657020.898479] R10: 0000000000000000 R11: 0000000000000246 R12:
000055ef74a15190
[657020.899155] R13: 00007ffe5687b5c0 R14: 00007ffe5687b560 R15:
0000000000000006
[657020.899850] INFO: task kworker/u2:1:458736 blocked for more than 245
seconds.
[657020.900513]       Not tainted 5.10.70 #1-NixOS
[657020.900942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[657020.901672] task:kworker/u2:1    state:D stack:    0 pid:458736 ppid:     2
flags:0x00004000
[657020.902469] Workqueue: writeback wb_workfn (flush-253:0)
[657020.902976] Call Trace:
[657020.903220]  __schedule+0x271/0x860
[657020.903556]  schedule+0x46/0xb0
[657020.903870]  rwsem_down_read_slowpath+0x169/0x490
[657020.904321]  ? mempool_alloc+0x62/0x170
[657020.904996]  xfs_map_blocks+0xb9/0x400 [xfs]
[657020.905412]  iomap_do_writepage+0x163/0x850
[657020.905823]  ? __mod_memcg_lruvec_state+0x21/0xe0
[657020.906269]  write_cache_pages+0x186/0x3d0
[657020.906664]  ? iomap_migrate_page+0xc0/0xc0
[657020.907079]  ? submit_bio_noacct+0x3a9/0x420
[657020.907487]  iomap_writepages+0x1c/0x40
[657020.907942]  xfs_vm_writepages+0x64/0x90 [xfs]
[657020.908376]  do_writepages+0x34/0xc0
[657020.908760]  __writeback_single_inode+0x39/0x2a0
[657020.909222]  writeback_sb_inodes+0x200/0x470
[657020.909635]  __writeback_inodes_wb+0x4c/0xe0
[657020.910058]  wb_writeback+0x1d8/0x290
[657020.910410]  wb_workfn+0x29b/0x4d0
[657020.910756]  ? __switch_to_asm+0x42/0x70
[657020.911135]  ? __switch_to+0x7b/0x3e0
[657020.911489]  process_one_work+0x1df/0x370
[657020.911912]  worker_thread+0x50/0x400
[657020.912264]  ? process_one_work+0x370/0x370
[657020.912667]  kthread+0x11b/0x140
[657020.913008]  ? __kthread_bind_mask+0x60/0x60
[657020.913413]  ret_from_fork+0x22/0x30
[657020.913791] INFO: task nix-daemon:459204 blocked for more than 245 seconds.
[657020.914440]       Not tainted 5.10.70 #1-NixOS
[657020.914879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[657020.915612] task:nix-daemon      state:D stack:    0 pid:459204 ppid:  1067
flags:0x00000000
[657020.916399] Call Trace:
[657020.916654]  __schedule+0x271/0x860
[657020.917006]  schedule+0x46/0xb0
[657020.917343]  xfs_log_commit_cil+0x6a4/0x800 [xfs]
[657020.917806]  ? wake_up_q+0xa0/0xa0
[657020.918162]  __xfs_trans_commit+0x9d/0x310 [xfs]
[657020.918633]  xfs_setattr_nonsize+0x342/0x520 [xfs]
[657020.919101]  notify_change+0x348/0x4c0
[657020.919461]  ? chmod_common+0xa1/0x150
[657020.919831]  chmod_common+0xa1/0x150
[657020.920180]  do_fchmodat+0x5a/0xb0
[657020.920509]  __x64_sys_chmod+0x17/0x20
[657020.920878]  do_syscall_64+0x33/0x40
[657020.921222]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[657020.921703] RIP: 0033:0x7f3d9b3a3b77
[657020.922055] RSP: 002b:00007ffc388e2b18 EFLAGS: 00000206 ORIG_RAX:
000000000000005a
[657020.922772] RAX: ffffffffffffffda RBX: 00007ffc388e2e10 RCX:
00007f3d9b3a3b77
[657020.923435] RDX: 0000000000008049 RSI: 0000000000008124 RDI:
00000000016a2400
[657020.924115] RBP: 00007ffc388e32a0 R08: 00007ffc388e4628 R09:
0000000000000000
[657020.924787] R10: 0000000000000000 R11: 0000000000000206 R12:
00007ffc388e4620
[657020.925450] R13: 0000000000000000 R14: 00007ffc388e2df0 R15:
00007ffc388e2e10


IO is subsequently stalled on ALL disks (we have vda/vdb/vdc with two of them
running XFS and one running swap).

I am pretty sure that this is not primarily an issue in the hypervisor (Qemu
4.1) or storage (Ceph Jewel) because I can warm reset the virtual machine using
"system-reset" and the machine will boot properly.

This could be related to https://bugzilla.kernel.org/show_bug.cgi?id=207053 but
it doesn't look quite right.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
@ 2021-10-20  7:39 ` bugzilla-daemon
  2021-10-20 16:16 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-20  7:39 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #1 from Christian Theune (ct@flyingcircus.io) ---
Also not that I'm not sure that it's related to the thaw at all because the
stuck thaw info comes exactly 122 seconds after the first blocked task report
which IMHO indicates that it's a subsequent error, not a cause.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
  2021-10-20  7:39 ` [Bug 214767] " bugzilla-daemon
@ 2021-10-20 16:16 ` bugzilla-daemon
  2021-10-20 16:31 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-20 16:16 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #2 from Christian Theune (ct@flyingcircus.io) ---
Here's another excerpt that I dug out from a different VM with the same issue.
No thawing happening AFAICT around this time:

[848865.353541] INFO: task nix-daemon:1855245 blocked for more than 122
seconds.
[848865.355625]       Not tainted 5.10.70 #1-NixOS
[848865.356912] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[848865.359117] task:nix-daemon      state:D stack:    0 pid:1855245 ppid: 
2165 flags:0x00000000
[848865.360999] Call Trace:
[848865.361480]  __schedule+0x271/0x860
[848865.362069]  schedule+0x46/0xb0
[848865.362657]  xfs_log_commit_cil+0x6a4/0x800 [xfs]
[848865.363436]  ? wake_up_q+0xa0/0xa0
[848865.364026]  __xfs_trans_commit+0x9d/0x310 [xfs]
[848865.364818]  xfs_create+0x472/0x560 [xfs]
[848865.365517]  xfs_generic_create+0x247/0x320 [xfs]
[848865.366310]  ? xfs_lookup+0x55/0x100 [xfs]
[848865.366984]  path_openat+0xdd7/0x1070
[848865.367617]  do_filp_open+0x88/0x130
[848865.368199]  ? getname_flags.part.0+0x29/0x1a0
[848865.368925]  do_sys_openat2+0x97/0x150
[848865.369557]  __x64_sys_openat+0x54/0x90
[848865.370176]  do_syscall_64+0x33/0x40
[848865.370787]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[848865.371619] RIP: 0033:0x7f48b24c0ea8
[848865.372196] RSP: 002b:00007ffed91cee30 EFLAGS: 00000293 ORIG_RAX:
0000000000000101
[848865.373398] RAX: ffffffffffffffda RBX: 00000000000800c1 RCX:
00007f48b24c0ea8
[848865.374530] RDX: 00000000000800c1 RSI: 00000000015c4750 RDI:
00000000ffffff9c
[848865.375661] RBP: 00000000015c4750 R08: 0000000000000000 R09:
0000000000000003
[848865.376790] R10: 00000000000001b6 R11: 0000000000000293 R12:
00007ffed91cf220
[848865.377920] R13: 00007ffed91ceee0 R14: 00007ffed91ceed0 R15:
00007ffed91d0c50

and I can see it not making any progress:

[848988.235884] INFO: task nix-daemon:1855245 blocked for more than 245
seconds.
[848988.238567]       Not tainted 5.10.70 #1-NixOS
[848988.240254] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[848988.242358] task:nix-daemon      state:D stack:    0 pid:1855245 ppid: 
2165 flags:0x00000000
[848988.243756] Call Trace:
[848988.244183]  __schedule+0x271/0x860
[848988.244780]  schedule+0x46/0xb0
[848988.245366]  xfs_log_commit_cil+0x6a4/0x800 [xfs]
[848988.246153]  ? wake_up_q+0xa0/0xa0
[848988.246775]  __xfs_trans_commit+0x9d/0x310 [xfs]
[848988.247572]  xfs_create+0x472/0x560 [xfs]
[848988.248269]  xfs_generic_create+0x247/0x320 [xfs]
[848988.249084]  ? xfs_lookup+0x55/0x100 [xfs]
[848988.249781]  path_openat+0xdd7/0x1070
[848988.250388]  do_filp_open+0x88/0x130
[848988.251000]  ? getname_flags.part.0+0x29/0x1a0
[848988.251746]  do_sys_openat2+0x97/0x150
[848988.252366]  __x64_sys_openat+0x54/0x90
[848988.253015]  do_syscall_64+0x33/0x40
[848988.253610]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[848988.254445] RIP: 0033:0x7f48b24c0ea8
[848988.255071] RSP: 002b:00007ffed91cee30 EFLAGS: 00000293 ORIG_RAX:
0000000000000101
[848988.256292] RAX: ffffffffffffffda RBX: 00000000000800c1 RCX:
00007f48b24c0ea8
[848988.257460] RDX: 00000000000800c1 RSI: 00000000015c4750 RDI:
00000000ffffff9c
[848988.258622] RBP: 00000000015c4750 R08: 0000000000000000 R09:
0000000000000003
[848988.259777] R10: 00000000000001b6 R11: 0000000000000293 R12:
00007ffed91cf220
[848988.260933] R13: 00007ffed91ceee0 R14: 00007ffed91ceed0 R15:
00007ffed91d0c50

Both stack traces are identical in every regard so that might be useful info.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
  2021-10-20  7:39 ` [Bug 214767] " bugzilla-daemon
  2021-10-20 16:16 ` bugzilla-daemon
@ 2021-10-20 16:31 ` bugzilla-daemon
  2021-10-20 18:29 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-20 16:31 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #3 from Christian Theune (ct@flyingcircus.io) ---
I have another machine that managed to break free after 20 minutes which did
around 3 thaws around that time:

Oct 20 18:05:22 pixometerstag04 kernel: INFO: task nix-daemon:1387736 blocked
for more than 1228 seconds.
Oct 20 18:05:22 pixometerstag04 kernel:       Not tainted 5.10.70 #1-NixOS
Oct 20 18:05:22 pixometerstag04 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 20 18:05:22 pixometerstag04 kernel: task:nix-daemon      state:D stack:   
0 pid:1387736 ppid:  1356 flags:0x00000000
Oct 20 18:05:22 pixometerstag04 kernel: Call Trace:
Oct 20 18:05:22 pixometerstag04 kernel:  __schedule+0x271/0x860
Oct 20 18:05:22 pixometerstag04 kernel:  schedule+0x46/0xb0
Oct 20 18:05:22 pixometerstag04 kernel:  xfs_log_commit_cil+0x6a4/0x800 [xfs]
Oct 20 18:05:22 pixometerstag04 kernel:  ? wake_up_q+0xa0/0xa0
Oct 20 18:05:22 pixometerstag04 kernel:  __xfs_trans_commit+0x9d/0x310 [xfs]
Oct 20 18:05:22 pixometerstag04 kernel:  xfs_create+0x472/0x560 [xfs]
Oct 20 18:05:22 pixometerstag04 kernel:  xfs_generic_create+0x247/0x320 [xfs]
Oct 20 18:05:22 pixometerstag04 kernel:  ? xfs_lookup+0x55/0x100 [xfs]
Oct 20 18:05:22 pixometerstag04 kernel:  path_openat+0xdd7/0x1070
Oct 20 18:05:22 pixometerstag04 kernel:  do_filp_open+0x88/0x130
Oct 20 18:05:22 pixometerstag04 kernel:  ? getname_flags.part.0+0x29/0x1a0
Oct 20 18:05:22 pixometerstag04 kernel:  do_sys_openat2+0x97/0x150
Oct 20 18:05:22 pixometerstag04 kernel:  __x64_sys_openat+0x54/0x90
Oct 20 18:05:22 pixometerstag04 kernel:  do_syscall_64+0x33/0x40
Oct 20 18:05:22 pixometerstag04 kernel: 
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct 20 18:05:22 pixometerstag04 kernel: RIP: 0033:0x7f3e3114dea8
Oct 20 18:05:22 pixometerstag04 kernel: RSP: 002b:00007ffd92f7df10 EFLAGS:
00000293 ORIG_RAX: 0000000000000101
Oct 20 18:05:22 pixometerstag04 kernel: RAX: ffffffffffffffda RBX:
00000000000800c1 RCX: 00007f3e3114dea8
Oct 20 18:05:22 pixometerstag04 kernel: RDX: 00000000000800c1 RSI:
000000000195cb50 RDI: 00000000ffffff9c
Oct 20 18:05:22 pixometerstag04 kernel: RBP: 000000000195cb50 R08:
0000000000000000 R09: 0000000000000003
Oct 20 18:05:22 pixometerstag04 kernel: R10: 00000000000001b6 R11:
0000000000000293 R12: 00007ffd92f7e300
Oct 20 18:05:22 pixometerstag04 kernel: R13: 00007ffd92f7dfc0 R14:
00007ffd92f7dfb0 R15: 00007ffd92f7fd30

However, the first thaw came between when the machine already reported 122 and
245s of hangs and the last thaw came after it broke free.

Interestingly it also logs a variety of things that worked while those were
logged, maybe it only happened for the /tmp filesystem and not / ... apparently
postgresql was still writing stuff on the disk in that period, so I'd guess
this only happened on /tmp

This is square in the middle of the reported blocks:

Oct 20 17:52:00 pixometerstag04 postgres[1008]: user=,db= LOG:  checkpoint
complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0
recycled; write=0.004 s, sync=0.001 s, total=0.018 s; sync fi>

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (2 preceding siblings ...)
  2021-10-20 16:31 ` bugzilla-daemon
@ 2021-10-20 18:29 ` bugzilla-daemon
  2021-10-20 18:37 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-20 18:29 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #4 from Christian Theune (ct@flyingcircus.io) ---
Another thing I noticed: it always hangs the nix-daemon (nixos.org) which does
some dance around managing an overlay read-only mount that gets re-mounted rw
for a short period while changing things in there. Here is what it looks like
normally:

/dev/vda1 on / type xfs
(rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/dev/vda1 on /nix/store type xfs
(ro,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (3 preceding siblings ...)
  2021-10-20 18:29 ` bugzilla-daemon
@ 2021-10-20 18:37 ` bugzilla-daemon
  2021-10-20 19:22 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-20 18:37 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #5 from Christian Theune (ct@flyingcircus.io) ---
I've also picked this up with the friends from NixOS in case we can shed some
light on what the dance is that the nix-daemon performs here:
https://github.com/NixOS/nixpkgs/issues/142394

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (4 preceding siblings ...)
  2021-10-20 18:37 ` bugzilla-daemon
@ 2021-10-20 19:22 ` bugzilla-daemon
  2021-10-20 19:26 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-20 19:22 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

Eric Sandeen (sandeen@sandeen.net) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sandeen@sandeen.net

--- Comment #6 from Eric Sandeen (sandeen@sandeen.net) ---
You might want to do an "echo w > /proc/sysrq-trigger when this happens, to get
all blocked tasks.

The thaw seems a little suspicious, what all is going on there with freezing
and thawing?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (5 preceding siblings ...)
  2021-10-20 19:22 ` bugzilla-daemon
@ 2021-10-20 19:26 ` bugzilla-daemon
  2021-10-20 22:16 ` [Bug 214767] New: " Dave Chinner
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-20 19:26 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #7 from Christian Theune (ct@flyingcircus.io) ---
Thanks, I'll try to do something like that, likely by using a qemu monitor
command to send in the trigger via keyboard when I catch this the next time it
happens.

The freezing/thawing is part of our image backup structure and over the years
we established to regularly call 'thaw' gratuitously as thawing can once in a
blue moon cause VMs to get stuck infinitely if an initial thaw doesn't go
through. This has been reliable for at least 2-3 years by now and I'm not sure
whether this will turn out a red herring or not.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (6 preceding siblings ...)
  2021-10-20 19:26 ` bugzilla-daemon
@ 2021-10-20 22:16 ` Dave Chinner
  2021-10-20 22:38 ` [Bug 214767] " bugzilla-daemon
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-10-20 22:16 UTC (permalink / raw)
  To: bugzilla-daemon; +Cc: linux-xfs

On Wed, Oct 20, 2021 at 07:36:07AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> [656898.034261] Call Trace:
> [656898.034538]  __schedule+0x271/0x860
> [656898.034881]  schedule+0x46/0xb0
> [656898.035226]  xfs_log_commit_cil+0x6a4/0x800 [xfs]

This looks like it is stuck on the hard throttle. Which means this
is the likely fix:

19f4e7cc8197 xfs: Fix CIL throttle hang when CIL space used going backwards

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (7 preceding siblings ...)
  2021-10-20 22:16 ` [Bug 214767] New: " Dave Chinner
@ 2021-10-20 22:38 ` bugzilla-daemon
  2021-10-21  4:21 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-20 22:38 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #8 from Dave Chinner (david@fromorbit.com) ---
On Wed, Oct 20, 2021 at 07:36:07AM +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> [656898.034261] Call Trace:
> [656898.034538]  __schedule+0x271/0x860
> [656898.034881]  schedule+0x46/0xb0
> [656898.035226]  xfs_log_commit_cil+0x6a4/0x800 [xfs]

This looks like it is stuck on the hard throttle. Which means this
is the likely fix:

19f4e7cc8197 xfs: Fix CIL throttle hang when CIL space used going backwards

Cheers,

Dave.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (8 preceding siblings ...)
  2021-10-20 22:38 ` [Bug 214767] " bugzilla-daemon
@ 2021-10-21  4:21 ` bugzilla-daemon
  2021-10-21 13:17 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-21  4:21 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #9 from Christian Theune (ct@flyingcircus.io) ---
Thanks Dave!

From what I can tell this fix was merged into 5.14 with a largish bunch of
other changes (via 9f7b640f001f). I checked that the fix you mentioned applies
cleanly to the current 5.10 codebase (it does) but I'm not sure whether we
missed some more backports for 5.10.

I wonder whether "we" (I don't think I'm expert enough to do this reliably)
need to check for other fixes that didn't make it back to 5.10. Do you have a
recommendation how/who to approach?

As I'm not doing this often, I would contact GKH and ask for a backport of the
patch you mentioned (given that you consider it appropriate to apply to 5.10
from a semantic perspective) but maybe you have better advice.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (9 preceding siblings ...)
  2021-10-21  4:21 ` bugzilla-daemon
@ 2021-10-21 13:17 ` bugzilla-daemon
  2021-10-21 13:19 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-21 13:17 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #10 from Christian Theune (ct@flyingcircus.io) ---
Alright,

I managed to catch a machine that logged a blocked request for only the /tmp
partition. I logged in, ran a 'sync' and did the 'w' sysrequest. I'm attaching
the log here where at first you see a number of blocked requests, then a while
later I log in and run the sync and you can see the sync stuck as well as a
large number of workers.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (10 preceding siblings ...)
  2021-10-21 13:17 ` bugzilla-daemon
@ 2021-10-21 13:19 ` bugzilla-daemon
  2021-10-21 13:20 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-21 13:19 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #11 from Christian Theune (ct@flyingcircus.io) ---
Going through the log I can see that aside from 'sync' and the workers there is
a tar process that is stuck since 13:20 and not making any progress. I can also
not see any IO on the block device.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (11 preceding siblings ...)
  2021-10-21 13:19 ` bugzilla-daemon
@ 2021-10-21 13:20 ` bugzilla-daemon
  2021-10-29  2:21 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-21 13:20 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #12 from Christian Theune (ct@flyingcircus.io) ---
Created attachment 299295
  --> https://bugzilla.kernel.org/attachment.cgi?id=299295&action=edit
log of a vm with only vdc1 (/tmp) stuck

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (12 preceding siblings ...)
  2021-10-21 13:20 ` bugzilla-daemon
@ 2021-10-29  2:21 ` bugzilla-daemon
  2021-10-29  2:22 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-29  2:21 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

Pedram Fard (pedram.fard@appian.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pedram.fard@appian.com

--- Comment #13 from Pedram Fard (pedram.fard@appian.com) ---
We have run into a similar issue. We use xfs_freeze to take backup snapshots of
our filesystems and occasionally, in the process the xfs_freeze/thaw process
turns into zombies.
We used the echo w > /proc/sysrq-trigger to get the stack trace:

[736826.666940] task:kswapd0         state:D stack:    0 pid:  576 ppid:     2
flags:0x00004000
[736826.674439] Call Trace:
[736826.677686]  __schedule+0x1f9/0x660
[736826.681593]  schedule+0x46/0xb0
[736826.685141]  percpu_rwsem_wait+0xa9/0x180
[736826.688867]  ? percpu_free_rwsem+0x30/0x30
[736826.692764]  __percpu_down_read+0x49/0x60
[736826.696549]  xfs_trans_alloc+0x15d/0x170
[736826.700367]  xfs_free_eofblocks+0x130/0x1e0
[736826.807546]  xfs_fs_destroy_inode+0xa8/0x1b0
[736826.811610]  destroy_inode+0x3b/0x70
[736826.815150]  dispose_list+0x48/0x60
[736826.818796]  prune_icache_sb+0x54/0x80
[736826.822476]  super_cache_scan+0x161/0x1e0
[736826.826341]  do_shrink_slab+0x145/0x240
[736826.830294]  shrink_slab_memcg+0xcd/0x1e0
[736826.834371]  shrink_node_memcgs+0x186/0x1c0
[736826.838491]  shrink_node+0x14f/0x570
[736826.842088]  balance_pgdat+0x232/0x510
[736826.845820]  kswapd+0xe2/0x170
[736826.849154]  ? balance_pgdat+0x510/0x510
[736826.852944]  kthread+0x11b/0x140
[736826.856326]  ? __kthread_bind_mask+0x60/0x60
[736826.860313]  ret_from_fork+0x22/0x30

[736826.863839] task:systemd-logind  state:D stack:    0 pid: 2073 ppid:     1
flags:0x00000084
[736826.871453] Call Trace:
[736826.874898]  __schedule+0x1f9/0x660
[736826.878898]  ? kvm_sched_clock_read+0xd/0x20
[736826.883265]  schedule+0x46/0xb0
[736826.887003]  rwsem_down_write_slowpath+0x234/0x4b0
[736826.891114]  ? __kmalloc+0x154/0x2b0
[736826.894794]  prealloc_shrinker+0x4c/0xf0
[736826.898526]  alloc_super+0x29a/0x2f0
[736826.902210]  ? set_anon_super+0x40/0x40
[736826.905867]  sget_fc+0x6c/0x240
[736826.909206]  ? shmem_create+0x20/0x20
[736826.912912]  get_tree_nodev+0x23/0xb0
[736826.916533]  ? bpf_lsm_capset+0x10/0x10
[736826.920310]  vfs_get_tree+0x25/0xb0
[736826.923848]  do_new_mount+0x152/0x1b0
[736826.927548]  __x64_sys_mount+0x103/0x140
[736826.931531]  do_syscall_64+0x33/0x40
[736826.935597]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[736826.940323] RIP: 0033:0x7f49f8efcdba
[736826.944298] RSP: 002b:00007ffdb98f2658 EFLAGS: 00000202 ORIG_RAX:
00000000000000a5
[736826.952238] RAX: ffffffffffffffda RBX: 000055f21a7113a0 RCX:
00007f49f8efcdba
[736826.959940] RDX: 000055f218672af0 RSI: 000055f21a721de0 RDI:
000055f218672af0
[736826.967721] RBP: 00007ffdb98f2690 R08: 000055f21a7a51e0 R09:
0000000000000000
[736826.975503] R10: 0000000000000006 R11: 0000000000000202 R12:
0000000000000000
[736826.983253] R13: 00007ffdb98f2740 R14: 00007ffdb98f2820 R15:
0000000000000000

[736826.990979] task:dockerd         state:D stack:    0 pid: 6639 ppid:     1
flags:0x00000080
[736826.999289] Call Trace:
[736827.002695]  __schedule+0x1f9/0x660
[736827.006708]  schedule+0x46/0xb0
[736827.010512]  rwsem_down_write_slowpath+0x234/0x4b0
[736827.015214]  ? __kmalloc+0x154/0x2b0
[736827.019291]  prealloc_shrinker+0x4c/0xf0
[736827.023495]  alloc_super+0x29a/0x2f0
[736827.027505]  sget+0x91/0x220
[736827.031089]  ? destroy_super_rcu+0x40/0x40
[736827.035268]  ? ovl_get_lowerstack+0x190/0x190 [overlay]
[736827.040172]  mount_nodev+0x26/0x90
[736827.044070]  legacy_get_tree+0x27/0x40
[736827.048169]  vfs_get_tree+0x25/0xb0
[736827.052088]  do_new_mount+0x152/0x1b0
[736827.056076]  __x64_sys_mount+0x103/0x140
[736827.060176]  do_syscall_64+0x33/0x40
[736827.064158]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[736827.068950] RIP: 0033:0x55c5416cc2aa
[736827.072915] RSP: 002b:000000c0015b2768 EFLAGS: 00000216 ORIG_RAX:
00000000000000a5
[736827.080676] RAX: ffffffffffffffda RBX: 000000c00005e800 RCX:
000055c5416cc2aa
[736827.088375] RDX: 000000c000df5320 RSI: 000000c0012db140 RDI:
000000c000df52e8
[736827.095311] RBP: 000000c0015b2800 R08: 000000c0015a0600 R09:
0000000000000000
[736827.102300] R10: 0000000000000000 R11: 0000000000000216 R12:
0000000000000133
[736827.109100] R13: 0000000000000132 R14: 0000000000000200 R15:
0000000000000055


[737030.525416] task:sh              state:D stack:    0 pid:27526 ppid: 27517
flags:0x00000080
[737030.532905] Call Trace:
[737030.536188]  __schedule+0x1f9/0x660
[737030.540060]  schedule+0x46/0xb0
[737030.543385]  percpu_rwsem_wait+0xa9/0x180
[737030.547220]  ? percpu_free_rwsem+0x30/0x30
[737030.551093]  __percpu_down_read+0x49/0x60
[737030.555220]  mnt_want_write+0x66/0x90
[737030.558966]  open_last_lookups+0x30d/0x3f0
[737030.562914]  ? path_init+0x2bd/0x3e0
[737030.681735]  path_openat+0x88/0x1d0
[737030.685600]  ? xfs_iunlock+0x94/0xe0
[737030.689575]  do_filp_open+0x88/0x130
[737030.693496]  ? __check_object_size.part.0+0x11f/0x140
[737030.698201]  do_sys_openat2+0x97/0x150
[737030.702232]  __x64_sys_openat+0x54/0x90
[737030.706301]  do_syscall_64+0x33/0x40
[737030.710248]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[737030.714945] RIP: 0033:0x7f4502c0d14e
[737030.718849] RSP: 002b:00007ffd066be7a0 EFLAGS: 00000246 ORIG_RAX:
0000000000000101
[737030.726628] RAX: ffffffffffffffda RBX: 00000000009ba360 RCX:
00007f4502c0d14e
[737030.734177] RDX: 0000000000000241 RSI: 00000000009bb8c0 RDI:
ffffffffffffff9c
[737030.741759] RBP: 00007ffd066be8a0 R08: 0000000000000020 R09:
00000000009bb8c0
[737030.749452] R10: 00000000000001b6 R11: 0000000000000246 R12:
0000000000000000
[737030.757142] R13: 0000000000000001 R14: 0000000000000001 R15:
00000000009bad60
[737030.764836] task:bash            state:D stack:    0 pid:27580 ppid:     1
flags:0x00000084
[737030.773068] Call Trace:
[737030.776437]  __schedule+0x1f9/0x660
[737030.780354]  schedule+0x46/0xb0
[737030.784080]  percpu_rwsem_wait+0xa9/0x180
[737030.788264]  ? percpu_free_rwsem+0x30/0x30
[737030.792476]  __percpu_down_read+0x49/0x60
[737030.796642]  mnt_want_write+0x66/0x90
[737030.800660]  open_last_lookups+0x30d/0x3f0
[737030.804923]  ? path_init+0x2bd/0x3e0
[737030.808895]  path_openat+0x88/0x1d0
[737030.812800]  ? xfs_iunlock+0x94/0xe0
[737030.816745]  do_filp_open+0x88/0x130
[737030.820694]  ? getname_flags.part.0+0x29/0x1a0
[737030.825097]  ? __check_object_size.part.0+0x11f/0x140
[737030.829846]  ? audit_alloc_name+0x8c/0xe0
[737030.834010]  do_sys_openat2+0x97/0x150
[737030.838056]  __x64_sys_openat+0x54/0x90
[737030.842164]  do_syscall_64+0x33/0x40
[737030.846146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[737030.850855] RIP: 0033:0x7f2072a9414e
[737030.854822] RSP: 002b:00007ffd855832e0 EFLAGS: 00000246 ORIG_RAX:
0000000000000101
[737030.862655] RAX: ffffffffffffffda RBX: 000000000178aab0 RCX:
00007f2072a9414e
[737030.870259] RDX: 0000000000000241 RSI: 000000000178af20 RDI:
ffffffffffffff9c
[737030.877820] RBP: 00007ffd855833e0 R08: 0000000000000020 R09:
000000000178af20
[737030.885463] R10: 00000000000001b6 R11: 0000000000000246 R12:
0000000000000000
[737030.893083] R13: 0000000000000003 R14: 0000000000000001 R15:
000000000178aa60

[736992.593491] task:k               state:D stack:    0 pid:11166 ppid: 29014
flags:0x00000184
[736992.600911] Call Trace:
[736992.604234]  __schedule+0x1f9/0x660
[736992.607842]  schedule+0x46/0xb0
[736992.611295]  percpu_rwsem_wait+0xa9/0x180
[736992.614999]  ? percpu_free_rwsem+0x30/0x30
[736992.618901]  __percpu_down_read+0x49/0x60
[736992.622614]  mnt_want_write+0x66/0x90
[736992.626323]  open_last_lookups+0x30d/0x3f0
[736992.630064]  ? path_init+0x2bd/0x3e0
[736992.633795]  path_openat+0x88/0x1d0
[736992.637360]  do_filp_open+0x88/0x130
[736992.640911]  ? getname_flags.part.0+0x29/0x1a0
[736992.645006]  ? __check_object_size.part.0+0x11f/0x140
[736992.649268]  ? audit_alloc_name+0x8c/0xe0
[736992.653173]  do_sys_openat2+0x97/0x150
[736992.656832]  __x64_sys_openat+0x54/0x90
[736992.660619]  do_syscall_64+0x33/0x40
[736992.664181]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[736992.668543] RIP: 0033:0x7f81dc2d11ae
[736992.672208] RSP: 002b:00007ffcff615c00 EFLAGS: 00000246 ORIG_RAX:
0000000000000101
[736992.679509] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007f81dc2d11ae
[736992.686481] RDX: 0000000000000042 RSI: 00007f81d757e6b8 RDI:
00000000ffffff9c
[736992.693723] RBP: 00007f81d7ba14c4 R08: 000000000000003a R09:
0000000000000000
[736992.700684] R10: 00000000000001b6 R11: 0000000000000246 R12:
00007f81d757e6b8
[736992.708136] R13: 00007f81d7ba14c4 R14: 00007f81dbfb45b8 R15:
00007f81d7f3f8fc


[736992.715824] task:k               state:D stack:    0 pid:11273 ppid: 29014
flags:0x00000184
[736992.724256] Call Trace:
[736992.727698]  __schedule+0x1f9/0x660
[736992.731616]  schedule+0x46/0xb0
[736992.735367]  percpu_rwsem_wait+0xa9/0x180
[736992.739512]  ? percpu_free_rwsem+0x30/0x30
[736992.743681]  __percpu_down_read+0x49/0x60
[736992.747837]  mnt_want_write+0x66/0x90
[736992.751896]  open_last_lookups+0x30d/0x3f0
[736992.756191]  ? path_init+0x2bd/0x3e0
[736992.760175]  path_openat+0x88/0x1d0
[736992.764055]  do_filp_open+0x88/0x130
[736992.768115]  ? getname_flags.part.0+0x29/0x1a0
[736992.772563]  ? __check_object_size.part.0+0x11f/0x140
[736992.777391]  ? audit_alloc_name+0x8c/0xe0
[736992.781658]  do_sys_openat2+0x97/0x150
[736992.785794]  __x64_sys_openat+0x54/0x90
[736992.789856]  do_syscall_64+0x33/0x40
[736992.793913]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[736992.798746] RIP: 0033:0x7fb7d337f1ae
[736992.802878] RSP: 002b:00007ffd9b2dbda0 EFLAGS: 00000246 ORIG_RAX:
0000000000000101
[736992.810824] RAX: ffffffffffffffda RBX: 0000000000000002 RCX:
00007fb7d337f1ae
[736992.818496] RDX: 0000000000000042 RSI: 00007fb7d2e2aa88 RDI:
00000000ffffff9c
[736992.826359] RBP: 00007fb7cecdc7dc R08: 0000000000000008 R09:
0000000000000000
[736992.833753] R10: 00000000000001b6 R11: 0000000000000246 R12:
00007fb7d2e2aa88
[736992.840727] R13: 00007fb7d2e2aa7c R14: 00007fb7d3150920 R15:
0000000000000004


[736992.847580] task:tail            state:D stack:    0 pid:11608 ppid:  9603
flags:0x00004186
[736992.855146] Call Trace:
[736992.858238]  __schedule+0x1f9/0x660
[736992.861943]  schedule+0x46/0xb0
[736992.865387]  percpu_rwsem_wait+0xa9/0x180
[736992.869138]  ? percpu_free_rwsem+0x30/0x30
[736992.873134]  __percpu_down_read+0x49/0x60
[736992.876948]  xfs_trans_alloc+0x15d/0x170
[736992.880854]  xfs_free_eofblocks+0x130/0x1e0
[736992.885085]  xfs_release+0x13d/0x160
[736992.888807]  __fput+0x96/0x240
[736992.892138]  task_work_run+0x5f/0x90
[736992.895834]  do_exit+0x22c/0x3b0
[736992.899189]  ? timerqueue_del+0x1e/0x40
[736992.902967]  do_group_exit+0x33/0xa0
[736992.906469]  get_signal+0x15d/0x5b0
[736993.013638]  arch_do_signal+0x25/0xf0
[736993.017223]  exit_to_user_mode_loop+0x8d/0xc0
[736993.021148]  exit_to_user_mode_prepare+0x6a/0x70
[736993.025371]  syscall_exit_to_user_mode+0x22/0x140
[736993.029417]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[736993.033852] RIP: 0033:0x7f907725e6f4
[736993.037391] RSP: 002b:00007ffcf7722f38 EFLAGS: 00000246 ORIG_RAX:
0000000000000023
[736993.044403] RAX: fffffffffffffdfc RBX: 0000000000000001 RCX:
00007f907725e6f4
[736993.051384] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
00007ffcf7722f40
[736993.058190] RBP: 0000000000000000 R08: 00007ffcf7723070 R09:
00007f9077312e80
[736993.065109] R10: 0000000000000000 R11: 0000000000000246 R12:
00007ffcf7722f40
[736993.071934] R13: 00007ffcf77256bf R14: 0000560396137440 R15:
0000000000000000


[736996.659414] task:process-exporte state:D stack:    0 pid:30396 ppid:  2204
flags:0x00004082
[736996.666720] Call Trace:
[736996.669864]  __schedule+0x1f9/0x660
[736996.673371]  schedule+0x46/0xb0
[736996.676730]  rwsem_down_write_slowpath+0x234/0x4b0
[736996.680991]  unregister_memcg_shrinker.isra.0+0x18/0x40
[736996.685377]  unregister_shrinker+0x7b/0x80
[736996.689295]  deactivate_locked_super+0x29/0xa0
[736996.693241]  cleanup_mnt+0x12d/0x190
[736996.696912]  task_work_run+0x5f/0x90
[736996.700444]  do_exit+0x22c/0x3b0
[736996.703996]  do_group_exit+0x33/0xa0
[736996.707581]  get_signal+0x15d/0x5b0
[736996.711191]  arch_do_signal+0x25/0xf0
[736996.714771]  exit_to_user_mode_loop+0x8d/0xc0
[736996.718759]  exit_to_user_mode_prepare+0x6a/0x70
[736996.722740]  syscall_exit_to_user_mode+0x22/0x140
[736996.726869]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[736996.835174] RIP: 0033:0x4a2c50
[736996.838687] RSP: 002b:000000c000213798 EFLAGS: 00000216 ORIG_RAX:
0000000000000000
[736996.845651] RAX: 0000000000000200 RBX: 000000c000032500 RCX:
00000000004a2c50
[736996.852636] RDX: 0000000000000200 RSI: 000000c0008ef600 RDI:
0000000000000005
[736996.859446] RBP: 000000c0002137e8 R08: 0000000000000000 R09:
0000000000000000
[736996.866347] R10: 0000000000000000 R11: 0000000000000216 R12:
ffffffffffffffff
[736996.873124] R13: 000000000000000c R14: 000000000000000b R15:
0000000000000010

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (13 preceding siblings ...)
  2021-10-29  2:21 ` bugzilla-daemon
@ 2021-10-29  2:22 ` bugzilla-daemon
  2021-11-09  5:19 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-10-29  2:22 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #14 from Pedram Fard (pedram.fard@appian.com) ---
Here is the result of echo t > /proc/sysrq-trigger:

1 I root       578     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfsalloc]
1 I root       579     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs_mru_cache]
1 I root      1251     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-buf/nvme0n1]
1 I root      1252     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-conv/nvme0n]
1 I root      1253     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-cil/nvme0n1]
1 I root      1254     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-reclaim/nvm]
1 I root      1255     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-eofblocks/n]
1 I root      1256     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-log/nvme0n1]
1 S root      1257     2  0  80   0 -     0 -      Oct20 ?        00:02:06
[xfsaild/nvme0n1]
1 I root      1990     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-buf/dm-0]
1 I root      1991     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-conv/dm-0]
1 I root      1992     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-cil/dm-0]
1 I root      1993     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-reclaim/dm-]
1 I root      1994     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-eofblocks/d]
1 I root      1995     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-log/dm-0]
1 S root      1996     2  0  80   0 -     0 -      Oct20 ?        00:00:00
[xfsaild/dm-0]
1 I root      1999     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-buf/dm-1]
1 I root      2000     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-conv/dm-1]
1 I root      2001     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-cil/dm-1]
1 I root      2002     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-reclaim/dm-]
1 I root      2003     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-eofblocks/d]
1 I root      2004     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-log/dm-1]
1 S root      2005     2  0  80   0 -     0 -      Oct20 ?        00:01:53
[xfsaild/dm-1]
1 I root      2008     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-buf/dm-2]
1 I root      2009     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-conv/dm-2]
1 I root      2010     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-cil/dm-2]
1 I root      2011     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-reclaim/dm-]
1 I root      2012     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-eofblocks/d]
1 I root      2013     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-log/dm-2]
1 S root      2014     2  0  80   0 -     0 -      Oct20 ?        00:00:00
[xfsaild/dm-2]
0 S root      2226  1942  0  80   0 - 30496 -      17:41 ?        00:00:00
/bin/sh -f /usr/sbin/xfs_freeze -u /usr/local/appian
4 D root      2230  2226  0  80   0 - 29825 -      17:41 ?        00:00:00
/usr/sbin/xfs_io -F -r -p xfs_freeze -x -c thaw /usr/local/appian
1 I root      4068     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-buf/dm-3]
1 I root      4069     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-conv/dm-3]
1 I root      4070     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-cil/dm-3]
1 I root      4071     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-reclaim/dm-]
1 I root      4072     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-eofblocks/d]
1 I root      4073     2  0  60 -20 -     0 -      Oct20 ?        00:00:00
[xfs-log/dm-3]
1 S root      4074     2  0  80   0 -     0 -      Oct20 ?        00:02:01
[xfsaild/dm-3]
0 S root     10903 13809  0  80   0 - 29855 -      17:57 pts/0    00:00:00 grep
--color=auto xfs
1 I root     11221     2  0  80   0 -     0 -      17:44 ?        00:00:00
[kworker/1:4-xfs]
4 D root     22209     1  0  80   0 - 29825 -      16:08 ?        00:00:00
/usr/sbin/xfs_io -F -r -p xfs_freeze -x -c thaw /usr/local/appian
1 I root     23187     2  0  80   0 -     0 -      17:37 ?        00:00:00
[kworker/1:3-xfs]
1 I root     30083     2  0  80   0 -     0 -      17:29 ?        00:00:00
[kworker/1:1-xfs]

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (14 preceding siblings ...)
  2021-10-29  2:22 ` bugzilla-daemon
@ 2021-11-09  5:19 ` bugzilla-daemon
  2021-11-10 15:16 ` bugzilla-daemon
  2021-11-20 22:31 ` bugzilla-daemon
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-11-09  5:19 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #15 from Christian Theune (ct@flyingcircus.io) ---
@Pedram: not sure those are related. I'm 50/50 whether my issue is
freeze-related and all my stacktraces have a 'xfs_log_commit_cil' in there
which I don't see in yours, could be a separate issue.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (15 preceding siblings ...)
  2021-11-09  5:19 ` bugzilla-daemon
@ 2021-11-10 15:16 ` bugzilla-daemon
  2021-11-20 22:31   ` Dave Chinner
  2021-11-20 22:31 ` bugzilla-daemon
  17 siblings, 1 reply; 20+ messages in thread
From: bugzilla-daemon @ 2021-11-10 15:16 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #16 from Christian Theune (ct@flyingcircus.io) ---
I started trying out the fix that Dave and am using it with 5.10.76 (applied
clean with a bit of fuzzing).

@Dave do you happen do know whether there's a helper that can stress test live
systems in this regard?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-11-10 15:16 ` bugzilla-daemon
@ 2021-11-20 22:31   ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-11-20 22:31 UTC (permalink / raw)
  To: bugzilla-daemon; +Cc: linux-xfs

On Wed, Nov 10, 2021 at 03:16:35PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=214767
> 
> --- Comment #16 from Christian Theune (ct@flyingcircus.io) ---
> I started trying out the fix that Dave and am using it with 5.10.76 (applied
> clean with a bit of fuzzing).
> 
> @Dave do you happen do know whether there's a helper that can stress test live
> systems in this regard?

fstests has some tests in the "freeze" group that stress
freeze/thaw. And it has lots of other tests in it that will tell you
if there's a regression in your backport, so might be an idea to run
a full "auto" group pass rather than just the freeze group....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Bug 214767] xfs seems to hang due to race condition? maybe related to (gratuitous) thaw.
  2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
                   ` (16 preceding siblings ...)
  2021-11-10 15:16 ` bugzilla-daemon
@ 2021-11-20 22:31 ` bugzilla-daemon
  17 siblings, 0 replies; 20+ messages in thread
From: bugzilla-daemon @ 2021-11-20 22:31 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=214767

--- Comment #17 from Dave Chinner (david@fromorbit.com) ---
On Wed, Nov 10, 2021 at 03:16:35PM +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=214767
> 
> --- Comment #16 from Christian Theune (ct@flyingcircus.io) ---
> I started trying out the fix that Dave and am using it with 5.10.76 (applied
> clean with a bit of fuzzing).
> 
> @Dave do you happen do know whether there's a helper that can stress test
> live
> systems in this regard?

fstests has some tests in the "freeze" group that stress
freeze/thaw. And it has lots of other tests in it that will tell you
if there's a regression in your backport, so might be an idea to run
a full "auto" group pass rather than just the freeze group....

Cheers,

Dave.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-11-20 22:31 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-20  7:36 [Bug 214767] New: xfs seems to hang due to race condition? maybe related to (gratuitous) thaw bugzilla-daemon
2021-10-20  7:39 ` [Bug 214767] " bugzilla-daemon
2021-10-20 16:16 ` bugzilla-daemon
2021-10-20 16:31 ` bugzilla-daemon
2021-10-20 18:29 ` bugzilla-daemon
2021-10-20 18:37 ` bugzilla-daemon
2021-10-20 19:22 ` bugzilla-daemon
2021-10-20 19:26 ` bugzilla-daemon
2021-10-20 22:16 ` [Bug 214767] New: " Dave Chinner
2021-10-20 22:38 ` [Bug 214767] " bugzilla-daemon
2021-10-21  4:21 ` bugzilla-daemon
2021-10-21 13:17 ` bugzilla-daemon
2021-10-21 13:19 ` bugzilla-daemon
2021-10-21 13:20 ` bugzilla-daemon
2021-10-29  2:21 ` bugzilla-daemon
2021-10-29  2:22 ` bugzilla-daemon
2021-11-09  5:19 ` bugzilla-daemon
2021-11-10 15:16 ` bugzilla-daemon
2021-11-20 22:31   ` Dave Chinner
2021-11-20 22:31 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.