All of lore.kernel.org
 help / color / mirror / Atom feed
* steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)
@ 2021-01-03 14:27 Nick Alcock
  2021-02-23 22:57 ` J. Bruce Fields
  0 siblings, 1 reply; 12+ messages in thread
From: Nick Alcock @ 2021-01-03 14:27 UTC (permalink / raw)
  To: NFS list

So I bought a game or two on Steam a fortnight ago to while away a
Christmas spent stuck indoors with no family, decorations, or presents
(stuck at my sister's and she got a perfectly-timed case of covid-19).
It turned out this was a very bad idea!

I keep my Steam directory on an ext4 fs on the local fileserver and
access it via NFSv4.2: over 10GbE: it's actually faster than local
spinning-rust storage would be (five-spindle RAID-6 and bcache and
128GiB RAM on the server for caching probably have something to do with
that!) though likely slower than the NVMe the local machine actually
has: but it doesn't have enough local storage, so I keep all the Steam
stuff remotely.

(I keep my $HOME on the same fileserver, though that's on xfs: I'd use
xfs for Steam, too, but there are still a lot of games out there that
fail on filesystems with 64-bit inode numbers.)

Unfortunately, what I observe now when kicking up Steam is a spasm of
downloading I/O for the game I just bought and then a complete
client-side hang when it starts installing: just about all I can do is
hard-reboot. This of course stops me using Steam for anything which
didn't make me very happy :P

The server hums onwards as though nothing went wrong, but the client is
in a bad way indeed: not only has NFS hung but there are clearly some
important locks taken out since even an ls on a non-NFS dir (with no NFS
dirs in the PATH) hangs, as does a while :; ps -e; done loop. I can
rebuild with lockdep and do a lock dump if that would be useful. (The
kernel is not quite stock: it has DTrace built in, but the hangs happen
even if the kernel module has never been loaded, and I'm pretty sure the
hangs can't be associated with the not-very-substantial built-in DTrace
changes. I can try rebuilding without that as well, but I'll have to rip
chunks out of my monitoring machinery to do that so I've avoided doing
it so far.)

I finally got round to doing a bit of debugging today. NFS debugging
dumps covering the last fifteen seconds or so before the hang and a
couple of minutes afterwards (taken on the client, logged without
incident to the server's syslog, as proof that not *everything* is
stalled) are a few megabytes long so are at
<http://www.esperi.org.uk/~nix/temporary/nfs-steam-hang.log>. Everything
seems to go wrong after an especially large bunch of NFS commits.

Both client (silk, 192.168.16.21) and server (loom, 192.168.16.8) are
running 5.10.4 with the same NFS config, though the client has no NFS
server exports at the time and the server isn't acting as an NFS client
at all:

CONFIG_NFS_FS=y
# CONFIG_NFS_V2 is not set
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_SWAP is not set
CONFIG_NFS_V4_1=y
CONFIG_NFS_V4_2=y
CONFIG_PNFS_FILE_LAYOUT=y
CONFIG_PNFS_BLOCK=y
CONFIG_PNFS_FLEXFILE_LAYOUT=m
CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
# CONFIG_NFS_V4_1_MIGRATION is not set
# CONFIG_ROOT_NFS is not set
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFS_DEBUG=y
CONFIG_NFS_DISABLE_UDP_SUPPORT=y
# CONFIG_NFS_V4_2_READ_PLUS is not set
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
# CONFIG_NFSD_BLOCKLAYOUT is not set
# CONFIG_NFSD_SCSILAYOUT is not set
# CONFIG_NFSD_FLEXFILELAYOUT is not set
CONFIG_NFSD_V4_2_INTER_SSC=y
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_SUNRPC_BACKCHANNEL=y
CONFIG_SUNRPC_DEBUG=y

The server (loom) has a large number of exports: many of the filesystems
being exported are moved around via bind mounts and the like. The steam
datadir is in /pkg/non-free/steam, which is bind-mounted onto
/home/nix/.steam on the client: /home/nix itself is bind-mounted from
/home/.loom.srvr.nix/steam, and is NFS-exported from the server under
that pathname.

Relevant exports, from /proc/fs/nfs/exports:

/	192.168.16.0/24,silk.wkstn.nix(ro,insecure,no_root_squash,sync,no_wdelay,no_subtree_check,v4root,fsid=0,uuid=0a4a4563:00764033:8c827c0e:989cf534,sec=390003:390004:390005:1)
/home/.loom.srvr.nix	*.srvr.nix,fold.srvr.nix(rw,root_squash,sync,wdelay,no_subtree_check,uuid=0a4a4563:00764033:8c827c0e:989cf534,sec=1)
/pkg/non-free/steam	192.168.16.0/24,silk.wkstn.nix(rw,root_squash,sync,wdelay,uuid=11b94016:cb6e4b64:beb6c134:9ea3d6b3,sec=1)

Blocked tasks after the hang suggest that the client is hung waiting
indefinitely for the server to reply, and I'd guess the server doesn't
think it needs to. (I can do some server-side debugging dumps too if you
think the problem might be there.)

[  322.843962] sysrq: Show Blocked State
[  322.844058] task:tee             state:D stack:    0 pid: 1176 ppid:  1173 flags:0x00000000
[  322.844064] Call Trace:
[  322.844074]  __schedule+0x3db/0x7ee
[  322.844079]  ? io_schedule+0x70/0x6d
[  322.844082]  schedule+0x68/0xdc
[  322.844085]  io_schedule+0x46/0x6d
[  322.844089]  bit_wait_io+0x11/0x52
[  322.844092]  __wait_on_bit+0x31/0x90
[  322.844096]  out_of_line_wait_on_bit+0x7e/0x80
[  322.844100]  ? var_wake_function+0x30/0x27
[  322.844104]  nfs_wait_on_request.part.0+0x31/0x33
[  322.844107]  nfs_page_group_lock_head+0x31/0x78
[  322.844110]  nfs_lock_and_join_requests+0x9d/0x26e
[  322.844114]  nfs_updatepage+0x1be/0x985
[  322.844118]  nfs_write_end+0x128/0x52a
[  322.844122]  ? iov_iter_copy_from_user_atomic+0xe8/0x3ce
[  322.844127]  generic_perform_write+0xed/0x17c
[  322.844131]  nfs_file_write+0x14a/0x29f
[  322.844135]  new_sync_write+0xfb/0x16b
[  322.844139]  vfs_write+0x1ee/0x29a
[  322.844142]  ? vfs_read+0x17b/0x198
[  322.844145]  ksys_write+0x61/0xd0
[  322.844148]  __x64_sys_write+0x1a/0x1c
[  322.844151]  do_syscall_64+0x32/0x45
[  322.844154]  entry_SYSCALL_64_after_hwframe+0x44/0x0
[  322.844158] RIP: 0033:0x7f4781f323a3
[  322.844160] RSP: 002b:00007ffcf8220408 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  322.844164] RAX: ffffffffffffffda RBX: 0000000000000092 RCX: 00007f4781f323a3
[  322.844166] RDX: 0000000000000092 RSI: 00007ffcf8220530 RDI: 0000000000000003
[  322.844168] RBP: 00007ffcf8220530 R08: 0000000000000000 R09: 0000000000000001
[  322.844170] R10: 0000000000400ad2 R11: 0000000000000246 R12: 0000000000000092
[  322.844172] R13: 0000000002361580 R14: 0000000000000092 R15: 00007f4782201720
[  322.844202] task:CIPCServer::Thr state:D stack:    0 pid: 1801 ppid:  1540 flags:0x20020000
[  322.844207] Call Trace:
[  322.844212]  __schedule+0x3db/0x7ee
[  322.844219]  ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[  322.844222]  schedule+0x68/0xdc
[  322.844227]  rpc_wait_bit_killable+0x11/0x52
[  322.844230]  __wait_on_bit+0x31/0x90
[  322.844235]  ? __rpc_atrun+0x20/0x1a
[  322.844238]  ? __rpc_atrun+0x20/0x1a
[  322.844242]  out_of_line_wait_on_bit+0x7e/0x80
[  322.844246]  ? var_wake_function+0x30/0x27
[  322.844250]  __rpc_execute+0xe5/0x485
[  322.844254]  ? sugov_get_util+0xf0/0xeb
[  322.844258]  rpc_execute+0xa5/0xaa
[  322.844263]  rpc_run_task+0x148/0x187
[  322.844267]  nfs4_do_call_sync+0x61/0x81
[  322.844272]  _nfs4_proc_getattr+0x10a/0x120
[  322.844278]  nfs4_proc_getattr+0x67/0xee
[  322.844283]  __nfs_revalidate_inode+0xb7/0x2df
[  322.844287]  nfs_access_get_cached+0x14d/0x1df
[  322.844291]  nfs_do_access+0x48/0x28e
[  322.844296]  ? step_into+0x2bc/0x6a1
[  322.844301]  nfs_permission+0xa2/0x1b8
[  322.844305]  inode_permission+0x96/0xca
[  322.844308]  ? unlazy_walk+0x4c/0x80
[  322.844312]  link_path_walk.part.0+0x2a5/0x320
[  322.844315]  ? path_init+0x2c2/0x3e5
[  322.844318]  path_lookupat+0x3f/0x1b1
[  322.844322]  filename_lookup+0x97/0x17f
[  322.844326]  ? kmem_cache_alloc+0x32/0x200
[  322.844331]  user_path_at_empty+0x59/0x8a
[  322.844333]  do_faccessat+0x70/0x227
[  322.844337]  __ia32_sys_access+0x1c/0x1e
[  322.844340]  __do_fast_syscall_32+0x5f/0x8e
[  322.844342]  do_fast_syscall_32+0x3d/0x80
[  322.844346]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
[  322.844348] RIP: 0023:0xf7ee6549
[  322.844350] RSP: 002b:00000000eb763b28 EFLAGS: 00200292 ORIG_RAX: 0000000000000021
[  322.844353] RAX: ffffffffffffffda RBX: 000000005754d9e0 RCX: 0000000000000004
[  322.844355] RDX: 00000000f7ab3e1c RSI: 000000005754d9e0 RDI: 00000000eb763bc8
[  322.844357] RBP: 00000000f78a91c0 R08: 0000000000000000 R09: 0000000000000000
[  322.844359] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  322.844360] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  322.844364] task:CFileWriterThre state:D stack:    0 pid: 1810 ppid:  1540 flags:0x20024000
[  322.844367] Call Trace:
[  322.844371]  __schedule+0x3db/0x7ee
[  322.844375]  ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[  322.844379]  schedule+0x68/0xdc
[  322.844382]  rpc_wait_bit_killable+0x11/0x52
[  322.844385]  __wait_on_bit+0x31/0x90
[  322.844389]  ? call_reserveresult+0xa0/0x9f
[  322.844391]  ? call_transmit_status+0x160/0x159
[  322.844395]  out_of_line_wait_on_bit+0x7e/0x80
[  322.844398]  ? var_wake_function+0x30/0x27
[  322.844401]  __rpc_execute+0xe5/0x485
[  322.844404]  ? nfs_check_verifier+0x5b/0xa4
[  322.844407]  rpc_execute+0xa5/0xaa
[  322.844410]  rpc_run_task+0x148/0x187
[  322.844414]  nfs4_do_call_sync+0x61/0x81
[  322.844417]  _nfs4_proc_getattr+0x10a/0x120
[  322.844422]  nfs4_proc_getattr+0x67/0xee
[  322.844426]  __nfs_revalidate_inode+0xb7/0x2df
[  322.844430]  nfs_getattr+0x159/0x374
[  322.844433]  vfs_getattr_nosec+0x96/0xa4
[  322.844435]  vfs_statx+0x7a/0xe5
[  322.844438]  vfs_fstatat+0x1f/0x21
[  322.844443]  __do_compat_sys_ia32_lstat64+0x2b/0x4e
[  322.844448]  __ia32_compat_sys_ia32_lstat64+0x14/0x16
[  322.844450]  __do_fast_syscall_32+0x5f/0x8e
[  322.844453]  do_fast_syscall_32+0x3d/0x80
[  322.844457]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
[  322.844459] RIP: 0023:0xf7ee6549
[  322.844460] RSP: 002b:00000000eb561be8 EFLAGS: 00000292 ORIG_RAX: 00000000000000c4
[  322.844463] RAX: ffffffffffffffda RBX: 00000000c521bf60 RCX: 00000000eb561e70
[  322.844465] RDX: 00000000f7ab3e1c RSI: 00000000eb561e70 RDI: 00000000c521bf60
[  322.844467] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  322.844468] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  322.844470] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  322.844477] task:Steam - DirWatc state:D stack:    0 pid: 1924 ppid:  1540 flags:0x20024000
[  322.844480] Call Trace:
[  322.844484]  __schedule+0x3db/0x7ee
[  322.844488]  ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[  322.844492]  schedule+0x68/0xdc
[  322.844495]  rpc_wait_bit_killable+0x11/0x52
[  322.844498]  __wait_on_bit+0x31/0x90
[  322.844501]  ? rpc_sleep_on_timeout+0x64/0x9c
[  322.844504]  ? rpc_check_timeout+0x1e0/0x1e0
[  322.844507]  ? call_connect_status+0x180/0x17a
[  322.844511]  out_of_line_wait_on_bit+0x7e/0x80
[  322.844513]  ? var_wake_function+0x30/0x27
[  322.844516]  __rpc_execute+0xe5/0x485
[  322.844520]  rpc_execute+0xa5/0xaa
[  322.844523]  rpc_run_task+0x148/0x187
[  322.844526]  nfs4_do_call_sync+0x61/0x81
[  322.844530]  _nfs4_proc_getattr+0x10a/0x120
[  322.844534]  nfs4_proc_getattr+0x67/0xee
[  322.844538]  __nfs_revalidate_inode+0xb7/0x2df
[  322.844542]  nfs_lookup_verify_inode+0x67/0x82
[  322.844545]  nfs_do_lookup_revalidate+0x19b/0x29c
[  322.844548]  nfs4_do_lookup_revalidate+0x5d/0xc5
[  322.844552]  nfs4_lookup_revalidate+0x6a/0x8d
[  322.844555]  lookup_fast+0xb8/0x122
[  322.844559]  path_openat+0xfd/0xfd8
[  322.844562]  ? unlazy_walk+0x4c/0x80
[  322.844565]  ? terminate_walk+0x62/0xed
[  322.844569]  ? putname+0x4b/0x53
[  322.844572]  do_filp_open+0x86/0x110
[  322.844576]  ? kmem_cache_alloc+0x32/0x200
[  322.844579]  ? __alloc_fd+0xb2/0x143
[  322.844583]  do_sys_openat2+0x8d/0x13a
[  322.844585]  ? do_faccessat+0x108/0x227
[  322.844588]  __ia32_compat_sys_openat+0x48/0x63
[  322.844591]  __do_fast_syscall_32+0x5f/0x8e
[  322.844594]  do_fast_syscall_32+0x3d/0x80
[  322.844597]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
[  322.844600] RIP: 0023:0xf7ee6549
[  322.844601] RSP: 002b:000000008c8fbcb0 EFLAGS: 00200287 ORIG_RAX: 0000000000000127
[  322.844604] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 000000008c8fbffc
[  322.844606] RDX: 0000000000098800 RSI: 0000000000000000 RDI: 00000000f7ab3e1c
[  322.844608] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  322.844609] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  322.844611] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  322.844614] task:CGenericAsyncFi state:D stack:    0 pid: 1927 ppid:  1540 flags:0x20020000
[  322.844617] Call Trace:
[  322.844621]  __schedule+0x3db/0x7ee
[  322.844625]  schedule+0x68/0xdc
[  322.844628]  io_schedule+0x46/0x6d
[  322.844631]  wait_on_page_bit_common+0xf8/0x325
[  322.844634]  ? trace_raw_output_file_check_and_advance_wb_err+0x80/0x7d
[  322.844638]  wait_on_page_bit+0x3f/0x41
[  322.844642]  wait_on_page_writeback.part.0+0x35/0x80
[  322.844645]  wait_on_page_writeback+0x27/0x29
[  322.844647]  __filemap_fdatawait_range+0x88/0xd9
[  322.844651]  filemap_write_and_wait_range+0x45/0x80
[  322.844655]  nfs_wb_all+0x30/0x13d
[  322.844658]  nfs4_file_flush+0x85/0xc0
[  322.844661]  filp_close+0x3e/0x78
[  322.844664]  __close_fd+0x23/0x30
[  322.844667]  __ia32_sys_close+0x22/0x50
[  322.844670]  __do_fast_syscall_32+0x5f/0x8e
[  322.844673]  do_fast_syscall_32+0x3d/0x80
[  322.844676]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
[  322.844678] RIP: 0023:0xf7ee6549
[  322.844680] RSP: 002b:000000008c5fddb0 EFLAGS: 00200286 ORIG_RAX: 0000000000000006
[  322.844682] RAX: ffffffffffffffda RBX: 000000000000004d RCX: 0000000000000002
[  322.844684] RDX: 0000000000000000 RSI: 00000000f7d02e70 RDI: 000000000000004d
[  322.844686] RBP: 000000008c5fe110 R08: 0000000000000000 R09: 0000000000000000
[  322.844688] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  322.844689] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  322.844699] task:ThreadPoolForeg state:D stack:    0 pid: 1768 ppid:  1735 flags:0x00004000
[  322.844702] Call Trace:
[  322.844705]  __schedule+0x3db/0x7ee
[  322.844709]  ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[  322.844712]  schedule+0x68/0xdc
[  322.844716]  rpc_wait_bit_killable+0x11/0x52
[  322.844719]  __wait_on_bit+0x31/0x90
[  322.844722]  ? rpc_sleep_on_timeout+0x64/0x9c
[  322.844725]  ? rpc_check_timeout+0x1e0/0x1e0
[  322.844728]  ? call_connect_status+0x180/0x17a
[  322.844731]  out_of_line_wait_on_bit+0x7e/0x80
[  322.844734]  ? var_wake_function+0x30/0x27
[  322.844737]  __rpc_execute+0xe5/0x485
[  322.844740]  ? nfs_check_verifier+0x5b/0xa4
[  322.844743]  rpc_execute+0xa5/0xaa
[  322.844746]  rpc_run_task+0x148/0x187
[  322.844750]  nfs4_do_call_sync+0x61/0x81
[  322.844753]  _nfs4_proc_getattr+0x10a/0x120
[  322.844758]  nfs4_proc_getattr+0x67/0xee
[  322.844762]  __nfs_revalidate_inode+0xb7/0x2df
[  322.844766]  nfs_getattr+0x159/0x374
[  322.844769]  vfs_getattr_nosec+0x96/0xa4
[  322.844771]  vfs_statx+0x7a/0xe5
[  322.844774]  __do_sys_newstat+0x31/0x4c
[  322.844778]  __x64_sys_newstat+0x16/0x18
[  322.844781]  do_syscall_64+0x32/0x45
[  322.844783]  entry_SYSCALL_64_after_hwframe+0x44/0x0
[  322.844786] RIP: 0033:0x7fe70ebf19e6
[  322.844787] RSP: 002b:00007fe6c15b2068 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  322.844790] RAX: ffffffffffffffda RBX: 00000bd4193e1910 RCX: 00007fe70ebf19e6
[  322.844792] RDX: 00007fe6c15b2078 RSI: 00007fe6c15b2078 RDI: 00000bd4193e1910
[  322.844794] RBP: 00007fe6c15b2160 R08: 0000000000000001 R09: 00007ffd971ed090
[  322.844796] R10: 0000000000056764 R11: 0000000000000246 R12: 00007fe6c15b2078
[  322.844797] R13: 00000bd4192f6f60 R14: 00007fe6c15b2108 R15: 00007fe6c15b2120
[  322.844801] task:ThreadPoolForeg state:D stack:    0 pid: 1769 ppid:  1735 flags:0x00000000
[  322.844804] Call Trace:
[  322.844807]  __schedule+0x3db/0x7ee
[  322.844811]  ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[  322.844814]  schedule+0x68/0xdc
[  322.844817]  rpc_wait_bit_killable+0x11/0x52
[  322.844820]  __wait_on_bit+0x31/0x90
[  322.844823]  ? __rpc_atrun+0x20/0x1a
[  322.844826]  ? __rpc_atrun+0x20/0x1a
[  322.844829]  out_of_line_wait_on_bit+0x7e/0x80
[  322.844831]  ? var_wake_function+0x30/0x27
[  322.844834]  __rpc_execute+0xe5/0x485
[  322.844837]  ? update_cfs_group+0x9a/0x9e
[  322.844840]  rpc_execute+0xa5/0xaa
[  322.844842]  rpc_run_task+0x148/0x187
[  322.844845]  nfs4_do_call_sync+0x61/0x81
[  322.844849]  _nfs4_proc_getattr+0x10a/0x120
[  322.844853]  nfs4_proc_getattr+0x67/0xee
[  322.844856]  __nfs_revalidate_inode+0xb7/0x2df
[  322.844859]  nfs_file_write+0xfc/0x29f
[  322.844862]  new_sync_write+0xfb/0x16b
[  322.844866]  vfs_write+0x1ee/0x29a
[  322.844868]  ? __fget_light+0x31/0x72
[  322.844871]  ksys_write+0x61/0xd0
[  322.844874]  __x64_sys_write+0x1a/0x1c
[  322.844877]  do_syscall_64+0x32/0x45
[  322.844880]  entry_SYSCALL_64_after_hwframe+0x44/0x0
[  322.844881] RIP: 0033:0x7fe71859ab7f
[  322.844883] RSP: 002b:00007fe6bd5b0d30 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[  322.844885] RAX: ffffffffffffffda RBX: 00007fe6bd5b0d68 RCX: 00007fe71859ab7f
[  322.844887] RDX: 0000000000000007 RSI: 00007fe6bd5b1090 RDI: 000000000000005b
[  322.844888] RBP: 00007fe6bd5b0de0 R08: 0000000000000000 R09: 00007ffd971ed090
[  322.844890] R10: 0000000000065fd2 R11: 0000000000000293 R12: 00000bd418d95020
[  322.844891] R13: 00007fe6bd5b1090 R14: 0000000000000007 R15: 0000000000000000
[  322.844900] task:ThreadPoolForeg state:D stack:    0 pid: 1807 ppid:  1773 flags:0x00000000
[  322.844902] Call Trace:
[  322.844904]  __schedule+0x3db/0x7ee
[  322.844907]  ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[  322.844908]  schedule+0x68/0xdc
[  322.844911]  rpc_wait_bit_killable+0x11/0x52
[  322.844913]  __wait_on_bit+0x31/0x90
[  322.844915]  ? __rpc_atrun+0x20/0x1a
[  322.844916]  ? __rpc_atrun+0x20/0x1a
[  322.844918]  out_of_line_wait_on_bit+0x7e/0x80
[  322.844920]  ? var_wake_function+0x30/0x27
[  322.844922]  __rpc_execute+0xe5/0x485
[  322.844924]  rpc_execute+0xa5/0xaa
[  322.844926]  rpc_run_task+0x148/0x187
[  322.844928]  nfs4_do_call_sync+0x61/0x81
[  322.844930]  _nfs4_proc_getattr+0x10a/0x120
[  322.844933]  nfs4_proc_getattr+0x67/0xee
[  322.844936]  __nfs_revalidate_inode+0xb7/0x2df
[  322.844938]  nfs_access_get_cached+0x14d/0x1df
[  322.844940]  nfs_do_access+0x48/0x28e
[  322.844942]  ? nfs4_lookup_revalidate+0x38/0x8d
[  322.844944]  nfs_permission+0xa2/0x1b8
[  322.844946]  inode_permission+0x96/0xca
[  322.844948]  ? unlazy_walk+0x4c/0x80
[  322.844950]  link_path_walk.part.0+0x2a5/0x320
[  322.844952]  ? path_init+0x2c2/0x3e5
[  322.844954]  path_lookupat+0x3f/0x1b1
[  322.844956]  filename_lookup+0x97/0x17f
[  322.844959]  ? do_unlk+0x9a/0xca
[  322.844960]  ? kmem_cache_alloc+0x32/0x200
[  322.844963]  user_path_at_empty+0x59/0x8a
[  322.844965]  vfs_statx+0x64/0xe5
[  322.844967]  __do_sys_newstat+0x31/0x4c
[  322.844968]  ? do_fcntl+0x113/0x582
[  322.844971]  ? fput+0x13/0x15
[  322.844972]  ? __x64_sys_fcntl+0x81/0xa8
[  322.844974]  __x64_sys_newstat+0x16/0x18
[  322.844976]  do_syscall_64+0x32/0x45
[  322.844977]  entry_SYSCALL_64_after_hwframe+0x44/0x0
[  322.844979] RIP: 0033:0x7f0d1fd0c9e6
[  322.844980] RSP: 002b:00007f0d08e12e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  322.844981] RAX: ffffffffffffffda RBX: 00007f0d08e12edc RCX: 00007f0d1fd0c9e6
[  322.844983] RDX: 00007f0d08e12e28 RSI: 00007f0d08e12e28 RDI: 00001605c2ff1169
[  322.844984] RBP: 00007f0d08e12ec0 R08: 0000000000000001 R09: 00007fffaddd9090
[  322.844985] R10: 00001605c3133480 R11: 0000000000000246 R12: 0000000000000000
[  322.844986] R13: 0000000000000000 R14: 00001605c2ff0f00 R15: 0000000000000000
[  322.844988] task:ThreadPoolForeg state:D stack:    0 pid: 2031 ppid:  1773 flags:0x00004000
[  322.844990] Call Trace:
[  322.844992]  __schedule+0x3db/0x7ee
[  322.844995]  ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[  322.844997]  schedule+0x68/0xdc
[  322.844999]  rpc_wait_bit_killable+0x11/0x52
[  322.845001]  __wait_on_bit+0x31/0x90
[  322.845003]  ? rpc_sleep_on_timeout+0x64/0x9c
[  322.845005]  ? rpc_check_timeout+0x1e0/0x1e0
[  322.845006]  ? call_connect_status+0x180/0x17a
[  322.845009]  out_of_line_wait_on_bit+0x7e/0x80
[  322.845010]  ? var_wake_function+0x30/0x27
[  322.845012]  __rpc_execute+0xe5/0x485
[  322.845014]  rpc_execute+0xa5/0xaa
[  322.845016]  rpc_run_task+0x148/0x187
[  322.845018]  nfs4_do_call_sync+0x61/0x81
[  322.845020]  _nfs4_proc_getattr+0x10a/0x120
[  322.845023]  nfs4_proc_getattr+0x67/0xee
[  322.845026]  __nfs_revalidate_inode+0xb7/0x2df
[  322.845028]  nfs_access_get_cached+0x14d/0x1df
[  322.845030]  nfs_do_access+0x48/0x28e
[  322.845032]  ? nfs4_lookup_revalidate+0x38/0x8d
[  322.845034]  nfs_permission+0xa2/0x1b8
[  322.845036]  inode_permission+0x96/0xca
[  322.845038]  ? unlazy_walk+0x4c/0x80
[  322.845040]  link_path_walk.part.0+0x2a5/0x320
[  322.845042]  ? path_init+0x2c2/0x3e5
[  322.845044]  path_lookupat+0x3f/0x1b1
[  322.845046]  filename_lookup+0x97/0x17f
[  322.845049]  ? kmem_cache_alloc+0x32/0x200
[  322.845051]  user_path_at_empty+0x59/0x8a
[  322.845054]  ? mntput_no_expire+0x4f/0x264
[  322.845055]  vfs_statx+0x64/0xe5
[  322.845057]  ? putname+0x4b/0x53
[  322.845059]  __do_sys_newstat+0x31/0x4c
[  322.845062]  ? __x64_sys_futex+0x13a/0x159
[  322.845064]  ? switch_fpu_return+0x58/0xf2
[  322.845066]  __x64_sys_newstat+0x16/0x18
[  322.845068]  do_syscall_64+0x32/0x45
[  322.845069]  entry_SYSCALL_64_after_hwframe+0x44/0x0
[  322.845071] RIP: 0033:0x7f0d1fd0c9e6
[  322.845072] RSP: 002b:00007f0d00331108 EFLAGS: 00000246 ORIG_RAX: 0000000000000004
[  322.845073] RAX: ffffffffffffffda RBX: 00001605c2fdc0a0 RCX: 00007f0d1fd0c9e6
[  322.845074] RDX: 00007f0d00331118 RSI: 00007f0d00331118 RDI: 00001605c2fdc0a0
[  322.845076] RBP: 00007f0d00331200 R08: 0000000000000001 R09: 00007fffaddd9090
[  322.845077] R10: 000000000005771c R11: 0000000000000246 R12: 00007f0d00331118
[  322.845078] R13: 00001605c3309c40 R14: 00007f0d003311a8 R15: 00007f0d003311c0
[  322.845087] task:192.168.16.8-ma state:D stack:    0 pid: 2044 ppid:     2 flags:0x00004000
[  322.845089] Call Trace:
[  322.845091]  __schedule+0x3db/0x7ee
[  322.845094]  ? trace_event_raw_event_rpc_xdr_overflow+0x280/0x27e
[  322.845095]  schedule+0x68/0xdc
[  322.845098]  rpc_wait_bit_killable+0x11/0x52
[  322.845100]  __wait_on_bit+0x31/0x90
[  322.845102]  ? __rpc_atrun+0x20/0x1a
[  322.845103]  ? __rpc_atrun+0x20/0x1a
[  322.845105]  out_of_line_wait_on_bit+0x7e/0x80
[  322.845107]  ? var_wake_function+0x30/0x27
[  322.845109]  __rpc_execute+0xe5/0x485
[  322.845111]  rpc_execute+0xa5/0xaa
[  322.845113]  rpc_run_task+0x148/0x187
[  322.845115]  nfs4_do_call_sync+0x61/0x81
[  322.845117]  _nfs4_proc_statfs+0xc1/0xcf
[  322.845120]  nfs4_proc_statfs+0x55/0x74
[  322.845122]  nfs_statfs+0x65/0x152
[  322.845124]  ? _nfs4_proc_delegreturn+0x21e/0x2e5
[  322.845127]  statfs_by_dentry+0x55/0x74
[  322.845129]  vfs_statfs+0x24/0x43
[  322.845132]  check_free_space+0x3c/0xe5
[  322.845135]  do_acct_process+0x5d/0x5e1
[  322.845136]  ? nfs_end_delegation_return+0x1cb/0x3aa
[  322.845138]  ? complete+0x49/0x57
[  322.845141]  acct_process+0xe5/0x105
[  322.845143]  do_exit+0x781/0xa1b
[  322.845145]  ? refcount_dec_and_lock+0x22/0x8c
[  322.845148]  __module_put_and_exit+0x1b/0x1b
[  322.845150]  nfs4_run_state_manager+0x293/0x933
[  322.845152]  ? nfs4_do_reclaim+0x820/0x81c
[  322.845155]  kthread+0x142/0x15f
[  322.845157]  ? __kthread_create_worker+0x130/0x122
[  322.845159]  ret_from_fork+0x1f/0x2a
[  369.859385] INFO: task tee:1176 blocked for more than 122 seconds.
[  369.859391]       Not tainted 5.10.4-00021-gde017294a539-dirty #3
[  369.859393] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  369.859395] task:tee             state:D stack:    0 pid: 1176 ppid:  1173 flags:0x00000000
[  369.859399] Call Trace:
[  369.859407]  __schedule+0x3db/0x7ee
[  369.859410]  ? io_schedule+0x70/0x6d
[  369.859412]  schedule+0x68/0xdc
[  369.859415]  io_schedule+0x46/0x6d
[  369.859417]  bit_wait_io+0x11/0x52
[  369.859420]  __wait_on_bit+0x31/0x90
[  369.859422]  out_of_line_wait_on_bit+0x7e/0x80
[  369.859426]  ? var_wake_function+0x30/0x27
[  369.859429]  nfs_wait_on_request.part.0+0x31/0x33
[  369.859431]  nfs_page_group_lock_head+0x31/0x78
[  369.859433]  nfs_lock_and_join_requests+0x9d/0x26e
[  369.859436]  nfs_updatepage+0x1be/0x985
[  369.859439]  nfs_write_end+0x128/0x52a
[  369.859443]  ? iov_iter_copy_from_user_atomic+0xe8/0x3ce
[  369.859446]  generic_perform_write+0xed/0x17c
[  369.859449]  nfs_file_write+0x14a/0x29f
[  369.859452]  new_sync_write+0xfb/0x16b
[  369.859455]  vfs_write+0x1ee/0x29a
[  369.859457]  ? vfs_read+0x17b/0x198
[  369.859459]  ksys_write+0x61/0xd0
[  369.859462]  __x64_sys_write+0x1a/0x1c
[  369.859464]  do_syscall_64+0x32/0x45
[  369.859467]  entry_SYSCALL_64_after_hwframe+0x44/0x0
[  369.859469] RIP: 0033:0x7f4781f323a3
[  369.859471] RSP: 002b:00007ffcf8220408 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  369.859474] RAX: ffffffffffffffda RBX: 0000000000000092 RCX: 00007f4781f323a3
[  369.859475] RDX: 0000000000000092 RSI: 00007ffcf8220530 RDI: 0000000000000003
[  369.859477] RBP: 00007ffcf8220530 R08: 0000000000000000 R09: 0000000000000001
[  369.859478] R10: 0000000000400ad2 R11: 0000000000000246 R12: 0000000000000092
[  369.859480] R13: 0000000002361580 R14: 0000000000000092 R15: 00007f4782201720
[  369.859495] INFO: task CGenericAsyncFi:1927 blocked for more than 122 seconds.
[  369.859497]       Not tainted 5.10.4-00021-gde017294a539-dirty #3
[  369.859498] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  369.859499] task:CGenericAsyncFi state:D stack:    0 pid: 1927 ppid:  1540 flags:0x20020000
[  369.859501] Call Trace:
[  369.859504]  __schedule+0x3db/0x7ee
[  369.859506]  schedule+0x68/0xdc
[  369.859508]  io_schedule+0x46/0x6d
[  369.859510]  wait_on_page_bit_common+0xf8/0x325
[  369.859512]  ? trace_raw_output_file_check_and_advance_wb_err+0x80/0x7d
[  369.859515]  wait_on_page_bit+0x3f/0x41
[  369.859517]  wait_on_page_writeback.part.0+0x35/0x80
[  369.859519]  wait_on_page_writeback+0x27/0x29
[  369.859521]  __filemap_fdatawait_range+0x88/0xd9
[  369.859523]  filemap_write_and_wait_range+0x45/0x80
[  369.859525]  nfs_wb_all+0x30/0x13d
[  369.859528]  nfs4_file_flush+0x85/0xc0
[  369.859530]  filp_close+0x3e/0x78
[  369.859532]  __close_fd+0x23/0x30
[  369.859534]  __ia32_sys_close+0x22/0x50
[  369.859536]  __do_fast_syscall_32+0x5f/0x8e
[  369.859538]  do_fast_syscall_32+0x3d/0x80
[  369.859540]  entry_SYSCALL_compat_after_hwframe+0x45/0x0
[  369.859541] RIP: 0023:0xf7ee6549
[  369.859543] RSP: 002b:000000008c5fddb0 EFLAGS: 00200286 ORIG_RAX: 0000000000000006
[  369.859545] RAX: ffffffffffffffda RBX: 000000000000004d RCX: 0000000000000002
[  369.859546] RDX: 0000000000000000 RSI: 00000000f7d02e70 RDI: 000000000000004d
[  369.859547] RBP: 000000008c5fe110 R08: 0000000000000000 R09: 0000000000000000
[  369.859548] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  369.859549] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  394.423093] nfs: server loom not responding, still trying

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 12+ messages in thread
* Re: steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10)
@ 2021-02-13 15:21 Nick Alcock
  0 siblings, 0 replies; 12+ messages in thread
From: Nick Alcock @ 2021-02-13 15:21 UTC (permalink / raw)
  To: NFS list

(I can't get References: right on this mail due to the original aging
out of my mailbox: archive URL,
https://www.spinics.net/lists/linux-nfs/msg81430.html).

I now have a little lockdep info from this hang (and reports from at
least two others that they've seen similar-looking hangs dating back to
4.19, though much harder to reproduce, taking many hours rather than
five minutes: in one case they report not using NFS in production any
more because of this).

Unfortunately the lockdep info isn't much use:

Feb 13 14:13:12 silk warning: : [  888.834464] Showing all locks held in the system:
Feb 13 14:13:12 silk warning: : [  888.834501] 1 lock held by dmesg/1152:
Feb 13 14:13:12 silk warning: : [  888.834508]  #0: ffff980c3b7200d0 (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x49/0x2d1
Feb 13 14:13:12 silk warning: : [  888.834540] 2 locks held by tee/1322:
Feb 13 14:13:12 silk warning: : [  888.834546]  #0: ffff980c0809a430 (sb_writers#12){.+.+}-{0:0}, at: ksys_write+0x6a/0xdc
Feb 13 14:13:12 silk warning: : [  888.834573]  #1: ffff980c3ca7b5e8 (&sb->s_type->i_mutex_key#16){++++}-{3:3}, at: nfs_start_io_write+0x1a/0x45
Feb 13 14:13:12 silk warning: : [  888.834632] 1 lock held by 192.168.16.8-ma/2302:
Feb 13 14:13:12 silk warning: : [  888.834638]  #0: ffff980c0fe6b700 (&acct->lock#2){+.+.}-{3:3}, at: acct_process+0x102/0x2bc

The first of those is my ongoing dmesg -w. The last is process
accounting. The middle one is an ongoing, always-active Xsession-errors
tee over the same NFSv4 connection, which says nothing more than that
writes to this NFS server from this client have hung, which we already
know. There are no signs of locks held by the Steam client which has
hung in the middle of installation.

So whateverthehell this is, it's not blocked on a lock. The NFS client
is hanging all on its own. (I have no idea how clients can block in the
middle of writing if a lock is *not* involved somehow, but that is what
it looks like from the lockdep output.)

Does anyone know how I might start debugging this sod?

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-04-05 16:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-03 14:27 steam-associated reproducible hard NFSv4.2 client hang (5.9, 5.10) Nick Alcock
2021-02-23 22:57 ` J. Bruce Fields
2021-02-23 23:58   ` Trond Myklebust
2021-02-24  2:01     ` bfields
2021-04-01 13:33       ` Nix
2021-04-01 13:44         ` bfields
2021-04-01 21:52           ` Nix
2021-04-02 19:20             ` bfields
2021-04-03 22:41               ` Nix
2021-04-05 11:48               ` Nix
2021-04-05 16:52       ` Nix
2021-02-13 15:21 Nick Alcock

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.