User process NFS write hang in wait_on_commit with kworker

* User process NFS write hang in wait_on_commit with kworker
@ 2019-06-18  0:06 Alan Post
  2019-06-18 15:29 ` Benjamin Coddington
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Post @ 2019-06-18  0:06 UTC (permalink / raw)
  To: linux-nfs

On May 20th I reported "User process NFS write hang followed
by automount hang requiring reboot" to this list.  There I
had a process that would hang on NFS write, followed by sync
hanging, eventually leading to my need to reboot the host.

On June 4th, after upgrading to Linux 4.19.44, I reported
the issue resolved.  Since that time, as I've deployed out
Linux 4.19.44, the issue has come back--sort of.

I have begun once again getting sync hangs following a
hung NFS write.  The hung write has a different stack trace
than any I previously reported:

    [<0>] wait_on_commit+0x60/0x90 [nfs]
    [<0>] __nfs_commit_inode+0x146/0x1a0 [nfs]
    [<0>] nfs_file_fsync+0xa7/0x1d0 [nfs]
    [<0>] filp_close+0x25/0x70
    [<0>] put_files_struct+0x66/0xb0
    [<0>] do_exit+0x2af/0xbb0
    [<0>] do_group_exit+0x35/0xa0
    [<0>] __x64_sys_exit_group+0xf/0x10
    [<0>] do_syscall_64+0x45/0x100
    [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [<0>] 0xffffffffffffffff

And there is attendant kworker thread:

    [<0>] wait_on_commit+0x60/0x90 [nfs]
    [<0>] __nfs_commit_inode+0x146/0x1a0 [nfs]
    [<0>] nfs_write_inode+0x5c/0x90 [nfs]
    [<0>] nfs4_write_inode+0xd/0x30 [nfsv4]
    [<0>] __writeback_single_inode+0x27a/0x320
    [<0>] writeback_sb_inodes+0x19a/0x460
    [<0>] wb_writeback+0x102/0x2f0
    [<0>] wb_workfn+0xa3/0x400
    [<0>] process_one_work+0x1e3/0x3d0
    [<0>] worker_thread+0x28/0x3c0
    [<0>] kthread+0x10e/0x130
    [<0>] ret_from_fork+0x35/0x40
    [<0>] 0xffffffffffffffff

Oddly enough, I can clear the problem without rebooting the host.
I arrange to block all traffic between the NFS server and NFS
client using iptables, of sufficient time for any open TCP
connections to timeout.  After which the connection apparently
reestablishes and unblocks the hung process.

I can't explain what's keeping the connection alive but apparently
stalled--requiring my manual intervention.  Do any of you have
ideas or speculation?  I'm happy to poke around in a packet capture
if the information provided isn't sufficient.

-A
-- 
Alan Post | Xen VPS hosting for the technically adept
PO Box 61688 | Sunnyvale, CA 94088-1681 | https://prgmr.com/
email: adp@prgmr.com

^ permalink raw reply	[flat|nested] 9+ messages in thread