linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Commit which exposes blocked tasks with NFSv4.0 and Kerberos
@ 2018-06-24 20:30 Armin Größlinger
  2018-06-24 20:56 ` Trond Myklebust
  0 siblings, 1 reply; 3+ messages in thread
From: Armin Größlinger @ 2018-06-24 20:30 UTC (permalink / raw)
  To: linux-nfs

Hello NFS developers,

I've written to this list before [1],[2] concerning uninterruptible hung
tasks in clients using NFSv4.0 with Kerberos. I have also written
scripts (which can be cloned from [3]) which help to reproduce the hangs
by configuring two virtual machines with the required setup and a test
program which triggers the hangs rather quickly (see [2] for details).

Meanwhile, I have been able to do some bisecting of kernel sources to
find a commit which exposes the hangs. It seems that since commit

2aca5b869ace67a63aab895659e5dc14c33a4d6e
SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT

(introduced with v3.18-rc1) the uninterruptible hangs occur. When I
revert this commit, then I do not observe the uninterruptible hangs.
I've tested this on Ubuntu 16.04's 4.4 kernel and Debian 9's 4.9 kernel
and several stock kernels.

In our group at the university, we have about 15 desktop machines and 70
nodes in a SLURM cluster. Without reverting the commit, we've had on
average one machine per day locking up with uninterruptible hanging
tasks reported by the kernel; for about 6 weeks, we now run only kernels
with the commit reverted (i.e., Debian/Ubuntu's kernel recompiled after
reverting the patch) and we have not had any NFS-related machine lockups
so far.

I'm not claiming that the mentioned commit is the cause of the problem;
I think it exposes the problem. The problem is also present in current
kernels. Unfortunately, there seems to be another problem which can be
triggered by my test program from [3]. Since commit

9b30889c548a4d45bfe6226e58de32504c1d682f
SUNRPC: Ensure we always close the socket after a connection shuts down

(introduced with v4.16-rc1) is is very likely that the system dies due
to an out of memory condition, i.e., at some point the kernel consumes
all the memory and the OOM killer kills all user processes. When this
commit is reverted, I can observe the uninterruptible hung tasks again
(with kernel up to 4.18-rc2).

Since I have no expertise in the NFS client implementation, I'm still
hoping that exports on this list have an idea how to fix the NFS
client's behavior.

Regards,
Armin



[1] https://marc.info/?l=linux-nfs&m=150620442017672

[2] https://marc.info/?l=linux-nfs&m=152396752525579

[3] https://gitlab.infosun.fim.uni-passau.de/groessli/nfs-krb5-vms


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-06-25 20:51 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-24 20:30 Commit which exposes blocked tasks with NFSv4.0 and Kerberos Armin Größlinger
2018-06-24 20:56 ` Trond Myklebust
2018-06-25 20:51   ` Armin Größlinger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).