linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NFSv4 client locks up on larger writes with Kerberos enabled
@ 2019-09-18 23:36 Kevin Vasko
       [not found] ` <E172CA50-EC89-4072-9C1D-1B825DC3FE8B@lysator.liu.se>
  2019-09-25 16:48 ` J. Bruce Fields
  0 siblings, 2 replies; 16+ messages in thread
From: Kevin Vasko @ 2019-09-18 23:36 UTC (permalink / raw)
  To: linux-nfs

Hello,


We have a new Dell EMC Unity 300 acting as NAS Server that is
presenting a NFSv4 NFS Share. Our clients are mostly Ubuntu 18.04.3
but issue is also present on CentOS 7.6 systems. We have been
struggling with this issue for over a week now and not sure how to
resolve it.



We are having trouble with NFS Clients completing their writes to the
Dell EMC Unity 300 NFS Server when Kerberos is enabled on the NFS
Share. I created the NFS Share on the U300, associated it with our
FreeIPA (Kerberos/LDAP server) and everything shows successful.



I can mount the share without a problem on multiple Ubuntu/CentOS
machines without issues. I can ls around and move around on the share
without issue (seemingly). I can copy files as well and that seem to
work (haven’t tested it extensively as I have been heavily focused on
this issue). However, clients won't complete their writes to the NFS
share randomly once the file gets to > 2GB or so in file size.
_sometimes_ they do, but I would say anything > 3GB, the client locks
up and won’t ever return 80% of the time I try it.



For example, I can run dd and write ten 1MB files to the NFS share no
issues, all back to back to back. However, when I try to do a single
3GB file, it _might_ or might not finish. If it doesn’t finish it just
locks up the client. Sometimes it will finish though so its random.
However, I would say 70-80% of the time it is reproducible with files
> 3GB or so.



Here is some information on the client and what I’m doing to reproduce
the issue. I have tried krb5, and krb5p with the same results.
Removing the requirement for krb5 in the NFS server and mounting with
sec=none, I don’t seem to hit the issue.



$uname -a

5.0.0-27-generic #28~18.04.1-Ubuntu SMP Thu Aug 22 03:00:32 UTC 2019
x86_64 x86_64 x86_64 GNU/Linux



$mount

nfs-server.example.com:/ds1 on /data type nfs4
(rw,relatime,vers=4.1,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5p,clientaddr=10.75.40.254,local_lock=none,addr=10.75.37.2)



$ dd if=/dev/zero of=/data/test.img count=1000 bs=1000

1000+0 records in

1000+0 records out

1000000 bytes (1.0 MB, 977 KiB) copied, 0.00613671 s, 163 MB/s

$ dd if=/dev/zero of=/data/test.img bs=1000M count=6

---- HUNG ----



After it hangs it will NEVER return. The only way I've been able to
get the client to return is to go into the U300 Control panel and
change the "SP Owner:" from SPA to SPB (effectively restarting the NFS
server). Once that happens the client returns.



If I disable Kerberos on the NFS Share I have no problems writing
larger files it seems.



I did some straces of the dd and this is what I saw.



This was a successful write….



$ strace dd if=/dev/zero of=/data/test3.img bs=100M count=1

execve("/bin/dd", ["dd", "if=/dev/zero", "of=/data/test3.img",
"bs=100M", "count=1"], 0x7fff8c03c7f0 /* 21 vars */) = 0

brk(NULL)                               = 0x559b8b39a000

access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)

access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)

openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3

fstat(3, {st_mode=S_IFREG|0644, st_size=93482, ...}) = 0

mmap(NULL, 93482, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f62b6180000

close(3)                                = 0

access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)

openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3

read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\34\2\0\0\0\0\0"...,
832) = 832

fstat(3, {st_mode=S_IFREG|0755, st_size=2030544, ...}) = 0

mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f62b617e000

mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f62b5b7f000

mprotect(0x7f62b5d66000, 2097152, PROT_NONE) = 0

mmap(0x7f62b5f66000, 24576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7f62b5f66000

mmap(0x7f62b5f6c000, 15072, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f62b5f6c000

close(3)                                = 0

arch_prctl(ARCH_SET_FS, 0x7f62b617f540) = 0

mprotect(0x7f62b5f66000, 16384, PROT_READ) = 0

mprotect(0x559b8978b000, 4096, PROT_READ) = 0

mprotect(0x7f62b6197000, 4096, PROT_READ) = 0

munmap(0x7f62b6180000, 93482)           = 0

rt_sigaction(SIGINT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0

rt_sigaction(SIGUSR1, {sa_handler=0x559b8957e0e0, sa_mask=[INT USR1],
sa_flags=SA_RESTORER, sa_restorer=0x7f62b5bbdf20}, NULL, 8) = 0

rt_sigaction(SIGINT, {sa_handler=0x559b8957e0d0, sa_mask=[INT USR1],
sa_flags=SA_RESTORER|SA_NODEFER|SA_RESETHAND,
sa_restorer=0x7f62b5bbdf20}, NULL, 8) = 0

brk(NULL)                               = 0x559b8b39a000

brk(0x559b8b3bb000)                     = 0x559b8b3bb000

openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3

fstat(3, {st_mode=S_IFREG|0644, st_size=4326016, ...}) = 0

mmap(NULL, 4326016, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f62b575e000

close(3)                                = 0

openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3

dup2(3, 0)                              = 0

close(3)                                = 0

lseek(0, 0, SEEK_CUR)                   = 0

openat(AT_FDCWD, "/data/test3.img", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3

dup2(3, 1)                              = 1

close(3)                                = 0

mmap(NULL, 104869888, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x7f62af35b000

read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
104857600) = 104857600

write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
104857600) = 104857600

close(0)                                = 0

close(1)                                = 0

openat(AT_FDCWD, "/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 0

fstat(0, {st_mode=S_IFREG|0644, st_size=2995, ...}) = 0

read(0, "# Locale name alias data base.\n#"..., 4096) = 2995

read(0, "", 4096)                       = 0

close(0)                                = 0

openat(AT_FDCWD,
"/usr/share/locale/en_US.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) =
-1 ENOENT (No such file or directory)

openat(AT_FDCWD,
"/usr/share/locale/en_US.utf8/LC_MESSAGES/coreutils.mo", O_RDONLY) =
-1 ENOENT (No such file or directory)

openat(AT_FDCWD, "/usr/share/locale/en_US/LC_MESSAGES/coreutils.mo",
O_RDONLY) = -1 ENOENT (No such file or directory)

openat(AT_FDCWD,
"/usr/share/locale/en.UTF-8/LC_MESSAGES/coreutils.mo", O_RDONLY) = -1
ENOENT (No such file or directory)

openat(AT_FDCWD, "/usr/share/locale/en.utf8/LC_MESSAGES/coreutils.mo",
O_RDONLY) = -1 ENOENT (No such file or directory)

openat(AT_FDCWD, "/usr/share/locale/en/LC_MESSAGES/coreutils.mo",
O_RDONLY) = -1 ENOENT (No such file or directory)

openat(AT_FDCWD,
"/usr/share/locale-langpack/en_US.UTF-8/LC_MESSAGES/coreutils.mo",
O_RDONLY) = -1 ENOENT (No such file or directory)

openat(AT_FDCWD,
"/usr/share/locale-langpack/en_US.utf8/LC_MESSAGES/coreutils.mo",
O_RDONLY) = -1 ENOENT (No such file or directory)

openat(AT_FDCWD,
"/usr/share/locale-langpack/en_US/LC_MESSAGES/coreutils.mo", O_RDONLY)
= -1 ENOENT (No such file or directory)

openat(AT_FDCWD,
"/usr/share/locale-langpack/en.UTF-8/LC_MESSAGES/coreutils.mo",
O_RDONLY) = -1 ENOENT (No such file or directory)

openat(AT_FDCWD,
"/usr/share/locale-langpack/en.utf8/LC_MESSAGES/coreutils.mo",
O_RDONLY) = -1 ENOENT (No such file or directory)

openat(AT_FDCWD,
"/usr/share/locale-langpack/en/LC_MESSAGES/coreutils.mo", O_RDONLY) =
0

fstat(0, {st_mode=S_IFREG|0644, st_size=578, ...}) = 0

mmap(NULL, 578, PROT_READ, MAP_PRIVATE, 0, 0) = 0x7f62b6196000

close(0)                                = 0

write(2, "1+0 records in\n1+0 records out\n", 311+0 records in

1+0 records out

) = 31

write(2, "104857600 bytes (105 MB, 100 MiB"..., 61104857600 bytes (105
MB, 100 MiB) copied, 1.01552 s, 103 MB/s) = 61

write(2, "\n", 1

)                       = 1

close(2)                                = 0

exit_group(0)                           = ?

+++ exited with 0 +++



This one was unsuccessful write and hung until I restarted the NFS server.



strace -f dd if=/dev/zero of=/data/test.img bs=1000M count=6

execve("/bin/dd", ["dd", "if=/dev/zero", "of=/data/test.img",
"bs=1000M", "count=6"], 0x7ffc8ee7e8b8 /* 30 vars */) = 0

brk(NULL)                               = 0x5648bd26a000

access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)

access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)

openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3

fstat(3, {st_mode=S_IFREG|0644, st_size=125550, ...}) = 0

mmap(NULL, 125550, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4f8dda1000

close(3)                                = 0

access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)

openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3

read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\34\2\0\0\0\0\0"...,
832) = 832

fstat(3, {st_mode=S_IFREG|0755, st_size=2030544, ...}) = 0

mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f4f8dd9f000

mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0x7f4f8d7a8000

mprotect(0x7f4f8d98f000, 2097152, PROT_NONE) = 0

mmap(0x7f4f8db8f000, 24576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7f4f8db8f000

mmap(0x7f4f8db95000, 15072, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f4f8db95000

close(3)                                = 0

arch_prctl(ARCH_SET_FS, 0x7f4f8dda0540) = 0

mprotect(0x7f4f8db8f000, 16384, PROT_READ) = 0

mprotect(0x5648bb739000, 4096, PROT_READ) = 0

mprotect(0x7f4f8ddc0000, 4096, PROT_READ) = 0

munmap(0x7f4f8dda1000, 125550)          = 0

rt_sigaction(SIGINT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0

rt_sigaction(SIGUSR1, {sa_handler=0x5648bb52c0e0, sa_mask=[INT USR1],
sa_flags=SA_RESTORER, sa_restorer=0x7f4f8d7e6f20}, NULL, 8) = 0

rt_sigaction(SIGINT, {sa_handler=0x5648bb52c0d0, sa_mask=[INT USR1],
sa_flags=SA_RESTORER|SA_NODEFER|SA_RESETHAND,
sa_restorer=0x7f4f8d7e6f20}, NULL, 8) = 0

brk(NULL)                               = 0x5648bd26a000

brk(0x5648bd28b000)                     = 0x5648bd28b000

openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3

fstat(3, {st_mode=S_IFREG|0644, st_size=11603728, ...}) = 0

mmap(NULL, 11603728, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f4f8cc97000

close(3)                                = 0

openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3

dup2(3, 0)                              = 0

close(3)                                = 0

lseek(0, 0, SEEK_CUR)                   = 0

openat(AT_FDCWD, "/data/test.img", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3

dup2(3, 1)                              = 1

close(3)                                = 0

mmap(NULL, 1048588288, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4f4e494000

read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1048576000) = 1048576000

close(0)                                = 0

close(1







I did some packet tracing and on the smaller writes, I would see a
bunch of writes and replies to those writes, the client would COMMIT
and then NFS server would reply with an ack COMMIT, client would then
issue a CLOSE, and server would respond with a CLOSE. However, on
larger writes where it would hang it would do…client would do a bunch
of writes packets, NFS server would respond to the write packets,
client would issue a COMMIT command, server would respond with a
COMMIT, and then nothing else would happen. No CLOSE would seemingly
be sent from the client, so I don’t know if the client was waiting on
the server to send back some type of message saying that the system
could issue a CLOSE or the client was never sending a CLOSE.



Once it is in the locked state, if I restart the NFS server as I
mentioned above the close(1…will continue and finish the write and
exit normally.



I did a kernel Stack Traceback on the NFS client (the light weight
version of just printing out /proc/*/wchan) while the dd command was
stuck in the “close(1” and the process for the dd command just says.



Process /proc/32161/wchan

ep_poll



I also enabled the NFS debugging and while its stuck it just keep
repeating the bottom section.



Sep 19 06:25:19 examplehost kernel: [772614.270750] NFS:       commit
(0:52/9439 4096@4609744896) OK

Sep 19 06:25:19 examplehost kernel: [772614.270751] NFS:       commit
(0:52/9439 4096@4609748992) OK

Sep 19 06:25:19 examplehost kernel: [772614.270754] NFS:       commit
(0:52/9439 4096@4609753088) OK

Sep 19 06:25:19 examplehost kernel: [772614.270757] NFS:       commit
(0:52/9439 4096@4609757184) OK

Sep 19 06:25:19 examplehost kernel: [772614.270759] NFS:       commit
(0:52/9439 4096@4609761280) OK

Sep 19 06:25:19 examplehost kernel: [772614.270761] NFS:       commit
(0:52/9439 4096@4609765376) OK

Sep 19 06:25:19 examplehost kernel: [772614.270763] NFS:       commit
(0:52/9439 4096@4609769472) OK

Sep 19 06:25:19 examplehost kernel: [772614.270767] NFS:       commit
(0:52/9439 4096@4609773568) OK

Sep 19 06:25:19 examplehost kernel: [772614.270769] NFS:       commit
(0:52/9439 4096@4609777664) OK

Sep 19 06:25:19 examplehost kernel: [772614.270772] NFS:       commit
(0:52/9439 4096@4609781760) OK

Sep 19 06:25:19 examplehost kernel: [772614.270773] NFS:       commit
(0:52/9439 4096@4609785856) OK

Sep 19 06:25:19 examplehost kernel: [772614.270775] NFS:       commit
(0:52/9439 4096@4609789952) OK

Sep 19 06:25:19 examplehost kernel: [772614.270777] NFS:       commit
(0:52/9439 4096@4609794048) OK

Sep 19 06:25:19 examplehost kernel: [772614.270781] NFS:       commit
(0:52/9439 4096@4609798144) OK

Sep 19 06:25:31 examplehost kernel: [772626.175927] <--
nfs41_proc_async_sequence status=0

Sep 19 06:25:31 examplehost kernel: [772626.176018] encode_sequence:
sessionid=61:1568410703:305419896:61 seqid=2907 slotid=0 max_slotid=35
cache_this=0

Sep 19 06:25:31 examplehost kernel: [772626.176572]
nfs41_sequence_process: Error 0 free the slot

Sep 19 06:25:31 examplehost kernel: [772626.176576]
nfs41_sequence_call_done rpc_cred 000000004b01718a

Sep 19 06:25:31 examplehost kernel: [772626.176577] <-- nfs41_sequence_call_done

Sep 19 06:25:43 examplehost kernel: [772638.207976] <--
nfs41_proc_async_sequence status=0

Sep 19 06:25:43 examplehost kernel: [772638.208074] encode_sequence:
sessionid=61:1568410703:305419896:61 seqid=2908 slotid=0 max_slotid=35
cache_this=0

Sep 19 06:25:43 examplehost kernel: [772638.208642]
nfs41_sequence_process: Error 0 free the slot

Sep 19 06:25:43 examplehost kernel: [772638.208646]
nfs41_sequence_call_done rpc_cred 000000004b01718a

Sep 19 06:25:43 examplehost kernel: [772638.208648] <-- nfs41_sequence_call_done

Sep 19 06:25:55 examplehost kernel: [772650.240046] <--
nfs41_proc_async_sequence status=0

Sep 19 06:25:55 examplehost kernel: [772650.240129] encode_sequence:
sessionid=61:1568410703:305419896:61 seqid=2909 slotid=0 max_slotid=35
cache_this=0

Sep 19 06:25:55 examplehost kernel: [772650.240904]
nfs41_sequence_process: Error 0 free the slot

Sep 19 06:25:55 examplehost kernel: [772650.240908]
nfs41_sequence_call_done rpc_cred 000000004b01718a

Sep 19 06:25:55 examplehost kernel: [772650.240909] <-- nfs41_sequence_call_done





On a small file that actually completes it looks like this…



Sep 19 06:13:29 examplehost kernel: [771904.532059] NFS:       commit
(0:52/9439 4096@7409315840) OK

Sep 19 06:13:29 examplehost kernel: [771904.532061] NFS:       commit
(0:52/9439 4096@7409319936) OK

Sep 19 06:13:29 examplehost kernel: [771904.532064] NFS:       commit
(0:52/9439 4096@7409324032) OK

Sep 19 06:13:29 examplehost kernel: [771904.532066] NFS:       commit
(0:52/9439 4096@7409328128) OK

Sep 19 06:13:29 examplehost kernel: [771904.532069] NFS:       commit
(0:52/9439 4096@7409332224) OK

Sep 19 06:13:29 examplehost kernel: [771904.532071] NFS:       commit
(0:52/9439 4096@7409336320) OK

Sep 19 06:13:29 examplehost kernel: [771904.532073] NFS:       commit
(0:52/9439 4096@7409340416) OK

Sep 19 06:13:29 examplehost kernel: [771904.532076] NFS:       commit
(0:52/9439 4096@7409344512) OK

Sep 19 06:13:29 examplehost kernel: [771904.532078] NFS:       commit
(0:52/9439 4096@7409348608) OK

Sep 19 06:13:29 examplehost kernel: [771904.532081] NFS:       commit
(0:52/9439 4096@7409352704) OK

Sep 19 06:13:29 examplehost kernel: [771904.532083] NFS:       commit
(0:52/9439 4096@7409356800) OK

Sep 19 06:13:29 examplehost kernel: [771904.532086] NFS:       commit
(0:52/9439 4096@7409360896) OK

Sep 19 06:13:29 examplehost kernel: [771904.532088] NFS:       commit
(0:52/9439 4096@7409364992) OK

Sep 19 06:13:29 examplehost kernel: [771904.532090] NFS:       commit
(0:52/9439 4096@7409369088) OK

Sep 19 06:13:29 examplehost kernel: [771904.532093] NFS:       commit
(0:52/9439 4096@7409373184) OK

Sep 19 06:13:29 examplehost kernel: [771904.532171] NFS: release(/test.img)

Sep 19 06:13:29 examplehost kernel: [771904.532193] nfs4_close_prepare: begin!

Sep 19 06:13:29 examplehost kernel: [771904.532197] nfs4_close_prepare: done!

Sep 19 06:13:29 examplehost kernel: [771904.532235] encode_sequence:
sessionid=61:1568410703:305419896:61 seqid=6 slotid=0 max_slotid=0
cache_this=1

Sep 19 06:13:29 examplehost kernel: [771904.532525] nfs4_close_done: begin!

Sep 19 06:13:29 examplehost kernel: [771904.532528]
nfs41_sequence_process: Error 0 free the slot

Sep 19 06:13:29 examplehost kernel: [771904.532531] nfs4_close_done:
done, ret = 0!

Sep 19 06:13:29 examplehost kernel: [771904.532555] NFS:
dentry_delete(/test.img, 48084c)

Sep 19 06:13:41 examplehost kernel: [771916.540491] <--
nfs41_proc_async_sequence status=0

Sep 19 06:13:41 examplehost kernel: [771916.540586] encode_sequence:
sessionid=61:1568410703:305419896:61 seqid=7 slotid=0 max_slotid=0
cache_this=0

Sep 19 06:13:41 examplehost kernel: [771916.541182]
nfs41_sequence_process: Error 0 free the slot

Sep 19 06:13:41 examplehost kernel: [771916.541186]
nfs41_sequence_call_done rpc_cred 000000004b01718a

Sep 19 06:13:41 examplehost kernel: [771916.541187] <-- nfs41_sequence_call_done

Sep 19 06:13:53 examplehost kernel: [771928.572599] <--
nfs41_proc_async_sequence status=0

Sep 19 06:13:53 examplehost kernel: [771928.572675] encode_sequence:
sessionid=61:1568410703:305419896:61 seqid=8 slotid=0 max_slotid=0
cache_this=0

Sep 19 06:13:53 examplehost kernel: [771928.572895] encode_sequence:
sessionid=61:1568410703:305419896:61 seqid=2 slotid=1 max_slotid=1
cache_this=1

Sep 19 06:13:53 examplehost kernel: [771928.573163]
nfs41_sequence_process: Error 0 free the slot

Sep 19 06:13:53 examplehost kernel: [771928.573166]
nfs41_sequence_call_done rpc_cred 000000004b01718a

Sep 19 06:13:53 examplehost kernel: [771928.573167] <-- nfs41_sequence_call_done

Sep 19 06:13:53 examplehost kernel: [771928.573236] decode_attr_type: type=00

Sep 19 06:13:53 examplehost kernel: [771928.573238]
decode_attr_change: change attribute=6738121685800950864

Sep 19 06:13:53 examplehost kernel: [771928.573240] decode_attr_size:
file size=8388608000

Sep 19 06:13:53 examplehost kernel: [771928.573242] decode_attr_fsid:
fsid=(0x0/0x0)

Sep 19 06:13:53 examplehost kernel: [771928.573243] decode_attr_fileid: fileid=0

Sep 19 06:13:53 examplehost kernel: [771928.573245]
decode_attr_fs_locations: fs_locations done, error = 0

Sep 19 06:13:53 examplehost kernel: [771928.573246] decode_attr_mode:
file mode=00



As you can see the NFS: release(/test.img) is immediately after all of
the writes on the successful version. In the version that gets stuck
it never responds with that “release”.



It’s hard for me to do any debugging on the NFS Server sinces its more
of a “managed” box (e.g I don’t have root and only have access to
certain debugging utilities). I do have a ticket open with Dell and
their engineering team is looking at this but I’m trying anything I
can do to figure this out as soon as I can.




Any help on this matter would be appreciated.

^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: NFSv4 client locks up on larger writes with Kerberos enabled
@ 2019-09-26 16:25 Kevin Vasko
  0 siblings, 0 replies; 16+ messages in thread
From: Kevin Vasko @ 2019-09-26 16:25 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Chuck Lever, Trond Myklebust, Linux NFS Mailing List

Bruce,

Yes. It’s a Dell EMC Unity 300 storage appliance. 

From what I have gathered, under the hood it is running SLES12SP1

3.12.74-60.64.66.1.NEOKERNEL_SLES12SP1 #2 SMP Fri May 17 06;41:36 EDT 2019 x86_64

Unfortunately I only have limited access to the appliance through a “service” account. I do not have root privileges and I can only run certain commands under the service account. 

I’ve got a ticket open with them and it’s been escalated to their “engineering” department but they haven’t provided much insight as of yet.

If you give me some idea on what might help I’ll try to get more info if I can.


-Kevin

> On Sep 26, 2019, at 11:06 AM, Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Thu, Sep 26, 2019 at 08:55:17AM -0700, Chuck Lever wrote:
>>>> On Sep 25, 2019, at 1:07 PM, Bruce Fields <bfields@fieldses.org> wrote:
>>> 
>>> On Wed, Sep 25, 2019 at 11:49:14AM -0700, Chuck Lever wrote:
>>>> Sounds like the NFS server is dropping the connection. With
>>>> GSS enabled, that's usually a sign that the GSS window has
>>>> overflowed.
>>> 
>>> Would that show up in the rpc statistics on the client somehow?
>> 
>> More likely on the server. The client just sees a disconnect
>> without any explanation attached.
> 
> So watching a count of disconnects might at least tell us something?
> 
>> gss_verify_header is where the checking is done on the server.
>> Disappointingly, I see some dprintk's in there, but no static
>> trace events.
> 
> Kevin, was this a Linux server?
> 
> --b.
> 
>>> In that case--I seem to remember there's a way to configure the size of
>>> the client's slot table, maybe lowering that (decreasing the number of
>>> rpc's allowed to be outstanding at a time) would work around the
>>> problem.
>> 
>>> Should the client be doing something different to avoid or recover from
>>> overflows of the gss window?
>> 
>> The client attempts to meter the request stream so that it stays
>> within the bounds of the GSS sequence number window. The stream
>> of requests is typically unordered coming out of the transmit
>> queue.
>> 
>> There is some new code (since maybe v5.0?) that handles the
>> metering: gss_xmit_need_reencode().

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-09-30 16:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-18 23:36 NFSv4 client locks up on larger writes with Kerberos enabled Kevin Vasko
     [not found] ` <E172CA50-EC89-4072-9C1D-1B825DC3FE8B@lysator.liu.se>
2019-09-19 14:19   ` Kevin Vasko
     [not found]     ` <72E82F62-C743-4783-B81E-98E6A8E35738@lysator.liu.se>
2019-09-19 14:58       ` Kevin Vasko
2019-09-25 16:48 ` J. Bruce Fields
2019-09-25 17:06   ` Chuck Lever
2019-09-25 18:44     ` Kevin Vasko
2019-09-25 18:49       ` Chuck Lever
2019-09-25 19:10         ` Kevin Vasko
2019-09-25 20:07         ` Bruce Fields
2019-09-26 15:55           ` Chuck Lever
2019-09-26 16:05             ` Bruce Fields
2019-09-26 19:55             ` Bruce Fields
2019-09-30 14:51               ` Kevin Vasko
2019-09-30 16:19                 ` Bruce Fields
2019-09-26  7:30       ` Daniel Kobras
2019-09-26 16:25 Kevin Vasko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).