Troubles with CIFS and suspend

* Troubles with CIFS and suspend
@ 2016-12-16 12:45 Mikko Rasa
  0 siblings, 0 replies; only message in thread
From: Mikko Rasa @ 2016-12-16 12:45 UTC (permalink / raw)
  To: linux-cifs-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3918 bytes --]

Hi,

I've been using CIFS for my home directory in my local network for a 
while now.  The performance is good but there are some issues with 
suspend which require me to do a full reboot more often than I'd like.

My setup is samba 4.5.2 on the server and Linux 4.7.2 on the client. 
Kerberos 5 is used for authentication.  The filesystem is mounted with 
multiuser,sec=krb5,username=nobody,iocharset=utf8,soft.

Typical use cases for suspend are putting the computer to sleep for the 
night and hibernating before booting to Windows to play games.

I've observed at least the following symptoms, in order from least to 
most severe:

1. When starting a new terminal, this text appears before the command 
prompt:

shell-init: error retrieving current directory: getcwd: cannot access 
parent directories: No such file or directory

The message repeats if I try to do things like tab completion. 
Inspecting /proc shows this:

$ ls -l /proc/2055/cwd
lrwxrwxrwx 1 tdb users 0 joulu 16 13:46 /proc/2055/cwd -> /home/tdb 
(deleted)

Apparently the FS gets remounted at some point and existing processes 
lose their cwd?  A simple cd fixes the issue for that shell so this is 
fairly benign.

2. Long delays after resume before xscreensaver presents the password 
prompt.  Sometimes it takes in excess of 10 minutes.  This hasn't 
happened for a while though, so perhaps some update or another fixed it.

3. User processes getting stuck after resume.  Some programs are more 
vulnerable than others.  Most common are evince and icedove; I suspect 
this has something to do with gvfs.  I've seen it affect opera, blender 
and steam as well, but much less often.

4. User processes becoming unkillable after resume.  A more severe case 
of the above, sometimes the process gets stuck so hard it can't be 
killed even with SIGKILL.  I try to close the common culprits before 
suspending the system but sometimes I forget.

Just now I have one process which claims to be in running state and eats 
100% CPU time but it's still not possible to kill it.  Attaching to it 
with strace to see what it's doing doesn't work either; strace gets 
stuck as well (but can be killed).

When I've inspected /proc/<pid>/syscall for such processes common 
results are futex and getdents.

See the attached traces.txt for typical dmesg output when attempting to 
suspend the system in such a case.

5. kswapd getting stuck after resume.  This spells imminent trouble and 
the computer is likely to become unstable in various ways unless I 
reboot immediately.  A clean reboot does not even go all the way, 
hanging at a late stage and requiring the use of the physical reset button.

I suspected the problems might have something to do with the kerberos 
tickets expiring while the computer was suspended, but increasing ticket 
lifetime to a week did not help.

I tried adding a hook to /etc/pm/sleep.d to dump process states to a log 
file in an effort to find out if they are in the stuck state immediately 
after resume.  They aren't.  The suspend cycle leaves the filesystem in 
a state in which processes get stuck when they try to perform particular 
operations on it.

I get this message from the kernel on boot when the filesystem is mounted:

Dec 12 10:13:11 muskrat kernel: CIFS VFS: Autodisabling the use of 
server inode numbers on \x5c\x5ccapybara.tdb.fi\x5chome. This server 
doesn't seem to support them properly. Hardlinks will not be recognized 
on this mount. Consider mounting with the "noserverino" option to 
silence this message.

Could it be related?  I can see how the lack of persistent inodes could 
lead to the cwd disappearing at least.

What could I do to resolve the issues or at least gain further insight 
of what's causing them?  I could live with having to occasionally 
restart single programs if an unrecoverable error occurs, but having to 
close down everything for a reboot is annoying.

-- 
Mikko

[-- Attachment #2: traces.txt --]
[-- Type: text/plain, Size: 4513 bytes --]

Dec  8 03:21:45 muskrat kernel: Freezing of tasks failed after 20.001 seconds (1 tasks refusing to freeze, wq_busy=0):
Dec  8 03:21:45 muskrat kernel: SimpleCacheWork D ffff880441743c08     0  2411      1 0x00000004
Dec  8 03:21:45 muskrat kernel:  ffff880441743c08 000008625c9ce600 ffff88045a470800 ffff880441744000
Dec  8 03:21:45 muskrat kernel:  ffff88046ecd3f80 7fffffffffffffff ffffffff8136bff0 ffff880441743d28
Dec  8 03:21:45 muskrat kernel:  ffff880441743c20 ffffffff8136b9d0 0000000000000000 ffff880441743c90
Dec  8 03:21:45 muskrat kernel: Call Trace:
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8136bff0>] ? bit_wait+0x60/0x60
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8136b9d0>] schedule+0x30/0x80
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8136df79>] schedule_timeout+0x159/0x1a0
Dec  8 03:21:45 muskrat kernel:  [<ffffffff810a2d6c>] ? ktime_get+0x3c/0xb0
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8136bff0>] ? bit_wait+0x60/0x60
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8136b38f>] io_schedule_timeout+0x9f/0x110
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8136c006>] bit_wait_io+0x16/0x60
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8136bcf6>] __wait_on_bit+0x56/0x80
Dec  8 03:21:45 muskrat kernel:  [<ffffffff810dcfcb>] wait_on_page_bit_killable+0xab/0xb0
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8107f3e0>] ? autoremove_wake_function+0x30/0x30
Dec  8 03:21:45 muskrat kernel:  [<ffffffff810dd0ef>] generic_file_read_iter+0x11f/0x740
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8106c8ce>] ? try_to_wake_up+0x1ae/0x2b0
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8106ca1d>] ? wake_up_q+0x2d/0x70
Dec  8 03:21:45 muskrat kernel:  [<ffffffffa2b70c69>] cifs_strict_readv+0xc9/0x100 [cifs]
Dec  8 03:21:45 muskrat kernel:  [<ffffffff811252ca>] __vfs_read+0xba/0x110
Dec  8 03:21:45 muskrat kernel:  [<ffffffff811255b8>] vfs_read+0x88/0x150
Dec  8 03:21:45 muskrat kernel:  [<ffffffff81127581>] SyS_pread64+0x71/0x90
Dec  8 03:21:45 muskrat kernel:  [<ffffffff8136eadb>] entry_SYSCALL_64_fastpath+0x13/0x8f

Dec 16 14:08:53 muskrat kernel: Freezing of tasks failed after 20.005 seconds (2 tasks refusing to freeze, wq_busy=0):
Dec 16 14:08:53 muskrat kernel: PathOfExile.exe R  running task        0 26021      1 0x20020004
Dec 16 14:08:53 muskrat kernel:  ffffffff81139da4 0000000000000000 0100000000000000 ffff880133519298
Dec 16 14:08:53 muskrat kernel:  0000000000023896 ffff880133519240 ffffffff8113b4b0 ffff880180d4be48
Dec 16 14:08:53 muskrat kernel:  ffff880133519240 ffff880180d4be50 ffff880180d4be50 000000000033bcbc
Dec 16 14:08:53 muskrat kernel: Call Trace:
Dec 16 14:08:53 muskrat kernel:  [<ffffffff81139da4>] ? d_walk+0xb4/0x250
Dec 16 14:08:53 muskrat kernel:  [<ffffffff8113b4b0>] ? d_lru_del+0x90/0x90
Dec 16 14:08:53 muskrat kernel:  [<ffffffff8113bf42>] ? shrink_dcache_parent+0x52/0x70
Dec 16 14:08:53 muskrat kernel:  [<ffffffff81133f3d>] ? vfs_rmdir+0x5d/0x120
Dec 16 14:08:53 muskrat kernel:  [<ffffffff811349ab>] ? do_rmdir+0x18b/0x200
Dec 16 14:08:53 muskrat kernel:  [<ffffffff811353f1>] ? SyS_rmdir+0x11/0x20
Dec 16 14:08:53 muskrat kernel:  [<ffffffff81002611>] ? do_fast_syscall_32+0x91/0x150
Dec 16 14:08:53 muskrat kernel:  [<ffffffff8137046c>] ? entry_SYSENTER_compat+0x4c/0x5b
Dec 16 14:08:53 muskrat kernel: steam           D ffff88042d533d98     0 26566      1 0x20020004
Dec 16 14:08:53 muskrat kernel:  ffff88042d533d98 ffff88045c30b020 ffff8801350dc900 ffff88042d534000
Dec 16 14:08:53 muskrat kernel:  ffff8800047a72c0 ffff8800047a72c0 00000000fffffffe ffff88042d533ec0
Dec 16 14:08:53 muskrat kernel:  ffff88042d533db0 ffffffff8136b9d0 ffff8803dfe2b280 ffff88042d533e08
Dec 16 14:08:53 muskrat kernel: Call Trace:
Dec 16 14:08:53 muskrat kernel:  [<ffffffff8136b9d0>] schedule+0x30/0x80
Dec 16 14:08:53 muskrat kernel:  [<ffffffff8136d88e>] rwsem_down_read_failed+0xbe/0x100
Dec 16 14:08:53 muskrat kernel:  [<ffffffffa2c1de22>] ? cifs_revalidate_dentry_attr+0x32/0xf0 [cifs]
Dec 16 14:08:53 muskrat kernel:  [<ffffffff811dc2f8>] call_rwsem_down_read_failed+0x18/0x30
Dec 16 14:08:53 muskrat kernel:  [<ffffffff8136d1bb>] down_read+0x1b/0x20
Dec 16 14:08:53 muskrat kernel:  [<ffffffff811376e0>] iterate_dir+0x40/0x180
Dec 16 14:08:53 muskrat kernel:  [<ffffffff8116cfa3>] compat_SyS_getdents+0x83/0x100
Dec 16 14:08:53 muskrat kernel:  [<ffffffff8116b790>] ? compat_set_fd_set+0x80/0x80
Dec 16 14:08:53 muskrat kernel:  [<ffffffff81002611>] do_fast_syscall_32+0x91/0x150
Dec 16 14:08:53 muskrat kernel:  [<ffffffff8137046c>] entry_SYSENTER_compat+0x4c/0x5b

^ permalink raw reply	[flat|nested] only message in thread