* TLINK_ERROR_EXPIRE breakage
@ 2016-09-01 16:26 Ben Harris
[not found] ` <alpine.DEB.2.10.1609011625050.20241-DQa+Qhn4Z596hOdV6ebWgAmrLCtcsW2JsGvUnSjEs/k@public.gmane.org>
0 siblings, 1 reply; 3+ messages in thread
From: Ben Harris @ 2016-09-01 16:26 UTC (permalink / raw)
To: linux-cifs-u79uwXL29TY76Z2rM5mHXA; +Cc: UIS Platforms
We have a system that uses CIFS for users' home directories, and that uses
pam_cifscreds to inject the user's password into the kernel keyring at
login. Using the stock Ubuntu 16.04 kernel (4.4.0-36-generic) we get
problems on console and SSH logins where Bash can't read .bash_profile:
Last login: Thu Sep 1 16:34:17 2016 from 172.24.193.54
-bash: /home/bjh21/.bash_profile: Permission denied
But if I then type "ls", everything seems to be fine.
I think the problem here is that parts of the login process try to access
the user's home directory early (e.g. login(8) tries to read
~/.hushlogin). At that point, the user's password isn't in the kernel
keyring yet, so request_key() fails. Then, when an attempt is made to
access the user's home directory after the password has been injected,
this case in cifs_sb_tlink() fires:
/* return error if we tried this already recently */
if (time_before(jiffies, tlink->tl_time + TLINK_ERROR_EXPIRE)) {
cifs_put_tlink(tlink);
return ERR_PTR(-EACCES);
}
Essentially, the CIFS layer is caching the failed key look-up for a
second even though in the intervening time the kernel has received a
suitable key.
I can demonstrate that the problem is a 1-second timeout by, for instance,
"ssh warg 'sleep 0.9; ls'", where the "ls" fails, but the same command
with a 1.1-second sleep succeeds. More amusingly, a sequence of ls'es
interspersed with "sleep 0.9" can keep the negative cache entry alive
indefinitely.
To work around this problem, I've build a CIFS module that defines
TLINK_ERROR_EXPIRE to -1, which effectively disables the above check.
This seems to have solved our problems. Maybe there are cases where this
negative caching is necessary, though, so a more subtle approach might be
required.
--
Ben Harris, University of Cambridge Information Services.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: TLINK_ERROR_EXPIRE breakage
[not found] ` <alpine.DEB.2.10.1609011625050.20241-DQa+Qhn4Z596hOdV6ebWgAmrLCtcsW2JsGvUnSjEs/k@public.gmane.org>
@ 2016-09-02 13:28 ` Jeff Layton
[not found] ` <1472822881.23692.9.camel-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
0 siblings, 1 reply; 3+ messages in thread
From: Jeff Layton @ 2016-09-02 13:28 UTC (permalink / raw)
To: Ben Harris, linux-cifs-u79uwXL29TY76Z2rM5mHXA; +Cc: UIS Platforms
On Thu, 2016-09-01 at 17:26 +0100, Ben Harris wrote:
> We have a system that uses CIFS for users' home directories, and that uses
> pam_cifscreds to inject the user's password into the kernel keyring at
> login. Using the stock Ubuntu 16.04 kernel (4.4.0-36-generic) we get
> problems on console and SSH logins where Bash can't read .bash_profile:
>
> Last login: Thu Sep 1 16:34:17 2016 from 172.24.193.54
> -bash: /home/bjh21/.bash_profile: Permission denied
>
> But if I then type "ls", everything seems to be fine.
>
> I think the problem here is that parts of the login process try to access
> the user's home directory early (e.g. login(8) tries to read
> ~/.hushlogin). At that point, the user's password isn't in the kernel
> keyring yet, so request_key() fails. Then, when an attempt is made to
> access the user's home directory after the password has been injected,
> this case in cifs_sb_tlink() fires:
>
> /* return error if we tried this already recently */
> if (time_before(jiffies, tlink->tl_time + TLINK_ERROR_EXPIRE)) {
> cifs_put_tlink(tlink);
> return ERR_PTR(-EACCES);
> }
>
> Essentially, the CIFS layer is caching the failed key look-up for a
> second even though in the intervening time the kernel has received a
> suitable key.
>
> I can demonstrate that the problem is a 1-second timeout by, for instance,
> "ssh warg 'sleep 0.9; ls'", where the "ls" fails, but the same command
> with a 1.1-second sleep succeeds. More amusingly, a sequence of ls'es
> interspersed with "sleep 0.9" can keep the negative cache entry alive
> indefinitely.
>
> To work around this problem, I've build a CIFS module that defines
> TLINK_ERROR_EXPIRE to -1, which effectively disables the above check.
> This seems to have solved our problems. Maybe there are cases where this
> negative caching is necessary, though, so a more subtle approach might be
> required.
>
Interesting. It has been a while since that code was written, but IIRC
the main worry was spamming the server (and upcalls) that are just
going to fail.
A more subtle approach may make sense there though. One idea might be
to continue to retry for a period of time, and only back off with
EACCES errors after that. Or, we could just sleep in the kernel for a
bit, and retry there, and only give up after that fails for a while.
I guess it comes down to: "What behavior makes the most sense in the
most situations?"
--
Jeff Layton <jlayton-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: TLINK_ERROR_EXPIRE breakage
[not found] ` <1472822881.23692.9.camel-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
@ 2016-09-05 18:12 ` Ben Harris
0 siblings, 0 replies; 3+ messages in thread
From: Ben Harris @ 2016-09-05 18:12 UTC (permalink / raw)
To: Jeff Layton; +Cc: UIS Platforms, linux-cifs-u79uwXL29TY76Z2rM5mHXA
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1379 bytes --]
On Fri, 2 Sep 2016, Jeff Layton wrote:
>> /* return error if we tried this already recently */
>> if (time_before(jiffies, tlink->tl_time + TLINK_ERROR_EXPIRE)) {
>> cifs_put_tlink(tlink);
>> return ERR_PTR(-EACCES);
>> }
>>
>> Essentially, the CIFS layer is caching the failed key look-up for a
>> second even though in the intervening time the kernel has received a
>> suitable key.
>
> Interesting. It has been a while since that code was written, but IIRC
> the main worry was spamming the server (and upcalls) that are just
> going to fail.
The keys subsystem has negative caching for request_key() calls that fail,
and it keeps those much longer than the CIFS timeout. So at least the
upcalls to user space are properly rate-limited.
> A more subtle approach may make sense there though. One idea might be
> to continue to retry for a period of time, and only back off with
> EACCES errors after that. Or, we could just sleep in the kernel for a
> bit, and retry there, and only give up after that fails for a while.
I think the right thing might be to distinguish request_key() failures
from other kinds of failure, and to only apply the rate limit to failures
that didn't come from request_key().
--
Ben Harris, University of Cambridge Information Services.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2016-09-05 18:12 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-01 16:26 TLINK_ERROR_EXPIRE breakage Ben Harris
[not found] ` <alpine.DEB.2.10.1609011625050.20241-DQa+Qhn4Z596hOdV6ebWgAmrLCtcsW2JsGvUnSjEs/k@public.gmane.org>
2016-09-02 13:28 ` Jeff Layton
[not found] ` <1472822881.23692.9.camel-vpEMnDpepFuMZCB2o+C8xQ@public.gmane.org>
2016-09-05 18:12 ` Ben Harris
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.