All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] SUNRPC handle EKEYEXPIRED in call_refreshresult
@ 2012-06-27 20:05 andros
  2012-06-27 20:05 ` [PATCH 1/1] " andros
  2012-06-28 15:43 ` [PATCH 0/1] " Jeff Layton
  0 siblings, 2 replies; 10+ messages in thread
From: andros @ 2012-06-27 20:05 UTC (permalink / raw)
  To: trond.myklebust; +Cc: linux-nfs, Andy Adamson

From: Andy Adamson <andros@netapp.com>

Without this patch attempting to access a Kerberos mount with expired or no
credentials resulted in the NFS client hanging while retrying to refresh creds
for ever.

I tested NFSv3/v4/v4.1 sec=krb5 mounts. With expired or non-existent user
Kerberos credentials, trying to ls the mountpoint, or cd into the mountpoint
resulted in three failed upcalls to gssd (due to tk_cred_retry being set to 2)
then the 'Operation not permitted' message is returned to the user.

I think this patch should go into the stable kernel.

Andy Adamson (1):
  SUNRPC handle EKEYEXPIRED in call_refreshresult

 fs/nfs/nfs4proc.c |    2 --
 net/sunrpc/clnt.c |    4 ++++
 2 files changed, 4 insertions(+), 2 deletions(-)

-- 
1.7.7.6


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/1] SUNRPC handle EKEYEXPIRED in call_refreshresult
  2012-06-27 20:05 [PATCH 0/1] SUNRPC handle EKEYEXPIRED in call_refreshresult andros
@ 2012-06-27 20:05 ` andros
  2012-07-16 18:44   ` Myklebust, Trond
  2012-06-28 15:43 ` [PATCH 0/1] " Jeff Layton
  1 sibling, 1 reply; 10+ messages in thread
From: andros @ 2012-06-27 20:05 UTC (permalink / raw)
  To: trond.myklebust; +Cc: linux-nfs, Andy Adamson

From: Andy Adamson <andros@netapp.com>

When an RPCSEC_GSS context has expired or is non-existent, and the user
(Kerberos) credentials have also expired or are non-existent, the client
retries to refresh the context for ever and the application
hangs. The user is not prompted to refresh/establish their credentials.

Move the -EKEYEXPIRED handling into the RPC layer. Try tk_cred_retry number
of times to refresh the gss_context, and then pass -EPERM to application.

Signed-off-by: Andy Adamson <andros@netapp.com>
---
 fs/nfs/nfs4proc.c |    2 --
 net/sunrpc/clnt.c |    4 ++++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 5a7b372..2f291b3 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -342,7 +342,6 @@ static int nfs4_handle_exception(struct nfs_server *server, int errorcode, struc
 			}
 		case -NFS4ERR_GRACE:
 		case -NFS4ERR_DELAY:
-		case -EKEYEXPIRED:
 			ret = nfs4_delay(server->client, &exception->timeout);
 			if (ret != 0)
 				break;
@@ -3939,7 +3938,6 @@ nfs4_async_handle_error(struct rpc_task *task, const struct nfs_server *server,
 		case -NFS4ERR_DELAY:
 			nfs_inc_server_stats(server, NFSIOS_DELAY);
 		case -NFS4ERR_GRACE:
-		case -EKEYEXPIRED:
 			rpc_delay(task, NFS4_POLL_RETRY_MAX);
 			task->tk_status = 0;
 			return -EAGAIN;
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index f56f045..a94fc0c 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1334,8 +1334,12 @@ call_refreshresult(struct rpc_task *task)
 		return;
 	case -ETIMEDOUT:
 		rpc_delay(task, 3*HZ);
+	case -EKEYEXPIRED:
+		status = -EPERM;
+		goto cred_retry;
 	case -EAGAIN:
 		status = -EACCES;
+cred_retry:
 		if (!task->tk_cred_retry)
 			break;
 		task->tk_cred_retry--;
-- 
1.7.7.6


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/1] SUNRPC handle EKEYEXPIRED in call_refreshresult
  2012-06-27 20:05 [PATCH 0/1] SUNRPC handle EKEYEXPIRED in call_refreshresult andros
  2012-06-27 20:05 ` [PATCH 1/1] " andros
@ 2012-06-28 15:43 ` Jeff Layton
  2012-06-28 16:31   ` Adamson, Andy
  1 sibling, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2012-06-28 15:43 UTC (permalink / raw)
  To: andros; +Cc: trond.myklebust, linux-nfs

On Wed, 27 Jun 2012 16:05:34 -0400
andros@netapp.com wrote:

> From: Andy Adamson <andros@netapp.com>
> 
> Without this patch attempting to access a Kerberos mount with expired or no
> credentials resulted in the NFS client hanging while retrying to refresh creds
> for ever.
> 
> I tested NFSv3/v4/v4.1 sec=krb5 mounts. With expired or non-existent user
> Kerberos credentials, trying to ls the mountpoint, or cd into the mountpoint
> resulted in three failed upcalls to gssd (due to tk_cred_retry being set to 2)
> then the 'Operation not permitted' message is returned to the user.
> 
> I think this patch should go into the stable kernel.
> 
> Andy Adamson (1):
>   SUNRPC handle EKEYEXPIRED in call_refreshresult
> 
>  fs/nfs/nfs4proc.c |    2 --
>  net/sunrpc/clnt.c |    4 ++++
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 

Wait...is this really the behavior you want here?

We had many complaints from users of krb5 mounts where long-running
jobs would routinely fail when the ticket expired.

The compromise behavior that we worked out at that time was to treat an
expired credcache differently from a "no credcache" situation. gssd would
return EKEYEXPIRED if the credcache existed but was expired, and
EACCES otherwise. The kernel would then treat those errors
differently:

    http://permalink.gmane.org/gmane.linux.nfsv4/11019

With EKEYEXPIRED, we'd want RPCs to hang indefinitely until the tickets
were renewed. With EACCES, the call would return an error. The idea
there is that the user would kdestroy if he needed to unwedge his krb5
mount.

This patch makes it sound like you're wanting to revert that behavior.
Is that the case? If so, what about people trying to run long-running
tasks on a kerberized mount? Are they just SOL if their ticket isn't
renewed in time?

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/1] SUNRPC handle EKEYEXPIRED in call_refreshresult
  2012-06-28 15:43 ` [PATCH 0/1] " Jeff Layton
@ 2012-06-28 16:31   ` Adamson, Andy
  2012-06-28 18:03     ` Jeff Layton
  0 siblings, 1 reply; 10+ messages in thread
From: Adamson, Andy @ 2012-06-28 16:31 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Myklebust, Trond, <linux-nfs@vger.kernel.org>


On Jun 28, 2012, at 11:43 AM, Jeff Layton wrote:

> On Wed, 27 Jun 2012 16:05:34 -0400
> andros@netapp.com wrote:
> 
>> From: Andy Adamson <andros@netapp.com>
>> 
>> Without this patch attempting to access a Kerberos mount with expired or no
>> credentials resulted in the NFS client hanging while retrying to refresh creds
>> for ever.
>> 
>> I tested NFSv3/v4/v4.1 sec=krb5 mounts. With expired or non-existent user
>> Kerberos credentials, trying to ls the mountpoint, or cd into the mountpoint
>> resulted in three failed upcalls to gssd (due to tk_cred_retry being set to 2)
>> then the 'Operation not permitted' message is returned to the user.
>> 
>> I think this patch should go into the stable kernel.
>> 
>> Andy Adamson (1):
>>  SUNRPC handle EKEYEXPIRED in call_refreshresult
>> 
>> fs/nfs/nfs4proc.c |    2 --
>> net/sunrpc/clnt.c |    4 ++++
>> 2 files changed, 4 insertions(+), 2 deletions(-)
>> 
> 
> Wait...is this really the behavior you want here?

Yes. Just having the client hang with no indication to the user is wrong.

> 
> We had many complaints from users of krb5 mounts where long-running
> jobs would routinely fail when the ticket expired.

That is a Kerberos ticket management issue, not an NFS kernel client issue.
You have long-running jobs, then kinit -l,  run krenew, or use a keytab with a cron job, 
or use some other credential management software package.


> The compromise behavior that we worked out at that time was to treat an
> expired credcache differently from a "no credcache" situation. gssd would
> return EKEYEXPIRED if the credcache existed but was expired, and
> EACCES otherwise. The kernel would then treat those errors
> differently:

In both cases, EPERM is the correct response from the Linux NFS client, as 
the user has no permissions to do anything in the file system.

> 
>    http://permalink.gmane.org/gmane.linux.nfsv4/11019
> 
> With EKEYEXPIRED, we'd want RPCs to hang indefinitely until the tickets
> were renewed.

Sounds like a good DOS attack.  Consider V4.1 and a multi-user machine. If a
users credentials expire during a heavy I/O run - that user could be using all of the
session slots, and no other user could make progress while the RPCs call rpc_delay 
and retry  indefinitely...


> With EACCES, the call would return an error. The idea
> there is that the user would kdestroy if he needed to unwedge his krb5
> mount.

Exactly how is the user supposed to know to kdestroy? All they see is a hung mount.
 
> 
> This patch makes it sound like you're wanting to revert that behavior.
> Is that the case?

Yes.

> If so, what about people trying to run long-running
> tasks on a kerberized mount? Are they just SOL if their ticket isn't
> renewed in time?

Yes - as with _any_ resource, you need to plan ahead.  As I said above, the administrator in such a situation
needs to setup krenew or the equivalent.

-->Andy


> 
> -- 
> Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/1] SUNRPC handle EKEYEXPIRED in call_refreshresult
  2012-06-28 16:31   ` Adamson, Andy
@ 2012-06-28 18:03     ` Jeff Layton
  2012-06-28 23:33       ` Andy Adamson
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2012-06-28 18:03 UTC (permalink / raw)
  To: Adamson, Andy; +Cc: Myklebust, Trond, <linux-nfs@vger.kernel.org>

On Thu, 28 Jun 2012 16:31:41 +0000
"Adamson, Andy" <William.Adamson@netapp.com> wrote:

> 
> On Jun 28, 2012, at 11:43 AM, Jeff Layton wrote:
> 
> > On Wed, 27 Jun 2012 16:05:34 -0400
> > andros@netapp.com wrote:
> > 
> >> From: Andy Adamson <andros@netapp.com>
> >> 
> >> Without this patch attempting to access a Kerberos mount with expired or no
> >> credentials resulted in the NFS client hanging while retrying to refresh creds
> >> for ever.
> >> 
> >> I tested NFSv3/v4/v4.1 sec=krb5 mounts. With expired or non-existent user
> >> Kerberos credentials, trying to ls the mountpoint, or cd into the mountpoint
> >> resulted in three failed upcalls to gssd (due to tk_cred_retry being set to 2)
> >> then the 'Operation not permitted' message is returned to the user.
> >> 
> >> I think this patch should go into the stable kernel.
> >> 
> >> Andy Adamson (1):
> >>  SUNRPC handle EKEYEXPIRED in call_refreshresult
> >> 
> >> fs/nfs/nfs4proc.c |    2 --
> >> net/sunrpc/clnt.c |    4 ++++
> >> 2 files changed, 4 insertions(+), 2 deletions(-)
> >> 
> > 
> > Wait...is this really the behavior you want here?
> 
> Yes. Just having the client hang with no indication to the user is wrong.
> 

I presume you mean to say that that behavior isn't ideal. I tend to
agree, but there's no good way to report that to the user who can do
anything about it. I'll also point out that this scheme doesn't really
help that either. The user will end up with a failing job, at which
point it's too late to do anything about it...

> > 
> > We had many complaints from users of krb5 mounts where long-running
> > jobs would routinely fail when the ticket expired.
> 
> That is a Kerberos ticket management issue, not an NFS kernel client issue.
> You have long-running jobs, then kinit -l,  run krenew, or use a keytab with a cron job, 
> or use some other credential management software package.
> 

Easy to say, far more difficult to do. Most of the people who
complained about the non-robustness of this were people who were
running jobs that took days or weeks. They were understandably upset
when that job failed just because the ticket expired.

> 
> > The compromise behavior that we worked out at that time was to treat an
> > expired credcache differently from a "no credcache" situation. gssd would
> > return EKEYEXPIRED if the credcache existed but was expired, and
> > EACCES otherwise. The kernel would then treat those errors
> > differently:
> 
> In both cases, EPERM is the correct response from the Linux NFS client, as 
> the user has no permissions to do anything in the file system.
> 

But, in the case of an expired ticket, it's quite likely that he had
permissions at some point in time. The rationale at the time was that
if that user could reacquire creds he could keep his job going.

> > 
> >    http://permalink.gmane.org/gmane.linux.nfsv4/11019
> > 
> > With EKEYEXPIRED, we'd want RPCs to hang indefinitely until the tickets
> > were renewed.
> 
> Sounds like a good DOS attack.  Consider V4.1 and a multi-user machine. If a
> users credentials expire during a heavy I/O run - that user could be using all of the
> session slots, and no other user could make progress while the RPCs call rpc_delay 
> and retry  indefinitely...
> 

Well, no. That was the main reason we handled this in the NFS layer and
not in sunrpc. The rpc_task would exit with EKEYEXPIRED and the NFS
code would treat that like an NFS4ERR_DELAY. Back off and try again
later. Once the task has exited, any resources held in the rpc layer
including the slot should be available.

> 
> > With EACCES, the call would return an error. The idea
> > there is that the user would kdestroy if he needed to unwedge his krb5
> > mount.
> 
> Exactly how is the user supposed to know to kdestroy? All they see is a hung mount.
>  

We do throw a warning when the state manager's ticket expires. Perhaps
we could do something similar from gssd for user tickets. The point is
though that the user has the ability to unwedge the mount without
reacquiring the ticket if he so chooses.

> > 
> > This patch makes it sound like you're wanting to revert that behavior.
> > Is that the case?
> 
> Yes.
> 
> > If so, what about people trying to run long-running
> > tasks on a kerberized mount? Are they just SOL if their ticket isn't
> > renewed in time?
> 
> Yes - as with _any_ resource, you need to plan ahead.  As I said above, the administrator in such a situation
> needs to setup krenew or the equivalent.
> 

That's not helpful. Everyone makes mistakes and you don't necessarily
want your job to fail simply due to that fact. But regardless, Trond
NAK'ed a similar idea not that long ago:

    http://marc.info/?l=linux-nfs&m=132161606503398&w=2

...you may want to read over that thread as I'm fairly certain what
you're proposing will have the same issues...

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/1] SUNRPC handle EKEYEXPIRED in call_refreshresult
  2012-06-28 18:03     ` Jeff Layton
@ 2012-06-28 23:33       ` Andy Adamson
  2012-06-29 20:43         ` Steve Dickson
  0 siblings, 1 reply; 10+ messages in thread
From: Andy Adamson @ 2012-06-28 23:33 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Adamson, Andy, Myklebust, Trond, <linux-nfs@vger.kernel.org>

On Thu, Jun 28, 2012 at 2:03 PM, Jeff Layton <jlayton@redhat.com> wrote:
> On Thu, 28 Jun 2012 16:31:41 +0000
> "Adamson, Andy" <William.Adamson@netapp.com> wrote:
>
>>
>> On Jun 28, 2012, at 11:43 AM, Jeff Layton wrote:
>>
>> > On Wed, 27 Jun 2012 16:05:34 -0400
>> > andros@netapp.com wrote:
>> >
>> >> From: Andy Adamson <andros@netapp.com>
>> >>
>> >> Without this patch attempting to access a Kerberos mount with expired or no
>> >> credentials resulted in the NFS client hanging while retrying to refresh creds
>> >> for ever.
>> >>
>> >> I tested NFSv3/v4/v4.1 sec=krb5 mounts. With expired or non-existent user
>> >> Kerberos credentials, trying to ls the mountpoint, or cd into the mountpoint
>> >> resulted in three failed upcalls to gssd (due to tk_cred_retry being set to 2)
>> >> then the 'Operation not permitted' message is returned to the user.
>> >>
>> >> I think this patch should go into the stable kernel.
>> >>
>> >> Andy Adamson (1):
>> >>  SUNRPC handle EKEYEXPIRED in call_refreshresult
>> >>
>> >> fs/nfs/nfs4proc.c |    2 --
>> >> net/sunrpc/clnt.c |    4 ++++
>> >> 2 files changed, 4 insertions(+), 2 deletions(-)
>> >>
>> >
>> > Wait...is this really the behavior you want here?
>>
>> Yes. Just having the client hang with no indication to the user is wrong.
>>
>
> I presume you mean to say that that behavior isn't ideal. I tend to
> agree, but there's no good way to report that to the user who can do
> anything about it. I'll also point out that this scheme doesn't really
> help that either. The user will end up with a failing job, at which
> point it's too late to do anything about it...
>
>> >
>> > We had many complaints from users of krb5 mounts where long-running
>> > jobs would routinely fail when the ticket expired.
>>
>> That is a Kerberos ticket management issue, not an NFS kernel client issue.
>> You have long-running jobs, then kinit -l,  run krenew, or use a keytab with a cron job,
>> or use some other credential management software package.
>>
>
> Easy to say, far more difficult to do.

Here is how simple it is.

$ kinit
$ krenew  -K 30&

This simple krenew command says to renew the current users TGT every
30 minutes for
as long as the TGT is renewable. For my setup, the TGT lasts 10 days -
long enough for most long-running jobs. The kinit is to get a fresh
TGT prior to running renew so that you start with as much time as
possible.

I just tested it. My TGS expires in 30 minutes, my TGT in 1 hour. 3
hours later, I'm still able to access the kerberos mount.

I'm sure most users with long running jobs could handle this. If you
need longer running jobs, then you simply talk to the Kerberos admin
and either get an extended TGT lifetime, or get a keytab and run a
cron job (or use krenew to run a script).


>Most of the people who
> complained about the non-robustness of this were people who were
> running jobs that took days or weeks. They were understandably upset
> when that job failed just because the ticket expired.

Well, they get what they setup. If they didn't ensure a long enough
ticket lifetime, then
they shouldn't be surprised when the access to the file system fails.

>>
>> > The compromise behavior that we worked out at that time was to treat an
>> > expired credcache differently from a "no credcache" situation. gssd would
>> > return EKEYEXPIRED if the credcache existed but was expired, and
>> > EACCES otherwise. The kernel would then treat those errors
>> > differently:
>>
>> In both cases, EPERM is the correct response from the Linux NFS client, as
>> the user has no permissions to do anything in the file system.
>>
>
> But, in the case of an expired ticket, it's quite likely that he had
> permissions at some point in time. The rationale at the time was that
> if that user could reacquire creds he could keep his job going.

Exactly! It's just not the NFS clients job to arrange this! It's a
Kerberos Ticket Lifetime Issue. Use the Kerberos Ticket Lifetime Issue
Tools!

>
>> >
>> >    http://permalink.gmane.org/gmane.linux.nfsv4/11019
>> >
>> > With EKEYEXPIRED, we'd want RPCs to hang indefinitely until the tickets
>> > were renewed.
>>
>> Sounds like a good DOS attack.  Consider V4.1 and a multi-user machine. If a
>> users credentials expire during a heavy I/O run - that user could be using all of the
>> session slots, and no other user could make progress while the RPCs call rpc_delay
>> and retry  indefinitely...
>>
>
> Well, no. That was the main reason we handled this in the NFS layer and
> not in sunrpc. The rpc_task would exit with EKEYEXPIRED and the NFS
> code would treat that like an NFS4ERR_DELAY. Back off and try again
> later.
> Once the task has exited, any resources held in the rpc layer
> including the slot should be available.

Ah - nfs4_async_handle_error on EKEYEXPIRED calls rpc_delay then
returs with an EAGAIN error code - so it  doesn't call rpc_exit, but
the EKEYEXPIRED is handled  the rpc_call_done routine which frees the
slot allowing it to be reused.

Fair enough.

After the delay timer expires, the task goes into the rpc_prepare_task
state to get a new slot, e.g. put on the session slot_tbl_waitq in
heavy I/O situations where all slots are being used.

So in my example above with an expired user GSS context on a
multi-user client, I/O gets stacked up in the session slot_tbl_waitq -
which I've seen to grow to hundreds of RPC requests depending on the
number of session slots and the amount of I/O.  Each I/O will keep on
returning to the session slot_tbl_waitq after an up call to GSSD which
fails with -EKEYEXPIRED.  Gssd gets pounded. The session
slot_tbl_waitq is not drained of this users I/O.

Other users will feel pain. Especially if there is more than one user
on the client irresponsible enough to start a long-running job without
handling their Kerberos credentials.

>
>>
>> > With EACCES, the call would return an error. The idea
>> > there is that the user would kdestroy if he needed to unwedge his krb5
>> > mount.
>>
>> Exactly how is the user supposed to know to kdestroy? All they see is a hung mount.
>>
>
> We do throw a warning when the state manager's ticket expires. Perhaps
> we could do something similar from gssd for user tickets. The point is
> though that the user has the ability to unwedge the mount without
> reacquiring the ticket if he so chooses.

And my point is - this is wrong!  It is a key management issue!  What
if the user doesn't choose to 'unwed'  the mount? Does it hang
forever?

If the user is going to have to make a choice, let them make it on the
front end, and setup their Kerberos tickets according to their need.

>
>> >
>> > This patch makes it sound like you're wanting to revert that behavior.
>> > Is that the case?
>>
>> Yes.
>>
>> > If so, what about people trying to run long-running
>> > tasks on a kerberized mount? Are they just SOL if their ticket isn't
>> > renewed in time?
>>
>> Yes - as with _any_ resource, you need to plan ahead.  As I said above, the administrator in such a situation
>> needs to setup krenew or the equivalent.
>>
>
> That's not helpful. Everyone makes mistakes and you don't necessarily
> want your job to fail simply due to that fact.

Yes, WRT security, you actually do.  There is a reason Kerberos
tickets have a lifetime.
The user needs to be aware of it, and extend or cut-off credentials
according to what their use case is.

> But regardless, Trond
> NAK'ed a similar idea not that long ago:
>
>    http://marc.info/?l=linux-nfs&m=132161606503398&w=2
>
> ...you may want to read over that thread as I'm fairly certain what
> you're proposing will have the same issues...

Your'e referring to data corruption due to buffered writes or commits
with expired GSS credentials?
I'm preparing patches for this issue as well. I'll resend this patch
with the others that
handle the expired GSS creds.

-->Andy

>
> --
> Jeff Layton <jlayton@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/1] SUNRPC handle EKEYEXPIRED in call_refreshresult
  2012-06-28 23:33       ` Andy Adamson
@ 2012-06-29 20:43         ` Steve Dickson
  2012-06-30 11:00           ` Jeff Layton
  0 siblings, 1 reply; 10+ messages in thread
From: Steve Dickson @ 2012-06-29 20:43 UTC (permalink / raw)
  To: Andy Adamson
  Cc: Jeff Layton, Adamson, Andy, Myklebust, Trond,
	<linux-nfs@vger.kernel.org>



On 06/28/2012 07:33 PM, Andy Adamson wrote:
> On Thu, Jun 28, 2012 at 2:03 PM, Jeff Layton <jlayton@redhat.com> wrote:
>> On Thu, 28 Jun 2012 16:31:41 +0000
>> "Adamson, Andy" <William.Adamson@netapp.com> wrote:
>>
>>>
>>> On Jun 28, 2012, at 11:43 AM, Jeff Layton wrote:
>>>
>>>> On Wed, 27 Jun 2012 16:05:34 -0400
>>>> andros@netapp.com wrote:
>>>>
>>>>> From: Andy Adamson <andros@netapp.com>
>>>>>
>>>>> Without this patch attempting to access a Kerberos mount with expired or no
>>>>> credentials resulted in the NFS client hanging while retrying to refresh creds
>>>>> for ever.
>>>>>
>>>>> I tested NFSv3/v4/v4.1 sec=krb5 mounts. With expired or non-existent user
>>>>> Kerberos credentials, trying to ls the mountpoint, or cd into the mountpoint
>>>>> resulted in three failed upcalls to gssd (due to tk_cred_retry being set to 2)
>>>>> then the 'Operation not permitted' message is returned to the user.
>>>>>
>>>>> I think this patch should go into the stable kernel.
>>>>>
>>>>> Andy Adamson (1):
>>>>>  SUNRPC handle EKEYEXPIRED in call_refreshresult
>>>>>
>>>>> fs/nfs/nfs4proc.c |    2 --
>>>>> net/sunrpc/clnt.c |    4 ++++
>>>>> 2 files changed, 4 insertions(+), 2 deletions(-)
>>>>>
>>>>
>>>> Wait...is this really the behavior you want here?
>>>
>>> Yes. Just having the client hang with no indication to the user is wrong.
+1 I completely agree with this... 
>>>
>>
>> I presume you mean to say that that behavior isn't ideal. I tend to
>> agree, but there's no good way to report that to the user who can do
>> anything about it. I'll also point out that this scheme doesn't really
>> help that either. The user will end up with a failing job, at which
>> point it's too late to do anything about it...
There are mechanisms (krenew,sssd) the user can use to stop the 
jobs from failing... 

>>
>>>>
>>>> We had many complaints from users of krb5 mounts where long-running
>>>> jobs would routinely fail when the ticket expired.
We also have complaints about things hanging too... ;-) 

>>>
>>> That is a Kerberos ticket management issue, not an NFS kernel client issue.
>>> You have long-running jobs, then kinit -l,  run krenew, or use a keytab with a cron job,
>>> or use some other credential management software package.
Right... Some early on preparations to ensure the tickets remain renewable is
the solution..,

>>>
>>
>> Easy to say, far more difficult to do.
> 
> Here is how simple it is.
> 
> $ kinit
> $ krenew  -K 30&
If this simple process cannot renew a ticket then its a bug
in the client... We have to let processes like this to renew
renewable tickets! 

I've never been a fan of this hang from the beginning but 
if its getting in the way of things renewing tickets then its
got to go... IMHO...

Lets push the error up to the app and then educate people
on how to avoid the error... 

steved.

> 
> This simple krenew command says to renew the current users TGT every
> 30 minutes for
> as long as the TGT is renewable. For my setup, the TGT lasts 10 days -
> long enough for most long-running jobs. The kinit is to get a fresh
> TGT prior to running renew so that you start with as much time as
> possible.
> 
> I just tested it. My TGS expires in 30 minutes, my TGT in 1 hour. 3
> hours later, I'm still able to access the kerberos mount.
> 
> I'm sure most users with long running jobs could handle this. If you
> need longer running jobs, then you simply talk to the Kerberos admin
> and either get an extended TGT lifetime, or get a keytab and run a
> cron job (or use krenew to run a script).
> 
> 
>> Most of the people who
>> complained about the non-robustness of this were people who were
>> running jobs that took days or weeks. They were understandably upset
>> when that job failed just because the ticket expired.
> 
> Well, they get what they setup. If they didn't ensure a long enough
> ticket lifetime, then
> they shouldn't be surprised when the access to the file system fails.
> 
>>>
>>>> The compromise behavior that we worked out at that time was to treat an
>>>> expired credcache differently from a "no credcache" situation. gssd would
>>>> return EKEYEXPIRED if the credcache existed but was expired, and
>>>> EACCES otherwise. The kernel would then treat those errors
>>>> differently:
>>>
>>> In both cases, EPERM is the correct response from the Linux NFS client, as
>>> the user has no permissions to do anything in the file system.
>>>
>>
>> But, in the case of an expired ticket, it's quite likely that he had
>> permissions at some point in time. The rationale at the time was that
>> if that user could reacquire creds he could keep his job going.
> 
> Exactly! It's just not the NFS clients job to arrange this! It's a
> Kerberos Ticket Lifetime Issue. Use the Kerberos Ticket Lifetime Issue
> Tools!
> 
>>
>>>>
>>>>    http://permalink.gmane.org/gmane.linux.nfsv4/11019
>>>>
>>>> With EKEYEXPIRED, we'd want RPCs to hang indefinitely until the tickets
>>>> were renewed.
>>>
>>> Sounds like a good DOS attack.  Consider V4.1 and a multi-user machine. If a
>>> users credentials expire during a heavy I/O run - that user could be using all of the
>>> session slots, and no other user could make progress while the RPCs call rpc_delay
>>> and retry  indefinitely...
>>>
>>
>> Well, no. That was the main reason we handled this in the NFS layer and
>> not in sunrpc. The rpc_task would exit with EKEYEXPIRED and the NFS
>> code would treat that like an NFS4ERR_DELAY. Back off and try again
>> later.
>> Once the task has exited, any resources held in the rpc layer
>> including the slot should be available.
> 
> Ah - nfs4_async_handle_error on EKEYEXPIRED calls rpc_delay then
> returs with an EAGAIN error code - so it  doesn't call rpc_exit, but
> the EKEYEXPIRED is handled  the rpc_call_done routine which frees the
> slot allowing it to be reused.
> 
> Fair enough.
> 
> After the delay timer expires, the task goes into the rpc_prepare_task
> state to get a new slot, e.g. put on the session slot_tbl_waitq in
> heavy I/O situations where all slots are being used.
> 
> So in my example above with an expired user GSS context on a
> multi-user client, I/O gets stacked up in the session slot_tbl_waitq -
> which I've seen to grow to hundreds of RPC requests depending on the
> number of session slots and the amount of I/O.  Each I/O will keep on
> returning to the session slot_tbl_waitq after an up call to GSSD which
> fails with -EKEYEXPIRED.  Gssd gets pounded. The session
> slot_tbl_waitq is not drained of this users I/O.
> 
> Other users will feel pain. Especially if there is more than one user
> on the client irresponsible enough to start a long-running job without
> handling their Kerberos credentials.
> 
>>
>>>
>>>> With EACCES, the call would return an error. The idea
>>>> there is that the user would kdestroy if he needed to unwedge his krb5
>>>> mount.
>>>
>>> Exactly how is the user supposed to know to kdestroy? All they see is a hung mount.
>>>
>>
>> We do throw a warning when the state manager's ticket expires. Perhaps
>> we could do something similar from gssd for user tickets. The point is
>> though that the user has the ability to unwedge the mount without
>> reacquiring the ticket if he so chooses.
> 
> And my point is - this is wrong!  It is a key management issue!  What
> if the user doesn't choose to 'unwed'  the mount? Does it hang
> forever?
> 
> If the user is going to have to make a choice, let them make it on the
> front end, and setup their Kerberos tickets according to their need.
> 
>>
>>>>
>>>> This patch makes it sound like you're wanting to revert that behavior.
>>>> Is that the case?
>>>
>>> Yes.
>>>
>>>> If so, what about people trying to run long-running
>>>> tasks on a kerberized mount? Are they just SOL if their ticket isn't
>>>> renewed in time?
>>>
>>> Yes - as with _any_ resource, you need to plan ahead.  As I said above, the administrator in such a situation
>>> needs to setup krenew or the equivalent.
>>>
>>
>> That's not helpful. Everyone makes mistakes and you don't necessarily
>> want your job to fail simply due to that fact.
> 
> Yes, WRT security, you actually do.  There is a reason Kerberos
> tickets have a lifetime.
> The user needs to be aware of it, and extend or cut-off credentials
> according to what their use case is.
> 
>> But regardless, Trond
>> NAK'ed a similar idea not that long ago:
>>
>>    http://marc.info/?l=linux-nfs&m=132161606503398&w=2
>>
>> ...you may want to read over that thread as I'm fairly certain what
>> you're proposing will have the same issues...
> 
> Your'e referring to data corruption due to buffered writes or commits
> with expired GSS credentials?
> I'm preparing patches for this issue as well. I'll resend this patch
> with the others that
> handle the expired GSS creds.
> 
> -->Andy
> 
>>
>> --
>> Jeff Layton <jlayton@redhat.com>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/1] SUNRPC handle EKEYEXPIRED in call_refreshresult
  2012-06-29 20:43         ` Steve Dickson
@ 2012-06-30 11:00           ` Jeff Layton
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff Layton @ 2012-06-30 11:00 UTC (permalink / raw)
  To: Steve Dickson
  Cc: Andy Adamson, Adamson, Andy, Myklebust, Trond,
	<linux-nfs@vger.kernel.org>

On Fri, 29 Jun 2012 16:43:45 -0400
Steve Dickson <SteveD@redhat.com> wrote:

> >>>
> >>
> >> Easy to say, far more difficult to do.
> > 
> > Here is how simple it is.
> > 
> > $ kinit
> > $ krenew  -K 30&
> If this simple process cannot renew a ticket then its a bug
> in the client... We have to let processes like this to renew
> renewable tickets! 
> 
> I've never been a fan of this hang from the beginning but 
> if its getting in the way of things renewing tickets then its
> got to go... IMHO...
> 
> Lets push the error up to the app and then educate people
> on how to avoid the error... 
> 

Fair enough. I'd prefer to see this handled better too. If Andy has a
solution for the writeback issues then I'm fine with reverting this
behavior.

If we have kerberized NFS howtos anywhere (do we?) then we probably
need to add some notes about krenewd or how to make sssd handle this.

-- 
Jeff Layton <jlayton@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] SUNRPC handle EKEYEXPIRED in call_refreshresult
  2012-06-27 20:05 ` [PATCH 1/1] " andros
@ 2012-07-16 18:44   ` Myklebust, Trond
  2012-07-16 20:09     ` Adamson, Andy
  0 siblings, 1 reply; 10+ messages in thread
From: Myklebust, Trond @ 2012-07-16 18:44 UTC (permalink / raw)
  To: Adamson, Andy; +Cc: linux-nfs

T24gV2VkLCAyMDEyLTA2LTI3IGF0IDE2OjA1IC0wNDAwLCBhbmRyb3NAbmV0YXBwLmNvbSB3cm90
ZToNCj4gRnJvbTogQW5keSBBZGFtc29uIDxhbmRyb3NAbmV0YXBwLmNvbT4NCj4gDQo+IFdoZW4g
YW4gUlBDU0VDX0dTUyBjb250ZXh0IGhhcyBleHBpcmVkIG9yIGlzIG5vbi1leGlzdGVudCwgYW5k
IHRoZSB1c2VyDQo+IChLZXJiZXJvcykgY3JlZGVudGlhbHMgaGF2ZSBhbHNvIGV4cGlyZWQgb3Ig
YXJlIG5vbi1leGlzdGVudCwgdGhlIGNsaWVudA0KPiByZXRyaWVzIHRvIHJlZnJlc2ggdGhlIGNv
bnRleHQgZm9yIGV2ZXIgYW5kIHRoZSBhcHBsaWNhdGlvbg0KPiBoYW5ncy4gVGhlIHVzZXIgaXMg
bm90IHByb21wdGVkIHRvIHJlZnJlc2gvZXN0YWJsaXNoIHRoZWlyIGNyZWRlbnRpYWxzLg0KPiAN
Cj4gTW92ZSB0aGUgLUVLRVlFWFBJUkVEIGhhbmRsaW5nIGludG8gdGhlIFJQQyBsYXllci4gVHJ5
IHRrX2NyZWRfcmV0cnkgbnVtYmVyDQo+IG9mIHRpbWVzIHRvIHJlZnJlc2ggdGhlIGdzc19jb250
ZXh0LCBhbmQgdGhlbiBwYXNzIC1FUEVSTSB0byBhcHBsaWNhdGlvbi4NCj4gDQo+IFNpZ25lZC1v
ZmYtYnk6IEFuZHkgQWRhbXNvbiA8YW5kcm9zQG5ldGFwcC5jb20+DQo+IC0tLQ0KPiAgZnMvbmZz
L25mczRwcm9jLmMgfCAgICAyIC0tDQo+ICBuZXQvc3VucnBjL2NsbnQuYyB8ICAgIDQgKysrKw0K
PiAgMiBmaWxlcyBjaGFuZ2VkLCA0IGluc2VydGlvbnMoKyksIDIgZGVsZXRpb25zKC0pDQo+IA0K
PiBkaWZmIC0tZ2l0IGEvZnMvbmZzL25mczRwcm9jLmMgYi9mcy9uZnMvbmZzNHByb2MuYw0KPiBp
bmRleCA1YTdiMzcyLi4yZjI5MWIzIDEwMDY0NA0KPiAtLS0gYS9mcy9uZnMvbmZzNHByb2MuYw0K
PiArKysgYi9mcy9uZnMvbmZzNHByb2MuYw0KPiBAQCAtMzQyLDcgKzM0Miw2IEBAIHN0YXRpYyBp
bnQgbmZzNF9oYW5kbGVfZXhjZXB0aW9uKHN0cnVjdCBuZnNfc2VydmVyICpzZXJ2ZXIsIGludCBl
cnJvcmNvZGUsIHN0cnVjDQo+ICAJCQl9DQo+ICAJCWNhc2UgLU5GUzRFUlJfR1JBQ0U6DQo+ICAJ
CWNhc2UgLU5GUzRFUlJfREVMQVk6DQo+IC0JCWNhc2UgLUVLRVlFWFBJUkVEOg0KPiAgCQkJcmV0
ID0gbmZzNF9kZWxheShzZXJ2ZXItPmNsaWVudCwgJmV4Y2VwdGlvbi0+dGltZW91dCk7DQo+ICAJ
CQlpZiAocmV0ICE9IDApDQo+ICAJCQkJYnJlYWs7DQo+IEBAIC0zOTM5LDcgKzM5MzgsNiBAQCBu
ZnM0X2FzeW5jX2hhbmRsZV9lcnJvcihzdHJ1Y3QgcnBjX3Rhc2sgKnRhc2ssIGNvbnN0IHN0cnVj
dCBuZnNfc2VydmVyICpzZXJ2ZXIsDQo+ICAJCWNhc2UgLU5GUzRFUlJfREVMQVk6DQo+ICAJCQlu
ZnNfaW5jX3NlcnZlcl9zdGF0cyhzZXJ2ZXIsIE5GU0lPU19ERUxBWSk7DQo+ICAJCWNhc2UgLU5G
UzRFUlJfR1JBQ0U6DQo+IC0JCWNhc2UgLUVLRVlFWFBJUkVEOg0KPiAgCQkJcnBjX2RlbGF5KHRh
c2ssIE5GUzRfUE9MTF9SRVRSWV9NQVgpOw0KPiAgCQkJdGFzay0+dGtfc3RhdHVzID0gMDsNCj4g
IAkJCXJldHVybiAtRUFHQUlOOw0KPiBkaWZmIC0tZ2l0IGEvbmV0L3N1bnJwYy9jbG50LmMgYi9u
ZXQvc3VucnBjL2NsbnQuYw0KPiBpbmRleCBmNTZmMDQ1Li5hOTRmYzBjIDEwMDY0NA0KPiAtLS0g
YS9uZXQvc3VucnBjL2NsbnQuYw0KPiArKysgYi9uZXQvc3VucnBjL2NsbnQuYw0KPiBAQCAtMTMz
NCw4ICsxMzM0LDEyIEBAIGNhbGxfcmVmcmVzaHJlc3VsdChzdHJ1Y3QgcnBjX3Rhc2sgKnRhc2sp
DQo+ICAJCXJldHVybjsNCj4gIAljYXNlIC1FVElNRURPVVQ6DQo+ICAJCXJwY19kZWxheSh0YXNr
LCAzKkhaKTsNCj4gKwljYXNlIC1FS0VZRVhQSVJFRDoNCj4gKwkJc3RhdHVzID0gLUVQRVJNOw0K
DQpUaGlzIG5lZWRzIHRvIGJlIEVBQ0NFUy4uLg0KDQo+ICsJCWdvdG8gY3JlZF9yZXRyeTsNCj4g
IAljYXNlIC1FQUdBSU46DQo+ICAJCXN0YXR1cyA9IC1FQUNDRVM7DQo+ICtjcmVkX3JldHJ5Og0K
PiAgCQlpZiAoIXRhc2stPnRrX2NyZWRfcmV0cnkpDQo+ICAJCQlicmVhazsNCj4gIAkJdGFzay0+
dGtfY3JlZF9yZXRyeS0tOw0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVu
dCBtYWludGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20NCnd3dy5u
ZXRhcHAuY29tDQoNCg==

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/1] SUNRPC handle EKEYEXPIRED in call_refreshresult
  2012-07-16 18:44   ` Myklebust, Trond
@ 2012-07-16 20:09     ` Adamson, Andy
  0 siblings, 0 replies; 10+ messages in thread
From: Adamson, Andy @ 2012-07-16 20:09 UTC (permalink / raw)
  To: Myklebust, Trond; +Cc: Adamson, Andy, linux-nfs


On Jul 16, 2012, at 2:44 PM, Myklebust, Trond wrote:

> On Wed, 2012-06-27 at 16:05 -0400, andros@netapp.com wrote:
>> From: Andy Adamson <andros@netapp.com>
>> 
>> When an RPCSEC_GSS context has expired or is non-existent, and the user
>> (Kerberos) credentials have also expired or are non-existent, the client
>> retries to refresh the context for ever and the application
>> hangs. The user is not prompted to refresh/establish their credentials.
>> 
>> Move the -EKEYEXPIRED handling into the RPC layer. Try tk_cred_retry number
>> of times to refresh the gss_context, and then pass -EPERM to application.
>> 
>> Signed-off-by: Andy Adamson <andros@netapp.com>
>> ---
>> fs/nfs/nfs4proc.c |    2 --
>> net/sunrpc/clnt.c |    4 ++++
>> 2 files changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>> index 5a7b372..2f291b3 100644
>> --- a/fs/nfs/nfs4proc.c
>> +++ b/fs/nfs/nfs4proc.c
>> @@ -342,7 +342,6 @@ static int nfs4_handle_exception(struct nfs_server *server, int errorcode, struc
>> 			}
>> 		case -NFS4ERR_GRACE:
>> 		case -NFS4ERR_DELAY:
>> -		case -EKEYEXPIRED:
>> 			ret = nfs4_delay(server->client, &exception->timeout);
>> 			if (ret != 0)
>> 				break;
>> @@ -3939,7 +3938,6 @@ nfs4_async_handle_error(struct rpc_task *task, const struct nfs_server *server,
>> 		case -NFS4ERR_DELAY:
>> 			nfs_inc_server_stats(server, NFSIOS_DELAY);
>> 		case -NFS4ERR_GRACE:
>> -		case -EKEYEXPIRED:
>> 			rpc_delay(task, NFS4_POLL_RETRY_MAX);
>> 			task->tk_status = 0;
>> 			return -EAGAIN;
>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>> index f56f045..a94fc0c 100644
>> --- a/net/sunrpc/clnt.c
>> +++ b/net/sunrpc/clnt.c
>> @@ -1334,8 +1334,12 @@ call_refreshresult(struct rpc_task *task)
>> 		return;
>> 	case -ETIMEDOUT:
>> 		rpc_delay(task, 3*HZ);
>> +	case -EKEYEXPIRED:
>> +		status = -EPERM;
> 
> This needs to be EACCES…

Agreed. I'll also remove all -EKEYEXPIRED handling from NFS.

-->Andy

> 
>> +		goto cred_retry;
>> 	case -EAGAIN:
>> 		status = -EACCES;
>> +cred_retry:
>> 		if (!task->tk_cred_retry)
>> 			break;
>> 		task->tk_cred_retry--;
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer
> 
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-07-16 20:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-27 20:05 [PATCH 0/1] SUNRPC handle EKEYEXPIRED in call_refreshresult andros
2012-06-27 20:05 ` [PATCH 1/1] " andros
2012-07-16 18:44   ` Myklebust, Trond
2012-07-16 20:09     ` Adamson, Andy
2012-06-28 15:43 ` [PATCH 0/1] " Jeff Layton
2012-06-28 16:31   ` Adamson, Andy
2012-06-28 18:03     ` Jeff Layton
2012-06-28 23:33       ` Andy Adamson
2012-06-29 20:43         ` Steve Dickson
2012-06-30 11:00           ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.