linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fedora 32 rpc.gssd misbehavior
@ 2020-07-29 17:19 Chuck Lever
  2020-07-29 18:27 ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2020-07-29 17:19 UTC (permalink / raw)
  To: Jeff Layton, Bruce Fields; +Cc: Simo Sorce, Linux NFS Mailing List

Hi!

I recently updated my test systems from EL7 to Fedora 32, and
NFSv4.0 with Kerberos has stopped working.

I mount with "klimt.ib" as before. The client workload stops
dead when the server tries to perform its first CB_RECALL.

I added some client instrumentation:

   kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
   kernel: NFS: NFSv4 callback contains invalid cred

I boosted gssd verbosity, and it says:

   rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib

But it knows the full hostname for the server:

   rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'


The acceptor appears to come from the Kerberos library. Shouldn't
it be canonicalized? If so, should the Kerberos library do it, or
should gssd? Since this behavior appeared after an upgrade, I
suspect a Kerberos library regression. But it could be config-
related, since both systems were re-imaged from the ground up.

Also noticing some other problems on the server (missing hostname
strings in debug messages, sssd_kcm infinite loops, and gssd
sending garbage to the client after the NULL request that
establishes the callback context).

But let's look at the client acceptor problem first.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-29 17:19 Fedora 32 rpc.gssd misbehavior Chuck Lever
@ 2020-07-29 18:27 ` Chuck Lever
  2020-07-30 14:43   ` Steve Dickson
  2020-07-30 16:14   ` Simo Sorce
  0 siblings, 2 replies; 15+ messages in thread
From: Chuck Lever @ 2020-07-29 18:27 UTC (permalink / raw)
  To: Jeff Layton, Bruce Fields; +Cc: Simo Sorce, Linux NFS Mailing List



> On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> Hi!
> 
> I recently updated my test systems from EL7 to Fedora 32, and
> NFSv4.0 with Kerberos has stopped working.
> 
> I mount with "klimt.ib" as before. The client workload stops
> dead when the server tries to perform its first CB_RECALL.
> 
> I added some client instrumentation:
> 
>   kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
>   kernel: NFS: NFSv4 callback contains invalid cred
> 
> I boosted gssd verbosity, and it says:
> 
>   rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
> 
> But it knows the full hostname for the server:
> 
>   rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
> 
> 
> The acceptor appears to come from the Kerberos library. Shouldn't
> it be canonicalized? If so, should the Kerberos library do it, or
> should gssd? Since this behavior appeared after an upgrade, I
> suspect a Kerberos library regression. But it could be config-
> related, since both systems were re-imaged from the ground up.
> 
> Also noticing some other problems on the server (missing hostname
> strings in debug messages, sssd_kcm infinite loops, and gssd
> sending garbage to the client after the NULL request that
> establishes the callback context).
> 
> But let's look at the client acceptor problem first.

I believe I found the problem.

8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
with Kerberos works again.

Is there a reason the default setting is 1?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-29 18:27 ` Chuck Lever
@ 2020-07-30 14:43   ` Steve Dickson
  2020-07-30 16:14   ` Simo Sorce
  1 sibling, 0 replies; 15+ messages in thread
From: Steve Dickson @ 2020-07-30 14:43 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, Bruce Fields; +Cc: Simo Sorce, Linux NFS Mailing List



On 7/29/20 2:27 PM, Chuck Lever wrote:
> 
> 
>> On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>
>> Hi!
>>
>> I recently updated my test systems from EL7 to Fedora 32, and
>> NFSv4.0 with Kerberos has stopped working.
>>
>> I mount with "klimt.ib" as before. The client workload stops
>> dead when the server tries to perform its first CB_RECALL.
>>
>> I added some client instrumentation:
>>
>>   kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
>>   kernel: NFS: NFSv4 callback contains invalid cred
>>
>> I boosted gssd verbosity, and it says:
>>
>>   rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
>>
>> But it knows the full hostname for the server:
>>
>>   rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
>>
>>
>> The acceptor appears to come from the Kerberos library. Shouldn't
>> it be canonicalized? If so, should the Kerberos library do it, or
>> should gssd? Since this behavior appeared after an upgrade, I
>> suspect a Kerberos library regression. But it could be config-
>> related, since both systems were re-imaged from the ground up.
>>
>> Also noticing some other problems on the server (missing hostname
>> strings in debug messages, sssd_kcm infinite loops, and gssd
>> sending garbage to the client after the NULL request that
>> establishes the callback context).
>>
>> But let's look at the client acceptor problem first.
> 
> I believe I found the problem.
> 
> 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
> options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
> dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
> with Kerberos works again.
Strange... What is failing in rpc.gssd when it is set to 1?
Maybe it has something to do with your DNS setup?

> 
> Is there a reason the default setting is 1?
Looking back... It's always bee set to 1. So no good reason ;-) 

steved.
> n
> 
> --
> Chuck Lever
> 
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-29 18:27 ` Chuck Lever
  2020-07-30 14:43   ` Steve Dickson
@ 2020-07-30 16:14   ` Simo Sorce
  2020-07-30 17:08     ` Robbie Harwood
  2020-07-30 17:09     ` Chuck Lever
  1 sibling, 2 replies; 15+ messages in thread
From: Simo Sorce @ 2020-07-30 16:14 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, Bruce Fields
  Cc: Linux NFS Mailing List, Robbie Harwood

On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
> > On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > 
> > Hi!
> > 
> > I recently updated my test systems from EL7 to Fedora 32, and
> > NFSv4.0 with Kerberos has stopped working.
> > 
> > I mount with "klimt.ib" as before. The client workload stops
> > dead when the server tries to perform its first CB_RECALL.
> > 
> > I added some client instrumentation:
> > 
> >   kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
> >   kernel: NFS: NFSv4 callback contains invalid cred
> > 
> > I boosted gssd verbosity, and it says:
> > 
> >   rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
> > 
> > But it knows the full hostname for the server:
> > 
> >   rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
> > 
> > 
> > The acceptor appears to come from the Kerberos library. Shouldn't
> > it be canonicalized? If so, should the Kerberos library do it, or
> > should gssd? Since this behavior appeared after an upgrade, I
> > suspect a Kerberos library regression. But it could be config-
> > related, since both systems were re-imaged from the ground up.
> > 
> > Also noticing some other problems on the server (missing hostname
> > strings in debug messages, sssd_kcm infinite loops, and gssd
> > sending garbage to the client after the NULL request that
> > establishes the callback context).
> > 
> > But let's look at the client acceptor problem first.
> 
> I believe I found the problem.
> 
> 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
> options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
> dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
> with Kerberos works again.
> 
> Is there a reason the default setting is 1?
> 

Now that you mention DNS, this may be an interaction between a new
default in Fedora 32 and how your environment is setup re DNS.

In F32 we changed the option dns_canonicalize_hostname from 'true' to
'fallback'.
This is a transitional state to eventually move it to 'false' at some
point in the future.

What it changes in practice is that it will first try the name passed
in *as is* and only as a fallback try a CNAME if the name passed is not
resolved as an A name. If you have principals in the KDC for both
names, but you do not have keys in the keytab for both, you can have
transitional issues.

Additionally we discovered a bug that causes non qualified names to
fail resolution with the 'fallback' option.
If your name in the principal is really not qualified it will try to
qualify it anyway, so if your principal is literally nfs/foo@FOO
libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
what is defined in resolv.conf search path.

We are trying to address this regression.

So try to set dns_canonicalize_hostname to true to see if that may
influence your issue. If so, please let me know, as we still need to
address this where possible.

Simo.

-- 
Simo Sorce
RHEL Crypto Team
Red Hat, Inc





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 16:14   ` Simo Sorce
@ 2020-07-30 17:08     ` Robbie Harwood
  2020-07-30 17:59       ` Chuck Lever
  2020-07-30 17:09     ` Chuck Lever
  1 sibling, 1 reply; 15+ messages in thread
From: Robbie Harwood @ 2020-07-30 17:08 UTC (permalink / raw)
  To: Simo Sorce, Chuck Lever, Jeff Layton, Bruce Fields; +Cc: Linux NFS Mailing List

[-- Attachment #1: Type: text/plain, Size: 3526 bytes --]

Simo Sorce <simo@redhat.com> writes:

> On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
>> > On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>> > 
>> > Hi!
>> > 
>> > I recently updated my test systems from EL7 to Fedora 32, and
>> > NFSv4.0 with Kerberos has stopped working.
>> > 
>> > I mount with "klimt.ib" as before. The client workload stops
>> > dead when the server tries to perform its first CB_RECALL.
>> > 
>> > I added some client instrumentation:
>> > 
>> >   kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
>> >   kernel: NFS: NFSv4 callback contains invalid cred
>> > 
>> > I boosted gssd verbosity, and it says:
>> > 
>> >   rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
>> > 
>> > But it knows the full hostname for the server:
>> > 
>> >   rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
>> > 
>> > 
>> > The acceptor appears to come from the Kerberos library. Shouldn't
>> > it be canonicalized? If so, should the Kerberos library do it, or
>> > should gssd? Since this behavior appeared after an upgrade, I
>> > suspect a Kerberos library regression. But it could be config-
>> > related, since both systems were re-imaged from the ground up.
>> > 
>> > Also noticing some other problems on the server (missing hostname
>> > strings in debug messages, sssd_kcm infinite loops, and gssd
>> > sending garbage to the client after the NULL request that
>> > establishes the callback context).
>> > 
>> > But let's look at the client acceptor problem first.
>> 
>> I believe I found the problem.
>> 
>> 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
>> options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
>> dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
>> with Kerberos works again.
>> 
>> Is there a reason the default setting is 1?
>> 
>
> Now that you mention DNS, this may be an interaction between a new
> default in Fedora 32 and how your environment is setup re DNS.
>
> In F32 we changed the option dns_canonicalize_hostname from 'true' to
> 'fallback'.
> This is a transitional state to eventually move it to 'false' at some
> point in the future.
>
> What it changes in practice is that it will first try the name passed
> in *as is* and only as a fallback try a CNAME if the name passed is not
> resolved as an A name. If you have principals in the KDC for both
> names, but you do not have keys in the keytab for both, you can have
> transitional issues.
>
> Additionally we discovered a bug that causes non qualified names to
> fail resolution with the 'fallback' option.
> If your name in the principal is really not qualified it will try to
> qualify it anyway, so if your principal is literally nfs/foo@FOO
> libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
> what is defined in resolv.conf search path.
>
> We are trying to address this regression.
>
> So try to set dns_canonicalize_hostname to true to see if that may
> influence your issue. If so, please let me know, as we still need to
> address this where possible.

Also, please try setting `qualify_shortname = ""`.  (I did update the
config file we ship with Fedora, but upstream's default turns that on.
This is a temporary workaround while we merge something better
upstream.)

Thanks,
--Robbie

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 16:14   ` Simo Sorce
  2020-07-30 17:08     ` Robbie Harwood
@ 2020-07-30 17:09     ` Chuck Lever
  2020-07-30 17:57       ` Simo Sorce
  1 sibling, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2020-07-30 17:09 UTC (permalink / raw)
  To: Simo Sorce
  Cc: Jeff Layton, Bruce Fields, Linux NFS Mailing List, Robbie Harwood



> On Jul 30, 2020, at 12:14 PM, Simo Sorce <simo@redhat.com> wrote:
> 
> On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
>>> On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>> 
>>> Hi!
>>> 
>>> I recently updated my test systems from EL7 to Fedora 32, and
>>> NFSv4.0 with Kerberos has stopped working.
>>> 
>>> I mount with "klimt.ib" as before. The client workload stops
>>> dead when the server tries to perform its first CB_RECALL.
>>> 
>>> I added some client instrumentation:
>>> 
>>>  kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
>>>  kernel: NFS: NFSv4 callback contains invalid cred
>>> 
>>> I boosted gssd verbosity, and it says:
>>> 
>>>  rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
>>> 
>>> But it knows the full hostname for the server:
>>> 
>>>  rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
>>> 
>>> 
>>> The acceptor appears to come from the Kerberos library. Shouldn't
>>> it be canonicalized? If so, should the Kerberos library do it, or
>>> should gssd? Since this behavior appeared after an upgrade, I
>>> suspect a Kerberos library regression. But it could be config-
>>> related, since both systems were re-imaged from the ground up.
>>> 
>>> Also noticing some other problems on the server (missing hostname
>>> strings in debug messages, sssd_kcm infinite loops, and gssd
>>> sending garbage to the client after the NULL request that
>>> establishes the callback context).
>>> 
>>> But let's look at the client acceptor problem first.
>> 
>> I believe I found the problem.
>> 
>> 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
>> options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
>> dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
>> with Kerberos works again.
>> 
>> Is there a reason the default setting is 1?
>> 
> 
> Now that you mention DNS, this may be an interaction between a new
> default in Fedora 32 and how your environment is setup re DNS.
> 
> In F32 we changed the option dns_canonicalize_hostname from 'true' to
> 'fallback'.
> This is a transitional state to eventually move it to 'false' at some
> point in the future.
> 
> What it changes in practice is that it will first try the name passed
> in *as is* and only as a fallback try a CNAME if the name passed is not
> resolved as an A name. If you have principals in the KDC for both
> names, but you do not have keys in the keytab for both, you can have
> transitional issues.
> 
> Additionally we discovered a bug that causes non qualified names to
> fail resolution with the 'fallback' option.
> If your name in the principal is really not qualified it will try to
> qualify it anyway, so if your principal is literally nfs/foo@FOO
> libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
> what is defined in resolv.conf search path.
> 
> We are trying to address this regression.
> 
> So try to set dns_canonicalize_hostname to true to see if that may
> influence your issue. If so, please let me know, as we still need to
> address this where possible.

I set avoid-dns to 1 and dns_canonicalize_hostname to true. The
workload hang is not reproducible, and the acceptor is fully qualified.

rpc.gssd[965]: doing downcall: lifetime_rec=86338 acceptor=nfs@klimt.ib.1015granger.net


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 17:09     ` Chuck Lever
@ 2020-07-30 17:57       ` Simo Sorce
  2020-07-30 18:07         ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Simo Sorce @ 2020-07-30 17:57 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Bruce Fields, Linux NFS Mailing List, Robbie Harwood

On Thu, 2020-07-30 at 13:09 -0400, Chuck Lever wrote:
> > On Jul 30, 2020, at 12:14 PM, Simo Sorce <simo@redhat.com> wrote:
> > 
> > On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
> > > > On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > > > 
> > > > Hi!
> > > > 
> > > > I recently updated my test systems from EL7 to Fedora 32, and
> > > > NFSv4.0 with Kerberos has stopped working.
> > > > 
> > > > I mount with "klimt.ib" as before. The client workload stops
> > > > dead when the server tries to perform its first CB_RECALL.
> > > > 
> > > > I added some client instrumentation:
> > > > 
> > > >  kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
> > > >  kernel: NFS: NFSv4 callback contains invalid cred
> > > > 
> > > > I boosted gssd verbosity, and it says:
> > > > 
> > > >  rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
> > > > 
> > > > But it knows the full hostname for the server:
> > > > 
> > > >  rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
> > > > 
> > > > 
> > > > The acceptor appears to come from the Kerberos library. Shouldn't
> > > > it be canonicalized? If so, should the Kerberos library do it, or
> > > > should gssd? Since this behavior appeared after an upgrade, I
> > > > suspect a Kerberos library regression. But it could be config-
> > > > related, since both systems were re-imaged from the ground up.
> > > > 
> > > > Also noticing some other problems on the server (missing hostname
> > > > strings in debug messages, sssd_kcm infinite loops, and gssd
> > > > sending garbage to the client after the NULL request that
> > > > establishes the callback context).
> > > > 
> > > > But let's look at the client acceptor problem first.
> > > 
> > > I believe I found the problem.
> > > 
> > > 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
> > > options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
> > > dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
> > > with Kerberos works again.
> > > 
> > > Is there a reason the default setting is 1?
> > > 
> > 
> > Now that you mention DNS, this may be an interaction between a new
> > default in Fedora 32 and how your environment is setup re DNS.
> > 
> > In F32 we changed the option dns_canonicalize_hostname from 'true' to
> > 'fallback'.
> > This is a transitional state to eventually move it to 'false' at some
> > point in the future.
> > 
> > What it changes in practice is that it will first try the name passed
> > in *as is* and only as a fallback try a CNAME if the name passed is not
> > resolved as an A name. If you have principals in the KDC for both
> > names, but you do not have keys in the keytab for both, you can have
> > transitional issues.
> > 
> > Additionally we discovered a bug that causes non qualified names to
> > fail resolution with the 'fallback' option.
> > If your name in the principal is really not qualified it will try to
> > qualify it anyway, so if your principal is literally nfs/foo@FOO
> > libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
> > what is defined in resolv.conf search path.
> > 
> > We are trying to address this regression.
> > 
> > So try to set dns_canonicalize_hostname to true to see if that may
> > influence your issue. If so, please let me know, as we still need to
> > address this where possible.
> 
> I set avoid-dns to 1 and dns_canonicalize_hostname to true. The
> workload hang is not reproducible, and the acceptor is fully qualified.
> 
> rpc.gssd[965]: doing downcall: lifetime_rec=86338 acceptor=nfs@klimt.ib.1015granger.net

Chuck,
can you tell what does klimt.ib.1015granger.net resolve to (A names
CNAMEs, not really interested in IP address)?
Also what ticket do you ultimately get in the ccache when this request
is made ?

Simo.

-- 
Simo Sorce
RHEL Crypto Team
Red Hat, Inc





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 17:08     ` Robbie Harwood
@ 2020-07-30 17:59       ` Chuck Lever
  2020-07-30 19:10         ` Simo Sorce
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2020-07-30 17:59 UTC (permalink / raw)
  To: Robbie Harwood, Simo Sorce
  Cc: Jeff Layton, Bruce Fields, Linux NFS Mailing List



> On Jul 30, 2020, at 1:08 PM, Robbie Harwood <rharwood@redhat.com> wrote:
> 
> Simo Sorce <simo@redhat.com> writes:
> 
>> On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
>>>> On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>> Hi!
>>>> 
>>>> I recently updated my test systems from EL7 to Fedora 32, and
>>>> NFSv4.0 with Kerberos has stopped working.
>>>> 
>>>> I mount with "klimt.ib" as before. The client workload stops
>>>> dead when the server tries to perform its first CB_RECALL.
>>>> 
>>>> I added some client instrumentation:
>>>> 
>>>>  kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
>>>>  kernel: NFS: NFSv4 callback contains invalid cred
>>>> 
>>>> I boosted gssd verbosity, and it says:
>>>> 
>>>>  rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
>>>> 
>>>> But it knows the full hostname for the server:
>>>> 
>>>>  rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
>>>> 
>>>> 
>>>> The acceptor appears to come from the Kerberos library. Shouldn't
>>>> it be canonicalized? If so, should the Kerberos library do it, or
>>>> should gssd? Since this behavior appeared after an upgrade, I
>>>> suspect a Kerberos library regression. But it could be config-
>>>> related, since both systems were re-imaged from the ground up.
>>>> 
>>>> Also noticing some other problems on the server (missing hostname
>>>> strings in debug messages, sssd_kcm infinite loops, and gssd
>>>> sending garbage to the client after the NULL request that
>>>> establishes the callback context).
>>>> 
>>>> But let's look at the client acceptor problem first.
>>> 
>>> I believe I found the problem.
>>> 
>>> 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
>>> options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
>>> dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
>>> with Kerberos works again.
>>> 
>>> Is there a reason the default setting is 1?
>>> 
>> 
>> Now that you mention DNS, this may be an interaction between a new
>> default in Fedora 32 and how your environment is setup re DNS.
>> 
>> In F32 we changed the option dns_canonicalize_hostname from 'true' to
>> 'fallback'.
>> This is a transitional state to eventually move it to 'false' at some
>> point in the future.
>> 
>> What it changes in practice is that it will first try the name passed
>> in *as is* and only as a fallback try a CNAME if the name passed is not
>> resolved as an A name. If you have principals in the KDC for both
>> names, but you do not have keys in the keytab for both, you can have
>> transitional issues.
>> 
>> Additionally we discovered a bug that causes non qualified names to
>> fail resolution with the 'fallback' option.
>> If your name in the principal is really not qualified it will try to
>> qualify it anyway, so if your principal is literally nfs/foo@FOO
>> libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
>> what is defined in resolv.conf search path.
>> 
>> We are trying to address this regression.
>> 
>> So try to set dns_canonicalize_hostname to true to see if that may
>> influence your issue. If so, please let me know, as we still need to
>> address this where possible.
> 
> Also, please try setting `qualify_shortname = ""`.  (I did update the
> config file we ship with Fedora, but upstream's default turns that on.
> This is a temporary workaround while we merge something better
> upstream.)

For completeness, I tried:

avoid-dns = 1
dns_canonicalize_hostname = fallback
qualify_shortname = ""

which is the default configuration out of the shrink wrap.

The workload hangs as before, and the acceptor is unqualified:

rpc.gssd[985]: doing downcall: lifetime_rec=84046 acceptor=nfs@klimt.ib


The test is:

Configured domain name is "1015granger.net"

Fully-qualified client hostname is "manet.ib.granger.net"

Fully-qualified server hostname is "klimt.ib.granger.net"

mount command is "mount -o vers=4.0,sec=sys klimt.ib:/export /mnt"

In this case, both systems have keytabs and service principals, so
the client automatically attempts to establish a GSS context for
lease management and callback operations. The failure occurs because
the server's principal is nfs@klimt.ib.1015granger.net but the
acceptor now matches the server hostname from the mount command line,
which is not always fully qualified.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 17:57       ` Simo Sorce
@ 2020-07-30 18:07         ` Chuck Lever
  2020-07-30 18:20           ` Simo Sorce
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2020-07-30 18:07 UTC (permalink / raw)
  To: Simo Sorce
  Cc: Jeff Layton, Bruce Fields, Linux NFS Mailing List, Robbie Harwood



> On Jul 30, 2020, at 1:57 PM, Simo Sorce <simo@redhat.com> wrote:
> 
> On Thu, 2020-07-30 at 13:09 -0400, Chuck Lever wrote:
>>> On Jul 30, 2020, at 12:14 PM, Simo Sorce <simo@redhat.com> wrote:
>>> 
>>> On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
>>>>> On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>> 
>>>>> Hi!
>>>>> 
>>>>> I recently updated my test systems from EL7 to Fedora 32, and
>>>>> NFSv4.0 with Kerberos has stopped working.
>>>>> 
>>>>> I mount with "klimt.ib" as before. The client workload stops
>>>>> dead when the server tries to perform its first CB_RECALL.
>>>>> 
>>>>> I added some client instrumentation:
>>>>> 
>>>>> kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
>>>>> kernel: NFS: NFSv4 callback contains invalid cred
>>>>> 
>>>>> I boosted gssd verbosity, and it says:
>>>>> 
>>>>> rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
>>>>> 
>>>>> But it knows the full hostname for the server:
>>>>> 
>>>>> rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
>>>>> 
>>>>> 
>>>>> The acceptor appears to come from the Kerberos library. Shouldn't
>>>>> it be canonicalized? If so, should the Kerberos library do it, or
>>>>> should gssd? Since this behavior appeared after an upgrade, I
>>>>> suspect a Kerberos library regression. But it could be config-
>>>>> related, since both systems were re-imaged from the ground up.
>>>>> 
>>>>> Also noticing some other problems on the server (missing hostname
>>>>> strings in debug messages, sssd_kcm infinite loops, and gssd
>>>>> sending garbage to the client after the NULL request that
>>>>> establishes the callback context).
>>>>> 
>>>>> But let's look at the client acceptor problem first.
>>>> 
>>>> I believe I found the problem.
>>>> 
>>>> 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
>>>> options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
>>>> dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
>>>> with Kerberos works again.
>>>> 
>>>> Is there a reason the default setting is 1?
>>>> 
>>> 
>>> Now that you mention DNS, this may be an interaction between a new
>>> default in Fedora 32 and how your environment is setup re DNS.
>>> 
>>> In F32 we changed the option dns_canonicalize_hostname from 'true' to
>>> 'fallback'.
>>> This is a transitional state to eventually move it to 'false' at some
>>> point in the future.
>>> 
>>> What it changes in practice is that it will first try the name passed
>>> in *as is* and only as a fallback try a CNAME if the name passed is not
>>> resolved as an A name. If you have principals in the KDC for both
>>> names, but you do not have keys in the keytab for both, you can have
>>> transitional issues.
>>> 
>>> Additionally we discovered a bug that causes non qualified names to
>>> fail resolution with the 'fallback' option.
>>> If your name in the principal is really not qualified it will try to
>>> qualify it anyway, so if your principal is literally nfs/foo@FOO
>>> libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
>>> what is defined in resolv.conf search path.
>>> 
>>> We are trying to address this regression.
>>> 
>>> So try to set dns_canonicalize_hostname to true to see if that may
>>> influence your issue. If so, please let me know, as we still need to
>>> address this where possible.
>> 
>> I set avoid-dns to 1 and dns_canonicalize_hostname to true. The
>> workload hang is not reproducible, and the acceptor is fully qualified.
>> 
>> rpc.gssd[965]: doing downcall: lifetime_rec=86338 acceptor=nfs@klimt.ib.1015granger.net
> 
> Chuck,
> can you tell what does klimt.ib.1015granger.net resolve to (A names
> CNAMEs, not really interested in IP address)?

[root@manet ~]# dig klimt.ib.1015granger.net

; <<>> DiG 9.11.20-RedHat-9.11.20-1.fc32 <<>> klimt.ib.1015granger.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55806
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 0a7a07d8b06eedeab886314f5f230a8c6f752fe4a24c2f97 (good)
;; QUESTION SECTION:
;klimt.ib.1015granger.net.	IN	A

;; ANSWER SECTION:
klimt.ib.1015granger.net. 10800	IN	A	192.168.2.55

;; AUTHORITY SECTION:
ib.1015granger.net.	10800	IN	NS	gateway.1015granger.net.

;; ADDITIONAL SECTION:
gateway.1015granger.net. 10800	IN	A	192.168.1.1

;; Query time: 0 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Thu Jul 30 13:59:40 EDT 2020
;; MSG SIZE  rcvd: 135

[root@manet ~]#


> Also what ticket do you ultimately get in the ccache when this request
> is made ?

I'm not exactly sure what you're asking, but:

[root@manet ~]# klist FILE:/tmp/krb5ccmachine_1015GRANGER.NET
Ticket cache: FILE:/tmp/krb5ccmachine_1015GRANGER.NET
Default principal: host/manet.1015granger.net@1015GRANGER.NET

Valid starting       Expires              Service principal
07/30/2020 13:45:38  07/31/2020 13:45:38  krbtgt/1015GRANGER.NET@1015GRANGER.NET
	renew until 08/06/2020 13:45:38
[root@manet ~]#


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 18:07         ` Chuck Lever
@ 2020-07-30 18:20           ` Simo Sorce
  2020-07-30 18:29             ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Simo Sorce @ 2020-07-30 18:20 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Bruce Fields, Linux NFS Mailing List, Robbie Harwood

On Thu, 2020-07-30 at 14:07 -0400, Chuck Lever wrote:
> > On Jul 30, 2020, at 1:57 PM, Simo Sorce <simo@redhat.com> wrote:
> > 
> > On Thu, 2020-07-30 at 13:09 -0400, Chuck Lever wrote:
> > > > On Jul 30, 2020, at 12:14 PM, Simo Sorce <simo@redhat.com> wrote:
> > > > 
> > > > On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
> > > > > > On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > > > > > 
> > > > > > Hi!
> > > > > > 
> > > > > > I recently updated my test systems from EL7 to Fedora 32, and
> > > > > > NFSv4.0 with Kerberos has stopped working.
> > > > > > 
> > > > > > I mount with "klimt.ib" as before. The client workload stops
> > > > > > dead when the server tries to perform its first CB_RECALL.
> > > > > > 
> > > > > > I added some client instrumentation:
> > > > > > 
> > > > > > kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
> > > > > > kernel: NFS: NFSv4 callback contains invalid cred
> > > > > > 
> > > > > > I boosted gssd verbosity, and it says:
> > > > > > 
> > > > > > rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
> > > > > > 
> > > > > > But it knows the full hostname for the server:
> > > > > > 
> > > > > > rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
> > > > > > 
> > > > > > 
> > > > > > The acceptor appears to come from the Kerberos library. Shouldn't
> > > > > > it be canonicalized? If so, should the Kerberos library do it, or
> > > > > > should gssd? Since this behavior appeared after an upgrade, I
> > > > > > suspect a Kerberos library regression. But it could be config-
> > > > > > related, since both systems were re-imaged from the ground up.
> > > > > > 
> > > > > > Also noticing some other problems on the server (missing hostname
> > > > > > strings in debug messages, sssd_kcm infinite loops, and gssd
> > > > > > sending garbage to the client after the NULL request that
> > > > > > establishes the callback context).
> > > > > > 
> > > > > > But let's look at the client acceptor problem first.
> > > > > 
> > > > > I believe I found the problem.
> > > > > 
> > > > > 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
> > > > > options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
> > > > > dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
> > > > > with Kerberos works again.
> > > > > 
> > > > > Is there a reason the default setting is 1?
> > > > > 
> > > > 
> > > > Now that you mention DNS, this may be an interaction between a new
> > > > default in Fedora 32 and how your environment is setup re DNS.
> > > > 
> > > > In F32 we changed the option dns_canonicalize_hostname from 'true' to
> > > > 'fallback'.
> > > > This is a transitional state to eventually move it to 'false' at some
> > > > point in the future.
> > > > 
> > > > What it changes in practice is that it will first try the name passed
> > > > in *as is* and only as a fallback try a CNAME if the name passed is not
> > > > resolved as an A name. If you have principals in the KDC for both
> > > > names, but you do not have keys in the keytab for both, you can have
> > > > transitional issues.
> > > > 
> > > > Additionally we discovered a bug that causes non qualified names to
> > > > fail resolution with the 'fallback' option.
> > > > If your name in the principal is really not qualified it will try to
> > > > qualify it anyway, so if your principal is literally nfs/foo@FOO
> > > > libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
> > > > what is defined in resolv.conf search path.
> > > > 
> > > > We are trying to address this regression.
> > > > 
> > > > So try to set dns_canonicalize_hostname to true to see if that may
> > > > influence your issue. If so, please let me know, as we still need to
> > > > address this where possible.
> > > 
> > > I set avoid-dns to 1 and dns_canonicalize_hostname to true. The
> > > workload hang is not reproducible, and the acceptor is fully qualified.
> > > 
> > > rpc.gssd[965]: doing downcall: lifetime_rec=86338 acceptor=nfs@klimt.ib.1015granger.net
> > 
> > Chuck,
> > can you tell what does klimt.ib.1015granger.net resolve to (A names
> > CNAMEs, not really interested in IP address)?
> 
> [root@manet ~]# dig klimt.ib.1015granger.net
> 
> ; <<>> DiG 9.11.20-RedHat-9.11.20-1.fc32 <<>> klimt.ib.1015granger.net
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55806
> ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2
> 
> ;; OPT PSEUDOSECTION:
> ; EDNS: version: 0, flags:; udp: 4096
> ; COOKIE: 0a7a07d8b06eedeab886314f5f230a8c6f752fe4a24c2f97 (good)
> ;; QUESTION SECTION:
> ;klimt.ib.1015granger.net.	IN	A
> 
> ;; ANSWER SECTION:
> klimt.ib.1015granger.net. 10800	IN	A	192.168.2.55
> 
> ;; AUTHORITY SECTION:
> ib.1015granger.net.	10800	IN	NS	gateway.1015granger.net.
> 
> ;; ADDITIONAL SECTION:
> gateway.1015granger.net. 10800	IN	A	192.168.1.1
> 
> ;; Query time: 0 msec
> ;; SERVER: 192.168.1.1#53(192.168.1.1)
> ;; WHEN: Thu Jul 30 13:59:40 EDT 2020
> ;; MSG SIZE  rcvd: 135
> 
> [root@manet ~]#

so klimt is an A name, and it is a fqdn in the principal, so I am
puzled why the krb5.conf option would make a difference, unless
192.168.2.55 perhaps resolves to a different name and you have rdns =
true ?

> 
> > Also what ticket do you ultimately get in the ccache when this request
> > is made ?
> 
> I'm not exactly sure what you're asking, but:
> 
> [root@manet ~]# klist FILE:/tmp/krb5ccmachine_1015GRANGER.NET
> Ticket cache: FILE:/tmp/krb5ccmachine_1015GRANGER.NET
> Default principal: host/manet.1015granger.net@1015GRANGER.NET
> 
> Valid starting       Expires              Service principal
> 07/30/2020 13:45:38  07/31/2020 13:45:38  krbtgt/1015GRANGER.NET@1015GRANGER.NET
> 	renew until 08/06/2020 13:45:38
> [root@manet ~]#

I was asking what ticket you get to use to talk to klimt, this is the
machine ccache but it only has the krbtgt for the machine host key.


And now that I reread the thread better I see:

Callback principal (nfs@klimt.ib.1015granger.net) does not match
acceptor (nfs@klimt.ib)

do you use klimt.ib explicitly somewhere instad of the fqdn and rely on
name expansion ?

Simo.

-- 
Simo Sorce
RHEL Crypto Team
Red Hat, Inc





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 18:20           ` Simo Sorce
@ 2020-07-30 18:29             ` Chuck Lever
  2020-07-30 18:55               ` Simo Sorce
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2020-07-30 18:29 UTC (permalink / raw)
  To: Simo Sorce
  Cc: Jeff Layton, Bruce Fields, Linux NFS Mailing List, Robbie Harwood



> On Jul 30, 2020, at 2:20 PM, Simo Sorce <simo@redhat.com> wrote:
> 
> On Thu, 2020-07-30 at 14:07 -0400, Chuck Lever wrote:
>>> On Jul 30, 2020, at 1:57 PM, Simo Sorce <simo@redhat.com> wrote:
>>> 
>>> On Thu, 2020-07-30 at 13:09 -0400, Chuck Lever wrote:
>>>>> On Jul 30, 2020, at 12:14 PM, Simo Sorce <simo@redhat.com> wrote:
>>>>> 
>>>>> On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
>>>>>>> On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>>> 
>>>>>>> Hi!
>>>>>>> 
>>>>>>> I recently updated my test systems from EL7 to Fedora 32, and
>>>>>>> NFSv4.0 with Kerberos has stopped working.
>>>>>>> 
>>>>>>> I mount with "klimt.ib" as before. The client workload stops
>>>>>>> dead when the server tries to perform its first CB_RECALL.
>>>>>>> 
>>>>>>> I added some client instrumentation:
>>>>>>> 
>>>>>>> kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
>>>>>>> kernel: NFS: NFSv4 callback contains invalid cred
>>>>>>> 
>>>>>>> I boosted gssd verbosity, and it says:
>>>>>>> 
>>>>>>> rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
>>>>>>> 
>>>>>>> But it knows the full hostname for the server:
>>>>>>> 
>>>>>>> rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
>>>>>>> 
>>>>>>> 
>>>>>>> The acceptor appears to come from the Kerberos library. Shouldn't
>>>>>>> it be canonicalized? If so, should the Kerberos library do it, or
>>>>>>> should gssd? Since this behavior appeared after an upgrade, I
>>>>>>> suspect a Kerberos library regression. But it could be config-
>>>>>>> related, since both systems were re-imaged from the ground up.
>>>>>>> 
>>>>>>> Also noticing some other problems on the server (missing hostname
>>>>>>> strings in debug messages, sssd_kcm infinite loops, and gssd
>>>>>>> sending garbage to the client after the NULL request that
>>>>>>> establishes the callback context).
>>>>>>> 
>>>>>>> But let's look at the client acceptor problem first.
>>>>>> 
>>>>>> I believe I found the problem.
>>>>>> 
>>>>>> 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
>>>>>> options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
>>>>>> dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
>>>>>> with Kerberos works again.
>>>>>> 
>>>>>> Is there a reason the default setting is 1?
>>>>>> 
>>>>> 
>>>>> Now that you mention DNS, this may be an interaction between a new
>>>>> default in Fedora 32 and how your environment is setup re DNS.
>>>>> 
>>>>> In F32 we changed the option dns_canonicalize_hostname from 'true' to
>>>>> 'fallback'.
>>>>> This is a transitional state to eventually move it to 'false' at some
>>>>> point in the future.
>>>>> 
>>>>> What it changes in practice is that it will first try the name passed
>>>>> in *as is* and only as a fallback try a CNAME if the name passed is not
>>>>> resolved as an A name. If you have principals in the KDC for both
>>>>> names, but you do not have keys in the keytab for both, you can have
>>>>> transitional issues.
>>>>> 
>>>>> Additionally we discovered a bug that causes non qualified names to
>>>>> fail resolution with the 'fallback' option.
>>>>> If your name in the principal is really not qualified it will try to
>>>>> qualify it anyway, so if your principal is literally nfs/foo@FOO
>>>>> libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
>>>>> what is defined in resolv.conf search path.
>>>>> 
>>>>> We are trying to address this regression.
>>>>> 
>>>>> So try to set dns_canonicalize_hostname to true to see if that may
>>>>> influence your issue. If so, please let me know, as we still need to
>>>>> address this where possible.
>>>> 
>>>> I set avoid-dns to 1 and dns_canonicalize_hostname to true. The
>>>> workload hang is not reproducible, and the acceptor is fully qualified.
>>>> 
>>>> rpc.gssd[965]: doing downcall: lifetime_rec=86338 acceptor=nfs@klimt.ib.1015granger.net
>>> 
>>> Chuck,
>>> can you tell what does klimt.ib.1015granger.net resolve to (A names
>>> CNAMEs, not really interested in IP address)?
>> 
>> [root@manet ~]# dig klimt.ib.1015granger.net
>> 
>> ; <<>> DiG 9.11.20-RedHat-9.11.20-1.fc32 <<>> klimt.ib.1015granger.net
>> ;; global options: +cmd
>> ;; Got answer:
>> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55806
>> ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2
>> 
>> ;; OPT PSEUDOSECTION:
>> ; EDNS: version: 0, flags:; udp: 4096
>> ; COOKIE: 0a7a07d8b06eedeab886314f5f230a8c6f752fe4a24c2f97 (good)
>> ;; QUESTION SECTION:
>> ;klimt.ib.1015granger.net.	IN	A
>> 
>> ;; ANSWER SECTION:
>> klimt.ib.1015granger.net. 10800	IN	A	192.168.2.55
>> 
>> ;; AUTHORITY SECTION:
>> ib.1015granger.net.	10800	IN	NS	gateway.1015granger.net.
>> 
>> ;; ADDITIONAL SECTION:
>> gateway.1015granger.net. 10800	IN	A	192.168.1.1
>> 
>> ;; Query time: 0 msec
>> ;; SERVER: 192.168.1.1#53(192.168.1.1)
>> ;; WHEN: Thu Jul 30 13:59:40 EDT 2020
>> ;; MSG SIZE  rcvd: 135
>> 
>> [root@manet ~]#
> 
> so klimt is an A name, and it is a fqdn in the principal, so I am
> puzzled why the krb5.conf option would make a difference, unless
> 192.168.2.55 perhaps resolves to a different name and you have rdns =
> true ?

rdns is false on my NFS client.

[cel@manet ~]$ dig -x 192.168.2.55

; <<>> DiG 9.11.20-RedHat-9.11.20-1.fc32 <<>> -x 192.168.2.55
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22491
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 868fe3a23c55fdee29cc812d5f231096a66a540a8064f2e8 (good)
;; QUESTION SECTION:
;55.2.168.192.in-addr.arpa.	IN	PTR

;; ANSWER SECTION:
55.2.168.192.IN-ADDR.ARPA. 10800 IN	PTR	klimt.ib.1015granger.net.

;; AUTHORITY SECTION:
2.168.192.IN-ADDR.ARPA.	10800	IN	NS	gateway.1015granger.net.

;; ADDITIONAL SECTION:
gateway.1015granger.net. 10800	IN	A	192.168.1.1

;; Query time: 0 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Thu Jul 30 14:25:26 EDT 2020
;; MSG SIZE  rcvd: 183

[cel@manet ~]$


>>> Also what ticket do you ultimately get in the ccache when this request
>>> is made ?
>> 
>> I'm not exactly sure what you're asking, but:
>> 
>> [root@manet ~]# klist FILE:/tmp/krb5ccmachine_1015GRANGER.NET
>> Ticket cache: FILE:/tmp/krb5ccmachine_1015GRANGER.NET
>> Default principal: host/manet.1015granger.net@1015GRANGER.NET
>> 
>> Valid starting       Expires              Service principal
>> 07/30/2020 13:45:38  07/31/2020 13:45:38  krbtgt/1015GRANGER.NET@1015GRANGER.NET
>> 	renew until 08/06/2020 13:45:38
>> [root@manet ~]#
> 
> I was asking what ticket you get to use to talk to klimt, this is the
> machine ccache but it only has the krbtgt for the machine host key.

I'm mounting with sec=sys, so the machine host key is the only
key in play. The client uses that key for NFSv4.0 lease management
and callbacks.

It is the callback context that is the problem. The client is
required to match the server's principal to the client's
acceptor when authenticating a callback operation.


> And now that I reread the thread better I see:
> 
> Callback principal (nfs@klimt.ib.1015granger.net) does not match
> acceptor (nfs@klimt.ib)
> 
> do you use klimt.ib explicitly somewhere instad of the fqdn and rely on
> name expansion ?

Yes, I mount with "mount -o vers=4.0,sec=sys klimt.ib:/export /mnt" .

When dns_canonicalize_hostname = true or avoid-dns = 0, gssd
properly canonicalizes the acceptor so that it matches the
server's callback principal.

--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 18:29             ` Chuck Lever
@ 2020-07-30 18:55               ` Simo Sorce
  0 siblings, 0 replies; 15+ messages in thread
From: Simo Sorce @ 2020-07-30 18:55 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jeff Layton, Bruce Fields, Linux NFS Mailing List, Robbie Harwood

On Thu, 2020-07-30 at 14:29 -0400, Chuck Lever wrote:
> > On Jul 30, 2020, at 2:20 PM, Simo Sorce <simo@redhat.com> wrote:
> > 
> > On Thu, 2020-07-30 at 14:07 -0400, Chuck Lever wrote:
> > > > On Jul 30, 2020, at 1:57 PM, Simo Sorce <simo@redhat.com> wrote:
> > > > 
> > > > On Thu, 2020-07-30 at 13:09 -0400, Chuck Lever wrote:
> > > > > > On Jul 30, 2020, at 12:14 PM, Simo Sorce <simo@redhat.com> wrote:
> > > > > > 
> > > > > > On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
> > > > > > > > On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > > > > > > > 
> > > > > > > > Hi!
> > > > > > > > 
> > > > > > > > I recently updated my test systems from EL7 to Fedora 32, and
> > > > > > > > NFSv4.0 with Kerberos has stopped working.
> > > > > > > > 
> > > > > > > > I mount with "klimt.ib" as before. The client workload stops
> > > > > > > > dead when the server tries to perform its first CB_RECALL.
> > > > > > > > 
> > > > > > > > I added some client instrumentation:
> > > > > > > > 
> > > > > > > > kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
> > > > > > > > kernel: NFS: NFSv4 callback contains invalid cred
> > > > > > > > 
> > > > > > > > I boosted gssd verbosity, and it says:
> > > > > > > > 
> > > > > > > > rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
> > > > > > > > 
> > > > > > > > But it knows the full hostname for the server:
> > > > > > > > 
> > > > > > > > rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
> > > > > > > > 
> > > > > > > > 
> > > > > > > > The acceptor appears to come from the Kerberos library. Shouldn't
> > > > > > > > it be canonicalized? If so, should the Kerberos library do it, or
> > > > > > > > should gssd? Since this behavior appeared after an upgrade, I
> > > > > > > > suspect a Kerberos library regression. But it could be config-
> > > > > > > > related, since both systems were re-imaged from the ground up.
> > > > > > > > 
> > > > > > > > Also noticing some other problems on the server (missing hostname
> > > > > > > > strings in debug messages, sssd_kcm infinite loops, and gssd
> > > > > > > > sending garbage to the client after the NULL request that
> > > > > > > > establishes the callback context).
> > > > > > > > 
> > > > > > > > But let's look at the client acceptor problem first.
> > > > > > > 
> > > > > > > I believe I found the problem.
> > > > > > > 
> > > > > > > 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
> > > > > > > options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
> > > > > > > dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
> > > > > > > with Kerberos works again.
> > > > > > > 
> > > > > > > Is there a reason the default setting is 1?
> > > > > > > 
> > > > > > 
> > > > > > Now that you mention DNS, this may be an interaction between a new
> > > > > > default in Fedora 32 and how your environment is setup re DNS.
> > > > > > 
> > > > > > In F32 we changed the option dns_canonicalize_hostname from 'true' to
> > > > > > 'fallback'.
> > > > > > This is a transitional state to eventually move it to 'false' at some
> > > > > > point in the future.
> > > > > > 
> > > > > > What it changes in practice is that it will first try the name passed
> > > > > > in *as is* and only as a fallback try a CNAME if the name passed is not
> > > > > > resolved as an A name. If you have principals in the KDC for both
> > > > > > names, but you do not have keys in the keytab for both, you can have
> > > > > > transitional issues.
> > > > > > 
> > > > > > Additionally we discovered a bug that causes non qualified names to
> > > > > > fail resolution with the 'fallback' option.
> > > > > > If your name in the principal is really not qualified it will try to
> > > > > > qualify it anyway, so if your principal is literally nfs/foo@FOO
> > > > > > libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
> > > > > > what is defined in resolv.conf search path.
> > > > > > 
> > > > > > We are trying to address this regression.
> > > > > > 
> > > > > > So try to set dns_canonicalize_hostname to true to see if that may
> > > > > > influence your issue. If so, please let me know, as we still need to
> > > > > > address this where possible.
> > > > > 
> > > > > I set avoid-dns to 1 and dns_canonicalize_hostname to true. The
> > > > > workload hang is not reproducible, and the acceptor is fully qualified.
> > > > > 
> > > > > rpc.gssd[965]: doing downcall: lifetime_rec=86338 acceptor=nfs@klimt.ib.1015granger.net
> > > > 
> > > > Chuck,
> > > > can you tell what does klimt.ib.1015granger.net resolve to (A names
> > > > CNAMEs, not really interested in IP address)?
> > > 
> > > [root@manet ~]# dig klimt.ib.1015granger.net
> > > 
> > > ; <<>> DiG 9.11.20-RedHat-9.11.20-1.fc32 <<>> klimt.ib.1015granger.net
> > > ;; global options: +cmd
> > > ;; Got answer:
> > > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55806
> > > ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2
> > > 
> > > ;; OPT PSEUDOSECTION:
> > > ; EDNS: version: 0, flags:; udp: 4096
> > > ; COOKIE: 0a7a07d8b06eedeab886314f5f230a8c6f752fe4a24c2f97 (good)
> > > ;; QUESTION SECTION:
> > > ;klimt.ib.1015granger.net.	IN	A
> > > 
> > > ;; ANSWER SECTION:
> > > klimt.ib.1015granger.net. 10800	IN	A	192.168.2.55
> > > 
> > > ;; AUTHORITY SECTION:
> > > ib.1015granger.net.	10800	IN	NS	gateway.1015granger.net.
> > > 
> > > ;; ADDITIONAL SECTION:
> > > gateway.1015granger.net. 10800	IN	A	192.168.1.1
> > > 
> > > ;; Query time: 0 msec
> > > ;; SERVER: 192.168.1.1#53(192.168.1.1)
> > > ;; WHEN: Thu Jul 30 13:59:40 EDT 2020
> > > ;; MSG SIZE  rcvd: 135
> > > 
> > > [root@manet ~]#
> > 
> > so klimt is an A name, and it is a fqdn in the principal, so I am
> > puzzled why the krb5.conf option would make a difference, unless
> > 192.168.2.55 perhaps resolves to a different name and you have rdns =
> > true ?
> 
> rdns is false on my NFS client.
> 
> [cel@manet ~]$ dig -x 192.168.2.55
> 
> ; <<>> DiG 9.11.20-RedHat-9.11.20-1.fc32 <<>> -x 192.168.2.55
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22491
> ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 2
> 
> ;; OPT PSEUDOSECTION:
> ; EDNS: version: 0, flags:; udp: 4096
> ; COOKIE: 868fe3a23c55fdee29cc812d5f231096a66a540a8064f2e8 (good)
> ;; QUESTION SECTION:
> ;55.2.168.192.in-addr.arpa.	IN	PTR
> 
> ;; ANSWER SECTION:
> 55.2.168.192.IN-ADDR.ARPA. 10800 IN	PTR	klimt.ib.1015granger.net.
> 
> ;; AUTHORITY SECTION:
> 2.168.192.IN-ADDR.ARPA.	10800	IN	NS	gateway.1015granger.net.
> 
> ;; ADDITIONAL SECTION:
> gateway.1015granger.net. 10800	IN	A	192.168.1.1
> 
> ;; Query time: 0 msec
> ;; SERVER: 192.168.1.1#53(192.168.1.1)
> ;; WHEN: Thu Jul 30 14:25:26 EDT 2020
> ;; MSG SIZE  rcvd: 183
> 
> [cel@manet ~]$
> 
> 
> > > > Also what ticket do you ultimately get in the ccache when this request
> > > > is made ?
> > > 
> > > I'm not exactly sure what you're asking, but:
> > > 
> > > [root@manet ~]# klist FILE:/tmp/krb5ccmachine_1015GRANGER.NET
> > > Ticket cache: FILE:/tmp/krb5ccmachine_1015GRANGER.NET
> > > Default principal: host/manet.1015granger.net@1015GRANGER.NET
> > > 
> > > Valid starting       Expires              Service principal
> > > 07/30/2020 13:45:38  07/31/2020 13:45:38  krbtgt/1015GRANGER.NET@1015GRANGER.NET
> > > 	renew until 08/06/2020 13:45:38
> > > [root@manet ~]#
> > 
> > I was asking what ticket you get to use to talk to klimt, this is the
> > machine ccache but it only has the krbtgt for the machine host key.
> 
> I'm mounting with sec=sys, so the machine host key is the only
> key in play. The client uses that key for NFSv4.0 lease management
> and callbacks.
> 
> It is the callback context that is the problem. The client is
> required to match the server's principal to the client's
> acceptor when authenticating a callback operation.
> 
> 
> > And now that I reread the thread better I see:
> > 
> > Callback principal (nfs@klimt.ib.1015granger.net) does not match
> > acceptor (nfs@klimt.ib)
> > 
> > do you use klimt.ib explicitly somewhere instad of the fqdn and rely on
> > name expansion ?
> 
> Yes, I mount with "mount -o vers=4.0,sec=sys klimt.ib:/export /mnt" .
> 
> When dns_canonicalize_hostname = true or avoid-dns = 0, gssd
> properly canonicalizes the acceptor so that it matches the
> server's callback principal.

Ok this seems to be the "issue", try restoring all the defaults but set
`qualify_shortname = ""` and see if that also make it work.

Simo.

-- 
Simo Sorce
RHEL Crypto Team
Red Hat, Inc





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 17:59       ` Chuck Lever
@ 2020-07-30 19:10         ` Simo Sorce
  2020-07-30 19:39           ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Simo Sorce @ 2020-07-30 19:10 UTC (permalink / raw)
  To: Chuck Lever, Robbie Harwood
  Cc: Jeff Layton, Bruce Fields, Linux NFS Mailing List

On Thu, 2020-07-30 at 13:59 -0400, Chuck Lever wrote:
> > On Jul 30, 2020, at 1:08 PM, Robbie Harwood <rharwood@redhat.com> wrote:
> > 
> > Simo Sorce <simo@redhat.com> writes:
> > 
> > > On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
> > > > > On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> > > > > 
> > > > > Hi!
> > > > > 
> > > > > I recently updated my test systems from EL7 to Fedora 32, and
> > > > > NFSv4.0 with Kerberos has stopped working.
> > > > > 
> > > > > I mount with "klimt.ib" as before. The client workload stops
> > > > > dead when the server tries to perform its first CB_RECALL.
> > > > > 
> > > > > I added some client instrumentation:
> > > > > 
> > > > >  kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
> > > > >  kernel: NFS: NFSv4 callback contains invalid cred
> > > > > 
> > > > > I boosted gssd verbosity, and it says:
> > > > > 
> > > > >  rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
> > > > > 
> > > > > But it knows the full hostname for the server:
> > > > > 
> > > > >  rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
> > > > > 
> > > > > 
> > > > > The acceptor appears to come from the Kerberos library. Shouldn't
> > > > > it be canonicalized? If so, should the Kerberos library do it, or
> > > > > should gssd? Since this behavior appeared after an upgrade, I
> > > > > suspect a Kerberos library regression. But it could be config-
> > > > > related, since both systems were re-imaged from the ground up.
> > > > > 
> > > > > Also noticing some other problems on the server (missing hostname
> > > > > strings in debug messages, sssd_kcm infinite loops, and gssd
> > > > > sending garbage to the client after the NULL request that
> > > > > establishes the callback context).
> > > > > 
> > > > > But let's look at the client acceptor problem first.
> > > > 
> > > > I believe I found the problem.
> > > > 
> > > > 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
> > > > options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
> > > > dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
> > > > with Kerberos works again.
> > > > 
> > > > Is there a reason the default setting is 1?
> > > > 
> > > 
> > > Now that you mention DNS, this may be an interaction between a new
> > > default in Fedora 32 and how your environment is setup re DNS.
> > > 
> > > In F32 we changed the option dns_canonicalize_hostname from 'true' to
> > > 'fallback'.
> > > This is a transitional state to eventually move it to 'false' at some
> > > point in the future.
> > > 
> > > What it changes in practice is that it will first try the name passed
> > > in *as is* and only as a fallback try a CNAME if the name passed is not
> > > resolved as an A name. If you have principals in the KDC for both
> > > names, but you do not have keys in the keytab for both, you can have
> > > transitional issues.
> > > 
> > > Additionally we discovered a bug that causes non qualified names to
> > > fail resolution with the 'fallback' option.
> > > If your name in the principal is really not qualified it will try to
> > > qualify it anyway, so if your principal is literally nfs/foo@FOO
> > > libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
> > > what is defined in resolv.conf search path.
> > > 
> > > We are trying to address this regression.
> > > 
> > > So try to set dns_canonicalize_hostname to true to see if that may
> > > influence your issue. If so, please let me know, as we still need to
> > > address this where possible.
> > 
> > Also, please try setting `qualify_shortname = ""`.  (I did update the
> > config file we ship with Fedora, but upstream's default turns that on.
> > This is a temporary workaround while we merge something better
> > upstream.)
> 
> For completeness, I tried:
> 
> avoid-dns = 1
> dns_canonicalize_hostname = fallback
> qualify_shortname = ""
> 
> which is the default configuration out of the shrink wrap.
> 
> The workload hangs as before, and the acceptor is unqualified:
> 
> rpc.gssd[985]: doing downcall: lifetime_rec=84046 acceptor=nfs@klimt.ib
> 
> 
> The test is:
> 
> Configured domain name is "1015granger.net"
> 
> Fully-qualified client hostname is "manet.ib.granger.net"
> 
> Fully-qualified server hostname is "klimt.ib.granger.net"
> 
> mount command is "mount -o vers=4.0,sec=sys klimt.ib:/export /mnt"
> 
> In this case, both systems have keytabs and service principals, so
> the client automatically attempts to establish a GSS context for
> lease management and callback operations. The failure occurs because
> the server's principal is nfs@klimt.ib.1015granger.net but the
> acceptor now matches the server hostname from the mount command line,
> which is not always fully qualified.

Ok, TBH I personally consider the syntax you  are currently using as
working by accident and that you should really sue the FQDN on the
command line (I assume it works that way, right?), however I understand
this is also technically a regression, that said I do not think we can
really fix this case because your "shortname" is not short (it has a
dot in it) so the heuristicts won't trigger to qualify it even when you
set qualify_shortname="".

I have the feeling we'll break this case, and our answer will have to
be "use the fqdn on the command line".

Simo.
 
-- 
Simo Sorce
RHEL Crypto Team
Red Hat, Inc





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 19:10         ` Simo Sorce
@ 2020-07-30 19:39           ` Chuck Lever
  2020-08-10 15:28             ` Chuck Lever
  0 siblings, 1 reply; 15+ messages in thread
From: Chuck Lever @ 2020-07-30 19:39 UTC (permalink / raw)
  To: Simo Sorce
  Cc: Robbie Harwood, Jeff Layton, Bruce Fields, Linux NFS Mailing List



> On Jul 30, 2020, at 3:10 PM, Simo Sorce <simo@redhat.com> wrote:
> 
> On Thu, 2020-07-30 at 13:59 -0400, Chuck Lever wrote:
>>> On Jul 30, 2020, at 1:08 PM, Robbie Harwood <rharwood@redhat.com> wrote:
>>> 
>>> Simo Sorce <simo@redhat.com> writes:
>>> 
>>>> On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
>>>>>> On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>> 
>>>>>> Hi!
>>>>>> 
>>>>>> I recently updated my test systems from EL7 to Fedora 32, and
>>>>>> NFSv4.0 with Kerberos has stopped working.
>>>>>> 
>>>>>> I mount with "klimt.ib" as before. The client workload stops
>>>>>> dead when the server tries to perform its first CB_RECALL.
>>>>>> 
>>>>>> I added some client instrumentation:
>>>>>> 
>>>>>> kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
>>>>>> kernel: NFS: NFSv4 callback contains invalid cred
>>>>>> 
>>>>>> I boosted gssd verbosity, and it says:
>>>>>> 
>>>>>> rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
>>>>>> 
>>>>>> But it knows the full hostname for the server:
>>>>>> 
>>>>>> rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
>>>>>> 
>>>>>> 
>>>>>> The acceptor appears to come from the Kerberos library. Shouldn't
>>>>>> it be canonicalized? If so, should the Kerberos library do it, or
>>>>>> should gssd? Since this behavior appeared after an upgrade, I
>>>>>> suspect a Kerberos library regression. But it could be config-
>>>>>> related, since both systems were re-imaged from the ground up.
>>>>>> 
>>>>>> Also noticing some other problems on the server (missing hostname
>>>>>> strings in debug messages, sssd_kcm infinite loops, and gssd
>>>>>> sending garbage to the client after the NULL request that
>>>>>> establishes the callback context).
>>>>>> 
>>>>>> But let's look at the client acceptor problem first.
>>>>> 
>>>>> I believe I found the problem.
>>>>> 
>>>>> 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
>>>>> options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
>>>>> dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
>>>>> with Kerberos works again.
>>>>> 
>>>>> Is there a reason the default setting is 1?
>>>>> 
>>>> 
>>>> Now that you mention DNS, this may be an interaction between a new
>>>> default in Fedora 32 and how your environment is setup re DNS.
>>>> 
>>>> In F32 we changed the option dns_canonicalize_hostname from 'true' to
>>>> 'fallback'.
>>>> This is a transitional state to eventually move it to 'false' at some
>>>> point in the future.
>>>> 
>>>> What it changes in practice is that it will first try the name passed
>>>> in *as is* and only as a fallback try a CNAME if the name passed is not
>>>> resolved as an A name. If you have principals in the KDC for both
>>>> names, but you do not have keys in the keytab for both, you can have
>>>> transitional issues.
>>>> 
>>>> Additionally we discovered a bug that causes non qualified names to
>>>> fail resolution with the 'fallback' option.
>>>> If your name in the principal is really not qualified it will try to
>>>> qualify it anyway, so if your principal is literally nfs/foo@FOO
>>>> libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
>>>> what is defined in resolv.conf search path.
>>>> 
>>>> We are trying to address this regression.
>>>> 
>>>> So try to set dns_canonicalize_hostname to true to see if that may
>>>> influence your issue. If so, please let me know, as we still need to
>>>> address this where possible.
>>> 
>>> Also, please try setting `qualify_shortname = ""`.  (I did update the
>>> config file we ship with Fedora, but upstream's default turns that on.
>>> This is a temporary workaround while we merge something better
>>> upstream.)
>> 
>> For completeness, I tried:
>> 
>> avoid-dns = 1
>> dns_canonicalize_hostname = fallback
>> qualify_shortname = ""
>> 
>> which is the default configuration out of the shrink wrap.
>> 
>> The workload hangs as before, and the acceptor is unqualified:
>> 
>> rpc.gssd[985]: doing downcall: lifetime_rec=84046 acceptor=nfs@klimt.ib
>> 
>> 
>> The test is:
>> 
>> Configured domain name is "1015granger.net"
>> 
>> Fully-qualified client hostname is "manet.ib.granger.net"
>> 
>> Fully-qualified server hostname is "klimt.ib.granger.net"
>> 
>> mount command is "mount -o vers=4.0,sec=sys klimt.ib:/export /mnt"
>> 
>> In this case, both systems have keytabs and service principals, so
>> the client automatically attempts to establish a GSS context for
>> lease management and callback operations. The failure occurs because
>> the server's principal is nfs@klimt.ib.1015granger.net but the
>> acceptor now matches the server hostname from the mount command line,
>> which is not always fully qualified.
> 
> Ok, TBH I personally consider the syntax you  are currently using as
> working by accident and that you should really sue the FQDN on the
> command line (I assume it works that way, right?), however I understand
> this is also technically a regression, that said I do not think we can
> really fix this case because your "shortname" is not short (it has a
> dot in it) so the heuristicts won't trigger to qualify it even when you
> set qualify_shortname="".
> 
> I have the feeling we'll break this case, and our answer will have to
> be "use the fqdn on the command line".

See previous e-mail. Using the shrink wrap default settings, which
includes qualify_shortname="", results in a hang on callback, as
originally observed.

Users will notice this and complain: klimt.ib works for the NFSv3
case and the NFSv4.1 case, and for NFSv4.0 when there is no keytab,
but NFSv4.0,sec=* with a keytab eventually hangs.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Fedora 32 rpc.gssd misbehavior
  2020-07-30 19:39           ` Chuck Lever
@ 2020-08-10 15:28             ` Chuck Lever
  0 siblings, 0 replies; 15+ messages in thread
From: Chuck Lever @ 2020-08-10 15:28 UTC (permalink / raw)
  To: Simo Sorce, Steve Dickson
  Cc: Robbie Harwood, Jeff Layton, Bruce Fields, Linux NFS Mailing List



> On Jul 30, 2020, at 3:39 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On Jul 30, 2020, at 3:10 PM, Simo Sorce <simo@redhat.com> wrote:
>> 
>> On Thu, 2020-07-30 at 13:59 -0400, Chuck Lever wrote:
>>>> On Jul 30, 2020, at 1:08 PM, Robbie Harwood <rharwood@redhat.com> wrote:
>>>> 
>>>> Simo Sorce <simo@redhat.com> writes:
>>>> 
>>>>> On Wed, 2020-07-29 at 14:27 -0400, Chuck Lever wrote:
>>>>>>> On Jul 29, 2020, at 1:19 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
>>>>>>> 
>>>>>>> Hi!
>>>>>>> 
>>>>>>> I recently updated my test systems from EL7 to Fedora 32, and
>>>>>>> NFSv4.0 with Kerberos has stopped working.
>>>>>>> 
>>>>>>> I mount with "klimt.ib" as before. The client workload stops
>>>>>>> dead when the server tries to perform its first CB_RECALL.
>>>>>>> 
>>>>>>> I added some client instrumentation:
>>>>>>> 
>>>>>>> kernel: NFSv4: Callback principal (nfs@klimt.ib.1015granger.net) does not match acceptor (nfs@klimt.ib).
>>>>>>> kernel: NFS: NFSv4 callback contains invalid cred
>>>>>>> 
>>>>>>> I boosted gssd verbosity, and it says:
>>>>>>> 
>>>>>>> rpc.gssd[986]: doing downcall: lifetime_rec=72226 acceptor=nfs@klimt.ib
>>>>>>> 
>>>>>>> But it knows the full hostname for the server:
>>>>>>> 
>>>>>>> rpc.gssd[986]: Full hostname for 'klimt.ib' is 'klimt.ib.1015granger.net'
>>>>>>> 
>>>>>>> 
>>>>>>> The acceptor appears to come from the Kerberos library. Shouldn't
>>>>>>> it be canonicalized? If so, should the Kerberos library do it, or
>>>>>>> should gssd? Since this behavior appeared after an upgrade, I
>>>>>>> suspect a Kerberos library regression. But it could be config-
>>>>>>> related, since both systems were re-imaged from the ground up.
>>>>>>> 
>>>>>>> Also noticing some other problems on the server (missing hostname
>>>>>>> strings in debug messages, sssd_kcm infinite loops, and gssd
>>>>>>> sending garbage to the client after the NULL request that
>>>>>>> establishes the callback context).
>>>>>>> 
>>>>>>> But let's look at the client acceptor problem first.
>>>>>> 
>>>>>> I believe I found the problem.
>>>>>> 
>>>>>> 8bffe8c5ec1a ("gssd: add /etc/nfs.conf support") added a number of gssd config
>>>>>> options to /etc/nfs.conf, including "avoid-dns". The default setting of avoid-
>>>>>> dns is 1. When I set this option on my client system explicitly to 0, NFSv4.0
>>>>>> with Kerberos works again.
>>>>>> 
>>>>>> Is there a reason the default setting is 1?
>>>>>> 
>>>>> 
>>>>> Now that you mention DNS, this may be an interaction between a new
>>>>> default in Fedora 32 and how your environment is setup re DNS.
>>>>> 
>>>>> In F32 we changed the option dns_canonicalize_hostname from 'true' to
>>>>> 'fallback'.
>>>>> This is a transitional state to eventually move it to 'false' at some
>>>>> point in the future.
>>>>> 
>>>>> What it changes in practice is that it will first try the name passed
>>>>> in *as is* and only as a fallback try a CNAME if the name passed is not
>>>>> resolved as an A name. If you have principals in the KDC for both
>>>>> names, but you do not have keys in the keytab for both, you can have
>>>>> transitional issues.
>>>>> 
>>>>> Additionally we discovered a bug that causes non qualified names to
>>>>> fail resolution with the 'fallback' option.
>>>>> If your name in the principal is really not qualified it will try to
>>>>> qualify it anyway, so if your principal is literally nfs/foo@FOO
>>>>> libgssapi may try to use nfs/foo.my.domdain@FOO, where "my.domain" is
>>>>> what is defined in resolv.conf search path.
>>>>> 
>>>>> We are trying to address this regression.
>>>>> 
>>>>> So try to set dns_canonicalize_hostname to true to see if that may
>>>>> influence your issue. If so, please let me know, as we still need to
>>>>> address this where possible.
>>>> 
>>>> Also, please try setting `qualify_shortname = ""`.  (I did update the
>>>> config file we ship with Fedora, but upstream's default turns that on.
>>>> This is a temporary workaround while we merge something better
>>>> upstream.)
>>> 
>>> For completeness, I tried:
>>> 
>>> avoid-dns = 1
>>> dns_canonicalize_hostname = fallback
>>> qualify_shortname = ""
>>> 
>>> which is the default configuration out of the shrink wrap.
>>> 
>>> The workload hangs as before, and the acceptor is unqualified:
>>> 
>>> rpc.gssd[985]: doing downcall: lifetime_rec=84046 acceptor=nfs@klimt.ib
>>> 
>>> 
>>> The test is:
>>> 
>>> Configured domain name is "1015granger.net"
>>> 
>>> Fully-qualified client hostname is "manet.ib.granger.net"
>>> 
>>> Fully-qualified server hostname is "klimt.ib.granger.net"
>>> 
>>> mount command is "mount -o vers=4.0,sec=sys klimt.ib:/export /mnt"
>>> 
>>> In this case, both systems have keytabs and service principals, so
>>> the client automatically attempts to establish a GSS context for
>>> lease management and callback operations. The failure occurs because
>>> the server's principal is nfs@klimt.ib.1015granger.net but the
>>> acceptor now matches the server hostname from the mount command line,
>>> which is not always fully qualified.
>> 
>> Ok, TBH I personally consider the syntax you  are currently using as
>> working by accident and that you should really sue the FQDN on the
>> command line (I assume it works that way, right?), however I understand
>> this is also technically a regression, that said I do not think we can
>> really fix this case because your "shortname" is not short (it has a
>> dot in it) so the heuristicts won't trigger to qualify it even when you
>> set qualify_shortname="".
>> 
>> I have the feeling we'll break this case, and our answer will have to
>> be "use the fqdn on the command line".
> 
> See previous e-mail. Using the shrink wrap default settings, which
> includes qualify_shortname="", results in a hang on callback, as
> originally observed.
> 
> Users will notice this and complain: klimt.ib works for the NFSv3
> case and the NFSv4.1 case, and for NFSv4.0 when there is no keytab,
> but NFSv4.0,sec=* with a keytab eventually hangs.

I filed

https://bugzilla.redhat.com/show_bug.cgi?id=1867719

to document the behavior regression and some possible workarounds.
Further discussion about addressing the issue (possibly in nfs-utils)
can happen there.

Thanks!


--
Chuck Lever




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-08-10 15:28 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-29 17:19 Fedora 32 rpc.gssd misbehavior Chuck Lever
2020-07-29 18:27 ` Chuck Lever
2020-07-30 14:43   ` Steve Dickson
2020-07-30 16:14   ` Simo Sorce
2020-07-30 17:08     ` Robbie Harwood
2020-07-30 17:59       ` Chuck Lever
2020-07-30 19:10         ` Simo Sorce
2020-07-30 19:39           ` Chuck Lever
2020-08-10 15:28             ` Chuck Lever
2020-07-30 17:09     ` Chuck Lever
2020-07-30 17:57       ` Simo Sorce
2020-07-30 18:07         ` Chuck Lever
2020-07-30 18:20           ` Simo Sorce
2020-07-30 18:29             ` Chuck Lever
2020-07-30 18:55               ` Simo Sorce

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).