All of lore.kernel.org
 help / color / mirror / Atom feed
* nfsd issue with a kerberized callback
@ 2018-04-16 19:48 Olga Kornievskaia
  2018-04-16 21:29 ` J. Bruce Fields
  0 siblings, 1 reply; 6+ messages in thread
From: Olga Kornievskaia @ 2018-04-16 19:48 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs, ng-linux-team

[-- Attachment #1: Type: text/plain, Size: 1919 bytes --]

Hi Bruce,

I have a failure that I’m investigating from the Bakeathon (this was going against redhat-75 server. Not sure who was running that server. But I believe that was RHEL7.5 server). I have a network trace and I was wondering if you could help with what the server is doing.

I’m attaching a network trace. The parts I’m interested in explaining have to do with the kerberized backchannel for NFS4.0.

A setup is client doing v3 and v4 mount and opening file with one version and appending to it with a different version. Its opened with 4.0 and got a delegation and it’s trying to write with v3 and server is recalling a delegation

Server is issuing CB_NULL gss_init trying to establish a gss context. But it’s doing it twice in frame 259 and frame 261. It’s weird that it’s doing it twice. But Ok.

Now in frame, 283 it sends CB_COMPOUND CB_RECALL 
And in frame 285 it sends CB_NULL with gss_data with the CB_NULL as the payload. I think this is to establish the callback.

In frame 287, client responds with RPC accept state of 6000 (which I believe is "drop reply").

I believe what’s happening is that because the client hasn’t received CB_NULL that establishes a callback channel but got a CB_RECALL it’s just ignoring it.

What happens later is that server re-transmits the CB_COMPOUND but client replies out of the cache. What’s interesting is that by this time since CB_NULL that came after the CB_COMPOUND should have established the callback and if the re-trasmission was instead a new CB_RECALL, then it would have succeeded I would think. Server tries twice and then finally, the sets the CB_PATH_DOWN on the RENEW that client sends.

Questions:
1. Do you see how CB_RECALL can travel before the callback is established?
2. Should the server do something else beside re-transmitting the CB_RECALL because it got this “drop reply” error code back? 


[-- Attachment #2: nfstest_interop_20180329095958_2.cap --]
[-- Type: application/octet-stream, Size: 69418 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nfsd issue with a kerberized callback
  2018-04-16 19:48 nfsd issue with a kerberized callback Olga Kornievskaia
@ 2018-04-16 21:29 ` J. Bruce Fields
  2018-04-17 13:08   ` Olga Kornievskaia
       [not found]   ` <EC52DFFD-CB6F-45F5-BAB9-C684C205DE26@netapp.com>
  0 siblings, 2 replies; 6+ messages in thread
From: J. Bruce Fields @ 2018-04-16 21:29 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: linux-nfs, ng-linux-team

On Mon, Apr 16, 2018 at 03:48:49PM -0400, Olga Kornievskaia wrote:
> I have a failure that I’m investigating from the Bakeathon (this was
> going against redhat-75 server. Not sure who was running that server.
> But I believe that was RHEL7.5 server). I have a network trace and I
> was wondering if you could help with what the server is doing.
>
> I’m attaching a network trace. The parts I’m interested in explaining
> have to do with the kerberized backchannel for NFS4.0.
> 
> A setup is client doing v3 and v4 mount and opening file with one
> version and appending to it with a different version. Its opened with
> 4.0 and got a delegation and it’s trying to write with v3 and server
> is recalling a delegation
> 
> Server is issuing CB_NULL gss_init trying to establish a gss context.
> But it’s doing it twice in frame 259 and frame 261. It’s weird that
> it’s doing it twice. But Ok.

I also wonder why the client sent two sets of
SETCLIENTID/SETCLIENTID_CONFIRM calls.  The second gets back the same
clientid as the first, so I think the only thing the server might do is
update the callback information--but the callback information is the
same in both cases.  Maybe some server bug is causing it not to handle
that update correctly?

I also expect the server to start a CB_NULL as soon as it gets the
setclientid_confirm, so I would have expected to see that sooner.

> Now in frame, 283 it sends CB_COMPOUND CB_RECALL And in frame 285 it
> sends CB_NULL with gss_data with the CB_NULL as the payload. I think
> this is to establish the callback.
> 
> In frame 287, client responds with RPC accept state of 6000 (which I
> believe is "drop reply").

That value shouldn't ever appear on the wire.

Looks like RHEL7 may need 0533b13072f4 "svc: Avoid garbage replies when
pc_func() returns rpc_drop_reply".

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nfsd issue with a kerberized callback
  2018-04-16 21:29 ` J. Bruce Fields
@ 2018-04-17 13:08   ` Olga Kornievskaia
       [not found]   ` <EC52DFFD-CB6F-45F5-BAB9-C684C205DE26@netapp.com>
  1 sibling, 0 replies; 6+ messages in thread
From: Olga Kornievskaia @ 2018-04-17 13:08 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Olga Kornievskaia, linux-nfs, ng-linux-team

On Mon, Apr 16, 2018 at 5:29 PM, J. Bruce Fields <bfields@redhat.com> wrote=
:
> On Mon, Apr 16, 2018 at 03:48:49PM -0400, Olga Kornievskaia wrote:
>> I have a failure that I=E2=80=99m investigating from the Bakeathon (this=
 was
>> going against redhat-75 server. Not sure who was running that server.
>> But I believe that was RHEL7.5 server). I have a network trace and I
>> was wondering if you could help with what the server is doing.
>>
>> I=E2=80=99m attaching a network trace. The parts I=E2=80=99m interested =
in explaining
>> have to do with the kerberized backchannel for NFS4.0.
>>
>> A setup is client doing v3 and v4 mount and opening file with one
>> version and appending to it with a different version. Its opened with
>> 4.0 and got a delegation and it=E2=80=99s trying to write with v3 and se=
rver
>> is recalling a delegation
>>
>> Server is issuing CB_NULL gss_init trying to establish a gss context.
>> But it=E2=80=99s doing it twice in frame 259 and frame 261. It=E2=80=99s=
 weird that
>> it=E2=80=99s doing it twice. But Ok.
>
> I also wonder why the client sent two sets of
> SETCLIENTID/SETCLIENTID_CONFIRM calls.

The test does two different mounts and there is an unmount between
them. There is 4.0 mount, unmount. Then v3 mount, then v4 mount again.
I think that=E2=80=99s why there are 2 SETCLIENTIDs.

> The second gets back the same
> clientid as the first, so I think the only thing the server might do is
> update the callback information--but the callback information is the
> same in both cases.  Maybe some server bug is causing it not to handle
> that update correctly?
>
> I also expect the server to start a CB_NULL as soon as it gets the
> setclientid_confirm, so I would have expected to see that sooner.

That=E2=80=99s true but I have seen when CB_NULL doesn=E2=80=99t happen rig=
ht away. So
this happens once in a while, don=E2=80=99t know why...

>> Now in frame, 283 it sends CB_COMPOUND CB_RECALL And in frame 285 it
>> sends CB_NULL with gss_data with the CB_NULL as the payload. I think
>> this is to establish the callback.
>>
>> In frame 287, client responds with RPC accept state of 6000 (which I
>> believe is "drop reply").
>
> That value shouldn't ever appear on the wire.
>
> Looks like RHEL7 may need 0533b13072f4 "svc: Avoid garbage replies when
> pc_func() returns rpc_drop_reply".
>
> From nfs4_callback_compound, there are a couple reasons it might return
> rpc_drop_reply--might be a failure to recognize the callback principal?

So you think the callback server doesn=E2=80=99t need an initial NULL call
before able to handle CB_RECALL?

>
>> I believe what=E2=80=99s happening is that because the client hasn=E2=80=
=99t received
>> CB_NULL that establishes a callback channel but got a CB_RECALL it=E2=80=
=99s
>> just ignoring it.
>
> I see two succesful CB_NULL calls and replies, so I think the context
> establishment worked.  I don't know why there's a third CB_NULL in frame
> 285.

The two CB_NULL calls are both gss_init calls. They are not RPC NULL
call. Server for some reason establishes 2 different gss context
(again not necessarily a problem). The 3rd CB_NULL is gss_data and it
sends an actual NULL call. I believe that=E2=80=99s the one that establishe=
s a
callback channel.

>
>> What happens later is that server re-transmits the CB_COMPOUND but
>> client replies out of the cache. What=E2=80=99s interesting is that by t=
his
>> time since CB_NULL that came after the CB_COMPOUND should have
>> established the callback and if the re-trasmission was instead a new
>> CB_RECALL, then it would have succeeded I would think. Server tries
>> twice and then finally, the sets the CB_PATH_DOWN on the RENEW that
>> client sends.
>>
>> Questions: 1. Do you see how CB_RECALL can travel before the callback
>> is established?
>
> I assume it's using the context established by one of the previous
> CB_NULLs.  I guess you could check the context handle to confirm this.

Both the CB_RECALL and CB_NULL carrying the NULL call that establishes
context use the gss_context established by the 2nd of the CB_NULL
calls.

I still believe that CB_RECALL was sent incorrectly as a callback
channel has not been established.

>> 2. Should the server do something else beside
>> re-transmitting the CB_RECALL because it got this =E2=80=9Cdrop reply=E2=
=80=9D error
>> code back?
>
> Since the reply isn't really valid, I think retrying is the server's
> only option.
>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nfsd issue with a kerberized callback
       [not found]   ` <EC52DFFD-CB6F-45F5-BAB9-C684C205DE26@netapp.com>
@ 2018-04-17 13:21     ` J. Bruce Fields
  2018-04-19 15:10       ` Olga Kornievskaia
  0 siblings, 1 reply; 6+ messages in thread
From: J. Bruce Fields @ 2018-04-17 13:21 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: linux-nfs, ng-linux-team

On Mon, Apr 16, 2018 at 08:05:44PM -0400, Olga Kornievskaia wrote:
> > On Apr 16, 2018, at 5:29 PM, J. Bruce Fields <bfields@redhat.com> wrote:
> >> I believe what’s happening is that because the client hasn’t received
> >> CB_NULL that establishes a callback channel but got a CB_RECALL it’s
> >> just ignoring it.
> > 
> > I see two succesful CB_NULL calls and replies, so I think the context
> > establishment worked.  I don't know why there's a third CB_NULL in frame
> > 285.
> 
> The two CB_NULL calls are both gss_init calls. They are not RPC NULL call. Server for some reason establishes 2 different gss context (again not necessarily a problem). The 3rd CB_NULL is gss_data and it sends an actual NULL call. I believe that’s the one that establishes a callback channel.

Oh, I see.  No, that NULL call isn't necessary.  It's just the server
probing to see whether the callback channel works.  That NULL call isn't
required by the spec and everything should still work if the CB_RECALL
is sent out before it completes.

--b.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nfsd issue with a kerberized callback
  2018-04-17 13:21     ` J. Bruce Fields
@ 2018-04-19 15:10       ` Olga Kornievskaia
  2018-04-19 15:52         ` J. Bruce Fields
  0 siblings, 1 reply; 6+ messages in thread
From: Olga Kornievskaia @ 2018-04-19 15:10 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Olga Kornievskaia, linux-nfs, ng-linux-team

On Tue, Apr 17, 2018 at 9:21 AM, J. Bruce Fields <bfields@redhat.com> wrote=
:
> On Mon, Apr 16, 2018 at 08:05:44PM -0400, Olga Kornievskaia wrote:
>> > On Apr 16, 2018, at 5:29 PM, J. Bruce Fields <bfields@redhat.com> wrot=
e:
>> >> I believe what=E2=80=99s happening is that because the client hasn=E2=
=80=99t received
>> >> CB_NULL that establishes a callback channel but got a CB_RECALL it=E2=
=80=99s
>> >> just ignoring it.
>> >
>> > I see two succesful CB_NULL calls and replies, so I think the context
>> > establishment worked.  I don't know why there's a third CB_NULL in fra=
me
>> > 285.
>>
>> The two CB_NULL calls are both gss_init calls. They are not RPC NULL cal=
l. Server for some reason establishes 2 different gss context (again not ne=
cessarily a problem). The 3rd CB_NULL is gss_data and it sends an actual NU=
LL call. I believe that=E2=80=99s the one that establishes a callback chann=
el.
>
> Oh, I see.  No, that NULL call isn't necessary.  It's just the server
> probing to see whether the callback channel works.  That NULL call isn't
> required by the spec and everything should still work if the CB_RECALL
> is sent out before it completes.

Ok thank you. Another question: can the nfsd decide to give out
delegation regardless of the existence of the callback path? Looking
at the nfsd open code I see that it checks if cl_cb_state is
NFSD4_CB_UP. But I guess somehow (according to this trace), it finds
it up? Any ideas, looks like a bug...

On this particular network trace, in frame 240, the server gives out
the read delegation (which is later on it's trying to recall). However
the setup of the callback path(s) doesn't happen until frames 259/261
(tcp connections are established starting from frames 253/256).

Goal is to try and reproduce this problem where the client is failing
the CB_RECALL... but so far I can't trigger this.

Also I figured out why there are 2 SETCLIENTIDs and it's not because
of the 2 mounts (with unmount in the middle). First SETCLIENTID is
trunking discovery. Second SETCLIENTID is for when client is trying to
do an operation requiring a clientid. Therefore this leads to
existence of 2 callback paths (ie 2 CB_NULL gss context
establishment). I'm not sure if this is somehow incorrect that client
provides 2 callback path and server establishes context for both.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: nfsd issue with a kerberized callback
  2018-04-19 15:10       ` Olga Kornievskaia
@ 2018-04-19 15:52         ` J. Bruce Fields
  0 siblings, 0 replies; 6+ messages in thread
From: J. Bruce Fields @ 2018-04-19 15:52 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Olga Kornievskaia, linux-nfs, ng-linux-team

On Thu, Apr 19, 2018 at 11:10:06AM -0400, Olga Kornievskaia wrote:
> On Tue, Apr 17, 2018 at 9:21 AM, J. Bruce Fields <bfields@redhat.com> wrote:
> > On Mon, Apr 16, 2018 at 08:05:44PM -0400, Olga Kornievskaia wrote:
> >> > On Apr 16, 2018, at 5:29 PM, J. Bruce Fields <bfields@redhat.com> wrote:
> >> >> I believe what’s happening is that because the client hasn’t received
> >> >> CB_NULL that establishes a callback channel but got a CB_RECALL it’s
> >> >> just ignoring it.
> >> >
> >> > I see two succesful CB_NULL calls and replies, so I think the context
> >> > establishment worked.  I don't know why there's a third CB_NULL in frame
> >> > 285.
> >>
> >> The two CB_NULL calls are both gss_init calls. They are not RPC NULL call. Server for some reason establishes 2 different gss context (again not necessarily a problem). The 3rd CB_NULL is gss_data and it sends an actual NULL call. I believe that’s the one that establishes a callback channel.
> >
> > Oh, I see.  No, that NULL call isn't necessary.  It's just the server
> > probing to see whether the callback channel works.  That NULL call isn't
> > required by the spec and everything should still work if the CB_RECALL
> > is sent out before it completes.
> 
> Ok thank you. Another question: can the nfsd decide to give out
> delegation regardless of the existence of the callback path? Looking
> at the nfsd open code I see that it checks if cl_cb_state is
> NFSD4_CB_UP. But I guess somehow (according to this trace), it finds
> it up? Any ideas, looks like a bug...

That CB_NULL isn't required by the spec and the lack of it shouldn't
bother the client.  But, yes, you're right, I think it's still a bug if
the server's giving out a delegation before it returns, in the NFSv4.0
case.  There are so many ways the NFSv4.0 callback connection can fail
(firewalls, NAT, gss configuration...) that I'd rather not assume it's
up.

So, I agree, it's a bug, it just doesn't necessarily explain what's
going wrong here.

> On this particular network trace, in frame 240, the server gives out
> the read delegation (which is later on it's trying to recall). However
> the setup of the callback path(s) doesn't happen until frames 259/261
> (tcp connections are established starting from frames 253/256).
> 
> Goal is to try and reproduce this problem where the client is failing
> the CB_RECALL... but so far I can't trigger this.
> 
> Also I figured out why there are 2 SETCLIENTIDs and it's not because
> of the 2 mounts (with unmount in the middle). First SETCLIENTID is
> trunking discovery. Second SETCLIENTID is for when client is trying to
> do an operation requiring a clientid. Therefore this leads to
> existence of 2 callback paths (ie 2 CB_NULL gss context
> establishment). I'm not sure if this is somehow incorrect that client
> provides 2 callback path and server establishes context for both.

When I looked I think I saw that the second SETCLIENTID/CONFIRM uses the
same callback info as the first.  So the server shouldn't need to do
anything.  But it may actually close the connection and open a new one,
which is unnecessary but not necessarily wrong.  But if it's keeping two
connections around and trying to use both then something's definitely
misbehaving.

--b.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-04-19 15:52 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-16 19:48 nfsd issue with a kerberized callback Olga Kornievskaia
2018-04-16 21:29 ` J. Bruce Fields
2018-04-17 13:08   ` Olga Kornievskaia
     [not found]   ` <EC52DFFD-CB6F-45F5-BAB9-C684C205DE26@netapp.com>
2018-04-17 13:21     ` J. Bruce Fields
2018-04-19 15:10       ` Olga Kornievskaia
2018-04-19 15:52         ` J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.