Linux-NFS Archive on lore.kernel.org
 help / color / Atom feed
* RPC Pipefs: Frequent parsing errors in client database
@ 2020-06-19 21:24 Kraus, Sebastian
  2020-06-19 22:04 ` J. Bruce Fields
  0 siblings, 1 reply; 21+ messages in thread
From: Kraus, Sebastian @ 2020-06-19 21:24 UTC (permalink / raw)
  To: linux-nfs

Hi all,
since several weeks, I am seeing, on a regular basis, errors like the following in the system log of one of my NFSv4 file servers:

Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt3bb/info: No such file or directory
Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info

Looks like premature closing of client connections.
The security flavor of the NFS export is set to krb5p (integrity+privacy).

Anyone a hint how to efficiently track down the problem?


Best and thanks
Sebastian


Sebastian Kraus
Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RPC Pipefs: Frequent parsing errors in client database
  2020-06-19 21:24 RPC Pipefs: Frequent parsing errors in client database Kraus, Sebastian
@ 2020-06-19 22:04 ` J. Bruce Fields
  2020-06-20 11:35   ` Kraus, Sebastian
  0 siblings, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2020-06-19 22:04 UTC (permalink / raw)
  To: Kraus, Sebastian; +Cc: linux-nfs

On Fri, Jun 19, 2020 at 09:24:27PM +0000, Kraus, Sebastian wrote:
> Hi all,
> since several weeks, I am seeing, on a regular basis, errors like the following in the system log of one of my NFSv4 file servers:
> 
> Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt3bb/info: No such file or directory
> Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info

I'm not sure what exactly is happening.

Are the log messages the only problem you're seeing, or is there some
other problem?

--b.

> 
> Looks like premature closing of client connections.
> The security flavor of the NFS export is set to krb5p (integrity+privacy).
> 
> Anyone a hint how to efficiently track down the problem?
> 
> 
> Best and thanks
> Sebastian
> 
> 
> Sebastian Kraus
> Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
> 
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RPC Pipefs: Frequent parsing errors in client database
  2020-06-19 22:04 ` J. Bruce Fields
@ 2020-06-20 11:35   ` Kraus, Sebastian
  2020-06-20 17:03     ` J. Bruce Fields
  0 siblings, 1 reply; 21+ messages in thread
From: Kraus, Sebastian @ 2020-06-20 11:35 UTC (permalink / raw)
  To: linux-nfs; +Cc: J. Bruce Fields

Hi Bruce,
OK, here the somehow longer story 8-O:

I am maintaining virtualized several NFS server instances running on VMware ESXi hypervisor. The operating system is Debian Stretch/Buster.
Most of the time, the NFS servers are nearly idling and there is only moderate CPU load during rush hours. So, servers far from being overloaded.

Anyway, more than a year ago, the rpc.gssd daemon started getting unstable in production use.
The daemon provokes segmentations violations, serveral times a day and on an irregular basis. Unfortunately without any obvious reason. :-(

The observed violations look like this:
Jun 11 21:52:08 all kernel: rpc.gssd[12043]: segfault at 0 ip 000056065d50e38e sp 00007fde27ffe880 error 4 in rpc.gssd[56065d50b000+9000]
or that:
Mar 17 10:32:10 all kernel: rpc.gssd[25793]: segfault at ffffffffffffffc0 ip 00007ffa61f246e4 sp 00007ffa6145f0f8 error 5 in libc-2.24.so[7ffa61ea4000+195000]

In order to manage the problem in a quick and dirty way, I activated automatic restart of the rpc-gssd.service unit for "on-fail" reasons.


Several monthes ago, I decided to investigate the problem further by launching rpc.svcgssd and rpc.gssd daemons with enhanced debug level from their service units.
Sadly, this didn't help me to get any clue of the root cause of these strange segmentations violations.

Some of my colleagues urged me to migrate the server instances from Debian Stretch (current oldstable) to Debian Buster (current stable). 
They argued, rpc.gssd's crashes possibly being rooted in NFS stack instabilities. And about three weeks ago, I upgraded two of my server instances.
Unexpectedly, not only the problem did not disappear, but moreover frequency of the segmentation violations increased slightly.

Debian Stretch ships with nfs-common v1.3.4-2.1 and Buster with nfs-common v1.3.4-2.5 . So, both based the same nfs-common point release.


In consequence, about a week ago, I decided to investigate the problem in a deep manner by stracing the rpc.gssd daemon while running.
Since then, the segementation violations were gone, but now lots of complaints of the following type appear in the system log:

 Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt3bb/info: No such file or directory
 Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info


This behaviour seems somehow strange to me.
But, one possible explanation could be: The execution speed of rpc.gssd slows down while being straced and the "true" reason for the segmentation violations pops up.
I would argue, rpc.gssd trying to parse non-existing files points anyway to an insane and defective behaviour of the RPC GSS user space daemon implementation.


Best and a nice weekend
Sebastian


Sebastian Kraus
Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

________________________________________
From: J. Bruce Fields <bfields@fieldses.org>
Sent: Saturday, June 20, 2020 00:04
To: Kraus, Sebastian
Cc: linux-nfs@vger.kernel.org
Subject: Re: RPC Pipefs: Frequent parsing errors in client database

On Fri, Jun 19, 2020 at 09:24:27PM +0000, Kraus, Sebastian wrote:
> Hi all,
> since several weeks, I am seeing, on a regular basis, errors like the following in the system log of one of my NFSv4 file servers:
>
> Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt3bb/info: No such file or directory
> Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info

I'm not sure what exactly is happening.

Are the log messages the only problem you're seeing, or is there some
other problem?

--b.

>
> Looks like premature closing of client connections.
> The security flavor of the NFS export is set to krb5p (integrity+privacy).
>
> Anyone a hint how to efficiently track down the problem?
>
>
> Best and thanks
> Sebastian
>
>
> Sebastian Kraus
> Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
>
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RPC Pipefs: Frequent parsing errors in client database
  2020-06-20 11:35   ` Kraus, Sebastian
@ 2020-06-20 17:03     ` J. Bruce Fields
  2020-06-20 21:08       ` Kraus, Sebastian
  0 siblings, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2020-06-20 17:03 UTC (permalink / raw)
  To: Kraus, Sebastian; +Cc: linux-nfs

On Sat, Jun 20, 2020 at 11:35:56AM +0000, Kraus, Sebastian wrote:
> In consequence, about a week ago, I decided to investigate the problem
> in a deep manner by stracing the rpc.gssd daemon while running.  Since
> then, the segementation violations were gone, but now lots of
> complaints of the following type appear in the system log:
> 
>  Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open
>  nfsd4_cb/clnt3bb/info: No such file or directory Jun 19 11:14:00 all
>  rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info
> 
> 
> This behaviour seems somehow strange to me.  But, one possible
> explanation could be: The execution speed of rpc.gssd slows down while
> being straced and the "true" reason for the segmentation violations
> pops up.  I would argue, rpc.gssd trying to parse non-existing files
> points anyway to an insane and defective behaviour of the RPC GSS user
> space daemon implementation.

Those files under rpc_pipefs come and go.  rpc.gssd monitors directories
under rpc_pipefs to learn about changes, and then tries to parse the
files under any new directories.

The thing is, if rpc.gssd is a little fast, I think it's possible that
it could get the notification about clnt3bb/ being created, and try to
look up "info", before "info" itself is actually created.

Or alternatively, if clnt3bb/ is short-lived, it might not look up
"info" until the directory's already been deleted again.

Neither problem should be fatal--rpc.gssd will get another update and
adjust to the new situation soon enough.

So it may be that the reall error here is an unconditional log message
in a case that's expected, not actually an error.

Or I could be wrong and maybe something else is happening.

But I think it'd be more useful to stay focused on the segfaults.

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RPC Pipefs: Frequent parsing errors in client database
  2020-06-20 17:03     ` J. Bruce Fields
@ 2020-06-20 21:08       ` Kraus, Sebastian
  2020-06-22 22:36         ` J. Bruce Fields
  0 siblings, 1 reply; 21+ messages in thread
From: Kraus, Sebastian @ 2020-06-20 21:08 UTC (permalink / raw)
  To: linux-nfs; +Cc: J. Bruce Fields

Hi Bruce,

>> But I think it'd be more useful to stay focused on the segfaults.

is it a clever idea to analyze core dumps? Or are there other much better debugging techniques w.r.t. RPC daemons?
I now do more tests while fiddling around with the time-out parameters "-T" and "-t" on the command line of rpc.gssd.

There are several things I do not really understand about the trace shown below:

1) How can it be useful that the rpc.gssd daemon tries to parse the info file although it knows about its absence beforehand?
2) Why are there two identifiers clnt36e and clnt36f being used for the same client?
3) What does the <?> in "inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)" mean?


Jun 18 18:55:07 server  rpcbind[30464]: connect from 130.149.XXX.YYY to null()
Jun 18 18:55:07 server  rpcbind[30465]: connect from 130.149.XXX.YYY to getport/addr(nfs)
Jun 18 18:55:07 server  rpcbind[30466]: connect from 130.149.XXX.YYY to null()
Jun 18 18:55:07 server  rpcbind[30467]: connect from 130.149.XXX.YYY to getport/addr(nfs)
Jun 18 18:55:07 server  rpcbind[30468]: connect from 130.149.XXX.YYY to null()
Jun 18 18:55:07 server  rpcbind[30469]: connect from 130.149.XXX.YYY to getport/addr(nfs)
Jun 18 18:55:07 server  rpcbind[30470]: connect from 130.149.XXX.YYY to null()
Jun 18 18:55:07 server  rpcbind[30471]: connect from 130.149.XXX.YYY to getport/addr(nfs)
Jun 18 18:55:07 server  rpc.svcgssd[20418]: leaving poll
Jun 18 18:55:07 server  rpc.svcgssd[20418]: handling null request
Jun 18 18:55:07 server  rpc.svcgssd[20418]: svcgssd_limit_krb5_enctypes: Calling gss_set_allowable_enctypes with 7 enctypes from the kernel
Jun 18 18:55:07 server  rpc.svcgssd[20418]: sname = host/client.domain.tu-berlin.de@TU-BERLIN.DE
Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: calling nsswitch->princ_to_ids
Jun 18 18:55:07 server  rpc.svcgssd[20418]: nss_getpwnam: name 'host/client.domain.tu-berlin.de@TU-BERLIN.DE' domain '(null)': resulting localname 'host/test0815.chem.tu-berlin.de'
Jun 18 18:55:07 server  rpc.svcgssd[20418]: nss_ldap: reconnecting to LDAP server...
Jun 18 18:55:07 server  rpc.svcgssd[20418]: nss_ldap: reconnected to LDAP server  ldaps://ldap-server.tu-berlin.de after 1 attempt
Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: nsswitch->princ_to_ids returned -2
Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: final return value is -2
Jun 18 18:55:07 server  rpc.svcgssd[20418]: DEBUG: serialize_krb5_ctx: lucid version!
Jun 18 18:55:07 server  rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: protocol 1
Jun 18 18:55:07 server  rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: serializing key with enctype 18 and size 32
Jun 18 18:55:07 server  rpc.svcgssd[20418]: doing downcall
Jun 18 18:55:07 server  rpc.svcgssd[20418]: mech: krb5, hndl len: 4, ctx len 52, timeout: 1592510348 (11041 from now), clnt: host@client.domain.tu-berlin.de, uid: -1, gid: -1, num aux grps: 0:
Jun 18 18:55:07 server  rpc.svcgssd[20418]: sending null reply
Jun 18 18:55:07 server  rpc.svcgssd[20418]: writing message: \x \x608202cf06092a864886f71201020201006e8202be308202baa003020105a10302010ea20703050020000000a3820194618201903082018ca003020105a10e1b0c54552d4245524c494e2e4445a2273025a003020103a11e301c1b036e66731b15616c6c2e6368656d2e74752d6265726c696e2e6465a382014a30820146a003
Jun 18 18:55:07 server  rpc.svcgssd[20418]: finished handling null request
Jun 18 18:55:07 server  rpc.svcgssd[20418]: entering poll

Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt36e) ev->mask (0x40000100)
Jun 18 18:55:07 server  rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt36e/info: No such file or directory
Jun 18 18:55:07 server  rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt36e/info

Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (krb5) ev->mask (0x00000200)
Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (gssd) ev->mask (0x00000200)
Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (info) ev->mask (0x00000200)
Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt36f) ev->mask (0x40000100)
Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)
Jun 18 18:55:07 server  rpc.svcgssd[20418]: leaving poll
Jun 18 18:55:07 server  rpc.svcgssd[20418]: handling null request
Jun 18 18:55:07 server  rpc.gssd[23620]: 
                                     handle_gssd_upcall: 'mech=krb5 uid=0 target=host@client.domain.tu-berlin.de service=nfs enctypes=18,17,16,23,3,1,2 ' (nfsd4_cb/clnt36f)
Jun 18 18:55:07 server  rpc.gssd[23620]: krb5_use_machine_creds: uid 0 tgtname host@client.domain.tu-berlin.de
Jun 18 18:55:07 server  rpc.svcgssd[20418]: svcgssd_limit_krb5_enctypes: Calling gss_set_allowable_enctypes with 7 enctypes from the kernel
Jun 18 18:55:07 server  rpc.gssd[23620]: Full hostname for 'client.domain.tu-berlin.de' is 'client.domain.tu-berlin.de'
Jun 18 18:55:07 server  rpc.gssd[23620]: Full hostname for 'server.domain.tu-berlin.de' is 'server.domain.tu-berlin.de'
Jun 18 18:55:07 server  rpc.gssd[23620]: Success getting keytab entry for 'nfs/server.domain.tu-berlin.de@TU-BERLIN.DE'
Jun 18 18:55:07 server  rpc.gssd[23620]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1592526348
Jun 18 18:55:07 server  rpc.gssd[23620]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1592526348
Jun 18 18:55:07 server  rpc.gssd[23620]: creating tcp client for server  client.domain.tu-berlin.de
Jun 18 18:55:07 server  rpc.gssd[23620]: DEBUG: port already set to 39495
Jun 18 18:55:07 server  rpc.gssd[23620]: creating context with server  host@client.domain.tu-berlin.de
Jun 18 18:55:07 server  rpc.gssd[23620]: in authgss_create_default()
Jun 18 18:55:07 server  rpc.gssd[23620]: in authgss_create()
Jun 18 18:55:07 server  rpc.gssd[23620]: authgss_create: name is 0x7fa5ac011970
Jun 18 18:55:07 server  rpc.gssd[23620]: authgss_create: gd->name is 0x7fa5ac005770
Jun 18 18:55:07 server  rpc.gssd[23620]: in authgss_refresh()
Jun 18 18:55:07 server  rpc.svcgssd[20418]: sname = host/client.domain.tu-berlin.de@TU-BERLIN.DE
Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: calling nsswitch->princ_to_ids
Jun 18 18:55:07 server  rpc.svcgssd[20418]: nss_getpwnam: name 'host/client.domain.tu-berlin.de@TU-BERLIN.DE' domain '(null)': resulting localname 'host/client.domain.tu-berlin.de'
Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: nsswitch->princ_to_ids returned -2
Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: final return value is -2
Jun 18 18:55:07 server  rpc.svcgssd[20418]: DEBUG: serialize_krb5_ctx: lucid version!
Jun 18 18:55:07 server  rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: protocol 1
Jun 18 18:55:07 server  rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: serializing key with enctype 18 and size 32
Jun 18 18:55:07 server  rpc.svcgssd[20418]: doing downcall
Jun 18 18:55:07 server  rpc.svcgssd[20418]: mech: krb5, hndl len: 4, ctx len 52, timeout: 1592510348 (11041 from now), clnt: host@client.domain.tu-berlin.de, uid: -1, gid: -1, num aux grps: 0:
Jun 18 18:55:07 server  rpc.svcgssd[20418]: sending null reply
Jun 18 18:55:07 server  rpc.svcgssd[20418]: writing message: \x \x608202cf06092a864886f71201020201006e8202be308202baa003020105a10302010ea20703050020000000a3820194618201903082018ca003020105a10e1b0c54552d4245524c494e2e4445a2273025a003020103a11e301c1b036e66731b15616c6c2e6368656d2e74752d6265726c696e2e6465a382014a30820146a003
Jun 18 18:55:07 server  rpc.svcgssd[20418]: finished handling null request
Jun 18 18:55:07 server  rpc.svcgssd[20418]: entering poll



Best and keep well
Sebastian


Sebastian Kraus
Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

________________________________________
From: J. Bruce Fields <bfields@fieldses.org>
Sent: Saturday, June 20, 2020 19:03
To: Kraus, Sebastian
Cc: linux-nfs@vger.kernel.org
Subject: Re: RPC Pipefs: Frequent parsing errors in client database

On Sat, Jun 20, 2020 at 11:35:56AM +0000, Kraus, Sebastian wrote:
> In consequence, about a week ago, I decided to investigate the problem
> in a deep manner by stracing the rpc.gssd daemon while running.  Since
> then, the segementation violations were gone, but now lots of
> complaints of the following type appear in the system log:
>
>  Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open
>  nfsd4_cb/clnt3bb/info: No such file or directory Jun 19 11:14:00 all
>  rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info
>
>
> This behaviour seems somehow strange to me.  But, one possible
> explanation could be: The execution speed of rpc.gssd slows down while
> being straced and the "true" reason for the segmentation violations
> pops up.  I would argue, rpc.gssd trying to parse non-existing files
> points anyway to an insane and defective behaviour of the RPC GSS user
> space daemon implementation.

Those files under rpc_pipefs come and go.  rpc.gssd monitors directories
under rpc_pipefs to learn about changes, and then tries to parse the
files under any new directories.

The thing is, if rpc.gssd is a little fast, I think it's possible that
it could get the notification about clnt3bb/ being created, and try to
look up "info", before "info" itself is actually created.

Or alternatively, if clnt3bb/ is short-lived, it might not look up
"info" until the directory's already been deleted again.

Neither problem should be fatal--rpc.gssd will get another update and
adjust to the new situation soon enough.

So it may be that the reall error here is an unconditional log message
in a case that's expected, not actually an error.

Or I could be wrong and maybe something else is happening.

But I think it'd be more useful to stay focused on the segfaults.

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: RPC Pipefs: Frequent parsing errors in client database
  2020-06-20 21:08       ` Kraus, Sebastian
@ 2020-06-22 22:36         ` J. Bruce Fields
  2020-06-25 17:43           ` Strange segmentation violations of rpc.gssd in Debian Buster Kraus, Sebastian
  0 siblings, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2020-06-22 22:36 UTC (permalink / raw)
  To: Kraus, Sebastian; +Cc: linux-nfs

On Sat, Jun 20, 2020 at 09:08:55PM +0000, Kraus, Sebastian wrote:
> Hi Bruce,
> 
> >> But I think it'd be more useful to stay focused on the segfaults.
> 
> is it a clever idea to analyze core dumps? Or are there other much better debugging techniques w.r.t. RPC daemons?

If we could at least get a backtrace out of the core dump that could be
useful.

> I now do more tests while fiddling around with the time-out parameters "-T" and "-t" on the command line of rpc.gssd.
> 
> There are several things I do not really understand about the trace shown below:
> 
> 1) How can it be useful that the rpc.gssd daemon tries to parse the info file although it knows about its absence beforehand?

It doesn't know beforehand, in the scenarios I described.

> 2) Why are there two identifiers clnt36e and clnt36f being used for the same client?

This is actually happening on an NFS server, the rpc client in question
is the callback client used to do things like send delegation recalls
back to the NFS client.

I'm not sure why two different callback clients are being created here,
but there's nothing inherently weird about that.

> 3) What does the <?> in "inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)" mean?

Off the top of my head, I don't know, we'd probably need to look through
header files or inotify man pages for the definitions of those masks.

--b.

> 
> 
> Jun 18 18:55:07 server  rpcbind[30464]: connect from 130.149.XXX.YYY to null()
> Jun 18 18:55:07 server  rpcbind[30465]: connect from 130.149.XXX.YYY to getport/addr(nfs)
> Jun 18 18:55:07 server  rpcbind[30466]: connect from 130.149.XXX.YYY to null()
> Jun 18 18:55:07 server  rpcbind[30467]: connect from 130.149.XXX.YYY to getport/addr(nfs)
> Jun 18 18:55:07 server  rpcbind[30468]: connect from 130.149.XXX.YYY to null()
> Jun 18 18:55:07 server  rpcbind[30469]: connect from 130.149.XXX.YYY to getport/addr(nfs)
> Jun 18 18:55:07 server  rpcbind[30470]: connect from 130.149.XXX.YYY to null()
> Jun 18 18:55:07 server  rpcbind[30471]: connect from 130.149.XXX.YYY to getport/addr(nfs)
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: leaving poll
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: handling null request
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: svcgssd_limit_krb5_enctypes: Calling gss_set_allowable_enctypes with 7 enctypes from the kernel
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: sname = host/client.domain.tu-berlin.de@TU-BERLIN.DE
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: calling nsswitch->princ_to_ids
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: nss_getpwnam: name 'host/client.domain.tu-berlin.de@TU-BERLIN.DE' domain '(null)': resulting localname 'host/test0815.chem.tu-berlin.de'
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: nss_ldap: reconnecting to LDAP server...
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: nss_ldap: reconnected to LDAP server  ldaps://ldap-server.tu-berlin.de after 1 attempt
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: nsswitch->princ_to_ids returned -2
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: final return value is -2
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: DEBUG: serialize_krb5_ctx: lucid version!
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: protocol 1
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: serializing key with enctype 18 and size 32
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: doing downcall
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: mech: krb5, hndl len: 4, ctx len 52, timeout: 1592510348 (11041 from now), clnt: host@client.domain.tu-berlin.de, uid: -1, gid: -1, num aux grps: 0:
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: sending null reply
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: writing message: \x \x608202cf06092a864886f71201020201006e8202be308202baa003020105a10302010ea20703050020000000a3820194618201903082018ca003020105a10e1b0c54552d4245524c494e2e4445a2273025a003020103a11e301c1b036e66731b15616c6c2e6368656d2e74752d6265726c696e2e6465a382014a30820146a003
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: finished handling null request
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: entering poll
> 
> Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt36e) ev->mask (0x40000100)
> Jun 18 18:55:07 server  rpc.gssd[23620]: ERROR: can't open nfsd4_cb/clnt36e/info: No such file or directory
> Jun 18 18:55:07 server  rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt36e/info
> 
> Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (krb5) ev->mask (0x00000200)
> Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (gssd) ev->mask (0x00000200)
> Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (info) ev->mask (0x00000200)
> Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt36f) ev->mask (0x40000100)
> Jun 18 18:55:07 server  rpc.gssd[23620]: inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: leaving poll
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: handling null request
> Jun 18 18:55:07 server  rpc.gssd[23620]: 
>                                      handle_gssd_upcall: 'mech=krb5 uid=0 target=host@client.domain.tu-berlin.de service=nfs enctypes=18,17,16,23,3,1,2 ' (nfsd4_cb/clnt36f)
> Jun 18 18:55:07 server  rpc.gssd[23620]: krb5_use_machine_creds: uid 0 tgtname host@client.domain.tu-berlin.de
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: svcgssd_limit_krb5_enctypes: Calling gss_set_allowable_enctypes with 7 enctypes from the kernel
> Jun 18 18:55:07 server  rpc.gssd[23620]: Full hostname for 'client.domain.tu-berlin.de' is 'client.domain.tu-berlin.de'
> Jun 18 18:55:07 server  rpc.gssd[23620]: Full hostname for 'server.domain.tu-berlin.de' is 'server.domain.tu-berlin.de'
> Jun 18 18:55:07 server  rpc.gssd[23620]: Success getting keytab entry for 'nfs/server.domain.tu-berlin.de@TU-BERLIN.DE'
> Jun 18 18:55:07 server  rpc.gssd[23620]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1592526348
> Jun 18 18:55:07 server  rpc.gssd[23620]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1592526348
> Jun 18 18:55:07 server  rpc.gssd[23620]: creating tcp client for server  client.domain.tu-berlin.de
> Jun 18 18:55:07 server  rpc.gssd[23620]: DEBUG: port already set to 39495
> Jun 18 18:55:07 server  rpc.gssd[23620]: creating context with server  host@client.domain.tu-berlin.de
> Jun 18 18:55:07 server  rpc.gssd[23620]: in authgss_create_default()
> Jun 18 18:55:07 server  rpc.gssd[23620]: in authgss_create()
> Jun 18 18:55:07 server  rpc.gssd[23620]: authgss_create: name is 0x7fa5ac011970
> Jun 18 18:55:07 server  rpc.gssd[23620]: authgss_create: gd->name is 0x7fa5ac005770
> Jun 18 18:55:07 server  rpc.gssd[23620]: in authgss_refresh()
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: sname = host/client.domain.tu-berlin.de@TU-BERLIN.DE
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: calling nsswitch->princ_to_ids
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: nss_getpwnam: name 'host/client.domain.tu-berlin.de@TU-BERLIN.DE' domain '(null)': resulting localname 'host/client.domain.tu-berlin.de'
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: nsswitch->princ_to_ids returned -2
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: nfs4_gss_princ_to_ids: final return value is -2
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: DEBUG: serialize_krb5_ctx: lucid version!
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: protocol 1
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: prepare_krb5_rfc4121_buffer: serializing key with enctype 18 and size 32
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: doing downcall
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: mech: krb5, hndl len: 4, ctx len 52, timeout: 1592510348 (11041 from now), clnt: host@client.domain.tu-berlin.de, uid: -1, gid: -1, num aux grps: 0:
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: sending null reply
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: writing message: \x \x608202cf06092a864886f71201020201006e8202be308202baa003020105a10302010ea20703050020000000a3820194618201903082018ca003020105a10e1b0c54552d4245524c494e2e4445a2273025a003020103a11e301c1b036e66731b15616c6c2e6368656d2e74752d6265726c696e2e6465a382014a30820146a003
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: finished handling null request
> Jun 18 18:55:07 server  rpc.svcgssd[20418]: entering poll
> 
> 
> 
> Best and keep well
> Sebastian
> 
> 
> Sebastian Kraus
> Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
> 
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
> 
> ________________________________________
> From: J. Bruce Fields <bfields@fieldses.org>
> Sent: Saturday, June 20, 2020 19:03
> To: Kraus, Sebastian
> Cc: linux-nfs@vger.kernel.org
> Subject: Re: RPC Pipefs: Frequent parsing errors in client database
> 
> On Sat, Jun 20, 2020 at 11:35:56AM +0000, Kraus, Sebastian wrote:
> > In consequence, about a week ago, I decided to investigate the problem
> > in a deep manner by stracing the rpc.gssd daemon while running.  Since
> > then, the segementation violations were gone, but now lots of
> > complaints of the following type appear in the system log:
> >
> >  Jun 19 11:14:00 all rpc.gssd[23620]: ERROR: can't open
> >  nfsd4_cb/clnt3bb/info: No such file or directory Jun 19 11:14:00 all
> >  rpc.gssd[23620]: ERROR: failed to parse nfsd4_cb/clnt3bb/info
> >
> >
> > This behaviour seems somehow strange to me.  But, one possible
> > explanation could be: The execution speed of rpc.gssd slows down while
> > being straced and the "true" reason for the segmentation violations
> > pops up.  I would argue, rpc.gssd trying to parse non-existing files
> > points anyway to an insane and defective behaviour of the RPC GSS user
> > space daemon implementation.
> 
> Those files under rpc_pipefs come and go.  rpc.gssd monitors directories
> under rpc_pipefs to learn about changes, and then tries to parse the
> files under any new directories.
> 
> The thing is, if rpc.gssd is a little fast, I think it's possible that
> it could get the notification about clnt3bb/ being created, and try to
> look up "info", before "info" itself is actually created.
> 
> Or alternatively, if clnt3bb/ is short-lived, it might not look up
> "info" until the directory's already been deleted again.
> 
> Neither problem should be fatal--rpc.gssd will get another update and
> adjust to the new situation soon enough.
> 
> So it may be that the reall error here is an unconditional log message
> in a case that's expected, not actually an error.
> 
> Or I could be wrong and maybe something else is happening.
> 
> But I think it'd be more useful to stay focused on the segfaults.
> 
> --b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-22 22:36         ` J. Bruce Fields
@ 2020-06-25 17:43           ` Kraus, Sebastian
  2020-06-25 20:14             ` J. Bruce Fields
  2020-06-25 21:44             ` Doug Nazar
  0 siblings, 2 replies; 21+ messages in thread
From: Kraus, Sebastian @ 2020-06-25 17:43 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs

Dear Bruce,
I got the following stack and back trace:

root@all:~# coredumpctl debug
           PID: 6356 (rpc.gssd)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Thu 2020-06-25 11:46:08 CEST (3h 4min ago)
  Command Line: /usr/sbin/rpc.gssd -vvvvvvv -rrrrrrr -t 3600 -T 10
    Executable: /usr/sbin/rpc.gssd
 Control Group: /system.slice/rpc-gssd.service
          Unit: rpc-gssd.service
         Slice: system.slice
       Boot ID: XXXXXXXXXXXXXXXXXXXXXXXXXXX
    Machine ID: YYYYYYYYYYYYYYYYYYYYYYYYYYYY
      Hostname: XYZ
       Storage: /var/lib/systemd/coredump/core.rpc\x2egssd.0.7f31136228274af0a1a855b91ad1e75c.6356.1593078368000000.lz4
       Message: Process 6356 (rpc.gssd) of user 0 dumped core.
                
                Stack trace of thread 14174:
                #0  0x000056233fff038e n/a (rpc.gssd)
                #1  0x000056233fff09f8 n/a (rpc.gssd)
                #2  0x000056233fff0b92 n/a (rpc.gssd)
                #3  0x000056233fff13b3 n/a (rpc.gssd)
                #4  0x00007fb2eb8dbfa3 start_thread (libpthread.so.0)
                #5  0x00007fb2eb80c4cf __clone (libc.so.6)
                
                Stack trace of thread 6356:
                #0  0x00007fb2eb801819 __GI___poll (libc.so.6)
                #1  0x00007fb2eb6e7207 send_dg (libresolv.so.2)
                #2  0x00007fb2eb6e4c43 __GI___res_context_query (libresolv.so.2)
                #3  0x00007fb2eb6bf536 __GI__nss_dns_gethostbyaddr2_r (libnss_dns.so.2)
                #4  0x00007fb2eb6bf823 _nss_dns_gethostbyaddr_r (libnss_dns.so.2)
                #5  0x00007fb2eb81dee2 __gethostbyaddr_r (libc.so.6)
                #6  0x00007fb2eb8267d5 gni_host_inet_name (libc.so.6)
                #7  0x000056233ffef455 n/a (rpc.gssd)
                #8  0x000056233ffef82c n/a (rpc.gssd)
                #9  0x000056233fff01d0 n/a (rpc.gssd)
                #10 0x00007fb2ebab49ba n/a (libevent-2.1.so.6)
                #11 0x00007fb2ebab5537 event_base_loop (libevent-2.1.so.6)
                #12 0x000056233ffedeaa n/a (rpc.gssd)
                #13 0x00007fb2eb73709b __libc_start_main (libc.so.6)
                #14 0x000056233ffee03a n/a (rpc.gssd)

GNU gdb (Debian 8.2.1-2+b3) 8.2.1
[...]
Reading symbols from /usr/sbin/rpc.gssd...(no debugging symbols found)...done.
[New LWP 14174]
[New LWP 6356]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/rpc.gssd -vvvvvvv -rrrrrrr -t 3600 -T 10'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000056233fff038e in ?? ()
[Current thread is 1 (Thread 0x7fb2eaeba700 (LWP 14174))]
(gdb) bt
#0  0x000056233fff038e in ?? ()
#1  0x000056233fff09f8 in ?? ()
#2  0x000056233fff0b92 in ?? ()
#3  0x000056233fff13b3 in ?? ()
#4  0x00007fb2eb8dbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#5  0x00007fb2eb80c4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) quit


I am not an expert in analyzing stack and backtraces. Is there anything meaningful, you are able to extract from the trace?
As far as I see, thread 14174 caused the segmentation violation just after its birth on clone. 
Please correct me, if I am in error.
Seems Debian Buster does not ship any dedicated package with debug symbols for the rpc.gssd executable. 
So far, I was not able to find such a package.
What's your opinon about the trace?


Best and Thanks
Sebastian

_____________________________
Sebastian Kraus
Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Email: sebastian.kraus@tu-berlin.de

________________________________________
From: linux-nfs-owner@vger.kernel.org <linux-nfs-owner@vger.kernel.org> on behalf of J. Bruce Fields <bfields@fieldses.org>
Sent: Tuesday, June 23, 2020 00:36
To: Kraus, Sebastian
Cc: linux-nfs@vger.kernel.org
Subject: Re: RPC Pipefs: Frequent parsing errors in client database

On Sat, Jun 20, 2020 at 09:08:55PM +0000, Kraus, Sebastian wrote:
> Hi Bruce,
>
> >> But I think it'd be more useful to stay focused on the segfaults.
>
> is it a clever idea to analyze core dumps? Or are there other much better debugging techniques w.r.t. RPC daemons?

If we could at least get a backtrace out of the core dump that could be
useful.

> I now do more tests while fiddling around with the time-out parameters "-T" and "-t" on the command line of rpc.gssd.
>
> There are several things I do not really understand about the trace shown below:
>
> 1) How can it be useful that the rpc.gssd daemon tries to parse the info file although it knows about its absence beforehand?

It doesn't know beforehand, in the scenarios I described.

> 2) Why are there two identifiers clnt36e and clnt36f being used for the same client?

This is actually happening on an NFS server, the rpc client in question
is the callback client used to do things like send delegation recalls
back to the NFS client.

I'm not sure why two different callback clients are being created here,
but there's nothing inherently weird about that.

> 3) What does the <?> in "inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)" mean?

Off the top of my head, I don't know, we'd probably need to look through
header files or inotify man pages for the definitions of those masks.

--b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-25 17:43           ` Strange segmentation violations of rpc.gssd in Debian Buster Kraus, Sebastian
@ 2020-06-25 20:14             ` J. Bruce Fields
  2020-06-25 21:44             ` Doug Nazar
  1 sibling, 0 replies; 21+ messages in thread
From: J. Bruce Fields @ 2020-06-25 20:14 UTC (permalink / raw)
  To: Kraus, Sebastian; +Cc: linux-nfs

On Thu, Jun 25, 2020 at 05:43:53PM +0000, Kraus, Sebastian wrote:
> Dear Bruce,
> I got the following stack and back trace:
> 
> root@all:~# coredumpctl debug
>            PID: 6356 (rpc.gssd)
>            UID: 0 (root)
>            GID: 0 (root)
>         Signal: 11 (SEGV)
>      Timestamp: Thu 2020-06-25 11:46:08 CEST (3h 4min ago)
>   Command Line: /usr/sbin/rpc.gssd -vvvvvvv -rrrrrrr -t 3600 -T 10
>     Executable: /usr/sbin/rpc.gssd
>  Control Group: /system.slice/rpc-gssd.service
>           Unit: rpc-gssd.service
>          Slice: system.slice
>        Boot ID: XXXXXXXXXXXXXXXXXXXXXXXXXXX
>     Machine ID: YYYYYYYYYYYYYYYYYYYYYYYYYYYY
>       Hostname: XYZ
>        Storage: /var/lib/systemd/coredump/core.rpc\x2egssd.0.7f31136228274af0a1a855b91ad1e75c.6356.1593078368000000.lz4
>        Message: Process 6356 (rpc.gssd) of user 0 dumped core.
>                 
>                 Stack trace of thread 14174:
>                 #0  0x000056233fff038e n/a (rpc.gssd)
>                 #1  0x000056233fff09f8 n/a (rpc.gssd)
>                 #2  0x000056233fff0b92 n/a (rpc.gssd)
>                 #3  0x000056233fff13b3 n/a (rpc.gssd)
>                 #4  0x00007fb2eb8dbfa3 start_thread (libpthread.so.0)
>                 #5  0x00007fb2eb80c4cf __clone (libc.so.6)
>                 
>                 Stack trace of thread 6356:
>                 #0  0x00007fb2eb801819 __GI___poll (libc.so.6)
>                 #1  0x00007fb2eb6e7207 send_dg (libresolv.so.2)
>                 #2  0x00007fb2eb6e4c43 __GI___res_context_query (libresolv.so.2)
>                 #3  0x00007fb2eb6bf536 __GI__nss_dns_gethostbyaddr2_r (libnss_dns.so.2)
>                 #4  0x00007fb2eb6bf823 _nss_dns_gethostbyaddr_r (libnss_dns.so.2)
>                 #5  0x00007fb2eb81dee2 __gethostbyaddr_r (libc.so.6)
>                 #6  0x00007fb2eb8267d5 gni_host_inet_name (libc.so.6)
>                 #7  0x000056233ffef455 n/a (rpc.gssd)
>                 #8  0x000056233ffef82c n/a (rpc.gssd)
>                 #9  0x000056233fff01d0 n/a (rpc.gssd)
>                 #10 0x00007fb2ebab49ba n/a (libevent-2.1.so.6)
>                 #11 0x00007fb2ebab5537 event_base_loop (libevent-2.1.so.6)
>                 #12 0x000056233ffedeaa n/a (rpc.gssd)
>                 #13 0x00007fb2eb73709b __libc_start_main (libc.so.6)
>                 #14 0x000056233ffee03a n/a (rpc.gssd)
> 
> GNU gdb (Debian 8.2.1-2+b3) 8.2.1
> [...]
> Reading symbols from /usr/sbin/rpc.gssd...(no debugging symbols found)...done.
> [New LWP 14174]
> [New LWP 6356]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `/usr/sbin/rpc.gssd -vvvvvvv -rrrrrrr -t 3600 -T 10'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x000056233fff038e in ?? ()
> [Current thread is 1 (Thread 0x7fb2eaeba700 (LWP 14174))]
> (gdb) bt
> #0  0x000056233fff038e in ?? ()
> #1  0x000056233fff09f8 in ?? ()
> #2  0x000056233fff0b92 in ?? ()
> #3  0x000056233fff13b3 in ?? ()
> #4  0x00007fb2eb8dbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
> #5  0x00007fb2eb80c4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> (gdb) quit
> 
> 
> I am not an expert in analyzing stack and backtraces. Is there anything meaningful, you are able to extract from the trace?
> As far as I see, thread 14174 caused the segmentation violation just after its birth on clone. 
> Please correct me, if I am in error.
> Seems Debian Buster does not ship any dedicated package with debug symbols for the rpc.gssd executable. 

Have you reported a debian bug?  They might know how to get a good trace
out of it.

--b.

> So far, I was not able to find such a package.
> What's your opinon about the trace?
> 
> 
> Best and Thanks
> Sebastian
> 
> _____________________________
> Sebastian Kraus
> Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
> 
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
> 
> Email: sebastian.kraus@tu-berlin.de
> 
> ________________________________________
> From: linux-nfs-owner@vger.kernel.org <linux-nfs-owner@vger.kernel.org> on behalf of J. Bruce Fields <bfields@fieldses.org>
> Sent: Tuesday, June 23, 2020 00:36
> To: Kraus, Sebastian
> Cc: linux-nfs@vger.kernel.org
> Subject: Re: RPC Pipefs: Frequent parsing errors in client database
> 
> On Sat, Jun 20, 2020 at 09:08:55PM +0000, Kraus, Sebastian wrote:
> > Hi Bruce,
> >
> > >> But I think it'd be more useful to stay focused on the segfaults.
> >
> > is it a clever idea to analyze core dumps? Or are there other much better debugging techniques w.r.t. RPC daemons?
> 
> If we could at least get a backtrace out of the core dump that could be
> useful.
> 
> > I now do more tests while fiddling around with the time-out parameters "-T" and "-t" on the command line of rpc.gssd.
> >
> > There are several things I do not really understand about the trace shown below:
> >
> > 1) How can it be useful that the rpc.gssd daemon tries to parse the info file although it knows about its absence beforehand?
> 
> It doesn't know beforehand, in the scenarios I described.
> 
> > 2) Why are there two identifiers clnt36e and clnt36f being used for the same client?
> 
> This is actually happening on an NFS server, the rpc client in question
> is the callback client used to do things like send delegation recalls
> back to the NFS client.
> 
> I'm not sure why two different callback clients are being created here,
> but there's nothing inherently weird about that.
> 
> > 3) What does the <?> in "inotify event for clntdir (nfsd4_cb/clnt36e) - ev->wd (600) ev->name (<?>) ev->mask (0x00008000)" mean?
> 
> Off the top of my head, I don't know, we'd probably need to look through
> header files or inotify man pages for the definitions of those masks.
> 
> --b.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-25 17:43           ` Strange segmentation violations of rpc.gssd in Debian Buster Kraus, Sebastian
  2020-06-25 20:14             ` J. Bruce Fields
@ 2020-06-25 21:44             ` Doug Nazar
  2020-06-26 12:31               ` Kraus, Sebastian
  1 sibling, 1 reply; 21+ messages in thread
From: Doug Nazar @ 2020-06-25 21:44 UTC (permalink / raw)
  To: Kraus, Sebastian, J. Bruce Fields; +Cc: linux-nfs

On 2020-06-25 13:43, Kraus, Sebastian wrote:
> [Current thread is 1 (Thread 0x7fb2eaeba700 (LWP 14174))]
> (gdb) bt
> #0  0x000056233fff038e in ?? ()
> #1  0x000056233fff09f8 in ?? ()
> #2  0x000056233fff0b92 in ?? ()
> #3  0x000056233fff13b3 in ?? ()
> #4  0x00007fb2eb8dbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
> #5  0x00007fb2eb80c4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> (gdb) quit
>
>
> I am not an expert in analyzing stack and backtraces. Is there anything meaningful, you are able to extract from the trace?
> As far as I see, thread 14174 caused the segmentation violation just after its birth on clone.
> Please correct me, if I am in error.
> Seems Debian Buster does not ship any dedicated package with debug symbols for the rpc.gssd executable.
> So far, I was not able to find such a package.
> What's your opinon about the trace?

You'll need to install the debug symbols for your distribution/package. 
A quick google links to https://wiki.debian.org/HowToGetABacktrace. 
Those ?? lines should then be replaced with function, file & line numbers.

I've been following this with interest since it used to happen to me a 
lot. It hasn't recently, even though every so often I spend a few hours 
trying to re-create it to try debug it.

Doug


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-25 21:44             ` Doug Nazar
@ 2020-06-26 12:31               ` Kraus, Sebastian
  2020-06-26 17:23                 ` Doug Nazar
  0 siblings, 1 reply; 21+ messages in thread
From: Kraus, Sebastian @ 2020-06-26 12:31 UTC (permalink / raw)
  To: linux-nfs; +Cc: Doug Nazar, J. Bruce Fields

Hi Doug, Dear Bruce,
FYI, I filed a bug with nfs-common package in Debian Buster:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=963746

Hope this helps to track down the problem further and shed some light on the mysterious segfaults of rpc.gssd. 

>> A quick google links to https://wiki.debian.org/HowToGetABacktrace.
BTW: Thanks very much for the link. This helped to find the right dbgsym packages in order to produce a more readable and hopefully valuable backtrace. ;-)

I also passed by a similar and still open issue in the CentOS bug tracker:
https://bugs.centos.org/view.php?id=15895


Best and Thanks for your support up to now
Sebastian

___________________
Sebastian Kraus
Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Email: sebastian.kraus@tu-berlin.de

________________________________________
From: Doug Nazar <nazard@nazar.ca>
Sent: Thursday, June 25, 2020 23:44
To: Kraus, Sebastian; J. Bruce Fields
Cc: linux-nfs@vger.kernel.org
Subject: Re: Strange segmentation violations of rpc.gssd in Debian Buster

On 2020-06-25 13:43, Kraus, Sebastian wrote:
> [Current thread is 1 (Thread 0x7fb2eaeba700 (LWP 14174))]
> (gdb) bt
> #0  0x000056233fff038e in ?? ()
> #1  0x000056233fff09f8 in ?? ()
> #2  0x000056233fff0b92 in ?? ()
> #3  0x000056233fff13b3 in ?? ()
> #4  0x00007fb2eb8dbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
> #5  0x00007fb2eb80c4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> (gdb) quit
>
>
> I am not an expert in analyzing stack and backtraces. Is there anything meaningful, you are able to extract from the trace?
> As far as I see, thread 14174 caused the segmentation violation just after its birth on clone.
> Please correct me, if I am in error.
> Seems Debian Buster does not ship any dedicated package with debug symbols for the rpc.gssd executable.
> So far, I was not able to find such a package.
> What's your opinon about the trace?

You'll need to install the debug symbols for your distribution/package.
A quick google links to https://wiki.debian.org/HowToGetABacktrace.
Those ?? lines should then be replaced with function, file & line numbers.

I've been following this with interest since it used to happen to me a
lot. It hasn't recently, even though every so often I spend a few hours
trying to re-create it to try debug it.

Doug

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-26 12:31               ` Kraus, Sebastian
@ 2020-06-26 17:23                 ` Doug Nazar
  2020-06-26 19:46                   ` J. Bruce Fields
  0 siblings, 1 reply; 21+ messages in thread
From: Doug Nazar @ 2020-06-26 17:23 UTC (permalink / raw)
  To: Kraus, Sebastian, linux-nfs; +Cc: J. Bruce Fields

On 2020-06-26 08:31, Kraus, Sebastian wrote:
> Hi Doug, Dear Bruce,
> FYI, I filed a bug with nfs-common package in Debian Buster:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=963746
>
> Hope this helps to track down the problem further and shed some light on the mysterious segfaults of rpc.gssd.

Ok, I think I see what's going on. The struct clnt_info is getting freed 
out from under the upcall thread. In this case it immediately got reused 
for another client which zeroed the struct and was in the process of 
looking up the info for it's client, hence the protocol & server fields 
were null in the upcall thread.

Explains why I haven't been able to recreate it. Thanks for the stack 
trace Sebastian.

Bruce, I can't see any locking/reference counting around struct 
clnt_info. It just gets destroyed when receiving the inotify. Or should 
it be deep copied when starting an upcall? Am I missing something?

Doug

Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt50e) ev->mask (0x40000100)
Jun 25 11:46:08 server rpc.gssd[6356]: handle_gssd_upcall: 'mech=krb5 uid=0 target=host@client.domain.tu-berlin.de service=nfs enctypes=18,17,16,23,3,1,2 ' (nfsd4_cb/clnt50e)
Jun 25 11:46:08 server rpc.gssd[6356]: krb5_use_machine_creds: uid 0 tgtname host@client.domain.tu-berlin.de
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (krb5) ev->mask (0x00000200)
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (gssd) ev->mask (0x00000200)
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (info) ev->mask (0x00000200)
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (<?>) ev->mask (0x00008000)
Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt50f) ev->mask (0x40000100)
Jun 25 11:46:08 server rpc.gssd[6356]: Full hostname for '' is 'client.domain.tu-berlin.de'
Jun 25 11:46:08 server rpc.gssd[6356]: Full hostname for 'server.domain.tu-berlin.de' is 'server.domain.tu-berlin.de'
Jun 25 11:46:08 server rpc.gssd[6356]: Success getting keytab entry for 'nfs/server.domain.tu-berlin.de@TU-BERLIN.DE'
Jun 25 11:46:08 server rpc.gssd[6356]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1593101766
Jun 25 11:46:08 server rpc.gssd[6356]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1593101766
Jun 25 11:46:08 server rpc.gssd[6356]: creating (null) client for server (null)
Jun 25 11:46:08 all kernel: rpc.gssd[14174]: segfault at 0 ip 000056233fff038e sp 00007fb2eaeb9880 error 4 in rpc.gssd[56233ffed000+9000]


Thread 1 (Thread 0x7fb2eaeba700 (LWP 14174)):
#0  0x000056233fff038e in create_auth_rpc_client (clp=clp@entry=0x562341008fa0, tgtname=tgtname@entry=0x562341011c8f "host@client.domain.tu-berlin.de", clnt_return=clnt_return@entry=0x7fb2eaeb9de8, auth_return=auth_return@entry=0x7fb2eaeb9d50, uid=uid@entry=0, cred=cred@entry=0x0, authtype=0) at gssd_proc.c:352

Thread 2 (Thread 0x7fb2eb6d9740 (LWP 6356)):
#12 0x000056233ffef82c in gssd_read_service_info (clp=0x562341008fa0, dirfd=11) at gssd.c:326



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-26 17:23                 ` Doug Nazar
@ 2020-06-26 19:46                   ` J. Bruce Fields
  2020-06-26 20:15                     ` Doug Nazar
  0 siblings, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2020-06-26 19:46 UTC (permalink / raw)
  To: Doug Nazar; +Cc: Kraus, Sebastian, linux-nfs, Steve Dickson, Olga Kornievskaia

On Fri, Jun 26, 2020 at 01:23:54PM -0400, Doug Nazar wrote:
> Ok, I think I see what's going on. The struct clnt_info is getting
> freed out from under the upcall thread. In this case it immediately
> got reused for another client which zeroed the struct and was in the
> process of looking up the info for it's client, hence the protocol &
> server fields were null in the upcall thread.
> 
> Explains why I haven't been able to recreate it. Thanks for the
> stack trace Sebastian.
> 
> Bruce, I can't see any locking/reference counting around struct
> clnt_info. It just gets destroyed when receiving the inotify. Or
> should it be deep copied when starting an upcall? Am I missing
> something?

Thanks for finding that!

Staring at that code in an attempt to catch up here....

Looks like there's one main thread that watches for upcalls and other
events, then creates a new short-lived thread for each upcall.

The main thread is the only one that really manipulates the data
structure with all the clients.  So that data structure shouldn't need
any locking.  Except, as you point out, to keep the clnt_info from
disappearing out from under them.

So, yeah, either a reference count or a deep copy is probably all that's
needed, in alloc_upcall_info() and at the end of handle_krb5_upcall().

--b.


> Doug
> 
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt50e) ev->mask (0x40000100)
> Jun 25 11:46:08 server rpc.gssd[6356]: handle_gssd_upcall: 'mech=krb5 uid=0 target=host@client.domain.tu-berlin.de service=nfs enctypes=18,17,16,23,3,1,2 ' (nfsd4_cb/clnt50e)
> Jun 25 11:46:08 server rpc.gssd[6356]: krb5_use_machine_creds: uid 0 tgtname host@client.domain.tu-berlin.de
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (krb5) ev->mask (0x00000200)
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (gssd) ev->mask (0x00000200)
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (info) ev->mask (0x00000200)
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for clntdir (nfsd4_cb/clnt50e) - ev->wd (75) ev->name (<?>) ev->mask (0x00008000)
> Jun 25 11:46:08 server rpc.gssd[6356]: inotify event for topdir (nfsd4_cb) - ev->wd (5) ev->name (clnt50f) ev->mask (0x40000100)
> Jun 25 11:46:08 server rpc.gssd[6356]: Full hostname for '' is 'client.domain.tu-berlin.de'
> Jun 25 11:46:08 server rpc.gssd[6356]: Full hostname for 'server.domain.tu-berlin.de' is 'server.domain.tu-berlin.de'
> Jun 25 11:46:08 server rpc.gssd[6356]: Success getting keytab entry for 'nfs/server.domain.tu-berlin.de@TU-BERLIN.DE'
> Jun 25 11:46:08 server rpc.gssd[6356]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1593101766
> Jun 25 11:46:08 server rpc.gssd[6356]: INFO: Credentials in CC 'FILE:/tmp/krb5ccmachine_TU-BERLIN.DE' are good until 1593101766
> Jun 25 11:46:08 server rpc.gssd[6356]: creating (null) client for server (null)
> Jun 25 11:46:08 all kernel: rpc.gssd[14174]: segfault at 0 ip 000056233fff038e sp 00007fb2eaeb9880 error 4 in rpc.gssd[56233ffed000+9000]
> 
> 
> Thread 1 (Thread 0x7fb2eaeba700 (LWP 14174)):
> #0  0x000056233fff038e in create_auth_rpc_client (clp=clp@entry=0x562341008fa0, tgtname=tgtname@entry=0x562341011c8f "host@client.domain.tu-berlin.de", clnt_return=clnt_return@entry=0x7fb2eaeb9de8, auth_return=auth_return@entry=0x7fb2eaeb9d50, uid=uid@entry=0, cred=cred@entry=0x0, authtype=0) at gssd_proc.c:352
> 
> Thread 2 (Thread 0x7fb2eb6d9740 (LWP 6356)):
> #12 0x000056233ffef82c in gssd_read_service_info (clp=0x562341008fa0, dirfd=11) at gssd.c:326
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-26 19:46                   ` J. Bruce Fields
@ 2020-06-26 20:15                     ` Doug Nazar
  2020-06-26 21:02                       ` J. Bruce Fields
  0 siblings, 1 reply; 21+ messages in thread
From: Doug Nazar @ 2020-06-26 20:15 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Kraus, Sebastian, linux-nfs, Steve Dickson, Olga Kornievskaia


[-- Attachment #1: Type: text/plain, Size: 584 bytes --]

On 2020-06-26 15:46, J. Bruce Fields wrote:
> So, yeah, either a reference count or a deep copy is probably all that's
> needed, in alloc_upcall_info() and at the end of handle_krb5_upcall().

Slightly more complex, to handle the error cases & event tear-down but 
this seems to work. I'm not sure how much of a hot spot this is so I 
just went with a global mutex. Strangely there was an existing unused 
mutex & thread_started flag.

Survives basic testing but I don't have a reproducer. Maybe blocking 
access to the kdc. If I get time I'll try to setup a test environment.

Doug


[-- Attachment #2: 0001-gssd-Refcount-struct-clnt_info-to-protect-multithrea.patch --]
[-- Type: text/plain, Size: 5751 bytes --]

From 3dcd482adf54e6ba97eaf9d881b996bdd25797db Mon Sep 17 00:00:00 2001
From: Doug Nazar <nazard@nazar.ca>
Date: Fri, 26 Jun 2020 16:02:04 -0400
Subject: [PATCH] gssd: Refcount struct clnt_info to protect multithread usage

Struct clnt_info is shared with the various upcall threads so
we need to ensure that it stays around even if the client dir
gets removed.

Reported-by: Sebastian Kraus <sebastian.kraus@tu-berlin.de>
Signed-off-by: Doug Nazar <nazard@nazar.ca>
---
 utils/gssd/gssd.c      | 59 ++++++++++++++++++++++++++++++++----------
 utils/gssd/gssd.h      |  5 ++--
 utils/gssd/gssd_proc.c |  4 +--
 3 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/utils/gssd/gssd.c b/utils/gssd/gssd.c
index 588da0fb..36bda230 100644
--- a/utils/gssd/gssd.c
+++ b/utils/gssd/gssd.c
@@ -90,9 +90,7 @@ char *ccachedir = NULL;
 /* Avoid DNS reverse lookups on server names */
 static bool avoid_dns = true;
 static bool use_gssproxy = false;
-int thread_started = false;
-pthread_mutex_t pmutex = PTHREAD_MUTEX_INITIALIZER;
-pthread_cond_t pcond = PTHREAD_COND_INITIALIZER;
+pthread_mutex_t clp_lock = PTHREAD_MUTEX_INITIALIZER;
 
 TAILQ_HEAD(topdir_list_head, topdir) topdir_list;
 
@@ -359,9 +357,37 @@ out:
 	free(port);
 }
 
+/* Actually frees clp and fields that might be used from other
+ * threads if was last reference.
+ */
+static void
+gssd_free_client(struct clnt_info *clp)
+{
+	int refcnt;
+
+	pthread_mutex_lock(&clp_lock);
+	refcnt = --clp->refcount;
+	pthread_mutex_unlock(&clp_lock);
+	if (refcnt > 0)
+		return;
+
+	printerr(3, "freeing client %s\n", clp->relpath);
+
+	free(clp->relpath);
+	free(clp->servicename);
+	free(clp->servername);
+	free(clp->protocol);
+	free(clp);
+}
+
+/* Called when removing from clnt_list to tear down event handling.
+ * Will then free clp if was last reference.
+ */
 static void
 gssd_destroy_client(struct clnt_info *clp)
 {
+	printerr(3, "destroying client %s\n", clp->relpath);
+
 	if (clp->krb5_fd >= 0) {
 		close(clp->krb5_fd);
 		event_del(&clp->krb5_ev);
@@ -373,11 +399,7 @@ gssd_destroy_client(struct clnt_info *clp)
 	}
 
 	inotify_rm_watch(inotify_fd, clp->wd);
-	free(clp->relpath);
-	free(clp->servicename);
-	free(clp->servername);
-	free(clp->protocol);
-	free(clp);
+	gssd_free_client(clp);
 }
 
 static void gssd_scan(void);
@@ -416,11 +438,21 @@ static struct clnt_upcall_info *alloc_upcall_info(struct clnt_info *clp)
 	info = malloc(sizeof(struct clnt_upcall_info));
 	if (info == NULL)
 		return NULL;
+
+	pthread_mutex_lock(&clp_lock);
+	clp->refcount++;
+	pthread_mutex_unlock(&clp_lock);
 	info->clp = clp;
 
 	return info;
 }
 
+void free_upcall_info(struct clnt_upcall_info *info)
+{
+	gssd_free_client(info->clp);
+	free(info);
+}
+
 /* For each upcall read the upcall info into the buffer, then create a
  * thread in a detached state so that resources are released back into
  * the system without the need for a join.
@@ -438,13 +470,13 @@ gssd_clnt_gssd_cb(int UNUSED(fd), short UNUSED(which), void *data)
 	info->lbuflen = read(clp->gssd_fd, info->lbuf, sizeof(info->lbuf));
 	if (info->lbuflen <= 0 || info->lbuf[info->lbuflen-1] != '\n') {
 		printerr(0, "WARNING: %s: failed reading request\n", __func__);
-		free(info);
+		free_upcall_info(info);
 		return;
 	}
 	info->lbuf[info->lbuflen-1] = 0;
 
 	if (start_upcall_thread(handle_gssd_upcall, info))
-		free(info);
+		free_upcall_info(info);
 }
 
 static void
@@ -461,12 +493,12 @@ gssd_clnt_krb5_cb(int UNUSED(fd), short UNUSED(which), void *data)
 			sizeof(info->uid)) < (ssize_t)sizeof(info->uid)) {
 		printerr(0, "WARNING: %s: failed reading uid from krb5 "
 			 "upcall pipe: %s\n", __func__, strerror(errno));
-		free(info);
+		free_upcall_info(info);
 		return;
 	}
 
 	if (start_upcall_thread(handle_krb5_upcall, info))
-		free(info);
+		free_upcall_info(info);
 }
 
 static struct clnt_info *
@@ -501,6 +533,7 @@ gssd_get_clnt(struct topdir *tdi, const char *name)
 	clp->name = clp->relpath + strlen(tdi->name) + 1;
 	clp->krb5_fd = -1;
 	clp->gssd_fd = -1;
+	clp->refcount = 1;
 
 	TAILQ_INSERT_HEAD(&tdi->clnt_list, clp, list);
 	return clp;
@@ -651,7 +684,7 @@ gssd_scan_topdir(const char *name)
 		if (clp->scanned)
 			continue;
 
-		printerr(3, "destroying client %s\n", clp->relpath);
+		printerr(3, "orphaned client %s\n", clp->relpath);
 		saveprev = clp->list.tqe_prev;
 		TAILQ_REMOVE(&tdi->clnt_list, clp, list);
 		gssd_destroy_client(clp);
diff --git a/utils/gssd/gssd.h b/utils/gssd/gssd.h
index f4f59754..d33e4dff 100644
--- a/utils/gssd/gssd.h
+++ b/utils/gssd/gssd.h
@@ -63,12 +63,10 @@ extern unsigned int 		context_timeout;
 extern unsigned int rpc_timeout;
 extern char			*preferred_realm;
 extern pthread_mutex_t ple_lock;
-extern pthread_cond_t pcond;
-extern pthread_mutex_t pmutex;
-extern int thread_started;
 
 struct clnt_info {
 	TAILQ_ENTRY(clnt_info)	list;
+	int			refcount;
 	int			wd;
 	bool			scanned;
 	char			*name;
@@ -94,6 +92,7 @@ struct clnt_upcall_info {
 
 void handle_krb5_upcall(struct clnt_upcall_info *clp);
 void handle_gssd_upcall(struct clnt_upcall_info *clp);
+void free_upcall_info(struct clnt_upcall_info *info);
 
 
 #endif /* _RPC_GSSD_H_ */
diff --git a/utils/gssd/gssd_proc.c b/utils/gssd/gssd_proc.c
index 8fe6605b..05c1da64 100644
--- a/utils/gssd/gssd_proc.c
+++ b/utils/gssd/gssd_proc.c
@@ -730,7 +730,7 @@ handle_krb5_upcall(struct clnt_upcall_info *info)
 	printerr(2, "\n%s: uid %d (%s)\n", __func__, info->uid, clp->relpath);
 
 	process_krb5_upcall(clp, info->uid, clp->krb5_fd, NULL, NULL, NULL);
-	free(info);
+	free_upcall_info(info);
 }
 
 void
@@ -830,6 +830,6 @@ handle_gssd_upcall(struct clnt_upcall_info *info)
 out:
 	free(upcall_str);
 out_nomem:
-	free(info);
+	free_upcall_info(info);
 	return;
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-26 20:15                     ` Doug Nazar
@ 2020-06-26 21:02                       ` J. Bruce Fields
  2020-06-26 21:30                         ` [PATCH v2] " Doug Nazar
  0 siblings, 1 reply; 21+ messages in thread
From: J. Bruce Fields @ 2020-06-26 21:02 UTC (permalink / raw)
  To: Doug Nazar; +Cc: Kraus, Sebastian, linux-nfs, Steve Dickson, Olga Kornievskaia

On Fri, Jun 26, 2020 at 04:15:46PM -0400, Doug Nazar wrote:
> On 2020-06-26 15:46, J. Bruce Fields wrote:
> >So, yeah, either a reference count or a deep copy is probably all that's
> >needed, in alloc_upcall_info() and at the end of handle_krb5_upcall().
> 
> Slightly more complex, to handle the error cases & event tear-down
> but this seems to work. I'm not sure how much of a hot spot this is
> so I just went with a global mutex. Strangely there was an existing
> unused mutex & thread_started flag.
> 
> Survives basic testing but I don't have a reproducer. Maybe blocking
> access to the kdc. If I get time I'll try to setup a test
> environment.

Thanks, looks good.  The only thing I wonder about:

> @@ -359,9 +357,37 @@ out:
>  	free(port);
>  }
>  
> +/* Actually frees clp and fields that might be used from other
> + * threads if was last reference.
> + */
> +static void
> +gssd_free_client(struct clnt_info *clp)
> +{
> +	int refcnt;
> +
> +	pthread_mutex_lock(&clp_lock);
> +	refcnt = --clp->refcount;
> +	pthread_mutex_unlock(&clp_lock);
> +	if (refcnt > 0)
> +		return;
> +
> +	printerr(3, "freeing client %s\n", clp->relpath);
> +
> +	free(clp->relpath);
> +	free(clp->servicename);
> +	free(clp->servername);
> +	free(clp->protocol);
> +	free(clp);
> +}
> +
> +/* Called when removing from clnt_list to tear down event handling.
> + * Will then free clp if was last reference.
> + */
>  static void
>  gssd_destroy_client(struct clnt_info *clp)
>  {
> +	printerr(3, "destroying client %s\n", clp->relpath);
> +
>  	if (clp->krb5_fd >= 0) {
>  		close(clp->krb5_fd);

Unless I'm missing something--an upcall thread could still be using this
file descriptor.

If we're particularly unlucky, we could do a new open in a moment and
reuse this file descriptor number, and then then writes in do_downcall()
could end up going to some other random file.

I think we want these closes done by gssd_free_client() in the !refcnt
case?

--b.

>  		event_del(&clp->krb5_ev);
> @@ -373,11 +399,7 @@ gssd_destroy_client(struct clnt_info *clp)
>  	}
>  
>  	inotify_rm_watch(inotify_fd, clp->wd);
> -	free(clp->relpath);
> -	free(clp->servicename);
> -	free(clp->servername);
> -	free(clp->protocol);
> -	free(clp);
> +	gssd_free_client(clp);
>  }
>  
>  static void gssd_scan(void);
> @@ -416,11 +438,21 @@ static struct clnt_upcall_info *alloc_upcall_info(struct clnt_info *clp)
>  	info = malloc(sizeof(struct clnt_upcall_info));
>  	if (info == NULL)
>  		return NULL;
> +
> +	pthread_mutex_lock(&clp_lock);
> +	clp->refcount++;
> +	pthread_mutex_unlock(&clp_lock);
>  	info->clp = clp;
>  
>  	return info;
>  }
>  
> +void free_upcall_info(struct clnt_upcall_info *info)
> +{
> +	gssd_free_client(info->clp);
> +	free(info);
> +}
> +
>  /* For each upcall read the upcall info into the buffer, then create a
>   * thread in a detached state so that resources are released back into
>   * the system without the need for a join.
> @@ -438,13 +470,13 @@ gssd_clnt_gssd_cb(int UNUSED(fd), short UNUSED(which), void *data)
>  	info->lbuflen = read(clp->gssd_fd, info->lbuf, sizeof(info->lbuf));
>  	if (info->lbuflen <= 0 || info->lbuf[info->lbuflen-1] != '\n') {
>  		printerr(0, "WARNING: %s: failed reading request\n", __func__);
> -		free(info);
> +		free_upcall_info(info);
>  		return;
>  	}
>  	info->lbuf[info->lbuflen-1] = 0;
>  
>  	if (start_upcall_thread(handle_gssd_upcall, info))
> -		free(info);
> +		free_upcall_info(info);
>  }
>  
>  static void
> @@ -461,12 +493,12 @@ gssd_clnt_krb5_cb(int UNUSED(fd), short UNUSED(which), void *data)
>  			sizeof(info->uid)) < (ssize_t)sizeof(info->uid)) {
>  		printerr(0, "WARNING: %s: failed reading uid from krb5 "
>  			 "upcall pipe: %s\n", __func__, strerror(errno));
> -		free(info);
> +		free_upcall_info(info);
>  		return;
>  	}
>  
>  	if (start_upcall_thread(handle_krb5_upcall, info))
> -		free(info);
> +		free_upcall_info(info);
>  }
>  
>  static struct clnt_info *
> @@ -501,6 +533,7 @@ gssd_get_clnt(struct topdir *tdi, const char *name)
>  	clp->name = clp->relpath + strlen(tdi->name) + 1;
>  	clp->krb5_fd = -1;
>  	clp->gssd_fd = -1;
> +	clp->refcount = 1;
>  
>  	TAILQ_INSERT_HEAD(&tdi->clnt_list, clp, list);
>  	return clp;
> @@ -651,7 +684,7 @@ gssd_scan_topdir(const char *name)
>  		if (clp->scanned)
>  			continue;
>  
> -		printerr(3, "destroying client %s\n", clp->relpath);
> +		printerr(3, "orphaned client %s\n", clp->relpath);
>  		saveprev = clp->list.tqe_prev;
>  		TAILQ_REMOVE(&tdi->clnt_list, clp, list);
>  		gssd_destroy_client(clp);
> diff --git a/utils/gssd/gssd.h b/utils/gssd/gssd.h
> index f4f59754..d33e4dff 100644
> --- a/utils/gssd/gssd.h
> +++ b/utils/gssd/gssd.h
> @@ -63,12 +63,10 @@ extern unsigned int 		context_timeout;
>  extern unsigned int rpc_timeout;
>  extern char			*preferred_realm;
>  extern pthread_mutex_t ple_lock;
> -extern pthread_cond_t pcond;
> -extern pthread_mutex_t pmutex;
> -extern int thread_started;
>  
>  struct clnt_info {
>  	TAILQ_ENTRY(clnt_info)	list;
> +	int			refcount;
>  	int			wd;
>  	bool			scanned;
>  	char			*name;
> @@ -94,6 +92,7 @@ struct clnt_upcall_info {
>  
>  void handle_krb5_upcall(struct clnt_upcall_info *clp);
>  void handle_gssd_upcall(struct clnt_upcall_info *clp);
> +void free_upcall_info(struct clnt_upcall_info *info);
>  
>  
>  #endif /* _RPC_GSSD_H_ */
> diff --git a/utils/gssd/gssd_proc.c b/utils/gssd/gssd_proc.c
> index 8fe6605b..05c1da64 100644
> --- a/utils/gssd/gssd_proc.c
> +++ b/utils/gssd/gssd_proc.c
> @@ -730,7 +730,7 @@ handle_krb5_upcall(struct clnt_upcall_info *info)
>  	printerr(2, "\n%s: uid %d (%s)\n", __func__, info->uid, clp->relpath);
>  
>  	process_krb5_upcall(clp, info->uid, clp->krb5_fd, NULL, NULL, NULL);
> -	free(info);
> +	free_upcall_info(info);
>  }
>  
>  void
> @@ -830,6 +830,6 @@ handle_gssd_upcall(struct clnt_upcall_info *info)
>  out:
>  	free(upcall_str);
>  out_nomem:
> -	free(info);
> +	free_upcall_info(info);
>  	return;
>  }
> -- 
> 2.26.2
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-26 21:02                       ` J. Bruce Fields
@ 2020-06-26 21:30                         ` Doug Nazar
  2020-06-26 21:44                           ` J. Bruce Fields
  2020-06-29  5:39                           ` Kraus, Sebastian
  0 siblings, 2 replies; 21+ messages in thread
From: Doug Nazar @ 2020-06-26 21:30 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Kraus, Sebastian, linux-nfs, Steve Dickson, Olga Kornievskaia


[-- Attachment #1: Type: text/plain, Size: 684 bytes --]

On 2020-06-26 17:02, J. Bruce Fields wrote:
> Unless I'm missing something--an upcall thread could still be using this
> file descriptor.
>
> If we're particularly unlucky, we could do a new open in a moment and
> reuse this file descriptor number, and then then writes in do_downcall()
> could end up going to some other random file.
>
> I think we want these closes done by gssd_free_client() in the !refcnt
> case?

Makes sense. I was thinking more that it was an abort situation and we 
shouldn't be sending any data to the kernel but re-use is definitely a 
concern.

I've split it so that we are removed from the event loop in destroy() 
but the close happens in free().

Doug


[-- Attachment #2: 0001-gssd-Refcount-struct-clnt_info-to-protect-multithrea.patch --]
[-- Type: text/plain, Size: 5967 bytes --]

From 8ef49081e8a42bfa05bb63265cd4f951e2b23413 Mon Sep 17 00:00:00 2001
From: Doug Nazar <nazard@nazar.ca>
Date: Fri, 26 Jun 2020 16:02:04 -0400
Subject: [PATCH] gssd: Refcount struct clnt_info to protect multithread usage

Struct clnt_info is shared with the various upcall threads so
we need to ensure that it stays around even if the client dir
gets removed.

Reported-by: Sebastian Kraus <sebastian.kraus@tu-berlin.de>
Signed-off-by: Doug Nazar <nazard@nazar.ca>
---
 utils/gssd/gssd.c      | 67 ++++++++++++++++++++++++++++++++----------
 utils/gssd/gssd.h      |  5 ++--
 utils/gssd/gssd_proc.c |  4 +--
 3 files changed, 55 insertions(+), 21 deletions(-)

diff --git a/utils/gssd/gssd.c b/utils/gssd/gssd.c
index 588da0fb..b40c3220 100644
--- a/utils/gssd/gssd.c
+++ b/utils/gssd/gssd.c
@@ -90,9 +90,7 @@ char *ccachedir = NULL;
 /* Avoid DNS reverse lookups on server names */
 static bool avoid_dns = true;
 static bool use_gssproxy = false;
-int thread_started = false;
-pthread_mutex_t pmutex = PTHREAD_MUTEX_INITIALIZER;
-pthread_cond_t pcond = PTHREAD_COND_INITIALIZER;
+pthread_mutex_t clp_lock = PTHREAD_MUTEX_INITIALIZER;
 
 TAILQ_HEAD(topdir_list_head, topdir) topdir_list;
 
@@ -359,20 +357,28 @@ out:
 	free(port);
 }
 
+/* Actually frees clp and fields that might be used from other
+ * threads if was last reference.
+ */
 static void
-gssd_destroy_client(struct clnt_info *clp)
+gssd_free_client(struct clnt_info *clp)
 {
-	if (clp->krb5_fd >= 0) {
+	int refcnt;
+
+	pthread_mutex_lock(&clp_lock);
+	refcnt = --clp->refcount;
+	pthread_mutex_unlock(&clp_lock);
+	if (refcnt > 0)
+		return;
+
+	printerr(3, "freeing client %s\n", clp->relpath);
+
+	if (clp->krb5_fd >= 0)
 		close(clp->krb5_fd);
-		event_del(&clp->krb5_ev);
-	}
 
-	if (clp->gssd_fd >= 0) {
+	if (clp->gssd_fd >= 0)
 		close(clp->gssd_fd);
-		event_del(&clp->gssd_ev);
-	}
 
-	inotify_rm_watch(inotify_fd, clp->wd);
 	free(clp->relpath);
 	free(clp->servicename);
 	free(clp->servername);
@@ -380,6 +386,24 @@ gssd_destroy_client(struct clnt_info *clp)
 	free(clp);
 }
 
+/* Called when removing from clnt_list to tear down event handling.
+ * Will then free clp if was last reference.
+ */
+static void
+gssd_destroy_client(struct clnt_info *clp)
+{
+	printerr(3, "destroying client %s\n", clp->relpath);
+
+	if (clp->krb5_fd >= 0)
+		event_del(&clp->krb5_ev);
+
+	if (clp->gssd_fd >= 0)
+		event_del(&clp->gssd_ev);
+
+	inotify_rm_watch(inotify_fd, clp->wd);
+	gssd_free_client(clp);
+}
+
 static void gssd_scan(void);
 
 static int
@@ -416,11 +440,21 @@ static struct clnt_upcall_info *alloc_upcall_info(struct clnt_info *clp)
 	info = malloc(sizeof(struct clnt_upcall_info));
 	if (info == NULL)
 		return NULL;
+
+	pthread_mutex_lock(&clp_lock);
+	clp->refcount++;
+	pthread_mutex_unlock(&clp_lock);
 	info->clp = clp;
 
 	return info;
 }
 
+void free_upcall_info(struct clnt_upcall_info *info)
+{
+	gssd_free_client(info->clp);
+	free(info);
+}
+
 /* For each upcall read the upcall info into the buffer, then create a
  * thread in a detached state so that resources are released back into
  * the system without the need for a join.
@@ -438,13 +472,13 @@ gssd_clnt_gssd_cb(int UNUSED(fd), short UNUSED(which), void *data)
 	info->lbuflen = read(clp->gssd_fd, info->lbuf, sizeof(info->lbuf));
 	if (info->lbuflen <= 0 || info->lbuf[info->lbuflen-1] != '\n') {
 		printerr(0, "WARNING: %s: failed reading request\n", __func__);
-		free(info);
+		free_upcall_info(info);
 		return;
 	}
 	info->lbuf[info->lbuflen-1] = 0;
 
 	if (start_upcall_thread(handle_gssd_upcall, info))
-		free(info);
+		free_upcall_info(info);
 }
 
 static void
@@ -461,12 +495,12 @@ gssd_clnt_krb5_cb(int UNUSED(fd), short UNUSED(which), void *data)
 			sizeof(info->uid)) < (ssize_t)sizeof(info->uid)) {
 		printerr(0, "WARNING: %s: failed reading uid from krb5 "
 			 "upcall pipe: %s\n", __func__, strerror(errno));
-		free(info);
+		free_upcall_info(info);
 		return;
 	}
 
 	if (start_upcall_thread(handle_krb5_upcall, info))
-		free(info);
+		free_upcall_info(info);
 }
 
 static struct clnt_info *
@@ -501,6 +535,7 @@ gssd_get_clnt(struct topdir *tdi, const char *name)
 	clp->name = clp->relpath + strlen(tdi->name) + 1;
 	clp->krb5_fd = -1;
 	clp->gssd_fd = -1;
+	clp->refcount = 1;
 
 	TAILQ_INSERT_HEAD(&tdi->clnt_list, clp, list);
 	return clp;
@@ -651,7 +686,7 @@ gssd_scan_topdir(const char *name)
 		if (clp->scanned)
 			continue;
 
-		printerr(3, "destroying client %s\n", clp->relpath);
+		printerr(3, "orphaned client %s\n", clp->relpath);
 		saveprev = clp->list.tqe_prev;
 		TAILQ_REMOVE(&tdi->clnt_list, clp, list);
 		gssd_destroy_client(clp);
diff --git a/utils/gssd/gssd.h b/utils/gssd/gssd.h
index f4f59754..d33e4dff 100644
--- a/utils/gssd/gssd.h
+++ b/utils/gssd/gssd.h
@@ -63,12 +63,10 @@ extern unsigned int 		context_timeout;
 extern unsigned int rpc_timeout;
 extern char			*preferred_realm;
 extern pthread_mutex_t ple_lock;
-extern pthread_cond_t pcond;
-extern pthread_mutex_t pmutex;
-extern int thread_started;
 
 struct clnt_info {
 	TAILQ_ENTRY(clnt_info)	list;
+	int			refcount;
 	int			wd;
 	bool			scanned;
 	char			*name;
@@ -94,6 +92,7 @@ struct clnt_upcall_info {
 
 void handle_krb5_upcall(struct clnt_upcall_info *clp);
 void handle_gssd_upcall(struct clnt_upcall_info *clp);
+void free_upcall_info(struct clnt_upcall_info *info);
 
 
 #endif /* _RPC_GSSD_H_ */
diff --git a/utils/gssd/gssd_proc.c b/utils/gssd/gssd_proc.c
index 8fe6605b..05c1da64 100644
--- a/utils/gssd/gssd_proc.c
+++ b/utils/gssd/gssd_proc.c
@@ -730,7 +730,7 @@ handle_krb5_upcall(struct clnt_upcall_info *info)
 	printerr(2, "\n%s: uid %d (%s)\n", __func__, info->uid, clp->relpath);
 
 	process_krb5_upcall(clp, info->uid, clp->krb5_fd, NULL, NULL, NULL);
-	free(info);
+	free_upcall_info(info);
 }
 
 void
@@ -830,6 +830,6 @@ handle_gssd_upcall(struct clnt_upcall_info *info)
 out:
 	free(upcall_str);
 out_nomem:
-	free(info);
+	free_upcall_info(info);
 	return;
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-26 21:30                         ` [PATCH v2] " Doug Nazar
@ 2020-06-26 21:44                           ` J. Bruce Fields
  2020-06-29  5:39                           ` Kraus, Sebastian
  1 sibling, 0 replies; 21+ messages in thread
From: J. Bruce Fields @ 2020-06-26 21:44 UTC (permalink / raw)
  To: Doug Nazar; +Cc: Kraus, Sebastian, linux-nfs, Steve Dickson, Olga Kornievskaia

On Fri, Jun 26, 2020 at 05:30:15PM -0400, Doug Nazar wrote:
> On 2020-06-26 17:02, J. Bruce Fields wrote:
> >Unless I'm missing something--an upcall thread could still be using this
> >file descriptor.
> >
> >If we're particularly unlucky, we could do a new open in a moment and
> >reuse this file descriptor number, and then then writes in do_downcall()
> >could end up going to some other random file.
> >
> >I think we want these closes done by gssd_free_client() in the !refcnt
> >case?
> 
> Makes sense. I was thinking more that it was an abort situation and
> we shouldn't be sending any data to the kernel but re-use is
> definitely a concern.
> 
> I've split it so that we are removed from the event loop in
> destroy() but the close happens in free().

Looks good to me.  Steve?

--b.

> 
> Doug
> 

> From 8ef49081e8a42bfa05bb63265cd4f951e2b23413 Mon Sep 17 00:00:00 2001
> From: Doug Nazar <nazard@nazar.ca>
> Date: Fri, 26 Jun 2020 16:02:04 -0400
> Subject: [PATCH] gssd: Refcount struct clnt_info to protect multithread usage
> 
> Struct clnt_info is shared with the various upcall threads so
> we need to ensure that it stays around even if the client dir
> gets removed.
> 
> Reported-by: Sebastian Kraus <sebastian.kraus@tu-berlin.de>
> Signed-off-by: Doug Nazar <nazard@nazar.ca>
> ---
>  utils/gssd/gssd.c      | 67 ++++++++++++++++++++++++++++++++----------
>  utils/gssd/gssd.h      |  5 ++--
>  utils/gssd/gssd_proc.c |  4 +--
>  3 files changed, 55 insertions(+), 21 deletions(-)
> 
> diff --git a/utils/gssd/gssd.c b/utils/gssd/gssd.c
> index 588da0fb..b40c3220 100644
> --- a/utils/gssd/gssd.c
> +++ b/utils/gssd/gssd.c
> @@ -90,9 +90,7 @@ char *ccachedir = NULL;
>  /* Avoid DNS reverse lookups on server names */
>  static bool avoid_dns = true;
>  static bool use_gssproxy = false;
> -int thread_started = false;
> -pthread_mutex_t pmutex = PTHREAD_MUTEX_INITIALIZER;
> -pthread_cond_t pcond = PTHREAD_COND_INITIALIZER;
> +pthread_mutex_t clp_lock = PTHREAD_MUTEX_INITIALIZER;
>  
>  TAILQ_HEAD(topdir_list_head, topdir) topdir_list;
>  
> @@ -359,20 +357,28 @@ out:
>  	free(port);
>  }
>  
> +/* Actually frees clp and fields that might be used from other
> + * threads if was last reference.
> + */
>  static void
> -gssd_destroy_client(struct clnt_info *clp)
> +gssd_free_client(struct clnt_info *clp)
>  {
> -	if (clp->krb5_fd >= 0) {
> +	int refcnt;
> +
> +	pthread_mutex_lock(&clp_lock);
> +	refcnt = --clp->refcount;
> +	pthread_mutex_unlock(&clp_lock);
> +	if (refcnt > 0)
> +		return;
> +
> +	printerr(3, "freeing client %s\n", clp->relpath);
> +
> +	if (clp->krb5_fd >= 0)
>  		close(clp->krb5_fd);
> -		event_del(&clp->krb5_ev);
> -	}
>  
> -	if (clp->gssd_fd >= 0) {
> +	if (clp->gssd_fd >= 0)
>  		close(clp->gssd_fd);
> -		event_del(&clp->gssd_ev);
> -	}
>  
> -	inotify_rm_watch(inotify_fd, clp->wd);
>  	free(clp->relpath);
>  	free(clp->servicename);
>  	free(clp->servername);
> @@ -380,6 +386,24 @@ gssd_destroy_client(struct clnt_info *clp)
>  	free(clp);
>  }
>  
> +/* Called when removing from clnt_list to tear down event handling.
> + * Will then free clp if was last reference.
> + */
> +static void
> +gssd_destroy_client(struct clnt_info *clp)
> +{
> +	printerr(3, "destroying client %s\n", clp->relpath);
> +
> +	if (clp->krb5_fd >= 0)
> +		event_del(&clp->krb5_ev);
> +
> +	if (clp->gssd_fd >= 0)
> +		event_del(&clp->gssd_ev);
> +
> +	inotify_rm_watch(inotify_fd, clp->wd);
> +	gssd_free_client(clp);
> +}
> +
>  static void gssd_scan(void);
>  
>  static int
> @@ -416,11 +440,21 @@ static struct clnt_upcall_info *alloc_upcall_info(struct clnt_info *clp)
>  	info = malloc(sizeof(struct clnt_upcall_info));
>  	if (info == NULL)
>  		return NULL;
> +
> +	pthread_mutex_lock(&clp_lock);
> +	clp->refcount++;
> +	pthread_mutex_unlock(&clp_lock);
>  	info->clp = clp;
>  
>  	return info;
>  }
>  
> +void free_upcall_info(struct clnt_upcall_info *info)
> +{
> +	gssd_free_client(info->clp);
> +	free(info);
> +}
> +
>  /* For each upcall read the upcall info into the buffer, then create a
>   * thread in a detached state so that resources are released back into
>   * the system without the need for a join.
> @@ -438,13 +472,13 @@ gssd_clnt_gssd_cb(int UNUSED(fd), short UNUSED(which), void *data)
>  	info->lbuflen = read(clp->gssd_fd, info->lbuf, sizeof(info->lbuf));
>  	if (info->lbuflen <= 0 || info->lbuf[info->lbuflen-1] != '\n') {
>  		printerr(0, "WARNING: %s: failed reading request\n", __func__);
> -		free(info);
> +		free_upcall_info(info);
>  		return;
>  	}
>  	info->lbuf[info->lbuflen-1] = 0;
>  
>  	if (start_upcall_thread(handle_gssd_upcall, info))
> -		free(info);
> +		free_upcall_info(info);
>  }
>  
>  static void
> @@ -461,12 +495,12 @@ gssd_clnt_krb5_cb(int UNUSED(fd), short UNUSED(which), void *data)
>  			sizeof(info->uid)) < (ssize_t)sizeof(info->uid)) {
>  		printerr(0, "WARNING: %s: failed reading uid from krb5 "
>  			 "upcall pipe: %s\n", __func__, strerror(errno));
> -		free(info);
> +		free_upcall_info(info);
>  		return;
>  	}
>  
>  	if (start_upcall_thread(handle_krb5_upcall, info))
> -		free(info);
> +		free_upcall_info(info);
>  }
>  
>  static struct clnt_info *
> @@ -501,6 +535,7 @@ gssd_get_clnt(struct topdir *tdi, const char *name)
>  	clp->name = clp->relpath + strlen(tdi->name) + 1;
>  	clp->krb5_fd = -1;
>  	clp->gssd_fd = -1;
> +	clp->refcount = 1;
>  
>  	TAILQ_INSERT_HEAD(&tdi->clnt_list, clp, list);
>  	return clp;
> @@ -651,7 +686,7 @@ gssd_scan_topdir(const char *name)
>  		if (clp->scanned)
>  			continue;
>  
> -		printerr(3, "destroying client %s\n", clp->relpath);
> +		printerr(3, "orphaned client %s\n", clp->relpath);
>  		saveprev = clp->list.tqe_prev;
>  		TAILQ_REMOVE(&tdi->clnt_list, clp, list);
>  		gssd_destroy_client(clp);
> diff --git a/utils/gssd/gssd.h b/utils/gssd/gssd.h
> index f4f59754..d33e4dff 100644
> --- a/utils/gssd/gssd.h
> +++ b/utils/gssd/gssd.h
> @@ -63,12 +63,10 @@ extern unsigned int 		context_timeout;
>  extern unsigned int rpc_timeout;
>  extern char			*preferred_realm;
>  extern pthread_mutex_t ple_lock;
> -extern pthread_cond_t pcond;
> -extern pthread_mutex_t pmutex;
> -extern int thread_started;
>  
>  struct clnt_info {
>  	TAILQ_ENTRY(clnt_info)	list;
> +	int			refcount;
>  	int			wd;
>  	bool			scanned;
>  	char			*name;
> @@ -94,6 +92,7 @@ struct clnt_upcall_info {
>  
>  void handle_krb5_upcall(struct clnt_upcall_info *clp);
>  void handle_gssd_upcall(struct clnt_upcall_info *clp);
> +void free_upcall_info(struct clnt_upcall_info *info);
>  
>  
>  #endif /* _RPC_GSSD_H_ */
> diff --git a/utils/gssd/gssd_proc.c b/utils/gssd/gssd_proc.c
> index 8fe6605b..05c1da64 100644
> --- a/utils/gssd/gssd_proc.c
> +++ b/utils/gssd/gssd_proc.c
> @@ -730,7 +730,7 @@ handle_krb5_upcall(struct clnt_upcall_info *info)
>  	printerr(2, "\n%s: uid %d (%s)\n", __func__, info->uid, clp->relpath);
>  
>  	process_krb5_upcall(clp, info->uid, clp->krb5_fd, NULL, NULL, NULL);
> -	free(info);
> +	free_upcall_info(info);
>  }
>  
>  void
> @@ -830,6 +830,6 @@ handle_gssd_upcall(struct clnt_upcall_info *info)
>  out:
>  	free(upcall_str);
>  out_nomem:
> -	free(info);
> +	free_upcall_info(info);
>  	return;
>  }
> -- 
> 2.26.2
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-26 21:30                         ` [PATCH v2] " Doug Nazar
  2020-06-26 21:44                           ` J. Bruce Fields
@ 2020-06-29  5:39                           ` Kraus, Sebastian
  2020-06-29 14:09                             ` Doug Nazar
  1 sibling, 1 reply; 21+ messages in thread
From: Kraus, Sebastian @ 2020-06-29  5:39 UTC (permalink / raw)
  To: Doug Nazar; +Cc: linux-nfs

Hi Doug,
thanks very much for your patch and efforts.
I manually backported the patch to nfs-utils 1.3.4-2.5 source in Debian Buster.
I am now testing the modified build on one of my NFSv4 file servers. Looks promising.

One additional question: Which nfs-utils branch are your working on - steved/nfs-utils.git ?

Best Sebastian

__________________
Sebastian Kraus
Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin

Email: sebastian.kraus@tu-berlin.de

________________________________________
From: linux-nfs-owner@vger.kernel.org <linux-nfs-owner@vger.kernel.org> on behalf of Doug Nazar <nazard@nazar.ca>
Sent: Friday, June 26, 2020 23:30
To: J. Bruce Fields
Cc: Kraus, Sebastian; linux-nfs@vger.kernel.org; Steve Dickson; Olga Kornievskaia
Subject: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster

On 2020-06-26 17:02, J. Bruce Fields wrote:
> Unless I'm missing something--an upcall thread could still be using this
> file descriptor.
>
> If we're particularly unlucky, we could do a new open in a moment and
> reuse this file descriptor number, and then then writes in do_downcall()
> could end up going to some other random file.
>
> I think we want these closes done by gssd_free_client() in the !refcnt
> case?

Makes sense. I was thinking more that it was an abort situation and we
shouldn't be sending any data to the kernel but re-use is definitely a
concern.

I've split it so that we are removed from the event loop in destroy()
but the close happens in free().

Doug

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-29  5:39                           ` Kraus, Sebastian
@ 2020-06-29 14:09                             ` Doug Nazar
  2020-07-01  7:39                               ` Kraus, Sebastian
  0 siblings, 1 reply; 21+ messages in thread
From: Doug Nazar @ 2020-06-29 14:09 UTC (permalink / raw)
  To: Kraus, Sebastian; +Cc: linux-nfs

On 2020-06-29 01:39, Kraus, Sebastian wrote:
> Hi Doug,
> thanks very much for your patch and efforts.
> I manually backported the patch to nfs-utils 1.3.4-2.5 source in Debian Buster.
> I am now testing the modified build on one of my NFSv4 file servers. Looks promising.
>
> One additional question: Which nfs-utils branch are your working on - steved/nfs-utils.git ?

Yes, I'm working against upstream. I did check briefly that the code 
hadn't changed too much since 1.3.4 in that area.

I've found one other place that has insufficient locking but the race to 
hit it is fairly small. It's in the Kerberos machine principal cache 
when it refreshes the machine credentials. I have a patch for that, but 
it's pretty invasive due to some other changes I'm currently working on. 
Let me know if you hit it, and I can work on a simple version to backport.

Doug


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-06-29 14:09                             ` Doug Nazar
@ 2020-07-01  7:39                               ` Kraus, Sebastian
  2020-07-01  8:13                                 ` [PATCH v2] " Peter Eriksson
  2020-07-01 18:45                                 ` [PATCH v2] " Doug Nazar
  0 siblings, 2 replies; 21+ messages in thread
From: Kraus, Sebastian @ 2020-07-01  7:39 UTC (permalink / raw)
  To: Doug Nazar; +Cc: linux-nfs, J. Bruce Fields

Hi Doug,

>> Yes, I'm working against upstream. I did check briefly that the code hadn't changed too much since 1.3.4 in that area.
OK, thanks for the info. I wondered, because your patch did not show up as a commit within upstream. 
Your patch seems to do a good job - no more segfaults since a period of four days. :-)

>> I've found one other place that has insufficient locking but the race to hit it is fairly small. It's in the Kerberos machine principal cache when it refreshes the machine credentials. 
These type of patches are always welcome. :-)
In the recent past, some of our scientific staff exprienced strange problems with Kerberos authentication against our NFSv4 file servers. 
Maybe, the outages were in connection with this type of race condition. But, I do not know for sure as the authentication errors did happen on a rather sporadic basis.

>> I have a patch for that, but it's pretty invasive due to some other changes I'm currently working on. Let me know if you hit it, and I can work on a simple version to backport.
NFSv4+Kerberos is not for the faint-hearted. I do not fear of invasive patches - as long as they are not missing technical correctness. ;-)

A question far apart from this:
How is it about the spread of NFSv4+Kerberos setups within academic community and commerical environments? 
Are there, up to your knowledge, any bigger on-premise or cloud setups out there?
And are there any companies running dedicated NFSv4+Kerberos setups?


Best and keep well and fit
Sebastian

_________________
Sebastian Kraus
Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7

Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin


Tel.: +49 30 314 22263
Fax: +49 30 314 29309
Email: sebastian.kraus@tu-berlin.de

________________________________________
From: Doug Nazar <nazard@nazar.ca>
Sent: Monday, June 29, 2020 16:09
To: Kraus, Sebastian
Cc: linux-nfs@vger.kernel.org
Subject: Re: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster

On 2020-06-29 01:39, Kraus, Sebastian wrote:
> Hi Doug,
> thanks very much for your patch and efforts.
> I manually backported the patch to nfs-utils 1.3.4-2.5 source in Debian Buster.
> I am now testing the modified build on one of my NFSv4 file servers. Looks promising.
>
> One additional question: Which nfs-utils branch are your working on - steved/nfs-utils.git ?

Yes, I'm working against upstream. I did check briefly that the code
hadn't changed too much since 1.3.4 in that area.

I've found one other place that has insufficient locking but the race to
hit it is fairly small. It's in the Kerberos machine principal cache
when it refreshes the machine credentials. I have a patch for that, but
it's pretty invasive due to some other changes I'm currently working on.
Let me know if you hit it, and I can work on a simple version to backport.

Doug


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Strange segmentation violations of rpc.gssd in Debian Buster
  2020-07-01  7:39                               ` Kraus, Sebastian
@ 2020-07-01  8:13                                 ` Peter Eriksson
  2020-07-01 18:45                                 ` [PATCH v2] " Doug Nazar
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Eriksson @ 2020-07-01  8:13 UTC (permalink / raw)
  To: Kraus, Sebastian; +Cc: linux-nfs


>>> I've found one other place that has insufficient locking but the race to hit it is fairly small. It's in the Kerberos machine principal cache when it refreshes the machine credentials. 
> These type of patches are always welcome. :-)
> In the recent past, some of our scientific staff exprienced strange problems with Kerberos authentication against our NFSv4 file servers. 
> Maybe, the outages were in connection with this type of race condition. But, I do not know for sure as the authentication errors did happen on a rather sporadic basis.

We (Linköping University in Sweden) have seen these problems before too. I sent a patch for rpc.gssd this spring that “fixed” this problem too (well, fixed the symptom and not the root cause so it wasn’t the right fix). Without that patch we typically had rpc.gssd crash on our multiuser client servers every other day. It was partly masked by Puppet detecting it down and restarting it but the users had strange errors that they reported and then when the support folks checked everything was running :-). It also crashed very often on a set of test machines that every minute would connect to our NFS servers in order to verify that they were running and giving good response times. Multiple NFS connections being set up and teared with concurrently many times easily forced this problem to happen after a day or two.


> A question far apart from this:
> How is it about the spread of NFSv4+Kerberos setups within academic community and commerical environments? 

We are using NFSv4+Kerberos. Most of our users are SMBv3 clients (Windows & Mac, 10x the Linux users) though but we have some 600 NFS clients (99.9% Linux (CentOS & Ubuntu mostly) based, servers are FreeBSD with ZFS). We used to be a big Sun/Solaris NFS shop previously so NFS comes “naturally” for us :-)

(Would have loved to use NFSv4+Kerberos on the MacOS clients but unfortunately MacOS panics when the Kerberos ticket expires and you have an active NFS share mounted which is a bit of a bummer :-)

(Using NFS v3 or lower and without Kerberos isn’t really an option - real ACLs and some sort of security is really needed)


Anyway - it’s good to see that the root cause for this bug has been found and fixed the right way :-)

- Peter

> Are there, up to your knowledge, any bigger on-premise or cloud setups out there?
> And are there any companies running dedicated NFSv4+Kerberos setups?
> 
> 
> Best and keep well and fit
> Sebastian
> 
> _________________
> Sebastian Kraus
> Team IT am Institut für Chemie
> Gebäude C, Straße des 17. Juni 115, Raum C7
> 
> Technische Universität Berlin
> Fakultät II
> Institut für Chemie
> Sekretariat C3
> Straße des 17. Juni 135
> 10623 Berlin
> 
> 
> Tel.: +49 30 314 22263
> Fax: +49 30 314 29309
> Email: sebastian.kraus@tu-berlin.de
> 
> ________________________________________
> From: Doug Nazar <nazard@nazar.ca>
> Sent: Monday, June 29, 2020 16:09
> To: Kraus, Sebastian
> Cc: linux-nfs@vger.kernel.org
> Subject: Re: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster
> 
> On 2020-06-29 01:39, Kraus, Sebastian wrote:
>> Hi Doug,
>> thanks very much for your patch and efforts.
>> I manually backported the patch to nfs-utils 1.3.4-2.5 source in Debian Buster.
>> I am now testing the modified build on one of my NFSv4 file servers. Looks promising.
>> 
>> One additional question: Which nfs-utils branch are your working on - steved/nfs-utils.git ?
> 
> Yes, I'm working against upstream. I did check briefly that the code
> hadn't changed too much since 1.3.4 in that area.
> 
> I've found one other place that has insufficient locking but the race to
> hit it is fairly small. It's in the Kerberos machine principal cache
> when it refreshes the machine credentials. I have a patch for that, but
> it's pretty invasive due to some other changes I'm currently working on.
> Let me know if you hit it, and I can work on a simple version to backport.
> 
> Doug
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] Re: Strange segmentation violations of rpc.gssd in Debian Buster
  2020-07-01  7:39                               ` Kraus, Sebastian
  2020-07-01  8:13                                 ` [PATCH v2] " Peter Eriksson
@ 2020-07-01 18:45                                 ` Doug Nazar
  1 sibling, 0 replies; 21+ messages in thread
From: Doug Nazar @ 2020-07-01 18:45 UTC (permalink / raw)
  To: Kraus, Sebastian; +Cc: linux-nfs

On 2020-07-01 03:39, Kraus, Sebastian wrote:
> OK, thanks for the info. I wondered, because your patch did not show 
> up as a commit within upstream.
> Your patch seems to do a good job - no more segfaults since a period of four days. :-)

I'm not a maintainer, just an enthusiastic user with a compiler... ;-)
I'm sure it'll get applied in the near future, as time permits.

>> I've found one other place that has insufficient locking but the race to hit it is fairly small. It's in the Kerberos machine principal cache when it refreshes the machine credentials.
> These type of patches are always welcome. :-)
> In the recent past, some of our scientific staff exprienced strange problems with Kerberos authentication against our NFSv4 file servers.
> Maybe, the outages were in connection with this type of race condition. But, I do not know for sure as the authentication errors did happen on a rather sporadic basis.

The previous bug could also cause authentication issues without crashing 
depending on load, timing, memory usage, malloc library, etc. This one 
would only crop up during machine credentials refresh, which by default 
is once every 24 hours. I've just posted a patch 'gssd: Fix locking for 
machine principal list', the interesting part for backporting is around 
line 447. It used to always strdup() even if cache name was the same.

>> I have a patch for that, but it's pretty invasive due to some other changes I'm currently working on. Let me know if you hit it, and I can work on a simple version to backport.
> NFSv4+Kerberos is not for the faint-hearted. I do not fear of invasive patches - as long as they are not missing technical correctness. ;-)

No guarantees... but I do try.  ;-)

> A question far apart from this:
> How is it about the spread of NFSv4+Kerberos setups within academic community and commerical environments?
> Are there, up to your knowledge, any bigger on-premise or cloud setups out there?
> And are there any companies running dedicated NFSv4+Kerberos setups?

I really have no idea. I only run it on my home network of a few dozen 
(old) machines. From what I've seen while googling
trying to figure out how the code base works, there are fair number of 
users. There's also been a large amount of work in
recent years, which would point to something driving that.

Doug


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, back to index

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19 21:24 RPC Pipefs: Frequent parsing errors in client database Kraus, Sebastian
2020-06-19 22:04 ` J. Bruce Fields
2020-06-20 11:35   ` Kraus, Sebastian
2020-06-20 17:03     ` J. Bruce Fields
2020-06-20 21:08       ` Kraus, Sebastian
2020-06-22 22:36         ` J. Bruce Fields
2020-06-25 17:43           ` Strange segmentation violations of rpc.gssd in Debian Buster Kraus, Sebastian
2020-06-25 20:14             ` J. Bruce Fields
2020-06-25 21:44             ` Doug Nazar
2020-06-26 12:31               ` Kraus, Sebastian
2020-06-26 17:23                 ` Doug Nazar
2020-06-26 19:46                   ` J. Bruce Fields
2020-06-26 20:15                     ` Doug Nazar
2020-06-26 21:02                       ` J. Bruce Fields
2020-06-26 21:30                         ` [PATCH v2] " Doug Nazar
2020-06-26 21:44                           ` J. Bruce Fields
2020-06-29  5:39                           ` Kraus, Sebastian
2020-06-29 14:09                             ` Doug Nazar
2020-07-01  7:39                               ` Kraus, Sebastian
2020-07-01  8:13                                 ` [PATCH v2] " Peter Eriksson
2020-07-01 18:45                                 ` [PATCH v2] " Doug Nazar

Linux-NFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-nfs/0 linux-nfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-nfs linux-nfs/ https://lore.kernel.org/linux-nfs \
		linux-nfs@vger.kernel.org
	public-inbox-index linux-nfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-nfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git