linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NULL dereference in rpcauth_lookup_credcache
@ 2018-11-08 21:44 J. Bruce Fields
  2018-11-09 18:01 ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: J. Bruce Fields @ 2018-11-08 21:44 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs

Since -rc1 my regression tests crash my client.  Is this a known
problem?  I'll investigate some more, I haven't even looked at the code
yet or checked which test exactly is hitting this.

--b.

[  164.109570] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  164.111207] PGD 0 P4D 0 
[  164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
[  164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted 4.20.0-rc1-13223-gafb6d1c474ef #1898
[  164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
[  164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
[  164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450 [sunrpc]
[  164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45 c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48 8b 40 28 e8 cb d2
[  164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
[  164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX: 0000000000600040
[  164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI: ffff88007cafb198
[  164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09: 0000000000000000
[  164.122464] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[  164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15: 0000000000000000
[  164.124296] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[  164.125322] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4: 00000000001606e0
[  164.126860] Call Trace:
[  164.127045]  ? call_retry_reserve+0x30/0x30 [sunrpc]
[  164.127622]  rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
[  164.128200]  rpcauth_refreshcred+0x15f/0x170 [sunrpc]
[  164.128807]  __rpc_execute+0xa9/0x460 [sunrpc]
[  164.129281]  process_one_work+0x227/0x630
[  164.129684]  worker_thread+0x3c/0x390
[  164.130062]  ? process_one_work+0x630/0x630
[  164.130609]  kthread+0x11d/0x140
[  164.130936]  ? kthread_park+0x80/0x80
[  164.131339]  ret_from_fork+0x3a/0x50
[  164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd grace auth_rpcgss sunrpc
[  164.132719] CR2: 0000000000000008
[  164.133050] ---[ end trace b4028a6781a696ad ]---


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-08 21:44 NULL dereference in rpcauth_lookup_credcache J. Bruce Fields
@ 2018-11-09 18:01 ` Chuck Lever
  2018-11-10 21:49   ` Bruce Fields
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2018-11-09 18:01 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Trond Myklebust, Anna Schumaker, Linux NFS Mailing List



> On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> Since -rc1 my regression tests crash my client.  Is this a known
> problem?  I'll investigate some more, I haven't even looked at the code
> yet or checked which test exactly is hitting this.
> 
> --b.
> 
> [  164.109570] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> [  164.111207] PGD 0 P4D 0 
> [  164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
> [  164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted 4.20.0-rc1-13223-gafb6d1c474ef #1898
> [  164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
> [  164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
> [  164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450 [sunrpc]
> [  164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45 c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48 8b 40 28 e8 cb d2
> [  164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
> [  164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX: 0000000000600040
> [  164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI: ffff88007cafb198
> [  164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09: 0000000000000000
> [  164.122464] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
> [  164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15: 0000000000000000
> [  164.124296] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
> [  164.125322] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4: 00000000001606e0
> [  164.126860] Call Trace:
> [  164.127045]  ? call_retry_reserve+0x30/0x30 [sunrpc]
> [  164.127622]  rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
> [  164.128200]  rpcauth_refreshcred+0x15f/0x170 [sunrpc]
> [  164.128807]  __rpc_execute+0xa9/0x460 [sunrpc]
> [  164.129281]  process_one_work+0x227/0x630
> [  164.129684]  worker_thread+0x3c/0x390
> [  164.130062]  ? process_one_work+0x630/0x630
> [  164.130609]  kthread+0x11d/0x140
> [  164.130936]  ? kthread_park+0x80/0x80
> [  164.131339]  ret_from_fork+0x3a/0x50
> [  164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd grace auth_rpcgss sunrpc
> [  164.132719] CR2: 0000000000000008
> [  164.133050] ---[ end trace b4028a6781a696ad ]---
> 

I just encountered this repeatedly with cthon04 general tests.

MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"


--
Chuck Lever
chucklever@gmail.com




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-09 18:01 ` Chuck Lever
@ 2018-11-10 21:49   ` Bruce Fields
  2018-11-12 17:59     ` Trond Myklebust
  0 siblings, 1 reply; 13+ messages in thread
From: Bruce Fields @ 2018-11-10 21:49 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond Myklebust, Anna Schumaker, Linux NFS Mailing List

Looks like it's the fault of

07d02a67b7faae "SUNRPC: Simplify lookup code"

--b.

On Fri, Nov 09, 2018 at 01:01:30PM -0500, Chuck Lever wrote:
> 
> 
> > On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > Since -rc1 my regression tests crash my client.  Is this a known
> > problem?  I'll investigate some more, I haven't even looked at the code
> > yet or checked which test exactly is hitting this.
> > 
> > --b.
> > 
> > [  164.109570] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> > [  164.111207] PGD 0 P4D 0 
> > [  164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
> > [  164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted 4.20.0-rc1-13223-gafb6d1c474ef #1898
> > [  164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
> > [  164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > [  164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450 [sunrpc]
> > [  164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45 c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48 8b 40 28 e8 cb d2
> > [  164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
> > [  164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX: 0000000000600040
> > [  164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI: ffff88007cafb198
> > [  164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09: 0000000000000000
> > [  164.122464] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
> > [  164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15: 0000000000000000
> > [  164.124296] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
> > [  164.125322] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4: 00000000001606e0
> > [  164.126860] Call Trace:
> > [  164.127045]  ? call_retry_reserve+0x30/0x30 [sunrpc]
> > [  164.127622]  rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
> > [  164.128200]  rpcauth_refreshcred+0x15f/0x170 [sunrpc]
> > [  164.128807]  __rpc_execute+0xa9/0x460 [sunrpc]
> > [  164.129281]  process_one_work+0x227/0x630
> > [  164.129684]  worker_thread+0x3c/0x390
> > [  164.130062]  ? process_one_work+0x630/0x630
> > [  164.130609]  kthread+0x11d/0x140
> > [  164.130936]  ? kthread_park+0x80/0x80
> > [  164.131339]  ret_from_fork+0x3a/0x50
> > [  164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd grace auth_rpcgss sunrpc
> > [  164.132719] CR2: 0000000000000008
> > [  164.133050] ---[ end trace b4028a6781a696ad ]---
> > 
> 
> I just encountered this repeatedly with cthon04 general tests.
> 
> MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"
> 
> 
> --
> Chuck Lever
> chucklever@gmail.com
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-10 21:49   ` Bruce Fields
@ 2018-11-12 17:59     ` Trond Myklebust
  2018-11-12 18:16       ` Chuck Lever
  2018-11-12 18:24       ` bfields
  0 siblings, 2 replies; 13+ messages in thread
From: Trond Myklebust @ 2018-11-12 17:59 UTC (permalink / raw)
  To: bfields, chucklever; +Cc: schumakeranna, linux-nfs

On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> Looks like it's the fault of
> 
> 07d02a67b7faae "SUNRPC: Simplify lookup code"

I'm having trouble reproducing this bug. I've tried both cthon and
xfstests in a loop, so far without success (both NFSv3 and v4.1, but
only sec=sys). Is there anything else you're doing that I might try?

e.g. Are you running multiple workloads in parallel? Different users?..

> 
> --b.
> 
> On Fri, Nov 09, 2018 at 01:01:30PM -0500, Chuck Lever wrote:
> > 
> > > On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <bfields@fieldses.org
> > > > wrote:
> > > 
> > > Since -rc1 my regression tests crash my client.  Is this a known
> > > problem?  I'll investigate some more, I haven't even looked at
> > > the code
> > > yet or checked which test exactly is hitting this.
> > > 
> > > --b.
> > > 
> > > [  164.109570] BUG: unable to handle kernel NULL pointer
> > > dereference at 0000000000000008
> > > [  164.111207] PGD 0 P4D 0 
> > > [  164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
> > > [  164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted
> > > 4.20.0-rc1-13223-gafb6d1c474ef #1898
> > > [  164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > > 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-
> > > 1.fc28 04/01/2014
> > > [  164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > > [  164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450
> > > [sunrpc]
> > > [  164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0
> > > 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45
> > > c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48
> > > 8b 40 28 e8 cb d2
> > > [  164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
> > > [  164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX:
> > > 0000000000600040
> > > [  164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI:
> > > ffff88007cafb198
> > > [  164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09:
> > > 0000000000000000
> > > [  164.122464] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > 0000000000000001
> > > [  164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15:
> > > 0000000000000000
> > > [  164.124296] FS:  0000000000000000(0000)
> > > GS:ffff88007fd00000(0000) knlGS:0000000000000000
> > > [  164.125322] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4:
> > > 00000000001606e0
> > > [  164.126860] Call Trace:
> > > [  164.127045]  ? call_retry_reserve+0x30/0x30 [sunrpc]
> > > [  164.127622]  rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
> > > [  164.128200]  rpcauth_refreshcred+0x15f/0x170 [sunrpc]
> > > [  164.128807]  __rpc_execute+0xa9/0x460 [sunrpc]
> > > [  164.129281]  process_one_work+0x227/0x630
> > > [  164.129684]  worker_thread+0x3c/0x390
> > > [  164.130062]  ? process_one_work+0x630/0x630
> > > [  164.130609]  kthread+0x11d/0x140
> > > [  164.130936]  ? kthread_park+0x80/0x80
> > > [  164.131339]  ret_from_fork+0x3a/0x50
> > > [  164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd
> > > grace auth_rpcgss sunrpc
> > > [  164.132719] CR2: 0000000000000008
> > > [  164.133050] ---[ end trace b4028a6781a696ad ]---
> > > 
> > 
> > I just encountered this repeatedly with cthon04 general tests.
> > 
> > MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"
> > 
> > 
> > --
> > Chuck Lever
> > chucklever@gmail.com
> > 
> > 
-- 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-12 17:59     ` Trond Myklebust
@ 2018-11-12 18:16       ` Chuck Lever
  2018-11-12 18:18         ` Trond Myklebust
  2018-11-12 18:24       ` bfields
  1 sibling, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2018-11-12 18:16 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Bruce Fields, schumakeranna, Linux NFS Mailing List



> On Nov 12, 2018, at 9:59 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
>> Looks like it's the fault of
>> 
>> 07d02a67b7faae "SUNRPC: Simplify lookup code"
> 
> I'm having trouble reproducing this bug. I've tried both cthon and
> xfstests in a loop, so far without success (both NFSv3 and v4.1, but
> only sec=sys). Is there anything else you're doing that I might try?
> 
> e.g. Are you running multiple workloads in parallel? Different users?..

Some observations, for what they are worth:

Single user test running with no other NFS workload.

I see the BUG fire at umount time, not during the test.

My client is a two-node NUMA system with 12 cores, which
could be more likely to trigger races.

Export is tmpfs.


>> --b.
>> 
>> On Fri, Nov 09, 2018 at 01:01:30PM -0500, Chuck Lever wrote:
>>> 
>>>> On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <bfields@fieldses.org
>>>>> wrote:
>>>> 
>>>> Since -rc1 my regression tests crash my client.  Is this a known
>>>> problem?  I'll investigate some more, I haven't even looked at
>>>> the code
>>>> yet or checked which test exactly is hitting this.
>>>> 
>>>> --b.
>>>> 
>>>> [  164.109570] BUG: unable to handle kernel NULL pointer
>>>> dereference at 0000000000000008
>>>> [  164.111207] PGD 0 P4D 0 
>>>> [  164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
>>>> [  164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted
>>>> 4.20.0-rc1-13223-gafb6d1c474ef #1898
>>>> [  164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX,
>>>> 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-
>>>> 1.fc28 04/01/2014
>>>> [  164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
>>>> [  164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450
>>>> [sunrpc]
>>>> [  164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0
>>>> 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45
>>>> c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48
>>>> 8b 40 28 e8 cb d2
>>>> [  164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
>>>> [  164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX:
>>>> 0000000000600040
>>>> [  164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI:
>>>> ffff88007cafb198
>>>> [  164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09:
>>>> 0000000000000000
>>>> [  164.122464] R10: 0000000000000000 R11: 0000000000000000 R12:
>>>> 0000000000000001
>>>> [  164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15:
>>>> 0000000000000000
>>>> [  164.124296] FS:  0000000000000000(0000)
>>>> GS:ffff88007fd00000(0000) knlGS:0000000000000000
>>>> [  164.125322] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4:
>>>> 00000000001606e0
>>>> [  164.126860] Call Trace:
>>>> [  164.127045]  ? call_retry_reserve+0x30/0x30 [sunrpc]
>>>> [  164.127622]  rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
>>>> [  164.128200]  rpcauth_refreshcred+0x15f/0x170 [sunrpc]
>>>> [  164.128807]  __rpc_execute+0xa9/0x460 [sunrpc]
>>>> [  164.129281]  process_one_work+0x227/0x630
>>>> [  164.129684]  worker_thread+0x3c/0x390
>>>> [  164.130062]  ? process_one_work+0x630/0x630
>>>> [  164.130609]  kthread+0x11d/0x140
>>>> [  164.130936]  ? kthread_park+0x80/0x80
>>>> [  164.131339]  ret_from_fork+0x3a/0x50
>>>> [  164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd
>>>> grace auth_rpcgss sunrpc
>>>> [  164.132719] CR2: 0000000000000008
>>>> [  164.133050] ---[ end trace b4028a6781a696ad ]---
>>>> 
>>> 
>>> I just encountered this repeatedly with cthon04 general tests.
>>> 
>>> MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"
>>> 
>>> 
>>> --
>>> Chuck Lever
>>> chucklever@gmail.com
>>> 
>>> 
> -- 
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> www.hammer.space

--
Chuck Lever
chucklever@gmail.com




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-12 18:16       ` Chuck Lever
@ 2018-11-12 18:18         ` Trond Myklebust
  0 siblings, 0 replies; 13+ messages in thread
From: Trond Myklebust @ 2018-11-12 18:18 UTC (permalink / raw)
  To: chucklever; +Cc: bfields, schumakeranna, linux-nfs

On Mon, 2018-11-12 at 10:16 -0800, Chuck Lever wrote:
> > On Nov 12, 2018, at 9:59 AM, Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > 
> > On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > > Looks like it's the fault of
> > > 
> > > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> > 
> > I'm having trouble reproducing this bug. I've tried both cthon and
> > xfstests in a loop, so far without success (both NFSv3 and v4.1,
> > but
> > only sec=sys). Is there anything else you're doing that I might
> > try?
> > 
> > e.g. Are you running multiple workloads in parallel? Different
> > users?..
> 
> Some observations, for what they are worth:
> 
> Single user test running with no other NFS workload.
> 
> I see the BUG fire at umount time, not during the test.
> 
> My client is a two-node NUMA system with 12 cores, which
> could be more likely to trigger races.
> 
> Export is tmpfs.
> 

Thanks! That's useful info. Particularly the observation that you're
seeing it at umount time...

> 
> > > --b.
> > > 
> > > On Fri, Nov 09, 2018 at 01:01:30PM -0500, Chuck Lever wrote:
> > > > > On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <
> > > > > bfields@fieldses.org
> > > > > > wrote:
> > > > > 
> > > > > Since -rc1 my regression tests crash my client.  Is this a
> > > > > known
> > > > > problem?  I'll investigate some more, I haven't even looked
> > > > > at
> > > > > the code
> > > > > yet or checked which test exactly is hitting this.
> > > > > 
> > > > > --b.
> > > > > 
> > > > > [  164.109570] BUG: unable to handle kernel NULL pointer
> > > > > dereference at 0000000000000008
> > > > > [  164.111207] PGD 0 P4D 0 
> > > > > [  164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
> > > > > [  164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not
> > > > > tainted
> > > > > 4.20.0-rc1-13223-gafb6d1c474ef #1898
> > > > > [  164.113487] Hardware name: QEMU Standard PC (i440FX +
> > > > > PIIX,
> > > > > 1996), BIOS ?-20180531_142017-buildhw-
> > > > > 08.phx2.fedoraproject.org-
> > > > > 1.fc28 04/01/2014
> > > > > [  164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > > > > [  164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450
> > > > > [sunrpc]
> > > > > [  164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89
> > > > > 4d b0
> > > > > 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48
> > > > > 8d 45
> > > > > c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef
> > > > > 48
> > > > > 8b 40 28 e8 cb d2
> > > > > [  164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
> > > > > [  164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180
> > > > > RCX:
> > > > > 0000000000600040
> > > > > [  164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60
> > > > > RDI:
> > > > > ffff88007cafb198
> > > > > [  164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000
> > > > > R09:
> > > > > 0000000000000000
> > > > > [  164.122464] R10: 0000000000000000 R11: 0000000000000000
> > > > > R12:
> > > > > 0000000000000001
> > > > > [  164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198
> > > > > R15:
> > > > > 0000000000000000
> > > > > [  164.124296] FS:  0000000000000000(0000)
> > > > > GS:ffff88007fd00000(0000) knlGS:0000000000000000
> > > > > [  164.125322] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > 0000000080050033
> > > > > [  164.126006] CR2: 0000000000000008 CR3: 000000007829c003
> > > > > CR4:
> > > > > 00000000001606e0
> > > > > [  164.126860] Call Trace:
> > > > > [  164.127045]  ? call_retry_reserve+0x30/0x30 [sunrpc]
> > > > > [  164.127622]  rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
> > > > > [  164.128200]  rpcauth_refreshcred+0x15f/0x170 [sunrpc]
> > > > > [  164.128807]  __rpc_execute+0xa9/0x460 [sunrpc]
> > > > > [  164.129281]  process_one_work+0x227/0x630
> > > > > [  164.129684]  worker_thread+0x3c/0x390
> > > > > [  164.130062]  ? process_one_work+0x630/0x630
> > > > > [  164.130609]  kthread+0x11d/0x140
> > > > > [  164.130936]  ? kthread_park+0x80/0x80
> > > > > [  164.131339]  ret_from_fork+0x3a/0x50
> > > > > [  164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs
> > > > > lockd
> > > > > grace auth_rpcgss sunrpc
> > > > > [  164.132719] CR2: 0000000000000008
> > > > > [  164.133050] ---[ end trace b4028a6781a696ad ]---
> > > > > 
> > > > 
> > > > I just encountered this repeatedly with cthon04 general tests.
> > > > 
> > > > MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"
> > > > 
> > > > 
> > > > --
> > > > Chuck Lever
> > > > chucklever@gmail.com
> > > > 
> > > > 
> > -- 
> > Trond Myklebust
> > CTO, Hammerspace Inc
> > 4300 El Camino Real, Suite 105
> > Los Altos, CA 94022
> > www.hammer.space
> 
> --
> Chuck Lever
> chucklever@gmail.com
> 
> 
> 
-- 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-12 17:59     ` Trond Myklebust
  2018-11-12 18:16       ` Chuck Lever
@ 2018-11-12 18:24       ` bfields
  2018-11-12 21:17         ` Trond Myklebust
  1 sibling, 1 reply; 13+ messages in thread
From: bfields @ 2018-11-12 18:24 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: chucklever, schumakeranna, linux-nfs

On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust wrote:
> On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > Looks like it's the fault of
> > 
> > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> 
> I'm having trouble reproducing this bug. I've tried both cthon and
> xfstests in a loop, so far without success (both NFSv3 and v4.1, but
> only sec=sys). Is there anything else you're doing that I might try?
> 
> e.g. Are you running multiple workloads in parallel? Different users?..

Nothing that interesting.  Currently it's connectathon over v4, v3,
v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but just serially
one after the other.  Then some pynfs tests (which bypass the client),
then xfstests over v4.2/sys.  And also a few one-off locking tests of my
own that probably aren't a factor here.

(Hah, I just realized I was mounting with vers=4 and assuming that meant
4.0, but actually it's changed over time depending on the defaults, so
currently those "v4" runs are actually all 4.2.  Gah.)

--b.

> 
> > 
> > --b.
> > 
> > On Fri, Nov 09, 2018 at 01:01:30PM -0500, Chuck Lever wrote:
> > > 
> > > > On Nov 8, 2018, at 4:44 PM, J. Bruce Fields <bfields@fieldses.org
> > > > > wrote:
> > > > 
> > > > Since -rc1 my regression tests crash my client.  Is this a known
> > > > problem?  I'll investigate some more, I haven't even looked at
> > > > the code
> > > > yet or checked which test exactly is hitting this.
> > > > 
> > > > --b.
> > > > 
> > > > [  164.109570] BUG: unable to handle kernel NULL pointer
> > > > dereference at 0000000000000008
> > > > [  164.111207] PGD 0 P4D 0 
> > > > [  164.111528] Oops: 0000 [#1] PREEMPT SMP PTI
> > > > [  164.112303] CPU: 2 PID: 2947 Comm: kworker/u8:5 Not tainted
> > > > 4.20.0-rc1-13223-gafb6d1c474ef #1898
> > > > [  164.113487] Hardware name: QEMU Standard PC (i440FX + PIIX,
> > > > 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-
> > > > 1.fc28 04/01/2014
> > > > [  164.115301] Workqueue: rpciod rpc_async_schedule [sunrpc]
> > > > [  164.115920] RIP: 0010:rpcauth_lookup_credcache+0x3d/0x450
> > > > [sunrpc]
> > > > [  164.116700] Code: 89 f5 41 54 41 89 d4 53 48 83 ec 38 89 4d b0
> > > > 4c 8b 7f 20 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8d 45
> > > > c0 48 89 45 c8 <41> 8b 77 08 48 89 45 c0 48 8b 47 10 4c 89 ef 48
> > > > 8b 40 28 e8 cb d2
> > > > [  164.119299] RSP: 0018:ffffc90001ee3cf0 EFLAGS: 00010246
> > > > [  164.119872] RAX: ffffc90001ee3d10 RBX: ffff88007cc18180 RCX:
> > > > 0000000000600040
> > > > [  164.120800] RDX: 0000000000000001 RSI: ffffc90001ee3d60 RDI:
> > > > ffff88007cafb198
> > > > [  164.121643] RBP: ffffc90001ee3d50 R08: 0000000000000000 R09:
> > > > 0000000000000000
> > > > [  164.122464] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > > 0000000000000001
> > > > [  164.123373] R13: ffffc90001ee3d60 R14: ffff88007cafb198 R15:
> > > > 0000000000000000
> > > > [  164.124296] FS:  0000000000000000(0000)
> > > > GS:ffff88007fd00000(0000) knlGS:0000000000000000
> > > > [  164.125322] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  164.126006] CR2: 0000000000000008 CR3: 000000007829c003 CR4:
> > > > 00000000001606e0
> > > > [  164.126860] Call Trace:
> > > > [  164.127045]  ? call_retry_reserve+0x30/0x30 [sunrpc]
> > > > [  164.127622]  rpcauth_lookupcred+0xa0/0xc0 [sunrpc]
> > > > [  164.128200]  rpcauth_refreshcred+0x15f/0x170 [sunrpc]
> > > > [  164.128807]  __rpc_execute+0xa9/0x460 [sunrpc]
> > > > [  164.129281]  process_one_work+0x227/0x630
> > > > [  164.129684]  worker_thread+0x3c/0x390
> > > > [  164.130062]  ? process_one_work+0x630/0x630
> > > > [  164.130609]  kthread+0x11d/0x140
> > > > [  164.130936]  ? kthread_park+0x80/0x80
> > > > [  164.131339]  ret_from_fork+0x3a/0x50
> > > > [  164.131676] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs lockd
> > > > grace auth_rpcgss sunrpc
> > > > [  164.132719] CR2: 0000000000000008
> > > > [  164.133050] ---[ end trace b4028a6781a696ad ]---
> > > > 
> > > 
> > > I just encountered this repeatedly with cthon04 general tests.
> > > 
> > > MNTOPTIONS="rw,proto=tcp,vers=4.1,sec=sys"
> > > 
> > > 
> > > --
> > > Chuck Lever
> > > chucklever@gmail.com
> > > 
> > > 
> -- 
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> www.hammer.space
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-12 18:24       ` bfields
@ 2018-11-12 21:17         ` Trond Myklebust
  2018-11-12 23:01           ` bfields
  0 siblings, 1 reply; 13+ messages in thread
From: Trond Myklebust @ 2018-11-12 21:17 UTC (permalink / raw)
  To: bfields; +Cc: schumakeranna, chucklever, linux-nfs

On Mon, 2018-11-12 at 13:24 -0500, bfields@fieldses.org wrote:
> On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust wrote:
> > On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > > Looks like it's the fault of
> > > 
> > > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> > 
> > I'm having trouble reproducing this bug. I've tried both cthon and
> > xfstests in a loop, so far without success (both NFSv3 and v4.1,
> > but
> > only sec=sys). Is there anything else you're doing that I might
> > try?
> > 
> > e.g. Are you running multiple workloads in parallel? Different
> > users?..
> 
> Nothing that interesting.  Currently it's connectathon over v4, v3,
> v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but just
> serially
> one after the other.  Then some pynfs tests (which bypass the
> client),
> then xfstests over v4.2/sys.  And also a few one-off locking tests of
> my
> own that probably aren't a factor here.
> 
> (Hah, I just realized I was mounting with vers=4 and assuming that
> meant
> 4.0, but actually it's changed over time depending on the defaults,
> so
> currently those "v4" runs are actually all 4.2.  Gah.)

Are you perhaps both using RPCSEC_GSS w/ integrity checking for your
EXCHANGE_ID authentication? The client will attempt to use that by
default if rpc.gssd is running.

I ask because I think the issue might be with RPCSEC_GSS, specifically
with the RPCSEC_GSS context destroy code, hence the 2 patches that I
just sent out.

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-12 21:17         ` Trond Myklebust
@ 2018-11-12 23:01           ` bfields
  2018-11-12 23:57             ` Trond Myklebust
  0 siblings, 1 reply; 13+ messages in thread
From: bfields @ 2018-11-12 23:01 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: schumakeranna, chucklever, linux-nfs

On Mon, Nov 12, 2018 at 09:17:16PM +0000, Trond Myklebust wrote:
> On Mon, 2018-11-12 at 13:24 -0500, bfields@fieldses.org wrote:
> > On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust wrote:
> > > On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > > > Looks like it's the fault of
> > > > 
> > > > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> > > 
> > > I'm having trouble reproducing this bug. I've tried both cthon and
> > > xfstests in a loop, so far without success (both NFSv3 and v4.1,
> > > but
> > > only sec=sys). Is there anything else you're doing that I might
> > > try?
> > > 
> > > e.g. Are you running multiple workloads in parallel? Different
> > > users?..
> > 
> > Nothing that interesting.  Currently it's connectathon over v4, v3,
> > v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but just
> > serially
> > one after the other.  Then some pynfs tests (which bypass the
> > client),
> > then xfstests over v4.2/sys.  And also a few one-off locking tests of
> > my
> > own that probably aren't a factor here.
> > 
> > (Hah, I just realized I was mounting with vers=4 and assuming that
> > meant
> > 4.0, but actually it's changed over time depending on the defaults,
> > so
> > currently those "v4" runs are actually all 4.2.  Gah.)
> 
> Are you perhaps both using RPCSEC_GSS w/ integrity checking for your
> EXCHANGE_ID authentication? The client will attempt to use that by
> default if rpc.gssd is running.

Yes, in addition to the krb5i mount I'd expect the sys/krb5/krb5p mounts
are using krb5i for EXCHANGE_ID.

> I ask because I think the issue might be with RPCSEC_GSS, specifically
> with the RPCSEC_GSS context destroy code, hence the 2 patches that I
> just sent out.

Looks like my tests pass after applying those two patches.

--b.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-12 23:01           ` bfields
@ 2018-11-12 23:57             ` Trond Myklebust
  2018-11-13  0:00               ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: Trond Myklebust @ 2018-11-12 23:57 UTC (permalink / raw)
  To: bfields; +Cc: schumakeranna, chucklever, linux-nfs

On Mon, 2018-11-12 at 18:01 -0500, bfields@fieldses.org wrote:
> On Mon, Nov 12, 2018 at 09:17:16PM +0000, Trond Myklebust wrote:
> > On Mon, 2018-11-12 at 13:24 -0500, bfields@fieldses.org wrote:
> > > On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust wrote:
> > > > On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > > > > Looks like it's the fault of
> > > > > 
> > > > > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> > > > 
> > > > I'm having trouble reproducing this bug. I've tried both cthon
> > > > and
> > > > xfstests in a loop, so far without success (both NFSv3 and
> > > > v4.1,
> > > > but
> > > > only sec=sys). Is there anything else you're doing that I might
> > > > try?
> > > > 
> > > > e.g. Are you running multiple workloads in parallel? Different
> > > > users?..
> > > 
> > > Nothing that interesting.  Currently it's connectathon over v4,
> > > v3,
> > > v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but just
> > > serially
> > > one after the other.  Then some pynfs tests (which bypass the
> > > client),
> > > then xfstests over v4.2/sys.  And also a few one-off locking
> > > tests of
> > > my
> > > own that probably aren't a factor here.
> > > 
> > > (Hah, I just realized I was mounting with vers=4 and assuming
> > > that
> > > meant
> > > 4.0, but actually it's changed over time depending on the
> > > defaults,
> > > so
> > > currently those "v4" runs are actually all 4.2.  Gah.)
> > 
> > Are you perhaps both using RPCSEC_GSS w/ integrity checking for
> > your
> > EXCHANGE_ID authentication? The client will attempt to use that by
> > default if rpc.gssd is running.
> 
> Yes, in addition to the krb5i mount I'd expect the sys/krb5/krb5p
> mounts
> are using krb5i for EXCHANGE_ID.
> 
> > I ask because I think the issue might be with RPCSEC_GSS,
> > specifically
> > with the RPCSEC_GSS context destroy code, hence the 2 patches that
> > I
> > just sent out.
> 
> Looks like my tests pass after applying those two patches.
> 

Cool! Thanks for testing.

Chuck, do you think the above might also explain your sighting of the
same Oops?

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-12 23:57             ` Trond Myklebust
@ 2018-11-13  0:00               ` Chuck Lever
  2018-11-13  0:08                 ` Trond Myklebust
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2018-11-13  0:00 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bfields, schumakeranna, chucklever, linux-nfs


> On Nov 12, 2018, at 3:57 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
>> On Mon, 2018-11-12 at 18:01 -0500, bfields@fieldses.org wrote:
>>> On Mon, Nov 12, 2018 at 09:17:16PM +0000, Trond Myklebust wrote:
>>>> On Mon, 2018-11-12 at 13:24 -0500, bfields@fieldses.org wrote:
>>>>> On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust wrote:
>>>>>> On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
>>>>>> Looks like it's the fault of
>>>>>> 
>>>>>> 07d02a67b7faae "SUNRPC: Simplify lookup code"
>>>>> 
>>>>> I'm having trouble reproducing this bug. I've tried both cthon
>>>>> and
>>>>> xfstests in a loop, so far without success (both NFSv3 and
>>>>> v4.1,
>>>>> but
>>>>> only sec=sys). Is there anything else you're doing that I might
>>>>> try?
>>>>> 
>>>>> e.g. Are you running multiple workloads in parallel? Different
>>>>> users?..
>>>> 
>>>> Nothing that interesting.  Currently it's connectathon over v4,
>>>> v3,
>>>> v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but just
>>>> serially
>>>> one after the other.  Then some pynfs tests (which bypass the
>>>> client),
>>>> then xfstests over v4.2/sys.  And also a few one-off locking
>>>> tests of
>>>> my
>>>> own that probably aren't a factor here.
>>>> 
>>>> (Hah, I just realized I was mounting with vers=4 and assuming
>>>> that
>>>> meant
>>>> 4.0, but actually it's changed over time depending on the
>>>> defaults,
>>>> so
>>>> currently those "v4" runs are actually all 4.2.  Gah.)
>>> 
>>> Are you perhaps both using RPCSEC_GSS w/ integrity checking for
>>> your
>>> EXCHANGE_ID authentication? The client will attempt to use that by
>>> default if rpc.gssd is running.
>> 
>> Yes, in addition to the krb5i mount I'd expect the sys/krb5/krb5p
>> mounts
>> are using krb5i for EXCHANGE_ID.
>> 
>>> I ask because I think the issue might be with RPCSEC_GSS,
>>> specifically
>>> with the RPCSEC_GSS context destroy code, hence the 2 patches that
>>> I
>>> just sent out.
>> 
>> Looks like my tests pass after applying those two patches.
>> 
> 
> Cool! Thanks for testing.
> 
> Chuck, do you think the above might also explain your sighting of the
> same Oops?

Could be, I don’t think I saw it until I started testing NFSv4.
I won’t be able to confirm that until next week.


> Cheers
>  Trond
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-13  0:00               ` Chuck Lever
@ 2018-11-13  0:08                 ` Trond Myklebust
  2018-11-13  0:17                   ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: Trond Myklebust @ 2018-11-13  0:08 UTC (permalink / raw)
  To: chuck.lever; +Cc: bfields, schumakeranna, chucklever, linux-nfs

On Mon, 2018-11-12 at 16:00 -0800, Chuck Lever wrote:
> > On Nov 12, 2018, at 3:57 PM, Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > 
> > > On Mon, 2018-11-12 at 18:01 -0500, bfields@fieldses.org wrote:
> > > > On Mon, Nov 12, 2018 at 09:17:16PM +0000, Trond Myklebust
> > > > wrote:
> > > > > On Mon, 2018-11-12 at 13:24 -0500, bfields@fieldses.org
> > > > > wrote:
> > > > > > On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust
> > > > > > wrote:
> > > > > > > On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
> > > > > > > Looks like it's the fault of
> > > > > > > 
> > > > > > > 07d02a67b7faae "SUNRPC: Simplify lookup code"
> > > > > > 
> > > > > > I'm having trouble reproducing this bug. I've tried both
> > > > > > cthon
> > > > > > and
> > > > > > xfstests in a loop, so far without success (both NFSv3 and
> > > > > > v4.1,
> > > > > > but
> > > > > > only sec=sys). Is there anything else you're doing that I
> > > > > > might
> > > > > > try?
> > > > > > 
> > > > > > e.g. Are you running multiple workloads in parallel?
> > > > > > Different
> > > > > > users?..
> > > > > 
> > > > > Nothing that interesting.  Currently it's connectathon over
> > > > > v4,
> > > > > v3,
> > > > > v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but
> > > > > just
> > > > > serially
> > > > > one after the other.  Then some pynfs tests (which bypass the
> > > > > client),
> > > > > then xfstests over v4.2/sys.  And also a few one-off locking
> > > > > tests of
> > > > > my
> > > > > own that probably aren't a factor here.
> > > > > 
> > > > > (Hah, I just realized I was mounting with vers=4 and assuming
> > > > > that
> > > > > meant
> > > > > 4.0, but actually it's changed over time depending on the
> > > > > defaults,
> > > > > so
> > > > > currently those "v4" runs are actually all 4.2.  Gah.)
> > > > 
> > > > Are you perhaps both using RPCSEC_GSS w/ integrity checking for
> > > > your
> > > > EXCHANGE_ID authentication? The client will attempt to use that
> > > > by
> > > > default if rpc.gssd is running.
> > > 
> > > Yes, in addition to the krb5i mount I'd expect the sys/krb5/krb5p
> > > mounts
> > > are using krb5i for EXCHANGE_ID.
> > > 
> > > > I ask because I think the issue might be with RPCSEC_GSS,
> > > > specifically
> > > > with the RPCSEC_GSS context destroy code, hence the 2 patches
> > > > that
> > > > I
> > > > just sent out.
> > > 
> > > Looks like my tests pass after applying those two patches.
> > > 
> > 
> > Cool! Thanks for testing.
> > 
> > Chuck, do you think the above might also explain your sighting of
> > the
> > same Oops?
> 
> Could be, I don’t think I saw it until I started testing NFSv4.
> I won’t be able to confirm that until next week.
> 

OK. Either way, I know that part of the GSS code needs to be fixed in
order to deal with the reference count being 0, so I think it is worth
merging this patch now, and then we can see if there is more to the
regression when you can get back to your test rig.

Thanks
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: NULL dereference in rpcauth_lookup_credcache
  2018-11-13  0:08                 ` Trond Myklebust
@ 2018-11-13  0:17                   ` Chuck Lever
  0 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2018-11-13  0:17 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: bfields, schumakeranna, chucklever, linux-nfs


> On Nov 12, 2018, at 4:08 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2018-11-12 at 16:00 -0800, Chuck Lever wrote:
>>> On Nov 12, 2018, at 3:57 PM, Trond Myklebust <
>>> trondmy@hammerspace.com> wrote:
>>> 
>>>>> On Mon, 2018-11-12 at 18:01 -0500, bfields@fieldses.org wrote:
>>>>> On Mon, Nov 12, 2018 at 09:17:16PM +0000, Trond Myklebust
>>>>> wrote:
>>>>>> On Mon, 2018-11-12 at 13:24 -0500, bfields@fieldses.org
>>>>>> wrote:
>>>>>>> On Mon, Nov 12, 2018 at 05:59:33PM +0000, Trond Myklebust
>>>>>>> wrote:
>>>>>>>> On Sat, 2018-11-10 at 16:49 -0500, Bruce Fields wrote:
>>>>>>>> Looks like it's the fault of
>>>>>>>> 
>>>>>>>> 07d02a67b7faae "SUNRPC: Simplify lookup code"
>>>>>>> 
>>>>>>> I'm having trouble reproducing this bug. I've tried both
>>>>>>> cthon
>>>>>>> and
>>>>>>> xfstests in a loop, so far without success (both NFSv3 and
>>>>>>> v4.1,
>>>>>>> but
>>>>>>> only sec=sys). Is there anything else you're doing that I
>>>>>>> might
>>>>>>> try?
>>>>>>> 
>>>>>>> e.g. Are you running multiple workloads in parallel?
>>>>>>> Different
>>>>>>> users?..
>>>>>> 
>>>>>> Nothing that interesting.  Currently it's connectathon over
>>>>>> v4,
>>>>>> v3,
>>>>>> v4/krb5, v3/krb5, v4/krb5i, v4/krb5p, v4.1, v4.1/krb5, but
>>>>>> just
>>>>>> serially
>>>>>> one after the other.  Then some pynfs tests (which bypass the
>>>>>> client),
>>>>>> then xfstests over v4.2/sys.  And also a few one-off locking
>>>>>> tests of
>>>>>> my
>>>>>> own that probably aren't a factor here.
>>>>>> 
>>>>>> (Hah, I just realized I was mounting with vers=4 and assuming
>>>>>> that
>>>>>> meant
>>>>>> 4.0, but actually it's changed over time depending on the
>>>>>> defaults,
>>>>>> so
>>>>>> currently those "v4" runs are actually all 4.2.  Gah.)
>>>>> 
>>>>> Are you perhaps both using RPCSEC_GSS w/ integrity checking for
>>>>> your
>>>>> EXCHANGE_ID authentication? The client will attempt to use that
>>>>> by
>>>>> default if rpc.gssd is running.
>>>> 
>>>> Yes, in addition to the krb5i mount I'd expect the sys/krb5/krb5p
>>>> mounts
>>>> are using krb5i for EXCHANGE_ID.
>>>> 
>>>>> I ask because I think the issue might be with RPCSEC_GSS,
>>>>> specifically
>>>>> with the RPCSEC_GSS context destroy code, hence the 2 patches
>>>>> that
>>>>> I
>>>>> just sent out.
>>>> 
>>>> Looks like my tests pass after applying those two patches.
>>>> 
>>> 
>>> Cool! Thanks for testing.
>>> 
>>> Chuck, do you think the above might also explain your sighting of
>>> the
>>> same Oops?
>> 
>> Could be, I don’t think I saw it until I started testing NFSv4.
>> I won’t be able to confirm that until next week.
>> 
> 
> OK. Either way, I know that part of the GSS code needs to be fixed in
> order to deal with the reference count being 0, so I think it is worth
> merging this patch now, and then we can see if there is more to the
> regression when you can get back to your test rig.

Sounds fine to me.


> Thanks
>  Trond
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-11-13  0:17 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-08 21:44 NULL dereference in rpcauth_lookup_credcache J. Bruce Fields
2018-11-09 18:01 ` Chuck Lever
2018-11-10 21:49   ` Bruce Fields
2018-11-12 17:59     ` Trond Myklebust
2018-11-12 18:16       ` Chuck Lever
2018-11-12 18:18         ` Trond Myklebust
2018-11-12 18:24       ` bfields
2018-11-12 21:17         ` Trond Myklebust
2018-11-12 23:01           ` bfields
2018-11-12 23:57             ` Trond Myklebust
2018-11-13  0:00               ` Chuck Lever
2018-11-13  0:08                 ` Trond Myklebust
2018-11-13  0:17                   ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).