All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")
@ 2016-09-01 14:50 Chuck Lever
  2016-09-06 20:42 ` J. Bruce Fields
  0 siblings, 1 reply; 9+ messages in thread
From: Chuck Lever @ 2016-09-01 14:50 UTC (permalink / raw)
  To: bfields; +Cc: linux-nfs

On occasion I see this on my server:

kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851!
kernel: invalid opcode: 0000 [#1] SMP
kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted 4.8.0-rc4-00006-g9d06b0b #15
kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
kernel: Workqueue: events do_cache_clean [sunrpc]
kernel: task: ffff8808541d8000 task.stack: ffff880854344000
kernel: RIP: 0010:[<ffffffff811e7075>]  [<ffffffff811e7075>] kfree+0x155/0x180
kernel: RSP: 0018:ffff880854347d70  EFLAGS: 00010246
kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: 146ff0f9d5ec5600
kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: ffff88083f9db064
kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: ffff88087fdd8780
kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: ffff880853f01500
kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: ffffffffa04253a0
kernel: FS:  0000000000000000(0000) GS:ffff88087fdc0000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: 00000000001406e0
kernel: Stack:
kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 ffff880854347da0
kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 ffffffffa013d006
kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 0000000057c7044f
kernel: Call Trace:
kernel: [<ffffffffa013cf76>] rsc_free+0x16/0x90 [auth_rpcgss]
kernel: [<ffffffffa013d006>] rsc_put+0x16/0x30 [auth_rpcgss]
kernel: [<ffffffffa0406f60>] cache_clean+0x2e0/0x300 [sunrpc]
kernel: [<ffffffffa04073ee>] do_cache_clean+0xe/0x70 [sunrpc]
kernel: [<ffffffff8109a70f>] process_one_work+0x1ff/0x3b0
kernel: [<ffffffff8109b15c>] worker_thread+0x2bc/0x4a0
kernel: [<ffffffff8109aea0>] ? rescuer_thread+0x3a0/0x3a0
kernel: [<ffffffff810a0ba4>] kthread+0xe4/0xf0
kernel: [<ffffffff8169c47f>] ret_from_fork+0x1f/0x40
kernel: [<ffffffff810a0ac0>] ? kthread_stop+0x110/0x110
kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff
kernel: RIP  [<ffffffff811e7075>] kfree+0x155/0x180
kernel: RSP <ffff880854347d70>
kernel: ---[ end trace 3fdec044969def26 ]---

It seems to be most common after a server reboot where a client has been
using a Kerberos mount, and reconnects to continue its workload.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
Hi Bruce-

I've hit this off-and-on oops since 64c59a3726f2 was merged.
Reverting it seems to make the problem go away. I haven't dug
into why.

Thoughts?


 net/sunrpc/auth_gss/svcauth_gss.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/auth_gss/svcauth_gss.c b/net/sunrpc/auth_gss/svcauth_gss.c
index 1d28181..d858202 100644
--- a/net/sunrpc/auth_gss/svcauth_gss.c
+++ b/net/sunrpc/auth_gss/svcauth_gss.c
@@ -569,9 +569,10 @@ gss_svc_searchbyctx(struct cache_detail *cd, struct xdr_netobj *handle)
 	struct rsc *found;
 
 	memset(&rsci, 0, sizeof(rsci));
-	rsci.handle.data = handle->data;
-	rsci.handle.len = handle->len;
+	if (dup_to_netobj(&rsci.handle, handle->data, handle->len))
+		return NULL;
 	found = rsc_lookup(cd, &rsci);
+	rsc_free(&rsci);
 	if (!found)
 		return NULL;
 	if (cache_check(cd, &found->h, NULL))


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")
  2016-09-01 14:50 [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation") Chuck Lever
@ 2016-09-06 20:42 ` J. Bruce Fields
  2016-09-06 20:49   ` Chuck Lever
  0 siblings, 1 reply; 9+ messages in thread
From: J. Bruce Fields @ 2016-09-06 20:42 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, bcodding

On Thu, Sep 01, 2016 at 10:50:38AM -0400, Chuck Lever wrote:
> On occasion I see this on my server:
> 
> kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851!
> kernel: invalid opcode: 0000 [#1] SMP
> kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
> kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted 4.8.0-rc4-00006-g9d06b0b #15
> kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
> kernel: Workqueue: events do_cache_clean [sunrpc]
> kernel: task: ffff8808541d8000 task.stack: ffff880854344000
> kernel: RIP: 0010:[<ffffffff811e7075>]  [<ffffffff811e7075>] kfree+0x155/0x180
> kernel: RSP: 0018:ffff880854347d70  EFLAGS: 00010246
> kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: 146ff0f9d5ec5600
> kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: ffff88083f9db064
> kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: ffff88087fdd8780
> kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: ffff880853f01500
> kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: ffffffffa04253a0
> kernel: FS:  0000000000000000(0000) GS:ffff88087fdc0000(0000) knlGS:0000000000000000
> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: 00000000001406e0
> kernel: Stack:
> kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 ffff880854347da0
> kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 ffffffffa013d006
> kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 0000000057c7044f
> kernel: Call Trace:
> kernel: [<ffffffffa013cf76>] rsc_free+0x16/0x90 [auth_rpcgss]
> kernel: [<ffffffffa013d006>] rsc_put+0x16/0x30 [auth_rpcgss]
> kernel: [<ffffffffa0406f60>] cache_clean+0x2e0/0x300 [sunrpc]
> kernel: [<ffffffffa04073ee>] do_cache_clean+0xe/0x70 [sunrpc]
> kernel: [<ffffffff8109a70f>] process_one_work+0x1ff/0x3b0
> kernel: [<ffffffff8109b15c>] worker_thread+0x2bc/0x4a0
> kernel: [<ffffffff8109aea0>] ? rescuer_thread+0x3a0/0x3a0
> kernel: [<ffffffff810a0ba4>] kthread+0xe4/0xf0
> kernel: [<ffffffff8169c47f>] ret_from_fork+0x1f/0x40
> kernel: [<ffffffff810a0ac0>] ? kthread_stop+0x110/0x110
> kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff
> kernel: RIP  [<ffffffff811e7075>] kfree+0x155/0x180
> kernel: RSP <ffff880854347d70>
> kernel: ---[ end trace 3fdec044969def26 ]---
> 
> It seems to be most common after a server reboot where a client has been
> using a Kerberos mount, and reconnects to continue its workload.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> Hi Bruce-
> 
> I've hit this off-and-on oops since 64c59a3726f2 was merged.
> Reverting it seems to make the problem go away. I haven't dug
> into why.
> 
> Thoughts?

Apologies, I wasn't thinking when I wrote that patch.  The problem is
probably that rsc_lookup steals the passed-in memory to avoid doing an
allocation of its own, so we can't just pass in a pointer to memory that
someone else is using....

If we really want to avoid allocation there then maybe we should
preallocate somwhere, or reference count these handles.

For now reverting sounds like the right thing to do.

Ben, did you ever confirm whether this helped with the problem you were
seeing?  (If I remember correctly, unnpredictable delays here could
cause the request to be dropped if later requests push the rpcsec_gss
sequence window too far.)  If so then we could look into reference
counting.

--b.

> 
> 
>  net/sunrpc/auth_gss/svcauth_gss.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sunrpc/auth_gss/svcauth_gss.c b/net/sunrpc/auth_gss/svcauth_gss.c
> index 1d28181..d858202 100644
> --- a/net/sunrpc/auth_gss/svcauth_gss.c
> +++ b/net/sunrpc/auth_gss/svcauth_gss.c
> @@ -569,9 +569,10 @@ gss_svc_searchbyctx(struct cache_detail *cd, struct xdr_netobj *handle)
>  	struct rsc *found;
>  
>  	memset(&rsci, 0, sizeof(rsci));
> -	rsci.handle.data = handle->data;
> -	rsci.handle.len = handle->len;
> +	if (dup_to_netobj(&rsci.handle, handle->data, handle->len))
> +		return NULL;
>  	found = rsc_lookup(cd, &rsci);
> +	rsc_free(&rsci);
>  	if (!found)
>  		return NULL;
>  	if (cache_check(cd, &found->h, NULL))

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")
  2016-09-06 20:42 ` J. Bruce Fields
@ 2016-09-06 20:49   ` Chuck Lever
  2016-09-06 21:01     ` J. Bruce Fields
  0 siblings, 1 reply; 9+ messages in thread
From: Chuck Lever @ 2016-09-06 20:49 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS Mailing List, bcodding


> On Sep 6, 2016, at 4:42 PM, J. Bruce Fields <bfields@fieldses.org> =
wrote:
>=20
> On Thu, Sep 01, 2016 at 10:50:38AM -0400, Chuck Lever wrote:
>> On occasion I see this on my server:
>>=20
>> kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851!
>> kernel: invalid opcode: 0000 [#1] SMP
>> kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core =
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass =
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw =
gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt =
iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core =
mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad =
acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad =
rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc =
ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast =
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm =
crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca =
i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
>> kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted =
4.8.0-rc4-00006-g9d06b0b #15
>> kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c =
09/09/2015
>> kernel: Workqueue: events do_cache_clean [sunrpc]
>> kernel: task: ffff8808541d8000 task.stack: ffff880854344000
>> kernel: RIP: 0010:[<ffffffff811e7075>]  [<ffffffff811e7075>] =
kfree+0x155/0x180
>> kernel: RSP: 0018:ffff880854347d70  EFLAGS: 00010246
>> kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: =
146ff0f9d5ec5600
>> kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: =
ffff88083f9db064
>> kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: =
ffff88087fdd8780
>> kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: =
ffff880853f01500
>> kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: =
ffffffffa04253a0
>> kernel: FS:  0000000000000000(0000) GS:ffff88087fdc0000(0000) =
knlGS:0000000000000000
>> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: =
00000000001406e0
>> kernel: Stack:
>> kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 =
ffff880854347da0
>> kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 =
ffffffffa013d006
>> kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 =
0000000057c7044f
>> kernel: Call Trace:
>> kernel: [<ffffffffa013cf76>] rsc_free+0x16/0x90 [auth_rpcgss]
>> kernel: [<ffffffffa013d006>] rsc_put+0x16/0x30 [auth_rpcgss]
>> kernel: [<ffffffffa0406f60>] cache_clean+0x2e0/0x300 [sunrpc]
>> kernel: [<ffffffffa04073ee>] do_cache_clean+0xe/0x70 [sunrpc]
>> kernel: [<ffffffff8109a70f>] process_one_work+0x1ff/0x3b0
>> kernel: [<ffffffff8109b15c>] worker_thread+0x2bc/0x4a0
>> kernel: [<ffffffff8109aea0>] ? rescuer_thread+0x3a0/0x3a0
>> kernel: [<ffffffff810a0ba4>] kthread+0xe4/0xf0
>> kernel: [<ffffffff8169c47f>] ret_from_fork+0x1f/0x40
>> kernel: [<ffffffff810a0ac0>] ? kthread_stop+0x110/0x110
>> kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 =
a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff =
ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff
>> kernel: RIP  [<ffffffff811e7075>] kfree+0x155/0x180
>> kernel: RSP <ffff880854347d70>
>> kernel: ---[ end trace 3fdec044969def26 ]---
>>=20
>> It seems to be most common after a server reboot where a client has =
been
>> using a Kerberos mount, and reconnects to continue its workload.
>>=20
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> Hi Bruce-
>>=20
>> I've hit this off-and-on oops since 64c59a3726f2 was merged.
>> Reverting it seems to make the problem go away. I haven't dug
>> into why.
>>=20
>> Thoughts?
>=20
> Apologies, I wasn't thinking when I wrote that patch.  The problem is
> probably that rsc_lookup steals the passed-in memory to avoid doing an
> allocation of its own, so we can't just pass in a pointer to memory =
that
> someone else is using....
>=20
> If we really want to avoid allocation there then maybe we should
> preallocate somwhere, or reference count these handles.
>=20
> For now reverting sounds like the right thing to do.

NP, thanks for confirming!


> Ben, did you ever confirm whether this helped with the problem you =
were
> seeing?  (If I remember correctly, unnpredictable delays here could
> cause the request to be dropped if later requests push the rpcsec_gss
> sequence window too far.)  If so then we could look into reference
> counting.

Well that's interesting.

When a request is dropped, would the server disconnect? Because if it
doesn't, the client will wait forever.


> --b.
>=20
>>=20
>>=20
>> net/sunrpc/auth_gss/svcauth_gss.c |    5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>=20
>> diff --git a/net/sunrpc/auth_gss/svcauth_gss.c =
b/net/sunrpc/auth_gss/svcauth_gss.c
>> index 1d28181..d858202 100644
>> --- a/net/sunrpc/auth_gss/svcauth_gss.c
>> +++ b/net/sunrpc/auth_gss/svcauth_gss.c
>> @@ -569,9 +569,10 @@ gss_svc_searchbyctx(struct cache_detail *cd, =
struct xdr_netobj *handle)
>> 	struct rsc *found;
>>=20
>> 	memset(&rsci, 0, sizeof(rsci));
>> -	rsci.handle.data =3D handle->data;
>> -	rsci.handle.len =3D handle->len;
>> +	if (dup_to_netobj(&rsci.handle, handle->data, handle->len))
>> +		return NULL;
>> 	found =3D rsc_lookup(cd, &rsci);
>> +	rsc_free(&rsci);
>> 	if (!found)
>> 		return NULL;
>> 	if (cache_check(cd, &found->h, NULL))

--
Chuck Lever

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")
  2016-09-06 20:49   ` Chuck Lever
@ 2016-09-06 21:01     ` J. Bruce Fields
  2016-09-06 21:25       ` Chuck Lever
  0 siblings, 1 reply; 9+ messages in thread
From: J. Bruce Fields @ 2016-09-06 21:01 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS Mailing List, bcodding

On Tue, Sep 06, 2016 at 04:49:33PM -0400, Chuck Lever wrote:
> 
> On Sep 6, 2016, at 4:42 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > Apologies, I wasn't thinking when I wrote that patch.  The problem is
> > probably that rsc_lookup steals the passed-in memory to avoid doing an
> > allocation of its own, so we can't just pass in a pointer to memory that
> > someone else is using....
> > 
> > If we really want to avoid allocation there then maybe we should
> > preallocate somwhere, or reference count these handles.
> > 
> > For now reverting sounds like the right thing to do.
> 
> NP, thanks for confirming!
> 
> 
> > Ben, did you ever confirm whether this helped with the problem you were
> > seeing?  (If I remember correctly, unnpredictable delays here could
> > cause the request to be dropped if later requests push the rpcsec_gss
> > sequence window too far.)  If so then we could look into reference
> > counting.
> 
> Well that's interesting.
> 
> When a request is dropped, would the server disconnect? Because if it
> doesn't, the client will wait forever.

Checking... gss_verify_header returns SVC_DROP, which is just a silent
close (SVC_CLOSE would close the connection).

I'm not sure what's correct there.

--b.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")
  2016-09-06 21:01     ` J. Bruce Fields
@ 2016-09-06 21:25       ` Chuck Lever
  2016-09-06 23:49         ` Benjamin Coddington
  2016-09-07 14:47         ` J. Bruce Fields
  0 siblings, 2 replies; 9+ messages in thread
From: Chuck Lever @ 2016-09-06 21:25 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS Mailing List, bcodding


> On Sep 6, 2016, at 5:01 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Tue, Sep 06, 2016 at 04:49:33PM -0400, Chuck Lever wrote:
>> 
>> On Sep 6, 2016, at 4:42 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>> Apologies, I wasn't thinking when I wrote that patch.  The problem is
>>> probably that rsc_lookup steals the passed-in memory to avoid doing an
>>> allocation of its own, so we can't just pass in a pointer to memory that
>>> someone else is using....
>>> 
>>> If we really want to avoid allocation there then maybe we should
>>> preallocate somwhere, or reference count these handles.
>>> 
>>> For now reverting sounds like the right thing to do.
>> 
>> NP, thanks for confirming!
>> 
>> 
>>> Ben, did you ever confirm whether this helped with the problem you were
>>> seeing?  (If I remember correctly, unnpredictable delays here could
>>> cause the request to be dropped if later requests push the rpcsec_gss
>>> sequence window too far.)  If so then we could look into reference
>>> counting.
>> 
>> Well that's interesting.
>> 
>> When a request is dropped, would the server disconnect? Because if it
>> doesn't, the client will wait forever.
> 
> Checking... gss_verify_header returns SVC_DROP, which is just a silent
> close (SVC_CLOSE would close the connection).
> 
> I'm not sure what's correct there.

Right, we may not get any guidance from the RPCSEC GSS specifications.

However, the Linux NFS client retransmit code was changed in 2013 so that
NFSv4 never retransmits until the server drops the connection, starting
around commit 8a19a0b6cb2e2216afd68ef2047f30260cc8a220.

SVC_CLOSE might be a better choice, at least for NFSv4.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")
  2016-09-06 21:25       ` Chuck Lever
@ 2016-09-06 23:49         ` Benjamin Coddington
  2016-09-07 14:35           ` Chuck Lever
  2016-09-13 20:39           ` J. Bruce Fields
  2016-09-07 14:47         ` J. Bruce Fields
  1 sibling, 2 replies; 9+ messages in thread
From: Benjamin Coddington @ 2016-09-06 23:49 UTC (permalink / raw)
  To: J. Bruce Fields, Chuck Lever; +Cc: Linux NFS Mailing List

On 6 Sep 2016, at 17:25, Chuck Lever wrote:

>> On Sep 6, 2016, at 5:01 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>
>> On Tue, Sep 06, 2016 at 04:49:33PM -0400, Chuck Lever wrote:
>>>
>>> On Sep 6, 2016, at 4:42 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
>>>> Apologies, I wasn't thinking when I wrote that patch.  The problem is
>>>> probably that rsc_lookup steals the passed-in memory to avoid doing an
>>>> allocation of its own, so we can't just pass in a pointer to memory that
>>>> someone else is using....
>>>>
>>>> If we really want to avoid allocation there then maybe we should
>>>> preallocate somwhere, or reference count these handles.
>>>>
>>>> For now reverting sounds like the right thing to do.
>>>
>>> NP, thanks for confirming!
>>>
>>>
>>>> Ben, did you ever confirm whether this helped with the problem you were
>>>> seeing?  (If I remember correctly, unnpredictable delays here could
>>>> cause the request to be dropped if later requests push the rpcsec_gss
>>>> sequence window too far.)  If so then we could look into reference
>>>> counting.

I somehow missed this; I haven't tested it.

>>> Well that's interesting.
>>>
>>> When a request is dropped, would the server disconnect? Because if it
>>> doesn't, the client will wait forever.
>>
>> Checking... gss_verify_header returns SVC_DROP, which is just a silent
>> close (SVC_CLOSE would close the connection).
>>
>> I'm not sure what's correct there.
>
> Right, we may not get any guidance from the RPCSEC GSS specifications.
>
> However, the Linux NFS client retransmit code was changed in 2013 so that
> NFSv4 never retransmits until the server drops the connection, starting
> around commit 8a19a0b6cb2e2216afd68ef2047f30260cc8a220.

That is the behavior I was seeing.

Ben


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")
  2016-09-06 23:49         ` Benjamin Coddington
@ 2016-09-07 14:35           ` Chuck Lever
  2016-09-13 20:39           ` J. Bruce Fields
  1 sibling, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2016-09-07 14:35 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: J. Bruce Fields, Linux NFS Mailing List


> On Sep 6, 2016, at 7:49 PM, Benjamin Coddington <bcodding@redhat.com> wrot=
e:
>=20
> On 6 Sep 2016, at 17:25, Chuck Lever wrote:
>=20
>>> On Sep 6, 2016, at 5:01 PM, J. Bruce Fields <bfields@fieldses.org> wrote=
:
>>>=20
>>>> On Tue, Sep 06, 2016 at 04:49:33PM -0400, Chuck Lever wrote:
>>>>=20
>>>>> On Sep 6, 2016, at 4:42 PM, J. Bruce Fields <bfields@fieldses.org> wro=
te:
>>>>> Apologies, I wasn't thinking when I wrote that patch.  The problem is
>>>>> probably that rsc_lookup steals the passed-in memory to avoid doing an=

>>>>> allocation of its own, so we can't just pass in a pointer to memory th=
at
>>>>> someone else is using....
>>>>>=20
>>>>> If we really want to avoid allocation there then maybe we should
>>>>> preallocate somwhere, or reference count these handles.
>>>>>=20
>>>>> For now reverting sounds like the right thing to do.
>>>>=20
>>>> NP, thanks for confirming!
>>>>=20
>>>>=20
>>>>> Ben, did you ever confirm whether this helped with the problem you wer=
e
>>>>> seeing?  (If I remember correctly, unnpredictable delays here could
>>>>> cause the request to be dropped if later requests push the rpcsec_gss
>>>>> sequence window too far.)  If so then we could look into reference
>>>>> counting.
>=20
> I somehow missed this; I haven't tested it.
>=20
>>>> Well that's interesting.
>>>>=20
>>>> When a request is dropped, would the server disconnect? Because if it
>>>> doesn't, the client will wait forever.
>>>=20
>>> Checking... gss_verify_header returns SVC_DROP, which is just a silent
>>> close (SVC_CLOSE would close the connection).
>>>=20
>>> I'm not sure what's correct there.
>>=20
>> Right, we may not get any guidance from the RPCSEC GSS specifications.
>>=20
>> However, the Linux NFS client retransmit code was changed in 2013 so that=

>> NFSv4 never retransmits until the server drops the connection, starting
>> around commit 8a19a0b6cb2e2216afd68ef2047f30260cc8a220.
>=20
> That is the behavior I was seeing.

I have a reproducer that is somewhat reliable. I will try using
SVC_CLOSE.=

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")
  2016-09-06 21:25       ` Chuck Lever
  2016-09-06 23:49         ` Benjamin Coddington
@ 2016-09-07 14:47         ` J. Bruce Fields
  1 sibling, 0 replies; 9+ messages in thread
From: J. Bruce Fields @ 2016-09-07 14:47 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Linux NFS Mailing List, bcodding

On Tue, Sep 06, 2016 at 05:25:38PM -0400, Chuck Lever wrote:
> 
> > On Sep 6, 2016, at 5:01 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > 
> > On Tue, Sep 06, 2016 at 04:49:33PM -0400, Chuck Lever wrote:
> >> 
> >> On Sep 6, 2016, at 4:42 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >>> Apologies, I wasn't thinking when I wrote that patch.  The problem is
> >>> probably that rsc_lookup steals the passed-in memory to avoid doing an
> >>> allocation of its own, so we can't just pass in a pointer to memory that
> >>> someone else is using....
> >>> 
> >>> If we really want to avoid allocation there then maybe we should
> >>> preallocate somwhere, or reference count these handles.
> >>> 
> >>> For now reverting sounds like the right thing to do.
> >> 
> >> NP, thanks for confirming!
> >> 
> >> 
> >>> Ben, did you ever confirm whether this helped with the problem you were
> >>> seeing?  (If I remember correctly, unnpredictable delays here could
> >>> cause the request to be dropped if later requests push the rpcsec_gss
> >>> sequence window too far.)  If so then we could look into reference
> >>> counting.
> >> 
> >> Well that's interesting.
> >> 
> >> When a request is dropped, would the server disconnect? Because if it
> >> doesn't, the client will wait forever.
> > 
> > Checking... gss_verify_header returns SVC_DROP, which is just a silent
> > close (SVC_CLOSE would close the connection).
> > 
> > I'm not sure what's correct there.
> 
> Right, we may not get any guidance from the RPCSEC GSS specifications.

Yeah, it won't say anything about disconnecting.  It does require the
drop, and gives rationale:

	The reason for discarding requests silently is that the server
	is unable to determine if the duplicate or out of range request
	was due to a sequencing problem in the client, network, or the
	operating system, or due to some quirk in routing, or a replay
	attack by an intruder.  Discarding the request allows the client
	to recover after timing out, if indeed the duplication was
	unintentional or well intended.

I'm trying to think of disadvantages to dropping:

	- an attacker can force a disconnect.  But if they can sniff the
	  network and inject packets then they can already break TCP
	  connections.
	- replays due to networking bugs get turned into unnecessary
	  disconnections.  But, do those actually happen, especially
	  over TCP?

So, OK, disconnect.

> 
> However, the Linux NFS client retransmit code was changed in 2013 so that
> NFSv4 never retransmits until the server drops the connection, starting
> around commit 8a19a0b6cb2e2216afd68ef2047f30260cc8a220.
> 
> SVC_CLOSE might be a better choice, at least for NFSv4.

Ugh.  I don't like sticking an NFSv4-specific exception here in the rpc
code, but that's probably right.  We'll need to check for where else
this is needed.

--b.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")
  2016-09-06 23:49         ` Benjamin Coddington
  2016-09-07 14:35           ` Chuck Lever
@ 2016-09-13 20:39           ` J. Bruce Fields
  1 sibling, 0 replies; 9+ messages in thread
From: J. Bruce Fields @ 2016-09-13 20:39 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Chuck Lever, Linux NFS Mailing List

On Tue, Sep 06, 2016 at 07:49:31PM -0400, Benjamin Coddington wrote:
> On 6 Sep 2016, at 17:25, Chuck Lever wrote:
> 
> >> On Sep 6, 2016, at 5:01 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >>
> >> On Tue, Sep 06, 2016 at 04:49:33PM -0400, Chuck Lever wrote:
> >>>
> >>> On Sep 6, 2016, at 4:42 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> >>>> Apologies, I wasn't thinking when I wrote that patch.  The problem is
> >>>> probably that rsc_lookup steals the passed-in memory to avoid doing an
> >>>> allocation of its own, so we can't just pass in a pointer to memory that
> >>>> someone else is using....
> >>>>
> >>>> If we really want to avoid allocation there then maybe we should
> >>>> preallocate somwhere, or reference count these handles.
> >>>>
> >>>> For now reverting sounds like the right thing to do.
> >>>
> >>> NP, thanks for confirming!
> >>>
> >>>
> >>>> Ben, did you ever confirm whether this helped with the problem you were
> >>>> seeing?  (If I remember correctly, unnpredictable delays here could
> >>>> cause the request to be dropped if later requests push the rpcsec_gss
> >>>> sequence window too far.)  If so then we could look into reference
> >>>> counting.
> 
> I somehow missed this; I haven't tested it.

OK.  If you get a reliable reproducer for the reordering problem, we can
revisit this, till then I'm not going to worry about it (and meanwhile
Chuck's work to close on dropping should mitigate the symptoms).

--b.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-09-13 20:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-01 14:50 [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation") Chuck Lever
2016-09-06 20:42 ` J. Bruce Fields
2016-09-06 20:49   ` Chuck Lever
2016-09-06 21:01     ` J. Bruce Fields
2016-09-06 21:25       ` Chuck Lever
2016-09-06 23:49         ` Benjamin Coddington
2016-09-07 14:35           ` Chuck Lever
2016-09-13 20:39           ` J. Bruce Fields
2016-09-07 14:47         ` J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.