kernel-tls-handshake.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* fput() considered harmful
@ 2023-03-08 13:24 Hannes Reinecke
  2023-03-08 13:43 ` Chuck Lever III
  0 siblings, 1 reply; 7+ messages in thread
From: Hannes Reinecke @ 2023-03-08 13:24 UTC (permalink / raw)
  To: Chuck Lever; +Cc: kernel-tls-handshake

Hi Chuck,

I'm playing around with v6 and (again) facing really nasty crashes:

[ 1662.912887] ------------[ cut here ]------------
[ 1662.913399] kernel BUG at fs/inode.c:1763!
[ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G 
       E      6.2.0-rc4-54-default+ #231
[ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
0.0.0 02/06/2015
[ 1662.915932] RIP: 0010:iput+0x1d/0x20
[ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 
00 48 85 ff 74 0e f6 87 98 00 00 00 40
  75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 0f 1f 44
[ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
[ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 
0000000000000064
[ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: 
ffff953bcbb23b00
[ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: 
ffff953bcbb23b00
[ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: 
ffff953bcb9d91d8
[ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: 
ffff953b83918cb8
[ 1662.921617] FS:  0000000000000000(0000) GS:ffff953bfb900000(0000) 
knlGS:0000000000000000
[ 1662.922376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 
0000000000350ee0
[ 1662.923568] Call Trace:
[ 1662.923801]  <TASK>
[ 1662.924006]  __dentry_kill+0xca/0x160
[ 1662.924353]  __fput+0xdf/0x240
[ 1662.924652]  task_work_run+0x67/0xa0
[ 1662.924992]  do_exit+0x343/0xb50
[ 1662.925302]  do_group_exit+0x2d/0x80
[ 1662.925644]  __x64_sys_exit_group+0x14/0x20
[ 1662.926039]  do_syscall_64+0x31/0x80

Key problem seems to be userspace here, which (apparently) is calling 
close() on the fd which we've passed from the kernel.
While this _normally_ is not a problem, in our case we have the problem
that the filedescriptor is associated with a socket (ie struct socket).
And that one is shared between kernel code and userland.
And doesn't have any refcounting whatsoever; socket_release() removes 
the reference to the underlying file:

	if (!sock->file) {
		iput(SOCK_INODE(sock));
		return;
	}
	sock->file = NULL;

ie if _another_ process (the kernel driver, say) is calling 'close()', 
too, it'll run into the !sock->file condition and crash in iput().

I have been testing various things here, not calling fput, calling 
get_file() etc, but either hit a crash or kmemleak complaining about
the file not being freed.

Have you seen similar issues?
And even if you haven't: relying on userspace _not_ to call 'close()'
on the socket seems to be a quite dangerous concept.
Maybe we need to implement a 'socket_clone()' functionality?

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fput() considered harmful
  2023-03-08 13:24 fput() considered harmful Hannes Reinecke
@ 2023-03-08 13:43 ` Chuck Lever III
  2023-03-08 15:40   ` Chuck Lever III
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever III @ 2023-03-08 13:43 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: kernel-tls-handshake



> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
> 
> Hi Chuck,
> 
> I'm playing around with v6 and (again) facing really nasty crashes:
> 
> [ 1662.912887] ------------[ cut here ]------------
> [ 1662.913399] kernel BUG at fs/inode.c:1763!
> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G       E      6.2.0-rc4-54-default+ #231
> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064
> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00
> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00
> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8
> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8
> [ 1662.921617] FS:  0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000
> [ 1662.922376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0
> [ 1662.923568] Call Trace:
> [ 1662.923801]  <TASK>
> [ 1662.924006]  __dentry_kill+0xca/0x160
> [ 1662.924353]  __fput+0xdf/0x240
> [ 1662.924652]  task_work_run+0x67/0xa0
> [ 1662.924992]  do_exit+0x343/0xb50
> [ 1662.925302]  do_group_exit+0x2d/0x80
> [ 1662.925644]  __x64_sys_exit_group+0x14/0x20
> [ 1662.926039]  do_syscall_64+0x31/0x80
> 
> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel.
> While this _normally_ is not a problem, in our case we have the problem
> that the filedescriptor is associated with a socket (ie struct socket).
> And that one is shared between kernel code and userland.
> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file:
> 
> 	if (!sock->file) {
> 		iput(SOCK_INODE(sock));
> 		return;
> 	}
> 	sock->file = NULL;
> 
> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the !sock->file condition and crash in iput().
> 
> I have been testing various things here, not calling fput, calling get_file() etc, but either hit a crash or kmemleak complaining about
> the file not being freed.
> 
> Have you seen similar issues?

I have, and they all vanished with the introduction of handshake_dup and haven't returned.


> And even if you haven't: relying on userspace _not_ to call 'close()'
> on the socket seems to be a quite dangerous concept.

- The kernel "closes" that file descriptor anyway when the process exits. Is that not the same as a user space close(2) ?

- The endpoint was created via dup, so calling close on it should merely remove that endpoint and not trigger a sock_release().

AIUI the reference counting of the struct file * is what prevents user space from calling ->release. So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk().


> Maybe we need to implement a 'socket_clone()' functionality?

The alternative is to create a unique set of file_ops for such shared sockets, but that quickly becomes painful.

I was thinking of trying to duplicate your set up here to help diagnose. Are there instructions you like to follow for setting up a small NVMe/TCP test bed?


--
Chuck Lever



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fput() considered harmful
  2023-03-08 13:43 ` Chuck Lever III
@ 2023-03-08 15:40   ` Chuck Lever III
  2023-03-08 16:04     ` Hannes Reinecke
  0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever III @ 2023-03-08 15:40 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: kernel-tls-handshake



> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>> 
>> Hi Chuck,
>> 
>> I'm playing around with v6 and (again) facing really nasty crashes:
>> 
>> [ 1662.912887] ------------[ cut here ]------------
>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G       E      6.2.0-rc4-54-default+ #231
>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064
>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00
>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00
>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8
>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8
>> [ 1662.921617] FS:  0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000
>> [ 1662.922376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0
>> [ 1662.923568] Call Trace:
>> [ 1662.923801]  <TASK>
>> [ 1662.924006]  __dentry_kill+0xca/0x160
>> [ 1662.924353]  __fput+0xdf/0x240
>> [ 1662.924652]  task_work_run+0x67/0xa0
>> [ 1662.924992]  do_exit+0x343/0xb50
>> [ 1662.925302]  do_group_exit+0x2d/0x80
>> [ 1662.925644]  __x64_sys_exit_group+0x14/0x20
>> [ 1662.926039]  do_syscall_64+0x31/0x80
>> 
>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel.
>> While this _normally_ is not a problem, in our case we have the problem
>> that the filedescriptor is associated with a socket (ie struct socket).
>> And that one is shared between kernel code and userland.
>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file:
>> 
>> 	if (!sock->file) {
>> 		iput(SOCK_INODE(sock));
>> 		return;
>> 	}
>> 	sock->file = NULL;
>> 
>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the !sock->file condition and crash in iput().
>> 
>> I have been testing various things here, not calling fput, calling get_file() etc, but either hit a crash or kmemleak complaining about
>> the file not being freed.
>> 
>> Have you seen similar issues?
> 
> I have, and they all vanished with the introduction of handshake_dup and haven't returned.
> 
> 
>> And even if you haven't: relying on userspace _not_ to call 'close()'
>> on the socket seems to be a quite dangerous concept.
> 
> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the same as a user space close(2) ?
> 
> - The endpoint was created via dup, so calling close on it should merely remove that endpoint and not trigger a sock_release().
> 
> AIUI the reference counting of the struct file * is what prevents user space from calling ->release. So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk().

I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
I tried it once with the close /before/ doing the handshake, and once
with the close /after/. I don't see any kernel issues on my server
system when a client attempts to handshake with it.

Based on what I've read here, the missing get_file() theory seems
plausible.


>> Maybe we need to implement a 'socket_clone()' functionality?
> 
> The alternative is to create a unique set of file_ops for such shared sockets, but that quickly becomes painful.
> 
> I was thinking of trying to duplicate your set up here to help diagnose. Are there instructions you like to follow for setting up a small NVMe/TCP test bed?


--
Chuck Lever



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fput() considered harmful
  2023-03-08 15:40   ` Chuck Lever III
@ 2023-03-08 16:04     ` Hannes Reinecke
  2023-03-08 16:51       ` Chuck Lever III
  2023-03-08 17:33       ` Hannes Reinecke
  0 siblings, 2 replies; 7+ messages in thread
From: Hannes Reinecke @ 2023-03-08 16:04 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: kernel-tls-handshake

On 3/8/23 16:40, Chuck Lever III wrote:
> 
> 
>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>
>>
>>
>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>
>>> Hi Chuck,
>>>
>>> I'm playing around with v6 and (again) facing really nasty crashes:
>>>
>>> [ 1662.912887] ------------[ cut here ]------------
>>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G       E      6.2.0-rc4-54-default+ #231
>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064
>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00
>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00
>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8
>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8
>>> [ 1662.921617] FS:  0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000
>>> [ 1662.922376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0
>>> [ 1662.923568] Call Trace:
>>> [ 1662.923801]  <TASK>
>>> [ 1662.924006]  __dentry_kill+0xca/0x160
>>> [ 1662.924353]  __fput+0xdf/0x240
>>> [ 1662.924652]  task_work_run+0x67/0xa0
>>> [ 1662.924992]  do_exit+0x343/0xb50
>>> [ 1662.925302]  do_group_exit+0x2d/0x80
>>> [ 1662.925644]  __x64_sys_exit_group+0x14/0x20
>>> [ 1662.926039]  do_syscall_64+0x31/0x80
>>>
>>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel.
>>> While this _normally_ is not a problem, in our case we have the problem
>>> that the filedescriptor is associated with a socket (ie struct socket).
>>> And that one is shared between kernel code and userland.
>>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file:
>>>
>>> 	if (!sock->file) {
>>> 		iput(SOCK_INODE(sock));
>>> 		return;
>>> 	}
>>> 	sock->file = NULL;
>>>
>>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the
>>> !sock->file condition and crash in iput().
>>>
>>> I have been testing various things here, not calling fput, calling get_file() etc, but either
>>> hit a crash or kmemleak complaining about
>>> the file not being freed.
>>>
>>> Have you seen similar issues?
>>
>> I have, and they all vanished with the introduction of handshake_dup and haven't returned.
>>
>>
>>> And even if you haven't: relying on userspace _not_ to call 'close()'
>>> on the socket seems to be a quite dangerous concept.
>>
>> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the
>> same as a user space close(2) ?
>>
>> - The endpoint was created via dup, so calling close on it should merely remove that endpoint
>> and not trigger a sock_release().
>>
>> AIUI the reference counting of the struct file * is what prevents user space from calling ->release.
>> So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk().
> 
> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
> I tried it once with the close /before/ doing the handshake, and once
> with the close /after/. I don't see any kernel issues on my server
> system when a client attempts to handshake with it.
> 
> Based on what I've read here, the missing get_file() theory seems
> plausible.
> 
Hmm. Let's see.
I must say the lifetime rules for 'struct file' (or 'struct socket') are 
still somewhat beyond me.
Currently I have to do a

sock_alloc_file()

_and_

get_file()

before calling tls_client_hello_psk() to avoid a crash.
One does wonder why; typically the refcount is initialized to '1',
so the extra 'get_file()' shouldn't be required.

But what do I know ...

I see to get you instructions for my testbed.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fput() considered harmful
  2023-03-08 16:04     ` Hannes Reinecke
@ 2023-03-08 16:51       ` Chuck Lever III
  2023-03-08 17:33       ` Hannes Reinecke
  1 sibling, 0 replies; 7+ messages in thread
From: Chuck Lever III @ 2023-03-08 16:51 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: kernel-tls-handshake



> On Mar 8, 2023, at 11:04 AM, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 3/8/23 16:40, Chuck Lever III wrote:
>>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>> 
>>> 
>>> 
>>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>> 
>>>> Hi Chuck,
>>>> 
>>>> I'm playing around with v6 and (again) facing really nasty crashes:
>>>> 
>>>> [ 1662.912887] ------------[ cut here ]------------
>>>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G       E      6.2.0-rc4-54-default+ #231
>>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
>>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064
>>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00
>>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00
>>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8
>>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8
>>>> [ 1662.921617] FS:  0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000
>>>> [ 1662.922376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0
>>>> [ 1662.923568] Call Trace:
>>>> [ 1662.923801]  <TASK>
>>>> [ 1662.924006]  __dentry_kill+0xca/0x160
>>>> [ 1662.924353]  __fput+0xdf/0x240
>>>> [ 1662.924652]  task_work_run+0x67/0xa0
>>>> [ 1662.924992]  do_exit+0x343/0xb50
>>>> [ 1662.925302]  do_group_exit+0x2d/0x80
>>>> [ 1662.925644]  __x64_sys_exit_group+0x14/0x20
>>>> [ 1662.926039]  do_syscall_64+0x31/0x80
>>>> 
>>>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel.
>>>> While this _normally_ is not a problem, in our case we have the problem
>>>> that the filedescriptor is associated with a socket (ie struct socket).
>>>> And that one is shared between kernel code and userland.
>>>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file:
>>>> 
>>>> 	if (!sock->file) {
>>>> 		iput(SOCK_INODE(sock));
>>>> 		return;
>>>> 	}
>>>> 	sock->file = NULL;
>>>> 
>>>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the
>>>> !sock->file condition and crash in iput().
>>>> 
>>>> I have been testing various things here, not calling fput, calling get_file() etc, but either
>>>> hit a crash or kmemleak complaining about
>>>> the file not being freed.
>>>> 
>>>> Have you seen similar issues?
>>> 
>>> I have, and they all vanished with the introduction of handshake_dup and haven't returned.
>>> 
>>> 
>>>> And even if you haven't: relying on userspace _not_ to call 'close()'
>>>> on the socket seems to be a quite dangerous concept.
>>> 
>>> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the
>>> same as a user space close(2) ?
>>> 
>>> - The endpoint was created via dup, so calling close on it should merely remove that endpoint
>>> and not trigger a sock_release().
>>> 
>>> AIUI the reference counting of the struct file * is what prevents user space from calling ->release.
>>> So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk().
>> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
>> I tried it once with the close /before/ doing the handshake, and once
>> with the close /after/. I don't see any kernel issues on my server
>> system when a client attempts to handshake with it.
>> Based on what I've read here, the missing get_file() theory seems
>> plausible.
> Hmm. Let's see.
> I must say the lifetime rules for 'struct file' (or 'struct socket') are still somewhat beyond me.
> Currently I have to do a
> 
> sock_alloc_file()
> 
> _and_
> 
> get_file()
> 
> before calling tls_client_hello_psk() to avoid a crash.
> One does wonder why; typically the refcount is initialized to '1',
> so the extra 'get_file()' shouldn't be required.

Especially since handshake_dup() is supposed to do that
one for you. Could be someone else is doing an unwanted
fput()?


--
Chuck Lever



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fput() considered harmful
  2023-03-08 16:04     ` Hannes Reinecke
  2023-03-08 16:51       ` Chuck Lever III
@ 2023-03-08 17:33       ` Hannes Reinecke
  2023-03-08 17:53         ` Chuck Lever III
  1 sibling, 1 reply; 7+ messages in thread
From: Hannes Reinecke @ 2023-03-08 17:33 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: kernel-tls-handshake

On 3/8/23 17:04, Hannes Reinecke wrote:
> On 3/8/23 16:40, Chuck Lever III wrote:
>>
>>
>>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> 
>>> wrote:
>>>
>>>
>>>
>>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>> Hi Chuck,
>>>>
>>>> I'm playing around with v6 and (again) facing really nasty crashes:
>>>>
>>>> [ 1662.912887] ------------[ cut here ]------------
>>>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: 
>>>> G       E      6.2.0-rc4-54-default+ #231
>>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
>>>> BIOS 0.0.0 02/06/2015
>>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 
>>>> 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 
>>>> 90 90 90 90 90 90 90 90 90 0f 1f 44
>>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 
>>>> 0000000000000064
>>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: 
>>>> ffff953bcbb23b00
>>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: 
>>>> ffff953bcbb23b00
>>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: 
>>>> ffff953bcb9d91d8
>>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: 
>>>> ffff953b83918cb8
>>>> [ 1662.921617] FS:  0000000000000000(0000) GS:ffff953bfb900000(0000) 
>>>> knlGS:0000000000000000
>>>> [ 1662.922376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 
>>>> 0000000000350ee0
>>>> [ 1662.923568] Call Trace:
>>>> [ 1662.923801]  <TASK>
>>>> [ 1662.924006]  __dentry_kill+0xca/0x160
>>>> [ 1662.924353]  __fput+0xdf/0x240
>>>> [ 1662.924652]  task_work_run+0x67/0xa0
>>>> [ 1662.924992]  do_exit+0x343/0xb50
>>>> [ 1662.925302]  do_group_exit+0x2d/0x80
>>>> [ 1662.925644]  __x64_sys_exit_group+0x14/0x20
>>>> [ 1662.926039]  do_syscall_64+0x31/0x80
>>>>
>>>> Key problem seems to be userspace here, which (apparently) is 
>>>> calling close() on the fd which we've passed from the kernel.
>>>> While this _normally_ is not a problem, in our case we have the problem
>>>> that the filedescriptor is associated with a socket (ie struct socket).
>>>> And that one is shared between kernel code and userland.
>>>> And doesn't have any refcounting whatsoever; socket_release() 
>>>> removes the reference to the underlying file:
>>>>
>>>>     if (!sock->file) {
>>>>         iput(SOCK_INODE(sock));
>>>>         return;
>>>>     }
>>>>     sock->file = NULL;
>>>>
>>>> ie if _another_ process (the kernel driver, say) is calling 
>>>> 'close()', too, it'll run into the
>>>> !sock->file condition and crash in iput().
>>>>
>>>> I have been testing various things here, not calling fput, calling 
>>>> get_file() etc, but either
>>>> hit a crash or kmemleak complaining about
>>>> the file not being freed.
>>>>
>>>> Have you seen similar issues?
>>>
>>> I have, and they all vanished with the introduction of handshake_dup 
>>> and haven't returned.
>>>
>>>
>>>> And even if you haven't: relying on userspace _not_ to call 'close()'
>>>> on the socket seems to be a quite dangerous concept.
>>>
>>> - The kernel "closes" that file descriptor anyway when the process 
>>> exits. Is that not the
>>> same as a user space close(2) ?
>>>
>>> - The endpoint was created via dup, so calling close on it should 
>>> merely remove that endpoint
>>> and not trigger a sock_release().
>>>
>>> AIUI the reference counting of the struct file * is what prevents 
>>> user space from calling ->release.
>>> So maybe NVMe needs another get_file() somewhere before it calls 
>>> tls_server_hello_psk().
>>
>> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
>> I tried it once with the close /before/ doing the handshake, and once
>> with the close /after/. I don't see any kernel issues on my server
>> system when a client attempts to handshake with it.
>>
>> Based on what I've read here, the missing get_file() theory seems
>> plausible.
>>
> Hmm. Let's see.
> I must say the lifetime rules for 'struct file' (or 'struct socket') are 
> still somewhat beyond me.
> Currently I have to do a
> 
> sock_alloc_file()
> 
> _and_
> 
> get_file()
> 
> before calling tls_client_hello_psk() to avoid a crash.
> One does wonder why; typically the refcount is initialized to '1',
> so the extra 'get_file()' shouldn't be required.
> 
> But what do I know ...
> Oh, grand.

sock_alloc_file()
get_file()
tls_client_hello_psk()

gives a memleak for the socket/file.

sock_alloc_file()
get_file()
tls_client_hello_psk()
fput()

crashes in handshake_nl_accept_doit()->fput().

sock_alloc_file()
tls_client_hello_psk()

crashes with
kernel BUG at fs/inode.c:1763!
on tlshd sys_exit().

Guess the only way of safely handling this is to
keep the socket file around until tlshd has
exited.

Sadly that's the _one_ thing I cannot do; I've got a
on-shot process 'nvme connect' which tries to establish
the connection and then exits.
And exist calls 'sock_release()' which will forcibly
remove the socket file.
I could try to disconnect the socket file before exit, but
that is ugly as hell.

Any ideas?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fput() considered harmful
  2023-03-08 17:33       ` Hannes Reinecke
@ 2023-03-08 17:53         ` Chuck Lever III
  0 siblings, 0 replies; 7+ messages in thread
From: Chuck Lever III @ 2023-03-08 17:53 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: kernel-tls-handshake



> On Mar 8, 2023, at 12:33 PM, Hannes Reinecke <hare@suse.de> wrote:
> 
> On 3/8/23 17:04, Hannes Reinecke wrote:
>> On 3/8/23 16:40, Chuck Lever III wrote:
>>> 
>>> 
>>>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>>> 
>>>>> Hi Chuck,
>>>>> 
>>>>> I'm playing around with v6 and (again) facing really nasty crashes:
>>>>> 
>>>>> [ 1662.912887] ------------[ cut here ]------------
>>>>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>>>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G       E      6.2.0-rc4-54-default+ #231
>>>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>>>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>>>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
>>>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>>>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064
>>>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00
>>>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00
>>>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8
>>>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8
>>>>> [ 1662.921617] FS:  0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000
>>>>> [ 1662.922376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0
>>>>> [ 1662.923568] Call Trace:
>>>>> [ 1662.923801]  <TASK>
>>>>> [ 1662.924006]  __dentry_kill+0xca/0x160
>>>>> [ 1662.924353]  __fput+0xdf/0x240
>>>>> [ 1662.924652]  task_work_run+0x67/0xa0
>>>>> [ 1662.924992]  do_exit+0x343/0xb50
>>>>> [ 1662.925302]  do_group_exit+0x2d/0x80
>>>>> [ 1662.925644]  __x64_sys_exit_group+0x14/0x20
>>>>> [ 1662.926039]  do_syscall_64+0x31/0x80
>>>>> 
>>>>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel.
>>>>> While this _normally_ is not a problem, in our case we have the problem
>>>>> that the filedescriptor is associated with a socket (ie struct socket).
>>>>> And that one is shared between kernel code and userland.
>>>>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file:
>>>>> 
>>>>>     if (!sock->file) {
>>>>>         iput(SOCK_INODE(sock));
>>>>>         return;
>>>>>     }
>>>>>     sock->file = NULL;
>>>>> 
>>>>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the
>>>>> !sock->file condition and crash in iput().
>>>>> 
>>>>> I have been testing various things here, not calling fput, calling get_file() etc, but either
>>>>> hit a crash or kmemleak complaining about
>>>>> the file not being freed.
>>>>> 
>>>>> Have you seen similar issues?
>>>> 
>>>> I have, and they all vanished with the introduction of handshake_dup and haven't returned.
>>>> 
>>>> 
>>>>> And even if you haven't: relying on userspace _not_ to call 'close()'
>>>>> on the socket seems to be a quite dangerous concept.
>>>> 
>>>> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the
>>>> same as a user space close(2) ?
>>>> 
>>>> - The endpoint was created via dup, so calling close on it should merely remove that endpoint
>>>> and not trigger a sock_release().
>>>> 
>>>> AIUI the reference counting of the struct file * is what prevents user space from calling ->release.
>>>> So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk().
>>> 
>>> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
>>> I tried it once with the close /before/ doing the handshake, and once
>>> with the close /after/. I don't see any kernel issues on my server
>>> system when a client attempts to handshake with it.
>>> 
>>> Based on what I've read here, the missing get_file() theory seems
>>> plausible.
>>> 
>> Hmm. Let's see.
>> I must say the lifetime rules for 'struct file' (or 'struct socket') are still somewhat beyond me.
>> Currently I have to do a
>> sock_alloc_file()
>> _and_
>> get_file()
>> before calling tls_client_hello_psk() to avoid a crash.
>> One does wonder why; typically the refcount is initialized to '1',
>> so the extra 'get_file()' shouldn't be required.
>> But what do I know ...
>> Oh, grand.
> 
> sock_alloc_file()
> get_file()
> tls_client_hello_psk()
> 
> gives a memleak for the socket/file.
> 
> sock_alloc_file()
> get_file()
> tls_client_hello_psk()
> fput()
> 
> crashes in handshake_nl_accept_doit()->fput().
> 
> sock_alloc_file()
> tls_client_hello_psk()
> 
> crashes with
> kernel BUG at fs/inode.c:1763!
> on tlshd sys_exit().
> 
> Guess the only way of safely handling this is to
> keep the socket file around until tlshd has
> exited.
> 
> Sadly that's the _one_ thing I cannot do; I've got a
> on-shot process 'nvme connect' which tries to establish
> the connection and then exits.

SunRPC client does much the same: a kworker is spun up to
perform the connection, then goes away once the connection
has been established or rejected.


> And exist calls 'sock_release()' which will forcibly
> remove the socket file.

I think sockfd_put() might be better. That way the process
that connects can increment the socket's reference count
to prevent it from vanishing.


> I could try to disconnect the socket file before exit, but
> that is ugly as hell.
> 
> Any ideas?



--
Chuck Lever



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-03-08 17:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-08 13:24 fput() considered harmful Hannes Reinecke
2023-03-08 13:43 ` Chuck Lever III
2023-03-08 15:40   ` Chuck Lever III
2023-03-08 16:04     ` Hannes Reinecke
2023-03-08 16:51       ` Chuck Lever III
2023-03-08 17:33       ` Hannes Reinecke
2023-03-08 17:53         ` Chuck Lever III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).