* fput() considered harmful
@ 2023-03-08 13:24 Hannes Reinecke
2023-03-08 13:43 ` Chuck Lever III
0 siblings, 1 reply; 7+ messages in thread
From: Hannes Reinecke @ 2023-03-08 13:24 UTC (permalink / raw)
To: Chuck Lever; +Cc: kernel-tls-handshake
Hi Chuck,
I'm playing around with v6 and (again) facing really nasty crashes:
[ 1662.912887] ------------[ cut here ]------------
[ 1662.913399] kernel BUG at fs/inode.c:1763!
[ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G
E 6.2.0-rc4-54-default+ #231
[ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
0.0.0 02/06/2015
[ 1662.915932] RIP: 0010:iput+0x1d/0x20
[ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00
00 48 85 ff 74 0e f6 87 98 00 00 00 40
75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 0f 1f 44
[ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
[ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX:
0000000000000064
[ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI:
ffff953bcbb23b00
[ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09:
ffff953bcbb23b00
[ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12:
ffff953bcb9d91d8
[ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15:
ffff953b83918cb8
[ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000)
knlGS:0000000000000000
[ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4:
0000000000350ee0
[ 1662.923568] Call Trace:
[ 1662.923801] <TASK>
[ 1662.924006] __dentry_kill+0xca/0x160
[ 1662.924353] __fput+0xdf/0x240
[ 1662.924652] task_work_run+0x67/0xa0
[ 1662.924992] do_exit+0x343/0xb50
[ 1662.925302] do_group_exit+0x2d/0x80
[ 1662.925644] __x64_sys_exit_group+0x14/0x20
[ 1662.926039] do_syscall_64+0x31/0x80
Key problem seems to be userspace here, which (apparently) is calling
close() on the fd which we've passed from the kernel.
While this _normally_ is not a problem, in our case we have the problem
that the filedescriptor is associated with a socket (ie struct socket).
And that one is shared between kernel code and userland.
And doesn't have any refcounting whatsoever; socket_release() removes
the reference to the underlying file:
if (!sock->file) {
iput(SOCK_INODE(sock));
return;
}
sock->file = NULL;
ie if _another_ process (the kernel driver, say) is calling 'close()',
too, it'll run into the !sock->file condition and crash in iput().
I have been testing various things here, not calling fput, calling
get_file() etc, but either hit a crash or kmemleak complaining about
the file not being freed.
Have you seen similar issues?
And even if you haven't: relying on userspace _not_ to call 'close()'
on the socket seems to be a quite dangerous concept.
Maybe we need to implement a 'socket_clone()' functionality?
Cheers,
Hannes
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful
2023-03-08 13:24 fput() considered harmful Hannes Reinecke
@ 2023-03-08 13:43 ` Chuck Lever III
2023-03-08 15:40 ` Chuck Lever III
0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever III @ 2023-03-08 13:43 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: kernel-tls-handshake
> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>
> Hi Chuck,
>
> I'm playing around with v6 and (again) facing really nasty crashes:
>
> [ 1662.912887] ------------[ cut here ]------------
> [ 1662.913399] kernel BUG at fs/inode.c:1763!
> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231
> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064
> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00
> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00
> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8
> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8
> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000
> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0
> [ 1662.923568] Call Trace:
> [ 1662.923801] <TASK>
> [ 1662.924006] __dentry_kill+0xca/0x160
> [ 1662.924353] __fput+0xdf/0x240
> [ 1662.924652] task_work_run+0x67/0xa0
> [ 1662.924992] do_exit+0x343/0xb50
> [ 1662.925302] do_group_exit+0x2d/0x80
> [ 1662.925644] __x64_sys_exit_group+0x14/0x20
> [ 1662.926039] do_syscall_64+0x31/0x80
>
> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel.
> While this _normally_ is not a problem, in our case we have the problem
> that the filedescriptor is associated with a socket (ie struct socket).
> And that one is shared between kernel code and userland.
> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file:
>
> if (!sock->file) {
> iput(SOCK_INODE(sock));
> return;
> }
> sock->file = NULL;
>
> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the !sock->file condition and crash in iput().
>
> I have been testing various things here, not calling fput, calling get_file() etc, but either hit a crash or kmemleak complaining about
> the file not being freed.
>
> Have you seen similar issues?
I have, and they all vanished with the introduction of handshake_dup and haven't returned.
> And even if you haven't: relying on userspace _not_ to call 'close()'
> on the socket seems to be a quite dangerous concept.
- The kernel "closes" that file descriptor anyway when the process exits. Is that not the same as a user space close(2) ?
- The endpoint was created via dup, so calling close on it should merely remove that endpoint and not trigger a sock_release().
AIUI the reference counting of the struct file * is what prevents user space from calling ->release. So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk().
> Maybe we need to implement a 'socket_clone()' functionality?
The alternative is to create a unique set of file_ops for such shared sockets, but that quickly becomes painful.
I was thinking of trying to duplicate your set up here to help diagnose. Are there instructions you like to follow for setting up a small NVMe/TCP test bed?
--
Chuck Lever
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful
2023-03-08 13:43 ` Chuck Lever III
@ 2023-03-08 15:40 ` Chuck Lever III
2023-03-08 16:04 ` Hannes Reinecke
0 siblings, 1 reply; 7+ messages in thread
From: Chuck Lever III @ 2023-03-08 15:40 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: kernel-tls-handshake
> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>
>> Hi Chuck,
>>
>> I'm playing around with v6 and (again) facing really nasty crashes:
>>
>> [ 1662.912887] ------------[ cut here ]------------
>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231
>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064
>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00
>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00
>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8
>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8
>> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000
>> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0
>> [ 1662.923568] Call Trace:
>> [ 1662.923801] <TASK>
>> [ 1662.924006] __dentry_kill+0xca/0x160
>> [ 1662.924353] __fput+0xdf/0x240
>> [ 1662.924652] task_work_run+0x67/0xa0
>> [ 1662.924992] do_exit+0x343/0xb50
>> [ 1662.925302] do_group_exit+0x2d/0x80
>> [ 1662.925644] __x64_sys_exit_group+0x14/0x20
>> [ 1662.926039] do_syscall_64+0x31/0x80
>>
>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel.
>> While this _normally_ is not a problem, in our case we have the problem
>> that the filedescriptor is associated with a socket (ie struct socket).
>> And that one is shared between kernel code and userland.
>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file:
>>
>> if (!sock->file) {
>> iput(SOCK_INODE(sock));
>> return;
>> }
>> sock->file = NULL;
>>
>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the !sock->file condition and crash in iput().
>>
>> I have been testing various things here, not calling fput, calling get_file() etc, but either hit a crash or kmemleak complaining about
>> the file not being freed.
>>
>> Have you seen similar issues?
>
> I have, and they all vanished with the introduction of handshake_dup and haven't returned.
>
>
>> And even if you haven't: relying on userspace _not_ to call 'close()'
>> on the socket seems to be a quite dangerous concept.
>
> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the same as a user space close(2) ?
>
> - The endpoint was created via dup, so calling close on it should merely remove that endpoint and not trigger a sock_release().
>
> AIUI the reference counting of the struct file * is what prevents user space from calling ->release. So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk().
I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
I tried it once with the close /before/ doing the handshake, and once
with the close /after/. I don't see any kernel issues on my server
system when a client attempts to handshake with it.
Based on what I've read here, the missing get_file() theory seems
plausible.
>> Maybe we need to implement a 'socket_clone()' functionality?
>
> The alternative is to create a unique set of file_ops for such shared sockets, but that quickly becomes painful.
>
> I was thinking of trying to duplicate your set up here to help diagnose. Are there instructions you like to follow for setting up a small NVMe/TCP test bed?
--
Chuck Lever
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful
2023-03-08 15:40 ` Chuck Lever III
@ 2023-03-08 16:04 ` Hannes Reinecke
2023-03-08 16:51 ` Chuck Lever III
2023-03-08 17:33 ` Hannes Reinecke
0 siblings, 2 replies; 7+ messages in thread
From: Hannes Reinecke @ 2023-03-08 16:04 UTC (permalink / raw)
To: Chuck Lever III; +Cc: kernel-tls-handshake
On 3/8/23 16:40, Chuck Lever III wrote:
>
>
>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>
>>
>>
>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>
>>> Hi Chuck,
>>>
>>> I'm playing around with v6 and (again) facing really nasty crashes:
>>>
>>> [ 1662.912887] ------------[ cut here ]------------
>>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231
>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064
>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00
>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00
>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8
>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8
>>> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000
>>> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0
>>> [ 1662.923568] Call Trace:
>>> [ 1662.923801] <TASK>
>>> [ 1662.924006] __dentry_kill+0xca/0x160
>>> [ 1662.924353] __fput+0xdf/0x240
>>> [ 1662.924652] task_work_run+0x67/0xa0
>>> [ 1662.924992] do_exit+0x343/0xb50
>>> [ 1662.925302] do_group_exit+0x2d/0x80
>>> [ 1662.925644] __x64_sys_exit_group+0x14/0x20
>>> [ 1662.926039] do_syscall_64+0x31/0x80
>>>
>>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel.
>>> While this _normally_ is not a problem, in our case we have the problem
>>> that the filedescriptor is associated with a socket (ie struct socket).
>>> And that one is shared between kernel code and userland.
>>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file:
>>>
>>> if (!sock->file) {
>>> iput(SOCK_INODE(sock));
>>> return;
>>> }
>>> sock->file = NULL;
>>>
>>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the
>>> !sock->file condition and crash in iput().
>>>
>>> I have been testing various things here, not calling fput, calling get_file() etc, but either
>>> hit a crash or kmemleak complaining about
>>> the file not being freed.
>>>
>>> Have you seen similar issues?
>>
>> I have, and they all vanished with the introduction of handshake_dup and haven't returned.
>>
>>
>>> And even if you haven't: relying on userspace _not_ to call 'close()'
>>> on the socket seems to be a quite dangerous concept.
>>
>> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the
>> same as a user space close(2) ?
>>
>> - The endpoint was created via dup, so calling close on it should merely remove that endpoint
>> and not trigger a sock_release().
>>
>> AIUI the reference counting of the struct file * is what prevents user space from calling ->release.
>> So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk().
>
> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
> I tried it once with the close /before/ doing the handshake, and once
> with the close /after/. I don't see any kernel issues on my server
> system when a client attempts to handshake with it.
>
> Based on what I've read here, the missing get_file() theory seems
> plausible.
>
Hmm. Let's see.
I must say the lifetime rules for 'struct file' (or 'struct socket') are
still somewhat beyond me.
Currently I have to do a
sock_alloc_file()
_and_
get_file()
before calling tls_client_hello_psk() to avoid a crash.
One does wonder why; typically the refcount is initialized to '1',
so the extra 'get_file()' shouldn't be required.
But what do I know ...
I see to get you instructions for my testbed.
Cheers,
Hannes
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful
2023-03-08 16:04 ` Hannes Reinecke
@ 2023-03-08 16:51 ` Chuck Lever III
2023-03-08 17:33 ` Hannes Reinecke
1 sibling, 0 replies; 7+ messages in thread
From: Chuck Lever III @ 2023-03-08 16:51 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: kernel-tls-handshake
> On Mar 8, 2023, at 11:04 AM, Hannes Reinecke <hare@suse.de> wrote:
>
> On 3/8/23 16:40, Chuck Lever III wrote:
>>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>
>>>
>>>
>>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>> Hi Chuck,
>>>>
>>>> I'm playing around with v6 and (again) facing really nasty crashes:
>>>>
>>>> [ 1662.912887] ------------[ cut here ]------------
>>>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231
>>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
>>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064
>>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00
>>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00
>>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8
>>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8
>>>> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000
>>>> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0
>>>> [ 1662.923568] Call Trace:
>>>> [ 1662.923801] <TASK>
>>>> [ 1662.924006] __dentry_kill+0xca/0x160
>>>> [ 1662.924353] __fput+0xdf/0x240
>>>> [ 1662.924652] task_work_run+0x67/0xa0
>>>> [ 1662.924992] do_exit+0x343/0xb50
>>>> [ 1662.925302] do_group_exit+0x2d/0x80
>>>> [ 1662.925644] __x64_sys_exit_group+0x14/0x20
>>>> [ 1662.926039] do_syscall_64+0x31/0x80
>>>>
>>>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel.
>>>> While this _normally_ is not a problem, in our case we have the problem
>>>> that the filedescriptor is associated with a socket (ie struct socket).
>>>> And that one is shared between kernel code and userland.
>>>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file:
>>>>
>>>> if (!sock->file) {
>>>> iput(SOCK_INODE(sock));
>>>> return;
>>>> }
>>>> sock->file = NULL;
>>>>
>>>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the
>>>> !sock->file condition and crash in iput().
>>>>
>>>> I have been testing various things here, not calling fput, calling get_file() etc, but either
>>>> hit a crash or kmemleak complaining about
>>>> the file not being freed.
>>>>
>>>> Have you seen similar issues?
>>>
>>> I have, and they all vanished with the introduction of handshake_dup and haven't returned.
>>>
>>>
>>>> And even if you haven't: relying on userspace _not_ to call 'close()'
>>>> on the socket seems to be a quite dangerous concept.
>>>
>>> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the
>>> same as a user space close(2) ?
>>>
>>> - The endpoint was created via dup, so calling close on it should merely remove that endpoint
>>> and not trigger a sock_release().
>>>
>>> AIUI the reference counting of the struct file * is what prevents user space from calling ->release.
>>> So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk().
>> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
>> I tried it once with the close /before/ doing the handshake, and once
>> with the close /after/. I don't see any kernel issues on my server
>> system when a client attempts to handshake with it.
>> Based on what I've read here, the missing get_file() theory seems
>> plausible.
> Hmm. Let's see.
> I must say the lifetime rules for 'struct file' (or 'struct socket') are still somewhat beyond me.
> Currently I have to do a
>
> sock_alloc_file()
>
> _and_
>
> get_file()
>
> before calling tls_client_hello_psk() to avoid a crash.
> One does wonder why; typically the refcount is initialized to '1',
> so the extra 'get_file()' shouldn't be required.
Especially since handshake_dup() is supposed to do that
one for you. Could be someone else is doing an unwanted
fput()?
--
Chuck Lever
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful
2023-03-08 16:04 ` Hannes Reinecke
2023-03-08 16:51 ` Chuck Lever III
@ 2023-03-08 17:33 ` Hannes Reinecke
2023-03-08 17:53 ` Chuck Lever III
1 sibling, 1 reply; 7+ messages in thread
From: Hannes Reinecke @ 2023-03-08 17:33 UTC (permalink / raw)
To: Chuck Lever III; +Cc: kernel-tls-handshake
On 3/8/23 17:04, Hannes Reinecke wrote:
> On 3/8/23 16:40, Chuck Lever III wrote:
>>
>>
>>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com>
>>> wrote:
>>>
>>>
>>>
>>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>> Hi Chuck,
>>>>
>>>> I'm playing around with v6 and (again) facing really nasty crashes:
>>>>
>>>> [ 1662.912887] ------------[ cut here ]------------
>>>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted:
>>>> G E 6.2.0-rc4-54-default+ #231
>>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>>>> BIOS 0.0.0 02/06/2015
>>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
>>>> 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90
>>>> 90 90 90 90 90 90 90 90 90 0f 1f 44
>>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX:
>>>> 0000000000000064
>>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI:
>>>> ffff953bcbb23b00
>>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09:
>>>> ffff953bcbb23b00
>>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12:
>>>> ffff953bcb9d91d8
>>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15:
>>>> ffff953b83918cb8
>>>> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000)
>>>> knlGS:0000000000000000
>>>> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4:
>>>> 0000000000350ee0
>>>> [ 1662.923568] Call Trace:
>>>> [ 1662.923801] <TASK>
>>>> [ 1662.924006] __dentry_kill+0xca/0x160
>>>> [ 1662.924353] __fput+0xdf/0x240
>>>> [ 1662.924652] task_work_run+0x67/0xa0
>>>> [ 1662.924992] do_exit+0x343/0xb50
>>>> [ 1662.925302] do_group_exit+0x2d/0x80
>>>> [ 1662.925644] __x64_sys_exit_group+0x14/0x20
>>>> [ 1662.926039] do_syscall_64+0x31/0x80
>>>>
>>>> Key problem seems to be userspace here, which (apparently) is
>>>> calling close() on the fd which we've passed from the kernel.
>>>> While this _normally_ is not a problem, in our case we have the problem
>>>> that the filedescriptor is associated with a socket (ie struct socket).
>>>> And that one is shared between kernel code and userland.
>>>> And doesn't have any refcounting whatsoever; socket_release()
>>>> removes the reference to the underlying file:
>>>>
>>>> if (!sock->file) {
>>>> iput(SOCK_INODE(sock));
>>>> return;
>>>> }
>>>> sock->file = NULL;
>>>>
>>>> ie if _another_ process (the kernel driver, say) is calling
>>>> 'close()', too, it'll run into the
>>>> !sock->file condition and crash in iput().
>>>>
>>>> I have been testing various things here, not calling fput, calling
>>>> get_file() etc, but either
>>>> hit a crash or kmemleak complaining about
>>>> the file not being freed.
>>>>
>>>> Have you seen similar issues?
>>>
>>> I have, and they all vanished with the introduction of handshake_dup
>>> and haven't returned.
>>>
>>>
>>>> And even if you haven't: relying on userspace _not_ to call 'close()'
>>>> on the socket seems to be a quite dangerous concept.
>>>
>>> - The kernel "closes" that file descriptor anyway when the process
>>> exits. Is that not the
>>> same as a user space close(2) ?
>>>
>>> - The endpoint was created via dup, so calling close on it should
>>> merely remove that endpoint
>>> and not trigger a sock_release().
>>>
>>> AIUI the reference counting of the struct file * is what prevents
>>> user space from calling ->release.
>>> So maybe NVMe needs another get_file() somewhere before it calls
>>> tls_server_hello_psk().
>>
>> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
>> I tried it once with the close /before/ doing the handshake, and once
>> with the close /after/. I don't see any kernel issues on my server
>> system when a client attempts to handshake with it.
>>
>> Based on what I've read here, the missing get_file() theory seems
>> plausible.
>>
> Hmm. Let's see.
> I must say the lifetime rules for 'struct file' (or 'struct socket') are
> still somewhat beyond me.
> Currently I have to do a
>
> sock_alloc_file()
>
> _and_
>
> get_file()
>
> before calling tls_client_hello_psk() to avoid a crash.
> One does wonder why; typically the refcount is initialized to '1',
> so the extra 'get_file()' shouldn't be required.
>
> But what do I know ...
> Oh, grand.
sock_alloc_file()
get_file()
tls_client_hello_psk()
gives a memleak for the socket/file.
sock_alloc_file()
get_file()
tls_client_hello_psk()
fput()
crashes in handshake_nl_accept_doit()->fput().
sock_alloc_file()
tls_client_hello_psk()
crashes with
kernel BUG at fs/inode.c:1763!
on tlshd sys_exit().
Guess the only way of safely handling this is to
keep the socket file around until tlshd has
exited.
Sadly that's the _one_ thing I cannot do; I've got a
on-shot process 'nvme connect' which tries to establish
the connection and then exits.
And exist calls 'sock_release()' which will forcibly
remove the socket file.
I could try to disconnect the socket file before exit, but
that is ugly as hell.
Any ideas?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful
2023-03-08 17:33 ` Hannes Reinecke
@ 2023-03-08 17:53 ` Chuck Lever III
0 siblings, 0 replies; 7+ messages in thread
From: Chuck Lever III @ 2023-03-08 17:53 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: kernel-tls-handshake
> On Mar 8, 2023, at 12:33 PM, Hannes Reinecke <hare@suse.de> wrote:
>
> On 3/8/23 17:04, Hannes Reinecke wrote:
>> On 3/8/23 16:40, Chuck Lever III wrote:
>>>
>>>
>>>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>>
>>>>
>>>>
>>>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>>>
>>>>> Hi Chuck,
>>>>>
>>>>> I'm playing around with v6 and (again) facing really nasty crashes:
>>>>>
>>>>> [ 1662.912887] ------------[ cut here ]------------
>>>>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>>>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231
>>>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>>>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>>>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44
>>>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>>>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064
>>>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00
>>>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00
>>>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8
>>>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8
>>>>> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000
>>>>> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0
>>>>> [ 1662.923568] Call Trace:
>>>>> [ 1662.923801] <TASK>
>>>>> [ 1662.924006] __dentry_kill+0xca/0x160
>>>>> [ 1662.924353] __fput+0xdf/0x240
>>>>> [ 1662.924652] task_work_run+0x67/0xa0
>>>>> [ 1662.924992] do_exit+0x343/0xb50
>>>>> [ 1662.925302] do_group_exit+0x2d/0x80
>>>>> [ 1662.925644] __x64_sys_exit_group+0x14/0x20
>>>>> [ 1662.926039] do_syscall_64+0x31/0x80
>>>>>
>>>>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel.
>>>>> While this _normally_ is not a problem, in our case we have the problem
>>>>> that the filedescriptor is associated with a socket (ie struct socket).
>>>>> And that one is shared between kernel code and userland.
>>>>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file:
>>>>>
>>>>> if (!sock->file) {
>>>>> iput(SOCK_INODE(sock));
>>>>> return;
>>>>> }
>>>>> sock->file = NULL;
>>>>>
>>>>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the
>>>>> !sock->file condition and crash in iput().
>>>>>
>>>>> I have been testing various things here, not calling fput, calling get_file() etc, but either
>>>>> hit a crash or kmemleak complaining about
>>>>> the file not being freed.
>>>>>
>>>>> Have you seen similar issues?
>>>>
>>>> I have, and they all vanished with the introduction of handshake_dup and haven't returned.
>>>>
>>>>
>>>>> And even if you haven't: relying on userspace _not_ to call 'close()'
>>>>> on the socket seems to be a quite dangerous concept.
>>>>
>>>> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the
>>>> same as a user space close(2) ?
>>>>
>>>> - The endpoint was created via dup, so calling close on it should merely remove that endpoint
>>>> and not trigger a sock_release().
>>>>
>>>> AIUI the reference counting of the struct file * is what prevents user space from calling ->release.
>>>> So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk().
>>>
>>> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
>>> I tried it once with the close /before/ doing the handshake, and once
>>> with the close /after/. I don't see any kernel issues on my server
>>> system when a client attempts to handshake with it.
>>>
>>> Based on what I've read here, the missing get_file() theory seems
>>> plausible.
>>>
>> Hmm. Let's see.
>> I must say the lifetime rules for 'struct file' (or 'struct socket') are still somewhat beyond me.
>> Currently I have to do a
>> sock_alloc_file()
>> _and_
>> get_file()
>> before calling tls_client_hello_psk() to avoid a crash.
>> One does wonder why; typically the refcount is initialized to '1',
>> so the extra 'get_file()' shouldn't be required.
>> But what do I know ...
>> Oh, grand.
>
> sock_alloc_file()
> get_file()
> tls_client_hello_psk()
>
> gives a memleak for the socket/file.
>
> sock_alloc_file()
> get_file()
> tls_client_hello_psk()
> fput()
>
> crashes in handshake_nl_accept_doit()->fput().
>
> sock_alloc_file()
> tls_client_hello_psk()
>
> crashes with
> kernel BUG at fs/inode.c:1763!
> on tlshd sys_exit().
>
> Guess the only way of safely handling this is to
> keep the socket file around until tlshd has
> exited.
>
> Sadly that's the _one_ thing I cannot do; I've got a
> on-shot process 'nvme connect' which tries to establish
> the connection and then exits.
SunRPC client does much the same: a kworker is spun up to
perform the connection, then goes away once the connection
has been established or rejected.
> And exist calls 'sock_release()' which will forcibly
> remove the socket file.
I think sockfd_put() might be better. That way the process
that connects can increment the socket's reference count
to prevent it from vanishing.
> I could try to disconnect the socket file before exit, but
> that is ugly as hell.
>
> Any ideas?
--
Chuck Lever
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-03-08 17:54 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-08 13:24 fput() considered harmful Hannes Reinecke
2023-03-08 13:43 ` Chuck Lever III
2023-03-08 15:40 ` Chuck Lever III
2023-03-08 16:04 ` Hannes Reinecke
2023-03-08 16:51 ` Chuck Lever III
2023-03-08 17:33 ` Hannes Reinecke
2023-03-08 17:53 ` Chuck Lever III
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).