* fput() considered harmful @ 2023-03-08 13:24 Hannes Reinecke 2023-03-08 13:43 ` Chuck Lever III 0 siblings, 1 reply; 7+ messages in thread From: Hannes Reinecke @ 2023-03-08 13:24 UTC (permalink / raw) To: Chuck Lever; +Cc: kernel-tls-handshake Hi Chuck, I'm playing around with v6 and (again) facing really nasty crashes: [ 1662.912887] ------------[ cut here ]------------ [ 1662.913399] kernel BUG at fs/inode.c:1763! [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231 [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 1662.915932] RIP: 0010:iput+0x1d/0x20 [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202 [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064 [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00 [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00 [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8 [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8 [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000 [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0 [ 1662.923568] Call Trace: [ 1662.923801] <TASK> [ 1662.924006] __dentry_kill+0xca/0x160 [ 1662.924353] __fput+0xdf/0x240 [ 1662.924652] task_work_run+0x67/0xa0 [ 1662.924992] do_exit+0x343/0xb50 [ 1662.925302] do_group_exit+0x2d/0x80 [ 1662.925644] __x64_sys_exit_group+0x14/0x20 [ 1662.926039] do_syscall_64+0x31/0x80 Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel. While this _normally_ is not a problem, in our case we have the problem that the filedescriptor is associated with a socket (ie struct socket). And that one is shared between kernel code and userland. And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file: if (!sock->file) { iput(SOCK_INODE(sock)); return; } sock->file = NULL; ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the !sock->file condition and crash in iput(). I have been testing various things here, not calling fput, calling get_file() etc, but either hit a crash or kmemleak complaining about the file not being freed. Have you seen similar issues? And even if you haven't: relying on userspace _not_ to call 'close()' on the socket seems to be a quite dangerous concept. Maybe we need to implement a 'socket_clone()' functionality? Cheers, Hannes ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful 2023-03-08 13:24 fput() considered harmful Hannes Reinecke @ 2023-03-08 13:43 ` Chuck Lever III 2023-03-08 15:40 ` Chuck Lever III 0 siblings, 1 reply; 7+ messages in thread From: Chuck Lever III @ 2023-03-08 13:43 UTC (permalink / raw) To: Hannes Reinecke; +Cc: kernel-tls-handshake > On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote: > > Hi Chuck, > > I'm playing around with v6 and (again) facing really nasty crashes: > > [ 1662.912887] ------------[ cut here ]------------ > [ 1662.913399] kernel BUG at fs/inode.c:1763! > [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231 > [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > [ 1662.915932] RIP: 0010:iput+0x1d/0x20 > [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40 > 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 > [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202 > [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064 > [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00 > [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00 > [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8 > [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8 > [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000 > [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0 > [ 1662.923568] Call Trace: > [ 1662.923801] <TASK> > [ 1662.924006] __dentry_kill+0xca/0x160 > [ 1662.924353] __fput+0xdf/0x240 > [ 1662.924652] task_work_run+0x67/0xa0 > [ 1662.924992] do_exit+0x343/0xb50 > [ 1662.925302] do_group_exit+0x2d/0x80 > [ 1662.925644] __x64_sys_exit_group+0x14/0x20 > [ 1662.926039] do_syscall_64+0x31/0x80 > > Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel. > While this _normally_ is not a problem, in our case we have the problem > that the filedescriptor is associated with a socket (ie struct socket). > And that one is shared between kernel code and userland. > And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file: > > if (!sock->file) { > iput(SOCK_INODE(sock)); > return; > } > sock->file = NULL; > > ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the !sock->file condition and crash in iput(). > > I have been testing various things here, not calling fput, calling get_file() etc, but either hit a crash or kmemleak complaining about > the file not being freed. > > Have you seen similar issues? I have, and they all vanished with the introduction of handshake_dup and haven't returned. > And even if you haven't: relying on userspace _not_ to call 'close()' > on the socket seems to be a quite dangerous concept. - The kernel "closes" that file descriptor anyway when the process exits. Is that not the same as a user space close(2) ? - The endpoint was created via dup, so calling close on it should merely remove that endpoint and not trigger a sock_release(). AIUI the reference counting of the struct file * is what prevents user space from calling ->release. So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk(). > Maybe we need to implement a 'socket_clone()' functionality? The alternative is to create a unique set of file_ops for such shared sockets, but that quickly becomes painful. I was thinking of trying to duplicate your set up here to help diagnose. Are there instructions you like to follow for setting up a small NVMe/TCP test bed? -- Chuck Lever ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful 2023-03-08 13:43 ` Chuck Lever III @ 2023-03-08 15:40 ` Chuck Lever III 2023-03-08 16:04 ` Hannes Reinecke 0 siblings, 1 reply; 7+ messages in thread From: Chuck Lever III @ 2023-03-08 15:40 UTC (permalink / raw) To: Hannes Reinecke; +Cc: kernel-tls-handshake > On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote: > > > >> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote: >> >> Hi Chuck, >> >> I'm playing around with v6 and (again) facing really nasty crashes: >> >> [ 1662.912887] ------------[ cut here ]------------ >> [ 1662.913399] kernel BUG at fs/inode.c:1763! >> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI >> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231 >> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 >> [ 1662.915932] RIP: 0010:iput+0x1d/0x20 >> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40 >> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 >> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202 >> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064 >> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00 >> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00 >> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8 >> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8 >> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000 >> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0 >> [ 1662.923568] Call Trace: >> [ 1662.923801] <TASK> >> [ 1662.924006] __dentry_kill+0xca/0x160 >> [ 1662.924353] __fput+0xdf/0x240 >> [ 1662.924652] task_work_run+0x67/0xa0 >> [ 1662.924992] do_exit+0x343/0xb50 >> [ 1662.925302] do_group_exit+0x2d/0x80 >> [ 1662.925644] __x64_sys_exit_group+0x14/0x20 >> [ 1662.926039] do_syscall_64+0x31/0x80 >> >> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel. >> While this _normally_ is not a problem, in our case we have the problem >> that the filedescriptor is associated with a socket (ie struct socket). >> And that one is shared between kernel code and userland. >> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file: >> >> if (!sock->file) { >> iput(SOCK_INODE(sock)); >> return; >> } >> sock->file = NULL; >> >> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the !sock->file condition and crash in iput(). >> >> I have been testing various things here, not calling fput, calling get_file() etc, but either hit a crash or kmemleak complaining about >> the file not being freed. >> >> Have you seen similar issues? > > I have, and they all vanished with the introduction of handshake_dup and haven't returned. > > >> And even if you haven't: relying on userspace _not_ to call 'close()' >> on the socket seems to be a quite dangerous concept. > > - The kernel "closes" that file descriptor anyway when the process exits. Is that not the same as a user space close(2) ? > > - The endpoint was created via dup, so calling close on it should merely remove that endpoint and not trigger a sock_release(). > > AIUI the reference counting of the struct file * is what prevents user space from calling ->release. So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk(). I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c. I tried it once with the close /before/ doing the handshake, and once with the close /after/. I don't see any kernel issues on my server system when a client attempts to handshake with it. Based on what I've read here, the missing get_file() theory seems plausible. >> Maybe we need to implement a 'socket_clone()' functionality? > > The alternative is to create a unique set of file_ops for such shared sockets, but that quickly becomes painful. > > I was thinking of trying to duplicate your set up here to help diagnose. Are there instructions you like to follow for setting up a small NVMe/TCP test bed? -- Chuck Lever ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful 2023-03-08 15:40 ` Chuck Lever III @ 2023-03-08 16:04 ` Hannes Reinecke 2023-03-08 16:51 ` Chuck Lever III 2023-03-08 17:33 ` Hannes Reinecke 0 siblings, 2 replies; 7+ messages in thread From: Hannes Reinecke @ 2023-03-08 16:04 UTC (permalink / raw) To: Chuck Lever III; +Cc: kernel-tls-handshake On 3/8/23 16:40, Chuck Lever III wrote: > > >> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote: >> >> >> >>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote: >>> >>> Hi Chuck, >>> >>> I'm playing around with v6 and (again) facing really nasty crashes: >>> >>> [ 1662.912887] ------------[ cut here ]------------ >>> [ 1662.913399] kernel BUG at fs/inode.c:1763! >>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI >>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231 >>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 >>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20 >>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40 >>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 >>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202 >>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064 >>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00 >>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00 >>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8 >>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8 >>> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000 >>> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0 >>> [ 1662.923568] Call Trace: >>> [ 1662.923801] <TASK> >>> [ 1662.924006] __dentry_kill+0xca/0x160 >>> [ 1662.924353] __fput+0xdf/0x240 >>> [ 1662.924652] task_work_run+0x67/0xa0 >>> [ 1662.924992] do_exit+0x343/0xb50 >>> [ 1662.925302] do_group_exit+0x2d/0x80 >>> [ 1662.925644] __x64_sys_exit_group+0x14/0x20 >>> [ 1662.926039] do_syscall_64+0x31/0x80 >>> >>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel. >>> While this _normally_ is not a problem, in our case we have the problem >>> that the filedescriptor is associated with a socket (ie struct socket). >>> And that one is shared between kernel code and userland. >>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file: >>> >>> if (!sock->file) { >>> iput(SOCK_INODE(sock)); >>> return; >>> } >>> sock->file = NULL; >>> >>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the >>> !sock->file condition and crash in iput(). >>> >>> I have been testing various things here, not calling fput, calling get_file() etc, but either >>> hit a crash or kmemleak complaining about >>> the file not being freed. >>> >>> Have you seen similar issues? >> >> I have, and they all vanished with the introduction of handshake_dup and haven't returned. >> >> >>> And even if you haven't: relying on userspace _not_ to call 'close()' >>> on the socket seems to be a quite dangerous concept. >> >> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the >> same as a user space close(2) ? >> >> - The endpoint was created via dup, so calling close on it should merely remove that endpoint >> and not trigger a sock_release(). >> >> AIUI the reference counting of the struct file * is what prevents user space from calling ->release. >> So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk(). > > I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c. > I tried it once with the close /before/ doing the handshake, and once > with the close /after/. I don't see any kernel issues on my server > system when a client attempts to handshake with it. > > Based on what I've read here, the missing get_file() theory seems > plausible. > Hmm. Let's see. I must say the lifetime rules for 'struct file' (or 'struct socket') are still somewhat beyond me. Currently I have to do a sock_alloc_file() _and_ get_file() before calling tls_client_hello_psk() to avoid a crash. One does wonder why; typically the refcount is initialized to '1', so the extra 'get_file()' shouldn't be required. But what do I know ... I see to get you instructions for my testbed. Cheers, Hannes ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful 2023-03-08 16:04 ` Hannes Reinecke @ 2023-03-08 16:51 ` Chuck Lever III 2023-03-08 17:33 ` Hannes Reinecke 1 sibling, 0 replies; 7+ messages in thread From: Chuck Lever III @ 2023-03-08 16:51 UTC (permalink / raw) To: Hannes Reinecke; +Cc: kernel-tls-handshake > On Mar 8, 2023, at 11:04 AM, Hannes Reinecke <hare@suse.de> wrote: > > On 3/8/23 16:40, Chuck Lever III wrote: >>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote: >>> >>> >>> >>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote: >>>> >>>> Hi Chuck, >>>> >>>> I'm playing around with v6 and (again) facing really nasty crashes: >>>> >>>> [ 1662.912887] ------------[ cut here ]------------ >>>> [ 1662.913399] kernel BUG at fs/inode.c:1763! >>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI >>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231 >>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 >>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20 >>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40 >>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 >>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202 >>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064 >>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00 >>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00 >>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8 >>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8 >>>> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000 >>>> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0 >>>> [ 1662.923568] Call Trace: >>>> [ 1662.923801] <TASK> >>>> [ 1662.924006] __dentry_kill+0xca/0x160 >>>> [ 1662.924353] __fput+0xdf/0x240 >>>> [ 1662.924652] task_work_run+0x67/0xa0 >>>> [ 1662.924992] do_exit+0x343/0xb50 >>>> [ 1662.925302] do_group_exit+0x2d/0x80 >>>> [ 1662.925644] __x64_sys_exit_group+0x14/0x20 >>>> [ 1662.926039] do_syscall_64+0x31/0x80 >>>> >>>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel. >>>> While this _normally_ is not a problem, in our case we have the problem >>>> that the filedescriptor is associated with a socket (ie struct socket). >>>> And that one is shared between kernel code and userland. >>>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file: >>>> >>>> if (!sock->file) { >>>> iput(SOCK_INODE(sock)); >>>> return; >>>> } >>>> sock->file = NULL; >>>> >>>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the >>>> !sock->file condition and crash in iput(). >>>> >>>> I have been testing various things here, not calling fput, calling get_file() etc, but either >>>> hit a crash or kmemleak complaining about >>>> the file not being freed. >>>> >>>> Have you seen similar issues? >>> >>> I have, and they all vanished with the introduction of handshake_dup and haven't returned. >>> >>> >>>> And even if you haven't: relying on userspace _not_ to call 'close()' >>>> on the socket seems to be a quite dangerous concept. >>> >>> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the >>> same as a user space close(2) ? >>> >>> - The endpoint was created via dup, so calling close on it should merely remove that endpoint >>> and not trigger a sock_release(). >>> >>> AIUI the reference counting of the struct file * is what prevents user space from calling ->release. >>> So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk(). >> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c. >> I tried it once with the close /before/ doing the handshake, and once >> with the close /after/. I don't see any kernel issues on my server >> system when a client attempts to handshake with it. >> Based on what I've read here, the missing get_file() theory seems >> plausible. > Hmm. Let's see. > I must say the lifetime rules for 'struct file' (or 'struct socket') are still somewhat beyond me. > Currently I have to do a > > sock_alloc_file() > > _and_ > > get_file() > > before calling tls_client_hello_psk() to avoid a crash. > One does wonder why; typically the refcount is initialized to '1', > so the extra 'get_file()' shouldn't be required. Especially since handshake_dup() is supposed to do that one for you. Could be someone else is doing an unwanted fput()? -- Chuck Lever ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful 2023-03-08 16:04 ` Hannes Reinecke 2023-03-08 16:51 ` Chuck Lever III @ 2023-03-08 17:33 ` Hannes Reinecke 2023-03-08 17:53 ` Chuck Lever III 1 sibling, 1 reply; 7+ messages in thread From: Hannes Reinecke @ 2023-03-08 17:33 UTC (permalink / raw) To: Chuck Lever III; +Cc: kernel-tls-handshake On 3/8/23 17:04, Hannes Reinecke wrote: > On 3/8/23 16:40, Chuck Lever III wrote: >> >> >>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> >>> wrote: >>> >>> >>> >>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote: >>>> >>>> Hi Chuck, >>>> >>>> I'm playing around with v6 and (again) facing really nasty crashes: >>>> >>>> [ 1662.912887] ------------[ cut here ]------------ >>>> [ 1662.913399] kernel BUG at fs/inode.c:1763! >>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI >>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: >>>> G E 6.2.0-rc4-54-default+ #231 >>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), >>>> BIOS 0.0.0 02/06/2015 >>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20 >>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 >>>> 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40 >>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 >>>> 90 90 90 90 90 90 90 90 90 0f 1f 44 >>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202 >>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: >>>> 0000000000000064 >>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: >>>> ffff953bcbb23b00 >>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: >>>> ffff953bcbb23b00 >>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: >>>> ffff953bcb9d91d8 >>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: >>>> ffff953b83918cb8 >>>> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) >>>> knlGS:0000000000000000 >>>> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: >>>> 0000000000350ee0 >>>> [ 1662.923568] Call Trace: >>>> [ 1662.923801] <TASK> >>>> [ 1662.924006] __dentry_kill+0xca/0x160 >>>> [ 1662.924353] __fput+0xdf/0x240 >>>> [ 1662.924652] task_work_run+0x67/0xa0 >>>> [ 1662.924992] do_exit+0x343/0xb50 >>>> [ 1662.925302] do_group_exit+0x2d/0x80 >>>> [ 1662.925644] __x64_sys_exit_group+0x14/0x20 >>>> [ 1662.926039] do_syscall_64+0x31/0x80 >>>> >>>> Key problem seems to be userspace here, which (apparently) is >>>> calling close() on the fd which we've passed from the kernel. >>>> While this _normally_ is not a problem, in our case we have the problem >>>> that the filedescriptor is associated with a socket (ie struct socket). >>>> And that one is shared between kernel code and userland. >>>> And doesn't have any refcounting whatsoever; socket_release() >>>> removes the reference to the underlying file: >>>> >>>> if (!sock->file) { >>>> iput(SOCK_INODE(sock)); >>>> return; >>>> } >>>> sock->file = NULL; >>>> >>>> ie if _another_ process (the kernel driver, say) is calling >>>> 'close()', too, it'll run into the >>>> !sock->file condition and crash in iput(). >>>> >>>> I have been testing various things here, not calling fput, calling >>>> get_file() etc, but either >>>> hit a crash or kmemleak complaining about >>>> the file not being freed. >>>> >>>> Have you seen similar issues? >>> >>> I have, and they all vanished with the introduction of handshake_dup >>> and haven't returned. >>> >>> >>>> And even if you haven't: relying on userspace _not_ to call 'close()' >>>> on the socket seems to be a quite dangerous concept. >>> >>> - The kernel "closes" that file descriptor anyway when the process >>> exits. Is that not the >>> same as a user space close(2) ? >>> >>> - The endpoint was created via dup, so calling close on it should >>> merely remove that endpoint >>> and not trigger a sock_release(). >>> >>> AIUI the reference counting of the struct file * is what prevents >>> user space from calling ->release. >>> So maybe NVMe needs another get_file() somewhere before it calls >>> tls_server_hello_psk(). >> >> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c. >> I tried it once with the close /before/ doing the handshake, and once >> with the close /after/. I don't see any kernel issues on my server >> system when a client attempts to handshake with it. >> >> Based on what I've read here, the missing get_file() theory seems >> plausible. >> > Hmm. Let's see. > I must say the lifetime rules for 'struct file' (or 'struct socket') are > still somewhat beyond me. > Currently I have to do a > > sock_alloc_file() > > _and_ > > get_file() > > before calling tls_client_hello_psk() to avoid a crash. > One does wonder why; typically the refcount is initialized to '1', > so the extra 'get_file()' shouldn't be required. > > But what do I know ... > Oh, grand. sock_alloc_file() get_file() tls_client_hello_psk() gives a memleak for the socket/file. sock_alloc_file() get_file() tls_client_hello_psk() fput() crashes in handshake_nl_accept_doit()->fput(). sock_alloc_file() tls_client_hello_psk() crashes with kernel BUG at fs/inode.c:1763! on tlshd sys_exit(). Guess the only way of safely handling this is to keep the socket file around until tlshd has exited. Sadly that's the _one_ thing I cannot do; I've got a on-shot process 'nvme connect' which tries to establish the connection and then exits. And exist calls 'sock_release()' which will forcibly remove the socket file. I could try to disconnect the socket file before exit, but that is ugly as hell. Any ideas? Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew Myers, Andrew McDonald, Martje Boudien Moerman ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: fput() considered harmful 2023-03-08 17:33 ` Hannes Reinecke @ 2023-03-08 17:53 ` Chuck Lever III 0 siblings, 0 replies; 7+ messages in thread From: Chuck Lever III @ 2023-03-08 17:53 UTC (permalink / raw) To: Hannes Reinecke; +Cc: kernel-tls-handshake > On Mar 8, 2023, at 12:33 PM, Hannes Reinecke <hare@suse.de> wrote: > > On 3/8/23 17:04, Hannes Reinecke wrote: >> On 3/8/23 16:40, Chuck Lever III wrote: >>> >>> >>>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> wrote: >>>> >>>> >>>> >>>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote: >>>>> >>>>> Hi Chuck, >>>>> >>>>> I'm playing around with v6 and (again) facing really nasty crashes: >>>>> >>>>> [ 1662.912887] ------------[ cut here ]------------ >>>>> [ 1662.913399] kernel BUG at fs/inode.c:1763! >>>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI >>>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: G E 6.2.0-rc4-54-default+ #231 >>>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 >>>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20 >>>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40 >>>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 >>>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202 >>>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 0000000000000064 >>>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: ffff953bcbb23b00 >>>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: ffff953bcbb23b00 >>>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: ffff953bcb9d91d8 >>>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: ffff953b83918cb8 >>>>> [ 1662.921617] FS: 0000000000000000(0000) GS:ffff953bfb900000(0000) knlGS:0000000000000000 >>>>> [ 1662.922376] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 0000000000350ee0 >>>>> [ 1662.923568] Call Trace: >>>>> [ 1662.923801] <TASK> >>>>> [ 1662.924006] __dentry_kill+0xca/0x160 >>>>> [ 1662.924353] __fput+0xdf/0x240 >>>>> [ 1662.924652] task_work_run+0x67/0xa0 >>>>> [ 1662.924992] do_exit+0x343/0xb50 >>>>> [ 1662.925302] do_group_exit+0x2d/0x80 >>>>> [ 1662.925644] __x64_sys_exit_group+0x14/0x20 >>>>> [ 1662.926039] do_syscall_64+0x31/0x80 >>>>> >>>>> Key problem seems to be userspace here, which (apparently) is calling close() on the fd which we've passed from the kernel. >>>>> While this _normally_ is not a problem, in our case we have the problem >>>>> that the filedescriptor is associated with a socket (ie struct socket). >>>>> And that one is shared between kernel code and userland. >>>>> And doesn't have any refcounting whatsoever; socket_release() removes the reference to the underlying file: >>>>> >>>>> if (!sock->file) { >>>>> iput(SOCK_INODE(sock)); >>>>> return; >>>>> } >>>>> sock->file = NULL; >>>>> >>>>> ie if _another_ process (the kernel driver, say) is calling 'close()', too, it'll run into the >>>>> !sock->file condition and crash in iput(). >>>>> >>>>> I have been testing various things here, not calling fput, calling get_file() etc, but either >>>>> hit a crash or kmemleak complaining about >>>>> the file not being freed. >>>>> >>>>> Have you seen similar issues? >>>> >>>> I have, and they all vanished with the introduction of handshake_dup and haven't returned. >>>> >>>> >>>>> And even if you haven't: relying on userspace _not_ to call 'close()' >>>>> on the socket seems to be a quite dangerous concept. >>>> >>>> - The kernel "closes" that file descriptor anyway when the process exits. Is that not the >>>> same as a user space close(2) ? >>>> >>>> - The endpoint was created via dup, so calling close on it should merely remove that endpoint >>>> and not trigger a sock_release(). >>>> >>>> AIUI the reference counting of the struct file * is what prevents user space from calling ->release. >>>> So maybe NVMe needs another get_file() somewhere before it calls tls_server_hello_psk(). >>> >>> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c. >>> I tried it once with the close /before/ doing the handshake, and once >>> with the close /after/. I don't see any kernel issues on my server >>> system when a client attempts to handshake with it. >>> >>> Based on what I've read here, the missing get_file() theory seems >>> plausible. >>> >> Hmm. Let's see. >> I must say the lifetime rules for 'struct file' (or 'struct socket') are still somewhat beyond me. >> Currently I have to do a >> sock_alloc_file() >> _and_ >> get_file() >> before calling tls_client_hello_psk() to avoid a crash. >> One does wonder why; typically the refcount is initialized to '1', >> so the extra 'get_file()' shouldn't be required. >> But what do I know ... >> Oh, grand. > > sock_alloc_file() > get_file() > tls_client_hello_psk() > > gives a memleak for the socket/file. > > sock_alloc_file() > get_file() > tls_client_hello_psk() > fput() > > crashes in handshake_nl_accept_doit()->fput(). > > sock_alloc_file() > tls_client_hello_psk() > > crashes with > kernel BUG at fs/inode.c:1763! > on tlshd sys_exit(). > > Guess the only way of safely handling this is to > keep the socket file around until tlshd has > exited. > > Sadly that's the _one_ thing I cannot do; I've got a > on-shot process 'nvme connect' which tries to establish > the connection and then exits. SunRPC client does much the same: a kworker is spun up to perform the connection, then goes away once the connection has been established or rejected. > And exist calls 'sock_release()' which will forcibly > remove the socket file. I think sockfd_put() might be better. That way the process that connects can increment the socket's reference count to prevent it from vanishing. > I could try to disconnect the socket file before exit, but > that is ugly as hell. > > Any ideas? -- Chuck Lever ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-03-08 17:54 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-03-08 13:24 fput() considered harmful Hannes Reinecke 2023-03-08 13:43 ` Chuck Lever III 2023-03-08 15:40 ` Chuck Lever III 2023-03-08 16:04 ` Hannes Reinecke 2023-03-08 16:51 ` Chuck Lever III 2023-03-08 17:33 ` Hannes Reinecke 2023-03-08 17:53 ` Chuck Lever III
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).