kernel-tls-handshake.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Chuck Lever III <chuck.lever@oracle.com>
Cc: "kernel-tls-handshake@lists.linux.dev"
	<kernel-tls-handshake@lists.linux.dev>
Subject: Re: fput() considered harmful
Date: Wed, 8 Mar 2023 18:33:17 +0100	[thread overview]
Message-ID: <0ab549c7-d482-1b50-5630-c604e4c92179@suse.de> (raw)
In-Reply-To: <c519d883-5e08-0d28-dee8-89f1a7f197e3@suse.de>

On 3/8/23 17:04, Hannes Reinecke wrote:
> On 3/8/23 16:40, Chuck Lever III wrote:
>>
>>
>>> On Mar 8, 2023, at 8:43 AM, Chuck Lever III <chuck.lever@oracle.com> 
>>> wrote:
>>>
>>>
>>>
>>>> On Mar 8, 2023, at 8:24 AM, Hannes Reinecke <hare@suse.de> wrote:
>>>>
>>>> Hi Chuck,
>>>>
>>>> I'm playing around with v6 and (again) facing really nasty crashes:
>>>>
>>>> [ 1662.912887] ------------[ cut here ]------------
>>>> [ 1662.913399] kernel BUG at fs/inode.c:1763!
>>>> [ 1662.913822] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>>>> [ 1662.914336] CPU: 1 PID: 6494 Comm: tlshd Kdump: loaded Tainted: 
>>>> G       E      6.2.0-rc4-54-default+ #231
>>>> [ 1662.915235] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
>>>> BIOS 0.0.0 02/06/2015
>>>> [ 1662.915932] RIP: 0010:iput+0x1d/0x20
>>>> [ 1662.916275] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 
>>>> 00 00 48 85 ff 74 0e f6 87 98 00 00 00 40
>>>> 75 0a e9 18 fe ff ff e9 23 e3 7a 00 <0f> 0b 90 90 90 90 90 90 90 90 
>>>> 90 90 90 90 90 90 90 90 90 0f 1f 44
>>>> [ 1662.917921] RSP: 0018:ffffaeb600b1fe38 EFLAGS: 00010202
>>>> [ 1662.918423] RAX: 0000000000000000 RBX: ffff953bcb9d9180 RCX: 
>>>> 0000000000000064
>>>> [ 1662.919065] RDX: 0000000000000001 RSI: ffff953bcbb23c38 RDI: 
>>>> ffff953bcbb23b00
>>>> [ 1662.919703] RBP: 0000000000000000 R08: 0000000100053304 R09: 
>>>> ffff953bcbb23b00
>>>> [ 1662.920339] R10: ffff953bc5538b40 R11: 0000000000000003 R12: 
>>>> ffff953bcb9d91d8
>>>> [ 1662.920978] R13: ffff953bfd15ade0 R14: ffff953bcb9d9180 R15: 
>>>> ffff953b83918cb8
>>>> [ 1662.921617] FS:  0000000000000000(0000) GS:ffff953bfb900000(0000) 
>>>> knlGS:0000000000000000
>>>> [ 1662.922376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [ 1662.922907] CR2: 00007fdc89a091a0 CR3: 000000004ad8c000 CR4: 
>>>> 0000000000350ee0
>>>> [ 1662.923568] Call Trace:
>>>> [ 1662.923801]  <TASK>
>>>> [ 1662.924006]  __dentry_kill+0xca/0x160
>>>> [ 1662.924353]  __fput+0xdf/0x240
>>>> [ 1662.924652]  task_work_run+0x67/0xa0
>>>> [ 1662.924992]  do_exit+0x343/0xb50
>>>> [ 1662.925302]  do_group_exit+0x2d/0x80
>>>> [ 1662.925644]  __x64_sys_exit_group+0x14/0x20
>>>> [ 1662.926039]  do_syscall_64+0x31/0x80
>>>>
>>>> Key problem seems to be userspace here, which (apparently) is 
>>>> calling close() on the fd which we've passed from the kernel.
>>>> While this _normally_ is not a problem, in our case we have the problem
>>>> that the filedescriptor is associated with a socket (ie struct socket).
>>>> And that one is shared between kernel code and userland.
>>>> And doesn't have any refcounting whatsoever; socket_release() 
>>>> removes the reference to the underlying file:
>>>>
>>>>     if (!sock->file) {
>>>>         iput(SOCK_INODE(sock));
>>>>         return;
>>>>     }
>>>>     sock->file = NULL;
>>>>
>>>> ie if _another_ process (the kernel driver, say) is calling 
>>>> 'close()', too, it'll run into the
>>>> !sock->file condition and crash in iput().
>>>>
>>>> I have been testing various things here, not calling fput, calling 
>>>> get_file() etc, but either
>>>> hit a crash or kmemleak complaining about
>>>> the file not being freed.
>>>>
>>>> Have you seen similar issues?
>>>
>>> I have, and they all vanished with the introduction of handshake_dup 
>>> and haven't returned.
>>>
>>>
>>>> And even if you haven't: relying on userspace _not_ to call 'close()'
>>>> on the socket seems to be a quite dangerous concept.
>>>
>>> - The kernel "closes" that file descriptor anyway when the process 
>>> exits. Is that not the
>>> same as a user space close(2) ?
>>>
>>> - The endpoint was created via dup, so calling close on it should 
>>> merely remove that endpoint
>>> and not trigger a sock_release().
>>>
>>> AIUI the reference counting of the struct file * is what prevents 
>>> user space from calling ->release.
>>> So maybe NVMe needs another get_file() somewhere before it calls 
>>> tls_server_hello_psk().
>>
>> I added an explicit "close(sockfd);return;" in src/tlshd/handshake.c.
>> I tried it once with the close /before/ doing the handshake, and once
>> with the close /after/. I don't see any kernel issues on my server
>> system when a client attempts to handshake with it.
>>
>> Based on what I've read here, the missing get_file() theory seems
>> plausible.
>>
> Hmm. Let's see.
> I must say the lifetime rules for 'struct file' (or 'struct socket') are 
> still somewhat beyond me.
> Currently I have to do a
> 
> sock_alloc_file()
> 
> _and_
> 
> get_file()
> 
> before calling tls_client_hello_psk() to avoid a crash.
> One does wonder why; typically the refcount is initialized to '1',
> so the extra 'get_file()' shouldn't be required.
> 
> But what do I know ...
> Oh, grand.

sock_alloc_file()
get_file()
tls_client_hello_psk()

gives a memleak for the socket/file.

sock_alloc_file()
get_file()
tls_client_hello_psk()
fput()

crashes in handshake_nl_accept_doit()->fput().

sock_alloc_file()
tls_client_hello_psk()

crashes with
kernel BUG at fs/inode.c:1763!
on tlshd sys_exit().

Guess the only way of safely handling this is to
keep the socket file around until tlshd has
exited.

Sadly that's the _one_ thing I cannot do; I've got a
on-shot process 'nvme connect' which tries to establish
the connection and then exits.
And exist calls 'sock_release()' which will forcibly
remove the socket file.
I could try to disconnect the socket file before exit, but
that is ugly as hell.

Any ideas?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


  parent reply	other threads:[~2023-03-08 17:33 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-08 13:24 fput() considered harmful Hannes Reinecke
2023-03-08 13:43 ` Chuck Lever III
2023-03-08 15:40   ` Chuck Lever III
2023-03-08 16:04     ` Hannes Reinecke
2023-03-08 16:51       ` Chuck Lever III
2023-03-08 17:33       ` Hannes Reinecke [this message]
2023-03-08 17:53         ` Chuck Lever III

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0ab549c7-d482-1b50-5630-c604e4c92179@suse.de \
    --to=hare@suse.de \
    --cc=chuck.lever@oracle.com \
    --cc=kernel-tls-handshake@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).