Re: [PATCH] fuse: allow CAP_SYS_ADMIN in root userns to access allow_other mount

From: Dave Marchevsky <davemarchevsky@fb.com>
To: Christian Brauner <christian.brauner@ubuntu.com>
Cc: <linux-fsdevel@vger.kernel.org>,
	Miklos Szeredi <miklos@szeredi.hu>,
	Seth Forshee <sforshee@digitalocean.com>,
	Rik van Riel <riel@surriel.com>, <kernel-team@fb.com>
Subject: Re: [PATCH] fuse: allow CAP_SYS_ADMIN in root userns to access allow_other mount
Date: Fri, 12 Nov 2021 18:29:20 -0500	[thread overview]
Message-ID: <0515c3c8-c9e3-25dd-4b49-bb8e19c76f0d@fb.com> (raw)
In-Reply-To: <20211112101307.iqf3nhxgchf2u2i3@wittgenstein>

On 11/12/21 5:13 AM, Christian Brauner wrote:   
> On Thu, Nov 11, 2021 at 02:11:42PM -0800, Dave Marchevsky wrote:
>> Since commit 73f03c2b4b52 ("fuse: Restrict allow_other to the
>> superblock's namespace or a descendant"), access to allow_other FUSE
>> filesystems has been limited to users in the mounting user namespace or
>> descendants. This prevents a process that is privileged in its userns -
>> but not its parent namespaces - from mounting a FUSE fs w/ allow_other
>> that is accessible to processes in parent namespaces.
>>
>> While this restriction makes sense overall it breaks a legitimate
>> usecase for me. I have a tracing daemon which needs to peek into
>> process' open files in order to symbolicate - similar to 'perf'. The
>> daemon is a privileged process in the root userns, but is unable to peek
>> into FUSE filesystems mounted with allow_other by processes in child
>> namespaces.
>>
>> This patch adds an escape hatch to the descendant userns logic
>> specifically for processes with CAP_SYS_ADMIN in the root userns. Such
>> processes can already do many dangerous things regardless of namespace,
>> and moreover could fork and setns into any child userns with a FUSE
>> mount, so it's reasonable to allow them to interact with all allow_other
>> FUSE filesystems.
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> Cc: Miklos Szeredi <miklos@szeredi.hu>
>> Cc: Seth Forshee <sforshee@digitalocean.com>
>> Cc: Rik van Riel <riel@surriel.com>
>> Cc: kernel-team@fb.com
>> ---
> 
> If your tracing daemon runs in init_user_ns with CAP_SYS_ADMIN why can't
> it simply use a helper process/thread to
> setns(userns_fd/pidfd, CLONE_NEWUSER)
> to the target userns? This way we don't need to special-case
> init_user_ns at all.

helper process + setns could work for my usecase. But the fact that there's no
way to say "I know what I am about to do is potentially stupid and dangerous,
but I am root so let me do it", without spawning a helper process in this case,
feels like it'll result in special-case userspace workarounds for anyone doing
symbolication of backtraces.

e.g. perf will have to add some logic: "did I fail
to grab this exe that some process had mapped? Is it in a FUSE mounted by some
descendant userns? let's fork a helper process..." Not the end of the world,
but unnecessary complexity nonetheless.

That being said, I agree that this patch's special-casing of init_user_ns is
hacky. What do you think about a more explicit and general "let me do this
stupid and dangerous thing" mechanism - perhaps a new struct fuse_conn field
containing a set of exception userns', populated with ioctl or similar.

> 
>>
>> Note: I was unsure whether CAP_SYS_ADMIN or CAP_SYS_PTRACE was the best
>> choice of capability here. Went with the former as it's checked
>> elsewhere in fs/fuse while CAP_SYS_PTRACE isn't.
>>
>>  fs/fuse/dir.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>> index 0654bfedcbb0..2524eeb0f35d 100644
>> --- a/fs/fuse/dir.c
>> +++ b/fs/fuse/dir.c
>> @@ -1134,7 +1134,7 @@ int fuse_allow_current_process(struct fuse_conn *fc)
>>  	const struct cred *cred;
>>  
>>  	if (fc->allow_other)
>> -		return current_in_userns(fc->user_ns);
>> +		return current_in_userns(fc->user_ns) || capable(CAP_SYS_ADMIN);
>>  
>>  	cred = current_cred();
>>  	if (uid_eq(cred->euid, fc->user_id) &&
>> -- 
>> 2.30.2
>>
>>