Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall

From: Al Viro <viro@zeniv.linux.org.uk>
To: Hao Luo <haoluo@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>, KP Singh <kpsingh@kernel.org>,
	Shakeel Butt <shakeelb@google.com>,
	Joe Burton <jevburton.kernel@gmail.com>,
	Tejun Heo <tj@kernel.org>,
	joshdon@google.com, sdf@google.com, bpf@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall
Date: Sat, 12 Mar 2022 03:46:37 +0000	[thread overview]
Message-ID: <YiwXnSGf9Nb79wnm@zeniv-ca.linux.org.uk> (raw)
In-Reply-To: <20220225234339.2386398-2-haoluo@google.com>

On Fri, Feb 25, 2022 at 03:43:31PM -0800, Hao Luo wrote:
> This patch allows bpf_syscall prog to perform some basic filesystem
> operations: create, remove directories and unlink files. Three bpf
> helpers are added for this purpose. When combined with the following
> patches that allow pinning and getting bpf objects from bpf prog,
> this feature can be used to create directory hierarchy in bpffs that
> help manage bpf objects purely using bpf progs.
> 
> The added helpers subject to the same permission checks as their syscall
> version. For example, one can not write to a read-only file system;
> The identity of the current process is checked to see whether it has
> sufficient permission to perform the operations.
> 
> Only directories and files in bpffs can be created or removed by these
> helpers. But it won't be too hard to allow these helpers to operate
> on files in other filesystems, if we want.

In which contexts can those be called?

> +BPF_CALL_2(bpf_rmdir, const char *, pathname, int, pathname_sz)
> +{
> +	struct user_namespace *mnt_userns;
> +	struct path parent;
> +	struct dentry *dentry;
> +	int err;
> +
> +	if (pathname_sz <= 1 || pathname[pathname_sz - 1])
> +		return -EINVAL;
> +
> +	err = kern_path(pathname, 0, &parent);
> +	if (err)
> +		return err;
> +
> +	if (!bpf_path_is_bpf_dir(&parent)) {
> +		err = -EPERM;
> +		goto exit1;
> +	}
> +
> +	err = mnt_want_write(parent.mnt);
> +	if (err)
> +		goto exit1;
> +
> +	dentry = kern_path_locked(pathname, &parent);

This can't be right.  Ever.  There is no promise whatsoever
that these two lookups will resolve to the same place.

> +BPF_CALL_2(bpf_unlink, const char *, pathname, int, pathname_sz)
> +{
> +	struct user_namespace *mnt_userns;
> +	struct path parent;
> +	struct dentry *dentry;
> +	struct inode *inode = NULL;
> +	int err;
> +
> +	if (pathname_sz <= 1 || pathname[pathname_sz - 1])
> +		return -EINVAL;
> +
> +	err = kern_path(pathname, 0, &parent);
> +	if (err)
> +		return err;
> +
> +	err = mnt_want_write(parent.mnt);
> +	if (err)
> +		goto exit1;
> +
> +	dentry = kern_path_locked(pathname, &parent);
> +	if (IS_ERR(dentry)) {
> +		err = PTR_ERR(dentry);
> +		goto exit2;
> +	}

Ditto.  NAK; if you want to poke into fs/namei.c guts, do it right.
Or at least discuss that on fsdevel.  As it is, it's completely broken.
It's racy *and* it blatantly leaks both vfsmount and dentry references.

NAKed-by: Al Viro <viro@zeniv.linux.org.uk>