Re: [RFC bpf-next 1/1] bpf: Introduce iter_pagecache

From: Daniel Xu <dxu@dxuuu.xyz>
To: Christian Brauner <christian.brauner@ubuntu.com>
Cc: bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com, jolsa@kernel.org, hannes@cmpxchg.org,
	yhs@fb.com, Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [RFC bpf-next 1/1] bpf: Introduce iter_pagecache
Date: Thu, 8 Apr 2021 13:44:10 -0700	[thread overview]
Message-ID: <20210408204410.wszz3rjmqbg4ps3q@dlxu-fedora-R90QNFJV> (raw)
In-Reply-To: <20210408081935.b3xollrzl6lejbyf@wittgenstein>

On Thu, Apr 08, 2021 at 10:19:35AM +0200, Christian Brauner wrote:
> On Wed, Apr 07, 2021 at 02:46:11PM -0700, Daniel Xu wrote:
> > This commit introduces the bpf page cache iterator. This iterator allows
> > users to run a bpf prog against each page in the "page cache".
> > Internally, the "page cache" is extremely tied to VFS superblock + inode
> > combo. Because of this, iter_pagecache will only examine pages in the
> > caller's mount namespace.
> > 
> > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > ---
> >  kernel/bpf/Makefile         |   2 +-
> >  kernel/bpf/pagecache_iter.c | 293 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 294 insertions(+), 1 deletion(-)
> >  create mode 100644 kernel/bpf/pagecache_iter.c

<...>

> > 
> > +static int init_seq_pagecache(void *priv_data, struct bpf_iter_aux_info *aux)
> > +{
> > +	struct bpf_iter_seq_pagecache_info *info = priv_data;
> > +	struct radix_tree_iter iter;
> > +	struct super_block *sb;
> > +	struct mount *mnt;
> > +	void **slot;
> > +	int err;
> > +
> > +	info->ns = current->nsproxy->mnt_ns;
> > +	get_mnt_ns(info->ns);
> > +	INIT_RADIX_TREE(&info->superblocks, GFP_KERNEL);
> > +
> > +	spin_lock(&info->ns->ns_lock);
> > +	list_for_each_entry(mnt, &info->ns->list, mnt_list) {
> 
> Not just are there helpers for taking ns_lock
> static inline void lock_ns_list(struct mnt_namespace *ns)
> static inline void unlock_ns_list(struct mnt_namespace *ns)
> they are private to fs/namespace.c because it's the only place that
> should ever walk this list.

Thanks for the hints. Would it be acceptable to add some helpers to
fs/namespace.c to allow walking the list?

IIUC the only way to find a list of mounts is by looking at the mount
namespace. And walking each mount and looking at each `struct
super_node`'s inode's `struct address_space` seemed like the cleanest
way to walkthe page cache.

> This seems buggy: why is it ok here to only take ns_lock and not also
> namespace_sem like mnt_already_visible() and __is_local_mountpoint()
> or the relevant proc iterators? I might be missing something.

Thanks for the hints. I'll take a closer look at the locking. Most
probably I didn't get it right.

I should have also mentioned in the cover letter that I'm fairly sure I
messed up the locking somewhere.

> 
> > +		sb = mnt->mnt.mnt_sb;
> > +
> > +		/* The same mount may be mounted in multiple places */
> > +		if (radix_tree_lookup(&info->superblocks, (unsigned long)sb))
> > +			continue;
> > +
> > +		err = radix_tree_insert(&info->superblocks,
> > +				        (unsigned long)sb, (void *)1);
> > +		if (err)
> > +			goto out;
> > +	}
> > +
> > +	radix_tree_for_each_slot(slot, &info->superblocks, &iter, 0) {
> > +		sb = (struct super_block *)iter.index;
> > +		atomic_inc(&sb->s_active);
> 
> It also isn't nice that you mess with sb->s_active directly.
> 
> Imho, this is poking around in a lot of fs/ specific stuff that other
> parts of the kernel should not care about or have access to.

Re above: do you think it'd be appropriate to add more helpers to fs/ ?

<...>

Thanks,
Daniel