BPF Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Daniel Xu <dxu@dxuuu.xyz>
Cc: bpf@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com, jolsa@kernel.org, hannes@cmpxchg.org,
Subject: Re: [RFC bpf-next 1/1] bpf: Introduce iter_pagecache
Date: Fri, 9 Apr 2021 08:11:54 +1000
Message-ID: <20210408221154.GL1990290@dread.disaster.area> (raw)
In-Reply-To: <22bededbd502e0df45326a54b3056941de65a101.1617831474.git.dxu@dxuuu.xyz>

On Wed, Apr 07, 2021 at 02:46:11PM -0700, Daniel Xu wrote:
> This commit introduces the bpf page cache iterator. This iterator allows
> users to run a bpf prog against each page in the "page cache".
> Internally, the "page cache" is extremely tied to VFS superblock + inode
> combo. Because of this, iter_pagecache will only examine pages in the
> caller's mount namespace.

No, it does not just examine pages with in the callers mount
namespace, because ....

> +static struct inode *goto_next_inode(struct bpf_iter_seq_pagecache_info *info)
> +{
> +	struct inode *prev_inode = info->cur_inode;
> +	struct inode *inode;
> +
> +retry:
> +	BUG_ON(!info->cur_sb);
> +	spin_lock(&info->cur_sb->s_inode_list_lock);
> +
> +	if (!info->cur_inode) {
> +		list_for_each_entry(inode, &info->cur_sb->s_inodes, i_sb_list) {

... this is an "all inodes on the superblock" walk.  This will also
iterate inodes in other mount namespaces that point to the same

IOWs, if you have different parts of the same filesystem mounted
into hundreds of container mount namespaces, this script will not
just iterate the local mount name space, it will iterate every inode
in every mount namespace.

And, of course, if the same files are mounted into multiple
containers (think read-only bind mounts using idmapping) then you
have zero indication of which container is actually using them, just
that there are hundreds of paths to the same inode. And every
container will appear to be using exactly the same amount of page cache.

IOWs, the stats this generates provide no insight into page cache
usage across mount namespaces in many situations, and it leaks
information about page cache usage across mount namespace

And that's before I say "iterating all inodes in a superblock is
bad" because it causes lock contention and interrupts normal usage.
We avoid s_inodes lists walks as much as we possibly can, and the
last thing we want is for userspace to be able to trivially
instigate long running walks of the s_inodes list. Remember, we can
have hundreds of millions of inodes on this list....

> +			spin_lock(&inode->i_lock);
> +			if (inode_unusual(inode)) {
> +				spin_unlock(&inode->i_lock);
> +				continue;
> +			}
> +			__iget(inode);
> +			spin_unlock(&inode->i_lock);

This can spin long enough to trigger livelock warnings. Even if it's
not held that long, it can cause unexpected long tail latencies in
memory reclaim and inode instantiation. Every s_inodes list walk has
cond_resched() built into it now....

> +	info->ns = current->nsproxy->mnt_ns;
> +	get_mnt_ns(info->ns);
> +	INIT_RADIX_TREE(&info->superblocks, GFP_KERNEL);
> +
> +	spin_lock(&info->ns->ns_lock);
> +	list_for_each_entry(mnt, &info->ns->list, mnt_list) {
> +		sb = mnt->mnt.mnt_sb;
> +
> +		/* The same mount may be mounted in multiple places */
> +		if (radix_tree_lookup(&info->superblocks, (unsigned long)sb))
> +			continue;
> +
> +		err = radix_tree_insert(&info->superblocks,
> +				        (unsigned long)sb, (void *)1);

And just because nobody has pointed it out yet: radix_tree_insert()
will do GFP_KERNEL memory allocations inside the spinlock being held


You said that you didn't take the "walk the LRUs" approach because
walking superblocks "seemed simpler". It's not. Page cache residency
and accounting is managed by memcgs, not by mount namespaces.

That is, containers usually have a memcg associated with them to control
memory usage of the container. The page cache used by a container is
accounted directly to the memcg, and memory reclaim can find all the
file-backed page cache pages associated with a memcg very quickly
(via mem_cgroup_lruvec()).  This will find pages associated directly
with the memcg, so it gives you a fairly accurate picture of the
page cache usage within the container.

This has none of the issues that arise from "sb != mnt_ns" that
walking superblocks and inode lists have, and it doesn't require you
to play games with mounts, superblocks and inode references....


Dave Chinner

  parent reply index

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-07 21:46 [RFC bpf-next 0/1] bpf: Add page cache iterator Daniel Xu
2021-04-07 21:46 ` [RFC bpf-next 1/1] bpf: Introduce iter_pagecache Daniel Xu
2021-04-08  6:14   ` Matthew Wilcox
2021-04-08 19:48     ` Daniel Xu
2021-04-08 21:29       ` Matthew Wilcox
2021-04-08  8:19   ` Christian Brauner
2021-04-08 20:44     ` Daniel Xu
2021-04-08 16:45   ` Al Viro
2021-04-08 20:49     ` Daniel Xu
2021-04-08 21:04       ` Al Viro
2021-04-08 22:11   ` Dave Chinner [this message]
2021-04-08  7:51 ` [RFC bpf-next 0/1] bpf: Add page cache iterator Christian Brauner
2021-04-08 16:08   ` Daniel Xu
2021-04-08 21:33 ` Shakeel Butt
2021-04-08 23:13 ` Darrick J. Wong
2021-04-09  0:24   ` Daniel Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210408221154.GL1990290@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=bpf@vger.kernel.org \
    --cc=dxu@dxuuu.xyz \
    --cc=hannes@cmpxchg.org \
    --cc=jolsa@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yhs@fb.com \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git