Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit

From: Amir Goldstein <amir73il@gmail.com>
To: Chris Down <chris@chrisdown.name>
Cc: Matthew Wilcox <willy@infradead.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Jeff Layton <jlayton@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	kernel-team@fb.com, Hugh Dickins <hughd@google.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	"zhengbin (A)" <zhengbin13@huawei.com>,
	Roman Gushchin <guro@fb.com>
Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit
Date: Tue, 24 Dec 2019 05:04:11 +0200	[thread overview]
Message-ID: <CAOQ4uxjm5JMvfbi4xa3yaDwuM+XpNOSDrbVsHvJtkms00ZBnAg@mail.gmail.com> (raw)
In-Reply-To: <20191223204551.GA272672@chrisdown.name>

> The slab i_ino recycling approach works somewhat, but is unfortunately neutered
> quite a lot by the fact that slab recycling is per-memcg. That is, replacing
> with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial
> callsites only leads to about 10% slab reuse, which doesn't really stem the
> bleeding of 32-bit inums on an affected workload:
>
>      # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
>          4454 recycle_or_get_next_ino: not recycled
>           546 recycle_or_get_next_ino: recycled
>

Too bad..
Maybe recycled ino should be implemented all the same because it is simple
and may improve workloads that are not so MEMCG intensive.

> Roman (who I've just added to cc) tells me that currently we only have
> per-memcg slab reuse instead of global when using CONFIG_MEMCG. This
> contributes fairly significantly here since there are multiple tasks across
> multiple cgroups which are contributing to the get_next_ino() thrash.
>
> I think this is a good start, but we need something of a different magnitude in
> order to actually solve this problem with the current slab infrastructure. How
> about something like the following?
>
> 1. Add get_next_ino_full, which uses whatever the full width of ino_t is
> 2. Use get_next_ino_full in tmpfs (et al.)

I would prefer that filesystems making heavy use of get_next_ino, be converted
to use a private ino pool per sb:

ino_pool_create()
ino_pool_get_next()

flags to ino_pool_create() can determine the desired ino range.
Does the Facebook use case involve a single large tmpfs or many
small ones? I would guess the latter and therefore we are trying to solve
a problem that nobody really needs to solve (i.e. global efficient ino pool).

> 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can
>     pass if they want the 32-bit inode numbers back. This would still allow
>     people who want to make this tradeoff to use xino.

inode32|inode64 (see man xfs(5)).

> 4. (If you like) Also add a CONFIG option to disable this at compile time.
>

I Don't know about disable, but the default mode for tmpfs (inode32|inode64)
might me best determined by CONFIG option, so distro builders could decide
if they want to take the risk of breaking applications on tmpfs.

But if you implement per sb ino pool, maybe inode64 will no longer be
required for your use case?

Thanks,
Amir.