Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit

From: "zhengbin (A)" <zhengbin13@huawei.com>
To: Chris Down <chris@chrisdown.name>, Amir Goldstein <amir73il@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	"Jeff Layton" <jlayton@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>, <kernel-team@fb.com>,
	Hugh Dickins <hughd@google.com>,
	Miklos Szeredi <miklos@szeredi.hu>, Roman Gushchin <guro@fb.com>
Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit
Date: Thu, 26 Dec 2019 09:40:32 +0800	[thread overview]
Message-ID: <d2205e04-524f-6f7d-000f-7b45b06ed93a@huawei.com> (raw)
In-Reply-To: <20191225125448.GA309148@chrisdown.name>

On 2019/12/25 20:54, Chris Down wrote:
> Amir Goldstein writes:
>>> The slab i_ino recycling approach works somewhat, but is unfortunately neutered
>>> quite a lot by the fact that slab recycling is per-memcg. That is, replacing
>>> with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial
>>> callsites only leads to about 10% slab reuse, which doesn't really stem the
>>> bleeding of 32-bit inums on an affected workload:
>>>
>>>      # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
>>>          4454 recycle_or_get_next_ino: not recycled
>>>           546 recycle_or_get_next_ino: recycled
>>>
>>
>> Too bad..
>> Maybe recycled ino should be implemented all the same because it is simple
>> and may improve workloads that are not so MEMCG intensive.
>
> Yeah, I agree. I'll send the full patch over separately (ie. not as v2 for this) since it's not a total solution for the problem, but still helps somewhat and we all seem to agree that it's overall an uncontroversial improvement.

Please cc me as well, thanks.

>
>>> Roman (who I've just added to cc) tells me that currently we only have
>>> per-memcg slab reuse instead of global when using CONFIG_MEMCG. This
>>> contributes fairly significantly here since there are multiple tasks across
>>> multiple cgroups which are contributing to the get_next_ino() thrash.
>>>
>>> I think this is a good start, but we need something of a different magnitude in
>>> order to actually solve this problem with the current slab infrastructure. How
>>> about something like the following?
>>>
>>> 1. Add get_next_ino_full, which uses whatever the full width of ino_t is
>>> 2. Use get_next_ino_full in tmpfs (et al.)
>>
>> I would prefer that filesystems making heavy use of get_next_ino, be converted
>> to use a private ino pool per sb:
>>
>> ino_pool_create()
>> ino_pool_get_next()
>>
>> flags to ino_pool_create() can determine the desired ino range.
>> Does the Facebook use case involve a single large tmpfs or many
>> small ones? I would guess the latter and therefore we are trying to solve
>> a problem that nobody really needs to solve (i.e. global efficient ino pool).
>
> Unfortunately in the case under discussion, it's all in one large tmpfs in /dev/shm. I can empathise with that -- application owners often prefer to use the mounts provided to them rather than having to set up their own. For this one case we can change that, but I think it seems reasonable to support this case since using a single tmpfs can be a reasonable decision as an application developer, especially if you only have unprivileged access to the system.
>
>>> 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can
>>>     pass if they want the 32-bit inode numbers back. This would still allow
>>>     people who want to make this tradeoff to use xino.
>>
>> inode32|inode64 (see man xfs(5)).
>
> Ah great, thanks! I'll reuse precedent from those.
>
>>> 4. (If you like) Also add a CONFIG option to disable this at compile time.
>>>
>>
>> I Don't know about disable, but the default mode for tmpfs (inode32|inode64)
>> might me best determined by CONFIG option, so distro builders could decide
>> if they want to take the risk of breaking applications on tmpfs.
>
> Sounds good.
>
>> But if you implement per sb ino pool, maybe inode64 will no longer be
>> required for your use case?
>
> In this case I think per-sb ino pool will help a bit, but unfortunately not by an order of magnitude. As with the recycling patch this will help reduce thrash a bit but not conclusively prevent the problem from happening long-term. To fix that, I think we really do need the option to use ino_t-sized get_next_ino_full (or per-sb equivalent).
>
> Happy holidays, and thanks for your feedback!
>
> Chris
>
> .
>