linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Down <chris@chrisdown.name>
To: Matthew Wilcox <willy@infradead.org>
Cc: Amir Goldstein <amir73il@gmail.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Jeff Layton <jlayton@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	kernel-team@fb.com, Hugh Dickins <hughd@google.com>,
	Miklos Szeredi <miklos@szeredi.hu>,
	"zhengbin (A)" <zhengbin13@huawei.com>,
	Roman Gushchin <guro@fb.com>
Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit
Date: Mon, 23 Dec 2019 20:45:51 +0000	[thread overview]
Message-ID: <20191223204551.GA272672@chrisdown.name> (raw)
In-Reply-To: <20191220195025.GA9469@bombadil.infradead.org>

Matthew Wilcox writes:
>On Fri, Dec 20, 2019 at 07:35:38PM +0200, Amir Goldstein wrote:
>> On Fri, Dec 20, 2019 at 6:46 PM Matthew Wilcox <willy@infradead.org> wrote:
>> >
>> > On Fri, Dec 20, 2019 at 03:41:11PM +0200, Amir Goldstein wrote:
>> > > Suggestion:
>> > > 1. Extend the kmem_cache API to let the ctor() know if it is
>> > > initializing an object
>> > >     for the first time (new page) or recycling an object.
>> >
>> > Uh, what?  The ctor is _only_ called when new pages are allocated.
>> > Part of the contract with the slab user is that objects are returned to
>> > the slab in an initialised state.
>>
>> Right. I mixed up the ctor() with alloc_inode().
>> So is there anything stopping us from reusing an existing non-zero
>> value of  i_ino in shmem_get_inode()? for recycling shmem ino
>> numbers?
>
>I think that would be an excellent solution to the problem!  At least,
>I can't think of any problems with it.

Thanks for the suggestions and feedback, Amir and Matthew :-)

The slab i_ino recycling approach works somewhat, but is unfortunately neutered 
quite a lot by the fact that slab recycling is per-memcg. That is, replacing 
with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial 
callsites only leads to about 10% slab reuse, which doesn't really stem the 
bleeding of 32-bit inums on an affected workload:

     # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
         4454 recycle_or_get_next_ino: not recycled
          546 recycle_or_get_next_ino: recycled

Roman (who I've just added to cc) tells me that currently we only have 
per-memcg slab reuse instead of global when using CONFIG_MEMCG. This 
contributes fairly significantly here since there are multiple tasks across 
multiple cgroups which are contributing to the get_next_ino() thrash.

I think this is a good start, but we need something of a different magnitude in 
order to actually solve this problem with the current slab infrastructure. How 
about something like the following?

1. Add get_next_ino_full, which uses whatever the full width of ino_t is
2. Use get_next_ino_full in tmpfs (et al.)
3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can 
    pass if they want the 32-bit inode numbers back. This would still allow 
    people who want to make this tradeoff to use xino.
4. (If you like) Also add a CONFIG option to disable this at compile time.

I'd appreciate your thoughts on that approach or others you have ideas about. 
Thanks! :-)

0:

unsigned int recycle_or_get_next_ino(ino_t old_ino)
{
	/*
	 * get_next_ino returns unsigned int. If this fires then i_ino must be
	 * >32 bits and have been changed later, so the caller shouldn't be
	 * recycling inode numbers
	 */
	WARN_ONCE(old_ino >> (sizeof(unsigned int) * 8),
		  "Recyclable i_ino uses more bits than unsigned int: %llu",
		  (u64)old_ino);

	if (old_ino) {
		if (prandom_u32() % 100 == 0)
			trace_printk("recycled\n");
		return old_ino;
	} else {
		if (prandom_u32() % 100 == 0)
			trace_printk("not recycled\n");
		return get_next_ino();
	}
}

  reply	other threads:[~2019-12-23 20:47 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-20  2:49 [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit Chris Down
2019-12-20  3:05 ` zhengbin (A)
2019-12-20  8:32 ` Amir Goldstein
2019-12-20 12:16   ` Chris Down
2019-12-20 13:41     ` Amir Goldstein
2019-12-20 16:46       ` Matthew Wilcox
2019-12-20 17:35         ` Amir Goldstein
2019-12-20 19:50           ` Matthew Wilcox
2019-12-23 20:45             ` Chris Down [this message]
2019-12-24  3:04               ` Amir Goldstein
2019-12-25 12:54                 ` Chris Down
2019-12-26  1:40                   ` zhengbin (A)
2019-12-20 21:30 ` Darrick J. Wong
2019-12-21  8:43   ` Amir Goldstein
2019-12-21 18:05     ` Darrick J. Wong
2019-12-21 10:16   ` Chris Down
2020-01-07 17:35     ` J. Bruce Fields
2020-01-07 17:44       ` Chris Down
2020-01-08  3:00         ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191223204551.GA272672@chrisdown.name \
    --to=chris@chrisdown.name \
    --cc=amir73il@gmail.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jlayton@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=zhengbin13@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).