Re: [PATCH] eventfs: Have inodes have unique inode numbers

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	 LKML <linux-kernel@vger.kernel.org>,
	 Linux Trace Devel <linux-trace-devel@vger.kernel.org>,
	Christian Brauner <brauner@kernel.org>,
	 Ajay Kaher <ajay.kaher@broadcom.com>,
	Geert Uytterhoeven <geert@linux-m68k.org>,
	 linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH] eventfs: Have inodes have unique inode numbers
Date: Sun, 28 Jan 2024 12:53:31 -0800	[thread overview]
Message-ID: <CAHk-=whKJ6dzQJX27gvL4Xug5bFRKW7_Cx4XpngMKmWxOtb+Qg@mail.gmail.com> (raw)
In-Reply-To: <20240128151542.6efa2118@rorschach.local.home>

On Sun, 28 Jan 2024 at 12:15, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I have to understand how the dentry lookup works. Basically, when the
> ei gets deleted, it can't be freed until all dentries it references
> (including its children) are no longer being accessed. Does that lookup
> get called only when a dentry with the name doesn't already exist?

Dentry lookup gets called with the parent inode locked for reading, so
a lookup can happen in parallel with readdir and other dentry lookup.

BUT.

Each dentry is also "self-serializing", so you will never see a lookup
on the same name (in the same directory) concurrently.

The implementation is very much internal to the VFS layer, and it's
all kinds of nasty, with a new potential lookup waiting for the old
one, verifying that the old one is still usable, and maybe repeating
it all until we find a successful previous lookup or we're the only
dentry remaining.

It's nasty code that is very much in the "Al Viro" camp, but the point
is that any normal filesystem should treat lookups as being concurrent
with non-creation events, but not concurrently multiples.

There *is* some common code with "atomic_open()", where filesystems
that implement that then want to know if it's the *first* lookup, or a
use of a previously looked up dentry, and they'll use the
"d_in_lookup()" thing to determine that. So this whole "keep track of
which dentries are *currently* being looked up is actually exposed,
but any normal filesystem should never care.

But if you think you have that issue (tracefs does not), you really
want to talk to Al extensively.

> That is, can I assume that eventfs_root_lookup() is only called when
> the VFS file system could not find an existing dentry and it has to
> create one?

Correct. For any _particular_ name, you should think of lookup as serialized.

> If that's the case, then I can remove the ei->dentry and just add a ref
> counter that it was accessed. Then the final dput() should call
> eventfs_set_ei_status_free() (I hate that name and need to change it),
> and if the ei->is_freed is set, it can free the ei.

Note that the final 'dput()' will happen *after* the dentry has been
removed, so what can happen is

   lookup("name", d1);
   ... lookup successful, dentry is used ..
   ... dentry at some point has no more users ..
   ... memory pressure prunes unused dentries ..
   ... dentry gets unhashed and is no longer visible ..
   lookup("name", d2);
   ... new dentry is created ..
   final dput(d1);
   .. old dentry - that wasn't accessible any more is freed ..

and this is actually one of the *reasons* that virtual filesystems
must not try to cache dentry pointers in their internal data
structures. Because note how the fuilesystem saw the new lookup(d2) of
the same name *before* it saw the >d_release(d1) of the old dentry.

And the above is fundamental: we obviously cannot call '->d_release()'
until the old dentry is all dead and buried (lockref_mark_dead() etc),
so pretty much by definition you'll have that ordering being possible.

It's extremely unlikely, of course. I'll be you'll never hit it in testing.

So if if you associate some internal data structure with a dentry,
just *what* happens when you haven't been told abotu the old dentry
being dead when the new one happens?

See why I say that it's fundamentally wrong for a filesystem to try to
track dentries? All the operations that can use a dentry will get one
passed down to them by the VFS layer. The filesystem has no business
trying to remember some dentry from a previous operation, and the
filesystem *will* get it wrong.

But also note how refcounting works fine. In fact, refcounting is
pretty much the *only* thing that works fine. So what you *should* do
is

 - at lookup(), when you save your filesystem data in "->d_fsdata",
you increment a refcount

 - at ->d_release(), you decrement a refcount

and now you're fine. Yes, when the above (very very unlikely)
situation happens, you'll temporarily have a refcount incremented
twice, but that's kind of the *point* of refcounts.

Side note: this is pretty much true of any kernel data structure. If
you have a kernel data structure that isn't just used within one
thread, it must be refcounted. But it'as *doubly* true when you save
references to something that the VFS maintains, because you DO NOT
CONTROL the lifetime of that entity.

> The eventfs_remove_dir() can free the ei (after SRCU) if it has no
> references, otherwise it needs to wait for the final dput() to do the
> free.

Honestly, you should just *always* do refcounting. No "free after RCU
delay" as an alternative. Just refcount it.

Now, the RCU delay may be needed if the lookup of said structure
happens under RCU, but no, saying "I use SRCU to make sure the
lifetime is at least X" is just broken.

The refcount is what gives the lifetime. Any form of RCU-delaying
should then be purely about non-refcounting RCU lookups that may
happen as the thing is dying (and said lookup should *look* at the
refcount and say "oh, this is dead, I'm not returning this".

> I think the ei->dentry was required for the dir wrapper logic that we
> removed.

I think all of this was due to the bogus readdir that created dentries
willy-nilly and without the required serialization.

And that was all horribly broken.  It wasn't even the above kind of
"really subtle race that you'll never hit in practice" broken. It was
just absolutely broken with readdir and lookup racing on the same name
and creating an unholy dentry mess.

            Linus