linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Al Viro <viro@zeniv.linux.org.uk>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	wugyuan@cn.ibm.com, jlayton@kernel.org, hsiangkao@aol.com,
	Jan Kara <jack@suse.cz>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
Date: Sat, 2 Nov 2019 18:08:42 +0000	[thread overview]
Message-ID: <20191102180842.GN26530@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20191102172229.GT20975@paulmck-ThinkPad-P72>

On Sat, Nov 02, 2019 at 10:22:29AM -0700, Paul E. McKenney wrote:
> Ignoring the possibility of the more exotic compiler optimizations, if
> the first task's load into f sees the value stored by the second task,
> then the pair of memory barriers guarantee that the first task's load
> into d will see the second task's store.

The question was about the load into i being also safe.

> In fact, you could instead say this in recent kernels:
> 
> 	f = READ_ONCE(fdt[n])  // provides dependency ordering via mb on Alpha
> 	mb

Er... that mb comes from expanded READ_ONCE(), actually - the call chain
is
	fdget_pos() -> __fdget_pos() -> __fdget() -> __fget_light() ->
	__fcheck_files(), either directly or via
			__fget() -> fcheck_files() -> __fcheck_files()
	rcu_dereference_raw() -> READ_ONCE() -> smp_read_barrier_depends()
which yields mb on alpha.
						
> 	d = f->f_path.dentry
> 	i = d->d_inode  // But this is OK only if ->f_path.entry is
> 			// constant throughout

Yes, it is - once you hold a reference to a positive dentry, it can't
be made negative by anybody else.  See d_delete() for details; basically,
if you have refcount > 1, dentry will be unhashed, but not made negative.

> The result of the first task's load into i requires information outside
> of the two code fragments.
> 
> Or am I missing your point?

My point is that barriers sufficient to guarantee visibility of *f in
the reader will also suffice to guarantee visibility of *f->f_path.dentry.

On alpha it boils down to having load of d->d_inode when opening the
file orders before the barrier prior to storing the reference to f
in the descriptor table, so if it observes the store to d->d_inode done
by the same CPU, that store is ordered before the barrier due to
processor instruction order constraints and if it observes the store
to d->d_inode done by some other CPU, that store is ordered before
the load and before the barrier by transitivity.  So in either case
that store is ordered before the store into descriptor table.
IOW, the reader that has enough barriers to guarantee seing ->f_path.dentry
will be guaranteed to see ->f_path.dentry->d_inode.

And yes, we will need some barriers added near some positivity checks in
pathname resolution - that's what has started the entire thread.  This
part ("any place looking at file->f_path.dentry will have ->d_inode and
mode bits of ->d_flags visible and stable") covers quite a few places
that come up in the analysis...

This morning catch, BTW:

    audit_get_nd(): don't unlock parent too early
    
    if the child has been negative and just went positive
    under us, we want coherent d_is_positive() and ->d_inode.
    Don't unlock the parent until we'd done that work...
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index 1f31c2f1e6fc..4508d5e0cf69 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -351,12 +351,12 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent)
        struct dentry *d = kern_path_locked(watch->path, parent);
        if (IS_ERR(d))
                return PTR_ERR(d);
-       inode_unlock(d_backing_inode(parent->dentry));
        if (d_is_positive(d)) {
                /* update watch filter fields */
                watch->dev = d->d_sb->s_dev;
                watch->ino = d_backing_inode(d)->i_ino;
        }
+       inode_unlock(d_backing_inode(parent->dentry));
        dput(d);
        return 0;
 }

For other fun bits and pieces see ceph bugs caught this week and crap in
dget_parent() (not posted yet).  The former had been ceph violating the
"turning a previously observable dentry positive requires exclusive lock
on parent" rule, the latter - genuine insufficient barriers in the fast
path of dget_parent().

It is converging to a reasonably small and understandable surface, actually,
most of that being in core pathname resolution.  Two big piles of nightmares
left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
the latter due to IMA/EVM/LSM insanity...

There's also some secondary stuff dropping out of that (e.g. ceph seeding
dcache on readdir and blindly unhashing dentries it sees stale instead of
doing d_invalidate() as it ought to - leads to fun results if you had
something mounted on a subdirectory that got removed/recreated on server),
but that's a separate pile of joy - doesn't affect this analysis, so
it'll have to be dealt with later.  It had been an interesting couple of
weeks...

  reply	other threads:[~2019-11-02 18:08 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-27  4:42 [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast Ritesh Harjani
2019-10-15  4:07 ` Ritesh Harjani
2019-10-22 13:38   ` Ritesh Harjani
2019-10-22 14:37     ` Al Viro
2019-10-22 14:50       ` Al Viro
2019-10-22 20:11       ` Al Viro
2019-10-23 11:05         ` Ritesh Harjani
2019-11-01 23:46           ` Al Viro
2019-11-02  6:17             ` Al Viro
2019-11-02 17:24               ` Paul E. McKenney
2019-11-02 17:22             ` Paul E. McKenney
2019-11-02 18:08               ` Al Viro [this message]
2019-11-03 14:41                 ` Paul E. McKenney
2019-11-03 16:35                 ` [RFC] lookup_one_len_unlocked() lousy calling conventions Al Viro
2019-11-03 18:20                   ` Al Viro
2019-11-03 18:51                     ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Al Viro
2019-11-03 19:03                       ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either Al Viro
2019-11-13  7:01                       ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Amir Goldstein
2019-11-13 12:52                         ` Al Viro
2019-11-13 16:22                           ` Amir Goldstein
2019-11-13 20:18                           ` Jean-Louis Biasini
2019-11-03 17:05                 ` [PATCH][RFC] ecryptfs unlink/rmdir breakage (similar to caught in ecryptfs rename last year) Al Viro
2019-11-09  3:13                 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
2019-11-09 16:55                   ` Linus Torvalds
2019-11-09 18:26                     ` Al Viro
2019-11-11  9:16                   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191102180842.GN26530@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=hsiangkao@aol.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=riteshh@linux.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=wugyuan@cn.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).