* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
2019-11-02 18:08 ` Al Viro
@ 2019-11-03 14:41 ` Paul E. McKenney
2019-11-03 16:35 ` [RFC] lookup_one_len_unlocked() lousy calling conventions Al Viro
` (2 subsequent siblings)
3 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2019-11-03 14:41 UTC (permalink / raw)
To: Al Viro
Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, wugyuan, jlayton,
hsiangkao, Jan Kara, Linus Torvalds
On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:
> On Sat, Nov 02, 2019 at 10:22:29AM -0700, Paul E. McKenney wrote:
> > Ignoring the possibility of the more exotic compiler optimizations, if
> > the first task's load into f sees the value stored by the second task,
> > then the pair of memory barriers guarantee that the first task's load
> > into d will see the second task's store.
>
> The question was about the load into i being also safe.
>
> > In fact, you could instead say this in recent kernels:
> >
> > f = READ_ONCE(fdt[n]) // provides dependency ordering via mb on Alpha
> > mb
>
> Er... that mb comes from expanded READ_ONCE(), actually - the call chain
> is
> fdget_pos() -> __fdget_pos() -> __fdget() -> __fget_light() ->
> __fcheck_files(), either directly or via
> __fget() -> fcheck_files() -> __fcheck_files()
> rcu_dereference_raw() -> READ_ONCE() -> smp_read_barrier_depends()
> which yields mb on alpha.
Exactly. But yes, my "provides dependency ordering via mb on Alpha"
comment was not as clearly stated as it should have been. Something about
my having dispensed with proofreading in order to catch my plane... :-/
> > d = f->f_path.dentry
> > i = d->d_inode // But this is OK only if ->f_path.entry is
> > // constant throughout
>
> Yes, it is - once you hold a reference to a positive dentry, it can't
> be made negative by anybody else. See d_delete() for details; basically,
> if you have refcount > 1, dentry will be unhashed, but not made negative.
Very good, then. At least assuming that you are double-checking the ordering
(as opposed to trying to figure out what is wrong with the ordering). ;-)
> > The result of the first task's load into i requires information outside
> > of the two code fragments.
> >
> > Or am I missing your point?
>
> My point is that barriers sufficient to guarantee visibility of *f in
> the reader will also suffice to guarantee visibility of *f->f_path.dentry.
>
> On alpha it boils down to having load of d->d_inode when opening the
> file orders before the barrier prior to storing the reference to f
> in the descriptor table, so if it observes the store to d->d_inode done
> by the same CPU, that store is ordered before the barrier due to
> processor instruction order constraints and if it observes the store
> to d->d_inode done by some other CPU, that store is ordered before
> the load and before the barrier by transitivity. So in either case
> that store is ordered before the store into descriptor table.
> IOW, the reader that has enough barriers to guarantee seing ->f_path.dentry
> will be guaranteed to see ->f_path.dentry->d_inode.
>
> And yes, we will need some barriers added near some positivity checks in
> pathname resolution - that's what has started the entire thread. This
> part ("any place looking at file->f_path.dentry will have ->d_inode and
> mode bits of ->d_flags visible and stable") covers quite a few places
> that come up in the analysis...
>
> This morning catch, BTW:
>
> audit_get_nd(): don't unlock parent too early
>
> if the child has been negative and just went positive
> under us, we want coherent d_is_positive() and ->d_inode.
> Don't unlock the parent until we'd done that work...
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
>
> diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
> index 1f31c2f1e6fc..4508d5e0cf69 100644
> --- a/kernel/audit_watch.c
> +++ b/kernel/audit_watch.c
> @@ -351,12 +351,12 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent)
> struct dentry *d = kern_path_locked(watch->path, parent);
> if (IS_ERR(d))
> return PTR_ERR(d);
> - inode_unlock(d_backing_inode(parent->dentry));
> if (d_is_positive(d)) {
> /* update watch filter fields */
> watch->dev = d->d_sb->s_dev;
> watch->ino = d_backing_inode(d)->i_ino;
> }
> + inode_unlock(d_backing_inode(parent->dentry));
> dput(d);
> return 0;
> }
>
> For other fun bits and pieces see ceph bugs caught this week and crap in
> dget_parent() (not posted yet). The former had been ceph violating the
> "turning a previously observable dentry positive requires exclusive lock
> on parent" rule, the latter - genuine insufficient barriers in the fast
> path of dget_parent().
>
> It is converging to a reasonably small and understandable surface, actually,
> most of that being in core pathname resolution. Two big piles of nightmares
> left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
> the latter due to IMA/EVM/LSM insanity...
>
> There's also some secondary stuff dropping out of that (e.g. ceph seeding
> dcache on readdir and blindly unhashing dentries it sees stale instead of
> doing d_invalidate() as it ought to - leads to fun results if you had
> something mounted on a subdirectory that got removed/recreated on server),
> but that's a separate pile of joy - doesn't affect this analysis, so
> it'll have to be dealt with later. It had been an interesting couple of
> weeks...
Looks like you are making a very productive (if perhaps somewhat painful)
tour through the code, good show!
Thanx, Paul
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC] lookup_one_len_unlocked() lousy calling conventions
2019-11-02 18:08 ` Al Viro
2019-11-03 14:41 ` Paul E. McKenney
@ 2019-11-03 16:35 ` Al Viro
2019-11-03 18:20 ` Al Viro
2019-11-03 17:05 ` [PATCH][RFC] ecryptfs unlink/rmdir breakage (similar to caught in ecryptfs rename last year) Al Viro
2019-11-09 3:13 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
3 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2019-11-03 16:35 UTC (permalink / raw)
To: linux-fsdevel
Cc: Ritesh Harjani, linux-kernel, wugyuan, jlayton, hsiangkao,
Jan Kara, Linus Torvalds
On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:
> It is converging to a reasonably small and understandable surface, actually,
> most of that being in core pathname resolution. Two big piles of nightmares
> left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
> the latter due to IMA/EVM/LSM insanity...
One thing found while digging through overlayfs (and responsible for several
remaining pieces from the assorted pile):
lookup_one_len_unlocked() calling conventions are wrong for its callers.
Namely, 11 out of 12 callers really want ERR_PTR(-ENOENT) on negatives.
Most of them take care to check, some rely upon that being impossible in
their case. Interactions with dentry turning positive right after
lookup_one_len_unlocked() has returned it are of varying bugginess...
The only exception is ecryptfs, where we do lookup in the underlying fs
on ecryptfs_lookup() and want to retain a negative dentry if we get one.
Not sure what's the best way to handle that - it looks like we want
a primitive with different behaviour on negatives, so the most conservative
variant would be to add such, leaving lookup_one_len_unlocked() use
for ecryptfs (and dealing with barriers properly in there), with
everybody else switching to lookup_positive_unlocked(), or whatever
we call that new primitive.
Other variants:
1) add a primitve with existing lookup_one_len_unlocked() behaviour,
changing lookup_one_len_unlocked() itselt to new calling conventions.
Switch ecryptfs to that, drop now-useless checks from other callers.
2) get rid of pinning negative underlying dentries in ecryptfs.
The cost will be doing lookup in underlying layer twice on something
like mkdir(2), the second one quite likely served out of dcache.
That would make _all_ callers of lookup_one_len_unlocked() need
the same calling conventions.
Preferences?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] lookup_one_len_unlocked() lousy calling conventions
2019-11-03 16:35 ` [RFC] lookup_one_len_unlocked() lousy calling conventions Al Viro
@ 2019-11-03 18:20 ` Al Viro
2019-11-03 18:51 ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Al Viro
0 siblings, 1 reply; 26+ messages in thread
From: Al Viro @ 2019-11-03 18:20 UTC (permalink / raw)
To: linux-fsdevel
Cc: Ritesh Harjani, linux-kernel, wugyuan, jlayton, hsiangkao,
Jan Kara, Linus Torvalds, ecryptfs
On Sun, Nov 03, 2019 at 04:35:24PM +0000, Al Viro wrote:
> lookup_one_len_unlocked() calling conventions are wrong for its callers.
> Namely, 11 out of 12 callers really want ERR_PTR(-ENOENT) on negatives.
> Most of them take care to check, some rely upon that being impossible in
> their case. Interactions with dentry turning positive right after
> lookup_one_len_unlocked() has returned it are of varying bugginess...
>
> The only exception is ecryptfs, where we do lookup in the underlying fs
> on ecryptfs_lookup() and want to retain a negative dentry if we get one.
Looking at that code... the thing that deals with the result of lookup in
underlying fs is ecryptfs_lookup_interpose(), and there we have this:
struct inode *inode, *lower_inode = d_inode(lower_dentry);
...
dentry_info = kmem_cache_alloc(ecryptfs_dentry_info_cache, GFP_KERNEL);
...
if (d_really_is_negative(lower_dentry)) {
/* We want to add because we couldn't find in lower */
d_add(dentry, NULL);
return NULL;
}
inode = __ecryptfs_get_inode(lower_inode, dentry->d_sb);
If lower dentry used to be negative, but went positive while we'd
been doing allocation, we'll get d_really_is_negative() (i.e.
!lower_dentry->d_inode) false, but lower_inode (fetched earlier)
still NULL. __ecryptfs_get_inode() starts with
if (lower_inode->i_sb != ecryptfs_superblock_to_lower(sb))
return ERR_PTR(-EXDEV);
which won't be happy in that situation... That has nothing to do
with barriers, ->d_flags, etc. - the window is rather wide here.
GFP_KERNEL allocation can block just fine.
IOW, the only caller of lookup_one_len_unlocked() that does not
reject negative dentries doesn't manage to handle them correctly ;-/
And then in the same ecryptfs_lookup_interpose() we have e.g.
fsstack_copy_attr_atime(d_inode(dentry->d_parent),
d_inode(lower_dentry->d_parent));
Now, dentry->d_parent is stable; dentry is guaranteed to be new
and not yet visible to anybody else, besides it's negative and
the parent is held shared, so it couldn't have been moved around
even if it had been seen by somebody else.
However, lower_dentry->d_parent is a different story. We are not holding
the lock on its parent anymore; it could've been moved around by somebody
mounting the underlying layer elsewhere and accessing it directly.
Moreover, there's nothing to guarantee that the pointer we fetch from
lower_dentry->d_parent won't be pointing to freed memory by the
time we get around to looking at its inode - lose the timeslice to
preemption just after fetching ->d_parent, have another process move
the damn thing around and there's nothing to keep the ex-parent
around by the time you regain CPU.
The problem goes all way back to addd65ad8d19 "eCryptfs: Filename Encryption:
filldir, lookup, and readlink" from 2009. That turned
lower_dir_dentry = ecryptfs_dentry_to_lower(dentry->d_parent);
into
lower_dir_dentry = lower_dentry->d_parent;
and it had hit the fan...
Sure, "somebody mounted the underlying fs elsewhere and is actively
trying to screw us over" is not how ecryptfs is supposed to be used
and it can demonstrate unexpected behavior - odd errors, etc.
But that behaviour should not include oopsen and access to freed
kernel memory...
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
2019-11-03 18:20 ` Al Viro
@ 2019-11-03 18:51 ` Al Viro
2019-11-03 19:03 ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either Al Viro
2019-11-13 7:01 ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Amir Goldstein
0 siblings, 2 replies; 26+ messages in thread
From: Al Viro @ 2019-11-03 18:51 UTC (permalink / raw)
To: linux-fsdevel
Cc: Ritesh Harjani, linux-kernel, wugyuan, jlayton, hsiangkao,
Jan Kara, Linus Torvalds, ecryptfs
lower_dentry can't go from positive to negative (we have it pinned),
but it *can* go from negative to positive. So fetching ->d_inode
into a local variable, doing a blocking allocation, checking that
now ->d_inode is non-NULL and feeding the value we'd fetched
earlier to a function that won't accept NULL is not a good idea.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index a905d5f4f3b0..3c2298721359 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -319,7 +319,7 @@ static int ecryptfs_i_size_read(struct dentry *dentry, struct inode *inode)
static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
struct dentry *lower_dentry)
{
- struct inode *inode, *lower_inode = d_inode(lower_dentry);
+ struct inode *inode, *lower_inode;
struct ecryptfs_dentry_info *dentry_info;
struct vfsmount *lower_mnt;
int rc = 0;
@@ -339,7 +339,15 @@ static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
dentry_info->lower_path.mnt = lower_mnt;
dentry_info->lower_path.dentry = lower_dentry;
- if (d_really_is_negative(lower_dentry)) {
+ /*
+ * negative dentry can go positive under us here - its parent is not
+ * locked. That's OK and that could happen just as we return from
+ * ecryptfs_lookup() anyway. Just need to be careful and fetch
+ * ->d_inode only once - it's not stable here.
+ */
+ lower_inode = READ_ONCE(lower_dentry->d_inode);
+
+ if (!lower_inode) {
/* We want to add because we couldn't find in lower */
d_add(dentry, NULL);
return NULL;
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
2019-11-03 18:51 ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Al Viro
@ 2019-11-03 19:03 ` Al Viro
2019-11-13 7:01 ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Amir Goldstein
1 sibling, 0 replies; 26+ messages in thread
From: Al Viro @ 2019-11-03 19:03 UTC (permalink / raw)
To: linux-fsdevel
Cc: Ritesh Harjani, linux-kernel, wugyuan, jlayton, hsiangkao,
Jan Kara, Linus Torvalds, ecryptfs
We need to get the underlying dentry of parent; sure, absent the races
it is the parent of underlying dentry, but there's nothing to prevent
losing a timeslice to preemtion in the middle of evaluation of
lower_dentry->d_parent->d_inode, having another process move lower_dentry
around and have its (ex)parent not pinned anymore and freed on memory
pressure. Then we regain CPU and try to fetch ->d_inode from memory
that is freed by that point.
dentry->d_parent *is* stable here - it's an argument of ->lookup() and
we are guaranteed that it won't be moved anywhere until we feed it
to d_add/d_splice_alias. So we safely go that way to get to its
underlying dentry.
Cc: stable@vger.kernel.org # since 2009 or so
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 3c2298721359..e23752d9a79f 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -319,9 +319,9 @@ static int ecryptfs_i_size_read(struct dentry *dentry, struct inode *inode)
static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
struct dentry *lower_dentry)
{
+ struct path *path = ecryptfs_dentry_to_lower_path(dentry->d_parent);
struct inode *inode, *lower_inode;
struct ecryptfs_dentry_info *dentry_info;
- struct vfsmount *lower_mnt;
int rc = 0;
dentry_info = kmem_cache_alloc(ecryptfs_dentry_info_cache, GFP_KERNEL);
@@ -330,13 +330,12 @@ static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
return ERR_PTR(-ENOMEM);
}
- lower_mnt = mntget(ecryptfs_dentry_to_lower_mnt(dentry->d_parent));
fsstack_copy_attr_atime(d_inode(dentry->d_parent),
- d_inode(lower_dentry->d_parent));
+ d_inode(path->dentry));
BUG_ON(!d_count(lower_dentry));
ecryptfs_set_dentry_private(dentry, dentry_info);
- dentry_info->lower_path.mnt = lower_mnt;
+ dentry_info->lower_path.mnt = mntget(path->mnt);
dentry_info->lower_path.dentry = lower_dentry;
/*
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
2019-11-03 18:51 ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Al Viro
2019-11-03 19:03 ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either Al Viro
@ 2019-11-13 7:01 ` Amir Goldstein
2019-11-13 12:52 ` Al Viro
1 sibling, 1 reply; 26+ messages in thread
From: Amir Goldstein @ 2019-11-13 7:01 UTC (permalink / raw)
To: Al Viro
Cc: linux-fsdevel, Ritesh Harjani, linux-kernel, wugyuan,
Jeff Layton, Gao Xiang, Jan Kara, Linus Torvalds, ecryptfs
On Sun, Nov 3, 2019 at 8:52 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> lower_dentry can't go from positive to negative (we have it pinned),
> but it *can* go from negative to positive. So fetching ->d_inode
> into a local variable, doing a blocking allocation, checking that
> now ->d_inode is non-NULL and feeding the value we'd fetched
> earlier to a function that won't accept NULL is not a good idea.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
> index a905d5f4f3b0..3c2298721359 100644
> --- a/fs/ecryptfs/inode.c
> +++ b/fs/ecryptfs/inode.c
> @@ -319,7 +319,7 @@ static int ecryptfs_i_size_read(struct dentry *dentry, struct inode *inode)
> static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
> struct dentry *lower_dentry)
> {
> - struct inode *inode, *lower_inode = d_inode(lower_dentry);
> + struct inode *inode, *lower_inode;
> struct ecryptfs_dentry_info *dentry_info;
> struct vfsmount *lower_mnt;
> int rc = 0;
> @@ -339,7 +339,15 @@ static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
> dentry_info->lower_path.mnt = lower_mnt;
> dentry_info->lower_path.dentry = lower_dentry;
>
> - if (d_really_is_negative(lower_dentry)) {
> + /*
> + * negative dentry can go positive under us here - its parent is not
> + * locked. That's OK and that could happen just as we return from
> + * ecryptfs_lookup() anyway. Just need to be careful and fetch
> + * ->d_inode only once - it's not stable here.
> + */
> + lower_inode = READ_ONCE(lower_dentry->d_inode);
> +
> + if (!lower_inode) {
> /* We want to add because we couldn't find in lower */
> d_add(dentry, NULL);
> return NULL;
Sigh!
Open coding a human readable macro to solve a subtle lookup race.
That doesn't sound like a scalable solution.
I have a feeling this is not the last patch we will be seeing along
those lines.
Seeing that developers already confused about when they should use
d_really_is_negative() over d_is_negative() [1] and we probably
don't want to add d_really_really_is_negative(), how about
applying that READ_ONCE into d_really_is_negative() and
re-purpose it as a macro to be used when races with lookup are
a concern?
Thanks,
Amir.
[1] https://lore.kernel.org/linux-fsdevel/20190903135803.GA25692@hsiangkao-HP-ZHAN-66-Pro-G1/
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
2019-11-13 7:01 ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Amir Goldstein
@ 2019-11-13 12:52 ` Al Viro
2019-11-13 16:22 ` Amir Goldstein
2019-11-13 20:18 ` Jean-Louis Biasini
0 siblings, 2 replies; 26+ messages in thread
From: Al Viro @ 2019-11-13 12:52 UTC (permalink / raw)
To: Amir Goldstein
Cc: linux-fsdevel, Ritesh Harjani, linux-kernel, wugyuan,
Jeff Layton, Gao Xiang, Jan Kara, Linus Torvalds, ecryptfs
On Wed, Nov 13, 2019 at 09:01:36AM +0200, Amir Goldstein wrote:
> > - if (d_really_is_negative(lower_dentry)) {
> > + /*
> > + * negative dentry can go positive under us here - its parent is not
> > + * locked. That's OK and that could happen just as we return from
> > + * ecryptfs_lookup() anyway. Just need to be careful and fetch
> > + * ->d_inode only once - it's not stable here.
> > + */
> > + lower_inode = READ_ONCE(lower_dentry->d_inode);
> > +
> > + if (!lower_inode) {
> > /* We want to add because we couldn't find in lower */
> > d_add(dentry, NULL);
> > return NULL;
>
> Sigh!
>
> Open coding a human readable macro to solve a subtle lookup race.
> That doesn't sound like a scalable solution.
> I have a feeling this is not the last patch we will be seeing along
> those lines.
>
> Seeing that developers already confused about when they should use
> d_really_is_negative() over d_is_negative() [1] and we probably
> don't want to add d_really_really_is_negative(), how about
> applying that READ_ONCE into d_really_is_negative() and
> re-purpose it as a macro to be used when races with lookup are
> a concern?
Would you care to explain what that "fix" would've achieved here,
considering the fact that barriers are no-ops on UP and this is
*NOT* an SMP race?
And it's very much present on UP - we have
fetch ->d_inode into local variable
do blocking allocation
check if ->d_inode is NULL now
if it is not, use the value in local variable and expect it to be non-NULL
That's not a case of missing barriers. At all. And no redefinition of
d_really_is_negative() is going to help - it can't retroactively affect
the value explicitly fetched into a local variable some time prior to
that.
There are other patches dealing with ->d_inode accesses, but they are
generally not along the same lines. The problem is rarely the same...
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
2019-11-13 12:52 ` Al Viro
@ 2019-11-13 16:22 ` Amir Goldstein
2019-11-13 20:18 ` Jean-Louis Biasini
1 sibling, 0 replies; 26+ messages in thread
From: Amir Goldstein @ 2019-11-13 16:22 UTC (permalink / raw)
To: Al Viro
Cc: linux-fsdevel, Ritesh Harjani, linux-kernel, wugyuan,
Jeff Layton, Gao Xiang, Jan Kara, Linus Torvalds, ecryptfs
On Wed, Nov 13, 2019 at 2:52 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Nov 13, 2019 at 09:01:36AM +0200, Amir Goldstein wrote:
> > > - if (d_really_is_negative(lower_dentry)) {
> > > + /*
> > > + * negative dentry can go positive under us here - its parent is not
> > > + * locked. That's OK and that could happen just as we return from
> > > + * ecryptfs_lookup() anyway. Just need to be careful and fetch
> > > + * ->d_inode only once - it's not stable here.
> > > + */
> > > + lower_inode = READ_ONCE(lower_dentry->d_inode);
> > > +
> > > + if (!lower_inode) {
> > > /* We want to add because we couldn't find in lower */
> > > d_add(dentry, NULL);
> > > return NULL;
> >
> > Sigh!
> >
> > Open coding a human readable macro to solve a subtle lookup race.
> > That doesn't sound like a scalable solution.
> > I have a feeling this is not the last patch we will be seeing along
> > those lines.
> >
> > Seeing that developers already confused about when they should use
> > d_really_is_negative() over d_is_negative() [1] and we probably
> > don't want to add d_really_really_is_negative(), how about
> > applying that READ_ONCE into d_really_is_negative() and
> > re-purpose it as a macro to be used when races with lookup are
> > a concern?
>
> Would you care to explain what that "fix" would've achieved here,
> considering the fact that barriers are no-ops on UP and this is
> *NOT* an SMP race?
>
> And it's very much present on UP - we have
> fetch ->d_inode into local variable
> do blocking allocation
> check if ->d_inode is NULL now
> if it is not, use the value in local variable and expect it to be non-NULL
>
> That's not a case of missing barriers. At all. And no redefinition of
> d_really_is_negative() is going to help - it can't retroactively affect
> the value explicitly fetched into a local variable some time prior to
> that.
>
Indeed. I missed that part of your commit message and didn't
realize the variable was being used later.
The language in the comment "can go positive under us" implied
SMP race so I misunderstood the reason for READ_ONCE().
Sorry for the noise.
Amir.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
2019-11-13 12:52 ` Al Viro
2019-11-13 16:22 ` Amir Goldstein
@ 2019-11-13 20:18 ` Jean-Louis Biasini
1 sibling, 0 replies; 26+ messages in thread
From: Jean-Louis Biasini @ 2019-11-13 20:18 UTC (permalink / raw)
To: Al Viro, Amir Goldstein
Cc: linux-fsdevel, Ritesh Harjani, linux-kernel, wugyuan,
Jeff Layton, Gao Xiang, Jan Kara, Linus Torvalds, ecryptfs
Please can you UNSUBSCRIBE me from this list?
thx
Le 13/11/2019 à 13:52, Al Viro a écrit :
> On Wed, Nov 13, 2019 at 09:01:36AM +0200, Amir Goldstein wrote:
>>> - if (d_really_is_negative(lower_dentry)) {
>>> + /*
>>> + * negative dentry can go positive under us here - its parent is not
>>> + * locked. That's OK and that could happen just as we return from
>>> + * ecryptfs_lookup() anyway. Just need to be careful and fetch
>>> + * ->d_inode only once - it's not stable here.
>>> + */
>>> + lower_inode = READ_ONCE(lower_dentry->d_inode);
>>> +
>>> + if (!lower_inode) {
>>> /* We want to add because we couldn't find in lower */
>>> d_add(dentry, NULL);
>>> return NULL;
>> Sigh!
>>
>> Open coding a human readable macro to solve a subtle lookup race.
>> That doesn't sound like a scalable solution.
>> I have a feeling this is not the last patch we will be seeing along
>> those lines.
>>
>> Seeing that developers already confused about when they should use
>> d_really_is_negative() over d_is_negative() [1] and we probably
>> don't want to add d_really_really_is_negative(), how about
>> applying that READ_ONCE into d_really_is_negative() and
>> re-purpose it as a macro to be used when races with lookup are
>> a concern?
> Would you care to explain what that "fix" would've achieved here,
> considering the fact that barriers are no-ops on UP and this is
> *NOT* an SMP race?
>
> And it's very much present on UP - we have
> fetch ->d_inode into local variable
> do blocking allocation
> check if ->d_inode is NULL now
> if it is not, use the value in local variable and expect it to be non-NULL
>
> That's not a case of missing barriers. At all. And no redefinition of
> d_really_is_negative() is going to help - it can't retroactively affect
> the value explicitly fetched into a local variable some time prior to
> that.
>
> There are other patches dealing with ->d_inode accesses, but they are
> generally not along the same lines. The problem is rarely the same...
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH][RFC] ecryptfs unlink/rmdir breakage (similar to caught in ecryptfs rename last year)
2019-11-02 18:08 ` Al Viro
2019-11-03 14:41 ` Paul E. McKenney
2019-11-03 16:35 ` [RFC] lookup_one_len_unlocked() lousy calling conventions Al Viro
@ 2019-11-03 17:05 ` Al Viro
2019-11-09 3:13 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
3 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2019-11-03 17:05 UTC (permalink / raw)
To: Tyler Hicks; +Cc: ecryptfs, linux-fsdevel, linux-kernel
On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:
> There's also some secondary stuff dropping out of that (e.g. ceph seeding
> dcache on readdir and blindly unhashing dentries it sees stale instead of
> doing d_invalidate() as it ought to - leads to fun results if you had
> something mounted on a subdirectory that got removed/recreated on server),
> but that's a separate pile of joy - doesn't affect this analysis, so
> it'll have to be dealt with later. It had been an interesting couple of
> weeks...
More secondaries: a piece of nastiness in ecryptfs unlink/rmdir -
have the underlying layer mounted elsewhere, look something up
via ecryptfs, move the underlying object via that mount elsewhere
and call unlink(). Ends up passing the wrong directory
inode to vfs_unlink() and then - to ->unlink() of the underlying
fs... Embarrassing, since ecryptfs_rename() used to have exact
same problem, caught and fixed in "ecryptfs_rename(): verify that
lower dentries are still OK after lock_rename()" a year ago.
Bugs are like mushrooms - found one, look around for its relatives...
How about the following (completely untested) patch? Tyler, do you
see any problems in that?
ecryptfs: fix unlink and rmdir in face of underlying fs modifications
A problem similar to the one caught in commit 74dd7c97ea2a ("ecryptfs_rename():
verify that lower dentries are still OK after lock_rename()") exists for
unlink/rmdir as well.
Instead of playing with dget_parent() of underlying dentry of victim
and hoping it's the same as underlying dentry of our directory,
do the following:
* find the underlying dentry of victim
* find the underlying directory of victim's parent (stable
since the victim is ecryptfs dentry and inode of its parent is
held exclusive by the caller).
* lock the inode of dentry underlying the victim's parent
* check that underlying dentry of victim is still hashed and
has the right parent - it can be moved, but it can't be moved to/from
the directory we are holding exclusive. So while ->d_parent itself
might not be stable, the result of comparison is.
If the check passes, everything is fine - underlying directory is locked,
underlying victim is still a child of that directory and we can go ahead
and feed them to vfs_unlink(). As in the current mainline we need to
pin the underlying dentry of victim, so that it wouldn't go negative under
us, but that's the only temporary reference that needs to be grabbed there.
Underlying dentry of parent won't go away (it's pinned by the parent,
which is held by caller), so there's no need to grab it.
The same problem (with the same solution) exists for rmdir. Moreover,
rename gets simpler and more robust with the same "don't bother with
dget_parent()" approach.
Fixes: 74dd7c97ea2 "ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 18426f4855f1..a905d5f4f3b0 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -128,13 +128,20 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry,
struct inode *inode)
{
struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
- struct inode *lower_dir_inode = ecryptfs_inode_to_lower(dir);
struct dentry *lower_dir_dentry;
+ struct inode *lower_dir_inode;
int rc;
- dget(lower_dentry);
- lower_dir_dentry = lock_parent(lower_dentry);
- rc = vfs_unlink(lower_dir_inode, lower_dentry, NULL);
+ lower_dir_dentry = ecryptfs_dentry_to_lower(dentry->d_parent);
+ lower_dir_inode = d_inode(lower_dir_dentry);
+ inode_lock_nested(lower_dir_inode, I_MUTEX_PARENT);
+ dget(lower_dentry); // don't even try to make the lower negative
+ if (lower_dentry->d_parent != lower_dir_dentry)
+ rc = -EINVAL;
+ else if (d_unhashed(lower_dentry))
+ rc = -EINVAL;
+ else
+ rc = vfs_unlink(lower_dir_inode, lower_dentry, NULL);
if (rc) {
printk(KERN_ERR "Error in vfs_unlink; rc = [%d]\n", rc);
goto out_unlock;
@@ -142,10 +149,11 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry,
fsstack_copy_attr_times(dir, lower_dir_inode);
set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink);
inode->i_ctime = dir->i_ctime;
- d_drop(dentry);
out_unlock:
- unlock_dir(lower_dir_dentry);
dput(lower_dentry);
+ inode_unlock(lower_dir_inode);
+ if (!rc)
+ d_drop(dentry);
return rc;
}
@@ -512,22 +520,30 @@ static int ecryptfs_rmdir(struct inode *dir, struct dentry *dentry)
{
struct dentry *lower_dentry;
struct dentry *lower_dir_dentry;
+ struct inode *lower_dir_inode;
int rc;
lower_dentry = ecryptfs_dentry_to_lower(dentry);
- dget(dentry);
- lower_dir_dentry = lock_parent(lower_dentry);
- dget(lower_dentry);
- rc = vfs_rmdir(d_inode(lower_dir_dentry), lower_dentry);
- dput(lower_dentry);
- if (!rc && d_really_is_positive(dentry))
+ lower_dir_dentry = ecryptfs_dentry_to_lower(dentry->d_parent);
+ lower_dir_inode = d_inode(lower_dir_dentry);
+
+ inode_lock_nested(lower_dir_inode, I_MUTEX_PARENT);
+ dget(lower_dentry); // don't even try to make the lower negative
+ if (lower_dentry->d_parent != lower_dir_dentry)
+ rc = -EINVAL;
+ else if (d_unhashed(lower_dentry))
+ rc = -EINVAL;
+ else
+ rc = vfs_rmdir(lower_dir_inode, lower_dentry);
+ if (!rc) {
clear_nlink(d_inode(dentry));
- fsstack_copy_attr_times(dir, d_inode(lower_dir_dentry));
- set_nlink(dir, d_inode(lower_dir_dentry)->i_nlink);
- unlock_dir(lower_dir_dentry);
+ fsstack_copy_attr_times(dir, lower_dir_inode);
+ set_nlink(dir, lower_dir_inode->i_nlink);
+ }
+ dput(lower_dentry);
+ inode_unlock(lower_dir_inode);
if (!rc)
d_drop(dentry);
- dput(dentry);
return rc;
}
@@ -565,20 +581,22 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
struct dentry *lower_new_dentry;
struct dentry *lower_old_dir_dentry;
struct dentry *lower_new_dir_dentry;
- struct dentry *trap = NULL;
+ struct dentry *trap;
struct inode *target_inode;
if (flags)
return -EINVAL;
+ lower_old_dir_dentry = ecryptfs_dentry_to_lower(old_dentry->d_parent);
+ lower_new_dir_dentry = ecryptfs_dentry_to_lower(new_dentry->d_parent);
+
lower_old_dentry = ecryptfs_dentry_to_lower(old_dentry);
lower_new_dentry = ecryptfs_dentry_to_lower(new_dentry);
- dget(lower_old_dentry);
- dget(lower_new_dentry);
- lower_old_dir_dentry = dget_parent(lower_old_dentry);
- lower_new_dir_dentry = dget_parent(lower_new_dentry);
+
target_inode = d_inode(new_dentry);
+
trap = lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
+ dget(lower_new_dentry);
rc = -EINVAL;
if (lower_old_dentry->d_parent != lower_old_dir_dentry)
goto out_lock;
@@ -606,11 +624,8 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
if (new_dir != old_dir)
fsstack_copy_attr_all(old_dir, d_inode(lower_old_dir_dentry));
out_lock:
- unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
- dput(lower_new_dir_dentry);
- dput(lower_old_dir_dentry);
dput(lower_new_dentry);
- dput(lower_old_dentry);
+ unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
return rc;
}
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH][RFC] race in exportfs_decode_fh()
2019-11-02 18:08 ` Al Viro
` (2 preceding siblings ...)
2019-11-03 17:05 ` [PATCH][RFC] ecryptfs unlink/rmdir breakage (similar to caught in ecryptfs rename last year) Al Viro
@ 2019-11-09 3:13 ` Al Viro
2019-11-09 16:55 ` Linus Torvalds
2019-11-11 9:16 ` Christoph Hellwig
3 siblings, 2 replies; 26+ messages in thread
From: Al Viro @ 2019-11-09 3:13 UTC (permalink / raw)
To: J. Bruce Fields
Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, wugyuan, jlayton,
hsiangkao, Jan Kara, Linus Torvalds, Christoph Hellwig
On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:
> It is converging to a reasonably small and understandable surface, actually,
> most of that being in core pathname resolution. Two big piles of nightmares
> left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
> the latter due to IMA/EVM/LSM insanity...
Oh, lovely - in exportfs_decode_fh() we have this:
err = exportfs_get_name(mnt, target_dir, nbuf, result);
if (!err) {
inode_lock(target_dir->d_inode);
nresult = lookup_one_len(nbuf, target_dir,
strlen(nbuf));
inode_unlock(target_dir->d_inode);
if (!IS_ERR(nresult)) {
if (nresult->d_inode) {
dput(result);
result = nresult;
} else
dput(nresult);
}
}
We have derived the parent from fhandle, we have a disconnected dentry for child,
we go look for the name. We even find it. Now, we want to look it up. And
some bastard goes and unlinks it, just as we are trying to lock the parent.
We do a lookup, and get a negative dentry. Then we unlock the parent... and
some other bastard does e.g. mkdir with the same name. OK, nresult->d_inode
is not NULL (anymore). It has fuck-all to do with the original fhandle
(different inumber, etc.) but we happily accept it.
Even better, we have no barriers between our check and nresult becoming positive.
IOW, having observed non-NULL ->d_inode doesn't give us enough - e.g. we might
still see the old ->d_flags value, from back when ->d_inode used to be NULL.
On something like alpha we also have no promises that we'll observe anything
about the fields of nresult->d_inode, but ->d_flags alone is enough for fun.
The callers can't e.g. expect d_is_reg() et.al. to match the reality.
This is obviously bogus. And the fix is obvious: check that nresult->d_inode is
equal to result->d_inode before unlocking the parent. Note that we'd *already* had
the original result and all of its aliases rejected by the 'acceptable' predicate,
so if nresult doesn't supply us a better alias, we are SOL.
Does anyone see objections to the following patch? Christoph, that seems to
be your code; am I missing something subtle here? AFAICS, that goes back to
2007 or so...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 09bc68708d28..2dd55b172d57 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -519,26 +519,33 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
* inode is actually connected to the parent.
*/
err = exportfs_get_name(mnt, target_dir, nbuf, result);
- if (!err) {
- inode_lock(target_dir->d_inode);
- nresult = lookup_one_len(nbuf, target_dir,
- strlen(nbuf));
- inode_unlock(target_dir->d_inode);
- if (!IS_ERR(nresult)) {
- if (nresult->d_inode) {
- dput(result);
- result = nresult;
- } else
- dput(nresult);
- }
+ if (err) {
+ dput(target_dir);
+ goto err_result;
}
+ inode_lock(target_dir->d_inode);
+ nresult = lookup_one_len(nbuf, target_dir, strlen(nbuf));
+ if (!IS_ERR(nresult)) {
+ if (unlikely(nresult->d_inode != result->d_inode)) {
+ dput(nresult);
+ nresult = ERR_PTR(-ESTALE);
+ }
+ }
+ inode_unlock(target_dir->d_inode);
/*
* At this point we are done with the parent, but it's pinned
* by the child dentry anyway.
*/
dput(target_dir);
+ if (IS_ERR(nresult)) {
+ err = PTR_ERR(nresult);
+ goto err_result;
+ }
+ dput(result);
+ result = nresult;
+
/*
* And finally make sure the dentry is actually acceptable
* to NFSD.
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH][RFC] race in exportfs_decode_fh()
2019-11-09 3:13 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
@ 2019-11-09 16:55 ` Linus Torvalds
2019-11-09 18:26 ` Al Viro
2019-11-11 9:16 ` Christoph Hellwig
1 sibling, 1 reply; 26+ messages in thread
From: Linus Torvalds @ 2019-11-09 16:55 UTC (permalink / raw)
To: Al Viro
Cc: J. Bruce Fields, Ritesh Harjani, linux-fsdevel,
Linux Kernel Mailing List, wugyuan, Jeff Layton, hsiangkao,
Jan Kara, Christoph Hellwig
On Fri, Nov 8, 2019 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> We have derived the parent from fhandle, we have a disconnected dentry for child,
> we go look for the name. We even find it. Now, we want to look it up. And
> some bastard goes and unlinks it, just as we are trying to lock the parent.
> We do a lookup, and get a negative dentry. Then we unlock the parent... and
> some other bastard does e.g. mkdir with the same name. OK, nresult->d_inode
> is not NULL (anymore). It has fuck-all to do with the original fhandle
> (different inumber, etc.) but we happily accept it.
No arguments with your patch, although I doubt that this case has
actually ever happened in practice ;)
Linus
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH][RFC] race in exportfs_decode_fh()
2019-11-09 16:55 ` Linus Torvalds
@ 2019-11-09 18:26 ` Al Viro
0 siblings, 0 replies; 26+ messages in thread
From: Al Viro @ 2019-11-09 18:26 UTC (permalink / raw)
To: Linus Torvalds
Cc: J. Bruce Fields, Ritesh Harjani, linux-fsdevel,
Linux Kernel Mailing List, wugyuan, Jeff Layton, hsiangkao,
Jan Kara, Christoph Hellwig
On Sat, Nov 09, 2019 at 08:55:38AM -0800, Linus Torvalds wrote:
> On Fri, Nov 8, 2019 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > We have derived the parent from fhandle, we have a disconnected dentry for child,
> > we go look for the name. We even find it. Now, we want to look it up. And
> > some bastard goes and unlinks it, just as we are trying to lock the parent.
> > We do a lookup, and get a negative dentry. Then we unlock the parent... and
> > some other bastard does e.g. mkdir with the same name. OK, nresult->d_inode
> > is not NULL (anymore). It has fuck-all to do with the original fhandle
> > (different inumber, etc.) but we happily accept it.
>
> No arguments with your patch, although I doubt that this case has
> actually ever happened in practice ;)
Frankly, by this point I'm rather tempted to introduce new sparse annotation for
dentries - "might not be positive". The thing is, there are four cases when
->d_inode is guaranteed to be stable:
1) nobody else has seen that dentry; that includes in-lookup ones - they
can be found in in-lookup hash by d_alloc_parallel(), but it'll wait until they
cease to be in-lookup.
2) ->d_lock is held
3) pinned positive
4) pinned, parent held at least shared
Anything else can have ->d_inode changed by another thread. And class 3 is by
far the most common. As the matter of fact, most of dentry pointers in the
system (as opposed to (h)lists traversing dentries) are such.
For obvious reasons we have shitloads of ->d_inode accesses; I'd been going
through a tree-wide audit of those and doing that manually is bloody unpleasant.
And we do have races in that area - the one above is the latest I'd caught;
there had been more and I'm fairly sure that it's not the last.
Redoing that kind of audit every once in a while is something I wouldn't
wish on anyone; it would be nice to use sparse to narrow the field. Note
that simply checking ->d_inode (or ->d_flags) is not enough unless we
know them to be stable. E.g. callers of lookup_one_len_unlocked() tend
to be racy exactly because of such tests. I'm going to add
lookup_positive_unlocked() that would return ERR_PTR(-ENOENT) on
negatives and convert the callers - with one exception they treat
negatives the same way they would treat ERR_PTR(-ENOENT) and their
tests are racy. I'd rather have sufficient barriers done in one common
helper, instead of trying to add them in 11 places and hope that new
users won't fuck it up...
I'm still not sure what's the best way to do the annotations - propagation
is not a big deal, but expressing the transitions and checking that
maybe-negative ones are not misused is trickier. The last part can
be actually left to manual audit - annotations would already reduce
the search area big way. A not too ugly way to say that now this dentry
is known to be non-negative, OTOH... I want to finish the audit and
take a good look at the list of places where such transitions happen.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH][RFC] race in exportfs_decode_fh()
2019-11-09 3:13 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
2019-11-09 16:55 ` Linus Torvalds
@ 2019-11-11 9:16 ` Christoph Hellwig
1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2019-11-11 9:16 UTC (permalink / raw)
To: Al Viro
Cc: J. Bruce Fields, Ritesh Harjani, linux-fsdevel, linux-kernel,
wugyuan, jlayton, hsiangkao, Jan Kara, Linus Torvalds,
Christoph Hellwig
On Sat, Nov 09, 2019 at 03:13:33AM +0000, Al Viro wrote:
> Does anyone see objections to the following patch? Christoph, that seems to
> be your code; am I missing something subtle here? AFAICS, that goes back to
> 2007 or so...
This goes back to way before that, that series jut factored out proper
export operations from the two inode or superblock methods we had before
with the rest handled in core code somewhere that made a complete
mess of file systems with 64-bit inode numbers.
Otherwise this looks fine, although splitting the refactoring from
the actual change would make for a much more readable series.
^ permalink raw reply [flat|nested] 26+ messages in thread