Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
@ 2019-09-27  4:42 Ritesh Harjani
  2019-10-15  4:07 ` Ritesh Harjani
  0 siblings, 1 reply; 22+ messages in thread
From: Ritesh Harjani @ 2019-09-27  4:42 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, wugyuan, jlayton, hsiangkao, riteshh

d_is_negative can race with d_instantiate_new()
-> __d_set_inode_and_type().
For e.g. in use cases where Thread-1 is creating
symlink (doing d_instantiate_new()) & Thread-2 is doing
cat of that symlink while doing lookup_fast (via REF-walk-
one such case is, when ->permission returns -ECHILD).

During this race if __d_set_and_inode_type() does out-of-order
execution and set the dentry->d_flags before setting
dentry->inode, then it can result into following kernel panic.

This change fixes the issue by directly checking for inode.

E.g. kernel panic, since inode was NULL.
trailing_symlink() -> may_follow_link() -> inode->i_uid.
Issue signature:-
  [NIP  : trailing_symlink+80]
  [LR   : trailing_symlink+1092]
  #4 [c00000198069bb70] trailing_symlink at c0000000004bae60  (unreliable)
  #5 [c00000198069bc00] path_openat at c0000000004bdd14
  #6 [c00000198069bc90] do_filp_open at c0000000004c0274
  #7 [c00000198069bdb0] do_sys_open at c00000000049b248
  #8 [c00000198069be30] system_call at c00000000000b388

Sequence of events:-
Thread-2(Comm: ln) 	       Thread-1(Comm: cat)

	                dentry = __d_lookup() //nonRCU

__d_set_and_inode_type() (Out-of-order execution)
    flags = READ_ONCE(dentry->d_flags);
    flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
    flags |= type_flags;
    WRITE_ONCE(dentry->d_flags, flags);

	                if (unlikely(d_is_negative()) // fails
	                       {}
	                // since d_flags is already updated in
	                // Thread-2 in parallel but inode
	                // not yet set.
	                // d_is_negative returns false

	                *inode = d_backing_inode(path->dentry);
	                // means inode is still NULL

    dentry->d_inode = inode;

	                trailing_symlink()
	                    may_follow_link()
	                        inode = nd->link_inode;
	                        // nd->link_inode = NULL
	                        //Then it crashes while
	                        //doing inode->i_uid

Reported-by: Guang Yuan Wu <wugyuan@cn.ibm.com>
Tested-by: Guang Yuan Wu <wugyuan@cn.ibm.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/namei.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 671c3c1a3425..7c5337cddebd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1617,7 +1617,21 @@ static int lookup_fast(struct nameidata *nd,
 		dput(dentry);
 		return status;
 	}
-	if (unlikely(d_is_negative(dentry))) {
+
+	/*
+	 * Caution: d_is_negative() can race with
+	 * __d_set_inode_and_type().
+	 * For e.g. in use cases where Thread-1 is creating
+	 * symlink (doing d_instantiate_new()) & Thread-2 is doing
+	 * cat of that symlink and falling here (via Ref-walk) while
+	 * doing lookup_fast (one such case is when ->permission
+	 * returns -ECHILD).
+	 * Now if __d_set_inode_and_type() does out-of-order execution
+	 * i.e. it first sets the dentry->d_flags & then dentry->inode
+	 * then it can result into inode being NULL, causing panic later.
+	 * Hence directly check if inode is NULL here.
+	 */
+	if (unlikely(d_really_is_negative(dentry))) {
 		dput(dentry);
 		return -ENOENT;
 	}
-- 
2.21.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-09-27  4:42 [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast Ritesh Harjani
@ 2019-10-15  4:07 ` Ritesh Harjani
  2019-10-22 13:38   ` Ritesh Harjani
  0 siblings, 1 reply; 22+ messages in thread
From: Ritesh Harjani @ 2019-10-15  4:07 UTC (permalink / raw)
  To: viro, linux-fsdevel; +Cc: linux-kernel, wugyuan, jlayton, hsiangkao

ping!!

On 9/27/19 10:12 AM, Ritesh Harjani wrote:
> d_is_negative can race with d_instantiate_new()
> -> __d_set_inode_and_type().
> For e.g. in use cases where Thread-1 is creating
> symlink (doing d_instantiate_new()) & Thread-2 is doing
> cat of that symlink while doing lookup_fast (via REF-walk-
> one such case is, when ->permission returns -ECHILD).
> 
> During this race if __d_set_and_inode_type() does out-of-order
> execution and set the dentry->d_flags before setting
> dentry->inode, then it can result into following kernel panic.
> 
> This change fixes the issue by directly checking for inode.
> 
> E.g. kernel panic, since inode was NULL.
> trailing_symlink() -> may_follow_link() -> inode->i_uid.
> Issue signature:-
>    [NIP  : trailing_symlink+80]
>    [LR   : trailing_symlink+1092]
>    #4 [c00000198069bb70] trailing_symlink at c0000000004bae60  (unreliable)
>    #5 [c00000198069bc00] path_openat at c0000000004bdd14
>    #6 [c00000198069bc90] do_filp_open at c0000000004c0274
>    #7 [c00000198069bdb0] do_sys_open at c00000000049b248
>    #8 [c00000198069be30] system_call at c00000000000b388
> 
> Sequence of events:-
> Thread-2(Comm: ln) 	       Thread-1(Comm: cat)
> 
> 	                dentry = __d_lookup() //nonRCU
> 
> __d_set_and_inode_type() (Out-of-order execution)
>      flags = READ_ONCE(dentry->d_flags);
>      flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
>      flags |= type_flags;
>      WRITE_ONCE(dentry->d_flags, flags);
> 
> 	                if (unlikely(d_is_negative()) // fails
> 	                       {}
> 	                // since d_flags is already updated in
> 	                // Thread-2 in parallel but inode
> 	                // not yet set.
> 	                // d_is_negative returns false
> 
> 	                *inode = d_backing_inode(path->dentry);
> 	                // means inode is still NULL
> 
>      dentry->d_inode = inode;
> 
> 	                trailing_symlink()
> 	                    may_follow_link()
> 	                        inode = nd->link_inode;
> 	                        // nd->link_inode = NULL
> 	                        //Then it crashes while
> 	                        //doing inode->i_uid
> 
> Reported-by: Guang Yuan Wu <wugyuan@cn.ibm.com>
> Tested-by: Guang Yuan Wu <wugyuan@cn.ibm.com>
> Acked-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> ---
>   fs/namei.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 671c3c1a3425..7c5337cddebd 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1617,7 +1617,21 @@ static int lookup_fast(struct nameidata *nd,
>   		dput(dentry);
>   		return status;
>   	}
> -	if (unlikely(d_is_negative(dentry))) {
> +
> +	/*
> +	 * Caution: d_is_negative() can race with
> +	 * __d_set_inode_and_type().
> +	 * For e.g. in use cases where Thread-1 is creating
> +	 * symlink (doing d_instantiate_new()) & Thread-2 is doing
> +	 * cat of that symlink and falling here (via Ref-walk) while
> +	 * doing lookup_fast (one such case is when ->permission
> +	 * returns -ECHILD).
> +	 * Now if __d_set_inode_and_type() does out-of-order execution
> +	 * i.e. it first sets the dentry->d_flags & then dentry->inode
> +	 * then it can result into inode being NULL, causing panic later.
> +	 * Hence directly check if inode is NULL here.
> +	 */
> +	if (unlikely(d_really_is_negative(dentry))) {
>   		dput(dentry);
>   		return -ENOENT;
>   	}
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-10-15  4:07 ` Ritesh Harjani
@ 2019-10-22 13:38   ` Ritesh Harjani
  2019-10-22 14:37     ` Al Viro
  0 siblings, 1 reply; 22+ messages in thread
From: Ritesh Harjani @ 2019-10-22 13:38 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: linux-kernel, wugyuan, jlayton, hsiangkao, Jan Kara

I think we have still not taken this patch. Al?


On 10/15/19 9:37 AM, Ritesh Harjani wrote:
> ping!!
> 
> On 9/27/19 10:12 AM, Ritesh Harjani wrote:
>> d_is_negative can race with d_instantiate_new()
>> -> __d_set_inode_and_type().
>> For e.g. in use cases where Thread-1 is creating
>> symlink (doing d_instantiate_new()) & Thread-2 is doing
>> cat of that symlink while doing lookup_fast (via REF-walk-
>> one such case is, when ->permission returns -ECHILD).
>>
>> During this race if __d_set_and_inode_type() does out-of-order
>> execution and set the dentry->d_flags before setting
>> dentry->inode, then it can result into following kernel panic.
>>
>> This change fixes the issue by directly checking for inode.
>>
>> E.g. kernel panic, since inode was NULL.
>> trailing_symlink() -> may_follow_link() -> inode->i_uid.
>> Issue signature:-
>>    [NIP  : trailing_symlink+80]
>>    [LR   : trailing_symlink+1092]
>>    #4 [c00000198069bb70] trailing_symlink at c0000000004bae60  
>> (unreliable)
>>    #5 [c00000198069bc00] path_openat at c0000000004bdd14
>>    #6 [c00000198069bc90] do_filp_open at c0000000004c0274
>>    #7 [c00000198069bdb0] do_sys_open at c00000000049b248
>>    #8 [c00000198069be30] system_call at c00000000000b388
>>
>> Sequence of events:-
>> Thread-2(Comm: ln)            Thread-1(Comm: cat)
>>
>>                     dentry = __d_lookup() //nonRCU
>>
>> __d_set_and_inode_type() (Out-of-order execution)
>>      flags = READ_ONCE(dentry->d_flags);
>>      flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
>>      flags |= type_flags;
>>      WRITE_ONCE(dentry->d_flags, flags);
>>
>>                     if (unlikely(d_is_negative()) // fails
>>                            {}
>>                     // since d_flags is already updated in
>>                     // Thread-2 in parallel but inode
>>                     // not yet set.
>>                     // d_is_negative returns false
>>
>>                     *inode = d_backing_inode(path->dentry);
>>                     // means inode is still NULL
>>
>>      dentry->d_inode = inode;
>>
>>                     trailing_symlink()
>>                         may_follow_link()
>>                             inode = nd->link_inode;
>>                             // nd->link_inode = NULL
>>                             //Then it crashes while
>>                             //doing inode->i_uid
>>
>> Reported-by: Guang Yuan Wu <wugyuan@cn.ibm.com>
>> Tested-by: Guang Yuan Wu <wugyuan@cn.ibm.com>
>> Acked-by: Jeff Layton <jlayton@kernel.org>
>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
>> ---
>>   fs/namei.c | 16 +++++++++++++++-
>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 671c3c1a3425..7c5337cddebd 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -1617,7 +1617,21 @@ static int lookup_fast(struct nameidata *nd,
>>           dput(dentry);
>>           return status;
>>       }
>> -    if (unlikely(d_is_negative(dentry))) {
>> +
>> +    /*
>> +     * Caution: d_is_negative() can race with
>> +     * __d_set_inode_and_type().
>> +     * For e.g. in use cases where Thread-1 is creating
>> +     * symlink (doing d_instantiate_new()) & Thread-2 is doing
>> +     * cat of that symlink and falling here (via Ref-walk) while
>> +     * doing lookup_fast (one such case is when ->permission
>> +     * returns -ECHILD).
>> +     * Now if __d_set_inode_and_type() does out-of-order execution
>> +     * i.e. it first sets the dentry->d_flags & then dentry->inode
>> +     * then it can result into inode being NULL, causing panic later.
>> +     * Hence directly check if inode is NULL here.
>> +     */
>> +    if (unlikely(d_really_is_negative(dentry))) {
>>           dput(dentry);
>>           return -ENOENT;
>>       }
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-10-22 13:38   ` Ritesh Harjani
@ 2019-10-22 14:37     ` Al Viro
  2019-10-22 14:50       ` Al Viro
  2019-10-22 20:11       ` Al Viro
  0 siblings, 2 replies; 22+ messages in thread
From: Al Viro @ 2019-10-22 14:37 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-fsdevel, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds

On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote:
> I think we have still not taken this patch. Al?

I'm still not sure it's a good way to deal with the whole
mess, TBH ;-/  Consider e.g. walk_component(), with its

                if (unlikely(d_is_negative(path.dentry))) {
                        path_to_nameidata(&path, nd);
                        return -ENOENT;
                }

                seq = 0;        /* we are already out of RCU mode */
                inode = d_backing_inode(path.dentry);

or mountpoint_last()
        if (d_is_negative(path.dentry)) {
                dput(path.dentry);
                return -ENOENT;
        }
        path.mnt = nd->path.mnt;
        return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0);
or last_open()
        if (unlikely(d_is_negative(path.dentry))) {
                path_to_nameidata(&path, nd);
                return -ENOENT; 
        }

        /*
         * create/update audit record if it already exists.
         */
        audit_inode(nd->name, path.dentry, 0);

        if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
                path_to_nameidata(&path, nd);
                return -EEXIST;
        }

        seq = 0;        /* out of RCU mode, so the value doesn't matter */
        inode = d_backing_inode(path.dentry);
or, for that matter, any callers of filename_lookup() assuming that the
lack of ENOENT means that the last call of walk_component() (from lookup_last())
has not failed with the same ENOENT and thus the result has been observed
positive.

You've picked the easiest one to hit, but on e.g. KVM setups you can have the
host thread representing the CPU where __d_set_inode_and_type() runs get
preempted (by the host kernel), leaving others with much wider window.

Sure, we can do that to all callers of d_is_negative/d_is_positive, but...
the same goes for any places that assumes that d_is_dir() implies that
the sucker is positive, etc.

What we have guaranteed is
	* ->d_lock serializes ->d_flags/->d_inode changes
	* ->d_seq is bumped before/after such changes
	* positive dentry never changes ->d_inode as long as you hold
a reference (negative dentry *can* become positive right under you)

So there are 3 classes of valid users: those holding ->d_lock, those
sampling and rechecking ->d_seq and those relying upon having observed
the sucker they've pinned to be positive.

What you've been hitting is "we have it pinned, ->d_flags says it's
positive but we still observe the value of ->d_inode from before the
store to ->d_flags that has made it look positive".

I'm somewhat tempted to make __d_set_inode_and_type() do smp_store_release()
for setting ->d_flags and __d_entry_type() - smp_load_acquire(); that would
be pretty much free for x86 (as well as sparc, s390 and itanic) and reasonably
cheap on ppc and arm64.  How badly would e.g. SMP arm suffer from the below
(completely untested)?

diff --git a/fs/dcache.c b/fs/dcache.c
index e88cf0554e65..35368316c582 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -319,7 +319,7 @@ static inline void __d_set_inode_and_type(struct dentry *dentry,
 	flags = READ_ONCE(dentry->d_flags);
 	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
 	flags |= type_flags;
-	WRITE_ONCE(dentry->d_flags, flags);
+	smp_store_release(&dentry->d_flags, flags);
 }
 
 static inline void __d_clear_type_and_inode(struct dentry *dentry)
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 10090f11ab95..bee28076a3fd 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -386,7 +386,7 @@ static inline bool d_mountpoint(const struct dentry *dentry)
  */
 static inline unsigned __d_entry_type(const struct dentry *dentry)
 {
-	return dentry->d_flags & DCACHE_ENTRY_TYPE;
+	return smp_load_acquire(&dentry->d_flags) & DCACHE_ENTRY_TYPE;
 }
 
 static inline bool d_is_miss(const struct dentry *dentry)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-10-22 14:37     ` Al Viro
@ 2019-10-22 14:50       ` Al Viro
  2019-10-22 20:11       ` Al Viro
  1 sibling, 0 replies; 22+ messages in thread
From: Al Viro @ 2019-10-22 14:50 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-fsdevel, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds

On Tue, Oct 22, 2019 at 03:37:36PM +0100, Al Viro wrote:

> I'm somewhat tempted to make __d_set_inode_and_type() do smp_store_release()
> for setting ->d_flags and __d_entry_type() - smp_load_acquire(); that would
> be pretty much free for x86 (as well as sparc, s390 and itanic) and reasonably
> cheap on ppc and arm64.  How badly would e.g. SMP arm suffer from the below
> (completely untested)?

PS: if we really go that way, we'd probably want __d_is_negative(), to be
used only under ->d_lock or with ->d_seq validation.

d_is_negative() callers in fs/dcache.c (all under ->d_lock) as well as
the RCU-side one in lookup_fast() (->d_seq validated) would you that.

Another fun place is do_unlinkat() - there we want to reorder
                inode = dentry->d_inode;
                if (d_is_negative(dentry))
                        goto slashes;
As it were, the same race here can lead to unbalanced iput(), if you
hit the window just right.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-10-22 14:37     ` Al Viro
  2019-10-22 14:50       ` Al Viro
@ 2019-10-22 20:11       ` Al Viro
  2019-10-23 11:05         ` Ritesh Harjani
  1 sibling, 1 reply; 22+ messages in thread
From: Al Viro @ 2019-10-22 20:11 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-fsdevel, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds

On Tue, Oct 22, 2019 at 03:37:36PM +0100, Al Viro wrote:
> On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote:
> > I think we have still not taken this patch. Al?

> You've picked the easiest one to hit, but on e.g. KVM setups you can have the
> host thread representing the CPU where __d_set_inode_and_type() runs get
> preempted (by the host kernel), leaving others with much wider window.
> 
> Sure, we can do that to all callers of d_is_negative/d_is_positive, but...
> the same goes for any places that assumes that d_is_dir() implies that
> the sucker is positive, etc.
> 
> What we have guaranteed is
> 	* ->d_lock serializes ->d_flags/->d_inode changes
> 	* ->d_seq is bumped before/after such changes
> 	* positive dentry never changes ->d_inode as long as you hold
> a reference (negative dentry *can* become positive right under you)
> 
> So there are 3 classes of valid users: those holding ->d_lock, those
> sampling and rechecking ->d_seq and those relying upon having observed
> the sucker they've pinned to be positive.
> 
> What you've been hitting is "we have it pinned, ->d_flags says it's
> positive but we still observe the value of ->d_inode from before the
> store to ->d_flags that has made it look positive".

Actually, your patch opens another problem there.  Suppose you make
it d_really_is_positive() and hit the same race sans reordering.
Dentry is found by __d_lookup() and is negative.  Right after we
return from __d_lookup() another thread makes it positive (a symlink)
- ->d_inode is set, d_really_is_positive() becomes true.  OK, on we
go, pick the inode and move on.  Right?  ->d_flags is still not set
by the thread that made it positive.  We return from lookup_fast()
and call step_into().  And get to
        if (likely(!d_is_symlink(path->dentry)) ||
Which checks ->d_flags and sees the value from before the sucker
became positive.  IOW, d_is_symlink() is false here.  If that
was the last path component and we'd been told to follow links,
we will end up with positive dentry of a symlink coming out of
pathname resolution.

Similar fun happens if you have mkdir racing with lookup - ENOENT
is what should've happened if lookup comes first, success - if
mkdir does.  This way we can hit ENOTDIR, due to false negative
from d_can_lookup().

IOW, d_really_is_negative() in lookup_fast() will paper over
one of oopsen, but it
	* won't cover similar oopsen on other codepaths and
	* will lead to bogus behaviour.

I'm not sure that blanket conversion of d_is_... to smp_load_acquire()
is the right solution; it might very well be that we need to do that
only on a small subset of call sites, lookup_fast() being one of
those.  But we do want at least to be certain that something we'd
got from lookup_fast() in non-RCU mode already has ->d_flags visible.

I'm going through the callers right now, will post a followup once
the things get cleaner...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-10-22 20:11       ` Al Viro
@ 2019-10-23 11:05         ` Ritesh Harjani
  2019-11-01 23:46           ` Al Viro
  0 siblings, 1 reply; 22+ messages in thread
From: Ritesh Harjani @ 2019-10-23 11:05 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds



On 10/23/19 1:41 AM, Al Viro wrote:
> On Tue, Oct 22, 2019 at 03:37:36PM +0100, Al Viro wrote:
>> On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote:
>>> I think we have still not taken this patch. Al?

>> or, for that matter, any callers of filename_lookup() assuming that the
>> lack of ENOENT means that the last call of walk_component() (from lookup_last())
>> has not failed with the same ENOENT and thus the result has been observed
>> positive.
>> You've picked the easiest one to hit, but on e.g. KVM setups you can have the
>> host thread representing the CPU where __d_set_inode_and_type() runs get
>> preempted (by the host kernel), leaving others with much wider window.

I had thought so about the other call sites, but as you said
maybe this was the easiest one to hit.
Then, I kind of followed your suggested fix in below link to fix at 
least this current crash.
https://patchwork.kernel.org/patch/10909881/

>>
>> Sure, we can do that to all callers of d_is_negative/d_is_positive, but...
>> the same goes for any places that assumes that d_is_dir() implies that
>> the sucker is positive, etc.
>>
>> What we have guaranteed is
>> 	* ->d_lock serializes ->d_flags/->d_inode changes
>> 	* ->d_seq is bumped before/after such changes
>> 	* positive dentry never changes ->d_inode as long as you hold
>> a reference (negative dentry *can* become positive right under you)
>>
>> So there are 3 classes of valid users: those holding ->d_lock, those
>> sampling and rechecking ->d_seq and those relying upon having observed
>> the sucker they've pinned to be positive.

:) Thanks for simplifying like this. Agreed.



>>
>> What you've been hitting is "we have it pinned, ->d_flags says it's
>> positive but we still observe the value of ->d_inode from before the
>> store to ->d_flags that has made it look positive".
> 
> Actually, your patch opens another problem there.  Suppose you make
> it d_really_is_positive() and hit the same race sans reordering.
> Dentry is found by __d_lookup() and is negative.  Right after we
> return from __d_lookup() another thread makes it positive (a symlink)
> - ->d_inode is set, d_really_is_positive() becomes true.  OK, on we
> go, pick the inode and move on.  Right?  ->d_flags is still not set
> by the thread that made it positive.  We return from lookup_fast()
> and call step_into().  And get to
>          if (likely(!d_is_symlink(path->dentry)) ||
> Which checks ->d_flags and sees the value from before the sucker
> became positive.  IOW, d_is_symlink() is false here.  If that
> was the last path component and we'd been told to follow links,
> we will end up with positive dentry of a symlink coming out of
> pathname resolution.
> 

hmm. yes, looks like it, based on your above explanation.
So even though this could avoid crash, but still we may end up with
a bogus entry with current patch.



> Similar fun happens if you have mkdir racing with lookup - ENOENT
> is what should've happened if lookup comes first, success - if
> mkdir does.  This way we can hit ENOTDIR, due to false negative
> from d_can_lookup().
> 
> IOW, d_really_is_negative() in lookup_fast() will paper over
> one of oopsen, but it
> 	* won't cover similar oopsen on other codepaths and
> 	* will lead to bogus behaviour.
> 
> I'm not sure that blanket conversion of d_is_... to smp_load_acquire()
> is the right solution; it might very well be that we need to do that
> only on a small subset of call sites, lookup_fast() being one of
> those.  But we do want at least to be certain that something we'd
> got from lookup_fast() in non-RCU mode already has ->d_flags visible.

We may also need similar guarantees with __d_clear_type_and_inode().

So do you think we should make use of ->d_seq for verifying this?
I see both __d_set_inode_and_type & __d_clear_type_and_inode() called
under ->d_seq_begin/->d_seq_end.

Then maybe we should use ->d_seq checking at those call sites.
We cannot unconditionally use ->d_seq checking in __d_entry_type(),
since we sometimes call this function inside ->d_seq_begin
(like in lookup_fast).


> 
> I'm going through the callers right now, will post a followup once
> the things get cleaner...
> 
Thanks for looking into this.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-10-23 11:05         ` Ritesh Harjani
@ 2019-11-01 23:46           ` Al Viro
  2019-11-02  6:17             ` Al Viro
  2019-11-02 17:22             ` Paul E. McKenney
  0 siblings, 2 replies; 22+ messages in thread
From: Al Viro @ 2019-11-01 23:46 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: linux-fsdevel, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds, Paul E. McKenney

On Wed, Oct 23, 2019 at 04:35:50PM +0530, Ritesh Harjani wrote:

> > > What we have guaranteed is
> > > 	* ->d_lock serializes ->d_flags/->d_inode changes
> > > 	* ->d_seq is bumped before/after such changes
> > > 	* positive dentry never changes ->d_inode as long as you hold
> > > a reference (negative dentry *can* become positive right under you)
> > > 
> > > So there are 3 classes of valid users: those holding ->d_lock, those
> > > sampling and rechecking ->d_seq and those relying upon having observed
> > > the sucker they've pinned to be positive.
> 
> :) Thanks for simplifying like this. Agreed.

FWIW, after fixing several ceph bugs, add to that the following:
	* all places that turn a negative dentry into positive one are
holding its parent exclusive or dentry has not been observable for
anybody else.  It had been present in the parent's list of children
(negative and unhashed) and it might have been present in in-lookup
hashtable.  However, nobody is going to grab a reference to it from there
without having grabbed ->d_lock on it and observed the state after
it became positive. 

Which means that holding a reference to dentry *and* holding its
parent at least shared stabilizes both ->d_inode and type bits in
->d_flags.  The situation with barriers is more subtle - *IF* we
had sufficient barriers to have ->d_inode/type bits seen right
after having gotten the reference, we are fine.  The only change
possible after that point is negative->positive transition and
that gets taken care of by barriers provided by ->i_rwsem.

If we'd obtained that reference by d_lookup() or __d_lookup(),
we are fine - ->d_lock gives a barrier.  The same goes for places
that grab references during a tree traversal, provided that they
hold ->d_lock around that (fs/autofs/expire.c stuff).  The same goes
for having it found in inode's aliases list (->i_lock).

I really hope that the same applies to accesses to file_dentry(file);
on anything except alpha that would be pretty much automatic and
on alpha we get the things along the lines of

	f = fdt[n]
	mb
	d = f->f_path.dentry
	i = d->d_inode
	assert(i != NULL)
vs.
	see that d->d_inode is non-NULL
	f->f_path.dentry = d
	mb
	fdt[n] = f

IOW, the barriers that make it safe to fetch the fields of struct file
(rcu_dereference_raw() in __fcheck_files() vs. smp_store_release()
in __fd_install() in the above) should *hopefully* take care of all
stores visible by the time of do_dentry_open().  Sure, alpha cache
coherency is insane, but AFAICS it's not _that_ insane.

Question to folks familiar with alpha memory model:

A = 0, B = NULL, C = NULL
CPU1:
	A = 1

CPU2:
	r1 = A
	if (r1) {
		B = &A
		mb
		C = &B
	}

CPU3:
	r2 = C;
	mb
	if (r2) {	// &B
		r3 = *r2	// &A
		r4 = *r3	// 1
		assert(r4 == 1)
	}

is the above safe on alpha?

[snip]

> We may also need similar guarantees with __d_clear_type_and_inode().

Not really - pinned dentry can't go negative.  In any case, with the
audit I've done so far, I don't believe that blanket solutions like
that are good idea - most of the places doing checks are safe as it is.
The surface that needs to be taken care of is fairly small, actually;
most of that is in fs/namei.c and fs/dcache.c.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-11-01 23:46           ` Al Viro
@ 2019-11-02  6:17             ` Al Viro
  2019-11-02 17:24               ` Paul E. McKenney
  2019-11-02 17:22             ` Paul E. McKenney
  1 sibling, 1 reply; 22+ messages in thread
From: Al Viro @ 2019-11-02  6:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-fsdevel, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds, Ritesh Harjani

On Fri, Nov 01, 2019 at 11:46:22PM +0000, Al Viro wrote:
> on anything except alpha that would be pretty much automatic and
> on alpha we get the things along the lines of
> 
> 	f = fdt[n]
> 	mb
> 	d = f->f_path.dentry
> 	i = d->d_inode
> 	assert(i != NULL)
> vs.
> 	see that d->d_inode is non-NULL
> 	f->f_path.dentry = d
> 	mb
> 	fdt[n] = f
> 
> IOW, the barriers that make it safe to fetch the fields of struct file
> (rcu_dereference_raw() in __fcheck_files() vs. smp_store_release()
> in __fd_install() in the above) should *hopefully* take care of all
> stores visible by the time of do_dentry_open().  Sure, alpha cache
> coherency is insane, but AFAICS it's not _that_ insane.
> 
> Question to folks familiar with alpha memory model:
> 
> A = 0, B = NULL, C = NULL
> CPU1:
> 	A = 1
> 
> CPU2:
> 	r1 = A
> 	if (r1) {
> 		B = &A
> 		mb
> 		C = &B
> 	}
> 
> CPU3:
> 	r2 = C;
> 	mb
> 	if (r2) {	// &B
> 		r3 = *r2	// &A
> 		r4 = *r3	// 1
> 		assert(r4 == 1)
> 	}
> 
> is the above safe on alpha?

Hmm...  After digging through alpha manuals, it should be -

U1: W A, 1

V1: R A, 1
V2: W B, &A
V3: MB
V4: W C, &B

W1: R C, &B
W2: MB
W3: R B, &A
W4: R A, 0

is rejected since
	U1 BEFORE V1 (storage and visibility)
	V1 BEFORE V3 BEFORE V4 (processor issue order constraints)
	V4 BEFORE W1 (storage and visibility)
	W1 BEFORE W2 BEFORE W4 (processor issue order constraints)
and W4 BEFORE U1 (storage and visibility), which is impossible
due to BEFORE being acyclic and transitive.

I might very well be missing something, though...  Paul, could you
take a look and tell if the above makes sense?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-11-01 23:46           ` Al Viro
  2019-11-02  6:17             ` Al Viro
@ 2019-11-02 17:22             ` Paul E. McKenney
  2019-11-02 18:08               ` Al Viro
  1 sibling, 1 reply; 22+ messages in thread
From: Paul E. McKenney @ 2019-11-02 17:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, wugyuan, jlayton,
	hsiangkao, Jan Kara, Linus Torvalds

On Fri, Nov 01, 2019 at 11:46:22PM +0000, Al Viro wrote:
> On Wed, Oct 23, 2019 at 04:35:50PM +0530, Ritesh Harjani wrote:
> 
> > > > What we have guaranteed is
> > > > 	* ->d_lock serializes ->d_flags/->d_inode changes
> > > > 	* ->d_seq is bumped before/after such changes
> > > > 	* positive dentry never changes ->d_inode as long as you hold
> > > > a reference (negative dentry *can* become positive right under you)
> > > > 
> > > > So there are 3 classes of valid users: those holding ->d_lock, those
> > > > sampling and rechecking ->d_seq and those relying upon having observed
> > > > the sucker they've pinned to be positive.
> > 
> > :) Thanks for simplifying like this. Agreed.
> 
> FWIW, after fixing several ceph bugs, add to that the following:
> 	* all places that turn a negative dentry into positive one are
> holding its parent exclusive or dentry has not been observable for
> anybody else.  It had been present in the parent's list of children
> (negative and unhashed) and it might have been present in in-lookup
> hashtable.  However, nobody is going to grab a reference to it from there
> without having grabbed ->d_lock on it and observed the state after
> it became positive. 
> 
> Which means that holding a reference to dentry *and* holding its
> parent at least shared stabilizes both ->d_inode and type bits in
> ->d_flags.  The situation with barriers is more subtle - *IF* we
> had sufficient barriers to have ->d_inode/type bits seen right
> after having gotten the reference, we are fine.  The only change
> possible after that point is negative->positive transition and
> that gets taken care of by barriers provided by ->i_rwsem.
> 
> If we'd obtained that reference by d_lookup() or __d_lookup(),
> we are fine - ->d_lock gives a barrier.  The same goes for places
> that grab references during a tree traversal, provided that they
> hold ->d_lock around that (fs/autofs/expire.c stuff).  The same goes
> for having it found in inode's aliases list (->i_lock).
> 
> I really hope that the same applies to accesses to file_dentry(file);
> on anything except alpha that would be pretty much automatic and
> on alpha we get the things along the lines of
> 
> 	f = fdt[n]
> 	mb
> 	d = f->f_path.dentry
> 	i = d->d_inode
> 	assert(i != NULL)
> vs.
> 	see that d->d_inode is non-NULL
> 	f->f_path.dentry = d
> 	mb
> 	fdt[n] = f

Ignoring the possibility of the more exotic compiler optimizations, if
the first task's load into f sees the value stored by the second task,
then the pair of memory barriers guarantee that the first task's load
into d will see the second task's store.

In fact, you could instead say this in recent kernels:

	f = READ_ONCE(fdt[n])  // provides dependency ordering via mb on Alpha
	mb
	d = f->f_path.dentry
	i = d->d_inode  // But this is OK only if ->f_path.entry is
			// constant throughout
	assert(i != NULL)
vs.
	see that d->d_inode is non-NULL
	f->f_path.dentry = d
	mb
	fdt[n] = f

The result of the first task's load into i requires information outside
of the two code fragments.

Or am I missing your point?

> IOW, the barriers that make it safe to fetch the fields of struct file
> (rcu_dereference_raw() in __fcheck_files() vs. smp_store_release()
> in __fd_install() in the above) should *hopefully* take care of all
> stores visible by the time of do_dentry_open().  Sure, alpha cache
> coherency is insane, but AFAICS it's not _that_ insane.

Agreed, not -that- insane.  ;-)

> Question to folks familiar with alpha memory model:
> 
> A = 0, B = NULL, C = NULL
> CPU1:
> 	A = 1
> 
> CPU2:
> 	r1 = A
> 	if (r1) {
> 		B = &A
> 		mb
> 		C = &B
> 	}
> 
> CPU3:
> 	r2 = C;
> 	mb
> 	if (r2) {	// &B
> 		r3 = *r2	// &A
> 		r4 = *r3	// 1
> 		assert(r4 == 1)
> 	}
> 
> is the above safe on alpha?

Looks that way to me.  LKMM agrees:

	C viro

	{
	}

	P0(int *a)
	{
		WRITE_ONCE(*a, 1);
	}

	P1(int *a, int **b, int ***c)
	{
		int r1;

		r1 = READ_ONCE(*a);
		if (r1) {
			WRITE_ONCE(*b, a);
			smp_mb();
			WRITE_ONCE(*c, b);
		}
	}

	P2(int ***c)
	{
		int **r2;
		int *r3;
		int r4;

		r2 = READ_ONCE(*c);
		smp_mb();
		if (r2) {
			r3 = READ_ONCE(*r2);
			r4 = READ_ONCE(*r3);
		}
	}

	exists (1:r1=1 /\ ~2:r2=0 /\ 2:r4=0)

Which gets us this:

	$ herd7 -conf linux-kernel.cfg /tmp/viro.litmus 
	Test viro Allowed
	States 3
	1:r1=0; 2:r2=0; 2:r4=0;
	1:r1=1; 2:r2=0; 2:r4=0;
	1:r1=1; 2:r2=b; 2:r4=1;
	No
	Witnesses
	Positive: 0 Negative: 3
	Condition exists (1:r1=1 /\ not (2:r2=0) /\ 2:r4=0)
	Observation viro Never 0 3
	Time viro 0.01
	Hash=cabcc7f3122771a04dd21686b2d58124

The state "1:r1=1; 2:r2=b; 2:r4=0;" does not appear, as expected.

							Thanx, Paul

> [snip]
> 
> > We may also need similar guarantees with __d_clear_type_and_inode().
> 
> Not really - pinned dentry can't go negative.  In any case, with the
> audit I've done so far, I don't believe that blanket solutions like
> that are good idea - most of the places doing checks are safe as it is.
> The surface that needs to be taken care of is fairly small, actually;
> most of that is in fs/namei.c and fs/dcache.c.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-11-02  6:17             ` Al Viro
@ 2019-11-02 17:24               ` Paul E. McKenney
  0 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2019-11-02 17:24 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds, Ritesh Harjani

On Sat, Nov 02, 2019 at 06:17:06AM +0000, Al Viro wrote:
> On Fri, Nov 01, 2019 at 11:46:22PM +0000, Al Viro wrote:
> > on anything except alpha that would be pretty much automatic and
> > on alpha we get the things along the lines of
> > 
> > 	f = fdt[n]
> > 	mb
> > 	d = f->f_path.dentry
> > 	i = d->d_inode
> > 	assert(i != NULL)
> > vs.
> > 	see that d->d_inode is non-NULL
> > 	f->f_path.dentry = d
> > 	mb
> > 	fdt[n] = f
> > 
> > IOW, the barriers that make it safe to fetch the fields of struct file
> > (rcu_dereference_raw() in __fcheck_files() vs. smp_store_release()
> > in __fd_install() in the above) should *hopefully* take care of all
> > stores visible by the time of do_dentry_open().  Sure, alpha cache
> > coherency is insane, but AFAICS it's not _that_ insane.
> > 
> > Question to folks familiar with alpha memory model:
> > 
> > A = 0, B = NULL, C = NULL
> > CPU1:
> > 	A = 1
> > 
> > CPU2:
> > 	r1 = A
> > 	if (r1) {
> > 		B = &A
> > 		mb
> > 		C = &B
> > 	}
> > 
> > CPU3:
> > 	r2 = C;
> > 	mb
> > 	if (r2) {	// &B
> > 		r3 = *r2	// &A
> > 		r4 = *r3	// 1
> > 		assert(r4 == 1)
> > 	}
> > 
> > is the above safe on alpha?
> 
> Hmm...  After digging through alpha manuals, it should be -
> 
> U1: W A, 1
> 
> V1: R A, 1

Assuming a compare and branch here ...

> V2: W B, &A
> V3: MB
> V4: W C, &B
> 
> W1: R C, &B
> W2: MB

... and here ...

> W3: R B, &A
> W4: R A, 0
> 
> is rejected since
> 	U1 BEFORE V1 (storage and visibility)
> 	V1 BEFORE V3 BEFORE V4 (processor issue order constraints)
> 	V4 BEFORE W1 (storage and visibility)
> 	W1 BEFORE W2 BEFORE W4 (processor issue order constraints)
> and W4 BEFORE U1 (storage and visibility), which is impossible
> due to BEFORE being acyclic and transitive.
> 
> I might very well be missing something, though...  Paul, could you
> take a look and tell if the above makes sense?

...  then yes, agreed.  Alpha does respect control dependencies to
stores, and you supplied the required mb for the last task that has
a control dependency only to loads.

I have to ask...  Are you seeing failures on Alpha?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-11-02 17:22             ` Paul E. McKenney
@ 2019-11-02 18:08               ` Al Viro
  2019-11-03 14:41                 ` Paul E. McKenney
                                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Al Viro @ 2019-11-02 18:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, wugyuan, jlayton,
	hsiangkao, Jan Kara, Linus Torvalds

On Sat, Nov 02, 2019 at 10:22:29AM -0700, Paul E. McKenney wrote:
> Ignoring the possibility of the more exotic compiler optimizations, if
> the first task's load into f sees the value stored by the second task,
> then the pair of memory barriers guarantee that the first task's load
> into d will see the second task's store.

The question was about the load into i being also safe.

> In fact, you could instead say this in recent kernels:
> 
> 	f = READ_ONCE(fdt[n])  // provides dependency ordering via mb on Alpha
> 	mb

Er... that mb comes from expanded READ_ONCE(), actually - the call chain
is
	fdget_pos() -> __fdget_pos() -> __fdget() -> __fget_light() ->
	__fcheck_files(), either directly or via
			__fget() -> fcheck_files() -> __fcheck_files()
	rcu_dereference_raw() -> READ_ONCE() -> smp_read_barrier_depends()
which yields mb on alpha.
						
> 	d = f->f_path.dentry
> 	i = d->d_inode  // But this is OK only if ->f_path.entry is
> 			// constant throughout

Yes, it is - once you hold a reference to a positive dentry, it can't
be made negative by anybody else.  See d_delete() for details; basically,
if you have refcount > 1, dentry will be unhashed, but not made negative.

> The result of the first task's load into i requires information outside
> of the two code fragments.
> 
> Or am I missing your point?

My point is that barriers sufficient to guarantee visibility of *f in
the reader will also suffice to guarantee visibility of *f->f_path.dentry.

On alpha it boils down to having load of d->d_inode when opening the
file orders before the barrier prior to storing the reference to f
in the descriptor table, so if it observes the store to d->d_inode done
by the same CPU, that store is ordered before the barrier due to
processor instruction order constraints and if it observes the store
to d->d_inode done by some other CPU, that store is ordered before
the load and before the barrier by transitivity.  So in either case
that store is ordered before the store into descriptor table.
IOW, the reader that has enough barriers to guarantee seing ->f_path.dentry
will be guaranteed to see ->f_path.dentry->d_inode.

And yes, we will need some barriers added near some positivity checks in
pathname resolution - that's what has started the entire thread.  This
part ("any place looking at file->f_path.dentry will have ->d_inode and
mode bits of ->d_flags visible and stable") covers quite a few places
that come up in the analysis...

This morning catch, BTW:

    audit_get_nd(): don't unlock parent too early
    
    if the child has been negative and just went positive
    under us, we want coherent d_is_positive() and ->d_inode.
    Don't unlock the parent until we'd done that work...
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index 1f31c2f1e6fc..4508d5e0cf69 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -351,12 +351,12 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent)
        struct dentry *d = kern_path_locked(watch->path, parent);
        if (IS_ERR(d))
                return PTR_ERR(d);
-       inode_unlock(d_backing_inode(parent->dentry));
        if (d_is_positive(d)) {
                /* update watch filter fields */
                watch->dev = d->d_sb->s_dev;
                watch->ino = d_backing_inode(d)->i_ino;
        }
+       inode_unlock(d_backing_inode(parent->dentry));
        dput(d);
        return 0;
 }

For other fun bits and pieces see ceph bugs caught this week and crap in
dget_parent() (not posted yet).  The former had been ceph violating the
"turning a previously observable dentry positive requires exclusive lock
on parent" rule, the latter - genuine insufficient barriers in the fast
path of dget_parent().

It is converging to a reasonably small and understandable surface, actually,
most of that being in core pathname resolution.  Two big piles of nightmares
left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
the latter due to IMA/EVM/LSM insanity...

There's also some secondary stuff dropping out of that (e.g. ceph seeding
dcache on readdir and blindly unhashing dentries it sees stale instead of
doing d_invalidate() as it ought to - leads to fun results if you had
something mounted on a subdirectory that got removed/recreated on server),
but that's a separate pile of joy - doesn't affect this analysis, so
it'll have to be dealt with later.  It had been an interesting couple of
weeks...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
  2019-11-02 18:08               ` Al Viro
@ 2019-11-03 14:41                 ` Paul E. McKenney
  2019-11-03 16:35                 ` [RFC] lookup_one_len_unlocked() lousy calling conventions Al Viro
                                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Paul E. McKenney @ 2019-11-03 14:41 UTC (permalink / raw)
  To: Al Viro
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, wugyuan, jlayton,
	hsiangkao, Jan Kara, Linus Torvalds

On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:
> On Sat, Nov 02, 2019 at 10:22:29AM -0700, Paul E. McKenney wrote:
> > Ignoring the possibility of the more exotic compiler optimizations, if
> > the first task's load into f sees the value stored by the second task,
> > then the pair of memory barriers guarantee that the first task's load
> > into d will see the second task's store.
> 
> The question was about the load into i being also safe.
> 
> > In fact, you could instead say this in recent kernels:
> > 
> > 	f = READ_ONCE(fdt[n])  // provides dependency ordering via mb on Alpha
> > 	mb
> 
> Er... that mb comes from expanded READ_ONCE(), actually - the call chain
> is
> 	fdget_pos() -> __fdget_pos() -> __fdget() -> __fget_light() ->
> 	__fcheck_files(), either directly or via
> 			__fget() -> fcheck_files() -> __fcheck_files()
> 	rcu_dereference_raw() -> READ_ONCE() -> smp_read_barrier_depends()
> which yields mb on alpha.

Exactly.  But yes, my "provides dependency ordering via mb on Alpha"
comment was not as clearly stated as it should have been.  Something about
my having dispensed with proofreading in order to catch my plane...  :-/

> > 	d = f->f_path.dentry
> > 	i = d->d_inode  // But this is OK only if ->f_path.entry is
> > 			// constant throughout
> 
> Yes, it is - once you hold a reference to a positive dentry, it can't
> be made negative by anybody else.  See d_delete() for details; basically,
> if you have refcount > 1, dentry will be unhashed, but not made negative.

Very good, then.  At least assuming that you are double-checking the ordering
(as opposed to trying to figure out what is wrong with the ordering).  ;-)

> > The result of the first task's load into i requires information outside
> > of the two code fragments.
> > 
> > Or am I missing your point?
> 
> My point is that barriers sufficient to guarantee visibility of *f in
> the reader will also suffice to guarantee visibility of *f->f_path.dentry.
> 
> On alpha it boils down to having load of d->d_inode when opening the
> file orders before the barrier prior to storing the reference to f
> in the descriptor table, so if it observes the store to d->d_inode done
> by the same CPU, that store is ordered before the barrier due to
> processor instruction order constraints and if it observes the store
> to d->d_inode done by some other CPU, that store is ordered before
> the load and before the barrier by transitivity.  So in either case
> that store is ordered before the store into descriptor table.
> IOW, the reader that has enough barriers to guarantee seing ->f_path.dentry
> will be guaranteed to see ->f_path.dentry->d_inode.
> 
> And yes, we will need some barriers added near some positivity checks in
> pathname resolution - that's what has started the entire thread.  This
> part ("any place looking at file->f_path.dentry will have ->d_inode and
> mode bits of ->d_flags visible and stable") covers quite a few places
> that come up in the analysis...
> 
> This morning catch, BTW:
> 
>     audit_get_nd(): don't unlock parent too early
>     
>     if the child has been negative and just went positive
>     under us, we want coherent d_is_positive() and ->d_inode.
>     Don't unlock the parent until we'd done that work...
>     
>     Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
> index 1f31c2f1e6fc..4508d5e0cf69 100644
> --- a/kernel/audit_watch.c
> +++ b/kernel/audit_watch.c
> @@ -351,12 +351,12 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent)
>         struct dentry *d = kern_path_locked(watch->path, parent);
>         if (IS_ERR(d))
>                 return PTR_ERR(d);
> -       inode_unlock(d_backing_inode(parent->dentry));
>         if (d_is_positive(d)) {
>                 /* update watch filter fields */
>                 watch->dev = d->d_sb->s_dev;
>                 watch->ino = d_backing_inode(d)->i_ino;
>         }
> +       inode_unlock(d_backing_inode(parent->dentry));
>         dput(d);
>         return 0;
>  }
> 
> For other fun bits and pieces see ceph bugs caught this week and crap in
> dget_parent() (not posted yet).  The former had been ceph violating the
> "turning a previously observable dentry positive requires exclusive lock
> on parent" rule, the latter - genuine insufficient barriers in the fast
> path of dget_parent().
> 
> It is converging to a reasonably small and understandable surface, actually,
> most of that being in core pathname resolution.  Two big piles of nightmares
> left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
> the latter due to IMA/EVM/LSM insanity...
> 
> There's also some secondary stuff dropping out of that (e.g. ceph seeding
> dcache on readdir and blindly unhashing dentries it sees stale instead of
> doing d_invalidate() as it ought to - leads to fun results if you had
> something mounted on a subdirectory that got removed/recreated on server),
> but that's a separate pile of joy - doesn't affect this analysis, so
> it'll have to be dealt with later.  It had been an interesting couple of
> weeks...

Looks like you are making a very productive (if perhaps somewhat painful)
tour through the code, good show!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC] lookup_one_len_unlocked() lousy calling conventions
  2019-11-02 18:08               ` Al Viro
  2019-11-03 14:41                 ` Paul E. McKenney
@ 2019-11-03 16:35                 ` Al Viro
  2019-11-03 18:20                   ` Al Viro
  2019-11-03 17:05                 ` [PATCH][RFC] ecryptfs unlink/rmdir breakage (similar to caught in ecryptfs rename last year) Al Viro
  2019-11-09  3:13                 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
  3 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2019-11-03 16:35 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ritesh Harjani, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds

On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:

> It is converging to a reasonably small and understandable surface, actually,
> most of that being in core pathname resolution.  Two big piles of nightmares
> left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
> the latter due to IMA/EVM/LSM insanity...

One thing found while digging through overlayfs (and responsible for several
remaining pieces from the assorted pile):

lookup_one_len_unlocked() calling conventions are wrong for its callers.
Namely, 11 out of 12 callers really want ERR_PTR(-ENOENT) on negatives.
Most of them take care to check, some rely upon that being impossible in
their case.  Interactions with dentry turning positive right after
lookup_one_len_unlocked() has returned it are of varying bugginess...

The only exception is ecryptfs, where we do lookup in the underlying fs
on ecryptfs_lookup() and want to retain a negative dentry if we get one.

Not sure what's the best way to handle that - it looks like we want
a primitive with different behaviour on negatives, so the most conservative
variant would be to add such, leaving lookup_one_len_unlocked() use
for ecryptfs (and dealing with barriers properly in there), with
everybody else switching to lookup_positive_unlocked(), or whatever
we call that new primitive.

Other variants:

1) add a primitve with existing lookup_one_len_unlocked() behaviour,
changing lookup_one_len_unlocked() itselt to new calling conventions.
Switch ecryptfs to that, drop now-useless checks from other callers.

2) get rid of pinning negative underlying dentries in ecryptfs.
The cost will be doing lookup in underlying layer twice on something
like mkdir(2), the second one quite likely served out of dcache.
That would make _all_ callers of lookup_one_len_unlocked() need
the same calling conventions.

Preferences?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH][RFC] ecryptfs unlink/rmdir breakage (similar to caught in ecryptfs rename last year)
  2019-11-02 18:08               ` Al Viro
  2019-11-03 14:41                 ` Paul E. McKenney
  2019-11-03 16:35                 ` [RFC] lookup_one_len_unlocked() lousy calling conventions Al Viro
@ 2019-11-03 17:05                 ` Al Viro
  2019-11-09  3:13                 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
  3 siblings, 0 replies; 22+ messages in thread
From: Al Viro @ 2019-11-03 17:05 UTC (permalink / raw)
  To: Tyler Hicks; +Cc: ecryptfs, linux-fsdevel, linux-kernel

On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:

> There's also some secondary stuff dropping out of that (e.g. ceph seeding
> dcache on readdir and blindly unhashing dentries it sees stale instead of
> doing d_invalidate() as it ought to - leads to fun results if you had
> something mounted on a subdirectory that got removed/recreated on server),
> but that's a separate pile of joy - doesn't affect this analysis, so
> it'll have to be dealt with later.  It had been an interesting couple of
> weeks...

More secondaries: a piece of nastiness in ecryptfs unlink/rmdir -
have the underlying layer mounted elsewhere, look something up
via ecryptfs, move the underlying object via that mount elsewhere
and call unlink().  Ends up passing the wrong directory
inode to vfs_unlink() and then - to ->unlink() of the underlying
fs...  Embarrassing, since ecryptfs_rename() used to have exact
same problem, caught and fixed in "ecryptfs_rename(): verify that
lower dentries are still OK after lock_rename()" a year ago.
Bugs are like mushrooms - found one, look around for its relatives...

How about the following (completely untested) patch?  Tyler, do you
see any problems in that?

ecryptfs: fix unlink and rmdir in face of underlying fs modifications

A problem similar to the one caught in commit 74dd7c97ea2a ("ecryptfs_rename():
verify that lower dentries are still OK after lock_rename()") exists for
unlink/rmdir as well.

Instead of playing with dget_parent() of underlying dentry of victim
and hoping it's the same as underlying dentry of our directory,
do the following:
	* find the underlying dentry of victim
	* find the underlying directory of victim's parent (stable
since the victim is ecryptfs dentry and inode of its parent is
held exclusive by the caller).
	* lock the inode of dentry underlying the victim's parent
	* check that underlying dentry of victim is still hashed and
has the right parent - it can be moved, but it can't be moved to/from
the directory we are holding exclusive.  So while ->d_parent itself
might not be stable, the result of comparison is.

If the check passes, everything is fine - underlying directory is locked,
underlying victim is still a child of that directory and we can go ahead
and feed them to vfs_unlink().  As in the current mainline we need to
pin the underlying dentry of victim, so that it wouldn't go negative under
us, but that's the only temporary reference that needs to be grabbed there.
Underlying dentry of parent won't go away (it's pinned by the parent,
which is held by caller), so there's no need to grab it.

The same problem (with the same solution) exists for rmdir.  Moreover,
rename gets simpler and more robust with the same "don't bother with
dget_parent()" approach.

Fixes: 74dd7c97ea2 "ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 18426f4855f1..a905d5f4f3b0 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -128,13 +128,20 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry,
 			      struct inode *inode)
 {
 	struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
-	struct inode *lower_dir_inode = ecryptfs_inode_to_lower(dir);
 	struct dentry *lower_dir_dentry;
+	struct inode *lower_dir_inode;
 	int rc;
 
-	dget(lower_dentry);
-	lower_dir_dentry = lock_parent(lower_dentry);
-	rc = vfs_unlink(lower_dir_inode, lower_dentry, NULL);
+	lower_dir_dentry = ecryptfs_dentry_to_lower(dentry->d_parent);
+	lower_dir_inode = d_inode(lower_dir_dentry);
+	inode_lock_nested(lower_dir_inode, I_MUTEX_PARENT);
+	dget(lower_dentry);	// don't even try to make the lower negative
+	if (lower_dentry->d_parent != lower_dir_dentry)
+		rc = -EINVAL;
+	else if (d_unhashed(lower_dentry))
+		rc = -EINVAL;
+	else
+		rc = vfs_unlink(lower_dir_inode, lower_dentry, NULL);
 	if (rc) {
 		printk(KERN_ERR "Error in vfs_unlink; rc = [%d]\n", rc);
 		goto out_unlock;
@@ -142,10 +149,11 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry,
 	fsstack_copy_attr_times(dir, lower_dir_inode);
 	set_nlink(inode, ecryptfs_inode_to_lower(inode)->i_nlink);
 	inode->i_ctime = dir->i_ctime;
-	d_drop(dentry);
 out_unlock:
-	unlock_dir(lower_dir_dentry);
 	dput(lower_dentry);
+	inode_unlock(lower_dir_inode);
+	if (!rc)
+		d_drop(dentry);
 	return rc;
 }
 
@@ -512,22 +520,30 @@ static int ecryptfs_rmdir(struct inode *dir, struct dentry *dentry)
 {
 	struct dentry *lower_dentry;
 	struct dentry *lower_dir_dentry;
+	struct inode *lower_dir_inode;
 	int rc;
 
 	lower_dentry = ecryptfs_dentry_to_lower(dentry);
-	dget(dentry);
-	lower_dir_dentry = lock_parent(lower_dentry);
-	dget(lower_dentry);
-	rc = vfs_rmdir(d_inode(lower_dir_dentry), lower_dentry);
-	dput(lower_dentry);
-	if (!rc && d_really_is_positive(dentry))
+	lower_dir_dentry = ecryptfs_dentry_to_lower(dentry->d_parent);
+	lower_dir_inode = d_inode(lower_dir_dentry);
+
+	inode_lock_nested(lower_dir_inode, I_MUTEX_PARENT);
+	dget(lower_dentry);	// don't even try to make the lower negative
+	if (lower_dentry->d_parent != lower_dir_dentry)
+		rc = -EINVAL;
+	else if (d_unhashed(lower_dentry))
+		rc = -EINVAL;
+	else
+		rc = vfs_rmdir(lower_dir_inode, lower_dentry);
+	if (!rc) {
 		clear_nlink(d_inode(dentry));
-	fsstack_copy_attr_times(dir, d_inode(lower_dir_dentry));
-	set_nlink(dir, d_inode(lower_dir_dentry)->i_nlink);
-	unlock_dir(lower_dir_dentry);
+		fsstack_copy_attr_times(dir, lower_dir_inode);
+		set_nlink(dir, lower_dir_inode->i_nlink);
+	}
+	dput(lower_dentry);
+	inode_unlock(lower_dir_inode);
 	if (!rc)
 		d_drop(dentry);
-	dput(dentry);
 	return rc;
 }
 
@@ -565,20 +581,22 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 	struct dentry *lower_new_dentry;
 	struct dentry *lower_old_dir_dentry;
 	struct dentry *lower_new_dir_dentry;
-	struct dentry *trap = NULL;
+	struct dentry *trap;
 	struct inode *target_inode;
 
 	if (flags)
 		return -EINVAL;
 
+	lower_old_dir_dentry = ecryptfs_dentry_to_lower(old_dentry->d_parent);
+	lower_new_dir_dentry = ecryptfs_dentry_to_lower(new_dentry->d_parent);
+
 	lower_old_dentry = ecryptfs_dentry_to_lower(old_dentry);
 	lower_new_dentry = ecryptfs_dentry_to_lower(new_dentry);
-	dget(lower_old_dentry);
-	dget(lower_new_dentry);
-	lower_old_dir_dentry = dget_parent(lower_old_dentry);
-	lower_new_dir_dentry = dget_parent(lower_new_dentry);
+
 	target_inode = d_inode(new_dentry);
+
 	trap = lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
+	dget(lower_new_dentry);
 	rc = -EINVAL;
 	if (lower_old_dentry->d_parent != lower_old_dir_dentry)
 		goto out_lock;
@@ -606,11 +624,8 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 	if (new_dir != old_dir)
 		fsstack_copy_attr_all(old_dir, d_inode(lower_old_dir_dentry));
 out_lock:
-	unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
-	dput(lower_new_dir_dentry);
-	dput(lower_old_dir_dentry);
 	dput(lower_new_dentry);
-	dput(lower_old_dentry);
+	unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
 	return rc;
 }
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] lookup_one_len_unlocked() lousy calling conventions
  2019-11-03 16:35                 ` [RFC] lookup_one_len_unlocked() lousy calling conventions Al Viro
@ 2019-11-03 18:20                   ` Al Viro
  2019-11-03 18:51                     ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Al Viro
  0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2019-11-03 18:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ritesh Harjani, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds, ecryptfs

On Sun, Nov 03, 2019 at 04:35:24PM +0000, Al Viro wrote:

> lookup_one_len_unlocked() calling conventions are wrong for its callers.
> Namely, 11 out of 12 callers really want ERR_PTR(-ENOENT) on negatives.
> Most of them take care to check, some rely upon that being impossible in
> their case.  Interactions with dentry turning positive right after
> lookup_one_len_unlocked() has returned it are of varying bugginess...
> 
> The only exception is ecryptfs, where we do lookup in the underlying fs
> on ecryptfs_lookup() and want to retain a negative dentry if we get one.

Looking at that code... the thing that deals with the result of lookup in
underlying fs is ecryptfs_lookup_interpose(), and there we have this:
        struct inode *inode, *lower_inode = d_inode(lower_dentry);
...
        dentry_info = kmem_cache_alloc(ecryptfs_dentry_info_cache, GFP_KERNEL);
...
        if (d_really_is_negative(lower_dentry)) {
                /* We want to add because we couldn't find in lower */
                d_add(dentry, NULL);
                return NULL;
        }
        inode = __ecryptfs_get_inode(lower_inode, dentry->d_sb);

If lower dentry used to be negative, but went positive while we'd
been doing allocation, we'll get d_really_is_negative() (i.e.
!lower_dentry->d_inode) false, but lower_inode (fetched earlier)
still NULL.  __ecryptfs_get_inode() starts with
        if (lower_inode->i_sb != ecryptfs_superblock_to_lower(sb))
                return ERR_PTR(-EXDEV);
which won't be happy in that situation...  That has nothing to do
with barriers, ->d_flags, etc. - the window is rather wide here.
GFP_KERNEL allocation can block just fine.

IOW, the only caller of lookup_one_len_unlocked() that does not
reject negative dentries doesn't manage to handle them correctly ;-/

And then in the same ecryptfs_lookup_interpose() we have e.g.
        fsstack_copy_attr_atime(d_inode(dentry->d_parent),
                                d_inode(lower_dentry->d_parent));
Now, dentry->d_parent is stable; dentry is guaranteed to be new
and not yet visible to anybody else, besides it's negative and
the parent is held shared, so it couldn't have been moved around
even if it had been seen by somebody else.

However, lower_dentry->d_parent is a different story.  We are not holding
the lock on its parent anymore; it could've been moved around by somebody
mounting the underlying layer elsewhere and accessing it directly.
Moreover, there's nothing to guarantee that the pointer we fetch from
lower_dentry->d_parent won't be pointing to freed memory by the
time we get around to looking at its inode - lose the timeslice to
preemption just after fetching ->d_parent, have another process move
the damn thing around and there's nothing to keep the ex-parent
around by the time you regain CPU.

The problem goes all way back to addd65ad8d19 "eCryptfs: Filename Encryption:
filldir, lookup, and readlink" from 2009.  That turned
	lower_dir_dentry = ecryptfs_dentry_to_lower(dentry->d_parent);
into
	lower_dir_dentry = lower_dentry->d_parent;
and it had hit the fan...

Sure, "somebody mounted the underlying fs elsewhere and is actively
trying to screw us over" is not how ecryptfs is supposed to be used
and it can demonstrate unexpected behavior - odd errors, etc.
But that behaviour should not include oopsen and access to freed
kernel memory...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
  2019-11-03 18:20                   ` Al Viro
@ 2019-11-03 18:51                     ` Al Viro
  2019-11-03 19:03                       ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either Al Viro
  0 siblings, 1 reply; 22+ messages in thread
From: Al Viro @ 2019-11-03 18:51 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ritesh Harjani, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds, ecryptfs

lower_dentry can't go from positive to negative (we have it pinned),
but it *can* go from negative to positive.  So fetching ->d_inode
into a local variable, doing a blocking allocation, checking that
now ->d_inode is non-NULL and feeding the value we'd fetched
earlier to a function that won't accept NULL is not a good idea.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index a905d5f4f3b0..3c2298721359 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -319,7 +319,7 @@ static int ecryptfs_i_size_read(struct dentry *dentry, struct inode *inode)
 static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
 				     struct dentry *lower_dentry)
 {
-	struct inode *inode, *lower_inode = d_inode(lower_dentry);
+	struct inode *inode, *lower_inode;
 	struct ecryptfs_dentry_info *dentry_info;
 	struct vfsmount *lower_mnt;
 	int rc = 0;
@@ -339,7 +339,15 @@ static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
 	dentry_info->lower_path.mnt = lower_mnt;
 	dentry_info->lower_path.dentry = lower_dentry;
 
-	if (d_really_is_negative(lower_dentry)) {
+	/*
+	 * negative dentry can go positive under us here - its parent is not
+	 * locked.  That's OK and that could happen just as we return from
+	 * ecryptfs_lookup() anyway.  Just need to be careful and fetch
+	 * ->d_inode only once - it's not stable here.
+	 */
+	lower_inode = READ_ONCE(lower_dentry->d_inode);
+
+	if (!lower_inode) {
 		/* We want to add because we couldn't find in lower */
 		d_add(dentry, NULL);
 		return NULL;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
  2019-11-03 18:51                     ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Al Viro
@ 2019-11-03 19:03                       ` Al Viro
  0 siblings, 0 replies; 22+ messages in thread
From: Al Viro @ 2019-11-03 19:03 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Ritesh Harjani, linux-kernel, wugyuan, jlayton, hsiangkao,
	Jan Kara, Linus Torvalds, ecryptfs

We need to get the underlying dentry of parent; sure, absent the races
it is the parent of underlying dentry, but there's nothing to prevent
losing a timeslice to preemtion in the middle of evaluation of
lower_dentry->d_parent->d_inode, having another process move lower_dentry
around and have its (ex)parent not pinned anymore and freed on memory
pressure.  Then we regain CPU and try to fetch ->d_inode from memory
that is freed by that point.

dentry->d_parent *is* stable here - it's an argument of ->lookup() and
we are guaranteed that it won't be moved anywhere until we feed it
to d_add/d_splice_alias.  So we safely go that way to get to its
underlying dentry.

Cc: stable@vger.kernel.org # since 2009 or so
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 3c2298721359..e23752d9a79f 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -319,9 +319,9 @@ static int ecryptfs_i_size_read(struct dentry *dentry, struct inode *inode)
 static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
 				     struct dentry *lower_dentry)
 {
+	struct path *path = ecryptfs_dentry_to_lower_path(dentry->d_parent);
 	struct inode *inode, *lower_inode;
 	struct ecryptfs_dentry_info *dentry_info;
-	struct vfsmount *lower_mnt;
 	int rc = 0;
 
 	dentry_info = kmem_cache_alloc(ecryptfs_dentry_info_cache, GFP_KERNEL);
@@ -330,13 +330,12 @@ static struct dentry *ecryptfs_lookup_interpose(struct dentry *dentry,
 		return ERR_PTR(-ENOMEM);
 	}
 
-	lower_mnt = mntget(ecryptfs_dentry_to_lower_mnt(dentry->d_parent));
 	fsstack_copy_attr_atime(d_inode(dentry->d_parent),
-				d_inode(lower_dentry->d_parent));
+				d_inode(path->dentry));
 	BUG_ON(!d_count(lower_dentry));
 
 	ecryptfs_set_dentry_private(dentry, dentry_info);
-	dentry_info->lower_path.mnt = lower_mnt;
+	dentry_info->lower_path.mnt = mntget(path->mnt);
 	dentry_info->lower_path.dentry = lower_dentry;
 
 	/*

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH][RFC] race in exportfs_decode_fh()
  2019-11-02 18:08               ` Al Viro
                                   ` (2 preceding siblings ...)
  2019-11-03 17:05                 ` [PATCH][RFC] ecryptfs unlink/rmdir breakage (similar to caught in ecryptfs rename last year) Al Viro
@ 2019-11-09  3:13                 ` Al Viro
  2019-11-09 16:55                   ` Linus Torvalds
  2019-11-11  9:16                   ` Christoph Hellwig
  3 siblings, 2 replies; 22+ messages in thread
From: Al Viro @ 2019-11-09  3:13 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Ritesh Harjani, linux-fsdevel, linux-kernel, wugyuan, jlayton,
	hsiangkao, Jan Kara, Linus Torvalds, Christoph Hellwig

On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:

> It is converging to a reasonably small and understandable surface, actually,
> most of that being in core pathname resolution.  Two big piles of nightmares
> left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
> the latter due to IMA/EVM/LSM insanity...

Oh, lovely - in exportfs_decode_fh() we have this:
                err = exportfs_get_name(mnt, target_dir, nbuf, result);
                if (!err) {
                        inode_lock(target_dir->d_inode);
                        nresult = lookup_one_len(nbuf, target_dir,
                                                 strlen(nbuf));
                        inode_unlock(target_dir->d_inode);
                        if (!IS_ERR(nresult)) {
                                if (nresult->d_inode) {
                                        dput(result);
                                        result = nresult;
                                } else
                                        dput(nresult);
                        }
                }
We have derived the parent from fhandle, we have a disconnected dentry for child,
we go look for the name.  We even find it.  Now, we want to look it up.  And
some bastard goes and unlinks it, just as we are trying to lock the parent.
We do a lookup, and get a negative dentry.  Then we unlock the parent... and
some other bastard does e.g. mkdir with the same name.  OK, nresult->d_inode
is not NULL (anymore).  It has fuck-all to do with the original fhandle
(different inumber, etc.) but we happily accept it.

Even better, we have no barriers between our check and nresult becoming positive.
IOW, having observed non-NULL ->d_inode doesn't give us enough - e.g. we might
still see the old ->d_flags value, from back when ->d_inode used to be NULL.
On something like alpha we also have no promises that we'll observe anything
about the fields of nresult->d_inode, but ->d_flags alone is enough for fun.
The callers can't e.g. expect d_is_reg() et.al. to match the reality.

This is obviously bogus.  And the fix is obvious: check that nresult->d_inode is
equal to result->d_inode before unlocking the parent.  Note that we'd *already* had
the original result and all of its aliases rejected by the 'acceptable' predicate,
so if nresult doesn't supply us a better alias, we are SOL.

Does anyone see objections to the following patch?  Christoph, that seems to
be your code; am I missing something subtle here?  AFAICS, that goes back to
2007 or so...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 09bc68708d28..2dd55b172d57 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -519,26 +519,33 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
 		 * inode is actually connected to the parent.
 		 */
 		err = exportfs_get_name(mnt, target_dir, nbuf, result);
-		if (!err) {
-			inode_lock(target_dir->d_inode);
-			nresult = lookup_one_len(nbuf, target_dir,
-						 strlen(nbuf));
-			inode_unlock(target_dir->d_inode);
-			if (!IS_ERR(nresult)) {
-				if (nresult->d_inode) {
-					dput(result);
-					result = nresult;
-				} else
-					dput(nresult);
-			}
+		if (err) {
+			dput(target_dir);
+			goto err_result;
 		}
 
+		inode_lock(target_dir->d_inode);
+		nresult = lookup_one_len(nbuf, target_dir, strlen(nbuf));
+		if (!IS_ERR(nresult)) {
+			if (unlikely(nresult->d_inode != result->d_inode)) {
+				dput(nresult);
+				nresult = ERR_PTR(-ESTALE);
+			}
+		}
+		inode_unlock(target_dir->d_inode);
 		/*
 		 * At this point we are done with the parent, but it's pinned
 		 * by the child dentry anyway.
 		 */
 		dput(target_dir);
 
+		if (IS_ERR(nresult)) {
+			err = PTR_ERR(nresult);
+			goto err_result;
+		}
+		dput(result);
+		result = nresult;
+
 		/*
 		 * And finally make sure the dentry is actually acceptable
 		 * to NFSD.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][RFC] race in exportfs_decode_fh()
  2019-11-09  3:13                 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
@ 2019-11-09 16:55                   ` Linus Torvalds
  2019-11-09 18:26                     ` Al Viro
  2019-11-11  9:16                   ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2019-11-09 16:55 UTC (permalink / raw)
  To: Al Viro
  Cc: J. Bruce Fields, Ritesh Harjani, linux-fsdevel,
	Linux Kernel Mailing List, wugyuan, Jeff Layton, hsiangkao,
	Jan Kara, Christoph Hellwig

On Fri, Nov 8, 2019 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> We have derived the parent from fhandle, we have a disconnected dentry for child,
> we go look for the name.  We even find it.  Now, we want to look it up.  And
> some bastard goes and unlinks it, just as we are trying to lock the parent.
> We do a lookup, and get a negative dentry.  Then we unlock the parent... and
> some other bastard does e.g. mkdir with the same name.  OK, nresult->d_inode
> is not NULL (anymore).  It has fuck-all to do with the original fhandle
> (different inumber, etc.) but we happily accept it.

No arguments with your patch, although I doubt that this case has
actually ever happened in practice ;)

              Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][RFC] race in exportfs_decode_fh()
  2019-11-09 16:55                   ` Linus Torvalds
@ 2019-11-09 18:26                     ` Al Viro
  0 siblings, 0 replies; 22+ messages in thread
From: Al Viro @ 2019-11-09 18:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: J. Bruce Fields, Ritesh Harjani, linux-fsdevel,
	Linux Kernel Mailing List, wugyuan, Jeff Layton, hsiangkao,
	Jan Kara, Christoph Hellwig

On Sat, Nov 09, 2019 at 08:55:38AM -0800, Linus Torvalds wrote:
> On Fri, Nov 8, 2019 at 7:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > We have derived the parent from fhandle, we have a disconnected dentry for child,
> > we go look for the name.  We even find it.  Now, we want to look it up.  And
> > some bastard goes and unlinks it, just as we are trying to lock the parent.
> > We do a lookup, and get a negative dentry.  Then we unlock the parent... and
> > some other bastard does e.g. mkdir with the same name.  OK, nresult->d_inode
> > is not NULL (anymore).  It has fuck-all to do with the original fhandle
> > (different inumber, etc.) but we happily accept it.
> 
> No arguments with your patch, although I doubt that this case has
> actually ever happened in practice ;)

Frankly, by this point I'm rather tempted to introduce new sparse annotation for
dentries - "might not be positive".  The thing is, there are four cases when
->d_inode is guaranteed to be stable:
	1) nobody else has seen that dentry; that includes in-lookup ones - they
can be found in in-lookup hash by d_alloc_parallel(), but it'll wait until they
cease to be in-lookup.
	2) ->d_lock is held
	3) pinned positive
	4) pinned, parent held at least shared
Anything else can have ->d_inode changed by another thread.  And class 3 is by
far the most common.  As the matter of fact, most of dentry pointers in the
system (as opposed to (h)lists traversing dentries) are such.

For obvious reasons we have shitloads of ->d_inode accesses; I'd been going
through a tree-wide audit of those and doing that manually is bloody unpleasant.
And we do have races in that area - the one above is the latest I'd caught;
there had been more and I'm fairly sure that it's not the last.

Redoing that kind of audit every once in a while is something I wouldn't
wish on anyone; it would be nice to use sparse to narrow the field.  Note
that simply checking ->d_inode (or ->d_flags) is not enough unless we
know them to be stable.  E.g. callers of lookup_one_len_unlocked() tend
to be racy exactly because of such tests.  I'm going to add
lookup_positive_unlocked() that would return ERR_PTR(-ENOENT) on
negatives and convert the callers - with one exception they treat
negatives the same way they would treat ERR_PTR(-ENOENT) and their
tests are racy.  I'd rather have sufficient barriers done in one common
helper, instead of trying to add them in 11 places and hope that new
users won't fuck it up...

I'm still not sure what's the best way to do the annotations - propagation
is not a big deal, but expressing the transitions and checking that
maybe-negative ones are not misused is trickier.  The last part can
be actually left to manual audit - annotations would already reduce
the search area big way.  A not too ugly way to say that now this dentry
is known to be non-negative, OTOH...  I want to finish the audit and
take a good look at the list of places where such transitions happen.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH][RFC] race in exportfs_decode_fh()
  2019-11-09  3:13                 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
  2019-11-09 16:55                   ` Linus Torvalds
@ 2019-11-11  9:16                   ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2019-11-11  9:16 UTC (permalink / raw)
  To: Al Viro
  Cc: J. Bruce Fields, Ritesh Harjani, linux-fsdevel, linux-kernel,
	wugyuan, jlayton, hsiangkao, Jan Kara, Linus Torvalds,
	Christoph Hellwig

On Sat, Nov 09, 2019 at 03:13:33AM +0000, Al Viro wrote:
> Does anyone see objections to the following patch?  Christoph, that seems to
> be your code; am I missing something subtle here?  AFAICS, that goes back to
> 2007 or so...

This goes back to way before that, that series jut factored out proper
export operations from the two inode or superblock methods we had before
with the rest handled in core code somewhere that made a complete
mess of file systems with 64-bit inode numbers.

Otherwise this looks fine, although splitting the refactoring from
the actual change would make for a much more readable series.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, back to index

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-27  4:42 [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast Ritesh Harjani
2019-10-15  4:07 ` Ritesh Harjani
2019-10-22 13:38   ` Ritesh Harjani
2019-10-22 14:37     ` Al Viro
2019-10-22 14:50       ` Al Viro
2019-10-22 20:11       ` Al Viro
2019-10-23 11:05         ` Ritesh Harjani
2019-11-01 23:46           ` Al Viro
2019-11-02  6:17             ` Al Viro
2019-11-02 17:24               ` Paul E. McKenney
2019-11-02 17:22             ` Paul E. McKenney
2019-11-02 18:08               ` Al Viro
2019-11-03 14:41                 ` Paul E. McKenney
2019-11-03 16:35                 ` [RFC] lookup_one_len_unlocked() lousy calling conventions Al Viro
2019-11-03 18:20                   ` Al Viro
2019-11-03 18:51                     ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Al Viro
2019-11-03 19:03                       ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either Al Viro
2019-11-03 17:05                 ` [PATCH][RFC] ecryptfs unlink/rmdir breakage (similar to caught in ecryptfs rename last year) Al Viro
2019-11-09  3:13                 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
2019-11-09 16:55                   ` Linus Torvalds
2019-11-09 18:26                     ` Al Viro
2019-11-11  9:16                   ` Christoph Hellwig

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git