linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ritesh Harjani <riteshh@linux.ibm.com>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	wugyuan@cn.ibm.com, jlayton@kernel.org, hsiangkao@aol.com,
	Jan Kara <jack@suse.cz>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast
Date: Wed, 23 Oct 2019 16:35:50 +0530	[thread overview]
Message-ID: <20191023110551.D04AE4C044@d06av22.portsmouth.uk.ibm.com> (raw)
In-Reply-To: <20191022201131.GZ26530@ZenIV.linux.org.uk>



On 10/23/19 1:41 AM, Al Viro wrote:
> On Tue, Oct 22, 2019 at 03:37:36PM +0100, Al Viro wrote:
>> On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote:
>>> I think we have still not taken this patch. Al?

>> or, for that matter, any callers of filename_lookup() assuming that the
>> lack of ENOENT means that the last call of walk_component() (from lookup_last())
>> has not failed with the same ENOENT and thus the result has been observed
>> positive.
>> You've picked the easiest one to hit, but on e.g. KVM setups you can have the
>> host thread representing the CPU where __d_set_inode_and_type() runs get
>> preempted (by the host kernel), leaving others with much wider window.

I had thought so about the other call sites, but as you said
maybe this was the easiest one to hit.
Then, I kind of followed your suggested fix in below link to fix at 
least this current crash.
https://patchwork.kernel.org/patch/10909881/

>>
>> Sure, we can do that to all callers of d_is_negative/d_is_positive, but...
>> the same goes for any places that assumes that d_is_dir() implies that
>> the sucker is positive, etc.
>>
>> What we have guaranteed is
>> 	* ->d_lock serializes ->d_flags/->d_inode changes
>> 	* ->d_seq is bumped before/after such changes
>> 	* positive dentry never changes ->d_inode as long as you hold
>> a reference (negative dentry *can* become positive right under you)
>>
>> So there are 3 classes of valid users: those holding ->d_lock, those
>> sampling and rechecking ->d_seq and those relying upon having observed
>> the sucker they've pinned to be positive.

:) Thanks for simplifying like this. Agreed.



>>
>> What you've been hitting is "we have it pinned, ->d_flags says it's
>> positive but we still observe the value of ->d_inode from before the
>> store to ->d_flags that has made it look positive".
> 
> Actually, your patch opens another problem there.  Suppose you make
> it d_really_is_positive() and hit the same race sans reordering.
> Dentry is found by __d_lookup() and is negative.  Right after we
> return from __d_lookup() another thread makes it positive (a symlink)
> - ->d_inode is set, d_really_is_positive() becomes true.  OK, on we
> go, pick the inode and move on.  Right?  ->d_flags is still not set
> by the thread that made it positive.  We return from lookup_fast()
> and call step_into().  And get to
>          if (likely(!d_is_symlink(path->dentry)) ||
> Which checks ->d_flags and sees the value from before the sucker
> became positive.  IOW, d_is_symlink() is false here.  If that
> was the last path component and we'd been told to follow links,
> we will end up with positive dentry of a symlink coming out of
> pathname resolution.
> 

hmm. yes, looks like it, based on your above explanation.
So even though this could avoid crash, but still we may end up with
a bogus entry with current patch.



> Similar fun happens if you have mkdir racing with lookup - ENOENT
> is what should've happened if lookup comes first, success - if
> mkdir does.  This way we can hit ENOTDIR, due to false negative
> from d_can_lookup().
> 
> IOW, d_really_is_negative() in lookup_fast() will paper over
> one of oopsen, but it
> 	* won't cover similar oopsen on other codepaths and
> 	* will lead to bogus behaviour.
> 
> I'm not sure that blanket conversion of d_is_... to smp_load_acquire()
> is the right solution; it might very well be that we need to do that
> only on a small subset of call sites, lookup_fast() being one of
> those.  But we do want at least to be certain that something we'd
> got from lookup_fast() in non-RCU mode already has ->d_flags visible.

We may also need similar guarantees with __d_clear_type_and_inode().

So do you think we should make use of ->d_seq for verifying this?
I see both __d_set_inode_and_type & __d_clear_type_and_inode() called
under ->d_seq_begin/->d_seq_end.

Then maybe we should use ->d_seq checking at those call sites.
We cannot unconditionally use ->d_seq checking in __d_entry_type(),
since we sometimes call this function inside ->d_seq_begin
(like in lookup_fast).


> 
> I'm going through the callers right now, will post a followup once
> the things get cleaner...
> 
Thanks for looking into this.


  reply	other threads:[~2019-10-23 11:06 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-27  4:42 [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast Ritesh Harjani
2019-10-15  4:07 ` Ritesh Harjani
2019-10-22 13:38   ` Ritesh Harjani
2019-10-22 14:37     ` Al Viro
2019-10-22 14:50       ` Al Viro
2019-10-22 20:11       ` Al Viro
2019-10-23 11:05         ` Ritesh Harjani [this message]
2019-11-01 23:46           ` Al Viro
2019-11-02  6:17             ` Al Viro
2019-11-02 17:24               ` Paul E. McKenney
2019-11-02 17:22             ` Paul E. McKenney
2019-11-02 18:08               ` Al Viro
2019-11-03 14:41                 ` Paul E. McKenney
2019-11-03 16:35                 ` [RFC] lookup_one_len_unlocked() lousy calling conventions Al Viro
2019-11-03 18:20                   ` Al Viro
2019-11-03 18:51                     ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Al Viro
2019-11-03 19:03                       ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either Al Viro
2019-11-13  7:01                       ` [PATCH][RFC] ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable Amir Goldstein
2019-11-13 12:52                         ` Al Viro
2019-11-13 16:22                           ` Amir Goldstein
2019-11-13 20:18                           ` Jean-Louis Biasini
2019-11-03 17:05                 ` [PATCH][RFC] ecryptfs unlink/rmdir breakage (similar to caught in ecryptfs rename last year) Al Viro
2019-11-09  3:13                 ` [PATCH][RFC] race in exportfs_decode_fh() Al Viro
2019-11-09 16:55                   ` Linus Torvalds
2019-11-09 18:26                     ` Al Viro
2019-11-11  9:16                   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191023110551.D04AE4C044@d06av22.portsmouth.uk.ibm.com \
    --to=riteshh@linux.ibm.com \
    --cc=hsiangkao@aol.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wugyuan@cn.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).