All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ritesh Harjani <riteshh@linux.ibm.com>
To: linux-fsdevel@vger.kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>
Cc: aneesh.kumar@linux.ibm.com, Jeff Layton <jlayton@kernel.org>,
	wugyuan@cn.ibm.com
Subject: [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink
Date: Tue, 3 Sep 2019 17:28:26 +0530	[thread overview]
Message-ID: <20190903115827.0A8A0A405B@b06wcsmtp001.portsmouth.uk.ibm.com> (raw)

Hi Viro/All,

Could you please review below issue and it's proposed solutions.
If you could let me know which of the two you think will be a better 
approach to solve this or in case if you have any other better approach, 
I can prepare and submit a official patch with that.



Issue signature:-
  [NIP  : trailing_symlink+80]
  [LR   : trailing_symlink+1092]
  #4 [c00000198069bb70] trailing_symlink at c0000000004bae60  (unreliable)
  #5 [c00000198069bc00] path_openat at c0000000004bdd14
  #6 [c00000198069bc90] do_filp_open at c0000000004c0274
  #7 [c00000198069bdb0] do_sys_open at c00000000049b248
  #8 [c00000198069be30] system_call at c00000000000b388



Test case:-
shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1 
/gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done



Problem description:-
In some filesystems like GPFS below described scenario may happen on 
some platforms (Reported-By:- wugyuan)

Here, two threads are being run in 2 different shells. Thread-1(cat) 
does cat of the symlink and Thread-2(ln) is creating the symlink.

Now on any platform with GPFS like filesystem, if CPU does out-of-order 
execution (or any kind of re-ordering due compiler optimization?) in 
function __d_set_and_inode_type(), then we see a NULL pointer 
dereference due to inode->i_uid.

This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. 
in else condition), we check d_is_negative() without any lock protection.
And since in __d_set_and_inode_type() re-ordering may happen in setting 
of dentry->type & dentry->inode => this means that there is this tiny 
window where things are going wrong.


(GPFS like):- Any FS with -inode_operations ->permission callback 
returning -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this 
problem to happen. (few e.g. found were - ocfs2, ceph, coda, afs)

int xxx_permission(struct inode *inode, int mask)
{
          if (mask & MAY_NOT_BLOCK)
                  return -ECHILD;
	<...>
}

Wugyuan(cc), could reproduce this problem with GPFS filesystem.
Since, I didn't have the GPFS setup, so I tried replicating on a native 
FS by forcing out-of-order execution in function 
__d_set_inode_and_type() and making sure we return -ECHILD in 
MAY_NOT_BLOCK case in ->permission operation for all inodes.

With above changes in kernel, I could as well hit this issue on a native 
FS too.

(basically what we observed is link_path_walk will do nonRCU(REF-walk) 
lookup due to may_lookup -> inode_permission return -ECHILD and then 
unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race 
is possible).



Sequence of events:-

Thread-2(Comm: ln)		Thread-1(Comm: cat)

				dentry = __d_lookup() //nonRCU

__d_set_and_inode_type() (Out-of-order execution)
	flags = READ_ONCE(dentry->d_flags);
	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
	flags |= type_flags;
	WRITE_ONCE(dentry->d_flags, flags);

					
				if (unlikely(d_is_negative()) // fails
   					{}
				// since type is already updated in
				// Thread-2 in parallel but inode
				// not yet set.
				// d_is_negative returns false

				*inode = d_backing_inode(path->dentry);
				// means inode is still NULL

	dentry->d_inode = inode;
	
				trailing_symlink()
					may_follow_link()
						inode = nd->link_inode;
						// nd->link_inode = NULL
						//Then it crashes while
						//doing inode->i_uid
					
	



Approach-1:- using wmb()

diff --git a/fs/dcache.c b/fs/dcache.c
index e88cf0554e65..966172df77ee 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -316,6 +316,7 @@ static inline void __d_set_inode_and_type(struct 
dentry *dentry,
         unsigned flags;

         dentry->d_inode = inode;
+       smp_wmb();
         flags = READ_ONCE(dentry->d_flags);
         flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
         flags |= type_flags;



Approach-2:- using spin_lock(&dentry->d_lock);

Do you think lock should be a better approach, given that we are already
in REF-walk mode. As per the Documentation, we should be able to take
spin_lock(&dentry->d_lock) in Ref-walk mode whenever required?


With smp_wmb(), if added, should add a small latency in both
RCU/REF-walk. But should be able to cover all the cases in general 
related to dentry->type & dentry->inode ordering. Though, we don't have 
any other race conditions reported or tested, other than the one, 
mentioned in this email.

Confused :(



diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..a3145594da1c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1557,6 +1557,7 @@ static int lookup_fast(struct nameidata *nd,
         struct dentry *dentry, *parent = nd->path.dentry;
         int status = 1;
         int err;
+       bool negative;

         /*
          * Rename seqlock is not required here because in the off chance
@@ -1565,7 +1566,6 @@ static int lookup_fast(struct nameidata *nd,
          */
         if (nd->flags & LOOKUP_RCU) {
                 unsigned seq;
-               bool negative;
                 dentry = __d_lookup_rcu(parent, &nd->last, &seq);
                 if (unlikely(!dentry)) {
                         if (unlazy_walk(nd))
@@ -1623,7 +1623,11 @@ static int lookup_fast(struct nameidata *nd,
                 dput(dentry);
                 return status;
         }
-       if (unlikely(d_is_negative(dentry))) {
+
+       spin_lock(&dentry->d_lock);
+       negative = d_is_negative(dentry);
+       spin_unlock(&dentry->d_lock);
+       if (unlikely(negative)) {
                 dput(dentry);
                 return -ENOENT;
         }


Regards
Ritesh


             reply	other threads:[~2019-09-03 11:58 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-03 11:58 Ritesh Harjani [this message]
2019-09-03 12:59 ` [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink Gao Xiang
2019-09-03 13:41   ` Ritesh Harjani
2019-09-03 13:58     ` Gao Xiang
2019-09-04 14:39     ` Jeff Layton
2019-09-06  5:17       ` Ritesh Harjani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190903115827.0A8A0A405B@b06wcsmtp001.portsmouth.uk.ibm.com \
    --to=riteshh@linux.ibm.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wugyuan@cn.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.