[RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink

From: Ritesh Harjani <riteshh@linux.ibm.com>
To: linux-fsdevel@vger.kernel.org, Alexander Viro <viro@zeniv.linux.org.uk>
Cc: aneesh.kumar@linux.ibm.com, Jeff Layton <jlayton@kernel.org>,
	wugyuan@cn.ibm.com
Subject: [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink
Date: Tue, 3 Sep 2019 17:28:26 +0530	[thread overview]
Message-ID: <20190903115827.0A8A0A405B@b06wcsmtp001.portsmouth.uk.ibm.com> (raw)

Hi Viro/All,

Could you please review below issue and it's proposed solutions.
If you could let me know which of the two you think will be a better 
approach to solve this or in case if you have any other better approach, 
I can prepare and submit a official patch with that.



Issue signature:-
  [NIP  : trailing_symlink+80]
  [LR   : trailing_symlink+1092]
  #4 [c00000198069bb70] trailing_symlink at c0000000004bae60  (unreliable)
  #5 [c00000198069bc00] path_openat at c0000000004bdd14
  #6 [c00000198069bc90] do_filp_open at c0000000004c0274
  #7 [c00000198069bdb0] do_sys_open at c00000000049b248
  #8 [c00000198069be30] system_call at c00000000000b388



Test case:-
shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1 
/gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done



Problem description:-
In some filesystems like GPFS below described scenario may happen on 
some platforms (Reported-By:- wugyuan)

Here, two threads are being run in 2 different shells. Thread-1(cat) 
does cat of the symlink and Thread-2(ln) is creating the symlink.

Now on any platform with GPFS like filesystem, if CPU does out-of-order 
execution (or any kind of re-ordering due compiler optimization?) in 
function __d_set_and_inode_type(), then we see a NULL pointer 
dereference due to inode->i_uid.

This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. 
in else condition), we check d_is_negative() without any lock protection.
And since in __d_set_and_inode_type() re-ordering may happen in setting 
of dentry->type & dentry->inode => this means that there is this tiny 
window where things are going wrong.


(GPFS like):- Any FS with -inode_operations ->permission callback 
returning -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this 
problem to happen. (few e.g. found were - ocfs2, ceph, coda, afs)

int xxx_permission(struct inode *inode, int mask)
{
          if (mask & MAY_NOT_BLOCK)
                  return -ECHILD;
	<...>
}

Wugyuan(cc), could reproduce this problem with GPFS filesystem.
Since, I didn't have the GPFS setup, so I tried replicating on a native 
FS by forcing out-of-order execution in function 
__d_set_inode_and_type() and making sure we return -ECHILD in 
MAY_NOT_BLOCK case in ->permission operation for all inodes.

With above changes in kernel, I could as well hit this issue on a native 
FS too.

(basically what we observed is link_path_walk will do nonRCU(REF-walk) 
lookup due to may_lookup -> inode_permission return -ECHILD and then 
unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race 
is possible).



Sequence of events:-

Thread-2(Comm: ln)		Thread-1(Comm: cat)

				dentry = __d_lookup() //nonRCU

__d_set_and_inode_type() (Out-of-order execution)
	flags = READ_ONCE(dentry->d_flags);
	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
	flags |= type_flags;
	WRITE_ONCE(dentry->d_flags, flags);

					
				if (unlikely(d_is_negative()) // fails
   					{}
				// since type is already updated in
				// Thread-2 in parallel but inode
				// not yet set.
				// d_is_negative returns false

				*inode = d_backing_inode(path->dentry);
				// means inode is still NULL

	dentry->d_inode = inode;
	
				trailing_symlink()
					may_follow_link()
						inode = nd->link_inode;
						// nd->link_inode = NULL
						//Then it crashes while
						//doing inode->i_uid
					
	



Approach-1:- using wmb()

diff --git a/fs/dcache.c b/fs/dcache.c
index e88cf0554e65..966172df77ee 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -316,6 +316,7 @@ static inline void __d_set_inode_and_type(struct 
dentry *dentry,
         unsigned flags;

         dentry->d_inode = inode;
+       smp_wmb();
         flags = READ_ONCE(dentry->d_flags);
         flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
         flags |= type_flags;



Approach-2:- using spin_lock(&dentry->d_lock);

Do you think lock should be a better approach, given that we are already
in REF-walk mode. As per the Documentation, we should be able to take
spin_lock(&dentry->d_lock) in Ref-walk mode whenever required?


With smp_wmb(), if added, should add a small latency in both
RCU/REF-walk. But should be able to cover all the cases in general 
related to dentry->type & dentry->inode ordering. Though, we don't have 
any other race conditions reported or tested, other than the one, 
mentioned in this email.

Confused :(



diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..a3145594da1c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1557,6 +1557,7 @@ static int lookup_fast(struct nameidata *nd,
         struct dentry *dentry, *parent = nd->path.dentry;
         int status = 1;
         int err;
+       bool negative;

         /*
          * Rename seqlock is not required here because in the off chance
@@ -1565,7 +1566,6 @@ static int lookup_fast(struct nameidata *nd,
          */
         if (nd->flags & LOOKUP_RCU) {
                 unsigned seq;
-               bool negative;
                 dentry = __d_lookup_rcu(parent, &nd->last, &seq);
                 if (unlikely(!dentry)) {
                         if (unlazy_walk(nd))
@@ -1623,7 +1623,11 @@ static int lookup_fast(struct nameidata *nd,
                 dput(dentry);
                 return status;
         }
-       if (unlikely(d_is_negative(dentry))) {
+
+       spin_lock(&dentry->d_lock);
+       negative = d_is_negative(dentry);
+       spin_unlock(&dentry->d_lock);
+       if (unlikely(negative)) {
                 dput(dentry);
                 return -ENOENT;
         }


Regards
Ritesh