All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/1] dcache: Translating dentry into pathname without taking rename_lock
@ 2013-09-05 18:55 Waiman Long
  2013-09-05 18:55 ` [PATCH v2 1/1] " Waiman Long
  0 siblings, 1 reply; 11+ messages in thread
From: Waiman Long @ 2013-09-05 18:55 UTC (permalink / raw)
  To: Alexander Viro, Linus Torvalds
  Cc: Waiman Long, linux-fsdevel, linux-kernel, Chandramouleeswaran,
	Aswin, Norton, Scott J, George Spelvin, John Stoffel

Change History

v1->v2:
 - Check for internal vs external dname, taking d_lock only for
   external dname for safety.
 - Replace memchr() by a byte-by-byte checking for loop.
 - Try lockless dentry to pathname conversion 3 times before falling
   back to taking the rename_lock to prevent live-lock.
 - Make code re-factoring suggested by George Spelvin.

Waiman Long (1):
  dcache: Translating dentry into pathname without taking rename_lock

 fs/dcache.c |  213 ++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 151 insertions(+), 62 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/1] dcache: Translating dentry into pathname without taking rename_lock
  2013-09-05 18:55 [PATCH v2 0/1] dcache: Translating dentry into pathname without taking rename_lock Waiman Long
@ 2013-09-05 18:55 ` Waiman Long
  2013-09-05 19:35   ` Linus Torvalds
  2013-09-05 20:04   ` Al Viro
  0 siblings, 2 replies; 11+ messages in thread
From: Waiman Long @ 2013-09-05 18:55 UTC (permalink / raw)
  To: Alexander Viro, Linus Torvalds
  Cc: Waiman Long, linux-fsdevel, linux-kernel, Chandramouleeswaran,
	Aswin, Norton, Scott J, George Spelvin, John Stoffel

When running the AIM7's short workload, Linus' lockref patch eliminated
most of the spinlock contention. However, there were still some left:

     8.46%     reaim  [kernel.kallsyms]     [k] _raw_spin_lock
                 |--42.21%-- d_path
                 |          proc_pid_readlink
                 |          SyS_readlinkat
                 |          SyS_readlink
                 |          system_call
                 |          __GI___readlink
                 |
                 |--40.97%-- sys_getcwd
                 |          system_call
                 |          __getcwd

The big one here is the rename_lock (seqlock) contention in d_path()
and the getcwd system call. This patch will eliminate the need to take
the rename_lock while translating dentries into the full pathnames.

The need to take the rename_lock is to make sure that no rename
operation can be ongoing while the translation is in progress. However,
only one thread can take the rename_lock thus blocking all the other
threads that need it even though the translation process won't make
any change to the dentries.

This patch will replace the writer's write_seqlock/write_sequnlock
sequence of the rename_lock of the callers of the prepend_path() and
__dentry_path() functions with the reader's read_seqbegin/read_seqretry
sequence within these 2 functions. As a result, the code will have to
retry if one or more rename operations had been performed. In addition,
RCU read lock will be taken during the translation process to make sure
that no dentries will go away. To prevent live-lock from happening,
the code will switch back to take the rename_lock if read_seqretry()
fails for three times.

To further reduce spinlock contention, this patch also tries not
to take the dentry's d_lock if the dentry name is internal. For
external dentry name, it is safer to take the d_lock as racing with
d_move() may cause it to access invalid memory location leading to
segmentation fault.  For safety, there is also additional check to
see if the name string is valid (no embedded null).

The following code re-factoring are also made:
1. Move prepend('/') into prepend_name().
2. Move the global root check in prepend_path() back to the top of
   the while loop.

With this patch, the _raw_spin_lock will now account for only 1.2%
of the total CPU cycles for the short workload. This patch also has
the effect of reducing the effect of running perf on its profile
since the perf command itself can be a heavy user of the d_path()
function depending on the complexity of the workload.

When taking the perf profile of the high-systime workload, the amount
of spinlock contention contributed by running perf without this patch
was about 16%. With this patch, the spinlock contention caused by
the running of perf will go away and we will have a more accurate
perf profile.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
 fs/dcache.c |  213 ++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 151 insertions(+), 62 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 96655f4..9703aa6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2510,9 +2510,57 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen)
 	return 0;
 }
 
-static int prepend_name(char **buffer, int *buflen, struct qstr *name)
+static int prepend_name(char **buffer, int *buflen, struct dentry *dentry)
 {
-	return prepend(buffer, buflen, name->name, name->len);
+	/*
+	 * With RCU path tracing, it may race with rename. Use
+	 * ACCESS_ONCE() to make sure that it is either the old or
+	 * the new name pointer. The length does not really matter as
+	 * the sequence number check will eventually catch any ongoing
+	 * rename operation.
+	 */
+	const char *dname = ACCESS_ONCE(dentry->d_name.name);
+	u32 dlen = dentry->d_name.len;
+	int error;
+
+	if (likely(dname == (const char *)dentry->d_iname)) {
+		/*
+		 * Internal dname, the string memory is valid as long
+		 * as its length is not over the limit.
+		 */
+		if (unlikely(dlen > sizeof(dentry->d_iname)))
+			return -EINVAL;
+	} else if (!dname)
+		return -EINVAL;
+	else {
+		const char *ptr;
+		u32 len;
+
+		/*
+		 * External dname, need to fetch name pointer and length
+		 * again under d_lock to get a consistent set and avoid
+		 * racing with d_move() which will take d_lock before
+		 * acting on the dentries.
+		 */
+		spin_lock(&dentry->d_lock);
+		dname = dentry->d_name.name;
+		dlen  = dentry->d_name.len;
+		spin_unlock(&dentry->d_lock);
+
+		if (unlikely(!dname || !dlen))
+			return -EINVAL;
+		/*
+		 * As the length and the content of the string may not be
+		 * valid, need to scan the string and return EINVAL if there
+		 * is embedded null byte within the length of the string.
+		 */
+		for (ptr = dname, len = dlen; len; len--, ptr++) {
+			if (*ptr == '\0')
+				return -EINVAL;
+		}
+	}
+	error = prepend(buffer, buflen, dname, dlen);
+	return error ? error : prepend(buffer, buflen, "/", 1);
 }
 
 /**
@@ -2522,7 +2570,14 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
  * @buffer: pointer to the end of the buffer
  * @buflen: pointer to buffer length
  *
- * Caller holds the rename_lock.
+ * The function tries to write out the pathname without taking any lock other
+ * than the RCU read lock to make sure that dentries won't go away. It only
+ * checks the sequence number of the global rename_lock as any change in the
+ * dentry's d_seq will be preceded by changes in the rename_lock sequence
+ * number. If the sequence number had been change, it will restart the whole
+ * pathname back-tracing sequence again. It performs a total of 3 trials of
+ * lockless back-tracing sequences before falling back to take the
+ * rename_lock.
  */
 static int prepend_path(const struct path *path,
 			const struct path *root,
@@ -2531,54 +2586,80 @@ static int prepend_path(const struct path *path,
 	struct dentry *dentry = path->dentry;
 	struct vfsmount *vfsmnt = path->mnt;
 	struct mount *mnt = real_mount(vfsmnt);
-	bool slash = false;
 	int error = 0;
+	unsigned seq = 0;
+	char *bptr;
+	int blen;
+	int retry_cnt = 3;	/* Try lockless path conversion 3 times */
 
+restart:
+	bptr = *buffer;
+	blen = *buflen;
+	if (retry_cnt) {
+		seq = read_seqbegin(&rename_lock);
+		rcu_read_lock();
+	} else
+		write_seqlock(&rename_lock);
 	while (dentry != root->dentry || vfsmnt != root->mnt) {
 		struct dentry * parent;
 
 		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			/* Global root? */
-			if (!mnt_has_parent(mnt))
-				goto global_root;
-			dentry = mnt->mnt_mountpoint;
-			mnt = mnt->mnt_parent;
-			vfsmnt = &mnt->mnt;
-			continue;
+			if (mnt_has_parent(mnt)) {
+				dentry = mnt->mnt_mountpoint;
+				mnt = mnt->mnt_parent;
+				vfsmnt = &mnt->mnt;
+				continue;
+			}
+			/*
+			 * Filesystems needing to implement special "root names"
+			 * should do so with ->d_dname()
+			 */
+			if (IS_ROOT(dentry) &&
+			   (dentry->d_name.len != 1 ||
+			    dentry->d_name.name[0] != '/')) {
+				WARN(1, "Root dentry has weird name <%.*s>\n",
+				     (int) dentry->d_name.len,
+				     dentry->d_name.name);
+			}
+			if (!error)
+				error = is_mounted(vfsmnt) ? 1 : 2;
+			break;
 		}
 		parent = dentry->d_parent;
+		if (unlikely(parent == NULL)) {
+			error = -EINVAL;
+			break;	/* Check sequence number */
+		}
 		prefetch(parent);
-		spin_lock(&dentry->d_lock);
-		error = prepend_name(buffer, buflen, &dentry->d_name);
-		spin_unlock(&dentry->d_lock);
-		if (!error)
-			error = prepend(buffer, buflen, "/", 1);
+		error = prepend_name(&bptr, &blen, dentry);
 		if (error)
 			break;
 
-		slash = true;
 		dentry = parent;
 	}
+	if (retry_cnt) {
+		retry_cnt--;
+		rcu_read_unlock();
+		if (read_seqretry(&rename_lock, seq))
+			goto restart;
+	} else
+		write_sequnlock(&rename_lock);
 
-	if (!error && !slash)
-		error = prepend(buffer, buflen, "/", 1);
-
+	if (error >= 0 && bptr == *buffer) {
+		if (--blen < 0)
+			error = -ENAMETOOLONG;
+		else
+			*--bptr = '/';
+	}
+	if (error == -EINVAL)
+		goto garbage;
+	*buffer = bptr;
+	*buflen = blen;
 	return error;
 
-global_root:
-	/*
-	 * Filesystems needing to implement special "root names"
-	 * should do so with ->d_dname()
-	 */
-	if (IS_ROOT(dentry) &&
-	    (dentry->d_name.len != 1 || dentry->d_name.name[0] != '/')) {
-		WARN(1, "Root dentry has weird name <%.*s>\n",
-		     (int) dentry->d_name.len, dentry->d_name.name);
-	}
-	if (!slash)
-		error = prepend(buffer, buflen, "/", 1);
-	if (!error)
-		error = is_mounted(vfsmnt) ? 1 : 2;
+garbage:
+	error = prepend(buffer, buflen, "(garbage)", 10);
 	return error;
 }
 
@@ -2607,9 +2688,7 @@ char *__d_path(const struct path *path,
 
 	prepend(&res, &buflen, "\0", 1);
 	br_read_lock(&vfsmount_lock);
-	write_seqlock(&rename_lock);
 	error = prepend_path(path, root, &res, &buflen);
-	write_sequnlock(&rename_lock);
 	br_read_unlock(&vfsmount_lock);
 
 	if (error < 0)
@@ -2628,9 +2707,7 @@ char *d_absolute_path(const struct path *path,
 
 	prepend(&res, &buflen, "\0", 1);
 	br_read_lock(&vfsmount_lock);
-	write_seqlock(&rename_lock);
 	error = prepend_path(path, &root, &res, &buflen);
-	write_sequnlock(&rename_lock);
 	br_read_unlock(&vfsmount_lock);
 
 	if (error > 1)
@@ -2696,9 +2773,7 @@ char *d_path(const struct path *path, char *buf, int buflen)
 
 	get_fs_root(current->fs, &root);
 	br_read_lock(&vfsmount_lock);
-	write_seqlock(&rename_lock);
 	error = path_with_deleted(path, &root, &res, &buflen);
-	write_sequnlock(&rename_lock);
 	br_read_unlock(&vfsmount_lock);
 	if (error < 0)
 		res = ERR_PTR(error);
@@ -2733,10 +2808,10 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
 	char *end = buffer + buflen;
 	/* these dentries are never renamed, so d_lock is not needed */
 	if (prepend(&end, &buflen, " (deleted)", 11) ||
-	    prepend_name(&end, &buflen, &dentry->d_name) ||
+	    prepend(&end, &buflen, dentry->d_name.name, dentry->d_name.len) ||
 	    prepend(&end, &buflen, "/", 1))  
 		end = ERR_PTR(-ENAMETOOLONG);
-	return end;  
+	return end;
 }
 
 /*
@@ -2744,30 +2819,55 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen)
  */
 static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
 {
-	char *end = buf + buflen;
-	char *retval;
+	char *end, *retval;
+	int len, seq = 0;
+	int error = 0;
+	int retry_cnt = 3;	/* Try lockless 3 times before taking lock */
 
-	prepend(&end, &buflen, "\0", 1);
+restart:
+	end = buf + buflen;
+	len = buflen;
+	prepend(&end, &len, "\0", 1);
 	if (buflen < 1)
 		goto Elong;
 	/* Get '/' right */
 	retval = end-1;
 	*retval = '/';
-
+	if (retry_cnt) {
+		seq = read_seqbegin(&rename_lock);
+		rcu_read_lock();
+	} else
+		write_seqlock(&rename_lock);
 	while (!IS_ROOT(dentry)) {
 		struct dentry *parent = dentry->d_parent;
 		int error;
 
+		if (unlikely(parent == NULL)) {
+			error = -EINVAL;
+			break;
+		}
 		prefetch(parent);
-		spin_lock(&dentry->d_lock);
-		error = prepend_name(&end, &buflen, &dentry->d_name);
-		spin_unlock(&dentry->d_lock);
-		if (error != 0 || prepend(&end, &buflen, "/", 1) != 0)
-			goto Elong;
+		error = prepend_name(&end, &len, dentry);
+		if (error)
+			break;
 
 		retval = end;
 		dentry = parent;
 	}
+	if (retry_cnt) {
+		retry_cnt--;
+		rcu_read_unlock();
+		if (read_seqretry(&rename_lock, seq))
+			goto restart;
+	} else
+		write_sequnlock(&rename_lock);
+	if (error == -EINVAL) {
+		error = prepend(&end, &len, "//garbage", 10);
+		if (!error)
+			return end;
+	}
+	if (error)
+		goto Elong;
 	return retval;
 Elong:
 	return ERR_PTR(-ENAMETOOLONG);
@@ -2775,13 +2875,7 @@ Elong:
 
 char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
 {
-	char *retval;
-
-	write_seqlock(&rename_lock);
-	retval = __dentry_path(dentry, buf, buflen);
-	write_sequnlock(&rename_lock);
-
-	return retval;
+	return __dentry_path(dentry, buf, buflen);
 }
 EXPORT_SYMBOL(dentry_path_raw);
 
@@ -2790,7 +2884,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 	char *p = NULL;
 	char *retval;
 
-	write_seqlock(&rename_lock);
 	if (d_unlinked(dentry)) {
 		p = buf + buflen;
 		if (prepend(&p, &buflen, "//deleted", 10) != 0)
@@ -2798,7 +2891,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen)
 		buflen++;
 	}
 	retval = __dentry_path(dentry, buf, buflen);
-	write_sequnlock(&rename_lock);
 	if (!IS_ERR(retval) && p)
 		*p = '/';	/* restore '/' overriden with '\0' */
 	return retval;
@@ -2837,7 +2929,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 
 	error = -ENOENT;
 	br_read_lock(&vfsmount_lock);
-	write_seqlock(&rename_lock);
 	if (!d_unlinked(pwd.dentry)) {
 		unsigned long len;
 		char *cwd = page + PAGE_SIZE;
@@ -2845,7 +2936,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 
 		prepend(&cwd, &buflen, "\0", 1);
 		error = prepend_path(&pwd, &root, &cwd, &buflen);
-		write_sequnlock(&rename_lock);
 		br_read_unlock(&vfsmount_lock);
 
 		if (error < 0)
@@ -2866,7 +2956,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size)
 				error = -EFAULT;
 		}
 	} else {
-		write_sequnlock(&rename_lock);
 		br_read_unlock(&vfsmount_lock);
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] dcache: Translating dentry into pathname without taking rename_lock
  2013-09-05 18:55 ` [PATCH v2 1/1] " Waiman Long
@ 2013-09-05 19:35   ` Linus Torvalds
  2013-09-05 20:29     ` Waiman Long
  2013-09-05 20:04   ` Al Viro
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2013-09-05 19:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	John Stoffel

On Thu, Sep 5, 2013 at 11:55 AM, Waiman Long <Waiman.Long@hp.com> wrote:
> +static int prepend_name(char **buffer, int *buflen, struct dentry *dentry)
>  {
> -       return prepend(buffer, buflen, name->name, name->len);
> +       /*
> +        * With RCU path tracing, it may race with rename. Use
> +        * ACCESS_ONCE() to make sure that it is either the old or
> +        * the new name pointer. The length does not really matter as
> +        * the sequence number check will eventually catch any ongoing
> +        * rename operation.
> +        */
> +       const char *dname = ACCESS_ONCE(dentry->d_name.name);
> +       u32 dlen = dentry->d_name.len;
> +       int error;
> +
> +       if (likely(dname == (const char *)dentry->d_iname)) {
> +               /*
> +                * Internal dname, the string memory is valid as long
> +                * as its length is not over the limit.
> +                */
> +               if (unlikely(dlen > sizeof(dentry->d_iname)))
> +                       return -EINVAL;
> +       } else if (!dname)
> +               return -EINVAL;
> +       else {
> +               const char *ptr;
> +               u32 len;
> +
> +               /*
> +                * External dname, need to fetch name pointer and length
> +                * again under d_lock to get a consistent set and avoid
> +                * racing with d_move() which will take d_lock before
> +                * acting on the dentries.
> +                */
> +               spin_lock(&dentry->d_lock);
> +               dname = dentry->d_name.name;
> +               dlen  = dentry->d_name.len;
> +               spin_unlock(&dentry->d_lock);
> +
> +               if (unlikely(!dname || !dlen))
> +                       return -EINVAL;
> +               /*
> +                * As the length and the content of the string may not be
> +                * valid, need to scan the string and return EINVAL if there
> +                * is embedded null byte within the length of the string.
> +                */
> +               for (ptr = dname, len = dlen; len; len--, ptr++) {
> +                       if (*ptr == '\0')
> +                               return -EINVAL;
> +               }
> +       }
> +       error = prepend(buffer, buflen, dname, dlen);

No. Stop all these stupid games.

No d_lock, no trying to make d_name/d_len consistent.

No "compare d_name against d_iname".

No EINVAL.

No idiotic racy "let's fetch each byte one-by one and test them
against NUL", which is just racy and stupid.

Just do what I told you to do: *copy* the name one byte at a time, and
stop copying if you hit NUL. IOW, make prepend() just use "strncpy()"
instead of "memcpy()". You don't even need to check the end result -
if it's bogus, the sequence count will fix it.

No special cases. No games. No crap. Just make "prepend()" able to
handle the special case of "oops, the name length didn't match, but we
must not ever go past the end of the buffer".

We can optimize strncpy() to use word accesses if it ends up ever
being a performance issue. I suspect it never even shows up, but it's
not hard to do if if does.

            Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] dcache: Translating dentry into pathname without taking rename_lock
  2013-09-05 18:55 ` [PATCH v2 1/1] " Waiman Long
  2013-09-05 19:35   ` Linus Torvalds
@ 2013-09-05 20:04   ` Al Viro
  2013-09-05 20:43     ` Waiman Long
  1 sibling, 1 reply; 11+ messages in thread
From: Al Viro @ 2013-09-05 20:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, linux-fsdevel, linux-kernel, Chandramouleeswaran,
	Aswin, Norton, Scott J, George Spelvin, John Stoffel

On Thu, Sep 05, 2013 at 02:55:16PM -0400, Waiman Long wrote:
> +	const char *dname = ACCESS_ONCE(dentry->d_name.name);
> +	u32 dlen = dentry->d_name.len;
> +	int error;
> +
> +	if (likely(dname == (const char *)dentry->d_iname)) {
> +		/*
> +		 * Internal dname, the string memory is valid as long
> +		 * as its length is not over the limit.
> +		 */
> +		if (unlikely(dlen > sizeof(dentry->d_iname)))
> +			return -EINVAL;
> +	} else if (!dname)
> +		return -EINVAL;
Can't happen.
> +	else {
> +		const char *ptr;
> +		u32 len;
> +
> +		/*
> +		 * External dname, need to fetch name pointer and length
> +		 * again under d_lock to get a consistent set and avoid
> +		 * racing with d_move() which will take d_lock before
> +		 * acting on the dentries.
> +		 */
> +		spin_lock(&dentry->d_lock);
> +		dname = dentry->d_name.name;
> +		dlen  = dentry->d_name.len;
> +		spin_unlock(&dentry->d_lock);
> +
> +		if (unlikely(!dname || !dlen))
> +			return -EINVAL;
Can't happen.

> +		/*
> +		 * As the length and the content of the string may not be
> +		 * valid, need to scan the string and return EINVAL if there
> +		 * is embedded null byte within the length of the string.
> +		 */
> +		for (ptr = dname, len = dlen; len; len--, ptr++) {
> +			if (*ptr == '\0')
> +				return -EINVAL;

Egads...  First of all, this is completely pointless - if you've grabbed
->d_name.name and ->d_name.len under ->d_lock, you don't *need* that crap.
At all.  The whole point of that exercise is to avoid taking ->d_lock;
_that_ is where the "read byte by byte until you hit NUL" comes from.
And if you do that, you can bloody well just go ahead and store them in
the target array *as* *you* *go*.  No reason to bother with memcpy()
afterwards.

Damnit, just grab len and name (no ->d_lock, etc.).  Check if you've got
enough space in the buffer, treat "not enough" as an overflow.  Then
proceed to copy the damn thing over there (starting at *buffer -= len)
byte by byte, stopping when you've copied len bytes *or* when the byte you've
got happens to be NUL.  Don't bother with EINVAL, etc. - just return to
caller and let rename_lock logics take care of the races.  That's it - nothing
more is needed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] dcache: Translating dentry into pathname without taking rename_lock
  2013-09-05 19:35   ` Linus Torvalds
@ 2013-09-05 20:29     ` Waiman Long
  2013-09-05 20:42       ` Linus Torvalds
  2013-09-05 20:46       ` Al Viro
  0 siblings, 2 replies; 11+ messages in thread
From: Waiman Long @ 2013-09-05 20:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	John Stoffel

On 09/05/2013 03:35 PM, Linus Torvalds wrote:
> No. Stop all these stupid games.
>
> No d_lock, no trying to make d_name/d_len consistent.
>
> No "compare d_name against d_iname".
>
> No EINVAL.
>
> No idiotic racy "let's fetch each byte one-by one and test them
> against NUL", which is just racy and stupid.
>
> Just do what I told you to do: *copy* the name one byte at a time, and
> stop copying if you hit NUL. IOW, make prepend() just use "strncpy()"
> instead of "memcpy()". You don't even need to check the end result -
> if it's bogus, the sequence count will fix it.
>
> No special cases. No games. No crap. Just make "prepend()" able to
> handle the special case of "oops, the name length didn't match, but we
> must not ever go past the end of the buffer".
>
> We can optimize strncpy() to use word accesses if it ends up ever
> being a performance issue. I suspect it never even shows up, but it's
> not hard to do if if does.
>
>              Linus

It is not as simple as doing a strncpy(). The pathname was built from 
the leaf up to the root, and from the end of buffer toward the 
beginning. As it goes through the while loop, the buffer will look like:

"    /c"
"  /b/c"
"/a/b/c"

If the content of the string is unreliable, I have to do at least 2 passes:
1) Locate the end of the string and determine the actual length
2) Copy the whole string or byte-by-byte backward

That is why I am looking for the null byte. Using strncpy() alone won't 
work. However, if the actual length doesn't match that of the dlen, the 
name string will be invalid and there is no point in proceeding any further.

I also consider the possible, but unlikely, scenario that during a 
rename operation, a short old name could be freed and replaced by a long 
new name. The old name could then be allocated to another user filling 
it up with non-NULL byte. If the buffer happen to be the end of 
valid-to-invalid memory page boundary, the code may step over to an 
invalid address by looking for the null byte. The current code probably 
won't free the buffer while the RCU lock is being hold, but future code 
change may make this assumption not valid. Blindly taking the d_lock may 
be too expensive as the original code does. That is why I come up with 
the internal vs. external dname check and take the lock only for 
external dname for safety.

I can change the code to do just locating the end of string and copying 
it backward, but not using strncpy().

-Longman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] dcache: Translating dentry into pathname without taking rename_lock
  2013-09-05 20:29     ` Waiman Long
@ 2013-09-05 20:42       ` Linus Torvalds
  2013-09-06  2:01         ` Waiman Long
  2013-09-05 20:46       ` Al Viro
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2013-09-05 20:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	John Stoffel

On Thu, Sep 5, 2013 at 1:29 PM, Waiman Long <waiman.long@hp.com> wrote:
>
> It is not as simple as doing a strncpy().

Yes it damn well is.

Stop the f*cking stupid arguments, and instead listen to what I say.

Here. Let me bold-face the most important part for you, so that you
don't miss it in all the other crap:

   MAKE prepend() JUST USE "strncpy()" INSTEAD OF "memcpy()".

Nothing else. Seriously. Your "you can't do it because we copy
backwards" arguments are pure and utter garbage, exactly BECAUSE YOU
DON'T CHANGE ANY OF THAT. You can actually use the unreliable length
variable BUT YOU MUST STILL STOP AT A ZERO.

Get it?

You're complicating the whole thing for no good reason. I'm telling
you (and HAVE BEEN telling you multiple times) that you cannot use
"memcpy()" because the length may not be reliable, so you need to
check for zero in the middle and stop early. All your arguments have
been totally pointless, because you don't seem to see that simple and
fundamental issue. You don't change ANYTHING else. But you damn well
not do a "memcpy", you do something that stops when it hits a NUL
character.

We call that function "strncpy()". I'd actually prefer to write it out
by hand (because somebody could implement "strncpy()" as a
questionable function that accesses past the NUL as long as it's
within the 'n'), and because I think we might want to do that
word-at-a-time version of it, but for a first approximation, just do
that one-liner version.

Don't do anything else. Don't do locking. Don't do memchr. Just make
sure that you stop at a NUL character, and don't trust the length,
because the length may not match the pointer. That's was always ALL
you needed to do.

              Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] dcache: Translating dentry into pathname without taking rename_lock
  2013-09-05 20:04   ` Al Viro
@ 2013-09-05 20:43     ` Waiman Long
  0 siblings, 0 replies; 11+ messages in thread
From: Waiman Long @ 2013-09-05 20:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, linux-kernel, Chandramouleeswaran,
	Aswin, Norton, Scott J, George Spelvin, John Stoffel

On 09/05/2013 04:04 PM, Al Viro wrote:
> On Thu, Sep 05, 2013 at 02:55:16PM -0400, Waiman Long wrote:
>> +	const char *dname = ACCESS_ONCE(dentry->d_name.name);
>> +	u32 dlen = dentry->d_name.len;
>> +	int error;
>> +
>> +	if (likely(dname == (const char *)dentry->d_iname)) {
>> +		/*
>> +		 * Internal dname, the string memory is valid as long
>> +		 * as its length is not over the limit.
>> +		 */
>> +		if (unlikely(dlen>  sizeof(dentry->d_iname)))
>> +			return -EINVAL;
>> +	} else if (!dname)
>> +		return -EINVAL;
> Can't happen.
>> +	else {
>> +		const char *ptr;
>> +		u32 len;
>> +
>> +		/*
>> +		 * External dname, need to fetch name pointer and length
>> +		 * again under d_lock to get a consistent set and avoid
>> +		 * racing with d_move() which will take d_lock before
>> +		 * acting on the dentries.
>> +		 */
>> +		spin_lock(&dentry->d_lock);
>> +		dname = dentry->d_name.name;
>> +		dlen  = dentry->d_name.len;
>> +		spin_unlock(&dentry->d_lock);
>> +
>> +		if (unlikely(!dname || !dlen))
>> +			return -EINVAL;
> Can't happen.
>
>> +		/*
>> +		 * As the length and the content of the string may not be
>> +		 * valid, need to scan the string and return EINVAL if there
>> +		 * is embedded null byte within the length of the string.
>> +		 */
>> +		for (ptr = dname, len = dlen; len; len--, ptr++) {
>> +			if (*ptr == '\0')
>> +				return -EINVAL;
> Egads...  First of all, this is completely pointless - if you've grabbed
> ->d_name.name and ->d_name.len under ->d_lock, you don't *need* that crap.
> At all.  The whole point of that exercise is to avoid taking ->d_lock;
> _that_ is where the "read byte by byte until you hit NUL" comes from.
> And if you do that, you can bloody well just go ahead and store them in
> the target array *as* *you* *go*.  No reason to bother with memcpy()
> afterwards.

That is what I thought too. I am just not totally sure about it. So yes, 
I can scrap all these additional check.

As the internal dname buffer is at least 32 bytes, most dentries will 
use the internal buffer instead of allocating from kmem. IOW, the d_lock 
taking code path is unlikely to be used.

> Damnit, just grab len and name (no ->d_lock, etc.).  Check if you've got
> enough space in the buffer, treat "not enough" as an overflow.  Then
> proceed to copy the damn thing over there (starting at *buffer -= len)
> byte by byte, stopping when you've copied len bytes *or* when the byte you've
> got happens to be NUL.  Don't bother with EINVAL, etc. - just return to
> caller and let rename_lock logics take care of the races.  That's it - nothing
> more is needed.

OK, I will do that.

-Longman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] dcache: Translating dentry into pathname without taking rename_lock
  2013-09-05 20:29     ` Waiman Long
  2013-09-05 20:42       ` Linus Torvalds
@ 2013-09-05 20:46       ` Al Viro
  2013-09-05 21:27         ` Linus Torvalds
  1 sibling, 1 reply; 11+ messages in thread
From: Al Viro @ 2013-09-05 20:46 UTC (permalink / raw)
  To: Waiman Long
  Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	John Stoffel

On Thu, Sep 05, 2013 at 04:29:06PM -0400, Waiman Long wrote:

> It is not as simple as doing a strncpy(). The pathname was built
> from the leaf up to the root, and from the end of buffer toward the
> beginning. As it goes through the while loop, the buffer will look
> like:
> 
> "    /c"
> "  /b/c"
> "/a/b/c"
> 
> If the content of the string is unreliable, I have to do at least 2 passes:
> 1) Locate the end of the string and determine the actual length
> 2) Copy the whole string or byte-by-byte backward

No, you do not need anything of that kind.  All you need is
	a) don't step out of the array (which will contain NUL at the end
at all times, no matter what) and
	b) generate correct output *IF* no d_move() happens while you
do that.

Nothing else matters at all.  You trust the length to be correct in absense
of d_move().  You can not trust it to match the size of ->d_name.name when
d_move() is happening, but you can trust everything up to ->d_name.len *or*
the first NUL, whichever happens first, to be safe to access.

Again, the contents copied into the buffer needs to be valid only if d_move()
hasn't happened; if it has, we don't give a fuck - read_seqretry() will take
care of that.  All you need to care about in that case is not oopsing the
damn thing.

static int prepend_name(char **buffer, int *buflen, struct qstr *name)
{
	const char *s = ACCESS_ONCE(name->name);
	unsigned len = ACCESS_ONCE(name->len);
	char *p;

	*buflen -= len;
	if (*buflen < 0)
		return -ENAMETOOLONG;
	p = *buffer -= len;
	while (len--) {
		c = *s++;
		if (!c)
			break;
		*p++ = c;
	}
        return 0;
}

And that's *all* - just call that under rcu_read_lock() and within
seq = read_seqbegin(&rename_lock)/read_seqretry(&rename_lock, seq)
loop over the whole prepend_path/path_with_deleted/__dentry_path
thing.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] dcache: Translating dentry into pathname without taking rename_lock
  2013-09-05 20:46       ` Al Viro
@ 2013-09-05 21:27         ` Linus Torvalds
  0 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2013-09-05 21:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	John Stoffel

On Thu, Sep 5, 2013 at 1:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> static int prepend_name(char **buffer, int *buflen, struct qstr *name)
> {
>         const char *s = ACCESS_ONCE(name->name);
>         unsigned len = ACCESS_ONCE(name->len);
>         char *p;
>
>         *buflen -= len;
>         if (*buflen < 0)
>                 return -ENAMETOOLONG;
>         p = *buffer -= len;
>         while (len--) {
>                 c = *s++;
>                 if (!c)
>                         break;
>                 *p++ = c;
>         }
>         return 0;
> }

.. and appended is a *totally* untested "dentry_string_copy()" that
does things a word at a time (it doesn't have your "buffer/buflen -=
len" parts, this is purely the copying code).

Note that the return value is only optimistic - it may be copying NUL
bytes *without* returning an error. We don't care. We really only care
that it not access past the end of the string using the same rules as
the RCU lookup does: it *can* access aligned words. It's just meant as
a "hey, if we had to go to the bother of checking for a NUL byte and
we found it, we might as well let the user know so that he could
choose to stop the optimistic RCU thing early".

Also note that this thing depends on CONFIG_DCACHE_WORD_ACCESS (which
in turn depends on little-endian), _and_ only works if the
architecture supports fast unaligned stores (which is true on x86, but
may not be true on ARM). And it does require that it can overshoot the
destination string - it won't *change* the destination string past the
end, but it will do an unaligned store there needs padding. And that
might be the biggest problem with this routine and the thing that
scuttles it. We could do that final store as a series of byte writes
instead, I guess. So maybe it should do

   do {
      *(char *)ct++ = a;
      a >>= 8;
   } while (--tcount);

at the end instead of doing the final word store.

I haven't actually tested it, but it looks right (within the above
constraints), and it generates asm that looks good enough. It is not
entirely obvious that it's really worth it over the "stupid"
character-at-a-time model, though: the loop counts are likely not
really all that high, and the unaligned store together with loading
the big constants to do the "test word for zero" might just add enough
overhead that it's not a big win. But the word-at-a-time approach was
a noticeable win for the rcu dentry lookup, so it's possible that it's
noticeable here too.

                   Linus

----
/*
 * The source comes from a dentry, which means that it is guaranteed to
 * be aligned. However, the destination is not likely to be aligned.
 *
 * The only rule is: we must *not* overstep the source by more than that
 * aligned word access if it has a NUL character in it.
 *
 * NOTE NOTE NOTE! This also requires that we can overshoot the destination
 * string by up to one unaligned word access!
 */
int dentry_string_copy(const unsigned char *_cs, unsigned char *_ct,
unsigned tcount)
{
        const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
        const unsigned long *cs = (const void *)_cs;
        unsigned long *ct = (void *)_ct;
        unsigned long a,mask;

        for (;;) {
                unsigned long adata;
                a = *cs;
                if (tcount < sizeof(unsigned long))
                        break;
                *ct = a;
                if (unlikely(has_zero(a, &adata, &constants)))
                        return -EINVAL;
                cs++; ct++;
                tcount -= sizeof(unsigned long);
                if (!tcount)
                        return 0;
        }
        /* We don't care if there's a NUL in the last word */
        mask = ~(~0ul << tcount*8);
        *ct = (a & mask) | (~mask & *ct);
        return 0;
}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] dcache: Translating dentry into pathname without taking rename_lock
  2013-09-05 20:42       ` Linus Torvalds
@ 2013-09-06  2:01         ` Waiman Long
  2013-09-06  4:54           ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Waiman Long @ 2013-09-06  2:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	John Stoffel

On 09/05/2013 04:42 PM, Linus Torvalds wrote:
> On Thu, Sep 5, 2013 at 1:29 PM, Waiman Long<waiman.long@hp.com>  wrote:
>> It is not as simple as doing a strncpy().
> Yes it damn well is.
>
> Stop the f*cking stupid arguments, and instead listen to what I say.
>
> Here. Let me bold-face the most important part for you, so that you
> don't miss it in all the other crap:
>
>     MAKE prepend() JUST USE "strncpy()" INSTEAD OF "memcpy()".
>
> Nothing else. Seriously. Your "you can't do it because we copy
> backwards" arguments are pure and utter garbage, exactly BECAUSE YOU
> DON'T CHANGE ANY OF THAT. You can actually use the unreliable length
> variable BUT YOU MUST STILL STOP AT A ZERO.
>
> Get it?
>
> You're complicating the whole thing for no good reason. I'm telling
> you (and HAVE BEEN telling you multiple times) that you cannot use
> "memcpy()" because the length may not be reliable, so you need to
> check for zero in the middle and stop early. All your arguments have
> been totally pointless, because you don't seem to see that simple and
> fundamental issue. You don't change ANYTHING else. But you damn well
> not do a "memcpy", you do something that stops when it hits a NUL
> character.
>
> We call that function "strncpy()". I'd actually prefer to write it out
> by hand (because somebody could implement "strncpy()" as a
> questionable function that accesses past the NUL as long as it's
> within the 'n'), and because I think we might want to do that
> word-at-a-time version of it, but for a first approximation, just do
> that one-liner version.
>
> Don't do anything else. Don't do locking. Don't do memchr. Just make
> sure that you stop at a NUL character, and don't trust the length,
> because the length may not match the pointer. That's was always ALL
> you needed to do.
>
>                Linus
I am sorry that I misunderstand what you said. I will do what you and Al 
advise me to do.

-Longman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/1] dcache: Translating dentry into pathname without taking rename_lock
  2013-09-06  2:01         ` Waiman Long
@ 2013-09-06  4:54           ` Linus Torvalds
  0 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2013-09-06  4:54 UTC (permalink / raw)
  To: Waiman Long
  Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List,
	Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin,
	John Stoffel

On Thu, Sep 5, 2013 at 7:01 PM, Waiman Long <waiman.long@hp.com> wrote:
>
> I am sorry that I misunderstand what you said. I will do what you and Al
> advise me to do.

I'm sorry I shouted at you. I was getting a bit frustrated there..

            Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-09-06  4:54 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-05 18:55 [PATCH v2 0/1] dcache: Translating dentry into pathname without taking rename_lock Waiman Long
2013-09-05 18:55 ` [PATCH v2 1/1] " Waiman Long
2013-09-05 19:35   ` Linus Torvalds
2013-09-05 20:29     ` Waiman Long
2013-09-05 20:42       ` Linus Torvalds
2013-09-06  2:01         ` Waiman Long
2013-09-06  4:54           ` Linus Torvalds
2013-09-05 20:46       ` Al Viro
2013-09-05 21:27         ` Linus Torvalds
2013-09-05 20:04   ` Al Viro
2013-09-05 20:43     ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.