All of lore.kernel.org
 help / color / mirror / Atom feed
* x86: faster strncpy_from_user()
@ 2012-04-06 21:32 Linus Torvalds
  2012-04-06 21:49 ` Linus Torvalds
  2012-04-10 22:35 ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2012-04-06 21:32 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, linux-arch

[-- Attachment #1: Type: text/plain, Size: 4983 bytes --]

Ok, as some of you are aware, one of the things that got merged very
early in the 3.4 merge window was the "word-at-a-time" filename lookup
patches I had been working on. They only get enabled on x86, but when
they do, they do speed things up by quite a noticeable bit (mainly on
x86-64, which ends up doing things 8 bytes at a time - it's much less
noticeable on x86-32).

Now that the name lookup itself is quite lean, that ended up showing
some serious problems in selinux and the security wrappers, some of
which were largely solved a few days ago (the recent "Merge branch
'selinux'" thing got rid of the most annoying cases).

I have a few more selinux fixes to really get the worst stupidity out
of the way, but I suspect they are for next merge window.  But with
those, the path lookup is fast, and selinux isn't eating up *too* much
time either.

And the next thing that shows up is "strncpy_from_user()".

It's not all that surprising - the function really isn't particularly
smart, and using lodsb/stosb is just legacy cruft. But realistically,
while it hasn't been a piece of wonderful engineering, it also never
really showed up very high on profiles. Now with all the other
improvements I have a few loads where it really does show up pretty
well - the only real users of strncpy_from_user() are the pathname
copy code (sure, there's stuff elsewhere, but it's just not that
important).

Anyway, that function uses asms and is duplicated across x86-32 and
x86-64 for no real good reason except it was never a priority.

Here's a patch that actually removes more lines than it adds, because
it removes that duplication, and replaces the asm with some fairly
optimal C code. So we have

 6 files changed, 105 insertions(+), 145 deletions(-)

but what is perhaps more interesting (and the reason linux-arch is
cc'd) is that not only is it noticeably faster, it should also
actually get all the error cases right.

What do I mean by that?

I mean that in my tree (but not committed), I have actually removed
this nasty piece of code from fs/namei.c:

  diff --git a/fs/namei.c b/fs/namei.c
  index 0062dd17eb55..2b7d691523bb 100644
  --- a/fs/namei.c
  +++ b/fs/namei.c
  @@ -119,14 +119,7 @@
   static int do_getname(const char __user *filename, char *page)
   {
          int retval;
  -       unsigned long len = PATH_MAX;
  -
  -       if (!segment_eq(get_fs(), KERNEL_DS)) {
  -               if ((unsigned long) filename >= TASK_SIZE)
  -                       return -EFAULT;
  -               if (TASK_SIZE - (unsigned long) filename < PATH_MAX)
  -                       len = TASK_SIZE - (unsigned long) filename;
  -       }
  +       const unsigned long len = PATH_MAX;

          retval = strncpy_from_user(page, filename, len);
          if (retval > 0) {

which worked around the fact that strncpy_from_user() didn't really
get the end conditions quite right, and it really just did an
access_ok() on the first byte. The new x86 implementation should get
all of that right without any need for any ugly hacks.

Now, I haven't committed that change to do_getname(), but it *will*
get committed at some point. Probably not until the next merge window,
but I did want to let architecture maintainers know: you need to look
at your strncpy_from_userspace(), and make sure they get the "end of
address space" case right, because the pathname code won't hack around
it for you any more if you don't.

The x86-"specific" strncpy_from_user() attached in the patch is
actually fairly generic. It does do the word-at-a-time tricks, but if
you don't want to bother with those, you can just remove that loop
entirely, and it should all work in general (the x86 code has a
fall-back to the byte-at-a-time case, so if you remove the "while (max
>= sizeof(unsigned long))" loop, it should all "just work".

And the word-at-a-time tricks it does do are actually simpler than the
ones the dcache does - just a subset of them - so you may decide that
you'll do that part too, even if the dcache ones are more complicated.

(For example, if you just implement the "has_zero()" function, you can
make it do

    if (has_zero(a))
        break;

to fall back to the byte-at-a-time model - that's what I did
originally in this patch too, but quite frankly, it does mean that the
word-at-a-time logic isn't as efficient).

Anyway, this works for me on x86 (you need current tip-of-git for the
<asm/word-at-a-time.h> thing) and is quite noticeable in profiles
(well, if you have loads that do a lot of pathname lookups: I use a
microbenchmarks that does ten million lookups of a filename with
multiple components, but I also do "make -j" profiles of a fully build
tree).

And considering that the old x86 strncpy_to_user() code was so ugly,
it's pretty much a no-brainer there. But on other architectures, I'd
still like to point out the problem with the end-of-address-space,
which you may or may not actually have on other architectures.

Comments?

                         Linus

[-- Attachment #2: x86-strncpy_from_user.patch --]
[-- Type: application/octet-stream, Size: 9998 bytes --]

 arch/x86/include/asm/uaccess.h    |    2 +
 arch/x86/include/asm/uaccess_32.h |    5 --
 arch/x86/include/asm/uaccess_64.h |    4 --
 arch/x86/lib/usercopy.c           |  103 +++++++++++++++++++++++++++++++++++++
 arch/x86/lib/usercopy_32.c        |   87 -------------------------------
 arch/x86/lib/usercopy_64.c        |   49 ------------------
 6 files changed, 105 insertions(+), 145 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 8be5f54d9360..e0544597cfe7 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -557,6 +557,8 @@ struct __large_struct { unsigned long buf[100]; };
 
 extern unsigned long
 copy_from_user_nmi(void *to, const void __user *from, unsigned long n);
+extern __must_check long
+strncpy_from_user(char *dst, const char __user *src, long count);
 
 /*
  * movsl can be slow when source and dest are not both 8-byte aligned
diff --git a/arch/x86/include/asm/uaccess_32.h b/arch/x86/include/asm/uaccess_32.h
index 566e803cc602..8084bc73b18c 100644
--- a/arch/x86/include/asm/uaccess_32.h
+++ b/arch/x86/include/asm/uaccess_32.h
@@ -213,11 +213,6 @@ static inline unsigned long __must_check copy_from_user(void *to,
 	return n;
 }
 
-long __must_check strncpy_from_user(char *dst, const char __user *src,
-				    long count);
-long __must_check __strncpy_from_user(char *dst,
-				      const char __user *src, long count);
-
 /**
  * strlen_user: - Get the size of a string in user space.
  * @str: The string to measure.
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 1c66d30971ad..fcd4b6f3ef02 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -208,10 +208,6 @@ int __copy_in_user(void __user *dst, const void __user *src, unsigned size)
 	}
 }
 
-__must_check long
-strncpy_from_user(char *dst, const char __user *src, long count);
-__must_check long
-__strncpy_from_user(char *dst, const char __user *src, long count);
 __must_check long strnlen_user(const char __user *str, long n);
 __must_check long __strnlen_user(const char __user *str, long n);
 __must_check long strlen_user(const char __user *str);
diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c
index 97be9cb54483..57252c928f56 100644
--- a/arch/x86/lib/usercopy.c
+++ b/arch/x86/lib/usercopy.c
@@ -7,6 +7,8 @@
 #include <linux/highmem.h>
 #include <linux/module.h>
 
+#include <asm/word-at-a-time.h>
+
 /*
  * best effort, GUP based copy_from_user() that is NMI-safe
  */
@@ -41,3 +43,104 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
 	return len;
 }
 EXPORT_SYMBOL_GPL(copy_from_user_nmi);
+
+static inline unsigned long count_bytes(unsigned long mask)
+{
+	mask = (mask - 1) & ~mask;
+	mask >>= 7;
+	return count_masked_bytes(mask);
+}
+
+/*
+ * Do a strncpy, return length of string without final '\0'.
+ * 'count' is the user-supplied count (return 'count' if we
+ * hit it), 'max' is the address space maximum (and we return
+ * -EFAULT if we hit it).
+ */
+static inline long do_strncpy_from_user(char *dst, const char __user *src, long count, long max)
+{
+	long res = 0;
+
+	/*
+	 * Truncate 'max' to the user-specified limit, so that
+	 * we only have one limit we need to check in the loop
+	 */
+	if (max > count)
+		max = count;
+
+	while (max >= sizeof(unsigned long)) {
+		unsigned long c;
+
+		/* Fall back to byte-at-a-time if we get a page fault */
+		if (unlikely(__get_user(c,(unsigned long __user *)(src+res))))
+			break;
+		/* This can write a few bytes past the NUL character, but that's ok */
+		*(unsigned long *)(dst+res) = c;
+		c = has_zero(c);
+		if (c)
+			return res + count_bytes(c);
+		res += sizeof(unsigned long);
+		max -= sizeof(unsigned long);
+	}
+
+	while (max) {
+		char c;
+
+		if (unlikely(__get_user(c,src+res)))
+			return -EFAULT;
+		dst[res] = c;
+		if (!c)
+			return res;
+		res++;
+		max--;
+	}
+
+	/*
+	 * Uhhuh. We hit 'max'. But was that the user-specified maximum
+	 * too? If so, that's ok - we got as much as the user asked for.
+	 */
+	if (res >= count)
+		return count;
+
+	/*
+	 * Nope: we hit the address space limit, and we still had more
+	 * characters the caller would have wanted. That's an EFAULT.
+	 */
+	return -EFAULT;
+}
+
+/**
+ * strncpy_from_user: - Copy a NUL terminated string from userspace.
+ * @dst:   Destination address, in kernel space.  This buffer must be at
+ *         least @count bytes long.
+ * @src:   Source address, in user space.
+ * @count: Maximum number of bytes to copy, including the trailing NUL.
+ *
+ * Copies a NUL-terminated string from userspace to kernel space.
+ *
+ * On success, returns the length of the string (not including the trailing
+ * NUL).
+ *
+ * If access to userspace fails, returns -EFAULT (some data may have been
+ * copied).
+ *
+ * If @count is smaller than the length of the string, copies @count bytes
+ * and returns @count.
+ */
+long
+strncpy_from_user(char *dst, const char __user *src, long count)
+{
+	unsigned long max_addr, src_addr;
+
+	if (unlikely(count <= 0))
+		return 0;
+
+	max_addr = current_thread_info()->addr_limit.seg;
+	src_addr = (unsigned long)src;
+	if (likely(src_addr < max_addr)) {
+		unsigned long max = max_addr - src_addr;
+		return do_strncpy_from_user(dst, src, count, max);
+	}
+	return -EFAULT;
+}
+EXPORT_SYMBOL(strncpy_from_user);
diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index d9b094ca7aaa..ef2a6a5d78e3 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -33,93 +33,6 @@ static inline int __movsl_is_ok(unsigned long a1, unsigned long a2, unsigned lon
 	__movsl_is_ok((unsigned long)(a1), (unsigned long)(a2), (n))
 
 /*
- * Copy a null terminated string from userspace.
- */
-
-#define __do_strncpy_from_user(dst, src, count, res)			   \
-do {									   \
-	int __d0, __d1, __d2;						   \
-	might_fault();							   \
-	__asm__ __volatile__(						   \
-		"	testl %1,%1\n"					   \
-		"	jz 2f\n"					   \
-		"0:	lodsb\n"					   \
-		"	stosb\n"					   \
-		"	testb %%al,%%al\n"				   \
-		"	jz 1f\n"					   \
-		"	decl %1\n"					   \
-		"	jnz 0b\n"					   \
-		"1:	subl %1,%0\n"					   \
-		"2:\n"							   \
-		".section .fixup,\"ax\"\n"				   \
-		"3:	movl %5,%0\n"					   \
-		"	jmp 2b\n"					   \
-		".previous\n"						   \
-		_ASM_EXTABLE(0b,3b)					   \
-		: "=&d"(res), "=&c"(count), "=&a" (__d0), "=&S" (__d1),	   \
-		  "=&D" (__d2)						   \
-		: "i"(-EFAULT), "0"(count), "1"(count), "3"(src), "4"(dst) \
-		: "memory");						   \
-} while (0)
-
-/**
- * __strncpy_from_user: - Copy a NUL terminated string from userspace, with less checking.
- * @dst:   Destination address, in kernel space.  This buffer must be at
- *         least @count bytes long.
- * @src:   Source address, in user space.
- * @count: Maximum number of bytes to copy, including the trailing NUL.
- *
- * Copies a NUL-terminated string from userspace to kernel space.
- * Caller must check the specified block with access_ok() before calling
- * this function.
- *
- * On success, returns the length of the string (not including the trailing
- * NUL).
- *
- * If access to userspace fails, returns -EFAULT (some data may have been
- * copied).
- *
- * If @count is smaller than the length of the string, copies @count bytes
- * and returns @count.
- */
-long
-__strncpy_from_user(char *dst, const char __user *src, long count)
-{
-	long res;
-	__do_strncpy_from_user(dst, src, count, res);
-	return res;
-}
-EXPORT_SYMBOL(__strncpy_from_user);
-
-/**
- * strncpy_from_user: - Copy a NUL terminated string from userspace.
- * @dst:   Destination address, in kernel space.  This buffer must be at
- *         least @count bytes long.
- * @src:   Source address, in user space.
- * @count: Maximum number of bytes to copy, including the trailing NUL.
- *
- * Copies a NUL-terminated string from userspace to kernel space.
- *
- * On success, returns the length of the string (not including the trailing
- * NUL).
- *
- * If access to userspace fails, returns -EFAULT (some data may have been
- * copied).
- *
- * If @count is smaller than the length of the string, copies @count bytes
- * and returns @count.
- */
-long
-strncpy_from_user(char *dst, const char __user *src, long count)
-{
-	long res = -EFAULT;
-	if (access_ok(VERIFY_READ, src, 1))
-		__do_strncpy_from_user(dst, src, count, res);
-	return res;
-}
-EXPORT_SYMBOL(strncpy_from_user);
-
-/*
  * Zero Userspace
  */
 
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index b7c2849ffb66..0d0326f388c0 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -9,55 +9,6 @@
 #include <asm/uaccess.h>
 
 /*
- * Copy a null terminated string from userspace.
- */
-
-#define __do_strncpy_from_user(dst,src,count,res)			   \
-do {									   \
-	long __d0, __d1, __d2;						   \
-	might_fault();							   \
-	__asm__ __volatile__(						   \
-		"	testq %1,%1\n"					   \
-		"	jz 2f\n"					   \
-		"0:	lodsb\n"					   \
-		"	stosb\n"					   \
-		"	testb %%al,%%al\n"				   \
-		"	jz 1f\n"					   \
-		"	decq %1\n"					   \
-		"	jnz 0b\n"					   \
-		"1:	subq %1,%0\n"					   \
-		"2:\n"							   \
-		".section .fixup,\"ax\"\n"				   \
-		"3:	movq %5,%0\n"					   \
-		"	jmp 2b\n"					   \
-		".previous\n"						   \
-		_ASM_EXTABLE(0b,3b)					   \
-		: "=&r"(res), "=&c"(count), "=&a" (__d0), "=&S" (__d1),	   \
-		  "=&D" (__d2)						   \
-		: "i"(-EFAULT), "0"(count), "1"(count), "3"(src), "4"(dst) \
-		: "memory");						   \
-} while (0)
-
-long
-__strncpy_from_user(char *dst, const char __user *src, long count)
-{
-	long res;
-	__do_strncpy_from_user(dst, src, count, res);
-	return res;
-}
-EXPORT_SYMBOL(__strncpy_from_user);
-
-long
-strncpy_from_user(char *dst, const char __user *src, long count)
-{
-	long res = -EFAULT;
-	if (access_ok(VERIFY_READ, src, 1))
-		return __strncpy_from_user(dst, src, count);
-	return res;
-}
-EXPORT_SYMBOL(strncpy_from_user);
-
-/*
  * Zero Userspace
  */
 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-06 21:32 x86: faster strncpy_from_user() Linus Torvalds
@ 2012-04-06 21:49 ` Linus Torvalds
  2012-04-10 22:35 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2012-04-06 21:49 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, linux-arch

On Fri, Apr 6, 2012 at 2:32 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, this works for me on x86 (you need current tip-of-git for the
> <asm/word-at-a-time.h> thing) and is quite noticeable in profiles
> (well, if you have loads that do a lot of pathname lookups: I use a
> microbenchmarks that does ten million lookups of a filename with
> multiple components, but I also do "make -j" profiles of a fully build
> tree).

Some actual numbers: my microbenchmark used to take just over 9s
(best-of-six:9.053s) back before the dcache lookup optimizations

With some of the __d_lookup_rcu cleanups, and doing dentry_cmp() a
word at a time, that best-of-six dropped to 8.504s, and then
eventually with the full word-wide hashing it dropped to 8.013s. Some
more optimization got it consistently into the mid-sevens, which is
roughly where 3.4-rc1 was.

Today, my "best-of-six" numbers are 6.301s for that same binary, on
the same machine. Now, the numbers aren't hugely stable, and I didn't
do the same pre-population of the dcache that I did with the earlier
numbers, so it's not entirely comparable. But it's basically
approaching almost 30% better than the original code.

Admittedly just on a silly micro-benchmark, but still: all that
benchmark does is just "stat()" the same path over and over - which is
not a totally unrealistic thing to do. The "same path" means that it
doesn't show the cache effects as much (and the profiles are much
"sharper"). And while the "make -j" thing clearly shows different
profiles, the top five kernel functions have actually been pretty
consistent between the two cases, even if the details change.

Note: extra good numbers are with all my (uncommitted) selinux fixes
too, and there is a fair amounf of fluctuation in the numbers (lots
and lots of cache effects), but the strncpy change did make a
noticeable difference. Earlier today, I had a hard time getting the
number under 7s. Getting to 6.3 was partly strncpy, but about half of
it was my very experimental selinux stuff. Some of which might never
end up being committed, who knows..

                     Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-06 21:32 x86: faster strncpy_from_user() Linus Torvalds
  2012-04-06 21:49 ` Linus Torvalds
@ 2012-04-10 22:35 ` Benjamin Herrenschmidt
  2012-04-10 22:50   ` Linus Torvalds
  2012-04-10 23:25   ` David Miller
  1 sibling, 2 replies; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2012-04-10 22:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, linux-arch

On Fri, 2012-04-06 at 14:32 -0700, Linus Torvalds wrote:
> Ok, as some of you are aware, one of the things that got merged very
> early in the 3.4 merge window was the "word-at-a-time" filename lookup
> patches I had been working on. They only get enabled on x86, but when
> they do, they do speed things up by quite a noticeable bit (mainly on
> x86-64, which ends up doing things 8 bytes at a time - it's much less
> noticeable on x86-32).

Talking of which ... I haven't had much time to look but any reason that
wouldn't work on BE platforms as well when they have a fast
byteswap-load ? Now powerpc sadly only have up to 32-bit byteswap loads
so doing 64-bit requires a bit of shifting around but the result might
still be faster than loading individual bytes especially since we do
have a bunch of registers to spare....

Something lines of

-		a = *(unsigned long *)name;
+		a = le64_to_cpup((__le64 *)name);

(etc...)

Maybe ?

I might have a chance to actually test later today (chasing some
regressions goes first)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-10 22:35 ` Benjamin Herrenschmidt
@ 2012-04-10 22:50   ` Linus Torvalds
  2012-04-10 23:29     ` David Miller
  2012-04-10 23:25   ` David Miller
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2012-04-10 22:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Ingo Molnar, H. Peter Anvin, the arch/x86 maintainers, linux-arch

On Tue, Apr 10, 2012 at 3:35 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> Talking of which ... I haven't had much time to look but any reason that
> wouldn't work on BE platforms as well when they have a fast
> byteswap-load

No reason. Talk to Davem - I know he was looking at doing the whole
dcache-by-word thing on sparc.

Sparc has the added complication of doing slow unaligneds, though - I
think you might be in a better situation than that on ppc (at least
some of them).

> Now powerpc sadly only have up to 32-bit byteswap loads
> so doing 64-bit requires a bit of shifting around but the result might
> still be faster than loading individual bytes especially since we do
> have a bunch of registers to spare....

So one thing you might want to look into is to only do the byte-swap
*outside* the loop. You can do the "does it have zero or slash bytes"
inside the loop with the big-endian values, and then you can
re-compute it at the end with the byte-swapped one.  It's a few extra
ops, but it shouldn't be too bad.

Of course, for the actual dcache lookup, the loop count really does
tend to be just one or two, because you work one component at a time.
So you might  just want to do the byte swapping inside the loop, in
order to not have to re-do- the zero/slash detection afterwards.

For the "strncpy_from_user()", you only have the 'detect zeroes', and
the loop count is often noticeable (whole pathname), so it might make
sense to do that outside the loop.

> Maybe ?
>
> I might have a chance to actually test later today (chasing some
> regressions goes first)

Try it out. I used three different "benchmarks" for profiling:
 - the "stat() same file 10 million times" (to avoid cache misses)
 - the "make -j" on a fully build kernel (to see a "real load")
 - a "git diff" with "core.preloadindex=true" on a git repository that
is just a collection of 16 kernel trees side-by-side (it just does a
lot of 'lstat()' calls in parallell threads, and shows cache misses
but unlike "make" has almost zero actual user space costs)

The "stat ten million times" is the one that is worth most to test the
word-at-a-time things, because the "lots of files" cases really do
tend to do a lot of D$ misses, both on the dentry hash chains, the
inode accesses, and the security layer adds its own horribly
inode->i_security dereference.

                   Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-10 22:35 ` Benjamin Herrenschmidt
  2012-04-10 22:50   ` Linus Torvalds
@ 2012-04-10 23:25   ` David Miller
  2012-04-11  0:34     ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: David Miller @ 2012-04-10 23:25 UTC (permalink / raw)
  To: benh; +Cc: torvalds, mingo, hpa, x86, linux-arch

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Wed, 11 Apr 2012 08:35:21 +1000

> On Fri, 2012-04-06 at 14:32 -0700, Linus Torvalds wrote:
>> Ok, as some of you are aware, one of the things that got merged very
>> early in the 3.4 merge window was the "word-at-a-time" filename lookup
>> patches I had been working on. They only get enabled on x86, but when
>> they do, they do speed things up by quite a noticeable bit (mainly on
>> x86-64, which ends up doing things 8 bytes at a time - it's much less
>> noticeable on x86-32).
> 
> Talking of which ... I haven't had much time to look but any reason that
> wouldn't work on BE platforms as well when they have a fast
> byteswap-load ? Now powerpc sadly only have up to 32-bit byteswap loads
> so doing 64-bit requires a bit of shifting around but the result might
> still be faster than loading individual bytes especially since we do
> have a bunch of registers to spare....
> 
> Something lines of
> 
> -		a = *(unsigned long *)name;
> +		a = le64_to_cpup((__le64 *)name);

Yes, you're right:

https://plus.google.com/101384639386588513837/posts/737GNZPe2q6

and we can use the trick that strlen uses in glibc on powerpc
and sparc to avoid the alignment issues in strncpy_from_user().

For the DCACHE hashing bits though, we'll need to use a barrel shifted
loop on platforms that can't handle unaligned loads well.  Since the
way the hashing computation works, we can't use the same alignment
trick we use for things like strlen and strncpy

Although that would be a neat trick, making the hashing function work
in a way such that if we simply rewound the initial pointer to be
aligned, put 0xff into the alignment bytes, and the hash would still
compute properly.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-10 22:50   ` Linus Torvalds
@ 2012-04-10 23:29     ` David Miller
  2012-04-10 23:33       ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2012-04-10 23:29 UTC (permalink / raw)
  To: torvalds; +Cc: benh, mingo, hpa, x86, linux-arch

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 10 Apr 2012 15:50:49 -0700

> For the "strncpy_from_user()", you only have the 'detect zeroes', and
> the loop count is often noticeable (whole pathname), so it might make
> sense to do that outside the loop.

Just wanted to mention that handling the detect zeroes operations on
cpus that require alignment is easy, just rewind the pointer at the
beginning to be aligned and "or" in a mask of 0xff for each alignment
pad byte into the initially loaded word.

That's how we do it in glibc on sparc and powerpc already.

The only reason to go to little-endian is to 1) potentially save a
small number of cycles on the loop exit and frankly I think if you do
another load to accomplish that you'll anihilate up all the gains
and 2) make the "strlen + hash function computation" case eaiser.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-10 23:29     ` David Miller
@ 2012-04-10 23:33       ` H. Peter Anvin
  2012-04-10 23:56         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2012-04-10 23:33 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, benh, mingo, x86, linux-arch

On 04/10/2012 04:29 PM, David Miller wrote:
> 
> Just wanted to mention that handling the detect zeroes operations on
> cpus that require alignment is easy, just rewind the pointer at the
> beginning to be aligned and "or" in a mask of 0xff for each alignment
> pad byte into the initially loaded word.
> 

Even on machines which don't require alignment it will still be faster
to do aligned memory references only, not counting the startup cost
(which is substantial in this case, of course, since the average length
is so short.)  However, it also neatly avoids the page overrun problem.

	-hpa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-10 23:33       ` H. Peter Anvin
@ 2012-04-10 23:56         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2012-04-10 23:56 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: David Miller, torvalds, mingo, x86, linux-arch

On Tue, 2012-04-10 at 16:33 -0700, H. Peter Anvin wrote:
> > Just wanted to mention that handling the detect zeroes operations on
> > cpus that require alignment is easy, just rewind the pointer at the
> > beginning to be aligned and "or" in a mask of 0xff for each alignment
> > pad byte into the initially loaded word.
> > 
> 
> Even on machines which don't require alignment it will still be faster
> to do aligned memory references only, not counting the startup cost
> (which is substantial in this case, of course, since the average length
> is so short.)  However, it also neatly avoids the page overrun problem.

I'm leaning toward that too, but I want to do some benches. The main
issues for me are:

  - I have to deal with a reasonably wide range of different cores which
will handle unaligned accesses very differently. Almost all will do it
in HW but with very varying degree of performances and some will
occasionally trap (SW emulation kicks in but that's extremely slow). The
trapping case is generally rare though, depending on the core it will
happen on things like page boundaries or segment boundaries. I also
suspect that the byte-reverse load/store instructions will suck more at
unaligned.

 - The page overrun is an issue. On 64-bit we don't have anything mapped
past the end of the linear mapping and on 32-bit we fall into ioremap
space. That's fixable with a quick hack to add one more page to the
linear mapping, creating a double mapping of either page 0 or any random
page of memory, I don't have cache aliases or anything like that to
worry about but it's gross.

Anyways, I'll try to play around if I get time, might have to wait for
next week tho, I have some more urgent stuff to sort out and I'm off
friday to tuesday.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-10 23:25   ` David Miller
@ 2012-04-11  0:34     ` Linus Torvalds
  2012-04-11  0:43       ` David Miller
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2012-04-11  0:34 UTC (permalink / raw)
  To: David Miller; +Cc: benh, mingo, hpa, x86, linux-arch

On Tue, Apr 10, 2012 at 4:25 PM, David Miller <davem@davemloft.net> wrote:
>
> Although that would be a neat trick, making the hashing function work
> in a way such that if we simply rewound the initial pointer to be
> aligned, put 0xff into the alignment bytes, and the hash would still
> compute properly.

I cannot imagine any worthwhile hash that works like that.

Any hash that magically works with aligned data implies basically
throwing away all the actual real bits of information in a file name,
notably all the "where are the letters" information.

So the hashing really *does* need unaligned loads (or the
shift-things-by-hand to emulate it).

Sparc has "faligndata", no? Power generally does unaligned loads
"reasonably ok", although I seem to recall some horrible 7-cycle
micro-trap on at least some micro-architecture (and a page-crossing
fault fixup in software, but that's not going to happen in practice:
the pathname is allocated from a page in getname())

                      Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-11  0:34     ` Linus Torvalds
@ 2012-04-11  0:43       ` David Miller
  2012-04-11  0:50         ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2012-04-11  0:43 UTC (permalink / raw)
  To: torvalds; +Cc: benh, mingo, hpa, x86, linux-arch

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 10 Apr 2012 17:34:14 -0700

> Sparc has "faligndata", no?

It's an FPU instruction :-)

The barrel shift would be the fastest on sparc.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-11  0:43       ` David Miller
@ 2012-04-11  0:50         ` Linus Torvalds
  2012-04-11  0:57           ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2012-04-11  0:50 UTC (permalink / raw)
  To: David Miller; +Cc: benh, mingo, hpa, x86, linux-arch

On Tue, Apr 10, 2012 at 5:43 PM, David Miller <davem@davemloft.net> wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Tue, 10 Apr 2012 17:34:14 -0700
>
>> Sparc has "faligndata", no?
>
> It's an FPU instruction :-)

Ok.

> The barrel shift would be the fastest on sparc.

.. and you don't have a double shift, right? So you'd need to do two
shifts and the or for each word.

At which point it is probably pointless to even bother on sparc32,
since you only get four bytes per loop. But on 64-bit it presumably
ends up being at least a slight win.

                        Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-11  0:50         ` Linus Torvalds
@ 2012-04-11  0:57           ` Linus Torvalds
  2012-04-11  1:09             ` David Miller
  2012-04-11  1:25             ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2012-04-11  0:57 UTC (permalink / raw)
  To: David Miller; +Cc: benh, mingo, hpa, x86, linux-arch

On Tue, Apr 10, 2012 at 5:50 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> .. and you don't have a double shift, right? So you'd need to do two
> shifts and the or for each word.

Actually, if sparc has a "rotate" instruction, you can do with a
single shift (rotate) per word loop.

You need to set up a mask register based on the alignment, and
pre-load the first word, but if you do have a rotate you can rotate
and then use "and mask" first to generate the "high bits" of the
current word, and then use the "andn mask" to generate the low bits of
the next word. So then you just need a single rotate per loop, and
some (very minor) loop prep.

Of course, RISC people tended to throw out rotate too, so maybe you
don't have even that.

                 Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-11  0:57           ` Linus Torvalds
@ 2012-04-11  1:09             ` David Miller
  2012-04-11  1:18               ` Linus Torvalds
  2012-04-11  1:25             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 16+ messages in thread
From: David Miller @ 2012-04-11  1:09 UTC (permalink / raw)
  To: torvalds; +Cc: benh, mingo, hpa, x86, linux-arch

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 10 Apr 2012 17:57:07 -0700

> On Tue, Apr 10, 2012 at 5:50 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> .. and you don't have a double shift, right? So you'd need to do two
>> shifts and the or for each word.
> 
> Actually, if sparc has a "rotate" instruction, you can do with a
> single shift (rotate) per word loop.

Sorry, no rotate :-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-11  1:09             ` David Miller
@ 2012-04-11  1:18               ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2012-04-11  1:18 UTC (permalink / raw)
  To: David Miller; +Cc: benh, mingo, hpa, x86, linux-arch

On Tue, Apr 10, 2012 at 6:09 PM, David Miller <davem@davemloft.net> wrote:
>
> Sorry, no rotate :-)

Remind me why you work with that awful architecture again?

Your girlfriend doesn't like to do the whole whips and chains thing?
So you need to spice up your life other ways?

                  Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-11  0:57           ` Linus Torvalds
  2012-04-11  1:09             ` David Miller
@ 2012-04-11  1:25             ` Benjamin Herrenschmidt
  2012-04-11  8:22               ` Geert Uytterhoeven
  1 sibling, 1 reply; 16+ messages in thread
From: Benjamin Herrenschmidt @ 2012-04-11  1:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Miller, mingo, hpa, x86, linux-arch

On Tue, 2012-04-10 at 17:57 -0700, Linus Torvalds wrote:
> On Tue, Apr 10, 2012 at 5:50 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > .. and you don't have a double shift, right? So you'd need to do two
> > shifts and the or for each word.
> 
> Actually, if sparc has a "rotate" instruction, you can do with a
> single shift (rotate) per word loop.
> 
> You need to set up a mask register based on the alignment, and
> pre-load the first word, but if you do have a rotate you can rotate
> and then use "and mask" first to generate the "high bits" of the
> current word, and then use the "andn mask" to generate the low bits of
> the next word. So then you just need a single rotate per loop, and
> some (very minor) loop prep.
> 
> Of course, RISC people tended to throw out rotate too, so maybe you
> don't have even that.

Well, we do have a very nice & flexible rotate & mask on ppc at
least :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: x86: faster strncpy_from_user()
  2012-04-11  1:25             ` Benjamin Herrenschmidt
@ 2012-04-11  8:22               ` Geert Uytterhoeven
  0 siblings, 0 replies; 16+ messages in thread
From: Geert Uytterhoeven @ 2012-04-11  8:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, David Miller, mingo, hpa, x86, linux-arch

On Wed, Apr 11, 2012 at 03:25, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>> Of course, RISC people tended to throw out rotate too, so maybe you
>> don't have even that.
>
> Well, we do have a very nice & flexible rotate & mask on ppc at
> least :-)

PPC is not RISC. It's LSWNOWCITCSCIOC.

(let-see-which-new-operations-we-can-invent-that-can-still-complete-in-one-cycle
;-)

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-04-11  8:22 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-06 21:32 x86: faster strncpy_from_user() Linus Torvalds
2012-04-06 21:49 ` Linus Torvalds
2012-04-10 22:35 ` Benjamin Herrenschmidt
2012-04-10 22:50   ` Linus Torvalds
2012-04-10 23:29     ` David Miller
2012-04-10 23:33       ` H. Peter Anvin
2012-04-10 23:56         ` Benjamin Herrenschmidt
2012-04-10 23:25   ` David Miller
2012-04-11  0:34     ` Linus Torvalds
2012-04-11  0:43       ` David Miller
2012-04-11  0:50         ` Linus Torvalds
2012-04-11  0:57           ` Linus Torvalds
2012-04-11  1:09             ` David Miller
2012-04-11  1:18               ` Linus Torvalds
2012-04-11  1:25             ` Benjamin Herrenschmidt
2012-04-11  8:22               ` Geert Uytterhoeven

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.