All of lore.kernel.org
 help / color / mirror / Atom feed
* Arch maintainers Ahoy! (was Re: x86: faster strncpy_from_user())
@ 2012-05-21 16:50 Linus Torvalds
  2012-05-23  5:46 ` Arch maintainers Ahoy! David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2012-05-21 16:50 UTC (permalink / raw)
  To: linux-arch

Ok, several weeks ago I sent out an email about the x86
strncpy_from_user() patch I did, and cc'd linux-arch because the
second stage might well affect other architectures.

Nobody reacted to it, and today that second stage was merged into
mainline, so I thought I'd re-do the notice. The relevant part of the
original explanation is appended, but I'll explain once more..

On x86, we *used* to be very lazy about strncpy_from_user(), and not
actually do the full proper error handling: the code was (for
historical reasons) using inline asms where it was very inconvenient
to actually check the proper end of the address space. And nobody had
really ever bothered to fix it up. As a result, the main user
(getname()) had its own special magic workarounds for this.

Until 3.4. We now do strncpy_from_user() properly, with all the
appropriate error cases of hitting the end of the user address space
etc handled.

And now that the 3.5 merge window is open, some of the first merging I
did was my own branches that had been pending. Including the one that
removes all the hackery from getname(), and expects
strncpy_from_user() to just work right.

If any architecture copied the lazy x86 strncpy_from_user(), they
really do need to fix themselves now. You can actually take the new
all-C x86 implementation, and just strip away the word-at-a-time
optimizations, and it should be pretty generic (if perhaps not
optimal).

The thing to really look out for is when the top of the user address
space abuts the kernel address space. If you *only* rely on page
faults, the user can read kernel addresses by mapping the very last
page and then putting a non-terminated string at the end of that page,
and trying to open a file using that name. So you need to either have
a guard page (guaranteeing a fault), or you need to take TASK_SIZE or
similar into account. Because getname() no longer does the TASK_SIZE
games for you.

                       Linus

On Fri, Apr 6, 2012 at 2:32 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> but what is perhaps more interesting (and the reason linux-arch is
> cc'd) is that not only is it noticeably faster, it should also
> actually get all the error cases right.
>
> What do I mean by that?
>
> I mean that in my tree (but not committed), I have actually removed
> this nasty piece of code from fs/namei.c:
>
>  diff --git a/fs/namei.c b/fs/namei.c
>  index 0062dd17eb55..2b7d691523bb 100644
>  --- a/fs/namei.c
>  +++ b/fs/namei.c
>  @@ -119,14 +119,7 @@
>   static int do_getname(const char __user *filename, char *page)
>   {
>          int retval;
>  -       unsigned long len = PATH_MAX;
>  -
>  -       if (!segment_eq(get_fs(), KERNEL_DS)) {
>  -               if ((unsigned long) filename >= TASK_SIZE)
>  -                       return -EFAULT;
>  -               if (TASK_SIZE - (unsigned long) filename < PATH_MAX)
>  -                       len = TASK_SIZE - (unsigned long) filename;
>  -       }
>  +       const unsigned long len = PATH_MAX;
>
>          retval = strncpy_from_user(page, filename, len);
>          if (retval > 0) {
>
> which worked around the fact that strncpy_from_user() didn't really
> get the end conditions quite right, and it really just did an
> access_ok() on the first byte. The new x86 implementation should get
> all of that right without any need for any ugly hacks.
>
> Now, I haven't committed that change to do_getname(), but it *will*
> get committed at some point.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-21 16:50 Arch maintainers Ahoy! (was Re: x86: faster strncpy_from_user()) Linus Torvalds
@ 2012-05-23  5:46 ` David Miller
  2012-05-23  8:02   ` Geert Uytterhoeven
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2012-05-23  5:46 UTC (permalink / raw)
  To: torvalds; +Cc: linux-arch


Thanks for the reminder although I did still have this on my TODO
list somewhere :-)

I'll do something like the following (untested) for sparc.

--------------------
[PATCH] sparc: Add full proper error handling to strncpy_from_user().

Linus removed the end-of-address-space hackery from
fs/namei.c:do_getname() so we really have to validate these edge
conditions and cannot cheat any more (as x86 used to as well).

Move to a common C implementation like x86 did.  And if both
src and dst are sufficiently aligned we'll do word at a time
copies and checks as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/include/asm/uaccess.h      |    3 +
 arch/sparc/include/asm/uaccess_32.h   |   10 ---
 arch/sparc/include/asm/uaccess_64.h   |    4 -
 arch/sparc/lib/Makefile               |    2 +-
 arch/sparc/lib/ksyms.c                |    3 -
 arch/sparc/lib/strncpy_from_user_32.S |   47 ------------
 arch/sparc/lib/strncpy_from_user_64.S |  133 ---------------------------------
 arch/sparc/lib/usercopy.c             |  132 ++++++++++++++++++++++++++++++++
 8 files changed, 136 insertions(+), 198 deletions(-)
 delete mode 100644 arch/sparc/lib/strncpy_from_user_32.S
 delete mode 100644 arch/sparc/lib/strncpy_from_user_64.S

diff --git a/arch/sparc/include/asm/uaccess.h b/arch/sparc/include/asm/uaccess.h
index e88fbe5..42a28cf 100644
--- a/arch/sparc/include/asm/uaccess.h
+++ b/arch/sparc/include/asm/uaccess.h
@@ -5,4 +5,7 @@
 #else
 #include <asm/uaccess_32.h>
 #endif
+
+extern long strncpy_from_user(char *dest, const char __user *src, long count);
+
 #endif
diff --git a/arch/sparc/include/asm/uaccess_32.h b/arch/sparc/include/asm/uaccess_32.h
index d50c310..59586b5 100644
--- a/arch/sparc/include/asm/uaccess_32.h
+++ b/arch/sparc/include/asm/uaccess_32.h
@@ -304,16 +304,6 @@ static inline unsigned long clear_user(void __user *addr, unsigned long n)
 		return n;
 }
 
-extern long __strncpy_from_user(char *dest, const char __user *src, long count);
-
-static inline long strncpy_from_user(char *dest, const char __user *src, long count)
-{
-	if (__access_ok((unsigned long) src, count))
-		return __strncpy_from_user(dest, src, count);
-	else
-		return -EFAULT;
-}
-
 extern long __strlen_user(const char __user *);
 extern long __strnlen_user(const char __user *, long len);
 
diff --git a/arch/sparc/include/asm/uaccess_64.h b/arch/sparc/include/asm/uaccess_64.h
index a1091afb..dcdfb89 100644
--- a/arch/sparc/include/asm/uaccess_64.h
+++ b/arch/sparc/include/asm/uaccess_64.h
@@ -257,10 +257,6 @@ extern unsigned long __must_check __clear_user(void __user *, unsigned long);
 
 #define clear_user __clear_user
 
-extern long __must_check __strncpy_from_user(char *dest, const char __user *src, long count);
-
-#define strncpy_from_user __strncpy_from_user
-
 extern long __strlen_user(const char __user *);
 extern long __strnlen_user(const char __user *, long len);
 
diff --git a/arch/sparc/lib/Makefile b/arch/sparc/lib/Makefile
index 389628f..943d98d 100644
--- a/arch/sparc/lib/Makefile
+++ b/arch/sparc/lib/Makefile
@@ -10,7 +10,7 @@ lib-y                 += strlen.o
 lib-y                 += checksum_$(BITS).o
 lib-$(CONFIG_SPARC32) += blockops.o
 lib-y                 += memscan_$(BITS).o memcmp.o strncmp_$(BITS).o
-lib-y                 += strncpy_from_user_$(BITS).o strlen_user_$(BITS).o
+lib-y                 += strlen_user_$(BITS).o
 lib-$(CONFIG_SPARC32) += divdi3.o udivdi3.o
 lib-$(CONFIG_SPARC32) += copy_user.o locks.o
 lib-$(CONFIG_SPARC64) += atomic_64.o
diff --git a/arch/sparc/lib/ksyms.c b/arch/sparc/lib/ksyms.c
index 2dc3087..6b278ab 100644
--- a/arch/sparc/lib/ksyms.c
+++ b/arch/sparc/lib/ksyms.c
@@ -33,9 +33,6 @@ EXPORT_SYMBOL(memset);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__bzero);
 
-/* Moving data to/from/in userspace. */
-EXPORT_SYMBOL(__strncpy_from_user);
-
 /* Networking helper routines. */
 EXPORT_SYMBOL(csum_partial);
 
diff --git a/arch/sparc/lib/strncpy_from_user_32.S b/arch/sparc/lib/strncpy_from_user_32.S
deleted file mode 100644
index db0ed29..0000000
--- a/arch/sparc/lib/strncpy_from_user_32.S
+++ /dev/null
@@ -1,47 +0,0 @@
-/* strncpy_from_user.S: Sparc strncpy from userspace.
- *
- *  Copyright(C) 1996 David S. Miller
- */
-
-#include <linux/linkage.h>
-#include <asm/ptrace.h>
-#include <asm/errno.h>
-
-	.text
-
-	/* Must return:
-	 *
-	 * -EFAULT		for an exception
-	 * count		if we hit the buffer limit
-	 * bytes copied		if we hit a null byte
-	 */
-
-ENTRY(__strncpy_from_user)
-	/* %o0=dest, %o1=src, %o2=count */
-	mov	%o2, %o3
-1:
-	subcc	%o2, 1, %o2
-	bneg	2f
-	 nop
-10:
-	ldub	[%o1], %o4
-	add	%o0, 1, %o0
-	cmp	%o4, 0
-	add	%o1, 1, %o1
-	bne	1b
-	 stb	%o4, [%o0 - 1]
-2:
-	add	%o2, 1, %o0
-	retl
-	 sub	%o3, %o0, %o0
-ENDPROC(__strncpy_from_user)
-
-	.section .fixup,#alloc,#execinstr
-	.align	4
-4:
-	retl
-	 mov	-EFAULT, %o0
-
-	.section __ex_table,#alloc
-	.align	4
-	.word	10b, 4b
diff --git a/arch/sparc/lib/strncpy_from_user_64.S b/arch/sparc/lib/strncpy_from_user_64.S
deleted file mode 100644
index d1246b7..0000000
--- a/arch/sparc/lib/strncpy_from_user_64.S
+++ /dev/null
@@ -1,133 +0,0 @@
-/*
- * strncpy_from_user.S: Sparc64 strncpy from userspace.
- *
- *  Copyright (C) 1997, 1999 Jakub Jelinek (jj@ultra.linux.cz)
- */
-
-#include <linux/linkage.h>
-#include <asm/asi.h>
-#include <asm/errno.h>
-
-	.data
-	.align	8
-0:	.xword	0x0101010101010101
-
-	.text
-
-	/* Must return:
-	 *
-	 * -EFAULT		for an exception
-	 * count		if we hit the buffer limit
-	 * bytes copied		if we hit a null byte
-	 * (without the null byte)
-	 *
-	 * This implementation assumes:
-	 * %o1 is 8 aligned => !(%o2 & 7)
-	 * %o0 is 8 aligned (if not, it will be slooooow, but will work)
-	 *
-	 * This is optimized for the common case:
-	 * in my stats, 90% of src are 8 aligned (even on sparc32)
-	 * and average length is 18 or so.
-	 */
-
-ENTRY(__strncpy_from_user)
-	/* %o0=dest, %o1=src, %o2=count */
-	andcc	%o1, 7, %g0		! IEU1	Group
-	bne,pn	%icc, 30f		! CTI
-	 add	%o0, %o2, %g3		! IEU0
-60:	ldxa	[%o1] %asi, %g1		! Load	Group
-	brlez,pn %o2, 10f		! CTI
-	 mov	%o0, %o3		! IEU0
-50:	sethi	%hi(0b), %o4		! IEU0	Group
-	ldx	[%o4 + %lo(0b)], %o4	! Load
-	sllx	%o4, 7, %o5		! IEU1	Group
-1:	sub	%g1, %o4, %g2		! IEU0	Group
-	stx	%g1, [%o0]		! Store
-	add	%o0, 8, %o0		! IEU1
-	andcc	%g2, %o5, %g0		! IEU1	Group
-	bne,pn	%xcc, 5f		! CTI
-	 add	%o1, 8, %o1		! IEU0
-	cmp	%o0, %g3		! IEU1	Group
-	bl,a,pt %xcc, 1b		! CTI
-61:	 ldxa	[%o1] %asi, %g1		! Load
-10:	retl				! CTI	Group
-	 mov	%o2, %o0		! IEU0
-5:	srlx	%g2, 32, %g7		! IEU0	Group
-	sethi	%hi(0xff00), %o4	! IEU1
-	andcc	%g7, %o5, %g0		! IEU1	Group
-	be,pn	%icc, 2f		! CTI
-	 or	%o4, %lo(0xff00), %o4	! IEU0
-	srlx	%g1, 48, %g7		! IEU0	Group
-	andcc	%g7, %o4, %g0		! IEU1	Group
-	be,pn	%icc, 50f		! CTI
-	 andcc	%g7, 0xff, %g0		! IEU1	Group
-	be,pn	%icc, 51f		! CTI
-	 srlx	%g1, 32, %g7		! IEU0
-	andcc	%g7, %o4, %g0		! IEU1	Group
-	be,pn	%icc, 52f		! CTI
-	 andcc	%g7, 0xff, %g0		! IEU1	Group
-	be,pn	%icc, 53f		! CTI
-2:	 andcc	%g2, %o5, %g0		! IEU1	Group
-	be,pn	%icc, 2f		! CTI
-	 srl	%g1, 16, %g7		! IEU0
-	andcc	%g7, %o4, %g0		! IEU1	Group
-	be,pn	%icc, 54f		! CTI
-	 andcc	%g7, 0xff, %g0		! IEU1	Group
-	be,pn	%icc, 55f		! CTI
-	 andcc	%g1, %o4, %g0		! IEU1	Group
-	be,pn	%icc, 56f		! CTI
-	 andcc	%g1, 0xff, %g0		! IEU1	Group
-	be,a,pn	%icc, 57f		! CTI
-	 sub	%o0, %o3, %o0		! IEU0
-2:	cmp	%o0, %g3		! IEU1	Group
-	bl,a,pt	%xcc, 50b		! CTI
-62:	 ldxa	[%o1] %asi, %g1		! Load
-	retl				! CTI	Group
-	 mov	%o2, %o0		! IEU0
-50:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 8, %o0
-51:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 7, %o0
-52:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 6, %o0
-53:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 5, %o0
-54:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 4, %o0
-55:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 3, %o0
-56:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 2, %o0
-57:	retl
-	 sub	%o0, 1, %o0
-30:	brlez,pn %o2, 3f
-	 sub	%g0, %o2, %o3
-	add	%o0, %o2, %o0
-63:	lduba	[%o1] %asi, %o4
-1:	add	%o1, 1, %o1
-	brz,pn	%o4, 2f
-	 stb	%o4, [%o0 + %o3]
-	addcc	%o3, 1, %o3
-	bne,pt	%xcc, 1b
-64:	 lduba	[%o1] %asi, %o4
-3:	retl
-	 mov	%o2, %o0
-2:	retl
-	 add	%o2, %o3, %o0
-ENDPROC(__strncpy_from_user)
-
-	.section __ex_table,"a"
-	.align	4
-	.word	60b, __retl_efault
-	.word	61b, __retl_efault
-	.word	62b, __retl_efault
-	.word	63b, __retl_efault
-	.word	64b, __retl_efault
-	.previous
diff --git a/arch/sparc/lib/usercopy.c b/arch/sparc/lib/usercopy.c
index 14b363f..4621dba 100644
--- a/arch/sparc/lib/usercopy.c
+++ b/arch/sparc/lib/usercopy.c
@@ -1,4 +1,6 @@
 #include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/errno.h>
 #include <linux/bug.h>
 
 void copy_from_user_overflow(void)
@@ -6,3 +8,133 @@ void copy_from_user_overflow(void)
 	WARN(1, "Buffer overflow detected!\n");
 }
 EXPORT_SYMBOL(copy_from_user_overflow);
+
+#define REPEAT_BYTE(x)	((~0ul / 0xff) * (x))
+
+/* Return the high bit set in the first byte that is a zero */
+static inline unsigned long has_zero(unsigned long a)
+{
+	return ((a - REPEAT_BYTE(0x01)) & ~a) & REPEAT_BYTE(0x80);
+}
+
+static inline long find_zero(unsigned long c)
+{
+#ifdef CONFIG_64BIT
+	if (!(c & 0xff00000000000000UL))
+		return 0;
+	if (!(c & 0x00ff000000000000UL))
+		return 1;
+	if (!(c & 0x0000ff0000000000UL))
+		return 2;
+	if (!(c & 0x000000ff00000000UL))
+		return 3;
+#define __OFF 4
+#else
+#define __OFF 0
+#endif
+	if (!(c & 0xff000000))
+		return __OFF + 0;
+	if (!(c & 0x00ff0000))
+		return __OFF + 1;
+	if (!(c & 0x0000ff00))
+		return __OFF + 2;
+	return __OFF + 3;
+#undef __OFF
+}
+
+/*
+ * Do a strncpy, return length of string without final '\0'.
+ * 'count' is the user-supplied count (return 'count' if we
+ * hit it), 'max' is the address space maximum (and we return
+ * -EFAULT if we hit it).
+ */
+static inline long do_strncpy_from_user(char *dst, const char __user *src, long count, unsigned long max)
+{
+	long res = 0;
+
+	/*
+	 * Truncate 'max' to the user-specified limit, so that
+	 * we only have one limit we need to check in the loop
+	 */
+	if (max > count)
+		max = count;
+
+	if (((long) dst | (long) src) & (sizeof(long) - 1))
+		goto byte_at_a_time;
+
+	while (max >= sizeof(unsigned long)) {
+		unsigned long c;
+
+		/* Fall back to byte-at-a-time if we get a page fault */
+		if (unlikely(__get_user(c,(unsigned long __user *)(src+res))))
+			break;
+		if (has_zero(c))
+			return res + find_zero(c);
+		*(unsigned long *)(dst+res) = c;
+		res += sizeof(unsigned long);
+		max -= sizeof(unsigned long);
+	}
+
+byte_at_a_time:
+	while (max) {
+		char c;
+
+		if (unlikely(__get_user(c,src+res)))
+			return -EFAULT;
+		dst[res] = c;
+		if (!c)
+			return res;
+		res++;
+		max--;
+	}
+
+	/*
+	 * Uhhuh. We hit 'max'. But was that the user-specified maximum
+	 * too? If so, that's ok - we got as much as the user asked for.
+	 */
+	if (res >= count)
+		return res;
+
+	/*
+	 * Nope: we hit the address space limit, and we still had more
+	 * characters the caller would have wanted. That's an EFAULT.
+	 */
+	return -EFAULT;
+}
+
+/**
+ * strncpy_from_user: - Copy a NUL terminated string from userspace.
+ * @dst:   Destination address, in kernel space.  This buffer must be at
+ *         least @count bytes long.
+ * @src:   Source address, in user space.
+ * @count: Maximum number of bytes to copy, including the trailing NUL.
+ *
+ * Copies a NUL-terminated string from userspace to kernel space.
+ *
+ * On success, returns the length of the string (not including the trailing
+ * NUL).
+ *
+ * If access to userspace fails, returns -EFAULT (some data may have been
+ * copied).
+ *
+ * If @count is smaller than the length of the string, copies @count bytes
+ * and returns @count.
+ */
+long strncpy_from_user(char *dst, const char __user *src, long count)
+{
+	unsigned long max_addr, src_addr;
+
+	if (unlikely(count <= 0))
+		return 0;
+
+	max_addr = ~0UL;
+	if (likely(segment_eq(get_fs(), USER_DS)))
+		max_addr = STACK_TOP;
+	src_addr = (unsigned long)src;
+	if (likely(src_addr < max_addr)) {
+		unsigned long max = max_addr - src_addr;
+		return do_strncpy_from_user(dst, src, count, max);
+	}
+	return -EFAULT;
+}
+EXPORT_SYMBOL(strncpy_from_user);
-- 
1.7.10

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23  5:46 ` Arch maintainers Ahoy! David Miller
@ 2012-05-23  8:02   ` Geert Uytterhoeven
  2012-05-23  9:40     ` James Bottomley
  0 siblings, 1 reply; 25+ messages in thread
From: Geert Uytterhoeven @ 2012-05-23  8:02 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, linux-arch

On Wed, May 23, 2012 at 7:46 AM, David Miller <davem@davemloft.net> wrote:
> [PATCH] sparc: Add full proper error handling to strncpy_from_user().
>
> Linus removed the end-of-address-space hackery from
> fs/namei.c:do_getname() so we really have to validate these edge
> conditions and cannot cheat any more (as x86 used to as well).
>
> Move to a common C implementation like x86 did.  And if both
> src and dst are sufficiently aligned we'll do word at a time
> copies and checks as well.

Now everybody is rewriting this in plain C, perhaps it make sense to
have it in lib/, so I don't have to write anything myself?

Gr{oetje,eeting}s,

                        Lazy Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23  8:02   ` Geert Uytterhoeven
@ 2012-05-23  9:40     ` James Bottomley
  2012-05-23 15:15       ` Linus Torvalds
  2012-05-23 17:19       ` David Miller
  0 siblings, 2 replies; 25+ messages in thread
From: James Bottomley @ 2012-05-23  9:40 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: David Miller, torvalds, linux-arch

On Wed, 2012-05-23 at 10:02 +0200, Geert Uytterhoeven wrote:
> On Wed, May 23, 2012 at 7:46 AM, David Miller <davem@davemloft.net> wrote:
> > [PATCH] sparc: Add full proper error handling to strncpy_from_user().
> >
> > Linus removed the end-of-address-space hackery from
> > fs/namei.c:do_getname() so we really have to validate these edge
> > conditions and cannot cheat any more (as x86 used to as well).
> >
> > Move to a common C implementation like x86 did.  And if both
> > src and dst are sufficiently aligned we'll do word at a time
> > copies and checks as well.
> 
> Now everybody is rewriting this in plain C, perhaps it make sense to
> have it in lib/, so I don't have to write anything myself?

I think the sparc version will work for everybody (it will certainly
work on parisc).  The only thing that might be necessary is to add these
guards:


#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
       if (((long) dst | (long) src) & (sizeof(long) - 1))
               goto byte_at_a_time;
#endif

Dave, did you want to add it to lib/ ?

James

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23  9:40     ` James Bottomley
@ 2012-05-23 15:15       ` Linus Torvalds
  2012-05-23 17:21         ` David Miller
  2012-05-23 17:19       ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2012-05-23 15:15 UTC (permalink / raw)
  To: James Bottomley; +Cc: Geert Uytterhoeven, David Miller, linux-arch

On Wed, May 23, 2012 at 2:40 AM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> I think the sparc version will work for everybody (it will certainly
> work on parisc).  The only thing that might be necessary is to add these
> guards:

The sparc version looks pretty generic (hey, it's almost all the x86
version verbatim), but no, there are more details there:

 - the alignment checks you noted (would make sense in
<asm/word-at-a-time.h>: maybe an architecture could do a cut-down
version that doesn't do everything needed for dcache word access, but
is still enough for just this)

 - the byte counting depends on byte order (more
<asm/word-at-a-time.h> details) and I suspect people wouldn't want to
do it for each byte anyway: even on big-endian I suspect you can find
smarter zero-finding than doing 8 byte tests by doing a "binary
search" with "has_zero()" and shifting.

 - the "where is end of address space" is architecture-dependent
(would need some helper inline in <asm/uaccess.h>). And at least FRV
has a *low* address check too, afaik, so the range check is not
*always* just "this is the max address".

but yes, the amount of architecture-specific code is fairly small.
There *might* be some additional detail I missed, but it's still going
to be a matter of small details rather than anything else.

Btw, the x86 version these days actually zeroes the last bytes of the
word-at-a-time copy. It was cheap, since we calculate the proper mask
anyway, and it means that you never copy unspecified bytes from user
space (even when you do overrun things), but it doesn't really
*matter* - we don't really care what crap goes into the rest of the
buffer. I thought it was cleaner that way, though, and if possible I'd
suggest people try to aim for that.

                        Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23  9:40     ` James Bottomley
  2012-05-23 15:15       ` Linus Torvalds
@ 2012-05-23 17:19       ` David Miller
  1 sibling, 0 replies; 25+ messages in thread
From: David Miller @ 2012-05-23 17:19 UTC (permalink / raw)
  To: James.Bottomley; +Cc: geert, torvalds, linux-arch

From: James Bottomley <James.Bottomley@HansenPartnership.com>
Date: Wed, 23 May 2012 10:40:25 +0100

> On Wed, 2012-05-23 at 10:02 +0200, Geert Uytterhoeven wrote:
>> On Wed, May 23, 2012 at 7:46 AM, David Miller <davem@davemloft.net> wrote:
>> > [PATCH] sparc: Add full proper error handling to strncpy_from_user().
>> >
>> > Linus removed the end-of-address-space hackery from
>> > fs/namei.c:do_getname() so we really have to validate these edge
>> > conditions and cannot cheat any more (as x86 used to as well).
>> >
>> > Move to a common C implementation like x86 did.  And if both
>> > src and dst are sufficiently aligned we'll do word at a time
>> > copies and checks as well.
>> 
>> Now everybody is rewriting this in plain C, perhaps it make sense to
>> have it in lib/, so I don't have to write anything myself?
> 
> I think the sparc version will work for everybody (it will certainly
> work on parisc).  The only thing that might be necessary is to add these
> guards:
> 
> 
> #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>        if (((long) dst | (long) src) & (sizeof(long) - 1))
>                goto byte_at_a_time;
> #endif
> 
> Dave, did you want to add it to lib/ ?

Yes, but if so, the fixed version which I already pushed out to the
sparc tree :-)

But frankly this was only a temporary measure, I intend to put this
back into assembler so I can optimize it properly, I was just too
lazy to do so right now and wanted this bug fixed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 15:15       ` Linus Torvalds
@ 2012-05-23 17:21         ` David Miller
  2012-05-23 17:32           ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2012-05-23 17:21 UTC (permalink / raw)
  To: torvalds; +Cc: James.Bottomley, geert, linux-arch

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 23 May 2012 08:15:42 -0700

> Btw, the x86 version these days actually zeroes the last bytes of the
> word-at-a-time copy.

I noticed, and that was the bug in the version I posted.

I decided in the end to not do what x86 does because there's no
cheap way to do it on big-endian.  I simply moved the 'dst'
store before the has_zero() test.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 17:21         ` David Miller
@ 2012-05-23 17:32           ` Linus Torvalds
  2012-05-23 18:16             ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2012-05-23 17:32 UTC (permalink / raw)
  To: David Miller; +Cc: James.Bottomley, geert, linux-arch

On Wed, May 23, 2012 at 10:21 AM, David Miller <davem@davemloft.net> wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Wed, 23 May 2012 08:15:42 -0700
>
>> Btw, the x86 version these days actually zeroes the last bytes of the
>> word-at-a-time copy.
>
> I noticed, and that was the bug in the version I posted.
>
> I decided in the end to not do what x86 does because there's no
> cheap way to do it on big-endian.  I simply moved the 'dst'
> store before the has_zero() test.

Yes, the original x86 code worked that way too, and it's how 3.4 does
thing there. It results in garbage at the end of the result string,
but since we consider that to be undefined data anyway (unlike the C
library standard strncpy() that zero-fills it), it's not a big deal.

The reason I changed it on x86 was that I was a *tiny* bit worried
that some kernel user would eventually possibly expose those garbage
bytes. *IF* we were to ever have code like

   memset(buffer, 0, sizeof(buffer));
   strncpy_from_user(buffer, ptr, sizeof(buffer-1));

this would matter, and could expose data that the user didn't *intent*
to expose.

We don't have anything like that right now as far as I can tell (and I
did check, although it was more like a "glance through all the
users"), so it's more of a "possible future issues" thing than
anything else.

                 Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 17:32           ` Linus Torvalds
@ 2012-05-23 18:16             ` David Miller
  2012-05-23 18:27               ` Linus Torvalds
  2012-05-24  9:40               ` David Howells
  0 siblings, 2 replies; 25+ messages in thread
From: David Miller @ 2012-05-23 18:16 UTC (permalink / raw)
  To: torvalds; +Cc: James.Bottomley, geert, linux-arch

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 23 May 2012 10:32:50 -0700

> Yes, the original x86 code worked that way too, and it's how 3.4 does
> thing there. It results in garbage at the end of the result string,
> but since we consider that to be undefined data anyway (unlike the C
> library standard strncpy() that zero-fills it), it's not a big deal.
> 
> The reason I changed it on x86 was that I was a *tiny* bit worried
> that some kernel user would eventually possibly expose those garbage
> bytes. *IF* we were to ever have code like
> 
>    memset(buffer, 0, sizeof(buffer));
>    strncpy_from_user(buffer, ptr, sizeof(buffer-1));
> 
> this would matter, and could expose data that the user didn't *intent*
> to expose.
> 
> We don't have anything like that right now as far as I can tell (and I
> did check, although it was more like a "glance through all the
> users"), so it's more of a "possible future issues" thing than
> anything else.

Right.

And believe it or not the "mask test each byte" thing is the fastest
portable code I could come up with.  Big endian just blows for making
this calculation.

If little-endian were available always (don't have this on sparc32)
and cheap (not true on any sparc64 chip), and I had population count
(only the more recent sparc64 chips do it in HW) then yes I could do
the zero byte discovery in 4 instructions.

But hey Linus I'm willing to be proven wrong, why don't you ask your
google+ programming challenge entourage for some help? :-)

For reference here is the final version of the sparc commit, it works
and I've been running tests on it since last night.  I'm extrmely
confident the C code will work on any big-endian machine.

--------------------
[PATCH] sparc: Add full proper error handling to strncpy_from_user().

Linus removed the end-of-address-space hackery from
fs/namei.c:do_getname() so we really have to validate these edge
conditions and cannot cheat any more (as x86 used to as well).

Move to a common C implementation like x86 did.  And if both
src and dst are sufficiently aligned we'll do word at a time
copies and checks as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 arch/sparc/include/asm/uaccess.h      |    3 +
 arch/sparc/include/asm/uaccess_32.h   |   10 ---
 arch/sparc/include/asm/uaccess_64.h   |    4 -
 arch/sparc/lib/Makefile               |    2 +-
 arch/sparc/lib/ksyms.c                |    3 -
 arch/sparc/lib/strncpy_from_user_32.S |   47 ------------
 arch/sparc/lib/strncpy_from_user_64.S |  133 ---------------------------------
 arch/sparc/lib/usercopy.c             |  132 ++++++++++++++++++++++++++++++++
 8 files changed, 136 insertions(+), 198 deletions(-)
 delete mode 100644 arch/sparc/lib/strncpy_from_user_32.S
 delete mode 100644 arch/sparc/lib/strncpy_from_user_64.S

diff --git a/arch/sparc/include/asm/uaccess.h b/arch/sparc/include/asm/uaccess.h
index e88fbe5..42a28cf 100644
--- a/arch/sparc/include/asm/uaccess.h
+++ b/arch/sparc/include/asm/uaccess.h
@@ -5,4 +5,7 @@
 #else
 #include <asm/uaccess_32.h>
 #endif
+
+extern long strncpy_from_user(char *dest, const char __user *src, long count);
+
 #endif
diff --git a/arch/sparc/include/asm/uaccess_32.h b/arch/sparc/include/asm/uaccess_32.h
index d50c310..59586b5 100644
--- a/arch/sparc/include/asm/uaccess_32.h
+++ b/arch/sparc/include/asm/uaccess_32.h
@@ -304,16 +304,6 @@ static inline unsigned long clear_user(void __user *addr, unsigned long n)
 		return n;
 }
 
-extern long __strncpy_from_user(char *dest, const char __user *src, long count);
-
-static inline long strncpy_from_user(char *dest, const char __user *src, long count)
-{
-	if (__access_ok((unsigned long) src, count))
-		return __strncpy_from_user(dest, src, count);
-	else
-		return -EFAULT;
-}
-
 extern long __strlen_user(const char __user *);
 extern long __strnlen_user(const char __user *, long len);
 
diff --git a/arch/sparc/include/asm/uaccess_64.h b/arch/sparc/include/asm/uaccess_64.h
index a1091afb..dcdfb89 100644
--- a/arch/sparc/include/asm/uaccess_64.h
+++ b/arch/sparc/include/asm/uaccess_64.h
@@ -257,10 +257,6 @@ extern unsigned long __must_check __clear_user(void __user *, unsigned long);
 
 #define clear_user __clear_user
 
-extern long __must_check __strncpy_from_user(char *dest, const char __user *src, long count);
-
-#define strncpy_from_user __strncpy_from_user
-
 extern long __strlen_user(const char __user *);
 extern long __strnlen_user(const char __user *, long len);
 
diff --git a/arch/sparc/lib/Makefile b/arch/sparc/lib/Makefile
index 389628f..943d98d 100644
--- a/arch/sparc/lib/Makefile
+++ b/arch/sparc/lib/Makefile
@@ -10,7 +10,7 @@ lib-y                 += strlen.o
 lib-y                 += checksum_$(BITS).o
 lib-$(CONFIG_SPARC32) += blockops.o
 lib-y                 += memscan_$(BITS).o memcmp.o strncmp_$(BITS).o
-lib-y                 += strncpy_from_user_$(BITS).o strlen_user_$(BITS).o
+lib-y                 += strlen_user_$(BITS).o
 lib-$(CONFIG_SPARC32) += divdi3.o udivdi3.o
 lib-$(CONFIG_SPARC32) += copy_user.o locks.o
 lib-$(CONFIG_SPARC64) += atomic_64.o
diff --git a/arch/sparc/lib/ksyms.c b/arch/sparc/lib/ksyms.c
index 2dc3087..6b278ab 100644
--- a/arch/sparc/lib/ksyms.c
+++ b/arch/sparc/lib/ksyms.c
@@ -33,9 +33,6 @@ EXPORT_SYMBOL(memset);
 EXPORT_SYMBOL(memmove);
 EXPORT_SYMBOL(__bzero);
 
-/* Moving data to/from/in userspace. */
-EXPORT_SYMBOL(__strncpy_from_user);
-
 /* Networking helper routines. */
 EXPORT_SYMBOL(csum_partial);
 
diff --git a/arch/sparc/lib/strncpy_from_user_32.S b/arch/sparc/lib/strncpy_from_user_32.S
deleted file mode 100644
index db0ed29..0000000
--- a/arch/sparc/lib/strncpy_from_user_32.S
+++ /dev/null
@@ -1,47 +0,0 @@
-/* strncpy_from_user.S: Sparc strncpy from userspace.
- *
- *  Copyright(C) 1996 David S. Miller
- */
-
-#include <linux/linkage.h>
-#include <asm/ptrace.h>
-#include <asm/errno.h>
-
-	.text
-
-	/* Must return:
-	 *
-	 * -EFAULT		for an exception
-	 * count		if we hit the buffer limit
-	 * bytes copied		if we hit a null byte
-	 */
-
-ENTRY(__strncpy_from_user)
-	/* %o0=dest, %o1=src, %o2=count */
-	mov	%o2, %o3
-1:
-	subcc	%o2, 1, %o2
-	bneg	2f
-	 nop
-10:
-	ldub	[%o1], %o4
-	add	%o0, 1, %o0
-	cmp	%o4, 0
-	add	%o1, 1, %o1
-	bne	1b
-	 stb	%o4, [%o0 - 1]
-2:
-	add	%o2, 1, %o0
-	retl
-	 sub	%o3, %o0, %o0
-ENDPROC(__strncpy_from_user)
-
-	.section .fixup,#alloc,#execinstr
-	.align	4
-4:
-	retl
-	 mov	-EFAULT, %o0
-
-	.section __ex_table,#alloc
-	.align	4
-	.word	10b, 4b
diff --git a/arch/sparc/lib/strncpy_from_user_64.S b/arch/sparc/lib/strncpy_from_user_64.S
deleted file mode 100644
index d1246b7..0000000
--- a/arch/sparc/lib/strncpy_from_user_64.S
+++ /dev/null
@@ -1,133 +0,0 @@
-/*
- * strncpy_from_user.S: Sparc64 strncpy from userspace.
- *
- *  Copyright (C) 1997, 1999 Jakub Jelinek (jj@ultra.linux.cz)
- */
-
-#include <linux/linkage.h>
-#include <asm/asi.h>
-#include <asm/errno.h>
-
-	.data
-	.align	8
-0:	.xword	0x0101010101010101
-
-	.text
-
-	/* Must return:
-	 *
-	 * -EFAULT		for an exception
-	 * count		if we hit the buffer limit
-	 * bytes copied		if we hit a null byte
-	 * (without the null byte)
-	 *
-	 * This implementation assumes:
-	 * %o1 is 8 aligned => !(%o2 & 7)
-	 * %o0 is 8 aligned (if not, it will be slooooow, but will work)
-	 *
-	 * This is optimized for the common case:
-	 * in my stats, 90% of src are 8 aligned (even on sparc32)
-	 * and average length is 18 or so.
-	 */
-
-ENTRY(__strncpy_from_user)
-	/* %o0=dest, %o1=src, %o2=count */
-	andcc	%o1, 7, %g0		! IEU1	Group
-	bne,pn	%icc, 30f		! CTI
-	 add	%o0, %o2, %g3		! IEU0
-60:	ldxa	[%o1] %asi, %g1		! Load	Group
-	brlez,pn %o2, 10f		! CTI
-	 mov	%o0, %o3		! IEU0
-50:	sethi	%hi(0b), %o4		! IEU0	Group
-	ldx	[%o4 + %lo(0b)], %o4	! Load
-	sllx	%o4, 7, %o5		! IEU1	Group
-1:	sub	%g1, %o4, %g2		! IEU0	Group
-	stx	%g1, [%o0]		! Store
-	add	%o0, 8, %o0		! IEU1
-	andcc	%g2, %o5, %g0		! IEU1	Group
-	bne,pn	%xcc, 5f		! CTI
-	 add	%o1, 8, %o1		! IEU0
-	cmp	%o0, %g3		! IEU1	Group
-	bl,a,pt %xcc, 1b		! CTI
-61:	 ldxa	[%o1] %asi, %g1		! Load
-10:	retl				! CTI	Group
-	 mov	%o2, %o0		! IEU0
-5:	srlx	%g2, 32, %g7		! IEU0	Group
-	sethi	%hi(0xff00), %o4	! IEU1
-	andcc	%g7, %o5, %g0		! IEU1	Group
-	be,pn	%icc, 2f		! CTI
-	 or	%o4, %lo(0xff00), %o4	! IEU0
-	srlx	%g1, 48, %g7		! IEU0	Group
-	andcc	%g7, %o4, %g0		! IEU1	Group
-	be,pn	%icc, 50f		! CTI
-	 andcc	%g7, 0xff, %g0		! IEU1	Group
-	be,pn	%icc, 51f		! CTI
-	 srlx	%g1, 32, %g7		! IEU0
-	andcc	%g7, %o4, %g0		! IEU1	Group
-	be,pn	%icc, 52f		! CTI
-	 andcc	%g7, 0xff, %g0		! IEU1	Group
-	be,pn	%icc, 53f		! CTI
-2:	 andcc	%g2, %o5, %g0		! IEU1	Group
-	be,pn	%icc, 2f		! CTI
-	 srl	%g1, 16, %g7		! IEU0
-	andcc	%g7, %o4, %g0		! IEU1	Group
-	be,pn	%icc, 54f		! CTI
-	 andcc	%g7, 0xff, %g0		! IEU1	Group
-	be,pn	%icc, 55f		! CTI
-	 andcc	%g1, %o4, %g0		! IEU1	Group
-	be,pn	%icc, 56f		! CTI
-	 andcc	%g1, 0xff, %g0		! IEU1	Group
-	be,a,pn	%icc, 57f		! CTI
-	 sub	%o0, %o3, %o0		! IEU0
-2:	cmp	%o0, %g3		! IEU1	Group
-	bl,a,pt	%xcc, 50b		! CTI
-62:	 ldxa	[%o1] %asi, %g1		! Load
-	retl				! CTI	Group
-	 mov	%o2, %o0		! IEU0
-50:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 8, %o0
-51:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 7, %o0
-52:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 6, %o0
-53:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 5, %o0
-54:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 4, %o0
-55:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 3, %o0
-56:	sub	%o0, %o3, %o0
-	retl
-	 sub	%o0, 2, %o0
-57:	retl
-	 sub	%o0, 1, %o0
-30:	brlez,pn %o2, 3f
-	 sub	%g0, %o2, %o3
-	add	%o0, %o2, %o0
-63:	lduba	[%o1] %asi, %o4
-1:	add	%o1, 1, %o1
-	brz,pn	%o4, 2f
-	 stb	%o4, [%o0 + %o3]
-	addcc	%o3, 1, %o3
-	bne,pt	%xcc, 1b
-64:	 lduba	[%o1] %asi, %o4
-3:	retl
-	 mov	%o2, %o0
-2:	retl
-	 add	%o2, %o3, %o0
-ENDPROC(__strncpy_from_user)
-
-	.section __ex_table,"a"
-	.align	4
-	.word	60b, __retl_efault
-	.word	61b, __retl_efault
-	.word	62b, __retl_efault
-	.word	63b, __retl_efault
-	.word	64b, __retl_efault
-	.previous
diff --git a/arch/sparc/lib/usercopy.c b/arch/sparc/lib/usercopy.c
index 14b363f..851cb75 100644
--- a/arch/sparc/lib/usercopy.c
+++ b/arch/sparc/lib/usercopy.c
@@ -1,4 +1,6 @@
 #include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/errno.h>
 #include <linux/bug.h>
 
 void copy_from_user_overflow(void)
@@ -6,3 +8,133 @@ void copy_from_user_overflow(void)
 	WARN(1, "Buffer overflow detected!\n");
 }
 EXPORT_SYMBOL(copy_from_user_overflow);
+
+#define REPEAT_BYTE(x)	((~0ul / 0xff) * (x))
+
+/* Return the high bit set in the first byte that is a zero */
+static inline unsigned long has_zero(unsigned long a)
+{
+	return ((a - REPEAT_BYTE(0x01)) & ~a) & REPEAT_BYTE(0x80);
+}
+
+static inline long find_zero(unsigned long c)
+{
+#ifdef CONFIG_64BIT
+	if (!(c & 0xff00000000000000UL))
+		return 0;
+	if (!(c & 0x00ff000000000000UL))
+		return 1;
+	if (!(c & 0x0000ff0000000000UL))
+		return 2;
+	if (!(c & 0x000000ff00000000UL))
+		return 3;
+#define __OFF 4
+#else
+#define __OFF 0
+#endif
+	if (!(c & 0xff000000))
+		return __OFF + 0;
+	if (!(c & 0x00ff0000))
+		return __OFF + 1;
+	if (!(c & 0x0000ff00))
+		return __OFF + 2;
+	return __OFF + 3;
+#undef __OFF
+}
+
+/*
+ * Do a strncpy, return length of string without final '\0'.
+ * 'count' is the user-supplied count (return 'count' if we
+ * hit it), 'max' is the address space maximum (and we return
+ * -EFAULT if we hit it).
+ */
+static inline long do_strncpy_from_user(char *dst, const char __user *src, long count, unsigned long max)
+{
+	long res = 0;
+
+	/*
+	 * Truncate 'max' to the user-specified limit, so that
+	 * we only have one limit we need to check in the loop
+	 */
+	if (max > count)
+		max = count;
+
+	if (((long) dst | (long) src) & (sizeof(long) - 1))
+		goto byte_at_a_time;
+
+	while (max >= sizeof(unsigned long)) {
+		unsigned long c;
+
+		/* Fall back to byte-at-a-time if we get a page fault */
+		if (unlikely(__get_user(c,(unsigned long __user *)(src+res))))
+			break;
+		*(unsigned long *)(dst+res) = c;
+		if (has_zero(c))
+			return res + find_zero(c);
+		res += sizeof(unsigned long);
+		max -= sizeof(unsigned long);
+	}
+
+byte_at_a_time:
+	while (max) {
+		char c;
+
+		if (unlikely(__get_user(c,src+res)))
+			return -EFAULT;
+		dst[res] = c;
+		if (!c)
+			return res;
+		res++;
+		max--;
+	}
+
+	/*
+	 * Uhhuh. We hit 'max'. But was that the user-specified maximum
+	 * too? If so, that's ok - we got as much as the user asked for.
+	 */
+	if (res >= count)
+		return res;
+
+	/*
+	 * Nope: we hit the address space limit, and we still had more
+	 * characters the caller would have wanted. That's an EFAULT.
+	 */
+	return -EFAULT;
+}
+
+/**
+ * strncpy_from_user: - Copy a NUL terminated string from userspace.
+ * @dst:   Destination address, in kernel space.  This buffer must be at
+ *         least @count bytes long.
+ * @src:   Source address, in user space.
+ * @count: Maximum number of bytes to copy, including the trailing NUL.
+ *
+ * Copies a NUL-terminated string from userspace to kernel space.
+ *
+ * On success, returns the length of the string (not including the trailing
+ * NUL).
+ *
+ * If access to userspace fails, returns -EFAULT (some data may have been
+ * copied).
+ *
+ * If @count is smaller than the length of the string, copies @count bytes
+ * and returns @count.
+ */
+long strncpy_from_user(char *dst, const char __user *src, long count)
+{
+	unsigned long max_addr, src_addr;
+
+	if (unlikely(count <= 0))
+		return 0;
+
+	max_addr = ~0UL;
+	if (likely(segment_eq(get_fs(), USER_DS)))
+		max_addr = STACK_TOP;
+	src_addr = (unsigned long)src;
+	if (likely(src_addr < max_addr)) {
+		unsigned long max = max_addr - src_addr;
+		return do_strncpy_from_user(dst, src, count, max);
+	}
+	return -EFAULT;
+}
+EXPORT_SYMBOL(strncpy_from_user);
-- 
1.7.7.6

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 18:16             ` David Miller
@ 2012-05-23 18:27               ` Linus Torvalds
  2012-05-23 18:35                 ` David Miller
  2012-05-24  9:40               ` David Howells
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2012-05-23 18:27 UTC (permalink / raw)
  To: David Miller; +Cc: James.Bottomley, geert, linux-arch

On Wed, May 23, 2012 at 11:16 AM, David Miller <davem@davemloft.net> wrote:
>
> And believe it or not the "mask test each byte" thing is the fastest
> portable code I could come up with.  Big endian just blows for making
> this calculation.

It's not faster to just do something like

   int byte = 4;

#if CONFIG_64BIT
   byte = 8;
   if (has_zero_32bit(value >> 32)) {
       value >>= 32;
       byte = 4;
   }
#endif
   if (has_zero_16(value >> 16)) {
      value >>= 16;
      byte -= 2;
   }
   if (!value & 0xff00)
      byte--;
   return byte;

which looks like it might generate ok code?

> But hey Linus I'm willing to be proven wrong, why don't you ask your
> google+ programming challenge entourage for some help? :-)

Hey, I refuse to start from something that looks stupid.

Btw, when benchmarking, make sure that your branches do not predict
well. Because in real life they won't predict well. So you can't
benchmark the mask->byte function with some  well-behaved input that
commonly returns the same value.

> For reference here is the final version of the sparc commit, it works
> and I've been running tests on it since last night.  I'm extrmely
> confident the C code will work on any big-endian machine.

Umm. Except your "top of address space" thing is entirely sparc-specific.

So no. The *bulk* of the code will work. But not all of it.

              Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 18:27               ` Linus Torvalds
@ 2012-05-23 18:35                 ` David Miller
  2012-05-23 18:44                   ` Linus Torvalds
  2012-05-23 18:46                   ` Linus Torvalds
  0 siblings, 2 replies; 25+ messages in thread
From: David Miller @ 2012-05-23 18:35 UTC (permalink / raw)
  To: torvalds; +Cc: James.Bottomley, geert, linux-arch

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 23 May 2012 11:27:00 -0700

> It's not faster to just do something like
> 
>    int byte = 4;
> 
> #if CONFIG_64BIT
>    byte = 8;
>    if (has_zero_32bit(value >> 32)) {
>        value >>= 32;
>        byte = 4;
>    }
> #endif
>    if (has_zero_16(value >> 16)) {
>       value >>= 16;
>       byte -= 2;
>    }
>    if (!value & 0xff00)
>       byte--;
>    return byte;
> 
> which looks like it might generate ok code?

It might be, I'll play around with it.

FWIW, when I code this end case in assembler on sparc64 I just go for
a bunch of conditional moves, so I'll try to come up with something
similar to the above that gcc will emit reasonably.

> Btw, when benchmarking, make sure that your branches do not predict
> well. Because in real life they won't predict well. So you can't
> benchmark the mask->byte function with some  well-behaved input that
> commonly returns the same value.

Indeed, and that's why I'd prefer it if gcc were to emit conditional
moves :-)

>> For reference here is the final version of the sparc commit, it works
>> and I've been running tests on it since last night.  I'm extrmely
>> confident the C code will work on any big-endian machine.
> 
> Umm. Except your "top of address space" thing is entirely sparc-specific.

Although a bit more expensive than what you can do on the x86 side
with these tests, I think my code should work.

Comparing get_fs() with USER_DS should be portable enough, as should
STACK_TOP, right?  Or does STACK_TOP have some weird semantics on some
architectures that I'm not aware of?

The only other thing is how we are using ~0UL as the limit for the
kernel, and for all practical purposes that ought to be fine too.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 18:35                 ` David Miller
@ 2012-05-23 18:44                   ` Linus Torvalds
  2012-05-23 18:46                   ` Linus Torvalds
  1 sibling, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2012-05-23 18:44 UTC (permalink / raw)
  To: David Miller; +Cc: James.Bottomley, geert, linux-arch

On Wed, May 23, 2012 at 11:35 AM, David Miller <davem@davemloft.net> wrote:
>
> Comparing get_fs() with USER_DS should be portable enough, as should
> STACK_TOP, right?  Or does STACK_TOP have some weird semantics on some
> architectures that I'm not aware of?

STACK_TOP is definitely not portable. TASK_SIZE would be closer, and
was what the old deleted code in fs/namei.c used to use.

It may *happen* that all big-endian machines make STACK_TOP be the
same as TASK_SIZE, but that would be more of an accident than anything
else.

                  Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 18:35                 ` David Miller
  2012-05-23 18:44                   ` Linus Torvalds
@ 2012-05-23 18:46                   ` Linus Torvalds
  2012-05-23 20:36                     ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2012-05-23 18:46 UTC (permalink / raw)
  To: David Miller; +Cc: James.Bottomley, geert, linux-arch

On Wed, May 23, 2012 at 11:35 AM, David Miller <davem@davemloft.net> wrote:
>
> FWIW, when I code this end case in assembler on sparc64 I just go for
> a bunch of conditional moves, so I'll try to come up with something
> similar to the above that gcc will emit reasonably.

.. and yes, it's possible that the keep-it-simple-and-stupid code you
posted first will actually generate better code if gcc can change it
all to cmov's.

                   Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 18:46                   ` Linus Torvalds
@ 2012-05-23 20:36                     ` David Miller
  2012-05-23 21:01                       ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2012-05-23 20:36 UTC (permalink / raw)
  To: torvalds; +Cc: James.Bottomley, geert, linux-arch

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 23 May 2012 11:46:54 -0700

> On Wed, May 23, 2012 at 11:35 AM, David Miller <davem@davemloft.net> wrote:
>>
>> FWIW, when I code this end case in assembler on sparc64 I just go for
>> a bunch of conditional moves, so I'll try to come up with something
>> similar to the above that gcc will emit reasonably.
> 
> .. and yes, it's possible that the keep-it-simple-and-stupid code you
> posted first will actually generate better code if gcc can change it
> all to cmov's.

I toyed around with some of the ideas we discussed but gcc really
mishandled all the approaches I tried.

It seemed to, no matter what I did, want to reload the Mycroft
constants in the tail code even though it had them all readily
available in registers for the word-at-a-time loop body.

We, of course, can't just use the already calculated has_zero() mask
value to figure out has_zero_32() and has_zero_16().  This is because
carrying can cause 0x80 values to end up in the mask at the locations
adjacent to the real zero byte.

As discussed in the past and as implemented in fs/namei.c, on
little-endian it's trivial the mask out the uninteresting 0x80 bytes
with that:

	mask = (mask - 1) & ~mask;

thing.

What I think we can do on big-endian is this:

1) In the loop, use the test:

      (x + 0xfefefeff) & ~(x | 0x7f7f7f7f)

   It's the same effective cost as the current test (on sparc
   it would be ADD, OR, ANDNCC).

   We make sure to calculate the "x | 0x7f7f7f7f" part into
   a variable which is not clobbered by the rest of the test.

   This is so we can reuse it in #2.

2) Once we find a word containing the zero byte, do a:

	~(((x & 0x7f7f7f7f) + 0x7f7f7f7f) | x | 0x7f7f7f7f)

   and that "x | 0x7f7f7f7f" part is already calculated and thus
   can be cribbed the place we left it in #1 above.

   And now we'll have exactly a 0x80 where there is a zero byte,
   and no bleeding of 0x80 values into adjacent byte positions.

Once we have that we can just test that mask directly for the
zero byte location search code.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 20:36                     ` David Miller
@ 2012-05-23 21:01                       ` Linus Torvalds
  2012-05-24  2:11                         ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2012-05-23 21:01 UTC (permalink / raw)
  To: David Miller; +Cc: James.Bottomley, geert, linux-arch

On Wed, May 23, 2012 at 1:36 PM, David Miller <davem@davemloft.net> wrote:
>
> I toyed around with some of the ideas we discussed but gcc really
> mishandled all the approaches I tried.

Have you tried coding them as ?: expressions, along with making all
the temporaries separate variables? Sometimes that seems to make gcc
more eager to use cmov's.

Although that seemed to work better before. These days gcc sometimes
seems so eager to show it knows better than the programmer that it is
hard to make it do the obvious thing from the obvious source code..

> 1) In the loop, use the test:
>
>      (x + 0xfefefeff) & ~(x | 0x7f7f7f7f)
>
>   It's the same effective cost as the current test (on sparc
>   it would be ADD, OR, ANDNCC).
>
>   We make sure to calculate the "x | 0x7f7f7f7f" part into
>   a variable which is not clobbered by the rest of the test.
>
>   This is so we can reuse it in #2.
>
> 2) Once we find a word containing the zero byte, do a:
>
>        ~(((x & 0x7f7f7f7f) + 0x7f7f7f7f) | x | 0x7f7f7f7f)
>
>   and that "x | 0x7f7f7f7f" part is already calculated and thus
>   can be cribbed the place we left it in #1 above.
>
>   And now we'll have exactly a 0x80 where there is a zero byte,
>   and no bleeding of 0x80 values into adjacent byte positions.
>
> Once we have that we can just test that mask directly for the
> zero byte location search code.

Sounds likely, and you only have two different constants to worry about.

Sadly, I don't see any way to get the "only high bits set" cheaply,
like the little-endian case does (ie going from "zero in second byte
and after": 0x00808080 to the byte mask you need: 0xff000000). If you
had that, and the appropriate unaligneds, you'd also have everything
for the dcache case, not just strncpy.

                  Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 21:01                       ` Linus Torvalds
@ 2012-05-24  2:11                         ` David Miller
  2012-05-24  5:25                           ` Paul Mackerras
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2012-05-24  2:11 UTC (permalink / raw)
  To: torvalds; +Cc: James.Bottomley, geert, linux-arch

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 23 May 2012 14:01:26 -0700

> On Wed, May 23, 2012 at 1:36 PM, David Miller <davem@davemloft.net> wrote:
>> 1) In the loop, use the test:
>>
>>      (x + 0xfefefeff) & ~(x | 0x7f7f7f7f)
>>
>>   It's the same effective cost as the current test (on sparc
>>   it would be ADD, OR, ANDNCC).
>>
>>   We make sure to calculate the "x | 0x7f7f7f7f" part into
>>   a variable which is not clobbered by the rest of the test.
>>
>>   This is so we can reuse it in #2.
>>
>> 2) Once we find a word containing the zero byte, do a:
>>
>>        ~(((x & 0x7f7f7f7f) + 0x7f7f7f7f) | x | 0x7f7f7f7f)
>>
>>   and that "x | 0x7f7f7f7f" part is already calculated and thus
>>   can be cribbed the place we left it in #1 above.
>>
>>   And now we'll have exactly a 0x80 where there is a zero byte,
>>   and no bleeding of 0x80 values into adjacent byte positions.
>>
>> Once we have that we can just test that mask directly for the
>> zero byte location search code.
> 
> Sounds likely, and you only have two different constants to worry about.

Ok, now confirmed, the following patch generates the best code I've
been able to get out of GCC thus far, and it's tested too.

Therefore, I'll push this out to my sparc tree.

Next, I'll try to add the abstractions et al. necessary to put this
under the top-level lib/ so others can use it too.

diff --git a/arch/sparc/lib/usercopy.c b/arch/sparc/lib/usercopy.c
index 851cb75..78c363f 100644
--- a/arch/sparc/lib/usercopy.c
+++ b/arch/sparc/lib/usercopy.c
@@ -17,29 +17,20 @@ static inline unsigned long has_zero(unsigned long a)
 	return ((a - REPEAT_BYTE(0x01)) & ~a) & REPEAT_BYTE(0x80);
 }
 
-static inline long find_zero(unsigned long c)
+static inline long find_zero(unsigned long mask)
 {
+	long byte = 0;
 #ifdef CONFIG_64BIT
-	if (!(c & 0xff00000000000000UL))
-		return 0;
-	if (!(c & 0x00ff000000000000UL))
-		return 1;
-	if (!(c & 0x0000ff0000000000UL))
-		return 2;
-	if (!(c & 0x000000ff00000000UL))
-		return 3;
-#define __OFF 4
-#else
-#define __OFF 0
+	if (mask >> 32)
+		mask >>= 32;
+	else
+		byte = 4;
 #endif
-	if (!(c & 0xff000000))
-		return __OFF + 0;
-	if (!(c & 0x00ff0000))
-		return __OFF + 1;
-	if (!(c & 0x0000ff00))
-		return __OFF + 2;
-	return __OFF + 3;
-#undef __OFF
+	if (mask >> 16)
+		mask >>= 16;
+	else
+		byte += 2;
+	return (mask >> 8) ? byte : byte + 1;
 }
 
 /*
@@ -50,6 +41,8 @@ static inline long find_zero(unsigned long c)
  */
 static inline long do_strncpy_from_user(char *dst, const char __user *src, long count, unsigned long max)
 {
+	const unsigned long high_bits = REPEAT_BYTE(0xfe) + 1;
+	const unsigned long low_bits = REPEAT_BYTE(0x7f);
 	long res = 0;
 
 	/*
@@ -63,14 +56,19 @@ static inline long do_strncpy_from_user(char *dst, const char __user *src, long
 		goto byte_at_a_time;
 
 	while (max >= sizeof(unsigned long)) {
-		unsigned long c;
+		unsigned long c, v, rhs;
 
 		/* Fall back to byte-at-a-time if we get a page fault */
 		if (unlikely(__get_user(c,(unsigned long __user *)(src+res))))
 			break;
+		rhs = c | low_bits;
+		v = (c + high_bits) & ~rhs;
 		*(unsigned long *)(dst+res) = c;
-		if (has_zero(c))
-			return res + find_zero(c);
+		if (v) {
+			v = (c & low_bits) + low_bits;;
+			v = ~(v | rhs);
+			return res + find_zero(v);
+		}
 		res += sizeof(unsigned long);
 		max -= sizeof(unsigned long);
 	}

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-24  2:11                         ` David Miller
@ 2012-05-24  5:25                           ` Paul Mackerras
  2012-05-24  5:56                             ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Paul Mackerras @ 2012-05-24  5:25 UTC (permalink / raw)
  To: David Miller; +Cc: torvalds, James.Bottomley, geert, linux-arch

On Wed, May 23, 2012 at 10:11:50PM -0400, David Miller wrote:

> Ok, now confirmed, the following patch generates the best code I've
> been able to get out of GCC thus far, and it's tested too.
> 
> Therefore, I'll push this out to my sparc tree.
> 
> Next, I'll try to add the abstractions et al. necessary to put this
> under the top-level lib/ so others can use it too.

Sounds good, thanks for doing this.  On PowerPC we have the "count
leading zeroes" instructions we could use to find the position of the
leftmost zero byte, so if you can conveniently make the find_zero()
function overridable, that would be good.

Paul.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-24  5:25                           ` Paul Mackerras
@ 2012-05-24  5:56                             ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2012-05-24  5:56 UTC (permalink / raw)
  To: paulus; +Cc: torvalds, James.Bottomley, geert, linux-arch

From: Paul Mackerras <paulus@samba.org>
Date: Thu, 24 May 2012 15:25:45 +1000

> On PowerPC we have the "count leading zeroes" instructions we could
> use to find the position of the leftmost zero byte,

I know, guess whose glibc strlen asm I got these formulas from?
:-)

> so if you can conveniently make the find_zero() function
> overridable, that would be good.

I'm sure a powerpc person can cons up a patch that does that once
my stuff goes in, and test it too.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-23 18:16             ` David Miller
  2012-05-23 18:27               ` Linus Torvalds
@ 2012-05-24  9:40               ` David Howells
  2012-05-24 15:53                 ` Linus Torvalds
  2012-05-24 16:45                 ` David Howells
  1 sibling, 2 replies; 25+ messages in thread
From: David Howells @ 2012-05-24  9:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dhowells, David Miller, James.Bottomley, geert, linux-arch


Linus Torvalds <torvalds@linux-foundation.org> wrote:

> It's not faster to just do something like
> 
>    int byte = 4;
> 
> #if CONFIG_64BIT
>    byte = 8;
>    if (has_zero_32bit(value >> 32)) {
>        value >>= 32;
>        byte = 4;
>    }
> #endif
>    if (has_zero_16(value >> 16)) {
>       value >>= 16;
>       byte -= 2;
>    }
>    if (!value & 0xff00)
>       byte--;
>    return byte;


Could you use cpu_to_be32/64() and then ffs()?  That ought to work for both
variants of endianness.  The cpu_to_beXX() should be a noop on BE and is
likely to be a single instruction on LE.  The meat of ffs() is usually a
single instruction, though it may have to have zero-detect logic added.

David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-24  9:40               ` David Howells
@ 2012-05-24 15:53                 ` Linus Torvalds
  2012-06-13 11:08                   ` Michael Cree
  2012-05-24 16:45                 ` David Howells
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2012-05-24 15:53 UTC (permalink / raw)
  To: David Howells; +Cc: David Miller, James.Bottomley, geert, linux-arch

On Thu, May 24, 2012 at 2:40 AM, David Howells <dhowells@redhat.com> wrote:
>
> Could you use cpu_to_be32/64() and then ffs()?  That ought to work for both
> variants of endianness.  The cpu_to_beXX() should be a noop on BE and is
> likely to be a single instruction on LE.  The meat of ffs() is usually a
> single instruction, though it may have to have zero-detect logic added.

First off, the *last* thing you want to do is go to big-endian mode.
All the bit counting gets *much* more complicated, and your argument
that it's "free" on some architectures is pointless, since it is only
free on the architectures that have the *least* users.

Secondly, it's not "likely a single instruction" on LE, neither is
ffs. It can be, but it's often one of the slower instructions.

Many architectures will have - but only in their most recent uarch
versions - popcount or similar, and if you're little-endian, that
would be what you want. Except we already figured out faster versions
for little-endian based on multiplication or a few add/shift
operations.

                        Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-24  9:40               ` David Howells
  2012-05-24 15:53                 ` Linus Torvalds
@ 2012-05-24 16:45                 ` David Howells
  2012-05-24 16:56                   ` Linus Torvalds
  2012-05-24 17:16                   ` David Howells
  1 sibling, 2 replies; 25+ messages in thread
From: David Howells @ 2012-05-24 16:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dhowells, David Miller, James.Bottomley, geert, linux-arch

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Could you use cpu_to_be32/64() and then ffs()?  That ought to work for both
> > variants of endianness.  The cpu_to_beXX() should be a noop on BE and is
> > likely to be a single instruction on LE.  The meat of ffs() is usually a
> > single instruction, though it may have to have zero-detect logic added.
> 
> First off, the *last* thing you want to do is go to big-endian mode.
> All the bit counting gets *much* more complicated, and your argument
> that it's "free" on some architectures is pointless, since it is only
> free on the architectures that have the *least* users.

I didn't suggest it was free, but it might be cheaper.  Besides x86/x86_64 has
BSF/BSR instructions - though having played with Dave's algorithm some, I
don't think they're usable for this.

My suggestion assumes that there would be zeros beyond the terminal NUL, which
isn't something you can rely on.

David

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-24 16:45                 ` David Howells
@ 2012-05-24 16:56                   ` Linus Torvalds
  2012-05-24 17:16                   ` David Howells
  1 sibling, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2012-05-24 16:56 UTC (permalink / raw)
  To: David Howells; +Cc: David Miller, James.Bottomley, geert, linux-arch

On Thu, May 24, 2012 at 9:45 AM, David Howells <dhowells@redhat.com> wrote:
>
> I didn't suggest it was free, but it might be cheaper.  Besides x86/x86_64 has
> BSF/BSR instructions - though having played with Dave's algorithm some, I
> don't think they're usable for this.

David, I don't think you've followed this saga very well.

That word-at-a-time algorithm *comes* from x86. Literally. The code is
90% pure copies from arch/x86/lib/usercopy.c.

We *know* how the code should be written on little-endian, AND IT IS
BOTH BETTER AND FASTER there.

Seriously. Little-endian is *superior* for string handling. No
questions. In fact, anybody who thinks that big-endian is better for
*anything* is seriously deluded these days.

Also, BSF is too damn slow. It's slow as hell on most older x86 chips,
and it's pointless. You can do clever tricks (again - only on
little-endian) that do it portably without it. Grep for
count_masked_bytes in the current tree, which does need a fast
multiplier on 64-bit, but fast multipliers are way more common than
the fast bit scan instructions.

I'd love to use a population count instruction, but efficient popc
instructions are simply not widely enough available. And the bsf
instruction is very slow on old x86 microarchitectures.

                  Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-24 16:45                 ` David Howells
  2012-05-24 16:56                   ` Linus Torvalds
@ 2012-05-24 17:16                   ` David Howells
  1 sibling, 0 replies; 25+ messages in thread
From: David Howells @ 2012-05-24 17:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dhowells, David Miller, James.Bottomley, geert, linux-arch

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > I didn't suggest it was free, but it might be cheaper.  Besides x86/x86_64
> > has BSF/BSR instructions - though having played with Dave's algorithm
> > some, I don't think they're usable for this.
> 
> David, I don't think you've followed this saga very well.

No, I didn't phrase the paragraph above very well, so let me rephrase: I think
Dave's patch is _good_[*] and that BSF/BSR would _not_ be usable for it.

I was thinking that reordering the bytes on LE might help, but no, it doesn't
as Dave's algorithm works equally well, no matter on the order the bytes are
in.

David

[*] Barring an extraneous semicolon already pointed out

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-05-24 15:53                 ` Linus Torvalds
@ 2012-06-13 11:08                   ` Michael Cree
  2012-06-13 14:51                     ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Michael Cree @ 2012-06-13 11:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, David Miller, James.Bottomley, geert, linux-arch

On 25/05/12 03:53, Linus Torvalds wrote:
> First off, the *last* thing you want to do is go to big-endian mode.
> All the bit counting gets *much* more complicated, and your argument
> that it's "free" on some architectures is pointless, since it is only
> free on the architectures that have the *least* users.

On Alpha we can find the zero bytes extrememly efficiently, and, yeah,
we have rather few users, so carry bugger-all weight.  Nevertheless I
want to ask about the semantics of the new prep_zero_mask() function
because if we have to implement it exactly as specified in the message
to commit 36126f8f2ed8 then we are forced to take a round-about, thus
less efficient, route in the find_zero() implementation on Alpha.

From commit 36126f8f2ed8 prep_zero_mask() must, and I quote, "generate
an *exact* mask of which byte had the first zero."  But the result of
prep_zero_mask() in all current extant usage is passed _only_ to
create_zero_mask().  It seems to me then that current usage is only
constrained by the following:

1) The result of prep_zero_mask() must be bitwise "OR"-able and the
result of the ORed results must in turn be a valid mask of zero bytes.

2) The result is only ever passed to create_zero_mask() which, like
prep_zero_mask(), is architecture specific.

But there is nothing currently in the kernel that currently requires
(other than a commit message) the result of prep_zero_mask() to be an
*exact* mask of the zero bytes, only that it be *a* mask of zero bytes.
 The difference is important to Alpha because if we can have a mask
where the lowest eight bits represent each byte (rather than a 64-bit
mask where a whole eight bits are set to represent a byte) we get an
extremely efficient implementation.

So, may I generalise prep_zero_mask() as suggested above?

I follow with the Alpha code for word-at-a-time.h that results if I may
(and is running fine on my Alpha):


/*
 * We do not use the word_at_a_time struct on Alpha, but it needs to be
 * implemented to humour the generic code.
 */
struct word_at_a_time {
        const unsigned long unused;
};

#define WORD_AT_A_TIME_CONSTANTS { 0 }

/* Return nonzero if val has a zero */
static inline unsigned long has_zero(unsigned long val, unsigned long
*bits, const struct word_at_a_time *c)
{
        unsigned long zero_locations = __kernel_cmpbge(0, val);
        *bits = zero_locations;
        return zero_locations;
}

static inline unsigned long prep_zero_mask(unsigned long val, unsigned
long bits, const struct word_at_a_time *c)
{
        return bits;
}

#define create_zero_mask(bits) (bits)

static inline unsigned long find_zero(unsigned long bits)
{
        return bits & (unsigned long)(-(long)bits);
}

Cheers
Michael.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Arch maintainers Ahoy!
  2012-06-13 11:08                   ` Michael Cree
@ 2012-06-13 14:51                     ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2012-06-13 14:51 UTC (permalink / raw)
  To: Michael Cree
  Cc: David Howells, David Miller, James.Bottomley, geert, linux-arch

On Wed, Jun 13, 2012 at 2:08 PM, Michael Cree <mcree@orcon.net.nz> wrote:
>
> From commit 36126f8f2ed8 prep_zero_mask() must, and I quote, "generate
> an *exact* mask of which byte had the first zero."

No, that wasn't the intent. It must generate something that can be
or'ed together so that the "generate_zero_mask()" can actually then
generate the mask itself.

It has to be exact in the sense that the value must *exactly* imply
the first byte. That's different from the "has_zero()" function that
doesn't need to calculate which is the *first* byte with the zero,
just that there is at least one byte of zero.

In fact, look at the x86 implementation. The prep_zero_mask() function
is a no-op, because it just leaves the high bit set in the appropriate
byte. The actual bytemask is generated by create_zero_mask().

           Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2012-06-13 14:51 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-21 16:50 Arch maintainers Ahoy! (was Re: x86: faster strncpy_from_user()) Linus Torvalds
2012-05-23  5:46 ` Arch maintainers Ahoy! David Miller
2012-05-23  8:02   ` Geert Uytterhoeven
2012-05-23  9:40     ` James Bottomley
2012-05-23 15:15       ` Linus Torvalds
2012-05-23 17:21         ` David Miller
2012-05-23 17:32           ` Linus Torvalds
2012-05-23 18:16             ` David Miller
2012-05-23 18:27               ` Linus Torvalds
2012-05-23 18:35                 ` David Miller
2012-05-23 18:44                   ` Linus Torvalds
2012-05-23 18:46                   ` Linus Torvalds
2012-05-23 20:36                     ` David Miller
2012-05-23 21:01                       ` Linus Torvalds
2012-05-24  2:11                         ` David Miller
2012-05-24  5:25                           ` Paul Mackerras
2012-05-24  5:56                             ` David Miller
2012-05-24  9:40               ` David Howells
2012-05-24 15:53                 ` Linus Torvalds
2012-06-13 11:08                   ` Michael Cree
2012-06-13 14:51                     ` Linus Torvalds
2012-05-24 16:45                 ` David Howells
2012-05-24 16:56                   ` Linus Torvalds
2012-05-24 17:16                   ` David Howells
2012-05-23 17:19       ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.