linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] lib/strscpy: remove word-at-a-time optimization.
@ 2018-01-09 16:37 Andrey Ryabinin
  2018-01-09 16:47 ` Andrey Ryabinin
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andrey Ryabinin @ 2018-01-09 16:37 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko,
	Chris Metcalf, David Laight, Dmitry Vyukov, Andrey Ryabinin,
	stable

strscpy() performs the word-at-a-time optimistic reads. So it may
may access the memory past the end of the object, which is perfectly fine
since strscpy() doesn't use that (past-the-end) data and makes sure the
optimistic read won't cross a page boundary.

But KASAN doesn't know anything about that so it will complain.
There are several possible ways to address this issue, but none
are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com

It seems the best solution is to simply disable word-at-a-time
optimization. My trivial testing shows that byte-at-a-time
could be up to x4.3 times slower than word-at-a-time.
It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs
~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy()
in a performance critical paths to copy large amounts of data,
so it shouldn't matter anyway.

Fixes: 30035e45753b7 ("string: provide strscpy()")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: <stable@vger.kernel.org>
---
 lib/string.c | 38 --------------------------------------
 1 file changed, 38 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 64a9e33f1daa..6205dd71aa0f 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -29,7 +29,6 @@
 #include <linux/errno.h>
 
 #include <asm/byteorder.h>
-#include <asm/word-at-a-time.h>
 #include <asm/page.h>
 
 #ifndef __HAVE_ARCH_STRNCASECMP
@@ -177,45 +176,8 @@ EXPORT_SYMBOL(strlcpy);
  */
 ssize_t strscpy(char *dest, const char *src, size_t count)
 {
-	const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
-	size_t max = count;
 	long res = 0;
 
-	if (count == 0)
-		return -E2BIG;
-
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-	/*
-	 * If src is unaligned, don't cross a page boundary,
-	 * since we don't know if the next page is mapped.
-	 */
-	if ((long)src & (sizeof(long) - 1)) {
-		size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1));
-		if (limit < max)
-			max = limit;
-	}
-#else
-	/* If src or dest is unaligned, don't do word-at-a-time. */
-	if (((long) dest | (long) src) & (sizeof(long) - 1))
-		max = 0;
-#endif
-
-	while (max >= sizeof(unsigned long)) {
-		unsigned long c, data;
-
-		c = *(unsigned long *)(src+res);
-		if (has_zero(c, &data, &constants)) {
-			data = prep_zero_mask(c, data, &constants);
-			data = create_zero_mask(data);
-			*(unsigned long *)(dest+res) = c & zero_bytemask(data);
-			return res + find_zero(data);
-		}
-		*(unsigned long *)(dest+res) = c;
-		res += sizeof(unsigned long);
-		count -= sizeof(unsigned long);
-		max -= sizeof(unsigned long);
-	}
-
 	while (count) {
 		char c;
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin
@ 2018-01-09 16:47 ` Andrey Ryabinin
  2018-01-24  8:54   ` Rasmus Villemoes
  2018-01-24  8:47 ` Rasmus Villemoes
  2018-01-24 15:53 ` Andy Shevchenko
  2 siblings, 1 reply; 12+ messages in thread
From: Andrey Ryabinin @ 2018-01-09 16:47 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko,
	Chris Metcalf, David Laight, Dmitry Vyukov, stable

[-- Attachment #1: Type: text/plain, Size: 2800 bytes --]

Attached user space program I used to see the difference.
Usage:
	gcc -02 -o strscpy strscpy_test.c
	./strscpy {b|w} src_str_len count

src_str_len - length of source string in between 1-4096
count - how many strscpy() to execute.



Also I've noticed something strange. I'm not sure why, but certain
src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results
for byte-at-a-time copy:

$ perf stat   ./strscpy b 29 10000000

 Performance counter stats for './strscpy b 29 10000000':

        165.354974      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                48      page-faults:u             #    0.290 K/sec                  
       640,475,981      cycles:u                  #    3.873 GHz                    
     2,500,090,080      instructions:u            #    3.90  insn per cycle         
       640,017,126      branches:u                # 3870.565 M/sec                  
             1,589      branch-misses:u           #    0.00% of all branches        

       0.165568346 seconds time elapsed



 Performance counter stats for './strscpy b 30 10000000':

        250.835659      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                46      page-faults:u             #    0.183 K/sec                  
       974,528,780      cycles:u                  #    3.885 GHz                    
     2,580,090,165      instructions:u            #    2.65  insn per cycle         
       660,017,211      branches:u                # 2631.273 M/sec                  
        14,488,234      branch-misses:u           #    2.20% of all branches        

       0.251147341 seconds time elapsed


 Performance counter stats for './strscpy b 31 10000000':

        176.598368      task-clock:u (msec)       #    0.997 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
                46      page-faults:u             #    0.260 K/sec                  
       681,367,948      cycles:u                  #    3.858 GHz                    
     2,660,090,092      instructions:u            #    3.90  insn per cycle         
       680,017,138      branches:u                # 3850.642 M/sec                  
             1,817      branch-misses:u           #    0.00% of all branches        

       0.177150181 seconds time elapsed


[-- Attachment #2: strscpy_test.c --]
[-- Type: text/x-csrc, Size: 3291 bytes --]

#include <stdlib.h>
#include <string.h>

#define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

#define REPEAT_BYTE(x)	((~0ul / 0xff) * (x))

#define E2BIG -1
#define PAGE_SIZE 4096
struct word_at_a_time {
	const unsigned long one_bits, high_bits;
};

#define WORD_AT_A_TIME_CONSTANTS { REPEAT_BYTE(0x01), REPEAT_BYTE(0x80) }
static inline long count_masked_bytes(unsigned long mask)
{
	return mask*0x0001020304050608ul >> 56;
}
static inline unsigned long has_zero(unsigned long a, unsigned long *bits, const struct word_at_a_time *c)
{
	unsigned long mask = ((a - c->one_bits) & ~a) & c->high_bits;
	*bits = mask;
	return mask;
}

static inline unsigned long prep_zero_mask(unsigned long a, unsigned long bits, const struct word_at_a_time *c)
{
	return bits;
}

static inline unsigned long create_zero_mask(unsigned long bits)
{
	bits = (bits - 1) & ~bits;
	return bits >> 7;
}

/* The mask we created is directly usable as a bytemask */
#define zero_bytemask(mask) (mask)

static inline unsigned long find_zero(unsigned long mask)
{
	return count_masked_bytes(mask);
}

__attribute__((noinline))
int strscpy_word(char *dest, const char *src, size_t count)
{
	const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
	size_t max = count;
	long res = 0;

	if (count == 0)
		return -E2BIG;

#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
	/*
	 * If src is unaligned, don't cross a page boundary,
	 * since we don't know if the next page is mapped.
	 */
	if ((long)src & (sizeof(long) - 1)) {
		size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1));
		if (limit < max)
			max = limit;
	}
#else
	/* If src or dest is unaligned, don't do word-at-a-time. */
	if (((long) dest | (long) src) & (sizeof(long) - 1))
		max = 0;
#endif

	while (max >= sizeof(unsigned long)) {
		unsigned long c, data;

		c = *(unsigned long *)(src+res);
		if (has_zero(c, &data, &constants)) {
			data = prep_zero_mask(c, data, &constants);
			data = create_zero_mask(data);
			*(unsigned long *)(dest+res) = c & zero_bytemask(data);
			return res + find_zero(data);
		}
		*(unsigned long *)(dest+res) = c;
		res += sizeof(unsigned long);
		count -= sizeof(unsigned long);
		max -= sizeof(unsigned long);
	}

	while (count) {
		char c;

		c = src[res];
		dest[res] = c;
		if (!c)
			return res;
		res++;
		count--;
	}

	/* Hit buffer length without finding a NUL; force NUL-termination. */
	if (res)
		dest[res-1] = '\0';

	return -E2BIG;
}

__attribute__((noinline))
int strscpy_byte(char *dest, const char *src, int count)
{
	int res = 0;

	while (count) {
		char c;
		c = src[res];
		dest[res] = c;
		if (!c)
			return res;
		res++;
		count--;

	}

	/* Hit buffer length without finding a NUL; force NUL-termination. */
	if (res)
		dest[res-1] = '\0';

	return -E2BIG;
}

char dest[4096] __attribute__((aligned(4096)));
char src[4096] __attribute__((aligned(4096)));

int main(int argc, char **argv)
{
	unsigned long long i;
	unsigned long src_len;
	unsigned long count;

	if (argc < 4)
		return -1;

	src_len = atoi(argv[2]);
	count = atoi(argv[3]);

	memset(src, 1, src_len);

	if (argv[1][0] == 'w') {
		for (i = 0; i < count; i++) {
			strscpy_word(dest, src, sizeof(dest));
		}
	} else if (argv[1][0] == 'b') {
		for (i = 0; i < count; i++) {
			strscpy_byte(dest, src, sizeof(dest));
		}
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin
  2018-01-09 16:47 ` Andrey Ryabinin
@ 2018-01-24  8:47 ` Rasmus Villemoes
  2018-01-24 15:53 ` Andy Shevchenko
  2 siblings, 0 replies; 12+ messages in thread
From: Rasmus Villemoes @ 2018-01-24  8:47 UTC (permalink / raw)
  To: Andrey Ryabinin, Andrew Morton, Linus Torvalds
  Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko,
	Chris Metcalf, David Laight, Dmitry Vyukov, stable

On 2018-01-09 17:37, Andrey Ryabinin wrote:
> strscpy() performs the word-at-a-time optimistic reads. So it may
> may access the memory past the end of the object, which is perfectly fine
> since strscpy() doesn't use that (past-the-end) data and makes sure the
> optimistic read won't cross a page boundary.
> 
> But KASAN doesn't know anything about that so it will complain.
> There are several possible ways to address this issue, but none
> are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com
> 
> It seems the best solution is to simply disable word-at-a-time
> optimization. My trivial testing shows that byte-at-a-time
> could be up to x4.3 times slower than word-at-a-time.
> It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs
> ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy()
> in a performance critical paths to copy large amounts of data,
> so it shouldn't matter anyway.
> 
> Fixes: 30035e45753b7 ("string: provide strscpy()")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: <stable@vger.kernel.org>
>

Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>

Your microbenchmark even favours word-at-a-time slightly, since in
practice I think at least one of src or dst will be unaligned a lot of
the time, and while x86 may HAVE_EFFICIENT_UNALIGNED_ACCESS, it's still
a little more expensive than doing aligned access. And since strscpy is
not called that often, I expect some of the ~300 bytes of instruction
cache it occupies can be put to better use elsewhere.

Rasmus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-09 16:47 ` Andrey Ryabinin
@ 2018-01-24  8:54   ` Rasmus Villemoes
  2018-01-24 17:52     ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Rasmus Villemoes @ 2018-01-24  8:54 UTC (permalink / raw)
  To: Andrey Ryabinin, Andrew Morton, Linus Torvalds
  Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko,
	Chris Metcalf, David Laight, Dmitry Vyukov, stable

On 2018-01-09 17:47, Andrey Ryabinin wrote:
> Attached user space program I used to see the difference.
> Usage:
> 	gcc -02 -o strscpy strscpy_test.c
> 	./strscpy {b|w} src_str_len count
> 
> src_str_len - length of source string in between 1-4096
> count - how many strscpy() to execute.
>  
> Also I've noticed something strange. I'm not sure why, but certain
> src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results
> for byte-at-a-time copy:

I see something similar, but at the 30->31 transition, and the
branch-misses remain at 1-3% for higher values, until 42 where it drops
back to 0%. Anyway, I highly doubt we do a lot of string copies of
strings longer then 32.

$ perf stat ./strscpy_test b 30 10000000

 Performance counter stats for './strscpy_test b 30 10000000':

        156,777082      task-clock (msec)         #    0,999 CPUs
utilized
                 0      context-switches          #    0,000 K/sec

                 0      cpu-migrations            #    0,000 K/sec

                48      page-faults               #    0,306 K/sec

       584.646.177      cycles                    #    3,729 GHz

   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
     2.580.599.614      instructions              #    4,41  insns per
cycle
       660.114.283      branches                  # 4210,528 M/sec

             4.891      branch-misses             #    0,00% of all
branches

       0,156970910 seconds time elapsed

$ perf stat ./strscpy_test b 31 10000000

 Performance counter stats for './strscpy_test b 31 10000000':

        258,533250      task-clock (msec)         #    0,999 CPUs
utilized
                 0      context-switches          #    0,000 K/sec

                 0      cpu-migrations            #    0,000 K/sec

                50      page-faults               #    0,193 K/sec

       965.505.138      cycles                    #    3,735 GHz

   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
     2.660.773.463      instructions              #    2,76  insns per
cycle
       680.141.051      branches                  # 2630,768 M/sec

        19.150.367      branch-misses             #    2,82% of all
branches

       0,258725192 seconds time elapsed


Rasmus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin
  2018-01-09 16:47 ` Andrey Ryabinin
  2018-01-24  8:47 ` Rasmus Villemoes
@ 2018-01-24 15:53 ` Andy Shevchenko
  2 siblings, 0 replies; 12+ messages in thread
From: Andy Shevchenko @ 2018-01-24 15:53 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, Linus Torvalds, Linux Kernel Mailing List,
	Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf,
	David Laight, Dmitry Vyukov, stable

On Tue, Jan 9, 2018 at 6:37 PM, Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
> strscpy() performs the word-at-a-time optimistic reads. So it may
> may access the memory past the end of the object, which is perfectly fine
> since strscpy() doesn't use that (past-the-end) data and makes sure the
> optimistic read won't cross a page boundary.
>
> But KASAN doesn't know anything about that so it will complain.
> There are several possible ways to address this issue, but none
> are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com
>
> It seems the best solution is to simply disable word-at-a-time
> optimization. My trivial testing shows that byte-at-a-time
> could be up to x4.3 times slower than word-at-a-time.
> It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs
> ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy()
> in a performance critical paths to copy large amounts of data,
> so it shouldn't matter anyway.

What prevents you to revert the patch? After this one the all
advantages of the function becomes useless AFAIU.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-24  8:54   ` Rasmus Villemoes
@ 2018-01-24 17:52     ` Linus Torvalds
  2018-01-25  8:32       ` Dmitry Vyukov
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2018-01-24 17:52 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrey Ryabinin, Andrew Morton, Linux Kernel Mailing List,
	Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf,
	David Laight, Dmitry Vyukov, stable

On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
<rasmus.villemoes@prevas.dk> wrote:
>
> I see something similar, but at the 30->31 transition, and the
> branch-misses remain at 1-3% for higher values, until 42 where it drops
> back to 0%. Anyway, I highly doubt we do a lot of string copies of
> strings longer then 32.

So I really dislike that microbenchmark, because it just has the same
length all the time. Which is very wrong, and makes the benchmark
pointless. A big part of this all is branch mispredicts, you shouldn't
just hand it the pattern on a plate.

Anyway, the reason I really dislike the patch is not because I think
strscpy() is all that important, but I *do* think that the
word-at-a-time thing is conceptually something we do care about, and I
hate removing it just because of KASAN not understanding it.

So I'd *much* rather have some way to tell KASAN that word-at-a-time
is going on. Because that approach definitely makes a difference in
other places.

                    Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-24 17:52     ` Linus Torvalds
@ 2018-01-25  8:32       ` Dmitry Vyukov
  2018-01-25  8:42         ` David Laight
  2018-01-25 17:55         ` Linus Torvalds
  0 siblings, 2 replies; 12+ messages in thread
From: Dmitry Vyukov @ 2018-01-25  8:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rasmus Villemoes, Andrey Ryabinin, Andrew Morton,
	Linux Kernel Mailing List, Kees Cook, Eryu Guan,
	Alexander Potapenko, Chris Metcalf, David Laight, stable,
	kasan-dev

On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
> <rasmus.villemoes@prevas.dk> wrote:
>>
>> I see something similar, but at the 30->31 transition, and the
>> branch-misses remain at 1-3% for higher values, until 42 where it drops
>> back to 0%. Anyway, I highly doubt we do a lot of string copies of
>> strings longer then 32.
>
> So I really dislike that microbenchmark, because it just has the same
> length all the time. Which is very wrong, and makes the benchmark
> pointless. A big part of this all is branch mispredicts, you shouldn't
> just hand it the pattern on a plate.
>
> Anyway, the reason I really dislike the patch is not because I think
> strscpy() is all that important, but I *do* think that the
> word-at-a-time thing is conceptually something we do care about, and I
> hate removing it just because of KASAN not understanding it.
>
> So I'd *much* rather have some way to tell KASAN that word-at-a-time
> is going on. Because that approach definitely makes a difference in
> other places.


The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read
once" part will affect codegen here, though.
But if word-at-a-time thing is conceptually something we do care
about, we could also introduce something like READ_PARTIALLY_VALID(),
which would check that at least first byte of the read is valid and
that it does not cross heap block boundary (but outside of KASAN is a
normal read).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-25  8:32       ` Dmitry Vyukov
@ 2018-01-25  8:42         ` David Laight
  2018-01-25  9:08           ` Dmitry Vyukov
  2018-01-25 17:55         ` Linus Torvalds
  1 sibling, 1 reply; 12+ messages in thread
From: David Laight @ 2018-01-25  8:42 UTC (permalink / raw)
  To: 'Dmitry Vyukov', Linus Torvalds
  Cc: Rasmus Villemoes, Andrey Ryabinin, Andrew Morton,
	Linux Kernel Mailing List, Kees Cook, Eryu Guan,
	Alexander Potapenko, Chris Metcalf, stable, kasan-dev

From: Dmitry Vyukov [mailto:dvyukov@google.com]
> Sent: 25 January 2018 08:33
> 
> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
> > <rasmus.villemoes@prevas.dk> wrote:
> >>
> >> I see something similar, but at the 30->31 transition, and the
> >> branch-misses remain at 1-3% for higher values, until 42 where it drops
> >> back to 0%. Anyway, I highly doubt we do a lot of string copies of
> >> strings longer then 32.
> >
> > So I really dislike that microbenchmark, because it just has the same
> > length all the time. Which is very wrong, and makes the benchmark
> > pointless. A big part of this all is branch mispredicts, you shouldn't
> > just hand it the pattern on a plate.
> >
> > Anyway, the reason I really dislike the patch is not because I think
> > strscpy() is all that important, but I *do* think that the
> > word-at-a-time thing is conceptually something we do care about, and I
> > hate removing it just because of KASAN not understanding it.
> >
> > So I'd *much* rather have some way to tell KASAN that word-at-a-time
> > is going on. Because that approach definitely makes a difference in
> > other places.
> 
> 
> The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read
> once" part will affect codegen here, though.
> But if word-at-a-time thing is conceptually something we do care
> about, we could also introduce something like READ_PARTIALLY_VALID(),
> which would check that at least first byte of the read is valid and
> that it does not cross heap block boundary (but outside of KASAN is a
> normal read).

The first byte might not have been written either.
For example, doing a strlen() on a misaligned string you might read
the aligned word containing the first byte and adjust the value so
that the initial byte(s) are not zero.
After scanning for a zero byte the length would be corrected.

	David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-25  8:42         ` David Laight
@ 2018-01-25  9:08           ` Dmitry Vyukov
  0 siblings, 0 replies; 12+ messages in thread
From: Dmitry Vyukov @ 2018-01-25  9:08 UTC (permalink / raw)
  To: David Laight
  Cc: Linus Torvalds, Rasmus Villemoes, Andrey Ryabinin, Andrew Morton,
	Linux Kernel Mailing List, Kees Cook, Eryu Guan,
	Alexander Potapenko, Chris Metcalf, stable, kasan-dev

On Thu, Jan 25, 2018 at 9:42 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Dmitry Vyukov [mailto:dvyukov@google.com]
>> Sent: 25 January 2018 08:33
>>
>> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> > On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
>> > <rasmus.villemoes@prevas.dk> wrote:
>> >>
>> >> I see something similar, but at the 30->31 transition, and the
>> >> branch-misses remain at 1-3% for higher values, until 42 where it drops
>> >> back to 0%. Anyway, I highly doubt we do a lot of string copies of
>> >> strings longer then 32.
>> >
>> > So I really dislike that microbenchmark, because it just has the same
>> > length all the time. Which is very wrong, and makes the benchmark
>> > pointless. A big part of this all is branch mispredicts, you shouldn't
>> > just hand it the pattern on a plate.
>> >
>> > Anyway, the reason I really dislike the patch is not because I think
>> > strscpy() is all that important, but I *do* think that the
>> > word-at-a-time thing is conceptually something we do care about, and I
>> > hate removing it just because of KASAN not understanding it.
>> >
>> > So I'd *much* rather have some way to tell KASAN that word-at-a-time
>> > is going on. Because that approach definitely makes a difference in
>> > other places.
>>
>>
>> The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read
>> once" part will affect codegen here, though.
>> But if word-at-a-time thing is conceptually something we do care
>> about, we could also introduce something like READ_PARTIALLY_VALID(),
>> which would check that at least first byte of the read is valid and
>> that it does not cross heap block boundary (but outside of KASAN is a
>> normal read).
>
> The first byte might not have been written either.
> For example, doing a strlen() on a misaligned string you might read
> the aligned word containing the first byte and adjust the value so
> that the initial byte(s) are not zero.
> After scanning for a zero byte the length would be corrected.

Was the first byte at least kmalloc-ed? That's what KASAN checks, it
does not care about "written". KMSAN can detect uses of uninit data,
but that's another story.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-25  8:32       ` Dmitry Vyukov
  2018-01-25  8:42         ` David Laight
@ 2018-01-25 17:55         ` Linus Torvalds
  2018-01-25 19:13           ` Andrey Ryabinin
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2018-01-25 17:55 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Rasmus Villemoes, Andrey Ryabinin, Andrew Morton,
	Linux Kernel Mailing List, Kees Cook, Eryu Guan,
	Alexander Potapenko, Chris Metcalf, David Laight, stable,
	kasan-dev

[-- Attachment #1: Type: text/plain, Size: 804 bytes --]

On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So I'd *much* rather have some way to tell KASAN that word-at-a-time
>> is going on. Because that approach definitely makes a difference in
>> other places.
>
> The other option was to use READ_ONCE_NOCHECK().

How about just using the same accessor that we do for the dcache case.
That gives a reasonable example of the whole word-at-a-time model, and
should be good.

COMPLETELY UNTESTED patch attached. This needs to be verified.

It does limit the word-at-a-time code to the architectures that select
both HAVE_EFFICIENT_UNALIGNED_ACCESS and DCACHE_WORD_ACCESS, but that
seems a reasonable choice anyway.

               Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1651 bytes --]

 lib/string.c | 28 +++++++---------------------
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/lib/string.c b/lib/string.c
index 64a9e33f1daa..25a3c200307e 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -177,33 +177,18 @@ EXPORT_SYMBOL(strlcpy);
  */
 ssize_t strscpy(char *dest, const char *src, size_t count)
 {
-	const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
-	size_t max = count;
 	long res = 0;
 
 	if (count == 0)
 		return -E2BIG;
 
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-	/*
-	 * If src is unaligned, don't cross a page boundary,
-	 * since we don't know if the next page is mapped.
-	 */
-	if ((long)src & (sizeof(long) - 1)) {
-		size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1));
-		if (limit < max)
-			max = limit;
-	}
-#else
-	/* If src or dest is unaligned, don't do word-at-a-time. */
-	if (((long) dest | (long) src) & (sizeof(long) - 1))
-		max = 0;
-#endif
-
-	while (max >= sizeof(unsigned long)) {
+#if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS && CONFIG_DCACHE_WORD_ACCESS
+{
+	const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
+	while (count >= sizeof(unsigned long)) {
 		unsigned long c, data;
 
-		c = *(unsigned long *)(src+res);
+		c = load_unaligned_zeropad(src+res);
 		if (has_zero(c, &data, &constants)) {
 			data = prep_zero_mask(c, data, &constants);
 			data = create_zero_mask(data);
@@ -213,8 +198,9 @@ ssize_t strscpy(char *dest, const char *src, size_t count)
 		*(unsigned long *)(dest+res) = c;
 		res += sizeof(unsigned long);
 		count -= sizeof(unsigned long);
-		max -= sizeof(unsigned long);
 	}
+}
+#endif
 
 	while (count) {
 		char c;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-25 17:55         ` Linus Torvalds
@ 2018-01-25 19:13           ` Andrey Ryabinin
  2018-01-30  9:12             ` Dmitry Vyukov
  0 siblings, 1 reply; 12+ messages in thread
From: Andrey Ryabinin @ 2018-01-25 19:13 UTC (permalink / raw)
  To: Linus Torvalds, Dmitry Vyukov
  Cc: Rasmus Villemoes, Andrew Morton, Linux Kernel Mailing List,
	Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf,
	David Laight, stable, kasan-dev

On 01/25/2018 08:55 PM, Linus Torvalds wrote:
> On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> So I'd *much* rather have some way to tell KASAN that word-at-a-time
>>> is going on. Because that approach definitely makes a difference in
>>> other places.
>>
>> The other option was to use READ_ONCE_NOCHECK().
> 
> How about just using the same accessor that we do for the dcache case.
> That gives a reasonable example of the whole word-at-a-time model, and
> should be good.
> 

If we also instrument load_unaligned_zeropad() with kasan_check_read(addr, 1),
than it should be fine. We don't want completely unchecked read of a source string.

But I also would like to revert df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports")
So I was going to send something like the hunk bellow (split in several patches).

Or we could just use instrumented load_unalingned_zeropad() everywhere, but it seems wrong
to use it to load *cs only to shut up KASAN.


---
 fs/dcache.c              |  2 +-
 include/linux/compiler.h | 11 +++++++++++
 lib/string.c             |  2 +-
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c7df1df81ff..6aa7be55a96d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -195,7 +195,7 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char
 	unsigned long a,b,mask;
 
 	for (;;) {
-		a = *(unsigned long *)cs;
+		a = READ_PARTIAL_CHECK(*(unsigned long *)cs);
 		b = load_unaligned_zeropad(ct);
 		if (tcount < sizeof(unsigned long))
 			break;
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 52e611ab9a6c..85b63c2e196e 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -240,6 +240,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
  * required ordering.
  */
 #include <asm/barrier.h>
+#include <linux/kasan-checks.h>
 
 #define __READ_ONCE(x, check)						\
 ({									\
@@ -259,6 +260,16 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
  */
 #define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0)
 
+#ifdef CONFIG_KASAN
+#define READ_PARTIAL_CHECK(x)		\
+({					\
+	kasan_check_read(&(x), 1);	\
+	READ_ONCE_NOCHECK(x);		\
+})
+#else
+#define READ_PARTIAL_CHECK(x) (x)
+#endif
+
 #define WRITE_ONCE(x, val) \
 ({							\
 	union { typeof(x) __val; char __c[1]; } __u =	\
diff --git a/lib/string.c b/lib/string.c
index 64a9e33f1daa..2396856e4c56 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -203,7 +203,7 @@ ssize_t strscpy(char *dest, const char *src, size_t count)
 	while (max >= sizeof(unsigned long)) {
 		unsigned long c, data;
 
-		c = *(unsigned long *)(src+res);
+		c = READ_PARTIAL_CHECK(*(unsigned long *)(src+res));
 		if (has_zero(c, &data, &constants)) {
 			data = prep_zero_mask(c, data, &constants);
 			data = create_zero_mask(data);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
  2018-01-25 19:13           ` Andrey Ryabinin
@ 2018-01-30  9:12             ` Dmitry Vyukov
  0 siblings, 0 replies; 12+ messages in thread
From: Dmitry Vyukov @ 2018-01-30  9:12 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Linus Torvalds, Rasmus Villemoes, Andrew Morton,
	Linux Kernel Mailing List, Kees Cook, Eryu Guan,
	Alexander Potapenko, Chris Metcalf, David Laight, stable,
	kasan-dev

On Thu, Jan 25, 2018 at 8:13 PM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
> On 01/25/2018 08:55 PM, Linus Torvalds wrote:
>> On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>> So I'd *much* rather have some way to tell KASAN that word-at-a-time
>>>> is going on. Because that approach definitely makes a difference in
>>>> other places.
>>>
>>> The other option was to use READ_ONCE_NOCHECK().
>>
>> How about just using the same accessor that we do for the dcache case.
>> That gives a reasonable example of the whole word-at-a-time model, and
>> should be good.
>>
>
> If we also instrument load_unaligned_zeropad() with kasan_check_read(addr, 1),
> than it should be fine. We don't want completely unchecked read of a source string.
>
> But I also would like to revert df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports")
> So I was going to send something like the hunk bellow (split in several patches).
>
> Or we could just use instrumented load_unalingned_zeropad() everywhere, but it seems wrong
> to use it to load *cs only to shut up KASAN.
>
>
> ---
>  fs/dcache.c              |  2 +-
>  include/linux/compiler.h | 11 +++++++++++
>  lib/string.c             |  2 +-
>  3 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5c7df1df81ff..6aa7be55a96d 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -195,7 +195,7 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char
>         unsigned long a,b,mask;
>
>         for (;;) {
> -               a = *(unsigned long *)cs;
> +               a = READ_PARTIAL_CHECK(*(unsigned long *)cs);
>                 b = load_unaligned_zeropad(ct);
>                 if (tcount < sizeof(unsigned long))
>                         break;
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 52e611ab9a6c..85b63c2e196e 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -240,6 +240,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
>   * required ordering.
>   */
>  #include <asm/barrier.h>
> +#include <linux/kasan-checks.h>
>
>  #define __READ_ONCE(x, check)                                          \
>  ({                                                                     \
> @@ -259,6 +260,16 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
>   */
>  #define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0)
>
> +#ifdef CONFIG_KASAN
> +#define READ_PARTIAL_CHECK(x)          \
> +({                                     \
> +       kasan_check_read(&(x), 1);      \
> +       READ_ONCE_NOCHECK(x);           \
> +})
> +#else
> +#define READ_PARTIAL_CHECK(x) (x)
> +#endif
> +
>  #define WRITE_ONCE(x, val) \
>  ({                                                     \
>         union { typeof(x) __val; char __c[1]; } __u =   \
> diff --git a/lib/string.c b/lib/string.c
> index 64a9e33f1daa..2396856e4c56 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -203,7 +203,7 @@ ssize_t strscpy(char *dest, const char *src, size_t count)
>         while (max >= sizeof(unsigned long)) {
>                 unsigned long c, data;
>
> -               c = *(unsigned long *)(src+res);
> +               c = READ_PARTIAL_CHECK(*(unsigned long *)(src+res));
>                 if (has_zero(c, &data, &constants)) {
>                         data = prep_zero_mask(c, data, &constants);
>                         data = create_zero_mask(data);


Looks good to me a general way to support word-at-a-time pattern.

This will also get rid of this in fs/dcache.c:

                if (IS_ENABLED(CONFIG_DCACHE_WORD_ACCESS))
                        kasan_unpoison_shadow(dname,
                                round_up(name->len + 1, sizeof(unsigned long)));

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-01-30  9:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin
2018-01-09 16:47 ` Andrey Ryabinin
2018-01-24  8:54   ` Rasmus Villemoes
2018-01-24 17:52     ` Linus Torvalds
2018-01-25  8:32       ` Dmitry Vyukov
2018-01-25  8:42         ` David Laight
2018-01-25  9:08           ` Dmitry Vyukov
2018-01-25 17:55         ` Linus Torvalds
2018-01-25 19:13           ` Andrey Ryabinin
2018-01-30  9:12             ` Dmitry Vyukov
2018-01-24  8:47 ` Rasmus Villemoes
2018-01-24 15:53 ` Andy Shevchenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).