* [PATCH] lib/strscpy: remove word-at-a-time optimization.
@ 2018-01-09 16:37 Andrey Ryabinin
2018-01-09 16:47 ` Andrey Ryabinin
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Andrey Ryabinin @ 2018-01-09 16:37 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds
Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko,
Chris Metcalf, David Laight, Dmitry Vyukov, Andrey Ryabinin,
stable
strscpy() performs the word-at-a-time optimistic reads. So it may
may access the memory past the end of the object, which is perfectly fine
since strscpy() doesn't use that (past-the-end) data and makes sure the
optimistic read won't cross a page boundary.
But KASAN doesn't know anything about that so it will complain.
There are several possible ways to address this issue, but none
are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com
It seems the best solution is to simply disable word-at-a-time
optimization. My trivial testing shows that byte-at-a-time
could be up to x4.3 times slower than word-at-a-time.
It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs
~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy()
in a performance critical paths to copy large amounts of data,
so it shouldn't matter anyway.
Fixes: 30035e45753b7 ("string: provide strscpy()")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: <stable@vger.kernel.org>
---
lib/string.c | 38 --------------------------------------
1 file changed, 38 deletions(-)
diff --git a/lib/string.c b/lib/string.c
index 64a9e33f1daa..6205dd71aa0f 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -29,7 +29,6 @@
#include <linux/errno.h>
#include <asm/byteorder.h>
-#include <asm/word-at-a-time.h>
#include <asm/page.h>
#ifndef __HAVE_ARCH_STRNCASECMP
@@ -177,45 +176,8 @@ EXPORT_SYMBOL(strlcpy);
*/
ssize_t strscpy(char *dest, const char *src, size_t count)
{
- const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
- size_t max = count;
long res = 0;
- if (count == 0)
- return -E2BIG;
-
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
- /*
- * If src is unaligned, don't cross a page boundary,
- * since we don't know if the next page is mapped.
- */
- if ((long)src & (sizeof(long) - 1)) {
- size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1));
- if (limit < max)
- max = limit;
- }
-#else
- /* If src or dest is unaligned, don't do word-at-a-time. */
- if (((long) dest | (long) src) & (sizeof(long) - 1))
- max = 0;
-#endif
-
- while (max >= sizeof(unsigned long)) {
- unsigned long c, data;
-
- c = *(unsigned long *)(src+res);
- if (has_zero(c, &data, &constants)) {
- data = prep_zero_mask(c, data, &constants);
- data = create_zero_mask(data);
- *(unsigned long *)(dest+res) = c & zero_bytemask(data);
- return res + find_zero(data);
- }
- *(unsigned long *)(dest+res) = c;
- res += sizeof(unsigned long);
- count -= sizeof(unsigned long);
- max -= sizeof(unsigned long);
- }
-
while (count) {
char c;
--
2.13.6
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin
@ 2018-01-09 16:47 ` Andrey Ryabinin
2018-01-24 8:54 ` Rasmus Villemoes
2018-01-24 8:47 ` Rasmus Villemoes
2018-01-24 15:53 ` Andy Shevchenko
2 siblings, 1 reply; 12+ messages in thread
From: Andrey Ryabinin @ 2018-01-09 16:47 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds
Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko,
Chris Metcalf, David Laight, Dmitry Vyukov, stable
[-- Attachment #1: Type: text/plain, Size: 2800 bytes --]
Attached user space program I used to see the difference.
Usage:
gcc -02 -o strscpy strscpy_test.c
./strscpy {b|w} src_str_len count
src_str_len - length of source string in between 1-4096
count - how many strscpy() to execute.
Also I've noticed something strange. I'm not sure why, but certain
src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results
for byte-at-a-time copy:
$ perf stat ./strscpy b 29 10000000
Performance counter stats for './strscpy b 29 10000000':
165.354974 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
48 page-faults:u # 0.290 K/sec
640,475,981 cycles:u # 3.873 GHz
2,500,090,080 instructions:u # 3.90 insn per cycle
640,017,126 branches:u # 3870.565 M/sec
1,589 branch-misses:u # 0.00% of all branches
0.165568346 seconds time elapsed
Performance counter stats for './strscpy b 30 10000000':
250.835659 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
46 page-faults:u # 0.183 K/sec
974,528,780 cycles:u # 3.885 GHz
2,580,090,165 instructions:u # 2.65 insn per cycle
660,017,211 branches:u # 2631.273 M/sec
14,488,234 branch-misses:u # 2.20% of all branches
0.251147341 seconds time elapsed
Performance counter stats for './strscpy b 31 10000000':
176.598368 task-clock:u (msec) # 0.997 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
46 page-faults:u # 0.260 K/sec
681,367,948 cycles:u # 3.858 GHz
2,660,090,092 instructions:u # 3.90 insn per cycle
680,017,138 branches:u # 3850.642 M/sec
1,817 branch-misses:u # 0.00% of all branches
0.177150181 seconds time elapsed
[-- Attachment #2: strscpy_test.c --]
[-- Type: text/x-csrc, Size: 3291 bytes --]
#include <stdlib.h>
#include <string.h>
#define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
#define REPEAT_BYTE(x) ((~0ul / 0xff) * (x))
#define E2BIG -1
#define PAGE_SIZE 4096
struct word_at_a_time {
const unsigned long one_bits, high_bits;
};
#define WORD_AT_A_TIME_CONSTANTS { REPEAT_BYTE(0x01), REPEAT_BYTE(0x80) }
static inline long count_masked_bytes(unsigned long mask)
{
return mask*0x0001020304050608ul >> 56;
}
static inline unsigned long has_zero(unsigned long a, unsigned long *bits, const struct word_at_a_time *c)
{
unsigned long mask = ((a - c->one_bits) & ~a) & c->high_bits;
*bits = mask;
return mask;
}
static inline unsigned long prep_zero_mask(unsigned long a, unsigned long bits, const struct word_at_a_time *c)
{
return bits;
}
static inline unsigned long create_zero_mask(unsigned long bits)
{
bits = (bits - 1) & ~bits;
return bits >> 7;
}
/* The mask we created is directly usable as a bytemask */
#define zero_bytemask(mask) (mask)
static inline unsigned long find_zero(unsigned long mask)
{
return count_masked_bytes(mask);
}
__attribute__((noinline))
int strscpy_word(char *dest, const char *src, size_t count)
{
const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
size_t max = count;
long res = 0;
if (count == 0)
return -E2BIG;
#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
/*
* If src is unaligned, don't cross a page boundary,
* since we don't know if the next page is mapped.
*/
if ((long)src & (sizeof(long) - 1)) {
size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1));
if (limit < max)
max = limit;
}
#else
/* If src or dest is unaligned, don't do word-at-a-time. */
if (((long) dest | (long) src) & (sizeof(long) - 1))
max = 0;
#endif
while (max >= sizeof(unsigned long)) {
unsigned long c, data;
c = *(unsigned long *)(src+res);
if (has_zero(c, &data, &constants)) {
data = prep_zero_mask(c, data, &constants);
data = create_zero_mask(data);
*(unsigned long *)(dest+res) = c & zero_bytemask(data);
return res + find_zero(data);
}
*(unsigned long *)(dest+res) = c;
res += sizeof(unsigned long);
count -= sizeof(unsigned long);
max -= sizeof(unsigned long);
}
while (count) {
char c;
c = src[res];
dest[res] = c;
if (!c)
return res;
res++;
count--;
}
/* Hit buffer length without finding a NUL; force NUL-termination. */
if (res)
dest[res-1] = '\0';
return -E2BIG;
}
__attribute__((noinline))
int strscpy_byte(char *dest, const char *src, int count)
{
int res = 0;
while (count) {
char c;
c = src[res];
dest[res] = c;
if (!c)
return res;
res++;
count--;
}
/* Hit buffer length without finding a NUL; force NUL-termination. */
if (res)
dest[res-1] = '\0';
return -E2BIG;
}
char dest[4096] __attribute__((aligned(4096)));
char src[4096] __attribute__((aligned(4096)));
int main(int argc, char **argv)
{
unsigned long long i;
unsigned long src_len;
unsigned long count;
if (argc < 4)
return -1;
src_len = atoi(argv[2]);
count = atoi(argv[3]);
memset(src, 1, src_len);
if (argv[1][0] == 'w') {
for (i = 0; i < count; i++) {
strscpy_word(dest, src, sizeof(dest));
}
} else if (argv[1][0] == 'b') {
for (i = 0; i < count; i++) {
strscpy_byte(dest, src, sizeof(dest));
}
}
return 0;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin
2018-01-09 16:47 ` Andrey Ryabinin
@ 2018-01-24 8:47 ` Rasmus Villemoes
2018-01-24 15:53 ` Andy Shevchenko
2 siblings, 0 replies; 12+ messages in thread
From: Rasmus Villemoes @ 2018-01-24 8:47 UTC (permalink / raw)
To: Andrey Ryabinin, Andrew Morton, Linus Torvalds
Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko,
Chris Metcalf, David Laight, Dmitry Vyukov, stable
On 2018-01-09 17:37, Andrey Ryabinin wrote:
> strscpy() performs the word-at-a-time optimistic reads. So it may
> may access the memory past the end of the object, which is perfectly fine
> since strscpy() doesn't use that (past-the-end) data and makes sure the
> optimistic read won't cross a page boundary.
>
> But KASAN doesn't know anything about that so it will complain.
> There are several possible ways to address this issue, but none
> are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com
>
> It seems the best solution is to simply disable word-at-a-time
> optimization. My trivial testing shows that byte-at-a-time
> could be up to x4.3 times slower than word-at-a-time.
> It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs
> ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy()
> in a performance critical paths to copy large amounts of data,
> so it shouldn't matter anyway.
>
> Fixes: 30035e45753b7 ("string: provide strscpy()")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: <stable@vger.kernel.org>
>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Your microbenchmark even favours word-at-a-time slightly, since in
practice I think at least one of src or dst will be unaligned a lot of
the time, and while x86 may HAVE_EFFICIENT_UNALIGNED_ACCESS, it's still
a little more expensive than doing aligned access. And since strscpy is
not called that often, I expect some of the ~300 bytes of instruction
cache it occupies can be put to better use elsewhere.
Rasmus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-09 16:47 ` Andrey Ryabinin
@ 2018-01-24 8:54 ` Rasmus Villemoes
2018-01-24 17:52 ` Linus Torvalds
0 siblings, 1 reply; 12+ messages in thread
From: Rasmus Villemoes @ 2018-01-24 8:54 UTC (permalink / raw)
To: Andrey Ryabinin, Andrew Morton, Linus Torvalds
Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko,
Chris Metcalf, David Laight, Dmitry Vyukov, stable
On 2018-01-09 17:47, Andrey Ryabinin wrote:
> Attached user space program I used to see the difference.
> Usage:
> gcc -02 -o strscpy strscpy_test.c
> ./strscpy {b|w} src_str_len count
>
> src_str_len - length of source string in between 1-4096
> count - how many strscpy() to execute.
>
> Also I've noticed something strange. I'm not sure why, but certain
> src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results
> for byte-at-a-time copy:
I see something similar, but at the 30->31 transition, and the
branch-misses remain at 1-3% for higher values, until 42 where it drops
back to 0%. Anyway, I highly doubt we do a lot of string copies of
strings longer then 32.
$ perf stat ./strscpy_test b 30 10000000
Performance counter stats for './strscpy_test b 30 10000000':
156,777082 task-clock (msec) # 0,999 CPUs
utilized
0 context-switches # 0,000 K/sec
0 cpu-migrations # 0,000 K/sec
48 page-faults # 0,306 K/sec
584.646.177 cycles # 3,729 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
2.580.599.614 instructions # 4,41 insns per
cycle
660.114.283 branches # 4210,528 M/sec
4.891 branch-misses # 0,00% of all
branches
0,156970910 seconds time elapsed
$ perf stat ./strscpy_test b 31 10000000
Performance counter stats for './strscpy_test b 31 10000000':
258,533250 task-clock (msec) # 0,999 CPUs
utilized
0 context-switches # 0,000 K/sec
0 cpu-migrations # 0,000 K/sec
50 page-faults # 0,193 K/sec
965.505.138 cycles # 3,735 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
2.660.773.463 instructions # 2,76 insns per
cycle
680.141.051 branches # 2630,768 M/sec
19.150.367 branch-misses # 2,82% of all
branches
0,258725192 seconds time elapsed
Rasmus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin
2018-01-09 16:47 ` Andrey Ryabinin
2018-01-24 8:47 ` Rasmus Villemoes
@ 2018-01-24 15:53 ` Andy Shevchenko
2 siblings, 0 replies; 12+ messages in thread
From: Andy Shevchenko @ 2018-01-24 15:53 UTC (permalink / raw)
To: Andrey Ryabinin
Cc: Andrew Morton, Linus Torvalds, Linux Kernel Mailing List,
Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf,
David Laight, Dmitry Vyukov, stable
On Tue, Jan 9, 2018 at 6:37 PM, Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
> strscpy() performs the word-at-a-time optimistic reads. So it may
> may access the memory past the end of the object, which is perfectly fine
> since strscpy() doesn't use that (past-the-end) data and makes sure the
> optimistic read won't cross a page boundary.
>
> But KASAN doesn't know anything about that so it will complain.
> There are several possible ways to address this issue, but none
> are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com
>
> It seems the best solution is to simply disable word-at-a-time
> optimization. My trivial testing shows that byte-at-a-time
> could be up to x4.3 times slower than word-at-a-time.
> It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs
> ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy()
> in a performance critical paths to copy large amounts of data,
> so it shouldn't matter anyway.
What prevents you to revert the patch? After this one the all
advantages of the function becomes useless AFAIU.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-24 8:54 ` Rasmus Villemoes
@ 2018-01-24 17:52 ` Linus Torvalds
2018-01-25 8:32 ` Dmitry Vyukov
0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2018-01-24 17:52 UTC (permalink / raw)
To: Rasmus Villemoes
Cc: Andrey Ryabinin, Andrew Morton, Linux Kernel Mailing List,
Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf,
David Laight, Dmitry Vyukov, stable
On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
<rasmus.villemoes@prevas.dk> wrote:
>
> I see something similar, but at the 30->31 transition, and the
> branch-misses remain at 1-3% for higher values, until 42 where it drops
> back to 0%. Anyway, I highly doubt we do a lot of string copies of
> strings longer then 32.
So I really dislike that microbenchmark, because it just has the same
length all the time. Which is very wrong, and makes the benchmark
pointless. A big part of this all is branch mispredicts, you shouldn't
just hand it the pattern on a plate.
Anyway, the reason I really dislike the patch is not because I think
strscpy() is all that important, but I *do* think that the
word-at-a-time thing is conceptually something we do care about, and I
hate removing it just because of KASAN not understanding it.
So I'd *much* rather have some way to tell KASAN that word-at-a-time
is going on. Because that approach definitely makes a difference in
other places.
Linus
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-24 17:52 ` Linus Torvalds
@ 2018-01-25 8:32 ` Dmitry Vyukov
2018-01-25 8:42 ` David Laight
2018-01-25 17:55 ` Linus Torvalds
0 siblings, 2 replies; 12+ messages in thread
From: Dmitry Vyukov @ 2018-01-25 8:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rasmus Villemoes, Andrey Ryabinin, Andrew Morton,
Linux Kernel Mailing List, Kees Cook, Eryu Guan,
Alexander Potapenko, Chris Metcalf, David Laight, stable,
kasan-dev
On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
> <rasmus.villemoes@prevas.dk> wrote:
>>
>> I see something similar, but at the 30->31 transition, and the
>> branch-misses remain at 1-3% for higher values, until 42 where it drops
>> back to 0%. Anyway, I highly doubt we do a lot of string copies of
>> strings longer then 32.
>
> So I really dislike that microbenchmark, because it just has the same
> length all the time. Which is very wrong, and makes the benchmark
> pointless. A big part of this all is branch mispredicts, you shouldn't
> just hand it the pattern on a plate.
>
> Anyway, the reason I really dislike the patch is not because I think
> strscpy() is all that important, but I *do* think that the
> word-at-a-time thing is conceptually something we do care about, and I
> hate removing it just because of KASAN not understanding it.
>
> So I'd *much* rather have some way to tell KASAN that word-at-a-time
> is going on. Because that approach definitely makes a difference in
> other places.
The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read
once" part will affect codegen here, though.
But if word-at-a-time thing is conceptually something we do care
about, we could also introduce something like READ_PARTIALLY_VALID(),
which would check that at least first byte of the read is valid and
that it does not cross heap block boundary (but outside of KASAN is a
normal read).
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-25 8:32 ` Dmitry Vyukov
@ 2018-01-25 8:42 ` David Laight
2018-01-25 9:08 ` Dmitry Vyukov
2018-01-25 17:55 ` Linus Torvalds
1 sibling, 1 reply; 12+ messages in thread
From: David Laight @ 2018-01-25 8:42 UTC (permalink / raw)
To: 'Dmitry Vyukov', Linus Torvalds
Cc: Rasmus Villemoes, Andrey Ryabinin, Andrew Morton,
Linux Kernel Mailing List, Kees Cook, Eryu Guan,
Alexander Potapenko, Chris Metcalf, stable, kasan-dev
From: Dmitry Vyukov [mailto:dvyukov@google.com]
> Sent: 25 January 2018 08:33
>
> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
> > <rasmus.villemoes@prevas.dk> wrote:
> >>
> >> I see something similar, but at the 30->31 transition, and the
> >> branch-misses remain at 1-3% for higher values, until 42 where it drops
> >> back to 0%. Anyway, I highly doubt we do a lot of string copies of
> >> strings longer then 32.
> >
> > So I really dislike that microbenchmark, because it just has the same
> > length all the time. Which is very wrong, and makes the benchmark
> > pointless. A big part of this all is branch mispredicts, you shouldn't
> > just hand it the pattern on a plate.
> >
> > Anyway, the reason I really dislike the patch is not because I think
> > strscpy() is all that important, but I *do* think that the
> > word-at-a-time thing is conceptually something we do care about, and I
> > hate removing it just because of KASAN not understanding it.
> >
> > So I'd *much* rather have some way to tell KASAN that word-at-a-time
> > is going on. Because that approach definitely makes a difference in
> > other places.
>
>
> The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read
> once" part will affect codegen here, though.
> But if word-at-a-time thing is conceptually something we do care
> about, we could also introduce something like READ_PARTIALLY_VALID(),
> which would check that at least first byte of the read is valid and
> that it does not cross heap block boundary (but outside of KASAN is a
> normal read).
The first byte might not have been written either.
For example, doing a strlen() on a misaligned string you might read
the aligned word containing the first byte and adjust the value so
that the initial byte(s) are not zero.
After scanning for a zero byte the length would be corrected.
David
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-25 8:42 ` David Laight
@ 2018-01-25 9:08 ` Dmitry Vyukov
0 siblings, 0 replies; 12+ messages in thread
From: Dmitry Vyukov @ 2018-01-25 9:08 UTC (permalink / raw)
To: David Laight
Cc: Linus Torvalds, Rasmus Villemoes, Andrey Ryabinin, Andrew Morton,
Linux Kernel Mailing List, Kees Cook, Eryu Guan,
Alexander Potapenko, Chris Metcalf, stable, kasan-dev
On Thu, Jan 25, 2018 at 9:42 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Dmitry Vyukov [mailto:dvyukov@google.com]
>> Sent: 25 January 2018 08:33
>>
>> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>> > On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes
>> > <rasmus.villemoes@prevas.dk> wrote:
>> >>
>> >> I see something similar, but at the 30->31 transition, and the
>> >> branch-misses remain at 1-3% for higher values, until 42 where it drops
>> >> back to 0%. Anyway, I highly doubt we do a lot of string copies of
>> >> strings longer then 32.
>> >
>> > So I really dislike that microbenchmark, because it just has the same
>> > length all the time. Which is very wrong, and makes the benchmark
>> > pointless. A big part of this all is branch mispredicts, you shouldn't
>> > just hand it the pattern on a plate.
>> >
>> > Anyway, the reason I really dislike the patch is not because I think
>> > strscpy() is all that important, but I *do* think that the
>> > word-at-a-time thing is conceptually something we do care about, and I
>> > hate removing it just because of KASAN not understanding it.
>> >
>> > So I'd *much* rather have some way to tell KASAN that word-at-a-time
>> > is going on. Because that approach definitely makes a difference in
>> > other places.
>>
>>
>> The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read
>> once" part will affect codegen here, though.
>> But if word-at-a-time thing is conceptually something we do care
>> about, we could also introduce something like READ_PARTIALLY_VALID(),
>> which would check that at least first byte of the read is valid and
>> that it does not cross heap block boundary (but outside of KASAN is a
>> normal read).
>
> The first byte might not have been written either.
> For example, doing a strlen() on a misaligned string you might read
> the aligned word containing the first byte and adjust the value so
> that the initial byte(s) are not zero.
> After scanning for a zero byte the length would be corrected.
Was the first byte at least kmalloc-ed? That's what KASAN checks, it
does not care about "written". KMSAN can detect uses of uninit data,
but that's another story.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-25 8:32 ` Dmitry Vyukov
2018-01-25 8:42 ` David Laight
@ 2018-01-25 17:55 ` Linus Torvalds
2018-01-25 19:13 ` Andrey Ryabinin
1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2018-01-25 17:55 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: Rasmus Villemoes, Andrey Ryabinin, Andrew Morton,
Linux Kernel Mailing List, Kees Cook, Eryu Guan,
Alexander Potapenko, Chris Metcalf, David Laight, stable,
kasan-dev
[-- Attachment #1: Type: text/plain, Size: 804 bytes --]
On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So I'd *much* rather have some way to tell KASAN that word-at-a-time
>> is going on. Because that approach definitely makes a difference in
>> other places.
>
> The other option was to use READ_ONCE_NOCHECK().
How about just using the same accessor that we do for the dcache case.
That gives a reasonable example of the whole word-at-a-time model, and
should be good.
COMPLETELY UNTESTED patch attached. This needs to be verified.
It does limit the word-at-a-time code to the architectures that select
both HAVE_EFFICIENT_UNALIGNED_ACCESS and DCACHE_WORD_ACCESS, but that
seems a reasonable choice anyway.
Linus
[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1651 bytes --]
lib/string.c | 28 +++++++---------------------
1 file changed, 7 insertions(+), 21 deletions(-)
diff --git a/lib/string.c b/lib/string.c
index 64a9e33f1daa..25a3c200307e 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -177,33 +177,18 @@ EXPORT_SYMBOL(strlcpy);
*/
ssize_t strscpy(char *dest, const char *src, size_t count)
{
- const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
- size_t max = count;
long res = 0;
if (count == 0)
return -E2BIG;
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
- /*
- * If src is unaligned, don't cross a page boundary,
- * since we don't know if the next page is mapped.
- */
- if ((long)src & (sizeof(long) - 1)) {
- size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1));
- if (limit < max)
- max = limit;
- }
-#else
- /* If src or dest is unaligned, don't do word-at-a-time. */
- if (((long) dest | (long) src) & (sizeof(long) - 1))
- max = 0;
-#endif
-
- while (max >= sizeof(unsigned long)) {
+#if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS && CONFIG_DCACHE_WORD_ACCESS
+{
+ const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS;
+ while (count >= sizeof(unsigned long)) {
unsigned long c, data;
- c = *(unsigned long *)(src+res);
+ c = load_unaligned_zeropad(src+res);
if (has_zero(c, &data, &constants)) {
data = prep_zero_mask(c, data, &constants);
data = create_zero_mask(data);
@@ -213,8 +198,9 @@ ssize_t strscpy(char *dest, const char *src, size_t count)
*(unsigned long *)(dest+res) = c;
res += sizeof(unsigned long);
count -= sizeof(unsigned long);
- max -= sizeof(unsigned long);
}
+}
+#endif
while (count) {
char c;
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-25 17:55 ` Linus Torvalds
@ 2018-01-25 19:13 ` Andrey Ryabinin
2018-01-30 9:12 ` Dmitry Vyukov
0 siblings, 1 reply; 12+ messages in thread
From: Andrey Ryabinin @ 2018-01-25 19:13 UTC (permalink / raw)
To: Linus Torvalds, Dmitry Vyukov
Cc: Rasmus Villemoes, Andrew Morton, Linux Kernel Mailing List,
Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf,
David Laight, stable, kasan-dev
On 01/25/2018 08:55 PM, Linus Torvalds wrote:
> On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> So I'd *much* rather have some way to tell KASAN that word-at-a-time
>>> is going on. Because that approach definitely makes a difference in
>>> other places.
>>
>> The other option was to use READ_ONCE_NOCHECK().
>
> How about just using the same accessor that we do for the dcache case.
> That gives a reasonable example of the whole word-at-a-time model, and
> should be good.
>
If we also instrument load_unaligned_zeropad() with kasan_check_read(addr, 1),
than it should be fine. We don't want completely unchecked read of a source string.
But I also would like to revert df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports")
So I was going to send something like the hunk bellow (split in several patches).
Or we could just use instrumented load_unalingned_zeropad() everywhere, but it seems wrong
to use it to load *cs only to shut up KASAN.
---
fs/dcache.c | 2 +-
include/linux/compiler.h | 11 +++++++++++
lib/string.c | 2 +-
3 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 5c7df1df81ff..6aa7be55a96d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -195,7 +195,7 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char
unsigned long a,b,mask;
for (;;) {
- a = *(unsigned long *)cs;
+ a = READ_PARTIAL_CHECK(*(unsigned long *)cs);
b = load_unaligned_zeropad(ct);
if (tcount < sizeof(unsigned long))
break;
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 52e611ab9a6c..85b63c2e196e 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -240,6 +240,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
* required ordering.
*/
#include <asm/barrier.h>
+#include <linux/kasan-checks.h>
#define __READ_ONCE(x, check) \
({ \
@@ -259,6 +260,16 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
*/
#define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0)
+#ifdef CONFIG_KASAN
+#define READ_PARTIAL_CHECK(x) \
+({ \
+ kasan_check_read(&(x), 1); \
+ READ_ONCE_NOCHECK(x); \
+})
+#else
+#define READ_PARTIAL_CHECK(x) (x)
+#endif
+
#define WRITE_ONCE(x, val) \
({ \
union { typeof(x) __val; char __c[1]; } __u = \
diff --git a/lib/string.c b/lib/string.c
index 64a9e33f1daa..2396856e4c56 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -203,7 +203,7 @@ ssize_t strscpy(char *dest, const char *src, size_t count)
while (max >= sizeof(unsigned long)) {
unsigned long c, data;
- c = *(unsigned long *)(src+res);
+ c = READ_PARTIAL_CHECK(*(unsigned long *)(src+res));
if (has_zero(c, &data, &constants)) {
data = prep_zero_mask(c, data, &constants);
data = create_zero_mask(data);
--
2.13.6
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
2018-01-25 19:13 ` Andrey Ryabinin
@ 2018-01-30 9:12 ` Dmitry Vyukov
0 siblings, 0 replies; 12+ messages in thread
From: Dmitry Vyukov @ 2018-01-30 9:12 UTC (permalink / raw)
To: Andrey Ryabinin
Cc: Linus Torvalds, Rasmus Villemoes, Andrew Morton,
Linux Kernel Mailing List, Kees Cook, Eryu Guan,
Alexander Potapenko, Chris Metcalf, David Laight, stable,
kasan-dev
On Thu, Jan 25, 2018 at 8:13 PM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
> On 01/25/2018 08:55 PM, Linus Torvalds wrote:
>> On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
>>> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>> So I'd *much* rather have some way to tell KASAN that word-at-a-time
>>>> is going on. Because that approach definitely makes a difference in
>>>> other places.
>>>
>>> The other option was to use READ_ONCE_NOCHECK().
>>
>> How about just using the same accessor that we do for the dcache case.
>> That gives a reasonable example of the whole word-at-a-time model, and
>> should be good.
>>
>
> If we also instrument load_unaligned_zeropad() with kasan_check_read(addr, 1),
> than it should be fine. We don't want completely unchecked read of a source string.
>
> But I also would like to revert df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports")
> So I was going to send something like the hunk bellow (split in several patches).
>
> Or we could just use instrumented load_unalingned_zeropad() everywhere, but it seems wrong
> to use it to load *cs only to shut up KASAN.
>
>
> ---
> fs/dcache.c | 2 +-
> include/linux/compiler.h | 11 +++++++++++
> lib/string.c | 2 +-
> 3 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5c7df1df81ff..6aa7be55a96d 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -195,7 +195,7 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char
> unsigned long a,b,mask;
>
> for (;;) {
> - a = *(unsigned long *)cs;
> + a = READ_PARTIAL_CHECK(*(unsigned long *)cs);
> b = load_unaligned_zeropad(ct);
> if (tcount < sizeof(unsigned long))
> break;
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 52e611ab9a6c..85b63c2e196e 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -240,6 +240,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
> * required ordering.
> */
> #include <asm/barrier.h>
> +#include <linux/kasan-checks.h>
>
> #define __READ_ONCE(x, check) \
> ({ \
> @@ -259,6 +260,16 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
> */
> #define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0)
>
> +#ifdef CONFIG_KASAN
> +#define READ_PARTIAL_CHECK(x) \
> +({ \
> + kasan_check_read(&(x), 1); \
> + READ_ONCE_NOCHECK(x); \
> +})
> +#else
> +#define READ_PARTIAL_CHECK(x) (x)
> +#endif
> +
> #define WRITE_ONCE(x, val) \
> ({ \
> union { typeof(x) __val; char __c[1]; } __u = \
> diff --git a/lib/string.c b/lib/string.c
> index 64a9e33f1daa..2396856e4c56 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -203,7 +203,7 @@ ssize_t strscpy(char *dest, const char *src, size_t count)
> while (max >= sizeof(unsigned long)) {
> unsigned long c, data;
>
> - c = *(unsigned long *)(src+res);
> + c = READ_PARTIAL_CHECK(*(unsigned long *)(src+res));
> if (has_zero(c, &data, &constants)) {
> data = prep_zero_mask(c, data, &constants);
> data = create_zero_mask(data);
Looks good to me a general way to support word-at-a-time pattern.
This will also get rid of this in fs/dcache.c:
if (IS_ENABLED(CONFIG_DCACHE_WORD_ACCESS))
kasan_unpoison_shadow(dname,
round_up(name->len + 1, sizeof(unsigned long)));
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2018-01-30 9:12 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin
2018-01-09 16:47 ` Andrey Ryabinin
2018-01-24 8:54 ` Rasmus Villemoes
2018-01-24 17:52 ` Linus Torvalds
2018-01-25 8:32 ` Dmitry Vyukov
2018-01-25 8:42 ` David Laight
2018-01-25 9:08 ` Dmitry Vyukov
2018-01-25 17:55 ` Linus Torvalds
2018-01-25 19:13 ` Andrey Ryabinin
2018-01-30 9:12 ` Dmitry Vyukov
2018-01-24 8:47 ` Rasmus Villemoes
2018-01-24 15:53 ` Andy Shevchenko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).