* [PATCH] lib/strscpy: remove word-at-a-time optimization. @ 2018-01-09 16:37 Andrey Ryabinin 2018-01-09 16:47 ` Andrey Ryabinin ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Andrey Ryabinin @ 2018-01-09 16:37 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, David Laight, Dmitry Vyukov, Andrey Ryabinin, stable strscpy() performs the word-at-a-time optimistic reads. So it may may access the memory past the end of the object, which is perfectly fine since strscpy() doesn't use that (past-the-end) data and makes sure the optimistic read won't cross a page boundary. But KASAN doesn't know anything about that so it will complain. There are several possible ways to address this issue, but none are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com It seems the best solution is to simply disable word-at-a-time optimization. My trivial testing shows that byte-at-a-time could be up to x4.3 times slower than word-at-a-time. It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy() in a performance critical paths to copy large amounts of data, so it shouldn't matter anyway. Fixes: 30035e45753b7 ("string: provide strscpy()") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: <stable@vger.kernel.org> --- lib/string.c | 38 -------------------------------------- 1 file changed, 38 deletions(-) diff --git a/lib/string.c b/lib/string.c index 64a9e33f1daa..6205dd71aa0f 100644 --- a/lib/string.c +++ b/lib/string.c @@ -29,7 +29,6 @@ #include <linux/errno.h> #include <asm/byteorder.h> -#include <asm/word-at-a-time.h> #include <asm/page.h> #ifndef __HAVE_ARCH_STRNCASECMP @@ -177,45 +176,8 @@ EXPORT_SYMBOL(strlcpy); */ ssize_t strscpy(char *dest, const char *src, size_t count) { - const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS; - size_t max = count; long res = 0; - if (count == 0) - return -E2BIG; - -#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS - /* - * If src is unaligned, don't cross a page boundary, - * since we don't know if the next page is mapped. - */ - if ((long)src & (sizeof(long) - 1)) { - size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1)); - if (limit < max) - max = limit; - } -#else - /* If src or dest is unaligned, don't do word-at-a-time. */ - if (((long) dest | (long) src) & (sizeof(long) - 1)) - max = 0; -#endif - - while (max >= sizeof(unsigned long)) { - unsigned long c, data; - - c = *(unsigned long *)(src+res); - if (has_zero(c, &data, &constants)) { - data = prep_zero_mask(c, data, &constants); - data = create_zero_mask(data); - *(unsigned long *)(dest+res) = c & zero_bytemask(data); - return res + find_zero(data); - } - *(unsigned long *)(dest+res) = c; - res += sizeof(unsigned long); - count -= sizeof(unsigned long); - max -= sizeof(unsigned long); - } - while (count) { char c; -- 2.13.6 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin @ 2018-01-09 16:47 ` Andrey Ryabinin 2018-01-24 8:54 ` Rasmus Villemoes 2018-01-24 8:47 ` Rasmus Villemoes 2018-01-24 15:53 ` Andy Shevchenko 2 siblings, 1 reply; 12+ messages in thread From: Andrey Ryabinin @ 2018-01-09 16:47 UTC (permalink / raw) To: Andrew Morton, Linus Torvalds Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, David Laight, Dmitry Vyukov, stable [-- Attachment #1: Type: text/plain, Size: 2800 bytes --] Attached user space program I used to see the difference. Usage: gcc -02 -o strscpy strscpy_test.c ./strscpy {b|w} src_str_len count src_str_len - length of source string in between 1-4096 count - how many strscpy() to execute. Also I've noticed something strange. I'm not sure why, but certain src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results for byte-at-a-time copy: $ perf stat ./strscpy b 29 10000000 Performance counter stats for './strscpy b 29 10000000': 165.354974 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 48 page-faults:u # 0.290 K/sec 640,475,981 cycles:u # 3.873 GHz 2,500,090,080 instructions:u # 3.90 insn per cycle 640,017,126 branches:u # 3870.565 M/sec 1,589 branch-misses:u # 0.00% of all branches 0.165568346 seconds time elapsed Performance counter stats for './strscpy b 30 10000000': 250.835659 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 46 page-faults:u # 0.183 K/sec 974,528,780 cycles:u # 3.885 GHz 2,580,090,165 instructions:u # 2.65 insn per cycle 660,017,211 branches:u # 2631.273 M/sec 14,488,234 branch-misses:u # 2.20% of all branches 0.251147341 seconds time elapsed Performance counter stats for './strscpy b 31 10000000': 176.598368 task-clock:u (msec) # 0.997 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 46 page-faults:u # 0.260 K/sec 681,367,948 cycles:u # 3.858 GHz 2,660,090,092 instructions:u # 3.90 insn per cycle 680,017,138 branches:u # 3850.642 M/sec 1,817 branch-misses:u # 0.00% of all branches 0.177150181 seconds time elapsed [-- Attachment #2: strscpy_test.c --] [-- Type: text/x-csrc, Size: 3291 bytes --] #include <stdlib.h> #include <string.h> #define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS #define REPEAT_BYTE(x) ((~0ul / 0xff) * (x)) #define E2BIG -1 #define PAGE_SIZE 4096 struct word_at_a_time { const unsigned long one_bits, high_bits; }; #define WORD_AT_A_TIME_CONSTANTS { REPEAT_BYTE(0x01), REPEAT_BYTE(0x80) } static inline long count_masked_bytes(unsigned long mask) { return mask*0x0001020304050608ul >> 56; } static inline unsigned long has_zero(unsigned long a, unsigned long *bits, const struct word_at_a_time *c) { unsigned long mask = ((a - c->one_bits) & ~a) & c->high_bits; *bits = mask; return mask; } static inline unsigned long prep_zero_mask(unsigned long a, unsigned long bits, const struct word_at_a_time *c) { return bits; } static inline unsigned long create_zero_mask(unsigned long bits) { bits = (bits - 1) & ~bits; return bits >> 7; } /* The mask we created is directly usable as a bytemask */ #define zero_bytemask(mask) (mask) static inline unsigned long find_zero(unsigned long mask) { return count_masked_bytes(mask); } __attribute__((noinline)) int strscpy_word(char *dest, const char *src, size_t count) { const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS; size_t max = count; long res = 0; if (count == 0) return -E2BIG; #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS /* * If src is unaligned, don't cross a page boundary, * since we don't know if the next page is mapped. */ if ((long)src & (sizeof(long) - 1)) { size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1)); if (limit < max) max = limit; } #else /* If src or dest is unaligned, don't do word-at-a-time. */ if (((long) dest | (long) src) & (sizeof(long) - 1)) max = 0; #endif while (max >= sizeof(unsigned long)) { unsigned long c, data; c = *(unsigned long *)(src+res); if (has_zero(c, &data, &constants)) { data = prep_zero_mask(c, data, &constants); data = create_zero_mask(data); *(unsigned long *)(dest+res) = c & zero_bytemask(data); return res + find_zero(data); } *(unsigned long *)(dest+res) = c; res += sizeof(unsigned long); count -= sizeof(unsigned long); max -= sizeof(unsigned long); } while (count) { char c; c = src[res]; dest[res] = c; if (!c) return res; res++; count--; } /* Hit buffer length without finding a NUL; force NUL-termination. */ if (res) dest[res-1] = '\0'; return -E2BIG; } __attribute__((noinline)) int strscpy_byte(char *dest, const char *src, int count) { int res = 0; while (count) { char c; c = src[res]; dest[res] = c; if (!c) return res; res++; count--; } /* Hit buffer length without finding a NUL; force NUL-termination. */ if (res) dest[res-1] = '\0'; return -E2BIG; } char dest[4096] __attribute__((aligned(4096))); char src[4096] __attribute__((aligned(4096))); int main(int argc, char **argv) { unsigned long long i; unsigned long src_len; unsigned long count; if (argc < 4) return -1; src_len = atoi(argv[2]); count = atoi(argv[3]); memset(src, 1, src_len); if (argv[1][0] == 'w') { for (i = 0; i < count; i++) { strscpy_word(dest, src, sizeof(dest)); } } else if (argv[1][0] == 'b') { for (i = 0; i < count; i++) { strscpy_byte(dest, src, sizeof(dest)); } } return 0; } ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-09 16:47 ` Andrey Ryabinin @ 2018-01-24 8:54 ` Rasmus Villemoes 2018-01-24 17:52 ` Linus Torvalds 0 siblings, 1 reply; 12+ messages in thread From: Rasmus Villemoes @ 2018-01-24 8:54 UTC (permalink / raw) To: Andrey Ryabinin, Andrew Morton, Linus Torvalds Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, David Laight, Dmitry Vyukov, stable On 2018-01-09 17:47, Andrey Ryabinin wrote: > Attached user space program I used to see the difference. > Usage: > gcc -02 -o strscpy strscpy_test.c > ./strscpy {b|w} src_str_len count > > src_str_len - length of source string in between 1-4096 > count - how many strscpy() to execute. > > Also I've noticed something strange. I'm not sure why, but certain > src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results > for byte-at-a-time copy: I see something similar, but at the 30->31 transition, and the branch-misses remain at 1-3% for higher values, until 42 where it drops back to 0%. Anyway, I highly doubt we do a lot of string copies of strings longer then 32. $ perf stat ./strscpy_test b 30 10000000 Performance counter stats for './strscpy_test b 30 10000000': 156,777082 task-clock (msec) # 0,999 CPUs utilized 0 context-switches # 0,000 K/sec 0 cpu-migrations # 0,000 K/sec 48 page-faults # 0,306 K/sec 584.646.177 cycles # 3,729 GHz <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 2.580.599.614 instructions # 4,41 insns per cycle 660.114.283 branches # 4210,528 M/sec 4.891 branch-misses # 0,00% of all branches 0,156970910 seconds time elapsed $ perf stat ./strscpy_test b 31 10000000 Performance counter stats for './strscpy_test b 31 10000000': 258,533250 task-clock (msec) # 0,999 CPUs utilized 0 context-switches # 0,000 K/sec 0 cpu-migrations # 0,000 K/sec 50 page-faults # 0,193 K/sec 965.505.138 cycles # 3,735 GHz <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 2.660.773.463 instructions # 2,76 insns per cycle 680.141.051 branches # 2630,768 M/sec 19.150.367 branch-misses # 2,82% of all branches 0,258725192 seconds time elapsed Rasmus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-24 8:54 ` Rasmus Villemoes @ 2018-01-24 17:52 ` Linus Torvalds 2018-01-25 8:32 ` Dmitry Vyukov 0 siblings, 1 reply; 12+ messages in thread From: Linus Torvalds @ 2018-01-24 17:52 UTC (permalink / raw) To: Rasmus Villemoes Cc: Andrey Ryabinin, Andrew Morton, Linux Kernel Mailing List, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, David Laight, Dmitry Vyukov, stable On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes <rasmus.villemoes@prevas.dk> wrote: > > I see something similar, but at the 30->31 transition, and the > branch-misses remain at 1-3% for higher values, until 42 where it drops > back to 0%. Anyway, I highly doubt we do a lot of string copies of > strings longer then 32. So I really dislike that microbenchmark, because it just has the same length all the time. Which is very wrong, and makes the benchmark pointless. A big part of this all is branch mispredicts, you shouldn't just hand it the pattern on a plate. Anyway, the reason I really dislike the patch is not because I think strscpy() is all that important, but I *do* think that the word-at-a-time thing is conceptually something we do care about, and I hate removing it just because of KASAN not understanding it. So I'd *much* rather have some way to tell KASAN that word-at-a-time is going on. Because that approach definitely makes a difference in other places. Linus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-24 17:52 ` Linus Torvalds @ 2018-01-25 8:32 ` Dmitry Vyukov 2018-01-25 8:42 ` David Laight 2018-01-25 17:55 ` Linus Torvalds 0 siblings, 2 replies; 12+ messages in thread From: Dmitry Vyukov @ 2018-01-25 8:32 UTC (permalink / raw) To: Linus Torvalds Cc: Rasmus Villemoes, Andrey Ryabinin, Andrew Morton, Linux Kernel Mailing List, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, David Laight, stable, kasan-dev On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes > <rasmus.villemoes@prevas.dk> wrote: >> >> I see something similar, but at the 30->31 transition, and the >> branch-misses remain at 1-3% for higher values, until 42 where it drops >> back to 0%. Anyway, I highly doubt we do a lot of string copies of >> strings longer then 32. > > So I really dislike that microbenchmark, because it just has the same > length all the time. Which is very wrong, and makes the benchmark > pointless. A big part of this all is branch mispredicts, you shouldn't > just hand it the pattern on a plate. > > Anyway, the reason I really dislike the patch is not because I think > strscpy() is all that important, but I *do* think that the > word-at-a-time thing is conceptually something we do care about, and I > hate removing it just because of KASAN not understanding it. > > So I'd *much* rather have some way to tell KASAN that word-at-a-time > is going on. Because that approach definitely makes a difference in > other places. The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read once" part will affect codegen here, though. But if word-at-a-time thing is conceptually something we do care about, we could also introduce something like READ_PARTIALLY_VALID(), which would check that at least first byte of the read is valid and that it does not cross heap block boundary (but outside of KASAN is a normal read). ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-25 8:32 ` Dmitry Vyukov @ 2018-01-25 8:42 ` David Laight 2018-01-25 9:08 ` Dmitry Vyukov 2018-01-25 17:55 ` Linus Torvalds 1 sibling, 1 reply; 12+ messages in thread From: David Laight @ 2018-01-25 8:42 UTC (permalink / raw) To: 'Dmitry Vyukov', Linus Torvalds Cc: Rasmus Villemoes, Andrey Ryabinin, Andrew Morton, Linux Kernel Mailing List, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, stable, kasan-dev From: Dmitry Vyukov [mailto:dvyukov@google.com] > Sent: 25 January 2018 08:33 > > On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes > > <rasmus.villemoes@prevas.dk> wrote: > >> > >> I see something similar, but at the 30->31 transition, and the > >> branch-misses remain at 1-3% for higher values, until 42 where it drops > >> back to 0%. Anyway, I highly doubt we do a lot of string copies of > >> strings longer then 32. > > > > So I really dislike that microbenchmark, because it just has the same > > length all the time. Which is very wrong, and makes the benchmark > > pointless. A big part of this all is branch mispredicts, you shouldn't > > just hand it the pattern on a plate. > > > > Anyway, the reason I really dislike the patch is not because I think > > strscpy() is all that important, but I *do* think that the > > word-at-a-time thing is conceptually something we do care about, and I > > hate removing it just because of KASAN not understanding it. > > > > So I'd *much* rather have some way to tell KASAN that word-at-a-time > > is going on. Because that approach definitely makes a difference in > > other places. > > > The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read > once" part will affect codegen here, though. > But if word-at-a-time thing is conceptually something we do care > about, we could also introduce something like READ_PARTIALLY_VALID(), > which would check that at least first byte of the read is valid and > that it does not cross heap block boundary (but outside of KASAN is a > normal read). The first byte might not have been written either. For example, doing a strlen() on a misaligned string you might read the aligned word containing the first byte and adjust the value so that the initial byte(s) are not zero. After scanning for a zero byte the length would be corrected. David ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-25 8:42 ` David Laight @ 2018-01-25 9:08 ` Dmitry Vyukov 0 siblings, 0 replies; 12+ messages in thread From: Dmitry Vyukov @ 2018-01-25 9:08 UTC (permalink / raw) To: David Laight Cc: Linus Torvalds, Rasmus Villemoes, Andrey Ryabinin, Andrew Morton, Linux Kernel Mailing List, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, stable, kasan-dev On Thu, Jan 25, 2018 at 9:42 AM, David Laight <David.Laight@aculab.com> wrote: > From: Dmitry Vyukov [mailto:dvyukov@google.com] >> Sent: 25 January 2018 08:33 >> >> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >> > On Wed, Jan 24, 2018 at 12:54 AM, Rasmus Villemoes >> > <rasmus.villemoes@prevas.dk> wrote: >> >> >> >> I see something similar, but at the 30->31 transition, and the >> >> branch-misses remain at 1-3% for higher values, until 42 where it drops >> >> back to 0%. Anyway, I highly doubt we do a lot of string copies of >> >> strings longer then 32. >> > >> > So I really dislike that microbenchmark, because it just has the same >> > length all the time. Which is very wrong, and makes the benchmark >> > pointless. A big part of this all is branch mispredicts, you shouldn't >> > just hand it the pattern on a plate. >> > >> > Anyway, the reason I really dislike the patch is not because I think >> > strscpy() is all that important, but I *do* think that the >> > word-at-a-time thing is conceptually something we do care about, and I >> > hate removing it just because of KASAN not understanding it. >> > >> > So I'd *much* rather have some way to tell KASAN that word-at-a-time >> > is going on. Because that approach definitely makes a difference in >> > other places. >> >> >> The other option was to use READ_ONCE_NOCHECK(). Not sure if the "read >> once" part will affect codegen here, though. >> But if word-at-a-time thing is conceptually something we do care >> about, we could also introduce something like READ_PARTIALLY_VALID(), >> which would check that at least first byte of the read is valid and >> that it does not cross heap block boundary (but outside of KASAN is a >> normal read). > > The first byte might not have been written either. > For example, doing a strlen() on a misaligned string you might read > the aligned word containing the first byte and adjust the value so > that the initial byte(s) are not zero. > After scanning for a zero byte the length would be corrected. Was the first byte at least kmalloc-ed? That's what KASAN checks, it does not care about "written". KMSAN can detect uses of uninit data, but that's another story. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-25 8:32 ` Dmitry Vyukov 2018-01-25 8:42 ` David Laight @ 2018-01-25 17:55 ` Linus Torvalds 2018-01-25 19:13 ` Andrey Ryabinin 1 sibling, 1 reply; 12+ messages in thread From: Linus Torvalds @ 2018-01-25 17:55 UTC (permalink / raw) To: Dmitry Vyukov Cc: Rasmus Villemoes, Andrey Ryabinin, Andrew Morton, Linux Kernel Mailing List, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, David Laight, stable, kasan-dev [-- Attachment #1: Type: text/plain, Size: 804 bytes --] On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <dvyukov@google.com> wrote: > On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> So I'd *much* rather have some way to tell KASAN that word-at-a-time >> is going on. Because that approach definitely makes a difference in >> other places. > > The other option was to use READ_ONCE_NOCHECK(). How about just using the same accessor that we do for the dcache case. That gives a reasonable example of the whole word-at-a-time model, and should be good. COMPLETELY UNTESTED patch attached. This needs to be verified. It does limit the word-at-a-time code to the architectures that select both HAVE_EFFICIENT_UNALIGNED_ACCESS and DCACHE_WORD_ACCESS, but that seems a reasonable choice anyway. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 1651 bytes --] lib/string.c | 28 +++++++--------------------- 1 file changed, 7 insertions(+), 21 deletions(-) diff --git a/lib/string.c b/lib/string.c index 64a9e33f1daa..25a3c200307e 100644 --- a/lib/string.c +++ b/lib/string.c @@ -177,33 +177,18 @@ EXPORT_SYMBOL(strlcpy); */ ssize_t strscpy(char *dest, const char *src, size_t count) { - const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS; - size_t max = count; long res = 0; if (count == 0) return -E2BIG; -#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS - /* - * If src is unaligned, don't cross a page boundary, - * since we don't know if the next page is mapped. - */ - if ((long)src & (sizeof(long) - 1)) { - size_t limit = PAGE_SIZE - ((long)src & (PAGE_SIZE - 1)); - if (limit < max) - max = limit; - } -#else - /* If src or dest is unaligned, don't do word-at-a-time. */ - if (((long) dest | (long) src) & (sizeof(long) - 1)) - max = 0; -#endif - - while (max >= sizeof(unsigned long)) { +#if CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS && CONFIG_DCACHE_WORD_ACCESS +{ + const struct word_at_a_time constants = WORD_AT_A_TIME_CONSTANTS; + while (count >= sizeof(unsigned long)) { unsigned long c, data; - c = *(unsigned long *)(src+res); + c = load_unaligned_zeropad(src+res); if (has_zero(c, &data, &constants)) { data = prep_zero_mask(c, data, &constants); data = create_zero_mask(data); @@ -213,8 +198,9 @@ ssize_t strscpy(char *dest, const char *src, size_t count) *(unsigned long *)(dest+res) = c; res += sizeof(unsigned long); count -= sizeof(unsigned long); - max -= sizeof(unsigned long); } +} +#endif while (count) { char c; ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-25 17:55 ` Linus Torvalds @ 2018-01-25 19:13 ` Andrey Ryabinin 2018-01-30 9:12 ` Dmitry Vyukov 0 siblings, 1 reply; 12+ messages in thread From: Andrey Ryabinin @ 2018-01-25 19:13 UTC (permalink / raw) To: Linus Torvalds, Dmitry Vyukov Cc: Rasmus Villemoes, Andrew Morton, Linux Kernel Mailing List, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, David Laight, stable, kasan-dev On 01/25/2018 08:55 PM, Linus Torvalds wrote: > On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <dvyukov@google.com> wrote: >> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> So I'd *much* rather have some way to tell KASAN that word-at-a-time >>> is going on. Because that approach definitely makes a difference in >>> other places. >> >> The other option was to use READ_ONCE_NOCHECK(). > > How about just using the same accessor that we do for the dcache case. > That gives a reasonable example of the whole word-at-a-time model, and > should be good. > If we also instrument load_unaligned_zeropad() with kasan_check_read(addr, 1), than it should be fine. We don't want completely unchecked read of a source string. But I also would like to revert df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports") So I was going to send something like the hunk bellow (split in several patches). Or we could just use instrumented load_unalingned_zeropad() everywhere, but it seems wrong to use it to load *cs only to shut up KASAN. --- fs/dcache.c | 2 +- include/linux/compiler.h | 11 +++++++++++ lib/string.c | 2 +- 3 files changed, 13 insertions(+), 2 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 5c7df1df81ff..6aa7be55a96d 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -195,7 +195,7 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char unsigned long a,b,mask; for (;;) { - a = *(unsigned long *)cs; + a = READ_PARTIAL_CHECK(*(unsigned long *)cs); b = load_unaligned_zeropad(ct); if (tcount < sizeof(unsigned long)) break; diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 52e611ab9a6c..85b63c2e196e 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -240,6 +240,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s * required ordering. */ #include <asm/barrier.h> +#include <linux/kasan-checks.h> #define __READ_ONCE(x, check) \ ({ \ @@ -259,6 +260,16 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s */ #define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0) +#ifdef CONFIG_KASAN +#define READ_PARTIAL_CHECK(x) \ +({ \ + kasan_check_read(&(x), 1); \ + READ_ONCE_NOCHECK(x); \ +}) +#else +#define READ_PARTIAL_CHECK(x) (x) +#endif + #define WRITE_ONCE(x, val) \ ({ \ union { typeof(x) __val; char __c[1]; } __u = \ diff --git a/lib/string.c b/lib/string.c index 64a9e33f1daa..2396856e4c56 100644 --- a/lib/string.c +++ b/lib/string.c @@ -203,7 +203,7 @@ ssize_t strscpy(char *dest, const char *src, size_t count) while (max >= sizeof(unsigned long)) { unsigned long c, data; - c = *(unsigned long *)(src+res); + c = READ_PARTIAL_CHECK(*(unsigned long *)(src+res)); if (has_zero(c, &data, &constants)) { data = prep_zero_mask(c, data, &constants); data = create_zero_mask(data); -- 2.13.6 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-25 19:13 ` Andrey Ryabinin @ 2018-01-30 9:12 ` Dmitry Vyukov 0 siblings, 0 replies; 12+ messages in thread From: Dmitry Vyukov @ 2018-01-30 9:12 UTC (permalink / raw) To: Andrey Ryabinin Cc: Linus Torvalds, Rasmus Villemoes, Andrew Morton, Linux Kernel Mailing List, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, David Laight, stable, kasan-dev On Thu, Jan 25, 2018 at 8:13 PM, Andrey Ryabinin <aryabinin@virtuozzo.com> wrote: > On 01/25/2018 08:55 PM, Linus Torvalds wrote: >> On Thu, Jan 25, 2018 at 12:32 AM, Dmitry Vyukov <dvyukov@google.com> wrote: >>> On Wed, Jan 24, 2018 at 6:52 PM, Linus Torvalds >>> <torvalds@linux-foundation.org> wrote: >>>> >>>> So I'd *much* rather have some way to tell KASAN that word-at-a-time >>>> is going on. Because that approach definitely makes a difference in >>>> other places. >>> >>> The other option was to use READ_ONCE_NOCHECK(). >> >> How about just using the same accessor that we do for the dcache case. >> That gives a reasonable example of the whole word-at-a-time model, and >> should be good. >> > > If we also instrument load_unaligned_zeropad() with kasan_check_read(addr, 1), > than it should be fine. We don't want completely unchecked read of a source string. > > But I also would like to revert df4c0e36f1b1 ("fs: dcache: manually unpoison dname after allocation to shut up kasan's reports") > So I was going to send something like the hunk bellow (split in several patches). > > Or we could just use instrumented load_unalingned_zeropad() everywhere, but it seems wrong > to use it to load *cs only to shut up KASAN. > > > --- > fs/dcache.c | 2 +- > include/linux/compiler.h | 11 +++++++++++ > lib/string.c | 2 +- > 3 files changed, 13 insertions(+), 2 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index 5c7df1df81ff..6aa7be55a96d 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -195,7 +195,7 @@ static inline int dentry_string_cmp(const unsigned char *cs, const unsigned char > unsigned long a,b,mask; > > for (;;) { > - a = *(unsigned long *)cs; > + a = READ_PARTIAL_CHECK(*(unsigned long *)cs); > b = load_unaligned_zeropad(ct); > if (tcount < sizeof(unsigned long)) > break; > diff --git a/include/linux/compiler.h b/include/linux/compiler.h > index 52e611ab9a6c..85b63c2e196e 100644 > --- a/include/linux/compiler.h > +++ b/include/linux/compiler.h > @@ -240,6 +240,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s > * required ordering. > */ > #include <asm/barrier.h> > +#include <linux/kasan-checks.h> > > #define __READ_ONCE(x, check) \ > ({ \ > @@ -259,6 +260,16 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s > */ > #define READ_ONCE_NOCHECK(x) __READ_ONCE(x, 0) > > +#ifdef CONFIG_KASAN > +#define READ_PARTIAL_CHECK(x) \ > +({ \ > + kasan_check_read(&(x), 1); \ > + READ_ONCE_NOCHECK(x); \ > +}) > +#else > +#define READ_PARTIAL_CHECK(x) (x) > +#endif > + > #define WRITE_ONCE(x, val) \ > ({ \ > union { typeof(x) __val; char __c[1]; } __u = \ > diff --git a/lib/string.c b/lib/string.c > index 64a9e33f1daa..2396856e4c56 100644 > --- a/lib/string.c > +++ b/lib/string.c > @@ -203,7 +203,7 @@ ssize_t strscpy(char *dest, const char *src, size_t count) > while (max >= sizeof(unsigned long)) { > unsigned long c, data; > > - c = *(unsigned long *)(src+res); > + c = READ_PARTIAL_CHECK(*(unsigned long *)(src+res)); > if (has_zero(c, &data, &constants)) { > data = prep_zero_mask(c, data, &constants); > data = create_zero_mask(data); Looks good to me a general way to support word-at-a-time pattern. This will also get rid of this in fs/dcache.c: if (IS_ENABLED(CONFIG_DCACHE_WORD_ACCESS)) kasan_unpoison_shadow(dname, round_up(name->len + 1, sizeof(unsigned long))); ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin 2018-01-09 16:47 ` Andrey Ryabinin @ 2018-01-24 8:47 ` Rasmus Villemoes 2018-01-24 15:53 ` Andy Shevchenko 2 siblings, 0 replies; 12+ messages in thread From: Rasmus Villemoes @ 2018-01-24 8:47 UTC (permalink / raw) To: Andrey Ryabinin, Andrew Morton, Linus Torvalds Cc: linux-kernel, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, David Laight, Dmitry Vyukov, stable On 2018-01-09 17:37, Andrey Ryabinin wrote: > strscpy() performs the word-at-a-time optimistic reads. So it may > may access the memory past the end of the object, which is perfectly fine > since strscpy() doesn't use that (past-the-end) data and makes sure the > optimistic read won't cross a page boundary. > > But KASAN doesn't know anything about that so it will complain. > There are several possible ways to address this issue, but none > are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com > > It seems the best solution is to simply disable word-at-a-time > optimization. My trivial testing shows that byte-at-a-time > could be up to x4.3 times slower than word-at-a-time. > It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs > ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy() > in a performance critical paths to copy large amounts of data, > so it shouldn't matter anyway. > > Fixes: 30035e45753b7 ("string: provide strscpy()") > Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> > Cc: <stable@vger.kernel.org> > Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Your microbenchmark even favours word-at-a-time slightly, since in practice I think at least one of src or dst will be unaligned a lot of the time, and while x86 may HAVE_EFFICIENT_UNALIGNED_ACCESS, it's still a little more expensive than doing aligned access. And since strscpy is not called that often, I expect some of the ~300 bytes of instruction cache it occupies can be put to better use elsewhere. Rasmus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] lib/strscpy: remove word-at-a-time optimization. 2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin 2018-01-09 16:47 ` Andrey Ryabinin 2018-01-24 8:47 ` Rasmus Villemoes @ 2018-01-24 15:53 ` Andy Shevchenko 2 siblings, 0 replies; 12+ messages in thread From: Andy Shevchenko @ 2018-01-24 15:53 UTC (permalink / raw) To: Andrey Ryabinin Cc: Andrew Morton, Linus Torvalds, Linux Kernel Mailing List, Kees Cook, Eryu Guan, Alexander Potapenko, Chris Metcalf, David Laight, Dmitry Vyukov, stable On Tue, Jan 9, 2018 at 6:37 PM, Andrey Ryabinin <aryabinin@virtuozzo.com> wrote: > strscpy() performs the word-at-a-time optimistic reads. So it may > may access the memory past the end of the object, which is perfectly fine > since strscpy() doesn't use that (past-the-end) data and makes sure the > optimistic read won't cross a page boundary. > > But KASAN doesn't know anything about that so it will complain. > There are several possible ways to address this issue, but none > are perfect. See https://lkml.kernel.org/r/9f0a9cf6-51f7-cd1f-5dc6-6d510a7b8ec4@virtuozzo.com > > It seems the best solution is to simply disable word-at-a-time > optimization. My trivial testing shows that byte-at-a-time > could be up to x4.3 times slower than word-at-a-time. > It may seems like a lot, but it's actually ~1.2e-10 sec per symbol vs > ~4.8e-10 sec per symbol on modern hardware. And we don't use strscpy() > in a performance critical paths to copy large amounts of data, > so it shouldn't matter anyway. What prevents you to revert the patch? After this one the all advantages of the function becomes useless AFAIU. -- With Best Regards, Andy Shevchenko ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2018-01-30 9:12 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-01-09 16:37 [PATCH] lib/strscpy: remove word-at-a-time optimization Andrey Ryabinin 2018-01-09 16:47 ` Andrey Ryabinin 2018-01-24 8:54 ` Rasmus Villemoes 2018-01-24 17:52 ` Linus Torvalds 2018-01-25 8:32 ` Dmitry Vyukov 2018-01-25 8:42 ` David Laight 2018-01-25 9:08 ` Dmitry Vyukov 2018-01-25 17:55 ` Linus Torvalds 2018-01-25 19:13 ` Andrey Ryabinin 2018-01-30 9:12 ` Dmitry Vyukov 2018-01-24 8:47 ` Rasmus Villemoes 2018-01-24 15:53 ` Andy Shevchenko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).