All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] vfs: Optimize dedupe comparison
@ 2021-07-15 14:13 Nikolay Borisov
  2021-07-15 14:30 ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolay Borisov @ 2021-07-15 14:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: viro, david, djwong, Nikolay Borisov

Currently the comparison method vfs_dedupe_file_range_compare utilizes
is a plain memcmp. This effectively means the code is doing byte-by-byte
comparison. Instead, the code could do word-sized comparison without
adverse effect on performance, provided that the comparison's length is
at least as big as the native word size, as well as resulting memory
addresses are properly aligned.

On a workload consisting of running duperemove (a userspace program
doing deduplication of duplicated extents) on a fully-duplicated
dataset, consisting of 80g spread among 20k 4m files I get the following
results:

		Unpatched:		Patched:
real		21m45.275s		21m14.445s
user		0m0.986s		0m0.933s
sys		1m30.734s		1m8.900s (-25%)

Notable changes in the perf profiles:
 .... omitted for brevity ....
     0.29%     +1.01%  [kernel.vmlinux]         [k] vfs_dedupe_file_range_compare.constprop.0
    23.62%             [kernel.vmlinux]         [k] memcmp
 .... omitted for brevity ....

The memcmp is being eliminated altogether and instead is replaced by the
newly introduced loop in vfs_dedupe_file_range_compare, hence the
increase of cycles spent there by 1%.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/remap_range.c | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/fs/remap_range.c b/fs/remap_range.c
index e4a5fdd7ad7b..041e03b082ed 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -212,6 +212,7 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 	loff_t cmp_len;
 	bool same;
 	int error;
+	const uint8_t block_size = sizeof(unsigned long);

 	error = -EINVAL;
 	same = true;
@@ -256,9 +257,35 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
 		flush_dcache_page(src_page);
 		flush_dcache_page(dest_page);

-		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
-			same = false;

+		if (!IS_ALIGNED((unsigned long)(src_addr + src_poff), block_size) ||
+		    !IS_ALIGNED((unsigned long)(dest_addr + dest_poff), block_size) ||
+		    cmp_len < block_size) {
+			if (memcmp(src_addr + src_poff, dest_addr + dest_poff,
+				   cmp_len))
+				same = false;
+		} else {
+			int i;
+			size_t blocks = cmp_len / block_size;
+			loff_t rem_len = cmp_len - (blocks * block_size);
+			unsigned long *src = src_addr + src_poff;
+			unsigned long *dst = dest_addr + src_poff;
+
+			for (i = 0; i < blocks; i++) {
+				if (src[i] - dst[i]) {
+					same = false;
+					goto finished;
+				}
+			}
+
+			if (rem_len) {
+				src_addr += src_poff + (blocks * block_size);
+				dest_addr += dest_poff + (blocks * block_size);
+				if (memcmp(src_addr, dest_addr, rem_len))
+					same = false;
+			}
+		}
+finished:
 		kunmap_atomic(dest_addr);
 		kunmap_atomic(src_addr);
 unlock:
--
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] vfs: Optimize dedupe comparison
  2021-07-15 14:13 [PATCH] vfs: Optimize dedupe comparison Nikolay Borisov
@ 2021-07-15 14:30 ` Matthew Wilcox
  2021-07-15 14:44   ` Nikolay Borisov
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2021-07-15 14:30 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-fsdevel, viro, david, djwong

On Thu, Jul 15, 2021 at 05:13:09PM +0300, Nikolay Borisov wrote:
> Currently the comparison method vfs_dedupe_file_range_compare utilizes
> is a plain memcmp. This effectively means the code is doing byte-by-byte
> comparison. Instead, the code could do word-sized comparison without
> adverse effect on performance, provided that the comparison's length is
> at least as big as the native word size, as well as resulting memory
> addresses are properly aligned.

Sounds to me like somebody hasn't optimised memcmp() very well ...
is this x86-64?

> @@ -256,9 +257,35 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>  		flush_dcache_page(src_page);
>  		flush_dcache_page(dest_page);
> 
> -		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
> -			same = false;
> 
> +		if (!IS_ALIGNED((unsigned long)(src_addr + src_poff), block_size) ||
> +		    !IS_ALIGNED((unsigned long)(dest_addr + dest_poff), block_size) ||
> +		    cmp_len < block_size) {

Can this even happen?  Surely we can only dedup on a block boundary and
blocks are required to be a power of two and at least 512 bytes in size?

> +			if (memcmp(src_addr + src_poff, dest_addr + dest_poff,
> +				   cmp_len))
> +				same = false;
> +		} else {
> +			int i;
> +			size_t blocks = cmp_len / block_size;
> +			loff_t rem_len = cmp_len - (blocks * block_size);
> +			unsigned long *src = src_addr + src_poff;
> +			unsigned long *dst = dest_addr + src_poff;
> +
> +			for (i = 0; i < blocks; i++) {
> +				if (src[i] - dst[i]) {
> +					same = false;
> +					goto finished;
> +				}
> +			}
> +
> +			if (rem_len) {
> +				src_addr += src_poff + (blocks * block_size);
> +				dest_addr += dest_poff + (blocks * block_size);
> +				if (memcmp(src_addr, dest_addr, rem_len))
> +					same = false;
> +			}
> +		}
> +finished:
>  		kunmap_atomic(dest_addr);
>  		kunmap_atomic(src_addr);
>  unlock:
> --
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vfs: Optimize dedupe comparison
  2021-07-15 14:30 ` Matthew Wilcox
@ 2021-07-15 14:44   ` Nikolay Borisov
  2021-07-15 15:09     ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolay Borisov @ 2021-07-15 14:44 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, viro, david, djwong



On 15.07.21 г. 17:30, Matthew Wilcox wrote:
> On Thu, Jul 15, 2021 at 05:13:09PM +0300, Nikolay Borisov wrote:
>> Currently the comparison method vfs_dedupe_file_range_compare utilizes
>> is a plain memcmp. This effectively means the code is doing byte-by-byte
>> comparison. Instead, the code could do word-sized comparison without
>> adverse effect on performance, provided that the comparison's length is
>> at least as big as the native word size, as well as resulting memory
>> addresses are properly aligned.
> 
> Sounds to me like somebody hasn't optimised memcmp() very well ...
> is this x86-64?
> 

That was my first impression, here's the profile:

       │    Disassembly of section .text:
       │
       │    ffffffff815c6f60 <memcmp>:
       │    memcmp():
       │      test   %rdx,%rdx
       │    ↓ je     22
       │      xor    %ecx,%ecx
       │    ↓ jmp    12
 49.32 │ 9:   add    $0x1,%rcx
  0.03 │      cmp    %rcx,%rdx
 11.82 │    ↓ je     21
  0.01 │12:   movzbl (%rdi,%rcx,1),%eax
 38.19 │      movzbl (%rsi,%rcx,1),%r8d
  0.59 │      sub    %r8d,%eax
  0.04 │    ↑ je     9
       │    ← retq
       │21: ← retq
       │22:   xor    %eax,%eax
       │    ← retq


It's indeed on x86-64 and according to the sources it's using
__builtin_memcmp according to arch/x86/boot/string.h

>> @@ -256,9 +257,35 @@ static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
>>  		flush_dcache_page(src_page);
>>  		flush_dcache_page(dest_page);
>>
>> -		if (memcmp(src_addr + src_poff, dest_addr + dest_poff, cmp_len))
>> -			same = false;
>>
>> +		if (!IS_ALIGNED((unsigned long)(src_addr + src_poff), block_size) ||
>> +		    !IS_ALIGNED((unsigned long)(dest_addr + dest_poff), block_size) ||
>> +		    cmp_len < block_size) {
> 
> Can this even happen?  Surely we can only dedup on a block boundary and
> blocks are required to be a power of two and at least 512 bytes in size?

I was wondering the same thing, but AFAICS it seems to be possible i.e
if userspace spaces bad offsets, while all kinds of internal fs
synchronization ops are going to be performed on aligned offsets, that
doesn't mean the original ones, passed from userspace are themselves
aligned explicitly.
> 
>> +			if (memcmp(src_addr + src_poff, dest_addr + dest_poff,
>> +				   cmp_len))
>> +				same = false;
>> +		} else {
>> +			int i;
>> +			size_t blocks = cmp_len / block_size;
>> +			loff_t rem_len = cmp_len - (blocks * block_size);
>> +			unsigned long *src = src_addr + src_poff;
>> +			unsigned long *dst = dest_addr + src_poff;
>> +
>> +			for (i = 0; i < blocks; i++) {
>> +				if (src[i] - dst[i]) {
>> +					same = false;
>> +					goto finished;
>> +				}
>> +			}
>> +
>> +			if (rem_len) {
>> +				src_addr += src_poff + (blocks * block_size);
>> +				dest_addr += dest_poff + (blocks * block_size);
>> +				if (memcmp(src_addr, dest_addr, rem_len))
>> +					same = false;
>> +			}
>> +		}
>> +finished:
>>  		kunmap_atomic(dest_addr);
>>  		kunmap_atomic(src_addr);
>>  unlock:
>> --
>> 2.25.1
>>
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vfs: Optimize dedupe comparison
  2021-07-15 14:44   ` Nikolay Borisov
@ 2021-07-15 15:09     ` Matthew Wilcox
  2021-07-15 22:33       ` Dave Chinner
  2021-07-16 12:10       ` Nikolay Borisov
  0 siblings, 2 replies; 8+ messages in thread
From: Matthew Wilcox @ 2021-07-15 15:09 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-fsdevel, viro, david, djwong

On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote:
> That was my first impression, here's the profile:
> 
>        │    Disassembly of section .text:
>        │
>        │    ffffffff815c6f60 <memcmp>:
>        │    memcmp():
>        │      test   %rdx,%rdx
>        │    ↓ je     22
>        │      xor    %ecx,%ecx
>        │    ↓ jmp    12
>  49.32 │ 9:   add    $0x1,%rcx
>   0.03 │      cmp    %rcx,%rdx
>  11.82 │    ↓ je     21
>   0.01 │12:   movzbl (%rdi,%rcx,1),%eax
>  38.19 │      movzbl (%rsi,%rcx,1),%r8d
>   0.59 │      sub    %r8d,%eax
>   0.04 │    ↑ je     9

That looks like a byte loop to me ...

> It's indeed on x86-64 and according to the sources it's using
> __builtin_memcmp according to arch/x86/boot/string.h

I think the 'boot' part of that path might indicate that it's not what's
actually being used by the kernel.

$ git grep __HAVE_ARCH_MEMCMP
arch/arc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
arch/arm64/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
arch/csky/abiv2/inc/abi/string.h:#define __HAVE_ARCH_MEMCMP
arch/powerpc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
arch/s390/include/asm/string.h:#define __HAVE_ARCH_MEMCMP       /* arch function */
arch/s390/lib/string.c:#ifdef __HAVE_ARCH_MEMCMP
arch/s390/purgatory/string.c:#define __HAVE_ARCH_MEMCMP /* arch function */
arch/sparc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
include/linux/string.h:#ifndef __HAVE_ARCH_MEMCMP
lib/string.c:#ifndef __HAVE_ARCH_MEMCMP

So I think x86-64 is using the stupid one.

> > Can this even happen?  Surely we can only dedup on a block boundary and
> > blocks are required to be a power of two and at least 512 bytes in size?
> 
> I was wondering the same thing, but AFAICS it seems to be possible i.e
> if userspace spaces bad offsets, while all kinds of internal fs
> synchronization ops are going to be performed on aligned offsets, that
> doesn't mean the original ones, passed from userspace are themselves
> aligned explicitly.

Ah, I thought it'd be failed before we got to this point.

But honestly, I think x86-64 needs to be fixed to either use
__builtin_memcmp() or to have a nicely written custom memcmp().  I
tried to find the gcc implementation of __builtin_memcmp() on
x86-64, but I can't.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vfs: Optimize dedupe comparison
  2021-07-15 15:09     ` Matthew Wilcox
@ 2021-07-15 22:33       ` Dave Chinner
  2021-07-20 14:58         ` Nikolay Borisov
  2021-07-16 12:10       ` Nikolay Borisov
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2021-07-15 22:33 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Nikolay Borisov, linux-fsdevel, viro, djwong

On Thu, Jul 15, 2021 at 04:09:06PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote:
> > I was wondering the same thing, but AFAICS it seems to be possible i.e
> > if userspace spaces bad offsets, while all kinds of internal fs
> > synchronization ops are going to be performed on aligned offsets, that
> > doesn't mean the original ones, passed from userspace are themselves
> > aligned explicitly.
> 
> Ah, I thought it'd be failed before we got to this point.
> 
> But honestly, I think x86-64 needs to be fixed to either use
> __builtin_memcmp() or to have a nicely written custom memcmp().  I
> tried to find the gcc implementation of __builtin_memcmp() on
> x86-64, but I can't.

Yup, this. memcmp() is widley used in hot paths through all the
filesystem code and the rest of the kernel. We should fix the
generic infrastructure problem, not play whack-a-mole to with custom
one-off fixes that avoid the problem just where it shows up in some
profile...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vfs: Optimize dedupe comparison
  2021-07-15 15:09     ` Matthew Wilcox
  2021-07-15 22:33       ` Dave Chinner
@ 2021-07-16 12:10       ` Nikolay Borisov
  1 sibling, 0 replies; 8+ messages in thread
From: Nikolay Borisov @ 2021-07-16 12:10 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, viro, david, djwong



On 15.07.21 г. 18:09, Matthew Wilcox wrote:
> On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote:
>> That was my first impression, here's the profile:
>>
>>        │    Disassembly of section .text:
>>        │
>>        │    ffffffff815c6f60 <memcmp>:
>>        │    memcmp():
>>        │      test   %rdx,%rdx
>>        │    ↓ je     22
>>        │      xor    %ecx,%ecx
>>        │    ↓ jmp    12
>>  49.32 │ 9:   add    $0x1,%rcx
>>   0.03 │      cmp    %rcx,%rdx
>>  11.82 │    ↓ je     21
>>   0.01 │12:   movzbl (%rdi,%rcx,1),%eax
>>  38.19 │      movzbl (%rsi,%rcx,1),%r8d
>>   0.59 │      sub    %r8d,%eax
>>   0.04 │    ↑ je     9
> 
> That looks like a byte loop to me ...
> 
>> It's indeed on x86-64 and according to the sources it's using
>> __builtin_memcmp according to arch/x86/boot/string.h
> 
> I think the 'boot' part of that path might indicate that it's not what's
> actually being used by the kernel.
> 
> $ git grep __HAVE_ARCH_MEMCMP
> arch/arc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
> arch/arm64/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
> arch/csky/abiv2/inc/abi/string.h:#define __HAVE_ARCH_MEMCMP
> arch/powerpc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
> arch/s390/include/asm/string.h:#define __HAVE_ARCH_MEMCMP       /* arch function */
> arch/s390/lib/string.c:#ifdef __HAVE_ARCH_MEMCMP
> arch/s390/purgatory/string.c:#define __HAVE_ARCH_MEMCMP /* arch function */
> arch/sparc/include/asm/string.h:#define __HAVE_ARCH_MEMCMP
> include/linux/string.h:#ifndef __HAVE_ARCH_MEMCMP
> lib/string.c:#ifndef __HAVE_ARCH_MEMCMP
> 
> So I think x86-64 is using the stupid one.
> 
>>> Can this even happen?  Surely we can only dedup on a block boundary and
>>> blocks are required to be a power of two and at least 512 bytes in size?
>>
>> I was wondering the same thing, but AFAICS it seems to be possible i.e
>> if userspace spaces bad offsets, while all kinds of internal fs
>> synchronization ops are going to be performed on aligned offsets, that
>> doesn't mean the original ones, passed from userspace are themselves
>> aligned explicitly.
> 
> Ah, I thought it'd be failed before we got to this point.
> 
> But honestly, I think x86-64 needs to be fixed to either use
> __builtin_memcmp() or to have a nicely written custom memcmp().  I
> tried to find the gcc implementation of __builtin_memcmp() on
> x86-64, but I can't.

__builtin_memcmp is a no go due to this function being an ifunc [0]
 and is being resolved to a call to memcmp which causes link => build
failures. So what remains is to either patch particular sites or as Dave
suggested have a generic optimized implementation.

glibc's implementation [1] seems straightforward enough to be
convertible to kernel style. However it would need definitive proof it
actually improves performance in a variety of scenarios.

[0] https://sourceware.org/glibc/wiki/GNU_IFUNC
[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=string/memcmp.c

> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vfs: Optimize dedupe comparison
  2021-07-15 22:33       ` Dave Chinner
@ 2021-07-20 14:58         ` Nikolay Borisov
  2021-07-20 15:12           ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolay Borisov @ 2021-07-20 14:58 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox; +Cc: linux-fsdevel, viro, djwong



On 16.07.21 г. 1:33, Dave Chinner wrote:
> On Thu, Jul 15, 2021 at 04:09:06PM +0100, Matthew Wilcox wrote:
>> On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote:
>>> I was wondering the same thing, but AFAICS it seems to be possible i.e
>>> if userspace spaces bad offsets, while all kinds of internal fs
>>> synchronization ops are going to be performed on aligned offsets, that
>>> doesn't mean the original ones, passed from userspace are themselves
>>> aligned explicitly.
>>
>> Ah, I thought it'd be failed before we got to this point.
>>
>> But honestly, I think x86-64 needs to be fixed to either use
>> __builtin_memcmp() or to have a nicely written custom memcmp().  I
>> tried to find the gcc implementation of __builtin_memcmp() on
>> x86-64, but I can't.
> 
> Yup, this. memcmp() is widley used in hot paths through all the
> filesystem code and the rest of the kernel. We should fix the
> generic infrastructure problem, not play whack-a-mole to with custom
> one-off fixes that avoid the problem just where it shows up in some
> profile...

I ported glibc's implementation of memcmp to the kernel and after
running the same workload I get the same performance as with the basic
memcmp implementation of doing byte comparison ...

> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] vfs: Optimize dedupe comparison
  2021-07-20 14:58         ` Nikolay Borisov
@ 2021-07-20 15:12           ` Matthew Wilcox
  0 siblings, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2021-07-20 15:12 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Dave Chinner, linux-fsdevel, viro, djwong

On Tue, Jul 20, 2021 at 05:58:39PM +0300, Nikolay Borisov wrote:
> 
> 
> On 16.07.21 г. 1:33, Dave Chinner wrote:
> > On Thu, Jul 15, 2021 at 04:09:06PM +0100, Matthew Wilcox wrote:
> >> On Thu, Jul 15, 2021 at 05:44:15PM +0300, Nikolay Borisov wrote:
> >>> I was wondering the same thing, but AFAICS it seems to be possible i.e
> >>> if userspace spaces bad offsets, while all kinds of internal fs
> >>> synchronization ops are going to be performed on aligned offsets, that
> >>> doesn't mean the original ones, passed from userspace are themselves
> >>> aligned explicitly.
> >>
> >> Ah, I thought it'd be failed before we got to this point.
> >>
> >> But honestly, I think x86-64 needs to be fixed to either use
> >> __builtin_memcmp() or to have a nicely written custom memcmp().  I
> >> tried to find the gcc implementation of __builtin_memcmp() on
> >> x86-64, but I can't.
> > 
> > Yup, this. memcmp() is widley used in hot paths through all the
> > filesystem code and the rest of the kernel. We should fix the
> > generic infrastructure problem, not play whack-a-mole to with custom
> > one-off fixes that avoid the problem just where it shows up in some
> > profile...
> 
> I ported glibc's implementation of memcmp to the kernel and after
> running the same workload I get the same performance as with the basic
> memcmp implementation of doing byte comparison ...

That's bizarre because the glibc memcmp that you pointed to earlier
basically does what your open-coded solution did.  Is it possible
you have a bug in one of the tests and it's falling back to the byte
loop?

Specifically for the dedup case, we only need the optimisation that

	if ((p1 | p2 | length) & 7)
		... do the byte loop ...
	... do the long-based comparison ...

so another possibility is that memcmp is doing too many tests.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-20 15:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-15 14:13 [PATCH] vfs: Optimize dedupe comparison Nikolay Borisov
2021-07-15 14:30 ` Matthew Wilcox
2021-07-15 14:44   ` Nikolay Borisov
2021-07-15 15:09     ` Matthew Wilcox
2021-07-15 22:33       ` Dave Chinner
2021-07-20 14:58         ` Nikolay Borisov
2021-07-20 15:12           ` Matthew Wilcox
2021-07-16 12:10       ` Nikolay Borisov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.