linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] block: Optimize bio_init()
@ 2021-09-11 21:47 Bart Van Assche
  2021-09-11 22:01 ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2021-09-11 21:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche

The following test:

sudo taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nullb0

reports 1366 K IOPS on my test setup without this patch and 1380 K IOPS
with this patch applied. In other words, this patch realizes a 1%
performance improvement. I think this is because this patch makes the
compiler generate better code. See also commit da521626ac62 ("bio:
optimize initialization of a bio").

The assembler code generated by gcc without this patch is as follows:

   0x0000000000000000 <+0>:     call   0x5 <bio_init+5>
   0x0000000000000005 <+5>:     xor    %eax,%eax
   0x0000000000000007 <+7>:     xor    %ecx,%ecx
   0x0000000000000009 <+9>:     movl   $0x1,0x1c(%rdi)
   0x0000000000000010 <+16>:    movq   $0x0,(%rdi)
   0x0000000000000017 <+23>:    movq   $0x0,0x8(%rdi)
   0x000000000000001f <+31>:    movq   $0x0,0x10(%rdi)
   0x0000000000000027 <+39>:    mov    %ax,0x18(%rdi)
   0x000000000000002b <+43>:    movb   $0x0,0x1a(%rdi)
   0x000000000000002f <+47>:    movq   $0x0,0x20(%rdi)
   0x0000000000000037 <+55>:    movq   $0x0,0x28(%rdi)
   0x000000000000003f <+63>:    movl   $0x0,0x30(%rdi)
   0x0000000000000046 <+70>:    movq   $0x0,0x38(%rdi)
   0x000000000000004e <+78>:    movq   $0x0,0x40(%rdi)
   0x0000000000000056 <+86>:    movq   $0x0,0x48(%rdi)
   0x000000000000005e <+94>:    movq   $0x0,0x50(%rdi)
   0x0000000000000066 <+102>:   movq   $0x0,0x58(%rdi)
   0x000000000000006e <+110>:   movq   $0x0,0x60(%rdi)
   0x0000000000000076 <+118>:   mov    %cx,0x68(%rdi)
   0x000000000000007a <+122>:   movl   $0x1,0x6c(%rdi)
   0x0000000000000081 <+129>:   mov    %dx,0x6a(%rdi)
   0x0000000000000085 <+133>:   mov    %rsi,0x70(%rdi)
   0x0000000000000089 <+137>:   movq   $0x0,0x78(%rdi)
   0x0000000000000091 <+145>:   ret

With this patch bio_init() is compiled into the following assembly code:

   0x0000000000000000 <+0>:     call   0x5 <bio_init+5>
   0x0000000000000005 <+5>:     mov    %rdi,%r8
   0x0000000000000008 <+8>:     mov    $0x10,%ecx
   0x000000000000000d <+13>:    xor    %eax,%eax
   0x000000000000000f <+15>:    rep stos %rax,%es:(%rdi)
   0x0000000000000012 <+18>:    movl   $0x1,0x1c(%r8)
   0x000000000000001a <+26>:    mov    %dx,0x6a(%r8)
   0x000000000000001f <+31>:    movl   $0x1,0x6c(%r8)
   0x0000000000000027 <+39>:    mov    %rsi,0x70(%r8)
   0x000000000000002b <+43>:    ret

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/bio.c | 45 ++++++++-------------------------------------
 1 file changed, 8 insertions(+), 37 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5df3dd282e40..775cd4274523 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -244,47 +244,18 @@ static void bio_free(struct bio *bio)
 }
 
 /*
- * Users of this function have their own bio allocation. Subsequently,
- * they must remember to pair any call to bio_init() with bio_uninit()
- * when IO has completed, or when the bio is released.
+ * Users of this function must pair any call to bio_init() with a call to
+ * bio_uninit() after IO has completed or when the bio is released.
  */
 void bio_init(struct bio *bio, struct bio_vec *table,
 	      unsigned short max_vecs)
 {
-	bio->bi_next = NULL;
-	bio->bi_bdev = NULL;
-	bio->bi_opf = 0;
-	bio->bi_flags = 0;
-	bio->bi_ioprio = 0;
-	bio->bi_write_hint = 0;
-	bio->bi_status = 0;
-	bio->bi_iter.bi_sector = 0;
-	bio->bi_iter.bi_size = 0;
-	bio->bi_iter.bi_idx = 0;
-	bio->bi_iter.bi_bvec_done = 0;
-	bio->bi_end_io = NULL;
-	bio->bi_private = NULL;
-#ifdef CONFIG_BLK_CGROUP
-	bio->bi_blkg = NULL;
-	bio->bi_issue.value = 0;
-#ifdef CONFIG_BLK_CGROUP_IOCOST
-	bio->bi_iocost_cost = 0;
-#endif
-#endif
-#ifdef CONFIG_BLK_INLINE_ENCRYPTION
-	bio->bi_crypt_context = NULL;
-#endif
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-	bio->bi_integrity = NULL;
-#endif
-	bio->bi_vcnt = 0;
-
-	atomic_set(&bio->__bi_remaining, 1);
-	atomic_set(&bio->__bi_cnt, 1);
-
-	bio->bi_max_vecs = max_vecs;
-	bio->bi_io_vec = table;
-	bio->bi_pool = NULL;
+	*bio = (struct bio) {
+		.__bi_remaining	= ATOMIC_INIT(1),
+		.__bi_cnt	= ATOMIC_INIT(1),
+		.bi_max_vecs	= max_vecs,
+		.bi_io_vec	= table,
+	};
 }
 EXPORT_SYMBOL(bio_init);
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] block: Optimize bio_init()
  2021-09-11 21:47 [PATCH] block: Optimize bio_init() Bart Van Assche
@ 2021-09-11 22:01 ` Jens Axboe
  2021-09-11 22:09   ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2021-09-11 22:01 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, Christoph Hellwig

On 9/11/21 3:47 PM, Bart Van Assche wrote:
> The following test:
> 
> sudo taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nullb0
> 
> reports 1366 K IOPS on my test setup without this patch and 1380 K IOPS
> with this patch applied. In other words, this patch realizes a 1%
> performance improvement. I think this is because this patch makes the
> compiler generate better code. See also commit da521626ac62 ("bio:
> optimize initialization of a bio").
> 
> The assembler code generated by gcc without this patch is as follows:
> 
>    0x0000000000000000 <+0>:     call   0x5 <bio_init+5>
>    0x0000000000000005 <+5>:     xor    %eax,%eax
>    0x0000000000000007 <+7>:     xor    %ecx,%ecx
>    0x0000000000000009 <+9>:     movl   $0x1,0x1c(%rdi)
>    0x0000000000000010 <+16>:    movq   $0x0,(%rdi)
>    0x0000000000000017 <+23>:    movq   $0x0,0x8(%rdi)
>    0x000000000000001f <+31>:    movq   $0x0,0x10(%rdi)
>    0x0000000000000027 <+39>:    mov    %ax,0x18(%rdi)
>    0x000000000000002b <+43>:    movb   $0x0,0x1a(%rdi)
>    0x000000000000002f <+47>:    movq   $0x0,0x20(%rdi)
>    0x0000000000000037 <+55>:    movq   $0x0,0x28(%rdi)
>    0x000000000000003f <+63>:    movl   $0x0,0x30(%rdi)
>    0x0000000000000046 <+70>:    movq   $0x0,0x38(%rdi)
>    0x000000000000004e <+78>:    movq   $0x0,0x40(%rdi)
>    0x0000000000000056 <+86>:    movq   $0x0,0x48(%rdi)
>    0x000000000000005e <+94>:    movq   $0x0,0x50(%rdi)
>    0x0000000000000066 <+102>:   movq   $0x0,0x58(%rdi)
>    0x000000000000006e <+110>:   movq   $0x0,0x60(%rdi)
>    0x0000000000000076 <+118>:   mov    %cx,0x68(%rdi)
>    0x000000000000007a <+122>:   movl   $0x1,0x6c(%rdi)
>    0x0000000000000081 <+129>:   mov    %dx,0x6a(%rdi)
>    0x0000000000000085 <+133>:   mov    %rsi,0x70(%rdi)
>    0x0000000000000089 <+137>:   movq   $0x0,0x78(%rdi)
>    0x0000000000000091 <+145>:   ret
> 
> With this patch bio_init() is compiled into the following assembly code:
> 
>    0x0000000000000000 <+0>:     call   0x5 <bio_init+5>
>    0x0000000000000005 <+5>:     mov    %rdi,%r8
>    0x0000000000000008 <+8>:     mov    $0x10,%ecx
>    0x000000000000000d <+13>:    xor    %eax,%eax
>    0x000000000000000f <+15>:    rep stos %rax,%es:(%rdi)
>    0x0000000000000012 <+18>:    movl   $0x1,0x1c(%r8)
>    0x000000000000001a <+26>:    mov    %dx,0x6a(%r8)
>    0x000000000000001f <+31>:    movl   $0x1,0x6c(%r8)
>    0x0000000000000027 <+39>:    mov    %rsi,0x70(%r8)
>    0x000000000000002b <+43>:    ret
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/bio.c | 45 ++++++++-------------------------------------
>  1 file changed, 8 insertions(+), 37 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 5df3dd282e40..775cd4274523 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -244,47 +244,18 @@ static void bio_free(struct bio *bio)
>  }
>  
>  /*
> - * Users of this function have their own bio allocation. Subsequently,
> - * they must remember to pair any call to bio_init() with bio_uninit()
> - * when IO has completed, or when the bio is released.
> + * Users of this function must pair any call to bio_init() with a call to
> + * bio_uninit() after IO has completed or when the bio is released.
>   */
>  void bio_init(struct bio *bio, struct bio_vec *table,
>  	      unsigned short max_vecs)
>  {
> -	bio->bi_next = NULL;
> -	bio->bi_bdev = NULL;
> -	bio->bi_opf = 0;
> -	bio->bi_flags = 0;
> -	bio->bi_ioprio = 0;
> -	bio->bi_write_hint = 0;
> -	bio->bi_status = 0;
> -	bio->bi_iter.bi_sector = 0;
> -	bio->bi_iter.bi_size = 0;
> -	bio->bi_iter.bi_idx = 0;
> -	bio->bi_iter.bi_bvec_done = 0;
> -	bio->bi_end_io = NULL;
> -	bio->bi_private = NULL;
> -#ifdef CONFIG_BLK_CGROUP
> -	bio->bi_blkg = NULL;
> -	bio->bi_issue.value = 0;
> -#ifdef CONFIG_BLK_CGROUP_IOCOST
> -	bio->bi_iocost_cost = 0;
> -#endif
> -#endif
> -#ifdef CONFIG_BLK_INLINE_ENCRYPTION
> -	bio->bi_crypt_context = NULL;
> -#endif
> -#ifdef CONFIG_BLK_DEV_INTEGRITY
> -	bio->bi_integrity = NULL;
> -#endif
> -	bio->bi_vcnt = 0;
> -
> -	atomic_set(&bio->__bi_remaining, 1);
> -	atomic_set(&bio->__bi_cnt, 1);
> -
> -	bio->bi_max_vecs = max_vecs;
> -	bio->bi_io_vec = table;
> -	bio->bi_pool = NULL;
> +	*bio = (struct bio) {
> +		.__bi_remaining	= ATOMIC_INIT(1),
> +		.__bi_cnt	= ATOMIC_INIT(1),
> +		.bi_max_vecs	= max_vecs,
> +		.bi_io_vec	= table,
> +	};
>  }

I'll give this a whirl too, another upside is that it's less prone to
errors if struct bio is changed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] block: Optimize bio_init()
  2021-09-11 22:01 ` Jens Axboe
@ 2021-09-11 22:09   ` Jens Axboe
  2021-09-11 22:16     ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2021-09-11 22:09 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, Christoph Hellwig

On 9/11/21 4:01 PM, Jens Axboe wrote:
> On 9/11/21 3:47 PM, Bart Van Assche wrote:
>> The following test:
>>
>> sudo taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nullb0
>>
>> reports 1366 K IOPS on my test setup without this patch and 1380 K IOPS
>> with this patch applied. In other words, this patch realizes a 1%
>> performance improvement. I think this is because this patch makes the
>> compiler generate better code. See also commit da521626ac62 ("bio:
>> optimize initialization of a bio").
>>
>> The assembler code generated by gcc without this patch is as follows:
>>
>>    0x0000000000000000 <+0>:     call   0x5 <bio_init+5>
>>    0x0000000000000005 <+5>:     xor    %eax,%eax
>>    0x0000000000000007 <+7>:     xor    %ecx,%ecx
>>    0x0000000000000009 <+9>:     movl   $0x1,0x1c(%rdi)
>>    0x0000000000000010 <+16>:    movq   $0x0,(%rdi)
>>    0x0000000000000017 <+23>:    movq   $0x0,0x8(%rdi)
>>    0x000000000000001f <+31>:    movq   $0x0,0x10(%rdi)
>>    0x0000000000000027 <+39>:    mov    %ax,0x18(%rdi)
>>    0x000000000000002b <+43>:    movb   $0x0,0x1a(%rdi)
>>    0x000000000000002f <+47>:    movq   $0x0,0x20(%rdi)
>>    0x0000000000000037 <+55>:    movq   $0x0,0x28(%rdi)
>>    0x000000000000003f <+63>:    movl   $0x0,0x30(%rdi)
>>    0x0000000000000046 <+70>:    movq   $0x0,0x38(%rdi)
>>    0x000000000000004e <+78>:    movq   $0x0,0x40(%rdi)
>>    0x0000000000000056 <+86>:    movq   $0x0,0x48(%rdi)
>>    0x000000000000005e <+94>:    movq   $0x0,0x50(%rdi)
>>    0x0000000000000066 <+102>:   movq   $0x0,0x58(%rdi)
>>    0x000000000000006e <+110>:   movq   $0x0,0x60(%rdi)
>>    0x0000000000000076 <+118>:   mov    %cx,0x68(%rdi)
>>    0x000000000000007a <+122>:   movl   $0x1,0x6c(%rdi)
>>    0x0000000000000081 <+129>:   mov    %dx,0x6a(%rdi)
>>    0x0000000000000085 <+133>:   mov    %rsi,0x70(%rdi)
>>    0x0000000000000089 <+137>:   movq   $0x0,0x78(%rdi)
>>    0x0000000000000091 <+145>:   ret
>>
>> With this patch bio_init() is compiled into the following assembly code:
>>
>>    0x0000000000000000 <+0>:     call   0x5 <bio_init+5>
>>    0x0000000000000005 <+5>:     mov    %rdi,%r8
>>    0x0000000000000008 <+8>:     mov    $0x10,%ecx
>>    0x000000000000000d <+13>:    xor    %eax,%eax
>>    0x000000000000000f <+15>:    rep stos %rax,%es:(%rdi)
>>    0x0000000000000012 <+18>:    movl   $0x1,0x1c(%r8)
>>    0x000000000000001a <+26>:    mov    %dx,0x6a(%r8)
>>    0x000000000000001f <+31>:    movl   $0x1,0x6c(%r8)
>>    0x0000000000000027 <+39>:    mov    %rsi,0x70(%r8)
>>    0x000000000000002b <+43>:    ret
>>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>  block/bio.c | 45 ++++++++-------------------------------------
>>  1 file changed, 8 insertions(+), 37 deletions(-)
>>
>> diff --git a/block/bio.c b/block/bio.c
>> index 5df3dd282e40..775cd4274523 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -244,47 +244,18 @@ static void bio_free(struct bio *bio)
>>  }
>>  
>>  /*
>> - * Users of this function have their own bio allocation. Subsequently,
>> - * they must remember to pair any call to bio_init() with bio_uninit()
>> - * when IO has completed, or when the bio is released.
>> + * Users of this function must pair any call to bio_init() with a call to
>> + * bio_uninit() after IO has completed or when the bio is released.
>>   */
>>  void bio_init(struct bio *bio, struct bio_vec *table,
>>  	      unsigned short max_vecs)
>>  {
>> -	bio->bi_next = NULL;
>> -	bio->bi_bdev = NULL;
>> -	bio->bi_opf = 0;
>> -	bio->bi_flags = 0;
>> -	bio->bi_ioprio = 0;
>> -	bio->bi_write_hint = 0;
>> -	bio->bi_status = 0;
>> -	bio->bi_iter.bi_sector = 0;
>> -	bio->bi_iter.bi_size = 0;
>> -	bio->bi_iter.bi_idx = 0;
>> -	bio->bi_iter.bi_bvec_done = 0;
>> -	bio->bi_end_io = NULL;
>> -	bio->bi_private = NULL;
>> -#ifdef CONFIG_BLK_CGROUP
>> -	bio->bi_blkg = NULL;
>> -	bio->bi_issue.value = 0;
>> -#ifdef CONFIG_BLK_CGROUP_IOCOST
>> -	bio->bi_iocost_cost = 0;
>> -#endif
>> -#endif
>> -#ifdef CONFIG_BLK_INLINE_ENCRYPTION
>> -	bio->bi_crypt_context = NULL;
>> -#endif
>> -#ifdef CONFIG_BLK_DEV_INTEGRITY
>> -	bio->bi_integrity = NULL;
>> -#endif
>> -	bio->bi_vcnt = 0;
>> -
>> -	atomic_set(&bio->__bi_remaining, 1);
>> -	atomic_set(&bio->__bi_cnt, 1);
>> -
>> -	bio->bi_max_vecs = max_vecs;
>> -	bio->bi_io_vec = table;
>> -	bio->bi_pool = NULL;
>> +	*bio = (struct bio) {
>> +		.__bi_remaining	= ATOMIC_INIT(1),
>> +		.__bi_cnt	= ATOMIC_INIT(1),
>> +		.bi_max_vecs	= max_vecs,
>> +		.bi_io_vec	= table,
>> +	};
>>  }
> 
> I'll give this a whirl too, another upside is that it's less prone to
> errors if struct bio is changed.

Seems slower for me, by about 1-1.5%, which is consumed by
bio_alloc_kiocb() which is the only bio_init() caller in my test. Using
gcc 11.1 here, and my code generation seems to match your case too
(series of mov vs rep stos with the patch).

Probably a CPU thing. I'm running on an AMD 3970X for these tests.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] block: Optimize bio_init()
  2021-09-11 22:09   ` Jens Axboe
@ 2021-09-11 22:16     ` Jens Axboe
  2021-09-12  3:19       ` Bart Van Assche
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2021-09-11 22:16 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, Christoph Hellwig

On 9/11/21 4:09 PM, Jens Axboe wrote:
> On 9/11/21 4:01 PM, Jens Axboe wrote:
>> On 9/11/21 3:47 PM, Bart Van Assche wrote:
>>> The following test:
>>>
>>> sudo taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nullb0
>>>
>>> reports 1366 K IOPS on my test setup without this patch and 1380 K IOPS
>>> with this patch applied. In other words, this patch realizes a 1%
>>> performance improvement. I think this is because this patch makes the
>>> compiler generate better code. See also commit da521626ac62 ("bio:
>>> optimize initialization of a bio").
>>>
>>> The assembler code generated by gcc without this patch is as follows:
>>>
>>>    0x0000000000000000 <+0>:     call   0x5 <bio_init+5>
>>>    0x0000000000000005 <+5>:     xor    %eax,%eax
>>>    0x0000000000000007 <+7>:     xor    %ecx,%ecx
>>>    0x0000000000000009 <+9>:     movl   $0x1,0x1c(%rdi)
>>>    0x0000000000000010 <+16>:    movq   $0x0,(%rdi)
>>>    0x0000000000000017 <+23>:    movq   $0x0,0x8(%rdi)
>>>    0x000000000000001f <+31>:    movq   $0x0,0x10(%rdi)
>>>    0x0000000000000027 <+39>:    mov    %ax,0x18(%rdi)
>>>    0x000000000000002b <+43>:    movb   $0x0,0x1a(%rdi)
>>>    0x000000000000002f <+47>:    movq   $0x0,0x20(%rdi)
>>>    0x0000000000000037 <+55>:    movq   $0x0,0x28(%rdi)
>>>    0x000000000000003f <+63>:    movl   $0x0,0x30(%rdi)
>>>    0x0000000000000046 <+70>:    movq   $0x0,0x38(%rdi)
>>>    0x000000000000004e <+78>:    movq   $0x0,0x40(%rdi)
>>>    0x0000000000000056 <+86>:    movq   $0x0,0x48(%rdi)
>>>    0x000000000000005e <+94>:    movq   $0x0,0x50(%rdi)
>>>    0x0000000000000066 <+102>:   movq   $0x0,0x58(%rdi)
>>>    0x000000000000006e <+110>:   movq   $0x0,0x60(%rdi)
>>>    0x0000000000000076 <+118>:   mov    %cx,0x68(%rdi)
>>>    0x000000000000007a <+122>:   movl   $0x1,0x6c(%rdi)
>>>    0x0000000000000081 <+129>:   mov    %dx,0x6a(%rdi)
>>>    0x0000000000000085 <+133>:   mov    %rsi,0x70(%rdi)
>>>    0x0000000000000089 <+137>:   movq   $0x0,0x78(%rdi)
>>>    0x0000000000000091 <+145>:   ret
>>>
>>> With this patch bio_init() is compiled into the following assembly code:
>>>
>>>    0x0000000000000000 <+0>:     call   0x5 <bio_init+5>
>>>    0x0000000000000005 <+5>:     mov    %rdi,%r8
>>>    0x0000000000000008 <+8>:     mov    $0x10,%ecx
>>>    0x000000000000000d <+13>:    xor    %eax,%eax
>>>    0x000000000000000f <+15>:    rep stos %rax,%es:(%rdi)
>>>    0x0000000000000012 <+18>:    movl   $0x1,0x1c(%r8)
>>>    0x000000000000001a <+26>:    mov    %dx,0x6a(%r8)
>>>    0x000000000000001f <+31>:    movl   $0x1,0x6c(%r8)
>>>    0x0000000000000027 <+39>:    mov    %rsi,0x70(%r8)
>>>    0x000000000000002b <+43>:    ret
>>>
>>> Cc: Christoph Hellwig <hch@lst.de>
>>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>>> ---
>>>  block/bio.c | 45 ++++++++-------------------------------------
>>>  1 file changed, 8 insertions(+), 37 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 5df3dd282e40..775cd4274523 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -244,47 +244,18 @@ static void bio_free(struct bio *bio)
>>>  }
>>>  
>>>  /*
>>> - * Users of this function have their own bio allocation. Subsequently,
>>> - * they must remember to pair any call to bio_init() with bio_uninit()
>>> - * when IO has completed, or when the bio is released.
>>> + * Users of this function must pair any call to bio_init() with a call to
>>> + * bio_uninit() after IO has completed or when the bio is released.
>>>   */
>>>  void bio_init(struct bio *bio, struct bio_vec *table,
>>>  	      unsigned short max_vecs)
>>>  {
>>> -	bio->bi_next = NULL;
>>> -	bio->bi_bdev = NULL;
>>> -	bio->bi_opf = 0;
>>> -	bio->bi_flags = 0;
>>> -	bio->bi_ioprio = 0;
>>> -	bio->bi_write_hint = 0;
>>> -	bio->bi_status = 0;
>>> -	bio->bi_iter.bi_sector = 0;
>>> -	bio->bi_iter.bi_size = 0;
>>> -	bio->bi_iter.bi_idx = 0;
>>> -	bio->bi_iter.bi_bvec_done = 0;
>>> -	bio->bi_end_io = NULL;
>>> -	bio->bi_private = NULL;
>>> -#ifdef CONFIG_BLK_CGROUP
>>> -	bio->bi_blkg = NULL;
>>> -	bio->bi_issue.value = 0;
>>> -#ifdef CONFIG_BLK_CGROUP_IOCOST
>>> -	bio->bi_iocost_cost = 0;
>>> -#endif
>>> -#endif
>>> -#ifdef CONFIG_BLK_INLINE_ENCRYPTION
>>> -	bio->bi_crypt_context = NULL;
>>> -#endif
>>> -#ifdef CONFIG_BLK_DEV_INTEGRITY
>>> -	bio->bi_integrity = NULL;
>>> -#endif
>>> -	bio->bi_vcnt = 0;
>>> -
>>> -	atomic_set(&bio->__bi_remaining, 1);
>>> -	atomic_set(&bio->__bi_cnt, 1);
>>> -
>>> -	bio->bi_max_vecs = max_vecs;
>>> -	bio->bi_io_vec = table;
>>> -	bio->bi_pool = NULL;
>>> +	*bio = (struct bio) {
>>> +		.__bi_remaining	= ATOMIC_INIT(1),
>>> +		.__bi_cnt	= ATOMIC_INIT(1),
>>> +		.bi_max_vecs	= max_vecs,
>>> +		.bi_io_vec	= table,
>>> +	};
>>>  }
>>
>> I'll give this a whirl too, another upside is that it's less prone to
>> errors if struct bio is changed.
> 
> Seems slower for me, by about 1-1.5%, which is consumed by
> bio_alloc_kiocb() which is the only bio_init() caller in my test. Using
> gcc 11.1 here, and my code generation seems to match your case too
> (series of mov vs rep stos with the patch).
> 
> Probably a CPU thing. I'm running on an AMD 3970X for these tests.

Looking at profile:

 43.34 │      rep    stos %rax,%es:(%rdi)                                              
I do wonder if rep stos is just not very well suited for small regions,
either in general or particularly on AMD.

What do your profiles look like for before and after?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] block: Optimize bio_init()
  2021-09-11 22:16     ` Jens Axboe
@ 2021-09-12  3:19       ` Bart Van Assche
  2021-09-12 13:03         ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2021-09-12  3:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 9/11/21 15:16, Jens Axboe wrote:
> Looking at profile:
> 
>   43.34 │      rep    stos %rax,%es:(%rdi)
> I do wonder if rep stos is just not very well suited for small regions,
> either in general or particularly on AMD.
> 
> What do your profiles look like for before and after?

Since I do not know which tool was used to obtain the above
information, I ran perf record -ags sleep 10 while the test
was running. I could not find bio_init in the output. I think
that means that that function got inlined. But
bio_alloc_bioset() showed up in the output. The time spent in
that function is lower if IOPS are higher.

The performance numbers in the patch description come from a
Intel Xeon Gold 6154 CPU. I reran the test today on an old Intel
Core i7-4790 CPU and obtained the opposite result: higher IOPS
without this patch than with this patch although the assembler
code looks to be the same. It seems like how fast "rep stos"
runs depends on the CPU type?

Bart.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] block: Optimize bio_init()
  2021-09-12  3:19       ` Bart Van Assche
@ 2021-09-12 13:03         ` Jens Axboe
  2021-09-12 22:01           ` Bart Van Assche
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2021-09-12 13:03 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, Christoph Hellwig

On 9/11/21 9:19 PM, Bart Van Assche wrote:
> On 9/11/21 15:16, Jens Axboe wrote:
>> Looking at profile:
>>
>>   43.34 │      rep    stos %rax,%es:(%rdi)
>> I do wonder if rep stos is just not very well suited for small regions,
>> either in general or particularly on AMD.
>>
>> What do your profiles look like for before and after?
> 
> Since I do not know which tool was used to obtain the above
> information, I ran perf record -ags sleep 10 while the test
> was running. I could not find bio_init in the output. I think
> that means that that function got inlined. But
> bio_alloc_bioset() showed up in the output. The time spent in
> that function is lower if IOPS are higher.

The above is from perf report, diving into the functions. Yours show up
in bio_alloc_bioset(), and mine in bio_alloc_kiocb() as I'm doing polled
IO.

> The performance numbers in the patch description come from a
> Intel Xeon Gold 6154 CPU. I reran the test today on an old Intel
> Core i7-4790 CPU and obtained the opposite result: higher IOPS
> without this patch than with this patch although the assembler
> code looks to be the same. It seems like how fast "rep stos"
> runs depends on the CPU type?

It does appear so. Which is a bit frustrating...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] block: Optimize bio_init()
  2021-09-12 13:03         ` Jens Axboe
@ 2021-09-12 22:01           ` Bart Van Assche
  2021-09-12 22:13             ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2021-09-12 22:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 9/12/21 06:03, Jens Axboe wrote:
> On 9/11/21 9:19 PM, Bart Van Assche wrote:
>> The performance numbers in the patch description come from a
>> Intel Xeon Gold 6154 CPU. I reran the test today on an old Intel
>> Core i7-4790 CPU and obtained the opposite result: higher IOPS
>> without this patch than with this patch although the assembler
>> code looks to be the same. It seems like how fast "rep stos"
>> runs depends on the CPU type?
> 
> It does appear so. Which is a bit frustrating...

Further measurements have shown that this behavior is specific to
gcc and also that clang always generates faster code for the version
of bio_init() in my patch. I have reported this as a bug to the gcc
project. See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294.

Bart.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] block: Optimize bio_init()
  2021-09-12 22:01           ` Bart Van Assche
@ 2021-09-12 22:13             ` Jens Axboe
  2021-09-13  3:52               ` Bart Van Assche
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2021-09-12 22:13 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, Christoph Hellwig

On 9/12/21 4:01 PM, Bart Van Assche wrote:
> On 9/12/21 06:03, Jens Axboe wrote:
>> On 9/11/21 9:19 PM, Bart Van Assche wrote:
>>> The performance numbers in the patch description come from a
>>> Intel Xeon Gold 6154 CPU. I reran the test today on an old Intel
>>> Core i7-4790 CPU and obtained the opposite result: higher IOPS
>>> without this patch than with this patch although the assembler
>>> code looks to be the same. It seems like how fast "rep stos"
>>> runs depends on the CPU type?
>>
>> It does appear so. Which is a bit frustrating...
> 
> Further measurements have shown that this behavior is specific to
> gcc and also that clang always generates faster code for the version
> of bio_init() in my patch. I have reported this as a bug to the gcc
> project. See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294.

Interesting! Here are some results from my end. First the 3970X again:

gcc-11.1
Elapsed time: 0.980807 s
Elapsed time: 0.452951 s
Elapsed time: 0.949918 s

clang-11.0
Elapsed time: 0.284734 s
Elapsed time: 0.356595 s
Elapsed time: 0.285459 s

And my laptop, which is using:

11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

gcc-11.1
Elapsed time: 0.218427 s
Elapsed time: 0.235000 s
Elapsed time: 0.214217 s

clang-11.0
Elapsed time: 0.217436 s
Elapsed time: 0.170959 s
Elapsed time: 0.149630 s

All compiles done with -O2 -march=native

Now I kind of want to compile the kernel with clang and see how that
goes...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] block: Optimize bio_init()
  2021-09-12 22:13             ` Jens Axboe
@ 2021-09-13  3:52               ` Bart Van Assche
  0 siblings, 0 replies; 9+ messages in thread
From: Bart Van Assche @ 2021-09-13  3:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig

On 9/12/21 15:13, Jens Axboe wrote:
> Now I kind of want to compile the kernel with clang and see how that
> goes...

Another question is what compiler we should use as a reference to decide 
which implementation is faster? My view on the conversation in 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102294 is that it could 
take a while before gcc generates code for memset() and structure 
assignment that is as fast as what clang generates ...

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-09-13  3:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-11 21:47 [PATCH] block: Optimize bio_init() Bart Van Assche
2021-09-11 22:01 ` Jens Axboe
2021-09-11 22:09   ` Jens Axboe
2021-09-11 22:16     ` Jens Axboe
2021-09-12  3:19       ` Bart Van Assche
2021-09-12 13:03         ` Jens Axboe
2021-09-12 22:01           ` Bart Van Assche
2021-09-12 22:13             ` Jens Axboe
2021-09-13  3:52               ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).