All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about btrfs and XOR offloading
@ 2021-01-03  3:50 Rosen Penev
  2021-01-03  6:53 ` Qu Wenruo
  2021-01-04 14:44 ` David Sterba
  0 siblings, 2 replies; 9+ messages in thread
From: Rosen Penev @ 2021-01-03  3:50 UTC (permalink / raw)
  To: linux-btrfs

I've noticed that internally, btrfs' XOR code is CPU only. Does anyone
know if there is a performance advantage to using a hardware
accelerated path? I ask as I use BTRFS on a Marvelll ARM platform with
XOR offload capability.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about btrfs and XOR offloading
  2021-01-03  3:50 Question about btrfs and XOR offloading Rosen Penev
@ 2021-01-03  6:53 ` Qu Wenruo
  2021-01-03  7:12   ` Rosen Penev
  2021-01-04 14:44 ` David Sterba
  1 sibling, 1 reply; 9+ messages in thread
From: Qu Wenruo @ 2021-01-03  6:53 UTC (permalink / raw)
  To: Rosen Penev, linux-btrfs



On 2021/1/3 上午11:50, Rosen Penev wrote:
> I've noticed that internally, btrfs' XOR code is CPU only. Does anyone
> know if there is a performance advantage to using a hardware
> accelerated path? I ask as I use BTRFS on a Marvelll ARM platform with
> XOR offload capability.
>
AFAIK XOR is only utilized by RAID56, while RAID56 is not considered
safe due to write-hole, thus I don't believe whether btrfs supports
hardware XOR would make any difference for now.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about btrfs and XOR offloading
  2021-01-03  6:53 ` Qu Wenruo
@ 2021-01-03  7:12   ` Rosen Penev
  0 siblings, 0 replies; 9+ messages in thread
From: Rosen Penev @ 2021-01-03  7:12 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Jan 2, 2021 at 10:53 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2021/1/3 上午11:50, Rosen Penev wrote:
> > I've noticed that internally, btrfs' XOR code is CPU only. Does anyone
> > know if there is a performance advantage to using a hardware
> > accelerated path? I ask as I use BTRFS on a Marvelll ARM platform with
> > XOR offload capability.
> >
> AFAIK XOR is only utilized by RAID56, while RAID56 is not considered
> safe due to write-hole, thus I don't believe whether btrfs supports
> hardware XOR would make any difference for now.
Right the question is about performance. I don't know if XOR being
async would make any difference.
>
> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about btrfs and XOR offloading
  2021-01-03  3:50 Question about btrfs and XOR offloading Rosen Penev
  2021-01-03  6:53 ` Qu Wenruo
@ 2021-01-04 14:44 ` David Sterba
  2021-01-04 22:56   ` Rosen Penev
  1 sibling, 1 reply; 9+ messages in thread
From: David Sterba @ 2021-01-04 14:44 UTC (permalink / raw)
  To: Rosen Penev; +Cc: linux-btrfs

On Sat, Jan 02, 2021 at 07:50:38PM -0800, Rosen Penev wrote:
> I've noticed that internally, btrfs' XOR code is CPU only. Does anyone
> know if there is a performance advantage to using a hardware
> accelerated path? I ask as I use BTRFS on a Marvelll ARM platform with
> XOR offload capability.

Even if it's CPU, it's accelerated and best algorithm is selected at
boot time:

[   16.357703] raid6: avx2x4   gen() 30635 MB/s
[   16.425701] raid6: avx2x4   xor() 10727 MB/s
[   16.493701] raid6: avx2x2   gen() 32995 MB/s
[   16.561701] raid6: avx2x2   xor() 19596 MB/s
[   16.629701] raid6: avx2x1   gen() 26349 MB/s
[   16.697710] raid6: avx2x1   xor() 17794 MB/s
[   16.765701] raid6: sse2x4   gen() 17354 MB/s
[   16.833701] raid6: sse2x4   xor()  9653 MB/s
[   16.901706] raid6: sse2x2   gen() 18495 MB/s
[   16.969702] raid6: sse2x2   xor() 11562 MB/s
[   17.037701] raid6: sse2x1   gen() 14440 MB/s
[   17.105818] raid6: sse2x1   xor() 10387 MB/s
[   17.108300] raid6: using algorithm avx2x2 gen() 32995 MB/s
[   17.110703] raid6: .... xor() 19596 MB/s, rmw enabled
[   17.113587] raid6: using avx2x2 recovery algorithm
[   17.327666] xor: automatically using best checksumming function   avx

The xor/parity calculations are done synchronously, while the offloading
to hw usually requires asynchronous submit/wait mechanism. This brings
some overhead, so it depends. The code in btrfs would need to be adapted
to do the async way, unless it's all somehow hidden under the crypto
API.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about btrfs and XOR offloading
  2021-01-04 14:44 ` David Sterba
@ 2021-01-04 22:56   ` Rosen Penev
  2021-01-05 15:33     ` David Sterba
  0 siblings, 1 reply; 9+ messages in thread
From: Rosen Penev @ 2021-01-04 22:56 UTC (permalink / raw)
  To: dsterba, Rosen Penev, linux-btrfs

On Mon, Jan 4, 2021 at 6:46 AM David Sterba <dsterba@suse.cz> wrote:
>
> On Sat, Jan 02, 2021 at 07:50:38PM -0800, Rosen Penev wrote:
> > I've noticed that internally, btrfs' XOR code is CPU only. Does anyone
> > know if there is a performance advantage to using a hardware
> > accelerated path? I ask as I use BTRFS on a Marvelll ARM platform with
> > XOR offload capability.
>
> Even if it's CPU, it's accelerated and best algorithm is selected at
> boot time:
>
> [   16.357703] raid6: avx2x4   gen() 30635 MB/s
> [   16.425701] raid6: avx2x4   xor() 10727 MB/s
> [   16.493701] raid6: avx2x2   gen() 32995 MB/s
> [   16.561701] raid6: avx2x2   xor() 19596 MB/s
> [   16.629701] raid6: avx2x1   gen() 26349 MB/s
> [   16.697710] raid6: avx2x1   xor() 17794 MB/s
> [   16.765701] raid6: sse2x4   gen() 17354 MB/s
> [   16.833701] raid6: sse2x4   xor()  9653 MB/s
> [   16.901706] raid6: sse2x2   gen() 18495 MB/s
> [   16.969702] raid6: sse2x2   xor() 11562 MB/s
> [   17.037701] raid6: sse2x1   gen() 14440 MB/s
> [   17.105818] raid6: sse2x1   xor() 10387 MB/s
> [   17.108300] raid6: using algorithm avx2x2 gen() 32995 MB/s
> [   17.110703] raid6: .... xor() 19596 MB/s, rmw enabled
> [   17.113587] raid6: using avx2x2 recovery algorithm
> [   17.327666] xor: automatically using best checksumming function   avx
Yeah...

[    0.316064] raid6: neonx8   xor()  1087 MB/s
[    0.452063] raid6: neonx4   xor()  1372 MB/s
[    0.588064] raid6: neonx2   xor()  1610 MB/s
[    0.724061] raid6: neonx1   xor()  1345 MB/s
[    0.860072] raid6: int32x8  xor()   337 MB/s
[    0.996092] raid6: int32x4  xor()   373 MB/s
[    1.132087] raid6: int32x2  xor()   348 MB/s
[    1.268090] raid6: int32x1  xor()   281 MB/s
[    1.268093] raid6: .... xor() 1610 MB/s, rmw enabled

Not as fast here.
>
> The xor/parity calculations are done synchronously, while the offloading
> to hw usually requires asynchronous submit/wait mechanism. This brings
> some overhead, so it depends. The code in btrfs would need to be adapted
> to do the async way, unless it's all somehow hidden under the crypto
> API.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about btrfs and XOR offloading
  2021-01-04 22:56   ` Rosen Penev
@ 2021-01-05 15:33     ` David Sterba
  2021-01-05 22:40       ` Rosen Penev
  0 siblings, 1 reply; 9+ messages in thread
From: David Sterba @ 2021-01-05 15:33 UTC (permalink / raw)
  To: Rosen Penev; +Cc: dsterba, linux-btrfs

On Mon, Jan 04, 2021 at 02:56:59PM -0800, Rosen Penev wrote:
> On Mon, Jan 4, 2021 at 6:46 AM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Sat, Jan 02, 2021 at 07:50:38PM -0800, Rosen Penev wrote:
> > > I've noticed that internally, btrfs' XOR code is CPU only. Does anyone
> > > know if there is a performance advantage to using a hardware
> > > accelerated path? I ask as I use BTRFS on a Marvelll ARM platform with
> > > XOR offload capability.
> >
> > Even if it's CPU, it's accelerated and best algorithm is selected at
> > boot time:
> >
> > [   16.357703] raid6: avx2x4   gen() 30635 MB/s
> > [   16.425701] raid6: avx2x4   xor() 10727 MB/s
> > [   16.493701] raid6: avx2x2   gen() 32995 MB/s
> > [   16.561701] raid6: avx2x2   xor() 19596 MB/s
> > [   16.629701] raid6: avx2x1   gen() 26349 MB/s
> > [   16.697710] raid6: avx2x1   xor() 17794 MB/s
> > [   16.765701] raid6: sse2x4   gen() 17354 MB/s
> > [   16.833701] raid6: sse2x4   xor()  9653 MB/s
> > [   16.901706] raid6: sse2x2   gen() 18495 MB/s
> > [   16.969702] raid6: sse2x2   xor() 11562 MB/s
> > [   17.037701] raid6: sse2x1   gen() 14440 MB/s
> > [   17.105818] raid6: sse2x1   xor() 10387 MB/s
> > [   17.108300] raid6: using algorithm avx2x2 gen() 32995 MB/s
> > [   17.110703] raid6: .... xor() 19596 MB/s, rmw enabled
> > [   17.113587] raid6: using avx2x2 recovery algorithm
> > [   17.327666] xor: automatically using best checksumming function   avx
> Yeah...
> 
> [    0.316064] raid6: neonx8   xor()  1087 MB/s
> [    0.452063] raid6: neonx4   xor()  1372 MB/s
> [    0.588064] raid6: neonx2   xor()  1610 MB/s
> [    0.724061] raid6: neonx1   xor()  1345 MB/s
> [    0.860072] raid6: int32x8  xor()   337 MB/s
> [    0.996092] raid6: int32x4  xor()   373 MB/s
> [    1.132087] raid6: int32x2  xor()   348 MB/s
> [    1.268090] raid6: int32x1  xor()   281 MB/s
> [    1.268093] raid6: .... xor() 1610 MB/s, rmw enabled
> 
> Not as fast here.

What's the raw speed of the hw offload? Measured on large data so that
the overhead is negligible.

It might make sense to add the async support in case the speed is
comparable or better to the CPU, but also to reduce the CPU load.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about btrfs and XOR offloading
  2021-01-05 15:33     ` David Sterba
@ 2021-01-05 22:40       ` Rosen Penev
  2021-01-06 12:37         ` David Sterba
  0 siblings, 1 reply; 9+ messages in thread
From: Rosen Penev @ 2021-01-05 22:40 UTC (permalink / raw)
  To: dsterba, Rosen Penev, linux-btrfs

On Tue, Jan 5, 2021 at 7:35 AM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Jan 04, 2021 at 02:56:59PM -0800, Rosen Penev wrote:
> > On Mon, Jan 4, 2021 at 6:46 AM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Sat, Jan 02, 2021 at 07:50:38PM -0800, Rosen Penev wrote:
> > > > I've noticed that internally, btrfs' XOR code is CPU only. Does anyone
> > > > know if there is a performance advantage to using a hardware
> > > > accelerated path? I ask as I use BTRFS on a Marvelll ARM platform with
> > > > XOR offload capability.
> > >
> > > Even if it's CPU, it's accelerated and best algorithm is selected at
> > > boot time:
> > >
> > > [   16.357703] raid6: avx2x4   gen() 30635 MB/s
> > > [   16.425701] raid6: avx2x4   xor() 10727 MB/s
> > > [   16.493701] raid6: avx2x2   gen() 32995 MB/s
> > > [   16.561701] raid6: avx2x2   xor() 19596 MB/s
> > > [   16.629701] raid6: avx2x1   gen() 26349 MB/s
> > > [   16.697710] raid6: avx2x1   xor() 17794 MB/s
> > > [   16.765701] raid6: sse2x4   gen() 17354 MB/s
> > > [   16.833701] raid6: sse2x4   xor()  9653 MB/s
> > > [   16.901706] raid6: sse2x2   gen() 18495 MB/s
> > > [   16.969702] raid6: sse2x2   xor() 11562 MB/s
> > > [   17.037701] raid6: sse2x1   gen() 14440 MB/s
> > > [   17.105818] raid6: sse2x1   xor() 10387 MB/s
> > > [   17.108300] raid6: using algorithm avx2x2 gen() 32995 MB/s
> > > [   17.110703] raid6: .... xor() 19596 MB/s, rmw enabled
> > > [   17.113587] raid6: using avx2x2 recovery algorithm
> > > [   17.327666] xor: automatically using best checksumming function   avx
> > Yeah...
> >
> > [    0.316064] raid6: neonx8   xor()  1087 MB/s
> > [    0.452063] raid6: neonx4   xor()  1372 MB/s
> > [    0.588064] raid6: neonx2   xor()  1610 MB/s
> > [    0.724061] raid6: neonx1   xor()  1345 MB/s
> > [    0.860072] raid6: int32x8  xor()   337 MB/s
> > [    0.996092] raid6: int32x4  xor()   373 MB/s
> > [    1.132087] raid6: int32x2  xor()   348 MB/s
> > [    1.268090] raid6: int32x1  xor()   281 MB/s
> > [    1.268093] raid6: .... xor() 1610 MB/s, rmw enabled
> >
> > Not as fast here.
>
> What's the raw speed of the hw offload? Measured on large data so that
> the overhead is negligible.
I have no idea how to benchmark such a thing. I assume it could be
done indirectly.
>
> It might make sense to add the async support in case the speed is
> comparable or better to the CPU, but also to reduce the CPU load.
I think the latter is the reason Marvell added hardware support for
doing parity calculations.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about btrfs and XOR offloading
  2021-01-05 22:40       ` Rosen Penev
@ 2021-01-06 12:37         ` David Sterba
  2021-01-06 21:52           ` Rosen Penev
  0 siblings, 1 reply; 9+ messages in thread
From: David Sterba @ 2021-01-06 12:37 UTC (permalink / raw)
  To: Rosen Penev; +Cc: dsterba, linux-btrfs

On Tue, Jan 05, 2021 at 02:40:28PM -0800, Rosen Penev wrote:
> > What's the raw speed of the hw offload? Measured on large data so that
> > the overhead is negligible.
> I have no idea how to benchmark such a thing. I assume it could be
> done indirectly.
> >
> > It might make sense to add the async support in case the speed is
> > comparable or better to the CPU, but also to reduce the CPU load.
> I think the latter is the reason Marvell added hardware support for
> doing parity calculations.

The support seems to be in NAS boxes and besides xor and raid5/6
calculations the engine can also do a memcpy offload. This could gain a
lot of performance and be cheap in terms of code. Full page copies are
wrapped under copy_page so we'd need to insert the offload code. Similar
for the raid5/6 calculations.

The MD-RAID already supports offloading so we have code to stea^Wcopy.
Overall it sounds worth to add the async support to btrfs as it would
help with the metadata updates too, there's a lot of memcpy/memmove.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about btrfs and XOR offloading
  2021-01-06 12:37         ` David Sterba
@ 2021-01-06 21:52           ` Rosen Penev
  0 siblings, 0 replies; 9+ messages in thread
From: Rosen Penev @ 2021-01-06 21:52 UTC (permalink / raw)
  To: dsterba, Rosen Penev, linux-btrfs

On Wed, Jan 6, 2021 at 4:39 AM David Sterba <dsterba@suse.cz> wrote:
>
> On Tue, Jan 05, 2021 at 02:40:28PM -0800, Rosen Penev wrote:
> > > What's the raw speed of the hw offload? Measured on large data so that
> > > the overhead is negligible.
> > I have no idea how to benchmark such a thing. I assume it could be
> > done indirectly.
> > >
> > > It might make sense to add the async support in case the speed is
> > > comparable or better to the CPU, but also to reduce the CPU load.
> > I think the latter is the reason Marvell added hardware support for
> > doing parity calculations.
>
> The support seems to be in NAS boxes and besides xor and raid5/6
> calculations the engine can also do a memcpy offload. This could gain a
> lot of performance and be cheap in terms of code. Full page copies are
> wrapped under copy_page so we'd need to insert the offload code. Similar
> for the raid5/6 calculations.
>
> The MD-RAID already supports offloading so we have code to stea^Wcopy.
> Overall it sounds worth to add the async support to btrfs as it would
> help with the metadata updates too, there's a lot of memcpy/memmove.
That would be lovely. I assume 24 hour btrfs scrubs would become shorter.

Unfortunately I lack the expertise to implement this properly.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-01-06 21:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-03  3:50 Question about btrfs and XOR offloading Rosen Penev
2021-01-03  6:53 ` Qu Wenruo
2021-01-03  7:12   ` Rosen Penev
2021-01-04 14:44 ` David Sterba
2021-01-04 22:56   ` Rosen Penev
2021-01-05 15:33     ` David Sterba
2021-01-05 22:40       ` Rosen Penev
2021-01-06 12:37         ` David Sterba
2021-01-06 21:52           ` Rosen Penev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.