All of lore.kernel.org
 help / color / mirror / Atom feed
* Qualcomm Crypto Engine performance numbers on mainline kernel
@ 2021-06-04 16:49 Thara Gopinath
  2021-06-05 15:32 ` Ard Biesheuvel
  0 siblings, 1 reply; 4+ messages in thread
From: Thara Gopinath @ 2021-06-04 16:49 UTC (permalink / raw)
  To: linux-crypto


Hi All,

Below are the performance numbers from running "crypsetup benchmark" on 
CE algorithms in the mainline kernel. All numbers are in MiB/s. The 
platform used is RB3 for sdm845 and MTPs for rest of them.


			SDM845 	  SM8150     SM8250 	SM8350
AES-CBC (128)
Encrypt / Decrypt	114/106  36/48 	     120/188    133/197

AES-XTS (256)
Encrypt / Decrypt	100/102  49/48 	     186/187 	n/a


-- 
Warm Regards
Thara (She/Her/Hers)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Qualcomm Crypto Engine performance numbers on mainline kernel
  2021-06-04 16:49 Qualcomm Crypto Engine performance numbers on mainline kernel Thara Gopinath
@ 2021-06-05 15:32 ` Ard Biesheuvel
  2021-06-06  6:49   ` Gilad Ben-Yossef
  2021-06-06 10:07   ` Christian Lamparter
  0 siblings, 2 replies; 4+ messages in thread
From: Ard Biesheuvel @ 2021-06-05 15:32 UTC (permalink / raw)
  To: Thara Gopinath, Eric Biggers; +Cc: Linux Crypto Mailing List

Hello Thara,

On Fri, 4 Jun 2021 at 18:49, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>
>
> Hi All,
>
> Below are the performance numbers from running "crypsetup benchmark" on
> CE algorithms in the mainline kernel. All numbers are in MiB/s. The
> platform used is RB3 for sdm845 and MTPs for rest of them.
>
>
>                         SDM845    SM8150     SM8250     SM8350
> AES-CBC (128)
> Encrypt / Decrypt       114/106  36/48       120/188    133/197
>
> AES-XTS (256)
> Encrypt / Decrypt       100/102  49/48       186/187    n/a
>

The CPU instruction based ones are apparently an order of magnitude
faster, and are synchronous so their latency should be lower.

So, as Eric already pointed out IIRC, there doesn't seem to be much
value in enabling this IP in Linux - it should not be the default
choice/highest priority, and it is not obvious to me whether/when you
would prefer this implementation over the CPU based one. Do you have
any idea how many queues it has, or how much data it can process in
parallel? Are there other features that stand out?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Qualcomm Crypto Engine performance numbers on mainline kernel
  2021-06-05 15:32 ` Ard Biesheuvel
@ 2021-06-06  6:49   ` Gilad Ben-Yossef
  2021-06-06 10:07   ` Christian Lamparter
  1 sibling, 0 replies; 4+ messages in thread
From: Gilad Ben-Yossef @ 2021-06-06  6:49 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: Thara Gopinath, Eric Biggers, Linux Crypto Mailing List

Hi,

On Sat, Jun 5, 2021 at 6:33 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> Hello Thara,
>
> On Fri, 4 Jun 2021 at 18:49, Thara Gopinath <thara.gopinath@linaro.org> wrote:
> >
> >
> > Hi All,
> >
> > Below are the performance numbers from running "crypsetup benchmark" on
> > CE algorithms in the mainline kernel. All numbers are in MiB/s. The
> > platform used is RB3 for sdm845 and MTPs for rest of them.
> >
> >
> >                         SDM845    SM8150     SM8250     SM8350
> > AES-CBC (128)
> > Encrypt / Decrypt       114/106  36/48       120/188    133/197
> >
> > AES-XTS (256)
> > Encrypt / Decrypt       100/102  49/48       186/187    n/a
> >
>
> The CPU instruction based ones are apparently an order of magnitude
> faster, and are synchronous so their latency should be lower.
>
> So, as Eric already pointed out IIRC, there doesn't seem to be much
> value in enabling this IP in Linux - it should not be the default
> choice/highest priority, and it is not obvious to me whether/when you
> would prefer this implementation over the CPU based one. Do you have
> any idea how many queues it has, or how much data it can process in
> parallel? Are there other features that stand out?

One of the things to consider with separate hardware block
implementation vis a vis CPU instruction based ones in general is that
often the consideration is more about getting a good enough
performance while freeing the CPU to perform other tasks which results
in better overall system performance rather than getting the best
possible performance in the specific task at hand. This is sometimes
further extended with power considerations where you can get better
power consumption when the lower performance  engine is used.
Less often, a lower jitter is more important than the peak
performance. I've seen this with encrypted video decoding for example.

Sadly, whether any of these considerations is applicable is very much
system and work load specific.

So my 2c contribution would be to include support for this, even if
not make this the default.

Gilad




-- 
Gilad Ben-Yossef
Chief Coffee Drinker

values of β will give rise to dom!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Qualcomm Crypto Engine performance numbers on mainline kernel
  2021-06-05 15:32 ` Ard Biesheuvel
  2021-06-06  6:49   ` Gilad Ben-Yossef
@ 2021-06-06 10:07   ` Christian Lamparter
  1 sibling, 0 replies; 4+ messages in thread
From: Christian Lamparter @ 2021-06-06 10:07 UTC (permalink / raw)
  To: Ard Biesheuvel, Thara Gopinath, Eric Biggers; +Cc: Linux Crypto Mailing List

On 05/06/2021 17:32, Ard Biesheuvel wrote:
> Hello Thara,
> 
> On Fri, 4 Jun 2021 at 18:49, Thara Gopinath <thara.gopinath@linaro.org> wrote:
>>
>>
>> Hi All,
>>
>> Below are the performance numbers from running "crypsetup benchmark" on
>> CE algorithms in the mainline kernel. All numbers are in MiB/s. The
>> platform used is RB3 for sdm845 and MTPs for rest of them.
>>
>>
>>                          SDM845    SM8150     SM8250     SM8350
>> AES-CBC (128)
>> Encrypt / Decrypt       114/106  36/48       120/188    133/197
>>
>> AES-XTS (256)
>> Encrypt / Decrypt       100/102  49/48       186/187    n/a
>>
> 
> The CPU instruction based ones are apparently an order of magnitude
> faster, and are synchronous so their latency should be lower.
> 
> So, as Eric already pointed out IIRC, there doesn't seem to be much
> value in enabling this IP in Linux - it should not be the default
> choice/highest priority, and it is not obvious to me whether/when you
> would prefer this implementation over the CPU based one. Do you have
> any idea how many queues it has, or how much data it can process in
> parallel? Are there other features that stand out?

While I can't say much for the qce-crypto. I do know that "cryptsetup
benchmark" isn't the greatest for pitting the hardware accelerated
crypto against the CPU in some instances.

In my case (crypto4xx / CPU is a PowerPC 464 800MHz - Hardware is a
Western Digital My Book Live - NAS) the "benchmark" results look
exceptionally poor:
#     Algorithm |       Key |      Encryption |      Decryption
         aes-cbc        128b         8.0 MiB/s         8.7 MiB/s
         aes-cbc        256b         8.7 MiB/s         8.7 MiB/s
         aes-xts        256b         5.3 MiB/s         7.9 MiB/s
         aes-xts        512b         7.9 MiB/s         7.9 MiB/s
(Hardware doesn't have cts/xts, but aes-cbc, aes-ctr and aes-gcm)

(for comparison, these are numbers that are produced by only the
800 MHz PowerPC CPU)
         aes-cbc        128b        15.8 MiB/s        16.3 MiB/s
         aes-cbc        256b        12.3 MiB/s        12.8 MiB/s
         aes-xts        256b        12.5 MiB/s        15.1 MiB/s
         aes-xts        512b        11.9 MiB/s        12.0 MiB/s


and (openssl speed -evp aes-128-cbc --elapsed -seconds 3) software
manages similar numbers:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc      12646.42k    16806.66k    18349.31k    18762.07k    18896.21k    18879.83k

However, when I format a partition on the NAS HDD with
cryptsetup + crypto4xx and use hdparm -i / dd

# hdparm -t /dev/mapper/aes-cbc-hw-test

/dev/mapper/aes-cbc-hw-test:
  Timing buffered disk reads:  96 MB in  3.05 seconds =  31.46 MB/sec

# dd if=/dev/mapper/aes-cbc-hw-test of=/dev/null bs=8M status=progress
5318377472 bytes (5.3 GB, 5.0 GiB) copied, 143 s, 37.2 MB/s^C
639+0 records in
638+0 records out
5351931904 bytes (5.4 GB, 5.0 GiB) copied, 144.246 s, 37.1 MB/s

whereas without crypto4xx:

# hdparm -t /dev/mapper/aes-cbc-hw-test

/dev/mapper/aes-cbc-hw-test:
  Timing buffered disk reads:  34 MB in  3.14 seconds =  10.82 MB/sec

# dd if=/dev/mapper/aes-cbc-hw-test of=/dev/null bs=8M status=progress
46+0 records in
45+0 records out
377487360 bytes (377 MB, 360 MiB) copied, 33.1952 s, 11.4 MB/s

This is 2-3 times the throughput that the CPU alone could do.

@Thara, Do you have a usb-3.0 + fast 3.0 usb-stick? If so, try
to format a partition on it for cryptsetup and try it there.

Cheers,
Christian

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-06-06 10:07 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-04 16:49 Qualcomm Crypto Engine performance numbers on mainline kernel Thara Gopinath
2021-06-05 15:32 ` Ard Biesheuvel
2021-06-06  6:49   ` Gilad Ben-Yossef
2021-06-06 10:07   ` Christian Lamparter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.