linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* dm-crypt optimization
@ 2016-12-20  9:41 Binoy Jayan
  2016-12-21 12:47 ` Milan Broz
  0 siblings, 1 reply; 5+ messages in thread
From: Binoy Jayan @ 2016-12-20  9:41 UTC (permalink / raw)
  To: Milan Broz, Oded, Ofir, Herbert Xu, Arnd Bergmann, Mark Brown,
	Alasdair Kergon, David S. Miller, private-kwg
  Cc: dm-devel, linux-crypto, Rajendra, Linux kernel mailing list,
	linux-raid, Shaohua Li, Mike Snitzer

At a high level the goal is to maximize the size of data blocks that get passed
to hardware accelerators, minimizing the overhead from setting up and tearing
down operations in the hardware. Currently dm-crypt itself is a big blocker as
it manually implements ESSIV and similar algorithms which allow per-block
encryption of the data so the low level operations from the crypto API can
only operate on a single block. This is done because currently the crypto API
doesn't have software implementations of these algorithms itself so dm-crypt
can't rely on it being able to provide the functionality. The plan to address
this was to provide some software implementations in the crypto API, then
update dm-crypt to rely on those. Even for a pure software implementation
with no hardware acceleration that should hopefully provide a small
optimization as we need to call into the crypto API less often but it's likely
to be marginal given the overhead of crypto, the real win would be on a system
that has an accelerator that can replace the software implementation.

Currently dm-crypt handles data only in single blocks. This means that it can't
make good use of hardware cryptography engines since there is an overhead to
each transaction with the engine but transfers must be split into block sized
chunks. Allowing the transfer of larger blocks e.g. 'struct bio', could
mitigate against these costs and could improve performance in operating systems
with encrypted filesystems. Although qualcomm chipsets support another variant
of the device-mapper dm-req-crypt, it is not something generic and in
mainline-able state. Also, it only supports 'XTS-AES' mode of encryption and
is not compatible with other modes supported by dm-crypt.

However, there are some challenges and a few possibilities to address this. I
request you to provide your suggestions on whether the points mentioned below
makes sense and if it could be done differently.

1. Move the 'real' IV generation algorithms to crypto layer (e.g. essiv)
2. Increase the 'length' of the scatterlist nodes used in the crypto api. It
   can be made equal to the size of a main memory segment (as defined in
   'struct bio') as they are physcially contiguous.
3. Multiple segments in 'struct bio' can be represented as scatterlist of all
   segments in a 'struct bio'.

4. Move algorithms 'lmk' and 'tcw' (which are IV combined with hacks to the
   cbc mode) to create a customized cbc algorithm, implemented in a seperate
   file (e.g. cbc_lmk.c/cbc_tcw.c). As Milan suggested, these can't be treated
   as real IVs as these include hacks to the cbc mode (and directly manipulate
   encrypted data).

5. Move key selection logic to user space or always assume keycount as '1'
   (as mentioned in the dm-crypt param format below) so that the key selection
   logic does not have to be dependent on the sector number. This is necessary
   as the key is selected otherwise based on sector number:

   key_index = sector & (key_count - 1)

   If block size for scatterlist nodes are increased beyond sector boundary
   (which is what we plan to achieve, for performance), the key set for every
   cipher operation cannot be changed at the sector level.

   dm-crypt param format : cipher[:keycount]-mode-iv:ivopts
   Example               : aes:2-cbc-essiv:sha256

   Also as Milan suggested, it is not wise to move the key selection logic to
   the crypto layer as it will prevent any changes to the key structure later.

The following is a reference to an earlier patchset. It had the cipher mode
'cbc' mixed up with the IV algorithms and is usually not the preferred way.

Reference:
https://lkml.org/lkml/2016/12/13/65
https://lkml.org/lkml/2016/12/13/66

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: dm-crypt optimization
  2016-12-20  9:41 dm-crypt optimization Binoy Jayan
@ 2016-12-21 12:47 ` Milan Broz
  2016-12-22  8:25   ` Binoy Jayan
  0 siblings, 1 reply; 5+ messages in thread
From: Milan Broz @ 2016-12-21 12:47 UTC (permalink / raw)
  To: Binoy Jayan, Oded, Ofir, Herbert Xu, Arnd Bergmann, Mark Brown,
	Alasdair Kergon, David S. Miller, private-kwg
  Cc: dm-devel, linux-crypto, Rajendra, Linux kernel mailing list,
	linux-raid, Shaohua Li, Mike Snitzer

On 12/20/2016 10:41 AM, Binoy Jayan wrote:
> At a high level the goal is to maximize the size of data blocks that get passed
> to hardware accelerators, minimizing the overhead from setting up and tearing
> down operations in the hardware. Currently dm-crypt itself is a big blocker as
> it manually implements ESSIV and similar algorithms which allow per-block
> encryption of the data so the low level operations from the crypto API can
> only operate on a single block. This is done because currently the crypto API
> doesn't have software implementations of these algorithms itself so dm-crypt
> can't rely on it being able to provide the functionality. The plan to address
> this was to provide some software implementations in the crypto API, then
> update dm-crypt to rely on those. Even for a pure software implementation
> with no hardware acceleration that should hopefully provide a small
> optimization as we need to call into the crypto API less often but it's likely
> to be marginal given the overhead of crypto, the real win would be on a system
> that has an accelerator that can replace the software implementation.
> 
> Currently dm-crypt handles data only in single blocks. This means that it can't
> make good use of hardware cryptography engines since there is an overhead to
> each transaction with the engine but transfers must be split into block sized
> chunks. Allowing the transfer of larger blocks e.g. 'struct bio', could
> mitigate against these costs and could improve performance in operating systems
> with encrypted filesystems. Although qualcomm chipsets support another variant
> of the device-mapper dm-req-crypt, it is not something generic and in
> mainline-able state. Also, it only supports 'XTS-AES' mode of encryption and
> is not compatible with other modes supported by dm-crypt.

So the core problem is that your crypto accelerator can operate efficiently only
with bigger batch sizes.

How big blocks your crypto hw need to be able to operate more efficiently?
What about 4k blocks (no batches), could it be usable trade-off?

With some (backward incompatible) changes in LUKS format I would like to see support
for encryption blocks equivalent to sectors size, so it basically means for 4k drive 4k
encryption block.
(This should decrease overhead, now is everything processed on 512 blocks only.)

Support of bigger block sizes would be unsafe without additional mechanism that provides
atomic writes of multiple sectors. Maybe it applies to 4k as well on some devices though...)

The above is not going against your proposal, I am just curious if this is enough
to provide better performance on your hw accelerator or not.

Milan

> However, there are some challenges and a few possibilities to address this. I
> request you to provide your suggestions on whether the points mentioned below
> makes sense and if it could be done differently.
> 
> 1. Move the 'real' IV generation algorithms to crypto layer (e.g. essiv)
> 2. Increase the 'length' of the scatterlist nodes used in the crypto api. It
>    can be made equal to the size of a main memory segment (as defined in
>    'struct bio') as they are physcially contiguous.
> 3. Multiple segments in 'struct bio' can be represented as scatterlist of all
>    segments in a 'struct bio'.
> 
> 4. Move algorithms 'lmk' and 'tcw' (which are IV combined with hacks to the
>    cbc mode) to create a customized cbc algorithm, implemented in a seperate
>    file (e.g. cbc_lmk.c/cbc_tcw.c). As Milan suggested, these can't be treated
>    as real IVs as these include hacks to the cbc mode (and directly manipulate
>    encrypted data).
> 
> 5. Move key selection logic to user space or always assume keycount as '1'
>    (as mentioned in the dm-crypt param format below) so that the key selection
>    logic does not have to be dependent on the sector number. This is necessary
>    as the key is selected otherwise based on sector number:
> 
>    key_index = sector & (key_count - 1)
> 
>    If block size for scatterlist nodes are increased beyond sector boundary
>    (which is what we plan to achieve, for performance), the key set for every
>    cipher operation cannot be changed at the sector level.
> 
>    dm-crypt param format : cipher[:keycount]-mode-iv:ivopts
>    Example               : aes:2-cbc-essiv:sha256
> 
>    Also as Milan suggested, it is not wise to move the key selection logic to
>    the crypto layer as it will prevent any changes to the key structure later.
> 
> The following is a reference to an earlier patchset. It had the cipher mode
> 'cbc' mixed up with the IV algorithms and is usually not the preferred way.
> 
> Reference:
> https://lkml.org/lkml/2016/12/13/65
> https://lkml.org/lkml/2016/12/13/66
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: dm-crypt optimization
  2016-12-21 12:47 ` Milan Broz
@ 2016-12-22  8:25   ` Binoy Jayan
  2016-12-22  8:59     ` Herbert Xu
  0 siblings, 1 reply; 5+ messages in thread
From: Binoy Jayan @ 2016-12-22  8:25 UTC (permalink / raw)
  To: Milan Broz
  Cc: Oded, Ofir, Herbert Xu, Arnd Bergmann, Mark Brown,
	Alasdair Kergon, David S. Miller, private-kwg, dm-devel,
	linux-crypto, Rajendra, Linux kernel mailing list, linux-raid,
	Shaohua Li, Mike Snitzer

Hi Milan,

On 21 December 2016 at 18:17, Milan Broz <gmazyland@gmail.com> wrote:

> So the core problem is that your crypto accelerator can operate efficiently only
> with bigger batch sizes.

Thank you for the reply. Yes, that would be rather an improvement when having
bigger block sizes.

> How big blocks your crypto hw need to be able to operate more efficiently?
> What about 4k blocks (no batches), could it be usable trade-off?

The benchmark results for Qualcomm Snapdragon SoC's (mentioned below) show
significant improvement with 4K blocks but in batches of all such contiguous
segments in the block layer's request queue in the form of a chained
scatterlist.
However, it uses the algorithm 'aes-xts' instead of the conventional
'essiv-cbc-aes'
used in dm-crypt. Also, it uses the device mapper dm-req-crypt instead
of dm-cypt.

http://nelenkov.blogspot.in/2015/05/hardware-accelerated-disk-encryption-in.html
Section : 'Performance'

Its reports and IO rate of 46.3MB/s compared to an IO rate of 25.1MB/s while
using a software-based FDE (based on dm-crypt).  But I am not sure how genuine
this data is or how it was tested.

Since qualcomm SoC's use hardware backed keystore for managing keys and since
there is no easy way to make dm-crypt work with qualcomm's engines, I do not
have solid benchmark data to show an improved performance when using 4k blocks.

> With some (backward incompatible) changes in LUKS format I would like to see support
> for encryption blocks equivalent to sectors size, so it basically means for 4k drive 4k
> encryption block.
> (This should decrease overhead, now is everything processed on 512 blocks only.)
>
> Support of bigger block sizes would be unsafe without additional mechanism that provides
> atomic writes of multiple sectors. Maybe it applies to 4k as well on some devices though...)

Did you mean write to the crypto output buffers or the actual disk write?
I didn't quite understand how the block size for encryption affects atomic
writes as it is the block layer which handles them. As far as dm-crypt is,
concerned it just encrypts/decrypts a 'struct bio' instance and submits the IO
operation to the block layer.

> The above is not going against your proposal, I am just curious if this is enough
> to provide better performance on your hw accelerator or not.

May be I should be able to procure an open crypto board and get back to you with
some results. Or may be show even a marginal improvement while using software
algorithm by avoiding the crypto overhead for every 512 bytes.

-Binoy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: dm-crypt optimization
  2016-12-22  8:25   ` Binoy Jayan
@ 2016-12-22  8:59     ` Herbert Xu
  2016-12-22 10:14       ` Ofir Drang
  0 siblings, 1 reply; 5+ messages in thread
From: Herbert Xu @ 2016-12-22  8:59 UTC (permalink / raw)
  To: Binoy Jayan
  Cc: Milan Broz, Oded, Ofir, Arnd Bergmann, Mark Brown,
	Alasdair Kergon, David S. Miller, private-kwg, dm-devel,
	linux-crypto, Rajendra, Linux kernel mailing list, linux-raid,
	Shaohua Li, Mike Snitzer

On Thu, Dec 22, 2016 at 01:55:59PM +0530, Binoy Jayan wrote:
>
> > Support of bigger block sizes would be unsafe without additional mechanism that provides
> > atomic writes of multiple sectors. Maybe it applies to 4k as well on some devices though...)
> 
> Did you mean write to the crypto output buffers or the actual disk write?
> I didn't quite understand how the block size for encryption affects atomic
> writes as it is the block layer which handles them. As far as dm-crypt is,
> concerned it just encrypts/decrypts a 'struct bio' instance and submits the IO
> operation to the block layer.

I think Milan's talking about increasing the real block size, which
would obviously require the hardware to be able to write that out
atomically, as otherwise it breaks the crypto.

But if we can instead do the IV generation within the crypto API,
then the block size won't be an issue at all.  Because you can
supply as many blocks as you want and they would be processed
block-by-block.

Now there is a disadvantage to this approach, and that is you
have to wait for the whole thing to be encrypted before you can 
start doing the IO.  I'm not sure how big a problem that is but
if it is bad enough to affect performance, we can look into adding
some form of partial completion to the crypto API.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: dm-crypt optimization
  2016-12-22  8:59     ` Herbert Xu
@ 2016-12-22 10:14       ` Ofir Drang
  0 siblings, 0 replies; 5+ messages in thread
From: Ofir Drang @ 2016-12-22 10:14 UTC (permalink / raw)
  To: Herbert Xu, Binoy Jayan
  Cc: Milan Broz, Oded Golombek, Arnd Bergmann, Mark Brown,
	Alasdair Kergon, David S. Miller, private-kwg, dm-devel,
	linux-crypto, Rajendra, Linux kernel mailing list, linux-raid,
	Shaohua Li, Mike Snitzer



-----Original Message-----
From: Herbert Xu [mailto:herbert@gondor.apana.org.au]
Sent: Thursday, December 22, 2016 10:59 AM
To: Binoy Jayan
Cc: Milan Broz; Oded Golombek; Ofir Drang; Arnd Bergmann; Mark Brown; Alasdair Kergon; David S. Miller; private-kwg@linaro.org; dm-devel@redhat.com; linux-crypto@vger.kernel.org; Rajendra; Linux kernel mailing list; linux-raid@vger.kernel.org; Shaohua Li; Mike Snitzer
Subject: Re: dm-crypt optimization

On Thu, Dec 22, 2016 at 01:55:59PM +0530, Binoy Jayan wrote:
>>
>> > Support of bigger block sizes would be unsafe without additional
>> > mechanism that provides atomic writes of multiple sectors. Maybe it
>> > applies to 4k as well on some devices though...)
>>
>> Did you mean write to the crypto output buffers or the actual disk write?
>> I didn't quite understand how the block size for encryption affects
>> atomic writes as it is the block layer which handles them. As far as
>> dm-crypt is, concerned it just encrypts/decrypts a 'struct bio'
>> instance and submits the IO operation to the block layer.

>I think Milan's talking about increasing the real block size, which would obviously require the hardware to be able to write that out atomically, as otherwise it breaks the crypto.
>
>But if we can instead do the IV generation within the crypto API, then the block size won't be an issue at all.  Because you can supply as many blocks as you want and they would be processed block-by-block.
>
>Now there is a disadvantage to this approach, and that is you have to wait for the whole thing to be encrypted before you can start doing the IO.  I'm not sure how big a problem that is but if it is bad enough to affect performance, we can look into adding >some form of partial completion to the crypto API.
>
>Cheers,

But assuming we have hardware accelerator that know to handle the IV generation for each sector, it will make sense to send out to the hardware the maximum block size as this will allow us to better utilize the hardware and offload the software. So if possible we need to provide generic interface that will be able to optimize the hardware accelerates.

Thx Ofir
--
Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-12-22 10:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-20  9:41 dm-crypt optimization Binoy Jayan
2016-12-21 12:47 ` Milan Broz
2016-12-22  8:25   ` Binoy Jayan
2016-12-22  8:59     ` Herbert Xu
2016-12-22 10:14       ` Ofir Drang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).