linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Pascal Van Leeuwen <pvanleeuwen@insidesecure.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
	"David S. Miller" <davem@davemloft.net>,
	"Jason A. Donenfeld" <Jason@zx2c4.com>,
	Eric Biggers <ebiggers@kernel.org>,
	Linux Crypto Mailing List <linux-crypto@vger.kernel.org>,
	"linux-fscrypt@vger.kernel.org" <linux-fscrypt@vger.kernel.org>,
	linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Paul Crowley <paulcrowley@google.com>,
	Greg Kaiser <gkaiser@google.com>,
	Samuel Neves <samuel.c.p.neves@gmail.com>,
	Tomer Ashur <tomer.ashur@esat.kuleuven.be>,
	Martin Willi <martin@strongswan.org>
Subject: RE: [PATCH 0/17] Add zinc using existing algorithm implementations
Date: Mon, 25 Mar 2019 09:10:08 +0000	[thread overview]
Message-ID: <AM5PR0901MB1155FD05113D6DC79A6274CED25E0@AM5PR0901MB1155.eurprd09.prod.outlook.com> (raw)
In-Reply-To: <CAHk-=wg2LJ5qQ0B2y+_6Ue62SBP4h9MLxLvn89bfcP7Cp2ac6A@mail.gmail.com>

As someone who has been working on crypto acceleration hardware for the better
part of the past 20 years, I feel compelled to respond to this, in defence of
the crypto API (which we're really happy with ...).

> And honestly, I'm 1000% with Jason on this. The crypto/ model is hard to use,
> inefficient, and completely pointless when you know what your cipher or
> hash algorithm is, and your CPU just does it well directly.
>
> > we even have support already for async accelerators that implement it,
>
> Afaik, none of the async accelerator code has ever been worth anything on
> real hardware and on any sane and real loads. The cost of going outside the
> CPU is *so* expensive that you'll always lose, unless the algorithm has been
> explicitly designed to be insanely hard on a regular CPU.
>
The days of designing them *specifically* to be hard on a CPU are over, but
nevertheless, due to required security properties, *good* crypto is usually
still hard work for a CPU.
Especially *asymmetric* crypto, which can easily take milliseconds per operation
even on a high-end CPU. You can do a lot of interrupt processing in a millisecond
Now some symmetric algorithms (AES,SHA,GHASH) have actually made it *into*
the CPU, somewhat changing the landscape, but:
a) That's really only a small subset of the crypto being used out there.
   There's much more to crypto than just AES, SHA and GHASH.
b) This only applies to high-end CPU's. The large majority of embedded CPU's do
   not have such embedded crypto instructions. This may change, but I don't
   really expect that to happen soon.
c) You're still keeping a big, power-hungry CPU busy for a lot of cycles doing
   some fairly trivial work.

> (Corollary: "insanely hard on a regular CPU" is also easy to do by making the
> CPU be weak and bad. Which is not a case we should optimize for).
>

Linux is being used a lot for embedded systems which usually DO have fairly
weak CPU's. I would argue that's exactly the case you SHOULD optimize for,
as that's where you can *really* still make the difference as a programmer.

> The whole "external accelerator" model is odd. It was wrong. It only makes
> sense if the accelerator does *everything* (ie it's the network card), and
> then you wouldn't use the wireguard thing on the CPU at all, you'd have all
> those things on the accelerator (ie a "network card that does WG").
>
> One of the (best or worst, depending on your hangups) arguments for
> external accelerators has been "but I trust the external hardware with the
> key, but not my own code", aka the TPM or Disney argument.  I don't think
> that's at all relevant to the discussion either.
>
> The whole model of async accelerators is completely bogus. The only crypto
> or hash accelerator that is worth it are the ones integrated on the CPU cores,
> which have the direct access to caches.
>
NOT true. We wouldn't still be in business after 20 years if that were true.

> And if the accelerator is some tightly coupled thing that has direct access to
> your caches, and doesn't need interrupt overhead or address translation etc
> (at which point it can be worth using) then you might as well just consider it
> an odd version of the above. You'd want to poll for the result anyway,
> because not polling is too expensive.
>
It's HARD to get the interfacing right to take advantage of the acceleration.
And that's exacly why you MUST have an async interface for that: to cope with
the large latency of external acceleration, going through the memory subsystem
and external buses as - as you already pointed out - you cannot access the CPU
cache directly (though you can be fully coherent with it).
So to cope with that latency, you will need to batch queue and pipeline your
processing. This is not unique to crypto acceleration, the same principles
apply to e.g. your GPU as well. Or any HW worth anything, for that matter.

What that DOES mean is that external crypto accelerators are indeed useless for
doing the occasional crypto operation, it really only makes sense for streaming
and/or bulk use cases, such as network packet or disk encryption. And for
operations that really take significant time, such as asymmetic crypto.
For the occasional symmetric crypto operation, by all means, do that on the CPU
using a very thin API layer for efficiency and simplicity. This is where Zinc
makes a lot of sense - I'm not against Zinc at all. But DO realise that when
you go that route, you forfeit any chances of benefitting from acceleration.

Ironically, for doing what Wireguard is doing - bulk packet encryption - the
async crypto API makes a lot more sense than Zinc. In Jason's defense, as
far as I know, there is no Poly/Chacha HW acceleration out there yet, but I
can assure you that that situation is going to change soon :-)
Still,  I would really recommend running Wireguard on top of crypto API.
How much performance would you really lose by doing that? If there's
anything wrong with the efficiency of the crypto API implementation of
P/C, then just fix that.

> Just a single interrupt would completely undo all the advantages you got
> from using specialized hardware - both power and performance.
>
We have done plenty of measurements, on both power and performance, to prove
you wrong there. Typically our HW needs *at least* a full order of a magnitude
less power to do the actual work. The CPU load for handing the interrupts etc.
tends to be around 20%. So assuming the CPU goes to sleep for the other 80%
of the time,  the combined solution would need about 1/3rd of the power of a
CPU only solution. It's one of our biggest selling points.

As for performance - in the embedded space you can normally expect the crypto
accelerator to be *much* faster than the CPU subsystem as well. So yes, big
multi-core multi-GHz Xeons can do AES-GCM extremely fast and we'd have a hard
time competing with that, but that's not the relevant use case here.

> And that kind of model would work just fine with zinc.

>
> So an accelerator ends up being useful in two cases:
>
>  - it's entirely external and part of the network card, so that there's no extra
> data transfer overhead
>
There are inherent efficiency advantages to integrating the crypto there, but
it's also very inflexible as you're stuck with a particular, usually very limited,
implementation. And it doesn't fly very well with current trends of Software
Defined Networking and Network Function Virtualization.
As for data transfer overhead: as long as the SoC memory subsystem was designed
to cope with this, it's really irrelevant as you shouldn't notice it.

>  - it's tightly coupled enough (either CPU instructions or some on-die cache
> coherent engine) that you can and will just use it synchronously anyway.
>
> In the first case, you wouldn't run wireguard on the CPU anyway - you have a
> network card that just implements the VPN.
>
The irony here is, that network card would probably be an embedded system running
Linux on an embedded CPU, offloading the crypto operations to our accelerator ...
Most smart NICs are architected that way.

Pascal van Leeuwen,
Silicon IP Architect @ Inside Secure

  reply	other threads:[~2019-03-25  9:11 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-22  6:27 [PATCH 0/17] Add zinc using existing algorithm implementations Herbert Xu
2019-03-22  6:29 ` [PATCH 1/17] asm: simd context helper API Herbert Xu
2019-03-22  6:29 ` [PATCH 2/17] crypto: chacha20 - Export chacha20 functions without crypto API Herbert Xu
2019-03-22  6:29 ` [PATCH 3/17] zinc: introduce minimal cryptography library Herbert Xu
2019-03-22  6:29 ` [PATCH 4/17] zinc: Add generic C implementation of chacha20 and self-test Herbert Xu
2019-03-22  6:29 ` [PATCH 5/17] zinc: Add x86 accelerated ChaCha20 Herbert Xu
2019-03-22  6:29 ` [PATCH 6/17] zinc: Add arm accelerated chacha20 Herbert Xu
2019-03-22  6:29 ` [PATCH 7/17] crypto: poly1305 - Export core functions without crypto API Herbert Xu
2019-03-22  6:29 ` [PATCH 8/17] zinc: Add generic C implementation of poly1305 and self-test Herbert Xu
2019-03-22  6:29 ` [PATCH 9/17] zinc: Add x86 accelerated poly1305 Herbert Xu
2019-03-22  6:29 ` [PATCH 10/17] zinc: ChaCha20Poly1305 construction and selftest Herbert Xu
2019-03-22  6:29 ` [PATCH 11/17] zinc: BLAKE2s generic C implementation " Herbert Xu
2019-03-22  6:29 ` [PATCH 12/17] zinc: BLAKE2s x86_64 implementation Herbert Xu
2019-03-22  6:29 ` [PATCH 13/17] zinc: Curve25519 generic C implementations and selftest Herbert Xu
2019-03-22  6:29 ` [PATCH 14/17] zinc: Curve25519 x86_64 implementation Herbert Xu
2019-03-22  6:29 ` [PATCH 15/17] zinc: import Bernstein and Schwabe's Curve25519 ARM implementation Herbert Xu
2019-03-22  6:29 ` [PATCH 16/17] zinc: " Herbert Xu
2019-03-22  6:29 ` [PATCH 17/17] security/keys: rewrite big_key crypto to use Zinc Herbert Xu
2019-03-22  6:41 ` [PATCH 0/17] Add zinc using existing algorithm implementations Jason A. Donenfeld
2019-03-22  7:56 ` Ard Biesheuvel
2019-03-22  8:10   ` Jason A. Donenfeld
2019-03-22 17:48   ` Linus Torvalds
2019-03-25  9:10     ` Pascal Van Leeuwen [this message]
2019-03-26  9:46       ` Riku Voipio
2019-04-09 16:14         ` Pascal Van Leeuwen
2019-03-25 10:43     ` Ard Biesheuvel
2019-03-25 10:45       ` Jason A. Donenfeld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AM5PR0901MB1155FD05113D6DC79A6274CED25E0@AM5PR0901MB1155.eurprd09.prod.outlook.com \
    --to=pvanleeuwen@insidesecure.com \
    --cc=Jason@zx2c4.com \
    --cc=ard.biesheuvel@linaro.org \
    --cc=davem@davemloft.net \
    --cc=ebiggers@kernel.org \
    --cc=gkaiser@google.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-fscrypt@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin@strongswan.org \
    --cc=paulcrowley@google.com \
    --cc=samuel.c.p.neves@gmail.com \
    --cc=tomer.ashur@esat.kuleuven.be \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).