RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption

From: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	Linux Crypto Mailing List <linux-crypto@vger.kernel.org>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	David Miller <davem@davemloft.net>,
	Greg KH <gregkh@linuxfoundation.org>,
	"Jason A . Donenfeld" <Jason@zx2c4.com>,
	Samuel Neves <sneves@dei.uc.pt>,
	Dan Carpenter <dan.carpenter@oracle.com>,
	Arnd Bergmann <arnd@arndb.de>, Eric Biggers <ebiggers@google.com>,
	Andy Lutomirski <luto@kernel.org>, Will Deacon <will@kernel.org>,
	Marc Zyngier <maz@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>
Subject: RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
Date: Mon, 30 Sep 2019 20:44:06 +0000	[thread overview]
Message-ID: <CH2PR20MB2968692F693A5A8CB71CE301CA820@CH2PR20MB2968.namprd20.prod.outlook.com> (raw)
In-Reply-To: <CAHk-=wj9BSMzoDD31R-ymjGpkpt0u-ndX6+p0ZWsrJFDTAN+zg@mail.gmail.com>

> -----Original Message----
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Friday, September 27, 2019 6:24 PM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> crypto@vger.kernel.org>; Linux ARM <linux-arm-kernel@lists.infradead.org>; Herbert Xu
> <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> <gregkh@linuxfoundation.org>; Jason A . Donenfeld <Jason@zx2c4.com>; Samuel Neves
> <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann <arnd@arndb.de>;
> Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; Will Deacon
> <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas <catalin.marinas@arm.com>
> Subject: Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
> 
> On Fri, Sep 27, 2019 at 2:58 AM Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
> >
> > > I'd want to see wireguard in an end-to-end situation from the very
> > > client hardware. So laptops, phones, desktops. Not the untrusted (to
> > > me) hw in between.
> > >
> > I don't see why the crypto HW would deserve any less trust than, say,
> > the CPU itself. I would say CPU's don't deserve that trust at the moment.
> 
> It's not the crypto engine that is part of the untrusted hardware.
> It's the box itself, and the manufacturer, and you having to trust
> that the manufacturer didn't set up some magic knocking sequence to
> disable the encryption.
> 
> Maybe the company that makes them is trying to do a good job. But
> maybe they are based in a country that has laws that require
> backdoors.
> 
> Say, France. There's a long long history of that kind of thing.
> 
> It's all to "fight terrorism", but hey, a little industrial espionage
> is good too, isn't it? So let's just disable GSM encryption based on
> geographic locale and local regulation, shall we.
> 
> Yeah, yeah, GSM encryption wasn't all that strong to begin with, but
> it was apparently strong enough that France didn't want it.
> 
> So tell me again why I should trust that box that I have no control over?
> 
Same reason you trust your PC hardware you have no control over?
(That CPU is assembled in Malaysia, your motherboard likely in China.
And not being a US citizen, *I* wouldn't trust anything out of the US
anyway, _knowing_ they've been actively spying on us for decades ...)

In case you worry about the software part: of course you'd be running
something open-source and Linux based like DD-WRT on that router ...

Personally I'm not that paranoid and I really like to offload all the
silly  crypto heavy-lifting to my router box, where it belongs.

> > Well, that's the general idea of abstraction. It also allows for
> > swapping in any other cipher with minimal effort just _because_ the
> > details were hidden from the application. So it may cost you some
> > effort initially, but it may save you effort later.
> 
> We clearly disagree on the utility of crypto agility. You point to
> things like ipsec as an argument for it.
> 
I don't recall doing specifically that, but anyway.

> And I point to ipsec as an argument *against* that horror. It's a
> bloated, inefficient, horribly complex mess. And all the "agility" is
> very much part of it.
> 
Oh really? I've been working on implementations thereof for nearly 2
decades, but I don't recognise this at all, at least not for the datapath.
IPsec actually made a significant effort to keep the packet format the
same across all extensions done over its 20+ year history. The cipher
agility is mostly abstracted away from the base protocol, allowing us to
add new ciphersuites - to hardware, no less! - with very minimal effort.

In any, case, while I believe in the KISS principle, I also believe that
things should be as simple as possible, but _no simpler than that_(A.E.)
Oversimplification is the evil twin of overcomplication.

> I also point to GSM as a reason against "agility". It has caused way
> more security problems than it has ever solved. The ":agility" is
> often a way to turn off (or tune down) the encryption, not as a way to
> say "ok, we can improve it later".
> 
> That "we can improve it later" is a bedtime story. It's not how it
> gets used. Particularly as the weaknesses are often not primarily in
> the crypto algorithm itself, but in how it gets used or other session
> details.
> 
I don't see what this has to do with cipher agility. Cipher agility has
nothing to do with "improving things later" and everything with the 
realisation that, someday, some clever person _will_ find some weakness.

> When you actually want to *improve* security, you throw the old code
> away, and start a new protocol entirely. Eg SSL -> TLS.
> 
Uhm. Now you're starting to show some ignorance ...

TLS was NOT a new protocol. I was a simple rename of a very minor evolution 
of SSL 3.0. Has been for all versions up to and including TLS 1.2. And YES,
THAT was a mistake, because SSL was just a very poor  starting point. 
For TLS 1.3 they finally did a (reasonably) proper redesign.
(Fun fact: SSL was _not_ designed by a committee, but TLS 1.3 _was_ ...)

> So cryptographic agility is way oversold, and often people are
> actively lying about why they want it. And the people who aren't lying
> are ignoring the costs.
> 
I wouldn't know what they could be lying about, crypto agility is 
just common sense risk spreading.

> One of the reasons _I_ like wireguard is that it just went for simple
> and secure. No BS.
> 
You and me both, BTW. I just don't want it to be _too_ simple.

> And you say
> 
> > Especially since all crypto it uses comes from a single
> > source (DJB), which is frowned upon in the industry.
> 
> I'm perhaps not a fan of DJB in all respects, but there's no question
> that he's at least competent.
> 
I have nothing against DJB, I've enjoyed many of his presentations.
I might even be a fan. I certainly don't doubt his competence.

But being as paranoid as you are: can you really TRUST the guy? ;-)
And as good as he is: there may be some weakness in the algorithm(s)
discovered _tomorrow_ and in that case _I_ would want to be able to
switch to an alternative instantly.
(and I believe for some big international organisation critically 
depending on such a VPN to connect all their branch offices around
the world while protecting their trade secrets, this is likely to
be even more important - they probably wouldn't want to wait until
Jason pulls Wireguard 2.0 out of his hat and certainly not for that
to pass certification and finaly hit their devices months later ...)

I'm not talking about some convoluted and fragile negotiation scheme,
a static parameter in some config file is just fine for that. The 
textual crypto templates of the Crypto API just fit that use case
perfectly.

And I have other reasons not to want to use Chacha-Poly, while I would
like to use the Wireguard _protocol_ itself:

1) Contrary to popular belief, Chacha-Poly is NOT the best choice of
   algorithms in terms of performance on many modern systems. On the
   quad core Cortex A72 system I'm working on here, AES-GCM is over 2
   times faster, even including Ard's Poly1305-Neon patches of last
   week (current mainline code for PC is even slower than that).
   Also, on modern Intel systems with AES-NI or VAES, AES-GCM 
   outperforms Chacha-Poly by a considerable margin. And, to make
   matters worse, running Chacha-Poly at high throughput is known to
   result in excessive thermal throttling on some recent Intel CPU's.
   Even if you don't need that throughput, it's nice to have more CPU
   power left to do useful work.
2) Chacha-Poly is inefficient in terms of power. For our hardware,
   it uses about 2x the power of AES-GCM and I have indications (e.g.
   the thermal throttling mentioned above) that this is no better for
   software implementations.

> The "industry practice" of having committees influenced by who knows
> what isn't all that much better. Do you want to talk about NSA
> elliptic curve constant choices?
> 
Which is actually an argument _in favor_ of crypto agility - you don't
want to be stuck with just one choice you may not trust ...
Options are _good_. (but do add some implementation complexity, sure)

> Anyway, on the costs:
> 
> > >  - dynamically allocate buffers at "init time"
> >
> > Why is that so "wrong"? It sure beats doing allocations on the hot path.
> 
> It's wrong not becasue the allocation is costly (you do that only
> once), but because the dynamic allocation means that you can't embed
> stuff in your own native data structures as a user.
> 
> So now accessing those things is no longer dense in the cache.
> 
I don't see how data allocated at _init time_ would be local in the 
cache at the time it is _finally_ used in some remote location, far
away in both space and time.

If you init and then immediately use, you may have a point, but
that should be the exception and not the rule.

> And it's the cache that matters for a synchronous CPU algorithm. You
> don't want the keys and state to be in some other location when you
> already have your data structures for the stream that could just have
> them right there with the other data.
> 
Yeah yeah, we all know that. But that only works for stuff that stays
in scope in the cache, not for stuff that has long since been pushed
out by other local variables.

And "other" memory that's used frequently (i.e. when it matters!) CAN
be cached too, you known :-) Modern prefetchers tend to be quite good,
too, so it shouldn't even matter if it gets flushed out temporarily.

> > And you don't want to have it on the stack initially and then have
> > to _copy_ it to some DMA-able location that you allocate on the fly
> > on the hot path if you _do_ want HW acceleration.
> 
> Actually, that's *exactly* what you want. You want keys etc to be in
> regular memory in a location that is convenient to the user, and then
> only if the hardware has issues do you say "ok, copy the key to the
> hardware". Because quite often the hardware will have very special key
> caches that aren't even available to the CPU, because they are on some
> hw-private buffers.
> 
Unfortunately, the only way to get that _into_ the HW is usually DMA
and that relies on DMA-capable memory. And copying significant data 
around on the  CPU tends to totally kill performance if you're in the 
business of HW acceleration, so it's nice if it's already in a DMA
capable buffer. Assuming the cost of having it there is not excessive.

I don't care so much about the keys BTW, that should not be performance
critical as you set it only once in a long while.
But things like IV's etc. _may_ be another matter for _some_ hardware.
(Actually, for _my_ hardware I _only_ care about not having to copy the
actual _data_, so for all _I_ care everything else can be on the stack.
But alas, I'm not alone in the world ...)

> Yes, you want to have a "key identity" model so that the hardware
> doesn't have to reload it all the time, but that's an invalidation
> protocol, not a "put the keys or nonces in special places".
> 
Actually, that _is_ exactly how (most of) _our_ hardware works :-)

But I _think_ keys and nonces and whatnot are actually not the main
reason those structs can't be on the stack. Drivers tend to add their
own local data to those structs, and this may contain buffers that
are used for DMA. I know for a fact the Inside Secure driver does
this (_not_ my design, BTW). I would personally have opted for 
embedding pointers to dynamically allocated blobs elsewhere, such
that the main struct _can_ be on the stack. Food for discussion :-)

>                Linus

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com