RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption

From: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Arnd Bergmann <arnd@arndb.de>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	Greg KH <gregkh@linuxfoundation.org>,
	Eric Biggers <ebiggers@google.com>,
	Samuel Neves <sneves@dei.uc.pt>,
	Linux Crypto Mailing List <linux-crypto@vger.kernel.org>,
	Andy Lutomirski <luto@kernel.org>, Marc Zyngier <maz@kernel.org>,
	Dan Carpenter <dan.carpenter@oracle.com>,
	Will Deacon <will@kernel.org>, David Miller <davem@davemloft.net>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: RE: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption
Date: Fri, 27 Sep 2019 09:58:14 +0000	[thread overview]
Message-ID: <MN2PR20MB2973A696B92A8C5A74A738F1CA810@MN2PR20MB2973.namprd20.prod.outlook.com> (raw)
In-Reply-To: <CAHk-=wjr1w7x9Rjre_ALnDLASYNjsEHxu6VJpk4eUwZXN0ntqw@mail.gmail.com>

> > That remark is just very stupid. The hardware ALREADY exists, and
> > more hardware is in the pipeline. Once this stuff is designed in, it
> > usually stays in for many years to come. And these are chips sold in
> > _serious_ quantities, to be used in things like wireless routers and
> > DSL, cable and FTTH modems, 5G base stations, etc. etc.
> 
> Yes, I very much mentioned routers. I believe those can happen much
> more quickly.
> 
> But I would very much hope that that is not the only situation where
> you'd see wireguard used.
> 
Same here

> I'd want to see wireguard in an end-to-end situation from the very
> client hardware. So laptops, phones, desktops. Not the untrusted (to
> me) hw in between.
> 
I don't see why the crypto HW would deserve any less trust than, say,
the CPU itself. I would say CPU's don't deserve that trust at the moment.

> > No, these are just the routers going into *everyone's* home. And 5G
> > basestations arriving at every other street corner. I wouldn't call
> > that rare, exactly.
> 
> That's fine for a corporate tunnel between devices. Which is certainly
> one use case for wireguard.
> 
> But if you want VPN for your own needs for security, you want it at
> the _client_. Not at the router box. So that case really does matter.
> 
Personally, I would really like it in my router box so my CPU is free
to do useful work instead of boring crypto. And I know there's nothing
untrusted in between my client and the router box, so I don't need to
worry about security there. But hey, that's just me.

> And I really don't see the hardware happening in that space. So the
> bad crypto interfaces only make the client _worse_.
> 
Fully agree. We don't focus on the client side with our HW anyway.
(but than there may be that router box in between that can help out)

> See?
> 
> But on to the arguments that we actually agree on:
> 
> > Hey, no argument there. I don't see any good reason why the key can't
> > be on the stack. I doubt any hardware would be able to DMA that as-is
> > directly, and in any case, key changes should be infrequent, so copying
> > it to some DMA buffer should not be a performance problem.
> > So maybe that's an area for improvement: allow that to be on the stack.
> 
> It's not even just the stack. It's really that the crypto interfaces
> are *designed* so that you have to allocate things separately, and
> can't embed these things in your own data structures.
> 
> And they are that way, because the crypto interfaces aren't actually
> about (just) hiding the hardware interface: they are about hiding
> _all_ the encryption details.
> 
Well, that's the general idea of abstraction. It also allows for 
swapping in any other cipher with minimal effort just _because_ the 
details were hidden from the application. So it may cost you some 
effort initially, but it may save you effort later.

> There's no way to say "hey, I know the crypto I use, I know the key
> size I have, I know the state size it needs, I can preallocate those
> AS PART of my own data structures".
> 
> Because the interface is designed to be so "generic" that you simply
> can't do those things, they are all external allocations, which is
> inevitably slower when you don't have hardware.
> 
Hmm, Ok, I see your point here. But most of those data structures 
(like the key) should be allocated infrequently anyway, so you can
amortize that cost over _many_ crypto operations.

You _do_ realize that performing the key schedule for e.g. AES with
AES-NI also takes quite a lot of time? So you should keep your keys
alive and not reload them all the time anyway.

But I already agreed with you that there may be cases where you just
want to call the library function directly. Wireguard just isn't one
of those cases, IMHO.

> And you've shown that you don't care about that "don't have hardware"
> situation, and seem to think it's the only case that matters. That's
> your job, after all.
> 
I don't recall putting it that strongly ... and I certainly never said
the HW acceleration thing is the _only_ case that matters. But it does
matter _significantly_ to me, for blatantly obvious reasons.

> But however much you try to claim otherwise, there's all these
> situations where the hardware just isn't there, and the crypto
> interface just forces nasty overhead for absolutely no good reason.
> 
> > I already explained the reasons for _not_ doing direct calls above.
> 
> And I've tried to explain how direct calls that do the synchronous
> thing efficiently would be possible, but then _if_ there is hardware,
> they can then fall back to an async interface.
> 
OK, I did not fully get that latter part. I would be fine with such an
approach for use cases (i.e. fixed, known crypto) where that makes sense.
It would actually be better than calling the SW-only library directly
(which was my suggestion) as it would still allow HW acceleration as
an option ... 

> > > So there is absolutely NO DOWNSIDE for hw accelerated crypto to just
> > > do it right, and use an interface like this:
> > >
> > >        if (!chacha20poly1305_decrypt_sg(sg, sg, skb->len, NULL, 0,
> > >                                         PACKET_CB(skb)->nonce, key->key,
> > >                                         simd_context))
> > >                return false;
> > >
> > Well, for one thing, a HW API should not expect the result to be
> > available when the function call returns. (if that's what you
> > mean here). That would just be WRONG.
> 
> Right. But that also shouldn't mean that when you have synchronous
> hardware (ie CPU) you have to set everything up even though it will
> never be used.
> 
> Put another way: even with hardware acceleration, the queuing
> interface should be a simple "do this" interface.
> 
OK, I don't think we disagree there. I _like_ simple. As long as it
doesn't sacrifice functionality I care about.

> The current crypto interface is basically something that requires all
> the setup up-front, whether it's needed or not. And it forces those
> very inconvenient and slow external allocations.
> 
But you should do the setup work (if by "setup" you mean things like
cipher allocation, key setup and request allocation) only _once_ in a 
_long_ while. You can just keep using it for the lifetime of the 
application (or key, for the key setup part).

If I look at my cipher fallback  paths in the driver (the only places
where I actually get to _use_ the API from the "other" side), per 
actual indivual request they _only_ do - the rest is all preallocated
earlier:

_set_callback()
_set_crypt()
_set_ad()
_encrypt() or _decrypt()

And now that I look at that, I think the _set_callback()  could
move to the setup phase as it's always the same callback function.
Probably, in case of Wireguard, you could even move the _set_ad()
there as it's always zero and  the crypto driver is not allowed 
to overwrite it in the request struct anyway.

Also, I already agreed with you that _set_crypt(), _set_ad()
and _encrypt()/_decrypt() _could_ be conveniently wrapped into
one API call instead of 3 separate ones if we think that's worth it.

BUT ... actually ... I just looked at the actual _implementation_
and it turns out these are _inlineable_ functions defined in the
header file that _just_ write to some struct fields. So they 
should not end up being function calls at all(!!).
_Only_ the _encrypt()/_decrypt() invocation will end up with a
true (indirect) function call.

So where are all those allocations you mention? Have you ever
actually _used_ the Crypto API for anything?

Yes, if you actually want to _queue_ requests you need to use one
request struct for every queued operation, but you could just
preallocate an array of them that you cycle through. No need to do
those allocations in the hot path.

So is your problem really with the API _itself_ or with incorrect/
inefficient _use_ of the API in some places?

> And I'm saying that causes problems, because it fundamentally means
> that you can't do a good job for the common CPU  case, because you're
> paying all those costs even when you need absolutely none of them.
> Both at setup time, but also at run-time due to the extra indirection
> and cache misses etc.
> 
There is some cost sure, but is it _significant_ for any use case that
_matters_? You started bringing up optimization rules, so how about
Amdahls law?

> > Again, HW acceleration does not depend on the indirection _at all_,
> > that's there for entirely different purposes I explained above.
> > HW acceleration _does_ depend greatly on a truly async ifc though.
> 
> Can you realize that the world isn't just all hw acceleration?
> 
Sure. But there's also a lot of HW acceleration _already out there_
that _could_ have been used if only the proper SW API's had existed.
Welcome to _my_ world.

> Can you admit that the current crypto interface is just horrid for the
> non-accelerated case?
> 
Is agreeing that it is not perfect sufficient for you? :-)

> Can you perhaps then also think that "maybe there are better models".
> 
Sure. There's always better. There's also good enough though ...

> > So queue requests on one side, handle results from the other side
> > in some callback func off of an interrupt handler.
> 
> Actually, what you can do - and what people *have* done - is to admit
> that the synchronous case is real and important, and then design
> interfaces that work for that one too.
> 
But they _do_ work for that case as well. I still haven't seen any
solid evidence that they are as horribly inefficient as you are 
implying for _real life_ use cases. And even if they are, then there's
the question whether that is the fault of the API or incorrect use 
thereof.

> You don't need to allocate resources ahead of time, and you don't have
> to disallow just having the state buffer allocated by the caller.
> 
> So here's the *wrong* way to do it (and the way that crypto does it):
> 
>  - dynamically allocate buffers at "init time"
> 
Why is that so "wrong"? It sure beats doing allocations on the hot path.
But yes, some stuff should be allowed to live on the stack. Some other
stuf can't be on the stack though, as that's gone when the calling 
function exits while the background crypto processing still needs it.

And you don't want to have it on the stack initially and then have
to _copy_ it to some DMA-able location that you allocate on the fly
on the hot path if you _do_ want HW acceleration.

>  - fill in "callback fields" etc before starting the crypto, whether
> they are needed or not
> 
I think this can be done _once_ at request allocation time.
But it's just one function pointer write anyway. Is that significant? 
Or: _if_ that is significant, you  shouldn't be using the Crypto API for 
that use case in the first place.

>  - call a "decrypt" function that then uses the indirect functions you
> set up at init time, and possibly waits for it (or calls the callbacks
> you set up)
> 
> note how it's all this "state machine" model where you add data to the
> state machine, and at some point you say "execute" and then either you
> wait for things or you get callbacks.
> 
Not sure how splitting data setup over a few seperate "function" calls
suddenly makes it a "state machine model" ...

But yes, I can understand why the completion handling through this
callback function seems like unnecessary complication for the SW case.

> That makes sense for a hw crypto engine. It's how a lot of them work, after all.
> 
Oh really?

Can't speak for other people stuff, but for our hardware you post a
request to it and then go off do other stuff while the HW does its thing
after which it will inform you it's done by means of an interrupt.
I don't see how this relates to the "statemachine model" above, there
is no persistent state involved, it's all included in the request.
The _only_ thing that matters is that you realize it's a pipeline that
needs to be kept filled and has latency >> throughput, just like your 
CPU pipeline.

> But it makes _zero_ sense for the synchronous case. You did a lot of
> extra work for that case, and because it was all a styate machine, you
> did it particularly inefficiently: not only do you have those separate
> allocations with pointer following, the "decrypt()" call ends up doing
> an indirect call to the CPU implementation, which is just quite slow
> to begin with, particularly in this day and age with retpoline etc.
> 
> So what's the alternative?
> 
> I claim that a good interface would accept that "Oh, a lot of cases
> will be synchronous, and a lot of cases use one fixed
> encryption/decryption model".
> 
> And it's quite doable. Instead of having those callback fields and
> indirection etc, you could have something more akin to this:
> 
>  - let the caller know what the state size is and allocate the
> synchronous state in its own data structures
> 
>  - let the caller just call a static "decrypt_xyz()" function for xyz
> decryptrion.
> 
Fine for those few cases where the algorithm is known and fixed.
(You do realize that the primary use cases are IPsec, dmcrypt and
fscrypt where that is most definitely _not_ the case?)

Also, you're still ignoring the fact that there is not one, single,
optimal, CPU implementation either. You have to select that as well,
based on CPU features. So it's either an indirect function call that
would be well predictable - as it's always the same at that point in
the program - or it's a deep if-else tree (which might actually be
implemented by the compiler as an indirect (table) jump ...) 
selecting the fastest implementation, either SW _or_ HW.

>  - if you end up doing it synchronously, that function just returns
> "done". No overhead. No extra allocations. No unnecessary stuff. Just
> do it, using the buffers provided. End of story. Efficient and simple.
> 
I don't see which "extra allocations" you would be saving here.
Those shouldn't happen in the hot path either way.

>  - BUT.
> 
>  - any hardware could have registered itself for "I can do xyz", and
> the decrypt_xyz() function would know about those, and *if* it has a
> list of accelerators (hopefully sorted by preference etc), it would
> try to use them. And if they take the job (they might not - maybe
> their queues are full, maybe they don't have room for new keys at the
> moment, which might be a separate setup from the queues), the
> "decrypt_xyz()" function returns a _cookie_ for that job. It's
> probably a pre-allocated one (the hw accelerator might preallocate a
> fixed number of in-progress data structures).
> 
> And once you have that cookie, and you see "ok, I didn't get the
> answer immediately" only THEN do you start filling in things like
> callback stuff, or maybe you set up a wait-queue and start waiting for
> it, or whatever".
> 
I don't see the point of saving that single callback pointer write.
I mean, it's just _one_ CPU word memory write. Likely to the L1 cache.

But I can see the appeal of getting a "done" response on the _encrypt()/
_decrypt() call and then being able to immediately continue processing
the result data and having the async response handling separated off. 

I think it should actually be possible to change the API to work like
that without breaking backward compatibility, i.e. define some flag
specifying you actually _want_ this behavior and then define some
return code that says "I'm done processing, carry on please".

> See the difference in models? One forces that asynchronous model, and
> actively penalizes the synchronous one.
> 
> The other _allows_ an asynchronous model, but is fine with a synchronous one.
> 
> > >        aead_request_set_callback(req, 0, NULL, NULL);
> > >
> > This is just inevitable for HW acceration ...
> 
> See above. It really isn't. You could do it *after* the fact,
>
Before ... after ... the point was you need it. And it's a totally
insignificant saving anyway.

> when
> you've gotten that ticket from the hardware. Then you say "ok, if the
> ticket is done, use these callbacks". Or "I'll now wait for this
> ticket to be done" (which is what the above does by setting the
> callbacks to zero).
> 
> Wouldn't that be lovely for a user?
> 
Yes and no.
Because the user would _still_ need to handle the case of callbacks.
In case the request _does_ go to the HW accelerator.

So you keep the main processing path clean I suppose, saving some 
cycles there, but you still have this case of callbacks and having
multiple requests queued you need to handle as well. Which now 
becomes a separate _exception_ case.  You now have  two distinct 
processing paths you have to manage from your application.
How is that an _improvement_ for the user? (not withstanding that
it may be an improvement to SW only performance)

> I suspect it would be a nice model for a hw accelerator too. If you
> have full queues or have problems allocating new memory or whatever,
> you just let the code fall back to the synchronous interface.
> 
HW drivers typically _do_ use SW fallback for cases they cannot
handle. Actually, that works very nicely with the current API,
with the fallback cipher just being attached to the original
requests' callback function ... i.e. just do a tail call to 
the fallback cipher request.

> > Trust me, I have whole list of things I don't like about the
> > API myself, it's not exacty ideal for HW acceleration  either.
> 
> That';s the thing. It's actively detrimental for "I have no HW acceleration".
> 
You keep asserting that with no evidence whatsoeever.

> And apparently it's not optimal for you guys either.
> 
True, but I accept the fact that it needs to be that way because some
_other_ HW may drive that requirement. I accept the fact that I'm not
alone in the world.

> > But the point is - there are those case where you _don't_ know and
> > _that_ is what the Crypto API is for. And just generally, crypto
> > really _should_ be switchable.
> 
> It's very much not what wireguard does.
> 
And that's very much a part of Wireguard that is _broken_. I like
Wireguard for a lot of things, but it's single cipher focus is not
one of them. Especially since all crypto it uses comes from a single
source (DJB), which is frowned upon in the industry.

Crypto agility is a very important _security_ feature and the whole
argument Jason makes that it is actually a weakness is _bullshit_.
(Just because SSL _implemented_ this horribly wrong doesn't mean 
it's a bad thing to do - it's not, it's actually _necessary_. As 
the alternative would be to either continue using broken crypto
or wait _months_ for a new implementation to reach your devices
when the crypto gets broken somehow. Not good.)

> And honestly, most of the switchable ones have caused way more
> security problems than they have "fixed" by being switchable.
> 
"most of the switchable ones"
You mean _just_ SSL/TLS. SSL/TLS before 1.3 just sucked security
wise, on so many levels. That has _nothing_ to do with the very
desirable feature of crypto agility. It _can_ be done properly and 
securely. (for one thing, it does not _need_ to be negotiable)

>                  Linus

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel