chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
       [not found]   ` <MN2PR20MB29733663686FB38153BAE7EACA860@MN2PR20MB2973.namprd20.prod.outlook.com>
@ 2019-09-26 11:06     ` Jason A. Donenfeld
  2019-09-26 11:38       ` Toke Høiland-Jørgensen
                         ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Jason A. Donenfeld @ 2019-09-26 11:06 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Ard Biesheuvel, Linux Crypto Mailing List, linux-arm-kernel,
	Herbert Xu, David Miller, Greg KH, Linus Torvalds, Samuel Neves,
	Dan Carpenter, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Will Deacon, Marc Zyngier, Catalin Marinas, Willy Tarreau,
	Netdev, Toke Høiland-Jørgensen, Dave Taht

[CC +willy, toke, dave, netdev]

Hi Pascal

On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
> Actually, that assumption is factually wrong. I don't know if anything
> is *publicly* available, but I can assure you the silicon is running in
> labs already. And something will be publicly available early next year
> at the latest. Which could nicely coincide with having Wireguard support
> in the kernel (which I would also like to see happen BTW) ...
>
> Not "at some point". It will. Very soon. Maybe not in consumer or server
> CPUs, but definitely in the embedded (networking) space.
> And it *will* be much faster than the embedded CPU next to it, so it will
> be worth using it for something like bulk packet encryption.

Super! I was wondering if you could speak a bit more about the
interface. My biggest questions surround latency. Will it be
synchronous or asynchronous? If the latter, why? What will its
latencies be? How deep will its buffers be? The reason I ask is that a
lot of crypto acceleration hardware of the past has been fast and
having very deep buffers, but at great expense of latency. In the
networking context, keeping latency low is pretty important. Already
WireGuard is multi-threaded which isn't super great all the time for
latency (improvements are a work in progress). If you're involved with
the design of the hardware, perhaps this is something you can help
ensure winds up working well? For example, AES-NI is straightforward
and good, but Intel can do that because they are the CPU. It sounds
like your silicon will be adjacent. How do you envision this working
in a low latency environment?

Jason

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 11:06     ` chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API] Jason A. Donenfeld
@ 2019-09-26 11:38       ` Toke Høiland-Jørgensen
  2019-09-26 13:52       ` Pascal Van Leeuwen
  2019-09-26 22:47       ` Jakub Kicinski
  2 siblings, 0 replies; 6+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-09-26 11:38 UTC (permalink / raw)
  To: Jason A. Donenfeld, Pascal Van Leeuwen
  Cc: Ard Biesheuvel, Linux Crypto Mailing List, linux-arm-kernel,
	Herbert Xu, David Miller, Greg KH, Linus Torvalds, Samuel Neves,
	Dan Carpenter, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Will Deacon, Marc Zyngier, Catalin Marinas, Willy Tarreau,
	Netdev, Dave Taht

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> [CC +willy, toke, dave, netdev]
>
> Hi Pascal
>
> On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
>> Actually, that assumption is factually wrong. I don't know if anything
>> is *publicly* available, but I can assure you the silicon is running in
>> labs already. And something will be publicly available early next year
>> at the latest. Which could nicely coincide with having Wireguard support
>> in the kernel (which I would also like to see happen BTW) ...
>>
>> Not "at some point". It will. Very soon. Maybe not in consumer or server
>> CPUs, but definitely in the embedded (networking) space.
>> And it *will* be much faster than the embedded CPU next to it, so it will
>> be worth using it for something like bulk packet encryption.
>
> Super! I was wondering if you could speak a bit more about the
> interface. My biggest questions surround latency. Will it be
> synchronous or asynchronous? If the latter, why? What will its
> latencies be? How deep will its buffers be? The reason I ask is that a
> lot of crypto acceleration hardware of the past has been fast and
> having very deep buffers, but at great expense of latency. In the
> networking context, keeping latency low is pretty important. Already
> WireGuard is multi-threaded which isn't super great all the time for
> latency (improvements are a work in progress). If you're involved with
> the design of the hardware, perhaps this is something you can help
> ensure winds up working well? For example, AES-NI is straightforward
> and good, but Intel can do that because they are the CPU. It sounds
> like your silicon will be adjacent. How do you envision this working
> in a low latency environment?

Being asynchronous doesn't *necessarily* have to hurt latency; you just
need the right queue back-pressure.


We already have multiple queues in the stack. With an async crypto
engine we would go from something like:

stack -> [qdisc] -> wg if -> [wireguard buffer] -> netdev driver ->
device -> [device buffer] -> wire

to

stack -> [qdisc] -> wg if -> [wireguard buffer] -> crypto stack ->
crypto device -> [crypto device buffer] -> wg post-crypto -> netdev
driver -> device -> [device buffer] -> wire

(where everything in [] is a packet queue).

The wireguard buffer is the source of the latency you're alluding to
above (the comment about multi-threaded behaviour), so we probably need
to fix that anyway. For the device buffer we have BQL to keep it at a
minimum. So that leaves the buffering in the crypto offload device. If
we add something like BQL to the crypto offload drivers, we could
conceivably avoid having that add a significant amount of latency. In
fact, doing so may benefit other users of crypto offloads as well, no?
Presumably ipsec has this same issue?


Caveat: I am fairly ignorant about the inner workings of the crypto
subsystem, so please excuse any inaccuracies in the above; the diagrams
are solely for illustrative purposes... :)

-Toke

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 11:06     ` chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API] Jason A. Donenfeld
  2019-09-26 11:38       ` Toke Høiland-Jørgensen
@ 2019-09-26 13:52       ` Pascal Van Leeuwen
  2019-09-26 23:13         ` Dave Taht
  2019-09-26 22:47       ` Jakub Kicinski
  2 siblings, 1 reply; 6+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-26 13:52 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Ard Biesheuvel, Linux Crypto Mailing List, linux-arm-kernel,
	Herbert Xu, David Miller, Greg KH, Linus Torvalds, Samuel Neves,
	Dan Carpenter, Arnd Bergmann, Eric Biggers, Andy Lutomirski,
	Will Deacon, Marc Zyngier, Catalin Marinas, Willy Tarreau,
	Netdev, Toke Høiland-Jørgensen, Dave Taht

> -----Original Message-----
> From: Jason A. Donenfeld <Jason@zx2c4.com>
> Sent: Thursday, September 26, 2019 1:07 PM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;
> Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel
> Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;
> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> <catalin.marinas@arm.com>; Willy Tarreau <w@1wt.eu>; Netdev <netdev@vger.kernel.org>;
> Toke Høiland-Jørgensen <toke@toke.dk>; Dave Taht <dave.taht@gmail.com>
> Subject: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard
> using the existing crypto API]
> 
> [CC +willy, toke, dave, netdev]
> 
> Hi Pascal
> 
> On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
> > Actually, that assumption is factually wrong. I don't know if anything
> > is *publicly* available, but I can assure you the silicon is running in
> > labs already. And something will be publicly available early next year
> > at the latest. Which could nicely coincide with having Wireguard support
> > in the kernel (which I would also like to see happen BTW) ...
> >
> > Not "at some point". It will. Very soon. Maybe not in consumer or server
> > CPUs, but definitely in the embedded (networking) space.
> > And it *will* be much faster than the embedded CPU next to it, so it will
> > be worth using it for something like bulk packet encryption.
> 
> Super! I was wondering if you could speak a bit more about the
> interface. My biggest questions surround latency. Will it be
> synchronous or asynchronous?
>
The hardware being external to the CPU and running in parallel with it,
obviously asynchronous.

> If the latter, why? 
>
Because, as you probably already guessed, the round-trip latency is way
longer than the actual processing time, at least for small packets.

Partly because the only way to communicate between the CPU and the HW 
accelerator (whether that is crypto, a GPU, a NIC, etc.) that doesn't
keep the CPU busy moving data is through memory, with the HW doing DMA.
And, as any programmer should now, round trip times to memory are huge
relative to the processing speed.

And partly because these accelerators are very similar to CPU's in
terms of architecture, doing pipelined processing and having multiple
of such pipelines in parallel. Except that these pipelines are not
working on low-level instructions but on full packets/blocks. So they
need to have many packets in flight to keep those pipelines fully
occupied. And packets need to move through the various pipeline stages,
so they incur the time needed to process them multiple times. (just 
like e.g. a multiply instruction with a throughput of 1 per cycle
actually may need 4 or more cycles to actually provide its result)

Could you do that from a synchronous interface? In theory, probably, 
if you would spawn a new thread for every new packet arriving and
rely on the scheduler to preempt the waiting threads. But you'd need
as many threads as the HW  accelerator can have packets in flight,
while an async would need only 2 threads: one to handle the input to
the accelerator and one to handle the output (or at most one thread
per CPU, if you want to divide the workload)

Such a many-thread approach seems very inefficient to me.

> What will its latencies be?
>
Depends very much on the specific integration scenario (i.e. bus 
speed, bus hierarchy, cache hierarchy, memory speed, etc.) but on
the order of a few thousand CPU clocks is not unheard of.
Which is an eternity for the CPU, but still only a few uSec in
human time. Not a problem unless you're a high-frequency trader and
every ns counts ...
It's not like the CPU would process those packets in zero time.

> How deep will its buffers be? 
>
That of course depends on the specific accelerator implementation,
but possibly dozens of small packets in our case, as you'd need 
at least width x depth packets in there to keep the pipes busy.
Just like a modern CPU needs hundreds of instructions in flight
to keep all its resources busy.

> The reason I ask is that a
> lot of crypto acceleration hardware of the past has been fast and
> having very deep buffers, but at great expense of latency.
>
Define "great expense". Everything is relative. The latency is very
high compared to per-packet processing time but at the same time it's
only on the order of a few uSec. Which may not even be significant on
the total time it takes for the packet to travel from input MAC to
output MAC, considering the CPU will still need to parse and classify
it and do pre- and postprocessing on it.

> In the networking context, keeping latency low is pretty important.
>
I've been doing this for IPsec for nearly 20 years now and I've never
heard anyone complain about our latency, so it must be OK.

We're also doing (fully inline, no CPU involved) MACsec cores, which
operate at layer 2 and I know it's a concern there for very specific
use cases (high frequency trading, precision time protocol, ...).
For "normal" VPN's though, a few uSec more or less should be a non-issue.

> Already
> WireGuard is multi-threaded which isn't super great all the time for
> latency (improvements are a work in progress). If you're involved with
> the design of the hardware, perhaps this is something you can help
> ensure winds up working well? For example, AES-NI is straightforward
> and good, but Intel can do that because they are the CPU. It sounds
> like your silicon will be adjacent. How do you envision this working
> in a low latency environment?
> 
Depends on how low low-latency is. If you really need minimal latency,
you need an inline implementation. Which we can also provide, BTW :-)

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 11:06     ` chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API] Jason A. Donenfeld
  2019-09-26 11:38       ` Toke Høiland-Jørgensen
  2019-09-26 13:52       ` Pascal Van Leeuwen
@ 2019-09-26 22:47       ` Jakub Kicinski
  2 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2019-09-26 22:47 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Pascal Van Leeuwen, Ard Biesheuvel, Linux Crypto Mailing List,
	linux-arm-kernel, Herbert Xu, David Miller, Greg KH,
	Linus Torvalds, Samuel Neves, Dan Carpenter, Arnd Bergmann,
	Eric Biggers, Andy Lutomirski, Will Deacon, Marc Zyngier,
	Catalin Marinas, Willy Tarreau, Netdev,
	Toke Høiland-Jørgensen, Dave Taht

On Thu, 26 Sep 2019 13:06:51 +0200, Jason A. Donenfeld wrote:
> On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen wrote:
> > Actually, that assumption is factually wrong. I don't know if anything
> > is *publicly* available, but I can assure you the silicon is running in
> > labs already. And something will be publicly available early next year
> > at the latest. Which could nicely coincide with having Wireguard support
> > in the kernel (which I would also like to see happen BTW) ...
> >
> > Not "at some point". It will. Very soon. Maybe not in consumer or server
> > CPUs, but definitely in the embedded (networking) space.
> > And it *will* be much faster than the embedded CPU next to it, so it will
> > be worth using it for something like bulk packet encryption.  
> 
> Super! I was wondering if you could speak a bit more about the
> interface. My biggest questions surround latency. Will it be
> synchronous or asynchronous? If the latter, why? What will its
> latencies be? How deep will its buffers be? The reason I ask is that a
> lot of crypto acceleration hardware of the past has been fast and
> having very deep buffers, but at great expense of latency. In the
> networking context, keeping latency low is pretty important.

FWIW are you familiar with existing kTLS, and IPsec offloads in the
networking stack? They offload the crypto into the NIC, inline, which
helps with the latency, and processing overhead.

There are also NIC silicon which can do some ChaCha/Poly, although 
I'm not familiar enough with WireGuard to know if offload to existing
silicon will be possible.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 13:52       ` Pascal Van Leeuwen
@ 2019-09-26 23:13         ` Dave Taht
  2019-09-27 12:18           ` Pascal Van Leeuwen
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Taht @ 2019-09-26 23:13 UTC (permalink / raw)
  To: Pascal Van Leeuwen
  Cc: Jason A. Donenfeld, Ard Biesheuvel, Linux Crypto Mailing List,
	linux-arm-kernel, Herbert Xu, David Miller, Greg KH,
	Linus Torvalds, Samuel Neves, Dan Carpenter, Arnd Bergmann,
	Eric Biggers, Andy Lutomirski, Will Deacon, Marc Zyngier,
	Catalin Marinas, Willy Tarreau, Netdev,
	Toke Høiland-Jørgensen

On Thu, Sep 26, 2019 at 6:52 AM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>
> > -----Original Message-----
> > From: Jason A. Donenfeld <Jason@zx2c4.com>
> > Sent: Thursday, September 26, 2019 1:07 PM
> > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;
> > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH
> > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel
> > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;
> > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> > <catalin.marinas@arm.com>; Willy Tarreau <w@1wt.eu>; Netdev <netdev@vger.kernel.org>;
> > Toke Høiland-Jørgensen <toke@toke.dk>; Dave Taht <dave.taht@gmail.com>
> > Subject: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard
> > using the existing crypto API]
> >
> > [CC +willy, toke, dave, netdev]
> >
> > Hi Pascal
> >
> > On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
> > <pvanleeuwen@verimatrix.com> wrote:
> > > Actually, that assumption is factually wrong. I don't know if anything
> > > is *publicly* available, but I can assure you the silicon is running in
> > > labs already. And something will be publicly available early next year
> > > at the latest. Which could nicely coincide with having Wireguard support
> > > in the kernel (which I would also like to see happen BTW) ...
> > >
> > > Not "at some point". It will. Very soon. Maybe not in consumer or server
> > > CPUs, but definitely in the embedded (networking) space.
> > > And it *will* be much faster than the embedded CPU next to it, so it will
> > > be worth using it for something like bulk packet encryption.
> >
> > Super! I was wondering if you could speak a bit more about the
> > interface. My biggest questions surround latency. Will it be
> > synchronous or asynchronous?
> >
> The hardware being external to the CPU and running in parallel with it,
> obviously asynchronous.
>
> > If the latter, why?
> >
> Because, as you probably already guessed, the round-trip latency is way
> longer than the actual processing time, at least for small packets.
>
> Partly because the only way to communicate between the CPU and the HW
> accelerator (whether that is crypto, a GPU, a NIC, etc.) that doesn't
> keep the CPU busy moving data is through memory, with the HW doing DMA.
> And, as any programmer should now, round trip times to memory are huge
> relative to the processing speed.
>
> And partly because these accelerators are very similar to CPU's in
> terms of architecture, doing pipelined processing and having multiple
> of such pipelines in parallel. Except that these pipelines are not
> working on low-level instructions but on full packets/blocks. So they
> need to have many packets in flight to keep those pipelines fully
> occupied. And packets need to move through the various pipeline stages,
> so they incur the time needed to process them multiple times. (just
> like e.g. a multiply instruction with a throughput of 1 per cycle
> actually may need 4 or more cycles to actually provide its result)
>
> Could you do that from a synchronous interface? In theory, probably,
> if you would spawn a new thread for every new packet arriving and
> rely on the scheduler to preempt the waiting threads. But you'd need
> as many threads as the HW  accelerator can have packets in flight,
> while an async would need only 2 threads: one to handle the input to
> the accelerator and one to handle the output (or at most one thread
> per CPU, if you want to divide the workload)
>
> Such a many-thread approach seems very inefficient to me.
>
> > What will its latencies be?
> >
> Depends very much on the specific integration scenario (i.e. bus
> speed, bus hierarchy, cache hierarchy, memory speed, etc.) but on
> the order of a few thousand CPU clocks is not unheard of.
> Which is an eternity for the CPU, but still only a few uSec in
> human time. Not a problem unless you're a high-frequency trader and
> every ns counts ...
> It's not like the CPU would process those packets in zero time.
>
> > How deep will its buffers be?
> >
> That of course depends on the specific accelerator implementation,
> but possibly dozens of small packets in our case, as you'd need
> at least width x depth packets in there to keep the pipes busy.
> Just like a modern CPU needs hundreds of instructions in flight
> to keep all its resources busy.
>
> > The reason I ask is that a
> > lot of crypto acceleration hardware of the past has been fast and
> > having very deep buffers, but at great expense of latency.
> >
> Define "great expense". Everything is relative. The latency is very
> high compared to per-packet processing time but at the same time it's
> only on the order of a few uSec. Which may not even be significant on
> the total time it takes for the packet to travel from input MAC to
> output MAC, considering the CPU will still need to parse and classify
> it and do pre- and postprocessing on it.
>
> > In the networking context, keeping latency low is pretty important.
> >
> I've been doing this for IPsec for nearly 20 years now and I've never
> heard anyone complain about our latency, so it must be OK.

Well, it depends on where your bottlenecks are. On low-end hardware
you can and do tend to bottleneck on the crypto step, and with
uncontrolled, non-fq'd non-aqm'd buffering you get results like this:

http://blog.cerowrt.org/post/wireguard/

so in terms of "threads" I would prefer to think of flows entering
the tunnel and attempting to multiplex them as best as possible
across the crypto hard/software so that minimal in-hw latencies are experienced
for most packets and that the coupled queue length does not grow out of control,

Adding fq_codel's hashing algo and queuing to ipsec as was done in
commit: 264b87fa617e758966108db48db220571ff3d60e to leverage
the inner hash...

Had some nice results:

before: http://www.taht.net/~d/ipsec_fq_codel/oldqos.png (100ms spikes)
After: http://www.taht.net/~d/ipsec_fq_codel/newqos.png (2ms spikes)

I'd love to see more vpn vendors using the rrul test or something even
nastier to evaluate their results, rather than dragstrip bulk throughput tests,
steering multiple flows over multiple cores.

> We're also doing (fully inline, no CPU involved) MACsec cores, which
> operate at layer 2 and I know it's a concern there for very specific
> use cases (high frequency trading, precision time protocol, ...).
> For "normal" VPN's though, a few uSec more or less should be a non-issue.

Measured buffering is typically 1000 packets in userspace vpns. If you
can put data in, faster than you can get it out, well....

> > Already
> > WireGuard is multi-threaded which isn't super great all the time for
> > latency (improvements are a work in progress). If you're involved with
> > the design of the hardware, perhaps this is something you can help
> > ensure winds up working well? For example, AES-NI is straightforward
> > and good, but Intel can do that because they are the CPU. It sounds
> > like your silicon will be adjacent. How do you envision this working
> > in a low latency environment?
> >
> Depends on how low low-latency is. If you really need minimal latency,
> you need an inline implementation. Which we can also provide, BTW :-)
>
> Regards,
> Pascal van Leeuwen
> Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
> www.insidesecure.com



-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
  2019-09-26 23:13         ` Dave Taht
@ 2019-09-27 12:18           ` Pascal Van Leeuwen
  0 siblings, 0 replies; 6+ messages in thread
From: Pascal Van Leeuwen @ 2019-09-27 12:18 UTC (permalink / raw)
  To: Dave Taht
  Cc: Jason A. Donenfeld, Ard Biesheuvel, Linux Crypto Mailing List,
	linux-arm-kernel, Herbert Xu, David Miller, Greg KH,
	Linus Torvalds, Samuel Neves, Dan Carpenter, Arnd Bergmann,
	Eric Biggers, Andy Lutomirski, Will Deacon, Marc Zyngier,
	Catalin Marinas, Willy Tarreau, Netdev,
	Toke Høiland-Jørgensen


> -----Original Message-----
> From: Dave Taht <dave.taht@gmail.com>
> Sent: Friday, September 27, 2019 1:14 AM
> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Ard Biesheuvel <ard.biesheuvel@linaro.org>;
> Linux Crypto Mailing List <linux-crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-
> kernel@lists.infradead.org>; Herbert Xu <herbert@gondor.apana.org.au>; David Miller
> <davem@davemloft.net>; Greg KH <gregkh@linuxfoundation.org>; Linus Torvalds
> <torvalds@linux-foundation.org>; Samuel Neves <sneves@dei.uc.pt>; Dan Carpenter
> <dan.carpenter@oracle.com>; Arnd Bergmann <arnd@arndb.de>; Eric Biggers
> <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; Will Deacon <will@kernel.org>;
> Marc Zyngier <maz@kernel.org>; Catalin Marinas <catalin.marinas@arm.com>; Willy Tarreau
> <w@1wt.eu>; Netdev <netdev@vger.kernel.org>; Toke Høiland-Jørgensen <toke@toke.dk>
> Subject: Re: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard
> using the existing crypto API]
> 
> On Thu, Sep 26, 2019 at 6:52 AM Pascal Van Leeuwen
> <pvanleeuwen@verimatrix.com> wrote:
> >
> > > -----Original Message-----
> > > From: Jason A. Donenfeld <Jason@zx2c4.com>
> > > Sent: Thursday, September 26, 2019 1:07 PM
> > > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>
> > > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-
> > > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;
> > > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg
> KH
> > > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel
> > > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann
> > > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski
> <luto@kernel.org>;
> > > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas
> > > <catalin.marinas@arm.com>; Willy Tarreau <w@1wt.eu>; Netdev
> <netdev@vger.kernel.org>;
> > > Toke Høiland-Jørgensen <toke@toke.dk>; Dave Taht <dave.taht@gmail.com>
> > > Subject: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard
> > > using the existing crypto API]
> > >
> > > [CC +willy, toke, dave, netdev]
> > >
> > > Hi Pascal
> > >
> > > On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
> > > <pvanleeuwen@verimatrix.com> wrote:
> > > > Actually, that assumption is factually wrong. I don't know if anything
> > > > is *publicly* available, but I can assure you the silicon is running in
> > > > labs already. And something will be publicly available early next year
> > > > at the latest. Which could nicely coincide with having Wireguard support
> > > > in the kernel (which I would also like to see happen BTW) ...
> > > >
> > > > Not "at some point". It will. Very soon. Maybe not in consumer or server
> > > > CPUs, but definitely in the embedded (networking) space.
> > > > And it *will* be much faster than the embedded CPU next to it, so it will
> > > > be worth using it for something like bulk packet encryption.
> > >
> > > Super! I was wondering if you could speak a bit more about the
> > > interface. My biggest questions surround latency. Will it be
> > > synchronous or asynchronous?
> > >
> > The hardware being external to the CPU and running in parallel with it,
> > obviously asynchronous.
> >
> > > If the latter, why?
> > >
> > Because, as you probably already guessed, the round-trip latency is way
> > longer than the actual processing time, at least for small packets.
> >
> > Partly because the only way to communicate between the CPU and the HW
> > accelerator (whether that is crypto, a GPU, a NIC, etc.) that doesn't
> > keep the CPU busy moving data is through memory, with the HW doing DMA.
> > And, as any programmer should now, round trip times to memory are huge
> > relative to the processing speed.
> >
> > And partly because these accelerators are very similar to CPU's in
> > terms of architecture, doing pipelined processing and having multiple
> > of such pipelines in parallel. Except that these pipelines are not
> > working on low-level instructions but on full packets/blocks. So they
> > need to have many packets in flight to keep those pipelines fully
> > occupied. And packets need to move through the various pipeline stages,
> > so they incur the time needed to process them multiple times. (just
> > like e.g. a multiply instruction with a throughput of 1 per cycle
> > actually may need 4 or more cycles to actually provide its result)
> >
> > Could you do that from a synchronous interface? In theory, probably,
> > if you would spawn a new thread for every new packet arriving and
> > rely on the scheduler to preempt the waiting threads. But you'd need
> > as many threads as the HW  accelerator can have packets in flight,
> > while an async would need only 2 threads: one to handle the input to
> > the accelerator and one to handle the output (or at most one thread
> > per CPU, if you want to divide the workload)
> >
> > Such a many-thread approach seems very inefficient to me.
> >
> > > What will its latencies be?
> > >
> > Depends very much on the specific integration scenario (i.e. bus
> > speed, bus hierarchy, cache hierarchy, memory speed, etc.) but on
> > the order of a few thousand CPU clocks is not unheard of.
> > Which is an eternity for the CPU, but still only a few uSec in
> > human time. Not a problem unless you're a high-frequency trader and
> > every ns counts ...
> > It's not like the CPU would process those packets in zero time.
> >
> > > How deep will its buffers be?
> > >
> > That of course depends on the specific accelerator implementation,
> > but possibly dozens of small packets in our case, as you'd need
> > at least width x depth packets in there to keep the pipes busy.
> > Just like a modern CPU needs hundreds of instructions in flight
> > to keep all its resources busy.
> >
> > > The reason I ask is that a
> > > lot of crypto acceleration hardware of the past has been fast and
> > > having very deep buffers, but at great expense of latency.
> > >
> > Define "great expense". Everything is relative. The latency is very
> > high compared to per-packet processing time but at the same time it's
> > only on the order of a few uSec. Which may not even be significant on
> > the total time it takes for the packet to travel from input MAC to
> > output MAC, considering the CPU will still need to parse and classify
> > it and do pre- and postprocessing on it.
> >
> > > In the networking context, keeping latency low is pretty important.
> > >
> > I've been doing this for IPsec for nearly 20 years now and I've never
> > heard anyone complain about our latency, so it must be OK.
> 
> Well, it depends on where your bottlenecks are. On low-end hardware
> you can and do tend to bottleneck on the crypto step, and with
> uncontrolled, non-fq'd non-aqm'd buffering you get results like this:
> 
> http://blog.cerowrt.org/post/wireguard/
> 
> so in terms of "threads" I would prefer to think of flows entering
> the tunnel and attempting to multiplex them as best as possible
> across the crypto hard/software so that minimal in-hw latencies are experienced
> for most packets and that the coupled queue length does not grow out of control,
> 
> Adding fq_codel's hashing algo and queuing to ipsec as was done in
> commit: 264b87fa617e758966108db48db220571ff3d60e to leverage
> the inner hash...
> 
> Had some nice results:
> 
> before: http://www.taht.net/~d/ipsec_fq_codel/oldqos.png (100ms spikes)
> After: http://www.taht.net/~d/ipsec_fq_codel/newqos.png (2ms spikes)
> 
> I'd love to see more vpn vendors using the rrul test or something even
> nastier to evaluate their results, rather than dragstrip bulk throughput tests,
> steering multiple flows over multiple cores.
> 
> > We're also doing (fully inline, no CPU involved) MACsec cores, which
> > operate at layer 2 and I know it's a concern there for very specific
> > use cases (high frequency trading, precision time protocol, ...).
> > For "normal" VPN's though, a few uSec more or less should be a non-issue.
> 
> Measured buffering is typically 1000 packets in userspace vpns. If you
> can put data in, faster than you can get it out, well....
> 
We don't buffer anywhere near 1000 packets in the hardware itself.
In fact, our buffers are designed to be carefully tunable to accept
the minimum number of packets required by the system as a whole.

But we do need to potentially keep a deep & wide pipeline busy, so for
the big, high-speed engines some double-digit buffering is inevitable. 
It won't get anywhere near even a 100 packets though, let alone 1000.

Also, the whole point of crypto HW acceleration is ensure the crypto
is *not* the bottleneck, not even for those pesky small TCP ACK packets
when they come back-to-back (although I doubt the crypto itself is the 
bottleneck there, as there is actually very little crypto to do then).
We work very hard to ensure decent *small* packet performance and 
generally you should scale your crypto HW to be able to keep up with 
the worst case there, with margin to spare ...

> > > Already
> > > WireGuard is multi-threaded which isn't super great all the time for
> > > latency (improvements are a work in progress). If you're involved with
> > > the design of the hardware, perhaps this is something you can help
> > > ensure winds up working well? For example, AES-NI is straightforward
> > > and good, but Intel can do that because they are the CPU. It sounds
> > > like your silicon will be adjacent. How do you envision this working
> > > in a low latency environment?
> > >
> > Depends on how low low-latency is. If you really need minimal latency,
> > you need an inline implementation. Which we can also provide, BTW :-)
> >
> > Regards,
> > Pascal van Leeuwen
> > Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
> > www.insidesecure.com
> 
> 
> 
> --
> 
> Dave Täht
> CTO, TekLibre, LLC
> http://www.teklibre.com
> Tel: 1-831-205-9740

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-09-27 12:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190925161255.1871-1-ard.biesheuvel@linaro.org>
     [not found] ` <CAHmME9oDhnv7aX77oEERof0TGihk4mDe9B_A3AntaTTVsg9aoA@mail.gmail.com>
     [not found]   ` <MN2PR20MB29733663686FB38153BAE7EACA860@MN2PR20MB2973.namprd20.prod.outlook.com>
2019-09-26 11:06     ` chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API] Jason A. Donenfeld
2019-09-26 11:38       ` Toke Høiland-Jørgensen
2019-09-26 13:52       ` Pascal Van Leeuwen
2019-09-26 23:13         ` Dave Taht
2019-09-27 12:18           ` Pascal Van Leeuwen
2019-09-26 22:47       ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).