[WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues

All of lore.kernel.org
 help / color / mirror / Atom feed

* [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
@ 2016-09-30 23:21 Jason A. Donenfeld
  2016-10-01 11:51 ` [WireGuard] [Cake] " Toke Høiland-Jørgensen
  0 siblings, 1 reply; 17+ messages in thread
From: Jason A. Donenfeld @ 2016-09-30 23:21 UTC (permalink / raw)
  To: cake, make-wifi-fast, WireGuard mailing list, Dave Taht

Hi all,

On Fri, Sep 30, 2016 at 9:18 PM, Dave Taht <dave.taht@gmail.com> wrote:
> All: I've always dreamed of a vpn that could fq and - when it was
> bottlenecking on cpu - throw away packets intelligently. Wireguard,
> which is what jason & co are working on, is a really simple, elegant
> set of newer vpn ideas that currently has a queuing model designed to
> optimize for multi-cpu encryption, and not, so much, for managing
> worst case network behaviors, or fairness, or working on lower end
> hardware.

Would love any feedback and support for working on the queuing model
with WireGuard. I hear the bufferbloat folks are geniuses at that...

> Do do a git clone of the code, and take a look... somewhere on the
> wireguard list, or privately, jason'd pointed me at the relevant bits
> of the queuing model.

It was this post:
https://lists.zx2c4.com/pipermail/wireguard/2016-August/000378.html
Start reading from "There are a couple reasons" and finish at
"chunking them somehow." The rest can be disregarded.

Hope to hear from y'all soon!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-09-30 23:21 [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues Jason A. Donenfeld
@ 2016-10-01 11:51 ` Toke Høiland-Jørgensen
  2016-10-01 15:40   ` Dave Taht
  0 siblings, 1 reply; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-10-01 11:51 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: cake, make-wifi-fast, WireGuard mailing list

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> Hi all,
>
> On Fri, Sep 30, 2016 at 9:18 PM, Dave Taht <dave.taht@gmail.com> wrote:
>> All: I've always dreamed of a vpn that could fq and - when it was
>> bottlenecking on cpu - throw away packets intelligently. Wireguard,
>> which is what jason & co are working on, is a really simple, elegant
>> set of newer vpn ideas that currently has a queuing model designed to
>> optimize for multi-cpu encryption, and not, so much, for managing
>> worst case network behaviors, or fairness, or working on lower end
>> hardware.
>
> Would love any feedback and support for working on the queuing model
> with WireGuard. I hear the bufferbloat folks are geniuses at that...

Looking at your queueing structure description, I'd say the reusable FQ
stuff is a pretty good fit. There's a lot of conceptual overlap between
the mac80211 concept of having a queue per station, and yours of having
a queue per peer. Have a look at include/net/fq.h and
include/net/fq_impl.h - and see net/mac80211/tx.c for a usage example
(grep for 'txq'). This can be further enhanced by using CoDel as the
dequeue function (again, see mac80211 for an example).

Basically, what this gives you is a set of per-flow queues that you can
use to do fairness queueing between flows. These queues can then be
partitioned into a number of 'tins', which in your use case would
correspond to peers. This saves you from allocating a big set of queues
for each peer, but you can still get flow-based FQ to that peer
(including the nice latency prioritisation for sparse flows that
FQ-CoDel also has).

I think you could probably use the FQ structure as a drop-in replacement
for the queue you have already in peer->tx_packet_queue. That would give
you fairness between flows going to a peer, prioritisation of sparse
flows to that peer, and CoDel to push back on flows when the VPN is the
bottleneck.

As far as the encryption step goes, I see two obvious ways of doing it:

1. Do what you are doing now, basically, i.e. just pull packets off the
   FQ, encrypt then, and send them out. This has the drawback of
   (potentially) being bursty, and it adds latency after the point where
   it can be controlled by CoDel.

2. Perform encryption on enqueue. Basically, submit a packet to
   encryption as soon as you get it, and when it finishes, stick it on
   the peer queue. Then move the dequeue operation to a separate step
   that just pulls from the queue. This would get you the benefit of
   having FQ and CoDel operate at the point right before the packets hit
   the wire.

Option 2 is a bit more complicated in that it would need to deal with
the case where no session is established. In this case, you could
probably just stick the non-encrypted packets on the queue, then when
the session is established, pull everything off the queue, encrypt it,
and put it back. A flag somewhere could then keep track of whether
packets on the queue are encrypted or not. You'd also need to deal with
the fact that the FQ hashing doesn't work for encrypted packets.
Probably this can be dealt with by hashing the packet before it is
encrypted and storing the hash in an appropriate field somewhere
(skb->cb?); then changing fq_impl.h to take a function pointer to
override fq_flow_classify().

Another thing that ties into the above which I couldn't find in your
description is how you schedule between peers? Is there a round-robin
scheduler somewhere or something?

-Toke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-01 11:51 ` [WireGuard] [Cake] " Toke Høiland-Jørgensen
@ 2016-10-01 15:40   ` Dave Taht
  2016-10-01 15:51     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Taht @ 2016-10-01 15:40 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: cake, make-wifi-fast, WireGuard mailing list

My thought - given that at least on some platforms - encrypting 1000
packets at a time is a bad idea - would be something regulating the
amount of data being crypted at a time, an equivalent to byte queue
limits - BQL - BCL? byte crypto limits - to keep no more than, say,
1ms of data in that part of the subsystem.

... also pulling stuff out of order from an already encrypted thing
leads to the same IV problems we had in mac80211.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-01 15:40   ` Dave Taht
@ 2016-10-01 15:51     ` Toke Høiland-Jørgensen
  2016-10-01 17:19       ` Dave Taht
  0 siblings, 1 reply; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-10-01 15:51 UTC (permalink / raw)
  To: Dave Taht; +Cc: cake, make-wifi-fast, WireGuard mailing list

Dave Taht <dave.taht@gmail.com> writes:

> My thought - given that at least on some platforms - encrypting 1000
> packets at a time is a bad idea - would be something regulating the
> amount of data being crypted at a time, an equivalent to byte queue
> limits - BQL - BCL? byte crypto limits - to keep no more than, say,
> 1ms of data in that part of the subsystem.

Well, the dynamic queue limit stuff is reusable (in
include/linux/dynamic_queue_limits.h). The netdev BQL stuff just uses
these functions with the packet byte sizes; so adapting it to use in
wireguard should be fairly straight forward :)

> ... also pulling stuff out of order from an already encrypted thing
> leads to the same IV problems we had in mac80211.

Yeah, but who needs IVs, really? ;)

-Toke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-01 15:51     ` Toke Høiland-Jørgensen
@ 2016-10-01 17:19       ` Dave Taht
  2016-10-01 17:28         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Taht @ 2016-10-01 17:19 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: cake, make-wifi-fast, WireGuard mailing list

On Sat, Oct 1, 2016 at 8:51 AM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@toke=
.dk> wrote:
> Dave Taht <dave.taht@gmail.com> writes:
>
>> My thought - given that at least on some platforms - encrypting 1000
>> packets at a time is a bad idea - would be something regulating the
>> amount of data being crypted at a time, an equivalent to byte queue
>> limits - BQL - BCL? byte crypto limits - to keep no more than, say,
>> 1ms of data in that part of the subsystem.
>
> Well, the dynamic queue limit stuff is reusable (in
> include/linux/dynamic_queue_limits.h). The netdev BQL stuff just uses
> these functions with the packet byte sizes; so adapting it to use in
> wireguard should be fairly straight forward :)

Having one global queue for all of wireguard makes a lot of sense,
one that gets divvied up as per the amount of traffic for each destination,
and regulated "fairly".

The present model - of one fixed size one per endpoint can run you out
of memory right quick.

>> ... also pulling stuff out of order from an already encrypted thing
>> leads to the same IV problems we had in mac80211.
>
> Yeah, but who needs IVs, really? ;)

Well, in wireguard's case, it does not  (yay!) have
a-must-deliver-packet-in-IV-order mechanism like 802.11 requires. It
will merely throw away things that are outside the replay window (2k
packets). So you could *encrypt* first, deliver later, if you wanted.

Still, what I'd wanted to do was push back (or pace?) from the
interface itself, as well as from the crypto subsystem, and throw away
packets before being crypted when things are starting to get out of
hand - as well as do better fq-ish mixing of what's going out.

Consider a path like this:

testbox, flooding -> 1Gbit in -> router-encrypting -> 20Mbit out to
the internet.

If you flood the thing - you get a big queue in two places - one at
the interface qdisc where we struggle to throw already encrypted
things away (eventually), and another
inside the vpn code where we are encrypting as fast as we can get stuff in.

...

As a side note, another crazy daydream of mine is to use up an entire
/64 for the vpn endpoint identifier, and pre-negotiate a "spread
spectrum" style of sending packets there. This would eliminate needing
to bundle the IV in the packet, it would be obvious (to each side)
which IV matched what IP address, we'd save 8 bytes on every packet.

and completely break every fq+congestion control qdisc we know of, as
well as most stateful firewalls, and so on. :)

> -Toke

--=20
Dave T=C3=A4ht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-01 17:19       ` Dave Taht
@ 2016-10-01 17:28         ` Toke Høiland-Jørgensen
  2016-10-01 22:44           ` Jason A. Donenfeld
  0 siblings, 1 reply; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-10-01 17:28 UTC (permalink / raw)
  To: Dave Taht; +Cc: cake, make-wifi-fast, WireGuard mailing list

Dave Taht <dave.taht@gmail.com> writes:

> On Sat, Oct 1, 2016 at 8:51 AM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@to=
ke.dk> wrote:
>> Dave Taht <dave.taht@gmail.com> writes:
>>
>>> My thought - given that at least on some platforms - encrypting 1000
>>> packets at a time is a bad idea - would be something regulating the
>>> amount of data being crypted at a time, an equivalent to byte queue
>>> limits - BQL - BCL? byte crypto limits - to keep no more than, say,
>>> 1ms of data in that part of the subsystem.
>>
>> Well, the dynamic queue limit stuff is reusable (in
>> include/linux/dynamic_queue_limits.h). The netdev BQL stuff just uses
>> these functions with the packet byte sizes; so adapting it to use in
>> wireguard should be fairly straight forward :)
>
> Having one global queue for all of wireguard makes a lot of sense, one
> that gets divvied up as per the amount of traffic for each
> destination,

You'd get that with the FQ structure I described earlier. Then apply the
dql stuff to limit (globally) the number of packets (or bytes) currently
being encrypted, and you should have something fairly reasonable.

-Toke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-01 17:28         ` Toke Høiland-Jørgensen
@ 2016-10-01 22:44           ` Jason A. Donenfeld
  2016-10-01 23:40             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 17+ messages in thread
From: Jason A. Donenfeld @ 2016-10-01 22:44 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: cake, make-wifi-fast, WireGuard mailing list

Hey Toke & Dave,

Thanks a lot for your responses. This is steering me in the right direction=
 (I
hope!). Responses are inline below.

Regards,
Jason

On Sat, Oct 1, 2016 at 1:51 PM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@toke=
.dk> wrote:
> Looking at your queueing structure description, I'd say the reusable FQ
> stuff is a pretty good fit. There's a lot of conceptual overlap between
> the mac80211 concept of having a queue per station, and yours of having
> a queue per peer. Have a look at include/net/fq.h and
> include/net/fq_impl.h - and see net/mac80211/tx.c for a usage example
> (grep for 'txq'). This can be further enhanced by using CoDel as the
> dequeue function (again, see mac80211 for an example).

Thanks, I'll have a look at these. I had skimmed them earlier, and it wasn'=
t
immediately clear how this API worked, but I'll give it a deep read.

Part of where I think we're divergent in our understanding is that in
actuality WireGuard has two queues:

    1. A per-peer (per-flow) queue. When a packet, or bundle of packets (vi=
a
    super packets) is asked to be transmitted, I put these packets in this
    per-peer queue. This is the one with a 1024 size at the moment. In the
    same function that these packets are enqueued, they're immediately afte=
r
    dequeued -- two or three lines of code later. The only reason that ther=
e's
    a queue at all is to give WireGuard time to do the crypto thing. I don'=
t
    think this per-peer queue in its current inception is relevant to our
    discussion here, except for one factor:

        1.a. When queue (2), described next, is full, the packet that could=
n't
        fit in queue (2) and all subsequent packets are requeued in the
        per-peer queue (1), and are then dequeued the next time a packet is
        enqueued and then immediately dequeued as above.

    2. A global queue for all of the WireGuard device. After packets from
    queue (1) are dequeued, they're immediately (and instantaneously in an
    unblocking manner) enqueued in this global queue (2). This queue actual=
ly
    lives inside of kernel/padata.c, and its size is limited to 1000 items.
    Items are dequeued on various kthreads that do the expensive crypto wor=
k,
    and their completion is serialized in the order that they were submitte=
d
    to work around the nonce and packet order issue. This queue is global; =
it
    is not per-peer (per-flow). Items are dequeued in the order that they w=
ere
    enqueued. The reason this can be global is that the CPU doesn't care wh=
o
    the packet is destined for when it's doing the expensive crypto operati=
on.

So the big questions are -- how do I deal with queue (2), the global one, t=
o
use fq_codel? In fact, don't I actually just want codel, not fq_codel, sinc=
e
flow isn't really relevant for a global thread worker queue system? Or, in
fact, should it remain relevant somehow? So, the semantics behind queue (2)
need to be intelligently tweaked. However, this does not have anything to d=
o
with queue (1). At the most, there's room for changing the behavior of poin=
t
(1.a), but that's about it.

Does this help clarify things for the continuation of the conversation?

>
> Basically, what this gives you is a set of per-flow queues that you can
> use to do fairness queueing between flows. These queues can then be
> partitioned into a number of 'tins', which in your use case would
> correspond to peers. This saves you from allocating a big set of queues
> for each peer, but you can still get flow-based FQ to that peer
> (including the nice latency prioritisation for sparse flows that
> FQ-CoDel also has).

Ahh interesting, so you believe I should in fact use flow classification (p=
eer
classification) on the global thread pool? So that each peer has a fair
latency?

> I think you could probably use the FQ structure as a drop-in replacement
> for the queue you have already in peer->tx_packet_queue. That would give
> you fairness between flows going to a peer, prioritisation of sparse
> flows to that peer, and CoDel to push back on flows when the VPN is the
> bottleneck.

Are you sure? peer->tx_packet_queue is the queue (1) above. Does it make se=
nse
to use the FQ structure there? Wouldn't it make more sense to somehow glue =
it
over queue (2)? Or would behavior (1.a) tie them together in such a way tha=
t
adding FQ to queue (1) would translate to the fairness we desire, even thou=
gh
its queue (2) that's actually doing the CPU intensive work.

Since your mention of peer->tx_packet_queue indicates you've looked at the
code, I'll point out that this is queue (1) as discussed above, and queue (=
2)
lives in the various invocations of padata_do_parallel() in data.c, which f=
or
encryption is found via the call to packet_create_data in send.c.

> As far as the encryption step goes, I see two obvious ways of doing it:
>
> 1. Do what you are doing now, basically, i.e. just pull packets off the
>    FQ, encrypt then, and send them out. This has the drawback of
>    (potentially) being bursty, and it adds latency after the point where
>    it can be controlled by CoDel.

I'm not sure I understand the last point. Where does CoDel's control kick i=
n
during this in the first place?

>
> 2. Perform encryption on enqueue. Basically, submit a packet to
>    encryption as soon as you get it, and when it finishes, stick it on
>    the peer queue. Then move the dequeue operation to a separate step
>    that just pulls from the queue. This would get you the benefit of
>    having FQ and CoDel operate at the point right before the packets hit
>    the wire.

What happens when the encryption queue -- queue (2) -- is full? Just drop t=
he
packet? Encryption is asynchronous, so the submission returns immediately, =
and
then that queue gets filled up (padata's limit is 1000). This also isn't
possible because I need to add unencrypted packets to the peer queue, since
the whole reason it exists is to buffer packets while waiting for a handsha=
ke.

And in any case, packets *are* submitted for encryption as soon as they're
received. They go in queue (1) for a second, but then are immediately deque=
ued
for encryption -- and enqueued in queue (2) for that -- right away.

> Option 2 is a bit more complicated in that it would need to deal with
> the case where no session is established. In this case, you could

Right.

> probably just stick the non-encrypted packets on the queue, then when
> the session is established, pull everything off the queue, encrypt it,
> and put it back. A flag somewhere could then keep track of whether
> packets on the queue are encrypted or not. You'd also need to deal with
> the fact that the FQ hashing doesn't work for encrypted packets.

FQ hashing is over the src/dst/dport/sport tuple, right? In that case I'd b=
e
using the outer IP header anyway, which isn't encrypted, so it seems like i=
t'd
be okay.

> Probably this can be dealt with by hashing the packet before it is
> encrypted and storing the hash in an appropriate field somewhere
> (skb->cb?); then changing fq_impl.h to take a function pointer to
> override fq_flow_classify().

The general suggestion here of adding a flag for the unencrypted
packets sheds light on the misunderstanding. Queue (1) already is this queu=
e
for the unencrypted packets. It's queue (2) -- the padata one -- that is
handling the encryption, to which packets from queue (1) are immediately
submitted.

> Another thing that ties into the above which I couldn't find in your
> description is how you schedule between peers? Is there a round-robin
> scheduler somewhere or something?

Queue (2) is a FIFO. It seralizes everything coming out of it so that even
though it's asynchronous multi-core operations occurring, they still appear=
 to
complete in the save order they are submitted.

On Sat, Oct 1, 2016 at 5:40 PM, Dave Taht <dave.taht@gmail.com> wrote:
> My thought - given that at least on some platforms - encrypting 1000
> packets at a time is a bad idea - would be something regulating the
> amount of data being crypted at a time, an equivalent to byte queue
> limits - BQL - BCL? byte crypto limits - to keep no more than, say,
> 1ms of data in that part of the subsystem.

That might be a decent idea, since it obviously takes longer to encrypt a b=
ig
packet than a small one.

>
> ... also pulling stuff out of order from an already encrypted thing
> leads to the same IV problems we had in mac80211.

padata handles the ordering issues rather nicely.

On Sat, Oct 1, 2016 at 5:51 PM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@toke=
.dk> wrote:
> Dave Taht <dave.taht@gmail.com> writes:
> Well, the dynamic queue limit stuff is reusable (in
> include/linux/dynamic_queue_limits.h). The netdev BQL stuff just uses
> these functions with the packet byte sizes; so adapting it to use in
> wireguard should be fairly straight forward :)
>
>> ... also pulling stuff out of order from an already encrypted thing
>> leads to the same IV problems we had in mac80211.
>
> Yeah, but who needs IVs, really? ;)

Sequential nonces, in my case. :P

On Sat, Oct 1, 2016 at 7:19 PM, Dave Taht <dave.taht@gmail.com> wrote:
> Having one global queue for all of wireguard makes a lot of sense,
> one that gets divvied up as per the amount of traffic for each destinatio=
n,
> and regulated "fairly".

Okay, so that's sort of what I already have now, except there's no fair
regulation happening.

> The present model - of one fixed size one per endpoint can run you out
> of memory right quick.

This problem goes away if I change behavior (1.a) to drop the packet instea=
d
of requeuing it. Does that point make sense? If not, please let me know, an=
d
I'll try better to explain it, as this seems like an important crux.

> Well, in wireguard's case, it does not  (yay!) have
> a-must-deliver-packet-in-IV-order mechanism like 802.11 requires. It
> will merely throw away things that are outside the replay window (2k
> packets). So you could *encrypt* first, deliver later, if you wanted.

Yes, it does have a backtrack mechanism to deal with networks reordering
packets, but things work out *much* better if they're sent in order. PLEASE
see the documentation for padata. Its API makes items appear to complete in
the order that they're submitted.

> If you flood the thing - you get a big queue in two places - one at
> the interface qdisc where we struggle to throw already encrypted
> things away (eventually), and another
> inside the vpn code where we are encrypting as fast as we can get stuff i=
n.

For what it's worth, the interface itself is marked as IFF_NO_QUEUE, so tha=
t
userspace's send() pretty much correlates to a direct call (or calls) to
xmit(), with a few exceptions for curious softirq magic.

> As a side note, another crazy daydream of mine is to use up an entire
> /64 for the vpn endpoint identifier, and pre-negotiate a "spread
> spectrum" style of sending packets there. This would eliminate needing
> to bundle the IV in the packet, it would be obvious (to each side)
> which IV matched what IP address, we'd save 8 bytes on every packet.

Sticking nonces in a v6 address -- that's a neat idea.

On Sat, Oct 1, 2016 at 7:28 PM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@toke=
.dk> wrote:
> You'd get that with the FQ structure I described earlier. Then apply the
> dql stuff to limit (globally) the number of packets (or bytes) currently
> being encrypted, and you should have something fairly reasonable.

Hm?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-01 22:44           ` Jason A. Donenfeld
@ 2016-10-01 23:40             ` Toke Høiland-Jørgensen
  2016-10-02  2:25               ` Jason A. Donenfeld
  0 siblings, 1 reply; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-10-01 23:40 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: cake, make-wifi-fast, WireGuard mailing list

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> Thanks a lot for your responses. This is steering me in the right directi=
on (I
> hope!). Responses are inline below.
>
> Regards,
> Jason
>
> On Sat, Oct 1, 2016 at 1:51 PM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@to=
ke.dk> wrote:
>> Looking at your queueing structure description, I'd say the reusable FQ
>> stuff is a pretty good fit. There's a lot of conceptual overlap between
>> the mac80211 concept of having a queue per station, and yours of having
>> a queue per peer. Have a look at include/net/fq.h and
>> include/net/fq_impl.h - and see net/mac80211/tx.c for a usage example
>> (grep for 'txq'). This can be further enhanced by using CoDel as the
>> dequeue function (again, see mac80211 for an example).
>
> Thanks, I'll have a look at these. I had skimmed them earlier, and it was=
n't
> immediately clear how this API worked, but I'll give it a deep read.
>
> Part of where I think we're divergent in our understanding is that in
> actuality WireGuard has two queues:

I assumed that there probably was, but was not sure where. Thanks for
clearing this up. I'll take a step back and try to describe this on the
conceptual level:

First, the reason we want better queueing is to deal with the case where
wireguard is a bottleneck. If it's not a bottleneck, no queue will
build, so there's no queueing latency to reduce. In this case the
bottleneck happens when we're waiting for the crypto step to finish,
i.e. when your queue (2) starts to fill. So we want to limit the time
packets spent in the total queueing system, i.e. from they enter through
xmit() and until they are released by send_off_bundle().

The simple way to do this is to just apply CoDel to the whole system.
I.e. timestamp packets when they appear in xmit() and apply CoDel before
sending them off in send_off_bundle(). You'd have to make sure the
timestamp is preserved through the encryption step (keep it in the cb,
for instance), but other than that it's only a couple of handfuls of LOC
you need to add to achieve this.

However, the FQ part of FQ-CoDel gives substantial benefits on top of
plain CoDel, namely fairness between (5-tuple) flows, and lower latency
to sparse flows (because they get priority). Since you have a strict
FIFO encryption step in the middle (an re-order-sensitivity after
encryption is completed), it's difficult to apply FQ across all of
wireguard. However, it can be applied to the peer queues, as I
mentioned. This would have the benefit that when a peer is backlogged
and waiting for space on the crypto queue, packets arriving to it will
get the benefit of the FQ when transmission resumes.

BTW, when reading the code (which is very readable!), I wondered about a
couple of things:

1. When the crypto queue is full, you stick all the unencrypted packets
   on the peer queues. What mechanism resumes the transmission of those
   packets, other than a new packet arriving to the same peer? And what
   happens if several peers are backlogged at the same time? Will their
   order they resume transmission depend on which peer happens to get a
   new packet once the crypto queue has space?

2. There are several places where you walk the queue with a manual
   for-loop. Is there any reason why you're not using the
   skb_queue_walk_* macros? In particular, in some places you are
   storing the pointer in the beginning of a loop in case it goes away;
   that seems to be what skb_queue_walk_safe() is meant for?

-Toke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-01 23:40             ` Toke Høiland-Jørgensen
@ 2016-10-02  2:25               ` Jason A. Donenfeld
  2016-10-02  2:40                 ` Jason A. Donenfeld
  2016-10-02 11:31                 ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 17+ messages in thread
From: Jason A. Donenfeld @ 2016-10-02  2:25 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: cake, make-wifi-fast, WireGuard mailing list

Hey Toke,

On Sun, Oct 2, 2016 at 1:40 AM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@toke=
.dk> wrote:
> I assumed that there probably was, but was not sure where. Thanks for
> clearing this up. I'll take a step back and try to describe this on the
> conceptual level:

Conceptual overview: exactly what I needed, thanks.

> First, the reason we want better queueing is to deal with the case where
> wireguard is a bottleneck. If it's not a bottleneck, no queue will
> build, so there's no queueing latency to reduce. In this case the
> bottleneck happens when we're waiting for the crypto step to finish,
> i.e. when your queue (2) starts to fill. So we want to limit the time
> packets spent in the total queueing system, i.e. from they enter through
> xmit() and until they are released by send_off_bundle().
>
> The simple way to do this is to just apply CoDel to the whole system.
> I.e. timestamp packets when they appear in xmit() and apply CoDel before
> sending them off in send_off_bundle(). You'd have to make sure the
> timestamp is preserved through the encryption step (keep it in the cb,
> for instance), but other than that it's only a couple of handfuls of LOC
> you need to add to achieve this.

Thank you! Excellent. Okay, that clears things up for me quite a bit. So if=
 my
understanding is correct, the goal of a smart queue is to realize when the
queue is being filled and apply artificial delay to the dequeuing process, =
so
that TCP's rate control algorithm won't suffer from initial bursts. So, the
whole idea is not to do anything fancy with the padata situation, but inste=
ad
record when the packet is submitted to the crypto engine and when it comes
back out of the crypto engine, feed this to CoDel, and let CoDel apply
whatever necessary delays. This should make going back and reading the
wireless code a bit more clear for me.

> However, the FQ part of FQ-CoDel gives substantial benefits on top of
> plain CoDel, namely fairness between (5-tuple) flows, and lower latency
> to sparse flows (because they get priority). Since you have a strict
> FIFO encryption step in the middle (an re-order-sensitivity after
> encryption is completed), it's difficult to apply FQ across all of
> wireguard. However, it can be applied to the peer queues, as I
> mentioned. This would have the benefit that when a peer is backlogged
> and waiting for space on the crypto queue, packets arriving to it will
> get the benefit of the FQ when transmission resumes.

Do you mean that:

   a. I should add another mechanism to the part where I remove items from =
the
      peer queue (queue (1)) and put it in the padata queue (queue (2)) so
      that it fairly chooses packets based on the inner IP header?

      or

   b. When queueing up any packets into the padata queue (queue (2)), I sho=
uld
      somehow apply fairness in deciding which peer's packets get put from
      that peer's queue (1) into the global queue (2)?

> BTW, when reading the code (which is very readable!)

Glad you liked it! If anything turns out to be unreadable or unclear, do le=
t
me know. This sort of thing is somewhat important to me.

> 1. When the crypto queue is full, you stick all the unencrypted packets
>    on the peer queues. What mechanism resumes the transmission of those
>    packets, other than a new packet arriving to the same peer? And what
>    happens if several peers are backlogged at the same time? Will their
>    order they resume transmission depend on which peer happens to get a
>    new packet once the crypto queue has space?

No mechanism. :( Right now nothing happens until the peer has another packe=
t
to send, which is far from ideal. Due to some other interesting aspects of
WireGuard that aren't directly relevant here, if the peer receives data but
hasn't sent any data for roughly 10 seconds, the queue will start up again.
This obviously isn't a satisfactory solution though.

What would you suggest? I try again after 100ms? I try again after some mag=
ic
CoDel-defined timeout? If I set a one-off timer for 100ms after the rejecti=
on,
if several peers all get rejected at the same time, I suppose I'll have
something of a "thundering herd" situation when the timer fires, though I'm
not sure this would necessarily be a bad thing.

> 2. There are several places where you walk the queue with a manual
>    for-loop. Is there any reason why you're not using the
>    skb_queue_walk_* macros? In particular, in some places you are
>    storing the pointer in the beginning of a loop in case it goes away;
>    that seems to be what skb_queue_walk_safe() is meant for?

skb_queues are circularly linked. This requires an extra 16 bytes for the h=
ead
element. Rather than waste that in the cb, I simply use a list of skbs that
terminate with a NULL. This makes traversal a bit different, and more simpl=
e.

Thanks for the guidance here. Very much appreciated.

Regards,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-02  2:25               ` Jason A. Donenfeld
@ 2016-10-02  2:40                 ` Jason A. Donenfeld
  2016-10-02 11:31                 ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 17+ messages in thread
From: Jason A. Donenfeld @ 2016-10-02  2:40 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: cake, make-wifi-fast, WireGuard mailing list

On Sun, Oct 2, 2016 at 4:25 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> What would you suggest? I try again after 100ms?

This now lives in a test branch:
https://git.zx2c4.com/WireGuard/commit/?id=da9bd3a5ddd4a8edbdbb337743a14e54afb50dd5

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-02  2:25               ` Jason A. Donenfeld
  2016-10-02  2:40                 ` Jason A. Donenfeld
@ 2016-10-02 11:31                 ` Toke Høiland-Jørgensen
  2016-10-05 12:39                   ` Jason A. Donenfeld
  1 sibling, 1 reply; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-10-02 11:31 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: cake, make-wifi-fast, WireGuard mailing list

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> Hey Toke,
>
> On Sun, Oct 2, 2016 at 1:40 AM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@to=
ke.dk> wrote:
>> I assumed that there probably was, but was not sure where. Thanks for
>> clearing this up. I'll take a step back and try to describe this on the
>> conceptual level:
>
> Conceptual overview: exactly what I needed, thanks.
>
>> First, the reason we want better queueing is to deal with the case where
>> wireguard is a bottleneck. If it's not a bottleneck, no queue will
>> build, so there's no queueing latency to reduce. In this case the
>> bottleneck happens when we're waiting for the crypto step to finish,
>> i.e. when your queue (2) starts to fill. So we want to limit the time
>> packets spent in the total queueing system, i.e. from they enter through
>> xmit() and until they are released by send_off_bundle().
>>
>> The simple way to do this is to just apply CoDel to the whole system.
>> I.e. timestamp packets when they appear in xmit() and apply CoDel before
>> sending them off in send_off_bundle(). You'd have to make sure the
>> timestamp is preserved through the encryption step (keep it in the cb,
>> for instance), but other than that it's only a couple of handfuls of LOC
>> you need to add to achieve this.
>
> Thank you! Excellent. Okay, that clears things up for me quite a bit.
> So if my understanding is correct, the goal of a smart queue is to
> realize when the queue is being filled and apply artificial delay to
> the dequeuing process, so that TCP's rate control algorithm won't
> suffer from initial bursts. So, the whole idea is not to do anything
> fancy with the padata situation, but instead record when the packet is
> submitted to the crypto engine and when it comes back out of the
> crypto engine, feed this to CoDel, and let CoDel apply whatever
> necessary delays. This should make going back and reading the wireless
> code a bit more clear for me.

Almost: We're not adding delay, we're trying to get rid of it ;)

Basically, what CoDel does is look at the time the packet spent in the
queue. If this time grows above 5 ms for an extended period of time,
CoDel will drop a packet. This signals the TCP sender to slow down,
which will hopefully drain the queue. If it doesn't help for long
enough, CoDel will drop another packet; and so on, dropping at an
increasing rate until the queue is drained.

The basic mechanism of occasionally dropping a packet has been known for
decades; it's the same good old RED does. What CoDel brings to the table
is basing the drop on the actual measured time the packets spend in the
queue. This nicely avoids having to tell the algorithm how many packets
are too many.

>> However, the FQ part of FQ-CoDel gives substantial benefits on top of
>> plain CoDel, namely fairness between (5-tuple) flows, and lower latency
>> to sparse flows (because they get priority). Since you have a strict
>> FIFO encryption step in the middle (an re-order-sensitivity after
>> encryption is completed), it's difficult to apply FQ across all of
>> wireguard. However, it can be applied to the peer queues, as I
>> mentioned. This would have the benefit that when a peer is backlogged
>> and waiting for space on the crypto queue, packets arriving to it will
>> get the benefit of the FQ when transmission resumes.
>
> Do you mean that:
>
>    a. I should add another mechanism to the part where I remove items fro=
m the
>       peer queue (queue (1)) and put it in the padata queue (queue (2)) so
>       that it fairly chooses packets based on the inner IP header?
>
>       or
>
>    b. When queueing up any packets into the padata queue (queue (2)), I s=
hould
>       somehow apply fairness in deciding which peer's packets get put from
>       that peer's queue (1) into the global queue (2)?

I'd say both. (a) makes sure that if, e.g., a TCP flow and a sparse flow
(VoIP, ping, etc) both go to the same peer, the sparse flow will get
priority and low latency. (b) makes sure that all peers get their fair
share of the total bandwidth.

>> BTW, when reading the code (which is very readable!)
>
> Glad you liked it! If anything turns out to be unreadable or unclear, do =
let
> me know. This sort of thing is somewhat important to me.
>
>> 1. When the crypto queue is full, you stick all the unencrypted packets
>>    on the peer queues. What mechanism resumes the transmission of those
>>    packets, other than a new packet arriving to the same peer? And what
>>    happens if several peers are backlogged at the same time? Will their
>>    order they resume transmission depend on which peer happens to get a
>>    new packet once the crypto queue has space?
>
> No mechanism. :( Right now nothing happens until the peer has another
> packet to send, which is far from ideal. Due to some other interesting
> aspects of WireGuard that aren't directly relevant here, if the peer
> receives data but hasn't sent any data for roughly 10 seconds, the
> queue will start up again. This obviously isn't a satisfactory
> solution though.

Yes, I found the other timer. Okay, at least I wasn't reading the code
wrong. And I agree that is probably not optimal ;)

> What would you suggest? I try again after 100ms? I try again after
> some magic CoDel-defined timeout? If I set a one-off timer for 100ms
> after the rejection, if several peers all get rejected at the same
> time, I suppose I'll have something of a "thundering herd" situation
> when the timer fires, though I'm not sure this would necessarily be a
> bad thing.

You don't need a timer. You already have a signal for when more queue
space is available in the encryption step: When a packet finishes
encryption. All you need to do is try to enqueue another one at this
point.

Basically, putting all of the above together, I'd suggest a flow like
this (expressed in C-like pseudocode):

function enqueue(pkt) { /* this is basically your existing xmit() */
  peer =3D get_peer(pkt);
  timestamp(pkt);
  fq_enqueue(peer, pkt);
  list_add_end(peer, active_peers);
  schedule();
}

function schedule() {
  while (!full(encryption queue)) {
    peer =3D list_first(active_peers);
    if (!peer) /* no active peers */
      break;
=20=20=20=20=20=20
    pkt =3D fq_dequeue(peer); /* Option 1: This runs CoDel */
    if (!pkt) { /* peer queue empty */
      list_remove(peer, active_peers);
      continue;
    }
=20=20=20=20
    start_encryption(pkt);
    list_move_end(peer, active_peers); /* rotate list */
  }
}

function encryption_done(pkt) {
  codel(pkt); /* Option 2: Run CoDel here; may drop the packet */
  if (pkt)
    send(pkt);
  schedule();
}

This basically gets you a round-robin scheduler between stations (in
schedule()), and an FQ structure for each peer that will do per-flow
fairness and sparse flow prioritisation. You could also do a 'sparse
peer' optimisation in schedule(), but it's not strictly necessary.

I'd suggest you use the dynamic queue length stuff to check whether the
encryption queue is full. I.e. the function I called full() would
actually be a call to dql_avail() - and you'd need to add the companion
calls to dql_queued() and dql_completed() in appropriate places, of
course (for which I'd suggest you use bytes rather than number of
packets). This should keep just the right number of packets in the
encryption queue to keep things busy, but not more (which would just add
latency).

As you can see I marked to possible places you could apply CoDel in the
above. I'm not actually sure which one works best. On the one hand,
option 2 means CoDel will get a chance to react to the encryption
latency, which is good. But on the other hand, because there is fairness
queueing, there can be reordering, so you'll get some packets with very
low latency intermixed with those that have been queued for longer.
Which will basically turn CoDel off. So for this reason, I think having
CoDel in the fq_dequeue() step would be better; in this case, there are
a separate set of state vars for each flow, so the TCP elephant flows
can get packets dropped while the others won't. This is what we do in
mac80211.

Actually with this setup, things are analogous to a regular network
interface, where the encryption queue takes the place of the hardware.
So we want to keep the hardware (encryption) busy, but not have more
packets queued up there than is necessary to achieve this.

Also, I've completely ignored your existing bunching of packets after
encryption. I think you could still do that (i.e. wait to send things in
encryption_done() until you have a bunch of packets), as long as you
call schedule() for each packet that finishes encryption. I'm not sure
how that would affect performance, though; it's probably a tradeoff
between throughput and latency. I'd suggest benchmarking both cases :)

>> 2. There are several places where you walk the queue with a manual
>>    for-loop. Is there any reason why you're not using the
>>    skb_queue_walk_* macros? In particular, in some places you are
>>    storing the pointer in the beginning of a loop in case it goes away;
>>    that seems to be what skb_queue_walk_safe() is meant for?
>
> skb_queues are circularly linked. This requires an extra 16 bytes for
> the head element. Rather than waste that in the cb, I simply use a
> list of skbs that terminate with a NULL. This makes traversal a bit
> different, and more simple.

Ah, I see. Well, if you want to use the queue structures from elsewhere
(i.e. FQ), you'll probably need to use skb_queue, at least in the upper
part.

> Thanks for the guidance here. Very much appreciated.

You're welcome. I'm enjoying the chance to explain these concepts to
someone new :)

-Toke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-02 11:31                 ` Toke Høiland-Jørgensen
@ 2016-10-05 12:39                   ` Jason A. Donenfeld
  2016-10-05 13:59                     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 17+ messages in thread
From: Jason A. Donenfeld @ 2016-10-05 12:39 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: cake, make-wifi-fast, WireGuard mailing list

On Sun, Oct 2, 2016 at 1:31 PM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@toke=
.dk> wrote:
> You don't need a timer. You already have a signal for when more queue
> space is available in the encryption step: When a packet finishes
> encryption. All you need to do is try to enqueue another one at this
> point.

Oh, silly me. Yes of course. Voila:
https://git.zx2c4.com/WireGuard/commit/?id=3Da0ad61c1a0e25a376e145f07ca27c6=
05d3852bc4

> Also, I've completely ignored your existing bunching of packets after
> encryption. I think you could still do that (i.e. wait to send things in
> encryption_done() until you have a bunch of packets), as long as you
> call schedule() for each packet that finishes encryption.

But, in fact, I wait until all packets are done and sent to call schedule()=
.
The reason is that sometimes schedule() does not queue, but rather immediat=
ely
encrypts and sends the packet if it's small and if there's only one. Waitin=
g
until they're all sent out ensures packet ordering. It also allows the next
batch of packets to be bundled up better, which provides much better
throughput. While it seems like this would contribute unneccessary latency,
the only times that packets are actually bundled in large quantities is whe=
n
doing some sort of bulk TCP data transfer, where the kernel has access to a
huge buffer and can form GSO superpackets. In these cases, latency is rarel=
y
the concern, since it's bulk transfer.

> Almost: We're not adding delay, we're trying to get rid of it ;)
>
> Basically, what CoDel does is look at the time the packet spent in the
> queue. If this time grows above 5 ms for an extended period of time,
> CoDel will drop a packet. This signals the TCP sender to slow down,
> which will hopefully drain the queue. If it doesn't help for long
> enough, CoDel will drop another packet; and so on, dropping at an
> increasing rate until the queue is drained.
>
> The basic mechanism of occasionally dropping a packet has been known for
> decades; it's the same good old RED does. What CoDel brings to the table
> is basing the drop on the actual measured time the packets spend in the
> queue. This nicely avoids having to tell the algorithm how many packets
> are too many.

Ahh, of course. That makes sense.

>
> Basically, putting all of the above together, I'd suggest a flow like
> this (expressed in C-like pseudocode):
>
> function enqueue(pkt) { /* this is basically your existing xmit() */
>   peer =3D get_peer(pkt);
>   timestamp(pkt);
>   fq_enqueue(peer, pkt);
>   list_add_end(peer, active_peers);
>   schedule();
> }
>
> function schedule() {
>   while (!full(encryption queue)) {
>     peer =3D list_first(active_peers);
>     if (!peer) /* no active peers */
>       break;
>
>     pkt =3D fq_dequeue(peer); /* Option 1: This runs CoDel */
>     if (!pkt) { /* peer queue empty */
>       list_remove(peer, active_peers);
>       continue;
>     }
>
>     start_encryption(pkt);
>     list_move_end(peer, active_peers); /* rotate list */
>   }
> }
>
> function encryption_done(pkt) {
>   codel(pkt); /* Option 2: Run CoDel here; may drop the packet */
>   if (pkt)
>     send(pkt);
>   schedule();
> }
>
>
> This basically gets you a round-robin scheduler between stations (in
> schedule()), and an FQ structure for each peer that will do per-flow
> fairness and sparse flow prioritisation. You could also do a 'sparse
> peer' optimisation in schedule(), but it's not strictly necessary.
>
> I'd suggest you use the dynamic queue length stuff to check whether the
> encryption queue is full. I.e. the function I called full() would
> actually be a call to dql_avail() - and you'd need to add the companion
> calls to dql_queued() and dql_completed() in appropriate places, of
> course (for which I'd suggest you use bytes rather than number of
> packets). This should keep just the right number of packets in the
> encryption queue to keep things busy, but not more (which would just add
> latency).
>
> As you can see I marked to possible places you could apply CoDel in the
> above. I'm not actually sure which one works best. On the one hand,
> option 2 means CoDel will get a chance to react to the encryption
> latency, which is good. But on the other hand, because there is fairness
> queueing, there can be reordering, so you'll get some packets with very
> low latency intermixed with those that have been queued for longer.
> Which will basically turn CoDel off. So for this reason, I think having
> CoDel in the fq_dequeue() step would be better; in this case, there are
> a separate set of state vars for each flow, so the TCP elephant flows
> can get packets dropped while the others won't. This is what we do in
> mac80211.
>
> Actually with this setup, things are analogous to a regular network
> interface, where the encryption queue takes the place of the hardware.
> So we want to keep the hardware (encryption) busy, but not have more
> packets queued up there than is necessary to achieve this.

That's a nice design you've outlined. One issue, however, with option 2 is
that now I'm dropping packets after they're already been encrypted. This se=
ems
like it's a bit of a waste. Option 1 avoids this, but then it doesn't take
into account the encryption latency, which is the main source of bufferbloa=
t.
Is there a hybrid option? Specifically, is there an API or a pattern to the
tune of:

struct fq_codel_stats fqc;

xmit(struct sk_buff *skb) {
    fq_hash_t flow_hash =3D fq_compute_flow_hash(iphdr(skb), ...);
    codel_time_t time =3D codel_get_time_or_drop(&fqc, flow_hash);
    if (!time) {
        kfree_skb(skb);
        return;
    }

    skb->cb.time =3D time;
    skb->cb.flow_hash =3D flow_hash;

    put_packet_in_encryption_queue(skb);
}

encryption_finished(struct sk_buff *skb) {
    fq_codel_register_packet_dequeued(&fqc, skb->cb.time, skb->cb.flow_hash=
);
    send_to_udp_socket(skb);
}

The idea here would be that the fqc structure would keep track of the vario=
us
statistics required for fq_codel to do its things. Packets would be registe=
red
coming into the queue and registered coming out of the queue. However, *bef=
ore*
being added to the queue, codel would say whether or not to drop the packet
based on its flow. It seems like this setup would be much simpler, and fulf=
ill
all the requirements, including sparse flows. The hash computation function
would be in the place with the most information, and could work per-peer,
per-flow.

Have I misunderstood something, or is this kind of simplification a real
possibility?

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-05 12:39                   ` Jason A. Donenfeld
@ 2016-10-05 13:59                     ` Toke Høiland-Jørgensen
  2016-10-05 14:21                       ` Jonathan Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-10-05 13:59 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: cake, make-wifi-fast, WireGuard mailing list

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> On Sun, Oct 2, 2016 at 1:31 PM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@to=
ke.dk> wrote:
>> You don't need a timer. You already have a signal for when more queue
>> space is available in the encryption step: When a packet finishes
>> encryption. All you need to do is try to enqueue another one at this
>> point.
>
> Oh, silly me. Yes of course. Voila:
> https://git.zx2c4.com/WireGuard/commit/?id=3Da0ad61c1a0e25a376e145f07ca27=
c605d3852bc4

Yup, that seems like the way to go about it :)

>> Also, I've completely ignored your existing bunching of packets after
>> encryption. I think you could still do that (i.e. wait to send things in
>> encryption_done() until you have a bunch of packets), as long as you
>> call schedule() for each packet that finishes encryption.
>
> But, in fact, I wait until all packets are done and sent to call
> schedule(). The reason is that sometimes schedule() does not queue,
> but rather immediately encrypts and sends the packet if it's small and
> if there's only one. Waiting until they're all sent out ensures packet
> ordering. It also allows the next batch of packets to be bundled up
> better, which provides much better throughput. While it seems like
> this would contribute unneccessary latency, the only times that
> packets are actually bundled in large quantities is when doing some
> sort of bulk TCP data transfer, where the kernel has access to a huge
> buffer and can form GSO superpackets. In these cases, latency is
> rarely the concern, since it's bulk transfer.

But you can have a bulk transfer and a latency sensitive flow going
through the same interface which would cause the latency sensitive flow
to be queued behind the bulk flow; exactly what we want to avoid.

A way to fix the reordering issue was to make schedule() check if there
is already something queued up ahead of it and only do the 'send
immediately' optimisation if the queue is empty. This is similar to what
the qdisc layer does: If the qdisc is empty, just skip the whole thing
and transmit the packet immediately.

[..snip..]

>> Actually with this setup, things are analogous to a regular network
>> interface, where the encryption queue takes the place of the hardware.
>> So we want to keep the hardware (encryption) busy, but not have more
>> packets queued up there than is necessary to achieve this.
>
> That's a nice design you've outlined. One issue, however, with option
> 2 is that now I'm dropping packets after they're already been
> encrypted. This seems like it's a bit of a waste.

I wouldn't worry too much about the inefficiencies here. CoDel doesn't
really drop that many packets (most people seem to have wrong intuitions
about this). With normal TCP traffic, you'll only see on the order of
dozens of packet drops per *minute*. UDP floods are a concern, of
course, but CoDel doesn't deal well with those at all, so you'd probably
want a different mechanism for that.

> Option 1 avoids this, but then it doesn't take into account the
> encryption latency, which is the main source of bufferbloat.

I'm not sure this is actually correct. Or rather, it doesn't have to be.
If you use the DQL interface to limit the number of packets you submit
to encryption to the lowest number possible that keeps the CPU busy, you
should be able to reduce the latency of the actual encryption step to a
constant, and keep the queue in the FQ structure where it can be managed
by CoDel.

> Is there a hybrid option? Specifically, is there an API or a pattern
> to the tune of:
>
> struct fq_codel_stats fqc;
>
> xmit(struct sk_buff *skb) {
>     fq_hash_t flow_hash =3D fq_compute_flow_hash(iphdr(skb), ...);
>     codel_time_t time =3D codel_get_time_or_drop(&fqc, flow_hash);
>     if (!time) {
>         kfree_skb(skb);
>         return;
>     }
>
>     skb->cb.time =3D time;
>     skb->cb.flow_hash =3D flow_hash;
>
>     put_packet_in_encryption_queue(skb);
> }
>
> encryption_finished(struct sk_buff *skb) {
>     fq_codel_register_packet_dequeued(&fqc, skb->cb.time, skb->cb.flow_ha=
sh);
>     send_to_udp_socket(skb);
> }
>
> The idea here would be that the fqc structure would keep track of the
> various statistics required for fq_codel to do its things. Packets
> would be registered coming into the queue and registered coming out of
> the queue. However, *before* being added to the queue, codel would say
> whether or not to drop the packet based on its flow. It seems like
> this setup would be much simpler, and fulfill all the requirements,
> including sparse flows. The hash computation function would be in the
> place with the most information, and could work per-peer, per-flow.
>
> Have I misunderstood something, or is this kind of simplification a real
> possibility?

There's currently no API to do what you're describing with CoDel. You'd
basically want the control law to use the packet hash from the last
packet of the flow that was done with the encryption step. That might
work, I suppose, but it would require reworking the API somewhat.

However, the architecture you're describing (storing the packet hash in
the CB and using it after the encryption step) could be a way to get rid
of the problem with reordered packets causing CoDel to turn off that I
described earlier. With that architecture you could retain the CoDel
state variables per flow, which means different flows with different
queue lengths wouldn't confuse CoDel.

A suggestion might be to try implementing this, but retaining the drop
at the end (i.e. after encryption); then see how well this works,
including how many encrypted packets you are actually throwing away. And
if this does turn out to be a significant number, then start looking at
optimisations that moves the drop to before the encryption step? I'm a
big fan of measuring before optimising in general :)

-Toke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] [Cake] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-10-05 13:59                     ` Toke Høiland-Jørgensen
@ 2016-10-05 14:21                       ` Jonathan Morton
  0 siblings, 0 replies; 17+ messages in thread
From: Jonathan Morton @ 2016-10-05 14:21 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: cake, make-wifi-fast, WireGuard mailing list

> On 5 Oct, 2016, at 16:59, Toke H=C3=B8iland-J=C3=B8rgensen =
<toke@toke.dk> wrote:
>=20
> UDP floods are a concern, of
> course, but CoDel doesn't deal well with those at all, so you'd =
probably
> want a different mechanism for that.

Cake happens to have addressed the UDP flood problem by combining Codel =
with BLUE.    I call the combination COBALT.  BLUE acts on a =
percentage-drop basis, which makes it much better at handling unbounded =
flood traffic compared to Codel=E2=80=99s drops-per-second basis.

The way BLUE works means that it doesn=E2=80=99t kick in until Codel has =
demonstrated complete inability to control the flow, which usually means =
it=E2=80=99s an =E2=80=9Cunresponsive=E2=80=9D flow (ie. it doesn=E2=80=99=
t respect normal congestion control signals).  The two therefore work =
together very naturally.

It would also be reasonable to have the packet drops associated with =
BLUE at enqueue time (ie. at the tail of the queue), so that they do not =
occupy encryption bandwidth needlessly.  Codel could still act at the =
queue head, which is the best position to send congestion signals from.  =
Cake doesn=E2=80=99t split the actions like that, but it doesn=E2=80=99t =
have a particular need to.

I think it=E2=80=99s definitely a good idea to have separate Codel (and =
BLUE) state per flow, which means they *do* need to be disambiguated at =
the place the drops occur.  Putting the queue ID in the cb along with =
the enqueue time seems like a sensible way to do it.  This would mean =
that sparse flows both don=E2=80=99t get signalled to unnecessarily, and =
don=E2=80=99t disrupt the sojourn-time pattern of bulk flows.

 - Jonathan Morton

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-09-30 19:18 ` Dave Taht
@ 2016-09-30 20:12   ` Jason A. Donenfeld
  0 siblings, 0 replies; 17+ messages in thread
From: Jason A. Donenfeld @ 2016-09-30 20:12 UTC (permalink / raw)
  To: Dave Taht; +Cc: cake, make-wifi-fast, WireGuard mailing list

Hi all,

On Fri, Sep 30, 2016 at 9:18 PM, Dave Taht <dave.taht@gmail.com> wrote:
> All: I've always dreamed of a vpn that could fq and - when it was
> bottlenecking on cpu - throw away packets intelligently. Wireguard,
> which is what jason & co are working on, is a really simple, elegant
> set of newer vpn ideas that currently has a queuing model designed to
> optimize for multi-cpu encryption, and not, so much, for managing
> worst case network behaviors, or fairness, or working on lower end
> hardware.

Would love any feedback and support for working on the queuing model
with WireGuard. I hear the bufferbloat folks are geniuses at that...

> Do do a git clone of the code, and take a look... somewhere on the
> wireguard list, or privately, jason'd pointed me at the relevant bits
> of the queuing model.

It was this post:
https://lists.zx2c4.com/pipermail/wireguard/2016-August/000378.html
Start reading from "There are a couple reasons" and finish at
"chunking them somehow." The rest can be disregarded.

Hope to hear from y'all soon!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-09-30 18:41 [WireGuard] " Jason A. Donenfeld
@ 2016-09-30 19:18 ` Dave Taht
  2016-09-30 20:12   ` Jason A. Donenfeld
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Taht @ 2016-09-30 19:18 UTC (permalink / raw)
  To: Jason A. Donenfeld, cake, make-wifi-fast; +Cc: WireGuard mailing list

Dear Jason:

Let me cross post, with a little background, for those not paying
attention on the other lists.

All: I've always dreamed of a vpn that could fq and - when it was
bottlenecking on cpu - throw away packets intelligently. Wireguard,
which is what jason & co are working on, is a really simple, elegant
set of newer vpn ideas that currently has a queuing model designed to
optimize for multi-cpu encryption, and not, so much, for managing
worst case network behaviors, or fairness, or working on lower end
hardware. There's a lede port for it that
topped out at (I think) about 16Mbits on weak hardware.

http://wireguard.io/ is really easy to compile and setup. I wrote a
bit about it in my blog as well (
http://blog.cerowrt.org/post/wireguard/ ) - and the fact that I spent
any time on it at all is symptomatic of my overall ADHD (and at the
time I was about to add a few more servers to the flent network and
didn't want to use tinc anymore).

But - As it turns out, the structure/basic concepts in the mac80211
implementation - the retry queue, the global fq_codel queue with per
station hash collision detection - seemed to match much of wireguard's
internal model, and I'd tweaked jason's interest.

Do do a git clone of the code, and take a look... somewhere on the
wireguard list, or privately, jason'd pointed me at the relevant bits
of the queuing model.

On Fri, Sep 30, 2016 at 11:41 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrot=
e:
> Hey Dave,
>
> I've been comparing graphs and bandwidth and so forth with flent's
> rrul and iperf3, trying to figure out what's going on.

A quick note on iperf3 - please see
http://burntchrome.blogspot.com/2016/09/iperf3-and-microbursts.html

There's a lesson here in this, and in pacing in general, sending a
giant burst out of
your retry queue, after you finish negotiating the link, is a bad
idea, and some sort of pacing mechanism might help. And rather than
pre-commenting here, I'll just include your last mail to these new
lists:

> Here's my
> present understanding of the queuing buffering issues. I sort of
> suspect these are issues that might not translate entirely well to the
> work you've been doing, but maybe I'm wrong. Here goes...
>
> 1. For each peer, there is a separate queue, called peer_queue. Each
> peer corresponds to a specific UDP endpoint, which means that a peer
> is a "flow".
> 2. When certain crypto handshake requirements haven't yet been met,
> packets pile up in peer_queue. Then when a handshake completes, all
> the packets that piled up are released. Because handshakes might take
> a while, peer_queue is quite big -- 1024 packets (dropping the oldest
> packets when full). In this context, that's not huge buffer bloat, but
> rather that's just a queue of packets for while the setup operation is
> occurring.
> 3. WireGuard is a net_device interface, which means it transmits
> packets from userspace in softirq. It's advertised as accepting GSO
> "super packets", so sometimes it is asked to transmit a packet that is
> 65k in length. When this happens, it splits those packets up into
> MTU-sized packets, puts them in the queue, and then processes the
> entire queue at once, immediately after.
>
> If that were the totality of things, I believe it would work quite
> well. If the description stopped there, it means packets would be
> encrypted and sent immediately in the softirq device transmit handler,
> just like how the mac80211 stack does things. The above existence of
> peer_queue wouldn't equate to any form of buffer bloat or latency
> issues, because it would just act as a simple data structure for
> immediately transmitting packets. Similarly, when receiving a packet
> from the UDP socket, we _could_ simply just decrypt in softirq, again
> like mac80211, as the packet comes in. This makes all the expensive
> crypto operations blocking to the initiator of the operation -- the
> userspace application calling send() or the udp socket receiving an
> encrypted packet. All is well.
>
> However, things get complicated and ugly when we add multi-core
> encryption and decryption. We add on to the above as follows:
>
> 4. The kernel has a library called padata (kernel/padata.c). You
> submit asynchronous jobs, which are then sent off to various CPUs in
> parallel, and then you're notified when the jobs are done, with the
> nice feature that you get these notifications in the same order that
> you submitted the jobs, so that packets don't get reordered. padata
> has a hard coded maximum of in-progress operations of 1000. We can
> artificially make this lower, if we want (currently we don't), but we
> can't make it higher.
> 5. We continue from the above described peer_queue, only this time
> instead of encrypting immediately in softirq, we simply send all of
> peer_queue off to padata. Since the actual work happens
> asynchronously, we return immediately, not spending cycles in softirq.
> When that batch of encryption jobs completes, we transmit the
> resultant encrypted packets. When we send those jobs off, it's
> possible padata already has 1000 operations in progress, in which case
> we get "-EBUSY", and can take one of two options: (a) put that packet
> back at the top of peer_queue, return from sending, and try again to
> send all of peer_queue the next time the user submits a packet, or (b)
> discard that packet, and keep trying to queue up the ones after.
> Currently we go with behavior (a).
> 6. Likewise, when receiving an encrypted packet from a UDP socket, we
> decrypt it asynchronously using padata. If there are already 1000
> operations in place, we drop the packet.
>
> If I change the length of peer_queue from 1024 to something small like
> 16, it makes some effect when combined with choice (a) as opposed to
> choice (b), but I think this nob isn't so important, and I can leave
> it at 1024. However, if I change the length of padata's maximum from
> 1000 to something small like 16, I immediately get much lower latency.
> However, bandwidth suffers greatly, no matter choice (a) or choice
> (b). Padata's maximum seems to be the relevant nob. But I'm not sure
> the best way to tune it, nor am I sure the best way to interact with
> everything else here.
>
> I'm open to all suggestions, as at the moment I'm a bit in the dark on
> how to proceed. Simply saying "just throw fq_codel at it!" or "change
> your buffer lengths!" doesn't really help me much, as I believe the
> design is a bit more nuanced.
>
> Thanks,
> Jason



--=20
Dave T=C3=A4ht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
@ 2016-09-30 18:41 Jason A. Donenfeld
  2016-09-30 19:18 ` Dave Taht
  0 siblings, 1 reply; 17+ messages in thread
From: Jason A. Donenfeld @ 2016-09-30 18:41 UTC (permalink / raw)
  To: Dave Taht; +Cc: WireGuard mailing list

Hey Dave,

I've been comparing graphs and bandwidth and so forth with flent's
rrul and iperf3, trying to figure out what's going on. Here's my
present understanding of the queuing buffering issues. I sort of
suspect these are issues that might not translate entirely well to the
work you've been doing, but maybe I'm wrong. Here goes...

1. For each peer, there is a separate queue, called peer_queue. Each
peer corresponds to a specific UDP endpoint, which means that a peer
is a "flow".
2. When certain crypto handshake requirements haven't yet been met,
packets pile up in peer_queue. Then when a handshake completes, all
the packets that piled up are released. Because handshakes might take
a while, peer_queue is quite big -- 1024 packets (dropping the oldest
packets when full). In this context, that's not huge buffer bloat, but
rather that's just a queue of packets for while the setup operation is
occurring.
3. WireGuard is a net_device interface, which means it transmits
packets from userspace in softirq. It's advertised as accepting GSO
"super packets", so sometimes it is asked to transmit a packet that is
65k in length. When this happens, it splits those packets up into
MTU-sized packets, puts them in the queue, and then processes the
entire queue at once, immediately after.

If that were the totality of things, I believe it would work quite
well. If the description stopped there, it means packets would be
encrypted and sent immediately in the softirq device transmit handler,
just like how the mac80211 stack does things. The above existence of
peer_queue wouldn't equate to any form of buffer bloat or latency
issues, because it would just act as a simple data structure for
immediately transmitting packets. Similarly, when receiving a packet
from the UDP socket, we _could_ simply just decrypt in softirq, again
like mac80211, as the packet comes in. This makes all the expensive
crypto operations blocking to the initiator of the operation -- the
userspace application calling send() or the udp socket receiving an
encrypted packet. All is well.

However, things get complicated and ugly when we add multi-core
encryption and decryption. We add on to the above as follows:

4. The kernel has a library called padata (kernel/padata.c). You
submit asynchronous jobs, which are then sent off to various CPUs in
parallel, and then you're notified when the jobs are done, with the
nice feature that you get these notifications in the same order that
you submitted the jobs, so that packets don't get reordered. padata
has a hard coded maximum of in-progress operations of 1000. We can
artificially make this lower, if we want (currently we don't), but we
can't make it higher.
5. We continue from the above described peer_queue, only this time
instead of encrypting immediately in softirq, we simply send all of
peer_queue off to padata. Since the actual work happens
asynchronously, we return immediately, not spending cycles in softirq.
When that batch of encryption jobs completes, we transmit the
resultant encrypted packets. When we send those jobs off, it's
possible padata already has 1000 operations in progress, in which case
we get "-EBUSY", and can take one of two options: (a) put that packet
back at the top of peer_queue, return from sending, and try again to
send all of peer_queue the next time the user submits a packet, or (b)
discard that packet, and keep trying to queue up the ones after.
Currently we go with behavior (a).
6. Likewise, when receiving an encrypted packet from a UDP socket, we
decrypt it asynchronously using padata. If there are already 1000
operations in place, we drop the packet.

If I change the length of peer_queue from 1024 to something small like
16, it makes some effect when combined with choice (a) as opposed to
choice (b), but I think this nob isn't so important, and I can leave
it at 1024. However, if I change the length of padata's maximum from
1000 to something small like 16, I immediately get much lower latency.
However, bandwidth suffers greatly, no matter choice (a) or choice
(b). Padata's maximum seems to be the relevant nob. But I'm not sure
the best way to tune it, nor am I sure the best way to interact with
everything else here.

I'm open to all suggestions, as at the moment I'm a bit in the dark on
how to proceed. Simply saying "just throw fq_codel at it!" or "change
your buffer lengths!" doesn't really help me much, as I believe the
design is a bit more nuanced.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-10-05 14:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-30 23:21 [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues Jason A. Donenfeld
2016-10-01 11:51 ` [WireGuard] [Cake] " Toke Høiland-Jørgensen
2016-10-01 15:40   ` Dave Taht
2016-10-01 15:51     ` Toke Høiland-Jørgensen
2016-10-01 17:19       ` Dave Taht
2016-10-01 17:28         ` Toke Høiland-Jørgensen
2016-10-01 22:44           ` Jason A. Donenfeld
2016-10-01 23:40             ` Toke Høiland-Jørgensen
2016-10-02  2:25               ` Jason A. Donenfeld
2016-10-02  2:40                 ` Jason A. Donenfeld
2016-10-02 11:31                 ` Toke Høiland-Jørgensen
2016-10-05 12:39                   ` Jason A. Donenfeld
2016-10-05 13:59                     ` Toke Høiland-Jørgensen
2016-10-05 14:21                       ` Jonathan Morton
  -- strict thread matches above, loose matches on Subject: below --
2016-09-30 18:41 [WireGuard] " Jason A. Donenfeld
2016-09-30 19:18 ` Dave Taht
2016-09-30 20:12   ` Jason A. Donenfeld

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.