Hi,

over the past few months, I was a GSoC[1] student in the WireGuard project
(nominally the Linux Foundation). Together with Thomas Gschwantner and Gauvain
Roussel-Tarbouriech (who were also GSoC students), and of course Jason
Donenfeld, I tried to implement some of the performance optimizations from the
TODO list[2].

TL;DR: Here are the commits that I contributed during that time, which were
also merged (not all of them were):
- https://git.zx2c4.com/WireGuard/commit/?h=b289d122d8fe935de7138f77a8495dca76555b0e
- https://git.zx2c4.com/WireGuard/commit/?h=590c41089fd77bf43efda375c0e6bd3913cb9350
- https://git.zx2c4.com/WireGuard/commit/?h=0f8718b4fcbb89abc8b8b25c09510e1f7fac9625
- https://git.zx2c4.com/WireGuard/commit/?h=18822b84f88422f11611c598a0c23c56652693d9
- https://git.zx2c4.com/WireGuard/commit/?h=6008eacbf2c7a5f31b0c9d5d0a629cbdfbb8f222

Our first task was to change the trie (prefix tree) used in allowedips.c[3]
to look up peers based on their IP(v6) address, to use native endianness
(little endian on the x86 systems prevalent today) instead of big endian,
which is how the bytes in a IP(v6) address are stored in a packet. This
saves a few byte swaps per lookup in situations with many peers, because the
number is reduced from one byte swap per edge of trie that was walked to one
byte swap per complete traversal through the trie.
We implemented and discussed a few versions of this optimization, and
finally Jason committed his[4], and a selftest to help ensure that the new
implementation is correct[5]. I noticed that the debugging code that can
print the allowedips trie in Graphviz[6] format was now wrong, and fixed[7]
it.

The next goal was to replace the spinlock-based ring buffer implementation[8]
used in WireGuard to implement the receive/transmit and encrypt/decrypt
queues with a lock-free version. For this, we looked at ConcurrencyKit[9],
and reimplemented their algorithms. However, after facing no clear
performance improvement, and in some cases, serious performance degradation,
we gave up on this goal. The code, along with a selftest[10] can be found on
the jn/mpmc-wip branch[11].

Next, I made use[12] of the NAPI[13] infrastructure of the Linux networking
subsystem, to improve the performance of the receive path, with some
success[14], but also a performance regression on some systems[15], which
Jason appears to have solved by using GRO[16] before passing packets up
through the network stack[17].
Since my integration of NAPI relied one NAPI context per WireGuard peer, and
NAPI contexts are stored in a fixed-size hashtable in the kernel[18], Thomas
disabled NAPI busy polling, thus disabling the use of this hashtable[19].
This avoids overloading the hashtable when there are many WireGuard peers.

My last task was to change WireGuard's hashtables[20] from a fixed-size
hashtable to the resizable hashtable implementation (called "rhashtable")
that's already in the upstream kernel[21]. This will save some memory when
few items are stored, and avoid overloading, and thus slowing down, the
hashtables when many items are stored. The main difficulty here stems from
the fact that WireGuard's pubkey_hashtable uses the SipHash function to be
better protected against hash collision, or "hashtable poisoning"
attacks[22], in which the hashed data is chosen in such a way that many
items are stored in the same hash bucket, slowing down lookups. SipHash uses
a 128-bit random seed to make the hashes less predictable, while the
existing rhashtable API only allows 32 bits. Thus, to properly integrate
SipHash into rhashtable, I need to extend the rhashtable API to use a longer
random seed. A work-in-progress version of this change can be found in the
linux-dev repo on git.zx2c4.com[22].


Conclusion:

I have not reached all the goals that I initially set out to reach. In
particular, here are some of the items that were part of my GSoC project
proposal:

- Reducing bufferbloat with CoDel/DQL. Since WireGuard temporarily stores
  packets in queues, and queues can add latency when they're full, it would be
  interesting to try to limit the added latency with CoDel[23] or Linux's
  dynamic queue limits library (DQL)[24].
- CPU auto-scaling and NUMA awareness. Currently, WireGuard splits the
  encryption/decryption workload among all CPUs[25] currently online. For low
  packet rates it might be a good idea to stick to one CPU, to avoid polluting
  the other CPUs' caches, and possibly to avoid the other CPUs from sleep
  states. In larger systems with two or more set of CPUs, each with their own
  L3 cache and DDR memory controller, i.e. in NUMA[26] systems, it might also
  be a good idea to confine WireGuard to one NUMA node, to save bus traffic.
- Performance profiling and benchmarking. I included performance profiling and
  benchmarking in my project proposal, because serious performance optimization
  has to rely on benchmarks to find out which approach is *really* the fastest,
  and on profiling to figure out where the bottlenecks are. I did, however, do
  very little in this direction, beyond running iperf3.

Even though I did not do everything I intended to do, it was nice to bring
WireGuard a bit further and learn something about optimization and the Linux
networking stack. Thanks to Jason Donenfeld, the Linux Foundation, and Google
for providing this opportunity.


Jonathan Neuschäfer


[1]: https://summerofcode.withgoogle.com/
[2]: https://www.wireguard.com/todo/
[3]: https://git.zx2c4.com/WireGuard/tree/src/allowedips.c?h=0.0.20180420
[4]: https://git.zx2c4.com/WireGuard/commit/?h=5e3532eb1895ccdd0a53d14b47d7fc85234f7866
[5]: https://git.zx2c4.com/WireGuard/commit/?h=856f10593d2fc8b87992d9c5c556ddf108d313e8
[6]: http://graphviz.org/
[7]: https://git.zx2c4.com/WireGuard/commit/?h=b289d122d8fe935de7138f77a8495dca76555b0e
[8]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/ptr_ring.h?h=v4.18-rc8
[9]: https://github.com/concurrencykit/ck/blob/0.6.0/include/ck_ring.h
[10]: https://git.zx2c4.com/WireGuard/tree/src/selftest/mpmc_ring.h?h=jn/mpmc-wip
[11]: https://git.zx2c4.com/WireGuard/log/?h=jn/mpmc-wip
[12]: https://git.zx2c4.com/WireGuard/commit/?h=6008eacbf2c7a5f31b0c9d5d0a629cbdfbb8f222
[13]: https://en.wikipedia.org/wiki/New_API
[14]: https://lists.zx2c4.com/pipermail/wireguard/2018-July/003115.html
[15]: https://lists.zx2c4.com/pipermail/wireguard/2018-July/003124.html
[16]: https://lwn.net/Articles/358910/
[17]: https://git.zx2c4.com/WireGuard/commit/?h=95951af7249912a4356b9a03cf3addc7e3f8f724
[18]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/dev.c?h=v4.18-rc8#n193
[19]: https://git.zx2c4.com/WireGuard/commit/?h=fe5f0f661797b41648eac64b40e5038b25175047
[20]: https://git.zx2c4.com/WireGuard/tree/src/hashtables.h?h=0.0.20180802
[21]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/rhashtable.h?h=v4.18-rc8
[22]: https://git.zx2c4.com/linux-dev/log/?h=jn/rhashtable-next&id=6f450b7f8f722dec835fe96ff5b99be98cd5622e
[23]: https://en.wikipedia.org/wiki/CoDel
[24]: https://lwn.net/Articles/454390/
[25]: https://git.zx2c4.com/WireGuard/tree/src/queueing.h?h=0.0.20180802#n119
[26]: https://en.wikipedia.org/wiki/Non-uniform_memory_access