Hey Jason,

First off, thanks for this wonderful project.

I have a question/comment regarding a bit of kernel code.

Background:
src/receive.c has the following code (lightly edited):

---
static u64 last_under_load;
bool under_load;

under_load = ...;
if (under_load)
    last_under_load = ktime_get_boot_fast_ns();
else
    under_load = !wg_birthdate_has_expired(last_under_load, 1);

---

after which the code uses 'under_load' to determine whether
or not to demand that handshake cookies be present.

The comment above 'last_under_load' says we don't care
about races on that unsynchronized global. I assume the rationale
here is that updates to last_under_load are always values from
ktime_get_boot_fast_ns(), and therefore observing a value
produced by a different cpu core won't produce meaningfully
different behavior. I agree that this is true on 64-bit hardware,
but I disagree that the race here is benign on 32-bit systems.
If the compiler decides to access the 8-byte storage with two
32-bit accesses, then it's possible that another thread could
observe the intermediate state in which one of the two words
has been updated. (I think you're most likely to observe this
behavior if wg_receive_handshake_packet is preempted, since
last_under_load shouldn't span cache lines.)
The thread that observes a 'torn' write to last_under_load may
not compute under_load as desired, and consequently drop a
handshake packet that it would have otherwise accepted.

I don't think performance would change meaningfully if
access to last_under_load was mediated with
atomic64_read()/atomic64_set(). On ARM, for example,
those macros will typically expand to a single ldrd/strd instruction, which is what you would want.

Let me know if I've analyzed this problem incorrectly.

Thanks!
Phil