Hey Jason, First off, thanks for this wonderful project. I have a question/comment regarding a bit of kernel code. Background: src/receive.c has the following code (lightly edited): --- static u64 last_under_load; bool under_load; under_load = ...; if (under_load) last_under_load = ktime_get_boot_fast_ns(); else under_load = !wg_birthdate_has_expired(last_under_load, 1); --- after which the code uses 'under_load' to determine whether or not to demand that handshake cookies be present. The comment above 'last_under_load' says we don't care about races on that unsynchronized global. I assume the rationale here is that updates to last_under_load are always values from ktime_get_boot_fast_ns(), and therefore observing a value produced by a different cpu core won't produce meaningfully different behavior. I agree that this is true on 64-bit hardware, but I disagree that the race here is benign on 32-bit systems. If the compiler decides to access the 8-byte storage with two 32-bit accesses, then it's possible that another thread could observe the intermediate state in which one of the two words has been updated. (I think you're most likely to observe this behavior if wg_receive_handshake_packet is preempted, since last_under_load shouldn't span cache lines.) The thread that observes a 'torn' write to last_under_load may not compute under_load as desired, and consequently drop a handshake packet that it would have otherwise accepted. I don't think performance would change meaningfully if access to last_under_load was mediated with atomic64_read()/atomic64_set(). On ARM, for example, those macros will typically expand to a single ldrd/strd instruction, which is what you would want. Let me know if I've analyzed this problem incorrectly. Thanks! Phil