BPF ringbuf misses notifications due to improper coherence

* BPF ringbuf misses notifications due to improper coherence
@ 2022-06-03 15:44 Tatsuyuki Ishi
  2022-06-13 15:28 ` [Resend] " Tatsuyuki Ishi
  0 siblings, 1 reply; 6+ messages in thread
From: Tatsuyuki Ishi @ 2022-06-03 15:44 UTC (permalink / raw)
  To: bpf; +Cc: andriin

The BPF ringbuf defaults to a mechanism to deliver epoll notifications
only when the userspace seems to "catch up" to the last written entry.
This is done by comparing the consumer pointer to the head of the last
written entry, and if it's equal, a notification is sent.

During the effort of implementing ringbuf in aya [1] I observed that
the epoll loop will sometimes get stuck, entering the wait state but
never getting the notification it's supposed to get. The
implementation originally mirrored libbpf's logic, especially its use
of acquire and release memory operations. However, it turned out that
the use of causal memory model is not sufficient, and using a seq_cst
store is required to avoid anomalies as outlined below.

The release-acquire ordering permits the following anomaly to happen
(in a simplified model where writing a new entry atomically completes
without going through busy bit):

kernel: write p 2 -> read c X -> write p 3 -> read c 1 (X doesn't matter)
user  : write c 2 -> read p 2

This is because the release-acquire model allows stale reads, and in
the case above the stale reads means that none of the causal effect
can prevent this anomaly from happening. In order to prevent this
anomaly, a total ordering needs to be enforced on producer and
consumer writes. (Interestingly, it doesn't need to be enforced on
reads, however.)

If this is correct, then the fix needed right now is to correct
libbpf's stores to be sequentially consistent. On the kernel side,
however, we have something weird, probably inoptimal, but still
correct. The kernel uses xchg when clearing the BUSY flag [2]. This
doesn't sound like a necessary thing, since making the written event
visible only require release ordering. However, it's this xchg that
provides the other half of total ordering in order to prevent the
anomalies, as it performs a smp_mb, essentially upgrading the prior
store to seq_cst. If the intention was actually that, it would be
really obscure and hard-to-reason way to implement coherency. I'd
appreciate a clarification on this.

[1]: https://github.com/aya-rs/aya/pull/294#issuecomment-1144385687
[2]: https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/kernel/bpf/ringbuf.c#L384

^ permalink raw reply	[flat|nested] 6+ messages in thread