All of lore.kernel.org
 help / color / mirror / Atom feed
* BPF ringbuf misses notifications due to improper coherence
@ 2022-06-03 15:44 Tatsuyuki Ishi
  2022-06-13 15:28 ` [Resend] " Tatsuyuki Ishi
  0 siblings, 1 reply; 6+ messages in thread
From: Tatsuyuki Ishi @ 2022-06-03 15:44 UTC (permalink / raw)
  To: bpf; +Cc: andriin

The BPF ringbuf defaults to a mechanism to deliver epoll notifications
only when the userspace seems to "catch up" to the last written entry.
This is done by comparing the consumer pointer to the head of the last
written entry, and if it's equal, a notification is sent.

During the effort of implementing ringbuf in aya [1] I observed that
the epoll loop will sometimes get stuck, entering the wait state but
never getting the notification it's supposed to get. The
implementation originally mirrored libbpf's logic, especially its use
of acquire and release memory operations. However, it turned out that
the use of causal memory model is not sufficient, and using a seq_cst
store is required to avoid anomalies as outlined below.

The release-acquire ordering permits the following anomaly to happen
(in a simplified model where writing a new entry atomically completes
without going through busy bit):

kernel: write p 2 -> read c X -> write p 3 -> read c 1 (X doesn't matter)
user  : write c 2 -> read p 2

This is because the release-acquire model allows stale reads, and in
the case above the stale reads means that none of the causal effect
can prevent this anomaly from happening. In order to prevent this
anomaly, a total ordering needs to be enforced on producer and
consumer writes. (Interestingly, it doesn't need to be enforced on
reads, however.)

If this is correct, then the fix needed right now is to correct
libbpf's stores to be sequentially consistent. On the kernel side,
however, we have something weird, probably inoptimal, but still
correct. The kernel uses xchg when clearing the BUSY flag [2]. This
doesn't sound like a necessary thing, since making the written event
visible only require release ordering. However, it's this xchg that
provides the other half of total ordering in order to prevent the
anomalies, as it performs a smp_mb, essentially upgrading the prior
store to seq_cst. If the intention was actually that, it would be
really obscure and hard-to-reason way to implement coherency. I'd
appreciate a clarification on this.

[1]: https://github.com/aya-rs/aya/pull/294#issuecomment-1144385687
[2]: https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/kernel/bpf/ringbuf.c#L384

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Resend] BPF ringbuf misses notifications due to improper coherence
  2022-06-03 15:44 BPF ringbuf misses notifications due to improper coherence Tatsuyuki Ishi
@ 2022-06-13 15:28 ` Tatsuyuki Ishi
  2022-06-13 20:26   ` Andrii Nakryiko
  0 siblings, 1 reply; 6+ messages in thread
From: Tatsuyuki Ishi @ 2022-06-13 15:28 UTC (permalink / raw)
  To: bpf, andriin

The BPF ringbuf defaults to a mechanism to deliver epoll notifications
only when the userspace seems to "catch up" to the last written entry.
This is done by comparing the consumer pointer to the head of the last
written entry, and if it's equal, a notification is sent.

During the effort of implementing ringbuf in aya [1] I observed that
the epoll loop will sometimes get stuck, entering the wait state but
never getting the notification it's supposed to get. The
implementation originally mirrored libbpf's logic, especially its use
of acquire and release memory operations. However, it turned out that
the use of causal memory model is not sufficient, and using a seq_cst
store is required to avoid anomalies as outlined below.

The release-acquire ordering permits the following anomaly to happen
(in a simplified model where writing a new entry atomically completes
without going through busy bit):

kernel: write p 2 -> read c X -> write p 3 -> read c 1 (X doesn't matter)
user  : write c 2 -> read p 2

This is because the release-acquire model allows stale reads, and in
the case above the stale reads means that none of the causal effect
can prevent this anomaly from happening. In order to prevent this
anomaly, a total ordering needs to be enforced on producer and
consumer writes. (Interestingly, it doesn't need to be enforced on
reads, however.)

If this is correct, then the fix needed right now is to correct
libbpf's stores to be sequentially consistent. On the kernel side,
however, we have something weird, probably inoptimal, but still
correct. The kernel uses xchg when clearing the BUSY flag [2]. This
doesn't sound like a necessary thing, since making the written event
visible only require release ordering. However, it's this xchg that
provides the other half of total ordering in order to prevent the
anomalies, as it performs a smp_mb, essentially upgrading the prior
store to seq_cst. If the intention was actually that, it would be
really obscure and hard-to-reason way to implement coherency. I'd
appreciate a clarification on this.

[1]: https://github.com/aya-rs/aya/pull/294#issuecomment-1144385687
[2]: https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/kernel/bpf/ringbuf.c#L384

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Resend] BPF ringbuf misses notifications due to improper coherence
  2022-06-13 15:28 ` [Resend] " Tatsuyuki Ishi
@ 2022-06-13 20:26   ` Andrii Nakryiko
  2022-06-14 15:31     ` Tatsuyuki Ishi
  0 siblings, 1 reply; 6+ messages in thread
From: Andrii Nakryiko @ 2022-06-13 20:26 UTC (permalink / raw)
  To: Tatsuyuki Ishi; +Cc: bpf, Andrii Nakryiko, Paul E . McKenney

On Mon, Jun 13, 2022 at 11:42 AM Tatsuyuki Ishi <ishitatsuyuki@gmail.com> wrote:
>
> The BPF ringbuf defaults to a mechanism to deliver epoll notifications
> only when the userspace seems to "catch up" to the last written entry.
> This is done by comparing the consumer pointer to the head of the last
> written entry, and if it's equal, a notification is sent.
>
> During the effort of implementing ringbuf in aya [1] I observed that
> the epoll loop will sometimes get stuck, entering the wait state but
> never getting the notification it's supposed to get. The
> implementation originally mirrored libbpf's logic, especially its use
> of acquire and release memory operations. However, it turned out that
> the use of causal memory model is not sufficient, and using a seq_cst
> store is required to avoid anomalies as outlined below.
>
> The release-acquire ordering permits the following anomaly to happen
> (in a simplified model where writing a new entry atomically completes
> without going through busy bit):
>
> kernel: write p 2 -> read c X -> write p 3 -> read c 1 (X doesn't matter)
> user  : write c 2 -> read p 2
>
> This is because the release-acquire model allows stale reads, and in
> the case above the stale reads means that none of the causal effect
> can prevent this anomaly from happening. In order to prevent this
> anomaly, a total ordering needs to be enforced on producer and
> consumer writes. (Interestingly, it doesn't need to be enforced on
> reads, however.)
>
> If this is correct, then the fix needed right now is to correct
> libbpf's stores to be sequentially consistent. On the kernel side,
> however, we have something weird, probably inoptimal, but still
> correct. The kernel uses xchg when clearing the BUSY flag [2]. This
> doesn't sound like a necessary thing, since making the written event
> visible only require release ordering. However, it's this xchg that
> provides the other half of total ordering in order to prevent the
> anomalies, as it performs a smp_mb, essentially upgrading the prior
> store to seq_cst. If the intention was actually that, it would be
> really obscure and hard-to-reason way to implement coherency. I'd
> appreciate a clarification on this.
>
> [1]: https://github.com/aya-rs/aya/pull/294#issuecomment-1144385687
> [2]: https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/kernel/bpf/ringbuf.c#L384


Hey Tatsuyuki,

Sorry for not getting back in time, I haven't missed or forgot about
this, it's just in my TODO queue and I haven't had time to seriously
look at this. No one has reported this for libbpf previously, but it
could be because most people specify timeout on ring_buffer__poll() so
never noticed this issue.

I need a bit more time to page in all the context and recall semantics
around smp_load_acquire/smp_store_release and stuff like that. As for
xchg, if I remember correctly it was a deliberate choice, I remember
Paul suggesting that xchg is faster for what we needed with ringbuf.

I'd like to get to this in next few days, but meanwhile have you tried
to benchmark what are the implications of stricter memory ordering
changes on libbpf side? Do you have example changes you were thinking
about for libbpf side? I can try benchmarking it on my side as well.

Thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Resend] BPF ringbuf misses notifications due to improper coherence
  2022-06-13 20:26   ` Andrii Nakryiko
@ 2022-06-14 15:31     ` Tatsuyuki Ishi
  2022-06-24 18:21       ` Andrii Nakryiko
  0 siblings, 1 reply; 6+ messages in thread
From: Tatsuyuki Ishi @ 2022-06-14 15:31 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, Andrii Nakryiko, Paul E . McKenney

Hi Andrii,
Thanks for looking into this.

> I'd like to get to this in next few days, but meanwhile have you tried
> to benchmark what are the implications of stricter memory ordering
> changes on libbpf side? Do you have example changes you were thinking
> about for libbpf side? I can try benchmarking it on my side as well.

I don't have a benchmark yet. I'll try to prepare a benchmark when I
have time to do so.

The proposed change to libbpf is simply to replace the two
smp_store_release with smp_store_mb. I just realized that the Linux
kernel memory model doesn't have direct mappings to seq_cst loads and
stores though, so this will lead to a redundant barrier on AArch64
etc.

Regards,
Tatsuyuki

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Resend] BPF ringbuf misses notifications due to improper coherence
  2022-06-14 15:31     ` Tatsuyuki Ishi
@ 2022-06-24 18:21       ` Andrii Nakryiko
  2022-06-26  9:55         ` Tatsuyuki Ishi
  0 siblings, 1 reply; 6+ messages in thread
From: Andrii Nakryiko @ 2022-06-24 18:21 UTC (permalink / raw)
  To: Tatsuyuki Ishi; +Cc: bpf, Andrii Nakryiko, Paul E . McKenney

On Tue, Jun 14, 2022 at 8:31 AM Tatsuyuki Ishi <ishitatsuyuki@gmail.com> wrote:
>
> Hi Andrii,
> Thanks for looking into this.
>
> > I'd like to get to this in next few days, but meanwhile have you tried
> > to benchmark what are the implications of stricter memory ordering
> > changes on libbpf side? Do you have example changes you were thinking
> > about for libbpf side? I can try benchmarking it on my side as well.
>
> I don't have a benchmark yet. I'll try to prepare a benchmark when I
> have time to do so.
>
> The proposed change to libbpf is simply to replace the two
> smp_store_release with smp_store_mb. I just realized that the Linux
> kernel memory model doesn't have direct mappings to seq_cst loads and
> stores though, so this will lead to a redundant barrier on AArch64
> etc.

Hey Tatsuyuki,

Just to follow up on this. We do have a bunch of benchmarks in
selftests/bpf/bench, so I did run after replacing smp_store_release()
in libbpf code with atomic_store(SEQ_CST). Generally it didn't show
significant performance differences, except for "Single-producer,
back-to-back mode (rb-libbpf case)". You can run these benchmarks
yourself from selftest/bpf with just benchs/run_bench_ringbufs.sh.

But before we make any changes, can you please share a reproducer for
this issue? And which architecture (x86-64? arm64?) did you manage to
reproduce this issue on?

>
> Regards,
> Tatsuyuki

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Resend] BPF ringbuf misses notifications due to improper coherence
  2022-06-24 18:21       ` Andrii Nakryiko
@ 2022-06-26  9:55         ` Tatsuyuki Ishi
  0 siblings, 0 replies; 6+ messages in thread
From: Tatsuyuki Ishi @ 2022-06-26  9:55 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, Andrii Nakryiko, Paul E . McKenney

Hi Andrii,

> But before we make any changes, can you please share a reproducer for
> this issue? And which architecture (x86-64? arm64?) did you manage to
> reproduce this issue on?

I have uploaded an reproducer here:
https://github.com/ishitatsuyuki/bpf-ringbuf-examples

First, I only modified the ringbuf-output sample, so please go into
src and do `make ringbuf-output`.

Second, it turned out that libbpf was actually immune to the issue due
to it using epoll in the level-triggered mode. Presumably, epoll first
does some internal locking which acts as a global memory barrier, then
it proceeds on checking the ringbuf pointers again, which effectively
prevent the anomalies. When ET is used, epoll instead waits for the
wakeup without double checking the ringbuf pointers, allowing the
anomaly to happen.

In the reproducer above, I've modified the libbpf submodule to use
EPOLLET, so that it does reproduce the phenomenon I was talking about,
but this also means that this is not an issue with libbpf per se.
Still, relying on the implicit synchronization of epoll is rather
ugly, and can cause confusion when trying to re-implement the logic
using libbpf as the reference.

The reproducer should get "stuck" in a few seconds when running on
x86-64; I don't have an AArch64 machine to test on, but as far as I
understand the acq/rel instructions gets promoted to seq_cst on those,
so this issue simply cannot surface on AArch64.

Tatsuyuki

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-06-26  9:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-03 15:44 BPF ringbuf misses notifications due to improper coherence Tatsuyuki Ishi
2022-06-13 15:28 ` [Resend] " Tatsuyuki Ishi
2022-06-13 20:26   ` Andrii Nakryiko
2022-06-14 15:31     ` Tatsuyuki Ishi
2022-06-24 18:21       ` Andrii Nakryiko
2022-06-26  9:55         ` Tatsuyuki Ishi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.