All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andriin@fb.com>, bpf <bpf@vger.kernel.org>,
	Networking <netdev@vger.kernel.org>,
	Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <kernel-team@fb.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Jonathan Lemon <jonathan.lemon@gmail.com>
Subject: Re: [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it
Date: Thu, 14 May 2020 13:49:30 -0700	[thread overview]
Message-ID: <CAEf4BzZp7Hzni=eQKKodfi-W1R46ia2+cRfFG1W6Li5xRX5qJA@mail.gmail.com> (raw)
In-Reply-To: <20200514190649.ca4qugueh5sp32ax@ast-mbp.dhcp.thefacebook.com>

On Thu, May 14, 2020 at 12:06 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, May 13, 2020 at 12:25:27PM -0700, Andrii Nakryiko wrote:
> > +
> > +/* Given pointer to ring buffer record metadata, restore pointer to struct
> > + * bpf_ringbuf itself by using page offset stored at offset 4
> > + */
> > +static struct bpf_ringbuf *bpf_ringbuf_restore_from_rec(void *meta_ptr)
> > +{
> > +     unsigned long addr = (unsigned long)meta_ptr;
> > +     unsigned long off = *(u32 *)(meta_ptr + 4) << PAGE_SHIFT;
>
> Looking at the further code it seems this one should be READ_ONCE, but...
>

This will be called from commit(), which will be called after
successful reserve(), which dose spin_unlock at the end. So in terms
of memory barriers, everything should be fine? Or am I missing some
trickier aspect?

> > +
> > +     return (void*)((addr & PAGE_MASK) - off);
> > +}
> > +
> > +static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
> > +{
> > +     unsigned long cons_pos, prod_pos, new_prod_pos, flags;
> > +     u32 len, pg_off;
> > +     void *meta_ptr;
> > +
> > +     if (unlikely(size > UINT_MAX))
> > +             return NULL;
> > +
> > +     len = round_up(size + RINGBUF_META_SZ, 8);
>
> it may overflow despite the check above.

As Jonathan pointed out, max size should be enforced about to be at
most (UINT_MAX/4), due to 2 bits reserved for BUSY and DISCARD. So it
shouldn't overflow.

>
> > +     cons_pos = READ_ONCE(rb->consumer_pos);
> > +
> > +     if (in_nmi()) {
> > +             if (!spin_trylock_irqsave(&rb->spinlock, flags))
> > +                     return NULL;
> > +     } else {
> > +             spin_lock_irqsave(&rb->spinlock, flags);
> > +     }
> > +
> > +     prod_pos = rb->producer_pos;
> > +     new_prod_pos = prod_pos + len;
> > +
> > +     /* check for out of ringbuf space by ensuring producer position
> > +      * doesn't advance more than (ringbuf_size - 1) ahead
> > +      */
> > +     if (new_prod_pos - cons_pos > rb->mask) {
> > +             spin_unlock_irqrestore(&rb->spinlock, flags);
> > +             return NULL;
> > +     }
> > +
> > +     meta_ptr = rb->data + (prod_pos & rb->mask);
> > +     pg_off = bpf_ringbuf_rec_pg_off(rb, meta_ptr);
> > +
> > +     WRITE_ONCE(*(u32 *)meta_ptr, RINGBUF_BUSY_BIT | size);
> > +     WRITE_ONCE(*(u32 *)(meta_ptr + 4), pg_off);
>
> it doens't match to few other places where normal read is done.
> But why WRITE_ONCE here?
> How does it race with anything?
> producer_pos is updated later.

Yeah, I think you are right. I left it as WRITE_ONCE from my initial
lockless variant, but this could be just a normal store, will fix.

>
> > +
> > +     /* ensure length prefix is written before updating producer positions */
> > +     smp_wmb();
>
> this barrier is enough to make sure meta_ptr and meta_ptr+4 init
> is visible before producer_pos is updated below.

yep, 100% agree


>
> > +     WRITE_ONCE(rb->producer_pos, new_prod_pos);

consumer reads this with smp_load_acquire, I guess for consistency
I'll switch this to smp_store_release() then, right?

> > +
> > +     spin_unlock_irqrestore(&rb->spinlock, flags);
> > +
> > +     return meta_ptr + RINGBUF_META_SZ;
> > +}
> > +
> > +BPF_CALL_3(bpf_ringbuf_reserve, struct bpf_map *, map, u64, size, u64, flags)
> > +{
> > +     struct bpf_ringbuf_map *rb_map;
> > +
> > +     if (unlikely(flags))
> > +             return -EINVAL;
> > +
> > +     rb_map = container_of(map, struct bpf_ringbuf_map, map);
> > +     return (unsigned long)__bpf_ringbuf_reserve(rb_map->rb, size);
> > +}
> > +
> > +const struct bpf_func_proto bpf_ringbuf_reserve_proto = {
> > +     .func           = bpf_ringbuf_reserve,
> > +     .ret_type       = RET_PTR_TO_ALLOC_MEM_OR_NULL,
> > +     .arg1_type      = ARG_CONST_MAP_PTR,
> > +     .arg2_type      = ARG_CONST_ALLOC_SIZE_OR_ZERO,
> > +     .arg3_type      = ARG_ANYTHING,
> > +};
> > +
> > +static void bpf_ringbuf_commit(void *sample, bool discard)
> > +{
> > +     unsigned long rec_pos, cons_pos;
> > +     u32 new_meta, old_meta;
> > +     void *meta_ptr;
> > +     struct bpf_ringbuf *rb;
> > +
> > +     meta_ptr = sample - RINGBUF_META_SZ;
> > +     rb = bpf_ringbuf_restore_from_rec(meta_ptr);
> > +     old_meta = *(u32 *)meta_ptr;
>
> I think this one will race with user space and should be READ_ONCE.

This is never modified by user-space, only by previous reserve(). Is
that still considered a race by some tooling?

>
> > +     new_meta = old_meta ^ RINGBUF_BUSY_BIT;
> > +     if (discard)
> > +             new_meta |= RINGBUF_DISCARD_BIT;
> > +
> > +     /* update metadata header with correct final size prefix */
> > +     xchg((u32 *)meta_ptr, new_meta);
> > +
> > +     /* if consumer caught up and is waiting for our record, notify about
> > +      * new data availability
> > +      */
> > +     rec_pos = (void *)meta_ptr - (void *)rb->data;
> > +     cons_pos = smp_load_acquire(&rb->consumer_pos) & rb->mask;
>
> hmm. Earlier WRITE_ONCE(rb->producer_pos) is used, but here it's load_acquire.
> Please be consistent with pairing.

wait, it's *consumer* vs *producer* positions. consumer_pos is updated
by consumer using smp_store_release(), so it's paired. But I mentioned
above that producer_pos has to be written with smp_store_release().
Does that address your concern?


>
> > +     if (cons_pos == rec_pos)
> > +             wake_up_all(&rb->waitq);
>
> Is it legal to do from preempt_disabled region?

Don't know, couldn't find anything about this aspect. But looking at
perf buffer code, all the wakeups are done after irq_work_queue(), so
I guess it's not safe then? I'll add irq_work_queue() then.

  reply	other threads:[~2020-05-14 20:50 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-13 19:25 [PATCH bpf-next 0/6] BPF ring buffer Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 1/6] bpf: implement BPF ring buffer and verifier support for it Andrii Nakryiko
2020-05-13 20:57   ` kbuild test robot
2020-05-13 20:57     ` kbuild test robot
2020-05-13 21:58   ` Alan Maguire
2020-05-14  5:59     ` Andrii Nakryiko
2020-05-14 22:25       ` Alan Maguire
2020-05-13 22:16   ` kbuild test robot
2020-05-13 22:16     ` kbuild test robot
2020-05-14 16:50   ` Jonathan Lemon
2020-05-14 20:11     ` Andrii Nakryiko
2020-05-14 17:33   ` sdf
2020-05-14 20:18     ` Andrii Nakryiko
2020-05-14 20:53       ` sdf
2020-05-14 21:13         ` Andrii Nakryiko
2020-05-14 21:56           ` Stanislav Fomichev
2020-05-14 19:06   ` Alexei Starovoitov
2020-05-14 20:49     ` Andrii Nakryiko [this message]
2020-05-14 19:18   ` Jakub Kicinski
2020-05-14 19:18     ` Jakub Kicinski
2020-05-14 20:39     ` Thomas Gleixner
2020-05-14 21:30       ` Andrii Nakryiko
2020-05-14 22:13         ` Paul E. McKenney
2020-05-14 22:56         ` Alexei Starovoitov
2020-05-14 23:06           ` Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 2/6] tools/memory-model: add BPF ringbuf MPSC litmus tests Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 3/6] bpf: track reference type in verifier Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 4/6] libbpf: add BPF ring buffer support Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 5/6] selftests/bpf: add BPF ringbuf selftests Andrii Nakryiko
2020-05-13 19:25 ` [PATCH bpf-next 6/6] bpf: add BPF ringbuf and perf buffer benchmarks Andrii Nakryiko
2020-05-13 22:49 ` [PATCH bpf-next 0/6] BPF ring buffer Jonathan Lemon
2020-05-14  6:08   ` Andrii Nakryiko
2020-05-14 16:30     ` Jonathan Lemon
2020-05-14 20:06       ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEf4BzZp7Hzni=eQKKodfi-W1R46ia2+cRfFG1W6Li5xRX5qJA@mail.gmail.com' \
    --to=andrii.nakryiko@gmail.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andriin@fb.com \
    --cc=ast@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=jonathan.lemon@gmail.com \
    --cc=kernel-team@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.