Re: [Question] bpf map value increase unexpectedly with tracepoint qdisc/qdisc_dequeue

From: Yafang Shao <laoar.shao@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Wu Zongyong <wuzongyong@linux.alibaba.com>, bpf <bpf@vger.kernel.org>
Subject: Re: [Question] bpf map value increase unexpectedly with tracepoint qdisc/qdisc_dequeue
Date: Thu, 14 Apr 2022 09:50:10 +0800	[thread overview]
Message-ID: <CALOAHbAvikKyrSTm6J9YSP0-7ifkxku2q2FeUsZMyxYdDdCqpg@mail.gmail.com> (raw)
In-Reply-To: <CAEf4BzYvBHwsFrp52ZqhP=H1WDdpEeovJcgctv2nioAvBg6FBw@mail.gmail.com>

On Thu, Apr 14, 2022 at 8:25 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Apr 13, 2022 at 12:08 PM Wu Zongyong
> <wuzongyong@linux.alibaba.com> wrote:
> >
> > Hi,
> >
> > I tried to count when tracepoint qdisc/qdisc_dequeue hit each time, then read
> > the count value from userspace by bpf_map_lookup_elem().
> > With bpftrace, I can see this tracepoint is hit about 700 times, but the count
> > I get from the bpf map is below 20. It's weird. Then I tried to add a bpf_printk()
> > to the program, and the count I get is normal which is about 700.
> >
> > The bpf program codes like that:
> >
> >         struct qdisc_dequeue_ctx {
> >                 __u64           __pad;
> >                 __u64           qdisc;
> >                 __u64           txq;
> >                 int             packets;
> >                 __u64           skbaddr;
> >                 int             ifindex;
> >                 __u32           handle;
> >                 __u32           parent;
> >                 unsigned long   txq_state;
> >         };
> >
> >         struct {
> >                 __uint(type, BPF_MAP_TYPE_HASH);
> >                 __type(key, int);
> >                 __type(value, __u32);
> >                 __uint(max_entries, 1);
> >                 __uint(pinning, LIBBPF_PIN_BY_NAME);
> >         } count_map SEC(".maps");
> >
> >         SEC("tracepoint/qdisc/qdisc_dequeue")
> >         int trace_dequeue(struct qdisc_dequeue_ctx *ctx)
> >         {
> >                 int key = 0;
> >                 __u32 *value;
> >                 __u32 init = 0;
> >
> >                 value = bpf_map_lookup_elem(&count_map, &key);
> >                 if (value) {
> >                         *value += 1;
> >                 } else {
> >                         bpf_printk("value reset");
> >                         bpf_map_update_elem(&count_map, &key, &init, 0);
> >                 }
> >                 return 0;
> >         }
> >
> > Any suggestion is appreciated!
> >
>
> First, why do you use HASH map for single-key map? ARRAY would make
> more sense for keys that are small integers. But I assume your real
> world use case needs bigger and more random keys, right?
>
> Second, you have two race conditions. One, you overwrite the value in
> the map with this bpf_map_update_elem(..., 0). Use BPF_NOEXISTS for
> initialization to avoid overwriting something that another CPU already
> set. Another one is your *value += 1 is non-atomic, so you are loosing
> updates as well. Use __sync_fetch_and_add(value, 1) for atomic
> increment.
>
> Something like this:
>
> value = bpf_map_lookup_elem(&count_map, &key);
> if (!value) {
>     /* BPF_NOEXIST won't allow to override the value that's already set */
>     bpf_map_update_elem(&count_map, &key, &init, BPF_NOEXISTS);
>     value = bpf_map_lookup_elem(&count_map, &key);
> }
> if (!value)
>     return 0;
>
> __sync_fetch_and_add(value, 1);
>

Hi Andrii,

I'm curious that if we should do it as below,

new_value = *value + 1;
bpf_map_update_elem(&count_map, &key, &new_value, BPF_EXIST);

in case someone, e.g. bpftool, may delete this elem in parallel.

BTW, why don't we have a helper like bpf_map_lookup_elem_value() which
returns the value directly instead of the addr of the value ?

-- 
Thanks
Yafang