Re: [RFC bpf-next 0/2] sharing bpf runtime stats with /dev/bpf

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC bpf-next 0/2] sharing bpf runtime stats with /dev/bpf_stats
       [not found] <20200314003518.3114452-1-songliubraving@fb.com>
@ 2020-03-14  2:43 ` Alexei Starovoitov
  2020-03-14  3:47   ` Song Liu
  0 siblings, 1 reply; 4+ messages in thread
From: Alexei Starovoitov @ 2020-03-14  2:43 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-fsdevel, linux-kernel, netdev, bpf, kernel-team, ast,
	daniel, mcgrof, keescook, yzaikin, peterz, bristot, mingo

On Fri, Mar 13, 2020 at 05:35:16PM -0700, Song Liu wrote:
> Motivation (copied from 2/2):
> 
> ======================= 8< =======================
> Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
> Typical userspace tools use kernel.bpf_stats_enabled as follows:
> 
>   1. Enable kernel.bpf_stats_enabled;
>   2. Check program run_time_ns;
>   3. Sleep for the monitoring period;
>   4. Check program run_time_ns again, calculate the difference;
>   5. Disable kernel.bpf_stats_enabled.
> 
> The problem with this approach is that only one userspace tool can toggle
> this sysctl. If multiple tools toggle the sysctl at the same time, the
> measurement may be inaccurate.
> 
> To fix this problem while keep backward compatibility, introduce
> /dev/bpf_stats. sysctl kernel.bpf_stats_enabled will only change the
> lowest bit of the static key. /dev/bpf_stats, on the other hand, adds 2
> to the static key for each open fd. The runtime stats is enabled when
> kernel.bpf_stats_enabled == 1 or there is open fd to /dev/bpf_stats.
> 
> With /dev/bpf_stats, user space tool would have the following flow:
> 
>   1. Open a fd to /dev/bpf_stats;
>   2. Check program run_time_ns;
>   3. Sleep for the monitoring period;
>   4. Check program run_time_ns again, calculate the difference;
>   5. Close the fd.
> ======================= 8< =======================
> 
> 1/2 adds a few new API to jump_label.
> 2/2 adds the /dev/bpf_stats and adjust kernel.bpf_stats_enabled handler.
> 
> Please share your comments.

Conceptually makes sense to me. Few comments:
1. I don't understand why +2 logic is necessary.
Just do +1 for every FD and change proc_do_static_key() from doing
explicit enable/disable to do +1/-1 as well on transition from 0->1 and 1->0.
The handler would need to check that 1->1 and 0->0 is a nop.

2. /dev is kinda awkward. May be introduce a new bpf command that returns fd?

3. Instead of 1 and 2 tweak sysctl to do ++/-- unconditionally?
 Like repeated sysctl kernel.bpf_stats_enabled=1 will keep incrementing it
 and would need equal amount of sysctl kernel.bpf_stats_enabled=0 to get
 it back to zero where it will stay zero even if users keep spamming
 sysctl kernel.bpf_stats_enabled=0.
 This way current services that use sysctl will keep working as-is.
 Multiple services that currently collide on sysctl will magically start
 working without any changes to them. It is still backwards compatible.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC bpf-next 0/2] sharing bpf runtime stats with /dev/bpf_stats
  2020-03-14  2:43 ` [RFC bpf-next 0/2] sharing bpf runtime stats with /dev/bpf_stats Alexei Starovoitov
@ 2020-03-14  3:47   ` Song Liu
  2020-03-14 15:57     ` Alexei Starovoitov
  0 siblings, 1 reply; 4+ messages in thread
From: Song Liu @ 2020-03-14  3:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-fsdevel, linux-kernel, netdev, bpf, Kernel Team, ast,
	daniel, mcgrof, keescook, yzaikin, peterz, bristot, mingo



> On Mar 13, 2020, at 7:43 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Fri, Mar 13, 2020 at 05:35:16PM -0700, Song Liu wrote:
>> Motivation (copied from 2/2):
>> 
>> ======================= 8< =======================
>> Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
>> Typical userspace tools use kernel.bpf_stats_enabled as follows:
>> 
>>  1. Enable kernel.bpf_stats_enabled;
>>  2. Check program run_time_ns;
>>  3. Sleep for the monitoring period;
>>  4. Check program run_time_ns again, calculate the difference;
>>  5. Disable kernel.bpf_stats_enabled.
>> 
>> The problem with this approach is that only one userspace tool can toggle
>> this sysctl. If multiple tools toggle the sysctl at the same time, the
>> measurement may be inaccurate.
>> 
>> To fix this problem while keep backward compatibility, introduce
>> /dev/bpf_stats. sysctl kernel.bpf_stats_enabled will only change the
>> lowest bit of the static key. /dev/bpf_stats, on the other hand, adds 2
>> to the static key for each open fd. The runtime stats is enabled when
>> kernel.bpf_stats_enabled == 1 or there is open fd to /dev/bpf_stats.
>> 
>> With /dev/bpf_stats, user space tool would have the following flow:
>> 
>>  1. Open a fd to /dev/bpf_stats;
>>  2. Check program run_time_ns;
>>  3. Sleep for the monitoring period;
>>  4. Check program run_time_ns again, calculate the difference;
>>  5. Close the fd.
>> ======================= 8< =======================
>> 
>> 1/2 adds a few new API to jump_label.
>> 2/2 adds the /dev/bpf_stats and adjust kernel.bpf_stats_enabled handler.
>> 
>> Please share your comments.
> 
> Conceptually makes sense to me. Few comments:
> 1. I don't understand why +2 logic is necessary.
> Just do +1 for every FD and change proc_do_static_key() from doing
> explicit enable/disable to do +1/-1 as well on transition from 0->1 and 1->0.
> The handler would need to check that 1->1 and 0->0 is a nop.

With the +2/-2 logic, we use the lowest bit of the counter to remember 
the value of the sysctl. Otherwise, we cannot tell whether we are making
0->1 transition or 1->1 transition. 

> 
> 2. /dev is kinda awkward. May be introduce a new bpf command that returns fd?

Yeah, I also feel /dev is awkward. fd from bpf command sounds great. 

> 
> 3. Instead of 1 and 2 tweak sysctl to do ++/-- unconditionally?
> Like repeated sysctl kernel.bpf_stats_enabled=1 will keep incrementing it
> and would need equal amount of sysctl kernel.bpf_stats_enabled=0 to get
> it back to zero where it will stay zero even if users keep spamming
> sysctl kernel.bpf_stats_enabled=0.
> This way current services that use sysctl will keep working as-is.
> Multiple services that currently collide on sysctl will magically start
> working without any changes to them. It is still backwards compatible.

I think this is not fully backwards compatible. With current logic, the 
following sequence disables stats eventually. 

  sysctl kernel.bpf_stats_enabled=1
  sysctl kernel.bpf_stats_enabled=1
  sysctl kernel.bpf_stats_enabled=0

The same sequence will not disable stats with the ++/-- sysctl. 

Thanks,
Song




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC bpf-next 0/2] sharing bpf runtime stats with /dev/bpf_stats
  2020-03-14  3:47   ` Song Liu
@ 2020-03-14 15:57     ` Alexei Starovoitov
  2020-03-16  5:52       ` Song Liu
  0 siblings, 1 reply; 4+ messages in thread
From: Alexei Starovoitov @ 2020-03-14 15:57 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-fsdevel, linux-kernel, netdev, bpf, Kernel Team, ast,
	daniel, mcgrof, keescook, yzaikin, peterz, bristot, mingo

On Sat, Mar 14, 2020 at 03:47:50AM +0000, Song Liu wrote:
> 
> 
> > On Mar 13, 2020, at 7:43 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> > On Fri, Mar 13, 2020 at 05:35:16PM -0700, Song Liu wrote:
> >> Motivation (copied from 2/2):
> >> 
> >> ======================= 8< =======================
> >> Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
> >> Typical userspace tools use kernel.bpf_stats_enabled as follows:
> >> 
> >>  1. Enable kernel.bpf_stats_enabled;
> >>  2. Check program run_time_ns;
> >>  3. Sleep for the monitoring period;
> >>  4. Check program run_time_ns again, calculate the difference;
> >>  5. Disable kernel.bpf_stats_enabled.
> >> 
> >> The problem with this approach is that only one userspace tool can toggle
> >> this sysctl. If multiple tools toggle the sysctl at the same time, the
> >> measurement may be inaccurate.
> >> 
> >> To fix this problem while keep backward compatibility, introduce
> >> /dev/bpf_stats. sysctl kernel.bpf_stats_enabled will only change the
> >> lowest bit of the static key. /dev/bpf_stats, on the other hand, adds 2
> >> to the static key for each open fd. The runtime stats is enabled when
> >> kernel.bpf_stats_enabled == 1 or there is open fd to /dev/bpf_stats.
> >> 
> >> With /dev/bpf_stats, user space tool would have the following flow:
> >> 
> >>  1. Open a fd to /dev/bpf_stats;
> >>  2. Check program run_time_ns;
> >>  3. Sleep for the monitoring period;
> >>  4. Check program run_time_ns again, calculate the difference;
> >>  5. Close the fd.
> >> ======================= 8< =======================
> >> 
> >> 1/2 adds a few new API to jump_label.
> >> 2/2 adds the /dev/bpf_stats and adjust kernel.bpf_stats_enabled handler.
> >> 
> >> Please share your comments.
> > 
> > Conceptually makes sense to me. Few comments:
> > 1. I don't understand why +2 logic is necessary.
> > Just do +1 for every FD and change proc_do_static_key() from doing
> > explicit enable/disable to do +1/-1 as well on transition from 0->1 and 1->0.
> > The handler would need to check that 1->1 and 0->0 is a nop.
> 
> With the +2/-2 logic, we use the lowest bit of the counter to remember 
> the value of the sysctl. Otherwise, we cannot tell whether we are making
> 0->1 transition or 1->1 transition. 

that can be another static int var in the handler.
and no need for patch 1.

> > 
> > 2. /dev is kinda awkward. May be introduce a new bpf command that returns fd?
> 
> Yeah, I also feel /dev is awkward. fd from bpf command sounds great. 
> 
> > 
> > 3. Instead of 1 and 2 tweak sysctl to do ++/-- unconditionally?
> > Like repeated sysctl kernel.bpf_stats_enabled=1 will keep incrementing it
> > and would need equal amount of sysctl kernel.bpf_stats_enabled=0 to get
> > it back to zero where it will stay zero even if users keep spamming
> > sysctl kernel.bpf_stats_enabled=0.
> > This way current services that use sysctl will keep working as-is.
> > Multiple services that currently collide on sysctl will magically start
> > working without any changes to them. It is still backwards compatible.
> 
> I think this is not fully backwards compatible. With current logic, the 
> following sequence disables stats eventually. 
> 
>   sysctl kernel.bpf_stats_enabled=1
>   sysctl kernel.bpf_stats_enabled=1
>   sysctl kernel.bpf_stats_enabled=0
> 
> The same sequence will not disable stats with the ++/-- sysctl. 

sure, but if a process holding an fd 'sysctl kernel.bpf_stats_enabled=0'
won't disable stats either. So it's also not backwards compatible. imo it's a
change in behavior whichever way, but either approach doesn't break user space.
An advantage of not doing an fd is that some user that really wants to have
stats disabled for performance benchmarking can do
'sysctl kernel.bpf_stats_enabled=0' few times and the stats will be off.
We can also make 'sysctl kernel.bpf_stats_enabled' to return current counter,
so humans can see how many daemons are doing stats collection at that very
moment.
We can also do both new fd via bpf syscall and ++/-- via sysctl, but imo
++/-- via sysctl is enough to address the issue of multiple stats collecting
daemons. The patch would be small enough that we can push it via bpf tree
and into older kernels as arguable 'fix'.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC bpf-next 0/2] sharing bpf runtime stats with /dev/bpf_stats
  2020-03-14 15:57     ` Alexei Starovoitov
@ 2020-03-16  5:52       ` Song Liu
  0 siblings, 0 replies; 4+ messages in thread
From: Song Liu @ 2020-03-16  5:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-fsdevel, linux-kernel, netdev, bpf, Kernel Team, ast,
	daniel, mcgrof, keescook, yzaikin, peterz, bristot, mingo



> On Mar 14, 2020, at 8:57 AM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Sat, Mar 14, 2020 at 03:47:50AM +0000, Song Liu wrote:
>> 
>> 
>>> On Mar 13, 2020, at 7:43 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>> 
>>> On Fri, Mar 13, 2020 at 05:35:16PM -0700, Song Liu wrote:
>>>> Motivation (copied from 2/2):
>>>> 
>>>> ======================= 8< =======================
>>>> Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
>>>> Typical userspace tools use kernel.bpf_stats_enabled as follows:
>>>> 
>>>> 1. Enable kernel.bpf_stats_enabled;
>>>> 2. Check program run_time_ns;
>>>> 3. Sleep for the monitoring period;
>>>> 4. Check program run_time_ns again, calculate the difference;
>>>> 5. Disable kernel.bpf_stats_enabled.
>>>> 
>>>> The problem with this approach is that only one userspace tool can toggle
>>>> this sysctl. If multiple tools toggle the sysctl at the same time, the
>>>> measurement may be inaccurate.
>>>> 
>>>> To fix this problem while keep backward compatibility, introduce
>>>> /dev/bpf_stats. sysctl kernel.bpf_stats_enabled will only change the
>>>> lowest bit of the static key. /dev/bpf_stats, on the other hand, adds 2
>>>> to the static key for each open fd. The runtime stats is enabled when
>>>> kernel.bpf_stats_enabled == 1 or there is open fd to /dev/bpf_stats.
>>>> 
>>>> With /dev/bpf_stats, user space tool would have the following flow:
>>>> 
>>>> 1. Open a fd to /dev/bpf_stats;
>>>> 2. Check program run_time_ns;
>>>> 3. Sleep for the monitoring period;
>>>> 4. Check program run_time_ns again, calculate the difference;
>>>> 5. Close the fd.
>>>> ======================= 8< =======================
>>>> 
>>>> 1/2 adds a few new API to jump_label.
>>>> 2/2 adds the /dev/bpf_stats and adjust kernel.bpf_stats_enabled handler.
>>>> 
>>>> Please share your comments.
>>> 
>>> Conceptually makes sense to me. Few comments:
>>> 1. I don't understand why +2 logic is necessary.
>>> Just do +1 for every FD and change proc_do_static_key() from doing
>>> explicit enable/disable to do +1/-1 as well on transition from 0->1 and 1->0.
>>> The handler would need to check that 1->1 and 0->0 is a nop.
>> 
>> With the +2/-2 logic, we use the lowest bit of the counter to remember 
>> the value of the sysctl. Otherwise, we cannot tell whether we are making
>> 0->1 transition or 1->1 transition. 
> 
> that can be another static int var in the handler.
> and no need for patch 1.

Good idea!

> 
>>> 
>>> 2. /dev is kinda awkward. May be introduce a new bpf command that returns fd?
>> 
>> Yeah, I also feel /dev is awkward. fd from bpf command sounds great. 
>> 
>>> 
>>> 3. Instead of 1 and 2 tweak sysctl to do ++/-- unconditionally?
>>> Like repeated sysctl kernel.bpf_stats_enabled=1 will keep incrementing it
>>> and would need equal amount of sysctl kernel.bpf_stats_enabled=0 to get
>>> it back to zero where it will stay zero even if users keep spamming
>>> sysctl kernel.bpf_stats_enabled=0.
>>> This way current services that use sysctl will keep working as-is.
>>> Multiple services that currently collide on sysctl will magically start
>>> working without any changes to them. It is still backwards compatible.
>> 
>> I think this is not fully backwards compatible. With current logic, the 
>> following sequence disables stats eventually. 
>> 
>>  sysctl kernel.bpf_stats_enabled=1
>>  sysctl kernel.bpf_stats_enabled=1
>>  sysctl kernel.bpf_stats_enabled=0
>> 
>> The same sequence will not disable stats with the ++/-- sysctl. 
> 
> sure, but if a process holding an fd 'sysctl kernel.bpf_stats_enabled=0'
> won't disable stats either. So it's also not backwards compatible. imo it's a
> change in behavior whichever way, but either approach doesn't break user space.
> An advantage of not doing an fd is that some user that really wants to have
> stats disabled for performance benchmarking can do
> 'sysctl kernel.bpf_stats_enabled=0' few times and the stats will be off.
> We can also make 'sysctl kernel.bpf_stats_enabled' to return current counter,
> so humans can see how many daemons are doing stats collection at that very
> moment.
> We can also do both new fd via bpf syscall and ++/-- via sysctl, but imo
> ++/-- via sysctl is enough to address the issue of multiple stats collecting
> daemons. The patch would be small enough that we can push it via bpf tree
> and into older kernels as arguable 'fix'.

For multiple tools to share run_time_ns stats, I think we need to address two
issues:

  1. run_time_ns is turned off when the tool is monitoring; 
  2. run_time_ns is left on when no one is using it. 

On the the other hand, when the tool is not monitoring run_time_ns, it should 
not care whether the kernel is counting it. 

Currently, we have both problem #1 and #2. ++/-- sysctl will help solve #1, but
won't help #2, e.g., when the tool crashes. fd solution also solves #2. 

Also, I think sysctl should not do ++/--? Most (all?) sysctl just does "set the 
value", no? 

Eventually, we should ask all tools to use the fd interface, and deprecate the
sysctl. For backward compatibility, we can have one legacy tool (using sysctl)
and multiple new tools (using fd interface) share run_time_ns. I feel this is 
almost fully backward compatible, because when the tool is not monitoring it 
should not care whether run_time_ns is on. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-03-16  5:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200314003518.3114452-1-songliubraving@fb.com>
2020-03-14  2:43 ` [RFC bpf-next 0/2] sharing bpf runtime stats with /dev/bpf_stats Alexei Starovoitov
2020-03-14  3:47   ` Song Liu
2020-03-14 15:57     ` Alexei Starovoitov
2020-03-16  5:52       ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).