Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

From: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
To: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Anton Blanchard <anton@ozlabs.org>,
	linux-kernel@vger.kernel.org,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive
Date: Thu, 17 Jun 2021 21:49:36 +0530	[thread overview]
Message-ID: <1623934820.8pqjdszq8o.naveen@linux.ibm.com> (raw)
In-Reply-To: <20210616094622.c8bd37840898c67dddde1053@kernel.org>

Masami Hiramatsu wrote:
> On Tue, 15 Jun 2021 23:11:27 +0530
> "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> wrote:
> 
>> Masami Hiramatsu wrote:
>> > On Mon, 14 Jun 2021 23:33:29 +0530
>> > "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com> wrote:
>> > 
>> >> We currently limit maxactive for a kretprobe to 4096 when registering
>> >> the same through tracefs. The comment indicates that this is done so as
>> >> to keep list traversal reasonable. However, we don't ever iterate over
>> >> all kretprobe_instance structures. The core kprobes infrastructure also
>> >> imposes no such limitation.
>> >> 
>> >> Remove the limit from the tracefs interface. This limit is easy to hit
>> >> on large cpu machines when tracing functions that can sleep.
>> >> 
>> >> Reported-by: Anton Blanchard <anton@ozlabs.org>
>> >> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
>> > 
>> > OK, but I don't like to just remove the limit (since it can cause
>> > memory shortage easily.)
>> > Can't we make it configurable? I don't mean Kconfig, but 
>> > tracefs/options/kretprobe_maxactive, or kprobes's debugfs knob.
>> > 
>> > Hmm, maybe debugfs/kprobes/kretprobe_maxactive will be better since
>> > it can limit both trace_kprobe and kprobes itself.
>> 
>> I don't think it is good to put a new tunable in debugfs -- we don't 
>> have any kprobes tunable there, so this adds a dependency on debugfs 
>> which shouldn't be necessary.
>> 
>> /proc/sys/debug/ may be a better fit since we have the 
>> kprobes-optimization flag to disable optprobes there, though I'm not 
>> sure if a new sysfs file is agreeable.
> 
> Indeed.
> 
>> But, I'm not too sure this really is a problem. Maxactive is a user 
>> _opt-in_ feature which needs to be explicitly added to an event 
>> definition. In that sense, isn't this already a tunable?
> 
> Let me explain the background of the limiation.

Thanks for the background on this.

> 
> Maxactive is currently no limit for the kprobe kernel module API,
> because the kernel module developer must take care of the max memory
> usage (and they can).
> 
> But the tracefs user may NOT have enough information about what
> happens if they pass something like 10M for maxactive (it will consume
> around 500MB kernel memory for one kretprobe).

Ok, thinking more about this...

Right now, the only way for a user to notice that kretprobe maxactive is 
an issue is by looking at kprobe_profile.  This is not even possible if 
using a bcc tool, which uses perf_event_open().  It took the reporting 
team some effort to even identify that the reason why they were getting 
weird results when tracing was due to the default value used for 
kretprobe maxactive; and then that 4096 was the hard limit through 
tracefs.

So, IMO, anyone using any existing bcc tool, or a pre-canned perf script 
will not even be able to identify this as a problem to begin with... at 
least, not without some effort.

To address this, as a first step, we should probably consider parsing 
kprobe_profile and printing a warning with 'perf' if we detect a 
non-zero miss count for a probe -- both a regular probe, as well as a 
retprobe.

If we do this, the nice thing with kprobe_profile is that the probe miss 
count is available, and can serve as a good way to decide what a more 
reasonable maxactive value should be. This should help prevent users 
from trying with arbitrary maxactive values.

For perf_event_open(), perhaps we can introduce an ioctl to query the 
probe miss count.

> 
> To avoid such trouble, I had set the 4096 limitation for the maxactive
> parameter. Of course 4096 may not enough for some use-cases. I'm welcome
> to expand it (e.g. 32k, isn't it enough?), but removing the limitation
> may cause OOM trouble easily.

Do you have suggestions for how we can determine a better limit? As you 
point out in the other email, there could very well be 64k or more 
processes on a large machine. Since the primary concern is memory usage, 
we probably need to decide this based on total memory. But, memory usage 
will vary depending on system load...

Perhaps we can start by making maxactive limit be a tunable with a 
default value of 4096, with the understanding that users will be careful 
when bumping up this value. Hopefully, scripts won't simply start 
writing into this file ;)

If we can feed back the probe miss count, tools should be able to guide 
users on what would be a reasonable maxactive value to use.

Thanks,
Naveen