Re: [PATCH v2 bpf-next 2/4] bpf: introduce helper bpf_get_task_stak()

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Song Liu <songliubraving@fb.com>
Cc: bpf <bpf@vger.kernel.org>, Networking <netdev@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>,
	Peter Ziljstra <peterz@infradead.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <Kernel-team@fb.com>,
	john fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@chromium.org>
Subject: Re: [PATCH v2 bpf-next 2/4] bpf: introduce helper bpf_get_task_stak()
Date: Fri, 26 Jun 2020 15:51:08 -0700	[thread overview]
Message-ID: <CAEf4BzaC1Dqn3PXBJmczPRaUmjKc7pcg6_mjyKymBek-sDKv7Q@mail.gmail.com> (raw)
In-Reply-To: <C3B6DD3E-1B69-4D0C-8A55-4EB81C21C619@fb.com>

On Fri, Jun 26, 2020 at 3:45 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jun 26, 2020, at 1:17 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Jun 25, 2020 at 5:14 PM Song Liu <songliubraving@fb.com> wrote:
> >>
> >> Introduce helper bpf_get_task_stack(), which dumps stack trace of given
> >> task. This is different to bpf_get_stack(), which gets stack track of
> >> current task. One potential use case of bpf_get_task_stack() is to call
> >> it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.
> >>
> >> bpf_get_task_stack() uses stack_trace_save_tsk() instead of
> >> get_perf_callchain() for kernel stack. The benefit of this choice is that
> >> stack_trace_save_tsk() doesn't require changes in arch/. The downside of
> >> using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
> >> stack trace to unsigned long array. For 32-bit systems, we need to
> >> translate it to u64 array.
> >>
> >> Signed-off-by: Song Liu <songliubraving@fb.com>
> >> ---
> >
> > Looks great, I just think that there are cases where user doesn't
> > necessarily has valid task_struct pointer, just pid, so would be nice
> > to not artificially restrict such cases by having extra helper.
> >
> > Acked-by: Andrii Nakryiko <andriin@fb.com>
>
> Thanks!
>
> >
> >> include/linux/bpf.h            |  1 +
> >> include/uapi/linux/bpf.h       | 35 ++++++++++++++-
> >> kernel/bpf/stackmap.c          | 79 ++++++++++++++++++++++++++++++++--
> >> kernel/trace/bpf_trace.c       |  2 +
> >> scripts/bpf_helpers_doc.py     |  2 +
> >> tools/include/uapi/linux/bpf.h | 35 ++++++++++++++-
> >> 6 files changed, 149 insertions(+), 5 deletions(-)
> >>
> >
> > [...]
> >
> >> +       /* stack_trace_save_tsk() works on unsigned long array, while
> >> +        * perf_callchain_entry uses u64 array. For 32-bit systems, it is
> >> +        * necessary to fix this mismatch.
> >> +        */
> >> +       if (__BITS_PER_LONG != 64) {
> >> +               unsigned long *from = (unsigned long *) entry->ip;
> >> +               u64 *to = entry->ip;
> >> +               int i;
> >> +
> >> +               /* copy data from the end to avoid using extra buffer */
> >> +               for (i = entry->nr - 1; i >= (int)init_nr; i--)
> >> +                       to[i] = (u64)(from[i]);
> >
> > doing this forward would be just fine as well, no? First iteration
> > will cast and overwrite low 32-bits, all the subsequent iterations
> > won't even overlap.
>
> I think first iteration will write zeros to higher 32 bits, no?

Oh, wait, I completely misread what this is doing. It up-converts from
32-bit to 64-bit, sorry. Yeah, ignore me on this :)

But then I have another question. How do you know that entry->ip has
enough space to keep the same number of 2x bigger entries?

>
> >
> >> +       }
> >> +
> >> +exit_put:
> >> +       put_callchain_entry(rctx);
> >> +
> >> +       return entry;
> >> +}
> >> +
> >
> > [...]
> >
> >> +BPF_CALL_4(bpf_get_task_stack, struct task_struct *, task, void *, buf,
> >> +          u32, size, u64, flags)
> >> +{
> >> +       struct pt_regs *regs = task_pt_regs(task);
> >> +
> >> +       return __bpf_get_stack(regs, task, buf, size, flags);
> >> +}
> >
> >
> > So this takes advantage of BTF and having a direct task_struct
> > pointer. But for kprobes/tracepoint I think it would also be extremely
> > helpful to be able to request stack trace by PID. How about one more
> > helper which will wrap this one with get/put task by PID, e.g.,
> > bpf_get_pid_stack(int pid, void *buf, u32 size, u64 flags)? Would that
> > be a problem?
>
> That should work. Let me add that in a follow up patch.
>