All of lore.kernel.org
 help / color / mirror / Atom feed
From: Blaise Sanouillet <blez@fb.com>
To: Yonghong Song <yhs@fb.com>, Daniel Xu <dxu@dxuuu.xyz>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Cc: "cneirabustos@gmail.com" <cneirabustos@gmail.com>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>
Subject: Re: Extending bpf_get_ns_current_pid_tgid()
Date: Fri, 13 Nov 2020 12:04:43 +0000	[thread overview]
Message-ID: <MN2PR15MB2991E847DE47A265E71F1BC8A0E60@MN2PR15MB2991.namprd15.prod.outlook.com> (raw)
In-Reply-To: <13b5b2dd-bec0-cef2-7304-7e5a09bafb6c@fb.com>

> On 11/12/20 4:57 PM, Daniel Xu wrote:
>> On Thu Nov 12, 2020 at 4:27 PM PST, Yonghong Song wrote:
>>>
>>>
>>> On 11/12/20 2:20 PM, Daniel Xu wrote:
>>>> Hi,
>>>>
>>>> I'm looking at the current implementation of
>>>> bpf_get_ns_current_pid_tgid() and the helper seems to be a bit overly
>>>> restricting to me. Specifically the following line:
>>>>
>>>>       if (!ns_match(&pidns->ns, (dev_t)dev, ino))
>>>>               goto clear;
>>>>
>>>> Why bail if the inode # does not match? IIUC from the old discussions,
>>>> it was b/c in the future pidns files might belong to different devices.
>>>> It's not clear to me (possibly b/c I'm missing something) why the inode
>>>> has to match as well.
>>>
>>> Yes, pidns file might belong to different devices in theory so we need
>>> to match dev as well.
>>>
>>> The inode number needs to match so we can ensure user indeed wants to
>>> get the *current pidns* tgid/pid.
>>
>> Right, this double-checking at the API level is what feels strange to
>> me -- why make the user prove they know what they're doing?
> 
> If we do not have this checking, it is possible that in interrupt
> context, pidns #10 user may get a tgid/pid actually from pisns #11,
> and tgid/pid could be valid for pidns #10. This result will be
> actually wrong.
> 
>>
>> Furthermore, the "proof" restricts flexibility. It's as if
>> bpf_get_current_task() required a (dev,ino) pair. How would you get the
>> namespaced pid for a process you don't know about yet? eg when you're
>> profiling the system.
> 
> Did not fully understand questions here. Do you mean
>    bpf_get_current_task(dev, ino)
> that will be weird. task is not associated with dev/ino.
> 
>>
>>>
>>> (dev, ino) input expressed user intention. Without this, in no-process
>>> context, it will be hard to interpret the results.
>>
>> But bpf_get_current_pid_tgid() doesn't return errors so this shouldn't
>> either, right?
> 
> Different helpers can have different signatures.
> 
>>
>>>
>>>>
>>>> Would it be possible to instead have the helper return the pid/tgid of
>>>> the current task as viewed _from_ the `dev`/`ino` pidns? If the current
>>>> task is hidden from the `dev`/`ino` pidns, then return -ENOENT. The use
>>>> case is for bpftrace symbolize stacks when run inside a container. For
>>>> example:
>>>>
>>>>       (in-container)# bpftrace -e 'profile:hz:99 { print(ustack) }'
>>>
>>> I think you try to propose something like below:
>>> - user provides dev/ino
>>> - the helper will try to go through all pidns'es (not just active
>>> one), if any match pidns match, returns tgid/pid in that pidns,
>>> otherwise, returns -ENOENT.
>>
>> Right, exactly.
> 
> If you want to do this, you will need a new helper like
>    bpf_get_ns_pid_tgid
> 
> It actually will be weird to use this helper as it looks like
> you try to get pid/tgid of another ns. So we do need to nail
> down the use case here.

I'll try and describe the use case I have in mind. I expect folks would like to use bpftrace in container X to trace events in container Y, where X may or not be Y, provided the bpftrace process has the required access to Y's namespace. For symbolization to work, bpftrace needs to get the pid of processes in container Y but in the namespace of X. For example X could be a parent cgroup to multiple workloads including X, and the owner of these workloads has access to X but not the host itself. I don't see it as trying to get pid/tgid from another namespace, it's actually to get the pid in the namespace where it can be acted upon (i.e. X).

>>> The current helper is
>>> bpf_get_ns_current_pid_tgid
>>> you want
>>> bpf_get_ns_pid_tgid
>>>
>>> I think it is possible, you need to check
>>> pid->numbers[pid_level].ns
>>> for all pid levels. You need to get a reference count for the namespace
>>> to ensure valid result.
>>>
>>> This may work for root inode, but for container inode, it may have
>>> issues. For example,
>>> container 1: create, inode 2
>>> container 1 removed
>>> container 2: create, inode 2
>>> If you use inode 2, depending on timing you may accidentally targetting
>>> wrong container.
>>
>> Yeah, so maybe an fd to /proc/<pid>/ns/pid or something.
>>
>>>
>>> I think you can workaround the issue without this helper. See below.
>>>
>>>>
>>>> This currently does not work b/c bpftrace will generate a prog that gets
>>>> the root pidns pid, pack it with the stackid, and pass it up to
>>>> userspace. But b/c bpftrace is running inside the container, the root
>>>> pidns pid is invalid and symbolization fails.
>>>
>>> bpftrace can generate a program takes dev/inode as input parameters in
>>> map. The bpftrace will supply dev/inode value, by query the current
>>> system/container, and then run the program.
>>
>> I don't think it's very feasible to have bpftrace integrate with every
>> container runtime out there. This also becomes really difficult to
>> manage if you want to trace N processes. You'd need N maps or N progs.
> 
> Why, just one map to store dev/inode is shared among all progs, right?
> 
>>
>>>
>>>>
>>>> What would be nice is if bpftrace could generate a prog that gets the
>>>> current pid as viewed from bpftrace's pidns. Then symbolization would
>>>> work.
>>>
>>> Despite the above workaround, what you really need is although it is
>>> running on container, you want to get stack trace interpreted with
>>> root pid/tgid for symbolization purpose? But you can already achieve
>>> this with bpf_get_pid_tgid()?
>>
>> No, this isn't possible when bpftrace runs inside the container. ie
>> bpftrace is in a pidns along with the tracees. Bpftrace gets the root
>> pidns pid from the kernel but cannot resolve it to the pidns pid. That
>> means bpftrace cannot find the executable file to symbolize against.
> 
> Not sure whether I understand correct or not. You want root pid to
> find exec, right? but bpf_get_pid_tgid() will give your root pid?
> Maybe I miss something here...

Looks like your suggestion is for bpftrace to use bpf_get_pid_tgid() when it's running in the root namespace (i.e. the host), and bpf_get_ns_current_pid_tgid() when it's running in another namespace (i.e. a container). I think this would be fine in the short term, even though it doesn't cover all the cases (see above). That said, I'd say the intelligence belongs more in a bpf helper than in user space.

Thanks,
Blaise

  reply	other threads:[~2020-11-13 12:05 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-12 22:20 Extending bpf_get_ns_current_pid_tgid() Daniel Xu
2020-11-13  0:27 ` Yonghong Song
2020-11-13  0:57   ` Daniel Xu
2020-11-13  4:13     ` Yonghong Song
2020-11-13 12:04       ` Blaise Sanouillet [this message]
2020-11-13 16:57         ` Yonghong Song
     [not found]         ` <CACiB22i6d2skkJJa7uwVRrYy7dtYOxmLgFwzjtieW4BFn2tzLw@mail.gmail.com>
2020-11-13 16:59           ` Yonghong Song
     [not found]             ` <CACiB22iU3zk4Row=wAween=rSvHJ7j7M5T2KbyFk38arzEwQpQ@mail.gmail.com>
2021-06-16  9:54               ` Blaise Sanouillet
2021-06-16 17:02               ` Yonghong Song
2021-06-16 21:44                 ` Carlos Neira
2021-06-17  2:49                   ` Yonghong Song
2021-06-17 10:59                     ` Blaise Sanouillet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=MN2PR15MB2991E847DE47A265E71F1BC8A0E60@MN2PR15MB2991.namprd15.prod.outlook.com \
    --to=blez@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=cneirabustos@gmail.com \
    --cc=dxu@dxuuu.xyz \
    --cc=ebiederm@xmission.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.