Re: Extending bpf_get_ns_current_pid_tgid()

From: Blaise Sanouillet <blez@fb.com>
To: Yonghong Song <yhs@fb.com>, Daniel Xu <dxu@dxuuu.xyz>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Cc: "cneirabustos@gmail.com" <cneirabustos@gmail.com>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>
Subject: Re: Extending bpf_get_ns_current_pid_tgid()
Date: Fri, 13 Nov 2020 12:04:43 +0000	[thread overview]
Message-ID: <MN2PR15MB2991E847DE47A265E71F1BC8A0E60@MN2PR15MB2991.namprd15.prod.outlook.com> (raw)
In-Reply-To: <13b5b2dd-bec0-cef2-7304-7e5a09bafb6c@fb.com>

> On 11/12/20 4:57 PM, Daniel Xu wrote:
>> On Thu Nov 12, 2020 at 4:27 PM PST, Yonghong Song wrote:
>>>
>>>
>>> On 11/12/20 2:20 PM, Daniel Xu wrote:
>>>> Hi,
>>>>
>>>> I'm looking at the current implementation of
>>>> bpf_get_ns_current_pid_tgid() and the helper seems to be a bit overly
>>>> restricting to me. Specifically the following line:
>>>>
>>>>       if (!ns_match(&pidns->ns, (dev_t)dev, ino))
>>>>               goto clear;
>>>>
>>>> Why bail if the inode # does not match? IIUC from the old discussions,
>>>> it was b/c in the future pidns files might belong to different devices.
>>>> It's not clear to me (possibly b/c I'm missing something) why the inode
>>>> has to match as well.
>>>
>>> Yes, pidns file might belong to different devices in theory so we need
>>> to match dev as well.
>>>
>>> The inode number needs to match so we can ensure user indeed wants to
>>> get the *current pidns* tgid/pid.
>>
>> Right, this double-checking at the API level is what feels strange to
>> me -- why make the user prove they know what they're doing?
> 
> If we do not have this checking, it is possible that in interrupt
> context, pidns #10 user may get a tgid/pid actually from pisns #11,
> and tgid/pid could be valid for pidns #10. This result will be
> actually wrong.
> 
>>
>> Furthermore, the "proof" restricts flexibility. It's as if
>> bpf_get_current_task() required a (dev,ino) pair. How would you get the
>> namespaced pid for a process you don't know about yet? eg when you're
>> profiling the system.
> 
> Did not fully understand questions here. Do you mean
>    bpf_get_current_task(dev, ino)
> that will be weird. task is not associated with dev/ino.
> 
>>
>>>
>>> (dev, ino) input expressed user intention. Without this, in no-process
>>> context, it will be hard to interpret the results.
>>
>> But bpf_get_current_pid_tgid() doesn't return errors so this shouldn't
>> either, right?
> 
> Different helpers can have different signatures.
> 
>>
>>>
>>>>
>>>> Would it be possible to instead have the helper return the pid/tgid of
>>>> the current task as viewed _from_ the `dev`/`ino` pidns? If the current
>>>> task is hidden from the `dev`/`ino` pidns, then return -ENOENT. The use
>>>> case is for bpftrace symbolize stacks when run inside a container. For
>>>> example:
>>>>
>>>>       (in-container)# bpftrace -e 'profile:hz:99 { print(ustack) }'
>>>
>>> I think you try to propose something like below:
>>> - user provides dev/ino
>>> - the helper will try to go through all pidns'es (not just active
>>> one), if any match pidns match, returns tgid/pid in that pidns,
>>> otherwise, returns -ENOENT.
>>
>> Right, exactly.
> 
> If you want to do this, you will need a new helper like
>    bpf_get_ns_pid_tgid
> 
> It actually will be weird to use this helper as it looks like
> you try to get pid/tgid of another ns. So we do need to nail
> down the use case here.

I'll try and describe the use case I have in mind. I expect folks would like to use bpftrace in container X to trace events in container Y, where X may or not be Y, provided the bpftrace process has the required access to Y's namespace. For symbolization to work, bpftrace needs to get the pid of processes in container Y but in the namespace of X. For example X could be a parent cgroup to multiple workloads including X, and the owner of these workloads has access to X but not the host itself. I don't see it as trying to get pid/tgid from another namespace, it's actually to get the pid in the namespace where it can be acted upon (i.e. X).

>>> The current helper is
>>> bpf_get_ns_current_pid_tgid
>>> you want
>>> bpf_get_ns_pid_tgid
>>>
>>> I think it is possible, you need to check
>>> pid->numbers[pid_level].ns
>>> for all pid levels. You need to get a reference count for the namespace
>>> to ensure valid result.
>>>
>>> This may work for root inode, but for container inode, it may have
>>> issues. For example,
>>> container 1: create, inode 2
>>> container 1 removed
>>> container 2: create, inode 2
>>> If you use inode 2, depending on timing you may accidentally targetting
>>> wrong container.
>>
>> Yeah, so maybe an fd to /proc/<pid>/ns/pid or something.
>>
>>>
>>> I think you can workaround the issue without this helper. See below.
>>>
>>>>
>>>> This currently does not work b/c bpftrace will generate a prog that gets
>>>> the root pidns pid, pack it with the stackid, and pass it up to
>>>> userspace. But b/c bpftrace is running inside the container, the root
>>>> pidns pid is invalid and symbolization fails.
>>>
>>> bpftrace can generate a program takes dev/inode as input parameters in
>>> map. The bpftrace will supply dev/inode value, by query the current
>>> system/container, and then run the program.
>>
>> I don't think it's very feasible to have bpftrace integrate with every
>> container runtime out there. This also becomes really difficult to
>> manage if you want to trace N processes. You'd need N maps or N progs.
> 
> Why, just one map to store dev/inode is shared among all progs, right?
> 
>>
>>>
>>>>
>>>> What would be nice is if bpftrace could generate a prog that gets the
>>>> current pid as viewed from bpftrace's pidns. Then symbolization would
>>>> work.
>>>
>>> Despite the above workaround, what you really need is although it is
>>> running on container, you want to get stack trace interpreted with
>>> root pid/tgid for symbolization purpose? But you can already achieve
>>> this with bpf_get_pid_tgid()?
>>
>> No, this isn't possible when bpftrace runs inside the container. ie
>> bpftrace is in a pidns along with the tracees. Bpftrace gets the root
>> pidns pid from the kernel but cannot resolve it to the pidns pid. That
>> means bpftrace cannot find the executable file to symbolize against.
> 
> Not sure whether I understand correct or not. You want root pid to
> find exec, right? but bpf_get_pid_tgid() will give your root pid?
> Maybe I miss something here...

Looks like your suggestion is for bpftrace to use bpf_get_pid_tgid() when it's running in the root namespace (i.e. the host), and bpf_get_ns_current_pid_tgid() when it's running in another namespace (i.e. a container). I think this would be fine in the short term, even though it doesn't cover all the cases (see above). That said, I'd say the intelligence belongs more in a bpf helper than in user space.

Thanks,
Blaise