Re: Extending bpf_get_ns_current_pid_tgid()

From: Yonghong Song <yhs@fb.com>
To: Blaise Sanouillet <blez@fb.com>, Daniel Xu <dxu@dxuuu.xyz>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Cc: "cneirabustos@gmail.com" <cneirabustos@gmail.com>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>
Subject: Re: Extending bpf_get_ns_current_pid_tgid()
Date: Fri, 13 Nov 2020 08:57:25 -0800	[thread overview]
Message-ID: <ba5f3c14-8261-af6f-8850-90848963d63a@fb.com> (raw)
In-Reply-To: <MN2PR15MB2991E847DE47A265E71F1BC8A0E60@MN2PR15MB2991.namprd15.prod.outlook.com>

On 11/13/20 4:04 AM, Blaise Sanouillet wrote:
>> On 11/12/20 4:57 PM, Daniel Xu wrote:
>>> On Thu Nov 12, 2020 at 4:27 PM PST, Yonghong Song wrote:
>>>>
>>>>
>>>> On 11/12/20 2:20 PM, Daniel Xu wrote:
>>>>> Hi,
>>>>>
>>>>> I'm looking at the current implementation of
>>>>> bpf_get_ns_current_pid_tgid() and the helper seems to be a bit overly
>>>>> restricting to me. Specifically the following line:
>>>>>
>>>>>        if (!ns_match(&pidns->ns, (dev_t)dev, ino))
>>>>>                goto clear;
>>>>>
>>>>> Why bail if the inode # does not match? IIUC from the old discussions,
>>>>> it was b/c in the future pidns files might belong to different devices.
>>>>> It's not clear to me (possibly b/c I'm missing something) why the inode
>>>>> has to match as well.
>>>>
>>>> Yes, pidns file might belong to different devices in theory so we need
>>>> to match dev as well.
>>>>
>>>> The inode number needs to match so we can ensure user indeed wants to
>>>> get the *current pidns* tgid/pid.
>>>
>>> Right, this double-checking at the API level is what feels strange to
>>> me -- why make the user prove they know what they're doing?
>>
>> If we do not have this checking, it is possible that in interrupt
>> context, pidns #10 user may get a tgid/pid actually from pisns #11,
>> and tgid/pid could be valid for pidns #10. This result will be
>> actually wrong.
>>
>>>
>>> Furthermore, the "proof" restricts flexibility. It's as if
>>> bpf_get_current_task() required a (dev,ino) pair. How would you get the
>>> namespaced pid for a process you don't know about yet? eg when you're
>>> profiling the system.
>>
>> Did not fully understand questions here. Do you mean
>>     bpf_get_current_task(dev, ino)
>> that will be weird. task is not associated with dev/ino.
>>
>>>
>>>>
>>>> (dev, ino) input expressed user intention. Without this, in no-process
>>>> context, it will be hard to interpret the results.
>>>
>>> But bpf_get_current_pid_tgid() doesn't return errors so this shouldn't
>>> either, right?
>>
>> Different helpers can have different signatures.
>>
>>>
>>>>
>>>>>
>>>>> Would it be possible to instead have the helper return the pid/tgid of
>>>>> the current task as viewed _from_ the `dev`/`ino` pidns? If the current
>>>>> task is hidden from the `dev`/`ino` pidns, then return -ENOENT. The use
>>>>> case is for bpftrace symbolize stacks when run inside a container. For
>>>>> example:
>>>>>
>>>>>        (in-container)# bpftrace -e 'profile:hz:99 { print(ustack) }'
>>>>
>>>> I think you try to propose something like below:
>>>> - user provides dev/ino
>>>> - the helper will try to go through all pidns'es (not just active
>>>> one), if any match pidns match, returns tgid/pid in that pidns,
>>>> otherwise, returns -ENOENT.
>>>
>>> Right, exactly.
>>
>> If you want to do this, you will need a new helper like
>>     bpf_get_ns_pid_tgid
>>
>> It actually will be weird to use this helper as it looks like
>> you try to get pid/tgid of another ns. So we do need to nail
>> down the use case here.
> 
> I'll try and describe the use case I have in mind. I expect folks would like to use bpftrace in container X to trace events in container Y, where X may or not be Y, provided the bpftrace process has the required access to Y's namespace. For symbolization to work, bpftrace needs to get the pid of processes in container Y but in the namespace of X. For example X could be a parent cgroup to multiple workloads including X, and the owner of these workloads has access to X but not the host itself. I don't see it as trying to get pid/tgid from another namespace, it's actually to get the pid in the namespace where it can be acted upon (i.e. X).

Thanks for explanation. If I understand correctly, you have a privileged
namespace which is used to trace other namespaces. So the helper's input
is other namespace dev/inode and the result is other namespace tgid/pid.

I think this is a valid use case.

> 
>>>> The current helper is
>>>> bpf_get_ns_current_pid_tgid
>>>> you want
>>>> bpf_get_ns_pid_tgid
>>>>
>>>> I think it is possible, you need to check
>>>> pid->numbers[pid_level].ns
>>>> for all pid levels. You need to get a reference count for the namespace
>>>> to ensure valid result.
>>>>
>>>> This may work for root inode, but for container inode, it may have
>>>> issues. For example,
>>>> container 1: create, inode 2
>>>> container 1 removed
>>>> container 2: create, inode 2
>>>> If you use inode 2, depending on timing you may accidentally targetting
>>>> wrong container.
>>>
>>> Yeah, so maybe an fd to /proc/<pid>/ns/pid or something.
>>>
>>>>
>>>> I think you can workaround the issue without this helper. See below.
>>>>
>>>>>
>>>>> This currently does not work b/c bpftrace will generate a prog that gets
>>>>> the root pidns pid, pack it with the stackid, and pass it up to
>>>>> userspace. But b/c bpftrace is running inside the container, the root
>>>>> pidns pid is invalid and symbolization fails.
>>>>
>>>> bpftrace can generate a program takes dev/inode as input parameters in
>>>> map. The bpftrace will supply dev/inode value, by query the current
>>>> system/container, and then run the program.
>>>
>>> I don't think it's very feasible to have bpftrace integrate with every
>>> container runtime out there. This also becomes really difficult to
>>> manage if you want to trace N processes. You'd need N maps or N progs.
>>
>> Why, just one map to store dev/inode is shared among all progs, right?
>>
>>>
>>>>
>>>>>
>>>>> What would be nice is if bpftrace could generate a prog that gets the
>>>>> current pid as viewed from bpftrace's pidns. Then symbolization would
>>>>> work.
>>>>
>>>> Despite the above workaround, what you really need is although it is
>>>> running on container, you want to get stack trace interpreted with
>>>> root pid/tgid for symbolization purpose? But you can already achieve
>>>> this with bpf_get_pid_tgid()?
>>>
>>> No, this isn't possible when bpftrace runs inside the container. ie
>>> bpftrace is in a pidns along with the tracees. Bpftrace gets the root
>>> pidns pid from the kernel but cannot resolve it to the pidns pid. That
>>> means bpftrace cannot find the executable file to symbolize against.
>>
>> Not sure whether I understand correct or not. You want root pid to
>> find exec, right? but bpf_get_pid_tgid() will give your root pid?
>> Maybe I miss something here...
> 
> Looks like your suggestion is for bpftrace to use bpf_get_pid_tgid() when it's running in the root namespace (i.e. the host), and bpf_get_ns_current_pid_tgid() when it's running in another namespace (i.e. a container). I think this would be fine in the short term, even though it doesn't cover all the cases (see above). That said, I'd say the intelligence belongs more in a bpf helper than in user space.

Using bpf_get_pid_tgid() should work if you have host access. But if you 
only have a privileged namespace, this won't work and indeed a new 
helper is needed.

> 
> Thanks,
> Blaise
>