BPF Archive on lore.kernel.org
 help / color / Atom feed
From: Yonghong Song <yhs@fb.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Carlos Antonio Neira Bustos <cneirabustos@gmail.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"brouer@redhat.com" <brouer@redhat.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
Date: Fri, 13 Sep 2019 17:28:34 +0000
Message-ID: <b99306da-e689-a815-8e5b-398bd2a35506@fb.com> (raw)
In-Reply-To: <87a7b8gmj8.fsf@x220.int.ebiederm.org>



On 9/13/19 5:59 PM, Eric W. Biederman wrote:
> Yonghong Song <yhs@fb.com> writes:
> 
>> On 9/11/19 9:16 AM, Eric W. Biederman wrote:
>>> Al Viro <viro@zeniv.linux.org.uk> writes:
>>>
>>>> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote:
>>>>>
>>>>> Carlos,
>>>>>
>>>>> Discussed with Eric today for what is the best way to get
>>>>> the device number for a namespace. The following patch seems
>>>>> a reasonable start although Eric would like to see
>>>>> how the helper is used in order to decide whether the
>>>>> interface looks right.
>>>>>
>>>>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
>>>>> Author: Yonghong Song <yhs@fb.com>
>>>>> Date:   Mon Sep 9 21:50:51 2019 -0700
>>>>>
>>>>>        nsfs: add an interface function ns_get_inum_dev()
>>>>>
>>>>>        This patch added an interface function
>>>>>        ns_get_inum_dev(). Given a ns_common structure,
>>>>>        the function returns the inode and device
>>>>>        numbers. The function will be used later
>>>>>        by a newly added bpf helper.
>>>>>
>>>>>        Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>>
>>>>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>>>>> index a0431642c6b5..a603c6fc3f54 100644
>>>>> --- a/fs/nsfs.c
>>>>> +++ b/fs/nsfs.c
>>>>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
>>>>>            return ERR_PTR(-EINVAL);
>>>>>     }
>>>>>
>>>>> +/* Get the device number for the current task pidns.
>>>>> + */
>>>>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
>>>>> +{
>>>>> +       *inum = ns->inum;
>>>>> +       *dev = nsfs_mnt->mnt_sb->s_dev;
>>>>> +}
>>>>
>>>> Umm...  Where would it get the device number once we get (hell knows
>>>> what for) multiple nsfs instances?  I still don't understand what
>>>> would that be about, TBH...  Is it really per-userns?  Or something
>>>> else entirely?  Eric, could you give some context?
>>>
>>> My goal is not to paint things into a corner, with future changes.
>>> Right now it is possible to stat a namespace file descriptor and
>>> get a device and inode number.  Then compare that.
>>>
>>> I don't want people using the inode number in nsfd as some magic
>>> namespace id.
>>>
>>> We have had times in the past where there was more than one superblock
>>> and thus more than one device number.  Further if userspace ever uses
>>> this heavily there may be times in the future where for
>>> checkpoint/restart purposes we will want multiple nsfd's so we can
>>> preserve the inode number accross a migration.
>>>
>>> Realistically there will probably just some kind of hotplug notification
>>> to userspace to say we have hotplugged your operatining system as
>>> a migration notification.
>>>
>>> Now the halway discussion did not quite capture everything I was trying
>>> to say but it at least got to the right ballpark.
>>>
>>> The helper in fs/nsfs.c should be:
>>>
>>> bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
>>> {
>>>           return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
>>> }
>>>
>>> That way if/when there are multiple inodes identifying the same
>>> namespace the bpf programs don't need to change.
>>
>> Thanks, Eric. This is indeed better. The bpf helper should focus
>> on comparing dev/ino, instead of return the dev/ino to bpf program.
>>
>> So overall, nsfs related change will look like:
>>
>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>> index a0431642c6b5..7e78d89c2172 100644
>> --- a/fs/nsfs.c
>> +++ b/fs/nsfs.c
>> @@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd)
>>           return ERR_PTR(-EINVAL);
>>    }
>>
>> +bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
>> +{
>> +       return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
>> +}
>> +
>>    static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
>>    {
>>           struct inode *inode = d_inode(dentry);
>> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
>> index d31cb6215905..79639807e960 100644
>> --- a/include/linux/proc_ns.h
>> +++ b/include/linux/proc_ns.h
>> @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct
>> task_struct *task,
>>    typedef struct ns_common *ns_get_path_helper_t(void *);
>>    extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t
>> ns_get_cb,
>>                               void *private_data);
>> +extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino);
>>
>>    extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
>>                           const struct proc_ns_operations *ns_ops);
>>
>>>
>>> Up farther in the stack it should be something like:
>>>
>>>> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino)
>>>> {
>>>>           return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino);
>>>> }
>>>>
>>>> const struct bpf_func_proto bpf_current_pidns_match_proto = {
>>>> 	.func		= bpf_current_pins_match,
>>>> 	.gpl_only	= true,
>>>> 	.ret_type	= RET_INTEGER
>>>> 	.arg1_type	= ARG_PTR_TO_DEVICE_NUMBER,
>>>> 	.arg2_type	= ARG_PTR_TO_INODE_NUMBER,
>>>> };
>>>
>>> That allows comparing what the bpf came up with with whatever value
>>> userspace generated by stating the file descriptor.
>>>
>>>
>>> That is the least bad suggestion I currently have for that
>>> functionality.  It really would be better to not have that filter in the
>>> bpf program itself but in the infrastructure that binds a program to a
>>> set of tasks.
>>>
>>> The problem with this approach is whatever device/inode you have when
>>> the namespace they refer to exits there is the possibility that the
>>> inode will be reused.  So your filter will eventually start matching on
>>> the wrong thing.
>>
>> I come up with a differeent helper definition, which is much more
>> similar to existing bpf_get_current_pid_tgid() and helper definition
>> much more conforms to bpf convention.
> 
> There is a problem with your bpf_get_ns_current_pid_tgid below.
> The inode number is a 64bit number.  To be nice to old userspace
> we try and not use 64bit inode numbers where they are not required
> but in this case we should not use an interface that assumes inode
> numbers are 32bit.  They just aren't.
> 
> I didn't know how to express that in the bpf proto so I did what
> I could.

We can change inum to u64. Just change the prototype like
`u64, inum` should be good.

> 
> The alternative to this would be to simply restrict this
> helper to bpf programs registered in the initial pid namespace.
> At which point you could just ensure all the numbers are in
> the global pid namespace.
> 
> Hmm.  Looing at the comment below I am confused.
> 
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 5e28718928ca..bc26903c80c7 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -11,6 +11,8 @@
>>    #include <linux/uidgid.h>
>>    #include <linux/filter.h>
>>    #include <linux/ctype.h>
>> +#include <linux/pid_namespace.h>
>> +#include <linux/proc_ns.h>
>>
>>    #include "../../lib/kstrtox.h"
>>
>> @@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = {
>>           .arg4_type      = ARG_PTR_TO_LONG,
>>    };
>>    #endif
>> +
>> +BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum)
>> +{
>> +       struct task_struct *task = current;
>> +       struct pid_namespace *pidns;
>> +       pid_t pid, tgid;
>> +
>> +       if (unlikely(!task))
>> +               return -EINVAL;
>> +
>> +
>> +       pidns = task_active_pid_ns(task);
>> +       if (unlikely(!pidns))
>> +               return -ENOENT;
>> +
>> +       if (!ns_match(&pidns->ns, (dev_t)dev, inum))
>> +               return -EINVAL;
>> +
>> +       pid = task_pid_nr_ns(task, pidns);
>> +       tgid = task_tgid_nr_ns(task, pidns);
>> +
>> +       return (u64) tgid << 32 | pid;
>> +}
>> +
>> +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = {
>> +       .func           = bpf_get_ns_current_pid_tgid,
>> +       .gpl_only       = false,
>> +       .ret_type       = RET_INTEGER,
>> +       .arg1_type      = ARG_ANYTHING,
>> +       .arg2_type      = ARG_ANYTHING,
>> +};
>>
>> Existing usage of bpf_get_current_pid_tgid() can be converted
>> to bpf_get_ns_current_pid_tgid() if ns dev/inode number
>> is supplied. For bpf_get_ns_current_pid_tgid(), checking
>> return value ( < 0 or not) is needed.
> 
> Ok.  I missed something.
> 
> What is the problem bpf_get_ns_current_pid_tgid trying to solve
> that bpf_get_current_pid_tgid does not solve.
> 
> I would think since much of tracing ebpf is fundamentally restricted
> to the global root user.  Limiting the ebpf programs to the initial
> pid namespace should not be a problem.
> 
> So I don't understand why you need to specify the namespace in
> the ebpf call.
> 
> Can someone give me a clue what problem is being sovled by this
> new call?

We want to run the bpf program inside the namespace.
There are parallel work to introduce CAP_BPF and CAP_TRACING
(https://lore.kernel.org/bpf/20190906231053.1276792-1-ast@kernel.org/T/#t)
to facilitate this.

We have users requesting to use bcc tools inside the containers.
https://github.com/iovisor/bcc/issues/1875
https://github.com/iovisor/bcc/issues/1366
https://github.com/iovisor/bcc/issues/1329
https://github.com/iovisor/bcc/issues/1532
...

Yes, this may require granting `root` privilege to containers.
This can be done outside container as well. But it is just a
big usability improvement if people can do inside the containers.

In addition, we have requests below and internal requests as well
to filter based on containers.
https://github.com/iovisor/bcc/issues/1119
The new helper permits at root that you can filter based on
a particular container (not container id, but dev/inode should
identifying one).

Hope this clarify why this helper is useful for tracing community.

> 
> Eric
> 

  reply index

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-06 15:09 [PATCH bpf-next v10 0/4] BPF: New helper to obtain namespace data from current task Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 1/4] fs/namei.c: make available filename_lookup() for bpf helpers Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info Carlos Neira
2019-09-06 15:24   ` Al Viro
2019-09-06 15:46     ` Al Viro
2019-09-06 16:00       ` Al Viro
2019-09-06 23:21         ` Yonghong Song
2019-09-07  0:10           ` Al Viro
2019-09-07  6:34             ` Yonghong Song
2019-09-09 17:45               ` Carlos Antonio Neira Bustos
2019-09-10 22:35                 ` Yonghong Song
2019-09-10 23:15                   ` Al Viro
2019-09-11  8:16                     ` ebiederm
2019-09-12  5:49                       ` Yonghong Song
     [not found]                         ` <CACiB22j9M2gmccnh7XqqFp8g7qKFuiOrSAVJiA2tQHLB0pmoSQ@mail.gmail.com>
2019-09-13  2:56                           ` Yonghong Song
2019-09-13 11:58                             ` Carlos Antonio Neira Bustos
2019-09-13 16:59                         ` ebiederm
2019-09-13 17:28                           ` Yonghong Song [this message]
2019-09-11  4:32                   ` Carlos Antonio Neira Bustos
2019-09-11  8:17                     ` ebiederm
2019-09-10 22:46   ` Yonghong Song
2019-09-11  4:33     ` Carlos Antonio Neira Bustos
2019-09-06 15:09 ` [PATCH bpf-next v10 3/4] tools: Added bpf_get_current_pidns_info helper Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 4/4] tools/testing/selftests/bpf: Add self-tests for helper bpf_get_pidns_info Carlos Neira
2019-09-10 22:55   ` Yonghong Song

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b99306da-e689-a815-8e5b-398bd2a35506@fb.com \
    --to=yhs@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=cneirabustos@gmail.com \
    --cc=ebiederm@xmission.com \
    --cc=netdev@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org bpf@archiver.kernel.org
	public-inbox-index bpf


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/ public-inbox