BPF Archive on lore.kernel.org
 help / color / Atom feed
From: ebiederm@xmission.com (Eric W. Biederman)
To: Yonghong Song <yhs@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Carlos Antonio Neira Bustos <cneirabustos@gmail.com>,
	"netdev\@vger.kernel.org" <netdev@vger.kernel.org>,
	"brouer\@redhat.com" <brouer@redhat.com>,
	"bpf\@vger.kernel.org" <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
Date: Fri, 13 Sep 2019 11:59:23 -0500
Message-ID: <87a7b8gmj8.fsf@x220.int.ebiederm.org> (raw)
In-Reply-To: <7b0a325e-9187-702f-eba7-bfcc7e3f7eb4@fb.com>

Yonghong Song <yhs@fb.com> writes:

> On 9/11/19 9:16 AM, Eric W. Biederman wrote:
>> Al Viro <viro@zeniv.linux.org.uk> writes:
>> 
>>> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote:
>>>>
>>>> Carlos,
>>>>
>>>> Discussed with Eric today for what is the best way to get
>>>> the device number for a namespace. The following patch seems
>>>> a reasonable start although Eric would like to see
>>>> how the helper is used in order to decide whether the
>>>> interface looks right.
>>>>
>>>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
>>>> Author: Yonghong Song <yhs@fb.com>
>>>> Date:   Mon Sep 9 21:50:51 2019 -0700
>>>>
>>>>       nsfs: add an interface function ns_get_inum_dev()
>>>>
>>>>       This patch added an interface function
>>>>       ns_get_inum_dev(). Given a ns_common structure,
>>>>       the function returns the inode and device
>>>>       numbers. The function will be used later
>>>>       by a newly added bpf helper.
>>>>
>>>>       Signed-off-by: Yonghong Song <yhs@fb.com>
>>>>
>>>> diff --git a/fs/nsfs.c b/fs/nsfs.c
>>>> index a0431642c6b5..a603c6fc3f54 100644
>>>> --- a/fs/nsfs.c
>>>> +++ b/fs/nsfs.c
>>>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
>>>>           return ERR_PTR(-EINVAL);
>>>>    }
>>>>
>>>> +/* Get the device number for the current task pidns.
>>>> + */
>>>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
>>>> +{
>>>> +       *inum = ns->inum;
>>>> +       *dev = nsfs_mnt->mnt_sb->s_dev;
>>>> +}
>>>
>>> Umm...  Where would it get the device number once we get (hell knows
>>> what for) multiple nsfs instances?  I still don't understand what
>>> would that be about, TBH...  Is it really per-userns?  Or something
>>> else entirely?  Eric, could you give some context?
>> 
>> My goal is not to paint things into a corner, with future changes.
>> Right now it is possible to stat a namespace file descriptor and
>> get a device and inode number.  Then compare that.
>> 
>> I don't want people using the inode number in nsfd as some magic
>> namespace id.
>> 
>> We have had times in the past where there was more than one superblock
>> and thus more than one device number.  Further if userspace ever uses
>> this heavily there may be times in the future where for
>> checkpoint/restart purposes we will want multiple nsfd's so we can
>> preserve the inode number accross a migration.
>> 
>> Realistically there will probably just some kind of hotplug notification
>> to userspace to say we have hotplugged your operatining system as
>> a migration notification.
>> 
>> Now the halway discussion did not quite capture everything I was trying
>> to say but it at least got to the right ballpark.
>> 
>> The helper in fs/nsfs.c should be:
>> 
>> bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
>> {
>>          return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
>> }
>> 
>> That way if/when there are multiple inodes identifying the same
>> namespace the bpf programs don't need to change.
>
> Thanks, Eric. This is indeed better. The bpf helper should focus
> on comparing dev/ino, instead of return the dev/ino to bpf program.
>
> So overall, nsfs related change will look like:
>
> diff --git a/fs/nsfs.c b/fs/nsfs.c
> index a0431642c6b5..7e78d89c2172 100644
> --- a/fs/nsfs.c
> +++ b/fs/nsfs.c
> @@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd)
>          return ERR_PTR(-EINVAL);
>   }
>
> +bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino)
> +{
> +       return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev));
> +}
> +
>   static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
>   {
>          struct inode *inode = d_inode(dentry);
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index d31cb6215905..79639807e960 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct 
> task_struct *task,
>   typedef struct ns_common *ns_get_path_helper_t(void *);
>   extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t 
> ns_get_cb,
>                              void *private_data);
> +extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino);
>
>   extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
>                          const struct proc_ns_operations *ns_ops);
>
>> 
>> Up farther in the stack it should be something like:
>> 
>>> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino)
>>> {
>>>          return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino);
>>> }
>>>
>>> const struct bpf_func_proto bpf_current_pidns_match_proto = {
>>> 	.func		= bpf_current_pins_match,
>>> 	.gpl_only	= true,
>>> 	.ret_type	= RET_INTEGER
>>> 	.arg1_type	= ARG_PTR_TO_DEVICE_NUMBER,
>>> 	.arg2_type	= ARG_PTR_TO_INODE_NUMBER,
>>> };
>> 
>> That allows comparing what the bpf came up with with whatever value
>> userspace generated by stating the file descriptor.
>> 
>> 
>> That is the least bad suggestion I currently have for that
>> functionality.  It really would be better to not have that filter in the
>> bpf program itself but in the infrastructure that binds a program to a
>> set of tasks.
>> 
>> The problem with this approach is whatever device/inode you have when
>> the namespace they refer to exits there is the possibility that the
>> inode will be reused.  So your filter will eventually start matching on
>> the wrong thing.
>
> I come up with a differeent helper definition, which is much more
> similar to existing bpf_get_current_pid_tgid() and helper definition
> much more conforms to bpf convention.

There is a problem with your bpf_get_ns_current_pid_tgid below.
The inode number is a 64bit number.  To be nice to old userspace
we try and not use 64bit inode numbers where they are not required
but in this case we should not use an interface that assumes inode
numbers are 32bit.  They just aren't.

I didn't know how to express that in the bpf proto so I did what
I could.

The alternative to this would be to simply restrict this
helper to bpf programs registered in the initial pid namespace.
At which point you could just ensure all the numbers are in
the global pid namespace.

Hmm.  Looing at the comment below I am confused.

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 5e28718928ca..bc26903c80c7 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -11,6 +11,8 @@
>   #include <linux/uidgid.h>
>   #include <linux/filter.h>
>   #include <linux/ctype.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/proc_ns.h>
>
>   #include "../../lib/kstrtox.h"
>
> @@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = {
>          .arg4_type      = ARG_PTR_TO_LONG,
>   };
>   #endif
> +
> +BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum)
> +{
> +       struct task_struct *task = current;
> +       struct pid_namespace *pidns;
> +       pid_t pid, tgid;
> +
> +       if (unlikely(!task))
> +               return -EINVAL;
> +
> +
> +       pidns = task_active_pid_ns(task);
> +       if (unlikely(!pidns))
> +               return -ENOENT;
> +
> +       if (!ns_match(&pidns->ns, (dev_t)dev, inum))
> +               return -EINVAL;
> +
> +       pid = task_pid_nr_ns(task, pidns);
> +       tgid = task_tgid_nr_ns(task, pidns);
> +
> +       return (u64) tgid << 32 | pid;
> +}
> +
> +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = {
> +       .func           = bpf_get_ns_current_pid_tgid,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_ANYTHING,
> +       .arg2_type      = ARG_ANYTHING,
> +};
>
> Existing usage of bpf_get_current_pid_tgid() can be converted
> to bpf_get_ns_current_pid_tgid() if ns dev/inode number
> is supplied. For bpf_get_ns_current_pid_tgid(), checking
> return value ( < 0 or not) is needed.

Ok.  I missed something.

What is the problem bpf_get_ns_current_pid_tgid trying to solve
that bpf_get_current_pid_tgid does not solve.

I would think since much of tracing ebpf is fundamentally restricted
to the global root user.  Limiting the ebpf programs to the initial
pid namespace should not be a problem.

So I don't understand why you need to specify the namespace in
the ebpf call.

Can someone give me a clue what problem is being sovled by this
new call?

Eric

  parent reply index

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-06 15:09 [PATCH bpf-next v10 0/4] BPF: New helper to obtain namespace data from current task Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 1/4] fs/namei.c: make available filename_lookup() for bpf helpers Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info Carlos Neira
2019-09-06 15:24   ` Al Viro
2019-09-06 15:46     ` Al Viro
2019-09-06 16:00       ` Al Viro
2019-09-06 23:21         ` Yonghong Song
2019-09-07  0:10           ` Al Viro
2019-09-07  6:34             ` Yonghong Song
2019-09-09 17:45               ` Carlos Antonio Neira Bustos
2019-09-10 22:35                 ` Yonghong Song
2019-09-10 23:15                   ` Al Viro
2019-09-11  8:16                     ` ebiederm
2019-09-12  5:49                       ` Yonghong Song
     [not found]                         ` <CACiB22j9M2gmccnh7XqqFp8g7qKFuiOrSAVJiA2tQHLB0pmoSQ@mail.gmail.com>
2019-09-13  2:56                           ` Yonghong Song
2019-09-13 11:58                             ` Carlos Antonio Neira Bustos
2019-09-13 16:59                         ` ebiederm [this message]
2019-09-13 17:28                           ` Yonghong Song
2019-09-11  4:32                   ` Carlos Antonio Neira Bustos
2019-09-11  8:17                     ` ebiederm
2019-09-10 22:46   ` Yonghong Song
2019-09-11  4:33     ` Carlos Antonio Neira Bustos
2019-09-06 15:09 ` [PATCH bpf-next v10 3/4] tools: Added bpf_get_current_pidns_info helper Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 4/4] tools/testing/selftests/bpf: Add self-tests for helper bpf_get_pidns_info Carlos Neira
2019-09-10 22:55   ` Yonghong Song

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a7b8gmj8.fsf@x220.int.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=cneirabustos@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org bpf@archiver.kernel.org
	public-inbox-index bpf


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/ public-inbox