BPF Archive on lore.kernel.org
 help / color / Atom feed
From: Yonghong Song <yhs@fb.com>
To: Carlos Antonio Neira Bustos <cneirabustos@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"ebiederm@xmission.com" <ebiederm@xmission.com>,
	"brouer@redhat.com" <brouer@redhat.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
Date: Tue, 10 Sep 2019 22:35:09 +0000
Message-ID: <dadf3657-2648-14ef-35ee-e09efb2cdb3e@fb.com> (raw)
In-Reply-To: <20190909174522.GA17882@frodo.byteswizards.com>


Discussed with Eric today for what is the best way to get
the device number for a namespace. The following patch seems
a reasonable start although Eric would like to see
how the helper is used in order to decide whether the
interface looks right.

commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
Author: Yonghong Song <yhs@fb.com>
Date:   Mon Sep 9 21:50:51 2019 -0700

     nsfs: add an interface function ns_get_inum_dev()

     This patch added an interface function
     ns_get_inum_dev(). Given a ns_common structure,
     the function returns the inode and device
     numbers. The function will be used later
     by a newly added bpf helper.

     Signed-off-by: Yonghong Song <yhs@fb.com>

diff --git a/fs/nsfs.c b/fs/nsfs.c
index a0431642c6b5..a603c6fc3f54 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
         return ERR_PTR(-EINVAL);

+/* Get the device number for the current task pidns.
+ */
+void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
+       *inum = ns->inum;
+       *dev = nsfs_mnt->mnt_sb->s_dev;
  static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
         struct inode *inode = d_inode(dentry);
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..b8fc680cdf1a 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct 
task_struct *task,
  typedef struct ns_common *ns_get_path_helper_t(void *);
  extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t 
                             void *private_data);
+extern void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev);

  extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
                         const struct proc_ns_operations *ns_ops);

Could you put the above change and patch #1 and then have
all your other patches? In your kernel change, please use
interface function ns_get_inum_dev() to get pidns inode number
and dev number.

On 9/9/19 6:45 PM, Carlos Antonio Neira Bustos wrote:
> Thanks a lot, Al Viro and Yonghong for taking the time to review this patch and
> provide technical insights needed on this one.
> But how do we move this forward?
> Al Viro's review is clear that this will not work and we should strip the name
> resolution code (thanks for your detailed analysis).
> As there is currently only one instance of the nsfs device on the system,
> I think we could leave out the retrieval of the pidns device number and address it
> when the situation changes.
> What do you think?
> On Sat, Sep 07, 2019 at 06:34:39AM +0000, Yonghong Song wrote:
>> On 9/6/19 5:10 PM, Al Viro wrote:
>>> On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote:
>>>> -bash-4.4$ readlink /proc/self/ns/pid
>>>> pid:[4026531836]
>>>> -bash-4.4$ stat /proc/self/ns/pid
>>>>      File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’
>>>>      Size: 0               Blocks: 0          IO Block: 1024   symbolic link
>>>> Device: 4h/4d   Inode: 344795989   Links: 1
>>>> Access: (0777/lrwxrwxrwx)  Uid: (128203/     yhs)   Gid: (  100/   users)
>>>> Context: user_u:base_r:base_t
>>>> Access: 2019-09-06 16:06:09.431616380 -0700
>>>> Modify: 2019-09-06 16:06:09.431616380 -0700
>>>> Change: 2019-09-06 16:06:09.431616380 -0700
>>>>     Birth: -
>>>> -bash-4.4$
>>>> Based on a discussion with Eric Biederman back in 2019 Linux
>>>> Plumbers, Eric suggested that to uniquely identify a
>>>> namespace, device id (major/minor) number should also
>>>> be included. Although today's kernel implementation
>>>> has the same device for all namespace pseudo files,
>>>> but from uapi perspective, device id should be included.
>>>> That is the reason why we try to get device id which holds
>>>> pid namespace pseudo file.
>>>> Do you have a better suggestion on how to get
>>>> the device id for 'current' pid namespace? Or from design, we
>>>> really should not care about device id at all?
>>> What the hell is "device id for pid namespace"?  This is the
>>> first time I've heard about that mystery object, so it's
>>> hard to tell where it could be found.
>>> I can tell you what device numbers are involved in the areas
>>> you seem to be looking in.
>>> 1) there's whatever device number that gets assigned to
>>> (this) procfs instance.  That, ironically, _is_ per-pidns, but
>>> that of the procfs instance, not that of your process (and
>>> those can be different).  That's what you get in ->st_dev
>>> when doing lstat() of anything in /proc (assuming that
>>> procfs is mounted there, in the first place).  NOTE:
>>> that's lstat(2), not stat(2).  stat(1) uses lstat(2),
>>> unless given -L (in which case it's stat(2) time).  The
>>> difference:
>>> root@kvm1:~# stat /proc/self/ns/pid
>>>     File: /proc/self/ns/pid -> pid:[4026531836]
>>>     Size: 0               Blocks: 0          IO Block: 1024   symbolic link
>>> Device: 4h/4d   Inode: 17396       Links: 1
>>> Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
>>> Access: 2019-09-06 19:43:11.871312319 -0400
>>> Modify: 2019-09-06 19:43:11.871312319 -0400
>>> Change: 2019-09-06 19:43:11.871312319 -0400
>>>    Birth: -
>>> root@kvm1:~# stat -L /proc/self/ns/pid
>>>     File: /proc/self/ns/pid
>>>     Size: 0               Blocks: 0          IO Block: 4096   regular empty file
>>> Device: 3h/3d   Inode: 4026531836  Links: 1
>>> Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
>>> Access: 2019-09-06 19:43:15.955313293 -0400
>>> Modify: 2019-09-06 19:43:15.955313293 -0400
>>> Change: 2019-09-06 19:43:15.955313293 -0400
>>>    Birth: -
>>> The former is lstat, the latter - stat.
>>> 2) device number of the filesystem where the symlink target lives.
>>> In this case, it's nsfs and there's only one instance on the entire
>>> system.  _That_ would be obtained by looking at st_dev in stat(2) on
>>> /proc/self/ns/pid (0:3 above).
>>> 3) device number *OF* the symlink.  That would be st_rdev in lstat(2).
>>> There's none - it's a symlink, not a character or block device.  It's
>>> always zero and always will be zero.
>>> 4) the same for the target; st_rdev in stat(2) results and again,
>>> there's no such beast - it's neither character nor block device.
>>> Your code is looking at (3).  Please, reread any textbook on Unix
>>> in the section that would cover stat(2) and discussion of the
>>> difference between st_dev and st_rdev.
>>> I have no idea what Eric had been talking about - it's hard to
>>> reconstruct by what you said so far.  Making nsfs per-userns,
>>> perhaps?  But that makes no sense whatsoever, not that userns
>>> ever had...  Cheap shots aside, I really can't guess what that's
>>> about.  Sorry.
>> Thanks for the detailed information. The device number we want
>> is nsfs. Indeed, currently, there is only one instance
>> on the entire system. But not exactly sure what is the possibility
>> to have more than one nsfs device in the future. Maybe per-userns
>> or any other criteria?
>>> In any case, pathname resolution is *NOT* for the situations where
>>> you can't block.  Even if it's procfs (and from the same pidns as
>>> the process) mounted there, there is no promise that the target
>>> of /proc/self has already been looked up and not evicted from
>>> memory since then.  And in case of cache miss pathwalk will
>>> have to call ->lookup(), which requires locking the directory
>>> (rw_sem, shared).  You can't do that in such context.
>>> And that doesn't even go into the possibility that process has
>>> something very different mounted on /proc.
>>> Again, I don't know what it is that you want to get to, but
>>> I would strongly recommend finding a way to get to that data
>>> that would not involve going anywhere near pathname resolution.
>>> How would you expect the userland to work with that value,
>>> whatever it might be?  If it's just a 32bit field that will
>>> never be read, you might as well store there the same value
>>> you store now (0, that is) in much cheaper and safer way ;-)
>> Suppose inside pid namespace, user can pass the device number,
>> say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map
>> or JIT). At runtime, bpf program will try to get device number,
>> say n2, for the 'current' process. If n1 is not the same as
>> n2, that means they are not in the same namespace. 'current'
>> is in the same pid namespace as the user iff
>> n1 == n2 and also pidns id is the same for 'current' and
>> the one with `lsns -t pid`.
>> Are you aware of any way to get the pidns device number
>> for 'current' without going through the pathname
>> lookup?

  reply index

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-06 15:09 [PATCH bpf-next v10 0/4] BPF: New helper to obtain namespace data from current task Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 1/4] fs/namei.c: make available filename_lookup() for bpf helpers Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info Carlos Neira
2019-09-06 15:24   ` Al Viro
2019-09-06 15:46     ` Al Viro
2019-09-06 16:00       ` Al Viro
2019-09-06 23:21         ` Yonghong Song
2019-09-07  0:10           ` Al Viro
2019-09-07  6:34             ` Yonghong Song
2019-09-09 17:45               ` Carlos Antonio Neira Bustos
2019-09-10 22:35                 ` Yonghong Song [this message]
2019-09-10 23:15                   ` Al Viro
2019-09-11  8:16                     ` Eric W. Biederman
2019-09-12  5:49                       ` Yonghong Song
     [not found]                         ` <CACiB22j9M2gmccnh7XqqFp8g7qKFuiOrSAVJiA2tQHLB0pmoSQ@mail.gmail.com>
2019-09-13  2:56                           ` Yonghong Song
2019-09-13 11:58                             ` Carlos Antonio Neira Bustos
2019-09-13 16:59                         ` Eric W. Biederman
2019-09-13 17:28                           ` Yonghong Song
2019-09-11  4:32                   ` Carlos Antonio Neira Bustos
2019-09-11  8:17                     ` Eric W. Biederman
2019-09-10 22:46   ` Yonghong Song
2019-09-11  4:33     ` Carlos Antonio Neira Bustos
2019-09-06 15:09 ` [PATCH bpf-next v10 3/4] tools: Added bpf_get_current_pidns_info helper Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 4/4] tools/testing/selftests/bpf: Add self-tests for helper bpf_get_pidns_info Carlos Neira
2019-09-10 22:55   ` Yonghong Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dadf3657-2648-14ef-35ee-e09efb2cdb3e@fb.com \
    --to=yhs@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=cneirabustos@gmail.com \
    --cc=ebiederm@xmission.com \
    --cc=netdev@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git