From: Yonghong Song <yhs@fb.com>
To: Carlos Antonio Neira Bustos <cneirabustos@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
"ebiederm@xmission.com" <ebiederm@xmission.com>,
"brouer@redhat.com" <brouer@redhat.com>,
"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.
Date: Tue, 10 Sep 2019 22:35:09 +0000 [thread overview]
Message-ID: <dadf3657-2648-14ef-35ee-e09efb2cdb3e@fb.com> (raw)
In-Reply-To: <20190909174522.GA17882@frodo.byteswizards.com>
Carlos,
Discussed with Eric today for what is the best way to get
the device number for a namespace. The following patch seems
a reasonable start although Eric would like to see
how the helper is used in order to decide whether the
interface looks right.
commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
Author: Yonghong Song <yhs@fb.com>
Date: Mon Sep 9 21:50:51 2019 -0700
nsfs: add an interface function ns_get_inum_dev()
This patch added an interface function
ns_get_inum_dev(). Given a ns_common structure,
the function returns the inode and device
numbers. The function will be used later
by a newly added bpf helper.
Signed-off-by: Yonghong Song <yhs@fb.com>
diff --git a/fs/nsfs.c b/fs/nsfs.c
index a0431642c6b5..a603c6fc3f54 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
return ERR_PTR(-EINVAL);
}
+/* Get the device number for the current task pidns.
+ */
+void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
+{
+ *inum = ns->inum;
+ *dev = nsfs_mnt->mnt_sb->s_dev;
+}
+
static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
{
struct inode *inode = d_inode(dentry);
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..b8fc680cdf1a 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct
task_struct *task,
typedef struct ns_common *ns_get_path_helper_t(void *);
extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t
ns_get_cb,
void *private_data);
+extern void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev);
extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
const struct proc_ns_operations *ns_ops);
Could you put the above change and patch #1 and then have
all your other patches? In your kernel change, please use
interface function ns_get_inum_dev() to get pidns inode number
and dev number.
On 9/9/19 6:45 PM, Carlos Antonio Neira Bustos wrote:
> Thanks a lot, Al Viro and Yonghong for taking the time to review this patch and
> provide technical insights needed on this one.
> But how do we move this forward?
> Al Viro's review is clear that this will not work and we should strip the name
> resolution code (thanks for your detailed analysis).
> As there is currently only one instance of the nsfs device on the system,
> I think we could leave out the retrieval of the pidns device number and address it
> when the situation changes.
> What do you think?
>
>
> On Sat, Sep 07, 2019 at 06:34:39AM +0000, Yonghong Song wrote:
>>
>>
>> On 9/6/19 5:10 PM, Al Viro wrote:
>>> On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote:
>>>
>>>> -bash-4.4$ readlink /proc/self/ns/pid
>>>> pid:[4026531836]
>>>> -bash-4.4$ stat /proc/self/ns/pid
>>>> File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’
>>>> Size: 0 Blocks: 0 IO Block: 1024 symbolic link
>>>> Device: 4h/4d Inode: 344795989 Links: 1
>>>> Access: (0777/lrwxrwxrwx) Uid: (128203/ yhs) Gid: ( 100/ users)
>>>> Context: user_u:base_r:base_t
>>>> Access: 2019-09-06 16:06:09.431616380 -0700
>>>> Modify: 2019-09-06 16:06:09.431616380 -0700
>>>> Change: 2019-09-06 16:06:09.431616380 -0700
>>>> Birth: -
>>>> -bash-4.4$
>>>>
>>>> Based on a discussion with Eric Biederman back in 2019 Linux
>>>> Plumbers, Eric suggested that to uniquely identify a
>>>> namespace, device id (major/minor) number should also
>>>> be included. Although today's kernel implementation
>>>> has the same device for all namespace pseudo files,
>>>> but from uapi perspective, device id should be included.
>>>>
>>>> That is the reason why we try to get device id which holds
>>>> pid namespace pseudo file.
>>>>
>>>> Do you have a better suggestion on how to get
>>>> the device id for 'current' pid namespace? Or from design, we
>>>> really should not care about device id at all?
>>>
>>> What the hell is "device id for pid namespace"? This is the
>>> first time I've heard about that mystery object, so it's
>>> hard to tell where it could be found.
>>>
>>> I can tell you what device numbers are involved in the areas
>>> you seem to be looking in.
>>>
>>> 1) there's whatever device number that gets assigned to
>>> (this) procfs instance. That, ironically, _is_ per-pidns, but
>>> that of the procfs instance, not that of your process (and
>>> those can be different). That's what you get in ->st_dev
>>> when doing lstat() of anything in /proc (assuming that
>>> procfs is mounted there, in the first place). NOTE:
>>> that's lstat(2), not stat(2). stat(1) uses lstat(2),
>>> unless given -L (in which case it's stat(2) time). The
>>> difference:
>>>
>>> root@kvm1:~# stat /proc/self/ns/pid
>>> File: /proc/self/ns/pid -> pid:[4026531836]
>>> Size: 0 Blocks: 0 IO Block: 1024 symbolic link
>>> Device: 4h/4d Inode: 17396 Links: 1
>>> Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
>>> Access: 2019-09-06 19:43:11.871312319 -0400
>>> Modify: 2019-09-06 19:43:11.871312319 -0400
>>> Change: 2019-09-06 19:43:11.871312319 -0400
>>> Birth: -
>>> root@kvm1:~# stat -L /proc/self/ns/pid
>>> File: /proc/self/ns/pid
>>> Size: 0 Blocks: 0 IO Block: 4096 regular empty file
>>> Device: 3h/3d Inode: 4026531836 Links: 1
>>> Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
>>> Access: 2019-09-06 19:43:15.955313293 -0400
>>> Modify: 2019-09-06 19:43:15.955313293 -0400
>>> Change: 2019-09-06 19:43:15.955313293 -0400
>>> Birth: -
>>>
>>> The former is lstat, the latter - stat.
>>>
>>> 2) device number of the filesystem where the symlink target lives.
>>> In this case, it's nsfs and there's only one instance on the entire
>>> system. _That_ would be obtained by looking at st_dev in stat(2) on
>>> /proc/self/ns/pid (0:3 above).
>>>
>>> 3) device number *OF* the symlink. That would be st_rdev in lstat(2).
>>> There's none - it's a symlink, not a character or block device. It's
>>> always zero and always will be zero.
>>>
>>> 4) the same for the target; st_rdev in stat(2) results and again,
>>> there's no such beast - it's neither character nor block device.
>>>
>>> Your code is looking at (3). Please, reread any textbook on Unix
>>> in the section that would cover stat(2) and discussion of the
>>> difference between st_dev and st_rdev.
>>>
>>> I have no idea what Eric had been talking about - it's hard to
>>> reconstruct by what you said so far. Making nsfs per-userns,
>>> perhaps? But that makes no sense whatsoever, not that userns
>>> ever had... Cheap shots aside, I really can't guess what that's
>>> about. Sorry.
>>
>> Thanks for the detailed information. The device number we want
>> is nsfs. Indeed, currently, there is only one instance
>> on the entire system. But not exactly sure what is the possibility
>> to have more than one nsfs device in the future. Maybe per-userns
>> or any other criteria?
>>
>>>
>>> In any case, pathname resolution is *NOT* for the situations where
>>> you can't block. Even if it's procfs (and from the same pidns as
>>> the process) mounted there, there is no promise that the target
>>> of /proc/self has already been looked up and not evicted from
>>> memory since then. And in case of cache miss pathwalk will
>>> have to call ->lookup(), which requires locking the directory
>>> (rw_sem, shared). You can't do that in such context.
>>>
>>> And that doesn't even go into the possibility that process has
>>> something very different mounted on /proc.
>>>
>>> Again, I don't know what it is that you want to get to, but
>>> I would strongly recommend finding a way to get to that data
>>> that would not involve going anywhere near pathname resolution.
>>>
>>> How would you expect the userland to work with that value,
>>> whatever it might be? If it's just a 32bit field that will
>>> never be read, you might as well store there the same value
>>> you store now (0, that is) in much cheaper and safer way ;-)
>>
>> Suppose inside pid namespace, user can pass the device number,
>> say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map
>> or JIT). At runtime, bpf program will try to get device number,
>> say n2, for the 'current' process. If n1 is not the same as
>> n2, that means they are not in the same namespace. 'current'
>> is in the same pid namespace as the user iff
>> n1 == n2 and also pidns id is the same for 'current' and
>> the one with `lsns -t pid`.
>>
>> Are you aware of any way to get the pidns device number
>> for 'current' without going through the pathname
>> lookup?
>>
next prev parent reply other threads:[~2019-09-10 22:35 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-06 15:09 [PATCH bpf-next v10 0/4] BPF: New helper to obtain namespace data from current task Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 1/4] fs/namei.c: make available filename_lookup() for bpf helpers Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info Carlos Neira
2019-09-06 15:24 ` Al Viro
2019-09-06 15:46 ` Al Viro
2019-09-06 16:00 ` Al Viro
2019-09-06 23:21 ` Yonghong Song
2019-09-07 0:10 ` Al Viro
2019-09-07 6:34 ` Yonghong Song
2019-09-09 17:45 ` Carlos Antonio Neira Bustos
2019-09-10 22:35 ` Yonghong Song [this message]
2019-09-10 23:15 ` Al Viro
2019-09-11 8:16 ` Eric W. Biederman
2019-09-12 5:49 ` Yonghong Song
[not found] ` <CACiB22j9M2gmccnh7XqqFp8g7qKFuiOrSAVJiA2tQHLB0pmoSQ@mail.gmail.com>
2019-09-13 2:56 ` Yonghong Song
2019-09-13 11:58 ` Carlos Antonio Neira Bustos
2019-09-13 16:59 ` Eric W. Biederman
2019-09-13 17:28 ` Yonghong Song
2019-09-11 4:32 ` Carlos Antonio Neira Bustos
2019-09-11 8:17 ` Eric W. Biederman
2019-09-10 22:46 ` Yonghong Song
2019-09-11 4:33 ` Carlos Antonio Neira Bustos
2019-09-06 15:09 ` [PATCH bpf-next v10 3/4] tools: Added bpf_get_current_pidns_info helper Carlos Neira
2019-09-06 15:09 ` [PATCH bpf-next v10 4/4] tools/testing/selftests/bpf: Add self-tests for helper bpf_get_pidns_info Carlos Neira
2019-09-10 22:55 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dadf3657-2648-14ef-35ee-e09efb2cdb3e@fb.com \
--to=yhs@fb.com \
--cc=bpf@vger.kernel.org \
--cc=brouer@redhat.com \
--cc=cneirabustos@gmail.com \
--cc=ebiederm@xmission.com \
--cc=netdev@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).