All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kirill Tkhai <ktkhai@virtuozzo.com>
To: Andrei Vagin <avagin@gmail.com>
Cc: adobriyan@gmail.com, "Eric W. Biederman" <ebiederm@xmission.com>,
	viro@zeniv.linux.org.uk, davem@davemloft.net,
	akpm@linux-foundation.org, christian.brauner@ubuntu.com,
	areber@redhat.com, serge@hallyn.com,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Subject: Re: [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary
Date: Tue, 11 Aug 2020 13:23:35 +0300	[thread overview]
Message-ID: <33565447-9b97-a820-bc2c-a4ff53a7675a@virtuozzo.com> (raw)
In-Reply-To: <20200810173431.GA68662@gmail.com>

On 10.08.2020 20:34, Andrei Vagin wrote:
> On Fri, Aug 07, 2020 at 11:47:57AM +0300, Kirill Tkhai wrote:
>> On 06.08.2020 11:05, Andrei Vagin wrote:
>>> On Mon, Aug 03, 2020 at 01:03:17PM +0300, Kirill Tkhai wrote:
>>>> On 31.07.2020 01:13, Eric W. Biederman wrote:
>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>>>
>>>>>> On 30.07.2020 17:34, Eric W. Biederman wrote:
>>>>>>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>>>>>>
>>>>>>>> Currently, there is no a way to list or iterate all or subset of namespaces
>>>>>>>> in the system. Some namespaces are exposed in /proc/[pid]/ns/ directories,
>>>>>>>> but some also may be as open files, which are not attached to a process.
>>>>>>>> When a namespace open fd is sent over unix socket and then closed, it is
>>>>>>>> impossible to know whether the namespace exists or not.
>>>>>>>>
>>>>>>>> Also, even if namespace is exposed as attached to a process or as open file,
>>>>>>>> iteration over /proc/*/ns/* or /proc/*/fd/* namespaces is not fast, because
>>>>>>>> this multiplies at tasks and fds number.
>>>>>>>
>>>>>>> I am very dubious about this.
>>>>>>>
>>>>>>> I have been avoiding exactly this kind of interface because it can
>>>>>>> create rather fundamental problems with checkpoint restart.
>>>>>>
>>>>>> restart/restore :)
>>>>>>
>>>>>>> You do have some filtering and the filtering is not based on current.
>>>>>>> Which is good.
>>>>>>>
>>>>>>> A view that is relative to a user namespace might be ok.    It almost
>>>>>>> certainly does better as it's own little filesystem than as an extension
>>>>>>> to proc though.
>>>>>>>
>>>>>>> The big thing we want to ensure is that if you migrate you can restore
>>>>>>> everything.  I don't see how you will be able to restore these files
>>>>>>> after migration.  Anything like this without having a complete
>>>>>>> checkpoint/restore story is a non-starter.
>>>>>>
>>>>>> There is no difference between files in /proc/namespaces/ directory and /proc/[pid]/ns/.
>>>>>>
>>>>>> CRIU can restore open files in /proc/[pid]/ns, the same will be with /proc/namespaces/ files.
>>>>>> As a person who worked deeply for pid_ns and user_ns support in CRIU, I don't see any
>>>>>> problem here.
>>>>>
>>>>> An obvious diffference is that you are adding the inode to the inode to
>>>>> the file name.  Which means that now you really do have to preserve the
>>>>> inode numbers during process migration.
>>>>>
>>>>> Which means now we have to do all of the work to make inode number
>>>>> restoration possible.  Which means now we need to have multiple
>>>>> instances of nsfs so that we can restore inode numbers.
>>>>>
>>>>> I think this is still possible but we have been delaying figuring out
>>>>> how to restore inode numbers long enough that may be actual technical
>>>>> problems making it happen.
>>>>
>>>> Yeah, this matters. But it looks like here is not a dead end. We just need
>>>> change the names the namespaces are exported to particular fs and to support
>>>> rename().
>>>>
>>>> Before introduction a principally new filesystem type for this, can't
>>>> this be solved in current /proc?
>>>
>>> do you mean to introduce names for namespaces which users will be able
>>> to change? By default, this can be uuid.
>>
>> Yes, I mean this.
>>
>> Currently I won't give a final answer about UUID, but I planned to show some
>> default names, which based on namespace type and inode num. Completely custom
>> names for any /proc by default will waste too much memory.
>>
>> So, I think the good way will be:
>>
>> 1)Introduce a function, which returns a hash/uuid based on ino, ns type and some static
>> random seed, which is generated on boot;
>>
>> 2)Use the hash/uuid as default names in newly create /proc/namespaces: pid-{hash/uuid(ino, "pid")}
>>
>> 3)Allow rename, and allocate space only for renamed names.
>>
>> Maybe 2 and 3 will be implemented as shrinkable dentries and non-shrinkable.
>>
>>> And I have a suggestion about the structure of /proc/namespaces/.
>>>
>>> Each namespace is owned by one of user namespaces. Maybe it makes sense
>>> to group namespaces by their user-namespaces?
>>>
>>> /proc/namespaces/
>>>                  user
>>>                  mnt-X
>>>                  mnt-Y
>>>                  pid-X
>>>                  uts-Z
>>>                  user-X/
>>>                         user
>>>                         mnt-A
>>>                         mnt-B
>>>                         user-C
>>>                         user-C/
>>>                                user
>>>                  user-Y/
>>>                         user
>>
>> Hm, I don't think that user namespace is a generic key value for everybody.
>> For generic people tasks a user namespace is just a namespace among another
>> namespace types. For me it will look a bit strage to iterate some user namespaces
>> to build container net topology.
> 
> I can’t agree with you that the user namespace is one of others. It is
> the namespace for namespaces. It sets security boundaries in the system
> and we need to know them to understand the whole system.
> 
> If user namespaces are not used in the system or on a container, you
> will see all namespaces in one directory. But if the system has a more
> complicated structure, you will be able to build a full picture of it.
> 
> You said that one of the users of this feature is CRIU (the tool to
> checkpoint/restore containers)  and you said that it would be good if
> CRIU will be able to collect all container namespaces before dumping
> processes, sockets, files etc. But how will we be able to do this if we
> will list all namespaces in one directory?

There is no a problem, this looks rather simple. Two cases are possible:

1)a container has dedicated namespaces set, and CRIU just has to iterate
  files in /proc/namespaces of root pid namespace of the container.
  The relationships between parents and childs of pid and user namespaces
  are founded via ioctl(NS_GET_PARENT).
  
2)container has no dedicated namespaces set. Then CRIU just has to iterate
  all host namespaces. There is no another way to do that, because container
  may have any host namespaces, and hierarchy in /proc/namespaces won't
  help you.

> Here are my thoughts why we need to the suggested structure is better
> than just a list of namespaces:
> 
> * Users will be able to understand securies bondaries in the system.
>   Each namespace in the system is owned by one of user namespace and we
>   need to know these relationshipts to understand the whole system.

Here are already NS_GET_PARENT and NS_GET_USERNS. What is the problem to use
this interfaces?

> * This is simplify collecting namespaces which belong to one container.
> 
> For example, CRIU collects all namespaces before dumping file
> descriptors. Then it collects all sockets with socket-diag in network
> namespaces and collects mount points via /proc/pid/mountinfo in mount
> namesapces. Then these information is used to dump socket file
> descriptors and opened files.

This is just the thing I say. This allows to avoid writing recursive dump.
But this has nothing about advantages of hierarchy in /proc/namespaces.

> * We are going to assign names to namespaces. But this means that we
> need to guarantee that all names in one directory are unique. The
> initial proposal was to enumerate all namespaces in one proc directory,
> that means names of all namespaces have to be unique. This can be
> problematic in some cases. For example, we may want to dump a container
> and then restore it more than once on the same host. How are we going to
> avoid namespace name conficts in such cases?

Previous message I wrote about .rename of proc files, Alexey Dobriyan
said this is not a taboo. Are there problem which doesn't cover the case
you point?

> If we will have per-user-namespace directories, we will need to
> guarantee that names are unique only inside one user namespace.

Unique names inside one user namespace won't introduce a new /proc
mount. You can't pass a sub-directory of /proc/namespaces/ to a specific
container. To give a virtualized name you have to have a dedicated pid ns.

Let we have in one /proc mount:

/mnt1/proc/namespaces/userns1/.../[namespaceX_name1 -- inode XXX]

In another another /proc mount we have:

/mnt2/proc/namespaces/userns1/.../[namespaceX_name1_synonym -- inode XXX]

The virtualization is made per /proc (i.e., per pid ns). Container should
receive either /mnt1/proc or /mnt2/proc on restore as it's /proc.

There is no a sense of directory hierarchy for virtualization, since
you can't use specific sub-directory as a root directory of /proc/namespaces
to a container. You still have to introduce a new pid ns to have virtualized
/proc.

> * With the suggested structure, for each user namepsace, we will show
>   only its subtree of namespaces. This looks more natural than
>   filltering content of one directory.

It's rather subjectively I think. /proc is related to pid ns, and user ns
hierarchy does not look more natural for me.

  reply	other threads:[~2020-08-11 10:23 UTC|newest]

Thread overview: 80+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-30 11:59 [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary Kirill Tkhai
2020-07-30 11:59 ` [PATCH 01/23] ns: Add common refcount into ns_common add use it as counter for net_ns Kirill Tkhai
2020-07-30 13:35   ` Christian Brauner
2020-07-30 14:07     ` Kirill Tkhai
2020-07-30 15:59       ` Christian Brauner
2020-07-30 14:30   ` Christian Brauner
2020-07-30 14:34     ` Kirill Tkhai
2020-07-30 14:39       ` Christian Brauner
2020-07-30 11:59 ` [PATCH 02/23] uts: Use generic ns_common::count Kirill Tkhai
2020-07-30 14:30   ` Christian Brauner
2020-07-30 11:59 ` [PATCH 03/23] ipc: " Kirill Tkhai
2020-07-30 14:32   ` Christian Brauner
2020-07-30 11:59 ` [PATCH 04/23] pid: " Kirill Tkhai
2020-07-30 14:37   ` Christian Brauner
2020-07-30 11:59 ` [PATCH 05/23] user: " Kirill Tkhai
2020-07-30 14:46   ` Christian Brauner
2020-07-30 11:59 ` [PATCH 06/23] mnt: " Kirill Tkhai
2020-07-30 14:49   ` Christian Brauner
2020-07-30 11:59 ` [PATCH 07/23] cgroup: " Kirill Tkhai
2020-07-30 14:50   ` Christian Brauner
2020-07-30 12:00 ` [PATCH 08/23] time: " Kirill Tkhai
2020-07-30 14:52   ` Christian Brauner
2020-07-30 12:00 ` [PATCH 09/23] ns: Introduce ns_idr to be able to iterate all allocated namespaces in the system Kirill Tkhai
2020-07-30 12:23   ` Matthew Wilcox
2020-07-30 13:32     ` Kirill Tkhai
2020-07-30 13:56       ` Matthew Wilcox
2020-07-30 14:12         ` Kirill Tkhai
2020-07-30 14:15           ` Matthew Wilcox
2020-07-30 14:20             ` Kirill Tkhai
2020-07-30 12:00 ` [PATCH 10/23] fs: Rename fs/proc/namespaces.c into fs/proc/task_namespaces.c Kirill Tkhai
2020-07-30 12:00 ` [PATCH 11/23] fs: Add /proc/namespaces/ directory Kirill Tkhai
2020-07-30 12:18   ` Alexey Dobriyan
2020-07-30 13:22     ` Kirill Tkhai
2020-07-30 13:26   ` Christian Brauner
2020-07-30 14:30     ` Kirill Tkhai
2020-07-30 20:47   ` kernel test robot
2020-07-30 20:47     ` kernel test robot
2020-07-30 22:20   ` kernel test robot
2020-07-30 22:20     ` kernel test robot
2020-08-05  8:17   ` kernel test robot
2020-08-05  8:17     ` kernel test robot
2020-08-05  8:17   ` [RFC PATCH] fs: namespaces_dentry_operations can be static kernel test robot
2020-08-05  8:17     ` kernel test robot
2020-07-30 12:00 ` [PATCH 12/23] user: Free user_ns one RCU grace period after final counter put Kirill Tkhai
2020-07-30 12:00 ` [PATCH 13/23] user: Add user namespaces into ns_idr Kirill Tkhai
2020-07-30 12:00 ` [PATCH 14/23] net: Add net " Kirill Tkhai
2020-07-30 12:00 ` [PATCH 15/23] pid: Eextract child_reaper check from pidns_for_children_get() Kirill Tkhai
2020-07-30 12:00 ` [PATCH 16/23] proc_ns_operations: Add can_get method Kirill Tkhai
2020-07-30 12:00 ` [PATCH 17/23] pid: Add pid namespaces into ns_idr Kirill Tkhai
2020-07-30 12:00 ` [PATCH 18/23] uts: Free uts namespace one RCU grace period after final counter put Kirill Tkhai
2020-07-30 12:01 ` [PATCH 19/23] uts: Add uts namespaces into ns_idr Kirill Tkhai
2020-07-30 12:01 ` [PATCH 20/23] ipc: Add ipc " Kirill Tkhai
2020-07-30 12:01 ` [PATCH 21/23] mnt: Add mount " Kirill Tkhai
2020-07-30 12:01 ` [PATCH 22/23] cgroup: Add cgroup " Kirill Tkhai
2020-07-30 12:01 ` [PATCH 23/23] time: Add time " Kirill Tkhai
2020-07-30 13:08 ` [PATCH 00/23] proc: Introduce /proc/namespaces/ directory to expose namespaces lineary Christian Brauner
2020-07-30 13:38   ` Christian Brauner
2020-07-30 14:34 ` Eric W. Biederman
2020-07-30 14:42   ` Christian Brauner
2020-07-30 15:01   ` Kirill Tkhai
2020-07-30 22:13     ` Eric W. Biederman
2020-07-31  8:48       ` Pavel Tikhomirov
2020-08-03 10:03       ` Kirill Tkhai
2020-08-03 10:51         ` Alexey Dobriyan
2020-08-06  8:05         ` Andrei Vagin
2020-08-07  8:47           ` Kirill Tkhai
2020-08-10 17:34             ` Andrei Vagin
2020-08-11 10:23               ` Kirill Tkhai [this message]
2020-08-12 17:53                 ` Andrei Vagin
2020-08-13  8:12                   ` Kirill Tkhai
2020-08-14  1:16                     ` Andrei Vagin
2020-08-14 15:11                       ` Kirill Tkhai
2020-08-14 19:21                         ` Andrei Vagin
2020-08-17 14:05                           ` Kirill Tkhai
2020-08-17 15:48                             ` Eric W. Biederman
2020-08-17 17:47                               ` Christian Brauner
2020-08-17 18:53                                 ` Eric W. Biederman
2020-08-04  5:43     ` Andrei Vagin
2020-08-04 12:11       ` Pavel Tikhomirov
2020-08-04 14:47       ` Kirill Tkhai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=33565447-9b97-a820-bc2c-a4ff53a7675a@virtuozzo.com \
    --to=ktkhai@virtuozzo.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=areber@redhat.com \
    --cc=avagin@gmail.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ptikhomirov@virtuozzo.com \
    --cc=serge@hallyn.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.