All of lore.kernel.org
 help / color / mirror / Atom feed
* A Plumber’s Wish List for Linux
@ 2011-10-06 23:17 Kay Sievers
  2011-10-06 23:46 ` Andi Kleen
                   ` (7 more replies)
  0 siblings, 8 replies; 81+ messages in thread
From: Kay Sievers @ 2011-10-06 23:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: lennart, harald, david, greg

We’d like to share our current wish list of plumbing layer features we
are hoping to see implemented in the near future in the Linux kernel and
associated tools. Some items we can implement on our own, others are not
our area of expertise, and we will need help getting them implemented.

Acknowledging that this wish list of ours only gets longer and not
shorter, even though we have implemented a number of other features on
our own in the previous years, we are posting this list here, in the
hope to find some help.

If you happen to be interested in working on something from this list or
able to help out, we’d be delighted. Please ping us in case you need
clarifications or more information on specific items.


Thanks,
Kay, Lennart, Harald, in the name of all the other plumbers



An here’s the wish list, in no particular order:

* (ioctl based?) interface to query and modify the label of a mounted
FAT volume:
A FAT labels is implemented as a hidden directory entry in the file
system which need to be renamed when changing the file system label,
this is impossible to do from userspace without unmounting. Hence we’d
like to see a kernel interface that is available on the mounted file
system mount point itself. Of course, bonus points if this new interface
can be implemented for other file systems as well, and also covers fs
UUIDs in addition to labels.

* CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:
useful to allow module auto-loading of e.g. cpufreq drivers and KVM
modules. Andy Kleen has a patch to create the alias file itself. CPU
‘struct sysdev’ needs to be converted to ‘struct device’ and a ‘struct
bus_type cpu’ needs to be introduced to allow proper CPU coldplug event
replay at bootup. This is one of the last remaining places where
automatic hardware-triggered module auto-loading is not available. And
we’d like to see that fix to make numerous ugly userspace work-arounds
to achieve the same go away.

* expose CAP_LAST_CAP somehow in the running kernel at runtime:
Userspace needs to know the highest valid capability of the running
kernel, which right now cannot reliably be retrieved from header files
only. The fact that this value cannot be detected properly right now
creates various problems for libraries compiled on newer header files
which are run on older kernels. They assume capabilities are available
which actually aren’t. Specifically, libcap-ng claims that all running
processes retain the higher capabilities in this case due to the
“inverted” semantics of CapBnd in /proc/$PID/status.

* export ‘struct device_type fb/fbcon’ of ‘struct class graphics’
Userspace wants to easily distinguish ‘fb’ and ‘fbcon’ from each other
without the need to match on the device name.

* allow changing argv[] of a process without mucking with environ[]:
Something like setproctitle() or a prctl() would be ideal. Of course it
is questionable if services like sendmail make use of this, but otoh for
services which fork but do not immediately exec() another binary being
able to rename this child processes in ps is of importance.

* module-init-tools: provide a proper libmodprobe.so from
module-init-tools:
Early boot tools, installers, driver install disks want to access
information about available modules to optimize bootup handling.

* fork throttling mechanism as basic cgroup functionality that is
available in all hierarchies independent of the controllers used:
This is important to implement race-free killing of all members of a
cgroup, so that cgroup member processes cannot fork faster then a cgroup
supervisor process could kill them. This needs to be recursive, so that
not only a cgroup but all its subgroups are covered as well.

* proper cgroup-is-empty notification interface:
The current call_usermodehelper() interface is an unefficient and an
ugly hack. Tools would prefer anything more lightweight like a netlink,
poll() or fanotify interface.

* allow user xattrs to be set on files in the cgroupfs (and maybe
procfs?)

* simple, reliable and future-proof way to detect whether a specific pid
is running in a CLONE_NEWPID container, i.e. not in the root PID
namespace. Currently, there are available a few ugly hacks to detect
this (for example a process wanting to know whether it is running in a
PID namespace could just look for a PID 2 being around and named
kthreadd which is a kernel thread only visible in the root namespace),
however all these solutions encode information and expectations that
better shouldn’t be encoded in a namespace test like this. This
functionality is needed in particular since the removal of the the ns
cgroup controller which provided the namespace membership information to
user code.

* allow making use of the “cpu” cgroup controller by default without
breaking RT. Right now creating a cgroup in the “cpu” hierarchy that
shall be able to take advantage of RT is impossible for the generic case
since it needs an RT budget configured which is from a limited resource
pool. What we want is the ability to create cgroups in “cpu” whose
processes get an non-RT weight applied, but for RT take advantage of the
parent’s RT budget. We want the separation of RT and non-RT budget
assignment in the “cpu” hierarchy, because right now, you lose RT
functionality in it unless you assign an RT budget. This issue severely
limits the usefulness of “cpu” hierarchy on general purpose systems
right now.

* Add a timerslack cgroup controller, to allow increasing the timer
slack of user session cgroups when the machine is idle.

* An auxiliary meta data message for AF_UNIX called SCM_CGROUPS (or
something like that), i.e. a way to attach sender cgroup membership to
messages sent via AF_UNIX. This is useful in case services such as
syslog shall be shared among various containers (or service cgroups),
and the syslog implementation needs to be able to distinguish the
sending cgroup in order to separate the logs on disk. Of course stm
SCM_CREDENTIALS can be used to look up the PID of the sender followed by
a check in /proc/$PID/cgroup, but that is necessarily racy, and actually
a very real race in real life.

* SCM_COMM, with a similar use case as SCM_CGROUPS. This auxiliary
control message should carry the process name as available
in /proc/$PID/comm.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-06 23:17 A Plumber’s Wish List for Linux Kay Sievers
@ 2011-10-06 23:46 ` Andi Kleen
  2011-10-07  0:13   ` Lennart Poettering
  2011-10-07  7:49 ` Matt Helsley
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 81+ messages in thread
From: Andi Kleen @ 2011-10-06 23:46 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-kernel, lennart, harald, david, greg

Kay Sievers <kay.sievers@vrfy.org> writes:
>
> * allow changing argv[] of a process without mucking with environ[]:
> Something like setproctitle() or a prctl() would be ideal. Of course
> it

prctl(PR_SET_NAME, ...)

The only problem is that some programs still use argv[] and get the old
name, but at least it works in "top"

> * An auxiliary meta data message for AF_UNIX called SCM_CGROUPS (or
> something like that), i.e. a way to attach sender cgroup membership to
> messages sent via AF_UNIX.

The problem is: this requires a reference count and these reference
counts can be very expensive. We had the same problem with pid
namespaces ruining AF_UNIX performance in some cases.

It can be probably done, but one would need to be very careful
about scalability issues.


> * SCM_COMM, with a similar use case as SCM_CGROUPS. This auxiliary
> control message should carry the process name as available
> in /proc/$PID/comm.

That sounds super racy. No guarantee at all this is unique and useful
for anything and everyone can change it.

The other ideas mostly sound reasonable to me, but I haven't thought
a lot about their details and implications.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-06 23:46 ` Andi Kleen
@ 2011-10-07  0:13   ` Lennart Poettering
  2011-10-07  1:57     ` Andi Kleen
  2011-10-19 23:16     ` H. Peter Anvin
  0 siblings, 2 replies; 81+ messages in thread
From: Lennart Poettering @ 2011-10-07  0:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Kay Sievers, linux-kernel, harald, david, greg

On Thu, 06.10.11 16:46, Andi Kleen (andi@firstfloor.org) wrote:

> 
> Kay Sievers <kay.sievers@vrfy.org> writes:
> >
> > * allow changing argv[] of a process without mucking with environ[]:
> > Something like setproctitle() or a prctl() would be ideal. Of course
> > it
> 
> prctl(PR_SET_NAME, ...)
> 
> The only problem is that some programs still use argv[] and get the old
> name, but at least it works in "top"

Well, I am aware of PR_SET_NAME, but that modifies comm, not argv[]. And
while "top" indeed shows the former, "ps" shows the latter. We are looking
for a way to nice way to modify argv[] without having to reuse space
from environ[] like most current Linux implementations of
setproctitle() do.

A while back there were patches for PR_SET_PROCTITLE_AREA floating
around. We'd like to see something like that merged one day.

> > * SCM_COMM, with a similar use case as SCM_CGROUPS. This auxiliary
> > control message should carry the process name as available
> > in /proc/$PID/comm.
> 
> That sounds super racy. No guarantee at all this is unique and useful
> for anything and everyone can change it.

Well, it's interesting in the syslog case, and it's OK if people can
change it. What matters is that this information is available simply for
the informational value. Right now, if one combines SCM_CREDENTIALS and
/proc/$PID/comm you often end up with no information about the senders
name at all, since at the time you try to read comm the PID might
actually not exist anymore at all. We are simply trying to close this
particular race between receiving SCM_CREDENTIALS and reading
/proc/$PID/comm here, we are not looking for a way to make process names
trusted.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07  0:13   ` Lennart Poettering
@ 2011-10-07  1:57     ` Andi Kleen
  2011-10-07 15:58       ` Lennart Poettering
  2011-10-19 23:16     ` H. Peter Anvin
  1 sibling, 1 reply; 81+ messages in thread
From: Andi Kleen @ 2011-10-07  1:57 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Andi Kleen, Kay Sievers, linux-kernel, harald, david, greg

> Well, I am aware of PR_SET_NAME, but that modifies comm, not argv[]. And
> while "top" indeed shows the former, "ps" shows the latter. We are looking
> for a way to nice way to modify argv[] without having to reuse space
> from environ[] like most current Linux implementations of
> setproctitle() do.

It's not clear to me how the kernel could change argv[] any better than you 
could in user space.

> Well, it's interesting in the syslog case, and it's OK if people can
> change it. What matters is that this information is available simply for
> the informational value. Right now, if one combines SCM_CREDENTIALS and
> /proc/$PID/comm you often end up with no information about the senders
> name at all, since at the time you try to read comm the PID might
> actually not exist anymore at all. We are simply trying to close this
> particular race between receiving SCM_CREDENTIALS and reading
> /proc/$PID/comm here, we are not looking for a way to make process names
> trusted.

The issue with all of these proposals is that the sender currently doesn't
know if the receiver needs it. Thus it always has to put it in and you
slow down the fast paths.

e.g. consider

sender sends packet
                                     receiver enables funky option
                                     receiver reads

If it was done lazily you would lose.

Also there are usually various complications with namespaces.

-Andi

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-06 23:17 A Plumber’s Wish List for Linux Kay Sievers
  2011-10-06 23:46 ` Andi Kleen
@ 2011-10-07  7:49 ` Matt Helsley
  2011-10-07 16:01   ` Lennart Poettering
  2011-10-07 10:12 ` A Plumber’s Wish List for Linux Alan Cox
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 81+ messages in thread
From: Matt Helsley @ 2011-10-07  7:49 UTC (permalink / raw)
  To: Kay Sievers
  Cc: linux-kernel, lennart, harald, david, greg, Biederman Eric Biederman

On Fri, Oct 07, 2011 at 01:17:02AM +0200, Kay Sievers wrote:

<snip>

> * simple, reliable and future-proof way to detect whether a specific pid
> is running in a CLONE_NEWPID container, i.e. not in the root PID
> namespace. Currently, there are available a few ugly hacks to detect

Is that precisely what's needed or would it be sufficient to know
that the pid is running in a child pid namespace of the current pid
namespace? If so, I think this could eventually be done by comparing
the inode numbers assigned to /proc/<pid>/ns/pid to those of
/proc/1/ns/pid.

> * Add a timerslack cgroup controller, to allow increasing the timer
> slack of user session cgroups when the machine is idle.

There were patches for a timerslack cgroup controller but for some
reason (I don't recall why) they stalled. It might be worth digging
through the containers mailing list archives.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-06 23:17 A Plumber’s Wish List for Linux Kay Sievers
  2011-10-06 23:46 ` Andi Kleen
  2011-10-07  7:49 ` Matt Helsley
@ 2011-10-07 10:12 ` Alan Cox
  2011-10-07 10:28   ` Kay Sievers
  2011-10-07 12:35 ` Vivek Goyal
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 81+ messages in thread
From: Alan Cox @ 2011-10-07 10:12 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-kernel, lennart, harald, david, greg

> * (ioctl based?) interface to query and modify the label of a mounted
> FAT volume:

Seems sensible - or it could go in sysfs ?

> A FAT labels is implemented as a hidden directory entry in the file
> system which need to be renamed when changing the file system label,

That would be ugly - it works for FAT as you can create an imaginary name
which is not possible on the fs, but that isn't true for say ext4. Sysfs
sounds the logic way, it means adding chunks of code to various file
systems.

> * expose CAP_LAST_CAP somehow in the running kernel at runtime:
> Userspace needs to know the highest valid capability of the running
> kernel, which right now cannot reliably be retrieved from header files
> only. The fact that this value cannot be detected properly right now
> creates various problems for libraries compiled on newer header files
> which are run on older kernels. They assume capabilities are available
> which actually aren’t. Specifically, libcap-ng claims that all running
> processes retain the higher capabilities in this case due to the
> “inverted” semantics of CapBnd in /proc/$PID/status.

You can probably deduce this by poking around but to me it seems like a
very sensible idea.

> * allow changing argv[] of a process without mucking with environ[]:
> Something like setproctitle() or a prctl() would be ideal. Of course it
> is questionable if services like sendmail make use of this, but otoh for
> services which fork but do not immediately exec() another binary being
> able to rename this child processes in ps is of importance.

Yes, its a real valuable tool for r00tkits, worms and general purpose
deception.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07 10:12 ` A Plumber’s Wish List for Linux Alan Cox
@ 2011-10-07 10:28   ` Kay Sievers
  2011-10-07 10:38     ` Alan Cox
  0 siblings, 1 reply; 81+ messages in thread
From: Kay Sievers @ 2011-10-07 10:28 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, lennart, harald, david, greg

On Fri, Oct 7, 2011 at 12:12, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> * (ioctl based?) interface to query and modify the label of a mounted
>> FAT volume:
>
> Seems sensible - or it could go in sysfs ?

That would mean to export superblocks in /sys, which isn't namespaced,
and which might create issues by making information globally available
which probably shouldn't?

>> A FAT labels is implemented as a hidden directory entry in the file
>> system which need to be renamed when changing the file system label,
>
> That would be ugly - it works for FAT as you can create an imaginary name
> which is not possible on the fs, but that isn't true for say ext4. Sysfs
> sounds the logic way, it means adding chunks of code to various file
> systems.

What do you mean would be ugly?

>> * expose CAP_LAST_CAP somehow in the running kernel at runtime:
>> Userspace needs to know the highest valid capability of the running
>> kernel, which right now cannot reliably be retrieved from header files
>> only. The fact that this value cannot be detected properly right now
>> creates various problems for libraries compiled on newer header files
>> which are run on older kernels. They assume capabilities are available
>> which actually aren’t. Specifically, libcap-ng claims that all running
>> processes retain the higher capabilities in this case due to the
>> “inverted” semantics of CapBnd in /proc/$PID/status.
>
> You can probably deduce this by poking around but to me it seems like a
> very sensible idea.
>
>> * allow changing argv[] of a process without mucking with environ[]:
>> Something like setproctitle() or a prctl() would be ideal. Of course it
>> is questionable if services like sendmail make use of this, but otoh for
>> services which fork but do not immediately exec() another binary being
>> able to rename this child processes in ps is of importance.
>
> Yes, its a real valuable tool for r00tkits, worms and general purpose
> deception.

They can do that already today.  The code to do that just looks really
ugly. So the r00tkits could have nicer looking code. :)

Thanks,
Kay

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07 10:28   ` Kay Sievers
@ 2011-10-07 10:38     ` Alan Cox
  2011-10-07 12:46       ` Kay Sievers
  2011-10-07 16:07       ` Valdis.Kletnieks
  0 siblings, 2 replies; 81+ messages in thread
From: Alan Cox @ 2011-10-07 10:38 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-kernel, lennart, harald, david, greg

On Fri, 7 Oct 2011 12:28:46 +0200
Kay Sievers <kay.sievers@vrfy.org> wrote:

> On Fri, Oct 7, 2011 at 12:12, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> >> * (ioctl based?) interface to query and modify the label of a mounted
> >> FAT volume:
> >
> > Seems sensible - or it could go in sysfs ?
> 
> That would mean to export superblocks in /sys, which isn't namespaced,
> and which might create issues by making information globally available
> which probably shouldn't?

Possibly, otherwise you really need an ioctl on the root inode of the fs
- which is doable, NCPfs makes heavy use of that.
> 
> >> A FAT labels is implemented as a hidden directory entry in the file
> >> system which need to be renamed when changing the file system label,
> >
> > That would be ugly - it works for FAT as you can create an imaginary name
> > which is not possible on the fs, but that isn't true for say ext4. Sysfs
> > sounds the logic way, it means adding chunks of code to various file
> > systems.
> 
> What do you mean would be ugly?

I have an ext4fs. It supports every possible file name allowed by POSIX
and SuS. What name are you going to use for your 'hidden directory' that
won't clash with a real file ?


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-06 23:17 A Plumber’s Wish List for Linux Kay Sievers
                   ` (2 preceding siblings ...)
  2011-10-07 10:12 ` A Plumber’s Wish List for Linux Alan Cox
@ 2011-10-07 12:35 ` Vivek Goyal
  2011-10-07 18:59 ` Greg KH
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 81+ messages in thread
From: Vivek Goyal @ 2011-10-07 12:35 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-kernel, lennart, harald, david, greg

On Fri, Oct 07, 2011 at 01:17:02AM +0200, Kay Sievers wrote:

[..]
> * fork throttling mechanism as basic cgroup functionality that is
> available in all hierarchies independent of the controllers used:
> This is important to implement race-free killing of all members of a
> cgroup, so that cgroup member processes cannot fork faster then a cgroup
> supervisor process could kill them. This needs to be recursive, so that
> not only a cgroup but all its subgroups are covered as well.

Above should make sense for "freezer" controller too. That will allow us
reliable dynamic migration of tasks in a cgroup by first freezing them,
then change the cgroup and then unfreeze.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07 10:38     ` Alan Cox
@ 2011-10-07 12:46       ` Kay Sievers
  2011-10-07 13:39         ` Theodore Tso
                           ` (2 more replies)
  2011-10-07 16:07       ` Valdis.Kletnieks
  1 sibling, 3 replies; 81+ messages in thread
From: Kay Sievers @ 2011-10-07 12:46 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, lennart, harald, david, greg

[]sorry, need to resend. I tried to reply with the cell phone but it bounces]

On Fri, Oct 7, 2011 at 12:38, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> On Fri, 7 Oct 2011 12:28:46 +0200 Kay Sievers <kay.sievers@vrfy.org> wrote:
>
>> What do you mean would be ugly?
>
> I have an ext4fs. It supports every possible file name allowed by POSIX
> and SuS. What name are you going to use for your 'hidden directory' that
> won't clash with a real file ?

Ah, no. The label on FAT (similar on NTFS) are 'magic entries' in the
root dir list, not a real file in the root dir.

We need kernel support for changing a mounted fs, because, unlike
ext4, the blocks containing the strings are inside the fs, which the
kernel might change any time.

Kay

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07 12:46       ` Kay Sievers
@ 2011-10-07 13:39         ` Theodore Tso
  2011-10-07 15:21         ` Hugo Mills
  2011-10-08  9:53         ` A Plumber’s " Bastien ROUCARIES
  2 siblings, 0 replies; 81+ messages in thread
From: Theodore Tso @ 2011-10-07 13:39 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Alan Cox, linux-kernel, lennart, harald, david, greg


On Oct 7, 2011, at 8:46 AM, Kay Sievers wrote:
> On Fri, Oct 7, 2011 at 12:38, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> On Fri, 7 Oct 2011 12:28:46 +0200 Kay Sievers <kay.sievers@vrfy.org> wrote:
>>> What do you mean would be ugly?
>> 
>> I have an ext4fs. It supports every possible file name allowed by POSIX
>> and SuS. What name are you going to use for your 'hidden directory' that
>> won't clash with a real file ?
> 
> Ah, no. The label on FAT (similar on NTFS) are 'magic entries' in the
> root dir list, not a real file in the root dir.
> 
> We need kernel support for changing a mounted fs, because, unlike
> ext4, the blocks containing the strings are inside the fs, which the
> kernel might change any time.

I'd suggest a syscall, not an ioctl, and if a file system has some limitation on what is a valid name (even ext4 has length limitations which might be different from other file systems), we just simply return an error if it's not a valid label name.

As it turns out I went to great lengths in both the kernel and userspace implementations of e2label/tune2fs to make sure it would be safe to directly edit the superblock while the file system is mounted, but that depends on implementation details of the buffer cache in the kernel.  Better to have a formally supported interface which is file system independent.

-- Ted



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07 12:46       ` Kay Sievers
  2011-10-07 13:39         ` Theodore Tso
@ 2011-10-07 15:21         ` Hugo Mills
  2011-10-10 11:18             ` David Sterba
  2011-10-08  9:53         ` A Plumber’s " Bastien ROUCARIES
  2 siblings, 1 reply; 81+ messages in thread
From: Hugo Mills @ 2011-10-07 15:21 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Alan Cox, linux-kernel, lennart, harald, david, greg,
	Chris Mason, Btrfs mailing list

[-- Attachment #1: Type: text/plain, Size: 1755 bytes --]

On Fri, Oct 07, 2011 at 02:46:23PM +0200, Kay Sievers wrote:
> On Fri, Oct 7, 2011 at 12:38, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > On Fri, 7 Oct 2011 12:28:46 +0200 Kay Sievers <kay.sievers@vrfy.org> wrote:
> >
> >> What do you mean would be ugly?
> >
> > I have an ext4fs. It supports every possible file name allowed by POSIX
> > and SuS. What name are you going to use for your 'hidden directory' that
> > won't clash with a real file ?
> 
> Ah, no. The label on FAT (similar on NTFS) are 'magic entries' in the
> root dir list, not a real file in the root dir.
> 
> We need kernel support for changing a mounted fs, because, unlike
> ext4, the blocks containing the strings are inside the fs, which the
> kernel might change any time.

   It's worth noting that there are similar issues with btrfs around
changing label. A common API for it would make sense. The only btrfs
patches I've seen to change label after mkfs-time work either as:

 * unmounted only, single underlying device only, pure userspace
   implementation
 * mounted only, multiple underlying devices, kernel support needed

   The kernel-side patches never got integrated, so we're still unable
to change the label on the majority of btrfs filesystems.

   Changing the UUID for the filesystem is even harder, as I think
it's written to every metadata block. I'm not sure we can do that
sanely on a mounted filesystem.

   Hugo (just a spear-carrier from the btrfs chorus).

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
    --- Anyone using a computer to generate random numbers is, of ---    
                       course,  in a state of sin.                       

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07  1:57     ` Andi Kleen
@ 2011-10-07 15:58       ` Lennart Poettering
  0 siblings, 0 replies; 81+ messages in thread
From: Lennart Poettering @ 2011-10-07 15:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Kay Sievers, linux-kernel, harald, david, greg

On Fri, 07.10.11 03:57, Andi Kleen (andi@firstfloor.org) wrote:

> 
> > Well, I am aware of PR_SET_NAME, but that modifies comm, not argv[]. And
> > while "top" indeed shows the former, "ps" shows the latter. We are looking
> > for a way to nice way to modify argv[] without having to reuse space
> > from environ[] like most current Linux implementations of
> > setproctitle() do.
> 
> It's not clear to me how the kernel could change argv[] any better than you 
> could in user space.

Well, it can resize the argv[] buffer, which we can't right now in
userspace. See those PR_SET_PROCTITLE_AREA.

> > Well, it's interesting in the syslog case, and it's OK if people can
> > change it. What matters is that this information is available simply for
> > the informational value. Right now, if one combines SCM_CREDENTIALS and
> > /proc/$PID/comm you often end up with no information about the senders
> > name at all, since at the time you try to read comm the PID might
> > actually not exist anymore at all. We are simply trying to close this
> > particular race between receiving SCM_CREDENTIALS and reading
> > /proc/$PID/comm here, we are not looking for a way to make process names
> > trusted.
> 
> The issue with all of these proposals is that the sender currently doesn't
> know if the receiver needs it. Thus it always has to put it in and you
> slow down the fast paths.
> 
> e.g. consider
> 
> sender sends packet
>                                      receiver enables funky option
>                                      receiver reads
> 
> If it was done lazily you would lose.

Would you? I think it's OK if messages queued before the sockopt is
enabled do not carry the SCM_COMM/SCM_CGROUPS data, even if they are
dequeued after the sockopt. At least I wouldn't expect them to
necessarily have the data, and this is probably just a matter of
documentation, i.e. say in the man page explicitly that the control data
will only be attached to newly queued messages. Given that
SCM_COMM/SCM_CGROUPS is a completely new API anyway this should not
create any compatibility problems.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07  7:49 ` Matt Helsley
@ 2011-10-07 16:01   ` Lennart Poettering
  2011-10-08  4:24     ` Eric W. Biederman
  0 siblings, 1 reply; 81+ messages in thread
From: Lennart Poettering @ 2011-10-07 16:01 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Kay Sievers, linux-kernel, harald, david, greg, Biederman Eric Biederman

On Fri, 07.10.11 00:49, Matt Helsley (matthltc@us.ibm.com) wrote:

> 
> On Fri, Oct 07, 2011 at 01:17:02AM +0200, Kay Sievers wrote:
> 
> <snip>
> 
> > * simple, reliable and future-proof way to detect whether a specific pid
> > is running in a CLONE_NEWPID container, i.e. not in the root PID
> > namespace. Currently, there are available a few ugly hacks to detect
> 
> Is that precisely what's needed or would it be sufficient to know
> that the pid is running in a child pid namespace of the current pid
> namespace? If so, I think this could eventually be done by comparing
> the inode numbers assigned to /proc/<pid>/ns/pid to those of
> /proc/1/ns/pid.

I think the most interesting test would be to figure out for a process
if itself is running in a PID namespace. And for that comparing inodes
wouldn't work since the namespace process would never get access to the
inode of the outside init.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07 10:38     ` Alan Cox
  2011-10-07 12:46       ` Kay Sievers
@ 2011-10-07 16:07       ` Valdis.Kletnieks
  1 sibling, 0 replies; 81+ messages in thread
From: Valdis.Kletnieks @ 2011-10-07 16:07 UTC (permalink / raw)
  To: Alan Cox; +Cc: Kay Sievers, linux-kernel, lennart, harald, david, greg

[-- Attachment #1: Type: text/plain, Size: 445 bytes --]

On Fri, 07 Oct 2011 11:38:20 BST, Alan Cox said:

> > What do you mean would be ugly?
> 
> I have an ext4fs. It supports every possible file name allowed by POSIX
> and SuS. What name are you going to use for your 'hidden directory' that
> won't clash with a real file ?

ext4 could always use an attribute bit for that.  Not that *that* solution is all that
much prettier, since you can't use it for filesystems that don't have attribute bits.

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-06 23:17 A Plumber’s Wish List for Linux Kay Sievers
                   ` (3 preceding siblings ...)
  2011-10-07 12:35 ` Vivek Goyal
@ 2011-10-07 18:59 ` Greg KH
  2011-10-09 12:20   ` Kay Sievers
  2011-10-09  8:45 ` Rusty Russell
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 81+ messages in thread
From: Greg KH @ 2011-10-07 18:59 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-kernel, lennart, harald, david

On Fri, Oct 07, 2011 at 01:17:02AM +0200, Kay Sievers wrote:
> * CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:
> useful to allow module auto-loading of e.g. cpufreq drivers and KVM
> modules. Andy Kleen has a patch to create the alias file itself. CPU
> ‘struct sysdev’ needs to be converted to ‘struct device’ and a ‘struct
> bus_type cpu’ needs to be introduced to allow proper CPU coldplug event
> replay at bootup. This is one of the last remaining places where
> automatic hardware-triggered module auto-loading is not available. And
> we’d like to see that fix to make numerous ugly userspace work-arounds
> to achieve the same go away.

I need to get off my ass and fix this properly, now that Rafael has done
all of the hard work for sysdev already.  Thanks for reminding me.

> * export ‘struct device_type fb/fbcon’ of ‘struct class graphics’
> Userspace wants to easily distinguish ‘fb’ and ‘fbcon’ from each other
> without the need to match on the device name.

Can't we just export a "type" file for the device for these devices?
Is it really just that simple?

> * module-init-tools: provide a proper libmodprobe.so from
> module-init-tools:
> Early boot tools, installers, driver install disks want to access
> information about available modules to optimize bootup handling.

What information do they want to know?

> * allow user xattrs to be set on files in the cgroupfs (and maybe
> procfs?)

This shouldn't be that difficult, right?

Thanks for the list, much appreciated.

greg k-h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07 16:01   ` Lennart Poettering
@ 2011-10-08  4:24     ` Eric W. Biederman
  2011-10-10 16:31       ` Lennart Poettering
  0 siblings, 1 reply; 81+ messages in thread
From: Eric W. Biederman @ 2011-10-08  4:24 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg

Lennart Poettering <mzxreary@0pointer.de> writes:

> On Fri, 07.10.11 00:49, Matt Helsley (matthltc@us.ibm.com) wrote:
>
>> 
>> On Fri, Oct 07, 2011 at 01:17:02AM +0200, Kay Sievers wrote:
>> 
>> <snip>
>> 
>> > * simple, reliable and future-proof way to detect whether a specific pid
>> > is running in a CLONE_NEWPID container, i.e. not in the root PID
>> > namespace. Currently, there are available a few ugly hacks to detect
>> 
>> Is that precisely what's needed or would it be sufficient to know
>> that the pid is running in a child pid namespace of the current pid
>> namespace? If so, I think this could eventually be done by comparing
>> the inode numbers assigned to /proc/<pid>/ns/pid to those of
>> /proc/1/ns/pid.
>
> I think the most interesting test would be to figure out for a process
> if itself is running in a PID namespace. And for that comparing inodes
> wouldn't work since the namespace process would never get access to the
> inode of the outside init.

Strictly correct answer.  All processes are running in a pid namespace.
I think we can implement that in a libc header.

static inline bool in_pid_namespace(void)
{
        return true;
}

Why does it matter if you are running in something other than the
initial pid namespace?  I expect what you are really after is something
else entirely, and you are asking the wrong question.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07 12:46       ` Kay Sievers
  2011-10-07 13:39         ` Theodore Tso
  2011-10-07 15:21         ` Hugo Mills
@ 2011-10-08  9:53         ` Bastien ROUCARIES
  2011-10-09  3:15           ` Alex Elsayed
  2 siblings, 1 reply; 81+ messages in thread
From: Bastien ROUCARIES @ 2011-10-08  9:53 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Alan Cox, linux-kernel, lennart, harald, david, greg

On Fri, Oct 7, 2011 at 2:46 PM, Kay Sievers <kay.sievers@vrfy.org> wrote:
> []sorry, need to resend. I tried to reply with the cell phone but it bounces]
>
> On Fri, Oct 7, 2011 at 12:38, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> On Fri, 7 Oct 2011 12:28:46 +0200 Kay Sievers <kay.sievers@vrfy.org> wrote:
>>
>>> What do you mean would be ugly?
>>
>> I have an ext4fs. It supports every possible file name allowed by POSIX
>> and SuS. What name are you going to use for your 'hidden directory' that
>> won't clash with a real file ?
>
> Ah, no. The label on FAT (similar on NTFS) are 'magic entries' in the
> root dir list, not a real file in the root dir.

Why not using a special xattr namespace ?

Bastien
> We need kernel support for changing a mounted fs, because, unlike
> ext4, the blocks containing the strings are inside the fs, which the
> kernel might change any time.
>
> Kay
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-08  9:53         ` A Plumber’s " Bastien ROUCARIES
@ 2011-10-09  3:15           ` Alex Elsayed
  0 siblings, 0 replies; 81+ messages in thread
From: Alex Elsayed @ 2011-10-09  3:15 UTC (permalink / raw)
  To: linux-kernel

Bastien ROUCARIES <roucaries.bastien <at> gmail.com> writes:

> 
> On Fri, Oct 7, 2011 at 2:46 PM, Kay Sievers <kay.sievers <at> vrfy.org> wrote:
> > On Fri, Oct 7, 2011 at 12:38, Alan Cox <alan <at> lxorguk.ukuu.org.uk> wrote:
> >> On Fri, 7 Oct 2011 12:28:46 +0200 Kay Sievers <kay.sievers <at> vrfy.org>
wrote:
> >>> What do you mean would be ugly?
> >>
> >> I have an ext4fs. It supports every possible file name allowed by POSIX
> >> and SuS. What name are you going to use for your 'hidden directory' that
> >> won't clash with a real file ?
> >
> > Ah, no. The label on FAT (similar on NTFS) are 'magic entries' in the
> > root dir list, not a real file in the root dir.
> 
> Why not using a special xattr namespace ?
> 
> Bastien

All of you are completely misconstruing what was said. He was NOT suggesting
magic entries as an interface to change the label. He was noting that the FAT
filesystem IMPLEMENTS its labels as a magic entry, which cannot be safely
altered from userspace on a mounted FS, necessitating help from the kernel.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-06 23:17 A Plumber’s Wish List for Linux Kay Sievers
                   ` (4 preceding siblings ...)
  2011-10-07 18:59 ` Greg KH
@ 2011-10-09  8:45 ` Rusty Russell
  2011-10-11 23:16 ` Andrew Morton
  2011-10-19 21:12 ` Paul Menage
  7 siblings, 0 replies; 81+ messages in thread
From: Rusty Russell @ 2011-10-09  8:45 UTC (permalink / raw)
  To: Kay Sievers, linux-kernel; +Cc: lennart, harald, david, greg, Jon Masters

On Fri, 07 Oct 2011 01:17:02 +0200, Kay Sievers <kay.sievers@vrfy.org> wrote:
> * module-init-tools: provide a proper libmodprobe.so from
> module-init-tools:
> Early boot tools, installers, driver install disks want to access
> information about available modules to optimize bootup handling.

That's a bit too vague for my limited experience and/or lack of
imagination: what exactly do they want?  And why?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07 18:59 ` Greg KH
@ 2011-10-09 12:20   ` Kay Sievers
  0 siblings, 0 replies; 81+ messages in thread
From: Kay Sievers @ 2011-10-09 12:20 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, lennart, harald, david

On Fri, Oct 7, 2011 at 20:59, Greg KH <greg@kroah.com> wrote:
> On Fri, Oct 07, 2011 at 01:17:02AM +0200, Kay Sievers wrote:
>> * CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:

> I need to get off my ass and fix this properly, now that Rafael has done
> all of the hard work for sysdev already.  Thanks for reminding me.
>
>> * export ‘struct device_type fb/fbcon’ of ‘struct class graphics’
>> Userspace wants to easily distinguish ‘fb’ and ‘fbcon’ from each other
>> without the need to match on the device name.
>
> Can't we just export a "type" file for the device for these devices?
> Is it really just that simple?

Yeah, it's just adding a 'struct device_type' with a 'name =
"fb/fccon" and DEVTYPE= will appear as a property at the device. So
much for getting off my ass. :)

>> * module-init-tools: provide a proper libmodprobe.so from
>> module-init-tools:
>> Early boot tools, installers, driver install disks want to access
>> information about available modules to optimize bootup handling.
>
> What information do they want to know?

Resolve the alias database that 'depmod' has created from inside any
process. Udev wants to avoid calling ~60 modprobes per bootup for a
bunch of device types like USB-hubs which will never have driver to be
loaded (optimization). Also the installer and module-update tools
sometimes want to query the list of things to load before running all
the magic asynchronously (less hacks).

In general, the command-line-tool-style of doing complexer system
software does not really fit any more into the way we need to do
things today. We need proper libraries in the background that can be
used by whatever thing needs the information, and the same tools we
have already should just be users of their own libraries. We a strict
separation of policy and mechanics. Other users should be able the use
the 'mechanics' of a tool, without executing any 'policy'.

>> * allow user xattrs to be set on files in the cgroupfs (and maybe
>> procfs?)
>
> This shouldn't be that difficult, right?

It shouldn't. We just need to be careful here what to export, when to
use it, and not to create problems and information leaks for
namespaces, which might re-use some of the mount points.

Kay

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber???s Wish List for Linux
  2011-10-07 15:21         ` Hugo Mills
@ 2011-10-10 11:18             ` David Sterba
  0 siblings, 0 replies; 81+ messages in thread
From: David Sterba @ 2011-10-10 11:18 UTC (permalink / raw)
  To: Hugo Mills, Kay Sievers, Alan Cox, linux-kernel, lennart, harald

On Fri, Oct 07, 2011 at 04:21:37PM +0100, Hugo Mills wrote:
> On Fri, Oct 07, 2011 at 02:46:23PM +0200, Kay Sievers wrote:
> > On Fri, Oct 7, 2011 at 12:38, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > On Fri, 7 Oct 2011 12:28:46 +0200 Kay Sievers <kay.sievers@vrfy.org> wrote:
> > >
> > >> What do you mean would be ugly?
> > >
> > > I have an ext4fs. It supports every possible file name allowed by POSIX
> > > and SuS. What name are you going to use for your 'hidden directory' that
> > > won't clash with a real file ?
> > 
> > Ah, no. The label on FAT (similar on NTFS) are 'magic entries' in the
> > root dir list, not a real file in the root dir.
> > 
> > We need kernel support for changing a mounted fs, because, unlike
> > ext4, the blocks containing the strings are inside the fs, which the
> > kernel might change any time.
> 
>    It's worth noting that there are similar issues with btrfs around
> changing label. A common API for it would make sense. The only btrfs
> patches I've seen to change label after mkfs-time work either as:
> 
>  * unmounted only, single underlying device only, pure userspace
>    implementation
>  * mounted only, multiple underlying devices, kernel support needed
> 
>    The kernel-side patches never got integrated, so we're still unable
> to change the label on the majority of btrfs filesystems.
> 
>    Changing the UUID for the filesystem is even harder, as I think
> it's written to every metadata block. I'm not sure we can do that
> sanely on a mounted filesystem.

http://marc.info/?l=linux-btrfs&m=131161949201880&w=2

"Resetting the UUID on btrfs isn't a quick-and-easy thing - you have to
walk the entire tree and change every object. We've got a bad-hack in
meego that uses btrfs-debug-tree and changes the UUID while it runs
the entire tree, but it's ugly as hell."

That's on an unmoutned fs. Doing it on a mounted one seems more
complicated wrt to the intermediate state when there are some blocks
with the old and some block wit the new UUID. The operation will take
long and I don't know if it's better do to do it in batches (and
follow usual rules for commiting a transaction every now and then), or
in one go (requires: no failures, no scrub run, no devices
added/removed). Counting all potential problems and practical
unusability of the FS during UUID change, the off-line approach seems a
better way to go.


david

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber???s Wish List for Linux
@ 2011-10-10 11:18             ` David Sterba
  0 siblings, 0 replies; 81+ messages in thread
From: David Sterba @ 2011-10-10 11:18 UTC (permalink / raw)
  To: Hugo Mills, Kay Sievers, Alan Cox, linux-kernel, lennart, harald,
	david, greg, Chris Mason, Btrfs mailing list

On Fri, Oct 07, 2011 at 04:21:37PM +0100, Hugo Mills wrote:
> On Fri, Oct 07, 2011 at 02:46:23PM +0200, Kay Sievers wrote:
> > On Fri, Oct 7, 2011 at 12:38, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > On Fri, 7 Oct 2011 12:28:46 +0200 Kay Sievers <kay.sievers@vrfy.org> wrote:
> > >
> > >> What do you mean would be ugly?
> > >
> > > I have an ext4fs. It supports every possible file name allowed by POSIX
> > > and SuS. What name are you going to use for your 'hidden directory' that
> > > won't clash with a real file ?
> > 
> > Ah, no. The label on FAT (similar on NTFS) are 'magic entries' in the
> > root dir list, not a real file in the root dir.
> > 
> > We need kernel support for changing a mounted fs, because, unlike
> > ext4, the blocks containing the strings are inside the fs, which the
> > kernel might change any time.
> 
>    It's worth noting that there are similar issues with btrfs around
> changing label. A common API for it would make sense. The only btrfs
> patches I've seen to change label after mkfs-time work either as:
> 
>  * unmounted only, single underlying device only, pure userspace
>    implementation
>  * mounted only, multiple underlying devices, kernel support needed
> 
>    The kernel-side patches never got integrated, so we're still unable
> to change the label on the majority of btrfs filesystems.
> 
>    Changing the UUID for the filesystem is even harder, as I think
> it's written to every metadata block. I'm not sure we can do that
> sanely on a mounted filesystem.

http://marc.info/?l=linux-btrfs&m=131161949201880&w=2

"Resetting the UUID on btrfs isn't a quick-and-easy thing - you have to
walk the entire tree and change every object. We've got a bad-hack in
meego that uses btrfs-debug-tree and changes the UUID while it runs
the entire tree, but it's ugly as hell."

That's on an unmoutned fs. Doing it on a mounted one seems more
complicated wrt to the intermediate state when there are some blocks
with the old and some block wit the new UUID. The operation will take
long and I don't know if it's better do to do it in batches (and
follow usual rules for commiting a transaction every now and then), or
in one go (requires: no failures, no scrub run, no devices
added/removed). Counting all potential problems and practical
unusability of the FS during UUID change, the off-line approach seems a
better way to go.


david

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber???s Wish List for Linux
  2011-10-10 11:18             ` David Sterba
  (?)
@ 2011-10-10 13:09             ` Theodore Tso
  2011-10-13  0:28               ` Dave Chinner
  -1 siblings, 1 reply; 81+ messages in thread
From: Theodore Tso @ 2011-10-10 13:09 UTC (permalink / raw)
  To: dave
  Cc: Theodore Tso, Hugo Mills, Kay Sievers, Alan Cox, linux-kernel,
	lennart, harald, david, greg, Chris Mason, Btrfs mailing list


On Oct 10, 2011, at 7:18 AM, David Sterba wrote:

> "Resetting the UUID on btrfs isn't a quick-and-easy thing - you have to
> walk the entire tree and change every object. We've got a bad-hack in
> meego that uses btrfs-debug-tree and changes the UUID while it runs
> the entire tree, but it's ugly as hell."

Changing the UUID is going to be harder for ext4 as well, once we integrate metadata checksums.   So while it makes sense to have on-line ways of updating labels for mounted file systems it probably makes muchness sense to support it for UUIDs.

I suspect what it means in practice is that it will be useful for file systems to provide fs image copying tools that also generate a new UUID while you're at it, for use by IT administrators and embedded systems manufacturers.

-- Ted

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-08  4:24     ` Eric W. Biederman
@ 2011-10-10 16:31       ` Lennart Poettering
  2011-10-10 20:59         ` Detecting if you are running in a container Eric W. Biederman
  0 siblings, 1 reply; 81+ messages in thread
From: Lennart Poettering @ 2011-10-10 16:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg

On Fri, 07.10.11 21:24, Eric W. Biederman (ebiederm@xmission.com) wrote:

> 
> Lennart Poettering <mzxreary@0pointer.de> writes:
> 
> > On Fri, 07.10.11 00:49, Matt Helsley (matthltc@us.ibm.com) wrote:
> >
> >> 
> >> On Fri, Oct 07, 2011 at 01:17:02AM +0200, Kay Sievers wrote:
> >> 
> >> <snip>
> >> 
> >> > * simple, reliable and future-proof way to detect whether a specific pid
> >> > is running in a CLONE_NEWPID container, i.e. not in the root PID
> >> > namespace. Currently, there are available a few ugly hacks to detect
> >> 
> >> Is that precisely what's needed or would it be sufficient to know
> >> that the pid is running in a child pid namespace of the current pid
> >> namespace? If so, I think this could eventually be done by comparing
> >> the inode numbers assigned to /proc/<pid>/ns/pid to those of
> >> /proc/1/ns/pid.
> >
> > I think the most interesting test would be to figure out for a process
> > if itself is running in a PID namespace. And for that comparing inodes
> > wouldn't work since the namespace process would never get access to the
> > inode of the outside init.
> 
> Strictly correct answer.  All processes are running in a pid namespace.
> I think we can implement that in a libc header.
> 
> static inline bool in_pid_namespace(void)
> {
>         return true;
> }
> 
> Why does it matter if you are running in something other than the
> initial pid namespace?  I expect what you are really after is something
> else entirely, and you are asking the wrong question.

Well, all other virtualization solutions are easily detectable via CPUID
leaf 0x1, bit 31, and via DMI and some other ways. However, for Linux
containers there is no nice way to detect them.

VMs are pretty good at providing a comprehensive emulation of real
machines, and distributions running in them usually do not need
information whether they are running in a VM or not. This is very
different though for containers: Quite a few kernel subsystems are
currently not virtualized, for example SELinux, VTs, most of sysfs, most
of /proc/sys, audit, udev or file systems (by which I mean that for a
container you probably don't want to fsck the root fs, and so on), and
containers tend to be much more lightweight than real systems.

To make a standard distribution run nicely in a Linux container you
usually have to make quite a number of modifications to it and disable
certain things from the boot process. Ideally however, one could simply
boot the same image on a real machine and in a container and would just
do the right thing, fully stateless. And for that you need to be able to
detect containers, and currently you can't.

Of course, in 10 years or so containers might be much more complete then
they are right now, and virtualize all subsystems I listed above and
maybe a ton more, but that's 10y for now, and for now to make things
work as cleanly as possible it would be immensly helpful if containers
could be detectable in a nice way.

Of course, in many case there are nicer ways to shortcut the init jobs
on a container. For example, instead of bypassing root fsck in a
container it makes a lot more sense to simply say: bypass root fsck if
the root fs is already writable. And there's more like that. But at the
end of the day you always want to be able to bind certain things to the
fact that you are running in a container, if you want things to "just
work". And I believe that must be the goal. 

I am pretty sure that having a way to detect execution in a container is
a minimum requirement to get general purpose distribution makers to
officially support and care for execution in container environments. As
you are a container guy I am sure that would be very much in your
interest.

And note that I am only interested in detecting CLONE_NEWPID, not the
other namespaces. CLONE_NEWPID is the core namespace technology that
turns a container into a container, so that's all that's needed.

And yes, CLONE_NEWPID can be useful for other purposes then just
containers as well. However, that doesn't really matter for my usecase
as mentioned above: becuase if you run an init system in CLONE_NEWPID
namespace, then that's what I call a container, and the init system
should have all rights to detect that.

The root PID namespace is different from all other namespaces btw,
already in the fact that the the kernel threads are part of it, but not
the other namespaces.

Finally, note that it prevously has been very easy to detect execution
in a container, simple by checking the "ns" cgroup hierarchy. (i.e. look
whether the path in /proc/self/cgroup for "ns" wasn't "/" and you knew
you were in a container). systemd made use of that and since very early
on we supported container boots. The removal of "ns" broke systemd in
that regard. Now, I don't want "ns" back, and I am not going to make the
big hubbub out of the fact that you guys broke userspace that way. But
what I do like to see made available again is a sane way to detect
execution in a container environment, i.e. a way for a process to detect
whether it is running in the root CLONE_NEWPID namespace.

Thanks,

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Detecting if you are running in a container
  2011-10-10 16:31       ` Lennart Poettering
@ 2011-10-10 20:59         ` Eric W. Biederman
  2011-10-10 21:41           ` Lennart Poettering
  2011-10-11  1:32           ` Ted Ts'o
  0 siblings, 2 replies; 81+ messages in thread
From: Eric W. Biederman @ 2011-10-10 20:59 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg,
	Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage


Cc's and subject updated so hopefully we get the correct people
on this discussion to make progress.

Lennart Poettering <mzxreary@0pointer.de> writes:

> To make a standard distribution run nicely in a Linux container you
> usually have to make quite a number of modifications to it and disable
> certain things from the boot process. Ideally however, one could simply
> boot the same image on a real machine and in a container and would just
> do the right thing, fully stateless. And for that you need to be able to
> detect containers, and currently you can't.

I agree getting to the point where we can run a standard distribution
unmodified in a container sounds like a reasonable goal.

> Quite a few kernel subsystems are
> currently not virtualized, for example SELinux, VTs, most of sysfs, most
> of /proc/sys, audit, udev or file systems (by which I mean that for a
> container you probably don't want to fsck the root fs, and so on), and
> containers tend to be much more lightweight than real systems.

That is an interesting viewpoint on what is not complete.  But as a
listing of the tasks that distribution startup needs to do differently in
a container the list seems more or less reasonable.

There are two questions 
- How in the general case do we detect if we are running in a container.
- How do we make reasonable tests during bootup to see if it makes sense
  to perform certain actions.

For the general detection if we are running in a linux container I can
see two reasonable possibilities.

- Put a file in / that let's you know by convention that you are in a
  linux container.  I am inclined to do this because this is something
  we can support on all kernels old and new.

- Allow modification to the output of uname(2).  The uts namespace
  already covers uname(2) and uname is the standard method to
  communicate to userspace the vageries about the OS level environment
  they are running in.


My list of things that still have work left to do looks like:
- cgroups.  It is not safe to create a new hierarchies with groups
  that are in existing hierarchies.  So cgroups don't work.

- user namespace.  We are very close to have something workable
  on this one, but until we do all of the users inside and outside
  of a container are the same, and pass the same permission checks.

  As a result we have to drop most of roots privileges, and we have
  to be a bit careful what binaries that can gain privileges (think suid
  root) are in the container filesystem.

- Reboot.  I know Daniel was working on something not long ago
  but I am not certain where he would up.

- device namespaces.  We periodically think about having a separate
  set of devices and to support things like losetup in a container
  that seems necessary.  Most of the time getting all of the way
  to device namespaces seems unnecessary.


As for tests on what to startup.

- udev.  All of the kernel interfaces for udev should be supported in
  current kernels.  However I believe udev is useless because container
  start drops CAP_MKNOD so we can't do evil things.  So I would
  recommend basing the startup of udev on presence of CAP_MKNOD.

- VTs.  Ptys should be well supported at this point.  For the rest
  they are physical hardware that a container should not be playing with
  so I would base which gettys to start up based on which device nodes
  are present in /dev.

- sysctls (aka /proc/sys) that is a trick one.  Until the user namespace
  is fleshed out a little more sysctls are going to be a problem,
  because root can write to most of them.  My gut feel says you probably
  want to base that to poke at sysctls on CAP_SYS_ADMIN.  At least that
  test will become true when the userspaces are rolled out, and at
  that point you will want to set all of the sysctls you have permission
  to.

- audit.  My memory is very fuzzy on this one.  The issue in question is
  should we start auditd?  I believe the audit calls actually fail in a
  container so we should be able to trigger starting auditd on if audit
  works at all.  If we can't do it that way certainly the work should be
  put in so that it can be done that way.

- fsck.  A rw filesystem check like you mentioned earlier seems like a
  reasonable place to be I know the OpenVz folks were talking about
  putting containers in their own block devices for their next round of
  supporting containers.  At which point a filesystem check on container
  startup might not be a bad idea at all.

- cgroups hierarchies.  I don't know at which point in the system
  startup we care.  The appropriate solution would seem to be to try
  it and if the operation fails figure it isn't supported.

- selinux.  It really should be in the same category.  You should be
  able to attempt to load a policy and have it fail in a way that
  indicates that selinux is currently supported.  I don't know if
  we can make that work right until we get the user namespace into
  a usable shame.

In general things in a container should work or the kernel feature
should fail in a way that indicates that the feature is not supported.
That currently works well for the networking stack, and with the
pending usablilty of the user namespace it should work just about
everywhere else as well.  For things that don't fit that model we
need to fix the kernel.

So while I agree a check to see if something is a container seems
reasonable.  I do not agree that the pid namespace is the place to put
that information.  I see no natural to put that information in the
pid namespace.

I further think there are a lot of reasonable checks for if a
kernel feature is supported in the current environment I would
rather pursue over hacks based the fact we are in a container.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-10 20:59         ` Detecting if you are running in a container Eric W. Biederman
@ 2011-10-10 21:41           ` Lennart Poettering
  2011-10-11  5:40             ` Eric W. Biederman
                               ` (2 more replies)
  2011-10-11  1:32           ` Ted Ts'o
  1 sibling, 3 replies; 81+ messages in thread
From: Lennart Poettering @ 2011-10-10 21:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg,
	Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:

> > Quite a few kernel subsystems are
> > currently not virtualized, for example SELinux, VTs, most of sysfs, most
> > of /proc/sys, audit, udev or file systems (by which I mean that for a
> > container you probably don't want to fsck the root fs, and so on), and
> > containers tend to be much more lightweight than real systems.
> 
> That is an interesting viewpoint on what is not complete.  But as a
> listing of the tasks that distribution startup needs to do differently in
> a container the list seems more or less reasonable.

Note that this is just what came to my mind while I was typing this, I
am quite sure there's actually more like this.

> There are two questions 
> - How in the general case do we detect if we are running in a container.
> - How do we make reasonable tests during bootup to see if it makes sense
>   to perform certain actions.
> 
> For the general detection if we are running in a linux container I can
> see two reasonable possibilities.
> 
> - Put a file in / that let's you know by convention that you are in a
>   linux container.  I am inclined to do this because this is something
>   we can support on all kernels old and new.

Hmpf. That would break the stateless read-only-ness of the root dir.

After pointing the issue out to the LXC folks they are now setting
"container=lxc" as env var when spawning a container. In systemd-nspawn
I have then adopted a similar scheme. Not sure though that that is
particularly nice however, since env vars are inherited further down the
tree where we probably don't want them.

In case you are curious: this is the code we use in systemd:

http://cgit.freedesktop.org/systemd/tree/src/virt.c

What matters to me though is that we can generically detect Linux
containers instead of specific implementations.

> - Allow modification to the output of uname(2).  The uts namespace
>   already covers uname(2) and uname is the standard method to
>   communicate to userspace the vageries about the OS level environment
>   they are running in.

Well, I am not a particular fan of having userspace tell userspace about
containers. I would prefer if userspace could get that info from the
kernel without any explicit agreement to set some specific variable.

That said detecting CLONE_NEWUTS by looking at the output of uname(2)
would be a workable solution for us. CLONE_NEWPID and CLONE_NEWUTS are
probably equally definining for what a container is, so I'd be happy if
we could detect either.

For example, if the kernel would append "(container)" or so to
utsname.machine[] after CLONE_NEWUTS is used I'd be quite happy.

> My list of things that still have work left to do looks like:
> - cgroups.  It is not safe to create a new hierarchies with groups
>   that are in existing hierarchies.  So cgroups don't work.

Well, for systemd they actually work quite fine since systemd will
always place its own cgroups below the cgroup it is started in. cgroups
hence make these things nicely stackable.

In fact, most folks involved in cgroups userspace have agreed to these
rules now:

http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups

Among other things they ask all userspace code to only create subgroups
below the group they are started in, so not only systemd should work
fine in a container environment but everything else following these
rules.

In other words: so far one gets away quite nicely with the fact that the
cgroup tree is not virtualized.

> - device namespaces.  We periodically think about having a separate
>   set of devices and to support things like losetup in a container
>   that seems necessary.  Most of the time getting all of the way
>   to device namespaces seems unnecessary.

Well, I am sure people use containers in all kinds of weird ways, but
for me personally I am quitre sure that containers should live in a
fully virtualized world and never get access to real devices.

> As for tests on what to startup.

Note again that my list above is not complete at all and the point I was
trying to make is that while you can find nice hooks for this for many
cases at the end of the day you actually do want to detect containers
for a few specific cases.

> - udev.  All of the kernel interfaces for udev should be supported in
>   current kernels.  However I believe udev is useless because container
>   start drops CAP_MKNOD so we can't do evil things.  So I would
>   recommend basing the startup of udev on presence of CAP_MKNOD.

Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev
in a systemd world makes use of that.

> - VTs.  Ptys should be well supported at this point.  For the rest
>   they are physical hardware that a container should not be playing with
>   so I would base which gettys to start up based on which device nodes
>   are present in /dev.

Well, I am not sure it's that easy since device nodes tend to show up
dynamically in bare systems. So if you just check whether /dev/tty0 is
there you might end up thinking you are in a container when you actually
aren't simply because you did that check before udev loaded the DRI
driver or so.

> - sysctls (aka /proc/sys) that is a trick one.  Until the user namespace
>   is fleshed out a little more sysctls are going to be a problem,
>   because root can write to most of them.  My gut feel says you probably
>   want to base that to poke at sysctls on CAP_SYS_ADMIN.  At least that
>   test will become true when the userspaces are rolled out, and at
>   that point you will want to set all of the sysctls you have permission
>   to.

So what we did right now in systemd-nspawn is that the container
supervisor premounts /proc/sys read-only into the container. That way
writes to it will fail in the container, and while you get a number of
warnings things will work as they should (though not necessarily safely
since the container can still remount the fs unless you take
CAP_SYS_ADMIN away).

> - selinux.  It really should be in the same category.  You should be
>   able to attempt to load a policy and have it fail in a way that
>   indicates that selinux is currently supported.  I don't know if
>   we can make that work right until we get the user namespace into
>   a usable shame.

The SELinux folks modified libselinux on my request to consider selinux
off if /sys/fs/selinux is already mounted and read-only. That means with
a new container userspace this problem is mostly worked around too. It
is crucial to make libselinux know that selinux is off because otherwise
it will continue to muck with the xattr labels where it shouldn't. In
if you want to fully virtualize this you probably should hide selinux
xattrs entirely in the container.

> So while I agree a check to see if something is a container seems
> reasonable.  I do not agree that the pid namespace is the place to put
> that information.  I see no natural to put that information in the
> pid namespace.

Well, a simple way would be to have a line /proc/1/status called
"PIDNamespaceLevel:" or so which would be 0 for the root namespace, and
increased for each namespace nested in it. Then, processes could simply
read that and be happy.

> I further think there are a lot of reasonable checks for if a
> kernel feature is supported in the current environment I would
> rather pursue over hacks based the fact we are in a container.

Well, believe me we have been tryiung to find nicer hooks that explicit
checks for containers, but I am quite sure that at the end of the day
you won't be able to go without it entirely.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-10 20:59         ` Detecting if you are running in a container Eric W. Biederman
  2011-10-10 21:41           ` Lennart Poettering
@ 2011-10-11  1:32           ` Ted Ts'o
  2011-10-11  2:05             ` Matt Helsley
  1 sibling, 1 reply; 81+ messages in thread
From: Ted Ts'o @ 2011-10-11  1:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Lennart Poettering, Matt Helsley, Kay Sievers, linux-kernel,
	harald, david, greg, Linux Containers, Linux Containers,
	Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Mon, Oct 10, 2011 at 01:59:10PM -0700, Eric W. Biederman wrote:
> Lennart Poettering <mzxreary@0pointer.de> writes:
> 
> > To make a standard distribution run nicely in a Linux container you
> > usually have to make quite a number of modifications to it and disable
> > certain things from the boot process. Ideally however, one could simply
> > boot the same image on a real machine and in a container and would just
> > do the right thing, fully stateless. And for that you need to be able to
> > detect containers, and currently you can't.
> 
> I agree getting to the point where we can run a standard distribution
> unmodified in a container sounds like a reasonable goal.

Hmm, interesting.  It's not clear to me that running a full standard
distribution in a container is always going to be what everyone wants
to do.

The whole point of containers versus VM's is that containers are
lighter weight.  And one of the ways that containers can be lighter
weight is if you don't have to have N copies of udev, dbus, running in
each container/VM.

If you end up so much overhead to provide the desired security and/or
performance isolation, then it becomes fair to ask the question
whether you might as well pay a tad bit more and get even better
security and isolation by using a VM solution....

	     	       	  	     - Ted

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11  1:32           ` Ted Ts'o
@ 2011-10-11  2:05             ` Matt Helsley
  2011-10-11  3:25               ` Ted Ts'o
  2011-10-11 22:25               ` david
  0 siblings, 2 replies; 81+ messages in thread
From: Matt Helsley @ 2011-10-11  2:05 UTC (permalink / raw)
  To: Ted Ts'o, Eric W. Biederman, Lennart Poettering,
	Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg,
	Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On Mon, Oct 10, 2011 at 09:32:01PM -0400, Ted Ts'o wrote:
> On Mon, Oct 10, 2011 at 01:59:10PM -0700, Eric W. Biederman wrote:
> > Lennart Poettering <mzxreary@0pointer.de> writes:
> > 
> > > To make a standard distribution run nicely in a Linux container you
> > > usually have to make quite a number of modifications to it and disable
> > > certain things from the boot process. Ideally however, one could simply
> > > boot the same image on a real machine and in a container and would just
> > > do the right thing, fully stateless. And for that you need to be able to
> > > detect containers, and currently you can't.
> > 
> > I agree getting to the point where we can run a standard distribution
> > unmodified in a container sounds like a reasonable goal.
> 
> Hmm, interesting.  It's not clear to me that running a full standard
> distribution in a container is always going to be what everyone wants
> to do.
> 
> The whole point of containers versus VM's is that containers are
> lighter weight.  And one of the ways that containers can be lighter
> weight is if you don't have to have N copies of udev, dbus, running in
> each container/VM.
> 
> If you end up so much overhead to provide the desired security and/or
> performance isolation, then it becomes fair to ask the question
> whether you might as well pay a tad bit more and get even better
> security and isolation by using a VM solution....
> 
> 	     	       	  	     - Ted

Yes, it does detract from the unique advantages of using a container.
However, I think the value here is not the effeciency of the initial
system configuration but the fact that it gives users a better place to
start.

Right now we're effectively asking users to start with non-working
and/or unfamiliar systems and repair them until they work.

By enabling unmodified distro installs in a container we're starting
at the other end. The choices may not be the most efficient but the
user may begin tuning from a working configuration. They can learn
about and tune those parts that prove significant for their workload.
This is better because in the end it's not just about how efficient the
user  can make their containers but how much effort they will spend
achieving and maintainingg that efficiency over time.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11  2:05             ` Matt Helsley
@ 2011-10-11  3:25               ` Ted Ts'o
  2011-10-11  6:42                 ` Eric W. Biederman
  2011-10-11 22:25               ` david
  1 sibling, 1 reply; 81+ messages in thread
From: Ted Ts'o @ 2011-10-11  3:25 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Eric W. Biederman, Lennart Poettering, Kay Sievers, linux-kernel,
	harald, david, greg, Linux Containers, Linux Containers,
	Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Mon, Oct 10, 2011 at 07:05:30PM -0700, Matt Helsley wrote:
> Yes, it does detract from the unique advantages of using a container.
> However, I think the value here is not the effeciency of the initial
> system configuration but the fact that it gives users a better place to
> start.
> 
> Right now we're effectively asking users to start with non-working
> and/or unfamiliar systems and repair them until they work.

If things are not working with containers, I would submit to you that
we're doing something wrong(tm).  Things should just work, except that
processes in one container can't use more than their fair share (as
dictated by policy) of memory, CPU, networking, and I/O bandwidth.

Something which is baked in my world view of containers (which I
suspect is not shared by other people who are interested in using
containers) is that given that kernel is shared, trying to use
containers to provide better security isolation between mutually
suspicious users is hopeless.  That is, it's pretty much impossible to
prevent a user from finding one or more zero day local privilege
escalation bugs that will allow a user to break root.  And at that
point, they will be able to penetrate the kernel, and from there,
break security of other processes.

So if you want that kind of security isolation, you shouldn't be using
containers in the first place.  You should be using KVM or Xen, and
then only after spending a huge amount of effort fuzz testing the
KVM/Xen paravirtualization interfaces.  So at least in my mind, adding
vast amounts of complexities to try to provide security isolation via
containers is really not worth it.  And if that's the model, then it's
a lot easier to make containers to run jobs in containers that don't
require changes to the distro plus huge increase of complexity for
containers in the kernel....

						- Ted

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-10 21:41           ` Lennart Poettering
@ 2011-10-11  5:40             ` Eric W. Biederman
  2011-10-11  6:54             ` Eric W. Biederman
  2011-10-12 16:59             ` Kay Sievers
  2 siblings, 0 replies; 81+ messages in thread
From: Eric W. Biederman @ 2011-10-11  5:40 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg,
	Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

Lennart Poettering <mzxreary@0pointer.de> writes:

> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:
>
>> > Quite a few kernel subsystems are
>> > currently not virtualized, for example SELinux, VTs, most of sysfs, most
>> > of /proc/sys, audit, udev or file systems (by which I mean that for a
>> > container you probably don't want to fsck the root fs, and so on), and
>> > containers tend to be much more lightweight than real systems.
>> 
>> That is an interesting viewpoint on what is not complete.  But as a
>> listing of the tasks that distribution startup needs to do differently in
>> a container the list seems more or less reasonable.
>
> Note that this is just what came to my mind while I was typing this, I
> am quite sure there's actually more like this.
>
>> There are two questions 
>> - How in the general case do we detect if we are running in a container.
>> - How do we make reasonable tests during bootup to see if it makes sense
>>   to perform certain actions.
>> 
>> For the general detection if we are running in a linux container I can
>> see two reasonable possibilities.
>> 
>> - Put a file in / that let's you know by convention that you are in a
>>   linux container.  I am inclined to do this because this is something
>>   we can support on all kernels old and new.
>
> Hmpf. That would break the stateless read-only-ness of the root dir.
>
> After pointing the issue out to the LXC folks they are now setting
> "container=lxc" as env var when spawning a container. In systemd-nspawn
> I have then adopted a similar scheme. Not sure though that that isp
> particularly nice however, since env vars are inherited further down the
> tree where we probably don't want them.

Interesting.  That seems like a reasonable enough thing to require
of the programs that create containers.

> In case you are curious: this is the code we use in systemd:
>
> http://cgit.freedesktop.org/systemd/tree/src/virt.c
>
> What matters to me though is that we can generically detect Linux
> containers instead of specific implementations.

>> - Allow modification to the output of uname(2).  The uts namespace
>>   already covers uname(2) and uname is the standard method to
>>   communicate to userspace the vageries about the OS level environment
>>   they are running in.
>
> Well, I am not a particular fan of having userspace tell userspace about
> containers. I would prefer if userspace could get that info from the
> kernel without any explicit agreement to set some specific variable.

Well userspace tells userspace about stdin and it works reliably.

Containers are a userspace construct built with kernel facilities.
I don't see why asking userspace to implement a convention is any more
important than the other things that have to be done.

We do need to document the convetions.  Just like we document the
standard device names but I don't beyond that we should be fine.

>> My list of things that still have work left to do looks like:
>> - cgroups.  It is not safe to create a new hierarchies with groups
>>   that are in existing hierarchies.  So cgroups don't work.
>
> Well, for systemd they actually work quite fine since systemd will
> always place its own cgroups below the cgroup it is started in. cgroups
> hence make these things nicely stackable.
>
> In fact, most folks involved in cgroups userspace have agreed to these
> rules now:
>
> http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
>
> Among other things they ask all userspace code to only create subgroups
> below the group they are started in, so not only systemd should work
> fine in a container environment but everything else following these
> rules.
>
> In other words: so far one gets away quite nicely with the fact that the
> cgroup tree is not virtualized.

Assuming you bind mount the cgroups inside and generally don't allow
people in a container to create cgroup hierarchies.  At the very least
that is nasty information leakage.

But I am glad there is a solution for right now.

For my uses I have yet to find control groups anything but borked.

>> - VTs.  Ptys should be well supported at this point.  For the rest
>>   they are physical hardware that a container should not be playing with
>>   so I would base which gettys to start up based on which device nodes
>>   are present in /dev.
>
> Well, I am not sure it's that easy since device nodes tend to show up
> dynamically in bare systems. So if you just check whether /dev/tty0 is
> there you might end up thinking you are in a container when you actually
> aren't simply because you did that check before udev loaded the DRI
> driver or so.

But the point isn't to detect a container the point is to decide if
a getty needs to be spawned.  Even with the configuration for a getty
you need to wait for the device node to exist before spawning one.

>> - sysctls (aka /proc/sys) that is a trick one.  Until the user namespace
>>   is fleshed out a little more sysctls are going to be a problem,
>>   because root can write to most of them.  My gut feel says you probably
>>   want to base that to poke at sysctls on CAP_SYS_ADMIN.  At least that
>>   test will become true when the userspaces are rolled out, and at
>>   that point you will want to set all of the sysctls you have permission
>>   to.
>
> So what we did right now in systemd-nspawn is that the container
> supervisor premounts /proc/sys read-only into the container. That way
> writes to it will fail in the container, and while you get a number of
> warnings things will work as they should (though not necessarily safely
> since the container can still remount the fs unless you take
> CAP_SYS_ADMIN away).

That sort of works.  In practice it means you can't setup interesting
things like forwarding in the networking stack.  But it certainly gets
things going.

>> So while I agree a check to see if something is a container seems
>> reasonable.  I do not agree that the pid namespace is the place to put
>> that information.  I see no natural to put that information in the
>> pid namespace.
>
> Well, a simple way would be to have a line /proc/1/status called
> "PIDNamespaceLevel:" or so which would be 0 for the root namespace, and
> increased for each namespace nested in it. Then, processes could simply
> read that and be happy.

Not a chance.  PIDNamespaceLevel is implementing an implementation
detail that may well change in the lifetime of a process.  It is true
we don't have migration mreged in the kernel yet but one of these days
I expect we will.

>> I further think there are a lot of reasonable checks for if a
>> kernel feature is supported in the current environment I would
>> rather pursue over hacks based the fact we are in a container.
>
> Well, believe me we have been tryiung to find nicer hooks that explicit
> checks for containers, but I am quite sure that at the end of the day
> you won't be able to go without it entirely.

And you have explicit information you are in a container at this point.

It looks like all that is left is Documentation of the conventions.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11  3:25               ` Ted Ts'o
@ 2011-10-11  6:42                 ` Eric W. Biederman
  2011-10-11 12:53                   ` Theodore Tso
  0 siblings, 1 reply; 81+ messages in thread
From: Eric W. Biederman @ 2011-10-11  6:42 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel,
	harald, david, greg, Linux Containers, Linux Containers,
	Serge E. Hallyn, Daniel Lezcano, Paul Menage

Ted Ts'o <tytso@mit.edu> writes:

> On Mon, Oct 10, 2011 at 07:05:30PM -0700, Matt Helsley wrote:
>> Yes, it does detract from the unique advantages of using a container.
>> However, I think the value here is not the effeciency of the initial
>> system configuration but the fact that it gives users a better place to
>> start.
>> 
>> Right now we're effectively asking users to start with non-working
>> and/or unfamiliar systems and repair them until they work.
>
> If things are not working with containers, I would submit to you that
> we're doing something wrong(tm). 

That is what this discussion is about.  What we are doing wrong(tm).
Mostly it is about the bits that have not yet been namespacified but
need to be.

I am totally in favor of not starting the entire world.  But just
like I find it convienient to loopback mount an iso image to see
what is on a disk image.  It would be handy to be able to just
download a distro image and play with it, without doing anything
special.

We can pair things down farther for the people who are running 1000
copies of apache but not requiring detailed distro surgery before
starting up the binaries on a livecd sounds handy.

> Things should just work, except that
> processes in one container can't use more than their fair share (as
> dictated by policy) of memory, CPU, networking, and I/O bandwidth.

You have to be careful with the limiters.  The fundamental reason
why containers are more efficient than hardware virtualization is
that with containers we can do over commit of resources, especially
memory.  I keep seeing implementations of resource limiters that want
to do things in a heavy handed way that break resource over commit.

> Something which is baked in my world view of containers (which I
> suspect is not shared by other people who are interested in using
> containers) is that given that kernel is shared, trying to use
> containers to provide better security isolation between mutually
> suspicious users is hopeless.  That is, it's pretty much impossible to
> prevent a user from finding one or more zero day local privilege
> escalation bugs that will allow a user to break root.  And at that
> point, they will be able to penetrate the kernel, and from there,
> break security of other processes.

You don't even have to get to security problems to have that concern.
There are enough crazy timing and side channel attacks.

I don't know what concern you have security wise, but the problem that
wants to be solved with user namespaces is something you hit much
earlier than when you worry about sharing a kernel between mutually
distrusting users.  Right now root inside a container is root rout
outside of a container just like in a chroot jail.  Where this becomes a
problem is that people change things like like
/proc/sys/kernel/print-fatal-signals expecting it to be a setting local
to their sand box when in fact the global setting and things start
behaving weirdly for other users.  Running sysctl -a during bootup 
has that problem in spades.

With user namespaces what we get is that the global root user is not the
container root user and we have been working our way through the
permission checks in the kernel to ensure we get them right in the
context of the user namespace.  This trivially means that the things
that we allow the global root user to do in /proc/ and /sysfs and
the like simply won't be allowed as a container root user.  Which
makes doing something stupid and affecting other people much more
difficult.

What the user namespace also allows is an escape hatch from the
bonds of suid.  Right now anything that could confuse an existing
app with that is suid root we have to only allow to root, or risk
adding a security hole.  With the user namespaces we can relax
that check and allow it also for container root users as well
as global root users.  When we are brave enough and certain
enough of our code we can allow non-root users to create their
own user namespaces.

There is the third use for containers where for some reason
we have uid assignment overlap.  Perhaps one distroy assigns
uid 22 to sshd and another to the nobody user.  Or perhaps there
are two departments who have that have done the silly thing
of assigning overlapping uids to their users and we want to
accesses filesystems created by both departments at the same
time without a chance of confusion and conflict.

With my sysadmin hat on I would not want to touch two untrusting groups
of users on the same machine.  Because of the probability there is at
least one security hole that can be found and exploited to allow
privilege escalation.

With my kernel developer hat on I can't just say surrender to the
idea that there will in fact be a privilege escalation bug that
is easy to exploit.  The code has to be built and designed so that
privilege escalation is difficult.  Otherwise we might as well
assume if you visit a website an stealthy worm has taken over your
computer.

It is my hope at the end of the day that the user namespaces will be one
more line of defense in messing up and slowing down the evil omnicient
worms that seem to uneering go for every privilege exploit there is.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-10 21:41           ` Lennart Poettering
  2011-10-11  5:40             ` Eric W. Biederman
@ 2011-10-11  6:54             ` Eric W. Biederman
  2011-10-12 16:59             ` Kay Sievers
  2 siblings, 0 replies; 81+ messages in thread
From: Eric W. Biederman @ 2011-10-11  6:54 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg,
	Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

Lennart Poettering <mzxreary@0pointer.de> writes:

> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:

>> My list of things that still have work left to do looks like:
>> - cgroups.  It is not safe to create a new hierarchies with groups
>>   that are in existing hierarchies.  So cgroups don't work.
>
> Well, for systemd they actually work quite fine since systemd will
> always place its own cgroups below the cgroup it is started in. cgroups
> hence make these things nicely stackable.
>
> In fact, most folks involved in cgroups userspace have agreed to these
> rules now:
>
> http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups

Wow.   Are cgroups really that complicated to use?  A list of rules
a page long on what you have to do to make them useful and non-conflict.
Something seems off.  Perhaps we need a rule don't mount multiple
controllers in the same hierarchy.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11  6:42                 ` Eric W. Biederman
@ 2011-10-11 12:53                   ` Theodore Tso
  2011-10-11 21:16                     ` Eric W. Biederman
  0 siblings, 1 reply; 81+ messages in thread
From: Theodore Tso @ 2011-10-11 12:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage


On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:

> I am totally in favor of not starting the entire world.  But just
> like I find it convienient to loopback mount an iso image to see
> what is on a disk image.  It would be handy to be able to just
> download a distro image and play with it, without doing anything
> special.

Agreed, but what's wrong with firing up KVM to play with a distro image?  Personally, I don't consider that "doing something special".

> 
>> Things should just work, except that
>> processes in one container can't use more than their fair share (as
>> dictated by policy) of memory, CPU, networking, and I/O bandwidth.
> 
> You have to be careful with the limiters.  The fundamental reason
> why containers are more efficient than hardware virtualization is
> that with containers we can do over commit of resources, especially
> memory.  I keep seeing implementations of resource limiters that want
> to do things in a heavy handed way that break resource over commit.

Oh, sure.   Resource limiting is something that should be done only when there are other demands on the resource in question.   Put another way, it should be considered more of a resource guarantee than a resource limit.   (You will have at least 10% of the CPU, not at most 10% of the CPU.)

> 
> I don't know what concern you have security wise, but the problem that
> wants to be solved with user namespaces is something you hit much
> earlier than when you worry about sharing a kernel between mutually
> distrusting users.  Right now root inside a container is root rout
> outside of a container just like in a chroot jail.  Where this becomes a
> problem is that people change things like like
> /proc/sys/kernel/print-fatal-signals expecting it to be a setting local
> to their sand box when in fact the global setting and things start
> behaving weirdly for other users.  Running sysctl -a during bootup 
> has that problem in spades.

The moment you start caring about global sysctl settings is the moment I start wondering whether or not VM and separate kernel images is the better solution.   Do we really want to add so much complexity that we are multiplexing different sysctl settings across containers?   To my mind, that way lies madness, and in some cases, it simply can't be done from a semantics perspective.

> 
> With my sysadmin hat on I would not want to touch two untrusting groups
> of users on the same machine.  Because of the probability there is at
> least one security hole that can be found and exploited to allow
> privilege escalation.
> 
> With my kernel developer hat on I can't just say surrender to the
> idea that there will in fact be a privilege escalation bug that
> is easy to exploit.  The code has to be built and designed so that
> privilege escalation is difficult.  Otherwise we might as well
> assume if you visit a website an stealthy worm has taken over your
> computer.

Oh, I agree that we should try to stop privilege escalation attacks.  And it will be a grand and glorious fight, like Leonidas and his 300 men at the pass at Thermopylae.   :-)   Or it will be like Steve Jobs struggling against cancer.  It's a fight that you know that you're going to lose, but it's not about winning or losing but how much you accomplish and how you fight that counts.

Personally, though, if the issue is worries about visiting a website, the primary protection against that has got to be done  at the browser level (i.e., the process level sandboxing done by Chrome).

-- Ted

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber???s Wish List for Linux
  2011-10-10 11:18             ` David Sterba
  (?)
  (?)
@ 2011-10-11 13:14             ` Serge E. Hallyn
  2011-10-11 15:49               ` Andrew G. Morgan
  -1 siblings, 1 reply; 81+ messages in thread
From: Serge E. Hallyn @ 2011-10-11 13:14 UTC (permalink / raw)
  To: Kay Sievers, Alan Cox, linux-kernel, lennart, harald, david,
	greg, Andrew Morgan, KaiGai Kohei


Unfortunately I'd deleted the early part of this thread before noticing
the mention on lwn+lkml.org, but fwiw detection of the last supported
capability has been brought up before (with patchsets floated (By KaiGai
I'm pretty sure) which exported the list of capabilities through /sys or
/security), and I agree it's something we need.

-serge

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber???s Wish List for Linux
  2011-10-11 13:14             ` Serge E. Hallyn
@ 2011-10-11 15:49               ` Andrew G. Morgan
  2011-10-12  2:31                 ` Serge E. Hallyn
  2011-10-12 20:51                 ` Lennart Poettering
  0 siblings, 2 replies; 81+ messages in thread
From: Andrew G. Morgan @ 2011-10-11 15:49 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Kay Sievers, Alan Cox, linux-kernel, lennart, harald, david,
	greg, KaiGai Kohei

The benefit of Kai Gai's patch was that it exported the actual names
of the capabilities rather than have them only stored in libcap.

It is possible to use CAP_IS_SUPPORTED(cap) (in libcap-2.21) to figure
out the maximum capability supported by the running kernel.

  https://sites.google.com/site/fullycapable/release-notes-for-libcap

Cheers

Andrew

On Tue, Oct 11, 2011 at 6:14 AM, Serge E. Hallyn <serge@hallyn.com> wrote:
>
> Unfortunately I'd deleted the early part of this thread before noticing
> the mention on lwn+lkml.org, but fwiw detection of the last supported
> capability has been brought up before (with patchsets floated (By KaiGai
> I'm pretty sure) which exported the list of capabilities through /sys or
> /security), and I agree it's something we need.
>
> -serge
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11 12:53                   ` Theodore Tso
@ 2011-10-11 21:16                     ` Eric W. Biederman
  2011-10-11 22:30                       ` david
  2011-10-12 17:57                       ` J. Bruce Fields
  0 siblings, 2 replies; 81+ messages in thread
From: Eric W. Biederman @ 2011-10-11 21:16 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel,
	harald, david, greg, Linux Containers, Linux Containers,
	Serge E. Hallyn, Daniel Lezcano, Paul Menage

Theodore Tso <tytso@MIT.EDU> writes:

> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:
>
>> I am totally in favor of not starting the entire world.  But just
>> like I find it convienient to loopback mount an iso image to see
>> what is on a disk image.  It would be handy to be able to just
>> download a distro image and play with it, without doing anything
>> special.
>
> Agreed, but what's wrong with firing up KVM to play with a distro
> image?  Personally, I don't consider that "doing something special".

Then let me flip this around and give a much more practical use case.
Testing.  A very interesting number of cases involve how multiple
machines interact.  You can test a lot more logical machines interacting
with containers than you can with vms.  And you can test on all the
aritectures and platforms linux supports not just the handful that are
well supported by hardware virtualization.

I admit for a lot of test cases that it makes sense not to use a full
set of userspace daemons.  At the same time there is not particularly
good reason to have a design that doesn't allow you to run a full
userspace.

>>> Things should just work, except that
>>> processes in one container can't use more than their fair share (as
>>> dictated by policy) of memory, CPU, networking, and I/O bandwidth.
>> 
>> You have to be careful with the limiters.  The fundamental reason
>> why containers are more efficient than hardware virtualization is
>> that with containers we can do over commit of resources, especially
>> memory.  I keep seeing implementations of resource limiters that want
>> to do things in a heavy handed way that break resource over commit.
>
> Oh, sure.   Resource limiting is something that should be done only
> when there are other demands on the resource in question.   Put
> another way, it should be considered more of a resource guarantee than
> a resource limit.   (You will have at least 10% of the CPU, not at
> most 10% of the CPU.)

Resource guarantees I suspect may be worse.  But all of this is to say
that the problem control groups are tackling is a hard one.  Resource
control and resource limits across multiple processes is a challenge
problem and in some contexts it is a hard problem.

My observations have been that when you want any kind of strong resource
guarantee or resource limit, it is currently a lot easier to implement
that with hardware virtualization than with control groups (at least for
memory).  I think the cpu scheduling has been solved but until you also
at least solve user space memory there are going to be issues.

At the same time getting better resource controls is an area where
there is a strong interest from all over the place.

>> I don't know what concern you have security wise, but the problem that
>> wants to be solved with user namespaces is something you hit much
>> earlier than when you worry about sharing a kernel between mutually
>> distrusting users.  Right now root inside a container is root rout
>> outside of a container just like in a chroot jail.  Where this becomes a
>> problem is that people change things like like
>> /proc/sys/kernel/print-fatal-signals expecting it to be a setting local
>> to their sand box when in fact the global setting and things start
>> behaving weirdly for other users.  Running sysctl -a during bootup 
>> has that problem in spades.
>
> The moment you start caring about global sysctl settings is the moment
> I start wondering whether or not VM and separate kernel images is the
> better solution.   Do we really want to add so much complexity that we
> are multiplexing different sysctl settings across containers?   To my
> mind, that way lies madness, and in some cases, it simply can't be
> done from a semantics perspective.

It actually isn't much complexity and for the most part the code that
I care about in that area is already merged.  In principle all I care
about are having the identiy checks go from:
(uid1 == uid2) to ((user_ns1 == user_ns2) && (uid1 == uid2))

There are some per subsystem sysctls that do make sense to make per
subsystem and that work is mostly done.  I expect there are a few
more in the networking stack that interesting to make per network
namespace.

The only real world issue right now that I am aware of is the user
namespace aren't quite ready for prime-time and so people run into
issues where something like sysctl -a during bootup sets a bunch of
sysctls and they change sysctls they didn't mean to.  Once the
user namespaces are in place accessing a truly global sysctl will
result in EPERM when you are in a container and everyone will be
happy. ;)


Where all of this winds up interesting in the field of oncoming kernel
work is that uids are persistent and are stored in file systems.  So
once we have all of the permission checks in the kernel tweaked to care
about user namespaces we next look at the filesystems.   The easy
initial implementation is going to be just associating a user namespace
with a super block.  But farther out being able to store uids from
different user namespaces on the same filesystem becomes an interesting
problem.

We already have things like user mapping in 9p and nfsv4 so it isn't
wholly uncharted territory.  But it could get interesting.   Just
a heads up.

>> With my sysadmin hat on I would not want to touch two untrusting groups
>> of users on the same machine.  Because of the probability there is at
>> least one security hole that can be found and exploited to allow
>> privilege escalation.
>> 
>> With my kernel developer hat on I can't just say surrender to the
>> idea that there will in fact be a privilege escalation bug that
>> is easy to exploit.  The code has to be built and designed so that
>> privilege escalation is difficult.  Otherwise we might as well
>> assume if you visit a website an stealthy worm has taken over your
>> computer.
>
> Oh, I agree that we should try to stop privilege escalation attacks.
> And it will be a grand and glorious fight, like Leonidas and his 300
> men at the pass at Thermopylae.  :-) Or it will be like Steve Jobs
> struggling against cancer.  It's a fight that you know that you're
> going to lose, but it's not about winning or losing but how much you
> accomplish and how you fight that counts.
>
> Personally, though, if the issue is worries about visiting a website,
> the primary protection against that has got to be done at the browser
> level (i.e., the process level sandboxing done by Chrome).

My concern is any externally implemented service, but in general 
browsers and web sites are your most likely candidates.  Both because
there is more complexity there and because http is used far more often
than other protocols.

And yes I agree that the first line of defense needs to be in the
browser source code, and then the application level sand boxing
features that the browser takes advantage of.  Last I paid attention
one of the layers of defense that chrome is user was to setup different
namespaces to make the sandbox tight even at the syscall level.   When
it is complete I would not be at all surprised if the user namespace
wound up being used in chrome as well.  Just as one more thing that
helps.

I have found it very surprising how many of the namespaces are
used for what you can't do with them.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11  2:05             ` Matt Helsley
  2011-10-11  3:25               ` Ted Ts'o
@ 2011-10-11 22:25               ` david
  1 sibling, 0 replies; 81+ messages in thread
From: david @ 2011-10-11 22:25 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Ted Ts'o, Eric W. Biederman, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Mon, 10 Oct 2011, Matt Helsley wrote:

> On Mon, Oct 10, 2011 at 09:32:01PM -0400, Ted Ts'o wrote:
>> On Mon, Oct 10, 2011 at 01:59:10PM -0700, Eric W. Biederman wrote:
>>> Lennart Poettering <mzxreary@0pointer.de> writes:
>>>
>>>> To make a standard distribution run nicely in a Linux container you
>>>> usually have to make quite a number of modifications to it and disable
>>>> certain things from the boot process. Ideally however, one could simply
>>>> boot the same image on a real machine and in a container and would just
>>>> do the right thing, fully stateless. And for that you need to be able to
>>>> detect containers, and currently you can't.
>>>
>>> I agree getting to the point where we can run a standard distribution
>>> unmodified in a container sounds like a reasonable goal.
>>
>> Hmm, interesting.  It's not clear to me that running a full standard
>> distribution in a container is always going to be what everyone wants
>> to do.
>>
>> The whole point of containers versus VM's is that containers are
>> lighter weight.  And one of the ways that containers can be lighter
>> weight is if you don't have to have N copies of udev, dbus, running in
>> each container/VM.
>>
>> If you end up so much overhead to provide the desired security and/or
>> performance isolation, then it becomes fair to ask the question
>> whether you might as well pay a tad bit more and get even better
>> security and isolation by using a VM solution....
>>
>> 	     	       	  	     - Ted
>
> Yes, it does detract from the unique advantages of using a container.
> However, I think the value here is not the effeciency of the initial
> system configuration but the fact that it gives users a better place to
> start.
>
> Right now we're effectively asking users to start with non-working
> and/or unfamiliar systems and repair them until they work.
>
> By enabling unmodified distro installs in a container we're starting
> at the other end. The choices may not be the most efficient but the
> user may begin tuning from a working configuration. They can learn
> about and tune those parts that prove significant for their workload.
> This is better because in the end it's not just about how efficient the
> user  can make their containers but how much effort they will spend
> achieving and maintainingg that efficiency over time.

what's needed isn't a way to run all the daemons, processes and startup 
scripts that a distro uses in a container without conflicting with the 
parent, but instead a easy way to create the appropriate config changes in 
the parent, bind mounts, cgroups, etc  for the container and startup the 
apps that are wanted in the container.

This needs to be something with a lot of knowledge and hooks in the 
parent, so it's not just a matter of adding a way to detect "am I in a 
container" or not.

when I run things in containers, I want to bind mount some things from the 
parent, I want to configure syslog to listen on /dev/log inside the 
container, and then I want to starup just the processes I am planning to 
use inside the container, not all the daemons and other processes that I 
need to run the service the container is built for.

David Lang

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11 21:16                     ` Eric W. Biederman
@ 2011-10-11 22:30                       ` david
  2011-10-12  4:26                         ` Eric W. Biederman
  2011-10-12 17:57                       ` J. Bruce Fields
  1 sibling, 1 reply; 81+ messages in thread
From: david @ 2011-10-11 22:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Tue, 11 Oct 2011, Eric W. Biederman wrote:

> Theodore Tso <tytso@MIT.EDU> writes:
>
>> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:
>>
>>> I am totally in favor of not starting the entire world.  But just
>>> like I find it convienient to loopback mount an iso image to see
>>> what is on a disk image.  It would be handy to be able to just
>>> download a distro image and play with it, without doing anything
>>> special.
>>
>> Agreed, but what's wrong with firing up KVM to play with a distro
>> image?  Personally, I don't consider that "doing something special".
>
> Then let me flip this around and give a much more practical use case.
> Testing.  A very interesting number of cases involve how multiple
> machines interact.  You can test a lot more logical machines interacting
> with containers than you can with vms.  And you can test on all the
> aritectures and platforms linux supports not just the handful that are
> well supported by hardware virtualization.

but in containers, you are not really testing lots of machines, you are 
testing lots of processes on the same machine (they share the same kernel)

> I admit for a lot of test cases that it makes sense not to use a full
> set of userspace daemons.  At the same time there is not particularly
> good reason to have a design that doesn't allow you to run a full
> userspace.

how do you share the display between all the different containers if they 
are trying to run the X server?

how do you avoid all the containers binding to the same port on the 
default IP address?

how do you arbitrate dbus across the containers.

when a new USB device gets plugged in, which container gets control of it?

there are a LOT of hard questions when you start talking about running a 
full system inside a container that do not apply for other use of 
containers.

David Lang

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-06 23:17 A Plumber’s Wish List for Linux Kay Sievers
                   ` (5 preceding siblings ...)
  2011-10-09  8:45 ` Rusty Russell
@ 2011-10-11 23:16 ` Andrew Morton
  2011-10-12  0:53   ` Frederic Weisbecker
  2011-10-12  0:59   ` Frederic Weisbecker
  2011-10-19 21:12 ` Paul Menage
  7 siblings, 2 replies; 81+ messages in thread
From: Andrew Morton @ 2011-10-11 23:16 UTC (permalink / raw)
  To: Kay Sievers
  Cc: linux-kernel, lennart, harald, david, greg, Kirill A. Shutemov,
	Frederic Weisbecker


Useful email, thanks.

On Fri, 07 Oct 2011 01:17:02 +0200
Kay Sievers <kay.sievers@vrfy.org> wrote:

> We___d like to share our current wish list of plumbing layer features we

gargh.  gmail?

>
> ...
>
> * fork throttling mechanism as basic cgroup functionality that is
> available in all hierarchies independent of the controllers used:
> This is important to implement race-free killing of all members of a
> cgroup, so that cgroup member processes cannot fork faster then a cgroup
> supervisor process could kill them. This needs to be recursive, so that
> not only a cgroup but all its subgroups are covered as well.

Frederic Weisbecker's "cgroups: add a task counter subsystem" should
address this.  Does it meet these requirments?  Have you tested it?

>
> ...
>
> * Add a timerslack cgroup controller, to allow increasing the timer
> slack of user session cgroups when the machine is idle.

Kirill Shutemov has just posted "cgroups: introduce timer slack
controller".  Again, is that sufficient?  Have you reviewed and tested
it?

>
> ...
>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-11 23:16 ` Andrew Morton
@ 2011-10-12  0:53   ` Frederic Weisbecker
  2011-10-12  0:59   ` Frederic Weisbecker
  1 sibling, 0 replies; 81+ messages in thread
From: Frederic Weisbecker @ 2011-10-12  0:53 UTC (permalink / raw)
  To: Andrew Morton, Kay Sievers, TejunHeotj
  Cc: linux-kernel, lennart, harald, david, greg, Kirill A. Shutemov

On Tue, Oct 11, 2011 at 04:16:00PM -0700, Andrew Morton wrote:
> On Fri, 07 Oct 2011 01:17:02 +0200
> Kay Sievers <kay.sievers@vrfy.org> wrote:
> > * fork throttling mechanism as basic cgroup functionality that is
> > available in all hierarchies independent of the controllers used:
> > This is important to implement race-free killing of all members of a
> > cgroup, so that cgroup member processes cannot fork faster then a cgroup
> > supervisor process could kill them. This needs to be recursive, so that
> > not only a cgroup but all its subgroups are covered as well.
> 
> Frederic Weisbecker's "cgroups: add a task counter subsystem" should
> address this.  Does it meet these requirments?  Have you tested it?

It should work for this yeah. We in fact explored and documented that
second usecase of the task counter subsystem for Kay's needs.

Now cgroup subsystems can only be binded in one hierarchy at a time.
So it couldn't be used by Lxc and some other user at the same time
and that defeats kay's goals. But there is an old patch from Paul
Menage that allows some specific subsystems (those that don't deal
with global resources) to be mounted on many hierarchies. The task
counter would fit in and hence be usable by Lxc and other users
simultaneously.

There is another solution that is to be considered. One could use
the cgroup freezer to freeze all the tasks in a cgroup and then kill
them all before thawing the whole. If the process of freezing doesn't
have races against fork then it should work as well. I only worry
about the window in copy_process() between the test on signal_pending(),
that cancels the fork if a signal is pending on the parent, and the
time the new task is eventually added to the cgroup with
cgroup_post_fork(). If the freezer misses the child while it is in that
window, then it's not going to be killed with the rest and it may even
launch some fork() soon to annoy you further. I don't know if that's
handled by the freezer. If it doesn't and that can't be fixed then that
won't work for you.

If the freezer is a possible solution then I don't know which one
is best for you. Perhaps freezing the tasks in the cgroup can make
it faster, or slower, than rejecting any fork and killing directly.
Perhaps it would be helpful to get more details about the practical
case you have.

Anyway, if you think the task counter subsystem approach suits you
better, I can rework Paul's patches that allow multi-bindable
subsystem so that it gets usable by several users simultaneously.

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-11 23:16 ` Andrew Morton
  2011-10-12  0:53   ` Frederic Weisbecker
@ 2011-10-12  0:59   ` Frederic Weisbecker
       [not found]     ` <20111012174014.GE6281@google.com>
  1 sibling, 1 reply; 81+ messages in thread
From: Frederic Weisbecker @ 2011-10-12  0:59 UTC (permalink / raw)
  To: Andrew Morton, Kay Sievers, Tejun Heo
  Cc: linux-kernel, lennart, harald, david, greg, Kirill A. Shutemov

(Resending because I screwed Tejun's email address...)

On Tue, Oct 11, 2011 at 04:16:00PM -0700, Andrew Morton wrote:
> On Fri, 07 Oct 2011 01:17:02 +0200
> Kay Sievers <kay.sievers@vrfy.org> wrote:
> > * fork throttling mechanism as basic cgroup functionality that is
> > available in all hierarchies independent of the controllers used:
> > This is important to implement race-free killing of all members of a
> > cgroup, so that cgroup member processes cannot fork faster then a cgroup
> > supervisor process could kill them. This needs to be recursive, so that
> > not only a cgroup but all its subgroups are covered as well.
>
> Frederic Weisbecker's "cgroups: add a task counter subsystem" should
> address this.  Does it meet these requirments?  Have you tested it?

It should work for this yeah. We in fact explored and documented that
second usecase of the task counter subsystem for Kay's needs.

Now cgroup subsystems can only be binded in one hierarchy at a time.
So it couldn't be used by Lxc and some other user at the same time
and that defeats kay's goals. But there is an old patch from Paul
Menage that allows some specific subsystems (those that don't deal
with global resources) to be mounted on many hierarchies. The task
counter would fit in and hence be usable by Lxc and other users
simultaneously.

There is another solution that is to be considered. One could use
the cgroup freezer to freeze all the tasks in a cgroup and then kill
them all before thawing the whole. If the process of freezing doesn't
have races against fork then it should work as well. I only worry
about the window in copy_process() between the test on signal_pending(),
that cancels the fork if a signal is pending on the parent, and the
time the new task is eventually added to the cgroup with
cgroup_post_fork(). If the freezer misses the child while it is in that
window, then it's not going to be killed with the rest and it may even
launch some fork() soon to annoy you further. I don't know if that's
handled by the freezer. If it doesn't and that can't be fixed then that
won't work for you.

If the freezer is a possible solution then I don't know which one
is best for you. Perhaps freezing the tasks in the cgroup can make
it faster, or slower, than rejecting any fork and killing directly.
Perhaps it would be helpful to get more details about the practical
case you have.

Anyway, if you think the task counter subsystem approach suits you
better, I can rework Paul's patches that allow multi-bindable
subsystem so that it gets usable by several users simultaneously.

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber???s Wish List for Linux
  2011-10-11 15:49               ` Andrew G. Morgan
@ 2011-10-12  2:31                 ` Serge E. Hallyn
  2011-10-12 20:51                 ` Lennart Poettering
  1 sibling, 0 replies; 81+ messages in thread
From: Serge E. Hallyn @ 2011-10-12  2:31 UTC (permalink / raw)
  To: Andrew G. Morgan
  Cc: Kay Sievers, Alan Cox, linux-kernel, lennart, harald, david,
	greg, KaiGai Kohei

Quoting Andrew G. Morgan (morgan@kernel.org):
> The benefit of Kai Gai's patch was that it exported the actual names
> of the capabilities rather than have them only stored in libcap.
> 
> It is possible to use CAP_IS_SUPPORTED(cap) (in libcap-2.21) to figure
> out the maximum capability supported by the running kernel.
> 
>   https://sites.google.com/site/fullycapable/release-notes-for-libcap

I keep forgetting about that :)

thanks, Andrew.

-serge

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11 22:30                       ` david
@ 2011-10-12  4:26                         ` Eric W. Biederman
  2011-10-12  5:10                           ` david
  0 siblings, 1 reply; 81+ messages in thread
From: Eric W. Biederman @ 2011-10-12  4:26 UTC (permalink / raw)
  To: david
  Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

david@lang.hm writes:

> On Tue, 11 Oct 2011, Eric W. Biederman wrote:
>
>> Theodore Tso <tytso@MIT.EDU> writes:
>>
>>> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:
>>>
>>>> I am totally in favor of not starting the entire world.  But just
>>>> like I find it convienient to loopback mount an iso image to see
>>>> what is on a disk image.  It would be handy to be able to just
>>>> download a distro image and play with it, without doing anything
>>>> special.
>>>
>>> Agreed, but what's wrong with firing up KVM to play with a distro
>>> image?  Personally, I don't consider that "doing something special".
>>
>> Then let me flip this around and give a much more practical use case.
>> Testing.  A very interesting number of cases involve how multiple
>> machines interact.  You can test a lot more logical machines interacting
>> with containers than you can with vms.  And you can test on all the
>> aritectures and platforms linux supports not just the handful that are
>> well supported by hardware virtualization.
>
> but in containers, you are not really testing lots of machines, you are testing
> lots of processes on the same machine (they share the same kernel)

True.  But usually that is the interesting part.

>> I admit for a lot of test cases that it makes sense not to use a full
>> set of userspace daemons.  At the same time there is not particularly
>> good reason to have a design that doesn't allow you to run a full
>> userspace.
>
> how do you share the display between all the different containers if they are
> trying to run the X server?

Either X does not start because the hardware it needs is not present or
Xnest or similar gets started.

> how do you avoid all the containers binding to the same port on the default IP
> address?

Network namespaces.

> how do you arbitrate dbus across the containers.

Why should you?

> when a new USB device gets plugged in, which container gets control of
> it?

None of them.  Although today they may all get the uevent.  None of the
containers should have permission to call mknod to mess with it.

> there are a LOT of hard questions when you start talking about running a full
> system inside a container that do not apply for other use of
> containers.

Not really mostly the answer is that you say no.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12  4:26                         ` Eric W. Biederman
@ 2011-10-12  5:10                           ` david
  2011-10-12 15:08                             ` Serge E. Hallyn
  0 siblings, 1 reply; 81+ messages in thread
From: david @ 2011-10-12  5:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Tue, 11 Oct 2011, Eric W. Biederman wrote:

> david@lang.hm writes:
>
>> On Tue, 11 Oct 2011, Eric W. Biederman wrote:
>>
>>> Theodore Tso <tytso@MIT.EDU> writes:
>>>
>>>> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:
>>>>
>>> I admit for a lot of test cases that it makes sense not to use a full
>>> set of userspace daemons.  At the same time there is not particularly
>>> good reason to have a design that doesn't allow you to run a full
>>> userspace.
>>
>> how do you share the display between all the different containers if they are
>> trying to run the X server?
>
> Either X does not start because the hardware it needs is not present or
> Xnest or similar gets started.
>
>> how do you avoid all the containers binding to the same port on the default IP
>> address?
>
> Network namespaces.
>
>> how do you arbitrate dbus across the containers.
>
> Why should you?

because the containers are simulating different machines, and dbus doesn't 
work arcross different machines.

>> when a new USB device gets plugged in, which container gets control of
>> it?
>
> None of them.  Although today they may all get the uevent.  None of the
> containers should have permission to call mknod to mess with it.

why would the software inside a container not have the rights to do a 
mknod inside the container?

>> there are a LOT of hard questions when you start talking about running a full
>> system inside a container that do not apply for other use of
>> containers.
>
> Not really mostly the answer is that you say no.
>
> Eric
>

David Lang

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12  5:10                           ` david
@ 2011-10-12 15:08                             ` Serge E. Hallyn
  0 siblings, 0 replies; 81+ messages in thread
From: Serge E. Hallyn @ 2011-10-12 15:08 UTC (permalink / raw)
  To: david
  Cc: Eric W. Biederman, Theodore Tso, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Linux Containers, Daniel Lezcano,
	Paul Menage

Quoting david@lang.hm (david@lang.hm):
> On Tue, 11 Oct 2011, Eric W. Biederman wrote:
> 
> >david@lang.hm writes:
> >
> >>On Tue, 11 Oct 2011, Eric W. Biederman wrote:
> >>
> >>>Theodore Tso <tytso@MIT.EDU> writes:
> >>>
> >>>>On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:
> >>>>
> >>>I admit for a lot of test cases that it makes sense not to use a full
> >>>set of userspace daemons.  At the same time there is not particularly
> >>>good reason to have a design that doesn't allow you to run a full
> >>>userspace.
> >>
> >>how do you share the display between all the different containers if they are
> >>trying to run the X server?
> >
> >Either X does not start because the hardware it needs is not present or
> >Xnest or similar gets started.
> >
> >>how do you avoid all the containers binding to the same port on the default IP
> >>address?
> >
> >Network namespaces.
> >
> >>how do you arbitrate dbus across the containers.
> >
> >Why should you?
> 
> because the containers are simulating different machines, and dbus
> doesn't work arcross different machines.

Exactly - Eric is saying dbus should not be (and is not) shared among
containers.

> >>when a new USB device gets plugged in, which container gets control of
> >>it?
> >
> >None of them.  Although today they may all get the uevent.  None of the
> >containers should have permission to call mknod to mess with it.
> 
> why would the software inside a container not have the rights to do
> a mknod inside the container?

Why shouldn't an unprivileged user be allowed to mknod on the host?

-serge

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-10 21:41           ` Lennart Poettering
  2011-10-11  5:40             ` Eric W. Biederman
  2011-10-11  6:54             ` Eric W. Biederman
@ 2011-10-12 16:59             ` Kay Sievers
  2011-11-01 22:05               ` [lxc-devel] " Michael Tokarev
  2 siblings, 1 reply; 81+ messages in thread
From: Kay Sievers @ 2011-10-12 16:59 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Eric W. Biederman, Matt Helsley, linux-kernel, harald, david,
	greg, Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On Mon, Oct 10, 2011 at 23:41, Lennart Poettering <mzxreary@0pointer.de> wrote:
> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:

>> - udev.  All of the kernel interfaces for udev should be supported in
>>   current kernels.  However I believe udev is useless because container
>>   start drops CAP_MKNOD so we can't do evil things.  So I would
>>   recommend basing the startup of udev on presence of CAP_MKNOD.
>
> Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev
> in a systemd world makes use of that.

Done.

http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=9371e6f3e04b03692c23e392fdf005a08ccf1edb

Thanks,
Kay

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11 21:16                     ` Eric W. Biederman
  2011-10-11 22:30                       ` david
@ 2011-10-12 17:57                       ` J. Bruce Fields
  2011-10-12 18:25                         ` Kyle Moffett
  1 sibling, 1 reply; 81+ messages in thread
From: J. Bruce Fields @ 2011-10-12 17:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote:
> It actually isn't much complexity and for the most part the code that
> I care about in that area is already merged.  In principle all I care
> about are having the identiy checks go from:
> (uid1 == uid2) to ((user_ns1 == user_ns2) && (uid1 == uid2))
> 
> There are some per subsystem sysctls that do make sense to make per
> subsystem and that work is mostly done.  I expect there are a few
> more in the networking stack that interesting to make per network
> namespace.
> 
> The only real world issue right now that I am aware of is the user
> namespace aren't quite ready for prime-time and so people run into
> issues where something like sysctl -a during bootup sets a bunch of
> sysctls and they change sysctls they didn't mean to.  Once the
> user namespaces are in place accessing a truly global sysctl will
> result in EPERM when you are in a container and everyone will be
> happy. ;)
> 
> 
> Where all of this winds up interesting in the field of oncoming kernel
> work is that uids are persistent and are stored in file systems.  So
> once we have all of the permission checks in the kernel tweaked to care
> about user namespaces we next look at the filesystems.   The easy
> initial implementation is going to be just associating a user namespace
> with a super block.  But farther out being able to store uids from
> different user namespaces on the same filesystem becomes an interesting
> problem.

Yipes.  Why would anyone want to do that?

--b.

> We already have things like user mapping in 9p and nfsv4 so it isn't
> wholly uncharted territory.  But it could get interesting.   Just
> a heads up.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
       [not found]     ` <20111012174014.GE6281@google.com>
@ 2011-10-12 18:16       ` Cyrill Gorcunov
  2011-10-14 15:38         ` Frederic Weisbecker
  0 siblings, 1 reply; 81+ messages in thread
From: Cyrill Gorcunov @ 2011-10-12 18:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Andrew Morton, Kay Sievers, linux-kernel,
	lennart, harald, david, greg, Kirill A. Shutemov, Oleg Nesterov,
	Paul Menage, Rafael J. Wysocki, Pavel Emelyanov

On Wed, Oct 12, 2011 at 10:40:14AM -0700, Tejun Heo wrote:
...
> 
> In general, I think making freezer work nicely with the rest of the
> system is a good idea and have been working towards that direction.
> Allowing a frozen task to be killed is not only handy for use cases
> like above but also makes solving freezer involved deadlocks much less
> likely and easier to solve.  Another that I have in mind is allowing
> ptrace from unfrozen task to a frozen task.  This can be helpful in
> general debugging (currently attaching to multi-threaded, violently
> cloning process is quite cumbersome) and userland checkpointing.

Yeah, being able to ptrace a frozen cgroup would be great for us.
We stick with signals start/stop cycle at moment but the final target
is the cgroups and freezer of course. (btw while were poking freezer
code I noticed that there is no shortcut to move all tasks in cgroup
into the root cgroup, so I guess say "echo -1 > tasks" might be a good
addition to move all tasks from some particular cgroup to the root
by single action).

> 
> I was working toward these and had some of the patches in Rafael's
> tree but then korg went down and we lost track of the tree and I had a
> pretty long vacation.  I can't say for sure but am aiming to achieve
> the goals during the next devel cycle.
>

This is a wishlist after all, so target is pointed and only time is
needed to implement all this ;)

	Cyrill

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12 17:57                       ` J. Bruce Fields
@ 2011-10-12 18:25                         ` Kyle Moffett
  2011-10-12 19:04                           ` J. Bruce Fields
  0 siblings, 1 reply; 81+ messages in thread
From: Kyle Moffett @ 2011-10-12 18:25 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Eric W. Biederman, Theodore Tso, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On Wed, Oct 12, 2011 at 13:57, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote:
>> Where all of this winds up interesting in the field of oncoming kernel
>> work is that uids are persistent and are stored in file systems.  So
>> once we have all of the permission checks in the kernel tweaked to care
>> about user namespaces we next look at the filesystems.   The easy
>> initial implementation is going to be just associating a user namespace
>> with a super block.  But farther out being able to store uids from
>> different user namespaces on the same filesystem becomes an interesting
>> problem.
>
> Yipes.  Why would anyone want to do that?

Consider an NFS file server providing collaborative access to multiple
independently managed domains (EG: several different universities),
each with their own LDAP userid database and Kerberos services.

The common server is in its own realm and allows cross-realm
authentication to the other university realms, using the origin realm
to decide what namespace to map each user into.

If you are just doing read-only operations then you don't need any
kind of namespace persistence on the NFS server's storage.  On the
other hand, if you want to allow users to collaborate and create ACLs
then you need something dramatically more involved.

On the wire, the kerberos server would simply identify each NFSv4 ACL
entry with a particular realm ID, but in the backend it would need
some filesystem-level disambiguation (possibly the recently-proposed
RichACL features?)

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12 18:25                         ` Kyle Moffett
@ 2011-10-12 19:04                           ` J. Bruce Fields
  2011-10-12 19:12                             ` Kyle Moffett
  0 siblings, 1 reply; 81+ messages in thread
From: J. Bruce Fields @ 2011-10-12 19:04 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Eric W. Biederman, Theodore Tso, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On Wed, Oct 12, 2011 at 02:25:04PM -0400, Kyle Moffett wrote:
> On Wed, Oct 12, 2011 at 13:57, J. Bruce Fields <bfields@fieldses.org> wrote:
> > On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote:
> >> Where all of this winds up interesting in the field of oncoming kernel
> >> work is that uids are persistent and are stored in file systems.  So
> >> once we have all of the permission checks in the kernel tweaked to care
> >> about user namespaces we next look at the filesystems.   The easy
> >> initial implementation is going to be just associating a user namespace
> >> with a super block.  But farther out being able to store uids from
> >> different user namespaces on the same filesystem becomes an interesting
> >> problem.
> >
> > Yipes.  Why would anyone want to do that?
> 
> Consider an NFS file server providing collaborative access to multiple
> independently managed domains (EG: several different universities),
> each with their own LDAP userid database and Kerberos services.
> 
> The common server is in its own realm and allows cross-realm
> authentication to the other university realms, using the origin realm
> to decide what namespace to map each user into.
> 
> If you are just doing read-only operations then you don't need any
> kind of namespace persistence on the NFS server's storage.  On the
> other hand, if you want to allow users to collaborate and create ACLs
> then you need something dramatically more involved.

Yeah, OK, I suppose I'd imagined mapping into the server's id space
somehow for that case, but I suppose this would be cleaner.  Still,
seems like a big pain....

> On the wire, the kerberos server would simply identify each NFSv4 ACL
> entry with a particular realm ID, but in the backend it would need
> some filesystem-level disambiguation (possibly the recently-proposed
> RichACL features?)

That doesn't help with owner and group.

--b.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12 19:04                           ` J. Bruce Fields
@ 2011-10-12 19:12                             ` Kyle Moffett
  2011-10-14 15:54                               ` Ted Ts'o
  0 siblings, 1 reply; 81+ messages in thread
From: Kyle Moffett @ 2011-10-12 19:12 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Eric W. Biederman, Theodore Tso, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On Wed, Oct 12, 2011 at 15:04, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Wed, Oct 12, 2011 at 02:25:04PM -0400, Kyle Moffett wrote:
>> On Wed, Oct 12, 2011 at 13:57, J. Bruce Fields <bfields@fieldses.org> wrote:
>> > On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote:
>> >> Where all of this winds up interesting in the field of oncoming kernel
>> >> work is that uids are persistent and are stored in file systems.  So
>> >> once we have all of the permission checks in the kernel tweaked to care
>> >> about user namespaces we next look at the filesystems.   The easy
>> >> initial implementation is going to be just associating a user namespace
>> >> with a super block.  But farther out being able to store uids from
>> >> different user namespaces on the same filesystem becomes an interesting
>> >> problem.
>> >
>> > Yipes.  Why would anyone want to do that?
>>
>> Consider an NFS file server providing collaborative access to multiple
>> independently managed domains (EG: several different universities),
>> each with their own LDAP userid database and Kerberos services.
>>
>> The common server is in its own realm and allows cross-realm
>> authentication to the other university realms, using the origin realm
>> to decide what namespace to map each user into.
>>
>> If you are just doing read-only operations then you don't need any
>> kind of namespace persistence on the NFS server's storage.  On the
>> other hand, if you want to allow users to collaborate and create ACLs
>> then you need something dramatically more involved.
>
> Yeah, OK, I suppose I'd imagined mapping into the server's id space
> somehow for that case, but I suppose this would be cleaner.  Still,
> seems like a big pain....
>
>> On the wire, the kerberos server would simply identify each NFSv4 ACL
>> entry with a particular realm ID, but in the backend it would need
>> some filesystem-level disambiguation (possibly the recently-proposed
>> RichACL features?)
>
> That doesn't help with owner and group.

Well, you're going to need to introduce a bunch of new xattrs to
handle the namespacing anyways.

As I understand it you can use RichACLs to grant all the same
privileges as owner and group, so you can simply map the real
namespaced owner and group into RichACLs (or another xattr) and force
the inode uid/gid to be root/root (or maybe nobody/nogroup or
something).

I am of course making it sound a million times easier than it's
actually likely to be, but I do think it's possible without too many
odd corner cases.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber???s Wish List for Linux
  2011-10-11 15:49               ` Andrew G. Morgan
  2011-10-12  2:31                 ` Serge E. Hallyn
@ 2011-10-12 20:51                 ` Lennart Poettering
  1 sibling, 0 replies; 81+ messages in thread
From: Lennart Poettering @ 2011-10-12 20:51 UTC (permalink / raw)
  To: Andrew G. Morgan
  Cc: Serge E. Hallyn, Kay Sievers, Alan Cox, linux-kernel, harald,
	david, greg, KaiGai Kohei

On Tue, 11.10.11 08:49, Andrew G. Morgan (morgan@kernel.org) wrote:

> 
> The benefit of Kai Gai's patch was that it exported the actual names
> of the capabilities rather than have them only stored in libcap.
> 
> It is possible to use CAP_IS_SUPPORTED(cap) (in libcap-2.21) to figure
> out the maximum capability supported by the running kernel.
> 
>   https://sites.google.com/site/fullycapable/release-notes-for-libcap

Oh, hmm, interesting. I have now changed my code to make use of this,
but I can't say it's pretty, because I basically have to search linearly
for the highest capability supported if that's what I want to know.

So, I guess this solves the problem for now, but I'd still like to see a
proper API for this.

Anyway, thanks for the pointer,

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber???s Wish List for Linux
  2011-10-10 13:09             ` Theodore Tso
@ 2011-10-13  0:28               ` Dave Chinner
  2011-10-14 15:47                 ` Ted Ts'o
  0 siblings, 1 reply; 81+ messages in thread
From: Dave Chinner @ 2011-10-13  0:28 UTC (permalink / raw)
  To: Theodore Tso
  Cc: dave, Hugo Mills, Kay Sievers, Alan Cox, linux-kernel, lennart,
	harald, david, greg, Chris Mason, Btrfs mailing list

On Mon, Oct 10, 2011 at 09:09:37AM -0400, Theodore Tso wrote:
> 
> On Oct 10, 2011, at 7:18 AM, David Sterba wrote:
> 
> > "Resetting the UUID on btrfs isn't a quick-and-easy thing - you
> > have to walk the entire tree and change every object. We've got
> > a bad-hack in meego that uses btrfs-debug-tree and changes the
> > UUID while it runs the entire tree, but it's ugly as hell."
> 
> Changing the UUID is going to be harder for ext4 as well, once we
> integrate metadata checksums. 

And for XFS, we're modifying the on-disk format to encode the UUID
into every single piece of metadata in the filesystem. Hence
changing it entails a similar problem to btrfs - an entire
filesystem metadata RMW cycle.

> So while it makes sense to have
> on-line ways of updating labels for mounted file systems it
> probably makes muchness sense to support it for UUIDs.
                     ^^^^ less
Agreed.

> I suspect what it means in practice is that it will be useful for
> file systems to provide fs image copying tools that also generate
> a new UUID while you're at it, for use by IT administrators and
> embedded systems manufacturers.

Yup. xfs_admin already provides an interface for offline
modification of the UUID for XFS filesytems. I.e. clone the
filesytem using xfs_copy, then run xfs_admin -U generate <clone> to
generate a new uuid in the cloned copy before you mount the
clone....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-12 18:16       ` Cyrill Gorcunov
@ 2011-10-14 15:38         ` Frederic Weisbecker
  2011-10-14 16:01           ` Cyrill Gorcunov
  2011-10-19 21:19           ` Paul Menage
  0 siblings, 2 replies; 81+ messages in thread
From: Frederic Weisbecker @ 2011-10-14 15:38 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Tejun Heo, Andrew Morton, Kay Sievers, linux-kernel, lennart,
	harald, david, greg, Kirill A. Shutemov, Oleg Nesterov,
	Paul Menage, Rafael J. Wysocki, Pavel Emelyanov

On Wed, Oct 12, 2011 at 10:16:41PM +0400, Cyrill Gorcunov wrote:
> On Wed, Oct 12, 2011 at 10:40:14AM -0700, Tejun Heo wrote:
> ...
> > 
> > In general, I think making freezer work nicely with the rest of the
> > system is a good idea and have been working towards that direction.
> > Allowing a frozen task to be killed is not only handy for use cases
> > like above but also makes solving freezer involved deadlocks much less
> > likely and easier to solve.  Another that I have in mind is allowing
> > ptrace from unfrozen task to a frozen task.  This can be helpful in
> > general debugging (currently attaching to multi-threaded, violently
> > cloning process is quite cumbersome) and userland checkpointing.
> 
> Yeah, being able to ptrace a frozen cgroup would be great for us.
> We stick with signals start/stop cycle at moment but the final target
> is the cgroups and freezer of course. (btw while were poking freezer
> code I noticed that there is no shortcut to move all tasks in cgroup
> into the root cgroup, so I guess say "echo -1 > tasks" might be a good
> addition to move all tasks from some particular cgroup to the root
> by single action).

Well, wouldn't it be better to pull that complexity to userspace?
After all, moving tasks from a cgroup to another is not a performance
critical operation so that probably doesn't need to be all handled by
the kernel.

If one worries about concurrent clone/fork while moving tasks, then
freezing the cgroup and moving its tasks away from userspace could
be enough?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber???s Wish List for Linux
  2011-10-13  0:28               ` Dave Chinner
@ 2011-10-14 15:47                 ` Ted Ts'o
  0 siblings, 0 replies; 81+ messages in thread
From: Ted Ts'o @ 2011-10-14 15:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dave, Hugo Mills, Kay Sievers, Alan Cox, linux-kernel, lennart,
	harald, david, greg, Chris Mason, Btrfs mailing list

On Thu, Oct 13, 2011 at 11:28:39AM +1100, Dave Chinner wrote:
> Yup. xfs_admin already provides an interface for offline
> modification of the UUID for XFS filesytems. I.e. clone the
> filesytem using xfs_copy, then run xfs_admin -U generate <clone> to
> generate a new uuid in the cloned copy before you mount the
> clone....

This is probably another thing which perhaps Ric Wheeler's proposed
"generic LVM / file system management front end" should abstract away,
since every single file system has a different way of setting the UUID
in an off-line way.  It's a relatively specialized feature, so I
wouldn't call it high priority to implement first.

	      	      	       	  - Ted

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12 19:12                             ` Kyle Moffett
@ 2011-10-14 15:54                               ` Ted Ts'o
  2011-10-14 18:04                                 ` Eric W. Biederman
  0 siblings, 1 reply; 81+ messages in thread
From: Ted Ts'o @ 2011-10-14 15:54 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: J. Bruce Fields, Eric W. Biederman, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On Wed, Oct 12, 2011 at 03:12:34PM -0400, Kyle Moffett wrote:
> Well, you're going to need to introduce a bunch of new xattrs to
> handle the namespacing anyways.
> 
> As I understand it you can use RichACLs to grant all the same
> privileges as owner and group, so you can simply map the real
> namespaced owner and group into RichACLs (or another xattr) and force
> the inode uid/gid to be root/root (or maybe nobody/nogroup or
> something).

It's going to be all about mapping tables, and whether the mapping is
done in userspace or kernel space.  For example, you might want to
take a Kerberos principal name, and mapping it to a 128bit identifier
(64 bit realm id + 64 bit user id), and that in turn might require
mapping to some 32-bit Linux uid namespace.

If people want to support multiple 32-bit Linux uid namespaces, then
it's a question of how you name these uid name spaces, and how to
manage the mapping tables outside of kernel, and how the mapping
tables get loaded into the kernel.

> I am of course making it sound a million times easier than it's
> actually likely to be, but I do think it's possible without too many
> odd corner cases.

It's not the corner cases, it's all of the different name spaces that
different system administrators and their sites are going to want to
use, and how to support them all....

And of course, once we start naming uid name spaces, eventually
someone will want to virtualize containers, and then we will have
namespaces for namespaces.  (It's turtles all the way down!  :-)

						- Ted

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-14 15:38         ` Frederic Weisbecker
@ 2011-10-14 16:01           ` Cyrill Gorcunov
  2011-10-14 16:08             ` Cyrill Gorcunov
  2011-10-19 21:19           ` Paul Menage
  1 sibling, 1 reply; 81+ messages in thread
From: Cyrill Gorcunov @ 2011-10-14 16:01 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Tejun Heo, Andrew Morton, Kay Sievers, linux-kernel, lennart,
	harald, david, greg, Kirill A. Shutemov, Oleg Nesterov,
	Paul Menage, Rafael J. Wysocki, Pavel Emelyanov

On Fri, Oct 14, 2011 at 05:38:47PM +0200, Frederic Weisbecker wrote:
...
> 
> Well, wouldn't it be better to pull that complexity to userspace?
> After all, moving tasks from a cgroup to another is not a performance
> critical operation so that probably doesn't need to be all handled by
> the kernel.
> 
> If one worries about concurrent clone/fork while moving tasks, then
> freezing the cgroup and moving its tasks away from userspace could
> be enough?

Well, it's not that problem to make it task-by-task, still I think
it's just a convenient shortcut :)

	Cyrill

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-14 16:01           ` Cyrill Gorcunov
@ 2011-10-14 16:08             ` Cyrill Gorcunov
  2011-10-14 16:19               ` Frederic Weisbecker
  0 siblings, 1 reply; 81+ messages in thread
From: Cyrill Gorcunov @ 2011-10-14 16:08 UTC (permalink / raw)
  To: Frederic Weisbecker, Tejun Heo, Andrew Morton, Kay Sievers,
	linux-kernel, lennart, harald, david, greg, Kirill A. Shutemov,
	Oleg Nesterov, Paul Menage, Rafael J. Wysocki, Pavel Emelyanov

On Fri, Oct 14, 2011 at 08:01:10PM +0400, Cyrill Gorcunov wrote:
...
> > Well, wouldn't it be better to pull that complexity to userspace?
> > After all, moving tasks from a cgroup to another is not a performance
> > critical operation so that probably doesn't need to be all handled by
> > the kernel.
> > 
> > If one worries about concurrent clone/fork while moving tasks, then
> > freezing the cgroup and moving its tasks away from userspace could
> > be enough?
> 
> Well, it's not that problem to make it task-by-task, still I think
> it's just a convenient shortcut :)
> 

Frederic, don't get me wrong, but when I've tried cgroups and freezer for
first time (and I did it not by any script but by hands rather) it makes
me scream out that once I've moved a number of tasks to some freezer cgroup
now I need to move them back again. Of course there is a way to write some
script of whatever but I thought we had some echo -1 shortcut. Anyway,
I can live with it ;)

	Cyrill

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-14 16:08             ` Cyrill Gorcunov
@ 2011-10-14 16:19               ` Frederic Weisbecker
  0 siblings, 0 replies; 81+ messages in thread
From: Frederic Weisbecker @ 2011-10-14 16:19 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Tejun Heo, Andrew Morton, Kay Sievers, linux-kernel, lennart,
	harald, david, greg, Kirill A. Shutemov, Oleg Nesterov,
	Paul Menage, Rafael J. Wysocki, Pavel Emelyanov

On Fri, Oct 14, 2011 at 08:08:09PM +0400, Cyrill Gorcunov wrote:
> On Fri, Oct 14, 2011 at 08:01:10PM +0400, Cyrill Gorcunov wrote:
> ...
> > > Well, wouldn't it be better to pull that complexity to userspace?
> > > After all, moving tasks from a cgroup to another is not a performance
> > > critical operation so that probably doesn't need to be all handled by
> > > the kernel.
> > > 
> > > If one worries about concurrent clone/fork while moving tasks, then
> > > freezing the cgroup and moving its tasks away from userspace could
> > > be enough?
> > 
> > Well, it's not that problem to make it task-by-task, still I think
> > it's just a convenient shortcut :)
> > 
> 
> Frederic, don't get me wrong, but when I've tried cgroups and freezer for
> first time (and I did it not by any script but by hands rather) it makes
> me scream out that once I've moved a number of tasks to some freezer cgroup
> now I need to move them back again. Of course there is a way to write some
> script of whatever but I thought we had some echo -1 shortcut. Anyway,
> I can live with it ;)

Using a script would be a much better shortcut.
The script may be a few dozen lines. Push that in the kernel and it may
be much more.

Have a close overall look at kernel/cgroup.c and ask yourself if you would
like to add 100 more lines to it, to avoid to make it in userspace ;)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-14 15:54                               ` Ted Ts'o
@ 2011-10-14 18:04                                 ` Eric W. Biederman
  2011-10-14 21:58                                   ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Eric W. Biederman @ 2011-10-14 18:04 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Kyle Moffett, J. Bruce Fields, Matt Helsley, Lennart Poettering,
	Kay Sievers, linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

Ted Ts'o <tytso@mit.edu> writes:

>> I am of course making it sound a million times easier than it's
>> actually likely to be, but I do think it's possible without too many
>> odd corner cases.
>
> It's not the corner cases, it's all of the different name spaces that
> different system administrators and their sites are going to want to
> use, and how to support them all....
>
> And of course, once we start naming uid name spaces, eventually
> someone will want to virtualize containers, and then we will have
> namespaces for namespaces.  (It's turtles all the way down!  :-)

I have found and merged a solution that allows us to name namespaces
without needing a namespaces for namespaces.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-14 18:04                                 ` Eric W. Biederman
@ 2011-10-14 21:58                                   ` H. Peter Anvin
  2011-10-16  9:42                                     ` Eric W. Biederman
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2011-10-14 21:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On 10/14/2011 11:04 AM, Eric W. Biederman wrote:
> 
> I have found and merged a solution that allows us to name namespaces
> without needing a namespaces for namespaces.
> 

Something based on UUIDs, perhaps?

UUIDs are kind of exactly this, after all... a single namespace designed
to be large and random enough to be globally unique without a central
registration authority.

	-hpa

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-14 21:58                                   ` H. Peter Anvin
@ 2011-10-16  9:42                                     ` Eric W. Biederman
  2011-10-30 20:11                                       ` H. Peter Anvin
  0 siblings, 1 reply; 81+ messages in thread
From: Eric W. Biederman @ 2011-10-16  9:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Serge E. Hallyn, Daniel Lezcano,
	Paul Menage

"H. Peter Anvin" <hpa@zytor.com> writes:

> On 10/14/2011 11:04 AM, Eric W. Biederman wrote:
>> 
>> I have found and merged a solution that allows us to name namespaces
>> without needing a namespaces for namespaces.
>> 
>
> Something based on UUIDs, perhaps?
>
> UUIDs are kind of exactly this, after all... a single namespace designed
> to be large and random enough to be globally unique without a central
> registration authority.

mount --bind /proc/self/ns/net /var/run/netns/<name>

When we want to refer to the namespace in syscalls we pass a file
descriptor we received from opening the namespace reference object.

That moves the entire naming problem into the file namespace.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-06 23:17 A Plumber’s Wish List for Linux Kay Sievers
                   ` (6 preceding siblings ...)
  2011-10-11 23:16 ` Andrew Morton
@ 2011-10-19 21:12 ` Paul Menage
  2011-10-19 23:03   ` Lennart Poettering
  7 siblings, 1 reply; 81+ messages in thread
From: Paul Menage @ 2011-10-19 21:12 UTC (permalink / raw)
  To: Kay Sievers; +Cc: linux-kernel, lennart, harald, david, greg

On Thu, Oct 6, 2011 at 4:17 PM, Kay Sievers <kay.sievers@vrfy.org> wrote:
>
> * fork throttling mechanism as basic cgroup functionality that is
> available in all hierarchies independent of the controllers used:
> This is important to implement race-free killing of all members of a
> cgroup, so that cgroup member processes cannot fork faster then a cgroup
> supervisor process could kill them. This needs to be recursive, so that
> not only a cgroup but all its subgroups are covered as well.

If that's your end goal, then an alternative to the freezer support
that others have mentioned would be a 'cgroup.signal' file which, when
written to, would send that signal to all members of the cgroup at
once. Perhaps simpler than having to get in the way of the fork path
more and manage a rate-limit.

>
> * allow user xattrs to be set on files in the cgroupfs (and maybe
> procfs?)

What would the use case be for this?

Paul

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-14 15:38         ` Frederic Weisbecker
  2011-10-14 16:01           ` Cyrill Gorcunov
@ 2011-10-19 21:19           ` Paul Menage
  1 sibling, 0 replies; 81+ messages in thread
From: Paul Menage @ 2011-10-19 21:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Cyrill Gorcunov, Tejun Heo, Andrew Morton, Kay Sievers,
	linux-kernel, lennart, harald, david, greg, Kirill A. Shutemov,
	Oleg Nesterov, Rafael J. Wysocki, Pavel Emelyanov

On Fri, Oct 14, 2011 at 8:38 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
> Well, wouldn't it be better to pull that complexity to userspace?
> After all, moving tasks from a cgroup to another is not a performance
> critical operation so that probably doesn't need to be all handled by
> the kernel.

I'd always assumed that too, but apparently on very many (possibly the
majority of?) Linux systems, it actually is performance-critical.

Specifically, Android bounces tasks in and out of a "foreground
low-latency" cpu cgroup at a fairly high rate, and has found the
performance hit from the locking to be a problem on multi-core phones.
Hence Colin Cross' patches for avoid calls to synchronize_rcu() in the
attach path.

Paul

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-19 21:12 ` Paul Menage
@ 2011-10-19 23:03   ` Lennart Poettering
  2011-10-19 23:09     ` Paul Menage
  0 siblings, 1 reply; 81+ messages in thread
From: Lennart Poettering @ 2011-10-19 23:03 UTC (permalink / raw)
  To: Paul Menage; +Cc: Kay Sievers, linux-kernel, harald, david, greg

On Wed, 19.10.11 14:12, Paul Menage (paul@paulmenage.org) wrote:

> On Thu, Oct 6, 2011 at 4:17 PM, Kay Sievers <kay.sievers@vrfy.org> wrote:
> >
> > * fork throttling mechanism as basic cgroup functionality that is
> > available in all hierarchies independent of the controllers used:
> > This is important to implement race-free killing of all members of a
> > cgroup, so that cgroup member processes cannot fork faster then a cgroup
> > supervisor process could kill them. This needs to be recursive, so that
> > not only a cgroup but all its subgroups are covered as well.
> 
> If that's your end goal, then an alternative to the freezer support
> that others have mentioned would be a 'cgroup.signal' file which, when
> written to, would send that signal to all members of the cgroup at
> once. Perhaps simpler than having to get in the way of the fork path
> more and manage a rate-limit.

For our systemd usecase a cgroup.signal file would not be useful. This
is because we actually kill all members of the service's cgroup plus the
main process of the service, which is usually also in the service's
cgroup but sometimes isn't (for example: when the user logs in, the
whole /sbin/login process ends up in the user's session cgroup, and is
removed from the original service cgroup). Since we want to avoid
killing the main service process twice in the case where it isn't in the
servce cgroup we'd hence prefer to have some fork throttling logic in
place, so that we can kill members flexibly in accordance with these
rules.

> > * allow user xattrs to be set on files in the cgroupfs (and maybe
> > procfs?)
> 
> What would the use case be for this?

Attaching meta information to services, in an easily discoverable
way. For example, in systemd we create one cgroup for each service, and
could then store data like the main pid of the specific service as an
xattr on the cgroup itself. That way we'd have almost all service state
in the cgroupfs, which would make it possible to terminate systemd and
later restart it without losing any state information. But there's more:
for example, some very peculiar services cannot be terminated on
shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
services in question could just mark that on their cgroup, by setting an
xattr. On the more desktopy side of things there are other
possibilities: for example there are plans defining what an application
is along the lines of a cgroup (i.e. an app being a collection of
processes). With xattrs one could then attach an icon or human readable
program name on the cgroup.

The key idea is that this would allow attaching runtime meta information
to cgroups and everything they model (services, apps, vms), that doesn't
need any complex userspace infrastructure, has good access control
(i.e. because the file system enforces that anyway, and there's the
"trusted." xattr namespace), notifications (inotify), and can easily be
shared among applications. 

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-19 23:03   ` Lennart Poettering
@ 2011-10-19 23:09     ` Paul Menage
  2011-10-19 23:31       ` Lennart Poettering
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Menage @ 2011-10-19 23:09 UTC (permalink / raw)
  To: Lennart Poettering; +Cc: Kay Sievers, linux-kernel, harald, david, greg

On Wed, Oct 19, 2011 at 4:03 PM, Lennart Poettering
<mzxreary@0pointer.de> wrote:
>
> For our systemd usecase a cgroup.signal file would not be useful. This
> is because we actually kill all members of the service's cgroup plus the
> main process of the service, which is usually also in the service's
> cgroup but sometimes isn't (for example: when the user logs in, the
> whole /sbin/login process ends up in the user's session cgroup, and is
> removed from the original service cgroup). Since we want to avoid
> killing the main service process twice in the case where it isn't in the
> servce cgroup we'd hence prefer to have some fork throttling logic in
> place, so that we can kill members flexibly in accordance with these
> rules.

By fork-throttling, do you just mean "0 or unlimited", or would you
actually want some kind of rate-limited throttling? If the former,
than I agree with Frederick that his task counter should solve that
problem.

Paul

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07  0:13   ` Lennart Poettering
  2011-10-07  1:57     ` Andi Kleen
@ 2011-10-19 23:16     ` H. Peter Anvin
  1 sibling, 0 replies; 81+ messages in thread
From: H. Peter Anvin @ 2011-10-19 23:16 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Andi Kleen, Kay Sievers, linux-kernel, harald, david, greg

On 10/06/2011 05:13 PM, Lennart Poettering wrote:
> 
> Well, I am aware of PR_SET_NAME, but that modifies comm, not argv[]. And
> while "top" indeed shows the former, "ps" shows the latter. We are looking
> for a way to nice way to modify argv[] without having to reuse space
> from environ[] like most current Linux implementations of
> setproctitle() do.
> 
> A while back there were patches for PR_SET_PROCTITLE_AREA floating
> around. We'd like to see something like that merged one day.
> 

A saner thing would be if the initial argv[] area couldn't be modified
at all, and that an explicit system call was required to change the
title displayed by ps or top, but that ps or top could be forced to show
the argv as initially passed to the process.

	-hpa


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-19 23:09     ` Paul Menage
@ 2011-10-19 23:31       ` Lennart Poettering
  2011-10-22 10:21         ` Frederic Weisbecker
  0 siblings, 1 reply; 81+ messages in thread
From: Lennart Poettering @ 2011-10-19 23:31 UTC (permalink / raw)
  To: Paul Menage; +Cc: Kay Sievers, linux-kernel, harald, david, greg

On Wed, 19.10.11 16:09, Paul Menage (paul@paulmenage.org) wrote:

> On Wed, Oct 19, 2011 at 4:03 PM, Lennart Poettering
> <mzxreary@0pointer.de> wrote:
> >
> > For our systemd usecase a cgroup.signal file would not be useful. This
> > is because we actually kill all members of the service's cgroup plus the
> > main process of the service, which is usually also in the service's
> > cgroup but sometimes isn't (for example: when the user logs in, the
> > whole /sbin/login process ends up in the user's session cgroup, and is
> > removed from the original service cgroup). Since we want to avoid
> > killing the main service process twice in the case where it isn't in the
> > servce cgroup we'd hence prefer to have some fork throttling logic in
> > place, so that we can kill members flexibly in accordance with these
> > rules.
> 
> By fork-throttling, do you just mean "0 or unlimited", or would you
> actually want some kind of rate-limited throttling? If the former,
> than I agree with Frederick that his task counter should solve that
> problem.

Given that shutting down some services might involve forking off a few
things (think: a shell script handling shutdown which forks off a couple
of shell utilities) we'd want something that is between "from now on no
forking at all" and "unlimited forking". This could be done in many
different ways: we'd be happy if we could do time-based rate limiting,
but we'd also be fine with defining a certain budget of additional forks
a cgroup can do (i.e. "from now on you can do 50 more forks, then you'll
get EPERM).

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-19 23:31       ` Lennart Poettering
@ 2011-10-22 10:21         ` Frederic Weisbecker
  2011-10-22 15:28           ` Lennart Poettering
  0 siblings, 1 reply; 81+ messages in thread
From: Frederic Weisbecker @ 2011-10-22 10:21 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Paul Menage, Kay Sievers, linux-kernel, harald, david, greg

On Thu, Oct 20, 2011 at 01:31:11AM +0200, Lennart Poettering wrote:
> On Wed, 19.10.11 16:09, Paul Menage (paul@paulmenage.org) wrote:
> 
> > On Wed, Oct 19, 2011 at 4:03 PM, Lennart Poettering
> > <mzxreary@0pointer.de> wrote:
> > >
> > > For our systemd usecase a cgroup.signal file would not be useful. This
> > > is because we actually kill all members of the service's cgroup plus the
> > > main process of the service, which is usually also in the service's
> > > cgroup but sometimes isn't (for example: when the user logs in, the
> > > whole /sbin/login process ends up in the user's session cgroup, and is
> > > removed from the original service cgroup). Since we want to avoid
> > > killing the main service process twice in the case where it isn't in the
> > > servce cgroup we'd hence prefer to have some fork throttling logic in
> > > place, so that we can kill members flexibly in accordance with these
> > > rules.
> > 
> > By fork-throttling, do you just mean "0 or unlimited", or would you
> > actually want some kind of rate-limited throttling? If the former,
> > than I agree with Frederick that his task counter should solve that
> > problem.
> 
> Given that shutting down some services might involve forking off a few
> things (think: a shell script handling shutdown which forks off a couple
> of shell utilities) we'd want something that is between "from now on no
> forking at all" and "unlimited forking". This could be done in many
> different ways: we'd be happy if we could do time-based rate limiting,
> but we'd also be fine with defining a certain budget of additional forks
> a cgroup can do (i.e. "from now on you can do 50 more forks, then you'll
> get EPERM).

Thinking more about it, you shouldn't use the task counter subsystem for
Systemd. This is a subsystem that may bring some significant overhead
(ie: walk through the entire hierarchy every fork and exit). Doesn't
sound like something suitable for an init process.

If you really need to stop any forks in a cgroup, then a cgroup core feature
handling that very single purpose would be better and more efficient.

That said I'm not really sure why you're using cgroups in Systemd.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-22 10:21         ` Frederic Weisbecker
@ 2011-10-22 15:28           ` Lennart Poettering
  2011-10-25  5:40             ` Li Zefan
  0 siblings, 1 reply; 81+ messages in thread
From: Lennart Poettering @ 2011-10-22 15:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul Menage, Kay Sievers, linux-kernel, harald, david, greg

On Sat, 22.10.11 12:21, Frederic Weisbecker (fweisbec@gmail.com) wrote:

> If you really need to stop any forks in a cgroup, then a cgroup core feature
> handling that very single purpose would be better and more efficient.

We'd be happy with that and this is what we originally suggested actually.

> That said I'm not really sure why you're using cgroups in Systemd.

We want to reliably label processes in a hierarchial way, so that this
is inherited by all child processes, cannot be overriden by unprivileged
code (subject to some classic Unix access control handling) and get
notifications when such a label stops referring to any process. We use
that for sticking the service name on a process, so that all CGI
processes of Apache are automatically assigned the same service as
apache itself. And we want a notification when all of apache's processes
die. And we also want to be able to kill Apache compeltely by killing
all its processes.

cgroups provides us with all of that, though the last two items only in
a suboptimal way: notification of cgroups running empty is ugly, since
it is done by spawning a usermode helper (we'd prefer a netlink msg or
so), and the process killing is a bit racy.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-22 15:28           ` Lennart Poettering
@ 2011-10-25  5:40             ` Li Zefan
  2011-10-30 17:18               ` Lennart Poettering
  0 siblings, 1 reply; 81+ messages in thread
From: Li Zefan @ 2011-10-25  5:40 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Frederic Weisbecker, Paul Menage, Kay Sievers, linux-kernel,
	harald, david, greg

Lennart Poettering wrote:
> On Sat, 22.10.11 12:21, Frederic Weisbecker (fweisbec@gmail.com) wrote:
> 
>> If you really need to stop any forks in a cgroup, then a cgroup core feature
>> handling that very single purpose would be better and more efficient.
> 
> We'd be happy with that and this is what we originally suggested actually.
> 
>> That said I'm not really sure why you're using cgroups in Systemd.
> 
> We want to reliably label processes in a hierarchial way, so that this
> is inherited by all child processes, cannot be overriden by unprivileged
> code (subject to some classic Unix access control handling) and get
> notifications when such a label stops referring to any process. We use
> that for sticking the service name on a process, so that all CGI
> processes of Apache are automatically assigned the same service as
> apache itself. And we want a notification when all of apache's processes
> die. And we also want to be able to kill Apache compeltely by killing
> all its processes.
> 
> cgroups provides us with all of that, though the last two items only in
> a suboptimal way: notification of cgroups running empty is ugly, since
> it is done by spawning a usermode helper (we'd prefer a netlink msg or
> so), and the process killing is a bit racy.
> 

How about using eventfd? You can create an eventfd for the specific "tasks"
file, and when the cgroup gets empty (no task in it), you'll get a notification.

It should be easy to implement, since cgroup already supports eventfd-based
API.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-25  5:40             ` Li Zefan
@ 2011-10-30 17:18               ` Lennart Poettering
  2011-11-01  1:27                 ` Li Zefan
  0 siblings, 1 reply; 81+ messages in thread
From: Lennart Poettering @ 2011-10-30 17:18 UTC (permalink / raw)
  To: Li Zefan
  Cc: Frederic Weisbecker, Paul Menage, Kay Sievers, linux-kernel,
	harald, david, greg

On Tue, 25.10.11 13:40, Li Zefan (lizf@cn.fujitsu.com) wrote:

> > cgroups provides us with all of that, though the last two items only in
> > a suboptimal way: notification of cgroups running empty is ugly, since
> > it is done by spawning a usermode helper (we'd prefer a netlink msg or
> > so), and the process killing is a bit racy.
> 
> How about using eventfd? You can create an eventfd for the specific "tasks"
> file, and when the cgroup gets empty (no task in it), you'll get a notification.
> 
> It should be easy to implement, since cgroup already supports eventfd-based
> API.

I am quite convinced that using eventfd() like this is quite ugly. The
current evetnfd() logic is not recursive anyway, hence wouldn't help us
much.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-16  9:42                                     ` Eric W. Biederman
@ 2011-10-30 20:11                                       ` H. Peter Anvin
  2011-11-01 13:38                                         ` Eric W. Biederman
  0 siblings, 1 reply; 81+ messages in thread
From: H. Peter Anvin @ 2011-10-30 20:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Serge E. Hallyn, Daniel Lezcano,
	Paul Menage

On 10/16/2011 02:42 AM, Eric W. Biederman wrote:
>>
>> Something based on UUIDs, perhaps?
>>
>> UUIDs are kind of exactly this, after all... a single namespace designed
>> to be large and random enough to be globally unique without a central
>> registration authority.
> 
> mount --bind /proc/self/ns/net /var/run/netns/<name>
> 
> When we want to refer to the namespace in syscalls we pass a file
> descriptor we received from opening the namespace reference object.
> 
> That moves the entire naming problem into the file namespace.
> 

That doesn't solve what I think of as the *real* problem.

The real problem is just another instance of what I sometimes refer to
as the "alien metadata problem": the alien metadata problem (which crops
up in *all kinds* of contexts, including containers, namespaces, virtual
machines, building distribution disk images, and backups) is the fact
that you would like to be able to store, manipulate and preserve, on
disk and in a mounted filesystem, a set of metadata which may not be the
"currently active" metadata.

There are two forms of "solutions" to this: one where the filesystem
still only contains one set of metadata, but it is not currently active,
and one where the filesystem contains multiple sets of metadata for the
same files at the same time, any one of which can be active (and
different ones may be active for different namespaces.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-30 17:18               ` Lennart Poettering
@ 2011-11-01  1:27                 ` Li Zefan
  0 siblings, 0 replies; 81+ messages in thread
From: Li Zefan @ 2011-11-01  1:27 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Frederic Weisbecker, Paul Menage, Kay Sievers, linux-kernel,
	harald, david, greg

Lennart Poettering wrote:
> On Tue, 25.10.11 13:40, Li Zefan (lizf@cn.fujitsu.com) wrote:
> 
>>> cgroups provides us with all of that, though the last two items only in
>>> a suboptimal way: notification of cgroups running empty is ugly, since
>>> it is done by spawning a usermode helper (we'd prefer a netlink msg or
>>> so), and the process killing is a bit racy.
>>
>> How about using eventfd? You can create an eventfd for the specific "tasks"
>> file, and when the cgroup gets empty (no task in it), you'll get a notification.
>>
>> It should be easy to implement, since cgroup already supports eventfd-based
>> API.
> 
> I am quite convinced that using eventfd() like this is quite ugly. The
> current evetnfd() logic is not recursive anyway, hence wouldn't help us
> much.
> 

I remember in an earlier email you stated you want to be able to kill all tasks
in a cgroup and its children, and you used the word "recursive", but what do you
mean by ""recursive" for empty cgroup notification, do you expect the listener
to recieve a message if a cgroup or any of its children becomes empty?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-30 20:11                                       ` H. Peter Anvin
@ 2011-11-01 13:38                                         ` Eric W. Biederman
  0 siblings, 0 replies; 81+ messages in thread
From: Eric W. Biederman @ 2011-11-01 13:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Serge E. Hallyn, Daniel Lezcano,
	Paul Menage

"H. Peter Anvin" <hpa@zytor.com> writes:

> On 10/16/2011 02:42 AM, Eric W. Biederman wrote:
>>>
>>> Something based on UUIDs, perhaps?
>>>
>>> UUIDs are kind of exactly this, after all... a single namespace designed
>>> to be large and random enough to be globally unique without a central
>>> registration authority.
>> 
>> mount --bind /proc/self/ns/net /var/run/netns/<name>
>> 
>> When we want to refer to the namespace in syscalls we pass a file
>> descriptor we received from opening the namespace reference object.
>> 
>> That moves the entire naming problem into the file namespace.
>> 
>
> That doesn't solve what I think of as the *real* problem.

It solves the problem of not needing a namespace of namespaces and
it solves the problem not requiring universal agreement between all
filesystems on all operating systems on how things should look.

In not precluding different solutions it makes a large stride forward.

> The real problem is just another instance of what I sometimes refer to
> as the "alien metadata problem": the alien metadata problem (which crops
> up in *all kinds* of contexts, including containers, namespaces, virtual
> machines, building distribution disk images, and backups) is the fact
> that you would like to be able to store, manipulate and preserve, on
> disk and in a mounted filesystem, a set of metadata which may not be the
> "currently active" metadata.

When you throw network filesystems with different notions of meta-data
things get even more interesting.

> There are two forms of "solutions" to this: one where the filesystem
> still only contains one set of metadata, but it is not currently active,
> and one where the filesystem contains multiple sets of metadata for the
> same files at the same time, any one of which can be active (and
> different ones may be active for different namespaces.)

There is an important tool that seems to be missing from your toolbox.
- Mapping the metadata on the file into different contexts.

The way I see it classic unix filesystems have exactly one context
that their meta-data is expected to work in.  The context in which
the filesystem is mounted.

However it is very easy to conceive of that context being specified
at a per inode granularity.  In which case at least the backup and
the distribution disk image problem can be solved by trivially
specifying a different context, and associating a user namespace with
that context.  Then you switch into the user namespace to manipulate
"alien metadata".

Where mapping comes in is when those files are accessed from
from another context besides the one where all of their metadata
falls.  At which point you can map all of the files to be owned
by the user who is responsible for making backups.  The mapping
is a bit like the root squash setting.


For the common case I expect we will settle on a well defined acl across
the native unix filesystems that allows us to make this persistent.  For
network filesystems with their broader interoperability requirements how
to specify this gets a little more interesting.

For purposes of implementation it doesn't matter to me if that acl is
a uuid or a unique string.  For management of the data it might.

How I expect a native linux filesystem to work when it encounters a
filesystem with a user namespace acl is that it will work like nfsv4
and do an upcall into userspace, to ask the appropriate userspace
how do I understand this acl.  The the userapce mapping agent will
say.  Oh.  You want the usernamespace for "hpa-backups"?  Let's see:
/var/run/userns/hpa-backups exists let me just tell the kernel about
that mapping.  Or perhaps the usernamespace does not exist so the
mapping daemon would go out and create it be consulting configuration
files in etc to know that everything in "hpa-backups" should a child
user namespace with the user "hpa" being able to switch into that
usernamespace without root permission.

Files with meta-data for more than one usernamespace/context I expect
to work similarly.  Care needs to be take that it doesn't drive the
administrator crazy.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [lxc-devel] Detecting if you are running in a container
  2011-10-12 16:59             ` Kay Sievers
@ 2011-11-01 22:05               ` Michael Tokarev
  2011-11-01 23:51                 ` Eric W. Biederman
  0 siblings, 1 reply; 81+ messages in thread
From: Michael Tokarev @ 2011-11-01 22:05 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, greg, Paul Menage, linux-kernel, david,
	Eric W. Biederman, Linux Containers, Linux Containers,
	Serge E. Hallyn, harald

[Replying to an oldish email...]

On 12.10.2011 20:59, Kay Sievers wrote:
> On Mon, Oct 10, 2011 at 23:41, Lennart Poettering <mzxreary@0pointer.de> wrote:
>> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:
> 
>>> - udev.  All of the kernel interfaces for udev should be supported in
>>>   current kernels.  However I believe udev is useless because container
>>>   start drops CAP_MKNOD so we can't do evil things.  So I would
>>>   recommend basing the startup of udev on presence of CAP_MKNOD.
>>
>> Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev
>> in a systemd world makes use of that.
> 
> Done.
> 
> http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=9371e6f3e04b03692c23e392fdf005a08ccf1edb

Maybe CAP_MKNOD isn't actually a good idea, having in mind devtmpfs?

Without CAP_MKNOD, is devtmpfs still being populated internally by
the kernel, so that udev only needs to change ownership/permissions
and maintain symlinks in response to device changes, and perform
other duties (reacting to other types of events) normally?

In other words, provided devtmpfs works even without CAP_MKNOD,
I can easily imagine a whole system running without this capability
from the very early boot, with all functionality in place, including
udev and what not...

And having CAP_MKNOD in container may not be that bad either, while
cgroup device.permission is set correctly - some nodes may need to
be created still, even in an unprivileged containers.  Who filters
out CAP_MKNOD during container startup (I don't see it in the code,
which only removes CAP_SYS_BOOT, and even that due to current
limitation), and which evil things can be done if it is not filtered?

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [lxc-devel] Detecting if you are running in a container
  2011-11-01 22:05               ` [lxc-devel] " Michael Tokarev
@ 2011-11-01 23:51                 ` Eric W. Biederman
  2011-11-02  8:08                   ` Michael Tokarev
  0 siblings, 1 reply; 81+ messages in thread
From: Eric W. Biederman @ 2011-11-01 23:51 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Kay Sievers, Lennart Poettering, greg, Paul Menage, linux-kernel,
	david, Linux Containers, Linux Containers, Serge E. Hallyn,
	harald

Michael Tokarev <mjt@tls.msk.ru> writes:

> [Replying to an oldish email...]
>
> On 12.10.2011 20:59, Kay Sievers wrote:
>> On Mon, Oct 10, 2011 at 23:41, Lennart Poettering <mzxreary@0pointer.de> wrote:
>>> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:
>> 
>>>> - udev.  All of the kernel interfaces for udev should be supported in
>>>>   current kernels.  However I believe udev is useless because container
>>>>   start drops CAP_MKNOD so we can't do evil things.  So I would
>>>>   recommend basing the startup of udev on presence of CAP_MKNOD.
>>>
>>> Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev
>>> in a systemd world makes use of that.
>> 
>> Done.
>> 
>> http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=9371e6f3e04b03692c23e392fdf005a08ccf1edb
>
> Maybe CAP_MKNOD isn't actually a good idea, having in mind devtmpfs?
>
> Without CAP_MKNOD, is devtmpfs still being populated internally by
> the kernel, so that udev only needs to change ownership/permissions
> and maintain symlinks in response to device changes, and perform
> other duties (reacting to other types of events) normally?
>
> In other words, provided devtmpfs works even without CAP_MKNOD,
> I can easily imagine a whole system running without this capability
> from the very early boot, with all functionality in place, including
> udev and what not...

Agreed devtmpfs does pretty much make dropping CAP_MKNOD useless.  I
expect we should verify that whoever mounts devtmpfs has CAP_MKNOD.

> And having CAP_MKNOD in container may not be that bad either, while
> cgroup device.permission is set correctly - some nodes may need to
> be created still, even in an unprivileged containers.  Who filters
> out CAP_MKNOD during container startup (I don't see it in the code,
> which only removes CAP_SYS_BOOT, and even that due to current
> limitation), and which evil things can be done if it is not filtered?

If you don't filter which device nodes you a process can read/write then
that process can access any device on the system.  Steal the keyboard,
the X display, access any filesystem, directly access memory.  Basically
the process can escalate that permission to full control of the system
without needing any kernel bugs to help it.

Eric

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [lxc-devel] Detecting if you are running in a container
  2011-11-01 23:51                 ` Eric W. Biederman
@ 2011-11-02  8:08                   ` Michael Tokarev
  0 siblings, 0 replies; 81+ messages in thread
From: Michael Tokarev @ 2011-11-02  8:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kay Sievers, Lennart Poettering, greg, Paul Menage, linux-kernel,
	david, Linux Containers, Linux Containers, Serge E. Hallyn,
	harald

On 02.11.2011 03:51, Eric W. Biederman wrote:
[]
>> And having CAP_MKNOD in container may not be that bad either, while
>> cgroup device.permission is set correctly - some nodes may need to
>> be created still, even in an unprivileged containers.  Who filters
>> out CAP_MKNOD during container startup (I don't see it in the code,
>> which only removes CAP_SYS_BOOT, and even that due to current
>> limitation), and which evil things can be done if it is not filtered?
> 
> If you don't filter which device nodes you a process can read/write then
> that process can access any device on the system.  Steal the keyboard,
> the X display, access any filesystem, directly access memory.  Basically
> the process can escalate that permission to full control of the system
> without needing any kernel bugs to help it.

There's cap_mknod, and cgroup/devices.{allow,deny}.  Even with CAP_MKNOD,
container can not _use_ devices not allowed in the latter.  That's what
I'm talking about - there's more fine control exist than CAP_MKNOD.  And
my question was about this context - with proper cgroup-level device
control in place, what bad CAP_MKNOD have?

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
  2011-10-07 13:40 ` Alan Cox
@ 2011-10-07 14:57   ` Alexander E. Patrakov
  0 siblings, 0 replies; 81+ messages in thread
From: Alexander E. Patrakov @ 2011-10-07 14:57 UTC (permalink / raw)
  To: linux-kernel; +Cc: Bastien ROUCARIES, Kay Sievers, david, greg, lennart, harald

07.10.2011 19:40, Alan Cox пишет:
> On Fri, 7 Oct 2011 15:09:16 +0200
> Bastien ROUCARIES<roucaries.bastien@gmail.com>  wrote:
>
>> For fat a special xattr for root inode ?
>
> If it's as Kay says a specific magic part of the directory and we need
> this just as a fixup for FAT and NTFS then probably an ioctl on it will
> do the job nicely. Sometimes stretching existing API's in semi-sane ways
> actually gets to produce worse special cases (like tar restoring the
> volume label by accident depending upon its settings)

I'd say that we also need to consider EXFAT which is available only for 
FUSE, and the fact that the FUSE-based NTFS driver has more features 
than the kernel driver. And, frankly speaking, I don't think that FAT 
belongs to the kernel at all. So any proposed solution has to be 
extensible enough to also cover FUSE.

-- 
Alexander E. Patrakov


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: A Plumber’s Wish List for Linux
       [not found] <CAE2SPAZci=u__d58phePCftVr_e+i+N2YU-JYjGDG_b3TmYTSQ@mail.gmail.com>
@ 2011-10-07 13:40 ` Alan Cox
  2011-10-07 14:57   ` Alexander E. Patrakov
  0 siblings, 1 reply; 81+ messages in thread
From: Alan Cox @ 2011-10-07 13:40 UTC (permalink / raw)
  To: Bastien ROUCARIES; +Cc: Kay Sievers, david, greg, lennart, linux-kernel, harald

On Fri, 7 Oct 2011 15:09:16 +0200
Bastien ROUCARIES <roucaries.bastien@gmail.com> wrote:

> For fat a special xattr for root inode ?

If it's as Kay says a specific magic part of the directory and we need
this just as a fixup for FAT and NTFS then probably an ioctl on it will
do the job nicely. Sometimes stretching existing API's in semi-sane ways
actually gets to produce worse special cases (like tar restoring the
volume label by accident depending upon its settings)

Alan

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2011-11-02  8:08 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-06 23:17 A Plumber’s Wish List for Linux Kay Sievers
2011-10-06 23:46 ` Andi Kleen
2011-10-07  0:13   ` Lennart Poettering
2011-10-07  1:57     ` Andi Kleen
2011-10-07 15:58       ` Lennart Poettering
2011-10-19 23:16     ` H. Peter Anvin
2011-10-07  7:49 ` Matt Helsley
2011-10-07 16:01   ` Lennart Poettering
2011-10-08  4:24     ` Eric W. Biederman
2011-10-10 16:31       ` Lennart Poettering
2011-10-10 20:59         ` Detecting if you are running in a container Eric W. Biederman
2011-10-10 21:41           ` Lennart Poettering
2011-10-11  5:40             ` Eric W. Biederman
2011-10-11  6:54             ` Eric W. Biederman
2011-10-12 16:59             ` Kay Sievers
2011-11-01 22:05               ` [lxc-devel] " Michael Tokarev
2011-11-01 23:51                 ` Eric W. Biederman
2011-11-02  8:08                   ` Michael Tokarev
2011-10-11  1:32           ` Ted Ts'o
2011-10-11  2:05             ` Matt Helsley
2011-10-11  3:25               ` Ted Ts'o
2011-10-11  6:42                 ` Eric W. Biederman
2011-10-11 12:53                   ` Theodore Tso
2011-10-11 21:16                     ` Eric W. Biederman
2011-10-11 22:30                       ` david
2011-10-12  4:26                         ` Eric W. Biederman
2011-10-12  5:10                           ` david
2011-10-12 15:08                             ` Serge E. Hallyn
2011-10-12 17:57                       ` J. Bruce Fields
2011-10-12 18:25                         ` Kyle Moffett
2011-10-12 19:04                           ` J. Bruce Fields
2011-10-12 19:12                             ` Kyle Moffett
2011-10-14 15:54                               ` Ted Ts'o
2011-10-14 18:04                                 ` Eric W. Biederman
2011-10-14 21:58                                   ` H. Peter Anvin
2011-10-16  9:42                                     ` Eric W. Biederman
2011-10-30 20:11                                       ` H. Peter Anvin
2011-11-01 13:38                                         ` Eric W. Biederman
2011-10-11 22:25               ` david
2011-10-07 10:12 ` A Plumber’s Wish List for Linux Alan Cox
2011-10-07 10:28   ` Kay Sievers
2011-10-07 10:38     ` Alan Cox
2011-10-07 12:46       ` Kay Sievers
2011-10-07 13:39         ` Theodore Tso
2011-10-07 15:21         ` Hugo Mills
2011-10-10 11:18           ` A Plumber???s " David Sterba
2011-10-10 11:18             ` David Sterba
2011-10-10 13:09             ` Theodore Tso
2011-10-13  0:28               ` Dave Chinner
2011-10-14 15:47                 ` Ted Ts'o
2011-10-11 13:14             ` Serge E. Hallyn
2011-10-11 15:49               ` Andrew G. Morgan
2011-10-12  2:31                 ` Serge E. Hallyn
2011-10-12 20:51                 ` Lennart Poettering
2011-10-08  9:53         ` A Plumber’s " Bastien ROUCARIES
2011-10-09  3:15           ` Alex Elsayed
2011-10-07 16:07       ` Valdis.Kletnieks
2011-10-07 12:35 ` Vivek Goyal
2011-10-07 18:59 ` Greg KH
2011-10-09 12:20   ` Kay Sievers
2011-10-09  8:45 ` Rusty Russell
2011-10-11 23:16 ` Andrew Morton
2011-10-12  0:53   ` Frederic Weisbecker
2011-10-12  0:59   ` Frederic Weisbecker
     [not found]     ` <20111012174014.GE6281@google.com>
2011-10-12 18:16       ` Cyrill Gorcunov
2011-10-14 15:38         ` Frederic Weisbecker
2011-10-14 16:01           ` Cyrill Gorcunov
2011-10-14 16:08             ` Cyrill Gorcunov
2011-10-14 16:19               ` Frederic Weisbecker
2011-10-19 21:19           ` Paul Menage
2011-10-19 21:12 ` Paul Menage
2011-10-19 23:03   ` Lennart Poettering
2011-10-19 23:09     ` Paul Menage
2011-10-19 23:31       ` Lennart Poettering
2011-10-22 10:21         ` Frederic Weisbecker
2011-10-22 15:28           ` Lennart Poettering
2011-10-25  5:40             ` Li Zefan
2011-10-30 17:18               ` Lennart Poettering
2011-11-01  1:27                 ` Li Zefan
     [not found] <CAE2SPAZci=u__d58phePCftVr_e+i+N2YU-JYjGDG_b3TmYTSQ@mail.gmail.com>
2011-10-07 13:40 ` Alan Cox
2011-10-07 14:57   ` Alexander E. Patrakov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.