All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression wrt mounting /proc in user namespace in 3.13
@ 2013-11-15 16:41 Daniel P. Berrange
       [not found] ` <20131115164123.GN28794-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Daniel P. Berrange @ 2013-11-15 16:41 UTC (permalink / raw)
  To: Containers; +Cc: Eric W. Biederman

Just testing libvirt with user namespaces on current Fedora rawhide
3.13.0-0.rc0.git3.2.fc21.x86_64 kernel, I'm now getting an error when
we attempt to mount /proc

  # virsh -c lxc:/// start shell
  error: Failed to start domain shell
  error: internal error: guest failed to start: Failed to mount proc on /proc type proc flags=e: Operation not permitted

The syscall failing is

  mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = -1 EPERM (Operation not permitted)


On the host OS the default Fedora environment has the following mounts
present

  # grep /proc /proc/mounts 
  proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
  systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=41,pgrp=1,timeout=300,minproto=5,maxproto=5,direct 0 0
  binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
  sunrpc /proc/fs/nfsd nfsd rw,relatime 0 0

  # ls /proc/fs/nfsd/
  export_features  filehandle      nfsv4gracetime  nfsv4recoverydir  pool_threads  reply_cache_stats        threads            unlock_ip
  exports          max_block_size  nfsv4leasetime  pool_stats        portlist      supported_krb5_enctypes  unlock_filesystem  versions

  # ls /proc/sys/fs/binfmt_misc/
  qemu-alpha  qemu-cris        qemu-microblazeel  qemu-mips64el  qemu-ppc64       qemu-sh4    qemu-sparc32plus  status
  qemu-arm    qemu-m68k        qemu-mips          qemu-mipsel    qemu-ppc64abi32  qemu-sh4eb  qemu-sparc64
  qemu-armeb  qemu-microblaze  qemu-mips64        qemu-ppc       qemu-s390x       qemu-sparc  register


Only if I umount both of the /proc/sys/fs/binfmt_misc/ entries
am I able to get past this EPERM error code.

Looking at GIT history I see this change as a likely candidate for
something which has changed in this area:

  commit e51db73532955dc5eaba4235e62b74b460709d5b
  Author: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  Date:   Sat Mar 30 19:57:41 2013 -0700

    userns: Better restrictions on when proc and sysfs can be mounted
    
    Rely on the fact that another flavor of the filesystem is already
    mounted and do not rely on state in the user namespace.
    
    Verify that the mounted filesystem is not covered in any significant
    way.  I would love to verify that the previously mounted filesystem
    has no mounts on top but there are at least the directories
    /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
    for other filesystems to mount on top of.
    
    Refactor the test into a function named fs_fully_visible and call that
    function from the mount routines of proc and sysfs.  This makes this
    test local to the filesystems involved and the results current of when
    the mounts take place, removing a weird threading of the user
    namespace, the mount namespace and the filesystems themselves.
    
    Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>


My guess is fs_fully_visible() is returning false, and thus causing the
proc_mount() call to return EPERM, but I'm unclear why this would happen,
or if this is indeed a correct hypothesis.


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Regression wrt mounting /proc in user namespace in 3.13
       [not found] ` <20131115164123.GN28794-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-11-16 16:48   ` Serge E. Hallyn
       [not found]     ` <20131116164840.GA4441-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-16 16:48 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: Containers, Eric W. Biederman

Quoting Daniel P. Berrange (berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org):
> Just testing libvirt with user namespaces on current Fedora rawhide
> 3.13.0-0.rc0.git3.2.fc21.x86_64 kernel, I'm now getting an error when
> we attempt to mount /proc

Thanks, I saw the same thing with 3.12 on friday afternoon, and decided
I must be too haggard from a week of unrelated work to think straight.

This definately will be a problem, making user namespace unusable for
containers.

>   # virsh -c lxc:/// start shell
>   error: Failed to start domain shell
>   error: internal error: guest failed to start: Failed to mount proc on /proc type proc flags=e: Operation not permitted
> 
> The syscall failing is
> 
>   mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = -1 EPERM (Operation not permitted)
> 
> 
> On the host OS the default Fedora environment has the following mounts
> present
> 
>   # grep /proc /proc/mounts 
>   proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
>   systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=41,pgrp=1,timeout=300,minproto=5,maxproto=5,direct 0 0
>   binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
>   sunrpc /proc/fs/nfsd nfsd rw,relatime 0 0
> 
>   # ls /proc/fs/nfsd/
>   export_features  filehandle      nfsv4gracetime  nfsv4recoverydir  pool_threads  reply_cache_stats        threads            unlock_ip
>   exports          max_block_size  nfsv4leasetime  pool_stats        portlist      supported_krb5_enctypes  unlock_filesystem  versions
> 
>   # ls /proc/sys/fs/binfmt_misc/
>   qemu-alpha  qemu-cris        qemu-microblazeel  qemu-mips64el  qemu-ppc64       qemu-sh4    qemu-sparc32plus  status
>   qemu-arm    qemu-m68k        qemu-mips          qemu-mipsel    qemu-ppc64abi32  qemu-sh4eb  qemu-sparc64
>   qemu-armeb  qemu-microblaze  qemu-mips64        qemu-ppc       qemu-s390x       qemu-sparc  register
> 
> 
> Only if I umount both of the /proc/sys/fs/binfmt_misc/ entries
> am I able to get past this EPERM error code.
> 
> Looking at GIT history I see this change as a likely candidate for
> something which has changed in this area:
> 
>   commit e51db73532955dc5eaba4235e62b74b460709d5b
>   Author: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>   Date:   Sat Mar 30 19:57:41 2013 -0700
> 
>     userns: Better restrictions on when proc and sysfs can be mounted
>     
>     Rely on the fact that another flavor of the filesystem is already
>     mounted and do not rely on state in the user namespace.
>     
>     Verify that the mounted filesystem is not covered in any significant
>     way.  I would love to verify that the previously mounted filesystem
>     has no mounts on top but there are at least the directories
>     /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
>     for other filesystems to mount on top of.
>     
>     Refactor the test into a function named fs_fully_visible and call that
>     function from the mount routines of proc and sysfs.  This makes this
>     test local to the filesystems involved and the results current of when
>     the mounts take place, removing a weird threading of the user
>     namespace, the mount namespace and the filesystems themselves.
>     
>     Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> 
> 
> My guess is fs_fully_visible() is returning false, and thus causing the
> proc_mount() call to return EPERM, but I'm unclear why this would happen,
> or if this is indeed a correct hypothesis.
> 
> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Regression wrt mounting /proc in user namespace in 3.13
       [not found]     ` <20131116164840.GA4441-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-17  3:06       ` Serge E. Hallyn
       [not found]         ` <20131117030653.GA7670-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-17  3:06 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers, Eric W. Biederman

Low on power and no charger, but a quick test printing out if a mount is
!S_ISDIR or has nlink !=2 in fs_fully_visible() gives me:

[   92.939650] nlink is 1 for ino 8733 (0:3)

(that's major 0 minor 3)

-serge

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Regression wrt mounting /proc in user namespace in 3.13
       [not found]         ` <20131117030653.GA7670-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-18  3:19           ` Serge E. Hallyn
       [not found]             ` <20131118031932.GA17621-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-18  3:19 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers, Eric W. Biederman

Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
> Low on power and no charger, but a quick test printing out if a mount is
> !S_ISDIR or has nlink !=2 in fs_fully_visible() gives me:
> 
> [   92.939650] nlink is 1 for ino 8733 (0:3)
> 
> (that's major 0 minor 3)

Ok, so that is for binfmt_misc on /proc/sys/fs/binfmt_misc.  The
underlying directory is empty, and nlink is showing up as 1.
 
Can we just get the nlink check changed to check for < 3 instead
of ==2 ?

-serge

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Regression wrt mounting /proc in user namespace in 3.13
       [not found]             ` <20131118031932.GA17621-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-18  4:52               ` Gao feng
       [not found]                 ` <52899D09.5080202-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Gao feng @ 2013-11-18  4:52 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers, Eric W. Biederman

On 11/18/2013 11:19 AM, Serge E. Hallyn wrote:
> Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
>> Low on power and no charger, but a quick test printing out if a mount is
>> !S_ISDIR or has nlink !=2 in fs_fully_visible() gives me:
>>
>> [   92.939650] nlink is 1 for ino 8733 (0:3)
>>
>> (that's major 0 minor 3)
> 
> Ok, so that is for binfmt_misc on /proc/sys/fs/binfmt_misc.  The
> underlying directory is empty, and nlink is showing up as 1.
>  
> Can we just get the nlink check changed to check for < 3 instead
> of ==2 ?
> 

I already reported this problem to Eric,hi is working on fix this problem.

nlink is not the right thing to check if a directory is null. since
in all of filesystems, parent dir's nlink is increase only when we
create sub-dir.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Regression wrt mounting /proc in user namespace in 3.13
       [not found]                 ` <52899D09.5080202-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2013-11-18 14:08                   ` Serge E. Hallyn
       [not found]                     ` <20131118140830.GA22075-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-18 14:08 UTC (permalink / raw)
  To: Gao feng; +Cc: Containers, Eric W. Biederman

Quoting Gao feng (gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org):
> On 11/18/2013 11:19 AM, Serge E. Hallyn wrote:
> > Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
> >> Low on power and no charger, but a quick test printing out if a mount is
> >> !S_ISDIR or has nlink !=2 in fs_fully_visible() gives me:
> >>
> >> [   92.939650] nlink is 1 for ino 8733 (0:3)
> >>
> >> (that's major 0 minor 3)
> > 
> > Ok, so that is for binfmt_misc on /proc/sys/fs/binfmt_misc.  The
> > underlying directory is empty, and nlink is showing up as 1.
> >  
> > Can we just get the nlink check changed to check for < 3 instead
> > of ==2 ?
> > 
> 
> I already reported this problem to Eric,hi is working on fix this problem.
> 
> nlink is not the right thing to check if a directory is null. since
> in all of filesystems, parent dir's nlink is increase only when we
> create sub-dir.

This whole thing feels very brittle.  May I also point out that simply
setting perms appears to work just fine instead of overmounting.  If I
chmod 700 /proc/swaps, unshare my pid and mount namespaces and remount
/proc, then /proc/swaps is 700 in the new mount.  Since our concern is
with a new user namespace, which will be limited to world perms, this
should suffice and allow us to skip all this nonsense.

Eric?

-serge

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Regression wrt mounting /proc in user namespace in 3.13
       [not found]                     ` <20131118140830.GA22075-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-18 18:01                       ` Serge E. Hallyn
       [not found]                         ` <20131118180134.GA24156-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-18 18:01 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers, Eric W. Biederman

Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
> Quoting Gao feng (gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org):
> > On 11/18/2013 11:19 AM, Serge E. Hallyn wrote:
> > > Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
> > >> Low on power and no charger, but a quick test printing out if a mount is
> > >> !S_ISDIR or has nlink !=2 in fs_fully_visible() gives me:
> > >>
> > >> [   92.939650] nlink is 1 for ino 8733 (0:3)
> > >>
> > >> (that's major 0 minor 3)
> > > 
> > > Ok, so that is for binfmt_misc on /proc/sys/fs/binfmt_misc.  The
> > > underlying directory is empty, and nlink is showing up as 1.
> > >  
> > > Can we just get the nlink check changed to check for < 3 instead
> > > of ==2 ?
> > > 
> > 
> > I already reported this problem to Eric,hi is working on fix this problem.
> > 
> > nlink is not the right thing to check if a directory is null. since
> > in all of filesystems, parent dir's nlink is increase only when we
> > create sub-dir.
> 
> This whole thing feels very brittle.  May I also point out that simply
> setting perms appears to work just fine instead of overmounting.  If I
> chmod 700 /proc/swaps, unshare my pid and mount namespaces and remount
> /proc, then /proc/swaps is 700 in the new mount.  Since our concern is
> with a new user namespace, which will be limited to world perms, this
> should suffice and allow us to skip all this nonsense.
> 
> Eric?
> 
> -serge

So yeah, I think this patch should be reverted, rather than "fixed".

-serge

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Regression wrt mounting /proc in user namespace in 3.13
       [not found]                         ` <20131118180134.GA24156-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-19  1:51                           ` Eric W. Biederman
       [not found]                             ` <87k3g5gnuv.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-19  1:51 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers

"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
>> Quoting Gao feng (gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org):
>> > On 11/18/2013 11:19 AM, Serge E. Hallyn wrote:
>> > > Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
>> > >> Low on power and no charger, but a quick test printing out if a mount is
>> > >> !S_ISDIR or has nlink !=2 in fs_fully_visible() gives me:
>> > >>
>> > >> [   92.939650] nlink is 1 for ino 8733 (0:3)
>> > >>
>> > >> (that's major 0 minor 3)
>> > > 
>> > > Ok, so that is for binfmt_misc on /proc/sys/fs/binfmt_misc.  The
>> > > underlying directory is empty, and nlink is showing up as 1.
>> > >  
>> > > Can we just get the nlink check changed to check for < 3 instead
>> > > of ==2 ?
>> > > 
>> > 
>> > I already reported this problem to Eric,hi is working on fix this problem.
>> > 
>> > nlink is not the right thing to check if a directory is null. since
>> > in all of filesystems, parent dir's nlink is increase only when we
>> > create sub-dir.
>> 
>> This whole thing feels very brittle.  May I also point out that simply
>> setting perms appears to work just fine instead of overmounting.  If I
>> chmod 700 /proc/swaps, unshare my pid and mount namespaces and remount
>> /proc, then /proc/swaps is 700 in the new mount.  Since our concern is
>> with a new user namespace, which will be limited to world perms, this
>> should suffice and allow us to skip all this nonsense.
>> 
>> Eric?
>> 
>> -serge
>
> So yeah, I think this patch should be reverted, rather than "fixed".

I will get back to this shortly.   The trivial fix is to kill the nlink
test or make the nlink test look for nlink <= 2.

The point of the original patch was to remove the horrible special case
that implemented the test to see if proc was moutned, so that we could
trivial prevent mount of proc in chroots or other legacy environments
where root did not want proc to be mounted.

The refinement I attempted and failed was to verify that people had only
mounted on empty proc directories.  Just to verify that nothing becomes
possible when mounting a new copy of proc that wasn't possible before.

My test is totally for that last case is totally braindead.  Permissions
are nice but the actual concern is setups that did silly things that are
not namespace aware.

I am totally against reverting the cleanup.  But I have no problem
killing the nlink test.  Because we did not have protection against
overmounted directories earlier.

Eric

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Regression wrt mounting /proc in user namespace in 3.13
       [not found]                             ` <87k3g5gnuv.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-19  3:47                               ` Serge E. Hallyn
  2013-11-26 18:10                               ` Serge E. Hallyn
  1 sibling, 0 replies; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-19  3:47 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Containers

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
> >> Quoting Gao feng (gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org):
> >> > On 11/18/2013 11:19 AM, Serge E. Hallyn wrote:
> >> > > Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
> >> > >> Low on power and no charger, but a quick test printing out if a mount is
> >> > >> !S_ISDIR or has nlink !=2 in fs_fully_visible() gives me:
> >> > >>
> >> > >> [   92.939650] nlink is 1 for ino 8733 (0:3)
> >> > >>
> >> > >> (that's major 0 minor 3)
> >> > > 
> >> > > Ok, so that is for binfmt_misc on /proc/sys/fs/binfmt_misc.  The
> >> > > underlying directory is empty, and nlink is showing up as 1.
> >> > >  
> >> > > Can we just get the nlink check changed to check for < 3 instead
> >> > > of ==2 ?
> >> > > 
> >> > 
> >> > I already reported this problem to Eric,hi is working on fix this problem.
> >> > 
> >> > nlink is not the right thing to check if a directory is null. since
> >> > in all of filesystems, parent dir's nlink is increase only when we
> >> > create sub-dir.
> >> 
> >> This whole thing feels very brittle.  May I also point out that simply
> >> setting perms appears to work just fine instead of overmounting.  If I
> >> chmod 700 /proc/swaps, unshare my pid and mount namespaces and remount
> >> /proc, then /proc/swaps is 700 in the new mount.  Since our concern is
> >> with a new user namespace, which will be limited to world perms, this
> >> should suffice and allow us to skip all this nonsense.
> >> 
> >> Eric?
> >> 
> >> -serge
> >
> > So yeah, I think this patch should be reverted, rather than "fixed".
> 
> I will get back to this shortly.   The trivial fix is to kill the nlink
> test or make the nlink test look for nlink <= 2.
> 
> The point of the original patch was to remove the horrible special case
> that implemented the test to see if proc was moutned, so that we could
> trivial prevent mount of proc in chroots or other legacy environments
> where root did not want proc to be mounted.
> 
> The refinement I attempted and failed was to verify that people had only
> mounted on empty proc directories.  Just to verify that nothing becomes
> possible when mounting a new copy of proc that wasn't possible before.
> 
> My test is totally for that last case is totally braindead.  Permissions
> are nice but the actual concern is setups that did silly things that are
> not namespace aware.

They don't have to be namespace aware, and can keep doing what they're
doing - long as they also chmod o-rwx the mountpoint.

> I am totally against reverting the cleanup.  But I have no problem
> killing the nlink test.  Because we did not have protection against
> overmounted directories earlier.

Uh, I *think* what you're suggesting sounds good :)

thanks,
-serge

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Regression wrt mounting /proc in user namespace in 3.13
       [not found]                             ` <87k3g5gnuv.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-11-19  3:47                               ` Serge E. Hallyn
@ 2013-11-26 18:10                               ` Serge E. Hallyn
       [not found]                                 ` <20131126181043.GA25492-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  1 sibling, 1 reply; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-26 18:10 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Containers

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
> >> Quoting Gao feng (gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org):
> >> > On 11/18/2013 11:19 AM, Serge E. Hallyn wrote:
> >> > > Quoting Serge E. Hallyn (serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org):
> >> > >> Low on power and no charger, but a quick test printing out if a mount is
> >> > >> !S_ISDIR or has nlink !=2 in fs_fully_visible() gives me:
> >> > >>
> >> > >> [   92.939650] nlink is 1 for ino 8733 (0:3)
> >> > >>
> >> > >> (that's major 0 minor 3)
> >> > > 
> >> > > Ok, so that is for binfmt_misc on /proc/sys/fs/binfmt_misc.  The
> >> > > underlying directory is empty, and nlink is showing up as 1.
> >> > >  
> >> > > Can we just get the nlink check changed to check for < 3 instead
> >> > > of ==2 ?
> >> > > 
> >> > 
> >> > I already reported this problem to Eric,hi is working on fix this problem.
> >> > 
> >> > nlink is not the right thing to check if a directory is null. since
> >> > in all of filesystems, parent dir's nlink is increase only when we
> >> > create sub-dir.
> >> 
> >> This whole thing feels very brittle.  May I also point out that simply
> >> setting perms appears to work just fine instead of overmounting.  If I
> >> chmod 700 /proc/swaps, unshare my pid and mount namespaces and remount
> >> /proc, then /proc/swaps is 700 in the new mount.  Since our concern is
> >> with a new user namespace, which will be limited to world perms, this
> >> should suffice and allow us to skip all this nonsense.
> >> 
> >> Eric?
> >> 
> >> -serge
> >
> > So yeah, I think this patch should be reverted, rather than "fixed".
> 
> I will get back to this shortly.   The trivial fix is to kill the nlink
> test or make the nlink test look for nlink <= 2.
> 
> The point of the original patch was to remove the horrible special case
> that implemented the test to see if proc was moutned, so that we could
> trivial prevent mount of proc in chroots or other legacy environments
> where root did not want proc to be mounted.
> 
> The refinement I attempted and failed was to verify that people had only
> mounted on empty proc directories.  Just to verify that nothing becomes
> possible when mounting a new copy of proc that wasn't possible before.
> 
> My test is totally for that last case is totally braindead.  Permissions
> are nice but the actual concern is setups that did silly things that are
> not namespace aware.
> 
> I am totally against reverting the cleanup.  But I have no problem
> killing the nlink test.  Because we did not have protection against
> overmounted directories earlier.

Ok, I do need this fixed - shall I send a patch of have you already tested
a version on  your end?

thanks,
-serge

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [REVIEW][PATCH 0/3] userns fixes for v3.13-rc1
       [not found]                                 ` <20131126181043.GA25492-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-27  0:14                                   ` Eric W. Biederman
       [not found]                                     ` <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27  0:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA


If folks will take a look I would appreciate it.  This is my pile of
user namespace fixes against v3.13-rc1.  All of these are minor
regressions that should be backported to v3.12.  Unfortunately they were
discovered late enough and I was running sick that I haven't had a time
to get to them earlier.

Eric W. Biederman (3):
      vfs: In d_path don't call d_dname on a mount point
      fork:  Allow CLONE_PARENT after setns(CLONE_NEWPID)
      vfs: Fix a regression in mounting proc

 fs/dcache.c    |    7 ++++++-
 fs/namespace.c |    2 +-
 kernel/fork.c  |    2 +-
 3 files changed, 8 insertions(+), 3 deletions(-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
       [not found]                                     ` <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-27  0:16                                       ` Eric W. Biederman
  2013-11-27  1:58                                         ` Serge E. Hallyn
       [not found]                                         ` <8738mi1yya.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-11-27  0:16                                         ` Eric W. Biederman
  2013-11-27  0:17                                       ` [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc Eric W. Biederman
  2 siblings, 2 replies; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27  0:16 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA


Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org) wrote:
> Commit bf056bfa80596a5d14b26b17276a56a0dcb080e5:
> "proc: Fix the namespace inode permission checks." converted
> the namespace files into symlinks. The same commit changed
> the way namespace bind mounts appear in /proc/mounts:
>   $ mount --bind /proc/self/ns/ipc /mnt/ipc
> Originally:
>   $ cat /proc/mounts | grep ipc
>   proc /mnt/ipc proc rw,nosuid,nodev,noexec 0 0
>
> After commit bf056bfa80596a5d14b26b17276a56a0dcb080e5:
>   $ cat /proc/mounts | grep ipc
>   proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0
>
> This breaks userspace which expects the 2nd field in
> /proc/mounts to be a valid path.

The symlink /proc/<pid>/ns/{ipc,mnt,net,pid,user,uts} point to
dentries allocated with d_alloc_pseudo that we can mount, and
that have interesting names printed out with d_dname.

When these files are bind mounted /proc/mounts is not currently
displaying the mount point correctly because d_dname is called instead
of just displaying the path where the file is mounted.

Solve this by adding an explicit check to distinguish mounted pseudo
inodes and unmounted pseudo inodes.  Unmounted pseudo inodes always
use mount of their filesstem as the mnt_root  in their path making
these two cases easy to distinguish.

CC: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Reported-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4bdb300b16e2..f7282ebf1a37 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3061,8 +3061,13 @@ char *d_path(const struct path *path, char *buf, int buflen)
 	 * thus don't need to be hashed.  They also don't need a name until a
 	 * user wants to identify the object in /proc/pid/fd/.  The little hack
 	 * below allows us to generate a name for these objects on demand:
+	 *
+	 * Some pseudo inodes are mountable.  When they are mounted
+	 * path->dentry == path->mnt->mnt_root.  In that case don't call d_dname
+	 * and instead have d_path return the mounted path.
 	 */
-	if (path->dentry->d_op && path->dentry->d_op->d_dname)
+	if (path->dentry->d_op && path->dentry->d_op->d_dname &&
+	    (!IS_ROOT(path->dentry) || path->dentry != path->mnt->mnt_root))
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	rcu_read_lock();
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [REVIEW][PATCH 2/3] fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
       [not found]                                     ` <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-27  0:16                                         ` Eric W. Biederman
  2013-11-27  0:16                                         ` Eric W. Biederman
  2013-11-27  0:17                                       ` [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc Eric W. Biederman
  2 siblings, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27  0:16 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA


Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> Hi Oleg,
>
> commit 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e :
> "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks"
> breaks lxc-attach in 3.12.  That code forks a child which does
> setns() and then does a clone(CLONE_PARENT).  That way the
> grandchild can be in the right namespaces (which the child was
> not) and be a child of the original task, which is the monitor.
>
> lxc-attach in 3.11 was working fine with no side effects that I
> could see.  Is there a real danger in allowing CLONE_PARENT
> when current->nsproxy->pidns_for_children is not our pidns,
> or was this done out of an "over-abundance of caution"?  Can we
> safely revert that new extra check?

The two fundamental things I know we can not allow are:
- A shared signal queue aka CLONE_THREAD.  Because we compute the pid
  and uid of the signal when we place it in the queue.

- Changing the pid and by extention pid_namespace of an existing
  process.

From a parents perspective there is nothing special about the pid
namespace, to deny CLONE_PARENT, because the parent simply won't know or
care.

From the childs perspective all that is special really are shared signal
queues.

User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks
in different pid namespaces is almost certainly going to break because
it is complicated.  But shared signal handlers can look at per thread
information to know which pid namespace a process is in, so I don't know
of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads
at the kernel level.  It would be absolutely stupid to implement but
that is a different thing.

So hmm.

Because it can do no harm, and because it is a regression let's remove
the CLONE_PARENT check and send it stable.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Acked-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/fork.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 728d5be9548c..f82fa2ee7581 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1171,7 +1171,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 * do not allow it to share a thread group or signal handlers or
 	 * parent with the forking task.
 	 */
-	if (clone_flags & (CLONE_SIGHAND | CLONE_PARENT)) {
+	if (clone_flags & CLONE_SIGHAND) {
 		if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
 		    (task_active_pid_ns(current) !=
 				current->nsproxy->pid_ns_for_children))
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [REVIEW][PATCH 2/3] fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
@ 2013-11-27  0:16                                         ` Eric W. Biederman
  0 siblings, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27  0:16 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA


Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> Hi Oleg,
>
> commit 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e :
> "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks"
> breaks lxc-attach in 3.12.  That code forks a child which does
> setns() and then does a clone(CLONE_PARENT).  That way the
> grandchild can be in the right namespaces (which the child was
> not) and be a child of the original task, which is the monitor.
>
> lxc-attach in 3.11 was working fine with no side effects that I
> could see.  Is there a real danger in allowing CLONE_PARENT
> when current->nsproxy->pidns_for_children is not our pidns,
> or was this done out of an "over-abundance of caution"?  Can we
> safely revert that new extra check?

The two fundamental things I know we can not allow are:
- A shared signal queue aka CLONE_THREAD.  Because we compute the pid
  and uid of the signal when we place it in the queue.

- Changing the pid and by extention pid_namespace of an existing
  process.

>From a parents perspective there is nothing special about the pid
namespace, to deny CLONE_PARENT, because the parent simply won't know or
care.

>From the childs perspective all that is special really are shared signal
queues.

User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks
in different pid namespaces is almost certainly going to break because
it is complicated.  But shared signal handlers can look at per thread
information to know which pid namespace a process is in, so I don't know
of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads
at the kernel level.  It would be absolutely stupid to implement but
that is a different thing.

So hmm.

Because it can do no harm, and because it is a regression let's remove
the CLONE_PARENT check and send it stable.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Acked-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/fork.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 728d5be9548c..f82fa2ee7581 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1171,7 +1171,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	 * do not allow it to share a thread group or signal handlers or
 	 * parent with the forking task.
 	 */
-	if (clone_flags & (CLONE_SIGHAND | CLONE_PARENT)) {
+	if (clone_flags & CLONE_SIGHAND) {
 		if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
 		    (task_active_pid_ns(current) !=
 				current->nsproxy->pid_ns_for_children))
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                     ` <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-11-27  0:16                                       ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Eric W. Biederman
  2013-11-27  0:16                                         ` Eric W. Biederman
@ 2013-11-27  0:17                                       ` Eric W. Biederman
       [not found]                                         ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-11-27 16:13                                         ` Oleg Nesterov
  2 siblings, 2 replies; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27  0:17 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA


Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit
e51db73532955dc5eaba4235e62b74b460709d5b
userns: Better restrictions on when proc and sysfs can be mounted
caused a regression on mounting a new instance of proc in a mount
namespace created with user namespace privileges, when binfmt_misc
is mounted on /proc/sys/fs/binfmt_misc.

This is an unintended regression caused by the absolutely bogus empty
directory check in fs_fully_visible.  The check fs_fully_visible replaced
didn't even bother to attempt to verify proc was fully visible and
hiding proc files with any kind of mount is rare.  So for now fix
the userspace regression by allowing directory with nlink == 1
as /proc/sys/fs/binfmt_misc has.

I will have a better patch but it is not stable material, or
last minute kernel material.  So it will have to wait.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/namespace.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ac2ce8a766e1..be32ebccdeb1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type)
 			struct inode *inode = child->mnt_mountpoint->d_inode;
 			if (!S_ISDIR(inode->i_mode))
 				goto next;
-			if (inode->i_nlink != 2)
+			if (inode->i_nlink > 2)
 				goto next;
 		}
 		visible = true;
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                         ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-27  0:21                                           ` Andy Lutomirski
       [not found]                                             ` <CALCETrVp78EfzY3Oa-LV1Hm8A4Y35apehcxrxdyrzvTb5sp=pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-11-27  2:00                                           ` Serge E. Hallyn
                                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 64+ messages in thread
From: Andy Lutomirski @ 2013-11-27  0:21 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Aditya Kali, Containers, Oleg Nesterov, Linux FS Devel

On Tue, Nov 26, 2013 at 4:17 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>
> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit
> e51db73532955dc5eaba4235e62b74b460709d5b
> userns: Better restrictions on when proc and sysfs can be mounted
> caused a regression on mounting a new instance of proc in a mount
> namespace created with user namespace privileges, when binfmt_misc
> is mounted on /proc/sys/fs/binfmt_misc.
>
> This is an unintended regression caused by the absolutely bogus empty
> directory check in fs_fully_visible.  The check fs_fully_visible replaced
> didn't even bother to attempt to verify proc was fully visible and
> hiding proc files with any kind of mount is rare.  So for now fix
> the userspace regression by allowing directory with nlink == 1
> as /proc/sys/fs/binfmt_misc has.
>
> I will have a better patch but it is not stable material, or
> last minute kernel material.  So it will have to wait.

Is the better fix to fix procfs to set nlink == 2?

--Andy

>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  fs/namespace.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index ac2ce8a766e1..be32ebccdeb1 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type)
>                         struct inode *inode = child->mnt_mountpoint->d_inode;
>                         if (!S_ISDIR(inode->i_mode))
>                                 goto next;
> -                       if (inode->i_nlink != 2)
> +                       if (inode->i_nlink > 2)
>                                 goto next;
>                 }
>                 visible = true;
> --
> 1.7.5.4
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                             ` <CALCETrVp78EfzY3Oa-LV1Hm8A4Y35apehcxrxdyrzvTb5sp=pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-27  0:36                                               ` Eric W. Biederman
  0 siblings, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27  0:36 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Aditya Kali, Containers, Oleg Nesterov, Linux FS Devel

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Tue, Nov 26, 2013 at 4:17 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit
>> e51db73532955dc5eaba4235e62b74b460709d5b
>> userns: Better restrictions on when proc and sysfs can be mounted
>> caused a regression on mounting a new instance of proc in a mount
>> namespace created with user namespace privileges, when binfmt_misc
>> is mounted on /proc/sys/fs/binfmt_misc.
>>
>> This is an unintended regression caused by the absolutely bogus empty
>> directory check in fs_fully_visible.  The check fs_fully_visible replaced
>> didn't even bother to attempt to verify proc was fully visible and
>> hiding proc files with any kind of mount is rare.  So for now fix
>> the userspace regression by allowing directory with nlink == 1
>> as /proc/sys/fs/binfmt_misc has.
>>
>> I will have a better patch but it is not stable material, or
>> last minute kernel material.  So it will have to wait.
>
> Is the better fix to fix procfs to set nlink == 2?

The better fix should be to drop locks, read the directory
(f_op->iterate?) and ensure it is empty and then take locks again.
nlink is insufficient to check if a directory is empty and a mount is
covering a file with something interesting.

Only under /proc/sys/... do directories have nlink == 1 so the nlink
check continues to provide value for now.

The only real world reasonable cases are mounting over an empty
directory in /proc or /sys or mounting over the filesystem entirely and
the nlink check actually catches the latter because the nlink count is
correct on the root directories.

Eric

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
       [not found]                                         ` <8738mi1yya.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-27  1:58                                           ` Serge E. Hallyn
  2013-11-30  6:15                                           ` Al Viro
  1 sibling, 0 replies; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-27  1:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> 
> Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org) wrote:
> > Commit bf056bfa80596a5d14b26b17276a56a0dcb080e5:
> > "proc: Fix the namespace inode permission checks." converted
> > the namespace files into symlinks. The same commit changed
> > the way namespace bind mounts appear in /proc/mounts:
> >   $ mount --bind /proc/self/ns/ipc /mnt/ipc
> > Originally:
> >   $ cat /proc/mounts | grep ipc
> >   proc /mnt/ipc proc rw,nosuid,nodev,noexec 0 0
> >
> > After commit bf056bfa80596a5d14b26b17276a56a0dcb080e5:
> >   $ cat /proc/mounts | grep ipc
> >   proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0
> >
> > This breaks userspace which expects the 2nd field in
> > /proc/mounts to be a valid path.
> 
> The symlink /proc/<pid>/ns/{ipc,mnt,net,pid,user,uts} point to
> dentries allocated with d_alloc_pseudo that we can mount, and
> that have interesting names printed out with d_dname.
> 
> When these files are bind mounted /proc/mounts is not currently
> displaying the mount point correctly because d_dname is called instead
> of just displaying the path where the file is mounted.
> 
> Solve this by adding an explicit check to distinguish mounted pseudo
> inodes and unmounted pseudo inodes.  Unmounted pseudo inodes always
> use mount of their filesstem as the mnt_root  in their path making
> these two cases easy to distinguish.
> 
> CC: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Reported-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> ---
>  fs/dcache.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 4bdb300b16e2..f7282ebf1a37 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3061,8 +3061,13 @@ char *d_path(const struct path *path, char *buf, int buflen)
>  	 * thus don't need to be hashed.  They also don't need a name until a
>  	 * user wants to identify the object in /proc/pid/fd/.  The little hack
>  	 * below allows us to generate a name for these objects on demand:
> +	 *
> +	 * Some pseudo inodes are mountable.  When they are mounted
> +	 * path->dentry == path->mnt->mnt_root.  In that case don't call d_dname
> +	 * and instead have d_path return the mounted path.
>  	 */
> -	if (path->dentry->d_op && path->dentry->d_op->d_dname)
> +	if (path->dentry->d_op && path->dentry->d_op->d_dname &&
> +	    (!IS_ROOT(path->dentry) || path->dentry != path->mnt->mnt_root))
>  		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
>  
>  	rcu_read_lock();
> -- 
> 1.7.5.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
  2013-11-27  0:16                                       ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Eric W. Biederman
@ 2013-11-27  1:58                                         ` Serge E. Hallyn
       [not found]                                         ` <8738mi1yya.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  1 sibling, 0 replies; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-27  1:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Gao feng, Containers, linux-fsdevel,
	Aditya Kali, Oleg Nesterov, Andy Lutomirski

Quoting Eric W. Biederman (ebiederm@xmission.com):
> 
> Aditya Kali (adityakali@google.com) wrote:
> > Commit bf056bfa80596a5d14b26b17276a56a0dcb080e5:
> > "proc: Fix the namespace inode permission checks." converted
> > the namespace files into symlinks. The same commit changed
> > the way namespace bind mounts appear in /proc/mounts:
> >   $ mount --bind /proc/self/ns/ipc /mnt/ipc
> > Originally:
> >   $ cat /proc/mounts | grep ipc
> >   proc /mnt/ipc proc rw,nosuid,nodev,noexec 0 0
> >
> > After commit bf056bfa80596a5d14b26b17276a56a0dcb080e5:
> >   $ cat /proc/mounts | grep ipc
> >   proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0
> >
> > This breaks userspace which expects the 2nd field in
> > /proc/mounts to be a valid path.
> 
> The symlink /proc/<pid>/ns/{ipc,mnt,net,pid,user,uts} point to
> dentries allocated with d_alloc_pseudo that we can mount, and
> that have interesting names printed out with d_dname.
> 
> When these files are bind mounted /proc/mounts is not currently
> displaying the mount point correctly because d_dname is called instead
> of just displaying the path where the file is mounted.
> 
> Solve this by adding an explicit check to distinguish mounted pseudo
> inodes and unmounted pseudo inodes.  Unmounted pseudo inodes always
> use mount of their filesstem as the mnt_root  in their path making
> these two cases easy to distinguish.
> 
> CC: stable@vger.kernel.org
> Reported-by: Aditya Kali <adityakali@google.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>

> ---
>  fs/dcache.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 4bdb300b16e2..f7282ebf1a37 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3061,8 +3061,13 @@ char *d_path(const struct path *path, char *buf, int buflen)
>  	 * thus don't need to be hashed.  They also don't need a name until a
>  	 * user wants to identify the object in /proc/pid/fd/.  The little hack
>  	 * below allows us to generate a name for these objects on demand:
> +	 *
> +	 * Some pseudo inodes are mountable.  When they are mounted
> +	 * path->dentry == path->mnt->mnt_root.  In that case don't call d_dname
> +	 * and instead have d_path return the mounted path.
>  	 */
> -	if (path->dentry->d_op && path->dentry->d_op->d_dname)
> +	if (path->dentry->d_op && path->dentry->d_op->d_dname &&
> +	    (!IS_ROOT(path->dentry) || path->dentry != path->mnt->mnt_root))
>  		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
>  
>  	rcu_read_lock();
> -- 
> 1.7.5.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 2/3] fork:  Allow CLONE_PARENT after setns(CLONE_NEWPID)
       [not found]                                         ` <87vbzezojq.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-27  1:58                                           ` Serge E. Hallyn
  0 siblings, 0 replies; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-27  1:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> 
> Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes:
> > Hi Oleg,
> >
> > commit 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e :
> > "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks"
> > breaks lxc-attach in 3.12.  That code forks a child which does
> > setns() and then does a clone(CLONE_PARENT).  That way the
> > grandchild can be in the right namespaces (which the child was
> > not) and be a child of the original task, which is the monitor.
> >
> > lxc-attach in 3.11 was working fine with no side effects that I
> > could see.  Is there a real danger in allowing CLONE_PARENT
> > when current->nsproxy->pidns_for_children is not our pidns,
> > or was this done out of an "over-abundance of caution"?  Can we
> > safely revert that new extra check?
> 
> The two fundamental things I know we can not allow are:
> - A shared signal queue aka CLONE_THREAD.  Because we compute the pid
>   and uid of the signal when we place it in the queue.
> 
> - Changing the pid and by extention pid_namespace of an existing
>   process.
> 
> >From a parents perspective there is nothing special about the pid
> namespace, to deny CLONE_PARENT, because the parent simply won't know or
> care.
> 
> >From the childs perspective all that is special really are shared signal
> queues.
> 
> User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks
> in different pid namespaces is almost certainly going to break because
> it is complicated.  But shared signal handlers can look at per thread
> information to know which pid namespace a process is in, so I don't know
> of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads
> at the kernel level.  It would be absolutely stupid to implement but
> that is a different thing.
> 
> So hmm.
> 
> Because it can do no harm, and because it is a regression let's remove
> the CLONE_PARENT check and send it stable.
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Acked-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Acked-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
> Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>

Thanks, Eric.

> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  kernel/fork.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 728d5be9548c..f82fa2ee7581 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1171,7 +1171,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	 * do not allow it to share a thread group or signal handlers or
>  	 * parent with the forking task.
>  	 */
> -	if (clone_flags & (CLONE_SIGHAND | CLONE_PARENT)) {
> +	if (clone_flags & CLONE_SIGHAND) {
>  		if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
>  		    (task_active_pid_ns(current) !=
>  				current->nsproxy->pid_ns_for_children))
> -- 
> 1.7.5.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                         ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-11-27  0:21                                           ` Andy Lutomirski
@ 2013-11-27  2:00                                           ` Serge E. Hallyn
  2013-11-27  3:19                                           ` Gao feng
  2013-11-27 16:13                                           ` Oleg Nesterov
  3 siblings, 0 replies; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-27  2:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> 
> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit
> e51db73532955dc5eaba4235e62b74b460709d5b
> userns: Better restrictions on when proc and sysfs can be mounted
> caused a regression on mounting a new instance of proc in a mount
> namespace created with user namespace privileges, when binfmt_misc
> is mounted on /proc/sys/fs/binfmt_misc.
> 
> This is an unintended regression caused by the absolutely bogus empty
> directory check in fs_fully_visible.  The check fs_fully_visible replaced
> didn't even bother to attempt to verify proc was fully visible and
> hiding proc files with any kind of mount is rare.  So for now fix
> the userspace regression by allowing directory with nlink == 1
> as /proc/sys/fs/binfmt_misc has.
> 
> I will have a better patch but it is not stable material, or
> last minute kernel material.  So it will have to wait.
> 
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Thanks, Eric, this should make user namespaces useful again for
containers.

Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>

> ---
>  fs/namespace.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index ac2ce8a766e1..be32ebccdeb1 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type)
>  			struct inode *inode = child->mnt_mountpoint->d_inode;
>  			if (!S_ISDIR(inode->i_mode))
>  				goto next;
> -			if (inode->i_nlink != 2)
> +			if (inode->i_nlink > 2)
>  				goto next;
>  		}
>  		visible = true;
> -- 
> 1.7.5.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                         ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-11-27  0:21                                           ` Andy Lutomirski
  2013-11-27  2:00                                           ` Serge E. Hallyn
@ 2013-11-27  3:19                                           ` Gao feng
       [not found]                                             ` <529564AA.8050100-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
  2013-11-27  5:00                                             ` Eric W. Biederman
  2013-11-27 16:13                                           ` Oleg Nesterov
  3 siblings, 2 replies; 64+ messages in thread
From: Gao feng @ 2013-11-27  3:19 UTC (permalink / raw)
  To: Eric W. Biederman, Serge E. Hallyn
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Aditya Kali, Containers,
	Oleg Nesterov, Andy Lutomirski

On 11/27/2013 08:17 AM, Eric W. Biederman wrote:
> 
> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit
> e51db73532955dc5eaba4235e62b74b460709d5b
> userns: Better restrictions on when proc and sysfs can be mounted
> caused a regression on mounting a new instance of proc in a mount
> namespace created with user namespace privileges, when binfmt_misc
> is mounted on /proc/sys/fs/binfmt_misc.
> 
> This is an unintended regression caused by the absolutely bogus empty
> directory check in fs_fully_visible.  The check fs_fully_visible replaced
> didn't even bother to attempt to verify proc was fully visible and
> hiding proc files with any kind of mount is rare.  So for now fix
> the userspace regression by allowing directory with nlink == 1
> as /proc/sys/fs/binfmt_misc has.
> 
> I will have a better patch but it is not stable material, or
> last minute kernel material.  So it will have to wait.
> 
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  fs/namespace.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index ac2ce8a766e1..be32ebccdeb1 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type)
>  			struct inode *inode = child->mnt_mountpoint->d_inode;
>  			if (!S_ISDIR(inode->i_mode))
>  				goto next;
> -			if (inode->i_nlink != 2)
> +			if (inode->i_nlink > 2)
>  				goto next;
>  		}
>  		visible = true;
> 

As a quick fix.

Acked-by: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
Tested-by: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

looking forward to your following patch. :)

Thanks!

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                             ` <529564AA.8050100-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2013-11-27  5:00                                               ` Eric W. Biederman
  0 siblings, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27  5:00 UTC (permalink / raw)
  To: Gao feng
  Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:

> On 11/27/2013 08:17 AM, Eric W. Biederman wrote:
>> 
>> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit
>> e51db73532955dc5eaba4235e62b74b460709d5b
>> userns: Better restrictions on when proc and sysfs can be mounted
>> caused a regression on mounting a new instance of proc in a mount
>> namespace created with user namespace privileges, when binfmt_misc
>> is mounted on /proc/sys/fs/binfmt_misc.
>> 
>> This is an unintended regression caused by the absolutely bogus empty
>> directory check in fs_fully_visible.  The check fs_fully_visible replaced
>> didn't even bother to attempt to verify proc was fully visible and
>> hiding proc files with any kind of mount is rare.  So for now fix
>> the userspace regression by allowing directory with nlink == 1
>> as /proc/sys/fs/binfmt_misc has.
>> 
>> I will have a better patch but it is not stable material, or
>> last minute kernel material.  So it will have to wait.
>> 
>> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> ---
>>  fs/namespace.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>> 
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index ac2ce8a766e1..be32ebccdeb1 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type)
>>  			struct inode *inode = child->mnt_mountpoint->d_inode;
>>  			if (!S_ISDIR(inode->i_mode))
>>  				goto next;
>> -			if (inode->i_nlink != 2)
>> +			if (inode->i_nlink > 2)
>>  				goto next;
>>  		}
>>  		visible = true;
>> 
>
> As a quick fix.
>
> Acked-by: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
> Tested-by: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
>
> looking forward to your following patch. :)

I might have to be prodded.  Sometimes it looks easy and sometimes I go
ick locking craziness.  Once I am done sorting out the regressions I
plan on focusing on the mount issues between namespaces.

Eric

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
  2013-11-27  3:19                                           ` Gao feng
       [not found]                                             ` <529564AA.8050100-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
@ 2013-11-27  5:00                                             ` Eric W. Biederman
  1 sibling, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27  5:00 UTC (permalink / raw)
  To: Gao feng
  Cc: Serge E. Hallyn, Containers, linux-fsdevel, Aditya Kali,
	Oleg Nesterov, Andy Lutomirski

Gao feng <gaofeng@cn.fujitsu.com> writes:

> On 11/27/2013 08:17 AM, Eric W. Biederman wrote:
>> 
>> Gao feng <gaofeng@cn.fujitsu.com> reported that commit
>> e51db73532955dc5eaba4235e62b74b460709d5b
>> userns: Better restrictions on when proc and sysfs can be mounted
>> caused a regression on mounting a new instance of proc in a mount
>> namespace created with user namespace privileges, when binfmt_misc
>> is mounted on /proc/sys/fs/binfmt_misc.
>> 
>> This is an unintended regression caused by the absolutely bogus empty
>> directory check in fs_fully_visible.  The check fs_fully_visible replaced
>> didn't even bother to attempt to verify proc was fully visible and
>> hiding proc files with any kind of mount is rare.  So for now fix
>> the userspace regression by allowing directory with nlink == 1
>> as /proc/sys/fs/binfmt_misc has.
>> 
>> I will have a better patch but it is not stable material, or
>> last minute kernel material.  So it will have to wait.
>> 
>> Cc: stable@vger.kernel.org
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/namespace.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>> 
>> diff --git a/fs/namespace.c b/fs/namespace.c
>> index ac2ce8a766e1..be32ebccdeb1 100644
>> --- a/fs/namespace.c
>> +++ b/fs/namespace.c
>> @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type)
>>  			struct inode *inode = child->mnt_mountpoint->d_inode;
>>  			if (!S_ISDIR(inode->i_mode))
>>  				goto next;
>> -			if (inode->i_nlink != 2)
>> +			if (inode->i_nlink > 2)
>>  				goto next;
>>  		}
>>  		visible = true;
>> 
>
> As a quick fix.
>
> Acked-by: Gao feng <gaofeng@cn.fujitsu.com>
> Tested-by: Gao feng <gaofeng@cn.fujitsu.com>
>
> looking forward to your following patch. :)

I might have to be prodded.  Sometimes it looks easy and sometimes I go
ick locking craziness.  Once I am done sorting out the regressions I
plan on focusing on the mount issues between namespaces.

Eric


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                         ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                                                             ` (2 preceding siblings ...)
  2013-11-27  3:19                                           ` Gao feng
@ 2013-11-27 16:13                                           ` Oleg Nesterov
  3 siblings, 0 replies; 64+ messages in thread
From: Oleg Nesterov @ 2013-11-27 16:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Containers, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

To all: sorry for noise, I can't comment this patch.


But Eric, could you please help me to understand? I am totally confused.

So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc")
should return false.

After the normal "mout -t proc none /proc/" it becomes true.

And it is still true after, say, "mount -t ramfs none /proc/sys" because
"ls -ld /proc/sys" shows ->i_nlink == 1.

However, say, "mount -t ramfs none /proc/tty/" should make
fs_fully_visible() == F, because in this case ->i_nlink == 4.

Correct?

If yes, could you explain what this "!CAP_SYS_ADMIN && !fs_fully_visible"
check actually tries to prevent and why?

Thanks,

Oleg.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
  2013-11-27  0:17                                       ` [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc Eric W. Biederman
       [not found]                                         ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-27 16:13                                         ` Oleg Nesterov
       [not found]                                           ` <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 64+ messages in thread
From: Oleg Nesterov @ 2013-11-27 16:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Gao feng, Containers, linux-fsdevel,
	Aditya Kali, Andy Lutomirski

To all: sorry for noise, I can't comment this patch.


But Eric, could you please help me to understand? I am totally confused.

So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc")
should return false.

After the normal "mout -t proc none /proc/" it becomes true.

And it is still true after, say, "mount -t ramfs none /proc/sys" because
"ls -ld /proc/sys" shows ->i_nlink == 1.

However, say, "mount -t ramfs none /proc/tty/" should make
fs_fully_visible() == F, because in this case ->i_nlink == 4.

Correct?

If yes, could you explain what this "!CAP_SYS_ADMIN && !fs_fully_visible"
check actually tries to prevent and why?

Thanks,

Oleg.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                           ` <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-11-27 16:29                                             ` Serge E. Hallyn
       [not found]                                               ` <20131127162928.GB7358-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2013-11-27 16:41                                             ` Andy Lutomirski
  2013-11-27 18:51                                             ` Eric W. Biederman
  2 siblings, 1 reply; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-27 16:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Aditya Kali, Containers, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman

Quoting Oleg Nesterov (oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org):
> To all: sorry for noise, I can't comment this patch.
> 
> 
> But Eric, could you please help me to understand? I am totally confused.
> 
> So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc")
> should return false.
> 
> After the normal "mout -t proc none /proc/" it becomes true.
> 
> And it is still true after, say, "mount -t ramfs none /proc/sys" because
> "ls -ld /proc/sys" shows ->i_nlink == 1.
> 
> However, say, "mount -t ramfs none /proc/tty/" should make
> fs_fully_visible() == F, because in this case ->i_nlink == 4.
> 
> Correct?
> 
> If yes, could you explain what this "!CAP_SYS_ADMIN && !fs_fully_visible"
> check actually tries to prevent and why?

The idea is that some admin on a host where /a/b/c/d exists, c/d should
be hidden, so overmounts a tmpfs onto /a/b/c.  In that case, an unpriv
user could clone(CLONE_NEWUSER), then clone(CLONE_NEWNS), then umount
/a/b/c and see /a/b/c/d.  This patch was to try and prevent that.

I argue that it is easy enough for the admin to make /a/b/c permissions
such that it can't be crossed by an unprivileged user, but that does
require an action on the part of admins who used this (imo misguided)
approach.

-serge

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                           ` <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-11-27 16:29                                             ` Serge E. Hallyn
@ 2013-11-27 16:41                                             ` Andy Lutomirski
       [not found]                                               ` <CALCETrXFnw63=JoEaQxM+Opj+kCXSL=9XppymzGKhLzOnp3WaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-11-27 18:51                                             ` Eric W. Biederman
  2 siblings, 1 reply; 64+ messages in thread
From: Andy Lutomirski @ 2013-11-27 16:41 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Aditya Kali, Containers, Linux FS Devel, Eric W. Biederman

On Wed, Nov 27, 2013 at 8:13 AM, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> To all: sorry for noise, I can't comment this patch.
>
>
> But Eric, could you please help me to understand? I am totally confused.
>
> So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc")
> should return false.
>
> After the normal "mout -t proc none /proc/" it becomes true.
>
> And it is still true after, say, "mount -t ramfs none /proc/sys" because
> "ls -ld /proc/sys" shows ->i_nlink == 1.

Grr.  mount -t ramfs none /proc/sys should make proc be not "fully visible".

--Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                               ` <20131127162928.GB7358-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-27 18:09                                                 ` Oleg Nesterov
  0 siblings, 0 replies; 64+ messages in thread
From: Oleg Nesterov @ 2013-11-27 18:09 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aditya Kali, Containers, Andy Lutomirski, Eric W. Biederman,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 11/27, Serge E. Hallyn wrote:
>
> Quoting Oleg Nesterov (oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org):
> > To all: sorry for noise, I can't comment this patch.
> >
> >
> > But Eric, could you please help me to understand? I am totally confused.
> >
> > So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc")
> > should return false.
> >
> > After the normal "mout -t proc none /proc/" it becomes true.
> >
> > And it is still true after, say, "mount -t ramfs none /proc/sys" because
> > "ls -ld /proc/sys" shows ->i_nlink == 1.
> >
> > However, say, "mount -t ramfs none /proc/tty/" should make
> > fs_fully_visible() == F, because in this case ->i_nlink == 4.
> >
> > Correct?
> >
> > If yes, could you explain what this "!CAP_SYS_ADMIN && !fs_fully_visible"
> > check actually tries to prevent and why?
>
> The idea is that some admin on a host where /a/b/c/d exists, c/d should
> be hidden, so overmounts a tmpfs onto /a/b/c.  In that case, an unpriv
> user could clone(CLONE_NEWUSER), then clone(CLONE_NEWNS), then umount
> /a/b/c and see /a/b/c/d.  This patch was to try and prevent that.

Thanks Serge, but now I am even more confused... fs_fully_visible() is
only called by proc/sysfs_mount ?

Perhaps you meant that admin may want to hide something in /proc or /sys?
This is what I suspected initially, but see "mount /proc/sys" above...

I guess you can ignore me, I don't understand this at all. To the point,
suppose that /proc was never mounted, even in the root namespace. Then
how a sub-namespace mount it? fs_fully_visible() can't return true.

OK, sorry again for noise.

Oleg.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                               ` <CALCETrXFnw63=JoEaQxM+Opj+kCXSL=9XppymzGKhLzOnp3WaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-27 18:10                                                 ` Oleg Nesterov
  0 siblings, 0 replies; 64+ messages in thread
From: Oleg Nesterov @ 2013-11-27 18:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aditya Kali, Containers, Linux FS Devel, Eric W. Biederman

On 11/27, Andy Lutomirski wrote:
>
> On Wed, Nov 27, 2013 at 8:13 AM, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > To all: sorry for noise, I can't comment this patch.
> >
> >
> > But Eric, could you please help me to understand? I am totally confused.
> >
> > So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc")
> > should return false.
> >
> > After the normal "mout -t proc none /proc/" it becomes true.
> >
> > And it is still true after, say, "mount -t ramfs none /proc/sys" because
> > "ls -ld /proc/sys" shows ->i_nlink == 1.
>
> Grr.  mount -t ramfs none /proc/sys should make proc be not "fully visible".

This was one of the reasons for my question ;)

Oleg.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                           ` <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-11-27 16:29                                             ` Serge E. Hallyn
  2013-11-27 16:41                                             ` Andy Lutomirski
@ 2013-11-27 18:51                                             ` Eric W. Biederman
       [not found]                                               ` <871u21oeyr.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-11-27 19:47                                               ` Oleg Nesterov
  2 siblings, 2 replies; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27 18:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Aditya Kali, Containers, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> To all: sorry for noise, I can't comment this patch.
>
>
> But Eric, could you please help me to understand? I am totally confused.
>
> So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc")
> should return false.
>
> After the normal "mout -t proc none /proc/" it becomes true.
>
> And it is still true after, say, "mount -t ramfs none /proc/sys" because
> "ls -ld /proc/sys" shows ->i_nlink == 1.
>
> However, say, "mount -t ramfs none /proc/tty/" should make
> fs_fully_visible() == F, because in this case ->i_nlink == 4.
>
> Correct?
>
> If yes, could you explain what this "!CAP_SYS_ADMIN && !fs_fully_visible"
> check actually tries to prevent and why?

Right now a more accurate name would be fs_visible.

The primary goal is to restrict the user namespace root to being able to
mount proc and sysfs if nothing new is revealed, preserving the
restrictions in jail type environments where proc and sysfs are not
mounted at all.

I would have dropped the ability for the userns root to create a fresh
mount of proc and syfs, and required a bind mount instead except
because of the namespace intertwining we need fresh mounts of proc and
sys after new namespaces are created.

What I really would love to have implemented would be a check that
nothing is mounted on top of proc of sysfs but that fails because we
have /sys/fs/.. and /proc/sys/fs_binfmt_misc.  So I had to implement
an empty directory check.

And when I first implemented that empty directory check I had a complete
and total brain fart.

So what is really implemented at the moment would be more accurately called
fs_visible().

For the real concern about jail environments where proc and sysfs are
not mounted at all a fs_visible check is all that is really required,
and if you look at 3.11? I think it was that is what was shipped.

Unfortunately I cleaned things up and totally had a massive brain fart
and failed to implement the is this an empty directory check properly
and somehow failed to test mounting /proc in a real world environment
and so caused a significant regression in 3.12 leading to a case where
proc is not mountable.  This patch just fixes that stupid regression.
As the overmount protection in general is not needed as to my knolwedge
no one overmounts proc or sysfs in a meaningful way.

Now it would be very good to go back and fix fs_fully_visible to either
be fs_visible and not care or to read overmounted directories and verify
that they are actually empty.  Reading directories at this point
requires opening a file and probably playing with locks.  Something I am
not prepared to do for a trivial bug fix.  Especially since getting it
right only closes theoretical holes at the moment.

Does that clarify things?

Eric

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                               ` <871u21oeyr.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-27 19:47                                                 ` Oleg Nesterov
  0 siblings, 0 replies; 64+ messages in thread
From: Oleg Nesterov @ 2013-11-27 19:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Containers, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Just to avoid the possible confusion, let me repeat that the fix itsef
looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong.

I am not arguing with this patch, I am just trying to understand this
logic.

On 11/27, Eric W. Biederman wrote:
>
> [... snip ...]

Thanks a lot.

> For the real concern about jail environments where proc and sysfs are
> not mounted at all a fs_visible check is all that is really required,

this is what I can't understand...

Lets ignore the implementation details. Suppose that proc was never
mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?

Oleg.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
  2013-11-27 18:51                                             ` Eric W. Biederman
       [not found]                                               ` <871u21oeyr.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-27 19:47                                               ` Oleg Nesterov
       [not found]                                                 ` <20131127194722.GA32673-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-11-27 19:52                                                 ` Eric W. Biederman
  1 sibling, 2 replies; 64+ messages in thread
From: Oleg Nesterov @ 2013-11-27 19:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Gao feng, Containers, linux-fsdevel,
	Aditya Kali, Andy Lutomirski

Just to avoid the possible confusion, let me repeat that the fix itsef
looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong.

I am not arguing with this patch, I am just trying to understand this
logic.

On 11/27, Eric W. Biederman wrote:
>
> [... snip ...]

Thanks a lot.

> For the real concern about jail environments where proc and sysfs are
> not mounted at all a fs_visible check is all that is really required,

this is what I can't understand...

Lets ignore the implementation details. Suppose that proc was never
mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?

Oleg.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                                 ` <20131127194722.GA32673-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-11-27 19:52                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27 19:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Aditya Kali, Containers, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> Just to avoid the possible confusion, let me repeat that the fix itsef
> looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong.
>
> I am not arguing with this patch, I am just trying to understand this
> logic.
>
> On 11/27, Eric W. Biederman wrote:
>>
>> [... snip ...]
>
> Thanks a lot.
>
>> For the real concern about jail environments where proc and sysfs are
>> not mounted at all a fs_visible check is all that is really required,
>
> this is what I can't understand...
>
> Lets ignore the implementation details. Suppose that proc was never
> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?

Yes.

Eric

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
  2013-11-27 19:47                                               ` Oleg Nesterov
       [not found]                                                 ` <20131127194722.GA32673-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-11-27 19:52                                                 ` Eric W. Biederman
       [not found]                                                   ` <87iovdmxl7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27 19:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Serge E. Hallyn, Gao feng, Containers, linux-fsdevel,
	Aditya Kali, Andy Lutomirski

Oleg Nesterov <oleg@redhat.com> writes:

> Just to avoid the possible confusion, let me repeat that the fix itsef
> looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong.
>
> I am not arguing with this patch, I am just trying to understand this
> logic.
>
> On 11/27, Eric W. Biederman wrote:
>>
>> [... snip ...]
>
> Thanks a lot.
>
>> For the real concern about jail environments where proc and sysfs are
>> not mounted at all a fs_visible check is all that is really required,
>
> this is what I can't understand...
>
> Lets ignore the implementation details. Suppose that proc was never
> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?

Yes.

Eric

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                                   ` <87iovdmxl7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-27 20:01                                                     ` Oleg Nesterov
  2013-11-27 20:07                                                     ` Eric W. Biederman
  1 sibling, 0 replies; 64+ messages in thread
From: Oleg Nesterov @ 2013-11-27 20:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Containers, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 11/27, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
> > Just to avoid the possible confusion, let me repeat that the fix itsef
> > looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong.
> >
> > I am not arguing with this patch, I am just trying to understand this
> > logic.
> >
> > On 11/27, Eric W. Biederman wrote:
> >>
> >> [... snip ...]
> >
> > Thanks a lot.
> >
> >> For the real concern about jail environments where proc and sysfs are
> >> not mounted at all a fs_visible check is all that is really required,
> >
> > this is what I can't understand...
> >
> > Lets ignore the implementation details. Suppose that proc was never
> > mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?
>
> Yes.

OK, and I agree this makes sense. Just from the code inspection it wasn't
clear to me if this was intended or not. Plus nlink("/proc/sys") == 1 added
more confusion ;)

Thanks to all.

Oleg.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                                   ` <87iovdmxl7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-11-27 20:01                                                     ` Oleg Nesterov
@ 2013-11-27 20:07                                                     ` Eric W. Biederman
  2013-11-27 20:41                                                       ` Andy Lutomirski
       [not found]                                                       ` <87wqjtlic3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  1 sibling, 2 replies; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-27 20:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Aditya Kali, Containers, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>
>> Just to avoid the possible confusion, let me repeat that the fix itsef
>> looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong.
>>
>> I am not arguing with this patch, I am just trying to understand this
>> logic.
>>
>> On 11/27, Eric W. Biederman wrote:
>>>
>>> [... snip ...]
>>
>> Thanks a lot.
>>
>>> For the real concern about jail environments where proc and sysfs are
>>> not mounted at all a fs_visible check is all that is really required,
>>
>> this is what I can't understand...
>>
>> Lets ignore the implementation details. Suppose that proc was never
>> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?
>
> Yes.

Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID.
If proc was never mounted.

Fresh mounts of proc are not allowed unless you have also created the
pid namespace.  With just CLONE_NEWUSER | NEWNS you are limited to bind
mounts.

Has this cleared up the confusion?

Eric

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                                       ` <87wqjtlic3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-27 20:41                                                         ` Andy Lutomirski
  2013-11-29 19:53                                                         ` Oleg Nesterov
  1 sibling, 0 replies; 64+ messages in thread
From: Andy Lutomirski @ 2013-11-27 20:41 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Aditya Kali, Containers, Oleg Nesterov, Linux FS Devel

On Wed, Nov 27, 2013 at 12:07 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>
>> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>>
>>> Just to avoid the possible confusion, let me repeat that the fix itsef
>>> looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong.
>>>
>>> I am not arguing with this patch, I am just trying to understand this
>>> logic.
>>>
>>> On 11/27, Eric W. Biederman wrote:
>>>>
>>>> [... snip ...]
>>>
>>> Thanks a lot.
>>>
>>>> For the real concern about jail environments where proc and sysfs are
>>>> not mounted at all a fs_visible check is all that is really required,
>>>
>>> this is what I can't understand...
>>>
>>> Lets ignore the implementation details. Suppose that proc was never
>>> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?
>>
>> Yes.
>
> Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID.
> If proc was never mounted.
>
> Fresh mounts of proc are not allowed unless you have also created the
> pid namespace.  With just CLONE_NEWUSER | NEWNS you are limited to bind
> mounts.
>
> Has this cleared up the confusion?
>
> Eric
>

This is all obnoxiously complicated.  I wonder if we can do (a lot)
better by allowing a "pid-only" variant of proc to be mounted.  It
should contain:

 - All the pid directories
 - /proc/self, /proc/net, and /proc/mounts (but possibly not
/proc/PID/net -- that's a weird interface IMO and isn't really related
to the pid)
 - keys key-users (wtf is up with that interface, though -- those
files are way too magical)
 - cpuinfo, version, and maybe other informational things (crypto?)
 - loadavg, perhaps

I wonder it would be possible to boot a reasonable container with a
heavily limited /proc like that.

--Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
  2013-11-27 20:07                                                     ` Eric W. Biederman
@ 2013-11-27 20:41                                                       ` Andy Lutomirski
       [not found]                                                         ` <CALCETrUwjK7iLMMJaCvKUbBwEqV58oXY4dWzTGJohYgg4DwjWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]                                                       ` <87wqjtlic3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 64+ messages in thread
From: Andy Lutomirski @ 2013-11-27 20:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Serge E. Hallyn, Gao feng, Containers,
	Linux FS Devel, Aditya Kali

On Wed, Nov 27, 2013 at 12:07 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>>> Just to avoid the possible confusion, let me repeat that the fix itsef
>>> looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong.
>>>
>>> I am not arguing with this patch, I am just trying to understand this
>>> logic.
>>>
>>> On 11/27, Eric W. Biederman wrote:
>>>>
>>>> [... snip ...]
>>>
>>> Thanks a lot.
>>>
>>>> For the real concern about jail environments where proc and sysfs are
>>>> not mounted at all a fs_visible check is all that is really required,
>>>
>>> this is what I can't understand...
>>>
>>> Lets ignore the implementation details. Suppose that proc was never
>>> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?
>>
>> Yes.
>
> Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID.
> If proc was never mounted.
>
> Fresh mounts of proc are not allowed unless you have also created the
> pid namespace.  With just CLONE_NEWUSER | NEWNS you are limited to bind
> mounts.
>
> Has this cleared up the confusion?
>
> Eric
>

This is all obnoxiously complicated.  I wonder if we can do (a lot)
better by allowing a "pid-only" variant of proc to be mounted.  It
should contain:

 - All the pid directories
 - /proc/self, /proc/net, and /proc/mounts (but possibly not
/proc/PID/net -- that's a weird interface IMO and isn't really related
to the pid)
 - keys key-users (wtf is up with that interface, though -- those
files are way too magical)
 - cpuinfo, version, and maybe other informational things (crypto?)
 - loadavg, perhaps

I wonder it would be possible to boot a reasonable container with a
heavily limited /proc like that.

--Andy

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                                         ` <CALCETrUwjK7iLMMJaCvKUbBwEqV58oXY4dWzTGJohYgg4DwjWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-29 14:56                                                           ` Serge E. Hallyn
  0 siblings, 0 replies; 64+ messages in thread
From: Serge E. Hallyn @ 2013-11-29 14:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aditya Kali, Eric W. Biederman, Containers, Oleg Nesterov,
	Linux FS Devel

Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org):
> On Wed, Nov 27, 2013 at 12:07 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> > ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
> >
> >> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
> >>
> >>> Just to avoid the possible confusion, let me repeat that the fix itsef
> >>> looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong.
> >>>
> >>> I am not arguing with this patch, I am just trying to understand this
> >>> logic.
> >>>
> >>> On 11/27, Eric W. Biederman wrote:
> >>>>
> >>>> [... snip ...]
> >>>
> >>> Thanks a lot.
> >>>
> >>>> For the real concern about jail environments where proc and sysfs are
> >>>> not mounted at all a fs_visible check is all that is really required,
> >>>
> >>> this is what I can't understand...
> >>>
> >>> Lets ignore the implementation details. Suppose that proc was never
> >>> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?
> >>
> >> Yes.
> >
> > Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID.
> > If proc was never mounted.
> >
> > Fresh mounts of proc are not allowed unless you have also created the
> > pid namespace.  With just CLONE_NEWUSER | NEWNS you are limited to bind
> > mounts.
> >
> > Has this cleared up the confusion?
> >
> > Eric
> >
> 
> This is all obnoxiously complicated.  I wonder if we can do (a lot)
> better by allowing a "pid-only" variant of proc to be mounted.  It
> should contain:
> 
>  - All the pid directories
>  - /proc/self, /proc/net, and /proc/mounts (but possibly not
> /proc/PID/net -- that's a weird interface IMO and isn't really related
> to the pid)
>  - keys key-users (wtf is up with that interface, though -- those
> files are way too magical)
>  - cpuinfo, version, and maybe other informational things (crypto?)
>  - loadavg, perhaps
> 
> I wonder it would be possible to boot a reasonable container with a
> heavily limited /proc like that.

Should be possible.  And heck, maybe some of the values could then
be virtualized :)  cmdline could point to the container init's
cmdline;  cpuinfo and loadavg and meminfo be filtered through
cgroupfs.

-serge

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                                       ` <87wqjtlic3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-11-27 20:41                                                         ` Andy Lutomirski
@ 2013-11-29 19:53                                                         ` Oleg Nesterov
       [not found]                                                           ` <20131129195327.GA12974-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-12-13 22:07                                                           ` Richard Weinberger
  1 sibling, 2 replies; 64+ messages in thread
From: Oleg Nesterov @ 2013-11-29 19:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Containers, Andy Lutomirski,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 11/27, Eric W. Biederman wrote:
>
> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>
> > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
> >>
> >> Lets ignore the implementation details. Suppose that proc was never
> >> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?
> >
> > Yes.
>
> Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID.

Yes, yes, I understand, the mounter should be CAP_SYS_ADMIN in
task_active_pid_ns().

> Has this cleared up the confusion?

Yes. Thanks.

Oleg.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
       [not found]                                         ` <8738mi1yya.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-11-27  1:58                                           ` Serge E. Hallyn
@ 2013-11-30  6:15                                           ` Al Viro
       [not found]                                             ` <20131130061525.GY10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  1 sibling, 1 reply; 64+ messages in thread
From: Al Viro @ 2013-11-30  6:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski,
	Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds

[Eric Dumazet Cc'd since ->d_dname() had been introduced by him]

On Tue, Nov 26, 2013 at 04:16:13PM -0800, Eric W. Biederman wrote:
> >   proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0
> >
> > This breaks userspace which expects the 2nd field in
> > /proc/mounts to be a valid path.
> 
> The symlink /proc/<pid>/ns/{ipc,mnt,net,pid,user,uts} point to
> dentries allocated with d_alloc_pseudo that we can mount, and
> that have interesting names printed out with d_dname.
> 
> When these files are bind mounted /proc/mounts is not currently
> displaying the mount point correctly because d_dname is called instead
> of just displaying the path where the file is mounted.
> 
> Solve this by adding an explicit check to distinguish mounted pseudo
> inodes and unmounted pseudo inodes.  Unmounted pseudo inodes always
> use mount of their filesstem as the mnt_root  in their path making
> these two cases easy to distinguish.

Excuse me, but this is a wrong way to deal with that.  The root of the
problem is your "let's give them ->d_dname() *and* let them live on the
same procfs vfsmount".  The latter, BTW, requires the suckers to access
nd->path.mnt, which is a bad thing by itself.  Plus the wrong way
->d_dname() is called.  It was kinda-sorta defensible back when it
had been introduced, but only because all filesystems with ->d_dname()
had been MS_NOUSER ones.  Kinda-sorta since it introduced an unpleasant
difference in semantics of d_path() and the rest of its ilk - see
e.g. tomoyo calling ->d_dname() manually.

->d_dname() addresses a real need, no arguments here - we do want to
delay name generation for pipes and sockets until asked for it.  The
problem is that it's dealt with in the wrong place - we ought to
have it handled when prepend_path() reaches a vfsmount with no
parent and an IS_ROOT() dentry.  _Then_ we need to check if it has
->d_op->d_dname() and call one if it does.

There's a bunch of unpleasant details around the handling of "what if
the final vfsmount is detached or internal" (and induced weirdness
with d_absolute_path(), etc. callers deciding whether they want to
call ->d_dname() manually, since only d_path() calls it); I'll look into
that.

As for using this procfs vfsmount...  Frankly, I would rather add a mini-fs
a-la aio one and have those inodes live _there_, with the same vfsmount
used for all those guys.  Internal, of course.  Without bothering with
register_filesystem(), with all associated headache ("what if userland
mounts the entire thing", etc.)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
       [not found]                                             ` <20131130061525.GY10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2013-11-30 17:02                                               ` Al Viro
       [not found]                                                 ` <20131130170226.GZ10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Al Viro @ 2013-11-30 17:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds

On Sat, Nov 30, 2013 at 06:15:26AM +0000, Al Viro wrote:

> There's a bunch of unpleasant details around the handling of "what if
> the final vfsmount is detached or internal" (and induced weirdness
> with d_absolute_path(), etc. callers deciding whether they want to
> call ->d_dname() manually, since only d_path() calls it); I'll look into
> that.

FWIW, the other callers of prepend_path() boil down to /proc/mountinfo
handling, apparmour d_namespace_path() (separate handling of MNT_INTERNAL,
__d_path() or d_absolute_path() for the rest) and tomoyo_get_absolute_path()
(this one directly calls ->d_dname() itself).  Note that /proc/mountinfo
will spew garbage for your case (binding /proc/<pid>/ns/mnt somewhere);
the mountpoint will show correctly, but the relative name won't - it
goes through seq_dentry()->dentry_path() (this and apparmour d_namespace_path()
being the only codepaths to dentry_path(), BTW).  Eric, what behaviour
do you want there?

While we are at it: lustre contains this gem:
static char* ll_d_path(struct dentry *dentry, char *buf, int bufsize)
{
        char *path = NULL;

        struct path p;

        p.dentry = dentry;
        p.mnt = current->fs->root.mnt;
        path_get(&p);
        path = d_path(&p, buf, bufsize);
        path_put(&p);

        return path;
}
That should've been dentry_path(), reimplemented here with the usual braino.
Think what this hack produces if current is chrooted into fs in question;
the "clever" trick fails and instead of intended path from fs root we get
path from wherever current's chrooted to.  There's a reason why dentry_path()
had been introduced...

<looks at the callers of that wonder>
        /* this can be called inside spin lock so use GFP_ATOMIC. */
        buf = (char *)__get_free_page(GFP_ATOMIC);
        if (buf != NULL) {
                dentry = d_find_alias(page->mapping->host);
                if (dentry != NULL)
                        path = ll_d_path(dentry, buf, PAGE_SIZE);
        }
	...
	if (dentry)
		dput(dentry);

Good luck if it ever gets called under spinlock - dput() is not a thing you
should call in such place, ditto for path_put() from ll_d_path() itself...
What are the callchains leading there?  I've tried to track them, but gave
up after a while ;-/  I really hope that it's just "called under spinlock"
and not "called from softirq" or something like that...  Who maintains
that thing, anyway?

BTW, what happens if svc_export_request() ends up with pathname filling
almost all space left, so that qword_add(bpp, blen, pth) right after the
call of d_path() in there overwrites the beginning of d_path() output?
Neil?  And while we are at it, handling of overflow in there looks also
looks fishy...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
       [not found]                                                 ` <20131130170226.GZ10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2013-11-30 21:51                                                   ` Eric W. Biederman
       [not found]                                                     ` <87a9glh838.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2013-12-01  5:09                                                   ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Al Viro
  2013-12-02  5:43                                                     ` NeilBrown
  2 siblings, 1 reply; 64+ messages in thread
From: Eric W. Biederman @ 2013-11-30 21:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> Eric, what behaviour do you want there?

Here is the behavior I want.

a) Something reasonable in /proc/self/fd when I just open one of these.

root@unassigned:~# exec 5< /proc/self/ns/net
root@unassigned:~# ls -l /proc/self/fd
total 0
lrwx------ 1 root root 64 Nov 30 13:12 0 -> /dev/ttyS0
lrwx------ 1 root root 64 Nov 30 13:12 1 -> /dev/ttyS0
lrwx------ 1 root root 64 Nov 30 13:12 2 -> /dev/ttyS0
lr-x------ 1 root root 64 Nov 30 13:12 3 -> /proc/718/fd
lr-x------ 1 root root 64 Nov 30 13:12 5 -> net:[4026531955]
root@unassigned:~# ip netns add foo

b) Something reasonable in /proc/mounts when I bind mount one of these.

root@unassigned:~# ip netns add foo
root@unassigned:~# cat /proc/mounts 
[...]
proc /var/run/netns/foo proc rw,nosuid,nodev,noexec,relatime 0 0

The technical path that lead me to changing d_path looks something like this.

- I want bind mounts to work. So I need check_mnt to pass.
- I can't have these file descriptors show up directly in proc or useful
  permission checks will be bypassed.  Therefore d_unhashed must be true.
- I don't want every call of d_path to add (deleted) so I need
  d_unlinked to be false. Therefore S_ISROOT must be true.
- When looking at one of these file descriptors when they are not
  mounted prepend_path see that IS_ROOT is true and report /proc/ unless
  I implement d_name.
- When looking at one of these file descriptors when they are bind
  mounted I want /proc/mounts to reflect where they are mounted, which
  means I shouldn't call d_path.

Any dentry allocated with d_alloc_pseudo will have the same problem if
it is ever in a context where it can be bind mounted.  So this is a
general issue with d_name.

The other path to where I am at starts at this comment in prepend_path:
	/*
	 * Filesystems needing to implement special "root names"
	 * should do so with ->d_dname()
	 */
Which is exactly what I did only later to discover that d_path get's it
wrong if the dentry is mounted.

If I remove ns_dname I get the following:
root@unassigned:~# exec 5< /proc/self/ns/net 
root@unassigned:~# ls -l /proc/self/fd/
total 0
lrwx------ 1 root root 64 Nov 30 13:31 0 -> /dev/ttyS0
lrwx------ 1 root root 64 Nov 30 13:31 1 -> /dev/ttyS0
lrwx------ 1 root root 64 Nov 30 13:31 2 -> /dev/ttyS0
lr-x------ 1 root root 64 Nov 30 13:31 3 -> /proc/708/fd
lr-x------ 1 root root 64 Nov 30 13:31 5 -> /proc

Which is livable but particularly ugly to look at.

As for simple correctness.  There are only a handful of implementations
of d_dname today and the all set a valid path in the file descriptors
returned, in anything that can make it to d_path.  Which makes the test
for a psuedo dentry in d_path safe.  And at a very simply level we never
want to call d_dname on an actual mount point.

So in the small I will argue what I have done is very much correct.  In
the large I don't see any other set of implementation choices that does
not result in a significant redesign of something.

Furthermore I have a regression dating back a couple of kernels that
breaks /proc/mounts that should be resolved, and this patch is the
cleanest, safest, simplest solution I can see.

Eric

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
       [not found]                                                     ` <87a9glh838.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2013-11-30 22:43                                                       ` Al Viro
       [not found]                                                         ` <20131130224340.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
                                                                           ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Al Viro @ 2013-11-30 22:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds

On Sat, Nov 30, 2013 at 01:51:07PM -0800, Eric W. Biederman wrote:
> Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:
> 
> > Eric, what behaviour do you want there?
> 
> Here is the behavior I want.
> 
> a) Something reasonable in /proc/self/fd when I just open one of these.
> 
> root@unassigned:~# exec 5< /proc/self/ns/net
> root@unassigned:~# ls -l /proc/self/fd
> total 0
> lrwx------ 1 root root 64 Nov 30 13:12 0 -> /dev/ttyS0
> lrwx------ 1 root root 64 Nov 30 13:12 1 -> /dev/ttyS0
> lrwx------ 1 root root 64 Nov 30 13:12 2 -> /dev/ttyS0
> lr-x------ 1 root root 64 Nov 30 13:12 3 -> /proc/718/fd
> lr-x------ 1 root root 64 Nov 30 13:12 5 -> net:[4026531955]
> root@unassigned:~# ip netns add foo
> 
> b) Something reasonable in /proc/mounts when I bind mount one of these.
> 
> root@unassigned:~# ip netns add foo
> root@unassigned:~# cat /proc/mounts 
> [...]
> proc /var/run/netns/foo proc rw,nosuid,nodev,noexec,relatime 0 0

Sigh...  What do you want to see in /proc/self/mountinfo?

> Any dentry allocated with d_alloc_pseudo will have the same problem if
> it is ever in a context where it can be bind mounted.  So this is a
> general issue with d_name.

Yes, ->d_dname() is called in a wrong place.  It should be in prepend_path(),
if not deeper.  What you are doing is growing a kludge on top of that mistake.

Note that your patch does nothing for mountinfo.

As for redesign...  That's exactly what is needed in that area, instead of
letting it fester.  Just look at the rats' nest in there - we have a maze
of helper functions, all alike, with slightly varying semantics.  With
prepend_path() having oh-so-wonderful calling conventions (0/-ENAMETOOLONG/1/2
for return value with rather opaque callers).  Outside of fs/dcache.c we
have the following pile:
	d_path() - most of the callers, periodically misused (see lustre
bug upthread)
	d_absolute_path(), __d_path(), dentry_path() - very few callers,
most in apparmour and tomoyo attempts to implement more or less the same
kind of thing.  Plus handling of /proc/<pid>/mountinfo.

As for check_mnt(old) in do_loopback()...  I wonder if we shouldn't just
turn that puppy into (check_mnt(old) || (old->mnt.mnt_flags & MNT_INTERNAL)).
Note that MS_NOUSER in superblock flags would prevent binding pipefs and
all mount_pseudo() users, so we wouldn't change existing behaviour...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
       [not found]                                                 ` <20131130170226.GZ10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2013-11-30 21:51                                                   ` Eric W. Biederman
@ 2013-12-01  5:09                                                   ` Al Viro
       [not found]                                                     ` <20131201050930.GB10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2013-12-01  6:15                                                     ` Tetsuo Handa
  2013-12-02  5:43                                                     ` NeilBrown
  2 siblings, 2 replies; 64+ messages in thread
From: Al Viro @ 2013-12-01  5:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Aditya Kali, Eric W. Biederman, Neil Brown, Containers,
	Oleg Nesterov, Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds

On Sat, Nov 30, 2013 at 05:02:26PM +0000, Al Viro wrote:

> FWIW, the other callers of prepend_path() boil down to /proc/mountinfo
> handling, apparmour d_namespace_path() (separate handling of MNT_INTERNAL,
> __d_path() or d_absolute_path() for the rest) and tomoyo_get_absolute_path()
> (this one directly calls ->d_dname() itself).

While we are at it, what's the origin of if (buflen >= 256) checks in
tomoyo_get_absolute_path() and tomoyo_get_dentry_path()?  The minimal
buflen value they can be called with is PAGE_SIZE - 1.  And if there is
a port with pages two times smaller than VAX ones, they'd better audit
the tree for places assuming that PAGE_SIZE is at least 4K...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mountpoint
       [not found]                                                     ` <20131201050930.GB10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2013-12-01  6:15                                                       ` Tetsuo Handa
  0 siblings, 0 replies; 64+ messages in thread
From: Tetsuo Handa @ 2013-12-01  6:15 UTC (permalink / raw)
  To: viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn
  Cc: adityakali-hpIqsD4AKlfQT0dZR+AlfA,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, neilb-l3A5Bk7waGM,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	oleg-H+wXaHxf7aLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Al Viro wrote:
> On Sat, Nov 30, 2013 at 05:02:26PM +0000, Al Viro wrote:
> 
> > FWIW, the other callers of prepend_path() boil down to /proc/mountinfo
> > handling, apparmour d_namespace_path() (separate handling of MNT_INTERNAL,
> > __d_path() or d_absolute_path() for the rest) and tomoyo_get_absolute_path()
> > (this one directly calls ->d_dname() itself).
> 
> While we are at it, what's the origin of if (buflen >= 256) checks in
> tomoyo_get_absolute_path() and tomoyo_get_dentry_path()?  The minimal
> buflen value they can be called with is PAGE_SIZE - 1.

This is a sanity check in case the caller by error called
tomoyo_get_absolute_path() or tomoyo_get_local_path() with buflen < 2, for
they might accesses buffer[buflen - 2].

But "PAGE_SIZE <= buf_len <= (UINT_MAX + 1) / 2" will be true because doing
buf_len <<= 1 will not make buf_len == 0 since kmalloc((UINT_MAX + 1) / 2)
will return NULL. Therefore, we can ignore/kill "if (buflen >= 256)" check.

Regards.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mountpoint
  2013-12-01  5:09                                                   ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Al Viro
       [not found]                                                     ` <20131201050930.GB10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2013-12-01  6:15                                                     ` Tetsuo Handa
  1 sibling, 0 replies; 64+ messages in thread
From: Tetsuo Handa @ 2013-12-01  6:15 UTC (permalink / raw)
  To: viro
  Cc: serge, gaofeng, containers, linux-fsdevel, adityakali, oleg,
	luto, torvalds, edumazet, neilb, ebiederm

Al Viro wrote:
> On Sat, Nov 30, 2013 at 05:02:26PM +0000, Al Viro wrote:
> 
> > FWIW, the other callers of prepend_path() boil down to /proc/mountinfo
> > handling, apparmour d_namespace_path() (separate handling of MNT_INTERNAL,
> > __d_path() or d_absolute_path() for the rest) and tomoyo_get_absolute_path()
> > (this one directly calls ->d_dname() itself).
> 
> While we are at it, what's the origin of if (buflen >= 256) checks in
> tomoyo_get_absolute_path() and tomoyo_get_dentry_path()?  The minimal
> buflen value they can be called with is PAGE_SIZE - 1.

This is a sanity check in case the caller by error called
tomoyo_get_absolute_path() or tomoyo_get_local_path() with buflen < 2, for
they might accesses buffer[buflen - 2].

But "PAGE_SIZE <= buf_len <= (UINT_MAX + 1) / 2" will be true because doing
buf_len <<= 1 will not make buf_len == 0 since kmalloc((UINT_MAX + 1) / 2)
will return NULL. Therefore, we can ignore/kill "if (buflen >= 256)" check.

Regards.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
  2013-11-30 17:02                                               ` Al Viro
@ 2013-12-02  5:43                                                     ` NeilBrown
  0 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2013-12-02  5:43 UTC (permalink / raw)
  To: Al Viro, J.Bruce Fields
  Cc: Aditya Kali, NFS, Containers, Oleg Nesterov, Andy Lutomirski,
	Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	Linus Torvalds, Eric W. Biederman


[-- Attachment #1.1: Type: text/plain, Size: 1678 bytes --]

On Sat, 30 Nov 2013 17:02:26 +0000 Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:


> BTW, what happens if svc_export_request() ends up with pathname filling
> almost all space left, so that qword_add(bpp, blen, pth) right after the
> call of d_path() in there overwrites the beginning of d_path() output?
> Neil?  And while we are at it, handling of overflow in there looks also
> looks fishy...

In this case the returned *blen will be negative so cache_request() in
net/sunrpc/cache.h will return -EAGAIN.
cache_read() would then go into a tight loop, trying again and again to
create the request, but it will never fit in the buffer.

I guess maybe an EINVAL might help there, plus code to skip over impossible
requests.
Maybe the following.  We'd need to double check that no ->cache_request
function can fail in a way that is worth retrying, but I doubt they would.

Any thoughts Bruce?

Thanks,
NeilBrown


diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index a72de074172d..a065f827e2a3 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -748,7 +748,7 @@ static int cache_request(struct cache_detail *detail,
 
 	detail->cache_request(detail, crq->item, &bp, &len);
 	if (len < 0)
-		return -EAGAIN;
+		return -EINVAL;
 	return PAGE_SIZE - len;
 }
 
@@ -788,6 +788,11 @@ static ssize_t cache_read(struct file *filp, char __user *buf, size_t count,
 
 	if (rq->len == 0) {
 		err = cache_request(cd, rq);
+		if (err == -EINVAL) {
+			spin_lock(&queue_lock);
+			list_move(&rp->q.list, &rq->q.list);
+			spin_unlock(&queue_lock);
+		}
 		if (err < 0)
 			goto out;
 		rq->len = err;

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 205 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
@ 2013-12-02  5:43                                                     ` NeilBrown
  0 siblings, 0 replies; 64+ messages in thread
From: NeilBrown @ 2013-12-02  5:43 UTC (permalink / raw)
  To: Al Viro, J.Bruce Fields
  Cc: Aditya Kali, NFS, Containers, Oleg Nesterov, Andy Lutomirski,
	Eric Dumazet, linux-fsdevel, Linus Torvalds, Eric W. Biederman


[-- Attachment #1.1: Type: text/plain, Size: 1647 bytes --]

On Sat, 30 Nov 2013 17:02:26 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:


> BTW, what happens if svc_export_request() ends up with pathname filling
> almost all space left, so that qword_add(bpp, blen, pth) right after the
> call of d_path() in there overwrites the beginning of d_path() output?
> Neil?  And while we are at it, handling of overflow in there looks also
> looks fishy...

In this case the returned *blen will be negative so cache_request() in
net/sunrpc/cache.h will return -EAGAIN.
cache_read() would then go into a tight loop, trying again and again to
create the request, but it will never fit in the buffer.

I guess maybe an EINVAL might help there, plus code to skip over impossible
requests.
Maybe the following.  We'd need to double check that no ->cache_request
function can fail in a way that is worth retrying, but I doubt they would.

Any thoughts Bruce?

Thanks,
NeilBrown


diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index a72de074172d..a065f827e2a3 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -748,7 +748,7 @@ static int cache_request(struct cache_detail *detail,
 
 	detail->cache_request(detail, crq->item, &bp, &len);
 	if (len < 0)
-		return -EAGAIN;
+		return -EINVAL;
 	return PAGE_SIZE - len;
 }
 
@@ -788,6 +788,11 @@ static ssize_t cache_read(struct file *filp, char __user *buf, size_t count,
 
 	if (rq->len == 0) {
 		err = cache_request(cd, rq);
+		if (err == -EINVAL) {
+			spin_lock(&queue_lock);
+			list_move(&rp->q.list, &rq->q.list);
+			spin_unlock(&queue_lock);
+		}
 		if (err < 0)
 			goto out;
 		rq->len = err;

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

[-- Attachment #2: Type: text/plain, Size: 171 bytes --]

_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
       [not found]                                                         ` <20131130224340.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2013-12-02  7:29                                                           ` Al Viro
  0 siblings, 0 replies; 64+ messages in thread
From: Al Viro @ 2013-12-02  7:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds

BTW, this "just use nd->path.mnt" is *not* guaranteed to work - sadly,
it's possible to get a _symlink_ bound somewhere.  I'm not sure whether
we should allow that or not, but as it is that's doable.  And with
your ->follow_link() that means ending up with (vfsmount,dentry) pair
that has dentry completely unrelated to vfsmount.  The thing is,
nd->path is where we would start the traversal for a normal symlink, i.e.
the directory where that thing had been found.  Which almost always gets
covered by the vfsmount as the symlink itself, with one exception - when
that symlink is bound there.

_Normally_ namei can cope with that just fine - if that's a normal symlink,
everything ends up working and we only have to take a bit of care about
link->mnt in fs/namei.c:follow_link().  With procfs-style symlinks it
also happens to work.  With one exception - /proc/<pid>/ns/*

Hell knows; my preference would be to ban that kind of crap, especially
since it comes from a rather convoluted loophole...

BTW, Linus, remember the thread back in March when I was grumbling about
procfs-symlinks-to-symlinks?  We ended up killing that BUG_ON() in
nd_jump_link() and back then I hadn't found sufficiently convincing breakage
to require failing such suckers with -ELOOP.  Looks like now I have a
reasonably good argument - pathname lookups with LOOKUP_FOLLOW should _not_
yield symlinks.  That's the loophole refered to above...

If you have happily flushed that memory, the problem was that following
a procfs symlink gives us location in the tree, but that's not enough
when that location itself happens to be a symlink; to resolve a normal
symlink we need both the symlink itself *and* where had it been found.
When we'd arrived to that symlink by following a procfs-style symlink,
the second part is missing.

Now, normally that's not an issue - to hit that case you need O_PATH|O_NOFOLLOW
open of a symlink (giving you an O_PATH descriptor of symlink itself) and
try to follow /proc/<pid>/fd/<that descriptor>.  No other way to get that
kind of symlink-to-symlink in procfs.  I argued for ELOOP on following
those guys; you replied with
<quote>
Why? That's what I did last week, it seemed to be the RightThing(tm)
to do. It somebody has opened a symlink, opening that again through
/proc gives you the same symlink. That would be the natural semantics,
no? You do *not* want to follow it any further, afaik.
</quote>

The trouble is, there is code assuming that LOOKUP_FOLLOW pathname
resolution will stop at non-symlink or fail.  One such example is
do_loopback(), aka. mount --bind old new, which uses LOOKUP_FOLLOW to
resolve old.  However, using the trick above (open a symlink with
O_PATH|O_NOFOLLOW, then pass /proc/self/fd/<descriptor> to mount(2)
with MS_BIND in flags) you can actually have it yield a symlink.
It mostly works (Eric's proc_ns_follow_link() is actually the only
instance that has problems with that setup), but it's certainly
unexpected.

And I'm not at all sure that we have no other place using LOOKUP_FOLLOW
and breaking if what it gets is a symlink...  I can buy your argument
about opening them (O_PATH open of /proc/self/fd/something yields
O_PATH descriptor of whatever's opened there, and if it's a symlink,
so be it - you've asked for it, you've got it), but I would like to have
path_lookupat() with LOOKUP_FOLLOW fail if it's ended up in a symlink.
With ELOOP, probably.

I agree that nd_jump_link() is not a good place for that; I would add
        if (!err && nd->flags & LOOKUP_FOLLOW) {
                if (unlikely(d_is_symlink(nd->path.dentry))) {
                        path_put(&nd->path);
                        err = -ELOOP;
                }
        }
next to similar check for LOOKUP_DIRECTORY in path_lookupat().  Objections?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
       [not found]                                                     ` <20131202164359.4f4f2c94-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2013-12-02 16:23                                                         ` J.Bruce Fields
@ 2013-12-02 16:23                                                       ` J.Bruce Fields
  1 sibling, 0 replies; 64+ messages in thread
From: J.Bruce Fields @ 2013-12-02 16:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Aditya Kali, NFS, Eric W. Biederman, Containers, Oleg Nesterov,
	Al Viro, Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds

On Mon, Dec 02, 2013 at 04:43:59PM +1100, NeilBrown wrote:
> On Sat, 30 Nov 2013 17:02:26 +0000 Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> 
> 
> > BTW, what happens if svc_export_request() ends up with pathname filling
> > almost all space left, so that qword_add(bpp, blen, pth) right after the
> > call of d_path() in there overwrites the beginning of d_path() output?
> > Neil?  And while we are at it, handling of overflow in there looks also
> > looks fishy...
> 
> In this case the returned *blen will be negative so cache_request() in
> net/sunrpc/cache.h will return -EAGAIN.
> cache_read() would then go into a tight loop, trying again and again to
> create the request, but it will never fit in the buffer.
> 
> I guess maybe an EINVAL might help there, plus code to skip over impossible
> requests.
> Maybe the following.  We'd need to double check that no ->cache_request
> function can fail in a way that is worth retrying, but I doubt they would.
> 
> Any thoughts Bruce?

Yes, the cache_request failures are all similarly fatal overflows.

Uh, not sure if I remember how the data structures here work: so
userspace is now going to get an -EINVAL on read, and may just end up
retrying the read?

Or do we hit the "need to release rq" case in cache_read and just
discard the request?

--b.

> 
> Thanks,
> NeilBrown
> 
> 
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index a72de074172d..a065f827e2a3 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -748,7 +748,7 @@ static int cache_request(struct cache_detail *detail,
>  
>  	detail->cache_request(detail, crq->item, &bp, &len);
>  	if (len < 0)
> -		return -EAGAIN;
> +		return -EINVAL;
>  	return PAGE_SIZE - len;
>  }
>  
> @@ -788,6 +788,11 @@ static ssize_t cache_read(struct file *filp, char __user *buf, size_t count,
>  
>  	if (rq->len == 0) {
>  		err = cache_request(cd, rq);
> +		if (err == -EINVAL) {
> +			spin_lock(&queue_lock);
> +			list_move(&rp->q.list, &rq->q.list);
> +			spin_unlock(&queue_lock);
> +		}
>  		if (err < 0)
>  			goto out;
>  		rq->len = err;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
  2013-12-02  5:43                                                     ` NeilBrown
@ 2013-12-02 16:23                                                         ` J.Bruce Fields
  -1 siblings, 0 replies; 64+ messages in thread
From: J.Bruce Fields @ 2013-12-02 16:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Al Viro, Eric W. Biederman, Serge E. Hallyn, Gao feng,
	Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Aditya Kali,
	Oleg Nesterov, Andy Lutomirski, Linus Torvalds, Eric Dumazet,
	NFS

On Mon, Dec 02, 2013 at 04:43:59PM +1100, NeilBrown wrote:
> On Sat, 30 Nov 2013 17:02:26 +0000 Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> 
> 
> > BTW, what happens if svc_export_request() ends up with pathname filling
> > almost all space left, so that qword_add(bpp, blen, pth) right after the
> > call of d_path() in there overwrites the beginning of d_path() output?
> > Neil?  And while we are at it, handling of overflow in there looks also
> > looks fishy...
> 
> In this case the returned *blen will be negative so cache_request() in
> net/sunrpc/cache.h will return -EAGAIN.
> cache_read() would then go into a tight loop, trying again and again to
> create the request, but it will never fit in the buffer.
> 
> I guess maybe an EINVAL might help there, plus code to skip over impossible
> requests.
> Maybe the following.  We'd need to double check that no ->cache_request
> function can fail in a way that is worth retrying, but I doubt they would.
> 
> Any thoughts Bruce?

Yes, the cache_request failures are all similarly fatal overflows.

Uh, not sure if I remember how the data structures here work: so
userspace is now going to get an -EINVAL on read, and may just end up
retrying the read?

Or do we hit the "need to release rq" case in cache_read and just
discard the request?

--b.

> 
> Thanks,
> NeilBrown
> 
> 
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index a72de074172d..a065f827e2a3 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -748,7 +748,7 @@ static int cache_request(struct cache_detail *detail,
>  
>  	detail->cache_request(detail, crq->item, &bp, &len);
>  	if (len < 0)
> -		return -EAGAIN;
> +		return -EINVAL;
>  	return PAGE_SIZE - len;
>  }
>  
> @@ -788,6 +788,11 @@ static ssize_t cache_read(struct file *filp, char __user *buf, size_t count,
>  
>  	if (rq->len == 0) {
>  		err = cache_request(cd, rq);
> +		if (err == -EINVAL) {
> +			spin_lock(&queue_lock);
> +			list_move(&rp->q.list, &rq->q.list);
> +			spin_unlock(&queue_lock);
> +		}
>  		if (err < 0)
>  			goto out;
>  		rq->len = err;


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
@ 2013-12-02 16:23                                                         ` J.Bruce Fields
  0 siblings, 0 replies; 64+ messages in thread
From: J.Bruce Fields @ 2013-12-02 16:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Al Viro, Eric W. Biederman, Serge E. Hallyn, Gao feng,
	Containers, linux-fsdevel, Aditya Kali, Oleg Nesterov,
	Andy Lutomirski, Linus Torvalds, Eric Dumazet, NFS

On Mon, Dec 02, 2013 at 04:43:59PM +1100, NeilBrown wrote:
> On Sat, 30 Nov 2013 17:02:26 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> 
> > BTW, what happens if svc_export_request() ends up with pathname filling
> > almost all space left, so that qword_add(bpp, blen, pth) right after the
> > call of d_path() in there overwrites the beginning of d_path() output?
> > Neil?  And while we are at it, handling of overflow in there looks also
> > looks fishy...
> 
> In this case the returned *blen will be negative so cache_request() in
> net/sunrpc/cache.h will return -EAGAIN.
> cache_read() would then go into a tight loop, trying again and again to
> create the request, but it will never fit in the buffer.
> 
> I guess maybe an EINVAL might help there, plus code to skip over impossible
> requests.
> Maybe the following.  We'd need to double check that no ->cache_request
> function can fail in a way that is worth retrying, but I doubt they would.
> 
> Any thoughts Bruce?

Yes, the cache_request failures are all similarly fatal overflows.

Uh, not sure if I remember how the data structures here work: so
userspace is now going to get an -EINVAL on read, and may just end up
retrying the read?

Or do we hit the "need to release rq" case in cache_read and just
discard the request?

--b.

> 
> Thanks,
> NeilBrown
> 
> 
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index a72de074172d..a065f827e2a3 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -748,7 +748,7 @@ static int cache_request(struct cache_detail *detail,
>  
>  	detail->cache_request(detail, crq->item, &bp, &len);
>  	if (len < 0)
> -		return -EAGAIN;
> +		return -EINVAL;
>  	return PAGE_SIZE - len;
>  }
>  
> @@ -788,6 +788,11 @@ static ssize_t cache_read(struct file *filp, char __user *buf, size_t count,
>  
>  	if (rq->len == 0) {
>  		err = cache_request(cd, rq);
> +		if (err == -EINVAL) {
> +			spin_lock(&queue_lock);
> +			list_move(&rp->q.list, &rq->q.list);
> +			spin_unlock(&queue_lock);
> +		}
>  		if (err < 0)
>  			goto out;
>  		rq->len = err;



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
       [not found]                                                           ` <20131129195327.GA12974-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-12-13 22:07                                                             ` Richard Weinberger
  0 siblings, 0 replies; 64+ messages in thread
From: Richard Weinberger @ 2013-12-13 22:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Aditya Kali, Containers, Andy Lutomirski, linux-fsdevel,
	Eric W. Biederman

On Fri, Nov 29, 2013 at 8:53 PM, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On 11/27, Eric W. Biederman wrote:
>>
>> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>>
>> > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>> >>
>> >> Lets ignore the implementation details. Suppose that proc was never
>> >> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?
>> >
>> > Yes.
>>
>> Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID.
>
> Yes, yes, I understand, the mounter should be CAP_SYS_ADMIN in
> task_active_pid_ns().
>
>> Has this cleared up the confusion?
>
> Yes. Thanks.
>
> Oleg.

What is the state of this work?
It would be nice to see this fixed seen.
Users on 3.12 suffer already from this regression.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc
  2013-11-29 19:53                                                         ` Oleg Nesterov
       [not found]                                                           ` <20131129195327.GA12974-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-12-13 22:07                                                           ` Richard Weinberger
  1 sibling, 0 replies; 64+ messages in thread
From: Richard Weinberger @ 2013-12-13 22:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Serge E. Hallyn, Gao feng, Containers,
	linux-fsdevel, Aditya Kali, Andy Lutomirski

On Fri, Nov 29, 2013 at 8:53 PM, Oleg Nesterov <oleg@redhat.com> wrote:
> On 11/27, Eric W. Biederman wrote:
>>
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>>
>> > Oleg Nesterov <oleg@redhat.com> writes:
>> >>
>> >> Lets ignore the implementation details. Suppose that proc was never
>> >> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS?
>> >
>> > Yes.
>>
>> Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID.
>
> Yes, yes, I understand, the mounter should be CAP_SYS_ADMIN in
> task_active_pid_ns().
>
>> Has this cleared up the confusion?
>
> Yes. Thanks.
>
> Oleg.

What is the state of this work?
It would be nice to see this fixed seen.
Users on 3.12 suffer already from this regression.

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
  2013-11-30 22:43                                                       ` Al Viro
       [not found]                                                         ` <20131130224340.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2014-01-17  3:29                                                         ` Eric W. Biederman
  2014-01-17  3:29                                                         ` Eric W. Biederman
  2 siblings, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2014-01-17  3:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds

Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:

> On Sat, Nov 30, 2013 at 01:51:07PM -0800, Eric W. Biederman wrote:
>> Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes:
>> 
>> > Eric, what behaviour do you want there?
>> 
>> Here is the behavior I want.
>> 
>> a) Something reasonable in /proc/self/fd when I just open one of these.
>> 
>> root@unassigned:~# exec 5< /proc/self/ns/net
>> root@unassigned:~# ls -l /proc/self/fd
>> total 0
>> lrwx------ 1 root root 64 Nov 30 13:12 0 -> /dev/ttyS0
>> lrwx------ 1 root root 64 Nov 30 13:12 1 -> /dev/ttyS0
>> lrwx------ 1 root root 64 Nov 30 13:12 2 -> /dev/ttyS0
>> lr-x------ 1 root root 64 Nov 30 13:12 3 -> /proc/718/fd
>> lr-x------ 1 root root 64 Nov 30 13:12 5 -> net:[4026531955]
>> root@unassigned:~# ip netns add foo
>> 
>> b) Something reasonable in /proc/mounts when I bind mount one of these.
>> 
>> root@unassigned:~# ip netns add foo
>> root@unassigned:~# cat /proc/mounts 
>> [...]
>> proc /var/run/netns/foo proc rw,nosuid,nodev,noexec,relatime 0 0
>
> Sigh...  What do you want to see in /proc/self/mountinfo?

With or without my patch I see what I am after right now is:
31 12 0:3 / /var/run/netns/foo rw,nosuid,nodev,noexec,relatime - proc proc rw

Apologies for not being clear on that point.

However it would be nice to see:
28 13 0:3 net:[4026532127] /var/run/netns/foo rw,nosuid,nodev,noexec,relatime - proc proc rw

Not having the path be the magical floating path is not causing
regressions for people so I don't care much at the moment.

>> Any dentry allocated with d_alloc_pseudo will have the same problem if
>> it is ever in a context where it can be bind mounted.  So this is a
>> general issue with d_name.
>
> Yes, ->d_dname() is called in a wrong place.  It should be in prepend_path
> if not deeper.  What you are doing is growing a kludge on top of that
> mistake.

I disagree. I am not growing a kludge on of a mistake.  I am refining
the logic at the call site of d_dname.  The logic to not call d_dname
on a mountpoint should be there wherever we call d_dname.

Which makes my change completely orthogonal to moving the call of
d_dname into the strangeness that is prepend_path.

> Note that your patch does nothing for mountinfo.

And it doesn't need to do anything for mountinfo.  My desire and the
only sensible semantics I can think of is for non-mountpoints to have
d_dname called on them.  By definition /proc/mountinfo only shows mount
points, so there is never a neeed to call d_dpath is /proc/mountinfo.


Looking at what is going on in detail, the implementations of the
.d_dname method are:

arch/ia64/kernel/perfmon.c: pfmfs_dname
    ia64 perfmon files are unlinked character devices whose dentries are
    allocated with d_alloc, and the filesystem is mounted with kern_mount.
fs/aio.c:simple_dname
    aio private files that are non-directories whose dentries are 
    allocated with d_alloc_pseudo, and the filesystem is mounted with kern_mount.
fs/anon_inodes.c:anon_inodefs_dname
    anonymous inodes that are non-directories whose dentries are
    allocated with d_alloc_pseudo, and the filesystem is mounted with
    kern_mount.    
fs/pipe.c:pipefs_dname
    pipes that are non-directories whose dentries are allocated with
    d_alloc_psuedo, and the fs is mounted with kern_mount.
net/socket.c:sockfs_dname
    sockets are non-directories whose dentries are allocated with
    d_alloc_pseudo, and the filesystem is mounted with kern_mount.
mm/shmem.c:simple_dname
    shared memory segments are non-directories whose dentries are
    allocated with d_alloc_pseudo, and the filesystem is mounted with
    kern_mount.
fs/hugetlbfs/inode.c:simple_dname
    hugetlb shared memory segments are non-directories whose dentries
    are allocted with d_alloc_pseudo, and the filesystem is mounted with
    kern_mount_data.
fs/proc/namespaces.c:ns_dname
    namespace files are non-directories whose dentries are allocated
    with d_alloc_pseudo, and the filesystem is mounted normally.


In the worst case prepend_path is safe to use on the dentries that
implement d_dname as their dentry name field has valid data.

The callers of prepend_path that are not d_path are:
__d_path
d_absolute_path
getcwd



getcwd need not be worried about because get_fs_root_and_pwd will always
return directories.  And none of the implementations of d_dname are
directories.



__d_path is has only two callers: fs/seq_file.c:seq_path_root and
security/apparmor/path.c:d_namespace_path.

fs/seq_file.c: seq_path_root which is only used in
fs/proc_namespaces.c:show_mountinfo aka /proc/mountinfo.  Since this is
only dealing with mounted paths there will never be a need to call
d_dname.

security/apparmor/path.c:d_namespace_path in apparmor is almost
interesting, it is called on files that come from exec (brpm->file),
link, rename, open, unlink, create, truncate, chown, chmod, and getattr.
Which with clever usage of magic proc symlinks could conceivably point
to a file descriptor that implements d_dname.  However d_namespace_path
tests the mountpoint for MNT_INTERNAL and calls dentry_path if that is
the case, avoiding __d_path or d_absolute_path, except for the proc
namespace files.  In either case today if mounted a sensible result and
if not mounted an empty path.

The only other caller of d_absolute_path is
security/tomoyo/realpath.c:tomoyo_get_absolute_path.  However
tomyo_get_absolute_path is only called from tomoyo_realpath_from_path
which checks for dentries that implement d_dname and calls them
explicitly, only calling d_absolute_name if there is no such
implementation.  



Which is a long way of saying that right now moving the call of d_dname
into prepend_path would make no practical difference.   

At a practical level there are improvements to be had by calling d_dname
in dentry_path and by adding my avoidance of calling d_dname on a
mountpoint into tomoyo_realpath_from_path.

So while I see what you mean by prepend_path needing cleaning up that is
really orthogonal to the acidental regression that I am fixing.

> As for redesign...  That's exactly what is needed in that area, instead of
> letting it fester.

I can't backport a redesign to fix the regression, and a redesign
results in no practical benefit for me.  So I might be persuaded later
if I can find the time but right now a redesign is the wrong place to
start.

> As for check_mnt(old) in do_loopback()...  I wonder if we shouldn't just
> turn that puppy into (check_mnt(old) || (old->mnt.mnt_flags & MNT_INTERNAL)).
> Note that MS_NOUSER in superblock flags would prevent binding pipefs and
> all mount_pseudo() users, so we wouldn't change existing behaviour...

Strictly speaking behavior would change for the proc namespace files, as
now you could do things by finding a mount of proc in another mount
namespace where the namespace files were available.

I would be a lot more comfortable with changing do_loopback if we could
safely remove the check_mnt(old) test altogether.  Why do we have the
check_mnt(old) test in do_loopback?   If we can't remove the
check_mnt(old) test I need verify that the reasons for that test don't
apply to my namespace file descriptor case.  Otherwise I can't see any
real problems with making that change (time permitting).



After getting this regression fix merged, the windmill that I expect I
will be tilting at are fixing the user space visible races that let a
dentry be renamed while we are mounting something on it, and the messy
fact that check_submounts_and_drop isn't part of d_invalidate.  Both of
those can potentially result in user space visible insantiy.

Eric

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
  2013-11-30 22:43                                                       ` Al Viro
       [not found]                                                         ` <20131130224340.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  2014-01-17  3:29                                                         ` Eric W. Biederman
@ 2014-01-17  3:29                                                         ` Eric W. Biederman
       [not found]                                                           ` <874n53gub7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2 siblings, 1 reply; 64+ messages in thread
From: Eric W. Biederman @ 2014-01-17  3:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Serge E. Hallyn, Gao feng, Containers, linux-fsdevel,
	Aditya Kali, Oleg Nesterov, Andy Lutomirski, Linus Torvalds,
	Eric Dumazet, Neil Brown

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Sat, Nov 30, 2013 at 01:51:07PM -0800, Eric W. Biederman wrote:
>> Al Viro <viro@ZenIV.linux.org.uk> writes:
>> 
>> > Eric, what behaviour do you want there?
>> 
>> Here is the behavior I want.
>> 
>> a) Something reasonable in /proc/self/fd when I just open one of these.
>> 
>> root@unassigned:~# exec 5< /proc/self/ns/net
>> root@unassigned:~# ls -l /proc/self/fd
>> total 0
>> lrwx------ 1 root root 64 Nov 30 13:12 0 -> /dev/ttyS0
>> lrwx------ 1 root root 64 Nov 30 13:12 1 -> /dev/ttyS0
>> lrwx------ 1 root root 64 Nov 30 13:12 2 -> /dev/ttyS0
>> lr-x------ 1 root root 64 Nov 30 13:12 3 -> /proc/718/fd
>> lr-x------ 1 root root 64 Nov 30 13:12 5 -> net:[4026531955]
>> root@unassigned:~# ip netns add foo
>> 
>> b) Something reasonable in /proc/mounts when I bind mount one of these.
>> 
>> root@unassigned:~# ip netns add foo
>> root@unassigned:~# cat /proc/mounts 
>> [...]
>> proc /var/run/netns/foo proc rw,nosuid,nodev,noexec,relatime 0 0
>
> Sigh...  What do you want to see in /proc/self/mountinfo?

With or without my patch I see what I am after right now is:
31 12 0:3 / /var/run/netns/foo rw,nosuid,nodev,noexec,relatime - proc proc rw

Apologies for not being clear on that point.

However it would be nice to see:
28 13 0:3 net:[4026532127] /var/run/netns/foo rw,nosuid,nodev,noexec,relatime - proc proc rw

Not having the path be the magical floating path is not causing
regressions for people so I don't care much at the moment.

>> Any dentry allocated with d_alloc_pseudo will have the same problem if
>> it is ever in a context where it can be bind mounted.  So this is a
>> general issue with d_name.
>
> Yes, ->d_dname() is called in a wrong place.  It should be in prepend_path
> if not deeper.  What you are doing is growing a kludge on top of that
> mistake.

I disagree. I am not growing a kludge on of a mistake.  I am refining
the logic at the call site of d_dname.  The logic to not call d_dname
on a mountpoint should be there wherever we call d_dname.

Which makes my change completely orthogonal to moving the call of
d_dname into the strangeness that is prepend_path.

> Note that your patch does nothing for mountinfo.

And it doesn't need to do anything for mountinfo.  My desire and the
only sensible semantics I can think of is for non-mountpoints to have
d_dname called on them.  By definition /proc/mountinfo only shows mount
points, so there is never a neeed to call d_dpath is /proc/mountinfo.


Looking at what is going on in detail, the implementations of the
.d_dname method are:

arch/ia64/kernel/perfmon.c: pfmfs_dname
    ia64 perfmon files are unlinked character devices whose dentries are
    allocated with d_alloc, and the filesystem is mounted with kern_mount.
fs/aio.c:simple_dname
    aio private files that are non-directories whose dentries are 
    allocated with d_alloc_pseudo, and the filesystem is mounted with kern_mount.
fs/anon_inodes.c:anon_inodefs_dname
    anonymous inodes that are non-directories whose dentries are
    allocated with d_alloc_pseudo, and the filesystem is mounted with
    kern_mount.    
fs/pipe.c:pipefs_dname
    pipes that are non-directories whose dentries are allocated with
    d_alloc_psuedo, and the fs is mounted with kern_mount.
net/socket.c:sockfs_dname
    sockets are non-directories whose dentries are allocated with
    d_alloc_pseudo, and the filesystem is mounted with kern_mount.
mm/shmem.c:simple_dname
    shared memory segments are non-directories whose dentries are
    allocated with d_alloc_pseudo, and the filesystem is mounted with
    kern_mount.
fs/hugetlbfs/inode.c:simple_dname
    hugetlb shared memory segments are non-directories whose dentries
    are allocted with d_alloc_pseudo, and the filesystem is mounted with
    kern_mount_data.
fs/proc/namespaces.c:ns_dname
    namespace files are non-directories whose dentries are allocated
    with d_alloc_pseudo, and the filesystem is mounted normally.


In the worst case prepend_path is safe to use on the dentries that
implement d_dname as their dentry name field has valid data.

The callers of prepend_path that are not d_path are:
__d_path
d_absolute_path
getcwd



getcwd need not be worried about because get_fs_root_and_pwd will always
return directories.  And none of the implementations of d_dname are
directories.



__d_path is has only two callers: fs/seq_file.c:seq_path_root and
security/apparmor/path.c:d_namespace_path.

fs/seq_file.c: seq_path_root which is only used in
fs/proc_namespaces.c:show_mountinfo aka /proc/mountinfo.  Since this is
only dealing with mounted paths there will never be a need to call
d_dname.

security/apparmor/path.c:d_namespace_path in apparmor is almost
interesting, it is called on files that come from exec (brpm->file),
link, rename, open, unlink, create, truncate, chown, chmod, and getattr.
Which with clever usage of magic proc symlinks could conceivably point
to a file descriptor that implements d_dname.  However d_namespace_path
tests the mountpoint for MNT_INTERNAL and calls dentry_path if that is
the case, avoiding __d_path or d_absolute_path, except for the proc
namespace files.  In either case today if mounted a sensible result and
if not mounted an empty path.

The only other caller of d_absolute_path is
security/tomoyo/realpath.c:tomoyo_get_absolute_path.  However
tomyo_get_absolute_path is only called from tomoyo_realpath_from_path
which checks for dentries that implement d_dname and calls them
explicitly, only calling d_absolute_name if there is no such
implementation.  



Which is a long way of saying that right now moving the call of d_dname
into prepend_path would make no practical difference.   

At a practical level there are improvements to be had by calling d_dname
in dentry_path and by adding my avoidance of calling d_dname on a
mountpoint into tomoyo_realpath_from_path.

So while I see what you mean by prepend_path needing cleaning up that is
really orthogonal to the acidental regression that I am fixing.

> As for redesign...  That's exactly what is needed in that area, instead of
> letting it fester.

I can't backport a redesign to fix the regression, and a redesign
results in no practical benefit for me.  So I might be persuaded later
if I can find the time but right now a redesign is the wrong place to
start.

> As for check_mnt(old) in do_loopback()...  I wonder if we shouldn't just
> turn that puppy into (check_mnt(old) || (old->mnt.mnt_flags & MNT_INTERNAL)).
> Note that MS_NOUSER in superblock flags would prevent binding pipefs and
> all mount_pseudo() users, so we wouldn't change existing behaviour...

Strictly speaking behavior would change for the proc namespace files, as
now you could do things by finding a mount of proc in another mount
namespace where the namespace files were available.

I would be a lot more comfortable with changing do_loopback if we could
safely remove the check_mnt(old) test altogether.  Why do we have the
check_mnt(old) test in do_loopback?   If we can't remove the
check_mnt(old) test I need verify that the reasons for that test don't
apply to my namespace file descriptor case.  Otherwise I can't see any
real problems with making that change (time permitting).



After getting this regression fix merged, the windmill that I expect I
will be tilting at are fixing the user space visible races that let a
dentry be renamed while we are mounting something on it, and the messy
fact that check_submounts_and_drop isn't part of d_invalidate.  Both of
those can potentially result in user space visible insantiy.

Eric


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point
       [not found]                                                           ` <874n53gub7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2014-01-17  8:39                                                             ` Al Viro
       [not found]                                                               ` <20140117083901.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Al Viro @ 2014-01-17  8:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds

On Thu, Jan 16, 2014 at 07:29:16PM -0800, Eric W. Biederman wrote:

> However it would be nice to see:
> 28 13 0:3 net:[4026532127] /var/run/netns/foo rw,nosuid,nodev,noexec,relatime - proc proc rw
> 
> Not having the path be the magical floating path is not causing
> regressions for people so I don't care much at the moment.

> >> Any dentry allocated with d_alloc_pseudo will have the same problem if
> >> it is ever in a context where it can be bind mounted.  So this is a
> >> general issue with d_name.
> >
> > Yes, ->d_dname() is called in a wrong place.  It should be in prepend_path
> > if not deeper.  What you are doing is growing a kludge on top of that
> > mistake.
> 
> I disagree. I am not growing a kludge on of a mistake.  I am refining
> the logic at the call site of d_dname.  The logic to not call d_dname
> on a mountpoint should be there wherever we call d_dname.

What, in your opinion does the word "kludge" mean?  The usual meaning is
"a change that might work, but depends on too many easy-to-break
accidental details of the current construction".  If you want to use
a different term, fine, but the problem doesn't disappear from renaming
things.

I agree that your patch doesn't break things and I've said so from the
very beginning.

[snip the obvious analysis]

> At a practical level there are improvements to be had by calling d_dname
> in dentry_path and by adding my avoidance of calling d_dname on a
> mountpoint into tomoyo_realpath_from_path.

... i.e. making things even more convoluted.  Congratulations, that's
just what we need.

> So while I see what you mean by prepend_path needing cleaning up that is
> really orthogonal to the acidental regression that I am fixing.
> 
> > As for redesign...  That's exactly what is needed in that area, instead of
> > letting it fester.
> 
> I can't backport a redesign to fix the regression, and a redesign
> results in no practical benefit for me.  So I might be persuaded later
> if I can find the time but right now a redesign is the wrong place to
> start.

Gotta love the style, but... I'm not persuaded by the above.  At all.

> > As for check_mnt(old) in do_loopback()...  I wonder if we shouldn't just
> > turn that puppy into (check_mnt(old) || (old->mnt.mnt_flags & MNT_INTERNAL)).
> > Note that MS_NOUSER in superblock flags would prevent binding pipefs and
> > all mount_pseudo() users, so we wouldn't change existing behaviour...
> 
> Strictly speaking behavior would change for the proc namespace files, as
> now you could do things by finding a mount of proc in another mount
> namespace where the namespace files were available.

Er...  Yes, and?  If you can do that, you can bloody well just open them
there and then do exact same thing.

The visible change would be different - it would be the ability to bind
hugetlbfs and shmem anon mappings via /proc/*/map_files/*

> I would be a lot more comfortable with changing do_loopback if we could
> safely remove the check_mnt(old) test altogether.  Why do we have the
> check_mnt(old) test in do_loopback?   If we can't remove the
> check_mnt(old) test I need verify that the reasons for that test don't
> apply to my namespace file descriptor case.  Otherwise I can't see any
> real problems with making that change (time permitting).

Sigh...  You do realize that right now your proc_ns_follow_link() violates
a pretty basic assert: nd->path.mnt->mnt_sb == nd->path.dentry->d_sb,
right?  IOW, that dentry belongs to the filesystem pointed to be vfsmount.

At ->follow_link() time we have two objects to keep track of: where we are
in pathname resolution (i.e. which directory had we reached) and what
symlink are we resolving.  nd->path points to the former; the first
argument of ->follow_link() - to the latter.  _Usually_ the parent is
on the same vfsmount as the symlink, so your shortcut doesn't blow up
all the time.  However, that is not guaranteed - it *is* possible to
bind a symlink.  Not conveniently at the moment, but it can be done and
there's no fundamental reasons why it should be forbidden.  And in that
setup proc_ns_follow_link() is in trouble.

The reasons for check_mnt() there are, IIRC, more about playing silly buggers
with propagation trees from some other namespace.  IOW, they do not apply
to cloning stuff off the internal vfsmounts.

/proc/*/map_files/* is potentially a problem, though - being able to
create a binding to something on hugetlbfs might have interesting implications.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/4] d_dname cleanups
       [not found]                                                               ` <20140117083901.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
@ 2014-02-07  2:21                                                                 ` Eric W. Biederman
       [not found]                                                                   ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 64+ messages in thread
From: Eric W. Biederman @ 2014-02-07  2:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds


Al,

This is a small pile of cleanups to the callers d_dname.

This patset changes all implementors of d_dname to d_alloc_pseudo,
simplifies the set of conditions in which we consider calling d_dname,
moves the code into prepend_path, and finally adds a call to d_dname
into __dentry_path allowing __dentry_path will return sensible for
dentries that implement d_dname.

The patches are available at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-vfs-d_dname-cleanups

Or in the patches following this email.

Eric W. Biederman (4):
      perfmon:  Use d_alloc_pseudo like all of the d_dname callers.
      vfs: Simply when d_alloc_dname is called.
      vfs: Move the call of d_op->d_dname from d_path to prepend_path
      vfs: Call d_dname from dentry_path

 arch/ia64/kernel/perfmon.c |    4 ++--
 fs/dcache.c                |   38 +++++++++++++++++++++++---------------
 2 files changed, 25 insertions(+), 17 deletions(-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/4] perfmon: Use d_alloc_pseudo like all of the d_dname callers.
       [not found]                                                                   ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2014-02-07  2:23                                                                     ` Eric W. Biederman
  2014-02-07  2:23                                                                     ` [PATCH 2/4] vfs: Simply when d_alloc_dname is called Eric W. Biederman
                                                                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2014-02-07  2:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds



Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 arch/ia64/kernel/perfmon.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index cb592773c78b..d1611f223768 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -2202,14 +2202,14 @@ pfm_alloc_file(pfm_context_t *ctx)
 	/*
 	 * allocate a new dcache entry
 	 */
-	path.dentry = d_alloc(pfmfs_mnt->mnt_root, &this);
+	path.dentry = d_alloc_pseudo(pfmfs_mnt->mnt_sb, &this);
 	if (!path.dentry) {
 		iput(inode);
 		return ERR_PTR(-ENOMEM);
 	}
 	path.mnt = mntget(pfmfs_mnt);
 
-	d_add(path.dentry, inode);
+	d_instantiate(path.dentry, inode);
 
 	file = alloc_file(&path, FMODE_READ, &pfm_file_ops);
 	if (IS_ERR(file)) {
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/4] vfs: Simply when d_alloc_dname is called.
       [not found]                                                                   ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2014-02-07  2:23                                                                     ` [PATCH 1/4] perfmon: Use d_alloc_pseudo like all of the d_dname callers Eric W. Biederman
@ 2014-02-07  2:23                                                                     ` Eric W. Biederman
  2014-02-07  2:24                                                                     ` [PATCH 3/4] vfs: Move the call of d_op->d_dname from d_path to prepend_path Eric W. Biederman
  2014-02-07  2:24                                                                     ` [PATCH 4/4] vfs: Call d_dname from dentry_path Eric W. Biederman
  3 siblings, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2014-02-07  2:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds


Now that all implementations of d_dname use d_alloc_psuedo only consider
calling d_dname on root dentries, as all dentries returned
d_alloc_pseudo by d_alloc_pseudo are root dentries.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c |   16 ++++++----------
 1 files changed, 6 insertions(+), 10 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 265e0ce9769c..c250d97befe4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3056,18 +3056,14 @@ char *d_path(const struct path *path, char *buf, int buflen)
 	int error;
 
 	/*
-	 * We have various synthetic filesystems that never get mounted.  On
-	 * these filesystems dentries are never used for lookup purposes, and
-	 * thus don't need to be hashed.  They also don't need a name until a
-	 * user wants to identify the object in /proc/pid/fd/.  The little hack
-	 * below allows us to generate a name for these objects on demand:
-	 *
-	 * Some pseudo inodes are mountable.  When they are mounted
-	 * path->dentry == path->mnt->mnt_root.  In that case don't call d_dname
-	 * and instead have d_path return the mounted path.
+	 * We have various synthetic files allocated with
+	 * d_alloc_pseudo that are not available through
+	 * ordinary path lookup and don't need a name until
+	 * a user wants to identify the object in
+	 * /proc/pid/fd/ or similiar.
 	 */
 	if (path->dentry->d_op && path->dentry->d_op->d_dname &&
-	    (!IS_ROOT(path->dentry) || path->dentry != path->mnt->mnt_root))
+	    IS_ROOT(path->dentry) && path->dentry != path->mnt->mnt_root)
 		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
 
 	rcu_read_lock();
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 3/4] vfs: Move the call of d_op->d_dname from d_path to prepend_path
       [not found]                                                                   ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2014-02-07  2:23                                                                     ` [PATCH 1/4] perfmon: Use d_alloc_pseudo like all of the d_dname callers Eric W. Biederman
  2014-02-07  2:23                                                                     ` [PATCH 2/4] vfs: Simply when d_alloc_dname is called Eric W. Biederman
@ 2014-02-07  2:24                                                                     ` Eric W. Biederman
  2014-02-07  2:24                                                                     ` [PATCH 4/4] vfs: Call d_dname from dentry_path Eric W. Biederman
  3 siblings, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2014-02-07  2:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds


For clarity move the call of d_dname from d_path into prepend_path.
There are only 4 callers of prepend_path and this change does not
affect the results returned by any of them.  d_absolute_path
and __d_path return an error and ignore the output of prepend_path
error == 2. getcwd only operates on directories and all of the
implementations of d_dname are on files.

For d_path the argument of no changes in output is a little trickier.
path_with_deleted does not trigger because pseudo dentries are match
IS_ROOT so is_unlinked is false.  All of current implementations of
d_dname are files so they are not currently anyone's d_parent, which
means today that either the first iteration of prepend_path will call
d_dname or none of the iterations of prepend_path will call d_dname.

d_dname is called from the while loop in prepend_path so that if it
happens in the future that any of the implementors of d_dname are a
directory instead of a file then prepend_path will work properly on them.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c |   32 +++++++++++++++++++++-----------
 1 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c250d97befe4..c5c7847ff84b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2892,6 +2892,27 @@ restart:
 	while (dentry != root->dentry || vfsmnt != root->mnt) {
 		struct dentry * parent;
 
+		/*
+		 * We have various synthetic files allocated with
+		 * d_alloc_pseudo that are not available through
+		 * ordinary path lookup and don't need a name until
+		 * a user wants to identify the object in
+		 * /proc/pid/fd/ or similiar.
+		 */
+		if (IS_ROOT(dentry) && dentry != vfsmnt->mnt_root &&
+		    dentry->d_op && dentry->d_op->d_dname) {
+			char *buf  = bptr - blen;
+			char *name = dentry->d_op->d_dname(dentry, buf, blen);
+			if (IS_ERR(name)) {
+				error = PTR_ERR(name);
+				break;
+			}
+			blen = name - buf;
+			bptr = name;
+			error = 2; /* unmounted by definition */
+			break;
+		}
+
 		if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) {
 			struct mount *parent = ACCESS_ONCE(mnt->mnt_parent);
 			/* Global root? */
@@ -3055,17 +3076,6 @@ char *d_path(const struct path *path, char *buf, int buflen)
 	struct path root;
 	int error;
 
-	/*
-	 * We have various synthetic files allocated with
-	 * d_alloc_pseudo that are not available through
-	 * ordinary path lookup and don't need a name until
-	 * a user wants to identify the object in
-	 * /proc/pid/fd/ or similiar.
-	 */
-	if (path->dentry->d_op && path->dentry->d_op->d_dname &&
-	    IS_ROOT(path->dentry) && path->dentry != path->mnt->mnt_root)
-		return path->dentry->d_op->d_dname(path->dentry, buf, buflen);
-
 	rcu_read_lock();
 	get_fs_root_rcu(current->fs, &root);
 	error = path_with_deleted(path, &root, &res, &buflen);
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 4/4] vfs: Call d_dname from dentry_path
       [not found]                                                                   ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
                                                                                       ` (2 preceding siblings ...)
  2014-02-07  2:24                                                                     ` [PATCH 3/4] vfs: Move the call of d_op->d_dname from d_path to prepend_path Eric W. Biederman
@ 2014-02-07  2:24                                                                     ` Eric W. Biederman
  3 siblings, 0 replies; 64+ messages in thread
From: Eric W. Biederman @ 2014-02-07  2:24 UTC (permalink / raw)
  To: Al Viro
  Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov,
	Andy Lutomirski, Eric Dumazet,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds


If the root dentry implements d_dname instead of ignoring it in
__dentry_path call d_dname and return the result.

The motivating example are files under /proc/<pid>/ns/ are bind
mounted into other locations.  This change allow the name of the
dentry that is bind mounted to show up in /proc/<pid>/mountinfo.
However in general this is correct behavior any time dentry_path is
called on a dentry that implements d_dname.

Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/dcache.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c5c7847ff84b..a7a925fb3ce7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3162,6 +3162,8 @@ restart:
 	done_seqretry(&rename_lock, seq);
 	if (error)
 		goto Elong;
+	if (dentry->d_op && dentry->d_op->d_dname)
+		retval = dentry->d_op->d_dname(dentry, buf, len);
 	return retval;
 Elong:
 	return ERR_PTR(-ENAMETOOLONG);
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2014-02-07  2:24 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-15 16:41 Regression wrt mounting /proc in user namespace in 3.13 Daniel P. Berrange
     [not found] ` <20131115164123.GN28794-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-11-16 16:48   ` Serge E. Hallyn
     [not found]     ` <20131116164840.GA4441-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-17  3:06       ` Serge E. Hallyn
     [not found]         ` <20131117030653.GA7670-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-18  3:19           ` Serge E. Hallyn
     [not found]             ` <20131118031932.GA17621-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-18  4:52               ` Gao feng
     [not found]                 ` <52899D09.5080202-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-11-18 14:08                   ` Serge E. Hallyn
     [not found]                     ` <20131118140830.GA22075-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-18 18:01                       ` Serge E. Hallyn
     [not found]                         ` <20131118180134.GA24156-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-19  1:51                           ` Eric W. Biederman
     [not found]                             ` <87k3g5gnuv.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-19  3:47                               ` Serge E. Hallyn
2013-11-26 18:10                               ` Serge E. Hallyn
     [not found]                                 ` <20131126181043.GA25492-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-27  0:14                                   ` [REVIEW][PATCH 0/3] userns fixes for v3.13-rc1 Eric W. Biederman
     [not found]                                     ` <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-27  0:16                                       ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Eric W. Biederman
2013-11-27  1:58                                         ` Serge E. Hallyn
     [not found]                                         ` <8738mi1yya.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-27  1:58                                           ` Serge E. Hallyn
2013-11-30  6:15                                           ` Al Viro
     [not found]                                             ` <20131130061525.GY10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2013-11-30 17:02                                               ` Al Viro
     [not found]                                                 ` <20131130170226.GZ10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2013-11-30 21:51                                                   ` Eric W. Biederman
     [not found]                                                     ` <87a9glh838.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-30 22:43                                                       ` Al Viro
     [not found]                                                         ` <20131130224340.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2013-12-02  7:29                                                           ` Al Viro
2014-01-17  3:29                                                         ` Eric W. Biederman
2014-01-17  3:29                                                         ` Eric W. Biederman
     [not found]                                                           ` <874n53gub7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2014-01-17  8:39                                                             ` Al Viro
     [not found]                                                               ` <20140117083901.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2014-02-07  2:21                                                                 ` [PATCH 0/4] d_dname cleanups Eric W. Biederman
     [not found]                                                                   ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2014-02-07  2:23                                                                     ` [PATCH 1/4] perfmon: Use d_alloc_pseudo like all of the d_dname callers Eric W. Biederman
2014-02-07  2:23                                                                     ` [PATCH 2/4] vfs: Simply when d_alloc_dname is called Eric W. Biederman
2014-02-07  2:24                                                                     ` [PATCH 3/4] vfs: Move the call of d_op->d_dname from d_path to prepend_path Eric W. Biederman
2014-02-07  2:24                                                                     ` [PATCH 4/4] vfs: Call d_dname from dentry_path Eric W. Biederman
2013-12-01  5:09                                                   ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Al Viro
     [not found]                                                     ` <20131201050930.GB10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2013-12-01  6:15                                                       ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mountpoint Tetsuo Handa
2013-12-01  6:15                                                     ` Tetsuo Handa
2013-12-02  5:43                                                   ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point NeilBrown
2013-12-02  5:43                                                     ` NeilBrown
     [not found]                                                     ` <20131202164359.4f4f2c94-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2013-12-02 16:23                                                       ` J.Bruce Fields
2013-12-02 16:23                                                         ` J.Bruce Fields
2013-12-02 16:23                                                       ` J.Bruce Fields
2013-11-27  0:16                                       ` [REVIEW][PATCH 2/3] fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) Eric W. Biederman
2013-11-27  0:16                                         ` Eric W. Biederman
     [not found]                                         ` <87vbzezojq.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-27  1:58                                           ` Serge E. Hallyn
2013-11-27  0:17                                       ` [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc Eric W. Biederman
     [not found]                                         ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-27  0:21                                           ` Andy Lutomirski
     [not found]                                             ` <CALCETrVp78EfzY3Oa-LV1Hm8A4Y35apehcxrxdyrzvTb5sp=pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-27  0:36                                               ` Eric W. Biederman
2013-11-27  2:00                                           ` Serge E. Hallyn
2013-11-27  3:19                                           ` Gao feng
     [not found]                                             ` <529564AA.8050100-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-11-27  5:00                                               ` Eric W. Biederman
2013-11-27  5:00                                             ` Eric W. Biederman
2013-11-27 16:13                                           ` Oleg Nesterov
2013-11-27 16:13                                         ` Oleg Nesterov
     [not found]                                           ` <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-11-27 16:29                                             ` Serge E. Hallyn
     [not found]                                               ` <20131127162928.GB7358-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-27 18:09                                                 ` Oleg Nesterov
2013-11-27 16:41                                             ` Andy Lutomirski
     [not found]                                               ` <CALCETrXFnw63=JoEaQxM+Opj+kCXSL=9XppymzGKhLzOnp3WaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-27 18:10                                                 ` Oleg Nesterov
2013-11-27 18:51                                             ` Eric W. Biederman
     [not found]                                               ` <871u21oeyr.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-27 19:47                                                 ` Oleg Nesterov
2013-11-27 19:47                                               ` Oleg Nesterov
     [not found]                                                 ` <20131127194722.GA32673-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-11-27 19:52                                                   ` Eric W. Biederman
2013-11-27 19:52                                                 ` Eric W. Biederman
     [not found]                                                   ` <87iovdmxl7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-27 20:01                                                     ` Oleg Nesterov
2013-11-27 20:07                                                     ` Eric W. Biederman
2013-11-27 20:41                                                       ` Andy Lutomirski
     [not found]                                                         ` <CALCETrUwjK7iLMMJaCvKUbBwEqV58oXY4dWzTGJohYgg4DwjWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-29 14:56                                                           ` Serge E. Hallyn
     [not found]                                                       ` <87wqjtlic3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-27 20:41                                                         ` Andy Lutomirski
2013-11-29 19:53                                                         ` Oleg Nesterov
     [not found]                                                           ` <20131129195327.GA12974-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-12-13 22:07                                                             ` Richard Weinberger
2013-12-13 22:07                                                           ` Richard Weinberger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.