util-linux.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
       [not found]                   ` <3e656465c427487e4ea14151b77d391d52cd6bad.camel@themaw.net>
@ 2020-02-27 13:45                     ` Miklos Szeredi
  2020-02-27 15:14                       ` Karel Zak
  2020-02-28  0:12                       ` Ian Kent
  0 siblings, 2 replies; 13+ messages in thread
From: Miklos Szeredi @ 2020-02-27 13:45 UTC (permalink / raw)
  To: Ian Kent
  Cc: Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml, Karel Zak,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Thu, Feb 27, 2020 at 12:34 PM Ian Kent <raven@themaw.net> wrote:
>
> On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote:
> > On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> wrote:
> >
> > > At the least the question of "do we need a highly efficient way
> > > to query the superblock parameters all at once" needs to be
> > > extended to include mount table enumeration as well as getting
> > > the info.
> > >
> > > But this is just me thinking about mount table handling and the
> > > quite significant problem we now have with user space scanning
> > > the proc mount tables to get this information.
> >
> > Right.
> >
> > So the problem is that currently autofs needs to rescan the proc
> > mount
> > table on every change.   The solution to that is to
>
> Actually no, that's not quite the problem I see.
>
> autofs handles large mount tables fairly well (necessarily) and
> in time I plan to remove the need to read the proc tables at all
> (that's proven very difficult but I'll get back to that).
>
> This has to be done to resolve the age old problem of autofs not
> being able to handle large direct mount maps. But, because of
> the large number of mounts associated with large direct mount
> maps, other system processes are badly affected too.
>
> So the problem I want to see fixed is the effect of very large
> mount tables on other user space applications, particularly the
> effect when a large number of mounts or umounts are performed.
>
> Clearly large mount tables not only result from autofs and the
> problems caused by them are slightly different to the mount and
> umount problem I describe. But they are a problem nevertheless
> in the sense that frequent notifications that lead to reading
> a large proc mount table has significant overhead that can't be
> avoided because the table may have changed since the last time
> it was read.
>
> It's easy to cause several system processes to peg a fair number
> of CPU's when a large number of mounts/umounts are being performed,
> namely systemd, udisks2 and a some others. Also I've seen couple
> of application processes badly affected purely by the presence of
> a large number of mounts in the proc tables, that's not quite so
> bad though.
>
> >
> >  - add a notification mechanism   - lookup a mount based on path
> >  - and a way to selectively query mount/superblock information
> based on path ...
> >
> > right?
> >
> > For the notification we have uevents in sysfs, which also supplies
> > the
> > changed parameters.  Taking aside namespace issues and addressing
> > mounts would this work for autofs?
>
> The parameters supplied by the notification mechanism are important.
>
> The place this is needed will be libmount since it catches a broad
> number of user space applications, including those I mentioned above
> (well at least systemd, I think also udisks2, very probably others).
>
> So that means mount table info. needs to be maintained, whether that
> can be achieved using sysfs I don't know. Creating and maintaining
> the sysfs tree would be a big challenge I think.
>
> But before trying to work out how to use a notification mechanism
> just having a way to get the info provided by the proc tables using
> a path alone should give initial immediate improvement in libmount.

Adding Karel, Lennart, Zbigniew and util-linux@vger...

At a quick glance at libmount and systemd code, it appears that just
switching out the implementation in libmount will not be enough:
systemd is calling functions like mnt_table_parse_*() when it receives
a notification that the mount table changed.

What is the end purpose of parsing the mount tables?  Can systemd guys
comment on that?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-27 13:45                     ` [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] Miklos Szeredi
@ 2020-02-27 15:14                       ` Karel Zak
  2020-02-28  0:43                         ` Ian Kent
  2020-02-28  0:12                       ` Ian Kent
  1 sibling, 1 reply; 13+ messages in thread
From: Karel Zak @ 2020-02-27 15:14 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Thu, Feb 27, 2020 at 02:45:27PM +0100, Miklos Szeredi wrote:
> > So the problem I want to see fixed is the effect of very large
> > mount tables on other user space applications, particularly the
> > effect when a large number of mounts or umounts are performed.

Yes, now you have to generate (in kernel) and parse (in
userspace) all mount table to get information about just 
one mount table entry. This is typical for umount or systemd.

> > >  - add a notification mechanism   - lookup a mount based on path
> > >  - and a way to selectively query mount/superblock information
> > based on path ...

For umount-like use-cases we need mountpoint/ to mount entry
conversion; I guess something like open(mountpoint/) + fsinfo() 
should be good enough.

For systemd we need the same, but triggered by notification. The ideal
solution is to get mount entry ID or FD from notification and later use this
ID or FD to ask for details about the mount entry (probably again fsinfo()).
The notification has to be usable with in epoll() set.

This solves 99% of our performance issues I guess.

> > So that means mount table info. needs to be maintained, whether that
> > can be achieved using sysfs I don't know. Creating and maintaining
> > the sysfs tree would be a big challenge I think.

It will be still necessary to get complete mount table sometimes, but 
not in performance sensitive scenarios.

I'm not sure about sysfs/, you need somehow resolve namespaces, order
of the mount entries (which one is the last one), etc. IMHO translate
mountpoint path to sysfs/ path will be complicated.

> > But before trying to work out how to use a notification mechanism
> > just having a way to get the info provided by the proc tables using
> > a path alone should give initial immediate improvement in libmount.
> 
> Adding Karel, Lennart, Zbigniew and util-linux@vger...
> 
> At a quick glance at libmount and systemd code, it appears that just
> switching out the implementation in libmount will not be enough:
> systemd is calling functions like mnt_table_parse_*() when it receives
> a notification that the mount table changed.

We're ready to change this stuff in systemd if there will be something
better (something per-mount-entry).

My plan is add new API to libmount to query information about one
mount entry (but I had no time to play with fsinfo yet).

> What is the end purpose of parsing the mount tables?  Can systemd guys
> comment on that?

If mount/umount is triggered by systemd than it need verification
about success and final version of the mount options. It also reads
information from libmount to get userspace mount options (.e.g.
_netdev -- libmount uses mount source, target and fsroot to join
kernel and userpace stuff).

And don't forget that mount units are part of systemd dependencies, so
umount/mount is important event for systemd and it need details about
the changes (what, where, ... etc.)

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-27 13:45                     ` [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] Miklos Szeredi
  2020-02-27 15:14                       ` Karel Zak
@ 2020-02-28  0:12                       ` Ian Kent
  1 sibling, 0 replies; 13+ messages in thread
From: Ian Kent @ 2020-02-28  0:12 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml, Karel Zak,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Thu, 2020-02-27 at 14:45 +0100, Miklos Szeredi wrote:
> On Thu, Feb 27, 2020 at 12:34 PM Ian Kent <raven@themaw.net> wrote:
> > On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote:
> > > On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net>
> > > wrote:
> > > 
> > > > At the least the question of "do we need a highly efficient way
> > > > to query the superblock parameters all at once" needs to be
> > > > extended to include mount table enumeration as well as getting
> > > > the info.
> > > > 
> > > > But this is just me thinking about mount table handling and the
> > > > quite significant problem we now have with user space scanning
> > > > the proc mount tables to get this information.
> > > 
> > > Right.
> > > 
> > > So the problem is that currently autofs needs to rescan the proc
> > > mount
> > > table on every change.   The solution to that is to
> > 
> > Actually no, that's not quite the problem I see.
> > 
> > autofs handles large mount tables fairly well (necessarily) and
> > in time I plan to remove the need to read the proc tables at all
> > (that's proven very difficult but I'll get back to that).
> > 
> > This has to be done to resolve the age old problem of autofs not
> > being able to handle large direct mount maps. But, because of
> > the large number of mounts associated with large direct mount
> > maps, other system processes are badly affected too.
> > 
> > So the problem I want to see fixed is the effect of very large
> > mount tables on other user space applications, particularly the
> > effect when a large number of mounts or umounts are performed.
> > 
> > Clearly large mount tables not only result from autofs and the
> > problems caused by them are slightly different to the mount and
> > umount problem I describe. But they are a problem nevertheless
> > in the sense that frequent notifications that lead to reading
> > a large proc mount table has significant overhead that can't be
> > avoided because the table may have changed since the last time
> > it was read.
> > 
> > It's easy to cause several system processes to peg a fair number
> > of CPU's when a large number of mounts/umounts are being performed,
> > namely systemd, udisks2 and a some others. Also I've seen couple
> > of application processes badly affected purely by the presence of
> > a large number of mounts in the proc tables, that's not quite so
> > bad though.
> > 
> > >  - add a notification mechanism   - lookup a mount based on path
> > >  - and a way to selectively query mount/superblock information
> > based on path ...
> > > right?
> > > 
> > > For the notification we have uevents in sysfs, which also
> > > supplies
> > > the
> > > changed parameters.  Taking aside namespace issues and addressing
> > > mounts would this work for autofs?
> > 
> > The parameters supplied by the notification mechanism are
> > important.
> > 
> > The place this is needed will be libmount since it catches a broad
> > number of user space applications, including those I mentioned
> > above
> > (well at least systemd, I think also udisks2, very probably
> > others).
> > 
> > So that means mount table info. needs to be maintained, whether
> > that
> > can be achieved using sysfs I don't know. Creating and maintaining
> > the sysfs tree would be a big challenge I think.
> > 
> > But before trying to work out how to use a notification mechanism
> > just having a way to get the info provided by the proc tables using
> > a path alone should give initial immediate improvement in libmount.
> 
> Adding Karel, Lennart, Zbigniew and util-linux@vger...
> 
> At a quick glance at libmount and systemd code, it appears that just
> switching out the implementation in libmount will not be enough:
> systemd is calling functions like mnt_table_parse_*() when it
> receives
> a notification that the mount table changed.

Maybe I wasn't clear, my bad, sorry about that.

There's no question that change notification handling is needed too.

I'm claiming that an initial change to use something that can get
the mount information without using the proc tables alone will give
an "initial immediate improvement".

The work needed to implement mount table change notification
handling will take much more time and exactly what changes that
will bring is not clear yet and I do plan to work on that too,
together with Karel.

Ian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-27 15:14                       ` Karel Zak
@ 2020-02-28  0:43                         ` Ian Kent
  2020-02-28  8:35                           ` Miklos Szeredi
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Kent @ 2020-02-28  0:43 UTC (permalink / raw)
  To: Karel Zak, Miklos Szeredi
  Cc: Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Thu, 2020-02-27 at 16:14 +0100, Karel Zak wrote:
> On Thu, Feb 27, 2020 at 02:45:27PM +0100, Miklos Szeredi wrote:
> > > So the problem I want to see fixed is the effect of very large
> > > mount tables on other user space applications, particularly the
> > > effect when a large number of mounts or umounts are performed.
> 
> Yes, now you have to generate (in kernel) and parse (in
> userspace) all mount table to get information about just 
> one mount table entry. This is typical for umount or systemd.
> 
> > > >  - add a notification mechanism   - lookup a mount based on
> > > > path
> > > >  - and a way to selectively query mount/superblock information
> > > based on path ...
> 
> For umount-like use-cases we need mountpoint/ to mount entry
> conversion; I guess something like open(mountpoint/) + fsinfo() 
> should be good enough.
> 
> For systemd we need the same, but triggered by notification. The
> ideal
> solution is to get mount entry ID or FD from notification and later
> use this
> ID or FD to ask for details about the mount entry (probably again
> fsinfo()).
> The notification has to be usable with in epoll() set.
> 
> This solves 99% of our performance issues I guess.
> 
> > > So that means mount table info. needs to be maintained, whether
> > > that
> > > can be achieved using sysfs I don't know. Creating and
> > > maintaining
> > > the sysfs tree would be a big challenge I think.
> 
> It will be still necessary to get complete mount table sometimes,
> but 
> not in performance sensitive scenarios.

That was my understanding too.

Mount table enumeration is possible with fsinfo() but you still
have to handle each and every mount so improvement there is not
going to be as much as cases where the proc mount table needs to
be scanned independently for an individual mount. It will be
somewhat more straight forward without the need to dissect text
records though.

> 
> I'm not sure about sysfs/, you need somehow resolve namespaces, order
> of the mount entries (which one is the last one), etc. IMHO translate
> mountpoint path to sysfs/ path will be complicated.

I wonder about that too, after all sysfs contains a tree of nodes
from which the view is created unlike proc which translates kernel
information directly based on what the process should see.

We'll need to wait a bit and see what Miklos has in mind for mount
table enumeration and nothing has been said about name spaces yet.

While fsinfo() is not similar to proc it does handle name spaces
in a sensible way via. file handles, a bit similar to the proc fs,
and ordering is catered for in the fsinfo() enumeration in a natural
way. Not sure how that would be handled using sysfs ...

Ian


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28  0:43                         ` Ian Kent
@ 2020-02-28  8:35                           ` Miklos Szeredi
  2020-02-28 12:27                             ` Greg Kroah-Hartman
  2020-02-28 15:08                             ` James Bottomley
  0 siblings, 2 replies; 13+ messages in thread
From: Miklos Szeredi @ 2020-02-28  8:35 UTC (permalink / raw)
  To: Ian Kent
  Cc: Karel Zak, Miklos Szeredi, James Bottomley, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Greg Kroah-Hartman, util-linux

On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote:

> > I'm not sure about sysfs/, you need somehow resolve namespaces, order
> > of the mount entries (which one is the last one), etc. IMHO translate
> > mountpoint path to sysfs/ path will be complicated.
>
> I wonder about that too, after all sysfs contains a tree of nodes
> from which the view is created unlike proc which translates kernel
> information directly based on what the process should see.
>
> We'll need to wait a bit and see what Miklos has in mind for mount
> table enumeration and nothing has been said about name spaces yet.

Adding Greg for sysfs knowledge.

As far as I understand the sysfs model is, basically:

  - list of devices sorted by class and address
  - with each class having a given set of attributes

Superblocks and mounts could get enumerated by a unique identifier.
mnt_id seems to be good for mounts, s_dev may or may not be good for
superblock, but  s_id (as introduced in this patchset) could be used
instead.

As for namespaces, that's "just" an access control issue, AFAICS.
For example a task with a non-initial mount namespace should not have
access to attributes of mounts outside of its namespace.  Checking
access to superblock attributes would be similar: scan the list of
mounts and only allow access if at least one mount would get access.

> While fsinfo() is not similar to proc it does handle name spaces
> in a sensible way via. file handles, a bit similar to the proc fs,
> and ordering is catered for in the fsinfo() enumeration in a natural
> way. Not sure how that would be handled using sysfs ...

I agree that the access control is much more straightforward with
fsinfo(2) and this may be the single biggest reason to introduce a new
syscall.

Let's see what others thing.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28  8:35                           ` Miklos Szeredi
@ 2020-02-28 12:27                             ` Greg Kroah-Hartman
  2020-02-28 16:24                               ` Miklos Szeredi
  2020-02-28 16:42                               ` David Howells
  2020-02-28 15:08                             ` James Bottomley
  1 sibling, 2 replies; 13+ messages in thread
From: Greg Kroah-Hartman @ 2020-02-28 12:27 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Ian Kent, Karel Zak, Miklos Szeredi, James Bottomley,
	Steven Whitehouse, David Howells, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Fri, Feb 28, 2020 at 09:35:17AM +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote:
> 
> > > I'm not sure about sysfs/, you need somehow resolve namespaces, order
> > > of the mount entries (which one is the last one), etc. IMHO translate
> > > mountpoint path to sysfs/ path will be complicated.
> >
> > I wonder about that too, after all sysfs contains a tree of nodes
> > from which the view is created unlike proc which translates kernel
> > information directly based on what the process should see.
> >
> > We'll need to wait a bit and see what Miklos has in mind for mount
> > table enumeration and nothing has been said about name spaces yet.
> 
> Adding Greg for sysfs knowledge.
> 
> As far as I understand the sysfs model is, basically:
> 
>   - list of devices sorted by class and address
>   - with each class having a given set of attributes

Close enough :)

> Superblocks and mounts could get enumerated by a unique identifier.
> mnt_id seems to be good for mounts, s_dev may or may not be good for
> superblock, but  s_id (as introduced in this patchset) could be used
> instead.

So what would the sysfs tree look like with this?

> As for namespaces, that's "just" an access control issue, AFAICS.
> For example a task with a non-initial mount namespace should not have
> access to attributes of mounts outside of its namespace.  Checking
> access to superblock attributes would be similar: scan the list of
> mounts and only allow access if at least one mount would get access.

sysfs does handle namespaces, look at how networking does this.  But,
it's not exactly the simplest thing to do so, so be careful with that as
this is going to be essential for this type of work.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28  8:35                           ` Miklos Szeredi
  2020-02-28 12:27                             ` Greg Kroah-Hartman
@ 2020-02-28 15:08                             ` James Bottomley
  2020-02-28 15:40                               ` Miklos Szeredi
  1 sibling, 1 reply; 13+ messages in thread
From: James Bottomley @ 2020-02-28 15:08 UTC (permalink / raw)
  To: Miklos Szeredi, Ian Kent
  Cc: Karel Zak, Miklos Szeredi, Steven Whitehouse, David Howells,
	viro, Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Greg Kroah-Hartman, util-linux

On Fri, 2020-02-28 at 09:35 +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote:
> 
> > > I'm not sure about sysfs/, you need somehow resolve namespaces,
> > > order of the mount entries (which one is the last one), etc. IMHO
> > > translate mountpoint path to sysfs/ path will be complicated.
> > 
> > I wonder about that too, after all sysfs contains a tree of nodes
> > from which the view is created unlike proc which translates kernel
> > information directly based on what the process should see.
> > 
> > We'll need to wait a bit and see what Miklos has in mind for mount
> > table enumeration and nothing has been said about name spaces yet.
> 
> Adding Greg for sysfs knowledge.
> 
> As far as I understand the sysfs model is, basically:
> 
>   - list of devices sorted by class and address
>   - with each class having a given set of attributes
> 
> Superblocks and mounts could get enumerated by a unique identifier.
> mnt_id seems to be good for mounts, s_dev may or may not be good for
> superblock, but  s_id (as introduced in this patchset) could be used
> instead.
> 
> As for namespaces, that's "just" an access control issue, AFAICS.

That's an easy thing to say but not an easy thing to check:  it can be
made so for label based namespaces like the network, but the mount
namespace is shared/cloned tree based.  Assessing whether a given
superblock is within your current namespace root can become a large
search exercise.  You can see how much of one in fs/proc_namespaces.c
which controls how /proc/self/mounts appears in your current namespace.

> For example a task with a non-initial mount namespace should not have
> access to attributes of mounts outside of its namespace.  Checking
> access to superblock attributes would be similar: scan the list of
> mounts and only allow access if at least one mount would get access.

That scan can be expensive as I explained above.  That's really why I
think this is a bad idea.  Sysfs itself is nicely currently restricted
to system information that most containers don't need to know, so a lot
of the sysfs issues with containers can be solved by not mounting it. 
If you suddenly make it required for filesystem information and
notifications, that security measure gets blown out of the water.

> > While fsinfo() is not similar to proc it does handle name spaces
> > in a sensible way via. file handles, a bit similar to the proc fs,
> > and ordering is catered for in the fsinfo() enumeration in a
> > natural way. Not sure how that would be handled using sysfs ...
> 
> I agree that the access control is much more straightforward with
> fsinfo(2) and this may be the single biggest reason to introduce a
> new syscall.
> 
> Let's see what others thing.

Containers are file based entities, so file descriptors are their most
natural thing and they have full ACL protection within the container
(can't open the file, can't then get the fd).  The other reason
container people like file descriptors (all the Xat system calls that
have been introduced) is that if we do actually need to break the
boundaries or privileges of the container, we can do so by getting the
orchestration system to pass in a fd the interior of the container
wouldn't have access to.

James


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 15:08                             ` James Bottomley
@ 2020-02-28 15:40                               ` Miklos Szeredi
  0 siblings, 0 replies; 13+ messages in thread
From: Miklos Szeredi @ 2020-02-28 15:40 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ian Kent, Karel Zak, Miklos Szeredi, Steven Whitehouse,
	David Howells, viro, Christian Brauner, Jann Horn,
	Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Greg Kroah-Hartman, util-linux

On Fri, Feb 28, 2020 at 4:09 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:

> Containers are file based entities, so file descriptors are their most
> natural thing and they have full ACL protection within the container
> (can't open the file, can't then get the fd).  The other reason
> container people like file descriptors (all the Xat system calls that
> have been introduced) is that if we do actually need to break the
> boundaries or privileges of the container, we can do so by getting the
> orchestration system to pass in a fd the interior of the container
> wouldn't have access to.

Yeah, agreed about the simplicity of fd based access.   Then again a
filesystem access would allow immediate access to all scripts,
languages, etc.  That, I think is a huge bonus compared to the
ioctl-like mess that the current proposal is, which would require
library, utility, language binding updates on all changes.  Ugh.

One way to resolve that is to have the mount information
magic-symlinked from /proc/PID/fdmount/FD directly to the mountinfo
dir, which would then have a link into the sbinfo dir.  With other
access denied to all except sysadmin.

Would that work?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 12:27                             ` Greg Kroah-Hartman
@ 2020-02-28 16:24                               ` Miklos Szeredi
  2020-02-28 17:15                                 ` Al Viro
  2020-03-02 10:34                                 ` Karel Zak
  2020-02-28 16:42                               ` David Howells
  1 sibling, 2 replies; 13+ messages in thread
From: Miklos Szeredi @ 2020-02-28 16:24 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Ian Kent, Karel Zak, Miklos Szeredi, James Bottomley,
	Steven Whitehouse, David Howells, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:

> > Superblocks and mounts could get enumerated by a unique identifier.
> > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > superblock, but  s_id (as introduced in this patchset) could be used
> > instead.
>
> So what would the sysfs tree look like with this?

For a start something like this:

mounts/$MOUNT_ID/
  parent -> ../$PARENT_ID
  super -> ../../supers/$SUPER_ID
  root: path from mount root to fs root (could be optional as usually
they are the same)
  mountpoint -> $MOUNTPOINT
  flags: mount flags
  propagation: mount propagation
  children/$CHILD_ID -> ../../$CHILD_ID

 supers/$SUPER_ID/
   type: fstype
   source: mount source (devname)
   options: csv of mount options

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 12:27                             ` Greg Kroah-Hartman
  2020-02-28 16:24                               ` Miklos Szeredi
@ 2020-02-28 16:42                               ` David Howells
  1 sibling, 0 replies; 13+ messages in thread
From: David Howells @ 2020-02-28 16:42 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Greg Kroah-Hartman, Ian Kent, Karel Zak,
	Miklos Szeredi, James Bottomley, Steven Whitehouse, viro,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, util-linux

Miklos Szeredi <miklos@szeredi.hu> wrote:

>   children/$CHILD_ID -> ../../$CHILD_ID

This would really suck.  This bit would particularly affect rescanning time.

You also really want to read the entire child set atomically and, ideally,
include notification counters.

>  supers/$SUPER_ID/
>    type: fstype
>    source: mount source (devname)
>    options: csv of mount options

There's a lot more to fsinfo() than just this lot - and there's the
possibility that some of the values may change depending on exactly which file
you're looking at.

David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 16:24                               ` Miklos Szeredi
@ 2020-02-28 17:15                                 ` Al Viro
  2020-03-02  8:43                                   ` Miklos Szeredi
  2020-03-02 10:34                                 ` Karel Zak
  1 sibling, 1 reply; 13+ messages in thread
From: Al Viro @ 2020-02-28 17:15 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Greg Kroah-Hartman, Ian Kent, Karel Zak, Miklos Szeredi,
	James Bottomley, Steven Whitehouse, David Howells,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, util-linux

On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote:
> On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> 
> > > Superblocks and mounts could get enumerated by a unique identifier.
> > > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > > superblock, but  s_id (as introduced in this patchset) could be used
> > > instead.
> >
> > So what would the sysfs tree look like with this?
> 
> For a start something like this:
> 
> mounts/$MOUNT_ID/
>   parent -> ../$PARENT_ID
>   super -> ../../supers/$SUPER_ID
>   root: path from mount root to fs root (could be optional as usually
> they are the same)
>   mountpoint -> $MOUNTPOINT
>   flags: mount flags
>   propagation: mount propagation
>   children/$CHILD_ID -> ../../$CHILD_ID
> 
>  supers/$SUPER_ID/
>    type: fstype
>    source: mount source (devname)
>    options: csv of mount options

Oh, wonderful.  So let me see if I got it right - any namespace operation
can create/destroy/move around an arbitrary amount of sysfs objects.
Better yet, we suddenly have to express the lifetime rules for struct mount
and struct superblock in terms of struct device garbage.

I'm less than thrilled by the entire fsinfo circus, but this really takes
the cake.

In case it needs to be spelled out: NAK.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 17:15                                 ` Al Viro
@ 2020-03-02  8:43                                   ` Miklos Szeredi
  0 siblings, 0 replies; 13+ messages in thread
From: Miklos Szeredi @ 2020-03-02  8:43 UTC (permalink / raw)
  To: Al Viro
  Cc: Greg Kroah-Hartman, Ian Kent, Karel Zak, Miklos Szeredi,
	James Bottomley, Steven Whitehouse, David Howells,
	Christian Brauner, Jann Horn, Darrick J. Wong, Linux API,
	linux-fsdevel, lkml, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, util-linux

On Fri, Feb 28, 2020 at 6:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote:
> > On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> >
> > > > Superblocks and mounts could get enumerated by a unique identifier.
> > > > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > > > superblock, but  s_id (as introduced in this patchset) could be used
> > > > instead.
> > >
> > > So what would the sysfs tree look like with this?
> >
> > For a start something like this:
> >
> > mounts/$MOUNT_ID/
> >   parent -> ../$PARENT_ID
> >   super -> ../../supers/$SUPER_ID
> >   root: path from mount root to fs root (could be optional as usually
> > they are the same)
> >   mountpoint -> $MOUNTPOINT
> >   flags: mount flags
> >   propagation: mount propagation
> >   children/$CHILD_ID -> ../../$CHILD_ID
> >
> >  supers/$SUPER_ID/
> >    type: fstype
> >    source: mount source (devname)
> >    options: csv of mount options
>
> Oh, wonderful.  So let me see if I got it right - any namespace operation
> can create/destroy/move around an arbitrary amount of sysfs objects.

Parent/children symlinks may be excessive...

> Better yet, we suddenly have to express the lifetime rules for struct mount
> and struct superblock in terms of struct device garbage.

How so?   struct mount and struct superblock would hold a ref on
struct device, not the other way round.

In any case, I'm not insistent on the use of sysfs device classes for
this; struct device (488B) does seem too heavy for struct mount
(328B).

What I'm pretty sure about is that a read(2) based interface would be
way more useful than the syscall multiplexer that the current proposal
is.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 00/17] VFS: Filesystem information and notifications [ver #17]
  2020-02-28 16:24                               ` Miklos Szeredi
  2020-02-28 17:15                                 ` Al Viro
@ 2020-03-02 10:34                                 ` Karel Zak
  1 sibling, 0 replies; 13+ messages in thread
From: Karel Zak @ 2020-03-02 10:34 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Greg Kroah-Hartman, Ian Kent, Miklos Szeredi, James Bottomley,
	Steven Whitehouse, David Howells, viro, Christian Brauner,
	Jann Horn, Darrick J. Wong, Linux API, linux-fsdevel, lkml,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek, util-linux

On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote:
> ned-By: MIMEDefang 2.78 on 10.11.54.4
> 
> On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> 
> > > Superblocks and mounts could get enumerated by a unique identifier.
> > > mnt_id seems to be good for mounts, s_dev may or may not be good for
> > > superblock, but  s_id (as introduced in this patchset) could be used
> > > instead.
> >
> > So what would the sysfs tree look like with this?
> 
> For a start something like this:
> 
> mounts/$MOUNT_ID/
>   parent -> ../$PARENT_ID
>   super -> ../../supers/$SUPER_ID
>   root: path from mount root to fs root (could be optional as usually
> they are the same)
>   mountpoint -> $MOUNTPOINT
>   flags: mount flags
>   propagation: mount propagation
>   children/$CHILD_ID -> ../../$CHILD_ID
> 
>  supers/$SUPER_ID/
>    type: fstype
>    source: mount source (devname)
>    options:

What about use-cases where I have no ID, but I have mountpoint path
(e.g. "umount /foo")?  In this case I have to go to open() + fsinfo()
and then sysfs does not make sense for me, right?

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-03-02 10:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <158230810644.2185128.16726948836367716086.stgit@warthog.procyon.org.uk>
     [not found] ` <1582316494.3376.45.camel@HansenPartnership.com>
     [not found]   ` <CAOssrKehjnTwbc6A1VagM5hG_32hy3mXZenx_PdGgcUGxYOaLQ@mail.gmail.com>
     [not found]     ` <1582556135.3384.4.camel@HansenPartnership.com>
     [not found]       ` <CAJfpegsk6BsVhUgHNwJgZrqcNP66wS0fhCXo_2sLt__goYGPWg@mail.gmail.com>
     [not found]         ` <a657a80e-8913-d1f3-0ffe-d582f5cb9aa2@redhat.com>
     [not found]           ` <1582644535.3361.8.camel@HansenPartnership.com>
     [not found]             ` <CAOssrKfaxnHswrKejedFzmYTbYivJ++cPes4c91+BJDfgH4xJA@mail.gmail.com>
     [not found]               ` <1c8db4e2b707f958316941d8edd2073ee7e7b22c.camel@themaw.net>
     [not found]                 ` <CAJfpegtRoXnPm5_sMYPL2L6FCZU52Tn8wk7NcW-dm4_2x=dD3Q@mail.gmail.com>
     [not found]                   ` <3e656465c427487e4ea14151b77d391d52cd6bad.camel@themaw.net>
2020-02-27 13:45                     ` [PATCH 00/17] VFS: Filesystem information and notifications [ver #17] Miklos Szeredi
2020-02-27 15:14                       ` Karel Zak
2020-02-28  0:43                         ` Ian Kent
2020-02-28  8:35                           ` Miklos Szeredi
2020-02-28 12:27                             ` Greg Kroah-Hartman
2020-02-28 16:24                               ` Miklos Szeredi
2020-02-28 17:15                                 ` Al Viro
2020-03-02  8:43                                   ` Miklos Szeredi
2020-03-02 10:34                                 ` Karel Zak
2020-02-28 16:42                               ` David Howells
2020-02-28 15:08                             ` James Bottomley
2020-02-28 15:40                               ` Miklos Szeredi
2020-02-28  0:12                       ` Ian Kent

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).