linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] netns / sysfs interaction
@ 2008-01-07  7:23 Al Viro
  2008-01-07 10:01 ` Eric W. Biederman
  0 siblings, 1 reply; 3+ messages in thread
From: Al Viro @ 2008-01-07  7:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: ebiederm, htejun, linux-fsdevel, gregkh

	As much as I hate to touch either subject, let alone both at
once...  Eric, would you mind explaining what exactly do you want
sysfs to do in presense of your "namespaces"?  On the "what does user
see if we do <...>" level.

	a) what happens if I do chdir("/sys/class/net/eth42/") and then
migrate?

	b) what happens to /sys/class/net/eth0/device visibility/things
it points to/etc.?

	c) what happens to open files?  E.g. to /sys/class/net - say it,
if migration happens between two getdents(2).

	d) what happens to visibility in other parts of sysfs?  E.g. to
things like
$ ls  /sys/devices/pci0000\:00/0000\:00\:0a.0/
bus     device  local_cpus  power      resource1         uevent
class   driver  modalias    resource   subsystem_device  vendor
config  irq     net:eth0    resource0  subsystem_vendor
$
See that net:eth0 in there?  Are all such suckers seen?

	e) while we are at it, wouldn't seeing the information in
/sys/devices/pci in general defeat whatever purpose you have in mind
for your stuff?

Context: we need sane locking for sysfs.  I think I have a more or less
workable scheme, but its feasibility depends big way on what netns needs
to have.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] netns / sysfs interaction
  2008-01-07  7:23 [RFC] netns / sysfs interaction Al Viro
@ 2008-01-07 10:01 ` Eric W. Biederman
  2008-01-07 10:24   ` Al Viro
  0 siblings, 1 reply; 3+ messages in thread
From: Eric W. Biederman @ 2008-01-07 10:01 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, htejun, linux-fsdevel, gregkh

Al Viro <viro@ZenIV.linux.org.uk> writes:

> 	As much as I hate to touch either subject, let alone both at
> once...  Eric, would you mind explaining what exactly do you want
> sysfs to do in presense of your "namespaces"?  On the "what does user
> see if we do <...>" level.

Right.  I need to repost the patches since Greg didn't get them applied
last time.

What appears to be a clean solution is to have multiple sysfs superblocks
and to capture the namespace at mount time.  For planning purposes there
is a device namespace on the drawing board as well, so you can keep
your same major minor numbers for devices (tty names, network attached
disk) in a migration event.   This means netns isn't the only
namespace we will have to worry about with sysfs before it is all
done.

> 	a) what happens if I do chdir("/sys/class/net/eth42/") and then
> migrate?

It shouldn't be any better or worse then any other filesystem.  The
prerequisite for a OS level migration is that the set of all
namespaces and all of the processes that use them all go together.

As we recreate the virtual filesystem and virtual devices we should
recreate a sysfs that is essentially the same.  I doubt we will go
to the trouble of keeping the unnamed device number we are mounted on
and the inode numbers the same, but otherwise we should be able to
recreate an identical looking sysfs (baring real hardware changes).

> 	b) what happens to /sys/class/net/eth0/device visibility/things
> it points to/etc.?

That should continue to work without any changes at all.  We only play
with /sys/class/net (and it's cousin directories that only exist
when we don't enable sysfs backwards compatibility).  The symlink
might change but that is about it.

> 	c) what happens to open files?  E.g. to /sys/class/net - say it,
> if migration happens between two getdents(2).

How do we restore the internal state?  Hmm.    The rule is that you
are only guaranteed to see directory entries that existed
both before you started to read the directory and after you finished.

The cheap solution is just to declared everything hotplugged and
deleted and recreated.  Removing any meaningful guarantee of seeing
anything.

Since we only depend upon the value of f_pos that should largely work.

If we ever figure out how to preserve inode numbers over a migration
event the current scheme will work unmodified but that sounds like
more pain then it is worth.

> 	d) what happens to visibility in other parts of sysfs?  E.g. to
> things like
> $ ls  /sys/devices/pci0000\:00/0000\:00\:0a.0/
> bus     device  local_cpus  power      resource1         uevent
> class   driver  modalias    resource   subsystem_device  vendor
> config  irq     net:eth0    resource0  subsystem_vendor

It all shows up.  Nothing is hidden except for the directories 
and possibly the symlinks to the directories for network devices.

We aren't trying to virtualize the hardware.

> $
> See that net:eth0 in there?  Are all such suckers seen?

Yep.  Grr.  net:eth0  from another namespace should either
be a broken symlink or disappear completely.  It has been ages
since I looked at what my patches do in that case, it should be
just a broken symlink.

This is a big of a challenge to explain because the relevant directory
structure changes in sysfs when CONFIG_SYSFS_DEPRECATED=n.  Then
instead of net:eth0 we have net/eth0 and the all of the device
specific files there.

> 	e) while we are at it, wouldn't seeing the information in
> /sys/devices/pci in general defeat whatever purpose you have in mind
> for your stuff?

No.

First when you migrate or whatever you can report all of the hardware
in the machine was hot unplugged and a new set of essentially
identical hardware was hotplugged.  For stuff that goes through
an OS abstraction like a fs they don't care.  For stuff that talks
to the hardware directly you don't have a choice you have to make
user space deal with it.  However the set of applications that
care is actually quite rare.

Secondly the goal is not to hide the fact you are running in a set
namespace that don't cover the entire machine, but to make it so
that you don't care.  Which is close but not quite the same thing.

Third when the goal is isolation and not migration (a better chroot)
then our hardware never changes.

> Context: we need sane locking for sysfs.  I think I have a more or less
> workable scheme, but its feasibility depends big way on what netns needs
> to have.

I think on the netns side Tejun and I have hashed it over enough
that the semantics if not the implementation comes out cleanly.

The idea is supporting multiple superblocks for sysfs:

  Ultimately capturing the relevant namespace at mount time
  and if we don't have a superblock for that namespace creating
  a new one.

  So we have one sysfs dirent tree and multiple dentry trees.

  The tricky parts are rename/move and blocking mount/unmount requests
  for sysfs until we complete the rename operation calling d_move
  everywhere.

Essentially the dentry and sysfs dirent separation was the big part I
needed.



If all I had to deal with was /sys/class/net I think I would have
split that off into it's own filesystem.  However with the latest
sysfs layout we are far beyond that and there are symlinks going
all over tying all of the pieces together.


Eric


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] netns / sysfs interaction
  2008-01-07 10:01 ` Eric W. Biederman
@ 2008-01-07 10:24   ` Al Viro
  0 siblings, 0 replies; 3+ messages in thread
From: Al Viro @ 2008-01-07 10:24 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, htejun, linux-fsdevel, gregkh

On Mon, Jan 07, 2008 at 03:01:47AM -0700, Eric W. Biederman wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:

> What appears to be a clean solution is to have multiple sysfs superblocks
> and to capture the namespace at mount time.

It is not a clean solution at all.  In particular, it leaves you with hell
of a coherency issues between these trees.

>  For planning purposes there
> is a device namespace on the drawing board as well, so you can keep
> your same major minor numbers for devices (tty names, network attached
> disk) in a migration event.

Yes, I'm quite sure there's more coming.  Which is why I'm asking now,
before we are even deeper into that... area

>   This means netns isn't the only
> namespace we will have to worry about with sysfs before it is all
> done.

Exciting.
 
> > 	a) what happens if I do chdir("/sys/class/net/eth42/") and then
> > migrate?
> 
> It shouldn't be any better or worse then any other filesystem.  The
> prerequisite for a OS level migration is that the set of all
> namespaces and all of the processes that use them all go together.
> As we recreate the virtual filesystem and virtual devices we should
> recreate a sysfs that is essentially the same.  I doubt we will go
> to the trouble of keeping the unnamed device number we are mounted on
> and the inode numbers the same, but otherwise we should be able to
> recreate an identical looking sysfs (baring real hardware changes).

Have you even bothered to read the pathname in question?  Please, do so.

> > 	c) what happens to open files?  E.g. to /sys/class/net - say it,
> > if migration happens between two getdents(2).
> 
> How do we restore the internal state?  Hmm.    The rule is that you
> are only guaranteed to see directory entries that existed
> both before you started to read the directory and after you finished.
> 
> The cheap solution is just to declared everything hotplugged and
> deleted and recreated.  Removing any meaningful guarantee of seeing
> anything.
> 
> Since we only depend upon the value of f_pos that should largely work.
> 
> If we ever figure out how to preserve inode numbers over a migration
> event the current scheme will work unmodified but that sounds like
> more pain then it is worth.
> 

Inode numbers?  Are you suggesting a wholesale replacement of all struct
file referenced by descriptor tables, all way down to inodes?  May I see
the patches for that, please?

> Third when the goal is isolation and not migration (a better chroot)
> then our hardware never changes.

... and you have quite a bit of system state (starting with those net:eth0
symlinks, etc.) visible in there, not just the hardware.

> The idea is supporting multiple superblocks for sysfs:
> 
>   Ultimately capturing the relevant namespace at mount time
>   and if we don't have a superblock for that namespace creating
>   a new one.
> 
>   So we have one sysfs dirent tree and multiple dentry trees.
> 
>   The tricky parts are rename/move and blocking mount/unmount requests
>   for sysfs until we complete the rename operation calling d_move
>   everywhere.

Excuse me, _what_?  Are you seriously suggesting going through all dentry
trees, doing d_move() in each?  I want to see your locking.  It's promising
to be worse than devfs had ever been.  Much worse.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-01-07 10:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-07  7:23 [RFC] netns / sysfs interaction Al Viro
2008-01-07 10:01 ` Eric W. Biederman
2008-01-07 10:24   ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).