* Question about pseudo filesystems @ 2002-09-04 18:02 Jamie Lokier 2002-09-07 12:00 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Jamie Lokier @ 2002-09-04 18:02 UTC (permalink / raw) To: Alex Viro, linux-kernel I have a small problem with a module I'm writing. It uses a pseudo-filesystem, rather like futexes and pipes, so that it can hand out special file descriptors on request. Following the examples of pipe.c and futex.c, but specifically for a 2.4 kernel, I'm doing this: static DECLARE_FSTYPE (mymod_fs_type, "mymod_fs", mymod_read_super, FS_NOMOUNT); static int __init mymod_init (void) { int err = register_filesystem (&mymod_fs_type); if (err) return err; mymod_mnt = kern_mount (&mymod_fs_type); if (IS_ERR (mymod_mnt)) { unregister_filesystem (&mymod_fs_type); return PTR_ERR (mymod_mnt); } } static void __exit mymod_exit (void) { mntput (mymod_mnt); unregister_filesystem (&mymod_fs_type); } Unfortunately, when I come to _unload_ the module, it can't be unloaded because the kern_mount increments the module reference count. (pipe.c in 2.4 appears to have the same problem, but of course nobody can ever unload it anyway so it doesn't matter). I'm handing out file descriptors rather like futexes from 2.5: they all share the same dentry, which is the root of the filesystem. In my code's case, that dentry is created in `mymod_read_super' (just the same way as 2.4 pipe.c). Somehow, it looks like I need to mount the filesystem when it first generates an fd, and unmount it when the last fd is destroyed -- but is it safe to unmount the filesystem _within_ a release function of an inode on the filesystem? Either that, or I need something else. Thanks, -- Jamie ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-04 18:02 Question about pseudo filesystems Jamie Lokier @ 2002-09-07 12:00 ` Daniel Phillips 2002-09-07 13:36 ` Alexander Viro 0 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-07 12:00 UTC (permalink / raw) To: Jamie Lokier, Rusty Russell, linux-kernel On Wednesday 04 September 2002 20:02, Jamie Lokier wrote: > I have a small problem with a module I'm writing. It uses a > pseudo-filesystem, rather like futexes and pipes, so that it can hand > out special file descriptors on request. > > Following the examples of pipe.c and futex.c, but specifically for a 2.4 > kernel, I'm doing this: > > static DECLARE_FSTYPE (mymod_fs_type, "mymod_fs", > mymod_read_super, FS_NOMOUNT); > > > static int __init mymod_init (void) > { > int err = register_filesystem (&mymod_fs_type); > if (err) > return err; > mymod_mnt = kern_mount (&mymod_fs_type); > if (IS_ERR (mymod_mnt)) { > unregister_filesystem (&mymod_fs_type); > return PTR_ERR (mymod_mnt); > } > } > > static void __exit mymod_exit (void) > { > mntput (mymod_mnt); > unregister_filesystem (&mymod_fs_type); > } > > Unfortunately, when I come to _unload_ the module, it can't be unloaded > because the kern_mount increments the module reference count. > > (pipe.c in 2.4 appears to have the same problem, but of course nobody > can ever unload it anyway so it doesn't matter). > > I'm handing out file descriptors rather like futexes from 2.5: they all > share the same dentry, which is the root of the filesystem. In my > code's case, that dentry is created in `mymod_read_super' (just the same > way as 2.4 pipe.c). > > Somehow, it looks like I need to mount the filesystem when it first > generates an fd, and unmount it when the last fd is destroyed -- but is > it safe to unmount the filesystem _within_ a release function of an > inode on the filesystem? > > Either that, or I need something else. It's a good example of why the module interface is stupidly wrong, and __exit needs to be called by the module unloader, returning 0 if it's ok to unload. Then your __exit can whatever condition it's interested in and, if all is well, do the kern_umount. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-07 12:00 ` Daniel Phillips @ 2002-09-07 13:36 ` Alexander Viro 2002-09-07 18:27 ` Jamie Lokier 0 siblings, 1 reply; 90+ messages in thread From: Alexander Viro @ 2002-09-07 13:36 UTC (permalink / raw) To: Daniel Phillips; +Cc: Jamie Lokier, Rusty Russell, linux-kernel On Sat, 7 Sep 2002, Daniel Phillips wrote: > On Wednesday 04 September 2002 20:02, Jamie Lokier wrote: > > I have a small problem with a module I'm writing. It uses a > > pseudo-filesystem, rather like futexes and pipes, so that it can hand > > out special file descriptors on request. > > > > Following the examples of pipe.c and futex.c, but specifically for a 2.4 > > kernel, I'm doing this: > > > > static DECLARE_FSTYPE (mymod_fs_type, "mymod_fs", > > mymod_read_super, FS_NOMOUNT); > > > > > > static int __init mymod_init (void) > > { > > int err = register_filesystem (&mymod_fs_type); > > if (err) > > return err; > > mymod_mnt = kern_mount (&mymod_fs_type); > > if (IS_ERR (mymod_mnt)) { > > unregister_filesystem (&mymod_fs_type); > > return PTR_ERR (mymod_mnt); > > } > > } > > > > static void __exit mymod_exit (void) > > { > > mntput (mymod_mnt); > > unregister_filesystem (&mymod_fs_type); > > } > > > > Unfortunately, when I come to _unload_ the module, it can't be unloaded > > because the kern_mount increments the module reference count. > > > > (pipe.c in 2.4 appears to have the same problem, but of course nobody > > can ever unload it anyway so it doesn't matter). > > > > I'm handing out file descriptors rather like futexes from 2.5: they all > > share the same dentry, which is the root of the filesystem. In my > > code's case, that dentry is created in `mymod_read_super' (just the same > > way as 2.4 pipe.c). > > > > Somehow, it looks like I need to mount the filesystem when it first > > generates an fd, and unmount it when the last fd is destroyed -- but is > > it safe to unmount the filesystem _within_ a release function of an > > inode on the filesystem? > > > > Either that, or I need something else. You need something else - namely, you need the code to follow the semantics you want for the lifetime of that beast. If your rules are "it's pinned as long as there are opened files created by foo()" - very well, there are two variants. The basic idea is the same - have sum of ->mnt_count for all vfsmounts of our type bumped whenever we call foo() and drop whenever final fput() is done on a file created by foo(). 1) Trivial variant: in foo() have file->f_vfsmnt = do_kern_mount(...); file->f_dentry = dget(file->f_vfsmnt->mnt_root); and lose kern_mount() in init 2) Slightly optimized one: spinlock_t lock; int count; struct vfsmount *mnt; in foo() p = NULL; spin_lock(&lock); if (!count++) { /* slow path */ count--; spin_unlock(&lock); p = do_kern_mount(...); spin_lock(&lock); if (!count++) mnt = p; } mntget(mnt); spin_unlock(&lock); mntput(p); file->f_vfsmnt = mnt; file->f_dentry = dget(mnt->mnt_root); and in ->release() for these files spin_lock(&lock); if (!--count) mnt = NULL; spin_unlock(&lock); (and lose kern_mount() in init, again) In both variants destruction of superblock is triggered by final fput() - upon mntput(file->f_vfsmnt); the difference being that (b) takes some care to avoid allocation of new vfsmounts if we already have one pinned down by opened file - whether that optimization is worth the trouble depends on intended use. > It's a good example of why the module interface is stupidly wrong, and > __exit needs to be called by the module unloader, returning 0 if it's > ok to unload. Then your __exit can whatever condition it's interested > in and, if all is well, do the kern_umount. BS. Instead of playing silly buggers with "oh, we will start exiting and maybe we'll bail out, let's just hope we won't find that we want to do that after we'd destroyed something" you need to decide what kind of rules you really want for the module lifetime. The rest is trivial. Again, variant (a) (which is absolutely straightforward - add one line in foo(), modify one line in foo(), delete one line in init) is enough to give the desired rules. Optimizing it if needed is not too hard - see (b) for one possible variant... ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-07 13:36 ` Alexander Viro @ 2002-09-07 18:27 ` Jamie Lokier 2002-09-07 19:47 ` Alexander Viro 0 siblings, 1 reply; 90+ messages in thread From: Jamie Lokier @ 2002-09-07 18:27 UTC (permalink / raw) To: Alexander Viro; +Cc: Daniel Phillips, Rusty Russell, linux-kernel Alexander Viro wrote: > If your rules are "it's pinned as long as there are opened files created > by foo()" - very well, there are two variants. The basic idea is the same > - have sum of ->mnt_count for all vfsmounts of our type bumped whenever we > call foo() and drop whenever final fput() is done on a file created by foo(). Thanks -- that's what I implemented, except I used a semaphore instead of a spinlock. I wanted to check that it's safe to call `mntput' from `->release()', which seems like quite a dubious thing to depend on. But if you say it is safe, that's cool. > > It's a good example of why the module interface is stupidly wrong, and > > __exit needs to be called by the module unloader, returning 0 if it's > > ok to unload. Then your __exit can whatever condition it's interested > > in and, if all is well, do the kern_umount. > > BS. Instead of playing silly buggers with "oh, we will start exiting > and maybe we'll bail out, let's just hope we won't find that we want > to do that after we'd destroyed something" you need to decide what kind > of rules you really want for the module lifetime. The rest is trivial. > Again, variant (a) (which is absolutely straightforward - add one line > in foo(), modify one line in foo(), delete one line in init) is enough > to give the desired rules. Optimizing it if needed is not too hard - > see (b) for one possible variant... Unfortunately, your suggestion, which I ended up implementing, is not safe from race conditions. The problem comes during the call to `->release()'. If that's really the last reference to the module, than as soon as I call `mntput' the module might be unloaded. In practice this doesn't happen, but if there were a long scheduling delay... (see CONFIG_PREEMPT), it could. -- Jamie ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-07 18:27 ` Jamie Lokier @ 2002-09-07 19:47 ` Alexander Viro 2002-09-08 2:21 ` Jamie Lokier 0 siblings, 1 reply; 90+ messages in thread From: Alexander Viro @ 2002-09-07 19:47 UTC (permalink / raw) To: Jamie Lokier; +Cc: Daniel Phillips, Rusty Russell, linux-kernel On Sat, 7 Sep 2002, Jamie Lokier wrote: > Alexander Viro wrote: > > If your rules are "it's pinned as long as there are opened files created > > by foo()" - very well, there are two variants. The basic idea is the same > > - have sum of ->mnt_count for all vfsmounts of our type bumped whenever we > > call foo() and drop whenever final fput() is done on a file created by foo(). > > Thanks -- that's what I implemented, except I used a semaphore instead > of a spinlock. > > I wanted to check that it's safe to call `mntput' from `->release()', > which seems like quite a dubious thing to depend on. But if you say it > is safe, that's cool. It is neither safe nor needed. Please, look at the previous posting again - neither variant calls mntput() in ->release(). Now, __fput() _does_ call mntput() - always. And yes, if that happens to be the final reference - it's OK. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-07 19:47 ` Alexander Viro @ 2002-09-08 2:21 ` Jamie Lokier 2002-09-08 2:43 ` Alexander Viro 2002-09-08 16:00 ` Question about pseudo filesystems Daniel Phillips 0 siblings, 2 replies; 90+ messages in thread From: Jamie Lokier @ 2002-09-08 2:21 UTC (permalink / raw) To: Alexander Viro; +Cc: Daniel Phillips, Rusty Russell, linux-kernel Alexander Viro wrote: > It is neither safe nor needed. Please, look at the previous posting again - > neither variant calls mntput() in ->release(). > > Now, __fput() _does_ call mntput() - always. And yes, if that happens to > be the final reference - it's OK. Thanks, that's really nice. I'd assumed `kern_mount' was similar to mounting a normal filesystem, but in a non-existent namespace. Traditionally in unix you can't unmount a filesystem when its in use, and mounts don't disappear when the last file being used on them disappears. But you've rather cutely arranged that these kinds of mount _do_ disappear when the last file being used on them disappears. Clever, if a bit disturbing. -- Jamie ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-08 2:21 ` Jamie Lokier @ 2002-09-08 2:43 ` Alexander Viro 2002-09-15 1:41 ` Moving a mount point (was Re: Question about pseudo filesystems) Rob Landley 2002-09-08 16:00 ` Question about pseudo filesystems Daniel Phillips 1 sibling, 1 reply; 90+ messages in thread From: Alexander Viro @ 2002-09-08 2:43 UTC (permalink / raw) To: Jamie Lokier; +Cc: Daniel Phillips, Rusty Russell, linux-kernel On Sun, 8 Sep 2002, Jamie Lokier wrote: > Alexander Viro wrote: > > It is neither safe nor needed. Please, look at the previous posting again - > > neither variant calls mntput() in ->release(). > > > > Now, __fput() _does_ call mntput() - always. And yes, if that happens to > > be the final reference - it's OK. > > Thanks, that's really nice. > > I'd assumed `kern_mount' was similar to mounting a normal filesystem, > but in a non-existent namespace. Traditionally in unix you can't > unmount a filesystem when its in use, and mounts don't disappear when > the last file being used on them disappears. > > But you've rather cutely arranged that these kinds of mount _do_ > disappear when the last file being used on them disappears. Clever, if > a bit disturbing. Normal mount _is_ do_kern_mount() + pinning down the resulting vfsmount + attaching it to mountpoint. The second part is what keeps them alive until fs is unmounted. Moreover, umount itself is just check refcount, return -EBUSY if the thing is busy and MNT_DETACH is not given. detach from mountpoint drop the reference Notice that umount _with_ MNT_DETACH will happily skip that check - and that will simply detach filesystem from mountpoint with fs shutdown postponed until it stops being busy. IOW, filesystem shutdown is _always_ a garbage collection - there's nothing special about vfsmounts that are not mounted somewhere. Check for fs being busy in umount(2) is just a compatibility code - kernel is perfectly happy with dissolving mountpoints, busy or not. If stuff mounted is not busy it will be shut down immediately, if not - when it stops being busy. It's actually very similar to relation between open(), unlink() and close() - you can unlink opened files and their contents will be there until the final close(); then it will be killed. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Moving a mount point (was Re: Question about pseudo filesystems) 2002-09-08 2:43 ` Alexander Viro @ 2002-09-15 1:41 ` Rob Landley 0 siblings, 0 replies; 90+ messages in thread From: Rob Landley @ 2002-09-15 1:41 UTC (permalink / raw) To: Alexander Viro, Jamie Lokier; +Cc: Daniel Phillips, Rusty Russell, linux-kernel On Saturday 07 September 2002 10:43 pm, Alexander Viro wrote: > IOW, filesystem shutdown is _always_ a garbage collection - there's nothing > special about vfsmounts that are not mounted somewhere. Check for fs > being busy in umount(2) is just a compatibility code - kernel is perfectly > happy with dissolving mountpoints, busy or not. If stuff mounted is not > busy it will be shut down immediately, if not - when it stops being busy. This wanders into something I've been trying to figure out for a while now. I want to move an existing mount point. I can't unmount it, because it's got a file open in it, but I want to move it to another binding in the tree and make the original binding go away so the filesystem IT'S under can go away. I can create a duplicate binding, but can't get the first binding to go away while the filesystem's in use. In case you're curious, the reason is I created a system where the whole root partition is a zisofs image, with /boot, /etc, /root, and /tmp symlinks to directories under /var. When the system is fully up, /var will be a mount point for /dev/hda1 (ext3), and /home on /dev/hda3. / is loopback mounted from a file ("firmware") under /var. To make this work, I make an initrd, which mounts /dev/hda1 on /var, does an losetup on /dev/loop0, exits and lets the boot continue (with root on /dev/loop0), and then runs /sbin/init-first which does a mount --bind /initrd/var to /var (duplicating the mount point), and then "exec /sbin/init", and life goes on. (Obviously, init-first has to run before init or else /etc is a symlink pointing to nothing. Haven't cleaned up /etc enough that it can be read-only, not sure I ever will...) The downside of duplicating the binding in init-first rather than moving it is I can't unmount /initrd and free its memory. It's not a major burning issue keeping me awake at night (caffiene manages to do that quite well on its own, thank you), it's just two extra mount entries in /proc/mounts and about four megabytes of wasted ram on a system with a quarter gig of the stuff, but it seems there should be a way to do this... Since each filesystem only has one superblock (regardless of the number of mount points), there's no way it can care too deeply about where it's mounted. The losetup really shouldn't care too much either (fundamentally, it's bound to an inode). So there's no theoretical reason "mount --move /old /new" couldn't be made to work. There may need to be a loop to flush dentries, and some locking, but that would be about it. (Mount points under the mount point being moved could either be moved as well, or the action could be denied if any exist. The first is cleaner, it's sort of the opposite of chroot or pivot_root. P.S. yes, I tried a few variants of pivot_root, that fails if either of the filesystems has any files open in it...) It could just be that since the concept of mount point and filesystem instance were only separated in 2.4, that this is just a historical relic. If so, I'd like to correct it. A filesystem instance can't go away with an open file, but a mount point can... Did I miss the ability to do this already, or are the userspace tools simply not taking advantage of something the kernel can currently do, or I need to come up with a patch to make this work under the 2.4 kernel? (I've poked around a bit, and vfs.txt does a pretty good job of describing how things were in 2.1.99, back when mount points and filesystem instances were the same thing...) Rob ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-08 2:21 ` Jamie Lokier 2002-09-08 2:43 ` Alexander Viro @ 2002-09-08 16:00 ` Daniel Phillips 2002-09-09 19:48 ` Jamie Lokier 1 sibling, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-08 16:00 UTC (permalink / raw) To: Jamie Lokier, Alexander Viro; +Cc: Rusty Russell, linux-kernel On Sunday 08 September 2002 04:21, Jamie Lokier wrote: > Alexander Viro wrote: > > It is neither safe nor needed. Please, look at the previous posting again - > > neither variant calls mntput() in ->release(). > > > > Now, __fput() _does_ call mntput() - always. And yes, if that happens to > > be the final reference - it's OK. > > Thanks, that's really nice. > > I'd assumed `kern_mount' was similar to mounting a normal filesystem, > but in a non-existent namespace. Traditionally in unix you can't > unmount a filesystem when its in use, and mounts don't disappear when > the last file being used on them disappears. > > But you've rather cutely arranged that these kinds of mount _do_ > disappear when the last file being used on them disappears. Clever, if > a bit disturbing. And it's not a good way to drive module unloading. It is rmmod that should cause a module to be unloaded, not close. The final close *allows* the module to be unloaded, it does not *cause* it to be. So to get the expected behaviour, you have to lather on some other fanciful construction to garbage collect modules ready to be unloaded, or to let rmmod inquire the state of the module in the process of attempting to unload it, and not trigger the nasty races we've discussed. Enter fancy locking regime that 3 people in the world actually understand. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-08 16:00 ` Question about pseudo filesystems Daniel Phillips @ 2002-09-09 19:48 ` Jamie Lokier 2002-09-09 20:06 ` Daniel Phillips ` (2 more replies) 0 siblings, 3 replies; 90+ messages in thread From: Jamie Lokier @ 2002-09-09 19:48 UTC (permalink / raw) To: Daniel Phillips; +Cc: Alexander Viro, Rusty Russell, linux-kernel Daniel Phillips wrote: > > But you've rather cutely arranged that these kinds of mount _do_ > > disappear when the last file being used on them disappears. Clever, if > > a bit disturbing. > > And it's not a good way to drive module unloading. It is rmmod that > should cause a module to be unloaded, not close. The final close > *allows* the module to be unloaded, it does not *cause* it to be. So > to get the expected behaviour, you have to lather on some other fanciful > construction to garbage collect modules ready to be unloaded, or to let > rmmod inquire the state of the module in the process of attempting to > unload it, and not trigger the nasty races we've discussed. Enter > fancy locking regime that 3 people in the world actually understand. Eh? In this case, Al Viro's scheme is really simple and works: the kern_mount keeps the module use-count non-zero so long as any file descriptors are using the module's filesystem. fput() decrements the use-count at a safe time -- no race conditions. The expected behaviour is as it has always been: rmmod fails if anyone is using the module, and succeeds if nobody is using the module. The garbage collection of modules is done using "rmmod -a" periodically, as it always has been. -- Jamie ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-09 19:48 ` Jamie Lokier @ 2002-09-09 20:06 ` Daniel Phillips 2002-09-10 0:44 ` Jamie Lokier 2002-09-11 15:28 ` Question about pseudo filesystems Bill Davidsen 2002-09-09 20:12 ` Daniel Phillips 2002-09-09 20:18 ` Daniel Phillips 2 siblings, 2 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-09 20:06 UTC (permalink / raw) To: Jamie Lokier; +Cc: Alexander Viro, Rusty Russell, linux-kernel On Monday 09 September 2002 21:48, Jamie Lokier wrote: > Daniel Phillips wrote: > > > But you've rather cutely arranged that these kinds of mount _do_ > > > disappear when the last file being used on them disappears. Clever, if > > > a bit disturbing. > > > > And it's not a good way to drive module unloading. It is rmmod that > > should cause a module to be unloaded, not close. The final close > > *allows* the module to be unloaded, it does not *cause* it to be. So > > to get the expected behaviour, you have to lather on some other fanciful > > construction to garbage collect modules ready to be unloaded, or to let > > rmmod inquire the state of the module in the process of attempting to > > unload it, and not trigger the nasty races we've discussed. Enter > > fancy locking regime that 3 people in the world actually understand. > > Eh? In this case, Al Viro's scheme is really simple and works: the > kern_mount keeps the module use-count non-zero so long as any file > descriptors are using the module's filesystem. fput() decrements the > use-count at a safe time -- no race conditions. It wasn't obvious to you, was it. So how can you call it simple. For modules, we need something that is truly simple, and having __exit return a flag is it. That is not to say that Al's mechanism is wrong, at least as far as filesystem modules go. In this case you'd rely on (part of) it to determine the correct flag to return. It's just that this is not the most straightforward mechanism for the job, and therefore wrong. Besides, how much sense does it make for __exit to be incapable of returning an error code? > The expected behaviour is as it has always been: rmmod fails if anyone > is using the module, and succeeds if nobody is using the module. The > garbage collection of modules is done using "rmmod -a" periodically, as > it always has been. Al's analysis is all focussed on the filesystem case, where you can prove assertions about whether the subsystem defined by the module is active or not. This doesn't cover the whole range of module applications. There is a significant class of module types that must rely on sheduling techniques to prove inactivity. My suggestion covers both, Al has his blinders on. Designs are not always correct just because they work. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-09 20:06 ` Daniel Phillips @ 2002-09-10 0:44 ` Jamie Lokier 2002-09-10 1:40 ` Daniel Phillips 2002-09-10 9:15 ` Daniel Phillips 2002-09-11 15:28 ` Question about pseudo filesystems Bill Davidsen 1 sibling, 2 replies; 90+ messages in thread From: Jamie Lokier @ 2002-09-10 0:44 UTC (permalink / raw) To: Daniel Phillips; +Cc: Alexander Viro, Rusty Russell, linux-kernel Daniel Phillips wrote: > It wasn't obvious to you, was it. So how can you call it simple. I have to agree. > [...] This doesn't cover the whole range of module applications. > There is a significant class of module types that must rely on > sheduling techniques to prove inactivity. My suggestion covers both, > Al has his blinders on. Unfortunately having cleanup_module() return a value don't necessarily make things simpler. Sure, it's a general solution, but it's not always easier to use. Typically, your module's resources are protected by a lock or so. cleanup_module() could take this lock, check any private reference counts, and (because it has the lock) decide whether to proceed with unregistering the module's resources. Once it begins unregistering resources, it's pretty committed to unregistering them all and saying it exited ok. Unfortunately, once it has the lock, and the reference counts are all zero, it's still _not_ generally safe to cleanup up a module. This is because any other function, for example a release() op on a file, or a remove() op on a PCI device, can't take a module's private lock, decrement the private reference count, release the lock and return. There's a race between releasing the lock and returning, where it still isn't safe to remove the module's memory. Even waiting for a schedule to happen won't help if CONFIG_PREEMPT is enabled. In other words, the module's idea of whether it's own resources are no longer in use _must_ be released by code outside the module - or at very least, locks protecting that information must be released outside the module. > Designs are not always correct just because they work. Unfortunately it's not immediately clear to me that having cleanup_module() be able to abort an exit actually helps. Doing so with RCU-style "wait until none of my module functions could possible be in their race window" might work, though. If you could 100% trust GCC's sibling call optimisation, variations on `spin_unlock_and_return' and `up_and_return' could be written. But even if you can write code that's safe, is it likely to be understood by most module authors? If not, it's no better than Al Viro's filesystem method. -- Jamie ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-10 0:44 ` Jamie Lokier @ 2002-09-10 1:40 ` Daniel Phillips 2002-09-10 1:56 ` Jamie Lokier 2002-09-10 9:15 ` Daniel Phillips 1 sibling, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-10 1:40 UTC (permalink / raw) To: Jamie Lokier; +Cc: Alexander Viro, Rusty Russell, linux-kernel On Tuesday 10 September 2002 02:44, Jamie Lokier wrote: > Daniel Phillips wrote: > > It wasn't obvious to you, was it. So how can you call it simple. > > I have to agree. > > > [...] This doesn't cover the whole range of module applications. > > There is a significant class of module types that must rely on > > sheduling techniques to prove inactivity. My suggestion covers both, > > Al has his blinders on. > > Unfortunately having cleanup_module() return a value don't necessarily > make things simpler. Sure, it's a general solution, but it's not always > easier to use. It's always as easy to use, or easier. It's certainly more obvious. > Typically, your module's resources are protected by a lock or so. > cleanup_module() could take this lock, check any private reference > counts, and (because it has the lock) decide whether to proceed with > unregistering the module's resources. Once it begins unregistering > resources, it's pretty committed to unregistering them all and saying it > exited ok. And failing silently if it can't? > Unfortunately, once it has the lock, and the reference counts are all > zero, it's still _not_ generally safe to cleanup up a module. > > This is because any other function, for example a release() op on a > file, or a remove() op on a PCI device, can't take a module's private > lock, decrement the private reference count, release the lock and > return. There's a race between releasing the lock and returning, where > it still isn't safe to remove the module's memory. > > Even waiting for a schedule to happen won't help if CONFIG_PREEMPT is > enabled. That's exactly the race that is removed by having the module subsystem call __exit to remove the module. Since the module subsystem checks the __exit's flag on return and releases the lock, so there is no window after releasing the lock when the releasor is still executing in the module. > In other words, the module's idea of whether it's own resources are no > longer in use _must_ be released by code outside the module - or at > very least, locks protecting that information must be released outside > the module. Yup, that's exactly what I've proposed, in the simplest way possible. > > Designs are not always correct just because they work. > > Unfortunately it's not immediately clear to me that having > cleanup_module() be able to abort an exit actually helps. Silent failure is about the worst thing that we could possibly design into the system, but that's not even the issue I'm going on about - because I think it's so appallingly obvious it's wrong, I assume everybody can see that. Rather, it's the simple fact that this is the obvious interface a naive person would expect, and nobody has presented a rational argument for not using it. > Doing so with RCU-style "wait until none of my module functions could > possible be in their race window" might work, though. If you could 100% > trust GCC's sibling call optimisation, variations on > `spin_unlock_and_return' and `up_and_return' could be written. > > But even if you can write code that's safe, is it likely to be > understood by most module authors? If not, it's no better than Al > Viro's filesystem method. It solves one of the races in a tidy, obviously correct way. There are other races that are much more difficult (which don't for the most part apply to filesystem modules) where we have to do cute things to be sure that all threads are out of the module, however, that is an othogonal issue, and when last sighted, it had a workable solution on the table, which requires each task to schedule once. Config_preempt is a trivial issue: just increment the preempt counter and nobody will preempt on you while you run the magic quiescence test. The module runs this test, by the way, so that only modules that can't figure this out by some less intrusive means have to impose themselves on the rest of the system this way. Anyway, this does not replace Al's filesystem method, it *uses* it, but only where appropriate. Of course we can design a more complex method for accomplishing the same thing, but why? Have you looked at the module.c by the way? If you have and you like it, you are one sick puppy ;-) -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-10 1:40 ` Daniel Phillips @ 2002-09-10 1:56 ` Jamie Lokier 2002-09-10 2:53 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Jamie Lokier @ 2002-09-10 1:56 UTC (permalink / raw) To: Daniel Phillips; +Cc: Alexander Viro, Rusty Russell, linux-kernel Daniel Phillips wrote: > > ... Once it begins unregistering > > resources, it's pretty committed to unregistering them all and saying it > > exited ok. > > And failing silently if it can't? It could make a noise, but it will still be a failure. Once you've unregistered one of the module's resources (e.g. a PCI device registration), there is no turning back. If the next thing you unregister (e.g. a filesystem) fails, what are you going to do? Try to re-register the PCI device? Basically, once you've started unregistering resources, if there's an error you're left in an inconsistent state. You can try to get back to one of the states "module installed" or "module removed", but there's no guarantee of either -- and trying to do that would certainly complicated cleanup_module() functions. > > Unfortunately, once it has the lock, and the reference counts are all > > zero, it's still _not_ generally safe to cleanup up a module. > > > > This is because any other function, for example a release() op on a > > file, or a remove() op on a PCI device, can't take a module's private > > lock, decrement the private reference count, release the lock and > > return. There's a race between releasing the lock and returning, where > > it still isn't safe to remove the module's memory. > > > > Even waiting for a schedule to happen won't help if CONFIG_PREEMPT is > > enabled. > > That's exactly the race that is removed by having the module subsystem call > __exit to remove the module. Since the module subsystem checks the __exit's > flag on return and releases the lock, so there is no window after releasing > the lock when the releasor is still executing in the module. Please say if you mean "cleanup_module()" when you say "__exit". Assuming you do, how can __exit safely decide whether to return 0 or -1? > > In other words, the module's idea of whether it's own resources are no > > longer in use _must_ be released by code outside the module - or at > > very least, locks protecting that information must be released outside > > the module. > > Yup, that's exactly what I've proposed, in the simplest way possible. Then you've completely confused me, because you seem to be saying that a module's own cleanup function returns a success/failure result to the module subsystem. Or is that not what you mean when you say __exit? (Which isn't the name of a function, btw). > Silent failure is about the worst thing that we could possibly design into > the system, but that's not even the issue I'm going on about - because I > think it's so appallingly obvious it's wrong, I assume everybody can > see that. I'm sure there would be no harm in cleanup_module() returning an error code, but if it did all we could do is log it. > Rather, it's the simple fact that this is the obvious interface a naive > person would expect, and nobody has presented a rational argument for not > using it. Well it's not obvious, because either I don't understand what you are describing, or you don't understand when I explain that it doesn't work :-) > [...] we have to do cute things to be sure that all threads are out of > the module, however, that is an othogonal issue, and when last > sighted, it had a workable solution on the table, which requires each > task to schedule once. Config_preempt is a trivial issue: just > increment the preempt counter and nobody will preempt on you while you > run the magic quiescence test. You have to increment the preempt counter in the threads, not in the path that's testing for a quiescent state, so that when a thread does set_finished_flag/dec_ref_count -> unlock -> exit, there can be no preempt between the unlock and the exit. This is all made simpler by using complete_and_exit() these days. > Of course we can design a more complex method for accomplishing the same > thing, but why? Well, first can you please explain what you mean by "calling __exit"? Do you mean "calling cleanup_module()", which is the the function in a module that's declared using the `module_exit' macro? If so then I maintain there are non-trivial races that bite nearly every use of MOD_DEC_USE_COUNT from within the module. -- Jamie ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-10 1:56 ` Jamie Lokier @ 2002-09-10 2:53 ` Daniel Phillips 2002-09-10 3:26 ` Jamie Lokier 0 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-10 2:53 UTC (permalink / raw) To: Jamie Lokier; +Cc: Alexander Viro, Rusty Russell, linux-kernel On Tuesday 10 September 2002 03:56, Jamie Lokier wrote: > Daniel Phillips wrote: > > > ... Once it begins unregistering > > > resources, it's pretty committed to unregistering them all and saying it > > > exited ok. > > > > And failing silently if it can't? > > It could make a noise, but it will still be a failure. Once you've > unregistered one of the module's resources (e.g. a PCI device > registration), there is no turning back. If the next thing you > unregister (e.g. a filesystem) fails, what are you going to do? Try to > re-register the PCI device? The module unloader is going to look at the error code and decide if it can do something about it (it probably can't). Otherwise, it will report the error in a standard, sane way, and the deluxe version will attempt to wall off the damage. What would you prefer, that __exit should fail silently, leaving various resources in random states, and that the remaining, functional part of the system should never be told about it? Granted, that's a time-honored approach to error handling, but it falls far short of adequate. > Basically, once you've started unregistering resources, if there's an > error you're left in an inconsistent state. You can try to get back to > one of the states "module installed" or "module removed", but there's no > guarantee of either -- and trying to do that would certainly complicated > cleanup_module() functions. I'm not advocating that somebody solve the whole 'how to we recover' problem this week. That's the kind of trouble IBM would go to on a mainframe, and they'd get it right, but we have lots more low hanging fruit before we get to that level of detail. > > > Unfortunately, once it has the lock, and the reference counts are all > > > zero, it's still _not_ generally safe to cleanup up a module. > > > > > > This is because any other function, for example a release() op on a > > > file, or a remove() op on a PCI device, can't take a module's private > > > lock, decrement the private reference count, release the lock and > > > return. There's a race between releasing the lock and returning, where > > > it still isn't safe to remove the module's memory. > > > > > > Even waiting for a schedule to happen won't help if CONFIG_PREEMPT is > > > enabled. > > > > That's exactly the race that is removed by having the module subsystem call > > __exit to remove the module. Since the module subsystem checks the __exit's > > flag on return and releases the lock, so there is no window after releasing > > the lock when the releasor is still executing in the module. > > Please say if you mean "cleanup_module()" when you say "__exit". Yes, I mean: -void cleanup_module(void) +int cleanup_module(void) Or possibly: +int cleanup_module(struct module *module, int funny_flags) > Assuming you do, how can __exit safely decide whether to return 0 or -1? Well, I'd suggest negative for an error code, or positive if the error is simply "this module still seems to have this many users". For filesystems it simply looks at the use count, which Al maintains appropriately. Locking has been arranged by module.c such that this won't increase during the course of the cleanup_module call. The module exit routine does: if (module->count) return module->count; return clean_up_whatever(); For more difficult cases, such as SCM where we can't know so easily if there are tasks executing in the module, it has to be more like: if (module->count) return module->count; if ((howmany = there_are_still_users_magic_test())) return howmany; return clean_up_whatever(); > > > In other words, the module's idea of whether it's own resources are no > > > longer in use _must_ be released by code outside the module - or at > > > very least, locks protecting that information must be released outside > > > the module. > > > > Yup, that's exactly what I've proposed, in the simplest way possible. > > Then you've completely confused me, because you seem to be saying that a > module's own cleanup function returns a success/failure result to the > module subsystem. That's exactly what I'm saying. > > Silent failure is about the worst thing that we could possibly design into > > the system, but that's not even the issue I'm going on about - because I > > think it's so appallingly obvious it's wrong, I assume everybody can > > see that. > > I'm sure there would be no harm in cleanup_module() returning an error > code, but if it did all we could do is log it. That's a start. If we were IBM, writing a mainframe executive, we'd also clean up all the resources, which we'd have kept track of carefully. But that's a little ambitious for us at the moment, given that we don't even have clear agreement that errors should be reported. > > Rather, it's the simple fact that this is the obvious interface a naive > > person would expect, and nobody has presented a rational argument for not > > using it. > > Well it's not obvious, because either I don't understand what you are > describing, or you don't understand when I explain that it doesn't work :-) Eh. You now have sufficient data to explain to me why it doesn't work, if it really doesn't. > > [...] we have to do cute things to be sure that all threads are out of > > the module, however, that is an othogonal issue, and when last > > sighted, it had a workable solution on the table, which requires each > > task to schedule once. Config_preempt is a trivial issue: just > > increment the preempt counter and nobody will preempt on you while you > > run the magic quiescence test. > > You have to increment the preempt counter in the threads, not in the > path that's testing for a quiescent state, so that when a thread does > set_finished_flag/dec_ref_count -> unlock -> exit, there can be no > preempt between the unlock and the exit. > > This is all made simpler by using complete_and_exit() these days. Hmm, lovely documentation there. Different rant, calm down, ignore ;-) It's not just the little detail of how you get out crashlessly after the test, the hard part is proving no other threads are in the module. And yes, you have to increment *all* the preempt counts, I did indeed omit that detail, nonetheless, it's just a detail. I'm willing to believe the scheduling crew have the there_are_still_users magic test problem under control, at least they seemed to be headed in that direction last time it got beaten to death on the list. > If so then I maintain there are non-trivial races that bite nearly every > use of MOD_DEC_USE_COUNT from within the module. Let's see if you can find one in the above. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-10 2:53 ` Daniel Phillips @ 2002-09-10 3:26 ` Jamie Lokier 2002-09-10 3:47 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Jamie Lokier @ 2002-09-10 3:26 UTC (permalink / raw) To: Daniel Phillips; +Cc: Alexander Viro, Rusty Russell, linux-kernel Daniel Phillips wrote: > Locking > has been arranged by module.c such that this won't increase during the > course of the cleanup_module call. The module exit routine does: > > if (module->count) > return module->count; > > return clean_up_whatever(); If any code within the module does `MOD_DEC_USE_COUNT; return;', you have a race condition. That's why nowadays you shouldn't see this. The main reference points to a module are either function calls from another module (already count as references), or a file/filesystem/blockdev/netdevice: these all have an `owner' field now, so that the MOD_DEC_USE_COUNT can take place outside their modules. If you do still see MOD_DEC_USE_COUNT in a module, then there's a race condition. Some task calls the function which does MOD_DEC_USE_COUNT, and some _other_ task calls sys_delete_module(). Lo, the module is cleaned up and destroyed by the second task while it's in the small window just before `return' in the first task. Given that, you need to either (a) not call MOD_DEC_USE_COUNT in the module, and use the `owner' fields of various structures instead, or (b) something quite clever involving a quiescent state and some flags. Note that (b) is not trivial, as you can't just do `wait_for_quiescent_state()' followed by `if (module->count)': it's possible that someone may call MOD_INC_USE_COUNT after the quiescent state has passed, but before you finish cleaning up. -- Jamie ps. I'm not sure what "SCM" means, so can't comment on that. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-10 3:26 ` Jamie Lokier @ 2002-09-10 3:47 ` Daniel Phillips 0 siblings, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-10 3:47 UTC (permalink / raw) To: Jamie Lokier; +Cc: Alexander Viro, Rusty Russell, linux-kernel On Tuesday 10 September 2002 05:26, Jamie Lokier wrote: > Daniel Phillips wrote: > > Locking > > has been arranged by module.c such that this won't increase during the > > course of the cleanup_module call. The module exit routine does: > > > > if (module->count) > > return module->count; > > > > return clean_up_whatever(); > > If any code within the module does `MOD_DEC_USE_COUNT; return;', you > have a race condition. > > That's why nowadays you shouldn't see this. The main reference points > to a module are either function calls from another module (already count > as references), or a file/filesystem/blockdev/netdevice: these all have > an `owner' field now, so that the MOD_DEC_USE_COUNT can take place > outside their modules. Thanks for telling me this, but I know it. The manner in which the module use count is protected is completely irrelvant to the code I showed. The cleanup_module just knows the count is protected and stable. > If you do still see MOD_DEC_USE_COUNT in a module, then there's a race > condition. Some task calls the function which does MOD_DEC_USE_COUNT, > and some _other_ task calls sys_delete_module(). Lo, the module is > cleaned up and destroyed by the second task while it's in the small > window just before `return' in the first task. > > Given that, you need to either (a) not call MOD_DEC_USE_COUNT in the > module, and use the `owner' fields of various structures instead, or (b) > something quite clever involving a quiescent state and some flags. > > Note that (b) is not trivial, as you can't just do > `wait_for_quiescent_state()' followed by `if (module->count)': it's > possible that someone may call MOD_INC_USE_COUNT after the quiescent > state has passed, but before you finish cleaning up. You promised to find a race or some other deficiency in my code and you pointed out some other interesting, but incidental fact. If you look at my code snippet for cleanup_module, you'll see it doesn't decrement anything, though it could if it wanted to, because it's called under protection provided by module.c. The way the module keeps track of its unloadable/not unloadable state is up to it, which makes sense, n'est-ce pas? Given that there is no way you can make the MOD_DEC_USE_COUNT concept work for SCM, for example. There is in fact no choice other than to bite the bullet and implement some form of magic_wait_for_quiescent_state(), however hard that may be, to be used in the cases where some precise accounting method simply isn't possible. The alternative is to forbid unloading of some types of modules, and I hope you realize what kind of criticism that would invite. Anyway, solutions to the difficulties are known. We are simply quibbling over whether we should use an obscure interface or a simple, transparent one. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-10 0:44 ` Jamie Lokier 2002-09-10 1:40 ` Daniel Phillips @ 2002-09-10 9:15 ` Daniel Phillips 2002-09-10 10:17 ` Roman Zippel 1 sibling, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-10 9:15 UTC (permalink / raw) To: Jamie Lokier; +Cc: Alexander Viro, Rusty Russell, linux-kernel On Tuesday 10 September 2002 02:44, Jamie Lokier wrote: > Typically, your module's resources are protected by a lock or so. > cleanup_module() could take this lock, check any private reference > counts, and (because it has the lock) decide whether to proceed with > unregistering the module's resources. Actually, there did turn out to be a problem with the code I showed yesterday, resulting from the fact that cleanup_module wants to be able to sleep. Because of this, there is no good way for module.c to protect module->count from try_inc_mod_count, not without changing the spinlock in try_inc_mod_count into a semaphore, which isn't nice either, because try_inc_mod_count is taken under at least one other spinlock in super.c. There's a simple solution though: examine the module->count under the same spinlock as try_inc_mod_count, which is what sys_delete_module does. We just encapsulate that check in a handy wrapper and define it as part of the try_inc_mod_count interface. At this point the thing is generalized to the point where the module count isn't used at all by module.c, so the same interface will also accomodate the still-under-construction magic wait for quiescent state(), needed for modules that don't fit the mod_count model. The nice thing is, hardly any work is required to accomplish this, though it's hard to look at module.c for more than a few seconds without seeing more things that want cleaning up. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-10 9:15 ` Daniel Phillips @ 2002-09-10 10:17 ` Roman Zippel 2002-09-11 18:35 ` [RFC] Raceless module interface Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Roman Zippel @ 2002-09-10 10:17 UTC (permalink / raw) To: Daniel Phillips; +Cc: Jamie Lokier, Alexander Viro, Rusty Russell, linux-kernel Hi, On Tue, 10 Sep 2002, Daniel Phillips wrote: > There's a simple solution though: examine the module->count under the same > spinlock as try_inc_mod_count, which is what sys_delete_module does. We just > encapsulate that check in a handy wrapper and define it as part of the > try_inc_mod_count interface. At this point the thing is generalized to the > point where the module count isn't used at all by module.c, so the same > interface will also accomodate the still-under-construction magic wait for > quiescent state(), needed for modules that don't fit the mod_count model. I implemented something like this some time ago. If module->count isn't used by module.c anymore, why should it be in the module structure? Consequently I removed it from the module struct (what breaks of course unloading of all modules, so I'll probably reintroduce it with big a warning). If the count isn't in the module structure, the locking will become quite simpler. More info is here http://marc.theaimsgroup.com/?l=linux-kernel&m=102754132716703&w=2 bye, Roman ^ permalink raw reply [flat|nested] 90+ messages in thread
* [RFC] Raceless module interface 2002-09-10 10:17 ` Roman Zippel @ 2002-09-11 18:35 ` Daniel Phillips 2002-09-11 18:53 ` Oliver Neukum ` (3 more replies) 0 siblings, 4 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-11 18:35 UTC (permalink / raw) To: Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, Rusty Russell, linux-kernel Hi Roman, On Tuesday 10 September 2002 12:17, Roman Zippel wrote: > I implemented something like this some time ago. If module->count isn't > used by module.c anymore, why should it be in the module structure? > Consequently I removed it from the module struct (what breaks of course > unloading of all modules, so I'll probably reintroduce it with big a > warning). If the count isn't in the module structure, the locking will > become quite simpler. More info is here > http://marc.theaimsgroup.com/?l=linux-kernel&m=102754132716703&w=2 Ah, I remember your original post but I didn't fully understand what you were going on about at the time. Now I do, and we're on the same page. Not only that, but I'm getting to the point with this messy problem where I can articulate the issues clearly, so let's try to do it now. Please ack/nack me as I go. First, let's identify the central issue that's causing conceptual problems for some who've sniffed at these issues. It's simple, there are two broad categories of modules: - Those for which we account exactly for all task entry/exits - Those for which we do not, or cannot account current usage exactly Let's call the former "counting" and the latter "noncounting" modules. The number of tasks executing in a given module is the count of task entries minus the count of task exits. Obviously this is a big problem when we have no good way of counting accurately, but it is not an insoluble problem, and a solution is presented here, based on earlier discussions on lkml. Preventing Unload Races in Counting Modules ------------------------------------------- Module unload races in counting modules are easy to handle. Al Viro has written code that handles those in a passably competent way and we will improve Al's approach incrementally as described below. I'll digress for a moment and ask the question "why are counting modules countable?". Well, using filesystems as an example, the vfs initiates each and every call into a filesystem module, and every vfs high level function is structured in such a way that all calls into a filesystem module return before the the vfs goes on to do something else. So it's easy to write accounting code that relies on this behavior, and in fact, the accounting job reduces to simply counting the number of times a given module is mounted vs unmounted. Even in this simple situation there are possible races, and one of the races is dealt with by Al's odd-looking try_inc_mod_count function, which protects the incrementing of its counter in three ways: 1) by using atomic operations 2) by protecting incrementing and testing with a spin lock 3) by additionally protecting incrementing with a state flag. Decrementing the counter is protected only by the use of atomic decrement, and turns out not to need any more protection than that (granted, this interesting fact is largely useless in terms of cycles saved). So let's look at the curious structure of the count increment operation, try_inc_mod_count. This acts on a counter embedded in struct module. (As you point out, it doesn't need to, it can act on a counter embedded in the struct filesystem instead, there is a very good reason for doing that, which I'll get back to.) I'm won't show Al's version, I'll show my improved version, which substitutes a special state of the counter variable for Al's state flag, a small but useful change: int try_inc_mod_count(struct module *module) { int result = 1; if (module) { spin_lock(&unload_lock); if (atomic_read(&module->count) == -1) result = 0; else atomic_inc(&module->count); spin_unlock(&unload_lock); } return result; } If the counter has the value -1 then it can't be incremented and this function returns 0. This corresponds to the state where unloading is in progress, and for filesystems, means that an attempted mount will fail. If the counter has any other value then that value is incremented, the function returns true, and the mount proceeds. Getting rid of the state flag simplifies the later step of removing the counter from struct module. The only other thing worth noting about this function is that, to use it, you have to be able to locate the module given a struct filesystem, hence the ->owner field in that structure. With the techniques you and I have identified, we can dispense with the ->owner field, removing considerable cruft in the process. However, for the moment I think we should produce a patch that aims narrowly at getting rid of the unload races, and do the spinoff cleanups after. The other main part of this mechanism is in sys_delete_module, in which the test for zero module count and setting of the state flag are protected by the same spinlock as above. Our improvement here will be to encapsulate this functionality in a module helper function, that looks something like this: int try_mod_count(struct module *module) { int count; spin_lock(&unload_lock); count = atomic_read(&module->count); if (!count) atomic_set(&module->count, -1); spin_unlock(&unload_lock); return count; } This function returns zero if the (counting) module has no users, and returns the (approximate) number of users otherwise. If the function returns zero, then additionally, the module count is set to -1, a state in which try_inc_mod_count is prevented from incrementing it, so after the spinlock is released, any pending mounts will fail as they should, because we have been asked to unload the module and are in the process of doing so. The remaining possible races here are currently closed by the BKL, and we will change that to a semaphore in due course. The fact that we use a sleeping lock for this purpose means that a module's cleanup_function is unrestricted in terms of scheduling. The (counting) module's cleanup_function will therefore look something like this: int error = try_mod_count(module); if (!error) { unregister_filesystem(foo_filesystem); cleanup_foo(foo_filesystem); } return error; And we will later design the 'module' parameter out of this, because it is no longer needed by generic code and so the associated cruft can be eliminated. As a programmer/guru friend once said: the code "won't have its thumb tied to its nose any more". Note that our returned error can also take on a negative value (i.e., a standard error code) should module unloading fail for some other reason than the module simply having users. Additionally, a return of -1 means "you already killed me, stupid, I'm already dead, and I already unregistered and cleaned up my filesystem". The state the module is in when its count is -1 is interesting and potentially useful: in this state the module can legally be reinitialized/reactivated, instead of just being removed from the module list and destroyed. (Roman, is -1 an ok error code to return here, given that there could be other negative error values returned as well?) Now we can look at the revised version of sys_delete_module, which has the code: BUG_ON(!mod->cleanup); if ((error = mod->cleanup()) goto out_error; free_module(module); and free_module will no longer call the cleanup_function. Nice and simple, easy to understand, is it not? What's more, we don't have to change this code at all to handle the more difficult case of noncounting modules. Preventing Unload Races in Noncounting Modules ---------------------------------------------- Noncounting modules typically don't keep track of the number of task entry/exits, typically because the cost of even a simple inc/dec wrapper in generic kernel code is unacceptable, and because the wrappers just plain look ugly. The proposed Linux Security Module is a good example of a noncounting kind of module: many vfs functions that currently execute security checks using inline code now must henceforth call indirectly through a module to do it. Linus will never buy it if these calls are inefficient. At present, some of these security tests consist of just a few instructions, so wrapping an accounting interface around those is going to show up on an execution profile, never mind the extra source code. This is far from the only reason why a module might be noncounting, but it's sufficient to show why we can't ignore the problem. Fortunately, there is a well-understood way to deal with such modules, consisting of two steps: - Stub out all the module call points - Wait until ever running task on the system has scheduled at least once To make this work, we rely on the rule that no module code may sleep/schedule unless it first increments a counter. The counter incremented will, at present, be the module->count, but that is entirely up to the module itself; module.c will no longer care how it keeps track, as long as it does. With config_preempt code, we must additionally disable preemption until the module quiescence test has completed. I'm not going to go further into this, because this is deep scheduling-fu, and I haven't got working code to show yet. Suffice to say that we have the technology to build a magic_wait_for_quiescence, and we must now proceed to do that. (Robert, are you reading?) A noncounting module uses the test as follows, in its cleanup_function: unregister_callpoints(...); magic_wait_for_quiescence(); return cleanup_foo(...); Blessed relief, that was easy. Now it's just a matter of coding ;-) One can easily imagine a kind of module that needs both the try_mod_count mechanism and the magic_wait_for_quiescence, or one that also has to wait for some daemon to exit, or a module that supports more than a single subsystem, possibly of very different kinds. All these variations are handled in the obvious way: by customizing the cleanup_function, with the aid of helper functions. This allows us to avoid the strong temptation to over-engineer the module interface, trying to anticipate every possible kind of module that someone might write in the future. Now there's the question "if this is such a great approach, why not make all modules work this way, not just filesystems?". Easy: the magic scheduling approach impacts the scheduler, however lightly, and worse, we cannot put an upper bound on the time needed for magic_wait_for_quiescence to complete. So the try_count approach is preferable, where it works. Implementation Issues --------------------- OK, I think we agree on the general outline of the work that has to be done. If not, please shout now, otherwise, feel free to niggle me to death in the usual way ;-) Let me touch on just one more issue: we need to do this work without breaking every existing module in the world. In other words, we need a compatibility wrapper to get us through the period, possibly as long as a year or two, when there are still a lot of modules both in-kernel and outside, whose cleanup_functions just return void. Module.c will detect this and do theits magic_wait_for_quiescence following the module's cleanup_function. Thus, such modules will be no more broken than they ever were. What we want is a define a new 'module_kill' macro, which will live alongside a backwards-compatible 'module_exit' macro for now. The latter generates a function call hook that module.c detects and wraps like this: if (!mod->old_cleanup) { WARN(1, "Please fix your old-style module"); if ((error = try_mod_count(module))) goto out_error; mod->old_cleanup(); free_module(module); } else { <nice new interface> } Miscellaneous other improvements we might want to make are: - Use a semaphore instead of lock_kernel - Clean up the generally substandard code in module.c - Move the semaphore into struct module - Move unload_lock into module - Other things I've forgotten for the moment But these can all wait. We have races now, and we have proposals on the table that smell like over-engineering. So we need to do the work I've described above, promptly, and get it into circulation. Comments/flames/races? -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-11 18:35 ` [RFC] Raceless module interface Daniel Phillips @ 2002-09-11 18:53 ` Oliver Neukum 2002-09-11 19:20 ` Daniel Phillips 2002-09-12 1:31 ` Rusty Russell ` (2 subsequent siblings) 3 siblings, 1 reply; 90+ messages in thread From: Oliver Neukum @ 2002-09-11 18:53 UTC (permalink / raw) To: Daniel Phillips, Roman Zippel Cc: Jamie Lokier, Alexander Viro, Rusty Russell, linux-kernel > Now there's the question "if this is such a great approach, why not make > all modules work this way, not just filesystems?". Easy: the magic > scheduling approach impacts the scheduler, however lightly, and worse, > we cannot put an upper bound on the time needed for You are in principle describing RCU. These guys seem to have solved the problem. > magic_wait_for_quiescence to complete. So the try_count approach is > preferable, where it works. But the try_count approach hurts every user of the defined interfaces, even if modules are not used. Is the impact on the scheduler limited to the time magic_wait_for_quiescence is running. If so, the approach looks superior. Regards Oliver ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-11 18:53 ` Oliver Neukum @ 2002-09-11 19:20 ` Daniel Phillips 2002-09-11 20:29 ` Oliver Neukum 0 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-11 19:20 UTC (permalink / raw) To: Oliver Neukum, Roman Zippel Cc: Jamie Lokier, Alexander Viro, Rusty Russell, linux-kernel On Wednesday 11 September 2002 20:53, Oliver Neukum wrote: > > Now there's the question "if this is such a great approach, why not make > > all modules work this way, not just filesystems?". Easy: the magic > > scheduling approach impacts the scheduler, however lightly, and worse, > > we cannot put an upper bound on the time needed for > > You are in principle describing RCU. These guys seem to have solved the > problem. Eh, how about that, and it was obvious to me. I wonder what that means, if anything. (No, I never knew anything about RCU other than the name.) Anyway, in case I sound too snippy here, here's a hearty thanks to IBM for contributing that IP to Linux without a fuss. > > magic_wait_for_quiescence to complete. So the try_count approach is > > preferable, where it works. > > But the try_count approach hurts every user of the defined interfaces, > even if modules are not used. Well, it works great for filesystems, which is my point. And I'll bet beer that somebody out there will find another application where it works well. It's all about choice, and the thing about the proposed interface is, it gives you the flexibility to make that choice. The fact that it's also the simplest interface is just nice. > Is the impact on the scheduler limited > to the time magic_wait_for_quiescence is running. > If so, the approach looks superior. It definitely is not superior for filesystem modules. However, "damm good" would be a nice level of functionality to aim for. The more we come to rely on virtual filesystem, the more we care that the load/unload procedure should be reliable and fast. Don't forget that point about not being able to put an upper bound on the time required to complete the magic wait. Are you hinting that we only need, and only ever will need, one mechanism here, so the module doesn't need to return a result from cleanup_module? If so then I should add that we don't just have varying requirements for cleanup style between modules, but other problems, such as single modules that support multiple services, which are common in the pcmcia world. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-11 19:20 ` Daniel Phillips @ 2002-09-11 20:29 ` Oliver Neukum 2002-09-11 21:15 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Oliver Neukum @ 2002-09-11 20:29 UTC (permalink / raw) To: Daniel Phillips, Roman Zippel Cc: Jamie Lokier, Alexander Viro, Rusty Russell, linux-kernel > > > magic_wait_for_quiescence to complete. So the try_count approach is > > > preferable, where it works. > > > > But the try_count approach hurts every user of the defined interfaces, > > even if modules are not used. > > Well, it works great for filesystems, which is my point. And I'll > bet beer that somebody out there will find another application > where it works well. It's all about choice, and the thing about How would the other version work less well for filesystems ? How long unloading takes doesn't matter, as long as it's safe. The overhead while running is the issue, nothing else. > the proposed interface is, it gives you the flexibility to make > that choice. The fact that it's also the simplest interface is > just nice. That makes no difference if you have to support two interfaces. > > Is the impact on the scheduler limited > > to the time magic_wait_for_quiescence is running. > > If so, the approach looks superior. > > It definitely is not superior for filesystem modules. However, > "damm good" would be a nice level of functionality to aim for. > The more we come to rely on virtual filesystem, the more we care > that the load/unload procedure should be reliable and fast. What? Unloading is a rare event in any case. > Don't forget that point about not being able to put an upper > bound on the time required to complete the magic wait. > > Are you hinting that we only need, and only ever will need, one > mechanism here, so the module doesn't need to return a result > from cleanup_module? If so then I should add that we don't just Returning the error value is good. > have varying requirements for cleanup style between modules, but > other problems, such as single modules that support multiple > services, which are common in the pcmcia world. Regards Oliver ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-11 20:29 ` Oliver Neukum @ 2002-09-11 21:15 ` Daniel Phillips 2002-09-11 21:26 ` Jamie Lokier 0 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-11 21:15 UTC (permalink / raw) To: Oliver Neukum, Roman Zippel Cc: Jamie Lokier, Alexander Viro, Rusty Russell, linux-kernel On Wednesday 11 September 2002 22:29, Oliver Neukum wrote: > > > > magic_wait_for_quiescence to complete. So the try_count approach is > > > > preferable, where it works. > > > > > > But the try_count approach hurts every user of the defined interfaces, > > > even if modules are not used. > > > > Well, it works great for filesystems, which is my point. And I'll > > bet beer that somebody out there will find another application > > where it works well. It's all about choice, and the thing about > > How would the other version work less well for filesystems ? > How long unloading takes doesn't matter, as long as it's safe. Really, that's not so, there are limits. 30 seconds? Whatever. Remember, during this time the service provided by the module is unavailable, so this is denial-of-service land. You could of course put in extra code to abort the unload process on demand, but, hmm, it probably wouldn't work ;-) And even if it did work, why would that be better than staying with a simple, proven solution that works (for filesystems)? > The overhead while running is the issue, nothing else. Au contraire, there are a variety of issues, see above for one. > > the proposed interface is, it gives you the flexibility to make > > that choice. The fact that it's also the simplest interface is > > just nice. > > That makes no difference if you have to support two interfaces. Mea culpa, I failed to communicate. The whole point of my [RFC] is that one new interface does both jobs, and potentially more besides, by the simple expediant of pushing the support for variants down into module land where it belongs, and at the same time, making these variants dead simple to write. Probably with the same or less code than we have now. Downside? > > > Is the impact on the scheduler limited > > > to the time magic_wait_for_quiescence is running. > > > If so, the approach looks superior. > > > > It definitely is not superior for filesystem modules. However, > > "damm good" would be a nice level of functionality to aim for. > > The more we come to rely on virtual filesystem, the more we care > > that the load/unload procedure should be reliable and fast. > > What? Unloading is a rare event in any case. Maybe for you it is. Anyway, I prefer that my computer does whatever I tell it to as quickly as it can, that's why I pay so much to run Linux on it ;-) > > Don't forget that point about not being able to put an upper > > bound on the time required to complete the magic wait. > > > > Are you hinting that we only need, and only ever will need, one > > mechanism here, so the module doesn't need to return a result > > from cleanup_module? If so then I should add that we don't just > > Returning the error value is good. :-) -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-11 21:15 ` Daniel Phillips @ 2002-09-11 21:26 ` Jamie Lokier 2002-09-11 21:47 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Jamie Lokier @ 2002-09-11 21:26 UTC (permalink / raw) To: Daniel Phillips Cc: Oliver Neukum, Roman Zippel, Alexander Viro, Rusty Russell, linux-kernel Daniel Phillips wrote: > Really, that's not so, there are limits. 30 seconds? Whatever. > Remember, during this time the service provided by the module is > unavailable, so this is denial-of-service land. You could of > course put in extra code to abort the unload process on demand, > but, hmm, it probably wouldn't work ;-) If you're going to do it right, you should fix that denial-of-service by waiting until the module has finished unloading and then demand-loading the module again. Ideally, those periodic "rmmod -a" calls should _never_ cause a denial-of-service. -- Jamie ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-11 21:26 ` Jamie Lokier @ 2002-09-11 21:47 ` Daniel Phillips 2002-09-12 1:42 ` Rusty Russell 0 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-11 21:47 UTC (permalink / raw) To: Jamie Lokier Cc: Oliver Neukum, Roman Zippel, Alexander Viro, Rusty Russell, linux-kernel On Wednesday 11 September 2002 23:26, Jamie Lokier wrote: > Daniel Phillips wrote: > > Really, that's not so, there are limits. 30 seconds? Whatever. > > Remember, during this time the service provided by the module is > > unavailable, so this is denial-of-service land. You could of > > course put in extra code to abort the unload process on demand, > > but, hmm, it probably wouldn't work ;-) > > If you're going to do it right, you should fix that denial-of-service by > waiting until the module has finished unloading and then demand-loading > the module again. That doesn't make the DoS go away, it just makes it a little harder to trigger. Anyway, one thing we could do if the rest of the module mechanism is up to it, is know that somebody is trying to reactivate a module that has just returned from module_cleanup(), and immediately reactivate it instead of freeing it, hoping to save some disk activity - if this turns out to be a real problem, that is. The null solution is likely the winner here. > Ideally, those periodic "rmmod -a" calls should _never_ cause a > denial-of-service. Goodness no. By the way, nobody has asked me how rmmod -a is to be implemented. Oh well. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-11 21:47 ` Daniel Phillips @ 2002-09-12 1:42 ` Rusty Russell 2002-09-12 2:09 ` Jamie Lokier 0 siblings, 1 reply; 90+ messages in thread From: Rusty Russell @ 2002-09-12 1:42 UTC (permalink / raw) To: Daniel Phillips Cc: Oliver Neukum, Roman Zippel, Alexander Viro, kaos, linux-kernel In message <E17pFKr-0007V7-00@starship> you write: > On Wednesday 11 September 2002 23:26, Jamie Lokier wrote: > > Daniel Phillips wrote: > > > Really, that's not so, there are limits. 30 seconds? Whatever. > > > Remember, during this time the service provided by the module is > > > unavailable, so this is denial-of-service land. You could of > > > course put in extra code to abort the unload process on demand, > > > but, hmm, it probably wouldn't work ;-) > > > > If you're going to do it right, you should fix that denial-of-service by > > waiting until the module has finished unloading and then demand-loading > > the module again. > > That doesn't make the DoS go away, it just makes it a little > harder to trigger. Anyway, one thing we could do if the rest > of the module mechanism is up to it, is know that somebody is > trying to reactivate a module that has just returned from > module_cleanup(), and immediately reactivate it instead of > freeing it, hoping to save some disk activity - if this turns > out to be a real problem, that is. The null solution is likely > the winner here. Ah, yes, four-point module interface: init, start, stop, destroy. Then you can call stop, realize the module is not at zero refcnt (you lost a race), then call start again. Similar thing if someone requests a stopped module. Now you're going to have to change request_module() so the kernel can realize that you're requesting a module which already exists (request_module()'s effect currently depends on /etc/modules.conf of course). Now, of course, your module interface is starting to look really complex, too. Because you want to solve the "half-loaded" problem, so you really want "init" to reserve resources, and "start" to register them (ie. start can't fail). So every register_xxx interface needs to be split into reserve_xxx and use_xxx. We can do all this, of course. I have an awful lot of patches. But I'm not really happy with any of them. Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 1:42 ` Rusty Russell @ 2002-09-12 2:09 ` Jamie Lokier 2002-09-12 3:13 ` Rusty Russell 2002-09-12 3:32 ` Daniel Phillips 0 siblings, 2 replies; 90+ messages in thread From: Jamie Lokier @ 2002-09-12 2:09 UTC (permalink / raw) To: Rusty Russell Cc: Daniel Phillips, Oliver Neukum, Roman Zippel, Alexander Viro, kaos, linux-kernel Rusty Russell wrote: > Ah, yes, four-point module interface: init, start, stop, destroy. > Then you can call stop, realize the module is not at zero refcnt (you > lost a race), then call start again. Similar thing if someone > requests a stopped module. That's one fairly complicated way of going about it. Simpler is three points: init, stop, destroy. Stop is that part before you call synchronise_kernel() - it's not something you can turn back from. If somebody needs the module, and it currently inside its cleanup_module() function, they simply wait until destroy finishes, and the module is unloaded, and then _reload_ the module from scratch. So, let the disk I/O happen. This is a rare. > Now you're going to have to change request_module() so the kernel can > realize that you're requesting a module which already exists > (request_module()'s effect currently depends on /etc/modules.conf of > course). Not at all. Keep request_module() simple, and change module loading, like this: 1. Just before a module's cleanup_module() function is called, mark the module as "unloading". This should force try_inc_mod_use_count() to fail, causing its caller to behave like the associated resource (e.g. filesystem) isn't actually registered, and to call request_module(). 2. request_module() should simply ignore modules marked as "unloading". It should proceed to call "insmod" etc. 3. sys_create_module() or sys_init_module() should block if there is a module of the same name currently in the "unloading" state. They should block until that module's cleanup_module() returns. 4. At this point, the new instance of the module will initialise, request_module() calls will return and the callers which called try_inc_mod_use_count() in step 1 will continue with the resource they needed. > Now, of course, your module interface is starting to look really > complex, too. Because you want to solve the "half-loaded" problem, so > you really want "init" to reserve resources, and "start" to register > them (ie. start can't fail). So every register_xxx interface needs to > be split into reserve_xxx and use_xxx. I don't see the point in this at all. What half-loaded problem? If a module is destroyed, then reloaded and fails to initialise properly: tough. It lost the resources fair and square. > We can do all this, of course. I have an awful lot of patches. But > I'm not really happy with any of them. What do you think of the idea described above?: To mark modules as "unloading" (as we do now), use synchronise_kernel() for an rcu-style safety pause (as you suggested), and change module loading so that a dying module is waited for before its replacement takes over? -- Jamie ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 2:09 ` Jamie Lokier @ 2002-09-12 3:13 ` Rusty Russell 2002-09-12 3:47 ` Daniel Phillips 2002-09-13 8:18 ` Helge Hafting 2002-09-12 3:32 ` Daniel Phillips 1 sibling, 2 replies; 90+ messages in thread From: Rusty Russell @ 2002-09-12 3:13 UTC (permalink / raw) To: Jamie Lokier Cc: Daniel Phillips, Oliver Neukum, Roman Zippel, Alexander Viro, kaos, linux-kernel In message <20020912030933.A13608@kushida.apsleyroad.org> you write: > I don't see the point in this at all. Yes, I'm starting to realize that. Frankly, I'm sick of pointing out the same problems individually to every shallow thinker who thinks that "it's easy". The fundamental problems with modules are as follows: A) Many places in the kernel do not increment module reference counts for you, and it is difficult currently write a module which safely looks after its own reference counts (see net/ipv4/netfilter/ip_conntrack_core.c's ip_conntrack_cleanup()) B) We do not handle the "half init problem" where a module fails to load, eg. a = register_xxx(); b = register_yyy(); if (!b) { unregister_xxx(a); return -EBARF; } Someone can start using "a", and we are in trouble when we remove the failed module. Suggested solutions come in several forms, with different mixtures of the following: 1) Poor man's pageable kernel: o Every interface in the kernel takes a struct module *, and looks after the reference counts before and after it jumps in. o Variants with/without try_inc_mod_count() or two-stage unload. o try_inc_mod_count variant solves the half-init problem 2) Modules carry their own overhead: o Modules can look after their own reference counts. o "don't sleep until you've grabbed a refcount" & wait for quiescence. o Two-stage module delete (stop & destroy). o More advanced variations allow "restart" of stopped module if race is lost. o Still leaves the half-init problem 3) Deprecate module unload: o Don't unload modules except by kernel hacking config option. o Unloading in-use modules: "don't do that". o Still leaves the half-init problem 4) Two-stage module init o "Reserve" (can fail) and "use" (can't fail). o Solves ONLY the half-init problem 5) Primordial Soup: o Module loading/activation and deactivation/destruction occur in a state similar to the kernel at boot. o You need to stop all tasks but the loader and some kernel threads, even if module load sleeps. o You probably want to defer almost all interrupts, too. o Or you can make module_init in interrupt context, which breaks lots of register_xxx interfaces. (1) PRO: Simple to write modules. CON: try_inc_mod_count is a source of random subsystem failure. CON: Everyone pays the inc/dec price even if they're not a module. CON: Massive infrastructure change CON: Modules started as a cute hack, are they really worth this cost? (2) PRO: Non-modules don't pay the overhead for inc/dec. CON: Harder to write modules than (1). CON: Gracefully handling the "lost the race between stop and cleanup" adds complexity (3) PRO: Infrastructure simple. PRO: Module implementation simple. CON: Some broken drivers (eg. 16-bit PCMCIA) require unload to reset (PCMCIA rewrite and/or suspend hooks can help here). CON: Judging "unused" from userspace is fun & racy on real systems. CON: We never remove features from Linux. (4) PRO: Some drivers register things before setting up their internal state already (which works fine at boot). This would solve that, too. PRO: Can be implemented naively, so if a module doesn't have two-stage init, we just never free the module memory (still dangerous since there other resources, but no worse than before). CON: Requires more work for module authors in future. (5) PRO: Infrastructure simple. PRO: Module implementation simple. PRO: Would fix drivers which register too early, too. CON: I'm not smart enough to implement it (esp. before 31 October). CON: Scheduler magic lots of fun to keep common case efficient. CON: Deferring interrupts may have interesting side effects. CON: May not be possible without deadlock? CON: "stop the world" during module load may piss off some. Cheers, Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 3:13 ` Rusty Russell @ 2002-09-12 3:47 ` Daniel Phillips 2002-09-12 3:53 ` Alexander Viro 2002-09-12 4:52 ` Rusty Russell 2002-09-13 8:18 ` Helge Hafting 1 sibling, 2 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-12 3:47 UTC (permalink / raw) To: Rusty Russell, Jamie Lokier Cc: Oliver Neukum, Roman Zippel, Alexander Viro, kaos, linux-kernel On Thursday 12 September 2002 05:13, Rusty Russell wrote: > B) We do not handle the "half init problem" where a module fails to load, eg. > a = register_xxx(); > b = register_yyy(); > if (!b) { > unregister_xxx(a); > return -EBARF; > } > Someone can start using "a", and we are in trouble when we remove > the failed module. No we are not. The module remains in the 'stopped' state throughout the entire initialization process, as it should and does, in my model. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 3:47 ` Daniel Phillips @ 2002-09-12 3:53 ` Alexander Viro 2002-09-12 4:11 ` Daniel Phillips 2002-09-12 4:52 ` Rusty Russell 1 sibling, 1 reply; 90+ messages in thread From: Alexander Viro @ 2002-09-12 3:53 UTC (permalink / raw) To: Daniel Phillips Cc: Rusty Russell, Jamie Lokier, Oliver Neukum, Roman Zippel, kaos, linux-kernel On Thu, 12 Sep 2002, Daniel Phillips wrote: > On Thursday 12 September 2002 05:13, Rusty Russell wrote: > > B) We do not handle the "half init problem" where a module fails to load, eg. > > a = register_xxx(); > > b = register_yyy(); > > if (!b) { > > unregister_xxx(a); > > return -EBARF; > > } > > Someone can start using "a", and we are in trouble when we remove > > the failed module. > > No we are not. The module remains in the 'stopped' state > throughout the entire initialization process, as it should and > does, in my model. Bzzzert. At the very least, for block devices we need to be able to open disks during module initialization. Al, fully expecting a stack of mind-boggling (and broken) kludges to be posted... ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 3:53 ` Alexander Viro @ 2002-09-12 4:11 ` Daniel Phillips 2002-09-12 4:40 ` Rusty Russell 2002-09-12 5:35 ` Rusty Russell 0 siblings, 2 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-12 4:11 UTC (permalink / raw) To: Alexander Viro Cc: Rusty Russell, Jamie Lokier, Oliver Neukum, Roman Zippel, kaos, linux-kernel On Thursday 12 September 2002 05:53, Alexander Viro wrote: > On Thu, 12 Sep 2002, Daniel Phillips wrote: > > > On Thursday 12 September 2002 05:13, Rusty Russell wrote: > > > B) We do not handle the "half init problem" where a module fails to load, eg. > > > a = register_xxx(); > > > b = register_yyy(); > > > if (!b) { > > > unregister_xxx(a); > > > return -EBARF; > > > } > > > Someone can start using "a", and we are in trouble when we remove > > > the failed module. > > > > No we are not. The module remains in the 'stopped' state > > throughout the entire initialization process, as it should and > > does, in my model. > > Bzzzert. Rusty, writing "Bzzzert" on lkml is juvenile, as is refering to your peers as "shallow thinkers". Ok, let's see if there is some content in your post. > At the very least, for block devices we need to be able to open > disks during module initialization. Huh? Would you please back up and try to make a coherent point? I'd love to point out why you're wrong, but you didn't actually say anything. > Al, fully expecting a stack of mind-boggling (and broken) kludges to be > posted... As I recall, you are the one who proposed eliminating the ability to unload modules entirely, because you were not able to solve the unload races. It's a good thing that people with more sense shouted you down. Now, let's be civilized about this. If you have points, make them, we do not need them decorated with sound effects. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 4:11 ` Daniel Phillips @ 2002-09-12 4:40 ` Rusty Russell 2002-09-12 5:27 ` Daniel Phillips 2002-09-12 14:46 ` Gerhard Mack 2002-09-12 5:35 ` Rusty Russell 1 sibling, 2 replies; 90+ messages in thread From: Rusty Russell @ 2002-09-12 4:40 UTC (permalink / raw) To: Daniel Phillips Cc: viro, Jamie Lokier, Oliver Neukum, Roman Zippel, kaos, linux-kernel In message <E17pLKe-0007ds-00@starship> you write: > On Thursday 12 September 2002 05:53, Alexander Viro wrote: > > Bzzzert. > > Rusty, writing "Bzzzert" on lkml is juvenile, as is refering to your > peers as "shallow thinkers". I've outlined the variations on solutions multiple times. When people repeat variations and use the word "simple", they are thinking shallowly. I had patience with this thread the first few times I had to explain it, but no longer, as you can tell. And it's clear that, shallow thinker or not, you have trouble reading 8) Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 4:40 ` Rusty Russell @ 2002-09-12 5:27 ` Daniel Phillips 2002-09-12 14:46 ` Gerhard Mack 1 sibling, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-12 5:27 UTC (permalink / raw) To: Rusty Russell Cc: viro, Jamie Lokier, Oliver Neukum, Roman Zippel, kaos, linux-kernel On Thursday 12 September 2002 06:40, Rusty Russell wrote: > In message <E17pLKe-0007ds-00@starship> you write: > > On Thursday 12 September 2002 05:53, Alexander Viro wrote: > > > Bzzzert. > > > > Rusty, writing "Bzzzert" on lkml is juvenile, as is refering to your > > peers as "shallow thinkers". > > I've outlined the variations on solutions multiple times. When people > repeat variations and use the word "simple", they are thinking > shallowly. > > I had patience with this thread the first few times I had to explain > it, but no longer, as you can tell. Well I can understand that when people are just talking and not coding, but when the latter happens, it's worth trying to make something of it. Just look at Roman's patch, there is a lot of value in it, though some central are indeed left hanging. > And it's clear that, shallow thinker or not, you have trouble reading 8) Oh crap, I misattributed the funny sound effects to you, I should have known. Sorry about that. Hey, it's *long* after midnight here. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 4:40 ` Rusty Russell 2002-09-12 5:27 ` Daniel Phillips @ 2002-09-12 14:46 ` Gerhard Mack 2002-09-13 0:39 ` Rusty Russell 1 sibling, 1 reply; 90+ messages in thread From: Gerhard Mack @ 2002-09-12 14:46 UTC (permalink / raw) To: Rusty Russell Cc: Daniel Phillips, viro, Jamie Lokier, Oliver Neukum, Roman Zippel, kaos, linux-kernel On Thu, 12 Sep 2002, Rusty Russell wrote: > Date: Thu, 12 Sep 2002 14:40:42 +1000 > From: Rusty Russell <rusty@rustcorp.com.au> > To: Daniel Phillips <phillips@arcor.de> > Cc: viro@math.psu.edu, Jamie Lokier <lk@tantalophile.demon.co.uk>, > Oliver Neukum <oliver@neukum.name>, Roman Zippel <zippel@linux-m68k.org>, > kaos@ocs.com.au, linux-kernel@vger.kernel.org > Subject: Re: [RFC] Raceless module interface > > In message <E17pLKe-0007ds-00@starship> you write: > > On Thursday 12 September 2002 05:53, Alexander Viro wrote: > > > Bzzzert. > > > > Rusty, writing "Bzzzert" on lkml is juvenile, as is refering to your > > peers as "shallow thinkers". > > I've outlined the variations on solutions multiple times. When people > repeat variations and use the word "simple", they are thinking > shallowly. > > I had patience with this thread the first few times I had to explain > it, but no longer, as you can tell. > > And it's clear that, shallow thinker or not, you have trouble reading 8) > Rusty. Write a FAQ then... -- Gerhard Mack gmack@innerfire.net <>< As a computer I find your faith in technology amusing. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 14:46 ` Gerhard Mack @ 2002-09-13 0:39 ` Rusty Russell 2002-09-13 2:23 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Rusty Russell @ 2002-09-13 0:39 UTC (permalink / raw) To: Gerhard Mack Cc: Daniel Phillips, viro, Jamie Lokier, Oliver Neukum, Roman Zippel, kaos, linux-kernel In message <Pine.LNX.4.44.0209121046110.25394-100000@innerfire.net> you write: > On Thu, 12 Sep 2002, Rusty Russell wrote: > > > Date: Thu, 12 Sep 2002 14:40:42 +1000 > > From: Rusty Russell <rusty@rustcorp.com.au> > > To: Daniel Phillips <phillips@arcor.de> > > Cc: viro@math.psu.edu, Jamie Lokier <lk@tantalophile.demon.co.uk>, > > Oliver Neukum <oliver@neukum.name>, Roman Zippel <zippel@linux-m68k.or g>, > > kaos@ocs.com.au, linux-kernel@vger.kernel.org > > Subject: Re: [RFC] Raceless module interface > > > > In message <E17pLKe-0007ds-00@starship> you write: > > > On Thursday 12 September 2002 05:53, Alexander Viro wrote: > > > > Bzzzert. > > > > > > Rusty, writing "Bzzzert" on lkml is juvenile, as is refering to your > > > peers as "shallow thinkers". > > > > I've outlined the variations on solutions multiple times. When people > > repeat variations and use the word "simple", they are thinking > > shallowly. > > > > I had patience with this thread the first few times I had to explain > > it, but no longer, as you can tell. > > > > And it's clear that, shallow thinker or not, you have trouble reading 8) > > Rusty. > > Write a FAQ then... People who can't find the lkml archives will read the FAQ? Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 0:39 ` Rusty Russell @ 2002-09-13 2:23 ` Daniel Phillips 0 siblings, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 2:23 UTC (permalink / raw) To: Rusty Russell, Gerhard Mack Cc: viro, Jamie Lokier, Oliver Neukum, Roman Zippel, kaos, linux-kernel On Friday 13 September 2002 02:39, Rusty Russell wrote: > In message <Pine.LNX.4.44.0209121046110.25394-100000@innerfire.net> you write: > > Write a FAQ then... > > People who can't find the lkml archives will read the FAQ? > > Rusty. In this case it would be a "Frequently Argued Questions" because there is no general agreement, and it is not just because everybody but you is a shallow thinker ;-) -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 4:11 ` Daniel Phillips 2002-09-12 4:40 ` Rusty Russell @ 2002-09-12 5:35 ` Rusty Russell 1 sibling, 0 replies; 90+ messages in thread From: Rusty Russell @ 2002-09-12 5:35 UTC (permalink / raw) To: Daniel Phillips Cc: Alexander Viro, Jamie Lokier, Oliver Neukum, Roman Zippel, kaos, linux-kernel In message <E17pLKe-0007ds-00@starship> you write: > As I recall, you are the one who proposed eliminating the ability > to unload modules entirely, because you were not able to solve the > unload races. Huh? I was able to solve the module races, in multiple ways. I even had an implementation, using two stage init and two stage remove. But it's not *simple*, and pushing additional constraints on kernel driver authors may be worse than not being able to remove modules. > It's a good thing that people with more sense shouted you down. There was (understandably) some resistance to removing a "working" feature, but noone produced a new workable alternative. The *point* of my presentation was to make people think of alternatives, especially, if we can't (or don't want to) solve all the unload races, is it OK to say "well, you can't unload those kind of modules"? If I had come up with a clear winner, I might have saved myself the pain of standing up in front of 70 of my peers and admitting that I wasn't smart enough, ferchissakes. But it doesn't help if you don't listen, then sprout the "one true solution" which turns out to be a wordy combination two previously discussed schemes, with the disadvantages of both. Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 3:47 ` Daniel Phillips 2002-09-12 3:53 ` Alexander Viro @ 2002-09-12 4:52 ` Rusty Russell 2002-09-12 5:58 ` Daniel Phillips 1 sibling, 1 reply; 90+ messages in thread From: Rusty Russell @ 2002-09-12 4:52 UTC (permalink / raw) To: Daniel Phillips; +Cc: lk, oliver, zippel, viro, kaos, linux-kernel On Thu, 12 Sep 2002 05:47:57 +0200 Daniel Phillips <phillips@arcor.de> wrote: > On Thursday 12 September 2002 05:13, Rusty Russell wrote: > > B) We do not handle the "half init problem" where a module fails to load, eg. > > a = register_xxx(); > > b = register_yyy(); > > if (!b) { > > unregister_xxx(a); > > return -EBARF; > > } > > Someone can start using "a", and we are in trouble when we remove > > the failed module. > > No we are not. The module remains in the 'stopped' state > throughout the entire initialization process, as it should and > does, in my model. Um, so register_xxx interfaces all use try_inc_mod_count (ie. a struct module * extra arg to register_xxx)? Or those entry points are not protected by try_inc_mod_count, so it must bump the refcnt, so you need to sleep in load until the module becomes unused again. You have the same issue in the "wait for schedule" case on unload, where someone jumps in while you are unregistering. My implementation decided to ignore this problem (ie. potentially wait forever with the module half-loaded) but it is an issue. Rusty. -- there are those who do and those who hang on and you don't see too many doers quoting their contemporaries. -- Larry McVoy ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 4:52 ` Rusty Russell @ 2002-09-12 5:58 ` Daniel Phillips 2002-09-12 7:00 ` Rusty Russell 0 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-12 5:58 UTC (permalink / raw) To: Rusty Russell; +Cc: lk, oliver, zippel, viro, kaos, linux-kernel On Thursday 12 September 2002 06:52, Rusty Russell wrote: > On Thu, 12 Sep 2002 05:47:57 +0200 > Daniel Phillips <phillips@arcor.de> wrote: > > > On Thursday 12 September 2002 05:13, Rusty Russell wrote: > > > B) We do not handle the "half init problem" where a module fails to > > > load, eg. > > > a = register_xxx(); > > > b = register_yyy(); > > > if (!b) { > > > unregister_xxx(a); > > > return -EBARF; > > > } > > > Someone can start using "a", and we are in trouble when we remove > > > the failed module. > > > > No we are not. The module remains in the 'stopped' state > > throughout the entire initialization process, as it should and > > does, in my model. > > Um, so register_xxx interfaces all use try_inc_mod_count (ie. a > struct module * extra arg to register_xxx)? In that case they are filesystems or some other thing that fits the model closely enough for try_ind_mod_count to be efficient. This case is easy and solved as far as I'm concerned, do correct me if I'm wrong. Assuming I'm not wrong, on to a hard and interesting case. > Or those entry points > are not protected by try_inc_mod_count, so it must bump the refcnt, so > you need to sleep in load until the module becomes unused again. Let's consider LSM as a really hard case, and let's suppose that the register_*'s just plug new functions into function tables. These functions are just called indirectly, there is no module entry/exit accounting whatsoever, except that the inc/dec rule must be followed where module code itself might sleep. Anyway, at the -EBARF point, all the function pointers have been restored to their original state, but we may have some tasks executing in the module already, which got in while _xxx was there. The solution is to do the magic_wait_for_quiescence (yes I know it has a real name, I forgot it) before returning -EBARF. Refcounts never come into it. What did I miss? It was too easy. Ah, I'm glossing over how the "I need to sleep" intramodule refcounts interact with the magic quiescence test. Let's see if some sleep fixes that. I doubt this is any central issue though. We could, for instance, pass the address of the count variable to the quiescence tester, and it will be examined at schedule time, however that works (I promise to find out soon). > You have the same issue in the "wait for schedule" case on unload, > where someone jumps in while you are unregistering. We don't. Our LSM module will first restore all the function pointers to their original state, then wait for everyone to schedule. > My implementation > decided to ignore this problem (ie. potentially wait forever with the > module half-loaded) but it is an issue. I don't see the endless wait. It looks to me like it's just the time required for everyone to schedule, which is unbounded all right, but not in any interesting sense. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 5:58 ` Daniel Phillips @ 2002-09-12 7:00 ` Rusty Russell 0 siblings, 0 replies; 90+ messages in thread From: Rusty Russell @ 2002-09-12 7:00 UTC (permalink / raw) To: Daniel Phillips; +Cc: lk, oliver, zippel, viro, kaos, linux-kernel In message <E17pMzL-0007fx-00@starship> you write: > On Thursday 12 September 2002 06:52, Rusty Russell wrote: > > Um, so register_xxx interfaces all use try_inc_mod_count (ie. a > > struct module * extra arg to register_xxx)? > > In that case they are filesystems or some other thing that fits the > model closely enough for try_ind_mod_count to be efficient. This > case is easy and solved as far as I'm concerned, do correct me if > I'm wrong. Assuming I'm not wrong, on to a hard and interesting > case. No, let's talk about a simple notifier chain. Say you want to know when CPUs go up and down... > > Or those entry points > > are not protected by try_inc_mod_count, so it must bump the refcnt, so > > you need to sleep in load until the module becomes unused again. > > Let's consider LSM as a really hard case, and let's suppose that the > register_*'s just plug new functions into function tables. These functions > are just called indirectly, there is no module entry/exit accounting > whatsoever, except that the inc/dec rule must be followed where module code > itself might sleep. > > Anyway, at the -EBARF point, all the function pointers have been restored to > their original state, but we may have some tasks executing in the module > already, which got in while _xxx was there. The solution is to do the > magic_wait_for_quiescence (yes I know it has a real name, I forgot it) before > returning -EBARF. Refcounts never come into it. > > What did I miss? It was too easy. The notifier routine did the right thing, and did: MOD_INC_USE_COUNT(); wait_for_some_event(); It's sitting there now, sleeping inside the module (which has refcnt 1) and we returned -EBARF from module_init(). > I doubt this is any central issue though. We could, for instance, pass the > address of the count variable to the quiescence tester, and it will be > examined at schedule time, however that works (I promise to find out soon). You can sleep in some way before actually kfreeing the failed-to-load module memory; preempt makes this a bit trickier but the theory is the same (it was preempt that broke my implementation, yes it was that long ago). The point is that you could be sleeping for a *long* time... > I don't see the endless wait. It looks to me like it's just the time > required for everyone to schedule, which is unbounded all right, but not in > any interesting sense. It's not the waiting to schedule. It's the waiting for the refcnt to drop to zero. I don't mind saying "we don't have to handle that corner case", but you're left with the messiness of two classes of kernel interface: those which handle your refcounting and those which don't. And the module author better get right which ones are which (remember, we're talking about people who are *not* kernel gurus, writing their first and last kernel module on paid time for their company). This is where the "how important is module unload?" thing somes in. You could get rid of the bugs by having a list of "module count safe" interfaces (ie. exported symbols), and if a module uses any others, you simply refuse to unload it. Then no modules need *ever* worry about their own reference counts. Now, the trick I have in my back pocket is a distributed reference count scheme (unimplemented so far on Linux, but another idea stolen from Paul McKenney's Dynix/PTX team). You use per-cpu counts *until* you want to see if the count is zero (ie. module unload), at which case you flip them into (slow) shared-counter mode (handwave, handwave, I know). So it *might* be cheap to apply this to every interface, and migrate interfaces across into the "module safe" category on an as-we-need-it basis. Hope that clarifies my thinking this week... Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 3:13 ` Rusty Russell 2002-09-12 3:47 ` Daniel Phillips @ 2002-09-13 8:18 ` Helge Hafting 1 sibling, 0 replies; 90+ messages in thread From: Helge Hafting @ 2002-09-13 8:18 UTC (permalink / raw) To: Rusty Russell, linux-kernel Rusty Russell wrote: > > In message <20020912030933.A13608@kushida.apsleyroad.org> you write: > > I don't see the point in this at all. > > Yes, I'm starting to realize that. > > Frankly, I'm sick of pointing out the same problems individually to > every shallow thinker who thinks that "it's easy". > > The fundamental problems with modules are as follows: > A) Many places in the kernel do not increment module reference counts > for you, and it is difficult currently write a module which safely > looks after its own reference counts (see > net/ipv4/netfilter/ip_conntrack_core.c's ip_conntrack_cleanup()) This sort of thing makes correct unloading hard, but in my eyes it is an argument for changing the module interface to something better and say "that thing CAN'T be a module of its own!" What if we *require* modules to use some open/close reference count that don't change so often as to be a performance problem? It makes the unloading/reloading problem easier, similiar to mount/umount. It'll work for lots of modules, such as: * drivers for hardware devices, char, block, and NIC. * filesystems Things with no easy refcounting (and it cannot even be grafted on) will have to be non-modular, or folded into some parent module. Maybe IP connection tracking can't be modular with these rules - so what? Make it compiled-in, or a part of the IPV4-module. So if you really need to load and unload that you unload the ipv4_with_conntrack module and then load a ipv4_without_conntrack module. To me, it seems like the current module interface is too fine-grained, and that cause trouble. I.e. the overhead of correct refcounting is so high in some cases that it isn't used - causing trouble with safe unloading. The solution is to say no to such modules. Making them work isn't easy - but why try? Helge Hafting ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 2:09 ` Jamie Lokier 2002-09-12 3:13 ` Rusty Russell @ 2002-09-12 3:32 ` Daniel Phillips 1 sibling, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-12 3:32 UTC (permalink / raw) To: Jamie Lokier, Rusty Russell Cc: Oliver Neukum, Roman Zippel, Alexander Viro, kaos, linux-kernel On Thursday 12 September 2002 04:09, Jamie Lokier wrote: > 1. Just before a module's cleanup_module() function is called, mark > the module as "unloading". This should force > try_inc_mod_use_count() to fail, causing its caller to behave like > the associated resource (e.g. filesystem) isn't actually > registered, and to call request_module(). My proposal produces the same effect using a slightly different arrangement: module_cleanup() is called, and the first thing it does is bail out with an error if there are users, or puts the module into the inactive state (count=-1). This forces try_inc_mod_use_count to fail, as you say. > 2. request_module() should simply ignore modules marked as > "unloading". It should proceed to call "insmod" etc. > > 3. sys_create_module() or sys_init_module() should block if there is > a module of the same name currently in the "unloading" state. > They should block until that module's cleanup_module() returns. OK, this is a really good example of why replacing the lock_kernel with a dedicated module_sem is good: if we do that, the right thing happens. Insmod will block on the module_sem until the module is completely unloaded and removed from the list. By the time it gets the semaphore, everything will be in a pristine state. The BKL on the other hand will happily relinquish control if the cleanup sleeps, messing things up nicely. > 4. At this point, the new instance of the module will initialise, > request_module() calls will return and the callers which called > try_inc_mod_use_count() in step 1 will continue with the resource > they needed. Yes, I don't see a problem. The problems start when you try to shortcut this cycle. Given that a shortcut is hardly going to save much at all, due to caching, I'd call it a waste of effort. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-11 18:35 ` [RFC] Raceless module interface Daniel Phillips 2002-09-11 18:53 ` Oliver Neukum @ 2002-09-12 1:31 ` Rusty Russell 2002-09-12 9:10 ` Oliver Neukum 2002-09-12 11:27 ` Roman Zippel 3 siblings, 0 replies; 90+ messages in thread From: Rusty Russell @ 2002-09-12 1:31 UTC (permalink / raw) To: Daniel Phillips; +Cc: Roman Zippel, Jamie Lokier, Alexander Viro, linux-kernel In message <E17pCKQ-0007Sz-00@starship> you write: > Hi Roman, > > On Tuesday 10 September 2002 12:17, Roman Zippel wrote: > > I implemented something like this some time ago. If module->count isn't > > used by module.c anymore, why should it be in the module structure? > > Consequently I removed it from the module struct (what breaks of course > > unloading of all modules, so I'll probably reintroduce it with big a > > warning). If the count isn't in the module structure, the locking will > > become quite simpler. More info is here > > http://marc.theaimsgroup.com/?l=linux-kernel&m=102754132716703&w=2 > > Ah, I remember your original post but I didn't fully understand what you were I hate people who can't be concise. It's a sign of sloppy thinking. 1) You only need reference counts if you want to unload a module. 2) A module can control its own reference counts safely if it does not sleep without holding a reference, and you use the rcu patch's synchronize_kernel() primitive. 3) Relying on *every* driver to control its own reference counts is a recipe for disaster: some subsystems will want to control module counts for their users. 4) Moving reference counts out of the module and into the particular objects is *not* a good idea, since per-cpu cache-friendly refcounting schemes are (almost by definition) about SMP_CACHE_BYTES*NR_CPUS in size. Hope I haven't missed anything, Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-11 18:35 ` [RFC] Raceless module interface Daniel Phillips 2002-09-11 18:53 ` Oliver Neukum 2002-09-12 1:31 ` Rusty Russell @ 2002-09-12 9:10 ` Oliver Neukum 2002-09-12 11:27 ` Roman Zippel 3 siblings, 0 replies; 90+ messages in thread From: Oliver Neukum @ 2002-09-12 9:10 UTC (permalink / raw) To: Daniel Phillips, Roman Zippel Cc: Jamie Lokier, Alexander Viro, Rusty Russell, linux-kernel > To make this work, we rely on the rule that no module code may > sleep/schedule unless it first increments a counter. The counter > incremented will, at present, be the module->count, but that is entirely > up to the module itself; module.c will no longer care how it keeps > track, as long as it does. > > With config_preempt code, we must additionally disable preemption until > the module quiescence test has completed. I'm not going to go further > into this, because this is deep scheduling-fu, and I haven't got working > code to show yet. Suffice to say that we have the technology to build a > magic_wait_for_quiescence, and we must now proceed to do that. (Robert, > are you reading?) A noncounting module uses the test as follows, in its > cleanup_function: > > unregister_callpoints(...); > magic_wait_for_quiescence(); > return cleanup_foo(...); Please correct me if I am wrong, but ... Task A Task B counter call_module() +1 schedule() still 1 unregister_callpoints() still 1 magic_wait_for_quiescence(); still 1 call_module_second_func() -> Won't work So by trying to unregister a module you make a module unusable for an unspecified amount of time. Regards Oliver ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-11 18:35 ` [RFC] Raceless module interface Daniel Phillips ` (2 preceding siblings ...) 2002-09-12 9:10 ` Oliver Neukum @ 2002-09-12 11:27 ` Roman Zippel 2002-09-12 13:03 ` Rusty Russell 3 siblings, 1 reply; 90+ messages in thread From: Roman Zippel @ 2002-09-12 11:27 UTC (permalink / raw) To: Daniel Phillips; +Cc: Jamie Lokier, Alexander Viro, Rusty Russell, linux-kernel Hi, On Wed, 11 Sep 2002, Daniel Phillips wrote: > Ah, I remember your original post but I didn't fully understand what you were > going on about at the time. Now I do, and we're on the same page. Not only > that, but I'm getting to the point with this messy problem where I can > articulate the issues clearly, so let's try to do it now. Please ack/nack me > as I go. > > First, let's identify the central issue that's causing conceptual problems > for some who've sniffed at these issues. It's simple, there are two broad > categories of modules: Sigh, please let's analyze the complete problem first, before making any more kludges. Every (loaded) module is at least registered with two hooks in the kernel - the module structure and a driver structure (e.g. file_system_type for fs). During unloading we have to remove these hooks safely again. A module basically goes through these stages during its lifetime: 1. module load 2. module init 3. module exit 4. module unload The problem now is stage 3, as it's not allowed to fail. This means before we even start with module exit, we have to make sure nobody has any reference and nobody gets any new reference to the module through any hook (the alternative would be to possibly wait forever in the exit function to wait for these references to go away). So nobody is allowed to increment the module count or to get any pointer to the driver, this on the other hand means before you take any reference to the driver, you have to check the state of the module, so any driver hook must include a pointer to the module structure. This is also what makes the locking interesting, since module structure and driver hooks are protected with different locks, both have to be taken to safely get a single reference to the driver. So how does this whole picture change, if we allow the module exit to fail? Module load/unload and module init/exit can become two independent stages. This means in order to remove a driver hook, we don't have to check the module state anymore. Module exit only has to make sure, that all driver hooks and all references to the driver are gone. How it does that exactly is up to the driver, whether it uses reference counts, RCU or other scheduler tricks doesn't matter. We gain complete flexibility here, only by allowing the exit function to fail. It's really important to understand this consequence in order to understand my patch (e.g. try_inc_mod_count() became a dummy). With the exception of module loading I completely disabled the old interface, but it probably has to be reenabled (with a big warning) until enough modules are converted. The start/stop methods I added are completely irrellevant to solving this problem, they only give us more flexibility in the module handling, e.g. we can disable new fs mounts, while the fs still has something mounted. Anyway, the important consequence is that we can sanely clean up the module interface this way. The module load/unload phase is private to the module code, the module init/exit phase is private to the driver. This means while the module or the driver is in control it doesn't has to care about the state of the other part. The core problem is actually not that difficult, you only have to forget all the old cruft, it's only causing confusion and is not worth fixing. bye, Roman ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 11:27 ` Roman Zippel @ 2002-09-12 13:03 ` Rusty Russell 2002-09-12 13:44 ` Roman Zippel 0 siblings, 1 reply; 90+ messages in thread From: Rusty Russell @ 2002-09-12 13:03 UTC (permalink / raw) To: Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, Daniel Phillips, linux-kernel In message <Pine.LNX.4.44.0209121043160.28515-100000@serv> you write: > Sigh, please let's analyze the complete problem first, before making any > more kludges. > Every (loaded) module is at least registered with two hooks in the kernel > - the module structure and a driver structure (e.g. file_system_type for > fs). During unloading we have to remove these hooks safely again. > A module basically goes through these stages during its lifetime: > 1. module load > 2. module init > 3. module exit > 4. module unload > The problem now is stage 3, as it's not allowed to fail. This means before Nope, that's one of the two problems. Read my previous post: the other is partial initialization. Your patch is two-stage delete, with the additional of a usecount function. So you have to move the usecount from the module to each object it registers: great for filesystems, but I don't think it buys you anything (since they were easy anyway). Moreover, I don't see where your patch prevented someone increasing the module count during try_unregister_module(), so that check is pointless (do it in userspace unless they specify rmmod -f). Alexey said we needed two-stage module delete back in 1999, so this is not a new idea... Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 13:03 ` Rusty Russell @ 2002-09-12 13:44 ` Roman Zippel 2002-09-13 1:30 ` Rusty Russell 0 siblings, 1 reply; 90+ messages in thread From: Roman Zippel @ 2002-09-12 13:44 UTC (permalink / raw) To: Rusty Russell; +Cc: Jamie Lokier, Alexander Viro, Daniel Phillips, linux-kernel Hi, On Thu, 12 Sep 2002, Rusty Russell wrote: > Nope, that's one of the two problems. Read my previous post: the > other is partial initialization. > > Your patch is two-stage delete, with the additional of a usecount > function. So you have to move the usecount from the module to each > object it registers: great for filesystems, but I don't think it buys > you anything (since they were easy anyway). I'm aware of the init problem, what I described was the core problem, which prevents any further cleanup. The usecount is optional, the only important question a module must be able to answer is: Are there any objects/references belonging to the module? It's a simple yes/no question. If a module can't answer that, it likely has more problem than just module unloading. How that interface is exactly done is open for discussion and needs to be specified. > Moreover, I don't see where your patch prevented someone increasing > the module count during try_unregister_module(), so that check is > pointless (do it in userspace unless they specify rmmod -f). I don't see your problem, if someone looks up a module, he gets a reference to the module structure, if a reference count goes to zero the module must be completely freed. State changes are protected with a separate lock, if a module is loaded/unloaded an extra reference is used to prevent module removal. bye, Roman ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-12 13:44 ` Roman Zippel @ 2002-09-13 1:30 ` Rusty Russell 2002-09-13 2:19 ` Daniel Phillips ` (2 more replies) 0 siblings, 3 replies; 90+ messages in thread From: Rusty Russell @ 2002-09-13 1:30 UTC (permalink / raw) To: Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, Daniel Phillips, linux-kernel In message <Pine.LNX.4.44.0209121520300.28515-100000@serv> you write: > Hi, > > On Thu, 12 Sep 2002, Rusty Russell wrote: > > > Nope, that's one of the two problems. Read my previous post: the > > other is partial initialization. > > > > Your patch is two-stage delete, with the additional of a usecount > > function. So you have to move the usecount from the module to each > > object it registers: great for filesystems, but I don't think it buys > > you anything (since they were easy anyway). > > I'm aware of the init problem, what I described was the core problem, > which prevents any further cleanup. I don't think of either of them as core, they are two problems. > The usecount is optional, the only important question a module must be Yes, for sure. So why check the use count at all? > able to answer is: Are there any objects/references belonging to the > module? It's a simple yes/no question. If a module can't answer that, it > likely has more problem than just module unloading. Ah, we're assuming you insert synchronize_kernel() between the call to stop and the call to exit? In which case *why* do you check the use count *inside* exit_affs_fs? Why not get exit_module() to do "if (mod->usecount() != 0) return -EBUSY; else mod->exit();"? There's the other issue of symmetry. If you allocate memory, in start, do you clean it up in stop or exit? Similarly for other resources: you call mod->exit() every time start fails, so that is supposed to check that mod->start() succeeded? Of course, separating start into "init" and "start" allows you to solve the half-initialized problem as well as clarify the rules. ================ I disagree with pushing the count into the filesystem structure: it changes nothing except hide the fact that you're only keeping reference counts for the sake of the module. Filesystems are low performance impact, but other subsystems really don't want that atomic inc and dec for every access. See my implementation (for those in the audience: http://www.kernel.org/pub/linux/kernel/people/rusty/ if (down_interruptible(&module_mutex)) return -EINTR; mod = find_module(name); if (!mod) { ret = -ENOENT; goto out; } if (!mod->stop && !mod->exit) { /* This module can't be removed */ ret = -EBUSY; goto out; } if (mod->stop) { ret = mod->stop(); if (ret != 0) goto out; } /* Remove symbols so module count can't increase */ spin_lock_irq(&modlist_lock); list_del_init(&mod->symbols.list); spin_unlock_irq(&modlist_lock); /* This may or may not decrement to zero. We may be waiting for someone else who is holding a reference. */ set_current_state(TASK_UNINTERRUPTIBLE); mod->waiting = current; module_put(mod); schedule(); /* Wait to ensure that noone is executing inside a module right now */ synchronize_kernel(); /* Final destruction now noone is using it. */ if (mod->exit) mod->exit(); free_module(mod); ret = 0; Now, I chose to leave the usage count checks as a userspace problem, and *not* to do call ->start again in this implementation, but sleep until the refcount hits 0. This works better for my in ip_conntrack, since ideally the usage count is a count of the number of packets we're tracking, which is under control of the outside world, so I wanted an "rmmod -f". I think either way is valid: effectively, where I do the wait, you do the check and reinitialize the module ("oh oh!"). Cheers, Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 1:30 ` Rusty Russell @ 2002-09-13 2:19 ` Daniel Phillips 2002-09-13 6:51 ` Rusty Russell 2002-09-13 3:14 ` David Gibson 2002-09-13 10:35 ` Roman Zippel 2 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 2:19 UTC (permalink / raw) To: Rusty Russell, Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, linux-kernel On Friday 13 September 2002 03:30, Rusty Russell wrote: > In message <Pine.LNX.4.44.0209121520300.28515-100000@serv> you write: > > The usecount is optional, the only important question a module must be > > able to answer is: Are there any objects/references belonging to the > > module? It's a simple yes/no question. If a module can't answer that, it > > likely has more problem than just module unloading. > > Ah, we're assuming you insert synchronize_kernel() between the call > to stop and the call to exit? > > In which case *why* do you check the use count *inside* exit_affs_fs? > Why not get exit_module() to do "if (mod->usecount() != 0) return > -EBUSY; else mod->exit();"? Because mod->usecount may be a totally inadequate way of determining if a module is busy. How does it work for LSM, for example? > There's the other issue of symmetry. If you allocate memory, in > start, do you clean it up in stop or exit? Actually, I'm going to press you on why you think you even need a two stage stop. I know you have your reasons, but I doubt any of the effects you aim at cannot be achieved with a single stage stop/exit. Could you please summarize the argument in favor of the two stage stop? > Similarly for other > resources: you call mod->exit() every time start fails, so that is > supposed to check that mod->start() succeeded? He does? That's not right. ->start should clean up after itself if it fails, like any other good Linux citizen. > Of course, separating start into "init" and "start" allows you to > solve the half-initialized problem as well as clarify the rules. I doubt it gives any new capability at all. The same with the entrenched separation at the user level between create and init module: what does it give you that an error exit from a single create/init would not? Sure, I know it's not going to change, but I'd like to know what the thinking was, and especially, if there's a non-bogus reason, I'd like to know it. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 2:19 ` Daniel Phillips @ 2002-09-13 6:51 ` Rusty Russell 2002-09-13 13:34 ` Daniel Phillips 2002-09-13 15:59 ` [RFC] Raceless module interface Daniel Phillips 0 siblings, 2 replies; 90+ messages in thread From: Rusty Russell @ 2002-09-13 6:51 UTC (permalink / raw) To: Daniel Phillips, Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, linux-kernel In message <E17pg3H-0007pb-00@starship> you write: > On Friday 13 September 2002 03:30, Rusty Russell wrote: > > In message <Pine.LNX.4.44.0209121520300.28515-100000@serv> you write: > > > The usecount is optional, the only important question a module must be > > > able to answer is: Are there any objects/references belonging to the > > > module? It's a simple yes/no question. If a module can't answer that, it > > > likely has more problem than just module unloading. > > > > Ah, we're assuming you insert synchronize_kernel() between the call > > to stop and the call to exit? > > > > In which case *why* do you check the use count *inside* exit_affs_fs? > > Why not get exit_module() to do "if (mod->usecount() != 0) return > > -EBUSY; else mod->exit();"? > > Because mod->usecount may be a totally inadequate way of determining > if a module is busy. How does it work for LSM, for example? As established previously: unless the hooks do it for you, you keep track of it yourself before you sleep. > > There's the other issue of symmetry. If you allocate memory, in > > start, do you clean it up in stop or exit? > > Actually, I'm going to press you on why you think you even need a > two stage stop. I know you have your reasons, but I doubt any of > the effects you aim at cannot be achieved with a single stage > stop/exit. Could you please summarize the argument in favor of the > two stage stop? Of course you can simulate a two-stage within a single-stage, of course, by doing int single_stage(void) { stage_one(); stage_two(); }, so "need" is a bit strong. Basically, you can't do stage two until you know noone is using the module, but you can do stage one at any time. Basically stage 1 ensures that the refcount never *increases* by removing all external references to the module (ie. deregister, etc). Stage 2 then knows that if the refcnt is zero, there's no race and it can destroy they module data structures. The synchronize_kernel() covers those "I was about to bump the module count!" races, as long as noone explicitly sleeps before mod_inc, or after mod_dec. > > Similarly for other > > resources: you call mod->exit() every time start fails, so that is > > supposed to check that mod->start() succeeded? > > He does? That's not right. ->start should clean up after itself if > it fails, like any other good Linux citizen. >From my reading, yes. > > Of course, separating start into "init" and "start" allows you to > > solve the half-initialized problem as well as clarify the rules. > > I doubt it gives any new capability at all. Since I explained what it does for you at the kernel summit, you obviously aren't listening. If you split registration interfaces into reserve (can fail) and use (can't fail), then you do: int my_module_init(void) { int ret; ret = reserve_foo(); if (ret != 0) return ret; ret = reserve_bar(); if (ret != 0) unreserve_foo(); return ret; } void my_module_start(void) { use_foo(); use_bar(); } Note the symmetry here with the exit case: noone can actually use the module until my_module_start is called, so even if the reserve_bar() fails, we're safe. > The same with the entrenched separation at the user level between > create and init module: what does it give you that an error exit > from a single create/init would not? That's done for entirely different reasons, as the userspace linker needs to know the address of the module. > Sure, I know it's not going to change, but I'd like to know what the > thinking was, and especially, if there's a non-bogus reason, I'd > like to know it. You should probably start playing with my code if you're really interested. Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 6:51 ` Rusty Russell @ 2002-09-13 13:34 ` Daniel Phillips 2002-09-13 13:52 ` Thunder from the hill 2002-09-16 2:17 ` Rusty Russell 2002-09-13 15:59 ` [RFC] Raceless module interface Daniel Phillips 1 sibling, 2 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 13:34 UTC (permalink / raw) To: Rusty Russell, Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, linux-kernel On Friday 13 September 2002 08:51, Rusty Russell wrote: > If you split registration interfaces into reserve (can fail) and > use (can't fail), then you do: > > int my_module_init(void) > { > int ret; > ret = reserve_foo(); > if (ret != 0) > return ret; > ret = reserve_bar(); > if (ret != 0) > unreserve_foo(); > return ret; > } > > void my_module_start(void) > { > use_foo(); > use_bar(); > } Why is that different from: int my_module_init(void) { int ret; ret = reserve_foo(); if (ret != 0) return ret; ret = reserve_bar(); if (ret != 0) { unreserve_foo(); return ret; } use_foo(); use_bar(); return 0; } > Note the symmetry here with the exit case: noone can actually use the > module until my_module_start is called, so even if the reserve_bar() > fails, we're safe. And in my example above, nobody can actually use the module until use_foo(). What's the difference? > > Sure, I know it's not going to change, but I'd like to know what the > > thinking was, and especially, if there's a non-bogus reason, I'd > > like to know it. > > You should probably start playing with my code if you're really > interested. Of course, and you might consider actually reading my [RFC]. We still disagree on whether your fat interface or my thin one is the right way to go. Don't forget that the Unix way has traditionally been to use the simplest interface that will do the job; if you propose a fat interface you need to prove that the thin one cannot do the job. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 13:34 ` Daniel Phillips @ 2002-09-13 13:52 ` Thunder from the hill 2002-09-13 14:09 ` Daniel Phillips 2002-09-16 2:17 ` Rusty Russell 1 sibling, 1 reply; 90+ messages in thread From: Thunder from the hill @ 2002-09-13 13:52 UTC (permalink / raw) To: Daniel Phillips Cc: Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, linux-kernel Hi, On Fri, 13 Sep 2002, Daniel Phillips wrote: > On Friday 13 September 2002 08:51, Rusty Russell wrote: > > [cool code] > > Why is that different from: > > [more code] Because in your example, my_module_start() would not be able to run separately, and because, as Rusty mentioned, my_module_init() can fail separately (e.g. if there's no space to drop that). Thunder -- --./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .- --/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..- .- -/---/--/---/.-./.-./---/.--/.-.-.- --./.-/-.../.-./.././.-../.-.-.- ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 13:52 ` Thunder from the hill @ 2002-09-13 14:09 ` Daniel Phillips 2002-09-13 14:33 ` Thunder from the hill 0 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 14:09 UTC (permalink / raw) To: Thunder from the hill Cc: Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, linux-kernel On Friday 13 September 2002 15:52, Thunder from the hill wrote: > Hi, > > On Fri, 13 Sep 2002, Daniel Phillips wrote: > > On Friday 13 September 2002 08:51, Rusty Russell wrote: > > > [cool code] > > > > Why is that different from: > > > > [more code] > > Because in your example, my_module_start() would not be able to run > separately That's obvious. What hasn't been shown is why that's necessary. Note: this is the *real* meaning of "begs the question". You answered my question "why is it necessary the these to be separate" with "because if they were not separate, then you could not use them separately". In logical terms, it amounts to "A because A". This is a logical falacy called "begging the question". When people say "begs the question", 99% of the time they really mean "invites the question". As an exercise, try scanning lkml for "From includes Torvalds" and "begs". Linus studied debating ;-) -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 14:09 ` Daniel Phillips @ 2002-09-13 14:33 ` Thunder from the hill 2002-09-13 14:44 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Thunder from the hill @ 2002-09-13 14:33 UTC (permalink / raw) To: Daniel Phillips Cc: Thunder from the hill, Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, linux-kernel Hi, On Fri, 13 Sep 2002, Daniel Phillips wrote: > > Because in your example, my_module_start() would not be able to run > > separately > > That's obvious. What hasn't been shown is why that's necessary. Yeah, but it was also obvious that my_module_init() can fail this way. Look, first we watch the module initialization, that is, we run the critical stuff like resource allocation, data structure allocation, etc. If we fail here, we can't load the module, because it would be unoperative if we proceed. (Because the data simply isn't there.) And possibly Rusty wanted to avoid a certain race, which is unrelated to school and kids. Once we've initialized, the module can be used, earlier is balderdash. Thunder -- --./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .- --/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..- .- -/---/--/---/.-./.-./---/.--/.-.-.- --./.-/-.../.-./.././.-../.-.-.- ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 14:33 ` Thunder from the hill @ 2002-09-13 14:44 ` Daniel Phillips 2002-09-13 14:59 ` Thunder from the hill 0 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 14:44 UTC (permalink / raw) To: Thunder from the hill Cc: Thunder from the hill, Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, linux-kernel On Friday 13 September 2002 16:33, Thunder from the hill wrote: > Hi, > > On Fri, 13 Sep 2002, Daniel Phillips wrote: > > > Because in your example, my_module_start() would not be able to run > > > separately > > > > That's obvious. What hasn't been shown is why that's necessary. > > Yeah, but it was also obvious that my_module_init() can fail this way. > Look, first we watch the module initialization, that is, we run the > critical stuff like resource allocation, data structure allocation, etc. > If we fail here, we can't load the module, because it would be unoperative > if we proceed. (Because the data simply isn't there.) > > And possibly Rusty wanted to avoid a certain race, which is unrelated to > school and kids. Once we've initialized, the module can be used, earlier > is balderdash. On the face of it, your post is content-free. To redeem yourself, please identify the race Rusty avoided and show how I did not avoid the same race. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 14:44 ` Daniel Phillips @ 2002-09-13 14:59 ` Thunder from the hill 2002-09-13 15:17 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Thunder from the hill @ 2002-09-13 14:59 UTC (permalink / raw) To: Daniel Phillips Cc: Thunder from the hill, Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, linux-kernel Hi, On Fri, 13 Sep 2002, Daniel Phillips wrote: > On Friday 13 September 2002 16:33, Thunder from the hill wrote: > > Look, first we watch the module initialization, that is, we run the > > critical stuff like resource allocation, data structure allocation, etc. > > If we fail here, we can't load the module, because it would be unoperative > > if we proceed. (Because the data simply isn't there.) -> if we don't do the starting here, we can operate on the data structures earlier, since we know they're running free. Also could we run start again, even though it sounds buggy. > > And possibly Rusty wanted to avoid a certain race, which is unrelated to > > school and kids. Ouch. I've referred to a comparison here which I've dropped lateron. Sorry if you felt insulted. > please identify the race Rusty avoided and show how I did not avoid the > same race. I'm sure Rusty could do that better. However, there might be some weird situations. For example, take someone trying to bring all modules down the moment we init. We might start running in unchecked environment, and there we fail because there is no 'we' any more. Thus rather module->init(). if (module) module->start(). Since then we can be sure that the module is locked, and if somebody unloads it, he'll have to wait for the use count to drop. Or as another example, take someone trying to use the resources we claimed before the module is really up. If you can rely on the module to be known to be up, you know what do do. Yes, usually that's no real good example, since resources ought to be locked as well. Thunder -- --./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .- --/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..- .- -/---/--/---/.-./.-./---/.--/.-.-.- --./.-/-.../.-./.././.-../.-.-.- ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 14:59 ` Thunder from the hill @ 2002-09-13 15:17 ` Daniel Phillips 2002-09-13 15:27 ` Thunder from the hill 0 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 15:17 UTC (permalink / raw) To: Thunder from the hill Cc: Thunder from the hill, Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, linux-kernel On Friday 13 September 2002 16:59, Thunder from the hill wrote: > Hi, > > On Fri, 13 Sep 2002, Daniel Phillips wrote: > > On Friday 13 September 2002 16:33, Thunder from the hill wrote: > > > Look, first we watch the module initialization, that is, we run the > > > critical stuff like resource allocation, data structure allocation, etc. > > > If we fail here, we can't load the module, because it would be unoperative > > > if we proceed. (Because the data simply isn't there.) > > -> if we don't do the starting here, we can operate on the data structures > earlier, since we know they're running free. > > Also could we run start again, even though it sounds buggy. We can do this without having start separte from init as well. > > please identify the race Rusty avoided and show how I did not avoid the > > same race. > > I'm sure Rusty could do that better. I'd be surprised if Rusty can do it any better than you. It's hard to show a race that doesn't exist, even harder to prove that a four-prong interface is necessary in order to be able to handle it. The latter is the question on the table. > However, there might be some weird > situations. For example, take someone trying to bring all modules down > the moment we init. We might start running in unchecked environment, and > there we fail because there is no 'we' any more. Oh indeed, there are weird situations, but they apply equally to the two-prong and the four-prong interfaces. > Thus rather module->init(). if (module) module->start(). Since then we can > be sure that the module is locked, and if somebody unloads it, he'll have > to wait for the use count to drop. This applies equally to the two-prong interface. > Or as another example, take someone trying to use the resources we claimed > before the module is really up. If you can rely on the module to be known > to be up, you know what do do. Yes, usually that's no real good example, > since resources ought to be locked as well. This applies equally to the two-prong interface. Do you see the pattern yet? -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 15:17 ` Daniel Phillips @ 2002-09-13 15:27 ` Thunder from the hill 2002-09-13 15:37 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Thunder from the hill @ 2002-09-13 15:27 UTC (permalink / raw) To: Daniel Phillips Cc: Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, Linux Kernel Mailing List Hi, On Fri, 13 Sep 2002, Daniel Phillips wrote: > I'd be surprised if Rusty can do it any better than you. It's hard to > show a race that doesn't exist, even harder to prove that a four-prong > interface is necessary in order to be able to handle it. The latter is > the question on the table. Not as easy as missing ones point, yes. > > However, there might be some weird situations. For example, take > > someone trying to bring all modules down the moment we init. We might > > start running in unchecked environment, and there we fail because > > there is no 'we' any more. > > Oh indeed, there are weird situations, but they apply equally to the > two-prong and the four-prong interfaces. > > > Thus rather module->init(). if (module) module->start(). Since then we can > > be sure that the module is locked, and if somebody unloads it, he'll have > > to wait for the use count to drop. > > This applies equally to the two-prong interface. > > > Or as another example, take someone trying to use the resources we claimed > > before the module is really up. If you can rely on the module to be known > > to be up, you know what do do. Yes, usually that's no real good example, > > since resources ought to be locked as well. > > This applies equally to the two-prong interface. Do you see the pattern > yet? Yes, but you don't seem to. (No, I don't want to insult you here.) Just to draw that: 2p: thread1 thread2 struct x *y = malloc(sizeof(struct x)); check y; blah(); cleanup(y et al); touch y->blah; /* bang */ 4p: thread1 thread2 struct x *y = malloc(sizeof(struct x)); check y; struct z *a = malloc(sizeof(struct z)); check a; cleanup(y et al); struct u *v = malloc(sizeof(struct u)); check v; return success; (back in caller - from that moment we're protected against arbitrariness from other threads) check y; <-- detected You know? We allocate, we lock, we check, we win. Thunder -- --./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .- --/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..- .- -/---/--/---/.-./.-./---/.--/.-.-.- --./.-/-.../.-./.././.-../.-.-.- ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 15:27 ` Thunder from the hill @ 2002-09-13 15:37 ` Daniel Phillips 0 siblings, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 15:37 UTC (permalink / raw) To: Thunder from the hill Cc: Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, Linux Kernel Mailing List On Friday 13 September 2002 17:27, Thunder from the hill wrote: > > This applies equally to the two-prong interface. Do you see the pattern > > yet? > > Yes, but you don't seem to. (No, I don't want to insult you here.) > > Just to draw that: > > 2p: > > thread1 thread2 > struct x *y = malloc(sizeof(struct x)); > check y; > blah(); cleanup(y et al); > touch y->blah; /* bang */ This can't happen because a semaphore serializes the load and unload. (Currently, module.c uses lock_kernel, which is obviously inadequate.) -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 13:34 ` Daniel Phillips 2002-09-13 13:52 ` Thunder from the hill @ 2002-09-16 2:17 ` Rusty Russell 2002-09-16 16:13 ` Daniel Phillips ` (2 more replies) 1 sibling, 3 replies; 90+ messages in thread From: Rusty Russell @ 2002-09-16 2:17 UTC (permalink / raw) To: Daniel Phillips, Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, linux-kernel In message <E17pqaz-000891-00@starship> you write: > On Friday 13 September 2002 08:51, Rusty Russell wrote: > > If you split registration interfaces into reserve (can fail) and > > use (can't fail), then you do: > > Why is that different from: > > int my_module_init(void) > { > int ret; > ret = reserve_foo(); > if (ret != 0) > return ret; > ret = reserve_bar(); > if (ret != 0) { > unreserve_foo(); > return ret; > } > use_foo(); > use_bar(); > return 0; > } As I said: R> Of course you can simulate a two-stage within a single-stage, of course, R> by doing int single_stage(void) { stage_one(); stage_two(); }, so R> "need" is a bit strong. > Of course, and you might consider actually reading my [RFC]. You weighed into a debate without background, with a longwinded "original" suggestion which wasn't, and then you accuse *me* of not reading? You divided modules into counting and non-counting. This is overly simplistic. In fact, *interfaces* are divided into counting and non-counting: a module may use both. A module which only uses counting interfaces is trivially safe from unload races. The interesting problem is module which control their own reference counts, because they use one or more non-counting interfaces. Your "solution" does not work: > unregister_callpoints(...); > magic_wait_for_quiescence(); > return cleanup_foo(...); In fact, it would look like: > unregister_callpoints(...); > synchronize_kernel(); > if (atomic_read(&usecount) != 0) { > reregister_callpoints(...); > return -EBUSY; > } > cleanup_foo(); Now, think what happens if reregister_callpoints() fails. So we need "unuse_xxx" here then. Now, *think* for one moment, from the point of view of the author of one of these modules. Now, how are you going to explain the subtle requirements of your "two stage in one" interfaces? Bear in mind that the init races weren't even understood by anyone on linux-kernel until two years ago, and you're dealing with a newbie kernel programmer. Now do you see my preference for taking the weight off the shoulders of module authors? It's just not sane to ask them to deal with these fairly esoteric races, and expect them to get it right. We could simply ban modules from using non-counting interfaces. Or we could introduce two-stage registration interfaces and then simply ban their unloading. Or we can make their unloading a kernel hacking option. Or we can provide all the infrastructure, and allow the module authors to set their own comfort level. > Don't forget that the Unix way has traditionally been to use the > simplest interface that will do the job; if you propose a fat > interface you need to prove that the thin one cannot do the job. Gee, really? You're so clever! You patronising little shit, Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-16 2:17 ` Rusty Russell @ 2002-09-16 16:13 ` Daniel Phillips 2002-09-16 16:36 ` Understanding the Principles of Argumentation #3 Daniel Phillips 2002-09-16 22:31 ` David Woodhouse 2 siblings, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-16 16:13 UTC (permalink / raw) To: Rusty Russell, Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, linux-kernel On Monday 16 September 2002 04:17, Rusty Russell wrote: > You weighed into a debate without background, with a longwinded > "original" suggestion which wasn't, and then you accuse *me* of not > reading? Yup. If you had actually read it, I apologize, but you showed every sign of having not read it. > You divided modules into counting and non-counting. This is overly > simplistic. In fact, *interfaces* are divided into counting and > non-counting: a module may use both. Read closely and you will see I covered that. Sure, my [rfc] may have been less consise than the perfect piece of prose you would have submitted in my place, but other than that, it seems to have held up rather well under attack. > A module which only uses > counting interfaces is trivially safe from unload races. The > interesting problem is module which control their own reference > counts, because they use one or more non-counting interfaces. > > Your "solution" does not work: > > > unregister_callpoints(...); > > magic_wait_for_quiescence(); > > return cleanup_foo(...); > > In fact, it would look like: > > > unregister_callpoints(...); > > synchronize_kernel(); > > if (atomic_read(&usecount) != 0) { > > reregister_callpoints(...); > > return -EBUSY; > > } > > cleanup_foo(); Ah no, I don't like the second one, it will introduce erratic behaviour in LSM. You will have a state where some security calls are implemented and others not, then you will return to a state where all are. Eep. Granted, LSM will always have a state when the calls are partially unplugged, but the next state after that should be complete removal. The admin is going to be awfully surprised if something else happens. So what I had in mind here (and did mention it) is to take the counter into account in magic_wait_for_quiescence(), on the understanding that this counter only counts possibly-sleeping states of module threads. Anyway, how do you propose to handle this: task in module: inc count sleep synchronize_kernel: schedule() wake up on cpu 1 wake up on cpu 0 dec count if (atomic_read(&usecount) != 0) /* false */ free_module(module); BOOM! > Now, think what happens if reregister_callpoints() fails. So we need > "unuse_xxx" here then. > > Now, *think* for one moment, from the point of view of the author of > one of these modules. Now, how are you going to explain the subtle > requirements of your "two stage in one" interfaces? Bear in mind that > the init races weren't even understood by anyone on linux-kernel until > two years ago, and you're dealing with a newbie kernel programmer. OK, *finally* we get to the core of your argument. You accused me of being long-winded, but you have just taken several days to admit that there is no inherent difference in capability between your proposal and mine. Though to be sure, some of your commentary was informative and useful, still, you should have taken the position you are now about to take right from the beginning. > Now do you see my preference for taking the weight off the shoulders > of module authors? It's just not sane to ask them to deal with these > fairly esoteric races, and expect them to get it right. I feel no need to take any weight off the shoulders of the LSM authors, they are big boys and they better know what they're doing. IMHO, they'd appreciate maximum control at the insertion/removal stage. But proc modules, device drivers, even filesystems: I strongly agree we need a dumbed-down interface. However, please forgive me for not immediately accepting the proposition that yours is the final word in dumbed-down module interfaces. I'm thinking that a set of easy-to-use library calls will be just as good and will win on the grounds of using the simpler interface. > We could simply ban modules from using non-counting interfaces. No we can't. As I mentioned in my [rfc], there are some modules that simply cannot use this technique, and I hope we've already rejected the strategy of disallowing removal of such modules. (The Chinese have a saying that covers this: "cut off toes to avoid worm coming".) > Or we > could introduce two-stage registration interfaces and then simply ban > their unloading. No, the complaining would never end. > Or we can make their unloading a kernel hacking > option. Or we can provide all the infrastructure, and allow the > module authors to set their own comfort level. This is a logical falacy, let's call it "trifurcation". You have presented three alternatives as if they were the only alternatives. Two were bogus anyway; by either measure this last qualifies as pure rhetoric. In summary, you have retreated from the position that my interface is somehow less capable than yours, and you have dug in to defend the position that your interface is better for newbie programmers, which appears to be your only argument at the moment. If you turn out to be right, then good and my hat off to you. If you are wrong, and I can present an interface that is just as easy for newbie programmers, than I am right and you should thank me for sticking to my guns and insisting that we go with the simplest inteface that will do the job. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Understanding the Principles of Argumentation #3 2002-09-16 2:17 ` Rusty Russell 2002-09-16 16:13 ` Daniel Phillips @ 2002-09-16 16:36 ` Daniel Phillips 2002-09-16 16:42 ` Robinson Maureira Castillo 2002-09-16 17:29 ` Cort Dougan 2002-09-16 22:31 ` David Woodhouse 2 siblings, 2 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-16 16:36 UTC (permalink / raw) To: Rusty Russell, Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, linux-kernel This is the third in my "Understanding the Principles of Argumentation" series. Rusty has obligingly provided us with a fine example of an "ad hominem attack": On Monday 16 September 2002 04:17, Rusty Russell wrote: > > Don't forget that the Unix way has traditionally been to use the > > simplest interface that will do the job; if you propose a fat > > interface you need to prove that the thin one cannot do the job. > > Gee, really? You're so clever! > > You patronising little shit, An ad hominem attack is a logical fallacy where the arguer attacks the person, rather than the issue: http://home.mcn.net/~montanabw/fallacies.html -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Understanding the Principles of Argumentation #3 2002-09-16 16:36 ` Understanding the Principles of Argumentation #3 Daniel Phillips @ 2002-09-16 16:42 ` Robinson Maureira Castillo 2002-09-16 17:29 ` Cort Dougan 1 sibling, 0 replies; 90+ messages in thread From: Robinson Maureira Castillo @ 2002-09-16 16:42 UTC (permalink / raw) To: Daniel Phillips Cc: Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, linux-kernel Can you PLEASE discuss that in PRIVATE? Thanks in advice. On Mon, 16 Sep 2002, Daniel Phillips wrote: > This is the third in my "Understanding the Principles of Argumentation" > series. Rusty has obligingly provided us with a fine example of an "ad > hominem attack": > > On Monday 16 September 2002 04:17, Rusty Russell wrote: > > > Don't forget that the Unix way has traditionally been to use the > > > simplest interface that will do the job; if you propose a fat > > > interface you need to prove that the thin one cannot do the job. > > > > Gee, really? You're so clever! > > > > You patronising little shit, > > An ad hominem attack is a logical fallacy where the arguer attacks the > person, rather than the issue: > > http://home.mcn.net/~montanabw/fallacies.html > > -- Robinson Maureira Castillo Asesor DAI INACAP ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Understanding the Principles of Argumentation #3 2002-09-16 16:36 ` Understanding the Principles of Argumentation #3 Daniel Phillips 2002-09-16 16:42 ` Robinson Maureira Castillo @ 2002-09-16 17:29 ` Cort Dougan 1 sibling, 0 replies; 90+ messages in thread From: Cort Dougan @ 2002-09-16 17:29 UTC (permalink / raw) To: Daniel Phillips Cc: Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, linux-kernel In cases where the issue is extraordinarily boring and pointless, it's sometimes more entertaining to attack the person. His attack is not baseless, though. You are being a "patronizing little shit". Solve the problem and move on. Do not argue for days about a hangnail. } This is the third in my "Understanding the Principles of Argumentation" } series. Rusty has obligingly provided us with a fine example of an "ad } hominem attack": ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Understanding the Principles of Argumentation #3 2002-09-16 2:17 ` Rusty Russell 2002-09-16 16:13 ` Daniel Phillips 2002-09-16 16:36 ` Understanding the Principles of Argumentation #3 Daniel Phillips @ 2002-09-16 22:31 ` David Woodhouse 2002-10-01 14:13 ` Daniel Phillips 2002-10-01 14:27 ` David Woodhouse 2 siblings, 2 replies; 90+ messages in thread From: David Woodhouse @ 2002-09-16 22:31 UTC (permalink / raw) To: Daniel Phillips Cc: Rusty Russell, Roman Zippel, Jamie Lokier, Alexander Viro, linux-kernel phillips@arcor.de said: > This is the third in my "Understanding the Principles of > Argumentation" series. Rusty has obligingly provided us with a fine > example of an "ad hominem attack": It is my understanding that the ad hominem fallacy takes the form "you are a patronising little shit, therefore what you say cannot be true". Rusty's comment seemed to be only that you were a patronising little shit, and not that this proved you to be incorrect -- hence it doesn't appear to be a particularly fine example of 'ad hominem' at all. -- dwmw2 ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Understanding the Principles of Argumentation #3 2002-09-16 22:31 ` David Woodhouse @ 2002-10-01 14:13 ` Daniel Phillips 2002-10-01 14:27 ` David Woodhouse 1 sibling, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-10-01 14:13 UTC (permalink / raw) To: David Woodhouse; +Cc: linux-kernel On Tuesday 17 September 2002 00:31, David Woodhouse wrote: > phillips@arcor.de said: > > This is the third in my "Understanding the Principles of > > Argumentation" series. Rusty has obligingly provided us with a fine > > example of an "ad hominem attack": > > It is my understanding that the ad hominem fallacy takes the form "you are a > patronising little shit, therefore what you say cannot be true". > > Rusty's comment seemed to be only that you were a patronising little shit, > and not that this proved you to be incorrect -- hence it doesn't appear to > be a particularly fine example of 'ad hominem' at all. You are entirely incorrect. The issue raised was "given two interfaces with the same functionality, choose the simpler of them". Instead of addressing that issue, the author was attacked. Perfect example, as I said. Now, if you think this thread is too long, why did you post to it? -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Understanding the Principles of Argumentation #3 2002-09-16 22:31 ` David Woodhouse 2002-10-01 14:13 ` Daniel Phillips @ 2002-10-01 14:27 ` David Woodhouse 1 sibling, 0 replies; 90+ messages in thread From: David Woodhouse @ 2002-10-01 14:27 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-kernel phillips@arcor.de said: > > It is my understanding that the ad hominem fallacy takes the form "you > > are a patronising little shit, therefore what you say cannot be true". > > > > Rusty's comment seemed to be only that you were a patronising little > > shit, and not that this proved you to be incorrect -- hence it doesn't > > appear to be a particularly fine example of 'ad hominem' at all. > You are entirely incorrect. The issue raised was "given two > interfaces with the same functionality, choose the simpler of them". > Instead of addressing that issue, the author was attacked. Perfect > example, as I said. No, Daniel. Do try to pay attention. The ad hominem fallacy take the form "You are a patronising little shit, therefore what you say cannot be true.". Rusty's mail contained only the first half of that 'logic', and hence was not an example of ad hominem at all. Stating that you are a patronising little shit was a digression and was irrelevant to the argument. Stating that _because_ you are a patronising little shit your claims must therefore be wrong, would be an example of the ad hominem fallacy. But as I said -- Rusty didn't do that; he only did the former. See http://www.nizkor.org/features/fallacies/ad-hominem.html Note the words "Second, this attack is taken to be evidence against the claim or argument the person in question is making (or presenting)." -- dwmw2 ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 6:51 ` Rusty Russell 2002-09-13 13:34 ` Daniel Phillips @ 2002-09-13 15:59 ` Daniel Phillips 1 sibling, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 15:59 UTC (permalink / raw) To: Rusty Russell, Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, linux-kernel On Friday 13 September 2002 08:51, Rusty Russell wrote: > In message <E17pg3H-0007pb-00@starship> you write: > > The same with the entrenched separation at the user level between > > create and init module: what does it give you that an error exit > > from a single create/init would not? > > That's done for entirely different reasons, as the userspace linker > needs to know the address of the module. Thanks for clearing that up. Perhaps a callback would have been better, but it's obviously moot now. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 1:30 ` Rusty Russell 2002-09-13 2:19 ` Daniel Phillips @ 2002-09-13 3:14 ` David Gibson 2002-09-13 10:35 ` Roman Zippel 2 siblings, 0 replies; 90+ messages in thread From: David Gibson @ 2002-09-13 3:14 UTC (permalink / raw) To: Rusty Russell Cc: Roman Zippel, Jamie Lokier, Alexander Viro, Daniel Phillips, linux-kernel On Fri, Sep 13, 2002 at 11:30:47AM +1000, Paul 'Rusty' Russell wrote: > In message <Pine.LNX.4.44.0209121520300.28515-100000@serv> you write: > > Hi, > > > > On Thu, 12 Sep 2002, Rusty Russell wrote: > > > > > Nope, that's one of the two problems. Read my previous post: the > > > other is partial initialization. > > > > > > Your patch is two-stage delete, with the additional of a usecount > > > function. So you have to move the usecount from the module to each > > > object it registers: great for filesystems, but I don't think it buys > > > you anything (since they were easy anyway). > > > > I'm aware of the init problem, what I described was the core problem, > > which prevents any further cleanup. > > I don't think of either of them as core, they are two problems. Actually, with one stage init, module unload is essentially a special case of module load failure, consider: module_init() { /* initialize stuff */ ... wait_event_interruptible(wq, 0 == 1); /* clean stuff up */ ... return -EINTR } -- David Gibson | For every complex problem there is a david@gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 1:30 ` Rusty Russell 2002-09-13 2:19 ` Daniel Phillips 2002-09-13 3:14 ` David Gibson @ 2002-09-13 10:35 ` Roman Zippel 2002-09-13 13:53 ` Daniel Phillips 2002-09-16 1:49 ` Rusty Russell 2 siblings, 2 replies; 90+ messages in thread From: Roman Zippel @ 2002-09-13 10:35 UTC (permalink / raw) To: Rusty Russell; +Cc: Jamie Lokier, Alexander Viro, Daniel Phillips, linux-kernel Hi, On Fri, 13 Sep 2002, Rusty Russell wrote: > > I'm aware of the init problem, what I described was the core problem, > > which prevents any further cleanup. > > I don't think of either of them as core, they are two problems. Your init problem is easily solvable for try_inc_mod_count() users, just add a check for MOD_INITIALIZING, so it will fail during module init. On the other hand forcing everyone to use try_inc_mod_count is a serious problem. > > able to answer is: Are there any objects/references belonging to the > > module? It's a simple yes/no question. If a module can't answer that, it > > likely has more problem than just module unloading. > > Ah, we're assuming you insert synchronize_kernel() between the call > to stop and the call to exit? > > In which case *why* do you check the use count *inside* exit_affs_fs? > Why not get exit_module() to do "if (mod->usecount() != 0) return > -EBUSY; else mod->exit();"? > > There's the other issue of symmetry. If you allocate memory, in > start, do you clean it up in stop or exit? Similarly for other > resources: you call mod->exit() every time start fails, so that is > supposed to check that mod->start() succeeded? Actually I consider removing the start/stop methods, if one can't get it right without them, one won't get it right with them either and it's easy enough to add them later again. The exit function should always be called after the init function (even if it failed, I don't do it in the patch, that's a bug). The fs init/exit would like this then: static int __init init_fs(void) { inode_cachep = kmem_cache_create(...); if (!inode_cachep) return -ENOMEM; return register_filesystem(&fs_type); } static int __exit exit_fs(void) { int err; err = unregister_filesystem(&fs_type); if (err) return err; kmem_cache_destroy(inode_cachep); inode_cachep = NULL; return 0; } > I disagree with pushing the count into the filesystem structure: it > changes nothing except hide the fact that you're only keeping > reference counts for the sake of the module. Filesystems are low > performance impact, but other subsystems really don't want that atomic > inc and dec for every access. As I already said before, it doesn't has to be an atomic reference count. > if (down_interruptible(&module_mutex)) > return -EINTR; I really don't like that the global module lock is held during calls to the driver, e.g. it makes it impossible to request further modules during module load. > This works better for my in ip_conntrack, since ideally the usage > count is a count of the number of packets we're tracking, which is > under control of the outside world, so I wanted an "rmmod -f". The ip_conntrack problem is a variation of the LSM problem and both are a problem of the driver not of the module code. Basically you have to wait long enough until no new package can call to the module. This means after removing the hooks, you have to check how much packages are busy and wait for them to be processed. bye, Roman ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 10:35 ` Roman Zippel @ 2002-09-13 13:53 ` Daniel Phillips 2002-09-13 15:13 ` Roman Zippel 2002-09-16 1:49 ` Rusty Russell 1 sibling, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 13:53 UTC (permalink / raw) To: Roman Zippel, Rusty Russell; +Cc: Jamie Lokier, Alexander Viro, linux-kernel On Friday 13 September 2002 12:35, Roman Zippel wrote: > Hi, > > On Fri, 13 Sep 2002, Rusty Russell wrote: > > > > I'm aware of the init problem, what I described was the core problem, > > > which prevents any further cleanup. > > > > I don't think of either of them as core, they are two problems. > > Your init problem is easily solvable for try_inc_mod_count() users, just > add a check for MOD_INITIALIZING, so it will fail during module init. On > the other hand forcing everyone to use try_inc_mod_count is a serious > problem. Well now. We agree about that, do we not? I go on to conclude that it is better to leave the job of monitoring module activity and determining quiescence up to the module itself, do we agree about that as well? > Actually I consider removing the start/stop methods, if one can't get it > right without them, one won't get it right with them either and it's easy > enough to add them later again. Exactly my feeling. > The exit function should always be called after the init function (even if > it failed, I don't do it in the patch, that's a bug). The fs init/exit > would like this then: Perhaps, but if so, the module itself should call the exit function in its failure path itself. Doing the full exit whether it needs to be done or not is wasteful and opens up new DoS opportunities. In the example you give below you must rely on register_filesystem tolerating unregistering a nonexistent filesystem. That's sloppy at best, and you will have to ensure *every* helper used by ->exit is similarly sloppy. > static int __init init_fs(void) > { > inode_cachep = kmem_cache_create(...); > if (!inode_cachep) > return -ENOMEM; > return register_filesystem(&fs_type); > } > > static int __exit exit_fs(void) > { > int err; > err = unregister_filesystem(&fs_type); > if (err) > return err; > kmem_cache_destroy(inode_cachep); > inode_cachep = NULL; > return 0; > } -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 13:53 ` Daniel Phillips @ 2002-09-13 15:13 ` Roman Zippel 2002-09-13 15:30 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Roman Zippel @ 2002-09-13 15:13 UTC (permalink / raw) To: Daniel Phillips; +Cc: Rusty Russell, Jamie Lokier, Alexander Viro, linux-kernel Hi, On Fri, 13 Sep 2002, Daniel Phillips wrote: > > The exit function should always be called after the init function (even if > > it failed, I don't do it in the patch, that's a bug). The fs init/exit > > would like this then: > > Perhaps, but if so, the module itself should call the exit function in > its failure path itself. Doing the full exit whether it needs to be > done or not is wasteful and opens up new DoS opportunities. The exit itself can fail as well, so it has to be done by the module code anyway (until it suceeds). What DoS opportunities are there? Module init failure is the exception case and usally needs further attention, so we could actually disable further attempts to load this module, unless the user tells us specifically so. > In the example you give below you must rely on register_filesystem > tolerating unregistering a nonexistent filesystem. That's sloppy at > best, and you will have to ensure *every* helper used by ->exit is > similarly sloppy. Why is that sloppy? E.g. kfree() happily accepts NULL pointers as well. bye, Roman ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 15:13 ` Roman Zippel @ 2002-09-13 15:30 ` Daniel Phillips 2002-09-13 15:55 ` Roman Zippel ` (2 more replies) 0 siblings, 3 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 15:30 UTC (permalink / raw) To: Roman Zippel; +Cc: Rusty Russell, Jamie Lokier, Alexander Viro, linux-kernel On Friday 13 September 2002 17:13, Roman Zippel wrote: > Hi, > > On Fri, 13 Sep 2002, Daniel Phillips wrote: > > > > The exit function should always be called after the init function (even if > > > it failed, I don't do it in the patch, that's a bug). The fs init/exit > > > would like this then: > > > > Perhaps, but if so, the module itself should call the exit function in > > its failure path itself. Doing the full exit whether it needs to be > > done or not is wasteful and opens up new DoS opportunities. > > The exit itself can fail as well, so it has to be done by the module code > anyway (until it suceeds). That's debatable. Arguably, a failed ->module_cleanup() should be retried on every rmmod -a, but expecting module.c to just keep retrying mindlessly on its own sounds too much like a busy wait. > What DoS opportunities are there? Suppose the module exit relies on synchronize_kernel. The attacker can force repeated synchronize_kernels, knowing that module.c will mindlessly do a synchronize_kernel every time a module init fails, whether needed or not. Each synchronize_kernel takes an unbounded amount of time to complete, across which module.c holds a lock. > Module init failure is the exception > case and usally needs further attention, so we could actually disable > further attempts to load this module, unless the user tells us > specifically so. Sure, you can fix it by lathering on more complexity. What you have to do is explain why we should do that, when there is a simpler and faster approach that doesn't introduce the problem in the first place. > > In the example you give below you must rely on register_filesystem > > tolerating unregistering a nonexistent filesystem. That's sloppy at > > best, and you will have to ensure *every* helper used by ->exit is > > similarly sloppy. > > Why is that sloppy? E.g. kfree() happily accepts NULL pointers as well. That is sloppy. Different discussion. I take it that the points you didn't reply to are points that you agree with? (The main point being, that we both advocate a simple, two-method interface for module load/unload.) -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 15:30 ` Daniel Phillips @ 2002-09-13 15:55 ` Roman Zippel 2002-09-13 16:09 ` Daniel Phillips 2002-09-13 16:39 ` Thunder from the hill 2002-09-16 0:24 ` Bill Davidsen 2 siblings, 1 reply; 90+ messages in thread From: Roman Zippel @ 2002-09-13 15:55 UTC (permalink / raw) To: Daniel Phillips; +Cc: Rusty Russell, Jamie Lokier, Alexander Viro, linux-kernel Hi, On Fri, 13 Sep 2002, Daniel Phillips wrote: > > The exit itself can fail as well, so it has to be done by the module code > > anyway (until it suceeds). > > That's debatable. Arguably, a failed ->module_cleanup() should be > retried on every rmmod -a, but expecting module.c to just keep > retrying mindlessly on its own sounds too much like a busy wait. That's not what I meant, if module_init fails the module goes directly to the cleanup state and the module code calls module_exit. Depending on this return value it continues to the exit state. Further exit attempts (if necessary) are done on user request. > > What DoS opportunities are there? > > Suppose the module exit relies on synchronize_kernel. The attacker > can force repeated synchronize_kernels, knowing that module.c will > mindlessly do a synchronize_kernel every time a module init fails, > whether needed or not. Each synchronize_kernel takes an unbounded > amount of time to complete, across which module.c holds a lock. This can't happen: if (hook) { hook = NULL; synchronize(); } > > Module init failure is the exception > > case and usally needs further attention, so we could actually disable > > further attempts to load this module, unless the user tells us > > specifically so. > > Sure, you can fix it by lathering on more complexity. What you have > to do is explain why we should do that, when there is a simpler and > faster approach that doesn't introduce the problem in the first > place. It doesn't add any complexity (at least not to the kernel). A simple approach might be that a failed kernel module cannot be loaded with modprobe anymore, this sort of policy can be done in userspace. > I take it that the points you didn't reply to are points that you > agree with? (The main point being, that we both advocate a simple, > two-method interface for module load/unload.) Basically yes, it's just that your initial RFC was more confusing than helpful. bye, Roman ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 15:55 ` Roman Zippel @ 2002-09-13 16:09 ` Daniel Phillips 0 siblings, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 16:09 UTC (permalink / raw) To: Roman Zippel; +Cc: Rusty Russell, Jamie Lokier, Alexander Viro, linux-kernel On Friday 13 September 2002 17:55, Roman Zippel wrote: > Hi, > > On Fri, 13 Sep 2002, Daniel Phillips wrote: > > > > The exit itself can fail as well, so it has to be done by the module code > > > anyway (until it suceeds). > > > > That's debatable. Arguably, a failed ->module_cleanup() should be > > retried on every rmmod -a, but expecting module.c to just keep > > retrying mindlessly on its own sounds too much like a busy wait. > > That's not what I meant, if module_init fails the module goes directly to > the cleanup state and the module code calls module_exit. Depending on this > return value it continues to the exit state. What I don't like about that is, module_exit doesn't know the exact state module_cleanup was in when it encountered the error. Look at how error cleanup is done normally: a bunch of gotos that jump into a sequence that releases resources in reverse order to the way they were allocated, so that each initialization state has a corresponding error cleanup state. How are you going to recover that level of precision if the error cleanup is in a separate function? > Further exit attempts (if necessary) are done on user request. Yep, another point nailed down. I'm keeping a list ;-) > > > What DoS opportunities are there? > > > > Suppose the module exit relies on synchronize_kernel. The attacker > > can force repeated synchronize_kernels, knowing that module.c will > > mindlessly do a synchronize_kernel every time a module init fails, > > whether needed or not. Each synchronize_kernel takes an unbounded > > amount of time to complete, across which module.c holds a lock. > > This can't happen: > > if (hook) { > hook = NULL; > synchronize(); > } Ah, this relys on synchronize_kernel only being called when necessary. In general, that can only be known by the module itself. This is one more point we agree on, and which differs from Rusty's proposal. > > > Module init failure is the exception > > > case and usally needs further attention, so we could actually disable > > > further attempts to load this module, unless the user tells us > > > specifically so. > > > > Sure, you can fix it by lathering on more complexity. What you have > > to do is explain why we should do that, when there is a simpler and > > faster approach that doesn't introduce the problem in the first > > place. > > It doesn't add any complexity (at least not to the kernel). A simple > approach might be that a failed kernel module cannot be loaded with > modprobe anymore, this sort of policy can be done in userspace. User space complexity is an issue too, if there is an approach that avoids complexity in both the kernel and user space. > > I take it that the points you didn't reply to are points that you > > agree with? (The main point being, that we both advocate a simple, > > two-method interface for module load/unload.) > > Basically yes, it's just that your initial RFC was more confusing than > helpful. Humble apologies. I will rewrite it now that I've had some practice at explaining the points. The second try should be much more concise and backed with more actual code. Some of which I will shamelessly lift from you and Rusty ;-) -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 15:30 ` Daniel Phillips 2002-09-13 15:55 ` Roman Zippel @ 2002-09-13 16:39 ` Thunder from the hill 2002-09-13 17:12 ` Daniel Phillips 2002-09-16 0:24 ` Bill Davidsen 2 siblings, 1 reply; 90+ messages in thread From: Thunder from the hill @ 2002-09-13 16:39 UTC (permalink / raw) To: Daniel Phillips Cc: Roman Zippel, Rusty Russell, Jamie Lokier, Alexander Viro, linux-kernel Hi, On Fri, 13 Sep 2002, Daniel Phillips wrote: > That's debatable. Arguably, a failed ->module_cleanup() should be > retried on every rmmod -a, but expecting module.c to just keep > retrying mindlessly on its own sounds too much like a busy wait. Hmmm. You might as well give it back to the user. rmmod: remove failed: do it again! So the cleanup code could as well just do it on its own. > > Why is that sloppy? E.g. kfree() happily accepts NULL pointers as well. > > That is sloppy. Different discussion. What should kfree do in your opinion? BUG()? doodle.c:12: attempted to free NULL pointer, as you know it already is. > I take it that the points you didn't reply to are points that you > agree with? (The main point being, that we both advocate a simple, > two-method interface for module load/unload.) You could even do it using three methods. Thunder -- --./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .- --/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..- .- -/---/--/---/.-./.-./---/.--/.-.-.- --./.-/-.../.-./.././.-../.-.-.- ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 16:39 ` Thunder from the hill @ 2002-09-13 17:12 ` Daniel Phillips 0 siblings, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-13 17:12 UTC (permalink / raw) To: Thunder from the hill Cc: Roman Zippel, Rusty Russell, Jamie Lokier, Alexander Viro, linux-kernel On Friday 13 September 2002 18:39, Thunder from the hill wrote: > Hi, > > On Fri, 13 Sep 2002, Daniel Phillips wrote: > > That's debatable. Arguably, a failed ->module_cleanup() should be > > retried on every rmmod -a, but expecting module.c to just keep > > retrying mindlessly on its own sounds too much like a busy wait. > > Hmmm. You might as well give it back to the user. > > rmmod: remove failed: do it again! That's what happens now. We can certainly improve the error message, and actually, that just falls out, from properly adding an error return code to ->cleanup_module() > So the cleanup code could as well just do it on its own. ??? > > > Why is that sloppy? E.g. kfree() happily accepts NULL pointers as well. > > > > That is sloppy. Different discussion. > > What should kfree do in your opinion? BUG()? Yuppers: static inline void kfree_test(void *object) { if (object) kfree(object); } #define kfree_sloppy kfree_test s/kfree/kfree_sloppy/g (But see "different discussion" above.) > doodle.c:12: attempted to free NULL pointer, as you know it already is. Um. You know it's NULL and you freed it anyway? (But see "different discussion" above.) > > I take it that the points you didn't reply to are points that you > > agree with? (The main point being, that we both advocate a simple, > > two-method interface for module load/unload.) > > You could even do it using three methods. Yes, or two, my favorite. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 15:30 ` Daniel Phillips 2002-09-13 15:55 ` Roman Zippel 2002-09-13 16:39 ` Thunder from the hill @ 2002-09-16 0:24 ` Bill Davidsen 2 siblings, 0 replies; 90+ messages in thread From: Bill Davidsen @ 2002-09-16 0:24 UTC (permalink / raw) To: Daniel Phillips Cc: Roman Zippel, Rusty Russell, Jamie Lokier, Alexander Viro, linux-kernel On Fri, 13 Sep 2002, Daniel Phillips wrote: > > What DoS opportunities are there? > > Suppose the module exit relies on synchronize_kernel. The attacker > can force repeated synchronize_kernels, knowing that module.c will > mindlessly do a synchronize_kernel every time a module init fails, > whether needed or not. Each synchronize_kernel takes an unbounded > amount of time to complete, across which module.c holds a lock. That's a good argument WRT the kernel autoloader, but not for manual load by root. The system should not refuse to attempt an operation about which root may have additional information (odd hardware or the like). We don't want to have the kernel DOS itself, but root should be allowed to at least try. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-13 10:35 ` Roman Zippel 2002-09-13 13:53 ` Daniel Phillips @ 2002-09-16 1:49 ` Rusty Russell 2002-09-16 21:36 ` Roman Zippel 1 sibling, 1 reply; 90+ messages in thread From: Rusty Russell @ 2002-09-16 1:49 UTC (permalink / raw) To: Roman Zippel; +Cc: Jamie Lokier, Alexander Viro, Daniel Phillips, linux-kernel In message <Pine.LNX.4.44.0209131131210.28515-100000@serv> you write: > > I don't think of either of them as core, they are two problems. > > Your init problem is easily solvable for try_inc_mod_count() users, just > add a check for MOD_INITIALIZING, so it will fail during module init. On > the other hand forcing everyone to use try_inc_mod_count is a serious > problem. Well, yes. I think Daniel, yourself and I all share a dislike for try_inc_mod_count() everywhere, but it *is* one solution. > The exit function should always be called after the init function (even if > it failed, I don't do it in the patch, that's a bug). The fs init/exit > would like this then: I disagree with this, but really, it's a detail. > > I disagree with pushing the count into the filesystem structure: it > > changes nothing except hide the fact that you're only keeping > > reference counts for the sake of the module. Filesystems are low > > performance impact, but other subsystems really don't want that atomic > > inc and dec for every access. > > As I already said before, it doesn't has to be an atomic reference count. Please explain? It has to be atomic for all the interesting cases (sure, fs mounting might be protected by a lock, but other things aren't). > > if (down_interruptible(&module_mutex)) > > return -EINTR; > > I really don't like that the global module lock is held during calls to > the driver, e.g. it makes it impossible to request further modules during > module load. Yes, this is a problem. It's a little icky to fix though, since you must disallow the *same* module being loaded. I can certainly do it: well spotted. > > This works better for my in ip_conntrack, since ideally the usage > > count is a count of the number of packets we're tracking, which is > > under control of the outside world, so I wanted an "rmmod -f". > > The ip_conntrack problem is a variation of the LSM problem and both are a > problem of the driver not of the module code. > Basically you have to wait long enough until no new package can call to > the module. This means after removing the hooks, you have to check how > much packages are busy and wait for them to be processed. Yes, which means an atomic reference count somewhere. At the moment I spin inside the cleanup() routine (not nice at all). Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-16 1:49 ` Rusty Russell @ 2002-09-16 21:36 ` Roman Zippel 2002-09-16 21:48 ` Daniel Phillips 0 siblings, 1 reply; 90+ messages in thread From: Roman Zippel @ 2002-09-16 21:36 UTC (permalink / raw) To: Rusty Russell; +Cc: Jamie Lokier, Alexander Viro, Daniel Phillips, linux-kernel Hi, On Mon, 16 Sep 2002, Rusty Russell wrote: > > Your init problem is easily solvable for try_inc_mod_count() users, just > > add a check for MOD_INITIALIZING, so it will fail during module init. On > > the other hand forcing everyone to use try_inc_mod_count is a serious > > problem. > > Well, yes. I think Daniel, yourself and I all share a dislike for > try_inc_mod_count() everywhere, but it *is* one solution. And it's an overly complex one, with my patch I demonstrated, how it could be removed and made simpler. > > The exit function should always be called after the init function (even if > > it failed, I don't do it in the patch, that's a bug). The fs init/exit > > would like this then: > > I disagree with this, but really, it's a detail. I wouldn't call it a detail, it's an important part of the complete module load/unload mechanism. > > > I disagree with pushing the count into the filesystem structure: it > > > changes nothing except hide the fact that you're only keeping > > > reference counts for the sake of the module. Filesystems are low > > > performance impact, but other subsystems really don't want that atomic > > > inc and dec for every access. > > > > As I already said before, it doesn't has to be an atomic reference count. > > Please explain? It has to be atomic for all the interesting cases > (sure, fs mounting might be protected by a lock, but other things aren't). Let me explain it in a larger context. You and Daniel are making it really more complex than necessary. Only the module itself can really answer the question whether it's safe to unload or not. So all the module code needs to do is some general module management and ask the module somehow, whether it's safe to unload, when the user requests it. The easiest way for modules to check for this is to use counters, it's very simple and covers the majority of modules. The few modules that don't want/can't use counters can use whatever they want and they usually know better how to synchronize with the part of the kernel where they installed their hooks into, without disturbing too much other parts of the kernel. I really don't like the synchronize calls as general mechanism, either they have to to do too much and possibly disturb other parts without good reason or modules have to take care of it (e.g. don't call somehow schedule() without extra protection). This makes the whole synchronize mechanism very fragile, which I'd prefer to keep it in the modules which really need it, where it can be better controlled. > > The ip_conntrack problem is a variation of the LSM problem and both are a > > problem of the driver not of the module code. > > Basically you have to wait long enough until no new package can call to > > the module. This means after removing the hooks, you have to check how > > much packages are busy and wait for them to be processed. > > Yes, which means an atomic reference count somewhere. At the moment I > spin inside the cleanup() routine (not nice at all). Packets don't come out of nowhere. I'm not familiar with the locking there, but it should be possible to use a cheap nonatomic count, which is read only accessible outside the normal locking. bye, Roman ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-16 21:36 ` Roman Zippel @ 2002-09-16 21:48 ` Daniel Phillips 2002-09-16 22:44 ` Roman Zippel 0 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-16 21:48 UTC (permalink / raw) To: Roman Zippel, Rusty Russell; +Cc: Jamie Lokier, Alexander Viro, linux-kernel On Monday 16 September 2002 23:36, Roman Zippel wrote: > > > > I disagree with pushing the count into the filesystem structure: it > > > > changes nothing except hide the fact that you're only keeping > > > > reference counts for the sake of the module. Filesystems are low > > > > performance impact, but other subsystems really don't want that atomic > > > > inc and dec for every access. > > > > > > As I already said before, it doesn't has to be an atomic reference count. > > > > Please explain? It has to be atomic for all the interesting cases > > (sure, fs mounting might be protected by a lock, but other things aren't). > > Let me explain it in a larger context. You and Daniel are making it really > more complex than necessary. Only the module itself can really answer the > question whether it's safe to unload or not. Excuse me, Roman, but that's the central thesis of my [rfc]. If I didn't express it concisely enough to make that obvious, I apologize. > So all the module code > needs to do is some general module management and ask the module somehow, > whether it's safe to unload, when the user requests it. The easiest way > for modules to check for this is to use counters, it's very simple and > covers the majority of modules. Err, check. In the [rfc]. > The few modules that don't want/can't use counters can use whatever they > want and they usually know better how to synchronize with the part of the > kernel where they installed their hooks into, without disturbing too much > other parts of the kernel. Check. In the [rfc]. > I really don't like the synchronize calls as general mechanism, either > they have to to do too much and possibly disturb other parts without good > reason Check. In the [rfc]. > or modules have to take care of it (e.g. don't call somehow > schedule() without extra protection). This makes the whole synchronize > mechanism very fragile, which I'd prefer to keep it in the modules which > really need it, where it can be better controlled. Yuppers. Are there any points at all we disagree on? -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [RFC] Raceless module interface 2002-09-16 21:48 ` Daniel Phillips @ 2002-09-16 22:44 ` Roman Zippel 0 siblings, 0 replies; 90+ messages in thread From: Roman Zippel @ 2002-09-16 22:44 UTC (permalink / raw) To: Daniel Phillips; +Cc: Rusty Russell, Jamie Lokier, Alexander Viro, linux-kernel Hi, On Mon, 16 Sep 2002, Daniel Phillips wrote: > > Let me explain it in a larger context. You and Daniel are making it really > > more complex than necessary. Only the module itself can really answer the > > question whether it's safe to unload or not. > > Excuse me, Roman, but that's the central thesis of my [rfc]. If I didn't > express it concisely enough to make that obvious, I apologize. Sorry, but even your later mails are not very concise, I rather wait for a new rfc (preferably with patch) before commenting on it. :-) bye, Roman ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-09 20:06 ` Daniel Phillips 2002-09-10 0:44 ` Jamie Lokier @ 2002-09-11 15:28 ` Bill Davidsen 2002-09-11 19:36 ` Daniel Phillips 1 sibling, 1 reply; 90+ messages in thread From: Bill Davidsen @ 2002-09-11 15:28 UTC (permalink / raw) To: Daniel Phillips; +Cc: Jamie Lokier, Alexander Viro, Rusty Russell, linux-kernel On Mon, 9 Sep 2002, Daniel Phillips wrote: > > The expected behaviour is as it has always been: rmmod fails if anyone > > is using the module, and succeeds if nobody is using the module. The > > garbage collection of modules is done using "rmmod -a" periodically, as > > it always has been. I can't disagree with any of that, but the idea of releasing resources when they are not in use and preventing new use of the resource from the time of the release request is hardly new to UNIX. It's exactly the way filespace is treated when a file is unlinked while open. > Al's analysis is all focussed on the filesystem case, where you can > prove assertions about whether the subsystem defined by the module is > active or not. This doesn't cover the whole range of module applications. Which if true doesn't preclude solving the problems it does address. > There is a significant class of module types that must rely on sheduling > techniques to prove inactivity. My suggestion covers both, Al has his > blinders on. Clearly there are people who don't agree, and you don't help by insulting them. I'd like a single solution, but it looks as if Al's idea will work for filesystems, and it's relatively simple. > Designs are not always correct just because they work. Solutions which work are most certainly correct, and no degree of elegance will make a non-working solution correct for any definition of correct I use. I think you mean "optimal" here, and clearly an optimal solution is simple, reliable, efficient, general, and easy to code and understand. Anything less is subject to improvement, and that certainly applies to module removal ;-) -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-11 15:28 ` Question about pseudo filesystems Bill Davidsen @ 2002-09-11 19:36 ` Daniel Phillips 0 siblings, 0 replies; 90+ messages in thread From: Daniel Phillips @ 2002-09-11 19:36 UTC (permalink / raw) To: Bill Davidsen; +Cc: Jamie Lokier, Alexander Viro, Rusty Russell, linux-kernel On Wednesday 11 September 2002 17:28, Bill Davidsen wrote: > On Mon, 9 Sep 2002, Daniel Phillips wrote: > > > > The expected behaviour is as it has always been: rmmod fails if anyone > > > is using the module, and succeeds if nobody is using the module. The > > > garbage collection of modules is done using "rmmod -a" periodically, as > > > it always has been. > > I can't disagree with any of that, but the idea of releasing resources > when they are not in use and preventing new use of the resource from the > time of the release request is hardly new to UNIX. It's exactly the way > filespace is treated when a file is unlinked while open. It's a little different for modules. For one thing, modules are sometimes used like switches: to make the system behave differently, plug in some module, to make it stop doing that, remove the module again. The user wants those transitions to happen *now*, like a switch. Sure, we can make various guru arguments about why the user is wrong, but in the end, the user is never wrong, in the customer sense. Anyway, this problem is under control now and the nice solution also happens to be the simplest, so what more could you ask for. > > Al's analysis is all focussed on the filesystem case, where you can > > prove assertions about whether the subsystem defined by the module is > > active or not. This doesn't cover the whole range of module applications. > > Which if true doesn't preclude solving the problems it does address. Oh absolutely. Please see my recent [RFC] Raceless module interface. <flamebait>Not to say that I was unable to improve Al's code.</flamebait> > > There is a significant class of module types that must rely on sheduling > > techniques to prove inactivity. My suggestion covers both, Al has his > > blinders on. > > Clearly there are people who don't agree, and you don't help by insulting > them. I wish I could agree with you about that, but Al is a special case. I think he actually *likes* it, like a Pit Bull likes fighting. He certainly became easier to deal with after I fell into the habit of returning each insult promptly and unambiguously. Note: Al is the one person on the list that I deal with this way, and he worked really hard to get there. > I'd like a single solution, but it looks as if Al's idea will work > for filesystems, and it's relatively simple. And what do you propose to do about the modules it doesn't work for? Sorry, rhetorical question, the solution is laid out in the above [RFC]. > > Designs are not always correct just because they work. > > Solutions which work are most certainly correct, and no degree of elegance > will make a non-working solution correct for any definition of correct I > use. I think you mean "optimal" here, and clearly an optimal solution is > simple, reliable, efficient, general, and easy to code and understand. > Anything less is subject to improvement, and that certainly applies to > module removal ;-) Yes, well please read the [RFC] and pass judgement on whether it's optimal or not. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-09 19:48 ` Jamie Lokier 2002-09-09 20:06 ` Daniel Phillips @ 2002-09-09 20:12 ` Daniel Phillips 2002-09-09 22:56 ` Jamie Lokier 2002-09-09 20:18 ` Daniel Phillips 2 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-09 20:12 UTC (permalink / raw) To: Jamie Lokier; +Cc: Alexander Viro, Rusty Russell, linux-kernel On Monday 09 September 2002 21:48, Jamie Lokier wrote: > The expected behaviour is as it has always been: rmmod fails if anyone > is using the module, and succeeds if nobody is using the module. The > garbage collection of modules is done using "rmmod -a" periodically, as > it always has been. When you say 'rmmod modulename' the module is supposed to be removed, if it can be. That is the user's expectation, and qualifies as 'obviously correct'. Garbage collecting should *not* be the primary mechanism for removing modules, that is what rmmod is for. Neither should a filesystem module magically disappear from the system just because the last mount went away, unless the module writer very specifically desires that. This is where the obfuscating opinion is coming from: Al has come up with an application where he wants the magic disappearing behavior and wants to impose it on the rest of the world, regardless of whether it makes sense. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-09 20:12 ` Daniel Phillips @ 2002-09-09 22:56 ` Jamie Lokier 2002-09-10 1:39 ` Alexander Viro 0 siblings, 1 reply; 90+ messages in thread From: Jamie Lokier @ 2002-09-09 22:56 UTC (permalink / raw) To: Daniel Phillips; +Cc: Alexander Viro, Rusty Russell, linux-kernel Daniel Phillips wrote: > > The expected behaviour is as it has always been: rmmod fails if anyone > > is using the module, and succeeds if nobody is using the module. The > > garbage collection of modules is done using "rmmod -a" periodically, as > > it always has been. > > When you say 'rmmod modulename' the module is supposed to be removed, if > it can be. That is the user's expectation, and qualifies as 'obviously > correct'. > > Garbage collecting should *not* be the primary mechanism for removing > modules, that is what rmmod is for. Neither should a filesystem module > magically disappear from the system just because the last mount went > away, unless the module writer very specifically desires that. This is > where the obfuscating opinion is coming from: Al has come up with an > application where he wants the magic disappearing behavior and wants > to impose it on the rest of the world, regardless of whether it makes > sense. I think you've misunderstood. The module does _not_ disappear when the last file reference is closed. It's reference count is decremented, that is all. Just the same as if you managed the reference count yourself. You still need rmmod to actually remove the module. -- Jamie ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-09 22:56 ` Jamie Lokier @ 2002-09-10 1:39 ` Alexander Viro 0 siblings, 0 replies; 90+ messages in thread From: Alexander Viro @ 2002-09-10 1:39 UTC (permalink / raw) To: Jamie Lokier; +Cc: Daniel Phillips, Rusty Russell, linux-kernel On Mon, 9 Sep 2002, Jamie Lokier wrote: > Daniel Phillips wrote: > > > The expected behaviour is as it has always been: rmmod fails if anyone > > > is using the module, and succeeds if nobody is using the module. The > > > garbage collection of modules is done using "rmmod -a" periodically, as > > > it always has been. > > > > When you say 'rmmod modulename' the module is supposed to be removed, if > > it can be. That is the user's expectation, and qualifies as 'obviously > > correct'. > > > > Garbage collecting should *not* be the primary mechanism for removing > > modules, that is what rmmod is for. Neither should a filesystem module > > magically disappear from the system just because the last mount went > > away, unless the module writer very specifically desires that. This is > > where the obfuscating opinion is coming from: Al has come up with an > > application where he wants the magic disappearing behavior and wants > > to impose it on the rest of the world, regardless of whether it makes > > sense. Huh? > I think you've misunderstood. The module does _not_ disappear when the > last file reference is closed. It's reference count is decremented, > that is all. Just the same as if you managed the reference count > yourself. You still need rmmod to actually remove the module. Never let the facts to stand in a way of a rant. Or presume that ability to write implies ability to read, for that matter... ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-09 19:48 ` Jamie Lokier 2002-09-09 20:06 ` Daniel Phillips 2002-09-09 20:12 ` Daniel Phillips @ 2002-09-09 20:18 ` Daniel Phillips 2002-09-10 6:48 ` Kai Henningsen 2 siblings, 1 reply; 90+ messages in thread From: Daniel Phillips @ 2002-09-09 20:18 UTC (permalink / raw) To: Jamie Lokier; +Cc: Alexander Viro, Rusty Russell, linux-kernel On Monday 09 September 2002 21:48, Jamie Lokier wrote: > The expected behaviour is as it has always been: rmmod fails if anyone > is using the module, and succeeds if nobody is using the module. The > garbage collection of modules is done using "rmmod -a" periodically, as > it always has been. Actually, it would be more useful if I stated the following simple fact: Returning a flag from __exit definitively gets rid of one race, that is the race where a module's memory can be freed while __exit is active. To get rid of this race by other means you have to put in place some fancy mechanism. This alone should be enough reason to do it the way I suggest, besides the fact that it is a simpler, more obvious and more robust interface. -- Daniel ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: Question about pseudo filesystems 2002-09-09 20:18 ` Daniel Phillips @ 2002-09-10 6:48 ` Kai Henningsen 0 siblings, 0 replies; 90+ messages in thread From: Kai Henningsen @ 2002-09-10 6:48 UTC (permalink / raw) To: linux-kernel phillips@arcor.de (Daniel Phillips) wrote on 09.09.02 in <E17oUzP-0006tu-00@starship>: > On Monday 09 September 2002 21:48, Jamie Lokier wrote: > > The expected behaviour is as it has always been: rmmod fails if anyone > > is using the module, and succeeds if nobody is using the module. The > > garbage collection of modules is done using "rmmod -a" periodically, as > > it always has been. > > Actually, it would be more useful if I stated the following simple fact: > Returning a flag from __exit definitively gets rid of one race, that is > the race where a module's memory can be freed while __exit is active. > > To get rid of this race by other means you have to put in place some > fancy mechanism. This alone should be enough reason to do it the way > I suggest, besides the fact that it is a simpler, more obvious and more > robust interface. I just love the way you propose insanely hard-to-get-right, hard-to- understand, and complicated interfaces with known broken cases you admit to, to replace simple, obviously correct and race-free mechanisms, all with a supposed goal to make things simpler and safer. Is there some fancy word to describe something that looks like irony, sarcasm, or satire, except for the fact it's actually meant entirely seriously? MfG Kai ^ permalink raw reply [flat|nested] 90+ messages in thread
end of thread, other threads:[~2002-10-01 14:22 UTC | newest] Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2002-09-04 18:02 Question about pseudo filesystems Jamie Lokier 2002-09-07 12:00 ` Daniel Phillips 2002-09-07 13:36 ` Alexander Viro 2002-09-07 18:27 ` Jamie Lokier 2002-09-07 19:47 ` Alexander Viro 2002-09-08 2:21 ` Jamie Lokier 2002-09-08 2:43 ` Alexander Viro 2002-09-15 1:41 ` Moving a mount point (was Re: Question about pseudo filesystems) Rob Landley 2002-09-08 16:00 ` Question about pseudo filesystems Daniel Phillips 2002-09-09 19:48 ` Jamie Lokier 2002-09-09 20:06 ` Daniel Phillips 2002-09-10 0:44 ` Jamie Lokier 2002-09-10 1:40 ` Daniel Phillips 2002-09-10 1:56 ` Jamie Lokier 2002-09-10 2:53 ` Daniel Phillips 2002-09-10 3:26 ` Jamie Lokier 2002-09-10 3:47 ` Daniel Phillips 2002-09-10 9:15 ` Daniel Phillips 2002-09-10 10:17 ` Roman Zippel 2002-09-11 18:35 ` [RFC] Raceless module interface Daniel Phillips 2002-09-11 18:53 ` Oliver Neukum 2002-09-11 19:20 ` Daniel Phillips 2002-09-11 20:29 ` Oliver Neukum 2002-09-11 21:15 ` Daniel Phillips 2002-09-11 21:26 ` Jamie Lokier 2002-09-11 21:47 ` Daniel Phillips 2002-09-12 1:42 ` Rusty Russell 2002-09-12 2:09 ` Jamie Lokier 2002-09-12 3:13 ` Rusty Russell 2002-09-12 3:47 ` Daniel Phillips 2002-09-12 3:53 ` Alexander Viro 2002-09-12 4:11 ` Daniel Phillips 2002-09-12 4:40 ` Rusty Russell 2002-09-12 5:27 ` Daniel Phillips 2002-09-12 14:46 ` Gerhard Mack 2002-09-13 0:39 ` Rusty Russell 2002-09-13 2:23 ` Daniel Phillips 2002-09-12 5:35 ` Rusty Russell 2002-09-12 4:52 ` Rusty Russell 2002-09-12 5:58 ` Daniel Phillips 2002-09-12 7:00 ` Rusty Russell 2002-09-13 8:18 ` Helge Hafting 2002-09-12 3:32 ` Daniel Phillips 2002-09-12 1:31 ` Rusty Russell 2002-09-12 9:10 ` Oliver Neukum 2002-09-12 11:27 ` Roman Zippel 2002-09-12 13:03 ` Rusty Russell 2002-09-12 13:44 ` Roman Zippel 2002-09-13 1:30 ` Rusty Russell 2002-09-13 2:19 ` Daniel Phillips 2002-09-13 6:51 ` Rusty Russell 2002-09-13 13:34 ` Daniel Phillips 2002-09-13 13:52 ` Thunder from the hill 2002-09-13 14:09 ` Daniel Phillips 2002-09-13 14:33 ` Thunder from the hill 2002-09-13 14:44 ` Daniel Phillips 2002-09-13 14:59 ` Thunder from the hill 2002-09-13 15:17 ` Daniel Phillips 2002-09-13 15:27 ` Thunder from the hill 2002-09-13 15:37 ` Daniel Phillips 2002-09-16 2:17 ` Rusty Russell 2002-09-16 16:13 ` Daniel Phillips 2002-09-16 16:36 ` Understanding the Principles of Argumentation #3 Daniel Phillips 2002-09-16 16:42 ` Robinson Maureira Castillo 2002-09-16 17:29 ` Cort Dougan 2002-09-16 22:31 ` David Woodhouse 2002-10-01 14:13 ` Daniel Phillips 2002-10-01 14:27 ` David Woodhouse 2002-09-13 15:59 ` [RFC] Raceless module interface Daniel Phillips 2002-09-13 3:14 ` David Gibson 2002-09-13 10:35 ` Roman Zippel 2002-09-13 13:53 ` Daniel Phillips 2002-09-13 15:13 ` Roman Zippel 2002-09-13 15:30 ` Daniel Phillips 2002-09-13 15:55 ` Roman Zippel 2002-09-13 16:09 ` Daniel Phillips 2002-09-13 16:39 ` Thunder from the hill 2002-09-13 17:12 ` Daniel Phillips 2002-09-16 0:24 ` Bill Davidsen 2002-09-16 1:49 ` Rusty Russell 2002-09-16 21:36 ` Roman Zippel 2002-09-16 21:48 ` Daniel Phillips 2002-09-16 22:44 ` Roman Zippel 2002-09-11 15:28 ` Question about pseudo filesystems Bill Davidsen 2002-09-11 19:36 ` Daniel Phillips 2002-09-09 20:12 ` Daniel Phillips 2002-09-09 22:56 ` Jamie Lokier 2002-09-10 1:39 ` Alexander Viro 2002-09-09 20:18 ` Daniel Phillips 2002-09-10 6:48 ` Kai Henningsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).