Re: Modules with list

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Modules with list
@ 2002-11-29  8:23 Adam J. Richter
  0 siblings, 0 replies; 21+ messages in thread
From: Adam J. Richter @ 2002-11-29  8:23 UTC (permalink / raw)
  To: rusty; +Cc: davem, linux-kernel, vandrove, zippel

>In message <200211280025.QAA07845@baldur.yggdrasil.com> you write:
>> Rusty Russel wrote:

>That's Russell 8)

>> >Sorry, that's why I said "*If O_NONBLOCK* is specified" (ie. still
>> >drop it for the --wait case).
>> 
>> 	Oops!  Sorry for misreading your message.
>> 
>> 	Even though it was not responsive to what you described, I do
>> hope you see my point about the problem with "rmmod --wait".

>But's it's an absolute requirement, to make modules removable in some
>circumstances, otherwise you end up being starved and the module
>cannot be removed (security modules and netfilter modules strike me as
>the obvious cases, but there are probably more).

	That may be a drawback, but I would necessarily say that
avoiding those drawbacks is an "absolute requirement" based on the
disadvantages you listed, nor would attaching that label magically
eliminate the problem that I described.


Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
  2002-11-28  0:25 Adam J. Richter
@ 2002-11-28 23:53 ` Rusty Russell
  0 siblings, 0 replies; 21+ messages in thread
From: Rusty Russell @ 2002-11-28 23:53 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: linux-kernel, vandrove, zippel, davem

In message <200211280025.QAA07845@baldur.yggdrasil.com> you write:
> Rusty Russel wrote:

That's Russell 8)

> >Sorry, that's why I said "*If O_NONBLOCK* is specified" (ie. still
> >drop it for the --wait case).
> 
> 	Oops!  Sorry for misreading your message.
> 
> 	Even though it was not responsive to what you described, I do
> hope you see my point about the problem with "rmmod --wait".

But's it's an absolute requirement, to make modules removable in some
circumstances, otherwise you end up being starved and the module
cannot be removed (security modules and netfilter modules strike me as
the obvious cases, but there are probably more).

Cheers!
Rusty.
PS.  You're right, net/ipv4/ doesn't use skb->destructor.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-28  1:54 Adam J. Richter
  0 siblings, 0 replies; 21+ messages in thread
From: Adam J. Richter @ 2002-11-28  1:54 UTC (permalink / raw)
  To: kaos; +Cc: linux-kernel

On 2002-11-27 22:42:46 GMT, Keith Owens wrote:
>On Wed, 27 Nov 2002 10:19:02 -0800, 
>"Adam J. Richter" <adam@yggdrasil.com> wrote:
>>       Hmm.  We could certainly have a binary editing tool to remove
>>what was the .exit{,func} sections after the link...
>>       In this scenario, the .exit{,func} section would be linked
>>in and then discarded by the module loader, by the process of the
>>kernel releasing its .init{,data} areas, or, if you wanted to build
>>a bzImage without CONFIG_HOTPLUG, by using a binary editing tool or
>>perhaps an ld script as I mentioned earlier in this response.
>>
>>       The point of the .devexit_p_refs section would just be to
>>set those references to NULL if that was useful.  The kernel module
>>load code would do something like:
>
>You have it back to front.  The real problem is open code that calls
>functions in sections that have been discarded, that code is an oops
>just waiting to happen.  When binutils was changed to detect such
>dangling references, it found a lot of bad code on rarely tested error
>paths.  Your method would stop binutils finding the dangling references
>and open the kernel up to bad code again.

	Currently, for __devexit{,func}, this is only detected on
CONFIG_HOTPLUG=n systems.  Under my scheme we would always build
.devexit.{text,data} sections, so we could actually test this more
widely, although it would require making a binary tool to delete the
relocations at addresses pointed to by the .devexit_p_ptrs entries.
We could have a regession test that would run that tool on every
module, then run an ld script to delete .devexit.{data,text} and see
if there are any dangling references.

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-28  0:25 Adam J. Richter
  2002-11-28 23:53 ` Rusty Russell
  0 siblings, 1 reply; 21+ messages in thread
From: Adam J. Richter @ 2002-11-28  0:25 UTC (permalink / raw)
  To: rusty; +Cc: linux-kernel, vandrove, zippel

Rusty Russel wrote:
>In message <200211270722.XAA23313@adam.yggdrasil.com> you write:
>> Rusty Russell wrote:
>> >In message <200211260649.WAA22216@adam.yggdrasil.com> you write:
>> >> >This would only happen if someone says "rmmod --wait".
>> 
>> >As I realized last night after I wrote this, there is a bug in
>> >module.c.  If O_NONBLOCK is specified, we shouldn't drop the module
>> >sempaphore at all, for exactly this reason.  A bug I introduced while
>> >"cleaning up" the "--wait" path.
>> 
>> >Sorry for the confusion.
>> 
>> 	Then if you do "rmmod --wait" on some module that is in use,
>> every lsmod, insmod and rmmod will hang while attempting to acquire

>Sorry, that's why I said "*If O_NONBLOCK* is specified" (ie. still
>drop it for the --wait case).

	Oops!  Sorry for misreading your message.

	Even though it was not responsive to what you described, I do
hope you see my point about the problem with "rmmod --wait".

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
  2002-11-27  7:22 Adam J. Richter
@ 2002-11-27 23:15 ` Rusty Russell
  0 siblings, 0 replies; 21+ messages in thread
From: Rusty Russell @ 2002-11-27 23:15 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: linux-kernel, vandrove, zippel

In message <200211270722.XAA23313@adam.yggdrasil.com> you write:
> Rusty Russell wrote:
> >In message <200211260649.WAA22216@adam.yggdrasil.com> you write:
> >> >This would only happen if someone says "rmmod --wait".
> 
> >As I realized last night after I wrote this, there is a bug in
> >module.c.  If O_NONBLOCK is specified, we shouldn't drop the module
> >sempaphore at all, for exactly this reason.  A bug I introduced while
> >"cleaning up" the "--wait" path.
> 
> >Sorry for the confusion.
> 
> 	Then if you do "rmmod --wait" on some module that is in use,
> every lsmod, insmod and rmmod will hang while attempting to acquire

Sorry, that's why I said "*If O_NONBLOCK* is specified" (ie. still
drop it for the --wait case).

Hope that clarifies,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
  2002-11-27 18:19 Adam J. Richter
@ 2002-11-27 22:42 ` Keith Owens
  0 siblings, 0 replies; 21+ messages in thread
From: Keith Owens @ 2002-11-27 22:42 UTC (permalink / raw)
  To: linux-kernel

On Wed, 27 Nov 2002 10:19:02 -0800, 
"Adam J. Richter" <adam@yggdrasil.com> wrote:
>	Hmm.  We could certainly have a binary editing tool to remove
>what was the .exit{,func} sections after the link...
>	In this scenario, the .exit{,func} section would be linked
>in and then discarded by the module loader, by the process of the
>kernel releasing its .init{,data} areas, or, if you wanted to build
>a bzImage without CONFIG_HOTPLUG, by using a binary editing tool or
>perhaps an ld script as I mentioned earlier in this response.
>
>	The point of the .devexit_p_refs section would just be to
>set those references to NULL if that was useful.  The kernel module
>load code would do something like:

You have it back to front.  The real problem is open code that calls
functions in sections that have been discarded, that code is an oops
just waiting to happen.  When binutils was changed to detect such
dangling references, it found a lot of bad code on rarely tested error
paths.  Your method would stop binutils finding the dangling references
and open the kernel up to bad code again.

__devexit_p tells the build "these functions are known to be omitted at
link time, do not reference them at build time".  IOW, we tell the
kernel that these references are safe, allowing binutils to find all
the other references to discarded code - the bad ones.

Removing the check from binutils is not an option.  The problem was
originally detected on ia64 and mips which use program counter relative
calls with a small number of bits for the offset from current PC.  The
dangling references to functions in discarded sections resulted in
calls to address 0 or to small addresses (actually the offset of the
target function from the start of the discarded section).  Trying to
convert that into PC relative format resulted in an offset which would
not fit in the number of available bits and the linker/insmod flagged a
relocation error.

Bottom line - unsafe references to discarded sections must be detected
by ld/insmod.  Kernel developers have to flag safe references to
discarded sections via __devexit_p().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-27 19:04 Adam J. Richter
  0 siblings, 0 replies; 21+ messages in thread
From: Adam J. Richter @ 2002-11-27 19:04 UTC (permalink / raw)
  To: kai; +Cc: ingo.oeser, linux-kernel, rusty, vandrove, zippel

	I wrote two really buggy examples in my previous reply
to Kai.  First of all, here is a more "correct" version of
vmlinux.lds.S for determining the disposition of devexit at
kernel load time, although I don't know if ld will allow it:

--- linux/arch/i386/vmlinux.lds.S.orig  2002-11-27 09:51:42.000000000 -0800
+++ linux/arch/i386/vmlinux.lds.S       2002-11-27 10:56:20.000000000 -0800
@@ -1,6 +1,8 @@
 /* ld script to make i386 Linux kernel
  * Written by Martin Mares <mj@atrey.karlin.mff.cuni.cz>;
  */
+#include <linux/config.h>
+
 OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
 OUTPUT_ARCH(i386)
 ENTRY(_start)
@@ -12,6 +14,9 @@
   _text = .;                   /* Text and read-only data */
   .text : {
        *(.text)
+#ifdef CONFIG_HOTPLUG
+       *(.devexit.text)
+#endif
        *(.fixup)
        *(.gnu.warning)
        } = 0x9090
@@ -38,6 +43,9 @@
   /* writeable */
   .data : {                    /* Data */
        *(.data)
+#ifdef CONFIG_HOTPLUG
+       *(.devexit.data)
+#endif
        CONSTRUCTORS
        }
 
@@ -96,6 +104,13 @@
 
   _end = . ;
 
+#ifndef CONFIG_HOTPLUG
+  .discard : {
+       *(.devexit.text)
+       *(.devexit.data)
+       } = 0x001               /* Like bss: SEC_ALLOC, but no SEC_LOAD */
+#endif
+
   /* Sections to be discarded */
   /DISCARD/ : {
        *(.exit.text)


Also here is a "corrected" bit of untested code for clearing
devexit references at module load time:

        if (!module->removable) {
                void ***pptr = module->devexit_p_start;
                while (pptr != module->devexit_p_end) {
                        **pptr = NULL;
                        pptr++;
                }
        }

Sorry for responding too hastily before.

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-27 18:19 Adam J. Richter
  2002-11-27 22:42 ` Keith Owens
  0 siblings, 1 reply; 21+ messages in thread
From: Adam J. Richter @ 2002-11-27 18:19 UTC (permalink / raw)
  To: kai; +Cc: ingo.oeser, linux-kernel, rusty, vandrove, zippel

Kai Germaschewski wrote:
>On Tue, 26 Nov 2002, Adam J. Richter wrote:
>> >No. That means dangling pointers everywhere. Remember dev_exit_p() 
>> >and why it was introduced.
>> 
>> 	First of all, let's understand how small this problem is.
>> dev_exit_p was introduceed to allow a debugging using a feature of
>> newer versions of ld.  It was never essential.  devexit_p() is only
>> relevant to non-hotplug systems.

>That's wrong. [...]

>[...] It's not a debugging feature of ld, it's just
>correct behavior and it's not possible to turn it off.

	Hmm.  We could certainly have a binary editing tool to remove
what was the .exit{,func} sections after the link.  Perhaps we can
even express this as an ld script for linking the core kernel.  We
could put .exit{,func} at the end of the kernel and turn it into
section like bss (SHT_NOBITS?) that has valid addresses but no
contents.  I haven't tried this, so I don't know if ld would
barf when asked to throw away data this way, but anyhow here
is an untried diff just to illustrate:

--- linux/arch/i386/vmlinux.lds.orig    2002-11-27 09:51:42.000000000 -0800
+++ linux/arch/i386/vmlinux.lds.S       2002-11-27 09:54:17.000000000 -0800
@@ -97,11 +97,11 @@
   _end = . ;
 
   /* Sections to be discarded */
-  /DISCARD/ : {
+  .discard : {
        *(.exit.text)
        *(.exit.data)
        *(.exitcall.exit)
-       }
+       } = 0x0001              /* Like bss: SEC_ALLOC, but no SEC_LOAD */
 
   /* Stabs debugging sections.  */
   .stab 0 : { *(.stab) }



[...]
>> 	It's probably overkill, but, in the longer term, we could
>> eliminate CONFIG_HOTPLUG, have .devexit{,data} sections, and build yet
>> another ELF section listing the devexit_p pointers, by defining
>> devexit_p like so.
>> 
>> #define devexit_p(symbol)	( \
>> 			asm (".pushsection .devexit_p_refs\n"	\
>> 			     ".long " #symbol "\n"		\
>> 			     ".popsection\n");			\
>> 			symbol )

>Again, that doesn't fix the problem that we have a reference to a symbol 
>which doesn't exist since we just discarded it.

	In this scenario, the .exit{,func} section would be linked
in and then discarded by the module loader, by the process of the
kernel releasing its .init{,data} areas, or, if you wanted to build
a bzImage without CONFIG_HOTPLUG, by using a binary editing tool or
perhaps an ld script as I mentioned earlier in this response.

	The point of the .devexit_p_refs section would just be to
set those references to NULL if that was useful.  The kernel module
load code would do something like:

	if (!module->removable) {
		void **pptr = module->devexit_p_start;
		while (pptr != module->devexit_p_end) {
			*pptr = NULL;
			pptr++;
		}
	}

>One way to fix this is to make my_remove a weak symbol, so that the linker
>just silently puts in NULL when it's not defined elsewhere,

	I don't believe that will work, since the "weak" facility
is for resolving external symbols.  For example, as barfs when I try
the following:

	.weak foo
foo=0
	.section .exit
foo:


[...]
>Another way to fix this is of course to always have CONFIG_HOTPLUG=y,
>which may become necessary anyway when properly shutting down devices at
>shutdown/reboot.

Going off topic here:

	Most devices do not need custom reset code because the parent
bus's reset operation will reset them.  Those few devices that need
custom reset code can register a device_driver->shutdown instead of
or in addition to device_driver->remove, regardless of CONFIG_HOTPLUG.

	Calling every device's remove function wastes time, especially
since remove functions have to block until all IO's terminate in one
form or another so that the system can reuse things like USB device
ID's in future, something completely unnecessary if a bus level reset
and system reboot are about to be done.  More importantly, calling
every driver's ->remove() function means decreasing reliability,
because one reason for rebooting is that a driver is confused, and
each driver's ->remove() function does a fair amount of house keeping.
So, it would become much more common to be unable to do a remote soft
reboot to recover from a confused device driver.


Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
  2002-11-26 16:18 Adam J. Richter
@ 2002-11-27 17:05 ` Kai Germaschewski
  0 siblings, 0 replies; 21+ messages in thread
From: Kai Germaschewski @ 2002-11-27 17:05 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: ingo.oeser, linux-kernel, rusty, vandrove, zippel

On Tue, 26 Nov 2002, Adam J. Richter wrote:

> >No. That means dangling pointers everywhere. Remember dev_exit_p() 
> >and why it was introduced.
> 
> 	First of all, let's understand how small this problem is.
> dev_exit_p was introduceed to allow a debugging using a feature of
> newer versions of ld.  It was never essential.  devexit_p() is only
> relevant to non-hotplug systems.

That's wrong. It's essential, since the following happens:

static void __devexit my_remove()
{
}

static struct pci_driver my_driver = {
	.name		= "my",
	.probe		= my_probe,
	.remove		= __devexit_p(my_remove),
};

For CONFIG_HOTPLUG set, __devexit == "" and __devexit_p(x) == x,
so everything's obviously fine.

For CONFIG_HOTPLUT not set, my_remove gets discarded from vmlinux, so we 
cannot set my_driver.remove = my_remove, since that symbol doesn't even 
exist anymore. Older binutils quietly accepted this, but recent ones
correctly barf. That's why in that case __devexit_p(x) == NULL, so
my_remove isn't referenced. It's not a debugging feature of ld, it's just
correct behavior and it's not possible to turn it off.

> 	devexit_p() is currently is used only for static
> initialization of driver->remove(), although I suspect that that ld
> debugging feature has probably exposed some bugs that have been fixed.
> So, in practice, all of the dangling pointers that it currently
> avoids could be avoided by adding a few lines in places like
> driver_register():
> 
> #ifndef CONFIG_HOTPLUG
> 	if (!driver->module->removable)
> 		driver->remove = invoke_bug;
> #endif

Since in that case driver->remove == NULL, that gives you a nice oops
when trying to invoke it, no need for an explicit invoke_bug ;)

> 	It's probably overkill, but, in the longer term, we could
> eliminate CONFIG_HOTPLUG, have .devexit{,data} sections, and build yet
> another ELF section listing the devexit_p pointers, by defining
> devexit_p like so.
> 
> #define devexit_p(symbol)	( \
> 			asm (".pushsection .devexit_p_refs\n"	\
> 			     ".long " #symbol "\n"		\
> 			     ".popsection\n");			\
> 			symbol )

Again, that doesn't fix the problem that we have a reference to a symbol 
which doesn't exist since we just discarded it.

One way to fix this is to make my_remove a weak symbol, so that the linker
just silently puts in NULL when it's not defined elsewhere, which would
avoid the ugly __devexit_p(). Not sure if that doesn't end up in even 
uglier hacks, though.

Another way to fix this is of course to always have CONFIG_HOTPLUG=y,
which may become necessary anyway when properly shutting down devices at
shutdown/reboot.

--Kai

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-27  7:22 Adam J. Richter
  2002-11-27 23:15 ` Rusty Russell
  0 siblings, 1 reply; 21+ messages in thread
From: Adam J. Richter @ 2002-11-27  7:22 UTC (permalink / raw)
  To: rusty; +Cc: linux-kernel, vandrove, zippel

Rusty Russell wrote:
>In message <200211260649.WAA22216@adam.yggdrasil.com> you write:
>> >This would only happen if someone says "rmmod --wait".

>As I realized last night after I wrote this, there is a bug in
>module.c.  If O_NONBLOCK is specified, we shouldn't drop the module
>sempaphore at all, for exactly this reason.  A bug I introduced while
>"cleaning up" the "--wait" path.

>Sorry for the confusion.

	Then if you do "rmmod --wait" on some module that is in use,
every lsmod, insmod and rmmod will hang while attempting to acquire
module_mutex until the reference count on the module that you're
waiting to remove drops to zero, and there is no guarantee that that
will ever happen.  Some program might have to decide to close a file
descriptor or unmount a file system.  I think what you want is to
release the mutex before blocking and then reacquire it when you
continue.  Of course, you'll again have the scenario that I described.

	However, if you get rid of this idea of a blocking rmmod or
change it so that it does not make the module as "dead" if it is going
to block, then your idea of locking should work in the cases where
try_get_module() failure is handled by doing a request_module() and
retrying (because the insmod that request_module causes will acquire
module_mutex, so it will block until that module unload that was
causing the problem completes).

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-27  7:01 Adam J. Richter
  0 siblings, 0 replies; 21+ messages in thread
From: Adam J. Richter @ 2002-11-27  7:01 UTC (permalink / raw)
  To: rusty; +Cc: linux-kernel, vandrove, zippel

Rusty Russell writes:
>In message <200211260649.WAA22216@adam.yggdrasil.com> you write:
>> >TCP for example, sets the destructor function for the skb.  It can be
>> >called an arbitrary time later.  Netfilter modules do a similar thing,
>> >for similar reasons.  You'd better grab a reference to *something*.
>> 
>> 	The ->remove() function of a network device driver will
>> not return until it has freed all receive skb's that it allocated
>> and all transmit skb's that were passed to its transmit function.

>I'm not talking about a device driver, but modularizing the IPv4
>stack.

	I don't see skb->destructor being set in net/ipv4 (although I
see it in other net/ subdirectories).  Anyhow, I don't see why ipv4
would need to increment or decrement a module reference count every
time a packet is sent or received.  It should suffice to do so when a
file descriptor is opened or closed and when a network connection is
created or completely forgotten (if that does not necessarily happen
before the close system call returns).

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
  2002-11-26  6:49 Adam J. Richter
@ 2002-11-26 22:40 ` Rusty Russell
  0 siblings, 0 replies; 21+ messages in thread
From: Rusty Russell @ 2002-11-26 22:40 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: linux-kernel, vandrove, zippel

In message <200211260649.WAA22216@adam.yggdrasil.com> you write:
> >TCP for example, sets the destructor function for the skb.  It can be
> >called an arbitrary time later.  Netfilter modules do a similar thing,
> >for similar reasons.  You'd better grab a reference to *something*.
> 
> 	The ->remove() function of a network device driver will
> not return until it has freed all receive skb's that it allocated
> and all transmit skb's that were passed to its transmit function.

I'm not talking about a device driver, but modularizing the IPv4
stack.

> >This would only happen if someone says "rmmod --wait".

As I realized last night after I wrote this, there is a bug in
module.c.  If O_NONBLOCK is specified, we shouldn't drop the module
sempaphore at all, for exactly this reason.  A bug I introduced while
"cleaning up" the "--wait" path.

Sorry for the confusion.

> 	I think of people would consider it to be progress to have
> non-removable modular IP, which what I run, but I digress again.

Non-removable is easy.  It's now possible to do removable IPv4 and
IPv6, which was kind of the point of the exercise.

Hope that helps,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-26 16:18 Adam J. Richter
  2002-11-27 17:05 ` Kai Germaschewski
  0 siblings, 1 reply; 21+ messages in thread
From: Adam J. Richter @ 2002-11-26 16:18 UTC (permalink / raw)
  To: ingo.oeser; +Cc: linux-kernel, rusty, vandrove, zippel

>>> = Adam Richter
>>  = Rusty Russell
>   = Ingo Oeser

>>>     5. At modprobe time, being able to decide to load a module
>>>        as non-removable to avoid loading .exit{,data} for a smaller
>>>        kernel footprint.  This might only require insmod changes
>>>        for the user level insmod.
>> Hmm, I already discard these if !CONFIG_MODULE_UNLOAD, but it'd be a
>> cute hack to let the user do this.
> 
>No. That means dangling pointers everywhere. Remember dev_exit_p() 
>and why it was introduced.

	First of all, let's understand how small this problem is.
dev_exit_p was introduceed to allow a debugging using a feature of
newer versions of ld.  It was never essential.  devexit_p() is only
relevant to non-hotplug systems.

	devexit_p() is currently is used only for static
initialization of driver->remove(), although I suspect that that ld
debugging feature has probably exposed some bugs that have been fixed.
So, in practice, all of the dangling pointers that it currently
avoids could be avoided by adding a few lines in places like
driver_register():

#ifndef CONFIG_HOTPLUG
	if (!driver->module->removable)
		driver->remove = invoke_bug;
#endif

	It's probably overkill, but, in the longer term, we could
eliminate CONFIG_HOTPLUG, have .devexit{,data} sections, and build yet
another ELF section listing the devexit_p pointers, by defining
devexit_p like so.

#define devexit_p(symbol)	( \
			asm (".pushsection .devexit_p_refs\n"	\
			     ".long " #symbol "\n"		\
			     ".popsection\n");			\
			symbol )


	Then "--permanent --disable-hotplug" could be selected
at modprobe time and at boot time (the .devexit{,func} sections
would be loaded just before .init{,func}, and could be dropped or
kept).  modprobe could then clear all devexit_p references if we
really wanted to bother. and the kernel could do the same for
its built-in objects.  We could also have a binary utility to
scrape out the hotplug support from a .o file based on the
contents of the .devexit_p_refs section if we wanted to use
that ld debugging feature or for smaller .o and bzImage file
sizes.

	If you want to be even fancier, we could have separate
CONFIG_PCI_HOTPLUG, CONFIG_USB_HOTPLUG, and pci_devexit_p(),
usb_devexit_p(), a .pci_devexit_p section, a .usb_devexit_p
section, and so on for each pluggable bus type.

	Would this be overkill?  CONFIG_HOTPLUG currently only
controls hot plugging for busses where it is wiredly used (USB,
CardBus, but not ordinary PCI cards).  To my knowledge, nobody has
complained about the lack of "!CONFIG_HOTPLUG" for pcmcia, for
example.  An alternative is to drop "!CONFIG_HOTPLUG" support, and
have CardBus and USB always support hotplug.  Eliminating
CONFIG_HOTPLUG would simplify a lot of source code at the expense of
making object files and the kernel footprint slightly larger for users
that would otherwise want to compile it out.

	Anyhow, eliminating "ifdef MODULE" from <linux/init.h> is
not immenent, and having driver_register() clobber driver->remove
for non-removable modules in non-hotplug systems should initially
address the runtime issues (although not the loss of that ld debugging
check).

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
  2002-11-26  0:35 ` Rusty Russell
@ 2002-11-26  7:57   ` Ingo Oeser
  0 siblings, 0 replies; 21+ messages in thread
From: Ingo Oeser @ 2002-11-26  7:57 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Adam J. Richter, vandrove, zippel, linux-kernel

On Tue, Nov 26, 2002 at 11:35:09AM +1100, Rusty Russell wrote:
> In message <200211252211.OAA02085@baldur.yggdrasil.com> you write:
> > 	2. Eventually have the same build command for modules and
> > 	   compiled in objects so that distribution makes can ship an
> > 	   "all modules" build and link script to allow much more
> > 	   customization by users who do not want to recompile kernel code.
> 
> Hmm, I've never really aimed for this, and as you've noticed, there
> are a few issues.
 
Maybe that could be done already by having a list of modules for
initramfs? That's Alans plan anyway, so we might as well solve it
here.

> > 		2c. Eliminate "#ifdef MODULE" init.h, module.h, and,
> > 		    eventually, almost everywhere.
> > 
> > 		2d. In the core kernel, THIS_MODULE would point to
> > 		    a struct module rather than being NULL (eliminating
> > 		    many little banches).
> 
> I thought about doing this, but the branch cost IRL is trivial on
> modern processors with decent branch prediction (since it will almost
> always be the same way).
 
It's not about branch prediction, it's about the branch
instruction and readable code. Most code dependend on MODULE can
be made dependend on CONFIG_MODULE_UNLOAD, because the rest is
common or should be rewritten that way.

> > 	5. At modprobe time, being able to decide to load a module
> > 	   as non-removable to avoid loading .exit{,data} for a smaller
> > 	   kernel footprint.  This might only require insmod changes
> > 	   for the user level insmod.
> Hmm, I already discard these if !CONFIG_MODULE_UNLOAD, but it'd be a
> cute hack to let the user do this.
 
No. That means dangling pointers everywhere. Remember dev_exit_p() 
and why it was introduced.

> > 	10. Move tracking of dependencies among loaded modules to
> > 	    user land (and be able to reconstruct in some cases
> > 	    from modules.dep).
> 
> Personally, I think the userspace module loaders are clearly inferior,
> especially as you're gonna break userspace with almost every one of
> these changes.  Sure, you can use a kernel-specific library to give
> you back the interface flexibility, but why?  You gain complexity and
> your kernel doesn't get any smaller anyway.
> 
> Anyway, I think supporting both doesn't make sense.  Either the
> in-kernel module loader is better, in which case it should be kept, or
> it isn't in which case it should be junked.

At least resolving module name aliases to modules and options
hould be done in user space, because that's critical to auto
configuration and readable configuration of the system.

module_name_deamon anyone?

This resolving is clearly seperateable and might not even require
root privileges and can be done as a special user (passed as
kernel parameter and defaulting to UID 0), because we just need
to read a kind of database.

That reduces buffer overflow attacks and the like.

That resolving I'm really missing from the new scheme.

Regards

Ingo Oeser
-- 
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-26  6:53 Adam J. Richter
  0 siblings, 0 replies; 21+ messages in thread
From: Adam J. Richter @ 2002-11-26  6:53 UTC (permalink / raw)
  To: rusty; +Cc: linux-kernel, vandrove, zippel

I made a little typo:

>        Why would setting skb->destructor attempt to increment the use
>count on a module?  As far as I can tell, it's own incremented in
                                                ^^^
>dev_open().

That should be "[...] it's *only* increment in dev_open()."

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-26  6:49 Adam J. Richter
  2002-11-26 22:40 ` Rusty Russell
  0 siblings, 1 reply; 21+ messages in thread
From: Adam J. Richter @ 2002-11-26  6:49 UTC (permalink / raw)
  To: rusty; +Cc: linux-kernel, vandrove, zippel

>In message <200211260406.UAA22079@adam.yggdrasil.com> you write:
>> Rusty Russell wrote:
>> >In message <200211252211.OAA02085@baldur.yggdrasil.com> you write:
>> [...]
>> >> 	4. failureless raceless module unloading by the module->rwsem_list
>> >> 	   system that I described toward the bottom of this message:
>> >> 	   http://marc.theaimsgroup.com/?l=linux-kernel&m=103773401411324&w=2
>> 
>> >OK, I've read this in detail now.  It's a fine scheme, but the devil
>> >is inside here:
>> 
>> >struct device_driver *
>> >get_driver_by_name(struct bus_type *bus_type, const char *name)
>> >{
>> >	down_read(&bus->rwsem);
>> 
>> >Now, that's perfectly fine for drivers, but try doing that on every
>> >network packet.
>> 
>> 	You're looking up a driver by name or incrementing a module
>> reference count on every network packet?  Where?  Show me, please.

>TCP for example, sets the destructor function for the skb.  It can be
>called an arbitrary time later.  Netfilter modules do a similar thing,
>for similar reasons.  You'd better grab a reference to *something*.

	The ->remove() function of a network device driver will
not return until it has freed all receive skb's that it allocated
and all transmit skb's that were passed to its transmit function.

	Why would setting skb->destructor attempt to increment the use
count on a module?  As far as I can tell, it's own incremented in
dev_open().


[...]
>> 	With your scheme, you really do have unnecessary failures.
>> For example, the system really can tell you that the iso9660
>> filesystem is not found when, in fact, there is a module for it
>> (because you asked at just the wrong moment, when it was being
>> unloaded).

>This would only happen if someone says "rmmod --wait".

	No.  Let's walk through an example of get_fs_type in
fs/filesystems.c, starting at line 229:

struct file_system_type *get_fs_type(const char *name)
{
        struct file_system_type *fs;

        read_lock(&file_systems_lock);
        fs = *(find_filesystem(name));
        if (fs && !try_inc_mod_count(fs->owner))
                fs = NULL;
        read_unlock(&file_systems_lock);
        if (!fs && (request_module(name) == 0)) {
                read_lock(&file_systems_lock);
                fs = *(find_filesystem(name));
                if (fs && !try_inc_mod_count(fs->owner))
                        fs = NULL;
                read_unlock(&file_systems_lock);
        }
        return fs;
}


A: "mount -t iso9660 /dev/cdrom /mnt".

B: invoked from cron, does "modprobe --remove" to remove stale
B: modules.  isofs is stale right now, so it does
B: sys_delete_module("isofs").
B: sys_delete_module does stop_refcounts(), which returns 0
B: clears mod->live, calls restart_refcounts(), does up(&module_mutex).
B: Has not yet called mod->exit().

A: calls get_fs_type("iso9660") (fs/filesystems.c line 226)
A: read_lock(&file_systems_lock) succeeds
A: finds the iso9660 filesystem entry
A: calls try_inc_mod_count, which fails, sets fs to NULL
A: get_fs_type releases lock, calls request_module("iso9660")
A: iso9660 module load fails because there is already a module by that name
A: (even if module load were to succeed, register_filesystem would fail
A: because there already is a filesystem with that name)
A: Because request_module() failed, get_fs_type returns NULL.
A: The user sees mount fail with "Unknown filesystem type or bad superblock."

	   
[...]

>> >Your code here seems flawed.  You grab the write sem(s) in the remove
>> >code to prevent anyone bumping the refcount.  If someone is recursing
>> >(trying to grab another refcount while already holding one), you
>> >*must* fail them, otherwise you'll deadlock.
>> 
>> 	No, the point is that these rwsem's do not just lock the
>> try_module_get() call.  They lock a slightly larger section of code,
>> so that the other process will block when it attempts to look up the
>> resource containing the module pointer by name.  In your example, the
>> module unload will complete, the other process will then be allowed to
>> continue, and it will discover that whatever driver it was asking
>> about is not currently loaded (for example, the iso9660 filesystem is
>> not found, and it will run modprobe to reload it).

>Process A calls "get_driver_by_name("foo")" successfully.  Module
>reference count now 1.

>Process B calls sys_delete_module: grabs sem in write mode, finds
>refcount not 0, waits because --wait is specified (I assume that's
>what the ... there does?).

	"--wait" is your own addition.  Elminating the race that I
described above is much more important for reliability than your
new "--wait" feature.

	For what it's worth, I think "--wait" could work to some
degree under my scheme by changing the up(&module_mutex) call in
sys_delete_module to release all of the locks and then having it go
back to the top of sys_delete_module after schedule() returns, with no
guarantee that it will not block again because someone else has added
a reference count in the meantime (since try_get_module never fails
under my scheme).

	Anyhow, so under my scheme, sys_delete_module() would fail
if we do not support the "--wait" function.  If we support it as
described in the previous paragraph, sys_delete_module blocks.


>Process A calls "get_driver_by_name("foo")" again.

	Because sys_delete_module released the module locks before
calling schedule(), process A's get_deriver_by_name("foo") call
succeeds.  The reference count for the underlying module is now 2.
The "rmmod --wait" process continues to wait.

[...]
>Yes, filesystems are basically fairly clean already (nothing really
>changed for them).  Filesystems are easy.  Devices are easy.

>Networking is *hard*, which is why Dave and Alexey never merged any
>"modularize ip" patches,

	I think of people would consider it to be progress to have
non-removable modular IP, which what I run, but I digress again.

>and why I keep my own reference counts and
>spin (potentially forever!) in ip_conntrack's cleanup code 8(

	I do not see where any of this code increments a module
reference count except in dev_open.  For that one, I would add an
rwsem that dev_ifsioc would down_read() before calling __dev_get_by_name
and which it would hold through the return of dev_change_flags (which
does the call to dev_open).

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
  2002-11-26  4:06 Adam J. Richter
@ 2002-11-26  5:04 ` Rusty Russell
  0 siblings, 0 replies; 21+ messages in thread
From: Rusty Russell @ 2002-11-26  5:04 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: linux-kernel, vandrove, zippel

In message <200211260406.UAA22079@adam.yggdrasil.com> you write:
> Rusty Russell wrote:
> >In message <200211252211.OAA02085@baldur.yggdrasil.com> you write:
> [...]
> >> 	4. failureless raceless module unloading by the module->rwsem_list
> >> 	   system that I described toward the bottom of this message:
> >> 	   http://marc.theaimsgroup.com/?l=linux-kernel&m=103773401411324&w=2
> 
> >OK, I've read this in detail now.  It's a fine scheme, but the devil
> >is inside here:
> 
> >struct device_driver *
> >get_driver_by_name(struct bus_type *bus_type, const char *name)
> >{
> >	down_read(&bus->rwsem);
> 
> >Now, that's perfectly fine for drivers, but try doing that on every
> >network packet.
> 
> 	You're looking up a driver by name or incrementing a module
> reference count on every network packet?  Where?  Show me, please.

TCP for example, sets the destructor function for the skb.  It can be
called an arbitrary time later.  Netfilter modules do a similar thing,
for similar reasons.  You'd better grab a reference to *something*.

> >No.  What you're missing is that there is a bogolock inside
> >try_module_get().  Think of it as a rwlock.
> 
> 	I don't know exactly what you mean by a bogolock (some
> reference to local_inc or your per-CPU reference counts?).

It's slightly magic (see the module.c code).  Functionally, think of
try_module_get() as having a "read_lock_irqsave(&bogolock, flags)/
read_unlock_irqrestore(&bogolock, flags)" around it, and
"stop_refcounts()" being "write_lock_irq(&bogolock)".

> 	With your scheme, you really do have unnecessary failures.
> For example, the system really can tell you that the iso9660
> filesystem is not found when, in fact, there is a module for it
> (because you asked at just the wrong moment, when it was being
> unloaded).

This would only happen if someone says "rmmod --wait".  In which case,
that's *exactly* the right thing to do.  The admin wanted that module
out of there for a reason.

> [...]
> >> 	4. This kind of race is not really specific to modules, although
> >> they may be the only example that comes up in practice.
> 
> >Your code here seems flawed.  You grab the write sem(s) in the remove
> >code to prevent anyone bumping the refcount.  If someone is recursing
> >(trying to grab another refcount while already holding one), you
> >*must* fail them, otherwise you'll deadlock.
> 
> 	No, the point is that these rwsem's do not just lock the
> try_module_get() call.  They lock a slightly larger section of code,
> so that the other process will block when it attempts to look up the
> resource containing the module pointer by name.  In your example, the
> module unload will complete, the other process will then be allowed to
> continue, and it will discover that whatever driver it was asking
> about is not currently loaded (for example, the iso9660 filesystem is
> not found, and it will run modprobe to reload it).

Process A calls "get_driver_by_name("foo")" successfully.  Module
reference count now 1.

Process B calls sys_delete_module: grabs sem in write mode, finds
refcount not 0, waits because --wait is specified (I assume that's
what the ... there does?).

Process A calls "get_driver_by_name("foo")" again.

Now, maybe process A should never do this.  But opening the same file
twice seems to be a similar case, no?  Maybe I need to see the rest of
the implementation.

> >This works *fine* until the object contains a function pointer to
> >inside a module: at this stage you need to make sure the module
> >doesn't disappear as well before the object is really deleted.
> 
> 	That's why open increments the module reference count,
> currently usually done via get_fops, as I'm sure you know, as I see
> you patched it to use try_module_get recently.  The module will not be
> unloaded and the storage will not be freed until the last file
> descriptor is closed, even if the device has been physically removed
> (hardware specific resources like consistent DMA memory, IRQ's, etc.
> could be released once the ->remove() function returns, but I digress).

We're fervently agreeing here I think, that, this is how it should
work.

> >Roman
> >solved this by refusing to deregister an object which was in use,
> >which isn't a really nice solution IMHO.
> 
> 	Solved what?  What was the problem?

Solved the problem of keeping explicit module reference counts (the
modules themselves did it).

> >They really are separate things, we just usually never bothered
> >refcounting our functions.  The obvious solution is to hold a
> >reference to the module as well, and that is in fact what this
> >solution does.
> 
> >Whatever else, it's conceptually simple.
> 
> 	I really do not understand what problem you're referring to.
> I believe that open() has incremented module reference counts for
> ages.  I'm not talking about eliminating or adding any calls that
> modify module reference counts, just tracking the locks that protect
> (or should protect) them and turning those locks into rw_sem's
> (really, I could have four separate lists for {rw_,}{semaphore,lock},
> but, so far, rw_semaphore seems fine for every case that I've examined).

Yes, filesystems are basically fairly clean already (nothing really
changed for them).  Filesystems are easy.  Devices are easy.

Networking is *hard*, which is why Dave and Alexey never merged any
"modularize ip" patches, and why I keep my own reference counts and
spin (potentially forever!) in ip_conntrack's cleanup code 8(

Hope that clarifies?
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-26  4:06 Adam J. Richter
  2002-11-26  5:04 ` Rusty Russell
  0 siblings, 1 reply; 21+ messages in thread
From: Adam J. Richter @ 2002-11-26  4:06 UTC (permalink / raw)
  To: rusty; +Cc: linux-kernel, vandrove, zippel

Rusty Russell wrote:
>In message <200211252211.OAA02085@baldur.yggdrasil.com> you write:
[...]
>> 	4. failureless raceless module unloading by the module->rwsem_list
>> 	   system that I described toward the bottom of this message:
>> 	   http://marc.theaimsgroup.com/?l=linux-kernel&m=103773401411324&w=2

>OK, I've read this in detail now.  It's a fine scheme, but the devil
>is inside here:

>struct device_driver *
>get_driver_by_name(struct bus_type *bus_type, const char *name)
>{
>	down_read(&bus->rwsem);

>Now, that's perfectly fine for drivers, but try doing that on every
>network packet.

	You're looking up a driver by name or incrementing a module
reference count on every network packet?  Where?  Show me, please.
I find it hard to believe that that is necessary.  As far as I
can tell, nothing calls the only thing that is remotely performance
critical that increments module reference counts are open(2)/close(2),
and they do generally do many heavier operations that grabbing an
rwsem.

[...]

>BTW,

>>	2. try_module_get() introduces new failures that other software
>> has to anticipate.  For example, if I try to mount an ext3 file system
>> and it happens that ext3 was being automatically removed (for lack of
>> use) at this time, the attempt to get the ext3 filesystem can fail
>> without request_module() being called to reload it.

>No.  What you're missing is that there is a bogolock inside
>try_module_get().  Think of it as a rwlock.

	I don't know exactly what you mean by a bogolock (some
reference to local_inc or your per-CPU reference counts?).  However,
what you seem to miss is that the lock has to bracket a little more
code than just the inside of try_module_get.

	With your scheme, you really do have unnecessary failures.
For example, the system really can tell you that the iso9660
filesystem is not found when, in fact, there is a module for it
(because you asked at just the wrong moment, when it was being
unloaded).

[...]
>> 	4. This kind of race is not really specific to modules, although
>> they may be the only example that comes up in practice.

>Your code here seems flawed.  You grab the write sem(s) in the remove
>code to prevent anyone bumping the refcount.  If someone is recursing
>(trying to grab another refcount while already holding one), you
>*must* fail them, otherwise you'll deadlock.

	No, the point is that these rwsem's do not just lock the
try_module_get() call.  They lock a slightly larger section of code,
so that the other process will block when it attempts to look up the
resource containing the module pointer by name.  In your example, the
module unload will complete, the other process will then be allowed to
continue, and it will discover that whatever driver it was asking
about is not currently loaded (for example, the iso9660 filesystem is
not found, and it will run modprobe to reload it).

>>	4. This kind of race is not really specific to modules, although
>> they may be the only example that comes up in practice.

>Actually, it is.  For years, the networking code has used refcounts
>and two-stage delete almost everywhere: it's now well accepted as an
>orthodox method.  So when you "unregister" a socket etc, it removes
>the element from lists, and decs the refcount.  Whoever decs the
>refcount to zero frees it up (it's assumed that it will monotonically
>decrease to zero, since there are no external references any more).
>You know this already, of course.

	Yes.  I talked about it regarding my patch for adding code in
drivers/base/ to kmalloc the fixed block of private memory for device
drivers.  I think it's a good model for any kind of device that a file
descriptor can have open.

>This works *fine* until the object contains a function pointer to
>inside a module: at this stage you need to make sure the module
>doesn't disappear as well before the object is really deleted.

	That's why open increments the module reference count,
currently usually done via get_fops, as I'm sure you know, as I see
you patched it to use try_module_get recently.  The module will not be
unloaded and the storage will not be freed until the last file
descriptor is closed, even if the device has been physically removed
(hardware specific resources like consistent DMA memory, IRQ's, etc.
could be released once the ->remove() function returns, but I digress).

>Roman
>solved this by refusing to deregister an object which was in use,
>which isn't a really nice solution IMHO.

	Solved what?  What was the problem?

>They really are separate things, we just usually never bothered
>refcounting our functions.  The obvious solution is to hold a
>reference to the module as well, and that is in fact what this
>solution does.

>Whatever else, it's conceptually simple.

	I really do not understand what problem you're referring to.
I believe that open() has incremented module reference counts for
ages.  I'm not talking about eliminating or adding any calls that
modify module reference counts, just tracking the locks that protect
(or should protect) them and turning those locks into rw_sem's
(really, I could have four separate lists for {rw_,}{semaphore,lock},
but, so far, rw_semaphore seems fine for every case that I've examined).

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
  2002-11-25 22:11 Adam J. Richter
@ 2002-11-26  0:35 ` Rusty Russell
  2002-11-26  7:57   ` Ingo Oeser
  0 siblings, 1 reply; 21+ messages in thread
From: Rusty Russell @ 2002-11-26  0:35 UTC (permalink / raw)
  To: Adam J. Richter; +Cc: vandrove, zippel, linux-kernel

In message <200211252211.OAA02085@baldur.yggdrasil.com> you write:
> 	Here is a list of changes that I'm thinking about trying to
> make to modules, in case anyone is interested or wants to show me the
> error of my ways.  Most of these changes do not depend on whether the
> module loader is in the kernel or a user level program.  I've labelled
> the items that are only applicable to user level modules with "user
> level version:".
> 
> 	1. Allow multiple MODULE_DEVICE_TABLE's of the same type in the
> 	   same .c file instead of the combined_dev_id_table hack that
> 	   is now used by modules that really need to load separate
> 	   but related drivers.

Makes sense, should be simple.

> 	2. Eventually have the same build command for modules and
> 	   compiled in objects so that distribution makes can ship an
> 	   "all modules" build and link script to allow much more
> 	   customization by users who do not want to recompile kernel code.

Hmm, I've never really aimed for this, and as you've noticed, there
are a few issues.

> 		2a. Compile module_init, subsys_init, etc. by the
> 		    same mechanism used by kernel objects.

Roman has a patch for this, which is quite nice.  (Look for "module
initcall depends" in the misc section on my patches page).  Basically,
Kai (IIRC) pointed out that module initialization are explicitly
ordered by their dependencies already, and voila (patch needs work to
apply).

Turns out the current scheme is "good enough" and noone seemed
interested, so it'll probably hang around until someone comes up with
a problem so horrible this patch is easier.

> 		2b. Pass module parameter by __setup() rather than
> 		    MODULE_PARM().

Well, s/__setup/__module_param_call/ and that seems like a fine plan.

> 		2c. Eliminate "#ifdef MODULE" init.h, module.h, and,
> 		    eventually, almost everywhere.
> 
> 		2d. In the core kernel, THIS_MODULE would point to
> 		    a struct module rather than being NULL (eliminating
> 		    many little banches).

I thought about doing this, but the branch cost IRL is trivial on
modern processors with decent branch prediction (since it will almost
always be the same way).

> 	3. To prevent rmmod's during modprobe, have
> 	    rmmod do flock(/proc/modules, LOCK_EX) and
> 	    modprobe do flock(/proc/modules, LOCK_SH).  Yes, you can
> 	    detect this already, but this way you it does not cause
> 	    failure and you do not need retry code.

Seems like a perfectly reasonable thing to do.

> 	Other wishes that probably do not effect module-init-tools,
> 	at least when the module loader is in the kernel:
> 
> 	4. failureless raceless module unloading by the module->rwsem_list
> 	   system that I described toward the bottom of this message:
> 	   http://marc.theaimsgroup.com/?l=linux-kernel&m=103773401411324&w=2

OK, I've read this in detail now.  It's a fine scheme, but the devil
is inside here:

struct device_driver *
get_driver_by_name(struct bus_type *bus_type, const char *name)
{
	down_read(&bus->rwsem);

Now, that's perfectly fine for drivers, but try doing that on every
network packet.  You can't use a semaphore, and if you introduce a
shared lock, Dave's going to rip your code out again (this is the guy
who wrote brlocks, remember).

BTW,

>	2. try_module_get() introduces new failures that other software
> has to anticipate.  For example, if I try to mount an ext3 file system
> and it happens that ext3 was being automatically removed (for lack of
> use) at this time, the attempt to get the ext3 filesystem can fail
> without request_module() being called to reload it.

No.  What you're missing is that there is a bogolock inside
try_module_get().  Think of it as a rwlock.

The result is that while your scheme has multiple locks held by
different subsystems which are registered with the module code, my
scheme has one lock in the module code registered with the different
subsystems.  The reason for this is simple: any really fast
lock/refcount structure is going to be fairly big (NR_CPUS *
SMP_CACHE_BYTES).  It's also simple, and I don't have to worry about
locking order.

The best bit, of course, is that when a radical new funky scheme comes
along which solves all these problems, the damage is contained to the
implementation of the interfaces.

>	3. try_module_get() introduces yet another "most fundamental"
> lock type.  We have a bunch of facilities vying to do that, and I
> think it's going to be a source of bugs.  It would be better to avoid
> introducing a new layer of locking if possible.

New code may have bugs, yes <shrug>.  But it's pretty simple. and a
bug is yet to be found here.

> 	4. This kind of race is not really specific to modules, although
> they may be the only example that comes up in practice.

Your code here seems flawed.  You grab the write sem(s) in the remove
code to prevent anyone bumping the refcount.  If someone is recursing
(trying to grab another refcount while already holding one), you
*must* fail them, otherwise you'll deadlock.

And then it becomes try_module_get().

>	4. This kind of race is not really specific to modules, although
> they may be the only example that comes up in practice.

Actually, it is.  For years, the networking code has used refcounts
and two-stage delete almost everywhere: it's now well accepted as an
orthodox method.  So when you "unregister" a socket etc, it removes
the element from lists, and decs the refcount.  Whoever decs the
refcount to zero frees it up (it's assumed that it will monotonically
decrease to zero, since there are no external references any more).
You know this already, of course.

This works *fine* until the object contains a function pointer to
inside a module: at this stage you need to make sure the module
doesn't disappear as well before the object is really deleted.  Roman
solved this by refusing to deregister an object which was in use,
which isn't a really nice solution IMHO.

They really are separate things, we just usually never bothered
refcounting our functions.  The obvious solution is to hold a
reference to the module as well, and that is in fact what this
solution does.

Whatever else, it's conceptually simple.

> 	5. At modprobe time, being able to decide to load a module
> 	   as non-removable to avoid loading .exit{,data} for a smaller
> 	   kernel footprint.  This might only require insmod changes
> 	   for the user level insmod.

Hmm, I already discard these if !CONFIG_MODULE_UNLOAD, but it'd be a
cute hack to let the user do this.

> 	6. kmalloc'ing small modules for less memory consumption and
> 	   perhaps so that they can avoid using TLB entries on certain
> 	   architectures (412 of 1129 modules on my system have
> 	   .text + .data < 4096).

Yeah, this is trivial with the current scheme, and was one of the
aims.  The alloc is arch-specific.

> 	7. User level version: optionally be able to move all symbols
> 	   to user land at the expense of losing kksymoops (would save
> 	   ~100kB on my system).
> 
> 	8. User level version (already done in kernel loader version):
> 	   eliminate dependence on struct module using a module-start.o
> 	   based on what Roman Zippel proposed at
> 	   http://marc.theaimsgroup.com/?l=linux-kernel&m=103740379811285&w=2
> 	   (but using a module-end.o file and eliminating the linker script).
> 
> 	9. User level version: load module contents with mmap(/dev/kmem),
> 	   reducing initial memory requirements by avoiding a malloc
> 	   and copy.
> 
> 	10. Move tracking of dependencies among loaded modules to
> 	    user land (and be able to reconstruct in some cases
> 	    from modules.dep).

Personally, I think the userspace module loaders are clearly inferior,
especially as you're gonna break userspace with almost every one of
these changes.  Sure, you can use a kernel-specific library to give
you back the interface flexibility, but why?  You gain complexity and
your kernel doesn't get any smaller anyway.

Anyway, I think supporting both doesn't make sense.  Either the
in-kernel module loader is better, in which case it should be kept, or
it isn't in which case it should be junked.

Thanks for the mail!
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Modules with list
@ 2002-11-25 22:39 Adam J. Richter
  0 siblings, 0 replies; 21+ messages in thread
From: Adam J. Richter @ 2002-11-25 22:39 UTC (permalink / raw)
  To: rusty, vandrove, zippel; +Cc: linux-kernel

I wrote:
>Subject: Modules with list

That should read "Modules wish list".  Sorry for the really stupid typo.

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Modules with list
@ 2002-11-25 22:11 Adam J. Richter
  2002-11-26  0:35 ` Rusty Russell
  0 siblings, 1 reply; 21+ messages in thread
From: Adam J. Richter @ 2002-11-25 22:11 UTC (permalink / raw)
  To: rusty, vandrove, zippel; +Cc: linux-kernel

	Here is a list of changes that I'm thinking about trying to
make to modules, in case anyone is interested or wants to show me the
error of my ways.  Most of these changes do not depend on whether the
module loader is in the kernel or a user level program.  I've labelled
the items that are only applicable to user level modules with "user
level version:".

	1. Allow multiple MODULE_DEVICE_TABLE's of the same type in the
	   same .c file instead of the combined_dev_id_table hack that
	   is now used by modules that really need to load separate
	   but related drivers.

	2. Eventually have the same build command for modules and
	   compiled in objects so that distribution makes can ship an
	   "all modules" build and link script to allow much more
	   customization by users who do not want to recompile kernel code.

		2a. Compile module_init, subsys_init, etc. by the
		    same mechanism used by kernel objects.

		2b. Pass module parameter by __setup() rather than
		    MODULE_PARM().

		2c. Eliminate "#ifdef MODULE" init.h, module.h, and,
		    eventually, almost everywhere.

		2d. In the core kernel, THIS_MODULE would point to
		    a struct module rather than being NULL (eliminating
		    many little banches).

	3. To prevent rmmod's during modprobe, have
	    rmmod do flock(/proc/modules, LOCK_EX) and
	    modprobe do flock(/proc/modules, LOCK_SH).  Yes, you can
	    detect this already, but this way you it does not cause
	    failure and you do not need retry code.

	Other wishes that probably do not effect module-init-tools,
	at least when the module loader is in the kernel:

	4. failureless raceless module unloading by the module->rwsem_list
	   system that I described toward the bottom of this message:
	   http://marc.theaimsgroup.com/?l=linux-kernel&m=103773401411324&w=2

	5. At modprobe time, being able to decide to load a module
	   as non-removable to avoid loading .exit{,data} for a smaller
	   kernel footprint.  This might only require insmod changes
	   for the user level insmod.

	6. kmalloc'ing small modules for less memory consumption and
	   perhaps so that they can avoid using TLB entries on certain
	   architectures (412 of 1129 modules on my system have
	   .text + .data < 4096).

		5a. maybe load .text and .data separately for modules where
		    .text + .data >= 4096 && .text < 4096 && .data < 4096
		    (26 of 1129 modules have this property on my system).
		    Probably not worth it.

	7. User level version: optionally be able to move all symbols
	   to user land at the expense of losing kksymoops (would save
	   ~100kB on my system).

	8. User level version (already done in kernel loader version):
	   eliminate dependence on struct module using a module-start.o
	   based on what Roman Zippel proposed at
	   http://marc.theaimsgroup.com/?l=linux-kernel&m=103740379811285&w=2
	   (but using a module-end.o file and eliminating the linker script).

	9. User level version: load module contents with mmap(/dev/kmem),
	   reducing initial memory requirements by avoiding a malloc
	   and copy.

	10. Move tracking of dependencies among loaded modules to
	    user land (and be able to reconstruct in some cases
	    from modules.dep).

	Hopefully, posting this list will reduce the chances of
duplication of effort or help expose a problem or potential
improvement I hadn't considered.

Adam J. Richter     __     ______________   575 Oroville Road
adam@yggdrasil.com     \ /                  Milpitas, California 95035
+1 408 309-6081         | g g d r a s i l   United States of America
                         "Free Software For The Rest Of Us."

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2002-11-29  8:16 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-11-29  8:23 Modules with list Adam J. Richter
  -- strict thread matches above, loose matches on Subject: below --
2002-11-28  1:54 Adam J. Richter
2002-11-28  0:25 Adam J. Richter
2002-11-28 23:53 ` Rusty Russell
2002-11-27 19:04 Adam J. Richter
2002-11-27 18:19 Adam J. Richter
2002-11-27 22:42 ` Keith Owens
2002-11-27  7:22 Adam J. Richter
2002-11-27 23:15 ` Rusty Russell
2002-11-27  7:01 Adam J. Richter
2002-11-26 16:18 Adam J. Richter
2002-11-27 17:05 ` Kai Germaschewski
2002-11-26  6:53 Adam J. Richter
2002-11-26  6:49 Adam J. Richter
2002-11-26 22:40 ` Rusty Russell
2002-11-26  4:06 Adam J. Richter
2002-11-26  5:04 ` Rusty Russell
2002-11-25 22:39 Adam J. Richter
2002-11-25 22:11 Adam J. Richter
2002-11-26  0:35 ` Rusty Russell
2002-11-26  7:57   ` Ingo Oeser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).