linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Rusty Russell <rusty@rustcorp.com.au>
To: Werner Almesberger <wa@almesberger.net>
Cc: kuznet@ms2.inr.ac.ru, Roman Zippel <zippel@linux-m68k.org>,
	kronos@kronoz.cjb.net, linux-kernel@vger.kernel.org
Subject: Re: [RFC] Migrating net/sched to new module interface
Date: Fri, 17 Jan 2003 13:27:56 +1100	[thread overview]
Message-ID: <20030117022833.229992C365@lists.samba.org> (raw)
In-Reply-To: Your message of "Wed, 15 Jan 2003 23:42:58 -0300." <20030115234258.E1521@almesberger.net>

In message <20030115234258.E1521@almesberger.net> you write:
> Rusty Russell wrote:
> > Deprecating every module, and rewriting their initialization routines
> > is ambitious beyond the scale of anything you have mentioned.
> 
> Well, it has happened before, e.g. sleep_on is now deprecated,
> cli() doesn't give the comprehensive protection it used to,

They gave us SMP.  What do we gain for your change?

> holding spinlocks while sleeping used to be frowned upon, but
> it's only recently that it moved to being forbidden, etc.

No, it's always been forbidden, it's only recently we detect it.

But apologies for the tone of my previous mail: it seems I'm
oversensitive to criticism of the module stuff now 8(

To go someway towards an explanation, at least, I humbly submit a
fairly complete description of the approach used (assuming the module
init race fix patch gets merged).

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

================
Rusty's Unreliable Guide To The 2.5 Module Locking Implementation (draft)
aka. All Locking and No Play Make Rusty Go Crazy

Any object in the kernel which might disappear, needs to have some
form of locking; we're very familiar with locking data structures in
the kernel.  Modules are no exception, except that in this case it's
the functions themselves need locking.

The main difference is that we are not prepared to pay the price for
locking every function we call: the vast majority of speed-critical
functions are not in modules at all, and for many others, the call is
internal to the same module.  So any locking solution must be
lightweight, and (or course) vanish if CONFIG_MODULES is not set.

Now, we don't need to lock unless we are going across a module
boundary (presumbably if you are already in the module, the locking is
sorted out already), and it's easy to ensure that you don't need to
lock for a normal function call, since the module loader can make the
caller statically depend on the callee: if module B calls "foo", which
is in module A, the loader knows that B refers to "foo" when it loads
B, so simply enforces that B uses A, and A cannot be unloaded before
B.

The cases which remain, however, are function calls through pointers:
such strategy functions are used widely, and naturally need some form
of protection against going away.  Since the advent of SMP, and
preemption, it is extremely difficult for such functions to protect
themselves.

Two Stage Delete
================

So let's apply standard locking techniques.  A standard approach to
this for data (eg networking packets) involves two-stage delete: we
keep a reference count in the object, which is usually 1 + number of
users, and to delete it we mark it deleted so noone new can use it,
and drop the reference count.  The last user, who drops the reference
count to zero, actually does the free.

This is called two-stage delete: Alexey Kuznetsov's reason why IPv4
was not a module was the lack of two-stage delete support for modules.

The usual and logical place for the deleted flag and reference count
is in the structure being protected: in this case, that's the module
itself.  You can, instead, have a flag and/or reference count in every
object used by the module, but it's not quite the same thing, and
deactivating them all atomically is impossible, which introduces its
own set of problems.

The main problem advantage of a single flag which says whether the
module is active or not, is that every module would need to add a
function which deactivated it.  There are 1600 modules in the kernel,
and they work fine when built into the kernel: such a change merely
for the module case seems overly disruptive.  Also, there is the other
case to consider: modules can disappear because they failed to
initialize.  The same "deleted" flag can be used to isolate modules
during initialization: otherwise each module would need to implement
two-stage initialization as well.

Finally, the try_module_get() primitive (which combines the deleted
flag test with the reference count increment in one convenient form)
was already fairly widely used by the filesystem code and others,
where it was called "try_mod_inc_use_count()".  Most people seemed to
have little problem using the primitive correctly.

Corner Cases And Optimizations
==============================

Some interfaces need to do more than simply flip a flag on activation:
they might want to fire off a hotplug event, for example, or scan
partition tables.  So a notifier chain is supplied for them to do
exactly this (in the "Proposed Module Init Race Fix" patch).  Modules
are free to "roll their own" two-stage init using module_make_live()
if they want, too.

One of the critical things when designing this, was that getting
references to modules had to be fast, and it didn't matter if removing
modules was relatively slow.  The logical implementation of
try_module_get() looks like:

	static int try_module_get(struct module *mod)
	{
		int ret;
		unsigned long flags;

		read_lock_irqsave(&module_lock, flags);
		if (mod->deleted)
			ret = 0;
		else {
			ret = 1;
			atomic_inc(&mod->use);
		}
		read_unlock_irqrestore(&module_lock, flags);
		return ret;
	}

	static int module_put(struct module *mod)
	{
		if (atomic_dec_and_test(&mod->use))
			if (mod->deleted)
				wake_up(mod->whoever_is_waiting);
	}

And module unloading would look like:

	...
	write_lock_irq(&module_lock);
	if (atomic_read(&mod->use) != 0)
		ret = -EBUSY;
	else {
		mod->deleted = 1;
		ret = 0;
	}
	write_unlock_irq(&module_lock);

But we can do better than this.  The try_module_get contains interrupt
saving code (even though it's often unneccessary), a spin lock and an
atomic operation.  Worse, on SMP the spinlock and atomic variable will
have to bounce from one CPU to another.  The answer is to use an
atomic counter per CPU, and a "bogolock", which looks (conceptually)
like this:

	void bogo_read_lock(void)
	{
		preempt_disable();
	}

	void bogo_read_unlock(void)
	{
		preempt_enable();
	}

	void bogo_write_lock(void)
	{
		run thread on every other CPU
		tell them all to stop interrupts
		stop interrupts locally
	}

	void bogo_write_unlock(void)
	{
		tell threads to unblock interrupts and exit
		restore interrupts locally
	}

So the real try_module_get and module_put look like:

	static inline int try_module_get(struct module *module)
	{
		int ret = 1;

		if (module) {
			unsigned int cpu = get_cpu(); /* Disables preemption */
			if (likely(module_is_live(module)))
				local_inc(&module->ref[cpu].count);
			else
				ret = 0;
			put_cpu();
		}
		return ret;
	}

	static inline void module_put(struct module *module)
	{
		if (module) {
			unsigned int cpu = get_cpu();
			local_dec(&module->ref[cpu].count);
			/* Maybe they're waiting for us to drop reference? */
			if (unlikely(!module_is_live(module)))
				wake_up_process(module->waiter);
			put_cpu();
		}
	}

And the remove code looks like:

	ret = stop_refcounts(); /* bogo_write_lock */
	if (ret != 0)
		goto out;

	/* If it's not unused, quit unless we are told to block. */
	if ((flags & O_NONBLOCK) && module_refcount(mod) != 0) {
		forced = try_force(flags);
		if (!forced)
			ret = -EWOULDBLOCK;
	} else {
		mod->waiter = current;
		mod->state = MODULE_STATE_GOING;
	}
	restart_refcounts(); /* bogo)write_unlock */

Cheers,
Rusty.


  parent reply	other threads:[~2003-01-17  2:19 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-01-02 22:50 [RFC] Migrating net/sched to new module interface Kronos
2003-01-03  5:10 ` Rusty Russell
2003-01-03  8:37   ` David S. Miller
2003-01-04  6:09     ` Rusty Russell
2003-01-04 16:21       ` Kronos
2003-01-13 22:32   ` kuznet
2003-01-13 23:23     ` Max Krasnyansky
2003-01-14 17:49     ` Kronos
2003-01-15  0:21       ` Roman Zippel
2003-01-15  1:19         ` kuznet
2003-01-15  7:31           ` Werner Almesberger
2003-01-15  8:16             ` Rusty Russell
2003-01-15  9:33               ` Werner Almesberger
2003-01-16  1:12                 ` Rusty Russell
2003-01-16  2:42                   ` Werner Almesberger
2003-01-16  3:31                     ` Rusty Russell
2003-01-16  4:20                       ` Werner Almesberger
2003-01-16  4:25                       ` David S. Miller
2003-01-16  4:49                         ` Werner Almesberger
2003-01-16 16:05                         ` Roman Zippel
2003-01-16 18:15                     ` Roman Zippel
2003-01-16 18:58                       ` Werner Almesberger
2003-01-16 23:53                         ` Roman Zippel
2003-01-17  1:04                           ` Greg KH
2003-01-17  2:27                     ` Rusty Russell [this message]
2003-01-17 21:40                       ` Roman Zippel
2003-02-13 23:16                       ` Werner Almesberger
2003-02-14  1:57                         ` Rusty Russell
2003-02-14  3:44                           ` Werner Almesberger
2003-02-14 11:16                           ` Roman Zippel
2003-02-14 12:04                             ` Rusty Russell
2003-02-14 12:49                               ` Roman Zippel
2003-02-17  1:59                                 ` Rusty Russell
2003-02-17 10:53                                   ` Roman Zippel
2003-02-17 23:31                                     ` Rusty Russell
2003-02-18 12:26                                       ` Roman Zippel
2003-02-14 13:21                               ` Roman Zippel
2003-02-14 13:53                                 ` Werner Almesberger
2003-02-14 14:24                                   ` Roman Zippel
2003-02-14 18:30                                     ` Werner Almesberger
2003-02-14 20:09                                       ` Roman Zippel
2003-02-15  0:12                                         ` Werner Almesberger
2003-02-15  0:51                                           ` Roman Zippel
2003-02-15  2:28                                             ` Werner Almesberger
2003-02-15 23:20                                               ` Roman Zippel
2003-02-17 17:04                                                 ` Werner Almesberger
2003-02-17 23:09                                                   ` [RFC] Is an alternative module interface needed/possible? Roman Zippel
2003-02-18  1:18                                                     ` Werner Almesberger
2003-02-18  4:54                                                       ` Rusty Russell
2003-02-18  7:20                                                         ` Werner Almesberger
2003-02-18 12:06                                                           ` Roman Zippel
2003-02-18 14:12                                                             ` Werner Almesberger
2003-02-18 12:45                                                               ` Thomas Molina
2003-02-18 17:22                                                               ` Werner Almesberger
2003-02-19  3:30                                                                 ` Rusty Russell
2003-02-19  4:11                                                                   ` Werner Almesberger
2003-02-19 23:38                                                                     ` Rusty Russell
2003-02-20  9:46                                                                       ` Roman Zippel
2003-02-20  0:40                                                                 ` Roman Zippel
2003-02-20  2:17                                                                   ` Werner Almesberger
2003-02-23 16:02                                                                     ` Roman Zippel
2003-02-26 23:26                                                                       ` Werner Almesberger
2003-02-27 12:34                                                                         ` Roman Zippel
2003-02-27 13:20                                                                           ` Werner Almesberger
2003-02-27 14:33                                                                             ` Roman Zippel
2003-02-23 23:34                                                                 ` Kevin O'Connor
2003-02-24 12:14                                                                   ` Roman Zippel
2003-02-18 12:35                                                           ` Roman Zippel
2003-02-18 14:14                                                             ` Werner Almesberger
2003-02-19  1:48                                                       ` Roman Zippel
2003-02-19  2:27                                                         ` Werner Almesberger
2003-01-16 13:44                   ` [RFC] Migrating net/sched to new module interface Roman Zippel
2003-01-15 17:04               ` Roman Zippel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030117022833.229992C365@lists.samba.org \
    --to=rusty@rustcorp.com.au \
    --cc=kronos@kronoz.cjb.net \
    --cc=kuznet@ms2.inr.ac.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=wa@almesberger.net \
    --cc=zippel@linux-m68k.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).