Real-Time Preemption and RCU

* Real-Time Preemption and RCU
@ 2005-03-18  0:20 Paul E. McKenney
  2005-03-18  7:49 ` Ingo Molnar
                   ` (6 more replies)
  0 siblings, 7 replies; 58+ messages in thread
From: Paul E. McKenney @ 2005-03-18  0:20 UTC (permalink / raw)
  To: mingo
  Cc: dipankar, shemminger, akpm, torvalds, rusty, tgall, jim.houston,
	manfred, gh, linux-kernel

Hello!

As promised/threatened earlier in another forum...

							Thanx, Paul

------------------------------------------------------------------------

The Real-Time Preemption patch modified RCU to permit its read-side
critical sections to be safely preempted.  This has worked, but there
are a few serious problems with this variant of RCU.  [If you just want
to skip directly to the code, search for "synchronize_kernel(void)".
There are five occurrences, each a variation on the theme.  I recommend
the fifth one.  The third one might also be OK in some environments.
If you have better approaches, please do not keep them a secret!!!]

So, why am I saying that there are problems with the real-time preemption
implementation of RCU?

o	RCU read-side critical sections cannot be freely nested,
	since the read-side critical section now acquires locks.
	This means that the real-time preemption variant of RCU
	is subject to deadlock conditions that "classic" RCU
	is immune to.  This is not just a theoretical concern,
	for example, see nf_hook_slow() in linux/net/core/netfilter.c:

		+       /*
		+        * PREEMPT_RT semantics: different-type read-locks
		+        * dont nest that easily:
		+        */
		+//     rcu_read_lock_read(&ptype_lock);

	A number of other RCU read-side critical sections have been
	similarly disabled (17 total in the patch).  Perhaps most embedded
	systems will not be using netfilter heavily, but this does bring
	up a very real stability concern, even on UP systems (since
	preemption will eventually occur when and where least expected).

o	RCU read-side critical sections cannot be unconditionally
	upgraded to write-side critical sections in all cases.
	For example, in classic RCU, it is perfectly legal to do
	the following:

		rcu_read_lock();
		list_for_each_entry_rcu(lep, head, p) {
			if (p->needs_update) {
				spin_lock(&update_lock);
				update_it(p);
				spin_unlock(&update_lock);
			}
		}
		rcu_read_unlock()

	This would need to change to the following for real-time
	preempt kernels:

		rcu_read_lock_spin(&update_lock);
		list_for_each_entry_rcu(lep, head, p) {
			if (p->needs_update) {
				spin_lock(&update_lock);
				update_it(p);
				spin_unlock(&update_lock);
			}
		}
		rcu_read_unlock_spin(&update_lock)

	This results in self-deadlock.

o	There is an API expansion, with five different variants
	of rcu_read_lock():

	API			    # uses
	------------------------    ------
	rcu_read_lock_spin()		11
	rcu_read_unlock_spin()		12
	rcu_read_lock_read()		42
	rcu_read_unlock_read()		42
	rcu_read_lock_bh_read()		 2
	rcu_read_unlock_bh_read()	 3
	rcu_read_lock_down_read()	14
	rcu_read_unlock_up_read()	20
	rcu_read_lock_nort()		 3
	rcu_read_unlock_nort()		 4

	TOTAL			       153

o	The need to modify lots of RCU code expands the size of this
	patch -- roughly 10% of the 20K lines of this patch are devoted
	to modifying RCU code to meet this new API.  10% may not sound
	like much, but it comes to more than 2,000 lines of context diffs.

Seems to me that it would be good to have an RCU implementation
that meet the requirements of the Real-Time Preemption patch,
but that is 100% compatible with the "classic RCU" API.  Such
an implementation must meet a number of requirements, which are
listed at the end of this message (search for "REQUIREMENTS").

I have looked into a number of seductive but subtly broken "solutions"
to this problem.  The solution listed here is not perfect, but I believe
that it has enough advantages to be worth pursuing.  The solution is
quite simple, and I feel a bit embarrassed that it took me so long
to come up with it.  All I can say in my defense is that the idea
of -adding- locks to improve scalability and eliminate deadlocks is
quite counterintuitive to me.  And, like I said earlier, if you
know of a better approach, please don't keep it a secret!

The following verbiage steps through several variations on this
solution, as follows:

1.	"Toy" implementation that has numerous API, scalability,
	and realtime problems, but is a very simple 28-line
	illustration of the underlying principles.  (In case you
	get excited about this being much smaller than "classic
	RCU", keep in mind that a similar "toy" implementation of
	"classic RCU" is even smaller.  To say nothing of faster
	and more scalable.)

2.	Expanded version of the "toy" implementation that exactly
	matches the "classic RCU" API, but still has realtime and
	scalability problems.

3.	Expanded version of solution #2 that meets realtime
	requirements (at least -I- think it does...).  It still does
	not scale well.

4.	Expanded version of solution #2 that scales, but that does
	not meet realtime requirements.

5.	The final version, which both scales and meets realtime
	requirements, as well as exactly matching the "classic RCU"
	API.

I have tested this approach, but in user-level scaffolding.  All of
these implementations should therefore be regarded with great suspicion:
untested, probably don't even compile.  Besides which, I certainly can't
claim to fully understand the real-time preempt patch, so I am bound
to have gotten something wrong somewhere.  In any case, none of these
implementations are a suitable replacement for "classic RCU" on large
servers, since they acquire locks in the RCU read-side critical sections.
However, they should scale enough to support small SMP systems, inflicting
only a modest performance penalty.

I believe that implementation #5 is most appropriate for real-time
preempt kernels.  Note that #1, #2, and #4 are not at all appropriate for
-any- kernel.  In theory, #3 might be appropriate, but if I understand
the real-time preempt implementation of reader-writer lock, it will
not perform well if there are long RCU read-side critical sections,
even in UP kernels.

------------------------------------------------------------------------

1. "Toy" Implementation.

[Inappropriate for actual use.]

The basic principle may be familiar to people who worked with the Linux
2.4 and early 2.5 networking code.  The RCU read-side critical section
read-acquires a lock, recursively if need be.  This lock is -not- acquired
by the update side, thus preventing the deadlocks caused by the current
real-time preempt implementation of RCU (and thus also deviating from the
aforementioned networking approach).  Instead, the lock is write-acquired
(and then immediately released) before freeing any memory removed from the
data structure.  The lock cannot be write-acquired until all currently
running read-side critical sections complete.  Therefore, the memory
cannot be released until all read-side critical sections have released
any references that they might have to the memory.

	rwlock_t rcu_deferral_lock = RW_LOCK_UNLOCKED;

	void
	rcu_read_lock(void)
	{
		read_lock(&rcu_deferral_lock);
	}

	void
	rcu_read_unlock(void)
	{
		read_unlock(&rcu_deferral_lock);
	}

	void
	synchronize_kernel(void)
	{
		write_lock(&rcu_deferral_lock);
		write_unlock(&rcu_deferral_lock);
	}

	void
	call_rcu(struct rcu_head *head,
		 void (*func)(struct rcu_head *rcu))
	{
		synchronize_kernel();
		func(head);
	}

This is quite small and simple, but before getting too excited, keep
in mind that a similar "toy" implementation of classic RCU would be
even smaller and simpler.

Note that, just as with classic RCU, the read-side and update-side
critical sections can run concurrently.  This means that the current
rcu_dereference() and rcu_assign_pointer() macros remain just as they
are in the current in-tree RCU implementation.  This is also true of
the other four implementations.

Note also that, just as with classic RCU, there has to be some sort
of mechanism (lock, semaphore, non-blocking synchronization, ...)
that prevents concurrent updates from making a mess.

Quick Quiz 1:  In what way does implementation #1 violate the
	current "classic RCU" API?  (Search for "ANSWER" if you can't
	wait.)

------------------------------------------------------------------------

2. "Toy" Implementation That Exactly Matches the "Classic RCU" API.

[Inappropriate for actual use.]

The following implementation exactly matches the "Classic RCU" API,
but does not meet realtime requirements, nor does it scale.

Quick Quiz 2: Why does implementation #2 not meet realtime requirements?

Quick Quiz 3: Why does implementation #2 fail to scale?

	struct rcu_data {
		long		batch;
		struct rcu_head	*waitlist;
		struct rcu_head	**waittail;
	};
	struct rcu_ctrlblk {
		rwlock		lock;
		long		batch;
	};
	DECLARE_PER_CPU(struct rcu_data, rcu_data);
	struct rcu_ctrlblk rcu_ctrlblk = {
		.lock = RW_LOCK_UNLOCKED,
		.batch = 0,
	};

	void
	rcu_read_lock(void)
	{
		read_lock(&rcu_ctrlblk.lock);
	}

	void
	rcu_read_unlock(void)
	{
		read_unlock(&rcu_ctrlblk.lock);
	}

	void
	synchronize_kernel(void)
	{
		write_lock(&rcu_ctrlblk.lock);
		rcu_ctrlblk.batch++;
		write_unlock(&rcu_ctrlblk.lock);
	}

	void
	call_rcu(struct rcu_head *head,
		 void (*func)(struct rcu_head *rcu))
	{
		unsigned long flags;
		struct rcu_data *rdp;

		head->func = func;
		head->next = NULL;
		local_irq_save(flags);
		rcu_do_my_batch();
		rdp = &__get_cpu_var(rcu_data);
		*rdp->waittail = head;
		rdp->nexttail = &head->next;
		local_irq_restore(flags);
	}

	void rcu_do_my_batch(void)
	{
		unsigned long flags;
		struct rcu_data *rdp;
		struct rcu_head *next, *list;

		local_irq_save(flags);
		rdp = &__get_cpu_var(rcu_data);
		smp_mb();  /* prevent sampling batch # before list removal. */
		if (rdp->batch == rcu_ctrlblk.batch) {
			local_irq_restore(flags);
			return;
		}
		list = rdp->waitlist;
		rdp->waitlist = NULL;
		rdp->waittail = &rdp->waitlist;
		rdp->batch = rcu_ctrlblk.batch;
		local_irq_restore(flags);
		while (list) {
			next = list->next;
			list->func(list);
			list = next;
		}
	}

	/*
	 * Note: rcu_dereference() and rcu_assign_pointer() unchanged
	 * from current in-tree implementation.
	 */

The trick here is that call_rcu() does not write-acquire the lock,
either directly or indirectly.  This means that there is no way
for it to deadlock with an RCU read-side critical section.
However, the callbacks are still invoked with interrupts disabled,
so this implementation still does not meet realtime requirements.
In addition, it still uses a global reader-writer lock, so it does
not scale, either.

However, it -does- faithfully implement the classic RCU API.

------------------------------------------------------------------------

3. Meets Realtime Requirements, Doesn't Scale

The trick here is to break the callback processing out into a separate
rcu_process_callbacks() function, so that it may be invoked from a kernel
daemon, a work queue, or whatever.  It must be called periodically.
If the kernel is under memory pressure, one can get immediate relief
by doing the following:

	synchronize_kernel();
	rcu_process_callbacks();

Even more relief can be obtained by using something like
smp_call_function() in order to invoke rcu_process_callbacks() on each
CPU in an SMP system.  Alternatively, one could awaken the kernel daemons
or schedule the function into the appropriate work queue, depending
on how this code fragment is integrated in.  Or, perhaps better yet,
restrict use of this approach to UP kernels.  ;-)

	struct rcu_data {
		long		batch;
		struct rcu_head	*waitlist;
		struct rcu_head	**waittail;
		struct rcu_head	*donelist;
		struct rcu_head	**donetail;
	};
	struct rcu_ctrlblk {
		rwlock		lock;
		long		batch;
	};
	DECLARE_PER_CPU(struct rcu_data, rcu_data);
	struct rcu_ctrlblk rcu_ctrlblk = {
		.lock = RW_LOCK_UNLOCKED,
		.batch = 0,
	};

	void
	rcu_read_lock(void)
	{
		read_lock(&rcu_ctrlblk.lock);
	}

	void
	rcu_read_unlock(void)
	{
		read_unlock(&rcu_ctrlblk.lock);
	}

	void
	synchronize_kernel(void)
	{
		write_lock(&rcu_ctrlblk.lock);
		rcu_ctrlblk.batch++;
		write_unlock(&rcu_ctrlblk.lock);
	}

	void
	call_rcu(struct rcu_head *head,
		 void (*func)(struct rcu_head *rcu))
	{
		unsigned long flags;
		struct rcu_data *rdp;

		head->func = func;
		head->next = NULL;
		local_irq_save(flags);
		rcu_advance_callbacks();
		rdp = &__get_cpu_var(rcu_data);
		*rdp->waittail = head;
		rdp->nexttail = &head->next;
		local_irq_restore(flags);
	}

	void rcu_advance_callbacks(void)
	{
		unsigned long flags;
		struct rcu_data *rdp;

		local_irq_save(flags);
		rdp = &__get_cpu_var(rcu_data);
		smp_mb();  /* prevent sampling batch # before list removal. */
		if (rdp->batch != rcu_ctrlblk.batch) {
			*rdp->donetail = rdp->waitlist;
			rdp->donetail = rdp->waittail;
			rdp->waitlist = NULL;
			rdp->waittail = &rdp->waitlist;
			rdp->batch = rcu_ctrlblk.batch;
		}
		local_irq_restore(flags);
	}

	void rcu_process_callbacks(void)
	{
		unsigned long flags;
		struct rcu_head *next, *list;
		struct rcu_data *rdp;

		local_irq_save(flags);
		rdp = &__get_cpu_var(rcu_data);
		list = rdp->donelist;
		if (list == NULL) {
			local_irq_restore(flags);
			return;
		}
		rdp->donelist = NULL;
		rdp->donetail = &rdp->waitlist;
		local_irq_restore(flags);
		while (list) {
			next = list->next;
			list->func(list);
			list = next;
		}
	}

	/*
	 * Note: rcu_dereference() and rcu_assign_pointer() unchanged
	 * from current in-tree implementation.
	 */

Since a single global rwlock is used, this implementation still does
not scale.

------------------------------------------------------------------------

4. Scalable Implementation

[Inappropriate for realtime use.]

This variant scales, though not well enough to recommend use on a
mid-range POWER machine, let alone a full-blown Altix.  In addition,
it fails to meet some realtime requirements.

Quick Quiz 4: Why does implementation #4 fail the realtime test?

	#define GRACE_PERIODS_PER_SEC 10

	struct rcu_data {
		rwlock		lock;
		long		batch;
		struct rcu_head	*waitlist;
		struct rcu_head	**waittail;
	};
	struct rcu_ctrlblk {
		long		batch;
	};
	DECLARE_PER_CPU(struct rcu_data, rcu_data);
	struct rcu_ctrlblk rcu_ctrlblk = {
		.batch = 0,
	};

	void
	rcu_read_lock(void)
	{
		read_lock(&__get_cpu_var(rcu_data).lock);
	}

	void
	rcu_read_unlock(void)
	{
		read_unlock(&__get_cpu_var(rcu_data).lock);
	}

	void
	_synchronize_kernel(void)
	{
		int cpu;

		for_each_cpu(cpu) {
			write_lock(&per_cpu(rcu_data, cpu).lock);
		}
		rcu_ctrlblk.batch++;
		for_each_cpu(cpu) {
			write_unlock(&per_cpu(rcu_data, cpu).lock);
		}
	}

	void
	synchronize_kernel(void)
	{
		long oldbatch;

		smp_mb();
		oldbatch = rcu_ctrlblk.batch;
		schedule_timeout(HZ/GRACE_PERIODS_PER_SEC);
		if (rcu_ctrlblk.batch == oldbatch) {
			_synchronize_kernel();
		}
	}

	void
	call_rcu(struct rcu_head *head,
		 void (*func)(struct rcu_head *rcu))
	{
		unsigned long flags;
		struct rcu_data *rdp;

		head->func = func;
		head->next = NULL;
		local_irq_save(flags);
		rcu_do_my_batch();
		rdp = &__get_cpu_var(rcu_data);
		*rdp->waittail = head;
		rdp->nexttail = &head->next;
		local_irq_restore(flags);
	}

	void rcu_do_my_batch(void)
	{
		unsigned long flags;
		struct rcu_data *rdp;
		struct rcu_head *next, *list;

		local_irq_save(flags);
		rdp = &__get_cpu_var(rcu_data);
		smp_mb();  /* prevent sampling batch # before list removal. */
		if (rdp->batch == rcu_ctrlblk.batch) {
			local_irq_restore(flags);
			return;
		}
		list = rdp->waitlist;
		rdp->waitlist = NULL;
		rdp->waittail = &rdp->waitlist;
		rdp->batch = rcu_ctrlblk.batch;
		local_irq_restore(flags);
		while (list) {
			next = list->next;
			list->func(list);
			list = next;
		}
	}

	/*
	 * Note: rcu_dereference() and rcu_assign_pointer() unchanged
	 * from current in-tree implementation.
	 */

Quick Quiz 5: Why the use of for_each_cpu() rather than (say)
	for_each_online_cpu() in _synchronize_kernel() in
	implementation #4?

Quick Quiz 6: Why aren't there write-side contention problems
	in implementation #4?

------------------------------------------------------------------------

5. Scalability -and- Realtime Response.

The trick here is to keep track of the CPU that did the rcu_read_lock()
in the task structure.  If there is a preemption, there will be some
slight inefficiency, but the correct lock will be released, preserving
correctness.

	#define GRACE_PERIODS_PER_SEC 10

	struct rcu_data {
		rwlock		lock;
		long		batch;
		struct rcu_head	*waitlist;
		struct rcu_head	**waittail;
		struct rcu_head	*donelist;
		struct rcu_head	**donetail;
	};
	struct rcu_ctrlblk {
		long		batch;
	};
	DECLARE_PER_CPU(struct rcu_data, rcu_data);
	struct rcu_ctrlblk rcu_ctrlblk = {
		.batch = 0,
	};

	void
	rcu_read_lock(void)
	{
		preempt_disable();
		if (current->rcu_read_lock_nesting++ == 0) {
			current->rcu_read_lock_ptr =
				&__get_cpu_var(rcu_data).lock;
			read_lock(current->rcu_read_lock_ptr);
		}
		preempt_enable();
	}

	void
	rcu_read_unlock(void)
	{
		preempt_disable();
		if (--current->rcu_read_lock_nesting == 0) {
			read_unlock(current->rcu_read_lock_ptr);
		}
		preempt_enable();
	}

	void
	_synchronize_kernel(void)
	{
		int cpu;

		for_each_cpu(cpu) {
			write_lock(&per_cpu(rcu_data, cpu).lock);
		}
		rcu_ctrlblk.batch++;
		for_each_cpu(cpu) {
			write_unlock(&per_cpu(rcu_data, cpu).lock);
		}
	}

	void
	synchronize_kernel(void)
	{
		long oldbatch;

		smp_mb();
		oldbatch = rcu_ctrlblk.batch;
		schedule_timeout(HZ/GRACE_PERIODS_PER_SEC);
		if (rcu_ctrlblk.batch == oldbatch) {
			_synchronize_kernel();
		}
	}

	void
	call_rcu(struct rcu_head *head,
		 void (*func)(struct rcu_head *rcu))
	{
		unsigned long flags;
		struct rcu_data *rdp;

		head->func = func;
		head->next = NULL;
		local_irq_save(flags);
		rcu_advance_callbacks();
		rdp = &__get_cpu_var(rcu_data);
		*rdp->waittail = head;
		rdp->nexttail = &head->next;
		local_irq_restore(flags);
	}

	void rcu_advance_callbacks(void)
	{
		unsigned long flags;
		struct rcu_data *rdp;

		local_irq_save(flags);  /* allow invocation from OOM handler. */
		rdp = &__get_cpu_var(rcu_data);
		smp_mb(); /* prevent sampling batch # before list removal. */
		if (rdp->batch != rcu_ctrlblk.batch) {
			*rdp->donetail = rdp->waitlist;
			rdp->donetail = rdp->waittail;
			rdp->waitlist = NULL;
			rdp->waittail = &rdp->waitlist;
			rdp->batch = rcu_ctrlblk.batch;
		}
		local_irq_restore(flags);
	}

	void rcu_process_callbacks(void)
	{
		unsigned long flags;
		struct rcu_head *next, *list;
		struct rcu_data *rdp;

		local_irq_save(flags);
		rdp = &__get_cpu_var(rcu_data);
		list = rdp->donelist;
		if (list == NULL) {
			local_irq_restore(flags);
			return;
		}
		rdp->donelist = NULL;
		rdp->donetail = &rdp->waitlist;
		local_irq_restore(flags);
		while (list) {
			next = list->next;
			list->func(list);
			list = next;
		}
	}

	/*
	 * Note: rcu_dereference() and rcu_assign_pointer() unchanged
	 * from current in-tree implementation.
	 */

This implementation meets all of the requirements, -except- that it
imposes significant (but probably not crippling) synchronization overhead
on RCU read-side critical sections.  This implementation is probably
serviceable on systems with up to 2-4 CPUs, and perhaps a few more.
However, it is not going to cut it on a mid-range POWER system, much
less a high-end Altix system.  So, if you don't need heavy-duty realtime
response, you should stick with "classic RCU".  Or, better yet, come
up with a single RCU implementation that is suited to both realtime and
server usage!

Quick Quiz 7: What if there is real interrupt or softirq code that
	needs to call rcu_read_lock()?

Quick Quiz 8: What other bugs are present in implementation #5?

------------------------------------------------------------------------

REQUIREMENTS

These are -not- in strict priority order.  Nor can they be, since
different people have different priorities.  ;-)

1. Deferred destruction.

	No data element may be destroyed (for example, freed) while an
	RCU read-side critical section is referencing it.  This property
	absolutely must be implemented, as failing to do so will break
	pretty much all the code that uses RCU.

2. Reliable.

	The implementation must not be prone to gratuitous failure; it
	must be able to run 24x7 for months at a time, and preferably
	for years.

	[This is a problem for so-called "hazard pointer" mechanisms,
	which require that hazard pointers be allocated and freed.
	So what do you do when the allocation fails while nested
	way down in the bowels of some lock's critical section?
	Avoiding this sort of unwind code was a major motivation
	for Jack and I to come up with RCU in the first place!]

3. Callable from IRQ.

	Since RCU is used from softirq state, a realtime implementation
	must either eliminate softirq and interrupt states, eliminate use
	of RCU from these states, or make RCU callable from these states.

	[I believe that the real-time preemption patch moves code
	from softirq/interrupt to process context, but could easily be
	missing something.  If need be, there are ways of handling cases
	were realtime RCU is called from softirq and interrupt contexts,
	for an example approach, see the Quick Quiz answers.]

4. Preemptible read side.

	RCU read-side critical sections can (in theory, anyway) be quite
	large, degrading realtime scheduling response.	Preemptible RCU
	read-side critical sections avoid such degradation.  Manual
	insertion of "preemption points" might be an alternative as well.
	But I have no intention of trying to settle the long-running
	argument between proponents of preemption and of preemption
	points.  Not today, anyway!  ;-)

	[This is one of classic RCU's weak points.  No surprise, as
	it was not designed with aggressive realtime in mind.]

5. Small memory footprint.

	Many realtime systems are configured with modest amounts
	of memory, so it is highly desirable to limit the number of
	outstanding RCU callbacks.  It is also necessary to force
	"reaping" of callbacks should the system run short of memory.

	[This is another of classic RCU's weak points.  No surprise,
	it was not designed with tiny-memory embedded systems in
	mind.]

6. Independent of memory blocks.

	The implementation should not make assumptions about the size
	and extent of the data elements being protected by RCU, since
	such assumptions constrain memory allocation design and can
	impose increased complexity.

	[This requirement makes it difficult to use so-called
	"hazard pointer" techniques.]

7. Synchronization-free read side.

	RCU read-side critical sections should avoid expensive
	synchronization instructions or cache misses.  It is most
	important to avoid global locks and atomic instructions that
	modify global memory, as these operations can inflict severe
	performance and scalability bottlenecks.  Avoiding memory barriers
	is desirable but not as critical.

	[This is where classic RCU shines.  The realtime scheme does not
	have a synchronization-free read side, but implementation #5 at
	least has decent cache locality for read-mostly workloads.]

8. Freely nestable read-side critical sections.

	Restrictions on nestability can result in code duplication,
	so should be avoided if possible.

	[This is where the current real-time preempt RCU implementation
	has -serious- problems.]

9. Unconditional read-to-write upgrade.

	RCU permits a read-side critical section to acquire the
	corresponding write-side lock.	This capability can be convenient
	in some cases.	However, since it is rarely used, it cannot be
	considered mandatory.  Then again, working around the lack of
	such upgrade usually results in code bloat.

	[This is another big issue with the current real-time preempt
	RCU implementation.]

10. Compatible API.

	A realtime RCU implementation should have an API compatible with
	that in the kernel.

	[Another big issue with the current real-time preempt RCU
	implementation.]

------------------------------------------------------------------------

ANSWERS TO QUICK QUIZ

Quick Quiz 1:  In what way does implementation #1 violate the
	current "classic RCU" API?

	Answer: Unlike "classic RCU", it is not legal to invoke
		call_rcu() from within an RCU read-side critical
		section.  Doing so will result in deadlock.

Quick Quiz 2: Why does implementation #2 not meet realtime requirements?

	Answer: Because all the callbacks are invoked with interrupts
		disabled.  If there are a lot of callbacks, realtime
		latency will suffer.

Quick Quiz 3: Why does implementation #2 fail to scale?

	Answer: Because the large cache-miss penalties of modern
		microprocessors.  See http://linuxjournal.com/article/6993
		for more details on why this is so.

Quick Quiz 4: Why does implementation #4 fail the realtime test?

	Answer: Because the realtime-preempt version of locks can be
		preempted.  Therefore, the rcu_read_unlock() might
		run on a different CPU than did the corresponding
		rcu_read_lock(), resulting in horrible confusion and
		a great decrease in the expected lifespan of your
		kernel.

		Also because it reverts to invoking the callbacks
		with interrupts disabled.

Quick Quiz 5: Why the use of for_each_cpu() rather than (say)
	for_each_online_cpu() in _synchronize_kernel()?

	Answer: To completely avoid any races with CPU hotplug.
		Admittedly a large hammer to use, but it does
		have the advantage of simplicity.  Suggestions
		for improvements welcome!

Quick Quiz 6: Why aren't there write-side contention problems
	in implementation #4?

	Answer: Because synchronize_kernel() refuses to acquire
		the lock more than 10 times a second.  Of course,
		if you invoke _synchronize_kernel() frequently
		from an OOM handler, or if you are running on
		a 100-CPU machine, some adjustments may be needed.
		In a -lot- of places in the latter case.

Quick Quiz 7: What if there is real interrupt or softirq code that
	needs to call rcu_read_lock()?

	Answer: Use something like the call_rcu_bh() API.  This could
		have a similar implementation to the ones laid out here,
		but rcu_read_lock_bh() must disable preemption, and
		perhaps also interrupts.  As must _synchronize_kernel().

Quick Quiz 8: What other bugs are present in implementation #5?

	Answer: Beats me!  If I knew, I might have fixed them.  ;-)

^ permalink raw reply	[flat|nested] 58+ messages in thread