Kernel rwlock design, Multicore and IGMP

* Kernel rwlock design, Multicore and IGMP
@ 2010-11-11 13:49 Cypher Wu
  2010-11-11 15:23 ` Eric Dumazet
  2010-11-13 22:52 ` Kernel rwlock design, Multicore and IGMP Peter Zijlstra
  0 siblings, 2 replies; 42+ messages in thread
From: Cypher Wu @ 2010-11-11 13:49 UTC (permalink / raw)
  To: linux-kernel

I'm using TILEPro and its rwlock in kernel is a liitle different than
other platforms. It have a priority for write lock that when tried it
will block the following read lock even if read lock is hold by
others. Its code can be read in Linux Kernel 2.6.36 in
arch/tile/lib/spinlock_32.c.

That different could cause a deadlock in kernel if we join/leave
Multicast Group simultaneous and frequently on mutlicores. IGMP
message is sent by

igmp_ifc_timer_expire() -> igmpv3_send_cr() -> igmpv3_sendpack()

in timer interrupt, igmpv3_send_cr() will generate the sk_buff for
IGMP message with mc_list_lock read locked and then call
igmpv3_sendpack() with it unlocked.
But if we have so many join/leave messages have to generate and it
can't be sent in one sk_buff then igmpv3_send_cr() -> add_grec() will
call igmpv3_sendpack() to send it and reallocate a new buffer. When
the message is sent:

__mkroute_output() -> ip_check_mc()

will read lock mc_list_lock again. If there is another core is try
write lock mc_list_lock between the two read lock, then deadlock
ocurred.

The rwlock on other platforms I've check, say, PowerPC, x86, ARM, is
just read lock shared and write_lock mutex, so if we've hold read lock
the write lock will just wait, and if there have a read lock again it
will success.

So, What's the criteria of rwlock design in the Linux kernel? Is that
read lock re-hold of IGMP a design error in Linux kernel, or the read
lock has to be design like that?

There is a other thing, that the timer interrupt will start timer on
the same in_dev, should that be optimized?

BTW: If we have so many cores, say 64, is there other things we have
to think about spinlock? If there have collisions ocurred, should we
just read the shared memory again and again, or just a very little
'delay' is better? I've seen relax() is called in the implementation
of spinlock on TILEPro platform.

^ permalink raw reply	[flat|nested] 42+ messages in thread