All of lore.kernel.org
 help / color / mirror / Atom feed
* Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
@ 2011-02-17  3:39 Cypher Wu
  2011-02-17  4:49 ` Américo Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Cypher Wu @ 2011-02-17  3:39 UTC (permalink / raw)
  To: linux-kernel; +Cc: Chris Metcalf, Américo Wang, Eric Dumazet, netdev

---------- Forwarded message ----------
From: Cypher Wu <cypher.w@gmail.com>
Date: Wed, Feb 16, 2011 at 5:58 PM
Subject: GMP and rwlock: Dead ocurred again on TILEPro
To: linux-kernel@vger.kernel.org


The rwlock and spinlock of TILEPro platform use TNS instruction to
test the value of lock, but if interrupt is not masked, read_lock()
have another chance to deadlock while read_lock() called in bh of
interrupt.

The original code:
void __raw_read_lock_slow(raw_
rwlock_t *rwlock, u32 val)
{
   u32 iterations = 0;
   do {
       if (!(val & 1))
           rwlock->lock = val;
       delay_backoff(iterations++);
       val = __insn_tns((int *)&rwlock->lock);
   } while ((val << RD_COUNT_WIDTH) != 0);
   rwlock->lock = val + (1 << RD_COUNT_SHIFT);
}

I've modified it to get some information:
void __raw_read_lock_slow(raw_rwlock_t *rwlock, u32 val)
{
   u32 iterations = 0;
   do {
       if (!(val & 1))
       {
           rwlock->lock = val;
           iterations = 0;
       }
       delay_backoff(iterations++);
       if (iterations > 0x1000000)
       {
           dump_stack();
           iterations = 0;
       }

       val = __insn_tns((int *)&rwlock->lock);
   } while ((val << RD_COUNT_WIDTH) != 0);
   rwlock->lock = val + (1 << RD_COUNT_SHIFT);
}

And this is the stack info:

Starting stack dump of tid 837, pid 837 (ff0) on cpu 55 at cycle 10180633928773
 frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8)
 frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8)
 frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0)
 frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20)
 frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30)
 frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70)
 frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98)
 frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8)
 frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0)
 frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0)
 <interrupt 25 while in kernel mode>
 frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38)
 frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40)
 frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70)
 frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90)
 frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98)
 frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0)
 <syscall while in user mode>
 frame 16: 0x3342a0 (sp 0xbfddfb00)
 frame 17: 0x16130 (sp 0xbfddfb08)
 frame 18: 0x16640 (sp 0xbfddfb38)
 frame 19: 0x16ee8 (sp 0xbfddfc58)
 frame 20: 0x345a08 (sp 0xbfddfc90)
 frame 21: 0x10218 (sp 0xbfddfe48)
Stack dump complete

I don't know the clear definition of rwlock & spinlock in Linux, but
the implementation of other platforms
like x86, PowerPC, ARM don't have that issue. The use of TNS cause a
race condition between system
call and interrupt.

Through the call tree of packet sending, there are also some other
rwlock will be tried, say
read_lock(&fib_hash_lock) in fn_hash_lookup() which is called in
ip_route_output_slow(). I've seen deadlock
on fib_hash_lock, but haven't reproduced with that debug information yet.

Maybe IGMP is not the only one, TCP timer will retransmit data and
will also call read_lock(&fib_hash_lock).

--
Cyberman Wu



-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17  3:39 Fwd: IGMP and rwlock: Dead ocurred again on TILEPro Cypher Wu
@ 2011-02-17  4:49 ` Américo Wang
  2011-02-17  5:04   ` Cypher Wu
  0 siblings, 1 reply; 21+ messages in thread
From: Américo Wang @ 2011-02-17  4:49 UTC (permalink / raw)
  To: Cypher Wu
  Cc: linux-kernel, Chris Metcalf, Américo Wang, Eric Dumazet, netdev

On Thu, Feb 17, 2011 at 11:39:22AM +0800, Cypher Wu wrote:
>---------- Forwarded message ----------
>From: Cypher Wu <cypher.w@gmail.com>
>Date: Wed, Feb 16, 2011 at 5:58 PM
>Subject: GMP and rwlock: Dead ocurred again on TILEPro
>To: linux-kernel@vger.kernel.org
>
>
>The rwlock and spinlock of TILEPro platform use TNS instruction to
>test the value of lock, but if interrupt is not masked, read_lock()
>have another chance to deadlock while read_lock() called in bh of
>interrupt.
>


In this case, you should call read_lock_bh() instead of read_lock().


> frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8)
> frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8)
> frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0)
> frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20)
> frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30)
> frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70)
> frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98)
> frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8)
> frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0)
> frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0)
> <interrupt 25 while in kernel mode>
> frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38)
> frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40)
> frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70)
> frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90)
> frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98)
> frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0)
> <syscall while in user mode>
> frame 16: 0x3342a0 (sp 0xbfddfb00)
> frame 17: 0x16130 (sp 0xbfddfb08)
> frame 18: 0x16640 (sp 0xbfddfb38)
> frame 19: 0x16ee8 (sp 0xbfddfc58)
> frame 20: 0x345a08 (sp 0xbfddfc90)
> frame 21: 0x10218 (sp 0xbfddfe48)
>Stack dump complete
>
>I don't know the clear definition of rwlock & spinlock in Linux, but
>the implementation of other platforms
>like x86, PowerPC, ARM don't have that issue. The use of TNS cause a
>race condition between system
>call and interrupt.
>

Have you turned CONFIG_LOCKDEP on?

I think Eric already converted that rwlock into RCU lock, thus
this problem should disappear. Could you try a new kernel?

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17  4:49 ` Américo Wang
@ 2011-02-17  5:04   ` Cypher Wu
  2011-02-17  5:42     ` Américo Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Cypher Wu @ 2011-02-17  5:04 UTC (permalink / raw)
  To: Américo Wang; +Cc: linux-kernel, Chris Metcalf, Eric Dumazet, netdev

On Thu, Feb 17, 2011 at 12:49 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
> On Thu, Feb 17, 2011 at 11:39:22AM +0800, Cypher Wu wrote:
>>---------- Forwarded message ----------
>>From: Cypher Wu <cypher.w@gmail.com>
>>Date: Wed, Feb 16, 2011 at 5:58 PM
>>Subject: GMP and rwlock: Dead ocurred again on TILEPro
>>To: linux-kernel@vger.kernel.org
>>
>>
>>The rwlock and spinlock of TILEPro platform use TNS instruction to
>>test the value of lock, but if interrupt is not masked, read_lock()
>>have another chance to deadlock while read_lock() called in bh of
>>interrupt.
>>
>
>
> In this case, you should call read_lock_bh() instead of read_lock().
>
>
>> frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8)
>> frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8)
>> frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0)
>> frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20)
>> frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30)
>> frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70)
>> frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98)
>> frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8)
>> frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0)
>> frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0)
>> <interrupt 25 while in kernel mode>
>> frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38)
>> frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40)
>> frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70)
>> frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90)
>> frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98)
>> frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0)
>> <syscall while in user mode>
>> frame 16: 0x3342a0 (sp 0xbfddfb00)
>> frame 17: 0x16130 (sp 0xbfddfb08)
>> frame 18: 0x16640 (sp 0xbfddfb38)
>> frame 19: 0x16ee8 (sp 0xbfddfc58)
>> frame 20: 0x345a08 (sp 0xbfddfc90)
>> frame 21: 0x10218 (sp 0xbfddfe48)
>>Stack dump complete
>>
>>I don't know the clear definition of rwlock & spinlock in Linux, but
>>the implementation of other platforms
>>like x86, PowerPC, ARM don't have that issue. The use of TNS cause a
>>race condition between system
>>call and interrupt.
>>
>
> Have you turned CONFIG_LOCKDEP on?
>
> I think Eric already converted that rwlock into RCU lock, thus
> this problem should disappear. Could you try a new kernel?
>
> Thanks.
>

I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
much information when we tried to figured out the former deadlock.

IGMP used read_lock() instead of read_lock_bh() since usually
read_lock() can be called recursively, and today I've read the
implementation of MIPS, it's should also works fine in that situation.
The implementation of TILEPro cause problem since after it use TNS set
the lock-val to 1 and hold the original value and before it re-set
lock-val a new value, it a race condition window.

It's not practical to upgrade the kernel.

Thanks.

-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17  5:04   ` Cypher Wu
@ 2011-02-17  5:42     ` Américo Wang
  2011-02-17  5:46       ` David Miller
  2011-02-17  6:42       ` Fwd: IGMP and rwlock: Dead ocurred again on TILEPro Cypher Wu
  0 siblings, 2 replies; 21+ messages in thread
From: Américo Wang @ 2011-02-17  5:42 UTC (permalink / raw)
  To: Cypher Wu
  Cc: Américo Wang, linux-kernel, Chris Metcalf, Eric Dumazet, netdev

On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote:
>>
>> Have you turned CONFIG_LOCKDEP on?
>>
>> I think Eric already converted that rwlock into RCU lock, thus
>> this problem should disappear. Could you try a new kernel?
>>
>> Thanks.
>>
>
>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
>much information when we tried to figured out the former deadlock.
>
>IGMP used read_lock() instead of read_lock_bh() since usually
>read_lock() can be called recursively, and today I've read the
>implementation of MIPS, it's should also works fine in that situation.
>The implementation of TILEPro cause problem since after it use TNS set
>the lock-val to 1 and hold the original value and before it re-set
>lock-val a new value, it a race condition window.
>

I see no reason why you can't call read_lock_bh() recursively,
read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(),
both can be recursive.

But I may miss something here. :-/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17  5:42     ` Américo Wang
@ 2011-02-17  5:46       ` David Miller
  2011-02-17  6:39         ` Eric Dumazet
  2011-02-17 22:49         ` Chris Metcalf
  2011-02-17  6:42       ` Fwd: IGMP and rwlock: Dead ocurred again on TILEPro Cypher Wu
  1 sibling, 2 replies; 21+ messages in thread
From: David Miller @ 2011-02-17  5:46 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: cypher.w, linux-kernel, cmetcalf, eric.dumazet, netdev

From: Américo Wang <xiyou.wangcong@gmail.com>
Date: Thu, 17 Feb 2011 13:42:37 +0800

> On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote:
>>>
>>> Have you turned CONFIG_LOCKDEP on?
>>>
>>> I think Eric already converted that rwlock into RCU lock, thus
>>> this problem should disappear. Could you try a new kernel?
>>>
>>> Thanks.
>>>
>>
>>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
>>much information when we tried to figured out the former deadlock.
>>
>>IGMP used read_lock() instead of read_lock_bh() since usually
>>read_lock() can be called recursively, and today I've read the
>>implementation of MIPS, it's should also works fine in that situation.
>>The implementation of TILEPro cause problem since after it use TNS set
>>the lock-val to 1 and hold the original value and before it re-set
>>lock-val a new value, it a race condition window.
>>
> 
> I see no reason why you can't call read_lock_bh() recursively,
> read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(),
> both can be recursive.
> 
> But I may miss something here. :-/

IGMP is doing this so that taking the read lock does not stop packet
processing.

TILEPro's rwlock implementation is simply buggy and needs to be fixed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17  5:46       ` David Miller
@ 2011-02-17  6:39         ` Eric Dumazet
  2011-02-17 22:49         ` Chris Metcalf
  1 sibling, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2011-02-17  6:39 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, cypher.w, linux-kernel, cmetcalf, netdev

Le mercredi 16 février 2011 à 21:46 -0800, David Miller a écrit :
> From: Américo Wang <xiyou.wangcong@gmail.com>
> Date: Thu, 17 Feb 2011 13:42:37 +0800
> 
> > On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote:
> >>>
> >>> Have you turned CONFIG_LOCKDEP on?
> >>>
> >>> I think Eric already converted that rwlock into RCU lock, thus
> >>> this problem should disappear. Could you try a new kernel?
> >>>
> >>> Thanks.
> >>>
> >>
> >>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
> >>much information when we tried to figured out the former deadlock.
> >>
> >>IGMP used read_lock() instead of read_lock_bh() since usually
> >>read_lock() can be called recursively, and today I've read the
> >>implementation of MIPS, it's should also works fine in that situation.
> >>The implementation of TILEPro cause problem since after it use TNS set
> >>the lock-val to 1 and hold the original value and before it re-set
> >>lock-val a new value, it a race condition window.
> >>
> > 
> > I see no reason why you can't call read_lock_bh() recursively,
> > read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(),
> > both can be recursive.
> > 
> > But I may miss something here. :-/
> 
> IGMP is doing this so that taking the read lock does not stop packet
> processing.
> 
> TILEPro's rwlock implementation is simply buggy and needs to be fixed.

Yep. Finding all recursive readlocks in kernel and convert them to
another locking model is probably more expensive than fixing TILEPro
rwlock implementation.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17  5:42     ` Américo Wang
  2011-02-17  5:46       ` David Miller
@ 2011-02-17  6:42       ` Cypher Wu
  1 sibling, 0 replies; 21+ messages in thread
From: Cypher Wu @ 2011-02-17  6:42 UTC (permalink / raw)
  To: Américo Wang; +Cc: linux-kernel, Chris Metcalf, Eric Dumazet, netdev

On Thu, Feb 17, 2011 at 1:42 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
> On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote:
>>>
>>> Have you turned CONFIG_LOCKDEP on?
>>>
>>> I think Eric already converted that rwlock into RCU lock, thus
>>> this problem should disappear. Could you try a new kernel?
>>>
>>> Thanks.
>>>
>>
>>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
>>much information when we tried to figured out the former deadlock.
>>
>>IGMP used read_lock() instead of read_lock_bh() since usually
>>read_lock() can be called recursively, and today I've read the
>>implementation of MIPS, it's should also works fine in that situation.
>>The implementation of TILEPro cause problem since after it use TNS set
>>the lock-val to 1 and hold the original value and before it re-set
>>lock-val a new value, it a race condition window.
>>
>
> I see no reason why you can't call read_lock_bh() recursively,
> read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(),
> both can be recursive.
>
> But I may miss something here. :-/
>

Of course read_lock_bh() can be called recursively, but read_lock() is
already enough for IGMP, the only reason for that deadlock is because
using TNS instruction set the value to 1 cause another race condition.

-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17  5:46       ` David Miller
  2011-02-17  6:39         ` Eric Dumazet
@ 2011-02-17 22:49         ` Chris Metcalf
  2011-02-17 22:53           ` David Miller
  1 sibling, 1 reply; 21+ messages in thread
From: Chris Metcalf @ 2011-02-17 22:49 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, cypher.w, linux-kernel, eric.dumazet, netdev

On 2/17/2011 12:46 AM, David Miller wrote:
> From: Américo Wang <xiyou.wangcong@gmail.com>
> Date: Thu, 17 Feb 2011 13:42:37 +0800
>
>> On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote:
>>>> Have you turned CONFIG_LOCKDEP on?
>>>>
>>>> I think Eric already converted that rwlock into RCU lock, thus
>>>> this problem should disappear. Could you try a new kernel?
>>>>
>>>> Thanks.
>>>>
>>> I haven't turned CONFIG_LOCKDEP on for test since I didn't get too
>>> much information when we tried to figured out the former deadlock.
>>>
>>> IGMP used read_lock() instead of read_lock_bh() since usually
>>> read_lock() can be called recursively, and today I've read the
>>> implementation of MIPS, it's should also works fine in that situation.
>>> The implementation of TILEPro cause problem since after it use TNS set
>>> the lock-val to 1 and hold the original value and before it re-set
>>> lock-val a new value, it a race condition window.
>>>
>> I see no reason why you can't call read_lock_bh() recursively,
>> read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(),
>> both can be recursive.
>>
>> But I may miss something here. :-/
> IGMP is doing this so that taking the read lock does not stop packet
> processing.
>
> TILEPro's rwlock implementation is simply buggy and needs to be fixed.

Cypher, thanks for tracking this down with a good bug report.

The fix is to disable interrupts for the arch_read_lock family of methods. 
In my fix I'm using the "hard" disable that locks out NMIs as well, so that
in the event the NMI handler needs to share an rwlock with regular code it
would be possible (plus, it's more efficient).  I believe it's not
necessary to worry about similar protection for the arch_write_lock
methods, since they aren't guaranteed to be re-entrant anyway (you'd have
to use write_lock_irqsave or equivalent).

I'll send the patch to LKML after letting it bake internally for a little
while.

Thanks again!

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17 22:49         ` Chris Metcalf
@ 2011-02-17 22:53           ` David Miller
  2011-02-17 23:04             ` Chris Metcalf
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2011-02-17 22:53 UTC (permalink / raw)
  To: cmetcalf; +Cc: xiyou.wangcong, cypher.w, linux-kernel, eric.dumazet, netdev

From: Chris Metcalf <cmetcalf@tilera.com>
Date: Thu, 17 Feb 2011 17:49:46 -0500

> The fix is to disable interrupts for the arch_read_lock family of methods. 

How does that help handle the race when it happens between different
cpus, instead of between IRQ and non-IRQ context on the same CPU?

Why don't you just use the generic spinlock based rwlock code on Tile,
since that is all that your atomic instructions can handle
sufficiently?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17 22:53           ` David Miller
@ 2011-02-17 23:04             ` Chris Metcalf
  2011-02-17 23:11               ` David Miller
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Metcalf @ 2011-02-17 23:04 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, cypher.w, linux-kernel, eric.dumazet, netdev

On 2/17/2011 5:53 PM, David Miller wrote:
> From: Chris Metcalf <cmetcalf@tilera.com>
> Date: Thu, 17 Feb 2011 17:49:46 -0500
>
>> The fix is to disable interrupts for the arch_read_lock family of methods. 
> How does that help handle the race when it happens between different
> cpus, instead of between IRQ and non-IRQ context on the same CPU?

There's no race in that case, since the lock code properly backs off and
retries until the other cpu frees it.  The distinction here is that the
non-IRQ context is "wedged" by the IRQ context.

> Why don't you just use the generic spinlock based rwlock code on Tile,
> since that is all that your atomic instructions can handle
> sufficiently?

The tile-specific code encodes reader/writer information in the same 32-bit
word that the test-and-set instruction manipulates, so it's more efficient
both in space and time.  This may not really matter for rwlocks, since no
one cares much about them any more, but that was the motivation.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17 23:04             ` Chris Metcalf
@ 2011-02-17 23:11               ` David Miller
  2011-02-17 23:18                 ` Chris Metcalf
  0 siblings, 1 reply; 21+ messages in thread
From: David Miller @ 2011-02-17 23:11 UTC (permalink / raw)
  To: cmetcalf; +Cc: xiyou.wangcong, cypher.w, linux-kernel, eric.dumazet, netdev

From: Chris Metcalf <cmetcalf@tilera.com>
Date: Thu, 17 Feb 2011 18:04:13 -0500

> On 2/17/2011 5:53 PM, David Miller wrote:
>> From: Chris Metcalf <cmetcalf@tilera.com>
>> Date: Thu, 17 Feb 2011 17:49:46 -0500
>>
>>> The fix is to disable interrupts for the arch_read_lock family of methods. 
>> How does that help handle the race when it happens between different
>> cpus, instead of between IRQ and non-IRQ context on the same CPU?
> 
> There's no race in that case, since the lock code properly backs off and
> retries until the other cpu frees it.  The distinction here is that the
> non-IRQ context is "wedged" by the IRQ context.
> 
>> Why don't you just use the generic spinlock based rwlock code on Tile,
>> since that is all that your atomic instructions can handle
>> sufficiently?
> 
> The tile-specific code encodes reader/writer information in the same 32-bit
> word that the test-and-set instruction manipulates, so it's more efficient
> both in space and time.  This may not really matter for rwlocks, since no
> one cares much about them any more, but that was the motivation.

Ok, but IRQ disabling is going to be very expensive.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17 23:11               ` David Miller
@ 2011-02-17 23:18                 ` Chris Metcalf
  2011-02-18  3:16                   ` Cypher Wu
                                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Chris Metcalf @ 2011-02-17 23:18 UTC (permalink / raw)
  To: David Miller; +Cc: xiyou.wangcong, cypher.w, linux-kernel, eric.dumazet, netdev

On 2/17/2011 6:11 PM, David Miller wrote:
> From: Chris Metcalf <cmetcalf@tilera.com>
> Date: Thu, 17 Feb 2011 18:04:13 -0500
>
>> On 2/17/2011 5:53 PM, David Miller wrote:
>>> From: Chris Metcalf <cmetcalf@tilera.com>
>>> Date: Thu, 17 Feb 2011 17:49:46 -0500
>>>
>>>> The fix is to disable interrupts for the arch_read_lock family of methods. 
>>> How does that help handle the race when it happens between different
>>> cpus, instead of between IRQ and non-IRQ context on the same CPU?
>> There's no race in that case, since the lock code properly backs off and
>> retries until the other cpu frees it.  The distinction here is that the
>> non-IRQ context is "wedged" by the IRQ context.
>>
>>> Why don't you just use the generic spinlock based rwlock code on Tile,
>>> since that is all that your atomic instructions can handle
>>> sufficiently?
>> The tile-specific code encodes reader/writer information in the same 32-bit
>> word that the test-and-set instruction manipulates, so it's more efficient
>> both in space and time.  This may not really matter for rwlocks, since no
>> one cares much about them any more, but that was the motivation.
> Ok, but IRQ disabling is going to be very expensive.

The interrupt architecture on Tile allows a write to a special-purpose
register to put you into a "critical section" where no interrupts or faults
are delivered.  So we just need to bracket the read_lock operations with
two SPR writes; each takes six machine cycles, so we're only adding 12
cycles to the total cost of taking or releasing a read lock on an rwlock.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17 23:18                 ` Chris Metcalf
@ 2011-02-18  3:16                   ` Cypher Wu
  2011-02-18  3:19                   ` Cypher Wu
  2011-02-18  7:08                   ` Cypher Wu
  2 siblings, 0 replies; 21+ messages in thread
From: Cypher Wu @ 2011-02-18  3:16 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev

On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
> On 2/17/2011 6:11 PM, David Miller wrote:
>> From: Chris Metcalf <cmetcalf@tilera.com>
>> Date: Thu, 17 Feb 2011 18:04:13 -0500
>>
>>> On 2/17/2011 5:53 PM, David Miller wrote:
>>>> From: Chris Metcalf <cmetcalf@tilera.com>
>>>> Date: Thu, 17 Feb 2011 17:49:46 -0500
>>>>
>>>>> The fix is to disable interrupts for the arch_read_lock family of methods.
>>>> How does that help handle the race when it happens between different
>>>> cpus, instead of between IRQ and non-IRQ context on the same CPU?
>>> There's no race in that case, since the lock code properly backs off and
>>> retries until the other cpu frees it.  The distinction here is that the
>>> non-IRQ context is "wedged" by the IRQ context.
>>>
>>>> Why don't you just use the generic spinlock based rwlock code on Tile,
>>>> since that is all that your atomic instructions can handle
>>>> sufficiently?
>>> The tile-specific code encodes reader/writer information in the same 32-bit
>>> word that the test-and-set instruction manipulates, so it's more efficient
>>> both in space and time.  This may not really matter for rwlocks, since no
>>> one cares much about them any more, but that was the motivation.
>> Ok, but IRQ disabling is going to be very expensive.
>
> The interrupt architecture on Tile allows a write to a special-purpose
> register to put you into a "critical section" where no interrupts or faults
> are delivered.  So we just need to bracket the read_lock operations with
> two SPR writes; each takes six machine cycles, so we're only adding 12
> cycles to the total cost of taking or releasing a read lock on an rwlock.
>
> --
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
>
>

I agree that just lock interrupt for read operations should be enough,
but read_unlock() is also the place we should lock interrupt, right?
If interrupt occurred when it hold lock-val after TNS deadlock still
can occur.

When will you release out that patch? Since time is tight, so maybe
I've to fix-up it myself. Though the problem is clearly now, I still
have two questions to confirm:

1. If we use SPR_INTERRUPT_CRITICAL_SECTION it will disable all the
interrupt which claimed 'CM', is that right? Should we have to same
its original value and restore it later?
There is some code in Linux:
int __tns_atomic_acquire(atomic_t *lock)
{
	int ret;
	u32 iterations = 0;

	BUG_ON(__insn_mfspr(SPR_INTERRUPT_CRITICAL_SECTION));
	__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1);

	while((ret = __insn_tns(&lock->counter)) == 1)
		delay_backoff(iterations++);
	return ret;
}
It just use BUG_ON to check SPR_INTERRUPT_CRITICAL_SECTION have to be
0, is that means that SPR is only used in special situations like
that?

2. Should we lock interrupt for the whole operation of
read_lock()/read_unlock(), or we should leave interrupt critical
section if it run into  __raw_read_lock_slow() and before have to
delay_backoff() some time, and re-enter interrupt critical section
again before TNS?




-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17 23:18                 ` Chris Metcalf
  2011-02-18  3:16                   ` Cypher Wu
@ 2011-02-18  3:19                   ` Cypher Wu
  2011-02-18  7:08                   ` Cypher Wu
  2 siblings, 0 replies; 21+ messages in thread
From: Cypher Wu @ 2011-02-18  3:19 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev

On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
> On 2/17/2011 6:11 PM, David Miller wrote:
>> From: Chris Metcalf <cmetcalf@tilera.com>
>> Date: Thu, 17 Feb 2011 18:04:13 -0500
>>
>>> On 2/17/2011 5:53 PM, David Miller wrote:
>>>> From: Chris Metcalf <cmetcalf@tilera.com>
>>>> Date: Thu, 17 Feb 2011 17:49:46 -0500
>>>>
>>>>> The fix is to disable interrupts for the arch_read_lock family of methods.
>>>> How does that help handle the race when it happens between different
>>>> cpus, instead of between IRQ and non-IRQ context on the same CPU?
>>> There's no race in that case, since the lock code properly backs off and
>>> retries until the other cpu frees it.  The distinction here is that the
>>> non-IRQ context is "wedged" by the IRQ context.
>>>
>>>> Why don't you just use the generic spinlock based rwlock code on Tile,
>>>> since that is all that your atomic instructions can handle
>>>> sufficiently?
>>> The tile-specific code encodes reader/writer information in the same 32-bit
>>> word that the test-and-set instruction manipulates, so it's more efficient
>>> both in space and time.  This may not really matter for rwlocks, since no
>>> one cares much about them any more, but that was the motivation.
>> Ok, but IRQ disabling is going to be very expensive.
>
> The interrupt architecture on Tile allows a write to a special-purpose
> register to put you into a "critical section" where no interrupts or faults
> are delivered.  So we just need to bracket the read_lock operations with
> two SPR writes; each takes six machine cycles, so we're only adding 12
> cycles to the total cost of taking or releasing a read lock on an rwlock.
>
> --
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
>
>


Bye the way, other RISC platforms, say ARM and MIPS, use store
conditional rather that TNS a temp value for lock-val, does Fx have
similar instructions?


-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-17 23:18                 ` Chris Metcalf
  2011-02-18  3:16                   ` Cypher Wu
  2011-02-18  3:19                   ` Cypher Wu
@ 2011-02-18  7:08                   ` Cypher Wu
  2011-02-18 21:51                     ` Chris Metcalf
  2011-03-01 18:30                     ` [PATCH] arch/tile: fix deadlock bugs in rwlock implementation Chris Metcalf
  2 siblings, 2 replies; 21+ messages in thread
From: Cypher Wu @ 2011-02-18  7:08 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev

On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
> On 2/17/2011 6:11 PM, David Miller wrote:
>> From: Chris Metcalf <cmetcalf@tilera.com>
>> Date: Thu, 17 Feb 2011 18:04:13 -0500
>>
>>> On 2/17/2011 5:53 PM, David Miller wrote:
>>>> From: Chris Metcalf <cmetcalf@tilera.com>
>>>> Date: Thu, 17 Feb 2011 17:49:46 -0500
>>>>
>>>>> The fix is to disable interrupts for the arch_read_lock family of methods.
>>>> How does that help handle the race when it happens between different
>>>> cpus, instead of between IRQ and non-IRQ context on the same CPU?
>>> There's no race in that case, since the lock code properly backs off and
>>> retries until the other cpu frees it.  The distinction here is that the
>>> non-IRQ context is "wedged" by the IRQ context.
>>>
>>>> Why don't you just use the generic spinlock based rwlock code on Tile,
>>>> since that is all that your atomic instructions can handle
>>>> sufficiently?
>>> The tile-specific code encodes reader/writer information in the same 32-bit
>>> word that the test-and-set instruction manipulates, so it's more efficient
>>> both in space and time.  This may not really matter for rwlocks, since no
>>> one cares much about them any more, but that was the motivation.
>> Ok, but IRQ disabling is going to be very expensive.
>
> The interrupt architecture on Tile allows a write to a special-purpose
> register to put you into a "critical section" where no interrupts or faults
> are delivered.  So we just need to bracket the read_lock operations with
> two SPR writes; each takes six machine cycles, so we're only adding 12
> cycles to the total cost of taking or releasing a read lock on an rwlock.
>
> --
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
>
>

Adding that to SPR writes should be fine, but it may cause interrupt
delay a little more that other platform's read_lock()?

Another question: What NMI in the former mail means?

Looking forward to your patch.

Regards

-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-18  7:08                   ` Cypher Wu
@ 2011-02-18 21:51                     ` Chris Metcalf
  2011-02-19  4:07                       ` Cypher Wu
  2011-03-01 18:30                     ` [PATCH] arch/tile: fix deadlock bugs in rwlock implementation Chris Metcalf
  1 sibling, 1 reply; 21+ messages in thread
From: Chris Metcalf @ 2011-02-18 21:51 UTC (permalink / raw)
  To: Cypher Wu
  Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev

On 2/17/2011 10:16 PM, Cypher Wu wrote:
> On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
>> The interrupt architecture on Tile allows a write to a special-purpose
>> register to put you into a "critical section" where no interrupts or faults
>> are delivered.  So we just need to bracket the read_lock operations with
>> two SPR writes; each takes six machine cycles, so we're only adding 12
>> cycles to the total cost of taking or releasing a read lock on an rwlock
> I agree that just lock interrupt for read operations should be enough,
> but read_unlock() is also the place we should lock interrupt, right?
> If interrupt occurred when it hold lock-val after TNS deadlock still
> can occur.

Correct; that's what I meant by "read_lock operations".  This include lock,
trylock, and unlock.

> When will you release out that patch? Since time is tight, so maybe
> I've to fix-up it myself.

I heard from one of our support folks that you were asking through that
channel, so I asked him to go ahead and give you the spinlock sources
directly.  I will be spending time next week syncing up our internal tree
with the public git repository so you'll see it on LKML at that time.

> 1. If we use SPR_INTERRUPT_CRITICAL_SECTION it will disable all the
> interrupt which claimed 'CM', is that right? Should we have to same
> its original value and restore it later?

We don't need to save and restore, since INTERRUPT_CRITICAL_SECTION is
almost always zero except in very specific situations.

> 2. Should we lock interrupt for the whole operation of
> read_lock()/read_unlock(), or we should leave interrupt critical
> section if it run into  __raw_read_lock_slow() and before have to
> delay_backoff() some time, and re-enter interrupt critical section
> again before TNS?

Correct, the fix only holds the critical section around the tns and the
write-back, not during the delay_backoff().

> Bye the way, other RISC platforms, say ARM and MIPS, use store
> conditional rather that TNS a temp value for lock-val, does Fx have
> similar instructions?

TILEPro does not have anything more than test-and-set; TILE-Gx (the 64-bit
processor) has a full array of atomic instructions.

> Adding that to SPR writes should be fine, but it may cause interrupt
> delay a little more that other platform's read_lock()?

A little, but I think it's in the noise relative to the basic cost of
read_lock in the absence of full-fledged atomic instructions.

> Another question: What NMI in the former mail means?

Non-maskable interrupt, such as performance counter interrupts.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-18 21:51                     ` Chris Metcalf
@ 2011-02-19  4:07                       ` Cypher Wu
  2011-02-20 13:33                         ` Chris Metcalf
  0 siblings, 1 reply; 21+ messages in thread
From: Cypher Wu @ 2011-02-19  4:07 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev

On Sat, Feb 19, 2011 at 5:51 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
> On 2/17/2011 10:16 PM, Cypher Wu wrote:
>> On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
>>> The interrupt architecture on Tile allows a write to a special-purpose
>>> register to put you into a "critical section" where no interrupts or faults
>>> are delivered.  So we just need to bracket the read_lock operations with
>>> two SPR writes; each takes six machine cycles, so we're only adding 12
>>> cycles to the total cost of taking or releasing a read lock on an rwlock
>> I agree that just lock interrupt for read operations should be enough,
>> but read_unlock() is also the place we should lock interrupt, right?
>> If interrupt occurred when it hold lock-val after TNS deadlock still
>> can occur.
>
> Correct; that's what I meant by "read_lock operations".  This include lock,
> trylock, and unlock.
>
>> When will you release out that patch? Since time is tight, so maybe
>> I've to fix-up it myself.
>
> I heard from one of our support folks that you were asking through that
> channel, so I asked him to go ahead and give you the spinlock sources
> directly.  I will be spending time next week syncing up our internal tree
> with the public git repository so you'll see it on LKML at that time.
>
>> 1. If we use SPR_INTERRUPT_CRITICAL_SECTION it will disable all the
>> interrupt which claimed 'CM', is that right? Should we have to same
>> its original value and restore it later?
>
> We don't need to save and restore, since INTERRUPT_CRITICAL_SECTION is
> almost always zero except in very specific situations.
>
>> 2. Should we lock interrupt for the whole operation of
>> read_lock()/read_unlock(), or we should leave interrupt critical
>> section if it run into  __raw_read_lock_slow() and before have to
>> delay_backoff() some time, and re-enter interrupt critical section
>> again before TNS?
>
> Correct, the fix only holds the critical section around the tns and the
> write-back, not during the delay_backoff().
>
>> Bye the way, other RISC platforms, say ARM and MIPS, use store
>> conditional rather that TNS a temp value for lock-val, does Fx have
>> similar instructions?
>
> TILEPro does not have anything more than test-and-set; TILE-Gx (the 64-bit
> processor) has a full array of atomic instructions.
>
>> Adding that to SPR writes should be fine, but it may cause interrupt
>> delay a little more that other platform's read_lock()?
>
> A little, but I think it's in the noise relative to the basic cost of
> read_lock in the absence of full-fledged atomic instructions.
>
>> Another question: What NMI in the former mail means?
>
> Non-maskable interrupt, such as performance counter interrupts.
>
> --
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
>
>

I've got your source code, thank you very much.

There is still two more question:
1. Why we merge the inlined code and the *_slow into none inlined functions?
2. I've seen the use of 'mb()' in unlock operation, but we don't use
that in the lock operation.

I've released a temporary version with that modification under our
customer' demand, since they want to do a long time test though this
weekend. I'll appreciate that if you gave some comment on my
modifications:

*** /opt/TileraMDE-2.1.0.98943/tilepro/src/sys/linux/include/asm-tile/spinlock_32.h
    2010-04-02 11:07:47.000000000 +0800
--- include/asm-tile/spinlock_32.h      2011-02-18 17:09:40.000000000 +0800
***************
*** 12,17 ****
--- 12,27 ----
   *   more details.
   *
   * 32-bit SMP spinlocks.
+  *
+  *
+  * The use of TNS instruction cause race condition for system call and
+  * interrupt, so we have to lock interrupt when we trying lock-value.
+  * However, since write_lock() is exclusive so if we really need to
+  * operate it in interrupt then system call have to use write_lock_irqsave(),
+  * So it don't need to lock interrupt here.
+  * Spinlock is also exclusive so we don't take care about it.
+  *
+  * Modified by Cyberman Wu on Feb 18th, 2011.
   */

  #ifndef _ASM_TILE_SPINLOCK_32_H
*************** void __raw_read_unlock_slow(raw_rwlock_t
*** 86,91 ****
--- 96,114 ----
  void __raw_write_lock_slow(raw_rwlock_t *, u32);
  void __raw_write_unlock_slow(raw_rwlock_t *, u32);

+
+ static inline void __raw_read_lock_enter_critical(void)
+ {
+       BUG_ON(__insn_mfspr(SPR_INTERRUPT_CRITICAL_SECTION));
+       __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1);
+ }
+
+ static inline void __raw_read_lock_leave_critical(void)
+ {
+       __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0);
+ }
+
+
  /**
   * __raw_read_can_lock() - would read_trylock() succeed?
   */
*************** static inline int __raw_write_can_lock(r
*** 107,121 ****
   */
  static inline void __raw_read_lock(raw_rwlock_t *rwlock)
  {
!       u32 val = __insn_tns((int *)&rwlock->lock);
        if (unlikely(val << _RD_COUNT_WIDTH)) {
  #ifdef __TILECC__
  #pragma frequency_hint NEVER
  #endif
                __raw_read_lock_slow(rwlock, val);
                return;
        }
        rwlock->lock = val + (1 << _RD_COUNT_SHIFT);
  }

  /**
--- 130,148 ----
   */
  static inline void __raw_read_lock(raw_rwlock_t *rwlock)
  {
!     u32 val;
!       __raw_read_lock_enter_critical();
!       /*u32 */val = __insn_tns((int *)&rwlock->lock);
        if (unlikely(val << _RD_COUNT_WIDTH)) {
  #ifdef __TILECC__
  #pragma frequency_hint NEVER
  #endif
                __raw_read_lock_slow(rwlock, val);
+               __raw_read_lock_leave_critical();
                return;
        }
        rwlock->lock = val + (1 << _RD_COUNT_SHIFT);
+       __raw_read_lock_leave_critical();
  }

  /**
*************** static inline void __raw_write_lock(raw_
*** 140,154 ****
  static inline int __raw_read_trylock(raw_rwlock_t *rwlock)
  {
        int locked;
!       u32 val = __insn_tns((int *)&rwlock->lock);
        if (unlikely(val & 1)) {
  #ifdef __TILECC__
  #pragma frequency_hint NEVER
  #endif
!               return __raw_read_trylock_slow(rwlock);
        }
        locked = (val << _RD_COUNT_WIDTH) == 0;
        rwlock->lock = val + (locked << _RD_COUNT_SHIFT);
        return locked;
  }

--- 167,187 ----
  static inline int __raw_read_trylock(raw_rwlock_t *rwlock)
  {
        int locked;
!     u32 val;
!       __raw_read_lock_enter_critical();
!       /*u32 */val = __insn_tns((int *)&rwlock->lock);
        if (unlikely(val & 1)) {
  #ifdef __TILECC__
  #pragma frequency_hint NEVER
  #endif
!               // return __raw_read_trylock_slow(rwlock);
!               locked =__raw_read_trylock_slow(rwlock);
!               __raw_read_lock_leave_critical();
!               return locked;
        }
        locked = (val << _RD_COUNT_WIDTH) == 0;
        rwlock->lock = val + (locked << _RD_COUNT_SHIFT);
+       __raw_read_lock_leave_critical();
        return locked;
  }

*************** static inline void __raw_read_unlock(raw
*** 184,198 ****
--- 217,234 ----
  {
        u32 val;
        mb();
+       __raw_read_lock_enter_critical();
        val = __insn_tns((int *)&rwlock->lock);
        if (unlikely(val & 1)) {
  #ifdef __TILECC__
  #pragma frequency_hint NEVER
  #endif
                __raw_read_unlock_slow(rwlock);
+               __raw_read_lock_leave_critical();
                return;
        }
        rwlock->lock = val - (1 << _RD_COUNT_SHIFT);
+       __raw_read_lock_leave_critical();
  }



--- /opt/TileraMDE-2.1.0.98943/tilepro/src/sys/linux/arch/tile/lib/spinlock_32.c
       2010-04-02 11:08:02.000000000 +0800
+++ arch/tile/lib/spinlock_32.c 2011-02-18 16:05:31.000000000 +0800
@@ -98,7 +98,18 @@ static inline u32 get_rwlock(raw_rwlock_
 #ifdef __TILECC__
 #pragma frequency_hint NEVER
 #endif
+            /*
+             * get_rwlock() now have to be called in Interrupt
+             * Critical Section, so it can't be called in the
+             * these __raw_write_xxx() anymore!!!!!
+             *
+             * We leave Interrupt Critical Section for making
+             * interrupt delay minimal.
+             * Is that really needed???
+             */
+            __raw_read_lock_leave_critical();
                        delay_backoff(iterations++);
+            __raw_read_lock_enter_critical();
                        continue;
                }
                return val;
@@ -152,7 +163,14 @@ void __raw_read_lock_slow(raw_rwlock_t *
        do {
                if (!(val & 1))
                        rwlock->lock = val;
+        /*
+         * We leave Interrupt Critical Section for making
+         * interrupt delay minimal.
+         * Is that really needed???
+         */
+        __raw_read_lock_leave_critical();
                delay_backoff(iterations++);
+        __raw_read_lock_enter_critical();
                val = __insn_tns((int *)&rwlock->lock);
        } while ((val << RD_COUNT_WIDTH) != 0);
        rwlock->lock = val + (1 << RD_COUNT_SHIFT);
@@ -166,23 +184,30 @@ void __raw_write_lock_slow(raw_rwlock_t
         * when we compare them.
         */
        u32 my_ticket_;
+       u32 iterations = 0;

-       /* Take out the next ticket; this will also stop would-be readers. */
-       if (val & 1)
-               val = get_rwlock(rwlock);
-       rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
+       /*
+        * Wait until there are no readers, then bump up the next
+        * field and capture the ticket value.
+        */
+       for (;;) {
+               if (!(val & 1)) {
+                       if ((val >> RD_COUNT_SHIFT) == 0)
+                               break;
+                       rwlock->lock = val;
+               }
+               delay_backoff(iterations++);
+               val = __insn_tns((int *)&rwlock->lock);
+       }

-       /* Extract my ticket value from the original word. */
+       /* Take out the next ticket and extract my ticket value. */
+       rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
        my_ticket_ = val >> WR_NEXT_SHIFT;

-       /*
-        * Wait until the "current" field matches our ticket, and
-        * there are no remaining readers.
-        */
+       /* Wait until the "current" field matches our ticket. */
        for (;;) {
                u32 curr_ = val >> WR_CURR_SHIFT;
-               u32 readers = val >> RD_COUNT_SHIFT;
-               u32 delta = ((my_ticket_ - curr_) & WR_MASK) + !!readers;
+               u32 delta = ((my_ticket_ - curr_) & WR_MASK);
                if (likely(delta == 0))
                        break;


-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: IGMP and rwlock: Dead ocurred again on TILEPro
  2011-02-19  4:07                       ` Cypher Wu
@ 2011-02-20 13:33                         ` Chris Metcalf
  0 siblings, 0 replies; 21+ messages in thread
From: Chris Metcalf @ 2011-02-20 13:33 UTC (permalink / raw)
  To: Cypher Wu
  Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev

On 2/18/2011 11:07 PM, Cypher Wu wrote:
> On Sat, Feb 19, 2011 at 5:51 AM, Chris Metcalf <cmetcalf@tilera.com> wrote:
>> I heard from one of our support folks that you were asking through that
>> channel, so I asked him to go ahead and give you the spinlock sources
>> directly.  I will be spending time next week syncing up our internal tree
>> with the public git repository so you'll see it on LKML at that time.
> I've got your source code, thank you very much.
>
> There is still two more question:
> 1. Why we merge the inlined code and the *_slow into none inlined functions?

Those functions were always borderline in terms of being sensible inlined
functions.  In my opinion, adding the SPR writes as well pushed them over
the edge, so I made them just straight function calls instead, for code
density reasons.  It also makes the code simpler, which is a plus.  And
since I was changing the read_lock versions I changed the write_lock
versions as well for consistency.

> 2. I've seen the use of 'mb()' in unlock operation, but we don't use
> that in the lock operation.

You don't need a memory barrier when acquiring a lock.  (Well, some
architectures require a read barrier, but Tile doesn't speculate loads past
control dependencies at the moment.)

> I've released a temporary version with that modification under our
> customer' demand, since they want to do a long time test though this
> weekend. I'll appreciate that if you gave some comment on my
> modifications:

It seems OK functionally, and has the advantage of addressing the deadlock
without changing the module API, so it's appropriate if you're trying to
maintain binary compatibility.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] arch/tile: fix deadlock bugs in rwlock implementation
  2011-02-18  7:08                   ` Cypher Wu
  2011-02-18 21:51                     ` Chris Metcalf
@ 2011-03-01 18:30                     ` Chris Metcalf
  1 sibling, 0 replies; 21+ messages in thread
From: Chris Metcalf @ 2011-03-01 18:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: David Miller, xiyou.wangcong, eric.dumazet, netdev

The first issue fixed in this patch is that pending rwlock write locks
could lock out new readers; this could cause a deadlock if a read lock was
held on cpu 1, a write lock was then attempted on cpu 2 and was pending,
and cpu 1 was interrupted and attempted to re-acquire a read lock.
The write lock code was modified to not lock out new readers.

The second issue fixed is that there was a narrow race window where a tns
instruction had been issued (setting the lock value to "1") and the store
instruction to reset the lock value correctly had not yet been issued.
In this case, if an interrupt occurred and the same cpu then tried to
manipulate the lock, it would find the lock value set to "1" and spin
forever, assuming some other cpu was partway through updating it.  The fix
is to enforce an interrupt critical section around the tns/store pair.

Since these changes make the rwlock "fast path" code heavier weight,
I decided to move all the rwlock code all out of line, leaving only the
conventional spinlock code with fastpath inlines.

As part of this change I also eliminate support for the now-obsolete
tns_atomic mode.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
---
 arch/tile/include/asm/spinlock_32.h |   84 ++----------------
 arch/tile/lib/spinlock_32.c         |  163 +++++++++++++++++++++--------------
 2 files changed, 108 insertions(+), 139 deletions(-)

diff --git a/arch/tile/include/asm/spinlock_32.h b/arch/tile/include/asm/spinlock_32.h
index 88efdde..32ba1fe 100644
--- a/arch/tile/include/asm/spinlock_32.h
+++ b/arch/tile/include/asm/spinlock_32.h
@@ -21,6 +21,7 @@
 #include <asm/page.h>
 #include <asm/system.h>
 #include <linux/compiler.h>
+#include <arch/spr_def.h>
 
 /*
  * We only use even ticket numbers so the '1' inserted by a tns is
@@ -78,13 +79,6 @@ void arch_spin_unlock_wait(arch_spinlock_t *lock);
 #define _RD_COUNT_SHIFT 24
 #define _RD_COUNT_WIDTH 8
 
-/* Internal functions; do not use. */
-void arch_read_lock_slow(arch_rwlock_t *, u32);
-int arch_read_trylock_slow(arch_rwlock_t *);
-void arch_read_unlock_slow(arch_rwlock_t *);
-void arch_write_lock_slow(arch_rwlock_t *, u32);
-void arch_write_unlock_slow(arch_rwlock_t *, u32);
-
 /**
  * arch_read_can_lock() - would read_trylock() succeed?
  */
@@ -104,94 +98,32 @@ static inline int arch_write_can_lock(arch_rwlock_t *rwlock)
 /**
  * arch_read_lock() - acquire a read lock.
  */
-static inline void arch_read_lock(arch_rwlock_t *rwlock)
-{
-	u32 val = __insn_tns((int *)&rwlock->lock);
-	if (unlikely(val << _RD_COUNT_WIDTH)) {
-		arch_read_lock_slow(rwlock, val);
-		return;
-	}
-	rwlock->lock = val + (1 << _RD_COUNT_SHIFT);
-}
+void arch_read_lock(arch_rwlock_t *rwlock);
 
 /**
- * arch_read_lock() - acquire a write lock.
+ * arch_write_lock() - acquire a write lock.
  */
-static inline void arch_write_lock(arch_rwlock_t *rwlock)
-{
-	u32 val = __insn_tns((int *)&rwlock->lock);
-	if (unlikely(val != 0)) {
-		arch_write_lock_slow(rwlock, val);
-		return;
-	}
-	rwlock->lock = 1 << _WR_NEXT_SHIFT;
-}
+void arch_write_lock(arch_rwlock_t *rwlock);
 
 /**
  * arch_read_trylock() - try to acquire a read lock.
  */
-static inline int arch_read_trylock(arch_rwlock_t *rwlock)
-{
-	int locked;
-	u32 val = __insn_tns((int *)&rwlock->lock);
-	if (unlikely(val & 1))
-		return arch_read_trylock_slow(rwlock);
-	locked = (val << _RD_COUNT_WIDTH) == 0;
-	rwlock->lock = val + (locked << _RD_COUNT_SHIFT);
-	return locked;
-}
+int arch_read_trylock(arch_rwlock_t *rwlock);
 
 /**
  * arch_write_trylock() - try to acquire a write lock.
  */
-static inline int arch_write_trylock(arch_rwlock_t *rwlock)
-{
-	u32 val = __insn_tns((int *)&rwlock->lock);
-
-	/*
-	 * If a tns is in progress, or there's a waiting or active locker,
-	 * or active readers, we can't take the lock, so give up.
-	 */
-	if (unlikely(val != 0)) {
-		if (!(val & 1))
-			rwlock->lock = val;
-		return 0;
-	}
-
-	/* Set the "next" field to mark it locked. */
-	rwlock->lock = 1 << _WR_NEXT_SHIFT;
-	return 1;
-}
+int arch_write_trylock(arch_rwlock_t *rwlock);
 
 /**
  * arch_read_unlock() - release a read lock.
  */
-static inline void arch_read_unlock(arch_rwlock_t *rwlock)
-{
-	u32 val;
-	mb();  /* guarantee anything modified under the lock is visible */
-	val = __insn_tns((int *)&rwlock->lock);
-	if (unlikely(val & 1)) {
-		arch_read_unlock_slow(rwlock);
-		return;
-	}
-	rwlock->lock = val - (1 << _RD_COUNT_SHIFT);
-}
+void arch_read_unlock(arch_rwlock_t *rwlock);
 
 /**
  * arch_write_unlock() - release a write lock.
  */
-static inline void arch_write_unlock(arch_rwlock_t *rwlock)
-{
-	u32 val;
-	mb();  /* guarantee anything modified under the lock is visible */
-	val = __insn_tns((int *)&rwlock->lock);
-	if (unlikely(val != (1 << _WR_NEXT_SHIFT))) {
-		arch_write_unlock_slow(rwlock, val);
-		return;
-	}
-	rwlock->lock = 0;
-}
+void arch_write_unlock(arch_rwlock_t *rwlock);
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
 #define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
index 5cd1c40..723e789 100644
--- a/arch/tile/lib/spinlock_32.c
+++ b/arch/tile/lib/spinlock_32.c
@@ -91,75 +91,82 @@ EXPORT_SYMBOL(arch_spin_unlock_wait);
 #define RD_COUNT_MASK   ((1 << RD_COUNT_WIDTH) - 1)
 
 
-/* Lock the word, spinning until there are no tns-ers. */
-static inline u32 get_rwlock(arch_rwlock_t *rwlock)
+/*
+ * We spin until everything but the reader bits (which are in the high
+ * part of the word) are zero, i.e. no active or waiting writers, no tns.
+ *
+ * We guard the tns/store-back with an interrupt critical section to
+ * preserve the semantic that the same read lock can be acquired in an
+ * interrupt context.
+ *
+ * ISSUE: This approach can permanently starve readers.  A reader who sees
+ * a writer could instead take a ticket lock (just like a writer would),
+ * and atomically enter read mode (with 1 reader) when it gets the ticket.
+ * This way both readers and writers will always make forward progress
+ * in a finite time.
+ */
+void arch_read_lock(arch_rwlock_t *rwlock)
 {
-	u32 iterations = 0;
+	u32 val, iterations = 0;
+
 	for (;;) {
-		u32 val = __insn_tns((int *)&rwlock->lock);
-		if (unlikely(val & 1)) {
-			delay_backoff(iterations++);
-			continue;
+		__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1);
+		val = __insn_tns((int *)&rwlock->lock);
+		if (likely((val << _RD_COUNT_WIDTH) == 0)) {
+			rwlock->lock = val + (1 << RD_COUNT_SHIFT);
+			__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0);
+			break;
 		}
-		return val;
+		if ((val & 1) == 0)
+			rwlock->lock = val;
+		__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0);
+		delay_backoff(iterations++);
 	}
 }
+EXPORT_SYMBOL(arch_read_lock);
 
-int arch_read_trylock_slow(arch_rwlock_t *rwlock)
+int arch_read_trylock(arch_rwlock_t *rwlock)
 {
-	u32 val = get_rwlock(rwlock);
-	int locked = (val << RD_COUNT_WIDTH) == 0;
-	rwlock->lock = val + (locked << RD_COUNT_SHIFT);
+	u32 val;
+	int locked;
+
+	__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1);
+	val = __insn_tns((int *)&rwlock->lock);
+	locked = ((val << _RD_COUNT_WIDTH) == 0);
+	if (likely(locked))
+		rwlock->lock = val + (1 << RD_COUNT_SHIFT);
+	else if ((val & 1) == 0)
+		rwlock->lock = val;
+	__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0);
 	return locked;
 }
-EXPORT_SYMBOL(arch_read_trylock_slow);
+EXPORT_SYMBOL(arch_read_trylock);
 
-void arch_read_unlock_slow(arch_rwlock_t *rwlock)
+void arch_read_unlock(arch_rwlock_t *rwlock)
 {
-	u32 val = get_rwlock(rwlock);
-	rwlock->lock = val - (1 << RD_COUNT_SHIFT);
-}
-EXPORT_SYMBOL(arch_read_unlock_slow);
+	u32 val, iterations = 0;
 
-void arch_write_unlock_slow(arch_rwlock_t *rwlock, u32 val)
-{
-	u32 eq, mask = 1 << WR_CURR_SHIFT;
-	while (unlikely(val & 1)) {
-		/* Limited backoff since we are the highest-priority task. */
-		relax(4);
+	mb();  /* guarantee anything modified under the lock is visible */
+	for (;;) {
+		__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1);
 		val = __insn_tns((int *)&rwlock->lock);
+		if (likely(val & 1) == 0) {
+			rwlock->lock = val - (1 << _RD_COUNT_SHIFT);
+			__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0);
+			break;
+		}
+		__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0);
+		delay_backoff(iterations++);
 	}
-	val = __insn_addb(val, mask);
-	eq = __insn_seqb(val, val << (WR_CURR_SHIFT - WR_NEXT_SHIFT));
-	val = __insn_mz(eq & mask, val);
-	rwlock->lock = val;
 }
-EXPORT_SYMBOL(arch_write_unlock_slow);
+EXPORT_SYMBOL(arch_read_unlock);
 
 /*
- * We spin until everything but the reader bits (which are in the high
- * part of the word) are zero, i.e. no active or waiting writers, no tns.
- *
- * ISSUE: This approach can permanently starve readers.  A reader who sees
- * a writer could instead take a ticket lock (just like a writer would),
- * and atomically enter read mode (with 1 reader) when it gets the ticket.
- * This way both readers and writers will always make forward progress
- * in a finite time.
+ * We don't need an interrupt critical section here (unlike for
+ * arch_read_lock) since we should never use a bare write lock where
+ * it could be interrupted by code that could try to re-acquire it.
  */
-void arch_read_lock_slow(arch_rwlock_t *rwlock, u32 val)
-{
-	u32 iterations = 0;
-	do {
-		if (!(val & 1))
-			rwlock->lock = val;
-		delay_backoff(iterations++);
-		val = __insn_tns((int *)&rwlock->lock);
-	} while ((val << RD_COUNT_WIDTH) != 0);
-	rwlock->lock = val + (1 << RD_COUNT_SHIFT);
-}
-EXPORT_SYMBOL(arch_read_lock_slow);
-
-void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val)
+void arch_write_lock(arch_rwlock_t *rwlock)
 {
 	/*
 	 * The trailing underscore on this variable (and curr_ below)
@@ -168,6 +175,12 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val)
 	 */
 	u32 my_ticket_;
 	u32 iterations = 0;
+	u32 val = __insn_tns((int *)&rwlock->lock);
+
+	if (likely(val == 0)) {
+		rwlock->lock = 1 << _WR_NEXT_SHIFT;
+		return;
+	}
 
 	/*
 	 * Wait until there are no readers, then bump up the next
@@ -206,23 +219,47 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val)
 			relax(4);
 	}
 }
-EXPORT_SYMBOL(arch_write_lock_slow);
+EXPORT_SYMBOL(arch_write_lock);
 
-int __tns_atomic_acquire(atomic_t *lock)
+int arch_write_trylock(arch_rwlock_t *rwlock)
 {
-	int ret;
-	u32 iterations = 0;
+	u32 val = __insn_tns((int *)&rwlock->lock);
 
-	BUG_ON(__insn_mfspr(SPR_INTERRUPT_CRITICAL_SECTION));
-	__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1);
+	/*
+	 * If a tns is in progress, or there's a waiting or active locker,
+	 * or active readers, we can't take the lock, so give up.
+	 */
+	if (unlikely(val != 0)) {
+		if (!(val & 1))
+			rwlock->lock = val;
+		return 0;
+	}
 
-	while ((ret = __insn_tns((void *)&lock->counter)) == 1)
-		delay_backoff(iterations++);
-	return ret;
+	/* Set the "next" field to mark it locked. */
+	rwlock->lock = 1 << _WR_NEXT_SHIFT;
+	return 1;
 }
+EXPORT_SYMBOL(arch_write_trylock);
 
-void __tns_atomic_release(atomic_t *p, int v)
+void arch_write_unlock(arch_rwlock_t *rwlock)
 {
-	p->counter = v;
-	__insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0);
+	u32 val, eq, mask;
+
+	mb();  /* guarantee anything modified under the lock is visible */
+	val = __insn_tns((int *)&rwlock->lock);
+	if (likely(val == (1 << _WR_NEXT_SHIFT))) {
+		rwlock->lock = 0;
+		return;
+	}
+	while (unlikely(val & 1)) {
+		/* Limited backoff since we are the highest-priority task. */
+		relax(4);
+		val = __insn_tns((int *)&rwlock->lock);
+	}
+	mask = 1 << WR_CURR_SHIFT;
+	val = __insn_addb(val, mask);
+	eq = __insn_seqb(val, val << (WR_CURR_SHIFT - WR_NEXT_SHIFT));
+	val = __insn_mz(eq & mask, val);
+	rwlock->lock = val;
 }
+EXPORT_SYMBOL(arch_write_unlock);
-- 
1.6.5.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
@ 2011-02-17  2:17 Cypher Wu
  0 siblings, 0 replies; 21+ messages in thread
From: Cypher Wu @ 2011-02-17  2:17 UTC (permalink / raw)
  To: linux-kernel, Chris Metcalf, Américo Wang, Eric Dumazet, netdev

---------- Forwarded message ----------
From: Cypher Wu <cypher.w@gmail.com>
Date: Wed, Feb 16, 2011 at 5:58 PM
Subject: GMP and rwlock: Dead ocurred again on TILEPro
To: linux-kernel@vger.kernel.org


The rwlock and spinlock of TILEPro platform use TNS instruction to
test the value of lock, but if interrupt is not masked, read_lock()
have another chance to deadlock while read_lock() called in bh of
interrupt.

The original code:
void __raw_read_lock_slow(raw_
rwlock_t *rwlock, u32 val)
{
   u32 iterations = 0;
   do {
       if (!(val & 1))
           rwlock->lock = val;
       delay_backoff(iterations++);
       val = __insn_tns((int *)&rwlock->lock);
   } while ((val << RD_COUNT_WIDTH) != 0);
   rwlock->lock = val + (1 << RD_COUNT_SHIFT);
}

I've modified it to get some information:
void __raw_read_lock_slow(raw_rwlock_t *rwlock, u32 val)
{
   u32 iterations = 0;
   do {
       if (!(val & 1))
       {
           rwlock->lock = val;
           iterations = 0;
       }
       delay_backoff(iterations++);
       if (iterations > 0x1000000)
       {
           dump_stack();
           iterations = 0;
       }

       val = __insn_tns((int *)&rwlock->lock);
   } while ((val << RD_COUNT_WIDTH) != 0);
   rwlock->lock = val + (1 << RD_COUNT_SHIFT);
}

And this is the stack info:

Starting stack dump of tid 837, pid 837 (ff0) on cpu 55 at cycle 10180633928773
 frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8)
 frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8)
 frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0)
 frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20)
 frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30)
 frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70)
 frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98)
 frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8)
 frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0)
 frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0)
 <interrupt 25 while in kernel mode>
 frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38)
 frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40)
 frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70)
 frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90)
 frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98)
 frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0)
 <syscall while in user mode>
 frame 16: 0x3342a0 (sp 0xbfddfb00)
 frame 17: 0x16130 (sp 0xbfddfb08)
 frame 18: 0x16640 (sp 0xbfddfb38)
 frame 19: 0x16ee8 (sp 0xbfddfc58)
 frame 20: 0x345a08 (sp 0xbfddfc90)
 frame 21: 0x10218 (sp 0xbfddfe48)
Stack dump complete

I don't know the clear definition of rwlock & spinlock in Linux, but
the implementation of other platforms
like x86, PowerPC, ARM don't have that issue. The use of TNS cause a
race condition between system
call and interrupt.

Through the call tree of packet sending, there are also some other
rwlock will be tried, say
read_lock(&fib_hash_lock) in fn_hash_lookup() which is called in
ip_route_output_slow(). I've seen deadlock
on fib_hash_lock, but haven't reproduced with that debug information yet.

Maybe IGMP is not the only one, TCP timer will retransmit data and
will also call read_lock(&fib_hash_lock).

--
Cyberman Wu



-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Fwd: IGMP and rwlock: Dead ocurred again on TILEPro
@ 2011-02-17  2:17 Cypher Wu
  0 siblings, 0 replies; 21+ messages in thread
From: Cypher Wu @ 2011-02-17  2:17 UTC (permalink / raw)
  To: linux-kernel, Chris Metcalf, Américo Wang, Eric Dumazet, netdev

---------- Forwarded message ----------
From: Cypher Wu <cypher.w@gmail.com>
Date: Wed, Feb 16, 2011 at 5:58 PM
Subject: GMP and rwlock: Dead ocurred again on TILEPro
To: linux-kernel@vger.kernel.org


The rwlock and spinlock of TILEPro platform use TNS instruction to
test the value of lock, but if interrupt is not masked, read_lock()
have another chance to deadlock while read_lock() called in bh of
interrupt.

The original code:
void __raw_read_lock_slow(raw_
rwlock_t *rwlock, u32 val)
{
   u32 iterations = 0;
   do {
       if (!(val & 1))
           rwlock->lock = val;
       delay_backoff(iterations++);
       val = __insn_tns((int *)&rwlock->lock);
   } while ((val << RD_COUNT_WIDTH) != 0);
   rwlock->lock = val + (1 << RD_COUNT_SHIFT);
}

I've modified it to get some information:
void __raw_read_lock_slow(raw_rwlock_t *rwlock, u32 val)
{
   u32 iterations = 0;
   do {
       if (!(val & 1))
       {
           rwlock->lock = val;
           iterations = 0;
       }
       delay_backoff(iterations++);
       if (iterations > 0x1000000)
       {
           dump_stack();
           iterations = 0;
       }

       val = __insn_tns((int *)&rwlock->lock);
   } while ((val << RD_COUNT_WIDTH) != 0);
   rwlock->lock = val + (1 << RD_COUNT_SHIFT);
}

And this is the stack info:

Starting stack dump of tid 837, pid 837 (ff0) on cpu 55 at cycle 10180633928773
 frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8)
 frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8)
 frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0)
 frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20)
 frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30)
 frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70)
 frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98)
 frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8)
 frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0)
 frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0)
 <interrupt 25 while in kernel mode>
 frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38)
 frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40)
 frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70)
 frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90)
 frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98)
 frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0)
 <syscall while in user mode>
 frame 16: 0x3342a0 (sp 0xbfddfb00)
 frame 17: 0x16130 (sp 0xbfddfb08)
 frame 18: 0x16640 (sp 0xbfddfb38)
 frame 19: 0x16ee8 (sp 0xbfddfc58)
 frame 20: 0x345a08 (sp 0xbfddfc90)
 frame 21: 0x10218 (sp 0xbfddfe48)
Stack dump complete

I don't know the clear definition of rwlock & spinlock in Linux, but
the implementation of other platforms
like x86, PowerPC, ARM don't have that issue. The use of TNS cause a
race condition between system
call and interrupt.

Through the call tree of packet sending, there are also some other
rwlock will be tried, say
read_lock(&fib_hash_lock) in fn_hash_lookup() which is called in
ip_route_output_slow(). I've seen deadlock
on fib_hash_lock, but haven't reproduced with that debug information yet.

Maybe IGMP is not the only one, TCP timer will retransmit data and
will also call read_lock(&fib_hash_lock).

--
Cyberman Wu



-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-03-01 19:37 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-17  3:39 Fwd: IGMP and rwlock: Dead ocurred again on TILEPro Cypher Wu
2011-02-17  4:49 ` Américo Wang
2011-02-17  5:04   ` Cypher Wu
2011-02-17  5:42     ` Américo Wang
2011-02-17  5:46       ` David Miller
2011-02-17  6:39         ` Eric Dumazet
2011-02-17 22:49         ` Chris Metcalf
2011-02-17 22:53           ` David Miller
2011-02-17 23:04             ` Chris Metcalf
2011-02-17 23:11               ` David Miller
2011-02-17 23:18                 ` Chris Metcalf
2011-02-18  3:16                   ` Cypher Wu
2011-02-18  3:19                   ` Cypher Wu
2011-02-18  7:08                   ` Cypher Wu
2011-02-18 21:51                     ` Chris Metcalf
2011-02-19  4:07                       ` Cypher Wu
2011-02-20 13:33                         ` Chris Metcalf
2011-03-01 18:30                     ` [PATCH] arch/tile: fix deadlock bugs in rwlock implementation Chris Metcalf
2011-02-17  6:42       ` Fwd: IGMP and rwlock: Dead ocurred again on TILEPro Cypher Wu
  -- strict thread matches above, loose matches on Subject: below --
2011-02-17  2:17 Cypher Wu
2011-02-17  2:17 Cypher Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.