All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible process deadlock in RMPP flow
@ 2009-09-23 15:04 Eli Cohen
  2009-09-23 16:08 ` Sean Hefty
  0 siblings, 1 reply; 12+ messages in thread
From: Eli Cohen @ 2009-09-23 15:04 UTC (permalink / raw)
  To: Sean Hefty; +Cc: Linux RDMA list, ewg, general-list

Hi Sean,
one of our customers experiences problems when running ibnetdiscover.
The problem happens from time to time.
Here is the call stack the he gets:

ibnetdiscover D ffffffff80149b8d     0 26968  26544
(L-TLB)
 ffff8102c900bd88 0000000000000046 ffff81037e8e0000 ffff81037e8e02e8
 ffff8102c900bd78 000000000000000a ffff8102c5b50820 ffff81038a929820
 0000011837bf6105 0000000000000ede ffff8102c5b50a08 0000000100000000
Call Trace:
 [<ffffffff80064207>] wait_for_completion+0x79/0xa2
 [<ffffffff8008b4cc>] default_wake_function+0x0/0xe
 [<ffffffff882271d9>] :ib_mad:ib_cancel_rmpp_recvs+0x87/0xde
 [<ffffffff88224485>] :ib_mad:ib_unregister_mad_agent+0x30d/0x424
 [<ffffffff883983e9>] :ib_umad:ib_umad_close+0x9d/0xd6
 [<ffffffff80012e22>] __fput+0xae/0x198
 [<ffffffff80023de6>] filp_close+0x5c/0x64
 [<ffffffff800393df>] put_files_struct+0x63/0xae
 [<ffffffff80015b26>] do_exit+0x31c/0x911
 [<ffffffff8004971a>] cpuset_exit+0x0/0x6c
 [<ffffffff8005e116>] system_call+0x7e/0x83

>From the dump it seems that the process is waits on the call to
flush_workqueue() in ib_cancel_rmpp_recvs(). The package they use is
OFED 1.4.2.

Do you have any idea or suggestions how to sort this out?

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Possible process deadlock in RMPP flow
  2009-09-23 15:04 Possible process deadlock in RMPP flow Eli Cohen
@ 2009-09-23 16:08 ` Sean Hefty
       [not found]   ` <7A32EEE20DF5432CADB60B8F8B1E0093-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Hefty @ 2009-09-23 16:08 UTC (permalink / raw)
  To: 'Eli Cohen'; +Cc: Linux RDMA list, general-list, ewg

>ibnetdiscover D ffffffff80149b8d     0 26968  26544
>(L-TLB)
> ffff8102c900bd88 0000000000000046 ffff81037e8e0000 ffff81037e8e02e8
> ffff8102c900bd78 000000000000000a ffff8102c5b50820 ffff81038a929820
> 0000011837bf6105 0000000000000ede ffff8102c5b50a08 0000000100000000
>Call Trace:
> [<ffffffff80064207>] wait_for_completion+0x79/0xa2
> [<ffffffff8008b4cc>] default_wake_function+0x0/0xe
> [<ffffffff882271d9>] :ib_mad:ib_cancel_rmpp_recvs+0x87/0xde
> [<ffffffff88224485>] :ib_mad:ib_unregister_mad_agent+0x30d/0x424
> [<ffffffff883983e9>] :ib_umad:ib_umad_close+0x9d/0xd6
> [<ffffffff80012e22>] __fput+0xae/0x198
> [<ffffffff80023de6>] filp_close+0x5c/0x64
> [<ffffffff800393df>] put_files_struct+0x63/0xae
> [<ffffffff80015b26>] do_exit+0x31c/0x911
> [<ffffffff8004971a>] cpuset_exit+0x0/0x6c
> [<ffffffff8005e116>] system_call+0x7e/0x83
>
>From the dump it seems that the process is waits on the call to
>flush_workqueue() in ib_cancel_rmpp_recvs(). The package they use is
>OFED 1.4.2.

Roland just submitted a patch in this area yesterday.  I don't know if the patch
would fix their issue, but it may be worth trying.  What kernel does 1.4.2 map
to?

What RMPP messages does ibnetdiscover use?  If the program is completing
successfully, there may be a different race with the rmpp cleanup.  I'll see if
anything else stands out in that area.

- Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Possible process deadlock in RMPP flow
       [not found]   ` <7A32EEE20DF5432CADB60B8F8B1E0093-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
@ 2009-09-23 16:20     ` Hal Rosenstock
  2009-09-23 17:25     ` Eli Cohen
  1 sibling, 0 replies; 12+ messages in thread
From: Hal Rosenstock @ 2009-09-23 16:20 UTC (permalink / raw)
  To: Sean Hefty; +Cc: Linux RDMA list, ewg, general-list


[-- Attachment #1.1: Type: text/plain, Size: 1787 bytes --]

On Wed, Sep 23, 2009 at 12:08 PM, Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

> >ibnetdiscover D ffffffff80149b8d     0 26968  26544
> >(L-TLB)
> > ffff8102c900bd88 0000000000000046 ffff81037e8e0000 ffff81037e8e02e8
> > ffff8102c900bd78 000000000000000a ffff8102c5b50820 ffff81038a929820
> > 0000011837bf6105 0000000000000ede ffff8102c5b50a08 0000000100000000
> >Call Trace:
> > [<ffffffff80064207>] wait_for_completion+0x79/0xa2
> > [<ffffffff8008b4cc>] default_wake_function+0x0/0xe
> > [<ffffffff882271d9>] :ib_mad:ib_cancel_rmpp_recvs+0x87/0xde
> > [<ffffffff88224485>] :ib_mad:ib_unregister_mad_agent+0x30d/0x424
> > [<ffffffff883983e9>] :ib_umad:ib_umad_close+0x9d/0xd6
> > [<ffffffff80012e22>] __fput+0xae/0x198
> > [<ffffffff80023de6>] filp_close+0x5c/0x64
> > [<ffffffff800393df>] put_files_struct+0x63/0xae
> > [<ffffffff80015b26>] do_exit+0x31c/0x911
> > [<ffffffff8004971a>] cpuset_exit+0x0/0x6c
> > [<ffffffff8005e116>] system_call+0x7e/0x83
> >
> >From the dump it seems that the process is waits on the call to
> >flush_workqueue() in ib_cancel_rmpp_recvs(). The package they use is
> >OFED 1.4.2.
>
> Roland just submitted a patch in this area yesterday.  I don't know if the
> patch
> would fix their issue, but it may be worth trying.  What kernel does 1.4.2
> map
> to?
>
> What RMPP messages does ibnetdiscover use?


None AFAIK.

-- Hal


>   If the program is completing
> successfully, there may be a different race with the rmpp cleanup.  I'll
> see if
> anything else stands out in that area.
>
> - Sean
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[-- Attachment #1.2: Type: text/html, Size: 2627 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
ewg mailing list
ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Possible process deadlock in RMPP flow
       [not found]   ` <7A32EEE20DF5432CADB60B8F8B1E0093-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
  2009-09-23 16:20     ` Hal Rosenstock
@ 2009-09-23 17:25     ` Eli Cohen
  2009-09-24  6:38       ` Or Gerlitz
  1 sibling, 1 reply; 12+ messages in thread
From: Eli Cohen @ 2009-09-23 17:25 UTC (permalink / raw)
  To: Sean Hefty; +Cc: Linux RDMA list, ewg, general-list

On Wed, Sep 23, 2009 at 09:08:28AM -0700, Sean Hefty wrote:
> 
> Roland just submitted a patch in this area yesterday.  I don't know if the patch
> would fix their issue, but it may be worth trying.  What kernel does 1.4.2 map
> to?
I think OFED 1.4.2 is based on kernel 2.6.27 but they're using RHEL
5.3.
Thanks, we'll try this.

> 
> What RMPP messages does ibnetdiscover use?  If the program is completing
> successfully, there may be a different race with the rmpp cleanup.  I'll see if
> anything else stands out in that area.
> 
> - Sean
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Possible process deadlock in RMPP flow
  2009-09-23 17:25     ` Eli Cohen
@ 2009-09-24  6:38       ` Or Gerlitz
       [not found]         ` <4ABB13F3.1060702-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Or Gerlitz @ 2009-09-24  6:38 UTC (permalink / raw)
  To: Eli Cohen, Sean Hefty
  Cc: Linux RDMA list, Roland Dreier, ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5

Eli Cohen wrote:
> On Wed, Sep 23, 2009 at 09:08:28AM -0700, Sean Hefty wrote:
>> What kernel does 1.4.2 map to?
> I think OFED 1.4.2 is based on kernel 2.6.27 but they're using RHEL 5.3

Yes, the usual mess: ofed X is based on kernel Y1 but with some additions from kernel Y2 plus plenty of unreviwed and non-merged patches. Distro Z picks ofed X and the result is 99% unsupportable as Roland said. Somehow this ofed creature is still hanging around working on the the next damage its going to bring into this world (code name 1.5)

Eli, here's a little tip for you, I had the displeasure to resolve bunch of support cases originating from the fact that the below 2 years old commit missed some ofed version (sorry forgot the number...), maybe it would help you as well?

Under a normal setting, if this commit actually solves a bug being hit by many costumers, someone would have opened a distro bugzilla case saying, "please pick this commit for your kernel", the customers would have either wait for the next distro update or use a distro intermediate kernel. Currently, I understand that distros are picking ofed versions and that's it.

Or.

commit b61d92d8ae6aa13b17d1c31e69d123879cec2ee2
Author: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Date:   Fri Nov 30 17:30:18 2007 -0800

    IB/mad: Fix incorrect access to items on local_list

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Possible process deadlock in RMPP flow
       [not found]         ` <4ABB13F3.1060702-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
@ 2009-09-24  7:36           ` Eli Cohen
  2009-09-24 15:53             ` Sean Hefty
  2009-10-04  7:04             ` Or Gerlitz
  0 siblings, 2 replies; 12+ messages in thread
From: Eli Cohen @ 2009-09-24  7:36 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Linux RDMA list, Sean Hefty,
	ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, Roland Dreier

On Thu, Sep 24, 2009 at 09:38:43AM +0300, Or Gerlitz wrote:
> 
> commit b61d92d8ae6aa13b17d1c31e69d123879cec2ee2
> Author: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Date:   Fri Nov 30 17:30:18 2007 -0800
> 
>     IB/mad: Fix incorrect access to items on local_list
> 
Thanks Or. This one is already in OFED 1.4.2 but apparently this is a
different problem. Once I have information whether the patch Roland
posted fixed it I will update the list.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Possible process deadlock in RMPP flow
  2009-09-24  7:36           ` Eli Cohen
@ 2009-09-24 15:53             ` Sean Hefty
       [not found]               ` <53F5ED0C557B4667B22A27A0459B800B-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
  2009-10-04  7:04             ` Or Gerlitz
  1 sibling, 1 reply; 12+ messages in thread
From: Sean Hefty @ 2009-09-24 15:53 UTC (permalink / raw)
  To: 'Eli Cohen', Or Gerlitz
  Cc: Linux RDMA list, ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, Roland Dreier

>Thanks Or. This one is already in OFED 1.4.2 but apparently this is a
>different problem. Once I have information whether the patch Roland
>posted fixed it I will update the list.

If ibnetdiscover doesn't use RMPP as Hal indicated, I don't think Roland's patch
will help.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Possible process deadlock in RMPP flow
       [not found]               ` <53F5ED0C557B4667B22A27A0459B800B-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
@ 2009-09-27  8:01                 ` Eli Cohen
  0 siblings, 0 replies; 12+ messages in thread
From: Eli Cohen @ 2009-09-27  8:01 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Or Gerlitz, Linux RDMA list,
	ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, Roland Dreier

On Thu, Sep 24, 2009 at 08:53:24AM -0700, Sean Hefty wrote:
> >Thanks Or. This one is already in OFED 1.4.2 but apparently this is a
> >different problem. Once I have information whether the patch Roland
> >posted fixed it I will update the list.
> 
> If ibnetdiscover doesn't use RMPP as Hal indicated, I don't think Roland's patch
> will help.

Right, it doesn't help. Still it appears that ibnetdiscover triggers
this problem and the lock seams to appear at ib_cancel_rmpp_recvs()
waiting for flush_workqueue() to return. Do you know which apps or
ULPs make use of RMPPs?
Any other ideas what this could be?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: Possible process deadlock in RMPP flow
  2009-09-24  7:36           ` Eli Cohen
  2009-09-24 15:53             ` Sean Hefty
@ 2009-10-04  7:04             ` Or Gerlitz
       [not found]               ` <4AC8BE74.3050200@mellanox.co.il>
  1 sibling, 1 reply; 12+ messages in thread
From: Or Gerlitz @ 2009-10-04  7:04 UTC (permalink / raw)
  To: Eli Cohen
  Cc: Linux RDMA list, Sean Hefty,
	ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, Roland Dreier

Eli Cohen wrote:
> Thanks Or. This one is already in OFED 1.4.2 but apparently this is a 
> different problem. Once I have information whether the patch Roland 
> posted fixed it I will update the list.
Eli, did you find a commit that fixes the problem you reported on?

Or.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Re: Possible process deadlock in RMPP flow
       [not found]                 ` <4AC8BE74.3050200-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org>
@ 2009-10-19 20:30                   ` Sean Hefty
       [not found]                     ` <BFC792E8570C48B8981C34F913243887-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Hefty @ 2009-10-19 20:30 UTC (permalink / raw)
  To: tziporet-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, Or Gerlitz
  Cc: Linux RDMA list, Roland Dreier, ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5

>>> Thanks Or. This one is already in OFED 1.4.2 but apparently this is a
>>> different problem. Once I have information whether the patch Roland
>>> posted fixed it I will update the list.
>> Eli, did you find a commit that fixes the problem you reported on?
>>
>> Or.
>>
>>
>Not yet :-(

I can't find anything off in the code for this.  It's odd, since
unregister_mad_agent() does:

        flush_workqueue(port_priv->wq);
        ib_cancel_rmpp_recvs(mad_agent_priv);

and ib_cancel_rmpp_recvs() does:

        spin_lock_irqsave(&agent->lock, flags);
        list_for_each_entry(rmpp_recv, &agent->rmpp_list, list) {
                cancel_delayed_work(&rmpp_recv->timeout_work);
                cancel_delayed_work(&rmpp_recv->cleanup_work);
        }
        spin_unlock_irqrestore(&agent->lock, flags);

        flush_workqueue(agent->qp_info->port_priv->wq);

which basically just flushes the same work queue.

I haven't been able to reproduce the problem, but I'm running the latest kernel
- not sure that matters in this case.  Does ibnetdiscover just hang forever at
the end of the test when this occurs?  Is there any more information available?

- Sean 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [ewg] Re: Possible process deadlock in RMPP flow
       [not found]                     ` <BFC792E8570C48B8981C34F913243887-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
@ 2009-10-20  6:06                       ` Tziporet Koren
  2009-10-20  7:48                       ` Eli Cohen
  1 sibling, 0 replies; 12+ messages in thread
From: Tziporet Koren @ 2009-10-20  6:06 UTC (permalink / raw)
  To: Sean Hefty
  Cc: tziporet-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, Or Gerlitz,
	Linux RDMA list, Roland Dreier,
	ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5

Sean Hefty wrote:
> I can't find anything off in the code for this.  
Eventually it was a FW issue that is fixed in our new 2.7.0 release

Tziporet
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Re: Possible process deadlock in RMPP flow
       [not found]                     ` <BFC792E8570C48B8981C34F913243887-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
  2009-10-20  6:06                       ` [ewg] " Tziporet Koren
@ 2009-10-20  7:48                       ` Eli Cohen
  1 sibling, 0 replies; 12+ messages in thread
From: Eli Cohen @ 2009-10-20  7:48 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Linux RDMA list, ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5, Roland Dreier

On Mon, Oct 19, 2009 at 01:30:47PM -0700, Sean Hefty wrote:
> 
> I can't find anything off in the code for this.  It's odd, since
> unregister_mad_agent() does:
> 
>         flush_workqueue(port_priv->wq);
>         ib_cancel_rmpp_recvs(mad_agent_priv);
> 
> and ib_cancel_rmpp_recvs() does:
> 
>         spin_lock_irqsave(&agent->lock, flags);
>         list_for_each_entry(rmpp_recv, &agent->rmpp_list, list) {
>                 cancel_delayed_work(&rmpp_recv->timeout_work);
>                 cancel_delayed_work(&rmpp_recv->cleanup_work);
>         }
>         spin_unlock_irqrestore(&agent->lock, flags);
> 
>         flush_workqueue(agent->qp_info->port_priv->wq);
> 
> which basically just flushes the same work queue.
> 
> I haven't been able to reproduce the problem, but I'm running the latest kernel
> - not sure that matters in this case.  Does ibnetdiscover just hang forever at
> the end of the test when this occurs?  Is there any more information available?
> 

We are checking if the problem is a firmware bug, it looks like it.
Once we verify this I will send an update. 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-10-20  7:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-23 15:04 Possible process deadlock in RMPP flow Eli Cohen
2009-09-23 16:08 ` Sean Hefty
     [not found]   ` <7A32EEE20DF5432CADB60B8F8B1E0093-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
2009-09-23 16:20     ` Hal Rosenstock
2009-09-23 17:25     ` Eli Cohen
2009-09-24  6:38       ` Or Gerlitz
     [not found]         ` <4ABB13F3.1060702-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
2009-09-24  7:36           ` Eli Cohen
2009-09-24 15:53             ` Sean Hefty
     [not found]               ` <53F5ED0C557B4667B22A27A0459B800B-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
2009-09-27  8:01                 ` Eli Cohen
2009-10-04  7:04             ` Or Gerlitz
     [not found]               ` <4AC8BE74.3050200@mellanox.co.il>
     [not found]                 ` <4AC8BE74.3050200-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org>
2009-10-19 20:30                   ` Sean Hefty
     [not found]                     ` <BFC792E8570C48B8981C34F913243887-Zpru7NauK7drdx17CPfAsdBPR1lH4CV8@public.gmane.org>
2009-10-20  6:06                       ` [ewg] " Tziporet Koren
2009-10-20  7:48                       ` Eli Cohen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.