All of lore.kernel.org
 help / color / mirror / Atom feed
* [IPoIB] Missing join mcast events causing full machine lockup
@ 2016-07-21  7:31 Nikolay Borisov
       [not found] ` <57907A37.3000902-6AxghH7DbtA@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Nikolay Borisov @ 2016-07-21  7:31 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, SiteGround Operations

Hello, 

With running the risk of sounding like a broken record, I came across 
another case where ipoib can cause the machine to go haywire due to 
missed join requests. This is on 4.4.14 kernel. Here is what I believe 
happens:

1. Ipoib connectivity breaks, which causes a workqueue task to stall: : 
[1297655.474707] kworker/u96:1   D ffff88026b057c48     0  6581      2 0x00000000
[1297655.474714] Workqueue: ipoib_wq ipoib_mcast_restart_task [ib_ipoib]
[1297655.474715]  ffff88026b057c48 ffff883ff29c6040 ffff880b2b5f2940 ffff88026b058000
[1297655.474717]  7fffffffffffffff ffff8820e2f809d8 ffff880b2b5f2940 ffff880b2b5f2940
[1297655.474718]  ffff88026b057c60 ffffffff816103dc ffff8820e2f809d0 ffff88026b057ce0
[1297655.474720] Call Trace:
[1297655.474722]  [<ffffffff816103dc>] schedule+0x3c/0x90
[1297655.474724]  [<ffffffff81613642>] schedule_timeout+0x202/0x260
[1297655.474728]  [<ffffffff81308645>] ? find_next_bit+0x15/0x20
[1297655.474734]  [<ffffffff812f409f>] ? cpumask_next_and+0x2f/0x40
[1297655.474737]  [<ffffffff8108db8c>] ? load_balance+0x1cc/0x9a0
[1297655.474739]  [<ffffffff816118df>] wait_for_completion+0xcf/0x130
[1297655.474742]  [<ffffffff8107cd30>] ? wake_up_q+0x70/0x70
[1297655.474745]  [<ffffffffa02de354>] ipoib_mcast_restart_task+0x3a4/0x4d0 [ib_ipoib]
[1297655.474748]  [<ffffffff81079a86>] ? finish_task_switch+0x76/0x220
[1297655.474750]  [<ffffffff8106bdf9>] process_one_work+0x159/0x450
[1297655.474752]  [<ffffffff8106c4a9>] worker_thread+0x69/0x490
[1297655.474753]  [<ffffffff8106c440>] ? rescuer_thread+0x350/0x350
[1297655.474755]  [<ffffffff8106c440>] ? rescuer_thread+0x350/0x350
[1297655.474757]  [<ffffffff8107161f>] kthread+0xef/0x110
[1297655.474759]  [<ffffffff81071530>] ? kthread_park+0x60/0x60
[1297655.474761]  [<ffffffff816149ff>] ret_from_fork+0x3f/0x70
[1297655.474763]  [<ffffffff81071530>] ? kthread_park+0x60/0x60

ipoib_mcast_restart_task+0x3a4 corresponds to: 


/*
 * make sure the in-flight joins have finished before we attempt
 * to leave
 */
 list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
 	if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
        	wait_for_completion(&mcast->done);

However, wait_for_completion never returns. Admin logs on the
node and issues a command to down the ib0 interface with the
hopes of resolving the situation, this in turn leads to the 
following backtrace: 
[1297655.475229] ip              D ffff8820cc1e35a8     0 24895      1 0x00000004
[1297655.475231]  ffff8820cc1e35a8 ffff883fbf2a2940 ffff88239d7ae040 ffff8820cc1e4000
[1297655.475233]  7fffffffffffffff ffff8820cc1e3700 ffff88239d7ae040 ffff88239d7ae040
[1297655.475234]  ffff8820cc1e35c0 ffffffff816103dc ffff8820cc1e36f8 ffff8820cc1e3640
[1297655.475236] Call Trace:
[1297655.475238]  [<ffffffff816103dc>] schedule+0x3c/0x90
[1297655.475239]  [<ffffffff81613642>] schedule_timeout+0x202/0x260
[1297655.475241]  [<ffffffff8107c8b9>] ? try_to_wake_up+0x49/0x430
[1297655.475244]  [<ffffffff810b0f94>] ? lock_timer_base.isra.37+0x54/0x70
[1297655.475246]  [<ffffffff816118df>] wait_for_completion+0xcf/0x130
[1297655.475247]  [<ffffffff8107cd30>] ? wake_up_q+0x70/0x70
[1297655.475249]  [<ffffffff8106986a>] flush_workqueue+0x11a/0x5d0
[1297655.475253]  [<ffffffffa02dda76>] ipoib_mcast_stop_thread+0x46/0x50 [ib_ipoib]
[1297655.475255]  [<ffffffffa02dbca2>] ipoib_ib_dev_down+0x22/0x40 [ib_ipoib]
[1297655.475257]  [<ffffffffa02d7f8d>] ipoib_stop+0x2d/0xb0 [ib_ipoib]
[1297655.475261]  [<ffffffff81546f28>] __dev_close_many+0x98/0xf0
[1297655.475263]  [<ffffffff815470d6>] __dev_close+0x36/0x50
[1297655.475266]  [<ffffffff8154ff6d>] __dev_change_flags+0x9d/0x160
[1297655.475268]  [<ffffffff81550059>] dev_change_flags+0x29/0x70
[1297655.475269]  [<ffffffff81308645>] ? find_next_bit+0x15/0x20
[1297655.475271]  [<ffffffff8155de2b>] do_setlink+0x5db/0xad0
[1297655.475272]  [<ffffffff8108d115>] ? update_sd_lb_stats+0x115/0x510
[1297655.475275]  [<ffffffff8114898c>] ? zone_statistics+0x7c/0xa0
[1297655.475277]  [<ffffffff8114898c>] ? zone_statistics+0x7c/0xa0
[1297655.475278]  [<ffffffff8114898c>] ? zone_statistics+0x7c/0xa0
[1297655.475283]  [<ffffffff8131e972>] ? nla_parse+0x32/0x100
[1297655.475284]  [<ffffffff8155f498>] rtnl_newlink+0x528/0x8c0
[1297655.475289]  [<ffffffff81131ed6>] ? __alloc_pages_nodemask+0x1a6/0xb90
[1297655.475291]  [<ffffffff8131e972>] ? nla_parse+0x32/0x100
[1297655.475293]  [<ffffffff8155c9e2>] rtnetlink_rcv_msg+0x92/0x230
[1297655.475295]  [<ffffffff815392aa>] ? __alloc_skb+0x7a/0x1d0
[1297655.475296]  [<ffffffff8155c950>] ? rtnetlink_rcv+0x30/0x30
[1297655.475298]  [<ffffffff8157ef84>] netlink_rcv_skb+0xa4/0xc0
[1297655.475299]  [<ffffffff8155c948>] rtnetlink_rcv+0x28/0x30
[1297655.475301]  [<ffffffff8157e763>] netlink_unicast+0x103/0x180
[1297655.475303]  [<ffffffff8157ec9c>] netlink_sendmsg+0x4bc/0x5d0
[1297655.475305]  [<ffffffff81531748>] sock_sendmsg+0x38/0x50
[1297655.475306]  [<ffffffff81531c55>] ___sys_sendmsg+0x285/0x290
[1297655.475308]  [<ffffffff8153097f>] ? sock_destroy_inode+0x2f/0x40
[1297655.475310]  [<ffffffff811b39fe>] ? evict+0x12e/0x190
[1297655.475312]  [<ffffffff811ae9ee>] ? dentry_free+0x4e/0x90
[1297655.475313]  [<ffffffff811af6f2>] ? __dentry_kill+0x162/0x1e0
[1297655.475315]  [<ffffffff811af965>] ? dput+0x1f5/0x230
[1297655.475317]  [<ffffffff811b8c24>] ? mntput+0x24/0x40
[1297655.475319]  [<ffffffff8119a968>] ? __fput+0x188/0x1f0
[1297655.475320]  [<ffffffff81532322>] __sys_sendmsg+0x42/0x80
[1297655.475322]  [<ffffffff81532372>] SyS_sendmsg+0x12/0x20
[1297655.475324]  [<ffffffff8161465b>] entry_SYSCALL_64_fastpath+0x16/0x6e

The bad thing that happens here is that this task hangs on waiting 
the flushing of the workqueue (with rtnl_lock held), which never 
completes due to ipoib_mcast_restart_task being hung on the join. 

This makes me wonder if using timeouts is actually better than
blindly relying on completing the join. So Doug, what would
you say about the following as a proposed fix (not tested):

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 87799de90a1d..f6f15d36b02d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -947,7 +947,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
         */
        list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
                if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
-                       wait_for_completion(&mcast->done);
+                       wait_for_completion_timeout(&mcast->done, 30 * HZ);
 
        list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
                ipoib_mcast_leave(mcast->dev, mcast);

Given the loop afterwards which uses ipoib_mcast_(leave_free) that should work?
Looking at the code in ipoib_mcast_leave it seems we are going to trigger a warning, 
which is preferable to putting the machine to a grinding halt? 

Does the proposed patch break things horribly ?

Regards, 
Nikolay 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [IPoIB] Missing join mcast events causing full machine lockup
       [not found] ` <57907A37.3000902-6AxghH7DbtA@public.gmane.org>
@ 2016-08-02 19:21   ` Doug Ledford
       [not found]     ` <1470165672.18081.37.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Doug Ledford @ 2016-08-02 19:21 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, SiteGround Operations

[-- Attachment #1: Type: text/plain, Size: 2872 bytes --]

On Thu, 2016-07-21 at 10:31 +0300, Nikolay Borisov wrote:
> Hello, 
> 
> With running the risk of sounding like a broken record, I came
> across 
> another case where ipoib can cause the machine to go haywire due to 
> missed join requests. This is on 4.4.14 kernel. Here is what I
> believe 
> happens:

[ snip long traces ]

> This makes me wonder if using timeouts is actually better than
> blindly relying on completing the join.

Blindly relying on the join completions is not what we do.  We are very
careful to make sure we always have the right locking so that we never
leave a join request in the BUSY state without running the completion
at some time.  If you are seeing us do that, then it means we have a
bug in our locking or state processing.  The answer then is to find
that bug and not to paper over it with a timeout.  Can you find some
way to reproduce this with a 4.7 kernel?

>  So Doug, what would
> you say about the following as a proposed fix (not tested):
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 87799de90a1d..f6f15d36b02d 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -947,7 +947,7 @@ void ipoib_mcast_restart_task(struct work_struct
> *work)
>          */
>         list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
>                 if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> -                       wait_for_completion(&mcast->done);
> +                       wait_for_completion_timeout(&mcast->done, 30
> * HZ);
>  
>         list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
>                 ipoib_mcast_leave(mcast->dev, mcast);
> 
> Given the loop afterwards which uses ipoib_mcast_(leave_free) that
> should work?
> Looking at the code in ipoib_mcast_leave it seems we are going to
> trigger a warning, 
> which is preferable to putting the machine to a grinding halt? 
> 
> Does the proposed patch break things horribly ?

It violates the intent of the join processing.  And if we have the
problem you are seeing, we really need to know if it's broken in IPoIB
or deeper down in the core portion of the stack.  Breaking out and
continuing might be OK, but if we do, we are likely going to either
leak something or have a use-after-free or something like that, so I
would have to spend some time thinking about how things might go wrong
and whether or not it's better to stop the machine when this happens,
or continue and hope we don't corrupt memory somehow.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [IPoIB] Missing join mcast events causing full machine lockup
       [not found]     ` <1470165672.18081.37.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-08-02 20:18       ` Nikolay Borisov
       [not found]         ` <CAJFSNy6USnLqcBiPEOcFOG8MrGq8gXwvakG48jHHi_-YgVaQ3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Nikolay Borisov @ 2016-08-02 20:18 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Nikolay Borisov, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	SiteGround Operations

On Tue, Aug 2, 2016 at 10:21 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Thu, 2016-07-21 at 10:31 +0300, Nikolay Borisov wrote:
>> Hello,
>>
>> With running the risk of sounding like a broken record, I came
>> across
>> another case where ipoib can cause the machine to go haywire due to
>> missed join requests. This is on 4.4.14 kernel. Here is what I
>> believe
>> happens:
>
> [ snip long traces ]
>
>> This makes me wonder if using timeouts is actually better than
>> blindly relying on completing the join.
>
> Blindly relying on the join completions is not what we do.  We are very
> careful to make sure we always have the right locking so that we never
> leave a join request in the BUSY state without running the completion
> at some time.  If you are seeing us do that, then it means we have a
> bug in our locking or state processing.  The answer then is to find
> that bug and not to paper over it with a timeout.  Can you find some
> way to reproduce this with a 4.7 kernel?

Unfortunately my environment is constrained to 4.4 kernel. I will, however,
try and check if I can get a couple of IB-enabled nodes on 4.7 and see
if something
shows up. And while I don't have a 100% reproducer for it I see those
symptoms rather regularly
on production nodes. I'm able and happy to extract any runtime state
that might be useful in debugging this i.e I can obtain crashdumps and
reverse the state of the ipoib stacks. I've seen this issue on 3.12 and on 4.4.
Some of my previous emails also show this manifesting in hangs in cm_destroy_id
as well. So clearly there is a problem there but it proves very elusive.

>
>>  So Doug, what would
>> you say about the following as a proposed fix (not tested):
>>
>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> index 87799de90a1d..f6f15d36b02d 100644
>> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> @@ -947,7 +947,7 @@ void ipoib_mcast_restart_task(struct work_struct
>> *work)
>>          */
>>         list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
>>                 if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
>> -                       wait_for_completion(&mcast->done);
>> +                       wait_for_completion_timeout(&mcast->done, 30
>> * HZ);
>>
>>         list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
>>                 ipoib_mcast_leave(mcast->dev, mcast);
>>
>> Given the loop afterwards which uses ipoib_mcast_(leave_free) that
>> should work?
>> Looking at the code in ipoib_mcast_leave it seems we are going to
>> trigger a warning,
>> which is preferable to putting the machine to a grinding halt?
>>
>> Does the proposed patch break things horribly ?
>
> It violates the intent of the join processing.  And if we have the
> problem you are seeing, we really need to know if it's broken in IPoIB
> or deeper down in the core portion of the stack.  Breaking out and
> continuing might be OK, but if we do, we are likely going to either
> leak something or have a use-after-free or something like that, so I
> would have to spend some time thinking about how things might go wrong
> and whether or not it's better to stop the machine when this happens,
> or continue and hope we don't corrupt memory somehow.
>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>               GPG KeyID: 0E572FDD
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [IPoIB] Missing join mcast events causing full machine lockup
       [not found]         ` <CAJFSNy6USnLqcBiPEOcFOG8MrGq8gXwvakG48jHHi_-YgVaQ3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-08-02 20:29           ` Doug Ledford
       [not found]             ` <1470169770.18081.44.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Doug Ledford @ 2016-08-02 20:29 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, SiteGround Operations

[-- Attachment #1: Type: text/plain, Size: 2143 bytes --]

On Tue, 2016-08-02 at 23:18 +0300, Nikolay Borisov wrote:
> On Tue, Aug 2, 2016 at 10:21 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> wrote:
> > 
> > On Thu, 2016-07-21 at 10:31 +0300, Nikolay Borisov wrote:
> > > 
> > > Hello,
> > > 
> > > With running the risk of sounding like a broken record, I came
> > > across
> > > another case where ipoib can cause the machine to go haywire due
> > > to
> > > missed join requests. This is on 4.4.14 kernel. Here is what I
> > > believe
> > > happens:
> > 
> > [ snip long traces ]
> > 
> > > 
> > > This makes me wonder if using timeouts is actually better than
> > > blindly relying on completing the join.
> > 
> > Blindly relying on the join completions is not what we do.  We are
> > very
> > careful to make sure we always have the right locking so that we
> > never
> > leave a join request in the BUSY state without running the
> > completion
> > at some time.  If you are seeing us do that, then it means we have
> > a
> > bug in our locking or state processing.  The answer then is to find
> > that bug and not to paper over it with a timeout.  Can you find
> > some
> > way to reproduce this with a 4.7 kernel?
> 
> Unfortunately my environment is constrained to 4.4 kernel. I will,
> however,
> try and check if I can get a couple of IB-enabled nodes on 4.7 and
> see
> if something
> shows up. And while I don't have a 100% reproducer for it I see those
> symptoms rather regularly
> on production nodes. I'm able and happy to extract any runtime state
> that might be useful in debugging this i.e I can obtain crashdumps
> and
> reverse the state of the ipoib stacks. I've seen this issue on 3.12
> and on 4.4.
> Some of my previous emails also show this manifesting in hangs in
> cm_destroy_id
> as well. So clearly there is a problem there but it proves very
> elusive.

Can you give any clues as to what's causing it?  Do you have link flap?
SM bounces?  Lots of multicast joins/leaves?

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [IPoIB] Missing join mcast events causing full machine lockup
       [not found]             ` <1470169770.18081.44.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-08-03  8:18               ` Nikolay Borisov
       [not found]                 ` <57A1A8F2.8040709-6AxghH7DbtA@public.gmane.org>
  2016-08-17 11:26               ` Nikolay Borisov
  1 sibling, 1 reply; 7+ messages in thread
From: Nikolay Borisov @ 2016-08-03  8:18 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, SiteGround Operations



On 08/02/2016 11:29 PM, Doug Ledford wrote:
> On Tue, 2016-08-02 at 23:18 +0300, Nikolay Borisov wrote:
>> On Tue, Aug 2, 2016 at 10:21 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> wrote:
>>>
>>> On Thu, 2016-07-21 at 10:31 +0300, Nikolay Borisov wrote:
>>>>
>>>> Hello,
>>>>
>>>> With running the risk of sounding like a broken record, I came
>>>> across
>>>> another case where ipoib can cause the machine to go haywire due
>>>> to
>>>> missed join requests. This is on 4.4.14 kernel. Here is what I
>>>> believe
>>>> happens:
>>>
>>> [ snip long traces ]
>>>
>>>>
>>>> This makes me wonder if using timeouts is actually better than
>>>> blindly relying on completing the join.
>>>
>>> Blindly relying on the join completions is not what we do.  We are
>>> very
>>> careful to make sure we always have the right locking so that we
>>> never
>>> leave a join request in the BUSY state without running the
>>> completion
>>> at some time.  If you are seeing us do that, then it means we have
>>> a
>>> bug in our locking or state processing.  The answer then is to find
>>> that bug and not to paper over it with a timeout.  Can you find
>>> some
>>> way to reproduce this with a 4.7 kernel?
>>
>> Unfortunately my environment is constrained to 4.4 kernel. I will,
>> however,
>> try and check if I can get a couple of IB-enabled nodes on 4.7 and
>> see
>> if something
>> shows up. And while I don't have a 100% reproducer for it I see those
>> symptoms rather regularly
>> on production nodes. I'm able and happy to extract any runtime state
>> that might be useful in debugging this i.e I can obtain crashdumps
>> and
>> reverse the state of the ipoib stacks. I've seen this issue on 3.12
>> and on 4.4.
>> Some of my previous emails also show this manifesting in hangs in
>> cm_destroy_id
>> as well. So clearly there is a problem there but it proves very
>> elusive.
> 
> Can you give any clues as to what's causing it?  Do you have link flap?
> SM bounces?  Lots of multicast joins/leaves?

I spoke with the network admins and they said that the network is not flapping, 
we shouldn't have a lot of join/leaves since the network is not that big and is 
stable. E.g. once nodes joins they usually are not restarted. 

Here are some messages which result after the said hangs happen: 

Aug  1 04:53:51 node1 kernel: [29100.763267] ib0: Budget exhausted after napi rescheduled 
Jul 31 21:29:46 node1 kernel: [ 2457.666476] NETDEV WATCHDOG: ib0 (ib_qib): transmit queue 0 timed out
Jul 29 05:17:36 node1 kernel: [ 8797.968402] ib0: dev_queue_xmit failed to requeue packet
Jul 23 19:27:22 node1 kernel: ib0: packet len 2200 (> 2048) too long to send, dropping
Jul 25 01:01:52 node1 kernel: ib0: queue stopped 1, tx_head 124520708, tx_tail 124520580

Aug  2 10:05:26 node15 bird6: LocalIPv6: Socket error on ib0: No buffer space available

Also I'm being told that *sometimes* doing a remote port reset actually fixes the issue,
but only sometimes as otherwise the port is completely inactive. 

I realize this is not much information but this issue really just rears its ugly head out
of nowhere and usually there doesn't seem to be that much information ;(

> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [IPoIB] Missing join mcast events causing full machine lockup
       [not found]                 ` <57A1A8F2.8040709-6AxghH7DbtA@public.gmane.org>
@ 2016-08-04  0:17                   ` Marian Marinov
  0 siblings, 0 replies; 7+ messages in thread
From: Marian Marinov @ 2016-08-04  0:17 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, SiteGround Operations

On 08/03/2016 11:18 AM, Nikolay Borisov wrote:
> 
> 
> On 08/02/2016 11:29 PM, Doug Ledford wrote:
>> On Tue, 2016-08-02 at 23:18 +0300, Nikolay Borisov wrote:
>>> On Tue, Aug 2, 2016 at 10:21 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> wrote:
>>>>
>>>> On Thu, 2016-07-21 at 10:31 +0300, Nikolay Borisov wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> With running the risk of sounding like a broken record, I came
>>>>> across
>>>>> another case where ipoib can cause the machine to go haywire due
>>>>> to
>>>>> missed join requests. This is on 4.4.14 kernel. Here is what I
>>>>> believe
>>>>> happens:
>>>>
>>>> [ snip long traces ]
>>>>
>>>>>
>>>>> This makes me wonder if using timeouts is actually better than
>>>>> blindly relying on completing the join.
>>>>
>>>> Blindly relying on the join completions is not what we do.  We are
>>>> very
>>>> careful to make sure we always have the right locking so that we
>>>> never
>>>> leave a join request in the BUSY state without running the
>>>> completion
>>>> at some time.  If you are seeing us do that, then it means we have
>>>> a
>>>> bug in our locking or state processing.  The answer then is to find
>>>> that bug and not to paper over it with a timeout.  Can you find
>>>> some
>>>> way to reproduce this with a 4.7 kernel?
>>>
>>> Unfortunately my environment is constrained to 4.4 kernel. I will,
>>> however,
>>> try and check if I can get a couple of IB-enabled nodes on 4.7 and
>>> see
>>> if something
>>> shows up. And while I don't have a 100% reproducer for it I see those
>>> symptoms rather regularly
>>> on production nodes. I'm able and happy to extract any runtime state
>>> that might be useful in debugging this i.e I can obtain crashdumps
>>> and
>>> reverse the state of the ipoib stacks. I've seen this issue on 3.12
>>> and on 4.4.
>>> Some of my previous emails also show this manifesting in hangs in
>>> cm_destroy_id
>>> as well. So clearly there is a problem there but it proves very
>>> elusive.
>>
>> Can you give any clues as to what's causing it?  Do you have link flap?
>> SM bounces?  Lots of multicast joins/leaves?
> 
> I spoke with the network admins and they said that the network is not flapping, 
> we shouldn't have a lot of join/leaves since the network is not that big and is 
> stable. E.g. once nodes joins they usually are not restarted. 
> 
> Here are some messages which result after the said hangs happen: 
> 
> Aug  1 04:53:51 node1 kernel: [29100.763267] ib0: Budget exhausted after napi rescheduled 
> Jul 31 21:29:46 node1 kernel: [ 2457.666476] NETDEV WATCHDOG: ib0 (ib_qib): transmit queue 0 timed out
> Jul 29 05:17:36 node1 kernel: [ 8797.968402] ib0: dev_queue_xmit failed to requeue packet
> Jul 23 19:27:22 node1 kernel: ib0: packet len 2200 (> 2048) too long to send, dropping
> Jul 25 01:01:52 node1 kernel: ib0: queue stopped 1, tx_head 124520708, tx_tail 124520580
> 
> Aug  2 10:05:26 node15 bird6: LocalIPv6: Socket error on ib0: No buffer space available
> 
> Also I'm being told that *sometimes* doing a remote port reset actually fixes the issue,
> but only sometimes as otherwise the port is completely inactive. 
> 
> I realize this is not much information but this issue really just rears its ugly head out
> of nowhere and usually there doesn't seem to be that much information ;(

Actually this happens when we setup 6-8 IB switches in a Ring topology with the MinHop SM routing algorithm. As this topology is not ok with the MinHop algorithm, the SM is regularly dropping some of the switches and leaving nodes there configured
but without access to the actual IB network outside their current switch. After some the SM again enables that port but disables another and so the network topology actually changes, frequently.

We are now changing the topology, but we would like to find why IPoIB behaves the way it behaves.

Marian
> 
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [IPoIB] Missing join mcast events causing full machine lockup
       [not found]             ` <1470169770.18081.44.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2016-08-03  8:18               ` Nikolay Borisov
@ 2016-08-17 11:26               ` Nikolay Borisov
  1 sibling, 0 replies; 7+ messages in thread
From: Nikolay Borisov @ 2016-08-17 11:26 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, SiteGround Operations



On 08/02/2016 11:29 PM, Doug Ledford wrote:
> On Tue, 2016-08-02 at 23:18 +0300, Nikolay Borisov wrote:
>> On Tue, Aug 2, 2016 at 10:21 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> wrote:
>>>
>>> On Thu, 2016-07-21 at 10:31 +0300, Nikolay Borisov wrote:
>>>>
>>>> Hello,
>>>>
>>>> With running the risk of sounding like a broken record, I came
>>>> across
>>>> another case where ipoib can cause the machine to go haywire due
>>>> to
>>>> missed join requests. This is on 4.4.14 kernel. Here is what I
>>>> believe
>>>> happens:
>>>
>>> [ snip long traces ]
>>>
>>>>
>>>> This makes me wonder if using timeouts is actually better than
>>>> blindly relying on completing the join.
>>>
>>> Blindly relying on the join completions is not what we do.  We are
>>> very
>>> careful to make sure we always have the right locking so that we
>>> never
>>> leave a join request in the BUSY state without running the
>>> completion
>>> at some time.  If you are seeing us do that, then it means we have
>>> a
>>> bug in our locking or state processing.  The answer then is to find
>>> that bug and not to paper over it with a timeout.  Can you find
>>> some
>>> way to reproduce this with a 4.7 kernel?
>>
>> Unfortunately my environment is constrained to 4.4 kernel. I will,
>> however,
>> try and check if I can get a couple of IB-enabled nodes on 4.7 and
>> see
>> if something
>> shows up. And while I don't have a 100% reproducer for it I see those
>> symptoms rather regularly
>> on production nodes. I'm able and happy to extract any runtime state
>> that might be useful in debugging this i.e I can obtain crashdumps
>> and
>> reverse the state of the ipoib stacks. I've seen this issue on 3.12
>> and on 4.4.
>> Some of my previous emails also show this manifesting in hangs in
>> cm_destroy_id
>> as well. So clearly there is a problem there but it proves very
>> elusive.
> 
> Can you give any clues as to what's causing it?  Do you have link flap?
> SM bounces?  Lots of multicast joins/leaves?

Hello again, after some testing and a lot more reboots I think we've
managed to isolate a culprit. Based on data we've observed on the
switches it seems that when a particular switch is congested it can
start queuing packets internally, and after its queue overflows it will
start dropping packets. Our switches show that they are discarding a lot
of packets when we increase the amount of traffic. Since our network is
linear e.g. switch 1 -> switch 2-> switch 3 then if node on sw1 wants to
send packets to a node on sw2 and sw2 is congested then it might
silently discard packets. And this in turn causes the ipoib (and the MAD
drivers) to wait for a response on a packet they sent, but that never
got sent to its destination. Does that sound plausible?


> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-08-17 11:26 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-21  7:31 [IPoIB] Missing join mcast events causing full machine lockup Nikolay Borisov
     [not found] ` <57907A37.3000902-6AxghH7DbtA@public.gmane.org>
2016-08-02 19:21   ` Doug Ledford
     [not found]     ` <1470165672.18081.37.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-08-02 20:18       ` Nikolay Borisov
     [not found]         ` <CAJFSNy6USnLqcBiPEOcFOG8MrGq8gXwvakG48jHHi_-YgVaQ3g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-08-02 20:29           ` Doug Ledford
     [not found]             ` <1470169770.18081.44.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-08-03  8:18               ` Nikolay Borisov
     [not found]                 ` <57A1A8F2.8040709-6AxghH7DbtA@public.gmane.org>
2016-08-04  0:17                   ` Marian Marinov
2016-08-17 11:26               ` Nikolay Borisov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.