All of lore.kernel.org
 help / color / mirror / Atom feed
* mlx4 catas_reset hangs when using the CM
@ 2012-03-29 11:41 sebastien dugue
  2012-03-30 18:33 ` Roland Dreier
  0 siblings, 1 reply; 3+ messages in thread
From: sebastien dugue @ 2012-03-29 11:41 UTC (permalink / raw)
  To: linux-rdma; +Cc: Sean Hefty, Roland Dreier, Vincent, Celine


  Hi,

  when the mlx4 FW generate an internal error, the driver's catas code tries to
reset the HCA and restart the stack. However if the CM is in use at that moment,
the stack shutdown never completes and hangs in the CM waiting for a refcount
that never reaches 0.

  I've not much knowledge of how the CM works, but I have the following stack:

crash> ps mlx4
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
   5166      2   0  ffff88033c779040  UN   0.0       0      0  [mlx4]
crash> bt 5166
PID: 5166   TASK: ffff88033c779040  CPU: 0   COMMAND: "mlx4"
 #0 [ffff880336113a30] schedule at ffffffff8147dddc
 #1 [ffff880336113af8] schedule_timeout at ffffffff8147eb25
 #2 [ffff880336113ba8] wait_for_common at ffffffff8147e767
 #3 [ffff880336113c38] wait_for_completion at ffffffff8147e8cd
 #4 [ffff880336113c48] cma_remove_one at ffffffffa03ada3e [rdma_cm]
 #5 [ffff880336113ce8] ib_unregister_device at ffffffffa01f7677 [ib_core]
 #6 [ffff880336113d28] mlx4_ib_remove at ffffffffa036a0b9 [mlx4_ib]
 #7 [ffff880336113d58] mlx4_remove_device at ffffffffa0340fb4 [mlx4_core]
 #8 [ffff880336113d88] mlx4_unregister_device at ffffffffa034100b [mlx4_core]
 #9 [ffff880336113da8] mlx4_remove_one at ffffffffa034180e [mlx4_core]
#10 [ffff880336113dd8] mlx4_restart_one at ffffffffa0344c46 [mlx4_core]
#11 [ffff880336113df8] catas_reset at ffffffffa033b115 [mlx4_core]
#12 [ffff880336113e38] worker_thread at ffffffff810749a0
#13 [ffff880336113ee8] kthread at ffffffff81079f36
#14 [ffff880336113f48] kernel_thread at ffffffff810041aa


  and the following cma_device data:

crash> cma_device 0xffff88033c04f1c0
struct cma_device {
  list = {
    next = 0xdead000000100100, 
    prev = 0xdead000000200200
  }, 
  device = 0xffff8802e06b0000, 
  comp = {
    done = 0, 
    wait = {
      lock = {
        raw_lock = {
          slock = 65537
        }
      }, 
      task_list = {
        next = 0xffff880336113be8, 
        prev = 0xffff880336113be8
      }
    }
  }, 
  refcount = {
    counter = 1
  }, 
  id_list = {
    next = 0xffff88033c04f200, 
    prev = 0xffff88033c04f200
  }
}


  So it looks like that cma_process_remove() did all it's job cleaning up
but is hung waiting for the client refcount to reach 0, which never happens.

  This happens with OFED from 1.5.3 to 1.5.4.1.


  The steps to reproduce:

  1. start a qperf using the CM between 2 nodes (client and victim):

victim$ qperf
client$ qperf victim -cm 1 -t 10000 rc_bw

  2. trigger a catas reset on the victim node by writing something <> 0 at the
     HEAD of the mlx4 error buffer:

victim# mstmwrite mlx4_0 0x1f020 1


  Sebastien.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mlx4 catas_reset hangs when using the CM
  2012-03-29 11:41 mlx4 catas_reset hangs when using the CM sebastien dugue
@ 2012-03-30 18:33 ` Roland Dreier
       [not found]   ` <CAL1RGDV-j7A7vEXuWkK6Q96u09woB2vx92gp9LS2=uWFrDmCbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Roland Dreier @ 2012-03-30 18:33 UTC (permalink / raw)
  To: sebastien dugue; +Cc: linux-rdma, Sean Hefty, Vincent, Celine

On Thu, Mar 29, 2012 at 4:41 AM, sebastien dugue
<sebastien.dugue-6ktuUTfB/bM@public.gmane.org> wrote:
>  So it looks like that cma_process_remove() did all it's job cleaning up
> but is hung waiting for the client refcount to reach 0, which never happens.

This is unfortunately expected with the current implementation.  Because we
don't have a way to revoke active userspace users of the device, we just send
them a catastrophic error async event and then expect them to clsoe the
device.  But if they never close the device, we just wait forever.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mlx4 catas_reset hangs when using the CM
       [not found]   ` <CAL1RGDV-j7A7vEXuWkK6Q96u09woB2vx92gp9LS2=uWFrDmCbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-04-02  7:02     ` sebastien dugue
  0 siblings, 0 replies; 3+ messages in thread
From: sebastien dugue @ 2012-04-02  7:02 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, Sean Hefty, Vincent, Celine

On Fri, 30 Mar 2012 11:33:56 -0700
Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:

> On Thu, Mar 29, 2012 at 4:41 AM, sebastien dugue
> <sebastien.dugue-6ktuUTfB/bM@public.gmane.org> wrote:
> >  So it looks like that cma_process_remove() did all it's job cleaning up
> > but is hung waiting for the client refcount to reach 0, which never happens.
> 
> This is unfortunately expected with the current implementation.  Because we
> don't have a way to revoke active userspace users of the device, we just send
> them a catastrophic error async event and then expect them to clsoe the
> device.  But if they never close the device, we just wait forever.

  Yep, that's what I gathered and unfortunately, having lustre as a client
is not going to help.

  Thanks,

  Sébastien.

> 
>  - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-04-02  7:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-29 11:41 mlx4 catas_reset hangs when using the CM sebastien dugue
2012-03-30 18:33 ` Roland Dreier
     [not found]   ` <CAL1RGDV-j7A7vEXuWkK6Q96u09woB2vx92gp9LS2=uWFrDmCbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-04-02  7:02     ` sebastien dugue

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.