[PATCH 5/8] infiniband: fix ulp/srpt/ib_srpt.c kernel-doc notation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 5/8] infiniband: fix ulp/srpt/ib_srpt.c kernel-doc notation
@ 2018-01-06  0:22 Randy Dunlap
       [not found] ` <5a5016c0.4c0a620a.ed2b3.60da-ATjtLOhZ0NVl57MIdRCFDg@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Randy Dunlap @ 2018-01-06  0:22 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Randy Dunlap

Use correct parameter names in kernel-doc notation to eliminate
warnings from scripts/kernel-doc.

../drivers/infiniband/ulp/srpt/ib_srpt.c:1146: warning: Excess function parameter 'context' description in 'srpt_abort_cmd'
../drivers/infiniband/ulp/srpt/ib_srpt.c:1482: warning: Excess function parameter 'ioctx' description in 'srpt_handle_new_iu'

Signed-off-by: Randy Dunlap <rdunlap-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Jason Gunthorpe <jgg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 drivers/infiniband/ulp/srpt/ib_srpt.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-next-20171222.orig/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ linux-next-20171222/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -1139,7 +1139,6 @@ static struct srpt_send_ioctx *srpt_get_
 /**
  * srpt_abort_cmd() - Abort a SCSI command.
  * @ioctx:   I/O context associated with the SCSI command.
- * @context: Preferred execution context.
  */
 static int srpt_abort_cmd(struct srpt_send_ioctx *ioctx)
 {
@@ -1473,7 +1472,8 @@ fail:
 /**
  * srpt_handle_new_iu() - Process a newly received information unit.
  * @ch:    RDMA channel through which the information unit has been received.
- * @ioctx: SRPT I/O context associated with the information unit.
+ * @recv_ioctx: SRPT I/O context associated with the receive information unit.
+ * @send_ioctx: SRPT I/O context associated with the send information unit.
  */
 static void srpt_handle_new_iu(struct srpt_rdma_ch *ch,
 			       struct srpt_recv_ioctx *recv_ioctx,


-- 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] infiniband: fix ulp/srpt/ib_srpt.c kernel-doc notation
       [not found] ` <5a5016c0.4c0a620a.ed2b3.60da-ATjtLOhZ0NVl57MIdRCFDg@public.gmane.org>
@ 2018-01-06  0:36   ` Bart Van Assche
       [not found]     ` <fcc3f226-848d-abc4-2a81-f4fd821761c9-Sjgp3cTcYWE@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Bart Van Assche @ 2018-01-06  0:36 UTC (permalink / raw)
  To: Randy Dunlap, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Randy Dunlap

On 01/05/18 16:22, Randy Dunlap wrote:
> Use correct parameter names in kernel-doc notation to eliminate
> warnings from scripts/kernel-doc.
> 
> ../drivers/infiniband/ulp/srpt/ib_srpt.c:1146: warning: Excess function parameter 'context' description in 'srpt_abort_cmd'
> ../drivers/infiniband/ulp/srpt/ib_srpt.c:1482: warning: Excess function parameter 'ioctx' description in 'srpt_handle_new_iu'
> 
> Signed-off-by: Randy Dunlap <rdunlap-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Jason Gunthorpe <jgg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
>   drivers/infiniband/ulp/srpt/ib_srpt.c |    4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux-next-20171222.orig/drivers/infiniband/ulp/srpt/ib_srpt.c
> +++ linux-next-20171222/drivers/infiniband/ulp/srpt/ib_srpt.c
> @@ -1139,7 +1139,6 @@ static struct srpt_send_ioctx *srpt_get_
>   /**
>    * srpt_abort_cmd() - Abort a SCSI command.
>    * @ioctx:   I/O context associated with the SCSI command.
> - * @context: Preferred execution context.
>    */
>   static int srpt_abort_cmd(struct srpt_send_ioctx *ioctx)
>   {
> @@ -1473,7 +1472,8 @@ fail:
>   /**
>    * srpt_handle_new_iu() - Process a newly received information unit.
>    * @ch:    RDMA channel through which the information unit has been received.
> - * @ioctx: SRPT I/O context associated with the information unit.
> + * @recv_ioctx: SRPT I/O context associated with the receive information unit.
> + * @send_ioctx: SRPT I/O context associated with the send information unit.
>    */
>   static void srpt_handle_new_iu(struct srpt_rdma_ch *ch,
>   			       struct srpt_recv_ioctx *recv_ioctx,

Please drop this patch. It conflicts with a patch series I'm working on.

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] infiniband: fix ulp/srpt/ib_srpt.c kernel-doc notation
       [not found]     ` <fcc3f226-848d-abc4-2a81-f4fd821761c9-Sjgp3cTcYWE@public.gmane.org>
@ 2018-01-06  5:55       ` Randy Dunlap
       [not found]         ` <31f69352-b8b1-9ed1-635b-2c654b49c775-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2018-01-09 20:15       ` Laurence Oberman
  1 sibling, 1 reply; 35+ messages in thread
From: Randy Dunlap @ 2018-01-06  5:55 UTC (permalink / raw)
  To: Bart Van Assche, Randy Dunlap, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 01/05/18 16:36, Bart Van Assche wrote:
> On 01/05/18 16:22, Randy Dunlap wrote:
>> Use correct parameter names in kernel-doc notation to eliminate
>> warnings from scripts/kernel-doc.
>>
>> ../drivers/infiniband/ulp/srpt/ib_srpt.c:1146: warning: Excess function parameter 'context' description in 'srpt_abort_cmd'
>> ../drivers/infiniband/ulp/srpt/ib_srpt.c:1482: warning: Excess function parameter 'ioctx' description in 'srpt_handle_new_iu'
>>
>> Signed-off-by: Randy Dunlap <rdunlap-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
>> Cc: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> Cc: Jason Gunthorpe <jgg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> ---
>>   drivers/infiniband/ulp/srpt/ib_srpt.c |    4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> --- linux-next-20171222.orig/drivers/infiniband/ulp/srpt/ib_srpt.c
>> +++ linux-next-20171222/drivers/infiniband/ulp/srpt/ib_srpt.c
>> @@ -1139,7 +1139,6 @@ static struct srpt_send_ioctx *srpt_get_
>>   /**
>>    * srpt_abort_cmd() - Abort a SCSI command.
>>    * @ioctx:   I/O context associated with the SCSI command.
>> - * @context: Preferred execution context.
>>    */
>>   static int srpt_abort_cmd(struct srpt_send_ioctx *ioctx)
>>   {
>> @@ -1473,7 +1472,8 @@ fail:
>>   /**
>>    * srpt_handle_new_iu() - Process a newly received information unit.
>>    * @ch:    RDMA channel through which the information unit has been received.
>> - * @ioctx: SRPT I/O context associated with the information unit.
>> + * @recv_ioctx: SRPT I/O context associated with the receive information unit.
>> + * @send_ioctx: SRPT I/O context associated with the send information unit.
>>    */
>>   static void srpt_handle_new_iu(struct srpt_rdma_ch *ch,
>>                      struct srpt_recv_ioctx *recv_ioctx,
> 
> Please drop this patch. It conflicts with a patch series I'm working on.
> 

Sure, no problem.  Do you have this kernel-doc notation fixed, then?


-- 
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] infiniband: fix ulp/srpt/ib_srpt.c kernel-doc notation
       [not found]         ` <31f69352-b8b1-9ed1-635b-2c654b49c775-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2018-01-06 16:50           ` Bart Van Assche
  0 siblings, 0 replies; 35+ messages in thread
From: Bart Van Assche @ 2018-01-06 16:50 UTC (permalink / raw)
  To: rdunlap-wEGCiKHe2LqWVfeAwA7xHQ, rd.dunlab-Re5JQEeQqe8AvxtiuMwx3w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 529 bytes --]

On Fri, 2018-01-05 at 21:55 -0800, Randy Dunlap wrote:
> On 01/05/18 16:36, Bart Van Assche wrote:
> > Please drop this patch. It conflicts with a patch series I'm working on.
> 
> Sure, no problem.  Do you have this kernel-doc notation fixed, then?

Yes. Please have a look at the series I posted three days ago:
https://www.spinics.net/lists/linux-rdma/msg58896.html.

Thanks,

Bart.N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±ÙšŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] infiniband: fix ulp/srpt/ib_srpt.c kernel-doc notation
       [not found]     ` <fcc3f226-848d-abc4-2a81-f4fd821761c9-Sjgp3cTcYWE@public.gmane.org>
  2018-01-06  5:55       ` Randy Dunlap
@ 2018-01-09 20:15       ` Laurence Oberman
       [not found]         ` <1515528956.3919.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-09 20:15 UTC (permalink / raw)
  To: Bart Van Assche, Randy Dunlap, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Randy Dunlap

On Fri, 2018-01-05 at 16:36 -0800, Bart Van Assche wrote:
> On 01/05/18 16:22, Randy Dunlap wrote:
> > Use correct parameter names in kernel-doc notation to eliminate
> > warnings from scripts/kernel-doc.
> > 
> > ../drivers/infiniband/ulp/srpt/ib_srpt.c:1146: warning: Excess
> > function parameter 'context' description in 'srpt_abort_cmd'
> > ../drivers/infiniband/ulp/srpt/ib_srpt.c:1482: warning: Excess
> > function parameter 'ioctx' description in 'srpt_handle_new_iu'
> > 
> > Signed-off-by: Randy Dunlap <rdunlap-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> > Cc: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Cc: Jason Gunthorpe <jgg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > ---
> >   drivers/infiniband/ulp/srpt/ib_srpt.c |    4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > --- linux-next-20171222.orig/drivers/infiniband/ulp/srpt/ib_srpt.c
> > +++ linux-next-20171222/drivers/infiniband/ulp/srpt/ib_srpt.c
> > @@ -1139,7 +1139,6 @@ static struct srpt_send_ioctx *srpt_get_
> >   /**
> >    * srpt_abort_cmd() - Abort a SCSI command.
> >    * @ioctx:   I/O context associated with the SCSI command.
> > - * @context: Preferred execution context.
> >    */
> >   static int srpt_abort_cmd(struct srpt_send_ioctx *ioctx)
> >   {
> > @@ -1473,7 +1472,8 @@ fail:
> >   /**
> >    * srpt_handle_new_iu() - Process a newly received information
> > unit.
> >    * @ch:    RDMA channel through which the information unit has
> > been received.
> > - * @ioctx: SRPT I/O context associated with the information unit.
> > + * @recv_ioctx: SRPT I/O context associated with the receive
> > information unit.
> > + * @send_ioctx: SRPT I/O context associated with the send
> > information unit.
> >    */
> >   static void srpt_handle_new_iu(struct srpt_rdma_ch *ch,
> >   			       struct srpt_recv_ioctx
> > *recv_ioctx,
> 
> Please drop this patch. It conflicts with a patch series I'm working
> on.
> 
> Thanks,
> 
> Bart.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
> in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hello Bart

As agreed, I pulled your tree and checked out block-scsi-for-next
branch 
I built a kernel to test on mlx5 and booted into that kernel and mapped
my SRP devices.

My first test I always run is a reboot after mapping the LUNS, my
server is not yet running your kernel only the client.

Anyway, I panicked on the client due to a list corruption and have the
capture below.

I am thinking you may not have seen this because you don't have mlx5,
only mlx4 in your test bed.

[  202.449161] sd 1:0:0:1: [sdbk] Synchronizing SCSI cache
[  202.478733] sd 1:0:0:2: [sdbj] Synchronizing SCSI cache
[  202.508986] sd 1:0:0:3: [sdbi] Synchronizing SCSI cache
[  202.538082] sd 1:0:0:4: [sdbh] Synchronizing SCSI cache
[  202.568329] sd 1:0:0:5: [sdbg] Synchronizing SCSI cache
[  202.598275] sd 1:0:0:6: [sdbf] Synchronizing SCSI cache
[  202.627607] sd 1:0:0:7: [sdbe] Synchronizing SCSI cache
[  202.657557] sd 1:0:0:8: [sdbd] Synchronizing SCSI cache
[  202.686773] sd 1:0:0:9: [sdbc] Synchronizing SCSI cache
[  202.716227] sd 1:0:0:10: [sdbb] Synchronizing SCSI cache
[  202.746555] sd 1:0:0:11: [sdba] Synchronizing SCSI cache
[  202.777826] sd 1:0:0:12: [sdaz] Synchronizing SCSI cache
[  202.808770] sd 1:0:0:13: [sday] Synchronizing SCSI cache
[  202.839954] sd 1:0:0:14: [sdax] Synchronizing SCSI cache
[  202.870355] sd 1:0:0:15: [sdaw] Synchronizing SCSI cache
[  202.900917] sd 1:0:0:16: [sdav] Synchronizing SCSI cache
[  202.930718] sd 1:0:0:17: [sdau] Synchronizing SCSI cache
[  202.960734] sd 1:0:0:18: [sdat] Synchronizing SCSI cache
[  202.990976] sd 1:0:0:19: [sdas] Synchronizing SCSI cache
[  203.020733] sd 1:0:0:20: [sdar] Synchronizing SCSI cache
[  203.050828] sd 1:0:0:21: [sdaq] Synchronizing SCSI cache
[  203.081566] sd 1:0:0:22: [sdap] Synchronizing SCSI cache
[  203.112472] sd 1:0:0:23: [sdao] Synchronizing SCSI cache
[  203.143305] sd 1:0:0:24: [sdan] Synchronizing SCSI cache
[  203.174065] sd 1:0:0:25: [sdam] Synchronizing SCSI cache
[  203.205173] sd 1:0:0:26: [sdal] Synchronizing SCSI cache
[  203.236178] sd 1:0:0:27: [sdak] Synchronizing SCSI cache
[  203.266446] sd 1:0:0:28: [sdaj] Synchronizing SCSI cache
[  203.297050] sd 1:0:0:29: [sdai] Synchronizing SCSI cache
[  203.327570] sd 1:0:0:0: [sdah] Synchronizing SCSI cache
[  203.357475] sd 2:0:0:1: [sdag] Synchronizing SCSI cache
[  203.387259] sd 2:0:0:2: [sdaf] Synchronizing SCSI cache
[  203.416950] sd 2:0:0:3: [sdae] Synchronizing SCSI cache
[  203.447112] sd 2:0:0:4: [sdad] Synchronizing SCSI cache
[  203.477650] sd 2:0:0:5: [sdac] Synchronizing SCSI cache
[  203.508438] sd 2:0:0:6: [sdab] Synchronizing SCSI cache
[  203.539018] sd 2:0:0:7: [sdaa] Synchronizing SCSI cache
[  203.568806] sd 2:0:0:8: [sdz] Synchronizing SCSI cache
[  203.598575] sd 2:0:0:9: [sdy] Synchronizing SCSI cache
[  203.628063] sd 2:0:0:10: [sdx] Synchronizing SCSI cache
[  203.658096] sd 2:0:0:11: [sdw] Synchronizing SCSI cache
[  203.687453] sd 2:0:0:12: [sdv] Synchronizing SCSI cache
[  203.718127] sd 2:0:0:13: [sdu] Synchronizing SCSI cache
[  203.747953] sd 2:0:0:14: [sdt] Synchronizing SCSI cache
[  203.777593] sd 2:0:0:15: [sds] Synchronizing SCSI cache
[  203.808214] sd 2:0:0:16: [sdr] Synchronizing SCSI cache
[  203.837516] sd 2:0:0:17: [sdq] Synchronizing SCSI cache
[  203.866690] sd 2:0:0:18: [sdp] Synchronizing SCSI cache
[  203.896013] sd 2:0:0:19: [sdo] Synchronizing SCSI cache
[  203.925029] sd 2:0:0:20: [sdn] Synchronizing SCSI cache
[  203.953954] sd 2:0:0:21: [sdm] Synchronizing SCSI cache
[  203.982830] sd 2:0:0:22: [sdl] Synchronizing SCSI cache
[  204.012713] sd 2:0:0:23: [sdk] Synchronizing SCSI cache
[  204.043456] sd 2:0:0:24: [sdj] Synchronizing SCSI cache
[  204.073671] sd 2:0:0:25: [sdi] Synchronizing SCSI cache
[  204.104050] sd 2:0:0:26: [sdh] Synchronizing SCSI cache
[  204.134239] sd 2:0:0:27: [sdg] Synchronizing SCSI cache
[  204.164603] sd 2:0:0:28: [sdf] Synchronizing SCSI cache
[  204.195387] sd 2:0:0:29: [sde] Synchronizing SCSI cache
[  204.225894] sd 2:0:0:0: [sdd] Synchronizing SCSI cache
[  204.256062] mlx5_core 0000:08:00.1: Shutdown was called
[  204.286882] mlx5_core 0000:08:00.1:
mlx5_cmd_force_teardown_hca:245:(pid 15875): teardown with force mode
failed
[  204.296810] mlx5_core 0000:08:00.1: mlx5_cmd_comp_handler:1445:(pid
1028): Command completion arrived after timeout (entry idx = 0).
[  207.477515] mlx5_1:wait_for_async_commands:735:(pid 15875): done
with all pending requests
[  207.529305] sd 1:0:0:0: [sdah] Synchronizing SCSI cache
[  207.563161] scsi 1:0:0:0: alua: Detached
[  207.586589] sd 1:0:0:29: [sdai] Synchronizing SCSI cache
[  207.623036] scsi 1:0:0:29: alua: Detached
[  207.646005] sd 1:0:0:28: [sdaj] Synchronizing SCSI cache
[  207.690180] scsi 1:0:0:28: alua: Detached
[  207.713360] sd 1:0:0:27: [sdak] Synchronizing SCSI cache
[  207.749020] scsi 1:0:0:27: alua: Detached
[  207.771957] sd 1:0:0:26: [sdal] Synchronizing SCSI cache
[  207.808036] scsi 1:0:0:26: alua: Detached
[  207.831913] sd 1:0:0:25: [sdam] Synchronizing SCSI cache
[  207.872192] scsi 1:0:0:25: alua: Detached
[  207.895678] sd 1:0:0:24: [sdan] Synchronizing SCSI cache
[  207.931020] scsi 1:0:0:24: alua: Detached
[  207.954279] sd 1:0:0:23: [sdao] Synchronizing SCSI cache
[  207.990180] scsi 1:0:0:23: alua: Detached
[  208.013315] sd 1:0:0:22: [sdap] Synchronizing SCSI cache
[  208.049012] scsi 1:0:0:22: alua: Detached
[  208.072381] sd 1:0:0:21: [sdaq] Synchronizing SCSI cache
[  208.112041] scsi 1:0:0:21: alua: Detached
[  208.135881] sd 1:0:0:20: [sdar] Synchronizing SCSI cache
[  208.176006] scsi 1:0:0:20: alua: Detached
[  208.199316] sd 1:0:0:19: [sdas] Synchronizing SCSI cache
[  208.235018] scsi 1:0:0:19: alua: Detached
[  208.257835] sd 1:0:0:18: [sdat] Synchronizing SCSI cache
[  208.294019] scsi 1:0:0:18: alua: Detached
[  208.317725] sd 1:0:0:17: [sdau] Synchronizing SCSI cache
[  208.357016] scsi 1:0:0:17: alua: Detached
[  208.380742] sd 1:0:0:16: [sdav] Synchronizing SCSI cache
[  208.417015] scsi 1:0:0:16: alua: Detached
[  208.440017] sd 1:0:0:15: [sdaw] Synchronizing SCSI cache
[  208.479001] scsi 1:0:0:15: alua: Detached
[  208.501658] sd 1:0:0:14: [sdax] Synchronizing SCSI cache
[  208.536039] scsi 1:0:0:14: alua: Detached
[  208.559162] sd 1:0:0:13: [sday] Synchronizing SCSI cache
[  208.595027] scsi 1:0:0:13: alua: Detached
[  208.618418] sd 1:0:0:12: [sdaz] Synchronizing SCSI cache
[  208.662175] scsi 1:0:0:12: alua: Detached
[  208.685158] sd 1:0:0:11: [sdba] Synchronizing SCSI cache
[  208.723993] scsi 1:0:0:11: alua: Detached
[  208.747988] sd 1:0:0:10: [sdbb] Synchronizing SCSI cache
[  208.787003] scsi 1:0:0:10: alua: Detached
[  208.810841] sd 1:0:0:9: [sdbc] Synchronizing SCSI cache
[  208.850000] scsi 1:0:0:9: alua: Detached
[  208.873249] sd 1:0:0:8: [sdbd] Synchronizing SCSI cache
[  208.913186] scsi 1:0:0:8: alua: Detached
[  208.936783] sd 1:0:0:7: [sdbe] Synchronizing SCSI cache
[  208.973192] scsi 1:0:0:7: alua: Detached
[  208.995709] sd 1:0:0:6: [sdbf] Synchronizing SCSI cache
[  209.031179] scsi 1:0:0:6: alua: Detached
[  209.053746] sd 1:0:0:5: [sdbg] Synchronizing SCSI cache
[  209.089022] scsi 1:0:0:5: alua: Detached
[  209.112261] sd 1:0:0:4: [sdbh] Synchronizing SCSI cache
[  209.148024] scsi 1:0:0:4: alua: Detached
[  209.171354] sd 1:0:0:3: [sdbi] Synchronizing SCSI cache
[  209.207014] scsi 1:0:0:3: alua: Detached
[  209.229514] sd 1:0:0:2: [sdbj] Synchronizing SCSI cache
[  209.267005] scsi 1:0:0:2: alua: Detached
[  209.290867] sd 1:0:0:1: [sdbk] Synchronizing SCSI cache
[  209.326303] scsi 1:0:0:1: alua: Detached
[  211.376056] ib0: multicast join failed for
ff12:601b:ffff:0000:0000:0000:0000:0001, status -22
[  211.439940] scsi host1: ib_srp: connection closed
[  211.466771] scsi host1: ib_srp: connection closed
[  211.493623] scsi host1: ib_srp: connection closed
[  213.425511] ib0: multicast join failed for
ff12:601b:ffff:0000:0000:0000:0000:0001, status -22
[  217.521341] ib0: multicast join failed for
ff12:601b:ffff:0000:0000:0000:0000:0001, status -22
[  220.843344] ------------[ cut here ]------------
[  220.869309] list_add corruption. prev->next should be next
(000000002a07d255), but was           (null). (prev=000000000edf5e8c).
[  220.935392] WARNING: CPU: 1 PID: 694 at lib/list_debug.c:28
__list_add_valid+0x6a/0x70
[  220.979462] Modules linked in: xt_CHECKSUM iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi
ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si crypto_simd
dm_service_time iTCO_wdt hpwdt iTCO_vendor_support glue_helper cryptd
ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
acpi_power_meter i7core_edac shpchp
[  221.385270]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc
dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core mlxfw
sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw bnx2
devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[  221.554496] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted:
G          I      4.15.0-rc7+ #1
[  221.606907] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[  221.642980] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
[  221.674616] RIP: 0010:__list_add_valid+0x6a/0x70
[  221.700561] RSP: 0018:ffffb2bdc75c7cf0 EFLAGS: 00010086
[  221.730608] RAX: 0000000000000000 RBX: ffff94342d610880 RCX:
ffffffff8ba62928
[  221.771490] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
0000000000000046
[  221.812721] RBP: ffff94342d6108b8 R08: 0000000000000000 R09:
0000000000000722
[  221.853073] R10: 0000000000000000 R11: ffffb2bdc75c7a58 R12:
0000000000000200
[  221.894156] R13: 0000000000000246 R14: ffff943fb7fd5000 R15:
ffff943fb7fd5000
[  221.935233] FS:  0000000000000000(0000) GS:ffff944033200000(0000)
knlGS:0000000000000000
[  221.980521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.013062] CR2: 00007f1bdc0ee910 CR3: 00000017e7e0a002 CR4:
00000000000206e0
[  222.052302] Call Trace:
[  222.065971]  ib_mad_post_receive_mads+0x177/0x310 [ib_core]
[  222.097349]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
[  222.124387]  __ib_process_cq+0x55/0xa0 [ib_core]
[  222.150827]  ib_cq_poll_work+0x1b/0x60 [ib_core]
[  222.177751]  process_one_work+0x141/0x340
[  222.200383]  worker_thread+0x47/0x3e0
[  222.220641]  kthread+0xf5/0x130
[  222.238951]  ? rescuer_thread+0x380/0x380
[  222.262034]  ? kthread_associate_blkcg+0x90/0x90
[  222.288514]  ? do_group_exit+0x39/0xa0
[  222.309492]  ret_from_fork+0x1f/0x30
[  222.330073] Code: fe 31 c0 48 c7 c7 98 36 89 8b e8 02 9c cf ff 0f ff
31 c0 c3 48 89 d1 48 c7 c7 48 36 89 8b 48 89 f2 48 89 c6 31 c0 e8 e6 9b
cf ff <0f> ff 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48 8b 
[  222.438058] ---[ end trace 5d41544bf17ab73b ]---
[  222.465993] BUG: unable to handle kernel NULL pointer dereference at
0000000000000028
[  222.510316] IP: ib_mad_post_receive_mads+0x3c/0x310 [ib_core]
[  222.543188] PGD 0 P4D 0 
[  222.557625] Oops: 0000 [#1] SMP PTI
[  222.576674] Modules linked in: xt_CHECKSUM iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi
ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si crypto_simd
dm_service_time iTCO_wdt hpwdt iTCO_vendor_support glue_helper cryptd
ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
acpi_power_meter i7core_edac shpchp
[  222.981443]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc
dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core mlxfw
sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw bnx2
devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[  223.152359] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted: G        W
I      4.15.0-rc7+ #1
[  223.198577] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[  223.235101] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
[  223.266750] RIP: 0010:ib_mad_post_receive_mads+0x3c/0x310 [ib_core]
[  223.303012] RSP: 0018:ffffb2bdc75c7cf8 EFLAGS: 00010286
[  223.333022] RAX: 0000000000000000 RBX: ffff94342d610908 RCX:
ffff94342d610948
[  223.373307] RDX: 0000000000000001 RSI: ffff94342d6108c0 RDI:
ffff94342d610908
[  223.414451] RBP: ffff94342d610940 R08: ffff94342a8e64c0 R09:
ffff94342a8e64e8
[  223.454789] R10: ffff94342a8e64e8 R11: ffff94342d6109a8 R12:
ffff944029c2e048
[  223.496554] R13: 0000000000000000 R14: ffff94342a8e64c0 R15:
ffff94342d6108c0
[  223.537489] FS:  0000000000000000(0000) GS:ffff944033200000(0000)
knlGS:0000000000000000
[  223.583538] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  223.616545] CR2: 0000000000000028 CR3: 00000017e7e0a002 CR4:
00000000000206e0
[  223.657337] Call Trace:
[  223.671022]  ? find_mad_agent+0x77/0x1b0 [ib_core]
[  223.698581]  ? __kmalloc+0x1be/0x1f0
[  223.719074]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
[  223.747190]  __ib_process_cq+0x55/0xa0 [ib_core]
[  223.774140]  ib_cq_poll_work+0x1b/0x60 [ib_core]
[  223.800719]  process_one_work+0x141/0x340
[  223.824120]  worker_thread+0x47/0x3e0
[  223.845133]  kthread+0xf5/0x130
[  223.863116]  ? rescuer_thread+0x380/0x380
[  223.886173]  ? kthread_associate_blkcg+0x90/0x90
[  223.912207]  ? do_group_exit+0x39/0xa0
[  223.933198]  ret_from_fork+0x1f/0x30
[  223.953218] Code: 55 41 54 55 48 8d 6f 38 53 48 89 fb 48 83 ec 50 65
48 8b 04 25 28 00 00 00 48 89 44 24 48 31 c0 48 8b 07 48 85 f6 48 89 4c
24 08 <48> 8b 50 28 8b 12 48 c7 44 24 28 00 00 00 00 c7 44 24 40 01 00 
[  224.059985] RIP: ib_mad_post_receive_mads+0x3c/0x310 [ib_core] RSP:
ffffb2bdc75c7cf8
[  224.103994] CR2: 0000000000000028


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] infiniband: fix ulp/srpt/ib_srpt.c kernel-doc notation
       [not found]         ` <1515528956.3919.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-09 20:31           ` Laurence Oberman
       [not found]             ` <1515529869.3919.4.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-09 20:31 UTC (permalink / raw)
  To: Bart Van Assche, Randy Dunlap, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Randy Dunlap

On Tue, 2018-01-09 at 15:15 -0500, Laurence Oberman wrote:
> On Fri, 2018-01-05 at 16:36 -0800, Bart Van Assche wrote:
> > On 01/05/18 16:22, Randy Dunlap wrote:
> > > Use correct parameter names in kernel-doc notation to eliminate
> > > warnings from scripts/kernel-doc.
> > > 
> > > ../drivers/infiniband/ulp/srpt/ib_srpt.c:1146: warning: Excess
> > > function parameter 'context' description in 'srpt_abort_cmd'
> > > ../drivers/infiniband/ulp/srpt/ib_srpt.c:1482: warning: Excess
> > > function parameter 'ioctx' description in 'srpt_handle_new_iu'
> > > 
> > > Signed-off-by: Randy Dunlap <rdunlap-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> > > Cc: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > > Cc: Jason Gunthorpe <jgg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > > Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > ---
> > >   drivers/infiniband/ulp/srpt/ib_srpt.c |    4 ++--
> > >   1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > --- linux-next-
> > > 20171222.orig/drivers/infiniband/ulp/srpt/ib_srpt.c
> > > +++ linux-next-20171222/drivers/infiniband/ulp/srpt/ib_srpt.c
> > > @@ -1139,7 +1139,6 @@ static struct srpt_send_ioctx *srpt_get_
> > >   /**
> > >    * srpt_abort_cmd() - Abort a SCSI command.
> > >    * @ioctx:   I/O context associated with the SCSI command.
> > > - * @context: Preferred execution context.
> > >    */
> > >   static int srpt_abort_cmd(struct srpt_send_ioctx *ioctx)
> > >   {
> > > @@ -1473,7 +1472,8 @@ fail:
> > >   /**
> > >    * srpt_handle_new_iu() - Process a newly received information
> > > unit.
> > >    * @ch:    RDMA channel through which the information unit has
> > > been received.
> > > - * @ioctx: SRPT I/O context associated with the information
> > > unit.
> > > + * @recv_ioctx: SRPT I/O context associated with the receive
> > > information unit.
> > > + * @send_ioctx: SRPT I/O context associated with the send
> > > information unit.
> > >    */
> > >   static void srpt_handle_new_iu(struct srpt_rdma_ch *ch,
> > >   			       struct srpt_recv_ioctx
> > > *recv_ioctx,
> > 
> > Please drop this patch. It conflicts with a patch series I'm
> > working
> > on.
> > 
> > Thanks,
> > 
> > Bart.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-
> > rdma" 
> > in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> Hello Bart
> 
> As agreed, I pulled your tree and checked out block-scsi-for-next
> branch 
> I built a kernel to test on mlx5 and booted into that kernel and
> mapped
> my SRP devices.
> 
> My first test I always run is a reboot after mapping the LUNS, my
> server is not yet running your kernel only the client.
> 
> Anyway, I panicked on the client due to a list corruption and have
> the
> capture below.
> 
> I am thinking you may not have seen this because you don't have mlx5,
> only mlx4 in your test bed.
> 
> [  202.449161] sd 1:0:0:1: [sdbk] Synchronizing SCSI cache
> [  202.478733] sd 1:0:0:2: [sdbj] Synchronizing SCSI cache
> [  202.508986] sd 1:0:0:3: [sdbi] Synchronizing SCSI cache
> [  202.538082] sd 1:0:0:4: [sdbh] Synchronizing SCSI cache
> [  202.568329] sd 1:0:0:5: [sdbg] Synchronizing SCSI cache
> [  202.598275] sd 1:0:0:6: [sdbf] Synchronizing SCSI cache
> [  202.627607] sd 1:0:0:7: [sdbe] Synchronizing SCSI cache
> [  202.657557] sd 1:0:0:8: [sdbd] Synchronizing SCSI cache
> [  202.686773] sd 1:0:0:9: [sdbc] Synchronizing SCSI cache
> [  202.716227] sd 1:0:0:10: [sdbb] Synchronizing SCSI cache
> [  202.746555] sd 1:0:0:11: [sdba] Synchronizing SCSI cache
> [  202.777826] sd 1:0:0:12: [sdaz] Synchronizing SCSI cache
> [  202.808770] sd 1:0:0:13: [sday] Synchronizing SCSI cache
> [  202.839954] sd 1:0:0:14: [sdax] Synchronizing SCSI cache
> [  202.870355] sd 1:0:0:15: [sdaw] Synchronizing SCSI cache
> [  202.900917] sd 1:0:0:16: [sdav] Synchronizing SCSI cache
> [  202.930718] sd 1:0:0:17: [sdau] Synchronizing SCSI cache
> [  202.960734] sd 1:0:0:18: [sdat] Synchronizing SCSI cache
> [  202.990976] sd 1:0:0:19: [sdas] Synchronizing SCSI cache
> [  203.020733] sd 1:0:0:20: [sdar] Synchronizing SCSI cache
> [  203.050828] sd 1:0:0:21: [sdaq] Synchronizing SCSI cache
> [  203.081566] sd 1:0:0:22: [sdap] Synchronizing SCSI cache
> [  203.112472] sd 1:0:0:23: [sdao] Synchronizing SCSI cache
> [  203.143305] sd 1:0:0:24: [sdan] Synchronizing SCSI cache
> [  203.174065] sd 1:0:0:25: [sdam] Synchronizing SCSI cache
> [  203.205173] sd 1:0:0:26: [sdal] Synchronizing SCSI cache
> [  203.236178] sd 1:0:0:27: [sdak] Synchronizing SCSI cache
> [  203.266446] sd 1:0:0:28: [sdaj] Synchronizing SCSI cache
> [  203.297050] sd 1:0:0:29: [sdai] Synchronizing SCSI cache
> [  203.327570] sd 1:0:0:0: [sdah] Synchronizing SCSI cache
> [  203.357475] sd 2:0:0:1: [sdag] Synchronizing SCSI cache
> [  203.387259] sd 2:0:0:2: [sdaf] Synchronizing SCSI cache
> [  203.416950] sd 2:0:0:3: [sdae] Synchronizing SCSI cache
> [  203.447112] sd 2:0:0:4: [sdad] Synchronizing SCSI cache
> [  203.477650] sd 2:0:0:5: [sdac] Synchronizing SCSI cache
> [  203.508438] sd 2:0:0:6: [sdab] Synchronizing SCSI cache
> [  203.539018] sd 2:0:0:7: [sdaa] Synchronizing SCSI cache
> [  203.568806] sd 2:0:0:8: [sdz] Synchronizing SCSI cache
> [  203.598575] sd 2:0:0:9: [sdy] Synchronizing SCSI cache
> [  203.628063] sd 2:0:0:10: [sdx] Synchronizing SCSI cache
> [  203.658096] sd 2:0:0:11: [sdw] Synchronizing SCSI cache
> [  203.687453] sd 2:0:0:12: [sdv] Synchronizing SCSI cache
> [  203.718127] sd 2:0:0:13: [sdu] Synchronizing SCSI cache
> [  203.747953] sd 2:0:0:14: [sdt] Synchronizing SCSI cache
> [  203.777593] sd 2:0:0:15: [sds] Synchronizing SCSI cache
> [  203.808214] sd 2:0:0:16: [sdr] Synchronizing SCSI cache
> [  203.837516] sd 2:0:0:17: [sdq] Synchronizing SCSI cache
> [  203.866690] sd 2:0:0:18: [sdp] Synchronizing SCSI cache
> [  203.896013] sd 2:0:0:19: [sdo] Synchronizing SCSI cache
> [  203.925029] sd 2:0:0:20: [sdn] Synchronizing SCSI cache
> [  203.953954] sd 2:0:0:21: [sdm] Synchronizing SCSI cache
> [  203.982830] sd 2:0:0:22: [sdl] Synchronizing SCSI cache
> [  204.012713] sd 2:0:0:23: [sdk] Synchronizing SCSI cache
> [  204.043456] sd 2:0:0:24: [sdj] Synchronizing SCSI cache
> [  204.073671] sd 2:0:0:25: [sdi] Synchronizing SCSI cache
> [  204.104050] sd 2:0:0:26: [sdh] Synchronizing SCSI cache
> [  204.134239] sd 2:0:0:27: [sdg] Synchronizing SCSI cache
> [  204.164603] sd 2:0:0:28: [sdf] Synchronizing SCSI cache
> [  204.195387] sd 2:0:0:29: [sde] Synchronizing SCSI cache
> [  204.225894] sd 2:0:0:0: [sdd] Synchronizing SCSI cache
> [  204.256062] mlx5_core 0000:08:00.1: Shutdown was called
> [  204.286882] mlx5_core 0000:08:00.1:
> mlx5_cmd_force_teardown_hca:245:(pid 15875): teardown with force mode
> failed
> [  204.296810] mlx5_core 0000:08:00.1:
> mlx5_cmd_comp_handler:1445:(pid
> 1028): Command completion arrived after timeout (entry idx = 0).
> [  207.477515] mlx5_1:wait_for_async_commands:735:(pid 15875): done
> with all pending requests
> [  207.529305] sd 1:0:0:0: [sdah] Synchronizing SCSI cache
> [  207.563161] scsi 1:0:0:0: alua: Detached
> [  207.586589] sd 1:0:0:29: [sdai] Synchronizing SCSI cache
> [  207.623036] scsi 1:0:0:29: alua: Detached
> [  207.646005] sd 1:0:0:28: [sdaj] Synchronizing SCSI cache
> [  207.690180] scsi 1:0:0:28: alua: Detached
> [  207.713360] sd 1:0:0:27: [sdak] Synchronizing SCSI cache
> [  207.749020] scsi 1:0:0:27: alua: Detached
> [  207.771957] sd 1:0:0:26: [sdal] Synchronizing SCSI cache
> [  207.808036] scsi 1:0:0:26: alua: Detached
> [  207.831913] sd 1:0:0:25: [sdam] Synchronizing SCSI cache
> [  207.872192] scsi 1:0:0:25: alua: Detached
> [  207.895678] sd 1:0:0:24: [sdan] Synchronizing SCSI cache
> [  207.931020] scsi 1:0:0:24: alua: Detached
> [  207.954279] sd 1:0:0:23: [sdao] Synchronizing SCSI cache
> [  207.990180] scsi 1:0:0:23: alua: Detached
> [  208.013315] sd 1:0:0:22: [sdap] Synchronizing SCSI cache
> [  208.049012] scsi 1:0:0:22: alua: Detached
> [  208.072381] sd 1:0:0:21: [sdaq] Synchronizing SCSI cache
> [  208.112041] scsi 1:0:0:21: alua: Detached
> [  208.135881] sd 1:0:0:20: [sdar] Synchronizing SCSI cache
> [  208.176006] scsi 1:0:0:20: alua: Detached
> [  208.199316] sd 1:0:0:19: [sdas] Synchronizing SCSI cache
> [  208.235018] scsi 1:0:0:19: alua: Detached
> [  208.257835] sd 1:0:0:18: [sdat] Synchronizing SCSI cache
> [  208.294019] scsi 1:0:0:18: alua: Detached
> [  208.317725] sd 1:0:0:17: [sdau] Synchronizing SCSI cache
> [  208.357016] scsi 1:0:0:17: alua: Detached
> [  208.380742] sd 1:0:0:16: [sdav] Synchronizing SCSI cache
> [  208.417015] scsi 1:0:0:16: alua: Detached
> [  208.440017] sd 1:0:0:15: [sdaw] Synchronizing SCSI cache
> [  208.479001] scsi 1:0:0:15: alua: Detached
> [  208.501658] sd 1:0:0:14: [sdax] Synchronizing SCSI cache
> [  208.536039] scsi 1:0:0:14: alua: Detached
> [  208.559162] sd 1:0:0:13: [sday] Synchronizing SCSI cache
> [  208.595027] scsi 1:0:0:13: alua: Detached
> [  208.618418] sd 1:0:0:12: [sdaz] Synchronizing SCSI cache
> [  208.662175] scsi 1:0:0:12: alua: Detached
> [  208.685158] sd 1:0:0:11: [sdba] Synchronizing SCSI cache
> [  208.723993] scsi 1:0:0:11: alua: Detached
> [  208.747988] sd 1:0:0:10: [sdbb] Synchronizing SCSI cache
> [  208.787003] scsi 1:0:0:10: alua: Detached
> [  208.810841] sd 1:0:0:9: [sdbc] Synchronizing SCSI cache
> [  208.850000] scsi 1:0:0:9: alua: Detached
> [  208.873249] sd 1:0:0:8: [sdbd] Synchronizing SCSI cache
> [  208.913186] scsi 1:0:0:8: alua: Detached
> [  208.936783] sd 1:0:0:7: [sdbe] Synchronizing SCSI cache
> [  208.973192] scsi 1:0:0:7: alua: Detached
> [  208.995709] sd 1:0:0:6: [sdbf] Synchronizing SCSI cache
> [  209.031179] scsi 1:0:0:6: alua: Detached
> [  209.053746] sd 1:0:0:5: [sdbg] Synchronizing SCSI cache
> [  209.089022] scsi 1:0:0:5: alua: Detached
> [  209.112261] sd 1:0:0:4: [sdbh] Synchronizing SCSI cache
> [  209.148024] scsi 1:0:0:4: alua: Detached
> [  209.171354] sd 1:0:0:3: [sdbi] Synchronizing SCSI cache
> [  209.207014] scsi 1:0:0:3: alua: Detached
> [  209.229514] sd 1:0:0:2: [sdbj] Synchronizing SCSI cache
> [  209.267005] scsi 1:0:0:2: alua: Detached
> [  209.290867] sd 1:0:0:1: [sdbk] Synchronizing SCSI cache
> [  209.326303] scsi 1:0:0:1: alua: Detached
> [  211.376056] ib0: multicast join failed for
> ff12:601b:ffff:0000:0000:0000:0000:0001, status -22
> [  211.439940] scsi host1: ib_srp: connection closed
> [  211.466771] scsi host1: ib_srp: connection closed
> [  211.493623] scsi host1: ib_srp: connection closed
> [  213.425511] ib0: multicast join failed for
> ff12:601b:ffff:0000:0000:0000:0000:0001, status -22
> [  217.521341] ib0: multicast join failed for
> ff12:601b:ffff:0000:0000:0000:0000:0001, status -22
> [  220.843344] ------------[ cut here ]------------
> [  220.869309] list_add corruption. prev->next should be next
> (000000002a07d255), but was           (null).
> (prev=000000000edf5e8c).
> [  220.935392] WARNING: CPU: 1 PID: 694 at lib/list_debug.c:28
> __list_add_valid+0x6a/0x70
> [  220.979462] Modules linked in: xt_CHECKSUM iptable_mangle
> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
> nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> iscsi_target_mod target_core_mod ib_iser libiscsi
> scsi_transport_iscsi
> ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
> rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> kvm_intel
> kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si crypto_simd
> dm_service_time iTCO_wdt hpwdt iTCO_vendor_support glue_helper cryptd
> ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
> acpi_power_meter i7core_edac shpchp
> [  221.385270]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
> sunrpc
> dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit
> drm_kms_helper
> syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core mlxfw
> sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw bnx2
> devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> [  221.554496] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted:
> G          I      4.15.0-rc7+ #1
> [  221.606907] Hardware name: HP ProLiant DL380 G7, BIOS P67
> 08/16/2015
> [  221.642980] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> [  221.674616] RIP: 0010:__list_add_valid+0x6a/0x70
> [  221.700561] RSP: 0018:ffffb2bdc75c7cf0 EFLAGS: 00010086
> [  221.730608] RAX: 0000000000000000 RBX: ffff94342d610880 RCX:
> ffffffff8ba62928
> [  221.771490] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
> 0000000000000046
> [  221.812721] RBP: ffff94342d6108b8 R08: 0000000000000000 R09:
> 0000000000000722
> [  221.853073] R10: 0000000000000000 R11: ffffb2bdc75c7a58 R12:
> 0000000000000200
> [  221.894156] R13: 0000000000000246 R14: ffff943fb7fd5000 R15:
> ffff943fb7fd5000
> [  221.935233] FS:  0000000000000000(0000) GS:ffff944033200000(0000)
> knlGS:0000000000000000
> [  221.980521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  222.013062] CR2: 00007f1bdc0ee910 CR3: 00000017e7e0a002 CR4:
> 00000000000206e0
> [  222.052302] Call Trace:
> [  222.065971]  ib_mad_post_receive_mads+0x177/0x310 [ib_core]
> [  222.097349]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
> [  222.124387]  __ib_process_cq+0x55/0xa0 [ib_core]
> [  222.150827]  ib_cq_poll_work+0x1b/0x60 [ib_core]
> [  222.177751]  process_one_work+0x141/0x340
> [  222.200383]  worker_thread+0x47/0x3e0
> [  222.220641]  kthread+0xf5/0x130
> [  222.238951]  ? rescuer_thread+0x380/0x380
> [  222.262034]  ? kthread_associate_blkcg+0x90/0x90
> [  222.288514]  ? do_group_exit+0x39/0xa0
> [  222.309492]  ret_from_fork+0x1f/0x30
> [  222.330073] Code: fe 31 c0 48 c7 c7 98 36 89 8b e8 02 9c cf ff 0f
> ff
> 31 c0 c3 48 89 d1 48 c7 c7 48 36 89 8b 48 89 f2 48 89 c6 31 c0 e8 e6
> 9b
> cf ff <0f> ff 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48
> 8b 
> [  222.438058] ---[ end trace 5d41544bf17ab73b ]---
> [  222.465993] BUG: unable to handle kernel NULL pointer dereference
> at
> 0000000000000028
> [  222.510316] IP: ib_mad_post_receive_mads+0x3c/0x310 [ib_core]
> [  222.543188] PGD 0 P4D 0 
> [  222.557625] Oops: 0000 [#1] SMP PTI
> [  222.576674] Modules linked in: xt_CHECKSUM iptable_mangle
> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
> nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> iscsi_target_mod target_core_mod ib_iser libiscsi
> scsi_transport_iscsi
> ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
> rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> kvm_intel
> kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si crypto_simd
> dm_service_time iTCO_wdt hpwdt iTCO_vendor_support glue_helper cryptd
> ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
> acpi_power_meter i7core_edac shpchp
> [  222.981443]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
> sunrpc
> dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit
> drm_kms_helper
> syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core mlxfw
> sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw bnx2
> devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> [  223.152359] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted: G        W
> I      4.15.0-rc7+ #1
> [  223.198577] Hardware name: HP ProLiant DL380 G7, BIOS P67
> 08/16/2015
> [  223.235101] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> [  223.266750] RIP: 0010:ib_mad_post_receive_mads+0x3c/0x310
> [ib_core]
> [  223.303012] RSP: 0018:ffffb2bdc75c7cf8 EFLAGS: 00010286
> [  223.333022] RAX: 0000000000000000 RBX: ffff94342d610908 RCX:
> ffff94342d610948
> [  223.373307] RDX: 0000000000000001 RSI: ffff94342d6108c0 RDI:
> ffff94342d610908
> [  223.414451] RBP: ffff94342d610940 R08: ffff94342a8e64c0 R09:
> ffff94342a8e64e8
> [  223.454789] R10: ffff94342a8e64e8 R11: ffff94342d6109a8 R12:
> ffff944029c2e048
> [  223.496554] R13: 0000000000000000 R14: ffff94342a8e64c0 R15:
> ffff94342d6108c0
> [  223.537489] FS:  0000000000000000(0000) GS:ffff944033200000(0000)
> knlGS:0000000000000000
> [  223.583538] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  223.616545] CR2: 0000000000000028 CR3: 00000017e7e0a002 CR4:
> 00000000000206e0
> [  223.657337] Call Trace:
> [  223.671022]  ? find_mad_agent+0x77/0x1b0 [ib_core]
> [  223.698581]  ? __kmalloc+0x1be/0x1f0
> [  223.719074]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
> [  223.747190]  __ib_process_cq+0x55/0xa0 [ib_core]
> [  223.774140]  ib_cq_poll_work+0x1b/0x60 [ib_core]
> [  223.800719]  process_one_work+0x141/0x340
> [  223.824120]  worker_thread+0x47/0x3e0
> [  223.845133]  kthread+0xf5/0x130
> [  223.863116]  ? rescuer_thread+0x380/0x380
> [  223.886173]  ? kthread_associate_blkcg+0x90/0x90
> [  223.912207]  ? do_group_exit+0x39/0xa0
> [  223.933198]  ret_from_fork+0x1f/0x30
> [  223.953218] Code: 55 41 54 55 48 8d 6f 38 53 48 89 fb 48 83 ec 50
> 65
> 48 8b 04 25 28 00 00 00 48 89 44 24 48 31 c0 48 8b 07 48 85 f6 48 89
> 4c
> 24 08 <48> 8b 50 28 8b 12 48 c7 44 24 28 00 00 00 00 c7 44 24 40 01
> 00 
> [  224.059985] RIP: ib_mad_post_receive_mads+0x3c/0x310 [ib_core]
> RSP:
> ffffb2bdc75c7cf8
> [  224.103994] CR2: 0000000000000028
> 
> 

Hi Bart

Just wanted to add that the panic is consistent, rebooted into only a
single path to my SRP LUNS and on reboot had the same panic.

Regards
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]             ` <1515529869.3919.4.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-09 20:51               ` Bart Van Assche
       [not found]                 ` <1515531079.2721.26.camel-Sjgp3cTcYWE@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Bart Van Assche @ 2018-01-09 20:51 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, loberman-H+wXaHxf7aLQT0dZR+AlfA

On Tue, 2018-01-09 at 15:31 -0500, Laurence Oberman wrote:
> On Tue, 2018-01-09 at 15:15 -0500, Laurence Oberman wrote:
> > [  220.843344] ------------[ cut here ]------------
> > [  220.869309] list_add corruption. prev->next should be next
> > (000000002a07d255), but was           (null).
> > (prev=000000000edf5e8c).
> > [  220.935392] WARNING: CPU: 1 PID: 694 at lib/list_debug.c:28
> > __list_add_valid+0x6a/0x70
> > [  220.979462] Modules linked in: xt_CHECKSUM iptable_mangle
> > ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
> > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
> > nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> > ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> > iscsi_target_mod target_core_mod ib_iser libiscsi
> > scsi_transport_iscsi
> > ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
> > rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> > kvm_intel
> > kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> > ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si crypto_simd
> > dm_service_time iTCO_wdt hpwdt iTCO_vendor_support glue_helper cryptd
> > ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
> > acpi_power_meter i7core_edac shpchp
> > [  221.385270]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
> > sunrpc
> > dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit
> > drm_kms_helper
> > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core mlxfw
> > sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw bnx2
> > devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> > [  221.554496] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted:
> > G          I      4.15.0-rc7+ #1
> > [  221.606907] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > 08/16/2015
> > [  221.642980] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> > [  221.674616] RIP: 0010:__list_add_valid+0x6a/0x70
> > [  221.700561] RSP: 0018:ffffb2bdc75c7cf0 EFLAGS: 00010086
> > [  221.730608] RAX: 0000000000000000 RBX: ffff94342d610880 RCX:
> > ffffffff8ba62928
> > [  221.771490] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
> > 0000000000000046
> > [  221.812721] RBP: ffff94342d6108b8 R08: 0000000000000000 R09:
> > 0000000000000722
> > [  221.853073] R10: 0000000000000000 R11: ffffb2bdc75c7a58 R12:
> > 0000000000000200
> > [  221.894156] R13: 0000000000000246 R14: ffff943fb7fd5000 R15:
> > ffff943fb7fd5000
> > [  221.935233] FS:  0000000000000000(0000) GS:ffff944033200000(0000)
> > knlGS:0000000000000000
> > [  221.980521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  222.013062] CR2: 00007f1bdc0ee910 CR3: 00000017e7e0a002 CR4:
> > 00000000000206e0
> > [  222.052302] Call Trace:
> > [  222.065971]  ib_mad_post_receive_mads+0x177/0x310 [ib_core]
> > [  222.097349]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
> > [  222.124387]  __ib_process_cq+0x55/0xa0 [ib_core]
> > [  222.150827]  ib_cq_poll_work+0x1b/0x60 [ib_core]
> > [  222.177751]  process_one_work+0x141/0x340
> > [  222.200383]  worker_thread+0x47/0x3e0
> > [  222.220641]  kthread+0xf5/0x130
> > [  222.238951]  ? rescuer_thread+0x380/0x380
> > [  222.262034]  ? kthread_associate_blkcg+0x90/0x90
> > [  222.288514]  ? do_group_exit+0x39/0xa0
> > [  222.309492]  ret_from_fork+0x1f/0x30
> > [  222.330073] Code: fe 31 c0 48 c7 c7 98 36 89 8b e8 02 9c cf ff 0f
> > ff
> > 31 c0 c3 48 89 d1 48 c7 c7 48 36 89 8b 48 89 f2 48 89 c6 31 c0 e8 e6
> > 9b
> > cf ff <0f> ff 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de 48
> > 8b 
> > [  222.438058] ---[ end trace 5d41544bf17ab73b ]---
> > [  222.465993] BUG: unable to handle kernel NULL pointer dereference
> > at
> > 0000000000000028
> > [  222.510316] IP: ib_mad_post_receive_mads+0x3c/0x310 [ib_core]
> > [  222.543188] PGD 0 P4D 0 
> > [  222.557625] Oops: 0000 [#1] SMP PTI
> > [  222.576674] Modules linked in: xt_CHECKSUM iptable_mangle
> > ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
> > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
> > nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> > ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> > iscsi_target_mod target_core_mod ib_iser libiscsi
> > scsi_transport_iscsi
> > ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
> > rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> > kvm_intel
> > kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> > ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si crypto_simd
> > dm_service_time iTCO_wdt hpwdt iTCO_vendor_support glue_helper cryptd
> > ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
> > acpi_power_meter i7core_edac shpchp
> > [  222.981443]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
> > sunrpc
> > dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit
> > drm_kms_helper
> > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core mlxfw
> > sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw bnx2
> > devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> > [  223.152359] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted: G        W
> > I      4.15.0-rc7+ #1
> > [  223.198577] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > 08/16/2015
> > [  223.235101] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> > [  223.266750] RIP: 0010:ib_mad_post_receive_mads+0x3c/0x310
> > [ib_core]
> > [  223.303012] RSP: 0018:ffffb2bdc75c7cf8 EFLAGS: 00010286
> > [  223.333022] RAX: 0000000000000000 RBX: ffff94342d610908 RCX:
> > ffff94342d610948
> > [  223.373307] RDX: 0000000000000001 RSI: ffff94342d6108c0 RDI:
> > ffff94342d610908
> > [  223.414451] RBP: ffff94342d610940 R08: ffff94342a8e64c0 R09:
> > ffff94342a8e64e8
> > [  223.454789] R10: ffff94342a8e64e8 R11: ffff94342d6109a8 R12:
> > ffff944029c2e048
> > [  223.496554] R13: 0000000000000000 R14: ffff94342a8e64c0 R15:
> > ffff94342d6108c0
> > [  223.537489] FS:  0000000000000000(0000) GS:ffff944033200000(0000)
> > knlGS:0000000000000000
> > [  223.583538] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  223.616545] CR2: 0000000000000028 CR3: 00000017e7e0a002 CR4:
> > 00000000000206e0
> > [  223.657337] Call Trace:
> > [  223.671022]  ? find_mad_agent+0x77/0x1b0 [ib_core]
> > [  223.698581]  ? __kmalloc+0x1be/0x1f0
> > [  223.719074]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
> > [  223.747190]  __ib_process_cq+0x55/0xa0 [ib_core]
> > [  223.774140]  ib_cq_poll_work+0x1b/0x60 [ib_core]
> > [  223.800719]  process_one_work+0x141/0x340
> > [  223.824120]  worker_thread+0x47/0x3e0
> > [  223.845133]  kthread+0xf5/0x130
> > [  223.863116]  ? rescuer_thread+0x380/0x380
> > [  223.886173]  ? kthread_associate_blkcg+0x90/0x90
> > [  223.912207]  ? do_group_exit+0x39/0xa0
> > [  223.933198]  ret_from_fork+0x1f/0x30
> > [  223.953218] Code: 55 41 54 55 48 8d 6f 38 53 48 89 fb 48 83 ec 50
> > 65
> > 48 8b 04 25 28 00 00 00 48 89 44 24 48 31 c0 48 8b 07 48 85 f6 48 89
> > 4c
> > 24 08 <48> 8b 50 28 8b 12 48 c7 44 24 28 00 00 00 00 c7 44 24 40 01
> > 00 
> > [  224.059985] RIP: ib_mad_post_receive_mads+0x3c/0x310 [ib_core]
> > RSP:
> > ffffb2bdc75c7cf8
> > [  224.103994] CR2: 0000000000000028
> 
> Just wanted to add that the panic is consistent, rebooted into only a
> single path to my SRP LUNS and on reboot had the same panic.

Hello Laurence,

Can you repeat your test with the following two kernels:
* v4.15-rc7 (Linus' latest).
* The for-next branch of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git.

I'm asking this because the crash occurred in a code path that is not modified by
any of my patches.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                 ` <1515531079.2721.26.camel-Sjgp3cTcYWE@public.gmane.org>
@ 2018-01-09 21:00                   ` Laurence Oberman
       [not found]                     ` <1515531652.26021.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-09 21:00 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, 2018-01-09 at 20:51 +0000, Bart Van Assche wrote:
> On Tue, 2018-01-09 at 15:31 -0500, Laurence Oberman wrote:
> > On Tue, 2018-01-09 at 15:15 -0500, Laurence Oberman wrote:
> > > [  220.843344] ------------[ cut here ]------------
> > > [  220.869309] list_add corruption. prev->next should be next
> > > (000000002a07d255), but was           (null).
> > > (prev=000000000edf5e8c).
> > > [  220.935392] WARNING: CPU: 1 PID: 694 at lib/list_debug.c:28
> > > __list_add_valid+0x6a/0x70
> > > [  220.979462] Modules linked in: xt_CHECKSUM iptable_mangle
> > > ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
> > > nf_nat
> > > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
> > > ipt_REJECT
> > > nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> > > ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> > > iscsi_target_mod target_core_mod ib_iser libiscsi
> > > scsi_transport_iscsi
> > > ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs
> > > ib_umad
> > > rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> > > kvm_intel
> > > kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> > > ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si crypto_simd
> > > dm_service_time iTCO_wdt hpwdt iTCO_vendor_support glue_helper
> > > cryptd
> > > ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
> > > acpi_power_meter i7core_edac shpchp
> > > [  221.385270]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
> > > sunrpc
> > > dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit
> > > drm_kms_helper
> > > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core mlxfw
> > > sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw bnx2
> > > devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> > > [  221.554496] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted:
> > > G          I      4.15.0-rc7+ #1
> > > [  221.606907] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > 08/16/2015
> > > [  221.642980] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> > > [  221.674616] RIP: 0010:__list_add_valid+0x6a/0x70
> > > [  221.700561] RSP: 0018:ffffb2bdc75c7cf0 EFLAGS: 00010086
> > > [  221.730608] RAX: 0000000000000000 RBX: ffff94342d610880 RCX:
> > > ffffffff8ba62928
> > > [  221.771490] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
> > > 0000000000000046
> > > [  221.812721] RBP: ffff94342d6108b8 R08: 0000000000000000 R09:
> > > 0000000000000722
> > > [  221.853073] R10: 0000000000000000 R11: ffffb2bdc75c7a58 R12:
> > > 0000000000000200
> > > [  221.894156] R13: 0000000000000246 R14: ffff943fb7fd5000 R15:
> > > ffff943fb7fd5000
> > > [  221.935233] FS:  0000000000000000(0000)
> > > GS:ffff944033200000(0000)
> > > knlGS:0000000000000000
> > > [  221.980521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  222.013062] CR2: 00007f1bdc0ee910 CR3: 00000017e7e0a002 CR4:
> > > 00000000000206e0
> > > [  222.052302] Call Trace:
> > > [  222.065971]  ib_mad_post_receive_mads+0x177/0x310 [ib_core]
> > > [  222.097349]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
> > > [  222.124387]  __ib_process_cq+0x55/0xa0 [ib_core]
> > > [  222.150827]  ib_cq_poll_work+0x1b/0x60 [ib_core]
> > > [  222.177751]  process_one_work+0x141/0x340
> > > [  222.200383]  worker_thread+0x47/0x3e0
> > > [  222.220641]  kthread+0xf5/0x130
> > > [  222.238951]  ? rescuer_thread+0x380/0x380
> > > [  222.262034]  ? kthread_associate_blkcg+0x90/0x90
> > > [  222.288514]  ? do_group_exit+0x39/0xa0
> > > [  222.309492]  ret_from_fork+0x1f/0x30
> > > [  222.330073] Code: fe 31 c0 48 c7 c7 98 36 89 8b e8 02 9c cf ff
> > > 0f
> > > ff
> > > 31 c0 c3 48 89 d1 48 c7 c7 48 36 89 8b 48 89 f2 48 89 c6 31 c0 e8
> > > e6
> > > 9b
> > > cf ff <0f> ff 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad de
> > > 48
> > > 8b 
> > > [  222.438058] ---[ end trace 5d41544bf17ab73b ]---
> > > [  222.465993] BUG: unable to handle kernel NULL pointer
> > > dereference
> > > at
> > > 0000000000000028
> > > [  222.510316] IP: ib_mad_post_receive_mads+0x3c/0x310 [ib_core]
> > > [  222.543188] PGD 0 P4D 0 
> > > [  222.557625] Oops: 0000 [#1] SMP PTI
> > > [  222.576674] Modules linked in: xt_CHECKSUM iptable_mangle
> > > ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
> > > nf_nat
> > > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
> > > ipt_REJECT
> > > nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> > > ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> > > iscsi_target_mod target_core_mod ib_iser libiscsi
> > > scsi_transport_iscsi
> > > ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs
> > > ib_umad
> > > rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> > > kvm_intel
> > > kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> > > ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si crypto_simd
> > > dm_service_time iTCO_wdt hpwdt iTCO_vendor_support glue_helper
> > > cryptd
> > > ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
> > > acpi_power_meter i7core_edac shpchp
> > > [  222.981443]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
> > > sunrpc
> > > dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit
> > > drm_kms_helper
> > > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core mlxfw
> > > sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw bnx2
> > > devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
> > > [  223.152359] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted:
> > > G        W
> > > I      4.15.0-rc7+ #1
> > > [  223.198577] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > 08/16/2015
> > > [  223.235101] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> > > [  223.266750] RIP: 0010:ib_mad_post_receive_mads+0x3c/0x310
> > > [ib_core]
> > > [  223.303012] RSP: 0018:ffffb2bdc75c7cf8 EFLAGS: 00010286
> > > [  223.333022] RAX: 0000000000000000 RBX: ffff94342d610908 RCX:
> > > ffff94342d610948
> > > [  223.373307] RDX: 0000000000000001 RSI: ffff94342d6108c0 RDI:
> > > ffff94342d610908
> > > [  223.414451] RBP: ffff94342d610940 R08: ffff94342a8e64c0 R09:
> > > ffff94342a8e64e8
> > > [  223.454789] R10: ffff94342a8e64e8 R11: ffff94342d6109a8 R12:
> > > ffff944029c2e048
> > > [  223.496554] R13: 0000000000000000 R14: ffff94342a8e64c0 R15:
> > > ffff94342d6108c0
> > > [  223.537489] FS:  0000000000000000(0000)
> > > GS:ffff944033200000(0000)
> > > knlGS:0000000000000000
> > > [  223.583538] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [  223.616545] CR2: 0000000000000028 CR3: 00000017e7e0a002 CR4:
> > > 00000000000206e0
> > > [  223.657337] Call Trace:
> > > [  223.671022]  ? find_mad_agent+0x77/0x1b0 [ib_core]
> > > [  223.698581]  ? __kmalloc+0x1be/0x1f0
> > > [  223.719074]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
> > > [  223.747190]  __ib_process_cq+0x55/0xa0 [ib_core]
> > > [  223.774140]  ib_cq_poll_work+0x1b/0x60 [ib_core]
> > > [  223.800719]  process_one_work+0x141/0x340
> > > [  223.824120]  worker_thread+0x47/0x3e0
> > > [  223.845133]  kthread+0xf5/0x130
> > > [  223.863116]  ? rescuer_thread+0x380/0x380
> > > [  223.886173]  ? kthread_associate_blkcg+0x90/0x90
> > > [  223.912207]  ? do_group_exit+0x39/0xa0
> > > [  223.933198]  ret_from_fork+0x1f/0x30
> > > [  223.953218] Code: 55 41 54 55 48 8d 6f 38 53 48 89 fb 48 83 ec
> > > 50
> > > 65
> > > 48 8b 04 25 28 00 00 00 48 89 44 24 48 31 c0 48 8b 07 48 85 f6 48
> > > 89
> > > 4c
> > > 24 08 <48> 8b 50 28 8b 12 48 c7 44 24 28 00 00 00 00 c7 44 24 40
> > > 01
> > > 00 
> > > [  224.059985] RIP: ib_mad_post_receive_mads+0x3c/0x310 [ib_core]
> > > RSP:
> > > ffffb2bdc75c7cf8
> > > [  224.103994] CR2: 0000000000000028
> > 
> > Just wanted to add that the panic is consistent, rebooted into only
> > a
> > single path to my SRP LUNS and on reboot had the same panic.
> 
> Hello Laurence,
> kernsl
> Can you repeat your test with the following two kernels:
> * v4.15-rc7 (Linus' latest).
> * The for-next branch of
> git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git.
> 
> I'm asking this because the crash occurred in a code path that is not
> modified by
> any of my patches.
> 
> Thanks,
> 
> Bart.NrybXǧv^)޺{.n+{ٚ{ay\x1d
ʇڙ,j\afhz\x1e
w\f
j:+vwjm\azZ+ݢj"!

Bart, Yep, I saw it was not in code you touched specific to your
patches.

Doing that now, although I had already tested 4.15.0-rc4 from Mike
Snitzers tree that only had NVME changes in it and did not see it.
So maybe it crept in in the kernels you mentioned.

Its clearly in the ib_mad_xxxx code.

I will baseline again on the ones you asked me to test with
v4.15-rc7 (Linus' latest).
The for-next branch of
> git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git

Back later
Regards
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                     ` <1515531652.26021.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-09 22:40                       ` Laurence Oberman
       [not found]                         ` <1515537614.26021.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-09 22:40 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, 2018-01-09 at 16:00 -0500, Laurence Oberman wrote:
> On Tue, 2018-01-09 at 20:51 +0000, Bart Van Assche wrote:
> > On Tue, 2018-01-09 at 15:31 -0500, Laurence Oberman wrote:
> > > On Tue, 2018-01-09 at 15:15 -0500, Laurence Oberman wrote:
> > > > [  220.843344] ------------[ cut here ]------------
> > > > [  220.869309] list_add corruption. prev->next should be next
> > > > (000000002a07d255), but was           (null).
> > > > (prev=000000000edf5e8c).
> > > > [  220.935392] WARNING: CPU: 1 PID: 694 at lib/list_debug.c:28
> > > > __list_add_valid+0x6a/0x70
> > > > [  220.979462] Modules linked in: xt_CHECKSUM iptable_mangle
> > > > ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
> > > > nf_nat
> > > > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
> > > > ipt_REJECT
> > > > nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> > > > ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> > > > iscsi_target_mod target_core_mod ib_iser libiscsi
> > > > scsi_transport_iscsi
> > > > ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs
> > > > ib_umad
> > > > rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> > > > kvm_intel
> > > > kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> > > > ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si crypto_simd
> > > > dm_service_time iTCO_wdt hpwdt iTCO_vendor_support glue_helper
> > > > cryptd
> > > > ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
> > > > acpi_power_meter i7core_edac shpchp
> > > > [  221.385270]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd
> > > > grace
> > > > sunrpc
> > > > dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit
> > > > drm_kms_helper
> > > > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core
> > > > mlxfw
> > > > sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw
> > > > bnx2
> > > > devlink scsi_transport_sas dm_mirror dm_region_hash dm_log
> > > > dm_mod
> > > > [  221.554496] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted:
> > > > G          I      4.15.0-rc7+ #1
> > > > [  221.606907] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > > 08/16/2015
> > > > [  221.642980] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> > > > [  221.674616] RIP: 0010:__list_add_valid+0x6a/0x70
> > > > [  221.700561] RSP: 0018:ffffb2bdc75c7cf0 EFLAGS: 00010086
> > > > [  221.730608] RAX: 0000000000000000 RBX: ffff94342d610880 RCX:
> > > > ffffffff8ba62928
> > > > [  221.771490] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
> > > > 0000000000000046
> > > > [  221.812721] RBP: ffff94342d6108b8 R08: 0000000000000000 R09:
> > > > 0000000000000722
> > > > [  221.853073] R10: 0000000000000000 R11: ffffb2bdc75c7a58 R12:
> > > > 0000000000000200
> > > > [  221.894156] R13: 0000000000000246 R14: ffff943fb7fd5000 R15:
> > > > ffff943fb7fd5000
> > > > [  221.935233] FS:  0000000000000000(0000)
> > > > GS:ffff944033200000(0000)
> > > > knlGS:0000000000000000
> > > > [  221.980521] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > 0000000080050033
> > > > [  222.013062] CR2: 00007f1bdc0ee910 CR3: 00000017e7e0a002 CR4:
> > > > 00000000000206e0
> > > > [  222.052302] Call Trace:
> > > > [  222.065971]  ib_mad_post_receive_mads+0x177/0x310 [ib_core]
> > > > [  222.097349]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
> > > > [  222.124387]  __ib_process_cq+0x55/0xa0 [ib_core]
> > > > [  222.150827]  ib_cq_poll_work+0x1b/0x60 [ib_core]
> > > > [  222.177751]  process_one_work+0x141/0x340
> > > > [  222.200383]  worker_thread+0x47/0x3e0
> > > > [  222.220641]  kthread+0xf5/0x130
> > > > [  222.238951]  ? rescuer_thread+0x380/0x380
> > > > [  222.262034]  ? kthread_associate_blkcg+0x90/0x90
> > > > [  222.288514]  ? do_group_exit+0x39/0xa0
> > > > [  222.309492]  ret_from_fork+0x1f/0x30
> > > > [  222.330073] Code: fe 31 c0 48 c7 c7 98 36 89 8b e8 02 9c cf
> > > > ff
> > > > 0f
> > > > ff
> > > > 31 c0 c3 48 89 d1 48 c7 c7 48 36 89 8b 48 89 f2 48 89 c6 31 c0
> > > > e8
> > > > e6
> > > > 9b
> > > > cf ff <0f> ff 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad
> > > > de
> > > > 48
> > > > 8b 
> > > > [  222.438058] ---[ end trace 5d41544bf17ab73b ]---
> > > > [  222.465993] BUG: unable to handle kernel NULL pointer
> > > > dereference
> > > > at
> > > > 0000000000000028
> > > > [  222.510316] IP: ib_mad_post_receive_mads+0x3c/0x310
> > > > [ib_core]
> > > > [  222.543188] PGD 0 P4D 0 
> > > > [  222.557625] Oops: 0000 [#1] SMP PTI
> > > > [  222.576674] Modules linked in: xt_CHECKSUM iptable_mangle
> > > > ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
> > > > nf_nat
> > > > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
> > > > ipt_REJECT
> > > > nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> > > > ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> > > > iscsi_target_mod target_core_mod ib_iser libiscsi
> > > > scsi_transport_iscsi
> > > > ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs
> > > > ib_umad
> > > > rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> > > > kvm_intel
> > > > kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> > > > ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si crypto_simd
> > > > dm_service_time iTCO_wdt hpwdt iTCO_vendor_support glue_helper
> > > > cryptd
> > > > ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
> > > > acpi_power_meter i7core_edac shpchp
> > > > [  222.981443]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd
> > > > grace
> > > > sunrpc
> > > > dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit
> > > > drm_kms_helper
> > > > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core
> > > > mlxfw
> > > > sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw
> > > > bnx2
> > > > devlink scsi_transport_sas dm_mirror dm_region_hash dm_log
> > > > dm_mod
> > > > [  223.152359] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted:
> > > > G        W
> > > > I      4.15.0-rc7+ #1
> > > > [  223.198577] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > > 08/16/2015
> > > > [  223.235101] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> > > > [  223.266750] RIP: 0010:ib_mad_post_receive_mads+0x3c/0x310
> > > > [ib_core]
> > > > [  223.303012] RSP: 0018:ffffb2bdc75c7cf8 EFLAGS: 00010286
> > > > [  223.333022] RAX: 0000000000000000 RBX: ffff94342d610908 RCX:
> > > > ffff94342d610948
> > > > [  223.373307] RDX: 0000000000000001 RSI: ffff94342d6108c0 RDI:
> > > > ffff94342d610908
> > > > [  223.414451] RBP: ffff94342d610940 R08: ffff94342a8e64c0 R09:
> > > > ffff94342a8e64e8
> > > > [  223.454789] R10: ffff94342a8e64e8 R11: ffff94342d6109a8 R12:
> > > > ffff944029c2e048
> > > > [  223.496554] R13: 0000000000000000 R14: ffff94342a8e64c0 R15:
> > > > ffff94342d6108c0
> > > > [  223.537489] FS:  0000000000000000(0000)
> > > > GS:ffff944033200000(0000)
> > > > knlGS:0000000000000000
> > > > [  223.583538] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > 0000000080050033
> > > > [  223.616545] CR2: 0000000000000028 CR3: 00000017e7e0a002 CR4:
> > > > 00000000000206e0
> > > > [  223.657337] Call Trace:
> > > > [  223.671022]  ? find_mad_agent+0x77/0x1b0 [ib_core]
> > > > [  223.698581]  ? __kmalloc+0x1be/0x1f0
> > > > [  223.719074]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
> > > > [  223.747190]  __ib_process_cq+0x55/0xa0 [ib_core]
> > > > [  223.774140]  ib_cq_poll_work+0x1b/0x60 [ib_core]
> > > > [  223.800719]  process_one_work+0x141/0x340
> > > > [  223.824120]  worker_thread+0x47/0x3e0
> > > > [  223.845133]  kthread+0xf5/0x130
> > > > [  223.863116]  ? rescuer_thread+0x380/0x380
> > > > [  223.886173]  ? kthread_associate_blkcg+0x90/0x90
> > > > [  223.912207]  ? do_group_exit+0x39/0xa0
> > > > [  223.933198]  ret_from_fork+0x1f/0x30
> > > > [  223.953218] Code: 55 41 54 55 48 8d 6f 38 53 48 89 fb 48 83
> > > > ec
> > > > 50
> > > > 65
> > > > 48 8b 04 25 28 00 00 00 48 89 44 24 48 31 c0 48 8b 07 48 85 f6
> > > > 48
> > > > 89
> > > > 4c
> > > > 24 08 <48> 8b 50 28 8b 12 48 c7 44 24 28 00 00 00 00 c7 44 24
> > > > 40
> > > > 01
> > > > 00 
> > > > [  224.059985] RIP: ib_mad_post_receive_mads+0x3c/0x310
> > > > [ib_core]
> > > > RSP:
> > > > ffffb2bdc75c7cf8
> > > > [  224.103994] CR2: 0000000000000028
> > > 
> > > Just wanted to add that the panic is consistent, rebooted into
> > > only
> > > a
> > > single path to my SRP LUNS and on reboot had the same panic.
> > 
> > Hello Laurence,
> > kernsl
> > Can you repeat your test with the following two kernels:
> > * v4.15-rc7 (Linus' latest).
> > * The for-next branch of
> > git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git.
> > 
> > I'm asking this because the crash occurred in a code path that is
> > not
> > modified by
> > any of my patches.
> > 
> > Thanks,
> > 
> > Bart.NrybXǧv^)޺{.n+{ٚ{ay\x1d
ʇڙ,j\afhz\x1e
w\f
j:+vwjm\azZ+ݢj"!
> 
> Bart, Yep, I saw it was not in code you touched specific to your
> patches.
> 
> Doing that now, although I had already tested 4.15.0-rc4 from Mike
> Snitzers tree that only had NVME changes in it and did not see it.
> So maybe it crept in in the kernels you mentioned.
> 
> Its clearly in the ib_mad_xxxx code.
> 
> I will baseline again on the ones you asked me to test with
> v4.15-rc7 (Linus' latest).
> The for-next branch of
> > git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
> 
> Back later
> Regards
> Laurence

Interesting

On Linus's kernel we don't panic, but we see this below
I will reboot one more time, validate same behavior and then try the
rdma tree
I am pretty sure there are changes in that RDMA tree that piggy back on
the below scenario to trigger the panic.
And I know Bart you have the RDMA stuff pulled in to yours.

If needed I can capture a vmcore to fully triage the panic.

Rebooting.
[ 1358.714127] sd 1:0:0:1: [sdbk] Synchronizing SCSI cache
[ 1358.744171] sd 1:0:0:2: [sdbj] Synchronizing SCSI cache
[ 1358.773412] sd 1:0:0:3: [sdbi] Synchronizing SCSI cache
[ 1358.803791] sd 1:0:0:4: [sdbh] Synchronizing SCSI cache
[ 1358.833925] sd 1:0:0:5: [sdbg] Synchronizing SCSI cache
[ 1358.864175] sd 1:0:0:6: [sdbf] Synchronizing SCSI cache
[ 1358.893766] sd 1:0:0:7: [sdbe] Synchronizing SCSI cache
[ 1358.924356] sd 1:0:0:8: [sdbd] Synchronizing SCSI cache
[ 1358.954940] sd 1:0:0:9: [sdbc] Synchronizing SCSI cache
[ 1358.985734] sd 1:0:0:10: [sdbb] Synchronizing SCSI cache
[ 1359.015816] sd 1:0:0:11: [sdba] Synchronizing SCSI cache
[ 1359.046156] sd 1:0:0:12: [sdaz] Synchronizing SCSI cache
[ 1359.076851] sd 1:0:0:13: [sday] Synchronizing SCSI cache
[ 1359.106872] sd 1:0:0:14: [sdax] Synchronizing SCSI cache
[ 1359.137053] sd 1:0:0:15: [sdaw] Synchronizing SCSI cache
[ 1359.167544] sd 1:0:0:16: [sdav] Synchronizing SCSI cache
[ 1359.197517] sd 1:0:0:17: [sdau] Synchronizing SCSI cache
[ 1359.229360] sd 1:0:0:18: [sdat] Synchronizing SCSI cache
[ 1359.258470] sd 1:0:0:19: [sdas] Synchronizing SCSI cache
[ 1359.286950] sd 1:0:0:20: [sdar] Synchronizing SCSI cache
[ 1359.317636] sd 1:0:0:21: [sdaq] Synchronizing SCSI cache
[ 1359.348601] sd 1:0:0:22: [sdap] Synchronizing SCSI cache
[ 1359.379196] sd 1:0:0:23: [sdao] Synchronizing SCSI cache
[ 1359.409689] sd 1:0:0:24: [sdan] Synchronizing SCSI cache
[ 1359.440632] sd 1:0:0:25: [sdam] Synchronizing SCSI cache
[ 1359.470780] sd 1:0:0:26: [sdal] Synchronizing SCSI cache
[ 1359.501198] sd 1:0:0:27: [sdak] Synchronizing SCSI cache
[ 1359.531820] sd 1:0:0:28: [sdaj] Synchronizing SCSI cache
[ 1359.561622] sd 1:0:0:29: [sdai] Synchronizing SCSI cache
[ 1359.591658] sd 1:0:0:0: [sdah] Synchronizing SCSI cache
[ 1359.621801] sd 2:0:0:1: [sdag] Synchronizing SCSI cache
[ 1359.651696] sd 2:0:0:2: [sdaf] Synchronizing SCSI cache
[ 1359.681975] sd 2:0:0:3: [sdae] Synchronizing SCSI cache
[ 1359.712012] sd 2:0:0:4: [sdad] Synchronizing SCSI cache
[ 1359.741984] sd 2:0:0:5: [sdac] Synchronizing SCSI cache
[ 1359.771704] sd 2:0:0:6: [sdab] Synchronizing SCSI cache
[ 1359.801829] sd 2:0:0:7: [sdaa] Synchronizing SCSI cache
[ 1359.832076] sd 2:0:0:8: [sdz] Synchronizing SCSI cache
[ 1359.861697] sd 2:0:0:9: [sdy] Synchronizing SCSI cache
[ 1359.890470] sd 2:0:0:10: [sdx] Synchronizing SCSI cache
[ 1359.920747] sd 2:0:0:11: [sdw] Synchronizing SCSI cache
[ 1359.950125] sd 2:0:0:12: [sdv] Synchronizing SCSI cache
[ 1359.978736] sd 2:0:0:13: [sdu] Synchronizing SCSI cache
[ 1360.008490] sd 2:0:0:14: [sdt] Synchronizing SCSI cache
[ 1360.037894] sd 2:0:0:15: [sds] Synchronizing SCSI cache
[ 1360.067282] sd 2:0:0:16: [sdr] Synchronizing SCSI cache
[ 1360.095579] sd 2:0:0:17: [sdq] Synchronizing SCSI cache
[ 1360.125297] sd 2:0:0:18: [sdp] Synchronizing SCSI cache
[ 1360.154539] sd 2:0:0:19: [sdo] Synchronizing SCSI cache
[ 1360.184087] sd 2:0:0:20: [sdn] Synchronizing SCSI cache
[ 1360.213859] sd 2:0:0:21: [sdm] Synchronizing SCSI cache
[ 1360.243405] sd 2:0:0:22: [sdl] Synchronizing SCSI cache
[ 1360.272676] sd 2:0:0:23: [sdk] Synchronizing SCSI cache
[ 1360.303088] sd 2:0:0:24: [sdj] Synchronizing SCSI cache
[ 1360.332838] sd 2:0:0:25: [sdi] Synchronizing SCSI cache
[ 1360.362778] sd 2:0:0:26: [sdh] Synchronizing SCSI cache
[ 1360.392887] sd 2:0:0:27: [sdg] Synchronizing SCSI cache
[ 1360.422989] sd 2:0:0:28: [sdf] Synchronizing SCSI cache
[ 1360.452909] sd 2:0:0:29: [sde] Synchronizing SCSI cache
[ 1360.482103] sd 2:0:0:0: [sdd] Synchronizing SCSI cache
[ 1360.511682] mlx5_core 0000:08:00.1: Shutdown was called
[ 1360.550531] mlx5_core 0000:08:00.1: mlx5_enter_error_state:121:(pid
15149): start
[ 1360.593520] ------------[ cut here ]------------
[ 1360.619930] got unsolicited completion for CQ 0x0000000068694acd
[ 1360.654434] WARNING: CPU: 15 PID: 15149 at
drivers/infiniband/core/cq.c:80 ib_cq_completion_direct+0x28/0x30
[ib_core]
[ 1360.716099] Modules linked in: xt_CHECKSUM iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi
ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
ghash_clmulni_intel pcbc joydev aesni_intel dm_service_time ipmi_si
crypto_simd glue_helper sg hpilo cryptd hpwdt ipmi_devintf iTCO_wdt
gpio_ich acpi_power_meter iTCO_vendor_support ipmi_msghandler shpchp
pcspkr i7core_edac lpc_ich
[ 1361.120851]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
dm_multipath sunrpc ip_tables xfs libcrc32c radeon i2c_algo_bit
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm sd_mod
drm mlx5_core mlxfw ptp serio_raw crc32c_intel i2c_core hpsa pps_core
bnx2 devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[ 1361.288913] CPU: 15 PID: 15149 Comm: reboot Tainted:
G          I      4.15.0-rc7 #1
[ 1361.333577] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[ 1361.369976] RIP: 0010:ib_cq_completion_direct+0x28/0x30 [ib_core]
[ 1361.404971] RSP: 0018:ffffa08c8747fc60 EFLAGS: 00010086
[ 1361.435007] RAX: 0000000000000000 RBX: ffff8d37a6f8b468 RCX:
ffffffffae662928
[ 1361.474397] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
0000000000000046
[ 1361.515097] RBP: ffff8d2bb07e0000 R08: 0000000000000000 R09:
0000000000000717
[ 1361.555054] R10: 0000000000000000 R11: ffffa08c8747f9c8 R12:
ffff8d2ed1edc264
[ 1361.595593] R13: ffff8d37a6f8b400 R14: ffffa08c8747fca8 R15:
0000000000000083
[ 1361.635133] FS:  00007fc09956a880(0000) GS:ffff8d37b33c0000(0000)
knlGS:0000000000000000
[ 1361.681800] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1361.714217] CR2: 0000000001034f80 CR3: 0000000ba0f9e005 CR4:
00000000000206e0
[ 1361.754794] Call Trace:
[ 1361.768980]  mlx5_ib_event+0x335/0x410 [mlx5_ib]
[ 1361.795303]  mlx5_core_event+0x7b/0x1a0 [mlx5_core]
[ 1361.823438]  ? synchronize_irq+0x35/0xa0
[ 1361.845962]  mlx5_enter_error_state+0xe4/0x1c0 [mlx5_core]
[ 1361.877382]  shutdown+0x127/0x170 [mlx5_core]
[ 1361.902688]  pci_device_shutdown+0x31/0x60
[ 1361.925924]  device_shutdown+0x101/0x1d0
[ 1361.948642]  kernel_restart+0xe/0x60
[ 1361.968517]  SYSC_reboot+0x1e8/0x210
[ 1361.988062]  ? __audit_syscall_entry+0xaf/0x100
[ 1362.013500]  ? syscall_trace_enter+0x1cc/0x2b0
[ 1362.038483]  ? __audit_syscall_exit+0x1ff/0x280
[ 1362.064598]  do_syscall_64+0x61/0x1a0
[ 1362.084635]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1362.111113] RIP: 0033:0x7fc098377a56
[ 1362.131668] RSP: 002b:00007ffd4b3377e8 EFLAGS: 00000206 ORIG_RAX:
00000000000000a9
[ 1362.174578] RAX: ffffffffffffffda RBX: 0000000000000004 RCX:
00007fc098377a56
[ 1362.213620] RDX: 0000000001234567 RSI: 0000000028121969 RDI:
fffffffffee1dead
[ 1362.255259] RBP: 0000000000000000 R08: 000056141a7642a0 R09:
00007ffd4b336eb0
[ 1362.296293] R10: 0000000000000024 R11: 0000000000000206 R12:
0000000000000000
[ 1362.338341] R13: 00007ffd4b337ab0 R14: 0000000000000000 R15:
0000000000000000
[ 1362.378518] Code: 00 00 00 66 66 66 66 90 80 3d 65 e1 02 00 00 74 02
f3 c3 48 89 fe 31 c0 48 c7 c7 68 58 92 c0 c6 05 4e e1 02 00 01 e8 a8 23
d8 ec <0f> ff c3 0f 1f 44 00 00 66 66 66 66 90 41 55 45 89 c5 41 54 49 
[ 1362.483962] ---[ end trace 528ee06930a5763f ]---
[ 1362.509435] mlx5_1:mlx5_ib_event:2992:(pid 15149): warning: event on
port 0
[ 1362.548716] scsi host2: ib_srp: failed RECV status WR flushed (5)
for CQE 0000000023e53497
[ 1362.595980] mlx5_core 0000:08:00.1: mlx5_enter_error_state:128:(pid
15149): end
[ 1362.637630] mlx5_core 0000:08:00.0: Shutdown was called
[ 1362.677523] mlx5_core 0000:08:00.0: mlx5_enter_error_state:121:(pid
15149): start
[ 1362.720734] mlx5_0:mlx5_ib_event:2992:(pid 15149): warning: event on
port 0
[ 1362.760795] scsi host1: ib_srp: failed RECV status WR flushed (5)
for CQE 000000009ad07e27
[ 1362.806977] mlx5_core 0000:08:00.0: mlx5_enter_error_state:128:(pid
15149): end
[ 1363.331808] reboot: Restarting system
[ 1363.349889] reboot: machine restart
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                         ` <1515537614.26021.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-10 13:42                           ` Laurence Oberman
       [not found]                             ` <1515591723.26021.6.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-10 13:42 UTC (permalink / raw)
  To: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Dutile, Don

On Tue, 2018-01-09 at 17:40 -0500, Laurence Oberman wrote:
> On Tue, 2018-01-09 at 16:00 -0500, Laurence Oberman wrote:
> > On Tue, 2018-01-09 at 20:51 +0000, Bart Van Assche wrote:
> > > On Tue, 2018-01-09 at 15:31 -0500, Laurence Oberman wrote:
> > > > On Tue, 2018-01-09 at 15:15 -0500, Laurence Oberman wrote:
> > > > > [  220.843344] ------------[ cut here ]------------
> > > > > [  220.869309] list_add corruption. prev->next should be next
> > > > > (000000002a07d255), but was           (null).
> > > > > (prev=000000000edf5e8c).
> > > > > [  220.935392] WARNING: CPU: 1 PID: 694 at
> > > > > lib/list_debug.c:28
> > > > > __list_add_valid+0x6a/0x70
> > > > > [  220.979462] Modules linked in: xt_CHECKSUM iptable_mangle
> > > > > ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
> > > > > nf_nat
> > > > > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
> > > > > ipt_REJECT
> > > > > nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> > > > > ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> > > > > iscsi_target_mod target_core_mod ib_iser libiscsi
> > > > > scsi_transport_iscsi
> > > > > ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs
> > > > > ib_umad
> > > > > rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> > > > > kvm_intel
> > > > > kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> > > > > ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si
> > > > > crypto_simd
> > > > > dm_service_time iTCO_wdt hpwdt iTCO_vendor_support
> > > > > glue_helper
> > > > > cryptd
> > > > > ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
> > > > > acpi_power_meter i7core_edac shpchp
> > > > > [  221.385270]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd
> > > > > grace
> > > > > sunrpc
> > > > > dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit
> > > > > drm_kms_helper
> > > > > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core
> > > > > mlxfw
> > > > > sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw
> > > > > bnx2
> > > > > devlink scsi_transport_sas dm_mirror dm_region_hash dm_log
> > > > > dm_mod
> > > > > [  221.554496] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted:
> > > > > G          I      4.15.0-rc7+ #1
> > > > > [  221.606907] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > > > 08/16/2015
> > > > > [  221.642980] Workqueue: ib-comp-wq ib_cq_poll_work
> > > > > [ib_core]
> > > > > [  221.674616] RIP: 0010:__list_add_valid+0x6a/0x70
> > > > > [  221.700561] RSP: 0018:ffffb2bdc75c7cf0 EFLAGS: 00010086
> > > > > [  221.730608] RAX: 0000000000000000 RBX: ffff94342d610880
> > > > > RCX:
> > > > > ffffffff8ba62928
> > > > > [  221.771490] RDX: 0000000000000001 RSI: 0000000000000082
> > > > > RDI:
> > > > > 0000000000000046
> > > > > [  221.812721] RBP: ffff94342d6108b8 R08: 0000000000000000
> > > > > R09:
> > > > > 0000000000000722
> > > > > [  221.853073] R10: 0000000000000000 R11: ffffb2bdc75c7a58
> > > > > R12:
> > > > > 0000000000000200
> > > > > [  221.894156] R13: 0000000000000246 R14: ffff943fb7fd5000
> > > > > R15:
> > > > > ffff943fb7fd5000
> > > > > [  221.935233] FS:  0000000000000000(0000)
> > > > > GS:ffff944033200000(0000)
> > > > > knlGS:0000000000000000
> > > > > [  221.980521] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > 0000000080050033
> > > > > [  222.013062] CR2: 00007f1bdc0ee910 CR3: 00000017e7e0a002
> > > > > CR4:
> > > > > 00000000000206e0
> > > > > [  222.052302] Call Trace:
> > > > > [  222.065971]  ib_mad_post_receive_mads+0x177/0x310
> > > > > [ib_core]
> > > > > [  222.097349]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
> > > > > [  222.124387]  __ib_process_cq+0x55/0xa0 [ib_core]
> > > > > [  222.150827]  ib_cq_poll_work+0x1b/0x60 [ib_core]
> > > > > [  222.177751]  process_one_work+0x141/0x340
> > > > > [  222.200383]  worker_thread+0x47/0x3e0
> > > > > [  222.220641]  kthread+0xf5/0x130
> > > > > [  222.238951]  ? rescuer_thread+0x380/0x380
> > > > > [  222.262034]  ? kthread_associate_blkcg+0x90/0x90
> > > > > [  222.288514]  ? do_group_exit+0x39/0xa0
> > > > > [  222.309492]  ret_from_fork+0x1f/0x30
> > > > > [  222.330073] Code: fe 31 c0 48 c7 c7 98 36 89 8b e8 02 9c
> > > > > cf
> > > > > ff
> > > > > 0f
> > > > > ff
> > > > > 31 c0 c3 48 89 d1 48 c7 c7 48 36 89 8b 48 89 f2 48 89 c6 31
> > > > > c0
> > > > > e8
> > > > > e6
> > > > > 9b
> > > > > cf ff <0f> ff 31 c0 c3 90 48 8b 07 48 b9 00 01 00 00 00 00 ad
> > > > > de
> > > > > 48
> > > > > 8b 
> > > > > [  222.438058] ---[ end trace 5d41544bf17ab73b ]---
> > > > > [  222.465993] BUG: unable to handle kernel NULL pointer
> > > > > dereference
> > > > > at
> > > > > 0000000000000028
> > > > > [  222.510316] IP: ib_mad_post_receive_mads+0x3c/0x310
> > > > > [ib_core]
> > > > > [  222.543188] PGD 0 P4D 0 
> > > > > [  222.557625] Oops: 0000 [#1] SMP PTI
> > > > > [  222.576674] Modules linked in: xt_CHECKSUM iptable_mangle
> > > > > ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
> > > > > nf_nat
> > > > > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack
> > > > > ipt_REJECT
> > > > > nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> > > > > ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> > > > > iscsi_target_mod target_core_mod ib_iser libiscsi
> > > > > scsi_transport_iscsi
> > > > > ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs
> > > > > ib_umad
> > > > > rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> > > > > kvm_intel
> > > > > kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> > > > > ghash_clmulni_intel pcbc aesni_intel joydev ipmi_si
> > > > > crypto_simd
> > > > > dm_service_time iTCO_wdt hpwdt iTCO_vendor_support
> > > > > glue_helper
> > > > > cryptd
> > > > > ipmi_devintf sg gpio_ich pcspkr hpilo ipmi_msghandler lpc_ich
> > > > > acpi_power_meter i7core_edac shpchp
> > > > > [  222.981443]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd
> > > > > grace
> > > > > sunrpc
> > > > > dm_multipath ip_tables xfs libcrc32c radeon i2c_algo_bit
> > > > > drm_kms_helper
> > > > > syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core
> > > > > mlxfw
> > > > > sd_mod drm ptp hpsa pps_core crc32c_intel i2c_core serio_raw
> > > > > bnx2
> > > > > devlink scsi_transport_sas dm_mirror dm_region_hash dm_log
> > > > > dm_mod
> > > > > [  223.152359] CPU: 1 PID: 694 Comm: kworker/1:1H Tainted:
> > > > > G        W
> > > > > I      4.15.0-rc7+ #1
> > > > > [  223.198577] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > > > 08/16/2015
> > > > > [  223.235101] Workqueue: ib-comp-wq ib_cq_poll_work
> > > > > [ib_core]
> > > > > [  223.266750] RIP: 0010:ib_mad_post_receive_mads+0x3c/0x310
> > > > > [ib_core]
> > > > > [  223.303012] RSP: 0018:ffffb2bdc75c7cf8 EFLAGS: 00010286
> > > > > [  223.333022] RAX: 0000000000000000 RBX: ffff94342d610908
> > > > > RCX:
> > > > > ffff94342d610948
> > > > > [  223.373307] RDX: 0000000000000001 RSI: ffff94342d6108c0
> > > > > RDI:
> > > > > ffff94342d610908
> > > > > [  223.414451] RBP: ffff94342d610940 R08: ffff94342a8e64c0
> > > > > R09:
> > > > > ffff94342a8e64e8
> > > > > [  223.454789] R10: ffff94342a8e64e8 R11: ffff94342d6109a8
> > > > > R12:
> > > > > ffff944029c2e048
> > > > > [  223.496554] R13: 0000000000000000 R14: ffff94342a8e64c0
> > > > > R15:
> > > > > ffff94342d6108c0
> > > > > [  223.537489] FS:  0000000000000000(0000)
> > > > > GS:ffff944033200000(0000)
> > > > > knlGS:0000000000000000
> > > > > [  223.583538] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > 0000000080050033
> > > > > [  223.616545] CR2: 0000000000000028 CR3: 00000017e7e0a002
> > > > > CR4:
> > > > > 00000000000206e0
> > > > > [  223.657337] Call Trace:
> > > > > [  223.671022]  ? find_mad_agent+0x77/0x1b0 [ib_core]
> > > > > [  223.698581]  ? __kmalloc+0x1be/0x1f0
> > > > > [  223.719074]  ib_mad_recv_done+0x471/0x9c0 [ib_core]
> > > > > [  223.747190]  __ib_process_cq+0x55/0xa0 [ib_core]
> > > > > [  223.774140]  ib_cq_poll_work+0x1b/0x60 [ib_core]
> > > > > [  223.800719]  process_one_work+0x141/0x340
> > > > > [  223.824120]  worker_thread+0x47/0x3e0
> > > > > [  223.845133]  kthread+0xf5/0x130
> > > > > [  223.863116]  ? rescuer_thread+0x380/0x380
> > > > > [  223.886173]  ? kthread_associate_blkcg+0x90/0x90
> > > > > [  223.912207]  ? do_group_exit+0x39/0xa0
> > > > > [  223.933198]  ret_from_fork+0x1f/0x30
> > > > > [  223.953218] Code: 55 41 54 55 48 8d 6f 38 53 48 89 fb 48
> > > > > 83
> > > > > ec
> > > > > 50
> > > > > 65
> > > > > 48 8b 04 25 28 00 00 00 48 89 44 24 48 31 c0 48 8b 07 48 85
> > > > > f6
> > > > > 48
> > > > > 89
> > > > > 4c
> > > > > 24 08 <48> 8b 50 28 8b 12 48 c7 44 24 28 00 00 00 00 c7 44 24
> > > > > 40
> > > > > 01
> > > > > 00 
> > > > > [  224.059985] RIP: ib_mad_post_receive_mads+0x3c/0x310
> > > > > [ib_core]
> > > > > RSP:
> > > > > ffffb2bdc75c7cf8
> > > > > [  224.103994] CR2: 0000000000000028
> > > > 
> > > > Just wanted to add that the panic is consistent, rebooted into
> > > > only
> > > > a
> > > > single path to my SRP LUNS and on reboot had the same panic.
> > > 
> > > Hello Laurence,
> > > kernsl
> > > Can you repeat your test with the following two kernels:
> > > * v4.15-rc7 (Linus' latest).
> > > * The for-next branch of
> > > git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git.
> > > 
> > > I'm asking this because the crash occurred in a code path that is
> > > not
> > > modified by
> > > any of my patches.
> > > 
> > > Thanks,
> > > 
> > > Bart.NrybXǧv^)޺{.n+{ٚ{ay\x1d
ʇڙ,j\afhz\x1e
w\f
j:+vwjm\azZ+ݢj"!
> > 
> > Bart, Yep, I saw it was not in code you touched specific to your
> > patches.
> > 
> > Doing that now, although I had already tested 4.15.0-rc4 from Mike
> > Snitzers tree that only had NVME changes in it and did not see it.
> > So maybe it crept in in the kernels you mentioned.
> > 
> > Its clearly in the ib_mad_xxxx code.
> > 
> > I will baseline again on the ones you asked me to test with
> > v4.15-rc7 (Linus' latest).
> > The for-next branch of
> > > git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
> > 
> > Back later
> > Regards
> > Laurence
> 
> Interesting
> 
> On Linus's kernel we don't panic, but we see this below
> I will reboot one more time, validate same behavior and then try the
> rdma tree
> I am pretty sure there are changes in that RDMA tree that piggy back
> on
> the below scenario to trigger the panic.
> And I know Bart you have the RDMA stuff pulled in to yours.
> 
> If needed I can capture a vmcore to fully triage the panic.
> 
> Rebooting.
> [ 1358.714127] sd 1:0:0:1: [sdbk] Synchronizing SCSI cache
> [ 1358.744171] sd 1:0:0:2: [sdbj] Synchronizing SCSI cache
> [ 1358.773412] sd 1:0:0:3: [sdbi] Synchronizing SCSI cache
> [ 1358.803791] sd 1:0:0:4: [sdbh] Synchronizing SCSI cache
> [ 1358.833925] sd 1:0:0:5: [sdbg] Synchronizing SCSI cache
> [ 1358.864175] sd 1:0:0:6: [sdbf] Synchronizing SCSI cache
> [ 1358.893766] sd 1:0:0:7: [sdbe] Synchronizing SCSI cache
> [ 1358.924356] sd 1:0:0:8: [sdbd] Synchronizing SCSI cache
> [ 1358.954940] sd 1:0:0:9: [sdbc] Synchronizing SCSI cache
> [ 1358.985734] sd 1:0:0:10: [sdbb] Synchronizing SCSI cache
> [ 1359.015816] sd 1:0:0:11: [sdba] Synchronizing SCSI cache
> [ 1359.046156] sd 1:0:0:12: [sdaz] Synchronizing SCSI cache
> [ 1359.076851] sd 1:0:0:13: [sday] Synchronizing SCSI cache
> [ 1359.106872] sd 1:0:0:14: [sdax] Synchronizing SCSI cache
> [ 1359.137053] sd 1:0:0:15: [sdaw] Synchronizing SCSI cache
> [ 1359.167544] sd 1:0:0:16: [sdav] Synchronizing SCSI cache
> [ 1359.197517] sd 1:0:0:17: [sdau] Synchronizing SCSI cache
> [ 1359.229360] sd 1:0:0:18: [sdat] Synchronizing SCSI cache
> [ 1359.258470] sd 1:0:0:19: [sdas] Synchronizing SCSI cache
> [ 1359.286950] sd 1:0:0:20: [sdar] Synchronizing SCSI cache
> [ 1359.317636] sd 1:0:0:21: [sdaq] Synchronizing SCSI cache
> [ 1359.348601] sd 1:0:0:22: [sdap] Synchronizing SCSI cache
> [ 1359.379196] sd 1:0:0:23: [sdao] Synchronizing SCSI cache
> [ 1359.409689] sd 1:0:0:24: [sdan] Synchronizing SCSI cache
> [ 1359.440632] sd 1:0:0:25: [sdam] Synchronizing SCSI cache
> [ 1359.470780] sd 1:0:0:26: [sdal] Synchronizing SCSI cache
> [ 1359.501198] sd 1:0:0:27: [sdak] Synchronizing SCSI cache
> [ 1359.531820] sd 1:0:0:28: [sdaj] Synchronizing SCSI cache
> [ 1359.561622] sd 1:0:0:29: [sdai] Synchronizing SCSI cache
> [ 1359.591658] sd 1:0:0:0: [sdah] Synchronizing SCSI cache
> [ 1359.621801] sd 2:0:0:1: [sdag] Synchronizing SCSI cache
> [ 1359.651696] sd 2:0:0:2: [sdaf] Synchronizing SCSI cache
> [ 1359.681975] sd 2:0:0:3: [sdae] Synchronizing SCSI cache
> [ 1359.712012] sd 2:0:0:4: [sdad] Synchronizing SCSI cache
> [ 1359.741984] sd 2:0:0:5: [sdac] Synchronizing SCSI cache
> [ 1359.771704] sd 2:0:0:6: [sdab] Synchronizing SCSI cache
> [ 1359.801829] sd 2:0:0:7: [sdaa] Synchronizing SCSI cache
> [ 1359.832076] sd 2:0:0:8: [sdz] Synchronizing SCSI cache
> [ 1359.861697] sd 2:0:0:9: [sdy] Synchronizing SCSI cache
> [ 1359.890470] sd 2:0:0:10: [sdx] Synchronizing SCSI cache
> [ 1359.920747] sd 2:0:0:11: [sdw] Synchronizing SCSI cache
> [ 1359.950125] sd 2:0:0:12: [sdv] Synchronizing SCSI cache
> [ 1359.978736] sd 2:0:0:13: [sdu] Synchronizing SCSI cache
> [ 1360.008490] sd 2:0:0:14: [sdt] Synchronizing SCSI cache
> [ 1360.037894] sd 2:0:0:15: [sds] Synchronizing SCSI cache
> [ 1360.067282] sd 2:0:0:16: [sdr] Synchronizing SCSI cache
> [ 1360.095579] sd 2:0:0:17: [sdq] Synchronizing SCSI cache
> [ 1360.125297] sd 2:0:0:18: [sdp] Synchronizing SCSI cache
> [ 1360.154539] sd 2:0:0:19: [sdo] Synchronizing SCSI cache
> [ 1360.184087] sd 2:0:0:20: [sdn] Synchronizing SCSI cache
> [ 1360.213859] sd 2:0:0:21: [sdm] Synchronizing SCSI cache
> [ 1360.243405] sd 2:0:0:22: [sdl] Synchronizing SCSI cache
> [ 1360.272676] sd 2:0:0:23: [sdk] Synchronizing SCSI cache
> [ 1360.303088] sd 2:0:0:24: [sdj] Synchronizing SCSI cache
> [ 1360.332838] sd 2:0:0:25: [sdi] Synchronizing SCSI cache
> [ 1360.362778] sd 2:0:0:26: [sdh] Synchronizing SCSI cache
> [ 1360.392887] sd 2:0:0:27: [sdg] Synchronizing SCSI cache
> [ 1360.422989] sd 2:0:0:28: [sdf] Synchronizing SCSI cache
> [ 1360.452909] sd 2:0:0:29: [sde] Synchronizing SCSI cache
> [ 1360.482103] sd 2:0:0:0: [sdd] Synchronizing SCSI cache
> [ 1360.511682] mlx5_core 0000:08:00.1: Shutdown was called
> [ 1360.550531] mlx5_core 0000:08:00.1:
> mlx5_enter_error_state:121:(pid
> 15149): start
> [ 1360.593520] ------------[ cut here ]------------
> [ 1360.619930] got unsolicited completion for CQ 0x0000000068694acd
> [ 1360.654434] WARNING: CPU: 15 PID: 15149 at
> drivers/infiniband/core/cq.c:80 ib_cq_completion_direct+0x28/0x30
> [ib_core]
> [ 1360.716099] Modules linked in: xt_CHECKSUM iptable_mangle
> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
> nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
> ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
> iscsi_target_mod target_core_mod ib_iser libiscsi
> scsi_transport_iscsi
> ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
> rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> kvm_intel
> kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
> ghash_clmulni_intel pcbc joydev aesni_intel dm_service_time ipmi_si
> crypto_simd glue_helper sg hpilo cryptd hpwdt ipmi_devintf iTCO_wdt
> gpio_ich acpi_power_meter iTCO_vendor_support ipmi_msghandler shpchp
> pcspkr i7core_edac lpc_ich
> [ 1361.120851]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
> dm_multipath sunrpc ip_tables xfs libcrc32c radeon i2c_algo_bit
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm
> sd_mod
> drm mlx5_core mlxfw ptp serio_raw crc32c_intel i2c_core hpsa pps_core
> bnx2 devlink scsi_transport_sas dm_mirror dm_region_hash dm_log
> dm_mod
> [ 1361.288913] CPU: 15 PID: 15149 Comm: reboot Tainted:
> G          I      4.15.0-rc7 #1
> [ 1361.333577] Hardware name: HP ProLiant DL380 G7, BIOS P67
> 08/16/2015
> [ 1361.369976] RIP: 0010:ib_cq_completion_direct+0x28/0x30 [ib_core]
> [ 1361.404971] RSP: 0018:ffffa08c8747fc60 EFLAGS: 00010086
> [ 1361.435007] RAX: 0000000000000000 RBX: ffff8d37a6f8b468 RCX:
> ffffffffae662928
> [ 1361.474397] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
> 0000000000000046
> [ 1361.515097] RBP: ffff8d2bb07e0000 R08: 0000000000000000 R09:
> 0000000000000717
> [ 1361.555054] R10: 0000000000000000 R11: ffffa08c8747f9c8 R12:
> ffff8d2ed1edc264
> [ 1361.595593] R13: ffff8d37a6f8b400 R14: ffffa08c8747fca8 R15:
> 0000000000000083
> [ 1361.635133] FS:  00007fc09956a880(0000) GS:ffff8d37b33c0000(0000)
> knlGS:0000000000000000
> [ 1361.681800] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1361.714217] CR2: 0000000001034f80 CR3: 0000000ba0f9e005 CR4:
> 00000000000206e0
> [ 1361.754794] Call Trace:
> [ 1361.768980]  mlx5_ib_event+0x335/0x410 [mlx5_ib]
> [ 1361.795303]  mlx5_core_event+0x7b/0x1a0 [mlx5_core]
> [ 1361.823438]  ? synchronize_irq+0x35/0xa0
> [ 1361.845962]  mlx5_enter_error_state+0xe4/0x1c0 [mlx5_core]
> [ 1361.877382]  shutdown+0x127/0x170 [mlx5_core]
> [ 1361.902688]  pci_device_shutdown+0x31/0x60
> [ 1361.925924]  device_shutdown+0x101/0x1d0
> [ 1361.948642]  kernel_restart+0xe/0x60
> [ 1361.968517]  SYSC_reboot+0x1e8/0x210
> [ 1361.988062]  ? __audit_syscall_entry+0xaf/0x100
> [ 1362.013500]  ? syscall_trace_enter+0x1cc/0x2b0
> [ 1362.038483]  ? __audit_syscall_exit+0x1ff/0x280
> [ 1362.064598]  do_syscall_64+0x61/0x1a0
> [ 1362.084635]  entry_SYSCALL64_slow_path+0x25/0x25
> [ 1362.111113] RIP: 0033:0x7fc098377a56
> [ 1362.131668] RSP: 002b:00007ffd4b3377e8 EFLAGS: 00000206 ORIG_RAX:
> 00000000000000a9
> [ 1362.174578] RAX: ffffffffffffffda RBX: 0000000000000004 RCX:
> 00007fc098377a56
> [ 1362.213620] RDX: 0000000001234567 RSI: 0000000028121969 RDI:
> fffffffffee1dead
> [ 1362.255259] RBP: 0000000000000000 R08: 000056141a7642a0 R09:
> 00007ffd4b336eb0
> [ 1362.296293] R10: 0000000000000024 R11: 0000000000000206 R12:
> 0000000000000000
> [ 1362.338341] R13: 00007ffd4b337ab0 R14: 0000000000000000 R15:
> 0000000000000000
> [ 1362.378518] Code: 00 00 00 66 66 66 66 90 80 3d 65 e1 02 00 00 74
> 02
> f3 c3 48 89 fe 31 c0 48 c7 c7 68 58 92 c0 c6 05 4e e1 02 00 01 e8 a8
> 23
> d8 ec <0f> ff c3 0f 1f 44 00 00 66 66 66 66 90 41 55 45 89 c5 41 54
> 49 
> [ 1362.483962] ---[ end trace 528ee06930a5763f ]---
> [ 1362.509435] mlx5_1:mlx5_ib_event:2992:(pid 15149): warning: event
> on
> port 0
> [ 1362.548716] scsi host2: ib_srp: failed RECV status WR flushed (5)
> for CQE 0000000023e53497
> [ 1362.595980] mlx5_core 0000:08:00.1:
> mlx5_enter_error_state:128:(pid
> 15149): end
> [ 1362.637630] mlx5_core 0000:08:00.0: Shutdown was called
> [ 1362.677523] mlx5_core 0000:08:00.0:
> mlx5_enter_error_state:121:(pid
> 15149): start
> [ 1362.720734] mlx5_0:mlx5_ib_event:2992:(pid 15149): warning: event
> on
> port 0
> [ 1362.760795] scsi host1: ib_srp: failed RECV status WR flushed (5)
> for CQE 000000009ad07e27
> [ 1362.806977] mlx5_core 0000:08:00.0:
> mlx5_enter_error_state:128:(pid
> 15149): end
> [ 1363.331808] reboot: Restarting system
> [ 1363.349889] reboot: machine restart


Hello Bart

Confirmed the panic in 4.15.0-rc2.rdma+ 

This kernel is built off the for-next branch of
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git.

Leon and RDMA folks can you look into this so I can avoid a bisect
please

Snippet from below seems to be important:

[  938.938946] mlx5_core 0000:08:00.1: Shutdown was called
[  938.968423] mlx5_core 0000:08:00.1:
mlx5_cmd_force_teardown_hca:245:(pid 14752): teardown with force mode
failed
[  938.978359] mlx5_core 0000:08:00.1: mlx5_cmd_comp_handler:1445:(pid
13186): Command completion arrived after timeout (entry idx = 0).
[  942.209464] mlx5_1:wait_for_async_commands:735:(pid 14752): done
with all pending requests

Rebooting.
[  937.142748] sd 1:0:0:1: [sdbk] Synchronizing SCSI cache
[  937.173076] sd 1:0:0:2: [sdbj] Synchronizing SCSI cache
[  937.203855] sd 1:0:0:3: [sdbi] Synchronizing SCSI cache
[  937.234117] sd 1:0:0:4: [sdbh] Synchronizing SCSI cache
[  937.264894] sd 1:0:0:5: [sdbg] Synchronizing SCSI cache
[  937.295257] sd 1:0:0:6: [sdbf] Synchronizing SCSI cache
[  937.325107] sd 1:0:0:7: [sdbe] Synchronizing SCSI cache
[  937.354969] sd 1:0:0:8: [sdbd] Synchronizing SCSI cache
[  937.385332] sd 1:0:0:9: [sdbc] Synchronizing SCSI cache
[  937.414118] sd 1:0:0:10: [sdbb] Synchronizing SCSI cache
[  937.444397] sd 1:0:0:11: [sdba] Synchronizing SCSI cache
[  937.473847] sd 1:0:0:12: [sdaz] Synchronizing SCSI cache
[  937.504494] sd 1:0:0:13: [sday] Synchronizing SCSI cache
[  937.535032] sd 1:0:0:14: [sdax] Synchronizing SCSI cache
[  937.565076] sd 1:0:0:15: [sdaw] Synchronizing SCSI cache
[  937.596843] sd 1:0:0:16: [sdav] Synchronizing SCSI cache
[  937.627773] sd 1:0:0:17: [sdau] Synchronizing SCSI cache
[  937.657933] sd 1:0:0:18: [sdat] Synchronizing SCSI cache
[  937.688828] sd 1:0:0:19: [sdas] Synchronizing SCSI cache
[  937.720082] sd 1:0:0:20: [sdar] Synchronizing SCSI cache
[  937.750818] sd 1:0:0:21: [sdaq] Synchronizing SCSI cache
[  937.781396] sd 1:0:0:22: [sdap] Synchronizing SCSI cache
[  937.811480] sd 1:0:0:23: [sdao] Synchronizing SCSI cache
[  937.841955] sd 1:0:0:24: [sdan] Synchronizing SCSI cache
[  937.872095] sd 1:0:0:25: [sdam] Synchronizing SCSI cache
[  937.902674] sd 1:0:0:26: [sdal] Synchronizing SCSI cache
[  937.932741] sd 1:0:0:27: [sdak] Synchronizing SCSI cache
[  937.962074] sd 1:0:0:28: [sdaj] Synchronizing SCSI cache
[  937.991450] sd 1:0:0:29: [sdai] Synchronizing SCSI cache
[  938.020907] sd 2:0:0:1: [sdah] Synchronizing SCSI cache
[  938.050879] sd 2:0:0:2: [sdag] Synchronizing SCSI cache
[  938.080505] sd 2:0:0:3: [sdaf] Synchronizing SCSI cache
[  938.110263] sd 2:0:0:4: [sdae] Synchronizing SCSI cache
[  938.139903] sd 2:0:0:5: [sdad] Synchronizing SCSI cache
[  938.169782] sd 2:0:0:6: [sdac] Synchronizing SCSI cache
[  938.199754] sd 2:0:0:7: [sdab] Synchronizing SCSI cache
[  938.230013] sd 2:0:0:8: [sdaa] Synchronizing SCSI cache
[  938.259811] sd 2:0:0:9: [sdz] Synchronizing SCSI cache
[  938.289633] sd 2:0:0:10: [sdy] Synchronizing SCSI cache
[  938.318870] sd 2:0:0:11: [sdx] Synchronizing SCSI cache
[  938.348961] sd 2:0:0:12: [sdw] Synchronizing SCSI cache
[  938.378774] sd 2:0:0:13: [sdv] Synchronizing SCSI cache
[  938.408485] sd 2:0:0:14: [sdu] Synchronizing SCSI cache
[  938.438075] sd 2:0:0:15: [sdt] Synchronizing SCSI cache
[  938.466951] sd 2:0:0:16: [sds] Synchronizing SCSI cache
[  938.496511] sd 2:0:0:17: [sdr] Synchronizing SCSI cache
[  938.526718] sd 2:0:0:18: [sdq] Synchronizing SCSI cache
[  938.556687] sd 2:0:0:19: [sdp] Synchronizing SCSI cache
[  938.585200] sd 2:0:0:20: [sdo] Synchronizing SCSI cache
[  938.614874] sd 2:0:0:21: [sdn] Synchronizing SCSI cache
[  938.644621] sd 2:0:0:22: [sdm] Synchronizing SCSI cache
[  938.674399] sd 2:0:0:23: [sdl] Synchronizing SCSI cache
[  938.704763] sd 2:0:0:24: [sdk] Synchronizing SCSI cache
[  938.734895] sd 2:0:0:25: [sdj] Synchronizing SCSI cache
[  938.765294] sd 2:0:0:26: [sdi] Synchronizing SCSI cache
[  938.794818] sd 2:0:0:27: [sdh] Synchronizing SCSI cache
[  938.824447] sd 2:0:0:28: [sdg] Synchronizing SCSI cache
[  938.853212] sd 2:0:0:29: [sdf] Synchronizing SCSI cache
[  938.881769] sd 2:0:0:0: [sde] Synchronizing SCSI cache
[  938.910640] sd 1:0:0:0: [sdd] Synchronizing SCSI cache
[  938.938946] mlx5_core 0000:08:00.1: Shutdown was called
[  938.968423] mlx5_core 0000:08:00.1:
mlx5_cmd_force_teardown_hca:245:(pid 14752): teardown with force mode
failed
[  938.978359] mlx5_core 0000:08:00.1: mlx5_cmd_comp_handler:1445:(pid
13186): Command completion arrived after timeout (entry idx = 0).
[  942.209464] mlx5_1:wait_for_async_commands:735:(pid 14752): done
with all pending requests
[  942.259812] sd 1:0:0:0: [sdd] Synchronizing SCSI cache
[  942.294448] scsi 1:0:0:0: alua: Detached
[  942.317433] sd 1:0:0:29: [sdai] Synchronizing SCSI cache
[  942.355461] scsi 1:0:0:29: alua: Detached
[  942.379602] sd 1:0:0:28: [sdaj] Synchronizing SCSI cache
[  942.418441] scsi 1:0:0:28: alua: Detached
[  942.440965] sd 1:0:0:27: [sdak] Synchronizing SCSI cache
[  942.479447] scsi 1:0:0:27: alua: Detached
[  942.502351] sd 1:0:0:26: [sdal] Synchronizing SCSI cache
[  942.537745] scsi 1:0:0:26: alua: Detached
[  942.561479] sd 1:0:0:25: [sdam] Synchronizing SCSI cache
[  942.599444] scsi 1:0:0:25: alua: Detached
[  942.623153] sd 1:0:0:24: [sdan] Synchronizing SCSI cache
[  942.659633] scsi 1:0:0:24: alua: Detached
[  942.682904] sd 1:0:0:23: [sdao] Synchronizing SCSI cache
[  942.722444] scsi 1:0:0:23: alua: Detached
[  942.745058] sd 1:0:0:22: [sdap] Synchronizing SCSI cache
[  942.780644] scsi 1:0:0:22: alua: Detached
[  942.803690] sd 1:0:0:21: [sdaq] Synchronizing SCSI cache
[  942.839647] scsi 1:0:0:21: alua: Detached
[  942.863364] sd 1:0:0:20: [sdar] Synchronizing SCSI cache
[  942.899617] scsi 1:0:0:20: alua: Detached
[  942.922661] sd 1:0:0:19: [sdas] Synchronizing SCSI cache
[  942.957640] scsi 1:0:0:19: alua: Detached
[  942.981039] sd 1:0:0:18: [sdat] Synchronizing SCSI cache
[  943.016637] scsi 1:0:0:18: alua: Detached
[  943.040163] sd 1:0:0:17: [sdau] Synchronizing SCSI cache
[  943.075648] scsi 1:0:0:17: alua: Detached
[  943.099057] sd 1:0:0:16: [sdav] Synchronizing SCSI cache
[  943.135627] scsi 1:0:0:16: alua: Detached
[  943.159647] sd 1:0:0:15: [sdaw] Synchronizing SCSI cache
[  943.199447] scsi 1:0:0:15: alua: Detached
[  943.222318] sd 1:0:0:14: [sdax] Synchronizing SCSI cache
[  943.256648] scsi 1:0:0:14: alua: Detached
[  943.279739] sd 1:0:0:13: [sday] Synchronizing SCSI cache
[  943.319442] scsi 1:0:0:13: alua: Detached
[  943.341975] sd 1:0:0:12: [sdaz] Synchronizing SCSI cache
[  943.377454] scsi 1:0:0:12: alua: Detached
[  943.400574] sd 1:0:0:11: [sdba] Synchronizing SCSI cache
[  943.436438] scsi 1:0:0:11: alua: Detached
[  943.459168] sd 1:0:0:10: [sdbb] Synchronizing SCSI cache
[  943.495649] scsi 1:0:0:10: alua: Detached
[  943.518395] sd 1:0:0:9: [sdbc] Synchronizing SCSI cache
[  943.554455] scsi 1:0:0:9: alua: Detached
[  943.577524] sd 1:0:0:8: [sdbd] Synchronizing SCSI cache
[  943.617643] scsi 1:0:0:8: alua: Detached
[  943.640599] sd 1:0:0:7: [sdbe] Synchronizing SCSI cache
[  943.676596] scsi 1:0:0:7: alua: Detached
[  943.699790] sd 1:0:0:6: [sdbf] Synchronizing SCSI cache
[  943.737440] scsi 1:0:0:6: alua: Detached
[  943.760309] sd 1:0:0:5: [sdbg] Synchronizing SCSI cache
[  943.796634] scsi 1:0:0:5: alua: Detached
[  943.819456] sd 1:0:0:4: [sdbh] Synchronizing SCSI cache
[  943.854634] scsi 1:0:0:4: alua: Detached
[  943.877433] sd 1:0:0:3: [sdbi] Synchronizing SCSI cache
[  943.914621] scsi 1:0:0:3: alua: Detached
[  943.938146] sd 1:0:0:2: [sdbj] Synchronizing SCSI cache
[  943.973712] scsi 1:0:0:2: alua: Detached
[  943.995848] sd 1:0:0:1: [sdbk] Synchronizing SCSI cache
[  944.029648] scsi 1:0:0:1: alua: Detached
[  946.135367] scsi host1: ib_srp: connection closed
[  946.159601] scsi host1: ib_srp: connection closed
[  946.185789] scsi host1: ib_srp: connection closed
[  946.647514] kernel tried to execute NX-protected page - exploit
attempt? (uid: 0)
[  946.691954] BUG: unable to handle kernel paging request at
00000000a2129b93
[  946.731023] IP: 0xffff9dcd6684dfc0
[  946.749587] PGD 1346a3b067 P4D 1346a3b067 PUD 8000000b000001e3 
[  946.783502] Oops: 0011 [#1] SMP
[  946.800543] Modules linked in: xt_CHECKSUM iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi
ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel
kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel dm_service_time joydev hpilo pcbc sg ipmi_si
aesni_intel ipmi_devintf crypto_simd glue_helper gpio_ich cryptd hpwdt
ipmi_msghandler iTCO_wdt acpi_power_meter iTCO_vendor_support
i7core_edac shpchp pcspkr lpc_ich
[  947.201595]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd dm_multipath
grace sunrpc ip_tables xfs libcrc32c radeon i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core mlxfw
sd_mod drm ptp pps_core i2c_core crc32c_intel hpsa serio_raw devlink
bnx2 scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[  947.368091] CPU: 0 PID: 832 Comm: kworker/0:1H Tainted:
G          I      4.15.0-rc2.rdma+ #1
[  947.416086] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[  947.452642] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
[  947.484966] task: 00000000f16afaf6 task.stack: 000000000a5a26e0
[  947.519275] RIP: 0010:0xffff9dcd6684dfc0
[  947.541610] RSP: 0018:ffffbf04072d3e28 EFLAGS: 00010282
[  947.571795] RAX: ffff9dce20c57a10 RBX: 0000000000000048 RCX:
0000000000000000
[  947.612745] RDX: ffff9dce2facd500 RSI: ffff9dda2a4d6848 RDI:
ffff9dce2e07a800
[  947.652788] RBP: 0000000000000090 R08: ffff9dce2facd4c8 R09:
ffff9dce2facd4c8
[  947.693568] R10: 0000000000000000 R11: ffff9dce2e07a9d0 R12:
0000000000000002
[  947.733332] R13: 0000000000000000 R14: 0000000000010000 R15:
ffff9dce2e07a800
[  947.772737] FS:  0000000000000000(0000) GS:ffff9dce37a00000(0000)
knlGS:0000000000000000
[  947.817595] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  947.850061] CR2: ffff9dcd6684dfc0 CR3: 0000001346009002 CR4:
00000000000206f0
[  947.889552] Call Trace:
[  947.903724]  ? __ib_process_cq+0x55/0xa0 [ib_core]
[  947.931179]  ? ib_cq_poll_work+0x1b/0x60 [ib_core]
[  947.958153]  ? process_one_work+0x141/0x340
[  947.981362]  ? worker_thread+0x47/0x3e0
[  948.002102]  ? kthread+0xf5/0x130
[  948.020538]  ? rescuer_thread+0x380/0x380
[  948.043180]  ? kthread_associate_blkcg+0x90/0x90
[  948.070184]  ? ret_from_fork+0x1f/0x30
[  948.091250] Code: 00 00 00 10 40 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 40 4c cb bc cd 9d ff ff 00 00 00 00 00 00
00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
[  948.199700] RIP: 0xffff9dcd6684dfc0 RSP: ffffbf04072d3e28
[  948.229734] CR2: ffff9dcd6684dfc0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                             ` <1515591723.26021.6.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-10 18:26                               ` Jason Gunthorpe
       [not found]                                 ` <20180110182648.GI4518-uk2M96/98Pc@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2018-01-10 18:26 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Dutile, Don

On Wed, Jan 10, 2018 at 08:42:03AM -0500, Laurence Oberman wrote:

> [  946.647514] kernel tried to execute NX-protected page - exploit
> attempt? (uid: 0)
> [  946.691954] BUG: unable to handle kernel paging request at
> 00000000a2129b93

> [  947.889552] Call Trace:
> [  947.903724]  ? __ib_process_cq+0x55/0xa0 [ib_core]
> [  947.931179]  ? ib_cq_poll_work+0x1b/0x60 [ib_core]
> [  947.958153]  ? process_one_work+0x141/0x340
> [  947.981362]  ? worker_thread+0x47/0x3e0
> [  948.002102]  ? kthread+0xf5/0x130
> [  948.020538]  ? rescuer_thread+0x380/0x380
> [  948.043180]  ? kthread_associate_blkcg+0x90/0x90
> [  948.070184]  ? ret_from_fork+0x1f/0x30

These oops's you have are very suggestive that ib_wc->wr_cqe
is garbage..

Did SRP free its wr_cqe data before completion somehow?

Turn on slab poisoning to confirm?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                 ` <20180110182648.GI4518-uk2M96/98Pc@public.gmane.org>
@ 2018-01-10 18:40                                   ` Bart Van Assche
       [not found]                                     ` <1515609623.2745.20.camel-Sjgp3cTcYWE@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Bart Van Assche @ 2018-01-10 18:40 UTC (permalink / raw)
  To: jgg-uk2M96/98Pc, loberman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Wed, 2018-01-10 at 11:26 -0700, Jason Gunthorpe wrote:
> On Wed, Jan 10, 2018 at 08:42:03AM -0500, Laurence Oberman wrote:
> 
> > [  946.647514] kernel tried to execute NX-protected page - exploit
> > attempt? (uid: 0)
> > [  946.691954] BUG: unable to handle kernel paging request at
> > 00000000a2129b93
> > [  947.889552] Call Trace:
> > [  947.903724]  ? __ib_process_cq+0x55/0xa0 [ib_core]
> > [  947.931179]  ? ib_cq_poll_work+0x1b/0x60 [ib_core]
> > [  947.958153]  ? process_one_work+0x141/0x340
> > [  947.981362]  ? worker_thread+0x47/0x3e0
> > [  948.002102]  ? kthread+0xf5/0x130
> > [  948.020538]  ? rescuer_thread+0x380/0x380
> > [  948.043180]  ? kthread_associate_blkcg+0x90/0x90
> > [  948.070184]  ? ret_from_fork+0x1f/0x30
> 
> These oops's you have are very suggestive that ib_wc->wr_cqe
> is garbage..
> 
> Did SRP free its wr_cqe data before completion somehow?
> 
> Turn on slab poisoning to confirm?

Hello Jason,

It's easy to see in drivers/infiniband/core/cq.c that polling is stopped
before a completion queue is destroyed (see also the cancel_work_sync(&cq->work)
and the cq->device->destroy_cq(cq) calls in ib_free_cq()).

BTW, I run all my tests with SLAB poisoning enabled. My SRP tests pass if I run
the SRP initiator and target drivers on top of the mlx4 and rdma_rxe drivers.

Bart.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                     ` <1515609623.2745.20.camel-Sjgp3cTcYWE@public.gmane.org>
@ 2018-01-10 18:59                                       ` Laurence Oberman
       [not found]                                         ` <1515610750.10153.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2018-01-10 19:17                                       ` Jason Gunthorpe
  1 sibling, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-10 18:59 UTC (permalink / raw)
  To: Bart Van Assche, jgg-uk2M96/98Pc
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Wed, 2018-01-10 at 18:40 +0000, Bart Van Assche wrote:
> On Wed, 2018-01-10 at 11:26 -0700, Jason Gunthorpe wrote:
> > On Wed, Jan 10, 2018 at 08:42:03AM -0500, Laurence Oberman wrote:
> > 
> > > [  946.647514] kernel tried to execute NX-protected page -
> > > exploit
> > > attempt? (uid: 0)
> > > [  946.691954] BUG: unable to handle kernel paging request at
> > > 00000000a2129b93
> > > [  947.889552] Call Trace:
> > > [  947.903724]  ? __ib_process_cq+0x55/0xa0 [ib_core]
> > > [  947.931179]  ? ib_cq_poll_work+0x1b/0x60 [ib_core]
> > > [  947.958153]  ? process_one_work+0x141/0x340
> > > [  947.981362]  ? worker_thread+0x47/0x3e0
> > > [  948.002102]  ? kthread+0xf5/0x130
> > > [  948.020538]  ? rescuer_thread+0x380/0x380
> > > [  948.043180]  ? kthread_associate_blkcg+0x90/0x90
> > > [  948.070184]  ? ret_from_fork+0x1f/0x30
> > 
> > These oops's you have are very suggestive that ib_wc->wr_cqe
> > is garbage..
> > 
> > Did SRP free its wr_cqe data before completion somehow?
> > 
> > Turn on slab poisoning to confirm?
> 
> Hello Jason,
> 
> It's easy to see in drivers/infiniband/core/cq.c that polling is
> stopped
> before a completion queue is destroyed (see also the
> cancel_work_sync(&cq->work)
> and the cq->device->destroy_cq(cq) calls in ib_free_cq()).
> 
> BTW, I run all my tests with SLAB poisoning enabled. My SRP tests
> pass if I run
> the SRP initiator and target drivers on top of the mlx4 and rdma_rxe
> drivers.
> 
> Bart.

Hi Jason

Yep, this seems specific to the mlx5 and IB. 
The problem though is Linus's tree 4.15-rc-7 already has enough of the
part of the RDMA updates to see issues.

With his tree I don't panic but I see this

[ 1360.511682] mlx5_core 0000:08:00.1: Shutdown was called
[ 1360.550531] mlx5_core 0000:08:00.1: mlx5_enter_error_state:121:(pid
15149): start
[ 1360.593520] ------------[ cut here ]------------
[ 1360.619930] got unsolicited completion for CQ 0x0000000068694acd
[ 1360.654434] WARNING: CPU: 15 PID: 15149 at
drivers/infiniband/core/cq.c:80 ib_cq_completion_direct+0x28/0x30
[ib_core]
[ 1360.716099] Modules linked in: xt_CHECKSUM iptable_mangle
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert
iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi
ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif
ghash_clmulni_intel pcbc joydev aesni_intel dm_service_time ipmi_si
crypto_simd glue_helper sg hpilo cryptd hpwdt ipmi_devintf iTCO_wdt
gpio_ich acpi_power_meter iTCO_vendor_support ipmi_msghandler shpchp
pcspkr i7core_edac lpc_ich
[ 1361.120851]  pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace
dm_multipath sunrpc ip_tables xfs libcrc32c radeon i2c_algo_bit
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm sd_mod
drm mlx5_core mlxfw ptp serio_raw crc32c_intel i2c_core hpsa pps_core
bnx2 devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[ 1361.288913] CPU: 15 PID: 15149 Comm: reboot Tainted:
G          I      4.15.0-rc7 #1
[ 1361.333577] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[ 1361.369976] RIP: 0010:ib_cq_completion_direct+0x28/0x30 [ib_core]
[ 1361.404971] RSP: 0018:ffffa08c8747fc60 EFLAGS: 00010086
[ 1361.435007] RAX: 0000000000000000 RBX: ffff8d37a6f8b468 RCX:
ffffffffae662928
[ 1361.474397] RDX: 0000000000000001 RSI: 0000000000000082 RDI:
0000000000000046
[ 1361.515097] RBP: ffff8d2bb07e0000 R08: 0000000000000000 R09:
0000000000000717
[ 1361.555054] R10: 0000000000000000 R11: ffffa08c8747f9c8 R12:
ffff8d2ed1edc264
[ 1361.595593] R13: ffff8d37a6f8b400 R14: ffffa08c8747fca8 R15:
0000000000000083
[ 1361.635133] FS:  00007fc09956a880(0000) GS:ffff8d37b33c0000(0000)
knlGS:0000000000000000
[ 1361.681800] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1361.714217] CR2: 0000000001034f80 CR3: 0000000ba0f9e005 CR4:
00000000000206e0
[ 1361.754794] Call Trace:
[ 1361.768980]  mlx5_ib_event+0x335/0x410 [mlx5_ib]
[ 1361.795303]  mlx5_core_event+0x7b/0x1a0 [mlx5_core]
[ 1361.823438]  ? synchronize_irq+0x35/0xa0
[ 1361.845962]  mlx5_enter_error_state+0xe4/0x1c0 [mlx5_core]
[ 1361.877382]  shutdown+0x127/0x170 [mlx5_core]
[ 1361.902688]  pci_device_shutdown+0x31/0x60
[ 1361.925924]  device_shutdown+0x101/0x1d0
[ 1361.948642]  kernel_restart+0xe/0x60
[ 1361.968517]  SYSC_reboot+0x1e8/0x210
[ 1361.988062]  ? __audit_syscall_entry+0xaf/0x100
[ 1362.013500]  ? syscall_trace_enter+0x1cc/0x2b0
[ 1362.038483]  ? __audit_syscall_exit+0x1ff/0x280
[ 1362.064598]  do_syscall_64+0x61/0x1a0
[ 1362.084635]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1362.111113] RIP: 0033:0x7fc098377a56
[ 1362.131668] RSP: 002b:00007ffd4b3377e8 EFLAGS: 00000206 ORIG_RAX:
00000000000000a9
[ 1362.174578] RAX: ffffffffffffffda RBX: 0000000000000004 RCX:
00007fc098377a56
[ 1362.213620] RDX: 0000000001234567 RSI: 0000000028121969 RDI:
fffffffffee1dead
[ 1362.255259] RBP: 0000000000000000 R08: 000056141a7642a0 R09:
00007ffd4b336eb0
[ 1362.296293] R10: 0000000000000024 R11: 0000000000000206 R12:
0000000000000000
[ 1362.338341] R13: 00007ffd4b337ab0 R14: 0000000000000000 R15:
0000000000000000
[ 1362.378518] Code: 00 00 00 66 66 66 66 90 80 3d 65 e1 02 00 00 74 02
f3 c3 48 89 fe 31 c0 48 c7 c7 68 58 92 c0 c6 05 4e e1 02 00 01 e8 a8 23
d8 ec <0f> ff c3 0f 1f 44 00 00 66 66 66 66 90 41 55 45 89 c5 41 54 49 
[ 1362.483962] ---[ end trace 528ee06930a5763f ]---
[ 1362.509435] mlx5_1:mlx5_ib_event:2992:(pid 15149): warning: event on
port 0
[ 1362.548716] scsi host2: ib_srp: failed RECV status WR flushed (5)
for CQE 0000000023e53497
[ 1362.595980] mlx5_core 0000:08:00.1: mlx5_enter_error_state:128:(pid
15149): end
[ 1362.637630] mlx5_core 0000:08:00.0: Shutdown was called
[ 1362.677523] mlx5_core 0000:08:00.0: mlx5_enter_error_state:121:(pid
15149): start
[ 1362.720734] mlx5_0:mlx5_ib_event:2992:(pid 15149): warning: event on
port 0
[ 1362.760795] scsi host1: ib_srp: failed RECV status WR flushed (5)
for CQE 000000009ad07e27
[ 1362.806977] mlx5_core 0000:08:00.0: mlx5_enter_error_state:128:(pid
15149): end

With the latest RDMA tree additions I panic every time on shutdown.
This is built against  4.15.0-rc2 with whatever other patches are in
the RDMA tree.

I was testing Bart's tree when I panicked and we know now we hve an
issue in mlx5/ib

I am waiting to see what Leon and the RDMA folks want to do so I can
avoid another bisect, but if I have to instrument and/or bisect I will
do it.

Regards
Laurence


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                         ` <1515610750.10153.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-10 19:15                                           ` Jason Gunthorpe
       [not found]                                             ` <20180110191510.GK4518-uk2M96/98Pc@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2018-01-10 19:15 UTC (permalink / raw)
  To: Laurence Oberman, Leon Romanovsky
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Wed, Jan 10, 2018 at 01:59:10PM -0500, Laurence Oberman wrote:

> Yep, this seems specific to the mlx5 and IB. 
> The problem though is Linus's tree 4.15-rc-7 already has enough of the
> part of the RDMA updates to see issues.

Every time you post a backtrace it is different.. The only commonality
seems to be that the CQ completion core appears to be processing
garbage, accompanied by these sorts of sketch kernel messages from mlx5:

> [ 1360.511682] mlx5_core 0000:08:00.1: Shutdown was called
> [ 1360.550531] mlx5_core 0000:08:00.1: mlx5_enter_error_state:121:(pid

> [  938.938946] mlx5_core 0000:08:00.1: Shutdown was called
> [  938.968423] mlx5_core 0000:08:00.1: mlx5_cmd_force_teardown_hca:245:(pid 14752): teardown with force mode failed
> [  938.978359] mlx5_core 0000:08:00.1: mlx5_cmd_comp_handler:1445:(pid 13186): Command completion arrived after timeout (entry idx = 0).
> [  942.209464] mlx5_1:wait_for_async_commands:735:(pid 14752): done with all pending requests

My other guess is a mlx5 issue where it is returning CQ wrids it
should not return?

Leon?

I don't see anything changing in this area in rdma.git for-rc, so I
can't give you a guess on a patch, sorry.

Do you think this test ever worked for you? You said bisect, so I
assume so?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                     ` <1515609623.2745.20.camel-Sjgp3cTcYWE@public.gmane.org>
  2018-01-10 18:59                                       ` Laurence Oberman
@ 2018-01-10 19:17                                       ` Jason Gunthorpe
       [not found]                                         ` <20180110191758.GL4518-uk2M96/98Pc@public.gmane.org>
  1 sibling, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2018-01-10 19:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: loberman-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Wed, Jan 10, 2018 at 06:40:25PM +0000, Bart Van Assche wrote:
> On Wed, 2018-01-10 at 11:26 -0700, Jason Gunthorpe wrote:
> > On Wed, Jan 10, 2018 at 08:42:03AM -0500, Laurence Oberman wrote:
> > 
> > > [  946.647514] kernel tried to execute NX-protected page - exploit
> > > attempt? (uid: 0)
> > > [  946.691954] BUG: unable to handle kernel paging request at
> > > 00000000a2129b93
> > > [  947.889552] Call Trace:
> > > [  947.903724]  ? __ib_process_cq+0x55/0xa0 [ib_core]
> > > [  947.931179]  ? ib_cq_poll_work+0x1b/0x60 [ib_core]
> > > [  947.958153]  ? process_one_work+0x141/0x340
> > > [  947.981362]  ? worker_thread+0x47/0x3e0
> > > [  948.002102]  ? kthread+0xf5/0x130
> > > [  948.020538]  ? rescuer_thread+0x380/0x380
> > > [  948.043180]  ? kthread_associate_blkcg+0x90/0x90
> > > [  948.070184]  ? ret_from_fork+0x1f/0x30
> > 
> > These oops's you have are very suggestive that ib_wc->wr_cqe
> > is garbage..
> > 
> > Did SRP free its wr_cqe data before completion somehow?
> > 
> > Turn on slab poisoning to confirm?
> 
> It's easy to see in drivers/infiniband/core/cq.c that polling is
> stopped before a completion queue is destroyed (see also the
> cancel_work_sync(&cq->work) and the cq->device->destroy_cq(cq) calls
> in ib_free_cq()).

But that has nothing directly to do with the lifetime of, say, struct
srp_request which contains ib_wc->wr_cqe?

eg freeing struct srp_request before the wrid has passed through the
CQ poll would produce these sorts of symptoms...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                             ` <20180110191510.GK4518-uk2M96/98Pc@public.gmane.org>
@ 2018-01-10 19:30                                               ` Laurence Oberman
       [not found]                                                 ` <1515612639.10153.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-10 19:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Wed, 2018-01-10 at 12:15 -0700, Jason Gunthorpe wrote:
> On Wed, Jan 10, 2018 at 01:59:10PM -0500, Laurence Oberman wrote:
> 
> > Yep, this seems specific to the mlx5 and IB. 
> > The problem though is Linus's tree 4.15-rc-7 already has enough of
> > the
> > part of the RDMA updates to see issues.
> 
> Every time you post a backtrace it is different.. The only
> commonality
> seems to be that the CQ completion core appears to be processing
> garbage, accompanied by these sorts of sketch kernel messages from
> mlx5:
> 
> > [ 1360.511682] mlx5_core 0000:08:00.1: Shutdown was called
> > [ 1360.550531] mlx5_core 0000:08:00.1:
> > mlx5_enter_error_state:121:(pid
> > [  938.938946] mlx5_core 0000:08:00.1: Shutdown was called
> > [  938.968423] mlx5_core 0000:08:00.1:
> > mlx5_cmd_force_teardown_hca:245:(pid 14752): teardown with force
> > mode failed
> > [  938.978359] mlx5_core 0000:08:00.1:
> > mlx5_cmd_comp_handler:1445:(pid 13186): Command completion arrived
> > after timeout (entry idx = 0).
> > [  942.209464] mlx5_1:wait_for_async_commands:735:(pid 14752): done
> > with all pending requests
> 
> My other guess is a mlx5 issue where it is returning CQ wrids it
> should not return?
> 
> Leon?
> 
> I don't see anything changing in this area in rdma.git for-rc, so I
> can't give you a guess on a patch, sorry.
> 
> Do you think this test ever worked for you? You said bisect, so I
> assume so?
> 
> Jason
Hi Jason

Just to be clear, I have posted two types of stack traces, one where I
panic the other here above where I am not panicking.

This is not any special type of test. I booted the kernel, mapped the
SRP devices from the target server and proceeded to shutdown the client
with shutdown -r now.
This is part of my holistic test I always do against new patches in
Bart's tree.
I start with reboots, them rmmod's etc. before I go on to perform I/O
against the LUNS from the target.

The panic was the first issue I came across after building a kernel
with Bart's tree.
I have not even started testing anything else yet.

The trace above was provided because Bart asked me to test two kernels,
 
1. Linus's tree 4.15-rc7 
2. The RDMA tree.

Bart's Tree panics the same as the RDMA tree I cloned.

I will look at prior release candidates in Linus's tree and see where
this maybe crept in. I am of course puzzled why I am the only one to
see it, other folks must have MLX5 (CX4) like I do.

Would be good to know what test was last performed on the current RDMA
tree by Leon and team.

Regards
Laurence

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                         ` <20180110191758.GL4518-uk2M96/98Pc@public.gmane.org>
@ 2018-01-10 19:32                                           ` Bart Van Assche
       [not found]                                             ` <1515612733.2745.27.camel-Sjgp3cTcYWE@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Bart Van Assche @ 2018-01-10 19:32 UTC (permalink / raw)
  To: jgg-uk2M96/98Pc
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA, loberman-H+wXaHxf7aLQT0dZR+AlfA

On Wed, 2018-01-10 at 12:17 -0700, Jason Gunthorpe wrote:
> But that has nothing directly to do with the lifetime of, say, struct
> srp_request which contains ib_wc->wr_cqe?
> 
> eg freeing struct srp_request before the wrid has passed through the
> CQ poll would produce these sorts of symptoms...

Hello Jason,

The SRP initiator driver RDMA channel shutdown sequence is as follows:
* srp_remove_target() calls srp_free_ch_ib(). That last function calls
  srp_destroy_qp() which in turn calls ib_drain_qp(). ib_drain_qp() waits until
  all CQEs have been dequeued by changing the QP state into IB_QPS_ERR and by
  waiting until the completion for a newly posted request has been received.
* srp_remove_target() calls srp_free_req_data(). That last function
  calls kfree(ch->req_ring), that is the data structure that contains the
  SRP request structures.

Bart.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                 ` <1515612639.10153.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-10 20:52                                                   ` Jason Gunthorpe
       [not found]                                                     ` <20180110205243.GP4776-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2018-01-10 20:52 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Leon Romanovsky, Bart Van Assche,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Wed, Jan 10, 2018 at 02:30:39PM -0500, Laurence Oberman wrote:

> Just to be clear, I have posted two types of stack traces, one where I
> panic the other here above where I am not panicking.

Guessing it is just luck which you hit.. Random corrupted memory and
all..

> This is not any special type of test. I booted the kernel, mapped
> the SRP devices from the target server and proceeded to shutdown the
> client with shutdown -r now.  This is part of my holistic test I
> always do against new patches in Bart's tree.  I start with reboots,
> them rmmod's etc. before I go on to perform I/O against the LUNS
> from the target.

Well, your shtudown is triggering the mlx driver shutdown code,
then it looks like the SRP stuff gets cleaned up? That certainly is
getting a bit exciting code wise

I see there have been some changes in the mlx5 shutdown handling
recently..

As an experiment comment out the '.shutdown = shutdown' in
drivers/net/ethernet/mellanox/mlx5/core/main.c?

And it would be interesting to know if your past success kernels were
printing the mlx5 shutdown message too? Perhaps something core kernel
changed to enable this path for your test?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                     ` <20180110205243.GP4776-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2018-01-10 21:11                                                       ` Laurence Oberman
       [not found]                                                         ` <1515618674.10153.6.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-10 21:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bart Van Assche,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Wed, 2018-01-10 at 13:52 -0700, Jason Gunthorpe wrote:
> On Wed, Jan 10, 2018 at 02:30:39PM -0500, Laurence Oberman wrote:
> 
> > Just to be clear, I have posted two types of stack traces, one
> > where I
> > panic the other here above where I am not panicking.
> 
> Guessing it is just luck which you hit.. Random corrupted memory and
> all..
> 
> > This is not any special type of test. I booted the kernel, mapped
> > the SRP devices from the target server and proceeded to shutdown
> > the
> > client with shutdown -r now.  This is part of my holistic test I
> > always do against new patches in Bart's tree.  I start with
> > reboots,
> > them rmmod's etc. before I go on to perform I/O against the LUNS
> > from the target.
> 
> Well, your shtudown is triggering the mlx driver shutdown code,
> then it looks like the SRP stuff gets cleaned up? That certainly is
> getting a bit exciting code wise
> 
> I see there have been some changes in the mlx5 shutdown handling
> recently..
> 
> As an experiment comment out the '.shutdown = shutdown' in
> drivers/net/ethernet/mellanox/mlx5/core/main.c?
> 
> And it would be interesting to know if your past success kernels were
> printing the mlx5 shutdown message too? Perhaps something core kernel
> changed to enable this path for your test?
> 
> Jason

Its a solid issue each time, the shutdown.

Here is rc6, I am building rc1 now and will then go to 4.14 to peel
this onion

4.15.0-rc6

[  150.600416] ---[ end trace fc9e16dc996e3246 ]---
[  150.626405] mlx5_1:mlx5_ib_event:2992:(pid 14203): warning: event on
port 0
[  150.666308] scsi host1: ib_srp: failed RECV status WR flushed (5)
for CQE 00000000ecb7c551
[  150.712873] mlx5_core 0000:08:00.1: mlx5_enter_error_state:128:(pid
14203): end
[  150.753463] mlx5_core 0000:08:00.0: Shutdown was called
[  150.793126] mlx5_core 0000:08:00.0: mlx5_enter_error_state:121:(pid
14203): start
[  150.835047] mlx5_0:mlx5_ib_event:2992:(pid 14203): warning: event on
port 0
[  150.874155] scsi host2: ib_srp: failed RECV status WR flushed (5)
for CQE 00000000f7f26a7b
[  150.919317] mlx5_core 0000:08:00.0: mlx5_enter_error_state:128:(pid
14203): end
[  151.449010] reboot: Restarting system
[  151.467644] reboot: machine restart


Almost looks like changes made may require new Firmware maybe for my
CX4 card because its coming from here and I dont like to see pci_err**
called.

static pci_ers_result_t mlx5_pci_err_detected(struct pci_dev *pdev,
                                              pci_channel_state_t
state)
{
        struct mlx5_core_dev *dev = pci_get_drvdata(pdev);
        struct mlx5_priv *priv = &dev->priv;

        dev_info(&pdev->dev, "%s was called\n", __func__);

        mlx5_enter_error_state(dev, false);
        mlx5_unload_one(dev, priv, false);
        /* In case of kernel call drain the health wq */
        if (state) {
                mlx5_drain_health_wq(dev);
                mlx5_pci_disable_device(dev);
        }

        return state == pci_channel_io_perm_failure ?
                PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
}

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                         ` <1515618674.10153.6.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-10 21:15                                                           ` Jason Gunthorpe
       [not found]                                                             ` <20180110211501.GS4776-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2018-01-10 21:17                                                           ` Laurence Oberman
  1 sibling, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2018-01-10 21:15 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: Leon Romanovsky, Bart Van Assche,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Wed, Jan 10, 2018 at 04:11:14PM -0500, Laurence Oberman wrote:

> Almost looks like changes made may require new Firmware maybe for my
> CX4 card because its coming from here and I dont like to see pci_err**
> called.

git tells me the shutdown error feature is newish, and surely
interacts with CQ shutdown, so guessing it is broken is a pretty good
starting point.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                         ` <1515618674.10153.6.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2018-01-10 21:15                                                           ` Jason Gunthorpe
@ 2018-01-10 21:17                                                           ` Laurence Oberman
  1 sibling, 0 replies; 35+ messages in thread
From: Laurence Oberman @ 2018-01-10 21:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bart Van Assche,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Wed, 2018-01-10 at 16:11 -0500, Laurence Oberman wrote:
> On Wed, 2018-01-10 at 13:52 -0700, Jason Gunthorpe wrote:
> > On Wed, Jan 10, 2018 at 02:30:39PM -0500, Laurence Oberman wrote:
> > 
> > > Just to be clear, I have posted two types of stack traces, one
> > > where I
> > > panic the other here above where I am not panicking.
> > 
> > Guessing it is just luck which you hit.. Random corrupted memory
> > and
> > all..
> > 
> > > This is not any special type of test. I booted the kernel, mapped
> > > the SRP devices from the target server and proceeded to shutdown
> > > the
> > > client with shutdown -r now.  This is part of my holistic test I
> > > always do against new patches in Bart's tree.  I start with
> > > reboots,
> > > them rmmod's etc. before I go on to perform I/O against the LUNS
> > > from the target.
> > 
> > Well, your shtudown is triggering the mlx driver shutdown code,
> > then it looks like the SRP stuff gets cleaned up? That certainly is
> > getting a bit exciting code wise
> > 
> > I see there have been some changes in the mlx5 shutdown handling
> > recently..
> > 
> > As an experiment comment out the '.shutdown = shutdown' in
> > drivers/net/ethernet/mellanox/mlx5/core/main.c?
> > 
> > And it would be interesting to know if your past success kernels
> > were
> > printing the mlx5 shutdown message too? Perhaps something core
> > kernel
> > changed to enable this path for your test?
> > 
> > Jason
> 
> Its a solid issue each time, the shutdown.
> 
> Here is rc6, I am building rc1 now and will then go to 4.14 to peel
> this onion
> 
> 4.15.0-rc6
> 
> [  150.600416] ---[ end trace fc9e16dc996e3246 ]---
> [  150.626405] mlx5_1:mlx5_ib_event:2992:(pid 14203): warning: event
> on
> port 0
> [  150.666308] scsi host1: ib_srp: failed RECV status WR flushed (5)
> for CQE 00000000ecb7c551
> [  150.712873] mlx5_core 0000:08:00.1:
> mlx5_enter_error_state:128:(pid
> 14203): end
> [  150.753463] mlx5_core 0000:08:00.0: Shutdown was called
> [  150.793126] mlx5_core 0000:08:00.0:
> mlx5_enter_error_state:121:(pid
> 14203): start
> [  150.835047] mlx5_0:mlx5_ib_event:2992:(pid 14203): warning: event
> on
> port 0
> [  150.874155] scsi host2: ib_srp: failed RECV status WR flushed (5)
> for CQE 00000000f7f26a7b
> [  150.919317] mlx5_core 0000:08:00.0:
> mlx5_enter_error_state:128:(pid
> 14203): end
> [  151.449010] reboot: Restarting system
> [  151.467644] reboot: machine restart
> 
> 
> Almost looks like changes made may require new Firmware maybe for my
> CX4 card because its coming from here and I dont like to see
> pci_err**
> called.
> 
> static pci_ers_result_t mlx5_pci_err_detected(struct pci_dev *pdev,
>                                               pci_channel_state_t
> state)
> {
>         struct mlx5_core_dev *dev = pci_get_drvdata(pdev);
>         struct mlx5_priv *priv = &dev->priv;
> 
>         dev_info(&pdev->dev, "%s was called\n", __func__);
> 
>         mlx5_enter_error_state(dev, false);
>         mlx5_unload_one(dev, priv, false);
>         /* In case of kernel call drain the health wq */
>         if (state) {
>                 mlx5_drain_health_wq(dev);
>                 mlx5_pci_disable_device(dev);
>         }
> 
>         return state == pci_channel_io_perm_failure ?
>                 PCI_ERS_RESULT_DISCONNECT :
> PCI_ERS_RESULT_NEED_RESET;
> }
> 

I will do this next, its possible its been there for a while and missed
as with no panics the messages would not have been a focus.

However keep in mind that other change sin the RDMA tree are more
sordid to see this shutdown then leas to list corruptions and panics.

Starting to make sense now based on what you said about the new
shutdown code.

Just going to try rc1 and then will do below as a test.

"As an experiment comment out the '.shutdown = shutdown' in
> > drivers/net/ethernet/mellanox/mlx5/core/main.c?
"

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                             ` <1515612733.2745.27.camel-Sjgp3cTcYWE@public.gmane.org>
@ 2018-01-10 22:43                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2018-01-10 22:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA, loberman-H+wXaHxf7aLQT0dZR+AlfA

On Wed, Jan 10, 2018 at 12:32 PM, Bart Van Assche
<Bart.VanAssche-Sjgp3cTcYWE@public.gmane.org> wrote:
> On Wed, 2018-01-10 at 12:17 -0700, Jason Gunthorpe wrote:
>> But that has nothing directly to do with the lifetime of, say, struct
>> srp_request which contains ib_wc->wr_cqe?
>>
>> eg freeing struct srp_request before the wrid has passed through the
>> CQ poll would produce these sorts of symptoms...
>
> Hello Jason,
>
> The SRP initiator driver RDMA channel shutdown sequence is as follows:
> * srp_remove_target() calls srp_free_ch_ib(). That last function calls
>   srp_destroy_qp() which in turn calls ib_drain_qp(). ib_drain_qp() waits until
>   all CQEs have been dequeued by changing the QP state into IB_QPS_ERR and by
>   waiting until the completion for a newly posted request has been received.
> * srp_remove_target() calls srp_free_req_data(). That last function
>   calls kfree(ch->req_ring), that is the data structure that contains the
>   SRP request structures.

Thanks Bart, informative..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                             ` <20180110211501.GS4776-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2018-01-11 13:02                                                               ` Laurence Oberman
       [not found]                                                                 ` <1515675741.21421.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-11 13:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bart Van Assche,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Wed, 2018-01-10 at 14:15 -0700, Jason Gunthorpe wrote:
> On Wed, Jan 10, 2018 at 04:11:14PM -0500, Laurence Oberman wrote:
> 
> > Almost looks like changes made may require new Firmware maybe for
> > my
> > CX4 card because its coming from here and I dont like to see
> > pci_err**
> > called.
> 
> git tells me the shutdown error feature is newish, and surely
> interacts with CQ shutdown, so guessing it is broken is a pretty good
> starting point.
> 
> Jason

So as expected removing the shutdown call stops this happening.

"
As an experiment comment out the '.shutdown = shutdown' in
drivers/net/ethernet/mellanox/mlx5/core/main.c?
"

Leon, can you look into this please becaus ethe issue is already in
Linus' tree and with the changes in the RDMA tree its not just
messaging, we panic on shutdown.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                 ` <1515675741.21421.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-11 18:20                                                                   ` Laurence Oberman
       [not found]                                                                     ` <1515694855.21421.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2018-01-11 20:43                                                                   ` Kernel v4.16 / v4.17 SRP and SRPT patches Laurence Oberman
  1 sibling, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-11 18:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bart Van Assche,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Thu, 2018-01-11 at 08:02 -0500, Laurence Oberman wrote:
> On Wed, 2018-01-10 at 14:15 -0700, Jason Gunthorpe wrote:
> > On Wed, Jan 10, 2018 at 04:11:14PM -0500, Laurence Oberman wrote:
> > 
> > > Almost looks like changes made may require new Firmware maybe for
> > > my
> > > CX4 card because its coming from here and I dont like to see
> > > pci_err**
> > > called.
> > 
> > git tells me the shutdown error feature is newish, and surely
> > interacts with CQ shutdown, so guessing it is broken is a pretty
> > good
> > starting point.
> > 
> > Jason
> 
> So as expected removing the shutdown call stops this happening.
> 
> "
> As an experiment comment out the '.shutdown = shutdown' in
> drivers/net/ethernet/mellanox/mlx5/core/main.c?
> "
> 
> Leon, can you look into this please becaus ethe issue is already in
> Linus' tree and with the changes in the RDMA tree its not just
> messaging, we panic on shutdown.
> 
> Thanks
> Laurence

Moving this to a new thread as I have a patch I am submitting for it.
I will then continue with all the rest of the tests on Bart's Latest
SRP/SRPT patches


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Patch:  RDMA mlx5_core.c : mlx5_try_fast_unload causes panics
       [not found]                                                                     ` <1515694855.21421.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-11 18:35                                                                       ` Laurence Oberman
  0 siblings, 0 replies; 35+ messages in thread
From: Laurence Oberman @ 2018-01-11 18:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bart Van Assche,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

Changes to the mlx5_core call mlx5_try_fast_unload in the shutdown.
This is causing error messages on shutdown and with the latest rdma
tree panics due to list corruption.
Remove the mlx5_try_fast_unload call so we go back to calling the
original mlx5_unload_one call only.

Tested-by:     Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

This patch was tested against the latest RDMA for-next tree

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index d4a471a..1c66df6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1522,9 +1522,7 @@ static void shutdown(struct pci_dev *pdev)
 	int err;
 
 	dev_info(&pdev->dev, "Shutdown was called\n");
-	err = mlx5_try_fast_unload(dev);
-	if (err)
-		mlx5_unload_one(dev, priv, false);
+	mlx5_unload_one(dev, priv, false);
 	mlx5_pci_disable_device(dev);
 }
 
-- 
1.8.3.1


Now on shutdown we are clean
Rebooting.
[  203.281646] kvm: exiting hardware virtualization
[  203.309916] sd 2:0:0:1: [sdbk] Synchronizing SCSI cache
..
..
[  204.240158] sd 1:0:0:2: [sdaf] Synchronizing SCSI cache
[  204.269623] sd 1:0:0:3: [sdae] Synchronizing SCSI cache
[  204.298736] sd 1:0:0:4: [sdad] Synchronizing SCSI cache
..
..
[  205.074525] sd 1:0:0:0: [sdd] Synchronizing SCSI cache
[  205.103639] mlx5_core 0000:08:00.1: Shutdown was called
[  208.244242] mlx5_1:wait_for_async_commands:735:(pid 14464): done
with all pending requests
..
..
[  208.294459] sd 1:0:0:0: [sdd] Synchronizing SCSI cache
[  208.329616] scsi 1:0:0:0: alua: Detached
[  208.352899] sd 1:0:0:29: [sde] Synchronizing SCSI cache
[  208.388955] scsi 1:0:0:29: alua: Detached
..
..
[  212.230718] scsi host1: ib_srp: connection closed
[  226.697119] mlx5_core 0000:08:00.0: Shutdown was called
[  229.899254] mlx5_0:wait_for_async_commands:735:(pid 14464): done
with all pending requests
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                 ` <1515675741.21421.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2018-01-11 18:20                                                                   ` Laurence Oberman
@ 2018-01-11 20:43                                                                   ` Laurence Oberman
       [not found]                                                                     ` <1515703435.21421.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-11 20:43 UTC (permalink / raw)
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	ddutile-H+wXaHxf7aLQT0dZR+AlfA

Hi Bart

So after sorting out the panic issue I started full testing of your
patch sets for SRPT and SRP.

This is the fisrt time i am using your kernel for both the SRPT side
(server) and SRP side (Client)

Server and Client both running kernel off your tree

[root@fedstorage ~]# uname -a
Linux fedstorage.bos.redhat.com 4.15.0-rc7.bart+ #1 SMP Thu Jan 11
14:17:33 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

Prior to booting the Server on your tree I could find my LUNS,
I was using 4.15 from Mike Snitzers tree for the server.

Now I see this on probe for SRP devices.

I have not changed any configuration entries for the srpt driver.

Still at
[root@fedstorage modprobe.d]# cat ib_srpt.conf 
options ib_srpt srp_max_req_size=8296 srpt_srq_size=32768



Server log
----------
Linux fedstorage.bos.redhat.com 4.15.0-rc7.bart+ #1 SMP Thu Jan 11
14:17:33 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@fedstorage ~]# [  533.405291] IPv6: ADDRCONF(NETDEV_CHANGE): ib0:
link becomes ready
[  533.442884] IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready
[  540.024438] ib_srpt Received SRP_LOGIN_REQ with i_port_id
7cfe:9003:0072:6e4e:7cfe:9003:0072:6ed2, t_port_id
7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port 1
(guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4e); pkey 0xffff
[  540.141464] ib_srpt failed to create queue pair with sq_size = 16384
(-12) - retrying
[  540.186159] ib_srpt failed to create queue pair with sq_size = 8192
(-12) - retrying
[  540.231794] ib_srpt Rejected login for initiator
7cfe:9003:0072:6ed2: r[root@fedstorage modprobe.d]# cat ib_srpt.conf 
options ib_srpt srp_max_req_size=8296 srpt_srq_size=32768
et = -13.
[  540.274077] ib_srpt Rejecting login with reason 0x10006
[  540.304117] ib_srpt Received SRP_LOGIN_REQ with i_port_id
7cfe:9003:0072:6e4f:7cfe:9003:0072:6ed3, t_port_id
7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port 1
(guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4f); pkey 0xffff
[  540.426295] ib_srpt failed to create queue pair with sq_size = 16384
(-12) - retrying
[  540.470829] ib_srpt failed to create queue pair with sq_size = 8192
(-12) - retrying
[  540.515817] ib_srpt Rejected login for initiator
7cfe:9003:0072:6ed3: ret = -13.
[  540.556485] ib_srpt Rejecting login with reason 0x10006


Client log
----------
[  554.414466] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[  559.264828] scsi host2: ib_srp: SRP LOGIN from
fe80:0000:0000:0000:7cfe:9003:0072:6ed2 to
fe80:0000:0000:0000:7cfe:9003:0072:6e4e REJECTED, reason 0x00010006
[  559.342571] scsi host2: ib_srp: Connection 0/8 to
fe80:0000:0000:0000:7cfe:9003:0072:6e4e failed
[  559.546713] scsi host1: ib_srp: SRP LOGIN from
fe80:0000:0000:0000:7cfe:9003:0072:6ed3 to
fe80:0000:0000:0000:7cfe:9003:0072:6e4f REJECTED, reason 0x00010006
[  559.626139] scsi host1: ib_srp: Connection 0/8 to
fe80:0000:0000:0000:7cfe:9003:0072:6e4f failed
[  591.325685] scsi host1: ib_srp: SRP LOGIN from
fe80:0000:0000:0000:7cfe:9003:0072:6ed2 to
fe80:0000:0000:0000:7cfe:9003:0072:6e4e REJECTED, reason 0x00010006
[  591.404376] scsi host1: ib_srp: Connection 0/8 to
fe80:0000:0000:0000:7cfe:9003:0072:6e4e failed
[  591.605330] scsi host2: ib_srp: SRP LOGIN from
fe80:0000:0000:0000:7cfe:9003:0072:6ed3 to
fe80:0000:0000:0000:7cfe:9003:0072:6e4f REJECTED, reason 0x00010006
[  591.681975] scsi host2: ib_srp: Connection 0/8 to
fe80:0000:0000:0000:7cfe:9003:0072:6e4f failed
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                     ` <1515703435.21421.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-11 21:15                                                                       ` Bart Van Assche
       [not found]                                                                         ` <1515705340.2752.60.camel-Sjgp3cTcYWE@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Bart Van Assche @ 2018-01-11 21:15 UTC (permalink / raw)
  To: loberman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Thu, 2018-01-11 at 15:43 -0500, Laurence Oberman wrote:
> Server log
> ----------
> Linux fedstorage.bos.redhat.com 4.15.0-rc7.bart+ #1 SMP Thu Jan 11
> 14:17:33 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
> [root@fedstorage ~]# [  533.405291] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
> [  533.442884] IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready
> [  540.024438] ib_srpt Received SRP_LOGIN_REQ with i_port_id 7cfe:9003:0072:6e4e:7cfe:9003:0072:6ed2, t_port_id 7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port 1
> (guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4e); pkey 0xffff
> [  540.141464] ib_srpt failed to create queue pair with sq_size = 16384 (-12) - retrying
> [  540.186159] ib_srpt failed to create queue pair with sq_size = 8192 (-12) - retrying
> [  540.231794] ib_srpt Rejected login for initiator 7cfe:9003:0072:6ed2: ret = -13.
> [  540.274077] ib_srpt Rejecting login with reason 0x10006

Hello Laurence,

I think that means that no ACL was configured on the target side for
initiator port 7cfe:9003:0072:6ed2. Can you provide the output of
(cd /sys/kernel/config/target/srpt && ls -d */*/acls/*)?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                         ` <1515705340.2752.60.camel-Sjgp3cTcYWE@public.gmane.org>
@ 2018-01-11 21:33                                                                           ` Laurence Oberman
       [not found]                                                                             ` <1515706433.21421.11.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-11 21:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Thu, 2018-01-11 at 21:15 +0000, Bart Van Assche wrote:
> On Thu, 2018-01-11 at 15:43 -0500, Laurence Oberman wrote:
> > Server log
> > ----------
> > Linux fedstorage.bos.redhat.com 4.15.0-rc7.bart+ #1 SMP Thu Jan 11
> > 14:17:33 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
> > [root@fedstorage ~]# [  533.405291] IPv6: ADDRCONF(NETDEV_CHANGE):
> > ib0: link becomes ready
> > [  533.442884] IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes
> > ready
> > [  540.024438] ib_srpt Received SRP_LOGIN_REQ with i_port_id
> > 7cfe:9003:0072:6e4e:7cfe:9003:0072:6ed2, t_port_id
> > 7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port
> > 1
> > (guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4e); pkey 0xffff
> > [  540.141464] ib_srpt failed to create queue pair with sq_size =
> > 16384 (-12) - retrying
> > [  540.186159] ib_srpt failed to create queue pair with sq_size =
> > 8192 (-12) - retrying
> > [  540.231794] ib_srpt Rejected login for initiator
> > 7cfe:9003:0072:6ed2: ret = -13.
> > [  540.274077] ib_srpt Rejecting login with reason 0x10006
> 
> Hello Laurence,
> 
> I think that means that no ACL was configured on the target side for
> initiator port 7cfe:9003:0072:6ed2. Can you provide the output of
> (cd /sys/kernel/config/target/srpt && ls -d */*/acls/*)?
> 
> Thanks,
> 
> Bart.
> NrybXǧv^)޺{.n+{ٚ{ay\x1d
ʇڙ,j\afhz\x1e
w\f
j:+vwjm\azZ+ݢj"!


Hi Bart

I have not changed the configuration in LIO target though.
The /etc/target json file is the same as always and Its been like this
for years

I just rebooted the server into 4.13 and its fine again and found all
the targets with the same kernel on the client.

So its specific to your new tree with srpt

I will reboot again and re-load LIO and show you but here is my ACL
list that has been this way for some time.


o- srpt
.......................................................................
...................................... [Targets: 2]
  | o- ib.fe800000000000007cfe900300726e4e
.......................................................................
.... [no-gen-acls]
  | | o- acls
.......................................................................
..................................... [ACLs: 8]
  | | | o- ib.4e6e72000390fe7c7cfe900300726ed2
...................................................................
[Mapped LUNs: 30]
  | | | o- ib.4e6e72000390fe7c7cfe900300726ed3
...................................................................
[Mapped LUNs: 30]
  | | | o- ib.4f6e72000390fe7c7cfe900300726ed2
...................................................................
[Mapped LUNs: 30]
  | | | o- ib.4f6e72000390fe7c7cfe900300726ed3
...................................................................
[Mapped LUNs: 30]
  | | | o- ib.7cfe900300726e4e7cfe900300726ed2
...................................................................
[Mapped LUNs: 30]
  | | | o- ib.7cfe900300726e4e7cfe900300726ed3
...................................................................
[Mapped LUNs: 30]
  | | | o- ib.7cfe900300726e4f7cfe900300726ed2
...................................................................
[Mapped LUNs: 30]
  | | | o- ib.fe800000000000007cfe900300726e4e
...................................................................
[Mapped LUNs: 30]
  | o- ib.fe800000000000007cfe900300726e4f
.......................................................................
.... [no-gen-acls]
  |   o- acls
.......................................................................
..................................... [ACLs: 8]
  |   | o- ib.4e6e72000390fe7c7cfe900300726ed2
...................................................................
[Mapped LUNs: 30]
  |   | o- ib.4e6e72000390fe7c7cfe900300726ed3
...................................................................
[Mapped LUNs: 30]
  |   | o- ib.4f6e72000390fe7c7cfe900300726ed2
...................................................................
[Mapped LUNs: 30]
  |   | o- ib.4f6e72000390fe7c7cfe900300726ed3
...................................................................
[Mapped LUNs: 30]
  |   | o- ib.7cfe900300726e4e7cfe900300726ed2
...................................................................
[Mapped LUNs: 30]
  |   | o- ib.7cfe900300726e4e7cfe900300726ed3
...................................................................
[Mapped LUNs: 30]
  |   | o- ib.7cfe900300726e4f7cfe900300726ed2
...................................................................
[Mapped LUNs: 30]
  |   | o- ib.7cfe900300726e4f7cfe900300726ed3
...................................................................
[Mapped LUNs: 30]


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                             ` <1515706433.21421.11.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-11 21:43                                                                               ` Bart Van Assche
  2018-01-12 21:11                                                                               ` Bart Van Assche
  1 sibling, 0 replies; 35+ messages in thread
From: Bart Van Assche @ 2018-01-11 21:43 UTC (permalink / raw)
  To: loberman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 757 bytes --]

On Thu, 2018-01-11 at 16:33 -0500, Laurence Oberman wrote:
> I have not changed the configuration in LIO target though.

Yes, I trust that you haven't changed the configuration but I didn't trust
whether my own ib_srpt changes had preserved backwards compatibility :-)

>   | | | o- ib.4e6e72000390fe7c7cfe900300726ed2
> ...................................................................
> 
> [ ... ]
> [Mapped LUNs: 30]
>   | | | o- ib.4f6e72000390fe7c7cfe900300726ed2

Thanks, this confirms my suspicion. I will see what needs to be changed in
the ib_srpt driver to restore backwards compatibility.

Bart.N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±ÙšŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                             ` <1515706433.21421.11.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2018-01-11 21:43                                                                               ` Bart Van Assche
@ 2018-01-12 21:11                                                                               ` Bart Van Assche
       [not found]                                                                                 ` <1515791472.2396.57.camel-Sjgp3cTcYWE@public.gmane.org>
  1 sibling, 1 reply; 35+ messages in thread
From: Bart Van Assche @ 2018-01-12 21:11 UTC (permalink / raw)
  To: loberman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Thu, 2018-01-11 at 16:33 -0500, Laurence Oberman wrote:
> I just rebooted the server into 4.13 and its fine again and found all
> the targets with the same kernel on the client.
> 
> So its specific to your new tree with srpt
> 
> I will reboot again and re-load LIO and show you but here is my ACL
> list that has been this way for some time.
> 
> 
> o- srpt
> .......................................................................
> ...................................... [Targets: 2]
>   | o- ib.fe800000000000007cfe900300726e4e
> .......................................................................
> .... [no-gen-acls]
>   | | o- acls
> .......................................................................
> ..................................... [ACLs: 8]
>   | | | o- ib.4e6e72000390fe7c7cfe900300726ed2
> 
> [ ... ]

Hello Laurence,

Although I'm not sure I think I found the root cause of this failure. The
following patch should fix the failure:

diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 96142110a155..5297963c834d 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -2083,7 +2083,7 @@ static int srpt_cm_req_recv(struct srpt_device *const sdev,
 		struct ib_cm_rep_param ib_cm;
 	} *rep_param = NULL;
 	struct srpt_rdma_ch *ch;
-	char i_port_id[24];
+	char i_port_id[36];
 	u32 it_iu_len;
 	int i, ret;
 
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h b/drivers/infiniband/ulp/srpt/ib_srpt.h
index bf4525b24d98..02883f8e9c71 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.h
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
@@ -308,7 +308,7 @@ struct srpt_rdma_ch {
 	bool			using_rdma_cm;
 	bool			processing_wait_list;
 	struct se_session	*sess;
-	u8			sess_name[36];
+	u8			sess_name[24];
 	struct work_struct	release_work;
 };
 

I wrote "should" because targetcli is not installed on my test setup and
because I have not yet verified this change with targetcli. If you have the
time to verify this change that would be great. If not then I will install
targetcli myself and verify this change.

Thanks,

Bart.

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                                 ` <1515791472.2396.57.camel-Sjgp3cTcYWE@public.gmane.org>
@ 2018-01-13  0:09                                                                                   ` Laurence Oberman
       [not found]                                                                                     ` <1515802177.1566.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-13  0:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Fri, 2018-01-12 at 21:11 +0000, Bart Van Assche wrote:
> On Thu, 2018-01-11 at 16:33 -0500, Laurence Oberman wrote:
> > I just rebooted the server into 4.13 and its fine again and found
> > all
> > the targets with the same kernel on the client.
> > 
> > So its specific to your new tree with srpt
> > 
> > I will reboot again and re-load LIO and show you but here is my ACL
> > list that has been this way for some time.
> > 
> > 
> > o- srpt
> > ...................................................................
> > ....
> > ...................................... [Targets: 2]
> >   | o- ib.fe800000000000007cfe900300726e4e
> > ...................................................................
> > ....
> > .... [no-gen-acls]
> >   | | o- acls
> > ...................................................................
> > ....
> > ..................................... [ACLs: 8]
> >   | | | o- ib.4e6e72000390fe7c7cfe900300726ed2
> > 
> > [ ... ]
> 
> Hello Laurence,
> 
> Although I'm not sure I think I found the root cause of this failure.
> The
> following patch should fix the failure:
> 
> diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c
> b/drivers/infiniband/ulp/srpt/ib_srpt.c
> index 96142110a155..5297963c834d 100644
> --- a/drivers/infiniband/ulp/srpt/ib_srpt.c
> +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
> @@ -2083,7 +2083,7 @@ static int srpt_cm_req_recv(struct srpt_device
> *const sdev,
>  		struct ib_cm_rep_param ib_cm;
>  	} *rep_param = NULL;
>  	struct srpt_rdma_ch *ch;
> -	char i_port_id[24];
> +	char i_port_id[36];
>  	u32 it_iu_len;
>  	int i, ret;
>  
> diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h
> b/drivers/infiniband/ulp/srpt/ib_srpt.h
> index bf4525b24d98..02883f8e9c71 100644
> --- a/drivers/infiniband/ulp/srpt/ib_srpt.h
> +++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
> @@ -308,7 +308,7 @@ struct srpt_rdma_ch {
>  	bool			using_rdma_cm;
>  	bool			processing_wait_list;
>  	struct se_session	*sess;
> -	u8			sess_name[36];
> +	u8			sess_name[24];
>  	struct work_struct	release_work;
>  };
>  
> 
> I wrote "should" because targetcli is not installed on my test setup
> and
> because I have not yet verified this change with targetcli. If you
> have the
> time to verify this change that would be great. If not then I will
> install
> targetcli myself and verify this change.
> 
> Thanks,
> 
> Bart.NrybXǧv^)޺{.n+{ٚ{ay\x1d
ʇڙ,j\afhz\x1e
w\f
j:+vwjm\azZ+ݢj"!


Hi Bart

I will get this tested tonight and report back.

Fix makes sesne.

Regards
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                                     ` <1515802177.1566.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-13  1:57                                                                                       ` Laurence Oberman
       [not found]                                                                                         ` <1515808673.11354.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-13  1:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Fri, 2018-01-12 at 19:09 -0500, Laurence Oberman wrote:
> On Fri, 2018-01-12 at 21:11 +0000, Bart Van Assche wrote:
> > On Thu, 2018-01-11 at 16:33 -0500, Laurence Oberman wrote:
> > > I just rebooted the server into 4.13 and its fine again and found
> > > all
> > > the targets with the same kernel on the client.
> > > 
> > > So its specific to your new tree with srpt
> > > 
> > > I will reboot again and re-load LIO and show you but here is my
> > > ACL
> > > list that has been this way for some time.
> > > 
> > > 
> > > o- srpt
> > > .................................................................
> > > ..
> > > ....
> > > ...................................... [Targets: 2]
> > >   | o- ib.fe800000000000007cfe900300726e4e
> > > .................................................................
> > > ..
> > > ....
> > > .... [no-gen-acls]
> > >   | | o- acls
> > > .................................................................
> > > ..
> > > ....
> > > ..................................... [ACLs: 8]
> > >   | | | o- ib.4e6e72000390fe7c7cfe900300726ed2
> > > 
> > > [ ... ]
> > 
> > Hello Laurence,
> > 
> > Although I'm not sure I think I found the root cause of this
> > failure.
> > The
> > following patch should fix the failure:
> > 
> > diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c
> > b/drivers/infiniband/ulp/srpt/ib_srpt.c
> > index 96142110a155..5297963c834d 100644
> > --- a/drivers/infiniband/ulp/srpt/ib_srpt.c
> > +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
> > @@ -2083,7 +2083,7 @@ static int srpt_cm_req_recv(struct
> > srpt_device
> > *const sdev,
> >  		struct ib_cm_rep_param ib_cm;
> >  	} *rep_param = NULL;
> >  	struct srpt_rdma_ch *ch;
> > -	char i_port_id[24];
> > +	char i_port_id[36];
> >  	u32 it_iu_len;
> >  	int i, ret;
> >  
> > diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h
> > b/drivers/infiniband/ulp/srpt/ib_srpt.h
> > index bf4525b24d98..02883f8e9c71 100644
> > --- a/drivers/infiniband/ulp/srpt/ib_srpt.h
> > +++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
> > @@ -308,7 +308,7 @@ struct srpt_rdma_ch {
> >  	bool			using_rdma_cm;
> >  	bool			processing_wait_list;
> >  	struct se_session	*sess;
> > -	u8			sess_name[36];
> > +	u8			sess_name[24];
> >  	struct work_struct	release_work;
> >  };
> >  
> > 
> > I wrote "should" because targetcli is not installed on my test
> > setup
> > and
> > because I have not yet verified this change with targetcli. If you
> > have the
> > time to verify this change that would be great. If not then I will
> > install
> > targetcli myself and verify this change.
> > 
> > Thanks,
> > 
> > Bart.NrybXǧv^)޺{.n+{ٚ{ay\x1d
ʇڙ,j\afhz\x1e
w\f
j:+vwjm\azZ+ݢj"!
> 
> 
> Hi Bart
> 
> I will get this tested tonight and report back.
> 
> Fix makes sesne.
> 
> Regards
> Laurence

Hello Bart
For the patch above:

This corrects the connectivity issue with LIO targets and I will
continue now testing your patches from your tree.

Reviewed-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Tested-by:   Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Thank you for your quick response Sir.

Laurence

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                                         ` <1515808673.11354.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-13 14:53                                                                                           ` Laurence Oberman
       [not found]                                                                                             ` <1515855226.32050.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Laurence Oberman @ 2018-01-13 14:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Fri, 2018-01-12 at 20:57 -0500, Laurence Oberman wrote:
> On Fri, 2018-01-12 at 19:09 -0500, Laurence Oberman wrote:
> > On Fri, 2018-01-12 at 21:11 +0000, Bart Van Assche wrote:
> > > On Thu, 2018-01-11 at 16:33 -0500, Laurence Oberman wrote:
> > > > I just rebooted the server into 4.13 and its fine again and
> > > > found
> > > > all
> > > > the targets with the same kernel on the client.
> > > > 
> > > > So its specific to your new tree with srpt
> > > > 
> > > > I will reboot again and re-load LIO and show you but here is my
> > > > ACL
> > > > list that has been this way for some time.
> > > > 
> > > > 
> > > > o- srpt
> > > > ...............................................................
> > > > ..
> > > > ..
> > > > ....
> > > > ...................................... [Targets: 2]
> > > >   | o- ib.fe800000000000007cfe900300726e4e
> > > > ...............................................................
> > > > ..
> > > > ..
> > > > ....
> > > > .... [no-gen-acls]
> > > >   | | o- acls
> > > > ...............................................................
> > > > ..
> > > > ..
> > > > ....
> > > > ..................................... [ACLs: 8]
> > > >   | | | o- ib.4e6e72000390fe7c7cfe900300726ed2
> > > > 
> > > > [ ... ]
> > > 
> > > Hello Laurence,
> > > 
> > > Although I'm not sure I think I found the root cause of this
> > > failure.
> > > The
> > > following patch should fix the failure:
> > > 
> > > diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c
> > > b/drivers/infiniband/ulp/srpt/ib_srpt.c
> > > index 96142110a155..5297963c834d 100644
> > > --- a/drivers/infiniband/ulp/srpt/ib_srpt.c
> > > +++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
> > > @@ -2083,7 +2083,7 @@ static int srpt_cm_req_recv(struct
> > > srpt_device
> > > *const sdev,
> > >  		struct ib_cm_rep_param ib_cm;
> > >  	} *rep_param = NULL;
> > >  	struct srpt_rdma_ch *ch;
> > > -	char i_port_id[24];
> > > +	char i_port_id[36];
> > >  	u32 it_iu_len;
> > >  	int i, ret;
> > >  
> > > diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h
> > > b/drivers/infiniband/ulp/srpt/ib_srpt.h
> > > index bf4525b24d98..02883f8e9c71 100644
> > > --- a/drivers/infiniband/ulp/srpt/ib_srpt.h
> > > +++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
> > > @@ -308,7 +308,7 @@ struct srpt_rdma_ch {
> > >  	bool			using_rdma_cm;
> > >  	bool			processing_wait_list;
> > >  	struct se_session	*sess;
> > > -	u8			sess_name[36];
> > > +	u8			sess_name[24];
> > >  	struct work_struct	release_work;
> > >  };
> > >  
> > > 
> > > I wrote "should" because targetcli is not installed on my test
> > > setup
> > > and
> > > because I have not yet verified this change with targetcli. If
> > > you
> > > have the
> > > time to verify this change that would be great. If not then I
> > > will
> > > install
> > > targetcli myself and verify this change.
> > > 
> > > Thanks,
> > > 
> > > Bart.NrybXǧv^)޺{.n+{ٚ{ay\x1d
ʇڙ,j\afhz\x1e
w\f
j:+vwjm\azZ+ݢj"!
> > 
> > 
> > Hi Bart
> > 
> > I will get this tested tonight and report back.
> > 
> > Fix makes sesne.
> > 
> > Regards
> > Laurence
> 
> Hello Bart
> For the patch above:
> 
> This corrects the connectivity issue with LIO targets and I will
> continue now testing your patches from your tree.
> 
> Reviewed-by: Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Tested-by:   Laurence Oberman <loberman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 
> Thank you for your quick response Sir.
> 
> Laurence
> 

Hello Bart

I missed some logs when I tested last night.
Its working fine as mentioned with the the above patch and I see all
the targets (That's what I checked for).

However I still see these in the srpt server, but I get access to all
the targets now on the client.

[  239.502025] ib_srpt Received SRP_LOGIN_REQ with i_port_id
7cfe:9003:0072:6e4f:7cfe:9003:0072:6ed3, t_port_id
7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port 1
(guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4f); pkey 0xffff
[  239.623881] ib_srpt failed to create queue pair with sq_size = 16384
(-12) - retrying
[  239.669381] ib_srpt failed to create queue pair with sq_size = 8192
(-12) - retrying
[  239.715366] ib_srpt Received SRP_LOGIN_REQ with i_port_id
7cfe:9003:0072:6e4e:7cfe:9003:0072:6ed2, t_port_id
7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port 1
(guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4e); pkey 0xffff
[  239.831661] ib_srpt failed to create queue pair with sq_size = 16384
(-12) - retrying
[  239.877193] ib_srpt failed to create queue pair with sq_size = 8192
(-12) - retrying
[  239.967259] ib_srpt Received SRP_LOGIN_REQ with i_port_id
7cfe:9003:0072:6e4f:7cfe:9003:0072:6ed3, t_port_id
7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port 1
(guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4f); pkey 0xffff
[  240.087362] ib_srpt failed to create queue pair with sq_size = 16384
(-12) - retrying
[  240.130981] ib_srpt failed to create queue pair with sq_size = 8192
(-12) - retrying
..
..

So the functional report was valid but we need to see why we are still
getting the messages above.

Apologies, should have checked all the logs last night before my first
reply.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                                             ` <1515855226.32050.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-15 16:12                                                                                               ` Bart Van Assche
       [not found]                                                                                                 ` <1516032762.3951.5.camel-Sjgp3cTcYWE@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Bart Van Assche @ 2018-01-15 16:12 UTC (permalink / raw)
  To: loberman-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Sat, 2018-01-13 at 09:53 -0500, Laurence Oberman wrote:
> [  239.502025] ib_srpt Received SRP_LOGIN_REQ with i_port_id
> 7cfe:9003:0072:6e4f:7cfe:9003:0072:6ed3, t_port_id
> 7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port 1
> (guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4f); pkey 0xffff
> [  239.623881] ib_srpt failed to create queue pair with sq_size = 16384
> (-12) - retrying
> [  239.669381] ib_srpt failed to create queue pair with sq_size = 8192
> (-12) - retrying
> [  239.715366] ib_srpt Received SRP_LOGIN_REQ with i_port_id
> 7cfe:9003:0072:6e4e:7cfe:9003:0072:6ed2, t_port_id
> 7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port 1
> (guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4e); pkey 0xffff
> [  239.831661] ib_srpt failed to create queue pair with sq_size = 16384
> (-12) - retrying
> [  239.877193] ib_srpt failed to create queue pair with sq_size = 8192
> (-12) - retrying

Hello Laurence,

These messages are expected and do not indicate a failure. The retry loop
the above messages refer to got introduced a long time ago:

commit ab477c1ff5e0a744c072404bf7db51bfe1f05b6e
Author: Bart Van Assche <bvanassche@acm.org>
Date:   Sun Oct 19 18:05:33 2014 +0300

    srp-target: Retry when QP creation fails with ENOMEM
    
    It is not guaranteed to that srp_sq_size is supported
    by the HCA. So if we failed to create the QP with ENOMEM,
    try with a smaller srp_sq_size. Keep it up until we hit
    MIN_SRPT_SQ_SIZE, then fail the connection.
    
[ ... ]

The only recent change in that code is that retry attempts are now logged.
From commit 0e9949f1db6c "IB/srpt: Add RDMA/CM support":

+       if (ret) {
+               bool retry = sq_size > MIN_SRPT_SQ_SIZE;
+
+               pr_err("failed to create queue pair with sq_size = %d (%d)%s\n",
+                      sq_size, ret, retry ? " - retrying" : "");
+               if (retry) {
+                       ib_free_cq(ch->cq);
+                       sq_size = max(sq_size / 2, MIN_SRPT_SQ_SIZE);
+                       goto retry;
+               } else {
+                       goto err_destroy_cq;
                }
-               pr_err("failed to create_qp ret= %d\n", ret);
-               goto err_destroy_cq;
        }

Do you perhaps want that pr_err() to be changed into a pr_debug() for retry
attempts?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Kernel v4.16 / v4.17 SRP and SRPT patches
       [not found]                                                                                                 ` <1516032762.3951.5.camel-Sjgp3cTcYWE@public.gmane.org>
@ 2018-01-15 16:52                                                                                                   ` Laurence Oberman
  0 siblings, 0 replies; 35+ messages in thread
From: Laurence Oberman @ 2018-01-15 16:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, ddutile-H+wXaHxf7aLQT0dZR+AlfA

On Mon, 2018-01-15 at 16:12 +0000, Bart Van Assche wrote:
> On Sat, 2018-01-13 at 09:53 -0500, Laurence Oberman wrote:
> > [  239.502025] ib_srpt Received SRP_LOGIN_REQ with i_port_id
> > 7cfe:9003:0072:6e4f:7cfe:9003:0072:6ed3, t_port_id
> > 7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port
> > 1
> > (guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4f); pkey 0xffff
> > [  239.623881] ib_srpt failed to create queue pair with sq_size =
> > 16384
> > (-12) - retrying
> > [  239.669381] ib_srpt failed to create queue pair with sq_size =
> > 8192
> > (-12) - retrying
> > [  239.715366] ib_srpt Received SRP_LOGIN_REQ with i_port_id
> > 7cfe:9003:0072:6e4e:7cfe:9003:0072:6ed2, t_port_id
> > 7cfe:9003:0072:6e4e:7cfe:9003:0072:6e4e and it_iu_len 2116 on port
> > 1
> > (guid=fe80:0000:0000:0000:7cfe:9003:0072:6e4e); pkey 0xffff
> > [  239.831661] ib_srpt failed to create queue pair with sq_size =
> > 16384
> > (-12) - retrying
> > [  239.877193] ib_srpt failed to create queue pair with sq_size =
> > 8192
> > (-12) - retrying
> 
> Hello Laurence,
> 
> These messages are expected and do not indicate a failure. The retry
> loop
> the above messages refer to got introduced a long time ago:
> 
> commit ab477c1ff5e0a744c072404bf7db51bfe1f05b6e
> Author: Bart Van Assche <bvanassche-HInyCGIudOg@public.gmane.org>
> Date:   Sun Oct 19 18:05:33 2014 +0300
> 
>     srp-target: Retry when QP creation fails with ENOMEM
>     
>     It is not guaranteed to that srp_sq_size is supported
>     by the HCA. So if we failed to create the QP with ENOMEM,
>     try with a smaller srp_sq_size. Keep it up until we hit
>     MIN_SRPT_SQ_SIZE, then fail the connection.
>     
> [ ... ]
> 
> The only recent change in that code is that retry attempts are now
> logged.
> From commit 0e9949f1db6c "IB/srpt: Add RDMA/CM support":
> 
> +       if (ret) {
> +               bool retry = sq_size > MIN_SRPT_SQ_SIZE;
> +
> +               pr_err("failed to create queue pair with sq_size = %d
> (%d)%s\n",
> +                      sq_size, ret, retry ? " - retrying" : "");
> +               if (retry) {
> +                       ib_free_cq(ch->cq);
> +                       sq_size = max(sq_size / 2, MIN_SRPT_SQ_SIZE);
> +                       goto retry;
> +               } else {
> +                       goto err_destroy_cq;
>                 }
> -               pr_err("failed to create_qp ret= %d\n", ret);
> -               goto err_destroy_cq;
>         }
> 
> Do you perhaps want that pr_err() to be changed into a pr_debug() for
> retry
> attempts?
> 
> Thanks,
> 
> Bart.

Hi Bart, 

I recognized those as maybe just reporting messages so I thought we
were were good with the recent patch to fix the connection issue.
However when I attempted to actually use the targets with your latest
SRPT I had failures on the client.

It was a tough weekend for me, and maybe I made mistakes.
Let me complete the irq/cpu test Ming is waiting for and I will revisit
this fully with a clean build and your most recent patch.

I will answer off list while we figure it out

Many Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-01-15 16:52 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-06  0:22 [PATCH 5/8] infiniband: fix ulp/srpt/ib_srpt.c kernel-doc notation Randy Dunlap
     [not found] ` <5a5016c0.4c0a620a.ed2b3.60da-ATjtLOhZ0NVl57MIdRCFDg@public.gmane.org>
2018-01-06  0:36   ` Bart Van Assche
     [not found]     ` <fcc3f226-848d-abc4-2a81-f4fd821761c9-Sjgp3cTcYWE@public.gmane.org>
2018-01-06  5:55       ` Randy Dunlap
     [not found]         ` <31f69352-b8b1-9ed1-635b-2c654b49c775-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2018-01-06 16:50           ` Bart Van Assche
2018-01-09 20:15       ` Laurence Oberman
     [not found]         ` <1515528956.3919.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-09 20:31           ` Laurence Oberman
     [not found]             ` <1515529869.3919.4.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-09 20:51               ` Kernel v4.16 / v4.17 SRP and SRPT patches Bart Van Assche
     [not found]                 ` <1515531079.2721.26.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-09 21:00                   ` Laurence Oberman
     [not found]                     ` <1515531652.26021.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-09 22:40                       ` Laurence Oberman
     [not found]                         ` <1515537614.26021.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-10 13:42                           ` Laurence Oberman
     [not found]                             ` <1515591723.26021.6.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-10 18:26                               ` Jason Gunthorpe
     [not found]                                 ` <20180110182648.GI4518-uk2M96/98Pc@public.gmane.org>
2018-01-10 18:40                                   ` Bart Van Assche
     [not found]                                     ` <1515609623.2745.20.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-10 18:59                                       ` Laurence Oberman
     [not found]                                         ` <1515610750.10153.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-10 19:15                                           ` Jason Gunthorpe
     [not found]                                             ` <20180110191510.GK4518-uk2M96/98Pc@public.gmane.org>
2018-01-10 19:30                                               ` Laurence Oberman
     [not found]                                                 ` <1515612639.10153.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-10 20:52                                                   ` Jason Gunthorpe
     [not found]                                                     ` <20180110205243.GP4776-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-01-10 21:11                                                       ` Laurence Oberman
     [not found]                                                         ` <1515618674.10153.6.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-10 21:15                                                           ` Jason Gunthorpe
     [not found]                                                             ` <20180110211501.GS4776-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-01-11 13:02                                                               ` Laurence Oberman
     [not found]                                                                 ` <1515675741.21421.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-11 18:20                                                                   ` Laurence Oberman
     [not found]                                                                     ` <1515694855.21421.3.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-11 18:35                                                                       ` Patch: RDMA mlx5_core.c : mlx5_try_fast_unload causes panics Laurence Oberman
2018-01-11 20:43                                                                   ` Kernel v4.16 / v4.17 SRP and SRPT patches Laurence Oberman
     [not found]                                                                     ` <1515703435.21421.9.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-11 21:15                                                                       ` Bart Van Assche
     [not found]                                                                         ` <1515705340.2752.60.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-11 21:33                                                                           ` Laurence Oberman
     [not found]                                                                             ` <1515706433.21421.11.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-11 21:43                                                                               ` Bart Van Assche
2018-01-12 21:11                                                                               ` Bart Van Assche
     [not found]                                                                                 ` <1515791472.2396.57.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-13  0:09                                                                                   ` Laurence Oberman
     [not found]                                                                                     ` <1515802177.1566.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-13  1:57                                                                                       ` Laurence Oberman
     [not found]                                                                                         ` <1515808673.11354.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-13 14:53                                                                                           ` Laurence Oberman
     [not found]                                                                                             ` <1515855226.32050.1.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-15 16:12                                                                                               ` Bart Van Assche
     [not found]                                                                                                 ` <1516032762.3951.5.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-15 16:52                                                                                                   ` Laurence Oberman
2018-01-10 21:17                                                           ` Laurence Oberman
2018-01-10 19:17                                       ` Jason Gunthorpe
     [not found]                                         ` <20180110191758.GL4518-uk2M96/98Pc@public.gmane.org>
2018-01-10 19:32                                           ` Bart Van Assche
     [not found]                                             ` <1515612733.2745.27.camel-Sjgp3cTcYWE@public.gmane.org>
2018-01-10 22:43                                               ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.