linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel warning while testing blktests srp/002 v5.17-rc7
@ 2022-04-19 15:53 Bernard Metzler
  2022-04-21  2:02 ` Cheng Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Bernard Metzler @ 2022-04-19 15:53 UTC (permalink / raw)
  To: Cheng Xu, Luis Chamberlain, Bart Van Assche
  Cc: linux-block, linux-rdma, Pankaj Raghav, Pankaj Raghav

> -----Original Message-----
> From: Cheng Xu <chengyou@linux.alibaba.com>
> Sent: Monday, 18 April 2022 10:29
> To: Luis Chamberlain <mcgrof@kernel.org>; Bernard Metzler
> <BMT@zurich.ibm.com>; Bart Van Assche <bvanassche@acm.org>
> Cc: linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; Pankaj Raghav
> <pankydev8@gmail.com>; Pankaj Raghav <p.raghav@samsung.com>
> Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel
> warning while testing blktests srp/002 v5.17-rc7
> 
> 
> 
> On 4/15/22 7:31 AM, Luis Chamberlain wrote:
> 
> <...>
> 
> > [  195.218783] ------------[ cut here ]------------
> > [  195.221242] WARNING: CPU: 7 PID: 201 at
> drivers/infiniband/sw/siw/siw_cm.c:255 siw_cep_put+0x125/0x130 [siw]
> > [  195.222838] Modules linked in: ib_srp(E) scsi_transport_srp(E)
> target_core_pscsi(E) target_core_file(E) ib_srpt(E) target_core_iblock(E)
> target_core_mod(E) rdma_cm(E) iw_cm(E) ib_cm(E) scsi_debug(E) siw(E)
> null_blk(E) ib_umad(E) ib_uverbs(E) sd_mod(E) sg(E) dm_service_time(E)
> scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) dm_multipath(E) ib_core(E)
> dm_mod(E) nvme_fabrics(E) kvm_intel(E) kvm(E) irqbypass(E)
> crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E)
> cryptd(E) joydev(E) evdev(E) serio_raw(E) cirrus(E) drm_shmem_helper(E)
> drm_kms_helper(E) virtio_balloon(E) cec(E) i6300esb(E) button(E) drm(E)
> configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E)
> jbd2(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) zstd_compress(E)
> libcrc32c(E) crc32c_generic(E) virtio_net(E) net_failover(E) failover(E)
> virtio_blk(E) ata_generic(E) uhci_hcd(E) ehci_hcd(E) crc32_pclmul(E)
> crc32c_intel(E) ata_piix(E) psmouse(E) nvme(E) libata(E) virtio_pci(E)
> > [  195.222986]  virtio_pci_legacy_dev(E) virtio_pci_modern_dev(E)
> usbcore(E) virtio(E) usb_common(E) scsi_mod(E) nvme_core(E) i2c_piix4(E)
> virtio_ring(E) t10_pi(E) scsi_common(E) [last unloaded: null_blk]
> > [  195.241036] sd 3:0:0:1: [sdd] Attached SCSI disn
> > [  195.241188] CPU: 2 PID: 201 Comm: kworker/u16:22 Kdump: loaded
> Tainted: G            E     5.17.0-rc7 #1
> > [  195.246053] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.15.0-1 04/01/2014
> > [  195.249123] Workqueue: iw_cm_wq cm_work_handler [iw_cm]
> > [  195.251274] RIP: 0010:siw_cep_put+0x125/0x130 [siw]
> > [  195.253548] Code: bb c0 e8 ae 74 0f d7 48 89 ef 5d 41 5c 41 5d e9 b1 d6 ef
> d6 5d be 03 00 00 00 41 5c 41 5d e9 22 b7 0c d7 0f 0b e9 f3 fe ff ff <0f> 0b e9 1c
> ff ff ff 0f 1f 40 00 0f 1f 44 00 00 55 48 8d 6f 20 53
> > [  195.258982] RSP: 0018:ffffbc53404ebc98 EFLAGS: 00010286
> > [  195.261018] RAX: 0000000000000001 RBX: 0000000000000000 RCX:
> 0000000000000000
> > [  195.263569] RDX: 0000000000000001 RSI: 0000000000000246 RDI:
> ffffa03d1102a924
> > [  195.266151] RBP: ffffa03d1102a900 R08: ffffa03d1102a920 R09:
> ffffbc53404ebc50
> > [  195.269150] R10: ffffffff98a060e0 R11: 0000000000000000 R12:
> ffffa03cc4297000
> > [  195.272744] R13: ffffa03d2a48aea0 R14: ffffa03d2a48ae78 R15:
> ffffa03cc427ad58
> > [  195.275575] FS:  0000000000000000(0000) GS:ffffa03df7c80000(0000)
> knlGS:0000000000000000
> > [  195.278932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  195.280963] CR2: 00005590bc2e4fe8 CR3: 000000008500a004 CR4:
> 0000000000770ee0
> > [  195.282803] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> > [  195.284650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> > [  195.286522] PKRU: 55555554
> > [  195.287998] Call Trace:
> > [  195.289210]  <TASK>
> > [  195.290969]  siw_reject+0xac/0x180 [siw]
> > [  195.292679]  iw_cm_reject+0x68/0xc0 [iw_cm]
> > [  195.294136]  cm_work_handler+0x59d/0xe20 [iw_cm]
> > [  195.295588]  process_one_work+0x1e2/0x3b0
> > [  195.298338]  worker_thread+0x50/0x3a0
> > [  195.300330]  ? rescuer_thread+0x390/0x390
> > [  195.302269]  kthread+0xe5/0x110
> > [  195.304062]  ? kthread_complete_and_exit+0x20/0x20
> > [  195.307612]  ret_from_fork+0x1f/0x30
> > [  195.309585]  </TASK>
> > [  195.310674] ---[ end trace 0000000000000000 ]---
> > [  195.313290] scsi host4: ib_srp: REJ received
> > [  195.313293] scsi host4:   REJ reason 0xffffff98
> > [  195.315433] scsi host4: ib_srp: Connection 0/8 to 172.17.8.113 failed
> > [  195.472718] ib_srp:srp_parse_in: ib_srp: 172.17.8.113 -> 172.17.8.113:0
> > [  195.472739] ib_srp:srp_parse_in: ib_srp: 172.17.8.113:5555 ->
> 172.17.8.113:5555
> > [  195.472807] ib_srp:srp_parse_in: ib_srp: [fe80::5054:ff:fe5b:90dc%3] ->
> [fe80::5054:ff:fe5b:90dc]:0/202442865%3
> > > [0] INVALID URI REMOVED
> 3A__github.com_mcgrof_kdevops&d=DwIGaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=2TaYXQ0T-
> r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=7dWDVPFaNFXoRqokXmPFFy
> XkVL2yItLNzYUDfM4ULTg&s=1ezv_qa-
> ujLTftm7OxJ5xNZuoKrc70DJPBDccqZokbY&e=
> >  >    Luis
> 
> Hi, Bernard
> 
> I reproduced this issue, and it looks like a condition race between
> 'cm_work_handler' and 'siw_cm_work_handler'.
> 
> ----------------------------------------------------------------
>   Thread0:                         Thread1:
>   siw_cm_work_handler              cm_work_handler
> ----------------------------------------------------------------
> step0:
> siw_cm_upcall with
> IW_CM_EVENT_CONNECT_REQUEST
> 
>                              ===> cm_conn_req_handler
>                                     ...
>                                       cm_id->cm_handler (failed)
>                                       iw_cm_reject
>                                            siw_reject
> 
> *step1*:
> detach cep with listen_cep
> ----------------------------------------------------------------
> 
> When siw_reject is called in cm_work_handler, the related cep may have
> not been detached with its listen_cep, through the two steps are very
> close.
> 
> I think one simple way to fix this issue is keep step1 under
> siw_cep_set_inuse's protection, and this will make siw_reject will be
> pending util siw_cm_work_handler release the lock:
> 
> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> b/drivers/infiniband/sw/siw/siw_cm.c
> index 7acdd3c3a599..f033b6da1e9f 100644
> --- a/drivers/infiniband/sw/siw/siw_cm.c
> +++ b/drivers/infiniband/sw/siw/siw_cm.c
> @@ -968,13 +968,15 @@ static void siw_accept_newconn(struct siw_cep
> *cep)
> 
>                  siw_cep_set_inuse(new_cep);
>                  rv = siw_proc_mpareq(new_cep);
> -               siw_cep_set_free(new_cep);
> 
>                  if (rv != -EAGAIN) {
>                          siw_cep_put(cep);
>                          new_cep->listen_cep = NULL;
> +                       siw_cep_set_free(new_cep);
>                          if (rv)
>                                  goto error;
> +               } else {
> +                       siw_cep_set_free(new_cep);
>                  }
>          }
>          return;
> 
> Thanks,
> Cheng Xu

Hi Cheng, many thanks for looking into it! Unfortunately
I am out next 12 days until May. I will immediately look into
it when back. Your explanation sounds reasonable, but I'd
like to fully understand. Was it fixing the issue for you?

Thanks, Bernard.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel warning while testing blktests srp/002 v5.17-rc7
  2022-04-19 15:53 Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel warning while testing blktests srp/002 v5.17-rc7 Bernard Metzler
@ 2022-04-21  2:02 ` Cheng Xu
  2022-05-04 20:40   ` Luis Chamberlain
  0 siblings, 1 reply; 7+ messages in thread
From: Cheng Xu @ 2022-04-21  2:02 UTC (permalink / raw)
  To: Bernard Metzler, Luis Chamberlain, Bart Van Assche
  Cc: linux-block, linux-rdma, Pankaj Raghav, Pankaj Raghav



On 4/19/22 11:53 PM, Bernard Metzler wrote:
>> -----Original Message-----
>> From: Cheng Xu <chengyou@linux.alibaba.com>
>> Sent: Monday, 18 April 2022 10:29
>> To: Luis Chamberlain <mcgrof@kernel.org>; Bernard Metzler
>> <BMT@zurich.ibm.com>; Bart Van Assche <bvanassche@acm.org>
>> Cc: linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; Pankaj Raghav
>> <pankydev8@gmail.com>; Pankaj Raghav <p.raghav@samsung.com>
>> Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel
>> warning while testing blktests srp/002 v5.17-rc7
>>
>>
>>
>> On 4/15/22 7:31 AM, Luis Chamberlain wrote:
>>
>> <...>
>>
>>> [  195.218783] ------------[ cut here ]------------
>>> [  195.221242] WARNING: CPU: 7 PID: 201 at
>> drivers/infiniband/sw/siw/siw_cm.c:255 siw_cep_put+0x125/0x130 [siw]
>>> [  195.222838] Modules linked in: ib_srp(E) scsi_transport_srp(E)
>> target_core_pscsi(E) target_core_file(E) ib_srpt(E) target_core_iblock(E)
>> target_core_mod(E) rdma_cm(E) iw_cm(E) ib_cm(E) scsi_debug(E) siw(E)
>> null_blk(E) ib_umad(E) ib_uverbs(E) sd_mod(E) sg(E) dm_service_time(E)
>> scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) dm_multipath(E) ib_core(E)
>> dm_mod(E) nvme_fabrics(E) kvm_intel(E) kvm(E) irqbypass(E)
>> crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E)
>> cryptd(E) joydev(E) evdev(E) serio_raw(E) cirrus(E) drm_shmem_helper(E)
>> drm_kms_helper(E) virtio_balloon(E) cec(E) i6300esb(E) button(E) drm(E)
>> configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E)
>> jbd2(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) zstd_compress(E)
>> libcrc32c(E) crc32c_generic(E) virtio_net(E) net_failover(E) failover(E)
>> virtio_blk(E) ata_generic(E) uhci_hcd(E) ehci_hcd(E) crc32_pclmul(E)
>> crc32c_intel(E) ata_piix(E) psmouse(E) nvme(E) libata(E) virtio_pci(E)
>>> [  195.222986]  virtio_pci_legacy_dev(E) virtio_pci_modern_dev(E)
>> usbcore(E) virtio(E) usb_common(E) scsi_mod(E) nvme_core(E) i2c_piix4(E)
>> virtio_ring(E) t10_pi(E) scsi_common(E) [last unloaded: null_blk]
>>> [  195.241036] sd 3:0:0:1: [sdd] Attached SCSI disn
>>> [  195.241188] CPU: 2 PID: 201 Comm: kworker/u16:22 Kdump: loaded
>> Tainted: G            E     5.17.0-rc7 #1
>>> [  195.246053] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS 1.15.0-1 04/01/2014
>>> [  195.249123] Workqueue: iw_cm_wq cm_work_handler [iw_cm]
>>> [  195.251274] RIP: 0010:siw_cep_put+0x125/0x130 [siw]
>>> [  195.253548] Code: bb c0 e8 ae 74 0f d7 48 89 ef 5d 41 5c 41 5d e9 b1 d6 ef
>> d6 5d be 03 00 00 00 41 5c 41 5d e9 22 b7 0c d7 0f 0b e9 f3 fe ff ff <0f> 0b e9 1c
>> ff ff ff 0f 1f 40 00 0f 1f 44 00 00 55 48 8d 6f 20 53
>>> [  195.258982] RSP: 0018:ffffbc53404ebc98 EFLAGS: 00010286
>>> [  195.261018] RAX: 0000000000000001 RBX: 0000000000000000 RCX:
>> 0000000000000000
>>> [  195.263569] RDX: 0000000000000001 RSI: 0000000000000246 RDI:
>> ffffa03d1102a924
>>> [  195.266151] RBP: ffffa03d1102a900 R08: ffffa03d1102a920 R09:
>> ffffbc53404ebc50
>>> [  195.269150] R10: ffffffff98a060e0 R11: 0000000000000000 R12:
>> ffffa03cc4297000
>>> [  195.272744] R13: ffffa03d2a48aea0 R14: ffffa03d2a48ae78 R15:
>> ffffa03cc427ad58
>>> [  195.275575] FS:  0000000000000000(0000) GS:ffffa03df7c80000(0000)
>> knlGS:0000000000000000
>>> [  195.278932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  195.280963] CR2: 00005590bc2e4fe8 CR3: 000000008500a004 CR4:
>> 0000000000770ee0
>>> [  195.282803] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>>> [  195.284650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>>> [  195.286522] PKRU: 55555554
>>> [  195.287998] Call Trace:
>>> [  195.289210]  <TASK>
>>> [  195.290969]  siw_reject+0xac/0x180 [siw]
>>> [  195.292679]  iw_cm_reject+0x68/0xc0 [iw_cm]
>>> [  195.294136]  cm_work_handler+0x59d/0xe20 [iw_cm]
>>> [  195.295588]  process_one_work+0x1e2/0x3b0
>>> [  195.298338]  worker_thread+0x50/0x3a0
>>> [  195.300330]  ? rescuer_thread+0x390/0x390
>>> [  195.302269]  kthread+0xe5/0x110
>>> [  195.304062]  ? kthread_complete_and_exit+0x20/0x20
>>> [  195.307612]  ret_from_fork+0x1f/0x30
>>> [  195.309585]  </TASK>
>>> [  195.310674] ---[ end trace 0000000000000000 ]---
>>> [  195.313290] scsi host4: ib_srp: REJ received
>>> [  195.313293] scsi host4:   REJ reason 0xffffff98
>>> [  195.315433] scsi host4: ib_srp: Connection 0/8 to 172.17.8.113 failed
>>> [  195.472718] ib_srp:srp_parse_in: ib_srp: 172.17.8.113 -> 172.17.8.113:0
>>> [  195.472739] ib_srp:srp_parse_in: ib_srp: 172.17.8.113:5555 ->
>> 172.17.8.113:5555
>>> [  195.472807] ib_srp:srp_parse_in: ib_srp: [fe80::5054:ff:fe5b:90dc%3] ->
>> [fe80::5054:ff:fe5b:90dc]:0/202442865%3
>>>> [0] INVALID URI REMOVED
>> 3A__github.com_mcgrof_kdevops&d=DwIGaQ&c=jf_iaSHvJObTbx-
>> siA1ZOg&r=2TaYXQ0T-
>> r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=7dWDVPFaNFXoRqokXmPFFy
>> XkVL2yItLNzYUDfM4ULTg&s=1ezv_qa-
>> ujLTftm7OxJ5xNZuoKrc70DJPBDccqZokbY&e=
>>>   >    Luis
>>
>> Hi, Bernard
>>
>> I reproduced this issue, and it looks like a condition race between
>> 'cm_work_handler' and 'siw_cm_work_handler'.
>>
>> ----------------------------------------------------------------
>>    Thread0:                         Thread1:
>>    siw_cm_work_handler              cm_work_handler
>> ----------------------------------------------------------------
>> step0:
>> siw_cm_upcall with
>> IW_CM_EVENT_CONNECT_REQUEST
>>
>>                               ===> cm_conn_req_handler
>>                                      ...
>>                                        cm_id->cm_handler (failed)
>>                                        iw_cm_reject
>>                                             siw_reject
>>
>> *step1*:
>> detach cep with listen_cep
>> ----------------------------------------------------------------
>>
>> When siw_reject is called in cm_work_handler, the related cep may have
>> not been detached with its listen_cep, through the two steps are very
>> close.
>>
>> I think one simple way to fix this issue is keep step1 under
>> siw_cep_set_inuse's protection, and this will make siw_reject will be
>> pending util siw_cm_work_handler release the lock:
>>
>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>> b/drivers/infiniband/sw/siw/siw_cm.c
>> index 7acdd3c3a599..f033b6da1e9f 100644
>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>> @@ -968,13 +968,15 @@ static void siw_accept_newconn(struct siw_cep
>> *cep)
>>
>>                   siw_cep_set_inuse(new_cep);
>>                   rv = siw_proc_mpareq(new_cep);
>> -               siw_cep_set_free(new_cep);
>>
>>                   if (rv != -EAGAIN) {
>>                           siw_cep_put(cep);
>>                           new_cep->listen_cep = NULL;
>> +                       siw_cep_set_free(new_cep);
>>                           if (rv)
>>                                   goto error;
>> +               } else {
>> +                       siw_cep_set_free(new_cep);
>>                   }
>>           }
>>           return;
>>
>> Thanks,
>> Cheng Xu
> 
> Hi Cheng, many thanks for looking into it! Unfortunately
> I am out next 12 days until May. I will immediately look into
> it when back. Your explanation sounds reasonable, but I'd
> like to fully understand. 
I'd like to send a patch to fix this. When you back, you can review this
issue and the patch.

Was it fixing the issue for you?

Sure, With this change, the WARN in dmesg does not appear any more in
my tests.

Thanks,
Cheng Xu

> 
> Thanks, Bernard.
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel warning while testing blktests srp/002 v5.17-rc7
  2022-04-21  2:02 ` Cheng Xu
@ 2022-05-04 20:40   ` Luis Chamberlain
  2022-05-05  8:38     ` Cheng Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Luis Chamberlain @ 2022-05-04 20:40 UTC (permalink / raw)
  To: Cheng Xu
  Cc: Bernard Metzler, Bart Van Assche, linux-block, linux-rdma,
	Pankaj Raghav, Pankaj Raghav

On Thu, Apr 21, 2022 at 10:02:47AM +0800, Cheng Xu wrote:
> 
> 
> On 4/19/22 11:53 PM, Bernard Metzler wrote:
> > > -----Original Message-----
> > > From: Cheng Xu <chengyou@linux.alibaba.com>
> > > Sent: Monday, 18 April 2022 10:29
> > > To: Luis Chamberlain <mcgrof@kernel.org>; Bernard Metzler
> > > <BMT@zurich.ibm.com>; Bart Van Assche <bvanassche@acm.org>
> > > Cc: linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; Pankaj Raghav
> > > <pankydev8@gmail.com>; Pankaj Raghav <p.raghav@samsung.com>
> > > Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel
> > > warning while testing blktests srp/002 v5.17-rc7
> > > 
> > > 
> > > 
> > > On 4/15/22 7:31 AM, Luis Chamberlain wrote:
> > > 
> > > <...>
> > > 
> > > > [  195.218783] ------------[ cut here ]------------
> > > > [  195.221242] WARNING: CPU: 7 PID: 201 at
> > > drivers/infiniband/sw/siw/siw_cm.c:255 siw_cep_put+0x125/0x130 [siw]
> > > > [  195.222838] Modules linked in: ib_srp(E) scsi_transport_srp(E)
> > > target_core_pscsi(E) target_core_file(E) ib_srpt(E) target_core_iblock(E)
> > > target_core_mod(E) rdma_cm(E) iw_cm(E) ib_cm(E) scsi_debug(E) siw(E)
> > > null_blk(E) ib_umad(E) ib_uverbs(E) sd_mod(E) sg(E) dm_service_time(E)
> > > scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) dm_multipath(E) ib_core(E)
> > > dm_mod(E) nvme_fabrics(E) kvm_intel(E) kvm(E) irqbypass(E)
> > > crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E)
> > > cryptd(E) joydev(E) evdev(E) serio_raw(E) cirrus(E) drm_shmem_helper(E)
> > > drm_kms_helper(E) virtio_balloon(E) cec(E) i6300esb(E) button(E) drm(E)
> > > configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E)
> > > jbd2(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) zstd_compress(E)
> > > libcrc32c(E) crc32c_generic(E) virtio_net(E) net_failover(E) failover(E)
> > > virtio_blk(E) ata_generic(E) uhci_hcd(E) ehci_hcd(E) crc32_pclmul(E)
> > > crc32c_intel(E) ata_piix(E) psmouse(E) nvme(E) libata(E) virtio_pci(E)
> > > > [  195.222986]  virtio_pci_legacy_dev(E) virtio_pci_modern_dev(E)
> > > usbcore(E) virtio(E) usb_common(E) scsi_mod(E) nvme_core(E) i2c_piix4(E)
> > > virtio_ring(E) t10_pi(E) scsi_common(E) [last unloaded: null_blk]
> > > > [  195.241036] sd 3:0:0:1: [sdd] Attached SCSI disn
> > > > [  195.241188] CPU: 2 PID: 201 Comm: kworker/u16:22 Kdump: loaded
> > > Tainted: G            E     5.17.0-rc7 #1
> > > > [  195.246053] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > BIOS 1.15.0-1 04/01/2014
> > > > [  195.249123] Workqueue: iw_cm_wq cm_work_handler [iw_cm]
> > > > [  195.251274] RIP: 0010:siw_cep_put+0x125/0x130 [siw]
> > > > [  195.253548] Code: bb c0 e8 ae 74 0f d7 48 89 ef 5d 41 5c 41 5d e9 b1 d6 ef
> > > d6 5d be 03 00 00 00 41 5c 41 5d e9 22 b7 0c d7 0f 0b e9 f3 fe ff ff <0f> 0b e9 1c
> > > ff ff ff 0f 1f 40 00 0f 1f 44 00 00 55 48 8d 6f 20 53
> > > > [  195.258982] RSP: 0018:ffffbc53404ebc98 EFLAGS: 00010286
> > > > [  195.261018] RAX: 0000000000000001 RBX: 0000000000000000 RCX:
> > > 0000000000000000
> > > > [  195.263569] RDX: 0000000000000001 RSI: 0000000000000246 RDI:
> > > ffffa03d1102a924
> > > > [  195.266151] RBP: ffffa03d1102a900 R08: ffffa03d1102a920 R09:
> > > ffffbc53404ebc50
> > > > [  195.269150] R10: ffffffff98a060e0 R11: 0000000000000000 R12:
> > > ffffa03cc4297000
> > > > [  195.272744] R13: ffffa03d2a48aea0 R14: ffffa03d2a48ae78 R15:
> > > ffffa03cc427ad58
> > > > [  195.275575] FS:  0000000000000000(0000) GS:ffffa03df7c80000(0000)
> > > knlGS:0000000000000000
> > > > [  195.278932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [  195.280963] CR2: 00005590bc2e4fe8 CR3: 000000008500a004 CR4:
> > > 0000000000770ee0
> > > > [  195.282803] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > > [  195.284650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > > [  195.286522] PKRU: 55555554
> > > > [  195.287998] Call Trace:
> > > > [  195.289210]  <TASK>
> > > > [  195.290969]  siw_reject+0xac/0x180 [siw]
> > > > [  195.292679]  iw_cm_reject+0x68/0xc0 [iw_cm]
> > > > [  195.294136]  cm_work_handler+0x59d/0xe20 [iw_cm]
> > > > [  195.295588]  process_one_work+0x1e2/0x3b0
> > > > [  195.298338]  worker_thread+0x50/0x3a0
> > > > [  195.300330]  ? rescuer_thread+0x390/0x390
> > > > [  195.302269]  kthread+0xe5/0x110
> > > > [  195.304062]  ? kthread_complete_and_exit+0x20/0x20
> > > > [  195.307612]  ret_from_fork+0x1f/0x30
> > > > [  195.309585]  </TASK>
> > > > [  195.310674] ---[ end trace 0000000000000000 ]---
> > > > [  195.313290] scsi host4: ib_srp: REJ received
> > > > [  195.313293] scsi host4:   REJ reason 0xffffff98
> > > > [  195.315433] scsi host4: ib_srp: Connection 0/8 to 172.17.8.113 failed
> > > > [  195.472718] ib_srp:srp_parse_in: ib_srp: 172.17.8.113 -> 172.17.8.113:0
> > > > [  195.472739] ib_srp:srp_parse_in: ib_srp: 172.17.8.113:5555 ->
> > > 172.17.8.113:5555
> > > > [  195.472807] ib_srp:srp_parse_in: ib_srp: [fe80::5054:ff:fe5b:90dc%3] ->
> > > [fe80::5054:ff:fe5b:90dc]:0/202442865%3
> > > > > [0] INVALID URI REMOVED
> > > 3A__github.com_mcgrof_kdevops&d=DwIGaQ&c=jf_iaSHvJObTbx-
> > > siA1ZOg&r=2TaYXQ0T-
> > > r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=7dWDVPFaNFXoRqokXmPFFy
> > > XkVL2yItLNzYUDfM4ULTg&s=1ezv_qa-
> > > ujLTftm7OxJ5xNZuoKrc70DJPBDccqZokbY&e=
> > > >   >    Luis
> > > 
> > > Hi, Bernard
> > > 
> > > I reproduced this issue, and it looks like a condition race between
> > > 'cm_work_handler' and 'siw_cm_work_handler'.
> > > 
> > > ----------------------------------------------------------------
> > >    Thread0:                         Thread1:
> > >    siw_cm_work_handler              cm_work_handler
> > > ----------------------------------------------------------------
> > > step0:
> > > siw_cm_upcall with
> > > IW_CM_EVENT_CONNECT_REQUEST
> > > 
> > >                               ===> cm_conn_req_handler
> > >                                      ...
> > >                                        cm_id->cm_handler (failed)
> > >                                        iw_cm_reject
> > >                                             siw_reject
> > > 
> > > *step1*:
> > > detach cep with listen_cep
> > > ----------------------------------------------------------------
> > > 
> > > When siw_reject is called in cm_work_handler, the related cep may have
> > > not been detached with its listen_cep, through the two steps are very
> > > close.
> > > 
> > > I think one simple way to fix this issue is keep step1 under
> > > siw_cep_set_inuse's protection, and this will make siw_reject will be
> > > pending util siw_cm_work_handler release the lock:
> > > 
> > > diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> > > b/drivers/infiniband/sw/siw/siw_cm.c
> > > index 7acdd3c3a599..f033b6da1e9f 100644
> > > --- a/drivers/infiniband/sw/siw/siw_cm.c
> > > +++ b/drivers/infiniband/sw/siw/siw_cm.c
> > > @@ -968,13 +968,15 @@ static void siw_accept_newconn(struct siw_cep
> > > *cep)
> > > 
> > >                   siw_cep_set_inuse(new_cep);
> > >                   rv = siw_proc_mpareq(new_cep);
> > > -               siw_cep_set_free(new_cep);
> > > 
> > >                   if (rv != -EAGAIN) {
> > >                           siw_cep_put(cep);
> > >                           new_cep->listen_cep = NULL;
> > > +                       siw_cep_set_free(new_cep);
> > >                           if (rv)
> > >                                   goto error;
> > > +               } else {
> > > +                       siw_cep_set_free(new_cep);
> > >                   }
> > >           }
> > >           return;
> > > 
> > > Thanks,
> > > Cheng Xu
> > 
> > Hi Cheng, many thanks for looking into it! Unfortunately
> > I am out next 12 days until May. I will immediately look into
> > it when back. Your explanation sounds reasonable, but I'd
> > like to fully understand.
> I'd like to send a patch to fix this. When you back, you can review this
> issue and the patch.
> 
> Was it fixing the issue for you?
> 
> Sure, With this change, the WARN in dmesg does not appear any more in
> my tests.
> 
> Thanks,
> Cheng Xu

*poke*

Would be good to get a fix merged. And if a patch is posted does this
need to go to stable?

  Luis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel warning while testing blktests srp/002 v5.17-rc7
  2022-05-04 20:40   ` Luis Chamberlain
@ 2022-05-05  8:38     ` Cheng Xu
  2022-05-05 11:42       ` Bernard Metzler
  0 siblings, 1 reply; 7+ messages in thread
From: Cheng Xu @ 2022-05-05  8:38 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Bernard Metzler, Bart Van Assche, linux-block, linux-rdma,
	Pankaj Raghav, Pankaj Raghav



On 5/5/22 4:40 AM, Luis Chamberlain wrote:
> On Thu, Apr 21, 2022 at 10:02:47AM +0800, Cheng Xu wrote:
>>
>>
>> On 4/19/22 11:53 PM, Bernard Metzler wrote:
>>>> -----Original Message-----
>>>> From: Cheng Xu <chengyou@linux.alibaba.com>
>>>> Sent: Monday, 18 April 2022 10:29
>>>> To: Luis Chamberlain <mcgrof@kernel.org>; Bernard Metzler
>>>> <BMT@zurich.ibm.com>; Bart Van Assche <bvanassche@acm.org>
>>>> Cc: linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; Pankaj Raghav
>>>> <pankydev8@gmail.com>; Pankaj Raghav <p.raghav@samsung.com>
>>>> Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel
>>>> warning while testing blktests srp/002 v5.17-rc7
>>>>
>>>>
>>>>
>>>> On 4/15/22 7:31 AM, Luis Chamberlain wrote:
>>>>
>>>> <...>
>>>>
>>>>> [  195.218783] ------------[ cut here ]------------
>>>>> [  195.221242] WARNING: CPU: 7 PID: 201 at
>>>> drivers/infiniband/sw/siw/siw_cm.c:255 siw_cep_put+0x125/0x130 [siw]
>>>>> [  195.222838] Modules linked in: ib_srp(E) scsi_transport_srp(E)
>>>> target_core_pscsi(E) target_core_file(E) ib_srpt(E) target_core_iblock(E)
>>>> target_core_mod(E) rdma_cm(E) iw_cm(E) ib_cm(E) scsi_debug(E) siw(E)
>>>> null_blk(E) ib_umad(E) ib_uverbs(E) sd_mod(E) sg(E) dm_service_time(E)
>>>> scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) dm_multipath(E) ib_core(E)
>>>> dm_mod(E) nvme_fabrics(E) kvm_intel(E) kvm(E) irqbypass(E)
>>>> crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E)
>>>> cryptd(E) joydev(E) evdev(E) serio_raw(E) cirrus(E) drm_shmem_helper(E)
>>>> drm_kms_helper(E) virtio_balloon(E) cec(E) i6300esb(E) button(E) drm(E)
>>>> configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E)
>>>> jbd2(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) zstd_compress(E)
>>>> libcrc32c(E) crc32c_generic(E) virtio_net(E) net_failover(E) failover(E)
>>>> virtio_blk(E) ata_generic(E) uhci_hcd(E) ehci_hcd(E) crc32_pclmul(E)
>>>> crc32c_intel(E) ata_piix(E) psmouse(E) nvme(E) libata(E) virtio_pci(E)
>>>>> [  195.222986]  virtio_pci_legacy_dev(E) virtio_pci_modern_dev(E)
>>>> usbcore(E) virtio(E) usb_common(E) scsi_mod(E) nvme_core(E) i2c_piix4(E)
>>>> virtio_ring(E) t10_pi(E) scsi_common(E) [last unloaded: null_blk]
>>>>> [  195.241036] sd 3:0:0:1: [sdd] Attached SCSI disn
>>>>> [  195.241188] CPU: 2 PID: 201 Comm: kworker/u16:22 Kdump: loaded
>>>> Tainted: G            E     5.17.0-rc7 #1
>>>>> [  195.246053] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>>> BIOS 1.15.0-1 04/01/2014
>>>>> [  195.249123] Workqueue: iw_cm_wq cm_work_handler [iw_cm]
>>>>> [  195.251274] RIP: 0010:siw_cep_put+0x125/0x130 [siw]
>>>>> [  195.253548] Code: bb c0 e8 ae 74 0f d7 48 89 ef 5d 41 5c 41 5d e9 b1 d6 ef
>>>> d6 5d be 03 00 00 00 41 5c 41 5d e9 22 b7 0c d7 0f 0b e9 f3 fe ff ff <0f> 0b e9 1c
>>>> ff ff ff 0f 1f 40 00 0f 1f 44 00 00 55 48 8d 6f 20 53
>>>>> [  195.258982] RSP: 0018:ffffbc53404ebc98 EFLAGS: 00010286
>>>>> [  195.261018] RAX: 0000000000000001 RBX: 0000000000000000 RCX:
>>>> 0000000000000000
>>>>> [  195.263569] RDX: 0000000000000001 RSI: 0000000000000246 RDI:
>>>> ffffa03d1102a924
>>>>> [  195.266151] RBP: ffffa03d1102a900 R08: ffffa03d1102a920 R09:
>>>> ffffbc53404ebc50
>>>>> [  195.269150] R10: ffffffff98a060e0 R11: 0000000000000000 R12:
>>>> ffffa03cc4297000
>>>>> [  195.272744] R13: ffffa03d2a48aea0 R14: ffffa03d2a48ae78 R15:
>>>> ffffa03cc427ad58
>>>>> [  195.275575] FS:  0000000000000000(0000) GS:ffffa03df7c80000(0000)
>>>> knlGS:0000000000000000
>>>>> [  195.278932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [  195.280963] CR2: 00005590bc2e4fe8 CR3: 000000008500a004 CR4:
>>>> 0000000000770ee0
>>>>> [  195.282803] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>>> [  195.284650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400
>>>>> [  195.286522] PKRU: 55555554
>>>>> [  195.287998] Call Trace:
>>>>> [  195.289210]  <TASK>
>>>>> [  195.290969]  siw_reject+0xac/0x180 [siw]
>>>>> [  195.292679]  iw_cm_reject+0x68/0xc0 [iw_cm]
>>>>> [  195.294136]  cm_work_handler+0x59d/0xe20 [iw_cm]
>>>>> [  195.295588]  process_one_work+0x1e2/0x3b0
>>>>> [  195.298338]  worker_thread+0x50/0x3a0
>>>>> [  195.300330]  ? rescuer_thread+0x390/0x390
>>>>> [  195.302269]  kthread+0xe5/0x110
>>>>> [  195.304062]  ? kthread_complete_and_exit+0x20/0x20
>>>>> [  195.307612]  ret_from_fork+0x1f/0x30
>>>>> [  195.309585]  </TASK>
>>>>> [  195.310674] ---[ end trace 0000000000000000 ]---
>>>>> [  195.313290] scsi host4: ib_srp: REJ received
>>>>> [  195.313293] scsi host4:   REJ reason 0xffffff98
>>>>> [  195.315433] scsi host4: ib_srp: Connection 0/8 to 172.17.8.113 failed
>>>>> [  195.472718] ib_srp:srp_parse_in: ib_srp: 172.17.8.113 -> 172.17.8.113:0
>>>>> [  195.472739] ib_srp:srp_parse_in: ib_srp: 172.17.8.113:5555 ->
>>>> 172.17.8.113:5555
>>>>> [  195.472807] ib_srp:srp_parse_in: ib_srp: [fe80::5054:ff:fe5b:90dc%3] ->
>>>> [fe80::5054:ff:fe5b:90dc]:0/202442865%3
>>>>>> [0] INVALID URI REMOVED
>>>> 3A__github.com_mcgrof_kdevops&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>>> siA1ZOg&r=2TaYXQ0T-
>>>> r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=7dWDVPFaNFXoRqokXmPFFy
>>>> XkVL2yItLNzYUDfM4ULTg&s=1ezv_qa-
>>>> ujLTftm7OxJ5xNZuoKrc70DJPBDccqZokbY&e=
>>>>>    >    Luis
>>>>
>>>> Hi, Bernard
>>>>
>>>> I reproduced this issue, and it looks like a condition race between
>>>> 'cm_work_handler' and 'siw_cm_work_handler'.
>>>>
>>>> ----------------------------------------------------------------
>>>>     Thread0:                         Thread1:
>>>>     siw_cm_work_handler              cm_work_handler
>>>> ----------------------------------------------------------------
>>>> step0:
>>>> siw_cm_upcall with
>>>> IW_CM_EVENT_CONNECT_REQUEST
>>>>
>>>>                                ===> cm_conn_req_handler
>>>>                                       ...
>>>>                                         cm_id->cm_handler (failed)
>>>>                                         iw_cm_reject
>>>>                                              siw_reject
>>>>
>>>> *step1*:
>>>> detach cep with listen_cep
>>>> ----------------------------------------------------------------
>>>>
>>>> When siw_reject is called in cm_work_handler, the related cep may have
>>>> not been detached with its listen_cep, through the two steps are very
>>>> close.
>>>>
>>>> I think one simple way to fix this issue is keep step1 under
>>>> siw_cep_set_inuse's protection, and this will make siw_reject will be
>>>> pending util siw_cm_work_handler release the lock:
>>>>
>>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>>>> b/drivers/infiniband/sw/siw/siw_cm.c
>>>> index 7acdd3c3a599..f033b6da1e9f 100644
>>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
>>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>>>> @@ -968,13 +968,15 @@ static void siw_accept_newconn(struct siw_cep
>>>> *cep)
>>>>
>>>>                    siw_cep_set_inuse(new_cep);
>>>>                    rv = siw_proc_mpareq(new_cep);
>>>> -               siw_cep_set_free(new_cep);
>>>>
>>>>                    if (rv != -EAGAIN) {
>>>>                            siw_cep_put(cep);
>>>>                            new_cep->listen_cep = NULL;
>>>> +                       siw_cep_set_free(new_cep);
>>>>                            if (rv)
>>>>                                    goto error;
>>>> +               } else {
>>>> +                       siw_cep_set_free(new_cep);
>>>>                    }
>>>>            }
>>>>            return;
>>>>
>>>> Thanks,
>>>> Cheng Xu
>>>
>>> Hi Cheng, many thanks for looking into it! Unfortunately
>>> I am out next 12 days until May. I will immediately look into
>>> it when back. Your explanation sounds reasonable, but I'd
>>> like to fully understand.
>> I'd like to send a patch to fix this. When you back, you can review this
>> issue and the patch.
>>
>> Was it fixing the issue for you?
>>
>> Sure, With this change, the WARN in dmesg does not appear any more in
>> my tests.
>>
>> Thanks,
>> Cheng Xu
> 
> *poke*
> 
> Would be good to get a fix merged. And if a patch is posted does this
> need to go to stable?
> 
>    Luis

The patch has been accepted and merged to for-rc, see:

https://lore.kernel.org/all/d528d83466c44687f3872eadcb8c184528b2e2d4.1650526554.git.chengyou@linux.alibaba.com/T/

I think this patch need not be merged back to stable, because the issue
is not a functional problem, but only produce a WARN in dmesg.

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel warning while testing blktests srp/002 v5.17-rc7
  2022-05-05  8:38     ` Cheng Xu
@ 2022-05-05 11:42       ` Bernard Metzler
  2022-05-05 12:48         ` Luis Chamberlain
  0 siblings, 1 reply; 7+ messages in thread
From: Bernard Metzler @ 2022-05-05 11:42 UTC (permalink / raw)
  To: Cheng Xu, Luis Chamberlain
  Cc: Bart Van Assche, linux-block, linux-rdma, Pankaj Raghav, Pankaj Raghav


> -----Original Message-----
> From: Cheng Xu <chengyou@linux.alibaba.com>
> Sent: Thursday, 5 May 2022 10:38
> To: Luis Chamberlain <mcgrof@kernel.org>
> Cc: Bernard Metzler <BMT@zurich.ibm.com>; Bart Van Assche
> <bvanassche@acm.org>; linux-block@vger.kernel.org; linux-
> rdma@vger.kernel.org; Pankaj Raghav <pankydev8@gmail.com>; Pankaj Raghav
> <p.raghav@samsung.com>
> Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel
> warning while testing blktests srp/002 v5.17-rc7
> 
> 
> 
> On 5/5/22 4:40 AM, Luis Chamberlain wrote:
> > On Thu, Apr 21, 2022 at 10:02:47AM +0800, Cheng Xu wrote:
> >>
> >>
> >> On 4/19/22 11:53 PM, Bernard Metzler wrote:
> >>>> -----Original Message-----
> >>>> From: Cheng Xu <chengyou@linux.alibaba.com>
> >>>> Sent: Monday, 18 April 2022 10:29
> >>>> To: Luis Chamberlain <mcgrof@kernel.org>; Bernard Metzler
> >>>> <BMT@zurich.ibm.com>; Bart Van Assche <bvanassche@acm.org>
> >>>> Cc: linux-block@vger.kernel.org; linux-rdma@vger.kernel.org; Pankaj
> Raghav
> >>>> <pankydev8@gmail.com>; Pankaj Raghav <p.raghav@samsung.com>
> >>>> Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel
> >>>> warning while testing blktests srp/002 v5.17-rc7
> >>>>
> >>>>
> >>>>
> >>>> On 4/15/22 7:31 AM, Luis Chamberlain wrote:
> >>>>
> >>>> <...>
> >>>>
> >>>>> [  195.218783] ------------[ cut here ]------------
> >>>>> [  195.221242] WARNING: CPU: 7 PID: 201 at
> >>>> drivers/infiniband/sw/siw/siw_cm.c:255 siw_cep_put+0x125/0x130
> [siw]
> >>>>> [  195.222838] Modules linked in: ib_srp(E) scsi_transport_srp(E)
> >>>> target_core_pscsi(E) target_core_file(E) ib_srpt(E)
> target_core_iblock(E)
> >>>> target_core_mod(E) rdma_cm(E) iw_cm(E) ib_cm(E) scsi_debug(E)
> siw(E)
> >>>> null_blk(E) ib_umad(E) ib_uverbs(E) sd_mod(E) sg(E)
> dm_service_time(E)
> >>>> scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) dm_multipath(E)
> ib_core(E)
> >>>> dm_mod(E) nvme_fabrics(E) kvm_intel(E) kvm(E) irqbypass(E)
> >>>> crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E)
> crypto_simd(E)
> >>>> cryptd(E) joydev(E) evdev(E) serio_raw(E) cirrus(E)
> drm_shmem_helper(E)
> >>>> drm_kms_helper(E) virtio_balloon(E) cec(E) i6300esb(E) button(E)
> drm(E)
> >>>> configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E)
> mbcache(E)
> >>>> jbd2(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E)
> zstd_compress(E)
> >>>> libcrc32c(E) crc32c_generic(E) virtio_net(E) net_failover(E)
> failover(E)
> >>>> virtio_blk(E) ata_generic(E) uhci_hcd(E) ehci_hcd(E)
> crc32_pclmul(E)
> >>>> crc32c_intel(E) ata_piix(E) psmouse(E) nvme(E) libata(E)
> virtio_pci(E)
> >>>>> [  195.222986]  virtio_pci_legacy_dev(E) virtio_pci_modern_dev(E)
> >>>> usbcore(E) virtio(E) usb_common(E) scsi_mod(E) nvme_core(E)
> i2c_piix4(E)
> >>>> virtio_ring(E) t10_pi(E) scsi_common(E) [last unloaded: null_blk]
> >>>>> [  195.241036] sd 3:0:0:1: [sdd] Attached SCSI disn
> >>>>> [  195.241188] CPU: 2 PID: 201 Comm: kworker/u16:22 Kdump: loaded
> >>>> Tainted: G            E     5.17.0-rc7 #1
> >>>>> [  195.246053] Hardware name: QEMU Standard PC (i440FX + PIIX,
> 1996),
> >>>> BIOS 1.15.0-1 04/01/2014
> >>>>> [  195.249123] Workqueue: iw_cm_wq cm_work_handler [iw_cm]
> >>>>> [  195.251274] RIP: 0010:siw_cep_put+0x125/0x130 [siw]
> >>>>> [  195.253548] Code: bb c0 e8 ae 74 0f d7 48 89 ef 5d 41 5c 41 5d
> e9 b1 d6 ef
> >>>> d6 5d be 03 00 00 00 41 5c 41 5d e9 22 b7 0c d7 0f 0b e9 f3 fe ff
> ff <0f> 0b e9 1c
> >>>> ff ff ff 0f 1f 40 00 0f 1f 44 00 00 55 48 8d 6f 20 53
> >>>>> [  195.258982] RSP: 0018:ffffbc53404ebc98 EFLAGS: 00010286
> >>>>> [  195.261018] RAX: 0000000000000001 RBX: 0000000000000000 RCX:
> >>>> 0000000000000000
> >>>>> [  195.263569] RDX: 0000000000000001 RSI: 0000000000000246 RDI:
> >>>> ffffa03d1102a924
> >>>>> [  195.266151] RBP: ffffa03d1102a900 R08: ffffa03d1102a920 R09:
> >>>> ffffbc53404ebc50
> >>>>> [  195.269150] R10: ffffffff98a060e0 R11: 0000000000000000 R12:
> >>>> ffffa03cc4297000
> >>>>> [  195.272744] R13: ffffa03d2a48aea0 R14: ffffa03d2a48ae78 R15:
> >>>> ffffa03cc427ad58
> >>>>> [  195.275575] FS:  0000000000000000(0000)
> GS:ffffa03df7c80000(0000)
> >>>> knlGS:0000000000000000
> >>>>> [  195.278932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> [  195.280963] CR2: 00005590bc2e4fe8 CR3: 000000008500a004 CR4:
> >>>> 0000000000770ee0
> >>>>> [  195.282803] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>>> 0000000000000000
> >>>>> [  195.284650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >>>> 0000000000000400
> >>>>> [  195.286522] PKRU: 55555554
> >>>>> [  195.287998] Call Trace:
> >>>>> [  195.289210]  <TASK>
> >>>>> [  195.290969]  siw_reject+0xac/0x180 [siw]
> >>>>> [  195.292679]  iw_cm_reject+0x68/0xc0 [iw_cm]
> >>>>> [  195.294136]  cm_work_handler+0x59d/0xe20 [iw_cm]
> >>>>> [  195.295588]  process_one_work+0x1e2/0x3b0
> >>>>> [  195.298338]  worker_thread+0x50/0x3a0
> >>>>> [  195.300330]  ? rescuer_thread+0x390/0x390
> >>>>> [  195.302269]  kthread+0xe5/0x110
> >>>>> [  195.304062]  ? kthread_complete_and_exit+0x20/0x20
> >>>>> [  195.307612]  ret_from_fork+0x1f/0x30
> >>>>> [  195.309585]  </TASK>
> >>>>> [  195.310674] ---[ end trace 0000000000000000 ]---
> >>>>> [  195.313290] scsi host4: ib_srp: REJ received
> >>>>> [  195.313293] scsi host4:   REJ reason 0xffffff98
> >>>>> [  195.315433] scsi host4: ib_srp: Connection 0/8 to 172.17.8.113
> failed
> >>>>> [  195.472718] ib_srp:srp_parse_in: ib_srp: 172.17.8.113 ->
> 172.17.8.113:0
> >>>>> [  195.472739] ib_srp:srp_parse_in: ib_srp: 172.17.8.113:5555 ->
> >>>> 172.17.8.113:5555
> >>>>> [  195.472807] ib_srp:srp_parse_in: ib_srp:
> [fe80::5054:ff:fe5b:90dc%3] ->
> >>>> [fe80::5054:ff:fe5b:90dc]:0/202442865%3
> >>>>>> [0] INVALID URI REMOVED
> >>>> 3A__github.com_mcgrof_kdevops&d=DwIGaQ&c=jf_iaSHvJObTbx-
> >>>> siA1ZOg&r=2TaYXQ0T-
> >>>> r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=7dWDVPFaNFXoRqokXmPFFy
> >>>> XkVL2yItLNzYUDfM4ULTg&s=1ezv_qa-
> >>>> ujLTftm7OxJ5xNZuoKrc70DJPBDccqZokbY&e=
> >>>>>    >    Luis
> >>>>
> >>>> Hi, Bernard
> >>>>
> >>>> I reproduced this issue, and it looks like a condition race between
> >>>> 'cm_work_handler' and 'siw_cm_work_handler'.
> >>>>
> >>>> ----------------------------------------------------------------
> >>>>     Thread0:                         Thread1:
> >>>>     siw_cm_work_handler              cm_work_handler
> >>>> ----------------------------------------------------------------
> >>>> step0:
> >>>> siw_cm_upcall with
> >>>> IW_CM_EVENT_CONNECT_REQUEST
> >>>>
> >>>>                                ===> cm_conn_req_handler
> >>>>                                       ...
> >>>>                                         cm_id->cm_handler (failed)
> >>>>                                         iw_cm_reject
> >>>>                                              siw_reject
> >>>>
> >>>> *step1*:
> >>>> detach cep with listen_cep
> >>>> ----------------------------------------------------------------
> >>>>
> >>>> When siw_reject is called in cm_work_handler, the related cep may
> have
> >>>> not been detached with its listen_cep, through the two steps are
> very
> >>>> close.
> >>>>
> >>>> I think one simple way to fix this issue is keep step1 under
> >>>> siw_cep_set_inuse's protection, and this will make siw_reject will
> be
> >>>> pending util siw_cm_work_handler release the lock:
> >>>>
> >>>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
> >>>> b/drivers/infiniband/sw/siw/siw_cm.c
> >>>> index 7acdd3c3a599..f033b6da1e9f 100644
> >>>> --- a/drivers/infiniband/sw/siw/siw_cm.c
> >>>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
> >>>> @@ -968,13 +968,15 @@ static void siw_accept_newconn(struct siw_cep
> >>>> *cep)
> >>>>
> >>>>                    siw_cep_set_inuse(new_cep);
> >>>>                    rv = siw_proc_mpareq(new_cep);
> >>>> -               siw_cep_set_free(new_cep);
> >>>>
> >>>>                    if (rv != -EAGAIN) {
> >>>>                            siw_cep_put(cep);
> >>>>                            new_cep->listen_cep = NULL;
> >>>> +                       siw_cep_set_free(new_cep);
> >>>>                            if (rv)
> >>>>                                    goto error;
> >>>> +               } else {
> >>>> +                       siw_cep_set_free(new_cep);
> >>>>                    }
> >>>>            }
> >>>>            return;
> >>>>
> >>>> Thanks,
> >>>> Cheng Xu
> >>>
> >>> Hi Cheng, many thanks for looking into it! Unfortunately
> >>> I am out next 12 days until May. I will immediately look into
> >>> it when back. Your explanation sounds reasonable, but I'd
> >>> like to fully understand.
> >> I'd like to send a patch to fix this. When you back, you can review
> this
> >> issue and the patch.
> >>
> >> Was it fixing the issue for you?
> >>
> >> Sure, With this change, the WARN in dmesg does not appear any more in
> >> my tests.
> >>
> >> Thanks,
> >> Cheng Xu
> >
> > *poke*
> >
> > Would be good to get a fix merged. And if a patch is posted does this
> > need to go to stable?
> >
> >    Luis
> 
> The patch has been accepted and merged to for-rc, see:
> 
> INVALID URI REMOVED
> 3A__lore.kernel.org_all_d528d83466c44687f3872eadcb8c184528b2e2d4.1650526
> 554.git.chengyou-40linux.alibaba.com_T_&d=DwICaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=2TaYXQ0T-r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=gj2AyKoOM_k9fYF-
> _XQ4HcYw_viOIwl6lDNPHqp7L1y2OiVRWvZkTFGFHSSZInor&s=P_HaXIXt9mBbCeBNBLsWe
> RTz5hvnUGUvObzs8lowzCM&e=
> 
> I think this patch need not be merged back to stable, because the issue
> is not a functional problem, but only produce a WARN in dmesg.
> 
> Thanks,
> Cheng Xu

I agree. It does not fix a memory leak or some such.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel warning while testing blktests srp/002 v5.17-rc7
  2022-05-05 11:42       ` Bernard Metzler
@ 2022-05-05 12:48         ` Luis Chamberlain
  0 siblings, 0 replies; 7+ messages in thread
From: Luis Chamberlain @ 2022-05-05 12:48 UTC (permalink / raw)
  To: Bernard Metzler
  Cc: Cheng Xu, Bart Van Assche, linux-block, linux-rdma,
	Pankaj Raghav, Pankaj Raghav

On Thu, May 05, 2022 at 11:42:55AM +0000, Bernard Metzler wrote:
> 
> > -----Original Message-----
> > >
> > > *poke*
> > >
> > > Would be good to get a fix merged. And if a patch is posted does this
> > > need to go to stable?
> > >
> > >    Luis
> > 
> > The patch has been accepted and merged to for-rc, see:
> > 
> > INVALID URI REMOVED
> > 3A__lore.kernel.org_all_d528d83466c44687f3872eadcb8c184528b2e2d4.1650526
> > 554.git.chengyou-40linux.alibaba.com_T_&d=DwICaQ&c=jf_iaSHvJObTbx-
> > siA1ZOg&r=2TaYXQ0T-r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=gj2AyKoOM_k9fYF-
> > _XQ4HcYw_viOIwl6lDNPHqp7L1y2OiVRWvZkTFGFHSSZInor&s=P_HaXIXt9mBbCeBNBLsWe
> > RTz5hvnUGUvObzs8lowzCM&e=
> > 
> > I think this patch need not be merged back to stable, because the issue
> > is not a functional problem, but only produce a WARN in dmesg.
> > 
> > Thanks,
> > Cheng Xu
> 
> I agree. It does not fix a memory leak or some such.

If the warning triggers on older kernels it means testing using this
driver will fail and those tests will be skipped. In this case the
test srp/002 would be skipped unless this is fixed to not trigger
a kernel warning.

  Luis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel warning while testing blktests srp/002 v5.17-rc7
@ 2022-05-05 13:40 Bernard Metzler
  0 siblings, 0 replies; 7+ messages in thread
From: Bernard Metzler @ 2022-05-05 13:40 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Cheng Xu, Bart Van Assche, linux-block, linux-rdma,
	Pankaj Raghav, Pankaj Raghav



> -----Original Message-----
> From: Luis Chamberlain <mcgrof@infradead.org> On Behalf Of Luis
> Chamberlain
> Sent: Thursday, 5 May 2022 14:48
> To: Bernard Metzler <BMT@zurich.ibm.com>
> Cc: Cheng Xu <chengyou@linux.alibaba.com>; Bart Van Assche
> <bvanassche@acm.org>; linux-block@vger.kernel.org; linux-
> rdma@vger.kernel.org; Pankaj Raghav <pankydev8@gmail.com>; Pankaj Raghav
> <p.raghav@samsung.com>
> Subject: [EXTERNAL] Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel
> warning while testing blktests srp/002 v5.17-rc7
> 
> On Thu, May 05, 2022 at 11:42:55AM +0000, Bernard Metzler wrote:
> >
> > > -----Original Message-----
> > > >
> > > > *poke*
> > > >
> > > > Would be good to get a fix merged. And if a patch is posted does
> this
> > > > need to go to stable?
> > > >
> > > >    Luis
> > >
> > > The patch has been accepted and merged to for-rc, see:
> > >
> > > INVALID URI REMOVED
> > >
> 3A__lore.kernel.org_all_d528d83466c44687f3872eadcb8c184528b2e2d4.1650526
> > > 554.git.chengyou-40linux.alibaba.com_T_&d=DwICaQ&c=jf_iaSHvJObTbx-
> > > siA1ZOg&r=2TaYXQ0T-
> r8ZO1PP1alNwU_QJcRRLfmYTAgd3QCvqSc&m=gj2AyKoOM_k9fYF-
> > >
> _XQ4HcYw_viOIwl6lDNPHqp7L1y2OiVRWvZkTFGFHSSZInor&s=P_HaXIXt9mBbCeBNBLsWe
> > > RTz5hvnUGUvObzs8lowzCM&e=
> > >
> > > I think this patch need not be merged back to stable, because the
> issue
> > > is not a functional problem, but only produce a WARN in dmesg.
> > >
> > > Thanks,
> > > Cheng Xu
> >
> > I agree. It does not fix a memory leak or some such.
> 
> If the warning triggers on older kernels it means testing using this
> driver will fail and those tests will be skipped. In this case the
> test srp/002 would be skipped unless this is fixed to not trigger
> a kernel warning.
> 
>   Luis


Ah okay.
Since it was an easy fix, I'd go for merging back to stable,
if it's not too much of an effort.

Thanks!
Bernard.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-05-05 13:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-19 15:53 Re: siw_cm.c:255 siw_cep_put+0x125/0x130 kernel warning while testing blktests srp/002 v5.17-rc7 Bernard Metzler
2022-04-21  2:02 ` Cheng Xu
2022-05-04 20:40   ` Luis Chamberlain
2022-05-05  8:38     ` Cheng Xu
2022-05-05 11:42       ` Bernard Metzler
2022-05-05 12:48         ` Luis Chamberlain
2022-05-05 13:40 Bernard Metzler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).