* Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
@ 2017-03-08 15:35 ` Johannes Thumshirn
0 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2017-03-08 15:35 UTC (permalink / raw)
To: Moni Shoua
Cc: Linux NVMe Mailinglist, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Sagi Grimberg, Leon Romanovsky, Christoph Hellwig
Hi Moni et al.,
I'm getting a KASAN stack-out-of-bounds in rxe_post_send+0xdfe/0x1830
[rdma_rxe] at addr ffff8800187072e8 with v4.11-rc1
rxe_post_send+0xdfe is the following (note: the pr_err was inserted by
me to aid debugging).
(gdb) list *(rxe_post_send+0xdfe)
0x1dc3e is in rxe_post_send (drivers/infiniband/sw/rxe/rxe_verbs.c:765).
760 pr_err("%s: *_wr(ibwr): %p\n",
761 __func__, (void *)(mask & WR_ATOMIC_MASK ?
atomic_wr(ibwr)
762 : rdma_wr(ibwr)));
763
764 wqe->iova = (mask & WR_ATOMIC_MASK) ?
765
atomic_wr(ibwr)->remote_addr :
766 rdma_wr(ibwr)->remote_addr;
767 wqe->mask = mask;
768 wqe->dma.length = length;
769 wqe->dma.resid = length;
Coincidentially ffff8800187072e8 = ibwr + 0x28. ibwr comes from
nvme_rdma_post_send() and has an opcode of IB_WR_SEND (verified . So the
rdma_wr(ibwr) call cannot return a correct/valid parent object (neither
could the atomic_wr(ibr)).
So much for the easy/mechanic part.
I can special case IB_WR_SEND in rxe's init_send_wqe() but I neither
know if it is correct nor how the wqe elements (especially wqe->iova)
should be set up.
So any help would be appreciated here.
Thanks in advance,
Johannes
--
Johannes Thumshirn Storage
jthumshirn-l3A5Bk7waGM@public.gmane.org +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
@ 2017-03-08 15:35 ` Johannes Thumshirn
0 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2017-03-08 15:35 UTC (permalink / raw)
Hi Moni et al.,
I'm getting a KASAN stack-out-of-bounds in rxe_post_send+0xdfe/0x1830
[rdma_rxe] at addr ffff8800187072e8 with v4.11-rc1
rxe_post_send+0xdfe is the following (note: the pr_err was inserted by
me to aid debugging).
(gdb) list *(rxe_post_send+0xdfe)
0x1dc3e is in rxe_post_send (drivers/infiniband/sw/rxe/rxe_verbs.c:765).
760 pr_err("%s: *_wr(ibwr): %p\n",
761 __func__, (void *)(mask & WR_ATOMIC_MASK ?
atomic_wr(ibwr)
762 : rdma_wr(ibwr)));
763
764 wqe->iova = (mask & WR_ATOMIC_MASK) ?
765
atomic_wr(ibwr)->remote_addr :
766 rdma_wr(ibwr)->remote_addr;
767 wqe->mask = mask;
768 wqe->dma.length = length;
769 wqe->dma.resid = length;
Coincidentially ffff8800187072e8 = ibwr + 0x28. ibwr comes from
nvme_rdma_post_send() and has an opcode of IB_WR_SEND (verified . So the
rdma_wr(ibwr) call cannot return a correct/valid parent object (neither
could the atomic_wr(ibr)).
So much for the easy/mechanic part.
I can special case IB_WR_SEND in rxe's init_send_wqe() but I neither
know if it is correct nor how the wqe elements (especially wqe->iova)
should be set up.
So any help would be appreciated here.
Thanks in advance,
Johannes
--
Johannes Thumshirn Storage
jthumshirn at suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
2017-03-08 15:35 ` Johannes Thumshirn
@ 2017-03-08 16:11 ` Johannes Thumshirn
-1 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2017-03-08 16:11 UTC (permalink / raw)
To: Moni Shoua
Cc: Linux NVMe Mailinglist, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Sagi Grimberg, Leon Romanovsky, Christoph Hellwig
On 03/08/2017 04:35 PM, Johannes Thumshirn wrote:
> Hi Moni et al.,
>
> I'm getting a KASAN stack-out-of-bounds in rxe_post_send+0xdfe/0x1830
> [rdma_rxe] at addr ffff8800187072e8 with v4.11-rc1
>
> rxe_post_send+0xdfe is the following (note: the pr_err was inserted by
> me to aid debugging).
Quick follow up to my last mail
Slamming in a:
@@ -753,9 +757,18 @@ static int init_send_wqe(struct rxe_qp *qp, struct
ib_send_wr *ibwr,
memcpy(wqe->dma.sge, ibwr->sg_list,
num_sge * sizeof(struct ib_sge));
- wqe->iova = (mask & WR_ATOMIC_MASK) ?
- atomic_wr(ibwr)->remote_addr :
- rdma_wr(ibwr)->remote_addr;
+
+ if (ibwr->opcode == IB_WR_RDMA_WRITE ||
+ ibwr->opcode == IB_WR_RDMA_WRITE_WITH_IMM ||
+ ibwr->opcode == IB_WR_ATOMIC_CMP_AND_SWP ||
+ ibwr->opcode == IB_WR_ATOMIC_FETCH_AND_ADD)
+ wqe->iova = (mask & WR_ATOMIC_MASK) ?
+ atomic_wr(ibwr)->remote_addr :
+ rdma_wr(ibwr)->remote_addr;
+
wqe->mask = mask;
wqe->dma.length = length;
wqe->dma.resid = length;
Gives me
[ 4.286632] rdma_rxe: qp#17 moved to error state
[ ...hang... ]
[ 64.847464] nvme nvme0: Connect command failed, error wo/DNR bit: 7
[ 64.859829]
==================================================================
[ 64.861048] BUG: KASAN: stack-out-of-bounds in
rxe_post_send+0x12f3/0x1880 [rdma_rxe] at addr ffff88001f787838
Which translates to:
(gdb) list *(rxe_post_send+0x12f3)
0x1e133 is in rxe_post_send (drivers/infiniband/sw/rxe/rxe_verbs.c:685).
680 switch (wr->opcode) {
681 case IB_WR_RDMA_WRITE_WITH_IMM:
682 wr->ex.imm_data = ibwr->ex.imm_data;
683 case IB_WR_RDMA_READ:
684 case IB_WR_RDMA_WRITE:
685 wr->wr.rdma.remote_addr =
rdma_wr(ibwr)->remote_addr;
686 wr->wr.rdma.rkey =
rdma_wr(ibwr)->rkey;
687 break;
688 case IB_WR_SEND_WITH_IMM:
689 wr->ex.imm_data = ibwr->ex.imm_data;
--
Johannes Thumshirn Storage
jthumshirn-l3A5Bk7waGM@public.gmane.org +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
@ 2017-03-08 16:11 ` Johannes Thumshirn
0 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2017-03-08 16:11 UTC (permalink / raw)
On 03/08/2017 04:35 PM, Johannes Thumshirn wrote:
> Hi Moni et al.,
>
> I'm getting a KASAN stack-out-of-bounds in rxe_post_send+0xdfe/0x1830
> [rdma_rxe] at addr ffff8800187072e8 with v4.11-rc1
>
> rxe_post_send+0xdfe is the following (note: the pr_err was inserted by
> me to aid debugging).
Quick follow up to my last mail
Slamming in a:
@@ -753,9 +757,18 @@ static int init_send_wqe(struct rxe_qp *qp, struct
ib_send_wr *ibwr,
memcpy(wqe->dma.sge, ibwr->sg_list,
num_sge * sizeof(struct ib_sge));
- wqe->iova = (mask & WR_ATOMIC_MASK) ?
- atomic_wr(ibwr)->remote_addr :
- rdma_wr(ibwr)->remote_addr;
+
+ if (ibwr->opcode == IB_WR_RDMA_WRITE ||
+ ibwr->opcode == IB_WR_RDMA_WRITE_WITH_IMM ||
+ ibwr->opcode == IB_WR_ATOMIC_CMP_AND_SWP ||
+ ibwr->opcode == IB_WR_ATOMIC_FETCH_AND_ADD)
+ wqe->iova = (mask & WR_ATOMIC_MASK) ?
+ atomic_wr(ibwr)->remote_addr :
+ rdma_wr(ibwr)->remote_addr;
+
wqe->mask = mask;
wqe->dma.length = length;
wqe->dma.resid = length;
Gives me
[ 4.286632] rdma_rxe: qp#17 moved to error state
[ ...hang... ]
[ 64.847464] nvme nvme0: Connect command failed, error wo/DNR bit: 7
[ 64.859829]
==================================================================
[ 64.861048] BUG: KASAN: stack-out-of-bounds in
rxe_post_send+0x12f3/0x1880 [rdma_rxe] at addr ffff88001f787838
Which translates to:
(gdb) list *(rxe_post_send+0x12f3)
0x1e133 is in rxe_post_send (drivers/infiniband/sw/rxe/rxe_verbs.c:685).
680 switch (wr->opcode) {
681 case IB_WR_RDMA_WRITE_WITH_IMM:
682 wr->ex.imm_data = ibwr->ex.imm_data;
683 case IB_WR_RDMA_READ:
684 case IB_WR_RDMA_WRITE:
685 wr->wr.rdma.remote_addr =
rdma_wr(ibwr)->remote_addr;
686 wr->wr.rdma.rkey =
rdma_wr(ibwr)->rkey;
687 break;
688 case IB_WR_SEND_WITH_IMM:
689 wr->ex.imm_data = ibwr->ex.imm_data;
--
Johannes Thumshirn Storage
jthumshirn at suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
2017-03-08 15:35 ` Johannes Thumshirn
@ 2017-03-08 16:33 ` Moni Shoua
-1 siblings, 0 replies; 14+ messages in thread
From: Moni Shoua @ 2017-03-08 16:33 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Linux NVMe Mailinglist, linux-rdma, Sagi Grimberg,
Leon Romanovsky, Christoph Hellwig
On Wed, Mar 8, 2017 at 5:35 PM, Johannes Thumshirn <jthumshirn-l3A5Bk7waGM@public.gmane.org> wrote:
> Hi Moni et al.,
>
> I'm getting a KASAN stack-out-of-bounds in rxe_post_send+0xdfe/0x1830
> [rdma_rxe] at addr ffff8800187072e8 with v4.11-rc1
>
> rxe_post_send+0xdfe is the following (note: the pr_err was inserted by
> me to aid debugging).
>
> (gdb) list *(rxe_post_send+0xdfe)
> 0x1dc3e is in rxe_post_send (drivers/infiniband/sw/rxe/rxe_verbs.c:765).
> 760 pr_err("%s: *_wr(ibwr): %p\n",
> 761 __func__, (void *)(mask & WR_ATOMIC_MASK ?
> atomic_wr(ibwr)
> 762 : rdma_wr(ibwr)));
> 763
> 764 wqe->iova = (mask & WR_ATOMIC_MASK) ?
> 765
> atomic_wr(ibwr)->remote_addr :
> 766 rdma_wr(ibwr)->remote_addr;
> 767 wqe->mask = mask;
> 768 wqe->dma.length = length;
> 769 wqe->dma.resid = length;
>
> Coincidentially ffff8800187072e8 = ibwr + 0x28. ibwr comes from
> nvme_rdma_post_send() and has an opcode of IB_WR_SEND (verified . So the
> rdma_wr(ibwr) call cannot return a correct/valid parent object (neither
> could the atomic_wr(ibr)).
>
> So much for the easy/mechanic part.
>
> I can special case IB_WR_SEND in rxe's init_send_wqe() but I neither
> know if it is correct nor how the wqe elements (especially wqe->iova)
> should be set up.
>
> So any help would be appreciated here.
>
> Thanks in advance,
> Johannes
> --
Hi Johannes
Your report and analysis seem to be accurate (regarding value of wqe->iova)
Unfortunately we didn't have a chance yet to run kernel application
tests but I will try to add them soon and be able to debug it myself.
In the meantime
1. DId the test fail completely or is it just the KASAN error that
made you look at init_send_wqe()?
2. You can take a look at librxe implementation of init_send_wqe() (it
looks slightly different from kernel's implementation) and see what
happens if you change implementation accordingly.
thanks
Moni
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
@ 2017-03-08 16:33 ` Moni Shoua
0 siblings, 0 replies; 14+ messages in thread
From: Moni Shoua @ 2017-03-08 16:33 UTC (permalink / raw)
On Wed, Mar 8, 2017@5:35 PM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> Hi Moni et al.,
>
> I'm getting a KASAN stack-out-of-bounds in rxe_post_send+0xdfe/0x1830
> [rdma_rxe] at addr ffff8800187072e8 with v4.11-rc1
>
> rxe_post_send+0xdfe is the following (note: the pr_err was inserted by
> me to aid debugging).
>
> (gdb) list *(rxe_post_send+0xdfe)
> 0x1dc3e is in rxe_post_send (drivers/infiniband/sw/rxe/rxe_verbs.c:765).
> 760 pr_err("%s: *_wr(ibwr): %p\n",
> 761 __func__, (void *)(mask & WR_ATOMIC_MASK ?
> atomic_wr(ibwr)
> 762 : rdma_wr(ibwr)));
> 763
> 764 wqe->iova = (mask & WR_ATOMIC_MASK) ?
> 765
> atomic_wr(ibwr)->remote_addr :
> 766 rdma_wr(ibwr)->remote_addr;
> 767 wqe->mask = mask;
> 768 wqe->dma.length = length;
> 769 wqe->dma.resid = length;
>
> Coincidentially ffff8800187072e8 = ibwr + 0x28. ibwr comes from
> nvme_rdma_post_send() and has an opcode of IB_WR_SEND (verified . So the
> rdma_wr(ibwr) call cannot return a correct/valid parent object (neither
> could the atomic_wr(ibr)).
>
> So much for the easy/mechanic part.
>
> I can special case IB_WR_SEND in rxe's init_send_wqe() but I neither
> know if it is correct nor how the wqe elements (especially wqe->iova)
> should be set up.
>
> So any help would be appreciated here.
>
> Thanks in advance,
> Johannes
> --
Hi Johannes
Your report and analysis seem to be accurate (regarding value of wqe->iova)
Unfortunately we didn't have a chance yet to run kernel application
tests but I will try to add them soon and be able to debug it myself.
In the meantime
1. DId the test fail completely or is it just the KASAN error that
made you look at init_send_wqe()?
2. You can take a look at librxe implementation of init_send_wqe() (it
looks slightly different from kernel's implementation) and see what
happens if you change implementation accordingly.
thanks
Moni
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
2017-03-08 16:33 ` Moni Shoua
@ 2017-03-09 7:57 ` Johannes Thumshirn
-1 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2017-03-09 7:57 UTC (permalink / raw)
To: Moni Shoua
Cc: Linux NVMe Mailinglist, linux-rdma, Sagi Grimberg,
Leon Romanovsky, Christoph Hellwig
Hi Moni,
On 03/08/2017 05:33 PM, Moni Shoua wrote:
> Your report and analysis seem to be accurate (regarding value of wqe->iova)
> Unfortunately we didn't have a chance yet to run kernel application
> tests but I will try to add them soon and be able to debug it myself.
> In the meantime
OK, thanks. This is highly appreciated as I think Soft RoCE is the
coolest thing since sliced bread for quick NVMf/iSER and SRP tests.
> 1. DId the test fail completely or is it just the KASAN error that
> made you look at init_send_wqe()?
No it fails completely. I can see both hosts talk RDMA/NVMf to each
other but the initiator side can't establish a connection. Without KASAN
I couldn't find a reason for it other than the following log message:
rdma_rxe: qp#17 moved to error state
I must admit I haven't looked up it's source yet, as my first test had
KASAN enabled.
> 2. You can take a look at librxe implementation of init_send_wqe() (it
> looks slightly different from kernel's implementation) and see what
> happens if you change implementation accordingly.
OK I'll have a look and hopefully come back with a (RFC) patch (fingers
crossed).
Thanks,
Johannes
--
Johannes Thumshirn Storage
jthumshirn-l3A5Bk7waGM@public.gmane.org +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
@ 2017-03-09 7:57 ` Johannes Thumshirn
0 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2017-03-09 7:57 UTC (permalink / raw)
Hi Moni,
On 03/08/2017 05:33 PM, Moni Shoua wrote:
> Your report and analysis seem to be accurate (regarding value of wqe->iova)
> Unfortunately we didn't have a chance yet to run kernel application
> tests but I will try to add them soon and be able to debug it myself.
> In the meantime
OK, thanks. This is highly appreciated as I think Soft RoCE is the
coolest thing since sliced bread for quick NVMf/iSER and SRP tests.
> 1. DId the test fail completely or is it just the KASAN error that
> made you look at init_send_wqe()?
No it fails completely. I can see both hosts talk RDMA/NVMf to each
other but the initiator side can't establish a connection. Without KASAN
I couldn't find a reason for it other than the following log message:
rdma_rxe: qp#17 moved to error state
I must admit I haven't looked up it's source yet, as my first test had
KASAN enabled.
> 2. You can take a look at librxe implementation of init_send_wqe() (it
> looks slightly different from kernel's implementation) and see what
> happens if you change implementation accordingly.
OK I'll have a look and hopefully come back with a (RFC) patch (fingers
crossed).
Thanks,
Johannes
--
Johannes Thumshirn Storage
jthumshirn at suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
2017-03-09 7:57 ` Johannes Thumshirn
@ 2017-03-09 10:59 ` Moni Shoua
-1 siblings, 0 replies; 14+ messages in thread
From: Moni Shoua @ 2017-03-09 10:59 UTC (permalink / raw)
To: Johannes Thumshirn
Cc: Linux NVMe Mailinglist, linux-rdma, Sagi Grimberg,
Leon Romanovsky, Christoph Hellwig
>> 2. You can take a look at librxe implementation of init_send_wqe() (it
>> looks slightly different from kernel's implementation) and see what
>> happens if you change implementation accordingly.
>
> OK I'll have a look and hopefully come back with a (RFC) patch (fingers
> crossed).
Thanks. Waiting to see what you found and in the meantime I'll try to
reproduce in my setup
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
@ 2017-03-09 10:59 ` Moni Shoua
0 siblings, 0 replies; 14+ messages in thread
From: Moni Shoua @ 2017-03-09 10:59 UTC (permalink / raw)
>> 2. You can take a look at librxe implementation of init_send_wqe() (it
>> looks slightly different from kernel's implementation) and see what
>> happens if you change implementation accordingly.
>
> OK I'll have a look and hopefully come back with a (RFC) patch (fingers
> crossed).
Thanks. Waiting to see what you found and in the meantime I'll try to
reproduce in my setup
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
2017-03-09 10:59 ` Moni Shoua
@ 2017-03-09 11:04 ` Johannes Thumshirn
-1 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2017-03-09 11:04 UTC (permalink / raw)
To: Moni Shoua
Cc: Linux NVMe Mailinglist, linux-rdma, Sagi Grimberg,
Leon Romanovsky, Christoph Hellwig
On 03/09/2017 11:59 AM, Moni Shoua wrote:
>>> 2. You can take a look at librxe implementation of init_send_wqe() (it
>>> looks slightly different from kernel's implementation) and see what
>>> happens if you change implementation accordingly.
>>
>> OK I'll have a look and hopefully come back with a (RFC) patch (fingers
>> crossed).
>
> Thanks. Waiting to see what you found and in the meantime I'll try to
> reproduce in my setup
>From what I can see librxe will have the same problem. The affected
code's more or less a copy of the kernel version.
I'm currently skimming through mlx5/qp.c trying to understand the known
good code path.
Byte,
Johannes
--
Johannes Thumshirn Storage
jthumshirn-l3A5Bk7waGM@public.gmane.org +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
@ 2017-03-09 11:04 ` Johannes Thumshirn
0 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2017-03-09 11:04 UTC (permalink / raw)
On 03/09/2017 11:59 AM, Moni Shoua wrote:
>>> 2. You can take a look at librxe implementation of init_send_wqe() (it
>>> looks slightly different from kernel's implementation) and see what
>>> happens if you change implementation accordingly.
>>
>> OK I'll have a look and hopefully come back with a (RFC) patch (fingers
>> crossed).
>
> Thanks. Waiting to see what you found and in the meantime I'll try to
> reproduce in my setup
>From what I can see librxe will have the same problem. The affected
code's more or less a copy of the kernel version.
I'm currently skimming through mlx5/qp.c trying to understand the known
good code path.
Byte,
Johannes
--
Johannes Thumshirn Storage
jthumshirn at suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
2017-03-09 10:59 ` Moni Shoua
@ 2017-03-21 13:12 ` Johannes Thumshirn
-1 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2017-03-21 13:12 UTC (permalink / raw)
To: Moni Shoua
Cc: Linux NVMe Mailinglist, linux-rdma, Sagi Grimberg,
Leon Romanovsky, Christoph Hellwig
On Thu, Mar 09, 2017 at 12:59:31PM +0200, Moni Shoua wrote:
> >> 2. You can take a look at librxe implementation of init_send_wqe() (it
> >> looks slightly different from kernel's implementation) and see what
> >> happens if you change implementation accordingly.
> >
> > OK I'll have a look and hopefully come back with a (RFC) patch (fingers
> > crossed).
>
> Thanks. Waiting to see what you found and in the meantime I'll try to
> reproduce in my setup
Hi Moni,
I got the NVMf with RXE connection problems (minus the KASAN splats)
traced down to the check_rkey() function. It bailes out on this code
in check_rkey():
if (pkt->mask & RXE_WRITE_MASK) {
if (resid > mtu) {
if (pktlen != mtu || bth_pad(pkt)) {
state = RESPST_ERR_LENGTH;
goto err;
}
qp->resp.resid = mtu;
} else {
if (pktlen != resid) {
This even happens if I set the mtu to 9000.
I instrumented the driver to get some more debug information out of it and
here's the last output I see before nvme enters an error state and reconnects:
rdma_rxe: write_data_in: data_len: 4096, qp->resp.resid: 4096
rdma_rxe: check_rkey: mtu: 4096, resid: 12288, pktlen: 4096
rdma_rxe: write_data_in: data_len: 4096, qp->resp.resid: 4096
rdma_rxe: check_rkey: mtu: 4096, resid: 0, pktlen: 4096
rdma_rxe: qp#19 moved to error state
nvme nvme0: RECV for CQE 0xffff88001f0ef240 failed with status WR flushed (5)
qp->resp.resid comes from this hunk in check_rkey():
if (pkt->mask & (RXE_READ_MASK | RXE_WRITE_MASK)) {
if (pkt->mask & RXE_RETH_MASK) {
qp->resp.va = reth_va(pkt);
qp->resp.rkey = reth_rkey(pkt);
qp->resp.resid = reth_len(pkt);
So I suppose reth_len() has some hiccups here. I'll continue debugging and
report back any new findings.
Byte,
Johannes
--
Johannes Thumshirn Storage
jthumshirn-l3A5Bk7waGM@public.gmane.org +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe
@ 2017-03-21 13:12 ` Johannes Thumshirn
0 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2017-03-21 13:12 UTC (permalink / raw)
On Thu, Mar 09, 2017@12:59:31PM +0200, Moni Shoua wrote:
> >> 2. You can take a look at librxe implementation of init_send_wqe() (it
> >> looks slightly different from kernel's implementation) and see what
> >> happens if you change implementation accordingly.
> >
> > OK I'll have a look and hopefully come back with a (RFC) patch (fingers
> > crossed).
>
> Thanks. Waiting to see what you found and in the meantime I'll try to
> reproduce in my setup
Hi Moni,
I got the NVMf with RXE connection problems (minus the KASAN splats)
traced down to the check_rkey() function. It bailes out on this code
in check_rkey():
if (pkt->mask & RXE_WRITE_MASK) {
if (resid > mtu) {
if (pktlen != mtu || bth_pad(pkt)) {
state = RESPST_ERR_LENGTH;
goto err;
}
qp->resp.resid = mtu;
} else {
if (pktlen != resid) {
This even happens if I set the mtu to 9000.
I instrumented the driver to get some more debug information out of it and
here's the last output I see before nvme enters an error state and reconnects:
rdma_rxe: write_data_in: data_len: 4096, qp->resp.resid: 4096
rdma_rxe: check_rkey: mtu: 4096, resid: 12288, pktlen: 4096
rdma_rxe: write_data_in: data_len: 4096, qp->resp.resid: 4096
rdma_rxe: check_rkey: mtu: 4096, resid: 0, pktlen: 4096
rdma_rxe: qp#19 moved to error state
nvme nvme0: RECV for CQE 0xffff88001f0ef240 failed with status WR flushed (5)
qp->resp.resid comes from this hunk in check_rkey():
if (pkt->mask & (RXE_READ_MASK | RXE_WRITE_MASK)) {
if (pkt->mask & RXE_RETH_MASK) {
qp->resp.va = reth_va(pkt);
qp->resp.rkey = reth_rkey(pkt);
qp->resp.resid = reth_len(pkt);
So I suppose reth_len() has some hiccups here. I'll continue debugging and
report back any new findings.
Byte,
Johannes
--
Johannes Thumshirn Storage
jthumshirn at suse.de +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2017-03-21 13:12 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-08 15:35 Need some pointers to debug a KASAN splat in NVMe over Fabrics with rdma-rxe Johannes Thumshirn
2017-03-08 15:35 ` Johannes Thumshirn
[not found] ` <5b16edcf-c39a-bc4b-0e32-8ccfb5d75efb-l3A5Bk7waGM@public.gmane.org>
2017-03-08 16:11 ` Johannes Thumshirn
2017-03-08 16:11 ` Johannes Thumshirn
2017-03-08 16:33 ` Moni Shoua
2017-03-08 16:33 ` Moni Shoua
[not found] ` <CAG9sBKPwEryc7U+hV=XJvRrFFhNvoktDOtwwS6piUw=cQ=bZhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-09 7:57 ` Johannes Thumshirn
2017-03-09 7:57 ` Johannes Thumshirn
[not found] ` <38a12b22-7bad-c069-4c96-39fc93d20d79-l3A5Bk7waGM@public.gmane.org>
2017-03-09 10:59 ` Moni Shoua
2017-03-09 10:59 ` Moni Shoua
[not found] ` <CAG9sBKNXwHFkmQYsOPPu40822h6Z7J3+63CaZ844juZp5hfHfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-09 11:04 ` Johannes Thumshirn
2017-03-09 11:04 ` Johannes Thumshirn
2017-03-21 13:12 ` Johannes Thumshirn
2017-03-21 13:12 ` Johannes Thumshirn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.