All of lore.kernel.org
 help / color / mirror / Atom feed
* "memory management error" with NFS/RDMA on RoCE
@ 2017-06-22 18:28 Chuck Lever
       [not found] ` <7F0FCF80-DB7B-46F1-BB9A-0B070603DE61-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2017-06-22 18:28 UTC (permalink / raw)
  To: linux-rdma

While running xfstests on an NFS/RDMA mount, I see this in
the client's /var/log/messages multiple times:

Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)

As far as I can tell the client is able to recover and continue
the test. However, this error is not supposed to happen in normal
operation.

This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found] ` <7F0FCF80-DB7B-46F1-BB9A-0B070603DE61-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-06-22 20:57   ` Robert LeBlanc
  2017-06-27  9:28   ` Sagi Grimberg
  1 sibling, 0 replies; 12+ messages in thread
From: Robert LeBlanc @ 2017-06-22 20:57 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma

On Thu, Jun 22, 2017 at 12:28 PM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> While running xfstests on an NFS/RDMA mount, I see this in
> the client's /var/log/messages multiple times:
>
> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>
> As far as I can tell the client is able to recover and continue
> the test. However, this error is not supposed to happen in normal
> operation.
>
> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>
>
> --
> Chuck Lever

Surprisingly, I've hit this today on iSER, but using 4.9 using CX4 ROcEv2 IPv6:

[Thu Jun 22 14:37:11 2017] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
[Thu Jun 22 14:37:11 2017] 00000000 00000000 00000000 00000000
[Thu Jun 22 14:37:11 2017] 00000000 00000000 00000000 00000000
[Thu Jun 22 14:37:11 2017] 00000000 00000000 00000000 00000000
[Thu Jun 22 14:37:11 2017] 00000000 08007806 2500011f ca4f0fd3
[Thu Jun 22 14:37:11 2017] iser: iser_err_comp: memreg failure: memory
management operation error (6) vend_err 78
[Thu Jun 22 14:37:11 2017]  connection4:0: detected conn error (1011)

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found] ` <7F0FCF80-DB7B-46F1-BB9A-0B070603DE61-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2017-06-22 20:57   ` Robert LeBlanc
@ 2017-06-27  9:28   ` Sagi Grimberg
       [not found]     ` <797a43c4-f30d-9deb-a332-c62cbd01be7b-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2017-06-27  9:28 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma


> While running xfstests on an NFS/RDMA mount, I see this in
> the client's /var/log/messages multiple times:
> 
> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
> 
> As far as I can tell the client is able to recover and continue
> the test. However, this error is not supposed to happen in normal
> operation.
> 
> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.

Is this a regression? What kernel version are you running?
FW revision?

Is the below commit applied?
commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date:   Sun May 28 10:53:11 2017 +0300

     RDMA/mlx5: set UMR wqe fence according to HCA cap

     Cache the needed umr_fence and set the wqe ctrl segmennt
     accordingly.

     Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
     Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
     Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
     Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

This is the only thing that changed in that area
lately...

Can you try without it?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found]     ` <797a43c4-f30d-9deb-a332-c62cbd01be7b-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2017-06-27 14:56       ` Chuck Lever
       [not found]         ` <2FEEE227-9BCF-4454-A056-3997C1E54686-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2017-06-27 14:56 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-rdma

Hi Sagi-

> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>> While running xfstests on an NFS/RDMA mount, I see this in
>> the client's /var/log/messages multiple times:
>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>> As far as I can tell the client is able to recover and continue
>> the test. However, this error is not supposed to happen in normal
>> operation.
>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
> 
> Is this a regression?

I can't answer that question with authority, because I just
started trying out NFS/RDMA on RoCE with mlx5. But Robert has
reported very similar symptoms with iSER on v4.9. It appears
to have been around for a while, if these are the same.


> What kernel version are you running?

v4.12-rc2.


> FW revision?

12.18.2000


> Is the below commit applied?

This commit does not appear to be applied to my kernel.


> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Date:   Sun May 28 10:53:11 2017 +0300
> 
>    RDMA/mlx5: set UMR wqe fence according to HCA cap
> 
>    Cache the needed umr_fence and set the wqe ctrl segmennt
>    accordingly.
> 
>    Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>    Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>    Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>    Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 
> This is the only thing that changed in that area
> lately...
> 
> Can you try without it?

I haven't tried with it. I can pull it and see if it helps.

I have tried:

- with and without IOMMU enabled
- with RoCE v1 and v2
- with instrumentation:

This can happen to any MR at any time after any number of
uses. It does not appear to be "sticky" (ie, xprtrdma
recovery from a memory management error clears the problem
successfully by releasing the MR and allocating a new one).

So it feels like a f/w or driver problem to me, at this
point.

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found]         ` <2FEEE227-9BCF-4454-A056-3997C1E54686-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-06-27 16:08           ` Sagi Grimberg
       [not found]             ` <a82056d7-5685-5b85-8226-c54065e729fe-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  2017-06-27 17:36           ` Leon Romanovsky
  2017-08-08 15:45           ` Max Gurtovoy
  2 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2017-06-27 16:08 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma


> Hi Sagi-
> 
>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>
>>
>>> While running xfstests on an NFS/RDMA mount, I see this in
>>> the client's /var/log/messages multiple times:
>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>>> As far as I can tell the client is able to recover and continue
>>> the test. However, this error is not supposed to happen in normal
>>> operation.
>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>
>> Is this a regression?
> 
> I can't answer that question with authority, because I just
> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
> reported very similar symptoms with iSER on v4.9. It appears
> to have been around for a while, if these are the same.

Is Robert running 4.9 on the initiator side?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found]             ` <a82056d7-5685-5b85-8226-c54065e729fe-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2017-06-27 17:03               ` Robert LeBlanc
  0 siblings, 0 replies; 12+ messages in thread
From: Robert LeBlanc @ 2017-06-27 17:03 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Chuck Lever, linux-rdma

On Tue, Jun 27, 2017 at 10:08 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>> Hi Sagi-
>>
>>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>>
>>>
>>>> While running xfstests on an NFS/RDMA mount, I see this in
>>>> the client's /var/log/messages multiple times:
>>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error
>>>> cqe
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management
>>>> operation error (6/0x78)
>>>> As far as I can tell the client is able to recover and continue
>>>> the test. However, this error is not supposed to happen in normal
>>>> operation.
>>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>>
>>>
>>> Is this a regression?
>>
>>
>> I can't answer that question with authority, because I just
>> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
>> reported very similar symptoms with iSER on v4.9. It appears
>> to have been around for a while, if these are the same.
>
>
> Is Robert running 4.9 on the initiator side?

Yes, this was 4.9 on the initiator side.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found]         ` <2FEEE227-9BCF-4454-A056-3997C1E54686-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2017-06-27 16:08           ` Sagi Grimberg
@ 2017-06-27 17:36           ` Leon Romanovsky
       [not found]             ` <20170627173620.GT1248-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  2017-08-08 15:45           ` Max Gurtovoy
  2 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2017-06-27 17:36 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Sagi Grimberg, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 3036 bytes --]

On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote:
> Hi Sagi-
>
> > On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> >
> >
> >> While running xfstests on an NFS/RDMA mount, I see this in
> >> the client's /var/log/messages multiple times:
> >> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
> >> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
> >> As far as I can tell the client is able to recover and continue
> >> the test. However, this error is not supposed to happen in normal
> >> operation.
> >> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
> >
> > Is this a regression?
>
> I can't answer that question with authority, because I just
> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
> reported very similar symptoms with iSER on v4.9. It appears
> to have been around for a while, if these are the same.
>
>
> > What kernel version are you running?
>
> v4.12-rc2.
>
>
> > FW revision?
>
> 12.18.2000
>
>
> > Is the below commit applied?
>
> This commit does not appear to be applied to my kernel.
>
>
> > commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
> > Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Date:   Sun May 28 10:53:11 2017 +0300
> >
> >    RDMA/mlx5: set UMR wqe fence according to HCA cap
> >
> >    Cache the needed umr_fence and set the wqe ctrl segmennt
> >    accordingly.
> >
> >    Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >    Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> >    Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> >    Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >
> > This is the only thing that changed in that area
> > lately...
> >
> > Can you try without it?
>
> I haven't tried with it. I can pull it and see if it helps.
>
> I have tried:
>
> - with and without IOMMU enabled
> - with RoCE v1 and v2
> - with instrumentation:
>
> This can happen to any MR at any time after any number of
> uses. It does not appear to be "sticky" (ie, xprtrdma
> recovery from a memory management error clears the problem
> successfully by releasing the MR and allocating a new one).
>
> So it feels like a f/w or driver problem to me, at this
> point.

Jack and me discussed your issue tomorrow morning and we have strong
feeling that it is FW.

Thanks

>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found]             ` <20170627173620.GT1248-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-06-27 19:30               ` Chuck Lever
  2017-07-05 14:40               ` Chuck Lever
  1 sibling, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2017-06-27 19:30 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Sagi Grimberg, linux-rdma


> On Jun 27, 2017, at 1:36 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> 
> On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote:
>> Hi Sagi-
>> 
>>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>> 
>>> 
>>>> While running xfstests on an NFS/RDMA mount, I see this in
>>>> the client's /var/log/messages multiple times:
>>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>>>> As far as I can tell the client is able to recover and continue
>>>> the test. However, this error is not supposed to happen in normal
>>>> operation.
>>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>> 
>>> Is this a regression?
>> 
>> I can't answer that question with authority, because I just
>> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
>> reported very similar symptoms with iSER on v4.9. It appears
>> to have been around for a while, if these are the same.
>> 
>> 
>>> What kernel version are you running?
>> 
>> v4.12-rc2.
>> 
>> 
>>> FW revision?
>> 
>> 12.18.2000
>> 
>> 
>>> Is the below commit applied?
>> 
>> This commit does not appear to be applied to my kernel.
>> 
>> 
>>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
>>> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Date:   Sun May 28 10:53:11 2017 +0300
>>> 
>>>   RDMA/mlx5: set UMR wqe fence according to HCA cap
>>> 
>>>   Cache the needed umr_fence and set the wqe ctrl segmennt
>>>   accordingly.
>>> 
>>>   Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>   Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>>   Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>>>   Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> 
>>> This is the only thing that changed in that area
>>> lately...
>>> 
>>> Can you try without it?
>> 
>> I haven't tried with it. I can pull it and see if it helps.
>> 
>> I have tried:
>> 
>> - with and without IOMMU enabled
>> - with RoCE v1 and v2
>> - with instrumentation:
>> 
>> This can happen to any MR at any time after any number of
>> uses. It does not appear to be "sticky" (ie, xprtrdma
>> recovery from a memory management error clears the problem
>> successfully by releasing the MR and allocating a new one).
>> 
>> So it feels like a f/w or driver problem to me, at this
>> point.
> 
> Jack and me discussed your issue tomorrow morning and we have strong
> feeling that it is FW.

A little more, not sure this is helpful:

The flushed FastReg WRs are all for IB_ACCESS_REMOTE_READ.


> Thanks
> 
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found]             ` <20170627173620.GT1248-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  2017-06-27 19:30               ` Chuck Lever
@ 2017-07-05 14:40               ` Chuck Lever
       [not found]                 ` <06510488-DB16-4781-8E5A-FDFFDDD00B4F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2017-07-05 14:40 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Sagi Grimberg, linux-rdma


> On Jun 27, 2017, at 1:36 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> 
> On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote:
>> Hi Sagi-
>> 
>>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>> 
>>> 
>>>> While running xfstests on an NFS/RDMA mount, I see this in
>>>> the client's /var/log/messages multiple times:
>>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>>>> As far as I can tell the client is able to recover and continue
>>>> the test. However, this error is not supposed to happen in normal
>>>> operation.
>>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>> 
>>> Is this a regression?
>> 
>> I can't answer that question with authority, because I just
>> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
>> reported very similar symptoms with iSER on v4.9. It appears
>> to have been around for a while, if these are the same.
>> 
>> 
>>> What kernel version are you running?
>> 
>> v4.12-rc2.
>> 
>> 
>>> FW revision?
>> 
>> 12.18.2000
>> 
>> 
>>> Is the below commit applied?
>> 
>> This commit does not appear to be applied to my kernel.
>> 
>> 
>>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
>>> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Date:   Sun May 28 10:53:11 2017 +0300
>>> 
>>>   RDMA/mlx5: set UMR wqe fence according to HCA cap
>>> 
>>>   Cache the needed umr_fence and set the wqe ctrl segmennt
>>>   accordingly.
>>> 
>>>   Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>   Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>>   Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>>>   Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> 
>>> This is the only thing that changed in that area
>>> lately...
>>> 
>>> Can you try without it?
>> 
>> I haven't tried with it. I can pull it and see if it helps.
>> 
>> I have tried:
>> 
>> - with and without IOMMU enabled
>> - with RoCE v1 and v2
>> - with instrumentation:
>> 
>> This can happen to any MR at any time after any number of
>> uses. It does not appear to be "sticky" (ie, xprtrdma
>> recovery from a memory management error clears the problem
>> successfully by releasing the MR and allocating a new one).
>> 
>> So it feels like a f/w or driver problem to me, at this
>> point.
> 
> Jack and me discussed your issue tomorrow morning and we have strong
> feeling that it is FW.

Hi Leon-

Who is going to drive this issue to resolution? Do you need me
to do something?


> Thanks
> 
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found]                 ` <06510488-DB16-4781-8E5A-FDFFDDD00B4F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-07-05 15:29                   ` Leon Romanovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2017-07-05 15:29 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Sagi Grimberg, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 3615 bytes --]

On Wed, Jul 05, 2017 at 10:40:41AM -0400, Chuck Lever wrote:
>
> > On Jun 27, 2017, at 1:36 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >
> > On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote:
> >> Hi Sagi-
> >>
> >>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> >>>
> >>>
> >>>> While running xfstests on an NFS/RDMA mount, I see this in
> >>>> the client's /var/log/messages multiple times:
> >>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
> >>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
> >>>> As far as I can tell the client is able to recover and continue
> >>>> the test. However, this error is not supposed to happen in normal
> >>>> operation.
> >>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
> >>>
> >>> Is this a regression?
> >>
> >> I can't answer that question with authority, because I just
> >> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
> >> reported very similar symptoms with iSER on v4.9. It appears
> >> to have been around for a while, if these are the same.
> >>
> >>
> >>> What kernel version are you running?
> >>
> >> v4.12-rc2.
> >>
> >>
> >>> FW revision?
> >>
> >> 12.18.2000
> >>
> >>
> >>> Is the below commit applied?
> >>
> >> This commit does not appear to be applied to my kernel.
> >>
> >>
> >>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
> >>> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>> Date:   Sun May 28 10:53:11 2017 +0300
> >>>
> >>>   RDMA/mlx5: set UMR wqe fence according to HCA cap
> >>>
> >>>   Cache the needed umr_fence and set the wqe ctrl segmennt
> >>>   accordingly.
> >>>
> >>>   Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>>   Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> >>>   Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> >>>   Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >>>
> >>> This is the only thing that changed in that area
> >>> lately...
> >>>
> >>> Can you try without it?
> >>
> >> I haven't tried with it. I can pull it and see if it helps.
> >>
> >> I have tried:
> >>
> >> - with and without IOMMU enabled
> >> - with RoCE v1 and v2
> >> - with instrumentation:
> >>
> >> This can happen to any MR at any time after any number of
> >> uses. It does not appear to be "sticky" (ie, xprtrdma
> >> recovery from a memory management error clears the problem
> >> successfully by releasing the MR and allocating a new one).
> >>
> >> So it feels like a f/w or driver problem to me, at this
> >> point.
> >
> > Jack and me discussed your issue tomorrow morning and we have strong
> > feeling that it is FW.
>
> Hi Leon-
>
> Who is going to drive this issue to resolution? Do you need me
> to do something?

I don't think so, Jack was supposed to do it.

>
>
> > Thanks
> >
> >>
> >> --
> >> Chuck Lever
> >>
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found]         ` <2FEEE227-9BCF-4454-A056-3997C1E54686-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2017-06-27 16:08           ` Sagi Grimberg
  2017-06-27 17:36           ` Leon Romanovsky
@ 2017-08-08 15:45           ` Max Gurtovoy
       [not found]             ` <7ef3ca44-1253-7aae-1b46-f78cc15e627d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2 siblings, 1 reply; 12+ messages in thread
From: Max Gurtovoy @ 2017-08-08 15:45 UTC (permalink / raw)
  To: Chuck Lever, Sagi Grimberg; +Cc: linux-rdma

Hi all,
soory for late response.

On 6/27/2017 5:56 PM, Chuck Lever wrote:
> Hi Sagi-
>
>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>
>>
>>> While running xfstests on an NFS/RDMA mount, I see this in
>>> the client's /var/log/messages multiple times:
>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>>> As far as I can tell the client is able to recover and continue
>>> the test. However, this error is not supposed to happen in normal
>>> operation.
>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>
>> Is this a regression?
>
> I can't answer that question with authority, because I just
> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
> reported very similar symptoms with iSER on v4.9. It appears
> to have been around for a while, if these are the same.
>
>
>> What kernel version are you running?
>
> v4.12-rc2.
>
>
>> FW revision?
>
> 12.18.2000
>
>
>> Is the below commit applied?
>
> This commit does not appear to be applied to my kernel.
>
>
>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
>> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Date:   Sun May 28 10:53:11 2017 +0300
>>
>>    RDMA/mlx5: set UMR wqe fence according to HCA cap
>>
>>    Cache the needed umr_fence and set the wqe ctrl segmennt
>>    accordingly.
>>
>>    Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>    Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>    Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>>    Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>
>> This is the only thing that changed in that area
>> lately...
>>
>> Can you try without it?
>
> I haven't tried with it. I can pull it and see if it helps.

Chuck,
any updates using my patch above (actually you need this 
1410a90ae449061b7e1ae19d275148f36948801b as a pre condition) ?


>
> I have tried:
>
> - with and without IOMMU enabled
> - with RoCE v1 and v2
> - with instrumentation:
>
> This can happen to any MR at any time after any number of
> uses. It does not appear to be "sticky" (ie, xprtrdma
> recovery from a memory management error clears the problem
> successfully by releasing the MR and allocating a new one).
>

I'm not so familiar with NFS/RDMA IO path yet, but are you using remote 
invalidation from server side or you run local invlidation ?
which side initiates the RDMA_READ/WRITE operations ?

> So it feels like a f/w or driver problem to me, at this
> point.
>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fvger.kernel.org%2Fmajordomo-info.html&data=02%7C01%7Cmaxg%40mellanox.com%7C7dcc1137dc654001e88708d4bd6d0947%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636341723437870581&sdata=uSImRqWvZxrJ9Lu8MBykfeBpxFZwlF3J0XQHNBTgSlc%3D&reserved=0
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: "memory management error" with NFS/RDMA on RoCE
       [not found]             ` <7ef3ca44-1253-7aae-1b46-f78cc15e627d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-08-08 16:14               ` Chuck Lever
  0 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2017-08-08 16:14 UTC (permalink / raw)
  To: Max Gurtovoy; +Cc: Sagi Grimberg, linux-rdma


> On Aug 8, 2017, at 11:45 AM, Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> 
> Hi all,
> soory for late response.
> 
> On 6/27/2017 5:56 PM, Chuck Lever wrote:
>> Hi Sagi-
>> 
>>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>> 
>>> 
>>>> While running xfstests on an NFS/RDMA mount, I see this in
>>>> the client's /var/log/messages multiple times:
>>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>>>> As far as I can tell the client is able to recover and continue
>>>> the test. However, this error is not supposed to happen in normal
>>>> operation.
>>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>> 
>>> Is this a regression?
>> 
>> I can't answer that question with authority, because I just
>> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
>> reported very similar symptoms with iSER on v4.9. It appears
>> to have been around for a while, if these are the same.
>> 
>> 
>>> What kernel version are you running?
>> 
>> v4.12-rc2.
>> 
>> 
>>> FW revision?
>> 
>> 12.18.2000
>> 
>> 
>>> Is the below commit applied?
>> 
>> This commit does not appear to be applied to my kernel.
>> 
>> 
>>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
>>> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Date:   Sun May 28 10:53:11 2017 +0300
>>> 
>>>   RDMA/mlx5: set UMR wqe fence according to HCA cap
>>> 
>>>   Cache the needed umr_fence and set the wqe ctrl segmennt
>>>   accordingly.
>>> 
>>>   Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>   Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>>   Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>>>   Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> 
>>> This is the only thing that changed in that area
>>> lately...
>>> 
>>> Can you try without it?
>> 
>> I haven't tried with it. I can pull it and see if it helps.
> 
> Chuck,
> any updates using my patch above (actually you need this 1410a90ae449061b7e1ae19d275148f36948801b as a pre condition) ?

My client is at v4.13-rc3 now, and I haven't seen this issue recur
recently.


>> I have tried:
>> 
>> - with and without IOMMU enabled
>> - with RoCE v1 and v2
>> - with instrumentation:
>> 
>> This can happen to any MR at any time after any number of
>> uses. It does not appear to be "sticky" (ie, xprtrdma
>> recovery from a memory management error clears the problem
>> successfully by releasing the MR and allocating a new one).
>> 
> 
> I'm not so familiar with NFS/RDMA IO path yet, but are you using remote invalidation from server side or you run local invlidation ?
> which side initiates the RDMA_READ/WRITE operations ?

Remote Invalidation should be in use, but I haven't confirmed that.

The storage target (the NFS server) issues the RDMA Read and
RDMA Write operations.


>> So it feels like a f/w or driver problem to me, at this
>> point.

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-08-08 16:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-22 18:28 "memory management error" with NFS/RDMA on RoCE Chuck Lever
     [not found] ` <7F0FCF80-DB7B-46F1-BB9A-0B070603DE61-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-22 20:57   ` Robert LeBlanc
2017-06-27  9:28   ` Sagi Grimberg
     [not found]     ` <797a43c4-f30d-9deb-a332-c62cbd01be7b-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-27 14:56       ` Chuck Lever
     [not found]         ` <2FEEE227-9BCF-4454-A056-3997C1E54686-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-27 16:08           ` Sagi Grimberg
     [not found]             ` <a82056d7-5685-5b85-8226-c54065e729fe-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-27 17:03               ` Robert LeBlanc
2017-06-27 17:36           ` Leon Romanovsky
     [not found]             ` <20170627173620.GT1248-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-06-27 19:30               ` Chuck Lever
2017-07-05 14:40               ` Chuck Lever
     [not found]                 ` <06510488-DB16-4781-8E5A-FDFFDDD00B4F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-07-05 15:29                   ` Leon Romanovsky
2017-08-08 15:45           ` Max Gurtovoy
     [not found]             ` <7ef3ca44-1253-7aae-1b46-f78cc15e627d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-08-08 16:14               ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.