* "memory management error" with NFS/RDMA on RoCE
@ 2017-06-22 18:28 Chuck Lever
[not found] ` <7F0FCF80-DB7B-46F1-BB9A-0B070603DE61-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2017-06-22 18:28 UTC (permalink / raw)
To: linux-rdma
While running xfstests on an NFS/RDMA mount, I see this in
the client's /var/log/messages multiple times:
Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
As far as I can tell the client is able to recover and continue
the test. However, this error is not supposed to happen in normal
operation.
This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <7F0FCF80-DB7B-46F1-BB9A-0B070603DE61-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-06-22 20:57 ` Robert LeBlanc
2017-06-27 9:28 ` Sagi Grimberg
1 sibling, 0 replies; 12+ messages in thread
From: Robert LeBlanc @ 2017-06-22 20:57 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-rdma
On Thu, Jun 22, 2017 at 12:28 PM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> While running xfstests on an NFS/RDMA mount, I see this in
> the client's /var/log/messages multiple times:
>
> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>
> As far as I can tell the client is able to recover and continue
> the test. However, this error is not supposed to happen in normal
> operation.
>
> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>
>
> --
> Chuck Lever
Surprisingly, I've hit this today on iSER, but using 4.9 using CX4 ROcEv2 IPv6:
[Thu Jun 22 14:37:11 2017] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
[Thu Jun 22 14:37:11 2017] 00000000 00000000 00000000 00000000
[Thu Jun 22 14:37:11 2017] 00000000 00000000 00000000 00000000
[Thu Jun 22 14:37:11 2017] 00000000 00000000 00000000 00000000
[Thu Jun 22 14:37:11 2017] 00000000 08007806 2500011f ca4f0fd3
[Thu Jun 22 14:37:11 2017] iser: iser_err_comp: memreg failure: memory
management operation error (6) vend_err 78
[Thu Jun 22 14:37:11 2017] connection4:0: detected conn error (1011)
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <7F0FCF80-DB7B-46F1-BB9A-0B070603DE61-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-22 20:57 ` Robert LeBlanc
@ 2017-06-27 9:28 ` Sagi Grimberg
[not found] ` <797a43c4-f30d-9deb-a332-c62cbd01be7b-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
1 sibling, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2017-06-27 9:28 UTC (permalink / raw)
To: Chuck Lever, linux-rdma
> While running xfstests on an NFS/RDMA mount, I see this in
> the client's /var/log/messages multiple times:
>
> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>
> As far as I can tell the client is able to recover and continue
> the test. However, this error is not supposed to happen in normal
> operation.
>
> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
Is this a regression? What kernel version are you running?
FW revision?
Is the below commit applied?
commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Date: Sun May 28 10:53:11 2017 +0300
RDMA/mlx5: set UMR wqe fence according to HCA cap
Cache the needed umr_fence and set the wqe ctrl segmennt
accordingly.
Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
This is the only thing that changed in that area
lately...
Can you try without it?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <797a43c4-f30d-9deb-a332-c62cbd01be7b-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2017-06-27 14:56 ` Chuck Lever
[not found] ` <2FEEE227-9BCF-4454-A056-3997C1E54686-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2017-06-27 14:56 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: linux-rdma
Hi Sagi-
> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>
>
>> While running xfstests on an NFS/RDMA mount, I see this in
>> the client's /var/log/messages multiple times:
>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>> As far as I can tell the client is able to recover and continue
>> the test. However, this error is not supposed to happen in normal
>> operation.
>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>
> Is this a regression?
I can't answer that question with authority, because I just
started trying out NFS/RDMA on RoCE with mlx5. But Robert has
reported very similar symptoms with iSER on v4.9. It appears
to have been around for a while, if these are the same.
> What kernel version are you running?
v4.12-rc2.
> FW revision?
12.18.2000
> Is the below commit applied?
This commit does not appear to be applied to my kernel.
> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Date: Sun May 28 10:53:11 2017 +0300
>
> RDMA/mlx5: set UMR wqe fence according to HCA cap
>
> Cache the needed umr_fence and set the wqe ctrl segmennt
> accordingly.
>
> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>
> This is the only thing that changed in that area
> lately...
>
> Can you try without it?
I haven't tried with it. I can pull it and see if it helps.
I have tried:
- with and without IOMMU enabled
- with RoCE v1 and v2
- with instrumentation:
This can happen to any MR at any time after any number of
uses. It does not appear to be "sticky" (ie, xprtrdma
recovery from a memory management error clears the problem
successfully by releasing the MR and allocating a new one).
So it feels like a f/w or driver problem to me, at this
point.
--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <2FEEE227-9BCF-4454-A056-3997C1E54686-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-06-27 16:08 ` Sagi Grimberg
[not found] ` <a82056d7-5685-5b85-8226-c54065e729fe-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-27 17:36 ` Leon Romanovsky
2017-08-08 15:45 ` Max Gurtovoy
2 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2017-06-27 16:08 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-rdma
> Hi Sagi-
>
>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>
>>
>>> While running xfstests on an NFS/RDMA mount, I see this in
>>> the client's /var/log/messages multiple times:
>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>>> As far as I can tell the client is able to recover and continue
>>> the test. However, this error is not supposed to happen in normal
>>> operation.
>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>
>> Is this a regression?
>
> I can't answer that question with authority, because I just
> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
> reported very similar symptoms with iSER on v4.9. It appears
> to have been around for a while, if these are the same.
Is Robert running 4.9 on the initiator side?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <a82056d7-5685-5b85-8226-c54065e729fe-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2017-06-27 17:03 ` Robert LeBlanc
0 siblings, 0 replies; 12+ messages in thread
From: Robert LeBlanc @ 2017-06-27 17:03 UTC (permalink / raw)
To: Sagi Grimberg; +Cc: Chuck Lever, linux-rdma
On Tue, Jun 27, 2017 at 10:08 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>> Hi Sagi-
>>
>>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>>
>>>
>>>> While running xfstests on an NFS/RDMA mount, I see this in
>>>> the client's /var/log/messages multiple times:
>>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error
>>>> cqe
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management
>>>> operation error (6/0x78)
>>>> As far as I can tell the client is able to recover and continue
>>>> the test. However, this error is not supposed to happen in normal
>>>> operation.
>>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>>
>>>
>>> Is this a regression?
>>
>>
>> I can't answer that question with authority, because I just
>> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
>> reported very similar symptoms with iSER on v4.9. It appears
>> to have been around for a while, if these are the same.
>
>
> Is Robert running 4.9 on the initiator side?
Yes, this was 4.9 on the initiator side.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <2FEEE227-9BCF-4454-A056-3997C1E54686-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-27 16:08 ` Sagi Grimberg
@ 2017-06-27 17:36 ` Leon Romanovsky
[not found] ` <20170627173620.GT1248-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-08-08 15:45 ` Max Gurtovoy
2 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2017-06-27 17:36 UTC (permalink / raw)
To: Chuck Lever; +Cc: Sagi Grimberg, linux-rdma
[-- Attachment #1: Type: text/plain, Size: 3036 bytes --]
On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote:
> Hi Sagi-
>
> > On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> >
> >
> >> While running xfstests on an NFS/RDMA mount, I see this in
> >> the client's /var/log/messages multiple times:
> >> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
> >> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
> >> As far as I can tell the client is able to recover and continue
> >> the test. However, this error is not supposed to happen in normal
> >> operation.
> >> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
> >
> > Is this a regression?
>
> I can't answer that question with authority, because I just
> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
> reported very similar symptoms with iSER on v4.9. It appears
> to have been around for a while, if these are the same.
>
>
> > What kernel version are you running?
>
> v4.12-rc2.
>
>
> > FW revision?
>
> 12.18.2000
>
>
> > Is the below commit applied?
>
> This commit does not appear to be applied to my kernel.
>
>
> > commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
> > Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Date: Sun May 28 10:53:11 2017 +0300
> >
> > RDMA/mlx5: set UMR wqe fence according to HCA cap
> >
> > Cache the needed umr_fence and set the wqe ctrl segmennt
> > accordingly.
> >
> > Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >
> > This is the only thing that changed in that area
> > lately...
> >
> > Can you try without it?
>
> I haven't tried with it. I can pull it and see if it helps.
>
> I have tried:
>
> - with and without IOMMU enabled
> - with RoCE v1 and v2
> - with instrumentation:
>
> This can happen to any MR at any time after any number of
> uses. It does not appear to be "sticky" (ie, xprtrdma
> recovery from a memory management error clears the problem
> successfully by releasing the MR and allocating a new one).
>
> So it feels like a f/w or driver problem to me, at this
> point.
Jack and me discussed your issue tomorrow morning and we have strong
feeling that it is FW.
Thanks
>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <20170627173620.GT1248-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2017-06-27 19:30 ` Chuck Lever
2017-07-05 14:40 ` Chuck Lever
1 sibling, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2017-06-27 19:30 UTC (permalink / raw)
To: Leon Romanovsky; +Cc: Sagi Grimberg, linux-rdma
> On Jun 27, 2017, at 1:36 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote:
>> Hi Sagi-
>>
>>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>>
>>>
>>>> While running xfstests on an NFS/RDMA mount, I see this in
>>>> the client's /var/log/messages multiple times:
>>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>>>> As far as I can tell the client is able to recover and continue
>>>> the test. However, this error is not supposed to happen in normal
>>>> operation.
>>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>>
>>> Is this a regression?
>>
>> I can't answer that question with authority, because I just
>> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
>> reported very similar symptoms with iSER on v4.9. It appears
>> to have been around for a while, if these are the same.
>>
>>
>>> What kernel version are you running?
>>
>> v4.12-rc2.
>>
>>
>>> FW revision?
>>
>> 12.18.2000
>>
>>
>>> Is the below commit applied?
>>
>> This commit does not appear to be applied to my kernel.
>>
>>
>>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
>>> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Date: Sun May 28 10:53:11 2017 +0300
>>>
>>> RDMA/mlx5: set UMR wqe fence according to HCA cap
>>>
>>> Cache the needed umr_fence and set the wqe ctrl segmennt
>>> accordingly.
>>>
>>> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>> Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>
>>> This is the only thing that changed in that area
>>> lately...
>>>
>>> Can you try without it?
>>
>> I haven't tried with it. I can pull it and see if it helps.
>>
>> I have tried:
>>
>> - with and without IOMMU enabled
>> - with RoCE v1 and v2
>> - with instrumentation:
>>
>> This can happen to any MR at any time after any number of
>> uses. It does not appear to be "sticky" (ie, xprtrdma
>> recovery from a memory management error clears the problem
>> successfully by releasing the MR and allocating a new one).
>>
>> So it feels like a f/w or driver problem to me, at this
>> point.
>
> Jack and me discussed your issue tomorrow morning and we have strong
> feeling that it is FW.
A little more, not sure this is helpful:
The flushed FastReg WRs are all for IB_ACCESS_REMOTE_READ.
> Thanks
>
>>
>> --
>> Chuck Lever
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <20170627173620.GT1248-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-06-27 19:30 ` Chuck Lever
@ 2017-07-05 14:40 ` Chuck Lever
[not found] ` <06510488-DB16-4781-8E5A-FDFFDDD00B4F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
1 sibling, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2017-07-05 14:40 UTC (permalink / raw)
To: Leon Romanovsky; +Cc: Sagi Grimberg, linux-rdma
> On Jun 27, 2017, at 1:36 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote:
>> Hi Sagi-
>>
>>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>>
>>>
>>>> While running xfstests on an NFS/RDMA mount, I see this in
>>>> the client's /var/log/messages multiple times:
>>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>>>> As far as I can tell the client is able to recover and continue
>>>> the test. However, this error is not supposed to happen in normal
>>>> operation.
>>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>>
>>> Is this a regression?
>>
>> I can't answer that question with authority, because I just
>> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
>> reported very similar symptoms with iSER on v4.9. It appears
>> to have been around for a while, if these are the same.
>>
>>
>>> What kernel version are you running?
>>
>> v4.12-rc2.
>>
>>
>>> FW revision?
>>
>> 12.18.2000
>>
>>
>>> Is the below commit applied?
>>
>> This commit does not appear to be applied to my kernel.
>>
>>
>>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
>>> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Date: Sun May 28 10:53:11 2017 +0300
>>>
>>> RDMA/mlx5: set UMR wqe fence according to HCA cap
>>>
>>> Cache the needed umr_fence and set the wqe ctrl segmennt
>>> accordingly.
>>>
>>> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>> Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>
>>> This is the only thing that changed in that area
>>> lately...
>>>
>>> Can you try without it?
>>
>> I haven't tried with it. I can pull it and see if it helps.
>>
>> I have tried:
>>
>> - with and without IOMMU enabled
>> - with RoCE v1 and v2
>> - with instrumentation:
>>
>> This can happen to any MR at any time after any number of
>> uses. It does not appear to be "sticky" (ie, xprtrdma
>> recovery from a memory management error clears the problem
>> successfully by releasing the MR and allocating a new one).
>>
>> So it feels like a f/w or driver problem to me, at this
>> point.
>
> Jack and me discussed your issue tomorrow morning and we have strong
> feeling that it is FW.
Hi Leon-
Who is going to drive this issue to resolution? Do you need me
to do something?
> Thanks
>
>>
>> --
>> Chuck Lever
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <06510488-DB16-4781-8E5A-FDFFDDD00B4F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-07-05 15:29 ` Leon Romanovsky
0 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2017-07-05 15:29 UTC (permalink / raw)
To: Chuck Lever; +Cc: Sagi Grimberg, linux-rdma
[-- Attachment #1: Type: text/plain, Size: 3615 bytes --]
On Wed, Jul 05, 2017 at 10:40:41AM -0400, Chuck Lever wrote:
>
> > On Jun 27, 2017, at 1:36 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >
> > On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote:
> >> Hi Sagi-
> >>
> >>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> >>>
> >>>
> >>>> While running xfstests on an NFS/RDMA mount, I see this in
> >>>> the client's /var/log/messages multiple times:
> >>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
> >>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
> >>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
> >>>> As far as I can tell the client is able to recover and continue
> >>>> the test. However, this error is not supposed to happen in normal
> >>>> operation.
> >>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
> >>>
> >>> Is this a regression?
> >>
> >> I can't answer that question with authority, because I just
> >> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
> >> reported very similar symptoms with iSER on v4.9. It appears
> >> to have been around for a while, if these are the same.
> >>
> >>
> >>> What kernel version are you running?
> >>
> >> v4.12-rc2.
> >>
> >>
> >>> FW revision?
> >>
> >> 12.18.2000
> >>
> >>
> >>> Is the below commit applied?
> >>
> >> This commit does not appear to be applied to my kernel.
> >>
> >>
> >>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
> >>> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>> Date: Sun May 28 10:53:11 2017 +0300
> >>>
> >>> RDMA/mlx5: set UMR wqe fence according to HCA cap
> >>>
> >>> Cache the needed umr_fence and set the wqe ctrl segmennt
> >>> accordingly.
> >>>
> >>> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>> Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> >>> Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> >>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >>>
> >>> This is the only thing that changed in that area
> >>> lately...
> >>>
> >>> Can you try without it?
> >>
> >> I haven't tried with it. I can pull it and see if it helps.
> >>
> >> I have tried:
> >>
> >> - with and without IOMMU enabled
> >> - with RoCE v1 and v2
> >> - with instrumentation:
> >>
> >> This can happen to any MR at any time after any number of
> >> uses. It does not appear to be "sticky" (ie, xprtrdma
> >> recovery from a memory management error clears the problem
> >> successfully by releasing the MR and allocating a new one).
> >>
> >> So it feels like a f/w or driver problem to me, at this
> >> point.
> >
> > Jack and me discussed your issue tomorrow morning and we have strong
> > feeling that it is FW.
>
> Hi Leon-
>
> Who is going to drive this issue to resolution? Do you need me
> to do something?
I don't think so, Jack was supposed to do it.
>
>
> > Thanks
> >
> >>
> >> --
> >> Chuck Lever
> >>
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <2FEEE227-9BCF-4454-A056-3997C1E54686-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-27 16:08 ` Sagi Grimberg
2017-06-27 17:36 ` Leon Romanovsky
@ 2017-08-08 15:45 ` Max Gurtovoy
[not found] ` <7ef3ca44-1253-7aae-1b46-f78cc15e627d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2 siblings, 1 reply; 12+ messages in thread
From: Max Gurtovoy @ 2017-08-08 15:45 UTC (permalink / raw)
To: Chuck Lever, Sagi Grimberg; +Cc: linux-rdma
Hi all,
soory for late response.
On 6/27/2017 5:56 PM, Chuck Lever wrote:
> Hi Sagi-
>
>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>
>>
>>> While running xfstests on an NFS/RDMA mount, I see this in
>>> the client's /var/log/messages multiple times:
>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>>> As far as I can tell the client is able to recover and continue
>>> the test. However, this error is not supposed to happen in normal
>>> operation.
>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>
>> Is this a regression?
>
> I can't answer that question with authority, because I just
> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
> reported very similar symptoms with iSER on v4.9. It appears
> to have been around for a while, if these are the same.
>
>
>> What kernel version are you running?
>
> v4.12-rc2.
>
>
>> FW revision?
>
> 12.18.2000
>
>
>> Is the below commit applied?
>
> This commit does not appear to be applied to my kernel.
>
>
>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
>> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Date: Sun May 28 10:53:11 2017 +0300
>>
>> RDMA/mlx5: set UMR wqe fence according to HCA cap
>>
>> Cache the needed umr_fence and set the wqe ctrl segmennt
>> accordingly.
>>
>> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>> Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>
>> This is the only thing that changed in that area
>> lately...
>>
>> Can you try without it?
>
> I haven't tried with it. I can pull it and see if it helps.
Chuck,
any updates using my patch above (actually you need this
1410a90ae449061b7e1ae19d275148f36948801b as a pre condition) ?
>
> I have tried:
>
> - with and without IOMMU enabled
> - with RoCE v1 and v2
> - with instrumentation:
>
> This can happen to any MR at any time after any number of
> uses. It does not appear to be "sticky" (ie, xprtrdma
> recovery from a memory management error clears the problem
> successfully by releasing the MR and allocating a new one).
>
I'm not so familiar with NFS/RDMA IO path yet, but are you using remote
invalidation from server side or you run local invlidation ?
which side initiates the RDMA_READ/WRITE operations ?
> So it feels like a f/w or driver problem to me, at this
> point.
>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fvger.kernel.org%2Fmajordomo-info.html&data=02%7C01%7Cmaxg%40mellanox.com%7C7dcc1137dc654001e88708d4bd6d0947%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636341723437870581&sdata=uSImRqWvZxrJ9Lu8MBykfeBpxFZwlF3J0XQHNBTgSlc%3D&reserved=0
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: "memory management error" with NFS/RDMA on RoCE
[not found] ` <7ef3ca44-1253-7aae-1b46-f78cc15e627d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-08-08 16:14 ` Chuck Lever
0 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2017-08-08 16:14 UTC (permalink / raw)
To: Max Gurtovoy; +Cc: Sagi Grimberg, linux-rdma
> On Aug 8, 2017, at 11:45 AM, Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>
> Hi all,
> soory for late response.
>
> On 6/27/2017 5:56 PM, Chuck Lever wrote:
>> Hi Sagi-
>>
>>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>>>
>>>
>>>> While running xfstests on an NFS/RDMA mount, I see this in
>>>> the client's /var/log/messages multiple times:
>>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000
>>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3
>>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78)
>>>> As far as I can tell the client is able to recover and continue
>>>> the test. However, this error is not supposed to happen in normal
>>>> operation.
>>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2.
>>>
>>> Is this a regression?
>>
>> I can't answer that question with authority, because I just
>> started trying out NFS/RDMA on RoCE with mlx5. But Robert has
>> reported very similar symptoms with iSER on v4.9. It appears
>> to have been around for a while, if these are the same.
>>
>>
>>> What kernel version are you running?
>>
>> v4.12-rc2.
>>
>>
>>> FW revision?
>>
>> 12.18.2000
>>
>>
>>> Is the below commit applied?
>>
>> This commit does not appear to be applied to my kernel.
>>
>>
>>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c
>>> Author: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Date: Sun May 28 10:53:11 2017 +0300
>>>
>>> RDMA/mlx5: set UMR wqe fence according to HCA cap
>>>
>>> Cache the needed umr_fence and set the wqe ctrl segmennt
>>> accordingly.
>>>
>>> Signed-off-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>> Acked-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>>> Reviewed-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>
>>> This is the only thing that changed in that area
>>> lately...
>>>
>>> Can you try without it?
>>
>> I haven't tried with it. I can pull it and see if it helps.
>
> Chuck,
> any updates using my patch above (actually you need this 1410a90ae449061b7e1ae19d275148f36948801b as a pre condition) ?
My client is at v4.13-rc3 now, and I haven't seen this issue recur
recently.
>> I have tried:
>>
>> - with and without IOMMU enabled
>> - with RoCE v1 and v2
>> - with instrumentation:
>>
>> This can happen to any MR at any time after any number of
>> uses. It does not appear to be "sticky" (ie, xprtrdma
>> recovery from a memory management error clears the problem
>> successfully by releasing the MR and allocating a new one).
>>
>
> I'm not so familiar with NFS/RDMA IO path yet, but are you using remote invalidation from server side or you run local invlidation ?
> which side initiates the RDMA_READ/WRITE operations ?
Remote Invalidation should be in use, but I haven't confirmed that.
The storage target (the NFS server) issues the RDMA Read and
RDMA Write operations.
>> So it feels like a f/w or driver problem to me, at this
>> point.
--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-08-08 16:14 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-22 18:28 "memory management error" with NFS/RDMA on RoCE Chuck Lever
[not found] ` <7F0FCF80-DB7B-46F1-BB9A-0B070603DE61-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-22 20:57 ` Robert LeBlanc
2017-06-27 9:28 ` Sagi Grimberg
[not found] ` <797a43c4-f30d-9deb-a332-c62cbd01be7b-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-27 14:56 ` Chuck Lever
[not found] ` <2FEEE227-9BCF-4454-A056-3997C1E54686-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-27 16:08 ` Sagi Grimberg
[not found] ` <a82056d7-5685-5b85-8226-c54065e729fe-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-27 17:03 ` Robert LeBlanc
2017-06-27 17:36 ` Leon Romanovsky
[not found] ` <20170627173620.GT1248-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-06-27 19:30 ` Chuck Lever
2017-07-05 14:40 ` Chuck Lever
[not found] ` <06510488-DB16-4781-8E5A-FDFFDDD00B4F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-07-05 15:29 ` Leon Romanovsky
2017-08-08 15:45 ` Max Gurtovoy
[not found] ` <7ef3ca44-1253-7aae-1b46-f78cc15e627d-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-08-08 16:14 ` Chuck Lever
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.